16->5 "Sort"

- R
- rickman
  
  Contact options for registered users
Vote on answer
posted
8 years ago

Tue, May 12, 2015 10:29 PM

Why do you have to use the output registers? The clock to out time on a BRAM has always been very fast as is the setup time. The ones I've worked with were only slightly slower than a FF in the context of typical delays in logic and fabric. What is your clock speed?

If you are working in a large part the LUTs are not an unreasonable way to implement this. Not sure how fast the resulting logic will be, but it should be in the same ballpark as the BRAM but purely combinatorial. Do you need to run faster than 100 MHz?

--

Rick

- K
- Kevin Neilson
  
  Contact options for registered users
Vote on answer
posted
8 years ago

Tue, May 12, 2015 10:59 PM

I'm using 350mHz, or a period of 2.8ns. The clk->out time for a V7 -1 BRAM (without output reg) is about 2.1ns, so if I didn't use the BRAM output re gister, I'd barely have enough time to get the output across a net to a FF. And I know even that usually won't meet timing, because Vivado is fond of pulling the output registers out of my BRAMs and putting them into slices, I guess because it thinks it has extra slack and can give some of it to th e next path. But then the net to the FF will be 600ps and the path will fa il. I have not figured out how to make Vivado stop doing this (except by i nstantiating BRAM primitives).

- R
- rickman
  
  Contact options for registered users
Vote on answer
posted
8 years ago

Tue, May 12, 2015 11:37 PM

Sorry, I just can't picture what you are doing. What is the "running sum" for? I think I might understand. You look at the first 5 inputs and output codes for all five positions. I'm not sure why you can't look at the first 6 inputs though. This outputs a three bit code of the number of 1's found. The second block looks at the next five inputs and outputs five codes. The last five bits would be like the second group and have a mux with the second group when in turn is what actually feeds the first mux. The first group would be one level of LUTs. The following two groups

Let me try to draw this...

,------, 3 ,-----, 0-5 | |--/------------------------|SEL | -->--| | 20 | | 20 | |--/------------------------| BUM*|--/-->-- '------' | | ,---| | ,------, 3 ,-----, | '-----' 6-10 | |--/--------|SEL | | -->--| | 20 | | | | |--/--------| | | '------' | | 20 | | BUM*|--/--' ,------, | |

11-15 | | | | -->--| | 20 | | | |--/--------| | '------' '-----' *Big, Ugly Mux

The mux might be hard to work out and will surely be more than 1 level of LUTs.... unless you can use the magic muxes in the slice to combine multiple LUTs into a 6 input mux. You don't need any adders for the counts since each 3 bit count controls a separate mux. This might just work in three levels of LUTs if you can use multiple LUTs to form a 6 input mux.

I just read your post where you said you were running at 350 MHz. I guess even this will have to be pipelined. But it should be less logic than the brute force distributed RAM approach. But who knows until the LUTs are counted? In essence this is the same thing I guess. It might work better with the larger front end blocks and just one mux.

I'm very surprised the clock to out time on the V7 BRAM is 2.1ns. I think that is about the same number as the Spartan 3s from long ago. Am I mistaken?

--

Rick

- K
- Kevin Neilson
  
  Contact options for registered users
Vote on answer
posted
8 years ago

Wed, May 13, 2015 7:58 AM

oing, but this is how I would expect it would work: the first output is th e easiest. You just find the leading 1 with a priority encoder and encode it. You can look at the first 5 bits with the first level, using the 6th L UT input for an input from the next level if none of those 5 bits are set, and so on. This requires 4 levels of LUTs. One could use the carry chain muxes to speed things up but you'd have to instantiate them because Vivado doesn't seem to know how to do that. So that first output requires 4 LUTs x 4 bits.

g 3-bit sum of the number of set bits already encountered, so 3 bits of eac h LUT after the first are needed for the running sum, and the sum itself re quires 2 levels of logic. (I can't post pictures here, can I?) So now you end up with what I calculate should be 7 levels of logic, or 3 levels of L UT and 5 levels of carry chain mux. I could maybe do this if I pipeline it and I can get Vivado to synthesize it properly. But it just seems like th ere should be some easier way.

The BRAM output is 2.1 ns, but if you use the output register (which I have to) it's 750 ps. Then the BRAM has 2 cycles of latency.

Yes, something like you show would work. The design I'd written up had the sums as inputs to the LUTs. So the top LUT could look at 6 bits (I said 5 originally because I was going to use the MUXCY but I abandoned that). Th en the next LUT looks at 4 bits, and the other 2 inputs would be the 2-bit sum of the first 5 bits. And the next LUT looks at 4 more bits and also as a 2-bit sum of the first 10 bits. (This is for the 3rd encoded output so we're looking for the 3rd bit set.) I end up with 4 of these LUTs, 2 level s of LUTs to do the sums, and an F7/F8 mux afterward to pick one of the 4 L UTs. So that's 3 levels of LUTs and an F7/F8, which would work in 1 cycle. The whole thing would be about 100 LUTs.

I couldn't get that to work, though, because I can't get Vivado to synthesi ze anything right, and I was going to have to instantiate a lot of primitiv es (including the F7/F8 muxes). I couldn't even get Vivado to do the sums correctly. You should be able to find the mod-2 sum of up to 18 bits with

8 LUTs in 2 levels, but Vivado does 3 levels. It's pitiful.

I ended up doing something else. I did a trailing-one detector like this:

wire [15:0] trailing_1 = ~(input_vec[15:0]-1) & input_vec[15:0];

This uses the carry chain. I think the idea is from Knuth. That gives you a 16-bit vector with just the trailing 1 set.

You encode that for the 1st output. You the same thing with a mirrored ver sion of input_vec to do a leading-one detector and encode that for the 2nd output. Then you XOR those two vectors with the original to get a vector w ith just the 3 middle bits still set. You do another leading/trailing 1 de tector and encode those two and then XOR those with the original and you ha ve a vector with 1 bit set and you encode that.

That's all 200 LUTs and I pipelined it for 3 cycles of latency. There's a lot of slack so I might be able to do it in 2 but I'm not sure if I want to risk it.