16->5 "Sort"

K

Kevin Neilson 11 years ago

I'm trying to design a circuit (Virtex-7) which you might call either a pri ority encoder or a sorter. This is what it should do:

Given a 16-bit vector with 5 bits set, create a list of 5 4-bit encoded values of each bit set. These needn't be in order.

This turns out to be a lot harder than I thought. Writing the behavioral R TL isn't hard, but Vivado synthesizes it to 16 levels of logic, and when I draw out an optimized version, I still get at least 5 levels (using 6-input LUTs). I'd like to do it with minimal latency, but I can only do about 3 levels of logic at my clock speed. I've tried thinking about how I can use the carry chain muxes but they don't seem to be helpful. I can do a leadi ng-ones detector and encode the leading 1, but I'd probably have to pipelin e each stage so that would take 5 cycles.

Vote

R

rickman 11 years ago

I'm not sure what you mean by "list". Logic circuits don't normally use lists. You want to know which five lines are set. It is easy to encode that into four bit values. But once you have the values what exactly do you want to do with them to provide them in a "list"?

If you want them accessible on the same four data lines in sequence, what determines the timing of the sequence? What exactly is your interface?

Rick

Vote

K

Kevin Neilson 11 years ago

The list is a 20-bit vector comprising the 5 4-bit values. This is put into a 20-bit wide FIFO; each of the 5 values will be processed simultaneously when read from the FIFO. So another way you could describe this is a 16-bit/5-hot -> 20-bit encoder.

Vote

K

Kevin Neilson 11 years ago

To be clearer, here's an example.

Input (bit 15 on left):

1 1 0 0 0 0 0 0 0 0 0 0 0 1 1 1

20-bit Output List:

1111 1110 0010 0001 0000

Vote

R

rickman 11 years ago

I am old school so I have to picture these things....

I see five priority encoders. The first priority encoder output is used to disable that input to the second priority encoder, etc.

How did you code it? I can see how this would be a lot more than three levels of logic. The equations get quite long. When you talk about levels of logic I assume you mean layers of LUTs? I can see maybe each

16 input priority encoder being no more than 3 levels of LUTs, but all five layers with the inhibit logic... I don't think so. I expect the tools did a pretty good job of optimizing it and I don't easily see any way of using carry chains.

Rick

Vote

G

glen herrmannsfeldt 11 years ago

(snip)

It is slightly easier than that.

(Sometime ago I did this with 36 and 3.)

Consider the case of only two bits high. Use one priority encoder the usual way, and one upside down. (That is, the highest and lowest set bit.) Now, as you said, subtract off the highest and lowest, and two more priority encoders, and finally subtract one more and the last one.

But first I would add the logic to determine that only (or at least) five were set.

I suspect that, as the OP noted, it is combinatorial hard. Consider the logic needed to, separately, compute each bit of the result. Even just the first one.

It does, however, pipeline very well. In the one I was working on, it had a fairly fast clock (66MHz) but I could stand some levels of latency. The OP claims the need for low latency.

-- glen

Vote

R

Rob Doyle 11 years ago

64K x 20 ROM?

How badly do you want one level of logic?

Rob.

Vote

G

GaborSzakacs 11 years ago

Two ROMs, 256 by 23 and 256 by 20. The first encodes the lower 8 bits into 0 to 5 4-bit numbers right-justified plus a 3-bit indication of how many bits were set. The second just encodes 0 to 5 4-bit numbers, left justified (you assume it has the remaining 1's not covered by the low 8 bits. Then one more level to select how many bits to take from each ROM based on the number of ones in the lower 8 bits (each output bit is only a 2:1 mux in this case - that's why the ROMs are justified right and left as noted).

Gabor

Vote

G

glen herrmannsfeldt 11 years ago

Hmm, that probably does work. Especially with the BRAM in many FPGAs.

When I did it some years ago, it was three out of 36 bits, and ROMs weren't, and still aren't, that big.

If not in BRAM, that seems a good way.

How many levels of logic is a 256 bit ROM in FPGA LUTs?

-- glen

Vote

R

rickman 11 years ago

Like most things, "it depends". The older devices have 16 bits per LUT, if any, and many of the newer devices have 64 bits per LUT.

Nearly all devices have BRAMs so it's a no brainer by the split table method. Only trouble is BRAMs use a clock cycle... I'm just sayin'...

Rick

Vote

G

glen herrmannsfeldt 11 years ago

(snip, I wrote)

(snip)

OK, going back the OP says that three levels of logic is fine. Doesn't say how many pipeline stages.

I presume a clock is available for the BRAM, but I am not sure.

-- glen

Vote

G

GaborSzakacs 11 years ago

In Xilinx 7-series:

A 256-bit LUT fits in 1 SLICEL. It uses all 4 64-bit LUTS and three muxes (two for each pair of LUTS and one to combine the result), so it would show up as 3 levels of logic, but it all routes internally to the slice and the muxes are really fast.

Convincing the tools to use LUT memory is the fun part. Here's my test code:

module simple_lut ( input wire [7:0] addr, output wire [7:0] data );

(* RAM_STYLE = "distributed" *) reg [7:0] lut_mem [0:255];

initial $readmemh ("../source/lutcontents.hex", lut_mem);

assign data = lut_mem[addr];

endmodule

Gabor

Vote

R

rickman 11 years ago

The OP said, "minimal latency" and I think I over spec'd that to mean combinatorial. So one clock for the BRAM and one clock for the muxing should be ok. Better than the 5 clock cycles he mentions.

Rick

Vote

R

rickman 11 years ago

It would be 43 x 4 or 172 LUTs. A fair amount plus the muxes to combine the outputs. Still, it is mostly in parallel so it should be fairly fast.

Rick

Vote

K

Kevin Neilson 11 years ago

The idea of working from both sides is useful. Finding the 5th bit set is a lot harder than finding the 1st because you have to keep a running sum of the bits already set so that eats up a few inputs of each LUT. If I searc h from both ends the running sum would be no bigger than 2 (finding, say, 3 starting from the top and 2 starting from the bottom). When I draw this o ut I still have at least 3 levels of logic plus a few levels of carry chain mux, which I'd probably still have to pipeline one stage, but that's accep table. Vivado surely won't use the carry chain mux unless I instantiate i t anyway, and then it would be 5-6 levels of logic, so I'd definitely need to pipeline that one stage.

Vote

K

Kevin Neilson 11 years ago

Vivado made it 16 levels of logic, and I can't tell exactly what it's doing , but this is how I would expect it would work: the first output is the ea siest. You just find the leading 1 with a priority encoder and encode it. You can look at the first 5 bits with the first level, using the 6th LUT i nput for an input from the next level if none of those 5 bits are set, and so on. This requires 4 levels of LUTs. One could use the carry chain muxe s to speed things up but you'd have to instantiate them because Vivado does n't seem to know how to do that. So that first output requires 4 LUTs x 4 bits.

But the 5th encoded output is harder, because you have to keep a running 3- bit sum of the number of set bits already encountered, so 3 bits of each LU T after the first are needed for the running sum, and the sum itself requir es 2 levels of logic. (I can't post pictures here, can I?) So now you end up with what I calculate should be 7 levels of logic, or 3 levels of LUT a nd 5 levels of carry chain mux. I could maybe do this if I pipeline it and I can get Vivado to synthesize it properly. But it just seems like there should be some easier way.

Vote

K

Kevin Neilson 11 years ago

Yes, that would work. I think it would be about about 180 LUTs, which is quite a bit. It would probably work in one cycle: there is a LUT, F7/F8 mux, and a second level of LUT for the mux, and only 2 levels of net routed on the fabric.

Vote

K

Kevin Neilson 11 years ago

I have plenty of BRAMs and don't mind using them, but they're a pain someti mes. I have to use the output registers so they have 2 cycles of latency, and often I have to add another cycle just to route data to or from the BRA M column. They come in handy, though. Lately I've been using them for a l ot of Galois arithmetic, such as lookup tables for 1/x.

Vote

K

Kevin Neilson 11 years ago

The ROM can be fast if you use the F7/F8 muxes built into the slice. I've found the key thing for V7 is to minimize the number of routes on the fabri c. The F7/F8 muxes are slower than LUTs, I think, but since you don't have to route the connecting net onto the fabric you save a lot of time.

Vote

G

GaborSzakacs 11 years ago

If you decide to pipeline it, a register placed after the 256-deep LUT will go into the same slice with the 4 LUTs, 2 F7 muxes and F8 mux. Then the final 2:1 would go after a standard fabric register, which has pretty small clock to Q (much better than BRAM without the output register). Even in a -2 Artix you can run above 500 MHz with this arrangement.

Gabor

Vote

16->5 "Sort"

Join the Discussion

Didn't find your answer?