Sorting large amounts of floats

Hi, I'm trying to figure out the best way to sort a large amount (thousands) of floats or fixed-point data on an FPGA. With this amount of data, it needs to be stored in RAM because it obviously won't fit in registers, but this means that there can only be one access to that RAM every clock cycle. Since any sorting algorithm will need access to at least two pieces of data to compare, I can't figure out how to parallelize the sorting. Has anyone got any experience in this area? I would really appreciate any advice.

Thanks, Keith

Reply to
Keith O'Conor
Loading thread data ...

One simple solution would be to use a dual port RAM.

Another solution is to use two different RAM blocks for each data set.

Cheers

PeteS

Reply to
PeteS

Sorting floats is no different than sorting fixed point, that is the reason for using excess code N for the exponent.

To accomplish the sort efficiently, you partition the data, storing and the sorting subsets of the full data set on the FPGA and then merging the results. That way you don't have as many accesses to the off-chip ram to deal with.

Reply to
Ray Andraka

Two accesses. Or four, if your application will let you read and write to the same location in the same clock cycle, which is unlikely, so call it two.

That's two accesses per block of RAM; and there can be a hundred or more of those in an a FPGA. If you can use this, you can improve parallelism.

- Brian

Reply to
Brian Drummond

Great, you at least know where the bottleneck is. The first rule in configurable computing is to stop using central memories, as they are unavoidably single threaded. Using lut based rams suddenly gives you thousands of little memories with concurrent accesses.

Frequently the next step is to stop thinking parallel, and go serial, which immediately suggests divide and conquer in a massively parallel way. Consider a 2 bit in, 2 bit out node with a tiny statemachine that resets on word clock, passes bits in to out till they are different, then latches one of those as the mux selector, and continues passing bits till the next word clock in this selected state. These butterflies can be cascaded 2^n deep to concurrently sort thousands of serial data streams with a one clock latency per stage, and a log2(width) latency for the whole sort.

I hope this wasn't your class assignment in reconfigurable computing for parallel processing applications. Monday or tuesday I will post Beta-2 for FpgaC, which has LUT RAMs for small arrays along with structures and a some other improvements. If you aren't currently taking a parallel programming class, it might be useful to at least review the course notes from a major university and pickup one of the texts, as it will provide you a lot of insight about approaching these problems for traditional processors which can be easily applied to reconfigurable computing.

formatting link

When the beta-2 is available for downloading look at the users manual in docs, and the examples. Setting up multiple stages of concurrent pipelines is pretty easy in traditional C, with a cute trick of building the procedure block backwards ... bottom to top data flow, rather than top to bottom.

have fun, John

Reply to
fpga_toys

ElectronDepot website is not affiliated with any of the manufacturers or service providers discussed here. All logos and trade names are the property of their respective owners.