Sorting algorithm for FPGA availlable?

Hello,

we have a Virtex4 FPGA and are looking for availlable hardware sorting algorithms. I couldnt find anything @opencores.org however I guess that it has been worked on this subject.

The Virtex4 supports 2 PowerPCs and lots of Block RAMs so everything seems to be there. Are there any projects and what speeds could be estimated for sorting 64 bit integers?

thx Heiner

Reply to
heinerlitz
Loading thread data ...

I'd suggest a binary tree based sorting algorithm with destination vectors in order to be able to sort them parallely with more than one architecture (e.g. one per block ram) or even one per cell in my case :-) once having found the final order.

If you have no special demands concerning the speed, you should be able to use a software based algo on the power pc core using one of the algos around in the net.

Reply to
homoalteraiensis

Hello,

There was a similar thread in this newsgroup. If you are looking to solve a specific problem and have performance targets, you should share those -- it will help people give you more focused answers.

You do not need PowerPCs or BlockRAMs to sort integers. Sure, you can write a sequential sort algorithm and run it on a processor, but if you are interested in performance, I think that approach may be one of the last things you would want to do. If you are interested in two examlpes of a hardware sort, take a look at:

formatting link

There is a bubble sort and an odd-even transposition sort. The examples sort a set of five, 16-bit values. They could certainly be extended to handle fewer/more and/or wider/narrower inputs. One example is pipelined, the pipeline depth can certainly be changed. In fact, you can turn it into a combinational sorter if you want.

If anyone expands the examlpes to sort 1 million 64-bit integers, please let me know what your synthesis tool thinks about it.

Certainly, I am not advertising the examples as "the best" and I am no "expert" in sorting. Or anything. But they might be a place for you to start. I only knew the bubble sort, but used Google to search for "parallel sorting algorithms" because I figured there would be one which I could implement in hardware. The odd-even transposition sort is indeed quite easy.

Eric

Reply to
Eric Crabill

How many objects and how many bits in each for the sort? How many clocks do you have to complete the sort? Are the results accessed sequentially or asa table? These are some of the important questions you need to answer before you can determine how best to sort the data.

Reply to
Ray Andraka

It all depends on what type of data you have, is it already sorted almost correctly, or soso, or fully random. look at Knuths texts for the algorithms.

If you need to sort a huge array of integer keys I favor the radix sort, for 64 bit values, use 8 passes with 256 variable length output buckets whose total size will be same as initial array.. The time will go towards 16N memory cycles which may use the fastest rate of memory access, potentially at DDR rates even.

Its is just like sorting a telephone book where you take the initial say random name list and put all the names beginning with 'a' into the 'a' bucket and similar for other letters. In this case you have 256 letters. On 2nd pass inspect 2nd letter. Every time you have say 16 'a' words you bag them and put them into the 'a' pile. You want to do mostly reads and mostly writes as bursts rather than interleaving reads & writes. Blockrams can be used as your letter bags.

Its get more interesting when you figure how to manage the buckets. BlockRams can be usefull for holding blocks of each letter. 1 Blockram with 256 64b words would only hold 1 word buffer for each of the 256 buckets. So 16 Blockrams would hold 256 buckets of 16 words deep. You want them to be as deep as possible to reduce the cost of bucket management. You could think of these Blockram buckets as soft 256 Fifos (16 word deep) into main memory.

Only when each letter bucket fills do you need to have to write it out and find a new place to write it. One way to do that is to assume the input data is matched by a similar sized output array which is broken into 16 word cells. Add to that a link list to manage the cells, then its is a matter of managing 256 link list heads and a simple table that will need 1 word per cell.

If you get really good at this, with small buffers for inputs and outputs, you can effectively do this inplace in main memory, though many will use separate in and out buffers. Every time 16 words are read in, that cell is added to a free list. On avg, you write 1 of the 256 letter cells for every cell you read in. Memory cost is therefore N+N/16 for 16 word deep buffers.

Ofcourse you should proto this in C/whatever and observe its operation. The HDL code will look similar.

John Jakson transputer guy

Reply to
JJ

Hi John, I don't understand the following memory allocation issue you mentioned: "Every time 16 words are read in, that cell is added to a free list. On avg, you write 1 of the 256 letter cells for every cell you read in. Memory cost is therefore N+N/16 for 16 word deep buffers. "

For example, all 1 million keywords starts with same one of 256 letters?

Could you please give more details? Or any references in Knuth volumn 3 book?

Thank you.

Weng

JJ wrote:

Reply to
Weng Tianxiang

Well its not that dificult to write a small C program to explore this.

Get a small rand function, like R250 and fill up a 1M word array and make a ptr table to every 16th word and make 256 empty head ptrs. Set up 256 empty 16 word fifos.

Arrange the C code to use an FSM in a loop so that every iteration, data moves from the array to 1 of the 256 buffers for 16 cycle bursts. If any of the buckets should fill, copy them to a write back buffer. During the 16 word read intervals, insert write backs.

Its would be much easier to just use 2x as much Blockram and let the buffers be 32 words deep. While all buffers less than 16 deep, keep reading in bursts. When any buffer crosses 16, write out the previous

16 during a read interval.. Basically you have a DMA engine that is always reading or writing 16 words blocks so DRAM can be efficient. You can change the 16 for another burst size by changing the bucket depths. Use consts ofcourse.

It doesn't really matter if all the letters are evenly distributed or even all mostly the same. That only changes the routing of bucket fifos into the output stream.

The input data of 64K 16 word entries all starts on 1 link list is going to be replaced by 64K 16 word blocks spread on 256 distinct lists, even if most might actually be on 1 list for uneven letters. On the next pass, the engine will iterate through each of these 256 input lists and write back sort of inplace back to tha same list.

I could see 2 memory systems in play, the data reads and writes to the

1M array, and a separate system to do the header link list management. Every write back of a bucket needs for the link list FSM to track the 256 headers.

The HDL FSM will be pretty much the same structure.

Ofcourse writing a program on a NG post is much easier than writing it for real, but really this isn't rocket science. This is about a days worth of C coding & testing.

Perhaps a basic text on link lists might be more useful than Knuth since thats is really all this is.

John Jakson transputer guy

Reply to
JJ

Hi JJ, Thank you.

Weng

Reply to
Weng Tianxiang

JJ schrieb:

Actually that would be bucket sort. The disadvantage of bucket sort is that you need to manage many buckets of varying size. This is either complicated or wastes a lot of memory. With radix sort you start by sorting counterintuitively by the LAST letter of the name (or the least significant byte) to avoid all that hassle.

Each pass can be done by counting sort with an internal block ram with

256 entries for counting. In a V4 you can probably even hold 64k entries so you can get along with 4 passes of 16-bits each.

See section 9.3 of Cormen, Leiserson and Rives "Introduction to Algorithms".

Smaller datasets can be sorted in hardware by systolic priority queues as fast as you can input the data.

Kolja Sulimma.

Reply to
Kolja Sulimma

You can also sort the words bit serial using a log2(N) array of muxes for each serial input. This strategy has a serial latency of log2(N) to the first bit, plus word length clocks, and can be built as wide as you want it. It's easily reused as part of a block sort.

formatting link

The strategy is a two input, two output muxes that passes bits through as pairs. Each bit is compared, and on the first difference, the mux selector is latched so that the low value bit stream passes to the left. The mux selectors are reset on a word flag. With a log2 mux arrangement, each mux layer sorts with a stream offset of 2, 4, 8, ...,

2^N

Depending on the FPGA, the self selecting bit muxes are a few LUTs each, making the total cost of the sort engine N * log2(N) * K (where K is the number of LUTs per bit mux).

Reply to
fpga_toys

Hi Eric, I have difficulty understanding your slide p. 35.

It relates to calculation of latency in ns:

2 - 9; - 1 | | | 6; 5 3;

',' is a register. Number is number of delays.

First pipeline delay should be 2+9; The 2nd pipeline delay should be

5+3.

Total delay should be 19. But your slide says it is 22 ns.

Page 37 is totally wrong using pipeline registers. Because they are not synchronous at all. There is a short cut from the entry to the end point.

Thank you.

Weng

Eric Crabill wrote:

Reply to
Weng Tianxiang

Hi Weng,

This class is about synchronous logic design using FPGAs. All of my examples are completely synchronous; hopefully the diagrams on pages 7, 8,

29, and 30 convey this information -- not in words -- but by the common clock that clocks all flip flops in the circuit.

While there are such things as asynchronous pipelines, they are not something I would want to discuss with undergraduate students who aren't sure about the difference between a flip-flop and a latch!

For your first question -- 22 ns is correct. A synchronous pipeline cannot run with a cycle time that is lower than the cycle time you compute for any particular segment. It may have been Henry Ford who first realized this. As you point out, the first stage (in isolation) can operate with a cycle time of 2+9=11 ns. The second stage (in isolation) can operate with a cycle time of 5+3=8 ns. However, when joined to form a synchronous pipeline, it is the stage with the 11 ns delay that sets the minimum cycle time for the complete pipeline. Two cycles of 11 ns yields 22 ns.

I do not understand your comment regarding page 37. Can you clarify what you meant?

Thanks, Eric

Reply to
Eric Crabill

Hi Crabill, Thank you for your response. Explanation of your slide 35 is right.

Now I know how you calculate slide 37 timing. But in real designs, the way flip flops move forward or backward as your slides show are very dangerous and must very be careful and clearly check all signals' validality. There must be some critiaria over there to govern the movements. I really would like to know the rules.

In my design, whether signals are valid or not is based on every clock.

1 clock delay for one signal may cause big troube and it usually takes me a lot of times to check and sometimes it relates to modifying related logic equations and even state machines.

Thank you.

Weng

Reply to
Weng Tianxiang

Hi Weng,

This is an excellent question, the answer is not in my handout, but I do address it during the lecture. While pipelining a circuit, you cannot draw arbitrary lines and call it a pipeline stage. For retiming a circuit, you cannot randomly move registers around.

The following rules are what I discuss in class. I believe they are sufficient to avoid an incorrect circuit but I cannot claim they are by any means "optimal". In fact, they are very simple and do not address how you might modify the circuit to correct for pipeline hazards.

For a proposed pipeline stage:

  1. The stage must completely cut the circuit. I tell students to visualize the circuit on a piece of paper, and visualize using scissors to cut along the proposed boundary. The circuit must be cut into separate pieces.
  2. I tell students to label one side of the proposed boundary "A" and the other side "B". I then tell students to evaluate each signal that crosses the boundary to determine the flow of information. All information must flow through the boundary in the same direction. A boundary represents flip flops that capture data from one stage and provide it to the next.

I pose retiming as an extension of pipelining -- a perfectly pipelined circuit does not benefit from retiming. Retiming is going back to a (poorly or haphazardly) pipelined design and trying to redraw the pipeline boundaries to gain some advantage by moving delay out of the critical path.

I ask students to assume all flip flops in the design to be retimed are part of a pipeline stage. They don't know how many stages, or which flip flop are associated with a given stage, because this information is not available. However, at any flip flop, it's clear which direction the information is moving.

To validate a potential retiming move, I tell them to visualize reaching "through" the flip flop (which is a pipeline boundary) to grab logic on the other side (doesn't matter on what side we started) and then physically pull the logic through the flip flop. Along with the logic are going to come signals feeding that logic. Those signals are now passing through the pipeline stage, the original flip flop is replaced by new flip flop(s) where the signals have been pulled through the boundary. The test for validity of this move is to revisit check (2) for pipelining. I know which direction the information was moving originally. For the retiming move to be valid, all the information crossing the boundary must be in the same direction as it was originally.

It is much easier to illustrate on the chalkboard using my hands and making slurping sound effects, which is why I did not include an essay like this in the lecture notes.

Eric

Reply to
Eric Crabill

Hi Eric, Thank you for your response.

I have read an article about median filter implementation: "New bit-level algorithm for general purpose median filtering" by Khaled Benkrid and Danny Crookes. Its FPGA speed is 27frame/s.

Let your students try to increase its implementation speed.

If they can, they know how to beef up performance.

Thank you.

Weng

Reply to
Weng Tianxiang

ElectronDepot website is not affiliated with any of the manufacturers or service providers discussed here. All logos and trade names are the property of their respective owners.