Fast/low area Sorting hardware.

Hi all , I wanted to implement a fast and low area sorting algorithm in Verilog RTL, does anyone have any suggestions? Any links to IEEE papers, articles are higly welcome...

regards Deepu John

Reply to
john.deepu
Loading thread data ...

The question is how many values you want to sort... basically, this is iterative algorithm, so for a large set, a big memory and controller will be small but long.

as you posted on comp.arch.fpga, you might like this one:

formatting link

else googlize "rank order filter"

snipped-for-privacy@gmail.com wrote:

Reply to
Stephane

I wanted to sort 48 8bit unsigned numbers

Reply to
john.deepu

Depends on how fast you want it and what the input order usually is.

Do it in C 1st and cost it so that every read and write into data array is your memory access cost function and every compare done is your hw compare cost function.

For a 2 input a,b = sort x,y a,b

Reply to
JJ

Hi Stephane. I wanted to sort 48 8bit unsigned numbers

thanks Deepu

Reply to
john.deepu

And do you want the 384 bit result for a new set of numbers every 10 ns? Can you wait 1 ms for the result? Is it fine to keep the inputs and outputs in a BlockRAM or are you presenting them parallel?

Reply to
John_H

schrieb im Newsbeitrag news: snipped-for-privacy@g49g2000cwa.googlegroups.com...

Just some ideas. Hold the numbers in a BRAM. Create a small FSM to pull out the numbers, store them in a temporary register and compare. Bubble sort is usually the easiest but slowest approach in software. Maybe this is also true for an FPGA. Misc sort is fastest, so it sould be for hardware too. You could use both ports aof a BRAm to simultaneously access two pieces of data to increase speed.

Regards Falk

Reply to
Falk Brunner

: schrieb im Newsbeitrag : news: snipped-for-privacy@g49g2000cwa.googlegroups.com... : > Hi Stephane. : > I wanted to sort 48 8bit unsigned numbers

: Just some ideas. Hold the numbers in a BRAM. Create a small FSM to pull out : the numbers, store them in a temporary register and compare. Bubble sort is : usually the easiest but slowest approach in software. Maybe this is also : true for an FPGA. Misc sort is fastest, so it sould be for hardware too. You : could use both ports aof a BRAm to simultaneously access two pieces of data : to increase speed.

Bubble sort should actually be quite fast - you can store all 48 values in registers, then compare and swap if necessary odd pairs on odd clock cycles and even pairs on even cycles. After 48 cycles the registers should hold a sorted data set.

Probably this aproach is about as fast as you can get? The speedup comes from the fact that many many bubble sort opperations can occour in parallel. Is this the most hardware efficient sort that runs at this speed though?

If possible data should be shifted sequentially into and out of (or at least out of) the registers as a readout mux would be a resource hog. Using a Generate statement in VHDL (or equiv in Verilog) I'm guessing the sort will come to less than 100 lines of code.

cheers cds

Reply to
c d saunter

"c d saunter" schrieb im Newsbeitrag news:d84pvd$5pu$ snipped-for-privacy@heffalump.dur.ac.uk...

cycles

This sounds like a almost full parallel approach. Could be quite fast, but also quite resouce hungry.

speed

least

Thats why a BRAM is quite handy. Using both ports you can pull two datas out in one cycle, compare, and write them back on a second cycle.

Regards Falk

Reply to
Falk Brunner

Falk Brunner ( snipped-for-privacy@gmx.de) wrote: : "c d saunter" schrieb im Newsbeitrag : news:d84pvd$5pu$ snipped-for-privacy@heffalump.dur.ac.uk...

: > Bubble sort should actually be quite fast - you can store all 48 values in : > registers, then compare and swap if necessary odd pairs on odd clock : cycles : > and even pairs on even cycles. After 48 cycles the registers should hold : > a sorted data set.

: This sounds like a almost full parallel approach. Could be quite fast, but : also quite resouce hungry.

Yup. Unless the OP posts some details about desired performance it's impossible to know which one is best...

: Thats why a BRAM is quite handy. Using both ports you can pull two datas out : in one cycle, compare, and write them back on a second cycle.

Indeed. There is a nice intermediate level of parallelism availible by using LUT RAMs to build units 16 words deep, each of which sort sequentially, with every other set of sorts crossing the LUT RAM boundaries. This would also be the most complicated case to code :-)

cheers cds

Reply to
c d saunter

snipped-for-privacy@gmail.com schrieb:

As others suggested the big question is: How are you numbers presented? If you have 48 bit serial inputs that present the data MSB first you can build a systolic 48x6 array of units containing two LUTs and two registers . As each cell connects only to three neighbours it can run at extremely high clock rates. The throughput in your case is 7 words per clock.

If your data arrives in words, one word at a time consider a systolic priority queue. (Similar to parallel heap sort). The simplest version has about 48x8 registers and can still run at fairly high speeds. The throughput is one word per clock.

If your data is allready in RAM (on or of chip) sort as you would on a CPU. Counting sort comes to mind. In your case it uses a single BRAM and is guaranteed to complete in 256+2N clock cycles on a single BRAM port. If you use both ports the number of cycles is halved.

3.7 clock cycles per word is not much better than other sort algorithms for a small data set as yours, but the implementation of counting sort is very simple and the number of clock cycles per word will actually decrease if the number of words increases.

Kolja Sulimma

Reply to
Kolja Sulimma

a BRAM-based "software like" solution will surely be the most area efficient; anyway, for signal processing, John might need real time?

here is another solution to find the MEDIAN of M values of N bits.

inputs are stored in M registers. (You might want to shift them along your sampled values) design a M bits inputs, 1 bit output module. The output value is 1 if the number of ones is greater than the number of zeroes. Plug this on all MSb, the output is the MSb of the final result.

Now design quite the same module, but with Mx2+1 inputs. The Mx2 are from your M values, starting with D_M(N downto N-1), the lonely input 'i' is the previous module's output. for each pair, if i=D_M(N), keep D_M(N-1) as is. if not, take not(D_M(N-1)). plug the eventually modified M bits, and plug them into a M-->1 module just like in step 1. you instanciate N-1 of such modules, and chain them. N outputs make the median value requested.

for small M, you can do it in one cycle. with 48, you'll get poor performance, so you'll design a sequential counter, or pipeline the modules. If you try, give feedback, it's interresting.

------------------------------- Now consider this: the method was patented in 92 by Thomson. I don't know if it is still pending... How can they dare patent such a method for a given algorithm? Would you dare patenting for example a fast carry adder??? I think it is a weak patent (haven't checked the claims), because just an implementation of a well know algorithm, and an anteriority shall be findable.

c d saunter wrote:

impossible

Reply to
Stephane

Hi, Here you are for 'carry adder' key word in all patents title field:

PAT. NO. Title

1 5,045,681 Optoelectric ripple carry adder 2 4,931,981 Multi-place ripple-carry adder 3 4,899,305 Manchester carry adder circuit 4 4,839,849 Ripple-carry adder 5 4,704,701 Conditional carry adder for a multibit digital computer 6 4,675,838 Conditional-carry adder for multibit digital computer 7 4,660,165 Pyramid carry adder circuit
Reply to
Weng Tianxiang

Stephane, Can you give a digital example on the median value algorithm or any references to its original paper.

Thank you.

Weng

Reply to
Weng Tianxiang

ElectronDepot website is not affiliated with any of the manufacturers or service providers discussed here. All logos and trade names are the property of their respective owners.