How much time does it need to sort 1 million random 64-bit/32-bit integers?

How many I/O pins does that take?

-- The suespammers.org mail server is located in California. So are all my other mailboxes. Please do not send unsolicited bulk e-mail or unsolicited commercial e-mail to my suespammers.org address or any of my other addresses. These are my opinions, not necessarily my employer's. I hate spam.

Reply to
Hal Murray
Loading thread data ...

OK, so I looked it up... it's the same as the Batcher sort/merge that I've played with before. Sorting 64 items requires 543 compare/swap blocks, and the longest path goes through

21 of these blocks. Each block consists of a 64-bit magnitude comparator and two 64-bit 2:1 muxes to do the swap.
--
Jonathan Bromley, Consultant

DOULOS - Developing Design Know-how
VHDL * Verilog * SystemC * e * Perl * Tcl/Tk * Project Services

Doulos Ltd., 22 Market Place, Ringwood, BH24 1AW, UK
jonathan.bromley@MYCOMPANY.com
http://www.MYCOMPANY.com

The contents of this message may contain personal views which 
are not the views of Doulos Ltd., unless specifically stated.
Reply to
Jonathan Bromley

Hi Jonathan, Could you please give the paper address so that I am able to have a look at the algorithm?

Thank you.

Weng

Reply to
Weng Tianxiang

This is where I found it...

formatting link

and it references the same Knuth book that I used to find out about Batcher sort.

-- Jonathan Bromley, Consultant

DOULOS - Developing Design Know-how VHDL * Verilog * SystemC * e * Perl * Tcl/Tk * Project Services

Doulos Ltd., 22 Market Place, Ringwood, BH24 1AW, UK snipped-for-privacy@MYCOMPANY.com

formatting link

The contents of this message may contain personal views which are not the views of Doulos Ltd., unless specifically stated.

Reply to
Jonathan Bromley

Hi Jonathan, Thank you very much.

Weng

Reply to
Weng Tianxiang

Hi Brannon, As NIST website quoted, bitonic sorting takes O((log n)2/2) stages (or steps) with n/2 comparators at each stage.

Can you release your design running frequency for 64*64 bitonic sorting and what type of Xilinx chip or larger is to be used?

Thank you.

Weng

Reply to
Weng Tianxiang

I'm not sure that question makes a whole lot of sense. On the SGI box I was getting 128bits per clock cycle at 200MHz on a 2v6000. The number of pins is irrelevant. The sort algorithm was only running at 100MHz. And it doesn't need to run that fast because it requires at least a few clock cycles to build up enough data to make the sort worth it.

Reply to
Brannon

Actually, I was just thinking, on my Starbridge HC box, I can get

128bits x 4 memory channels x 66MHz. 512@66MHz = 34Gb/s. 128@200MHz = 25Gb/s. Can anybody outrun 34Gb/s input to their FPGA these days? I know that using all the rocket i/o / LVDS / whatever that current chips should support input rates significantly faster than that -- but who actually has a board that implements it?

What rate is SRC getting with their direct FSB connection? I still dream about a board using an HTX slot, or even a cheap board with some large Spartans and 4 PCIe in and 4PCIe out (10Gb/s each direction for a few hundred $). These sort algorithms are all worthless if you cannot get your data in and out of the chip fast enough.

Reply to
Brannon

When you claim your FPGA solution will process 4096 bits of data per clock cycle questioning how your solution gets 4096 bits of data per clock cycle in and out of the FPGA makes a lot of sense.

Reply to
nospam

Hi Jonathan, It seems you are very familiar with Batcher algorithm.

  1. I would like to know why you get 543 compare/swap blocks?

It has (lg N)*((lg N)+1)/2 stages based on formula of Knuth Vol 3, p.113. For N= 64, it has 21 stages.

With n/2 comparators at each stage, it means it has 32 comparators at each stage.

32*21 = 672 /= 543?

  1. How did you get the conclusion: the longest path goes through 21 of these blocks.

I think because it is pipelined, the longest path goes through only 1 block, why it is 21 blocks?

Thank you.

Weng

Reply to
Weng Tianxiang

Yes, that's where I get my figure for 21 levels of logic.

Not all levels have the maximum population of n/2 blocks.

By counting them :-) See Knuth's algorithm. The little console- mode C program, attached, implements the algorithm and shows you the sorting network as ASCII-art. Make sure its output is rendered in a fixed-width font.

I think it would be quite easy to modify the program so that it writes a VHDL or Verilog netlist.

Of course, if you pipeline it then you can reduce the critical path to only one level of compare/swap - but you then get 21 clocks of latency. Your choice.

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ // Batcher's merge-exchange sort // Follows notation in Knuth, Sorting and searching, p112 // // This is (nearly!) the best way to create a purely combinational // sorting network, given that the basic available sorting operation // is to compare two values and swap them if they are out of order. // The hardware structures created by this program are easy to // implement, and can readily be pipelined to improve throughput. // // // Revision info: // // 1.0 Jonathan Bromley 18 Feb 2001 // first version // 2.0 Jonathan Bromley 31 Jan 2002 // - improved argc/argv checking // - added "usage" message if user supplies bad arguments // - added statistics line // - improved comments // There's still a problem with wasteful use of space on // lines where compare/exchange operations overlap. // The program shows the correct set of swap operations, // but it prints them in a slightly clumsy way in some situations. // To fix this, we would need to gather up all the swaps // at any given logic level, and sort them before printing. // I can't be bothered doing that just now.

#include #include

// find /lg2(n)\, i.e. number of bits needed to represent 0..n-1 // int ceil_lg2(int n) { int t=1; n--; while (n/=2) t++; return t; }

// print separator between sets of compare/exchange // void sep(int n, int p, int q, int r, int d) { int i; printf("p,q,r,d = %2d,%2d,%2d,%2d |", p,q,r,d); for (i=1; i=a) { // Overlap! We need to start on a new line. // Print to the end... while (++i 0) { // Make sure a, b are in order (paranoia!!) if (a>b) { t=a; a=b; b=t; } // Leading spaces (left margin) if (i==0) printf(" "); // Move forward to the starting position, printing blanks while (++i < a) printf(" |"); // Print starting position printf(" O"); // leaves i==a // Move forward to the end position, printing links while (++i < b) printf("=="); // Print end position printf("=O"); // leaves i==b } }

// Print statistics info // void stats(int c, int l, int n) { printf("\nSorted %d elements using:\n", n); printf("%4d levels of logic;\n", l); printf("%4d compare/exchange modules.\n\n", c); }

// Main program // void Batcher( int N ) {

// Variables required by Batcher's algorithm. // The very short names exactly match the names of // variables in Knuth's description. int t, p, q, r, d, i;

// Keep track of how many compares we did, and how many levels // of logic it required int comps=0; int levels=0;

// M1: initialise p t = ceil_lg2(N); p = 1

Reply to
Jonathan Bromley

Yes, I see what you're saying now. I didn't realize I had claimed to be running that fast currently. I was talking about parallelizing the data once it was inside the chip -- some kind of serial to parallel operation. That's what I do currently. Still, I think 4096 bits per clock cycle is possible on a v5 using 533MHz LVDS input with 512 pin pairs should allow you to run the search algorithm at 66MHz. If you layed out the board well, you should be able to run some DCI DDR stuff at 266MHz using the newer chips. That's not out of spec, is it? I thought the DDR2 memory ran that fast. Currently my fastest input is

512bits at 66MHz.
Reply to
Brannon

So here are my results. First, I've never ran 64x64. I didn't mean to imply that I could -- just that it was possible. I can run at 100MHz on a 2v6000 with 16x64bit. It should run twice that fast on a V4. I compiled a 32x64bit and got this result: 50837 LUTs and 72720 FFs. That's larger than my chip. I compiled the 64x64bit version and got

140461 LUTs and 199888 FFs. That's pretty darn big. It includes registers for parallelizing the input which is coming in at 64bits per clock for those examples. I'm sure my implementation is not the most efficient possible, but I still think you'll need a pretty large chip to do any kind of serious parallel sorting.
Reply to
Brannon

Hi Brannon, Thank you for sharing your information with us. You are experienced expert in this regard. I appreciate your work and I have never seen a project with such huge resource demand. I was said a v2-6000 costs nearly $8K.

By the way, I cannot find 'bitonic' name in Microsoft Encarta computer dictionary. What does it mean?

Weng

Brann> > Can you release your design running frequency for 64*64 bitonic sorting

Reply to
Weng Tianxiang

Hi Jonathan, Thank you for your thorough understanding the Batcher algorithm and its program in C.

Weng

Reply to
Weng Tianxiang

2v-6000-4 is $2k. Don't pay any more than that for one.

I have no idea what 'tonic' is refering to. I got all my information including the exact hardware implementation from this page:

formatting link

Reply to
Brannon

Hi Jonathan, "1. I would like to know why you get 543 compare/swap blocks?

It has (lg N)*((lg N)+1)/2 stages based on formula of Knuth Vol 3, p.113. For N= 64, it has 21 stages. Yes, that's where I get my figure for 21 levels of logic. With n/2 comparators at each stage, it means it has 32 comparators at each stage.

32*21 = 672 /= 543? Not all levels have the maximum population of n/2 blocks. "

I read Knuth's Volumn Batcher's algorhtm and didn't find the formula for the number of comparators. Do you have a ready formula for it?

Thank you.

Weng

Reply to
Weng Tianxiang

ElectronDepot website is not affiliated with any of the manufacturers or service providers discussed here. All logos and trade names are the property of their respective owners.