How much time does it need to sort 1 million random 64-bit/32-bit integers?

Hi, Can I find a website that lists the time for sorting 1 million

64-bit/32-bit random data using the fastest software sorting algorithm in a PC?

Thank you.

Weng

Reply to
Weng Tianxiang
Loading thread data ...

Apparently not.

Note that, even if you could, such results would be highly hardware-dependent (although the /relative/ ordering of various sorting algorithms should be fairly consistent on any given platform). PCs vary wildly in performance characteristics. The first PC I used was an Atari ST, running at - what was it, 8 MHz? The first IBM-compatible I used was an

8088 running at - er, I forget, but it was in the 8-12MHz range, I think. The first PC I /owned/ was an 80386-SX running at 20MHz. Whoa, that baby was *fast*. Currently, I'm using a PIII which is probably a brazillion times faster than the 386. The term "a PC" doesn't mean a lot when it comes to benchmarking.

I suggest you just write up the algorithms yourself, or grab them off the Net, and try them on your own hardware.

Beware of the results being corrupted by caching. Do not measure I/O times. Do each test a few hundred times.

--
Richard Heathfield
"Usenet is a strange place" - dmr 29/7/1999
 Click to see the full signature
Reply to
Richard Heathfield

Given that you can't even find the relevant usenet newsgroup to ask this question... I highly doubt it.

That aside...

The "fastest" algorithm is going to depend on a few things; for example the actual 'randomness' of the data (its distribution), the format of said data and the results (e.g. do you need to sort in-place?), and the implementation of the algorithm (by which I mean, is the algorithm optimal for implementation in a typical PC?).

You definitely won't find any answers on the absolute time required by a "PC" - whatever that is. And even if you did, the answer would be obsolete in about a week's time.

You really need to find one or two candidate algorithms and then implement them on your chosen platform to get any meaningful solution to your problem.

Regards,

--
Mark McDougall, Engineer
Virtual Logic Pty Ltd, 
 Click to see the full signature
Reply to
Mark McDougall

Weng Tianxiang schrieb:

That depends on a lot of parameters, but I can give you one datapoint as an example:

- Sorting 16777216 pair wise different 24-bit numbers takes no time at all.

Kolja Sulimma

Reply to
Kolja Sulimma

Hi, Actually I read the related article in a website that lists the timing run by the author, but I couldn't find it again.

For example, he lists times in parallel with number of data, the frequency of his computer and all related parameters that may affect the performance, he also lists the performances using different algorithms to do a comparison and made some conclusions.

Thank you.

Weng

Reply to
Weng Tianxiang

Since you posted on fpga and vhdl newsgroups in addition to programming, are you interested in an FPGA solution? Most of the information that one *can* find on the web deals with serial implementations. Parallel implementations for an FPGA such as the Bitonic sort can speed up the implementation significantly.

Reply to
John_H

A post to c.p, c.a.f, c.l.v for 3 different sets of answers.

wikipedia is probably a good place to start for software only solution.

Largely depends on overall memory performance since very little work is done for the cost of memory access and also what you really want to sort for, whether it is already mostly sorted or always random ordered and if you are only counting frequencies rather than true sorting.

Research for quicksort, mergesort plus radix sort and counter sorts for the larger cases and bubble or insertion sort for the tiny cases. If you have Knuths 3 vol set that will help, the material has been repeated many times in every CS text. Most give you the answer in terms of ideal machines using big O notation. Knuths reference books were written in the 60s when machines ran below 1mips with memory in same ballpark as cpu, so no caches.

For the PC case For very small sorts say upto many thousands of items, the entire sort can be done from L1, L2 caches and is likely to be a good test of the cpu & cache system. Bigger sorts can use these smaller sorts in succession with some sort of merging. 1 random million isn't even remotely going to take longer than a blink since L2 caches are big enough to hold a good fraction of the 1M dataset.

In any nix system you already have a basic quicksort on the cmd line but IIRC its not known to be stable for data that is already nearly sorted. If you code up quicksort, you can randomize the order to guarantee quicksort sticks to O N log N time otherwise it can degrade badly. Mergesort always takes bigger O N log N time since it does more work but it doesn't degrade with data patterns. A quicksort with pre randomization probably comes out even.

Now if computers still had a form of memory that had the performance of SRAM but the cost and size of DRAM, the answer would be much easier to predict and radix sort would come out handily as that sort time follows the no of partitions performed on the word * no of items. Most radix sorts will use byte sorting so that would mean 4 passes of byte testing and writing items to 1 of 256 respective buckets which can get very large, esp if the data is mostly the same. If you have enough of this flat memory to hold the data on input and output, then you can do it in

2 passes with 65536 buckets, this will break modern PC caches though.

Since most PCs don't have SRAM like memory they simulate that with cache hierarchy the answer gets more complicated to predict. Fully random uncached accesses into PC main memory are surprisingly slow, towards several 100ns. Thats called a Memory wall.

If you sort for N random ordered items and sweep N from very small to large to enormous, the quicksort and others follow O N log N time but with abrupt steps in O that increase as the hardware works ever harder to simulate flat memory access times. IIRC O can vary over over a factor of 10. O isn't anywhere near constant except on ideal machines with something like RLDRAM or infinitely big caches. I believe that mergesort will have a much more constant O.

For the FPGA case. Each compare and swap could be done in an atomic operation needing 2 reads and possibly 2 conditional writes in merge sort. WIth a dual port memory that might take 1 or 2 clocks in a pipeline fashion for sorts that fit into Blockram. Now if the data is spread over many sort engines, the parallelism may make up for the much lower clock.

I might favor radix 3 sort since you could effectively use BlockRams as temporary buckets. Basically stream all the DRAM values through to 2048 distinct buckets depending in turn on upper, middle, lower 11b fields. As each bucket fills, save it back to DRAM and reset to empty. Same can be done slower with radix 4 with 256 buckets or even faster with radix

2 with 65536 buckets on a really big FPGA. The sort time will approach 6M (or resp 8M, 4M) memory cycles at the fastest possible rate for just the keys. Some complexity comes in managing the buckets in the output stream into seemingly contiguous buckets.

If you are just doing counter sort of values limited to x values, things get so much simpler, just funnel a memory read stream through x counters. Does x have to be 2^32?

John Jakson transputer guy

Reply to
JJ

Here's a ballpark figure:

Let n = 1000000

n*log2(n) / 3Ghz = 6.64385619 milliseconds

- Oliver

Reply to
Oliver Wong

Hi Wong, n*log2(n) is the number of comparison operations that must be done with a comparison sorting algorithm.

It must have some overhead to do a loop and to swap two compared data into place.

That is the reason why the formula has a 14.5 coefficient.

Weng

Oliver W> > Hi,

Reply to
Weng Tianxiang

Weng Tianxiang schrieb:

Forget about comaprisons on PC hardware. You are interested in the number of load/stores because these are a lot slower than comparisons. A typical comparison sort will have two loads and two store per comparison.

Radix Sort with 11-bit digits requires three O(N) passes for a total of about 12M loads and 6M stores. If your cache is big enough you could also do only two passes using

16-bit digits.

Kolja Sulimma

Reply to
Kolja Sulimma

Hi Kolja, I am talking about a real world measurements, not a theoretical one:

  1. Input data are in random;
  2. Input data are 32-bit or 64-bit;
  3. Sorting must be stable;

Radix sort cannot be applied to my case: how can you sort by 11-bit for a data set of 32-bits? You may imagine a set of data with same all lowest 11 bits, but different upper 21 bits?

Now I have clearer idea on what I really want: As indicated by someone above, 14.5*N*(lg N) is for MIX referenced by Knuth and is not useful in getting real PC world computation.

So now I want to know the coefficient in the following formula in real PC world to replace Knuth's calculation formula: Coefficiant * N * (lg N) for a real PC environment.

PC running frequency and cache size are not my concern, because when a coefficient is given, a PC frequency and cache size may be appended to the result and everyone knows their relations.

If a table of coefficient can be available based on frequency and cache size, it is convenient for everyone to get a taste on how fast a sorting algorithm is.

Thank you.

Weng

Reply to
Weng Tianxiang

Weng Tianxiang schrieb:

Maybe this would be the time to read up how radix sort works? Hint: Try google.

Radix sort is stable. It can applied to all sorting problems, it just might not be the fastest one. E.g. unlinke other methods it will be half the speed for 64-bit than for 32-bit. Hint: In my post I spoke about three passes with 11-bits or two passes with 16-bit. How does this relate to the 32-bit of the values?

Definitely not. Cache access time is access pattern dependant.

Kolja Sulimma

Reply to
Kolja Sulimma

Would you add another axis to the table for Issue width? Retirement width? The number of ALUs? What about ALUs of a certain type? Perhaps an entry for the number of architected vs physical registers? What about the latency of a cache hit? The latency of a miss? What about the size of L1 vs L2 vs L3? Inclusive or exclusive caches? Interconnect network? Memory interface? Microcode issues? (This list could go on for pages))

I believe I understand what you're attempting to do but the number of performance altering variables you would be omitting is astounding. That's part of the reason such constants are ignored in complexity theory.

If you would like to make up some magic number for your particular hardware, algorithm, coding style, compiler, etc feel free. Will that constant mean anything on another system? Generally not.

Reply to
dmackay

In the case of LSD radix sort, you can do both 16-bit counting passes in a single pass (count high 16 bits bucket and low 16 bits bucket when you access each element), hence only two passes over the data are needed (one pass to count and one pass to sort). Of course, you need 65536*2 buckets to hold your counts. If there are one million elements, you would need 32 bit buckets (actually, 20 bit buckets would do, but a 20 bit integer type would be pretty unusual).

Reply to
Dann Corbit

Hi, I received a google warning message on key word "fastest sorting algorithm" this afternoon.

The following is one of its message addresses:

formatting link

"Why? It only takes that run to disprove your assertion. You claimed that the radix sort is the fastest algorithm, this example disproves that.

10 elements, qsort: ~2500 clocks, bucket: ~37500 clocks.

FYI, with 1000 values, the times were: ~275000 for bucket, ~435000 for qsort."

Unit is in clocks, it is easier for everyone to compare different sorting algorithms and I have a good taste on how fast a sorting algorithm runs.

The message was in 1998, but it is still valid in the order.

Weng

Reply to
Weng Tianxiang

formatting link

Fastest is usually important in big data sets. In other words, if we have

10 elements, it does not matter much which sorting method is used. The sort will be done much faster than we can sneeze.

And typically, sort algorithms fall back to routines that are O(f(n)) inferior but faster in practice for smaller sets.

For example, sorting a huge volume of data:

Start with MSD Radix sort at the top level (as a function of keysize and rowcount). Switch to Introspective sort when the set size is under some threshold (keysize * set_rowcount * system_specific_constant > set_rowcount * log2(set_rowcount)). Switch to Insertion sort when the set size is very small (usually 20-30 elements or so).

Typically, if a dataset sorts in 10 milliseconds or 20 microseconds, we won't care very much. If it is one day verses 1 hour, then it matters a lot.

P.S. Here's the world's fastest sort (it only works on sets that have 0 or 1 element):

void worlds_fastest_sort() { return; }

Reply to
Dann Corbit

Hi Dann, I have 3 good website to deal with sorting speed:

  1. formatting link
  2. formatting link
  3. formatting link

None of them mentioned radix sorting as the candidate as the fastest sorting algorithm.

In the last email I referenced a measurement: with 1000 values, the times were: ~275000 in clocks for bucket, ~435000 in clocks for qsort.

The measurement was made in 1999.

I like to have more similar data in clocks with best algorithm and best CPU available now.

I really don't understand that for one data, above measurement needs

1000 clocks .

Why? Any comments?

Weng

Reply to
Weng Tianxiang

formatting link

If a tree falls in a forest with no ear to hear it, does it make a sound?

Radix sort is the fastest way to sort a long list of integers. Period.

1000 values is a trivial set. There are many optimizations that can be done for Radix sort. For instance, you can count all the buckets in a single pass for LSD radix sort. That means you have one counting pass, and (with 16 bit buckets) 2 distribution passes. There is simply no way for a comparison sort to do better.

You were talking about a million objects in the list. With a list like that, Radix or bucket sort will dominate. If you have a billion items, then it is no contest. If (on the other hand) you have very long keys, then the comparison sorts become competitive. But for keys that are only a few bytes long, distribution rather than comparison sorts are the obvious choice.

With special data sets and with small data sets, other sorts will be faster. But you can special case the entry point into your sort routine.

It is not difficult to do it yourself.

Probably a function of CLOCKS_PER_SECOND.

Reply to
Dann Corbit

Here's my FPGA solution. I have a bitonic sort core for an FPGA. Fill up a large FPGA and it will do 64-bit by 64 rows every clock cycle, pipelined of course. If you had a high-speed connection to the FPGA, you could stream your data through the FPGA and get out sorted sets of

  1. You would then run a merge on those on the host processor. It should chop lg(6) off your quick sort or merge sort.

Reply to
Brannon

As a matter of interest, how big is the bitonic sort? I guess I'm really asking "how many 64-bit magnitude comparators do you need", although there's presumably some swap or routing logic also associated with each comparator. (sorry, I'm not familiar with the bitonic sort algorithm - must go look it up). And how many levels of logic? I guess it's best to pipeline it...

TIA

--
Jonathan Bromley, Consultant

DOULOS - Developing Design Know-how
 Click to see the full signature
Reply to
Jonathan Bromley

ElectronDepot website is not affiliated with any of the manufacturers or service providers discussed here. All logos and trade names are the property of their respective owners.