validity of ... reasons for preferring C over C++

(snip on sorting algorithms and implementations)

The recent suggestions for quicksort is, when doing the recursion (as previously noted, using an explicit stack, not recursive calls) not to process very small partitions, but just return. Then, after the quicksort part is finished, do in insertion sort on the whole array.

If elements are close to their final position, insertion sort (with the right choices) is fast, making up for the partition start overhead of quicksort. (The minimum partition size might be about three.)

(Good quicksort doesn't use a fixed array element as the pivot for a partition, but maybe the median of three elemennts. That pretty much avoids the O(N**2) worst case.)

-- glen

Reply to
glen herrmannsfeldt
Loading thread data ...

The qsort() routine in the C library is always some variant of quicksort. There are a handful of common variants which differ in how they select the pivot point and when they switch between recursion and iteration.

I haven't seen the C++11 standard document so I can't comment on that. std::sort always has been encouraged to be O(n*log(n)) worst case ... but you are correct that it has not been required in the past.

George

Reply to
George Neuner

There are things to recommend either approach - insertion sorting small partitions, or insertion sorting the whole array at once. The later does reduce the startup overhead, but is much less cache friendly than immediately sorting the small partitions (which will almost always have been moved into near cache by the partitioning process which created them). I've not done a serious survey, but most recent implementations I've seen sort small partitions immediately.

In this day and age, "pretty much" is often not good enough. Anyone not considering that inputs might be maliciously constructed is being pretty foolish, IMO. Musser's original Introsort paper is worth reading for its very interesting discussion of just how easy it is to actually create Quicksort inputs that result in quadratic behavior (even with the standard optimizations like median-of-three). The part about Introsort itself is unfortunately less interesting (not that it isn't a bloody good idea, it's just one of those obvious ones in retrospect*, and most of the theoretical basis and implementation considerations can be summed up in a paragraph).

*And caused thousands of foreheads to be slapped, usually accompanied by a statement along of lines of "why didn't I think of that?"
Reply to
Robert Wessel

(snip, I wrote)

Hmm. How cache unfriendly is the part that goes back from the end of the array?

Seems like something that someone should study in detail, and on a variety of processors.

I suppose it helps to know about the source of the data, but yes, in some cases that might be true. You could always use a random pivot selector. Then all you have to do is find a good seed source that the malicious constructor doesn't know about.

(snip)

-- glen

Reply to
glen herrmannsfeldt

While the details will vary from implementations to implementation, you're basically looking at an extra trip through the cache of the entire array being sorted for the one-big-insertion-sort-at-the-end approach. Yes, hardware prefetchers should manage to deal with that, but you still have to stream the entire dataset in (and out) of cache.

You avoid that by doing the "small" insertion sorts, because the partitions are in memory already, and will already needed to be written back to memory (since the partitioning will almost always have reordered the items in the partition). So you get the last phase of sorting done with no additional memory traffic.

Of course that assumes the dataset is large enough to fit entirely into cache.

Reply to
Robert Wessel

Not really. It just cuts the number of steps in half: rather than the worst case consistently choosing the highest/lowest element as the partition, the worst case consistently chooses the second highest/lowest element.

Reply to
Nobody

(snip, I wrote)

OK, if you don't like it use the random pivot selector.

The problem with original quicksort wasn't that the worst case was so bad, but that it was so common.

Note that the worst case for you is that all the air molecules randomly move to the side of the room you are not sitting in. Thermodynamics doesn't say that it is impossible, only that it is unlikely.

As someone noted, in some cases a worst case can be supplied by a malicious user. For large N, not likely if you use a random pivot selector and the user can't guess the seed.

-- glen

Reply to
glen herrmannsfeldt

It is not the problem of a malicious user. In hard real time systems you just have to able to guaranty worst case behavior. Saying it is not probable is just not enough.

--
Reinhardt
Reply to
Reinhardt Behm

OTOH, a good (pseudo) random number generator is likely more work than implementing Introsort, and probably would probably add enough overhead to Quicksort that plain Heapsort would only be marginally faster, if at all. And it still doesn't *guarantee* non-quadratic behavior, even if it does make it unlikely (which is an issue in some environments).

IMO, there are four sorts non-specialist programmers should know about, and be ready to use. Insertion (

Reply to
Robert Wessel

(snip, I wrote)

(snip)

Do you consider the probability that your system will be hit by a large meteorite, or that a nuclear bomb will blow up nearby? (The latter more likely every day, with countries like Iran trying to build one.)

If you are sorting 10 values, then you probably shouldn't use quicksort, anyway.

If you are sorting a billion values, maybe unusual for embedded systems, but not impossible, and if people aren't choosing the values (that is, natural randomness instead of human randomness) there are 1000000000! possible initial permutations, reasonably equally probable. Consider that not only the worst case, by some nearby cases, so the probability of getting a bad case is only about 1 in 900000000! instead of the absolute worst 1 in 1000000000!.

But as someone noted, the rules all change if someone can intentionally generate a worst case permutation. Assume open source, so that they know the actual algorithm and random number genertor in use. If they can't predict the seed, then it is up to 1 in (seed choices), which should still be pretty low, but maybe not low enough.

Factorial grows darn fast, so the probabilities decrease very fast with increasing N. I have done plenty of sorts with billions of items, but even with millions, or 10s of thousands, it is plenty low enough.

-- glen

Reply to
glen herrmannsfeldt

(snip, I wrote)

(I also wrote)

(snip about a very unlikely way to die)

I wondered a little about that. Seems to me that if you do random pivot selection for enough large partitions, maybe not all that many, and use median of three for smaller ones, you get most of the advantage. For even medium sized N, the permutations get large darn fast, so your main worry is someone intentionally supplying worst case.

And for some values of N.

When I had a CS course (large number) years ago, for one assignment we wrote insertion sort, selection sort, and bubble sort. For the next one, quicksort, heapsort, and shell sort.

As was mentioned earlier, cache behavior should be considered. My first thought is that heap sort and shell sort aren't so good in cache use, but I haven't thought about it all that much. I always like the 3N+1 shell sort, should work well with interleaved memory, but maybe not with cache.

-- glen

Reply to
glen herrmannsfeldt

Sounds like a Nash equilibrium.

--
Les Cargill
Reply to
Les Cargill

ElectronDepot website is not affiliated with any of the manufacturers or service providers discussed here. All logos and trade names are the property of their respective owners.