In most real applications Quicksort far outperforms the others. However, Quicksort has a worst case running time of $O(n2)$, although the number of worst cases encountered can be reduced to almost none by choosing the pivot elements (see below) with some intelligence. [2]

None of algorithms dependent on the ``divide and conquer'' approach (mergesort and quicksort) provide optimal performance when parallelized. The reason is that the top-level divide involves all the data and must be done sequentially.

Take, for example, the Perfect Shuffle sort as described in
[5]. It requires $n/2$ threads, and has
an asymptotical limit of
$O(n/plog2$_{2} n)
.
Implemented on a shared memory
machine it requires approximately ten times as many
operations as sequential Quicksort for each element. The
sorting must also execute synchronously, for a total of
$nlog2$_{2} n
synchronization operations. Clearly, one
processor executing sequential Quicksort would perform
better than this except in very special cases.

All the other multicomputer algorithms have similar limitations, and cannot be used in real applications. (After all, what good is a parallel algorithm using $p$ processors if a sequential algorithm on one processor outperforms it?) I therefore turned my attention to the sequential algorithms that have performed well on uniprocessor computers.

In the algorithm in table 2, $X$ is the array of elements to be sorted, and $R$ is the result array.

As the thread creation time^{2}is far longer than the time to compare
and exchange two elements in most cases,
the algorithm cannot start new threads for each partitioning.
Instead, the number of elements to be partitioned
are compared to a hard-coded limit,
and if the number of elements is sufficiently large, it creates a
new thread, otherwise it just makes a recursive function call.

The algorithm given in table 3 is the one used
for the string sorting measurements,
and it is slightly more optimized than the one used for integer sorting.
$P$_{0...n-1}
is an array of pointers to the elements to be sorted,
$EP$_{i}
refers to the element pointed to by $P$_{i}
,
and $s$ is the parallel threshold.

In the algorithm in table 4
$s$_{partition}
and
$s$_{recurse}
are the parallel thresholds
for partitioning and sort recursion, respectively. Most of
the algorithm is identical to Parallel Quicksort, except for
the parallel partitioning step.

In words, the parallel partitioning step works as follows: Choose the middle element of the array as the pivot. Set a lower index variable to the beginning and an upper index variable to the end of the array. Start one partitioning thread for each processor available. For thread $j$ start with element $j$ in the array and step through the array with $p$ long steps. For example, if there are four processors, thread three will examine the third, seventh, and eleventh element etc.

If the examined element is less than the pivot, lock the lower index variable $l$, place the element into the temporary array at the position given by the lower index, increment the index, and unlock. Otherwise, if it isn't the pivot element itself, lock the upper index variable $u$, place the element, decrement the upper index, and unlock.

As sequential partitioning is very efficient, I had to spend much effort tuning and tweaking the parallel partitioning part in order to get in reasonably fast. For example, the locked index access in step (3) is buffered so that a thread only has to access the locks every 32 elements. The implementation is far from optimal though, and there is still much room for improvement.

External sorting performance is very dependent on assumptions about the problem, such as data size, I/O system, organization of input and output files, etc. The performance estimate must also take into account the I/O time for input and output to be useful. In many cases the I/O time is large compared to the compare and sort overhead times, and the problem changes from one of multiprocessor sorting to one of overall system optimization.

Lack of time prevented me from doing anything but a few theoretical investigations.

The External Tree sort assumes that the total main memory available to all processors is equal to the input data size. In this case, a shared memory computer could as well use an internal sorting algorithm.

The Pipelined External sort assumes that there are
$log$_{2}
(n) + 1
processors (processes) available. This is not unrealistic, but
the algorithm also uses
$2+4log$_{2} n
files, all of which
must be read and written in parallel. The file requirement
disqualified it from consideration.

In the algorithm in table 5, $m$ is the number of elements that will fit in main memory at one time. This should be a relatively large number in most cases. For example, if the main memory is 50 MB and the element size is 1000 B, $m$ will be approximately 50000.

There are two steps in the algorithm.

First, $n/m$ sorted subfiles are produced by reading $m$ elements at a time from the input file, sorting them using the Parallel Quicksort internal sort, and writing the sorted elements.

In the second step the subfiles are merged, all at the same time, producing the output file. If the subfiles are to reside on the same physical device, the merge step must assume adequate buffering, for example $m\; /\; (n/m\; +\; 1)$ element buffers for each file in order to reduce seek overhead.

The time to sort $n$ elements, assuming sequential I/O, is

\begin{eqnarray*} t(n) & = & \frac{n}{m} \left( m t_r + t_\alpha \frac{m}{p} \log_2 m + m t_w \right) +\\ & & + n \left( t_w + t_r + \frac{n}{m} t_\beta \right), \end{eqnarray*}

$$

where $t$
$n/m\; \le log$_{2} n,

$t(n)\; \approx \; 2n\; (t$_{r} + t_{w}) + n _{2} n (t_{α}/p + t_{β}),

A further refinement would be to do the merge step in parallel, although $n/m$ would have to be relatively large in order to get any noticeable performance benefit.

If $\nu $ is the probability for two elements to be identical
in their first $r$ bytes; $t$_{r}
is the read,
$t$_{sr}
is the seek and read, and
$t$_{s}
is the sort time per element respectively; the sorting
time will be

$t(n)=\; n\; t$_{r} + {1_{s} + νt_{sr}) n _{2} n.

$$

where $b=n/(m\; p)$ is number of memory pages per processor (at least two), $k$ is the number of elements per page, $t$

With the time approximations given above, it is possible to do theoretical comparisons of the Simple External and Improved Block Bitonic sorts.

Using only $bk=3$ elements per processor for the Improved Block Bitonic sort might degrade I/O performance. However, with adequate buffering and one I/O device per subfile, the performance should be within the limits given above.

The examples above show that the Improved Block Bitonic sort has higher parallelism, but worse worst-case performance than the Simple External sort.

- ... time
^{2} - About 90 $\mu $s for FastThreads