In most real applications Quicksort far outperforms the others. However, Quicksort has a worst case running time of , although the number of worst cases encountered can be reduced to almost none by choosing the pivot elements (see below) with some intelligence. [2]
None of algorithms dependent on the ``divide and conquer'' approach (mergesort and quicksort) provide optimal performance when parallelized. The reason is that the top-level divide involves all the data and must be done sequentially.
Take, for example, the Perfect Shuffle sort as described in [5]. It requires threads, and has an asymptotical limit of . Implemented on a shared memory machine it requires approximately ten times as many operations as sequential Quicksort for each element. The sorting must also execute synchronously, for a total of synchronization operations. Clearly, one processor executing sequential Quicksort would perform better than this except in very special cases.
All the other multicomputer algorithms have similar limitations, and cannot be used in real applications. (After all, what good is a parallel algorithm using processors if a sequential algorithm on one processor outperforms it?) I therefore turned my attention to the sequential algorithms that have performed well on uniprocessor computers.
In the algorithm in table 2, is the array of elements to be sorted, and is the result array.
As the thread creation time2is far longer than the time to compare and exchange two elements in most cases, the algorithm cannot start new threads for each partitioning. Instead, the number of elements to be partitioned are compared to a hard-coded limit, and if the number of elements is sufficiently large, it creates a new thread, otherwise it just makes a recursive function call.
The algorithm given in table 3 is the one used for the string sorting measurements, and it is slightly more optimized than the one used for integer sorting. is an array of pointers to the elements to be sorted, refers to the element pointed to by , and is the parallel threshold.
In the algorithm in table 4 and are the parallel thresholds for partitioning and sort recursion, respectively. Most of the algorithm is identical to Parallel Quicksort, except for the parallel partitioning step.
In words, the parallel partitioning step works as follows: Choose the middle element of the array as the pivot. Set a lower index variable to the beginning and an upper index variable to the end of the array. Start one partitioning thread for each processor available. For thread start with element in the array and step through the array with long steps. For example, if there are four processors, thread three will examine the third, seventh, and eleventh element etc.
If the examined element is less than the pivot, lock the lower index variable , place the element into the temporary array at the position given by the lower index, increment the index, and unlock. Otherwise, if it isn't the pivot element itself, lock the upper index variable , place the element, decrement the upper index, and unlock.
As sequential partitioning is very efficient, I had to spend much effort tuning and tweaking the parallel partitioning part in order to get in reasonably fast. For example, the locked index access in step (3) is buffered so that a thread only has to access the locks every 32 elements. The implementation is far from optimal though, and there is still much room for improvement.
External sorting performance is very dependent on assumptions about the problem, such as data size, I/O system, organization of input and output files, etc. The performance estimate must also take into account the I/O time for input and output to be useful. In many cases the I/O time is large compared to the compare and sort overhead times, and the problem changes from one of multiprocessor sorting to one of overall system optimization.
Lack of time prevented me from doing anything but a few theoretical investigations.
The External Tree sort assumes that the total main memory available to all processors is equal to the input data size. In this case, a shared memory computer could as well use an internal sorting algorithm.
The Pipelined External sort assumes that there are processors (processes) available. This is not unrealistic, but the algorithm also uses files, all of which must be read and written in parallel. The file requirement disqualified it from consideration.
In the algorithm in table 5, is the number of elements that will fit in main memory at one time. This should be a relatively large number in most cases. For example, if the main memory is 50 MB and the element size is 1000 B, will be approximately 50000.
There are two steps in the algorithm.
First, sorted subfiles are produced by reading elements at a time from the input file, sorting them using the Parallel Quicksort internal sort, and writing the sorted elements.
In the second step the subfiles are merged, all at the same time, producing the output file. If the subfiles are to reside on the same physical device, the merge step must assume adequate buffering, for example element buffers for each file in order to reduce seek overhead.
The time to sort elements, assuming sequential I/O, is
\begin{eqnarray*} t(n) & = & \frac{n}{m} \left( m t_r + t_\alpha \frac{m}{p} \log_2 m + m t_w \right) +\\ & & + n \left( t_w + t_r + \frac{n}{m} t_\beta \right), \end{eqnarray*}
where are the read, write, sort, and merge time per element respectively. Assuming that
the sorting time is . If we further assume that and then the sorting time can be approximated by
which is close to optimal given the assumption about I/O and comparison times.
A further refinement would be to do the merge step in parallel, although would have to be relatively large in order to get any noticeable performance benefit.
If is the probability for two elements to be identical in their first bytes; is the read, is the seek and read, and is the sort time per element respectively; the sorting time will be
This is close to optimal, provided that is close to zero.
where is number of memory pages per processor (at least two), is the number of elements per page, , is the time to move an element in main memory, and compare two elements, respectively. The other notation is the same as in ``Simple External sort'' above. The elements are assumed to be read and written a page at a time. Of course, times the element size is equal to the amount of main memory used. The algorithm also assumes that the I/O is distributed between several devices so that there is no I/O contention between the different processors.
With the time approximations given above, it is possible to do theoretical comparisons of the Simple External and Improved Block Bitonic sorts.
Using only elements per processor for the Improved Block Bitonic sort might degrade I/O performance. However, with adequate buffering and one I/O device per subfile, the performance should be within the limits given above.
The examples above show that the Improved Block Bitonic sort has higher parallelism, but worse worst-case performance than the Simple External sort.