Subsections

Examined Algorithms

In all of the algorithms described below n and p denote the number of sorted elements and processors, respectively.

Sequential Algorithms

Of the sequential algorithms having an ``average'' running time of O(n log2 n) , Heapsort, Mergesort and Quicksort are the most commonly used. Mergesort and Quicksort can quite simply be parallelized, whereas Heapsort requires somewhat more effort.

In most real applications Quicksort far outperforms the others. However, Quicksort has a worst case running time of O(n2) , although the number of worst cases encountered can be reduced to almost none by choosing the pivot elements (see below) with some intelligence. [2]

None of algorithms dependent on the ``divide and conquer'' approach (mergesort and quicksort) provide optimal performance when parallelized. The reason is that the top-level divide involves all the data and must be done sequentially.

Parallel Algorithms

Internal

I started looking at different parallel algorithms mentioned in the literature. After some calculation and a few test runs, I realized that none of the algorithms, such as Bitonic Sort, Shuffle Sort, etc., for multicomputer architectures would be suitable for a multiprocessor like the Symmetry. All of them are based on some form of message or element passing, or assume that the memory available is private to each processor, which just is too inefficient.

Take, for example, the Perfect Shuffle sort as described in [5]. It requires n/2 threads, and has an asymptotical limit of O(n/p log22 n) . Implemented on a shared memory machine it requires approximately ten times as many operations as sequential Quicksort for each element. The sorting must also execute synchronously, for a total of n log22 n synchronization operations. Clearly, one processor executing sequential Quicksort would perform better than this except in very special cases.

All the other multicomputer algorithms have similar limitations, and cannot be used in real applications. (After all, what good is a parallel algorithm using p processors if a sequential algorithm on one processor outperforms it?) I therefore turned my attention to the sequential algorithms that have performed well on uniprocessor computers.

Enumeration Sort

This algorithm uses 2n memory, and has a running time of O(n2) , which of course makes it unusable for anything but very small problems. I thought that the lack of synchronization and the overall simplicity would make it worth consideration for small problems, n < 100 .

In the algorithm in table 2, X is the array of elements to be sorted, and R is the result array.


Table 2: Enumeration sort.
\begin{table}\sf \begin{enumerate}
\item for $i \leftarrow 1$\ to $n$\ do
\beg...
...list} \end{list} \item $R_k \leftarrow e$ \end{list} \end{enumerate} \end{table}


Parallel Quicksort

Quick-sort works by selecting a pivot element, moving all elements less than (or equal to) the pivot to one side, and the other elements to the other side of the array. If there are more than one element in one of the two subsequences, it then recursively calls itself passing the subsequence. This is quite easily parallelized by starting a new process, thread or whatever for each recursion. [5] [6]

As the thread creation time2is far longer than the time to compare and exchange two elements in most cases, the algorithm cannot start new threads for each partitioning. Instead, the number of elements to be partitioned are compared to a hard-coded limit, and if the number of elements is sufficiently large, it creates a new thread, otherwise it just makes a recursive function call.

The algorithm given in table 3 is the one used for the string sorting measurements, and it is slightly more optimized than the one used for integer sorting. P0...n-1 is an array of pointers to the elements to be sorted, EPi refers to the element pointed to by Pi , and s is the parallel threshold.


Table 3: Parallel Quicksort.
\begin{table}\sf \begin {list}{}{}
\item Choose pivot element $EP_0$ \item Part...
...ldots n-1}$ \end{list} \item Wait for thread to complete
\end{list} \end{table}


Improved Parallel Quicksort

The problem with Parallel Quicksort is that the partitioning step must be done before going parallel with the subsequences, which limits parallelism for the log2 p phases. The solution is of course to do the partition step in parallel. The improved parallel quick-sort algorithm uses 2 n pointer memory, and tries to increase parallelism by doing the partitioning step in parallel.

In the algorithm in table 4 spartition and srecurse are the parallel thresholds for partitioning and sort recursion, respectively. Most of the algorithm is identical to Parallel Quicksort, except for the parallel partitioning step.

In words, the parallel partitioning step works as follows: Choose the middle element of the array as the pivot. Set a lower index variable to the beginning and an upper index variable to the end of the array. Start one partitioning thread for each processor available. For thread j start with element j in the array and step through the array with p long steps. For example, if there are four processors, thread three will examine the third, seventh, and eleventh element etc.

If the examined element is less than the pivot, lock the lower index variable l , place the element into the temporary array at the position given by the lower index, increment the index, and unlock. Otherwise, if it isn't the pivot element itself, lock the upper index variable u , place the element, decrement the upper index, and unlock.

As sequential partitioning is very efficient, I had to spend much effort tuning and tweaking the parallel partitioning part in order to get in reasonably fast. For example, the locked index access in step (3) is buffered so that a thread only has to access the locks every 32 elements. The implementation is far from optimal though, and there is still much room for improvement.


Table 4: Improved parallel quicksort.
\begin{table}\sf\begin{enumerate}
\item Setup
\begin{list}{}{}
\item Allocate...
...\item Unlock $u$ \end{list} \end{list} \end{list} \end{enumerate}\par\end{table}


Parallel External Algorithms

External sorting performance is very dependent on assumptions about the problem, such as data size, I/O system, organization of input and output files, etc. The performance estimate must also take into account the I/O time for input and output to be useful. In many cases the I/O time is large compared to the compare and sort overhead times, and the problem changes from one of multiprocessor sorting to one of overall system optimization.

Lack of time prevented me from doing anything but a few theoretical investigations.

Useless Algorithms

Many different external sorting algorithms have been proposed for multi-computer environments. Most of them are not suitable for multi-processor computers. For example, in [5] two classes of external algorithms are given, the External Tree sort, and the Pipelined External sort.

The External Tree sort assumes that the total main memory available to all processors is equal to the input data size. In this case, a shared memory computer could as well use an internal sorting algorithm.

The Pipelined External sort assumes that there are log2 (n) + 1 processors (processes) available. This is not unrealistic, but the algorithm also uses 2+4log2 n files, all of which must be read and written in parallel. The file requirement disqualified it from consideration.

Simple External Sort

The assumption for ``Simple External sort'' is that I/O is done sequentially with no parallelism (ie. the input and output files are read and written sequentially, one element at a time), and that the time to compare two elements is small compared to the time to read/write an element. The goal is, of course, to produce a sorted version of the input file as quickly as possible.


Table 5: Simple External sort.
\begin{table}\sf \begin{list}{}{}
\item for $i \leftarrow 1$\ to $n/m$ \begin{l...
...ots n/m$.
\item Write element to output file
\end{list} \end{list} \end{table}


In the algorithm in table 5, m is the number of elements that will fit in main memory at one time. This should be a relatively large number in most cases. For example, if the main memory is 50 MB and the element size is 1000 B, m will be approximately 50000.

There are two steps in the algorithm.

First, n/m sorted subfiles are produced by reading m elements at a time from the input file, sorting them using the Parallel Quicksort internal sort, and writing the sorted elements.

In the second step the subfiles are merged, all at the same time, producing the output file. If the subfiles are to reside on the same physical device, the merge step must assume adequate buffering, for example m / (n/m + 1) element buffers for each file in order to reduce seek overhead.

The time to sort n elements, assuming sequential I/O, is

 \begin{eqnarray*}
t(n) & = & \frac{n}{m} \left( m t_r + t_\alpha \frac{m}{p}
			\log_2 m + m t_w \right) +\\
     		& & + n \left( t_w + t_r + \frac{n}{m} t_\beta \right),
\end{eqnarray*}

where tr, tw, tα, tβ are the read, write, sort, and merge time per element respectively. Assuming that

n/m ≤log2 n,

the sorting time is O(n log2 n) . If we further assume that m= n/log2 n and n >> 0 then the sorting time can be approximated by

t(n) ≈ 2n (tr + tw) + n log2 n (tα/p + tβ),

which is close to optimal given the assumption about I/O and comparison times.

A further refinement would be to do the merge step in parallel, although n/m would have to be relatively large in order to get any noticeable performance benefit.

External Index Sort

The External Index sort is an algorithm which produces a sorted index to the input file, instead of producing a sorted output file. It only works for problems with large elements, stored on a random access device, and who are relatively random (ie. most of the elements should differ in their first bytes). In the algorithm in table 6 r is the number of bytes stored in main memory from each element, for example four.


Table 6: External Index sort.
\begin{table}\sf \begin{enumerate}
\item Read file.
\begin{list}{}{}
\item fo...
...nd{list} \item Return comparison result.
\end{list} \end{enumerate} \end{table}


If ν is the probability for two elements to be identical in their first r bytes; tr is the read, tsr is the seek and read, and ts is the sort time per element respectively; the sorting time will be

t(n)= n tr + {1p} (ts + νtsr) n log2 n.

This is close to optimal, provided that ν is close to zero.

Improved Block Bitonic Sort

If the merge step in the Simple External sort described above is replaced with a Block Bitonic Merge, merging n/m sorted subfiles, where b is the number of pages (input buffers) per processor, the result is the Improved Block Bitonic sort as given in [7, pages 304-307]. The execution time given for this algorithm is given as

where b=n/(m p) is number of memory pages per processor (at least two), k is the number of elements per page, tm , tc is the time to move an element in main memory, and compare two elements, respectively. The other notation is the same as in ``Simple External sort'' above. The elements are assumed to be read and written a page at a time. Of course, bpk times the element size is equal to the amount of main memory used. The algorithm also assumes that the I/O is distributed between several devices so that there is no I/O contention between the different processors.

Simple External Sort Compared to Improved Block Bitonic Sort

With the time approximations given above, it is possible to do theoretical comparisons of the Simple External and Improved Block Bitonic sorts.

Example 1.

Assuming available main memory to be 80 MB, n= 107 , and the element size to be 100 B, and p=4 we get m= 800000 , and b=3 . Reasonable assumption for the times are tα=10 μ s, tβ=5μ s, tr=tw=100 μ s, tm=10 μ s, and tc= 2 μ s. The page-size will be k= 1 element for the Improved Block Bitonic sort, in order to get reasonable performance. The time for Simple External sort is then 5115 s, and for Improved Block Bitonic sort sort 4089 s, an improvement of about 20%.

Example 2.

If only one processor is used, p=1 , the sorting time is 6586 s for the Simple External sort, and 10356 s for the Improved Block Bitonic sort sort, which is far worse.

Using only bk=3 elements per processor for the Improved Block Bitonic sort might degrade I/O performance. However, with adequate buffering and one I/O device per subfile, the performance should be within the limits given above.

The examples above show that the Improved Block Bitonic sort has higher parallelism, but worse worst-case performance than the Simple External sort.



Footnotes

... time2
About 90 μ s for FastThreads
© Ola Sigurdson 1991-08-20 Ola@Sigurdson.SE