Subsections

Experimental Results for Internal Algorithms

The sorting algorithms are very sensitive to the ``go parallel'' threshold. In general, a higher threshold gives shorter execution time for one processor but lower parallelism. In timing my algorithms I've used fixed values for each input data set, adjusted to provide good parallelism and the shortest execution time.


Sorting Times

In an ideal case the sort time should be proportional to n log2 n with a constant number of processors. Looking at figure 1, 2, and 3 you can see that only the C library Quicksort (included for reference) has this relationship, and that most of the others perform better with more elements.

Figure 1: Relative sorting time for integer sort.
\begin{figure}\psfig{file=figures/unsigned.time.ps,width=\linewidth} \end{figure}


Table 7: Relative sorting time in μ s for integers with one processor.
  Parallel Improved Sequential
n Quicksort Parallel QS Quicksort
101 6.0 11.8  
102 4.0 7.0 11.1
103 3.4 6.5 10.8
104 4.0 8.1 10.9
105 3.5 9.9 11.1
106 3.2 11.1 11.1
2 ·106 3.3    



Table 8: Relative sorting time in μ s for integers with four processors.
  Parallel Improved
n Quicksort Parallel QS
101 6.0 11.2
102 4.0 7.1
103 3.4 6.4
104 2.3 4.0
105 4.0 2.7
106 2.7 2.8
2 ·106 2.8  


In figure 1 the sequential quicksort show near perfect behaviour. The high time per element is due to implementation details. For example, the sequential quicksort is implemented as a general-purpose library routine using function calls to compare elements, whereas the parallel ones are specifically written for integer sorting and have the compare part hard-coded into the sort functions.

Both of the parallel algorithms show unreasonable performance for less than 100 integers. This increase for small number of elements can certainly be limited by applying the parallel threshold limits more intelligently. Far worse is the Improved Parallel Quicksort for one processor. The relative sorting time for 1 processor increases markedly with more elements, although no other setting of the parallel threshold provided better performance for four processors.

Figure 2: Relative sorting time for random string sort.
\begin{figure}\psfig{file=figures/string.time.ps,width=\linewidth} \end{figure}


Table 9: Relative sorting time in μ s for random strings with one processor.
  Parallel Improved Sequential
n Quicksort Parallel QS Quicksort
101 11.0   20.0
102 8.7 72.5 15.2
103 12.4 31.2 16.4
104 11.6 27.7 17.6
105 12.0 27.2 17.9
2 ·105 12.7 27.8 18.7



Table 10: Relative sorting time in μ s for random strings with four processors.
  Parallel Improved
n Quicksort Parallel QS
101 11.4  
102 8.7 69.7
103 11.0 21.4
104 5.0 9.6
105 3.8 7.7
2 ·105 3.9 7.9


For the string sort shown in figure 2 the parallel algorithms were implemented similarly (but slightly more efficient) to the C-library quicksort, and with better limits for the go parallel thresholds. The hump at 1000 elements for Parallel Quicksort should disappear with better tuning. The degradation for Improved Quicksort for small number of elements is caused by a shared-memory allocation in the sort algorithm.3

Both for string and integer sorting, the ``Improved'' Parallel Quicksort performs worse than Parallel Quicksort. The implementation of Improved Quicksort is far from optimal though, and with some more effort it should be possible to get at least twice as fast.

Figure 3: Relative sorting time for 1000 character records.
\begin{figure}\psfig{file=figures/equal.time.ps,width=\linewidth} \end{figure}


Table 11: Relative sorting time in μ s for 1000 character records with one processor.
  Parallel Improved Sequential
n Quicksort Parallel QS Quicksort
10 920   1404
30 1161 1763 1404
100 1199 1320 1276
300 1383 1325 1337
1000 1478 1453 1327



Table 12: Relative sorting time in μ s for 1000 character records with four processors.
  Parallel Improved
n Quicksort Parallel QS
10 920  
30 1159 1789
100 640 631
300 632 425
1000 530 372


In figure 3 the comparison time per element is large compared to the overhead of the sorting algorithm. The memory allocation mentioned above still limits the Improved Quicksort performance, but it does indeed outperform Parallel Quicksort for more than 100 elements. As can be seen, the performance for both parallel algorithms on one processor is approximately the same as for the sequential Quicksort.

The Improved Quicksort is very sensitive to the time to compare two elements (compare figure 1, 2, and 3). If the time to compare two elements is large compared to the overhead of the sorting routine, the Improved Quicksort is more efficient. Adding more processors should also improve the efficiency relative to Parallel Quicksort.


Parallelism

The failure to achieve any parallelism for small number of elements is due to several factors. First of all, to go parallel means the creation of threads, which is expensive compared to the compare-exchange operation.

Secondly, there is thread latency. The actual time that each thread run is so short that the operating system and FastThread mechanism combine to make some threads execute and finish before the others have even started. Interrupts and context switches by the operating system may delay thread execution several milliseconds, which is enough to loose parallelism with short sorting times.

Even worse is that the interrupts and context switches may come in the middle of a critical section, such as while thread switching, or while the thread has a lock.

Figure 4: Event trace of Enumeration sort for 100 integers
\begin{figure*}\psfig{file=figures/enum100-4.ps,width=12cm} \hfill
\psfig{file=figures/legend.ps,width=3cm} \end{figure*}

For a real example, look at the processor event trace from the enumeration algorithm sorting 100 integers in figure 4. The areas with a lot of numbers run together above are the actual sorting, the dense hatching is thread management overhead, the sparse hatching represents idle time, and the black areas are user thread processing. The sorting started when the bottom three processors started executing, and stopped when they went idle at the end. As can be seen, the time actual spent sorting is quite small compared to the total time. The other algorithms show similar behaviour.

Figure 5: Parallelism for integer sort
\begin{figure}\psfig{file=figures/unsigned.ps,width=\linewidth} \end{figure}

There is really not so much room for improvement, short of redesigning the whole system. A tighter thread management (part of the operating system), and leaving one processor free for interrupt handling would improve the performance slightly.

In practice, if the compare times for the elements to be sorted are on the same order of magnitude as the overhead for the sorting functions, none of the parallel algorithm are useful for less than 10000 elements.

Figure 6: Parallelism for random string sort
\begin{figure}\psfig{file=figures/string.ps,width=\linewidth} \end{figure}



Footnotes

... algorithm.3
The shared memory allocator is implemented with file mapping, which is incredibly slow. A better shared-memory implementation would correct the problem.
© Ola Sigurdson 1991-08-20 Ola@Sigurdson.SE