|
In figure 1 the sequential quicksort show near perfect behaviour. The high time per element is due to implementation details. For example, the sequential quicksort is implemented as a general-purpose library routine using function calls to compare elements, whereas the parallel ones are specifically written for integer sorting and have the compare part hard-coded into the sort functions.
Both of the parallel algorithms show unreasonable performance for less than 100 integers. This increase for small number of elements can certainly be limited by applying the parallel threshold limits more intelligently. Far worse is the Improved Parallel Quicksort for one processor. The relative sorting time for 1 processor increases markedly with more elements, although no other setting of the parallel threshold provided better performance for four processors.
|
|
For the string sort shown in figure 2 the parallel algorithms were implemented similarly (but slightly more efficient) to the C-library quicksort, and with better limits for the go parallel thresholds. The hump at 1000 elements for Parallel Quicksort should disappear with better tuning. The degradation for Improved Quicksort for small number of elements is caused by a shared-memory allocation in the sort algorithm.3
Both for string and integer sorting, the ``Improved'' Parallel Quicksort performs worse than Parallel Quicksort. The implementation of Improved Quicksort is far from optimal though, and with some more effort it should be possible to get at least twice as fast.
|
|
In figure 3 the comparison time per element is large compared to the overhead of the sorting algorithm. The memory allocation mentioned above still limits the Improved Quicksort performance, but it does indeed outperform Parallel Quicksort for more than 100 elements. As can be seen, the performance for both parallel algorithms on one processor is approximately the same as for the sequential Quicksort.
The Improved Quicksort is very sensitive to the time to compare two elements (compare figure 1, 2, and 3). If the time to compare two elements is large compared to the overhead of the sorting routine, the Improved Quicksort is more efficient. Adding more processors should also improve the efficiency relative to Parallel Quicksort.
Secondly, there is thread latency. The actual time that each thread run is so short that the operating system and FastThread mechanism combine to make some threads execute and finish before the others have even started. Interrupts and context switches by the operating system may delay thread execution several milliseconds, which is enough to loose parallelism with short sorting times.
Even worse is that the interrupts and context switches may come in the middle of a critical section, such as while thread switching, or while the thread has a lock.
For a real example, look at the processor event trace from the enumeration algorithm sorting 100 integers in figure 4. The areas with a lot of numbers run together above are the actual sorting, the dense hatching is thread management overhead, the sparse hatching represents idle time, and the black areas are user thread processing. The sorting started when the bottom three processors started executing, and stopped when they went idle at the end. As can be seen, the time actual spent sorting is quite small compared to the total time. The other algorithms show similar behaviour.
There is really not so much room for improvement, short of redesigning the whole system. A tighter thread management (part of the operating system), and leaving one processor free for interrupt handling would improve the performance slightly.
In practice, if the compare times for the elements to be sorted are on the same order of magnitude as the overhead for the sorting functions, none of the parallel algorithm are useful for less than 10000 elements.