![]() |
|
|
| Lesson
2: Merge Stage: Distribution and Merging |
|
![]() |
4.3 Distribution and Merging Suppose that the sort stage produces R partitions. The optimal way of merging them is to have each partition in a separate file and then to perform a single R-way merge, However, in practice there will be operational restrictions, The most likely of these is on the number of files that a program can have open at any time. Therefore merging typically requires a series of phases. During each phase records are read from one set of files and merged partitions are written to a second set of files. There are many merging strategies, each having requirements regarding the distribution of the initial partitions. In the following we will assume that at any time we can have at most F files open for reading or writing. The larger the value of F, the faster merging is likely to proceed. A limit on the value of F is most likely to arise from an operating system restriction. We will consider three distribution and merging strategies: 1. balanced N-way merging 2. optimal merging 3. polyphase merging A measure of the efficiency of the merging stage of an external sorting algorithm is the number of passes over the data required to merge the partitions together. The number of passes is defined as follows: total number of records in the sorted file Balanced N-way Merge In the balanced N-way merge the available files are divided as equally as possible into two sets. The initial partitions are distributed as evenly as possible on the files in one set. During each phase of the merge, records on the input files are read and merged partitions are distributed cyclically onto the output files. At the end of a phase what was the input set becomes the output set and vice versa, The method is balanced in the sense that each of the files in the input set contains approximately the same number of records. Figure 4.7 contains a pseudocode algorithm for the balanced N-way merge. The variable input-set-first is used to remember which of the two sets of files (the files numbered 1 through N or those numbered N + 1 through F) is currently the input. The value of input-set-first is established at the start of each phase. During a phase, records are read from the input files and merged partitions are written to the output files. Partitions are written cyclically to the output files in order to distribute them as evenly as possible. If exactly one partition is written out during a particular phase, then merging is over. The value of input-set-first implicitly identifies the file containing the final sorted records. Consider 20 initial partitions and 4 files. We will use i X j to rep-resent i partitions each containing j records. Table 4.1 shows how merging proceeds when using the balanced N-way algorithm with F equal to 4. In the first phase, files 1 and 2 are the input files and files 3 and 4 are the output files. In phase 1, partitions of length 1 are read and, since we are performing two-way merging, partitions containing two records are produced. There are ten partitions produced, which are distributed over files 3 and 4. In phase 2, files 3 and 4 become the input files and files 1 and 2 become the output files. Partitions containing four records are produced, which are the result of merging partitions containing two records. Five partitions are produced in phase 2 and distributed over the two output files. The algorithm proceeds in this way until one phase, in this example phase 5, produces exactly one partition. Note that the balanced N-way merge algorithm, although simple, is far from the best. We are performing only (F/2)-way merging instead of (F – 1)-way merging, which is the best we could do (there must always be at least one output file). In addition, depending on the number of initial partitions and the value of F, during certain phases partitions are liable to be copied from one file to another without being merged with anything. In our example, a partition of length 4 is copied from file 1 to file 3 during phase 3. During phase 4 the same partition is copied from file 3 to file 2. The number of passes that the balanced N-way merge algorithm requires is always the same as the number of phases. This is because all the records are read in each phase. For our 20-partition example, therefore, there are five passes. We will see that this simple relation-ship between phases and passes does not hold for the other two algorithms. The optimal
merge algorithm was described by Lewis and Smith . Initial partitions
are written to separate files, and a record is kept of the length of each
partition. Thus we have a set of files, each containing one partition.
As with the other algorithms we examine, merging proceeds in a number
of phases. During each phase the F – 1 shortest partitions are
read and merged, and the merged partition is written to an output file.
The input files are then removed from the set of files, and the output
file is added to the set. The process of merging is repeated until the
set contains only one file. Table 4.2 shows how
this algorithm operates on the 20 partitions of the earlier example; again
we assume that each initial partition contains exactly one record and
that F = 4. We use x:y to represent file number x containing a partition
of y records. For example, file 21, containing 3 records, is created in
phase 1. In phase 7 file 21 is merged with files 19 and 20 to create file
27, containing 5 records. The number of passes required is 63/20, or 3.15.
The polyphase merge algorithm has neither of the disadvantages of the balanced N-way merge and does not require the record keeping of the optimal merge. However, it does require a more complex initial distribution of partitions. We will consider two algorithms – one for a special case and one for a general polyphase merge. Table
4.3 shows how the merging of 31 partitions with F = 4 would proceed,
(The number of partitions and their initial distribution are carefully
chosen.) We assume again that each initial partition contains exactly
one record. We use the notation introduced earlier for the balanced N-way
example, that is, iXj represents i partitions each containing
j records, The general algorithm will be able to merge an arbitrary number of partitions. However, it will be convenient to consider first an algorithm that requires that the total number of partitions be one of a special set of numbers. Special case. The problem with the polyphase merge is knowing how the initial partitions are to be distributed so that the algorithm can merge them optimally. To determine the initial distribution we work back from the final one. Consider the case where F = 4; that is, in any phase there are three input files. We are concerned with the number of partitions on each of these input files. The final distribution is one partition on one file. Before the final phase we must therefore have one partition on each of the input files. We denote this as You may observe that this merging method, when F = 4, will work optimally if the total number of initial partitions is in the series Ti = 1 i<F i-1 Ti = S Tk i > F K=I-F+1 Special case algorithm. We know in this special case that the total number of initial partitions is a member of the appropriate Fibonacci series. We have seen how each term in the series is associated with a particular distribution of partitions. We have also seen how a distribution is derived from the distribution for the preceding term in the series. This suggests the following algorithm for distributing a number of partitions onto files. Again we illustrate using F = 4. We regard Next, we consider a more general algorithm where it is no longer critical that the total number of partitions to be merged be a member of the appropriate Fibonacci series. General algorithm. We have seen how distribution and merging works optimally if the number of partitions is a member of an appropriate Fibonacci series, If the total is not guaranteed to be a member of the series, a solution is to introduce dummy partitions. Dummy partitions do not occupy file space; in fact, they exist only as numbers in a table, We introduce sufficient dummy partitions to make the total number of partitions a term in the series. The use of dummy partitions has consequences for the section of the sorting algorithm that merges partitions. It must be able to distinguish between dummy partitions and real ones. How are dummy partitions regarded in relation to the files containing real partitions? Note that during merging, the partitions toward the beginning of a file are read more often than those toward the end (consider the ex-ample of Table 4.3). It makes sense, therefore, to treat the dummy partitions as if they appeared at the beginning rather than at the end of a file, In this way real input/output is minimized. For the same reason, it is desirable to spread the dummy partitions as evenly as possible over the files. Figure 4.8 shows a general implementation of the polyphase algorithm*, Variable LEVEL and tables A and D are used in the following way. During the distribution phase, the variable LEVEL records how many targets have been reached, When merging, LEVEL is decreased after each phase; when it reaches zero we have finished. Table A at any time holds a target distribution of partitions, Each target is a row similar to those in Table 4.4. Table D holds the number of dummy partitions required to bring the number of actual partitions on each file up to the appropriate quota in the target. Thus, as we put real partitions on the file, the elements of D are decreased, If table D contains all zeros, we have reached the current target. If, at this point, we have not yet come to the end of the input file, then the target is replaced as in the earlier special case algorithm. The elements of D are modified accordingly. Note that the real partitions are distributed across the files in such a way as to even out the number of dummy partitions, In order to take dummy partitions into account, the merging process operates as follows for each merged partition produced, if D[k] > 0 for all k 1<=k<=(F-1) then increase D[F] by 1 decrease D[k] by 1 1<=k<=(F-1) else merge 1 partition from each FILE[k] where D[k] = 0 decrease D[k] by 1 where D[k] >0
Figure 4.9
also shows the final contents of table D, which holds the distribution
of dummy partitions. Note that only real reads are counted. Detaild
text explanation on how to build table-5 is available.
Comparison of Distribution and Merging Strategies Balanced N-way merging is simple to implement but not very efficient. The algorithm designated as optimal is not the best in all cases, as we have demonstrated with our example, and has record-keeping overhead. The overhead with the polyphase algorithm is proportional
to the order of the merge rather than the number of partitions. See Knuth
for a more detailed analysis and description of sorting algorithms.
|
Required Reading: Chapter 4. |