Subject: Re: Question about Polyphase merge alogrithm -class CIS501
Date: Mon, 12 Nov 2001 16:46:03 -0600
From: "C.Y. Hsieh" <cyhsieh@ksi.edu>
.
The polyphase algorithm is derived by Knuth. So, sometimes it is also called
Knuth's alogrithm as in exercise 4-11, and 4-12.

The purpose of this alogrithm is to have the last merge phase optimize(maximize) the
input files available: i.e., Maximum Input files = Total-Files - 1.

In Table4-2 of optimal merge, the last phase (phase 10)
does not have the optimal condition (it used only two input files).
But this last phase issued  the maximum I/O operations!
So, Knuth figured out a mathematical way to sacrifice the first few
phases, which involes less data, and make the last phase be optimal.

The concept utilizes the Fibonacci series:
A Fibonacci series of order 3 is
1  1  1  3  5  9  17   31 ... (1+1+1=3,   1+1+3=5,   1+3+5=9,   3+5+9=17, ...)
That is every number is the sum of the previous 3 numbers.
So, A Fibonacci number of order 4 is
1  1  1  1  4  7  13  25  49 ... (......., .4+7+13+25=49, ...)

Now if we have 4 files in total, then the optimal (maximum) input files would be 3(one for output).
So, what we need is a Fibonacci series of order 3.

In the following, I will create table 4-4 using reverse engineering.
First we assume that the final merge involved the max. three inputs.
But in this consideration, it will happened only when the  total-input-files number is in Fibonacci series
of order 3;  i.e., 1, 3, 5, 9, 17 or 31 partitions, then we can have the perfect result.

Since, in the text, we assume that total partitions is 20.
So, we need extra 11=(31-20) dummy partitions (Zero record partitions).
(31 is the smallest number no less then 20)
And the distribution of partitions should be 13, 11,7 as derived in table4-4.

(It is important to  know how to derive this table:)
The last phase is always   1   0    0 for order 3                    1   0   0   0 for order 4
and last-1 is always          1   1    1                                     1   1   1   1
and last-2 is always          2   2    1                                     2   2   2   1
last -3                              4   3    2                                     4   4   3   2
last -4                              7   6    4                                     8   7   6   4

The following is the trick to derive  the above distribution table:
To get the number of each row: (1) starting with the rightmost number of a row.
This number is the left most number of the previouse phase (such as 4 in last-3)
The rest of the number is computed as follows:
6= 4+ 2(  the 2 is right above 4 of  last-3)
7= 4+3 (the 3 is right above 6 of  last-3)
8=4+4 (the 4 is right above 7 of  last-3)
If you draw lines on the above table, you can see the realtionship better..

Now let us calculate the distribution of dummy partitions.
The best way to distribute dummy partition over the three inputs are "as evenly as possible".
So, 4 , 4, 3 is the choice.
This leads to the result of the 20 real partition distribution as 13  11  7  Needed
(Derived mathematically in reverse engineering to optimize the last phase)

4   4  3 Dummy (as evenly distrubuted as possible)
-----------------
9   7   4  the calculated real partution distribution
(13-4=9, 11-4=7 and 7-3=4)

This is Figure 4-9.
(The sequence of Ra, Rb, Rc.... is not important. It is the total partitions in each files that is our concern)

Now let me derive Table4-5 as follows:
I will represent the 20 real partitions as twenty M's
(Because each partition has fixed M records in our example)
D is used to represent Dummy partition.

F-1   F-2   F-3   F-4
----------------------------------------
D     D     D ..................................Row-1
D     D     D ..................................Row-2
D     D     D ..................................Row-3
D     D     M .................................Row-4
M    M     M .................................Row-5
M    M     M .................................Row-6
M    M     M .................................Row-7
M    M
M    M
M    M
M    M
M
M
===================================
Now I will merge F-1, F-2 and F-3, if  there are inputs (D and M):(That is from  Row-1 to Row-7),
and ouput the merged data to F-4 as follows:

Phase-1:
F-1   F-2   F-3   F-4
----------------------------------------
D     D     D  ........... Row-1
D     D     D ............ Row-2
D     D     D ............ Row-3
D     D     M ............Row-4
M     M    M ............Row-5
M    M     M ............Row-6
M    M     M ............Row-7
-----------------------------------------
M   M              D (D+D+D=D, just like 0+0+0=0) From Row-1... to be Row -8
M   M              D (D+D+D=D) From Row-2.............................  . to be Row -9
M   M              D (D+D+D=D) .................................................    to be Row-10
M   M              M (D+D+M)   ........................................................   ...Row -11
M                    3M(M+M+M)...........................................................  ..Row-12
M                    3M (M+M+M)..........................................................  ..Row-13
                       3M (M+M+M)..........................................................  ..Row-14
=======================================

Up to now we used up all the input data from F-3.
So, we switch inputs files to F-1 F-2 and F-4, and used F-3 as output file!
And merge the F-1, F-2 , F-4, so long as there are data in these three files, to F-3:
(Row-8 to Row-11.)
This setup the second phase.
Phase-2:
F-1     F-2     F-3     F-4
----------------------------------------
D       D         D       .......... Row-1
D       D         D     ............ Row-2
D       D         D     ............ Row-3
D       D         M     ............Row-4
M      M         M     ............Row-5
M      M         M     ............Row-6
M      M         M     ............Row-7
-----------------------------------------
M      M                  D (D+D+D=D, just like 0+0+0-0) From Row-1.... .Row -8
M      M                  D (D+D+D=D) From Row-2...............................   Row -9
M      M                  D (D+D+D=D)....................................................   Row-10
M      M                  M (D+D+M)........................................................  Row -11
--------------------------------------------------------------------------------------------------------
M               2M       3M (M+M+M)......................................................Row-12
M               2M       3M (M+M+M)......................................................Row-13
                  2M       3M (M+M+M)......................................................Row-14
                 3M........................................................................              to be Row-15
================================================================

We used up all the data in F-2 now, so we switch inputs to F-1, F-3, F-4, and use F-2 as
output file to merge Row-12, Row-13 to F-2:
Phase-3:
F-1     F-2     F-3     F-4
----------------------------------------
D       D       D      ..........  .Row-1
D       D       D       ............Row-2
D       D       D       ............Row-3
D       D       M       ............Row-4
M      M      M        ............Row-5
M      M      M       ............Row-6
M      M      M       ............Row-7
-----------------------------------------
M      M                  D (D+D+D=D, just like 0+0+0-0)           Row -8
M      M                  D (D+D+D=D)                                       Row -9
M      M                  D (D+D+D=D)                                       Row-10
M      M                  M (D+D+M).........................................  Row -11
--------------------------------------------------------------------------------------------------------
M              2M        3M(M+M+M).........................................Row-12
M              2M        3M(M+M+M).........................................Row-13
--------------------------------------------------------------------------------------------------------
        6M    2M        3M(M+M+M)..........................................Row-14
        6M    3M ................................................... .................   Row-15
===============================================================
Again we used up all the data in F-1, so we switch inputs to F-2, F-2 and F-4
and use F-1 as output file to merge Row-14

Phase-4:
F-1     F-2     F-3     F-4
----------------------------------------
D       D       D      ..........  .Row-1
D       D       D       ............Row-2
D       D       D       ............Row-3
D       D       M       ............Row-4
M      M      M        ............Row-5
M      M      M       ............Row-6
M      M      M       ............Row-7
-----------------------------------------
M      M                   D (D+D+D=D, just like 0+0+0-0)           Row -8
M      M                   D (D+D+D=D)                                       Row -9
M      M                   D (D+D+D=D)                                       Row-10
M      M                   M (D+D+M)........................................  .Row -11
--------------------------------------------------------------------------------------------------------
M               2M        3M(M+M+M).........................................Row-12
M               2M        3M(M+M+M).........................................Row-13
--------------------------------------------------------------------------------------------------------
          6M   2M       3M(M+M+M)..........................................Row-14
--------------------------------------------------------------------------------------------------------
11M  6M   3M ................................................... .................  Row-15
==================================================================
Finally, we used up all the data in F-4, so we switch inputs back to F-1,
F-2 and F-3 and merge them into F-4.

Phase-5:
F-1     F-2     F-3     F-4
----------------------------------------
D       D       D      ..........  . Row-1
D       D       D       ............ Row-2
D       D       D       ............ Row-3
D       D       M       ............Row-4
M      M      M        ............Row-5
M      M      M       ............ Row-6
M      M      M       ............ Row-7
-----------------------------------------
M      M                  D (D+D+D=D, just like 0+0+0-0)           Row -8
M      M                  D (D+D+D=D)                                       Row -9
M      M                  D (D+D+D=D)                                       Row-10
M      M                  M (D+D+M).........................................  Row -11
--------------------------------------------------------------------------------------------------------
M              2M        3M(M+M+M).........................................Row-12
M              2M        3M(M+M+M).........................................Row-13
--------------------------------------------------------------------------------------------------------
         6M   2M       3M(M+M+M)..........................................Row-14
--------------------------------------------------------------------------------------------------------

11M  6M   3M ................................................... ................. Row-15
-------------------------------------------------------------------------------------------------------
                                20M (11M+6M+3M)
This is Table4-5.
________________________________________________________________________________