Skip to content

15 External Sorting

基本概念

In external sorting, the sorting is done by reading all the data into memory, sorting it there, and then writing it back out to disk.

答案

内部排序是在主存中运行的排序算法,当待排序数据量过大时,需要借助磁盘空间使用外部排序,故答案为 \(\textnormal{F}\)

If only one tape drive is available to perform the external sorting, then the tape access time for any algorithm will be \(\Omega (N^2)\).

答案

磁带是顺序访问设备,单磁带机无法并行存储多顺串,归并时需反复扫描已合并数据,累计访问时间随 \(N\) 的平方增长,下界为 \(Ω(N²)\),答案为 \(\textnormal{T}\)

The bottleneck of external sorting is to merge the records from input buffers to the output buffers.

答案

磁盘 I/O 速度远低于内存操作,外部排序真正的瓶颈是磁盘 I/O(归并段的读 / 写),答案为 \(\textnormal{F}\)

Pass Reduction

In external sorting, a \(k\)-way merging is usually used in order to reduce the number of passes and we will take the \(k\) as large as possible as long as we have enough amount of tapes.

答案

\(\textnormal{F}\)

Polyphase merge is a method for speeding up k-way merge in external sorting.

答案

\(\textnormal{F}\),多项合并在保持子序列个数不变的情况下减少所需磁带数

Replacement selection is a method for generating longer runs in external sorting.

答案

\(\textnormal{T}\)

k-way Merge

For a \(k\)-way merge in external sorting, the primary reason for \(k\) not assuming a large value is that:

\(\textnormal{A}\). \(k\) has to be a finite integer

\(\textnormal{B}\). \(k\) is bounded above by the number of runs

\(\textnormal{C}\). during merging, the number of comparisons would increase

\(\textnormal{D}\). the I\O time would increase

答案

\(\textnormal{D}\)

Given 2000 runs and 10 tapes. If simple k-way merge is used, the minimum number of passes required is 5 (runs generation pass is not counted).

答案

\(1 + \lceil \log_9 2000 \rceil = 5\),故答案为 \(\textnormal{T}\)

To sort \(N\) numbers by external sorting using a \(k\)-way merge and a \(k\)-size heap, which statement is TRUE about the total comparison times \(T(N, k)\) and \(k\)?

\(\textnormal{A}\). \(T(N, k)\) has nothing to do with \(k\).

\(\textnormal{B}\). \(T(N, k)\) is \(O(k)\) for fixed \(N\).

\(\textnormal{C}\). \(T(N, k)\) is \(O(k \log k)\) for fixed \(N\).

\(\textnormal{D}\). \(T(N, k)\) is \(O(k^2)\) for fixed \(N\).

答案

在外部排序中,总比较次数 \(T(N, k)\) 包括初始排序和归并阶段:

  • 初始排序将数据分成块,每块内部排序的比较次数为 \(O(N \log M)\),其中 \(M\) 是内存可容纳的块大小
  • 归并阶段使用 \(k\)-路归并和大小为 \(k\) 的堆,每趟归并的比较次数为 \(O(N \log k)\),归并趟数为 \(O(\log_k \frac{N}{M})\),归并阶段的总比较次数为 \(O(N \log k \cdot \log_k \frac{N}{M}) = O(N \log \frac{N}{M})\)

故总比较次数为 \(O(N \log M + N \log \frac{N}{M}) = O(N \log N)\),与 \(k\) 无关,答案为 \(\textnormal{A}\)

Buffer Handling

In general, for a 3-way merge we need 6 input buffers and 2 output buffers for parallel operations.

答案

k-way merge 通常需要 \(2k\) 个 input buffers 和 \(2\) 个 output buffer,答案为 \(\textnormal{T}\)

pass 的计算

Given 100,000,000 records of 256 bytes each, and the size of the internal memory is 128MB. If simple 2-way merges are used, how many passes do we have to do?

答案

\(M = 128 \, \text{MB} = 2^7 \times 2^{20} \, \text{B} = 2^{27} \, \text{B}\)\(N = 10^8 \times 2^8\)\(number\ of\ passes = 1 + \lceil \log_k \frac{N}{M} \rceil = 9\)

Given 13 runs and 4 tapes to do external sorting. If we apply the polyphase merge with Fibonacci splitting to perform 3-way merges, how many passes (the initial run generating pass is not included) must we have?

答案

答案是 4,分割成 \([6, 5, 2]\) 可以实现。但是我有点不理解这是怎么得到的,因为如果直接按三阶斐波那契数列分割的话初始分布应该是 \([7, 4, 2]\),但这样会是 \(5\) pass,或许可以用 \(number\ of\ passes = \text{对应斐波那契数索引} - 2\) 吧,虽然我也不知道这个式子哪里来的以及正不正确。

简单的变体

Suppose we only have 2 tapes, \(T_a\) and \(T_b\), to do external sorting. Suppose that the data which has \(N\) records is initially on \(T_a\). Suppose further that the internal memory can hold (and sort) \(M\) records at a time. A simple algorithm works as the following:

  • Step 1: read \(M\) records at a time from \(T_a\), sort the records internally, and then write the sorted records to \(T_b\).

  • Step 2: read \(M\) records at a time from \(T_a\), sort the records internally, and merge them with sorted records from \(T_b\), and write them (\(2M\) records) to \(T_a\).

  • Step 3: read \(M\) records from \(T_a\), sort them internally, and merge them with sorted \(2M\) records from \(T_a\), and write them (\(3M\) records) to \(T_b\).

Repeat steps 2 and 3 until all the records are sorted. This algorithm will require __ passes.

答案

\(\lceil N/M\rceil\)