14 Parallel Algorithms¶

读取写入模式

专一读取 — 专一写入 (exclusive-read exclusive-write, EREW)
并发读取 — 专一写入 (concurrent-read exclusive-write, CREW)
并发读取 — 并发写入 (concurrent-read concurrent-write, CRCW)

基本性质

When we solve the summation problem via designing the parallel algorithms, we shorten the asymptotic time complexity but take more asymptotic work loads comparing with the sequential algorithms.

答案

并行求和算法渐近时间复杂度缩短，但渐近工作量与串行算法均为 \(O(n)\)，并未增加，答案为 \(\textnormal{F}\)

If we translate a serial algorithm into a reasonably efficient parallel algorithm, the work load and the worst-case running time are usually reduced.

答案

高效并行算法仅能减少运行时间，但无法减少总工作量，甚至可能增加总工作量，故答案为 \(\textnormal{F}\)

Managing shared memory for parallel programs is simpler than normal sequential programs.

答案

并行程序的共享内存引入了竞态条件、同步开销、缓存一致性等特有问题，管理难度远高于顺序程序，答案为 \(\textnormal{F}\)

Measurement

When measure the performance of parallel algorithm, we often use work load \(W(n)\) and worst-case running time \(T(n)\). How many evaluation metrics are asymptotically equivalent to \(W(n)\) and \(T(n)\)?

\(P(n) = W(n)/T(n)\) processors and \(T(n)\) time (on a PRAM)
\(W(n)/p\) time using any number of \(p \geq W(n)/T(n)\) processors (on a PRAM)
\(W(n)/p + T(n)\) time using any number of \(p\) processors (on a PRAM)

答案

对于指标 \(1\)，PRAM 中，要在 \(T(n)\) 时间内完成总工作量 \(W(n)\)，每个处理器需在 \(T(n)\) 时间内执行 \(T(n)\) 次操作（无空闲），因此最小处理器数满足 \(P_{\text{opt}}(n) \times T(n) = W(n) \implies P_{\text{opt}}(n) = W(n)/T(n)\)，满足渐近等价性

对于指标 \(2\)，当处理器数 \(p \geq P_{\text{opt}}(n) = W(n)/T(n)\) 时，处理器已足够多（超过最小需求），但并行算法的运行时间由 并行深度（即 \(T(n)\)） 决定，而非处理器数，不满足渐近等价性

对于指标 \(3\)，当 \(p < P_{\text{opt}}(n)\)（处理器不足）时，时间由 \(W(n)/p\) 主导（总工作量分摊到每个处理器的操作数多）；当 \(p \geq P_{\text{opt}}(n)\)（处理器充足）时，时间由 \(T(n)\) 主导（并行深度限制，再多处理器也无法缩短时间），满足渐近等价性

综上所述，答案为 \(2\)

Consider two parallel algorithms for the same problem, with a total of \(W_i(n)\) operations in \(T_i(n)\) time \((i = 1, 2)\). If \(W_1(n) = O(n), T_1(n) = O(n),\) and \(W_2(n) = O(n\log n), T_2(n) = O(\log n)\). Then:

\(\textnormal{A}\). the first algorithm is more efficient than the second algorithm

\(\textnormal{B}\). the first algorithm is less efficient than the second algorithm

\(\textnormal{C}\). the performances of the two algorithms are asymptotically equivalent

\(\textnormal{D}\). none of the other three options are correct

答案

效率的比较通常需要参考最佳串行算法的运行时间，\(W(n)/T(n)\) 只能反映算法的平均并行度，即理论上需要的最小处理器数量，不能衡量效率，题目中的两个算法各有利弊，答案为 \(\textnormal{D}\)

Prefix - Sums problem

The following psuedo-code is for solving the Prefix Sums problem parallely, where the input \(N\) numbers are stored in the array A. Which of the following gives the time and work load of the algorithm?

for Pi , 1 <= i <= n pardo
    B(0, i) := A(i)
for h = 1 to log(n)
    for i, 1 <= i<= n/(2^h) pardo
        B(h, i) := B(h-1, 2*i-1) + B(h-1, 2*i)
for h = log(n) to 0
    for i even, 1 <= i<= n/(2^h) pardo
        C(h, i) := C(h+1, i/2)
    for i = 1 pardo
        C(h, 1) := B(h, 1)
    for i odd, 3 <= i <= n/(2^h) pardo
        C(h, i) := C(h+1, (i-1)/2) + B(h, i)
for Pi , 1 <= i<= n pardo
Output C(0, i)

答案

\(T(n) = O (log n)， W(n) = O(n)\)

To evaluate the prefix-sums of a sequence of 16 numbers by the parallel algorithm with Balanced Binary Trees, \(C(3, 1)\) is found before \(C(2, 1)\). \(C(h, i) = \sum_{k=1}^{a} A(k)\) where \((0, a)\) is the rightmost descendant leaf of node \((h, i)\).

答案

先自底向上计算求和问题，然后根据求和问题的结果自顶向下计算前缀和问题，答案为 \(\textnormal{T}\)

There is an array \(A, A[i] = i(1 \leq i \leq 1024)\). If we use Balanced Binary Trees Parallel Algorithm to calculate the Prefix-Sum of \(A\). Define \(C(h, i) = \sum\limits_{k=1}^{\alpha} A(k)\), where \((0, \alpha)\) is the rightmost descendant leaf of node \((h, i)\). Node \((h, i)\) indicates the \(i\)-th node in height \(h\). For example, the root is node \((10,1)\) and leaves are node \((0, i) \ 1 \leq i \leq 1024\). The value of \(C(2,223)\) is ?

答案

\(\alpha = 223 \times 2^2 = 892\)，\(C(h, i) = 1 + 2 + 3 + \cdots + 892 = 398278\)

The nearest-one problem is defined as follows. Input: An array A of size n of bits; namely, the value of each entry of A is either 0 or 1. The nearest-one problem is to find for each i, 1 ≤ i ≤ n, the largest index j ≤ i, such that A(j) = 1. There is an parallel algorithm that runs in \(O(n)\) work and \(O(1)\) time for the problem.

答案

\(\textnormal{F}\)，等价于前缀最大值问题，\(T(n) = O(\log n)\)，\(W(n) = O(n)\)

Ranking problem

算法对比

	\(T(n)\)	\(W(n)\)
二分查找（binary search）	\(O(\log n)\)	\(O(n\log n)\)
顺序排行（serial ranking）	\(O(n+m)\)	\(O(n+m)\)
并行排行（parallel ranking）	\(O(\log n)\)	\(O(n)\)

When we solve the ranking problem via designing the parallel algorithms based on binary search, we shorten the worst-case running time but take more work loads comparing with the sequential algorithms.

答案

\(\textnormal{T}\)

Maximum Finding

算法对比

	\(T(n)\)	\(W(n)\)	备注
求和问题算法改版	\(O(\log n)\)	\(O(n)\)
全元素对比较法	\(O(1)\)	\(O(n^2)\)
划分范式	\(O(\log \log n)\)	\(O(n \log \log n)\)	子问题规模 \(h = \sqrt{n}\)
Doubly-Logarithmic Paradigm	\(O(\log \log n)\)	\(O(n)\)	子问题规模 \(h = \log \log n\)
Random Sampling	\(O(1)\)	\(O(n)\)	失败概率为 \(O(n^c)\)，\(c\) 为负常数

There are \(N\) different numbers. What's the minimum time complexity to find the \(K\)-th largest number with parallel algorithms? (No need to worry about access conflicts)

答案

\(O(1)\)

To solve the Maximum Finding problem with parallel Random Sampling method, \(T(n) = O(1)\) and \(W(n) = O(n)\) can be achieved with \(O(n)\) processors.

答案

\(\textnormal{F}\)，为啥啊？

平行算法的递归描述

Sorting-by-merging is a classic serial algorithm. It can be translated directly into a reasonably efficient parallel algorithm. A recursive description follows.

MERGE-SORT(A(1), A(2), ..., A(n); B(1), B(2), ..., B(n))

Assume that \(n = 2^l\) for some integer \(l \geq 0\)

if \(n = 1\) then return B(1) := A(1)

else call, in parallel, MERGE-SORT(A(1), ..., A(n/2); C(1), ..., C(n/2)) and

MERGE-SORT(A(n/2+1), ..., A(n); C(n/2+1), ..., C(n))
Merge(C(1)....C(n/2)) and (C(n/2 + 1)....C(n)) into (B(1), B(2), ..., B(n))

Then the MERGE-SORT runs in ___ work and ___ time.

答案

由 \(W(n) = 2 \cdot W(n/2) + O(n)\)，\(T(n) = T(n/2) + O(\log n)\) 得：

\[ W(n) = O(n \log n)，O(\log^2 n) \]