Key Bottlenecks in PC Parallelization

Table of Links

Preliminaries and Related Work
Key Bottlenecks in PC Parallelization
Harnessing Block-Based PC Parallelization

4.1. Fully Connected Sum Layers

4.2. Generalizing To Practical Sum Layers

4.3. Efficient Implementations by Compiling PC Layers

4.4. Analysis: IO and Computation Overhead
Optimizing Backpropagation with PC Flows
Experiments

6.1. Faster Models with PyJuice

6.2. Better PCs At Scale

6.3. Benchmarking Existing PCs
Conclusion, Acknowledgements, Impact Statement, and References

3. Key Bottlenecks in PC Parallelization

This section aims to lay out the key bottlenecks to efficient PC implementations. For ease of illustration, we focus solely on the forward pass, and leave the unique challenges posed by the backward pass and their solution to Section 5.

We start by illustrating the layering procedure deployed for PCs. Starting from the input nodes, we perform a topological sort of all nodes, clustering nodes with the same topological depth into a layer. For example, in Figure 1, the PC on the left side is transformed into an equivalent layered representation on the right, where nodes of the same color belong to the same layer. The forward pass proceeds by sequentially processing each layer, and finally returns the root node’s output. To avoid underflow, all probabilities are stored in the logarithm space. Therefore, product layers just need to sum up the corresponding input logprobabilities, while sum layers compute weighted sums of input log-probabilities utilizing the logsumexp trick

Assume for now that all nodes in every layer have the same number of children. A straightforward strategy is to parallelize over every node and every sample. Specifically, given a layer of size M and batch size B, we need to compute in total M×B output values, which are evenly distributed to all processors (e.g., thread-blocks in GPUs). We apply this idea to a PC with the PD structure (Poon & Domingos, 2011). The PC has ∼1M nodes and ∼150M edges. Additionally, all nodes within a layer have the same number of children, making it an ideal testbed for the aforementioned algorithm.

Figure 2 illustrates the runtime breakdown of the forward pass (with batch size 512). As shown in the pie chart, both the IO and the computation overhead of the sum layers are much larger than that of the product layers. We would expect sum layers to exhibit a higher computation overhead due to (i) the number of sum edges being ∼85x more than the product edges (see the table in Fig. 2), and (ii) sum edges requiring more compute compared to product edges. However, we would not expect the gap in IO overhead to be as pronounced as indicated in the pie chart. Specifically, with batch size 512, the ideal memory read count of product layers should be roughly [batch size]×[#sum nodes]≈ 102M since all children of product nodes are sum or input nodes (the number of input nodes is an order of magnitude smaller and is omitted). Similarly, the number of memory reads required by the sum layers is approximately [batch size]×[#prod nodes]+[#parameters]≈571M, which is only 5.6x compared to the product layers. The ideal memory write count of product layers should be larger since there are about 4x more product nodes compared to sum nodes.

While the ideal IO overhead of the sum layers is not much larger than that of the product layers, the drastic difference in runtime (over 50x) can be explained by the significant amount of reloads of child nodes’ probabilities in the sum layers. Specifically, in the adopted PD structure, every sum node has no more than 12 parents, while most product nodes have 256 parents.[3] Recall that the parents of product nodes are sum nodes and vice versa. As a result, each sum layer needs to reload the output of every product node multiple times. Although this does not lead to 256x loads from the GPU’s High-Bandwidth Memory (HBM) thanks to its caching mechanism, such excessive IO access still significantly slows down the algorithm.

The fundamental principle guiding our design is to properly group, or allocate, sum edges to different processors to minimize the reloading of product nodes’ outputs. As an added benefit, this allows us to interpret part of the core computation as matrix multiplications, allowing us to harness Tensor Cores available in modern GPUs and resulting in a significant reduction in sum layers’ computational overhead.

Authors:

(1) Anji Liu, Department of Computer Science, University of California, Los Angeles, USA ([email protected]);

(2) Kareem Ahmed, Department of Computer Science, University of California, Los Angeles, USA;

(3) Guy Van den Broeck, Department of Computer Science, University of California, Los Angeles, USA;

This paper is available on arxiv under CC BY 4.0 DEED license.

[3] Only the children of the root sum node have 1 parent.