Scaling Probabilistic Circuits with PyJuice: 100x Faster Training & 5x Lower Memory Use

Written by probabilistic | Published 2025/08/24
Tech Story Tags: scalable-generative-models | gpu-accelerated-computation | probabilistic-circuits-(pcs) | pyjuice | efficient-parallelization | memory-efficient-training | block-based-parallelization | probabilistic-inference

TLDRProbabilistic Circuits are powerful but often limited by slow, memory-intensive implementations. PyJuice overcomes these bottlenecks with a GPU-optimized system that enables 1–2 orders of magnitude faster training and 2–5x memory savings. By compiling circuits into compact structures for block-based parallelization and leveraging Tensor Cores, PyJuice scales PCs to real-world image and language datasets. This advancement sets new baselines for large-scale generative modeling and paves the way for broader adoption of PCs in AI research.via the TL;DR App

Authors:

(1) Anji Liu, Department of Computer Science, University of California, Los Angeles, USA ([email protected]);

(2) Kareem Ahmed, Department of Computer Science, University of California, Los Angeles, USA;

(3) Guy Van den Broeck, Department of Computer Science, University of California, Los Angeles, USA;

Table of Links

Abstract and 1. Introduction

  1. Preliminaries and Related Work

  2. Key Bottlenecks in PC Parallelization

  3. Harnessing Block-Based PC Parallelization

    4.1. Fully Connected Sum Layers

    4.2. Generalizing To Practical Sum Layers

    4.3. Efficient Implementations by Compiling PC Layers

    4.4. Analysis: IO and Computation Overhead

  4. Optimizing Backpropagation with PC Flows

  5. Experiments

    6.1. Faster Models with PyJuice

    6.2. Better PCs At Scale

    6.3. Benchmarking Existing PCs

  6. Conclusion, Acknowledgements, Impact Statement, and References

A. Algorithm Details

B. Additional Technical Details

C. Experimental Details

D. Additional Experiments

Abstract

Probabilistic Circuits (PCs) are a general framework for tractable deep generative models, which support exact and efficient probabilistic inference on their learned distributions. Recent modeling and training advancements have enabled their application to complex real-world tasks. However, the time and memory inefficiency of existing PC implementations hinders further scaling up. This paper proposes PyJuice, a general GPU implementation design for PCs that improves prior art in several regards. Specifically, PyJuice is 1-2 orders of magnitude faster than existing systems (including very recent ones) at training large-scale PCs. Moreover, PyJuice consumes 2-5x less GPU memory, which enables us to train larger models. At the core of our system is a compilation process that converts a PC into a compact representation amenable to efficient block-based parallelization, which significantly reduces IO and makes it possible to leverage Tensor Cores available in modern GPUs. Empirically, PyJuice can be used to improve state-of-the-art PCs trained on image (e.g., ImageNet32) and language (e.g., WikiText, CommonGen) datasets. We further establish a new set of baselines on natural image and language datasets by benchmarking existing PC structures but with much larger sizes and more training epochs, with the hope of incentivizing future research. Code is available at https: //github.com/Tractables/pyjuice.

1. Introduction

Many tasks require not only precise modeling of intricate, high-dimensional data distributions but also the efficient execution of probabilistic inference on the learned model. To satisfy inference-side demands, tractable deep generative models are designed to support efficient computation of various probabilistic queries. Probabilistic Circuits (PCs) (Choi et al., 2020; Vergari et al., 2020) are a unified framework that abstracts a myriad of tractable model families. PCs have been applied to many domains such as explainability and causality (Correia et al., 2020; Wang & Kwiatkowska, 2023), graph link prediction (Loconte et al., 2023), and neuro-symbolic AI (Xu et al., 2018; Manhaeve et al., 2018; Ahmed et al., 2022a). In particular, there is a trend of using PCs’ tractability to control expressive deep generative models, including (large) language models (Zhang et al., 2023), image diffusion models (Liu et al., 2024), and reinforcement learning models (Liu et al., 2023b).

The backbone of the application-side advancements is the recent breakthroughs on the modeling and learning side of PCs, which include designing better PC structures (Peharz et al., 2020b; Correia et al., 2023; Mathur et al., 2023; Loconte et al., 2024; Gala et al., 2024), effective structure learning algorithms (Gens & Pedro, 2013; Dang et al., 2020; 2022; Yang et al., 2023), and distilling from expressive deep generative models (Liu et al., 2023a). Despite such algorithmic innovations, a fundamental obstacle to further scaling up PC learning and inference is the time and memory inefficiency of existing implementations, hindering the training of large PC models and their application to large-scale datasets.

In this work, we develop an efficient and flexible system called PyJuice that addresses various training and inference tasks for PCs. As shown in Table 1, PyJuice is orders of magnitude faster than previous implementations for PCs (e.g., SPFlow (Molina et al., 2019), EiNet (Peharz et al., 2020a), and Juice.jl (Dang et al., 2021)) as well as Hidden Markov Models[1] (e.g., Dynamax (Murphy et al., 2023)). Additionally, as we shall demonstrate in the experiments, PyJuice is more memory efficient than the baselines, enabling us to train larger PCs with a fixed memory quota.

Unlike other deep generative models based on neural network layers that are readily amenable to efficient systems (e.g., a fully connected layer can be emulated by a single matrix multiplication and addition kernel plus an element-wise activation kernel), PCs cannot be efficiently computed using well-established operands due to (i) the unique connection

patterns of their computation graph,[2] and (ii) the existence of log probabilities at drastically different scales in the models, which requires to properly handle numerical underflow problems. To parallelize PCs at scale, we propose a compilation phase that converts a PC into a compact data structure amenable to block-based parallelization on modern GPUs. Further, we improve the backpropagation process by indirectly computing the parameter updates by backpropagating a quantity called PC flow (Choi et al., 2021) that is more numerically convenient yet mathematically equivalent.

In the following, we first formally define PCs and discuss common ways to parallelize their computation in Section 2. Section 3 examines the key bottlenecks in PC parallelization. Section 4 and 5 explains our design in details.

This paper is available on arxiv under CC BY 4.0 DEED license.

[1] Every HMM has an equivalent PC representation.

[2] Commonly used neural network layers mainly employ “regular” tensor operations such as matrix multiplications and tensor inner-/outer-products. In contrast, PC layers can contain nodes that are sparsely connected.


Written by probabilistic | Probabilistic
Published by HackerNoon on 2025/08/24