**PolyThrottle: Energy-efficient Neural Network Inference on Edge Devices: Predictor Analysis**

Using Autodiff to Estimate Posterior Moments, Marginals and Samples: Experimental Datasets and Model by@bayesianinference

by Bayesian InferenceApril 15th, 2024

This paper is available on arxiv under CC 4.0 license.

**Authors:**

(1) Sam Bowyer, Equal contribution, Department of Mathematics and [email protected];

(2) Thomas Heap, Equal contribution, Department of Computer Science University of Bristol and [email protected];

(3) Laurence Aitchison, Department of Computer Science University of Bristol and [email protected].

- Abstract & Introduction
- Related work
- Background
- Methods
- Experiments
- "Conclusion, Limitations, and References"
- Derivations
- Algorithms
- Experimental Datasets And Model

The MovieLens100K1 (Harper and Konstan 2015) dataset contains 100k ratings of N=1682 films from among M=943 users. The original user ratings run from 0 to 5. Following Geffner and Domke (2022), we binarise the ratings into just likes/dislikes, by taking user-ratings of {0, 1, 2, 3} as a binarised-rating of 0 (dislikes) and user-ratings of {4, 5} as a binarised-rating of 1 (likes). We assume binarised ratings of 0 for films which users have not previously rated.

The probabilistic graphical model is given in Fig 3. We use n to index films, and m to index users. Each film has a known feature vector, xn, while each user has a latent weight-vector, zm, of the same length, describing whether or not they like any given feature. There are 18 features, indicating which genre tags the film has (Action, Adventure, Animation, Childrens,...). Each film may have more than one tag. The probability of the user liking a film is given by taking the dot-product of the film’s feature vector with the latent weight-vector, and applying a sigmoid (σ(·)) (final line)

Additionally, we have latent vectors for the global mean, µ, and variance, ψ, of the weight vectors. We use a random subset of N = 20 films and M = 450 users for our experiment to ensure high levels of uncertainty. We use equally sized but disjoint subset of users held aside for calculation of the predictive log-likelihood. In Fig. 1c,d we consider the variance and MSE of estimators of zm.

In our second experiment, we model the length of delays to New York school bus journeys,2 working with a dataset supplied by the City of New York (DOE 2023). Our goal is to predict the length of a delay, based on the year, y, borough, b, bus company, c and journey type, j. Specifically, our data includes the years 2015 − 2022 inclusive and covers the five New York boroughs (Brooklyn, Manhatten, The Bronx, Queens, Staten Island) as well as some surrounding areas (Nassau County, New Jersey, Connecticut, Rockland County, Westchester). There are 57 bus companies, and 6 different journey types (e.g. pre-K/elementary school route, general education AM/PM route etc.) We take I = 60 delayed buses in each borough and year, and take Y = 3 years and B = 3 boroughs. We then split along the I dimension to get two equally sized train and test sets. Thus, each delay is uniquely identified by the year, y, the borough, b, and the index, i, giving delayybi The delays are recorded as an integer number of minutes and we discard any entries greater than 130 minutes.

We have a hierarchical latent variable model describing the impact of each of the features (year, borough, bus company and journey type) on the length of delays. Specifically, the integer delay is modelled with a Negative Binomial distribution, with

fixed total count of 130, as we discard entries with delays longer than this. The expected delay length is controlled by a logits latent variable, logitsybi, with one logits for each delayed bus. The logits is a sum of three terms: one for the borough and year jointly, one for the bus company and one for the journey type. Each of these three terms is themselves a latent variable that must be inferred.

First, we have a term for the year and borough, YearBoroughWeightyb, which has a hierarchical prior. Specifically, we begin by sampling a global mean and variance, GlobalMean and GlobalVariance. Then for each year, we use GlobalMean and GlobalVariance to sample a mean for each year, YearMeany. Additionally, we sample a variance for each borough, BoroughVarianceb. Then we sample a YearBoroughWeightyb from a Gaussian distribution with a year-dependent mean, YearMeany, and a borough-dependent variance, BoroughVarianceb.

Next, the weights for the bus company and journey type are very similar. Specifically, we have one latent weight for each bus company, CompanyWeightc, with c ∈ {1, . . . , 57}, and for each journey type, JourneyTypeWeightj, with j ∈ {1, . . . , 6}. We have a table identifying the bus company, bybi, and journey type, jybi, for each delayed bus journey (remember that a particular delayed bus journey is uniquely identified by the year, y, borough, b and index i). In logitsybi we use these tables to pick out the right company and journey type weight for that particular delayed bus journey, CompanyWeightcybi and JourneyTypeWeight cybi. The final generative model is, and the graphical model is given in Fig. 4

We ran all experiments on an 80GB NVIDIA A100 GPU and averaged the results over three different dataset splits generated using different random seeds. On each dataset split, the MovieLens experiments required a maximum 26.9 GB of memory at any one time. For the NYC Bus Breakdown experiments this value was 0.9 GB. For each experiment using a single dataset split, we used global IS, MP IS, VI and HMC methods to obtain (for each value of K in the IS methods, or each iteration of VI/HMC) 1000 values of the ELBO, predictive log-likelihood (on a test set that was disjoint from the training set), and posterior mean estimates for each latent variable in the model. These 1000 values were averaged for the ELBO and predictive

log-likelihood, whilst the 1000 mean estimates were used to calculate mean squared error or variance per variable, based on whether the observed data had been sampled from the model itself—in which case the true variable values were known—or not. We present error bars in the top rows of Fig. 1 and Fig. 2 representing the standard deviation over the three dataset splits.

For VI, we optimize our approximate posterior using Adam with learning rates of 10−2, 10−3 and 10−4 (we found that learning rates faster than 10−2 were unstable), and ran 100 optimization steps for both for the MovieLens and Bus Breakdown models. For HMC, we ran the No U-Turn Sampler on our GPU using JAX (Bradbury et al. 2018) to obtain 10 consecutive samples after a single step with PyMC’s (Salvatier, Wiecki, and Fonnesbeck 2016) pre-tuning and with all other parameters set to their default values. We present results of the first 5 samples (not including the tuning sample) in order to align as best we can with the small time-scale of the IS methods.

L O A D I N G

. . . comments & more!

. . . comments & more!