Assessing Non-linearities and Distribution Assumptions in Barrier-to-Exit Analysisby@escholar

Assessing Non-linearities and Distribution Assumptions in Barrier-to-Exit Analysis

tldt arrow

Too Long; Didn't Read

Analysis of alternative models reveals challenges in model robustness and distribution assumptions in Amazon's Barrier-to-Exit study. Post-hoc adjustments using Gamma regression show potential for improved fit, yet caution is warranted due to non-random residuals and distribution suitability concerns. Further exploration into distribution fitting methods may be necessary for a comprehensive understanding.
featured image - Assessing Non-linearities and Distribution Assumptions in Barrier-to-Exit Analysis
EScholar: Electronic Academic Papers for Scholars HackerNoon profile picture


(1) Jonathan H. Rystrøm.

Abstract and Introduction

Previous Literature

Methods and Data



Conclusions and References

A. Validation of Assumptions

B. Other Models

C. Pre-processing steps

B Other Models

B.1 Model without activity-level

However, when we ran the model (also using lmerTest (Kuznetsova et al., 2017)) and plotted the residuals, we got the following:

Figure 9: Residuals of the initial model. From a visual inspection, the residuals are not randomly distributed

Just from a brief visual inspection, it is clear to see that the residuals are not randomly distributed: There are two distinct ”bands” that both seem to trend upward. This breaks the assumption that the residuals are randomly distributed (Poole & O’Farrell, 1971). While many assumption violations are reduced with enough data (Baayen et al., 2008; Schielzeth et al., 2020), non-linearity is not one of them (Poole & O’Farrell, 1971).

Fortunately, the non-random residuals were (partially) fixed by introducing activity-level for the reasons described in section 3.3.

B.2 Problematic Categories Removed

Here we fit the main model (eq. 5, with problematic categories removed. We define a problematic category as a category with a fitted random effect of less than -0.5. We obtain this threshold by visually inspecting Fig. 7.

The results of this fit can be seen below in 2:

Table 2: Ablation with problematic categories removed

B.3 Gamma mixed-effects model

In the following, we refit the main model (eq. 5) using a Gamma regression. This is the most widely recommended solution in the literature on fitting right-tailed, heteroscedastic outcomes (Feng et al., 2014; Villadsen & Wulff, 2021).

However, since we discovered this after running our initial models, we could only justify doing this as a post-hoc test.

I use the lme4-package to fit the model (Bates et al., 2014). To avoid convergence errors and adapt the model to the formula, we make the following alterations:

  1. Add a gamma log-link function (Fox, 2015)

  2. change the year (β1) estimate to decades. This has the effect of rescaling the effect size.

  3. We still log-transform the activity-level to rescale it. As this is not part of the hypothesis, this does not affect our interpretation.

Table 3: Results of Gamma GLMM

Figure 10: Residuals for the Gamma GLM. The residuals are heteroscedastic and have visible non-randomness.

This leads us to the interpretation. The effect size per decade is 0.31, which is highly significant (SE=0.006, T=24, p ≪ 0.001). This translates into 34% increase per decade or 3% increase per year. This is the same direction as our transformed linear model, with somewhat larger results. However, as these come from an almost unidentifiable fit with extremely non-random residuals, no inferences can be drawn from this.

Some of this indicates that the conditional distribution of Barrier-to-Exit is not a Gamma distribution. Pursuing the GLMM path would require further assessments of the best-fitting distribution. This could be by e.g. applying the Box-Cox method as described by Villadsen and Wulff (2021).

This paper is available on arxiv under CC 4.0 license.