paint-brush
Assessing Non-linearities and Distribution Assumptions in Barrier-to-Exit Analysisby@escholar

Assessing Non-linearities and Distribution Assumptions in Barrier-to-Exit Analysis

tldt arrow

Too Long; Didn't Read

Analysis of alternative models reveals challenges in model robustness and distribution assumptions in Amazon's Barrier-to-Exit study. Post-hoc adjustments using Gamma regression show potential for improved fit, yet caution is warranted due to non-random residuals and distribution suitability concerns. Further exploration into distribution fitting methods may be necessary for a comprehensive understanding.
featured image - Assessing Non-linearities and Distribution Assumptions in Barrier-to-Exit Analysis
EScholar: Electronic Academic Papers for Scholars HackerNoon profile picture

Authors:

(1) Jonathan H. Rystrøm.

Abstract and Introduction

Previous Literature

Methods and Data

Results

Discussions

Conclusions and References

A. Validation of Assumptions

B. Other Models

C. Pre-processing steps

B Other Models

B.1 Model without activity-level


However, when we ran the model (also using lmerTest (Kuznetsova et al., 2017)) and plotted the residuals, we got the following:


Figure 9: Residuals of the initial model. From a visual inspection, the residuals are not randomly distributed


Just from a brief visual inspection, it is clear to see that the residuals are not randomly distributed: There are two distinct ”bands” that both seem to trend upward. This breaks the assumption that the residuals are randomly distributed (Poole & O’Farrell, 1971). While many assumption violations are reduced with enough data (Baayen et al., 2008; Schielzeth et al., 2020), non-linearity is not one of them (Poole & O’Farrell, 1971).


Fortunately, the non-random residuals were (partially) fixed by introducing activity-level for the reasons described in section 3.3.

B.2 Problematic Categories Removed

Here we fit the main model (eq. 5, with problematic categories removed. We define a problematic category as a category with a fitted random effect of less than -0.5. We obtain this threshold by visually inspecting Fig. 7.


The results of this fit can be seen below in 2:


Table 2: Ablation with problematic categories removed


B.3 Gamma mixed-effects model

In the following, we refit the main model (eq. 5) using a Gamma regression. This is the most widely recommended solution in the literature on fitting right-tailed, heteroscedastic outcomes (Feng et al., 2014; Villadsen & Wulff, 2021).


However, since we discovered this after running our initial models, we could only justify doing this as a post-hoc test.


I use the lme4-package to fit the model (Bates et al., 2014). To avoid convergence errors and adapt the model to the formula, we make the following alterations:


  1. Add a gamma log-link function (Fox, 2015)


  2. change the year (β1) estimate to decades. This has the effect of rescaling the effect size.


  3. We still log-transform the activity-level to rescale it. As this is not part of the hypothesis, this does not affect our interpretation.


Table 3: Results of Gamma GLMM



Figure 10: Residuals for the Gamma GLM. The residuals are heteroscedastic and have visible non-randomness.


This leads us to the interpretation. The effect size per decade is 0.31, which is highly significant (SE=0.006, T=24, p ≪ 0.001). This translates into 34% increase per decade or 3% increase per year. This is the same direction as our transformed linear model, with somewhat larger results. However, as these come from an almost unidentifiable fit with extremely non-random residuals, no inferences can be drawn from this.


Some of this indicates that the conditional distribution of Barrier-to-Exit is not a Gamma distribution. Pursuing the GLMM path would require further assessments of the best-fitting distribution. This could be by e.g. applying the Box-Cox method as described by Villadsen and Wulff (2021).


This paper is available on arxiv under CC 4.0 license.