This story draft by @escholar has not been reviewed by an editor, YET.
Authors:
(1) Yuan Wang, University of Rochester (e-mail: [email protected]);
(2) Yangxin Fan, University of Rochester (e-mail: [email protected]).
WP fatal police shooting dataset insight
Fatal police shooting rate and victims race prediction
In this part, we used the insights we draw from WP data and multi-attributes correlation analysis to build predictive models. We constructed a series of regression models to predict fatal police shooting rates on the state level and a series of classification models to predict fatal police shooting victims’ race.
According to above correlation analysis, we chose the violent crime rate, land area, and gun ownership rate, state joined year based on their highest correlation coefficient with the fatal police shooting rate. We acquired more data points by looking at each state every year from 2015 to 2019 separately.
In the Weka machine learning software, we tried all models and chose three of the best-performed models based on ten-fold cross-validation performance. The best one is Kstar [1]. It achieved 28.04% cross-validation relative absolute error and explained 88.53% variance, followed by KNearest-Neighbor Regression and Random Forest. These three models all performed much better than the baseline linear regression model, see Table-2.
Figure-19 displays the cross-validation prediction error of each data point in the Kstar model (each data point represents the fatal police shooting rate of a state in a particular year). The X-axis is the real police shooting rate, while the Y-axis is the predicted police shooting rate. The large cross means a higher error rate.
The prediction model tells us that the reason for fatal police shootings could be complex. It is related to the state joined year, state land area, gun ownership rate, and violent crime rate. It suggests us to understand this problem from multi-dimensional aspects.
This prediction intends to test whether or not there is racial discrimination during the fatal police shooting. The null hypothesis is that the model cannot predict the victim’s race (No racial discrimination). The alternative hypothesis is that the model can predict the victim’s race (racial discrimination). We use WP data from 01/01/2015 to 02/12/2020 and excluded the data missing the race information. The total records are 4518. Since “age” is the only numeric variable, we applied the chi-square test to select the predictor for the rest of the variables.
where χ 2 = chi squared, Oi = observed value, Ei = expected value
After applying chi-square testing to the above categorical variables, we find that threat level, signs of mental illness, armed, flee, body camera, and gender are not independent of the race at 0.05 statistically significant level, see Table 4. On the other hand, manner of death and is gencoding exact are independent of the race at 0.05 statistically significant level. For city and state, the degree of freedoms (DF) is too large to apply chi-square testing. Finally, we chose armed, age, gender, signs of mental illness, threat level, flee, and body camera as predictors and city, age as backup predictors for the racial classification model.
In the Weka machine learning software and Python AutoML package, we tried all models and chosen the top three best-performed models based on stratified five-fold crossvalidation performance. see Table-5 below.
We find that adding city and state attributes could boost model performance. Gradient Boosting Machine [4] performs best, having 0.589 precision and 0.611 recall, slightly better than predicting all victims to be white (about 50% precision and recall). GBM algorithm gives us an idea of the importance of attributes we selected for prediction. City, state, armed, and age attributes play essential roles in racial prediction. See Figure-20 below. We failed to reject the null hypothesis since even the best-performed model cannot predict victims’ race well, proving that there is no racial discrimination for observed fatal police shootings in WP data.
This paper is available on arxiv under CC BY-NC-ND 4.0 DEED license.