Table of Links
- Abstract and Introduction
- Related Work
- Feedback Mechanisms
- The NewsUnfold Platform
- Results
- Discussion
- Conclusion
- Acknowledgments and References
A. Feedback Mechanism Study Texts
B. Detailed UX Survey Results for NewsUnfold
C. Material Bias and Demographics of Feedback Mechanism Study
5 Results
From March 4th to March 11th (2023), NewsUnfold had 187 unique visitors. 158 read articles, 33 (20.89%) provided sentence feedback, and eight offered 25 additional reasons for feedback, mainly on sentences perceived as biased (84%) but highlighted as not biased (80%). 45 (28.48%) completed the tutorial, and 13 (6.9%) the UX survey. Geographically, 61% were from Germany, 25% from Japan, 6% from the United States. Language-wise, 45% preferred English, 42% preferred German. Notably, 52% accessed via mobile, high- lighting mobile optimization’s importance.[14]
The 357 sentences collectively received 1997 individual annotations, representing either agreement or disagreement with the presented classifier outcome. We identify two spammers within the 5% spammer score range and remove 47 annotations, leaving 1950 valid annotations in the dataset. 316 sentences attain a label through the repeated-labeling method. A sentence is categorized as decided if there is a majority, controversial if the biased-to-unbiased feedback ratio lies between 40-60%15, and undecided if the ratio stands at an exact 50% as listed in Figure 6. 310 decided sentences spanning nine topics form NUDA.[16]
Data Quality
To evaluate if NewsUnfold increases data quality, we cal- culate the Inter-Annotator agreement score Krippendrorff’s α. The NUDA dataset achieves a Krippendorff’s α of .504. The 26.31% increase in IAA compared to the baseline’s IAA of .399 (Spinde et al. 2021b) is statistically significant, as demonstrated in Figure 5 by the non-overlapping boot- strapped confidence intervals. To demonstrate that the IAA does not merely increase with the sample size but through higher data quality, we take 100 randomly sized dataset sam- ples (n = 10 to n = 1950), calculate the IAA for each, and employ a regression model.
The model’s explanatory power (R2 = .009, R2adjusted =−*.002) suggests a negligible linear relationship between sample size and the F1 score Table 4. This implies that the model does not explain the variance in F1 scores when accounting for the increase in data points. Moreover, the F-statistic of .8424 (p = 0.*361) does not provide evidence to reject the null hypothesis that there is no linear relation-
ship between sample size and F1 score (x1 = *.*000004, SD = *.000004, t = −.918, CI[.00001, .*000004]). Therefore, we conclude that the collected data is reliable, and increases in quantity do not necessarily translate into increased data quality. Further, we conducted a manual evaluation by annotating 310 sentences and comparing these expert an- notations against the labels provided by NUDA. The comparison yielded an agreement of 90.97% across 282 labels, with a disagreement of 9.03% over 28 labels. Specifically, the experts identified 25 sentences as biased, which NUDA had not, whereas only three sentences deemed biased by the experts were classified as unbiased by NUDA. A closer examination of the disagreeing labels revealed that the primary source of discrepancy was sentences containing direct quotes. When we removed 69 sentences predominantly consisting of direct quotes, the agreement increased to 95.44% on 230 labels, with the disagreement rate dropping to 4.56% on 11 labels. Of these, ten sentences experts labeled as biased were not labeled as biased by NUDA, and one sentences experts labeled as biased was labeled not biased by NUDA. This high agreement rate suggests that NewsUnfold can gather high-quality annotations and labels.
Classifier Performance
After merging NUDA with the BABE dataset, the average F1 score (5-fold cross-validation) is .824 (Table 3), showing a 2.49% improvement over the BABE baseline (Spinde et al. 2021b). While this may not constitute a substantial improvement, it is a positive increment towards the anticipated direction. We conduct five 5-fold cross-validations with different distributions to control for potential biases in the F1-Score due to imbalanced dataset distribution. Folds and repetitions show only marginal differences with a variance of .000022, suggesting that the data quality provides reliable results.
User Experience Survey Results
Thirteen participants took part in the UX survey. They ex- press positive feelings about the platform and bias highlights (Appendix B). The platform’s ease of use receives a high
rating of 8.46 on a 10-point scale, indicating a user-friendly design, affirmed by participants’ descriptions of the interface as intuitive and concise. While almost all users state a positive effect on reading more critically, some raise concerns about highlight calibration, their ineffectiveness with unbiased articles, and bias introduced by direct quotes in news articles.
Participants exhibit varied opinions when providing feed- back, most enjoying it, some undecided, and one finding it
work-like (Appendix B). For those interested in giving feedback, the survey indicates an easy process.
One participant mentioned that skipping the tutorial leads to confusion. Thus, one could consider making the tutorial mandatory in future iterations. In conclusion, we expect that the ease of use facilitates higher retention rates and engagement while the self-reported heightened media bias awareness positively correlates with data quality.
Authors:
(1) Smi Hinterreiter;
(2) Martin Wessel;
(3) Fabian Schliski;
(4) Isao Echizen;
(5) Marc Erich Latoschik;
(6) Timo Spinde.
This paper is available on arxiv under CC0 1.0 license.
[14] We detail all statistics on https://doi.org/10.5281/zenodo.8344891.
