Author:
(1) David Novoa-Paradela, Universidade da Coruña, CITIC, Campus de Elviña s/n, 15008, A Coruña, Spain & Corresponding author (Email: [email protected]);
(2) Oscar Fontenla-Romero, Universidade da Coruña, CITIC, Campus de Elviña s/n, 15008, A Coruña, Spain (Email: [email protected]);
(3) Bertha Guijarro-Berdiñas, Universidade da Coruña, CITIC, Campus de Elviña s/n, 15008, A Coruña, Spain (Email: [email protected]).
In this work, we have proposed a pipeline for detecting anomalous reviews associated with Amazon products, which can be directly extrapolated to other online review platforms or scenarios with similar characteristics. The representation of the reviews using MPNet embeddings has enabled the training of classical anomaly detection algorithms that have achieved a very good performance. These have been evaluated using reviews from different products and categories, and the score they emit allows us to sort the reviews based on their normality.
A technique based on the occurrence of frequent terms has been proposed to generate explanations associated with the classifications of the reviews. This technique has been compared with SHAP, one of the reference post-hoc techniques in the field of explainability, and with GPT-3, due to its high power and versatility. To evaluate this aspect of the pipeline, we conducted a two-part survey in which 241 members of the university community participated.
From the first part of the explainability test we can conclude that, in general terms, the effect of the explanations has not been beneficial for the users. In any case, these tests allow us to reflect on the difficulty of using explainability and evaluation techniques in borderline scenarios where subjectivity plays an important role, such as the one presented in this article or in other fields of NLP, as well as in areas such as image or audio generation.
Regarding the second part of the explainability test, we have been able to conclude that respondents preferred explanations that presented a more natural and familiar appearance over more condensed and concise explanations such as those provided by SHAP, regardless of the explanation effect they provide. Explanations based on term frequency analysis have been preferred by respondents along with GPT-3, however, our approach presents a significantly lower computational costs and both its use and the explanations produced are simpler for the users.
As future work, it would be interesting to evaluate GPT-3 or other large models carrying out the complete process followed by the pipeline proposed in this work, instead of being tested only in the explainability module. We have not carried out this test due to the high computational cost that would be involved in processing the thousands of reviews to be evaluated using GPT3. Another interesting possible line of future work would be to broaden the scope of the survey, both in terms of the number of products involved and the number of reviews, in order to clarify the conclusions reached at the forward simulation stage. Lastly, it would be very useful to try presenting the explanations issued by SHAP in a more familiar or natural format for the end user, so that we can see if their level of preference is increased for the general public.
This work was supported in part by grant Machine Learning on the Edge - Ayudas Fundación BBVA a Equipos de Investigación Científica 2019 ; the Spanish National Plan for Scientific and Technical Research and Innovation (PID2019-109238GB-C22 and TED2021-130599A-I00); the Xunta de Galicia (ED431C 2022/44) and ERDF funds. CITIC, as a Research Center of the University System of Galicia, is funded by Consellería de Educación, Universidade e Formación Profesional of the Xunta de Galicia, Spain through the European Regional Development Fund (ERDF) and the Secretaría Xeral de Universidades (Ref. ED431G 2019/01).
This paper is available on arxiv under CC 4.0 license.