By some estimates, less than 75% of internet users speak English, even though nearly half of all online content is created in English. This imbalance has far-reaching consequences. Big Tech is often accused of disproportionately investing in English-speaking audiences. For example, Meta has been criticised for spending 87% of its content moderation budget to serve just 9% of its English-speaking users. some estimates nearly half criticised Meanwhile, modern AI systems are already responsible for a massive share of online tasks, from moderation and recommendations to text generation, and will only penetrate deeper into daily life. Yet they are mostly trained on English data. This leads to cultural bias, which already results in real-world incidents. For example, in Colombia, Meta's algorithms mistakenly removed posts with a satirical police cartoon, resulting in 215 appeals, 210 of which were upheld, while in Myanmar, Facebook's systems contributed to the amplification of anti-Rohingya hate speech. removed contributed This has worried me for quite some time, not only from an ethical perspective but also from a technical one. That is why I used the Queen Mary University of London lab to investigate how modern multilingual language models interpret moral categories depending on the language. For the research, I used Moral Foundations Theory (which identifies five universal moral dimensions: Care/Harm, Fairness/Cheating, Loyalty/Betrayal, Authority/Subversion, and Purity/Degradation) to annotate song lyrics across six languages and then test how well different multilingual NLP models (mBERT, XLM-R, and GPT-4 in zero-shot mode) could recognise these moral categories. This allowed me to see not only whether the models could detect moral values at all, but also whether their performance depended on the language and on the type of moral foundation. Moral Foundations Theory, source: https://moralfoundations.github.io/ Moral Foundations Theory, source: https://moralfoundations.github.io/ https://moralfoundations.github.io/ In this article, I publish the results of my experiment. They showed that models indeed perform best in English and with universal moral categories (such as care and harm), but struggle significantly with culturally specific categories (such as purity or authority). Today developers bear a responsibility: the standards of moral interpretation we embed into AI now will shape the way future systems handle communication, education, healthcare, and content moderation worldwide. At the same time, developers lack a standard benchmark or guideline (something equivalent to BLEU scores in machine translation) to measure how well AI systems handle cultural and moral nuance. That's why I propose what I call the Cultural Intelligence Standard, a framework designed to evaluate whether AI models can maintain both universal and culture-specific moral understanding across languages. I hope this standard can serve as a foundation for building AI systems that are not only technically powerful but also culturally aware and ethically reliable. Cultural Intelligence Standard Research process To test how AI interprets morality across cultures, I built a dataset of: 700 real songs in six languages (English, Russian, Greek, Turkish, Spanish, and French);1,800 synthetic songs generated with GPT-4 for model fine-tuning;annotations by native speakers for five moral foundations: Care/Harm, Fairness/Cheating, Loyalty/Betrayal, Authority/Subversion, and Purity/Degradation. 700 real songs in six languages (English, Russian, Greek, Turkish, Spanish, and French); 1,800 synthetic songs generated with GPT-4 for model fine-tuning; annotations by native speakers for five moral foundations: Care/Harm, Fairness/Cheating, Loyalty/Betrayal, Authority/Subversion, and Purity/Degradation. Songs were chosen from Genius and Spotify to cover a variety of genres, while synthetic ones were created to give the models extra training material. Each song was annotated by 2 independent native speakers using Moral Foundations Theory, which describes five core moral categories: Care / Harm, Fairness / Cheating, Loyalty / Betrayal, Authority / Subversion, and Purity / Degradation. Moral Foundations Theory To evaluate performance, I relied on standard classification metrics: precision, recall, and F1-score. They were calculated both per language and per moral category, and averaged across the dataset. F1-score was a primary metric, since it balances precision and recall and is widely used in multi-label classification tasks. I then compared three models: mBERT, XLM-R, and GPT-4 in zero-shot mode. The first two were fine-tuned on the dataset, while GPT-4 was tested as-is (to see how well it could generalise moral categories without special training). What I discovered What I discovered 1. Universal values are easier to detect than culture-specific ones 1. Universal values are easier to detect than culture-specific ones Among all moral categories I explored, Care / Harm was the most predictable and stable for the models. For example, mBERT achieved F1-scores ranging from 0.67 to 0.87 depending on the language. Even GPT-4 in zero-shot mode delivered remarkable results here, reaching up to 0.97 F1 in Turkish. This means that values related to care and the prevention of harm are expressed through universal linguistic patterns that AI can reliably capture. The picture was very different for Purity / Degradation. Here, the performance dropped - in the Greek corpus, F1-scores were as low as 0.04, and in other languages rarely exceeded 0.25. We see that concepts of "purity" and "degradation" are tied to culturally specific metaphors, making it harder for models to interpret. In short, AI can reliably recognise universal moral categories such as care and harm, but fails with culture-specific categories like purity. 2. Models differ in stability and reliability 2. Models differ in stability and reliability The comparison between models revealed differences in how they handle moral classification across languages. mBERT proved to be the most stable overall (with micro F1-scores ranging from 0.45 to 0.53). While it wasn't always the top performer, its consistency made it the most reliable baseline across all six languages. XLM-R, by contrast, showed greater variability. Its performance ranged from 0.37 to 0.55, with particularly weak results on French data. In Spanish and Turkish, it sometimes outperformed mBERT, but overall, its performance was much less stable. GPT-4 in zero-shot mode showed strong potential, especially for the Care / Harm category, where it sometimes outperformed the fine-tuned models. For instance, it achieved up to 0.97 F1 on Turkish Care/ Harm lyrics. However, its results on other categories were highly inconsistent, showing that large language models can generalise some universal moral concepts but remain unreliable when cultural specificity is required. These findings show that no single model can yet deliver both high performance and stability across all languages and moral foundations. mBERT provides the most consistent baseline, XLM-R occasionally outperforms but with high variance, and GPT-4 demonstrates the promise (and the current limits) of zero-shot cultural reasoning. 3. Models confuse similar moral categories 3. Models confuse similar moral categories The error analysis revealed that models often confused moral categories that are semantically or culturally close. A common mistake was mixing up Fairness / Cheating with Loyalty/ Betrayal. Both of them involve ideas of trust and reciprocity, but fairness relates to abstract principles of justice, while loyalty reflects obligations within a group. For the models, this distinction was too subtle, leading to systematic misclassifications. Another frequent error was the mislabeling of Authority / Subversion as Loyalty / Betrayal. In many cultural contexts, respect for authority and loyalty to one's group are twisted. A lyric about obedience to elders, for instance, could be understood as authority in one culture but as loyalty in another. The models often failed to separate the two, reflecting how tightly these values are linked in natural language. Obviously, there was a clear tendency toward over-classification of Care / Harm. Because this foundation has strong and obvious linguistic markers (words about love, protection, pain, or harm), the models defaulted to it more often than appropriate. These results show that AI is sensitive to strong universal signals but struggles with finer distinctions between moral foundations, particularly where those distinctions depend on cultural context. What developers can do about this problem This is how I came to realise that AI, at its current stage of development, remains culturally blind. And this problem will only get worse, both as data volumes grow and as technology becomes more widely adopted, unless we establish standards and regulations as soon as possible. For the developer community behind these technologies, I suggest several practical steps to help address the problem: First, test AI across all five moral foundations (not just the common ones like Care / Harm). Detecting harm is important, but it is not enough: if a system cannot handle categories like authority or purity, it is by definition incomplete for global use. First Care / Harm) Second, introduce a Cultural Intelligence Standard, a benchmark similar to BLEU in machine translation, but designed to evaluate cultural and moral adequacy. For instance, it could include: Second transparency labels (similar to energy efficiency ratings), to see how well a model performs across moral foundations and languages;the 80/20 rule: performance in any supported language should reach at least 80% of the English baseline, and no moral foundation should fall below 20%;a new metric, the Cultural Alignment Score, measuring how closely model predictions align with the judgments of native speakers. transparency labels (similar to energy efficiency ratings), to see how well a model performs across moral foundations and languages; the 80/20 rule: performance in any supported language should reach at least 80% of the English baseline, and no moral foundation should fall below 20%; a new metric, the Cultural Alignment Score, measuring how closely model predictions align with the judgments of native speakers. Third, consider using hybrid systems that combine the stability of fine-tuned models (such as mBERT) with the intuitive knowledge of large LLMs. Third Fourth, there should be more investment in multilingual training data. My experiment showed that synthetic texts are too plain and smoothed out to reflect the richness of real cultural contexts. Reliable systems need more diverse and authentic corpora. Fourth Fifth, always keep cultural context in mind while building. The same expression can belong to different categories in different societies, respect for elders might be classified as authority in one culture and loyalty in another. Systems need to be sensitive to such distinctions. Fifth And finally, involve native speakers, gather feedback, and allow models to "hesitate" by flagging cases where moral interpretation is uncertain.