Making decisions from dirty data can have disastrous outcomes In 2016, Microsoft released , an AI chatbot designed to interact with Twitter users, answer their questions, and learn from users' replies. Unlike the other chatbots of the time, however, Tay was meant to have a personality, and to have a sense of humor, engaging in “casual and playful conversation”. Tay Yet, in a few hours of release, the conversations were somewhat less “playful”: :no_upscale()/cdn.vox-cdn.com/uploads/chorus_asset/file/6238309/Screen_Shot_2016-03-24_at_10.46.22_AM.0.png) By 4 am the next day Microsoft had shut down Tay – only 16 hours after it had launched. Microsoft may have had a great idea. Tay was meant to learn from its interactions, becoming more advanced with each interaction with a user. But they seemingly didn’t account for one crucial thing – . the data itself may be bad But dirty data is not just hate speech or outliers. It’s more than that. And understanding how to clean data is essential to ensure your data tells an accurate story What is dirty data Microsoft’s chatbot fell prey to data bias, a problem most AI and predictive systems are affected by. If the dataset used to train an AI excludes certain groups or has inherent societal biases, these systems may exacerbate societal biases. Think of the infamous . Within a few minutes of Exploratory Data Analysis, you would realize that females were more likely to survive than men. If a travel insurance firm trained its system using the Titanic dataset, it would charge men markedly higher premiums than women. With that, a system meant to assess risk based on the destination of travel and the number of people traveling instead assesses risk based on gender. Titanic dataset vastly This may sound hypothetical, but to people of color. This may have been caused, in part, by AI systems that consider black borrowers riskier. banks have been charging, on average, higher interest rates How to (try) clean dirty data It’s true that data will never be clean – it will always have some level of bias and error, but by systematically evaluating the source, freshness, and attributes of the data being collected, bias and error can be reduced considerably. But how can this be done? To borrow from Kathy Baxter’s : We can evaluate the correctness of data by asking 4 W’s – , , , and ? very good article Where When Who What WHERE – Is your data source credible? The quality of your data is very important, and so it is crucial to consider where your data comes from. Could that source have a vested interest in having the data be a certain way? Might they have tweaked the data? It is very important that you get your data from a trusted, credible source. If you’re collecting the data yourself: Could the sample you collected the data from introducing bias into the data? These are all very key considerations to make your data credible. have you followed the proper procedures for collecting data systematically? If you’re starting out in Data Science and are looking for credible, real-world datasets, public datasets from The , , and the will do the job. World Bank Google US Census Bureau WHEN – Is your data source recent? You’ve probably heard that . It’s an incredible statistic, but one of the problems with this is that there is way more data about the recent past than the more distant past. the total amount of data in the world doubles every two years For example, a stock trading bot maker would have way more information about the movements in the stock market in the last month, than they would have of 5 years ago. This phenomenon is called , and while it can make your AI system good at finding short-term trends, it may falter on the more long-term, bigger-picture trends. recency bias However, the opposite is also true: a medical insurance firm making decisions based on data collected 5 years ago may not take into consideration the COVID-19 pandemic we’re still in the midst of. You will need to take into account the recency of your data, and decide if it is appropriate for your particular use case. WHO – Is your data source representative? Your dataset needs to be representative of your target users/customers. For example, if an insurance firm was targeting European clients, it would not make sense to train their system with a dataset of American customers. But it’s not often that easy to know if your dataset is representative: People of color have . While the COVID-19 vaccine trials were markedly more diverse, this was still observable. This was not intentional – hospitals typically found in POC neighborhoods tend to be underfunded and therefore unable to conduct trials. historically been underrepresented in clinical trials Pharmaceutical companies also accounted for age in these trials. Even though people over the age of 65 make up just 9% of the US population, they are vastly more likely to die of COVID-19. Pharmaceutical companies have therefore included a disproportionate amount of older people in vaccine trials. WHAT – Could the data categories you are analyzing have an inherent bias? Data categories are the different or you are analyzing - things like name, age, or location. Choosing what data categories to include is important because the presence of certain variables could introduce bias into your dataset. pieces of information variables Due to certain laws, or organizational values, certain attributes are restricted: Age Race Gender National origin Sexual orientation These variables are prohibited in certain contexts like or providing . Organizations tend to avoid them because they tend to introduce unwanted bias. It is generally advisable to stay away from these attributes unless your use case is specifically linked to them (e.g. for research purposes). employing people government services It is true that data will never be clean, but that doesn’t mean it can’t still be useful. To quote Dean Abbott, “No data is clean, but most is useful.” So, go on and make your data useful! (after cleaning it)