Machine learning is a remarkable technology. It promises to disrupt business analytics, customer service, hiring, and more. But all this potential comes with some equally weighty concerns. Data poisoning could render these algorithms useless or even harmful.
Governance and security challenges are already the
Businesses need to understand these risks to use this technology safely and effectively. With that in mind, here’s a closer look at data poisoning and what companies can do to prevent it.
Despite its massive potential for disruption, data poisoning is fairly straightforward. It involves injecting misleading or erroneous information into machine learning training data sets. When the model learns from this poisoned data, it’ll produce inaccurate results.
How destructive these attacks are can vary depending on the model and the poisoned data in question. They could be as tame as making an NLP (natural language processing) algorithm fail to recognize some misspelled words. Alternatively, they could jeopardize people’s careers.
Amazon’s infamous recruiting algorithm project hints at what a data poisoning attack could do. The company scrapped the project after realizing the model
Data poisoning can happen with any machine learning model, black box or white box, supervised or unsupervised. While the approaches may vary in each scenario, the overall goal remains the same: inject or alter data in training data sets to compromise algorithm integrity.
This threat is more than theoretical, too. Organizations have already experienced data poisoning attacks, and as machine learning gains more prominence, these attacks may become more common.
Data poisoning attacks __date back as far as 2004__when attackers compromised email spam filters. Cybercriminals would inject new data to make these fairly simple algorithms fail to recognize some messages as spam. The criminals could then send malicious messages that flew under these defenses’ radar.
One of the most famous examples of data poisoning occurred in 2016 with Microsoft’s Tay chatbot. The bot, which learned from how people interacted with it, quickly began
While subpar spam filters and rude chatbots aren’t ideal, they may not seem particularly threatening. However, data poisoning can be far more dangerous, especially as businesses rely more heavily on machine learning.
In 2019, researchers showed how they could
Data poisoning attacks also become more concerning as automation becomes more common in cybersecurity. Just as early attacks hindered spam filters’ efficacy, new ones could affect intrusion detection models or other security algorithms. Attackers could cause monitoring software to fail to recognize irregular behavior, opening the door to more destructive attacks.
With
Given how damaging data poisoning can be, data professionals must defend against it. The key to these defenses is preventing and searching for unauthorized access to training data sets.
Businesses must consider their training data sources carefully, reviewing even data from trusted sources before using it. If organizations must move their training data, they should check it before and after to ensure no poisoning occurred in transit. Phased data migration is also ideal, as it creates no downtime, minimizing attackers’ opportunities to inject malicious or misleading data.
Machine learning developers should restrict access to their training data as much as possible and implement strong identification controls. Since more than
Frequent audits can reveal changes to data sets, indicating a poisoning attack. Data professionals should also understand their role in how these models learn, being careful to prevent their own biases from seeping in and unintentionally poisoning the data.
As companies rely more on data and data-centric technologies, data vulnerabilities become more concerning. Machine learning can yield impressive results, but businesses must be careful to make sure no attackers lead their algorithms astray.
Data poisoning can render an otherwise game-changing model useless or even harmful. Data professionals must understand this threat to develop machine learning models safely and effectively.