Adversarial Examples for Humans — An Introduction
This article is based on a twenty-minute talk I gave for TrendMicro Philippines Decode Event 2018. It’s about how malicious people can attack deep neural networks. A trained neural network is a model; I’ll be using the terms network (short for neural network) and model interchangeably throughout this article.
The basic building block of any neural network is an artificial neuron.
Essentially, a neuron takes a bunch of inputs and outputs a value. A neuron gets the weighted sum of the inputs (plus a number called a bias) and feeds it to a non-linear activation function. Then, the function outputs a value that can be used as one of the inputs to other neurons.
You can connect neurons in various different (usually complicated!) ways and you get a neural network. We call how the neurons are connected the architecture of the neural network. If you have many layers of neuron between the inputs and the outputs, then this is a deep neural network.
When properly-trained, deep neural networks can produce a correct set of outputs given a set of inputs.
Training a deep neural network means we use techniques to get the weights (and biases) for the artificial neurons. Recall earlier, that one of the steps for each neuron is to output the value is getting the weighted sum of the inputs (plus the bias).
It’s like the deep neural network, learned how to map inputs to outputs! That’s why it’s called Deep Learning!
Deep neural networks using deep learning techniques are very good at finding patterns given huge amounts of data.
Deep neural networks are popular because, in theory, they can learn at different levels of abstraction.
For example, layers near the input could learn simple features like lines and curves of different orientations. The middle layers can use these as inputs to distinguish more complex shapes like face parts (eyes, nose, etc) and the layers near the output could use that information to recognize specific faces based on facial structures.
You can imagine how hard it is for humans to make a face recognition software without deep learning. Because deep learning is great at finding patterns, deep learning could automate tasks that were previously thought as impossible by many.
Deep Learning has many applications today. However, deep learning models are very prone to attacks!
Here’s an example:
Here’s an image of my very good friend Jessica.
Say we then feed this image to a state-of-the-art model, and it says it’s 70% sure it’s Jessica.
In the right of the image below, is another image of Jessica. As you can see, it looks identical to the previous image. When we feed this other image to the same model, is it possible that the model will say that it’s 99% Kris Aquino? Yes, it’s possible!
In 2013, researchers published papers with images like the ones below.
As you can see, each pair of images look identical. However, while the left ones were classified correctly, all the images on the right were each confidently classified as an ostrich.
Turns out you can slightly change the input and the model could say that it’s 99% sure that a drastically wrong output is correct. You can slightly change a well-trained model and it could say it’s 99% sure it’s Jessica!
The two funny-looking images below are from a highly referenced 2015 paper. The well-trained model said it’s 99% sure that the left image is a bikini. It’s also 99% sure that the right image is an assault rifle.
There are many weird images in that paper. Below are some of them and all of them are classified with 99% confidence that they’re something specific, like an African chameleon or the number nine.
It seems like these high accuracy models don’t really understand what they’re doing.
In 2016, one paper demonstrated how you can trick commercial face recognition software into thinking you are someone else. You just have to wear intentionally-designed fake glasses. While the presented results are not that robust, they’re promising and I imagine it gets better given time like almost all other technologies.
Here’s a video (Nov 2, 2017) from MIT where they tricked a deep learning model that a 3D printed turtle is a rifle for most orientations.
These intentionally-crafted inputs to neural networks are called adversarial examples.
Adversarial examples are inputs designed by an attacker so that the model makes a mistake. That means if you slightly modify a normal input maliciously, the model would produce an obviously wrong output.
If you show the modified input to humans, they’d easily give the correct output. The changes are negligible to humans.
This matters, obviously, because it’s useless to use a model that doesn’t make sense. It’s also dangerous to deploy in the real world if the risks are high.
Evil Jessica can get away with watching NSFW content in the office. Evil Jessica can make a one hundred peso check and cash it for ten thousand pesos. Evil eye-glasses Jessica can rob a bank and A.I could report that it’s 99% sure the robber is Kris Aquino.
For example, any company with huge amounts of data rely heavily on understanding the meaning behind those data. For example, Twitter could want to analyze many sequences of words to understand which topics are trending and what is the general sentiment of people given a topic.
Sentiment analysis could also be used as feedback like how good a company is doing. For example, if autistic Jessica could hack all the private conversations of her officemates, she can give herself a daily satisfaction or trust rating and can use this to improve how she interacts with others.
A 2017 paper demonstrated that by simply replacing certain words with their synonyms, a model could say that a negative sentiment is a positive one or vice versa.
It’s a disaster for autistic Jessica if she makes decisions on how to act based on opposite information!
Here’s a paper about adversarial examples applied to reading comprehension systems: “The accuracy of sixteen published models drops from an average of 75% F1 score to 36%”— Adversarial Examples for Evaluating Reading Comprehension Systems (2017)
In theory, it is also possible to make small changes to a malware file such that it will remain malware but the model will say it is not. This is dangerous and obviously defeats its entire purpose.
According to many experiments including those from Google Brain, they found out that an adversarial input to one network is also most likely adversarial to another, even if the networks have different architectures. They call this property, Adversarial Sample Transferability.
Also, you can create adversarial inputs to any known network today. Many algorithms have been specifically developed for this purpose alone, as we will discuss shortly.
The actual noise added to the image below is designed to fool this specific network. It is not random noise. We can create this noise if we know the architecture and the values of all the parameters (weights and biases) of the network. This is like a “white box approach”.
In normal training, we have fixed inputs and we tweak the parameters in a good direction in order to get a highs score for the correct class. That’s what good training algorithms do.
To create adversaries, one idea is that we flip this process. We have fixed parameters and we can tweak the inputs. We can tweak the original input in the opposite direction a tiny bit until we get a high score for the wrong class. This becomes the adversarial version of the original input. (Andrej Kaparthy — Breaking Covnets)
These changes are so tiny such that it’s unobservable to humans but observable to the network. It’s like we’re exploiting the basis of all network training algorithms.
A lot of techniques use this idea but there are many other ideas. The point is, it can be done.
Jacobian-based Saliency Map Attack (JSMA). L0, L2, L-infinity attacks. Deepfool. Fast Gradient Sign. Iterative Gradient Sign. Etcetera.
We’ve discussed the white box approach which simply means if we know everything about the network we’re trying to attack, we can.
What if we can only use the network that we want to attack? This means we can only feed inputs and get the corresponding output. We don’t know anything else. It turns out that yes, we can attack any network given these constraints.
Recent papers have actually shown success on attacking commercial networks like those from Amazon and Google.
The key is to exploit Adversarial Sample Transferability. If we can fool a substitute network, most likely, we can fool the target network we want to attack.
Given the sets of inputs and outputs that we got from the target network, we can use this to train a substitute network. Because we know everything about our substitute network (weights, biases, architecture), we can create adversaries for this network. And because of transferability, these adversarial inputs will most likely also be adversarial to the network we want to attack.
Here’s a recent video from a research paper by MIT researchers where they showcase that their blackbox algorithm is 1000x faster than existing blackbox attack algorithms. (6 Apr 2018, Query-Efficient Black-box Adversarial Examples, Arxiv: 1712.07113)
We can generate many adversarial inputs ourselves, then we can explicitly train our network to not be fooled by them. But an attacker can always generate new adversaries to our stronger network. So it’s not really a good defense strategy. Creating defenses is an active research area.
It’s hard to create a defense against adversarial inputs because networks should produce the correct output for every possible input. Usually, they only encounter a small subset of these inputs.
Some google researchers believe that although neural networks can be highly nonlinear globally, activation functions are almost linear and that is why neural networks are locally linear. They say that almost linear activation functions not only help in training, but also in performing well.
One may also conclude that the model families we use are intrinsically flawed. Ease of optimization has come at the cost of models that are easily misled. — Explaining and Harnessing Adversarial Examples (2015)
It seems like the features that make neural networks effective is also the reason it’s vulnerable to adversarial attacks.
There are tools used to visualize layers of your neural network in order to see what neural network is seeing or paying attention to. Tools like Deep-Visualization-Toolbox, Keras-Vis, and Keras-GradCam.
There are also tools to test the robustness of your neural network against adversaries like Foolbox and Cleverhans.
Cleverhans is named after a horse from a fictional story. They thought the horse could answer math questions. But actually, the horse just learned to read social cues from humans. It is a metaphor for high accuracy neural networks that do not really understand what they’re doing.
Deep learning is used and can be used to automate tasks previously thought to be impossible. Adversarial Examples/Inputs are hard-to-defend attacks against neural networks. It causes the network to make mistakes humans won’t.
If there’s only one thing that you can take away from this article, it’s that along with needing huge amounts of data to correctly answer questions, adversarial examples seem to show that deep neural networks don’t seem to think, learn, or reason on a high-level like humans do.
There are still a lot of work to be done.