Prehistory Hello everyone! Most recently, I ran into a problem: for unexplained reasons, the memory card began to move all files to the LOST.DIR folder without any extensions. For a long time, there accumulated more than 500 files of different types: pictures, video, audio, documents. It was impossible to understand the format of the independently, so I started looking for a way to solve this problem programmatically. file Looking for decision I did not want to use ready-made solutions in the form of web services or programs, so there was an idea to write a console utility that would go through all the files and install the extensions automatically. was chosen to write the utility. The search for suitable modules and libraries did not bring results for several reasons: Python Lack of support from the developer Excessive functionality Lack of support for new versions of Python Excessive code complexity Of the many libraries, python-magic is very popular (almost 1000 stars on GitHub). It’s a wrapper for the libmagic library. But it’s impossible to use python-magic under Windows without the DLL for the Unix library. So this option wasn’t good enough. Solution of the problem Proceeding from the above, I decided not to use third-party libraries and modules and solve the problem without them. After a short search of information on how to implement this task, the only true way was to determine the format by the signature of the file, also called “magic number”. The file signature is a set of bytes that provides a definition of the file format. The signature has the following form in hexadecimal notation: 50 4D 4F 43 43 4D 4F 43 Fortunately, there are two good sites on the Internet with a lot of signatures of different formats. The most common formats became the goal.As it turned out, some signatures are suitable for different file formats, such as the signature of Microsoft Office files. Based on this, in some cases it will be necessary to return a list of suitable file extensions. print(get("D:\\some_ms_office_document")) # prints ['doc', 'ppt', 'xls'] Also, often the signatures have an offset from the beginning of the file like 3GP multimedia container. 1. Compiling a list of data As a list of data, I decided to use a JSON file, with the ‘data’ object, whose value will be an array of objects of the following form: {"format": "jpg", "offset": 0, "signature": ["FF D8 FF E0", "FF D8 FF E1", "FF D8 FF E2", "FF D8 FF E8"]} Where: format — file format; offset — offset of the signature from the beginning of the file; signature — an array of suitable signatures for the specified file format. 2. Writing an utility Import the necessary modules: import osimport json Read a list of data: abspath = os.path.abspath(os.path.dirname(__file__))data = json.loads(open(os.path.join(abspath, "data.json"), "r", encoding="utf-8").read())["data"] Great, the data list is loaded. Now we read the file as an array of bytes. We will only read the first 32 bytes, since the determination of common formats doesn’t require more, and full reading of a large file will take a long time. file = open("path_to_the_file", "rb").read(32) If you print ‘file’ variable, you will see something similar to this: \x90\x00\x03\x00\x00\x00\x04 Now bytes must be converted to a hexadecimal system: hex_bytes = " ".join(['{:02X}'.format(byte) for byte in file]) Next, we create a list in which the appropriate formats will be added: out = [] And now we create a structure that will cyclically determine the file format: for element in data:        for signature in element["signature"]:            offset = element["offset"]*2+element["offset"]            if signature == hex_bytes[offset:len(signature)+offset].upper():                out.append(element["format"]) About this string: offset = element["offset"]*2+element["offset"] Since our bytes are represented as a string, and two symbols represent one byte, we multiply the offset by 2 and add the number of spaces between the “bytes”. And the only thing that remains for us is to output a list of suitable formats, which is represented by the ‘out’ variable. print(out) # prints something like ['extension_1', 'extension_2'] Conclusion As it turned out, various projects are faced with the need to recognize the file format, so I decided to release my solution in open-source as a module for Python called fleep . You can install the module using the standard python utility ‘pip’: link to the GitHub page pip install fleep Also there are examples of usage and a complete list of supported file formats on the GitHub project page.I improve fleep every day, adding new features and formats. You can use it in your project :) Thank you for attention! P.S. I would be glad to hear your opinion about my module.P.P.S. English is not my native language, so, excuse me for any mistakes :)

Determining file format using Python

About Author

Comments

TOPICS

THIS ARTICLE WAS FEATURED IN

Related Stories

Move Over JPEG, There are new Image Formats in Town

[Deep Dive] What is the G3D Geometry Exchange Format?

0–100 in Django: Starting an app the right way

The Noonification: How to Deal With Flapping or Broken Tests (11/29/2023)

The Noonification: OpenAI is Sam Altman; Sam Altman is OpenAI (11/30/2023)

The Noonification: Panda Power (11/28/2023)

Move Over JPEG, There are new Image Formats in Town

[Deep Dive] What is the G3D Geometry Exchange Format?

0–100 in Django: Starting an app the right way

The Noonification: How to Deal With Flapping or Broken Tests (11/29/2023)

The Noonification: OpenAI is Sam Altman; Sam Altman is OpenAI (11/30/2023)

The Noonification: Panda Power (11/28/2023)

Light-Mode

Classic

Newspaper

Minty

Dark-Mode

Neon Noir

Minty

HN StartUps