If you are a programmer, you probably use GIT. But have you ever wondered how is GIT working deep inside? I do. Fortunately, you can find many documents on the web about GIT internals. When I read them, I have to realize that GIT is a relatively simple but super genius system. In this article, I would show you, how GIT works deep inside. Come with me to the deep of the rabbit hole.
When you clone your favorite repo from GitHub or any other git repository, you will get the files and a .git folder. This single .git folder contains everything. It's not a problem if you delete the other files, you can simply restore them by a ‘checkout’ command. It’s possible because the whole file tree is described in the .git folder.
Let’s see into the .git folder. It contains some files and folders. One of the most important is the objects folder. Git is something like a special filesystem that stores file with the same content only once. So, if you have different folders that contain the same file, the content will be stored only once. When you store a file in the Git repo, it will calculate the SHA1 hash of the file, and store it in the objects folder. If the file exists in different places in the tree, it will be stored only once, because SHA1 will maps the same content to the same file.
The SHA1 hash of the content is 20 bytes. The first byte (2 hex characters) defines the folder in the objects folder, and the other 19 bytes (38 hex characters) will be the name of the file. For example, if the content hash is 10116ede2f0bcf2ec0720843616e4a5250ae5268 then it will mapped to objects/10/116ede2f0bcf2ec0720843616e4a5250ae5268.
If you cloned the repository and haven’t changed anything, you will probably not find any object file, only a pack folder and a .pack file in it. It is an optimization. Git pulls the object files in one pack file from the server. You can simply unpack this single file if you move it outside of the .git directory and run
git unpack-objects < ./{pack_file_name}.pack
This command will unpack the objects into the objects folder in the above format.
The object files are zip-ed, so, if you open one of them, you won’t be able to read it, but you can easily unzip it by the following command:
pigz -d < ./.git/objects/10/116ede2f0bcf2ec0720843616e4a5250ae5268
The objects are organized into trees. A tree is something like a folder in a filesystem stored in another file in the objects folder. A tree looks like this:
100644 blob 5f71dbb20efc1dc9bd95e116ebc403659556b58a .gitignore
100644 blob f288702d2fa16d3cdf0035b15a9fcbc552cd88e7 LICENSE
100644 blob 49e96aecc3c354402c153d759e900354cfcb7c80 README.md
040000 tree 7054d5d9fd2431c4ff4f27537d6a5388b3c73ca9 database
100644 blob 9b50d8c47e0ad56aab6aa570f344c6db5409a955 env.development
100644 blob a473235e1bf1461feef090b2a62b2066d75c7d97 env.template
100644 blob a0f18dc0b81d5122a8eeca6903868f1ea4721ebc package.json
040000 tree ae9e90c2dcc818fab099dd22093ac5e5adb87bbb public
040000 tree 0bef5a72fa773367998e501275c262bb0ec75544 scripts
040000 tree 878c06bb25e1752fa6271c6eef51edad0942c3ff src
100644 blob 604c913eebc2578696d37b7346be681db2591816 tsconfig.json
040000 tree cdd80d4ee72ce05a172e9d6bc05b2d946767d079 views
This output is generated by the git cat-file command, which can read and parse any file from the objects folder by hash. The above output is generated by:
git cat-file -p 54ca9b88af96f27e181b9a059ca4be1f60e720ba
The first column shows a Linux-like file mode, the second column shows the object type, the third column is an object hash and the last column is the filename. A Git tree is very similar to a Linux folder that can contain files (blobs) and other folders (tree). If you would, you can check the content of some of the files or trees by using the commands that we used before.
Git can be imagined as a virtual filesystem, where every branch and every commit in the branches are folders. When you do a checkout you copy the contents of the chosen folder outside of the .git directory. In a standard filesystem, this needs a huge amount of disk space, but because of the clever hash-based and compressed solution of Git, it is stored in an optimal way.
Creating a branch would need a full directory copy in a standard filesystem, but Git only generates one single file that points to the tree of the source of the branch. If you change a file and do a commit, only a commit object is generated that points to the changed tree that contains the file (3 files instead of a full directory copy).
Every commit contains the hash of the previous commit (like a blockchain), so the history is fully trackable. This makes this special filesystem a version control system.
When you pull or push, Git sends these object files to the other part in a packed format. Because of hash-based naming, the objects will never collide. You could simply copy every object from every Git repository in the world to a single folder without any problem. This is why forking a repository on GitHub needs only a few seconds. GitHub doesn’t copy anything, only creates an entry in the database similar to branching.
In nutshell, this is how Git works. This hash-based file storage which is the essence of the system is used by many other decentralized systems. IPFS or Swarm also uses this hash-based representation. The difference is that these systems add a discovery system to it to find the given hashes in the distributed network storage of nodes.
Mixing the discovery system of these decentralized filesystems and Git versioning abilities could be the base of a fully decentralized GitHub alternative, but it is another story…
If you want to know Git more deeply, you can find everything on the Git website or in the Git book.