When I started using Git, I did what most people do. I memorized commands to get the job done without really understanding what was happening under the hood. In most cases, I was getting the results I wanted. But I was still frustrated that I was occasionally ‘breaking’ the repo—getting it into a state I didn't expect and not knowing how to fix it.
Is your experience similar?
The shortcut approach to using a repository is an attempt to use a tool without doing the essential homework to learn how it works. In my case, everything ‘clicked’ as soon as I read about the internal data model used by Git. You see, Git is a kind of a database, and one would never be able to work with SQL, for example, without knowing what a table, record, etc. is. Let’s cover the knowledge gap and see a bit of the internals of a Git repository.
.git
Git is a distributed version control software, which means you don’t need an external server to use it. All the data that Git needs is stored in the .git
folder. As a Git user, you have no business changing those files, but for the purposes of this article, we’ll take a look inside to see how Git stores the data.
Just after creating the repository with git init
, you’ll find inside:
$ ls -R .git
HEAD config description hooks info objects refs
.git/hooks:
applypatch-msg.sample pre-applypatch.sample pre-rebase.sample update.sample
commit-msg.sample pre-commit.sample pre-receive.sample
fsmonitor-watchman.sample pre-merge-commit.sample prepare-commit-msg.sample
post-update.sample pre-push.sample push-to-checkout.sample
.git/info:
exclude
.git/objects:
info pack
.git/objects/info:
.git/objects/pack:
.git/refs:
heads tags
.git/refs/heads:
.git/refs/tags:
Right now, it’s almost empty: we have a few folders, mostly example files for hooks. We will ignore these; our focus in this article will be mostly .git/objects
content—the primary data storage in Git.
Git stores every single version of each file it tracks as a blob. Git identifies blobs by the hash of their content and keeps them in .git/objects
. Any change to the file content will generate a completely new blob object.
The easiest way to create an object is to add an object to the stage. What is in the stage will be part of the next commit. Staging is the “pre-commit” state in git. It’s where we keep files that are not already committed but already tracked by Git.
Let’s create a simple file and make a blob to represent it:
$ echo "Test" > test.txt
With this command, we write “Test” to the test.txt
file. To make it a blob, we just need to add it to the stage by running:
$ git add .
After adding our new file to the stage, inside .git/objects
, we have:
$ ls -R .git/objects
34 info pack
.git/objects/34:
5e6aef713208c8d50cdea23b85e6ad831f0449
.git/objects/info:
.git/objects/pack:
We have a new folder, 34
, and inside that folder a file 5e6aef713208c8d50cdea23b85e6ad831f0449
. This is because the content hash is 345e….
: the two chars from the front are used as a directory. The content of this file is:
$ cat .git/objects/34/5e6aef713208c8d50cdea23b85e6ad831f0449
xKOR0I-.
It’s compressed for storage efficiency. We can see what’s inside by running the following Git command:
$ git cat-file blob 345e6aef713208c8d50cdea23b85e6ad831f0449
Test
We have only the content inside—no metadata for the file.
Let’s see what happens if we make some changes to the file and add the updated version:
$ echo "Test 2" >> test.txt
This command adds a new line,“Test 2”, to the existing file test.txt
.
Let’s add the current version to the stage:
$ git add .
And see what we have inside the .git/objects
folder:
$ ls -R .git/objects
34 d2 info pack
.git/objects/34:
5e6aef713208c8d50cdea23b85e6ad831f0449
.git/objects/d2:
77ba2806ce99d418b0b5d6c28643deca0e36dc
...
Now we have two objects, the second one inside the d2
subfolder. Its content is:
$ git cat-file blob d277ba2806ce99d418b0b5d6c28643deca0e36dc
Test
Test 2
It’s the same as our updated text.txt
:
$ cat test.txt
Test
Test 2
As we can see, Git stores the complete file for each version.
The tree objects are how Git is storing folders. They reference other things as their content:
For each reference, a tree stores:
Like with blobs, Git identifies each tree by the hash of its content. Because the tree is referencing the hash of each file it contains, any change to the content of files will cause the creation of an entirely new tree object.
Similarly, because different versions of the same file will have multiple blobs, Git will create another tree object for each folder version.
Usually, you create a tree as part of the commit. We will cover commits later in this article, but in the meantime, let’s use git write-tree
—a plumbing command that creates a tree based on what is inside our staging.
Plumbing and porcelain commands come from an analogy used in Git:
Unless you are doing advanced stuff, you don’t need to know plumbing commands.
With our staging as before, we run:
$ git write-tree
fd4f9947de2805e460bfeeca3346e3d36d617d37
The returned value is the ID of our new tree object. To look inside, you can run:
$ git cat-file -p fd4f9947de2805e460bfeeca3346e3d36d617d37
100644 blob d277ba2806ce99d418b0b5d6c28643deca0e36dc test.txt
Even though it’s a different data type than blobs, their value is stored in the same place:
$ ls -R .git/objects
34 d2 fd info pack
.git/objects/34:
5e6aef713208c8d50cdea23b85e6ad831f0449
.git/objects/d2:
77ba2806ce99d418b0b5d6c28643deca0e36dc
.git/objects/fd:
4f9947de2805e460bfeeca3346e3d36d617d37
…
All the data is in the same folder structure.
Now, we’ll add another folder inside to see how nested trees are stored:
create a new folder:
$ mkdir nested
add a file & it’s content
$ echo 'lorem' > nested/ipsum
adding it to the stage
$ git add .
Creating a tree now will give us a new ID:
$ git write-tree
25517090ae5d0eb08f694de6d38d613615fe99e4
Its content:
$ git ls-tree 25517090ae5d0eb08f694de6d38d613615fe99e4
040000 tree bc9a36d27aa303a3b1cab543b64c6944fea5ce8b nested
100644 blob d277ba2806ce99d418b0b5d6c28643deca0e36dc test.txt
We can see that nested
was added as a tree reference. Let’s see what is inside:
$ git ls-tree bc9a36d27aa303a3b1cab543b64c6944fea5ce8b
100644 blob 3e9ffe066cd7b2ce4c6fb5c8f858496194e1c251 ipsum
As you can see, it’s another tree object that describes a folder's content. With many tree objects, you can describe any nested folder structure.
A commit is a complete description of the state of the repository. It contains the following information:
Most commits have only one parent, with the following exceptions:
As before, Git identifies each commit by the hash of its content. Therefore, any change to the files, folder, or commit metadata will create a new commit.
We can create our first commit with the standard commit command:
$ git commit -m 'first commit'
[main (root-commit) 26349a2] first commit
2 files changed, 3 insertions(+)
create mode 100644 nested/ipsum
create mode 100644 test.txt
The output shows the truncated commit ID. Let’s find a complete value:
$ git show
commit 26349a25253f9b316db1a5d3c3f23c1ca5ca4e0e (HEAD -> main)
Author: Marcin Wosinek <[email protected]>
Date: Thu Apr 28 18:18:07 2022 +0200
first commit
…
To see the content of commit object, we can use:
$ git cat-file -p 26349a25253f9b316db1a5d3c3f23c1ca5ca4e0e
tree 25517090ae5d0eb08f694de6d38d613615fe99e4
author Marcin Wosinek <[email protected]> 1651162687 +0200
committer Marcin Wosinek <[email protected]> 1651162687 +0200
first commit
The tree reference is the same as what we had in the previous example. We can see that commits stay in the same folder as other objects:
ls -R .git/objects
25 26 34 3e bc d2 fd info pack
…
.git/objects/26:
349a25253f9b316db1a5d3c3f23c1ca5ca4e0e
…
Let’s restore the first version of our test.txt
file:
$ echo "Test" > test.txt
This command overwrites the existing file with “Test”.
$ git add .
Adds the updated version to the staging.
$ git commit -m 'second commit'
[main 7f54a43] second commit
1 file changed, 1 deletion(-)
Commits changes.
Let’s find the full ID:
$ git show
commit 7f54a437d87cd1f241cfb893c4823bc7e60c19ec (HEAD -> main)
Author: Marcin Wosinek <[email protected]>
Date: Thu Apr 28 18:37:55 2022 +0200
second commit
…
The commit content is thus:
$ git cat-file -p 7f54a437d87cd1f241cfb893c4823bc7e60c19ec
tree 04b0192c1c88ac1c1a96f386e84e5388ef8a509a
parent 26349a25253f9b316db1a5d3c3f23c1ca5ca4e0e
author Marcin Wosinek <[email protected]> 1651163875 +0200
committer Marcin Wosinek <[email protected]> 1651163875 +0200
second commit
Git has added the parent line because we commit on top of another commit.
Other important data kept by Git are just references to a most recent commit. So my main branch is store in .git/refs/heads/main
, and its content is
$ cat .git/refs/heads/main
7f54a437d87cd1f241cfb893c4823bc7e60c19ec
or the ID of its topmost commit. We can find all the relevant information from the ever-expanding tree of commits:
When I create a simple tag:
$ git tag v1
A file is created in .git/refs/tags
:
$ cat .git/refs/tags/v1
7f54a437d87cd1f241cfb893c4823bc7e60c19ec
As you can see, both tags and branches are explicit references to a commit. The only difference between them is how Git treats them when we create a new commit:
The blob, tree, and commits are how Git stores the complete history of your repository. It does all the references by the object hash: there is no way of manipulating the history or files tracked in the repository without breaking the relations.
Originally published here.