Understanding Git — Data Model

Since its birth in 2005 git has become massively popular especially in the open source world but many of us use it on our job posts also. It is a great VCS tool and has many advantages, but being easy to learn is just not one of them. Which can make us frustrated since we use it so often. In my opinion the only way to get comfortable with using git and maybe even start loving it is to learn about how it works internally. The reason why I think so was perfectly summarized in a statement given by Edward Thomson in his lecture Deep Dive Into Git :

The Git commands are just a leaky abstraction over the data storage.

This is why no matter how many git commands or tips ‘n tricks you memorize or store in your git cheatsheet, without understanding of how git works under the hood you will remain confused with the strange ways of git because those git internals will every once in a while leak through the abstraction layer git’s (frontend) commands give you.

You can do better

So in this Understanding Git series we will cover git’s internals (we will not go into git’s source code don’t worry) and first thing on that list is git’s heart and soul — the data model.

To start, we will initialise and empty git repository in our project directory:

git init

Git will inform us it has created a .git directory in our project’s directory so let’s take a quick peak at how it looks like:

$ tree .git/

.git/├── HEAD├── config├── description├── hooks│ ├── applypatch-msg.sample│ ├── commit-msg.sample│ ├── post-update.sample│ ├── pre-applypatch.sample│ ├── pre-commit.sample│ ├── pre-push.sample│ ├── pre-rebase.sample│ ├── pre-receive.sample│ ├── prepare-commit-msg.sample│ └── update.sample├── info│ └── exclude├── objects│ ├── info│ └── pack└── refs├── heads└── tags

8 directories, 14 files

Some of these files and directories may sound familiar to you (particularly HEAD) but for now we will focus on the .git/objects directory which is empty right now, but we will change that in a moment.

Let’s create an index.php file

touch index.php

give it some content

<?phpecho "Hello World";

and a README.md file

touch README.md

and give it some content too:

# DescriptionThis is my hello world project

Now let’s stage and commit them:

git add .git commit -m "Initial Commit"

OK, nothing special here, adding and committing — we’ve all “been there, done that”.

If we take a look again at our .git directory we can see that the .git/objects directory has some subdirectories and files now:

├── objects│ ├── 5d│ │ └── 92c127156d3d86b70ae41c73973434bf4bf341│ ├── a6│ │ └── dbf05551541dc86b7a49212b62cfe1e9bb14f2│ ├── cf│ │ └── 59e02c3d2a2413e2da9e535d3c116af1077906│ ├── f8│ │ └── 9e64bdfcc08a8b371ee76a74775cfe096655ce│ ├── info│ └── pack

(Note: directories and files can/will have different names on you computer)

We will get back to .git/objects but for now notice that every directory name is two characters long. Git generates a 40-character checksum (SHA-1) hash for every object and the first two characters of that checksum are used as directory name and the other 38 as file (object) name.

The first kind of objects that git creates when we commit some file(s) are blob objects, in our case two of them, one for each file we committed:

Blob objects associated with our index.php and README.md files

They contain snapshots of our files (content of our files at the time of the commit) and have their checksum header.

The next kind of objects git creates are tree objects. In our case there is only one and it contains a list of all files in our project with a pointer to the blob objects assigned to them (this is how git associates your files with their blob objects):

Tree object pointing to blob objects

And finally git creates a commit object that has a pointer to it’s tree object (along with some other information):

Commit object points to it’s tree object

If we look back at our .git/objects directory things should look clearer now.

With git log we can see our commit history:

commit a6dbf05551541dc86b7a49212b62cfe1e9bb14f2Author: zspajich <[email protected]>Date: Tue Jan 23 13:31:43 2018 +0100

Initial Commit

And using the naming convention we mentioned earlier we can find our commit object in .git/object :

├── objects│ ├── a6│ │ └── dbf05551541dc86b7a49212b62cfe1e9bb14f2

To look at it’s content we can’t simply use cat command since these are not plain text files but git has a cat-file command we can use:

git cat-file commit a6dbf05551541dc86b7a49212b62cfe1e9bb14f2

to get the content of our commit object:

tree f89e64bdfcc08a8b371ee76a74775cfe096655ceauthor zspajich <[email protected]> 1516710703 +0100committer zspajich <[email protected]> 1516710703 +0100

Initial Commit

Here we see the pointer to our commit’s tree object and to examine it’s content we use git ls-tree command:

git ls-tree f89e64bdfcc08a8b371ee76a74775cfe096655ce

and as expected it does contain a list of our files with pointers to their blob objects:

100644 blob cf59e02c3d2a2413e2da9e535d3c116af1077906 README.md100644 blob 5d92c127156d3d86b70ae41c73973434bf4bf341 index.php

We can look at blob object representing (for example) index.php with cat-file command:

git cat-file blob 5d92c127156d3d86b70ae41c73973434bf4bf341

and we see that it contains our index.php file’s content

<?echo "Hello World!"

So that is what happens when we create and commit some files.

Now we’ll do another commit, this time let’s say we made some changes to our index.php file (added some code magic) and commited those changes:

Git creates a new blob object for the file that has changed

As we see, git has now created a new blob object with a new snapshot of index.php. Since README.md hasn’t changed, no new blob object for it is created, git will reuse the existing one instead (we’ll see in a second how).

Now, when git creates a tree object, blob pointer assigned to index.php is updated and blob pointer assigned to README.md simply stays the same as in the previous commit’s tree.

Pointer to index.php blob is updated and pointer to README.md blob stays the same

And at the end, git creates a commit object with a pointer to it’s tree object

Commit object points to it’s tree and also has a pointer to it’s parent commit object

and also a pointer to it’s parent commit object (every commit except the first one has at least one parent).

So now that we know how git handles file adding and editing, the only thing that remains is to see how it handles file deletion:

Git deletes the entry for index.php in tree object

It’s very simple — git deletes the file entry (file name with pointer to it’s blob object) from the tree object. In this case we deleted index.php in our commit so there is no longer an index.php entry in that commit’s tree object (in other words, our commit’s tree object no longer has a pointer to a blob object representing index.php).

There is just one more addition to this data model we presented— tree objects can be nested (they can point to other tree objects). You can think of it this way: every blob object represents a file and every tree object represents a directory, so if we have nested directories we will have nested tree objects.

Let’s look at an example:

Tree objects can point to other tree objects

Here, our project would have one README.md file and one app directory with two files ( app.php and app_dev.php).

Git uses blob objects to recreate the content of our files at any given point in time (commit) and tree objects to reproduce our project’s folder structure.

So there you have have it - git’s data model. It is in fact a simple data model and in next post we’ll look at branching and how git’s data model makes branching very cheap and simple.

If you wish to dig deeper into git’s data model I would recommend this lecture from Scott Chacon and also going through Git Internals chapter from his Git Pro book.