Stupid Gopher


Git from zero to hero - starting with foundations

As described in git page:

“Git is a free and open source distributed version control system designed to handle everything from small to very large projects with speed and efficiency.” — git-scm

Let’s first analyse what does this means.

Being a version control system, means that it allows you to create various versions of your project along the way.

Being distributed implies that it can be replicated along the network. It makes perfectly sense because it’s used by entire teams.

The first thing that it’s important to know is how git manages things to perform this version control. This article will focus in git foundations so that in next articles we can tackle more advanced concepts.

How version control is made

The first thing that must happen in order to use git, is to initialise a git repository inside a folder. Sure you can also clone an existing project, but let’s begin from the start.

> mkdir git-project
> cd git-project
> git init
Initialized empty Git repository in /Users/stupidgopher/git-project/.git/

So as you can see, an empty git repository was initialised in a .git folder inside our directory.

In order to git save versions of your code you have to perform an action, make an interaction. So the first conclusion is that there is a point of interaction between you and git.

Well this point of interaction is called commit. A commit can be described as a project change transaction to git.

In order to commit some changes we must add files or folders to the staging area. The staging are can be described as a bucket where you put the changes you want to commit next.

So let’s see this in action, by adding to stage our first file and folder:

> mkdir folder
> echo "a file" > folder/A
> echo "a file" > B
> git add folder B
> git status
On branch master
No commits yet
Changes to be committed:
(use "git rm --cached <file>..." to unstage)
new file:   B
new file: folder/A

The first feedback line says that we are on master branch. Branches will be covered in the next article. For now think that master is our context.

We added the folder and the file B and both of them are now changes to be committed.

Let’s commit the changes, using the option -m to associate a message. Associating a message is mandatory, so if you don’t provide one with -m, the default system editor will prompt you one message.

> git commit -m "Adding the first files"
[master (root-commit) 967d0ac] Adding the first files
2 files changed, 2 insertions(+)
create mode 100644 B
create mode 100644 folder/A

So we committed our first changes, creating the first version of our project. Let’s do the second commit so that you can start to see the big picture:

> echo "one more line" >> B # >> appends a new line to the file
> git add B
> git commit -m "Add changes to file B"

We have now two commits, the first one added the files to your git repository with the first lines. The second commit was built on top of the first one.

Git is like a graph database, each commit is built on the top of the previous one, except for the first one that don’t have any ancestor. So from here you can already grasp a model like the one I am going to show in the next image:

Graph showing a sequence of commits

Every commit hides behind a snapshot corresponding to a version. This is an important feature of git in opposition to other version control systems that save the changes instead. Why this is important?

Well if we are storing several versions of our project it’s expectable that in the future somehow we want to travel to a previous commit. By saving a snapshot instead of changes, we can go from point D to point A without revisiting intermediary points.

Commits IDs and reference/pointers

Like in any database it’s desirable that a record has an ID. Git is no exception, each time a commit is created a new unique ID is associated to that commit. In this case a sha1 hash is generated so that we can identify this commit later.

Another important characteristic of a commit is that once is created cannot be changed. It’s immutable, so if we need to change anything including the message, a new commit must be generated.

Let’s execute git log command with a couple of options that will make the output prettier:

> git log --pretty --graph --oneline
* d316650 (HEAD -> master) Add changes to file B
* 967d0ac Adding the first files

As you can see in this case our first commit has an hash starting with 967d0ac and the second commit d316650. In case you are following along and reproducing my steps your hashes are going to be different.

You may noticed that in the last commit we have (HEAD -> master). This is an indication that HEAD is pointing to master branch and master is pointing to the last commit.

HEAD is a reference that points to the current state of your project.

As you may have guessed the a branch is also a pointer, but with different characteristics from HEAD. This will be explained in the next article.

Drill-down version storage

To end this article in beauty we are going to drill-down the snapshot storage, using the commit as the start point. The commands that I am going to use are not essential to everyday usage. But in my opinion is important to know how things works under the hood to use git with more confidence.

Until now a commit is defined by an SHA1 ID and a commit message. If we execute git log without options, we are going to see more information:

> git log
commit d316650f83e2cc2ec2f44485c521b3f42872d403 (HEAD -> master)
Author: *** <***>
Date: Sun Jun 17 01:52:53 2018 +0100
Add changes to file B
commit 967d0acee11b5156c48b833c24cc6696e6804bd3
Author: *** <***>
Date: Sun Jun 17 01:38:03 2018 +0100
Adding the first files

Until now this is what we got:

We know that a commit as a message, author and date. We didn’t see how a commit references an ancestor and the snapshot of our tree at that version.

We are going to use a new command git cat-file. Git cat-file provides content or type and size information for repository objects like described in git manual.

So git uses objects to store information. We are going to discover this objects in an interactive way, so let’s start:

> git log --pretty --graph --oneline
* d316650 (HEAD -> master) Add changes to file B
* 967d0ac Adding the first files

First let’s analyse the first commit:

> git cat-file -t 967d0ac
commit # the type of this object is commit
> git cat-file -p 967d0ac
tree 1b8369392f39b30c2da28ae285476572394e145e
author *** <***> 1529195883 +0100
committer *** <***> 1529195883 +0100
Adding the first files

So we executed first the cat-file with -t to see the type of commit object. As expected the type is commit. Secondly we executed cat-file with -p option “pretty print” and we have the author and message inside the object. But the new information is a tree reference. What is a tree object?

> git cat-file -t 1b8369392f39b30c2da28ae285476572394e145e
> git cat-file -p 1b8369392f39b30c2da28ae285476572394e145e
100644 blob 02f6335fc4f28cc4ea2d0846aacff267a149effb B
040000 tree da97193026f17ed933494b723517904586648324 folder

The tree object as his type says it’s a tree of files. In this case the tree object as two references. The first one is a blob and notice that in the front of this reference we have the filename in this case B. The second one is another tree, as expected, because as you know we added a file and a folder to our project.

Let’s analyse the blob:

> git cat-file -t 02f6335fc4f28cc4ea2d0846aacff267a149effb
> git cat-file -p 02f6335fc4f28cc4ea2d0846aacff267a149effb
a file

A blob it’s an object that stores the file content. Just for the sake of consistency let’s see the tree at the same level of this blob.

> git cat-file -t da97193026f17ed933494b723517904586648324
> git cat-file -p da97193026f17ed933494b723517904586648324
100644 blob 02f6335fc4f28cc4ea2d0846aacff267a149effb A

Wow, this tree as a reference to a file named A with the same blob that we saw previously. This is because the content of file A inside folder is the same as the content of file B. Git is very clever, it only stores what he needs to store nothing more than that.

So we analysed the first commit that is a special case, because it hasn’t any ancestor. So let’s see the second commit:

> git cat-file -t HEAD # HEAD > MASTER > current commit

> git cat-file -p HEAD
tree 3c7e122c50226fdb638ad38f44beffe717d4922e
parent 967d0acee11b5156c48b833c24cc6696e6804bd3
author *** <***> 1529196773 +0100
committer *** <***> 1529196773 +0100
Add changes to file B

Notice the reference in this commit to is ancestor.

Finally let’s just see the tree. It’s important to have in mind that the second commit only introduces changes in file B.

> git cat-file -p 3c7e122c50226fdb638ad38f44beffe717d4922e
100644 blob 84a9ec3809e10f40ac7e1b248e036c0ce1237ee6 B
040000 tree da97193026f17ed933494b723517904586648324 folder

The hash of the tree in folder remains the same, git only builds new blobs and trees for the changes. Everything that is untouched remains the same.

how git keeps track of versions

Hope you enjoyed this first article, sorry if it was too long. In the next article we are going to focus on branches.


Stupid Gopher

More by Stupid Gopher

Topics of interest

More Related Stories