Tim Pettersen

@kannonboy

What’s new in Git 2.11?

A lot, as it turns out! Git 2.11 has been released with a bunch of new features and usability improvements, particularly for those of you working on projects with deep histories, large files, or submodules. Here’s the awesome stuff that piqued our interest on the Bitbucket team:

Auto-sized SHA-1 abbreviations

Git uses SHA-1 hashes, usually displayed as big ugly hexadecimal strings, to uniquely identify commits and other objects. When developers switch from centralized version control (where monotonically incrementing IDs are easy because they are centrally generated) to distributed version control (where each contributor generates their own IDs) the question of “what would happen if two different objects create the same hash?!” sometimes comes up. Fortunately, the chance of a SHA-1 collision is disregardably tiny. Scott Chacon, author of Pro Git, puts it best:

If all 6.5 billion humans on Earth were programming, and every second, each one was producing code that was the equivalent of the entire Linux kernel history (3.6 million Git objects) and pushing it into one enormous Git repository, it would take roughly 2 years until that repository contained enough objects to have a 50% probability of a single SHA-1 object collision. A higher probability exists that every member of your programming team will be attacked and killed by wolves in unrelated incidents on the same night.

However, to save on terminal real estate, up until Git 2.10 many common commands abbreviate SHAs to the first 7 hexdigits of the hash.

Abbreviated SHAs

This can be a problem as abbreviated SHAs are short enough that they may eventually lose their uniqueness. I say eventually because Git will always output enough characters to ensure the abbreviated SHA is unique at the time the command is run. But if you record a short SHA in a commit message, an email, or another external system, it may no longer be unique when you copy and paste it back into a Git command run against the repository at some future date.

Abbreviating from 40 hexdigits to 7 brings the number of possible SHAs down from the unthinkably large 1.46 x 10⁴⁸ to the still quite large 268,435,456. However, Git creates a lot of objects that each require their own SHA-1: one for each commit and annotated tag, one for each version of each file, and one for each tree object (which are created each time a directory is created or its contents change). Even worse, the birthday paradox means that you only need a relatively small number of objects — just 19,290 in fact — to have a 50% chance of two of them having the same leading 7 hexdigits.

Git 2.11 introduces a new feature where SHA abbreviation length is calculated based on an estimate of the number of objects in your repository. The Linux kernel has just passed 5 million objects in it’s Git repository, and defaults to twelve hexdigits under Git 2.11:

linux.git

Git’s own repository, with just over 220,000 objects, defaults to nine hexdigits:

git.git

If you’re in the habit of recording short SHAs for your own projects, you may wish to future proof yourself by overriding the SHA abbreviation length manually:

$ git config --global core.abbrev 12

As a bonus, Git 2.11 has also learnt to list conflicting objects referenced by an abbreviated SHA:

So if you do have the misfortune to run into an ambiguously abbreviated SHA-1, Git now gives you enough information to resolve the ambiguity yourself.

Submodules, submodules, submodules

Git submodules have seen a raft of improvements in recent Git releases:

  • parallelized fetching in Git 2.8
  • git clone --shallow-submodules in Git 2.9
  • .gitmodules “shallow” flags in 2.10

I still recommend submodules only as a last resort, as dependency management systems are typically more effective for combining projects. However, Git 2.11 does have three notable improvements that make submodules significantly easier to use if you do need them. Feel free to skip to the section on git diff improvements if you have no interest in submodules.

Alternates for submodules

The --reference option can be used with git clone to specify another local repository as an "alternate" object store, to save recopying objects over the network that you already have locally. The syntax is:

$ git clone --reference <local repo> <url>

As of Git 2.11, you can use --reference in combination with --recurse-submodules to set up submodule alternates pointing to submodules from another local repository. The syntax is:

$ git clone --recurse-submodules --reference <local repo> <url>

This can potentially save a huge amount of bandwidth and local disk, but will fail if the referenced local repository does not have all the required submodules of the remote repository that you’re cloning from.

Fortunately the handy --reference-if-able option will fail gracefully and fall back to a normal clone for any submodules missing from the referenced local repository:

$ git clone --recurse-submodules --reference-if-able <local repo> <url>

Submodule diffs

Previously Git had two modes for displaying diffs of commits that updated your repository’s submodules:

  • git diff --submodule=short displays the old commit and new commit from the submodule referenced by your project (this is also the default if you omit the --submodule option altogether):
git diff --submodule=short
  • git diff --submodule=log is a bit more useful, and displays the summary line from the commit message of any new or removed commits in the updated submodule:
git diff --submodule=log
  • Git 2.11 introduces a third option: --submodule=diff. This displays a full diff of all changes in the updated submodule:
git diff --submodule=diff

git ls-files --recurse-submodules

git ls-files learnt the --recurse-submodules option. Traditionally git ls-files listed tracked files in your repository, and as of Git 2.11 can list files tracked by your submodules as well:

git ls-file --recurse-submodules

This perhaps isn’t a command that you’d run every day, but it’s handy if you want to write a script that walks every file in your repository and its submodules.

Experimental diff improvements

git diff can produce some slightly confusing output when the lines before and after a modified section are the same. This can happen when you have two or more similarly structured functions in a file. For a slightly contrived example, imagine we have a simple file that contains a single function:

/* @return {string} "Bitbucket" */
function productName() {
return "Bitbucket";
}

Now imagine we’ve committed a change that prepends another function that does something similar:

/* @return {string} "Bitbucket" */
function productId() {
return "Bitbucket";
}
/* @return {string} "Bitbucket" */
function productName() {
return "Bitbucket";
}

You’d expect git diff to show the top five lines as added, but it actually incorrectly attributes the very first line to the original commit:

The wrong comment is included in the diff.

Not the end of the world, but the couple of seconds of cognitive overhead from the whaaat? every time this happens can add up.

Git 2.11 introduces a new experimental diff option, --indent-heuristic, that attempts to produce more aesthetically pleasing diffs:

git diff --indent-heuristic

Under the hood, --indent-heuristic cycles through the possible diffs for each change and assigns each a “badness” score. This is based on heuristics like whether the diff block starts and ends with different levels of indentation (which is aesthetically bad) and whether the diff block has leading and trailing blank lines (which is aesthetically pleasing). Then the block with the lowest badness score is output.

This feature is experimental, but you can test it out ad-hoc by applying the --indent-heuristic option to any git diff command. Or if you like to live on the bleeding edge, you can enable it across your system with:

$ git config --global diff.indentHeuristic true

Simpler stash IDs

The git stash command is a nifty little tool for temporarily shelving changes while you work on something else. If you're a fan, you probably know that you can shelve multiple changes, and view them with git stash list:

git stash list

However you may not know why Git’s stashes have such awkward identifiers — stash@{1}, stash@{2}, etc. — and may have written them off as “just one of those Git idiosyncrasies”. It turns out that like many Git features, these weird IDs are actually a symptom of a very clever use (or abuse) of the Git data model.

Under the hood, the git stash command actually creates a set of special commit objects that encode your stashed changes, and maintains a reflog that holds references to these special commits. This is why the output from git stash list looks a lot like the output from the git reflog command. When you run git stash apply stash@{1}, you're actually saying “apply the commit at position 1 from the stash reflog”.

As of Git 2.11, you no longer have to use the full git stash@{n} syntax. Instead, you can reference stashes with a simple integer indicating their position in the stash reflog:

$ git stash show 1
$ git stash apply 1
$ git stash pop 1

And so forth. If you’d like to learn more about how stashes are stored, I wrote a little bit about it in How git stash works.

Long running filter processes

Bitbucket supports the popular Git LFS (Large File Storage) extension to help users who need to efficiently track large binary files in their repositories. Git 2.11 comes with a couple of improvements to make Git LFS much faster and more pleasant to use. The most important change is that Git now supports long-running clean and smudge filter processes for transforming LFS pointers, rather than having to invoke a new process each time.

When you git add a file, clean filters can be used to transform (or clean) the file’s contents before being written to the Git object store. Git LFS reduces your repository size by using a clean filter to squirrel away large file content in the LFS cache, and adds a tiny “pointer” file to the Git object store instead.

The Git LFS clean filter converts large files into tiny pointer files.

Smudge filters are the opposite of clean filters — hence the name. When file content is read from the Git object store during a git checkout, smudge filters have a chance to transform it before it’s written to the user’s working copy. The Git LFS smudge filter transforms pointer files by replacing them with the corresponding large file, either from your LFS cache or by reading through to your Git LFS store on Bitbucket.

The Git LFS smudge filter converts pointer files back into the large file content.

Traditionally smudge and clean filter processes were invoked once per file that was being added or checked out. So a project with 1,000 files tracked with Git LFS invoked the git-lfs-smudge command 1,000 times for a fresh checkout! While each operation is relatively quick, the overhead of spinning up 1,000 individual smudge processes is costly.

As of Git 2.11, smudge and clean filters can be defined as long running processes that are invoked once for the first filtered file, then fed subsequent files that need smudging or cleaning until the parent Git operation exits. Lars Schneider, who contributed long running filters to Git, nicely summarized the impact of the change on Git LFS performance:

Filter process is 💥80x faster💥 on macOS and 💥 58x faster💥 on Windows for the test repo with 12k files. On Windows that means the tests runs in 57 seconds instead of 55 minutes!

That’s a seriously impressive performance gain! Note that you’ll need to upgrade to both Git 2.11 and Git LFS 1.5 to take advantage of these speed improvements.

git cat-file --filters

Another small improvement for users of Git LFS and other filter-based extensions is the new --filters option for the git cat-file command. git cat-file -p lets you inspect objects in your Git object store (the -p is for “pretty print”). For example, I can view the contents of the file images/bitbucket.png on the branch v2 with:

The contents of a Git LFS pointer file, courtesy of git cat-file -p

However, bitbucket.png happens to be tracked with Git LFS, and git cat-file skips the Git LFS smudge filter (and any other filters that we've defined), so rather than getting our image back we just get the contents of the Git LFS pointer file.

As of Git 2.11, you can ensure the appropriate filters are applied with the --filters option. For Git LFS files, this will dereference the pointer and look up the actual content, typically printing an impressive amount of binary junk to your terminal:

The actual LFS file content, courtesy of git cat-file --filters

For binary content, it’s usually better to pipe the content to a temporary file instead of your terminal. Or, if you’re on macOS, you can pipe it straight to an appropriate editor in one step with open -a <app> -f:

git cat-file --filters <commit-ish>:<path> | open -a Preview.app -f

Alternatively, Atlassian’s free Git client SourceTree has superb support for Git LFS, including binary previews of LFS-tracked content. And I don’t use the term superb lightly: SourceTree’s founding developer—Steve “Sinbad” Streeting — also happens to have contributed 31 KLOC to the Git LFS project.

receive.maxInputSize

Git 2.11 also introduces a server-side setting for restricting the number of bytes that can be transferred in a single push. This is handy for repo administrators who want to prevent users from accidentally pushing large binaries (that are better tracked with Git LFS) to the upstream repository. The receive.maxInputSize setting is useful if you're running your own homegrown Git server or otherwise administering your server config by hand. Running the following command on your Git server will limit the push size to 10 megabytes:

$ git config --global receive.maxInputSize 10485760

If you’re using Bitbucket Server, I’d recommend using a pre-receive hook add-on like ScriptRunner to constrain pushes. The Bitbucket hook API gives you much more flexibility over things file size and type, naming conventions, and enforcing commit provenance, rather than just limiting the total number of bytes pushed.

Merge comparison shorthand

Git has a bunch of neat shorthand for traversing your commit history. One popular command for showing the changes that were merged in a merge commit is:

$ git log <merge commit>^..<merge commit>

which roughly translates to:

Show me all the new commits that were merged with <merge commit>!

The ^ in the command means “parent of”, and the double-dot syntax specifies that we're interested in commits that are reachable by the merge commit, but not reachable by its first parent. (The first parent of a merge is the commit that was at the tip of the checked out branch when the merge was performed.)

git log <merge>^..<merge> shows the commits merged into the target branch by <merge>

Git 2.11 introduces a shorthand for this syntax:

$ git log <merge commit>^-1

You can also replace ^-1 with ^-2 to compare the merge commit to its second parent, to see commits that were not on the merged branch.

git log <merge>^-n shows commits from all branches that aren’t the nth parent of <merge>

Or, if you’re into octopus merges (merges involving more than two parents), you can use ^-n to compare against the nth parent.

This new form works anywhere you can specify a revision range! For example, you can use:

$ git diff <merge commit>^-1

to generate a diff of all the changes introduced by a merge commit.

</tome>

You made it, thanks for reading! This is a long post, but we only managed to get through ten of the 100+ features, fixes, and performance improvements shipped by Junio and his merry band of Git wizards in version 2.11. Check out the full release notes, and then install the latest version of Git (or upgrade using your favorite package manager).

And if you ever want to chat about Git or Bitbucket, grab me on Twitter: I’m @kannonboy.

More by Tim Pettersen

Topics of interest

More Related Stories