I’m not sure how it happened, but I’ve gone all this time without knowing about Git Large File Storage (LFS). Developed by GitHub, this tool stores a pointer in a Git blob object in the Git object database in place of the large file, which is then stored on a remote server managed by GitHub.
This enables a developer to “upload” large files to their local or remote repositories for versioning just like any other Git blob object. In essence, no matter how large the file, the pointer will always stay the same size since it is just a simple text file, and the size of your Git database will not be negatively affected.
While this tool may not seem useful to a developer working mostly with text files, it’s extremely handy when dealing with large binary files such as images, audio, and in my case, tar archives of root filesystems. These situations, i.e., needing to store large files in Git, may not arise often, but when they do, it’s good to know that there are useful tools available to you.
A fairly large drawback to frequent use of Git LFS is that you’re only given a small amount of storage for free. This makes sense, of course, but I quickly hit the 1GB limit, and upgrading is expensive. It would be cheaper to get a VPS or another cloud storage solution.
Still, it’s a viable solution, and one that frankly may not be that well known. Of course, it’s probably just me that was unaware of it. I am, after all, just another bozo on the bus.
Install
At the time of this writing, the latest version is 3.1.1: https://github.com/git-lfs/git-lfs/releases/tag/v3.1.1
After downloading it, extract the archive and execute the install.sh
script. This will install the pre-compiled Go binary git-lfs
and run the following command:
$ git lfs install
Updated git hooks.
Git LFS initialized.
Running this adds the following section to the global Git config file (.gitconfig
):
[filter "lfs"]
clean = git-lfs clean -- %f
smudge = git-lfs smudge -- %f
process = git-lfs filter-process
required = true
Of course, you don’t have to run the install script. If you look at its contents, it’s actually quite simple and really only saves a couple of manual steps.
Further, it doesn’t install the man files located in the archive’s man/
directory. If you’re interested in installing those, you’ll have to do that yourself:
$ sudo install --mode 0644 man/*1 /usr/local/man/man1
$ sudo install --mode 0644 man/*5 -D --target-directory /usr/local/man/man5
The
-D
flag will create all components of the--target-directory
, which was necessary on my system becauseman5
did not exist.
Now, view the man pages!
$ man git-lfs-update
$ man 5 git-lfs-config
Weeeeeeeeeeeeeeeeeeeeeeeeeeeeee
If you’re not sure where
man
looks for man files, run themanpath
command.
With the installation out of the way, let’s look at a simple example.
Example
There is clearly more to the CLI than what I’m covering here, however, this is demonstrating the most common use case.
The git-lfs
tool isn’t tracking any file paths until they are added. You can use globs to target multiple files, such as globbing all files with the extension .tar
, but for my case I prefer to track them explicitly so I will give the full relative path.
Here’s my working directory:
$ tree -aI .git . ├── CHANGELOG.md ├── COPYING ├── .gitattributes ├── hugo │ ├── Dockerfile │ ├── .Dockerignore │ ├── hugo.nspawn │ ├── hugo.sh │ ├── hugo.tar.xz │ └── README.md ├── README.md └── tor-browser ├── Dockerfile ├── .Dockerignore ├── install_tor_browser.sh ├── README.md ├── tor-browser.nspawn └── tor-browser.tar.xz
Note that the paths are relative to the root of the repository.
$ git lfs track "hugo/hugo.tar.xz"
Tracking "hugo/hugo.tar.xz"
The first time that git-lfs-track
is called, it will create a .gitattributes
file in the current working directory, if it doesn’t exist.
Tracking another file will add another entry to the .gitattributes
file.
$ git lfs track "tor-browser/tor-browser.tar.xz"
Tracking "tor-browser/tor-browser.tar.xz"
$ cat .gitattributes
hugo/hugo.tar.xz filter=lfs diff=lfs merge=lfs -text
tor-browser/tor-browser.tar.xz filter=lfs diff=lfs merge=lfs -text
Let’s verify that it’s tracking the paths that we just added:
$ git lfs track
Listing tracked patterns
hugo/hugo.tar.xz (.gitattributes)
tor-browser/tor-browser.tar.xz (.gitattributes)
Listing excluded patterns
We’ll now add and commit the new files:
git add *.tar.xz .gitattributes
git commit -m 'Added nspawn container tarballs'
Now that the objects have been committed to the database, the tool is aware of the files that we added above. Let’s verify that:
Running the next command will verify the files and/or file patterns that are being tracked:
$ git lfs ls-files
bfd840fa63 * hugo/hugo.tar.xz
68540ee38c * tor-browser/tor-browser.tar.xz
$ git lfs ls-files -s
bfd840fa63 * hugo/hugo.tar.xz (22 MB)
68540ee38c * tor-browser/tor-browser.tar.xz (201 MB)
Note that running
git-lfs-track
only adds the file paths to track, not the files themselves. It’s only when the files are added to the Git object database bygit-add
andgit-commit
thatgit-lfs
knows how to manage them.
How It Works
So, how does this work? Simply. A pointer to the large file is stored in the Git history, and the actual file itself resides on another server. Obviously, all this is abstracted and hidden from the user, and the large file appears as a regular file stored in the Git object database like any other blob.
On GitHub, the file in the repo appears as though it’s been committed in the history, even though we know now that it’s just a small file that contains a pointer to the real location.
So, can we download the large file from the remote repository like we’re used to? Yes!
We can prove this using curl
. We’ll turn on verbose
output to get extra info when dumping the header information for all of the redirects. Each hop will be prefaced with the text < location
(at the beginning of new line), so a simple pipeline to a tool that can do textual searches will do the trick:
$ curl -vIL https://github.com/btoll/machines/raw/master/tor-browser/tor-browser.tar.xz 2>&1 \
| ag "^< location:"
< location: https://media.githubusercontent.com/media/btoll/machines/master/tor-browser/tor-browser.tar.xz
We see that there’s only one hop, and the file itself resides on the media
server of the githubusercontent.com
domain.
Now that we’ve proven that we have full access to the large file through the usual means, let’s investigate what’s stored in the Git history.
Let’s drill down into the Git object database to verify that Git itself is only saving a pointer to the large file, which should be only a few kilobytes.
Before we do, let’s see the size of the large file on disk (in bytes):
$ stat --format=%s tor-browser/tor-browser.tar.xz
201371436
Now, let’s get the contents of the tree
object from the latest commit of the branch pointed to by HEAD
. First, we’ll verify that the last commit contains the objects we’re interested in.
$ git log -1 HEAD --oneline
f3c7847 (HEAD -> master, origin/master) Added nspawn container tarballs
That indeed contains the tarballs, so let’s drill into the commit’s tree
object:
$ git cat-file -p master^{tree}
100644 blob bfbf5761375d136171d57bd47a65561cc4763cdc .gitattributes
040000 tree 340818d02e10f3367f36e2816e9c02c9abc6472b hugo
040000 tree 65ce727ccb86801c1e5b8b2a8e115eb78c1269ca tor-browser
And take a look at the tree
object that represents tor-browser
:
$ git cat-file -p 65ce727
100644 blob 6a7707ceea144076d2fa9d20307bfd0b2e743163 .Dockerignore
100644 blob 7d21223cd1f1053bbc3a76c46a44c977038a9481 Dockerfile
100644 blob b750388bfaf5e08fbf38847194795f9970434280 README.md
100755 blob 57ff4a74e17febe8497f6a1fc016c74ac9523499 install_tor_browser.sh
100644 blob 3b3eab8bd420a5006a7aa0b91d5bac6216e62b47 tor-browser.nspawn
100644 blob d82da0c1d752847d10bdbc270041103cf8c8165f tor-browser.tar.xz
Finally, let’s see the size of the tor-browser.tar.xz
in bytes. Again, this is the pointer to the large file stored on the aforementioned server:
$ git cat-file -s d82da0c
134
Ok, about what we’d expect, and clearly not the same file as what is on the hard disk.
As a sanity to assure ourselves that the Git blob objects are indeed storing a representation of the files on disk (and they’re not all pointers, for example), let’s see what it reports for another file that we’ve committed to Git that resides within the same commit and tree
object.
Is the size of the shell script install_tor_browser.sh
in the working directory the same as that of the blob
boject in the Git object database?
$ stat --format=%s tor-browser/install_tor_browser.sh
1335
$ git cat-file -s 57ff4a7
It is!
Ok, this all looks good, and we can feel pretty good that the Git LFS tool is indeed doing what’s described on the tin.
Before we call this a wrap, let’s look at something that everyone who uses this tools should be aware of.
Caveat Emptor
Let’s say that you want to use Git LFS to track and manage files that have already been committed in your git history.
No problem, you think to yourself, I’ll check out the repo’s README
and see what it has to say about what is probably a fairly common use case.
At the time of this writing, there was no warning in the examples section of the Git LFS
README
that warned of the consequences of running thegit-lfs-migrate
command inimport
mode.
Armed with the suggestion from the README
, I’ll just sidle up to the keyboard and type that in.
$ git lfs migrate import --include="*.zed" --everything
migrate: Sorting commits: ..., done.
migrate: Rewriting commits: 100% (2/2), done.
master f3c784746b91e5d0f9a68a47468bddc1df2b4650 -> cdb98a1b3e2efe3f50c2e7b462b91c4b899d3c79
migrate: Updating refs: ..., done.
migrate: checkout: ..., done.
Well, friend, you just rewrote your commit history, and when you have to force push the objects to the remote repository, you will immediately become persona non grata to anyone else working on the project!
I present the following before and after snapshots of the git log as evidence.
Before
$ git log --oneline
f3c7847 (HEAD -> master) Added nspawn container tarballs
5373111 Initial commit
After
$ git log --oneline
cdb98a1 (HEAD -> master) Added nspawn container tarballs
275933a Initial commit
You can see that the hashes have changed, which of course will happen anytime its contents changed. If you’re not expecting this side effect, you will be in for quite a surprise.
The new commit with the “Initial commit” commit message (hash 275933a) now includes a new .gitattributes
file that was not part of the original commit.
There are other differences, but I’ll only focus on the addition of the
.gitattributes
file to the first commit.
Let’s explore further. Since the Git history only includes two commits, we can view the Git objects that make up the first commit with the following command (the first output displayed below is before the git-lfs-migrate
command that we issued above):
Before
$ git cat-file -p master^^{tree}
040000 tree 4d32b2a7c05a99a091f4739340138a14ba00a167 hugo
040000 tree de471a5e090040da0b376f75cbe744046696fa7f tor-browser
After
$ git cat-file -p master^^{tree}
100644 blob 5ff9029b4eb0b849c29f4ca93546e7044478349e .gitattributes
040000 tree 4d32b2a7c05a99a091f4739340138a14ba00a167 hugo
040000 tree de471a5e090040da0b376f75cbe744046696fa7f tor-browser
The side effect of writing the entire Git history is alarming to me, especially since this isn’t called out in the README
until further down in the Uninistall
section.
I submitted a pull request to call out this behavior.
Why is the first commit futzed with? Well, it is here, but I’m not sure if that will always be the case. I haven’t read through the code, so I don’t know what algorithm is being used, but I do know that every reference is being inspected, both locally and remotely.
git-lfs
may only insert the .gitattributes
file immediately preceding the first commit that contains a file that matches a given tracked file path, or it may always just alter the first to be safe.
Ultimately, it is the user’s responsibility to read the documentation and be aware of the impact of the commands they are running. I just feel it is a bit cavalier to not call more attention to commands that could impact every commit object in the database.
Note that running the inverse command
git lfs migrate export
will reverse theimport
, but it will still give the commit objects new (and different) hashes and will cause you to force push your changes which will (again) have anyone also working on this repository to hate your guts.
Conclusion
Although this may only be of limited use to me, it’s good to know that such a tool exists.
My original use case for this was as a remote storage for my root filesystem tarballs that I download for running containers using the systemd-nspawn
container manager.
Essentially, I’d export the nspawn
containers to a tarball and then upload them using Git LFS to a public repository. However, having discovered through the course of writing this article that there is a relatively small limit to the free tier, I’ll probably upload them to a VPS instead.
In a world where things were more equitable and only a very small number of companies run by horrible people didn’t control most of the world’s infrastructure, I’d consider uploading the archives to a storage offering like S3.
Better to give my money to a small company or use local backup.