How Git Internally Works

Understanding better this powerful tool

“Do you know how to use Git?”

This was one of the first questions I was asked on my first day working as a software developer. Probably the same thing has happened to many of you (or will happen to you, soon).

I remember back then I knew that Git was a tool that allowed many people to work simultaneously on the same project hosted in a repository (usually in the cloud).

Each developer made a copy of the project on their computer and performed their tasks there. Once the task was finished, the main repository was updated with the new changes and everyone else could update their local version of the project, downloading the new changes from the main repository. I also knew that Git kept a history of all previous versions of code generated. If at any time, you need any of the previous versions, you could go back and look for them. I also knew how to use some basic git commands such as git add or git push.

That’s it. That’s all I knew about Git.

  • ”Don’t worry. 95% of the time you are going to use the same 6 or 7 commands. There are only two things you must remember. Every time you start a new task, create a copy of the development branch and work within it. And never, ever, EVER, push anything directly to the production branch.”

 

As time went on, I realized that what they told me was true I almost always used the same commands: git pull, git checkout, git add, git commit, git push, and git merge. I learned the concept behind a commit (checkpoints in the version history) and I began to get familiar with the use of branches (paths that fork from a commit).

Over time, I faced situations where it was necessary to use new commands like git reset or git revert and I also learned some tricks like git stash or git cherry-pick.

By then, I already felt like a Git expert. 😎

One day, I decided to take a test to assess how much I knew about Git. These were some of the questions I had to answer:

  1. ‘What is a blob?’
  2. ‘What information does a commit contain?’
  3. ‘What are the three states of the Git workflow?’

 

On an approximate scale of 1 to 200, the test gave me a 37.

That day I felt that I was a fraud.

I wasn’t the Git expert I thought I was. Actually, I was a beginner who had learned how to manipulate Git. I had learned to use Git in software projects with other people, but the truth was, I didn’t know what Git was really doing.

That day, I decided to understand the model that Git was based on, the behind-the-scenes of Git.

It was then, when I understood how Git works under the hood that everything started to make sense.

In this article, we’ll stop using Git on autopilot and understand once and for all how Git internally works.

What is Git?

Let’s see what the official Git documentation says.

Note: Remember that you must have Git installed on your machine. If you don’t have it, you can download it from the official Git site.

Let’s look at the officialdocumentation to see what Git says about itself:

Wait, what? It’s a joke, right? Actually no, it’s not.

Although the official definition is as follows: Git is a fast, scalable, distributed revision control system with an unusually rich command set that provides both high-level operations and full access to internals. If you look at the core of Git, it’s just that. A tool to track content.

This was the first big revelation I had. I thought Git was only for software projects, but it’s not. Git is a content tracker and the content can be whatever you want.

This may seems like a very simple revelation, but in fact, this made me realize how little I really knew about Git.

How does Git track content?

We have to think of Git as a table that maps a value (our content) with a unique key. Each value is a sequence of bytes (our content converted into a sequence of bytes).

When we give Git a value, Git stores it and generates a hash key for it, using the SHA-1 algorithm. A hash key is a small value that is used to represent a large piece of data in a hash system. Git uses these keys to identify the value that was stored. It doesn’t matter what machine or operating system we’re on. If the value is the same, we will get the same hash key from Git every time. Second big revelation I had.

Let’s understand by doing

I’m going to create a project to manage the tasks of my family.

Let’s initialize a Git repository:

				
					// Initialize an empty Git repository inside familiy-to-do-list folder
$ git init familiy-to-do-list

// Open familiy-to-do-list folder 
$ cd familiy-to-do-list

// Examine ./git folder
$ ls .git
HEAD  config  description  hooks/  info/  objects/  refs/
				
			

When initializing a new Git repository, a ./git folder is created by default, with several folders and files inside it. For now, let’s focus on the objects/ folder, which is Git objects database. This is where we will create all the necessary objects to save the different versions of our project.

				
					$ ls .git/objects
info/  pack
				
			

The objects/ folder contains two folders: info/ and pack/ . These folders are only used by Git for low-level optimizations. Just ignore them.

We could say that by default, the objects/ folder is empty.

Let’s add our first file to the project:

				
					// Create a Members.txt file with Rodrigo as the first member of it.
$ printf "Rodrigo" >> Members.txt

// cat is a utility command to print the content of a file.
$ cat Members.txt
Rodrigo

// Let's check the current status after adding our new file
$ git status
On branch master

No commits yet

Untracked files:
  (use "git add <file>..." to include in what will be committed)
        Members.txt
				
			

What Git is telling us here is that we have files that are untracked. These files are living in what Git calls the Working Directory. If we want to add these changes to the next commit, we must first add them to an intermediary area.

				
					// Add new or modified files to the staging area
$ git add .
				
			

This command doesn’t create a commit; it doesn’t put anything into the history yet. This command says: “Hey Git, track this please, I want you to keep an eye on it”.

This second area for tracked files is called Staging Area (its technical name is index).

Let’s suppose I am sure that I want to create a commit with these changes and no others. Everything tracked in the Staging Area will be part of the new commit.

				
					// Create a new commit with the message "Create members file"
$ git commit -m "Create members file"
[master (root-commit) 54725fc] Create members file
 1 file changed, 1 insertion(+)
 create mode 100644 Members.txt
				
			

So, what happened here? Yes, I created my first commit.

But what else? I just created stuff in the objects/ folder in our objects repository.

Let’s check it:

				
					// Examine ./git/objects folder
$ ls .git/objects
49/  54/  76/  info/  pack/
				
			

Now, beside info/ and pack/ folders, we also have 3 new objects created.

Before continuing, you must know that Git has three fundamental object types. The blob, the tree, and the commit. Don’t worry, we will explain throughout this example what each of them are.

Let’s dig into these newly created objects and see exactly what they are.

We will start by looking at the commit we just created.

				
					// Shows the commits log
$ git log
commit 54725fc233a67fa1ba4807271b840532e136f624 (HEAD -> master)
Author: Rodrigo Trelles <rodrigo.trelles5@gmail.com>
Date:   Sun Apr 17 10:43:48 2022 -0300

    Create members file
				
			

We can see that our first commit hash is 54725fc233a67fa1ba4807271b840532e136f624.

If we look closely, we can see that the name of one of the new folders created inside the objects/ folder has the first 2 characters of the hash 54/.

Let’s dig into this folder to see what we have inside it.

				
					// Examine ./git/objects/54 folder
$ ls .git/objects/54
725fc233a67fa1ba4807271b840532e136f624
				
			

The name of the file inside the 54/ folder is equal to the rest of the characters that make up the hash key of the commit.

54/725fc233a67fa1ba4807271b840532e136f624 = 54725fc233a67fa1ba4807271b840532e136f624

Note: We can see that the name of the folder created in the objects database is the first two digits of the hashes, and the file inside this folder is the rest of the hash keys. This is the naming convention Git follows to store data. This scheme is used by Git to find objects more efficiently.

We can confirm that this is the object inside the object database for this commit.

First object type: The commit

So, what does this object look like? What is the content of this object?

We can use a low-level git command called cat-file, which provides the content or the type for any object in the repository.

Note: It is absolutely fine if you never used this command before. It is very likely that you will never use it again. 😅

				
					// Run command with the -t flag to return the object type
$ git cat-file 54725fc233a67fa1ba4807271b840532e136f624 -t

commit  

// Run command with the -p flag return the object content
$ git cat-file 54725fc233a67fa1ba4807271b840532e136f624 -p

tree 490296214553ccc9a66cd62ad8b05d56775bf465
author Rodrigo Trelles <rodrigo.trelles5@gmail.com> 1650203028 -0300
committer Rodrigo Trelles <rodrigo.trelles5@gmail.com> 1650203028 -0300

Create members file
				
			

This is what a commit looks like. In fact, this is literally the commit we have just created.

A commit is just plain text, and like any other content in Git, it was zipped, a hash was assigned to it and finally persisted in the objects repository. A commit object holds metadata, including the author, the committer, the commit date, and the commit message.

But we can see one more thing in the commit: A tree, ****with its own hash key is also included.

If we perform the same search exercise that we did with the commit hash key we can find that there is an object in our repository with this same hash key.

				
					$ ls .git/objects
49/  54/  76/  info/  pack/

$ ls .git/objects/49
0296214553ccc9a66cd62ad8b05d56775bf465
				
			

49/0296214553ccc9a66cd62ad8b05d56775bf465 = 490296214553ccc9a66cd62ad8b05d56775bf465

This is the object inside the object database representing this tree.

Second object type: The tree

A tree is the object used to store directories in our project. A tree can point to other trees to build a complete hierarchy of files and subdirectories. It can also point to blobs.

Each commit points to a tree object that captures, in one complete snapshot, the state of the repository at the time the commit was performed. This snapshot is the project version we save in our Git history.

Let’s see what a tree looks like:

				
					$ git cat-file 490296214553ccc9a66cd62ad8b05d56775bf465 -t
tree 

$ git cat-file 490296214553ccc9a66cd62ad8b05d56775bf465 -p
100644 blob 76871a05855a2389ddac28d1b167dd48b2226141    Members.txt
				
			

The tree object contains one line per file or subdirectory. For each of them, it records permissions, object type, object hash, and filename.

Here comes my third big revelation: The file names are controlled by the tree object, not by the files themselves. We’ll see soon why.

So, we have the reference to a blob object inside our tree. If we go back to check our objects database we will find that indeed this third object exists in it.

				
					$ ls .git/objects
49/  54/  76/  info/  pack/

$ ls .git/objects/76
871a05855a2389ddac28d1b167dd48b2226141
				
			

76/871a05855a2389ddac28d1b167dd48b2226141 = 76871a05855a2389ddac28d1b167dd48b2226141

Finally, let’s see what this blob object type looks like:

				
					$ git cat-file 76871a05855a2389ddac28d1b167dd48b2226141 -t
blob

$ git cat-file 76871a05855a2389ddac28d1b167dd48b2226141 -p
Rodrigo
				
			

Third object type: The blob

Blob is a contraction of “binary large object” and it holds a file’s data, but does not contain any metadata about the file or even its name. Each version of a file is represented as a blob.

Summarizing

  • We have three types of objects in Git: blobs, trees, and commits. Git always generates a key for any of them, a SHA-1 hash, and then persists the content zipped in the repository.
  • A commit points to a tree object that captures in a snapshot the entire state of the repository at a given point in time.
  • A tree points to blobs, or it can point to other trees to create a hierarchical structure.
  • A blob object is the content of a file and nothing more.

Every commit you create as you progress through your project adheres to these rules and saves these objects in the Git object database. This is how Git stores and handles our content.

This is what our first commit snapshot looks like:

graphic showing process

We left a pending question, a really important one when we talk about trees and blobs. Why is it that the file name is stored in the tree and not in the blob? The answer to this question is one of the greatest revelations of all when it comes to understanding Git.

Let’s create a new commit to find out why.

The second commit

First, we add a new folder to the root of project called Tasks/. Inside it, we create a file for each task with the name of the person responsible for that task.

The first task is washing the dishes and I am in charge of doing it 🙂.

				
					// create Tasks folder
$ mkdir Tasks

$ cd Tasks

// Create a "Wash the dishes.txt" file with Rodrigo as the person 
// in charge of doing it.
printf "Rodrigo" >> Wash\ the\ dishes.txt

$ git add .

$ git commit -m "Create wash the dishes task"
[master 6522e62] Create wash the dishes task
 1 file changed, 1 insertion(+)
 create mode 100644 Tasks/Wash the dishes.txt
				
			

We have just added a new commit and we already know that something was created in our objects database folder. Let’s find out what.

As we did for the first commit, we start by looking at the new commit. This time, we use the git flag --oneline to summarize the info of each commit.

				
					$ git log --oneline 
6522e62... (HEAD -> master) Create wash the dishes task
54725fc... Create members file

$ git cat-file 6522e62ca0294b9288ee188e0a6a0e09f75c4756 -t
commit 

$ git cat-file 6522e62ca0294b9288ee188e0a6a0e09f75c4756 -p
tree 5adf115c14c967d43674d9b26add84ae6d1a3bd9
parent 54725fc233a67fa1ba4807271b840532e136f624
author Rodrigo Trelles <rodrigo.trelles5@gmail.com> 1653248905 -0300
committer Rodrigo Trelles <rodrigo.trelles5@gmail.com> 1653248905 -0300

Create wash the dishes task

				
			

We get the expected content for the second commit, but we also have something new. This commit contains a parent object. Who is this parent object? Well, it is our first commit. You can check it by comparing its id with the first commit id. Except for the first commit, all commits have (at least) one parent commit.

Let’s look into the tree container now:

				
					$ git cat-file 5adf115c14c967d43674d9b26add84ae6d1a3bd9 -t
tree

$ git cat-file 5adf115c14c967d43674d9b26add84ae6d1a3bd9 -p
100644 blob 76871a05855a2389ddac28d1b167dd48b2226141    Members.txt
040000 tree 860fffa5f80edd65512f9d87fa7f343a67607819    Tasks
				
			

The tree contains to object references:

  1. The same blob reference with name Members.txt **of previous commit.
  2. A new tree references called Tasks

 

We said before, that trees are responsible for building a hierarchy in our project. We added a new folder to our project, so this new tree makes sense.

Blob reference with name Members.txt didn’t change, so the content must be the same.

				
					$ git cat-file 76871a05855a2389ddac28d1b167dd48b2226141 -t
blob

$ git cat-file 76871a05855a2389ddac28d1b167dd48b2226141 -p
Rodrigo
				
			

Let’s see the new tree:

				
					git cat-file 860fffa5f80edd65512f9d87fa7f343a67607819 -t
tree

git cat-file 860fffa5f80edd65512f9d87fa7f343a67607819 -p
100644 blob 76871a05855a2389ddac28d1b167dd48b2226141    Wash the dishes.txt
				
			

As we expected, this tree has the id of the blob object referenced with the name Wash the dishes.txt

Finally, let’s see the content of this blob.

				
					git cat-file 76871a05855a2389ddac28d1b167dd48b2226141 -t
blob

git cat-file 76871a05855a2389ddac28d1b167dd48b2226141 -p
Rodrigo
				
			

Indeed, this is our file. However, something just happened here. Did you find out which one it is? The id of the Wash the dishes.txt blob object is equal to the the Members.txt blob id.

But both are different files. Aren’t object ids supposed to be unique? They are not actually repeating themselves. Git is just being efficient. 😉

Git recognizes that both files have exactly the same content, so Git decides to create just one blob and to use the same reference twice, instead of creating two identical objects in the database.

This is more or less how Git thinks:

“Hey, I already have this zipped content in my database, there is no need to save another identical object. I’d better reuse this object and refer both cases to the same blob from the trees.”

Remember our second big revelation. If the value is exactly the same, we will get the same hash key from Git. This is what just happened here.

Note: I used the same colors for the same hash keys throughout the example to make it easier to identify match ups.

Git has a low-level command [hash-object](<https://git-scm.com/docs/git-hash-object>), which takes some piece of content and returns the hash key for it. The way we can pass a value to this command is by piping the value. To tell the hash-object command to read the string we must also use the flag —stdin.

In a nutshell, we are telling Git: “Here you have a string, give me the hash key for it”.

Both files we created in our example have the content Rodrigo. Let’s give Git the string Rodrigo and see what happens:

				
					$ printf "Rodrigo" | git hash-object --stdin
76871a05855a2389ddac28d1b167dd48b2226141
				
			

We get the same hash key. 🤯

Does this mean that if we create another .txt file with the content “Rodrigo”, the new commit will point to the same blob? Yes, exactly.

This is the biggest revelation I had. Git handles the objects created in the database so efficiently that it reuses the blobs files every time the content is the same. It doesn’t matter if it’s another file created in another folder.

This is the reason why the files names are saved in the trees. This decision allows Git to handle references to the same blob with different file names. Git is much smarter than I thought.

This is what the snapshot of the second commit looks like:

graphic showing process

Finally, this is the snapshot of both commits sharing the same blob in the objects database.

2 commits, 3 trees, but only 1 blob. I have to be honest with you, this is simply beautiful to me.

Note: Let’s make a stop here. I know this is a lot of information (and hash keys) to process. Have a glass of water before continuing. 😊

References to remember commits in an easier way

Let’s go back to the second commit for a moment.

As we saw, a new object called parent is inside it. This is the reference to its parent commit.

This reference is what allows us to maintain a historical order of the commits. With them, Git builds a thread of commits. By knowing the hash key of each commit, we can return to any previous state of our project whenever we want.

But a hash is a 40-digit hexadecimal number. Not easy for the human brain to remember. We feel more comfortable remembering words with a logical meaning, not random letters and numbers concatenated.

That’s why Git provides us with friendly references to avoid having to remember long hashes.

Let’s open the refs/ folder inside our root repository and see what we have here for our example.

				
					$ ls .git/refs
heads/ tags/
				
			

Inside the refs/ folder we have heads/ and tags/ subfolders. Let’s open the heads folder first and see what we have here:

				
					$ ls ./git/refs/heads
master
				
			

This is our master branch. Git creates this branch the moment we initialized the repository.

Let’s create a new branch:

				
					// Create new branch
$ git checkout -b branchA
Switched to a new branch 'branchA'

$ ls ./git/refs/heads/
branchA master
				
			

Branches are references and all branches are saved inside the heads/ folder inside refs/.

Great, but I want to go one step deeper. What is a branch physically?

Let’s find out.

				
					$ cat ./git/refs/heads/master
6522e62ca0294b9288ee188e0a6a0e09f75c4756
				
			

A branch is a hash. Did you recognize this hash? It is the hash of the second commit (You can check the colors). A branch is a pointer to a commit with a human-friendly name. Even simpler, a branch is a file with a string inside of it (the commit hash key).

With Git branches you can go back and forth in time within our commit history very quickly.

				
					$ cat ./git/refs/heads/branchA
6522e62ca0294b9288ee188e0a6a0e09f75c4756
				
			

If we read the content of branchA, the hash key is the same because this branch was created from master branch, and no more commits were created afterward. Both pointers are “pointing” to the same commit.

But how does Git know what the current branch is?

Well, Git stores this information in the HEAD file, at the root of our repository.

				
					$ cat ./git/HEAD
ref: refs/heads/branchA
				
			

Why is the content of this file not also a hash? As we saw before, only objects in the database have hashes. Branches are not objects (they are references) and what the HEAD is pointing at is a branch.

So, what HEAD stores is the reference to another reference. In conclusion, HEAD is a pointer to another pointer.

Let’s create a new commit to see what happens:

				
					$ cd Tasks

// Create a "Laundry.txt" file with Rodrigo as the person
// in charge of doing it.
printf "Rodrigo" >> Laundry.txt

$ git add .

$ git commit -m "Create laundry task"
[branchA 17c4571] Create laundry task
 1 file changed, 1 insertion(+)
 create mode 100644 Tasks/Laundry.txt
				
			

Let’s check the current value of our references:

				
					
$ cat ./git/refs/heads/master
6522e62ca0294b9288ee188e0a6a0e09f75c4756

$ cat ./git/refs/heads/branchA
17c457182627b1a560a68a927a413fdeae1efc9b

$ cat ./git/HEAD
ref: refs/heads/branchA
				
			

Makes sense, right?

  1. After the new commit was created, branchA updates its value because now it is pointing to a new commit. branchA is the current branch.
  2. HEAD file doesn’t change. It continues to point to branchA.
  3. master branch doesn’t change. It keeps the same hash.

The Tags

Our last reference is the tag.

A tag is a special reference used to mark a commit in history. Sounds similar to a branch, right?

The main difference between a tag and a branch is that a tag doesn’t change its value after a commit. A tag is an immutable reference to a specific commit.

Note: tags are mostly used to identify releasing versions of our code. You can always be sure you will get the same snapshot used for that release.

Let’s create a tag. It will be saved inside the tags/ subfolder inside refs/.

				
					// create a tag
$ git tag v1.0

$ cat .git/refs/tags/v1.0
17c457182627b1a560a68a927a413fdeae1efc9b
				
			

As we expected. This tag v1.0 is just a pointer to a commit (the current one).

When this project has grown a lot and branchA is already much further along in time, we can always quickly return to the state of our project in this commit using the tag v1.0.

Conclusions

Let’s summarize the 5 most important points of this article:

  1. Git stores three types of objects in its database: commits, trees, and blobs. Every object in the Git database has a hash key associated with it.
  2. With commits, trees, and blobs and using their hash keys as pointers, Git builds our project’s data hierarchy efficiently and without duplicating content.
  3. Hash keys are not easy for the human mind to remembere and that is why Git provides a more friendly way of remembering hashes: The references. We have branches, HEAD, and tags.
  4. The branch is a pointer to a commit. The default branch name in Git is master, but you can create as many as you want. The HEAD reference tells us which branch we are currently working on. Every time you create a new commit, the branch pointer moves forward automatically.
  5. Lastly, the tag reference is immutable and is always linked to the same commit to be able to always remember a moment in time.

 

I’m thinking of writing a second article, explaining the states of the Git workflow and seeing how each command we run affects it. But for now, it’s enough. It’s been a long road. 😇

I don’t expect you to believe me just because you read this blog.

In fact, I encourage you to create a repository and confirm for yourself that this is all true. Even more, you can create different repositories on different machines and compare the hashes created for exactly the same content.

Git is a super powerful tool that we use a lot and understanding it will allow you to understand more about what you do on a daily basis . And if you ask me, there is nothing more beautiful than understanding why things happen.

See related posts