Understanding better this powerful tool
“Do you know how to use Git?”
This was one of the first questions I was asked on my first day working as a software developer. Probably the same thing has happened to many of you (or will happen to you, soon).
I remember back then I knew that Git was a tool that allowed many people to work simultaneously on the same project hosted in a repository (usually in the cloud).
Each developer made a copy of the project on their computer and performed their tasks there. Once the task was finished, the main repository was updated with the new changes and everyone else could update their local version of the project, downloading the new changes from the main repository. I also knew that Git kept a history of all previous versions of code generated. If at any time, you need any of the previous versions, you could go back and look for them. I also knew how to use some basic git commands such as
git add or
That’s it. That’s all I knew about Git.
- ”Don’t worry. 95% of the time you are going to use the same 6 or 7 commands. There are only two things you must remember. Every time you start a new task, create a copy of the development branch and work within it. And never, ever, EVER, push anything directly to the production branch.”
As time went on, I realized that what they told me was true I almost always used the same commands:
git push, and
git merge. I learned the concept behind a
commit (checkpoints in the version history) and I began to get familiar with the use of
branches (paths that fork from a
By then, I already felt like a Git expert. 😎
One day, I decided to take a test to assess how much I knew about Git. These were some of the questions I had to answer:
- ‘What is a blob?’
- ‘What information does a commit contain?’
- ‘What are the three states of the Git workflow?’
On an approximate scale of 1 to 200, the test gave me a 37.
That day I felt that I was a fraud.
I wasn’t the Git expert I thought I was. Actually, I was a beginner who had learned how to manipulate Git. I had learned to use Git in software projects with other people, but the truth was, I didn’t know what Git was really doing.
That day, I decided to understand the model that Git was based on, the behind-the-scenes of Git.
It was then, when I understood how Git works under the hood that everything started to make sense.
In this article, we’ll stop using Git on autopilot and understand once and for all how Git internally works.
What is Git?
Let’s see what the official Git documentation says.
Note: Remember that you must have Git installed on your machine. If you don’t have it, you can download it from the official Git site.
Let’s look at the officialdocumentation to see what Git says about itself:
Wait, what? It’s a joke, right? Actually no, it’s not.
Although the official definition is as follows: Git is a fast, scalable, distributed revision control system with an unusually rich command set that provides both high-level operations and full access to internals. If you look at the core of Git, it’s just that. A tool to track content.
This was the first big revelation I had. I thought Git was only for software projects, but it’s not. Git is a content tracker and the content can be whatever you want.
This may seems like a very simple revelation, but in fact, this made me realize how little I really knew about Git.
How does Git track content?
We have to think of Git as a table that maps a value (our content) with a unique key. Each value is a sequence of bytes (our content converted into a sequence of bytes).
When we give Git a value, Git stores it and generates a hash key for it, using the SHA-1 algorithm. A hash key is a small value that is used to represent a large piece of data in a hash system. Git uses these keys to identify the value that was stored. It doesn’t matter what machine or operating system we’re on. If the value is the same, we will get the same hash key from Git every time. Second big revelation I had.
Let’s understand by doing
I’m going to create a project to manage the tasks of my family.
Let’s initialize a Git repository:
// Initialize an empty Git repository inside familiy-to-do-list folder $ git init familiy-to-do-list // Open familiy-to-do-list folder $ cd familiy-to-do-list // Examine ./git folder $ ls .git HEAD config description hooks/ info/ objects/ refs/
When initializing a new Git repository, a ./git folder is created by default, with several folders and files inside it. For now, let’s focus on the objects/ folder, which is Git objects database. This is where we will create all the necessary objects to save the different versions of our project.
$ ls .git/objects info/ pack
objects/ folder contains two folders:
pack/ . These folders are only used by Git for low-level optimizations. Just ignore them.
We could say that by default, the
objects/ folder is empty.
Let’s add our first file to the project:
// Create a Members.txt file with Rodrigo as the first member of it. $ printf "Rodrigo" >> Members.txt // cat is a utility command to print the content of a file. $ cat Members.txt Rodrigo // Let's check the current status after adding our new file $ git status On branch master No commits yet Untracked files: (use "git add ..." to include in what will be committed) Members.txt
What Git is telling us here is that we have files that are untracked. These files are living in what Git calls the Working Directory. If we want to add these changes to the next commit, we must first add them to an intermediary area.
// Add new or modified files to the staging area $ git add .
This command doesn’t create a commit; it doesn’t put anything into the history yet. This command says: “Hey Git, track this please, I want you to keep an eye on it”.
This second area for tracked files is called Staging Area (its technical name is
Let’s suppose I am sure that I want to create a
commit with these changes and no others. Everything tracked in the Staging Area will be part of the new
// Create a new commit with the message "Create members file" $ git commit -m "Create members file" [master (root-commit) 54725fc] Create members file 1 file changed, 1 insertion(+) create mode 100644 Members.txt
So, what happened here? Yes, I created my first commit.
But what else? I just created stuff in the
objects/ folder in our objects repository.
Let’s check it:
// Examine ./git/objects folder $ ls .git/objects 49/ 54/ 76/ info/ pack/
pack/ folders, we also have 3 new objects created.
Before continuing, you must know that Git has three fundamental object types. The blob, the tree, and the commit. Don’t worry, we will explain throughout this example what each of them are.
Let’s dig into these newly created objects and see exactly what they are.
We will start by looking at the commit we just created.
// Shows the commits log $ git log commit 54725fc233a67fa1ba4807271b840532e136f624 (HEAD -> master) Author: Rodrigo Trelles Date: Sun Apr 17 10:43:48 2022 -0300 Create members file
We can see that our first commit hash is
If we look closely, we can see that the name of one of the new folders created inside the
objects/ folder has the first 2 characters of the hash
Let’s dig into this folder to see what we have inside it.
// Examine ./git/objects/54 folder $ ls .git/objects/54 725fc233a67fa1ba4807271b840532e136f624
The name of the file inside the
54/ folder is equal to the rest of the characters that make up the hash key of the commit.
Note: We can see that the name of the folder created in the objects database is the first two digits of the hashes, and the file inside this folder is the rest of the hash keys. This is the naming convention Git follows to store data. This scheme is used by Git to find objects more efficiently.
We can confirm that this is the object inside the object database for this commit.
First object type: The commit
So, what does this object look like? What is the content of this object?
We can use a low-level git command called
cat-file, which provides the content or the type for any object in the repository.
Note: It is absolutely fine if you never used this command before. It is very likely that you will never use it again. 😅
// Run command with the -t flag to return the object type $ git cat-file 54725fc233a67fa1ba4807271b840532e136f624 -t commit // Run command with the -p flag return the object content $ git cat-file 54725fc233a67fa1ba4807271b840532e136f624 -p tree 490296214553ccc9a66cd62ad8b05d56775bf465 author Rodrigo Trelles 1650203028 -0300 committer Rodrigo Trelles 1650203028 -0300 Create members file
This is what a commit looks like. In fact, this is literally the commit we have just created.
A commit is just plain text, and like any other content in Git, it was zipped, a hash was assigned to it and finally persisted in the objects repository. A
commit object holds metadata, including the author, the committer, the commit date, and the commit message.
But we can see one more thing in the commit: A
tree, ****with its own hash key is also included.
If we perform the same search exercise that we did with the
commit hash key we can find that there is an object in our repository with this same hash key.
$ ls .git/objects 49/ 54/ 76/ info/ pack/ $ ls .git/objects/49 0296214553ccc9a66cd62ad8b05d56775bf465
This is the object inside the object database representing this tree.
Second object type: The tree
tree is the object used to store directories in our project. A
tree can point to other
trees to build a complete hierarchy of files and subdirectories. It can also point to
Each commit points to a tree object that captures, in one complete snapshot, the state of the repository at the time the commit was performed. This snapshot is the project version we save in our Git history.
Let’s see what a
tree looks like:
$ git cat-file 490296214553ccc9a66cd62ad8b05d56775bf465 -t tree $ git cat-file 490296214553ccc9a66cd62ad8b05d56775bf465 -p 100644 blob 76871a05855a2389ddac28d1b167dd48b2226141 Members.txt
tree object contains one line per file or subdirectory. For each of them, it records permissions, object type, object hash, and filename.
Here comes my third big revelation: The file names are controlled by the tree object, not by the files themselves. We’ll see soon why.
So, we have the reference to a blob object inside our tree. If we go back to check our objects database we will find that indeed this third object exists in it.
$ ls .git/objects 49/ 54/ 76/ info/ pack/ $ ls .git/objects/76 871a05855a2389ddac28d1b167dd48b2226141
Finally, let’s see what this blob object type looks like:
$ git cat-file 76871a05855a2389ddac28d1b167dd48b2226141 -t blob $ git cat-file 76871a05855a2389ddac28d1b167dd48b2226141 -p Rodrigo
Third object type: The blob
Blob is a contraction of “binary large object” and it holds a file’s data, but does not contain any metadata about the file or even its name. Each version of a file is represented as a blob.
- We have three types of objects in Git:
commits. Git always generates a key for any of them, a SHA-1 hash, and then persists the content zipped in the repository.
commitpoints to a
treeobject that captures in a snapshot the entire state of the repository at a given point in time.
treepoints to blobs, or it can point to other
treesto create a hierarchical structure.
blobobject is the content of a file and nothing more.
Every commit you create as you progress through your project adheres to these rules and saves these objects in the Git object database. This is how Git stores and handles our content.
This is what our first commit snapshot looks like:
We left a pending question, a really important one when we talk about
blobs. Why is it that the file name is stored in the
tree and not in the
blob? The answer to this question is one of the greatest revelations of all when it comes to understanding Git.
Let’s create a new commit to find out why.
The second commit
First, we add a new folder to the root of project called
Tasks/. Inside it, we create a file for each task with the name of the person responsible for that task.
The first task is washing the dishes and I am in charge of doing it 🙂.
// create Tasks folder $ mkdir Tasks $ cd Tasks // Create a "Wash the dishes.txt" file with Rodrigo as the person // in charge of doing it. printf "Rodrigo" >> Wash\ the\ dishes.txt $ git add . $ git commit -m "Create wash the dishes task" [master 6522e62] Create wash the dishes task 1 file changed, 1 insertion(+) create mode 100644 Tasks/Wash the dishes.txt
We have just added a new commit and we already know that something was created in our objects database folder. Let’s find out what.
As we did for the first
commit, we start by looking at the new
commit. This time, we use the git flag
--oneline to summarize the info of each
$ git log --oneline 6522e62... (HEAD -> master) Create wash the dishes task 54725fc... Create members file $ git cat-file 6522e62ca0294b9288ee188e0a6a0e09f75c4756 -t commit $ git cat-file 6522e62ca0294b9288ee188e0a6a0e09f75c4756 -p tree 5adf115c14c967d43674d9b26add84ae6d1a3bd9 parent 54725fc233a67fa1ba4807271b840532e136f624 author Rodrigo Trelles 1653248905 -0300 committer Rodrigo Trelles 1653248905 -0300 Create wash the dishes task
We get the expected content for the second
commit, but we also have something new. This
commit contains a
parent object. Who is this parent object? Well, it is our first
commit. You can check it by comparing its id with the first commit id. Except for the first commit, all commits have (at least) one parent commit.
Let’s look into the tree container now:
$ git cat-file 5adf115c14c967d43674d9b26add84ae6d1a3bd9 -t tree $ git cat-file 5adf115c14c967d43674d9b26add84ae6d1a3bd9 -p 100644 blob 76871a05855a2389ddac28d1b167dd48b2226141 Members.txt 040000 tree 860fffa5f80edd65512f9d87fa7f343a67607819 Tasks
tree contains to object references:
- The same
blobreference with name
Members.txt**of previous commit.
- A new
We said before, that
trees are responsible for building a hierarchy in our project. We added a new folder to our project, so this new
tree makes sense.
Blob reference with name
Members.txt didn’t change, so the content must be the same.
$ git cat-file 76871a05855a2389ddac28d1b167dd48b2226141 -t blob $ git cat-file 76871a05855a2389ddac28d1b167dd48b2226141 -p Rodrigo
Let’s see the new tree:
git cat-file 860fffa5f80edd65512f9d87fa7f343a67607819 -t tree git cat-file 860fffa5f80edd65512f9d87fa7f343a67607819 -p 100644 blob 76871a05855a2389ddac28d1b167dd48b2226141 Wash the dishes.txt
As we expected, this
tree has the id of the blob object referenced with the name
Wash the dishes.txt
Finally, let’s see the content of this blob.
git cat-file 76871a05855a2389ddac28d1b167dd48b2226141 -t blob git cat-file 76871a05855a2389ddac28d1b167dd48b2226141 -p Rodrigo
Indeed, this is our file. However, something just happened here. Did you find out which one it is? The id of the
Wash the dishes.txt blob object is equal to the the
Members.txt blob id.
But both are different files. Aren’t object ids supposed to be unique? They are not actually repeating themselves. Git is just being efficient. 😉
Git recognizes that both files have exactly the same content, so Git decides to create just one blob and to use the same reference twice, instead of creating two identical objects in the database.
This is more or less how Git thinks:
“Hey, I already have this zipped content in my database, there is no need to save another identical object. I’d better reuse this object and refer both cases to the same blob from the trees.”
Remember our second big revelation. If the value is exactly the same, we will get the same hash key from Git. This is what just happened here.
Note: I used the same colors for the same hash keys throughout the example to make it easier to identify match ups.
Git has a low-level command
[hash-object](<https://git-scm.com/docs/git-hash-object>), which takes some piece of content and returns the hash key for it. The way we can pass a value to this command is by piping the value. To tell the
hash-object command to read the string we must also use the flag
In a nutshell, we are telling Git: “Here you have a string, give me the hash key for it”.
Both files we created in our example have the content
Rodrigo. Let’s give Git the string
Rodrigo and see what happens:
$ printf "Rodrigo" | git hash-object --stdin 76871a05855a2389ddac28d1b167dd48b2226141
We get the same hash key. 🤯
Does this mean that if we create another
.txt file with the content “Rodrigo”, the new commit will point to the same blob? Yes, exactly.
This is the biggest revelation I had. Git handles the objects created in the database so efficiently that it reuses the
blobs files every time the content is the same. It doesn’t matter if it’s another file created in another folder.
This is the reason why the files names are saved in the trees. This decision allows Git to handle references to the same blob with different file names. Git is much smarter than I thought.
This is what the snapshot of the second
commit looks like:
Finally, this is the snapshot of both commits sharing the same blob in the objects database.
trees, but only 1
blob. I have to be honest with you, this is simply beautiful to me.
Note: Let’s make a stop here. I know this is a lot of information (and hash keys) to process. Have a glass of water before continuing. 😊
References to remember commits in an easier way
Let’s go back to the second
commit for a moment.
As we saw, a new object called
parent is inside it. This is the reference to its parent
This reference is what allows us to maintain a historical order of the
commits. With them, Git builds a thread of
commits. By knowing the hash key of each
commit, we can return to any previous state of our project whenever we want.
But a hash is a 40-digit hexadecimal number. Not easy for the human brain to remember. We feel more comfortable remembering words with a logical meaning, not random letters and numbers concatenated.
That’s why Git provides us with friendly
references to avoid having to remember long hashes.
Let’s open the
refs/ folder inside our root repository and see what we have here for our example.
$ ls .git/refs heads/ tags/
Inside the refs/ folder we have heads/ and tags/ subfolders. Let’s open the heads folder first and see what we have here:
$ ls ./git/refs/heads master
This is our
master branch. Git creates this branch the moment we initialized the repository.
Let’s create a new branch:
// Create new branch $ git checkout -b branchA Switched to a new branch 'branchA' $ ls ./git/refs/heads/ branchA master
Branches are references and all branches are saved inside the
heads/ folder inside
Great, but I want to go one step deeper. What is a branch physically?
Let’s find out.
$ cat ./git/refs/heads/master 6522e62ca0294b9288ee188e0a6a0e09f75c4756
A branch is a hash. Did you recognize this hash? It is the hash of the second
commit (You can check the colors). A branch is a pointer to a commit with a human-friendly name. Even simpler, a branch is a file with a string inside of it (the
commit hash key).
branches you can go back and forth in time within our
commit history very quickly.
$ cat ./git/refs/heads/branchA 6522e62ca0294b9288ee188e0a6a0e09f75c4756
If we read the content of
branchA, the hash key is the same because this branch was created from
master branch, and no more
commits were created afterward. Both pointers are “pointing” to the same
But how does Git know what the current branch is?
Well, Git stores this information in the
HEAD file, at the root of our repository.
$ cat ./git/HEAD ref: refs/heads/branchA
Why is the content of this file not also a hash? As we saw before, only objects in the database have hashes. Branches are not objects (they are references) and what the
HEAD is pointing at is a
HEAD stores is the reference to another reference. In conclusion,
HEAD is a pointer to another pointer.
Let’s create a new commit to see what happens:
$ cd Tasks // Create a "Laundry.txt" file with Rodrigo as the person // in charge of doing it. printf "Rodrigo" >> Laundry.txt $ git add . $ git commit -m "Create laundry task" [branchA 17c4571] Create laundry task 1 file changed, 1 insertion(+) create mode 100644 Tasks/Laundry.txt
Let’s check the current value of our references:
$ cat ./git/refs/heads/master 6522e62ca0294b9288ee188e0a6a0e09f75c4756 $ cat ./git/refs/heads/branchA 17c457182627b1a560a68a927a413fdeae1efc9b $ cat ./git/HEAD ref: refs/heads/branchA
Makes sense, right?
- After the new commit was created,
branchAupdates its value because now it is pointing to a new
branchAis the current branch.
HEADfile doesn’t change. It continues to point to
masterbranch doesn’t change. It keeps the same hash.
Our last reference is the
tag is a special reference used to mark a commit in history. Sounds similar to a branch, right?
The main difference between a
tag and a
branch is that a
tag doesn’t change its value after a commit. A tag is an immutable reference to a specific commit.
Note: tags are mostly used to identify releasing versions of our code. You can always be sure you will get the same snapshot used for that release.
Let’s create a tag. It will be saved inside the
tags/ subfolder inside
// create a tag $ git tag v1.0 $ cat .git/refs/tags/v1.0 17c457182627b1a560a68a927a413fdeae1efc9b
As we expected. This
tag v1.0 is just a pointer to a commit (the current one).
When this project has grown a lot and
branchA is already much further along in time, we can always quickly return to the state of our project in this commit using the
Let’s summarize the 5 most important points of this article:
- Git stores three types of objects in its database:
blobs. Every object in the Git database has a hash key associated with it.
blobsand using their hash keys as pointers, Git builds our project’s data hierarchy efficiently and without duplicating content.
- Hash keys are not easy for the human mind to remembere and that is why Git provides a more friendly way of remembering hashes: The
references. We have
branchis a pointer to a
commit. The default
branchname in Git is master, but you can create as many as you want. The
HEADreference tells us which branch we are currently working on. Every time you create a new
commit, the branch pointer moves forward automatically.
- Lastly, the
tagreference is immutable and is always linked to the same
committo be able to always remember a moment in time.
I’m thinking of writing a second article, explaining the states of the Git workflow and seeing how each command we run affects it. But for now, it’s enough. It’s been a long road. 😇
I don’t expect you to believe me just because you read this blog.
In fact, I encourage you to create a repository and confirm for yourself that this is all true. Even more, you can create different repositories on different machines and compare the hashes created for exactly the same content.
Git is a super powerful tool that we use a lot and understanding it will allow you to understand more about what you do on a daily basis . And if you ask me, there is nothing more beautiful than understanding why things happen.
See related posts
Let’s talk about the different automation tools we can use in order to analyze our Python code and perform several checks to make sure it meets the standards.