Using git without feeling stupid (part 1)

Tagged:  •    •  

More and more projects are switching over to git or other distributed VCS. Even projects using centralized servers are doing so, because even if your project doesn't have a network of developers each with their own repository, distributed VCS have a very nice set of additional features. For example, the set of available offline operations is very complete; and as a consequence, not relying on network connection makes the system much faster even when you are not offline. Also, the possibility to quickly create and throw away branches makes it easier to do experiments. Of course, some distributed VCS may not enjoy all these advantages. The best and most widespread distributed VCS nowadays are git and mercurial (hg); this is a great step from a couple of years ago, where most systems had serious scalability problems and a much smaller feature set.

I switched to a distributed VCS for GNU Smalltalk three years ago, and chose arch at the time. Since I work on GNU Smalltalk mostly on my commutes, having the possibility to commit offline was enough of a boon to bear the huge time to do a single commit (1 minute) and the huge time to synchronize upstream (10 seconds per commit, at least). But since better tools are now available, after finishing the 3.0 release I took the opportunity to switch to git.

I had already switched all my "local" projects to git a while before, and had not regretted it, so I was already pretty comfortable with how the system worked. I didn't find git that hard to use, especially after they rewrote the way you manage remote repositories in recent versions (1.5.3 or newer). However, I had big problems finding a tutorial that teaches you git with a relatively gentle learning curve.

It’s not hard to get started with git if you start from a single principle: a git working tree also hosts a full fledged repository. Copy the working tree, and you have actually cloned the repository. The source of the copy can be local (à la cp -R) or remote (à la rsync). You can try this now; if you already have Git installed, you can get the latest development version via Git itself:

git clone git://git.kernel.org/pub/scm/git/git.git
cd git
git log

The latter command will show all the history of git development without any need to access the network.

Actually, git is not the first version control system to store metadata side-by-side with the working tree. Ancient systems like SCCS or RCS did the same! So, basic usage of git (without branches and with a single developer) is probably more similar to RCS than to anything else!

Of course you don't have locks, you have atomic commits as in subversion, and so on, so the similarity does not last long. But a major point is that using git in this scenario is probably even easier than using CVS or Subversion, and it gives you a way to learn the following basic ideas:

  • git init does not mean I want to store versioning data here, but rather I want to store versioning informations for the files that are 'already' here.
  • you can use git add and git rm as you do in CVS, but git add won't fail if the file is already under version control; for now, refrain from doing so
  • git commit will only add and remove files that you marked with git add or git rm. Instead, you have to specify the files you commit with git commit FILE1 FILE2..., or invoke git commit -a to commit all modified files as in CVS.
  • as in CVS, you can also inspect the version history and review changes with git diff and git log.

Now, let's add a server to the picture. Concurrent development was the biggest innovation of CVS, and we can think of git as a different offspring of RCS which took a radically different approach to concurrent development. CVS (and subversion) completely centralized the server: they keep all the revisions there, so that all the operations require a connection to this server. Committing something (cvs ci) writes a new revision to the server, and there is a command to fetch a batch of updates from the server (cvs up). In git, committing something writes it locally (as in RCS's ci), and you have two commands to send as well as fetch a batch of updates to the server.

From this small difference, entirely different workflows arise. This is however premature to explain now. Let's look at a typical CVS workflow:

cvs -d PATH co DIR
cvs update
... work work work ...
cvs update
... fix conflicts ...
cvs ci

A 1:1 mapping in git looks like this (this uses new features from version 1.5.4; I suggest you fetch bleeding-edge sources with the git clone command above, and then compile with make && make install):

git clone PATH DIR
git pull --rebase
... work work work ...
git pull --rebase
... fix conflicts ...
git push

There's an interesting point that is not clear from the simple scheme above. Committing (with git commit -a as in the single user case) happens during the work, not after. This is true in any distributed VCS, but cleaner designs (as in git and hg) make it extremely natural for the developer. It is a boon, because it makes it easier to revert mistakes, to review changes, to establish milestones. Overall, it makes your job easier.

Like cvs up, git pull will bring in changes from the remote repository and put them in the current repository. That single command, git pull --rebase, hides quite a lot of things that git does. It fetches from the remote server, and it reapplies the user's commits one by one (letting the user fix conflicts) on top of the remote server's trunk. At the end, the user sees that his history has changed from this:

                A---B---C  (user branch)
               /
          D---E---F---G  (server branch)

to this:

                        A'--B'--C'  (user branch)
                       /
          D---E---F---G  (server branch)

This change makes sure that the user's changes are up-to-date with the latest changes on the server. A', B' and C' represent the same changes as A, B, C; however, git considers them different because the former are based off G, while the latter are based off E. Commits A, B and C have disappeared; this is not a problem because they were not made public i.e., pushed to the server. (Note that while locally you can be cavalier and "rewrite past history", the history on a centralized server should move rigorously forward, as CVS and subversion force you to do).

In this installment, I showed how basic usage of git does not need any concept that is unique to a particular version control system. To some extent, this is true until you have to deal with conflicts. In the next installment, I'll talk about conflicts.

Your 'latter and 'former' need switching in the penultimate paragraph.

...this site doesn't have a background-color: value; on the <body/> element!

... how is that marginally stupid?

Because some people set a default background of black. If you set a text color you should set a background color, or else you are black on black on some systems.

Good Tutorial, you guys might find this series of tutorials complimentary to this if anyone's interested, can be found here.

(I am trying to concact you even via email).
Which is the license governing the blog posts?! It is not clear, and they told me that if it is not clear, it must be All Right Reserved. In particular I am interested in posting a sample code of a previous posts in a GNU FDL licensed site, as I've explained you by mail, in case you received it.

All on my blog is CC-BY.

I've recently started playing with Git and I was really happy when I found this article. Up and until I read your explanations I really did feel like an idiot; you have done a great job of making git approachable, where most other articles are more expert-friendly.

I find it funny that you even mention CVS. Who uses that? You must have switched to git pretty early to use CVS as the comparison, most people would be using SVN now.

Helpful guide in any case, thanks.

Excellent gentle introduction. The other tutorials I looked at were not nearly as thorough. Much thanks!

Just a minor update to the 1st step

git clone....
cd git
git log

You'll need to cd into the repository.
Then it all flows nicely.

fixed, thanks.

Really useful, thanks.
One question though: how can i find if a specific file is in the git repository or not ?
Thanks,
Andrei

"git ls-files" or "git ls-files " (lists the file name if under version control, or nothing otherwise) seem to do exactly what you're looking for!

You can use git cat-file -t HEAD:file. If it gives an error, the file is not in the repository. Don't forget the HEAD: part.

You could alias git cat-file -t to git test, for example (see part 2).

Paolo

There are better ways than to suggest "cat-file", which is really not meant for humans, but scripts.

My preference would be "git show HEAD:", because it makes it easy to see what it does: look into a revision, for a specific file.

"git show" is really what you want to use a lot of the time when looking at the repository's objects themselves.

User login