Reply to comment

Version Control is a process of keeping track of the state of a project at specific points in time. The traditional use of this technology is in software development, where many small changes can interact in complex ways, and it is important to know when and what changes were made that might have caused errors. I have been using version control for my work since the first time as a graduate student that I broke one of my numerical codes as I was adding a feature and I really wanted to just go back to the way it was before I broke it, but instead I had to spend two days discovering exactly what I had done and undoing it. I've been using Subversion for five years now and I have been quite happy with it. With the emergence of Distributed Version Control Systems, I saw an opportunity to change my version control software and gain features that I had long desired.

Disclaimer: I am an academic. I work at a college. I teach classes. I do some research. I have an advanced degree. I do not do professional software work. I might be a geek, but that's not why I earn the big bucks. Your mileage may vary.

My Subversion repository stored all of my professional work together in one place. The primary benefits were

I could synchronize projects between multiple computers (very important for itinerant graduate students),
Get to my work from anywhere (ditto),
restore known working versions (the initial motivation for a VCS),
and easily re-use work that I had done previously.

Note that because I am a one-man band I made minimal use of tags, branching or merging. I did occasionally branch a project when I chose to take the same code in two directions, but due to Subversions infant support for tracking merges, I never ever merged a bug fix from one branch to another one. I did make heavy use of the centralized server to synchronize my work on various computers and to allow me to fetch my work from anywhere that had an Internet connection. (I initially served my repository over HTTP, then later over SVN+SSH when I changed institutions to one with a more restrictive network infrastructure.)

My Subversion organization

Since I was one person and interested in reusing stuff from one project in others, I put all of my version controlled work in one Subversion repository whose top-level eventually looked like this:
branches/ career/ doc/ empty/ gfd2003/ home/ latex/ notebooks/ numerics/ papers/ software/ songs/ tags/ teaching/ vendor/ wacsg/ www/
You can see I had top-level directories for different kinds of work that I did. The doc/ and numerics/ directories contained most of my professional output comprising papers, presentations and various simulation software. There also accreted some unrelated stuff such as wacsg/-a website I was developing for someone else and career/-documents about jobs, etc. Also of interest to me was the papers/ directory which I used as a library to save PDFs of papers that I had read in my research, in order to make them available in a central location.

When I worked with this repository, I would check out a subdirectory, say numerics/pvsim/ to my computer at ~/Research/pvsim and then work on it there, committing regularly when I had something to save, or I wanted to move my most recent work to another computer. I never checked out the entire root of the repository (nor would I want to, because its organization does not fit the organization that I use for files on my computer) and I rarely checked out even an entire first-level directory (e.g. notebooks/) with the exception of teaching/, software/ and www/ which mirrored actual directories on my computer(s).

Why DVCS?

With the above advantages of my Subversion setup, and also my experience working within its model and limitations, why would I think of switching? One reason is that I am an inveterate switcher. I tinker constantly with things that already work very, very well. It is one of my personal crosses to bear. The other is that Subversion lacked (and will always lack) one feature that I would like to have. I would like to be able to work on my laptop while I travel and be able to commit to my repository without having internet access. I often work on version-controlled projects (such as presentation slides written in LaTeX/beamer) while I am travelling and I don't get paid well enough (see above disclaimer) to feel I can afford to pay for connectivity while I travel. This limits me to possibly being able to commit my day's work from my hotel room at the end of the day if the WiFi is free. (SVK is one solution for this problem, but not one that I would want to use.)

Which DVCS?

The three front-runners in modern DVCS are Bazaar, Mercurial and Git. Git was written by Linus Torvalds to host the development of the Linux Kernel and its pedigree shows. It also (as befits a Torvalds creation) has a very clear, no nonsense model of version control. It is written in C and is reportedly quite snappy and useful for large projects (e.g. the Kernel). Mercurial and Bazaar are similar DVCS projects written in Python that even have very similar commands. All allow disconnected work of the type that I wish to do on my laptop, where I make changes all day long, commit them locally and then when connectivity is present, I can share those changes with my central repository.

So how did I choose between the three? Since DVCS is still an emerging arena, it was important to me that I choose a product with a stable future. If changing away from Subversion was going to be a big process, I hardly wanted to go through the process twice when the tool I selected this year went belly-up next year. The other thing that I value in the software that I use is a responsive, lively developer community. I am not a developer of software, but what I use, I usually use hard, report bugs and expect responses. So, as part of my selection process, I followed the mailing lists for Mercurial and Bazaar for a couple of weeks (okay, months) to see what felt like the right fit for me. I want to re-iterate that all three of these packages had the same features and that what finally decided me was the community behind the software. (This was pre-miniscent of an essay by Ian Clatworthy on important features of a VCS including #3 Community).

I decided on Bazaar for the following reasons:

It is written in Python, a language I use and understand (thus I may someday be able to contribute, however slightly)
It explicitly supports my workflow: a central repository with local checkouts, occasionally disconnected.
The developer community is very responsive, supportive and involved. (Some of those guys must be getting paid to do this.)
The project is continually evolving and refining itself. They have recently added things like Shared Repositories, heavy- and lightweight checkouts, new storage formats and tags. (Mercurial in particular seems not to be adding many new features. I could be mistaken about this.)
There is a native plugin for interoperating with Subversion: bzr-svn

Converting to Bazaar

Having settled on a product, I needed to convert my work and all of its history to Bazaar from Subversion. I wanted to rearrange the structure of my repository and I would need to rearrange it to fit in with the Bazaar model. One feature that many VCSs lack (except for Subversion and CVS) is the ability to check out only a portion of the directory structure and work there. As I mentioned above, I used this feature heavily in my workflow. In Bazaar, you can only check out an entire branch. Since I never want to check out all of my numerical simulation code at once, I needed to split my Subversion repository into multiple Bazar branches. This turned out to be harder than it looks.

First try: bzr-svn

At first, I tried to just bzr branch $REPO/numerics/pvsim and count on the bzr-svn plugin to do the work. After plenty of work getting the plugin installed and working on my Macs (funny, under Ubuntu it took one apt-get), I was ready to try the restructuring. This didn't work because bzr-svn will only work on the top-level of a repository (or one in the trunk-tags-branches structure).

Second try: svndumpfilter

Next, I figured that I could use the svndumpfilter program that came with Subversion to chop my Subversion repository up into pieces (by including only one subdirectory and excluding all others) and then use bzr-svn on each piece. This did not work either, because svndumpfilter (fundamentally) cannot deal with files that are copied between included and excluded paths. This happened a lot in my repository as I said "I was using that LaTeX style file in that project, I'll just svn cp it over here and use it again." The reuse of work that was an important part of Version Control in the first place was making my repository restructuring much more complicated.

Third time's a charm

I am not the only one who has noticed this limitation of svndumpfilter. There is an svndumpfilter2 that handles copies between included and excluded paths and an svndumpfilter3 that does the same thing in a much slower but more memory-efficient way. Unfortunately, neither tool did what I wanted it to. In particular, both only allow filtering on top-level directories (because that is the easiest case). But I wanted to split out the subdirectories of numerics/ and doc/ into separate branches so I could check them out individually. So, I rewrote part of svndumpfilter2 to produce the eponymous svndumpfilter4 which can filter paths of any depth.

Repository Conversion

So now I could convert my Subversion repository into an equivalent collection of Bazaar branches. Here is an example of the process for one piece of the repository. I started with a dump of my subversion repository located at ./repo in the file svn-1444-dump

Filter the path from the original repository
svndumpfilter4 --untangle=./repo numerics/pvsim < svn-1444-dump > svn-1444-dump-pvsim
Load the filtered repository into its own repository
svnadmin create repo-pvsim svnadmin load --ignore-uuid repo-pvsim < svn-1444-dump-pvsim
Branch the filtered repository using bzr-svn
bzr branch file://`pwd`/repo-pvsim bzr-pvsim
Push the bzr repository to its central location
bzr branch bzr-pvsim bzr+ssh://central-server/repos/research/proj/pvsim

In step 2, the --ignore-uuid flag to svnadmin load is important if you would like to load your bzr branches into a shared repository (since bzr-svn uses the uuid to generate the ID for each file, and your branches all contain a numerics directory).

I found that working with shared repositories was harder than standalone branches, so I used them sparingly in my new setup. In particular, if you make a mistake and corrupt the index in a shared repository, you have to blow away the entire thing and rebuild all of the branches that it contains. In addition, I never found out the correct way to remove a branch from a shared repository. Just deleting the branch using rm -rf repos/branch1 does not delete the revisions that are stored in repos/.bzr/....

The Happy Ending

Now, I have a collection of bzr branches that looks like this

 
career
research/
  proj/
    pvsim
    QGmodel
    ...
  doc/
    paper/
      physfluids
      ...
    talk/
      snowbird
      maa_iowa
      ...
software
teaching
www

where research/, proj/ and doc/ are regular subdirectories and paper/ and talk/ are the only shared repositories in my structure (because this is where I am likely to clone branches to make new talks and papers).

I have gained the ability to work offline with heavyweight checkouts (for my laptop). I can still use lightweight checkouts just like I did under SVN on my office workstation which always has a network connection. I have lost the historical connections bewteen work shared between different projects, but those historical connections are not so important to me, since my projects rapidly diverge from one another that I don't need to be propagating bug fixes from one project to another. The conversion has been a long process, and I am grateful to the Bazaar community for making me want to carry it through.

Pinky Gone Driving Little tidbits not meant for human consumption

An Academic Gets Bazaar