In our company, most days of the week start with a two-hour meeting. We do code reviews, design product features and discuss new solutions. The only unusual thing about the meetings is that their participants are usually separated by nine time zones.
When we started Reliable Software in 1997, we took a big risk. We had no other choice but to experiment with distributed development. I lived in Seattle, WA and my partner in Gdansk, Poland. We decided to try the newly introduced Internet voice phone technology to communicate. In the beginning there were days we couldn't even establish communications. When we did, it was like talking on a walkie-talkie to an astronaut on the Moon (actually, the delay in communications between the Earth and the Moon is shorter). Over the years the Internet technology improved, we started using full duplex connections (we no longer needed to say "over" after every statement) and DSL on both ends. We tried many internet phone products, but we finally settled on Microsoft NetMeeting. Besides providing decent voice connection, it lets you share a whiteboard as well as any application running on your computer.
Being able to talk is essential for collaboration, but it's not enough for software development. We had to be able to share source code and synchronize our changes. We needed a version control system (VCS). Since there wasn't any that worked over such distances, we decided to create one from scratch. We named it Code Co-op.
The basic functionality of a VCS is to let many developers work on the same project. Each developer may check some files out, edit them, and check them back in. Every check-in makes source changes available to other developers.
There were several difficulties we had to overcome in order to build a distributed VCS. First of all, we didn't have a central server to store project files. We had to have a copy of the whole project on each developer's machine. A back of the envelope calculation showed that it's indeed a reasonable solution. Not only disks are getting bigger, but also we've found out that files created by most compilers during the build process require a lot more space than the sources.
The second difficulty was the exchange of data between distributed developers. Normally, during a check in, a developer is connected to the server. He or she gets exclusive access to the project repository and transmits the differential information describing the changes. Similarly, in order to synchronize the project, a developer connects to the server and downloads the differences between his version and the most current version. Obviously, such a scheme requires that every developer have fast access to the server. On a local area network it's a reasonable assumption. On the Internet, especially at such distances, this scheme would create a very narrow bottleneck. Besides, if we were to use a server, every time the connection to the server broke down, developers would be locked out of the project.
We decided to replicate the repository on every machine. Keeping these repositories in synch using slow and unreliable transport was our next challenge. After considering various options, we decided to use e-mail to synchronize our repositories. Because of its slowness and unreliability, we had to design a very robust distributed protocol. Since synchronization scripts could arrive out of order or get lost in transport, we had to come up with a script numbering system and a pecking order between users, in order to resolve conflicts.
This system works very well. It detects gaps in script order and lets the user wait for a missing script (if the script is lost, it's possible to re-send it). The only drawback is that, when two scripts miss each other, one of them has to be temporarily rejected. We made a rejection of a script as painless as possible. The author of the rejected script learns about it as soon as the winning script arrives at his machine. At that point his check-in is reversed and the files involved in the change are checked out, ready to be checked back in. No changes are lost.
Next we had to design a database to store the state of the project on each machine. In order to make a reliable VCS, we decided to transact every operation. For instance, when somebody is checking in his changes, the system first prepares a new state on the side. If anything goes wrong-memory allocation fails, a file is missing, or even if there's a bug in the program-the whole new state is discarded and the system goes back to its previous consistent state.
Since we've been always using the most current build of Code Co-op in our own development, on several occasions we discovered fresh bugs during transactions. After fixing the bugs, we could repeat the transaction without any loss of database integrity. Even when running Code Co-op under a debugger, we could stop the execution and exit in the middle of a transaction with no danger of losing data.
We were so confident in our transactions that at some point we implemented a new feature by using the ability to abort a transaction in progress. First we implemented a feature that allowed Code Co-op to restore old versions of the project. Of course, the whole operation was transacted. Next we needed a way to view old versions of files, without restoring them permanently. We implemented it by reusing the code that restored old versions, only that this time we didn't commit the transaction. We used the restored files for viewing, but since the transaction was aborted, no change to the state of the project took place.
Over the years of using and developing Code Co-op, we added a lot of features and streamlined the user interface. Code Co-op can now not only work using email, but also on a local network. The dispatcher, which is a separate component of Code Co-op, takes care of the transport. It has a small database of users and re-routes synchronization scripts as necessary. The dispatcher can send a script by email to one participant and, at the same time, copy it over the network to another. It uses Simple MAPI interface to send and receive scripts by e-mail. It copies scripts between machines on the network using shared directories. The system is very easy to configure and it works dependably due to its simplicity.
Reliable Software added two more developers in Poland, each in a different town, plus one developer in Sweden doing the port to Linux. All five of us can simultaneously work on the same software project and keep in synch with each other. We are aware that ten years ago this kind of collaboration would have been impossible. Today the Internet is almost everywhere. Programmers in Poland, Sweden, Russia or India can browse the web, use e-mail and talk to each other using internet telephony. We hope that many other distributed software companies will spring into existence in the coming years. Such a company has very low overhead and its developers can conveniently work from their own homes on flexible schedules.
It used to be that subcontracting software was risky because of very limited possibilities of supervision and minimal feedback. A subcontractor would be given, for instance, half a year to write a program according to some specification. We all know how difficult is to provide detailed software specification and how constraining such specification could be when unexpected problems are discovered in the course of the development. Without a tight feedback loop, the results of such collaboration were rarely satisfactory and sometimes disastrous.
With the advent of the Internet and such tools as Code Co-op and NetMeeting, the situation has changed. It is possible now to have a distributed team with some members working on-site and others at remote locations. The on-site members have up-to-date knowledge of the state of the project and can very quickly alert the management about potential problems. If necessary, they can continually supervise remote development. This way the company can get the benefits of subcontracting without the risk of unpleasant surprises at the time of delivery.