At deviantART, every developer gets their own virtual machine. Our VMs are our private laboratories. In them we work, test, and experiment, with no fear of harming the site.
The VM is also the most significant change we've made to our deployment chain in the past 6 years.
How Does a Virtual Machine Improve Development?
If you have to commit to the repository on a staging server every time you need to test a change, you end up with a lot of commits. Not only does this take time, but it creates a lot of noise. Before the VM, a single change could result in as many as 20 commit messages. We found it very difficult to keep track of what was changing on the site. Now, developers tend to commit code in single logical chunks with meaningful messages.
Reduced Contention for the Staging Server
Staging acts as a final check for code that's already been tested on a VM. It uses the production databases and other production daemons, and it gives us a good idea of what performance will be like. But it's a shared resource, and when it's locked it can very quickly prevent other developers from getting work done. Fewer, more well-tested commits keep the staging server in a known-good state for longer periods of time.
Freedom to Experiment
Some code changes would be too tedious or dangerous to test on the staging server. Branches can help in some cases, but you also need to isolate or create copies of resources (such as the database), which can be impractical and time-consuming. Better to test the changes on a machine that doesn't have a connection to anything in production.
This isolation of the VM also helps developers learn new systems faster. Developers can break their VMs like kids breaking apart their parents' expensive electronics to figure out how they work. The worst that can happen is that the developer has to build a new VM.
How the VM is Made
Almost all deviantART servers run off of a common base netboot image. We took this netboot image, converted it to a VMWare disk image, and added a boot loader. Then, we configured it to download and install the same binary packages used by the live servers. Most packages require some conditional tweaks in their configuration files, but they're all relatively straightforward.
Creating a new VM is mostly automatic. A developer downloads a disk image, adds it to VMWare, and then runs a setup script from within the VM. The script downloads the correct versions of the necessary binaries, configures them, and starts any associated daemons. All that's left for the developer is to mount the source code directory on their host via SSHFS.
What to Do About Databases
Accessing the production databases from the VM is out of the question. The VM cannot be allowed to make changes to any production resource. Additionally, the added latency would make testing on the VM much more tedious. For example, if a page on production runs 45 queries in 30ms, the same page on the VM would take no less than 2 seconds to run the same queries (assuming 50ms of round-trip latency). Instead, we have to bring a copy of the databases to the VM.
Imagine a parallel universe in which deviantART only has a handful of members. That's essentially what the VM is. We take more than 2TB of database data and extract the approximately 50MB used by just the developers' accounts.
We wrote a program that pulls this data from the production database servers nightly. The program is driven by a configuration file editable by developers. The file begins with a list of usernames; only data on these users will be pulled. To protect users' data, we only add developer usernames. Next, the file defines how various data is related (sort of like foreign key relations, but not quite). These relationships are used to define the subsets of data required to be pulled from each of the database tables.
As the program runs, it outputs a .sql (text) file resembling the output of mysqldump. When complete, it places the file on a server and rotates the previous files as backups. At any time, a developer can run a command to automatically update their VM's database with the latest data (or optionally roll it back to some point in the past).
What to Do with Resources
Even after reducing the database to 50MB, the associated image files would still be too large to download and store locally. Luckily, we don't need to. Instead, requests for images and other non-CSS/JS files are proxied (transparently) by Apache to the live site. For us, this is simplified by the fact that these files are hosted on distinct subdomains, but you can achieve the same with any regular directory structure.
Sometimes, though, you don't want the request to be proxied. For example, when testing uploads and submissions, we don't want to have to upload the test file to the site itself. Instead, we upload it to the VM. Apache, instead of proxying the request immediately, checks to see if the file exists locally first and serves it from the VM filesystem if it does.
Hosting the VM on a Separate Domain
Originally, we edited our /etc/hosts file to point deviantart.com to the VM when we wanted to work with it. But doing so was quite a hassle, especially considering you had to switch browsers or use a cookie switching plugin to avoid cookie collisions. Also, it wasn't always clear if you were browsing the VM or the live site. So we decided to move the VM to its own domain: deviantart.lan.
deviantART is unusual in that it has millions of subdomains: 1 for each registered member, as well as a collection of reserved subdomains. Enumerating all the possible subdomains in /etc/hosts would be tedious and error prone. So instead, we setup an instance of tinydns on the VM to answer requests for *.deviantart.lan. Tinydns can be configured to then proxy DNS requests for domains it's not responsible for to another DNS server. Or, if you're using OS X you can create a file in /etc/resolver to tell OS X to pass any (and only) requests for deviantart.lan domains to the VM.
Avoiding Code Changes
Making the VM respond to deviantart.lan posed a new problem. All of our code was written with the assumption that we're responding to deviantart.com. We first tried removing all literal occurrances of the string "deviantart.com" from the code and replacing it with a constant/global variable reference (which could vary between the VM and production). But there were too many places in the code to change, and it would be easy to forget and commit a literal "deviantart.com" later.
We decided to move back to solve the problem: we'd translate the name at the HTTP level. We looked first at using an Apache directive or extension. Unfortunately, none handled rewriting both the body and the headers (necessary for catching cookies and redirects) in both directions. So we wrote a daemon: tcpbf.
tcpbf is our "TCP Bi-directional Filter". It's a simple program that allows us to make regexp-based replacements in the requests to and responses from the VM. For example, we change "deviantart.lan" to "deviantart.com" on the way in and vice versa on the way out. Additionally, we strip any Accept-Encoding header on requests to prevent the response coming back as gzipped data (which we couldn't do text replacement on).
This explains why we chose ".lan" as the new TLD for the virtual deviantART: keeping it to 3 letters keeps the Content-Length of the response correct. If we didn't, we'd have a mismatch that would break browsers that use chunked transfer encoding, like Chrome and IE.
To handle SSL traffic, we have 2 copies of stunnel running with tcpbf sitting between them. That way both the browser and Apache see SSL packets while tcpbf sees cleartext.
Getting to 100% Accurate Emulation
The VM isn't perfect. Sometimes it's missing data. Occassionally changes are made on production that aren't compatible with it. Sometimes we hit an edge case on the VM we wouldn't have hit on the productions servers due to a difference in resource limits, and vice versa. Luckily, the emulation doesn't have to be 100% accurate to produce benefits. The most important thing we did was make it good enough that developers wanted to use it. Ever since everyone has adopted it, it's been improving naturally over time.