hq already gave a non-technical summary of these events. We thought that you might enjoy a little more detail, though.
I warn you ahead of time that this entry assumes some basic familiarity with the techniques associated with data storage on large websites. If you've never heard the term "shard" applied to databases before, this might not make sense in places. In a later post we're going to do an overview of how deviantART's servers are structured which will cover this.
Breaking down
Last Monday at about 1:30am, people started noting some really weird things happening on deviantART. Their comments would appear somewhere other than where they were left. Their message centers filled up with notification of journals from people they didn't watch. They found inexplicable notes in their Message Center. Polls they posted had the options from someone else's poll. All sorts of things were failing: posting deviations, creating galleries and collections, almost anything that involved making something new on the site. Obviously, something was wrong.
Being that it was the middle of the night for most of us, it took a little while for reports to filter through helpdesk tickets and reach someone who could react. The only people were a few of our more insomniac and/or European developers: 20after4, kouiskas, and pachunka. All that had been reported so far was that people in :devdevbug: were getting journals in their message center from people they didn't watch. The natural assumption was that it was something to do with the new features we were having :devdevbug: beta test at the time. Before long the mixed up comments were noticed, and more reports from non-:devdevbug: people came in, so that was ruled out.
At that point we knew that something was really messed up, so the site was thrown into read-only mode. This is something we try to avoid doing, as you might imagine. We get somewhere close to 100,000 new deviations every day, well over a million comments, and thousands of new deviants... and the site being read-only shuts all of that down. But data corruption is serious business.
With some time to look into the issue in relative peace we found out that data with the same id was appearing on different database servers in our cluster. Since we shard the data by user, this was an indication that things were seriously mixed up; apparently we'd been giving out the same id to multiple pieces of content.
With this clue we investigated the source of our ids. Since we're using sharded servers, we can't rely on auto_increment for these, so we store sequence values on one database server and increment them whenever we assign an id. Looking at these, we saw that some of the sequence values had decreased at about 1:30am, when the trouble started. We then worked out that at about this time some fairly routine database maintenance had occurred, which involved swapping that server with its backup. These servers are supposed to be identical, and their replication was up to date when the swap occurred.
As far as we can tell, what happened was a failure in statement-based replication. When we update the sequence values we don't set them directly to a new number, we just send "value = value + 1", and rely on MySQL's LAST_INSERT_ID to get the new value. So if occasionally one of these queries just didn't get replicated, the backup's sequence values would slowly fall behind.
We immediately manually added a fairly large number to all of the sequence values, making sure that they were all above the largest-observed id value in use, and took the site out of read-only once we were sure that stopped new data corruption.
Cleaning up
Now we were stuck with the problem of fixing as much of the broken data as we could.
(This is about where I woke up, and can legitimately claim to be part of "we" in this story. Because I'm in charge of our "Reactor" team, I got to coordinate the cleanup effort.)
Fortunately, because of our sharded servers, the problem was often just that different data with the same id existed on multiple servers, and the wrong server was being read from. In those cases the solution was to find the duplicate ids, and to assign new ids to some of them. That was pretty easy; I wrote a quick tool to find the duplicate ids, and the new ids were seamlessly handed out.
However, some of the older parts of the site maintained data on a non-sharded server. This old code also often didn't check whether or not the initial insert succeeded... resulting in data loss when further queries went ahead. Journal entries got mixed up terribly, for instance. In these cases we unmixed them as well as we could, generally putting one of the entries back together, and deleting the shattered remnants of the other.
Notes turned out to be especially bad, because they had mixed together all the recipient lists for the original notes, and we had no way of telling what the original list was. Notes are some of our very few totally private pieces of information, and we absolutely couldn't risk anyone reading a note that wasn't meant for them. So we deleted the 2,552 notes that were sent during the incident.
Comments were intimidating just because of their scale. 33,212 comments were jumbled up, the text for one comment appearing in the thread where another was supposed to be. Luckily this turned out to be mostly fixable; 32,861 comments were put back in their proper place, and we only had to delete 351.
This could have been a lot worse. We were initially worried that deviant credit cards might have been getting mixed up, but this turned out to not be possible. Similarly, user/group widgets and privileges could also potentially have been mixed up... and we were happy to discover that they were immune.
Prevention
As you might imagine, this has made us start looking into ways to avoid problems with replication in the future. We'd never seen a replication bug this subtle before, and the preventative measures we had in place were for more blatant issues. To patch up this particular hole we've put some automatic monitoring on the sequence values, so that if the master and backup drift out of alignment we'll know immediately. More generally we're evaluating switching to row-based replication. We've also been considering moving to a system like Twitter's Snowflake to get ids, without having to rely on database integrity.
Because it's not paranoia when MySQL really is out to get you, while we were cleaning up we added error checking to a lot of old code which was assuming that its inserts couldn't possibly fail. This means that if our sequences ever do fall back again in spite of the precautions we mentioned there won't be any corruption occurring before we can respond.
This incident has added some twists to how we're going to investigate similar problems in the future. Normally we start out by looking at the last code to be launched before an issue began, since that's probably related. We also didn't immediately think it would be the database maintenance, because the sequence slip meant that timestamps on affected content were from before the maintenance. Now we know that this sort of replication bug is possible we can check for it early, instead of looking for it as a last resort.
A lot of credit must go to our insomniac heroes, 20after4, kouiskas, and pachunka who quickly realized how screwed we were and made the judgement call to pull the plug on the site until we could fix it.
ShopDreamUp AI ArtDreamUp
Objective-C deviantART SDK
We're happy to announce the release of version 1.0 of the Objective-C deviantART SDK . This SDK is a simple way to build great experiences using the deviantART API in your Mac or iOS apps. Overview The SDK requires iOS7 or OS X 10.9 and provides several classes for you to use and interact with: DVNTAPIClient - The core of the SDK. This class provides the OAuth2 management along with core network calls. DVNTAPIRequest - Provides simple wrapper methods around all API deviantART endpoints. If you find any that are missing, feel free to open a pull request to add them. For the fastest integration, we recommend you use CocoaPods to install...
We Give a F*** How the Site Loads
Developers can be angry people sometimes. This is actually quite the understatement and :devdT: is no exception to that assessment. With web development in particular, there are several moments during the day where we are astounded, perplexed, and irritated by why something works the way it does--often over things beyond our control like lack of uniform web standards. Abe Stanway, the creator of Commit Logs from Last Night, actually gives a pretty compelling, and serious, Ignite talk on the functionality of profanity for developers here:(It has several cool histograms and visualizations of how developers use profanity and which languages i...
How deviantART looks like to colorblind users
One in 20 men and one in 200 women have some form of color blindess that prevents them from seeing color the same way that people without any color vision deficiencies do. For example, some 10 million American men — fully 7 percent of the male population — either cannot distinguish red from green, or see red and green differently from most people.Below are some views of deviantART as seen by colorblind visitors. Click the images to view them in full size. Tritanopia :bigthumb564209613793529: Deuteranopia :bigthumb3265201673023403:Here's a few other pages as seen by those with Deuteranopia, the more common variation of color blindness. Here...
#DT and LOGR
It's Friday evening and after a long day, you check the code you were working on into git, have the commit reviewed, accepted, merged, and sync it live. All seems right with the world. You let out a sigh of relief, back your chair away from your desk, and walk away in a satisfied mist of ease. In fact, you're excited because you're going to a concert with your friends tonight.But then, twenty minutes after you leave, it begins. Errors. Fatal errors. And you're not around to know. So what happens?In :devdt:, we look out for one another. One of the ways we manage to do this is through an error logging service we've built called Logr. If you ...
Featured in Groups
© 2010 - 2025 dt
Comments29
Join the community to add your comment. Already a deviant? Log In
I
what you did here ...
to the heroes 


