I recently worked on a project that tried to improve the performance of DeviantArt muro. As part of this project I did a lot of testing and benchmarking of various HTML5 operations. I learned a lot about where one must be careful when writing web applications that use HTML5’s <canvas> element. The following is a diary of sorts that I made while working on the project. Of course your milage may vary depending on the setup of your application, but as you will see, DeviantArt muro realized significant performance improvements when I applied the lessons I learned.
For those who don't want to read this whole long and rambling article, here are the main rules to live by as suggested by my testing:
Reduce direct pixel manipulation as much as possible. Use the line drawing API when possible, and when you must sample pixels get as few as possible.
Rendering shadows, especially ones with a high blur, will greatly reduce your performance. You can draw quite a few shadowless lines in the time it takes to draw a single one with a soft shadow.
Bundle as many lines as you can into a single call to stroke() as possible.
Since I was interested in a scenario similar to what DeviantArt muro will often see, I ran all tests on a canvas that was 1200px wide and 500px tall. This would be the approximate size of the drawing area if a user with an average sized monitor maximized their browser window. All tests were run on my laptop (2009 Macbook Pro with 3.06GHz Intel Core 2 Duo Processor, 8GB RAM, and Intel X25-M Harddrive) with minimal other apps running (Terminal, vi, and standard background tasks). The browsers that I tested on were: Firefox 3.6.13, Safari 5.0.3 (6533.19.4), and Google Chrome 8.0.552.237. Late in the game a colleague asked me how Firefox 4 beta compared, so I re-ran some of the tests using Firefox 4.0b10.
Some people will surely ask how the Internet Explorer 9 release candidate stacks up with the rest of the browsers. I apologize that I did not run the tests using IE9 because I did not have a windows machine handy, and did not feel it would be accurate to run these tests on a virtual machine.
For all the tests I would time how long it took to run a small section of code a bunch of times, and then subtract the time needed to run the same code without the critical bit I was testing. This would mean that little costs of doing things to prepare the test would not contribute to the time measured by the test itself. The results shown here are averages of running the tests several hundred times each. Though I did not calculate standard deviations for each result, I kept an eye on the time distributions to make sure that they did not change too much from test to test.
The first thing I tested was basic line drawing: moveTo() a random location on the canvas, lineTo() a different random location. I did the test once using a single fully opaque color, and again using various random colors and opacities. This was meant to give me an idea for how much penalty one pays for making a canvas calculate more intricate blending. I also did the same tests using quadraticCurveTo() and bezierCurveTo() so I could see how much more expensive it is to draw with smooth lines. Of course, if you are using one of those functions in your app you will also have the overhead of having to calculate the proper control points to use.
There is not much that is surprising here. The four browsers that were tested performed pretty similarly. Using more mathematically complex curves comes at a cost.
Next I wanted to see if it makes a difference how often one calls stroke() when drawing with lines. I ran tests where I compared calling stroke after each line segment was drawn and where I drew a number of line segments and then stroked them all at once. As can be seen by the approximately linear graph, stroke() takes close to the same amount of time each time it is called. If you can draw 50 line segments before calling stroke(), you can save in the neighborhood of 20% of the cost of drawing.
Shadows are a really useful tool, not only for drawing actual shadows, but for anything that needs a nice soft edge. However, that soft edge comes at a really high price. For a while now, deviantART has realized that WebKit browsers struggle when we use a lot of shadows. I was really curious to get a handle on just what was going on that made Gecko and WebKit browsers behave so differently. When one times how long it takes to draw straight lines with various amounts of shadowBlur, a really interesting graph appears. WebKit browsers can draw small shadows quickly, but as shadowBlur increases their rendering time increases slightly worse than linearly. Firefox, on the other hand, renders shadows at near constant speed. If the shadows do not have much blur it is slower than the WebKit browsers, but when shadowBlur gets up to 100 it is four times faster.
Interestingly, Firefox 4 beta now has performance closer to that of the WebKit browsers. The shadows of the same blur in Firefox 4 are also a lot wider than they were previously (Firefox 3 has always had smaller shadows than WebKit). I do not know the details, but it would seem that the canvas spec must be settling on a softer, but more computationally intensive shadow as its reference.
Earlier profiling that I did showed that the worst performance issues in DeviantArt muro came from having to move buffers around at an inopportune time. Any complex graphics app is going to have to store and/or move image data around, and I was really curious about what the best way to do this is.
There are a number of different ways to get at the data that is on a canvas. One can use drawImage() to copy the contents of one canvas to another canvas. You can get the contents of a canvas as a base64 encoded PNG by using toDataUrl(). You can also get essentially an array of pixels using getImageData(). As can be seen below, the toDataUrl() method is the clear loser; apparently the cost of encoding the data is pretty high. Which of the other two methods to use is a little less clear until Firefox 4 usage is widespread. As can be seen, Firefox 3 has some problems getting and putting image data quickly, but WebKit browsers are much faster at that than using another canvas element as a buffer.
An advantage of using getImageData() is that it can sample a portion of the canvas. I did the same getImageData() test, but sampled squares of increasingly larger size, and for all four browsers tested, getImageData() had close to constant speed per pixel sampled. Before I had thought that the overhead of getting any pixels would be large, so I would sometimes sample more than I needed if I thought there was information that I would be needing at a later point. As this graph shows though, it is better to grab only what you need, because you do not pay a noticeable penalty for sampling a second time down the road.
Applying to dA muro
While all this data is somewhat interesting, one must wonder how much the knowledge helps in performance tuning a web application. Of course everybody’s milage will vary on this, I am sure that there are programmers out there who have a much better intuition for speed optimizations than I do. They would have written faster code right than I from the get go. However, I think that the code I started from was probably fairly average as to what one might expect from an experienced coder who was fairly new to HTML5 and used all of the available API’s in interest of making simple and straightforward code in preference over premature optimizations.
The main lessons I learned is that any kind of getImageData() call or canvas copy should be avoided at all costs, delayed until a “down time” if they cannot be avoided, and if all else fails great care should be used to sample only the pixels that you absolutely need. It is alright to call lineTo() many more times if it means you can avoid a call to getImageData().
The first place that I tried to optimize was measured by a test that simulates a user making a bunch of short strokes relatively close to one another. An artist would typically do this if they were cross hatching, stippling, or applying a Van Gogh-esque texture to their drawing. A lot of the changes that I made are particular to the internals of DeviantArt muro, and a description would not make sense to somebody unfamiliar with our codebase. However, I will describe two of the optimizations. When a user draws a line, the new line needs to be pushed into an undo buffer, and it also needs to be reflected in the zoom “navigator” panel that is in the corner of a screen. These two tasks can not avoid some kind of buffer copying, but smarter buffer copying led to some noticeable speed improvements as can be seen below. Note that Firefox 4 is not shown in the first three sets of results because I did not start testing it until later (and I did not feel like re-coding all my inefficiencies just to see how much it improved).
The next place I turned my attention was individual brushes that were taking longer than they needed to. In most cases this came from unnecessary calls to clearRect() or fillRect() (these calls perform similarly to putImageData() calls). The bulk of deviantART muro’s brushes are now quite a bit faster. Once again, Firefox 4 is not in these results because I did not benchmark it at the beginning of the project.
I figured that the good news about the filter problem was that it is something that can be easily parallelized. Most modern computers have at least 2 processor cores, so it is a shame to leave one of those idle while a single browser UI thread is churning away. So, I prototyped a change that split the canvas into several chunks and passed the filtering off to some web worker threads.
The first problem I ran into is that web workers do not have a concept of shared memory. Data passed to and from them must go through calls to postMessage(). An article on the Mozilla blog indicated that internally these messages are passed as JSON strings. This is a problem for a task like filters that are operating quickly on a very large data set. The cost of JSON encoding is not small compared to the cost of the actual computation. Note also that in WebKit browsers you cannot assign an array reference into an imageData’s data, so you have to pay the penalty of doing a memory copy from the JSON decoded array into an imageData object.
The results of my experiment were mixed. Firefox 3 was quite slow before the change, and sped up by a factor of 3 when it was parallelized. Safari, on the other hand, spent a long time churning before the threads even started executing (I cannot be sure, but I suspect that this was while the JSON encoding was happening), and then for some unknown reason the multiple threads each took a lot longer than the single UI thread. Chrome’s threads ran very quickly at first, but then it sat for a little while before returning the data to the UI thread.
I spent a little bit of time trying to debug these issues, but eventually gave up. From my experiences I would say that web workers are a cool technology that will be really useful someday, but at the moment some browsers are not quite ready for this particular use case.
Below you can see the CPU utilization of the two cores of my machine when the filtering code is running in a single thread vs when it is running in web worker threads.