Vega Visualization Performance benchmark
Table of contents
- Abstract (TLDR)
- What is Vega
- Interactive client-side graphs on web pages with Vega
- Problems scaling Vega
- Benchmark
- Observations
- Performance profile
- Conclusions
Abstract (TLDR)
Vega is a visualization grammar (a language) and a library for server- and client-side visualizations. A live benchmark showing client-side performance of Vega on differently sized datasets is presented. A Python program that generate experiments used in the article is presented as well. Vega is shown to scale well to up to 10000 datapoints. No performance difference noticed when using json
or csv
input formats.
The spec/data generator - a Python program used to set up the experiments, is here.
What is Vega
Vega is a “visualization grammar”, essentially, a specification language and a JavaScript library to interpret this grammar (both in browser and server-side). Vega allows creating interactive graphs, called “visualizations”, in a declarative way. Declarative here means requiring no coding in traditional sense, we only write a JSON specification. For example, the picture above is a result of rendering this specification.
Vega is widely used in scientific and business circles and has a thriving community, great documentation and tutorials.
The “grammar” is far from trivial, but once you get a hang of it, creating graphs can be relatively painless and even enjoyable process. The docs are extremely helpful and definitely worth checking out. There is also a simplified version of the framework called “Vega Lite”
Interactive client-side graphs on web pages with Vega
One of the most prominent use cases of Vega is the following:
- the data to be visualized is stored somewhere as
json
orcsv
- a declarative
json
specification is embedded into the page, or loaded from the outside - a JavaScript runtime of Vega is loaded as well (they provide it via CDN)
- and viola, a juicy interactive graph is there on the page
see the above in action here.
Note that in the use case above, all the heavy lifting of visualizing things is done by the client (a machine running your browser). Once your data and visualization specification are loaded to the browser, it is your PC and your PC only that is responsible for the visualization, no server involved.
Problems scaling Vega
For reasonably sized datasets (the definition of “reasonably” is, in fact, the main question of this article) client-side Vega works flawlessly. However, as data starts getting big, the performance bottlenecks are becoming evident. This is not to say that Vega is not performant, but rather that any frontend-based visualization is bound to have scalability issues outside its comfort zone (in terms of dataset size).
Benchmark
Keeping the above in mind, I got curious about how far we can actually push it until it gets unusable. Knowing some ballpark figures might help me and others to quickly get some rough idea whether client-side Vega will work in a particular use case.
To answer this question, a simple spec of a scatter plot showing multi-attribute datapoints was created using a programmatic generator.
The properties of the data are:
- x, y - the center of the circle
- category - the fictional category the point belongs to, this affects the color
- attr_0 - attr_4 - a real-valued attributes without any specific meaning except
attr_0
that defines the size of the circle
The expected interactive behaviors:
- when hovering over a circle, all attributes are shown as a concatenated string
- clicking on a category in legend area highlights datapoints belonging to this category
Variables
the graph described above was built for different values of:
- dataset size (1000, 5000, 10 000, 30 000 points)
- format of input data (
csv
orjson
) - the renderer (
canvas
orsvg
)
giving us in total 4 * 2 * 2 = 16 different experiments. You can find and check all of them here.
Observations
Here I describe my experiences and conclusions after testing the graphs in my setup: Linux, Chrome, a decent i5-based laptop
One thousand datapoints
I have no problems at all interacting with the visualization when dataset size is at 1000.
Neither do I perceive any difference when using canvas
or svg
as a renderer.
Five thousand datapoints
At 5000 datapoints, it gets very slightly more sluggish, although I would probably not notice it if I didn’t pay attention:
Surprisingly, canvas
performance is still on par with svg
one.
Ten thousand datapoints
At 10k it gets a little hot. Although the rendering itself takes quite reasonable time, the “show attributes on hover” behaviour is now definitely sluggish (when using canvas
as renderer).
svg
renderer feels, again, similar to canvas in terms of rendering, and, surprisingly, shows noticeably better “hover” performance. This is something I did not expect at all. I theorise that somehow svg
renderer uses browser compiled routines to see which circle we are “hovering” over, as opposed to canvas
where Vega has to rely on JavaScript interpreter to infer that.
Thirty thousand datapoints
well, they both still work, I would say. They show me the picture without waiting time being unbearable (though it does give a sluggish impression). With svg
renderer, I see “hover” working unreliably (not showing anything sometimes). Hovering in canvas
-based graph is more reliable but the delay is very high, up to the point of making it unusable.
All in all, I think that 30k is already too much for Vega to handle.
Performance profile
Here I put some screenshots of Chrome profiler to get some quantitative data.
Conclusions
So we have these three competing demands:
- responsiveness
- size of data being displayed (the bigger the better)
- speed of rendering
and we would want all three to be satisfied at the same time.
I would say, based on the experiments performed, the sweet spot is somewhere around 10 000 datapoints. Anything above, although somewhat workable, at least up to 30k, is already not too comfortable to interact with.
The most surprising discovery for me was that svg
renderer does not show any immediately obvious inferiority in comparison with canvas
, and in some respects and on certain data sizes, delivers, in fact, better performance (I’m referring to hover behaviour at 10k).
Another noteworthy finding is that there is no noticeable performance difference when using csv
or json
as data format.
So 10000 seem to be ok, but another point is whether we really want to display this much data. Several thousand facts on a single chart is already too much for any human being to digest. So further scaling should be possible by dynamically loading small portions of data and performing server-side aggregation so that at any given point, the graphical frontend has a comfortably sized data to display.