Table of contents

Abstract (TLDR)

Vega is a visualization grammar (a language) and a library for server- and client-side visualizations. A live benchmark showing client-side performance of Vega on differently sized datasets is presented. A Python program that generate experiments used in the article is presented as well. Vega is shown to scale well to up to 10000 datapoints. No performance difference noticed when using json or csv input formats.

The spec/data generator - a Python program used to set up the experiments, is here.

image

What is Vega

Vega is a “visualization grammar”, essentially, a specification language and a JavaScript library to interpret this grammar (both in browser and server-side). Vega allows creating interactive graphs, called “visualizations”, in a declarative way. Declarative here means requiring no coding in traditional sense, we only write a JSON specification. For example, the picture above is a result of rendering this specification.

Vega is widely used in scientific and business circles and has a thriving community, great documentation and tutorials.

The “grammar” is far from trivial, but once you get a hang of it, creating graphs can be relatively painless and even enjoyable process. The docs are extremely helpful and definitely worth checking out. There is also a simplified version of the framework called “Vega Lite”

Interactive client-side graphs on web pages with Vega

One of the most prominent use cases of Vega is the following:

  • the data to be visualized is stored somewhere as json or csv
  • a declarative json specification is embedded into the page, or loaded from the outside
  • a JavaScript runtime of Vega is loaded as well (they provide it via CDN)
  • and viola, a juicy interactive graph is there on the page

see the above in action here.

Note that in the use case above, all the heavy lifting of visualizing things is done by the client (a machine running your browser). Once your data and visualization specification are loaded to the browser, it is your PC and your PC only that is responsible for the visualization, no server involved.

Problems scaling Vega

For reasonably sized datasets (the definition of “reasonably” is, in fact, the main question of this article) client-side Vega works flawlessly. However, as data starts getting big, the performance bottlenecks are becoming evident. This is not to say that Vega is not performant, but rather that any frontend-based visualization is bound to have scalability issues outside its comfort zone (in terms of dataset size).

Benchmark

Keeping the above in mind, I got curious about how far we can actually push it until it gets unusable. Knowing some ballpark figures might help me and others to quickly get some rough idea whether client-side Vega will work in a particular use case.

To answer this question, a simple spec of a scatter plot showing multi-attribute datapoints was created using a programmatic generator.

The properties of the data are:

  • x, y - the center of the circle
  • category - the fictional category the point belongs to, this affects the color
  • attr_0 - attr_4 - a real-valued attributes without any specific meaning except attr_0 that defines the size of the circle

The expected interactive behaviors:

  • when hovering over a circle, all attributes are shown as a concatenated string
  • clicking on a category in legend area highlights datapoints belonging to this category

Variables

the graph described above was built for different values of:

  • dataset size (1000, 5000, 10 000, 30 000 points)
  • format of input data (csv or json)
  • the renderer (canvas or svg)

giving us in total 4 * 2 * 2 = 16 different experiments. You can find and check all of them here.

Observations

Here I describe my experiences and conclusions after testing the graphs in my setup: Linux, Chrome, a decent i5-based laptop

One thousand datapoints

I have no problems at all interacting with the visualization when dataset size is at 1000.

Neither do I perceive any difference when using canvas or svg as a renderer.

Five thousand datapoints

At 5000 datapoints, it gets very slightly more sluggish, although I would probably not notice it if I didn’t pay attention:

Surprisingly, canvas performance is still on par with svg one.

Ten thousand datapoints

At 10k it gets a little hot. Although the rendering itself takes quite reasonable time, the “show attributes on hover” behaviour is now definitely sluggish (when using canvas as renderer).

svg renderer feels, again, similar to canvas in terms of rendering, and, surprisingly, shows noticeably better “hover” performance. This is something I did not expect at all. I theorise that somehow svg renderer uses browser compiled routines to see which circle we are “hovering” over, as opposed to canvas where Vega has to rely on JavaScript interpreter to infer that.

Thirty thousand datapoints

well, they both still work, I would say. They show me the picture without waiting time being unbearable (though it does give a sluggish impression). With svg renderer, I see “hover” working unreliably (not showing anything sometimes). Hovering in canvas-based graph is more reliable but the delay is very high, up to the point of making it unusable.

All in all, I think that 30k is already too much for Vega to handle.

Performance profile

Here I put some screenshots of Chrome profiler to get some quantitative data.

image

image

Conclusions

So we have these three competing demands:

  • responsiveness
  • size of data being displayed (the bigger the better)
  • speed of rendering

and we would want all three to be satisfied at the same time.

I would say, based on the experiments performed, the sweet spot is somewhere around 10 000 datapoints. Anything above, although somewhat workable, at least up to 30k, is already not too comfortable to interact with.

The most surprising discovery for me was that svg renderer does not show any immediately obvious inferiority in comparison with canvas, and in some respects and on certain data sizes, delivers, in fact, better performance (I’m referring to hover behaviour at 10k).

Another noteworthy finding is that there is no noticeable performance difference when using csv or json as data format.

So 10000 seem to be ok, but another point is whether we really want to display this much data. Several thousand facts on a single chart is already too much for any human being to digest. So further scaling should be possible by dynamically loading small portions of data and performing server-side aggregation so that at any given point, the graphical frontend has a comfortably sized data to display.