Monday 22 January 2018

java.il talk on Flamegraphs and the JVM

Thanks to everyone who came to the talk. The slides are available here.

Thanks for Java.il for making it happen.

Video from the talk (in hebrew) to appear soon.

Saturday 30 December 2017

Flame graph animations

Flame graphs 

Flame graph is a wonderful visualisation tool for sampling profiling of code-paths.
For me, the best thing about it is that once you understand what any property of a rectangle (i.e. a frame) stands for, the information encapsulated inside the flame graph becomes very intuitive.
It is just so natural and simple to compare the size or position of a rectangle rather then reading and matching numbers.

If we're talking CPU flame graphs for example, by looking at a single image we can see:
- All the stack trace samples
- Self time of each frame
- Total time of each frame
- The difference between each of them (in terms of time on the CPU)
- How did the code branch out of a specific frame
- How did we arrive at a specific frame
- What is the type of each frame (with the help of a colour scheme)

The only thing we don't get from a flame graph is the concept of change over time.
There is no time axis in flame graphs.
What we see is a snapshot of all the samples that were taken at a specific (usually very small) time window.
In this blog entry we'll assume some previous knowledge of flamegraphs.
For an introduction and a lot more documentation on the topic look here.

Why change over time is interesting

If we take on-CPU stack traces, knowing the pattern of CPU usage of, for example, "socket-write" could tell us a lot about the process. This in itself is already interesting as a lot of the value in flamegraphs is insights we gain from it about simply how the process works.
Knowing the pattern of usage over time could also lead us to discover correlations between different (unrelated as far as we know) code paths which are hard to see today with a single image.

Capturing change over time in flame graphs


Several approaches were attempted to compare between flamegraphs that were taken at different points of time. These types of flamegraphs tried to compare and highlight the differences between the two sets of data.
Very useful for checking performance regression issues between two different runs for example but not necessarily to show the change over time for a single process.
The ideas so far were either place two graphs side by side or create a new graph where the colour will convey growth or reduction in the measurement.
These approaches are covered here.

On the left is a red-blue differential flame graph I created for a simple Java webserver (dropwizard example).
red = growth, blue = reduction. svg source

Issues with current approach


The main issue with this approach is that it is impossible to show a frame that disappeared (its measurement changed to 0 therefore the size of rectangle will be 0).

Another problem is that unlike a "normal" flame graph, a differential red/blue graph seems a lot less intuitive and you need to constantly remind yourself what red or blue means, making the real visualisation in your head.

In JIT environments such as Java flame graphs, this approach also loses the information of a change to a frame type (e.g. a frame becomes inlined).

Flame Graph animation


I'd like to propose a different approach to adding the concept of change over time to flame graphs - Flame graph animation.

A combination of SVG elements and javascript code allows us to create a SVG animation.

The basic idea is to shift the rectangles and move between one flamegraph image to another.

So far, other grav flamegraph utilities used to generate data from specific sources or modify the data in the folded stacks file. This time we need to modify the end product i.e. generate a different svg file.

In order to generate the svg animation I modified  the original flamegraph.pl file.

To make it cleaner and similar to the original, I added a new intermediate file format - The  "animated folded stacks format" which is very close to the original folded file format but instead of having a single value next to every stack trace, it has multiple values (one for each phase of the animation).
In the animated file,  0 is also a legal value (indicating the stack does not appear at all in this phase).

For example:

a;b;c;d;e 1 2 3 4 0

In this example, the stack a;b;c;d;e has the values 1 in phase 1, 2 in phase 2 etc. until it disappears in phase 5.
We read any number of folded files (and sort them to phases alphabetically). Then we generate the new format by merging the files using a python script.

This new intermediate file is being fed into the new animated_flamegraph.pl perl script

The result is a flamegraph animation.

Note that in order to start the animation you must click the "Animate" button on the top right corner after the page has finished loading.
During the animation, the "Animate" text will change into the phase number and the rectangles will start to "dance".

Synthetic example

Let's start from a very simple example. Three phases:

Phase 1

a;b;c;d;e;f 2
a;b;c 3
a;x;y;z 5
a;b;c;d 2
a;b;c;d;g 3
a;i 3

Phase 2

a;b;c;d;e;f 3
a;b;c 4
a;x;y;z 2
a;i 7

Phase 3

a;b;c;d;e;f 6
a;b;c 2
a;x;y;z 4
a;q;r;s;t;u 8

Running our merge script produces:

a;b;c;d;e;f 2 3 6
a;i 3 7 0
a;x;y;z 5 2 4
a;q;r;s;t;u 0 0 8
a;b;c 3 4 2
a;b;c;d 2 0 0
a;b;c;d;g 3 0 0

And our animated_flamegraph.pl script will generate:

Please use a modern browser to view SVG
Click on the animate button at the top right corner to start the animation.

Real world example


This example shows a Java dropwizard webserver starting to serve clients.
It was taken with the "aggregate threads by prefix" option turned on. (All worker threads represented in a single "flame". This aggregation is part of grav).

It is the same profiling used for the differential flame graph above. The extra information contained in the animated flamegraph means the SVG file size increased from 1MB (in the differential flamegraph) to around 6MB.

However, as we already know, animated flamegraphs are not limited to comparison between two states.
The animation below is taken from 5 different phases of the same dropwizard process (previous animation used only phase 2 & 4)

In this animation (link below) we can actually see the JVM in action.
Notice how at the start there are no "dw-*" worker threads.
It then moves to a phase dominated by two compiler threads.
Later phases are dominated by the worker "dw-*" threads.
Notice also how the GC threads suddenly raise their heads at the end.
Looking carefully at the stacks also shows a few changes in type.
Frame Lorg/eclipse/jetty/server/handler/HandlerWrapper:::handle gets inlined between phase 3 and 4
and Lorg/eclipse/jetty/util/thread/strategy/EatWhatYouKill:::run does the same between phase 4 and 5 (among others).

Note: This "Five flamegraphs in one" does come with a price. It grew to a massive 22.5MB SVG source file (see Future work). Some browsers even refuse to load it (In my experience, Google chrome seems to be the browser that handles big SVG files the best).
Wait for the SVG file to finish loading, then click the animate button...


Future Work


Better control over animation

  • Pause button
  • Control animation speed
  • Restart animation (Currently this is only done by reloading the page).
  • Go Back / Forward
  • Have a time line to control movement (like videos)

Highlight differences between phases

Either with colour, size or possibly the order in which we modify the rectangles.

File Size

Because the SVG file contains all the information as multiple SVG flame graphs (+ the animation elements), it can get very big, very quickly.
The dropwizard example above exceeds 22MB.
This size of SVG graphics is problematic as some browsers refuse to even display it.
One avenue to explore is how to reduce the size of the file. Maybe reduce granularity level? (avoid tiny movements or tiny frames)? Allow to filter by regex at creation?

Better Scripts Reuse / Refactor 

In order to still be able to enjoy changes / fixes / updates to the original flamegraph.pl


Try rectangle size relative only to current phase

Currently, rectangle size is calculated in a global context which does give us a view of the difference in the amount of activity between phases.
However, if that difference is huge, we get phases which only take a tiny proportion of the window and are hard to see.

Where is the code?

After collaborating on several flame graph utilities with Mark Price in the grav toolkit, it seems only natural to put it there as well.

Have a look at the scripts here under src/animation.
There is also a bash script that will take several snapshots (including generating a full flamegraph) and then gerneate an animation from all their data here
Give it a try,  send feedback, contribute, enjoy.