This page contains tips and tricks for capturing and understanding profiling reports.

Some other good references are:

Installation

Install valgrind, kcachegrind, and graphviz. For example on Ubuntu:

sudo apt-get install valgrind kcachegrind graphviz

Compiling your example program

When profiling, you should use a representative example program.

Compile your example program with debug symbols (line numbers) enabled, but still in optimized mode (-O2). That will allow you to drill down into line-by-line performance costs, while maintainaing representative performance. These builds are usually termed as “release with debug info”. This is different than “debug builds” which typically have compiler optimizations disabled (-O0).

In Bazel, build like this:

bazel build -c opt --copt=-g //foo/bar:example

In CMake, configure like this:

cmake -DCMAKE_BUILD_TYPE=RelWithDebInfo

Capturing a trace

Run valgrind with callgrind to get profiling data. For example:

valgrind --tool=callgrind bazel-bin/foo/bar/example

Each time this is run, it will create an output file file named callgrind.out.${PID}. Typically you’d capture just one trace and inspect it on its own, but in advanced uses you can combine multiple traces.

Oftentimes the call patterns in Drake will be difficult to view under the default instrumentation settings, so we recommend running with extra flags. These flags will slow down data collection, but are usually worth it:

valgrind --tool=callgrind \
  --separate-callers=10 \
   bazel-bin/foo/bar/example

If you’re trying to micro-optimize a function, use --dump-instr=yes to see per-instruction costs in the object code disassembly.

Viewing the trace

Run kcachegrind to analyze profiling data. For example:

kcachegrind callgrind.out.19482

Profiling Google Benchmark executables

Google Benchmark throttles the number of iterations it runs on each benchmark, which will skew the performance metrics you see in the profiler. You may want to manually specify the number of iterations to run on each benchmark.