Wink Saville’s Blog

September 8, 2009

Using performance counters on x86; pref

Filed under: linux — wink @ 6:34 am

This email from Ingo Molnar gives a simple introduction on how to use pref on an x86 to gain information on system performance.

Do you have an x86 box to test it on?

If yes then perfcounters can be used for _much_ more precise
measurements that you can trust. Do something like this:

perf stat -a --repeat 3 sleep 1

The ‘-a/–all’ option will measure all CPUs – everything: IRQ
context, irqs-off region, etc. That output will be comparable before
your threaded patch and after the patch.

Here’s an example. I started one infinite loop on a testbox, which
is using 100% of a single CPU. The system-wide stats look like this:

# perf stat -a --repeat 3 sleep 1

Performance counter stats for ‘sleep 1′ (3 runs):

16003.320239  task-clock-msecs  #     15.993 CPUs    ( +-   0.044% )
94  context-switches            #      0.000 M/sec   ( +-  11.373% )
3  CPU-migrations               #      0.000 M/sec   ( +-  25.000% )
170  page-faults                #      0.000 M/sec   ( +-  0.518% )
3294001334  cycles              #    205.832 M/sec   ( +-  0.896% )
1088670782  instructions        #      0.331 IPC     ( +-  0.905% )
1720926  cache-references       #      0.108 M/sec   ( +-  1.880% )
61253  cache-misses             #      0.004 M/sec   ( +-  4.401% )

1.000623219  seconds time elapsed   ( +-   0.002% )

the instructions count or the cycle count will go up or down,
precisely according to how the threaded handlers. These stats are
not time sampled but ‘real’, so they reflect reality and show
whether your workload had to spend more (or less) cycles /
instructions /etc.

I started a second loop in addition to the first one, and perf stat
now gives me this output:

# perf stat -a –repeat 3 sleep 1

Performance counter stats for ‘sleep 1′ (3 runs):

16003.289509  task-clock-msecs  #     15.994 CPUs    ( +-   0.046% )
88  context-switches            #      0.000 M/sec   ( +-  15.933% )
2  CPU-migrations               #      0.000 M/sec   ( +-  14.286% )
188  page-faults                #      0.000 M/sec   ( +-   9.414% )
6481963224  cycles              #    405.039 M/sec   ( +-   0.011% )
2152924468  instructions        #      0.332 IPC     ( +-   0.054% )
397564  cache-references        #      0.025 M/sec   ( +-   1.217% )
59835  cache-misses             #      0.004 M/sec   ( +-   3.732% )

1.000576354  seconds time elapsed   ( +-   0.005% )

Compare the two results:

before:
6481963224  cycles              #    405.039 M/sec   ( +-   0.011% )
2152924468  instructions        #      0.332 IPC     ( +-   0.054% )

after:
3294001334  cycles              #    205.832 M/sec   ( +-  0.896% )
1088670782  instructions        #      0.331 IPC     ( +-  0.905% )

The cycles/sec doubled – as expected. You could do the same with
your test and not have to rely in the very imprecise (and often
misleading) ‘top’ statistics for kernel development.

The IPC (instructions per cycle) factor stayed roughly constant -
showing that both workloads can push the same amount of instructions
when normalized to a single CPU. If a workload becomes very
cache-missy or executes a lot of system calls then the IPC factor
goes down – if it becomes more optimal ‘tight’ code then the IPC
factor goes up.)

(The cache-miss rate was very low in both cases – it’s a simple
infinite loop i tested.)

Furthermore the error bars in the rightmost column help you know
whether any difference in results is statistically significant, or
within the noise level.

Hope this helps,

Ingo

No Comments

No comments yet.

RSS feed for comments on this post. TrackBack URL

Sorry, the comment form is closed at this time.

Powered by WordPress