Rami On The Web: Scalar Profiling on Linux

Gathering information about a program’s behavior, specifically its use of resources, is called profiling. The most important “resource” in terms of high performance computing is runtime. Hence, a common profiling strategy is to find out how much time is spent in the different functions, and maybe even lines, of a code
in order to identify hot spots, i.e., the parts of the program that require the dominant fraction of runtime. These hot spots are analyzed for possible optimization opportunitie.

Function profiling - gprof

The most widely used profiling tool is gprof from the GNU binutils package.

gprof uses both instrumentation and sampling to collect a flat function profile as

well as a callgraph profile, also called a butterfly graph. In order to activate profiling,

the code must be compiled with an appropriate option (many modern compilers

can generate gprof-compliant instrumentation; for the GCC, use -pg) and run

once. This produces a non-human-readable file gmon.out, to be interpreted by the

gprof program. The flat profile contains information about execution times of all

the program’s functions and how often they were called

Hardware performance counters - perf stat

Modern processors feature a small number of performance counters (often far less than ten), which are special on-chip registers that get incremented each time a certain event occurs. Among the usually

several hundred events that can be monitored, there are a few that are most useful for

profiling:

Number of bus transactions, i.e., cache line transfers
Number of loads and stores
Number of floating-point operations
Mispredicted branches
Pipeline stalls
Number of instructions executed

In linux we do it by perf stat .The perf tool supports a list of measurable events.

As an example:

 cpu-cycles OR cycles                       [Hardware event]
 instructions                               [Hardware event]
 cache-references                           [Hardware event]
 cache-misses                               [Hardware event]
 branch-instructions OR branches            [Hardware event]
 branch-misses                              [Hardware event]
 bus-cycles                                 [Hardware event]

 cpu-clock                                  [Software event]
 task-clock                                 [Software event]
 page-faults OR faults                      [Software event]
 minor-faults                               [Software event]
 major-faults                               [Software event]
 context-switches OR cs                     [Software event]
 cpu-migrations OR migrations               [Software event]
 alignment-faults                           [Software event]
 emulation-faults                           [Software event]

 L1-dcache-loads                            [Hardware cache event]
 L1-dcache-load-misses                      [Hardware cache event]
 L1-dcache-stores                           [Hardware cache event]
 L1-dcache-store-misses                     [Hardware cache event]
 L1-dcache-prefetches                       [Hardware cache event]
 L1-dcache-prefetch-misses                  [Hardware cache event]
 L1-icache-loads                            [Hardware cache event]
 L1-icache-load-misses                      [Hardware cache event]
 L1-icache-prefetches                       [Hardware cache event]
 L1-icache-prefetch-misses                  [Hardware cache event]
 LLC-loads                                  [Hardware cache event]
 LLC-load-misses                            [Hardware cache event]
 LLC-stores                                 [Hardware cache event]
 LLC-store-misses                           [Hardware cache event]

 LLC-prefetch-misses                        [Hardware cache event]
 dTLB-loads                                 [Hardware cache event]
 dTLB-load-misses                           [Hardware cache event]
 dTLB-stores                                [Hardware cache event]
 dTLB-store-misses                          [Hardware cache event]
 dTLB-prefetches                            [Hardware cache event]
 dTLB-prefetch-misses                       [Hardware cache event]
 iTLB-loads                                 [Hardware cache event]
 iTLB-load-misses                           [Hardware cache event]
 branch-loads                               [Hardware cache event]
 branch-load-misses                         [Hardware cache event]

 rNNN (see 'perf list --help' on how to encode it) [Raw hardware event descriptor]

 mem:<addr>[:access]                        [Hardware breakpoint]

 kvmmmu:kvm_mmu_pagetable_walk              [Tracepoint event]

 [...]

 sched:sched_stat_runtime                   [Tracepoint event]
 sched:sched_pi_setprio                     [Tracepoint event]
 syscalls:sys_enter_socket                  [Tracepoint event]
 syscalls:sys_exit_socket                   [Tracepoint event]

Example Run:

perf stat -B dd if=/dev/zero of=/dev/null count=1000000

1000000+0 records in
1000000+0 records out
512000000 bytes (512 MB) copied, 0.956217 s, 535 MB/s

 Performance counter stats for 'dd if=/dev/zero of=/dev/null count=1000000':

            5,099 cache-misses             #      0.005 M/sec (scaled from 66.58%)
          235,384 cache-references         #      0.246 M/sec (scaled from 66.56%)
        9,281,660 branch-misses            #      3.858 %     (scaled from 33.50%)
      240,609,766 branches                 #    251.559 M/sec (scaled from 33.66%)
    1,403,561,257 instructions             #      0.679 IPC   (scaled from 50.23%)
    2,066,201,729 cycles                   #   2160.227 M/sec (scaled from 66.67%)
              217 page-faults              #      0.000 M/sec
                3 CPU-migrations           #      0.000 M/sec
               83 context-switches         #      0.000 M/sec
       956.474238 task-clock-msecs         #      0.999 CPUs

       0.957617512  seconds time elapsed

Links

Perf Manual

AMD & Intel Processors specs

https://docs.google.com/open?id=0B83rvqbRt-ksMEl4U05Od1hUS2c

https://docs.google.com/open?id=0B83rvqbRt-ksR1VXaWMyTUdIQWs

Source

Introduction to High Performance Computing

Rami On The Web

Tuesday, December 25, 2012

Scalar Profiling on Linux

Function profiling - gprof

Hardware performance counters - perf stat

No comments:

Post a Comment