Tip 11: gprof

From Vlsiwiki
Jump to: navigation, search

gprof is the GNU Profiler. It can be used to find out where your program is spending most of its time when running. This won't help if your code is broken, has an infinite loop, etc, but it can be used to improve performance on working code.

First, you should probably be using optimized code. There are many things that the compiler can improve. Specify -O3 in g++ or gcc. Anything more in unnecessary.

Second, you must recompile your code and specify the "-pg" flag. You must use this for compilation AND linking. This will add extra instructions for performance counters in your code.

Third, run your code on a typical (but fairly small) test case. You will see the normal output and behavior. This will generate the function call data in a file called gmon.out.

Last, after the program has finished run gprof with your executable:

gprof myprogram > gprof.out

This will use the binary and gmon.out to create the output file gprof.out which will contain a bunch of useful information such as this:


Flat profile:
Each sample counts as 0.01 seconds.
  %   cumulative   self              self     total           
 time   seconds   seconds    calls   s/call   s/call  name    
 52.10     11.90    11.90 1326020941     0.00     0.00  cieq
 26.71     18.00     6.10  3104531     0.00     0.00  findvec
  6.74     19.54     1.54   225037     0.00     0.00  com_measure_when
  3.50     20.34     0.80                             scannum
  1.84     20.76     0.42 19926865     0.00     0.00  spice_dstring_append
  1.75     21.16     0.40  3209201     0.00     0.00  gettok
  1.27     21.45     0.29     1468     0.00     0.00  BSIM4v2load
  0.57     21.58     0.13   826248     0.00     0.00  vec_fromplot
  0.53     21.70     0.12      726     0.00     0.03  do_measure
  0.48     21.81     0.11     1468     0.00     0.00  CAPload
  0.35     21.89     0.08   400828     0.00     0.00  ft_numparse
  0.31     21.96     0.07  4050975     0.00     0.00  copy
  0.26     22.02     0.06 19857822     0.00     0.00  spice_dstring_append_char
  0.26     22.08     0.06  6886171     0.00     0.00  tmalloc

This tells me that 52% of the time is spent in the "cieq" function. This is 11.9 seconds of wall time and a huge number of function calls. If you keep scrolling down, you will get to the "Call Graph" section:

                    Call graph (explanation follows)
granularity: each sample hit covers 2 byte(s) for 0.04% of 22.84 seconds
index % time    self  children    called     name
                                                 <spontaneous>
[1]     96.0    0.00   21.93                 main [1]
               0.00   21.92       1/1           ft_dorun [3]
               0.00    0.01       1/1           inp_spsource [63]
               0.00    0.00       1/1           ft_savedotargs [74]
               0.00    0.00       1/1           ft_cpinit [76]
               0.00    0.00       1/1           ft_cktcoms [103]
               0.00    0.00       1/8           cp_vset [154]
               0.00    0.00       1/1           sp_shutdown [252]
               0.00    0.00       1/1           SIMinit [257]
               0.00    0.00       4/4           read_initialisation_file [364]
               0.00    0.00       1/1           ivars [462]
               0.00    0.00       1/1           init_time [456]
               0.00    0.00       1/1           init_rlimits [455]
               0.00    0.00       1/1           append_to_stream [388]
-----------------------------------------------
               0.00   21.92       1/1           ft_dorun [3]
[2]     96.0    0.00   21.92       1         dosim [2]
               0.00   21.89       1/1           if_run [5]
               0.00    0.03       1/726         do_measure [7]
               0.00    0.00       1/1083        cp_getvar [52]
               0.00    0.00       1/4050975     copy [29]
               0.00    0.00       1/6886171     tmalloc [31]
               0.00    0.00       2/5662407     txfree [56]
-----------------------------------------------
               0.00   21.92       1/1           main [1]
[3]     96.0    0.00   21.92       1         ft_dorun [3]
               0.00   21.92       1/1           dosim [2]

etc.

This information is a bit difficult to read, but it is all based around the primary line such as this:

 [2]     96.0    0.00   21.92       1         dosim [2]

The lines ABOVE this line tell us what functions call this function. The ones BELOW this line are the functions called by this function. So, dosim is called by:

               0.00   21.92       1/1           ft_dorun [3]

one time and it calls these functions:

               0.00   21.89       1/1           if_run [5]
               0.00    0.03       1/726         do_measure [7]
               0.00    0.00       1/1083        cp_getvar [52]
               0.00    0.00       1/4050975     copy [29]
               0.00    0.00       1/6886171     tmalloc [31]
               0.00    0.00       2/5662407     txfree [56]

We can see that do_measure is only called once, but is called another 725 times by some other function.

If we go back to the initial problem with "cieq", we simply search for this function and find this:

               0.00    0.00       1/1326020941     plot_alloc [211]
               0.00    0.00       1/1326020941     plot_setcur [249]
               0.00    0.00      26/1326020941     vec_basename [163]
               0.00    0.00      84/1326020941     vec_eq [161]
               0.00    0.00     921/1326020941     plotInit [145]
               0.00    0.00    1658/1326020941     beginPlot [114]
               0.00    0.00  100188/1326020941     vec_get [10]
               0.00    0.00  200376/1326020941     measure_parse_when [18]
               0.00    0.00  400752/1326020941     measure_parse_trigtarg [15]
               0.99    0.00 110529682/1326020941     com_measure_when [16]
              10.90    0.00 1214787252/1326020941     findvec [12]
[13]    52.1   11.90    0.00 1326020941         cieq [13]

Most of these calls are from findvec (which is also the second CPU hog) but it is also called in many other functions too. Using this information, you can diagnose where the calls to cieq are coming from. So you can either 1) optimize a function or 2) call it less.