Tip 11: gprof
gprof is the GNU Profiler. It can be used to find out where your program is spending most of its time when running. This won't help if your code is broken, has an infinite loop, etc, but it can be used to improve performance on working code.
First, you should probably be using optimized code. There are many things that the compiler can improve. Specify -O3 in g++ or gcc. Anything more in unnecessary.
Second, you must recompile your code and specify the "-pg" flag. You must use this for compilation AND linking. This will add extra instructions for performance counters in your code.
Third, run your code on a typical (but fairly small) test case. You will see the normal output and behavior. This will generate the function call data in a file called gmon.out.
Last, after the program has finished run gprof with your executable:
gprof myprogram > gprof.out
This will use the binary and gmon.out to create the output file gprof.out which will contain a bunch of useful information such as this:
Flat profile: Each sample counts as 0.01 seconds. % cumulative self self total time seconds seconds calls s/call s/call name 52.10 11.90 11.90 1326020941 0.00 0.00 cieq 26.71 18.00 6.10 3104531 0.00 0.00 findvec 6.74 19.54 1.54 225037 0.00 0.00 com_measure_when 3.50 20.34 0.80 scannum 1.84 20.76 0.42 19926865 0.00 0.00 spice_dstring_append 1.75 21.16 0.40 3209201 0.00 0.00 gettok 1.27 21.45 0.29 1468 0.00 0.00 BSIM4v2load 0.57 21.58 0.13 826248 0.00 0.00 vec_fromplot 0.53 21.70 0.12 726 0.00 0.03 do_measure 0.48 21.81 0.11 1468 0.00 0.00 CAPload 0.35 21.89 0.08 400828 0.00 0.00 ft_numparse 0.31 21.96 0.07 4050975 0.00 0.00 copy 0.26 22.02 0.06 19857822 0.00 0.00 spice_dstring_append_char 0.26 22.08 0.06 6886171 0.00 0.00 tmalloc
This tells me that 52% of the time is spent in the "cieq" function. This is 11.9 seconds of wall time and a huge number of function calls. If you keep scrolling down, you will get to the "Call Graph" section:
Call graph (explanation follows) granularity: each sample hit covers 2 byte(s) for 0.04% of 22.84 seconds index % time self children called name <spontaneous> [1] 96.0 0.00 21.93 main [1] 0.00 21.92 1/1 ft_dorun [3] 0.00 0.01 1/1 inp_spsource [63] 0.00 0.00 1/1 ft_savedotargs [74] 0.00 0.00 1/1 ft_cpinit [76] 0.00 0.00 1/1 ft_cktcoms [103] 0.00 0.00 1/8 cp_vset [154] 0.00 0.00 1/1 sp_shutdown [252] 0.00 0.00 1/1 SIMinit [257] 0.00 0.00 4/4 read_initialisation_file [364] 0.00 0.00 1/1 ivars [462] 0.00 0.00 1/1 init_time [456] 0.00 0.00 1/1 init_rlimits [455] 0.00 0.00 1/1 append_to_stream [388] ----------------------------------------------- 0.00 21.92 1/1 ft_dorun [3] [2] 96.0 0.00 21.92 1 dosim [2] 0.00 21.89 1/1 if_run [5] 0.00 0.03 1/726 do_measure [7] 0.00 0.00 1/1083 cp_getvar [52] 0.00 0.00 1/4050975 copy [29] 0.00 0.00 1/6886171 tmalloc [31] 0.00 0.00 2/5662407 txfree [56] ----------------------------------------------- 0.00 21.92 1/1 main [1] [3] 96.0 0.00 21.92 1 ft_dorun [3] 0.00 21.92 1/1 dosim [2]
etc.
This shows for each function how many times it calls the sub-functions. In main, for example, this calls ft_dorun one time and this is 21.92 out of 21.93 seconds of total run-time. I typically don't use this call graph information much, but it can help you diagnose where the calls to cieq are coming from. So you can either 1) optimize cieq or 2) call it less.