|
Cachegrind:
Cachegrind通过模拟cpu的1,3级缓存,收集应用程序运行时关于cpu的一些统计数据,最后在将明细数据和汇总信息打印出来。
1. 以下是cpu统计数据的一些术语缩写:
I cache reads (Ir, which equals the number of instructions executed), I1 cache read misses (I1mr) and LL cache instruction read misses (ILmr).
D cache reads (Dr, which equals the number of memory reads), D1 cache read misses (D1mr), and LL cache data read misses (DLmr).
D cache writes (Dw, which equals the number of memory writes), D1 cache write misses (D1mw), and LL cache data write misses (DLmw).
Conditional branches executed (Bc) and conditional branches mispredicted (Bcm).
Indirect branches executed (Bi) and indirect branches mispredicted (Bim).
Note that D1 total accesses is given by D1mr + D1mw, and that LL total accesses is given by ILmr + DLmr + DLmw.
2. 执行方式:
valgrind --tool=cachegrind your_application
以下为程序输出的统计信息:
==31751== I refs: 27,742,716
==31751== I1 misses: 276
==31751== LLi misses: 275
==31751== I1 miss rate: 0.0%
==31751== LLi miss rate: 0.0%
==31751==
==31751== D refs: 15,430,290 (10,955,517 rd + 4,474,773 wr)
==31751== D1 misses: 41,185 ( 21,905 rd + 19,280 wr)
==31751== LLd misses: 23,085 ( 3,987 rd + 19,098 wr)
==31751== D1 miss rate: 0.2% ( 0.1% + 0.4%)
==31751== LLd miss rate: 0.1% ( 0.0% + 0.4%)
==31751==
==31751== LL misses: 23,360 ( 4,262 rd + 19,098 wr)
==31751== LL miss rate: 0.0% ( 0.0% + 0.4%)
cachegrind的结果也会以输出文件的方式输出更多的细节,输出文件的缺省文件名是cachegrind.out.<pid>,其中<pid>是当前进程的pid。该文件名可以通过--cachegrind-out-file选择指定更可读的文件名,这个文件将会成为cg_annotate的输入。
3. cg_annotate:
cg_annotate <filename>
以下为cg_annotate执行后的统计信息的输出:
I1 cache: 65536 B, 64 B, 2-way associative
D1 cache: 65536 B, 64 B, 2-way associative
LL cache: 262144 B, 64 B, 8-way associative
Command: concord vg_to_ucode.c
Events recorded: Ir I1mr ILmr Dr D1mr DLmr Dw D1mw DLmw
Events shown: Ir I1mr ILmr Dr D1mr DLmr Dw D1mw DLmw
Event sort order: Ir I1mr ILmr Dr D1mr DLmr Dw D1mw DLmw
Threshold: 99%
Chosen for annotation:
Auto-annotation: off
以下为cg_annotate执行后的明细信息的输出(function by function):
--------------------------------------------------------------------------------
Ir I1mr ILmr Dr D1mr DLmr Dw D1mw DLmw file:function
--------------------------------------------------------------------------------
8,821,482 5 5 2,242,702 1,621 73 1,794,230 0 0 getc.c:_IO_getc
5,222,023 4 4 2,276,334 16 12 875,959 1 1 concord.c:get_word
2,649,248 2 2 1,344,810 7,326 1,385 . . . vg_main.c:strcmp
2,521,927 2 2 591,215 0 0 179,398 0 0 concord.c:hash
2,242,740 2 2 1,046,612 568 22 448,548 0 0 ctype.c:tolower
1,496,937 4 4 630,874 9,000 1,400 279,388 0 0 concord.c:insert
897,991 51 51 897,831 95 30 62 1 1 ???:???
598,068 1 1 299,034 0 0 149,517 0 0 ../sysdeps/generic/lockfile.c:__flockfile
598,068 0 0 299,034 0 0 149,517 0 0 ../sysdeps/generic/lockfile.c:__funlockfile
598,024 4 4 213,580 35 16 149,506 0 0 vg_clientmalloc.c:malloc
446,587 1 1 215,973 2,167 430 129,948 14,057 13,957 concord.c:add_existing
341,760 2 2 128,160 0 0 128,160 0 0 vg_clientmalloc.c:vg_trap_here_WRAPPER
320,782 4 4 150,711 276 0 56,027 53 53 concord.c:init_hash_table
298,998 1 1 106,785 0 0 64,071 1 1 concord.c:create
149,518 0 0 149,516 0 0 1 0 0 ???:tolower@@GLIBC_2.0
149,518 0 0 149,516 0 0 1 0 0 ???:fgetc@@GLIBC_2.0
95,983 4 4 38,031 0 0 34,409 3,152 3,150 concord.c:new_word_node
85,440 0 0 42,720 0 0 21,360 0 0 vg_clientmalloc.c:vg_bogus_epilogue
注:以上数据中,如果某个column的value为dot,则意味着这个event在这个函数中没有发生。如果函数名中包含???:???,则不能从debug info中确定文件名,如果程序在编译的时候没有-g选项,将会有大量的这种未知信息。
4. line by line 计算:
cg_annotate <filename> concord.c,将输出concord.c基于line的统计数据,如下:
--------------------------------------------------------------------------------
-- User-annotated source: concord.c
--------------------------------------------------------------------------------
Ir I1mr ILmr Dr D1mr DLmr Dw D1mw DLmw
. . . . . . . . . void init_hash_table(char *file_name, Word_Node *table[])
3 1 1 . . . 1 0 0 {
. . . . . . . . . FILE *file_ptr;
. . . . . . . . . Word_Info *data;
1 0 0 . . . 1 1 1 int line = 1, i;
. . . . . . . . .
5 0 0 . . . 3 0 0 data = (Word_Info *) create(sizeof(Word_Info));
. . . . . . . . .
4,991 0 0 1,995 0 0 998 0 0 for (i = 0; i < TABLE_SIZE; i++)
3,988 1 1 1,994 0 0 997 53 52 table = NULL;
. . . . . . . . .
. . . . . . . . . /* Open file, check it. */
6 0 0 1 0 0 4 0 0 file_ptr = fopen(file_name, "r");
2 0 0 1 0 0 . . . if (!(file_ptr)) {
. . . . . . . . . fprintf(stderr, "Couldn't open '%s'.\n", file_name);
1 1 1 . . . . . . exit(EXIT_FAILURE);
. . . . . . . . . }
. . . . . . . . .
165,062 1 1 73,360 0 0 91,700 0 0 while ((line = get_word(data, line, file_ptr)) != EOF)
146,712 0 0 73,356 0 0 73,356 0 0 insert(data->;word, data->line, table);
. . . . . . . . .
4 0 0 1 0 0 2 0 0 free(data);
4 0 0 1 0 0 2 0 0 fclose(file_ptr);
3 0 0 2 0 0 . . . }
5. cg_diff file1 file2
用于比较两个输入文件的差异,这个工具可以用于在测试某个功能的性能,然后做出一些修改,然后比较前后的差异。
6. Cachegrind命令行选项:
--cache-sim=no|yes [yes]
指定是否收集cache accesses和miss counts
--branch-sim=no|yes [no]
指定是否收集branch instruction和misprediction counts
7. cg_annotate命令行选项:
--show=A,B,C [default: all, using order in cachegrind.out.<pid>]
指定需要显示的events columns,如(--show=D1mr,DLmr) or (--show=DLmr,DLmw)
--sort=A,B,C [default: order in cachegrind.out.<pid>]
指定function-by-function明细中排序需要基于的事件
--threshold=X [default: 0.1%]
对输出的数据进行过滤,只要超过该阈值的明细信息才会被数据。
Sets the threshold for the function-by-function summary. A function is shown if it accounts for more than X% of the counts for the primary sort event. If auto-annotating, also affects which files are annotated.
Note: thresholds can be set for more than one of the events by appending any events for the --sort option with a colon and a number (no spaces, though). E.g. if you want to see each function that covers more than 1% of LL read misses or 1% of LL write misses, use this option:
--sort=DLmr:1,DLmw:1
--auto=<no|yes> [default: no]
When enabled, automatically annotates every file that is mentioned in the function-by-function summary that can be found. Also gives a list of those that couldn't be found.
--context=N [default: 8]
Print N lines of context before and after each annotated line. Avoids printing large sections of source files that were not executed. Use a large number (e.g. 100000) to show all source lines.
-I<dir> --include=<dir> [default: none]
指定source file的搜索路径,可以通过多个-I/--include来指定更多的目录。
Callgrind:
1. 精确诊断部分代码片段:
--instr-atstart=no 在程序启动的时候将该选项设置为no, 这样程序就不会收集这些测试信息。当你准备开始测量你需要测量的代码片段时,再在另外的终端窗口中执行该命令 callgrind_control -i on 如果想要完成精确的测量,需要在该测量代码片段的前面定义该宏CALLGRIND_START_INSTRUMENTATION,在其后再定义CALLGRIND_STOP_INSTRUMENTATION。
2. 通过callgrind_control来dump指定函数的统计信息:
--dump-before=function:在进入该函数之前dump统计信息到文件;
--dump-after=function:在离开该函数之后dump统计信息到文件;
--zero-before=function:在进入该函数之前用0重置所有的计数器,在代码中添加该宏CALLGRIND_ZERO_STATS,可以更加精确的重置计数器为0.
以上选项可以被多次使用,以便指定多个函数。
3. Callgrind --cache-sim=yes 通过将该选项置为yes,可以模拟cache的行为,从而得到更多的关于cache的统计数据。
Callgrind --branch-sim=yes 通过将该选项置为yes,可以得到更多像低效的switch语句带来的性能问题。
4. Callgrind命令行选项:
1) --callgrind-out-file=<file>
指定profile data的输出文件,而不是缺省命名规则生成的文件。
2) --dump-line=<no|yes> [default: yes]
事件计数将以source line作为统计的粒度,但是要求源程序在编译的时候加入-g选项。
3) --collect-systime=<no|yes> [default: no]
This specifies whether information for system call times should be collected.
5. callgrind_annotate命令行选项:(大部分选项和cg_annotate相同,以下两个选项为callgrind_annotate独有)
1) --inclusive=<yes|no> [default: no]
在计算cost的时候,将callee的成本合并到caller的成本中。
2) --tree=<none|caller|calling|both> [default: none]
Print for each function their callers, the called functions or both.
Helgrind:
1. --track-lockorders=no|yes [default: yes]
是否在程序运行的过程中检测lock的加锁顺序,如果暂时不关心此类问题,可以考虑暂时关闭他
2. --read-var-info=yes
可以给出比较详细的变量声明地址
http://www.cnblogs.com/stephen-liu74/archive/2011/06/05/2073341.html
|
|