All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v1 0/3] collect LBR callstack together with thread stack data
@ 2019-08-09 15:16 Alexey Budankov
  2019-08-09 15:23 ` [PATCH v1 1/3] perf record: enable LBR callstack capture jointly with thread stack Alexey Budankov
                   ` (2 more replies)
  0 siblings, 3 replies; 7+ messages in thread
From: Alexey Budankov @ 2019-08-09 15:16 UTC (permalink / raw)
  To: Arnaldo Carvalho de Melo, linux-kernel
  Cc: Jiri Olsa, Namhyung Kim, Alexander Shishkin, Peter Zijlstra,
	Ingo Molnar, Andi Kleen, Kan Liang, Jin, Yao


The patch set unblocks collection of LBR call stack data simultaneously with
raw thread stack data by --call-graph dwarf,SIZE option:

  $perf record -g --call-graph dwarf,1024 -j stack,u -- stack_test

Collected LBR call stack can be used to augment dwarf call stack calculated 
from the raw thread stack data and to provide more comprehensive call stack 
information for cases when collected SIZE is not enough to cover complete 
thread stack.

Such cases are typical for workloads that allocate large arrays of data on 
its threads stacks or the possible SIZE to collect can't be large enough due 
to workload nature or system configuration and this is where hardware 
captured LBR call stacks can provide missing stack frames. Possible dwarf plus 
LBR call stacks consolidation algorithm description follows.

With this patch set perf report command UI currently ignores collected LBR 
call stack data and still provides dwarf based call stacks information.

===========================================================================

Overview:

   Legend:

   THS - thread stack
   CTX - thread register context
   SWS - software stack
   SSF - skipped stack frames
   PSS - Perf sample stack

   ip,sp,bp - HW registers values
   d        - allocated stack regions
   kip      - ip address in the kernel space
   K        - captured thread stack size

        THS

       -----
       |   |<-stack bottom
        ... 
       |---|
       |ip4|
       |---|         PSS = SWS(THS(K))
       |   |
   --> |   |
   |   |d3 |                  user/
   |   |---|         user PSS kernel PSS
   |   |ip3|         ------   ------
   |   |---|         |SSF |   |SSF |
   |   |   |          ....     ....
   |   |   |         ------   ------
   |   |d2 |         | -1 |   | -1 |
       |---|   user  ------   ------
   K   |ip2|   CTX   |ip3 |   |ip3 |
       |---|         |----|   |----|
   |   |d1 |   ...   |ip2 | , |ip2 |
   |   |---|  |---|  |----|   |----|
   |   |ip1|  |bp0|  |ip1 |   |ip1 |
   |   |---|  |---|  |----|   |----|
   |   |   |  |ip0|->|ip0 |   |ip0 |<-user stack top
   |   |   |  |---|  ------   ------
   |   |   |<-|sp0|<-stack    |kip0|<-kernel stack bottom
   --> -----  -----   top     |----|
                              |kip1|
                              |----|
		              |kip2|
		              |----|
                               ....
			      |    |<-kernel stack top
                              ------

Algorithm details:

   Legend:

   HWS - hardware stack
   K-SWS - kernel software stack

			 BRANCH
			 TABLE

		 HWS      ip   ip
			  from to
		 ------  -----------
		 |ip7`|  |ip7`|    |
		 |----|  |----|----|
		 |ip6`|  |ip6`|    |
	user PSS |----|  |----|----|
		 |ip5`|  |ip5`|    |
	------   |----|  |----|----|
	| -1 |   |ip4`|  |ip4`|    |
	------   |----|  |----|----|
	|ip3 |~~~|ip3`|  |ip3`|    |
	|----|   |----|  |----|----|
	|ip2 |~~~|ip2`|  |ip2`|    |
	|----| 	 |----|  |----|----|
	|ip1 |~~~|ip1`|  |ip1`|ip0`|
	|----| 	 |----|  -----------
	|ip0 |~~~|ip0`|<---------'
	------   ------

	1. if (sym(ipj) == sym(ipj`)), j=0-3 ===> user PSS
	2. ipj`                      , j=4-7 ===> user PSS

Augmented PSS = A_SWS(SWS(THS(K)), HWS):

	         user/
       user PSS  kernel PSS

	------   ------
	|ip7`|   |ip7`|<-user PSS bottom
	|----|   |----|
	|ip6`|   |ip6`|
	|----|   |----|
    HWS	|ip5`|   |ip5`|
	|----|   |----|
	|ip4`|   |ip4`|
	------   ------
	|ip3 |   |ip3 |
	|----|   |----|
    SWS |ip2 |   |ip2 |
	|----|   |----|
	|ip1 |   |ip1 |
	|----|   |----|
	|ip0 |   |ip0 |<-user PSS top
	------   ------
		 |kip0|<-kernel PSS bottom
		 |----|
		 |kip1|
	   K-SWS |----|
		 |kip2|
		 |----|
		 |kip3|<-kernel PSS top
		 ------

                  APSS

===========================================================================

---
Alexey Budankov (3):
  perf record: enable LBR callstack capture jointly with thread stack
  perf report: dump LBR callstack data by -D jointly with thread stack
  perf report: prefer dwarf callstacks to LBR ones when captured both

 tools/perf/builtin-report.c            |  2 ++
 tools/perf/util/parse-branch-options.c |  1 +
 tools/perf/util/session.c              | 31 ++++++++++++++++----------
 3 files changed, 22 insertions(+), 12 deletions(-)

-- 
2.20.1


^ permalink raw reply	[flat|nested] 7+ messages in thread

* [PATCH v1 1/3] perf record: enable LBR callstack capture jointly with thread stack
  2019-08-09 15:16 [PATCH v1 0/3] collect LBR callstack together with thread stack data Alexey Budankov
@ 2019-08-09 15:23 ` Alexey Budankov
  2019-08-23  2:28   ` [tip: perf/core] perf record: Enable " tip-bot2 for Alexey Budankov
  2019-08-09 15:26 ` [PATCH v1 2/3] perf report: dump LBR callstack data by -D " Alexey Budankov
  2019-08-09 15:31 ` [PATCH v1 3/3] perf report: prefer dwarf callstacks to LBR ones when captured both Alexey Budankov
  2 siblings, 1 reply; 7+ messages in thread
From: Alexey Budankov @ 2019-08-09 15:23 UTC (permalink / raw)
  To: Arnaldo Carvalho de Melo
  Cc: Jiri Olsa, Namhyung Kim, Alexander Shishkin, Peter Zijlstra,
	Ingo Molnar, Andi Kleen, Kan Liang, Jin, Yao, linux-kernel


Enable '-j stack' applicability together with '--call-graph dwarf'
option so thread stack data and LBR call stack could be captured 
jointly:

  $ perf record -g --call-graph dwarf,1024 -j stack,u -- stack_test

Signed-off-by: Alexey Budankov <alexey.budankov@linux.intel.com>
---
 tools/perf/util/parse-branch-options.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/tools/perf/util/parse-branch-options.c b/tools/perf/util/parse-branch-options.c
index 726e8d9e8c54..4ed20c833d44 100644
--- a/tools/perf/util/parse-branch-options.c
+++ b/tools/perf/util/parse-branch-options.c
@@ -30,6 +30,7 @@ static const struct branch_mode branch_modes[] = {
 	BRANCH_OPT("ind_jmp", PERF_SAMPLE_BRANCH_IND_JUMP),
 	BRANCH_OPT("call", PERF_SAMPLE_BRANCH_CALL),
 	BRANCH_OPT("save_type", PERF_SAMPLE_BRANCH_TYPE_SAVE),
+	BRANCH_OPT("stack", PERF_SAMPLE_BRANCH_CALL_STACK),
 	BRANCH_END
 };
 
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 7+ messages in thread

* [PATCH v1 2/3] perf report: dump LBR callstack data by -D jointly with thread stack
  2019-08-09 15:16 [PATCH v1 0/3] collect LBR callstack together with thread stack data Alexey Budankov
  2019-08-09 15:23 ` [PATCH v1 1/3] perf record: enable LBR callstack capture jointly with thread stack Alexey Budankov
@ 2019-08-09 15:26 ` Alexey Budankov
  2019-08-23  2:28   ` [tip: perf/core] perf report: Dump " tip-bot2 for Alexey Budankov
  2019-08-09 15:31 ` [PATCH v1 3/3] perf report: prefer dwarf callstacks to LBR ones when captured both Alexey Budankov
  2 siblings, 1 reply; 7+ messages in thread
From: Alexey Budankov @ 2019-08-09 15:26 UTC (permalink / raw)
  To: Arnaldo Carvalho de Melo
  Cc: Jiri Olsa, Namhyung Kim, Alexander Shishkin, Peter Zijlstra,
	Ingo Molnar, Andi Kleen, Kan Liang, Jin, Yao, linux-kernel


Make perf report -D command print captured LBR callstack chain when it is
collected together with raw thread stack data:

  2752673087247083 0x5d10 [0x548]: PERF_RECORD_SAMPLE(IP, 0x4002): 5841/5841: 0x40121f period: 1543862 addr: 0
  ... FP chain: nr:0
  ... branch callstack: nr:3
  .....  0: 00000000004011d0
  .....  1: 00007f393c388411
  .....  2: 0000000000401098
  ... user regs: mask 0xff0fff ABI 64-bit
  .... AX    0x34e7
  .... BX    0x7fff5f6dd3c0
  .... CX    0xffffffff
  .... DX    0x34e6
  .... SI    0x7f393c5268d0
  .... DI    0x0
  .... BP    0x401260
  .... SP    0x7fff5f6dd3c0
  .... IP    0x40121f
  .... FLAGS 0x29f
  .... CS    0x33
  .... SS    0x2b
  .... R8    0x7f393c526800
  .... R9    0x7f393c525da0
  .... R10   0xfffffffffffff70a
  .... R11   0x246
  .... R12   0x401070
  .... R13   0x7fff5f6ddcb0
  .... R14   0x0
  .... R15   0x0
  ... ustack: size 1024, offset 0x130
   . data_src: 0x5080021
   ... thread: stack_test:5841
   ...... dso: /root/abudanko/stacks/stack_test

Signed-off-by: Alexey Budankov <alexey.budankov@linux.intel.com>
---
 tools/perf/util/session.c | 31 +++++++++++++++++++------------
 1 file changed, 19 insertions(+), 12 deletions(-)

diff --git a/tools/perf/util/session.c b/tools/perf/util/session.c
index b9fe71d11bf6..82e0438a9160 100644
--- a/tools/perf/util/session.c
+++ b/tools/perf/util/session.c
@@ -1051,23 +1051,30 @@ static void callchain__printf(struct evsel *evsel,
 		       i, callchain->ips[i]);
 }
 
-static void branch_stack__printf(struct perf_sample *sample)
+static void branch_stack__printf(struct perf_sample *sample, bool callstack)
 {
 	uint64_t i;
 
-	printf("... branch stack: nr:%" PRIu64 "\n", sample->branch_stack->nr);
+	printf("%s: nr:%" PRIu64 "\n",
+		!callstack ? "... branch stack" : "... branch callstack",
+		sample->branch_stack->nr);
 
 	for (i = 0; i < sample->branch_stack->nr; i++) {
 		struct branch_entry *e = &sample->branch_stack->entries[i];
 
-		printf("..... %2"PRIu64": %016" PRIx64 " -> %016" PRIx64 " %hu cycles %s%s%s%s %x\n",
-			i, e->from, e->to,
-			(unsigned short)e->flags.cycles,
-			e->flags.mispred ? "M" : " ",
-			e->flags.predicted ? "P" : " ",
-			e->flags.abort ? "A" : " ",
-			e->flags.in_tx ? "T" : " ",
-			(unsigned)e->flags.reserved);
+		if (!callstack) {
+			printf("..... %2"PRIu64": %016" PRIx64 " -> %016" PRIx64 " %hu cycles %s%s%s%s %x\n",
+				i, e->from, e->to,
+				(unsigned short)e->flags.cycles,
+				e->flags.mispred ? "M" : " ",
+				e->flags.predicted ? "P" : " ",
+				e->flags.abort ? "A" : " ",
+				e->flags.in_tx ? "T" : " ",
+				(unsigned)e->flags.reserved);
+		} else {
+			printf("..... %2"PRIu64": %016" PRIx64 "\n",
+				i, i > 0 ? e->from : e->to);
+		}
 	}
 }
 
@@ -1217,8 +1224,8 @@ static void dump_sample(struct evsel *evsel, union perf_event *event,
 	if (evsel__has_callchain(evsel))
 		callchain__printf(evsel, sample);
 
-	if ((sample_type & PERF_SAMPLE_BRANCH_STACK) && !perf_evsel__has_branch_callstack(evsel))
-		branch_stack__printf(sample);
+	if (sample_type & PERF_SAMPLE_BRANCH_STACK)
+		branch_stack__printf(sample, perf_evsel__has_branch_callstack(evsel));
 
 	if (sample_type & PERF_SAMPLE_REGS_USER)
 		regs_user__printf(sample);
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 7+ messages in thread

* [PATCH v1 3/3] perf report: prefer dwarf callstacks to LBR ones when captured both
  2019-08-09 15:16 [PATCH v1 0/3] collect LBR callstack together with thread stack data Alexey Budankov
  2019-08-09 15:23 ` [PATCH v1 1/3] perf record: enable LBR callstack capture jointly with thread stack Alexey Budankov
  2019-08-09 15:26 ` [PATCH v1 2/3] perf report: dump LBR callstack data by -D " Alexey Budankov
@ 2019-08-09 15:31 ` Alexey Budankov
  2019-08-23  2:28   ` [tip: perf/core] perf report: Prefer DWARF " tip-bot2 for Alexey Budankov
  2 siblings, 1 reply; 7+ messages in thread
From: Alexey Budankov @ 2019-08-09 15:31 UTC (permalink / raw)
  To: Arnaldo Carvalho de Melo
  Cc: Jiri Olsa, Namhyung Kim, Alexander Shishkin, Peter Zijlstra,
	Ingo Molnar, Andi Kleen, Kan Liang, Jin, Yao, linux-kernel


Display dwarf based callchains when Perf trace contains as raw thread
stack data as LBR callstack data.

Signed-off-by: Alexey Budankov <alexey.budankov@linux.intel.com>
---
 tools/perf/builtin-report.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/tools/perf/builtin-report.c b/tools/perf/builtin-report.c
index d4288dcce156..5ecac3a1ed5b 100644
--- a/tools/perf/builtin-report.c
+++ b/tools/perf/builtin-report.c
@@ -1271,6 +1271,8 @@ int cmd_report(int argc, const char **argv)
 
 	has_br_stack = perf_header__has_feat(&session->header,
 					     HEADER_BRANCH_STACK);
+	if (perf_evlist__combined_sample_type(session->evlist) & PERF_SAMPLE_STACK_USER)
+		has_br_stack = false;
 
 	setup_forced_leader(&report, session->evlist);
 
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 7+ messages in thread

* [tip: perf/core] perf report: Prefer DWARF callstacks to LBR ones when captured both
  2019-08-09 15:31 ` [PATCH v1 3/3] perf report: prefer dwarf callstacks to LBR ones when captured both Alexey Budankov
@ 2019-08-23  2:28   ` tip-bot2 for Alexey Budankov
  0 siblings, 0 replies; 7+ messages in thread
From: tip-bot2 for Alexey Budankov @ 2019-08-23  2:28 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: linux-kernel, Peter Zijlstra, Namhyung Kim, Kan Liang, Jiri Olsa,
	Jin Yao, Andi Kleen, Alexander Shishkin,
	Arnaldo Carvalho de Melo, Alexey Budankov

The following commit has been merged into the perf/core branch of tip:

Commit-ID:     10ccbc1cc0b8a05a5c8491630d36d1e2672036c1
Gitweb:        https://git.kernel.org/tip/10ccbc1cc0b8a05a5c8491630d36d1e2672036c1
Author:        Alexey Budankov <alexey.budankov@linux.intel.com>
AuthorDate:    Fri, 09 Aug 2019 18:31:28 +03:00
Committer:     Arnaldo Carvalho de Melo <acme@redhat.com>
CommitterDate: Tue, 20 Aug 2019 12:20:16 -03:00

perf report: Prefer DWARF callstacks to LBR ones when captured both

Display DWARF based callchains when the perf.data file contains raw thread
stack data as LBR callstack data.

Commiter testing:

This changes the output from the branch stack based one, i.e. without
this patch, for the same file as in the previous csets:

  # perf report --stdio
  # To display the perf.data header info, please use --header/--header-only options.
  #
  # Total Lost Samples: 0
  #
  # Samples: 13  of event 'cycles'
  # Event count (approx.): 13
  #
  # Overhead  Command  Source Shared Object  Source Symbol                Target Symbol                              Basic Block Cycles
  # ........  .......  ....................  ...........................  .........................................  ..................
  #
       7.69%  ls       libpthread-2.29.so    [.] _init                    [.] __pthread_initialize_minimal_internal  6827
       7.69%  ls       ld-2.29.so            [k] _start                   [k] _dl_start                              -
       7.69%  ls       ld-2.29.so            [.] _dl_start_user           [.] _dl_init                               -24790
       7.69%  ls       ld-2.29.so            [k] _dl_start                [k] _dl_sysdep_start                       278
       7.69%  ls       ld-2.29.so            [k] dl_main                  [k] _dl_map_object_deps                    15581
       7.69%  ls       ld-2.29.so            [k] open_verify.constprop.0  [k] lseek64                                4228
       7.69%  ls       ld-2.29.so            [k] _dl_map_object           [k] open_verify.constprop.0                55
       7.69%  ls       ld-2.29.so            [k] openaux                  [k] _dl_map_object                         67
       7.69%  ls       ld-2.29.so            [k] _dl_map_object_deps      [k] 0x00007f441b57c090                     112
       7.69%  ls       ld-2.29.so            [.] call_init.part.0         [.] _init                                  334
       7.69%  ls       ld-2.29.so            [.] _dl_init                 [.] call_init.part.0                       383
       7.69%  ls       ld-2.29.so            [k] _dl_sysdep_start         [k] dl_main                                45
       7.69%  ls       ld-2.29.so            [k] _dl_catch_exception      [k] openaux                                116

  #
  # (Tip: For memory address profiling, try: perf mem record / perf mem report)
  #

To the one that shows call chains:

  # perf report --stdio
  # To display the perf.data header info, please use --header/--header-only options.
  #
  #
  # Total Lost Samples: 0
  #
  # Samples: 10  of event 'cycles'
  # Event count (approx.): 3204047
  #
  # Children      Self  Command  Shared Object       Symbol
  # ........  ........  .......  ..................  .........................................
  #
      55.01%     0.00%  ls       [kernel.vmlinux]    [k] entry_SYSCALL_64_after_hwframe
              |
              ---entry_SYSCALL_64_after_hwframe
                 do_syscall_64
                 |
                  --16.01%--__x64_sys_execve
                            __do_execve_file.isra.0
                            search_binary_handler
                            load_elf_binary
                            elf_map
                            vm_mmap_pgoff
                            do_mmap
                            mmap_region
                            perf_event_mmap
                            perf_iterate_sb
                            perf_iterate_ctx
                            perf_event_mmap_output
                            perf_output_copy
                            memcpy_erms

      55.01%    39.00%  ls       [kernel.vmlinux]    [k] do_syscall_64
              |
              |--39.00%--0xffffffffffffffff
              |          _dl_map_object
              |          open_verify.constprop.0
              |          __lseek64 (inlined)
              |          entry_SYSCALL_64_after_hwframe
              |          do_syscall_64
              |
               --16.01%--do_syscall_64
                         __x64_sys_execve
                         __do_execve_file.isra.0
                         search_binary_handler
                         load_elf_binary
                         elf_map
                         vm_mmap_pgoff
                         do_mmap
                         mmap_region
                         perf_event_mmap
                         perf_iterate_sb
                         perf_iterate_ctx
                         perf_event_mmap_output
                         perf_output_copy
                         memcpy_erms

      42.95%    42.95%  ls       libpthread-2.29.so  [.] __pthread_initialize_minimal_internal
              |
              ---_init
                 __pthread_initialize_minimal_internal

      42.95%     0.00%  ls       libpthread-2.29.so  [.] _init
              |
              ---_init
                 __pthread_initialize_minimal_internal

  <SNIP>

  #
  # (Tip: Profiling branch (mis)predictions with: perf record -b / perf report)
  #
  #

The branch stack view be explicitely selected using:

  # perf report -h branch-stack

   Usage: perf report [<options>]

      -b, --branch-stack    use branch records for per branch histogram filling

  #

I.e. after this patch:

  # perf report -b --stdio
  # To display the perf.data header info, please use --header/--header-only options.
  #
  #
  # Total Lost Samples: 0
  #
  # Samples: 13  of event 'cycles'
  # Event count (approx.): 13
  #
  # Overhead  Command  Source Shared Object  Source Symbol                Target Symbol                              Basic Block Cycles
  # ........  .......  ....................  ...........................  .........................................  ..................
  #
       7.69%  ls       libpthread-2.29.so    [.] _init                    [.] __pthread_initialize_minimal_internal  6827
       7.69%  ls       ld-2.29.so            [k] _start                   [k] _dl_start                              -
       7.69%  ls       ld-2.29.so            [.] _dl_start_user           [.] _dl_init                               -24790
       7.69%  ls       ld-2.29.so            [k] _dl_start                [k] _dl_sysdep_start                       278
       7.69%  ls       ld-2.29.so            [k] dl_main                  [k] _dl_map_object_deps                    15581
       7.69%  ls       ld-2.29.so            [k] open_verify.constprop.0  [k] lseek64                                4228
       7.69%  ls       ld-2.29.so            [k] _dl_map_object           [k] open_verify.constprop.0                55
       7.69%  ls       ld-2.29.so            [k] openaux                  [k] _dl_map_object                         67
       7.69%  ls       ld-2.29.so            [k] _dl_map_object_deps      [k] 0x00007f441b57c090                     112
       7.69%  ls       ld-2.29.so            [.] call_init.part.0         [.] _init                                  334
       7.69%  ls       ld-2.29.so            [.] _dl_init                 [.] call_init.part.0                       383
       7.69%  ls       ld-2.29.so            [k] _dl_sysdep_start         [k] dl_main                                45
       7.69%  ls       ld-2.29.so            [k] _dl_catch_exception      [k] openaux                                116

  #
  # (Tip: Show current config key-value pairs: perf config --list)
  #
  #

Signed-off-by: Alexey Budankov <alexey.budankov@linux.intel.com>
Tested-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: Jin Yao <yao.jin@linux.intel.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Kan Liang <kan.liang@linux.intel.com>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/ccbd9583-82f4-dec5-7e84-64bf56e351fb@linux.intel.com
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
---
 tools/perf/builtin-report.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/tools/perf/builtin-report.c b/tools/perf/builtin-report.c
index 5e003d0..79dfb11 100644
--- a/tools/perf/builtin-report.c
+++ b/tools/perf/builtin-report.c
@@ -1281,6 +1281,8 @@ int cmd_report(int argc, const char **argv)
 
 	has_br_stack = perf_header__has_feat(&session->header,
 					     HEADER_BRANCH_STACK);
+	if (perf_evlist__combined_sample_type(session->evlist) & PERF_SAMPLE_STACK_USER)
+		has_br_stack = false;
 
 	setup_forced_leader(&report, session->evlist);
 

^ permalink raw reply related	[flat|nested] 7+ messages in thread

* [tip: perf/core] perf record: Enable LBR callstack capture jointly with thread stack
  2019-08-09 15:23 ` [PATCH v1 1/3] perf record: enable LBR callstack capture jointly with thread stack Alexey Budankov
@ 2019-08-23  2:28   ` tip-bot2 for Alexey Budankov
  0 siblings, 0 replies; 7+ messages in thread
From: tip-bot2 for Alexey Budankov @ 2019-08-23  2:28 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: linux-kernel, Peter Zijlstra, Namhyung Kim, Kan Liang, Jiri Olsa,
	Jin Yao, Andi Kleen, Alexander Shishkin,
	Arnaldo Carvalho de Melo, Alexey Budankov

The following commit has been merged into the perf/core branch of tip:

Commit-ID:     2566349648b40aa3f5edf6af8f7f893ccd6e4eae
Gitweb:        https://git.kernel.org/tip/2566349648b40aa3f5edf6af8f7f893ccd6e4eae
Author:        Alexey Budankov <alexey.budankov@linux.intel.com>
AuthorDate:    Fri, 09 Aug 2019 18:23:58 +03:00
Committer:     Arnaldo Carvalho de Melo <acme@redhat.com>
CommitterDate: Tue, 20 Aug 2019 12:18:58 -03:00

perf record: Enable LBR callstack capture jointly with thread stack

Enable '-j stack' applicability together with '--call-graph dwarf'
option so thread stack data and LBR call stack could be captured
jointly:

  $ perf record -g --call-graph dwarf,1024 -j stack,u -- stack_test

Collected LBR call stack can be used to augment DWARF call stack
calculated from the raw thread stack data and to provide more
comprehensive call stack information for cases when collected SIZE is
not enough to cover complete thread stack.

Such cases are typical for workloads that allocate large arrays of data
on its threads stacks or the possible SIZE to collect can't be large
enough due to workload nature or system configuration and this is where
hardware captured LBR call stacks can provide missing stack frames.
Possible DWARF plus LBR call stacks consolidation algorithm description
follows.

With this patch set perf report command UI currently ignores collected
LBR call stack data and still provides DWARF based call stacks
information.

  ===========================================================================

  Overview:

   Legend:

   THS - thread stack
   CTX - thread register context
   SWS - software stack
   SSF - skipped stack frames
   PSS - Perf sample stack

   ip,sp,bp - HW registers values
   d        - allocated stack regions
   kip      - ip address in the kernel space
   K        - captured thread stack size

        THS

       -----
       |   |<-stack bottom
        ...
       |---|
       |ip4|
       |---|         PSS = SWS(THS(K))
       |   |
   --> |   |
   |   |d3 |                  user/
   |   |---|         user PSS kernel PSS
   |   |ip3|         ------   ------
   |   |---|         |SSF |   |SSF |
   |   |   |          ....     ....
   |   |   |         ------   ------
   |   |d2 |         | -1 |   | -1 |
       |---|   user  ------   ------
   K   |ip2|   CTX   |ip3 |   |ip3 |
       |---|         |----|   |----|
   |   |d1 |   ...   |ip2 | , |ip2 |
   |   |---|  |---|  |----|   |----|
   |   |ip1|  |bp0|  |ip1 |   |ip1 |
   |   |---|  |---|  |----|   |----|
   |   |   |  |ip0|->|ip0 |   |ip0 |<-user stack top
   |   |   |  |---|  ------   ------
   |   |   |<-|sp0|<-stack    |kip0|<-kernel stack bottom
   --> -----  -----   top     |----|
                              |kip1|
                              |----|
		              |kip2|
		              |----|
                               ....
			      |    |<-kernel stack top
                              ------

  Algorithm details:

   Legend:

   HWS - hardware stack
   K-SWS - kernel software stack

			 BRANCH
			 TABLE

		 HWS      ip   ip
			  from to
		 ------  -----------
		 |ip7`|  |ip7`|    |
		 |----|  |----|----|
		 |ip6`|  |ip6`|    |
	user PSS |----|  |----|----|
		 |ip5`|  |ip5`|    |
	------   |----|  |----|----|
	| -1 |   |ip4`|  |ip4`|    |
	------   |----|  |----|----|
	|ip3 |~~~|ip3`|  |ip3`|    |
	|----|   |----|  |----|----|
	|ip2 |~~~|ip2`|  |ip2`|    |
	|----| 	 |----|  |----|----|
	|ip1 |~~~|ip1`|  |ip1`|ip0`|
	|----| 	 |----|  -----------
	|ip0 |~~~|ip0`|<---------'
	------   ------

	1. if (sym(ipj) == sym(ipj`)), j=0-3 ===> user PSS
	2. ipj`                      , j=4-7 ===> user PSS

  Augmented PSS = A_SWS(SWS(THS(K)), HWS):

	         user/
       user PSS  kernel PSS

	------   ------
	|ip7`|   |ip7`|<-user PSS bottom
	|----|   |----|
	|ip6`|   |ip6`|
	|----|   |----|
    HWS	|ip5`|   |ip5`|
	|----|   |----|
	|ip4`|   |ip4`|
	------   ------
	|ip3 |   |ip3 |
	|----|   |----|
    SWS |ip2 |   |ip2 |
	|----|   |----|
	|ip1 |   |ip1 |
	|----|   |----|
	|ip0 |   |ip0 |<-user PSS top
	------   ------
		 |kip0|<-kernel PSS bottom
		 |----|
		 |kip1|
	   K-SWS |----|
		 |kip2|
		 |----|
		 |kip3|<-kernel PSS top
		 ------

                  APSS

Committer testing:

Before:

  # perf record -g --call-graph dwarf,1024 -j stack,u ls > /dev/null
  unknown branch filter stack, check man page

   Usage: perf record [<options>] [<command>]
      or: perf record [<options>] -- <command> [<options>]

      -j, --branch-filter <branch filter mask>
                            branch stack filter modes
  # perf record -g --call-graph dwarf,1024 -j u ls > /dev/null
  [ perf record: Woken up 1 times to write data ]
  [ perf record: Captured and wrote 0.054 MB perf.data (12 samples) ]
  # perf evlist -v
  cycles: size: 112, { sample_period, sample_freq }: 4000, sample_type: IP|TID|TIME|ADDR|CALLCHAIN|PERIOD|BRANCH_STACK|REGS_USER|STACK_USER|DATA_SRC, read_format: ID, disabled: 1, inherit: 1, mmap: 1, comm: 1, freq: 1, enable_on_exec: 1, task: 1, precise_ip: 3, mmap_data: 1, sample_id_all: 1, exclude_guest: 1, exclude_callchain_user: 1, mmap2: 1, comm_exec: 1, ksymbol: 1, bpf_event: 1, branch_sample_type: ANY, sample_regs_user: 0xff0fff, sample_stack_user: 1024
   #

After:

  # perf record -g --call-graph dwarf,1024 -j stack,u ls > /dev/null
  [ perf record: Woken up 1 times to write data ]
  [ perf record: Captured and wrote 0.044 MB perf.data (11 samples) ]
  [root@quaco ~]# perf evlist -v
  cycles: size: 112, { sample_period, sample_freq }: 4000, sample_type: IP|TID|TIME|ADDR|CALLCHAIN|PERIOD|BRANCH_STACK|REGS_USER|STACK_USER|DATA_SRC, read_format: ID, disabled: 1, inherit: 1, mmap: 1, comm: 1, freq: 1, enable_on_exec: 1, task: 1, precise_ip: 3, mmap_data: 1, sample_id_all: 1, exclude_guest: 1, exclude_callchain_user: 1, mmap2: 1, comm_exec: 1, ksymbol: 1, bpf_event: 1, branch_sample_type: USER|CALL_STACK, sample_regs_user: 0xff0fff, sample_stack_user: 1024
  #

Signed-off-by: Alexey Budankov <alexey.budankov@linux.intel.com>
Tested-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: Jin Yao <yao.jin@linux.intel.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Kan Liang <kan.liang@linux.intel.com>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/e9e00090-66fb-d2a4-c90f-1d12344f7788@linux.intel.com
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
---
 tools/perf/util/parse-branch-options.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/tools/perf/util/parse-branch-options.c b/tools/perf/util/parse-branch-options.c
index 726e8d9..4ed20c8 100644
--- a/tools/perf/util/parse-branch-options.c
+++ b/tools/perf/util/parse-branch-options.c
@@ -30,6 +30,7 @@ static const struct branch_mode branch_modes[] = {
 	BRANCH_OPT("ind_jmp", PERF_SAMPLE_BRANCH_IND_JUMP),
 	BRANCH_OPT("call", PERF_SAMPLE_BRANCH_CALL),
 	BRANCH_OPT("save_type", PERF_SAMPLE_BRANCH_TYPE_SAVE),
+	BRANCH_OPT("stack", PERF_SAMPLE_BRANCH_CALL_STACK),
 	BRANCH_END
 };
 

^ permalink raw reply related	[flat|nested] 7+ messages in thread

* [tip: perf/core] perf report: Dump LBR callstack data by -D jointly with thread stack
  2019-08-09 15:26 ` [PATCH v1 2/3] perf report: dump LBR callstack data by -D " Alexey Budankov
@ 2019-08-23  2:28   ` tip-bot2 for Alexey Budankov
  0 siblings, 0 replies; 7+ messages in thread
From: tip-bot2 for Alexey Budankov @ 2019-08-23  2:28 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: linux-kernel, Peter Zijlstra, Namhyung Kim, Kan Liang, Jiri Olsa,
	Jin Yao, Andi Kleen, Alexander Shishkin,
	Arnaldo Carvalho de Melo, Alexey Budankov

The following commit has been merged into the perf/core branch of tip:

Commit-ID:     d2720c3dad58def723f9617f7cf2a48c752ef50a
Gitweb:        https://git.kernel.org/tip/d2720c3dad58def723f9617f7cf2a48c752ef50a
Author:        Alexey Budankov <alexey.budankov@linux.intel.com>
AuthorDate:    Fri, 09 Aug 2019 18:26:30 +03:00
Committer:     Arnaldo Carvalho de Melo <acme@redhat.com>
CommitterDate: Tue, 20 Aug 2019 12:19:44 -03:00

perf report: Dump LBR callstack data by -D jointly with thread stack

Make perf report -D command print captured LBR callstack chain when it is
collected together with raw thread stack data:

  2752673087247083 0x5d10 [0x548]: PERF_RECORD_SAMPLE(IP, 0x4002): 5841/5841: 0x40121f period: 1543862 addr: 0
  ... FP chain: nr:0
  ... branch callstack: nr:3
  .....  0: 00000000004011d0
  .....  1: 00007f393c388411
  .....  2: 0000000000401098
  ... user regs: mask 0xff0fff ABI 64-bit
  .... AX    0x34e7
  .... BX    0x7fff5f6dd3c0
  .... CX    0xffffffff
  .... DX    0x34e6
  .... SI    0x7f393c5268d0
  .... DI    0x0
  .... BP    0x401260
  .... SP    0x7fff5f6dd3c0
  .... IP    0x40121f
  .... FLAGS 0x29f
  .... CS    0x33
  .... SS    0x2b
  .... R8    0x7f393c526800
  .... R9    0x7f393c525da0
  .... R10   0xfffffffffffff70a
  .... R11   0x246
  .... R12   0x401070
  .... R13   0x7fff5f6ddcb0
  .... R14   0x0
  .... R15   0x0
  ... ustack: size 1024, offset 0x130
   . data_src: 0x5080021
   ... thread: stack_test:5841
   ...... dso: /root/abudanko/stacks/stack_test

Committer testing:

  # perf record -g --call-graph dwarf,1024 -j stack,u ls > /dev/null
  [ perf record: Woken up 1 times to write data ]
  [ perf record: Captured and wrote 0.042 MB perf.data (10 samples) ]
  #

Before:

  # perf report -D |& grep PERF_RECORD_SAMPLE -A28 | tail -29
  67538909824483 0xa7a0 [0x560]: PERF_RECORD_SAMPLE(IP, 0x4002): 9721/9721: 0x7f441b2b1e20 period: 1376095 addr: 0
  ... FP chain: nr:0
  ... user regs: mask 0xff0fff ABI 64-bit
  .... AX    0x7f441b2b1000
  .... BX    0x7f441b55b970
  .... CX    0x7fff6e2db218
  .... DX    0x7fff6e2db218
  .... SI    0x7fff6e2db208
  .... DI    0x1
  .... BP    0x1
  .... SP    0x7fff6e2db178
  .... IP    0x7f441b2b1e20
  .... FLAGS 0x20a
  .... CS    0x33
  .... SS    0x2b
  .... R8    0x1
  .... R9    0x7f441b371c18
  .... R10   0x7f441b5a5f10
  .... R11   0x202
  .... R12   0x7fff6e2db208
  .... R13   0x7fff6e2db218
  .... R14   0x7f441b5a7150
  .... R15   0x0
  ... ustack: size 1024, offset 0x148
   . data_src: 0x5080021
   ... thread: ls:9721
   ...... dso: /usr/lib64/libpthread-2.29.so

  0xad00 [0x60]: event: 10
  #

After:

  # perf report -D |& grep PERF_RECORD_SAMPLE -A31 | tail -32
  67538909824483 0xa7a0 [0x560]: PERF_RECORD_SAMPLE(IP, 0x4002): 9721/9721: 0x7f441b2b1e20 period: 1376095 addr: 0
  ... FP chain: nr:0
  ... branch callstack: nr:4
  .....  0: 00007f441b2b1e20
  .....  1: 00007f441b58af1a
  .....  2: 00007f441b58b0e1
  .....  3: 00007f441b57c145
  ... user regs: mask 0xff0fff ABI 64-bit
  .... AX    0x7f441b2b1000
  .... BX    0x7f441b55b970
  .... CX    0x7fff6e2db218
  .... DX    0x7fff6e2db218
  .... SI    0x7fff6e2db208
  .... DI    0x1
  .... BP    0x1
  .... SP    0x7fff6e2db178
  .... IP    0x7f441b2b1e20
  .... FLAGS 0x20a
  .... CS    0x33
  .... SS    0x2b
  .... R8    0x1
  .... R9    0x7f441b371c18
  .... R10   0x7f441b5a5f10
  .... R11   0x202
  .... R12   0x7fff6e2db208
  .... R13   0x7fff6e2db218
  .... R14   0x7f441b5a7150
  .... R15   0x0
  ... ustack: size 1024, offset 0x148
   . data_src: 0x5080021
   ... thread: ls:9721
   ...... dso: /usr/lib64/libpthread-2.29.so
  #

Signed-off-by: Alexey Budankov <alexey.budankov@linux.intel.com>
Tested-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: Jin Yao <yao.jin@linux.intel.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Kan Liang <kan.liang@linux.intel.com>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/aa82e5dd-def2-0ca8-a064-db9e2e8ad076@linux.intel.com
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
---
 tools/perf/util/session.c | 31 +++++++++++++++++++------------
 1 file changed, 19 insertions(+), 12 deletions(-)

diff --git a/tools/perf/util/session.c b/tools/perf/util/session.c
index b9fe71d..82e0438 100644
--- a/tools/perf/util/session.c
+++ b/tools/perf/util/session.c
@@ -1051,23 +1051,30 @@ static void callchain__printf(struct evsel *evsel,
 		       i, callchain->ips[i]);
 }
 
-static void branch_stack__printf(struct perf_sample *sample)
+static void branch_stack__printf(struct perf_sample *sample, bool callstack)
 {
 	uint64_t i;
 
-	printf("... branch stack: nr:%" PRIu64 "\n", sample->branch_stack->nr);
+	printf("%s: nr:%" PRIu64 "\n",
+		!callstack ? "... branch stack" : "... branch callstack",
+		sample->branch_stack->nr);
 
 	for (i = 0; i < sample->branch_stack->nr; i++) {
 		struct branch_entry *e = &sample->branch_stack->entries[i];
 
-		printf("..... %2"PRIu64": %016" PRIx64 " -> %016" PRIx64 " %hu cycles %s%s%s%s %x\n",
-			i, e->from, e->to,
-			(unsigned short)e->flags.cycles,
-			e->flags.mispred ? "M" : " ",
-			e->flags.predicted ? "P" : " ",
-			e->flags.abort ? "A" : " ",
-			e->flags.in_tx ? "T" : " ",
-			(unsigned)e->flags.reserved);
+		if (!callstack) {
+			printf("..... %2"PRIu64": %016" PRIx64 " -> %016" PRIx64 " %hu cycles %s%s%s%s %x\n",
+				i, e->from, e->to,
+				(unsigned short)e->flags.cycles,
+				e->flags.mispred ? "M" : " ",
+				e->flags.predicted ? "P" : " ",
+				e->flags.abort ? "A" : " ",
+				e->flags.in_tx ? "T" : " ",
+				(unsigned)e->flags.reserved);
+		} else {
+			printf("..... %2"PRIu64": %016" PRIx64 "\n",
+				i, i > 0 ? e->from : e->to);
+		}
 	}
 }
 
@@ -1217,8 +1224,8 @@ static void dump_sample(struct evsel *evsel, union perf_event *event,
 	if (evsel__has_callchain(evsel))
 		callchain__printf(evsel, sample);
 
-	if ((sample_type & PERF_SAMPLE_BRANCH_STACK) && !perf_evsel__has_branch_callstack(evsel))
-		branch_stack__printf(sample);
+	if (sample_type & PERF_SAMPLE_BRANCH_STACK)
+		branch_stack__printf(sample, perf_evsel__has_branch_callstack(evsel));
 
 	if (sample_type & PERF_SAMPLE_REGS_USER)
 		regs_user__printf(sample);

^ permalink raw reply related	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2019-08-23  2:28 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-08-09 15:16 [PATCH v1 0/3] collect LBR callstack together with thread stack data Alexey Budankov
2019-08-09 15:23 ` [PATCH v1 1/3] perf record: enable LBR callstack capture jointly with thread stack Alexey Budankov
2019-08-23  2:28   ` [tip: perf/core] perf record: Enable " tip-bot2 for Alexey Budankov
2019-08-09 15:26 ` [PATCH v1 2/3] perf report: dump LBR callstack data by -D " Alexey Budankov
2019-08-23  2:28   ` [tip: perf/core] perf report: Dump " tip-bot2 for Alexey Budankov
2019-08-09 15:31 ` [PATCH v1 3/3] perf report: prefer dwarf callstacks to LBR ones when captured both Alexey Budankov
2019-08-23  2:28   ` [tip: perf/core] perf report: Prefer DWARF " tip-bot2 for Alexey Budankov

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.