All of lore.kernel.org
 help / color / mirror / Atom feed
* [Question] How does perf still record the stack of a specified pid even when that process is interrupted and CPU is scheduled to other process
@ 2015-04-24 13:31 Yunlong Song
  2015-04-24 13:49 ` Yunlong Song
                   ` (2 more replies)
  0 siblings, 3 replies; 12+ messages in thread
From: Yunlong Song @ 2015-04-24 13:31 UTC (permalink / raw)
  To: a.p.zijlstra, paulus, mingo, Arnaldo Carvalho de Melo
  Cc: linux-kernel, wangnan0

[Profiling Background]
Now we are profiling the performance of ext4 and f2fs on an eMMC card with iozone,
we find a case that ext4 is better than f2fs in random write under the test of
"iozone -s 262144 -r 64 -i 0 -i 2". We want to analyze the I/O delay of the two
file systems. We have got a conclusion that 1% of sys_write takes up 60% time of
the overall sys_write (262144/64=4096). We want to find out the call stack during
this specific 1% sys_write. Our idea is to record the stack in a certain time period
and since the specific 1% case takes up 60% time, the total number of records of its
stack should also takes up 60% of the total records, then we can recognize those stacks
and figure out what the I/O stack of f2fs is doing in the 1% case.

[Profiling Problem]

Although perf can record the events (with call stack) of a specified pid, e.g. using
"perf record -g iozone -s 262144 -r 64 -i 0 -i 2". But we find iozone is interrupted
and the CPU is scheduled to other process. As a result, perf will not record any events
of iozone until iozone's context is recovered and the CPU is scheduled to continue
processing the sys_write of iozone. This obeys our initial idea which is described in
[Profiling Background], since we cannot recognize the call stacks of the specific 1% case
by using the ratio of the record number.

[Alternative Solution without Perf]
We can use /proc/#pid/stack to record the stack in a certain period (e.g. 1ms) of iozone,
no matter whether iozone is interrupted or not. However, we have not taken a deep sight
into this, since we want to use perf to do this kind of thing.

[Question about Perf]
So we have a question that "How does perf still record the stack of a specified pid even
when that process is interrupted and CPU is scheduled to other process?"

-- 
Thanks,
Yunlong Song


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [Question] How does perf still record the stack of a specified pid even when that process is interrupted and CPU is scheduled to other process
  2015-04-24 13:31 [Question] How does perf still record the stack of a specified pid even when that process is interrupted and CPU is scheduled to other process Yunlong Song
@ 2015-04-24 13:49 ` Yunlong Song
  2015-04-25 14:03   ` Yunlong Song
  2015-04-24 13:49 ` David Ahern
  2015-04-24 13:58 ` David Ahern
  2 siblings, 1 reply; 12+ messages in thread
From: Yunlong Song @ 2015-04-24 13:49 UTC (permalink / raw)
  To: a.p.zijlstra, paulus, mingo, Arnaldo Carvalho de Melo
  Cc: linux-kernel, wangnan0

On 2015/4/24 21:31, Yunlong Song wrote:
> [Profiling Background]
> Now we are profiling the performance of ext4 and f2fs on an eMMC card with iozone,
> we find a case that ext4 is better than f2fs in random write under the test of
> "iozone -s 262144 -r 64 -i 0 -i 2". We want to analyze the I/O delay of the two
> file systems. We have got a conclusion that 1% of sys_write takes up 60% time of
> the overall sys_write (262144/64=4096). We want to find out the call stack during
> this specific 1% sys_write. Our idea is to record the stack in a certain time period
> and since the specific 1% case takes up 60% time, the total number of records of its
> stack should also takes up 60% of the total records, then we can recognize those stacks
> and figure out what the I/O stack of f2fs is doing in the 1% case.
> 
> [Profiling Problem]
> 
> Although perf can record the events (with call stack) of a specified pid, e.g. using
> "perf record -g iozone -s 262144 -r 64 -i 0 -i 2". But we find iozone is interrupted
> and the CPU is scheduled to other process. As a result, perf will not record any events
> of iozone until iozone's context is recovered and the CPU is scheduled to continue
> processing the sys_write of iozone. 

>This obeys our initial idea which is described in [Profiling Background],

This "disobeys" our initial idea which is described in [Profiling Background],

 since we cannot recognize the call stacks of the specific 1% case
> by using the ratio of the record number.
> 
> [Alternative Solution without Perf]
> We can use /proc/#pid/stack to record the stack in a certain period (e.g. 1ms) of iozone,
> no matter whether iozone is interrupted or not. However, we have not taken a deep sight
> into this, since we want to use perf to do this kind of thing.
> 
> [Question about Perf]
> So we have a question that "How does perf still record the stack of a specified pid even
> when that process is interrupted and CPU is scheduled to other process?"
> 


-- 
Thanks,
Yunlong Song


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [Question] How does perf still record the stack of a specified pid even when that process is interrupted and CPU is scheduled to other process
  2015-04-24 13:31 [Question] How does perf still record the stack of a specified pid even when that process is interrupted and CPU is scheduled to other process Yunlong Song
  2015-04-24 13:49 ` Yunlong Song
@ 2015-04-24 13:49 ` David Ahern
  2015-04-24 13:56   ` Yunlong Song
  2015-04-24 13:58 ` David Ahern
  2 siblings, 1 reply; 12+ messages in thread
From: David Ahern @ 2015-04-24 13:49 UTC (permalink / raw)
  To: Yunlong Song, a.p.zijlstra, paulus, mingo, Arnaldo Carvalho de Melo
  Cc: linux-kernel, wangnan0

On 4/24/15 7:31 AM, Yunlong Song wrote:
> [Profiling Problem]
>
> Although perf can record the events (with call stack) of a specified pid, e.g. using
> "perf record -g iozone -s 262144 -r 64 -i 0 -i 2". But we find iozone is interrupted

That command tells perf to *only* collect data for iozone. This command:

     perf record -g -a -- iozone ....

tells perf to collect samples system wide as long as the iozone process 
is alive.


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [Question] How does perf still record the stack of a specified pid even when that process is interrupted and CPU is scheduled to other process
  2015-04-24 13:49 ` David Ahern
@ 2015-04-24 13:56   ` Yunlong Song
  0 siblings, 0 replies; 12+ messages in thread
From: Yunlong Song @ 2015-04-24 13:56 UTC (permalink / raw)
  To: David Ahern, a.p.zijlstra, paulus, mingo, Arnaldo Carvalho de Melo
  Cc: linux-kernel, wangnan0

On 2015/4/24 21:49, David Ahern wrote:
> On 4/24/15 7:31 AM, Yunlong Song wrote:
>> [Profiling Problem]
>>
>> Although perf can record the events (with call stack) of a specified pid, e.g. using
>> "perf record -g iozone -s 262144 -r 64 -i 0 -i 2". But we find iozone is interrupted
> 
> That command tells perf to *only* collect data for iozone. This command:
> 
>     perf record -g -a -- iozone ....
> 
> tells perf to collect samples system wide as long as the iozone process is alive.
> 
> 
> 

But we only want the records of iozone rather than the overall system, since other process
will interfere the targeted I/O stack of the specific 1% case, and we cannot figure out the
"clean" specific stack which takes up 60% time.

-- 
Thanks,
Yunlong Song


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [Question] How does perf still record the stack of a specified pid even when that process is interrupted and CPU is scheduled to other process
  2015-04-24 13:31 [Question] How does perf still record the stack of a specified pid even when that process is interrupted and CPU is scheduled to other process Yunlong Song
  2015-04-24 13:49 ` Yunlong Song
  2015-04-24 13:49 ` David Ahern
@ 2015-04-24 13:58 ` David Ahern
  2015-04-25 14:05   ` Yunlong Song
  2 siblings, 1 reply; 12+ messages in thread
From: David Ahern @ 2015-04-24 13:58 UTC (permalink / raw)
  To: Yunlong Song, a.p.zijlstra, paulus, mingo, Arnaldo Carvalho de Melo
  Cc: linux-kernel, wangnan0

On 4/24/15 7:31 AM, Yunlong Song wrote:
> Now we are profiling the performance of ext4 and f2fs on an eMMC card with iozone,
> we find a case that ext4 is better than f2fs in random write under the test of
> "iozone -s 262144 -r 64 -i 0 -i 2". We want to analyze the I/O delay of the two
> file systems. We have got a conclusion that 1% of sys_write takes up 60% time of
> the overall sys_write (262144/64=4096). We want to find out the call stack during
> this specific 1% sys_write. Our idea is to record the stack in a certain time period
> and since the specific 1% case takes up 60% time, the total number of records of its
> stack should also takes up 60% of the total records, then we can recognize those stacks
> and figure out what the I/O stack of f2fs is doing in the 1% case.

And to address this specific profiling problem have you tried:

    perf trace record -- iozone ...
    perf trace -i perf.data -S



^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [Question] How does perf still record the stack of a specified pid even when that process is interrupted and CPU is scheduled to other process
  2015-04-24 13:49 ` Yunlong Song
@ 2015-04-25 14:03   ` Yunlong Song
  0 siblings, 0 replies; 12+ messages in thread
From: Yunlong Song @ 2015-04-25 14:03 UTC (permalink / raw)
  To: a.p.zijlstra, paulus, mingo, Arnaldo Carvalho de Melo
  Cc: linux-kernel, wangnan0, dsahern

On 2015/4/24 21:49, Yunlong Song wrote:
> On 2015/4/24 21:31, Yunlong Song wrote:
>> [Profiling Background]
>> Now we are profiling the performance of ext4 and f2fs on an eMMC card with iozone,
>> we find a case that ext4 is better than f2fs in random write under the test of
>> "iozone -s 262144 -r 64 -i 0 -i 2". We want to analyze the I/O delay of the two
>> file systems. We have got a conclusion that 1% of sys_write takes up 60% time of
>> the overall sys_write (262144/64=4096). We want to find out the call stack during
>> this specific 1% sys_write. Our idea is to record the stack in a certain time period
>> and since the specific 1% case takes up 60% time, the total number of records of its
>> stack should also takes up 60% of the total records, then we can recognize those stacks
>> and figure out what the I/O stack of f2fs is doing in the 1% case.
>>
>> [Profiling Problem]
>>
>> Although perf can record the events (with call stack) of a specified pid, e.g. using
>> "perf record -g iozone -s 262144 -r 64 -i 0 -i 2". But we find iozone is interrupted
>> and the CPU is scheduled to other process. As a result, perf will not record any events
>> of iozone until iozone's context is recovered and the CPU is scheduled to continue
>> processing the sys_write of iozone. 
> 
>> This obeys our initial idea which is described in [Profiling Background],
> 
> This "disobeys" our initial idea which is described in [Profiling Background],
> 
>  since we cannot recognize the call stacks of the specific 1% case
>> by using the ratio of the record number.
>>
>> [Alternative Solution without Perf]
>> We can use /proc/#pid/stack to record the stack in a certain period (e.g. 1ms) of iozone,
>> no matter whether iozone is interrupted or not. However, we have not taken a deep sight
>> into this, since we want to use perf to do this kind of thing.
>>
>> [Question about Perf]
>> So we have a question that "How does perf still record the stack of a specified pid even
>> when that process is interrupted and CPU is scheduled to other process?"
>>
> 
> 

Here is another example:

$ cat test.c
void func_loop(void)
{
        int i, sum = 0;
        for (i = 0; i < 100000; i++)
                sum += 1;
}

void func_sleep(void)
{
        sleep(1);
}

int main(void)
{
        while (1) {
                func_loop();
                func_sleep();
        }
        return 0;
}

$ gcc test.c -o test -O0
$ ./test
$ ps aux | grep test
Yunlong+ 133798  0.0  0.0   4056   348 pts/7    S+   21:52   0:00 ./test
$ perf record -g -p 133798 sleep 5
$ perf report -g --stdio
# To display the perf.data header info, please use --header/--header-only options.
#
# Samples: 16  of event 'cycles'
# Event count (approx.): 2439331
#
# Children      Self  Command  Shared Object      Symbol
# ........  ........  .......  .................  ...............................
#
    79.46%    76.48%  test     test               [.] func_loop
               |
               ---func_loop
                  |
                  |--97.07%-- main
                  |          __libc_start_main
                   --2.93%-- [...]

    77.14%     0.00%  test     libc-2.19.so       [.] __libc_start_main
               |
               ---__libc_start_main

    77.14%     0.00%  test     test               [.] main
               |
               ---main
                  __libc_start_main

    20.54%     0.00%  test     [unknown]          [.] 0000000000000000
               |
               ---0

    20.54%     0.00%  test     libc-2.19.so       [.] __GI___libc_nanosleep
               |
               ---__GI___libc_nanosleep
                  0

    20.54%     0.00%  test     [unknown]          [k] 0x00007f2fa07ef6b0
               |
               ---0x7f2fa07ef6b0
                  __GI___libc_nanosleep
                  0

    20.54%     0.00%  test     [kernel.kallsyms]  [k] system_call_fastpath
               |
               ---system_call_fastpath
                  0x7f2fa07ef6b0
                  __GI___libc_nanosleep

As shown above, the func_loop is regarded as the main operation and its call stack
is showed on the top of perf report. However, we know that func_sleep is really the
main operation from the view of time duration indeed. And if we want to its actual
call stack in the real time, we can see the result as following:

$ while true ; do cat /proc/133798/stack  ; sleep 0.1 ; done
[<ffffffff8107aed3>] hrtimer_nanosleep+0xa3/0x160
[<ffffffff8107afed>] SyS_nanosleep+0x5d/0x80
[<ffffffff81519329>] system_call_fastpath+0x16/0x1b
[<00007f2fa07ef6b0>] 0x7f2fa07ef6b0
[<ffffffffffffffff>] 0xffffffffffffffff
[<ffffffff8107aed3>] hrtimer_nanosleep+0xa3/0x160
[<ffffffff8107afed>] SyS_nanosleep+0x5d/0x80
[<ffffffff81519329>] system_call_fastpath+0x16/0x1b
[<00007f2fa07ef6b0>] 0x7f2fa07ef6b0
[<ffffffffffffffff>] 0xffffffffffffffff
[<ffffffff8107aed3>] hrtimer_nanosleep+0xa3/0x160
[<ffffffff8107afed>] SyS_nanosleep+0x5d/0x80
[<ffffffff81519329>] system_call_fastpath+0x16/0x1b
[<00007f2fa07ef6b0>] 0x7f2fa07ef6b0
[<ffffffffffffffff>] 0xffffffffffffffff
[<ffffffff8107aed3>] hrtimer_nanosleep+0xa3/0x160
[<ffffffff8107afed>] SyS_nanosleep+0x5d/0x80
[<ffffffff81519329>] system_call_fastpath+0x16/0x1b
[<00007f2fa07ef6b0>] 0x7f2fa07ef6b0
[<ffffffffffffffff>] 0xffffffffffffffff
[<ffffffff8107aed3>] hrtimer_nanosleep+0xa3/0x160
[<ffffffff8107afed>] SyS_nanosleep+0x5d/0x80
[<ffffffff81519329>] system_call_fastpath+0x16/0x1b
[<00007f2fa07ef6b0>] 0x7f2fa07ef6b0
[<ffffffffffffffff>] 0xffffffffffffffff
[<ffffffff8107aed3>] hrtimer_nanosleep+0xa3/0x160
[<ffffffff8107afed>] SyS_nanosleep+0x5d/0x80
[<ffffffff81519329>] system_call_fastpath+0x16/0x1b
[<00007f2fa07ef6b0>] 0x7f2fa07ef6b0
[<ffffffffffffffff>] 0xffffffffffffffff
[<ffffffff8107aed3>] hrtimer_nanosleep+0xa3/0x160
[<ffffffff8107afed>] SyS_nanosleep+0x5d/0x80
[<ffffffff81519329>] system_call_fastpath+0x16/0x1b
[<00007f2fa07ef6b0>] 0x7f2fa07ef6b0
[<ffffffffffffffff>] 0xffffffffffffffff
[<ffffffff8107aed3>] hrtimer_nanosleep+0xa3/0x160
[<ffffffff8107afed>] SyS_nanosleep+0x5d/0x80
[<ffffffff81519329>] system_call_fastpath+0x16/0x1b
[<00007f2fa07ef6b0>] 0x7f2fa07ef6b0
[<ffffffffffffffff>] 0xffffffffffffffff

As shown above, func_sleep is regarded as the main operation from the view of time and
its call stack is printed in real time.

-- 
Thanks,
Yunlong Song


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [Question] How does perf still record the stack of a specified pid even when that process is interrupted and CPU is scheduled to other process
  2015-04-24 13:58 ` David Ahern
@ 2015-04-25 14:05   ` Yunlong Song
  2015-04-25 15:53     ` David Ahern
  0 siblings, 1 reply; 12+ messages in thread
From: Yunlong Song @ 2015-04-25 14:05 UTC (permalink / raw)
  To: David Ahern, a.p.zijlstra, paulus, mingo, Arnaldo Carvalho de Melo
  Cc: linux-kernel, wangnan0

On 2015/4/24 21:58, David Ahern wrote:
> On 4/24/15 7:31 AM, Yunlong Song wrote:
>> Now we are profiling the performance of ext4 and f2fs on an eMMC card with iozone,
>> we find a case that ext4 is better than f2fs in random write under the test of
>> "iozone -s 262144 -r 64 -i 0 -i 2". We want to analyze the I/O delay of the two
>> file systems. We have got a conclusion that 1% of sys_write takes up 60% time of
>> the overall sys_write (262144/64=4096). We want to find out the call stack during
>> this specific 1% sys_write. Our idea is to record the stack in a certain time period
>> and since the specific 1% case takes up 60% time, the total number of records of its
>> stack should also takes up 60% of the total records, then we can recognize those stacks
>> and figure out what the I/O stack of f2fs is doing in the 1% case.
> 
> And to address this specific profiling problem have you tried:
> 
>    perf trace record -- iozone ...
>    perf trace -i perf.data -S
> 
> 
> 
> 

But this only shows the system call like strace, but we want the call stack of kernel functions
in fact.

-- 
Thanks,
Yunlong Song


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [Question] How does perf still record the stack of a specified pid even when that process is interrupted and CPU is scheduled to other process
  2015-04-25 14:05   ` Yunlong Song
@ 2015-04-25 15:53     ` David Ahern
  2015-05-05 21:53       ` Rabin Vincent
  2015-05-06  4:10       ` Yunlong Song
  0 siblings, 2 replies; 12+ messages in thread
From: David Ahern @ 2015-04-25 15:53 UTC (permalink / raw)
  To: Yunlong Song, a.p.zijlstra, paulus, mingo, Arnaldo Carvalho de Melo
  Cc: linux-kernel, wangnan0

On 4/25/15 8:05 AM, Yunlong Song wrote:
> On 2015/4/24 21:58, David Ahern wrote:
>> On 4/24/15 7:31 AM, Yunlong Song wrote:
>>> Now we are profiling the performance of ext4 and f2fs on an eMMC card with iozone,
>>> we find a case that ext4 is better than f2fs in random write under the test of
>>> "iozone -s 262144 -r 64 -i 0 -i 2". We want to analyze the I/O delay of the two
>>> file systems. We have got a conclusion that 1% of sys_write takes up 60% time of
>>> the overall sys_write (262144/64=4096). We want to find out the call stack during
>>> this specific 1% sys_write. Our idea is to record the stack in a certain time period
>>> and since the specific 1% case takes up 60% time, the total number of records of its
>>> stack should also takes up 60% of the total records, then we can recognize those stacks
>>> and figure out what the I/O stack of f2fs is doing in the 1% case.
>>
>> And to address this specific profiling problem have you tried:
>>
>>     perf trace record -- iozone ...
>>     perf trace -i perf.data -S
>>
>>
>>
>>
>
> But this only shows the system call like strace, but we want the call stack of kernel functions
> in fact.
>

We haven't added the callchain option yet; on the to-do list.

perf trace record -g -- iozone ...
perf trace -i perf.data -s
--> summary of system calls, max/min/average times

perf trace -i perf.data --duration 10.0 -T
--> note the timestamp where the write took a "long" time

perf script
--> search down to *around* the time of interest; you want the syscall 
entry; timestamp is for exit

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [Question] How does perf still record the stack of a specified pid even when that process is interrupted and CPU is scheduled to other process
  2015-04-25 15:53     ` David Ahern
@ 2015-05-05 21:53       ` Rabin Vincent
  2015-05-05 22:24         ` David Ahern
  2015-05-06  4:13         ` Yunlong Song
  2015-05-06  4:10       ` Yunlong Song
  1 sibling, 2 replies; 12+ messages in thread
From: Rabin Vincent @ 2015-05-05 21:53 UTC (permalink / raw)
  To: David Ahern, Yunlong Song
  Cc: a.p.zijlstra, paulus, mingo, Arnaldo Carvalho de Melo,
	linux-kernel, wangnan0

On Sat, Apr 25, 2015 at 09:53:57AM -0600, David Ahern wrote:
> On 4/25/15 8:05 AM, Yunlong Song wrote:
> >But this only shows the system call like strace, but we want the call
> >stack of kernel functions in fact.
> >
> We haven't added the callchain option yet; on the to-do list.
> 
> perf trace record -g -- iozone ...
> perf trace -i perf.data -s
> --> summary of system calls, max/min/average times
> 
> perf trace -i perf.data --duration 10.0 -T
> --> note the timestamp where the write took a "long" time
> 
> perf script
> --> search down to *around* the time of interest; you want the syscall
> entry; timestamp is for exit

Now if I understood the use case right, what Yulong Song wants to know
is what the iozone process is doing in the kernel (i.e. the stacktrace of why
exactly it goes to sleep / what it's waiting for) during these
sys_writes which take a long time.  

The commands above will identify the sys_write which takes time but only
provide the stacktrace at the entry and exit of the syscall, but this do
not show why the process blocked or what it did inside the system call.

So a way to get what is required for the use case would be to make the
following changes to the above sequence:

(1) include the sched:* events when perf trace record is run

(2) around the time of interest, look at the kernel stack st the sched:switch 
    events between the entry and the exit.  This will show what the process was
    waiting for when it when it blocked.  The stacktraces at the
    stat_runtime events in the process may also be useful to understand what
    was going on.

Example:

$ perf trace record -g -e sched:* -- dd if=/dev/zero of=/x bs=10M count=100 conv=fsync
$ perf trace -i perf.data -s

 dd (147), 364 events, 94.3%, 0.000 msec

   syscall            calls      min       avg       max      stddev
   --------------- -------- --------- --------- ---------     ------
   write                 63   266.019   316.896   963.413      4.69%
   ...

$ perf trace -i perf.data --duration 960 -T
 91916.354 (963.413 ms): dd/147 write(arg0: 1, arg1: 139729327423488, arg2: 10485760, arg3: 582, arg4: 100, arg5: 72340172838076673) = 10485760

$ perf script
...

dd   147 [002]    90.952941: raw_syscalls:sys_enter: NR 1 (1, 7f1544ed0000, a00000, 246, 64, 101010101010101)
	          211a00 syscall_trace_enter_phase2 ([kernel.kallsyms])
	          7d79c3 tracesys_phase2 ([kernel.kallsyms])
	            ed20 [unknown] (/lib/libpthread-2.16.so)

(... trimmed everything except the longest block: ...)

dd   147 [002]    91.204723: sched:sched_switch: prev_comm=dd prev_pid=147 prev_prio=120 prev_state=D ==> next_comm=swapper/2 next_pid=0 next_prio=120
	          7d1121 __schedule ([kernel.kallsyms])
	          7d1817 schedule ([kernel.kallsyms])
	          4578a5 jbd2_log_wait_commit ([kernel.kallsyms])
	          4546ff jbd2_log_do_checkpoint ([kernel.kallsyms])
	          45489f __jbd2_log_wait_for_space ([kernel.kallsyms])
	          44c2d8 start_this_handle ([kernel.kallsyms])
	          44c7b6 jbd2__journal_start ([kernel.kallsyms])
	          42fc3b __ext4_journal_start_sb ([kernel.kallsyms])
	          4012e9 ext4_write_begin ([kernel.kallsyms])
	          33795f generic_perform_write ([kernel.kallsyms])
	          338b68 __generic_file_write_iter ([kernel.kallsyms])
	          3f4371 ext4_file_write_iter ([kernel.kallsyms])
	          381ae1 __vfs_write ([kernel.kallsyms])
	          381f07 vfs_write ([kernel.kallsyms])
	          3825dd sys_write ([kernel.kallsyms])
	          7d7a1a tracesys_phase2 ([kernel.kallsyms])
	            ed20 [unknown] (/lib/libpthread-2.16.so)

swapper     0 [002]    91.875573: sched:sched_wakeup: comm=dd pid=147 prio=120 success=1 target_cpu=002

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [Question] How does perf still record the stack of a specified pid even when that process is interrupted and CPU is scheduled to other process
  2015-05-05 21:53       ` Rabin Vincent
@ 2015-05-05 22:24         ` David Ahern
  2015-05-06  4:13         ` Yunlong Song
  1 sibling, 0 replies; 12+ messages in thread
From: David Ahern @ 2015-05-05 22:24 UTC (permalink / raw)
  To: Rabin Vincent, Yunlong Song
  Cc: a.p.zijlstra, paulus, mingo, Arnaldo Carvalho de Melo,
	linux-kernel, wangnan0

On 5/5/15 3:53 PM, Rabin Vincent wrote:
> The commands above will identify the sys_write which takes time but only
> provide the stacktrace at the entry and exit of the syscall, but this do
> not show why the process blocked or what it did inside the system call.
>
> So a way to get what is required for the use case would be to make the
> following changes to the above sequence:
>
> (1) include the sched:* events when perf trace record is run
>
> (2) around the time of interest, look at the kernel stack st the sched:switch
>      events between the entry and the exit.  This will show what the process was
>      waiting for when it when it blocked.  The stacktraces at the
>      stat_runtime events in the process may also be useful to understand what
>      was going on.

agreed.



^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [Question] How does perf still record the stack of a specified pid even when that process is interrupted and CPU is scheduled to other process
  2015-04-25 15:53     ` David Ahern
  2015-05-05 21:53       ` Rabin Vincent
@ 2015-05-06  4:10       ` Yunlong Song
  1 sibling, 0 replies; 12+ messages in thread
From: Yunlong Song @ 2015-05-06  4:10 UTC (permalink / raw)
  To: David Ahern, a.p.zijlstra, paulus, mingo, Arnaldo Carvalho de Melo
  Cc: linux-kernel, wangnan0

On 2015/4/25 23:53, David Ahern wrote:
> On 4/25/15 8:05 AM, Yunlong Song wrote:
>> On 2015/4/24 21:58, David Ahern wrote:
>>> On 4/24/15 7:31 AM, Yunlong Song wrote:
>>>> Now we are profiling the performance of ext4 and f2fs on an eMMC card with iozone,
>>>> we find a case that ext4 is better than f2fs in random write under the test of
>>>> "iozone -s 262144 -r 64 -i 0 -i 2". We want to analyze the I/O delay of the two
>>>> file systems. We have got a conclusion that 1% of sys_write takes up 60% time of
>>>> the overall sys_write (262144/64=4096). We want to find out the call stack during
>>>> this specific 1% sys_write. Our idea is to record the stack in a certain time period
>>>> and since the specific 1% case takes up 60% time, the total number of records of its
>>>> stack should also takes up 60% of the total records, then we can recognize those stacks
>>>> and figure out what the I/O stack of f2fs is doing in the 1% case.
>>>
>>> And to address this specific profiling problem have you tried:
>>>
>>>     perf trace record -- iozone ...
>>>     perf trace -i perf.data -S
>>>
>>>
>>>
>>>
>>
>> But this only shows the system call like strace, but we want the call stack of kernel functions
>> in fact.
>>
> 
> We haven't added the callchain option yet; on the to-do list.
> 
> perf trace record -g -- iozone ...
> perf trace -i perf.data -s
> --> summary of system calls, max/min/average times
> 
> perf trace -i perf.data --duration 10.0 -T
> --> note the timestamp where the write took a "long" time
> 
> perf script
> --> search down to *around* the time of interest; you want the syscall entry; timestamp is for exit
> 
> .
> 

Hi, David,

It's almost what we want, we are eager to see it can work as a callchain option, since it's really a useful
tool in analyzing latency of I/O performance in production case.

-- 
Thanks,
Yunlong Song


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [Question] How does perf still record the stack of a specified pid even when that process is interrupted and CPU is scheduled to other process
  2015-05-05 21:53       ` Rabin Vincent
  2015-05-05 22:24         ` David Ahern
@ 2015-05-06  4:13         ` Yunlong Song
  1 sibling, 0 replies; 12+ messages in thread
From: Yunlong Song @ 2015-05-06  4:13 UTC (permalink / raw)
  To: Rabin Vincent, David Ahern
  Cc: a.p.zijlstra, paulus, mingo, Arnaldo Carvalho de Melo,
	linux-kernel, wangnan0

On 2015/5/6 5:53, Rabin Vincent wrote:
> On Sat, Apr 25, 2015 at 09:53:57AM -0600, David Ahern wrote:
>> On 4/25/15 8:05 AM, Yunlong Song wrote:
>>> But this only shows the system call like strace, but we want the call
>>> stack of kernel functions in fact.
>>>
>> We haven't added the callchain option yet; on the to-do list.
>>
>> perf trace record -g -- iozone ...
>> perf trace -i perf.data -s
>> --> summary of system calls, max/min/average times
>>
>> perf trace -i perf.data --duration 10.0 -T
>> --> note the timestamp where the write took a "long" time
>>
>> perf script
>> --> search down to *around* the time of interest; you want the syscall
>> entry; timestamp is for exit
> 
> Now if I understood the use case right, what Yulong Song wants to know
> is what the iozone process is doing in the kernel (i.e. the stacktrace of why
> exactly it goes to sleep / what it's waiting for) during these
> sys_writes which take a long time.  
> 
> The commands above will identify the sys_write which takes time but only
> provide the stacktrace at the entry and exit of the syscall, but this do
> not show why the process blocked or what it did inside the system call.
> 
> So a way to get what is required for the use case would be to make the
> following changes to the above sequence:
> 
> (1) include the sched:* events when perf trace record is run
> 
> (2) around the time of interest, look at the kernel stack st the sched:switch 
>     events between the entry and the exit.  This will show what the process was
>     waiting for when it when it blocked.  The stacktraces at the
>     stat_runtime events in the process may also be useful to understand what
>     was going on.
> 
> Example:
> 
> $ perf trace record -g -e sched:* -- dd if=/dev/zero of=/x bs=10M count=100 conv=fsync
> $ perf trace -i perf.data -s
> 
>  dd (147), 364 events, 94.3%, 0.000 msec
> 
>    syscall            calls      min       avg       max      stddev
>    --------------- -------- --------- --------- ---------     ------
>    write                 63   266.019   316.896   963.413      4.69%
>    ...
> 
> $ perf trace -i perf.data --duration 960 -T
>  91916.354 (963.413 ms): dd/147 write(arg0: 1, arg1: 139729327423488, arg2: 10485760, arg3: 582, arg4: 100, arg5: 72340172838076673) = 10485760
> 
> $ perf script
> ...
> 
> dd   147 [002]    90.952941: raw_syscalls:sys_enter: NR 1 (1, 7f1544ed0000, a00000, 246, 64, 101010101010101)
> 	          211a00 syscall_trace_enter_phase2 ([kernel.kallsyms])
> 	          7d79c3 tracesys_phase2 ([kernel.kallsyms])
> 	            ed20 [unknown] (/lib/libpthread-2.16.so)
> 
> (... trimmed everything except the longest block: ...)
> 
> dd   147 [002]    91.204723: sched:sched_switch: prev_comm=dd prev_pid=147 prev_prio=120 prev_state=D ==> next_comm=swapper/2 next_pid=0 next_prio=120
> 	          7d1121 __schedule ([kernel.kallsyms])
> 	          7d1817 schedule ([kernel.kallsyms])
> 	          4578a5 jbd2_log_wait_commit ([kernel.kallsyms])
> 	          4546ff jbd2_log_do_checkpoint ([kernel.kallsyms])
> 	          45489f __jbd2_log_wait_for_space ([kernel.kallsyms])
> 	          44c2d8 start_this_handle ([kernel.kallsyms])
> 	          44c7b6 jbd2__journal_start ([kernel.kallsyms])
> 	          42fc3b __ext4_journal_start_sb ([kernel.kallsyms])
> 	          4012e9 ext4_write_begin ([kernel.kallsyms])
> 	          33795f generic_perform_write ([kernel.kallsyms])
> 	          338b68 __generic_file_write_iter ([kernel.kallsyms])
> 	          3f4371 ext4_file_write_iter ([kernel.kallsyms])
> 	          381ae1 __vfs_write ([kernel.kallsyms])
> 	          381f07 vfs_write ([kernel.kallsyms])
> 	          3825dd sys_write ([kernel.kallsyms])
> 	          7d7a1a tracesys_phase2 ([kernel.kallsyms])
> 	            ed20 [unknown] (/lib/libpthread-2.16.so)
> 
> swapper     0 [002]    91.875573: sched:sched_wakeup: comm=dd pid=147 prio=120 success=1 target_cpu=002
> 
> .
> 

Good, it's clearer with "sched:*" event.

-- 
Thanks,
Yunlong Song


^ permalink raw reply	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2015-05-06  4:14 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-04-24 13:31 [Question] How does perf still record the stack of a specified pid even when that process is interrupted and CPU is scheduled to other process Yunlong Song
2015-04-24 13:49 ` Yunlong Song
2015-04-25 14:03   ` Yunlong Song
2015-04-24 13:49 ` David Ahern
2015-04-24 13:56   ` Yunlong Song
2015-04-24 13:58 ` David Ahern
2015-04-25 14:05   ` Yunlong Song
2015-04-25 15:53     ` David Ahern
2015-05-05 21:53       ` Rabin Vincent
2015-05-05 22:24         ` David Ahern
2015-05-06  4:13         ` Yunlong Song
2015-05-06  4:10       ` Yunlong Song

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.