Tracing sched_switch events for client application when process is switched back in

All of lore.kernel.org
 help / color / mirror / Atom feed

* Tracing sched_switch events for client application when process is switched back in
@ 2016-10-06 15:52 Milian Wolff
  2016-10-06 16:21 ` Arnaldo Carvalho de Melo
  0 siblings, 1 reply; 9+ messages in thread
From: Milian Wolff @ 2016-10-06 15:52 UTC (permalink / raw)
  To: linux-perf-users

[-- Attachment #1: Type: text/plain, Size: 2595 bytes --]

Hey all,

when you have an application like this:

~~~~~~
#include <unistd.h>

int main()
{
   for (int i = 0; i < 100; ++i) {
       usleep(10);
   }
}
~~~~~~

Then record sched:sched_switch events for it:

~~~~~~
$ perf record -e sched:sched_switch ./a.out 
[ perf record: Woken up 1 times to write data ]
[ perf record: Captured and wrote 0.031 MB perf.data (100 samples) ]
~~~~~~

Note how it's only 100 samples, instead of 200 for both, the switch-out and 
switch-in events. Apparently we only get the switch-out events:

~~~~~
$ perf script -F trace | uniq -c
     99 a.out:3994 [120] S ==> swapper/7:0 [120]
      1 a.out:3994 [120] S ==> swapper/7:0 [1
~~~~~

Doing a global trace, or using LTTNG, we also see the switch-in events:

~~~~~
$ perf record -e sched:sched_switch -a ./a.out 
$ perf script -F trace | grep a.out | sort | uniq -c
    100 a.out:11570 [120] S ==> swapper/4:0 [120]
      1 a.out:11570 [120] x ==> swapper/4:0 [120]
    100 swapper/4:0 [120] R ==> a.out:11570 [120]
~~~~~

Is there any way I can get both sched switch events, i.e. not only the one 
with prev_tid=$client_tid, but also the one with next_tid=$client_tid? One way 
that seems to work is the following:

~~~~~
$ perf record -e sched:sched_switch --filter 'prev_comm=="a.out" || 
next_comm=="a.out"' -a ./a.out 
[ perf record: Woken up 1 times to write data ]
[ perf record: Captured and wrote 1.499 MB perf.data (201 samples) ]
$ perf script -F trace | grep a.out | sort | uniq -c
      1 akonadi_notes_a:2160 [120] S ==> a.out:14710 [120]
      3 a.out:14710 [120] S ==> swapper/2:0 [120]
     97 a.out:14710 [120] S ==> swapper/6:0 [120]
      1 a.out:14710 [120] x ==> swapper/6:0 [120]
      1 ksoftirqd/6:47 [120] S ==> a.out:14710 [120]
      1 mysqld:14707 [120] S ==> a.out:14710 [120]
      1 QDBusConnection:2140 [120] S ==> a.out:14710 [120]
      1 QDBusConnection:2213 [120] S ==> a.out:14710 [120]
      2 swapper/2:0 [120] R ==> a.out:14710 [120]
     93 swapper/6:0 [120] R ==> a.out:14710 [120]
~~~~~

The above workaround will not work for more complex applications, with 
secondary threads that can have different comm names. So:

Where is the "default" filtering done, which simply drops all events from 
sources that do not match the pid of the created process? I assume this 
happens in kernel space? Could it be adapted to also include switch_switch 
events when next_tid belongs to the pid we use for filtering?

Thanks
-- 
Milian Wolff | milian.wolff@kdab.com | Software Engineer
KDAB (Deutschland) GmbH&Co KG, a KDAB Group company
Tel: +49-30-521325470
KDAB - The Qt Experts

[-- Attachment #2: smime.p7s --]
[-- Type: application/pkcs7-signature, Size: 5903 bytes --]

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Tracing sched_switch events for client application when process is switched back in
  2016-10-06 15:52 Tracing sched_switch events for client application when process is switched back in Milian Wolff
@ 2016-10-06 16:21 ` Arnaldo Carvalho de Melo
  2016-10-06 16:38   ` Milian Wolff
  0 siblings, 1 reply; 9+ messages in thread
From: Arnaldo Carvalho de Melo @ 2016-10-06 16:21 UTC (permalink / raw)
  To: Milian Wolff; +Cc: linux-perf-users

Em Thu, Oct 06, 2016 at 05:52:51PM +0200, Milian Wolff escreveu:
> Hey all,
> 
> when you have an application like this:
> 
> ~~~~~~
> #include <unistd.h>
> 
> int main()
> {
>    for (int i = 0; i < 100; ++i) {
>        usleep(10);
>    }
> }
> ~~~~~~
> 
> Then record sched:sched_switch events for it:
> 
> ~~~~~~
> $ perf record -e sched:sched_switch ./a.out 
> [ perf record: Woken up 1 times to write data ]
> [ perf record: Captured and wrote 0.031 MB perf.data (100 samples) ]
> ~~~~~~
> 
> Note how it's only 100 samples, instead of 200 for both, the switch-out and 
> switch-in events. Apparently we only get the switch-out events:
> 
> ~~~~~
> $ perf script -F trace | uniq -c
>      99 a.out:3994 [120] S ==> swapper/7:0 [120]
>       1 a.out:3994 [120] S ==> swapper/7:0 [1
> ~~~~~
> 
> Doing a global trace, or using LTTNG, we also see the switch-in events:
> 
> ~~~~~
> $ perf record -e sched:sched_switch -a ./a.out 
> $ perf script -F trace | grep a.out | sort | uniq -c
>     100 a.out:11570 [120] S ==> swapper/4:0 [120]
>       1 a.out:11570 [120] x ==> swapper/4:0 [120]
>     100 swapper/4:0 [120] R ==> a.out:11570 [120]
> ~~~~~
> 
> Is there any way I can get both sched switch events, i.e. not only the one 
> with prev_tid=$client_tid, but also the one with next_tid=$client_tid? One way 
> that seems to work is the following:

Have you considered using PERF_RECORD_SWITCH?

[acme@jouet linux]$ perf record --switch-events sleep 1
[ perf record: Woken up 1 times to write data ]
[ perf record: Captured and wrote 0.020 MB perf.data (7 samples) ]
[acme@jouet linux]$ perf script --show-switch-events
  sleep 20679  5578.172409:          1 cycles:u:      7fb4e5a51c80 _start+0x0 (/usr/lib64/ld-2.23.so)
  sleep 20679  5578.172411:          1 cycles:u:      7fb4e5a51c80 _start+0x0 (/usr/lib64/ld-2.23.so)
  sleep 20679  5578.172412:         12 cycles:u:      7fb4e5a51c80 _start+0x0 (/usr/lib64/ld-2.23.so)
  sleep 20679  5578.172413:        397 cycles:u:      7fb4e5a51c80 _start+0x0 (/usr/lib64/ld-2.23.so)
  sleep 20679  5578.172414:      13936 cycles:u:  ffffffffb77ee7a0 page_fault+0x0 (/usr/lib/debug/lib/modules/4.7.5-200.fc24.x86_64/vmlinux)
  sleep 20679  5578.172431:     330542 cycles:u:      7fb4e5a5301f dl_main+0x6cf (/usr/lib64/ld-2.23.so)
  sleep 20679  5578.172700: PERF_RECORD_SWITCH OUT
  sleep 20679  5579.172821: PERF_RECORD_SWITCH IN 
  sleep 20679  5579.172838:     802931 cycles:u:      7fb4e5a6122e _dl_fini+0x8e (/usr/lib64/ld-2.23.so)
[acme@jouet linux]$ 
 
> ~~~~~
> $ perf record -e sched:sched_switch --filter 'prev_comm=="a.out" || 
> next_comm=="a.out"' -a ./a.out 
> [ perf record: Woken up 1 times to write data ]
> [ perf record: Captured and wrote 1.499 MB perf.data (201 samples) ]
> $ perf script -F trace | grep a.out | sort | uniq -c
>       1 akonadi_notes_a:2160 [120] S ==> a.out:14710 [120]
>       3 a.out:14710 [120] S ==> swapper/2:0 [120]
>      97 a.out:14710 [120] S ==> swapper/6:0 [120]
>       1 a.out:14710 [120] x ==> swapper/6:0 [120]
>       1 ksoftirqd/6:47 [120] S ==> a.out:14710 [120]
>       1 mysqld:14707 [120] S ==> a.out:14710 [120]
>       1 QDBusConnection:2140 [120] S ==> a.out:14710 [120]
>       1 QDBusConnection:2213 [120] S ==> a.out:14710 [120]
>       2 swapper/2:0 [120] R ==> a.out:14710 [120]
>      93 swapper/6:0 [120] R ==> a.out:14710 [120]
> ~~~~~
> 
> The above workaround will not work for more complex applications, with 
> secondary threads that can have different comm names. So:
> 
> Where is the "default" filtering done, which simply drops all events from 
> sources that do not match the pid of the created process? I assume this 
> happens in kernel space? Could it be adapted to also include switch_switch 
> events when next_tid belongs to the pid we use for filtering?
> 
> Thanks
> -- 
> Milian Wolff | milian.wolff@kdab.com | Software Engineer
> KDAB (Deutschland) GmbH&Co KG, a KDAB Group company
> Tel: +49-30-521325470
> KDAB - The Qt Experts

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Tracing sched_switch events for client application when process is switched back in
  2016-10-06 16:21 ` Arnaldo Carvalho de Melo
@ 2016-10-06 16:38   ` Milian Wolff
  2016-10-11  0:36     ` Arnaldo Carvalho de Melo
  0 siblings, 1 reply; 9+ messages in thread
From: Milian Wolff @ 2016-10-06 16:38 UTC (permalink / raw)
  To: Arnaldo Carvalho de Melo; +Cc: linux-perf-users

[-- Attachment #1: Type: text/plain, Size: 4099 bytes --]

On Thursday, October 6, 2016 1:21:25 PM CEST Arnaldo Carvalho de Melo wrote:
> Em Thu, Oct 06, 2016 at 05:52:51PM +0200, Milian Wolff escreveu:
> > Hey all,
> > 
> > when you have an application like this:
> > 
> > ~~~~~~
> > #include <unistd.h>
> > 
> > int main()
> > {
> > 
> >    for (int i = 0; i < 100; ++i) {
> >    
> >        usleep(10);
> >    
> >    }
> > 
> > }
> > ~~~~~~
> > 
> > Then record sched:sched_switch events for it:
> > 
> > ~~~~~~
> > $ perf record -e sched:sched_switch ./a.out
> > [ perf record: Woken up 1 times to write data ]
> > [ perf record: Captured and wrote 0.031 MB perf.data (100 samples) ]
> > ~~~~~~
> > 
> > Note how it's only 100 samples, instead of 200 for both, the switch-out
> > and
> > switch-in events. Apparently we only get the switch-out events:
> > 
> > ~~~~~
> > $ perf script -F trace | uniq -c
> > 
> >      99 a.out:3994 [120] S ==> swapper/7:0 [120]
> >      
> >       1 a.out:3994 [120] S ==> swapper/7:0 [1
> > 
> > ~~~~~
> > 
> > Doing a global trace, or using LTTNG, we also see the switch-in events:
> > 
> > ~~~~~
> > $ perf record -e sched:sched_switch -a ./a.out
> > $ perf script -F trace | grep a.out | sort | uniq -c
> > 
> >     100 a.out:11570 [120] S ==> swapper/4:0 [120]
> >     
> >       1 a.out:11570 [120] x ==> swapper/4:0 [120]
> >     
> >     100 swapper/4:0 [120] R ==> a.out:11570 [120]
> > 
> > ~~~~~
> > 
> > Is there any way I can get both sched switch events, i.e. not only the one
> > with prev_tid=$client_tid, but also the one with next_tid=$client_tid? One
> > way
> > that seems to work is the following:
> Have you considered using PERF_RECORD_SWITCH?
> 
> [acme@jouet linux]$ perf record --switch-events sleep 1
> [ perf record: Woken up 1 times to write data ]
> [ perf record: Captured and wrote 0.020 MB perf.data (7 samples) ]
> [acme@jouet linux]$ perf script --show-switch-events
>   sleep 20679  5578.172409:          1 cycles:u:      7fb4e5a51c80
> _start+0x0 (/usr/lib64/ld-2.23.so) sleep 20679  5578.172411:          1
> cycles:u:      7fb4e5a51c80 _start+0x0 (/usr/lib64/ld-2.23.so) sleep 20679 
> 5578.172412:         12 cycles:u:      7fb4e5a51c80 _start+0x0
> (/usr/lib64/ld-2.23.so) sleep 20679  5578.172413:        397 cycles:u:     
> 7fb4e5a51c80 _start+0x0 (/usr/lib64/ld-2.23.so) sleep 20679  5578.172414:  
>    13936 cycles:u:  ffffffffb77ee7a0 page_fault+0x0
> (/usr/lib/debug/lib/modules/4.7.5-200.fc24.x86_64/vmlinux) sleep 20679 
> 5578.172431:     330542 cycles:u:      7fb4e5a5301f dl_main+0x6cf
> (/usr/lib64/ld-2.23.so) sleep 20679  5578.172700: PERF_RECORD_SWITCH OUT
>   sleep 20679  5579.172821: PERF_RECORD_SWITCH IN
>   sleep 20679  5579.172838:     802931 cycles:u:      7fb4e5a6122e
> _dl_fini+0x8e (/usr/lib64/ld-2.23.so) [acme@jouet linux]$

First time I read about it, and the documentation is scarce, to not say non-
existing. It seems to record cycles for the switch events. But why does it 
only find 7, whereas tracing sched_switch finds all 100 that actually take 
place?

The reason why I want sched_switch is that I want to build a "better" sleep 
time profiler. The existing script [1] is cumbersome to use, and only shows 
parts of what is going on:

- only looks at sched_stat_sleep, ignores sched_stat_iowait or sched_stat_wait
- requires one to enable the sched_stat tracepoints in newer kernels 
  $ echo 1 | sudo tee /proc/sys/kernel/sched_schedstats
- as a user-space developer, I actually care about the time between getting 
switched out and getting switched in, so using that in perf would actually 
give me the duration I got switched out

If I could do something like this:

$ perf record -e cycles:ppp,sched:sched_switch <my app>

And have both switch-in and switch-out traced, than a GUI could visualize this 
together and show me both, CPU hotspots as well as sleep/io/blocked time where 
I got switched out

[1] https://perf.wiki.kernel.org/index.php/Tutorial#Profiling_sleep_times

-- 
Milian Wolff | milian.wolff@kdab.com | Software Engineer
KDAB (Deutschland) GmbH&Co KG, a KDAB Group company
Tel: +49-30-521325470
KDAB - The Qt Experts

[-- Attachment #2: smime.p7s --]
[-- Type: application/pkcs7-signature, Size: 5903 bytes --]

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Tracing sched_switch events for client application when process is switched back in
  2016-10-06 16:38   ` Milian Wolff
@ 2016-10-11  0:36     ` Arnaldo Carvalho de Melo
  2016-11-14 14:10       ` Milian Wolff
  0 siblings, 1 reply; 9+ messages in thread
From: Arnaldo Carvalho de Melo @ 2016-10-11  0:36 UTC (permalink / raw)
  To: Milian Wolff; +Cc: linux-perf-users

Em Thu, Oct 06, 2016 at 06:38:28PM +0200, Milian Wolff escreveu:
> On Thursday, October 6, 2016 1:21:25 PM CEST Arnaldo Carvalho de Melo wrote:
> > Em Thu, Oct 06, 2016 at 05:52:51PM +0200, Milian Wolff escreveu:
> > > Hey all,
> > > 
> > > when you have an application like this:
> > > 
> > > ~~~~~~
> > > #include <unistd.h>
> > > 
> > > int main()
> > > {
> > > 
> > >    for (int i = 0; i < 100; ++i) {
> > >    
> > >        usleep(10);
> > >    
> > >    }
> > > 
> > > }
> > > ~~~~~~
> > > 
> > > Then record sched:sched_switch events for it:
> > > 
> > > ~~~~~~
> > > $ perf record -e sched:sched_switch ./a.out
> > > [ perf record: Woken up 1 times to write data ]
> > > [ perf record: Captured and wrote 0.031 MB perf.data (100 samples) ]
> > > ~~~~~~
> > > 
> > > Note how it's only 100 samples, instead of 200 for both, the switch-out
> > > and
> > > switch-in events. Apparently we only get the switch-out events:
> > > 
> > > ~~~~~
> > > $ perf script -F trace | uniq -c
> > > 
> > >      99 a.out:3994 [120] S ==> swapper/7:0 [120]
> > >      
> > >       1 a.out:3994 [120] S ==> swapper/7:0 [1
> > > 
> > > ~~~~~
> > > 
> > > Doing a global trace, or using LTTNG, we also see the switch-in events:
> > > 
> > > ~~~~~
> > > $ perf record -e sched:sched_switch -a ./a.out
> > > $ perf script -F trace | grep a.out | sort | uniq -c
> > > 
> > >     100 a.out:11570 [120] S ==> swapper/4:0 [120]
> > >     
> > >       1 a.out:11570 [120] x ==> swapper/4:0 [120]
> > >     
> > >     100 swapper/4:0 [120] R ==> a.out:11570 [120]
> > > 
> > > ~~~~~
> > > 
> > > Is there any way I can get both sched switch events, i.e. not only the one
> > > with prev_tid=$client_tid, but also the one with next_tid=$client_tid? One
> > > way
> > > that seems to work is the following:
> > Have you considered using PERF_RECORD_SWITCH?
> > 
> > [acme@jouet linux]$ perf record --switch-events sleep 1
> > [ perf record: Woken up 1 times to write data ]
> > [ perf record: Captured and wrote 0.020 MB perf.data (7 samples) ]
> > [acme@jouet linux]$ perf script --show-switch-events
> >   sleep 20679  5578.172409:          1 cycles:u:      7fb4e5a51c80
> > _start+0x0 (/usr/lib64/ld-2.23.so) sleep 20679  5578.172411:          1
> > cycles:u:      7fb4e5a51c80 _start+0x0 (/usr/lib64/ld-2.23.so) sleep 20679 
> > 5578.172412:         12 cycles:u:      7fb4e5a51c80 _start+0x0
> > (/usr/lib64/ld-2.23.so) sleep 20679  5578.172413:        397 cycles:u:     
> > 7fb4e5a51c80 _start+0x0 (/usr/lib64/ld-2.23.so) sleep 20679  5578.172414:  
> >    13936 cycles:u:  ffffffffb77ee7a0 page_fault+0x0
> > (/usr/lib/debug/lib/modules/4.7.5-200.fc24.x86_64/vmlinux) sleep 20679 
> > 5578.172431:     330542 cycles:u:      7fb4e5a5301f dl_main+0x6cf
> > (/usr/lib64/ld-2.23.so) sleep 20679  5578.172700: PERF_RECORD_SWITCH OUT
> >   sleep 20679  5579.172821: PERF_RECORD_SWITCH IN
> >   sleep 20679  5579.172838:     802931 cycles:u:      7fb4e5a6122e
> > _dl_fini+0x8e (/usr/lib64/ld-2.23.so) [acme@jouet linux]$
> 
> First time I read about it, and the documentation is scarce, to not say non-
> existing. It seems to record cycles for the switch events. But why does it 



> only find 7, whereas tracing sched_switch finds all 100 that actually take 
> place?

This is looking only for the sched switches for the monitored workload,
which in this case is 'sleep 1', this wasn't a system wide session.

add -a and you'll get those other switches, if I got what you described.
 
> The reason why I want sched_switch is that I want to build a "better" sleep 
> time profiler. The existing script [1] is cumbersome to use, and only shows 
> parts of what is going on:
> 
> - only looks at sched_stat_sleep, ignores sched_stat_iowait or sched_stat_wait
> - requires one to enable the sched_stat tracepoints in newer kernels 
>   $ echo 1 | sudo tee /proc/sys/kernel/sched_schedstats
> - as a user-space developer, I actually care about the time between getting 
> switched out and getting switched in, so using that in perf would actually 
> give me the duration I got switched out
> 
> If I could do something like this:
> 
> $ perf record -e cycles:ppp,sched:sched_switch <my app>
> 
> And have both switch-in and switch-out traced, than a GUI could visualize this 
> together and show me both, CPU hotspots as well as sleep/io/blocked time where 
> I got switched out
> 
> [1] https://perf.wiki.kernel.org/index.php/Tutorial#Profiling_sleep_times
> 
> -- 
> Milian Wolff | milian.wolff@kdab.com | Software Engineer
> KDAB (Deutschland) GmbH&Co KG, a KDAB Group company
> Tel: +49-30-521325470
> KDAB - The Qt Experts

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Tracing sched_switch events for client application when process is switched back in
  2016-10-11  0:36     ` Arnaldo Carvalho de Melo
@ 2016-11-14 14:10       ` Milian Wolff
  2016-11-14 15:36         ` Arnaldo Carvalho de Melo
  0 siblings, 1 reply; 9+ messages in thread
From: Milian Wolff @ 2016-11-14 14:10 UTC (permalink / raw)
  To: Arnaldo Carvalho de Melo; +Cc: linux-perf-users

[-- Attachment #1: Type: text/plain, Size: 4821 bytes --]

On Monday, October 10, 2016 9:36:55 PM CET Arnaldo Carvalho de Melo wrote:
> Em Thu, Oct 06, 2016 at 06:38:28PM +0200, Milian Wolff escreveu:
> > On Thursday, October 6, 2016 1:21:25 PM CEST Arnaldo Carvalho de Melo 
wrote:
> > > Em Thu, Oct 06, 2016 at 05:52:51PM +0200, Milian Wolff escreveu:
> > > > Hey all,
> > > > 
> > > > when you have an application like this:
> > > > 
> > > > ~~~~~~
> > > > #include <unistd.h>
> > > > 
> > > > int main()
> > > > {
> > > > 
> > > >    for (int i = 0; i < 100; ++i) {
> > > >    
> > > >        usleep(10);
> > > >    
> > > >    }
> > > > 
> > > > }
> > > > ~~~~~~
> > > > 
> > > > Then record sched:sched_switch events for it:
> > > > 
> > > > ~~~~~~
> > > > $ perf record -e sched:sched_switch ./a.out
> > > > [ perf record: Woken up 1 times to write data ]
> > > > [ perf record: Captured and wrote 0.031 MB perf.data (100 samples) ]
> > > > ~~~~~~
> > > > 
> > > > Note how it's only 100 samples, instead of 200 for both, the
> > > > switch-out
> > > > and
> > > > switch-in events. Apparently we only get the switch-out events:
> > > > 
> > > > ~~~~~
> > > > $ perf script -F trace | uniq -c
> > > > 
> > > >      99 a.out:3994 [120] S ==> swapper/7:0 [120]
> > > >      
> > > >       1 a.out:3994 [120] S ==> swapper/7:0 [1
> > > > 
> > > > ~~~~~
> > > > 
> > > > Doing a global trace, or using LTTNG, we also see the switch-in
> > > > events:
> > > > 
> > > > ~~~~~
> > > > $ perf record -e sched:sched_switch -a ./a.out
> > > > $ perf script -F trace | grep a.out | sort | uniq -c
> > > > 
> > > >     100 a.out:11570 [120] S ==> swapper/4:0 [120]
> > > >     
> > > >       1 a.out:11570 [120] x ==> swapper/4:0 [120]
> > > >     
> > > >     100 swapper/4:0 [120] R ==> a.out:11570 [120]
> > > > 
> > > > ~~~~~
> > > > 
> > > > Is there any way I can get both sched switch events, i.e. not only the
> > > > one
> > > > with prev_tid=$client_tid, but also the one with next_tid=$client_tid?
> > > > One
> > > > way
> > > 
> > > > that seems to work is the following:
> > > Have you considered using PERF_RECORD_SWITCH?
> > > 
> > > [acme@jouet linux]$ perf record --switch-events sleep 1
> > > [ perf record: Woken up 1 times to write data ]
> > > [ perf record: Captured and wrote 0.020 MB perf.data (7 samples) ]
> > > [acme@jouet linux]$ perf script --show-switch-events
> > > 
> > >   sleep 20679  5578.172409:          1 cycles:u:      7fb4e5a51c80
> > > 
> > > _start+0x0 (/usr/lib64/ld-2.23.so) sleep 20679  5578.172411:          1
> > > cycles:u:      7fb4e5a51c80 _start+0x0 (/usr/lib64/ld-2.23.so) sleep
> > > 20679
> > > 5578.172412:         12 cycles:u:      7fb4e5a51c80 _start+0x0
> > > (/usr/lib64/ld-2.23.so) sleep 20679  5578.172413:        397 cycles:u:
> > > 
> > > 7fb4e5a51c80 _start+0x0 (/usr/lib64/ld-2.23.so) sleep 20679  
5578.172414:
> > >    13936 cycles:u:  ffffffffb77ee7a0 page_fault+0x0
> > > 
> > > (/usr/lib/debug/lib/modules/4.7.5-200.fc24.x86_64/vmlinux) sleep 20679
> > > 5578.172431:     330542 cycles:u:      7fb4e5a5301f dl_main+0x6cf
> > > (/usr/lib64/ld-2.23.so) sleep 20679  5578.172700: PERF_RECORD_SWITCH OUT
> > > 
> > >   sleep 20679  5579.172821: PERF_RECORD_SWITCH IN
> > >   sleep 20679  5579.172838:     802931 cycles:u:      7fb4e5a6122e
> > > 
> > > _dl_fini+0x8e (/usr/lib64/ld-2.23.so) [acme@jouet linux]$
> > 
> > First time I read about it, and the documentation is scarce, to not say
> > non- existing. It seems to record cycles for the switch events. But why
> > does it
> > 
> > 
> > 
> > only find 7, whereas tracing sched_switch finds all 100 that actually take
> > place?

Hey there,

long delay but I looked at this again today.

> This is looking only for the sched switches for the monitored workload,
> which in this case is 'sleep 1', this wasn't a system wide session.
>
> add -a and you'll get those other switches, if I got what you described.

I've used the sources of the simple C application I showed in the first email 
of this thread (still quoted above). Then I try:

perf record --switch-events -a ./a.out

And indeed, I get all switch events. But I now also profile all applications 
on the machine, which is not what I want. I want to combine the following 
features:

- sample CPU events for a given application with backtraces
	perf record --call-graph dwarf ./foo
- record switch-out events with backtraces (`-e sched:sched_switch/fp=dwarf/`)
- somehow record the switch-in events, which are associated with a different 
process and thus currently discarded

Any advise on where to look for the filtering? Not discarding the sched_switch 
of a different process based on matching next_tid can't be that hard, or is 
it?

Cheers

-- 
Milian Wolff | milian.wolff@kdab.com | Software Engineer
KDAB (Deutschland) GmbH&Co KG, a KDAB Group company
Tel: +49-30-521325470
KDAB - The Qt Experts

[-- Attachment #2: smime.p7s --]
[-- Type: application/pkcs7-signature, Size: 5903 bytes --]

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Tracing sched_switch events for client application when process is switched back in
  2016-11-14 14:10       ` Milian Wolff
@ 2016-11-14 15:36         ` Arnaldo Carvalho de Melo
  2016-11-14 18:17           ` Milian Wolff
  0 siblings, 1 reply; 9+ messages in thread
From: Arnaldo Carvalho de Melo @ 2016-11-14 15:36 UTC (permalink / raw)
  To: Milian Wolff; +Cc: linux-perf-users

Em Mon, Nov 14, 2016 at 03:10:26PM +0100, Milian Wolff escreveu:
> On Monday, October 10, 2016 9:36:55 PM CET Arnaldo Carvalho de Melo wrote:
> > Em Thu, Oct 06, 2016 at 06:38:28PM +0200, Milian Wolff escreveu:
> > > only find 7, whereas tracing sched_switch finds all 100 that actually take
> > > place?
 
> Hey there,
 
> long delay but I looked at this again today.
 
> > This is looking only for the sched switches for the monitored workload,
> > which in this case is 'sleep 1', this wasn't a system wide session.

> > add -a and you'll get those other switches, if I got what you described.
 
> I've used the sources of the simple C application I showed in the first email 
> of this thread (still quoted above). Then I try:
 
> perf record --switch-events -a ./a.out
 
> And indeed, I get all switch events. But I now also profile all applications 
> on the machine, which is not what I want. I want to combine the following 

So you remove that '-a' from that command line and gets just the context
switches for that ./a.out app, no?

Lets see:

[root@jouet ~]# perf record --switch-events sleep 1
[ perf record: Woken up 1 times to write data ]
[ perf record: Captured and wrote 0.020 MB perf.data (7 samples) ]
[root@jouet ~]#

[root@jouet ~]# perf script --show-switch-events
   perf 29395 176148.426824:          1 cycles:ppp:  ffffffffae1ad256 perf_event_exec+0x146 (/usr/lib/debug/lib/modules/4.8.6-201.fc24.x86_64/vmlinux
   perf 29395 176148.426827:          1 cycles:ppp:  ffffffffae00ca1b intel_pmu_handle_irq+0x21b (/usr/lib/debug/lib/modules/4.8.6-201.fc24.x86_64/vm
   perf 29395 176148.426829:         10 cycles:ppp:  ffffffffae02fd93 native_sched_clock+0x33 (/usr/lib/debug/lib/modules/4.8.6-201.fc24.x86_64/vmlin
   perf 29395 176148.426830:        221 cycles:ppp:  ffffffffae02fd93 native_sched_clock+0x33 (/usr/lib/debug/lib/modules/4.8.6-201.fc24.x86_64/vmlin
   perf 29395 176148.426831:       5213 cycles:ppp:  ffffffffae02fe79 sched_clock+0x9 (/usr/lib/debug/lib/modules/4.8.6-201.fc24.x86_64/vmlinux)
  sleep 29395 176148.426834:     120344 cycles:ppp:  ffffffffae36b0cf selinux_inode_permission+0x11f (/usr/lib/debug/lib/modules/4.8.6-201.fc24.x86_6
  sleep 29395 176148.426873:    1717973 cycles:ppp:  ffffffffae259e3b free_pipe_info+0x4b (/usr/lib/debug/lib/modules/4.8.6-201.fc24.x86_64/vmlinux)
  sleep 29395 176148.427182: PERF_RECORD_SWITCH OUT
  sleep 29395 176149.427366: PERF_RECORD_SWITCH IN 

And if you use something that generates more switches, say:

[root@jouet ~]# perf record --switch-events ping -c 5 localhost
PING localhost.localdomain (127.0.0.1) 56(84) bytes of data.
64 bytes from localhost.localdomain (127.0.0.1): icmp_seq=1 ttl=64 time=0.043 ms
64 bytes from localhost.localdomain (127.0.0.1): icmp_seq=2 ttl=64 time=0.040 ms
64 bytes from localhost.localdomain (127.0.0.1): icmp_seq=3 ttl=64 time=0.038 ms
64 bytes from localhost.localdomain (127.0.0.1): icmp_seq=4 ttl=64 time=0.038 ms
64 bytes from localhost.localdomain (127.0.0.1): icmp_seq=5 ttl=64 time=0.036 ms

--- localhost.localdomain ping statistics ---
5 packets transmitted, 5 received, 0% packet loss, time 4077ms
rtt min/avg/max/mdev = 0.036/0.039/0.043/0.002 ms
[ perf record: Woken up 1 times to write data ]
[ perf record: Captured and wrote 0.021 MB perf.data (9 samples) ]
[root@jouet ~]# perf script --show-switch-events
   perf 29429 176239.731333:          1 cycles:ppp:  ffffffffae1ad250 perf_event_exec+0x140 (/usr/lib/debug/lib/modules/4.8.6-201.fc24.x86_64/vmlinux
   perf 29429 176239.731336:          1 cycles:ppp:  ffffffffae00ca1b intel_pmu_handle_irq+0x21b (/usr/lib/debug/lib/modules/4.8.6-201.fc24.x86_64/vm
   perf 29429 176239.731338:          9 cycles:ppp:  ffffffffae02fd93 native_sched_clock+0x33 (/usr/lib/debug/lib/modules/4.8.6-201.fc24.x86_64/vmlin
   perf 29429 176239.731339:        198 cycles:ppp:  ffffffffae02fd93 native_sched_clock+0x33 (/usr/lib/debug/lib/modules/4.8.6-201.fc24.x86_64/vmlin
   perf 29429 176239.731340:       4667 cycles:ppp:  ffffffffae02fe7c sched_clock+0xc (/usr/lib/debug/lib/modules/4.8.6-201.fc24.x86_64/vmlinux)
   ping 29429 176239.731343:     108407 cycles:ppp:  ffffffffae258d09 setup_new_exec+0x109 (/usr/lib/debug/lib/modules/4.8.6-201.fc24.x86_64/vmlinux)
   ping 29429 176239.731378:    1645300 cycles:ppp:  ffffffffae240c29 mem_cgroup_try_charge+0x19 (/usr/lib/debug/lib/modules/4.8.6-201.fc24.x86_64/vm
   ping 29429 176239.731951:    2594316 cycles:ppp:  ffffffffae0e46fc __wake_up_bit+0x1c (/usr/lib/debug/lib/modules/4.8.6-201.fc24.x86_64/vmlinux)
   ping 29429 176239.732833:    2411373 cycles:ppp:  ffffffffae756ec0 icmpv4_xrlim_allow.isra.17+0x60 (/usr/lib/debug/lib/modules/4.8.6-201.fc24.x86_
   ping 29429 176239.732872: PERF_RECORD_SWITCH OUT
   ping 29429 176240.738063: PERF_RECORD_SWITCH IN 
   ping 29429 176240.738139: PERF_RECORD_SWITCH OUT
   ping 29429 176241.762048: PERF_RECORD_SWITCH IN 
   ping 29429 176241.762119: PERF_RECORD_SWITCH OUT
   ping 29429 176242.786143: PERF_RECORD_SWITCH IN 
   ping 29429 176242.786221: PERF_RECORD_SWITCH OUT
   ping 29429 176243.810099: PERF_RECORD_SWITCH IN 
   ping 29429 176243.810182: PERF_RECORD_SWITCH OUT
   ping 29429 176243.810190: PERF_RECORD_SWITCH IN

5 out, 5 in, in a lightly loaded machine, plenty of CPUs, looks about right, no?

But...

> features:
> 
> - sample CPU events for a given application with backtraces
> 	perf record --call-graph dwarf ./foo
> - record switch-out events with backtraces (`-e sched:sched_switch/fp=dwarf/`)
> - somehow record the switch-in events, which are associated with a different 
> process and thus currently discarded

You want the backtraces with that, and since this is a meta-event... Humm, for
the switch in would the above be sufficient? What value would be in knowing the
backtrace in that case?

So something like would do? Try not looking at the "PERF_RECORD_SWITCH OUT"
lines, they are generated by the kernel _after_ the corresponding
sched:sched_switch is registered:

[root@jouet ~]# perf record --call-graph=dwarf --switch-events -e sched:sched_switch ping -c 5 localhost
PING localhost.localdomain (127.0.0.1) 56(84) bytes of data.
64 bytes from localhost.localdomain (127.0.0.1): icmp_seq=1 ttl=64 time=0.043 ms
64 bytes from localhost.localdomain (127.0.0.1): icmp_seq=2 ttl=64 time=0.039 ms
64 bytes from localhost.localdomain (127.0.0.1): icmp_seq=3 ttl=64 time=0.043 ms
64 bytes from localhost.localdomain (127.0.0.1): icmp_seq=4 ttl=64 time=0.041 ms
64 bytes from localhost.localdomain (127.0.0.1): icmp_seq=5 ttl=64 time=0.039 ms

--- localhost.localdomain ping statistics ---
5 packets transmitted, 5 received, 0% packet loss, time 4074ms
rtt min/avg/max/mdev = 0.039/0.041/0.043/0.002 ms
[ perf record: Woken up 1 times to write data ]
[ perf record: Captured and wrote 0.069 MB perf.data (5 samples) ]
[root@jouet ~]# ^C
[root@jouet ~]# perf script --show-switch-events
ping 29565 [001] 176667.912276: sched:sched_switch: ping:29565 [120] S ==> swapper/1:0 [120]
                  9fd994 __schedule+0xa4200354 (/usr/lib/debug/lib/modules/4.8.6-201.fc24.x86_64/vmlinux)
                  9fdde5 schedule+0xa4200035 (/usr/lib/debug/lib/modules/4.8.6-201.fc24.x86_64/vmlinux)
                  a01259 schedule_timeout+0xa4200219 (/usr/lib/debug/lib/modules/4.8.6-201.fc24.x86_64/vmlinux)
                  8cb431 __skb_wait_for_more_packets+0xa4200111 (/usr/lib/debug/lib/modules/4.8.6-201.fc24.x86_64/vmlinux)
                  8cb796 __skb_recv_datagram+0xa4200066 (/usr/lib/debug/lib/modules/4.8.6-201.fc24.x86_64/vmlinux)
                  8cb82f skb_recv_datagram+0xa420003f (/usr/lib/debug/lib/modules/4.8.6-201.fc24.x86_64/vmlinux)
                  94d197 raw_recvmsg+0xa4200097 (/usr/lib/debug/lib/modules/4.8.6-201.fc24.x86_64/vmlinux)
                  95daee inet_recvmsg+0xa420007e (/usr/lib/debug/lib/modules/4.8.6-201.fc24.x86_64/vmlinux)
                  8bb61d sock_recvmsg+0xa420003d (/usr/lib/debug/lib/modules/4.8.6-201.fc24.x86_64/vmlinux)
                  8bc667 ___sys_recvmsg+0xa42000d7 (/usr/lib/debug/lib/modules/4.8.6-201.fc24.x86_64/vmlinux)
                  8bd191 __sys_recvmsg+0xa4200051 (/usr/lib/debug/lib/modules/4.8.6-201.fc24.x86_64/vmlinux)
                  8bd1e2 sys_recvmsg+0xa4200012 (/usr/lib/debug/lib/modules/4.8.6-201.fc24.x86_64/vmlinux)
                  a02572 entry_SYSCALL_64_fastpath+0xa420001a (/usr/lib/debug/lib/modules/4.8.6-201.fc24.x86_64/vmlinux)
                  104050 __GI___libc_recvmsg+0xffff0206e8e70010 (/usr/lib64/libc-2.23.so)
                    7b5a _init+0xffff5455e66f3ab2 (/usr/bin/ping)
                    3b34 _init+0xffff5455e66efa8c (/usr/bin/ping)
                    2d9f _init+0xffff5455e66eecf7 (/usr/bin/ping)
                   20731 __libc_start_main+0xffff0206e8e700f1 (/usr/lib64/libc-2.23.so)
                    30a9 _init+0xffff5455e66ef001 (/usr/bin/ping)

ping 29565 [001] 176667.912281: PERF_RECORD_SWITCH OUT
ping 29565 [001] 176668.915107: PERF_RECORD_SWITCH IN 
ping 29565 [001] 176668.915114: sched:sched_switch: ping:29565 [120] R ==> kworker/1:179:28142 [120]
                  9fd994 __schedule+0xa4200354 (/usr/lib/debug/lib/modules/4.8.6-201.fc24.x86_64/vmlinux)
                  9fdde5 schedule+0xa4200035 (/usr/lib/debug/lib/modules/4.8.6-201.fc24.x86_64/vmlinux)
                  203282 exit_to_usermode_loop+0xa4200072 (/usr/lib/debug/lib/modules/4.8.6-201.fc24.x86_64/vmlinux)
                  203be1 syscall_return_slowpath+0xa42000a1 (/usr/lib/debug/lib/modules/4.8.6-201.fc24.x86_64/vmlinux)
                  a025fa entry_SYSCALL_64_fastpath+0xa42000a2 (/usr/lib/debug/lib/modules/4.8.6-201.fc24.x86_64/vmlinux)
                  104050 __GI___libc_recvmsg+0xffff0206e8e70010 (/usr/lib64/libc-2.23.so)
                    7b5a _init+0xffff5455e66f3ab2 (/usr/bin/ping)
                    3b34 _init+0xffff5455e66efa8c (/usr/bin/ping)
                    2d9f _init+0xffff5455e66eecf7 (/usr/bin/ping)
                   20731 __libc_start_main+0xffff0206e8e700f1 (/usr/lib64/libc-2.23.so)
                    30a9 _init+0xffff5455e66ef001 (/usr/bin/ping)

ping 29565 [001] 176668.915121: PERF_RECORD_SWITCH OUT
ping 29565 [001] 176668.915126: PERF_RECORD_SWITCH IN 
ping 29565 [001] 176668.915197: sched:sched_switch: ping:29565 [120] S ==> swapper/1:0 [120]
                  9fd994 __schedule+0xa4200354 (/usr/lib/debug/lib/modules/4.8.6-201.fc24.x86_64/vmlinux)
                  9fdde5 schedule+0xa4200035 (/usr/lib/debug/lib/modules/4.8.6-201.fc24.x86_64/vmlinux)
[root@jouet ~]# 

Just to check that the backtrace for ping is ok, I did a 'dnf debuginfo-install
iputils', looks sensible:
 
ping 29619 [001] 176793.486584: sched:sched_switch: ping:29619 [120] S ==> swapper/1:0 [120]
        9fd994 __schedule+0xa4200354 (/usr/lib/debug/lib/modules/4.8.6-201.fc24.x86_64/vmlinux)
        9fdde5 schedule+0xa4200035 (/usr/lib/debug/lib/modules/4.8.6-201.fc24.x86_64/vmlinux)
        a01259 schedule_timeout+0xa4200219 (/usr/lib/debug/lib/modules/4.8.6-201.fc24.x86_64/vmlinux)
        8cb431 __skb_wait_for_more_packets+0xa4200111 (/usr/lib/debug/lib/modules/4.8.6-201.fc24.x86_64/vmlinux)
        8cb796 __skb_recv_datagram+0xa4200066 (/usr/lib/debug/lib/modules/4.8.6-201.fc24.x86_64/vmlinux)
        8cb82f skb_recv_datagram+0xa420003f (/usr/lib/debug/lib/modules/4.8.6-201.fc24.x86_64/vmlinux)
        94d197 raw_recvmsg+0xa4200097 (/usr/lib/debug/lib/modules/4.8.6-201.fc24.x86_64/vmlinux)
        95daee inet_recvmsg+0xa420007e (/usr/lib/debug/lib/modules/4.8.6-201.fc24.x86_64/vmlinux)
        8bb61d sock_recvmsg+0xa420003d (/usr/lib/debug/lib/modules/4.8.6-201.fc24.x86_64/vmlinux)
        8bc667 ___sys_recvmsg+0xa42000d7 (/usr/lib/debug/lib/modules/4.8.6-201.fc24.x86_64/vmlinux)
        8bd191 __sys_recvmsg+0xa4200051 (/usr/lib/debug/lib/modules/4.8.6-201.fc24.x86_64/vmlinux)
        8bd1e2 sys_recvmsg+0xa4200012 (/usr/lib/debug/lib/modules/4.8.6-201.fc24.x86_64/vmlinux)
        a02572 entry_SYSCALL_64_fastpath+0xa420001a (/usr/lib/debug/lib/modules/4.8.6-201.fc24.x86_64/vmlinux)
        104050 __GI___libc_recvmsg+0xffff020805e62010 (/usr/lib64/libc-2.23.so)
          7b5a main_loop+0xffff550a5a4dc1fa (/usr/bin/ping)
          3b34 ping4_run+0xffff550a5a4dc534 (/usr/bin/ping)
          2d9f main+0xffff550a5a4dc98f (/usr/bin/ping)
         20731 __libc_start_main+0xffff020805e620f1 (/usr/lib64/libc-2.23.so)
          30a9 _start+0xffff550a5a4dc029 (/usr/bin/ping)

> Any advise on where to look for the filtering? Not discarding the sched_switch 
> of a different process based on matching next_tid can't be that hard, or is 
> it?
> 
> Cheers
> 
> -- 
> Milian Wolff | milian.wolff@kdab.com | Software Engineer
> KDAB (Deutschland) GmbH&Co KG, a KDAB Group company
> Tel: +49-30-521325470
> KDAB - The Qt Experts

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Tracing sched_switch events for client application when process is switched back in
  2016-11-14 15:36         ` Arnaldo Carvalho de Melo
@ 2016-11-14 18:17           ` Milian Wolff
  2016-11-15  1:17             ` Arnaldo Carvalho de Melo
  0 siblings, 1 reply; 9+ messages in thread
From: Milian Wolff @ 2016-11-14 18:17 UTC (permalink / raw)
  To: Arnaldo Carvalho de Melo; +Cc: linux-perf-users

[-- Attachment #1: Type: text/plain, Size: 3357 bytes --]

On Monday, November 14, 2016 12:36:11 PM CET Arnaldo Carvalho de Melo wrote:
> Em Mon, Nov 14, 2016 at 03:10:26PM +0100, Milian Wolff escreveu:
> > On Monday, October 10, 2016 9:36:55 PM CET Arnaldo Carvalho de Melo wrote:
> > > Em Thu, Oct 06, 2016 at 06:38:28PM +0200, Milian Wolff escreveu:
> > > > only find 7, whereas tracing sched_switch finds all 100 that actually
> > > > take
> > > > place?
> > 
> > Hey there,
> > 
> > long delay but I looked at this again today.
> > 
> > > This is looking only for the sched switches for the monitored workload,
> > > which in this case is 'sleep 1', this wasn't a system wide session.
> > > 
> > > add -a and you'll get those other switches, if I got what you described.
> > 
> > I've used the sources of the simple C application I showed in the first
> > email of this thread (still quoted above). Then I try:
> > 
> > perf record --switch-events -a ./a.out
> > 
> > And indeed, I get all switch events. But I now also profile all
> > applications on the machine, which is not what I want. I want to combine
> > the following
>
> So you remove that '-a' from that command line and gets just the context
> switches for that ./a.out app, no?
>
> Lets see:

<snip>

> 5 out, 5 in, in a lightly loaded machine, plenty of CPUs, looks about right,
> no?

Hm this is odd. Before writing the last mail, and the one from one month ago, 
I tried this and it did not work. But now it does seem to work as intended, 
even with my simple application code...

perf record --switch-events ./a.out
perf script --show-switch-events | grep -c SWITCH
200

So this clearly works as intended - sorry for the noise.

> But...
> 
> > features:
> > 
> > - sample CPU events for a given application with backtraces
> > 
> > 	perf record --call-graph dwarf ./foo
> > 
> > - record switch-out events with backtraces (`-e
> > sched:sched_switch/fp=dwarf/`) - somehow record the switch-in events,
> > which are associated with a different process and thus currently
> > discarded
> 
> You want the backtraces with that, and since this is a meta-event... Humm,
> for the switch in would the above be sufficient? What value would be in
> knowing the backtrace in that case?

Knowing the backtrace would ensure I get the real location that triggered a 
switch, i.e. often times that comes from a syscall like futex or the like. Or 
would I always get a CPU event (e.g. cycles) with a backtrace before the 
switch happens, pointing to the same location?

> So something like would do? Try not looking at the "PERF_RECORD_SWITCH OUT"
> lines, they are generated by the kernel _after_ the corresponding
> sched:sched_switch is registered:
> 
> [root@jouet ~]# perf record --call-graph=dwarf --switch-events -e

<snip>

Yes, that seems to work just fine:

perf record -e cycles -e sched:sched_switch \
	--call-graph dwarf --switch-events \
	./a.out

That gives me all information. With some post-processing one could then turn 
this into a really useful profiling tool, as one will only need a single perf 
record step to allow both, on-CPU as well as off-CPU profiling for application 
developers.

Many thanks for the help, Arnaldo, and sorry again for the confusion on my 
side.

Cheers
-- 
Milian Wolff | milian.wolff@kdab.com | Software Engineer
KDAB (Deutschland) GmbH&Co KG, a KDAB Group company
Tel: +49-30-521325470
KDAB - The Qt Experts

[-- Attachment #2: smime.p7s --]
[-- Type: application/pkcs7-signature, Size: 5903 bytes --]

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Tracing sched_switch events for client application when process is switched back in
  2016-11-14 18:17           ` Milian Wolff
@ 2016-11-15  1:17             ` Arnaldo Carvalho de Melo
  2016-11-16 11:17               ` Milian Wolff
  0 siblings, 1 reply; 9+ messages in thread
From: Arnaldo Carvalho de Melo @ 2016-11-15  1:17 UTC (permalink / raw)
  To: Milian Wolff; +Cc: linux-perf-users

Em Mon, Nov 14, 2016 at 07:17:01PM +0100, Milian Wolff escreveu:
> On Monday, November 14, 2016 12:36:11 PM CET Arnaldo Carvalho de Melo wrote:
> > Em Mon, Nov 14, 2016 at 03:10:26PM +0100, Milian Wolff escreveu:
> > > features:

> > > - sample CPU events for a given application with backtraces

> > > 	perf record --call-graph dwarf ./foo

> > > - record switch-out events with backtraces (`-e
> > > sched:sched_switch/fp=dwarf/`) - somehow record the switch-in events,
> > > which are associated with a different process and thus currently
> > > discarded

> > You want the backtraces with that, and since this is a meta-event... Humm,
> > for the switch in would the above be sufficient? What value would be in
> > knowing the backtrace in that case?

> Knowing the backtrace would ensure I get the real location that triggered a 
> switch, i.e. often times that comes from a syscall like futex or the like. Or 

Sure, sure, a backtrace for a sched switch _out_ of a thread is useful,
and that you get from the sched:sched_switch tracepoint, what I'm
alluding to is the lack of a backtrace with the PERF_RECORD_SWITCH _IN_,
meta-event i.e. when the kernel switchs back to the thread being
monitored.

> would I always get a CPU event (e.g. cycles) with a backtrace before the 

Is there a need for that? With sched:sched_switch + backtrace you get a
backtrace right at the moment of the sched_switch _out_, then, with
PERF_RECORD_SWITCH (--switch-events) you'll get the missing piece, the
sched switch _in_, that gets a CPU back to your thread. You'll not know
from where it came (the thread using the CPU right before this switch),
unless you're root, then PERF_RECORD_SWITCH _will_ have that info.

> switch happens, pointing to the same location?
> 
> > So something like would do? Try not looking at the "PERF_RECORD_SWITCH OUT"
> > lines, they are generated by the kernel _after_ the corresponding
> > sched:sched_switch is registered:
> > 
> > [root@jouet ~]# perf record --call-graph=dwarf --switch-events -e
> 
> <snip>
> 
> Yes, that seems to work just fine:
> 
> perf record -e cycles -e sched:sched_switch \
> 	--call-graph dwarf --switch-events \
> 	./a.out
> 
> That gives me all information. With some post-processing one could then turn 
> this into a really useful profiling tool, as one will only need a single perf 
> record step to allow both, on-CPU as well as off-CPU profiling for application 
> developers.

right, off-cpu info in a 'perf trace' session is something we should
have, for instance. It should appear like a syscall, i.e. another way
out of the thread that will take some time to complete (sched_out ...
sched_in).

- Arnaldo
 
> Many thanks for the help, Arnaldo, and sorry again for the confusion on my 
> side.
> 
> Cheers
> -- 
> Milian Wolff | milian.wolff@kdab.com | Software Engineer
> KDAB (Deutschland) GmbH&Co KG, a KDAB Group company
> Tel: +49-30-521325470
> KDAB - The Qt Experts

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Tracing sched_switch events for client application when process is switched back in
  2016-11-15  1:17             ` Arnaldo Carvalho de Melo
@ 2016-11-16 11:17               ` Milian Wolff
  0 siblings, 0 replies; 9+ messages in thread
From: Milian Wolff @ 2016-11-16 11:17 UTC (permalink / raw)
  To: Arnaldo Carvalho de Melo; +Cc: linux-perf-users

[-- Attachment #1: Type: text/plain, Size: 2862 bytes --]

On Monday, November 14, 2016 10:17:03 PM CET Arnaldo Carvalho de Melo wrote:
> Em Mon, Nov 14, 2016 at 07:17:01PM +0100, Milian Wolff escreveu:
> > On Monday, November 14, 2016 12:36:11 PM CET Arnaldo Carvalho de Melo 
wrote:
> > > Em Mon, Nov 14, 2016 at 03:10:26PM +0100, Milian Wolff escreveu:
> > > > features:
> > > > 
> > > > - sample CPU events for a given application with backtraces
> > > > 
> > > > 	perf record --call-graph dwarf ./foo
> > > > 
> > > > - record switch-out events with backtraces (`-e
> > > > sched:sched_switch/fp=dwarf/`) - somehow record the switch-in events,
> > > > which are associated with a different process and thus currently
> > > > discarded
> > > 
> > > You want the backtraces with that, and since this is a meta-event...
> > > Humm,
> > > for the switch in would the above be sufficient? What value would be in
> > > knowing the backtrace in that case?
> > 
> > Knowing the backtrace would ensure I get the real location that triggered
> > a
> > switch, i.e. often times that comes from a syscall like futex or the like.
> > Or
> Sure, sure, a backtrace for a sched switch _out_ of a thread is useful,
> and that you get from the sched:sched_switch tracepoint, what I'm
> alluding to is the lack of a backtrace with the PERF_RECORD_SWITCH _IN_,
> meta-event i.e. when the kernel switchs back to the thread being
> monitored.

I think my previous mail was confusing: Your approach works really well, 
thanks again. I do not need a backtrace for the switch-in in the general case.
 
> > would I always get a CPU event (e.g. cycles) with a backtrace before the
> 
> Is there a need for that? With sched:sched_switch + backtrace you get a
> backtrace right at the moment of the sched_switch _out_, then, with
> PERF_RECORD_SWITCH (--switch-events) you'll get the missing piece, the
> sched switch _in_, that gets a CPU back to your thread. You'll not know
> from where it came (the thread using the CPU right before this switch),
> unless you're root, then PERF_RECORD_SWITCH _will_ have that info.

Right, as said above - that is not too crucial. I simply need to know I'm 
getting switched in again, and at what time that happens.

Thanks again!

<snip>

> > That gives me all information. With some post-processing one could then
> > turn this into a really useful profiling tool, as one will only need a
> > single perf record step to allow both, on-CPU as well as off-CPU
> > profiling for application developers.
> 
> right, off-cpu info in a 'perf trace' session is something we should
> have, for instance. It should appear like a syscall, i.e. another way
> out of the thread that will take some time to complete (sched_out ...
> sched_in).

Good idea, yes.

-- 
Milian Wolff | milian.wolff@kdab.com | Software Engineer
KDAB (Deutschland) GmbH&Co KG, a KDAB Group company
Tel: +49-30-521325470
KDAB - The Qt Experts

[-- Attachment #2: smime.p7s --]
[-- Type: application/pkcs7-signature, Size: 5903 bytes --]

^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2016-11-16 11:17 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-10-06 15:52 Tracing sched_switch events for client application when process is switched back in Milian Wolff
2016-10-06 16:21 ` Arnaldo Carvalho de Melo
2016-10-06 16:38   ` Milian Wolff
2016-10-11  0:36     ` Arnaldo Carvalho de Melo
2016-11-14 14:10       ` Milian Wolff
2016-11-14 15:36         ` Arnaldo Carvalho de Melo
2016-11-14 18:17           ` Milian Wolff
2016-11-15  1:17             ` Arnaldo Carvalho de Melo
2016-11-16 11:17               ` Milian Wolff

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.