All of lore.kernel.org
 help / color / mirror / Atom feed
* Re: large liblttng-ust startup overhead (ust_lock)
       [not found] <2117884.c1kuPXoiOA@milian-kdab2>
@ 2017-09-05 14:51 ` Mathieu Desnoyers
  2017-09-05 14:52 ` Milian Wolff
                   ` (2 subsequent siblings)
  3 siblings, 0 replies; 13+ messages in thread
From: Mathieu Desnoyers @ 2017-09-05 14:51 UTC (permalink / raw)
  To: Milian Wolff; +Cc: lttng-dev

----- On Sep 5, 2017, at 10:34 AM, Milian Wolff milian.wolff@kdab.com wrote:

> Hey all,
> 
> I have noticed a very large overhead when linking against liblttng-ust:
> 
> ~~~~~
> ┌milian@milian-kdab2:/tmp
> └$ cat lttng-test.c
> int main()
> {
>  return 0;
> }
> ┌milian@milian-kdab2:/tmp
> └$ gcc -O2 -g -ldl lttng-test.c
> ┌milian@milian-kdab2:/tmp
> └$ perf stat -r 5 ./a.out
> 
> Performance counter stats for './a.out' (5 runs):
> 
>          0.209587      task-clock (msec)         #    0.596 CPUs utilized
> ( +-  8.76% )
>                 0      context-switches          #    0.000 K/sec
>                 0      cpu-migrations            #    0.000 K/sec
>                49      page-faults               #    0.235 M/sec
> ( +-  1.19% )
>           706,854      cycles                    #    3.373 GHz
> ( +-  8.82% )
>           773,603      instructions              #    1.09  insn per cycle
> ( +-  0.75% )
>           147,128      branches                  #  701.987 M/sec
> ( +-  0.70% )
>             4,096      branch-misses             #    2.78% of all branches
> ( +-  5.27% )
> 
>       0.000351422 seconds time elapsed
> ( +- 11.85% )
> 
> ┌milian@milian-kdab2:/tmp
> └$ gcc -O2 -g -ldl -llttng-ust lttng-test.c
> ┌milian@milian-kdab2:/tmp
> └$ perf stat -r 5 ./a.out
> 
> Performance counter stats for './a.out' (5 runs):
> 
>          2.063040      task-clock (msec)         #    0.009 CPUs utilized
> ( +-  1.37% )
>                44      context-switches          #    0.021 M/sec
> ( +-  1.95% )
>                 2      cpu-migrations            #    0.776 K/sec
> ( +- 25.00% )
>               209      page-faults               #    0.101 M/sec
> ( +-  0.34% )
>         7,053,686      cycles                    #    3.419 GHz
> ( +-  2.03% )
>         6,893,783      instructions              #    0.98  insn per cycle
> ( +-  0.25% )
>         1,342,492      branches                  #  650.735 M/sec
> ( +-  0.20% )
>            29,390      branch-misses             #    2.19% of all branches
> ( +-  0.61% )
> 
>       0.225597302 seconds time elapsed
> ( +-  6.68% )
> ~~~~~
> 
> This is without any LTTng session configured. If I enable LTTng kernel and
> userspace events, this becomes even worse:
> 
> ~~~~~
> ┌milian@milian-kdab2:/tmp
> └$ cat $(which run_lttng_trace.sh)
> #!/bin/sh
> 
> if [ -z "$(pidof lttng-sessiond)" ]; then
>    sudo lttng-sessiond --daemonize
> fi
> 
> sudo lttng create -o ~/lttng-traces/$(date -Iseconds)
> sudo lttng enable-channel kernel -k --subbuf-size 16M --num-subbuf 8
> sudo lttng enable-event -c kernel -k -a
> sudo lttng enable-channel ust -u --subbuf-size 16M --num-subbuf 8
> sudo lttng enable-event -c ust -u lttng_ust_tracef:*
> sudo lttng start
> $@
> sudo lttng stop
> 
> sudo chmod a+rx -R ~/lttng-traces
> ┌milian@milian-kdab2:/tmp
> └$ run_lttng_trace.sh perf stat -r 5 ./a.out
> Session auto-20170905-162818 created.
> Traces will be written in /home/milian/lttng-traces/2017-09-05T16:28:18+02:00
> Kernel channel kernel enabled for session auto-20170905-162818
> All Kernel events are enabled in channel kernel
> UST channel ust enabled for session auto-20170905-162818
> UST event lttng_ust_tracef:* created in channel ust
> Tracing started for session auto-20170905-162818
> 
> Performance counter stats for './a.out' (5 runs):
> 
>          4.937820      task-clock (msec)         #    0.005 CPUs utilized
> ( +-  1.28% )
>               117      context-switches          #    0.024 M/sec
> ( +-  0.70% )
>                 3      cpu-migrations            #    0.608 K/sec
> ( +- 23.57% )
>               245      page-faults               #    0.050 M/sec
> ( +-  0.20% )
>        16,541,355      cycles                    #    3.350 GHz
> ( +-  0.94% )
>        20,611,637      instructions              #    1.25  insn per cycle
> ( +-  0.23% )
>         3,580,525      branches                  #  725.123 M/sec
> ( +-  0.19% )
>            52,093      branch-misses             #    1.45% of all branches
> ( +-  1.73% )
> 
>       0.965545882 seconds time elapsed
> ( +- 16.96% )
> 
> Waiting for data availability...
> Tracing stopped for session auto-20170905-162818
> ~~~~~
> 
> I see this ~1s delay during startup in nearly all test applications where I
> wanted to use lttng-ust. In tracecompass, and according to perf sleep time
> profiling, this seems to be mostly due to extensive sleeping on mutex locks
> (from ust_lock in liblttng-ust). With perf trace, I also see membarrier taking
> a lot of time:
> 
> ~~~~~
> # without a running lttng session:
> ┌milian@milian-kdab2:/tmp
> └$ perf trace --duration 1 ./a.out
>     6.492 (52.468 ms): a.out/23672 recvmsg(fd: 3<socket:[1178439]>, msg:
> 0x7fbe2fbb1070                  ) = 1
>     5.077 (54.271 ms): a.out/23671 futex(uaddr: 0x7fbe30d508a0, op:
> WAIT_BITSET|PRIV|CLKRT, utime: 0x7ffc474ff5a0, val3: 4294967295) = 0
>    59.598 (79.379 ms): a.out/23671 membarrier(cmd: 1
> ) = 0
>   138.984 (39.945 ms): a.out/23671 membarrier(cmd: 1
> ) = 0
>    59.621 (119.324 ms): a.out/23672 futex(uaddr: 0x7fbe303e54c0, op: WAIT|
> PRIV, val: 2                    ) = 0
>   179.045 (79.918 ms): a.out/23671 membarrier(cmd: 1
> ) = 0
>   258.971 (39.997 ms): a.out/23671 membarrier(cmd: 1
> ) = 0
>   299.052 (79.883 ms): a.out/23671 membarrier(cmd: 1
> ) = 0
>   378.942 (59.988 ms): a.out/23671 membarrier(cmd: 1
> ) = 0
>   439.022 (69.908 ms): a.out/23671 membarrier(cmd: 1
> ) = 0
>   508.937 (80.027 ms): a.out/23671 membarrier(cmd: 1
> ) = 0
> 
> # with a running lttng session:
> ┌milian@milian-kdab2:/tmp
> └$ run_lttng_trace.sh perf trace --duration 1 ./a.out
>     6.666 (392.496 ms): a.out-ust/23790 recvmsg(fd: 3<socket:[1185362]>, msg:
> 0x7fcd81d8a0b0                  ) = 612
>   400.238 (68.452 ms): a.out-ust/23790 membarrier(cmd: 1
> ) = 0
>   468.698 (39.983 ms): a.out-ust/23790 membarrier(cmd: 1
> ) = 0
>   510.567 (98.113 ms): a.out-ust/23790 recvmsg(fd: 3<socket:[1185362]>, msg:
> 0x7fcd81d8a070                  ) = 1
>     6.467 (603.500 ms): a.out/23789 futex(uaddr: 0x7fcd82f298a0, op:
> WAIT_BITSET|PRIV|CLKRT, utime: 0x7fff4df6c970, val3: 4294967295) = 0
>   610.020 ( 6.307 ms): a.out/23789 futex(uaddr: 0x7fcd82f29960, op: WAIT|
> PRIV, val: 2                    ) = 0
>   616.446 (62.265 ms): a.out/23789 membarrier(cmd: 1
> ) = 0
>   678.721 (59.916 ms): a.out/23789 membarrier(cmd: 1
> ) = 0
>   616.332 (122.335 ms): a.out-ust/23790 futex(uaddr: 0x7fcd82f29960, op:
> WAIT|PRIV, val: 2                    ) = 0
>   738.976 (49.658 ms): a.out/23789 membarrier(cmd: 1
> ) = 0
>   788.642 (49.987 ms): a.out/23789 membarrier(cmd: 1
> ) = 0
>   739.014 (99.644 ms): a.out-ust/23790 futex(uaddr: 0x7fcd825be4c0, op: WAIT|
> PRIV, val: 2                    ) = 0
>   838.651 (49.977 ms): a.out/23789 membarrier(cmd: 1
> ) = 0
>   888.636 (69.983 ms): a.out/23789 membarrier(cmd: 1
> ) = 0
>   838.662 (119.978 ms): a.out-ust/23790 futex(uaddr: 0x7fcd825be4c0, op:
> WAIT|PRIV, val: 2                    ) = 0
>   958.739 (59.873 ms): a.out/23789 membarrier(cmd: 1
> ) = 0
>  1018.621 (80.058 ms): a.out/23789 membarrier(cmd: 1
> ) = 0
>  1098.782 (49.844 ms): a.out/23789 membarrier(cmd: 1
> ) = 0
>  1148.634 (59.998 ms): a.out/23789 membarrier(cmd: 1
> ) = 0
>  1208.731 (69.940 ms): a.out/23789 membarrier(cmd: 1
> ) = 0
>  1278.680 (79.947 ms): a.out/23789 membarrier(cmd: 1
> ) = 0
>  1358.727 (39.881 ms): a.out/23789 membarrier(cmd: 1
> ) = 0
>  1398.616 (39.980 ms): a.out/23789 membarrier(cmd: 1
> ) = 0
>     6.640 (1432.157 ms): a.out-ust/23791 futex(uaddr: 0x7fcd82f90000
> ) = -1 (null) INTERNAL ERROR: strerror_r(512, 0x7ffdbc5467a0, 128)=22
> ~~~~~
> 
> This makes lttng-ust pretty much unusable for any short-lived application. Is
> this a known limitation?

This kind of delay is not completely unexpected, but a bit higher than I would
expect. I'm interested in figuring out if most of this delay happens to be
caused by sys_membarrier, either directly, or indirectly (through a lock).

> 
> I'm using lttng-ust on Arch from the AUR packages, i.e. LTTng 2.9.4. I've seen
> this problem also on Ubuntu 17.04 though I didn't do any of the above
> measurements there. So far, I suspected the ust statedump to take too much
> time, but I disabled that one for the measurements above.
> 
> Any help would be appreciated

Can you reproduce with a Linux kernel that has CONFIG_MEMBARRIER=n and
share the performance results ?

Thanks,

Mathieu


> 
> --
> Milian Wolff | milian.wolff@kdab.com | Senior Software Engineer
> KDAB (Deutschland) GmbH&Co KG, a KDAB Group company
> Tel: +49-30-521325470
> KDAB - The Qt Experts
> _______________________________________________
> lttng-dev mailing list
> lttng-dev@lists.lttng.org
> https://lists.lttng.org/cgi-bin/mailman/listinfo/lttng-dev

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com
_______________________________________________
lttng-dev mailing list
lttng-dev@lists.lttng.org
https://lists.lttng.org/cgi-bin/mailman/listinfo/lttng-dev

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: large liblttng-ust startup overhead (ust_lock)
       [not found] <2117884.c1kuPXoiOA@milian-kdab2>
  2017-09-05 14:51 ` large liblttng-ust startup overhead (ust_lock) Mathieu Desnoyers
@ 2017-09-05 14:52 ` Milian Wolff
       [not found] ` <119280371.2823.1504623102155.JavaMail.zimbra@efficios.com>
       [not found] ` <5409371.Vv1znlTa4X@milian-kdab2>
  3 siblings, 0 replies; 13+ messages in thread
From: Milian Wolff @ 2017-09-05 14:52 UTC (permalink / raw)
  To: lttng-dev


[-- Attachment #1.1: Type: text/plain, Size: 14667 bytes --]

On Tuesday, September 5, 2017 4:34:54 PM CEST Milian Wolff wrote:
> Hey all,
> 
> I have noticed a very large overhead when linking against liblttng-ust:
> 
> ~~~~~
> ┌milian@milian-kdab2:/tmp
> └$ cat lttng-test.c
> int main()
> {
>   return 0;
> }
> ┌milian@milian-kdab2:/tmp
> └$ gcc -O2 -g -ldl lttng-test.c
> ┌milian@milian-kdab2:/tmp
> └$ perf stat -r 5 ./a.out
> 
>  Performance counter stats for './a.out' (5 runs):
> 
>           0.209587      task-clock (msec)         #    0.596 CPUs utilized
> ( +-  8.76% )
>                  0      context-switches          #    0.000 K/sec
>                  0      cpu-migrations            #    0.000 K/sec
>                 49      page-faults               #    0.235 M/sec
> ( +-  1.19% )
>            706,854      cycles                    #    3.373 GHz
> ( +-  8.82% )
>            773,603      instructions              #    1.09  insn per cycle
> ( +-  0.75% )
>            147,128      branches                  #  701.987 M/sec
> ( +-  0.70% )
>              4,096      branch-misses             #    2.78% of all branches
> ( +-  5.27% )
> 
>        0.000351422 seconds time elapsed
> ( +- 11.85% )
> 
> ┌milian@milian-kdab2:/tmp
> └$ gcc -O2 -g -ldl -llttng-ust lttng-test.c
> ┌milian@milian-kdab2:/tmp
> └$ perf stat -r 5 ./a.out
> 
>  Performance counter stats for './a.out' (5 runs):
> 
>           2.063040      task-clock (msec)         #    0.009 CPUs utilized
> ( +-  1.37% )
>                 44      context-switches          #    0.021 M/sec
> ( +-  1.95% )
>                  2      cpu-migrations            #    0.776 K/sec
> ( +- 25.00% )
>                209      page-faults               #    0.101 M/sec
> ( +-  0.34% )
>          7,053,686      cycles                    #    3.419 GHz
> ( +-  2.03% )
>          6,893,783      instructions              #    0.98  insn per cycle
> ( +-  0.25% )
>          1,342,492      branches                  #  650.735 M/sec
> ( +-  0.20% )
>             29,390      branch-misses             #    2.19% of all branches
> ( +-  0.61% )
> 
>        0.225597302 seconds time elapsed
> ( +-  6.68% )
> ~~~~~
> 
> This is without any LTTng session configured. If I enable LTTng kernel and
> userspace events, this becomes even worse:
> 
> ~~~~~
> ┌milian@milian-kdab2:/tmp
> └$ cat $(which run_lttng_trace.sh)
> #!/bin/sh
> 
> if [ -z "$(pidof lttng-sessiond)" ]; then
>     sudo lttng-sessiond --daemonize
> fi
> 
> sudo lttng create -o ~/lttng-traces/$(date -Iseconds)
> sudo lttng enable-channel kernel -k --subbuf-size 16M --num-subbuf 8
> sudo lttng enable-event -c kernel -k -a
> sudo lttng enable-channel ust -u --subbuf-size 16M --num-subbuf 8
> sudo lttng enable-event -c ust -u lttng_ust_tracef:*
> sudo lttng start
> $@
> sudo lttng stop
> 
> sudo chmod a+rx -R ~/lttng-traces
> ┌milian@milian-kdab2:/tmp
> └$ run_lttng_trace.sh perf stat -r 5 ./a.out
> Session auto-20170905-162818 created.
> Traces will be written in
> /home/milian/lttng-traces/2017-09-05T16:28:18+02:00 Kernel channel kernel
> enabled for session auto-20170905-162818
> All Kernel events are enabled in channel kernel
> UST channel ust enabled for session auto-20170905-162818
> UST event lttng_ust_tracef:* created in channel ust
> Tracing started for session auto-20170905-162818
> 
>  Performance counter stats for './a.out' (5 runs):
> 
>           4.937820      task-clock (msec)         #    0.005 CPUs utilized
> ( +-  1.28% )
>                117      context-switches          #    0.024 M/sec
> ( +-  0.70% )
>                  3      cpu-migrations            #    0.608 K/sec
> ( +- 23.57% )
>                245      page-faults               #    0.050 M/sec
> ( +-  0.20% )
>         16,541,355      cycles                    #    3.350 GHz
> ( +-  0.94% )
>         20,611,637      instructions              #    1.25  insn per cycle
> ( +-  0.23% )
>          3,580,525      branches                  #  725.123 M/sec
> ( +-  0.19% )
>             52,093      branch-misses             #    1.45% of all branches
> ( +-  1.73% )
> 
>        0.965545882 seconds time elapsed
> ( +- 16.96% )
> 
> Waiting for data availability...
> Tracing stopped for session auto-20170905-162818
> ~~~~~
> 
> I see this ~1s delay during startup in nearly all test applications where I
> wanted to use lttng-ust. In tracecompass, and according to perf sleep time
> profiling, this seems to be mostly due to extensive sleeping on mutex locks
> (from ust_lock in liblttng-ust). With perf trace, I also see membarrier
> taking a lot of time:
> 
> ~~~~~
> # without a running lttng session:
> ┌milian@milian-kdab2:/tmp
> └$ perf trace --duration 1 ./a.out
>      6.492 (52.468 ms): a.out/23672 recvmsg(fd: 3<socket:[1178439]>, msg:
> 0x7fbe2fbb1070                  ) = 1
>      5.077 (54.271 ms): a.out/23671 futex(uaddr: 0x7fbe30d508a0, op:
> WAIT_BITSET|PRIV|CLKRT, utime: 0x7ffc474ff5a0, val3: 4294967295) = 0
>     59.598 (79.379 ms): a.out/23671 membarrier(cmd: 1
> ) = 0
>    138.984 (39.945 ms): a.out/23671 membarrier(cmd: 1
> ) = 0
>     59.621 (119.324 ms): a.out/23672 futex(uaddr: 0x7fbe303e54c0, op: WAIT|
> PRIV, val: 2                    ) = 0
>    179.045 (79.918 ms): a.out/23671 membarrier(cmd: 1
> ) = 0
>    258.971 (39.997 ms): a.out/23671 membarrier(cmd: 1
> ) = 0
>    299.052 (79.883 ms): a.out/23671 membarrier(cmd: 1
> ) = 0
>    378.942 (59.988 ms): a.out/23671 membarrier(cmd: 1
> ) = 0
>    439.022 (69.908 ms): a.out/23671 membarrier(cmd: 1
> ) = 0
>    508.937 (80.027 ms): a.out/23671 membarrier(cmd: 1
> ) = 0
> 
> # with a running lttng session:
> ┌milian@milian-kdab2:/tmp
> └$ run_lttng_trace.sh perf trace --duration 1 ./a.out
>      6.666 (392.496 ms): a.out-ust/23790 recvmsg(fd: 3<socket:[1185362]>,
> msg: 0x7fcd81d8a0b0                  ) = 612
>    400.238 (68.452 ms): a.out-ust/23790 membarrier(cmd: 1
> ) = 0
>    468.698 (39.983 ms): a.out-ust/23790 membarrier(cmd: 1
> ) = 0
>    510.567 (98.113 ms): a.out-ust/23790 recvmsg(fd: 3<socket:[1185362]>,
> msg: 0x7fcd81d8a070                  ) = 1
>      6.467 (603.500 ms): a.out/23789 futex(uaddr: 0x7fcd82f298a0, op:
> WAIT_BITSET|PRIV|CLKRT, utime: 0x7fff4df6c970, val3: 4294967295) = 0
>    610.020 ( 6.307 ms): a.out/23789 futex(uaddr: 0x7fcd82f29960, op: WAIT|
> PRIV, val: 2                    ) = 0
>    616.446 (62.265 ms): a.out/23789 membarrier(cmd: 1
> ) = 0
>    678.721 (59.916 ms): a.out/23789 membarrier(cmd: 1
> ) = 0
>    616.332 (122.335 ms): a.out-ust/23790 futex(uaddr: 0x7fcd82f29960, op:
> WAIT|PRIV, val: 2                    ) = 0
>    738.976 (49.658 ms): a.out/23789 membarrier(cmd: 1
> ) = 0
>    788.642 (49.987 ms): a.out/23789 membarrier(cmd: 1
> ) = 0
>    739.014 (99.644 ms): a.out-ust/23790 futex(uaddr: 0x7fcd825be4c0, op:
> WAIT| PRIV, val: 2                    ) = 0
>    838.651 (49.977 ms): a.out/23789 membarrier(cmd: 1
> ) = 0
>    888.636 (69.983 ms): a.out/23789 membarrier(cmd: 1
> ) = 0
>    838.662 (119.978 ms): a.out-ust/23790 futex(uaddr: 0x7fcd825be4c0, op:
> WAIT|PRIV, val: 2                    ) = 0
>    958.739 (59.873 ms): a.out/23789 membarrier(cmd: 1
> ) = 0
>   1018.621 (80.058 ms): a.out/23789 membarrier(cmd: 1
> ) = 0
>   1098.782 (49.844 ms): a.out/23789 membarrier(cmd: 1
> ) = 0
>   1148.634 (59.998 ms): a.out/23789 membarrier(cmd: 1
> ) = 0
>   1208.731 (69.940 ms): a.out/23789 membarrier(cmd: 1
> ) = 0
>   1278.680 (79.947 ms): a.out/23789 membarrier(cmd: 1
> ) = 0
>   1358.727 (39.881 ms): a.out/23789 membarrier(cmd: 1
> ) = 0
>   1398.616 (39.980 ms): a.out/23789 membarrier(cmd: 1
> ) = 0
>      6.640 (1432.157 ms): a.out-ust/23791 futex(uaddr: 0x7fcd82f90000
> ) = -1 (null) INTERNAL ERROR: strerror_r(512, 0x7ffdbc5467a0, 128)=22
> ~~~~~
> 
> This makes lttng-ust pretty much unusable for any short-lived application.
> Is this a known limitation?
> 
> I'm using lttng-ust on Arch from the AUR packages, i.e. LTTng 2.9.4. I've
> seen this problem also on Ubuntu 17.04 though I didn't do any of the above
> measurements there. So far, I suspected the ust statedump to take too much
> time, but I disabled that one for the measurements above.
> 
> Any help would be appreciated

Here's the perf sleep record data when no session is configured:

~~~~~
100% = 848.063336375 ms

# Overhead      Pid:Command    Shared Object     Symbol          Source:Line
# ........  .................  ................  ..............  ...........
#
    76.50%    10336:a.out-ust  [kernel.vmlinux]  [k] __schedule  core.c:0
            |
            ---__schedule core.c:0
               schedule
               futex_wait_queue_me futex.c:0
               futex_wait futex.c:0
               do_futex
               sys_futex
               do_syscall_64
               return_from_SYSCALL_64 .tmp_entry_64.o:0
               syscall
               ust_listener_thread futex.h:65
               /usr/include/urcu/futex.h:65 (inline)
               /usr/include/urcu/futex.h:97 (inline)
               /home/milian/projects/src/lttng/lttng-ust/liblttng-ust/lttng-
ust-comm.c:1311 (inline)
               /home/milian/projects/src/lttng/lttng-ust/liblttng-ust/lttng-
ust-comm.c:1373 (inline)

    15.79%    10335:a.out-ust  [kernel.vmlinux]  [k] __schedule  core.c:0
            |
            ---__schedule core.c:0
               schedule
               |          
                --15.70%--futex_wait_queue_me futex.c:0
                          futex_wait futex.c:0
                          do_futex
                          sys_futex
                          do_syscall_64
                          return_from_SYSCALL_64 .tmp_entry_64.o:0
                          __lll_lock_wait
                          pthread_mutex_lock
                          |          
                           --15.69%--0x2338
                                     0x3b29
                                     __nptl_deallocate_tsd.part.4 
pthread_create.c:0
                                     start_thread pthread_create.c:0
                                     __clone

     7.72%    10334:a.out      [kernel.vmlinux]  [k] __schedule  core.c:0
            |
            ---__schedule core.c:0
               schedule
               futex_wait_queue_me futex.c:0
               futex_wait futex.c:0
               do_futex
               sys_futex
               do_syscall_64
               return_from_SYSCALL_64 .tmp_entry_64.o:0
               do_futex_wait sem_timedwait.c:0
               __new_sem_wait_slow sem_timedwait.c:0
               lttng_ust_init.part.3 lttng-ust-comm.c:1757
               call_init.part.0 dl-init.c:0
               _dl_init
               _dl_start_user

~~~~~

Here's the perf sleep record data for an existing session:

~~~~~
100% = 3.0793293265 s

# Overhead      Pid:Command    Shared Object     Symbol          Source:Line
# ........  .................  ................  ..............  ...........
#
    52.10%    10762:a.out-ust  [kernel.vmlinux]  [k] __schedule  core.c:0
            |
            ---__schedule core.c:0
               schedule
               futex_wait_queue_me futex.c:0
               futex_wait futex.c:0
               do_futex
               sys_futex
               do_syscall_64
               return_from_SYSCALL_64 .tmp_entry_64.o:0
               syscall
               ust_listener_thread futex.h:65
               /usr/include/urcu/futex.h:65 (inline)
               /usr/include/urcu/futex.h:97 (inline)
               /home/milian/projects/src/lttng/lttng-ust/liblttng-ust/lttng-
ust-comm.c:1311 (inline)
               /home/milian/projects/src/lttng/lttng-ust/liblttng-ust/lttng-
ust-comm.c:1373 (inline)

    27.09%    10761:a.out-ust  [kernel.vmlinux]  [k] __schedule  core.c:0
            |
            ---__schedule core.c:0
               schedule
               |          
               |--13.69%--schedule_timeout
               |          unix_stream_read_generic af_unix.c:0
               |          unix_stream_recvmsg af_unix.c:0
               |          sock_recvmsg
               |          ___sys_recvmsg socket.c:0
               |          __sys_recvmsg
               |          sys_recvmsg
               |          do_syscall_64
               |          return_from_SYSCALL_64 .tmp_entry_64.o:0
               |          __libc_recvmsg
               |          ustcomm_recv_unix_sock lttng-ust-comm.c:307
               |          
                --13.40%--futex_wait_queue_me futex.c:0
                          futex_wait futex.c:0
                          do_futex
                          sys_futex
                          do_syscall_64
                          return_from_SYSCALL_64 .tmp_entry_64.o:0
                          __lll_lock_wait
                          pthread_mutex_lock
                          |          
                          |--8.41%--0x2338
                          |          0x3b29
                          |          __nptl_deallocate_tsd.part.4 
pthread_create.c:0
                          |          start_thread pthread_create.c:0
                          |          __clone
                          |          
                           --4.99%--ust_lock_nocheck lttng-ust-comm.c:178

    20.81%    10760:a.out      [kernel.vmlinux]  [k] __schedule  core.c:0
            |
            ---__schedule core.c:0
               schedule
               |          
                --20.81%--futex_wait_queue_me futex.c:0
                          futex_wait futex.c:0
                          do_futex
                          sys_futex
                          do_syscall_64
                          return_from_SYSCALL_64 .tmp_entry_64.o:0
                          |          
                           --20.77%--do_futex_wait sem_timedwait.c:0
                                     __new_sem_wait_slow sem_timedwait.c:0
                                     lttng_ust_init.part.3 lttng-ust-comm.c:
1757
                                     call_init.part.0 dl-init.c:0
                                     _dl_init
                                     _dl_start_user

~~~~~

Hope that helps
-- 
Milian Wolff | milian.wolff@kdab.com | Senior Software Engineer
KDAB (Deutschland) GmbH&Co KG, a KDAB Group company
Tel: +49-30-521325470
KDAB - The Qt Experts

[-- Attachment #1.2: smime.p7s --]
[-- Type: application/pkcs7-signature, Size: 3826 bytes --]

[-- Attachment #2: Type: text/plain, Size: 156 bytes --]

_______________________________________________
lttng-dev mailing list
lttng-dev@lists.lttng.org
https://lists.lttng.org/cgi-bin/mailman/listinfo/lttng-dev

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: large liblttng-ust startup overhead (ust_lock)
       [not found] ` <119280371.2823.1504623102155.JavaMail.zimbra@efficios.com>
@ 2017-09-05 15:08   ` Milian Wolff
       [not found]   ` <3731578.YsMjlWojEO@milian-kdab2>
  1 sibling, 0 replies; 13+ messages in thread
From: Milian Wolff @ 2017-09-05 15:08 UTC (permalink / raw)
  To: Mathieu Desnoyers; +Cc: lttng-dev


[-- Attachment #1.1: Type: text/plain, Size: 10398 bytes --]

On Tuesday, September 5, 2017 4:51:42 PM CEST Mathieu Desnoyers wrote:
> ----- On Sep 5, 2017, at 10:34 AM, Milian Wolff milian.wolff@kdab.com wrote:
> > Hey all,
> > 
> > I have noticed a very large overhead when linking against liblttng-ust:
> > 
> > ~~~~~
> > ┌milian@milian-kdab2:/tmp
> > └$ cat lttng-test.c
> > int main()
> > {
> > 
> >  return 0;
> > 
> > }
> > ┌milian@milian-kdab2:/tmp
> > └$ gcc -O2 -g -ldl lttng-test.c
> > ┌milian@milian-kdab2:/tmp
> > └$ perf stat -r 5 ./a.out
> > 
> > Performance counter stats for './a.out' (5 runs):
> >          0.209587      task-clock (msec)         #    0.596 CPUs utilized
> > 
> > ( +-  8.76% )
> > 
> >                 0      context-switches          #    0.000 K/sec
> >                 0      cpu-migrations            #    0.000 K/sec
> >                
> >                49      page-faults               #    0.235 M/sec
> > 
> > ( +-  1.19% )
> > 
> >           706,854      cycles                    #    3.373 GHz
> > 
> > ( +-  8.82% )
> > 
> >           773,603      instructions              #    1.09  insn per cycle
> > 
> > ( +-  0.75% )
> > 
> >           147,128      branches                  #  701.987 M/sec
> > 
> > ( +-  0.70% )
> > 
> >             4,096      branch-misses             #    2.78% of all
> >             branches
> > 
> > ( +-  5.27% )
> > 
> >       0.000351422 seconds time elapsed
> > 
> > ( +- 11.85% )
> > 
> > ┌milian@milian-kdab2:/tmp
> > └$ gcc -O2 -g -ldl -llttng-ust lttng-test.c
> > ┌milian@milian-kdab2:/tmp
> > └$ perf stat -r 5 ./a.out
> > 
> > Performance counter stats for './a.out' (5 runs):
> >          2.063040      task-clock (msec)         #    0.009 CPUs utilized
> > 
> > ( +-  1.37% )
> > 
> >                44      context-switches          #    0.021 M/sec
> > 
> > ( +-  1.95% )
> > 
> >                 2      cpu-migrations            #    0.776 K/sec
> > 
> > ( +- 25.00% )
> > 
> >               209      page-faults               #    0.101 M/sec
> > 
> > ( +-  0.34% )
> > 
> >         7,053,686      cycles                    #    3.419 GHz
> > 
> > ( +-  2.03% )
> > 
> >         6,893,783      instructions              #    0.98  insn per cycle
> > 
> > ( +-  0.25% )
> > 
> >         1,342,492      branches                  #  650.735 M/sec
> > 
> > ( +-  0.20% )
> > 
> >            29,390      branch-misses             #    2.19% of all
> >            branches
> > 
> > ( +-  0.61% )
> > 
> >       0.225597302 seconds time elapsed
> > 
> > ( +-  6.68% )
> > ~~~~~
> > 
> > This is without any LTTng session configured. If I enable LTTng kernel and
> > userspace events, this becomes even worse:
> > 
> > ~~~~~
> > ┌milian@milian-kdab2:/tmp
> > └$ cat $(which run_lttng_trace.sh)
> > #!/bin/sh
> > 
> > if [ -z "$(pidof lttng-sessiond)" ]; then
> > 
> >    sudo lttng-sessiond --daemonize
> > 
> > fi
> > 
> > sudo lttng create -o ~/lttng-traces/$(date -Iseconds)
> > sudo lttng enable-channel kernel -k --subbuf-size 16M --num-subbuf 8
> > sudo lttng enable-event -c kernel -k -a
> > sudo lttng enable-channel ust -u --subbuf-size 16M --num-subbuf 8
> > sudo lttng enable-event -c ust -u lttng_ust_tracef:*
> > sudo lttng start
> > $@
> > sudo lttng stop
> > 
> > sudo chmod a+rx -R ~/lttng-traces
> > ┌milian@milian-kdab2:/tmp
> > └$ run_lttng_trace.sh perf stat -r 5 ./a.out
> > Session auto-20170905-162818 created.
> > Traces will be written in
> > /home/milian/lttng-traces/2017-09-05T16:28:18+02:00 Kernel channel kernel
> > enabled for session auto-20170905-162818
> > All Kernel events are enabled in channel kernel
> > UST channel ust enabled for session auto-20170905-162818
> > UST event lttng_ust_tracef:* created in channel ust
> > Tracing started for session auto-20170905-162818
> > 
> > Performance counter stats for './a.out' (5 runs):
> >          4.937820      task-clock (msec)         #    0.005 CPUs utilized
> > 
> > ( +-  1.28% )
> > 
> >               117      context-switches          #    0.024 M/sec
> > 
> > ( +-  0.70% )
> > 
> >                 3      cpu-migrations            #    0.608 K/sec
> > 
> > ( +- 23.57% )
> > 
> >               245      page-faults               #    0.050 M/sec
> > 
> > ( +-  0.20% )
> > 
> >        16,541,355      cycles                    #    3.350 GHz
> > 
> > ( +-  0.94% )
> > 
> >        20,611,637      instructions              #    1.25  insn per cycle
> > 
> > ( +-  0.23% )
> > 
> >         3,580,525      branches                  #  725.123 M/sec
> > 
> > ( +-  0.19% )
> > 
> >            52,093      branch-misses             #    1.45% of all
> >            branches
> > 
> > ( +-  1.73% )
> > 
> >       0.965545882 seconds time elapsed
> > 
> > ( +- 16.96% )
> > 
> > Waiting for data availability...
> > Tracing stopped for session auto-20170905-162818
> > ~~~~~
> > 
> > I see this ~1s delay during startup in nearly all test applications where
> > I
> > wanted to use lttng-ust. In tracecompass, and according to perf sleep time
> > profiling, this seems to be mostly due to extensive sleeping on mutex
> > locks
> > (from ust_lock in liblttng-ust). With perf trace, I also see membarrier
> > taking a lot of time:
> > 
> > ~~~~~
> > # without a running lttng session:
> > ┌milian@milian-kdab2:/tmp
> > └$ perf trace --duration 1 ./a.out
> > 
> >     6.492 (52.468 ms): a.out/23672 recvmsg(fd: 3<socket:[1178439]>, msg:
> > 0x7fbe2fbb1070                  ) = 1
> > 
> >     5.077 (54.271 ms): a.out/23671 futex(uaddr: 0x7fbe30d508a0, op:
> > WAIT_BITSET|PRIV|CLKRT, utime: 0x7ffc474ff5a0, val3: 4294967295) = 0
> > 
> >    59.598 (79.379 ms): a.out/23671 membarrier(cmd: 1
> > 
> > ) = 0
> > 
> >   138.984 (39.945 ms): a.out/23671 membarrier(cmd: 1
> > 
> > ) = 0
> > 
> >    59.621 (119.324 ms): a.out/23672 futex(uaddr: 0x7fbe303e54c0, op: WAIT|
> > 
> > PRIV, val: 2                    ) = 0
> > 
> >   179.045 (79.918 ms): a.out/23671 membarrier(cmd: 1
> > 
> > ) = 0
> > 
> >   258.971 (39.997 ms): a.out/23671 membarrier(cmd: 1
> > 
> > ) = 0
> > 
> >   299.052 (79.883 ms): a.out/23671 membarrier(cmd: 1
> > 
> > ) = 0
> > 
> >   378.942 (59.988 ms): a.out/23671 membarrier(cmd: 1
> > 
> > ) = 0
> > 
> >   439.022 (69.908 ms): a.out/23671 membarrier(cmd: 1
> > 
> > ) = 0
> > 
> >   508.937 (80.027 ms): a.out/23671 membarrier(cmd: 1
> > 
> > ) = 0
> > 
> > # with a running lttng session:
> > ┌milian@milian-kdab2:/tmp
> > └$ run_lttng_trace.sh perf trace --duration 1 ./a.out
> > 
> >     6.666 (392.496 ms): a.out-ust/23790 recvmsg(fd: 3<socket:[1185362]>, 
msg:
> > 0x7fcd81d8a0b0                  ) = 612
> > 
> >   400.238 (68.452 ms): a.out-ust/23790 membarrier(cmd: 1
> > 
> > ) = 0
> > 
> >   468.698 (39.983 ms): a.out-ust/23790 membarrier(cmd: 1
> > 
> > ) = 0
> > 
> >   510.567 (98.113 ms): a.out-ust/23790 recvmsg(fd: 3<socket:[1185362]>, 
msg:
> > 0x7fcd81d8a070                  ) = 1
> > 
> >     6.467 (603.500 ms): a.out/23789 futex(uaddr: 0x7fcd82f298a0, op:
> > WAIT_BITSET|PRIV|CLKRT, utime: 0x7fff4df6c970, val3: 4294967295) = 0
> > 
> >   610.020 ( 6.307 ms): a.out/23789 futex(uaddr: 0x7fcd82f29960, op: WAIT|
> > 
> > PRIV, val: 2                    ) = 0
> > 
> >   616.446 (62.265 ms): a.out/23789 membarrier(cmd: 1
> > 
> > ) = 0
> > 
> >   678.721 (59.916 ms): a.out/23789 membarrier(cmd: 1
> > 
> > ) = 0
> > 
> >   616.332 (122.335 ms): a.out-ust/23790 futex(uaddr: 0x7fcd82f29960, op:
> > WAIT|PRIV, val: 2                    ) = 0
> > 
> >   738.976 (49.658 ms): a.out/23789 membarrier(cmd: 1
> > 
> > ) = 0
> > 
> >   788.642 (49.987 ms): a.out/23789 membarrier(cmd: 1
> > 
> > ) = 0
> > 
> >   739.014 (99.644 ms): a.out-ust/23790 futex(uaddr: 0x7fcd825be4c0, op:
> >   WAIT|
> > 
> > PRIV, val: 2                    ) = 0
> > 
> >   838.651 (49.977 ms): a.out/23789 membarrier(cmd: 1
> > 
> > ) = 0
> > 
> >   888.636 (69.983 ms): a.out/23789 membarrier(cmd: 1
> > 
> > ) = 0
> > 
> >   838.662 (119.978 ms): a.out-ust/23790 futex(uaddr: 0x7fcd825be4c0, op:
> > WAIT|PRIV, val: 2                    ) = 0
> > 
> >   958.739 (59.873 ms): a.out/23789 membarrier(cmd: 1
> > 
> > ) = 0
> > 
> >  1018.621 (80.058 ms): a.out/23789 membarrier(cmd: 1
> > 
> > ) = 0
> > 
> >  1098.782 (49.844 ms): a.out/23789 membarrier(cmd: 1
> > 
> > ) = 0
> > 
> >  1148.634 (59.998 ms): a.out/23789 membarrier(cmd: 1
> > 
> > ) = 0
> > 
> >  1208.731 (69.940 ms): a.out/23789 membarrier(cmd: 1
> > 
> > ) = 0
> > 
> >  1278.680 (79.947 ms): a.out/23789 membarrier(cmd: 1
> > 
> > ) = 0
> > 
> >  1358.727 (39.881 ms): a.out/23789 membarrier(cmd: 1
> > 
> > ) = 0
> > 
> >  1398.616 (39.980 ms): a.out/23789 membarrier(cmd: 1
> > 
> > ) = 0
> > 
> >     6.640 (1432.157 ms): a.out-ust/23791 futex(uaddr: 0x7fcd82f90000
> > 
> > ) = -1 (null) INTERNAL ERROR: strerror_r(512, 0x7ffdbc5467a0, 128)=22
> > ~~~~~
> > 
> > This makes lttng-ust pretty much unusable for any short-lived application.
> > Is this a known limitation?
> 
> This kind of delay is not completely unexpected, but a bit higher than I
> would expect. I'm interested in figuring out if most of this delay happens
> to be caused by sys_membarrier, either directly, or indirectly (through a
> lock).

See my other email which adds some harder numbers from a perf sleep record.

> > I'm using lttng-ust on Arch from the AUR packages, i.e. LTTng 2.9.4. I've
> > seen this problem also on Ubuntu 17.04 though I didn't do any of the
> > above measurements there. So far, I suspected the ust statedump to take
> > too much time, but I disabled that one for the measurements above.
> > 
> > Any help would be appreciated
> 
> Can you reproduce with a Linux kernel that has CONFIG_MEMBARRIER=n and
> share the performance results ?

I'd have to recompile the kernel, which is something I don't like to do. Why 
is this required, or put differently - why would the kernel option influence 
the userspace LTTng library?

Thanks

-- 
Milian Wolff | milian.wolff@kdab.com | Senior Software Engineer
KDAB (Deutschland) GmbH&Co KG, a KDAB Group company
Tel: +49-30-521325470
KDAB - The Qt Experts

[-- Attachment #1.2: smime.p7s --]
[-- Type: application/pkcs7-signature, Size: 3826 bytes --]

[-- Attachment #2: Type: text/plain, Size: 156 bytes --]

_______________________________________________
lttng-dev mailing list
lttng-dev@lists.lttng.org
https://lists.lttng.org/cgi-bin/mailman/listinfo/lttng-dev

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: large liblttng-ust startup overhead (ust_lock)
       [not found] ` <5409371.Vv1znlTa4X@milian-kdab2>
@ 2017-09-05 15:10   ` Milian Wolff
       [not found]   ` <5963588.vVBsjV779J@milian-kdab2>
  1 sibling, 0 replies; 13+ messages in thread
From: Milian Wolff @ 2017-09-05 15:10 UTC (permalink / raw)
  To: lttng-dev


[-- Attachment #1.1: Type: text/plain, Size: 16416 bytes --]

On Tuesday, September 5, 2017 4:52:42 PM CEST Milian Wolff wrote:
> On Tuesday, September 5, 2017 4:34:54 PM CEST Milian Wolff wrote:
> > Hey all,
> > 
> > I have noticed a very large overhead when linking against liblttng-ust:
> > 
> > ~~~~~
> > ┌milian@milian-kdab2:/tmp
> > └$ cat lttng-test.c
> > int main()
> > {
> > 
> >   return 0;
> > 
> > }
> > ┌milian@milian-kdab2:/tmp
> > └$ gcc -O2 -g -ldl lttng-test.c
> > ┌milian@milian-kdab2:/tmp
> > └$ perf stat -r 5 ./a.out
> > 
> >  Performance counter stats for './a.out' (5 runs):
> >           0.209587      task-clock (msec)         #    0.596 CPUs utilized
> > 
> > ( +-  8.76% )
> > 
> >                  0      context-switches          #    0.000 K/sec
> >                  0      cpu-migrations            #    0.000 K/sec
> >                 
> >                 49      page-faults               #    0.235 M/sec
> > 
> > ( +-  1.19% )
> > 
> >            706,854      cycles                    #    3.373 GHz
> > 
> > ( +-  8.82% )
> > 
> >            773,603      instructions              #    1.09  insn per
> >            cycle
> > 
> > ( +-  0.75% )
> > 
> >            147,128      branches                  #  701.987 M/sec
> > 
> > ( +-  0.70% )
> > 
> >              4,096      branch-misses             #    2.78% of all
> >              branches
> > 
> > ( +-  5.27% )
> > 
> >        0.000351422 seconds time elapsed
> > 
> > ( +- 11.85% )
> > 
> > ┌milian@milian-kdab2:/tmp
> > └$ gcc -O2 -g -ldl -llttng-ust lttng-test.c
> > ┌milian@milian-kdab2:/tmp
> > └$ perf stat -r 5 ./a.out
> > 
> >  Performance counter stats for './a.out' (5 runs):
> >           2.063040      task-clock (msec)         #    0.009 CPUs utilized
> > 
> > ( +-  1.37% )
> > 
> >                 44      context-switches          #    0.021 M/sec
> > 
> > ( +-  1.95% )
> > 
> >                  2      cpu-migrations            #    0.776 K/sec
> > 
> > ( +- 25.00% )
> > 
> >                209      page-faults               #    0.101 M/sec
> > 
> > ( +-  0.34% )
> > 
> >          7,053,686      cycles                    #    3.419 GHz
> > 
> > ( +-  2.03% )
> > 
> >          6,893,783      instructions              #    0.98  insn per
> >          cycle
> > 
> > ( +-  0.25% )
> > 
> >          1,342,492      branches                  #  650.735 M/sec
> > 
> > ( +-  0.20% )
> > 
> >             29,390      branch-misses             #    2.19% of all
> >             branches
> > 
> > ( +-  0.61% )
> > 
> >        0.225597302 seconds time elapsed
> > 
> > ( +-  6.68% )
> > ~~~~~
> > 
> > This is without any LTTng session configured. If I enable LTTng kernel and
> > userspace events, this becomes even worse:
> > 
> > ~~~~~
> > ┌milian@milian-kdab2:/tmp
> > └$ cat $(which run_lttng_trace.sh)
> > #!/bin/sh
> > 
> > if [ -z "$(pidof lttng-sessiond)" ]; then
> > 
> >     sudo lttng-sessiond --daemonize
> > 
> > fi
> > 
> > sudo lttng create -o ~/lttng-traces/$(date -Iseconds)
> > sudo lttng enable-channel kernel -k --subbuf-size 16M --num-subbuf 8
> > sudo lttng enable-event -c kernel -k -a
> > sudo lttng enable-channel ust -u --subbuf-size 16M --num-subbuf 8
> > sudo lttng enable-event -c ust -u lttng_ust_tracef:*
> > sudo lttng start
> > $@
> > sudo lttng stop
> > 
> > sudo chmod a+rx -R ~/lttng-traces
> > ┌milian@milian-kdab2:/tmp
> > └$ run_lttng_trace.sh perf stat -r 5 ./a.out
> > Session auto-20170905-162818 created.
> > Traces will be written in
> > /home/milian/lttng-traces/2017-09-05T16:28:18+02:00 Kernel channel kernel
> > enabled for session auto-20170905-162818
> > All Kernel events are enabled in channel kernel
> > UST channel ust enabled for session auto-20170905-162818
> > UST event lttng_ust_tracef:* created in channel ust
> > Tracing started for session auto-20170905-162818
> > 
> >  Performance counter stats for './a.out' (5 runs):
> >           4.937820      task-clock (msec)         #    0.005 CPUs utilized
> > 
> > ( +-  1.28% )
> > 
> >                117      context-switches          #    0.024 M/sec
> > 
> > ( +-  0.70% )
> > 
> >                  3      cpu-migrations            #    0.608 K/sec
> > 
> > ( +- 23.57% )
> > 
> >                245      page-faults               #    0.050 M/sec
> > 
> > ( +-  0.20% )
> > 
> >         16,541,355      cycles                    #    3.350 GHz
> > 
> > ( +-  0.94% )
> > 
> >         20,611,637      instructions              #    1.25  insn per
> >         cycle
> > 
> > ( +-  0.23% )
> > 
> >          3,580,525      branches                  #  725.123 M/sec
> > 
> > ( +-  0.19% )
> > 
> >             52,093      branch-misses             #    1.45% of all
> >             branches
> > 
> > ( +-  1.73% )
> > 
> >        0.965545882 seconds time elapsed
> > 
> > ( +- 16.96% )
> > 
> > Waiting for data availability...
> > Tracing stopped for session auto-20170905-162818
> > ~~~~~
> > 
> > I see this ~1s delay during startup in nearly all test applications where
> > I
> > wanted to use lttng-ust. In tracecompass, and according to perf sleep time
> > profiling, this seems to be mostly due to extensive sleeping on mutex
> > locks
> > (from ust_lock in liblttng-ust). With perf trace, I also see membarrier
> > taking a lot of time:
> > 
> > ~~~~~
> > # without a running lttng session:
> > ┌milian@milian-kdab2:/tmp
> > └$ perf trace --duration 1 ./a.out
> > 
> >      6.492 (52.468 ms): a.out/23672 recvmsg(fd: 3<socket:[1178439]>, msg:
> > 0x7fbe2fbb1070                  ) = 1
> > 
> >      5.077 (54.271 ms): a.out/23671 futex(uaddr: 0x7fbe30d508a0, op:
> > WAIT_BITSET|PRIV|CLKRT, utime: 0x7ffc474ff5a0, val3: 4294967295) = 0
> > 
> >     59.598 (79.379 ms): a.out/23671 membarrier(cmd: 1
> > 
> > ) = 0
> > 
> >    138.984 (39.945 ms): a.out/23671 membarrier(cmd: 1
> > 
> > ) = 0
> > 
> >     59.621 (119.324 ms): a.out/23672 futex(uaddr: 0x7fbe303e54c0, op:
> >     WAIT|
> > 
> > PRIV, val: 2                    ) = 0
> > 
> >    179.045 (79.918 ms): a.out/23671 membarrier(cmd: 1
> > 
> > ) = 0
> > 
> >    258.971 (39.997 ms): a.out/23671 membarrier(cmd: 1
> > 
> > ) = 0
> > 
> >    299.052 (79.883 ms): a.out/23671 membarrier(cmd: 1
> > 
> > ) = 0
> > 
> >    378.942 (59.988 ms): a.out/23671 membarrier(cmd: 1
> > 
> > ) = 0
> > 
> >    439.022 (69.908 ms): a.out/23671 membarrier(cmd: 1
> > 
> > ) = 0
> > 
> >    508.937 (80.027 ms): a.out/23671 membarrier(cmd: 1
> > 
> > ) = 0
> > 
> > # with a running lttng session:
> > ┌milian@milian-kdab2:/tmp
> > └$ run_lttng_trace.sh perf trace --duration 1 ./a.out
> > 
> >      6.666 (392.496 ms): a.out-ust/23790 recvmsg(fd: 3<socket:[1185362]>,
> > 
> > msg: 0x7fcd81d8a0b0                  ) = 612
> > 
> >    400.238 (68.452 ms): a.out-ust/23790 membarrier(cmd: 1
> > 
> > ) = 0
> > 
> >    468.698 (39.983 ms): a.out-ust/23790 membarrier(cmd: 1
> > 
> > ) = 0
> > 
> >    510.567 (98.113 ms): a.out-ust/23790 recvmsg(fd: 3<socket:[1185362]>,
> > 
> > msg: 0x7fcd81d8a070                  ) = 1
> > 
> >      6.467 (603.500 ms): a.out/23789 futex(uaddr: 0x7fcd82f298a0, op:
> > WAIT_BITSET|PRIV|CLKRT, utime: 0x7fff4df6c970, val3: 4294967295) = 0
> > 
> >    610.020 ( 6.307 ms): a.out/23789 futex(uaddr: 0x7fcd82f29960, op: WAIT|
> > 
> > PRIV, val: 2                    ) = 0
> > 
> >    616.446 (62.265 ms): a.out/23789 membarrier(cmd: 1
> > 
> > ) = 0
> > 
> >    678.721 (59.916 ms): a.out/23789 membarrier(cmd: 1
> > 
> > ) = 0
> > 
> >    616.332 (122.335 ms): a.out-ust/23790 futex(uaddr: 0x7fcd82f29960, op:
> > WAIT|PRIV, val: 2                    ) = 0
> > 
> >    738.976 (49.658 ms): a.out/23789 membarrier(cmd: 1
> > 
> > ) = 0
> > 
> >    788.642 (49.987 ms): a.out/23789 membarrier(cmd: 1
> > 
> > ) = 0
> > 
> >    739.014 (99.644 ms): a.out-ust/23790 futex(uaddr: 0x7fcd825be4c0, op:
> > WAIT| PRIV, val: 2                    ) = 0
> > 
> >    838.651 (49.977 ms): a.out/23789 membarrier(cmd: 1
> > 
> > ) = 0
> > 
> >    888.636 (69.983 ms): a.out/23789 membarrier(cmd: 1
> > 
> > ) = 0
> > 
> >    838.662 (119.978 ms): a.out-ust/23790 futex(uaddr: 0x7fcd825be4c0, op:
> > WAIT|PRIV, val: 2                    ) = 0
> > 
> >    958.739 (59.873 ms): a.out/23789 membarrier(cmd: 1
> > 
> > ) = 0
> > 
> >   1018.621 (80.058 ms): a.out/23789 membarrier(cmd: 1
> > 
> > ) = 0
> > 
> >   1098.782 (49.844 ms): a.out/23789 membarrier(cmd: 1
> > 
> > ) = 0
> > 
> >   1148.634 (59.998 ms): a.out/23789 membarrier(cmd: 1
> > 
> > ) = 0
> > 
> >   1208.731 (69.940 ms): a.out/23789 membarrier(cmd: 1
> > 
> > ) = 0
> > 
> >   1278.680 (79.947 ms): a.out/23789 membarrier(cmd: 1
> > 
> > ) = 0
> > 
> >   1358.727 (39.881 ms): a.out/23789 membarrier(cmd: 1
> > 
> > ) = 0
> > 
> >   1398.616 (39.980 ms): a.out/23789 membarrier(cmd: 1
> > 
> > ) = 0
> > 
> >      6.640 (1432.157 ms): a.out-ust/23791 futex(uaddr: 0x7fcd82f90000
> > 
> > ) = -1 (null) INTERNAL ERROR: strerror_r(512, 0x7ffdbc5467a0, 128)=22
> > ~~~~~
> > 
> > This makes lttng-ust pretty much unusable for any short-lived application.
> > Is this a known limitation?
> > 
> > I'm using lttng-ust on Arch from the AUR packages, i.e. LTTng 2.9.4. I've
> > seen this problem also on Ubuntu 17.04 though I didn't do any of the above
> > measurements there. So far, I suspected the ust statedump to take too much
> > time, but I disabled that one for the measurements above.
> > 
> > Any help would be appreciated
> 
> Here's the perf sleep record data when no session is configured:
> 
> ~~~~~
> 100% = 848.063336375 ms
> 
> # Overhead      Pid:Command    Shared Object     Symbol          Source:Line
> # ........  .................  ................  .............. 
> ........... #
>     76.50%    10336:a.out-ust  [kernel.vmlinux]  [k] __schedule  core.c:0
> 
>             ---__schedule core.c:0
>                schedule
>                futex_wait_queue_me futex.c:0
>                futex_wait futex.c:0
>                do_futex
>                sys_futex
>                do_syscall_64
>                return_from_SYSCALL_64 .tmp_entry_64.o:0
>                syscall
>                ust_listener_thread futex.h:65
>                /usr/include/urcu/futex.h:65 (inline)
>                /usr/include/urcu/futex.h:97 (inline)
>                /home/milian/projects/src/lttng/lttng-ust/liblttng-ust/lttng-
> ust-comm.c:1311 (inline)
>                /home/milian/projects/src/lttng/lttng-ust/liblttng-ust/lttng-
> ust-comm.c:1373 (inline)
> 
>     15.79%    10335:a.out-ust  [kernel.vmlinux]  [k] __schedule  core.c:0
> 
>             ---__schedule core.c:0
>                schedule
> 
>                 --15.70%--futex_wait_queue_me futex.c:0
>                           futex_wait futex.c:0
>                           do_futex
>                           sys_futex
>                           do_syscall_64
>                           return_from_SYSCALL_64 .tmp_entry_64.o:0
>                           __lll_lock_wait
>                           pthread_mutex_lock
> 
>                            --15.69%--0x2338
>                                      0x3b29
>                                      __nptl_deallocate_tsd.part.4
> pthread_create.c:0
>                                      start_thread pthread_create.c:0
>                                      __clone
> 
>      7.72%    10334:a.out      [kernel.vmlinux]  [k] __schedule  core.c:0
> 
>             ---__schedule core.c:0
>                schedule
>                futex_wait_queue_me futex.c:0
>                futex_wait futex.c:0
>                do_futex
>                sys_futex
>                do_syscall_64
>                return_from_SYSCALL_64 .tmp_entry_64.o:0
>                do_futex_wait sem_timedwait.c:0
>                __new_sem_wait_slow sem_timedwait.c:0
>                lttng_ust_init.part.3 lttng-ust-comm.c:1757
>                call_init.part.0 dl-init.c:0
>                _dl_init
>                _dl_start_user
> 
> ~~~~~
> 
> Here's the perf sleep record data for an existing session:
> 
> ~~~~~
> 100% = 3.0793293265 s
> 
> # Overhead      Pid:Command    Shared Object     Symbol          Source:Line
> # ........  .................  ................  .............. 
> ........... #
>     52.10%    10762:a.out-ust  [kernel.vmlinux]  [k] __schedule  core.c:0
> 
>             ---__schedule core.c:0
>                schedule
>                futex_wait_queue_me futex.c:0
>                futex_wait futex.c:0
>                do_futex
>                sys_futex
>                do_syscall_64
>                return_from_SYSCALL_64 .tmp_entry_64.o:0
>                syscall
>                ust_listener_thread futex.h:65
>                /usr/include/urcu/futex.h:65 (inline)
>                /usr/include/urcu/futex.h:97 (inline)
>                /home/milian/projects/src/lttng/lttng-ust/liblttng-ust/lttng-
> ust-comm.c:1311 (inline)
>                /home/milian/projects/src/lttng/lttng-ust/liblttng-ust/lttng-
> ust-comm.c:1373 (inline)
> 
>     27.09%    10761:a.out-ust  [kernel.vmlinux]  [k] __schedule  core.c:0
> 
>             ---__schedule core.c:0
>                schedule
> 
>                |--13.69%--schedule_timeout
>                |
>                |          unix_stream_read_generic af_unix.c:0
>                |          unix_stream_recvmsg af_unix.c:0
>                |          sock_recvmsg
>                |          ___sys_recvmsg socket.c:0
>                |          __sys_recvmsg
>                |          sys_recvmsg
>                |          do_syscall_64
>                |          return_from_SYSCALL_64 .tmp_entry_64.o:0
>                |          __libc_recvmsg
>                |          ustcomm_recv_unix_sock lttng-ust-comm.c:307
> 
>                 --13.40%--futex_wait_queue_me futex.c:0
>                           futex_wait futex.c:0
>                           do_futex
>                           sys_futex
>                           do_syscall_64
>                           return_from_SYSCALL_64 .tmp_entry_64.o:0
>                           __lll_lock_wait
>                           pthread_mutex_lock
> 
>                           |--8.41%--0x2338
>                           |
>                           |          0x3b29
>                           |          __nptl_deallocate_tsd.part.4
> 
> pthread_create.c:0
> 
>                           |          start_thread pthread_create.c:0
>                           |          __clone
> 
>                            --4.99%--ust_lock_nocheck lttng-ust-comm.c:178
> 
>     20.81%    10760:a.out      [kernel.vmlinux]  [k] __schedule  core.c:0
> 
>             ---__schedule core.c:0
>                schedule
> 
>                 --20.81%--futex_wait_queue_me futex.c:0
>                           futex_wait futex.c:0
>                           do_futex
>                           sys_futex
>                           do_syscall_64
>                           return_from_SYSCALL_64 .tmp_entry_64.o:0
> 
>                            --20.77%--do_futex_wait sem_timedwait.c:0
>                                      __new_sem_wait_slow sem_timedwait.c:0
>                                      lttng_ust_init.part.3 lttng-ust-comm.c:
> 1757
>                                      call_init.part.0 dl-init.c:0
>                                      _dl_init
>                                      _dl_start_user
> 
> ~~~~~
> 
> Hope that helps

One thing that is apparent from the above, and can also be made visible by 
adding some visible output to the application before quitting it: It's not the 
main thread that's blocked initially. Rather, the main thread spins up the 
secondary -ust threads, then runs its code and finishes before the -ust 
threads manage to initialize. Before the main thread then quits, it waits for 
the ust threads to finish.

-- 
Milian Wolff | milian.wolff@kdab.com | Senior Software Engineer
KDAB (Deutschland) GmbH&Co KG, a KDAB Group company
Tel: +49-30-521325470
KDAB - The Qt Experts

[-- Attachment #1.2: smime.p7s --]
[-- Type: application/pkcs7-signature, Size: 3826 bytes --]

[-- Attachment #2: Type: text/plain, Size: 156 bytes --]

_______________________________________________
lttng-dev mailing list
lttng-dev@lists.lttng.org
https://lists.lttng.org/cgi-bin/mailman/listinfo/lttng-dev

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: large liblttng-ust startup overhead (ust_lock)
       [not found]   ` <3731578.YsMjlWojEO@milian-kdab2>
@ 2017-09-05 18:11     ` Mathieu Desnoyers
       [not found]     ` <709369421.3343.1504635118648.JavaMail.zimbra@efficios.com>
  1 sibling, 0 replies; 13+ messages in thread
From: Mathieu Desnoyers @ 2017-09-05 18:11 UTC (permalink / raw)
  To: Milian Wolff; +Cc: lttng-dev

----- On Sep 5, 2017, at 11:08 AM, Milian Wolff milian.wolff@kdab.com wrote:

> On Tuesday, September 5, 2017 4:51:42 PM CEST Mathieu Desnoyers wrote:
>> ----- On Sep 5, 2017, at 10:34 AM, Milian Wolff milian.wolff@kdab.com wrote:
>> > Hey all,
>> > 
>> > I have noticed a very large overhead when linking against liblttng-ust:
>> > 
>> > ~~~~~
>> > ┌milian@milian-kdab2:/tmp
>> > └$ cat lttng-test.c
>> > int main()
>> > {
>> > 
>> >  return 0;
>> > 
>> > }
>> > ┌milian@milian-kdab2:/tmp
>> > └$ gcc -O2 -g -ldl lttng-test.c
>> > ┌milian@milian-kdab2:/tmp
>> > └$ perf stat -r 5 ./a.out
>> > 
>> > Performance counter stats for './a.out' (5 runs):
>> >          0.209587      task-clock (msec)         #    0.596 CPUs utilized
>> > 
>> > ( +-  8.76% )
>> > 
>> >                 0      context-switches          #    0.000 K/sec
>> >                 0      cpu-migrations            #    0.000 K/sec
>> >                
>> >                49      page-faults               #    0.235 M/sec
>> > 
>> > ( +-  1.19% )
>> > 
>> >           706,854      cycles                    #    3.373 GHz
>> > 
>> > ( +-  8.82% )
>> > 
>> >           773,603      instructions              #    1.09  insn per cycle
>> > 
>> > ( +-  0.75% )
>> > 
>> >           147,128      branches                  #  701.987 M/sec
>> > 
>> > ( +-  0.70% )
>> > 
>> >             4,096      branch-misses             #    2.78% of all
>> >             branches
>> > 
>> > ( +-  5.27% )
>> > 
>> >       0.000351422 seconds time elapsed
>> > 
>> > ( +- 11.85% )
>> > 
>> > ┌milian@milian-kdab2:/tmp
>> > └$ gcc -O2 -g -ldl -llttng-ust lttng-test.c
>> > ┌milian@milian-kdab2:/tmp
>> > └$ perf stat -r 5 ./a.out
>> > 
>> > Performance counter stats for './a.out' (5 runs):
>> >          2.063040      task-clock (msec)         #    0.009 CPUs utilized
>> > 
>> > ( +-  1.37% )
>> > 
>> >                44      context-switches          #    0.021 M/sec
>> > 
>> > ( +-  1.95% )
>> > 
>> >                 2      cpu-migrations            #    0.776 K/sec
>> > 
>> > ( +- 25.00% )
>> > 
>> >               209      page-faults               #    0.101 M/sec
>> > 
>> > ( +-  0.34% )
>> > 
>> >         7,053,686      cycles                    #    3.419 GHz
>> > 
>> > ( +-  2.03% )
>> > 
>> >         6,893,783      instructions              #    0.98  insn per cycle
>> > 
>> > ( +-  0.25% )
>> > 
>> >         1,342,492      branches                  #  650.735 M/sec
>> > 
>> > ( +-  0.20% )
>> > 
>> >            29,390      branch-misses             #    2.19% of all
>> >            branches
>> > 
>> > ( +-  0.61% )
>> > 
>> >       0.225597302 seconds time elapsed
>> > 
>> > ( +-  6.68% )
>> > ~~~~~
>> > 
>> > This is without any LTTng session configured. If I enable LTTng kernel and
>> > userspace events, this becomes even worse:
>> > 
>> > ~~~~~
>> > ┌milian@milian-kdab2:/tmp
>> > └$ cat $(which run_lttng_trace.sh)
>> > #!/bin/sh
>> > 
>> > if [ -z "$(pidof lttng-sessiond)" ]; then
>> > 
>> >    sudo lttng-sessiond --daemonize
>> > 
>> > fi
>> > 
>> > sudo lttng create -o ~/lttng-traces/$(date -Iseconds)
>> > sudo lttng enable-channel kernel -k --subbuf-size 16M --num-subbuf 8
>> > sudo lttng enable-event -c kernel -k -a
>> > sudo lttng enable-channel ust -u --subbuf-size 16M --num-subbuf 8
>> > sudo lttng enable-event -c ust -u lttng_ust_tracef:*
>> > sudo lttng start
>> > $@
>> > sudo lttng stop
>> > 
>> > sudo chmod a+rx -R ~/lttng-traces
>> > ┌milian@milian-kdab2:/tmp
>> > └$ run_lttng_trace.sh perf stat -r 5 ./a.out
>> > Session auto-20170905-162818 created.
>> > Traces will be written in
>> > /home/milian/lttng-traces/2017-09-05T16:28:18+02:00 Kernel channel kernel
>> > enabled for session auto-20170905-162818
>> > All Kernel events are enabled in channel kernel
>> > UST channel ust enabled for session auto-20170905-162818
>> > UST event lttng_ust_tracef:* created in channel ust
>> > Tracing started for session auto-20170905-162818
>> > 
>> > Performance counter stats for './a.out' (5 runs):
>> >          4.937820      task-clock (msec)         #    0.005 CPUs utilized
>> > 
>> > ( +-  1.28% )
>> > 
>> >               117      context-switches          #    0.024 M/sec
>> > 
>> > ( +-  0.70% )
>> > 
>> >                 3      cpu-migrations            #    0.608 K/sec
>> > 
>> > ( +- 23.57% )
>> > 
>> >               245      page-faults               #    0.050 M/sec
>> > 
>> > ( +-  0.20% )
>> > 
>> >        16,541,355      cycles                    #    3.350 GHz
>> > 
>> > ( +-  0.94% )
>> > 
>> >        20,611,637      instructions              #    1.25  insn per cycle
>> > 
>> > ( +-  0.23% )
>> > 
>> >         3,580,525      branches                  #  725.123 M/sec
>> > 
>> > ( +-  0.19% )
>> > 
>> >            52,093      branch-misses             #    1.45% of all
>> >            branches
>> > 
>> > ( +-  1.73% )
>> > 
>> >       0.965545882 seconds time elapsed
>> > 
>> > ( +- 16.96% )
>> > 
>> > Waiting for data availability...
>> > Tracing stopped for session auto-20170905-162818
>> > ~~~~~
>> > 
>> > I see this ~1s delay during startup in nearly all test applications where
>> > I
>> > wanted to use lttng-ust. In tracecompass, and according to perf sleep time
>> > profiling, this seems to be mostly due to extensive sleeping on mutex
>> > locks
>> > (from ust_lock in liblttng-ust). With perf trace, I also see membarrier
>> > taking a lot of time:
>> > 
>> > ~~~~~
>> > # without a running lttng session:
>> > ┌milian@milian-kdab2:/tmp
>> > └$ perf trace --duration 1 ./a.out
>> > 
>> >     6.492 (52.468 ms): a.out/23672 recvmsg(fd: 3<socket:[1178439]>, msg:
>> > 0x7fbe2fbb1070                  ) = 1
>> > 
>> >     5.077 (54.271 ms): a.out/23671 futex(uaddr: 0x7fbe30d508a0, op:
>> > WAIT_BITSET|PRIV|CLKRT, utime: 0x7ffc474ff5a0, val3: 4294967295) = 0
>> > 
>> >    59.598 (79.379 ms): a.out/23671 membarrier(cmd: 1
>> > 
>> > ) = 0
>> > 
>> >   138.984 (39.945 ms): a.out/23671 membarrier(cmd: 1
>> > 
>> > ) = 0
>> > 
>> >    59.621 (119.324 ms): a.out/23672 futex(uaddr: 0x7fbe303e54c0, op: WAIT|
>> > 
>> > PRIV, val: 2                    ) = 0
>> > 
>> >   179.045 (79.918 ms): a.out/23671 membarrier(cmd: 1
>> > 
>> > ) = 0
>> > 
>> >   258.971 (39.997 ms): a.out/23671 membarrier(cmd: 1
>> > 
>> > ) = 0
>> > 
>> >   299.052 (79.883 ms): a.out/23671 membarrier(cmd: 1
>> > 
>> > ) = 0
>> > 
>> >   378.942 (59.988 ms): a.out/23671 membarrier(cmd: 1
>> > 
>> > ) = 0
>> > 
>> >   439.022 (69.908 ms): a.out/23671 membarrier(cmd: 1
>> > 
>> > ) = 0
>> > 
>> >   508.937 (80.027 ms): a.out/23671 membarrier(cmd: 1
>> > 
>> > ) = 0
>> > 
>> > # with a running lttng session:
>> > ┌milian@milian-kdab2:/tmp
>> > └$ run_lttng_trace.sh perf trace --duration 1 ./a.out
>> > 
>> >     6.666 (392.496 ms): a.out-ust/23790 recvmsg(fd: 3<socket:[1185362]>,
> msg:
>> > 0x7fcd81d8a0b0                  ) = 612
>> > 
>> >   400.238 (68.452 ms): a.out-ust/23790 membarrier(cmd: 1
>> > 
>> > ) = 0
>> > 
>> >   468.698 (39.983 ms): a.out-ust/23790 membarrier(cmd: 1
>> > 
>> > ) = 0
>> > 
>> >   510.567 (98.113 ms): a.out-ust/23790 recvmsg(fd: 3<socket:[1185362]>,
> msg:
>> > 0x7fcd81d8a070                  ) = 1
>> > 
>> >     6.467 (603.500 ms): a.out/23789 futex(uaddr: 0x7fcd82f298a0, op:
>> > WAIT_BITSET|PRIV|CLKRT, utime: 0x7fff4df6c970, val3: 4294967295) = 0
>> > 
>> >   610.020 ( 6.307 ms): a.out/23789 futex(uaddr: 0x7fcd82f29960, op: WAIT|
>> > 
>> > PRIV, val: 2                    ) = 0
>> > 
>> >   616.446 (62.265 ms): a.out/23789 membarrier(cmd: 1
>> > 
>> > ) = 0
>> > 
>> >   678.721 (59.916 ms): a.out/23789 membarrier(cmd: 1
>> > 
>> > ) = 0
>> > 
>> >   616.332 (122.335 ms): a.out-ust/23790 futex(uaddr: 0x7fcd82f29960, op:
>> > WAIT|PRIV, val: 2                    ) = 0
>> > 
>> >   738.976 (49.658 ms): a.out/23789 membarrier(cmd: 1
>> > 
>> > ) = 0
>> > 
>> >   788.642 (49.987 ms): a.out/23789 membarrier(cmd: 1
>> > 
>> > ) = 0
>> > 
>> >   739.014 (99.644 ms): a.out-ust/23790 futex(uaddr: 0x7fcd825be4c0, op:
>> >   WAIT|
>> > 
>> > PRIV, val: 2                    ) = 0
>> > 
>> >   838.651 (49.977 ms): a.out/23789 membarrier(cmd: 1
>> > 
>> > ) = 0
>> > 
>> >   888.636 (69.983 ms): a.out/23789 membarrier(cmd: 1
>> > 
>> > ) = 0
>> > 
>> >   838.662 (119.978 ms): a.out-ust/23790 futex(uaddr: 0x7fcd825be4c0, op:
>> > WAIT|PRIV, val: 2                    ) = 0
>> > 
>> >   958.739 (59.873 ms): a.out/23789 membarrier(cmd: 1
>> > 
>> > ) = 0
>> > 
>> >  1018.621 (80.058 ms): a.out/23789 membarrier(cmd: 1
>> > 
>> > ) = 0
>> > 
>> >  1098.782 (49.844 ms): a.out/23789 membarrier(cmd: 1
>> > 
>> > ) = 0
>> > 
>> >  1148.634 (59.998 ms): a.out/23789 membarrier(cmd: 1
>> > 
>> > ) = 0
>> > 
>> >  1208.731 (69.940 ms): a.out/23789 membarrier(cmd: 1
>> > 
>> > ) = 0
>> > 
>> >  1278.680 (79.947 ms): a.out/23789 membarrier(cmd: 1
>> > 
>> > ) = 0
>> > 
>> >  1358.727 (39.881 ms): a.out/23789 membarrier(cmd: 1
>> > 
>> > ) = 0
>> > 
>> >  1398.616 (39.980 ms): a.out/23789 membarrier(cmd: 1
>> > 
>> > ) = 0
>> > 
>> >     6.640 (1432.157 ms): a.out-ust/23791 futex(uaddr: 0x7fcd82f90000
>> > 
>> > ) = -1 (null) INTERNAL ERROR: strerror_r(512, 0x7ffdbc5467a0, 128)=22
>> > ~~~~~
>> > 
>> > This makes lttng-ust pretty much unusable for any short-lived application.
>> > Is this a known limitation?
>> 
>> This kind of delay is not completely unexpected, but a bit higher than I
>> would expect. I'm interested in figuring out if most of this delay happens
>> to be caused by sys_membarrier, either directly, or indirectly (through a
>> lock).
> 
> See my other email which adds some harder numbers from a perf sleep record.
> 
>> > I'm using lttng-ust on Arch from the AUR packages, i.e. LTTng 2.9.4. I've
>> > seen this problem also on Ubuntu 17.04 though I didn't do any of the
>> > above measurements there. So far, I suspected the ust statedump to take
>> > too much time, but I disabled that one for the measurements above.
>> > 
>> > Any help would be appreciated
>> 
>> Can you reproduce with a Linux kernel that has CONFIG_MEMBARRIER=n and
>> share the performance results ?
> 
> I'd have to recompile the kernel, which is something I don't like to do. Why
> is this required, or put differently - why would the kernel option influence
> the userspace LTTng library?

liburcu 0.9.x and 0.10.x (a dependency of lttng-ust) uses the membarrier system
call to speed up tracing, at the expense of slower execution of tracing state
update.

I'm currently proposing a new sys_membarrier command for PRIVATE_EXPEDITED
(single-process) use cases which should execute much faster than the
non-expedited SHARED command currently implemented.

liburcu 0.9 and 0.10 detects that this system call is available, and uses
it when it can.

Thanks,

Mathieu


> 
> Thanks
> 
> --
> Milian Wolff | milian.wolff@kdab.com | Senior Software Engineer
> KDAB (Deutschland) GmbH&Co KG, a KDAB Group company
> Tel: +49-30-521325470
> KDAB - The Qt Experts

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com
_______________________________________________
lttng-dev mailing list
lttng-dev@lists.lttng.org
https://lists.lttng.org/cgi-bin/mailman/listinfo/lttng-dev

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: large liblttng-ust startup overhead (ust_lock)
       [not found]   ` <5963588.vVBsjV779J@milian-kdab2>
@ 2017-09-05 18:20     ` Mathieu Desnoyers
       [not found]     ` <859976545.3393.1504635610385.JavaMail.zimbra@efficios.com>
  1 sibling, 0 replies; 13+ messages in thread
From: Mathieu Desnoyers @ 2017-09-05 18:20 UTC (permalink / raw)
  To: Milian Wolff; +Cc: lttng-dev

----- On Sep 5, 2017, at 11:10 AM, Milian Wolff milian.wolff@kdab.com wrote:

> On Tuesday, September 5, 2017 4:52:42 PM CEST Milian Wolff wrote:
>> On Tuesday, September 5, 2017 4:34:54 PM CEST Milian Wolff wrote:
>> > Hey all,
>> > 
>> > I have noticed a very large overhead when linking against liblttng-ust:
>> > 
>> > ~~~~~
>> > ┌milian@milian-kdab2:/tmp
>> > └$ cat lttng-test.c
>> > int main()
>> > {
>> > 
>> >   return 0;
>> > 
>> > }
>> > ┌milian@milian-kdab2:/tmp
>> > └$ gcc -O2 -g -ldl lttng-test.c
>> > ┌milian@milian-kdab2:/tmp
>> > └$ perf stat -r 5 ./a.out
>> > 
>> >  Performance counter stats for './a.out' (5 runs):
>> >           0.209587      task-clock (msec)         #    0.596 CPUs utilized
>> > 
>> > ( +-  8.76% )
>> > 
>> >                  0      context-switches          #    0.000 K/sec
>> >                  0      cpu-migrations            #    0.000 K/sec
>> >                 
>> >                 49      page-faults               #    0.235 M/sec
>> > 
>> > ( +-  1.19% )
>> > 
>> >            706,854      cycles                    #    3.373 GHz
>> > 
>> > ( +-  8.82% )
>> > 
>> >            773,603      instructions              #    1.09  insn per
>> >            cycle
>> > 
>> > ( +-  0.75% )
>> > 
>> >            147,128      branches                  #  701.987 M/sec
>> > 
>> > ( +-  0.70% )
>> > 
>> >              4,096      branch-misses             #    2.78% of all
>> >              branches
>> > 
>> > ( +-  5.27% )
>> > 
>> >        0.000351422 seconds time elapsed
>> > 
>> > ( +- 11.85% )
>> > 
>> > ┌milian@milian-kdab2:/tmp
>> > └$ gcc -O2 -g -ldl -llttng-ust lttng-test.c
>> > ┌milian@milian-kdab2:/tmp
>> > └$ perf stat -r 5 ./a.out
>> > 
>> >  Performance counter stats for './a.out' (5 runs):
>> >           2.063040      task-clock (msec)         #    0.009 CPUs utilized
>> > 
>> > ( +-  1.37% )
>> > 
>> >                 44      context-switches          #    0.021 M/sec
>> > 
>> > ( +-  1.95% )
>> > 
>> >                  2      cpu-migrations            #    0.776 K/sec
>> > 
>> > ( +- 25.00% )
>> > 
>> >                209      page-faults               #    0.101 M/sec
>> > 
>> > ( +-  0.34% )
>> > 
>> >          7,053,686      cycles                    #    3.419 GHz
>> > 
>> > ( +-  2.03% )
>> > 
>> >          6,893,783      instructions              #    0.98  insn per
>> >          cycle
>> > 
>> > ( +-  0.25% )
>> > 
>> >          1,342,492      branches                  #  650.735 M/sec
>> > 
>> > ( +-  0.20% )
>> > 
>> >             29,390      branch-misses             #    2.19% of all
>> >             branches
>> > 
>> > ( +-  0.61% )
>> > 
>> >        0.225597302 seconds time elapsed
>> > 
>> > ( +-  6.68% )
>> > ~~~~~
>> > 
>> > This is without any LTTng session configured. If I enable LTTng kernel and
>> > userspace events, this becomes even worse:
>> > 
>> > ~~~~~
>> > ┌milian@milian-kdab2:/tmp
>> > └$ cat $(which run_lttng_trace.sh)
>> > #!/bin/sh
>> > 
>> > if [ -z "$(pidof lttng-sessiond)" ]; then
>> > 
>> >     sudo lttng-sessiond --daemonize
>> > 
>> > fi
>> > 
>> > sudo lttng create -o ~/lttng-traces/$(date -Iseconds)
>> > sudo lttng enable-channel kernel -k --subbuf-size 16M --num-subbuf 8
>> > sudo lttng enable-event -c kernel -k -a
>> > sudo lttng enable-channel ust -u --subbuf-size 16M --num-subbuf 8
>> > sudo lttng enable-event -c ust -u lttng_ust_tracef:*
>> > sudo lttng start
>> > $@
>> > sudo lttng stop
>> > 
>> > sudo chmod a+rx -R ~/lttng-traces
>> > ┌milian@milian-kdab2:/tmp
>> > └$ run_lttng_trace.sh perf stat -r 5 ./a.out
>> > Session auto-20170905-162818 created.
>> > Traces will be written in
>> > /home/milian/lttng-traces/2017-09-05T16:28:18+02:00 Kernel channel kernel
>> > enabled for session auto-20170905-162818
>> > All Kernel events are enabled in channel kernel
>> > UST channel ust enabled for session auto-20170905-162818
>> > UST event lttng_ust_tracef:* created in channel ust
>> > Tracing started for session auto-20170905-162818
>> > 
>> >  Performance counter stats for './a.out' (5 runs):
>> >           4.937820      task-clock (msec)         #    0.005 CPUs utilized
>> > 
>> > ( +-  1.28% )
>> > 
>> >                117      context-switches          #    0.024 M/sec
>> > 
>> > ( +-  0.70% )
>> > 
>> >                  3      cpu-migrations            #    0.608 K/sec
>> > 
>> > ( +- 23.57% )
>> > 
>> >                245      page-faults               #    0.050 M/sec
>> > 
>> > ( +-  0.20% )
>> > 
>> >         16,541,355      cycles                    #    3.350 GHz
>> > 
>> > ( +-  0.94% )
>> > 
>> >         20,611,637      instructions              #    1.25  insn per
>> >         cycle
>> > 
>> > ( +-  0.23% )
>> > 
>> >          3,580,525      branches                  #  725.123 M/sec
>> > 
>> > ( +-  0.19% )
>> > 
>> >             52,093      branch-misses             #    1.45% of all
>> >             branches
>> > 
>> > ( +-  1.73% )
>> > 
>> >        0.965545882 seconds time elapsed
>> > 
>> > ( +- 16.96% )
>> > 
>> > Waiting for data availability...
>> > Tracing stopped for session auto-20170905-162818
>> > ~~~~~
>> > 
>> > I see this ~1s delay during startup in nearly all test applications where
>> > I
>> > wanted to use lttng-ust. In tracecompass, and according to perf sleep time
>> > profiling, this seems to be mostly due to extensive sleeping on mutex
>> > locks
>> > (from ust_lock in liblttng-ust). With perf trace, I also see membarrier
>> > taking a lot of time:
>> > 
>> > ~~~~~
>> > # without a running lttng session:
>> > ┌milian@milian-kdab2:/tmp
>> > └$ perf trace --duration 1 ./a.out
>> > 
>> >      6.492 (52.468 ms): a.out/23672 recvmsg(fd: 3<socket:[1178439]>, msg:
>> > 0x7fbe2fbb1070                  ) = 1
>> > 
>> >      5.077 (54.271 ms): a.out/23671 futex(uaddr: 0x7fbe30d508a0, op:
>> > WAIT_BITSET|PRIV|CLKRT, utime: 0x7ffc474ff5a0, val3: 4294967295) = 0
>> > 
>> >     59.598 (79.379 ms): a.out/23671 membarrier(cmd: 1
>> > 
>> > ) = 0
>> > 
>> >    138.984 (39.945 ms): a.out/23671 membarrier(cmd: 1
>> > 
>> > ) = 0
>> > 
>> >     59.621 (119.324 ms): a.out/23672 futex(uaddr: 0x7fbe303e54c0, op:
>> >     WAIT|
>> > 
>> > PRIV, val: 2                    ) = 0
>> > 
>> >    179.045 (79.918 ms): a.out/23671 membarrier(cmd: 1
>> > 
>> > ) = 0
>> > 
>> >    258.971 (39.997 ms): a.out/23671 membarrier(cmd: 1
>> > 
>> > ) = 0
>> > 
>> >    299.052 (79.883 ms): a.out/23671 membarrier(cmd: 1
>> > 
>> > ) = 0
>> > 
>> >    378.942 (59.988 ms): a.out/23671 membarrier(cmd: 1
>> > 
>> > ) = 0
>> > 
>> >    439.022 (69.908 ms): a.out/23671 membarrier(cmd: 1
>> > 
>> > ) = 0
>> > 
>> >    508.937 (80.027 ms): a.out/23671 membarrier(cmd: 1
>> > 
>> > ) = 0
>> > 
>> > # with a running lttng session:
>> > ┌milian@milian-kdab2:/tmp
>> > └$ run_lttng_trace.sh perf trace --duration 1 ./a.out
>> > 
>> >      6.666 (392.496 ms): a.out-ust/23790 recvmsg(fd: 3<socket:[1185362]>,
>> > 
>> > msg: 0x7fcd81d8a0b0                  ) = 612
>> > 
>> >    400.238 (68.452 ms): a.out-ust/23790 membarrier(cmd: 1
>> > 
>> > ) = 0
>> > 
>> >    468.698 (39.983 ms): a.out-ust/23790 membarrier(cmd: 1
>> > 
>> > ) = 0
>> > 
>> >    510.567 (98.113 ms): a.out-ust/23790 recvmsg(fd: 3<socket:[1185362]>,
>> > 
>> > msg: 0x7fcd81d8a070                  ) = 1
>> > 
>> >      6.467 (603.500 ms): a.out/23789 futex(uaddr: 0x7fcd82f298a0, op:
>> > WAIT_BITSET|PRIV|CLKRT, utime: 0x7fff4df6c970, val3: 4294967295) = 0
>> > 
>> >    610.020 ( 6.307 ms): a.out/23789 futex(uaddr: 0x7fcd82f29960, op: WAIT|
>> > 
>> > PRIV, val: 2                    ) = 0
>> > 
>> >    616.446 (62.265 ms): a.out/23789 membarrier(cmd: 1
>> > 
>> > ) = 0
>> > 
>> >    678.721 (59.916 ms): a.out/23789 membarrier(cmd: 1
>> > 
>> > ) = 0
>> > 
>> >    616.332 (122.335 ms): a.out-ust/23790 futex(uaddr: 0x7fcd82f29960, op:
>> > WAIT|PRIV, val: 2                    ) = 0
>> > 
>> >    738.976 (49.658 ms): a.out/23789 membarrier(cmd: 1
>> > 
>> > ) = 0
>> > 
>> >    788.642 (49.987 ms): a.out/23789 membarrier(cmd: 1
>> > 
>> > ) = 0
>> > 
>> >    739.014 (99.644 ms): a.out-ust/23790 futex(uaddr: 0x7fcd825be4c0, op:
>> > WAIT| PRIV, val: 2                    ) = 0
>> > 
>> >    838.651 (49.977 ms): a.out/23789 membarrier(cmd: 1
>> > 
>> > ) = 0
>> > 
>> >    888.636 (69.983 ms): a.out/23789 membarrier(cmd: 1
>> > 
>> > ) = 0
>> > 
>> >    838.662 (119.978 ms): a.out-ust/23790 futex(uaddr: 0x7fcd825be4c0, op:
>> > WAIT|PRIV, val: 2                    ) = 0
>> > 
>> >    958.739 (59.873 ms): a.out/23789 membarrier(cmd: 1
>> > 
>> > ) = 0
>> > 
>> >   1018.621 (80.058 ms): a.out/23789 membarrier(cmd: 1
>> > 
>> > ) = 0
>> > 
>> >   1098.782 (49.844 ms): a.out/23789 membarrier(cmd: 1
>> > 
>> > ) = 0
>> > 
>> >   1148.634 (59.998 ms): a.out/23789 membarrier(cmd: 1
>> > 
>> > ) = 0
>> > 
>> >   1208.731 (69.940 ms): a.out/23789 membarrier(cmd: 1
>> > 
>> > ) = 0
>> > 
>> >   1278.680 (79.947 ms): a.out/23789 membarrier(cmd: 1
>> > 
>> > ) = 0
>> > 
>> >   1358.727 (39.881 ms): a.out/23789 membarrier(cmd: 1
>> > 
>> > ) = 0
>> > 
>> >   1398.616 (39.980 ms): a.out/23789 membarrier(cmd: 1
>> > 
>> > ) = 0
>> > 
>> >      6.640 (1432.157 ms): a.out-ust/23791 futex(uaddr: 0x7fcd82f90000
>> > 
>> > ) = -1 (null) INTERNAL ERROR: strerror_r(512, 0x7ffdbc5467a0, 128)=22
>> > ~~~~~
>> > 
>> > This makes lttng-ust pretty much unusable for any short-lived application.
>> > Is this a known limitation?
>> > 
>> > I'm using lttng-ust on Arch from the AUR packages, i.e. LTTng 2.9.4. I've
>> > seen this problem also on Ubuntu 17.04 though I didn't do any of the above
>> > measurements there. So far, I suspected the ust statedump to take too much
>> > time, but I disabled that one for the measurements above.
>> > 
>> > Any help would be appreciated
>> 
>> Here's the perf sleep record data when no session is configured:
>> 
>> ~~~~~
>> 100% = 848.063336375 ms
>> 
>> # Overhead      Pid:Command    Shared Object     Symbol          Source:Line
>> # ........  .................  ................  ..............
>> ........... #
>>     76.50%    10336:a.out-ust  [kernel.vmlinux]  [k] __schedule  core.c:0
>> 
>>             ---__schedule core.c:0
>>                schedule
>>                futex_wait_queue_me futex.c:0
>>                futex_wait futex.c:0
>>                do_futex
>>                sys_futex
>>                do_syscall_64
>>                return_from_SYSCALL_64 .tmp_entry_64.o:0
>>                syscall
>>                ust_listener_thread futex.h:65
>>                /usr/include/urcu/futex.h:65 (inline)
>>                /usr/include/urcu/futex.h:97 (inline)
>>                /home/milian/projects/src/lttng/lttng-ust/liblttng-ust/lttng-
>> ust-comm.c:1311 (inline)
>>                /home/milian/projects/src/lttng/lttng-ust/liblttng-ust/lttng-
>> ust-comm.c:1373 (inline)
>> 
>>     15.79%    10335:a.out-ust  [kernel.vmlinux]  [k] __schedule  core.c:0
>> 
>>             ---__schedule core.c:0
>>                schedule
>> 
>>                 --15.70%--futex_wait_queue_me futex.c:0
>>                           futex_wait futex.c:0
>>                           do_futex
>>                           sys_futex
>>                           do_syscall_64
>>                           return_from_SYSCALL_64 .tmp_entry_64.o:0
>>                           __lll_lock_wait
>>                           pthread_mutex_lock
>> 
>>                            --15.69%--0x2338
>>                                      0x3b29
>>                                      __nptl_deallocate_tsd.part.4
>> pthread_create.c:0
>>                                      start_thread pthread_create.c:0
>>                                      __clone
>> 
>>      7.72%    10334:a.out      [kernel.vmlinux]  [k] __schedule  core.c:0
>> 
>>             ---__schedule core.c:0
>>                schedule
>>                futex_wait_queue_me futex.c:0
>>                futex_wait futex.c:0
>>                do_futex
>>                sys_futex
>>                do_syscall_64
>>                return_from_SYSCALL_64 .tmp_entry_64.o:0
>>                do_futex_wait sem_timedwait.c:0
>>                __new_sem_wait_slow sem_timedwait.c:0
>>                lttng_ust_init.part.3 lttng-ust-comm.c:1757
>>                call_init.part.0 dl-init.c:0
>>                _dl_init
>>                _dl_start_user
>> 
>> ~~~~~
>> 
>> Here's the perf sleep record data for an existing session:
>> 
>> ~~~~~
>> 100% = 3.0793293265 s
>> 
>> # Overhead      Pid:Command    Shared Object     Symbol          Source:Line
>> # ........  .................  ................  ..............
>> ........... #
>>     52.10%    10762:a.out-ust  [kernel.vmlinux]  [k] __schedule  core.c:0
>> 
>>             ---__schedule core.c:0
>>                schedule
>>                futex_wait_queue_me futex.c:0
>>                futex_wait futex.c:0
>>                do_futex
>>                sys_futex
>>                do_syscall_64
>>                return_from_SYSCALL_64 .tmp_entry_64.o:0
>>                syscall
>>                ust_listener_thread futex.h:65
>>                /usr/include/urcu/futex.h:65 (inline)
>>                /usr/include/urcu/futex.h:97 (inline)
>>                /home/milian/projects/src/lttng/lttng-ust/liblttng-ust/lttng-
>> ust-comm.c:1311 (inline)
>>                /home/milian/projects/src/lttng/lttng-ust/liblttng-ust/lttng-
>> ust-comm.c:1373 (inline)
>> 
>>     27.09%    10761:a.out-ust  [kernel.vmlinux]  [k] __schedule  core.c:0
>> 
>>             ---__schedule core.c:0
>>                schedule
>> 
>>                |--13.69%--schedule_timeout
>>                |
>>                |          unix_stream_read_generic af_unix.c:0
>>                |          unix_stream_recvmsg af_unix.c:0
>>                |          sock_recvmsg
>>                |          ___sys_recvmsg socket.c:0
>>                |          __sys_recvmsg
>>                |          sys_recvmsg
>>                |          do_syscall_64
>>                |          return_from_SYSCALL_64 .tmp_entry_64.o:0
>>                |          __libc_recvmsg
>>                |          ustcomm_recv_unix_sock lttng-ust-comm.c:307
>> 
>>                 --13.40%--futex_wait_queue_me futex.c:0
>>                           futex_wait futex.c:0
>>                           do_futex
>>                           sys_futex
>>                           do_syscall_64
>>                           return_from_SYSCALL_64 .tmp_entry_64.o:0
>>                           __lll_lock_wait
>>                           pthread_mutex_lock
>> 
>>                           |--8.41%--0x2338
>>                           |
>>                           |          0x3b29
>>                           |          __nptl_deallocate_tsd.part.4
>> 
>> pthread_create.c:0
>> 
>>                           |          start_thread pthread_create.c:0
>>                           |          __clone
>> 
>>                            --4.99%--ust_lock_nocheck lttng-ust-comm.c:178
>> 
>>     20.81%    10760:a.out      [kernel.vmlinux]  [k] __schedule  core.c:0
>> 
>>             ---__schedule core.c:0
>>                schedule
>> 
>>                 --20.81%--futex_wait_queue_me futex.c:0
>>                           futex_wait futex.c:0
>>                           do_futex
>>                           sys_futex
>>                           do_syscall_64
>>                           return_from_SYSCALL_64 .tmp_entry_64.o:0
>> 
>>                            --20.77%--do_futex_wait sem_timedwait.c:0
>>                                      __new_sem_wait_slow sem_timedwait.c:0
>>                                      lttng_ust_init.part.3 lttng-ust-comm.c:
>> 1757
>>                                      call_init.part.0 dl-init.c:0
>>                                      _dl_init
>>                                      _dl_start_user
>> 
>> ~~~~~
>> 
>> Hope that helps
> 
> One thing that is apparent from the above, and can also be made visible by
> adding some visible output to the application before quitting it: It's not the
> main thread that's blocked initially. Rather, the main thread spins up the
> secondary -ust threads, then runs its code and finishes before the -ust
> threads manage to initialize. Before the main thread then quits, it waits for
> the ust threads to finish.

Those threads appear to be talking to the session daemon (on recvmsg). Please
note that there are two cases there:

- If you use per-pid buffers (explicitly at channel creation), the consumer
  daemon need to allocate buffers on behalf of each and every new process
  that appear while tracing, which takes a significant amount of time,
- If you use default buffers (per-uid), only the first process that registers
  for a specific UID will require that the consumer daemon allocates buffers.
  After that, the same buffers will be sent to other applications registering
  for the same user ID.

Clearly, per-pid buffers incur a significant slowdown at application start. This
is one reason why those are not used by default.

If you benchmark per-uid buffers, you will want to benchmark those two scenarios:

1) startup time of the first process that registers for a UID,

2) startup time of following process registrations

Technically, the buffer allocation delay only impacts the first process that
registers for a UID, and not all of them.

Thanks,

Mathieu



> 
> --
> Milian Wolff | milian.wolff@kdab.com | Senior Software Engineer
> KDAB (Deutschland) GmbH&Co KG, a KDAB Group company
> Tel: +49-30-521325470
> KDAB - The Qt Experts
> _______________________________________________
> lttng-dev mailing list
> lttng-dev@lists.lttng.org
> https://lists.lttng.org/cgi-bin/mailman/listinfo/lttng-dev

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com
_______________________________________________
lttng-dev mailing list
lttng-dev@lists.lttng.org
https://lists.lttng.org/cgi-bin/mailman/listinfo/lttng-dev

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: large liblttng-ust startup overhead (ust_lock)
       [not found]     ` <709369421.3343.1504635118648.JavaMail.zimbra@efficios.com>
@ 2017-09-06  7:35       ` Milian Wolff
       [not found]       ` <3211887.IErR8xQyUx@agathebauer>
  1 sibling, 0 replies; 13+ messages in thread
From: Milian Wolff @ 2017-09-06  7:35 UTC (permalink / raw)
  To: Mathieu Desnoyers; +Cc: lttng-dev

On Dienstag, 5. September 2017 20:11:58 CEST Mathieu Desnoyers wrote:
> ----- On Sep 5, 2017, at 11:08 AM, Milian Wolff milian.wolff@kdab.com wrote:
> > On Tuesday, September 5, 2017 4:51:42 PM CEST Mathieu Desnoyers wrote:
> >> ----- On Sep 5, 2017, at 10:34 AM, Milian Wolff milian.wolff@kdab.com 
wrote:
> >> > Hey all,
> >> > 
> >> > I have noticed a very large overhead when linking against liblttng-ust:
> >> > 
> >> > ~~~~~
> >> > ┌milian@milian-kdab2:/tmp
> >> > └$ cat lttng-test.c
> >> > int main()
> >> > {
> >> > 
> >> >  return 0;
> >> > 
> >> > }
> >> > ┌milian@milian-kdab2:/tmp
> >> > └$ gcc -O2 -g -ldl lttng-test.c
> >> > ┌milian@milian-kdab2:/tmp
> >> > └$ perf stat -r 5 ./a.out
> >> > 
> >> > Performance counter stats for './a.out' (5 runs):
> >> >          0.209587      task-clock (msec)         #    0.596 CPUs
> >> >          utilized
> >> > 
> >> > ( +-  8.76% )
> >> > 
> >> >                 0      context-switches          #    0.000 K/sec
> >> >                 0      cpu-migrations            #    0.000 K/sec
> >> >                
> >> >                49      page-faults               #    0.235 M/sec
> >> > 
> >> > ( +-  1.19% )
> >> > 
> >> >           706,854      cycles                    #    3.373 GHz
> >> > 
> >> > ( +-  8.82% )
> >> > 
> >> >           773,603      instructions              #    1.09  insn per
> >> >           cycle
> >> > 
> >> > ( +-  0.75% )
> >> > 
> >> >           147,128      branches                  #  701.987 M/sec
> >> > 
> >> > ( +-  0.70% )
> >> > 
> >> >             4,096      branch-misses             #    2.78% of all
> >> >             branches
> >> > 
> >> > ( +-  5.27% )
> >> > 
> >> >       0.000351422 seconds time elapsed
> >> > 
> >> > ( +- 11.85% )
> >> > 
> >> > ┌milian@milian-kdab2:/tmp
> >> > └$ gcc -O2 -g -ldl -llttng-ust lttng-test.c
> >> > ┌milian@milian-kdab2:/tmp
> >> > └$ perf stat -r 5 ./a.out
> >> > 
> >> > Performance counter stats for './a.out' (5 runs):
> >> >          2.063040      task-clock (msec)         #    0.009 CPUs
> >> >          utilized
> >> > 
> >> > ( +-  1.37% )
> >> > 
> >> >                44      context-switches          #    0.021 M/sec
> >> > 
> >> > ( +-  1.95% )
> >> > 
> >> >                 2      cpu-migrations            #    0.776 K/sec
> >> > 
> >> > ( +- 25.00% )
> >> > 
> >> >               209      page-faults               #    0.101 M/sec
> >> > 
> >> > ( +-  0.34% )
> >> > 
> >> >         7,053,686      cycles                    #    3.419 GHz
> >> > 
> >> > ( +-  2.03% )
> >> > 
> >> >         6,893,783      instructions              #    0.98  insn per
> >> >         cycle
> >> > 
> >> > ( +-  0.25% )
> >> > 
> >> >         1,342,492      branches                  #  650.735 M/sec
> >> > 
> >> > ( +-  0.20% )
> >> > 
> >> >            29,390      branch-misses             #    2.19% of all
> >> >            branches
> >> > 
> >> > ( +-  0.61% )
> >> > 
> >> >       0.225597302 seconds time elapsed
> >> > 
> >> > ( +-  6.68% )
> >> > ~~~~~
> >> > 
> >> > This is without any LTTng session configured. If I enable LTTng kernel
> >> > and
> >> > userspace events, this becomes even worse:
> >> > 
> >> > ~~~~~
> >> > ┌milian@milian-kdab2:/tmp
> >> > └$ cat $(which run_lttng_trace.sh)
> >> > #!/bin/sh
> >> > 
> >> > if [ -z "$(pidof lttng-sessiond)" ]; then
> >> > 
> >> >    sudo lttng-sessiond --daemonize
> >> > 
> >> > fi
> >> > 
> >> > sudo lttng create -o ~/lttng-traces/$(date -Iseconds)
> >> > sudo lttng enable-channel kernel -k --subbuf-size 16M --num-subbuf 8
> >> > sudo lttng enable-event -c kernel -k -a
> >> > sudo lttng enable-channel ust -u --subbuf-size 16M --num-subbuf 8
> >> > sudo lttng enable-event -c ust -u lttng_ust_tracef:*
> >> > sudo lttng start
> >> > $@
> >> > sudo lttng stop
> >> > 
> >> > sudo chmod a+rx -R ~/lttng-traces
> >> > ┌milian@milian-kdab2:/tmp
> >> > └$ run_lttng_trace.sh perf stat -r 5 ./a.out
> >> > Session auto-20170905-162818 created.
> >> > Traces will be written in
> >> > /home/milian/lttng-traces/2017-09-05T16:28:18+02:00 Kernel channel
> >> > kernel
> >> > enabled for session auto-20170905-162818
> >> > All Kernel events are enabled in channel kernel
> >> > UST channel ust enabled for session auto-20170905-162818
> >> > UST event lttng_ust_tracef:* created in channel ust
> >> > Tracing started for session auto-20170905-162818
> >> > 
> >> > Performance counter stats for './a.out' (5 runs):
> >> >          4.937820      task-clock (msec)         #    0.005 CPUs
> >> >          utilized
> >> > 
> >> > ( +-  1.28% )
> >> > 
> >> >               117      context-switches          #    0.024 M/sec
> >> > 
> >> > ( +-  0.70% )
> >> > 
> >> >                 3      cpu-migrations            #    0.608 K/sec
> >> > 
> >> > ( +- 23.57% )
> >> > 
> >> >               245      page-faults               #    0.050 M/sec
> >> > 
> >> > ( +-  0.20% )
> >> > 
> >> >        16,541,355      cycles                    #    3.350 GHz
> >> > 
> >> > ( +-  0.94% )
> >> > 
> >> >        20,611,637      instructions              #    1.25  insn per
> >> >        cycle
> >> > 
> >> > ( +-  0.23% )
> >> > 
> >> >         3,580,525      branches                  #  725.123 M/sec
> >> > 
> >> > ( +-  0.19% )
> >> > 
> >> >            52,093      branch-misses             #    1.45% of all
> >> >            branches
> >> > 
> >> > ( +-  1.73% )
> >> > 
> >> >       0.965545882 seconds time elapsed
> >> > 
> >> > ( +- 16.96% )
> >> > 
> >> > Waiting for data availability...
> >> > Tracing stopped for session auto-20170905-162818
> >> > ~~~~~
> >> > 
> >> > I see this ~1s delay during startup in nearly all test applications
> >> > where
> >> > I
> >> > wanted to use lttng-ust. In tracecompass, and according to perf sleep
> >> > time
> >> > profiling, this seems to be mostly due to extensive sleeping on mutex
> >> > locks
> >> > (from ust_lock in liblttng-ust). With perf trace, I also see membarrier
> >> > taking a lot of time:
> >> > 
> >> > ~~~~~
> >> > # without a running lttng session:
> >> > ┌milian@milian-kdab2:/tmp
> >> > └$ perf trace --duration 1 ./a.out
> >> > 
> >> >     6.492 (52.468 ms): a.out/23672 recvmsg(fd: 3<socket:[1178439]>, 
msg:
> >> > 0x7fbe2fbb1070                  ) = 1
> >> > 
> >> >     5.077 (54.271 ms): a.out/23671 futex(uaddr: 0x7fbe30d508a0, op:
> >> > WAIT_BITSET|PRIV|CLKRT, utime: 0x7ffc474ff5a0, val3: 4294967295) = 0
> >> > 
> >> >    59.598 (79.379 ms): a.out/23671 membarrier(cmd: 1
> >> > 
> >> > ) = 0
> >> > 
> >> >   138.984 (39.945 ms): a.out/23671 membarrier(cmd: 1
> >> > 
> >> > ) = 0
> >> > 
> >> >    59.621 (119.324 ms): a.out/23672 futex(uaddr: 0x7fbe303e54c0, op:
> >> >    WAIT|
> >> > 
> >> > PRIV, val: 2                    ) = 0
> >> > 
> >> >   179.045 (79.918 ms): a.out/23671 membarrier(cmd: 1
> >> > 
> >> > ) = 0
> >> > 
> >> >   258.971 (39.997 ms): a.out/23671 membarrier(cmd: 1
> >> > 
> >> > ) = 0
> >> > 
> >> >   299.052 (79.883 ms): a.out/23671 membarrier(cmd: 1
> >> > 
> >> > ) = 0
> >> > 
> >> >   378.942 (59.988 ms): a.out/23671 membarrier(cmd: 1
> >> > 
> >> > ) = 0
> >> > 
> >> >   439.022 (69.908 ms): a.out/23671 membarrier(cmd: 1
> >> > 
> >> > ) = 0
> >> > 
> >> >   508.937 (80.027 ms): a.out/23671 membarrier(cmd: 1
> >> > 
> >> > ) = 0
> >> > 
> >> > # with a running lttng session:
> >> > ┌milian@milian-kdab2:/tmp
> >> > └$ run_lttng_trace.sh perf trace --duration 1 ./a.out
> >> > 
> >> >     6.666 (392.496 ms): a.out-ust/23790 recvmsg(fd:
> >> >     3<socket:[1185362]>,
> > 
> > msg:
> >> > 0x7fcd81d8a0b0                  ) = 612
> >> > 
> >> >   400.238 (68.452 ms): a.out-ust/23790 membarrier(cmd: 1
> >> > 
> >> > ) = 0
> >> > 
> >> >   468.698 (39.983 ms): a.out-ust/23790 membarrier(cmd: 1
> >> > 
> >> > ) = 0
> >> > 
> >> >   510.567 (98.113 ms): a.out-ust/23790 recvmsg(fd: 3<socket:[1185362]>,
> > 
> > msg:
> >> > 0x7fcd81d8a070                  ) = 1
> >> > 
> >> >     6.467 (603.500 ms): a.out/23789 futex(uaddr: 0x7fcd82f298a0, op:
> >> > WAIT_BITSET|PRIV|CLKRT, utime: 0x7fff4df6c970, val3: 4294967295) = 0
> >> > 
> >> >   610.020 ( 6.307 ms): a.out/23789 futex(uaddr: 0x7fcd82f29960, op:
> >> >   WAIT|
> >> > 
> >> > PRIV, val: 2                    ) = 0
> >> > 
> >> >   616.446 (62.265 ms): a.out/23789 membarrier(cmd: 1
> >> > 
> >> > ) = 0
> >> > 
> >> >   678.721 (59.916 ms): a.out/23789 membarrier(cmd: 1
> >> > 
> >> > ) = 0
> >> > 
> >> >   616.332 (122.335 ms): a.out-ust/23790 futex(uaddr: 0x7fcd82f29960, 
op:
> >> > WAIT|PRIV, val: 2                    ) = 0
> >> > 
> >> >   738.976 (49.658 ms): a.out/23789 membarrier(cmd: 1
> >> > 
> >> > ) = 0
> >> > 
> >> >   788.642 (49.987 ms): a.out/23789 membarrier(cmd: 1
> >> > 
> >> > ) = 0
> >> > 
> >> >   739.014 (99.644 ms): a.out-ust/23790 futex(uaddr: 0x7fcd825be4c0, op:
> >> >   WAIT|
> >> > 
> >> > PRIV, val: 2                    ) = 0
> >> > 
> >> >   838.651 (49.977 ms): a.out/23789 membarrier(cmd: 1
> >> > 
> >> > ) = 0
> >> > 
> >> >   888.636 (69.983 ms): a.out/23789 membarrier(cmd: 1
> >> > 
> >> > ) = 0
> >> > 
> >> >   838.662 (119.978 ms): a.out-ust/23790 futex(uaddr: 0x7fcd825be4c0, 
op:
> >> > WAIT|PRIV, val: 2                    ) = 0
> >> > 
> >> >   958.739 (59.873 ms): a.out/23789 membarrier(cmd: 1
> >> > 
> >> > ) = 0
> >> > 
> >> >  1018.621 (80.058 ms): a.out/23789 membarrier(cmd: 1
> >> > 
> >> > ) = 0
> >> > 
> >> >  1098.782 (49.844 ms): a.out/23789 membarrier(cmd: 1
> >> > 
> >> > ) = 0
> >> > 
> >> >  1148.634 (59.998 ms): a.out/23789 membarrier(cmd: 1
> >> > 
> >> > ) = 0
> >> > 
> >> >  1208.731 (69.940 ms): a.out/23789 membarrier(cmd: 1
> >> > 
> >> > ) = 0
> >> > 
> >> >  1278.680 (79.947 ms): a.out/23789 membarrier(cmd: 1
> >> > 
> >> > ) = 0
> >> > 
> >> >  1358.727 (39.881 ms): a.out/23789 membarrier(cmd: 1
> >> > 
> >> > ) = 0
> >> > 
> >> >  1398.616 (39.980 ms): a.out/23789 membarrier(cmd: 1
> >> > 
> >> > ) = 0
> >> > 
> >> >     6.640 (1432.157 ms): a.out-ust/23791 futex(uaddr: 0x7fcd82f90000
> >> > 
> >> > ) = -1 (null) INTERNAL ERROR: strerror_r(512, 0x7ffdbc5467a0, 128)=22
> >> > ~~~~~
> >> > 
> >> > This makes lttng-ust pretty much unusable for any short-lived
> >> > application.
> >> > Is this a known limitation?
> >> 
> >> This kind of delay is not completely unexpected, but a bit higher than I
> >> would expect. I'm interested in figuring out if most of this delay
> >> happens
> >> to be caused by sys_membarrier, either directly, or indirectly (through a
> >> lock).
> > 
> > See my other email which adds some harder numbers from a perf sleep
> > record.
> > 
> >> > I'm using lttng-ust on Arch from the AUR packages, i.e. LTTng 2.9.4.
> >> > I've
> >> > seen this problem also on Ubuntu 17.04 though I didn't do any of the
> >> > above measurements there. So far, I suspected the ust statedump to take
> >> > too much time, but I disabled that one for the measurements above.
> >> > 
> >> > Any help would be appreciated
> >> 
> >> Can you reproduce with a Linux kernel that has CONFIG_MEMBARRIER=n and
> >> share the performance results ?
> > 
> > I'd have to recompile the kernel, which is something I don't like to do.
> > Why is this required, or put differently - why would the kernel option
> > influence the userspace LTTng library?
> 
> liburcu 0.9.x and 0.10.x (a dependency of lttng-ust) uses the membarrier
> system call to speed up tracing, at the expense of slower execution of
> tracing state update.
> 
> I'm currently proposing a new sys_membarrier command for PRIVATE_EXPEDITED
> (single-process) use cases which should execute much faster than the
> non-expedited SHARED command currently implemented.
> 
> liburcu 0.9 and 0.10 detects that this system call is available, and uses
> it when it can.

May I ask whether you have such a configuration available? The test case I 
have submitted should be easy to reproduce for everyone. Compiling my own 
kernel version takes a huge amount of time which I'm quite frankly not willing 
to invest at this point.

I simply wonder whether anyone else is actually using lttng-ust, or if they 
are, whether they are only doing that for long-lived applications where this 
overhead is not noticeable?

Thanks

-- 
Milian Wolff | milian.wolff@kdab.com | Senior Software Engineer
KDAB (Deutschland) GmbH&Co KG, a KDAB Group company
Tel: +49-30-521325470
KDAB - The Qt Experts


_______________________________________________
lttng-dev mailing list
lttng-dev@lists.lttng.org
https://lists.lttng.org/cgi-bin/mailman/listinfo/lttng-dev

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: large liblttng-ust startup overhead (ust_lock)
       [not found]     ` <859976545.3393.1504635610385.JavaMail.zimbra@efficios.com>
@ 2017-09-06  7:42       ` Milian Wolff
  0 siblings, 0 replies; 13+ messages in thread
From: Milian Wolff @ 2017-09-06  7:42 UTC (permalink / raw)
  To: Mathieu Desnoyers; +Cc: lttng-dev

On Dienstag, 5. September 2017 20:20:10 CEST Mathieu Desnoyers wrote:
> ----- On Sep 5, 2017, at 11:10 AM, Milian Wolff milian.wolff@kdab.com wrote:
> > On Tuesday, September 5, 2017 4:52:42 PM CEST Milian Wolff wrote:
> >> On Tuesday, September 5, 2017 4:34:54 PM CEST Milian Wolff wrote:
> >> > Hey all,
> >> > 
> >> > I have noticed a very large overhead when linking against liblttng-ust:

<snip>

> >> > 
> >> > This is without any LTTng session configured. If I enable LTTng kernel
> >> > and
> >> > userspace events, this becomes even worse:
> >> > 
> >> > ~~~~~
> >> > ┌milian@milian-kdab2:/tmp
> >> > └$ cat $(which run_lttng_trace.sh)
> >> > #!/bin/sh
> >> > 
> >> > if [ -z "$(pidof lttng-sessiond)" ]; then
> >> > 
> >> >     sudo lttng-sessiond --daemonize
> >> > 
> >> > fi
> >> > 
> >> > sudo lttng create -o ~/lttng-traces/$(date -Iseconds)
> >> > sudo lttng enable-channel kernel -k --subbuf-size 16M --num-subbuf 8
> >> > sudo lttng enable-event -c kernel -k -a
> >> > sudo lttng enable-channel ust -u --subbuf-size 16M --num-subbuf 8
> >> > sudo lttng enable-event -c ust -u lttng_ust_tracef:*
> >> > sudo lttng start
> >> > $@
> >> > sudo lttng stop
> >> > 
> >> > sudo chmod a+rx -R ~/lttng-traces

<snip>

> > One thing that is apparent from the above, and can also be made visible by
> > adding some visible output to the application before quitting it: It's not
> > the main thread that's blocked initially. Rather, the main thread spins
> > up the secondary -ust threads, then runs its code and finishes before the
> > -ust threads manage to initialize. Before the main thread then quits, it
> > waits for the ust threads to finish.
> 
> Those threads appear to be talking to the session daemon (on recvmsg).
> Please note that there are two cases there:
> 
> - If you use per-pid buffers (explicitly at channel creation), the consumer
>   daemon need to allocate buffers on behalf of each and every new process
>   that appear while tracing, which takes a significant amount of time,
> - If you use default buffers (per-uid), only the first process that
> registers for a specific UID will require that the consumer daemon
> allocates buffers. After that, the same buffers will be sent to other
> applications registering for the same user ID.
> 
> Clearly, per-pid buffers incur a significant slowdown at application start.
> This is one reason why those are not used by default.
> 
> If you benchmark per-uid buffers, you will want to benchmark those two
> scenarios:
> 
> 1) startup time of the first process that registers for a UID,
> 
> 2) startup time of following process registrations
> 
> Technically, the buffer allocation delay only impacts the first process that
> registers for a UID, and not all of them.

See my lttng commands above, I'm using the default buffer scheme, so if I 
understand you correctly, that should be per-UID. Looking into the lttng trace 
folder, I can confirm this by looking at the directory hierarchy:

~~~~~
┌milian@milian-kdab2:~/lttng-traces/2017-09-05T16:44:39+02:00
└$ find
.
./kernel
./kernel/kernel_6
./kernel/kernel_4
./kernel/kernel_2
./kernel/metadata
./kernel/kernel_1
./kernel/kernel_5
./kernel/index
./kernel/index/kernel_1.idx
./kernel/index/kernel_7.idx
./kernel/index/kernel_4.idx
./kernel/index/kernel_0.idx
./kernel/index/kernel_5.idx
./kernel/index/kernel_3.idx
./kernel/index/kernel_6.idx
./kernel/index/kernel_2.idx
./kernel/kernel_7
./kernel/kernel_0
./kernel/kernel_3
./ust
./ust/uid
./ust/uid/1000
./ust/uid/1000/64-bit
./ust/uid/1000/64-bit/ust_6
./ust/uid/1000/64-bit/ust_7
./ust/uid/1000/64-bit/ust_1
./ust/uid/1000/64-bit/metadata
./ust/uid/1000/64-bit/ust_0
./ust/uid/1000/64-bit/ust_3
./ust/uid/1000/64-bit/index
./ust/uid/1000/64-bit/index/ust_2.idx
./ust/uid/1000/64-bit/index/ust_3.idx
./ust/uid/1000/64-bit/index/ust_5.idx
./ust/uid/1000/64-bit/index/ust_6.idx
./ust/uid/1000/64-bit/index/ust_0.idx
./ust/uid/1000/64-bit/index/ust_4.idx
./ust/uid/1000/64-bit/index/ust_7.idx
./ust/uid/1000/64-bit/index/ust_1.idx
./ust/uid/1000/64-bit/ust_5
./ust/uid/1000/64-bit/ust_4
./ust/uid/1000/64-bit/ust_2
~~~~~

Note that my measurements somewhat contradict your assumptions thought: It is 
not only the first task that I run while tracing which gets slowed down. It's 
all of them. Have a look at the `perf stat` output I pasted:

       0.965545882 seconds time elapsed ( +- 16.96% )

I can reproduce values within that variance by running the example code 
multiple times and measuring each time individually, while keeping the lttng 
session up and running.

Bye

-- 
Milian Wolff | milian.wolff@kdab.com | Senior Software Engineer
KDAB (Deutschland) GmbH&Co KG, a KDAB Group company
Tel: +49-30-521325470
KDAB - The Qt Experts


_______________________________________________
lttng-dev mailing list
lttng-dev@lists.lttng.org
https://lists.lttng.org/cgi-bin/mailman/listinfo/lttng-dev

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: large liblttng-ust startup overhead (ust_lock)
       [not found]       ` <3211887.IErR8xQyUx@agathebauer>
@ 2017-09-06 19:57         ` Mathieu Desnoyers
       [not found]         ` <1506174693.4712.1504727831374.JavaMail.zimbra@efficios.com>
  1 sibling, 0 replies; 13+ messages in thread
From: Mathieu Desnoyers @ 2017-09-06 19:57 UTC (permalink / raw)
  To: Milian Wolff; +Cc: lttng-dev

[-- Attachment #1: Type: text/plain, Size: 13899 bytes --]

----- On Sep 6, 2017, at 3:35 AM, Milian Wolff milian.wolff@kdab.com wrote:

> On Dienstag, 5. September 2017 20:11:58 CEST Mathieu Desnoyers wrote:
>> ----- On Sep 5, 2017, at 11:08 AM, Milian Wolff milian.wolff@kdab.com wrote:
>> > On Tuesday, September 5, 2017 4:51:42 PM CEST Mathieu Desnoyers wrote:
>> >> ----- On Sep 5, 2017, at 10:34 AM, Milian Wolff milian.wolff@kdab.com
> wrote:
>> >> > Hey all,
>> >> > 
>> >> > I have noticed a very large overhead when linking against liblttng-ust:
>> >> > 
>> >> > ~~~~~
>> >> > ┌milian@milian-kdab2:/tmp
>> >> > └$ cat lttng-test.c
>> >> > int main()
>> >> > {
>> >> > 
>> >> >  return 0;
>> >> > 
>> >> > }
>> >> > ┌milian@milian-kdab2:/tmp
>> >> > └$ gcc -O2 -g -ldl lttng-test.c
>> >> > ┌milian@milian-kdab2:/tmp
>> >> > └$ perf stat -r 5 ./a.out
>> >> > 
>> >> > Performance counter stats for './a.out' (5 runs):
>> >> >          0.209587      task-clock (msec)         #    0.596 CPUs
>> >> >          utilized
>> >> > 
>> >> > ( +-  8.76% )
>> >> > 
>> >> >                 0      context-switches          #    0.000 K/sec
>> >> >                 0      cpu-migrations            #    0.000 K/sec
>> >> >                
>> >> >                49      page-faults               #    0.235 M/sec
>> >> > 
>> >> > ( +-  1.19% )
>> >> > 
>> >> >           706,854      cycles                    #    3.373 GHz
>> >> > 
>> >> > ( +-  8.82% )
>> >> > 
>> >> >           773,603      instructions              #    1.09  insn per
>> >> >           cycle
>> >> > 
>> >> > ( +-  0.75% )
>> >> > 
>> >> >           147,128      branches                  #  701.987 M/sec
>> >> > 
>> >> > ( +-  0.70% )
>> >> > 
>> >> >             4,096      branch-misses             #    2.78% of all
>> >> >             branches
>> >> > 
>> >> > ( +-  5.27% )
>> >> > 
>> >> >       0.000351422 seconds time elapsed
>> >> > 
>> >> > ( +- 11.85% )
>> >> > 
>> >> > ┌milian@milian-kdab2:/tmp
>> >> > └$ gcc -O2 -g -ldl -llttng-ust lttng-test.c
>> >> > ┌milian@milian-kdab2:/tmp
>> >> > └$ perf stat -r 5 ./a.out
>> >> > 
>> >> > Performance counter stats for './a.out' (5 runs):
>> >> >          2.063040      task-clock (msec)         #    0.009 CPUs
>> >> >          utilized
>> >> > 
>> >> > ( +-  1.37% )
>> >> > 
>> >> >                44      context-switches          #    0.021 M/sec
>> >> > 
>> >> > ( +-  1.95% )
>> >> > 
>> >> >                 2      cpu-migrations            #    0.776 K/sec
>> >> > 
>> >> > ( +- 25.00% )
>> >> > 
>> >> >               209      page-faults               #    0.101 M/sec
>> >> > 
>> >> > ( +-  0.34% )
>> >> > 
>> >> >         7,053,686      cycles                    #    3.419 GHz
>> >> > 
>> >> > ( +-  2.03% )
>> >> > 
>> >> >         6,893,783      instructions              #    0.98  insn per
>> >> >         cycle
>> >> > 
>> >> > ( +-  0.25% )
>> >> > 
>> >> >         1,342,492      branches                  #  650.735 M/sec
>> >> > 
>> >> > ( +-  0.20% )
>> >> > 
>> >> >            29,390      branch-misses             #    2.19% of all
>> >> >            branches
>> >> > 
>> >> > ( +-  0.61% )
>> >> > 
>> >> >       0.225597302 seconds time elapsed
>> >> > 
>> >> > ( +-  6.68% )
>> >> > ~~~~~
>> >> > 
>> >> > This is without any LTTng session configured. If I enable LTTng kernel
>> >> > and
>> >> > userspace events, this becomes even worse:
>> >> > 
>> >> > ~~~~~
>> >> > ┌milian@milian-kdab2:/tmp
>> >> > └$ cat $(which run_lttng_trace.sh)
>> >> > #!/bin/sh
>> >> > 
>> >> > if [ -z "$(pidof lttng-sessiond)" ]; then
>> >> > 
>> >> >    sudo lttng-sessiond --daemonize
>> >> > 
>> >> > fi
>> >> > 
>> >> > sudo lttng create -o ~/lttng-traces/$(date -Iseconds)
>> >> > sudo lttng enable-channel kernel -k --subbuf-size 16M --num-subbuf 8
>> >> > sudo lttng enable-event -c kernel -k -a
>> >> > sudo lttng enable-channel ust -u --subbuf-size 16M --num-subbuf 8
>> >> > sudo lttng enable-event -c ust -u lttng_ust_tracef:*
>> >> > sudo lttng start
>> >> > $@
>> >> > sudo lttng stop
>> >> > 
>> >> > sudo chmod a+rx -R ~/lttng-traces
>> >> > ┌milian@milian-kdab2:/tmp
>> >> > └$ run_lttng_trace.sh perf stat -r 5 ./a.out
>> >> > Session auto-20170905-162818 created.
>> >> > Traces will be written in
>> >> > /home/milian/lttng-traces/2017-09-05T16:28:18+02:00 Kernel channel
>> >> > kernel
>> >> > enabled for session auto-20170905-162818
>> >> > All Kernel events are enabled in channel kernel
>> >> > UST channel ust enabled for session auto-20170905-162818
>> >> > UST event lttng_ust_tracef:* created in channel ust
>> >> > Tracing started for session auto-20170905-162818
>> >> > 
>> >> > Performance counter stats for './a.out' (5 runs):
>> >> >          4.937820      task-clock (msec)         #    0.005 CPUs
>> >> >          utilized
>> >> > 
>> >> > ( +-  1.28% )
>> >> > 
>> >> >               117      context-switches          #    0.024 M/sec
>> >> > 
>> >> > ( +-  0.70% )
>> >> > 
>> >> >                 3      cpu-migrations            #    0.608 K/sec
>> >> > 
>> >> > ( +- 23.57% )
>> >> > 
>> >> >               245      page-faults               #    0.050 M/sec
>> >> > 
>> >> > ( +-  0.20% )
>> >> > 
>> >> >        16,541,355      cycles                    #    3.350 GHz
>> >> > 
>> >> > ( +-  0.94% )
>> >> > 
>> >> >        20,611,637      instructions              #    1.25  insn per
>> >> >        cycle
>> >> > 
>> >> > ( +-  0.23% )
>> >> > 
>> >> >         3,580,525      branches                  #  725.123 M/sec
>> >> > 
>> >> > ( +-  0.19% )
>> >> > 
>> >> >            52,093      branch-misses             #    1.45% of all
>> >> >            branches
>> >> > 
>> >> > ( +-  1.73% )
>> >> > 
>> >> >       0.965545882 seconds time elapsed
>> >> > 
>> >> > ( +- 16.96% )
>> >> > 
>> >> > Waiting for data availability...
>> >> > Tracing stopped for session auto-20170905-162818
>> >> > ~~~~~
>> >> > 
>> >> > I see this ~1s delay during startup in nearly all test applications
>> >> > where
>> >> > I
>> >> > wanted to use lttng-ust. In tracecompass, and according to perf sleep
>> >> > time
>> >> > profiling, this seems to be mostly due to extensive sleeping on mutex
>> >> > locks
>> >> > (from ust_lock in liblttng-ust). With perf trace, I also see membarrier
>> >> > taking a lot of time:
>> >> > 
>> >> > ~~~~~
>> >> > # without a running lttng session:
>> >> > ┌milian@milian-kdab2:/tmp
>> >> > └$ perf trace --duration 1 ./a.out
>> >> > 
>> >> >     6.492 (52.468 ms): a.out/23672 recvmsg(fd: 3<socket:[1178439]>,
> msg:
>> >> > 0x7fbe2fbb1070                  ) = 1
>> >> > 
>> >> >     5.077 (54.271 ms): a.out/23671 futex(uaddr: 0x7fbe30d508a0, op:
>> >> > WAIT_BITSET|PRIV|CLKRT, utime: 0x7ffc474ff5a0, val3: 4294967295) = 0
>> >> > 
>> >> >    59.598 (79.379 ms): a.out/23671 membarrier(cmd: 1
>> >> > 
>> >> > ) = 0
>> >> > 
>> >> >   138.984 (39.945 ms): a.out/23671 membarrier(cmd: 1
>> >> > 
>> >> > ) = 0
>> >> > 
>> >> >    59.621 (119.324 ms): a.out/23672 futex(uaddr: 0x7fbe303e54c0, op:
>> >> >    WAIT|
>> >> > 
>> >> > PRIV, val: 2                    ) = 0
>> >> > 
>> >> >   179.045 (79.918 ms): a.out/23671 membarrier(cmd: 1
>> >> > 
>> >> > ) = 0
>> >> > 
>> >> >   258.971 (39.997 ms): a.out/23671 membarrier(cmd: 1
>> >> > 
>> >> > ) = 0
>> >> > 
>> >> >   299.052 (79.883 ms): a.out/23671 membarrier(cmd: 1
>> >> > 
>> >> > ) = 0
>> >> > 
>> >> >   378.942 (59.988 ms): a.out/23671 membarrier(cmd: 1
>> >> > 
>> >> > ) = 0
>> >> > 
>> >> >   439.022 (69.908 ms): a.out/23671 membarrier(cmd: 1
>> >> > 
>> >> > ) = 0
>> >> > 
>> >> >   508.937 (80.027 ms): a.out/23671 membarrier(cmd: 1
>> >> > 
>> >> > ) = 0
>> >> > 
>> >> > # with a running lttng session:
>> >> > ┌milian@milian-kdab2:/tmp
>> >> > └$ run_lttng_trace.sh perf trace --duration 1 ./a.out
>> >> > 
>> >> >     6.666 (392.496 ms): a.out-ust/23790 recvmsg(fd:
>> >> >     3<socket:[1185362]>,
>> > 
>> > msg:
>> >> > 0x7fcd81d8a0b0                  ) = 612
>> >> > 
>> >> >   400.238 (68.452 ms): a.out-ust/23790 membarrier(cmd: 1
>> >> > 
>> >> > ) = 0
>> >> > 
>> >> >   468.698 (39.983 ms): a.out-ust/23790 membarrier(cmd: 1
>> >> > 
>> >> > ) = 0
>> >> > 
>> >> >   510.567 (98.113 ms): a.out-ust/23790 recvmsg(fd: 3<socket:[1185362]>,
>> > 
>> > msg:
>> >> > 0x7fcd81d8a070                  ) = 1
>> >> > 
>> >> >     6.467 (603.500 ms): a.out/23789 futex(uaddr: 0x7fcd82f298a0, op:
>> >> > WAIT_BITSET|PRIV|CLKRT, utime: 0x7fff4df6c970, val3: 4294967295) = 0
>> >> > 
>> >> >   610.020 ( 6.307 ms): a.out/23789 futex(uaddr: 0x7fcd82f29960, op:
>> >> >   WAIT|
>> >> > 
>> >> > PRIV, val: 2                    ) = 0
>> >> > 
>> >> >   616.446 (62.265 ms): a.out/23789 membarrier(cmd: 1
>> >> > 
>> >> > ) = 0
>> >> > 
>> >> >   678.721 (59.916 ms): a.out/23789 membarrier(cmd: 1
>> >> > 
>> >> > ) = 0
>> >> > 
>> >> >   616.332 (122.335 ms): a.out-ust/23790 futex(uaddr: 0x7fcd82f29960,
> op:
>> >> > WAIT|PRIV, val: 2                    ) = 0
>> >> > 
>> >> >   738.976 (49.658 ms): a.out/23789 membarrier(cmd: 1
>> >> > 
>> >> > ) = 0
>> >> > 
>> >> >   788.642 (49.987 ms): a.out/23789 membarrier(cmd: 1
>> >> > 
>> >> > ) = 0
>> >> > 
>> >> >   739.014 (99.644 ms): a.out-ust/23790 futex(uaddr: 0x7fcd825be4c0, op:
>> >> >   WAIT|
>> >> > 
>> >> > PRIV, val: 2                    ) = 0
>> >> > 
>> >> >   838.651 (49.977 ms): a.out/23789 membarrier(cmd: 1
>> >> > 
>> >> > ) = 0
>> >> > 
>> >> >   888.636 (69.983 ms): a.out/23789 membarrier(cmd: 1
>> >> > 
>> >> > ) = 0
>> >> > 
>> >> >   838.662 (119.978 ms): a.out-ust/23790 futex(uaddr: 0x7fcd825be4c0,
> op:
>> >> > WAIT|PRIV, val: 2                    ) = 0
>> >> > 
>> >> >   958.739 (59.873 ms): a.out/23789 membarrier(cmd: 1
>> >> > 
>> >> > ) = 0
>> >> > 
>> >> >  1018.621 (80.058 ms): a.out/23789 membarrier(cmd: 1
>> >> > 
>> >> > ) = 0
>> >> > 
>> >> >  1098.782 (49.844 ms): a.out/23789 membarrier(cmd: 1
>> >> > 
>> >> > ) = 0
>> >> > 
>> >> >  1148.634 (59.998 ms): a.out/23789 membarrier(cmd: 1
>> >> > 
>> >> > ) = 0
>> >> > 
>> >> >  1208.731 (69.940 ms): a.out/23789 membarrier(cmd: 1
>> >> > 
>> >> > ) = 0
>> >> > 
>> >> >  1278.680 (79.947 ms): a.out/23789 membarrier(cmd: 1
>> >> > 
>> >> > ) = 0
>> >> > 
>> >> >  1358.727 (39.881 ms): a.out/23789 membarrier(cmd: 1
>> >> > 
>> >> > ) = 0
>> >> > 
>> >> >  1398.616 (39.980 ms): a.out/23789 membarrier(cmd: 1
>> >> > 
>> >> > ) = 0
>> >> > 
>> >> >     6.640 (1432.157 ms): a.out-ust/23791 futex(uaddr: 0x7fcd82f90000
>> >> > 
>> >> > ) = -1 (null) INTERNAL ERROR: strerror_r(512, 0x7ffdbc5467a0, 128)=22
>> >> > ~~~~~
>> >> > 
>> >> > This makes lttng-ust pretty much unusable for any short-lived
>> >> > application.
>> >> > Is this a known limitation?
>> >> 
>> >> This kind of delay is not completely unexpected, but a bit higher than I
>> >> would expect. I'm interested in figuring out if most of this delay
>> >> happens
>> >> to be caused by sys_membarrier, either directly, or indirectly (through a
>> >> lock).
>> > 
>> > See my other email which adds some harder numbers from a perf sleep
>> > record.
>> > 
>> >> > I'm using lttng-ust on Arch from the AUR packages, i.e. LTTng 2.9.4.
>> >> > I've
>> >> > seen this problem also on Ubuntu 17.04 though I didn't do any of the
>> >> > above measurements there. So far, I suspected the ust statedump to take
>> >> > too much time, but I disabled that one for the measurements above.
>> >> > 
>> >> > Any help would be appreciated
>> >> 
>> >> Can you reproduce with a Linux kernel that has CONFIG_MEMBARRIER=n and
>> >> share the performance results ?
>> > 
>> > I'd have to recompile the kernel, which is something I don't like to do.
>> > Why is this required, or put differently - why would the kernel option
>> > influence the userspace LTTng library?
>> 
>> liburcu 0.9.x and 0.10.x (a dependency of lttng-ust) uses the membarrier
>> system call to speed up tracing, at the expense of slower execution of
>> tracing state update.
>> 
>> I'm currently proposing a new sys_membarrier command for PRIVATE_EXPEDITED
>> (single-process) use cases which should execute much faster than the
>> non-expedited SHARED command currently implemented.
>> 
>> liburcu 0.9 and 0.10 detects that this system call is available, and uses
>> it when it can.
> 
> May I ask whether you have such a configuration available? The test case I
> have submitted should be easy to reproduce for everyone. Compiling my own
> kernel version takes a huge amount of time which I'm quite frankly not willing
> to invest at this point.

We have reproduced the situation, and it's pointing towards our use of the
sys_membarrier system call. We'll try narrowing it down between its use by
lttng-ust (liburcu-bp.so) or by lttng-sessiond (liburcu.so).

You can try to patch liburcu so it won't try using sys_membarrier at all
(see attached patch)

> 
> I simply wonder whether anyone else is actually using lttng-ust, or if they
> are, whether they are only doing that for long-lived applications where this
> overhead is not noticeable?

It appears that this application startup slowdown mainly appears with
LTTng linked against liburcu 0.9 and 0.10 running with rather recent kernels.
This would explain why you are the first one to report this across our user
base.

Thanks,

Mathieu


> 
> Thanks
> 
> --
> Milian Wolff | milian.wolff@kdab.com | Senior Software Engineer
> KDAB (Deutschland) GmbH&Co KG, a KDAB Group company
> Tel: +49-30-521325470
> KDAB - The Qt Experts

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #2: liburcu-no-sys-mb.patch --]
[-- Type: text/x-patch; name=liburcu-no-sys-mb.patch, Size: 799 bytes --]

diff --git a/src/urcu-bp.c b/src/urcu-bp.c
index c0fac83..2d8bcca 100644
--- a/src/urcu-bp.c
+++ b/src/urcu-bp.c
@@ -95,7 +95,7 @@ static
 int rcu_bp_refcount;
 
 /* If the headers do not support membarrier system call, fall back smp_mb. */
-#ifdef __NR_membarrier
+#if 0 //def __NR_membarrier
 # define membarrier(...)		syscall(__NR_membarrier, __VA_ARGS__)
 #else
 # define membarrier(...)		-ENOSYS
diff --git a/src/urcu.c b/src/urcu.c
index a81b129..1aafbc0 100644
--- a/src/urcu.c
+++ b/src/urcu.c
@@ -64,7 +64,7 @@
 #define RCU_QS_ACTIVE_ATTEMPTS 100
 
 /* If the headers do not support membarrier system call, fall back on RCU_MB */
-#ifdef __NR_membarrier
+#if 0 //def __NR_membarrier
 # define membarrier(...)		syscall(__NR_membarrier, __VA_ARGS__)
 #else
 # define membarrier(...)		-ENOSYS

[-- Attachment #3: Type: text/plain, Size: 156 bytes --]

_______________________________________________
lttng-dev mailing list
lttng-dev@lists.lttng.org
https://lists.lttng.org/cgi-bin/mailman/listinfo/lttng-dev

^ permalink raw reply related	[flat|nested] 13+ messages in thread

* Re: large liblttng-ust startup overhead (ust_lock)
       [not found]         ` <1506174693.4712.1504727831374.JavaMail.zimbra@efficios.com>
@ 2017-09-06 20:23           ` Mathieu Desnoyers
       [not found]           ` <643211580.4746.1504729420940.JavaMail.zimbra@efficios.com>
  2017-09-07  7:50           ` Milian Wolff
  2 siblings, 0 replies; 13+ messages in thread
From: Mathieu Desnoyers @ 2017-09-06 20:23 UTC (permalink / raw)
  To: Milian Wolff, Paul E. McKenney; +Cc: lttng-dev

----- On Sep 6, 2017, at 3:57 PM, Mathieu Desnoyers mathieu.desnoyers@efficios.com wrote:

> ----- On Sep 6, 2017, at 3:35 AM, Milian Wolff milian.wolff@kdab.com wrote:
> 
>> On Dienstag, 5. September 2017 20:11:58 CEST Mathieu Desnoyers wrote:
>>> ----- On Sep 5, 2017, at 11:08 AM, Milian Wolff milian.wolff@kdab.com wrote:
>>> > On Tuesday, September 5, 2017 4:51:42 PM CEST Mathieu Desnoyers wrote:
>>> >> ----- On Sep 5, 2017, at 10:34 AM, Milian Wolff milian.wolff@kdab.com
>> wrote:
>>> >> > Hey all,
>>> >> > 
>>> >> > I have noticed a very large overhead when linking against liblttng-ust:
>>> >> > 
>>> >> > ~~~~~
>>> >> > ┌milian@milian-kdab2:/tmp
>>> >> > └$ cat lttng-test.c
>>> >> > int main()
>>> >> > {
>>> >> > 
>>> >> >  return 0;
>>> >> > 
>>> >> > }
>>> >> > ┌milian@milian-kdab2:/tmp
>>> >> > └$ gcc -O2 -g -ldl lttng-test.c
>>> >> > ┌milian@milian-kdab2:/tmp
>>> >> > └$ perf stat -r 5 ./a.out
>>> >> > 
>>> >> > Performance counter stats for './a.out' (5 runs):
>>> >> >          0.209587      task-clock (msec)         #    0.596 CPUs
>>> >> >          utilized
>>> >> > 
>>> >> > ( +-  8.76% )
>>> >> > 
>>> >> >                 0      context-switches          #    0.000 K/sec
>>> >> >                 0      cpu-migrations            #    0.000 K/sec
>>> >> >                
>>> >> >                49      page-faults               #    0.235 M/sec
>>> >> > 
>>> >> > ( +-  1.19% )
>>> >> > 
>>> >> >           706,854      cycles                    #    3.373 GHz
>>> >> > 
>>> >> > ( +-  8.82% )
>>> >> > 
>>> >> >           773,603      instructions              #    1.09  insn per
>>> >> >           cycle
>>> >> > 
>>> >> > ( +-  0.75% )
>>> >> > 
>>> >> >           147,128      branches                  #  701.987 M/sec
>>> >> > 
>>> >> > ( +-  0.70% )
>>> >> > 
>>> >> >             4,096      branch-misses             #    2.78% of all
>>> >> >             branches
>>> >> > 
>>> >> > ( +-  5.27% )
>>> >> > 
>>> >> >       0.000351422 seconds time elapsed
>>> >> > 
>>> >> > ( +- 11.85% )
>>> >> > 
>>> >> > ┌milian@milian-kdab2:/tmp
>>> >> > └$ gcc -O2 -g -ldl -llttng-ust lttng-test.c
>>> >> > ┌milian@milian-kdab2:/tmp
>>> >> > └$ perf stat -r 5 ./a.out
>>> >> > 
>>> >> > Performance counter stats for './a.out' (5 runs):
>>> >> >          2.063040      task-clock (msec)         #    0.009 CPUs
>>> >> >          utilized
>>> >> > 
>>> >> > ( +-  1.37% )
>>> >> > 
>>> >> >                44      context-switches          #    0.021 M/sec
>>> >> > 
>>> >> > ( +-  1.95% )
>>> >> > 
>>> >> >                 2      cpu-migrations            #    0.776 K/sec
>>> >> > 
>>> >> > ( +- 25.00% )
>>> >> > 
>>> >> >               209      page-faults               #    0.101 M/sec
>>> >> > 
>>> >> > ( +-  0.34% )
>>> >> > 
>>> >> >         7,053,686      cycles                    #    3.419 GHz
>>> >> > 
>>> >> > ( +-  2.03% )
>>> >> > 
>>> >> >         6,893,783      instructions              #    0.98  insn per
>>> >> >         cycle
>>> >> > 
>>> >> > ( +-  0.25% )
>>> >> > 
>>> >> >         1,342,492      branches                  #  650.735 M/sec
>>> >> > 
>>> >> > ( +-  0.20% )
>>> >> > 
>>> >> >            29,390      branch-misses             #    2.19% of all
>>> >> >            branches
>>> >> > 
>>> >> > ( +-  0.61% )
>>> >> > 
>>> >> >       0.225597302 seconds time elapsed
>>> >> > 
>>> >> > ( +-  6.68% )
>>> >> > ~~~~~
>>> >> > 
>>> >> > This is without any LTTng session configured. If I enable LTTng kernel
>>> >> > and
>>> >> > userspace events, this becomes even worse:
>>> >> > 
>>> >> > ~~~~~
>>> >> > ┌milian@milian-kdab2:/tmp
>>> >> > └$ cat $(which run_lttng_trace.sh)
>>> >> > #!/bin/sh
>>> >> > 
>>> >> > if [ -z "$(pidof lttng-sessiond)" ]; then
>>> >> > 
>>> >> >    sudo lttng-sessiond --daemonize
>>> >> > 
>>> >> > fi
>>> >> > 
>>> >> > sudo lttng create -o ~/lttng-traces/$(date -Iseconds)
>>> >> > sudo lttng enable-channel kernel -k --subbuf-size 16M --num-subbuf 8
>>> >> > sudo lttng enable-event -c kernel -k -a
>>> >> > sudo lttng enable-channel ust -u --subbuf-size 16M --num-subbuf 8
>>> >> > sudo lttng enable-event -c ust -u lttng_ust_tracef:*
>>> >> > sudo lttng start
>>> >> > $@
>>> >> > sudo lttng stop
>>> >> > 
>>> >> > sudo chmod a+rx -R ~/lttng-traces
>>> >> > ┌milian@milian-kdab2:/tmp
>>> >> > └$ run_lttng_trace.sh perf stat -r 5 ./a.out
>>> >> > Session auto-20170905-162818 created.
>>> >> > Traces will be written in
>>> >> > /home/milian/lttng-traces/2017-09-05T16:28:18+02:00 Kernel channel
>>> >> > kernel
>>> >> > enabled for session auto-20170905-162818
>>> >> > All Kernel events are enabled in channel kernel
>>> >> > UST channel ust enabled for session auto-20170905-162818
>>> >> > UST event lttng_ust_tracef:* created in channel ust
>>> >> > Tracing started for session auto-20170905-162818
>>> >> > 
>>> >> > Performance counter stats for './a.out' (5 runs):
>>> >> >          4.937820      task-clock (msec)         #    0.005 CPUs
>>> >> >          utilized
>>> >> > 
>>> >> > ( +-  1.28% )
>>> >> > 
>>> >> >               117      context-switches          #    0.024 M/sec
>>> >> > 
>>> >> > ( +-  0.70% )
>>> >> > 
>>> >> >                 3      cpu-migrations            #    0.608 K/sec
>>> >> > 
>>> >> > ( +- 23.57% )
>>> >> > 
>>> >> >               245      page-faults               #    0.050 M/sec
>>> >> > 
>>> >> > ( +-  0.20% )
>>> >> > 
>>> >> >        16,541,355      cycles                    #    3.350 GHz
>>> >> > 
>>> >> > ( +-  0.94% )
>>> >> > 
>>> >> >        20,611,637      instructions              #    1.25  insn per
>>> >> >        cycle
>>> >> > 
>>> >> > ( +-  0.23% )
>>> >> > 
>>> >> >         3,580,525      branches                  #  725.123 M/sec
>>> >> > 
>>> >> > ( +-  0.19% )
>>> >> > 
>>> >> >            52,093      branch-misses             #    1.45% of all
>>> >> >            branches
>>> >> > 
>>> >> > ( +-  1.73% )
>>> >> > 
>>> >> >       0.965545882 seconds time elapsed
>>> >> > 
>>> >> > ( +- 16.96% )
>>> >> > 
>>> >> > Waiting for data availability...
>>> >> > Tracing stopped for session auto-20170905-162818
>>> >> > ~~~~~
>>> >> > 
>>> >> > I see this ~1s delay during startup in nearly all test applications
>>> >> > where
>>> >> > I
>>> >> > wanted to use lttng-ust. In tracecompass, and according to perf sleep
>>> >> > time
>>> >> > profiling, this seems to be mostly due to extensive sleeping on mutex
>>> >> > locks
>>> >> > (from ust_lock in liblttng-ust). With perf trace, I also see membarrier
>>> >> > taking a lot of time:
>>> >> > 
>>> >> > ~~~~~
>>> >> > # without a running lttng session:
>>> >> > ┌milian@milian-kdab2:/tmp
>>> >> > └$ perf trace --duration 1 ./a.out
>>> >> > 
>>> >> >     6.492 (52.468 ms): a.out/23672 recvmsg(fd: 3<socket:[1178439]>,
>> msg:
>>> >> > 0x7fbe2fbb1070                  ) = 1
>>> >> > 
>>> >> >     5.077 (54.271 ms): a.out/23671 futex(uaddr: 0x7fbe30d508a0, op:
>>> >> > WAIT_BITSET|PRIV|CLKRT, utime: 0x7ffc474ff5a0, val3: 4294967295) = 0
>>> >> > 
>>> >> >    59.598 (79.379 ms): a.out/23671 membarrier(cmd: 1
>>> >> > 
>>> >> > ) = 0
>>> >> > 
>>> >> >   138.984 (39.945 ms): a.out/23671 membarrier(cmd: 1
>>> >> > 
>>> >> > ) = 0
>>> >> > 
>>> >> >    59.621 (119.324 ms): a.out/23672 futex(uaddr: 0x7fbe303e54c0, op:
>>> >> >    WAIT|
>>> >> > 
>>> >> > PRIV, val: 2                    ) = 0
>>> >> > 
>>> >> >   179.045 (79.918 ms): a.out/23671 membarrier(cmd: 1
>>> >> > 
>>> >> > ) = 0
>>> >> > 
>>> >> >   258.971 (39.997 ms): a.out/23671 membarrier(cmd: 1
>>> >> > 
>>> >> > ) = 0
>>> >> > 
>>> >> >   299.052 (79.883 ms): a.out/23671 membarrier(cmd: 1
>>> >> > 
>>> >> > ) = 0
>>> >> > 
>>> >> >   378.942 (59.988 ms): a.out/23671 membarrier(cmd: 1
>>> >> > 
>>> >> > ) = 0
>>> >> > 
>>> >> >   439.022 (69.908 ms): a.out/23671 membarrier(cmd: 1
>>> >> > 
>>> >> > ) = 0
>>> >> > 
>>> >> >   508.937 (80.027 ms): a.out/23671 membarrier(cmd: 1
>>> >> > 
>>> >> > ) = 0
>>> >> > 
>>> >> > # with a running lttng session:
>>> >> > ┌milian@milian-kdab2:/tmp
>>> >> > └$ run_lttng_trace.sh perf trace --duration 1 ./a.out
>>> >> > 
>>> >> >     6.666 (392.496 ms): a.out-ust/23790 recvmsg(fd:
>>> >> >     3<socket:[1185362]>,
>>> > 
>>> > msg:
>>> >> > 0x7fcd81d8a0b0                  ) = 612
>>> >> > 
>>> >> >   400.238 (68.452 ms): a.out-ust/23790 membarrier(cmd: 1
>>> >> > 
>>> >> > ) = 0
>>> >> > 
>>> >> >   468.698 (39.983 ms): a.out-ust/23790 membarrier(cmd: 1
>>> >> > 
>>> >> > ) = 0
>>> >> > 
>>> >> >   510.567 (98.113 ms): a.out-ust/23790 recvmsg(fd: 3<socket:[1185362]>,
>>> > 
>>> > msg:
>>> >> > 0x7fcd81d8a070                  ) = 1
>>> >> > 
>>> >> >     6.467 (603.500 ms): a.out/23789 futex(uaddr: 0x7fcd82f298a0, op:
>>> >> > WAIT_BITSET|PRIV|CLKRT, utime: 0x7fff4df6c970, val3: 4294967295) = 0
>>> >> > 
>>> >> >   610.020 ( 6.307 ms): a.out/23789 futex(uaddr: 0x7fcd82f29960, op:
>>> >> >   WAIT|
>>> >> > 
>>> >> > PRIV, val: 2                    ) = 0
>>> >> > 
>>> >> >   616.446 (62.265 ms): a.out/23789 membarrier(cmd: 1
>>> >> > 
>>> >> > ) = 0
>>> >> > 
>>> >> >   678.721 (59.916 ms): a.out/23789 membarrier(cmd: 1
>>> >> > 
>>> >> > ) = 0
>>> >> > 
>>> >> >   616.332 (122.335 ms): a.out-ust/23790 futex(uaddr: 0x7fcd82f29960,
>> op:
>>> >> > WAIT|PRIV, val: 2                    ) = 0
>>> >> > 
>>> >> >   738.976 (49.658 ms): a.out/23789 membarrier(cmd: 1
>>> >> > 
>>> >> > ) = 0
>>> >> > 
>>> >> >   788.642 (49.987 ms): a.out/23789 membarrier(cmd: 1
>>> >> > 
>>> >> > ) = 0
>>> >> > 
>>> >> >   739.014 (99.644 ms): a.out-ust/23790 futex(uaddr: 0x7fcd825be4c0, op:
>>> >> >   WAIT|
>>> >> > 
>>> >> > PRIV, val: 2                    ) = 0
>>> >> > 
>>> >> >   838.651 (49.977 ms): a.out/23789 membarrier(cmd: 1
>>> >> > 
>>> >> > ) = 0
>>> >> > 
>>> >> >   888.636 (69.983 ms): a.out/23789 membarrier(cmd: 1
>>> >> > 
>>> >> > ) = 0
>>> >> > 
>>> >> >   838.662 (119.978 ms): a.out-ust/23790 futex(uaddr: 0x7fcd825be4c0,
>> op:
>>> >> > WAIT|PRIV, val: 2                    ) = 0
>>> >> > 
>>> >> >   958.739 (59.873 ms): a.out/23789 membarrier(cmd: 1
>>> >> > 
>>> >> > ) = 0
>>> >> > 
>>> >> >  1018.621 (80.058 ms): a.out/23789 membarrier(cmd: 1
>>> >> > 
>>> >> > ) = 0
>>> >> > 
>>> >> >  1098.782 (49.844 ms): a.out/23789 membarrier(cmd: 1
>>> >> > 
>>> >> > ) = 0
>>> >> > 
>>> >> >  1148.634 (59.998 ms): a.out/23789 membarrier(cmd: 1
>>> >> > 
>>> >> > ) = 0
>>> >> > 
>>> >> >  1208.731 (69.940 ms): a.out/23789 membarrier(cmd: 1
>>> >> > 
>>> >> > ) = 0
>>> >> > 
>>> >> >  1278.680 (79.947 ms): a.out/23789 membarrier(cmd: 1
>>> >> > 
>>> >> > ) = 0
>>> >> > 
>>> >> >  1358.727 (39.881 ms): a.out/23789 membarrier(cmd: 1
>>> >> > 
>>> >> > ) = 0
>>> >> > 
>>> >> >  1398.616 (39.980 ms): a.out/23789 membarrier(cmd: 1
>>> >> > 
>>> >> > ) = 0
>>> >> > 
>>> >> >     6.640 (1432.157 ms): a.out-ust/23791 futex(uaddr: 0x7fcd82f90000
>>> >> > 
>>> >> > ) = -1 (null) INTERNAL ERROR: strerror_r(512, 0x7ffdbc5467a0, 128)=22
>>> >> > ~~~~~
>>> >> > 
>>> >> > This makes lttng-ust pretty much unusable for any short-lived
>>> >> > application.
>>> >> > Is this a known limitation?
>>> >> 
>>> >> This kind of delay is not completely unexpected, but a bit higher than I
>>> >> would expect. I'm interested in figuring out if most of this delay
>>> >> happens
>>> >> to be caused by sys_membarrier, either directly, or indirectly (through a
>>> >> lock).
>>> > 
>>> > See my other email which adds some harder numbers from a perf sleep
>>> > record.
>>> > 
>>> >> > I'm using lttng-ust on Arch from the AUR packages, i.e. LTTng 2.9.4.
>>> >> > I've
>>> >> > seen this problem also on Ubuntu 17.04 though I didn't do any of the
>>> >> > above measurements there. So far, I suspected the ust statedump to take
>>> >> > too much time, but I disabled that one for the measurements above.
>>> >> > 
>>> >> > Any help would be appreciated
>>> >> 
>>> >> Can you reproduce with a Linux kernel that has CONFIG_MEMBARRIER=n and
>>> >> share the performance results ?
>>> > 
>>> > I'd have to recompile the kernel, which is something I don't like to do.
>>> > Why is this required, or put differently - why would the kernel option
>>> > influence the userspace LTTng library?
>>> 
>>> liburcu 0.9.x and 0.10.x (a dependency of lttng-ust) uses the membarrier
>>> system call to speed up tracing, at the expense of slower execution of
>>> tracing state update.
>>> 
>>> I'm currently proposing a new sys_membarrier command for PRIVATE_EXPEDITED
>>> (single-process) use cases which should execute much faster than the
>>> non-expedited SHARED command currently implemented.
>>> 
>>> liburcu 0.9 and 0.10 detects that this system call is available, and uses
>>> it when it can.
>> 
>> May I ask whether you have such a configuration available? The test case I
>> have submitted should be easy to reproduce for everyone. Compiling my own
>> kernel version takes a huge amount of time which I'm quite frankly not willing
>> to invest at this point.
> 
> We have reproduced the situation, and it's pointing towards our use of the
> sys_membarrier system call. We'll try narrowing it down between its use by
> lttng-ust (liburcu-bp.so) or by lttng-sessiond (liburcu.so).
> 
> You can try to patch liburcu so it won't try using sys_membarrier at all
> (see attached patch)
> 
>> 
>> I simply wonder whether anyone else is actually using lttng-ust, or if they
>> are, whether they are only doing that for long-lived applications where this
>> overhead is not noticeable?
> 
> It appears that this application startup slowdown mainly appears with
> LTTng linked against liburcu 0.9 and 0.10 running with rather recent kernels.
> This would explain why you are the first one to report this across our user
> base.

We have confirmed that it's the use of sys_membarrier SHARED by liburcu-bp,
used by lttng-ust, that causes those delays.

lttng-ust usage pattern is to call synchronize_rcu() directly. Having
membarrier system calls each time adds up to significant delays, because
membarrier SHARED blocks until a scheduler switch happens on each core.
We are invoking synchronize_rcu() directly rather than using the call_rcu
batching facilities because we want to minimize the number of threads we
have to insert into the application.

I'm currently proposing a PRIVATE_EXPEDITED command for the membarrier
system call in Linux. It will actively send IPIs rather than blocking.

I'm going to push a "Fix" in liburcu master, 0.10, 0.9 that will make it
stop using sys_membarrier in liburcu-bp. Then, after the new PRIVATE_EXPEDITED
command makes it way into the upstream Linux kernel, a future liburcu
(0.11 or +) will introduce usage of that new command, given that it has
a much, much smaller impact compared to the SHARED command.

The reason why we want sys_membarrier there is to speed up tracing. When
can trade memory barriers for simple compiler barriers on the tracing
fast-path by using sys_membarrier on the update side.

Thanks for reporting your finding!

Mathieu


> 
> Thanks,
> 
> Mathieu
> 
> 
>> 
>> Thanks
>> 
>> --
>> Milian Wolff | milian.wolff@kdab.com | Senior Software Engineer
>> KDAB (Deutschland) GmbH&Co KG, a KDAB Group company
>> Tel: +49-30-521325470
>> KDAB - The Qt Experts
> 
> --
> Mathieu Desnoyers
> EfficiOS Inc.
> http://www.efficios.com

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com
_______________________________________________
lttng-dev mailing list
lttng-dev@lists.lttng.org
https://lists.lttng.org/cgi-bin/mailman/listinfo/lttng-dev

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: large liblttng-ust startup overhead (ust_lock)
       [not found]           ` <643211580.4746.1504729420940.JavaMail.zimbra@efficios.com>
@ 2017-09-06 21:22             ` Paul E. McKenney
  0 siblings, 0 replies; 13+ messages in thread
From: Paul E. McKenney @ 2017-09-06 21:22 UTC (permalink / raw)
  To: Mathieu Desnoyers; +Cc: lttng-dev

BTW, your expedited commit hit mainline earlier this week.  Here is
hoping!  ;-)

							Thanx, Paul

On Wed, Sep 06, 2017 at 08:23:40PM +0000, Mathieu Desnoyers wrote:
> ----- On Sep 6, 2017, at 3:57 PM, Mathieu Desnoyers mathieu.desnoyers@efficios.com wrote:
> 
> > ----- On Sep 6, 2017, at 3:35 AM, Milian Wolff milian.wolff@kdab.com wrote:
> > 
> >> On Dienstag, 5. September 2017 20:11:58 CEST Mathieu Desnoyers wrote:
> >>> ----- On Sep 5, 2017, at 11:08 AM, Milian Wolff milian.wolff@kdab.com wrote:
> >>> > On Tuesday, September 5, 2017 4:51:42 PM CEST Mathieu Desnoyers wrote:
> >>> >> ----- On Sep 5, 2017, at 10:34 AM, Milian Wolff milian.wolff@kdab.com
> >> wrote:
> >>> >> > Hey all,
> >>> >> > 
> >>> >> > I have noticed a very large overhead when linking against liblttng-ust:
> >>> >> > 
> >>> >> > ~~~~~
> >>> >> > ┌milian@milian-kdab2:/tmp
> >>> >> > └$ cat lttng-test.c
> >>> >> > int main()
> >>> >> > {
> >>> >> > 
> >>> >> >  return 0;
> >>> >> > 
> >>> >> > }
> >>> >> > ┌milian@milian-kdab2:/tmp
> >>> >> > └$ gcc -O2 -g -ldl lttng-test.c
> >>> >> > ┌milian@milian-kdab2:/tmp
> >>> >> > └$ perf stat -r 5 ./a.out
> >>> >> > 
> >>> >> > Performance counter stats for './a.out' (5 runs):
> >>> >> >          0.209587      task-clock (msec)         #    0.596 CPUs
> >>> >> >          utilized
> >>> >> > 
> >>> >> > ( +-  8.76% )
> >>> >> > 
> >>> >> >                 0      context-switches          #    0.000 K/sec
> >>> >> >                 0      cpu-migrations            #    0.000 K/sec
> >>> >> >                
> >>> >> >                49      page-faults               #    0.235 M/sec
> >>> >> > 
> >>> >> > ( +-  1.19% )
> >>> >> > 
> >>> >> >           706,854      cycles                    #    3.373 GHz
> >>> >> > 
> >>> >> > ( +-  8.82% )
> >>> >> > 
> >>> >> >           773,603      instructions              #    1.09  insn per
> >>> >> >           cycle
> >>> >> > 
> >>> >> > ( +-  0.75% )
> >>> >> > 
> >>> >> >           147,128      branches                  #  701.987 M/sec
> >>> >> > 
> >>> >> > ( +-  0.70% )
> >>> >> > 
> >>> >> >             4,096      branch-misses             #    2.78% of all
> >>> >> >             branches
> >>> >> > 
> >>> >> > ( +-  5.27% )
> >>> >> > 
> >>> >> >       0.000351422 seconds time elapsed
> >>> >> > 
> >>> >> > ( +- 11.85% )
> >>> >> > 
> >>> >> > ┌milian@milian-kdab2:/tmp
> >>> >> > └$ gcc -O2 -g -ldl -llttng-ust lttng-test.c
> >>> >> > ┌milian@milian-kdab2:/tmp
> >>> >> > └$ perf stat -r 5 ./a.out
> >>> >> > 
> >>> >> > Performance counter stats for './a.out' (5 runs):
> >>> >> >          2.063040      task-clock (msec)         #    0.009 CPUs
> >>> >> >          utilized
> >>> >> > 
> >>> >> > ( +-  1.37% )
> >>> >> > 
> >>> >> >                44      context-switches          #    0.021 M/sec
> >>> >> > 
> >>> >> > ( +-  1.95% )
> >>> >> > 
> >>> >> >                 2      cpu-migrations            #    0.776 K/sec
> >>> >> > 
> >>> >> > ( +- 25.00% )
> >>> >> > 
> >>> >> >               209      page-faults               #    0.101 M/sec
> >>> >> > 
> >>> >> > ( +-  0.34% )
> >>> >> > 
> >>> >> >         7,053,686      cycles                    #    3.419 GHz
> >>> >> > 
> >>> >> > ( +-  2.03% )
> >>> >> > 
> >>> >> >         6,893,783      instructions              #    0.98  insn per
> >>> >> >         cycle
> >>> >> > 
> >>> >> > ( +-  0.25% )
> >>> >> > 
> >>> >> >         1,342,492      branches                  #  650.735 M/sec
> >>> >> > 
> >>> >> > ( +-  0.20% )
> >>> >> > 
> >>> >> >            29,390      branch-misses             #    2.19% of all
> >>> >> >            branches
> >>> >> > 
> >>> >> > ( +-  0.61% )
> >>> >> > 
> >>> >> >       0.225597302 seconds time elapsed
> >>> >> > 
> >>> >> > ( +-  6.68% )
> >>> >> > ~~~~~
> >>> >> > 
> >>> >> > This is without any LTTng session configured. If I enable LTTng kernel
> >>> >> > and
> >>> >> > userspace events, this becomes even worse:
> >>> >> > 
> >>> >> > ~~~~~
> >>> >> > ┌milian@milian-kdab2:/tmp
> >>> >> > └$ cat $(which run_lttng_trace.sh)
> >>> >> > #!/bin/sh
> >>> >> > 
> >>> >> > if [ -z "$(pidof lttng-sessiond)" ]; then
> >>> >> > 
> >>> >> >    sudo lttng-sessiond --daemonize
> >>> >> > 
> >>> >> > fi
> >>> >> > 
> >>> >> > sudo lttng create -o ~/lttng-traces/$(date -Iseconds)
> >>> >> > sudo lttng enable-channel kernel -k --subbuf-size 16M --num-subbuf 8
> >>> >> > sudo lttng enable-event -c kernel -k -a
> >>> >> > sudo lttng enable-channel ust -u --subbuf-size 16M --num-subbuf 8
> >>> >> > sudo lttng enable-event -c ust -u lttng_ust_tracef:*
> >>> >> > sudo lttng start
> >>> >> > $@
> >>> >> > sudo lttng stop
> >>> >> > 
> >>> >> > sudo chmod a+rx -R ~/lttng-traces
> >>> >> > ┌milian@milian-kdab2:/tmp
> >>> >> > └$ run_lttng_trace.sh perf stat -r 5 ./a.out
> >>> >> > Session auto-20170905-162818 created.
> >>> >> > Traces will be written in
> >>> >> > /home/milian/lttng-traces/2017-09-05T16:28:18+02:00 Kernel channel
> >>> >> > kernel
> >>> >> > enabled for session auto-20170905-162818
> >>> >> > All Kernel events are enabled in channel kernel
> >>> >> > UST channel ust enabled for session auto-20170905-162818
> >>> >> > UST event lttng_ust_tracef:* created in channel ust
> >>> >> > Tracing started for session auto-20170905-162818
> >>> >> > 
> >>> >> > Performance counter stats for './a.out' (5 runs):
> >>> >> >          4.937820      task-clock (msec)         #    0.005 CPUs
> >>> >> >          utilized
> >>> >> > 
> >>> >> > ( +-  1.28% )
> >>> >> > 
> >>> >> >               117      context-switches          #    0.024 M/sec
> >>> >> > 
> >>> >> > ( +-  0.70% )
> >>> >> > 
> >>> >> >                 3      cpu-migrations            #    0.608 K/sec
> >>> >> > 
> >>> >> > ( +- 23.57% )
> >>> >> > 
> >>> >> >               245      page-faults               #    0.050 M/sec
> >>> >> > 
> >>> >> > ( +-  0.20% )
> >>> >> > 
> >>> >> >        16,541,355      cycles                    #    3.350 GHz
> >>> >> > 
> >>> >> > ( +-  0.94% )
> >>> >> > 
> >>> >> >        20,611,637      instructions              #    1.25  insn per
> >>> >> >        cycle
> >>> >> > 
> >>> >> > ( +-  0.23% )
> >>> >> > 
> >>> >> >         3,580,525      branches                  #  725.123 M/sec
> >>> >> > 
> >>> >> > ( +-  0.19% )
> >>> >> > 
> >>> >> >            52,093      branch-misses             #    1.45% of all
> >>> >> >            branches
> >>> >> > 
> >>> >> > ( +-  1.73% )
> >>> >> > 
> >>> >> >       0.965545882 seconds time elapsed
> >>> >> > 
> >>> >> > ( +- 16.96% )
> >>> >> > 
> >>> >> > Waiting for data availability...
> >>> >> > Tracing stopped for session auto-20170905-162818
> >>> >> > ~~~~~
> >>> >> > 
> >>> >> > I see this ~1s delay during startup in nearly all test applications
> >>> >> > where
> >>> >> > I
> >>> >> > wanted to use lttng-ust. In tracecompass, and according to perf sleep
> >>> >> > time
> >>> >> > profiling, this seems to be mostly due to extensive sleeping on mutex
> >>> >> > locks
> >>> >> > (from ust_lock in liblttng-ust). With perf trace, I also see membarrier
> >>> >> > taking a lot of time:
> >>> >> > 
> >>> >> > ~~~~~
> >>> >> > # without a running lttng session:
> >>> >> > ┌milian@milian-kdab2:/tmp
> >>> >> > └$ perf trace --duration 1 ./a.out
> >>> >> > 
> >>> >> >     6.492 (52.468 ms): a.out/23672 recvmsg(fd: 3<socket:[1178439]>,
> >> msg:
> >>> >> > 0x7fbe2fbb1070                  ) = 1
> >>> >> > 
> >>> >> >     5.077 (54.271 ms): a.out/23671 futex(uaddr: 0x7fbe30d508a0, op:
> >>> >> > WAIT_BITSET|PRIV|CLKRT, utime: 0x7ffc474ff5a0, val3: 4294967295) = 0
> >>> >> > 
> >>> >> >    59.598 (79.379 ms): a.out/23671 membarrier(cmd: 1
> >>> >> > 
> >>> >> > ) = 0
> >>> >> > 
> >>> >> >   138.984 (39.945 ms): a.out/23671 membarrier(cmd: 1
> >>> >> > 
> >>> >> > ) = 0
> >>> >> > 
> >>> >> >    59.621 (119.324 ms): a.out/23672 futex(uaddr: 0x7fbe303e54c0, op:
> >>> >> >    WAIT|
> >>> >> > 
> >>> >> > PRIV, val: 2                    ) = 0
> >>> >> > 
> >>> >> >   179.045 (79.918 ms): a.out/23671 membarrier(cmd: 1
> >>> >> > 
> >>> >> > ) = 0
> >>> >> > 
> >>> >> >   258.971 (39.997 ms): a.out/23671 membarrier(cmd: 1
> >>> >> > 
> >>> >> > ) = 0
> >>> >> > 
> >>> >> >   299.052 (79.883 ms): a.out/23671 membarrier(cmd: 1
> >>> >> > 
> >>> >> > ) = 0
> >>> >> > 
> >>> >> >   378.942 (59.988 ms): a.out/23671 membarrier(cmd: 1
> >>> >> > 
> >>> >> > ) = 0
> >>> >> > 
> >>> >> >   439.022 (69.908 ms): a.out/23671 membarrier(cmd: 1
> >>> >> > 
> >>> >> > ) = 0
> >>> >> > 
> >>> >> >   508.937 (80.027 ms): a.out/23671 membarrier(cmd: 1
> >>> >> > 
> >>> >> > ) = 0
> >>> >> > 
> >>> >> > # with a running lttng session:
> >>> >> > ┌milian@milian-kdab2:/tmp
> >>> >> > └$ run_lttng_trace.sh perf trace --duration 1 ./a.out
> >>> >> > 
> >>> >> >     6.666 (392.496 ms): a.out-ust/23790 recvmsg(fd:
> >>> >> >     3<socket:[1185362]>,
> >>> > 
> >>> > msg:
> >>> >> > 0x7fcd81d8a0b0                  ) = 612
> >>> >> > 
> >>> >> >   400.238 (68.452 ms): a.out-ust/23790 membarrier(cmd: 1
> >>> >> > 
> >>> >> > ) = 0
> >>> >> > 
> >>> >> >   468.698 (39.983 ms): a.out-ust/23790 membarrier(cmd: 1
> >>> >> > 
> >>> >> > ) = 0
> >>> >> > 
> >>> >> >   510.567 (98.113 ms): a.out-ust/23790 recvmsg(fd: 3<socket:[1185362]>,
> >>> > 
> >>> > msg:
> >>> >> > 0x7fcd81d8a070                  ) = 1
> >>> >> > 
> >>> >> >     6.467 (603.500 ms): a.out/23789 futex(uaddr: 0x7fcd82f298a0, op:
> >>> >> > WAIT_BITSET|PRIV|CLKRT, utime: 0x7fff4df6c970, val3: 4294967295) = 0
> >>> >> > 
> >>> >> >   610.020 ( 6.307 ms): a.out/23789 futex(uaddr: 0x7fcd82f29960, op:
> >>> >> >   WAIT|
> >>> >> > 
> >>> >> > PRIV, val: 2                    ) = 0
> >>> >> > 
> >>> >> >   616.446 (62.265 ms): a.out/23789 membarrier(cmd: 1
> >>> >> > 
> >>> >> > ) = 0
> >>> >> > 
> >>> >> >   678.721 (59.916 ms): a.out/23789 membarrier(cmd: 1
> >>> >> > 
> >>> >> > ) = 0
> >>> >> > 
> >>> >> >   616.332 (122.335 ms): a.out-ust/23790 futex(uaddr: 0x7fcd82f29960,
> >> op:
> >>> >> > WAIT|PRIV, val: 2                    ) = 0
> >>> >> > 
> >>> >> >   738.976 (49.658 ms): a.out/23789 membarrier(cmd: 1
> >>> >> > 
> >>> >> > ) = 0
> >>> >> > 
> >>> >> >   788.642 (49.987 ms): a.out/23789 membarrier(cmd: 1
> >>> >> > 
> >>> >> > ) = 0
> >>> >> > 
> >>> >> >   739.014 (99.644 ms): a.out-ust/23790 futex(uaddr: 0x7fcd825be4c0, op:
> >>> >> >   WAIT|
> >>> >> > 
> >>> >> > PRIV, val: 2                    ) = 0
> >>> >> > 
> >>> >> >   838.651 (49.977 ms): a.out/23789 membarrier(cmd: 1
> >>> >> > 
> >>> >> > ) = 0
> >>> >> > 
> >>> >> >   888.636 (69.983 ms): a.out/23789 membarrier(cmd: 1
> >>> >> > 
> >>> >> > ) = 0
> >>> >> > 
> >>> >> >   838.662 (119.978 ms): a.out-ust/23790 futex(uaddr: 0x7fcd825be4c0,
> >> op:
> >>> >> > WAIT|PRIV, val: 2                    ) = 0
> >>> >> > 
> >>> >> >   958.739 (59.873 ms): a.out/23789 membarrier(cmd: 1
> >>> >> > 
> >>> >> > ) = 0
> >>> >> > 
> >>> >> >  1018.621 (80.058 ms): a.out/23789 membarrier(cmd: 1
> >>> >> > 
> >>> >> > ) = 0
> >>> >> > 
> >>> >> >  1098.782 (49.844 ms): a.out/23789 membarrier(cmd: 1
> >>> >> > 
> >>> >> > ) = 0
> >>> >> > 
> >>> >> >  1148.634 (59.998 ms): a.out/23789 membarrier(cmd: 1
> >>> >> > 
> >>> >> > ) = 0
> >>> >> > 
> >>> >> >  1208.731 (69.940 ms): a.out/23789 membarrier(cmd: 1
> >>> >> > 
> >>> >> > ) = 0
> >>> >> > 
> >>> >> >  1278.680 (79.947 ms): a.out/23789 membarrier(cmd: 1
> >>> >> > 
> >>> >> > ) = 0
> >>> >> > 
> >>> >> >  1358.727 (39.881 ms): a.out/23789 membarrier(cmd: 1
> >>> >> > 
> >>> >> > ) = 0
> >>> >> > 
> >>> >> >  1398.616 (39.980 ms): a.out/23789 membarrier(cmd: 1
> >>> >> > 
> >>> >> > ) = 0
> >>> >> > 
> >>> >> >     6.640 (1432.157 ms): a.out-ust/23791 futex(uaddr: 0x7fcd82f90000
> >>> >> > 
> >>> >> > ) = -1 (null) INTERNAL ERROR: strerror_r(512, 0x7ffdbc5467a0, 128)=22
> >>> >> > ~~~~~
> >>> >> > 
> >>> >> > This makes lttng-ust pretty much unusable for any short-lived
> >>> >> > application.
> >>> >> > Is this a known limitation?
> >>> >> 
> >>> >> This kind of delay is not completely unexpected, but a bit higher than I
> >>> >> would expect. I'm interested in figuring out if most of this delay
> >>> >> happens
> >>> >> to be caused by sys_membarrier, either directly, or indirectly (through a
> >>> >> lock).
> >>> > 
> >>> > See my other email which adds some harder numbers from a perf sleep
> >>> > record.
> >>> > 
> >>> >> > I'm using lttng-ust on Arch from the AUR packages, i.e. LTTng 2.9.4.
> >>> >> > I've
> >>> >> > seen this problem also on Ubuntu 17.04 though I didn't do any of the
> >>> >> > above measurements there. So far, I suspected the ust statedump to take
> >>> >> > too much time, but I disabled that one for the measurements above.
> >>> >> > 
> >>> >> > Any help would be appreciated
> >>> >> 
> >>> >> Can you reproduce with a Linux kernel that has CONFIG_MEMBARRIER=n and
> >>> >> share the performance results ?
> >>> > 
> >>> > I'd have to recompile the kernel, which is something I don't like to do.
> >>> > Why is this required, or put differently - why would the kernel option
> >>> > influence the userspace LTTng library?
> >>> 
> >>> liburcu 0.9.x and 0.10.x (a dependency of lttng-ust) uses the membarrier
> >>> system call to speed up tracing, at the expense of slower execution of
> >>> tracing state update.
> >>> 
> >>> I'm currently proposing a new sys_membarrier command for PRIVATE_EXPEDITED
> >>> (single-process) use cases which should execute much faster than the
> >>> non-expedited SHARED command currently implemented.
> >>> 
> >>> liburcu 0.9 and 0.10 detects that this system call is available, and uses
> >>> it when it can.
> >> 
> >> May I ask whether you have such a configuration available? The test case I
> >> have submitted should be easy to reproduce for everyone. Compiling my own
> >> kernel version takes a huge amount of time which I'm quite frankly not willing
> >> to invest at this point.
> > 
> > We have reproduced the situation, and it's pointing towards our use of the
> > sys_membarrier system call. We'll try narrowing it down between its use by
> > lttng-ust (liburcu-bp.so) or by lttng-sessiond (liburcu.so).
> > 
> > You can try to patch liburcu so it won't try using sys_membarrier at all
> > (see attached patch)
> > 
> >> 
> >> I simply wonder whether anyone else is actually using lttng-ust, or if they
> >> are, whether they are only doing that for long-lived applications where this
> >> overhead is not noticeable?
> > 
> > It appears that this application startup slowdown mainly appears with
> > LTTng linked against liburcu 0.9 and 0.10 running with rather recent kernels.
> > This would explain why you are the first one to report this across our user
> > base.
> 
> We have confirmed that it's the use of sys_membarrier SHARED by liburcu-bp,
> used by lttng-ust, that causes those delays.
> 
> lttng-ust usage pattern is to call synchronize_rcu() directly. Having
> membarrier system calls each time adds up to significant delays, because
> membarrier SHARED blocks until a scheduler switch happens on each core.
> We are invoking synchronize_rcu() directly rather than using the call_rcu
> batching facilities because we want to minimize the number of threads we
> have to insert into the application.
> 
> I'm currently proposing a PRIVATE_EXPEDITED command for the membarrier
> system call in Linux. It will actively send IPIs rather than blocking.
> 
> I'm going to push a "Fix" in liburcu master, 0.10, 0.9 that will make it
> stop using sys_membarrier in liburcu-bp. Then, after the new PRIVATE_EXPEDITED
> command makes it way into the upstream Linux kernel, a future liburcu
> (0.11 or +) will introduce usage of that new command, given that it has
> a much, much smaller impact compared to the SHARED command.
> 
> The reason why we want sys_membarrier there is to speed up tracing. When
> can trade memory barriers for simple compiler barriers on the tracing
> fast-path by using sys_membarrier on the update side.
> 
> Thanks for reporting your finding!
> 
> Mathieu
> 
> 
> > 
> > Thanks,
> > 
> > Mathieu
> > 
> > 
> >> 
> >> Thanks
> >> 
> >> --
> >> Milian Wolff | milian.wolff@kdab.com | Senior Software Engineer
> >> KDAB (Deutschland) GmbH&Co KG, a KDAB Group company
> >> Tel: +49-30-521325470
> >> KDAB - The Qt Experts
> > 
> > --
> > Mathieu Desnoyers
> > EfficiOS Inc.
> > https://urldefense.proofpoint.com/v2/url?u=http-3A__www.efficios.com&d=DwIFaQ&c=jf_iaSHvJObTbx-siA1ZOg&r=ux41CW3B5BSVxDMRNRWyLbUmPebZc70Kq4AkfdiRGMI&m=3Vy8SVv1V7W52p2avk7362Nu7zY7IrKURNrPvKWJydk&s=gJ_xD9XLVeE2RJHq9MmoWZRJIk5IUabWGcG0VdiM7U4&e= 
> 
> -- 
> Mathieu Desnoyers
> EfficiOS Inc.
> https://urldefense.proofpoint.com/v2/url?u=http-3A__www.efficios.com&d=DwIFaQ&c=jf_iaSHvJObTbx-siA1ZOg&r=ux41CW3B5BSVxDMRNRWyLbUmPebZc70Kq4AkfdiRGMI&m=3Vy8SVv1V7W52p2avk7362Nu7zY7IrKURNrPvKWJydk&s=gJ_xD9XLVeE2RJHq9MmoWZRJIk5IUabWGcG0VdiM7U4&e= 
> 

_______________________________________________
lttng-dev mailing list
lttng-dev@lists.lttng.org
https://lists.lttng.org/cgi-bin/mailman/listinfo/lttng-dev

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: large liblttng-ust startup overhead (ust_lock)
       [not found]         ` <1506174693.4712.1504727831374.JavaMail.zimbra@efficios.com>
  2017-09-06 20:23           ` Mathieu Desnoyers
       [not found]           ` <643211580.4746.1504729420940.JavaMail.zimbra@efficios.com>
@ 2017-09-07  7:50           ` Milian Wolff
  2 siblings, 0 replies; 13+ messages in thread
From: Milian Wolff @ 2017-09-07  7:50 UTC (permalink / raw)
  To: Mathieu Desnoyers; +Cc: lttng-dev


[-- Attachment #1.1: Type: text/plain, Size: 19679 bytes --]

On Wednesday, September 6, 2017 9:57:11 PM CEST Mathieu Desnoyers wrote:
> ----- On Sep 6, 2017, at 3:35 AM, Milian Wolff milian.wolff@kdab.com wrote:
> > On Dienstag, 5. September 2017 20:11:58 CEST Mathieu Desnoyers wrote:
> >> ----- On Sep 5, 2017, at 11:08 AM, Milian Wolff milian.wolff@kdab.com 
wrote:
> >> > On Tuesday, September 5, 2017 4:51:42 PM CEST Mathieu Desnoyers wrote:
> >> >> ----- On Sep 5, 2017, at 10:34 AM, Milian Wolff milian.wolff@kdab.com
> > 
> > wrote:
> >> >> > Hey all,
> >> >> > 
> >> >> > I have noticed a very large overhead when linking against
> >> >> > liblttng-ust:
> >> >> > 
> >> >> > ~~~~~
> >> >> > ┌milian@milian-kdab2:/tmp
> >> >> > └$ cat lttng-test.c
> >> >> > int main()
> >> >> > {
> >> >> > 
> >> >> >  return 0;
> >> >> > 
> >> >> > }
> >> >> > ┌milian@milian-kdab2:/tmp
> >> >> > └$ gcc -O2 -g -ldl lttng-test.c
> >> >> > ┌milian@milian-kdab2:/tmp
> >> >> > └$ perf stat -r 5 ./a.out
> >> >> > 
> >> >> > Performance counter stats for './a.out' (5 runs):
> >> >> >          0.209587      task-clock (msec)         #    0.596 CPUs
> >> >> >          utilized
> >> >> > 
> >> >> > ( +-  8.76% )
> >> >> > 
> >> >> >                 0      context-switches          #    0.000 K/sec
> >> >> >                 0      cpu-migrations            #    0.000 K/sec
> >> >> >                
> >> >> >                49      page-faults               #    0.235 M/sec
> >> >> > 
> >> >> > ( +-  1.19% )
> >> >> > 
> >> >> >           706,854      cycles                    #    3.373 GHz
> >> >> > 
> >> >> > ( +-  8.82% )
> >> >> > 
> >> >> >           773,603      instructions              #    1.09  insn per
> >> >> >           cycle
> >> >> > 
> >> >> > ( +-  0.75% )
> >> >> > 
> >> >> >           147,128      branches                  #  701.987 M/sec
> >> >> > 
> >> >> > ( +-  0.70% )
> >> >> > 
> >> >> >             4,096      branch-misses             #    2.78% of all
> >> >> >             branches
> >> >> > 
> >> >> > ( +-  5.27% )
> >> >> > 
> >> >> >       0.000351422 seconds time elapsed
> >> >> > 
> >> >> > ( +- 11.85% )
> >> >> > 
> >> >> > ┌milian@milian-kdab2:/tmp
> >> >> > └$ gcc -O2 -g -ldl -llttng-ust lttng-test.c
> >> >> > ┌milian@milian-kdab2:/tmp
> >> >> > └$ perf stat -r 5 ./a.out
> >> >> > 
> >> >> > Performance counter stats for './a.out' (5 runs):
> >> >> >          2.063040      task-clock (msec)         #    0.009 CPUs
> >> >> >          utilized
> >> >> > 
> >> >> > ( +-  1.37% )
> >> >> > 
> >> >> >                44      context-switches          #    0.021 M/sec
> >> >> > 
> >> >> > ( +-  1.95% )
> >> >> > 
> >> >> >                 2      cpu-migrations            #    0.776 K/sec
> >> >> > 
> >> >> > ( +- 25.00% )
> >> >> > 
> >> >> >               209      page-faults               #    0.101 M/sec
> >> >> > 
> >> >> > ( +-  0.34% )
> >> >> > 
> >> >> >         7,053,686      cycles                    #    3.419 GHz
> >> >> > 
> >> >> > ( +-  2.03% )
> >> >> > 
> >> >> >         6,893,783      instructions              #    0.98  insn per
> >> >> >         cycle
> >> >> > 
> >> >> > ( +-  0.25% )
> >> >> > 
> >> >> >         1,342,492      branches                  #  650.735 M/sec
> >> >> > 
> >> >> > ( +-  0.20% )
> >> >> > 
> >> >> >            29,390      branch-misses             #    2.19% of all
> >> >> >            branches
> >> >> > 
> >> >> > ( +-  0.61% )
> >> >> > 
> >> >> >       0.225597302 seconds time elapsed
> >> >> > 
> >> >> > ( +-  6.68% )
> >> >> > ~~~~~
> >> >> > 
> >> >> > This is without any LTTng session configured. If I enable LTTng
> >> >> > kernel
> >> >> > and
> >> >> > userspace events, this becomes even worse:
> >> >> > 
> >> >> > ~~~~~
> >> >> > ┌milian@milian-kdab2:/tmp
> >> >> > └$ cat $(which run_lttng_trace.sh)
> >> >> > #!/bin/sh
> >> >> > 
> >> >> > if [ -z "$(pidof lttng-sessiond)" ]; then
> >> >> > 
> >> >> >    sudo lttng-sessiond --daemonize
> >> >> > 
> >> >> > fi
> >> >> > 
> >> >> > sudo lttng create -o ~/lttng-traces/$(date -Iseconds)
> >> >> > sudo lttng enable-channel kernel -k --subbuf-size 16M --num-subbuf 8
> >> >> > sudo lttng enable-event -c kernel -k -a
> >> >> > sudo lttng enable-channel ust -u --subbuf-size 16M --num-subbuf 8
> >> >> > sudo lttng enable-event -c ust -u lttng_ust_tracef:*
> >> >> > sudo lttng start
> >> >> > $@
> >> >> > sudo lttng stop
> >> >> > 
> >> >> > sudo chmod a+rx -R ~/lttng-traces
> >> >> > ┌milian@milian-kdab2:/tmp
> >> >> > └$ run_lttng_trace.sh perf stat -r 5 ./a.out
> >> >> > Session auto-20170905-162818 created.
> >> >> > Traces will be written in
> >> >> > /home/milian/lttng-traces/2017-09-05T16:28:18+02:00 Kernel channel
> >> >> > kernel
> >> >> > enabled for session auto-20170905-162818
> >> >> > All Kernel events are enabled in channel kernel
> >> >> > UST channel ust enabled for session auto-20170905-162818
> >> >> > UST event lttng_ust_tracef:* created in channel ust
> >> >> > Tracing started for session auto-20170905-162818
> >> >> > 
> >> >> > Performance counter stats for './a.out' (5 runs):
> >> >> >          4.937820      task-clock (msec)         #    0.005 CPUs
> >> >> >          utilized
> >> >> > 
> >> >> > ( +-  1.28% )
> >> >> > 
> >> >> >               117      context-switches          #    0.024 M/sec
> >> >> > 
> >> >> > ( +-  0.70% )
> >> >> > 
> >> >> >                 3      cpu-migrations            #    0.608 K/sec
> >> >> > 
> >> >> > ( +- 23.57% )
> >> >> > 
> >> >> >               245      page-faults               #    0.050 M/sec
> >> >> > 
> >> >> > ( +-  0.20% )
> >> >> > 
> >> >> >        16,541,355      cycles                    #    3.350 GHz
> >> >> > 
> >> >> > ( +-  0.94% )
> >> >> > 
> >> >> >        20,611,637      instructions              #    1.25  insn per
> >> >> >        cycle
> >> >> > 
> >> >> > ( +-  0.23% )
> >> >> > 
> >> >> >         3,580,525      branches                  #  725.123 M/sec
> >> >> > 
> >> >> > ( +-  0.19% )
> >> >> > 
> >> >> >            52,093      branch-misses             #    1.45% of all
> >> >> >            branches
> >> >> > 
> >> >> > ( +-  1.73% )
> >> >> > 
> >> >> >       0.965545882 seconds time elapsed
> >> >> > 
> >> >> > ( +- 16.96% )
> >> >> > 
> >> >> > Waiting for data availability...
> >> >> > Tracing stopped for session auto-20170905-162818
> >> >> > ~~~~~
> >> >> > 
> >> >> > I see this ~1s delay during startup in nearly all test applications
> >> >> > where
> >> >> > I
> >> >> > wanted to use lttng-ust. In tracecompass, and according to perf
> >> >> > sleep
> >> >> > time
> >> >> > profiling, this seems to be mostly due to extensive sleeping on
> >> >> > mutex
> >> >> > locks
> >> >> > (from ust_lock in liblttng-ust). With perf trace, I also see
> >> >> > membarrier
> >> >> > taking a lot of time:
> >> >> > 
> >> >> > ~~~~~
> >> >> > # without a running lttng session:
> >> >> > ┌milian@milian-kdab2:/tmp
> >> >> > └$ perf trace --duration 1 ./a.out
> >> >> > 
> >> >> >     6.492 (52.468 ms): a.out/23672 recvmsg(fd: 3<socket:[1178439]>,
> > 
> > msg:
> >> >> > 0x7fbe2fbb1070                  ) = 1
> >> >> > 
> >> >> >     5.077 (54.271 ms): a.out/23671 futex(uaddr: 0x7fbe30d508a0, op:
> >> >> > WAIT_BITSET|PRIV|CLKRT, utime: 0x7ffc474ff5a0, val3: 4294967295) = 0
> >> >> > 
> >> >> >    59.598 (79.379 ms): a.out/23671 membarrier(cmd: 1
> >> >> > 
> >> >> > ) = 0
> >> >> > 
> >> >> >   138.984 (39.945 ms): a.out/23671 membarrier(cmd: 1
> >> >> > 
> >> >> > ) = 0
> >> >> > 
> >> >> >    59.621 (119.324 ms): a.out/23672 futex(uaddr: 0x7fbe303e54c0, op:
> >> >> >    WAIT|
> >> >> > 
> >> >> > PRIV, val: 2                    ) = 0
> >> >> > 
> >> >> >   179.045 (79.918 ms): a.out/23671 membarrier(cmd: 1
> >> >> > 
> >> >> > ) = 0
> >> >> > 
> >> >> >   258.971 (39.997 ms): a.out/23671 membarrier(cmd: 1
> >> >> > 
> >> >> > ) = 0
> >> >> > 
> >> >> >   299.052 (79.883 ms): a.out/23671 membarrier(cmd: 1
> >> >> > 
> >> >> > ) = 0
> >> >> > 
> >> >> >   378.942 (59.988 ms): a.out/23671 membarrier(cmd: 1
> >> >> > 
> >> >> > ) = 0
> >> >> > 
> >> >> >   439.022 (69.908 ms): a.out/23671 membarrier(cmd: 1
> >> >> > 
> >> >> > ) = 0
> >> >> > 
> >> >> >   508.937 (80.027 ms): a.out/23671 membarrier(cmd: 1
> >> >> > 
> >> >> > ) = 0
> >> >> > 
> >> >> > # with a running lttng session:
> >> >> > ┌milian@milian-kdab2:/tmp
> >> >> > └$ run_lttng_trace.sh perf trace --duration 1 ./a.out
> >> >> > 
> >> >> >     6.666 (392.496 ms): a.out-ust/23790 recvmsg(fd:
> >> >> >     3<socket:[1185362]>,
> >> > 
> >> > msg:
> >> >> > 0x7fcd81d8a0b0                  ) = 612
> >> >> > 
> >> >> >   400.238 (68.452 ms): a.out-ust/23790 membarrier(cmd: 1
> >> >> > 
> >> >> > ) = 0
> >> >> > 
> >> >> >   468.698 (39.983 ms): a.out-ust/23790 membarrier(cmd: 1
> >> >> > 
> >> >> > ) = 0
> >> >> > 
> >> >> >   510.567 (98.113 ms): a.out-ust/23790 recvmsg(fd:
> >> >> >   3<socket:[1185362]>,
> >> > 
> >> > msg:
> >> >> > 0x7fcd81d8a070                  ) = 1
> >> >> > 
> >> >> >     6.467 (603.500 ms): a.out/23789 futex(uaddr: 0x7fcd82f298a0, op:
> >> >> > WAIT_BITSET|PRIV|CLKRT, utime: 0x7fff4df6c970, val3: 4294967295) = 0
> >> >> > 
> >> >> >   610.020 ( 6.307 ms): a.out/23789 futex(uaddr: 0x7fcd82f29960, op:
> >> >> >   WAIT|
> >> >> > 
> >> >> > PRIV, val: 2                    ) = 0
> >> >> > 
> >> >> >   616.446 (62.265 ms): a.out/23789 membarrier(cmd: 1
> >> >> > 
> >> >> > ) = 0
> >> >> > 
> >> >> >   678.721 (59.916 ms): a.out/23789 membarrier(cmd: 1
> >> >> > 
> >> >> > ) = 0
> >> >> > 
> >> >> >   616.332 (122.335 ms): a.out-ust/23790 futex(uaddr: 0x7fcd82f29960,
> > 
> > op:
> >> >> > WAIT|PRIV, val: 2                    ) = 0
> >> >> > 
> >> >> >   738.976 (49.658 ms): a.out/23789 membarrier(cmd: 1
> >> >> > 
> >> >> > ) = 0
> >> >> > 
> >> >> >   788.642 (49.987 ms): a.out/23789 membarrier(cmd: 1
> >> >> > 
> >> >> > ) = 0
> >> >> > 
> >> >> >   739.014 (99.644 ms): a.out-ust/23790 futex(uaddr: 0x7fcd825be4c0,
> >> >> >   op:
> >> >> >   WAIT|
> >> >> > 
> >> >> > PRIV, val: 2                    ) = 0
> >> >> > 
> >> >> >   838.651 (49.977 ms): a.out/23789 membarrier(cmd: 1
> >> >> > 
> >> >> > ) = 0
> >> >> > 
> >> >> >   888.636 (69.983 ms): a.out/23789 membarrier(cmd: 1
> >> >> > 
> >> >> > ) = 0
> >> >> > 
> >> >> >   838.662 (119.978 ms): a.out-ust/23790 futex(uaddr: 0x7fcd825be4c0,
> > 
> > op:
> >> >> > WAIT|PRIV, val: 2                    ) = 0
> >> >> > 
> >> >> >   958.739 (59.873 ms): a.out/23789 membarrier(cmd: 1
> >> >> > 
> >> >> > ) = 0
> >> >> > 
> >> >> >  1018.621 (80.058 ms): a.out/23789 membarrier(cmd: 1
> >> >> > 
> >> >> > ) = 0
> >> >> > 
> >> >> >  1098.782 (49.844 ms): a.out/23789 membarrier(cmd: 1
> >> >> > 
> >> >> > ) = 0
> >> >> > 
> >> >> >  1148.634 (59.998 ms): a.out/23789 membarrier(cmd: 1
> >> >> > 
> >> >> > ) = 0
> >> >> > 
> >> >> >  1208.731 (69.940 ms): a.out/23789 membarrier(cmd: 1
> >> >> > 
> >> >> > ) = 0
> >> >> > 
> >> >> >  1278.680 (79.947 ms): a.out/23789 membarrier(cmd: 1
> >> >> > 
> >> >> > ) = 0
> >> >> > 
> >> >> >  1358.727 (39.881 ms): a.out/23789 membarrier(cmd: 1
> >> >> > 
> >> >> > ) = 0
> >> >> > 
> >> >> >  1398.616 (39.980 ms): a.out/23789 membarrier(cmd: 1
> >> >> > 
> >> >> > ) = 0
> >> >> > 
> >> >> >     6.640 (1432.157 ms): a.out-ust/23791 futex(uaddr: 0x7fcd82f90000
> >> >> > 
> >> >> > ) = -1 (null) INTERNAL ERROR: strerror_r(512, 0x7ffdbc5467a0,
> >> >> > 128)=22
> >> >> > ~~~~~
> >> >> > 
> >> >> > This makes lttng-ust pretty much unusable for any short-lived
> >> >> > application.
> >> >> > Is this a known limitation?
> >> >> 
> >> >> This kind of delay is not completely unexpected, but a bit higher than
> >> >> I
> >> >> would expect. I'm interested in figuring out if most of this delay
> >> >> happens
> >> >> to be caused by sys_membarrier, either directly, or indirectly
> >> >> (through a
> >> >> lock).
> >> > 
> >> > See my other email which adds some harder numbers from a perf sleep
> >> > record.
> >> > 
> >> >> > I'm using lttng-ust on Arch from the AUR packages, i.e. LTTng 2.9.4.
> >> >> > I've
> >> >> > seen this problem also on Ubuntu 17.04 though I didn't do any of the
> >> >> > above measurements there. So far, I suspected the ust statedump to
> >> >> > take
> >> >> > too much time, but I disabled that one for the measurements above.
> >> >> > 
> >> >> > Any help would be appreciated
> >> >> 
> >> >> Can you reproduce with a Linux kernel that has CONFIG_MEMBARRIER=n and
> >> >> share the performance results ?
> >> > 
> >> > I'd have to recompile the kernel, which is something I don't like to
> >> > do.
> >> > Why is this required, or put differently - why would the kernel option
> >> > influence the userspace LTTng library?
> >> 
> >> liburcu 0.9.x and 0.10.x (a dependency of lttng-ust) uses the membarrier
> >> system call to speed up tracing, at the expense of slower execution of
> >> tracing state update.
> >> 
> >> I'm currently proposing a new sys_membarrier command for
> >> PRIVATE_EXPEDITED
> >> (single-process) use cases which should execute much faster than the
> >> non-expedited SHARED command currently implemented.
> >> 
> >> liburcu 0.9 and 0.10 detects that this system call is available, and uses
> >> it when it can.
> > 
> > May I ask whether you have such a configuration available? The test case I
> > have submitted should be easy to reproduce for everyone. Compiling my own
> > kernel version takes a huge amount of time which I'm quite frankly not
> > willing to invest at this point.
> 
> We have reproduced the situation, and it's pointing towards our use of the
> sys_membarrier system call. We'll try narrowing it down between its use by
> lttng-ust (liburcu-bp.so) or by lttng-sessiond (liburcu.so).
> 
> You can try to patch liburcu so it won't try using sys_membarrier at all
> (see attached patch)

It does improve the situation tremendously. I just compiled v0.10.0-6-g6447802 
with and without the patch to record the numbers below. Note that apparently 
your latest patch [1] in git master alone is not enough to fix the performance 
bug, I do need to apply the patch you send by mail, too.

[1]: https://github.com/urcu/userspace-rcu/commit/
64478021edcf7a5ac3bca3fa9e8b8108d2fbb9b6

But once I also apply the patch you send by mail, things start to fly. In the 
below, "before" means running against userspace-rcu v0.10.0-6-g6447802, 
"after" means running against the same version with your emailed patch applied 
on top.

Measurement without an active LTTng session:

~~~~~
# before:
┌milian@milian-kdab2:/tmp
└$ perf stat -r 5 ./a.out

 Performance counter stats for './a.out' (5 runs):

          3.973288      task-clock (msec)         #    0.005 CPUs utilized            
( +-  0.39% )
               153      context-switches          #    0.039 M/sec                    
( +-  1.43% )
                 2      cpu-migrations            #    0.554 K/sec                    
( +- 26.50% )
               315      page-faults               #    0.079 M/sec                    
( +-  0.21% )
        13,257,360      cycles                    #    3.337 GHz                      
( +-  0.34% )
        14,906,461      instructions              #    1.12  insn per cycle           
( +-  0.11% )
         2,910,368      branches                  #  732.483 M/sec                    
( +-  0.10% )
            42,766      branch-misses             #    1.47% of all branches          
( +-  0.89% )

       0.851654352 seconds time elapsed                                          
( +-  4.33% )

# after:
┌milian@milian-kdab2:/tmp
└$ perf stat -r 5 ./a.out

 Performance counter stats for './a.out' (5 runs):

          3.608242      task-clock (msec)         #    0.056 CPUs utilized            
( +-  1.84% )
               138      context-switches          #    0.038 M/sec                    
( +-  2.70% )
                 1      cpu-migrations            #    0.277 K/sec                  
               313      page-faults               #    0.087 M/sec                    
( +-  0.27% )
        12,022,499      cycles                    #    3.332 GHz                      
( +-  1.73% )
        14,861,318      instructions              #    1.24  insn per cycle           
( +-  0.32% )
         2,902,920      branches                  #  804.525 M/sec                    
( +-  0.32% )
            38,264      branch-misses             #    1.32% of all branches          
( +-  0.78% )

       0.063929538 seconds time elapsed                                          
( +- 10.93% )
~~~~~

With an active LTTng session:

~~~~~
# before:
 Performance counter stats for './a.out' (5 runs):

          9.152925      task-clock (msec)         #    0.005 CPUs utilized            
( +-  1.13% )
               216      context-switches          #    0.024 M/sec                    
( +-  2.83% )
                 4      cpu-migrations            #    0.459 K/sec                    
( +- 11.66% )
               370      page-faults               #    0.040 M/sec                    
( +-  0.20% )
        31,311,283      cycles                    #    3.421 GHz                      
( +-  0.69% )
        40,509,922      instructions              #    1.29  insn per cycle           
( +-  0.40% )
         7,192,651      branches                  #  785.831 M/sec                    
( +-  0.33% )
            90,264      branch-misses             #    1.25% of all branches          
( +-  1.46% )

       1.839545247 seconds time elapsed                                          
( +-  6.95% )

# after:
 Performance counter stats for './a.out' (5 runs):

          7.052094      task-clock (msec)         #    0.026 CPUs utilized            
( +-  1.45% )
               190      context-switches          #    0.027 M/sec                    
( +-  4.88% )
                 1      cpu-migrations            #    0.142 K/sec                    
( +- 44.72% )
               335      page-faults               #    0.048 M/sec                    
( +-  0.37% )
        24,105,408      cycles                    #    3.418 GHz                      
( +-  1.39% )
        35,100,950      instructions              #    1.46  insn per cycle           
( +-  0.17% )
         6,197,972      branches                  #  878.884 M/sec                    
( +-  0.15% )
            56,045      branch-misses             #    0.90% of all branches          
( +-  1.41% )

       0.272121270 seconds time elapsed                                          
( +- 52.41% )
~~~~~

> > I simply wonder whether anyone else is actually using lttng-ust, or if
> > they
> > are, whether they are only doing that for long-lived applications where
> > this overhead is not noticeable?
> 
> It appears that this application startup slowdown mainly appears with
> LTTng linked against liburcu 0.9 and 0.10 running with rather recent
> kernels. This would explain why you are the first one to report this across
> our user base.

Thank you, that explains it.

-- 
Milian Wolff | milian.wolff@kdab.com | Senior Software Engineer
KDAB (Deutschland) GmbH&Co KG, a KDAB Group company
Tel: +49-30-521325470
KDAB - The Qt Experts

[-- Attachment #1.2: smime.p7s --]
[-- Type: application/pkcs7-signature, Size: 3826 bytes --]

[-- Attachment #2: Type: text/plain, Size: 156 bytes --]

_______________________________________________
lttng-dev mailing list
lttng-dev@lists.lttng.org
https://lists.lttng.org/cgi-bin/mailman/listinfo/lttng-dev

^ permalink raw reply	[flat|nested] 13+ messages in thread

* large liblttng-ust startup overhead (ust_lock)
@ 2017-09-05 14:34 Milian Wolff
  0 siblings, 0 replies; 13+ messages in thread
From: Milian Wolff @ 2017-09-05 14:34 UTC (permalink / raw)
  To: lttng-dev


[-- Attachment #1.1: Type: text/plain, Size: 9953 bytes --]

Hey all,

I have noticed a very large overhead when linking against liblttng-ust:

~~~~~
┌milian@milian-kdab2:/tmp
└$ cat lttng-test.c  
int main()
{
  return 0;
}
┌milian@milian-kdab2:/tmp
└$ gcc -O2 -g -ldl lttng-test.c                     
┌milian@milian-kdab2:/tmp
└$ perf stat -r 5 ./a.out 

 Performance counter stats for './a.out' (5 runs):

          0.209587      task-clock (msec)         #    0.596 CPUs utilized            
( +-  8.76% )
                 0      context-switches          #    0.000 K/sec                  
                 0      cpu-migrations            #    0.000 K/sec                  
                49      page-faults               #    0.235 M/sec                    
( +-  1.19% )
           706,854      cycles                    #    3.373 GHz                      
( +-  8.82% )
           773,603      instructions              #    1.09  insn per cycle           
( +-  0.75% )
           147,128      branches                  #  701.987 M/sec                    
( +-  0.70% )
             4,096      branch-misses             #    2.78% of all branches          
( +-  5.27% )

       0.000351422 seconds time elapsed                                          
( +- 11.85% )

┌milian@milian-kdab2:/tmp
└$ gcc -O2 -g -ldl -llttng-ust lttng-test.c
┌milian@milian-kdab2:/tmp
└$ perf stat -r 5 ./a.out 

 Performance counter stats for './a.out' (5 runs):

          2.063040      task-clock (msec)         #    0.009 CPUs utilized            
( +-  1.37% )
                44      context-switches          #    0.021 M/sec                    
( +-  1.95% )
                 2      cpu-migrations            #    0.776 K/sec                    
( +- 25.00% )
               209      page-faults               #    0.101 M/sec                    
( +-  0.34% )
         7,053,686      cycles                    #    3.419 GHz                      
( +-  2.03% )
         6,893,783      instructions              #    0.98  insn per cycle           
( +-  0.25% )
         1,342,492      branches                  #  650.735 M/sec                    
( +-  0.20% )
            29,390      branch-misses             #    2.19% of all branches          
( +-  0.61% )

       0.225597302 seconds time elapsed                                          
( +-  6.68% )
~~~~~

This is without any LTTng session configured. If I enable LTTng kernel and 
userspace events, this becomes even worse:

~~~~~
┌milian@milian-kdab2:/tmp
└$ cat $(which run_lttng_trace.sh)
#!/bin/sh

if [ -z "$(pidof lttng-sessiond)" ]; then
    sudo lttng-sessiond --daemonize
fi

sudo lttng create -o ~/lttng-traces/$(date -Iseconds)
sudo lttng enable-channel kernel -k --subbuf-size 16M --num-subbuf 8
sudo lttng enable-event -c kernel -k -a
sudo lttng enable-channel ust -u --subbuf-size 16M --num-subbuf 8
sudo lttng enable-event -c ust -u lttng_ust_tracef:*
sudo lttng start
$@
sudo lttng stop

sudo chmod a+rx -R ~/lttng-traces
┌milian@milian-kdab2:/tmp
└$ run_lttng_trace.sh perf stat -r 5 ./a.out 
Session auto-20170905-162818 created.
Traces will be written in /home/milian/lttng-traces/2017-09-05T16:28:18+02:00
Kernel channel kernel enabled for session auto-20170905-162818
All Kernel events are enabled in channel kernel
UST channel ust enabled for session auto-20170905-162818
UST event lttng_ust_tracef:* created in channel ust
Tracing started for session auto-20170905-162818

 Performance counter stats for './a.out' (5 runs):

          4.937820      task-clock (msec)         #    0.005 CPUs utilized            
( +-  1.28% )
               117      context-switches          #    0.024 M/sec                    
( +-  0.70% )
                 3      cpu-migrations            #    0.608 K/sec                    
( +- 23.57% )
               245      page-faults               #    0.050 M/sec                    
( +-  0.20% )
        16,541,355      cycles                    #    3.350 GHz                      
( +-  0.94% )
        20,611,637      instructions              #    1.25  insn per cycle           
( +-  0.23% )
         3,580,525      branches                  #  725.123 M/sec                    
( +-  0.19% )
            52,093      branch-misses             #    1.45% of all branches          
( +-  1.73% )

       0.965545882 seconds time elapsed                                          
( +- 16.96% )

Waiting for data availability...
Tracing stopped for session auto-20170905-162818
~~~~~

I see this ~1s delay during startup in nearly all test applications where I 
wanted to use lttng-ust. In tracecompass, and according to perf sleep time 
profiling, this seems to be mostly due to extensive sleeping on mutex locks 
(from ust_lock in liblttng-ust). With perf trace, I also see membarrier taking 
a lot of time:

~~~~~
# without a running lttng session:
┌milian@milian-kdab2:/tmp
└$ perf trace --duration 1 ./a.out
     6.492 (52.468 ms): a.out/23672 recvmsg(fd: 3<socket:[1178439]>, msg: 
0x7fbe2fbb1070                  ) = 1
     5.077 (54.271 ms): a.out/23671 futex(uaddr: 0x7fbe30d508a0, op: 
WAIT_BITSET|PRIV|CLKRT, utime: 0x7ffc474ff5a0, val3: 4294967295) = 0
    59.598 (79.379 ms): a.out/23671 membarrier(cmd: 1                                                     
) = 0
   138.984 (39.945 ms): a.out/23671 membarrier(cmd: 1                                                     
) = 0
    59.621 (119.324 ms): a.out/23672 futex(uaddr: 0x7fbe303e54c0, op: WAIT|
PRIV, val: 2                    ) = 0
   179.045 (79.918 ms): a.out/23671 membarrier(cmd: 1                                                     
) = 0
   258.971 (39.997 ms): a.out/23671 membarrier(cmd: 1                                                     
) = 0
   299.052 (79.883 ms): a.out/23671 membarrier(cmd: 1                                                     
) = 0
   378.942 (59.988 ms): a.out/23671 membarrier(cmd: 1                                                     
) = 0
   439.022 (69.908 ms): a.out/23671 membarrier(cmd: 1                                                     
) = 0
   508.937 (80.027 ms): a.out/23671 membarrier(cmd: 1                                                     
) = 0

# with a running lttng session:
┌milian@milian-kdab2:/tmp
└$ run_lttng_trace.sh perf trace --duration 1 ./a.out 
     6.666 (392.496 ms): a.out-ust/23790 recvmsg(fd: 3<socket:[1185362]>, msg: 
0x7fcd81d8a0b0                  ) = 612
   400.238 (68.452 ms): a.out-ust/23790 membarrier(cmd: 1                                                     
) = 0
   468.698 (39.983 ms): a.out-ust/23790 membarrier(cmd: 1                                                     
) = 0
   510.567 (98.113 ms): a.out-ust/23790 recvmsg(fd: 3<socket:[1185362]>, msg: 
0x7fcd81d8a070                  ) = 1
     6.467 (603.500 ms): a.out/23789 futex(uaddr: 0x7fcd82f298a0, op: 
WAIT_BITSET|PRIV|CLKRT, utime: 0x7fff4df6c970, val3: 4294967295) = 0
   610.020 ( 6.307 ms): a.out/23789 futex(uaddr: 0x7fcd82f29960, op: WAIT|
PRIV, val: 2                    ) = 0
   616.446 (62.265 ms): a.out/23789 membarrier(cmd: 1                                                     
) = 0
   678.721 (59.916 ms): a.out/23789 membarrier(cmd: 1                                                     
) = 0
   616.332 (122.335 ms): a.out-ust/23790 futex(uaddr: 0x7fcd82f29960, op: 
WAIT|PRIV, val: 2                    ) = 0
   738.976 (49.658 ms): a.out/23789 membarrier(cmd: 1                                                     
) = 0
   788.642 (49.987 ms): a.out/23789 membarrier(cmd: 1                                                     
) = 0
   739.014 (99.644 ms): a.out-ust/23790 futex(uaddr: 0x7fcd825be4c0, op: WAIT|
PRIV, val: 2                    ) = 0
   838.651 (49.977 ms): a.out/23789 membarrier(cmd: 1                                                     
) = 0
   888.636 (69.983 ms): a.out/23789 membarrier(cmd: 1                                                     
) = 0
   838.662 (119.978 ms): a.out-ust/23790 futex(uaddr: 0x7fcd825be4c0, op: 
WAIT|PRIV, val: 2                    ) = 0
   958.739 (59.873 ms): a.out/23789 membarrier(cmd: 1                                                     
) = 0
  1018.621 (80.058 ms): a.out/23789 membarrier(cmd: 1                                                     
) = 0
  1098.782 (49.844 ms): a.out/23789 membarrier(cmd: 1                                                     
) = 0
  1148.634 (59.998 ms): a.out/23789 membarrier(cmd: 1                                                     
) = 0
  1208.731 (69.940 ms): a.out/23789 membarrier(cmd: 1                                                     
) = 0
  1278.680 (79.947 ms): a.out/23789 membarrier(cmd: 1                                                     
) = 0
  1358.727 (39.881 ms): a.out/23789 membarrier(cmd: 1                                                     
) = 0
  1398.616 (39.980 ms): a.out/23789 membarrier(cmd: 1                                                     
) = 0
     6.640 (1432.157 ms): a.out-ust/23791 futex(uaddr: 0x7fcd82f90000                                           
) = -1 (null) INTERNAL ERROR: strerror_r(512, 0x7ffdbc5467a0, 128)=22
~~~~~

This makes lttng-ust pretty much unusable for any short-lived application. Is 
this a known limitation?

I'm using lttng-ust on Arch from the AUR packages, i.e. LTTng 2.9.4. I've seen 
this problem also on Ubuntu 17.04 though I didn't do any of the above 
measurements there. So far, I suspected the ust statedump to take too much 
time, but I disabled that one for the measurements above.

Any help would be appreciated

-- 
Milian Wolff | milian.wolff@kdab.com | Senior Software Engineer
KDAB (Deutschland) GmbH&Co KG, a KDAB Group company
Tel: +49-30-521325470
KDAB - The Qt Experts

[-- Attachment #1.2: smime.p7s --]
[-- Type: application/pkcs7-signature, Size: 3826 bytes --]

[-- Attachment #2: Type: text/plain, Size: 156 bytes --]

_______________________________________________
lttng-dev mailing list
lttng-dev@lists.lttng.org
https://lists.lttng.org/cgi-bin/mailman/listinfo/lttng-dev

^ permalink raw reply	[flat|nested] 13+ messages in thread

end of thread, other threads:[~2017-09-07  7:50 UTC | newest]

Thread overview: 13+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <2117884.c1kuPXoiOA@milian-kdab2>
2017-09-05 14:51 ` large liblttng-ust startup overhead (ust_lock) Mathieu Desnoyers
2017-09-05 14:52 ` Milian Wolff
     [not found] ` <119280371.2823.1504623102155.JavaMail.zimbra@efficios.com>
2017-09-05 15:08   ` Milian Wolff
     [not found]   ` <3731578.YsMjlWojEO@milian-kdab2>
2017-09-05 18:11     ` Mathieu Desnoyers
     [not found]     ` <709369421.3343.1504635118648.JavaMail.zimbra@efficios.com>
2017-09-06  7:35       ` Milian Wolff
     [not found]       ` <3211887.IErR8xQyUx@agathebauer>
2017-09-06 19:57         ` Mathieu Desnoyers
     [not found]         ` <1506174693.4712.1504727831374.JavaMail.zimbra@efficios.com>
2017-09-06 20:23           ` Mathieu Desnoyers
     [not found]           ` <643211580.4746.1504729420940.JavaMail.zimbra@efficios.com>
2017-09-06 21:22             ` Paul E. McKenney
2017-09-07  7:50           ` Milian Wolff
     [not found] ` <5409371.Vv1znlTa4X@milian-kdab2>
2017-09-05 15:10   ` Milian Wolff
     [not found]   ` <5963588.vVBsjV779J@milian-kdab2>
2017-09-05 18:20     ` Mathieu Desnoyers
     [not found]     ` <859976545.3393.1504635610385.JavaMail.zimbra@efficios.com>
2017-09-06  7:42       ` Milian Wolff
2017-09-05 14:34 Milian Wolff

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.