* Re: large liblttng-ust startup overhead (ust_lock)
[not found] <2117884.c1kuPXoiOA@milian-kdab2>
@ 2017-09-05 14:51 ` Mathieu Desnoyers
2017-09-05 14:52 ` Milian Wolff
` (2 subsequent siblings)
3 siblings, 0 replies; 13+ messages in thread
From: Mathieu Desnoyers @ 2017-09-05 14:51 UTC (permalink / raw)
To: Milian Wolff; +Cc: lttng-dev
----- On Sep 5, 2017, at 10:34 AM, Milian Wolff milian.wolff@kdab.com wrote:
> Hey all,
>
> I have noticed a very large overhead when linking against liblttng-ust:
>
> ~~~~~
> ┌milian@milian-kdab2:/tmp
> └$ cat lttng-test.c
> int main()
> {
> return 0;
> }
> ┌milian@milian-kdab2:/tmp
> └$ gcc -O2 -g -ldl lttng-test.c
> ┌milian@milian-kdab2:/tmp
> └$ perf stat -r 5 ./a.out
>
> Performance counter stats for './a.out' (5 runs):
>
> 0.209587 task-clock (msec) # 0.596 CPUs utilized
> ( +- 8.76% )
> 0 context-switches # 0.000 K/sec
> 0 cpu-migrations # 0.000 K/sec
> 49 page-faults # 0.235 M/sec
> ( +- 1.19% )
> 706,854 cycles # 3.373 GHz
> ( +- 8.82% )
> 773,603 instructions # 1.09 insn per cycle
> ( +- 0.75% )
> 147,128 branches # 701.987 M/sec
> ( +- 0.70% )
> 4,096 branch-misses # 2.78% of all branches
> ( +- 5.27% )
>
> 0.000351422 seconds time elapsed
> ( +- 11.85% )
>
> ┌milian@milian-kdab2:/tmp
> └$ gcc -O2 -g -ldl -llttng-ust lttng-test.c
> ┌milian@milian-kdab2:/tmp
> └$ perf stat -r 5 ./a.out
>
> Performance counter stats for './a.out' (5 runs):
>
> 2.063040 task-clock (msec) # 0.009 CPUs utilized
> ( +- 1.37% )
> 44 context-switches # 0.021 M/sec
> ( +- 1.95% )
> 2 cpu-migrations # 0.776 K/sec
> ( +- 25.00% )
> 209 page-faults # 0.101 M/sec
> ( +- 0.34% )
> 7,053,686 cycles # 3.419 GHz
> ( +- 2.03% )
> 6,893,783 instructions # 0.98 insn per cycle
> ( +- 0.25% )
> 1,342,492 branches # 650.735 M/sec
> ( +- 0.20% )
> 29,390 branch-misses # 2.19% of all branches
> ( +- 0.61% )
>
> 0.225597302 seconds time elapsed
> ( +- 6.68% )
> ~~~~~
>
> This is without any LTTng session configured. If I enable LTTng kernel and
> userspace events, this becomes even worse:
>
> ~~~~~
> ┌milian@milian-kdab2:/tmp
> └$ cat $(which run_lttng_trace.sh)
> #!/bin/sh
>
> if [ -z "$(pidof lttng-sessiond)" ]; then
> sudo lttng-sessiond --daemonize
> fi
>
> sudo lttng create -o ~/lttng-traces/$(date -Iseconds)
> sudo lttng enable-channel kernel -k --subbuf-size 16M --num-subbuf 8
> sudo lttng enable-event -c kernel -k -a
> sudo lttng enable-channel ust -u --subbuf-size 16M --num-subbuf 8
> sudo lttng enable-event -c ust -u lttng_ust_tracef:*
> sudo lttng start
> $@
> sudo lttng stop
>
> sudo chmod a+rx -R ~/lttng-traces
> ┌milian@milian-kdab2:/tmp
> └$ run_lttng_trace.sh perf stat -r 5 ./a.out
> Session auto-20170905-162818 created.
> Traces will be written in /home/milian/lttng-traces/2017-09-05T16:28:18+02:00
> Kernel channel kernel enabled for session auto-20170905-162818
> All Kernel events are enabled in channel kernel
> UST channel ust enabled for session auto-20170905-162818
> UST event lttng_ust_tracef:* created in channel ust
> Tracing started for session auto-20170905-162818
>
> Performance counter stats for './a.out' (5 runs):
>
> 4.937820 task-clock (msec) # 0.005 CPUs utilized
> ( +- 1.28% )
> 117 context-switches # 0.024 M/sec
> ( +- 0.70% )
> 3 cpu-migrations # 0.608 K/sec
> ( +- 23.57% )
> 245 page-faults # 0.050 M/sec
> ( +- 0.20% )
> 16,541,355 cycles # 3.350 GHz
> ( +- 0.94% )
> 20,611,637 instructions # 1.25 insn per cycle
> ( +- 0.23% )
> 3,580,525 branches # 725.123 M/sec
> ( +- 0.19% )
> 52,093 branch-misses # 1.45% of all branches
> ( +- 1.73% )
>
> 0.965545882 seconds time elapsed
> ( +- 16.96% )
>
> Waiting for data availability...
> Tracing stopped for session auto-20170905-162818
> ~~~~~
>
> I see this ~1s delay during startup in nearly all test applications where I
> wanted to use lttng-ust. In tracecompass, and according to perf sleep time
> profiling, this seems to be mostly due to extensive sleeping on mutex locks
> (from ust_lock in liblttng-ust). With perf trace, I also see membarrier taking
> a lot of time:
>
> ~~~~~
> # without a running lttng session:
> ┌milian@milian-kdab2:/tmp
> └$ perf trace --duration 1 ./a.out
> 6.492 (52.468 ms): a.out/23672 recvmsg(fd: 3<socket:[1178439]>, msg:
> 0x7fbe2fbb1070 ) = 1
> 5.077 (54.271 ms): a.out/23671 futex(uaddr: 0x7fbe30d508a0, op:
> WAIT_BITSET|PRIV|CLKRT, utime: 0x7ffc474ff5a0, val3: 4294967295) = 0
> 59.598 (79.379 ms): a.out/23671 membarrier(cmd: 1
> ) = 0
> 138.984 (39.945 ms): a.out/23671 membarrier(cmd: 1
> ) = 0
> 59.621 (119.324 ms): a.out/23672 futex(uaddr: 0x7fbe303e54c0, op: WAIT|
> PRIV, val: 2 ) = 0
> 179.045 (79.918 ms): a.out/23671 membarrier(cmd: 1
> ) = 0
> 258.971 (39.997 ms): a.out/23671 membarrier(cmd: 1
> ) = 0
> 299.052 (79.883 ms): a.out/23671 membarrier(cmd: 1
> ) = 0
> 378.942 (59.988 ms): a.out/23671 membarrier(cmd: 1
> ) = 0
> 439.022 (69.908 ms): a.out/23671 membarrier(cmd: 1
> ) = 0
> 508.937 (80.027 ms): a.out/23671 membarrier(cmd: 1
> ) = 0
>
> # with a running lttng session:
> ┌milian@milian-kdab2:/tmp
> └$ run_lttng_trace.sh perf trace --duration 1 ./a.out
> 6.666 (392.496 ms): a.out-ust/23790 recvmsg(fd: 3<socket:[1185362]>, msg:
> 0x7fcd81d8a0b0 ) = 612
> 400.238 (68.452 ms): a.out-ust/23790 membarrier(cmd: 1
> ) = 0
> 468.698 (39.983 ms): a.out-ust/23790 membarrier(cmd: 1
> ) = 0
> 510.567 (98.113 ms): a.out-ust/23790 recvmsg(fd: 3<socket:[1185362]>, msg:
> 0x7fcd81d8a070 ) = 1
> 6.467 (603.500 ms): a.out/23789 futex(uaddr: 0x7fcd82f298a0, op:
> WAIT_BITSET|PRIV|CLKRT, utime: 0x7fff4df6c970, val3: 4294967295) = 0
> 610.020 ( 6.307 ms): a.out/23789 futex(uaddr: 0x7fcd82f29960, op: WAIT|
> PRIV, val: 2 ) = 0
> 616.446 (62.265 ms): a.out/23789 membarrier(cmd: 1
> ) = 0
> 678.721 (59.916 ms): a.out/23789 membarrier(cmd: 1
> ) = 0
> 616.332 (122.335 ms): a.out-ust/23790 futex(uaddr: 0x7fcd82f29960, op:
> WAIT|PRIV, val: 2 ) = 0
> 738.976 (49.658 ms): a.out/23789 membarrier(cmd: 1
> ) = 0
> 788.642 (49.987 ms): a.out/23789 membarrier(cmd: 1
> ) = 0
> 739.014 (99.644 ms): a.out-ust/23790 futex(uaddr: 0x7fcd825be4c0, op: WAIT|
> PRIV, val: 2 ) = 0
> 838.651 (49.977 ms): a.out/23789 membarrier(cmd: 1
> ) = 0
> 888.636 (69.983 ms): a.out/23789 membarrier(cmd: 1
> ) = 0
> 838.662 (119.978 ms): a.out-ust/23790 futex(uaddr: 0x7fcd825be4c0, op:
> WAIT|PRIV, val: 2 ) = 0
> 958.739 (59.873 ms): a.out/23789 membarrier(cmd: 1
> ) = 0
> 1018.621 (80.058 ms): a.out/23789 membarrier(cmd: 1
> ) = 0
> 1098.782 (49.844 ms): a.out/23789 membarrier(cmd: 1
> ) = 0
> 1148.634 (59.998 ms): a.out/23789 membarrier(cmd: 1
> ) = 0
> 1208.731 (69.940 ms): a.out/23789 membarrier(cmd: 1
> ) = 0
> 1278.680 (79.947 ms): a.out/23789 membarrier(cmd: 1
> ) = 0
> 1358.727 (39.881 ms): a.out/23789 membarrier(cmd: 1
> ) = 0
> 1398.616 (39.980 ms): a.out/23789 membarrier(cmd: 1
> ) = 0
> 6.640 (1432.157 ms): a.out-ust/23791 futex(uaddr: 0x7fcd82f90000
> ) = -1 (null) INTERNAL ERROR: strerror_r(512, 0x7ffdbc5467a0, 128)=22
> ~~~~~
>
> This makes lttng-ust pretty much unusable for any short-lived application. Is
> this a known limitation?
This kind of delay is not completely unexpected, but a bit higher than I would
expect. I'm interested in figuring out if most of this delay happens to be
caused by sys_membarrier, either directly, or indirectly (through a lock).
>
> I'm using lttng-ust on Arch from the AUR packages, i.e. LTTng 2.9.4. I've seen
> this problem also on Ubuntu 17.04 though I didn't do any of the above
> measurements there. So far, I suspected the ust statedump to take too much
> time, but I disabled that one for the measurements above.
>
> Any help would be appreciated
Can you reproduce with a Linux kernel that has CONFIG_MEMBARRIER=n and
share the performance results ?
Thanks,
Mathieu
>
> --
> Milian Wolff | milian.wolff@kdab.com | Senior Software Engineer
> KDAB (Deutschland) GmbH&Co KG, a KDAB Group company
> Tel: +49-30-521325470
> KDAB - The Qt Experts
> _______________________________________________
> lttng-dev mailing list
> lttng-dev@lists.lttng.org
> https://lists.lttng.org/cgi-bin/mailman/listinfo/lttng-dev
--
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com
_______________________________________________
lttng-dev mailing list
lttng-dev@lists.lttng.org
https://lists.lttng.org/cgi-bin/mailman/listinfo/lttng-dev
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: large liblttng-ust startup overhead (ust_lock)
[not found] <2117884.c1kuPXoiOA@milian-kdab2>
2017-09-05 14:51 ` large liblttng-ust startup overhead (ust_lock) Mathieu Desnoyers
@ 2017-09-05 14:52 ` Milian Wolff
[not found] ` <119280371.2823.1504623102155.JavaMail.zimbra@efficios.com>
[not found] ` <5409371.Vv1znlTa4X@milian-kdab2>
3 siblings, 0 replies; 13+ messages in thread
From: Milian Wolff @ 2017-09-05 14:52 UTC (permalink / raw)
To: lttng-dev
[-- Attachment #1.1: Type: text/plain, Size: 14667 bytes --]
On Tuesday, September 5, 2017 4:34:54 PM CEST Milian Wolff wrote:
> Hey all,
>
> I have noticed a very large overhead when linking against liblttng-ust:
>
> ~~~~~
> ┌milian@milian-kdab2:/tmp
> └$ cat lttng-test.c
> int main()
> {
> return 0;
> }
> ┌milian@milian-kdab2:/tmp
> └$ gcc -O2 -g -ldl lttng-test.c
> ┌milian@milian-kdab2:/tmp
> └$ perf stat -r 5 ./a.out
>
> Performance counter stats for './a.out' (5 runs):
>
> 0.209587 task-clock (msec) # 0.596 CPUs utilized
> ( +- 8.76% )
> 0 context-switches # 0.000 K/sec
> 0 cpu-migrations # 0.000 K/sec
> 49 page-faults # 0.235 M/sec
> ( +- 1.19% )
> 706,854 cycles # 3.373 GHz
> ( +- 8.82% )
> 773,603 instructions # 1.09 insn per cycle
> ( +- 0.75% )
> 147,128 branches # 701.987 M/sec
> ( +- 0.70% )
> 4,096 branch-misses # 2.78% of all branches
> ( +- 5.27% )
>
> 0.000351422 seconds time elapsed
> ( +- 11.85% )
>
> ┌milian@milian-kdab2:/tmp
> └$ gcc -O2 -g -ldl -llttng-ust lttng-test.c
> ┌milian@milian-kdab2:/tmp
> └$ perf stat -r 5 ./a.out
>
> Performance counter stats for './a.out' (5 runs):
>
> 2.063040 task-clock (msec) # 0.009 CPUs utilized
> ( +- 1.37% )
> 44 context-switches # 0.021 M/sec
> ( +- 1.95% )
> 2 cpu-migrations # 0.776 K/sec
> ( +- 25.00% )
> 209 page-faults # 0.101 M/sec
> ( +- 0.34% )
> 7,053,686 cycles # 3.419 GHz
> ( +- 2.03% )
> 6,893,783 instructions # 0.98 insn per cycle
> ( +- 0.25% )
> 1,342,492 branches # 650.735 M/sec
> ( +- 0.20% )
> 29,390 branch-misses # 2.19% of all branches
> ( +- 0.61% )
>
> 0.225597302 seconds time elapsed
> ( +- 6.68% )
> ~~~~~
>
> This is without any LTTng session configured. If I enable LTTng kernel and
> userspace events, this becomes even worse:
>
> ~~~~~
> ┌milian@milian-kdab2:/tmp
> └$ cat $(which run_lttng_trace.sh)
> #!/bin/sh
>
> if [ -z "$(pidof lttng-sessiond)" ]; then
> sudo lttng-sessiond --daemonize
> fi
>
> sudo lttng create -o ~/lttng-traces/$(date -Iseconds)
> sudo lttng enable-channel kernel -k --subbuf-size 16M --num-subbuf 8
> sudo lttng enable-event -c kernel -k -a
> sudo lttng enable-channel ust -u --subbuf-size 16M --num-subbuf 8
> sudo lttng enable-event -c ust -u lttng_ust_tracef:*
> sudo lttng start
> $@
> sudo lttng stop
>
> sudo chmod a+rx -R ~/lttng-traces
> ┌milian@milian-kdab2:/tmp
> └$ run_lttng_trace.sh perf stat -r 5 ./a.out
> Session auto-20170905-162818 created.
> Traces will be written in
> /home/milian/lttng-traces/2017-09-05T16:28:18+02:00 Kernel channel kernel
> enabled for session auto-20170905-162818
> All Kernel events are enabled in channel kernel
> UST channel ust enabled for session auto-20170905-162818
> UST event lttng_ust_tracef:* created in channel ust
> Tracing started for session auto-20170905-162818
>
> Performance counter stats for './a.out' (5 runs):
>
> 4.937820 task-clock (msec) # 0.005 CPUs utilized
> ( +- 1.28% )
> 117 context-switches # 0.024 M/sec
> ( +- 0.70% )
> 3 cpu-migrations # 0.608 K/sec
> ( +- 23.57% )
> 245 page-faults # 0.050 M/sec
> ( +- 0.20% )
> 16,541,355 cycles # 3.350 GHz
> ( +- 0.94% )
> 20,611,637 instructions # 1.25 insn per cycle
> ( +- 0.23% )
> 3,580,525 branches # 725.123 M/sec
> ( +- 0.19% )
> 52,093 branch-misses # 1.45% of all branches
> ( +- 1.73% )
>
> 0.965545882 seconds time elapsed
> ( +- 16.96% )
>
> Waiting for data availability...
> Tracing stopped for session auto-20170905-162818
> ~~~~~
>
> I see this ~1s delay during startup in nearly all test applications where I
> wanted to use lttng-ust. In tracecompass, and according to perf sleep time
> profiling, this seems to be mostly due to extensive sleeping on mutex locks
> (from ust_lock in liblttng-ust). With perf trace, I also see membarrier
> taking a lot of time:
>
> ~~~~~
> # without a running lttng session:
> ┌milian@milian-kdab2:/tmp
> └$ perf trace --duration 1 ./a.out
> 6.492 (52.468 ms): a.out/23672 recvmsg(fd: 3<socket:[1178439]>, msg:
> 0x7fbe2fbb1070 ) = 1
> 5.077 (54.271 ms): a.out/23671 futex(uaddr: 0x7fbe30d508a0, op:
> WAIT_BITSET|PRIV|CLKRT, utime: 0x7ffc474ff5a0, val3: 4294967295) = 0
> 59.598 (79.379 ms): a.out/23671 membarrier(cmd: 1
> ) = 0
> 138.984 (39.945 ms): a.out/23671 membarrier(cmd: 1
> ) = 0
> 59.621 (119.324 ms): a.out/23672 futex(uaddr: 0x7fbe303e54c0, op: WAIT|
> PRIV, val: 2 ) = 0
> 179.045 (79.918 ms): a.out/23671 membarrier(cmd: 1
> ) = 0
> 258.971 (39.997 ms): a.out/23671 membarrier(cmd: 1
> ) = 0
> 299.052 (79.883 ms): a.out/23671 membarrier(cmd: 1
> ) = 0
> 378.942 (59.988 ms): a.out/23671 membarrier(cmd: 1
> ) = 0
> 439.022 (69.908 ms): a.out/23671 membarrier(cmd: 1
> ) = 0
> 508.937 (80.027 ms): a.out/23671 membarrier(cmd: 1
> ) = 0
>
> # with a running lttng session:
> ┌milian@milian-kdab2:/tmp
> └$ run_lttng_trace.sh perf trace --duration 1 ./a.out
> 6.666 (392.496 ms): a.out-ust/23790 recvmsg(fd: 3<socket:[1185362]>,
> msg: 0x7fcd81d8a0b0 ) = 612
> 400.238 (68.452 ms): a.out-ust/23790 membarrier(cmd: 1
> ) = 0
> 468.698 (39.983 ms): a.out-ust/23790 membarrier(cmd: 1
> ) = 0
> 510.567 (98.113 ms): a.out-ust/23790 recvmsg(fd: 3<socket:[1185362]>,
> msg: 0x7fcd81d8a070 ) = 1
> 6.467 (603.500 ms): a.out/23789 futex(uaddr: 0x7fcd82f298a0, op:
> WAIT_BITSET|PRIV|CLKRT, utime: 0x7fff4df6c970, val3: 4294967295) = 0
> 610.020 ( 6.307 ms): a.out/23789 futex(uaddr: 0x7fcd82f29960, op: WAIT|
> PRIV, val: 2 ) = 0
> 616.446 (62.265 ms): a.out/23789 membarrier(cmd: 1
> ) = 0
> 678.721 (59.916 ms): a.out/23789 membarrier(cmd: 1
> ) = 0
> 616.332 (122.335 ms): a.out-ust/23790 futex(uaddr: 0x7fcd82f29960, op:
> WAIT|PRIV, val: 2 ) = 0
> 738.976 (49.658 ms): a.out/23789 membarrier(cmd: 1
> ) = 0
> 788.642 (49.987 ms): a.out/23789 membarrier(cmd: 1
> ) = 0
> 739.014 (99.644 ms): a.out-ust/23790 futex(uaddr: 0x7fcd825be4c0, op:
> WAIT| PRIV, val: 2 ) = 0
> 838.651 (49.977 ms): a.out/23789 membarrier(cmd: 1
> ) = 0
> 888.636 (69.983 ms): a.out/23789 membarrier(cmd: 1
> ) = 0
> 838.662 (119.978 ms): a.out-ust/23790 futex(uaddr: 0x7fcd825be4c0, op:
> WAIT|PRIV, val: 2 ) = 0
> 958.739 (59.873 ms): a.out/23789 membarrier(cmd: 1
> ) = 0
> 1018.621 (80.058 ms): a.out/23789 membarrier(cmd: 1
> ) = 0
> 1098.782 (49.844 ms): a.out/23789 membarrier(cmd: 1
> ) = 0
> 1148.634 (59.998 ms): a.out/23789 membarrier(cmd: 1
> ) = 0
> 1208.731 (69.940 ms): a.out/23789 membarrier(cmd: 1
> ) = 0
> 1278.680 (79.947 ms): a.out/23789 membarrier(cmd: 1
> ) = 0
> 1358.727 (39.881 ms): a.out/23789 membarrier(cmd: 1
> ) = 0
> 1398.616 (39.980 ms): a.out/23789 membarrier(cmd: 1
> ) = 0
> 6.640 (1432.157 ms): a.out-ust/23791 futex(uaddr: 0x7fcd82f90000
> ) = -1 (null) INTERNAL ERROR: strerror_r(512, 0x7ffdbc5467a0, 128)=22
> ~~~~~
>
> This makes lttng-ust pretty much unusable for any short-lived application.
> Is this a known limitation?
>
> I'm using lttng-ust on Arch from the AUR packages, i.e. LTTng 2.9.4. I've
> seen this problem also on Ubuntu 17.04 though I didn't do any of the above
> measurements there. So far, I suspected the ust statedump to take too much
> time, but I disabled that one for the measurements above.
>
> Any help would be appreciated
Here's the perf sleep record data when no session is configured:
~~~~~
100% = 848.063336375 ms
# Overhead Pid:Command Shared Object Symbol Source:Line
# ........ ................. ................ .............. ...........
#
76.50% 10336:a.out-ust [kernel.vmlinux] [k] __schedule core.c:0
|
---__schedule core.c:0
schedule
futex_wait_queue_me futex.c:0
futex_wait futex.c:0
do_futex
sys_futex
do_syscall_64
return_from_SYSCALL_64 .tmp_entry_64.o:0
syscall
ust_listener_thread futex.h:65
/usr/include/urcu/futex.h:65 (inline)
/usr/include/urcu/futex.h:97 (inline)
/home/milian/projects/src/lttng/lttng-ust/liblttng-ust/lttng-
ust-comm.c:1311 (inline)
/home/milian/projects/src/lttng/lttng-ust/liblttng-ust/lttng-
ust-comm.c:1373 (inline)
15.79% 10335:a.out-ust [kernel.vmlinux] [k] __schedule core.c:0
|
---__schedule core.c:0
schedule
|
--15.70%--futex_wait_queue_me futex.c:0
futex_wait futex.c:0
do_futex
sys_futex
do_syscall_64
return_from_SYSCALL_64 .tmp_entry_64.o:0
__lll_lock_wait
pthread_mutex_lock
|
--15.69%--0x2338
0x3b29
__nptl_deallocate_tsd.part.4
pthread_create.c:0
start_thread pthread_create.c:0
__clone
7.72% 10334:a.out [kernel.vmlinux] [k] __schedule core.c:0
|
---__schedule core.c:0
schedule
futex_wait_queue_me futex.c:0
futex_wait futex.c:0
do_futex
sys_futex
do_syscall_64
return_from_SYSCALL_64 .tmp_entry_64.o:0
do_futex_wait sem_timedwait.c:0
__new_sem_wait_slow sem_timedwait.c:0
lttng_ust_init.part.3 lttng-ust-comm.c:1757
call_init.part.0 dl-init.c:0
_dl_init
_dl_start_user
~~~~~
Here's the perf sleep record data for an existing session:
~~~~~
100% = 3.0793293265 s
# Overhead Pid:Command Shared Object Symbol Source:Line
# ........ ................. ................ .............. ...........
#
52.10% 10762:a.out-ust [kernel.vmlinux] [k] __schedule core.c:0
|
---__schedule core.c:0
schedule
futex_wait_queue_me futex.c:0
futex_wait futex.c:0
do_futex
sys_futex
do_syscall_64
return_from_SYSCALL_64 .tmp_entry_64.o:0
syscall
ust_listener_thread futex.h:65
/usr/include/urcu/futex.h:65 (inline)
/usr/include/urcu/futex.h:97 (inline)
/home/milian/projects/src/lttng/lttng-ust/liblttng-ust/lttng-
ust-comm.c:1311 (inline)
/home/milian/projects/src/lttng/lttng-ust/liblttng-ust/lttng-
ust-comm.c:1373 (inline)
27.09% 10761:a.out-ust [kernel.vmlinux] [k] __schedule core.c:0
|
---__schedule core.c:0
schedule
|
|--13.69%--schedule_timeout
| unix_stream_read_generic af_unix.c:0
| unix_stream_recvmsg af_unix.c:0
| sock_recvmsg
| ___sys_recvmsg socket.c:0
| __sys_recvmsg
| sys_recvmsg
| do_syscall_64
| return_from_SYSCALL_64 .tmp_entry_64.o:0
| __libc_recvmsg
| ustcomm_recv_unix_sock lttng-ust-comm.c:307
|
--13.40%--futex_wait_queue_me futex.c:0
futex_wait futex.c:0
do_futex
sys_futex
do_syscall_64
return_from_SYSCALL_64 .tmp_entry_64.o:0
__lll_lock_wait
pthread_mutex_lock
|
|--8.41%--0x2338
| 0x3b29
| __nptl_deallocate_tsd.part.4
pthread_create.c:0
| start_thread pthread_create.c:0
| __clone
|
--4.99%--ust_lock_nocheck lttng-ust-comm.c:178
20.81% 10760:a.out [kernel.vmlinux] [k] __schedule core.c:0
|
---__schedule core.c:0
schedule
|
--20.81%--futex_wait_queue_me futex.c:0
futex_wait futex.c:0
do_futex
sys_futex
do_syscall_64
return_from_SYSCALL_64 .tmp_entry_64.o:0
|
--20.77%--do_futex_wait sem_timedwait.c:0
__new_sem_wait_slow sem_timedwait.c:0
lttng_ust_init.part.3 lttng-ust-comm.c:
1757
call_init.part.0 dl-init.c:0
_dl_init
_dl_start_user
~~~~~
Hope that helps
--
Milian Wolff | milian.wolff@kdab.com | Senior Software Engineer
KDAB (Deutschland) GmbH&Co KG, a KDAB Group company
Tel: +49-30-521325470
KDAB - The Qt Experts
[-- Attachment #1.2: smime.p7s --]
[-- Type: application/pkcs7-signature, Size: 3826 bytes --]
[-- Attachment #2: Type: text/plain, Size: 156 bytes --]
_______________________________________________
lttng-dev mailing list
lttng-dev@lists.lttng.org
https://lists.lttng.org/cgi-bin/mailman/listinfo/lttng-dev
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: large liblttng-ust startup overhead (ust_lock)
[not found] ` <119280371.2823.1504623102155.JavaMail.zimbra@efficios.com>
@ 2017-09-05 15:08 ` Milian Wolff
[not found] ` <3731578.YsMjlWojEO@milian-kdab2>
1 sibling, 0 replies; 13+ messages in thread
From: Milian Wolff @ 2017-09-05 15:08 UTC (permalink / raw)
To: Mathieu Desnoyers; +Cc: lttng-dev
[-- Attachment #1.1: Type: text/plain, Size: 10398 bytes --]
On Tuesday, September 5, 2017 4:51:42 PM CEST Mathieu Desnoyers wrote:
> ----- On Sep 5, 2017, at 10:34 AM, Milian Wolff milian.wolff@kdab.com wrote:
> > Hey all,
> >
> > I have noticed a very large overhead when linking against liblttng-ust:
> >
> > ~~~~~
> > ┌milian@milian-kdab2:/tmp
> > └$ cat lttng-test.c
> > int main()
> > {
> >
> > return 0;
> >
> > }
> > ┌milian@milian-kdab2:/tmp
> > └$ gcc -O2 -g -ldl lttng-test.c
> > ┌milian@milian-kdab2:/tmp
> > └$ perf stat -r 5 ./a.out
> >
> > Performance counter stats for './a.out' (5 runs):
> > 0.209587 task-clock (msec) # 0.596 CPUs utilized
> >
> > ( +- 8.76% )
> >
> > 0 context-switches # 0.000 K/sec
> > 0 cpu-migrations # 0.000 K/sec
> >
> > 49 page-faults # 0.235 M/sec
> >
> > ( +- 1.19% )
> >
> > 706,854 cycles # 3.373 GHz
> >
> > ( +- 8.82% )
> >
> > 773,603 instructions # 1.09 insn per cycle
> >
> > ( +- 0.75% )
> >
> > 147,128 branches # 701.987 M/sec
> >
> > ( +- 0.70% )
> >
> > 4,096 branch-misses # 2.78% of all
> > branches
> >
> > ( +- 5.27% )
> >
> > 0.000351422 seconds time elapsed
> >
> > ( +- 11.85% )
> >
> > ┌milian@milian-kdab2:/tmp
> > └$ gcc -O2 -g -ldl -llttng-ust lttng-test.c
> > ┌milian@milian-kdab2:/tmp
> > └$ perf stat -r 5 ./a.out
> >
> > Performance counter stats for './a.out' (5 runs):
> > 2.063040 task-clock (msec) # 0.009 CPUs utilized
> >
> > ( +- 1.37% )
> >
> > 44 context-switches # 0.021 M/sec
> >
> > ( +- 1.95% )
> >
> > 2 cpu-migrations # 0.776 K/sec
> >
> > ( +- 25.00% )
> >
> > 209 page-faults # 0.101 M/sec
> >
> > ( +- 0.34% )
> >
> > 7,053,686 cycles # 3.419 GHz
> >
> > ( +- 2.03% )
> >
> > 6,893,783 instructions # 0.98 insn per cycle
> >
> > ( +- 0.25% )
> >
> > 1,342,492 branches # 650.735 M/sec
> >
> > ( +- 0.20% )
> >
> > 29,390 branch-misses # 2.19% of all
> > branches
> >
> > ( +- 0.61% )
> >
> > 0.225597302 seconds time elapsed
> >
> > ( +- 6.68% )
> > ~~~~~
> >
> > This is without any LTTng session configured. If I enable LTTng kernel and
> > userspace events, this becomes even worse:
> >
> > ~~~~~
> > ┌milian@milian-kdab2:/tmp
> > └$ cat $(which run_lttng_trace.sh)
> > #!/bin/sh
> >
> > if [ -z "$(pidof lttng-sessiond)" ]; then
> >
> > sudo lttng-sessiond --daemonize
> >
> > fi
> >
> > sudo lttng create -o ~/lttng-traces/$(date -Iseconds)
> > sudo lttng enable-channel kernel -k --subbuf-size 16M --num-subbuf 8
> > sudo lttng enable-event -c kernel -k -a
> > sudo lttng enable-channel ust -u --subbuf-size 16M --num-subbuf 8
> > sudo lttng enable-event -c ust -u lttng_ust_tracef:*
> > sudo lttng start
> > $@
> > sudo lttng stop
> >
> > sudo chmod a+rx -R ~/lttng-traces
> > ┌milian@milian-kdab2:/tmp
> > └$ run_lttng_trace.sh perf stat -r 5 ./a.out
> > Session auto-20170905-162818 created.
> > Traces will be written in
> > /home/milian/lttng-traces/2017-09-05T16:28:18+02:00 Kernel channel kernel
> > enabled for session auto-20170905-162818
> > All Kernel events are enabled in channel kernel
> > UST channel ust enabled for session auto-20170905-162818
> > UST event lttng_ust_tracef:* created in channel ust
> > Tracing started for session auto-20170905-162818
> >
> > Performance counter stats for './a.out' (5 runs):
> > 4.937820 task-clock (msec) # 0.005 CPUs utilized
> >
> > ( +- 1.28% )
> >
> > 117 context-switches # 0.024 M/sec
> >
> > ( +- 0.70% )
> >
> > 3 cpu-migrations # 0.608 K/sec
> >
> > ( +- 23.57% )
> >
> > 245 page-faults # 0.050 M/sec
> >
> > ( +- 0.20% )
> >
> > 16,541,355 cycles # 3.350 GHz
> >
> > ( +- 0.94% )
> >
> > 20,611,637 instructions # 1.25 insn per cycle
> >
> > ( +- 0.23% )
> >
> > 3,580,525 branches # 725.123 M/sec
> >
> > ( +- 0.19% )
> >
> > 52,093 branch-misses # 1.45% of all
> > branches
> >
> > ( +- 1.73% )
> >
> > 0.965545882 seconds time elapsed
> >
> > ( +- 16.96% )
> >
> > Waiting for data availability...
> > Tracing stopped for session auto-20170905-162818
> > ~~~~~
> >
> > I see this ~1s delay during startup in nearly all test applications where
> > I
> > wanted to use lttng-ust. In tracecompass, and according to perf sleep time
> > profiling, this seems to be mostly due to extensive sleeping on mutex
> > locks
> > (from ust_lock in liblttng-ust). With perf trace, I also see membarrier
> > taking a lot of time:
> >
> > ~~~~~
> > # without a running lttng session:
> > ┌milian@milian-kdab2:/tmp
> > └$ perf trace --duration 1 ./a.out
> >
> > 6.492 (52.468 ms): a.out/23672 recvmsg(fd: 3<socket:[1178439]>, msg:
> > 0x7fbe2fbb1070 ) = 1
> >
> > 5.077 (54.271 ms): a.out/23671 futex(uaddr: 0x7fbe30d508a0, op:
> > WAIT_BITSET|PRIV|CLKRT, utime: 0x7ffc474ff5a0, val3: 4294967295) = 0
> >
> > 59.598 (79.379 ms): a.out/23671 membarrier(cmd: 1
> >
> > ) = 0
> >
> > 138.984 (39.945 ms): a.out/23671 membarrier(cmd: 1
> >
> > ) = 0
> >
> > 59.621 (119.324 ms): a.out/23672 futex(uaddr: 0x7fbe303e54c0, op: WAIT|
> >
> > PRIV, val: 2 ) = 0
> >
> > 179.045 (79.918 ms): a.out/23671 membarrier(cmd: 1
> >
> > ) = 0
> >
> > 258.971 (39.997 ms): a.out/23671 membarrier(cmd: 1
> >
> > ) = 0
> >
> > 299.052 (79.883 ms): a.out/23671 membarrier(cmd: 1
> >
> > ) = 0
> >
> > 378.942 (59.988 ms): a.out/23671 membarrier(cmd: 1
> >
> > ) = 0
> >
> > 439.022 (69.908 ms): a.out/23671 membarrier(cmd: 1
> >
> > ) = 0
> >
> > 508.937 (80.027 ms): a.out/23671 membarrier(cmd: 1
> >
> > ) = 0
> >
> > # with a running lttng session:
> > ┌milian@milian-kdab2:/tmp
> > └$ run_lttng_trace.sh perf trace --duration 1 ./a.out
> >
> > 6.666 (392.496 ms): a.out-ust/23790 recvmsg(fd: 3<socket:[1185362]>,
msg:
> > 0x7fcd81d8a0b0 ) = 612
> >
> > 400.238 (68.452 ms): a.out-ust/23790 membarrier(cmd: 1
> >
> > ) = 0
> >
> > 468.698 (39.983 ms): a.out-ust/23790 membarrier(cmd: 1
> >
> > ) = 0
> >
> > 510.567 (98.113 ms): a.out-ust/23790 recvmsg(fd: 3<socket:[1185362]>,
msg:
> > 0x7fcd81d8a070 ) = 1
> >
> > 6.467 (603.500 ms): a.out/23789 futex(uaddr: 0x7fcd82f298a0, op:
> > WAIT_BITSET|PRIV|CLKRT, utime: 0x7fff4df6c970, val3: 4294967295) = 0
> >
> > 610.020 ( 6.307 ms): a.out/23789 futex(uaddr: 0x7fcd82f29960, op: WAIT|
> >
> > PRIV, val: 2 ) = 0
> >
> > 616.446 (62.265 ms): a.out/23789 membarrier(cmd: 1
> >
> > ) = 0
> >
> > 678.721 (59.916 ms): a.out/23789 membarrier(cmd: 1
> >
> > ) = 0
> >
> > 616.332 (122.335 ms): a.out-ust/23790 futex(uaddr: 0x7fcd82f29960, op:
> > WAIT|PRIV, val: 2 ) = 0
> >
> > 738.976 (49.658 ms): a.out/23789 membarrier(cmd: 1
> >
> > ) = 0
> >
> > 788.642 (49.987 ms): a.out/23789 membarrier(cmd: 1
> >
> > ) = 0
> >
> > 739.014 (99.644 ms): a.out-ust/23790 futex(uaddr: 0x7fcd825be4c0, op:
> > WAIT|
> >
> > PRIV, val: 2 ) = 0
> >
> > 838.651 (49.977 ms): a.out/23789 membarrier(cmd: 1
> >
> > ) = 0
> >
> > 888.636 (69.983 ms): a.out/23789 membarrier(cmd: 1
> >
> > ) = 0
> >
> > 838.662 (119.978 ms): a.out-ust/23790 futex(uaddr: 0x7fcd825be4c0, op:
> > WAIT|PRIV, val: 2 ) = 0
> >
> > 958.739 (59.873 ms): a.out/23789 membarrier(cmd: 1
> >
> > ) = 0
> >
> > 1018.621 (80.058 ms): a.out/23789 membarrier(cmd: 1
> >
> > ) = 0
> >
> > 1098.782 (49.844 ms): a.out/23789 membarrier(cmd: 1
> >
> > ) = 0
> >
> > 1148.634 (59.998 ms): a.out/23789 membarrier(cmd: 1
> >
> > ) = 0
> >
> > 1208.731 (69.940 ms): a.out/23789 membarrier(cmd: 1
> >
> > ) = 0
> >
> > 1278.680 (79.947 ms): a.out/23789 membarrier(cmd: 1
> >
> > ) = 0
> >
> > 1358.727 (39.881 ms): a.out/23789 membarrier(cmd: 1
> >
> > ) = 0
> >
> > 1398.616 (39.980 ms): a.out/23789 membarrier(cmd: 1
> >
> > ) = 0
> >
> > 6.640 (1432.157 ms): a.out-ust/23791 futex(uaddr: 0x7fcd82f90000
> >
> > ) = -1 (null) INTERNAL ERROR: strerror_r(512, 0x7ffdbc5467a0, 128)=22
> > ~~~~~
> >
> > This makes lttng-ust pretty much unusable for any short-lived application.
> > Is this a known limitation?
>
> This kind of delay is not completely unexpected, but a bit higher than I
> would expect. I'm interested in figuring out if most of this delay happens
> to be caused by sys_membarrier, either directly, or indirectly (through a
> lock).
See my other email which adds some harder numbers from a perf sleep record.
> > I'm using lttng-ust on Arch from the AUR packages, i.e. LTTng 2.9.4. I've
> > seen this problem also on Ubuntu 17.04 though I didn't do any of the
> > above measurements there. So far, I suspected the ust statedump to take
> > too much time, but I disabled that one for the measurements above.
> >
> > Any help would be appreciated
>
> Can you reproduce with a Linux kernel that has CONFIG_MEMBARRIER=n and
> share the performance results ?
I'd have to recompile the kernel, which is something I don't like to do. Why
is this required, or put differently - why would the kernel option influence
the userspace LTTng library?
Thanks
--
Milian Wolff | milian.wolff@kdab.com | Senior Software Engineer
KDAB (Deutschland) GmbH&Co KG, a KDAB Group company
Tel: +49-30-521325470
KDAB - The Qt Experts
[-- Attachment #1.2: smime.p7s --]
[-- Type: application/pkcs7-signature, Size: 3826 bytes --]
[-- Attachment #2: Type: text/plain, Size: 156 bytes --]
_______________________________________________
lttng-dev mailing list
lttng-dev@lists.lttng.org
https://lists.lttng.org/cgi-bin/mailman/listinfo/lttng-dev
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: large liblttng-ust startup overhead (ust_lock)
[not found] ` <5409371.Vv1znlTa4X@milian-kdab2>
@ 2017-09-05 15:10 ` Milian Wolff
[not found] ` <5963588.vVBsjV779J@milian-kdab2>
1 sibling, 0 replies; 13+ messages in thread
From: Milian Wolff @ 2017-09-05 15:10 UTC (permalink / raw)
To: lttng-dev
[-- Attachment #1.1: Type: text/plain, Size: 16416 bytes --]
On Tuesday, September 5, 2017 4:52:42 PM CEST Milian Wolff wrote:
> On Tuesday, September 5, 2017 4:34:54 PM CEST Milian Wolff wrote:
> > Hey all,
> >
> > I have noticed a very large overhead when linking against liblttng-ust:
> >
> > ~~~~~
> > ┌milian@milian-kdab2:/tmp
> > └$ cat lttng-test.c
> > int main()
> > {
> >
> > return 0;
> >
> > }
> > ┌milian@milian-kdab2:/tmp
> > └$ gcc -O2 -g -ldl lttng-test.c
> > ┌milian@milian-kdab2:/tmp
> > └$ perf stat -r 5 ./a.out
> >
> > Performance counter stats for './a.out' (5 runs):
> > 0.209587 task-clock (msec) # 0.596 CPUs utilized
> >
> > ( +- 8.76% )
> >
> > 0 context-switches # 0.000 K/sec
> > 0 cpu-migrations # 0.000 K/sec
> >
> > 49 page-faults # 0.235 M/sec
> >
> > ( +- 1.19% )
> >
> > 706,854 cycles # 3.373 GHz
> >
> > ( +- 8.82% )
> >
> > 773,603 instructions # 1.09 insn per
> > cycle
> >
> > ( +- 0.75% )
> >
> > 147,128 branches # 701.987 M/sec
> >
> > ( +- 0.70% )
> >
> > 4,096 branch-misses # 2.78% of all
> > branches
> >
> > ( +- 5.27% )
> >
> > 0.000351422 seconds time elapsed
> >
> > ( +- 11.85% )
> >
> > ┌milian@milian-kdab2:/tmp
> > └$ gcc -O2 -g -ldl -llttng-ust lttng-test.c
> > ┌milian@milian-kdab2:/tmp
> > └$ perf stat -r 5 ./a.out
> >
> > Performance counter stats for './a.out' (5 runs):
> > 2.063040 task-clock (msec) # 0.009 CPUs utilized
> >
> > ( +- 1.37% )
> >
> > 44 context-switches # 0.021 M/sec
> >
> > ( +- 1.95% )
> >
> > 2 cpu-migrations # 0.776 K/sec
> >
> > ( +- 25.00% )
> >
> > 209 page-faults # 0.101 M/sec
> >
> > ( +- 0.34% )
> >
> > 7,053,686 cycles # 3.419 GHz
> >
> > ( +- 2.03% )
> >
> > 6,893,783 instructions # 0.98 insn per
> > cycle
> >
> > ( +- 0.25% )
> >
> > 1,342,492 branches # 650.735 M/sec
> >
> > ( +- 0.20% )
> >
> > 29,390 branch-misses # 2.19% of all
> > branches
> >
> > ( +- 0.61% )
> >
> > 0.225597302 seconds time elapsed
> >
> > ( +- 6.68% )
> > ~~~~~
> >
> > This is without any LTTng session configured. If I enable LTTng kernel and
> > userspace events, this becomes even worse:
> >
> > ~~~~~
> > ┌milian@milian-kdab2:/tmp
> > └$ cat $(which run_lttng_trace.sh)
> > #!/bin/sh
> >
> > if [ -z "$(pidof lttng-sessiond)" ]; then
> >
> > sudo lttng-sessiond --daemonize
> >
> > fi
> >
> > sudo lttng create -o ~/lttng-traces/$(date -Iseconds)
> > sudo lttng enable-channel kernel -k --subbuf-size 16M --num-subbuf 8
> > sudo lttng enable-event -c kernel -k -a
> > sudo lttng enable-channel ust -u --subbuf-size 16M --num-subbuf 8
> > sudo lttng enable-event -c ust -u lttng_ust_tracef:*
> > sudo lttng start
> > $@
> > sudo lttng stop
> >
> > sudo chmod a+rx -R ~/lttng-traces
> > ┌milian@milian-kdab2:/tmp
> > └$ run_lttng_trace.sh perf stat -r 5 ./a.out
> > Session auto-20170905-162818 created.
> > Traces will be written in
> > /home/milian/lttng-traces/2017-09-05T16:28:18+02:00 Kernel channel kernel
> > enabled for session auto-20170905-162818
> > All Kernel events are enabled in channel kernel
> > UST channel ust enabled for session auto-20170905-162818
> > UST event lttng_ust_tracef:* created in channel ust
> > Tracing started for session auto-20170905-162818
> >
> > Performance counter stats for './a.out' (5 runs):
> > 4.937820 task-clock (msec) # 0.005 CPUs utilized
> >
> > ( +- 1.28% )
> >
> > 117 context-switches # 0.024 M/sec
> >
> > ( +- 0.70% )
> >
> > 3 cpu-migrations # 0.608 K/sec
> >
> > ( +- 23.57% )
> >
> > 245 page-faults # 0.050 M/sec
> >
> > ( +- 0.20% )
> >
> > 16,541,355 cycles # 3.350 GHz
> >
> > ( +- 0.94% )
> >
> > 20,611,637 instructions # 1.25 insn per
> > cycle
> >
> > ( +- 0.23% )
> >
> > 3,580,525 branches # 725.123 M/sec
> >
> > ( +- 0.19% )
> >
> > 52,093 branch-misses # 1.45% of all
> > branches
> >
> > ( +- 1.73% )
> >
> > 0.965545882 seconds time elapsed
> >
> > ( +- 16.96% )
> >
> > Waiting for data availability...
> > Tracing stopped for session auto-20170905-162818
> > ~~~~~
> >
> > I see this ~1s delay during startup in nearly all test applications where
> > I
> > wanted to use lttng-ust. In tracecompass, and according to perf sleep time
> > profiling, this seems to be mostly due to extensive sleeping on mutex
> > locks
> > (from ust_lock in liblttng-ust). With perf trace, I also see membarrier
> > taking a lot of time:
> >
> > ~~~~~
> > # without a running lttng session:
> > ┌milian@milian-kdab2:/tmp
> > └$ perf trace --duration 1 ./a.out
> >
> > 6.492 (52.468 ms): a.out/23672 recvmsg(fd: 3<socket:[1178439]>, msg:
> > 0x7fbe2fbb1070 ) = 1
> >
> > 5.077 (54.271 ms): a.out/23671 futex(uaddr: 0x7fbe30d508a0, op:
> > WAIT_BITSET|PRIV|CLKRT, utime: 0x7ffc474ff5a0, val3: 4294967295) = 0
> >
> > 59.598 (79.379 ms): a.out/23671 membarrier(cmd: 1
> >
> > ) = 0
> >
> > 138.984 (39.945 ms): a.out/23671 membarrier(cmd: 1
> >
> > ) = 0
> >
> > 59.621 (119.324 ms): a.out/23672 futex(uaddr: 0x7fbe303e54c0, op:
> > WAIT|
> >
> > PRIV, val: 2 ) = 0
> >
> > 179.045 (79.918 ms): a.out/23671 membarrier(cmd: 1
> >
> > ) = 0
> >
> > 258.971 (39.997 ms): a.out/23671 membarrier(cmd: 1
> >
> > ) = 0
> >
> > 299.052 (79.883 ms): a.out/23671 membarrier(cmd: 1
> >
> > ) = 0
> >
> > 378.942 (59.988 ms): a.out/23671 membarrier(cmd: 1
> >
> > ) = 0
> >
> > 439.022 (69.908 ms): a.out/23671 membarrier(cmd: 1
> >
> > ) = 0
> >
> > 508.937 (80.027 ms): a.out/23671 membarrier(cmd: 1
> >
> > ) = 0
> >
> > # with a running lttng session:
> > ┌milian@milian-kdab2:/tmp
> > └$ run_lttng_trace.sh perf trace --duration 1 ./a.out
> >
> > 6.666 (392.496 ms): a.out-ust/23790 recvmsg(fd: 3<socket:[1185362]>,
> >
> > msg: 0x7fcd81d8a0b0 ) = 612
> >
> > 400.238 (68.452 ms): a.out-ust/23790 membarrier(cmd: 1
> >
> > ) = 0
> >
> > 468.698 (39.983 ms): a.out-ust/23790 membarrier(cmd: 1
> >
> > ) = 0
> >
> > 510.567 (98.113 ms): a.out-ust/23790 recvmsg(fd: 3<socket:[1185362]>,
> >
> > msg: 0x7fcd81d8a070 ) = 1
> >
> > 6.467 (603.500 ms): a.out/23789 futex(uaddr: 0x7fcd82f298a0, op:
> > WAIT_BITSET|PRIV|CLKRT, utime: 0x7fff4df6c970, val3: 4294967295) = 0
> >
> > 610.020 ( 6.307 ms): a.out/23789 futex(uaddr: 0x7fcd82f29960, op: WAIT|
> >
> > PRIV, val: 2 ) = 0
> >
> > 616.446 (62.265 ms): a.out/23789 membarrier(cmd: 1
> >
> > ) = 0
> >
> > 678.721 (59.916 ms): a.out/23789 membarrier(cmd: 1
> >
> > ) = 0
> >
> > 616.332 (122.335 ms): a.out-ust/23790 futex(uaddr: 0x7fcd82f29960, op:
> > WAIT|PRIV, val: 2 ) = 0
> >
> > 738.976 (49.658 ms): a.out/23789 membarrier(cmd: 1
> >
> > ) = 0
> >
> > 788.642 (49.987 ms): a.out/23789 membarrier(cmd: 1
> >
> > ) = 0
> >
> > 739.014 (99.644 ms): a.out-ust/23790 futex(uaddr: 0x7fcd825be4c0, op:
> > WAIT| PRIV, val: 2 ) = 0
> >
> > 838.651 (49.977 ms): a.out/23789 membarrier(cmd: 1
> >
> > ) = 0
> >
> > 888.636 (69.983 ms): a.out/23789 membarrier(cmd: 1
> >
> > ) = 0
> >
> > 838.662 (119.978 ms): a.out-ust/23790 futex(uaddr: 0x7fcd825be4c0, op:
> > WAIT|PRIV, val: 2 ) = 0
> >
> > 958.739 (59.873 ms): a.out/23789 membarrier(cmd: 1
> >
> > ) = 0
> >
> > 1018.621 (80.058 ms): a.out/23789 membarrier(cmd: 1
> >
> > ) = 0
> >
> > 1098.782 (49.844 ms): a.out/23789 membarrier(cmd: 1
> >
> > ) = 0
> >
> > 1148.634 (59.998 ms): a.out/23789 membarrier(cmd: 1
> >
> > ) = 0
> >
> > 1208.731 (69.940 ms): a.out/23789 membarrier(cmd: 1
> >
> > ) = 0
> >
> > 1278.680 (79.947 ms): a.out/23789 membarrier(cmd: 1
> >
> > ) = 0
> >
> > 1358.727 (39.881 ms): a.out/23789 membarrier(cmd: 1
> >
> > ) = 0
> >
> > 1398.616 (39.980 ms): a.out/23789 membarrier(cmd: 1
> >
> > ) = 0
> >
> > 6.640 (1432.157 ms): a.out-ust/23791 futex(uaddr: 0x7fcd82f90000
> >
> > ) = -1 (null) INTERNAL ERROR: strerror_r(512, 0x7ffdbc5467a0, 128)=22
> > ~~~~~
> >
> > This makes lttng-ust pretty much unusable for any short-lived application.
> > Is this a known limitation?
> >
> > I'm using lttng-ust on Arch from the AUR packages, i.e. LTTng 2.9.4. I've
> > seen this problem also on Ubuntu 17.04 though I didn't do any of the above
> > measurements there. So far, I suspected the ust statedump to take too much
> > time, but I disabled that one for the measurements above.
> >
> > Any help would be appreciated
>
> Here's the perf sleep record data when no session is configured:
>
> ~~~~~
> 100% = 848.063336375 ms
>
> # Overhead Pid:Command Shared Object Symbol Source:Line
> # ........ ................. ................ ..............
> ........... #
> 76.50% 10336:a.out-ust [kernel.vmlinux] [k] __schedule core.c:0
>
> ---__schedule core.c:0
> schedule
> futex_wait_queue_me futex.c:0
> futex_wait futex.c:0
> do_futex
> sys_futex
> do_syscall_64
> return_from_SYSCALL_64 .tmp_entry_64.o:0
> syscall
> ust_listener_thread futex.h:65
> /usr/include/urcu/futex.h:65 (inline)
> /usr/include/urcu/futex.h:97 (inline)
> /home/milian/projects/src/lttng/lttng-ust/liblttng-ust/lttng-
> ust-comm.c:1311 (inline)
> /home/milian/projects/src/lttng/lttng-ust/liblttng-ust/lttng-
> ust-comm.c:1373 (inline)
>
> 15.79% 10335:a.out-ust [kernel.vmlinux] [k] __schedule core.c:0
>
> ---__schedule core.c:0
> schedule
>
> --15.70%--futex_wait_queue_me futex.c:0
> futex_wait futex.c:0
> do_futex
> sys_futex
> do_syscall_64
> return_from_SYSCALL_64 .tmp_entry_64.o:0
> __lll_lock_wait
> pthread_mutex_lock
>
> --15.69%--0x2338
> 0x3b29
> __nptl_deallocate_tsd.part.4
> pthread_create.c:0
> start_thread pthread_create.c:0
> __clone
>
> 7.72% 10334:a.out [kernel.vmlinux] [k] __schedule core.c:0
>
> ---__schedule core.c:0
> schedule
> futex_wait_queue_me futex.c:0
> futex_wait futex.c:0
> do_futex
> sys_futex
> do_syscall_64
> return_from_SYSCALL_64 .tmp_entry_64.o:0
> do_futex_wait sem_timedwait.c:0
> __new_sem_wait_slow sem_timedwait.c:0
> lttng_ust_init.part.3 lttng-ust-comm.c:1757
> call_init.part.0 dl-init.c:0
> _dl_init
> _dl_start_user
>
> ~~~~~
>
> Here's the perf sleep record data for an existing session:
>
> ~~~~~
> 100% = 3.0793293265 s
>
> # Overhead Pid:Command Shared Object Symbol Source:Line
> # ........ ................. ................ ..............
> ........... #
> 52.10% 10762:a.out-ust [kernel.vmlinux] [k] __schedule core.c:0
>
> ---__schedule core.c:0
> schedule
> futex_wait_queue_me futex.c:0
> futex_wait futex.c:0
> do_futex
> sys_futex
> do_syscall_64
> return_from_SYSCALL_64 .tmp_entry_64.o:0
> syscall
> ust_listener_thread futex.h:65
> /usr/include/urcu/futex.h:65 (inline)
> /usr/include/urcu/futex.h:97 (inline)
> /home/milian/projects/src/lttng/lttng-ust/liblttng-ust/lttng-
> ust-comm.c:1311 (inline)
> /home/milian/projects/src/lttng/lttng-ust/liblttng-ust/lttng-
> ust-comm.c:1373 (inline)
>
> 27.09% 10761:a.out-ust [kernel.vmlinux] [k] __schedule core.c:0
>
> ---__schedule core.c:0
> schedule
>
> |--13.69%--schedule_timeout
> |
> | unix_stream_read_generic af_unix.c:0
> | unix_stream_recvmsg af_unix.c:0
> | sock_recvmsg
> | ___sys_recvmsg socket.c:0
> | __sys_recvmsg
> | sys_recvmsg
> | do_syscall_64
> | return_from_SYSCALL_64 .tmp_entry_64.o:0
> | __libc_recvmsg
> | ustcomm_recv_unix_sock lttng-ust-comm.c:307
>
> --13.40%--futex_wait_queue_me futex.c:0
> futex_wait futex.c:0
> do_futex
> sys_futex
> do_syscall_64
> return_from_SYSCALL_64 .tmp_entry_64.o:0
> __lll_lock_wait
> pthread_mutex_lock
>
> |--8.41%--0x2338
> |
> | 0x3b29
> | __nptl_deallocate_tsd.part.4
>
> pthread_create.c:0
>
> | start_thread pthread_create.c:0
> | __clone
>
> --4.99%--ust_lock_nocheck lttng-ust-comm.c:178
>
> 20.81% 10760:a.out [kernel.vmlinux] [k] __schedule core.c:0
>
> ---__schedule core.c:0
> schedule
>
> --20.81%--futex_wait_queue_me futex.c:0
> futex_wait futex.c:0
> do_futex
> sys_futex
> do_syscall_64
> return_from_SYSCALL_64 .tmp_entry_64.o:0
>
> --20.77%--do_futex_wait sem_timedwait.c:0
> __new_sem_wait_slow sem_timedwait.c:0
> lttng_ust_init.part.3 lttng-ust-comm.c:
> 1757
> call_init.part.0 dl-init.c:0
> _dl_init
> _dl_start_user
>
> ~~~~~
>
> Hope that helps
One thing that is apparent from the above, and can also be made visible by
adding some visible output to the application before quitting it: It's not the
main thread that's blocked initially. Rather, the main thread spins up the
secondary -ust threads, then runs its code and finishes before the -ust
threads manage to initialize. Before the main thread then quits, it waits for
the ust threads to finish.
--
Milian Wolff | milian.wolff@kdab.com | Senior Software Engineer
KDAB (Deutschland) GmbH&Co KG, a KDAB Group company
Tel: +49-30-521325470
KDAB - The Qt Experts
[-- Attachment #1.2: smime.p7s --]
[-- Type: application/pkcs7-signature, Size: 3826 bytes --]
[-- Attachment #2: Type: text/plain, Size: 156 bytes --]
_______________________________________________
lttng-dev mailing list
lttng-dev@lists.lttng.org
https://lists.lttng.org/cgi-bin/mailman/listinfo/lttng-dev
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: large liblttng-ust startup overhead (ust_lock)
[not found] ` <3731578.YsMjlWojEO@milian-kdab2>
@ 2017-09-05 18:11 ` Mathieu Desnoyers
[not found] ` <709369421.3343.1504635118648.JavaMail.zimbra@efficios.com>
1 sibling, 0 replies; 13+ messages in thread
From: Mathieu Desnoyers @ 2017-09-05 18:11 UTC (permalink / raw)
To: Milian Wolff; +Cc: lttng-dev
----- On Sep 5, 2017, at 11:08 AM, Milian Wolff milian.wolff@kdab.com wrote:
> On Tuesday, September 5, 2017 4:51:42 PM CEST Mathieu Desnoyers wrote:
>> ----- On Sep 5, 2017, at 10:34 AM, Milian Wolff milian.wolff@kdab.com wrote:
>> > Hey all,
>> >
>> > I have noticed a very large overhead when linking against liblttng-ust:
>> >
>> > ~~~~~
>> > ┌milian@milian-kdab2:/tmp
>> > └$ cat lttng-test.c
>> > int main()
>> > {
>> >
>> > return 0;
>> >
>> > }
>> > ┌milian@milian-kdab2:/tmp
>> > └$ gcc -O2 -g -ldl lttng-test.c
>> > ┌milian@milian-kdab2:/tmp
>> > └$ perf stat -r 5 ./a.out
>> >
>> > Performance counter stats for './a.out' (5 runs):
>> > 0.209587 task-clock (msec) # 0.596 CPUs utilized
>> >
>> > ( +- 8.76% )
>> >
>> > 0 context-switches # 0.000 K/sec
>> > 0 cpu-migrations # 0.000 K/sec
>> >
>> > 49 page-faults # 0.235 M/sec
>> >
>> > ( +- 1.19% )
>> >
>> > 706,854 cycles # 3.373 GHz
>> >
>> > ( +- 8.82% )
>> >
>> > 773,603 instructions # 1.09 insn per cycle
>> >
>> > ( +- 0.75% )
>> >
>> > 147,128 branches # 701.987 M/sec
>> >
>> > ( +- 0.70% )
>> >
>> > 4,096 branch-misses # 2.78% of all
>> > branches
>> >
>> > ( +- 5.27% )
>> >
>> > 0.000351422 seconds time elapsed
>> >
>> > ( +- 11.85% )
>> >
>> > ┌milian@milian-kdab2:/tmp
>> > └$ gcc -O2 -g -ldl -llttng-ust lttng-test.c
>> > ┌milian@milian-kdab2:/tmp
>> > └$ perf stat -r 5 ./a.out
>> >
>> > Performance counter stats for './a.out' (5 runs):
>> > 2.063040 task-clock (msec) # 0.009 CPUs utilized
>> >
>> > ( +- 1.37% )
>> >
>> > 44 context-switches # 0.021 M/sec
>> >
>> > ( +- 1.95% )
>> >
>> > 2 cpu-migrations # 0.776 K/sec
>> >
>> > ( +- 25.00% )
>> >
>> > 209 page-faults # 0.101 M/sec
>> >
>> > ( +- 0.34% )
>> >
>> > 7,053,686 cycles # 3.419 GHz
>> >
>> > ( +- 2.03% )
>> >
>> > 6,893,783 instructions # 0.98 insn per cycle
>> >
>> > ( +- 0.25% )
>> >
>> > 1,342,492 branches # 650.735 M/sec
>> >
>> > ( +- 0.20% )
>> >
>> > 29,390 branch-misses # 2.19% of all
>> > branches
>> >
>> > ( +- 0.61% )
>> >
>> > 0.225597302 seconds time elapsed
>> >
>> > ( +- 6.68% )
>> > ~~~~~
>> >
>> > This is without any LTTng session configured. If I enable LTTng kernel and
>> > userspace events, this becomes even worse:
>> >
>> > ~~~~~
>> > ┌milian@milian-kdab2:/tmp
>> > └$ cat $(which run_lttng_trace.sh)
>> > #!/bin/sh
>> >
>> > if [ -z "$(pidof lttng-sessiond)" ]; then
>> >
>> > sudo lttng-sessiond --daemonize
>> >
>> > fi
>> >
>> > sudo lttng create -o ~/lttng-traces/$(date -Iseconds)
>> > sudo lttng enable-channel kernel -k --subbuf-size 16M --num-subbuf 8
>> > sudo lttng enable-event -c kernel -k -a
>> > sudo lttng enable-channel ust -u --subbuf-size 16M --num-subbuf 8
>> > sudo lttng enable-event -c ust -u lttng_ust_tracef:*
>> > sudo lttng start
>> > $@
>> > sudo lttng stop
>> >
>> > sudo chmod a+rx -R ~/lttng-traces
>> > ┌milian@milian-kdab2:/tmp
>> > └$ run_lttng_trace.sh perf stat -r 5 ./a.out
>> > Session auto-20170905-162818 created.
>> > Traces will be written in
>> > /home/milian/lttng-traces/2017-09-05T16:28:18+02:00 Kernel channel kernel
>> > enabled for session auto-20170905-162818
>> > All Kernel events are enabled in channel kernel
>> > UST channel ust enabled for session auto-20170905-162818
>> > UST event lttng_ust_tracef:* created in channel ust
>> > Tracing started for session auto-20170905-162818
>> >
>> > Performance counter stats for './a.out' (5 runs):
>> > 4.937820 task-clock (msec) # 0.005 CPUs utilized
>> >
>> > ( +- 1.28% )
>> >
>> > 117 context-switches # 0.024 M/sec
>> >
>> > ( +- 0.70% )
>> >
>> > 3 cpu-migrations # 0.608 K/sec
>> >
>> > ( +- 23.57% )
>> >
>> > 245 page-faults # 0.050 M/sec
>> >
>> > ( +- 0.20% )
>> >
>> > 16,541,355 cycles # 3.350 GHz
>> >
>> > ( +- 0.94% )
>> >
>> > 20,611,637 instructions # 1.25 insn per cycle
>> >
>> > ( +- 0.23% )
>> >
>> > 3,580,525 branches # 725.123 M/sec
>> >
>> > ( +- 0.19% )
>> >
>> > 52,093 branch-misses # 1.45% of all
>> > branches
>> >
>> > ( +- 1.73% )
>> >
>> > 0.965545882 seconds time elapsed
>> >
>> > ( +- 16.96% )
>> >
>> > Waiting for data availability...
>> > Tracing stopped for session auto-20170905-162818
>> > ~~~~~
>> >
>> > I see this ~1s delay during startup in nearly all test applications where
>> > I
>> > wanted to use lttng-ust. In tracecompass, and according to perf sleep time
>> > profiling, this seems to be mostly due to extensive sleeping on mutex
>> > locks
>> > (from ust_lock in liblttng-ust). With perf trace, I also see membarrier
>> > taking a lot of time:
>> >
>> > ~~~~~
>> > # without a running lttng session:
>> > ┌milian@milian-kdab2:/tmp
>> > └$ perf trace --duration 1 ./a.out
>> >
>> > 6.492 (52.468 ms): a.out/23672 recvmsg(fd: 3<socket:[1178439]>, msg:
>> > 0x7fbe2fbb1070 ) = 1
>> >
>> > 5.077 (54.271 ms): a.out/23671 futex(uaddr: 0x7fbe30d508a0, op:
>> > WAIT_BITSET|PRIV|CLKRT, utime: 0x7ffc474ff5a0, val3: 4294967295) = 0
>> >
>> > 59.598 (79.379 ms): a.out/23671 membarrier(cmd: 1
>> >
>> > ) = 0
>> >
>> > 138.984 (39.945 ms): a.out/23671 membarrier(cmd: 1
>> >
>> > ) = 0
>> >
>> > 59.621 (119.324 ms): a.out/23672 futex(uaddr: 0x7fbe303e54c0, op: WAIT|
>> >
>> > PRIV, val: 2 ) = 0
>> >
>> > 179.045 (79.918 ms): a.out/23671 membarrier(cmd: 1
>> >
>> > ) = 0
>> >
>> > 258.971 (39.997 ms): a.out/23671 membarrier(cmd: 1
>> >
>> > ) = 0
>> >
>> > 299.052 (79.883 ms): a.out/23671 membarrier(cmd: 1
>> >
>> > ) = 0
>> >
>> > 378.942 (59.988 ms): a.out/23671 membarrier(cmd: 1
>> >
>> > ) = 0
>> >
>> > 439.022 (69.908 ms): a.out/23671 membarrier(cmd: 1
>> >
>> > ) = 0
>> >
>> > 508.937 (80.027 ms): a.out/23671 membarrier(cmd: 1
>> >
>> > ) = 0
>> >
>> > # with a running lttng session:
>> > ┌milian@milian-kdab2:/tmp
>> > └$ run_lttng_trace.sh perf trace --duration 1 ./a.out
>> >
>> > 6.666 (392.496 ms): a.out-ust/23790 recvmsg(fd: 3<socket:[1185362]>,
> msg:
>> > 0x7fcd81d8a0b0 ) = 612
>> >
>> > 400.238 (68.452 ms): a.out-ust/23790 membarrier(cmd: 1
>> >
>> > ) = 0
>> >
>> > 468.698 (39.983 ms): a.out-ust/23790 membarrier(cmd: 1
>> >
>> > ) = 0
>> >
>> > 510.567 (98.113 ms): a.out-ust/23790 recvmsg(fd: 3<socket:[1185362]>,
> msg:
>> > 0x7fcd81d8a070 ) = 1
>> >
>> > 6.467 (603.500 ms): a.out/23789 futex(uaddr: 0x7fcd82f298a0, op:
>> > WAIT_BITSET|PRIV|CLKRT, utime: 0x7fff4df6c970, val3: 4294967295) = 0
>> >
>> > 610.020 ( 6.307 ms): a.out/23789 futex(uaddr: 0x7fcd82f29960, op: WAIT|
>> >
>> > PRIV, val: 2 ) = 0
>> >
>> > 616.446 (62.265 ms): a.out/23789 membarrier(cmd: 1
>> >
>> > ) = 0
>> >
>> > 678.721 (59.916 ms): a.out/23789 membarrier(cmd: 1
>> >
>> > ) = 0
>> >
>> > 616.332 (122.335 ms): a.out-ust/23790 futex(uaddr: 0x7fcd82f29960, op:
>> > WAIT|PRIV, val: 2 ) = 0
>> >
>> > 738.976 (49.658 ms): a.out/23789 membarrier(cmd: 1
>> >
>> > ) = 0
>> >
>> > 788.642 (49.987 ms): a.out/23789 membarrier(cmd: 1
>> >
>> > ) = 0
>> >
>> > 739.014 (99.644 ms): a.out-ust/23790 futex(uaddr: 0x7fcd825be4c0, op:
>> > WAIT|
>> >
>> > PRIV, val: 2 ) = 0
>> >
>> > 838.651 (49.977 ms): a.out/23789 membarrier(cmd: 1
>> >
>> > ) = 0
>> >
>> > 888.636 (69.983 ms): a.out/23789 membarrier(cmd: 1
>> >
>> > ) = 0
>> >
>> > 838.662 (119.978 ms): a.out-ust/23790 futex(uaddr: 0x7fcd825be4c0, op:
>> > WAIT|PRIV, val: 2 ) = 0
>> >
>> > 958.739 (59.873 ms): a.out/23789 membarrier(cmd: 1
>> >
>> > ) = 0
>> >
>> > 1018.621 (80.058 ms): a.out/23789 membarrier(cmd: 1
>> >
>> > ) = 0
>> >
>> > 1098.782 (49.844 ms): a.out/23789 membarrier(cmd: 1
>> >
>> > ) = 0
>> >
>> > 1148.634 (59.998 ms): a.out/23789 membarrier(cmd: 1
>> >
>> > ) = 0
>> >
>> > 1208.731 (69.940 ms): a.out/23789 membarrier(cmd: 1
>> >
>> > ) = 0
>> >
>> > 1278.680 (79.947 ms): a.out/23789 membarrier(cmd: 1
>> >
>> > ) = 0
>> >
>> > 1358.727 (39.881 ms): a.out/23789 membarrier(cmd: 1
>> >
>> > ) = 0
>> >
>> > 1398.616 (39.980 ms): a.out/23789 membarrier(cmd: 1
>> >
>> > ) = 0
>> >
>> > 6.640 (1432.157 ms): a.out-ust/23791 futex(uaddr: 0x7fcd82f90000
>> >
>> > ) = -1 (null) INTERNAL ERROR: strerror_r(512, 0x7ffdbc5467a0, 128)=22
>> > ~~~~~
>> >
>> > This makes lttng-ust pretty much unusable for any short-lived application.
>> > Is this a known limitation?
>>
>> This kind of delay is not completely unexpected, but a bit higher than I
>> would expect. I'm interested in figuring out if most of this delay happens
>> to be caused by sys_membarrier, either directly, or indirectly (through a
>> lock).
>
> See my other email which adds some harder numbers from a perf sleep record.
>
>> > I'm using lttng-ust on Arch from the AUR packages, i.e. LTTng 2.9.4. I've
>> > seen this problem also on Ubuntu 17.04 though I didn't do any of the
>> > above measurements there. So far, I suspected the ust statedump to take
>> > too much time, but I disabled that one for the measurements above.
>> >
>> > Any help would be appreciated
>>
>> Can you reproduce with a Linux kernel that has CONFIG_MEMBARRIER=n and
>> share the performance results ?
>
> I'd have to recompile the kernel, which is something I don't like to do. Why
> is this required, or put differently - why would the kernel option influence
> the userspace LTTng library?
liburcu 0.9.x and 0.10.x (a dependency of lttng-ust) uses the membarrier system
call to speed up tracing, at the expense of slower execution of tracing state
update.
I'm currently proposing a new sys_membarrier command for PRIVATE_EXPEDITED
(single-process) use cases which should execute much faster than the
non-expedited SHARED command currently implemented.
liburcu 0.9 and 0.10 detects that this system call is available, and uses
it when it can.
Thanks,
Mathieu
>
> Thanks
>
> --
> Milian Wolff | milian.wolff@kdab.com | Senior Software Engineer
> KDAB (Deutschland) GmbH&Co KG, a KDAB Group company
> Tel: +49-30-521325470
> KDAB - The Qt Experts
--
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com
_______________________________________________
lttng-dev mailing list
lttng-dev@lists.lttng.org
https://lists.lttng.org/cgi-bin/mailman/listinfo/lttng-dev
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: large liblttng-ust startup overhead (ust_lock)
[not found] ` <5963588.vVBsjV779J@milian-kdab2>
@ 2017-09-05 18:20 ` Mathieu Desnoyers
[not found] ` <859976545.3393.1504635610385.JavaMail.zimbra@efficios.com>
1 sibling, 0 replies; 13+ messages in thread
From: Mathieu Desnoyers @ 2017-09-05 18:20 UTC (permalink / raw)
To: Milian Wolff; +Cc: lttng-dev
----- On Sep 5, 2017, at 11:10 AM, Milian Wolff milian.wolff@kdab.com wrote:
> On Tuesday, September 5, 2017 4:52:42 PM CEST Milian Wolff wrote:
>> On Tuesday, September 5, 2017 4:34:54 PM CEST Milian Wolff wrote:
>> > Hey all,
>> >
>> > I have noticed a very large overhead when linking against liblttng-ust:
>> >
>> > ~~~~~
>> > ┌milian@milian-kdab2:/tmp
>> > └$ cat lttng-test.c
>> > int main()
>> > {
>> >
>> > return 0;
>> >
>> > }
>> > ┌milian@milian-kdab2:/tmp
>> > └$ gcc -O2 -g -ldl lttng-test.c
>> > ┌milian@milian-kdab2:/tmp
>> > └$ perf stat -r 5 ./a.out
>> >
>> > Performance counter stats for './a.out' (5 runs):
>> > 0.209587 task-clock (msec) # 0.596 CPUs utilized
>> >
>> > ( +- 8.76% )
>> >
>> > 0 context-switches # 0.000 K/sec
>> > 0 cpu-migrations # 0.000 K/sec
>> >
>> > 49 page-faults # 0.235 M/sec
>> >
>> > ( +- 1.19% )
>> >
>> > 706,854 cycles # 3.373 GHz
>> >
>> > ( +- 8.82% )
>> >
>> > 773,603 instructions # 1.09 insn per
>> > cycle
>> >
>> > ( +- 0.75% )
>> >
>> > 147,128 branches # 701.987 M/sec
>> >
>> > ( +- 0.70% )
>> >
>> > 4,096 branch-misses # 2.78% of all
>> > branches
>> >
>> > ( +- 5.27% )
>> >
>> > 0.000351422 seconds time elapsed
>> >
>> > ( +- 11.85% )
>> >
>> > ┌milian@milian-kdab2:/tmp
>> > └$ gcc -O2 -g -ldl -llttng-ust lttng-test.c
>> > ┌milian@milian-kdab2:/tmp
>> > └$ perf stat -r 5 ./a.out
>> >
>> > Performance counter stats for './a.out' (5 runs):
>> > 2.063040 task-clock (msec) # 0.009 CPUs utilized
>> >
>> > ( +- 1.37% )
>> >
>> > 44 context-switches # 0.021 M/sec
>> >
>> > ( +- 1.95% )
>> >
>> > 2 cpu-migrations # 0.776 K/sec
>> >
>> > ( +- 25.00% )
>> >
>> > 209 page-faults # 0.101 M/sec
>> >
>> > ( +- 0.34% )
>> >
>> > 7,053,686 cycles # 3.419 GHz
>> >
>> > ( +- 2.03% )
>> >
>> > 6,893,783 instructions # 0.98 insn per
>> > cycle
>> >
>> > ( +- 0.25% )
>> >
>> > 1,342,492 branches # 650.735 M/sec
>> >
>> > ( +- 0.20% )
>> >
>> > 29,390 branch-misses # 2.19% of all
>> > branches
>> >
>> > ( +- 0.61% )
>> >
>> > 0.225597302 seconds time elapsed
>> >
>> > ( +- 6.68% )
>> > ~~~~~
>> >
>> > This is without any LTTng session configured. If I enable LTTng kernel and
>> > userspace events, this becomes even worse:
>> >
>> > ~~~~~
>> > ┌milian@milian-kdab2:/tmp
>> > └$ cat $(which run_lttng_trace.sh)
>> > #!/bin/sh
>> >
>> > if [ -z "$(pidof lttng-sessiond)" ]; then
>> >
>> > sudo lttng-sessiond --daemonize
>> >
>> > fi
>> >
>> > sudo lttng create -o ~/lttng-traces/$(date -Iseconds)
>> > sudo lttng enable-channel kernel -k --subbuf-size 16M --num-subbuf 8
>> > sudo lttng enable-event -c kernel -k -a
>> > sudo lttng enable-channel ust -u --subbuf-size 16M --num-subbuf 8
>> > sudo lttng enable-event -c ust -u lttng_ust_tracef:*
>> > sudo lttng start
>> > $@
>> > sudo lttng stop
>> >
>> > sudo chmod a+rx -R ~/lttng-traces
>> > ┌milian@milian-kdab2:/tmp
>> > └$ run_lttng_trace.sh perf stat -r 5 ./a.out
>> > Session auto-20170905-162818 created.
>> > Traces will be written in
>> > /home/milian/lttng-traces/2017-09-05T16:28:18+02:00 Kernel channel kernel
>> > enabled for session auto-20170905-162818
>> > All Kernel events are enabled in channel kernel
>> > UST channel ust enabled for session auto-20170905-162818
>> > UST event lttng_ust_tracef:* created in channel ust
>> > Tracing started for session auto-20170905-162818
>> >
>> > Performance counter stats for './a.out' (5 runs):
>> > 4.937820 task-clock (msec) # 0.005 CPUs utilized
>> >
>> > ( +- 1.28% )
>> >
>> > 117 context-switches # 0.024 M/sec
>> >
>> > ( +- 0.70% )
>> >
>> > 3 cpu-migrations # 0.608 K/sec
>> >
>> > ( +- 23.57% )
>> >
>> > 245 page-faults # 0.050 M/sec
>> >
>> > ( +- 0.20% )
>> >
>> > 16,541,355 cycles # 3.350 GHz
>> >
>> > ( +- 0.94% )
>> >
>> > 20,611,637 instructions # 1.25 insn per
>> > cycle
>> >
>> > ( +- 0.23% )
>> >
>> > 3,580,525 branches # 725.123 M/sec
>> >
>> > ( +- 0.19% )
>> >
>> > 52,093 branch-misses # 1.45% of all
>> > branches
>> >
>> > ( +- 1.73% )
>> >
>> > 0.965545882 seconds time elapsed
>> >
>> > ( +- 16.96% )
>> >
>> > Waiting for data availability...
>> > Tracing stopped for session auto-20170905-162818
>> > ~~~~~
>> >
>> > I see this ~1s delay during startup in nearly all test applications where
>> > I
>> > wanted to use lttng-ust. In tracecompass, and according to perf sleep time
>> > profiling, this seems to be mostly due to extensive sleeping on mutex
>> > locks
>> > (from ust_lock in liblttng-ust). With perf trace, I also see membarrier
>> > taking a lot of time:
>> >
>> > ~~~~~
>> > # without a running lttng session:
>> > ┌milian@milian-kdab2:/tmp
>> > └$ perf trace --duration 1 ./a.out
>> >
>> > 6.492 (52.468 ms): a.out/23672 recvmsg(fd: 3<socket:[1178439]>, msg:
>> > 0x7fbe2fbb1070 ) = 1
>> >
>> > 5.077 (54.271 ms): a.out/23671 futex(uaddr: 0x7fbe30d508a0, op:
>> > WAIT_BITSET|PRIV|CLKRT, utime: 0x7ffc474ff5a0, val3: 4294967295) = 0
>> >
>> > 59.598 (79.379 ms): a.out/23671 membarrier(cmd: 1
>> >
>> > ) = 0
>> >
>> > 138.984 (39.945 ms): a.out/23671 membarrier(cmd: 1
>> >
>> > ) = 0
>> >
>> > 59.621 (119.324 ms): a.out/23672 futex(uaddr: 0x7fbe303e54c0, op:
>> > WAIT|
>> >
>> > PRIV, val: 2 ) = 0
>> >
>> > 179.045 (79.918 ms): a.out/23671 membarrier(cmd: 1
>> >
>> > ) = 0
>> >
>> > 258.971 (39.997 ms): a.out/23671 membarrier(cmd: 1
>> >
>> > ) = 0
>> >
>> > 299.052 (79.883 ms): a.out/23671 membarrier(cmd: 1
>> >
>> > ) = 0
>> >
>> > 378.942 (59.988 ms): a.out/23671 membarrier(cmd: 1
>> >
>> > ) = 0
>> >
>> > 439.022 (69.908 ms): a.out/23671 membarrier(cmd: 1
>> >
>> > ) = 0
>> >
>> > 508.937 (80.027 ms): a.out/23671 membarrier(cmd: 1
>> >
>> > ) = 0
>> >
>> > # with a running lttng session:
>> > ┌milian@milian-kdab2:/tmp
>> > └$ run_lttng_trace.sh perf trace --duration 1 ./a.out
>> >
>> > 6.666 (392.496 ms): a.out-ust/23790 recvmsg(fd: 3<socket:[1185362]>,
>> >
>> > msg: 0x7fcd81d8a0b0 ) = 612
>> >
>> > 400.238 (68.452 ms): a.out-ust/23790 membarrier(cmd: 1
>> >
>> > ) = 0
>> >
>> > 468.698 (39.983 ms): a.out-ust/23790 membarrier(cmd: 1
>> >
>> > ) = 0
>> >
>> > 510.567 (98.113 ms): a.out-ust/23790 recvmsg(fd: 3<socket:[1185362]>,
>> >
>> > msg: 0x7fcd81d8a070 ) = 1
>> >
>> > 6.467 (603.500 ms): a.out/23789 futex(uaddr: 0x7fcd82f298a0, op:
>> > WAIT_BITSET|PRIV|CLKRT, utime: 0x7fff4df6c970, val3: 4294967295) = 0
>> >
>> > 610.020 ( 6.307 ms): a.out/23789 futex(uaddr: 0x7fcd82f29960, op: WAIT|
>> >
>> > PRIV, val: 2 ) = 0
>> >
>> > 616.446 (62.265 ms): a.out/23789 membarrier(cmd: 1
>> >
>> > ) = 0
>> >
>> > 678.721 (59.916 ms): a.out/23789 membarrier(cmd: 1
>> >
>> > ) = 0
>> >
>> > 616.332 (122.335 ms): a.out-ust/23790 futex(uaddr: 0x7fcd82f29960, op:
>> > WAIT|PRIV, val: 2 ) = 0
>> >
>> > 738.976 (49.658 ms): a.out/23789 membarrier(cmd: 1
>> >
>> > ) = 0
>> >
>> > 788.642 (49.987 ms): a.out/23789 membarrier(cmd: 1
>> >
>> > ) = 0
>> >
>> > 739.014 (99.644 ms): a.out-ust/23790 futex(uaddr: 0x7fcd825be4c0, op:
>> > WAIT| PRIV, val: 2 ) = 0
>> >
>> > 838.651 (49.977 ms): a.out/23789 membarrier(cmd: 1
>> >
>> > ) = 0
>> >
>> > 888.636 (69.983 ms): a.out/23789 membarrier(cmd: 1
>> >
>> > ) = 0
>> >
>> > 838.662 (119.978 ms): a.out-ust/23790 futex(uaddr: 0x7fcd825be4c0, op:
>> > WAIT|PRIV, val: 2 ) = 0
>> >
>> > 958.739 (59.873 ms): a.out/23789 membarrier(cmd: 1
>> >
>> > ) = 0
>> >
>> > 1018.621 (80.058 ms): a.out/23789 membarrier(cmd: 1
>> >
>> > ) = 0
>> >
>> > 1098.782 (49.844 ms): a.out/23789 membarrier(cmd: 1
>> >
>> > ) = 0
>> >
>> > 1148.634 (59.998 ms): a.out/23789 membarrier(cmd: 1
>> >
>> > ) = 0
>> >
>> > 1208.731 (69.940 ms): a.out/23789 membarrier(cmd: 1
>> >
>> > ) = 0
>> >
>> > 1278.680 (79.947 ms): a.out/23789 membarrier(cmd: 1
>> >
>> > ) = 0
>> >
>> > 1358.727 (39.881 ms): a.out/23789 membarrier(cmd: 1
>> >
>> > ) = 0
>> >
>> > 1398.616 (39.980 ms): a.out/23789 membarrier(cmd: 1
>> >
>> > ) = 0
>> >
>> > 6.640 (1432.157 ms): a.out-ust/23791 futex(uaddr: 0x7fcd82f90000
>> >
>> > ) = -1 (null) INTERNAL ERROR: strerror_r(512, 0x7ffdbc5467a0, 128)=22
>> > ~~~~~
>> >
>> > This makes lttng-ust pretty much unusable for any short-lived application.
>> > Is this a known limitation?
>> >
>> > I'm using lttng-ust on Arch from the AUR packages, i.e. LTTng 2.9.4. I've
>> > seen this problem also on Ubuntu 17.04 though I didn't do any of the above
>> > measurements there. So far, I suspected the ust statedump to take too much
>> > time, but I disabled that one for the measurements above.
>> >
>> > Any help would be appreciated
>>
>> Here's the perf sleep record data when no session is configured:
>>
>> ~~~~~
>> 100% = 848.063336375 ms
>>
>> # Overhead Pid:Command Shared Object Symbol Source:Line
>> # ........ ................. ................ ..............
>> ........... #
>> 76.50% 10336:a.out-ust [kernel.vmlinux] [k] __schedule core.c:0
>>
>> ---__schedule core.c:0
>> schedule
>> futex_wait_queue_me futex.c:0
>> futex_wait futex.c:0
>> do_futex
>> sys_futex
>> do_syscall_64
>> return_from_SYSCALL_64 .tmp_entry_64.o:0
>> syscall
>> ust_listener_thread futex.h:65
>> /usr/include/urcu/futex.h:65 (inline)
>> /usr/include/urcu/futex.h:97 (inline)
>> /home/milian/projects/src/lttng/lttng-ust/liblttng-ust/lttng-
>> ust-comm.c:1311 (inline)
>> /home/milian/projects/src/lttng/lttng-ust/liblttng-ust/lttng-
>> ust-comm.c:1373 (inline)
>>
>> 15.79% 10335:a.out-ust [kernel.vmlinux] [k] __schedule core.c:0
>>
>> ---__schedule core.c:0
>> schedule
>>
>> --15.70%--futex_wait_queue_me futex.c:0
>> futex_wait futex.c:0
>> do_futex
>> sys_futex
>> do_syscall_64
>> return_from_SYSCALL_64 .tmp_entry_64.o:0
>> __lll_lock_wait
>> pthread_mutex_lock
>>
>> --15.69%--0x2338
>> 0x3b29
>> __nptl_deallocate_tsd.part.4
>> pthread_create.c:0
>> start_thread pthread_create.c:0
>> __clone
>>
>> 7.72% 10334:a.out [kernel.vmlinux] [k] __schedule core.c:0
>>
>> ---__schedule core.c:0
>> schedule
>> futex_wait_queue_me futex.c:0
>> futex_wait futex.c:0
>> do_futex
>> sys_futex
>> do_syscall_64
>> return_from_SYSCALL_64 .tmp_entry_64.o:0
>> do_futex_wait sem_timedwait.c:0
>> __new_sem_wait_slow sem_timedwait.c:0
>> lttng_ust_init.part.3 lttng-ust-comm.c:1757
>> call_init.part.0 dl-init.c:0
>> _dl_init
>> _dl_start_user
>>
>> ~~~~~
>>
>> Here's the perf sleep record data for an existing session:
>>
>> ~~~~~
>> 100% = 3.0793293265 s
>>
>> # Overhead Pid:Command Shared Object Symbol Source:Line
>> # ........ ................. ................ ..............
>> ........... #
>> 52.10% 10762:a.out-ust [kernel.vmlinux] [k] __schedule core.c:0
>>
>> ---__schedule core.c:0
>> schedule
>> futex_wait_queue_me futex.c:0
>> futex_wait futex.c:0
>> do_futex
>> sys_futex
>> do_syscall_64
>> return_from_SYSCALL_64 .tmp_entry_64.o:0
>> syscall
>> ust_listener_thread futex.h:65
>> /usr/include/urcu/futex.h:65 (inline)
>> /usr/include/urcu/futex.h:97 (inline)
>> /home/milian/projects/src/lttng/lttng-ust/liblttng-ust/lttng-
>> ust-comm.c:1311 (inline)
>> /home/milian/projects/src/lttng/lttng-ust/liblttng-ust/lttng-
>> ust-comm.c:1373 (inline)
>>
>> 27.09% 10761:a.out-ust [kernel.vmlinux] [k] __schedule core.c:0
>>
>> ---__schedule core.c:0
>> schedule
>>
>> |--13.69%--schedule_timeout
>> |
>> | unix_stream_read_generic af_unix.c:0
>> | unix_stream_recvmsg af_unix.c:0
>> | sock_recvmsg
>> | ___sys_recvmsg socket.c:0
>> | __sys_recvmsg
>> | sys_recvmsg
>> | do_syscall_64
>> | return_from_SYSCALL_64 .tmp_entry_64.o:0
>> | __libc_recvmsg
>> | ustcomm_recv_unix_sock lttng-ust-comm.c:307
>>
>> --13.40%--futex_wait_queue_me futex.c:0
>> futex_wait futex.c:0
>> do_futex
>> sys_futex
>> do_syscall_64
>> return_from_SYSCALL_64 .tmp_entry_64.o:0
>> __lll_lock_wait
>> pthread_mutex_lock
>>
>> |--8.41%--0x2338
>> |
>> | 0x3b29
>> | __nptl_deallocate_tsd.part.4
>>
>> pthread_create.c:0
>>
>> | start_thread pthread_create.c:0
>> | __clone
>>
>> --4.99%--ust_lock_nocheck lttng-ust-comm.c:178
>>
>> 20.81% 10760:a.out [kernel.vmlinux] [k] __schedule core.c:0
>>
>> ---__schedule core.c:0
>> schedule
>>
>> --20.81%--futex_wait_queue_me futex.c:0
>> futex_wait futex.c:0
>> do_futex
>> sys_futex
>> do_syscall_64
>> return_from_SYSCALL_64 .tmp_entry_64.o:0
>>
>> --20.77%--do_futex_wait sem_timedwait.c:0
>> __new_sem_wait_slow sem_timedwait.c:0
>> lttng_ust_init.part.3 lttng-ust-comm.c:
>> 1757
>> call_init.part.0 dl-init.c:0
>> _dl_init
>> _dl_start_user
>>
>> ~~~~~
>>
>> Hope that helps
>
> One thing that is apparent from the above, and can also be made visible by
> adding some visible output to the application before quitting it: It's not the
> main thread that's blocked initially. Rather, the main thread spins up the
> secondary -ust threads, then runs its code and finishes before the -ust
> threads manage to initialize. Before the main thread then quits, it waits for
> the ust threads to finish.
Those threads appear to be talking to the session daemon (on recvmsg). Please
note that there are two cases there:
- If you use per-pid buffers (explicitly at channel creation), the consumer
daemon need to allocate buffers on behalf of each and every new process
that appear while tracing, which takes a significant amount of time,
- If you use default buffers (per-uid), only the first process that registers
for a specific UID will require that the consumer daemon allocates buffers.
After that, the same buffers will be sent to other applications registering
for the same user ID.
Clearly, per-pid buffers incur a significant slowdown at application start. This
is one reason why those are not used by default.
If you benchmark per-uid buffers, you will want to benchmark those two scenarios:
1) startup time of the first process that registers for a UID,
2) startup time of following process registrations
Technically, the buffer allocation delay only impacts the first process that
registers for a UID, and not all of them.
Thanks,
Mathieu
>
> --
> Milian Wolff | milian.wolff@kdab.com | Senior Software Engineer
> KDAB (Deutschland) GmbH&Co KG, a KDAB Group company
> Tel: +49-30-521325470
> KDAB - The Qt Experts
> _______________________________________________
> lttng-dev mailing list
> lttng-dev@lists.lttng.org
> https://lists.lttng.org/cgi-bin/mailman/listinfo/lttng-dev
--
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com
_______________________________________________
lttng-dev mailing list
lttng-dev@lists.lttng.org
https://lists.lttng.org/cgi-bin/mailman/listinfo/lttng-dev
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: large liblttng-ust startup overhead (ust_lock)
[not found] ` <709369421.3343.1504635118648.JavaMail.zimbra@efficios.com>
@ 2017-09-06 7:35 ` Milian Wolff
[not found] ` <3211887.IErR8xQyUx@agathebauer>
1 sibling, 0 replies; 13+ messages in thread
From: Milian Wolff @ 2017-09-06 7:35 UTC (permalink / raw)
To: Mathieu Desnoyers; +Cc: lttng-dev
On Dienstag, 5. September 2017 20:11:58 CEST Mathieu Desnoyers wrote:
> ----- On Sep 5, 2017, at 11:08 AM, Milian Wolff milian.wolff@kdab.com wrote:
> > On Tuesday, September 5, 2017 4:51:42 PM CEST Mathieu Desnoyers wrote:
> >> ----- On Sep 5, 2017, at 10:34 AM, Milian Wolff milian.wolff@kdab.com
wrote:
> >> > Hey all,
> >> >
> >> > I have noticed a very large overhead when linking against liblttng-ust:
> >> >
> >> > ~~~~~
> >> > ┌milian@milian-kdab2:/tmp
> >> > └$ cat lttng-test.c
> >> > int main()
> >> > {
> >> >
> >> > return 0;
> >> >
> >> > }
> >> > ┌milian@milian-kdab2:/tmp
> >> > └$ gcc -O2 -g -ldl lttng-test.c
> >> > ┌milian@milian-kdab2:/tmp
> >> > └$ perf stat -r 5 ./a.out
> >> >
> >> > Performance counter stats for './a.out' (5 runs):
> >> > 0.209587 task-clock (msec) # 0.596 CPUs
> >> > utilized
> >> >
> >> > ( +- 8.76% )
> >> >
> >> > 0 context-switches # 0.000 K/sec
> >> > 0 cpu-migrations # 0.000 K/sec
> >> >
> >> > 49 page-faults # 0.235 M/sec
> >> >
> >> > ( +- 1.19% )
> >> >
> >> > 706,854 cycles # 3.373 GHz
> >> >
> >> > ( +- 8.82% )
> >> >
> >> > 773,603 instructions # 1.09 insn per
> >> > cycle
> >> >
> >> > ( +- 0.75% )
> >> >
> >> > 147,128 branches # 701.987 M/sec
> >> >
> >> > ( +- 0.70% )
> >> >
> >> > 4,096 branch-misses # 2.78% of all
> >> > branches
> >> >
> >> > ( +- 5.27% )
> >> >
> >> > 0.000351422 seconds time elapsed
> >> >
> >> > ( +- 11.85% )
> >> >
> >> > ┌milian@milian-kdab2:/tmp
> >> > └$ gcc -O2 -g -ldl -llttng-ust lttng-test.c
> >> > ┌milian@milian-kdab2:/tmp
> >> > └$ perf stat -r 5 ./a.out
> >> >
> >> > Performance counter stats for './a.out' (5 runs):
> >> > 2.063040 task-clock (msec) # 0.009 CPUs
> >> > utilized
> >> >
> >> > ( +- 1.37% )
> >> >
> >> > 44 context-switches # 0.021 M/sec
> >> >
> >> > ( +- 1.95% )
> >> >
> >> > 2 cpu-migrations # 0.776 K/sec
> >> >
> >> > ( +- 25.00% )
> >> >
> >> > 209 page-faults # 0.101 M/sec
> >> >
> >> > ( +- 0.34% )
> >> >
> >> > 7,053,686 cycles # 3.419 GHz
> >> >
> >> > ( +- 2.03% )
> >> >
> >> > 6,893,783 instructions # 0.98 insn per
> >> > cycle
> >> >
> >> > ( +- 0.25% )
> >> >
> >> > 1,342,492 branches # 650.735 M/sec
> >> >
> >> > ( +- 0.20% )
> >> >
> >> > 29,390 branch-misses # 2.19% of all
> >> > branches
> >> >
> >> > ( +- 0.61% )
> >> >
> >> > 0.225597302 seconds time elapsed
> >> >
> >> > ( +- 6.68% )
> >> > ~~~~~
> >> >
> >> > This is without any LTTng session configured. If I enable LTTng kernel
> >> > and
> >> > userspace events, this becomes even worse:
> >> >
> >> > ~~~~~
> >> > ┌milian@milian-kdab2:/tmp
> >> > └$ cat $(which run_lttng_trace.sh)
> >> > #!/bin/sh
> >> >
> >> > if [ -z "$(pidof lttng-sessiond)" ]; then
> >> >
> >> > sudo lttng-sessiond --daemonize
> >> >
> >> > fi
> >> >
> >> > sudo lttng create -o ~/lttng-traces/$(date -Iseconds)
> >> > sudo lttng enable-channel kernel -k --subbuf-size 16M --num-subbuf 8
> >> > sudo lttng enable-event -c kernel -k -a
> >> > sudo lttng enable-channel ust -u --subbuf-size 16M --num-subbuf 8
> >> > sudo lttng enable-event -c ust -u lttng_ust_tracef:*
> >> > sudo lttng start
> >> > $@
> >> > sudo lttng stop
> >> >
> >> > sudo chmod a+rx -R ~/lttng-traces
> >> > ┌milian@milian-kdab2:/tmp
> >> > └$ run_lttng_trace.sh perf stat -r 5 ./a.out
> >> > Session auto-20170905-162818 created.
> >> > Traces will be written in
> >> > /home/milian/lttng-traces/2017-09-05T16:28:18+02:00 Kernel channel
> >> > kernel
> >> > enabled for session auto-20170905-162818
> >> > All Kernel events are enabled in channel kernel
> >> > UST channel ust enabled for session auto-20170905-162818
> >> > UST event lttng_ust_tracef:* created in channel ust
> >> > Tracing started for session auto-20170905-162818
> >> >
> >> > Performance counter stats for './a.out' (5 runs):
> >> > 4.937820 task-clock (msec) # 0.005 CPUs
> >> > utilized
> >> >
> >> > ( +- 1.28% )
> >> >
> >> > 117 context-switches # 0.024 M/sec
> >> >
> >> > ( +- 0.70% )
> >> >
> >> > 3 cpu-migrations # 0.608 K/sec
> >> >
> >> > ( +- 23.57% )
> >> >
> >> > 245 page-faults # 0.050 M/sec
> >> >
> >> > ( +- 0.20% )
> >> >
> >> > 16,541,355 cycles # 3.350 GHz
> >> >
> >> > ( +- 0.94% )
> >> >
> >> > 20,611,637 instructions # 1.25 insn per
> >> > cycle
> >> >
> >> > ( +- 0.23% )
> >> >
> >> > 3,580,525 branches # 725.123 M/sec
> >> >
> >> > ( +- 0.19% )
> >> >
> >> > 52,093 branch-misses # 1.45% of all
> >> > branches
> >> >
> >> > ( +- 1.73% )
> >> >
> >> > 0.965545882 seconds time elapsed
> >> >
> >> > ( +- 16.96% )
> >> >
> >> > Waiting for data availability...
> >> > Tracing stopped for session auto-20170905-162818
> >> > ~~~~~
> >> >
> >> > I see this ~1s delay during startup in nearly all test applications
> >> > where
> >> > I
> >> > wanted to use lttng-ust. In tracecompass, and according to perf sleep
> >> > time
> >> > profiling, this seems to be mostly due to extensive sleeping on mutex
> >> > locks
> >> > (from ust_lock in liblttng-ust). With perf trace, I also see membarrier
> >> > taking a lot of time:
> >> >
> >> > ~~~~~
> >> > # without a running lttng session:
> >> > ┌milian@milian-kdab2:/tmp
> >> > └$ perf trace --duration 1 ./a.out
> >> >
> >> > 6.492 (52.468 ms): a.out/23672 recvmsg(fd: 3<socket:[1178439]>,
msg:
> >> > 0x7fbe2fbb1070 ) = 1
> >> >
> >> > 5.077 (54.271 ms): a.out/23671 futex(uaddr: 0x7fbe30d508a0, op:
> >> > WAIT_BITSET|PRIV|CLKRT, utime: 0x7ffc474ff5a0, val3: 4294967295) = 0
> >> >
> >> > 59.598 (79.379 ms): a.out/23671 membarrier(cmd: 1
> >> >
> >> > ) = 0
> >> >
> >> > 138.984 (39.945 ms): a.out/23671 membarrier(cmd: 1
> >> >
> >> > ) = 0
> >> >
> >> > 59.621 (119.324 ms): a.out/23672 futex(uaddr: 0x7fbe303e54c0, op:
> >> > WAIT|
> >> >
> >> > PRIV, val: 2 ) = 0
> >> >
> >> > 179.045 (79.918 ms): a.out/23671 membarrier(cmd: 1
> >> >
> >> > ) = 0
> >> >
> >> > 258.971 (39.997 ms): a.out/23671 membarrier(cmd: 1
> >> >
> >> > ) = 0
> >> >
> >> > 299.052 (79.883 ms): a.out/23671 membarrier(cmd: 1
> >> >
> >> > ) = 0
> >> >
> >> > 378.942 (59.988 ms): a.out/23671 membarrier(cmd: 1
> >> >
> >> > ) = 0
> >> >
> >> > 439.022 (69.908 ms): a.out/23671 membarrier(cmd: 1
> >> >
> >> > ) = 0
> >> >
> >> > 508.937 (80.027 ms): a.out/23671 membarrier(cmd: 1
> >> >
> >> > ) = 0
> >> >
> >> > # with a running lttng session:
> >> > ┌milian@milian-kdab2:/tmp
> >> > └$ run_lttng_trace.sh perf trace --duration 1 ./a.out
> >> >
> >> > 6.666 (392.496 ms): a.out-ust/23790 recvmsg(fd:
> >> > 3<socket:[1185362]>,
> >
> > msg:
> >> > 0x7fcd81d8a0b0 ) = 612
> >> >
> >> > 400.238 (68.452 ms): a.out-ust/23790 membarrier(cmd: 1
> >> >
> >> > ) = 0
> >> >
> >> > 468.698 (39.983 ms): a.out-ust/23790 membarrier(cmd: 1
> >> >
> >> > ) = 0
> >> >
> >> > 510.567 (98.113 ms): a.out-ust/23790 recvmsg(fd: 3<socket:[1185362]>,
> >
> > msg:
> >> > 0x7fcd81d8a070 ) = 1
> >> >
> >> > 6.467 (603.500 ms): a.out/23789 futex(uaddr: 0x7fcd82f298a0, op:
> >> > WAIT_BITSET|PRIV|CLKRT, utime: 0x7fff4df6c970, val3: 4294967295) = 0
> >> >
> >> > 610.020 ( 6.307 ms): a.out/23789 futex(uaddr: 0x7fcd82f29960, op:
> >> > WAIT|
> >> >
> >> > PRIV, val: 2 ) = 0
> >> >
> >> > 616.446 (62.265 ms): a.out/23789 membarrier(cmd: 1
> >> >
> >> > ) = 0
> >> >
> >> > 678.721 (59.916 ms): a.out/23789 membarrier(cmd: 1
> >> >
> >> > ) = 0
> >> >
> >> > 616.332 (122.335 ms): a.out-ust/23790 futex(uaddr: 0x7fcd82f29960,
op:
> >> > WAIT|PRIV, val: 2 ) = 0
> >> >
> >> > 738.976 (49.658 ms): a.out/23789 membarrier(cmd: 1
> >> >
> >> > ) = 0
> >> >
> >> > 788.642 (49.987 ms): a.out/23789 membarrier(cmd: 1
> >> >
> >> > ) = 0
> >> >
> >> > 739.014 (99.644 ms): a.out-ust/23790 futex(uaddr: 0x7fcd825be4c0, op:
> >> > WAIT|
> >> >
> >> > PRIV, val: 2 ) = 0
> >> >
> >> > 838.651 (49.977 ms): a.out/23789 membarrier(cmd: 1
> >> >
> >> > ) = 0
> >> >
> >> > 888.636 (69.983 ms): a.out/23789 membarrier(cmd: 1
> >> >
> >> > ) = 0
> >> >
> >> > 838.662 (119.978 ms): a.out-ust/23790 futex(uaddr: 0x7fcd825be4c0,
op:
> >> > WAIT|PRIV, val: 2 ) = 0
> >> >
> >> > 958.739 (59.873 ms): a.out/23789 membarrier(cmd: 1
> >> >
> >> > ) = 0
> >> >
> >> > 1018.621 (80.058 ms): a.out/23789 membarrier(cmd: 1
> >> >
> >> > ) = 0
> >> >
> >> > 1098.782 (49.844 ms): a.out/23789 membarrier(cmd: 1
> >> >
> >> > ) = 0
> >> >
> >> > 1148.634 (59.998 ms): a.out/23789 membarrier(cmd: 1
> >> >
> >> > ) = 0
> >> >
> >> > 1208.731 (69.940 ms): a.out/23789 membarrier(cmd: 1
> >> >
> >> > ) = 0
> >> >
> >> > 1278.680 (79.947 ms): a.out/23789 membarrier(cmd: 1
> >> >
> >> > ) = 0
> >> >
> >> > 1358.727 (39.881 ms): a.out/23789 membarrier(cmd: 1
> >> >
> >> > ) = 0
> >> >
> >> > 1398.616 (39.980 ms): a.out/23789 membarrier(cmd: 1
> >> >
> >> > ) = 0
> >> >
> >> > 6.640 (1432.157 ms): a.out-ust/23791 futex(uaddr: 0x7fcd82f90000
> >> >
> >> > ) = -1 (null) INTERNAL ERROR: strerror_r(512, 0x7ffdbc5467a0, 128)=22
> >> > ~~~~~
> >> >
> >> > This makes lttng-ust pretty much unusable for any short-lived
> >> > application.
> >> > Is this a known limitation?
> >>
> >> This kind of delay is not completely unexpected, but a bit higher than I
> >> would expect. I'm interested in figuring out if most of this delay
> >> happens
> >> to be caused by sys_membarrier, either directly, or indirectly (through a
> >> lock).
> >
> > See my other email which adds some harder numbers from a perf sleep
> > record.
> >
> >> > I'm using lttng-ust on Arch from the AUR packages, i.e. LTTng 2.9.4.
> >> > I've
> >> > seen this problem also on Ubuntu 17.04 though I didn't do any of the
> >> > above measurements there. So far, I suspected the ust statedump to take
> >> > too much time, but I disabled that one for the measurements above.
> >> >
> >> > Any help would be appreciated
> >>
> >> Can you reproduce with a Linux kernel that has CONFIG_MEMBARRIER=n and
> >> share the performance results ?
> >
> > I'd have to recompile the kernel, which is something I don't like to do.
> > Why is this required, or put differently - why would the kernel option
> > influence the userspace LTTng library?
>
> liburcu 0.9.x and 0.10.x (a dependency of lttng-ust) uses the membarrier
> system call to speed up tracing, at the expense of slower execution of
> tracing state update.
>
> I'm currently proposing a new sys_membarrier command for PRIVATE_EXPEDITED
> (single-process) use cases which should execute much faster than the
> non-expedited SHARED command currently implemented.
>
> liburcu 0.9 and 0.10 detects that this system call is available, and uses
> it when it can.
May I ask whether you have such a configuration available? The test case I
have submitted should be easy to reproduce for everyone. Compiling my own
kernel version takes a huge amount of time which I'm quite frankly not willing
to invest at this point.
I simply wonder whether anyone else is actually using lttng-ust, or if they
are, whether they are only doing that for long-lived applications where this
overhead is not noticeable?
Thanks
--
Milian Wolff | milian.wolff@kdab.com | Senior Software Engineer
KDAB (Deutschland) GmbH&Co KG, a KDAB Group company
Tel: +49-30-521325470
KDAB - The Qt Experts
_______________________________________________
lttng-dev mailing list
lttng-dev@lists.lttng.org
https://lists.lttng.org/cgi-bin/mailman/listinfo/lttng-dev
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: large liblttng-ust startup overhead (ust_lock)
[not found] ` <859976545.3393.1504635610385.JavaMail.zimbra@efficios.com>
@ 2017-09-06 7:42 ` Milian Wolff
0 siblings, 0 replies; 13+ messages in thread
From: Milian Wolff @ 2017-09-06 7:42 UTC (permalink / raw)
To: Mathieu Desnoyers; +Cc: lttng-dev
On Dienstag, 5. September 2017 20:20:10 CEST Mathieu Desnoyers wrote:
> ----- On Sep 5, 2017, at 11:10 AM, Milian Wolff milian.wolff@kdab.com wrote:
> > On Tuesday, September 5, 2017 4:52:42 PM CEST Milian Wolff wrote:
> >> On Tuesday, September 5, 2017 4:34:54 PM CEST Milian Wolff wrote:
> >> > Hey all,
> >> >
> >> > I have noticed a very large overhead when linking against liblttng-ust:
<snip>
> >> >
> >> > This is without any LTTng session configured. If I enable LTTng kernel
> >> > and
> >> > userspace events, this becomes even worse:
> >> >
> >> > ~~~~~
> >> > ┌milian@milian-kdab2:/tmp
> >> > └$ cat $(which run_lttng_trace.sh)
> >> > #!/bin/sh
> >> >
> >> > if [ -z "$(pidof lttng-sessiond)" ]; then
> >> >
> >> > sudo lttng-sessiond --daemonize
> >> >
> >> > fi
> >> >
> >> > sudo lttng create -o ~/lttng-traces/$(date -Iseconds)
> >> > sudo lttng enable-channel kernel -k --subbuf-size 16M --num-subbuf 8
> >> > sudo lttng enable-event -c kernel -k -a
> >> > sudo lttng enable-channel ust -u --subbuf-size 16M --num-subbuf 8
> >> > sudo lttng enable-event -c ust -u lttng_ust_tracef:*
> >> > sudo lttng start
> >> > $@
> >> > sudo lttng stop
> >> >
> >> > sudo chmod a+rx -R ~/lttng-traces
<snip>
> > One thing that is apparent from the above, and can also be made visible by
> > adding some visible output to the application before quitting it: It's not
> > the main thread that's blocked initially. Rather, the main thread spins
> > up the secondary -ust threads, then runs its code and finishes before the
> > -ust threads manage to initialize. Before the main thread then quits, it
> > waits for the ust threads to finish.
>
> Those threads appear to be talking to the session daemon (on recvmsg).
> Please note that there are two cases there:
>
> - If you use per-pid buffers (explicitly at channel creation), the consumer
> daemon need to allocate buffers on behalf of each and every new process
> that appear while tracing, which takes a significant amount of time,
> - If you use default buffers (per-uid), only the first process that
> registers for a specific UID will require that the consumer daemon
> allocates buffers. After that, the same buffers will be sent to other
> applications registering for the same user ID.
>
> Clearly, per-pid buffers incur a significant slowdown at application start.
> This is one reason why those are not used by default.
>
> If you benchmark per-uid buffers, you will want to benchmark those two
> scenarios:
>
> 1) startup time of the first process that registers for a UID,
>
> 2) startup time of following process registrations
>
> Technically, the buffer allocation delay only impacts the first process that
> registers for a UID, and not all of them.
See my lttng commands above, I'm using the default buffer scheme, so if I
understand you correctly, that should be per-UID. Looking into the lttng trace
folder, I can confirm this by looking at the directory hierarchy:
~~~~~
┌milian@milian-kdab2:~/lttng-traces/2017-09-05T16:44:39+02:00
└$ find
.
./kernel
./kernel/kernel_6
./kernel/kernel_4
./kernel/kernel_2
./kernel/metadata
./kernel/kernel_1
./kernel/kernel_5
./kernel/index
./kernel/index/kernel_1.idx
./kernel/index/kernel_7.idx
./kernel/index/kernel_4.idx
./kernel/index/kernel_0.idx
./kernel/index/kernel_5.idx
./kernel/index/kernel_3.idx
./kernel/index/kernel_6.idx
./kernel/index/kernel_2.idx
./kernel/kernel_7
./kernel/kernel_0
./kernel/kernel_3
./ust
./ust/uid
./ust/uid/1000
./ust/uid/1000/64-bit
./ust/uid/1000/64-bit/ust_6
./ust/uid/1000/64-bit/ust_7
./ust/uid/1000/64-bit/ust_1
./ust/uid/1000/64-bit/metadata
./ust/uid/1000/64-bit/ust_0
./ust/uid/1000/64-bit/ust_3
./ust/uid/1000/64-bit/index
./ust/uid/1000/64-bit/index/ust_2.idx
./ust/uid/1000/64-bit/index/ust_3.idx
./ust/uid/1000/64-bit/index/ust_5.idx
./ust/uid/1000/64-bit/index/ust_6.idx
./ust/uid/1000/64-bit/index/ust_0.idx
./ust/uid/1000/64-bit/index/ust_4.idx
./ust/uid/1000/64-bit/index/ust_7.idx
./ust/uid/1000/64-bit/index/ust_1.idx
./ust/uid/1000/64-bit/ust_5
./ust/uid/1000/64-bit/ust_4
./ust/uid/1000/64-bit/ust_2
~~~~~
Note that my measurements somewhat contradict your assumptions thought: It is
not only the first task that I run while tracing which gets slowed down. It's
all of them. Have a look at the `perf stat` output I pasted:
0.965545882 seconds time elapsed ( +- 16.96% )
I can reproduce values within that variance by running the example code
multiple times and measuring each time individually, while keeping the lttng
session up and running.
Bye
--
Milian Wolff | milian.wolff@kdab.com | Senior Software Engineer
KDAB (Deutschland) GmbH&Co KG, a KDAB Group company
Tel: +49-30-521325470
KDAB - The Qt Experts
_______________________________________________
lttng-dev mailing list
lttng-dev@lists.lttng.org
https://lists.lttng.org/cgi-bin/mailman/listinfo/lttng-dev
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: large liblttng-ust startup overhead (ust_lock)
[not found] ` <3211887.IErR8xQyUx@agathebauer>
@ 2017-09-06 19:57 ` Mathieu Desnoyers
[not found] ` <1506174693.4712.1504727831374.JavaMail.zimbra@efficios.com>
1 sibling, 0 replies; 13+ messages in thread
From: Mathieu Desnoyers @ 2017-09-06 19:57 UTC (permalink / raw)
To: Milian Wolff; +Cc: lttng-dev
[-- Attachment #1: Type: text/plain, Size: 13899 bytes --]
----- On Sep 6, 2017, at 3:35 AM, Milian Wolff milian.wolff@kdab.com wrote:
> On Dienstag, 5. September 2017 20:11:58 CEST Mathieu Desnoyers wrote:
>> ----- On Sep 5, 2017, at 11:08 AM, Milian Wolff milian.wolff@kdab.com wrote:
>> > On Tuesday, September 5, 2017 4:51:42 PM CEST Mathieu Desnoyers wrote:
>> >> ----- On Sep 5, 2017, at 10:34 AM, Milian Wolff milian.wolff@kdab.com
> wrote:
>> >> > Hey all,
>> >> >
>> >> > I have noticed a very large overhead when linking against liblttng-ust:
>> >> >
>> >> > ~~~~~
>> >> > ┌milian@milian-kdab2:/tmp
>> >> > └$ cat lttng-test.c
>> >> > int main()
>> >> > {
>> >> >
>> >> > return 0;
>> >> >
>> >> > }
>> >> > ┌milian@milian-kdab2:/tmp
>> >> > └$ gcc -O2 -g -ldl lttng-test.c
>> >> > ┌milian@milian-kdab2:/tmp
>> >> > └$ perf stat -r 5 ./a.out
>> >> >
>> >> > Performance counter stats for './a.out' (5 runs):
>> >> > 0.209587 task-clock (msec) # 0.596 CPUs
>> >> > utilized
>> >> >
>> >> > ( +- 8.76% )
>> >> >
>> >> > 0 context-switches # 0.000 K/sec
>> >> > 0 cpu-migrations # 0.000 K/sec
>> >> >
>> >> > 49 page-faults # 0.235 M/sec
>> >> >
>> >> > ( +- 1.19% )
>> >> >
>> >> > 706,854 cycles # 3.373 GHz
>> >> >
>> >> > ( +- 8.82% )
>> >> >
>> >> > 773,603 instructions # 1.09 insn per
>> >> > cycle
>> >> >
>> >> > ( +- 0.75% )
>> >> >
>> >> > 147,128 branches # 701.987 M/sec
>> >> >
>> >> > ( +- 0.70% )
>> >> >
>> >> > 4,096 branch-misses # 2.78% of all
>> >> > branches
>> >> >
>> >> > ( +- 5.27% )
>> >> >
>> >> > 0.000351422 seconds time elapsed
>> >> >
>> >> > ( +- 11.85% )
>> >> >
>> >> > ┌milian@milian-kdab2:/tmp
>> >> > └$ gcc -O2 -g -ldl -llttng-ust lttng-test.c
>> >> > ┌milian@milian-kdab2:/tmp
>> >> > └$ perf stat -r 5 ./a.out
>> >> >
>> >> > Performance counter stats for './a.out' (5 runs):
>> >> > 2.063040 task-clock (msec) # 0.009 CPUs
>> >> > utilized
>> >> >
>> >> > ( +- 1.37% )
>> >> >
>> >> > 44 context-switches # 0.021 M/sec
>> >> >
>> >> > ( +- 1.95% )
>> >> >
>> >> > 2 cpu-migrations # 0.776 K/sec
>> >> >
>> >> > ( +- 25.00% )
>> >> >
>> >> > 209 page-faults # 0.101 M/sec
>> >> >
>> >> > ( +- 0.34% )
>> >> >
>> >> > 7,053,686 cycles # 3.419 GHz
>> >> >
>> >> > ( +- 2.03% )
>> >> >
>> >> > 6,893,783 instructions # 0.98 insn per
>> >> > cycle
>> >> >
>> >> > ( +- 0.25% )
>> >> >
>> >> > 1,342,492 branches # 650.735 M/sec
>> >> >
>> >> > ( +- 0.20% )
>> >> >
>> >> > 29,390 branch-misses # 2.19% of all
>> >> > branches
>> >> >
>> >> > ( +- 0.61% )
>> >> >
>> >> > 0.225597302 seconds time elapsed
>> >> >
>> >> > ( +- 6.68% )
>> >> > ~~~~~
>> >> >
>> >> > This is without any LTTng session configured. If I enable LTTng kernel
>> >> > and
>> >> > userspace events, this becomes even worse:
>> >> >
>> >> > ~~~~~
>> >> > ┌milian@milian-kdab2:/tmp
>> >> > └$ cat $(which run_lttng_trace.sh)
>> >> > #!/bin/sh
>> >> >
>> >> > if [ -z "$(pidof lttng-sessiond)" ]; then
>> >> >
>> >> > sudo lttng-sessiond --daemonize
>> >> >
>> >> > fi
>> >> >
>> >> > sudo lttng create -o ~/lttng-traces/$(date -Iseconds)
>> >> > sudo lttng enable-channel kernel -k --subbuf-size 16M --num-subbuf 8
>> >> > sudo lttng enable-event -c kernel -k -a
>> >> > sudo lttng enable-channel ust -u --subbuf-size 16M --num-subbuf 8
>> >> > sudo lttng enable-event -c ust -u lttng_ust_tracef:*
>> >> > sudo lttng start
>> >> > $@
>> >> > sudo lttng stop
>> >> >
>> >> > sudo chmod a+rx -R ~/lttng-traces
>> >> > ┌milian@milian-kdab2:/tmp
>> >> > └$ run_lttng_trace.sh perf stat -r 5 ./a.out
>> >> > Session auto-20170905-162818 created.
>> >> > Traces will be written in
>> >> > /home/milian/lttng-traces/2017-09-05T16:28:18+02:00 Kernel channel
>> >> > kernel
>> >> > enabled for session auto-20170905-162818
>> >> > All Kernel events are enabled in channel kernel
>> >> > UST channel ust enabled for session auto-20170905-162818
>> >> > UST event lttng_ust_tracef:* created in channel ust
>> >> > Tracing started for session auto-20170905-162818
>> >> >
>> >> > Performance counter stats for './a.out' (5 runs):
>> >> > 4.937820 task-clock (msec) # 0.005 CPUs
>> >> > utilized
>> >> >
>> >> > ( +- 1.28% )
>> >> >
>> >> > 117 context-switches # 0.024 M/sec
>> >> >
>> >> > ( +- 0.70% )
>> >> >
>> >> > 3 cpu-migrations # 0.608 K/sec
>> >> >
>> >> > ( +- 23.57% )
>> >> >
>> >> > 245 page-faults # 0.050 M/sec
>> >> >
>> >> > ( +- 0.20% )
>> >> >
>> >> > 16,541,355 cycles # 3.350 GHz
>> >> >
>> >> > ( +- 0.94% )
>> >> >
>> >> > 20,611,637 instructions # 1.25 insn per
>> >> > cycle
>> >> >
>> >> > ( +- 0.23% )
>> >> >
>> >> > 3,580,525 branches # 725.123 M/sec
>> >> >
>> >> > ( +- 0.19% )
>> >> >
>> >> > 52,093 branch-misses # 1.45% of all
>> >> > branches
>> >> >
>> >> > ( +- 1.73% )
>> >> >
>> >> > 0.965545882 seconds time elapsed
>> >> >
>> >> > ( +- 16.96% )
>> >> >
>> >> > Waiting for data availability...
>> >> > Tracing stopped for session auto-20170905-162818
>> >> > ~~~~~
>> >> >
>> >> > I see this ~1s delay during startup in nearly all test applications
>> >> > where
>> >> > I
>> >> > wanted to use lttng-ust. In tracecompass, and according to perf sleep
>> >> > time
>> >> > profiling, this seems to be mostly due to extensive sleeping on mutex
>> >> > locks
>> >> > (from ust_lock in liblttng-ust). With perf trace, I also see membarrier
>> >> > taking a lot of time:
>> >> >
>> >> > ~~~~~
>> >> > # without a running lttng session:
>> >> > ┌milian@milian-kdab2:/tmp
>> >> > └$ perf trace --duration 1 ./a.out
>> >> >
>> >> > 6.492 (52.468 ms): a.out/23672 recvmsg(fd: 3<socket:[1178439]>,
> msg:
>> >> > 0x7fbe2fbb1070 ) = 1
>> >> >
>> >> > 5.077 (54.271 ms): a.out/23671 futex(uaddr: 0x7fbe30d508a0, op:
>> >> > WAIT_BITSET|PRIV|CLKRT, utime: 0x7ffc474ff5a0, val3: 4294967295) = 0
>> >> >
>> >> > 59.598 (79.379 ms): a.out/23671 membarrier(cmd: 1
>> >> >
>> >> > ) = 0
>> >> >
>> >> > 138.984 (39.945 ms): a.out/23671 membarrier(cmd: 1
>> >> >
>> >> > ) = 0
>> >> >
>> >> > 59.621 (119.324 ms): a.out/23672 futex(uaddr: 0x7fbe303e54c0, op:
>> >> > WAIT|
>> >> >
>> >> > PRIV, val: 2 ) = 0
>> >> >
>> >> > 179.045 (79.918 ms): a.out/23671 membarrier(cmd: 1
>> >> >
>> >> > ) = 0
>> >> >
>> >> > 258.971 (39.997 ms): a.out/23671 membarrier(cmd: 1
>> >> >
>> >> > ) = 0
>> >> >
>> >> > 299.052 (79.883 ms): a.out/23671 membarrier(cmd: 1
>> >> >
>> >> > ) = 0
>> >> >
>> >> > 378.942 (59.988 ms): a.out/23671 membarrier(cmd: 1
>> >> >
>> >> > ) = 0
>> >> >
>> >> > 439.022 (69.908 ms): a.out/23671 membarrier(cmd: 1
>> >> >
>> >> > ) = 0
>> >> >
>> >> > 508.937 (80.027 ms): a.out/23671 membarrier(cmd: 1
>> >> >
>> >> > ) = 0
>> >> >
>> >> > # with a running lttng session:
>> >> > ┌milian@milian-kdab2:/tmp
>> >> > └$ run_lttng_trace.sh perf trace --duration 1 ./a.out
>> >> >
>> >> > 6.666 (392.496 ms): a.out-ust/23790 recvmsg(fd:
>> >> > 3<socket:[1185362]>,
>> >
>> > msg:
>> >> > 0x7fcd81d8a0b0 ) = 612
>> >> >
>> >> > 400.238 (68.452 ms): a.out-ust/23790 membarrier(cmd: 1
>> >> >
>> >> > ) = 0
>> >> >
>> >> > 468.698 (39.983 ms): a.out-ust/23790 membarrier(cmd: 1
>> >> >
>> >> > ) = 0
>> >> >
>> >> > 510.567 (98.113 ms): a.out-ust/23790 recvmsg(fd: 3<socket:[1185362]>,
>> >
>> > msg:
>> >> > 0x7fcd81d8a070 ) = 1
>> >> >
>> >> > 6.467 (603.500 ms): a.out/23789 futex(uaddr: 0x7fcd82f298a0, op:
>> >> > WAIT_BITSET|PRIV|CLKRT, utime: 0x7fff4df6c970, val3: 4294967295) = 0
>> >> >
>> >> > 610.020 ( 6.307 ms): a.out/23789 futex(uaddr: 0x7fcd82f29960, op:
>> >> > WAIT|
>> >> >
>> >> > PRIV, val: 2 ) = 0
>> >> >
>> >> > 616.446 (62.265 ms): a.out/23789 membarrier(cmd: 1
>> >> >
>> >> > ) = 0
>> >> >
>> >> > 678.721 (59.916 ms): a.out/23789 membarrier(cmd: 1
>> >> >
>> >> > ) = 0
>> >> >
>> >> > 616.332 (122.335 ms): a.out-ust/23790 futex(uaddr: 0x7fcd82f29960,
> op:
>> >> > WAIT|PRIV, val: 2 ) = 0
>> >> >
>> >> > 738.976 (49.658 ms): a.out/23789 membarrier(cmd: 1
>> >> >
>> >> > ) = 0
>> >> >
>> >> > 788.642 (49.987 ms): a.out/23789 membarrier(cmd: 1
>> >> >
>> >> > ) = 0
>> >> >
>> >> > 739.014 (99.644 ms): a.out-ust/23790 futex(uaddr: 0x7fcd825be4c0, op:
>> >> > WAIT|
>> >> >
>> >> > PRIV, val: 2 ) = 0
>> >> >
>> >> > 838.651 (49.977 ms): a.out/23789 membarrier(cmd: 1
>> >> >
>> >> > ) = 0
>> >> >
>> >> > 888.636 (69.983 ms): a.out/23789 membarrier(cmd: 1
>> >> >
>> >> > ) = 0
>> >> >
>> >> > 838.662 (119.978 ms): a.out-ust/23790 futex(uaddr: 0x7fcd825be4c0,
> op:
>> >> > WAIT|PRIV, val: 2 ) = 0
>> >> >
>> >> > 958.739 (59.873 ms): a.out/23789 membarrier(cmd: 1
>> >> >
>> >> > ) = 0
>> >> >
>> >> > 1018.621 (80.058 ms): a.out/23789 membarrier(cmd: 1
>> >> >
>> >> > ) = 0
>> >> >
>> >> > 1098.782 (49.844 ms): a.out/23789 membarrier(cmd: 1
>> >> >
>> >> > ) = 0
>> >> >
>> >> > 1148.634 (59.998 ms): a.out/23789 membarrier(cmd: 1
>> >> >
>> >> > ) = 0
>> >> >
>> >> > 1208.731 (69.940 ms): a.out/23789 membarrier(cmd: 1
>> >> >
>> >> > ) = 0
>> >> >
>> >> > 1278.680 (79.947 ms): a.out/23789 membarrier(cmd: 1
>> >> >
>> >> > ) = 0
>> >> >
>> >> > 1358.727 (39.881 ms): a.out/23789 membarrier(cmd: 1
>> >> >
>> >> > ) = 0
>> >> >
>> >> > 1398.616 (39.980 ms): a.out/23789 membarrier(cmd: 1
>> >> >
>> >> > ) = 0
>> >> >
>> >> > 6.640 (1432.157 ms): a.out-ust/23791 futex(uaddr: 0x7fcd82f90000
>> >> >
>> >> > ) = -1 (null) INTERNAL ERROR: strerror_r(512, 0x7ffdbc5467a0, 128)=22
>> >> > ~~~~~
>> >> >
>> >> > This makes lttng-ust pretty much unusable for any short-lived
>> >> > application.
>> >> > Is this a known limitation?
>> >>
>> >> This kind of delay is not completely unexpected, but a bit higher than I
>> >> would expect. I'm interested in figuring out if most of this delay
>> >> happens
>> >> to be caused by sys_membarrier, either directly, or indirectly (through a
>> >> lock).
>> >
>> > See my other email which adds some harder numbers from a perf sleep
>> > record.
>> >
>> >> > I'm using lttng-ust on Arch from the AUR packages, i.e. LTTng 2.9.4.
>> >> > I've
>> >> > seen this problem also on Ubuntu 17.04 though I didn't do any of the
>> >> > above measurements there. So far, I suspected the ust statedump to take
>> >> > too much time, but I disabled that one for the measurements above.
>> >> >
>> >> > Any help would be appreciated
>> >>
>> >> Can you reproduce with a Linux kernel that has CONFIG_MEMBARRIER=n and
>> >> share the performance results ?
>> >
>> > I'd have to recompile the kernel, which is something I don't like to do.
>> > Why is this required, or put differently - why would the kernel option
>> > influence the userspace LTTng library?
>>
>> liburcu 0.9.x and 0.10.x (a dependency of lttng-ust) uses the membarrier
>> system call to speed up tracing, at the expense of slower execution of
>> tracing state update.
>>
>> I'm currently proposing a new sys_membarrier command for PRIVATE_EXPEDITED
>> (single-process) use cases which should execute much faster than the
>> non-expedited SHARED command currently implemented.
>>
>> liburcu 0.9 and 0.10 detects that this system call is available, and uses
>> it when it can.
>
> May I ask whether you have such a configuration available? The test case I
> have submitted should be easy to reproduce for everyone. Compiling my own
> kernel version takes a huge amount of time which I'm quite frankly not willing
> to invest at this point.
We have reproduced the situation, and it's pointing towards our use of the
sys_membarrier system call. We'll try narrowing it down between its use by
lttng-ust (liburcu-bp.so) or by lttng-sessiond (liburcu.so).
You can try to patch liburcu so it won't try using sys_membarrier at all
(see attached patch)
>
> I simply wonder whether anyone else is actually using lttng-ust, or if they
> are, whether they are only doing that for long-lived applications where this
> overhead is not noticeable?
It appears that this application startup slowdown mainly appears with
LTTng linked against liburcu 0.9 and 0.10 running with rather recent kernels.
This would explain why you are the first one to report this across our user
base.
Thanks,
Mathieu
>
> Thanks
>
> --
> Milian Wolff | milian.wolff@kdab.com | Senior Software Engineer
> KDAB (Deutschland) GmbH&Co KG, a KDAB Group company
> Tel: +49-30-521325470
> KDAB - The Qt Experts
--
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com
[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #2: liburcu-no-sys-mb.patch --]
[-- Type: text/x-patch; name=liburcu-no-sys-mb.patch, Size: 799 bytes --]
diff --git a/src/urcu-bp.c b/src/urcu-bp.c
index c0fac83..2d8bcca 100644
--- a/src/urcu-bp.c
+++ b/src/urcu-bp.c
@@ -95,7 +95,7 @@ static
int rcu_bp_refcount;
/* If the headers do not support membarrier system call, fall back smp_mb. */
-#ifdef __NR_membarrier
+#if 0 //def __NR_membarrier
# define membarrier(...) syscall(__NR_membarrier, __VA_ARGS__)
#else
# define membarrier(...) -ENOSYS
diff --git a/src/urcu.c b/src/urcu.c
index a81b129..1aafbc0 100644
--- a/src/urcu.c
+++ b/src/urcu.c
@@ -64,7 +64,7 @@
#define RCU_QS_ACTIVE_ATTEMPTS 100
/* If the headers do not support membarrier system call, fall back on RCU_MB */
-#ifdef __NR_membarrier
+#if 0 //def __NR_membarrier
# define membarrier(...) syscall(__NR_membarrier, __VA_ARGS__)
#else
# define membarrier(...) -ENOSYS
[-- Attachment #3: Type: text/plain, Size: 156 bytes --]
_______________________________________________
lttng-dev mailing list
lttng-dev@lists.lttng.org
https://lists.lttng.org/cgi-bin/mailman/listinfo/lttng-dev
^ permalink raw reply related [flat|nested] 13+ messages in thread
* Re: large liblttng-ust startup overhead (ust_lock)
[not found] ` <1506174693.4712.1504727831374.JavaMail.zimbra@efficios.com>
@ 2017-09-06 20:23 ` Mathieu Desnoyers
[not found] ` <643211580.4746.1504729420940.JavaMail.zimbra@efficios.com>
2017-09-07 7:50 ` Milian Wolff
2 siblings, 0 replies; 13+ messages in thread
From: Mathieu Desnoyers @ 2017-09-06 20:23 UTC (permalink / raw)
To: Milian Wolff, Paul E. McKenney; +Cc: lttng-dev
----- On Sep 6, 2017, at 3:57 PM, Mathieu Desnoyers mathieu.desnoyers@efficios.com wrote:
> ----- On Sep 6, 2017, at 3:35 AM, Milian Wolff milian.wolff@kdab.com wrote:
>
>> On Dienstag, 5. September 2017 20:11:58 CEST Mathieu Desnoyers wrote:
>>> ----- On Sep 5, 2017, at 11:08 AM, Milian Wolff milian.wolff@kdab.com wrote:
>>> > On Tuesday, September 5, 2017 4:51:42 PM CEST Mathieu Desnoyers wrote:
>>> >> ----- On Sep 5, 2017, at 10:34 AM, Milian Wolff milian.wolff@kdab.com
>> wrote:
>>> >> > Hey all,
>>> >> >
>>> >> > I have noticed a very large overhead when linking against liblttng-ust:
>>> >> >
>>> >> > ~~~~~
>>> >> > ┌milian@milian-kdab2:/tmp
>>> >> > └$ cat lttng-test.c
>>> >> > int main()
>>> >> > {
>>> >> >
>>> >> > return 0;
>>> >> >
>>> >> > }
>>> >> > ┌milian@milian-kdab2:/tmp
>>> >> > └$ gcc -O2 -g -ldl lttng-test.c
>>> >> > ┌milian@milian-kdab2:/tmp
>>> >> > └$ perf stat -r 5 ./a.out
>>> >> >
>>> >> > Performance counter stats for './a.out' (5 runs):
>>> >> > 0.209587 task-clock (msec) # 0.596 CPUs
>>> >> > utilized
>>> >> >
>>> >> > ( +- 8.76% )
>>> >> >
>>> >> > 0 context-switches # 0.000 K/sec
>>> >> > 0 cpu-migrations # 0.000 K/sec
>>> >> >
>>> >> > 49 page-faults # 0.235 M/sec
>>> >> >
>>> >> > ( +- 1.19% )
>>> >> >
>>> >> > 706,854 cycles # 3.373 GHz
>>> >> >
>>> >> > ( +- 8.82% )
>>> >> >
>>> >> > 773,603 instructions # 1.09 insn per
>>> >> > cycle
>>> >> >
>>> >> > ( +- 0.75% )
>>> >> >
>>> >> > 147,128 branches # 701.987 M/sec
>>> >> >
>>> >> > ( +- 0.70% )
>>> >> >
>>> >> > 4,096 branch-misses # 2.78% of all
>>> >> > branches
>>> >> >
>>> >> > ( +- 5.27% )
>>> >> >
>>> >> > 0.000351422 seconds time elapsed
>>> >> >
>>> >> > ( +- 11.85% )
>>> >> >
>>> >> > ┌milian@milian-kdab2:/tmp
>>> >> > └$ gcc -O2 -g -ldl -llttng-ust lttng-test.c
>>> >> > ┌milian@milian-kdab2:/tmp
>>> >> > └$ perf stat -r 5 ./a.out
>>> >> >
>>> >> > Performance counter stats for './a.out' (5 runs):
>>> >> > 2.063040 task-clock (msec) # 0.009 CPUs
>>> >> > utilized
>>> >> >
>>> >> > ( +- 1.37% )
>>> >> >
>>> >> > 44 context-switches # 0.021 M/sec
>>> >> >
>>> >> > ( +- 1.95% )
>>> >> >
>>> >> > 2 cpu-migrations # 0.776 K/sec
>>> >> >
>>> >> > ( +- 25.00% )
>>> >> >
>>> >> > 209 page-faults # 0.101 M/sec
>>> >> >
>>> >> > ( +- 0.34% )
>>> >> >
>>> >> > 7,053,686 cycles # 3.419 GHz
>>> >> >
>>> >> > ( +- 2.03% )
>>> >> >
>>> >> > 6,893,783 instructions # 0.98 insn per
>>> >> > cycle
>>> >> >
>>> >> > ( +- 0.25% )
>>> >> >
>>> >> > 1,342,492 branches # 650.735 M/sec
>>> >> >
>>> >> > ( +- 0.20% )
>>> >> >
>>> >> > 29,390 branch-misses # 2.19% of all
>>> >> > branches
>>> >> >
>>> >> > ( +- 0.61% )
>>> >> >
>>> >> > 0.225597302 seconds time elapsed
>>> >> >
>>> >> > ( +- 6.68% )
>>> >> > ~~~~~
>>> >> >
>>> >> > This is without any LTTng session configured. If I enable LTTng kernel
>>> >> > and
>>> >> > userspace events, this becomes even worse:
>>> >> >
>>> >> > ~~~~~
>>> >> > ┌milian@milian-kdab2:/tmp
>>> >> > └$ cat $(which run_lttng_trace.sh)
>>> >> > #!/bin/sh
>>> >> >
>>> >> > if [ -z "$(pidof lttng-sessiond)" ]; then
>>> >> >
>>> >> > sudo lttng-sessiond --daemonize
>>> >> >
>>> >> > fi
>>> >> >
>>> >> > sudo lttng create -o ~/lttng-traces/$(date -Iseconds)
>>> >> > sudo lttng enable-channel kernel -k --subbuf-size 16M --num-subbuf 8
>>> >> > sudo lttng enable-event -c kernel -k -a
>>> >> > sudo lttng enable-channel ust -u --subbuf-size 16M --num-subbuf 8
>>> >> > sudo lttng enable-event -c ust -u lttng_ust_tracef:*
>>> >> > sudo lttng start
>>> >> > $@
>>> >> > sudo lttng stop
>>> >> >
>>> >> > sudo chmod a+rx -R ~/lttng-traces
>>> >> > ┌milian@milian-kdab2:/tmp
>>> >> > └$ run_lttng_trace.sh perf stat -r 5 ./a.out
>>> >> > Session auto-20170905-162818 created.
>>> >> > Traces will be written in
>>> >> > /home/milian/lttng-traces/2017-09-05T16:28:18+02:00 Kernel channel
>>> >> > kernel
>>> >> > enabled for session auto-20170905-162818
>>> >> > All Kernel events are enabled in channel kernel
>>> >> > UST channel ust enabled for session auto-20170905-162818
>>> >> > UST event lttng_ust_tracef:* created in channel ust
>>> >> > Tracing started for session auto-20170905-162818
>>> >> >
>>> >> > Performance counter stats for './a.out' (5 runs):
>>> >> > 4.937820 task-clock (msec) # 0.005 CPUs
>>> >> > utilized
>>> >> >
>>> >> > ( +- 1.28% )
>>> >> >
>>> >> > 117 context-switches # 0.024 M/sec
>>> >> >
>>> >> > ( +- 0.70% )
>>> >> >
>>> >> > 3 cpu-migrations # 0.608 K/sec
>>> >> >
>>> >> > ( +- 23.57% )
>>> >> >
>>> >> > 245 page-faults # 0.050 M/sec
>>> >> >
>>> >> > ( +- 0.20% )
>>> >> >
>>> >> > 16,541,355 cycles # 3.350 GHz
>>> >> >
>>> >> > ( +- 0.94% )
>>> >> >
>>> >> > 20,611,637 instructions # 1.25 insn per
>>> >> > cycle
>>> >> >
>>> >> > ( +- 0.23% )
>>> >> >
>>> >> > 3,580,525 branches # 725.123 M/sec
>>> >> >
>>> >> > ( +- 0.19% )
>>> >> >
>>> >> > 52,093 branch-misses # 1.45% of all
>>> >> > branches
>>> >> >
>>> >> > ( +- 1.73% )
>>> >> >
>>> >> > 0.965545882 seconds time elapsed
>>> >> >
>>> >> > ( +- 16.96% )
>>> >> >
>>> >> > Waiting for data availability...
>>> >> > Tracing stopped for session auto-20170905-162818
>>> >> > ~~~~~
>>> >> >
>>> >> > I see this ~1s delay during startup in nearly all test applications
>>> >> > where
>>> >> > I
>>> >> > wanted to use lttng-ust. In tracecompass, and according to perf sleep
>>> >> > time
>>> >> > profiling, this seems to be mostly due to extensive sleeping on mutex
>>> >> > locks
>>> >> > (from ust_lock in liblttng-ust). With perf trace, I also see membarrier
>>> >> > taking a lot of time:
>>> >> >
>>> >> > ~~~~~
>>> >> > # without a running lttng session:
>>> >> > ┌milian@milian-kdab2:/tmp
>>> >> > └$ perf trace --duration 1 ./a.out
>>> >> >
>>> >> > 6.492 (52.468 ms): a.out/23672 recvmsg(fd: 3<socket:[1178439]>,
>> msg:
>>> >> > 0x7fbe2fbb1070 ) = 1
>>> >> >
>>> >> > 5.077 (54.271 ms): a.out/23671 futex(uaddr: 0x7fbe30d508a0, op:
>>> >> > WAIT_BITSET|PRIV|CLKRT, utime: 0x7ffc474ff5a0, val3: 4294967295) = 0
>>> >> >
>>> >> > 59.598 (79.379 ms): a.out/23671 membarrier(cmd: 1
>>> >> >
>>> >> > ) = 0
>>> >> >
>>> >> > 138.984 (39.945 ms): a.out/23671 membarrier(cmd: 1
>>> >> >
>>> >> > ) = 0
>>> >> >
>>> >> > 59.621 (119.324 ms): a.out/23672 futex(uaddr: 0x7fbe303e54c0, op:
>>> >> > WAIT|
>>> >> >
>>> >> > PRIV, val: 2 ) = 0
>>> >> >
>>> >> > 179.045 (79.918 ms): a.out/23671 membarrier(cmd: 1
>>> >> >
>>> >> > ) = 0
>>> >> >
>>> >> > 258.971 (39.997 ms): a.out/23671 membarrier(cmd: 1
>>> >> >
>>> >> > ) = 0
>>> >> >
>>> >> > 299.052 (79.883 ms): a.out/23671 membarrier(cmd: 1
>>> >> >
>>> >> > ) = 0
>>> >> >
>>> >> > 378.942 (59.988 ms): a.out/23671 membarrier(cmd: 1
>>> >> >
>>> >> > ) = 0
>>> >> >
>>> >> > 439.022 (69.908 ms): a.out/23671 membarrier(cmd: 1
>>> >> >
>>> >> > ) = 0
>>> >> >
>>> >> > 508.937 (80.027 ms): a.out/23671 membarrier(cmd: 1
>>> >> >
>>> >> > ) = 0
>>> >> >
>>> >> > # with a running lttng session:
>>> >> > ┌milian@milian-kdab2:/tmp
>>> >> > └$ run_lttng_trace.sh perf trace --duration 1 ./a.out
>>> >> >
>>> >> > 6.666 (392.496 ms): a.out-ust/23790 recvmsg(fd:
>>> >> > 3<socket:[1185362]>,
>>> >
>>> > msg:
>>> >> > 0x7fcd81d8a0b0 ) = 612
>>> >> >
>>> >> > 400.238 (68.452 ms): a.out-ust/23790 membarrier(cmd: 1
>>> >> >
>>> >> > ) = 0
>>> >> >
>>> >> > 468.698 (39.983 ms): a.out-ust/23790 membarrier(cmd: 1
>>> >> >
>>> >> > ) = 0
>>> >> >
>>> >> > 510.567 (98.113 ms): a.out-ust/23790 recvmsg(fd: 3<socket:[1185362]>,
>>> >
>>> > msg:
>>> >> > 0x7fcd81d8a070 ) = 1
>>> >> >
>>> >> > 6.467 (603.500 ms): a.out/23789 futex(uaddr: 0x7fcd82f298a0, op:
>>> >> > WAIT_BITSET|PRIV|CLKRT, utime: 0x7fff4df6c970, val3: 4294967295) = 0
>>> >> >
>>> >> > 610.020 ( 6.307 ms): a.out/23789 futex(uaddr: 0x7fcd82f29960, op:
>>> >> > WAIT|
>>> >> >
>>> >> > PRIV, val: 2 ) = 0
>>> >> >
>>> >> > 616.446 (62.265 ms): a.out/23789 membarrier(cmd: 1
>>> >> >
>>> >> > ) = 0
>>> >> >
>>> >> > 678.721 (59.916 ms): a.out/23789 membarrier(cmd: 1
>>> >> >
>>> >> > ) = 0
>>> >> >
>>> >> > 616.332 (122.335 ms): a.out-ust/23790 futex(uaddr: 0x7fcd82f29960,
>> op:
>>> >> > WAIT|PRIV, val: 2 ) = 0
>>> >> >
>>> >> > 738.976 (49.658 ms): a.out/23789 membarrier(cmd: 1
>>> >> >
>>> >> > ) = 0
>>> >> >
>>> >> > 788.642 (49.987 ms): a.out/23789 membarrier(cmd: 1
>>> >> >
>>> >> > ) = 0
>>> >> >
>>> >> > 739.014 (99.644 ms): a.out-ust/23790 futex(uaddr: 0x7fcd825be4c0, op:
>>> >> > WAIT|
>>> >> >
>>> >> > PRIV, val: 2 ) = 0
>>> >> >
>>> >> > 838.651 (49.977 ms): a.out/23789 membarrier(cmd: 1
>>> >> >
>>> >> > ) = 0
>>> >> >
>>> >> > 888.636 (69.983 ms): a.out/23789 membarrier(cmd: 1
>>> >> >
>>> >> > ) = 0
>>> >> >
>>> >> > 838.662 (119.978 ms): a.out-ust/23790 futex(uaddr: 0x7fcd825be4c0,
>> op:
>>> >> > WAIT|PRIV, val: 2 ) = 0
>>> >> >
>>> >> > 958.739 (59.873 ms): a.out/23789 membarrier(cmd: 1
>>> >> >
>>> >> > ) = 0
>>> >> >
>>> >> > 1018.621 (80.058 ms): a.out/23789 membarrier(cmd: 1
>>> >> >
>>> >> > ) = 0
>>> >> >
>>> >> > 1098.782 (49.844 ms): a.out/23789 membarrier(cmd: 1
>>> >> >
>>> >> > ) = 0
>>> >> >
>>> >> > 1148.634 (59.998 ms): a.out/23789 membarrier(cmd: 1
>>> >> >
>>> >> > ) = 0
>>> >> >
>>> >> > 1208.731 (69.940 ms): a.out/23789 membarrier(cmd: 1
>>> >> >
>>> >> > ) = 0
>>> >> >
>>> >> > 1278.680 (79.947 ms): a.out/23789 membarrier(cmd: 1
>>> >> >
>>> >> > ) = 0
>>> >> >
>>> >> > 1358.727 (39.881 ms): a.out/23789 membarrier(cmd: 1
>>> >> >
>>> >> > ) = 0
>>> >> >
>>> >> > 1398.616 (39.980 ms): a.out/23789 membarrier(cmd: 1
>>> >> >
>>> >> > ) = 0
>>> >> >
>>> >> > 6.640 (1432.157 ms): a.out-ust/23791 futex(uaddr: 0x7fcd82f90000
>>> >> >
>>> >> > ) = -1 (null) INTERNAL ERROR: strerror_r(512, 0x7ffdbc5467a0, 128)=22
>>> >> > ~~~~~
>>> >> >
>>> >> > This makes lttng-ust pretty much unusable for any short-lived
>>> >> > application.
>>> >> > Is this a known limitation?
>>> >>
>>> >> This kind of delay is not completely unexpected, but a bit higher than I
>>> >> would expect. I'm interested in figuring out if most of this delay
>>> >> happens
>>> >> to be caused by sys_membarrier, either directly, or indirectly (through a
>>> >> lock).
>>> >
>>> > See my other email which adds some harder numbers from a perf sleep
>>> > record.
>>> >
>>> >> > I'm using lttng-ust on Arch from the AUR packages, i.e. LTTng 2.9.4.
>>> >> > I've
>>> >> > seen this problem also on Ubuntu 17.04 though I didn't do any of the
>>> >> > above measurements there. So far, I suspected the ust statedump to take
>>> >> > too much time, but I disabled that one for the measurements above.
>>> >> >
>>> >> > Any help would be appreciated
>>> >>
>>> >> Can you reproduce with a Linux kernel that has CONFIG_MEMBARRIER=n and
>>> >> share the performance results ?
>>> >
>>> > I'd have to recompile the kernel, which is something I don't like to do.
>>> > Why is this required, or put differently - why would the kernel option
>>> > influence the userspace LTTng library?
>>>
>>> liburcu 0.9.x and 0.10.x (a dependency of lttng-ust) uses the membarrier
>>> system call to speed up tracing, at the expense of slower execution of
>>> tracing state update.
>>>
>>> I'm currently proposing a new sys_membarrier command for PRIVATE_EXPEDITED
>>> (single-process) use cases which should execute much faster than the
>>> non-expedited SHARED command currently implemented.
>>>
>>> liburcu 0.9 and 0.10 detects that this system call is available, and uses
>>> it when it can.
>>
>> May I ask whether you have such a configuration available? The test case I
>> have submitted should be easy to reproduce for everyone. Compiling my own
>> kernel version takes a huge amount of time which I'm quite frankly not willing
>> to invest at this point.
>
> We have reproduced the situation, and it's pointing towards our use of the
> sys_membarrier system call. We'll try narrowing it down between its use by
> lttng-ust (liburcu-bp.so) or by lttng-sessiond (liburcu.so).
>
> You can try to patch liburcu so it won't try using sys_membarrier at all
> (see attached patch)
>
>>
>> I simply wonder whether anyone else is actually using lttng-ust, or if they
>> are, whether they are only doing that for long-lived applications where this
>> overhead is not noticeable?
>
> It appears that this application startup slowdown mainly appears with
> LTTng linked against liburcu 0.9 and 0.10 running with rather recent kernels.
> This would explain why you are the first one to report this across our user
> base.
We have confirmed that it's the use of sys_membarrier SHARED by liburcu-bp,
used by lttng-ust, that causes those delays.
lttng-ust usage pattern is to call synchronize_rcu() directly. Having
membarrier system calls each time adds up to significant delays, because
membarrier SHARED blocks until a scheduler switch happens on each core.
We are invoking synchronize_rcu() directly rather than using the call_rcu
batching facilities because we want to minimize the number of threads we
have to insert into the application.
I'm currently proposing a PRIVATE_EXPEDITED command for the membarrier
system call in Linux. It will actively send IPIs rather than blocking.
I'm going to push a "Fix" in liburcu master, 0.10, 0.9 that will make it
stop using sys_membarrier in liburcu-bp. Then, after the new PRIVATE_EXPEDITED
command makes it way into the upstream Linux kernel, a future liburcu
(0.11 or +) will introduce usage of that new command, given that it has
a much, much smaller impact compared to the SHARED command.
The reason why we want sys_membarrier there is to speed up tracing. When
can trade memory barriers for simple compiler barriers on the tracing
fast-path by using sys_membarrier on the update side.
Thanks for reporting your finding!
Mathieu
>
> Thanks,
>
> Mathieu
>
>
>>
>> Thanks
>>
>> --
>> Milian Wolff | milian.wolff@kdab.com | Senior Software Engineer
>> KDAB (Deutschland) GmbH&Co KG, a KDAB Group company
>> Tel: +49-30-521325470
>> KDAB - The Qt Experts
>
> --
> Mathieu Desnoyers
> EfficiOS Inc.
> http://www.efficios.com
--
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com
_______________________________________________
lttng-dev mailing list
lttng-dev@lists.lttng.org
https://lists.lttng.org/cgi-bin/mailman/listinfo/lttng-dev
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: large liblttng-ust startup overhead (ust_lock)
[not found] ` <643211580.4746.1504729420940.JavaMail.zimbra@efficios.com>
@ 2017-09-06 21:22 ` Paul E. McKenney
0 siblings, 0 replies; 13+ messages in thread
From: Paul E. McKenney @ 2017-09-06 21:22 UTC (permalink / raw)
To: Mathieu Desnoyers; +Cc: lttng-dev
BTW, your expedited commit hit mainline earlier this week. Here is
hoping! ;-)
Thanx, Paul
On Wed, Sep 06, 2017 at 08:23:40PM +0000, Mathieu Desnoyers wrote:
> ----- On Sep 6, 2017, at 3:57 PM, Mathieu Desnoyers mathieu.desnoyers@efficios.com wrote:
>
> > ----- On Sep 6, 2017, at 3:35 AM, Milian Wolff milian.wolff@kdab.com wrote:
> >
> >> On Dienstag, 5. September 2017 20:11:58 CEST Mathieu Desnoyers wrote:
> >>> ----- On Sep 5, 2017, at 11:08 AM, Milian Wolff milian.wolff@kdab.com wrote:
> >>> > On Tuesday, September 5, 2017 4:51:42 PM CEST Mathieu Desnoyers wrote:
> >>> >> ----- On Sep 5, 2017, at 10:34 AM, Milian Wolff milian.wolff@kdab.com
> >> wrote:
> >>> >> > Hey all,
> >>> >> >
> >>> >> > I have noticed a very large overhead when linking against liblttng-ust:
> >>> >> >
> >>> >> > ~~~~~
> >>> >> > ┌milian@milian-kdab2:/tmp
> >>> >> > └$ cat lttng-test.c
> >>> >> > int main()
> >>> >> > {
> >>> >> >
> >>> >> > return 0;
> >>> >> >
> >>> >> > }
> >>> >> > ┌milian@milian-kdab2:/tmp
> >>> >> > └$ gcc -O2 -g -ldl lttng-test.c
> >>> >> > ┌milian@milian-kdab2:/tmp
> >>> >> > └$ perf stat -r 5 ./a.out
> >>> >> >
> >>> >> > Performance counter stats for './a.out' (5 runs):
> >>> >> > 0.209587 task-clock (msec) # 0.596 CPUs
> >>> >> > utilized
> >>> >> >
> >>> >> > ( +- 8.76% )
> >>> >> >
> >>> >> > 0 context-switches # 0.000 K/sec
> >>> >> > 0 cpu-migrations # 0.000 K/sec
> >>> >> >
> >>> >> > 49 page-faults # 0.235 M/sec
> >>> >> >
> >>> >> > ( +- 1.19% )
> >>> >> >
> >>> >> > 706,854 cycles # 3.373 GHz
> >>> >> >
> >>> >> > ( +- 8.82% )
> >>> >> >
> >>> >> > 773,603 instructions # 1.09 insn per
> >>> >> > cycle
> >>> >> >
> >>> >> > ( +- 0.75% )
> >>> >> >
> >>> >> > 147,128 branches # 701.987 M/sec
> >>> >> >
> >>> >> > ( +- 0.70% )
> >>> >> >
> >>> >> > 4,096 branch-misses # 2.78% of all
> >>> >> > branches
> >>> >> >
> >>> >> > ( +- 5.27% )
> >>> >> >
> >>> >> > 0.000351422 seconds time elapsed
> >>> >> >
> >>> >> > ( +- 11.85% )
> >>> >> >
> >>> >> > ┌milian@milian-kdab2:/tmp
> >>> >> > └$ gcc -O2 -g -ldl -llttng-ust lttng-test.c
> >>> >> > ┌milian@milian-kdab2:/tmp
> >>> >> > └$ perf stat -r 5 ./a.out
> >>> >> >
> >>> >> > Performance counter stats for './a.out' (5 runs):
> >>> >> > 2.063040 task-clock (msec) # 0.009 CPUs
> >>> >> > utilized
> >>> >> >
> >>> >> > ( +- 1.37% )
> >>> >> >
> >>> >> > 44 context-switches # 0.021 M/sec
> >>> >> >
> >>> >> > ( +- 1.95% )
> >>> >> >
> >>> >> > 2 cpu-migrations # 0.776 K/sec
> >>> >> >
> >>> >> > ( +- 25.00% )
> >>> >> >
> >>> >> > 209 page-faults # 0.101 M/sec
> >>> >> >
> >>> >> > ( +- 0.34% )
> >>> >> >
> >>> >> > 7,053,686 cycles # 3.419 GHz
> >>> >> >
> >>> >> > ( +- 2.03% )
> >>> >> >
> >>> >> > 6,893,783 instructions # 0.98 insn per
> >>> >> > cycle
> >>> >> >
> >>> >> > ( +- 0.25% )
> >>> >> >
> >>> >> > 1,342,492 branches # 650.735 M/sec
> >>> >> >
> >>> >> > ( +- 0.20% )
> >>> >> >
> >>> >> > 29,390 branch-misses # 2.19% of all
> >>> >> > branches
> >>> >> >
> >>> >> > ( +- 0.61% )
> >>> >> >
> >>> >> > 0.225597302 seconds time elapsed
> >>> >> >
> >>> >> > ( +- 6.68% )
> >>> >> > ~~~~~
> >>> >> >
> >>> >> > This is without any LTTng session configured. If I enable LTTng kernel
> >>> >> > and
> >>> >> > userspace events, this becomes even worse:
> >>> >> >
> >>> >> > ~~~~~
> >>> >> > ┌milian@milian-kdab2:/tmp
> >>> >> > └$ cat $(which run_lttng_trace.sh)
> >>> >> > #!/bin/sh
> >>> >> >
> >>> >> > if [ -z "$(pidof lttng-sessiond)" ]; then
> >>> >> >
> >>> >> > sudo lttng-sessiond --daemonize
> >>> >> >
> >>> >> > fi
> >>> >> >
> >>> >> > sudo lttng create -o ~/lttng-traces/$(date -Iseconds)
> >>> >> > sudo lttng enable-channel kernel -k --subbuf-size 16M --num-subbuf 8
> >>> >> > sudo lttng enable-event -c kernel -k -a
> >>> >> > sudo lttng enable-channel ust -u --subbuf-size 16M --num-subbuf 8
> >>> >> > sudo lttng enable-event -c ust -u lttng_ust_tracef:*
> >>> >> > sudo lttng start
> >>> >> > $@
> >>> >> > sudo lttng stop
> >>> >> >
> >>> >> > sudo chmod a+rx -R ~/lttng-traces
> >>> >> > ┌milian@milian-kdab2:/tmp
> >>> >> > └$ run_lttng_trace.sh perf stat -r 5 ./a.out
> >>> >> > Session auto-20170905-162818 created.
> >>> >> > Traces will be written in
> >>> >> > /home/milian/lttng-traces/2017-09-05T16:28:18+02:00 Kernel channel
> >>> >> > kernel
> >>> >> > enabled for session auto-20170905-162818
> >>> >> > All Kernel events are enabled in channel kernel
> >>> >> > UST channel ust enabled for session auto-20170905-162818
> >>> >> > UST event lttng_ust_tracef:* created in channel ust
> >>> >> > Tracing started for session auto-20170905-162818
> >>> >> >
> >>> >> > Performance counter stats for './a.out' (5 runs):
> >>> >> > 4.937820 task-clock (msec) # 0.005 CPUs
> >>> >> > utilized
> >>> >> >
> >>> >> > ( +- 1.28% )
> >>> >> >
> >>> >> > 117 context-switches # 0.024 M/sec
> >>> >> >
> >>> >> > ( +- 0.70% )
> >>> >> >
> >>> >> > 3 cpu-migrations # 0.608 K/sec
> >>> >> >
> >>> >> > ( +- 23.57% )
> >>> >> >
> >>> >> > 245 page-faults # 0.050 M/sec
> >>> >> >
> >>> >> > ( +- 0.20% )
> >>> >> >
> >>> >> > 16,541,355 cycles # 3.350 GHz
> >>> >> >
> >>> >> > ( +- 0.94% )
> >>> >> >
> >>> >> > 20,611,637 instructions # 1.25 insn per
> >>> >> > cycle
> >>> >> >
> >>> >> > ( +- 0.23% )
> >>> >> >
> >>> >> > 3,580,525 branches # 725.123 M/sec
> >>> >> >
> >>> >> > ( +- 0.19% )
> >>> >> >
> >>> >> > 52,093 branch-misses # 1.45% of all
> >>> >> > branches
> >>> >> >
> >>> >> > ( +- 1.73% )
> >>> >> >
> >>> >> > 0.965545882 seconds time elapsed
> >>> >> >
> >>> >> > ( +- 16.96% )
> >>> >> >
> >>> >> > Waiting for data availability...
> >>> >> > Tracing stopped for session auto-20170905-162818
> >>> >> > ~~~~~
> >>> >> >
> >>> >> > I see this ~1s delay during startup in nearly all test applications
> >>> >> > where
> >>> >> > I
> >>> >> > wanted to use lttng-ust. In tracecompass, and according to perf sleep
> >>> >> > time
> >>> >> > profiling, this seems to be mostly due to extensive sleeping on mutex
> >>> >> > locks
> >>> >> > (from ust_lock in liblttng-ust). With perf trace, I also see membarrier
> >>> >> > taking a lot of time:
> >>> >> >
> >>> >> > ~~~~~
> >>> >> > # without a running lttng session:
> >>> >> > ┌milian@milian-kdab2:/tmp
> >>> >> > └$ perf trace --duration 1 ./a.out
> >>> >> >
> >>> >> > 6.492 (52.468 ms): a.out/23672 recvmsg(fd: 3<socket:[1178439]>,
> >> msg:
> >>> >> > 0x7fbe2fbb1070 ) = 1
> >>> >> >
> >>> >> > 5.077 (54.271 ms): a.out/23671 futex(uaddr: 0x7fbe30d508a0, op:
> >>> >> > WAIT_BITSET|PRIV|CLKRT, utime: 0x7ffc474ff5a0, val3: 4294967295) = 0
> >>> >> >
> >>> >> > 59.598 (79.379 ms): a.out/23671 membarrier(cmd: 1
> >>> >> >
> >>> >> > ) = 0
> >>> >> >
> >>> >> > 138.984 (39.945 ms): a.out/23671 membarrier(cmd: 1
> >>> >> >
> >>> >> > ) = 0
> >>> >> >
> >>> >> > 59.621 (119.324 ms): a.out/23672 futex(uaddr: 0x7fbe303e54c0, op:
> >>> >> > WAIT|
> >>> >> >
> >>> >> > PRIV, val: 2 ) = 0
> >>> >> >
> >>> >> > 179.045 (79.918 ms): a.out/23671 membarrier(cmd: 1
> >>> >> >
> >>> >> > ) = 0
> >>> >> >
> >>> >> > 258.971 (39.997 ms): a.out/23671 membarrier(cmd: 1
> >>> >> >
> >>> >> > ) = 0
> >>> >> >
> >>> >> > 299.052 (79.883 ms): a.out/23671 membarrier(cmd: 1
> >>> >> >
> >>> >> > ) = 0
> >>> >> >
> >>> >> > 378.942 (59.988 ms): a.out/23671 membarrier(cmd: 1
> >>> >> >
> >>> >> > ) = 0
> >>> >> >
> >>> >> > 439.022 (69.908 ms): a.out/23671 membarrier(cmd: 1
> >>> >> >
> >>> >> > ) = 0
> >>> >> >
> >>> >> > 508.937 (80.027 ms): a.out/23671 membarrier(cmd: 1
> >>> >> >
> >>> >> > ) = 0
> >>> >> >
> >>> >> > # with a running lttng session:
> >>> >> > ┌milian@milian-kdab2:/tmp
> >>> >> > └$ run_lttng_trace.sh perf trace --duration 1 ./a.out
> >>> >> >
> >>> >> > 6.666 (392.496 ms): a.out-ust/23790 recvmsg(fd:
> >>> >> > 3<socket:[1185362]>,
> >>> >
> >>> > msg:
> >>> >> > 0x7fcd81d8a0b0 ) = 612
> >>> >> >
> >>> >> > 400.238 (68.452 ms): a.out-ust/23790 membarrier(cmd: 1
> >>> >> >
> >>> >> > ) = 0
> >>> >> >
> >>> >> > 468.698 (39.983 ms): a.out-ust/23790 membarrier(cmd: 1
> >>> >> >
> >>> >> > ) = 0
> >>> >> >
> >>> >> > 510.567 (98.113 ms): a.out-ust/23790 recvmsg(fd: 3<socket:[1185362]>,
> >>> >
> >>> > msg:
> >>> >> > 0x7fcd81d8a070 ) = 1
> >>> >> >
> >>> >> > 6.467 (603.500 ms): a.out/23789 futex(uaddr: 0x7fcd82f298a0, op:
> >>> >> > WAIT_BITSET|PRIV|CLKRT, utime: 0x7fff4df6c970, val3: 4294967295) = 0
> >>> >> >
> >>> >> > 610.020 ( 6.307 ms): a.out/23789 futex(uaddr: 0x7fcd82f29960, op:
> >>> >> > WAIT|
> >>> >> >
> >>> >> > PRIV, val: 2 ) = 0
> >>> >> >
> >>> >> > 616.446 (62.265 ms): a.out/23789 membarrier(cmd: 1
> >>> >> >
> >>> >> > ) = 0
> >>> >> >
> >>> >> > 678.721 (59.916 ms): a.out/23789 membarrier(cmd: 1
> >>> >> >
> >>> >> > ) = 0
> >>> >> >
> >>> >> > 616.332 (122.335 ms): a.out-ust/23790 futex(uaddr: 0x7fcd82f29960,
> >> op:
> >>> >> > WAIT|PRIV, val: 2 ) = 0
> >>> >> >
> >>> >> > 738.976 (49.658 ms): a.out/23789 membarrier(cmd: 1
> >>> >> >
> >>> >> > ) = 0
> >>> >> >
> >>> >> > 788.642 (49.987 ms): a.out/23789 membarrier(cmd: 1
> >>> >> >
> >>> >> > ) = 0
> >>> >> >
> >>> >> > 739.014 (99.644 ms): a.out-ust/23790 futex(uaddr: 0x7fcd825be4c0, op:
> >>> >> > WAIT|
> >>> >> >
> >>> >> > PRIV, val: 2 ) = 0
> >>> >> >
> >>> >> > 838.651 (49.977 ms): a.out/23789 membarrier(cmd: 1
> >>> >> >
> >>> >> > ) = 0
> >>> >> >
> >>> >> > 888.636 (69.983 ms): a.out/23789 membarrier(cmd: 1
> >>> >> >
> >>> >> > ) = 0
> >>> >> >
> >>> >> > 838.662 (119.978 ms): a.out-ust/23790 futex(uaddr: 0x7fcd825be4c0,
> >> op:
> >>> >> > WAIT|PRIV, val: 2 ) = 0
> >>> >> >
> >>> >> > 958.739 (59.873 ms): a.out/23789 membarrier(cmd: 1
> >>> >> >
> >>> >> > ) = 0
> >>> >> >
> >>> >> > 1018.621 (80.058 ms): a.out/23789 membarrier(cmd: 1
> >>> >> >
> >>> >> > ) = 0
> >>> >> >
> >>> >> > 1098.782 (49.844 ms): a.out/23789 membarrier(cmd: 1
> >>> >> >
> >>> >> > ) = 0
> >>> >> >
> >>> >> > 1148.634 (59.998 ms): a.out/23789 membarrier(cmd: 1
> >>> >> >
> >>> >> > ) = 0
> >>> >> >
> >>> >> > 1208.731 (69.940 ms): a.out/23789 membarrier(cmd: 1
> >>> >> >
> >>> >> > ) = 0
> >>> >> >
> >>> >> > 1278.680 (79.947 ms): a.out/23789 membarrier(cmd: 1
> >>> >> >
> >>> >> > ) = 0
> >>> >> >
> >>> >> > 1358.727 (39.881 ms): a.out/23789 membarrier(cmd: 1
> >>> >> >
> >>> >> > ) = 0
> >>> >> >
> >>> >> > 1398.616 (39.980 ms): a.out/23789 membarrier(cmd: 1
> >>> >> >
> >>> >> > ) = 0
> >>> >> >
> >>> >> > 6.640 (1432.157 ms): a.out-ust/23791 futex(uaddr: 0x7fcd82f90000
> >>> >> >
> >>> >> > ) = -1 (null) INTERNAL ERROR: strerror_r(512, 0x7ffdbc5467a0, 128)=22
> >>> >> > ~~~~~
> >>> >> >
> >>> >> > This makes lttng-ust pretty much unusable for any short-lived
> >>> >> > application.
> >>> >> > Is this a known limitation?
> >>> >>
> >>> >> This kind of delay is not completely unexpected, but a bit higher than I
> >>> >> would expect. I'm interested in figuring out if most of this delay
> >>> >> happens
> >>> >> to be caused by sys_membarrier, either directly, or indirectly (through a
> >>> >> lock).
> >>> >
> >>> > See my other email which adds some harder numbers from a perf sleep
> >>> > record.
> >>> >
> >>> >> > I'm using lttng-ust on Arch from the AUR packages, i.e. LTTng 2.9.4.
> >>> >> > I've
> >>> >> > seen this problem also on Ubuntu 17.04 though I didn't do any of the
> >>> >> > above measurements there. So far, I suspected the ust statedump to take
> >>> >> > too much time, but I disabled that one for the measurements above.
> >>> >> >
> >>> >> > Any help would be appreciated
> >>> >>
> >>> >> Can you reproduce with a Linux kernel that has CONFIG_MEMBARRIER=n and
> >>> >> share the performance results ?
> >>> >
> >>> > I'd have to recompile the kernel, which is something I don't like to do.
> >>> > Why is this required, or put differently - why would the kernel option
> >>> > influence the userspace LTTng library?
> >>>
> >>> liburcu 0.9.x and 0.10.x (a dependency of lttng-ust) uses the membarrier
> >>> system call to speed up tracing, at the expense of slower execution of
> >>> tracing state update.
> >>>
> >>> I'm currently proposing a new sys_membarrier command for PRIVATE_EXPEDITED
> >>> (single-process) use cases which should execute much faster than the
> >>> non-expedited SHARED command currently implemented.
> >>>
> >>> liburcu 0.9 and 0.10 detects that this system call is available, and uses
> >>> it when it can.
> >>
> >> May I ask whether you have such a configuration available? The test case I
> >> have submitted should be easy to reproduce for everyone. Compiling my own
> >> kernel version takes a huge amount of time which I'm quite frankly not willing
> >> to invest at this point.
> >
> > We have reproduced the situation, and it's pointing towards our use of the
> > sys_membarrier system call. We'll try narrowing it down between its use by
> > lttng-ust (liburcu-bp.so) or by lttng-sessiond (liburcu.so).
> >
> > You can try to patch liburcu so it won't try using sys_membarrier at all
> > (see attached patch)
> >
> >>
> >> I simply wonder whether anyone else is actually using lttng-ust, or if they
> >> are, whether they are only doing that for long-lived applications where this
> >> overhead is not noticeable?
> >
> > It appears that this application startup slowdown mainly appears with
> > LTTng linked against liburcu 0.9 and 0.10 running with rather recent kernels.
> > This would explain why you are the first one to report this across our user
> > base.
>
> We have confirmed that it's the use of sys_membarrier SHARED by liburcu-bp,
> used by lttng-ust, that causes those delays.
>
> lttng-ust usage pattern is to call synchronize_rcu() directly. Having
> membarrier system calls each time adds up to significant delays, because
> membarrier SHARED blocks until a scheduler switch happens on each core.
> We are invoking synchronize_rcu() directly rather than using the call_rcu
> batching facilities because we want to minimize the number of threads we
> have to insert into the application.
>
> I'm currently proposing a PRIVATE_EXPEDITED command for the membarrier
> system call in Linux. It will actively send IPIs rather than blocking.
>
> I'm going to push a "Fix" in liburcu master, 0.10, 0.9 that will make it
> stop using sys_membarrier in liburcu-bp. Then, after the new PRIVATE_EXPEDITED
> command makes it way into the upstream Linux kernel, a future liburcu
> (0.11 or +) will introduce usage of that new command, given that it has
> a much, much smaller impact compared to the SHARED command.
>
> The reason why we want sys_membarrier there is to speed up tracing. When
> can trade memory barriers for simple compiler barriers on the tracing
> fast-path by using sys_membarrier on the update side.
>
> Thanks for reporting your finding!
>
> Mathieu
>
>
> >
> > Thanks,
> >
> > Mathieu
> >
> >
> >>
> >> Thanks
> >>
> >> --
> >> Milian Wolff | milian.wolff@kdab.com | Senior Software Engineer
> >> KDAB (Deutschland) GmbH&Co KG, a KDAB Group company
> >> Tel: +49-30-521325470
> >> KDAB - The Qt Experts
> >
> > --
> > Mathieu Desnoyers
> > EfficiOS Inc.
> > https://urldefense.proofpoint.com/v2/url?u=http-3A__www.efficios.com&d=DwIFaQ&c=jf_iaSHvJObTbx-siA1ZOg&r=ux41CW3B5BSVxDMRNRWyLbUmPebZc70Kq4AkfdiRGMI&m=3Vy8SVv1V7W52p2avk7362Nu7zY7IrKURNrPvKWJydk&s=gJ_xD9XLVeE2RJHq9MmoWZRJIk5IUabWGcG0VdiM7U4&e=
>
> --
> Mathieu Desnoyers
> EfficiOS Inc.
> https://urldefense.proofpoint.com/v2/url?u=http-3A__www.efficios.com&d=DwIFaQ&c=jf_iaSHvJObTbx-siA1ZOg&r=ux41CW3B5BSVxDMRNRWyLbUmPebZc70Kq4AkfdiRGMI&m=3Vy8SVv1V7W52p2avk7362Nu7zY7IrKURNrPvKWJydk&s=gJ_xD9XLVeE2RJHq9MmoWZRJIk5IUabWGcG0VdiM7U4&e=
>
_______________________________________________
lttng-dev mailing list
lttng-dev@lists.lttng.org
https://lists.lttng.org/cgi-bin/mailman/listinfo/lttng-dev
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: large liblttng-ust startup overhead (ust_lock)
[not found] ` <1506174693.4712.1504727831374.JavaMail.zimbra@efficios.com>
2017-09-06 20:23 ` Mathieu Desnoyers
[not found] ` <643211580.4746.1504729420940.JavaMail.zimbra@efficios.com>
@ 2017-09-07 7:50 ` Milian Wolff
2 siblings, 0 replies; 13+ messages in thread
From: Milian Wolff @ 2017-09-07 7:50 UTC (permalink / raw)
To: Mathieu Desnoyers; +Cc: lttng-dev
[-- Attachment #1.1: Type: text/plain, Size: 19679 bytes --]
On Wednesday, September 6, 2017 9:57:11 PM CEST Mathieu Desnoyers wrote:
> ----- On Sep 6, 2017, at 3:35 AM, Milian Wolff milian.wolff@kdab.com wrote:
> > On Dienstag, 5. September 2017 20:11:58 CEST Mathieu Desnoyers wrote:
> >> ----- On Sep 5, 2017, at 11:08 AM, Milian Wolff milian.wolff@kdab.com
wrote:
> >> > On Tuesday, September 5, 2017 4:51:42 PM CEST Mathieu Desnoyers wrote:
> >> >> ----- On Sep 5, 2017, at 10:34 AM, Milian Wolff milian.wolff@kdab.com
> >
> > wrote:
> >> >> > Hey all,
> >> >> >
> >> >> > I have noticed a very large overhead when linking against
> >> >> > liblttng-ust:
> >> >> >
> >> >> > ~~~~~
> >> >> > ┌milian@milian-kdab2:/tmp
> >> >> > └$ cat lttng-test.c
> >> >> > int main()
> >> >> > {
> >> >> >
> >> >> > return 0;
> >> >> >
> >> >> > }
> >> >> > ┌milian@milian-kdab2:/tmp
> >> >> > └$ gcc -O2 -g -ldl lttng-test.c
> >> >> > ┌milian@milian-kdab2:/tmp
> >> >> > └$ perf stat -r 5 ./a.out
> >> >> >
> >> >> > Performance counter stats for './a.out' (5 runs):
> >> >> > 0.209587 task-clock (msec) # 0.596 CPUs
> >> >> > utilized
> >> >> >
> >> >> > ( +- 8.76% )
> >> >> >
> >> >> > 0 context-switches # 0.000 K/sec
> >> >> > 0 cpu-migrations # 0.000 K/sec
> >> >> >
> >> >> > 49 page-faults # 0.235 M/sec
> >> >> >
> >> >> > ( +- 1.19% )
> >> >> >
> >> >> > 706,854 cycles # 3.373 GHz
> >> >> >
> >> >> > ( +- 8.82% )
> >> >> >
> >> >> > 773,603 instructions # 1.09 insn per
> >> >> > cycle
> >> >> >
> >> >> > ( +- 0.75% )
> >> >> >
> >> >> > 147,128 branches # 701.987 M/sec
> >> >> >
> >> >> > ( +- 0.70% )
> >> >> >
> >> >> > 4,096 branch-misses # 2.78% of all
> >> >> > branches
> >> >> >
> >> >> > ( +- 5.27% )
> >> >> >
> >> >> > 0.000351422 seconds time elapsed
> >> >> >
> >> >> > ( +- 11.85% )
> >> >> >
> >> >> > ┌milian@milian-kdab2:/tmp
> >> >> > └$ gcc -O2 -g -ldl -llttng-ust lttng-test.c
> >> >> > ┌milian@milian-kdab2:/tmp
> >> >> > └$ perf stat -r 5 ./a.out
> >> >> >
> >> >> > Performance counter stats for './a.out' (5 runs):
> >> >> > 2.063040 task-clock (msec) # 0.009 CPUs
> >> >> > utilized
> >> >> >
> >> >> > ( +- 1.37% )
> >> >> >
> >> >> > 44 context-switches # 0.021 M/sec
> >> >> >
> >> >> > ( +- 1.95% )
> >> >> >
> >> >> > 2 cpu-migrations # 0.776 K/sec
> >> >> >
> >> >> > ( +- 25.00% )
> >> >> >
> >> >> > 209 page-faults # 0.101 M/sec
> >> >> >
> >> >> > ( +- 0.34% )
> >> >> >
> >> >> > 7,053,686 cycles # 3.419 GHz
> >> >> >
> >> >> > ( +- 2.03% )
> >> >> >
> >> >> > 6,893,783 instructions # 0.98 insn per
> >> >> > cycle
> >> >> >
> >> >> > ( +- 0.25% )
> >> >> >
> >> >> > 1,342,492 branches # 650.735 M/sec
> >> >> >
> >> >> > ( +- 0.20% )
> >> >> >
> >> >> > 29,390 branch-misses # 2.19% of all
> >> >> > branches
> >> >> >
> >> >> > ( +- 0.61% )
> >> >> >
> >> >> > 0.225597302 seconds time elapsed
> >> >> >
> >> >> > ( +- 6.68% )
> >> >> > ~~~~~
> >> >> >
> >> >> > This is without any LTTng session configured. If I enable LTTng
> >> >> > kernel
> >> >> > and
> >> >> > userspace events, this becomes even worse:
> >> >> >
> >> >> > ~~~~~
> >> >> > ┌milian@milian-kdab2:/tmp
> >> >> > └$ cat $(which run_lttng_trace.sh)
> >> >> > #!/bin/sh
> >> >> >
> >> >> > if [ -z "$(pidof lttng-sessiond)" ]; then
> >> >> >
> >> >> > sudo lttng-sessiond --daemonize
> >> >> >
> >> >> > fi
> >> >> >
> >> >> > sudo lttng create -o ~/lttng-traces/$(date -Iseconds)
> >> >> > sudo lttng enable-channel kernel -k --subbuf-size 16M --num-subbuf 8
> >> >> > sudo lttng enable-event -c kernel -k -a
> >> >> > sudo lttng enable-channel ust -u --subbuf-size 16M --num-subbuf 8
> >> >> > sudo lttng enable-event -c ust -u lttng_ust_tracef:*
> >> >> > sudo lttng start
> >> >> > $@
> >> >> > sudo lttng stop
> >> >> >
> >> >> > sudo chmod a+rx -R ~/lttng-traces
> >> >> > ┌milian@milian-kdab2:/tmp
> >> >> > └$ run_lttng_trace.sh perf stat -r 5 ./a.out
> >> >> > Session auto-20170905-162818 created.
> >> >> > Traces will be written in
> >> >> > /home/milian/lttng-traces/2017-09-05T16:28:18+02:00 Kernel channel
> >> >> > kernel
> >> >> > enabled for session auto-20170905-162818
> >> >> > All Kernel events are enabled in channel kernel
> >> >> > UST channel ust enabled for session auto-20170905-162818
> >> >> > UST event lttng_ust_tracef:* created in channel ust
> >> >> > Tracing started for session auto-20170905-162818
> >> >> >
> >> >> > Performance counter stats for './a.out' (5 runs):
> >> >> > 4.937820 task-clock (msec) # 0.005 CPUs
> >> >> > utilized
> >> >> >
> >> >> > ( +- 1.28% )
> >> >> >
> >> >> > 117 context-switches # 0.024 M/sec
> >> >> >
> >> >> > ( +- 0.70% )
> >> >> >
> >> >> > 3 cpu-migrations # 0.608 K/sec
> >> >> >
> >> >> > ( +- 23.57% )
> >> >> >
> >> >> > 245 page-faults # 0.050 M/sec
> >> >> >
> >> >> > ( +- 0.20% )
> >> >> >
> >> >> > 16,541,355 cycles # 3.350 GHz
> >> >> >
> >> >> > ( +- 0.94% )
> >> >> >
> >> >> > 20,611,637 instructions # 1.25 insn per
> >> >> > cycle
> >> >> >
> >> >> > ( +- 0.23% )
> >> >> >
> >> >> > 3,580,525 branches # 725.123 M/sec
> >> >> >
> >> >> > ( +- 0.19% )
> >> >> >
> >> >> > 52,093 branch-misses # 1.45% of all
> >> >> > branches
> >> >> >
> >> >> > ( +- 1.73% )
> >> >> >
> >> >> > 0.965545882 seconds time elapsed
> >> >> >
> >> >> > ( +- 16.96% )
> >> >> >
> >> >> > Waiting for data availability...
> >> >> > Tracing stopped for session auto-20170905-162818
> >> >> > ~~~~~
> >> >> >
> >> >> > I see this ~1s delay during startup in nearly all test applications
> >> >> > where
> >> >> > I
> >> >> > wanted to use lttng-ust. In tracecompass, and according to perf
> >> >> > sleep
> >> >> > time
> >> >> > profiling, this seems to be mostly due to extensive sleeping on
> >> >> > mutex
> >> >> > locks
> >> >> > (from ust_lock in liblttng-ust). With perf trace, I also see
> >> >> > membarrier
> >> >> > taking a lot of time:
> >> >> >
> >> >> > ~~~~~
> >> >> > # without a running lttng session:
> >> >> > ┌milian@milian-kdab2:/tmp
> >> >> > └$ perf trace --duration 1 ./a.out
> >> >> >
> >> >> > 6.492 (52.468 ms): a.out/23672 recvmsg(fd: 3<socket:[1178439]>,
> >
> > msg:
> >> >> > 0x7fbe2fbb1070 ) = 1
> >> >> >
> >> >> > 5.077 (54.271 ms): a.out/23671 futex(uaddr: 0x7fbe30d508a0, op:
> >> >> > WAIT_BITSET|PRIV|CLKRT, utime: 0x7ffc474ff5a0, val3: 4294967295) = 0
> >> >> >
> >> >> > 59.598 (79.379 ms): a.out/23671 membarrier(cmd: 1
> >> >> >
> >> >> > ) = 0
> >> >> >
> >> >> > 138.984 (39.945 ms): a.out/23671 membarrier(cmd: 1
> >> >> >
> >> >> > ) = 0
> >> >> >
> >> >> > 59.621 (119.324 ms): a.out/23672 futex(uaddr: 0x7fbe303e54c0, op:
> >> >> > WAIT|
> >> >> >
> >> >> > PRIV, val: 2 ) = 0
> >> >> >
> >> >> > 179.045 (79.918 ms): a.out/23671 membarrier(cmd: 1
> >> >> >
> >> >> > ) = 0
> >> >> >
> >> >> > 258.971 (39.997 ms): a.out/23671 membarrier(cmd: 1
> >> >> >
> >> >> > ) = 0
> >> >> >
> >> >> > 299.052 (79.883 ms): a.out/23671 membarrier(cmd: 1
> >> >> >
> >> >> > ) = 0
> >> >> >
> >> >> > 378.942 (59.988 ms): a.out/23671 membarrier(cmd: 1
> >> >> >
> >> >> > ) = 0
> >> >> >
> >> >> > 439.022 (69.908 ms): a.out/23671 membarrier(cmd: 1
> >> >> >
> >> >> > ) = 0
> >> >> >
> >> >> > 508.937 (80.027 ms): a.out/23671 membarrier(cmd: 1
> >> >> >
> >> >> > ) = 0
> >> >> >
> >> >> > # with a running lttng session:
> >> >> > ┌milian@milian-kdab2:/tmp
> >> >> > └$ run_lttng_trace.sh perf trace --duration 1 ./a.out
> >> >> >
> >> >> > 6.666 (392.496 ms): a.out-ust/23790 recvmsg(fd:
> >> >> > 3<socket:[1185362]>,
> >> >
> >> > msg:
> >> >> > 0x7fcd81d8a0b0 ) = 612
> >> >> >
> >> >> > 400.238 (68.452 ms): a.out-ust/23790 membarrier(cmd: 1
> >> >> >
> >> >> > ) = 0
> >> >> >
> >> >> > 468.698 (39.983 ms): a.out-ust/23790 membarrier(cmd: 1
> >> >> >
> >> >> > ) = 0
> >> >> >
> >> >> > 510.567 (98.113 ms): a.out-ust/23790 recvmsg(fd:
> >> >> > 3<socket:[1185362]>,
> >> >
> >> > msg:
> >> >> > 0x7fcd81d8a070 ) = 1
> >> >> >
> >> >> > 6.467 (603.500 ms): a.out/23789 futex(uaddr: 0x7fcd82f298a0, op:
> >> >> > WAIT_BITSET|PRIV|CLKRT, utime: 0x7fff4df6c970, val3: 4294967295) = 0
> >> >> >
> >> >> > 610.020 ( 6.307 ms): a.out/23789 futex(uaddr: 0x7fcd82f29960, op:
> >> >> > WAIT|
> >> >> >
> >> >> > PRIV, val: 2 ) = 0
> >> >> >
> >> >> > 616.446 (62.265 ms): a.out/23789 membarrier(cmd: 1
> >> >> >
> >> >> > ) = 0
> >> >> >
> >> >> > 678.721 (59.916 ms): a.out/23789 membarrier(cmd: 1
> >> >> >
> >> >> > ) = 0
> >> >> >
> >> >> > 616.332 (122.335 ms): a.out-ust/23790 futex(uaddr: 0x7fcd82f29960,
> >
> > op:
> >> >> > WAIT|PRIV, val: 2 ) = 0
> >> >> >
> >> >> > 738.976 (49.658 ms): a.out/23789 membarrier(cmd: 1
> >> >> >
> >> >> > ) = 0
> >> >> >
> >> >> > 788.642 (49.987 ms): a.out/23789 membarrier(cmd: 1
> >> >> >
> >> >> > ) = 0
> >> >> >
> >> >> > 739.014 (99.644 ms): a.out-ust/23790 futex(uaddr: 0x7fcd825be4c0,
> >> >> > op:
> >> >> > WAIT|
> >> >> >
> >> >> > PRIV, val: 2 ) = 0
> >> >> >
> >> >> > 838.651 (49.977 ms): a.out/23789 membarrier(cmd: 1
> >> >> >
> >> >> > ) = 0
> >> >> >
> >> >> > 888.636 (69.983 ms): a.out/23789 membarrier(cmd: 1
> >> >> >
> >> >> > ) = 0
> >> >> >
> >> >> > 838.662 (119.978 ms): a.out-ust/23790 futex(uaddr: 0x7fcd825be4c0,
> >
> > op:
> >> >> > WAIT|PRIV, val: 2 ) = 0
> >> >> >
> >> >> > 958.739 (59.873 ms): a.out/23789 membarrier(cmd: 1
> >> >> >
> >> >> > ) = 0
> >> >> >
> >> >> > 1018.621 (80.058 ms): a.out/23789 membarrier(cmd: 1
> >> >> >
> >> >> > ) = 0
> >> >> >
> >> >> > 1098.782 (49.844 ms): a.out/23789 membarrier(cmd: 1
> >> >> >
> >> >> > ) = 0
> >> >> >
> >> >> > 1148.634 (59.998 ms): a.out/23789 membarrier(cmd: 1
> >> >> >
> >> >> > ) = 0
> >> >> >
> >> >> > 1208.731 (69.940 ms): a.out/23789 membarrier(cmd: 1
> >> >> >
> >> >> > ) = 0
> >> >> >
> >> >> > 1278.680 (79.947 ms): a.out/23789 membarrier(cmd: 1
> >> >> >
> >> >> > ) = 0
> >> >> >
> >> >> > 1358.727 (39.881 ms): a.out/23789 membarrier(cmd: 1
> >> >> >
> >> >> > ) = 0
> >> >> >
> >> >> > 1398.616 (39.980 ms): a.out/23789 membarrier(cmd: 1
> >> >> >
> >> >> > ) = 0
> >> >> >
> >> >> > 6.640 (1432.157 ms): a.out-ust/23791 futex(uaddr: 0x7fcd82f90000
> >> >> >
> >> >> > ) = -1 (null) INTERNAL ERROR: strerror_r(512, 0x7ffdbc5467a0,
> >> >> > 128)=22
> >> >> > ~~~~~
> >> >> >
> >> >> > This makes lttng-ust pretty much unusable for any short-lived
> >> >> > application.
> >> >> > Is this a known limitation?
> >> >>
> >> >> This kind of delay is not completely unexpected, but a bit higher than
> >> >> I
> >> >> would expect. I'm interested in figuring out if most of this delay
> >> >> happens
> >> >> to be caused by sys_membarrier, either directly, or indirectly
> >> >> (through a
> >> >> lock).
> >> >
> >> > See my other email which adds some harder numbers from a perf sleep
> >> > record.
> >> >
> >> >> > I'm using lttng-ust on Arch from the AUR packages, i.e. LTTng 2.9.4.
> >> >> > I've
> >> >> > seen this problem also on Ubuntu 17.04 though I didn't do any of the
> >> >> > above measurements there. So far, I suspected the ust statedump to
> >> >> > take
> >> >> > too much time, but I disabled that one for the measurements above.
> >> >> >
> >> >> > Any help would be appreciated
> >> >>
> >> >> Can you reproduce with a Linux kernel that has CONFIG_MEMBARRIER=n and
> >> >> share the performance results ?
> >> >
> >> > I'd have to recompile the kernel, which is something I don't like to
> >> > do.
> >> > Why is this required, or put differently - why would the kernel option
> >> > influence the userspace LTTng library?
> >>
> >> liburcu 0.9.x and 0.10.x (a dependency of lttng-ust) uses the membarrier
> >> system call to speed up tracing, at the expense of slower execution of
> >> tracing state update.
> >>
> >> I'm currently proposing a new sys_membarrier command for
> >> PRIVATE_EXPEDITED
> >> (single-process) use cases which should execute much faster than the
> >> non-expedited SHARED command currently implemented.
> >>
> >> liburcu 0.9 and 0.10 detects that this system call is available, and uses
> >> it when it can.
> >
> > May I ask whether you have such a configuration available? The test case I
> > have submitted should be easy to reproduce for everyone. Compiling my own
> > kernel version takes a huge amount of time which I'm quite frankly not
> > willing to invest at this point.
>
> We have reproduced the situation, and it's pointing towards our use of the
> sys_membarrier system call. We'll try narrowing it down between its use by
> lttng-ust (liburcu-bp.so) or by lttng-sessiond (liburcu.so).
>
> You can try to patch liburcu so it won't try using sys_membarrier at all
> (see attached patch)
It does improve the situation tremendously. I just compiled v0.10.0-6-g6447802
with and without the patch to record the numbers below. Note that apparently
your latest patch [1] in git master alone is not enough to fix the performance
bug, I do need to apply the patch you send by mail, too.
[1]: https://github.com/urcu/userspace-rcu/commit/
64478021edcf7a5ac3bca3fa9e8b8108d2fbb9b6
But once I also apply the patch you send by mail, things start to fly. In the
below, "before" means running against userspace-rcu v0.10.0-6-g6447802,
"after" means running against the same version with your emailed patch applied
on top.
Measurement without an active LTTng session:
~~~~~
# before:
┌milian@milian-kdab2:/tmp
└$ perf stat -r 5 ./a.out
Performance counter stats for './a.out' (5 runs):
3.973288 task-clock (msec) # 0.005 CPUs utilized
( +- 0.39% )
153 context-switches # 0.039 M/sec
( +- 1.43% )
2 cpu-migrations # 0.554 K/sec
( +- 26.50% )
315 page-faults # 0.079 M/sec
( +- 0.21% )
13,257,360 cycles # 3.337 GHz
( +- 0.34% )
14,906,461 instructions # 1.12 insn per cycle
( +- 0.11% )
2,910,368 branches # 732.483 M/sec
( +- 0.10% )
42,766 branch-misses # 1.47% of all branches
( +- 0.89% )
0.851654352 seconds time elapsed
( +- 4.33% )
# after:
┌milian@milian-kdab2:/tmp
└$ perf stat -r 5 ./a.out
Performance counter stats for './a.out' (5 runs):
3.608242 task-clock (msec) # 0.056 CPUs utilized
( +- 1.84% )
138 context-switches # 0.038 M/sec
( +- 2.70% )
1 cpu-migrations # 0.277 K/sec
313 page-faults # 0.087 M/sec
( +- 0.27% )
12,022,499 cycles # 3.332 GHz
( +- 1.73% )
14,861,318 instructions # 1.24 insn per cycle
( +- 0.32% )
2,902,920 branches # 804.525 M/sec
( +- 0.32% )
38,264 branch-misses # 1.32% of all branches
( +- 0.78% )
0.063929538 seconds time elapsed
( +- 10.93% )
~~~~~
With an active LTTng session:
~~~~~
# before:
Performance counter stats for './a.out' (5 runs):
9.152925 task-clock (msec) # 0.005 CPUs utilized
( +- 1.13% )
216 context-switches # 0.024 M/sec
( +- 2.83% )
4 cpu-migrations # 0.459 K/sec
( +- 11.66% )
370 page-faults # 0.040 M/sec
( +- 0.20% )
31,311,283 cycles # 3.421 GHz
( +- 0.69% )
40,509,922 instructions # 1.29 insn per cycle
( +- 0.40% )
7,192,651 branches # 785.831 M/sec
( +- 0.33% )
90,264 branch-misses # 1.25% of all branches
( +- 1.46% )
1.839545247 seconds time elapsed
( +- 6.95% )
# after:
Performance counter stats for './a.out' (5 runs):
7.052094 task-clock (msec) # 0.026 CPUs utilized
( +- 1.45% )
190 context-switches # 0.027 M/sec
( +- 4.88% )
1 cpu-migrations # 0.142 K/sec
( +- 44.72% )
335 page-faults # 0.048 M/sec
( +- 0.37% )
24,105,408 cycles # 3.418 GHz
( +- 1.39% )
35,100,950 instructions # 1.46 insn per cycle
( +- 0.17% )
6,197,972 branches # 878.884 M/sec
( +- 0.15% )
56,045 branch-misses # 0.90% of all branches
( +- 1.41% )
0.272121270 seconds time elapsed
( +- 52.41% )
~~~~~
> > I simply wonder whether anyone else is actually using lttng-ust, or if
> > they
> > are, whether they are only doing that for long-lived applications where
> > this overhead is not noticeable?
>
> It appears that this application startup slowdown mainly appears with
> LTTng linked against liburcu 0.9 and 0.10 running with rather recent
> kernels. This would explain why you are the first one to report this across
> our user base.
Thank you, that explains it.
--
Milian Wolff | milian.wolff@kdab.com | Senior Software Engineer
KDAB (Deutschland) GmbH&Co KG, a KDAB Group company
Tel: +49-30-521325470
KDAB - The Qt Experts
[-- Attachment #1.2: smime.p7s --]
[-- Type: application/pkcs7-signature, Size: 3826 bytes --]
[-- Attachment #2: Type: text/plain, Size: 156 bytes --]
_______________________________________________
lttng-dev mailing list
lttng-dev@lists.lttng.org
https://lists.lttng.org/cgi-bin/mailman/listinfo/lttng-dev
^ permalink raw reply [flat|nested] 13+ messages in thread
* large liblttng-ust startup overhead (ust_lock)
@ 2017-09-05 14:34 Milian Wolff
0 siblings, 0 replies; 13+ messages in thread
From: Milian Wolff @ 2017-09-05 14:34 UTC (permalink / raw)
To: lttng-dev
[-- Attachment #1.1: Type: text/plain, Size: 9953 bytes --]
Hey all,
I have noticed a very large overhead when linking against liblttng-ust:
~~~~~
┌milian@milian-kdab2:/tmp
└$ cat lttng-test.c
int main()
{
return 0;
}
┌milian@milian-kdab2:/tmp
└$ gcc -O2 -g -ldl lttng-test.c
┌milian@milian-kdab2:/tmp
└$ perf stat -r 5 ./a.out
Performance counter stats for './a.out' (5 runs):
0.209587 task-clock (msec) # 0.596 CPUs utilized
( +- 8.76% )
0 context-switches # 0.000 K/sec
0 cpu-migrations # 0.000 K/sec
49 page-faults # 0.235 M/sec
( +- 1.19% )
706,854 cycles # 3.373 GHz
( +- 8.82% )
773,603 instructions # 1.09 insn per cycle
( +- 0.75% )
147,128 branches # 701.987 M/sec
( +- 0.70% )
4,096 branch-misses # 2.78% of all branches
( +- 5.27% )
0.000351422 seconds time elapsed
( +- 11.85% )
┌milian@milian-kdab2:/tmp
└$ gcc -O2 -g -ldl -llttng-ust lttng-test.c
┌milian@milian-kdab2:/tmp
└$ perf stat -r 5 ./a.out
Performance counter stats for './a.out' (5 runs):
2.063040 task-clock (msec) # 0.009 CPUs utilized
( +- 1.37% )
44 context-switches # 0.021 M/sec
( +- 1.95% )
2 cpu-migrations # 0.776 K/sec
( +- 25.00% )
209 page-faults # 0.101 M/sec
( +- 0.34% )
7,053,686 cycles # 3.419 GHz
( +- 2.03% )
6,893,783 instructions # 0.98 insn per cycle
( +- 0.25% )
1,342,492 branches # 650.735 M/sec
( +- 0.20% )
29,390 branch-misses # 2.19% of all branches
( +- 0.61% )
0.225597302 seconds time elapsed
( +- 6.68% )
~~~~~
This is without any LTTng session configured. If I enable LTTng kernel and
userspace events, this becomes even worse:
~~~~~
┌milian@milian-kdab2:/tmp
└$ cat $(which run_lttng_trace.sh)
#!/bin/sh
if [ -z "$(pidof lttng-sessiond)" ]; then
sudo lttng-sessiond --daemonize
fi
sudo lttng create -o ~/lttng-traces/$(date -Iseconds)
sudo lttng enable-channel kernel -k --subbuf-size 16M --num-subbuf 8
sudo lttng enable-event -c kernel -k -a
sudo lttng enable-channel ust -u --subbuf-size 16M --num-subbuf 8
sudo lttng enable-event -c ust -u lttng_ust_tracef:*
sudo lttng start
$@
sudo lttng stop
sudo chmod a+rx -R ~/lttng-traces
┌milian@milian-kdab2:/tmp
└$ run_lttng_trace.sh perf stat -r 5 ./a.out
Session auto-20170905-162818 created.
Traces will be written in /home/milian/lttng-traces/2017-09-05T16:28:18+02:00
Kernel channel kernel enabled for session auto-20170905-162818
All Kernel events are enabled in channel kernel
UST channel ust enabled for session auto-20170905-162818
UST event lttng_ust_tracef:* created in channel ust
Tracing started for session auto-20170905-162818
Performance counter stats for './a.out' (5 runs):
4.937820 task-clock (msec) # 0.005 CPUs utilized
( +- 1.28% )
117 context-switches # 0.024 M/sec
( +- 0.70% )
3 cpu-migrations # 0.608 K/sec
( +- 23.57% )
245 page-faults # 0.050 M/sec
( +- 0.20% )
16,541,355 cycles # 3.350 GHz
( +- 0.94% )
20,611,637 instructions # 1.25 insn per cycle
( +- 0.23% )
3,580,525 branches # 725.123 M/sec
( +- 0.19% )
52,093 branch-misses # 1.45% of all branches
( +- 1.73% )
0.965545882 seconds time elapsed
( +- 16.96% )
Waiting for data availability...
Tracing stopped for session auto-20170905-162818
~~~~~
I see this ~1s delay during startup in nearly all test applications where I
wanted to use lttng-ust. In tracecompass, and according to perf sleep time
profiling, this seems to be mostly due to extensive sleeping on mutex locks
(from ust_lock in liblttng-ust). With perf trace, I also see membarrier taking
a lot of time:
~~~~~
# without a running lttng session:
┌milian@milian-kdab2:/tmp
└$ perf trace --duration 1 ./a.out
6.492 (52.468 ms): a.out/23672 recvmsg(fd: 3<socket:[1178439]>, msg:
0x7fbe2fbb1070 ) = 1
5.077 (54.271 ms): a.out/23671 futex(uaddr: 0x7fbe30d508a0, op:
WAIT_BITSET|PRIV|CLKRT, utime: 0x7ffc474ff5a0, val3: 4294967295) = 0
59.598 (79.379 ms): a.out/23671 membarrier(cmd: 1
) = 0
138.984 (39.945 ms): a.out/23671 membarrier(cmd: 1
) = 0
59.621 (119.324 ms): a.out/23672 futex(uaddr: 0x7fbe303e54c0, op: WAIT|
PRIV, val: 2 ) = 0
179.045 (79.918 ms): a.out/23671 membarrier(cmd: 1
) = 0
258.971 (39.997 ms): a.out/23671 membarrier(cmd: 1
) = 0
299.052 (79.883 ms): a.out/23671 membarrier(cmd: 1
) = 0
378.942 (59.988 ms): a.out/23671 membarrier(cmd: 1
) = 0
439.022 (69.908 ms): a.out/23671 membarrier(cmd: 1
) = 0
508.937 (80.027 ms): a.out/23671 membarrier(cmd: 1
) = 0
# with a running lttng session:
┌milian@milian-kdab2:/tmp
└$ run_lttng_trace.sh perf trace --duration 1 ./a.out
6.666 (392.496 ms): a.out-ust/23790 recvmsg(fd: 3<socket:[1185362]>, msg:
0x7fcd81d8a0b0 ) = 612
400.238 (68.452 ms): a.out-ust/23790 membarrier(cmd: 1
) = 0
468.698 (39.983 ms): a.out-ust/23790 membarrier(cmd: 1
) = 0
510.567 (98.113 ms): a.out-ust/23790 recvmsg(fd: 3<socket:[1185362]>, msg:
0x7fcd81d8a070 ) = 1
6.467 (603.500 ms): a.out/23789 futex(uaddr: 0x7fcd82f298a0, op:
WAIT_BITSET|PRIV|CLKRT, utime: 0x7fff4df6c970, val3: 4294967295) = 0
610.020 ( 6.307 ms): a.out/23789 futex(uaddr: 0x7fcd82f29960, op: WAIT|
PRIV, val: 2 ) = 0
616.446 (62.265 ms): a.out/23789 membarrier(cmd: 1
) = 0
678.721 (59.916 ms): a.out/23789 membarrier(cmd: 1
) = 0
616.332 (122.335 ms): a.out-ust/23790 futex(uaddr: 0x7fcd82f29960, op:
WAIT|PRIV, val: 2 ) = 0
738.976 (49.658 ms): a.out/23789 membarrier(cmd: 1
) = 0
788.642 (49.987 ms): a.out/23789 membarrier(cmd: 1
) = 0
739.014 (99.644 ms): a.out-ust/23790 futex(uaddr: 0x7fcd825be4c0, op: WAIT|
PRIV, val: 2 ) = 0
838.651 (49.977 ms): a.out/23789 membarrier(cmd: 1
) = 0
888.636 (69.983 ms): a.out/23789 membarrier(cmd: 1
) = 0
838.662 (119.978 ms): a.out-ust/23790 futex(uaddr: 0x7fcd825be4c0, op:
WAIT|PRIV, val: 2 ) = 0
958.739 (59.873 ms): a.out/23789 membarrier(cmd: 1
) = 0
1018.621 (80.058 ms): a.out/23789 membarrier(cmd: 1
) = 0
1098.782 (49.844 ms): a.out/23789 membarrier(cmd: 1
) = 0
1148.634 (59.998 ms): a.out/23789 membarrier(cmd: 1
) = 0
1208.731 (69.940 ms): a.out/23789 membarrier(cmd: 1
) = 0
1278.680 (79.947 ms): a.out/23789 membarrier(cmd: 1
) = 0
1358.727 (39.881 ms): a.out/23789 membarrier(cmd: 1
) = 0
1398.616 (39.980 ms): a.out/23789 membarrier(cmd: 1
) = 0
6.640 (1432.157 ms): a.out-ust/23791 futex(uaddr: 0x7fcd82f90000
) = -1 (null) INTERNAL ERROR: strerror_r(512, 0x7ffdbc5467a0, 128)=22
~~~~~
This makes lttng-ust pretty much unusable for any short-lived application. Is
this a known limitation?
I'm using lttng-ust on Arch from the AUR packages, i.e. LTTng 2.9.4. I've seen
this problem also on Ubuntu 17.04 though I didn't do any of the above
measurements there. So far, I suspected the ust statedump to take too much
time, but I disabled that one for the measurements above.
Any help would be appreciated
--
Milian Wolff | milian.wolff@kdab.com | Senior Software Engineer
KDAB (Deutschland) GmbH&Co KG, a KDAB Group company
Tel: +49-30-521325470
KDAB - The Qt Experts
[-- Attachment #1.2: smime.p7s --]
[-- Type: application/pkcs7-signature, Size: 3826 bytes --]
[-- Attachment #2: Type: text/plain, Size: 156 bytes --]
_______________________________________________
lttng-dev mailing list
lttng-dev@lists.lttng.org
https://lists.lttng.org/cgi-bin/mailman/listinfo/lttng-dev
^ permalink raw reply [flat|nested] 13+ messages in thread
end of thread, other threads:[~2017-09-07 7:50 UTC | newest]
Thread overview: 13+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
[not found] <2117884.c1kuPXoiOA@milian-kdab2>
2017-09-05 14:51 ` large liblttng-ust startup overhead (ust_lock) Mathieu Desnoyers
2017-09-05 14:52 ` Milian Wolff
[not found] ` <119280371.2823.1504623102155.JavaMail.zimbra@efficios.com>
2017-09-05 15:08 ` Milian Wolff
[not found] ` <3731578.YsMjlWojEO@milian-kdab2>
2017-09-05 18:11 ` Mathieu Desnoyers
[not found] ` <709369421.3343.1504635118648.JavaMail.zimbra@efficios.com>
2017-09-06 7:35 ` Milian Wolff
[not found] ` <3211887.IErR8xQyUx@agathebauer>
2017-09-06 19:57 ` Mathieu Desnoyers
[not found] ` <1506174693.4712.1504727831374.JavaMail.zimbra@efficios.com>
2017-09-06 20:23 ` Mathieu Desnoyers
[not found] ` <643211580.4746.1504729420940.JavaMail.zimbra@efficios.com>
2017-09-06 21:22 ` Paul E. McKenney
2017-09-07 7:50 ` Milian Wolff
[not found] ` <5409371.Vv1znlTa4X@milian-kdab2>
2017-09-05 15:10 ` Milian Wolff
[not found] ` <5963588.vVBsjV779J@milian-kdab2>
2017-09-05 18:20 ` Mathieu Desnoyers
[not found] ` <859976545.3393.1504635610385.JavaMail.zimbra@efficios.com>
2017-09-06 7:42 ` Milian Wolff
2017-09-05 14:34 Milian Wolff
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.