It is v2.17 (for the problematic system) and 2.23 for the other

Shehab Y. Elsayed, MSc.
PhD Student
The Edwards S. Rogers Sr. Dept. of Electrical and Computer Engineering
University of Toronto

E-mail: shehabyomn@gmail.com

On Thu, Mar 22, 2018 at 1:00 PM, Mathieu Desnoyers <mathieu.desnoyers@efficios.com> wrote:
----- On Mar 22, 2018, at 12:24 PM, Shehab Elsayed <shehabyomn@gmail.com> wrote:
Actually, i am not sure if this would help. I have been trying to reproduce the problem on a different machine, but I can't. Even without the patches !!!!!

Does it have the same glibc version ?

Thanks,

Mathieu




Shehab Y. Elsayed, MSc.
PhD Student
The Edwards S. Rogers Sr. Dept. of Electrical and Computer Engineering
University of Toronto

E-mail: shehabyomn@gmail.com


On Wed, Mar 21, 2018 at 8:13 PM, Mathieu Desnoyers <mathieu.desnoyers@efficios.com> wrote:
----- On Mar 21, 2018, at 8:01 PM, Shehab Elsayed <shehabyomn@gmail.com> wrote:
Just to clarify more what I meant by custom events; I have new tracepoints added in the source code of the benchmark. However, I don't enable the corresponding events when I do the actual tracing.

This is the sequence followed in building the benchmark:
gcc-7.2 -c -O2 -pthread -D_XOPEN_SOURCE=500 -D_POSIX_C_SOURCE=200112 -std=c11 -g -fno-strict-aliasing -DLTTNG_INST lu.c
gcc-7.2 -O2 -pthread -D_XOPEN_SOURCE=500 -D_POSIX_C_SOURCE=200112 -std=c11 -g -fno-strict-aliasing -DLTTNG_INST -o LU_NCB lu.o ../../instrumentation/lttng_tp/tp.o -lm -llttng-ust -ldl

LTTNG_INST is just a preprocessor flag I have and tp.o is my custom tracepoints
Could you share a repository with your custom instrumentation on github, so I could
try it out ?

My current problem is that I cannot reproduce your issue on my end.

Thanks,

Mathieu




Shehab Y. Elsayed, MSc.
PhD Student
The Edwards S. Rogers Sr. Dept. of Electrical and Computer Engineering
University of Toronto

E-mail: shehabyomn@gmail.com


On Wed, Mar 21, 2018 at 7:55 PM, Shehab Elsayed <shehabyomn@gmail.com> wrote:
Still running into same problem. I attached the debug trace I got after applying the 2 patches.

The benchmark I am running also includes some custom created tracepoints. I am not adding the events being traced in the files I have provided. Do you think this might be causing a problem when I have tracpoints from 2 different packages?

Shehab Y. Elsayed, MSc.
PhD Student
The Edwards S. Rogers Sr. Dept. of Electrical and Computer Engineering
University of Toronto

E-mail: shehabyomn@gmail.com


On Wed, Mar 21, 2018 at 4:22 PM, Mathieu Desnoyers <mathieu.desnoyers@efficios.com> wrote:
----- On Mar 21, 2018, at 1:55 PM, Shehab Elsayed <shehabyomn@gmail.com> wrote:
I am so sorry for the late replies.

I have tried the first patch you sent and the problem still happens (although seemingly less frequently especially with debugging enabled!!!!). I have attached the output I got from one of the erroneous runs.

I will try the updated patch and let you know how it goes.

Also, I am not sure if these points make any difference:
1- The error actually happens at the end of the application. It actually finishes execution, but then something goes wrong.
2- I run into this problem only for some of the benchmarks (or at least the problems happens much less frequently for others that I didn't run into it, not yet)

Thanks you very much, and sorry again for the late replies.

No worries! Looking through your log, I notice that pthread set cancel state has problems when
called from application threads. We do not restore the original thread's cancel state. Can you try
with the attached patch applied on top of the previous one ?

Thanks,

Mathieu




Shehab Y. Elsayed, MSc.
PhD Student
The Edwards S. Rogers Sr. Dept. of Electrical and Computer Engineering
University of Toronto

E-mail: shehabyomn@gmail.com


On Wed, Mar 21, 2018 at 1:27 PM, Mathieu Desnoyers <mathieu.desnoyers@efficios.com> wrote:
----- On Mar 20, 2018, at 5:42 PM, Mathieu Desnoyers <mathieu.desnoyers@efficios.com> wrote:
----- On Mar 20, 2018, at 4:58 PM, Mathieu Desnoyers <mathieu.desnoyers@efficios.com> wrote:
----- On Mar 20, 2018, at 12:07 PM, Mathieu Desnoyers <mathieu.desnoyers@efficios.com> wrote:

----- On Mar 19, 2018, at 4:21 PM, Shehab Elsayed <shehabyomn@gmail.com> wrote:
I did "echo "-1" > /proc/sys/kernel/perf_event_paranoid" and made sure the value was actually written by "cat /proc/sys/kernel/perf_event_paranoid"

It executed normally 2 times but then got the same error.

Can you provide the output when reproducing the issue with the
LTTNG_UST_DEBUG=1 environment variable set when starting
your test program ?
I just noticed something that might explain what goes wrong here:

lttng-context-perf-counters.c: add_thread_field() grabs the ust_lock(). Pthread mutex
in your case is instrumented with the pthread wrapper. This "add_thread_field" is invoked
the first time the perf counter is hit by each given thread. When this happens, the
instrumented pthread mutex will try to call into the instrumentation tracepoint again,
which will call add_thread_field() (again), and so on until we reach the libringbuffer
"lib_ring_buffer_nesting" threshold of 4 calls deep.

I suspect this situation where we recursively call add_thread_field is unexpected,
which may trigger your double-free here.

Will investigate more.
Can you try with the attached patch applied ?
Here is an updated v2 of the patch which tests the notrace tls counter sooner (before evaluating
filter). Please let me know how it goes.

Thanks,

Mathieu





Thanks,

Mathieu





Thanks,

Mathieu




Thanks,

Mathieu




Shehab Y. Elsayed, MSc.
PhD Student
The Edwards S. Rogers Sr. Dept. of Electrical and Computer Engineering
University of Toronto

E-mail: shehabyomn@gmail.com


On Mon, Mar 19, 2018 at 4:01 PM, Mathieu Desnoyers <mathieu.desnoyers@efficios.com> wrote:


----- On Mar 19, 2018, at 3:53 PM, Shehab Elsayed <shehabyomn@gmail.com> wrote:
cat /proc/sys/kernel/perf_event_paranoid ---> returns 1

I run the program as a normal user

The glibc version I get by running "ldd --version" is 2.17

Can you reproduce the issue after you do this as root ?

echo "-1" > /proc/sys/kernel/perf_event_paranoid

Based on this documentation of the Linux kernel:

Documentation/sysctl/kernel.txt:

perf_event_paranoid:

Controls use of the performance events system by unprivileged
users (without CAP_SYS_ADMIN).  The default value is 2.

 -1: Allow use of (almost) all events by all users
     Ignore mlock limit after perf_event_mlock_kb without CAP_IPC_LOCK
>=0: Disallow ftrace function tracepoint by users without CAP_SYS_ADMIN
     Disallow raw tracepoint access by users without CAP_SYS_ADMIN
>=1: Disallow CPU event access by users without CAP_SYS_ADMIN
>=2: Disallow kernel profiling by users without CAP_SYS_ADMIN

Thanks,

Mathieu



Shehab Y. Elsayed, MSc.
PhD Student
The Edwards S. Rogers Sr. Dept. of Electrical and Computer Engineering
University of Toronto

E-mail: shehabyomn@gmail.com


On Mon, Mar 19, 2018 at 3:31 PM, Mathieu Desnoyers <mathieu.desnoyers@efficios.com> wrote:
---- On Mar 19, 2018, at 3:26 PM, Mathieu Desnoyers <mathieu.desnoyers@efficios.com> wrote:
----- On Mar 19, 2018, at 2:26 PM, Shehab Elsayed <shehabyomn@gmail.com> wrote:
Yes, I tried with only one of those contexts and I still ran into the same problem.
What is the setting returned by

cat /proc/sys/kernel/perf_event_paranoid

on your system ? And do you run your test program as root or normal user ?

Please CC the mailing list on your reply.

I will also need to know what glibc version you have on your system.

Thanks,

Mathieu




Thanks,

Mathieu



Shehab Y. Elsayed, MSc.
PhD Student
The Edwards S. Rogers Sr. Dept. of Electrical and Computer Engineering
University of Toronto

E-mail: shehabyomn@gmail.com


On Mon, Mar 19, 2018 at 2:24 PM, Mathieu Desnoyers <mathieu.desnoyers@efficios.com> wrote:
----- On Mar 19, 2018, at 12:36 PM, Shehab Elsayed <shehabyomn@gmail.com> wrote:
Hi Mathieu,

Thank you very much for your reply.

I manually built lttng-ust from source (commit #: 8a208943e21700211beee3ea64180a5a534c7d2a).

This is how I set up the tracing session:
1- lttng create lu_ncb_8_native -o {path}
2- lttng enable-event --userspace lttng_ust_pthread:pthread_mutex_lock_req
    lttng enable-event --userspace lttng_ust_pthread:pthread_mutex_lock_acq
    lttng enable-event --userspace lttng_ust_pthread:pthread_mutex_lock_trylock
    lttng enable-event --userspace lttng_ust_pthread:pthread_mutex_lock_unlock
3- lttng add-context -u -t procname
    lttng add-context -u -t vpid
    lttng add-context -u -t pthread_id
    lttng add-context -u -t vtid
    lttng add-context -u -t ip
    lttng add-context -u -t perf:thread:cpu-cycles
    lttng add-context -u -t perf:thread:cycles
    lttng add-context -u -t perf:thread:instructions
4- lttng start
5- LD_PRELOAD=/usr/local/lib/liblttng-ust-pthread-wrapper.so ./lu_ncb -p8 -n8096 -b32
6- lttng stop
7- lttng destroy

Can you reproduce if you remove the following contexts ?

    lttng add-context -u -t perf:thread:cpu-cycles
    lttng add-context -u -t perf:thread:cycles
    lttng add-context -u -t perf:thread:instructions

And if you only keep a single of those contexts ?

Thanks,

Mathieu






Shehab Y. Elsayed, MSc.
PhD Student
The Edwards S. Rogers Sr. Dept. of Electrical and Computer Engineering
University of Toronto

E-mail: shehabyomn@gmail.com


On Mon, Mar 19, 2018 at 12:21 PM, Mathieu Desnoyers <mathieu.desnoyers@efficios.com> wrote:


----- On Mar 16, 2018, at 5:37 PM, Shehab Elsayed <shehabyomn@gmail.com> wrote:
Hello All,

I am trying to instrument a pthread application using the provided pthread wrapper, but I sometimes run into a "Double free or corruption error (fasttop)" error.

Please provide more information about the version of lttng-ust you are using, and how you setup
your tracing session.

Thanks,

Mathieu



Here is a description of what I have tried and noticed:
1- The problem isn't consistent. It sometimes happen and sometimes works as expected.
2- From my experiments, the problem happens (more frequently at least) when adding performance counter contexts (I tried cycles, cpu_cycles and instructions).
3- I am testing using lu_ncb from splash3 benchmark suite after setting LD_PRELOAD to use the pthread wrapper as described in the LTTng documents.
4- Here is the backtrace printed after exiting:
======= Backtrace: =========
/lib64/libc.so.6([Thread 0x7ffff5611700 (LWP 97229) exited]
/usr/local/lib/liblttng-ust.so.0(lttng_destroy_context+0x35)[0x7ffff7471575]
/usr/local/lib/liblttng-ust.so.0(lttng_session_destroy+0x21c)[0x7ffff747363c]
/usr/local/lib/liblttng-ust.so.0(+0x1e906)[0x7ffff746d906]
/usr/local/lib/liblttng-ust.so.0(lttng_ust_objd_unref+0x9f)[0x7ffff746dccf]
/usr/local/lib/liblttng-ust.so.0(lttng_ust_objd_unref+0x9f)[0x7ffff746dccf]
/usr/local/lib/liblttng-ust.so.0(lttng_ust_objd_unref+0x9f)[0x7ffff746dccf]
/usr/local/lib/liblttng-ust.so.0(lttng_ust_abi_exit+0x68)[0x7ffff746ead8]
/usr/local/lib/liblttng-ust.so.0(+0x191d3)[0x7ffff74681d3]
/usr/local/lib/liblttng-ust.so.0(lttng_ust_exit+0x67)[0x7ffff745ed57]
/lib64/ld-linux-x86-64.so.2(+0xf85a)[0x7ffff7dec85a]
/lib64/libc.so.6(+0x38a49)[0x7ffff6ca6a49]
/lib64/libc.so.6(+0x38a95)[0x7ffff6ca6a95]
/aenao-99/elsayed9/LTTng/data/scripts/tmp/lu_ncb[0x401b51]
/lib64/libc.so.6(__libc_start_main+0xf5)[0x7ffff6c8fb35]
/aenao-99/elsayed9/LTTng/data/scripts/tmp/lu_ncb[0x401c44]
5- Also, this is a backtrace I obtained from gdb:
#0  0x00007ffff6eac1d7 in raise () from /lib64/libc.so.6
#1  0x00007ffff6ead8c8 in abort () from /lib64/libc.so.6
#2  0x00007ffff6eebf07 in __libc_message () from /lib64/libc.so.6
#3  0x00007ffff6ef3503 in _int_free () from /lib64/libc.so.6
#4  0x00007ffff768ad25 in lttng_destroy_perf_counter_field (
    field=<optimized out>) at lttng-context-perf-counters.c:418
#5  0x00007ffff767a575 in lttng_destroy_context (
ctx=0x7ffff0011090) at lttng-context.c:278
#6  0x00007ffff767c63c in _lttng_channel_unmap (
lttng_chan=0x7ffff0010f40) at lttng-events.c:172
#7  lttng_session_destroy (session=0x7ffff0000900)
    at lttng-events.c:247
#8  0x00007ffff7676906 in lttng_release_session (
objd=<optimized out>) at lttng-ust-abi.c:601
#9  0x00007ffff7676ccf in lttng_ust_objd_unref (id=1,
    is_owner=<optimized out>) at lttng-ust-abi.c:216
#10 0x00007ffff7676ccf in lttng_ust_objd_unref (id=2,
    is_owner=<optimized out>) at lttng-ust-abi.c:216
#11 0x00007ffff7676ccf in lttng_ust_objd_unref (id=id@entry=18,
    is_owner=is_owner@entry=1) at lttng-ust-abi.c:216
#12 0x00007ffff7677ad8 in objd_table_destroy ()
    at lttng-ust-abi.c:235
#13 lttng_ust_abi_exit () at lttng-ust-abi.c:1002
#14 0x00007ffff76711d3 in lttng_ust_cleanup (exiting=1)
    at lttng-ust-comm.c:1807
#15 0x00007ffff7667d57 in lttng_ust_exit ()
    at lttng-ust-comm.c:1874
#16 0x00007ffff7dec85a in _dl_fini ()
   from /lib64/ld-linux-x86-64.so.2
#17 0x00007ffff6eafa49 in __run_exit_handlers ()
   from /lib64/libc.so.6
#18 0x00007ffff6eafa95 in exit () from /lib64/libc.so.6
#19 0x0000000000401b51 in main (argc=<optimized out>,
argv=<optimized out>) at lu.c:368

Any ideas, why this happens and how to fix it?

Thanks,
Shehab



_______________________________________________
lttng-dev mailing list
lttng-dev@lists.lttng.org
https://lists.lttng.org/cgi-bin/mailman/listinfo/lttng-dev

--
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com


--
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com


--
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com


--
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com


--
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com


--
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com


--
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com


--
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com


--
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com


--
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com



--
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com


--
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com