All of lore.kernel.org
 help / color / mirror / Atom feed
* Re: Segfault at v_read() called from lib_ring_buffer_try_reserve_slow() in LTTng-UST traced app - CPU/VMware dependent
       [not found] <20998D40D9A2B7499CA5A3A2666CB1EB2D9DDA59@ZURMSG1.QUANTUM.com>
@ 2014-12-11 15:36 ` Mathieu Desnoyers
       [not found] ` <1745838195.26177.1418312190246.JavaMail.zimbra@efficios.com>
  1 sibling, 0 replies; 9+ messages in thread
From: Mathieu Desnoyers @ 2014-12-11 15:36 UTC (permalink / raw)
  To: David OShea; +Cc: lttng-dev


[-- Attachment #1.1: Type: text/plain, Size: 4749 bytes --]

----- Original Message -----

> From: "David OShea" <David.OShea@quantum.com>
> To: "lttng-dev" <lttng-dev@lists.lttng.org>
> Sent: Sunday, December 7, 2014 10:30:04 PM
> Subject: [lttng-dev] Segfault at v_read() called from
> lib_ring_buffer_try_reserve_slow() in LTTng-UST traced app - CPU/VMware
> dependent

> Hi all,

> We have encountered a problem with using LTTng-UST tracing with our
> application, where on a particular VMware vCenter cluster we almost ways get
> segfaults when tracepoints are enabled, whereas on another vCenter cluster,
> and on every other machine we’ve ever used, we don’t hit this problem.

> I can reproduce this using lttng-ust/tests/hello after using:

> """

> lttng create

> lttng enable-channel channel0 --userspace

> lttng add-context --userspace -t vpid -t vtid -t procname

> lttng enable-event --userspace "ust_tests_hello:*" -c channel0

> lttng start

> """

> In which case I get the following stack trace with an obvious NULL pointer
> dereference:

> """

> Program terminated with signal SIGSEGV, Segmentation fault.

> #0 v_read (config=<optimized out>, v_a=0x0) at vatomic.h:48

> 48 return uatomic_read(&v_a->a);

> [...]

> #0 v_read (config=<optimized out>, v_a=0x0) at vatomic.h:48

> #1 0x00007f4aa10a4804 in lib_ring_buffer_try_reserve_slow (

> buf=0x7f4a98008a00, chan=0x7f4a98008a00, offsets=0x7fffef67c620,

> ctx=0x7fffef67ca40) at ring_buffer_frontend.c:1677

> #2 0x00007f4aa10a6c9f in lib_ring_buffer_reserve_slow (ctx=0x7fffef67ca40)

> at ring_buffer_frontend.c:1819

> #3 0x00007f4aa1095b75 in lib_ring_buffer_reserve (ctx=0x7fffef67ca40,

> config=0x7f4aa12b8ae0 <client_config>)

> at ../libringbuffer/frontend_api.h:211

> #4 lttng_event_reserve (ctx=0x7fffef67ca40, event_id=0)

> at lttng-ring-buffer-client.h:473

> #5 0x000000000040135f in __event_probe__ust_tests_hello___tptest (

> __tp_data=0xed3410, anint=0, netint=0, values=0x7fffef67cb50,

> text=0x7fffef67cb70 "test", textlen=<optimized out>, doublearg=2,

> floatarg=2222, boolarg=true) at ././ust_tests_hello.h:32

> #6 0x0000000000400d2c in __tracepoint_cb_ust_tests_hello___tptest (

> boolarg=true, floatarg=2222, doublearg=2, textlen=4,

> text=0x7fffef67cb70 "test", values=0x7fffef67cb50,

> netint=<optimized out>, anint=0) at ust_tests_hello.h:32

> #7 main (argc=<optimized out>, argv=<optimized out>) at hello.c:92

> """

> I hit this segfault 10 out of 10 times I ran “hello” on a VM on one vCenter
> and 0 out of 10 times I ran it on the other, and the VMs otherwise had the
> same software installed on them:

> - CentOS 6-based

> - kernel-2.6.32-504.1.3.el6 with some minor changes made in networking

> - userspace-rcu-0.8.3, lttng-ust-2.3.2 and lttng-tools-2.3.2 which might have
> some minor patches backported, and leftovers of changes to get them to build
> on CentOS 5

> On the “good” vCenter, I tested on two different VM hosts:

> Processor Type: Intel(R) Xeon(R) CPU E5530 @ 2.40GHz

> EVC Mode: Intel(R) "Nehalem" Generation

> Image Profile: (Updated) ESXi-5.1.0-799733-standard

> Processor Type: Intel(R) Core(TM) i7-3930K CPU @ 3.20GHz

> EVC Mode: Intel(R) "Nehalem" Generation

> Image Profile: (Updated) ESXi-5.1.0-799733-standard

> The “bad” vCenter VM host that I tested on had this configuration:

> ESX Version: VMware ESXi, 5.0.0, 469512

> Processor Type: Intel(R) Xeon(R) CPU X7550 @ 2.00GHz

> Any ideas?

My bet would be that the OS is lying to userspace about the 
number of possible CPUs. I wonder what liblttng-ust 
libringbuffer/shm.h num_possible_cpus() is returning compared 
to what lib_ring_buffer_get_cpu() returns. 

Can you check this out ? 

Thanks, 

Mathieu 

> Thanks in advance,
> David

> The information contained in this transmission may be confidential. Any
> disclosure, copying, or further distribution of confidential information is
> not permitted unless such privilege is explicitly granted in writing by
> Quantum. Quantum reserves the right to have electronic communications,
> including email and attachments, sent across its networks filtered through
> anti virus and spam software programs and retain such messages in order to
> comply with applicable data security and retention requirements. Quantum is
> not responsible for the proper and complete transmission of the substance of
> this communication or for any delay in its receipt.

> _______________________________________________
> lttng-dev mailing list
> lttng-dev@lists.lttng.org
> http://lists.lttng.org/cgi-bin/mailman/listinfo/lttng-dev

-- 
Mathieu Desnoyers 
EfficiOS Inc. 
http://www.efficios.com 

[-- Attachment #1.2: Type: text/html, Size: 8339 bytes --]

[-- Attachment #2: Type: text/plain, Size: 155 bytes --]

_______________________________________________
lttng-dev mailing list
lttng-dev@lists.lttng.org
http://lists.lttng.org/cgi-bin/mailman/listinfo/lttng-dev

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Segfault at v_read() called from lib_ring_buffer_try_reserve_slow() in LTTng-UST traced app - CPU/VMware dependent
       [not found] ` <1745838195.26177.1418312190246.JavaMail.zimbra@efficios.com>
@ 2015-01-12  6:33   ` David OShea
       [not found]   ` <20998D40D9A2B7499CA5A3A2666CB1EB2D9E464C@ZURMSG1.QUANTUM.com>
  1 sibling, 0 replies; 9+ messages in thread
From: David OShea @ 2015-01-12  6:33 UTC (permalink / raw)
  To: Mathieu Desnoyers; +Cc: lttng-dev

Hi Mathieu,

Apologies for the delay in getting back to you, please see below:

> -----Original Message-----
> From: Mathieu Desnoyers [mailto:mathieu.desnoyers@efficios.com]
> Sent: Friday, 12 December 2014 2:07 AM
> To: David OShea
> Cc: lttng-dev
> Subject: Re: [lttng-dev] Segfault at v_read() called from
> lib_ring_buffer_try_reserve_slow() in LTTng-UST traced app - CPU/VMware
> dependent
> 
> ________________________________
> 
> 	From: "David OShea" <David.OShea@quantum.com>
> 	To: "lttng-dev" <lttng-dev@lists.lttng.org>
> 	Sent: Sunday, December 7, 2014 10:30:04 PM
> 	Subject: [lttng-dev] Segfault at v_read() called from
> lib_ring_buffer_try_reserve_slow() in LTTng-UST traced app - CPU/VMware
> dependent
> 
> 
> 
> 	Hi all,
> 
> 	We have encountered a problem with using LTTng-UST tracing with
> our application, where on a particular VMware vCenter cluster we almost
> ways get segfaults when tracepoints are enabled, whereas on another
> vCenter cluster, and on every other machine we’ve ever used, we don’t
> hit this problem.
> 
> 	I can reproduce this using lttng-ust/tests/hello after using:
> 
> 	"""
> 
> 	lttng create
> 
> 	lttng enable-channel channel0 --userspace
> 
> 	lttng add-context --userspace -t vpid -t vtid -t procname
> 
> 	lttng enable-event --userspace "ust_tests_hello:*" -c channel0
> 
> 	lttng start
> 
> 	"""
> 
> 	In which case I get the following stack trace with an obvious
> NULL pointer dereference:
> 
> 	"""
> 
> 	Program terminated with signal SIGSEGV, Segmentation fault.
> 
> 	#0  v_read (config=<optimized out>, v_a=0x0) at vatomic.h:48
> 
> 	48              return uatomic_read(&v_a->a);
> 
> 	[...]
> 
> 	#0  v_read (config=<optimized out>, v_a=0x0) at vatomic.h:48
> 
> 	#1  0x00007f4aa10a4804 in lib_ring_buffer_try_reserve_slow (
> 
> 	    buf=0x7f4a98008a00, chan=0x7f4a98008a00,
> offsets=0x7fffef67c620,
> 
> 	    ctx=0x7fffef67ca40) at ring_buffer_frontend.c:1677
> 
> 	#2  0x00007f4aa10a6c9f in lib_ring_buffer_reserve_slow
> (ctx=0x7fffef67ca40)
> 
> 	    at ring_buffer_frontend.c:1819
> 
> 	#3  0x00007f4aa1095b75 in lib_ring_buffer_reserve
> (ctx=0x7fffef67ca40,
> 
> 	    config=0x7f4aa12b8ae0 <client_config>)
> 
> 	    at ../libringbuffer/frontend_api.h:211
> 
> 	#4  lttng_event_reserve (ctx=0x7fffef67ca40, event_id=0)
> 
> 	    at lttng-ring-buffer-client.h:473
> 
> 	#5  0x000000000040135f in __event_probe__ust_tests_hello___tptest
> (
> 
> 	    __tp_data=0xed3410, anint=0, netint=0, values=0x7fffef67cb50,
> 
> 	    text=0x7fffef67cb70 "test", textlen=<optimized out>,
> doublearg=2,
> 
> 	    floatarg=2222, boolarg=true) at ././ust_tests_hello.h:32
> 
> 	#6  0x0000000000400d2c in
> __tracepoint_cb_ust_tests_hello___tptest (
> 
> 	    boolarg=true, floatarg=2222, doublearg=2, textlen=4,
> 
> 	    text=0x7fffef67cb70 "test", values=0x7fffef67cb50,
> 
> 	    netint=<optimized out>, anint=0) at ust_tests_hello.h:32
> 
> 	#7  main (argc=<optimized out>, argv=<optimized out>) at
> hello.c:92
> 
> 	"""
> 
> 	I hit this segfault 10 out of 10 times I ran “hello” on a VM on
> one vCenter and 0 out of 10 times I ran it on the other, and the VMs
> otherwise had the same software installed on them:
> 
> 	- CentOS 6-based
> 
> 	- kernel-2.6.32-504.1.3.el6 with some minor changes made in
> networking
> 
> 	- userspace-rcu-0.8.3, lttng-ust-2.3.2 and lttng-tools-2.3.2
> which might have some minor patches backported, and leftovers of
> changes to get them to build on CentOS 5
> 
> 	On the “good” vCenter, I tested on two different VM hosts:
> 
> 	Processor Type: Intel(R) Xeon(R) CPU E5530 @ 2.40GHz
> 
> 	EVC Mode: Intel(R) "Nehalem" Generation
> 
> 	Image Profile: (Updated) ESXi-5.1.0-799733-standard
> 
> 	Processor Type: Intel(R) Core(TM) i7-3930K CPU @ 3.20GHz
> 
> 	EVC Mode: Intel(R) "Nehalem" Generation
> 
> 	Image Profile: (Updated) ESXi-5.1.0-799733-standard
> 
> 	The “bad” vCenter VM host that I tested on had this
> configuration:
> 
> 	ESX Version: VMware ESXi, 5.0.0, 469512
> 
> 	Processor Type: Intel(R) Xeon(R) CPU X7550 @ 2.00GHz
> 
> 	Any ideas?
> 
> 
> My bet would be that the OS is lying to userspace about the
> number of possible CPUs. I wonder what liblttng-ust
> libringbuffer/shm.h num_possible_cpus() is returning compared
> to what lib_ring_buffer_get_cpu() returns.
> 
> 
> Can you check this out ?

Yes, this seems to be the case - 'gdb' on the core dump shows:

(gdb) p __num_possible_cpus
$1 = 2

which is consistent with how I configured the virtual machine, which is consistent with this output:

# lscpu
Architecture:          x86_64
CPU op-mode(s):        32-bit, 64-bit
Byte Order:            Little Endian
CPU(s):                2
On-line CPU(s) list:   0,1
Thread(s) per core:    1
Core(s) per socket:    1
Socket(s):             2
NUMA node(s):          1
Vendor ID:             GenuineIntel
CPU family:            6
Model:                 26
Stepping:              4
CPU MHz:               1995.000
BogoMIPS:              3990.00
Hypervisor vendor:     VMware
Virtualization type:   full
L1d cache:             32K
L1i cache:             32K
L2 cache:              256K
L3 cache:              18432K
NUMA node0 CPU(s):     0,1

Despite the fact that there are 2 CPUs, when I hacked lttng-ring-buffer-client.h to output the result of lib_ring_buffer_get_cpu() and then ran tests/hello with tracing enabled, I could see it would sit on CPU 0 for a while, or CPU 1, and perhaps move between the two, but eventually either 2 or 3 would appear, immediately followed by the segfault.

The VM host has 4 sockets, 8 cores per socket, with Hyper-Threading enabled.  The VM has its "HT Sharing" option set to "Any", which according to https://pubs.vmware.com/vsphere-50/index.jsp?topic=%2Fcom.vmware.vsphere.vm_admin.doc_50%2FGUID-101176D4-9866-420D-AB4F-6374025CABDA.html means that each one of the virtual machine's virtual cores can share a physical core with another virtual machine, each virtual core using a different thread on that physical core.  I assume none of this should be relevant except perhaps if there are bugs in VMware.

Is it possible that this is an issue in LTTng, or should I work out how the kernel works out which CPU it is running on and then look into whether there are any VMware bugs in this area?

Thanks in advance,
David

----------------------------------------------------------------------
The information contained in this transmission may be confidential. Any disclosure, copying, or further distribution of confidential information is not permitted unless such privilege is explicitly granted in writing by Quantum. Quantum reserves the right to have electronic communications, including email and attachments, sent across its networks filtered through anti virus and spam software programs and retain such messages in order to comply with applicable data security and retention requirements. Quantum is not responsible for the proper and complete transmission of the substance of this communication or for any delay in its receipt.
_______________________________________________
lttng-dev mailing list
lttng-dev@lists.lttng.org
http://lists.lttng.org/cgi-bin/mailman/listinfo/lttng-dev

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Segfault at v_read() called from lib_ring_buffer_try_reserve_slow() in LTTng-UST traced app - CPU/VMware dependent
       [not found]   ` <20998D40D9A2B7499CA5A3A2666CB1EB2D9E464C@ZURMSG1.QUANTUM.com>
@ 2015-01-12 15:34     ` Mathieu Desnoyers
       [not found]     ` <1205251820.37907.1421076877053.JavaMail.zimbra@efficios.com>
  1 sibling, 0 replies; 9+ messages in thread
From: Mathieu Desnoyers @ 2015-01-12 15:34 UTC (permalink / raw)
  To: David OShea; +Cc: lttng-dev

----- Original Message -----
> From: "David OShea" <David.OShea@quantum.com>
> To: "Mathieu Desnoyers" <mathieu.desnoyers@efficios.com>
> Cc: "lttng-dev" <lttng-dev@lists.lttng.org>
> Sent: Monday, January 12, 2015 1:33:07 AM
> Subject: RE: [lttng-dev] Segfault at v_read() called from lib_ring_buffer_try_reserve_slow() in LTTng-UST traced app
> - CPU/VMware dependent
> 
> Hi Mathieu,
> 
> Apologies for the delay in getting back to you, please see below:
> 
> > -----Original Message-----
> > From: Mathieu Desnoyers [mailto:mathieu.desnoyers@efficios.com]
> > Sent: Friday, 12 December 2014 2:07 AM
> > To: David OShea
> > Cc: lttng-dev
> > Subject: Re: [lttng-dev] Segfault at v_read() called from
> > lib_ring_buffer_try_reserve_slow() in LTTng-UST traced app - CPU/VMware
> > dependent
> > 
> > ________________________________
> > 
> > 	From: "David OShea" <David.OShea@quantum.com>
> > 	To: "lttng-dev" <lttng-dev@lists.lttng.org>
> > 	Sent: Sunday, December 7, 2014 10:30:04 PM
> > 	Subject: [lttng-dev] Segfault at v_read() called from
> > lib_ring_buffer_try_reserve_slow() in LTTng-UST traced app - CPU/VMware
> > dependent
> > 
> > 
> > 
> > 	Hi all,
> > 
> > 	We have encountered a problem with using LTTng-UST tracing with
> > our application, where on a particular VMware vCenter cluster we almost
> > ways get segfaults when tracepoints are enabled, whereas on another
> > vCenter cluster, and on every other machine we’ve ever used, we don’t
> > hit this problem.
> > 
> > 	I can reproduce this using lttng-ust/tests/hello after using:
> > 
> > 	"""
> > 
> > 	lttng create
> > 
> > 	lttng enable-channel channel0 --userspace
> > 
> > 	lttng add-context --userspace -t vpid -t vtid -t procname
> > 
> > 	lttng enable-event --userspace "ust_tests_hello:*" -c channel0
> > 
> > 	lttng start
> > 
> > 	"""
> > 
> > 	In which case I get the following stack trace with an obvious
> > NULL pointer dereference:
> > 
> > 	"""
> > 
> > 	Program terminated with signal SIGSEGV, Segmentation fault.
> > 
> > 	#0  v_read (config=<optimized out>, v_a=0x0) at vatomic.h:48
> > 
> > 	48              return uatomic_read(&v_a->a);
> > 
> > 	[...]
> > 
> > 	#0  v_read (config=<optimized out>, v_a=0x0) at vatomic.h:48
> > 
> > 	#1  0x00007f4aa10a4804 in lib_ring_buffer_try_reserve_slow (
> > 
> > 	    buf=0x7f4a98008a00, chan=0x7f4a98008a00,
> > offsets=0x7fffef67c620,
> > 
> > 	    ctx=0x7fffef67ca40) at ring_buffer_frontend.c:1677
> > 
> > 	#2  0x00007f4aa10a6c9f in lib_ring_buffer_reserve_slow
> > (ctx=0x7fffef67ca40)
> > 
> > 	    at ring_buffer_frontend.c:1819
> > 
> > 	#3  0x00007f4aa1095b75 in lib_ring_buffer_reserve
> > (ctx=0x7fffef67ca40,
> > 
> > 	    config=0x7f4aa12b8ae0 <client_config>)
> > 
> > 	    at ../libringbuffer/frontend_api.h:211
> > 
> > 	#4  lttng_event_reserve (ctx=0x7fffef67ca40, event_id=0)
> > 
> > 	    at lttng-ring-buffer-client.h:473
> > 
> > 	#5  0x000000000040135f in __event_probe__ust_tests_hello___tptest
> > (
> > 
> > 	    __tp_data=0xed3410, anint=0, netint=0, values=0x7fffef67cb50,
> > 
> > 	    text=0x7fffef67cb70 "test", textlen=<optimized out>,
> > doublearg=2,
> > 
> > 	    floatarg=2222, boolarg=true) at ././ust_tests_hello.h:32
> > 
> > 	#6  0x0000000000400d2c in
> > __tracepoint_cb_ust_tests_hello___tptest (
> > 
> > 	    boolarg=true, floatarg=2222, doublearg=2, textlen=4,
> > 
> > 	    text=0x7fffef67cb70 "test", values=0x7fffef67cb50,
> > 
> > 	    netint=<optimized out>, anint=0) at ust_tests_hello.h:32
> > 
> > 	#7  main (argc=<optimized out>, argv=<optimized out>) at
> > hello.c:92
> > 
> > 	"""
> > 
> > 	I hit this segfault 10 out of 10 times I ran “hello” on a VM on
> > one vCenter and 0 out of 10 times I ran it on the other, and the VMs
> > otherwise had the same software installed on them:
> > 
> > 	- CentOS 6-based
> > 
> > 	- kernel-2.6.32-504.1.3.el6 with some minor changes made in
> > networking
> > 
> > 	- userspace-rcu-0.8.3, lttng-ust-2.3.2 and lttng-tools-2.3.2
> > which might have some minor patches backported, and leftovers of
> > changes to get them to build on CentOS 5
> > 
> > 	On the “good” vCenter, I tested on two different VM hosts:
> > 
> > 	Processor Type: Intel(R) Xeon(R) CPU E5530 @ 2.40GHz
> > 
> > 	EVC Mode: Intel(R) "Nehalem" Generation
> > 
> > 	Image Profile: (Updated) ESXi-5.1.0-799733-standard
> > 
> > 	Processor Type: Intel(R) Core(TM) i7-3930K CPU @ 3.20GHz
> > 
> > 	EVC Mode: Intel(R) "Nehalem" Generation
> > 
> > 	Image Profile: (Updated) ESXi-5.1.0-799733-standard
> > 
> > 	The “bad” vCenter VM host that I tested on had this
> > configuration:
> > 
> > 	ESX Version: VMware ESXi, 5.0.0, 469512
> > 
> > 	Processor Type: Intel(R) Xeon(R) CPU X7550 @ 2.00GHz
> > 
> > 	Any ideas?
> > 
> > 
> > My bet would be that the OS is lying to userspace about the
> > number of possible CPUs. I wonder what liblttng-ust
> > libringbuffer/shm.h num_possible_cpus() is returning compared
> > to what lib_ring_buffer_get_cpu() returns.
> > 
> > 
> > Can you check this out ?
> 
> Yes, this seems to be the case - 'gdb' on the core dump shows:
> 
> (gdb) p __num_possible_cpus
> $1 = 2
> 
> which is consistent with how I configured the virtual machine, which is
> consistent with this output:
> 
> # lscpu
> Architecture:          x86_64
> CPU op-mode(s):        32-bit, 64-bit
> Byte Order:            Little Endian
> CPU(s):                2
> On-line CPU(s) list:   0,1
> Thread(s) per core:    1
> Core(s) per socket:    1
> Socket(s):             2
> NUMA node(s):          1
> Vendor ID:             GenuineIntel
> CPU family:            6
> Model:                 26
> Stepping:              4
> CPU MHz:               1995.000
> BogoMIPS:              3990.00
> Hypervisor vendor:     VMware
> Virtualization type:   full
> L1d cache:             32K
> L1i cache:             32K
> L2 cache:              256K
> L3 cache:              18432K
> NUMA node0 CPU(s):     0,1
> 
> Despite the fact that there are 2 CPUs, when I hacked
> lttng-ring-buffer-client.h to output the result of lib_ring_buffer_get_cpu()
> and then ran tests/hello with tracing enabled, I could see it would sit on
> CPU 0 for a while, or CPU 1, and perhaps move between the two, but
> eventually either 2 or 3 would appear, immediately followed by the segfault.
> 
> The VM host has 4 sockets, 8 cores per socket, with Hyper-Threading enabled.
> The VM has its "HT Sharing" option set to "Any", which according to
> https://pubs.vmware.com/vsphere-50/index.jsp?topic=%2Fcom.vmware.vsphere.vm_admin.doc_50%2FGUID-101176D4-9866-420D-AB4F-6374025CABDA.html
> means that each one of the virtual machine's virtual cores can share a
> physical core with another virtual machine, each virtual core using a
> different thread on that physical core.  I assume none of this should be
> relevant except perhaps if there are bugs in VMware.
> 
> Is it possible that this is an issue in LTTng, or should I work out how the
> kernel works out which CPU it is running on and then look into whether there
> are any VMware bugs in this area?

This appears to be very likely a VMware bug. /proc/cpuinfo should show
4 CPUs (and sysconf(_SC_NPROCESSORS_CONF) should return 4) if the current
CPU number can be 0, 1, 2, 3 throughout execution.

Thanks,

Mathieu


> 
> Thanks in advance,
> David
> 
> ----------------------------------------------------------------------
> The information contained in this transmission may be confidential. Any
> disclosure, copying, or further distribution of confidential information is
> not permitted unless such privilege is explicitly granted in writing by
> Quantum. Quantum reserves the right to have electronic communications,
> including email and attachments, sent across its networks filtered through
> anti virus and spam software programs and retain such messages in order to
> comply with applicable data security and retention requirements. Quantum is
> not responsible for the proper and complete transmission of the substance of
> this communication or for any delay in its receipt.
> 

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com

_______________________________________________
lttng-dev mailing list
lttng-dev@lists.lttng.org
http://lists.lttng.org/cgi-bin/mailman/listinfo/lttng-dev

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Segfault at v_read() called from lib_ring_buffer_try_reserve_slow() in LTTng-UST traced app - CPU/VMware dependent
       [not found]     ` <1205251820.37907.1421076877053.JavaMail.zimbra@efficios.com>
@ 2015-01-12 15:36       ` Mathieu Desnoyers
       [not found]       ` <1686087516.37913.1421076968595.JavaMail.zimbra@efficios.com>
  1 sibling, 0 replies; 9+ messages in thread
From: Mathieu Desnoyers @ 2015-01-12 15:36 UTC (permalink / raw)
  To: David OShea; +Cc: lttng-dev

----- Original Message -----
> From: "Mathieu Desnoyers" <mathieu.desnoyers@efficios.com>
> To: "David OShea" <David.OShea@quantum.com>
> Cc: "lttng-dev" <lttng-dev@lists.lttng.org>
> Sent: Monday, January 12, 2015 10:34:37 AM
> Subject: Re: [lttng-dev] Segfault at v_read() called from lib_ring_buffer_try_reserve_slow() in LTTng-UST traced app
> - CPU/VMware dependent
> 
> ----- Original Message -----
> > From: "David OShea" <David.OShea@quantum.com>
> > To: "Mathieu Desnoyers" <mathieu.desnoyers@efficios.com>
> > Cc: "lttng-dev" <lttng-dev@lists.lttng.org>
> > Sent: Monday, January 12, 2015 1:33:07 AM
> > Subject: RE: [lttng-dev] Segfault at v_read() called from
> > lib_ring_buffer_try_reserve_slow() in LTTng-UST traced app
> > - CPU/VMware dependent
> > 
> > Hi Mathieu,
> > 
> > Apologies for the delay in getting back to you, please see below:
> > 
> > > -----Original Message-----
> > > From: Mathieu Desnoyers [mailto:mathieu.desnoyers@efficios.com]
> > > Sent: Friday, 12 December 2014 2:07 AM
> > > To: David OShea
> > > Cc: lttng-dev
> > > Subject: Re: [lttng-dev] Segfault at v_read() called from
> > > lib_ring_buffer_try_reserve_slow() in LTTng-UST traced app - CPU/VMware
> > > dependent
> > > 
> > > ________________________________
> > > 
> > > 	From: "David OShea" <David.OShea@quantum.com>
> > > 	To: "lttng-dev" <lttng-dev@lists.lttng.org>
> > > 	Sent: Sunday, December 7, 2014 10:30:04 PM
> > > 	Subject: [lttng-dev] Segfault at v_read() called from
> > > lib_ring_buffer_try_reserve_slow() in LTTng-UST traced app - CPU/VMware
> > > dependent
> > > 
> > > 
> > > 
> > > 	Hi all,
> > > 
> > > 	We have encountered a problem with using LTTng-UST tracing with
> > > our application, where on a particular VMware vCenter cluster we almost
> > > ways get segfaults when tracepoints are enabled, whereas on another
> > > vCenter cluster, and on every other machine we’ve ever used, we don’t
> > > hit this problem.
> > > 
> > > 	I can reproduce this using lttng-ust/tests/hello after using:
> > > 
> > > 	"""
> > > 
> > > 	lttng create
> > > 
> > > 	lttng enable-channel channel0 --userspace
> > > 
> > > 	lttng add-context --userspace -t vpid -t vtid -t procname
> > > 
> > > 	lttng enable-event --userspace "ust_tests_hello:*" -c channel0
> > > 
> > > 	lttng start
> > > 
> > > 	"""
> > > 
> > > 	In which case I get the following stack trace with an obvious
> > > NULL pointer dereference:
> > > 
> > > 	"""
> > > 
> > > 	Program terminated with signal SIGSEGV, Segmentation fault.
> > > 
> > > 	#0  v_read (config=<optimized out>, v_a=0x0) at vatomic.h:48
> > > 
> > > 	48              return uatomic_read(&v_a->a);
> > > 
> > > 	[...]
> > > 
> > > 	#0  v_read (config=<optimized out>, v_a=0x0) at vatomic.h:48
> > > 
> > > 	#1  0x00007f4aa10a4804 in lib_ring_buffer_try_reserve_slow (
> > > 
> > > 	    buf=0x7f4a98008a00, chan=0x7f4a98008a00,
> > > offsets=0x7fffef67c620,
> > > 
> > > 	    ctx=0x7fffef67ca40) at ring_buffer_frontend.c:1677
> > > 
> > > 	#2  0x00007f4aa10a6c9f in lib_ring_buffer_reserve_slow
> > > (ctx=0x7fffef67ca40)
> > > 
> > > 	    at ring_buffer_frontend.c:1819
> > > 
> > > 	#3  0x00007f4aa1095b75 in lib_ring_buffer_reserve
> > > (ctx=0x7fffef67ca40,
> > > 
> > > 	    config=0x7f4aa12b8ae0 <client_config>)
> > > 
> > > 	    at ../libringbuffer/frontend_api.h:211
> > > 
> > > 	#4  lttng_event_reserve (ctx=0x7fffef67ca40, event_id=0)
> > > 
> > > 	    at lttng-ring-buffer-client.h:473
> > > 
> > > 	#5  0x000000000040135f in __event_probe__ust_tests_hello___tptest
> > > (
> > > 
> > > 	    __tp_data=0xed3410, anint=0, netint=0, values=0x7fffef67cb50,
> > > 
> > > 	    text=0x7fffef67cb70 "test", textlen=<optimized out>,
> > > doublearg=2,
> > > 
> > > 	    floatarg=2222, boolarg=true) at ././ust_tests_hello.h:32
> > > 
> > > 	#6  0x0000000000400d2c in
> > > __tracepoint_cb_ust_tests_hello___tptest (
> > > 
> > > 	    boolarg=true, floatarg=2222, doublearg=2, textlen=4,
> > > 
> > > 	    text=0x7fffef67cb70 "test", values=0x7fffef67cb50,
> > > 
> > > 	    netint=<optimized out>, anint=0) at ust_tests_hello.h:32
> > > 
> > > 	#7  main (argc=<optimized out>, argv=<optimized out>) at
> > > hello.c:92
> > > 
> > > 	"""
> > > 
> > > 	I hit this segfault 10 out of 10 times I ran “hello” on a VM on
> > > one vCenter and 0 out of 10 times I ran it on the other, and the VMs
> > > otherwise had the same software installed on them:
> > > 
> > > 	- CentOS 6-based
> > > 
> > > 	- kernel-2.6.32-504.1.3.el6 with some minor changes made in
> > > networking
> > > 
> > > 	- userspace-rcu-0.8.3, lttng-ust-2.3.2 and lttng-tools-2.3.2
> > > which might have some minor patches backported, and leftovers of
> > > changes to get them to build on CentOS 5
> > > 
> > > 	On the “good” vCenter, I tested on two different VM hosts:
> > > 
> > > 	Processor Type: Intel(R) Xeon(R) CPU E5530 @ 2.40GHz
> > > 
> > > 	EVC Mode: Intel(R) "Nehalem" Generation
> > > 
> > > 	Image Profile: (Updated) ESXi-5.1.0-799733-standard
> > > 
> > > 	Processor Type: Intel(R) Core(TM) i7-3930K CPU @ 3.20GHz
> > > 
> > > 	EVC Mode: Intel(R) "Nehalem" Generation
> > > 
> > > 	Image Profile: (Updated) ESXi-5.1.0-799733-standard
> > > 
> > > 	The “bad” vCenter VM host that I tested on had this
> > > configuration:
> > > 
> > > 	ESX Version: VMware ESXi, 5.0.0, 469512
> > > 
> > > 	Processor Type: Intel(R) Xeon(R) CPU X7550 @ 2.00GHz
> > > 
> > > 	Any ideas?
> > > 
> > > 
> > > My bet would be that the OS is lying to userspace about the
> > > number of possible CPUs. I wonder what liblttng-ust
> > > libringbuffer/shm.h num_possible_cpus() is returning compared
> > > to what lib_ring_buffer_get_cpu() returns.
> > > 
> > > 
> > > Can you check this out ?
> > 
> > Yes, this seems to be the case - 'gdb' on the core dump shows:
> > 
> > (gdb) p __num_possible_cpus
> > $1 = 2
> > 
> > which is consistent with how I configured the virtual machine, which is
> > consistent with this output:
> > 
> > # lscpu
> > Architecture:          x86_64
> > CPU op-mode(s):        32-bit, 64-bit
> > Byte Order:            Little Endian
> > CPU(s):                2
> > On-line CPU(s) list:   0,1
> > Thread(s) per core:    1
> > Core(s) per socket:    1
> > Socket(s):             2
> > NUMA node(s):          1
> > Vendor ID:             GenuineIntel
> > CPU family:            6
> > Model:                 26
> > Stepping:              4
> > CPU MHz:               1995.000
> > BogoMIPS:              3990.00
> > Hypervisor vendor:     VMware
> > Virtualization type:   full
> > L1d cache:             32K
> > L1i cache:             32K
> > L2 cache:              256K
> > L3 cache:              18432K
> > NUMA node0 CPU(s):     0,1
> > 
> > Despite the fact that there are 2 CPUs, when I hacked
> > lttng-ring-buffer-client.h to output the result of
> > lib_ring_buffer_get_cpu()
> > and then ran tests/hello with tracing enabled, I could see it would sit on
> > CPU 0 for a while, or CPU 1, and perhaps move between the two, but
> > eventually either 2 or 3 would appear, immediately followed by the
> > segfault.
> > 
> > The VM host has 4 sockets, 8 cores per socket, with Hyper-Threading
> > enabled.
> > The VM has its "HT Sharing" option set to "Any", which according to
> > https://pubs.vmware.com/vsphere-50/index.jsp?topic=%2Fcom.vmware.vsphere.vm_admin.doc_50%2FGUID-101176D4-9866-420D-AB4F-6374025CABDA.html
> > means that each one of the virtual machine's virtual cores can share a
> > physical core with another virtual machine, each virtual core using a
> > different thread on that physical core.  I assume none of this should be
> > relevant except perhaps if there are bugs in VMware.
> > 
> > Is it possible that this is an issue in LTTng, or should I work out how the
> > kernel works out which CPU it is running on and then look into whether
> > there
> > are any VMware bugs in this area?
> 
> This appears to be very likely a VMware bug. /proc/cpuinfo should show
> 4 CPUs (and sysconf(_SC_NPROCESSORS_CONF) should return 4) if the current
> CPU number can be 0, 1, 2, 3 throughout execution.

You might want to look at the sysconf(3) manpage, especially the parts about
_SC_NPROCESSORS_CONF and _SC_NPROCESSORS_ONLN. My guess is that vmware is lying
about the number of "possible" CPUs (_SC_NPROCESSORS_CONF).

Thanks,

Mathieu


> 
> Thanks,
> 
> Mathieu
> 
> 
> > 
> > Thanks in advance,
> > David
> > 
> > ----------------------------------------------------------------------
> > The information contained in this transmission may be confidential. Any
> > disclosure, copying, or further distribution of confidential information is
> > not permitted unless such privilege is explicitly granted in writing by
> > Quantum. Quantum reserves the right to have electronic communications,
> > including email and attachments, sent across its networks filtered through
> > anti virus and spam software programs and retain such messages in order to
> > comply with applicable data security and retention requirements. Quantum is
> > not responsible for the proper and complete transmission of the substance
> > of
> > this communication or for any delay in its receipt.
> > 
> 
> --
> Mathieu Desnoyers
> EfficiOS Inc.
> http://www.efficios.com
> 

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com

_______________________________________________
lttng-dev mailing list
lttng-dev@lists.lttng.org
http://lists.lttng.org/cgi-bin/mailman/listinfo/lttng-dev

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Segfault at v_read() called from lib_ring_buffer_try_reserve_slow() in LTTng-UST traced app - CPU/VMware dependent
       [not found]       ` <1686087516.37913.1421076968595.JavaMail.zimbra@efficios.com>
@ 2015-01-15  2:45         ` David OShea
       [not found]         ` <20998D40D9A2B7499CA5A3A2666CB1EB2D9E72E6@ZURMSG1.QUANTUM.com>
  1 sibling, 0 replies; 9+ messages in thread
From: David OShea @ 2015-01-15  2:45 UTC (permalink / raw)
  To: Mathieu Desnoyers; +Cc: lttng-dev

Hi Mathieu,

> -----Original Message-----
> From: Mathieu Desnoyers [mailto:mathieu.desnoyers@efficios.com]
> Sent: Tuesday, 13 January 2015 2:06 AM
> To: David OShea
> Cc: lttng-dev
> Subject: Re: [lttng-dev] Segfault at v_read() called from
> lib_ring_buffer_try_reserve_slow() in LTTng-UST traced app - CPU/VMware
> dependent
[...]
> > > Is it possible that this is an issue in LTTng, or should I work out
> how the
> > > kernel works out which CPU it is running on and then look into
> whether
> > > there
> > > are any VMware bugs in this area?
> >
> > This appears to be very likely a VMware bug. /proc/cpuinfo should
> show
> > 4 CPUs (and sysconf(_SC_NPROCESSORS_CONF) should return 4) if the
> current
> > CPU number can be 0, 1, 2, 3 throughout execution.

/proc/cpuinfo shows two CPUs:

processor       : 0
vendor_id       : GenuineIntel
cpu family      : 6
model           : 26
model name      : Intel(R) Xeon(R) CPU           X7550  @ 2.00GHz
stepping        : 4
microcode       : 8
cpu MHz         : 1995.000
cache size      : 18432 KB
physical id     : 0
siblings        : 1
core id         : 0
cpu cores       : 1
apicid          : 0
initial apicid  : 0
fpu             : yes
fpu_exception   : yes
cpuid level     : 11
wp              : yes
flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss syscall nx rdtscp lm constant_tsc arch_perfmon pebs bts xtopology tsc_reliable nonstop_tsc aperfmperf unfair_spinlock pni ssse3 cx16 sse4_1 sse4_2 popcnt hypervisor lahf_lm ida dts
bogomips        : 3990.00
clflush size    : 64
cache_alignment : 64
address sizes   : 40 bits physical, 48 bits virtual
power management:

processor       : 1
vendor_id       : GenuineIntel
cpu family      : 6
model           : 26
model name      : Intel(R) Xeon(R) CPU           X7550  @ 2.00GHz
stepping        : 4
microcode       : 8
cpu MHz         : 1995.000
cache size      : 18432 KB
physical id     : 2
siblings        : 1
core id         : 0
cpu cores       : 1
apicid          : 2
initial apicid  : 2
fpu             : yes
fpu_exception   : yes
cpuid level     : 11
wp              : yes
flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss syscall nx rdtscp lm constant_tsc arch_perfmon pebs bts xtopology tsc_reliable nonstop_tsc aperfmperf unfair_spinlock pni ssse3 cx16 sse4_1 sse4_2 popcnt hypervisor lahf_lm ida dts
bogomips        : 3990.00
clflush size    : 64
cache_alignment : 64
address sizes   : 40 bits physical, 48 bits virtual
power management:

> You might want to look at the sysconf(3) manpage, especially the parts
> about
> _SC_NPROCESSORS_CONF and _SC_NPROCESSORS_ONLN. My guess is that vmware
> is lying
> about the number of "possible" CPUs (_SC_NPROCESSORS_CONF).

_SC_NPROCESSORS_CONF = 2
_SC_NPROCESSORS_ONLN = 2

Thanks for the pointers, I will look into possible VMware bugs.

Out of curiosity, what happens if I happened to have a system with hot-pluggable CPUs - does _SC_NPROCESSORS_CONF reflect the maximum number of CPUs I can insert, and that is how many LTTng will support?

Thanks,
David

----------------------------------------------------------------------
The information contained in this transmission may be confidential. Any disclosure, copying, or further distribution of confidential information is not permitted unless such privilege is explicitly granted in writing by Quantum. Quantum reserves the right to have electronic communications, including email and attachments, sent across its networks filtered through anti virus and spam software programs and retain such messages in order to comply with applicable data security and retention requirements. Quantum is not responsible for the proper and complete transmission of the substance of this communication or for any delay in its receipt.

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Segfault at v_read() called from lib_ring_buffer_try_reserve_slow() in LTTng-UST traced app - CPU/VMware dependent
       [not found]         ` <20998D40D9A2B7499CA5A3A2666CB1EB2D9E72E6@ZURMSG1.QUANTUM.com>
@ 2015-01-15  2:50           ` Mathieu Desnoyers
       [not found]           ` <214724236.41002.1421290246531.JavaMail.zimbra@efficios.com>
  1 sibling, 0 replies; 9+ messages in thread
From: Mathieu Desnoyers @ 2015-01-15  2:50 UTC (permalink / raw)
  To: David OShea; +Cc: lttng-dev

----- Original Message -----
> From: "David OShea" <David.OShea@quantum.com>
> To: "Mathieu Desnoyers" <mathieu.desnoyers@efficios.com>
> Cc: "lttng-dev" <lttng-dev@lists.lttng.org>
> Sent: Wednesday, January 14, 2015 9:45:01 PM
> Subject: RE: [lttng-dev] Segfault at v_read() called from lib_ring_buffer_try_reserve_slow() in LTTng-UST traced app
> - CPU/VMware dependent
> 
> Hi Mathieu,
> 
> > -----Original Message-----
> > From: Mathieu Desnoyers [mailto:mathieu.desnoyers@efficios.com]
> > Sent: Tuesday, 13 January 2015 2:06 AM
> > To: David OShea
> > Cc: lttng-dev
> > Subject: Re: [lttng-dev] Segfault at v_read() called from
> > lib_ring_buffer_try_reserve_slow() in LTTng-UST traced app - CPU/VMware
> > dependent
> [...]
> > > > Is it possible that this is an issue in LTTng, or should I work out
> > how the
> > > > kernel works out which CPU it is running on and then look into
> > whether
> > > > there
> > > > are any VMware bugs in this area?
> > >
> > > This appears to be very likely a VMware bug. /proc/cpuinfo should
> > show
> > > 4 CPUs (and sysconf(_SC_NPROCESSORS_CONF) should return 4) if the
> > current
> > > CPU number can be 0, 1, 2, 3 throughout execution.
> 
> /proc/cpuinfo shows two CPUs:
> 
> processor       : 0
> vendor_id       : GenuineIntel
> cpu family      : 6
> model           : 26
> model name      : Intel(R) Xeon(R) CPU           X7550  @ 2.00GHz
> stepping        : 4
> microcode       : 8
> cpu MHz         : 1995.000
> cache size      : 18432 KB
> physical id     : 0
> siblings        : 1
> core id         : 0
> cpu cores       : 1
> apicid          : 0
> initial apicid  : 0
> fpu             : yes
> fpu_exception   : yes
> cpuid level     : 11
> wp              : yes
> flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca
> cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss syscall nx rdtscp lm
> constant_tsc arch_perfmon pebs bts xtopology tsc_reliable nonstop_tsc
> aperfmperf unfair_spinlock pni ssse3 cx16 sse4_1 sse4_2 popcnt hypervisor
> lahf_lm ida dts
> bogomips        : 3990.00
> clflush size    : 64
> cache_alignment : 64
> address sizes   : 40 bits physical, 48 bits virtual
> power management:
> 
> processor       : 1
> vendor_id       : GenuineIntel
> cpu family      : 6
> model           : 26
> model name      : Intel(R) Xeon(R) CPU           X7550  @ 2.00GHz
> stepping        : 4
> microcode       : 8
> cpu MHz         : 1995.000
> cache size      : 18432 KB
> physical id     : 2
> siblings        : 1
> core id         : 0
> cpu cores       : 1
> apicid          : 2
> initial apicid  : 2
> fpu             : yes
> fpu_exception   : yes
> cpuid level     : 11
> wp              : yes
> flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca
> cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss syscall nx rdtscp lm
> constant_tsc arch_perfmon pebs bts xtopology tsc_reliable nonstop_tsc
> aperfmperf unfair_spinlock pni ssse3 cx16 sse4_1 sse4_2 popcnt hypervisor
> lahf_lm ida dts
> bogomips        : 3990.00
> clflush size    : 64
> cache_alignment : 64
> address sizes   : 40 bits physical, 48 bits virtual
> power management:
> 
> > You might want to look at the sysconf(3) manpage, especially the parts
> > about
> > _SC_NPROCESSORS_CONF and _SC_NPROCESSORS_ONLN. My guess is that vmware
> > is lying
> > about the number of "possible" CPUs (_SC_NPROCESSORS_CONF).
> 
> _SC_NPROCESSORS_CONF = 2
> _SC_NPROCESSORS_ONLN = 2
> 
> Thanks for the pointers, I will look into possible VMware bugs.
> 
> Out of curiosity, what happens if I happened to have a system with
> hot-pluggable CPUs - does _SC_NPROCESSORS_CONF reflect the maximum number of
> CPUs I can insert, and that is how many LTTng will support?

Yes, exactly.

Thanks,

Mathieu

> 
> Thanks,
> David
> 
> ----------------------------------------------------------------------
> The information contained in this transmission may be confidential. Any
> disclosure, copying, or further distribution of confidential information is
> not permitted unless such privilege is explicitly granted in writing by
> Quantum. Quantum reserves the right to have electronic communications,
> including email and attachments, sent across its networks filtered through
> anti virus and spam software programs and retain such messages in order to
> comply with applicable data security and retention requirements. Quantum is
> not responsible for the proper and complete transmission of the substance of
> this communication or for any delay in its receipt.
> 

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Segfault at v_read() called from lib_ring_buffer_try_reserve_slow() in LTTng-UST traced app - CPU/VMware dependent
       [not found]           ` <214724236.41002.1421290246531.JavaMail.zimbra@efficios.com>
@ 2015-09-03  2:14             ` David OShea
       [not found]             ` <20998D40D9A2B7499CA5A3A2666CB1EB5EAD6F13@ZURMSG1.QUANTUM.com>
  1 sibling, 0 replies; 9+ messages in thread
From: David OShea @ 2015-09-03  2:14 UTC (permalink / raw)
  To: Mathieu Desnoyers; +Cc: lttng-dev

For the record, it appears that upgrading from VMware ESXi version 5.0.0, 469512 to version 5.5.0, 2068190 ("Update 2") resolved this issue.  However, we had other hosts running version 5.1.0, 799733 which should have been set to the same CPU architecture (Nehalem) which didn't have the issue, so presumably the fix was included in that version.

Thanks,
David

> -----Original Message-----
> From: Mathieu Desnoyers [mailto:mathieu.desnoyers@efficios.com]
> Sent: Thursday, 15 January 2015 1:21 PM
> To: David OShea
> Cc: lttng-dev
> Subject: Re: [lttng-dev] Segfault at v_read() called from
> lib_ring_buffer_try_reserve_slow() in LTTng-UST traced app - CPU/VMware
> dependent
> 
> ----- Original Message -----
> > From: "David OShea" <David.OShea@quantum.com>
> > To: "Mathieu Desnoyers" <mathieu.desnoyers@efficios.com>
> > Cc: "lttng-dev" <lttng-dev@lists.lttng.org>
> > Sent: Wednesday, January 14, 2015 9:45:01 PM
> > Subject: RE: [lttng-dev] Segfault at v_read() called from
> lib_ring_buffer_try_reserve_slow() in LTTng-UST traced app
> > - CPU/VMware dependent
> >
> > Hi Mathieu,
> >
> > > -----Original Message-----
> > > From: Mathieu Desnoyers [mailto:mathieu.desnoyers@efficios.com]
> > > Sent: Tuesday, 13 January 2015 2:06 AM
> > > To: David OShea
> > > Cc: lttng-dev
> > > Subject: Re: [lttng-dev] Segfault at v_read() called from
> > > lib_ring_buffer_try_reserve_slow() in LTTng-UST traced app -
> CPU/VMware
> > > dependent
> > [...]
> > > > > Is it possible that this is an issue in LTTng, or should I work
> out
> > > how the
> > > > > kernel works out which CPU it is running on and then look into
> > > whether
> > > > > there
> > > > > are any VMware bugs in this area?
> > > >
> > > > This appears to be very likely a VMware bug. /proc/cpuinfo should
> > > show
> > > > 4 CPUs (and sysconf(_SC_NPROCESSORS_CONF) should return 4) if the
> > > current
> > > > CPU number can be 0, 1, 2, 3 throughout execution.
> >
> > /proc/cpuinfo shows two CPUs:
> >
> > processor       : 0
> > vendor_id       : GenuineIntel
> > cpu family      : 6
> > model           : 26
> > model name      : Intel(R) Xeon(R) CPU           X7550  @ 2.00GHz
> > stepping        : 4
> > microcode       : 8
> > cpu MHz         : 1995.000
> > cache size      : 18432 KB
> > physical id     : 0
> > siblings        : 1
> > core id         : 0
> > cpu cores       : 1
> > apicid          : 0
> > initial apicid  : 0
> > fpu             : yes
> > fpu_exception   : yes
> > cpuid level     : 11
> > wp              : yes
> > flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr
> pge mca
> > cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss syscall nx
> rdtscp lm
> > constant_tsc arch_perfmon pebs bts xtopology tsc_reliable nonstop_tsc
> > aperfmperf unfair_spinlock pni ssse3 cx16 sse4_1 sse4_2 popcnt
> hypervisor
> > lahf_lm ida dts
> > bogomips        : 3990.00
> > clflush size    : 64
> > cache_alignment : 64
> > address sizes   : 40 bits physical, 48 bits virtual
> > power management:
> >
> > processor       : 1
> > vendor_id       : GenuineIntel
> > cpu family      : 6
> > model           : 26
> > model name      : Intel(R) Xeon(R) CPU           X7550  @ 2.00GHz
> > stepping        : 4
> > microcode       : 8
> > cpu MHz         : 1995.000
> > cache size      : 18432 KB
> > physical id     : 2
> > siblings        : 1
> > core id         : 0
> > cpu cores       : 1
> > apicid          : 2
> > initial apicid  : 2
> > fpu             : yes
> > fpu_exception   : yes
> > cpuid level     : 11
> > wp              : yes
> > flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr
> pge mca
> > cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss syscall nx
> rdtscp lm
> > constant_tsc arch_perfmon pebs bts xtopology tsc_reliable nonstop_tsc
> > aperfmperf unfair_spinlock pni ssse3 cx16 sse4_1 sse4_2 popcnt
> hypervisor
> > lahf_lm ida dts
> > bogomips        : 3990.00
> > clflush size    : 64
> > cache_alignment : 64
> > address sizes   : 40 bits physical, 48 bits virtual
> > power management:
> >
> > > You might want to look at the sysconf(3) manpage, especially the
> parts
> > > about
> > > _SC_NPROCESSORS_CONF and _SC_NPROCESSORS_ONLN. My guess is that
> vmware
> > > is lying
> > > about the number of "possible" CPUs (_SC_NPROCESSORS_CONF).
> >
> > _SC_NPROCESSORS_CONF = 2
> > _SC_NPROCESSORS_ONLN = 2
> >
> > Thanks for the pointers, I will look into possible VMware bugs.
> >
> > Out of curiosity, what happens if I happened to have a system with
> > hot-pluggable CPUs - does _SC_NPROCESSORS_CONF reflect the maximum
> number of
> > CPUs I can insert, and that is how many LTTng will support?
> 
> Yes, exactly.
> 
> Thanks,
> 
> Mathieu
> 
> >
> > Thanks,
> > David
> >
> > ---------------------------------------------------------------------
> -
> > The information contained in this transmission may be confidential.
> Any
> > disclosure, copying, or further distribution of confidential
> information is
> > not permitted unless such privilege is explicitly granted in writing
> by
> > Quantum. Quantum reserves the right to have electronic
> communications,
> > including email and attachments, sent across its networks filtered
> through
> > anti virus and spam software programs and retain such messages in
> order to
> > comply with applicable data security and retention requirements.
> Quantum is
> > not responsible for the proper and complete transmission of the
> substance of
> > this communication or for any delay in its receipt.
> >
> 
> --
> Mathieu Desnoyers
> EfficiOS Inc.
> https://urldefense.proofpoint.com/v1/url?u=http://www.efficios.com/&k=8
> F5TVnBDKF32UabxXsxZiA%3D%3D%0A&r=H%2F7L7PqcsBryhdFPEDkMctduZSYZKIU%2Bn0
> pwhSRt%2FlE%3D%0A&m=prixRKthxyU%2BMyt%2F6tzAMJHpXUWgy4zX5MfojFJij0w%3D%
> 0A&s=d3553cdf8b9f86db71bd2f2a34d4ba415a863c5592a6ca9655ee047b4b017ef3

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Segfault at v_read() called from lib_ring_buffer_try_reserve_slow() in LTTng-UST traced app - CPU/VMware dependent
       [not found]             ` <20998D40D9A2B7499CA5A3A2666CB1EB5EAD6F13@ZURMSG1.QUANTUM.com>
@ 2015-09-03 15:25               ` Mathieu Desnoyers
  0 siblings, 0 replies; 9+ messages in thread
From: Mathieu Desnoyers @ 2015-09-03 15:25 UTC (permalink / raw)
  To: David OShea; +Cc: lttng-dev

----- On Sep 2, 2015, at 10:14 PM, David OShea David.OShea@quantum.com wrote:

> For the record, it appears that upgrading from VMware ESXi version 5.0.0, 469512
> to version 5.5.0, 2068190 ("Update 2") resolved this issue.  However, we had
> other hosts running version 5.1.0, 799733 which should have been set to the
> same CPU architecture (Nehalem) which didn't have the issue, so presumably the
> fix was included in that version.

That's good news, thanks for letting us know!

Mathieu

> 
> Thanks,
> David
> 
>> -----Original Message-----
>> From: Mathieu Desnoyers [mailto:mathieu.desnoyers@efficios.com]
>> Sent: Thursday, 15 January 2015 1:21 PM
>> To: David OShea
>> Cc: lttng-dev
>> Subject: Re: [lttng-dev] Segfault at v_read() called from
>> lib_ring_buffer_try_reserve_slow() in LTTng-UST traced app - CPU/VMware
>> dependent
>> 
>> ----- Original Message -----
>> > From: "David OShea" <David.OShea@quantum.com>
>> > To: "Mathieu Desnoyers" <mathieu.desnoyers@efficios.com>
>> > Cc: "lttng-dev" <lttng-dev@lists.lttng.org>
>> > Sent: Wednesday, January 14, 2015 9:45:01 PM
>> > Subject: RE: [lttng-dev] Segfault at v_read() called from
>> lib_ring_buffer_try_reserve_slow() in LTTng-UST traced app
>> > - CPU/VMware dependent
>> >
>> > Hi Mathieu,
>> >
>> > > -----Original Message-----
>> > > From: Mathieu Desnoyers [mailto:mathieu.desnoyers@efficios.com]
>> > > Sent: Tuesday, 13 January 2015 2:06 AM
>> > > To: David OShea
>> > > Cc: lttng-dev
>> > > Subject: Re: [lttng-dev] Segfault at v_read() called from
>> > > lib_ring_buffer_try_reserve_slow() in LTTng-UST traced app -
>> CPU/VMware
>> > > dependent
>> > [...]
>> > > > > Is it possible that this is an issue in LTTng, or should I work
>> out
>> > > how the
>> > > > > kernel works out which CPU it is running on and then look into
>> > > whether
>> > > > > there
>> > > > > are any VMware bugs in this area?
>> > > >
>> > > > This appears to be very likely a VMware bug. /proc/cpuinfo should
>> > > show
>> > > > 4 CPUs (and sysconf(_SC_NPROCESSORS_CONF) should return 4) if the
>> > > current
>> > > > CPU number can be 0, 1, 2, 3 throughout execution.
>> >
>> > /proc/cpuinfo shows two CPUs:
>> >
>> > processor       : 0
>> > vendor_id       : GenuineIntel
>> > cpu family      : 6
>> > model           : 26
>> > model name      : Intel(R) Xeon(R) CPU           X7550  @ 2.00GHz
>> > stepping        : 4
>> > microcode       : 8
>> > cpu MHz         : 1995.000
>> > cache size      : 18432 KB
>> > physical id     : 0
>> > siblings        : 1
>> > core id         : 0
>> > cpu cores       : 1
>> > apicid          : 0
>> > initial apicid  : 0
>> > fpu             : yes
>> > fpu_exception   : yes
>> > cpuid level     : 11
>> > wp              : yes
>> > flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr
>> pge mca
>> > cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss syscall nx
>> rdtscp lm
>> > constant_tsc arch_perfmon pebs bts xtopology tsc_reliable nonstop_tsc
>> > aperfmperf unfair_spinlock pni ssse3 cx16 sse4_1 sse4_2 popcnt
>> hypervisor
>> > lahf_lm ida dts
>> > bogomips        : 3990.00
>> > clflush size    : 64
>> > cache_alignment : 64
>> > address sizes   : 40 bits physical, 48 bits virtual
>> > power management:
>> >
>> > processor       : 1
>> > vendor_id       : GenuineIntel
>> > cpu family      : 6
>> > model           : 26
>> > model name      : Intel(R) Xeon(R) CPU           X7550  @ 2.00GHz
>> > stepping        : 4
>> > microcode       : 8
>> > cpu MHz         : 1995.000
>> > cache size      : 18432 KB
>> > physical id     : 2
>> > siblings        : 1
>> > core id         : 0
>> > cpu cores       : 1
>> > apicid          : 2
>> > initial apicid  : 2
>> > fpu             : yes
>> > fpu_exception   : yes
>> > cpuid level     : 11
>> > wp              : yes
>> > flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr
>> pge mca
>> > cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss syscall nx
>> rdtscp lm
>> > constant_tsc arch_perfmon pebs bts xtopology tsc_reliable nonstop_tsc
>> > aperfmperf unfair_spinlock pni ssse3 cx16 sse4_1 sse4_2 popcnt
>> hypervisor
>> > lahf_lm ida dts
>> > bogomips        : 3990.00
>> > clflush size    : 64
>> > cache_alignment : 64
>> > address sizes   : 40 bits physical, 48 bits virtual
>> > power management:
>> >
>> > > You might want to look at the sysconf(3) manpage, especially the
>> parts
>> > > about
>> > > _SC_NPROCESSORS_CONF and _SC_NPROCESSORS_ONLN. My guess is that
>> vmware
>> > > is lying
>> > > about the number of "possible" CPUs (_SC_NPROCESSORS_CONF).
>> >
>> > _SC_NPROCESSORS_CONF = 2
>> > _SC_NPROCESSORS_ONLN = 2
>> >
>> > Thanks for the pointers, I will look into possible VMware bugs.
>> >
>> > Out of curiosity, what happens if I happened to have a system with
>> > hot-pluggable CPUs - does _SC_NPROCESSORS_CONF reflect the maximum
>> number of
>> > CPUs I can insert, and that is how many LTTng will support?
>> 
>> Yes, exactly.
>> 
>> Thanks,
>> 
>> Mathieu
>> 
>> >
>> > Thanks,
>> > David
>> >
>> > ---------------------------------------------------------------------
>> -
>> > The information contained in this transmission may be confidential.
>> Any
>> > disclosure, copying, or further distribution of confidential
>> information is
>> > not permitted unless such privilege is explicitly granted in writing
>> by
>> > Quantum. Quantum reserves the right to have electronic
>> communications,
>> > including email and attachments, sent across its networks filtered
>> through
>> > anti virus and spam software programs and retain such messages in
>> order to
>> > comply with applicable data security and retention requirements.
>> Quantum is
>> > not responsible for the proper and complete transmission of the
>> substance of
>> > this communication or for any delay in its receipt.
>> >
>> 
>> --
>> Mathieu Desnoyers
>> EfficiOS Inc.
>> https://urldefense.proofpoint.com/v1/url?u=http://www.efficios.com/&k=8
>> F5TVnBDKF32UabxXsxZiA%3D%3D%0A&r=H%2F7L7PqcsBryhdFPEDkMctduZSYZKIU%2Bn0
>> pwhSRt%2FlE%3D%0A&m=prixRKthxyU%2BMyt%2F6tzAMJHpXUWgy4zX5MfojFJij0w%3D%
> > 0A&s=d3553cdf8b9f86db71bd2f2a34d4ba415a863c5592a6ca9655ee047b4b017ef3

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Segfault at v_read() called from lib_ring_buffer_try_reserve_slow() in LTTng-UST traced app - CPU/VMware dependent
@ 2014-12-08  3:30 David OShea
  0 siblings, 0 replies; 9+ messages in thread
From: David OShea @ 2014-12-08  3:30 UTC (permalink / raw)
  To: lttng-dev


[-- Attachment #1.1: Type: text/plain, Size: 3802 bytes --]

Hi all,

We have encountered a problem with using LTTng-UST tracing with our application, where on a particular VMware vCenter cluster we almost ways get segfaults when tracepoints are enabled, whereas on another vCenter cluster, and on every other machine we've ever used, we don't hit this problem.

I can reproduce this using lttng-ust/tests/hello after using:

"""
lttng create
lttng enable-channel channel0 --userspace
lttng add-context --userspace -t vpid -t vtid -t procname
lttng enable-event --userspace "ust_tests_hello:*" -c channel0
lttng start
"""

In which case I get the following stack trace with an obvious NULL pointer dereference:

"""
Program terminated with signal SIGSEGV, Segmentation fault.
#0  v_read (config=<optimized out>, v_a=0x0) at vatomic.h:48
48              return uatomic_read(&v_a->a);
[...]
#0  v_read (config=<optimized out>, v_a=0x0) at vatomic.h:48
#1  0x00007f4aa10a4804 in lib_ring_buffer_try_reserve_slow (
    buf=0x7f4a98008a00, chan=0x7f4a98008a00, offsets=0x7fffef67c620,
    ctx=0x7fffef67ca40) at ring_buffer_frontend.c:1677
#2  0x00007f4aa10a6c9f in lib_ring_buffer_reserve_slow (ctx=0x7fffef67ca40)
    at ring_buffer_frontend.c:1819
#3  0x00007f4aa1095b75 in lib_ring_buffer_reserve (ctx=0x7fffef67ca40,
    config=0x7f4aa12b8ae0 <client_config>)
    at ../libringbuffer/frontend_api.h:211
#4  lttng_event_reserve (ctx=0x7fffef67ca40, event_id=0)
    at lttng-ring-buffer-client.h:473
#5  0x000000000040135f in __event_probe__ust_tests_hello___tptest (
    __tp_data=0xed3410, anint=0, netint=0, values=0x7fffef67cb50,
    text=0x7fffef67cb70 "test", textlen=<optimized out>, doublearg=2,
    floatarg=2222, boolarg=true) at ././ust_tests_hello.h:32
#6  0x0000000000400d2c in __tracepoint_cb_ust_tests_hello___tptest (
    boolarg=true, floatarg=2222, doublearg=2, textlen=4,
    text=0x7fffef67cb70 "test", values=0x7fffef67cb50,
    netint=<optimized out>, anint=0) at ust_tests_hello.h:32
#7  main (argc=<optimized out>, argv=<optimized out>) at hello.c:92
"""

I hit this segfault 10 out of 10 times I ran "hello" on a VM on one vCenter and 0 out of 10 times I ran it on the other, and the VMs otherwise had the same software installed on them:

- CentOS 6-based
- kernel-2.6.32-504.1.3.el6 with some minor changes made in networking
- userspace-rcu-0.8.3, lttng-ust-2.3.2 and lttng-tools-2.3.2 which might have some minor patches backported, and leftovers of changes to get them to build on CentOS 5

On the "good" vCenter, I tested on two different VM hosts:

Processor Type: Intel(R) Xeon(R) CPU E5530 @ 2.40GHz
EVC Mode: Intel(R) "Nehalem" Generation
Image Profile: (Updated) ESXi-5.1.0-799733-standard

Processor Type: Intel(R) Core(TM) i7-3930K CPU @ 3.20GHz
EVC Mode: Intel(R) "Nehalem" Generation
Image Profile: (Updated) ESXi-5.1.0-799733-standard

The "bad" vCenter VM host that I tested on had this configuration:

ESX Version: VMware ESXi, 5.0.0, 469512
Processor Type: Intel(R) Xeon(R) CPU X7550 @ 2.00GHz

Any ideas?

Thanks in advance,
David

----------------------------------------------------------------------
The information contained in this transmission may be confidential. Any disclosure, copying, or further distribution of confidential information is not permitted unless such privilege is explicitly granted in writing by Quantum. Quantum reserves the right to have electronic communications, including email and attachments, sent across its networks filtered through anti virus and spam software programs and retain such messages in order to comply with applicable data security and retention requirements. Quantum is not responsible for the proper and complete transmission of the substance of this communication or for any delay in its receipt.

[-- Attachment #1.2: Type: text/html, Size: 8318 bytes --]

[-- Attachment #2: Type: text/plain, Size: 155 bytes --]

_______________________________________________
lttng-dev mailing list
lttng-dev@lists.lttng.org
http://lists.lttng.org/cgi-bin/mailman/listinfo/lttng-dev

^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2015-09-03 15:26 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <20998D40D9A2B7499CA5A3A2666CB1EB2D9DDA59@ZURMSG1.QUANTUM.com>
2014-12-11 15:36 ` Segfault at v_read() called from lib_ring_buffer_try_reserve_slow() in LTTng-UST traced app - CPU/VMware dependent Mathieu Desnoyers
     [not found] ` <1745838195.26177.1418312190246.JavaMail.zimbra@efficios.com>
2015-01-12  6:33   ` David OShea
     [not found]   ` <20998D40D9A2B7499CA5A3A2666CB1EB2D9E464C@ZURMSG1.QUANTUM.com>
2015-01-12 15:34     ` Mathieu Desnoyers
     [not found]     ` <1205251820.37907.1421076877053.JavaMail.zimbra@efficios.com>
2015-01-12 15:36       ` Mathieu Desnoyers
     [not found]       ` <1686087516.37913.1421076968595.JavaMail.zimbra@efficios.com>
2015-01-15  2:45         ` David OShea
     [not found]         ` <20998D40D9A2B7499CA5A3A2666CB1EB2D9E72E6@ZURMSG1.QUANTUM.com>
2015-01-15  2:50           ` Mathieu Desnoyers
     [not found]           ` <214724236.41002.1421290246531.JavaMail.zimbra@efficios.com>
2015-09-03  2:14             ` David OShea
     [not found]             ` <20998D40D9A2B7499CA5A3A2666CB1EB5EAD6F13@ZURMSG1.QUANTUM.com>
2015-09-03 15:25               ` Mathieu Desnoyers
2014-12-08  3:30 David OShea

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.