All of lore.kernel.org
 help / color / mirror / Atom feed
* [dm-devel] [PATCH] multipathd: avoid crash in uevent_cleanup()
@ 2021-01-28 21:08 mwilck
  2021-02-02 20:52 ` Martin Wilck
  2021-02-02 22:23 ` Benjamin Marzinski
  0 siblings, 2 replies; 34+ messages in thread
From: mwilck @ 2021-01-28 21:08 UTC (permalink / raw)
  To: lixiaokeng, Benjamin Marzinski, Christophe Varoqui; +Cc: dm-devel, Martin Wilck

From: Martin Wilck <mwilck@suse.com>

Crashes have been observed in the unwinder stack of uevent_listen().
This can only be explained by "udev" not being a valid object at that
time. Be sure to pass a valid pointer, and don't call udev_unref() if
it has been set to NULL already.

I'm not quite sure how this would come to pass, as we join the threads
before setting udev to NULL, but this is unwinder code, so I guess it
might actually be executed after the thread has terminated.

Signed-off-by: Martin Wilck <mwilck@suse.com>
---
 libmultipath/uevent.c | 7 ++++---
 1 file changed, 4 insertions(+), 3 deletions(-)

diff --git a/libmultipath/uevent.c b/libmultipath/uevent.c
index d3061bf..4e662ff 100644
--- a/libmultipath/uevent.c
+++ b/libmultipath/uevent.c
@@ -397,10 +397,11 @@ service_uevq(struct list_head *tmpq)
 
 static void uevent_cleanup(void *arg)
 {
-	struct udev *udev = arg;
+	struct udev **pudev = arg;
 
+	if (*pudev)
+		udev_unref(*pudev);
 	condlog(3, "Releasing uevent_listen() resources");
-	udev_unref(udev);
 }
 
 static void monitor_cleanup(void *arg)
@@ -560,7 +561,7 @@ int uevent_listen(struct udev *udev)
 		return 1;
 	}
 	udev_ref(udev);
-	pthread_cleanup_push(uevent_cleanup, udev);
+	pthread_cleanup_push(uevent_cleanup, &udev);
 
 	monitor = udev_monitor_new_from_netlink(udev, "udev");
 	if (!monitor) {
-- 
2.29.2


--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel


^ permalink raw reply related	[flat|nested] 34+ messages in thread

* Re: [dm-devel] [PATCH] multipathd: avoid crash in uevent_cleanup()
  2021-01-28 21:08 [dm-devel] [PATCH] multipathd: avoid crash in uevent_cleanup() mwilck
@ 2021-02-02 20:52 ` Martin Wilck
  2021-02-03 10:48   ` lixiaokeng
  2021-02-02 22:23 ` Benjamin Marzinski
  1 sibling, 1 reply; 34+ messages in thread
From: Martin Wilck @ 2021-02-02 20:52 UTC (permalink / raw)
  To: lixiaokeng, Benjamin Marzinski, Christophe Varoqui; +Cc: dm-devel

lixiaokeng,

did this fix your "crash on exit" issue?

Martin

On Thu, 2021-01-28 at 22:08 +0100, mwilck@suse.com wrote:
> From: Martin Wilck <mwilck@suse.com>
> 
> Crashes have been observed in the unwinder stack of uevent_listen().
> This can only be explained by "udev" not being a valid object at that
> time. Be sure to pass a valid pointer, and don't call udev_unref() if
> it has been set to NULL already.
> 
> I'm not quite sure how this would come to pass, as we join the
> threads
> before setting udev to NULL, but this is unwinder code, so I guess it
> might actually be executed after the thread has terminated.
> 
> Signed-off-by: Martin Wilck <mwilck@suse.com>
> ---
>  libmultipath/uevent.c | 7 ++++---
>  1 file changed, 4 insertions(+), 3 deletions(-)
> 
> diff --git a/libmultipath/uevent.c b/libmultipath/uevent.c
> index d3061bf..4e662ff 100644
> --- a/libmultipath/uevent.c
> +++ b/libmultipath/uevent.c
> @@ -397,10 +397,11 @@ service_uevq(struct list_head *tmpq)
>  
>  static void uevent_cleanup(void *arg)
>  {
> -       struct udev *udev = arg;
> +       struct udev **pudev = arg;
>  
> +       if (*pudev)
> +               udev_unref(*pudev);
>         condlog(3, "Releasing uevent_listen() resources");
> -       udev_unref(udev);
>  }
>  
>  static void monitor_cleanup(void *arg)
> @@ -560,7 +561,7 @@ int uevent_listen(struct udev *udev)
>                 return 1;
>         }
>         udev_ref(udev);
> -       pthread_cleanup_push(uevent_cleanup, udev);
> +       pthread_cleanup_push(uevent_cleanup, &udev);
>  
>         monitor = udev_monitor_new_from_netlink(udev, "udev");
>         if (!monitor) {



--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [dm-devel] [PATCH] multipathd: avoid crash in uevent_cleanup()
  2021-01-28 21:08 [dm-devel] [PATCH] multipathd: avoid crash in uevent_cleanup() mwilck
  2021-02-02 20:52 ` Martin Wilck
@ 2021-02-02 22:23 ` Benjamin Marzinski
  1 sibling, 0 replies; 34+ messages in thread
From: Benjamin Marzinski @ 2021-02-02 22:23 UTC (permalink / raw)
  To: mwilck; +Cc: lixiaokeng, dm-devel

On Thu, Jan 28, 2021 at 10:08:52PM +0100, mwilck@suse.com wrote:
> From: Martin Wilck <mwilck@suse.com>
> 
> Crashes have been observed in the unwinder stack of uevent_listen().
> This can only be explained by "udev" not being a valid object at that
> time. Be sure to pass a valid pointer, and don't call udev_unref() if
> it has been set to NULL already.
> 
> I'm not quite sure how this would come to pass, as we join the threads
> before setting udev to NULL, but this is unwinder code, so I guess it
> might actually be executed after the thread has terminated.
> 
> Signed-off-by: Martin Wilck <mwilck@suse.com>

Assuming that this does resolve the issue,

Reviewed-by: Benjamin Marzinski <bmaizns@redhat.com>
> ---
>  libmultipath/uevent.c | 7 ++++---
>  1 file changed, 4 insertions(+), 3 deletions(-)
> 
> diff --git a/libmultipath/uevent.c b/libmultipath/uevent.c
> index d3061bf..4e662ff 100644
> --- a/libmultipath/uevent.c
> +++ b/libmultipath/uevent.c
> @@ -397,10 +397,11 @@ service_uevq(struct list_head *tmpq)
>  
>  static void uevent_cleanup(void *arg)
>  {
> -	struct udev *udev = arg;
> +	struct udev **pudev = arg;
>  
> +	if (*pudev)
> +		udev_unref(*pudev);
>  	condlog(3, "Releasing uevent_listen() resources");
> -	udev_unref(udev);
>  }
>  
>  static void monitor_cleanup(void *arg)
> @@ -560,7 +561,7 @@ int uevent_listen(struct udev *udev)
>  		return 1;
>  	}
>  	udev_ref(udev);
> -	pthread_cleanup_push(uevent_cleanup, udev);
> +	pthread_cleanup_push(uevent_cleanup, &udev);
>  
>  	monitor = udev_monitor_new_from_netlink(udev, "udev");
>  	if (!monitor) {
> -- 
> 2.29.2

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [dm-devel] [PATCH] multipathd: avoid crash in uevent_cleanup()
  2021-02-02 20:52 ` Martin Wilck
@ 2021-02-03 10:48   ` lixiaokeng
  2021-02-03 13:57     ` Martin Wilck
  2021-02-08  7:41     ` lixiaokeng
  0 siblings, 2 replies; 34+ messages in thread
From: lixiaokeng @ 2021-02-03 10:48 UTC (permalink / raw)
  To: Martin Wilck, Benjamin Marzinski, Christophe Varoqui; +Cc: dm-devel



On 2021/2/3 4:52, Martin Wilck wrote:
> did this fix your "crash on exit" issue?

Unfortunately, the issue is not solved.


There are 100 luns and two scripts to reproduce the issue.

#!/bin/bash
while true
do
        for i in `seq 1 20`
        do
                udevadm trigger
        done
        sleep 1
done

#!/bin/bash
while true
do
        for i in `seq 1 10`
        do
                systemctl restart multipathd
        done
	kill -9 `pidof /sbin/multipathd`
	sleep 1
done

There will be some different coredump stack.

0.8.5 (I'm not sure there are only two stacks in 0.8.5)
First stack:
Program terminated with signal SIGSEGV, Segmentation fault.
#0  0x00007f59a0109647 in ?? ()
[Current thread is 1 (LWP 1997690)]
(gdb) bt
#0  0x00007f59a0109647 in ?? ()
#1  0x0000000000000000 in ?? ()
(gdb) info threads
  Id   Target Id         Frame
* 1    LWP 1997690       0x00007f59a0109647 in ?? ()
  2    LWP 1996840       0x00007f59a0531de7 in ?? ()
  3    LWP 1997692       0x00007f59a0109647 in ?? ()
  4    LWP 1996857       0x00007f59a020d169 in ?? ()

Second stack:
#0  0x0000ffffb6118f4c in aarch64_fallback_frame_state (context=0xffffb523f200, context=0xffffb523f200, fs=0xffffb523e700) at ./md-unwind-support.h:74
#1  uw_frame_state_for (context=context@entry=0xffffb523f200, fs=fs@entry=0xffffb523e700) at ../../../libgcc/unwind-dw2.c:1257
#2  0x0000ffffb6119ef4 in _Unwind_ForcedUnwind_Phase2 (exc=exc@entry=0xffffb52403b0, context=context@entry=0xffffb523f200) at ../../../libgcc/unwind.inc:155
#3  0x0000ffffb611a284 in _Unwind_ForcedUnwind (exc=0xffffb52403b0, stop=stop@entry=0xffffb64846c0 <unwind_stop>, stop_argument=0xffffb523f630) at ../../../libgcc/unwind.inc:207
#4  0x0000ffffb6484860 in __GI___pthread_unwind (buf=<optimized out>) at unwind.c:121
#5  0x0000ffffb6482d08 in __do_cancel () at pthreadP.h:304
#6  __GI___pthread_testcancel () at pthread_testcancel.c:26
#7  0x0000ffffb5c528e8 in ?? ()

There are other stacks in 0.7.7
Second stack:
#0  __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:51
#1  0x0000ffff9d8d281c in __GI_abort () at abort.c:79
#2  0x0000ffff9d90b818 in __libc_message (action=action@entry=do_abort, fmt=fmt@entry=0xffff9d9cb888 "%s\n") at ../sysdeps/posix/libc_fatal.c:181
#3  0x0000ffff9d911f6c in malloc_printerr (str=str@entry=0xffff9d9c90d0 "free(): invalid pointer") at malloc.c:5389
#4  0x0000ffff9d913780 in _int_free (av=0xffff9da0ba58 <main_arena>, p=0xffff98000070, have_lock=0) at malloc.c:4172
#5  0x0000ffff9dc2b608 in internal_hashmap_clear (h=h@entry=0xffff9814dfa0, default_free_key=<optimized out>, default_free_value=<optimized out>) at ../src/basic/hashmap.c:902
#6  0x0000ffff9dc2b700 in internal_hashmap_free (h=<optimized out>, default_free_key=<optimized out>, default_free_value=<optimized out>, default_free_value=<optimized out>, default_free_key=<optimized out>,
    h=<optimized out>) at ../src/basic/hashmap.c:874
#7  0x0000ffff9dc2b88c in ordered_hashmap_free_free_free () at ../src/basic/hashmap.h:118
#8  device_free (device=0xffff9814d420) at ../src/libsystemd/sd-device/sd-device.c:68
#9  sd_device_unref (p=<optimized out>) at ../src/libsystemd/sd-device/sd-device.c:78
#10 0x0000ffff9dc36cc8 in sd_device_unrefp () at ../src/systemd/sd-device.h:118
#11 device_new_from_nulstr (len=<optimized out>, nulstr=0xffff9d3253d0 "SEQNUM", ret=<synthetic pointer>) at ../src/libsystemd/sd-device/device-private.c:448
#12 device_monitor_receive_device (m=0xffff98000b20, ret=ret@entry=0xffff9d327388) at ../src/libsystemd/sd-device/device-monitor.c:447
#13 0x0000ffff9dc38bf4 in udev_monitor_receive_sd_device (ret=0xffff9d327388, udev_monitor=0xffff98000c70) at ../src/libudev/libudev-monitor.c:207
#14 udev_monitor_receive_device (udev_monitor=0xffff98000c70, udev_monitor@entry=0xffff9d3273a0) at ../src/libudev/libudev-monitor.c:253
#15 0x0000ffff9dcd9478 in uevent_listen (udev=0xffff9d327f40) at uevent.c:853
#16 0x0000aaaae3783514 in ueventloop (ap=0xffffe160fe80) at main.c:1518
#17 0x0000ffff9dbb67ac in start_thread (arg=0xffff9d3ad380) at pthread_create.c:486
#18 0x0000ffff9d97047c in thread_start () at ../sysdeps/unix/sysv/linux/aarch64/clone.S:78

Some stacks in glibc and libgcc.

If exit() before all pthread_cancel in child of 0.7.7, there is no any crash.
So I believe there are many races in thread cancel but I don't know how it comes.

Regards,
Lixiaokeng

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [dm-devel] [PATCH] multipathd: avoid crash in uevent_cleanup()
  2021-02-03 10:48   ` lixiaokeng
@ 2021-02-03 13:57     ` Martin Wilck
  2021-02-04  1:40       ` lixiaokeng
  2021-03-01 14:53       ` lixiaokeng
  2021-02-08  7:41     ` lixiaokeng
  1 sibling, 2 replies; 34+ messages in thread
From: Martin Wilck @ 2021-02-03 13:57 UTC (permalink / raw)
  To: lixiaokeng, Benjamin Marzinski, Christophe Varoqui; +Cc: dm-devel

On Wed, 2021-02-03 at 18:48 +0800, lixiaokeng wrote:
> 
> 
> On 2021/2/3 4:52, Martin Wilck wrote:
> > did this fix your "crash on exit" issue?
> 
> Unfortunately, the issue is not solved.
> 
> There will be some different coredump stack.
> 
> 0.8.5 (I'm not sure there are only two stacks in 0.8.5)
> First stack:
> Program terminated with signal SIGSEGV, Segmentation fault.
> #0  0x00007f59a0109647 in ?? ()
> [Current thread is 1 (LWP 1997690)]
> (gdb) bt
> #0  0x00007f59a0109647 in ?? ()
> #1  0x0000000000000000 in ?? ()
> (gdb) info threads
>   Id   Target Id         Frame
> * 1    LWP 1997690       0x00007f59a0109647 in ?? ()
>   2    LWP 1996840       0x00007f59a0531de7 in ?? ()
>   3    LWP 1997692       0x00007f59a0109647 in ?? ()
>   4    LWP 1996857       0x00007f59a020d169 in ?? ()
> 
> Second stack:
> #0  0x0000ffffb6118f4c in aarch64_fallback_frame_state
> (context=0xffffb523f200, context=0xffffb523f200, fs=0xffffb523e700)
> at ./md-unwind-support.h:74
> #1  uw_frame_state_for (context=context@entry=0xffffb523f200, 
> fs=fs@entry=0xffffb523e700) at ../../../libgcc/unwind-dw2.c:1257
> #2  0x0000ffffb6119ef4 in _Unwind_ForcedUnwind_Phase2
> (exc=exc@entry=0xffffb52403b0, context=context@entry=0xffffb523f200)
> at ../../../libgcc/unwind.inc:155
> #3  0x0000ffffb611a284 in _Unwind_ForcedUnwind (exc=0xffffb52403b0, 
> stop=stop@entry=0xffffb64846c0 <unwind_stop>,
> stop_argument=0xffffb523f630) at ../../../libgcc/unwind.inc:207
> #4  0x0000ffffb6484860 in __GI___pthread_unwind (buf=<optimized out>)
> at unwind.c:121
> #5  0x0000ffffb6482d08 in __do_cancel () at pthreadP.h:304
> #6  __GI___pthread_testcancel () at pthread_testcancel.c:26
> #7  0x0000ffffb5c528e8 in ?? ()
> 

Please check your glibc and gcc versions. Google turns up some general
issues with pthread_cancel() and certain library versions, especially
on ARM. There's a chance that this isn't a multipath-tools bug after
all.

https://www.google.com/search?q=SIGSEGV+_Unwind_ForcedUnwind_Phase2

You could also try to compile with clang and see if the error goes
away.

> If exit() before all pthread_cancel in child of 0.7.7, there is no
> any crash.

What do you mean with "exit() before all pthread_cancel"? If this
happens on pthread_cancel(), and you don't call that function, this
would actually be expected.

Regards,
Martin

> So I believe there are many races in thread cancel but I don't know
> how it comes.



> 
> Regards,
> Lixiaokeng
> 



--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [dm-devel] [PATCH] multipathd: avoid crash in uevent_cleanup()
  2021-02-03 13:57     ` Martin Wilck
@ 2021-02-04  1:40       ` lixiaokeng
  2021-02-04 15:06         ` Martin Wilck
  2021-03-01 14:53       ` lixiaokeng
  1 sibling, 1 reply; 34+ messages in thread
From: lixiaokeng @ 2021-02-04  1:40 UTC (permalink / raw)
  To: Martin Wilck, Benjamin Marzinski, Christophe Varoqui; +Cc: dm-devel



On 2021/2/3 21:57, Martin Wilck wrote:
>> If exit() before all pthread_cancel in child of 0.7.7, there is no
>> any crash.
> What do you mean with "exit() before all pthread_cancel"? If this
> happens on pthread_cancel(), and you don't call that function, this
> would actually be expected.

When running_state is DAEMON_SHUTDOWN, break while then _exit(0). But
is is not a great method.

Regards,
Lixiaokeng

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [dm-devel] [PATCH] multipathd: avoid crash in uevent_cleanup()
  2021-02-04  1:40       ` lixiaokeng
@ 2021-02-04 15:06         ` Martin Wilck
  2021-02-05 11:08           ` Martin Wilck
  0 siblings, 1 reply; 34+ messages in thread
From: Martin Wilck @ 2021-02-04 15:06 UTC (permalink / raw)
  To: lixiaokeng, Benjamin Marzinski, Christophe Varoqui; +Cc: dm-devel

On Thu, 2021-02-04 at 09:40 +0800, lixiaokeng wrote:
> 
> 
> On 2021/2/3 21:57, Martin Wilck wrote:
> > > If exit() before all pthread_cancel in child of 0.7.7, there is
> > > no
> > > any crash.
> > What do you mean with "exit() before all pthread_cancel"? If this
> > happens on pthread_cancel(), and you don't call that function, this
> > would actually be expected.
> 
> When running_state is DAEMON_SHUTDOWN, break while then _exit(0). But
> is is not a great method.

I wonder if it would be possible to figure out the LWP numbers (process
IDs) of the different threads before the crash occurs, and compare this
to the gdb output

(gdb) info threads
  Id   Target Id         Frame
* 1    LWP 1997690       0x00007f59a0109647 in ?? ()
  2    LWP 1996840       0x00007f59a0531de7 in ?? ()
  3    LWP 1997692       0x00007f59a0109647 in ?? ()
  4    LWP 1996857       0x00007f59a020d169 in ?? ()

... to identify which thread crashed, and if it's always the same one.

Martin


--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [dm-devel] [PATCH] multipathd: avoid crash in uevent_cleanup()
  2021-02-04 15:06         ` Martin Wilck
@ 2021-02-05 11:08           ` Martin Wilck
  2021-02-05 11:09             ` Martin Wilck
  2021-02-07  7:05             ` lixiaokeng
  0 siblings, 2 replies; 34+ messages in thread
From: Martin Wilck @ 2021-02-05 11:08 UTC (permalink / raw)
  To: lixiaokeng, Benjamin Marzinski, Christophe Varoqui; +Cc: dm-devel

On Thu, 2021-02-04 at 16:06 +0100, Martin Wilck wrote:
> On Thu, 2021-02-04 at 09:40 +0800, lixiaokeng wrote:
> > 
> > 
> > On 2021/2/3 21:57, Martin Wilck wrote:
> > > > If exit() before all pthread_cancel in child of 0.7.7, there is
> > > > no
> > > > any crash.
> > > What do you mean with "exit() before all pthread_cancel"? If this
> > > happens on pthread_cancel(), and you don't call that function,
> > > this
> > > would actually be expected.
> > 
> > When running_state is DAEMON_SHUTDOWN, break while then _exit(0).
> > But
> > is is not a great method.
> 
> I wonder if it would be possible to figure out the LWP numbers
> (process
> IDs) of the different threads before the crash occurs, and compare
> this
> to the gdb output
> 
> (gdb) info threads
>   Id   Target Id         Frame
> * 1    LWP 1997690       0x00007f59a0109647 in ?? ()
>   2    LWP 1996840       0x00007f59a0531de7 in ?? ()
>   3    LWP 1997692       0x00007f59a0109647 in ?? ()
>   4    LWP 1996857       0x00007f59a020d169 in ?? ()
> 
> ... to identify which thread crashed, and if it's always the same
> one.

>From the LWP numbers, thread 2 and 4 are probably TUR checkers
(temporary threads). thread 1 can't be easily identified. Could you 
provide the stack of thread 3? From that, we might be able to infer
which thread crashed, because multipathd always starts its threads in
the same sequence.

Martin



--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [dm-devel] [PATCH] multipathd: avoid crash in uevent_cleanup()
  2021-02-05 11:08           ` Martin Wilck
@ 2021-02-05 11:09             ` Martin Wilck
  2021-02-07  7:05             ` lixiaokeng
  1 sibling, 0 replies; 34+ messages in thread
From: Martin Wilck @ 2021-02-05 11:09 UTC (permalink / raw)
  To: lixiaokeng, Benjamin Marzinski, Christophe Varoqui; +Cc: dm-devel

On Fri, 2021-02-05 at 12:08 +0100, Martin Wilck wrote:
> On Thu, 2021-02-04 at 16:06 +0100, Martin Wilck wrote:
> 
> 
> From the LWP numbers, thread 2 and 4 are probably TUR checkers
> (temporary threads). thread 1 can't be easily identified. Could you 
> provide the stack of thread 3? From that, we might be able to infer
> which thread crashed, because multipathd always starts its threads in
> the same sequence.

btw, do you have core dumps?


--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [dm-devel] [PATCH] multipathd: avoid crash in uevent_cleanup()
  2021-02-05 11:08           ` Martin Wilck
  2021-02-05 11:09             ` Martin Wilck
@ 2021-02-07  7:05             ` lixiaokeng
  1 sibling, 0 replies; 34+ messages in thread
From: lixiaokeng @ 2021-02-07  7:05 UTC (permalink / raw)
  To: dm-devel

[-- Attachment #1: Type: text/plain, Size: 5643 bytes --]



On 2021/2/5 19:08, Martin Wilck wrote:
> On Thu, 2021-02-04 at 16:06 +0100, Martin Wilck wrote:
>> On Thu, 2021-02-04 at 09:40 +0800, lixiaokeng wrote:
>>>
>>>
>>> On 2021/2/3 21:57, Martin Wilck wrote:
>>>>> If exit() before all pthread_cancel in child of 0.7.7, there is
>>>>> no
>>>>> any crash.
>>>> What do you mean with "exit() before all pthread_cancel"? If this
>>>> happens on pthread_cancel(), and you don't call that function,
>>>> this
>>>> would actually be expected.
>>>
>>> When running_state is DAEMON_SHUTDOWN, break while then _exit(0).
>>> But
>>> is is not a great method.
>>
>> I wonder if it would be possible to figure out the LWP numbers
>> (process
>> IDs) of the different threads before the crash occurs, and compare
>> this
>> to the gdb output
>>
>> (gdb) info threads
>>   Id   Target Id         Frame
>> * 1    LWP 1997690       0x00007f59a0109647 in ?? ()
>>   2    LWP 1996840       0x00007f59a0531de7 in ?? ()
>>   3    LWP 1997692       0x00007f59a0109647 in ?? ()
>>   4    LWP 1996857       0x00007f59a020d169 in ?? ()
>>
>> ... to identify which thread crashed, and if it's always the same
>> one.
> 
>>From the LWP numbers, thread 2 and 4 are probably TUR checkers
> (temporary threads). thread 1 can't be easily identified. Could you 
> provide the stack of thread 3? From that, we might be able to infer
> which thread crashed, because multipathd always starts its threads in
> the same sequence.
> 

Here is another core stack(attachment is core dumps):

[Thread debugging using libthread_db enabled]
Using host libthread_db library "/usr/lib64/libthread_db.so.1".
Core was generated by `/sbin/multipathd -d -s'.
Program terminated with signal SIGSEGV, Segmentation fault.
#0  0x00007fe17dd97456 in ?? ()
[Current thread is 1 (Thread 0x7fe17cc00700 (LWP 3093458))]
(gdb) bt
#0  0x00007fe17dd97456 in ?? ()
#1  0x0000006800007530 in ?? ()
#2  0x00007fe17cbffb2e in ?? ()
#3  0xffffffff00000053 in ?? ()
#4  0x0000000000002006 in ?? ()
#5  0x0000000000000000 in ?? ()
(gdb) info thread
  Id   Target Id                           Frame
* 1    Thread 0x7fe17cc00700 (LWP 3093458) 0x00007fe17dd97456 in ?? ()
  2    Thread 0x7fe17d421700 (LWP 3092869) 0x00007fe17da8a929 in __GI___poll (fds=fds@entry=0x0, nfds=nfds@entry=0, timeout=timeout@entry=10) at ../sysdeps/unix/sysv/linux/poll.c:29
  3    Thread 0x7fe17d4a7a80 (LWP 3092860) 0x00007fe17da96e27 in socket () at ../sysdeps/unix/syscall-template.S:78
  4    Thread 0x7fe17cc0e700 (LWP 3093459) 0x00007fe17da8c507 in ioctl () at ../sysdeps/unix/syscall-template.S:78
  5    Thread 0x7fe17cbc1700 (LWP 3093460) 0x00007fe17da8c507 in ioctl () at ../sysdeps/unix/syscall-template.S:78
  6    Thread 0x7fe17cb4c700 (LWP 3093461) 0x00007fe17da8c507 in ioctl () at ../sysdeps/unix/syscall-template.S:78
  7    Thread 0x7fe17cb5e700 (LWP 3093462) 0x00007fe17da8c507 in ioctl () at ../sysdeps/unix/syscall-template.S:78
  8    Thread 0x7fe17cb67700 (LWP 3093463) 0x00007fe17da8c507 in ioctl () at ../sysdeps/unix/syscall-template.S:78
  9    Thread 0x7fe17cb70700 (LWP 3093464) 0x00007fe17da8c507 in ioctl () at ../sysdeps/unix/syscall-template.S:78
(gdb) thread 2
[Switching to thread 2 (Thread 0x7fe17d421700 (LWP 3092869))]
#0  0x00007fe17da8a929 in __GI___poll (fds=fds@entry=0x0, nfds=nfds@entry=0, timeout=timeout@entry=10) at ../sysdeps/unix/sysv/linux/poll.c:29
29	  return SYSCALL_CANCEL (poll, fds, nfds, timeout);
(gdb) bt
#0  0x00007fe17da8a929 in __GI___poll (fds=fds@entry=0x0, nfds=nfds@entry=0, timeout=timeout@entry=10) at ../sysdeps/unix/sysv/linux/poll.c:29
#1  0x00007fe17dcdcd91 in poll (__timeout=10, __nfds=0, __fds=0x0) at /usr/include/bits/poll2.h:46
#2  call_rcu_thread (arg=0x557ab760e210) at urcu-call-rcu-impl.h:383
#3  0x00007fe17dcbef4b in start_thread (arg=0x7fe17d421700) at pthread_create.c:486
#4  0x00007fe17da957ef in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95
(gdb) thread 3
[Switching to thread 3 (Thread 0x7fe17d4a7a80 (LWP 3092860))]
#0  0x00007fe17da96e27 in socket () at ../sysdeps/unix/syscall-template.S:78
78	T_PSEUDO (SYSCALL_SYMBOL, SYSCALL_NAME, SYSCALL_NARGS)
(gdb) bt
#0  0x00007fe17da96e27 in socket () at ../sysdeps/unix/syscall-template.S:78
#1  0x00007fe17db95f34 in sd_pid_notify_with_fds (pid=0, unset_environment=0, state=0x557ab58af801 "ERRNO=0", fds=0x0, n_fds=0) at ../src/libsystemd/sd-daemon/sd-daemon.c:481
#2  0x0000557ab58a6ef7 in child (param=<optimized out>) at main.c:3140
#3  0x0000557ab589f503 in main (argc=<optimized out>, argv=0x7fffe51c3c08) at main.c:3325
(gdb) thread 4
[Switching to thread 4 (Thread 0x7fe17cc0e700 (LWP 3093459))]
#0  0x00007fe17da8c507 in ioctl () at ../sysdeps/unix/syscall-template.S:78
78	T_PSEUDO (SYSCALL_SYMBOL, SYSCALL_NAME, SYSCALL_NARGS)
(gdb) bt
#0  0x00007fe17da8c507 in ioctl () at ../sysdeps/unix/syscall-template.S:78
#1  0x00007fe17dd97456 in ?? ()
#2  0x0000006900007530 in ?? ()
#3  0x00007fe17cc0db2e in ?? ()
#4  0xffffffff00000053 in ?? ()
#5  0x0000000000002006 in ?? ()
#6  0x0000000000000000 in ?? ()
(gdb) thread 5
[Switching to thread 5 (Thread 0x7fe17cbc1700 (LWP 3093460))]
#0  0x00007fe17da8c507 in ioctl () at ../sysdeps/unix/syscall-template.S:78
78	T_PSEUDO (SYSCALL_SYMBOL, SYSCALL_NAME, SYSCALL_NARGS)
(gdb) bt
#0  0x00007fe17da8c507 in ioctl () at ../sysdeps/unix/syscall-template.S:78
#1  0x00007fe17dd97456 in ?? ()
#2  0x0000006a00007530 in ?? ()
#3  0x00007fe17cbc0b2e in ?? ()
#4  0xffffffff00000053 in ?? ()
#5  0x0000000000002006 in ?? ()
#6  0x0000000000000000 in ?? ()

Regards
Lixiaokeng

[-- Attachment #2: core.multipathd.0.5912fc2ca07945d8ab7d921ca6ff28ab.3092860.1612673516000000000000.lz4 --]
[-- Type: application/octet-stream, Size: 1533834 bytes --]

[-- Attachment #3: Type: text/plain, Size: 97 bytes --]

--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [dm-devel] [PATCH] multipathd: avoid crash in uevent_cleanup()
  2021-02-03 10:48   ` lixiaokeng
  2021-02-03 13:57     ` Martin Wilck
@ 2021-02-08  7:41     ` lixiaokeng
  2021-02-08  9:50       ` Martin Wilck
  1 sibling, 1 reply; 34+ messages in thread
From: lixiaokeng @ 2021-02-08  7:41 UTC (permalink / raw)
  To: Martin Wilck, Benjamin Marzinski, Christophe Varoqui; +Cc: dm-devel


> 
> There are other stacks in 0.7.7
> Second stack:
> #0  __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:51
> #1  0x0000ffff9d8d281c in __GI_abort () at abort.c:79
> #2  0x0000ffff9d90b818 in __libc_message (action=action@entry=do_abort, fmt=fmt@entry=0xffff9d9cb888 "%s\n") at ../sysdeps/posix/libc_fatal.c:181
> #3  0x0000ffff9d911f6c in malloc_printerr (str=str@entry=0xffff9d9c90d0 "free(): invalid pointer") at malloc.c:5389
> #4  0x0000ffff9d913780 in _int_free (av=0xffff9da0ba58 <main_arena>, p=0xffff98000070, have_lock=0) at malloc.c:4172
> #5  0x0000ffff9dc2b608 in internal_hashmap_clear (h=h@entry=0xffff9814dfa0, default_free_key=<optimized out>, default_free_value=<optimized out>) at ../src/basic/hashmap.c:902
> #6  0x0000ffff9dc2b700 in internal_hashmap_free (h=<optimized out>, default_free_key=<optimized out>, default_free_value=<optimized out>, default_free_value=<optimized out>, default_free_key=<optimized out>,
>     h=<optimized out>) at ../src/basic/hashmap.c:874
> #7  0x0000ffff9dc2b88c in ordered_hashmap_free_free_free () at ../src/basic/hashmap.h:118
> #8  device_free (device=0xffff9814d420) at ../src/libsystemd/sd-device/sd-device.c:68
> #9  sd_device_unref (p=<optimized out>) at ../src/libsystemd/sd-device/sd-device.c:78
> #10 0x0000ffff9dc36cc8 in sd_device_unrefp () at ../src/systemd/sd-device.h:118
> #11 device_new_from_nulstr (len=<optimized out>, nulstr=0xffff9d3253d0 "SEQNUM", ret=<synthetic pointer>) at ../src/libsystemd/sd-device/device-private.c:448
> #12 device_monitor_receive_device (m=0xffff98000b20, ret=ret@entry=0xffff9d327388) at ../src/libsystemd/sd-device/device-monitor.c:447
> #13 0x0000ffff9dc38bf4 in udev_monitor_receive_sd_device (ret=0xffff9d327388, udev_monitor=0xffff98000c70) at ../src/libudev/libudev-monitor.c:207
> #14 udev_monitor_receive_device (udev_monitor=0xffff98000c70, udev_monitor@entry=0xffff9d3273a0) at ../src/libudev/libudev-monitor.c:253
> #15 0x0000ffff9dcd9478 in uevent_listen (udev=0xffff9d327f40) at uevent.c:853
> #16 0x0000aaaae3783514 in ueventloop (ap=0xffffe160fe80) at main.c:1518
> #17 0x0000ffff9dbb67ac in start_thread (arg=0xffff9d3ad380) at pthread_create.c:486
> #18 0x0000ffff9d97047c in thread_start () at ../sysdeps/unix/sysv/linux/aarch64/clone.S:78
> 

Hi Martin,

There is a _cleanup_ in device_new_from_nulstr. If uevent_thr exit in
device_new_from_nulstr and some keys is not be append to sd_device,
the _cleanup_ will be called, which leads to multipathd crashes with
the stack.

When I use your advice,


On 2021/1/26 16:34, Martin Wilck wrote:
>     int oldstate;
>
>     pthread_setcancelstate(PTHREAD_CANCEL_DISABLE, &oldstate);
>
>     udev_monitor_receive_device(...)
>
>     pthread_setcancelstate(oldstate, NULL);
>     pthread_testcancel();

this coredump does not seem to appear anymore (several hours with test scripts).

Here are some discussion in https://github.com/systemd/systemd/issues/18376.

Should be pthread cancellation set in multipath-tools?

Regards,
Lixiaokeng



--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [dm-devel] [PATCH] multipathd: avoid crash in uevent_cleanup()
  2021-02-08  7:41     ` lixiaokeng
@ 2021-02-08  9:50       ` Martin Wilck
  2021-02-08 10:49         ` lixiaokeng
  0 siblings, 1 reply; 34+ messages in thread
From: Martin Wilck @ 2021-02-08  9:50 UTC (permalink / raw)
  To: lixiaokeng, Benjamin Marzinski, Christophe Varoqui; +Cc: dm-devel

On Mon, 2021-02-08 at 15:41 +0800, lixiaokeng wrote:
> 
> Hi Martin,
> 
> There is a _cleanup_ in device_new_from_nulstr. If uevent_thr exit in
> device_new_from_nulstr and some keys is not be append to sd_device,
> the _cleanup_ will be called, which leads to multipathd crashes with
> the stack.
> 
> When I use your advice,
> 
> 
> On 2021/1/26 16:34, Martin Wilck wrote:
> >     int oldstate;
> > 
> >     pthread_setcancelstate(PTHREAD_CANCEL_DISABLE, &oldstate);
> > 
> >     udev_monitor_receive_device(...)
> > 
> >     pthread_setcancelstate(oldstate, NULL);
> >     pthread_testcancel();
> 
> this coredump does not seem to appear anymore (several hours with
> test scripts).

Thanks for your continued hard work on this, but I can't follow you. In
this post:

https://listman.redhat.com/archives/dm-devel/2021-January/msg00396.html

you said that this advice did _not_ help. Please clarify.

Regards
Martin



--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [dm-devel] [PATCH] multipathd: avoid crash in uevent_cleanup()
  2021-02-08  9:50       ` Martin Wilck
@ 2021-02-08 10:49         ` lixiaokeng
  2021-02-08 11:03           ` Martin Wilck
  0 siblings, 1 reply; 34+ messages in thread
From: lixiaokeng @ 2021-02-08 10:49 UTC (permalink / raw)
  To: Martin Wilck, Benjamin Marzinski, Christophe Varoqui; +Cc: dm-devel



On 2021/2/8 17:50, Martin Wilck wrote:
> On Mon, 2021-02-08 at 15:41 +0800, lixiaokeng wrote:
>>
>> Hi Martin,
>>
>> There is a _cleanup_ in device_new_from_nulstr. If uevent_thr exit in
>> device_new_from_nulstr and some keys is not be append to sd_device,
>> the _cleanup_ will be called, which leads to multipathd crashes with
>> the stack.
>>
>> When I use your advice,
>>
>>
>> On 2021/1/26 16:34, Martin Wilck wrote:
>>>     int oldstate;
>>>
>>>     pthread_setcancelstate(PTHREAD_CANCEL_DISABLE, &oldstate);
>>>
>>>     udev_monitor_receive_device(...)
>>>
>>>     pthread_setcancelstate(oldstate, NULL);
>>>     pthread_testcancel();
>>
>> this coredump does not seem to appear anymore (several hours with
>> test scripts).
> 
> Thanks for your continued hard work on this, but I can't follow you. In
> this post:
> 
> https://listman.redhat.com/archives/dm-devel/2021-January/msg00396.html
> 
> you said that this advice did _not_ help. Please clarify.
> 

Hi Martin,
At that time, I did not know how the crash occurred in the systemd interface.
There were still some crashes with pthread_testcancel(), for example
#0  0x0000ffffb6118f4c in aarch64_fallback_frame_state (context=0xffffb523f200, context=0xffffb523f200, fs=0xffffb523e700) at ./md-unwind-support.h:74
#1  uw_frame_state_for (context=context@entry=0xffffb523f200, fs=fs@entry=0xffffb523e700) at ../../../libgcc/unwind-dw2.c:1257
#2  0x0000ffffb6119ef4 in _Unwind_ForcedUnwind_Phase2 (exc=exc@entry=0xffffb52403b0, context=context@entry=0xffffb523f200) at ../../../libgcc/unwind.inc:155
#3  0x0000ffffb611a284 in _Unwind_ForcedUnwind (exc=0xffffb52403b0, stop=stop@entry=0xffffb64846c0 <unwind_stop>, stop_argument=0xffffb523f630) at ../../../libgcc/unwind.inc:207
#4  0x0000ffffb6484860 in __GI___pthread_unwind (buf=<optimized out>) at unwind.c:121
#5  0x0000ffffb6482d08 in __do_cancel () at pthreadP.h:304
#6  __GI___pthread_testcancel () at pthread_testcancel.c:26
#7  0x0000ffffb5c528e8 in ?? ()

I thought these crashes might be related to crash in systemd interface.

However, I think these may be independent questions after analyzing
coredump and discussing with the community. So I test it again.
?? and _Unwind_XXX crashes still exist but no crash in
device_monitor_receive_device.

Regards,
Lixiaokeng


--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [dm-devel] [PATCH] multipathd: avoid crash in uevent_cleanup()
  2021-02-08 10:49         ` lixiaokeng
@ 2021-02-08 11:03           ` Martin Wilck
  2021-02-09  1:36             ` lixiaokeng
  0 siblings, 1 reply; 34+ messages in thread
From: Martin Wilck @ 2021-02-08 11:03 UTC (permalink / raw)
  To: lixiaokeng, Benjamin Marzinski, Christophe Varoqui; +Cc: dm-devel

On Mon, 2021-02-08 at 18:49 +0800, lixiaokeng wrote:
> 
> 
> On 2021/2/8 17:50, Martin Wilck wrote:
> > On Mon, 2021-02-08 at 15:41 +0800, lixiaokeng wrote:
> > > 
> > > Hi Martin,
> > > 
> > > There is a _cleanup_ in device_new_from_nulstr. If uevent_thr
> > > exit in
> > > device_new_from_nulstr and some keys is not be append to
> > > sd_device,
> > > the _cleanup_ will be called, which leads to multipathd crashes
> > > with
> > > the stack.
> > > 
> > > When I use your advice,
> > > 
> > > 
> > > On 2021/1/26 16:34, Martin Wilck wrote:
> > > >     int oldstate;
> > > > 
> > > >     pthread_setcancelstate(PTHREAD_CANCEL_DISABLE, &oldstate);
> > > > 
> > > >     udev_monitor_receive_device(...)
> > > > 
> > > >     pthread_setcancelstate(oldstate, NULL);
> > > >     pthread_testcancel();
> > > 
> > > this coredump does not seem to appear anymore (several hours with
> > > test scripts).
> > 
> > Thanks for your continued hard work on this, but I can't follow
> > you. In
> > this post:
> > 
> > https://listman.redhat.com/archives/dm-devel/2021-January/msg00396.html
> > 
> > you said that this advice did _not_ help. Please clarify.
> > 
> 
> Hi Martin,
> At that time, I did not know how the crash occurred in the systemd
> interface.
> There were still some crashes with pthread_testcancel(), for example
> #0  0x0000ffffb6118f4c in aarch64_fallback_frame_state
> (context=0xffffb523f200, context=0xffffb523f200, fs=0xffffb523e700)
> at ./md-unwind-support.h:74
> #1  uw_frame_state_for (context=context@entry=0xffffb523f200, 
> fs=fs@entry=0xffffb523e700) at ../../../libgcc/unwind-dw2.c:1257
> #2  0x0000ffffb6119ef4 in _Unwind_ForcedUnwind_Phase2
> (exc=exc@entry=0xffffb52403b0, context=context@entry=0xffffb523f200)
> at ../../../libgcc/unwind.inc:155
> #3  0x0000ffffb611a284 in _Unwind_ForcedUnwind (exc=0xffffb52403b0, 
> stop=stop@entry=0xffffb64846c0 <unwind_stop>,
> stop_argument=0xffffb523f630) at ../../../libgcc/unwind.inc:207
> #4  0x0000ffffb6484860 in __GI___pthread_unwind (buf=<optimized out>)
> at unwind.c:121
> #5  0x0000ffffb6482d08 in __do_cancel () at pthreadP.h:304
> #6  __GI___pthread_testcancel () at pthread_testcancel.c:26
> #7  0x0000ffffb5c528e8 in ?? ()
> 

I still don't fully understand. Above you said "this coredump doesn't
seem to appear any more". Am I understanding correctly that you
observed *other* core dumps instead?

The uw_frame_state_for() stack looks healthy (learned that just
recently from one of our experts in the area). Most probably the actual
crash occured in another thread in this case. It would be intersting to
look at a core dump.

The point of my suggestion was not the pthread_testcancel(), but the
blocking of thread cancellation during udev_monitor_receive_device().

> I thought these crashes might be related to crash in systemd
> interface.
> 
> However, I think these may be independent questions after analyzing
> coredump and discussing with the community. So I test it again.
> ?? and _Unwind_XXX crashes still exist but no crash in
> device_monitor_receive_device.

The "best" solution would probably be to generally disallow
cancellation, and only run pthread_testcancel() at certain points in
the code where we might block (and know that being cancelled would be
safe). That would not only make multipathd safer from crashing, it
would also enable us to remove hundreds of ugly
pthread_cleanup_push()/pop() calls from our code.

Finding all these points would be a challenge though, and if we don't
find them, we risk hanging on exit again, which is bad too, and was
just recently improved.

Regards
Martin



--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [dm-devel] [PATCH] multipathd: avoid crash in uevent_cleanup()
  2021-02-08 11:03           ` Martin Wilck
@ 2021-02-09  1:36             ` lixiaokeng
  2021-02-09 17:30               ` Martin Wilck
  0 siblings, 1 reply; 34+ messages in thread
From: lixiaokeng @ 2021-02-09  1:36 UTC (permalink / raw)
  To: Martin Wilck, Benjamin Marzinski, Christophe Varoqui; +Cc: dm-devel


> 
> I still don't fully understand. Above you said "this coredump doesn't
> seem to appear any more". Am I understanding correctly that you
> observed *other* core dumps instead?
> 

No, it is not "instead".
As shown in https://www.spinics.net/lists/dm-devel/msg45293.html,
there are some different crashes in multipathd with no code change.
When blocking of thread cancellation during udev_monitor_receive_device(),
no crash in udev_monitor_receive_device happens but others still
exist.

> 
> The "best" solution would probably be to generally disallow
> cancellation, and only run pthread_testcancel() at certain points in
> the code where we might block (and know that being cancelled would be
> safe). That would not only make multipathd safer from crashing, it
> would also enable us to remove hundreds of ugly
> pthread_cleanup_push()/pop() calls from our code.
> 
> Finding all these points would be a challenge though, and if we don't
> find them, we risk hanging on exit again, which is bad too, and was
> just recently improved.

Do you mean some patches have been made to solve these problem?

Regards,
Lixiaokeng

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [dm-devel] [PATCH] multipathd: avoid crash in uevent_cleanup()
  2021-02-09  1:36             ` lixiaokeng
@ 2021-02-09 17:30               ` Martin Wilck
  2021-02-10  2:02                 ` lixiaokeng
  2021-02-19  1:36                 ` lixiaokeng
  0 siblings, 2 replies; 34+ messages in thread
From: Martin Wilck @ 2021-02-09 17:30 UTC (permalink / raw)
  To: lixiaokeng, Benjamin Marzinski, Christophe Varoqui; +Cc: dm-devel

On Tue, 2021-02-09 at 09:36 +0800, lixiaokeng wrote:
> 
> > 
> > I still don't fully understand. Above you said "this coredump
> > doesn't
> > seem to appear any more". Am I understanding correctly that you
> > observed *other* core dumps instead?

> > 
> 
> No, it is not "instead".
> As shown in https://www.spinics.net/lists/dm-devel/msg45293.html,
> there are some different crashes in multipathd with no code change.
> When blocking of thread cancellation during
> udev_monitor_receive_device(),
> no crash in udev_monitor_receive_device happens but others still
> exist.

Now I got it, eventually :-) Thanks for the clarification. Would it be
ossible for you to categorize the different issues and provide core
dumps?

You mentioned in the systemd issue that you were playing around with
the gcc -fexceptions flag. As I remarked there - how did it get set in
the first place? What distro are you using?
> > 
> > The "best" solution would probably be to generally disallow
> > cancellation, and only run pthread_testcancel() at certain points
> > in
> > the code where we might block (and know that being cancelled would
> > be
> > safe). That would not only make multipathd safer from crashing, it
> > would also enable us to remove hundreds of ugly
> > pthread_cleanup_push()/pop() calls from our code.
> > 
> > Finding all these points would be a challenge though, and if we
> > don't
> > find them, we risk hanging on exit again, which is bad too, and was
> > just recently improved.
> 
> Do you mean some patches have been made to solve these problem?

No. I could hack up some, but it will take some time.

Regards,
Martin




--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [dm-devel] [PATCH] multipathd: avoid crash in uevent_cleanup()
  2021-02-09 17:30               ` Martin Wilck
@ 2021-02-10  2:02                 ` lixiaokeng
  2021-02-10  2:29                   ` Hexiaowen (Hex, EulerOS)
  2021-02-19  1:36                 ` lixiaokeng
  1 sibling, 1 reply; 34+ messages in thread
From: lixiaokeng @ 2021-02-10  2:02 UTC (permalink / raw)
  To: Martin Wilck, Benjamin Marzinski, Christophe Varoqui, hexiaowen
  Cc: linfeilong, dm-devel

[-- Attachment #1: Type: text/plain, Size: 649 bytes --]



> 
> Now I got it, eventually :-) Thanks for the clarification. Would it be
> ossible for you to categorize the different issues and provide core
> dumps?
> 

When blocking of thread cancellation during udev_monitor_receive_device(),
there are just ??() and pthread_testcancel() crash.
The stack1.tgz is ??() coredump and the stack2.tgz is pthread_testcancel()
coredumps.

> You mentioned in the systemd issue that you were playing around with
> the gcc -fexceptions flag. As I remarked there - how did it get set in
> the first place? What distro are you using?

He Xiaowen, please answer this question.

Regards,
Lixiaokeng

[-- Attachment #2: stack1.tgz --]
[-- Type: application/octet-stream, Size: 3113993 bytes --]

[-- Attachment #3: stack2.tgz --]
[-- Type: application/octet-stream, Size: 2860014 bytes --]

[-- Attachment #4: Type: text/plain, Size: 97 bytes --]

--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [dm-devel] [PATCH] multipathd: avoid crash in uevent_cleanup()
  2021-02-10  2:02                 ` lixiaokeng
@ 2021-02-10  2:29                   ` Hexiaowen (Hex, EulerOS)
  2021-02-19 10:35                     ` Martin Wilck
  0 siblings, 1 reply; 34+ messages in thread
From: Hexiaowen (Hex, EulerOS) @ 2021-02-10  2:29 UTC (permalink / raw)
  To: lixiaokeng, Martin Wilck, Benjamin Marzinski, Christophe Varoqui
  Cc: linfeilong, dm-devel



在 2021/2/10 10:02, lixiaokeng 写道:
> 
> 
>>
>> Now I got it, eventually :-) Thanks for the clarification. Would it be
>> ossible for you to categorize the different issues and provide core
>> dumps?
>>
> 
> When blocking of thread cancellation during udev_monitor_receive_device(),
> there are just ??() and pthread_testcancel() crash.
> The stack1.tgz is ??() coredump and the stack2.tgz is pthread_testcancel()
> coredumps.
> 
>> You mentioned in the systemd issue that you were playing around with
>> the gcc -fexceptions flag. As I remarked there - how did it get set in
>> the first place? What distro are you using?
> 
> He Xiaowen, please answer this question.

We use the release version openEuler 20.03 LTS and the systemd version 243.

During RPMbuild, add -fexceptions globally to CFLAGS.

> 
> Regards,
> Lixiaokeng
> 


--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [dm-devel] [PATCH] multipathd: avoid crash in uevent_cleanup()
  2021-02-09 17:30               ` Martin Wilck
  2021-02-10  2:02                 ` lixiaokeng
@ 2021-02-19  1:36                 ` lixiaokeng
  1 sibling, 0 replies; 34+ messages in thread
From: lixiaokeng @ 2021-02-19  1:36 UTC (permalink / raw)
  To: Martin Wilck, Benjamin Marzinski, Christophe Varoqui
  Cc: linfeilong, dm-devel, hexiaowen



On 2021/2/10 1:30, Martin Wilck wrote:
> On Tue, 2021-02-09 at 09:36 +0800, lixiaokeng wrote:
>>
>>>
>>> I still don't fully understand. Above you said "this coredump
>>> doesn't
>>> seem to appear any more". Am I understanding correctly that you
>>> observed *other* core dumps instead?
> 
>>>
>>
>> No, it is not "instead".
>> As shown in https://www.spinics.net/lists/dm-devel/msg45293.html,
>> there are some different crashes in multipathd with no code change.
>> When blocking of thread cancellation during
>> udev_monitor_receive_device(),
>> no crash in udev_monitor_receive_device happens but others still
>> exist.
> 
> Now I got it, eventually :-) Thanks for the clarification. Would it be
> ossible for you to categorize the different issues and provide core
> dumps?
> 
> You mentioned in the systemd issue that you were playing around with
> the gcc -fexceptions flag. As I remarked there - how did it get set in
> the first place? What distro are you using?
>>>
>>> The "best" solution would probably be to generally disallow
>>> cancellation, and only run pthread_testcancel() at certain points
>>> in
>>> the code where we might block (and know that being cancelled would
>>> be
>>> safe). That would not only make multipathd safer from crashing, it
>>> would also enable us to remove hundreds of ugly
>>> pthread_cleanup_push()/pop() calls from our code.
>>>
>>> Finding all these points would be a challenge though, and if we
>>> don't
>>> find them, we risk hanging on exit again, which is bad too, and was
>>> just recently improved.
>>
>> Do you mean some patches have been made to solve these problem?
> 
> No. I could hack up some, but it will take some time.
> 
Hi Martin,
    How is this problem going?

Regards
Lixiaokeng


--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [dm-devel] [PATCH] multipathd: avoid crash in uevent_cleanup()
  2021-02-10  2:29                   ` Hexiaowen (Hex, EulerOS)
@ 2021-02-19 10:35                     ` Martin Wilck
  0 siblings, 0 replies; 34+ messages in thread
From: Martin Wilck @ 2021-02-19 10:35 UTC (permalink / raw)
  To: Hexiaowen (Hex, EulerOS),
	lixiaokeng, Benjamin Marzinski, Christophe Varoqui
  Cc: linfeilong, dm-devel

On Wed, 2021-02-10 at 10:29 +0800, Hexiaowen (Hex, EulerOS) wrote:
> 
> We use the release version openEuler 20.03 LTS and the systemd
> version 243.
> 
> During RPMbuild, add -fexceptions globally to CFLAGS.

I have seen that OpenEuler does this for systemd (but not for
multipath).

>From the understanding I've gathered so far, enabling -fexceptions in
libsystemd is at least part of the problem. As OpenEuler is maintained
by Huawei - why are you doing this? I've talked with our toolchain
maintainers, and they clearly did not recommend doing it for C
programs.

Regards,
Martin


--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [dm-devel] [PATCH] multipathd: avoid crash in uevent_cleanup()
  2021-02-03 13:57     ` Martin Wilck
  2021-02-04  1:40       ` lixiaokeng
@ 2021-03-01 14:53       ` lixiaokeng
  2021-03-02  8:41         ` lixiaokeng
  2021-03-02  9:56         ` Martin Wilck
  1 sibling, 2 replies; 34+ messages in thread
From: lixiaokeng @ 2021-03-01 14:53 UTC (permalink / raw)
  To: Martin Wilck, Benjamin Marzinski, Christophe Varoqui; +Cc: linfeilong, dm-devel



On 2021/2/3 21:57, Martin Wilck wrote:
> On Wed, 2021-02-03 at 18:48 +0800, lixiaokeng wrote:
>>
>>
>> On 2021/2/3 4:52, Martin Wilck wrote:
>>> did this fix your "crash on exit" issue?
>>
>> Unfortunately, the issue is not solved.
>>
>> There will be some different coredump stack.
>>
>> 0.8.5 (I'm not sure there are only two stacks in 0.8.5)
>> First stack:
>> Program terminated with signal SIGSEGV, Segmentation fault.
>> #0  0x00007f59a0109647 in ?? ()
>> [Current thread is 1 (LWP 1997690)]
>> (gdb) bt
>> #0  0x00007f59a0109647 in ?? ()
>> #1  0x0000000000000000 in ?? ()
>> (gdb) info threads
>>   Id   Target Id         Frame
>> * 1    LWP 1997690       0x00007f59a0109647 in ?? ()
>>   2    LWP 1996840       0x00007f59a0531de7 in ?? ()
>>   3    LWP 1997692       0x00007f59a0109647 in ?? ()
>>   4    LWP 1996857       0x00007f59a020d169 in ?? ()
>>
>> Second stack:
>> #0  0x0000ffffb6118f4c in aarch64_fallback_frame_state
>> (context=0xffffb523f200, context=0xffffb523f200, fs=0xffffb523e700)
>> at ./md-unwind-support.h:74
>> #1  uw_frame_state_for (context=context@entry=0xffffb523f200, 
>> fs=fs@entry=0xffffb523e700) at ../../../libgcc/unwind-dw2.c:1257
>> #2  0x0000ffffb6119ef4 in _Unwind_ForcedUnwind_Phase2
>> (exc=exc@entry=0xffffb52403b0, context=context@entry=0xffffb523f200)
>> at ../../../libgcc/unwind.inc:155
>> #3  0x0000ffffb611a284 in _Unwind_ForcedUnwind (exc=0xffffb52403b0, 
>> stop=stop@entry=0xffffb64846c0 <unwind_stop>,
>> stop_argument=0xffffb523f630) at ../../../libgcc/unwind.inc:207
>> #4  0x0000ffffb6484860 in __GI___pthread_unwind (buf=<optimized out>)
>> at unwind.c:121
>> #5  0x0000ffffb6482d08 in __do_cancel () at pthreadP.h:304
>> #6  __GI___pthread_testcancel () at pthread_testcancel.c:26
>> #7  0x0000ffffb5c528e8 in ?? ()
>>
> 

Hi Martin:
   Here I add condlog(2, "start funcname"), pthread_cleanup_push(func_print, NULL)
in every pthread_create func. When these two core happened, "exit tur_thread"
are less than "start tur_thread". So the trouble may be in tur_thread.

I will use
	int oldstate;
	pthread_setcancelstate(PTHREAD_CANCEL_DISABLE, &oldstate);
 	...
	pthread_setcancelstate(oldstate, NULL);
	pthread_testcancel();
to test it.

Regards,
Lixiaokeng


--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [dm-devel] [PATCH] multipathd: avoid crash in uevent_cleanup()
  2021-03-01 14:53       ` lixiaokeng
@ 2021-03-02  8:41         ` lixiaokeng
  2021-03-02 11:07           ` Martin Wilck
  2021-03-02  9:56         ` Martin Wilck
  1 sibling, 1 reply; 34+ messages in thread
From: lixiaokeng @ 2021-03-02  8:41 UTC (permalink / raw)
  To: Martin Wilck, Benjamin Marzinski, Christophe Varoqui; +Cc: linfeilong, dm-devel


> 
> Hi Martin:
>    Here I add condlog(2, "start funcname"), pthread_cleanup_push(func_print, NULL)
> in every pthread_create func. When these two core happened, "exit tur_thread"
> are less than "start tur_thread". So the trouble may be in tur_thread.
> 

If I made no mistake, these coredumps happened when the tur_thread(which is
libcheck_thread in latest code) was not finished but checker(tur)was
cleared by cleanup_checkers. But I can't understand this, because the ctx
in tur_thread wasn't freed by cleanup_checkers.

Please give me some helps, thanks.

Regards,
Lixiaokeng


--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [dm-devel] [PATCH] multipathd: avoid crash in uevent_cleanup()
  2021-03-01 14:53       ` lixiaokeng
  2021-03-02  8:41         ` lixiaokeng
@ 2021-03-02  9:56         ` Martin Wilck
  2021-03-02 12:44           ` lixiaokeng
  1 sibling, 1 reply; 34+ messages in thread
From: Martin Wilck @ 2021-03-02  9:56 UTC (permalink / raw)
  To: lixiaokeng, Benjamin Marzinski, Christophe Varoqui; +Cc: linfeilong, dm-devel

Hi Lixiaokeng,

thanks for your continued efforts!

On Mon, 2021-03-01 at 22:53 +0800, lixiaokeng wrote:

> 
> Hi Martin:
>    Here I add condlog(2, "start funcname"),
> pthread_cleanup_push(func_print, NULL)
> in every pthread_create func. When these two core happened, "exit
> tur_thread"
> are less than "start tur_thread". So the trouble may be in
> tur_thread.

Note that unlike all other threads, TUR threads are _detached_ threads.
multipathd tries to cancel them, but it has no way to verify that they
actually stopped. It may be just a normal observation that you can't
see the messages when a TUR thread terminates, in particular if the
program is exiting and might have already closed the stderr file
descriptor.


If you look at the crashed processes with gdb, the thread IDs should
give you some clue which stack belongs to which thread. The TUR threads
will have higher thread IDs than the others because they are started
later.

> I will use
>         int oldstate;
>         pthread_setcancelstate(PTHREAD_CANCEL_DISABLE, &oldstate);
>         ...
>         pthread_setcancelstate(oldstate, NULL);
>         pthread_testcancel();
> to test it.

Where exactly do you want to put that code?

IIUC you don't compile multipathd with -fexceptions, do you? You
haven't answered my previous question why you do that for systemd.

Regards
Martin



--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [dm-devel] [PATCH] multipathd: avoid crash in uevent_cleanup()
  2021-03-02  8:41         ` lixiaokeng
@ 2021-03-02 11:07           ` Martin Wilck
  2021-03-02 15:49             ` lixiaokeng
  0 siblings, 1 reply; 34+ messages in thread
From: Martin Wilck @ 2021-03-02 11:07 UTC (permalink / raw)
  To: lixiaokeng, Benjamin Marzinski, Christophe Varoqui; +Cc: linfeilong, dm-devel

On Tue, 2021-03-02 at 16:41 +0800, lixiaokeng wrote:
> 
> > 
> > Hi Martin:
> >    Here I add condlog(2, "start funcname"),
> > pthread_cleanup_push(func_print, NULL)
> > in every pthread_create func. When these two core happened, "exit
> > tur_thread"
> > are less than "start tur_thread". So the trouble may be in
> > tur_thread.
> > 
> 
> If I made no mistake, these coredumps happened when the
> tur_thread(which is
> libcheck_thread in latest code) was not finished but checker(tur)was
> cleared by cleanup_checkers. But I can't understand this, because the
> ctx
> in tur_thread wasn't freed by cleanup_checkers.
> 
> Please give me some helps, thanks.

So you think the crashing thread is libcheck_thread? Do you have a core
dump from this situation?

Does the code you use include commit 38ffd89 ("libmultipath: prevent
DSO unloading with astray checker threads")?

Martin



--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [dm-devel] [PATCH] multipathd: avoid crash in uevent_cleanup()
  2021-03-02  9:56         ` Martin Wilck
@ 2021-03-02 12:44           ` lixiaokeng
  2021-03-02 15:29             ` Martin Wilck
  0 siblings, 1 reply; 34+ messages in thread
From: lixiaokeng @ 2021-03-02 12:44 UTC (permalink / raw)
  To: Martin Wilck, Benjamin Marzinski, Christophe Varoqui; +Cc: linfeilong, dm-devel


> Note that unlike all other threads, TUR threads are _detached_ threads.
> multipathd tries to cancel them, but it has no way to verify that they
> actually stopped. It may be just a normal observation that you can't
> see the messages when a TUR thread terminates, in particular if the
> program is exiting and might have already closed the stderr file
> descriptor.
> 
> 
> If you look at the crashed processes with gdb, the thread IDs should
> give you some clue which stack belongs to which thread. The TUR threads
> will have higher thread IDs than the others because they are started
> later.
>


??

[Thread debugging using libthread_db enabled]
Using host libthread_db library "/usr/lib64/libthread_db.so.1".
Core was generated by `/sbin/multipathd -d -s'.
Program terminated with signal SIGSEGV, Segmentation fault.
#0  0x00007f3f669e071d in ?? ()
[Current thread is 1 (Thread 0x7f3f65873700 (LWP 1645593))]
(gdb) i thread
  Id   Target Id                           Frame
* 1    Thread 0x7f3f65873700 (LWP 1645593) 0x00007f3f669e071d in ?? ()
  2    Thread 0x7f3f6611a000 (LWP 1645066) 0x00007f3f669fede7 in munmap () at ../sysdeps/unix/syscall-template.S:78
  3    Thread 0x7f3f6609d700 (LWP 1645095) syscall () at ../sysdeps/unix/sysv/linux/x86_64/syscall.S:38
(gdb) bt
#0  0x00007f3f669e071d in ?? ()
#1  0x0000000000000000 in ?? ()
(gdb) thread 2
[Switching to thread 2 (Thread 0x7f3f6611a000 (LWP 1645066))]
#0  0x00007f3f669fede7 in munmap () at ../sysdeps/unix/syscall-template.S:78
78	T_PSEUDO (SYSCALL_SYMBOL, SYSCALL_NAME, SYSCALL_NARGS)
(gdb) bt
#0  0x00007f3f669fede7 in munmap () at ../sysdeps/unix/syscall-template.S:78
#1  0x00007f3f669fb77d in _dl_unmap_segments (l=l@entry=0x557cb432ba10) at ./dl-unmap-segments.h:32
...
#10 0x00007f3f669b44ed in cleanup_prio () at prio.c:66  //cleanup_checkers() is finished.
#11 0x0000557cb26db794 in child (param=<optimized out>) at main.c:2932
#12 0x0000557cb26d44d3 in main (argc=<optimized out>, argv=0x7ffc98d47948) at main.c:3150


UNWIND

[Thread debugging using libthread_db enabled]
Using host libthread_db library "/usr/lib64/libthread_db.so.1".
Core was generated by `/sbin/multipathd -d -s'.
Program terminated with signal SIGSEGV, Segmentation fault.
#0  x86_64_fallback_frame_state (fs=0x7fa9b2f576d0, context=0x7fa9b2f57980) at ./md-unwind-support.h:58
58	  if (*(unsigned char *)(pc+0) == 0x48
[Current thread is 1 (Thread 0x7fa9b2f58700 (LWP 1285074))]
(gdb) i thread
  Id   Target Id                                     Frame
* 1    Thread 0x7fa9b2f58700 (LWP 1285074) (Exiting) x86_64_fallback_frame_state (fs=0x7fa9b2f576d0, context=0x7fa9b2f57980) at ./md-unwind-support.h:58
  2    Thread 0x7fa9b383e000 (LWP 1284366)           0x00007fa9b403e127 in __close (fd=5) at ../sysdeps/unix/sysv/linux/close.c:27
  3    Thread 0x7fa9b37c1700 (LWP 1284374)           syscall () at ../sysdeps/unix/sysv/linux/x86_64/syscall.S:38
  4    Thread 0x7fa9b2f73700 (LWP 1285077)           0x00007fa9b3e06507 in ioctl () at ../sysdeps/unix/syscall-template.S:78
  5    Thread 0x7fa9b2f61700 (LWP 1285076)           0x00007fa9b3e06507 in ioctl () at ../sysdeps/unix/syscall-template.S:78
  6    Thread 0x7fa9b2f4f700 (LWP 1285079)           0x00007fa9b3e06507 in ioctl () at ../sysdeps/unix/syscall-template.S:78
  7    Thread 0x7fa9b2fa9700 (LWP 1285080)           0x00007fa9b3e06507 in ioctl () at ../sysdeps/unix/syscall-template.S:78
(gdb) thread 2
[Switching to thread 2 (Thread 0x7fa9b383e000 (LWP 1284366))]
#0  0x00007fa9b403e127 in __close (fd=5) at ../sysdeps/unix/sysv/linux/close.c:27
27	  return SYSCALL_CANCEL (close, fd);
(gdb) bt
#0  0x00007fa9b403e127 in __close (fd=5) at ../sysdeps/unix/sysv/linux/close.c:27
#1  0x00005606f030f95b in cleanup_dmevent_waiter () at dmevents.c:111
#2  0x00005606f03087a2 in child (param=<optimized out>) at main.c:2934
#3  0x00005606f03014d3 in main (argc=<optimized out>, argv=0x7ffdb782ab38) at main.c:3150


The LWP of ?? and UNWIND is much larger than thread 2(main).

I add print_func like:

@@ -228,6 +228,10 @@ static void copy_msg_to_tcc(void *ct_p, const char *msg)
        pthread_mutex_unlock(&ct->lock);
 }

+static void lxk10 (void)
+{
+       condlog(2, "lxk exit tur_thread");
+}
 static void *tur_thread(void *ctx)
 {
        struct tur_checker_context *ct = ctx;
@@ -235,6 +239,8 @@ static void *tur_thread(void *ctx)
        char devt[32];

        /* This thread can be canceled, so setup clean up */
+       condlog(2, "lxk start tur_thread");
+       pthread_cleanup_push(lxk10, NULL);
        tur_thread_cleanup_push(ct);

When there are four devices, core log:
Mar 02 11:40:35 localhost.localdomain multipathd[85474]: exit (signal)
Mar 02 11:40:35 localhost.localdomain multipathd[85474]: sda: unusable path
Mar 02 11:40:35 localhost.localdomain multipathd[85474]: sdf: unusable path
Mar 02 11:40:35 localhost.localdomain multipathd[85474]: sde: unusable path
Mar 02 11:40:35 localhost.localdomain multipathd[85474]: sdd: unusable path
Mar 02 11:40:35 localhost.localdomain multipathd[85474]: sdc: unusable path
Mar 02 11:40:35 localhost.localdomain multipathd[85474]: sdb: unusable path
Mar 02 11:40:35 localhost.localdomain multipathd[85474]: lxk start tur_thread
Mar 02 11:40:35 localhost.localdomain multipathd[85474]: lxk exit tur_thread
Mar 02 11:40:35 localhost.localdomain multipathd[85474]: lxk start tur_thread
Mar 02 11:40:35 localhost.localdomain multipathd[85474]: lxk start tur_thread
Mar 02 11:40:35 localhost.localdomain multipathd[85474]: lxk exit tur_thread
Mar 02 11:40:35 localhost.localdomain multipathd[85474]: lxk start tur_thread
Mar 02 11:40:35 localhost.localdomain multipathd[85474]: lxk exit tur_thread
Mar 02 11:40:35 localhost.localdomain multipathd[85474]: 360014057a1353ec1bdd4dfcad19db6db: remove multipath map
Mar 02 11:40:35 localhost.localdomain multipathd[85474]: sdg: orphan path, map flushed
Mar 02 11:40:35 localhost.localdomain multipathd[85474]: BUG: orphaning path sdg that holds hwe of 360014057a1353ec1bdd4dfcad19db6db
Mar 02 11:40:35 localhost.localdomain multipathd[85474]: tur checker refcount 4
Mar 02 11:40:35 localhost.localdomain multipathd[85474]: 36001405faf8a6c2920840ed8ba73b9ee: remove multipath map
Mar 02 11:40:35 localhost.localdomain multipathd[85474]: sdj: orphan path, map flushed
Mar 02 11:40:35 localhost.localdomain multipathd[85474]: BUG: orphaning path sdj that holds hwe of 36001405faf8a6c2920840ed8ba73b9ee
Mar 02 11:40:35 localhost.localdomain multipathd[85474]: tur checker refcount 3
Mar 02 11:40:35 localhost.localdomain multipathd[85474]: 36001405044c0f50ba3c4e5b9b57e4de4: remove multipath map
Mar 02 11:40:35 localhost.localdomain multipathd[85474]: sdi: orphan path, map flushed
Mar 02 11:40:35 localhost.localdomain multipathd[85474]: BUG: orphaning path sdi that holds hwe of 36001405044c0f50ba3c4e5b9b57e4de4
Mar 02 11:40:35 localhost.localdomain multipathd[85474]: tur checker refcount 2
Mar 02 11:40:35 localhost.localdomain multipathd[85474]: 36001405e0cbb950907b4a51af1a002ed: remove multipath map
Mar 02 11:40:35 localhost.localdomain multipathd[85474]: sdh: orphan path, map flushed
Mar 02 11:40:35 localhost.localdomain multipathd[85474]: BUG: orphaning path sdh that holds hwe of 36001405e0cbb950907b4a51af1a002ed
Mar 02 11:40:35 localhost.localdomain multipathd[85474]: tur checker refcount 1
Mar 02 11:40:35 localhost.localdomain multipathd[85474]: lxk exit ueventloop
Mar 02 11:40:35 localhost.localdomain multipathd[85474]: lxk exit uxlsnrloop
Mar 02 11:40:35 localhost.localdomain multipathd[85474]: lxk exit uevqloop
Mar 02 11:40:35 localhost.localdomain multipathd[85474]: lxk exit wait_dmevents
Mar 02 11:40:35 localhost.localdomain multipathd[85474]: lxk exit checkerloop
Mar 02 11:40:35 localhost.localdomain multipathd[85474]: directio checker refcount 6
Mar 02 11:40:35 localhost.localdomain multipathd[85474]: lxk free tur checker  //checker_put
Mar 02 11:40:36 localhost.localdomain systemd-coredump[85547]: Process 85474 (multipathd) of user 0 dumped core

There are four "lxk start tur_thread" but three "lxk exit tur_thread".

>> I will use
>>         int oldstate;
>>         pthread_setcancelstate(PTHREAD_CANCEL_DISABLE, &oldstate);
>>         ...
>>         pthread_setcancelstate(oldstate, NULL);
>>         pthread_testcancel();
>> to test it.
> 
> Where exactly do you want to put that code?
> 
I add this in BEGAIN and END of tur_thread. But it is not helpful.

> IIUC you don't compile multipathd with -fexceptions, do you? You
> haven't answered my previous question why you do that for systemd.

I don't know why use -fexceptions before, but we have removed it
and there is no udev_monitor_receive_device core.

Regards,
Lixiaokeng


--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [dm-devel] [PATCH] multipathd: avoid crash in uevent_cleanup()
  2021-03-02 12:44           ` lixiaokeng
@ 2021-03-02 15:29             ` Martin Wilck
  2021-03-02 16:55               ` Martin Wilck
  2021-03-03 10:42               ` lixiaokeng
  0 siblings, 2 replies; 34+ messages in thread
From: Martin Wilck @ 2021-03-02 15:29 UTC (permalink / raw)
  To: lixiaokeng, Benjamin Marzinski, Christophe Varoqui; +Cc: linfeilong, dm-devel

On Tue, 2021-03-02 at 20:44 +0800, lixiaokeng wrote:
> 
> > Note that unlike all other threads, TUR threads are _detached_
> > threads.
> > multipathd tries to cancel them, but it has no way to verify that
> > they
> > actually stopped. It may be just a normal observation that you
> > can't
> > see the messages when a TUR thread terminates, in particular if the
> > program is exiting and might have already closed the stderr file
> > descriptor.
> > 
> > 
> > If you look at the crashed processes with gdb, the thread IDs
> > should
> > give you some clue which stack belongs to which thread. The TUR
> > threads
> > will have higher thread IDs than the others because they are
> > started
> > later.
> > 
> 
> 
> ??
> 
> [Thread debugging using libthread_db enabled]
> Using host libthread_db library "/usr/lib64/libthread_db.so.1".
> Core was generated by `/sbin/multipathd -d -s'.
> Program terminated with signal SIGSEGV, Segmentation fault.
> #0  0x00007f3f669e071d in ?? ()
> [Current thread is 1 (Thread 0x7f3f65873700 (LWP 1645593))]
> (gdb) i thread
>   Id   Target Id                           Frame
> * 1    Thread 0x7f3f65873700 (LWP 1645593) 0x00007f3f669e071d in ??
> ()
>   2    Thread 0x7f3f6611a000 (LWP 1645066) 0x00007f3f669fede7 in
> munmap () at ../sysdeps/unix/syscall-template.S:78
>   3    Thread 0x7f3f6609d700 (LWP 1645095) syscall () at
> ../sysdeps/unix/sysv/linux/x86_64/syscall.S:38
> (gdb) bt
> #0  0x00007f3f669e071d in ?? ()
> #1  0x0000000000000000 in ?? ()
> (gdb) thread 2
> [Switching to thread 2 (Thread 0x7f3f6611a000 (LWP 1645066))]
> #0  0x00007f3f669fede7 in munmap () at ../sysdeps/unix/syscall-
> template.S:78
> 78      T_PSEUDO (SYSCALL_SYMBOL, SYSCALL_NAME, SYSCALL_NARGS)
> (gdb) bt
> #0  0x00007f3f669fede7 in munmap () at ../sysdeps/unix/syscall-
> template.S:78
> #1  0x00007f3f669fb77d in _dl_unmap_segments
> (l=l@entry=0x557cb432ba10) at ./dl-unmap-segments.h:32
> ...
> #10 0x00007f3f669b44ed in cleanup_prio () at prio.c:66 
> //cleanup_checkers() is finished.
> #11 0x0000557cb26db794 in child (param=<optimized out>) at
> main.c:2932
> #12 0x0000557cb26d44d3 in main (argc=<optimized out>,
> argv=0x7ffc98d47948) at main.c:3150
> 
> 
> UNWIND
> 
> [Thread debugging using libthread_db enabled]
> Using host libthread_db library "/usr/lib64/libthread_db.so.1".
> Core was generated by `/sbin/multipathd -d -s'.
> Program terminated with signal SIGSEGV, Segmentation fault.
> #0  x86_64_fallback_frame_state (fs=0x7fa9b2f576d0,
> context=0x7fa9b2f57980) at ./md-unwind-support.h:58
> 58        if (*(unsigned char *)(pc+0) == 0x48
> [Current thread is 1 (Thread 0x7fa9b2f58700 (LWP 1285074))]
> (gdb) i thread
>   Id   Target Id                                     Frame
> * 1    Thread 0x7fa9b2f58700 (LWP 1285074) (Exiting)
> x86_64_fallback_frame_state (fs=0x7fa9b2f576d0,
> context=0x7fa9b2f57980) at ./md-unwind-support.h:58
>   2    Thread 0x7fa9b383e000 (LWP 1284366)          
> 0x00007fa9b403e127 in __close (fd=5) at
> ../sysdeps/unix/sysv/linux/close.c:27
>   3    Thread 0x7fa9b37c1700 (LWP 1284374)           syscall () at
> ../sysdeps/unix/sysv/linux/x86_64/syscall.S:38
>   4    Thread 0x7fa9b2f73700 (LWP 1285077)          
> 0x00007fa9b3e06507 in ioctl () at ../sysdeps/unix/syscall-
> template.S:78
>   5    Thread 0x7fa9b2f61700 (LWP 1285076)          
> 0x00007fa9b3e06507 in ioctl () at ../sysdeps/unix/syscall-
> template.S:78
>   6    Thread 0x7fa9b2f4f700 (LWP 1285079)          
> 0x00007fa9b3e06507 in ioctl () at ../sysdeps/unix/syscall-
> template.S:78
>   7    Thread 0x7fa9b2fa9700 (LWP 1285080)          
> 0x00007fa9b3e06507 in ioctl () at ../sysdeps/unix/syscall-
> template.S:78
> (gdb) thread 2
> [Switching to thread 2 (Thread 0x7fa9b383e000 (LWP 1284366))]
> #0  0x00007fa9b403e127 in __close (fd=5) at
> ../sysdeps/unix/sysv/linux/close.c:27
> 27        return SYSCALL_CANCEL (close, fd);
> (gdb) bt
> #0  0x00007fa9b403e127 in __close (fd=5) at
> ../sysdeps/unix/sysv/linux/close.c:27
> #1  0x00005606f030f95b in cleanup_dmevent_waiter () at dmevents.c:111
> #2  0x00005606f03087a2 in child (param=<optimized out>) at
> main.c:2934
> #3  0x00005606f03014d3 in main (argc=<optimized out>,
> argv=0x7ffdb782ab38) at main.c:3150
> 
> 
> The LWP of ?? and UNWIND is much larger than thread 2(main).
> 
> I add print_func like:
> 
> @@ -228,6 +228,10 @@ static void copy_msg_to_tcc(void *ct_p, const
> char *msg)
>         pthread_mutex_unlock(&ct->lock);
>  }
> 
> +static void lxk10 (void)
> +{
> +       condlog(2, "lxk exit tur_thread");
> +}
>  static void *tur_thread(void *ctx)
>  {
>         struct tur_checker_context *ct = ctx;
> @@ -235,6 +239,8 @@ static void *tur_thread(void *ctx)
>         char devt[32];
> 
>         /* This thread can be canceled, so setup clean up */
> +       condlog(2, "lxk start tur_thread");
> +       pthread_cleanup_push(lxk10, NULL);
>         tur_thread_cleanup_push(ct);
> 
> When there are four devices, core log:
> Mar 02 11:40:35 localhost.localdomain multipathd[85474]: exit
> (signal)
> Mar 02 11:40:35 localhost.localdomain multipathd[85474]: sda:
> unusable path
> Mar 02 11:40:35 localhost.localdomain multipathd[85474]: sdf:
> unusable path
> Mar 02 11:40:35 localhost.localdomain multipathd[85474]: sde:
> unusable path
> Mar 02 11:40:35 localhost.localdomain multipathd[85474]: sdd:
> unusable path
> Mar 02 11:40:35 localhost.localdomain multipathd[85474]: sdc:
> unusable path
> Mar 02 11:40:35 localhost.localdomain multipathd[85474]: sdb:
> unusable path
> Mar 02 11:40:35 localhost.localdomain multipathd[85474]: lxk start
> tur_thread
> Mar 02 11:40:35 localhost.localdomain multipathd[85474]: lxk exit
> tur_thread
> Mar 02 11:40:35 localhost.localdomain multipathd[85474]: lxk start
> tur_thread
> Mar 02 11:40:35 localhost.localdomain multipathd[85474]: lxk start
> tur_thread
> Mar 02 11:40:35 localhost.localdomain multipathd[85474]: lxk exit
> tur_thread
> Mar 02 11:40:35 localhost.localdomain multipathd[85474]: lxk start
> tur_thread
> Mar 02 11:40:35 localhost.localdomain multipathd[85474]: lxk exit
> tur_thread
> Mar 02 11:40:35 localhost.localdomain multipathd[85474]:
> 360014057a1353ec1bdd4dfcad19db6db: remove multipath map
> Mar 02 11:40:35 localhost.localdomain multipathd[85474]: sdg: orphan
> path, map flushed
> Mar 02 11:40:35 localhost.localdomain multipathd[85474]: BUG:
> orphaning path sdg that holds hwe of
> 360014057a1353ec1bdd4dfcad19db6db
> Mar 02 11:40:35 localhost.localdomain multipathd[85474]: tur checker
> refcount 4
> Mar 02 11:40:35 localhost.localdomain multipathd[85474]:
> 36001405faf8a6c2920840ed8ba73b9ee: remove multipath map
> Mar 02 11:40:35 localhost.localdomain multipathd[85474]: sdj: orphan
> path, map flushed
> Mar 02 11:40:35 localhost.localdomain multipathd[85474]: BUG:
> orphaning path sdj that holds hwe of
> 36001405faf8a6c2920840ed8ba73b9ee
> Mar 02 11:40:35 localhost.localdomain multipathd[85474]: tur checker
> refcount 3
> Mar 02 11:40:35 localhost.localdomain multipathd[85474]:
> 36001405044c0f50ba3c4e5b9b57e4de4: remove multipath map
> Mar 02 11:40:35 localhost.localdomain multipathd[85474]: sdi: orphan
> path, map flushed
> Mar 02 11:40:35 localhost.localdomain multipathd[85474]: BUG:
> orphaning path sdi that holds hwe of
> 36001405044c0f50ba3c4e5b9b57e4de4
> Mar 02 11:40:35 localhost.localdomain multipathd[85474]: tur checker
> refcount 2
> Mar 02 11:40:35 localhost.localdomain multipathd[85474]:
> 36001405e0cbb950907b4a51af1a002ed: remove multipath map
> Mar 02 11:40:35 localhost.localdomain multipathd[85474]: sdh: orphan
> path, map flushed
> Mar 02 11:40:35 localhost.localdomain multipathd[85474]: BUG:
> orphaning path sdh that holds hwe of
> 36001405e0cbb950907b4a51af1a002ed
> Mar 02 11:40:35 localhost.localdomain multipathd[85474]: tur checker
> refcount 1
> Mar 02 11:40:35 localhost.localdomain multipathd[85474]: lxk exit
> ueventloop
> Mar 02 11:40:35 localhost.localdomain multipathd[85474]: lxk exit
> uxlsnrloop
> Mar 02 11:40:35 localhost.localdomain multipathd[85474]: lxk exit
> uevqloop
> Mar 02 11:40:35 localhost.localdomain multipathd[85474]: lxk exit
> wait_dmevents
> Mar 02 11:40:35 localhost.localdomain multipathd[85474]: lxk exit
> checkerloop
> Mar 02 11:40:35 localhost.localdomain multipathd[85474]: directio
> checker refcount 6
> Mar 02 11:40:35 localhost.localdomain multipathd[85474]: lxk free tur
> checker  //checker_put


So we do not see "unloading tur checker". Like you said, that suggests
that the crash occurs between libcheck_free() and the thread exiting.
I suggest you put a message in tur.c:libcheck_free (), AFTER the call
to cleanup_context(), printing the values of "running" and "holders".

Anyway:

	holders = uatomic_sub_return(&ct->holders, 1);
	if (!holders)
		cleanup_context(ct);

Whatever mistakes we have made, only one actor can have seen 
holders == 0, and have called cleanup_context().

The stacks you have shown indicate that the instruction pointers were
broken. That would suggest something similar as dicussed in the ML
thread leading to 38ffd89 ("libmultipath: prevent DSO unloading with
astray checker threads"). Your logs show "tur checker refcount 1", so
the next call to checker_put would have unloaded the DSO. 

Please try commenting out the dlclose() call in free_checker_class(),
and see if it makes a difference.

Martin



--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [dm-devel] [PATCH] multipathd: avoid crash in uevent_cleanup()
  2021-03-02 11:07           ` Martin Wilck
@ 2021-03-02 15:49             ` lixiaokeng
  0 siblings, 0 replies; 34+ messages in thread
From: lixiaokeng @ 2021-03-02 15:49 UTC (permalink / raw)
  To: Martin Wilck, Benjamin Marzinski, Christophe Varoqui; +Cc: linfeilong, dm-devel



On 2021/3/2 19:07, Martin Wilck wrote:
> On Tue, 2021-03-02 at 16:41 +0800, lixiaokeng wrote:
>>
>>>
>>> Hi Martin:
>>>    Here I add condlog(2, "start funcname"),
>>> pthread_cleanup_push(func_print, NULL)
>>> in every pthread_create func. When these two core happened, "exit
>>> tur_thread"
>>> are less than "start tur_thread". So the trouble may be in
>>> tur_thread.
>>>
>>
>> If I made no mistake, these coredumps happened when the
>> tur_thread(which is
>> libcheck_thread in latest code) was not finished but checker(tur)was
>> cleared by cleanup_checkers. But I can't understand this, because the
>> ctx
>> in tur_thread wasn't freed by cleanup_checkers.
>>
>> Please give me some helps, thanks.
> 
> So you think the crashing thread is libcheck_thread? Do you have a core
> dump from this situation?
> 
> Does the code you use include commit 38ffd89 ("libmultipath: prevent
> DSO unloading with astray checker threads")?
> 
> Martin
> 

Hi Martin,

   I am sorry. I test code doesn't include commit 38ffd89. I confuse
the master and queue branches. I just test 0.7.7 and 0.8.5 master.
Please ignore libcheck_thread and focus on last email with thread
info.

Regards,
Lixiaokeng


--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [dm-devel] [PATCH] multipathd: avoid crash in uevent_cleanup()
  2021-03-02 15:29             ` Martin Wilck
@ 2021-03-02 16:55               ` Martin Wilck
  2021-03-03 10:42               ` lixiaokeng
  1 sibling, 0 replies; 34+ messages in thread
From: Martin Wilck @ 2021-03-02 16:55 UTC (permalink / raw)
  To: lixiaokeng, Benjamin Marzinski, Christophe Varoqui; +Cc: linfeilong, dm-devel

[-- Attachment #1: Type: text/plain, Size: 1006 bytes --]

Hi lixiaokeng,

On Tue, 2021-03-02 at 16:29 +0100, Martin Wilck wrote:
> On Tue, 2021-03-02 at 20:44 +0800, lixiaokeng wrote:
> > 
> > 
> 
> The stacks you have shown indicate that the instruction pointers were
> broken. That would suggest something similar as dicussed in the ML
> thread leading to 38ffd89 ("libmultipath: prevent DSO unloading with
> astray checker threads"). Your logs show "tur checker refcount 1", so
> the next call to checker_put would have unloaded the DSO. 
> 
> Please try commenting out the dlclose() call in free_checker_class(),
> and see if it makes a difference.

I have two TENTATIVE patches here that I'd like you to ask to try (with
the dlclose in place again). Also, please make sure you've got 38ffd89.

This is really tentative, I'm still pretty much in the dark. But my
theory is that the crash can happen if the thread is about to start. So
the most important part is the hunk that checks the return value of
checker_class_ref() in start_checker_thread().

Martin



[-- Attachment #2: 0001-libmultipath-protect-DSO-unloading-with-RCU.patch --]
[-- Type: text/x-patch, Size: 4795 bytes --]

From a4dd64808d49f5a0d2a94336e56401262ef99e55 Mon Sep 17 00:00:00 2001
From: Martin Wilck <mwilck@suse.com>
Date: Tue, 2 Mar 2021 17:03:15 +0100
Subject: [PATCH 1/2] libmultipath: protect DSO unloading with RCU

Some crashes possibly related to DSO unloading are still observed.
Try protecting the unloading with RCU.

Signed-off-by: Martin Wilck <mwilck@suse.com>
---
 libmultipath/checkers.c | 79 ++++++++++++++++++++++++++++++-----------
 libmultipath/propsel.c  |  4 +++
 2 files changed, 63 insertions(+), 20 deletions(-)

diff --git a/libmultipath/checkers.c b/libmultipath/checkers.c
index 2dd9915..25f07ce 100644
--- a/libmultipath/checkers.c
+++ b/libmultipath/checkers.c
@@ -3,6 +3,7 @@
 #include <stddef.h>
 #include <dlfcn.h>
 #include <sys/stat.h>
+#include <errno.h>
 #include <urcu.h>
 #include <urcu/uatomic.h>
 
@@ -25,6 +26,7 @@ struct checker_class {
 	void *(*thread)(void *);	     /* async thread entry point */
 	const char **msgtable;
 	short msgtable_size;
+	struct rcu_head rcu;
 };
 
 static const char *checker_state_names[PATH_MAX_STATE] = {
@@ -74,20 +76,16 @@ static int checker_class_unref(struct checker_class *cls)
 	return uatomic_sub_return(&cls->refcount, 1);
 }
 
-void free_checker_class(struct checker_class *c)
+static void free_checker_class_rcu(struct rcu_head *head)
 {
-	int cnt;
+	struct checker_class *c = container_of(head, struct checker_class, rcu);
 
-	if (!c)
-		return;
-	cnt = checker_class_unref(c);
-	if (cnt != 0) {
-		condlog(cnt < 0 ? 1 : 4, "%s checker refcount %d",
-			c->name, cnt);
+	if (uatomic_read(&c-refcount) > 0) {
+		condlog(1, "%s: RACE: refcount = %d, not freeing checker",
+			__func__, refcount);
 		return;
 	}
 	condlog(3, "unloading %s checker", c->name);
-	list_del(&c->node);
 	if (c->reset)
 		c->reset();
 	if (c->handle) {
@@ -99,6 +97,22 @@ void free_checker_class(struct checker_class *c)
 	FREE(c);
 }
 
+static void free_checker_class(struct checker_class *c)
+{
+	int cnt;
+
+	if (!c)
+		return;
+	cnt = checker_class_unref(c);
+	if (cnt != 0) {
+		condlog(cnt < 0 ? 1 : 4, "%s checker refcount %d",
+			c->name, cnt);
+		return;
+	}
+	list_del(&c->node);
+	call_rcu(&c->rcu, free_checker_class_rcu);
+}
+
 void cleanup_checkers (void)
 {
 	struct checker_class *checker_loop;
@@ -111,15 +125,32 @@ void cleanup_checkers (void)
 
 static struct checker_class *checker_class_lookup(const char *name)
 {
-	struct checker_class *c;
+	struct checker_class *c, *found = NULL;
+	int refcount = 0;
 
 	if (!name || !strlen(name))
 		return NULL;
+
+	rcu_read_lock();
 	list_for_each_entry(c, &checkers, node) {
-		if (!strncmp(name, c->name, CHECKER_NAME_LEN))
-			return c;
+		if (!strncmp(name, c->name, CHECKER_NAME_LEN)) {
+			found = c;
+			break;
+		}
 	}
-	return NULL;
+	if (found) {
+		refcount = checker_class_ref(found);
+		if (refcount == 1)
+			checker_class_unref(found);
+	}
+	rcu_read_unlock();
+
+	if (refcount <= 1) {
+		condlog(1, "%s: RACE: got refcount == %d", __func__, refcount);
+		found = NULL;
+	}
+
+	return found;
 }
 
 void reset_checker_classes(void)
@@ -387,11 +418,20 @@ static void *checker_thread_entry(void *arg)
 int start_checker_thread(pthread_t *thread, const pthread_attr_t *attr,
 			 struct checker_context *ctx)
 {
-	int rv;
+	int rv, refcount;
 
 	assert(ctx && ctx->cls && ctx->cls->thread);
+
 	/* Take a ref here, lest the class be freed before the thread starts */
-	(void)checker_class_ref(ctx->cls);
+	rcu_read_lock();
+	refcount = checker_class_ref(ctx->cls);
+	if (refcount <= 1)
+		checker_class_unref(ctx->cls);
+	rcu_read_unlock();
+	if (refcount <= 1)
+		condlog(1, "%s: RACE: got refcount == %d", __func_, refcount);
+		return EIO;
+	}
 	rv = pthread_create(thread, attr, checker_thread_entry, ctx);
 	if (rv != 0) {
 		condlog(1, "failed to start checker thread for %s: %m",
@@ -418,14 +458,13 @@ void checker_get(const char *multipath_dir, struct checker *dst,
 
 	if (name && strlen(name)) {
 		src = checker_class_lookup(name);
-		if (!src)
+		if (!src) {
 			src = add_checker_class(multipath_dir, name);
+			if (src && checker_class_ref(src) == 1)
+				src = NULL;
+		}
 	}
 	dst->cls = src;
-	if (!src)
-		return;
-
-	(void)checker_class_ref(dst->cls);
 }
 
 int init_checkers(const char *multipath_dir)
diff --git a/libmultipath/propsel.c b/libmultipath/propsel.c
index f771a83..4add95a 100644
--- a/libmultipath/propsel.c
+++ b/libmultipath/propsel.c
@@ -536,6 +536,10 @@ int select_checker(struct config *conf, struct path *pp)
 	do_default(ckr_name, DEFAULT_CHECKER);
 out:
 	checker_get(conf->multipath_dir, c, ckr_name);
+	if (!checker_selected(c)) {
+		condlog(1, "%s: failed to grab checker", __func__);
+		return 1;
+	}
 	condlog(3, "%s: path_checker = %s %s", pp->dev,
 		checker_name(c), origin);
 	if (conf->checker_timeout) {
-- 
2.29.2


[-- Attachment #3: 0002-libmultipath-tur_thread-use-pthread_exit.patch --]
[-- Type: text/x-patch, Size: 2236 bytes --]

From c44375bb5e218b1e54ca4d9069b2b1632df87f75 Mon Sep 17 00:00:00 2001
From: Martin Wilck <mwilck@suse.com>
Date: Tue, 2 Mar 2021 17:05:26 +0100
Subject: [PATCH 2/2] libmultipath: tur_thread: use pthread_exit()

Using "return" would jump into a different DSO (libmultipath),
avoid that.

Signed-off-by: Martin Wilck <mwilck@suse.com>
---
 libmultipath/checkers.c     | 11 ++++++-----
 libmultipath/checkers/tur.c |  2 ++
 2 files changed, 8 insertions(+), 5 deletions(-)

diff --git a/libmultipath/checkers.c b/libmultipath/checkers.c
index 25f07ce..99e48bc 100644
--- a/libmultipath/checkers.c
+++ b/libmultipath/checkers.c
@@ -79,8 +79,9 @@ static int checker_class_unref(struct checker_class *cls)
 static void free_checker_class_rcu(struct rcu_head *head)
 {
 	struct checker_class *c = container_of(head, struct checker_class, rcu);
+	int refcount;
 
-	if (uatomic_read(&c-refcount) > 0) {
+	if ((refcount = uatomic_read(&c->refcount)) > 0) {
 		condlog(1, "%s: RACE: refcount = %d, not freeing checker",
 			__func__, refcount);
 		return;
@@ -145,7 +146,7 @@ static struct checker_class *checker_class_lookup(const char *name)
 	}
 	rcu_read_unlock();
 
-	if (refcount <= 1) {
+	if (refcount == 1) {
 		condlog(1, "%s: RACE: got refcount == %d", __func__, refcount);
 		found = NULL;
 	}
@@ -425,11 +426,11 @@ int start_checker_thread(pthread_t *thread, const pthread_attr_t *attr,
 	/* Take a ref here, lest the class be freed before the thread starts */
 	rcu_read_lock();
 	refcount = checker_class_ref(ctx->cls);
-	if (refcount <= 1)
+	if (refcount == 1)
 		checker_class_unref(ctx->cls);
 	rcu_read_unlock();
-	if (refcount <= 1)
-		condlog(1, "%s: RACE: got refcount == %d", __func_, refcount);
+	if (refcount <= 1) {
+		condlog(1, "%s: RACE: got refcount == %d", __func__, refcount);
 		return EIO;
 	}
 	rv = pthread_create(thread, attr, checker_thread_entry, ctx);
diff --git a/libmultipath/checkers/tur.c b/libmultipath/checkers/tur.c
index a4b4a21..0db50ba 100644
--- a/libmultipath/checkers/tur.c
+++ b/libmultipath/checkers/tur.c
@@ -284,6 +284,8 @@ void *libcheck_thread(struct checker_context *ctx)
 
 	tur_thread_cleanup_pop(ct);
 
+	pthread_exit(NULL);
+	/* not reached */
 	return ((void *)0);
 }
 
-- 
2.29.2


[-- Attachment #4: Type: text/plain, Size: 97 bytes --]

--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel

^ permalink raw reply related	[flat|nested] 34+ messages in thread

* Re: [dm-devel] [PATCH] multipathd: avoid crash in uevent_cleanup()
  2021-03-02 15:29             ` Martin Wilck
  2021-03-02 16:55               ` Martin Wilck
@ 2021-03-03 10:42               ` lixiaokeng
  2021-03-08  9:40                 ` Martin Wilck
  1 sibling, 1 reply; 34+ messages in thread
From: lixiaokeng @ 2021-03-03 10:42 UTC (permalink / raw)
  To: Martin Wilck, Benjamin Marzinski, Christophe Varoqui; +Cc: linfeilong, dm-devel


>> Mar 02 11:40:35 localhost.localdomain multipathd[85474]: directio
>> checker refcount 6
>> Mar 02 11:40:35 localhost.localdomain multipathd[85474]: lxk free tur
>> checker  //checker_put
> 
> 
> So we do not see "unloading tur checker". Like you said, that suggests
> that the crash occurs between libcheck_free() and the thread exiting.



"lxk free tur checker" is add in free_checker called by checker_put.
I don't change the level of "unloading tur checker", so we don't see it.

@@ -58,7 +58,7 @@ void free_checker (struct checker * c)
                return;
        c->refcount--;
        if (c->refcount) {
-               condlog(3, "%s checker refcount %d",
+               condlog(2, "%s checker refcount %d",
                        c->name, c->refcount);
                return;
        }
@@ -77,6 +77,7 @@ void free_checker (struct checker * c)
                        pthread_join(ct->thread, NULL);
                };
        }
+       condlog(2, "lxk free %s checker", c->name);
        FREE(c);
 }


> I suggest you put a message in tur.c:libcheck_free (), AFTER the call
> to cleanup_context(), printing the values of "running" and "holders"
> Anyway:
> 
> 	holders = uatomic_sub_return(&ct->holders, 1);
> 	if (!holders)
> 		cleanup_context(ct);
> 
> Whatever mistakes we have made, only one actor can have seen 
> holders == 0, and have called cleanup_context().
> 

diff --git a/libmultipath/checkers/tur.c b/libmultipath/checkers/tur.c
index 4ea63af..900f960 100644
--- a/libmultipath/checkers/tur.c
+++ b/libmultipath/checkers/tur.c
@@ -105,8 +105,11 @@ void libcheck_free (struct checker * c)
                        pthread_cancel(ct->thread);
                ct->thread = 0;
                holders = uatomic_sub_return(&ct->holders, 1);
-               if (!holders)
+               if (!holders) {
+                       running = uatomic_xchg(&ct->running, 0);
                        cleanup_context(ct);
+                       condlog(2, "lxk tur running is %d", running);
+               }
                c->context = NULL;
        }
        return;


Here I add running print but it is zero.

> The stacks you have shown indicate that the instruction pointers were
> broken. That would suggest something similar as dicussed in the ML
> thread leading to 38ffd89 ("libmultipath: prevent DSO unloading with
> astray checker threads"). Your logs show "tur checker refcount 1", so
> the next call to checker_put would have unloaded the DSO. 

Here I test 0.8.5 master code with commit 38ffd89. There is no crash
in five hours (without patch, crash happen in running test script
for 30 to 40 minutes.)

Regards,
Lixiaokeng



--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel


^ permalink raw reply related	[flat|nested] 34+ messages in thread

* Re: [dm-devel] [PATCH] multipathd: avoid crash in uevent_cleanup()
  2021-03-03 10:42               ` lixiaokeng
@ 2021-03-08  9:40                 ` Martin Wilck
  2021-03-15 13:00                   ` Martin Wilck
  0 siblings, 1 reply; 34+ messages in thread
From: Martin Wilck @ 2021-03-08  9:40 UTC (permalink / raw)
  To: lixiaokeng, Benjamin Marzinski, Christophe Varoqui; +Cc: linfeilong, dm-devel

Hello Lixiaokeng,

On Wed, 2021-03-03 at 18:42 +0800, lixiaokeng wrote:
> 
> > The stacks you have shown indicate that the instruction pointers
> > were
> > broken. That would suggest something similar as dicussed in the ML
> > thread leading to 38ffd89 ("libmultipath: prevent DSO unloading
> > with
> > astray checker threads"). Your logs show "tur checker refcount 1",
> > so
> > the next call to checker_put would have unloaded the DSO. 
> 
> Here I test 0.8.5 master code with commit 38ffd89. There is no crash
> in five hours (without patch, crash happen in running test script
> for 30 to 40 minutes.)

Can you confirm that that commit fixes your issue?

Regards
Martin


--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [dm-devel] [PATCH] multipathd: avoid crash in uevent_cleanup()
  2021-03-08  9:40                 ` Martin Wilck
@ 2021-03-15 13:00                   ` Martin Wilck
  2021-03-16 11:12                     ` lixiaokeng
  0 siblings, 1 reply; 34+ messages in thread
From: Martin Wilck @ 2021-03-15 13:00 UTC (permalink / raw)
  To: lixiaokeng, Benjamin Marzinski, Christophe Varoqui; +Cc: linfeilong, dm-devel

Hi Lixiaokeng,

On Mon, 2021-03-08 at 10:40 +0100, Martin Wilck wrote:
> Hello Lixiaokeng,
> 
> On Wed, 2021-03-03 at 18:42 +0800, lixiaokeng wrote:
> > 
> > > The stacks you have shown indicate that the instruction pointers
> > > were
> > > broken. That would suggest something similar as dicussed in the
> > > ML
> > > thread leading to 38ffd89 ("libmultipath: prevent DSO unloading
> > > with
> > > astray checker threads"). Your logs show "tur checker refcount
> > > 1",
> > > so
> > > the next call to checker_put would have unloaded the DSO. 
> > 
> > Here I test 0.8.5 master code with commit 38ffd89. There is no
> > crash
> > in five hours (without patch, crash happen in running test script
> > for 30 to 40 minutes.)
> 
> Can you confirm that that commit fixes your issue?

Any updates on this?

Thanks,
Martin




--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [dm-devel] [PATCH] multipathd: avoid crash in uevent_cleanup()
  2021-03-15 13:00                   ` Martin Wilck
@ 2021-03-16 11:12                     ` lixiaokeng
  2021-03-17 16:59                       ` Martin Wilck
  0 siblings, 1 reply; 34+ messages in thread
From: lixiaokeng @ 2021-03-16 11:12 UTC (permalink / raw)
  To: Martin Wilck, Benjamin Marzinski, Christophe Varoqui; +Cc: linfeilong, dm-devel



>>
>> Can you confirm that that commit fixes your issue?
> 
> Any updates on this?
> 

Hi Martin

  I'm sorry for missing this. 38ffd89 ("libmultipath: prevent
DSO unloading with astray checker threads") fixes my issue.

Regards
Lixiaokeng

--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [dm-devel] [PATCH] multipathd: avoid crash in uevent_cleanup()
  2021-03-16 11:12                     ` lixiaokeng
@ 2021-03-17 16:59                       ` Martin Wilck
  2021-03-19  1:49                         ` lixiaokeng
  0 siblings, 1 reply; 34+ messages in thread
From: Martin Wilck @ 2021-03-17 16:59 UTC (permalink / raw)
  To: lixiaokeng, Benjamin Marzinski, Christophe Varoqui; +Cc: linfeilong, dm-devel

Hello Lixiaokeng,


> Hi Martin
> 
>   I'm sorry for missing this. 38ffd89 ("libmultipath: prevent
> DSO unloading with astray checker threads") fixes my issue.

Thanks! Am I right assuming that you did _not_ use my prior patch
"multipathd: avoid crash in uevent_cleanup()" (which would mean that we
don't need it)? Please confirm that, too.

Regards,
Martin



--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [dm-devel] [PATCH] multipathd: avoid crash in uevent_cleanup()
  2021-03-17 16:59                       ` Martin Wilck
@ 2021-03-19  1:49                         ` lixiaokeng
  0 siblings, 0 replies; 34+ messages in thread
From: lixiaokeng @ 2021-03-19  1:49 UTC (permalink / raw)
  To: Martin Wilck, Benjamin Marzinski, Christophe Varoqui; +Cc: linfeilong, dm-devel



On 2021/3/18 0:59, Martin Wilck wrote:
> Hello Lixiaokeng,
> 
> 
>> Hi Martin
>>
>>   I'm sorry for missing this. 38ffd89 ("libmultipath: prevent
>> DSO unloading with astray checker threads") fixes my issue.
> 
> Thanks! Am I right assuming that you did _not_ use my prior patch
> "multipathd: avoid crash in uevent_cleanup()" (which would mean that we
> don't need it)? Please confirm that, too.
> 

The patch "multipathd: avoid crash in uevent_cleanup()" is _NOT_ used.

Regards,
Lixiaokeng


--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel


^ permalink raw reply	[flat|nested] 34+ messages in thread

end of thread, other threads:[~2021-03-19  1:49 UTC | newest]

Thread overview: 34+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-01-28 21:08 [dm-devel] [PATCH] multipathd: avoid crash in uevent_cleanup() mwilck
2021-02-02 20:52 ` Martin Wilck
2021-02-03 10:48   ` lixiaokeng
2021-02-03 13:57     ` Martin Wilck
2021-02-04  1:40       ` lixiaokeng
2021-02-04 15:06         ` Martin Wilck
2021-02-05 11:08           ` Martin Wilck
2021-02-05 11:09             ` Martin Wilck
2021-02-07  7:05             ` lixiaokeng
2021-03-01 14:53       ` lixiaokeng
2021-03-02  8:41         ` lixiaokeng
2021-03-02 11:07           ` Martin Wilck
2021-03-02 15:49             ` lixiaokeng
2021-03-02  9:56         ` Martin Wilck
2021-03-02 12:44           ` lixiaokeng
2021-03-02 15:29             ` Martin Wilck
2021-03-02 16:55               ` Martin Wilck
2021-03-03 10:42               ` lixiaokeng
2021-03-08  9:40                 ` Martin Wilck
2021-03-15 13:00                   ` Martin Wilck
2021-03-16 11:12                     ` lixiaokeng
2021-03-17 16:59                       ` Martin Wilck
2021-03-19  1:49                         ` lixiaokeng
2021-02-08  7:41     ` lixiaokeng
2021-02-08  9:50       ` Martin Wilck
2021-02-08 10:49         ` lixiaokeng
2021-02-08 11:03           ` Martin Wilck
2021-02-09  1:36             ` lixiaokeng
2021-02-09 17:30               ` Martin Wilck
2021-02-10  2:02                 ` lixiaokeng
2021-02-10  2:29                   ` Hexiaowen (Hex, EulerOS)
2021-02-19 10:35                     ` Martin Wilck
2021-02-19  1:36                 ` lixiaokeng
2021-02-02 22:23 ` Benjamin Marzinski

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.