dm-devel.redhat.com archive mirror
 help / color / mirror / Atom feed
From: lixiaokeng <lixiaokeng@huawei.com>
To: Martin Wilck <mwilck@suse.com>,
	Benjamin Marzinski <bmarzins@redhat.com>,
	 Christophe Varoqui <christophe.varoqui@opensvc.com>
Cc: linfeilong <linfeilong@huawei.com>, dm-devel@redhat.com
Subject: Re: [dm-devel] [PATCH] multipathd: avoid crash in uevent_cleanup()
Date: Tue, 2 Mar 2021 20:44:43 +0800	[thread overview]
Message-ID: <05c23ce9-4859-b0c3-3acb-c74f2c4510d6@huawei.com> (raw)
In-Reply-To: <79f18cdb19b41be24d082d5528ab2325e6552395.camel@suse.com>


> Note that unlike all other threads, TUR threads are _detached_ threads.
> multipathd tries to cancel them, but it has no way to verify that they
> actually stopped. It may be just a normal observation that you can't
> see the messages when a TUR thread terminates, in particular if the
> program is exiting and might have already closed the stderr file
> descriptor.
> 
> 
> If you look at the crashed processes with gdb, the thread IDs should
> give you some clue which stack belongs to which thread. The TUR threads
> will have higher thread IDs than the others because they are started
> later.
>


??

[Thread debugging using libthread_db enabled]
Using host libthread_db library "/usr/lib64/libthread_db.so.1".
Core was generated by `/sbin/multipathd -d -s'.
Program terminated with signal SIGSEGV, Segmentation fault.
#0  0x00007f3f669e071d in ?? ()
[Current thread is 1 (Thread 0x7f3f65873700 (LWP 1645593))]
(gdb) i thread
  Id   Target Id                           Frame
* 1    Thread 0x7f3f65873700 (LWP 1645593) 0x00007f3f669e071d in ?? ()
  2    Thread 0x7f3f6611a000 (LWP 1645066) 0x00007f3f669fede7 in munmap () at ../sysdeps/unix/syscall-template.S:78
  3    Thread 0x7f3f6609d700 (LWP 1645095) syscall () at ../sysdeps/unix/sysv/linux/x86_64/syscall.S:38
(gdb) bt
#0  0x00007f3f669e071d in ?? ()
#1  0x0000000000000000 in ?? ()
(gdb) thread 2
[Switching to thread 2 (Thread 0x7f3f6611a000 (LWP 1645066))]
#0  0x00007f3f669fede7 in munmap () at ../sysdeps/unix/syscall-template.S:78
78	T_PSEUDO (SYSCALL_SYMBOL, SYSCALL_NAME, SYSCALL_NARGS)
(gdb) bt
#0  0x00007f3f669fede7 in munmap () at ../sysdeps/unix/syscall-template.S:78
#1  0x00007f3f669fb77d in _dl_unmap_segments (l=l@entry=0x557cb432ba10) at ./dl-unmap-segments.h:32
...
#10 0x00007f3f669b44ed in cleanup_prio () at prio.c:66  //cleanup_checkers() is finished.
#11 0x0000557cb26db794 in child (param=<optimized out>) at main.c:2932
#12 0x0000557cb26d44d3 in main (argc=<optimized out>, argv=0x7ffc98d47948) at main.c:3150


UNWIND

[Thread debugging using libthread_db enabled]
Using host libthread_db library "/usr/lib64/libthread_db.so.1".
Core was generated by `/sbin/multipathd -d -s'.
Program terminated with signal SIGSEGV, Segmentation fault.
#0  x86_64_fallback_frame_state (fs=0x7fa9b2f576d0, context=0x7fa9b2f57980) at ./md-unwind-support.h:58
58	  if (*(unsigned char *)(pc+0) == 0x48
[Current thread is 1 (Thread 0x7fa9b2f58700 (LWP 1285074))]
(gdb) i thread
  Id   Target Id                                     Frame
* 1    Thread 0x7fa9b2f58700 (LWP 1285074) (Exiting) x86_64_fallback_frame_state (fs=0x7fa9b2f576d0, context=0x7fa9b2f57980) at ./md-unwind-support.h:58
  2    Thread 0x7fa9b383e000 (LWP 1284366)           0x00007fa9b403e127 in __close (fd=5) at ../sysdeps/unix/sysv/linux/close.c:27
  3    Thread 0x7fa9b37c1700 (LWP 1284374)           syscall () at ../sysdeps/unix/sysv/linux/x86_64/syscall.S:38
  4    Thread 0x7fa9b2f73700 (LWP 1285077)           0x00007fa9b3e06507 in ioctl () at ../sysdeps/unix/syscall-template.S:78
  5    Thread 0x7fa9b2f61700 (LWP 1285076)           0x00007fa9b3e06507 in ioctl () at ../sysdeps/unix/syscall-template.S:78
  6    Thread 0x7fa9b2f4f700 (LWP 1285079)           0x00007fa9b3e06507 in ioctl () at ../sysdeps/unix/syscall-template.S:78
  7    Thread 0x7fa9b2fa9700 (LWP 1285080)           0x00007fa9b3e06507 in ioctl () at ../sysdeps/unix/syscall-template.S:78
(gdb) thread 2
[Switching to thread 2 (Thread 0x7fa9b383e000 (LWP 1284366))]
#0  0x00007fa9b403e127 in __close (fd=5) at ../sysdeps/unix/sysv/linux/close.c:27
27	  return SYSCALL_CANCEL (close, fd);
(gdb) bt
#0  0x00007fa9b403e127 in __close (fd=5) at ../sysdeps/unix/sysv/linux/close.c:27
#1  0x00005606f030f95b in cleanup_dmevent_waiter () at dmevents.c:111
#2  0x00005606f03087a2 in child (param=<optimized out>) at main.c:2934
#3  0x00005606f03014d3 in main (argc=<optimized out>, argv=0x7ffdb782ab38) at main.c:3150


The LWP of ?? and UNWIND is much larger than thread 2(main).

I add print_func like:

@@ -228,6 +228,10 @@ static void copy_msg_to_tcc(void *ct_p, const char *msg)
        pthread_mutex_unlock(&ct->lock);
 }

+static void lxk10 (void)
+{
+       condlog(2, "lxk exit tur_thread");
+}
 static void *tur_thread(void *ctx)
 {
        struct tur_checker_context *ct = ctx;
@@ -235,6 +239,8 @@ static void *tur_thread(void *ctx)
        char devt[32];

        /* This thread can be canceled, so setup clean up */
+       condlog(2, "lxk start tur_thread");
+       pthread_cleanup_push(lxk10, NULL);
        tur_thread_cleanup_push(ct);

When there are four devices, core log:
Mar 02 11:40:35 localhost.localdomain multipathd[85474]: exit (signal)
Mar 02 11:40:35 localhost.localdomain multipathd[85474]: sda: unusable path
Mar 02 11:40:35 localhost.localdomain multipathd[85474]: sdf: unusable path
Mar 02 11:40:35 localhost.localdomain multipathd[85474]: sde: unusable path
Mar 02 11:40:35 localhost.localdomain multipathd[85474]: sdd: unusable path
Mar 02 11:40:35 localhost.localdomain multipathd[85474]: sdc: unusable path
Mar 02 11:40:35 localhost.localdomain multipathd[85474]: sdb: unusable path
Mar 02 11:40:35 localhost.localdomain multipathd[85474]: lxk start tur_thread
Mar 02 11:40:35 localhost.localdomain multipathd[85474]: lxk exit tur_thread
Mar 02 11:40:35 localhost.localdomain multipathd[85474]: lxk start tur_thread
Mar 02 11:40:35 localhost.localdomain multipathd[85474]: lxk start tur_thread
Mar 02 11:40:35 localhost.localdomain multipathd[85474]: lxk exit tur_thread
Mar 02 11:40:35 localhost.localdomain multipathd[85474]: lxk start tur_thread
Mar 02 11:40:35 localhost.localdomain multipathd[85474]: lxk exit tur_thread
Mar 02 11:40:35 localhost.localdomain multipathd[85474]: 360014057a1353ec1bdd4dfcad19db6db: remove multipath map
Mar 02 11:40:35 localhost.localdomain multipathd[85474]: sdg: orphan path, map flushed
Mar 02 11:40:35 localhost.localdomain multipathd[85474]: BUG: orphaning path sdg that holds hwe of 360014057a1353ec1bdd4dfcad19db6db
Mar 02 11:40:35 localhost.localdomain multipathd[85474]: tur checker refcount 4
Mar 02 11:40:35 localhost.localdomain multipathd[85474]: 36001405faf8a6c2920840ed8ba73b9ee: remove multipath map
Mar 02 11:40:35 localhost.localdomain multipathd[85474]: sdj: orphan path, map flushed
Mar 02 11:40:35 localhost.localdomain multipathd[85474]: BUG: orphaning path sdj that holds hwe of 36001405faf8a6c2920840ed8ba73b9ee
Mar 02 11:40:35 localhost.localdomain multipathd[85474]: tur checker refcount 3
Mar 02 11:40:35 localhost.localdomain multipathd[85474]: 36001405044c0f50ba3c4e5b9b57e4de4: remove multipath map
Mar 02 11:40:35 localhost.localdomain multipathd[85474]: sdi: orphan path, map flushed
Mar 02 11:40:35 localhost.localdomain multipathd[85474]: BUG: orphaning path sdi that holds hwe of 36001405044c0f50ba3c4e5b9b57e4de4
Mar 02 11:40:35 localhost.localdomain multipathd[85474]: tur checker refcount 2
Mar 02 11:40:35 localhost.localdomain multipathd[85474]: 36001405e0cbb950907b4a51af1a002ed: remove multipath map
Mar 02 11:40:35 localhost.localdomain multipathd[85474]: sdh: orphan path, map flushed
Mar 02 11:40:35 localhost.localdomain multipathd[85474]: BUG: orphaning path sdh that holds hwe of 36001405e0cbb950907b4a51af1a002ed
Mar 02 11:40:35 localhost.localdomain multipathd[85474]: tur checker refcount 1
Mar 02 11:40:35 localhost.localdomain multipathd[85474]: lxk exit ueventloop
Mar 02 11:40:35 localhost.localdomain multipathd[85474]: lxk exit uxlsnrloop
Mar 02 11:40:35 localhost.localdomain multipathd[85474]: lxk exit uevqloop
Mar 02 11:40:35 localhost.localdomain multipathd[85474]: lxk exit wait_dmevents
Mar 02 11:40:35 localhost.localdomain multipathd[85474]: lxk exit checkerloop
Mar 02 11:40:35 localhost.localdomain multipathd[85474]: directio checker refcount 6
Mar 02 11:40:35 localhost.localdomain multipathd[85474]: lxk free tur checker  //checker_put
Mar 02 11:40:36 localhost.localdomain systemd-coredump[85547]: Process 85474 (multipathd) of user 0 dumped core

There are four "lxk start tur_thread" but three "lxk exit tur_thread".

>> I will use
>>         int oldstate;
>>         pthread_setcancelstate(PTHREAD_CANCEL_DISABLE, &oldstate);
>>         ...
>>         pthread_setcancelstate(oldstate, NULL);
>>         pthread_testcancel();
>> to test it.
> 
> Where exactly do you want to put that code?
> 
I add this in BEGAIN and END of tur_thread. But it is not helpful.

> IIUC you don't compile multipathd with -fexceptions, do you? You
> haven't answered my previous question why you do that for systemd.

I don't know why use -fexceptions before, but we have removed it
and there is no udev_monitor_receive_device core.

Regards,
Lixiaokeng


--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel


  reply	other threads:[~2021-03-02 12:45 UTC|newest]

Thread overview: 34+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-01-28 21:08 [dm-devel] [PATCH] multipathd: avoid crash in uevent_cleanup() mwilck
2021-02-02 20:52 ` Martin Wilck
2021-02-03 10:48   ` lixiaokeng
2021-02-03 13:57     ` Martin Wilck
2021-02-04  1:40       ` lixiaokeng
2021-02-04 15:06         ` Martin Wilck
2021-02-05 11:08           ` Martin Wilck
2021-02-05 11:09             ` Martin Wilck
2021-02-07  7:05             ` lixiaokeng
2021-03-01 14:53       ` lixiaokeng
2021-03-02  8:41         ` lixiaokeng
2021-03-02 11:07           ` Martin Wilck
2021-03-02 15:49             ` lixiaokeng
2021-03-02  9:56         ` Martin Wilck
2021-03-02 12:44           ` lixiaokeng [this message]
2021-03-02 15:29             ` Martin Wilck
2021-03-02 16:55               ` Martin Wilck
2021-03-03 10:42               ` lixiaokeng
2021-03-08  9:40                 ` Martin Wilck
2021-03-15 13:00                   ` Martin Wilck
2021-03-16 11:12                     ` lixiaokeng
2021-03-17 16:59                       ` Martin Wilck
2021-03-19  1:49                         ` lixiaokeng
2021-02-08  7:41     ` lixiaokeng
2021-02-08  9:50       ` Martin Wilck
2021-02-08 10:49         ` lixiaokeng
2021-02-08 11:03           ` Martin Wilck
2021-02-09  1:36             ` lixiaokeng
2021-02-09 17:30               ` Martin Wilck
2021-02-10  2:02                 ` lixiaokeng
2021-02-10  2:29                   ` Hexiaowen (Hex, EulerOS)
2021-02-19 10:35                     ` Martin Wilck
2021-02-19  1:36                 ` lixiaokeng
2021-02-02 22:23 ` Benjamin Marzinski

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=05c23ce9-4859-b0c3-3acb-c74f2c4510d6@huawei.com \
    --to=lixiaokeng@huawei.com \
    --cc=bmarzins@redhat.com \
    --cc=christophe.varoqui@opensvc.com \
    --cc=dm-devel@redhat.com \
    --cc=linfeilong@huawei.com \
    --cc=mwilck@suse.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).