All of lore.kernel.org
 help / color / mirror / Atom feed
From: Martin Wilck <mwilck@suse.com>
To: lixiaokeng <lixiaokeng@huawei.com>,
	Benjamin Marzinski <bmarzins@redhat.com>,
	Christophe Varoqui <christophe.varoqui@opensvc.com>
Cc: linfeilong <linfeilong@huawei.com>, dm-devel@redhat.com
Subject: Re: [dm-devel] [PATCH] multipathd: avoid crash in uevent_cleanup()
Date: Tue, 02 Mar 2021 16:29:40 +0100	[thread overview]
Message-ID: <41e79d67f568baf8de6b28e4924620240f0a2731.camel@suse.com> (raw)
In-Reply-To: <05c23ce9-4859-b0c3-3acb-c74f2c4510d6@huawei.com>

On Tue, 2021-03-02 at 20:44 +0800, lixiaokeng wrote:
> 
> > Note that unlike all other threads, TUR threads are _detached_
> > threads.
> > multipathd tries to cancel them, but it has no way to verify that
> > they
> > actually stopped. It may be just a normal observation that you
> > can't
> > see the messages when a TUR thread terminates, in particular if the
> > program is exiting and might have already closed the stderr file
> > descriptor.
> > 
> > 
> > If you look at the crashed processes with gdb, the thread IDs
> > should
> > give you some clue which stack belongs to which thread. The TUR
> > threads
> > will have higher thread IDs than the others because they are
> > started
> > later.
> > 
> 
> 
> ??
> 
> [Thread debugging using libthread_db enabled]
> Using host libthread_db library "/usr/lib64/libthread_db.so.1".
> Core was generated by `/sbin/multipathd -d -s'.
> Program terminated with signal SIGSEGV, Segmentation fault.
> #0  0x00007f3f669e071d in ?? ()
> [Current thread is 1 (Thread 0x7f3f65873700 (LWP 1645593))]
> (gdb) i thread
>   Id   Target Id                           Frame
> * 1    Thread 0x7f3f65873700 (LWP 1645593) 0x00007f3f669e071d in ??
> ()
>   2    Thread 0x7f3f6611a000 (LWP 1645066) 0x00007f3f669fede7 in
> munmap () at ../sysdeps/unix/syscall-template.S:78
>   3    Thread 0x7f3f6609d700 (LWP 1645095) syscall () at
> ../sysdeps/unix/sysv/linux/x86_64/syscall.S:38
> (gdb) bt
> #0  0x00007f3f669e071d in ?? ()
> #1  0x0000000000000000 in ?? ()
> (gdb) thread 2
> [Switching to thread 2 (Thread 0x7f3f6611a000 (LWP 1645066))]
> #0  0x00007f3f669fede7 in munmap () at ../sysdeps/unix/syscall-
> template.S:78
> 78      T_PSEUDO (SYSCALL_SYMBOL, SYSCALL_NAME, SYSCALL_NARGS)
> (gdb) bt
> #0  0x00007f3f669fede7 in munmap () at ../sysdeps/unix/syscall-
> template.S:78
> #1  0x00007f3f669fb77d in _dl_unmap_segments
> (l=l@entry=0x557cb432ba10) at ./dl-unmap-segments.h:32
> ...
> #10 0x00007f3f669b44ed in cleanup_prio () at prio.c:66 
> //cleanup_checkers() is finished.
> #11 0x0000557cb26db794 in child (param=<optimized out>) at
> main.c:2932
> #12 0x0000557cb26d44d3 in main (argc=<optimized out>,
> argv=0x7ffc98d47948) at main.c:3150
> 
> 
> UNWIND
> 
> [Thread debugging using libthread_db enabled]
> Using host libthread_db library "/usr/lib64/libthread_db.so.1".
> Core was generated by `/sbin/multipathd -d -s'.
> Program terminated with signal SIGSEGV, Segmentation fault.
> #0  x86_64_fallback_frame_state (fs=0x7fa9b2f576d0,
> context=0x7fa9b2f57980) at ./md-unwind-support.h:58
> 58        if (*(unsigned char *)(pc+0) == 0x48
> [Current thread is 1 (Thread 0x7fa9b2f58700 (LWP 1285074))]
> (gdb) i thread
>   Id   Target Id                                     Frame
> * 1    Thread 0x7fa9b2f58700 (LWP 1285074) (Exiting)
> x86_64_fallback_frame_state (fs=0x7fa9b2f576d0,
> context=0x7fa9b2f57980) at ./md-unwind-support.h:58
>   2    Thread 0x7fa9b383e000 (LWP 1284366)          
> 0x00007fa9b403e127 in __close (fd=5) at
> ../sysdeps/unix/sysv/linux/close.c:27
>   3    Thread 0x7fa9b37c1700 (LWP 1284374)           syscall () at
> ../sysdeps/unix/sysv/linux/x86_64/syscall.S:38
>   4    Thread 0x7fa9b2f73700 (LWP 1285077)          
> 0x00007fa9b3e06507 in ioctl () at ../sysdeps/unix/syscall-
> template.S:78
>   5    Thread 0x7fa9b2f61700 (LWP 1285076)          
> 0x00007fa9b3e06507 in ioctl () at ../sysdeps/unix/syscall-
> template.S:78
>   6    Thread 0x7fa9b2f4f700 (LWP 1285079)          
> 0x00007fa9b3e06507 in ioctl () at ../sysdeps/unix/syscall-
> template.S:78
>   7    Thread 0x7fa9b2fa9700 (LWP 1285080)          
> 0x00007fa9b3e06507 in ioctl () at ../sysdeps/unix/syscall-
> template.S:78
> (gdb) thread 2
> [Switching to thread 2 (Thread 0x7fa9b383e000 (LWP 1284366))]
> #0  0x00007fa9b403e127 in __close (fd=5) at
> ../sysdeps/unix/sysv/linux/close.c:27
> 27        return SYSCALL_CANCEL (close, fd);
> (gdb) bt
> #0  0x00007fa9b403e127 in __close (fd=5) at
> ../sysdeps/unix/sysv/linux/close.c:27
> #1  0x00005606f030f95b in cleanup_dmevent_waiter () at dmevents.c:111
> #2  0x00005606f03087a2 in child (param=<optimized out>) at
> main.c:2934
> #3  0x00005606f03014d3 in main (argc=<optimized out>,
> argv=0x7ffdb782ab38) at main.c:3150
> 
> 
> The LWP of ?? and UNWIND is much larger than thread 2(main).
> 
> I add print_func like:
> 
> @@ -228,6 +228,10 @@ static void copy_msg_to_tcc(void *ct_p, const
> char *msg)
>         pthread_mutex_unlock(&ct->lock);
>  }
> 
> +static void lxk10 (void)
> +{
> +       condlog(2, "lxk exit tur_thread");
> +}
>  static void *tur_thread(void *ctx)
>  {
>         struct tur_checker_context *ct = ctx;
> @@ -235,6 +239,8 @@ static void *tur_thread(void *ctx)
>         char devt[32];
> 
>         /* This thread can be canceled, so setup clean up */
> +       condlog(2, "lxk start tur_thread");
> +       pthread_cleanup_push(lxk10, NULL);
>         tur_thread_cleanup_push(ct);
> 
> When there are four devices, core log:
> Mar 02 11:40:35 localhost.localdomain multipathd[85474]: exit
> (signal)
> Mar 02 11:40:35 localhost.localdomain multipathd[85474]: sda:
> unusable path
> Mar 02 11:40:35 localhost.localdomain multipathd[85474]: sdf:
> unusable path
> Mar 02 11:40:35 localhost.localdomain multipathd[85474]: sde:
> unusable path
> Mar 02 11:40:35 localhost.localdomain multipathd[85474]: sdd:
> unusable path
> Mar 02 11:40:35 localhost.localdomain multipathd[85474]: sdc:
> unusable path
> Mar 02 11:40:35 localhost.localdomain multipathd[85474]: sdb:
> unusable path
> Mar 02 11:40:35 localhost.localdomain multipathd[85474]: lxk start
> tur_thread
> Mar 02 11:40:35 localhost.localdomain multipathd[85474]: lxk exit
> tur_thread
> Mar 02 11:40:35 localhost.localdomain multipathd[85474]: lxk start
> tur_thread
> Mar 02 11:40:35 localhost.localdomain multipathd[85474]: lxk start
> tur_thread
> Mar 02 11:40:35 localhost.localdomain multipathd[85474]: lxk exit
> tur_thread
> Mar 02 11:40:35 localhost.localdomain multipathd[85474]: lxk start
> tur_thread
> Mar 02 11:40:35 localhost.localdomain multipathd[85474]: lxk exit
> tur_thread
> Mar 02 11:40:35 localhost.localdomain multipathd[85474]:
> 360014057a1353ec1bdd4dfcad19db6db: remove multipath map
> Mar 02 11:40:35 localhost.localdomain multipathd[85474]: sdg: orphan
> path, map flushed
> Mar 02 11:40:35 localhost.localdomain multipathd[85474]: BUG:
> orphaning path sdg that holds hwe of
> 360014057a1353ec1bdd4dfcad19db6db
> Mar 02 11:40:35 localhost.localdomain multipathd[85474]: tur checker
> refcount 4
> Mar 02 11:40:35 localhost.localdomain multipathd[85474]:
> 36001405faf8a6c2920840ed8ba73b9ee: remove multipath map
> Mar 02 11:40:35 localhost.localdomain multipathd[85474]: sdj: orphan
> path, map flushed
> Mar 02 11:40:35 localhost.localdomain multipathd[85474]: BUG:
> orphaning path sdj that holds hwe of
> 36001405faf8a6c2920840ed8ba73b9ee
> Mar 02 11:40:35 localhost.localdomain multipathd[85474]: tur checker
> refcount 3
> Mar 02 11:40:35 localhost.localdomain multipathd[85474]:
> 36001405044c0f50ba3c4e5b9b57e4de4: remove multipath map
> Mar 02 11:40:35 localhost.localdomain multipathd[85474]: sdi: orphan
> path, map flushed
> Mar 02 11:40:35 localhost.localdomain multipathd[85474]: BUG:
> orphaning path sdi that holds hwe of
> 36001405044c0f50ba3c4e5b9b57e4de4
> Mar 02 11:40:35 localhost.localdomain multipathd[85474]: tur checker
> refcount 2
> Mar 02 11:40:35 localhost.localdomain multipathd[85474]:
> 36001405e0cbb950907b4a51af1a002ed: remove multipath map
> Mar 02 11:40:35 localhost.localdomain multipathd[85474]: sdh: orphan
> path, map flushed
> Mar 02 11:40:35 localhost.localdomain multipathd[85474]: BUG:
> orphaning path sdh that holds hwe of
> 36001405e0cbb950907b4a51af1a002ed
> Mar 02 11:40:35 localhost.localdomain multipathd[85474]: tur checker
> refcount 1
> Mar 02 11:40:35 localhost.localdomain multipathd[85474]: lxk exit
> ueventloop
> Mar 02 11:40:35 localhost.localdomain multipathd[85474]: lxk exit
> uxlsnrloop
> Mar 02 11:40:35 localhost.localdomain multipathd[85474]: lxk exit
> uevqloop
> Mar 02 11:40:35 localhost.localdomain multipathd[85474]: lxk exit
> wait_dmevents
> Mar 02 11:40:35 localhost.localdomain multipathd[85474]: lxk exit
> checkerloop
> Mar 02 11:40:35 localhost.localdomain multipathd[85474]: directio
> checker refcount 6
> Mar 02 11:40:35 localhost.localdomain multipathd[85474]: lxk free tur
> checker  //checker_put


So we do not see "unloading tur checker". Like you said, that suggests
that the crash occurs between libcheck_free() and the thread exiting.
I suggest you put a message in tur.c:libcheck_free (), AFTER the call
to cleanup_context(), printing the values of "running" and "holders".

Anyway:

	holders = uatomic_sub_return(&ct->holders, 1);
	if (!holders)
		cleanup_context(ct);

Whatever mistakes we have made, only one actor can have seen 
holders == 0, and have called cleanup_context().

The stacks you have shown indicate that the instruction pointers were
broken. That would suggest something similar as dicussed in the ML
thread leading to 38ffd89 ("libmultipath: prevent DSO unloading with
astray checker threads"). Your logs show "tur checker refcount 1", so
the next call to checker_put would have unloaded the DSO. 

Please try commenting out the dlclose() call in free_checker_class(),
and see if it makes a difference.

Martin



--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel


  reply	other threads:[~2021-03-02 15:30 UTC|newest]

Thread overview: 34+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-01-28 21:08 [dm-devel] [PATCH] multipathd: avoid crash in uevent_cleanup() mwilck
2021-02-02 20:52 ` Martin Wilck
2021-02-03 10:48   ` lixiaokeng
2021-02-03 13:57     ` Martin Wilck
2021-02-04  1:40       ` lixiaokeng
2021-02-04 15:06         ` Martin Wilck
2021-02-05 11:08           ` Martin Wilck
2021-02-05 11:09             ` Martin Wilck
2021-02-07  7:05             ` lixiaokeng
2021-03-01 14:53       ` lixiaokeng
2021-03-02  8:41         ` lixiaokeng
2021-03-02 11:07           ` Martin Wilck
2021-03-02 15:49             ` lixiaokeng
2021-03-02  9:56         ` Martin Wilck
2021-03-02 12:44           ` lixiaokeng
2021-03-02 15:29             ` Martin Wilck [this message]
2021-03-02 16:55               ` Martin Wilck
2021-03-03 10:42               ` lixiaokeng
2021-03-08  9:40                 ` Martin Wilck
2021-03-15 13:00                   ` Martin Wilck
2021-03-16 11:12                     ` lixiaokeng
2021-03-17 16:59                       ` Martin Wilck
2021-03-19  1:49                         ` lixiaokeng
2021-02-08  7:41     ` lixiaokeng
2021-02-08  9:50       ` Martin Wilck
2021-02-08 10:49         ` lixiaokeng
2021-02-08 11:03           ` Martin Wilck
2021-02-09  1:36             ` lixiaokeng
2021-02-09 17:30               ` Martin Wilck
2021-02-10  2:02                 ` lixiaokeng
2021-02-10  2:29                   ` Hexiaowen (Hex, EulerOS)
2021-02-19 10:35                     ` Martin Wilck
2021-02-19  1:36                 ` lixiaokeng
2021-02-02 22:23 ` Benjamin Marzinski

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=41e79d67f568baf8de6b28e4924620240f0a2731.camel@suse.com \
    --to=mwilck@suse.com \
    --cc=bmarzins@redhat.com \
    --cc=christophe.varoqui@opensvc.com \
    --cc=dm-devel@redhat.com \
    --cc=linfeilong@huawei.com \
    --cc=lixiaokeng@huawei.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.