All of lore.kernel.org
 help / color / mirror / Atom feed
From: Martin Wilck <mwilck@suse.com>
To: lixiaokeng <lixiaokeng@huawei.com>,
	Benjamin Marzinski <bmarzins@redhat.com>,
	Christophe Varoqui <christophe.varoqui@opensvc.com>,
	dm-devel mailing list <dm-devel@redhat.com>
Cc: linfeilong <linfeilong@huawei.com>,
	"liuzhiqiang \(I\)" <liuzhiqiang26@huawei.com>,
	lihaotian9@huawei.com
Subject: Re: [dm-devel] [QUESTION]: multipath device with wrong path lead to metadata err
Date: Wed, 27 Jan 2021 00:11:20 +0100	[thread overview]
Message-ID: <5440d76a18994a7a214321c30fe8a1e99c0a3988.camel@suse.com> (raw)
In-Reply-To: <ef4f29d8-a20b-2b4d-97ab-a83fb4bca5ac@huawei.com>

On Tue, 2021-01-26 at 19:14 +0800, lixiaokeng wrote:
> 
> > > Hi,
> > >   Unfortunately the verify_path() called before *and* after
> > > domap()
> > > in
> > > coalesce_paths can't solve this problem. I think it is another
> > > way to
> > > lead multipath with wrong path, but now I can't find the way from
> > > log.
> > 
> > Can you provide multipathd -v3 logs, and kernel logs? Maybe I'll
> > see
> > something.

This is not a -v3 log, right? We can't see much what multipathd is
doing. Anyway, I understand now that verify_paths() won't help. It
looks only for paths that have been removed (i.e. don't exist any more
in sysfs) since the last path detection. But then, when the error
occurs, it seems that sdf has been removed *and re-added*. So, the
check whether the path still exists succeeds. The uevents were also
missed because the uevent handler didn't get the lock.


> 
> (1)multipath -r: The sdf is found as a path of
> 36001405b7679bd96b094bccbf971bc90
> (iscsi node is 4:0:0:2)
> 
> (2)iscsi logout: The sdf is removed in iscsi in system time
> [1202538.467014].
> 
> (3)iscsi login: The sdf appears in iscsi in system time
> [1202538.825745].
> It is a path of 3600140584e11eb1818c4afab12c17800 (iscsi node
> 2:0:0:0)
> 
> Here I have a doubt. When I stop in domap using gdb and iscsi log
> out/in,
> the sdf will not  be used again becasue the disk refcount is not
> zero. I
> add a print if the disk refcount is zero in put_disk_and_module (for
> example lxk ref put after: name sdi; count 0), but there is not this
> print
> about sdf.

Yes, this is a very good point, and it's indeed strange. multipathd
should have opened a file descriptor to /dev/sdf in pathinfo(), and as
long as that file is open, the use count shouldn't drop to 0, the disk
devices (block device and scsi_disk device) shouldn't be released, and
the major/minor number shouldn't be reused. Unless I'm missing
something essential, that is.

> Jan 25 12:37:48 client1 kernel: [1202538.467014] sd 4:0:0:2: [sdf] Synchronizing SCSI cache
> Jan 25 12:37:48 client1 kernel: [1202538.568195] scsi 4:0:0:2: alua: Detached
> Jan 25 12:37:48 client1 kernel: [1202538.630507] sd 2:0:0:0: [sdf] 20971520 512-byte logical blocks: (10.7 GB/10.0 GiB)

Less than 0.1s between the disappearance of 4:0:0:2 as sdf and reappearance
of 2:0:0:0, without any sign of multipathd having noticed this change,
is indeed quite strange.

So we can only conclude that (if there's no kernel refcounting bug,
which I doubt) either orphan_path()->uninitialize_path() had been
called (closing the fd),  or that opening the sd device had failed in
the first place (in which case the path WWID should have been nulled in
pathinfo(). In both cases it makes little sense that the path should
still be part of a struct multipath. 

Please increase the log level of the "Couldn't open device node"
message in pathinfo(), and see if respective errors are logged.

Can you verify in the debugger if multipathd still has the fd to the
disk device open?

Perhaps you could trace scsi_disk_release() in the kernel?

Martin



--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel


  reply	other threads:[~2021-01-26 23:17 UTC|newest]

Thread overview: 18+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-01-18 11:08 [dm-devel] [QUESTION]: multipath device with wrong path lead to metadata err lixiaokeng
2021-01-19  9:41 ` Martin Wilck
2021-01-19 12:46   ` lixiaokeng
2021-01-19 21:57 ` Martin Wilck
2021-01-20  2:30   ` lixiaokeng
2021-01-20 14:07     ` Martin Wilck
2021-01-25  1:33       ` lixiaokeng
2021-01-25 12:28         ` Martin Wilck
2021-01-26  6:40           ` lixiaokeng
2021-01-26 11:14           ` lixiaokeng
2021-01-26 23:11             ` Martin Wilck [this message]
2021-01-28  8:27               ` lixiaokeng
2021-01-28 21:15                 ` Martin Wilck
2021-02-04 11:25               ` lixiaokeng
2021-02-04 14:56                 ` Martin Wilck
2021-02-05 11:49                   ` lixiaokeng
2021-01-20 13:02   ` Roger Heflin
2021-01-20 20:45     ` Martin Wilck

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=5440d76a18994a7a214321c30fe8a1e99c0a3988.camel@suse.com \
    --to=mwilck@suse.com \
    --cc=bmarzins@redhat.com \
    --cc=christophe.varoqui@opensvc.com \
    --cc=dm-devel@redhat.com \
    --cc=lihaotian9@huawei.com \
    --cc=linfeilong@huawei.com \
    --cc=liuzhiqiang26@huawei.com \
    --cc=lixiaokeng@huawei.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.