Re: RFC: one more time: SCSI device identification

From: "Ewan D. Milne" <emilne@redhat.com>
To: Martin Wilck <martin.wilck@suse.com>,
	"Ulrich.Windl@rz.uni-regensburg.de" 
	<Ulrich.Windl@rz.uni-regensburg.de>,
	"martin.petersen@oracle.com" <martin.petersen@oracle.com>
Cc: Hannes Reinecke <hare@suse.com>, "hch@lst.de" <hch@lst.de>,
	"dgilbert@interlog.com" <dgilbert@interlog.com>,
	"dm-devel@redhat.com" <dm-devel@redhat.com>,
	"linux-scsi@vger.kernel.org" <linux-scsi@vger.kernel.org>,
	"jejb@linux.vnet.ibm.com" <jejb@linux.vnet.ibm.com>,
	"systemd-devel@lists.freedesktop.org" 
	<systemd-devel@lists.freedesktop.org>,
	"bmarzins@redhat.com" <bmarzins@redhat.com>
Subject: Re: RFC: one more time: SCSI device identification
Date: Tue, 27 Apr 2021 16:41:45 -0400	[thread overview]
Message-ID: <c8ede601244e1710dbf320c33c0f7853e249bbee.camel@redhat.com> (raw)
In-Reply-To: <488ef3e7fa0cca4f0a0cb2e9307ddaa08385d3f7.camel@suse.com>

On Tue, 2021-04-27 at 20:33 +0000, Martin Wilck wrote:
> On Tue, 2021-04-27 at 16:14 -0400, Ewan D. Milne wrote:
> > 
> > There's no way to do that, in principle.  Because there could be
> > other I/Os in flight.  You might (somehow) avoid retrying an I/O
> > that got a UA until you figured out if something changed, but other
> > I/Os can already have been sent to the target, or issued before you
> > get to look at the status.
> 
> Right. But in practice, a WWID change will hardly happen under full
> IO
> load. The storage side will probably have to block IO while this
> happens, at least for a short time period. So blocking and quiescing
> the queue upon an UA might still work, most of the time. Even if we
> were too late already, the sooner we stop the queue, the better.
> 
> The current algorithm in multipath-tools needs to detect a path going
> down and being reinstated. The time interval during which a WWID
> change
> will go unnoticed is one or more path checker intervals, typically on
> the order of 5-30 seconds. If we could decrease this interval to a
> sub-
> second or even millisecond range by blocking the queue in the kernel
> quickly, we'd have made a big step forward.

Yes, and in many situations this may help.  But in the general case
we can't protect against a storage array misconfiguration,
where something like this can happen.  So I worry about people
believing the host software will protect them against a mistake,
when we can't really do that.

All it takes is one I/O (a discard) to make a thorough mess of the LUN.

-Ewan

> 
> Regards
> Martin
>