On Tue, 2021-04-27 at 10:21 +0200, Hannes Reinecke wrote:
On 4/27/21 10:10 AM, Martin Wilck wrote:
On Tue, 2021-04-27 at 13:48 +1000, Erwin van Londen wrote:

Wrt 1), we can only hope that it's the case. But 2) and 3) need work,
afaics.

In my view the WWID should never change. 

In an ideal world, perhaps not. But in the dm-multipath realm, we know
that WWID changes can happen with certain storage arrays. See 
https://listman.redhat.com/archives/dm-devel/2021-February/msg00116.html 
and follow-ups, for example.

And it's actually something which might happen quite easily.
The storage array can unmap a LUN, delete it, create a new one, and map
that one into the same LUN number than the old one.
If we didn't do I/O during that interval upon the next I/O we will be
getting the dreaded 'Power-On/Reset' sense code.
_And nothing else_, due to the arcane rules for sense code generation in
SAM.
But we end up with a completely different device.

The only way out of it is to do a rescan for every POR sense code, and
disable the device eg via DID_NO_CONNECT whenever we find that the
identification has changed. We already have a copy of the original VPD
page 0x83 at hand, so that should be reasonably easy.

The way out of this is to chuck the array in the bin. As I mentioned in one of my other emails when a scenario happens as you described above and the array does not inform the initiator it goes against the SAM-5 standard.

That standard shows:
5.14 Unit attention conditions
5.14.1 Unit attention conditions that are not coalesced
Each logical unit shall establish a unit attention condition whenever one of the following events occurs:
a) a power on (see 6.3.1), hard reset (see 6.3.2), logical unit reset (see 6.3.3), I_T nexus loss (see 6.3.4), or power loss expected (see 6.3.5) occurs;
b) commands received on this I_T nexus have been cleared by a command or a task management function associated with another I_T nexus and the TAS bit was set to zero in the Control mode page associated with this I_T nexus (see 5.6);
c) the portion of the logical unit inventory that consists of administrative logical units and hierarchical logical units has been changed (see 4.6.18.1); or
d) any other event requiring the attention of the SCSI initiator device.

Especially the I_T nexus loss under a is an important trigger.

---
6.3.4 I_T nexus loss
An I_T nexus loss is a SCSI device condition resulting from:

a) a hard reset condition (see 6.3.2);
b) an I_T nexus loss event (e.g., logout) indicated by a Nexus Loss event notification (see 6.4);
c) indication that an I_T NEXUS RESET task management request (see 7.6) has been processed; or
d) an indication that a REMOVE I_T NEXUS command (see SPC-4) has been processed.
An I_T nexus loss event is an indication from the SCSI transport protocol to the SAL that an I_T nexus no
longer exists. SCSI transport protocols may define I_T nexus loss events.

Each SCSI transport protocol standard that defines I_T nexus loss events should specify when those events
result in the delivery of a Nexus Loss event notification to the SAL.

The I_T nexus loss condition applies to both SCSI initiator devices and SCSI target devices.

If a SCSI target port detects an I_T nexus loss, then a Nexus Loss event notification shall be delivered to
each logical unit to which the I_T nexus has access.

In response to an I_T nexus loss condition a logical unit shall take the following actions:
a) abort all commands received on the I_T nexus as described in 5.6;
b) abort all background third-party copy operations (see SPC-4) that are using the I_T nexus;
c) terminate all task management functions received on the I_T nexus;
d) clear all ACA conditions (see 5.9.5) associated with the I_T nexus;
e) establish a unit attention condition for the SCSI initiator port associated with the I_T nexus (see 5.14
and 6.2); and
f) perform any additional functions required by the applicable command standards.
---

This does also mean that any underlying transport protocol issues like on FC or TCP for iSCSI will very often trigger aborted commands or UA's as well which will be picked up by the kernel/respected drivers.


I had a rather lengthy discussion with Fred Knight @ NetApp about
Power-On/Reset handling, what with him complaining that we don't handle
is correctly. So this really is something we should be looking into,
even independently of multipathing.

But actually I like the idea from Martin Petersen to expose the parsed
VPD identifiers to sysfs; that would allow us to drop sg_inq completely
from the udev rules.

Cheers,

Hannes