All of lore.kernel.org
 help / color / mirror / Atom feed
* "Power-on or device reset occurred" after a LUN resize
@ 2020-09-23  9:17 Jean-Louis Dupond
  2023-01-30  8:39 ` Jean-Louis Dupond
  0 siblings, 1 reply; 2+ messages in thread
From: Jean-Louis Dupond @ 2020-09-23  9:17 UTC (permalink / raw)
  To: linux-scsi

Hi,

Last week some action that we do regularly caused some issues.

00:50:31 CEST -> We resized a iSCSI LUN on a SAN from 3TB -> 4TB.

The clients did detect the change fine, and resized it devices:

> Sep 22 00:51:07 server001 kernel: sd 16:0:0:1: Capacity data has changed
> Sep 22 00:51:07 server001 kernel: sd 16:0:0:1: Inquiry data has changed
> Sep 22 00:51:07 server001 kernel: sd 16:0:0:1: alua: supports implicit 
> TPGS
> Sep 22 00:51:07 server001 kernel: sd 16:0:0:1: alua: device 
> t10.NETAPP   LUN 80Vcx]PVRq4F        port group 3e9 rel port 8
> Sep 22 00:51:07 server001 kernel: sd 17:0:0:1: Capacity data has changed
> Sep 22 00:51:07 server001 kernel: sd 16:0:0:1: [sdf] 8589934592 
> 512-byte logical blocks: (4.40 TB/4.00 TiB)
> Sep 22 00:51:07 server001 kernel: sd 16:0:0:1: [sdf] 4096-byte 
> physical blocks
> Sep 22 00:51:07 server001 kernel: sdf: detected capacity change from 
> 3298534883328 to 4398046511104
> Sep 22 00:51:07 server001 kernel: sd 16:0:0:1: alua: port group 3e9 
> state A non-preferred supports TolUsNA
> Sep 22 00:51:07 server001 kernel: sd 17:0:0:1: Inquiry data has changed
> Sep 22 00:51:07 server001 kernel: sd 17:0:0:1: alua: supports implicit 
> TPGS
> Sep 22 00:51:07 server001 kernel: sd 17:0:0:1: alua: device 
> t10.NETAPP   LUN 80Vcx]PVRq4F        port group 3e9 rel port 7
> Sep 22 00:51:07 server001 kernel: sd 17:0:0:1: [sdi] 8589934592 
> 512-byte logical blocks: (4.40 TB/4.00 TiB)
> Sep 22 00:51:07 server001 kernel: sd 17:0:0:1: [sdi] 4096-byte 
> physical blocks
> Sep 22 00:51:07 server001 kernel: sdi: detected capacity change from 
> 3298534883328 to 4398046511104
> Sep 22 00:51:07 server001 kernel: sd 17:0:0:1: alua: port group 3e9 
> state A non-preferred supports TolUsNA
> Sep 22 00:51:12 server001 kernel: sd 18:0:0:1: Capacity data has changed
> Sep 22 00:51:12 server001 kernel: sd 18:0:0:1: Inquiry data has changed
> Sep 22 00:51:12 server001 kernel: sd 18:0:0:1: alua: supports implicit 
> TPGS
> Sep 22 00:51:12 server001 kernel: sd 18:0:0:1: alua: device 
> t10.NETAPP   LUN 80Vcx]PVRq4F        port group 3e8 rel port 6
> Sep 22 00:51:12 server001 kernel: sd 18:0:0:1: [sdl] 8589934592 
> 512-byte logical blocks: (4.40 TB/4.00 TiB)
> Sep 22 00:51:12 server001 kernel: sd 18:0:0:1: [sdl] 4096-byte 
> physical blocks
> Sep 22 00:51:12 server001 kernel: sdl: detected capacity change from 
> 3298534883328 to 4398046511104
> Sep 22 00:51:12 server001 kernel: sd 18:0:0:1: alua: port group 3e8 
> state N non-preferred supports TolUsNA
> Sep 22 00:51:18 server001 kernel: sd 15:0:0:1: Capacity data has changed
> Sep 22 00:51:18 server001 kernel: sd 15:0:0:1: Inquiry data has changed
> Sep 22 00:51:18 server001 kernel: sd 15:0:0:1: alua: supports implicit 
> TPGS
> Sep 22 00:51:18 server001 kernel: sd 15:0:0:1: alua: device 
> t10.NETAPP   LUN 80Vcx]PVRq4F        port group 3e8 rel port 5
> Sep 22 00:51:18 server001 kernel: sd 15:0:0:1: [sdc] 8589934592 
> 512-byte logical blocks: (4.40 TB/4.00 TiB)
> Sep 22 00:51:18 server001 kernel: sd 15:0:0:1: [sdc] 4096-byte 
> physical blocks
> Sep 22 00:51:18 server001 kernel: sdc: detected capacity change from 
> 3298534883328 to 4398046511104
> Sep 22 00:51:18 server001 kernel: sd 15:0:0:1: alua: port group 3e8 
> state N non-preferred supports TolUsNA
> Sep 22 00:52:09 server001 kernel: sd 16:0:0:1: Power-on or device 
> reset occurred
> Sep 22 00:52:09 server001 kernel: sd 16:0:0:1: alua: port group 3e9 
> state A non-preferred supports TolUsNA
> Sep 22 00:52:09 server001 kernel: sd 17:0:0:1: Power-on or device 
> reset occurred

But then it kept doing resets:
> Sep 22 00:54:39 server001 kernel: sd 16:0:0:1: Power-on or device 
> reset occurred
> Sep 22 00:54:39 server001 kernel: sd 16:0:0:1: alua: port group 3e9 
> state A non-preferred supports TolUsNA
> Sep 22 00:54:39 server001 kernel: sd 17:0:0:1: Power-on or device 
> reset occurred
> Sep 22 00:54:39 server001 kernel: sd 17:0:0:1: alua: port group 3e9 
> state A non-preferred supports TolUsNA
> Sep 22 00:54:42 server001 kernel: sd 15:0:0:1: Power-on or device 
> reset occurred
> Sep 22 00:54:42 server001 kernel: sd 15:0:0:1: alua: port group 3e8 
> state N non-preferred supports TolUsNA

This caused some multipath failovers until it stopped after ~10 minutes.

We do use ALUA multipath:
3600a098038305663785d505652713446 dm-15 NETAPP,LUN C-Mode
size=4.0T features='3 queue_if_no_path pg_init_retries 50' hwhandler='1 
alua' wp=rw
|-+- policy='service-time 0' prio=50 status=active
| |- 16:0:0:1 sdf 8:80  active ready running
| `- 17:0:0:1 sdi 8:128 active ready running
`-+- policy='service-time 0' prio=10 status=enabled
   |- 15:0:0:1 sdc 8:32  active ready running
   `- 18:0:0:1 sdl 8:176 active ready running


Who is sending the Power-on or device reset?
Is that the SAN?
Or does the client trigger a reset (for which reason then?)?
The LUN is attachted to multiple servers (all CentOS 8), and all showed 
the same resets.

It would be nice to find out what caused this!

Thanks for having a look :)
Jean-Louis



^ permalink raw reply	[flat|nested] 2+ messages in thread

* Re: "Power-on or device reset occurred" after a LUN resize
  2020-09-23  9:17 "Power-on or device reset occurred" after a LUN resize Jean-Louis Dupond
@ 2023-01-30  8:39 ` Jean-Louis Dupond
  0 siblings, 0 replies; 2+ messages in thread
From: Jean-Louis Dupond @ 2023-01-30  8:39 UTC (permalink / raw)
  To: linux-scsi

Hi All,

Last week we've had the same issue again, but luckily we did some more 
debugging actions when it occurred.
And it looks like a NetApp issue but also partly a kernel issue.

Let me describe:

> Jan 23 13:36:57 srv001 kernel: sd 15:0:0:0: Capacity data has changed
> Jan 23 13:37:57 srv001 kernel: sd 15:0:0:0: Power-on or device reset 
> occurred
We do a LUN resize, and exactly 1 minute later -> A 'Power-on or device 
reset' event.
Now what it looks like, is that the SAN here did not send some 
confirmation for a read/write, and the kernel tried to abort it, but failed.
So it ended up sending a Logical Unit Reset to recover.
-> So this seems to be clearly a SAN bug, as it should always confirm 
read/write.

But then:
When the LUR was send by a host (20 hosts are connected to the same LUN 
here), the following seems to happen:
- Client (linux host) sends a write x
- Client (linux host) sends a write y
- SAN respons with a check condition 0x29 (Power-on or device reset) on 
write y
- Client (linux host) sends a NOP Out after 30 seconds
- NetApp responds with a NOP In
- Client sends an abort for write x after 1 minute (as it was still not 
confirmed from the netapp side)
- NetApp responds with '0x01' (Task not in set)
- Client sends a LUR to the NetApp to reset again, as it still didn't 
know what happend with write x and could not abort it
- The LUR completes, and causes the same issue again on other hosts.

As the NetApp seems to skip write confirmations during a reset, we end 
up in a reset storm because write confirmations get skipped, and a new 
reset is being send again.
And as this happens on all the 20 hosts, this causes an endless reset storm.

Now while I think the NetApp should never skip confirmations of 
read/writes, I think the kernel should remove all non-confirmed 
writes/reads on a LUR event?
This is what SAM-2 
(https://www.cs.cmu.edu/afs/club/usr/jhutz/project/Archives/scsi/sam2r24.pdf) 
specification tells about it:
> To process a logical unit reset the logical unit shall:
> a)Abort all tasks as described in 5.7;

If the kernel on all hosts would remove the unconfirmed writes/reads on 
all hosts after a 0x29, it would never send an abort (which fails), and 
then no more LUR's would be send by the kernel.
And everything would recover correctly after the first LUR.

It would be great if we could improve this :)
There is an issue open on NetApp side also for this!

Thanks
Jean-Louis

On 23/09/2020 11:17, Jean-Louis Dupond wrote:
> Hi,
>
> Last week some action that we do regularly caused some issues.
>
> 00:50:31 CEST -> We resized a iSCSI LUN on a SAN from 3TB -> 4TB.
>
> The clients did detect the change fine, and resized it devices:
>
>> Sep 22 00:51:07 server001 kernel: sd 16:0:0:1: Capacity data has changed
>> Sep 22 00:51:07 server001 kernel: sd 16:0:0:1: Inquiry data has changed
>> Sep 22 00:51:07 server001 kernel: sd 16:0:0:1: alua: supports 
>> implicit TPGS
>> Sep 22 00:51:07 server001 kernel: sd 16:0:0:1: alua: device 
>> t10.NETAPP   LUN 80Vcx]PVRq4F        port group 3e9 rel port 8
>> Sep 22 00:51:07 server001 kernel: sd 17:0:0:1: Capacity data has changed
>> Sep 22 00:51:07 server001 kernel: sd 16:0:0:1: [sdf] 8589934592 
>> 512-byte logical blocks: (4.40 TB/4.00 TiB)
>> Sep 22 00:51:07 server001 kernel: sd 16:0:0:1: [sdf] 4096-byte 
>> physical blocks
>> Sep 22 00:51:07 server001 kernel: sdf: detected capacity change from 
>> 3298534883328 to 4398046511104
>> Sep 22 00:51:07 server001 kernel: sd 16:0:0:1: alua: port group 3e9 
>> state A non-preferred supports TolUsNA
>> Sep 22 00:51:07 server001 kernel: sd 17:0:0:1: Inquiry data has changed
>> Sep 22 00:51:07 server001 kernel: sd 17:0:0:1: alua: supports 
>> implicit TPGS
>> Sep 22 00:51:07 server001 kernel: sd 17:0:0:1: alua: device 
>> t10.NETAPP   LUN 80Vcx]PVRq4F        port group 3e9 rel port 7
>> Sep 22 00:51:07 server001 kernel: sd 17:0:0:1: [sdi] 8589934592 
>> 512-byte logical blocks: (4.40 TB/4.00 TiB)
>> Sep 22 00:51:07 server001 kernel: sd 17:0:0:1: [sdi] 4096-byte 
>> physical blocks
>> Sep 22 00:51:07 server001 kernel: sdi: detected capacity change from 
>> 3298534883328 to 4398046511104
>> Sep 22 00:51:07 server001 kernel: sd 17:0:0:1: alua: port group 3e9 
>> state A non-preferred supports TolUsNA
>> Sep 22 00:51:12 server001 kernel: sd 18:0:0:1: Capacity data has changed
>> Sep 22 00:51:12 server001 kernel: sd 18:0:0:1: Inquiry data has changed
>> Sep 22 00:51:12 server001 kernel: sd 18:0:0:1: alua: supports 
>> implicit TPGS
>> Sep 22 00:51:12 server001 kernel: sd 18:0:0:1: alua: device 
>> t10.NETAPP   LUN 80Vcx]PVRq4F        port group 3e8 rel port 6
>> Sep 22 00:51:12 server001 kernel: sd 18:0:0:1: [sdl] 8589934592 
>> 512-byte logical blocks: (4.40 TB/4.00 TiB)
>> Sep 22 00:51:12 server001 kernel: sd 18:0:0:1: [sdl] 4096-byte 
>> physical blocks
>> Sep 22 00:51:12 server001 kernel: sdl: detected capacity change from 
>> 3298534883328 to 4398046511104
>> Sep 22 00:51:12 server001 kernel: sd 18:0:0:1: alua: port group 3e8 
>> state N non-preferred supports TolUsNA
>> Sep 22 00:51:18 server001 kernel: sd 15:0:0:1: Capacity data has changed
>> Sep 22 00:51:18 server001 kernel: sd 15:0:0:1: Inquiry data has changed
>> Sep 22 00:51:18 server001 kernel: sd 15:0:0:1: alua: supports 
>> implicit TPGS
>> Sep 22 00:51:18 server001 kernel: sd 15:0:0:1: alua: device 
>> t10.NETAPP   LUN 80Vcx]PVRq4F        port group 3e8 rel port 5
>> Sep 22 00:51:18 server001 kernel: sd 15:0:0:1: [sdc] 8589934592 
>> 512-byte logical blocks: (4.40 TB/4.00 TiB)
>> Sep 22 00:51:18 server001 kernel: sd 15:0:0:1: [sdc] 4096-byte 
>> physical blocks
>> Sep 22 00:51:18 server001 kernel: sdc: detected capacity change from 
>> 3298534883328 to 4398046511104
>> Sep 22 00:51:18 server001 kernel: sd 15:0:0:1: alua: port group 3e8 
>> state N non-preferred supports TolUsNA
>> Sep 22 00:52:09 server001 kernel: sd 16:0:0:1: Power-on or device 
>> reset occurred
>> Sep 22 00:52:09 server001 kernel: sd 16:0:0:1: alua: port group 3e9 
>> state A non-preferred supports TolUsNA
>> Sep 22 00:52:09 server001 kernel: sd 17:0:0:1: Power-on or device 
>> reset occurred
>
> But then it kept doing resets:
>> Sep 22 00:54:39 server001 kernel: sd 16:0:0:1: Power-on or device 
>> reset occurred
>> Sep 22 00:54:39 server001 kernel: sd 16:0:0:1: alua: port group 3e9 
>> state A non-preferred supports TolUsNA
>> Sep 22 00:54:39 server001 kernel: sd 17:0:0:1: Power-on or device 
>> reset occurred
>> Sep 22 00:54:39 server001 kernel: sd 17:0:0:1: alua: port group 3e9 
>> state A non-preferred supports TolUsNA
>> Sep 22 00:54:42 server001 kernel: sd 15:0:0:1: Power-on or device 
>> reset occurred
>> Sep 22 00:54:42 server001 kernel: sd 15:0:0:1: alua: port group 3e8 
>> state N non-preferred supports TolUsNA
>
> This caused some multipath failovers until it stopped after ~10 minutes.
>
> We do use ALUA multipath:
> 3600a098038305663785d505652713446 dm-15 NETAPP,LUN C-Mode
> size=4.0T features='3 queue_if_no_path pg_init_retries 50' 
> hwhandler='1 alua' wp=rw
> |-+- policy='service-time 0' prio=50 status=active
> | |- 16:0:0:1 sdf 8:80  active ready running
> | `- 17:0:0:1 sdi 8:128 active ready running
> `-+- policy='service-time 0' prio=10 status=enabled
>   |- 15:0:0:1 sdc 8:32  active ready running
>   `- 18:0:0:1 sdl 8:176 active ready running
>
>
> Who is sending the Power-on or device reset?
> Is that the SAN?
> Or does the client trigger a reset (for which reason then?)?
> The LUN is attachted to multiple servers (all CentOS 8), and all 
> showed the same resets.
>
> It would be nice to find out what caused this!
>
> Thanks for having a look :)
> Jean-Louis
>
>

^ permalink raw reply	[flat|nested] 2+ messages in thread

end of thread, other threads:[~2023-01-30  8:39 UTC | newest]

Thread overview: 2+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-09-23  9:17 "Power-on or device reset occurred" after a LUN resize Jean-Louis Dupond
2023-01-30  8:39 ` Jean-Louis Dupond

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.