All of lore.kernel.org
 help / color / mirror / Atom feed
* sd: Unaligned partial completion
@ 2022-02-16  6:35 Douglas Gilbert
  2022-02-19 22:46 ` Martin K. Petersen
  0 siblings, 1 reply; 10+ messages in thread
From: Douglas Gilbert @ 2022-02-16  6:35 UTC (permalink / raw)
  To: SCSI development list

What should the sd driver do when it gets the error in the subject
line? Try again, and again, and again, and again ...?

sd 2:0:1:0: [sdb] Unaligned partial completion (resid=3584, sector_sz=4096)
sd 2:0:1:0: [sdb] tag#407 CDB: Read(16) 88 00 00 00 00 00 00 00 00 00 00 00 00 01 00

Not very productive, IMO. Perhaps, after say 3 retries getting the _same_
resid, it might rescan that disk. There is a big hint in the logged
data shown above: trying to READ 1 block (sector_sz=4096) and getting a
resid of 3584. So it got back 512 bytes (again and again ...). The disk
isn't mounted so perhaps it is being prepared. And maybe that preparation
involved a MODE SELECT which changed the LB size in its block descriptor,
prior to a FORMAT UNIT.


Another issue with that error message: what does "unaligned" mean in
this context? Surely it is superfluous and "Partial completion" is
more accurate (unless the resid is negative).

Doug Gilbert


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: sd: Unaligned partial completion
  2022-02-16  6:35 sd: Unaligned partial completion Douglas Gilbert
@ 2022-02-19 22:46 ` Martin K. Petersen
  2022-02-20  0:56   ` Douglas Gilbert
  0 siblings, 1 reply; 10+ messages in thread
From: Martin K. Petersen @ 2022-02-19 22:46 UTC (permalink / raw)
  To: Douglas Gilbert; +Cc: SCSI development list


Douglas,

> What should the sd driver do when it gets the error in the subject
> line? Try again, and again, and again, and again ...?
>
> sd 2:0:1:0: [sdb] Unaligned partial completion (resid=3584, sector_sz=4096)
> sd 2:0:1:0: [sdb] tag#407 CDB: Read(16) 88 00 00 00 00 00 00 00 00 00 00 00 00 01 00
>
> Not very productive, IMO. Perhaps, after say 3 retries getting the
> _same_ resid, it might rescan that disk. There is a big hint in the
> logged data shown above: trying to READ 1 block (sector_sz=4096) and
> getting a resid of 3584. So it got back 512 bytes (again and again
> ...). The disk isn't mounted so perhaps it is being prepared. And
> maybe that preparation involved a MODE SELECT which changed the LB
> size in its block descriptor, prior to a FORMAT UNIT.

The kernel doesn't inspect passthrough commands to track whether an
application is doing MODE SELECT or FORMAT UNIT. The burden is generally
on the application to do the right thing.

I'm assuming we're trying to read the partition table. Did the device
somehow get closed between the MODE SELECT and the FORMAT UNIT?

> Another issue with that error message: what does "unaligned" mean in
> this context? Surely it is superfluous and "Partial completion" is
> more accurate (unless the resid is negative).

The "unaligned" term comes from ZBC.

-- 
Martin K. Petersen	Oracle Linux Engineering

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: sd: Unaligned partial completion
  2022-02-19 22:46 ` Martin K. Petersen
@ 2022-02-20  0:56   ` Douglas Gilbert
  2022-02-20  1:35     ` Damien Le Moal
  2022-02-23  3:27     ` Martin K. Petersen
  0 siblings, 2 replies; 10+ messages in thread
From: Douglas Gilbert @ 2022-02-20  0:56 UTC (permalink / raw)
  To: Martin K. Petersen; +Cc: SCSI development list

On 2022-02-19 17:46, Martin K. Petersen wrote:
> 
> Douglas,
> 
>> What should the sd driver do when it gets the error in the subject
>> line? Try again, and again, and again, and again ...?
>>
>> sd 2:0:1:0: [sdb] Unaligned partial completion (resid=3584, sector_sz=4096)
>> sd 2:0:1:0: [sdb] tag#407 CDB: Read(16) 88 00 00 00 00 00 00 00 00 00 00 00 00 01 00
>>
>> Not very productive, IMO. Perhaps, after say 3 retries getting the
>> _same_ resid, it might rescan that disk. There is a big hint in the
>> logged data shown above: trying to READ 1 block (sector_sz=4096) and
>> getting a resid of 3584. So it got back 512 bytes (again and again
>> ...). The disk isn't mounted so perhaps it is being prepared. And
>> maybe that preparation involved a MODE SELECT which changed the LB
>> size in its block descriptor, prior to a FORMAT UNIT.
> 
> The kernel doesn't inspect passthrough commands to track whether an
> application is doing MODE SELECT or FORMAT UNIT. The burden is generally
> on the application to do the right thing.

No, of course not. But the kernel should inspect all UAs especially the one
that says: CAPACITY DATA HAS CHANGED !

> I'm assuming we're trying to read the partition table. Did the device
> somehow get closed between the MODE SELECT and the FORMAT UNIT?

Nope, look up "format corrupt" state in SBC, there is a asc/ascq code for
that, and it was _not_ reported in this case. The disk was fine after those
two commands, it was sd or the scsi mid-level that didn't observe the UAs,
hence the snafu. Sending a READ command after a CAPACITY DATA HAS CHANGE
UA is "undefined behaviour" as the say in the C/C++ spec.

Also more and more settings in SCSI *** are giving the option to return an
error (even MEDIUM ERROR) if the initiator is reading a block that has never
been written. So if the sd driver is looking for a partition table (LBA 0 ?)
then you have a chicken and egg problem that retrying will not solve.

>> Another issue with that error message: what does "unaligned" mean in
>> this context? Surely it is superfluous and "Partial completion" is
>> more accurate (unless the resid is negative).
> 
> The "unaligned" term comes from ZBC.

The sd driver should take its lead from SBC, not ZBC.

Doug Gilbert


*** for example, FORMAT UNIT (FFMT=2)


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: sd: Unaligned partial completion
  2022-02-20  0:56   ` Douglas Gilbert
@ 2022-02-20  1:35     ` Damien Le Moal
  2022-02-20  7:16       ` Douglas Gilbert
  2022-02-23  3:27     ` Martin K. Petersen
  1 sibling, 1 reply; 10+ messages in thread
From: Damien Le Moal @ 2022-02-20  1:35 UTC (permalink / raw)
  To: dgilbert, Martin K. Petersen; +Cc: SCSI development list

On 2/20/22 09:56, Douglas Gilbert wrote:
> On 2022-02-19 17:46, Martin K. Petersen wrote:
>>
>> Douglas,
>>
>>> What should the sd driver do when it gets the error in the subject
>>> line? Try again, and again, and again, and again ...?
>>>
>>> sd 2:0:1:0: [sdb] Unaligned partial completion (resid=3584, sector_sz=4096)
>>> sd 2:0:1:0: [sdb] tag#407 CDB: Read(16) 88 00 00 00 00 00 00 00 00 00 00 00 00 01 00
>>>
>>> Not very productive, IMO. Perhaps, after say 3 retries getting the
>>> _same_ resid, it might rescan that disk. There is a big hint in the
>>> logged data shown above: trying to READ 1 block (sector_sz=4096) and
>>> getting a resid of 3584. So it got back 512 bytes (again and again
>>> ...). The disk isn't mounted so perhaps it is being prepared. And
>>> maybe that preparation involved a MODE SELECT which changed the LB
>>> size in its block descriptor, prior to a FORMAT UNIT.
>>
>> The kernel doesn't inspect passthrough commands to track whether an
>> application is doing MODE SELECT or FORMAT UNIT. The burden is generally
>> on the application to do the right thing.
> 
> No, of course not. But the kernel should inspect all UAs especially the one
> that says: CAPACITY DATA HAS CHANGED !
> 
>> I'm assuming we're trying to read the partition table. Did the device
>> somehow get closed between the MODE SELECT and the FORMAT UNIT?
> 
> Nope, look up "format corrupt" state in SBC, there is a asc/ascq code for
> that, and it was _not_ reported in this case. The disk was fine after those
> two commands, it was sd or the scsi mid-level that didn't observe the UAs,
> hence the snafu. Sending a READ command after a CAPACITY DATA HAS CHANGE
> UA is "undefined behaviour" as the say in the C/C++ spec.
> 
> Also more and more settings in SCSI *** are giving the option to return an
> error (even MEDIUM ERROR) if the initiator is reading a block that has never
> been written. So if the sd driver is looking for a partition table (LBA 0 ?)
> then you have a chicken and egg problem that retrying will not solve.

It is not the scsi driver looking for partitions. This is generic block
layer code rescanning the partition table together with disk revalidate
after the bdev is closed. The disk revalidate should have caught the
change in LBA size, so it may be that the partition scan is before
revalidate instead of after... That would need checking.

>>> Another issue with that error message: what does "unaligned" mean in
>>> this context? Surely it is superfluous and "Partial completion" is
>>> more accurate (unless the resid is negative).
>>
>> The "unaligned" term comes from ZBC.
> 
> The sd driver should take its lead from SBC, not ZBC.

It was observed in the past that some HBAs (Broadcom I think it was)
returned a resid not aligned to the LBA size with 4Kn disks, making it
impossible to restart the command to process the reminder of the data.
This problem was especially apparent with ZBC disks writes.

So unaligned here is not just for ZBC disks.

> 
> Doug Gilbert
> 
> 
> *** for example, FORMAT UNIT (FFMT=2)
> 


-- 
Damien Le Moal
Western Digital Research

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: sd: Unaligned partial completion
  2022-02-20  1:35     ` Damien Le Moal
@ 2022-02-20  7:16       ` Douglas Gilbert
  2022-02-21  0:13         ` Damien Le Moal
  0 siblings, 1 reply; 10+ messages in thread
From: Douglas Gilbert @ 2022-02-20  7:16 UTC (permalink / raw)
  To: Damien Le Moal, Martin K. Petersen; +Cc: SCSI development list

On 2022-02-19 20:35, Damien Le Moal wrote:
> On 2/20/22 09:56, Douglas Gilbert wrote:
>> On 2022-02-19 17:46, Martin K. Petersen wrote:
>>>
>>> Douglas,
>>>
>>>> What should the sd driver do when it gets the error in the subject
>>>> line? Try again, and again, and again, and again ...?
>>>>
>>>> sd 2:0:1:0: [sdb] Unaligned partial completion (resid=3584, sector_sz=4096)
>>>> sd 2:0:1:0: [sdb] tag#407 CDB: Read(16) 88 00 00 00 00 00 00 00 00 00 00 00 00 01 00
>>>>
>>>> Not very productive, IMO. Perhaps, after say 3 retries getting the
>>>> _same_ resid, it might rescan that disk. There is a big hint in the
>>>> logged data shown above: trying to READ 1 block (sector_sz=4096) and
>>>> getting a resid of 3584. So it got back 512 bytes (again and again
>>>> ...). The disk isn't mounted so perhaps it is being prepared. And
>>>> maybe that preparation involved a MODE SELECT which changed the LB
>>>> size in its block descriptor, prior to a FORMAT UNIT.
>>>
>>> The kernel doesn't inspect passthrough commands to track whether an
>>> application is doing MODE SELECT or FORMAT UNIT. The burden is generally
>>> on the application to do the right thing.
>>
>> No, of course not. But the kernel should inspect all UAs especially the one
>> that says: CAPACITY DATA HAS CHANGED !
>>
>>> I'm assuming we're trying to read the partition table. Did the device
>>> somehow get closed between the MODE SELECT and the FORMAT UNIT?
>>
>> Nope, look up "format corrupt" state in SBC, there is a asc/ascq code for
>> that, and it was _not_ reported in this case. The disk was fine after those
>> two commands, it was sd or the scsi mid-level that didn't observe the UAs,
>> hence the snafu. Sending a READ command after a CAPACITY DATA HAS CHANGE
>> UA is "undefined behaviour" as the say in the C/C++ spec.
>>
>> Also more and more settings in SCSI *** are giving the option to return an
>> error (even MEDIUM ERROR) if the initiator is reading a block that has never
>> been written. So if the sd driver is looking for a partition table (LBA 0 ?)
>> then you have a chicken and egg problem that retrying will not solve.
> 
> It is not the scsi driver looking for partitions. This is generic block
> layer code rescanning the partition table together with disk revalidate
> after the bdev is closed. The disk revalidate should have caught the
> change in LBA size, so it may be that the partition scan is before
> revalidate instead of after... That would need checking.
> 
>>>> Another issue with that error message: what does "unaligned" mean in
>>>> this context? Surely it is superfluous and "Partial completion" is
>>>> more accurate (unless the resid is negative).
>>>
>>> The "unaligned" term comes from ZBC.
>>
>> The sd driver should take its lead from SBC, not ZBC.
> 
> It was observed in the past that some HBAs (Broadcom I think it was)
> returned a resid not aligned to the LBA size with 4Kn disks, making it
> impossible to restart the command to process the reminder of the data.

But restarting the READ of one "logical block" at LBA 0 when the kernel
thought that was 4096 bytes and the drive returned 512 bytes is exactly
what I observed; again and again.

IMO the kernel should be prepared for surprises when reading LBA 0,
such as:
   - the block size is not what it was expecting [as in this case]
   - that block has never been written and the disk has been told to
     return an (IO) error in that case

It is a pity that a SCSI pass-through like the bsg or sg driver cannot
establish its own I_T nexus, separate from the I_T nexus that the
sd driver uses. The reason is that if an I_T nexus causes a UA (e.g.
MODE SELECT change LB size) then the next command (apart from
INQUIRY, REPORT LUNS and friends) will _not_ receive that UA. [Other
I_T nexi will receive that UA.]

> This problem was especially apparent with ZBC disks writes. > So unaligned here is not just for ZBC disks.

SCSI data-out and data-in transfers are inherently unaligned (or byte
aligned) but I suppose the DMA silicon in the HBA may have some
alignment requirements.

> 
>>
>> Doug Gilbert
>>
>>
>> *** for example, FORMAT UNIT (FFMT=2)
>>
> 
> 


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: sd: Unaligned partial completion
  2022-02-20  7:16       ` Douglas Gilbert
@ 2022-02-21  0:13         ` Damien Le Moal
  0 siblings, 0 replies; 10+ messages in thread
From: Damien Le Moal @ 2022-02-21  0:13 UTC (permalink / raw)
  To: dgilbert, Martin K. Petersen; +Cc: SCSI development list

On 2022/02/20 16:16, Douglas Gilbert wrote:
> On 2022-02-19 20:35, Damien Le Moal wrote:
>> On 2/20/22 09:56, Douglas Gilbert wrote:
>>> On 2022-02-19 17:46, Martin K. Petersen wrote:
>>>>
>>>> Douglas,
>>>>
>>>>> What should the sd driver do when it gets the error in the subject
>>>>> line? Try again, and again, and again, and again ...?
>>>>>
>>>>> sd 2:0:1:0: [sdb] Unaligned partial completion (resid=3584, sector_sz=4096)
>>>>> sd 2:0:1:0: [sdb] tag#407 CDB: Read(16) 88 00 00 00 00 00 00 00 00 00 00 00 00 01 00
>>>>>
>>>>> Not very productive, IMO. Perhaps, after say 3 retries getting the
>>>>> _same_ resid, it might rescan that disk. There is a big hint in the
>>>>> logged data shown above: trying to READ 1 block (sector_sz=4096) and
>>>>> getting a resid of 3584. So it got back 512 bytes (again and again
>>>>> ...). The disk isn't mounted so perhaps it is being prepared. And
>>>>> maybe that preparation involved a MODE SELECT which changed the LB
>>>>> size in its block descriptor, prior to a FORMAT UNIT.
>>>>
>>>> The kernel doesn't inspect passthrough commands to track whether an
>>>> application is doing MODE SELECT or FORMAT UNIT. The burden is generally
>>>> on the application to do the right thing.
>>>
>>> No, of course not. But the kernel should inspect all UAs especially the one
>>> that says: CAPACITY DATA HAS CHANGED !
>>>
>>>> I'm assuming we're trying to read the partition table. Did the device
>>>> somehow get closed between the MODE SELECT and the FORMAT UNIT?
>>>
>>> Nope, look up "format corrupt" state in SBC, there is a asc/ascq code for
>>> that, and it was _not_ reported in this case. The disk was fine after those
>>> two commands, it was sd or the scsi mid-level that didn't observe the UAs,
>>> hence the snafu. Sending a READ command after a CAPACITY DATA HAS CHANGE
>>> UA is "undefined behaviour" as the say in the C/C++ spec.
>>>
>>> Also more and more settings in SCSI *** are giving the option to return an
>>> error (even MEDIUM ERROR) if the initiator is reading a block that has never
>>> been written. So if the sd driver is looking for a partition table (LBA 0 ?)
>>> then you have a chicken and egg problem that retrying will not solve.
>>
>> It is not the scsi driver looking for partitions. This is generic block
>> layer code rescanning the partition table together with disk revalidate
>> after the bdev is closed. The disk revalidate should have caught the
>> change in LBA size, so it may be that the partition scan is before
>> revalidate instead of after... That would need checking.
>>
>>>>> Another issue with that error message: what does "unaligned" mean in
>>>>> this context? Surely it is superfluous and "Partial completion" is
>>>>> more accurate (unless the resid is negative).
>>>>
>>>> The "unaligned" term comes from ZBC.
>>>
>>> The sd driver should take its lead from SBC, not ZBC.
>>
>> It was observed in the past that some HBAs (Broadcom I think it was)
>> returned a resid not aligned to the LBA size with 4Kn disks, making it
>> impossible to restart the command to process the reminder of the data.
> 
> But restarting the READ of one "logical block" at LBA 0 when the kernel
> thought that was 4096 bytes and the drive returned 512 bytes is exactly
> what I observed; again and again.

As I said, it may be because the block layer disk revalidate call and partition
scan are reversed, or not synchronized, causing the partition scan read to be
dealt with without the sector size yet being updated in the sd driver. We should
check the block layer. Will have a look.

> 
> IMO the kernel should be prepared for surprises when reading LBA 0,
> such as:
>    - the block size is not what it was expecting [as in this case]
>    - that block has never been written and the disk has been told to
>      return an (IO) error in that case
> 
> It is a pity that a SCSI pass-through like the bsg or sg driver cannot
> establish its own I_T nexus, separate from the I_T nexus that the
> sd driver uses. The reason is that if an I_T nexus causes a UA (e.g.
> MODE SELECT change LB size) then the next command (apart from
> INQUIRY, REPORT LUNS and friends) will _not_ receive that UA. [Other
> I_T nexi will receive that UA.]
> 
>> This problem was especially apparent with ZBC disks writes. > So unaligned here is not just for ZBC disks.
> 
> SCSI data-out and data-in transfers are inherently unaligned (or byte
> aligned) but I suppose the DMA silicon in the HBA may have some
> alignment requirements.

Sure, I know that. But the kernel never asks for unaligned read/writes and the
disk will certainly never return a half backed sector for reads or partially
writes sectors. So getting back a resid that is not aligned on the LBA size is a
gross bug from the HBA and we should not allow that to go unnoticed.

> 
>>
>>>
>>> Doug Gilbert
>>>
>>>
>>> *** for example, FORMAT UNIT (FFMT=2)
>>>
>>
>>
> 


-- 
Damien Le Moal
Western Digital Research

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: sd: Unaligned partial completion
  2022-02-20  0:56   ` Douglas Gilbert
  2022-02-20  1:35     ` Damien Le Moal
@ 2022-02-23  3:27     ` Martin K. Petersen
  2022-02-23 21:37       ` Douglas Gilbert
  1 sibling, 1 reply; 10+ messages in thread
From: Martin K. Petersen @ 2022-02-23  3:27 UTC (permalink / raw)
  To: Douglas Gilbert; +Cc: Martin K. Petersen, SCSI development list


Douglas,

> No, of course not. But the kernel should inspect all UAs especially
> the one that says: CAPACITY DATA HAS CHANGED !

It does. And uses it to emit an event to userland.

In most cases when capacity has changed it is because the user grew
their LUN. And doing the right thing in that case is to let userland
decide how to deal with it.

You could argue that the kernel should do something if the device
capacity shrinks. But it is unclear to me what "the right thing" is in
all cases. What if there is not a mounted filesystem in the affected
block range? Maybe the reduced block range it is not even described by
an entry in the partition table? Should we do something? How does SCSI
know how much of the capacity is actively in use, if any? Also, what's a
partition?

In addition, given our experience with NVMe devices which, for $OTHER_OS
reasons, truncated their capacity when they experienced media problems,
I am not sure we want to encourage anybody ever going down this
path. What a mess!

> Also more and more settings in SCSI *** are giving the option to
> return an error (even MEDIUM ERROR) if the initiator is reading a
> block that has never been written. So if the sd driver is looking for
> a partition table (LBA 0 ?)  then you have a chicken and egg problem
> that retrying will not solve.

For a general purpose OS it is completely unreasonable to expect that
the OS has prior knowledge about which blocks were written. How is that
even supposed to work if you plug in a USB drive from a different
machine/OS? It also breaks the notion of array disks being
self-describing which is now effectively an industry requirement.

I am very happy to take patches that prevent us from retrying forever
when a device is being disagreeable. But I am also very comfortable with
the notion of not bothering to support devices that behave in a
nonsensical way. Just because the SCSI spec allows something doesn't
mean we should support it.

> The sd driver should take its lead from SBC, not ZBC.

The sd driver is the driver for both protocols.

-- 
Martin K. Petersen	Oracle Linux Engineering

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: sd: Unaligned partial completion
  2022-02-23  3:27     ` Martin K. Petersen
@ 2022-02-23 21:37       ` Douglas Gilbert
  2022-02-23 22:47         ` Damien Le Moal
  0 siblings, 1 reply; 10+ messages in thread
From: Douglas Gilbert @ 2022-02-23 21:37 UTC (permalink / raw)
  To: Martin K. Petersen; +Cc: SCSI development list

On 2022-02-22 22:27, Martin K. Petersen wrote:
> 
> Douglas,
> 
>> No, of course not. But the kernel should inspect all UAs especially
>> the one that says: CAPACITY DATA HAS CHANGED !
> 
> It does. And uses it to emit an event to userland.
> 
> In most cases when capacity has changed it is because the user grew
> their LUN. And doing the right thing in that case is to let userland
> decide how to deal with it.
> 
> You could argue that the kernel should do something if the device
> capacity shrinks. But it is unclear to me what "the right thing" is in
> all cases. What if there is not a mounted filesystem in the affected
> block range? Maybe the reduced block range it is not even described by
> an entry in the partition table? Should we do something? How does SCSI
> know how much of the capacity is actively in use, if any? Also, what's a
> partition?
> 
> In addition, given our experience with NVMe devices which, for $OTHER_OS
> reasons, truncated their capacity when they experienced media problems,
> I am not sure we want to encourage anybody ever going down this
> path. What a mess!

But this misses my point. sbc5r01.pdf says this:

   "If the device server supports changing the block descriptor parameters
    by a MODE SELECT command and the number of logical blocks or the
    logical block length is changed, then the device server establishes
    a unit attention condition of:
       a) CAPACITY DATA HAS CHANGED as described in 4.10; and
       b) MODE PARAMETERS CHANGED as described in SPC-6.

My point is: if "the logical block length is changed" then the sd driver
can NOT reliably issue any IO commands (READ or WRITE) until it does a
READ CAPACITY command to find out whether the LB size has changed, and
if so, to what.

>> Also more and more settings in SCSI *** are giving the option to
>> return an error (even MEDIUM ERROR) if the initiator is reading a
>> block that has never been written. So if the sd driver is looking for
>> a partition table (LBA 0 ?)  then you have a chicken and egg problem
>> that retrying will not solve.
> 
> For a general purpose OS it is completely unreasonable to expect that
> the OS has prior knowledge about which blocks were written. How is that
> even supposed to work if you plug in a USB drive from a different
> machine/OS? It also breaks the notion of array disks being
> self-describing which is now effectively an industry requirement.
> 
> I am very happy to take patches that prevent us from retrying forever
> when a device is being disagreeable. But I am also very comfortable with
> the notion of not bothering to support devices that behave in a
> nonsensical way. Just because the SCSI spec allows something doesn't
> mean we should support it.
> 
>> The sd driver should take its lead from SBC, not ZBC.
> 
> The sd driver is the driver for both protocols.

This "unaligned" usage seems to come from ZBC and first appeared in
SPC-4, ASC/ACSQ code [0x21,0x4]: "Unaligned WRITE command". It is
the only use of the word "unaligned" in SPC-4, SPC-5 and spc6r06.pdf
and it is not defined (in those documents) or in the SBC specs.
Surprisingly it is used, but not defined in zbc2r12.pdf .

To me "unaligned" means some sort of transport issue which this is
not ***. It simply means the WRITE was not issued with a starting
LBA which corresponded to that zone's write pointer. This is
for "sequential write required" (swr)zones. IMO the ASC message
should be akin to: "Sequential write requirement violated".

Until Linux utilities catch up with zoned disks, users of zoned
disks are going to see a lot of that "unaligned"  error! Currently
you can't partition a zoned disk because those utilities try to
WRITE shadow copies further out on the disk and violate the
write pointer settings of swr zones (then crash and burn).
You can create a BTR file system on a whole zoned disk (e.g. /dev/sdb)
but only if you have a recent enough btrfs-prog package ****. Any
Debian user caught in this bind, try using the binary Sid package at:
     https://packages.debian.org/sid/btrfs-progs


Life is a little easier fo ZBC/ZAC zoned disks which typically
start with conventional (normal random WRITE capable) zones (for 1%
of the available storage) before the swr zones commence. ZNS (for
NVMe) doesn't support conventional zones.

Doug Gilbert


***  well where sd.c generated that "unaligned" error it was because
      it tried to READ one block at LBA 0 and thought it was 4096
      bytes long. It wasn't (due to a MODE SELECT) so it got back
      512 bytes. Is that an alignment error ??

**** building btrfs-prog from its github source is not a pleasant
      experience, IMO

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: sd: Unaligned partial completion
  2022-02-23 21:37       ` Douglas Gilbert
@ 2022-02-23 22:47         ` Damien Le Moal
  2022-02-23 23:58           ` Douglas Gilbert
  0 siblings, 1 reply; 10+ messages in thread
From: Damien Le Moal @ 2022-02-23 22:47 UTC (permalink / raw)
  To: dgilbert, Martin K. Petersen; +Cc: SCSI development list

On 2/24/22 06:37, Douglas Gilbert wrote:
> On 2022-02-22 22:27, Martin K. Petersen wrote:
>>
>> Douglas,
>>
>>> No, of course not. But the kernel should inspect all UAs especially
>>> the one that says: CAPACITY DATA HAS CHANGED !
>>
>> It does. And uses it to emit an event to userland.
>>
>> In most cases when capacity has changed it is because the user grew
>> their LUN. And doing the right thing in that case is to let userland
>> decide how to deal with it.
>>
>> You could argue that the kernel should do something if the device
>> capacity shrinks. But it is unclear to me what "the right thing" is in
>> all cases. What if there is not a mounted filesystem in the affected
>> block range? Maybe the reduced block range it is not even described by
>> an entry in the partition table? Should we do something? How does SCSI
>> know how much of the capacity is actively in use, if any? Also, what's a
>> partition?
>>
>> In addition, given our experience with NVMe devices which, for $OTHER_OS
>> reasons, truncated their capacity when they experienced media problems,
>> I am not sure we want to encourage anybody ever going down this
>> path. What a mess!
> 
> But this misses my point. sbc5r01.pdf says this:
> 
>    "If the device server supports changing the block descriptor parameters
>     by a MODE SELECT command and the number of logical blocks or the
>     logical block length is changed, then the device server establishes
>     a unit attention condition of:
>        a) CAPACITY DATA HAS CHANGED as described in 4.10; and
>        b) MODE PARAMETERS CHANGED as described in SPC-6.
> 
> My point is: if "the logical block length is changed" then the sd driver
> can NOT reliably issue any IO commands (READ or WRITE) until it does a
> READ CAPACITY command to find out whether the LB size has changed, and
> if so, to what.
> 
>>> Also more and more settings in SCSI *** are giving the option to
>>> return an error (even MEDIUM ERROR) if the initiator is reading a
>>> block that has never been written. So if the sd driver is looking for
>>> a partition table (LBA 0 ?)  then you have a chicken and egg problem
>>> that retrying will not solve.
>>
>> For a general purpose OS it is completely unreasonable to expect that
>> the OS has prior knowledge about which blocks were written. How is that
>> even supposed to work if you plug in a USB drive from a different
>> machine/OS? It also breaks the notion of array disks being
>> self-describing which is now effectively an industry requirement.
>>
>> I am very happy to take patches that prevent us from retrying forever
>> when a device is being disagreeable. But I am also very comfortable with
>> the notion of not bothering to support devices that behave in a
>> nonsensical way. Just because the SCSI spec allows something doesn't
>> mean we should support it.
>>
>>> The sd driver should take its lead from SBC, not ZBC.
>>
>> The sd driver is the driver for both protocols.
> 
> This "unaligned" usage seems to come from ZBC and first appeared in
> SPC-4, ASC/ACSQ code [0x21,0x4]: "Unaligned WRITE command". It is
> the only use of the word "unaligned" in SPC-4, SPC-5 and spc6r06.pdf
> and it is not defined (in those documents) or in the SBC specs.
> Surprisingly it is used, but not defined in zbc2r12.pdf .
> 
> To me "unaligned" means some sort of transport issue which this is
> not ***. It simply means the WRITE was not issued with a starting
> LBA which corresponded to that zone's write pointer. This is
> for "sequential write required" (swr)zones. IMO the ASC message
> should be akin to: "Sequential write requirement violated".
> 
> Until Linux utilities catch up with zoned disks, users of zoned
> disks are going to see a lot of that "unaligned"  error! Currently
> you can't partition a zoned disk because those utilities try to
> WRITE shadow copies further out on the disk and violate the
> write pointer settings of swr zones (then crash and burn).
> You can create a BTR file system on a whole zoned disk (e.g. /dev/sdb)
> but only if you have a recent enough btrfs-prog package ****. Any
> Debian user caught in this bind, try using the binary Sid package at:
>      https://packages.debian.org/sid/btrfs-progs
> 
> 
> Life is a little easier fo ZBC/ZAC zoned disks which typically
> start with conventional (normal random WRITE capable) zones (for 1%
> of the available storage) before the swr zones commence. ZNS (for
> NVMe) doesn't support conventional zones.
> 
> Doug Gilbert
> 
> 
> ***  well where sd.c generated that "unaligned" error it was because
>       it tried to READ one block at LBA 0 and thought it was 4096
>       bytes long. It wasn't (due to a MODE SELECT) so it got back
>       512 bytes. Is that an alignment error ??

Personally, I consider it as such because the retry to process the
remaining will necessarily fail, or worse, do bad things to the drive
sectors, since the addressing is off by a factor of 8. Retrying the
remaining of any of these "unaligned" commands is dangerous. For a read,
this can lead to data leaks, and for a write, that can destroy the FS on
the disk.

> 
> **** building btrfs-prog from its github source is not a pleasant
>       experience, IMO


-- 
Damien Le Moal
Western Digital Research

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: sd: Unaligned partial completion
  2022-02-23 22:47         ` Damien Le Moal
@ 2022-02-23 23:58           ` Douglas Gilbert
  0 siblings, 0 replies; 10+ messages in thread
From: Douglas Gilbert @ 2022-02-23 23:58 UTC (permalink / raw)
  To: Damien Le Moal, Martin K. Petersen; +Cc: SCSI development list

On 2022-02-23 17:47, Damien Le Moal wrote:
> On 2/24/22 06:37, Douglas Gilbert wrote:
>> On 2022-02-22 22:27, Martin K. Petersen wrote:
>>>
>>> Douglas,
>>>
>>>> No, of course not. But the kernel should inspect all UAs especially
>>>> the one that says: CAPACITY DATA HAS CHANGED !
>>>
>>> It does. And uses it to emit an event to userland.
>>>
>>> In most cases when capacity has changed it is because the user grew
>>> their LUN. And doing the right thing in that case is to let userland
>>> decide how to deal with it.
>>>
>>> You could argue that the kernel should do something if the device
>>> capacity shrinks. But it is unclear to me what "the right thing" is in
>>> all cases. What if there is not a mounted filesystem in the affected
>>> block range? Maybe the reduced block range it is not even described by
>>> an entry in the partition table? Should we do something? How does SCSI
>>> know how much of the capacity is actively in use, if any? Also, what's a
>>> partition?
>>>
>>> In addition, given our experience with NVMe devices which, for $OTHER_OS
>>> reasons, truncated their capacity when they experienced media problems,
>>> I am not sure we want to encourage anybody ever going down this
>>> path. What a mess!
>>
>> But this misses my point. sbc5r01.pdf says this:
>>
>>     "If the device server supports changing the block descriptor parameters
>>      by a MODE SELECT command and the number of logical blocks or the
>>      logical block length is changed, then the device server establishes
>>      a unit attention condition of:
>>         a) CAPACITY DATA HAS CHANGED as described in 4.10; and
>>         b) MODE PARAMETERS CHANGED as described in SPC-6.
>>
>> My point is: if "the logical block length is changed" then the sd driver
>> can NOT reliably issue any IO commands (READ or WRITE) until it does a
>> READ CAPACITY command to find out whether the LB size has changed, and
>> if so, to what.
>>
>>>> Also more and more settings in SCSI *** are giving the option to
>>>> return an error (even MEDIUM ERROR) if the initiator is reading a
>>>> block that has never been written. So if the sd driver is looking for
>>>> a partition table (LBA 0 ?)  then you have a chicken and egg problem
>>>> that retrying will not solve.
>>>
>>> For a general purpose OS it is completely unreasonable to expect that
>>> the OS has prior knowledge about which blocks were written. How is that
>>> even supposed to work if you plug in a USB drive from a different
>>> machine/OS? It also breaks the notion of array disks being
>>> self-describing which is now effectively an industry requirement.
>>>
>>> I am very happy to take patches that prevent us from retrying forever
>>> when a device is being disagreeable. But I am also very comfortable with
>>> the notion of not bothering to support devices that behave in a
>>> nonsensical way. Just because the SCSI spec allows something doesn't
>>> mean we should support it.
>>>
>>>> The sd driver should take its lead from SBC, not ZBC.
>>>
>>> The sd driver is the driver for both protocols.
>>
>> This "unaligned" usage seems to come from ZBC and first appeared in
>> SPC-4, ASC/ACSQ code [0x21,0x4]: "Unaligned WRITE command". It is
>> the only use of the word "unaligned" in SPC-4, SPC-5 and spc6r06.pdf
>> and it is not defined (in those documents) or in the SBC specs.
>> Surprisingly it is used, but not defined in zbc2r12.pdf .
>>
>> To me "unaligned" means some sort of transport issue which this is
>> not ***. It simply means the WRITE was not issued with a starting
>> LBA which corresponded to that zone's write pointer. This is
>> for "sequential write required" (swr)zones. IMO the ASC message
>> should be akin to: "Sequential write requirement violated".
>>
>> Until Linux utilities catch up with zoned disks, users of zoned
>> disks are going to see a lot of that "unaligned"  error! Currently
>> you can't partition a zoned disk because those utilities try to
>> WRITE shadow copies further out on the disk and violate the
>> write pointer settings of swr zones (then crash and burn).
>> You can create a BTR file system on a whole zoned disk (e.g. /dev/sdb)
>> but only if you have a recent enough btrfs-prog package ****. Any
>> Debian user caught in this bind, try using the binary Sid package at:
>>       https://packages.debian.org/sid/btrfs-progs
>>
>>
>> Life is a little easier fo ZBC/ZAC zoned disks which typically
>> start with conventional (normal random WRITE capable) zones (for 1%
>> of the available storage) before the swr zones commence. ZNS (for
>> NVMe) doesn't support conventional zones.
>>
>> Doug Gilbert
>>
>>
>> ***  well where sd.c generated that "unaligned" error it was because
>>        it tried to READ one block at LBA 0 and thought it was 4096
>>        bytes long. It wasn't (due to a MODE SELECT) so it got back
>>        512 bytes. Is that an alignment error ??
> 
> Personally, I consider it as such because the retry to process the
> remaining will necessarily fail, or worse, do bad things to the drive
> sectors, since the addressing is off by a factor of 8. Retrying the
> remaining of any of these "unaligned" commands is dangerous. For a read,
> this can lead to data leaks, and for a write, that can destroy the FS on
> the disk.

Here are the error messages I saw after the MODE_SELECT+FORMAT_UNIT
commands that changed the LB size from 4096 to 512 bytes. No command
was entered on the command line (after the format). The disk had no
mounted file systems on it.

[10490.819058] sd 2:0:1:0: [sdb] Unaligned partial completion (resid=3584, 
sector_sz=4096)
[10490.819189] sd 2:0:1:0: [sdb] tag#392 CDB: Read(16) 88 00 00 00 00 00 00 00 
00 00 00 00 00 01 00 00
[10490.820349] sd 2:0:1:0: [sdb] Unaligned partial completion (resid=3584, 
sector_sz=4096)
[10490.820356] sd 2:0:1:0: [sdb] tag#393 CDB: Read(16) 88 00 00 00 00 00 00 00 
00 00 00 00 00 01 00 00
[10490.820609] sd 2:0:1:0: [sdb] Unaligned partial completion (resid=3584, 
sector_sz=4096)
[10490.820612] sd 2:0:1:0: [sdb] tag#394 CDB: Read(16) 88 00 00 00 00 00 00 00 
00 00 00 00 00 01 00 00
[10490.820768] sd 2:0:1:0: [sdb] Unaligned partial completion (resid=3584, 
sector_sz=4096)
[10490.820769] sd 2:0:1:0: [sdb] tag#395 CDB: Read(16) 88 00 00 00 00 00 00 00 
00 00 00 00 00 01 00 00

That continued and the machine became unusable so I rebooted it.

The log shows that it is trying to read the partition table, that failed,
lets try it again (ad infinitum).
Surely to goodness that is BUG. And the information it needs is there:
wanted 4096 bytes, got 512, try again ... same result ... does that look
like a transport error? Not IMO.

What should it do? Well doing a READ CAPACITY would be a great start.

Doug Gilbert


^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2022-02-23 23:58 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-02-16  6:35 sd: Unaligned partial completion Douglas Gilbert
2022-02-19 22:46 ` Martin K. Petersen
2022-02-20  0:56   ` Douglas Gilbert
2022-02-20  1:35     ` Damien Le Moal
2022-02-20  7:16       ` Douglas Gilbert
2022-02-21  0:13         ` Damien Le Moal
2022-02-23  3:27     ` Martin K. Petersen
2022-02-23 21:37       ` Douglas Gilbert
2022-02-23 22:47         ` Damien Le Moal
2022-02-23 23:58           ` Douglas Gilbert

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.