All of lore.kernel.org
 help / color / mirror / Atom feed
* RAID-10 keeps aborting
@ 2013-06-03  3:57 H. Peter Anvin
  2013-06-03  4:05 ` H. Peter Anvin
                   ` (2 more replies)
  0 siblings, 3 replies; 122+ messages in thread
From: H. Peter Anvin @ 2013-06-03  3:57 UTC (permalink / raw)
  To: linux-raid

Hello,

I have a brand new server with a RAID-10 array.  The drives are a SAS
JBOD (mptsas) which I'm driving using Linux mdraid raid10.

Unfortunately, although the server did burn-in fine, once put in
production I have so far had multiple cases (about once every 24 hours)
of the raid10 failing, with a mirror pair dropping out in very short
succession:

Jun  2 20:23:05 terminus kernel: [83595.614689] md/raid10:md4: Disk
failure on sdb6, disabling device.
Jun  2 20:23:05 terminus kernel: [83595.614689] md/raid10:md4: Operation
continuing on 3 devices.
Jun  2 20:23:05 terminus kernel: [83595.703106] md/raid10:md4: Disk
failure on sdc6, disabling device.
Jun  2 20:23:05 terminus kernel: [83595.703106] md/raid10:md4: Operation
continuing on 2 devices.
Jun  2 20:23:05 terminus kernel: [83595.789234] md4: WRITE SAME failed.
Manually zeroing.

Unfortunately, those two devices that just dropped out are of course the
mirrors of each other, causing filesystem corruption and shutdown in
very short order.

There are no other kernel messages from the same time, and given the
timing (less than 90 ms apart) it would appear that this is a timeout of
some kind and not an actual disk failure.

Are there any tunables I can tweak, or do I have a $4000 paperweight?

	-hpa

^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: RAID-10 keeps aborting
  2013-06-03  3:57 RAID-10 keeps aborting H. Peter Anvin
@ 2013-06-03  4:05 ` H. Peter Anvin
  2013-06-03  5:47 ` Dan Williams
  2013-06-11 21:50 ` RAID-10 keeps aborting Joe Lawrence
  2 siblings, 0 replies; 122+ messages in thread
From: H. Peter Anvin @ 2013-06-03  4:05 UTC (permalink / raw)
  To: linux-raid

On 06/02/2013 08:57 PM, H. Peter Anvin wrote:
> Hello,
> 
> I have a brand new server with a RAID-10 array.  The drives are a SAS
> JBOD (mptsas) which I'm driving using Linux mdraid raid10.
> 
> Unfortunately, although the server did burn-in fine, once put in
> production I have so far had multiple cases (about once every 24 hours)
> of the raid10 failing, with a mirror pair dropping out in very short
> succession:
> 
> Jun  2 20:23:05 terminus kernel: [83595.614689] md/raid10:md4: Disk
> failure on sdb6, disabling device.
> Jun  2 20:23:05 terminus kernel: [83595.614689] md/raid10:md4: Operation
> continuing on 3 devices.
> Jun  2 20:23:05 terminus kernel: [83595.703106] md/raid10:md4: Disk
> failure on sdc6, disabling device.
> Jun  2 20:23:05 terminus kernel: [83595.703106] md/raid10:md4: Operation
> continuing on 2 devices.
> Jun  2 20:23:05 terminus kernel: [83595.789234] md4: WRITE SAME failed.
> Manually zeroing.
> 

For the record, this is kernel 3.8.13 (-100.fc17 from Fedora).

	-hpa



^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: RAID-10 keeps aborting
  2013-06-03  3:57 RAID-10 keeps aborting H. Peter Anvin
  2013-06-03  4:05 ` H. Peter Anvin
@ 2013-06-03  5:47 ` Dan Williams
  2013-06-03  6:06   ` H. Peter Anvin
  2013-06-11 21:50 ` RAID-10 keeps aborting Joe Lawrence
  2 siblings, 1 reply; 122+ messages in thread
From: Dan Williams @ 2013-06-03  5:47 UTC (permalink / raw)
  To: H. Peter Anvin; +Cc: linux-raid

On Sun, Jun 2, 2013 at 8:57 PM, H. Peter Anvin <hpa@zytor.com> wrote:
> Hello,
>
> I have a brand new server with a RAID-10 array.  The drives are a SAS
> JBOD (mptsas) which I'm driving using Linux mdraid raid10.
>
> Unfortunately, although the server did burn-in fine, once put in
> production I have so far had multiple cases (about once every 24 hours)
> of the raid10 failing, with a mirror pair dropping out in very short
> succession:
>
> Jun  2 20:23:05 terminus kernel: [83595.614689] md/raid10:md4: Disk
> failure on sdb6, disabling device.
> Jun  2 20:23:05 terminus kernel: [83595.614689] md/raid10:md4: Operation
> continuing on 3 devices.
> Jun  2 20:23:05 terminus kernel: [83595.703106] md/raid10:md4: Disk
> failure on sdc6, disabling device.
> Jun  2 20:23:05 terminus kernel: [83595.703106] md/raid10:md4: Operation
> continuing on 2 devices.
> Jun  2 20:23:05 terminus kernel: [83595.789234] md4: WRITE SAME failed.
> Manually zeroing.
>
> Unfortunately, those two devices that just dropped out are of course the
> mirrors of each other, causing filesystem corruption and shutdown in
> very short order.
>
> There are no other kernel messages from the same time, and given the
> timing (less than 90 ms apart) it would appear that this is a timeout of
> some kind and not an actual disk failure.

Looks like the underlying devices just may not support write_same...
if the device lies about support we don't find about it until the
first attempt fails and md drops the devices.

> Are there any tunables I can tweak, or do I have a $4000 paperweight?

One hack to prove this may be to explicitly disable write_same before
the array is assembled:

for i in /sys/class/scsi_disk/*/max_write_same_blocks; do echo 0 > $i; done

If this works then maybe md needs to be tolerant of write_same
failures since the block layer will simply retry with zeroes.

--
Dan

^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: RAID-10 keeps aborting
  2013-06-03  5:47 ` Dan Williams
@ 2013-06-03  6:06   ` H. Peter Anvin
  2013-06-03  6:14     ` Dan Williams
  0 siblings, 1 reply; 122+ messages in thread
From: H. Peter Anvin @ 2013-06-03  6:06 UTC (permalink / raw)
  To: Dan Williams; +Cc: linux-raid

On 06/02/2013 10:47 PM, Dan Williams wrote:
> 
> One hack to prove this may be to explicitly disable write_same before
> the array is assembled:
> 
> for i in /sys/class/scsi_disk/*/max_write_same_blocks; do echo 0 > $i; done
> 
> If this works then maybe md needs to be tolerant of write_same
> failures since the block layer will simply retry with zeroes.
> 

Trying that (array is already assembled but is currently functional.)
Let's hope it works.

	-hpa



^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: RAID-10 keeps aborting
  2013-06-03  6:06   ` H. Peter Anvin
@ 2013-06-03  6:14     ` Dan Williams
  2013-06-03  6:30       ` H. Peter Anvin
                         ` (2 more replies)
  0 siblings, 3 replies; 122+ messages in thread
From: Dan Williams @ 2013-06-03  6:14 UTC (permalink / raw)
  To: H. Peter Anvin; +Cc: linux-raid

On Sun, Jun 2, 2013 at 11:06 PM, H. Peter Anvin <hpa@zytor.com> wrote:
> On 06/02/2013 10:47 PM, Dan Williams wrote:
>>
>> One hack to prove this may be to explicitly disable write_same before
>> the array is assembled:
>>
>> for i in /sys/class/scsi_disk/*/max_write_same_blocks; do echo 0 > $i; done
>>
>> If this works then maybe md needs to be tolerant of write_same
>> failures since the block layer will simply retry with zeroes.
>>
>
> Trying that (array is already assembled but is currently functional.)
> Let's hope it works.
>

If I'm reading things correctly that may still result in failure since
md will still pass the REQ_WRITE_SAME bios down to the the devices and
will receive BLK_PREP_KILL for its trouble.  md only notices that
write same is disabled on underlying devices at assembly time.

^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: RAID-10 keeps aborting
  2013-06-03  6:14     ` Dan Williams
@ 2013-06-03  6:30       ` H. Peter Anvin
  2013-06-03 14:39       ` H. Peter Anvin
  2013-06-03 15:47       ` H. Peter Anvin
  2 siblings, 0 replies; 122+ messages in thread
From: H. Peter Anvin @ 2013-06-03  6:30 UTC (permalink / raw)
  To: Dan Williams; +Cc: linux-raid

Ok, will muck with that tomorrow. Hopefully it doesn't die tonight.

Dan Williams <dan.j.williams@gmail.com> wrote:

>On Sun, Jun 2, 2013 at 11:06 PM, H. Peter Anvin <hpa@zytor.com> wrote:
>> On 06/02/2013 10:47 PM, Dan Williams wrote:
>>>
>>> One hack to prove this may be to explicitly disable write_same
>before
>>> the array is assembled:
>>>
>>> for i in /sys/class/scsi_disk/*/max_write_same_blocks; do echo 0 >
>$i; done
>>>
>>> If this works then maybe md needs to be tolerant of write_same
>>> failures since the block layer will simply retry with zeroes.
>>>
>>
>> Trying that (array is already assembled but is currently functional.)
>> Let's hope it works.
>>
>
>If I'm reading things correctly that may still result in failure since
>md will still pass the REQ_WRITE_SAME bios down to the the devices and
>will receive BLK_PREP_KILL for its trouble.  md only notices that
>write same is disabled on underlying devices at assembly time.

-- 
Sent from my mobile phone. Please excuse brevity and lack of formatting.

^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: RAID-10 keeps aborting
  2013-06-03  6:14     ` Dan Williams
  2013-06-03  6:30       ` H. Peter Anvin
@ 2013-06-03 14:39       ` H. Peter Anvin
  2013-06-11 16:47         ` Joe Lawrence
  2013-06-03 15:47       ` H. Peter Anvin
  2 siblings, 1 reply; 122+ messages in thread
From: H. Peter Anvin @ 2013-06-03 14:39 UTC (permalink / raw)
  To: Dan Williams; +Cc: linux-raid

On 06/02/2013 11:14 PM, Dan Williams wrote:
> On Sun, Jun 2, 2013 at 11:06 PM, H. Peter Anvin <hpa@zytor.com> wrote:
>> On 06/02/2013 10:47 PM, Dan Williams wrote:
>>>
>>> One hack to prove this may be to explicitly disable write_same before
>>> the array is assembled:
>>>
>>> for i in /sys/class/scsi_disk/*/max_write_same_blocks; do echo 0 > $i; done
>>>
>>> If this works then maybe md needs to be tolerant of write_same
>>> failures since the block layer will simply retry with zeroes.
>>>
>>
>> Trying that (array is already assembled but is currently functional.)
>> Let's hope it works.
>>
> 
> If I'm reading things correctly that may still result in failure since
> md will still pass the REQ_WRITE_SAME bios down to the the devices and
> will receive BLK_PREP_KILL for its trouble.  md only notices that
> write same is disabled on underlying devices at assembly time.
> 

Hmmm... that means getting dracut/udev to supply this little mod, unless
it can be fed as a kernel command-line option somehow.  Digging...

	-hpa


^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: RAID-10 keeps aborting
  2013-06-03  6:14     ` Dan Williams
  2013-06-03  6:30       ` H. Peter Anvin
  2013-06-03 14:39       ` H. Peter Anvin
@ 2013-06-03 15:47       ` H. Peter Anvin
  2013-06-03 16:09         ` Joe Lawrence
  2013-06-03 17:22         ` Dan Williams
  2 siblings, 2 replies; 122+ messages in thread
From: H. Peter Anvin @ 2013-06-03 15:47 UTC (permalink / raw)
  To: Dan Williams; +Cc: linux-raid

On 06/02/2013 11:14 PM, Dan Williams wrote:
> 
> If I'm reading things correctly that may still result in failure since
> md will still pass the REQ_WRITE_SAME bios down to the the devices and
> will receive BLK_PREP_KILL for its trouble.  md only notices that
> write same is disabled on underlying devices at assembly time.
> 

I have to admit to not seeing where md (as opposed to dm) even looks for
if the underlying devices have write same enabled.  It seems extremely
likely that it is write same that is causing the headaches, though.

	-hpa


^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: RAID-10 keeps aborting
  2013-06-03 15:47       ` H. Peter Anvin
@ 2013-06-03 16:09         ` Joe Lawrence
  2013-06-03 17:22         ` Dan Williams
  1 sibling, 0 replies; 122+ messages in thread
From: Joe Lawrence @ 2013-06-03 16:09 UTC (permalink / raw)
  To: H. Peter Anvin; +Cc: Dan Williams, linux-raid

On Mon, 3 Jun 2013, H. Peter Anvin wrote:

> On 06/02/2013 11:14 PM, Dan Williams wrote:
> > 
> > If I'm reading things correctly that may still result in failure since
> > md will still pass the REQ_WRITE_SAME bios down to the the devices and
> > will receive BLK_PREP_KILL for its trouble.  md only notices that
> > write same is disabled on underlying devices at assembly time.
> > 
> 
> I have to admit to not seeing where md (as opposed to dm) even looks for
> if the underlying devices have write same enabled.  It seems extremely
> likely that it is write same that is causing the headaches, though.

I'll try to take a look later, but for now maybe these threads could help 
you?

  http://thread.gmane.org/gmane.linux.raid/41035
  http://thread.gmane.org/gmane.linux.raid/41078

It might be that after commit 4363ac7 "block: Implement support for WRITE 
SAME", the max_write_same_sectors for the MD is set to the minimum its 
component disks (at least for RAID1).  Not sure exactly how RAID10 treats 
it, but maybe there are enough clues in that first thread to figure it 
out.

When I have some more time I can help investigate.

Regards,

-- Joe

^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: RAID-10 keeps aborting
  2013-06-03 15:47       ` H. Peter Anvin
  2013-06-03 16:09         ` Joe Lawrence
@ 2013-06-03 17:22         ` Dan Williams
  2013-06-03 17:40           ` H. Peter Anvin
  1 sibling, 1 reply; 122+ messages in thread
From: Dan Williams @ 2013-06-03 17:22 UTC (permalink / raw)
  To: H. Peter Anvin; +Cc: linux-raid

On Mon, Jun 3, 2013 at 8:47 AM, H. Peter Anvin <hpa@zytor.com> wrote:
> On 06/02/2013 11:14 PM, Dan Williams wrote:
>>
>> If I'm reading things correctly that may still result in failure since
>> md will still pass the REQ_WRITE_SAME bios down to the the devices and
>> will receive BLK_PREP_KILL for its trouble.  md only notices that
>> write same is disabled on underlying devices at assembly time.
>>
>
> I have to admit to not seeing where md (as opposed to dm) even looks for
> if the underlying devices have write same enabled.  It seems extremely
> likely that it is write same that is causing the headaches, though.
>

raid10 calls disk_stack_limits() after blk_queue_max_write_same_sectors().

...and here is where scsi considers failures as non-fatal in
sd_done().  I assume the REQ_QUIET is why there are no other kernel
messages.

        case ILLEGAL_REQUEST:
                if (sshdr.asc == 0x10)  /* DIX: Host detected corruption */
                        good_bytes = sd_completed_bytes(SCpnt);
                /* INVALID COMMAND OPCODE or INVALID FIELD IN CDB */
                if (sshdr.asc == 0x20 || sshdr.asc == 0x24) {
                        switch (op) {
                        case UNMAP:
                                sd_config_discard(sdkp, SD_LBP_DISABLE);
                                break;
                        case WRITE_SAME_16:
                        case WRITE_SAME:
                                if (unmap)
                                        sd_config_discard(sdkp, SD_LBP_DISABLE);
                                else {
                                        sdkp->device->no_write_same = 1;
                                        sd_config_write_same(sdkp);

                                        good_bytes = 0;
                                        req->__data_len = blk_rq_bytes(req);
                                        req->cmd_flags |= REQ_QUIET;
                                }
                        }
                }
                break;

^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: RAID-10 keeps aborting
  2013-06-03 17:22         ` Dan Williams
@ 2013-06-03 17:40           ` H. Peter Anvin
  2013-06-03 18:35             ` Martin K. Petersen
  0 siblings, 1 reply; 122+ messages in thread
From: H. Peter Anvin @ 2013-06-03 17:40 UTC (permalink / raw)
  To: Dan Williams; +Cc: linux-raid

On 06/03/2013 10:22 AM, Dan Williams wrote:
> On Mon, Jun 3, 2013 at 8:47 AM, H. Peter Anvin <hpa@zytor.com> wrote:
>> On 06/02/2013 11:14 PM, Dan Williams wrote:
>>>
>>> If I'm reading things correctly that may still result in failure since
>>> md will still pass the REQ_WRITE_SAME bios down to the the devices and
>>> will receive BLK_PREP_KILL for its trouble.  md only notices that
>>> write same is disabled on underlying devices at assembly time.
>>>
>>
>> I have to admit to not seeing where md (as opposed to dm) even looks for
>> if the underlying devices have write same enabled.  It seems extremely
>> likely that it is write same that is causing the headaches, though.
>>
> 
> raid10 calls disk_stack_limits() after blk_queue_max_write_same_sectors().
> 

OK, I see it now.

I wonder changing blk_queue_max_write_same_sectors() to zero in the
kernel sources would do the trick here... might be easier than making
udev/dracut to the right thing... :-/

> ...and here is where scsi considers failures as non-fatal in
> sd_done().  I assume the REQ_QUIET is why there are no other kernel
> messages.

>                         case WRITE_SAME_16:
>                         case WRITE_SAME:
>                                 if (unmap)
>                                         sd_config_discard(sdkp, SD_LBP_DISABLE);
>                                 else {
>                                         sdkp->device->no_write_same = 1;
>                                         sd_config_write_same(sdkp);
> 
>                                         good_bytes = 0;
>                                         req->__data_len = blk_rq_bytes(req);
>                                         req->cmd_flags |= REQ_QUIET;
>                                 }
>                         }
>                 }
>                 break;

OK, so the device here says don't do this again, but fails the request
anyway expecting the block device to pick up the slack.

	-hpa



^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: RAID-10 keeps aborting
  2013-06-03 17:40           ` H. Peter Anvin
@ 2013-06-03 18:35             ` Martin K. Petersen
  2013-06-03 18:38               ` H. Peter Anvin
                                 ` (3 more replies)
  0 siblings, 4 replies; 122+ messages in thread
From: Martin K. Petersen @ 2013-06-03 18:35 UTC (permalink / raw)
  To: H. Peter Anvin; +Cc: Dan Williams, linux-raid

>>>>> "hpa" == H Peter Anvin <hpa@zytor.com> writes:

hpa> OK, so the device here says don't do this again, but fails the
hpa> request anyway expecting the block device to pick up the slack.

Yes, the block layer function will resort to writing out zeroes directly
in this case.

MD should not consider a rejected WRITE SAME a failure.

-- 
Martin K. Petersen	Oracle Linux Engineering

^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: RAID-10 keeps aborting
  2013-06-03 18:35             ` Martin K. Petersen
@ 2013-06-03 18:38               ` H. Peter Anvin
  2013-06-03 18:40               ` H. Peter Anvin
                                 ` (2 subsequent siblings)
  3 siblings, 0 replies; 122+ messages in thread
From: H. Peter Anvin @ 2013-06-03 18:38 UTC (permalink / raw)
  To: Martin K. Petersen; +Cc: Dan Williams, linux-raid

On 06/03/2013 11:35 AM, Martin K. Petersen wrote:
>>>>>> "hpa" == H Peter Anvin <hpa@zytor.com> writes:
> 
> hpa> OK, so the device here says don't do this again, but fails the
> hpa> request anyway expecting the block device to pick up the slack.
> 
> Yes, the block layer function will resort to writing out zeroes directly
> in this case.
> 
> MD should not consider a rejected WRITE SAME a failure.
> 

Right.  That seems to be the root of the problem here.

	-hpa


^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: RAID-10 keeps aborting
  2013-06-03 18:35             ` Martin K. Petersen
  2013-06-03 18:38               ` H. Peter Anvin
@ 2013-06-03 18:40               ` H. Peter Anvin
  2013-06-03 22:20                 ` H. Peter Anvin
  2013-06-03 23:19               ` H. Peter Anvin
  2013-06-04 17:36               ` Dan Williams
  3 siblings, 1 reply; 122+ messages in thread
From: H. Peter Anvin @ 2013-06-03 18:40 UTC (permalink / raw)
  To: Martin K. Petersen; +Cc: Dan Williams, linux-raid

On 06/03/2013 11:35 AM, Martin K. Petersen wrote:
>>>>>> "hpa" == H Peter Anvin <hpa@zytor.com> writes:
> 
> hpa> OK, so the device here says don't do this again, but fails the
> hpa> request anyway expecting the block device to pick up the slack.
> 
> Yes, the block layer function will resort to writing out zeroes directly
> in this case.
> 
> MD should not consider a rejected WRITE SAME a failure.
> 

Presumably MD should do the same thing that SCSI does: disable
write_same and kick the failure upstack?

	-hpa


^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: RAID-10 keeps aborting
  2013-06-03 18:40               ` H. Peter Anvin
@ 2013-06-03 22:20                 ` H. Peter Anvin
  2013-06-03 22:34                   ` H. Peter Anvin
  0 siblings, 1 reply; 122+ messages in thread
From: H. Peter Anvin @ 2013-06-03 22:20 UTC (permalink / raw)
  To: Martin K. Petersen; +Cc: Dan Williams, linux-raid

On 06/03/2013 11:40 AM, H. Peter Anvin wrote:
> On 06/03/2013 11:35 AM, Martin K. Petersen wrote:
>>>>>>> "hpa" == H Peter Anvin <hpa@zytor.com> writes:
>>
>> hpa> OK, so the device here says don't do this again, but fails the
>> hpa> request anyway expecting the block device to pick up the slack.
>>
>> Yes, the block layer function will resort to writing out zeroes directly
>> in this case.
>>
>> MD should not consider a rejected WRITE SAME a failure.
>>
> 
> Presumably MD should do the same thing that SCSI does: disable
> write_same and kick the failure upstack?
> 

Also, given the seriousness of the failure mode, is this something that
should be addressed in stable?

	-hpa


^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: RAID-10 keeps aborting
  2013-06-03 22:20                 ` H. Peter Anvin
@ 2013-06-03 22:34                   ` H. Peter Anvin
  2013-06-04 15:56                     ` Martin K. Petersen
  0 siblings, 1 reply; 122+ messages in thread
From: H. Peter Anvin @ 2013-06-03 22:34 UTC (permalink / raw)
  To: Martin K. Petersen; +Cc: Dan Williams, linux-raid

On 06/03/2013 03:20 PM, H. Peter Anvin wrote:
> 
> Also, given the seriousness of the failure mode, is this something that
> should be addressed in stable?
> 

Note that the filesystem is not mounted with the "discard" option, and
given the rarity of the aborts, whatever causes the WRITE SAME bio to be
generated must be a very rare event.

	-hpa


^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: RAID-10 keeps aborting
  2013-06-03 18:35             ` Martin K. Petersen
  2013-06-03 18:38               ` H. Peter Anvin
  2013-06-03 18:40               ` H. Peter Anvin
@ 2013-06-03 23:19               ` H. Peter Anvin
  2013-06-04 15:39                 ` Joe Lawrence
  2013-06-04 17:36               ` Dan Williams
  3 siblings, 1 reply; 122+ messages in thread
From: H. Peter Anvin @ 2013-06-03 23:19 UTC (permalink / raw)
  To: Martin K. Petersen; +Cc: Dan Williams, linux-raid, Joe Lawrence

On 06/03/2013 11:35 AM, Martin K. Petersen wrote:
>>>>>> "hpa" == H Peter Anvin <hpa@zytor.com> writes:
> 
> hpa> OK, so the device here says don't do this again, but fails the
> hpa> request anyway expecting the block device to pick up the slack.
> 
> Yes, the block layer function will resort to writing out zeroes directly
> in this case.
> 
> MD should not consider a rejected WRITE SAME a failure.
> 

We should probably add Joe Lawrence to this thread.

Joe: basically it seems that the error behavior of md (at least raid10,
but probably raid1 as well) on WRITE SAME is wrong, and it causes the
RAID to abort.

	-hpa


^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: RAID-10 keeps aborting
  2013-06-03 23:19               ` H. Peter Anvin
@ 2013-06-04 15:39                 ` Joe Lawrence
  2013-06-04 15:46                   ` H. Peter Anvin
  2013-06-05 10:02                   ` Bernd Schubert
  0 siblings, 2 replies; 122+ messages in thread
From: Joe Lawrence @ 2013-06-04 15:39 UTC (permalink / raw)
  To: H. Peter Anvin; +Cc: Martin K. Petersen, Dan Williams, linux-raid, Joe Lawrence

On Mon, 3 Jun 2013, H. Peter Anvin wrote:

> On 06/03/2013 11:35 AM, Martin K. Petersen wrote:
> >>>>>> "hpa" == H Peter Anvin <hpa@zytor.com> writes:
> > 
> > hpa> OK, so the device here says don't do this again, but fails the
> > hpa> request anyway expecting the block device to pick up the slack.
> > 
> > Yes, the block layer function will resort to writing out zeroes directly
> > in this case.
> > 
> > MD should not consider a rejected WRITE SAME a failure.
> > 
> 
> We should probably add Joe Lawrence to this thread.
> 
> Joe: basically it seems that the error behavior of md (at least raid10,
> but probably raid1 as well) on WRITE SAME is wrong, and it causes the
> RAID to abort.

Martin is probably the expert here (I had extended his initial WRITE SAME 
support in MD raid0 to raid1 and raid10), but I can try failing a WS cmd 
using our San Blaze emulator to see the fall out. 

Just curious, what type drives were in your RAID and what does
/sys/class/scsi_disk/*/max_write_same_blocks report?  If you have a spare 
drive to test, maybe you could try a quick sg_write_same command to see 
how the drive reacts?

-- Joe

^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: RAID-10 keeps aborting
  2013-06-04 15:39                 ` Joe Lawrence
@ 2013-06-04 15:46                   ` H. Peter Anvin
  2013-06-04 15:54                     ` Martin K. Petersen
  2013-06-05 10:02                   ` Bernd Schubert
  1 sibling, 1 reply; 122+ messages in thread
From: H. Peter Anvin @ 2013-06-04 15:46 UTC (permalink / raw)
  To: Joe Lawrence; +Cc: Martin K. Petersen, Dan Williams, linux-raid

On 06/04/2013 08:39 AM, Joe Lawrence wrote:
>>
>> We should probably add Joe Lawrence to this thread.
>>
>> Joe: basically it seems that the error behavior of md (at least raid10,
>> but probably raid1 as well) on WRITE SAME is wrong, and it causes the
>> RAID to abort.
> 
> Martin is probably the expert here (I had extended his initial WRITE SAME 
> support in MD raid0 to raid1 and raid10), but I can try failing a WS cmd 
> using our San Blaze emulator to see the fall out. 
> 
> Just curious, what type drives were in your RAID and what does
> /sys/class/scsi_disk/*/max_write_same_blocks report?  If you have a spare 
> drive to test, maybe you could try a quick sg_write_same command to see 
> how the drive reacts?
> 

The drives are SATA drives connected via mptsas.  max_write_same_blocks
show 65535.  Unfortunately the problems are rare enough that it didn't
pop up until the server was put in production, so I would like to avoid
experimenting on it as much as possible.

	-hpa



^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: RAID-10 keeps aborting
  2013-06-04 15:46                   ` H. Peter Anvin
@ 2013-06-04 15:54                     ` Martin K. Petersen
  0 siblings, 0 replies; 122+ messages in thread
From: Martin K. Petersen @ 2013-06-04 15:54 UTC (permalink / raw)
  To: H. Peter Anvin; +Cc: Joe Lawrence, Martin K. Petersen, Dan Williams, linux-raid

>>>>> "hpa" == H Peter Anvin <hpa@zytor.com> writes:

hpa> Unfortunately the problems are rare enough that it didn't pop up
hpa> until the server was put in production, so I would like to avoid
hpa> experimenting on it as much as possible.

Yeah, and this is trivial to reproduce.

I will take a look if Joe doesn't beat me to it...

-- 
Martin K. Petersen	Oracle Linux Engineering

^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: RAID-10 keeps aborting
  2013-06-03 22:34                   ` H. Peter Anvin
@ 2013-06-04 15:56                     ` Martin K. Petersen
  0 siblings, 0 replies; 122+ messages in thread
From: Martin K. Petersen @ 2013-06-04 15:56 UTC (permalink / raw)
  To: H. Peter Anvin; +Cc: Martin K. Petersen, Dan Williams, linux-raid

>>>>> "hpa" == H Peter Anvin <hpa@zytor.com> writes:

hpa> Note that the filesystem is not mounted with the "discard" option,
hpa> and given the rarity of the aborts, whatever causes the WRITE SAME
hpa> bio to be generated must be a very rare event.

It's probably not discard. I'm guessing it's something that needs to
zero out a block range. Not something that happens very often.

-- 
Martin K. Petersen	Oracle Linux Engineering

^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: RAID-10 keeps aborting
  2013-06-03 18:35             ` Martin K. Petersen
                                 ` (2 preceding siblings ...)
  2013-06-03 23:19               ` H. Peter Anvin
@ 2013-06-04 17:36               ` Dan Williams
  2013-06-04 17:54                 ` Martin K. Petersen
  3 siblings, 1 reply; 122+ messages in thread
From: Dan Williams @ 2013-06-04 17:36 UTC (permalink / raw)
  To: Martin K. Petersen; +Cc: H. Peter Anvin, linux-raid

On Mon, Jun 3, 2013 at 11:35 AM, Martin K. Petersen
<martin.petersen@oracle.com> wrote:
>>>>>> "hpa" == H Peter Anvin <hpa@zytor.com> writes:
>
> hpa> OK, so the device here says don't do this again, but fails the
> hpa> request anyway expecting the block device to pick up the slack.
>
> Yes, the block layer function will resort to writing out zeroes directly
> in this case.
>
> MD should not consider a rejected WRITE SAME a failure.

What should md do in the partial success case?  Seems to be a layering
violation to assume that the block layer will retry with zeroes.
Maybe just act on BIO_QUIET to retry the write with REQ_WRITE_SAME
cleared?

--
Dan

^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: RAID-10 keeps aborting
  2013-06-04 17:36               ` Dan Williams
@ 2013-06-04 17:54                 ` Martin K. Petersen
  2013-06-04 17:57                   ` H. Peter Anvin
  0 siblings, 1 reply; 122+ messages in thread
From: Martin K. Petersen @ 2013-06-04 17:54 UTC (permalink / raw)
  To: Dan Williams; +Cc: Martin K. Petersen, H. Peter Anvin, linux-raid

>>>>> "Dan" == Dan Williams <dan.j.williams@gmail.com> writes:

>> MD should not consider a rejected WRITE SAME a failure.

Dan> What should md do in the partial success case?  

What exactly do you mean when you say "partial success"? Either the
device accepts the command or it doesn't...


Dan> Maybe just act on BIO_QUIET to retry the write with REQ_WRITE_SAME
Dan> cleared?

No go.

-- 
Martin K. Petersen	Oracle Linux Engineering

^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: RAID-10 keeps aborting
  2013-06-04 17:54                 ` Martin K. Petersen
@ 2013-06-04 17:57                   ` H. Peter Anvin
  2013-06-04 18:04                     ` Martin K. Petersen
  0 siblings, 1 reply; 122+ messages in thread
From: H. Peter Anvin @ 2013-06-04 17:57 UTC (permalink / raw)
  To: Martin K. Petersen; +Cc: Dan Williams, linux-raid

On 06/04/2013 10:54 AM, Martin K. Petersen wrote:
>>>>>> "Dan" == Dan Williams <dan.j.williams@gmail.com> writes:
> 
>>> MD should not consider a rejected WRITE SAME a failure.
> 
> Dan> What should md do in the partial success case?  
> 
> What exactly do you mean when you say "partial success"? Either the
> device accepts the command or it doesn't...
> 

One subdevice accepts it and the other doesn't, presumably.

	-hpa


^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: RAID-10 keeps aborting
  2013-06-04 17:57                   ` H. Peter Anvin
@ 2013-06-04 18:04                     ` Martin K. Petersen
  2013-06-04 18:32                       ` Dan Williams
  0 siblings, 1 reply; 122+ messages in thread
From: Martin K. Petersen @ 2013-06-04 18:04 UTC (permalink / raw)
  To: H. Peter Anvin; +Cc: Martin K. Petersen, Dan Williams, linux-raid

>>>>> "hpa" == H Peter Anvin <hpa@zytor.com> writes:

hpa> One subdevice accepts it and the other doesn't, presumably.

Ah. Well fail the command and let the block layer deal with it. This is
really no different from the discard case.

-- 
Martin K. Petersen	Oracle Linux Engineering

^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: RAID-10 keeps aborting
  2013-06-04 18:04                     ` Martin K. Petersen
@ 2013-06-04 18:32                       ` Dan Williams
  2013-06-04 18:38                         ` H. Peter Anvin
  0 siblings, 1 reply; 122+ messages in thread
From: Dan Williams @ 2013-06-04 18:32 UTC (permalink / raw)
  To: Martin K. Petersen; +Cc: H. Peter Anvin, linux-raid

On Tue, Jun 4, 2013 at 11:04 AM, Martin K. Petersen
<martin.petersen@oracle.com> wrote:
>>>>>> "hpa" == H Peter Anvin <hpa@zytor.com> writes:
>
> hpa> One subdevice accepts it and the other doesn't, presumably.
>
> Ah. Well fail the command and let the block layer deal with it. This is
> really no different from the discard case.
>

Which md also does not handle if the device later returns "illegal
request" to a discard command.  My point about one device accepting
the write and another device dropping it is we now have an
inconsistent array and a write command to complete.  So I don't see
how md can wait/trust that the upper layer will retry and fix things
up?    Translate and retry internally for these command types, return
success to the original request, and disable future requests.

^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: RAID-10 keeps aborting
  2013-06-04 18:32                       ` Dan Williams
@ 2013-06-04 18:38                         ` H. Peter Anvin
  2013-06-04 18:56                           ` Dan Williams
  0 siblings, 1 reply; 122+ messages in thread
From: H. Peter Anvin @ 2013-06-04 18:38 UTC (permalink / raw)
  To: Dan Williams; +Cc: Martin K. Petersen, linux-raid

On 06/04/2013 11:32 AM, Dan Williams wrote:
> On Tue, Jun 4, 2013 at 11:04 AM, Martin K. Petersen
> <martin.petersen@oracle.com> wrote:
>>>>>>> "hpa" == H Peter Anvin <hpa@zytor.com> writes:
>>
>> hpa> One subdevice accepts it and the other doesn't, presumably.
>>
>> Ah. Well fail the command and let the block layer deal with it. This is
>> really no different from the discard case.
> 
> Which md also does not handle if the device later returns "illegal
> request" to a discard command.  My point about one device accepting
> the write and another device dropping it is we now have an
> inconsistent array and a write command to complete.  So I don't see
> how md can wait/trust that the upper layer will retry and fix things
> up?    Translate and retry internally for these command types, return
> success to the original request, and disable future requests.
> 

Well, if that is what the block device layer is defined to do then that
is what the block layer does.  It makes sense from the point of view of
a disk, there block layer has to translate and redo, so if the block
layer is defined to do that, why not rely on it?

	-hpa


^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: RAID-10 keeps aborting
  2013-06-04 18:38                         ` H. Peter Anvin
@ 2013-06-04 18:56                           ` Dan Williams
  2013-06-05  2:39                             ` H. Peter Anvin
  0 siblings, 1 reply; 122+ messages in thread
From: Dan Williams @ 2013-06-04 18:56 UTC (permalink / raw)
  To: H. Peter Anvin; +Cc: Martin K. Petersen, linux-raid

On Tue, Jun 4, 2013 at 11:38 AM, H. Peter Anvin <hpa@zytor.com> wrote:
> Well, if that is what the block device layer is defined to do then that
> is what the block layer does.  It makes sense from the point of view of
> a disk, there block layer has to translate and redo, so if the block
> layer is defined to do that, why not rely on it?
>

I'm just hung up on when we can safely mark the array as not dirty.
At a minimum this means raid needs a "I have an ignored-write-failure
in-flight, awaiting retry from upper layer" state.

^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: RAID-10 keeps aborting
  2013-06-04 18:56                           ` Dan Williams
@ 2013-06-05  2:39                             ` H. Peter Anvin
       [not found]                               ` <(H.>
  0 siblings, 1 reply; 122+ messages in thread
From: H. Peter Anvin @ 2013-06-05  2:39 UTC (permalink / raw)
  To: Dan Williams; +Cc: Martin K. Petersen, linux-raid

On 06/04/2013 11:56 AM, Dan Williams wrote:
> On Tue, Jun 4, 2013 at 11:38 AM, H. Peter Anvin <hpa@zytor.com> wrote:
>> Well, if that is what the block device layer is defined to do then that
>> is what the block layer does.  It makes sense from the point of view of
>> a disk, there block layer has to translate and redo, so if the block
>> layer is defined to do that, why not rely on it?
>>
> 
> I'm just hung up on when we can safely mark the array as not dirty.
> At a minimum this means raid needs a "I have an ignored-write-failure
> in-flight, awaiting retry from upper layer" state.
> 

Ah yes, if you rely on the block layer to retry on you you don't see the
beginnings and ends of the entire transaction, and at least ideally the
RAID -- and the specific blocks -- should be marked dirty during that
operation.  The same applies to DISCARD presumably.

Yuck, this suddenly got complex.  Perhaps WRITE SAME should simply be
disabled on raid1/raid10 until this can be addressed?  Do we need to do
the same for DISCARD?

	-hpa

^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: RAID-10 keeps aborting
  2013-06-04 15:39                 ` Joe Lawrence
  2013-06-04 15:46                   ` H. Peter Anvin
@ 2013-06-05 10:02                   ` Bernd Schubert
  2013-06-05 11:38                     ` Bernd Schubert
  1 sibling, 1 reply; 122+ messages in thread
From: Bernd Schubert @ 2013-06-05 10:02 UTC (permalink / raw)
  To: Joe Lawrence; +Cc: H. Peter Anvin, Martin K. Petersen, Dan Williams, linux-raid

On 06/04/2013 05:39 PM, Joe Lawrence wrote:
>
> Just curious, what type drives were in your RAID and what does
> /sys/class/scsi_disk/*/max_write_same_blocks report?  If you have a spare
> drive to test, maybe you could try a quick sg_write_same command to see
> how the drive reacts?
>

I just run into the same issue with an ancient system from 2006. Except 
that I'm in hurry an need it to stress-test my own work, I can do 
anything with it - it is booted via NFS and disks are only used for 
development/testing.

> (squeeze)fslab1:~# cat /sys/block/md126/queue/write_same_max_bytes
> 16384

> (squeeze)fslab1:~# cat /sys/block/sd[o,n,m,l]/queue/write_same_max_bytes
> 0
> 0
> 0
> 0


Ah, now I found the reason why it fails, scsi-layer had set 
write_same_max_bytes to zero when it detected that it does not support 
it, but after reloading the arecal module (arcmsr) I now get:

> (squeeze)fslab1:~# cat /sys/block/sd[o,n,m,l]/queue/write_same_max_bytes
> 33553920
> 33553920
> 33553920
> 33553920

Now for example

> 11:0:1:2]   disk    Hitachi  HDS724040KLSA80  R001  /dev/sdl  /dev/sg11

> (squeeze)fslab1:~# sg_write_same --num=100 /dev/sg11
> Write same(10) command not supported


> (squeeze)fslab1:~# sg_write_same --16 --num=100 /dev/sg11
> Write same(16) command not supported


Cheers,
Bernd


PS: This is the 2nd time this I run into this, on Sunday I had a similar 
same issue at home with Ubuntus 3.8 kernel, but somehow not with vanilla 
3.9. I need to recheck the logs in evening to see if it is really the 
same issue.





^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: RAID-10 keeps aborting
  2013-06-05 10:02                   ` Bernd Schubert
@ 2013-06-05 11:38                     ` Bernd Schubert
  2013-06-05 12:53                       ` [PATCH] scsi: Check if the device support WRITE_SAME_10 Bernd Schubert
  2013-06-05 19:11                       ` RAID-10 keeps aborting Martin K. Petersen
  0 siblings, 2 replies; 122+ messages in thread
From: Bernd Schubert @ 2013-06-05 11:38 UTC (permalink / raw)
  Cc: Joe Lawrence, H. Peter Anvin, Martin K. Petersen, Dan Williams,
	linux-raid, linux-scsi

On 06/05/2013 12:02 PM, Bernd Schubert wrote:
> On 06/04/2013 05:39 PM, Joe Lawrence wrote:
>>
>> Just curious, what type drives were in your RAID and what does
>> /sys/class/scsi_disk/*/max_write_same_blocks report?  If you have a spare
>> drive to test, maybe you could try a quick sg_write_same command to see
>> how the drive reacts?
>>
>
> I just run into the same issue with an ancient system from 2006. Except
> that I'm in hurry an need it to stress-test my own work, I can do
> anything with it - it is booted via NFS and disks are only used for
> development/testing.
>
>> (squeeze)fslab1:~# cat /sys/block/md126/queue/write_same_max_bytes
>> 16384
>
>> (squeeze)fslab1:~# cat /sys/block/sd[o,n,m,l]/queue/write_same_max_bytes
>> 0
>> 0
>> 0
>> 0
>
>
> Ah, now I found the reason why it fails, scsi-layer had set
> write_same_max_bytes to zero when it detected that it does not support
> it, but after reloading the arecal module (arcmsr) I now get:
>
>> (squeeze)fslab1:~# cat /sys/block/sd[o,n,m,l]/queue/write_same_max_bytes
>> 33553920
>> 33553920
>> 33553920
>> 33553920

In sd_config_write_same() it sets

	if (sdkp->max_ws_blocks == 0)
		sdkp->max_ws_blocks = SD_MAX_WS10_BLOCKS;

except when sdkp->device->no_write_same is set.
But only ata_scsi_sdev_config() sets that. And I also don't see any 
driver setting max_ws_blocks, so everything except of libata gets the 
default of SD_MAX_WS10_BLOCKS. This also seems to be consistent with the 
33553920 I see, except that there is somewhere a bit shift.
So no surprise that mptsas and arcmsr (and anything else) devices have 
write_same_max_bytes set.
As the correct handling in the md layer seems to be difficult, can we 
send a fake request at device configuration time to figure out if the 
device really support write-same?



^ permalink raw reply	[flat|nested] 122+ messages in thread

* [PATCH] scsi: Check if the device support WRITE_SAME_10
  2013-06-05 11:38                     ` Bernd Schubert
@ 2013-06-05 12:53                       ` Bernd Schubert
  2013-06-05 19:14                         ` Martin K. Petersen
  2013-06-05 19:11                       ` RAID-10 keeps aborting Martin K. Petersen
  1 sibling, 1 reply; 122+ messages in thread
From: Bernd Schubert @ 2013-06-05 12:53 UTC (permalink / raw)
  Cc: Joe Lawrence, H. Peter Anvin, Martin K. Petersen, Dan Williams,
	linux-raid, linux-scsi

Here's a rather simply patch for scsi-midlayer

checkpatch.pl complains about style issue, but 
I just did it as the other lines there.

> schubert@squeeze@fsdevel3 linux-stable>scripts/checkpatch.pl patches-linux-3.9.y/ws10 
> ERROR: spaces prohibited around that ':' (ctx:WxW)
> #48: FILE: drivers/scsi/sd.h:87:
> +       unsigned        ws10 : 1;
>                              ^

If someone wants me, I can send another patch to fix the other
lines first.



scsi: Check if the device support WRITE_SAME_10

From: Bernd Schubert <bernd.schubert@itwm.fraunhofer.de>

The md layer currently cannot handle failed WRITE_SAME commands
and the initial easiest fix is to check if the device supports
WRITE_SAME at all. It already tested for WRITE_SAME_16 and
this commit adds a test for WRITE_SAME_10.

Signed-off-by: Bernd Schubert <bernd.schubert@itwm.fraunhofer.de>
---
 drivers/scsi/sd.c |    6 +++++-
 drivers/scsi/sd.h |    1 +
 2 files changed, 6 insertions(+), 1 deletion(-)

diff --git a/drivers/scsi/sd.c b/drivers/scsi/sd.c
index 82910cc..368f0a4 100644
--- a/drivers/scsi/sd.c
+++ b/drivers/scsi/sd.c
@@ -742,7 +742,7 @@ static void sd_config_write_same(struct scsi_disk *sdkp)
 	unsigned int logical_block_size = sdkp->device->sector_size;
 	unsigned int blocks = 0;
 
-	if (sdkp->device->no_write_same) {
+	if (sdkp->device->no_write_same || !(sdkp->ws10 || sdkp->ws16)) {
 		sdkp->max_ws_blocks = 0;
 		goto out;
 	}
@@ -2648,6 +2648,10 @@ static void sd_read_block_provisioning(struct scsi_disk *sdkp)
 static void sd_read_write_same(struct scsi_disk *sdkp, unsigned char *buffer)
 {
 	if (scsi_report_opcode(sdkp->device, buffer, SD_BUF_SIZE,
+			       WRITE_SAME))
+		sdkp->ws10 = 1;
+
+	if (scsi_report_opcode(sdkp->device, buffer, SD_BUF_SIZE,
 			       WRITE_SAME_16))
 		sdkp->ws16 = 1;
 }
diff --git a/drivers/scsi/sd.h b/drivers/scsi/sd.h
index 2386aeb..7a049de 100644
--- a/drivers/scsi/sd.h
+++ b/drivers/scsi/sd.h
@@ -84,6 +84,7 @@ struct scsi_disk {
 	unsigned	lbpws : 1;
 	unsigned	lbpws10 : 1;
 	unsigned	lbpvpd : 1;
+	unsigned	ws10 : 1;
 	unsigned	ws16 : 1;
 };
 #define to_scsi_disk(obj) container_of(obj,struct scsi_disk,dev)

^ permalink raw reply related	[flat|nested] 122+ messages in thread

* Re: RAID-10 keeps aborting
  2013-06-05 11:38                     ` Bernd Schubert
  2013-06-05 12:53                       ` [PATCH] scsi: Check if the device support WRITE_SAME_10 Bernd Schubert
@ 2013-06-05 19:11                       ` Martin K. Petersen
  1 sibling, 0 replies; 122+ messages in thread
From: Martin K. Petersen @ 2013-06-05 19:11 UTC (permalink / raw)
  To: Bernd Schubert
  Cc: Joe Lawrence, H. Peter Anvin, Martin K. Petersen, Dan Williams,
	linux-raid, linux-scsi

>>>>> "Bernd" == Bernd Schubert <bernd.schubert@itwm.fraunhofer.de> writes:

Bernd> And I also don't see any driver setting max_ws_blocks, so
Bernd> everything except of libata gets the default of
Bernd> SD_MAX_WS10_BLOCKS. 

Yes. That's intentional. Unless the device provides MAXIMUM WRITE SAME
BLOCKS in the BLOCK LIMITS VPD.


Bernd> As the correct handling in the md layer seems to be difficult,
Bernd> can we send a fake request at device configuration time to figure
Bernd> out if the device really support write-same?

The problem is that WRITE SAME is destructive. And unfortunately a block
count of 0 means "write entire device".

-- 
Martin K. Petersen	Oracle Linux Engineering

^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: [PATCH] scsi: Check if the device support WRITE_SAME_10
  2013-06-05 12:53                       ` [PATCH] scsi: Check if the device support WRITE_SAME_10 Bernd Schubert
@ 2013-06-05 19:14                         ` Martin K. Petersen
  2013-06-05 20:09                           ` Bernd Schubert
  0 siblings, 1 reply; 122+ messages in thread
From: Martin K. Petersen @ 2013-06-05 19:14 UTC (permalink / raw)
  To: Bernd Schubert
  Cc: Joe Lawrence, H. Peter Anvin, Martin K. Petersen, Dan Williams,
	linux-raid, linux-scsi

>>>>> "Bernd" == Bernd Schubert <bernd.schubert@itwm.fraunhofer.de> writes:

Bernd> The md layer currently cannot handle failed WRITE_SAME commands
Bernd> and the initial easiest fix is to check if the device supports
Bernd> WRITE_SAME at all. It already tested for WRITE_SAME_16 and this
Bernd> commit adds a test for WRITE_SAME_10.

No go. That'll disable WRITE SAME for drives which don't support
RSOC. Which means almost all of them.

I propose the following...

-- 
Martin K. Petersen	Oracle Linux Engineering


[PATCH] sd: Update WRITE SAME heuristics

SATA drives located behind a SAS controller would incorrectly receive
WRITE SAME commands. Tweak the heuristics so that:

 - If REPORT SUPPORTED OPERATION CODES is provided we will use that to
   choose between WRITE SAME(16), WRITE SAME(10) and disabled. This also
   fixes an issue with the old code which would issue WRITE SAME(10)
   despite the command not being whitelisted in REPORT SUPPORTED
   OPERATION CODES.

 - If REPORT SUPPORTED OPERATION CODES is not provided we will fall back
   to WRITE SAME(10) unless the device has an ATA Information VPD page.
   The assumption is that a SATL which is smart enough to implement
   WRITE SAME would also provide REPORT SUPPORTED OPERATION CODES.

To facilitate the new heuristics scsi_report_opcode() has been modified
to so we can distinguish between "operation not supported" and "RSOC not
supported".

Reported-by: H. Peter Anvin <hpa@zytor.com>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>

diff --git a/drivers/scsi/scsi.c b/drivers/scsi/scsi.c
index 2c0d0ec..3b1ea34 100644
--- a/drivers/scsi/scsi.c
+++ b/drivers/scsi/scsi.c
@@ -1070,8 +1070,8 @@ EXPORT_SYMBOL_GPL(scsi_get_vpd_page);
  * @opcode:	opcode for command to look up
  *
  * Uses the REPORT SUPPORTED OPERATION CODES to look up the given
- * opcode. Returns 0 if RSOC fails or if the command opcode is
- * unsupported. Returns 1 if the device claims to support the command.
+ * opcode. Returns -EINVAL if RSOC fails, 0 if the command opcode is
+ * unsupported and 1 if the device claims to support the command.
  */
 int scsi_report_opcode(struct scsi_device *sdev, unsigned char *buffer,
 		       unsigned int len, unsigned char opcode)
@@ -1081,7 +1081,7 @@ int scsi_report_opcode(struct scsi_device *sdev, unsigned char *buffer,
 	int result;
 
 	if (sdev->no_report_opcodes || sdev->scsi_level < SCSI_SPC_3)
-		return 0;
+		return -EINVAL;
 
 	memset(cmd, 0, 16);
 	cmd[0] = MAINTENANCE_IN;
@@ -1097,7 +1097,7 @@ int scsi_report_opcode(struct scsi_device *sdev, unsigned char *buffer,
 	if (result && scsi_sense_valid(&sshdr) &&
 	    sshdr.sense_key == ILLEGAL_REQUEST &&
 	    (sshdr.asc == 0x20 || sshdr.asc == 0x24) && sshdr.ascq == 0x00)
-		return 0;
+		return -EINVAL;
 
 	if ((buffer[1] & 3) == 3) /* Command supported */
 		return 1;
diff --git a/drivers/scsi/sd.c b/drivers/scsi/sd.c
index a37eda9..366b661 100644
--- a/drivers/scsi/sd.c
+++ b/drivers/scsi/sd.c
@@ -442,8 +442,15 @@ sd_store_write_same_blocks(struct device *dev, struct device_attribute *attr,
 
 	if (max == 0)
 		sdp->no_write_same = 1;
-	else if (max <= SD_MAX_WS16_BLOCKS)
-		sdkp->max_ws_blocks = max;
+	else
+		sdp->no_write_same = 0;
+
+	if (sdkp->ws16)
+		sdkp->max_ws_blocks =
+			max_t(unsigned long, max, SD_MAX_WS16_BLOCKS);
+	else
+		sdkp->max_ws_blocks =
+			max_t(unsigned long, max, SD_MAX_WS10_BLOCKS);
 
 	sd_config_write_same(sdkp);
 
@@ -762,16 +769,16 @@ static void sd_config_write_same(struct scsi_disk *sdkp)
 	 * blocks per I/O unless the device explicitly advertises a
 	 * bigger limit.
 	 */
-	if (sdkp->max_ws_blocks == 0)
-		sdkp->max_ws_blocks = SD_MAX_WS10_BLOCKS;
-
-	if (sdkp->ws16 || sdkp->max_ws_blocks > SD_MAX_WS10_BLOCKS)
+	if (sdkp->max_ws_blocks > SD_MAX_WS10_BLOCKS)
 		blocks = min_not_zero(sdkp->max_ws_blocks,
 				      (u32)SD_MAX_WS16_BLOCKS);
 	else
 		blocks = min_not_zero(sdkp->max_ws_blocks,
 				      (u32)SD_MAX_WS10_BLOCKS);
 
+	if (sdkp->ws16 || sdkp->ws10 || sdkp->device->no_report_opcodes)
+		sdkp->max_ws_blocks = blocks;
+
 out:
 	blk_queue_max_write_same_sectors(q, blocks * (logical_block_size >> 9));
 }
@@ -2645,9 +2652,24 @@ static void sd_read_block_provisioning(struct scsi_disk *sdkp)
 
 static void sd_read_write_same(struct scsi_disk *sdkp, unsigned char *buffer)
 {
-	if (scsi_report_opcode(sdkp->device, buffer, SD_BUF_SIZE,
-			       WRITE_SAME_16))
+	struct scsi_device *sdev = sdkp->device;
+
+	if (scsi_report_opcode(sdev, buffer, SD_BUF_SIZE, INQUIRY) < 0) {
+		sdev->no_report_opcodes = 1;
+
+		/* Disable WRITE SAME if REPORT SUPPORTED OPERATION
+		 * CODES is unsupported and the device has an ATA
+		 * Information VPD page (SAT).
+		 */
+		if (!scsi_get_vpd_page(sdev, 0x89, buffer, SD_BUF_SIZE))
+			sdev->no_write_same = 1;
+	}
+
+	if (scsi_report_opcode(sdev, buffer, SD_BUF_SIZE, WRITE_SAME_16) == 1)
 		sdkp->ws16 = 1;
+
+	if (scsi_report_opcode(sdev, buffer, SD_BUF_SIZE, WRITE_SAME) == 1)
+		sdkp->ws10 = 1;
 }
 
 static int sd_try_extended_inquiry(struct scsi_device *sdp)
diff --git a/drivers/scsi/sd.h b/drivers/scsi/sd.h
index 2386aeb..7a049de 100644
--- a/drivers/scsi/sd.h
+++ b/drivers/scsi/sd.h
@@ -84,6 +84,7 @@ struct scsi_disk {
 	unsigned	lbpws : 1;
 	unsigned	lbpws10 : 1;
 	unsigned	lbpvpd : 1;
+	unsigned	ws10 : 1;
 	unsigned	ws16 : 1;
 };
 #define to_scsi_disk(obj) container_of(obj,struct scsi_disk,dev)

^ permalink raw reply related	[flat|nested] 122+ messages in thread

* Re: RAID-10 keeps aborting
       [not found]                                                   ` <-0700")>
@ 2013-06-05 19:29                                                     ` Martin K. Petersen
  2013-06-06 18:27                                                       ` Joe Lawrence
  2013-06-12 14:43                                                     ` Martin K. Petersen
                                                                       ` (2 subsequent siblings)
  3 siblings, 1 reply; 122+ messages in thread
From: Martin K. Petersen @ 2013-06-05 19:29 UTC (permalink / raw)
  To: H. Peter Anvin; +Cc: Dan Williams, Martin K. Petersen, linux-raid, joe.lawrence

>>>>> "hpa" == H Peter Anvin <hpa@zytor.com> writes:

hpa> Yuck, this suddenly got complex.  Perhaps WRITE SAME should simply
hpa> be disabled on raid1/raid10 until this can be addressed?  

Yeah, maybe.


hpa> Do we need to do the same for DISCARD?

For discard we have better heuristics from the device so the partial
completion case should be rare. But obviously a disk can reject a
command anytime with or without reason...

Also, discard is advisory. There is no guarantee that a pair of mirrored
drives will be consistent in the discarded region (unless the drives
promise to zero discarded blocks but MD doesn't check that for
raid1/raid10).

WRITE SAME requires strict consistency between the devices, however. So
it looks like there's some work that needs to be done in that
department.

-- 
Martin K. Petersen	Oracle Linux Engineering

^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: [PATCH] scsi: Check if the device support WRITE_SAME_10
  2013-06-05 19:14                         ` Martin K. Petersen
@ 2013-06-05 20:09                           ` Bernd Schubert
  2013-06-07  2:15                             ` Martin K. Petersen
  0 siblings, 1 reply; 122+ messages in thread
From: Bernd Schubert @ 2013-06-05 20:09 UTC (permalink / raw)
  To: Martin K. Petersen
  Cc: Joe Lawrence, H. Peter Anvin, Dan Williams, linux-raid, linux-scsi

On 06/05/2013 09:14 PM, Martin K. Petersen wrote:>>>>>> "Bernd" == Bernd Schubert <bernd.schubert@itwm.fraunhofer.de> writes:
> 
> Bernd> The md layer currently cannot handle failed WRITE_SAME commands
> Bernd> and the initial easiest fix is to check if the device supports
> Bernd> WRITE_SAME at all. It already tested for WRITE_SAME_16 and this
> Bernd> commit adds a test for WRITE_SAME_10.
> 
> No go. That'll disable WRITE SAME for drives which don't support
> RSOC. Which means almost all of them.

Ah, sorry, I didn't check the specs.

> 
> I propose the following...
> 

> --- a/drivers/scsi/sd.c
> +++ b/drivers/scsi/sd.c
> @@ -442,8 +442,15 @@ sd_store_write_same_blocks(struct device *dev, struct device_attribute *attr,
>  
>  	if (max == 0)
>  		sdp->no_write_same = 1;
> -	else if (max <= SD_MAX_WS16_BLOCKS)
> -		sdkp->max_ws_blocks = max;
> +	else
> +		sdp->no_write_same = 0;
> +
> +	if (sdkp->ws16)
> +		sdkp->max_ws_blocks =
> +			max_t(unsigned long, max, SD_MAX_WS16_BLOCKS);
> +	else
> +		sdkp->max_ws_blocks =
> +			max_t(unsigned long, max, SD_MAX_WS10_BLOCKS);
>  
>  	sd_config_write_same(sdkp);

Max? Not min_t()?


> @@ -762,16 +769,16 @@ static void sd_config_write_same(struct scsi_disk *sdkp)
>  	 * blocks per I/O unless the device explicitly advertises a
>  	 * bigger limit.
>  	 */
> -	if (sdkp->max_ws_blocks == 0)
> -		sdkp->max_ws_blocks = SD_MAX_WS10_BLOCKS;
> -
> -	if (sdkp->ws16 || sdkp->max_ws_blocks > SD_MAX_WS10_BLOCKS)
> +	if (sdkp->max_ws_blocks > SD_MAX_WS10_BLOCKS)
>  		blocks = min_not_zero(sdkp->max_ws_blocks,
>  				      (u32)SD_MAX_WS16_BLOCKS);
>  	else
>  		blocks = min_not_zero(sdkp->max_ws_blocks,
>  				      (u32)SD_MAX_WS10_BLOCKS);
>  
> +	if (sdkp->ws16 || sdkp->ws10 || sdkp->device->no_report_opcodes)
> +		sdkp->max_ws_blocks = blocks;
> +
>  out:
>  	blk_queue_max_write_same_sectors(q, blocks * (logical_block_size >> 9));
>  }

blk_queue_max_write_same_sectors(q, sdkp->max_ws_blocks * (logical_block_size >> 9)) ?
Otherwise sdkp->max_ws_blocks and the queue might have different values, wouldn't they?


I cant't provide a comment about scsi_get_vpd_page, I simply don't know. You certainly
know the scsi specs ways better than I do!


Thanks,
Bernd




^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: RAID-10 keeps aborting
  2013-06-05 19:29                                                     ` Martin K. Petersen
@ 2013-06-06 18:27                                                       ` Joe Lawrence
       [not found]                                                         ` <(Joe>
  2013-06-06 18:36                                                         ` H. Peter Anvin
  0 siblings, 2 replies; 122+ messages in thread
From: Joe Lawrence @ 2013-06-06 18:27 UTC (permalink / raw)
  To: Martin K. Petersen; +Cc: H. Peter Anvin, Dan Williams, linux-raid

On Wed, 5 Jun 2013 12:29:32 -0700 (PDT)
"Martin K. Petersen" <martin.petersen@oracle.com> wrote:

> >>>>> "hpa" == H Peter Anvin <hpa@zytor.com> writes:
> 
> hpa> Yuck, this suddenly got complex.  Perhaps WRITE SAME should simply
> hpa> be disabled on raid1/raid10 until this can be addressed?  
> 
> Yeah, maybe.

Martin, 

I'm looking at the changes in raid1.c and am confused as to why we did
this in the first place (commit c8dc9c6):

        if (mddev->queue)
                blk_queue_max_write_same_sectors(mddev->queue,
                                                 mddev->chunk_sectors);

for RAID1, aren't chunk_sectors always going to be zero?  (At least on
my machine, /sys/block/md*/queue/write_same_max_bytes for all md RAID1
devices are set to 0.)  This would have the effect of rendering
bdev_write_same() always false for these MD devices.

-- Joe

^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: RAID-10 keeps aborting
  2013-06-06 18:27                                                       ` Joe Lawrence
       [not found]                                                         ` <(Joe>
@ 2013-06-06 18:36                                                         ` H. Peter Anvin
  1 sibling, 0 replies; 122+ messages in thread
From: H. Peter Anvin @ 2013-06-06 18:36 UTC (permalink / raw)
  To: Joe Lawrence; +Cc: Martin K. Petersen, Dan Williams, linux-raid

On 06/06/2013 11:27 AM, Joe Lawrence wrote:
> On Wed, 5 Jun 2013 12:29:32 -0700 (PDT)
> "Martin K. Petersen" <martin.petersen@oracle.com> wrote:
> 
>>>>>>> "hpa" == H Peter Anvin <hpa@zytor.com> writes:
>>
>> hpa> Yuck, this suddenly got complex.  Perhaps WRITE SAME should simply
>> hpa> be disabled on raid1/raid10 until this can be addressed?  
>>
>> Yeah, maybe.
> 
> Martin, 
> 
> I'm looking at the changes in raid1.c and am confused as to why we did
> this in the first place (commit c8dc9c6):
> 
>         if (mddev->queue)
>                 blk_queue_max_write_same_sectors(mddev->queue,
>                                                  mddev->chunk_sectors);
> 
> for RAID1, aren't chunk_sectors always going to be zero?  (At least on
> my machine, /sys/block/md*/queue/write_same_max_bytes for all md RAID1
> devices are set to 0.)  This would have the effect of rendering
> bdev_write_same() always false for these MD devices.
> 

That presumably also explains why only RAID-10 seems to be affected.

	-hpa



^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: [PATCH] scsi: Check if the device support WRITE_SAME_10
  2013-06-05 20:09                           ` Bernd Schubert
@ 2013-06-07  2:15                             ` Martin K. Petersen
  2013-06-12 19:34                               ` Bernd Schubert
  0 siblings, 1 reply; 122+ messages in thread
From: Martin K. Petersen @ 2013-06-07  2:15 UTC (permalink / raw)
  To: Bernd Schubert
  Cc: Martin K. Petersen, Joe Lawrence, H. Peter Anvin, Dan Williams,
	linux-raid, linux-scsi

>>>>> "Bernd" == Bernd Schubert <bernd.schubert@itwm.fraunhofer.de> writes:

>> max_t(unsigned long, max, SD_MAX_WS10_BLOCKS);

Bernd> Max? Not min_t()?

Brain fart. Updated patch with a few other adjustments.

I have tested this on a couple of JBODs with a mishmash of SATA and SAS
drives, including a few specimens that report MAX WRITE SAME BLOCKS.

-- 
Martin K. Petersen	Oracle Linux Engineering


[PATCH] sd: Update WRITE SAME heuristics

SATA drives located behind a SAS controller would incorrectly receive
WRITE SAME commands. Tweak the heuristics so that:

 - If REPORT SUPPORTED OPERATION CODES is provided we will use that to
   choose between WRITE SAME(16), WRITE SAME(10) and disabled. This also
   fixes an issue with the old code which would issue WRITE SAME(10)
   despite the command not being whitelisted in REPORT SUPPORTED
   OPERATION CODES.

 - If REPORT SUPPORTED OPERATION CODES is not provided we will fall back
   to WRITE SAME(10) unless the device has an ATA Information VPD page.
   The assumption is that a SATL which is smart enough to implement
   WRITE SAME would also provide REPORT SUPPORTED OPERATION CODES.

To facilitate the new heuristics scsi_report_opcode() has been modified
to so we can distinguish between "operation not supported" and "RSOC not
supported".

Reported-by: H. Peter Anvin <hpa@zytor.com>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
Cc: <stable@vger.kernel.org>

diff --git a/drivers/scsi/scsi.c b/drivers/scsi/scsi.c
index 2c0d0ec..3b1ea34 100644
--- a/drivers/scsi/scsi.c
+++ b/drivers/scsi/scsi.c
@@ -1070,8 +1070,8 @@ EXPORT_SYMBOL_GPL(scsi_get_vpd_page);
  * @opcode:	opcode for command to look up
  *
  * Uses the REPORT SUPPORTED OPERATION CODES to look up the given
- * opcode. Returns 0 if RSOC fails or if the command opcode is
- * unsupported. Returns 1 if the device claims to support the command.
+ * opcode. Returns -EINVAL if RSOC fails, 0 if the command opcode is
+ * unsupported and 1 if the device claims to support the command.
  */
 int scsi_report_opcode(struct scsi_device *sdev, unsigned char *buffer,
 		       unsigned int len, unsigned char opcode)
@@ -1081,7 +1081,7 @@ int scsi_report_opcode(struct scsi_device *sdev, unsigned char *buffer,
 	int result;
 
 	if (sdev->no_report_opcodes || sdev->scsi_level < SCSI_SPC_3)
-		return 0;
+		return -EINVAL;
 
 	memset(cmd, 0, 16);
 	cmd[0] = MAINTENANCE_IN;
@@ -1097,7 +1097,7 @@ int scsi_report_opcode(struct scsi_device *sdev, unsigned char *buffer,
 	if (result && scsi_sense_valid(&sshdr) &&
 	    sshdr.sense_key == ILLEGAL_REQUEST &&
 	    (sshdr.asc == 0x20 || sshdr.asc == 0x24) && sshdr.ascq == 0x00)
-		return 0;
+		return -EINVAL;
 
 	if ((buffer[1] & 3) == 3) /* Command supported */
 		return 1;
diff --git a/drivers/scsi/sd.c b/drivers/scsi/sd.c
index a37eda9..420e763 100644
--- a/drivers/scsi/sd.c
+++ b/drivers/scsi/sd.c
@@ -442,8 +442,10 @@ sd_store_write_same_blocks(struct device *dev, struct device_attribute *attr,
 
 	if (max == 0)
 		sdp->no_write_same = 1;
-	else if (max <= SD_MAX_WS16_BLOCKS)
+	else if (max <= SD_MAX_WS16_BLOCKS) {
+		sdp->no_write_same = 0;
 		sdkp->max_ws_blocks = max;
+	}
 
 	sd_config_write_same(sdkp);
 
@@ -750,7 +752,6 @@ static void sd_config_write_same(struct scsi_disk *sdkp)
 {
 	struct request_queue *q = sdkp->disk->queue;
 	unsigned int logical_block_size = sdkp->device->sector_size;
-	unsigned int blocks = 0;
 
 	if (sdkp->device->no_write_same) {
 		sdkp->max_ws_blocks = 0;
@@ -762,18 +763,20 @@ static void sd_config_write_same(struct scsi_disk *sdkp)
 	 * blocks per I/O unless the device explicitly advertises a
 	 * bigger limit.
 	 */
-	if (sdkp->max_ws_blocks == 0)
-		sdkp->max_ws_blocks = SD_MAX_WS10_BLOCKS;
-
-	if (sdkp->ws16 || sdkp->max_ws_blocks > SD_MAX_WS10_BLOCKS)
-		blocks = min_not_zero(sdkp->max_ws_blocks,
-				      (u32)SD_MAX_WS16_BLOCKS);
-	else
-		blocks = min_not_zero(sdkp->max_ws_blocks,
-				      (u32)SD_MAX_WS10_BLOCKS);
+	if (sdkp->max_ws_blocks > SD_MAX_WS10_BLOCKS)
+		sdkp->max_ws_blocks = min_not_zero(sdkp->max_ws_blocks,
+						   (u32)SD_MAX_WS16_BLOCKS);
+	else if (sdkp->ws16 || sdkp->ws10 || sdkp->device->no_report_opcodes)
+		sdkp->max_ws_blocks = min_not_zero(sdkp->max_ws_blocks,
+						   (u32)SD_MAX_WS10_BLOCKS);
+	else {
+		sdkp->device->no_write_same = 1;
+		sdkp->max_ws_blocks = 0;
+	}
 
 out:
-	blk_queue_max_write_same_sectors(q, blocks * (logical_block_size >> 9));
+	blk_queue_max_write_same_sectors(q, sdkp->max_ws_blocks *
+					 (logical_block_size >> 9));
 }
 
 /**
@@ -2645,9 +2648,24 @@ static void sd_read_block_provisioning(struct scsi_disk *sdkp)
 
 static void sd_read_write_same(struct scsi_disk *sdkp, unsigned char *buffer)
 {
-	if (scsi_report_opcode(sdkp->device, buffer, SD_BUF_SIZE,
-			       WRITE_SAME_16))
+	struct scsi_device *sdev = sdkp->device;
+
+	if (scsi_report_opcode(sdev, buffer, SD_BUF_SIZE, INQUIRY) < 0) {
+		sdev->no_report_opcodes = 1;
+
+		/* Disable WRITE SAME if REPORT SUPPORTED OPERATION
+		 * CODES is unsupported and the device has an ATA
+		 * Information VPD page (SAT).
+		 */
+		if (!scsi_get_vpd_page(sdev, 0x89, buffer, SD_BUF_SIZE))
+			sdev->no_write_same = 1;
+	}
+
+	if (scsi_report_opcode(sdev, buffer, SD_BUF_SIZE, WRITE_SAME_16) == 1)
 		sdkp->ws16 = 1;
+
+	if (scsi_report_opcode(sdev, buffer, SD_BUF_SIZE, WRITE_SAME) == 1)
+		sdkp->ws10 = 1;
 }
 
 static int sd_try_extended_inquiry(struct scsi_device *sdp)
diff --git a/drivers/scsi/sd.h b/drivers/scsi/sd.h
index 2386aeb..7a049de 100644
--- a/drivers/scsi/sd.h
+++ b/drivers/scsi/sd.h
@@ -84,6 +84,7 @@ struct scsi_disk {
 	unsigned	lbpws : 1;
 	unsigned	lbpws10 : 1;
 	unsigned	lbpvpd : 1;
+	unsigned	ws10 : 1;
 	unsigned	ws16 : 1;
 };
 #define to_scsi_disk(obj) container_of(obj,struct scsi_disk,dev)

^ permalink raw reply related	[flat|nested] 122+ messages in thread

* Re: RAID-10 keeps aborting
       [not found]                                                   ` <-0400")>
@ 2013-06-07  2:19                                                     ` Martin K. Petersen
  2013-06-10 14:15                                                       ` Joe Lawrence
  0 siblings, 1 reply; 122+ messages in thread
From: Martin K. Petersen @ 2013-06-07  2:19 UTC (permalink / raw)
  To: Joe Lawrence; +Cc: Martin K. Petersen, H. Peter Anvin, Dan Williams, linux-raid

>>>>> "Joe" == Joe Lawrence <joe.lawrence@stratus.com> writes:

Joe> I'm looking at the changes in raid1.c and am confused as to why we
Joe> did this in the first place (commit c8dc9c6):

Joe>         if (mddev->queue)
Joe>                 blk_queue_max_write_same_sectors(mddev->queue, mddev-> chunk_sectors);

Joe> for RAID1, aren't chunk_sectors always going to be zero?  (At least
Joe> on my machine, /sys/block/md*/queue/write_same_max_bytes for all md
Joe> RAID1 devices are set to 0.)  This would have the effect of
Joe> rendering bdev_write_same() always false for these MD devices.

I'm guessing that's a copy and paste from my raid0 support. In the raid0
case I set it to prevent straddling discard/write same commands.

-- 
Martin K. Petersen	Oracle Linux Engineering

^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: RAID-10 keeps aborting
  2013-06-07  2:19                                                     ` RAID-10 keeps aborting Martin K. Petersen
@ 2013-06-10 14:15                                                       ` Joe Lawrence
  2013-06-12  3:15                                                         ` NeilBrown
  0 siblings, 1 reply; 122+ messages in thread
From: Joe Lawrence @ 2013-06-10 14:15 UTC (permalink / raw)
  To: Martin K. Petersen; +Cc: H. Peter Anvin, Dan Williams, linux-raid, Neil Brown

[Cc: Neil -- a few questions if you want to skip to the bottom ]

I looked a little more into the RAID1 case and tried a few things.
First, to enable WRITE SAME on RAID1, I had to apply the patch at the
bottom of this mail. With the patch in place, the component disk limits
propogate up to their MD:

  [2:0:4:0]    disk    SANBLAZE VLUN0001         0001  /dev/sds 
  [3:0:2:0]    disk    SEAGATE  ST9146853SS      0004  /dev/sdv

  /sys/class/scsi_disk/2:0:4:0/max_write_same_blocks: 65535
  /sys/class/scsi_disk/3:0:2:0/max_write_same_blocks: 65535
  /sys/block/md125/queue/write_same_max_bytes: 33553920

I was interested in observing how a failed WRITE SAME would interact
with MD, the intent bitmap, and resyncing. 

In my setup, I created a RAID1 out of a 4G partition from a SAS disk
(supporting WRITE SAME) and a SanBlaze VirtualLUN (claiming WRITE SAME
support, but returing [sense_key,asc,ascq]: [0x05,0x20,0x00] on the
first such command.)  I also added an external write intent bitmap file
chunk size of 4KB to create a large, granular bitmap:

  mdadm --create /dev/md125 --raid-devices=2 --level=1 \
        --bitmap=/mnt/bitmap/file --bitmap-chunk=4K /dev/sds1 /dev/sdv1

After creating the RAID and letting the initial synchronization finish,
I filled the entire MD with random data.  I would use this to verify
resync using the write intent bitmap later.

From previous tests, I knew that the first failed WRITE SAME to the
VirtualLUN would bounce that disk from the MD.  Current and subsequent
WRITE SAME cmds would process just fine to the member disk that actually
supported the command. 

To kick off WRITE SAME commands, I added a new ext4 filesystem to the
disk.  When mounting (no special options) this executes the following
call chain:

  ext4_lazyinit_thread
    ext4_init_inode_table
      sb_issue_zeroout
        blkdev_issue_zeroout
          blkdev_issue_write_same

When the first WRITE SAME hits the VirtualLUN, MD kicks it from the RAID
and degrades the array:

  EXT4-fs (md125): mounted filesystem with ordered data mode. Opts: (null)
  sd 2:0:6:0: [sds] CDB: 
  Write same(16): 93 00 00 00 00 00 00 00 21 10 00 00 0f f8 00 00
  mpt2sas0:        sas_address(0x500605b0006c0ae0), phy(3)
  mpt2sas0:        handle(0x000b), ioc_status(scsi data underrun)(0x0045), smid(59)
  mpt2sas0:        scsi_status(check condition)(0x02), scsi_state(autosense valid )(0x01)
  mpt2sas0:        [sense_key,asc,ascq]: [0x05,0x20,0x00], count(96)
  md/raid1:md125: Disk failure on sds1, disabling device.
  md/raid1:md125: Operation continuing on 1 devices.

the bitmap file starts recording dirty chunks:

  Sync Size : 4192192 (4.00 GiB 4.29 GB)
     Bitmap : 1048048 bits (chunks), 16409 dirty (1.6%)

The MD write_same_max_bytes are left at 33553920 until the VirtualLUN is
failed/remove/re-added.  After the WRITE SAME failure, the VirtualLUN's
max_write_same_blocks have been set to 0.  When re-added to the MD, this
value is reconsidered in the MD's write_same_max_bytes, which also gets
set to zero.  This behavior seems okay as the remaining good disk fully
supported WRITE SAME when the RAID was degraded.  Once the
non-supporting component disk was added to the RAID1, WRITE SAME was
disabled for the MD:

  /sys/class/scsi_disk/2:0:4:0/max_write_same_blocks: 0
  /sys/class/scsi_disk/3:0:2:0/max_write_same_blocks: 65535
  /sys/block/md125/queue/write_same_max_bytes: 0

When the VirtualLUN was re-added to the RAID1, resync initiated.  Recall
that earlier I had dumped random bits on the entire MD device, so the
state of the disks should have looked like this:

  SAS  = init RAID sync + random bits + ext4 WRITE SAME 0's + ext4 misc
  VLUN = init RAID sync + random bits

and resync would need to consult the bitmap to repair the VLUN chunks
that WRITE SAME and whatever else ext4_lazyinit_thread layed down.

By setting the bitmap chunksize so small, the idea was to spread the
failed WRITE SAME across tracking bits.  CDB WRITE SAME num blocks was
0x0FF8 (4088) and 4088 x 512 (block size) ~= 2MB (much greater than
4KB).  With a systemtap probe, I saw 32 WRITE SAME commands (all about
4KB blocks) emitted from the block layer via ext4_lazyinit_thread.  So
the estimated dirty bits for all 32 should be somewhere around:

  32 * (2MB disk dirty / 4K disk per bit) = 16384 dirty bits

pretty close to the observed 16409 (the rest I assume were other ext4
housekeeping).

At this point we know:

  - Failed WRITE SAME will kick disk from MD RAID1
  - WRITE SAME is disabled if unsupported disk added to MD
  - Failed WRITE SAME is properly handled by bitmap, even when
    spanning bitmap bits.

A few outstanding questions that I have, maybe Neil or someone more
familiar with the code could answer.

Q1 - Is mddev->chunk_sectors is always zero for RAID1?

Q2 - I noticed handle_write_finished calls narrow_write_error to try and
     potentially avoid failing an entire device.  In my tests,
     narrow_write_error never succeeded as rdev->badblocks.shift = -1.

     I think this part of the bad block list code Neil has been working
     on.  I don't suppose this is the proper place for MD to reset
     write_same_max_bytes to disable future WRITE SAME and handling the
     individual writes here instead of the block layer?

Regards,

-- Joe


From b12c24ee0fce802f35263da65d236694b01c99cf Mon Sep 17 00:00:00 2001
From: Joe Lawrence <joe.lawrence@stratus.com>
Date: Fri, 7 Jun 2013 15:25:54 -0400
Subject: [PATCH] raid1: properly set blk_queue_max_write_same_sectors

MD RAID1 chunk_sectors will always be zero, unlike RAID0, so RAID1 does
not need to worry about limiting the write same sectors in that regard.
Let disk_stack_limits choose the minimum of the RAID1 components write
same values.

Signed-off-by: Joe Lawrence <joe.lawrence@stratus.com>
---
 drivers/md/raid1.c | 3 ---
 1 file changed, 3 deletions(-)

diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c
index fd86b37..3dc9ad6 100644
--- a/drivers/md/raid1.c
+++ b/drivers/md/raid1.c
@@ -2821,9 +2821,6 @@ static int run(struct mddev *mddev)
 	if (IS_ERR(conf))
 		return PTR_ERR(conf);
 
-	if (mddev->queue)
-		blk_queue_max_write_same_sectors(mddev->queue,
-						 mddev->chunk_sectors);
 	rdev_for_each(rdev, mddev) {
 		if (!mddev->gendisk)
 			continue;
-- 
1.8.1.4


^ permalink raw reply related	[flat|nested] 122+ messages in thread

* Re: RAID-10 keeps aborting
  2013-06-03 14:39       ` H. Peter Anvin
@ 2013-06-11 16:47         ` Joe Lawrence
  2013-06-11 17:12           ` H. Peter Anvin
  0 siblings, 1 reply; 122+ messages in thread
From: Joe Lawrence @ 2013-06-11 16:47 UTC (permalink / raw)
  To: H. Peter Anvin; +Cc: Dan Williams, linux-raid

On Mon, 03 Jun 2013 07:39:46 -0700
"H. Peter Anvin" <hpa@zytor.com> wrote:

> On 06/02/2013 11:14 PM, Dan Williams wrote:
> > On Sun, Jun 2, 2013 at 11:06 PM, H. Peter Anvin <hpa@zytor.com> wrote:
> >> On 06/02/2013 10:47 PM, Dan Williams wrote:
> >>>
> >>> One hack to prove this may be to explicitly disable write_same before
> >>> the array is assembled:
> >>>
> >>> for i in /sys/class/scsi_disk/*/max_write_same_blocks; do echo 0 > $i; done
> >>>
> >>> If this works then maybe md needs to be tolerant of write_same
> >>> failures since the block layer will simply retry with zeroes.
> >>>
> >>
> >> Trying that (array is already assembled but is currently functional.)
> >> Let's hope it works.
> >>
> > 
> > If I'm reading things correctly that may still result in failure since
> > md will still pass the REQ_WRITE_SAME bios down to the the devices and
> > will receive BLK_PREP_KILL for its trouble.  md only notices that
> > write same is disabled on underlying devices at assembly time.
> > 
> 
> Hmmm... that means getting dracut/udev to supply this little mod, unless
> it can be fed as a kernel command-line option somehow.  Digging...
> 
> 	-hpa

You've probably worked around this by now, but Dan's suggestion can be
tweaked if you are willing to fail/remove/re-add one of the disks.  I
just verified the following:

  /sys/block/md125/queue/write_same_max_bytes: 524288

% mdadm --fail /dev/md125 /dev/sds1
% mdadm --remove /dev/md125 /dev/sds1
% echo 0 > /sys/class/scsi_disk/2\:0\:2\:0/max_write_same_blocks
% mdadm --add /dev/md125 /dev/sds1

  /sys/block/md125/queue/write_same_max_bytes: 0

That gets raid10.c to invoke disk_stack_limits (via raid10_add_disk)
which will recalculate write_same_max_bytes for the MD.

Regards,

-- Joe

^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: RAID-10 keeps aborting
  2013-06-11 16:47         ` Joe Lawrence
@ 2013-06-11 17:12           ` H. Peter Anvin
  0 siblings, 0 replies; 122+ messages in thread
From: H. Peter Anvin @ 2013-06-11 17:12 UTC (permalink / raw)
  To: Joe Lawrence; +Cc: Dan Williams, linux-raid

On 06/11/2013 09:47 AM, Joe Lawrence wrote:
> 
> You've probably worked around this by now, but Dan's suggestion can be
> tweaked if you are willing to fail/remove/re-add one of the disks.  I
> just verified the following:
> 
>   /sys/block/md125/queue/write_same_max_bytes: 524288
> 
> % mdadm --fail /dev/md125 /dev/sds1
> % mdadm --remove /dev/md125 /dev/sds1
> % echo 0 > /sys/class/scsi_disk/2\:0\:2\:0/max_write_same_blocks
> % mdadm --add /dev/md125 /dev/sds1
> 
>   /sys/block/md125/queue/write_same_max_bytes: 0
> 
> That gets raid10.c to invoke disk_stack_limits (via raid10_add_disk)
> which will recalculate write_same_max_bytes for the MD.
> 

No thanks.  I just hacked the kernel directly.

	-hpa



^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: RAID-10 keeps aborting
  2013-06-03  3:57 RAID-10 keeps aborting H. Peter Anvin
  2013-06-03  4:05 ` H. Peter Anvin
  2013-06-03  5:47 ` Dan Williams
@ 2013-06-11 21:50 ` Joe Lawrence
  2013-06-11 21:53   ` H. Peter Anvin
  2 siblings, 1 reply; 122+ messages in thread
From: Joe Lawrence @ 2013-06-11 21:50 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: linux-raid, Dan Williams, Martin K. Petersen, Bernd Schubert

I'll away from the office the rest of the week for the Red Hat Summit,
but a few things tested today:

*** Test 1 (repeated for both RAID1 and RAID10)

Setup: RAID1/10, a single disk fails WRITE SAME with
       [sense_key,asc,ascq]: [0x05,0x20,0x00]

Result: the RAID md_personality error_handler was invoked for the disk
        that kicked back Illegal Request.  That disk was marked
	faulty and write intent bitmaps started tracking chunks.  WRITE
	SAME to the other disks succeeded so MD didn't fail the parent
	bio and therefore blkdev_issue_write_same was considered
	successful.  

        When the failed component disk is re-added to the RAID, the
        write_same_max_bytes are recalculated and set to 0.  Resync with
        write intent bitmap properly mirrored the disk.

This behavior seems reasonable, though one could argue that
write_same_max_bytes should be set to 0 for the MD device once any WRITE
SAME fails.  Better yet, the block layer could fall back to plain WRITEs
without dropping the disk.


*** Test 2

Setup: RAID1, all member disks fail WRITE SAME with
       [sense_key,asc,ascq]: [0x05,0x20,0x00].

Result: the first disk is kicked out of the RAID, but the second remains
	active.  The block layer detected an error from
	blkdev_issue_write_same, emits "mdXXX: WRITE SAME failed.
	Manually zeroing" messages, but never prevents future WRITE
        SAMEs from hitting this MD device.


*** Test 3

Setup: RAID10, all member disks fail WRITE SAME with
       [sense_key,asc,ascq]: [0x05,0x20,0x00].

Result: all component disks are marked as faulty and lost page writes
	occur on MD device!  blkdev_issue_write_same fails and "mdXXX:
	WRITE SAME failed.  Manually zeroing" appear, but it's too late,
	there aren't any disks left standing in the RAID.


I think the results of Tests 2 and 3 confirm that RAID10 definitely
needs to be fixed...  Disabling WRITE SAME for RAID10 for now (3.10 +
stable) would be safest thing to do until this is sorted out properly.

-- Joe

^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: RAID-10 keeps aborting
  2013-06-11 21:50 ` RAID-10 keeps aborting Joe Lawrence
@ 2013-06-11 21:53   ` H. Peter Anvin
  0 siblings, 0 replies; 122+ messages in thread
From: H. Peter Anvin @ 2013-06-11 21:53 UTC (permalink / raw)
  To: Joe Lawrence; +Cc: linux-raid, Dan Williams, Martin K. Petersen, Bernd Schubert

Exactly.

Joe Lawrence <joe.lawrence@stratus.com> wrote:

>I'll away from the office the rest of the week for the Red Hat Summit,
>but a few things tested today:
>
>*** Test 1 (repeated for both RAID1 and RAID10)
>
>Setup: RAID1/10, a single disk fails WRITE SAME with
>       [sense_key,asc,ascq]: [0x05,0x20,0x00]
>
>Result: the RAID md_personality error_handler was invoked for the disk
>        that kicked back Illegal Request.  That disk was marked
>	faulty and write intent bitmaps started tracking chunks.  WRITE
>	SAME to the other disks succeeded so MD didn't fail the parent
>	bio and therefore blkdev_issue_write_same was considered
>	successful.  
>
>        When the failed component disk is re-added to the RAID, the
>       write_same_max_bytes are recalculated and set to 0.  Resync with
>        write intent bitmap properly mirrored the disk.
>
>This behavior seems reasonable, though one could argue that
>write_same_max_bytes should be set to 0 for the MD device once any
>WRITE
>SAME fails.  Better yet, the block layer could fall back to plain
>WRITEs
>without dropping the disk.
>
>
>*** Test 2
>
>Setup: RAID1, all member disks fail WRITE SAME with
>       [sense_key,asc,ascq]: [0x05,0x20,0x00].
>
>Result: the first disk is kicked out of the RAID, but the second
>remains
>	active.  The block layer detected an error from
>	blkdev_issue_write_same, emits "mdXXX: WRITE SAME failed.
>	Manually zeroing" messages, but never prevents future WRITE
>        SAMEs from hitting this MD device.
>
>
>*** Test 3
>
>Setup: RAID10, all member disks fail WRITE SAME with
>       [sense_key,asc,ascq]: [0x05,0x20,0x00].
>
>Result: all component disks are marked as faulty and lost page writes
>	occur on MD device!  blkdev_issue_write_same fails and "mdXXX:
>	WRITE SAME failed.  Manually zeroing" appear, but it's too late,
>	there aren't any disks left standing in the RAID.
>
>
>I think the results of Tests 2 and 3 confirm that RAID10 definitely
>needs to be fixed...  Disabling WRITE SAME for RAID10 for now (3.10 +
>stable) would be safest thing to do until this is sorted out properly.
>
>-- Joe

-- 
Sent from my mobile phone. Please excuse brevity and lack of formatting.

^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: RAID-10 keeps aborting
  2013-06-10 14:15                                                       ` Joe Lawrence
@ 2013-06-12  3:15                                                         ` NeilBrown
  2013-06-12  4:07                                                           ` H. Peter Anvin
  2013-06-13  2:45                                                           ` Joe Lawrence
  0 siblings, 2 replies; 122+ messages in thread
From: NeilBrown @ 2013-06-12  3:15 UTC (permalink / raw)
  To: Joe Lawrence; +Cc: Martin K. Petersen, H. Peter Anvin, Dan Williams, linux-raid

[-- Attachment #1: Type: text/plain, Size: 2465 bytes --]

On Mon, 10 Jun 2013 10:15:05 -0400 Joe Lawrence <joe.lawrence@stratus.com>
wrote:

> A few outstanding questions that I have, maybe Neil or someone more
> familiar with the code could answer.
> 
> Q1 - Is mddev->chunk_sectors is always zero for RAID1?

Not always.  But often.  It is largely ignored.  I think the only effect is
to round the size of the device down to a multiple of the chunk size.

> 
> Q2 - I noticed handle_write_finished calls narrow_write_error to try and
>      potentially avoid failing an entire device.  In my tests,
>      narrow_write_error never succeeded as rdev->badblocks.shift = -1.

Yes.  You would need a newer mdadm (from my git tree) to get badblocks.shift
to something else.

> 
>      I think this part of the bad block list code Neil has been working
>      on.  I don't suppose this is the proper place for MD to reset
>      write_same_max_bytes to disable future WRITE SAME and handling the
>      individual writes here instead of the block layer?

If a drive reports that WRITE SAME works, but it doesn't, then I'm not sure
that I can be happy about working with that drive.
If a drive has some quirky behaviour wrt WRITE SAME, then that should be
handled in some place where 'quirks' are handled - certainly not in md.


I've applied that patch below - thanks.

NeilBrown


> 
> Regards,
> 
> -- Joe
> 
> 
> From b12c24ee0fce802f35263da65d236694b01c99cf Mon Sep 17 00:00:00 2001
> From: Joe Lawrence <joe.lawrence@stratus.com>
> Date: Fri, 7 Jun 2013 15:25:54 -0400
> Subject: [PATCH] raid1: properly set blk_queue_max_write_same_sectors
> 
> MD RAID1 chunk_sectors will always be zero, unlike RAID0, so RAID1 does
> not need to worry about limiting the write same sectors in that regard.
> Let disk_stack_limits choose the minimum of the RAID1 components write
> same values.
> 
> Signed-off-by: Joe Lawrence <joe.lawrence@stratus.com>
> ---
>  drivers/md/raid1.c | 3 ---
>  1 file changed, 3 deletions(-)
> 
> diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c
> index fd86b37..3dc9ad6 100644
> --- a/drivers/md/raid1.c
> +++ b/drivers/md/raid1.c
> @@ -2821,9 +2821,6 @@ static int run(struct mddev *mddev)
>  	if (IS_ERR(conf))
>  		return PTR_ERR(conf);
>  
> -	if (mddev->queue)
> -		blk_queue_max_write_same_sectors(mddev->queue,
> -						 mddev->chunk_sectors);
>  	rdev_for_each(rdev, mddev) {
>  		if (!mddev->gendisk)
>  			continue;


[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 828 bytes --]

^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: RAID-10 keeps aborting
  2013-06-12  3:15                                                         ` NeilBrown
@ 2013-06-12  4:07                                                           ` H. Peter Anvin
  2013-06-12  6:29                                                             ` Bernd Schubert
  2013-06-12 14:25                                                             ` Martin K. Petersen
  2013-06-13  2:45                                                           ` Joe Lawrence
  1 sibling, 2 replies; 122+ messages in thread
From: H. Peter Anvin @ 2013-06-12  4:07 UTC (permalink / raw)
  To: NeilBrown; +Cc: Joe Lawrence, Martin K. Petersen, Dan Williams, linux-raid

On 06/11/2013 08:15 PM, NeilBrown wrote:
> If a drive reports that WRITE SAME works, but it doesn't, then I'm
> not sure that I can be happy about working with that drive.

Seriously... we have that kind of problems all over the place with all
kinds of hardware.  Falling back is sensible... the problem here is
*where* that needs to happen... the block layer already does, apparently.

> If a drive has some quirky behaviour wrt WRITE SAME, then that
> should be handled in some place where 'quirks' are handled -
> certainly not in md.

The problem here is that you don't find out ahead of time.

Now, if I understand the issue at hand correctly is that the reporting
here was actually a Linux bug related to SATA drives behind a SAS
controller.  Martin, am I right?

	-hpa



^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: RAID-10 keeps aborting
  2013-06-12  4:07                                                           ` H. Peter Anvin
@ 2013-06-12  6:29                                                             ` Bernd Schubert
  2013-06-12 10:22                                                               ` Joe Lawrence
  2013-06-12 14:28                                                               ` Martin K. Petersen
  2013-06-12 14:25                                                             ` Martin K. Petersen
  1 sibling, 2 replies; 122+ messages in thread
From: Bernd Schubert @ 2013-06-12  6:29 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: NeilBrown, Joe Lawrence, Martin K. Petersen, Dan Williams, linux-raid

On 06/12/2013 06:07 AM, H. Peter Anvin wrote:
> On 06/11/2013 08:15 PM, NeilBrown wrote:
>> If a drive reports that WRITE SAME works, but it doesn't, then I'm
>> not sure that I can be happy about working with that drive.
> 
> Seriously... we have that kind of problems all over the place with all
> kinds of hardware.  Falling back is sensible... the problem here is
> *where* that needs to happen... the block layer already does, apparently.
> 
>> If a drive has some quirky behaviour wrt WRITE SAME, then that
>> should be handled in some place where 'quirks' are handled -
>> certainly not in md.
> 
> The problem here is that you don't find out ahead of time.
> 
> Now, if I understand the issue at hand correctly is that the reporting
> here was actually a Linux bug related to SATA drives behind a SAS
> controller.  Martin, am I right?

Martin, please correct me if I'm wrong, but I think the code
optimistically enabled WRITE_SAME for any drive, except of those on a
sata (libata) controller. So not the drive reported that it can do
WRITE_SAME, but scsi-midlayer did that. Martins patch should improve
that (I still need to test it on our hardware), but I'm not sure if
there won't be some hardware falling through.


Cheers,
Bernd


^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: RAID-10 keeps aborting
  2013-06-12  6:29                                                             ` Bernd Schubert
@ 2013-06-12 10:22                                                               ` Joe Lawrence
  2013-06-12 14:28                                                               ` Martin K. Petersen
  1 sibling, 0 replies; 122+ messages in thread
From: Joe Lawrence @ 2013-06-12 10:22 UTC (permalink / raw)
  To: Bernd Schubert
  Cc: H. Peter Anvin, NeilBrown, Joe Lawrence, Martin K. Petersen,
	Dan Williams, linux-raid

On Wed, 12 Jun 2013, Bernd Schubert wrote:

> On 06/12/2013 06:07 AM, H. Peter Anvin wrote:
> > On 06/11/2013 08:15 PM, NeilBrown wrote:
> >> If a drive reports that WRITE SAME works, but it doesn't, then I'm
> >> not sure that I can be happy about working with that drive.
> > 
> > Seriously... we have that kind of problems all over the place with all
> > kinds of hardware.  Falling back is sensible... the problem here is
> > *where* that needs to happen... the block layer already does, apparently.
> > 
> >> If a drive has some quirky behaviour wrt WRITE SAME, then that
> >> should be handled in some place where 'quirks' are handled -
> >> certainly not in md.
> > 
> > The problem here is that you don't find out ahead of time.
> > 
> > Now, if I understand the issue at hand correctly is that the reporting
> > here was actually a Linux bug related to SATA drives behind a SAS
> > controller.  Martin, am I right?
> 
> Martin, please correct me if I'm wrong, but I think the code
> optimistically enabled WRITE_SAME for any drive, except of those on a
> sata (libata) controller. So not the drive reported that it can do
> WRITE_SAME, but scsi-midlayer did that. Martins patch should improve
> that (I still need to test it on our hardware), but I'm not sure if
> there won't be some hardware falling through.

I'm worried about other unsupported HW, too.  Last night I started writing 
a patch to set the raid1,10 max write same sectors to 0 for inclusion in 
3.10 + stable... I'd like to include a mention of Martin's patch/the SATA 
drives in the commit log.  Thanks so much for hunting this down.

-- Joe

^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: RAID-10 keeps aborting
  2013-06-12  4:07                                                           ` H. Peter Anvin
  2013-06-12  6:29                                                             ` Bernd Schubert
@ 2013-06-12 14:25                                                             ` Martin K. Petersen
  2013-06-12 14:29                                                               ` H. Peter Anvin
  1 sibling, 1 reply; 122+ messages in thread
From: Martin K. Petersen @ 2013-06-12 14:25 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: NeilBrown, Joe Lawrence, Martin K. Petersen, Dan Williams, linux-raid

>>>>> "hpa" == H Peter Anvin <hpa@zytor.com> writes:

>> If a drive has some quirky behaviour wrt WRITE SAME, then that should
>> be handled in some place where 'quirks' are handled - certainly not
>> in md.

hpa> The problem here is that you don't find out ahead of time.

hpa> Now, if I understand the issue at hand correctly is that the
hpa> reporting here was actually a Linux bug related to SATA drives
hpa> behind a SAS controller.  Martin, am I right?

Support for WRITE SAME is harder for us to detect. With discard we have
a set of device-reported bits we can use as triggers, not so for WRITE
SAME. And since it is a destructive command we can not simply issue one
at device discovery time to try whether it works.

Technically there's nothing that prevents a SAS controller's SCSI-ATA
Translation to handle WRITE SAME. The patch I posted simply adds another
heuristic. Namely that if we can see that the drive behind the SAS
controller is of the ATA persuasion we will not attempt to issue WRITE
SAME unless the controller explicitly advertises WRITE SAME support
using REPORT SUPPORTED OPERATION CODES.

Sadly we can not exclusively rely on RSOC when deciding whether WRITE
SAME is supported or not for devices in general. 95% of the WRITE
SAME-capable devices out there do not support RSOC :(

-- 
Martin K. Petersen	Oracle Linux Engineering

^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: RAID-10 keeps aborting
  2013-06-12  6:29                                                             ` Bernd Schubert
  2013-06-12 10:22                                                               ` Joe Lawrence
@ 2013-06-12 14:28                                                               ` Martin K. Petersen
  1 sibling, 0 replies; 122+ messages in thread
From: Martin K. Petersen @ 2013-06-12 14:28 UTC (permalink / raw)
  To: Bernd Schubert
  Cc: H. Peter Anvin, NeilBrown, Joe Lawrence, Martin K. Petersen,
	Dan Williams, linux-raid

>>>>> "Bernd" == Bernd Schubert <bernd.schubert@itwm.fraunhofer.de> writes:

Bernd> please correct me if I'm wrong, but I think the code
Bernd> optimistically enabled WRITE_SAME for any drive, except of those
Bernd> on a sata (libata) controller. 

Correct. WRITE SAME has been used extensively by RAID manufacturers for
years. So pretty much any SCSI-class drive supports it.

The headache we ran into in this case was SATA drives behind a SAS
controller.

Bernd> Martins patch should improve that (I still need to test it on our
Bernd> hardware), 

Please do! James wants a tested-by from someone that's not me before he
pushes the patch to Linus.

-- 
Martin K. Petersen	Oracle Linux Engineering

^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: RAID-10 keeps aborting
  2013-06-12 14:25                                                             ` Martin K. Petersen
@ 2013-06-12 14:29                                                               ` H. Peter Anvin
  2013-06-12 14:34                                                                 ` Martin K. Petersen
  0 siblings, 1 reply; 122+ messages in thread
From: H. Peter Anvin @ 2013-06-12 14:29 UTC (permalink / raw)
  To: Martin K. Petersen; +Cc: NeilBrown, Joe Lawrence, Dan Williams, linux-raid

On 06/12/2013 07:25 AM, Martin K. Petersen wrote:
>>>>>> "hpa" == H Peter Anvin <hpa@zytor.com> writes:
> 
>>> If a drive has some quirky behaviour wrt WRITE SAME, then that should
>>> be handled in some place where 'quirks' are handled - certainly not
>>> in md.
> 
> hpa> The problem here is that you don't find out ahead of time.
> 
> hpa> Now, if I understand the issue at hand correctly is that the
> hpa> reporting here was actually a Linux bug related to SATA drives
> hpa> behind a SAS controller.  Martin, am I right?
> 
> Support for WRITE SAME is harder for us to detect. With discard we have
> a set of device-reported bits we can use as triggers, not so for WRITE
> SAME. And since it is a destructive command we can not simply issue one
> at device discovery time to try whether it works.
> 
> Technically there's nothing that prevents a SAS controller's SCSI-ATA
> Translation to handle WRITE SAME. The patch I posted simply adds another
> heuristic. Namely that if we can see that the drive behind the SAS
> controller is of the ATA persuasion we will not attempt to issue WRITE
> SAME unless the controller explicitly advertises WRITE SAME support
> using REPORT SUPPORTED OPERATION CODES.
> 
> Sadly we can not exclusively rely on RSOC when deciding whether WRITE
> SAME is supported or not for devices in general. 95% of the WRITE
> SAME-capable devices out there do not support RSOC :(
> 

The second question is if we should disable WRITE SAME for raid1/10
(what about raid0?) for 3.10/stable or if your patch really is
sufficient... "just adds another heuristic" makes me nervous.

	-hpa


^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: RAID-10 keeps aborting
  2013-06-12 14:29                                                               ` H. Peter Anvin
@ 2013-06-12 14:34                                                                 ` Martin K. Petersen
  2013-06-12 14:37                                                                   ` H. Peter Anvin
  2013-06-12 14:45                                                                   ` H. Peter Anvin
  0 siblings, 2 replies; 122+ messages in thread
From: Martin K. Petersen @ 2013-06-12 14:34 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: Martin K. Petersen, NeilBrown, Joe Lawrence, Dan Williams, linux-raid

>>>>> "hpa" == H Peter Anvin <hpa@zytor.com> writes:

hpa> The second question is if we should disable WRITE SAME for raid1/10
hpa> (what about raid0?) for 3.10/stable or if your patch really is
hpa> sufficient... "just adds another heuristic" makes me nervous.

I think we should disable 1+10 in stable until we get the recovery
scenario sorted out.

I don't believe there are any problems with raid0.

-- 
Martin K. Petersen	Oracle Linux Engineering

^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: RAID-10 keeps aborting
  2013-06-12 14:34                                                                 ` Martin K. Petersen
@ 2013-06-12 14:37                                                                   ` H. Peter Anvin
  2013-06-12 14:45                                                                   ` H. Peter Anvin
  1 sibling, 0 replies; 122+ messages in thread
From: H. Peter Anvin @ 2013-06-12 14:37 UTC (permalink / raw)
  To: Martin K. Petersen; +Cc: NeilBrown, Joe Lawrence, Dan Williams, linux-raid

On 06/12/2013 07:34 AM, Martin K. Petersen wrote:
> 
> I don't believe there are any problems with raid0.
> 

I was wondering if a WRITE SOME sent to raid0 would cause the array to
get stopped or paniced.

	-hpa


^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: RAID-10 keeps aborting
       [not found]                                                   ` <-0700")>
  2013-06-05 19:29                                                     ` Martin K. Petersen
@ 2013-06-12 14:43                                                     ` Martin K. Petersen
  2020-04-01  2:29                                                     ` [PATCH 0/4] block: Add support for REQ_OP_ASSIGN_RANGE Martin K. Petersen
  2020-05-12 16:01                                                     ` [PATCH v11 00/10] Introduce Zone Append for writing to zoned block devices Martin K. Petersen
  3 siblings, 0 replies; 122+ messages in thread
From: Martin K. Petersen @ 2013-06-12 14:43 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: Martin K. Petersen, NeilBrown, Joe Lawrence, Dan Williams, linux-raid

>>>>> "hpa" == H Peter Anvin <hpa@zytor.com> writes:

>> I don't believe there are any problems with raid0.

hpa> I was wondering if a WRITE SOME sent to raid0 would cause the array
hpa> to get stopped or paniced.

raid0 doesn't do its own endio processing so an error will just bubble
up to the generic code in blk-lib.

-- 
Martin K. Petersen	Oracle Linux Engineering

^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: RAID-10 keeps aborting
  2013-06-12 14:34                                                                 ` Martin K. Petersen
  2013-06-12 14:37                                                                   ` H. Peter Anvin
@ 2013-06-12 14:45                                                                   ` H. Peter Anvin
       [not found]                                                                     ` <5AA430FFE4486C448003201AC83BC85E0360CE3F@EXHQ.corp.stratus! .com>
  2013-06-13  3:10                                                                     ` NeilBrown
  1 sibling, 2 replies; 122+ messages in thread
From: H. Peter Anvin @ 2013-06-12 14:45 UTC (permalink / raw)
  To: Martin K. Petersen; +Cc: NeilBrown, Joe Lawrence, Dan Williams, linux-raid

[-- Attachment #1: Type: text/plain, Size: 504 bytes --]

On 06/12/2013 07:34 AM, Martin K. Petersen wrote:
>>>>>> "hpa" == H Peter Anvin <hpa@zytor.com> writes:
> 
> hpa> The second question is if we should disable WRITE SAME for raid1/10
> hpa> (what about raid0?) for 3.10/stable or if your patch really is
> hpa> sufficient... "just adds another heuristic" makes me nervous.
> 
> I think we should disable 1+10 in stable until we get the recovery
> scenario sorted out.
> 
> I don't believe there are any problems with raid0.
> 

How does this look?

	-hpa


[-- Attachment #2: 0001-raid1-10-Disable-WRITE-SAME-until-a-recovery-strateg.patch --]
[-- Type: text/x-patch, Size: 2452 bytes --]

From ac28be1574a6187f4f26bd75217059bf17b13560 Mon Sep 17 00:00:00 2001
From: "H. Peter Anvin" <hpa@zytor.com>
Date: Wed, 12 Jun 2013 07:37:43 -0700
Subject: [PATCH] raid1,10: Disable WRITE SAME until a recovery strategy is in
 place

There are cases where the kernel will believe that the WRITE SAME
command is supported by a block device which does not, in fact,
support WRITE SAME.  This currently happens for SATA drivers behind a
SAS controller, but there are probably a hundred other ways that can
happen, including drive firmware bugs.

After receiving an error for WRITE SAME the block layer will retry the
request as a plain write of zeroes, but mdraid will consider the
failure as fatal and consider the drive failed.  This has the effect
that all the mirrors containing a specific set of data are each
offlined in very rapid succession resulting in data loss.

However, just bouncing the request back up to the block layer isn't
ideal either, because the whole initial request-retry sequence should
be inside the write bitmap fence, which probably means that md needs
to do its own conversion of WRITE SAME to write zero.

Until the failure scenario has been sorted out, disable WRITE SAME for
raid1 and raid10.

Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
---
 drivers/md/raid1.c  | 4 ++--
 drivers/md/raid10.c | 3 +--
 2 files changed, 3 insertions(+), 4 deletions(-)

diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c
index 5595118..914ca0a 100644
--- a/drivers/md/raid1.c
+++ b/drivers/md/raid1.c
@@ -2780,8 +2780,8 @@ static int run(struct mddev *mddev)
 		return PTR_ERR(conf);
 
 	if (mddev->queue)
-		blk_queue_max_write_same_sectors(mddev->queue,
-						 mddev->chunk_sectors);
+		blk_queue_max_write_same_sectors(mddev->queue, 0);
+
 	rdev_for_each(rdev, mddev) {
 		if (!mddev->gendisk)
 			continue;
diff --git a/drivers/md/raid10.c b/drivers/md/raid10.c
index 59d4daa..807ace8 100644
--- a/drivers/md/raid10.c
+++ b/drivers/md/raid10.c
@@ -3609,8 +3609,7 @@ static int run(struct mddev *mddev)
 	if (mddev->queue) {
 		blk_queue_max_discard_sectors(mddev->queue,
 					      mddev->chunk_sectors);
-		blk_queue_max_write_same_sectors(mddev->queue,
-						 mddev->chunk_sectors);
+		blk_queue_max_write_same_sectors(mddev->queue, 0);
 		blk_queue_io_min(mddev->queue, chunk_size);
 		if (conf->geo.raid_disks % conf->geo.near_copies)
 			blk_queue_io_opt(mddev->queue, chunk_size * conf->geo.raid_disks);
-- 
1.7.11.7


^ permalink raw reply related	[flat|nested] 122+ messages in thread

* Re: RAID-10 keeps aborting
       [not found]                                                                       ` <5AA430FFE4486C448003201AC83BC85E0360CE3F@EXHQ.corp.stratus.com>
@ 2013-06-12 15:58                                                                         ` H. Peter Anvin
  0 siblings, 0 replies; 122+ messages in thread
From: H. Peter Anvin @ 2013-06-12 15:58 UTC (permalink / raw)
  To: Lawrence, Joe; +Cc: Martin K. Petersen, NeilBrown, Dan Williams, linux-raid

On 06/12/2013 08:55 AM, Lawrence, Joe wrote:
> This looks exactly like what I started last night, only with a more detailed commit msg.  Would it be worth mentioning that block will only retry if MD fails the master bio?  (Ie, if one of mirrored components succeed the WS)

Probably not... the big deal here is that it isn't a viable solution due
to the write bitmap fence.

	-hpa



^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: [PATCH] scsi: Check if the device support WRITE_SAME_10
  2013-06-07  2:15                             ` Martin K. Petersen
@ 2013-06-12 19:34                               ` Bernd Schubert
  0 siblings, 0 replies; 122+ messages in thread
From: Bernd Schubert @ 2013-06-12 19:34 UTC (permalink / raw)
  To: Martin K. Petersen
  Cc: Joe Lawrence, H. Peter Anvin, Dan Williams, linux-raid, linux-scsi

On 06/07/2013 04:15 AM, Martin K. Petersen wrote:
>>>>>> "Bernd" == Bernd Schubert <bernd.schubert@itwm.fraunhofer.de> writes:
>
>>> max_t(unsigned long, max, SD_MAX_WS10_BLOCKS);
>
> Bernd> Max? Not min_t()?
>
> Brain fart. Updated patch with a few other adjustments.
>
> I have tested this on a couple of JBODs with a mishmash of SATA and SAS
> drives, including a few specimens that report MAX WRITE SAME BLOCKS.
>

Thanks for the update!

I'm far too long at work again, but I managed to test it now and it 
works fine for the ancient areca controller (ARC-1260) of this test-lab. 
So you may add

Tested-by: Bernd Schubert <bernd.schubert@itwm.fraunhofer.de>

and

Reviewed-by: Bernd Schubert <bernd.schubert@itwm.fraunhofer.de>

(although the latter probably does not count much for linux-scsi).


Cheers,
Bernd

^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: RAID-10 keeps aborting
  2013-06-12  3:15                                                         ` NeilBrown
  2013-06-12  4:07                                                           ` H. Peter Anvin
@ 2013-06-13  2:45                                                           ` Joe Lawrence
  2013-06-13  3:11                                                             ` NeilBrown
  1 sibling, 1 reply; 122+ messages in thread
From: Joe Lawrence @ 2013-06-13  2:45 UTC (permalink / raw)
  To: NeilBrown
  Cc: Joe Lawrence, Martin K. Petersen, H. Peter Anvin, Dan Williams,
	linux-raid

On Wed, 12 Jun 2013, NeilBrown wrote:

> I've applied that patch below - thanks.
> 
> NeilBrown
> > 
> > From b12c24ee0fce802f35263da65d236694b01c99cf Mon Sep 17 00:00:00 2001
> > From: Joe Lawrence <joe.lawrence@stratus.com>
> > Date: Fri, 7 Jun 2013 15:25:54 -0400
> > Subject: [PATCH] raid1: properly set blk_queue_max_write_same_sectors
> > 
[snip patch]

Hi Neil -- this patch was only to test out what RAID1 would do if the 
blk_queue_max_write_same_sectors were set to enable WRITE SAME.  As HPA 
and Martin point out, we should be disabling WRITE SAME for raid1/10 in 
3.10 / stable for now.  Sorry for the confusion.

-- Joe

^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: RAID-10 keeps aborting
  2013-06-12 14:45                                                                   ` H. Peter Anvin
       [not found]                                                                     ` <5AA430FFE4486C448003201AC83BC85E0360CE3F@EXHQ.corp.stratus! .com>
@ 2013-06-13  3:10                                                                     ` NeilBrown
  2013-06-13  3:13                                                                       ` H. Peter Anvin
  2013-06-13 21:40                                                                       ` Martin K. Petersen
  1 sibling, 2 replies; 122+ messages in thread
From: NeilBrown @ 2013-06-13  3:10 UTC (permalink / raw)
  To: H. Peter Anvin; +Cc: Martin K. Petersen, Joe Lawrence, Dan Williams, linux-raid

[-- Attachment #1: Type: text/plain, Size: 960 bytes --]

On Wed, 12 Jun 2013 07:45:16 -0700 "H. Peter Anvin" <hpa@zytor.com> wrote:

> On 06/12/2013 07:34 AM, Martin K. Petersen wrote:
> >>>>>> "hpa" == H Peter Anvin <hpa@zytor.com> writes:
> > 
> > hpa> The second question is if we should disable WRITE SAME for raid1/10
> > hpa> (what about raid0?) for 3.10/stable or if your patch really is
> > hpa> sufficient... "just adds another heuristic" makes me nervous.
> > 
> > I think we should disable 1+10 in stable until we get the recovery
> > scenario sorted out.
> > 
> > I don't believe there are any problems with raid0.
> > 
> 
> How does this look?
> 
> 	-hpa
> 

Promising - thanks.

However should we do the same thing in raid5.c too?
As far as I can tell, the default set by blk_set_stacking_limits() (which md
calls) is to allow WRITE_SAME if all all underlying devices do.
But I'm pretty sure raid5 will do the wrong thing with a WRITE_SAME request.

??

Thanks,
NeilBrown

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 828 bytes --]

^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: RAID-10 keeps aborting
  2013-06-13  2:45                                                           ` Joe Lawrence
@ 2013-06-13  3:11                                                             ` NeilBrown
  0 siblings, 0 replies; 122+ messages in thread
From: NeilBrown @ 2013-06-13  3:11 UTC (permalink / raw)
  To: Joe Lawrence; +Cc: Martin K. Petersen, H. Peter Anvin, Dan Williams, linux-raid

[-- Attachment #1: Type: text/plain, Size: 1036 bytes --]

On Wed, 12 Jun 2013 22:45:08 -0400 (EDT) Joe Lawrence
<joe.lawrence@stratus.com> wrote:

> On Wed, 12 Jun 2013, NeilBrown wrote:
> 
> > I've applied that patch below - thanks.
> > 
> > NeilBrown
> > > 
> > > From b12c24ee0fce802f35263da65d236694b01c99cf Mon Sep 17 00:00:00 2001
> > > From: Joe Lawrence <joe.lawrence@stratus.com>
> > > Date: Fri, 7 Jun 2013 15:25:54 -0400
> > > Subject: [PATCH] raid1: properly set blk_queue_max_write_same_sectors
> > > 
> [snip patch]
> 
> Hi Neil -- this patch was only to test out what RAID1 would do if the 
> blk_queue_max_write_same_sectors were set to enable WRITE SAME.  As HPA 
> and Martin point out, we should be disabling WRITE SAME for raid1/10 in 
> 3.10 / stable for now.  Sorry for the confusion.
> 
> -- Joe

Ahh - thanks.

the code your patch removes is clearly wrong as chunk size doesn't mean
anything useful for raid1, but as you say we need to actually disable it, so
may as well do that at the same time as removing the error.

Thanks,
NeilBrown

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 828 bytes --]

^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: RAID-10 keeps aborting
  2013-06-13  3:10                                                                     ` NeilBrown
@ 2013-06-13  3:13                                                                       ` H. Peter Anvin
  2013-06-13  3:31                                                                         ` NeilBrown
  2013-06-13 21:40                                                                       ` Martin K. Petersen
  1 sibling, 1 reply; 122+ messages in thread
From: H. Peter Anvin @ 2013-06-13  3:13 UTC (permalink / raw)
  To: NeilBrown; +Cc: Martin K. Petersen, Joe Lawrence, Dan Williams, linux-raid

On 06/12/2013 08:10 PM, NeilBrown wrote:
> 
> Promising - thanks.
> 
> However should we do the same thing in raid5.c too? As far as I can
> tell, the default set by blk_set_stacking_limits() (which md calls)
> is to allow WRITE_SAME if all all underlying devices do. But I'm
> pretty sure raid5 will do the wrong thing with a WRITE_SAME
> request.
> 

Yes, if raid5 also bounces the array if the WRITE SAME request fails
at the device we need to do the same thing there.

	-hpa



^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: RAID-10 keeps aborting
  2013-06-13  3:13                                                                       ` H. Peter Anvin
@ 2013-06-13  3:31                                                                         ` NeilBrown
  0 siblings, 0 replies; 122+ messages in thread
From: NeilBrown @ 2013-06-13  3:31 UTC (permalink / raw)
  To: H. Peter Anvin; +Cc: Martin K. Petersen, Joe Lawrence, Dan Williams, linux-raid

[-- Attachment #1: Type: text/plain, Size: 681 bytes --]

On Wed, 12 Jun 2013 20:13:19 -0700 "H. Peter Anvin" <hpa@zytor.com> wrote:

> On 06/12/2013 08:10 PM, NeilBrown wrote:
> > 
> > Promising - thanks.
> > 
> > However should we do the same thing in raid5.c too? As far as I can
> > tell, the default set by blk_set_stacking_limits() (which md calls)
> > is to allow WRITE_SAME if all all underlying devices do. But I'm
> > pretty sure raid5 will do the wrong thing with a WRITE_SAME
> > request.
> > 
> 
> Yes, if raid5 also bounces the array if the WRITE SAME request fails
> at the device we need to do the same thing there.
> 
> 	-hpa
> 

Thanks. I've modified the patch and tagged it for -stable.

NeilBrown

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 828 bytes --]

^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: RAID-10 keeps aborting
  2013-06-13  3:10                                                                     ` NeilBrown
  2013-06-13  3:13                                                                       ` H. Peter Anvin
@ 2013-06-13 21:40                                                                       ` Martin K. Petersen
  1 sibling, 0 replies; 122+ messages in thread
From: Martin K. Petersen @ 2013-06-13 21:40 UTC (permalink / raw)
  To: NeilBrown
  Cc: H. Peter Anvin, Martin K. Petersen, Joe Lawrence, Dan Williams,
	linux-raid

>>>>> "Neil" == NeilBrown  <neilb@suse.de> writes:

Neil> But I'm pretty sure raid5 will do the wrong thing with a
Neil> WRITE_SAME request.

Yeah, you should set: 

	mddev->queue->max_write_same_blocks = 0;

before disk_stack_limits() is called. Just like it's done with
discard_zeroes_data.

-- 
Martin K. Petersen	Oracle Linux Engineering

^ permalink raw reply	[flat|nested] 122+ messages in thread

* [PATCH] scsi_dh_alua: Remove stale variables
@ 2015-12-03  6:57 Hannes Reinecke
       [not found] ` <(Hannes>
                   ` (2 more replies)
  0 siblings, 3 replies; 122+ messages in thread
From: Hannes Reinecke @ 2015-12-03  6:57 UTC (permalink / raw)
  To: Martin K. Petersen
  Cc: Christoph Hellwig, James Bottomley, linux-scsi, Hannes Reinecke

With commit 83ea0e5e3501 these variables became obsolete,
but weren't removed.

Signed-off-by: Hannes Reinecke <hare@suse.de>
---
 drivers/scsi/device_handler/scsi_dh_alua.c | 2 --
 1 file changed, 2 deletions(-)

diff --git a/drivers/scsi/device_handler/scsi_dh_alua.c b/drivers/scsi/device_handler/scsi_dh_alua.c
index f100cbb..5a328bf 100644
--- a/drivers/scsi/device_handler/scsi_dh_alua.c
+++ b/drivers/scsi/device_handler/scsi_dh_alua.c
@@ -320,8 +320,6 @@ static int alua_check_tpgs(struct scsi_device *sdev)
  */
 static int alua_check_vpd(struct scsi_device *sdev, struct alua_dh_data *h)
 {
-	unsigned char *d;
-	unsigned char __rcu *vpd_pg83;
 	int rel_port = -1, group_id;
 
 	group_id = scsi_vpd_tpg_id(sdev, &rel_port);
-- 
1.8.5.6


^ permalink raw reply related	[flat|nested] 122+ messages in thread

* Re: [PATCH] scsi_dh_alua: Remove stale variables
  2015-12-03  6:57 [PATCH] scsi_dh_alua: Remove stale variables Hannes Reinecke
       [not found] ` <(Hannes>
@ 2015-12-03  9:23 ` Johannes Thumshirn
  2015-12-03 16:43 ` Christoph Hellwig
  2 siblings, 0 replies; 122+ messages in thread
From: Johannes Thumshirn @ 2015-12-03  9:23 UTC (permalink / raw)
  To: Hannes Reinecke, Martin K. Petersen
  Cc: Christoph Hellwig, James Bottomley, linux-scsi

On Thu, 2015-12-03 at 07:57 +0100, Hannes Reinecke wrote:
> With commit 83ea0e5e3501 these variables became obsolete,
> but weren't removed.
> 
> Signed-off-by: Hannes Reinecke <hare@suse.de>
> ---
>  drivers/scsi/device_handler/scsi_dh_alua.c | 2 --
>  1 file changed, 2 deletions(-)
> 
> diff --git a/drivers/scsi/device_handler/scsi_dh_alua.c
> b/drivers/scsi/device_handler/scsi_dh_alua.c
> index f100cbb..5a328bf 100644
> --- a/drivers/scsi/device_handler/scsi_dh_alua.c
> +++ b/drivers/scsi/device_handler/scsi_dh_alua.c
> @@ -320,8 +320,6 @@ static int alua_check_tpgs(struct scsi_device *sdev)
>   */
>  static int alua_check_vpd(struct scsi_device *sdev, struct alua_dh_data *h)
>  {
> -	unsigned char *d;
> -	unsigned char __rcu *vpd_pg83;
>  	int rel_port = -1, group_id;
>  
>  	group_id = scsi_vpd_tpg_id(sdev, &rel_port);

Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: [PATCH] scsi_dh_alua: Remove stale variables
  2015-12-03  6:57 [PATCH] scsi_dh_alua: Remove stale variables Hannes Reinecke
       [not found] ` <(Hannes>
  2015-12-03  9:23 ` Johannes Thumshirn
@ 2015-12-03 16:43 ` Christoph Hellwig
  2 siblings, 0 replies; 122+ messages in thread
From: Christoph Hellwig @ 2015-12-03 16:43 UTC (permalink / raw)
  To: Hannes Reinecke
  Cc: Martin K. Petersen, Christoph Hellwig, James Bottomley, linux-scsi

Looks fine,

Reviewed-by: Christoph Hellwig <hch@lst.de>

^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: [PATCH] scsi_dh_alua: Remove stale variables
       [not found]                                                   ` <+0100")>
@ 2015-12-08  1:12                                                     ` Martin K. Petersen
  2020-12-09  2:17                                                     ` [PATCH 0/2] two UFS changes Martin K. Petersen
  1 sibling, 0 replies; 122+ messages in thread
From: Martin K. Petersen @ 2015-12-08  1:12 UTC (permalink / raw)
  To: Hannes Reinecke
  Cc: Martin K. Petersen, Christoph Hellwig, James Bottomley, linux-scsi

>>>>> "Hannes" == Hannes Reinecke <hare@suse.de> writes:

Hannes> With commit 83ea0e5e3501 these variables became obsolete, but
Hannes> weren't removed.

Applied to 4.5/scsi-queue.

-- 
Martin K. Petersen	Oracle Linux Engineering

^ permalink raw reply	[flat|nested] 122+ messages in thread

* linux-next: Signed-off-by missing for commit in the scsi-fixes tree
@ 2018-11-06 19:48 Stephen Rothwell
       [not found] ` <(Stephen>
  0 siblings, 1 reply; 122+ messages in thread
From: Stephen Rothwell @ 2018-11-06 19:48 UTC (permalink / raw)
  To: Martin K. Petersen; +Cc: Linux-Next Mailing List, Linux Kernel Mailing List

[-- Attachment #1: Type: text/plain, Size: 211 bytes --]

Hi Martin,

Commit

  85ee0a7b2d53 ("Revert "scsi: ufs: Disable blk-mq for now"")

is missing a Signed-off-by from its author and committer.

Reverts are commits, too.

-- 
Cheers,
Stephen Rothwell

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: linux-next: Signed-off-by missing for commit in the scsi-fixes tree
       [not found]                                                   ` <+1100")>
@ 2018-11-07  1:52                                                     ` Martin K. Petersen
  2020-04-03  3:45                                                     ` [PATCH 0/4] block: Add support for REQ_OP_ASSIGN_RANGE Martin K. Petersen
  1 sibling, 0 replies; 122+ messages in thread
From: Martin K. Petersen @ 2018-11-07  1:52 UTC (permalink / raw)
  To: Stephen Rothwell
  Cc: Martin K. Petersen, Linux-Next Mailing List, Linux Kernel Mailing List


Stephen,

> Commit
>
>   85ee0a7b2d53 ("Revert "scsi: ufs: Disable blk-mq for now"")
>
> is missing a Signed-off-by from its author and committer.

Fixed up, thanks!

-- 
Martin K. Petersen	Oracle Linux Engineering

^ permalink raw reply	[flat|nested] 122+ messages in thread

* [PATCH 0/4] block: Add support for REQ_OP_ASSIGN_RANGE
@ 2020-03-29 17:47 Chaitanya Kulkarni
       [not found] ` <(Chaitanya>
                   ` (6 more replies)
  0 siblings, 7 replies; 122+ messages in thread
From: Chaitanya Kulkarni @ 2020-03-29 17:47 UTC (permalink / raw)
  To: hch, martin.petersen
  Cc: darrick.wong, axboe, tytso, adilger.kernel, ming.lei, jthumshirn,
	minwoo.im.dev, chaitanya.kulkarni, damien.lemoal, andrea.parri,
	hare, tj, hannes, khlebnikov, ajay.joshi, bvanassche, arnd,
	houtao1, asml.silence, linux-block, linux-ext4

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1: Type: text/plain; charset=y, Size: 12778 bytes --]

Hi,

This patch-series is based on the original RFC patch series:-
https://www.spinics.net/lists/linux-block/msg47933.html.

I've designed a rough testcase based on the information present
in the mailing list archive for original RFC, it may need
some corrections from the author.

If anyone is interested, test results are at the end of this patch.

Following is the original cover-letter :-

Information about continuous extent placement may be useful
for some block devices. Say, distributed network filesystems,
which provide block device interface, may use this information
for better blocks placement over the nodes in their cluster,
and for better performance. Block devices, which map a file
on another filesystem (loop), may request the same length extent
on underlining filesystem for less fragmentation and for batching
allocation requests. Also, hypervisors like QEMU may use this
information for optimization of cluster allocations.

This patchset introduces REQ_OP_ASSIGN_RANGE, which is going
to be used for forwarding user's fallocate(0) requests into
block device internals. It rather similar to existing
REQ_OP_DISCARD, REQ_OP_WRITE_ZEROES, etc. The corresponding
exported primitive is called blkdev_issue_assign_range().
See [1/3] for the details.

Patch [2/3] teaches loop driver to handle REQ_OP_ASSIGN_RANGE
requests by calling fallocate(0).

Patch [3/3] makes ext4 to notify a block device about fallocate(0).

Here is a simple test I did:
https://gist.github.com/tkhai/5b788651cdb74c1dbff3500745878856

I attached a file on ext4 to loop. Then, created ext4 partition
on loop device and started the test in the partition. Direct-io
is enabled on loop.

The test fallocates 4G file and writes from some offset with
given step, then it chooses another offset and repeats. After
the test all the blocks in the file become written.

The results shows that batching extents-assigning requests improves
the performance:

Before patchset: real ~ 1min 27sec
After patchset:  real ~ 1min 16sec (18% better)

Ordinary fallocate() before writes improves the performance
by batching the requests. These results just show, the same
is in case of forwarding extents information to underlining
filesystem.

Regards,
Chaitanya

Changes from RFC:-

1. Add missing plumbing for REQ_OP_ASSIGN_RANGE similar to write-zeores.
2. Add a prep patch to create a helper to submit payloadless bios.
3. Design a testcases around the description present in the
   cover-letter.

Chaitanya Kulkarni (1):
  block: create payloadless issue bio helper

Kirill Tkhai (3):
  block: Add support for REQ_OP_ASSIGN_RANGE
  loop: Forward REQ_OP_ASSIGN_RANGE into fallocate(0)
  ext4: Notify block device about alloc-assigned blk

 block/blk-core.c          |   5 ++
 block/blk-lib.c           | 115 +++++++++++++++++++++++++++++++-------
 block/blk-merge.c         |  21 +++++++
 block/blk-settings.c      |  19 +++++++
 block/blk-zoned.c         |   1 +
 block/bounce.c            |   1 +
 drivers/block/loop.c      |   5 ++
 fs/ext4/ext4.h            |   2 +
 fs/ext4/extents.c         |  12 +++-
 include/linux/bio.h       |   9 ++-
 include/linux/blk_types.h |   2 +
 include/linux/blkdev.h    |  34 +++++++++++
 12 files changed, 201 insertions(+), 25 deletions(-)

1. Setup :-
-----------
# git log --oneline -5 
c64a4c781915 (HEAD -> req-op-assign-range) ext4: Notify block device about alloc-assigned blk
000cbc6720a4 loop: Forward REQ_OP_ASSIGN_RANGE into fallocate(0)
89ceed8cac80 block: Add support for REQ_OP_ASSIGN_RANGE
a798743e87e7 block: create payloadless issue bio helper
b53df2e7442c (tag: block-5.6-2020-03-13) block: Fix partition support for host aware zoned block devices

# cat /proc/kallsyms | grep -i blkdev_issue_assign_range
ffffffffa3264a80 T blkdev_issue_assign_range
ffffffffa4027184 r __ksymtab_blkdev_issue_assign_range
ffffffffa40524be r __kstrtabns_blkdev_issue_assign_range
ffffffffa405a8eb r __kstrtab_blkdev_issue_assign_range

2. Test program, will be moved to blktest once code is upstream :-
-----------------
#define _GNU_SOURCE
#include <sys/types.h>
#include <unistd.h>
#include <stdlib.h>
#include <stdio.h>
#include <fcntl.h>
#include <errno.h>

#define BLOCK_SIZE 4096
#define STEP (BLOCK_SIZE * 16)
#define SIZE (1024 * 1024 * 1024ULL)

int main(int argc, char *argv[])
{
	int fd, step, ret = 0;
	unsigned long i;
	void *buf;

	if (posix_memalign(&buf, BLOCK_SIZE, SIZE)) {
		perror("alloc");
		exit(1);
	}

	fd = open("/mnt/loop0/file.img", O_RDWR | O_CREAT | O_DIRECT);
	if (fd < 0) {
		perror("open");
		exit(1);
	}

	if (ftruncate(fd, SIZE)) {
		perror("ftruncate");
		exit(1);
	}

	ret = fallocate(fd, 0, 0, SIZE);
	if (ret) {
		perror("fallocate");
		exit(1);
	}
	
	for (step = STEP - BLOCK_SIZE; step >= 0; step -= BLOCK_SIZE) {
		printf("step=%u\n", step);
		for (i = step; i < SIZE; i += STEP) {
			errno = 0;
			if (pwrite(fd, buf, BLOCK_SIZE, i) != BLOCK_SIZE) {
				perror("pwrite");
				exit(1);
			}
		}

		if (fsync(fd)) {
			perror("fsync");
			exit(1);
		}
	}
	return 0;
}

3. Test script, will be moved to blktests once code is upstream :-
------------------------------------------------------------------
# cat req_op_assign_test.sh 
#!/bin/bash -x

NULLB_FILE="/mnt/backend/data"
NULLB_MNT="/mnt/backend"
LOOP_MNT="/mnt/loop0"

delete_loop()
{
	umount ${LOOP_MNT}
	losetup -D
	sleep 3
}

delete_nullb()
{
	umount ${NULLB_MNT}
	echo 1 > config/nullb/nullb0/power
	rmdir config/nullb/nullb0
	sleep 3
}

unload_modules()
{
	rmmod drivers/block/loop.ko
	rmmod fs/ext4/ext4.ko
	rmmod drivers/block/null_blk.ko
	lsmod | grep -e ext4 -e loop -e null_blk
}

unload()
{
	delete_loop
	delete_nullb
	unload_modules
}

load_ext4()
{
	make -j $(nproc) M=fs/ext4 modules
	local src=fs/ext4/
	local dest=/lib/modules/`uname -r`/kernel/fs/ext4
	\cp ${src}/ext4.ko ${dest}/

	modprobe mbcache
	modprobe jbd2
	sleep 1
	insmod fs/ext4/ext4.ko
	sleep 1
}

load_nullb()
{
	local src=drivers/block/
	local dest=/lib/modules/`uname -r`/kernel/drivers/block
	\cp ${src}/null_blk.ko ${dest}/

	modprobe null_blk nr_devices=0
	sleep 1

	mkdir config/nullb/nullb0
	tree config/nullb/nullb0

	echo 1 > config/nullb/nullb0/memory_backed
	echo 512 > config/nullb/nullb0/blocksize 

	# 20 GB
	echo 20480 > config/nullb/nullb0/size 
	echo 1 > config/nullb/nullb0/power
	sleep 2
	IDX=`cat config/nullb/nullb0/index`
	lsblk | grep null${IDX}
	sleep 1

	mkfs.ext4 /dev/nullb0 
	mount /dev/nullb0 ${NULLB_MNT}
	sleep 1
	mount | grep nullb

	# 10 GB
	dd if=/dev/zero of=${NULLB_FILE} count=2621440 bs=4096
}

load_loop()
{
	local src=drivers/block/
	local dest=/lib/modules/`uname -r`/kernel/drivers/block
	\cp ${src}/loop.ko ${dest}/

	insmod drivers/block/loop.ko max_loop=1
	sleep 3
	/root/util-linux/losetup --direct-io=off /dev/loop0 ${NULLB_FILE}
	sleep 3
	/root/util-linux/losetup
	ls -l /dev/loop*
	dmesg -c 
	mkfs.ext4 /dev/loop0
	mount /dev/loop0 ${LOOP_MNT}
	mount | grep loop0
}

load()
{
	make -j $(nproc) M=drivers/block modules

	load_ext4
	load_nullb
	load_loop
	sleep 1
	sync
	sync
	sync
}

unload
load
time ./test

4. Test Results :-
------------------

# ./req_op_assign_test.sh 
+ NULLB_FILE=/mnt/backend/data
+ NULLB_MNT=/mnt/backend
+ LOOP_MNT=/mnt/loop0
+ unload
+ delete_loop
+ umount /mnt/loop0
+ losetup -D
+ sleep 3
+ delete_nullb
+ umount /mnt/backend
+ echo 1
+ rmdir config/nullb/nullb0
+ sleep 3
+ unload_modules
+ rmmod drivers/block/loop.ko
+ rmmod fs/ext4/ext4.ko
+ rmmod drivers/block/null_blk.ko
+ lsmod
+ grep -e ext4 -e loop -e null_blk
+ load
++ nproc
+ make -j 32 M=drivers/block modules
  CC [M]  drivers/block/loop.o
  MODPOST 11 modules
  CC [M]  drivers/block/loop.mod.o
  LD [M]  drivers/block/loop.ko
+ load_ext4
++ nproc
+ make -j 32 M=fs/ext4 modules
  CC [M]  fs/ext4/balloc.o
  CC [M]  fs/ext4/bitmap.o
  CC [M]  fs/ext4/block_validity.o
  CC [M]  fs/ext4/dir.o
  CC [M]  fs/ext4/ext4_jbd2.o
  CC [M]  fs/ext4/extents.o
  CC [M]  fs/ext4/extents_status.o
  CC [M]  fs/ext4/file.o
  CC [M]  fs/ext4/fsmap.o
  CC [M]  fs/ext4/fsync.o
  CC [M]  fs/ext4/hash.o
  CC [M]  fs/ext4/ialloc.o
  CC [M]  fs/ext4/indirect.o
  CC [M]  fs/ext4/inline.o
  CC [M]  fs/ext4/inode.o
  CC [M]  fs/ext4/ioctl.o
  CC [M]  fs/ext4/mballoc.o
  CC [M]  fs/ext4/migrate.o
  CC [M]  fs/ext4/mmp.o
  CC [M]  fs/ext4/move_extent.o
  CC [M]  fs/ext4/namei.o
  CC [M]  fs/ext4/page-io.o
  CC [M]  fs/ext4/readpage.o
  CC [M]  fs/ext4/resize.o
  CC [M]  fs/ext4/super.o
  CC [M]  fs/ext4/symlink.o
  CC [M]  fs/ext4/sysfs.o
  CC [M]  fs/ext4/xattr.o
  CC [M]  fs/ext4/xattr_trusted.o
  CC [M]  fs/ext4/xattr_user.o
  CC [M]  fs/ext4/acl.o
  CC [M]  fs/ext4/xattr_security.o
  LD [M]  fs/ext4/ext4.o
  MODPOST 1 modules
  LD [M]  fs/ext4/ext4.ko
+ local src=fs/ext4/
++ uname -r
+ local dest=/lib/modules/5.6.0-rc3lbk+/kernel/fs/ext4
+ cp fs/ext4//ext4.ko /lib/modules/5.6.0-rc3lbk+/kernel/fs/ext4/
+ modprobe mbcache
+ modprobe jbd2
+ sleep 1
+ insmod fs/ext4/ext4.ko
+ sleep 1
+ load_nullb
+ local src=drivers/block/
++ uname -r
+ local dest=/lib/modules/5.6.0-rc3lbk+/kernel/drivers/block
+ cp drivers/block//null_blk.ko /lib/modules/5.6.0-rc3lbk+/kernel/drivers/block/
+ modprobe null_blk nr_devices=0
+ sleep 1
+ mkdir config/nullb/nullb0
+ tree config/nullb/nullb0
config/nullb/nullb0
├── badblocks
├── blocking
├── blocksize
├── cache_size
├── completion_nsec
├── discard
├── home_node
├── hw_queue_depth
├── index
├── irqmode
├── mbps
├── memory_backed
├── power
├── queue_mode
├── size
├── submit_queues
├── use_per_node_hctx
├── zoned
├── zone_nr_conv
└── zone_size

0 directories, 20 files
+ echo 1
+ echo 512
+ echo 20480
+ echo 1
+ sleep 2
++ cat config/nullb/nullb0/index
+ IDX=0
+ lsblk
+ grep null0
+ sleep 1
+ mkfs.ext4 /dev/nullb0
mke2fs 1.42.9 (28-Dec-2013)
Filesystem label=
OS type: Linux
Block size=4096 (log=2)
Fragment size=4096 (log=2)
Stride=0 blocks, Stripe width=0 blocks
1310720 inodes, 5242880 blocks
262144 blocks (5.00%) reserved for the super user
First data block=0
Maximum filesystem blocks=2153775104
160 block groups
32768 blocks per group, 32768 fragments per group
8192 inodes per group
Superblock backups stored on blocks: 
	32768, 98304, 163840, 229376, 294912, 819200, 884736, 1605632, 2654208, 
	4096000

Allocating group tables: done                            
Writing inode tables: done                            
Creating journal (32768 blocks): done
Writing superblocks and filesystem accounting information: done   

+ mount /dev/nullb0 /mnt/backend
+ sleep 1
+ mount
+ grep nullb
/dev/nullb0 on /mnt/backend type ext4 (rw,relatime,seclabel)
+ dd if=/dev/zero of=/mnt/backend/data count=2621440 bs=4096
2621440+0 records in
2621440+0 records out
10737418240 bytes (11 GB) copied, 27.4579 s, 391 MB/s
+ load_loop
+ local src=drivers/block/
++ uname -r
+ local dest=/lib/modules/5.6.0-rc3lbk+/kernel/drivers/block
+ cp drivers/block//loop.ko /lib/modules/5.6.0-rc3lbk+/kernel/drivers/block/
+ insmod drivers/block/loop.ko max_loop=1
+ sleep 3
+ /root/util-linux/losetup --direct-io=off /dev/loop0 /mnt/backend/data
+ sleep 3
+ /root/util-linux/losetup
NAME       SIZELIMIT OFFSET AUTOCLEAR RO BACK-FILE         DIO LOG-SEC
/dev/loop0         0      0         0  0 /mnt/backend/data   0     512
+ ls -l /dev/loop0 /dev/loop-control
brw-rw----. 1 root disk  7,   0 Mar 29 10:28 /dev/loop0
crw-rw----. 1 root disk 10, 237 Mar 29 10:28 /dev/loop-control
+ dmesg -c
[42963.967060] null_blk: module loaded
[42968.419481] EXT4-fs (nullb0): mounted filesystem with ordered data mode. Opts: (null)
[42996.928141] loop: module loaded
+ mkfs.ext4 /dev/loop0
mke2fs 1.42.9 (28-Dec-2013)
Discarding device blocks: done                            
Filesystem label=
OS type: Linux
Block size=4096 (log=2)
Fragment size=4096 (log=2)
Stride=0 blocks, Stripe width=0 blocks
655360 inodes, 2621440 blocks
131072 blocks (5.00%) reserved for the super user
First data block=0
Maximum filesystem blocks=2151677952
80 block groups
32768 blocks per group, 32768 fragments per group
8192 inodes per group
Superblock backups stored on blocks: 
	32768, 98304, 163840, 229376, 294912, 819200, 884736, 1605632

Allocating group tables: done                            
Writing inode tables: done                            
Creating journal (32768 blocks): done
Writing superblocks and filesystem accounting information: done 

+ mount /dev/loop0 /mnt/loop0
+ mount
+ grep loop0
/dev/loop0 on /mnt/loop0 type ext4 (rw,relatime,seclabel)
+ sleep 1
+ sync
+ sync
+ sync
+ ./test
step=61440
step=57344
step=53248
step=49152
step=45056
step=40960
step=36864
step=32768
step=28672
step=24576
step=20480
step=16384
step=12288
step=8192
step=4096
step=0

real	9m34.472s
user	0m0.062s
sys	0m5.783s

-- 
2.22.0


^ permalink raw reply	[flat|nested] 122+ messages in thread

* [PATCH 1/4] block: create payloadless issue bio helper
  2020-03-29 17:47 [PATCH 0/4] block: Add support for REQ_OP_ASSIGN_RANGE Chaitanya Kulkarni
       [not found] ` <(Chaitanya>
@ 2020-03-29 17:47 ` Chaitanya Kulkarni
  2020-03-29 17:47 ` [PATCH 2/4] block: Add support for REQ_OP_ASSIGN_RANGE Chaitanya Kulkarni
                   ` (4 subsequent siblings)
  6 siblings, 0 replies; 122+ messages in thread
From: Chaitanya Kulkarni @ 2020-03-29 17:47 UTC (permalink / raw)
  To: hch, martin.petersen
  Cc: darrick.wong, axboe, tytso, adilger.kernel, ming.lei, jthumshirn,
	minwoo.im.dev, chaitanya.kulkarni, damien.lemoal, andrea.parri,
	hare, tj, hannes, khlebnikov, ajay.joshi, bvanassche, arnd,
	houtao1, asml.silence, linux-block, linux-ext4

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1: Type: text/plain; charset=y, Size: 2560 bytes --]

This is a prep-patch that creates a helper to submit payloadless bio
with all the required arguments. This is needed to avoid the code
repetition in blk-lib,c so that new payloadless ops can use it.  

Signed-off-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
---
 block/blk-lib.c | 51 +++++++++++++++++++++++++++++--------------------
 1 file changed, 30 insertions(+), 21 deletions(-)

diff --git a/block/blk-lib.c b/block/blk-lib.c
index 5f2c429d4378..8e53e393703c 100644
--- a/block/blk-lib.c
+++ b/block/blk-lib.c
@@ -209,13 +209,40 @@ int blkdev_issue_write_same(struct block_device *bdev, sector_t sector,
 }
 EXPORT_SYMBOL(blkdev_issue_write_same);
 
+static void __blkdev_issue_payloadless(struct block_device *bdev,
+		unsigned op, sector_t sector, sector_t nr_sects, gfp_t gfp_mask,
+		struct bio **biop, unsigned bio_opf, unsigned int max_sectors)
+{
+	struct bio *bio = *biop;
+
+	while (nr_sects) {
+		bio = blk_next_bio(bio, 0, gfp_mask);
+		bio->bi_iter.bi_sector = sector;
+		bio_set_dev(bio, bdev);
+		bio->bi_opf = op;
+		bio->bi_opf |= bio_opf;
+
+		if (nr_sects > max_sectors) {
+			bio->bi_iter.bi_size = max_sectors << 9;
+			nr_sects -= max_sectors;
+			sector += max_sectors;
+		} else {
+			bio->bi_iter.bi_size = nr_sects << 9;
+			nr_sects = 0;
+		}
+		cond_resched();
+	}
+
+	*biop = bio;
+}
+
 static int __blkdev_issue_write_zeroes(struct block_device *bdev,
 		sector_t sector, sector_t nr_sects, gfp_t gfp_mask,
 		struct bio **biop, unsigned flags)
 {
-	struct bio *bio = *biop;
 	unsigned int max_write_zeroes_sectors;
 	struct request_queue *q = bdev_get_queue(bdev);
+	unsigned int unmap = (flags & BLKDEV_ZERO_NOUNMAP) ? REQ_NOUNMAP : 0;
 
 	if (!q)
 		return -ENXIO;
@@ -229,26 +256,8 @@ static int __blkdev_issue_write_zeroes(struct block_device *bdev,
 	if (max_write_zeroes_sectors == 0)
 		return -EOPNOTSUPP;
 
-	while (nr_sects) {
-		bio = blk_next_bio(bio, 0, gfp_mask);
-		bio->bi_iter.bi_sector = sector;
-		bio_set_dev(bio, bdev);
-		bio->bi_opf = REQ_OP_WRITE_ZEROES;
-		if (flags & BLKDEV_ZERO_NOUNMAP)
-			bio->bi_opf |= REQ_NOUNMAP;
-
-		if (nr_sects > max_write_zeroes_sectors) {
-			bio->bi_iter.bi_size = max_write_zeroes_sectors << 9;
-			nr_sects -= max_write_zeroes_sectors;
-			sector += max_write_zeroes_sectors;
-		} else {
-			bio->bi_iter.bi_size = nr_sects << 9;
-			nr_sects = 0;
-		}
-		cond_resched();
-	}
-
-	*biop = bio;
+	__blkdev_issue_payloadless(bdev, REQ_OP_WRITE_ZEROES, sector, nr_sects,
+		       gfp_mask, biop, unmap, max_write_zeroes_sectors);
 	return 0;
 }
 
-- 
2.22.0


^ permalink raw reply related	[flat|nested] 122+ messages in thread

* [PATCH 2/4] block: Add support for REQ_OP_ASSIGN_RANGE
  2020-03-29 17:47 [PATCH 0/4] block: Add support for REQ_OP_ASSIGN_RANGE Chaitanya Kulkarni
       [not found] ` <(Chaitanya>
  2020-03-29 17:47 ` [PATCH 1/4] block: create payloadless issue bio helper Chaitanya Kulkarni
@ 2020-03-29 17:47 ` Chaitanya Kulkarni
  2020-03-29 17:47 ` [PATCH 3/4] loop: Forward REQ_OP_ASSIGN_RANGE into fallocate(0) Chaitanya Kulkarni
                   ` (3 subsequent siblings)
  6 siblings, 0 replies; 122+ messages in thread
From: Chaitanya Kulkarni @ 2020-03-29 17:47 UTC (permalink / raw)
  To: hch, martin.petersen
  Cc: darrick.wong, axboe, tytso, adilger.kernel, ming.lei, jthumshirn,
	minwoo.im.dev, chaitanya.kulkarni, damien.lemoal, andrea.parri,
	hare, tj, hannes, khlebnikov, ajay.joshi, bvanassche, arnd,
	houtao1, asml.silence, linux-block, linux-ext4

From: Kirill Tkhai <ktkhai@virtuozzo.com>

This operation allows to notify a device about the fact, that some
sectors range was chosen by a filesystem as a single extent, and
the device should try its best to reflect that (keep the range as a
single hunk in its internals, or represent the range as minimal set of
hunks). Speaking directly, the operation is for forwarding fallocate(0)
requests into an essence, on which the device is based.

This may be useful for some distributed network filesystems, providing
block device interface, for optimization of their blocks placement over
the cluster nodes.

Also, block devices mapping a file (like loop) are users of that, since
this allows to allocate more continuous extents and since this batches
blocks allocation requests. In addition, hypervisors like QEMU may use
this for better blocks placement.

This patch adds a new blkdev_issue_assign_range() primitive, which is
rather similar to existing blkdev_issue_{*} api. Also, a new queue
limit.max_assign_range_sectors is added.

Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com>
Signed-off-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
---
 block/blk-core.c          |  5 +++
 block/blk-lib.c           | 64 +++++++++++++++++++++++++++++++++++++++
 block/blk-merge.c         | 21 +++++++++++++
 block/blk-settings.c      | 19 ++++++++++++
 block/blk-zoned.c         |  1 +
 block/bounce.c            |  1 +
 include/linux/bio.h       |  9 ++++--
 include/linux/blk_types.h |  2 ++
 include/linux/blkdev.h    | 34 +++++++++++++++++++++
 9 files changed, 153 insertions(+), 3 deletions(-)

diff --git a/block/blk-core.c b/block/blk-core.c
index 60dc9552ef8d..25165fa8fe46 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -137,6 +137,7 @@ static const char *const blk_op_name[] = {
 	REQ_OP_NAME(ZONE_FINISH),
 	REQ_OP_NAME(WRITE_SAME),
 	REQ_OP_NAME(WRITE_ZEROES),
+	REQ_OP_NAME(ASSIGN_RANGE),
 	REQ_OP_NAME(SCSI_IN),
 	REQ_OP_NAME(SCSI_OUT),
 	REQ_OP_NAME(DRV_IN),
@@ -952,6 +953,10 @@ generic_make_request_checks(struct bio *bio)
 		if (!q->limits.max_write_zeroes_sectors)
 			goto not_supported;
 		break;
+	case REQ_OP_ASSIGN_RANGE:
+		if (!q->limits.max_assign_range_sectors)
+			goto not_supported;
+		break;
 	default:
 		break;
 	}
diff --git a/block/blk-lib.c b/block/blk-lib.c
index 8e53e393703c..16dc9dbf6c79 100644
--- a/block/blk-lib.c
+++ b/block/blk-lib.c
@@ -414,3 +414,67 @@ int blkdev_issue_zeroout(struct block_device *bdev, sector_t sector,
 	return ret;
 }
 EXPORT_SYMBOL(blkdev_issue_zeroout);
+
+static int __blkdev_issue_assign_range(struct block_device *bdev,
+		sector_t sector, sector_t nr_sects, gfp_t gfp_mask,
+		struct bio **biop)
+{
+	unsigned int max_assign_range_sectors;
+	struct request_queue *q = bdev_get_queue(bdev);
+
+	if (!q)
+		return -ENXIO;
+
+	if (bdev_read_only(bdev))
+		return -EPERM;
+
+	max_assign_range_sectors = bdev_assign_range_sectors(bdev);
+
+	if (max_assign_range_sectors == 0)
+		return -EOPNOTSUPP;
+
+	__blkdev_issue_payloadless(bdev, REQ_OP_ASSIGN_RANGE, sector, nr_sects,
+			gfp_mask, biop, 0, max_assign_range_sectors);
+	return 0;
+}
+
+/**
+ * __blkdev_issue_assign_range - generate number of assign range bios
+ * @bdev:	blockdev to issue
+ * @sector:	start sector
+ * @nr_sects:	number of sectors to write
+ * @gfp_mask:	memory allocation flags (for bio_alloc)
+ * @biop:	pointer to anchor bio
+ *
+ * Description:
+ *  Assign a block range for batched allocation requests. Useful in stacking
+ *  block device on the top of the file system.
+ *
+ */
+int blkdev_issue_assign_range(struct block_device *bdev, sector_t sector,
+			sector_t nr_sects, gfp_t gfp_mask)
+{
+	int ret = 0;
+	sector_t bs_mask;
+	struct blk_plug plug;
+	struct bio *bio = NULL;
+
+	if (bdev_assign_range_sectors(bdev) == 0)
+		return 0;
+
+	bs_mask = (bdev_logical_block_size(bdev) >> 9) - 1;
+	if ((sector | nr_sects) & bs_mask)
+		return -EINVAL;
+
+	blk_start_plug(&plug);
+	ret = __blkdev_issue_assign_range(bdev, sector, nr_sects,
+					  gfp_mask, &bio);
+	if (ret == 0 && bio) {
+		ret = submit_bio_wait(bio);
+		bio_put(bio);
+	}
+	blk_finish_plug(&plug);
+
+	return ret;
+}
+EXPORT_SYMBOL(blkdev_issue_assign_range);
diff --git a/block/blk-merge.c b/block/blk-merge.c
index 1534ed736363..441d1620de03 100644
--- a/block/blk-merge.c
+++ b/block/blk-merge.c
@@ -116,6 +116,22 @@ static struct bio *blk_bio_write_zeroes_split(struct request_queue *q,
 	return bio_split(bio, q->limits.max_write_zeroes_sectors, GFP_NOIO, bs);
 }
 
+static struct bio *blk_bio_assign_range_split(struct request_queue *q,
+					      struct bio *bio,
+					      struct bio_set *bs,
+					      unsigned *nsegs)
+{
+	*nsegs = 0;
+
+	if (!q->limits.max_assign_range_sectors)
+		return NULL;
+
+	if (bio_sectors(bio) <= q->limits.max_assign_range_sectors)
+		return NULL;
+
+	return bio_split(bio, q->limits.max_assign_range_sectors, GFP_NOIO, bs);
+}
+
 static struct bio *blk_bio_write_same_split(struct request_queue *q,
 					    struct bio *bio,
 					    struct bio_set *bs,
@@ -308,6 +324,10 @@ void __blk_queue_split(struct request_queue *q, struct bio **bio,
 		split = blk_bio_write_zeroes_split(q, *bio, &q->bio_split,
 				nr_segs);
 		break;
+	case REQ_OP_ASSIGN_RANGE:
+		split = blk_bio_assign_range_split(q, *bio, &q->bio_split,
+				nr_segs);
+		break;
 	case REQ_OP_WRITE_SAME:
 		split = blk_bio_write_same_split(q, *bio, &q->bio_split,
 				nr_segs);
@@ -386,6 +406,7 @@ unsigned int blk_recalc_rq_segments(struct request *rq)
 	case REQ_OP_DISCARD:
 	case REQ_OP_SECURE_ERASE:
 	case REQ_OP_WRITE_ZEROES:
+	case REQ_OP_ASSIGN_RANGE:
 		return 0;
 	case REQ_OP_WRITE_SAME:
 		return 1;
diff --git a/block/blk-settings.c b/block/blk-settings.c
index c8eda2e7b91e..6beee0585580 100644
--- a/block/blk-settings.c
+++ b/block/blk-settings.c
@@ -48,6 +48,7 @@ void blk_set_default_limits(struct queue_limits *lim)
 	lim->chunk_sectors = 0;
 	lim->max_write_same_sectors = 0;
 	lim->max_write_zeroes_sectors = 0;
+	lim->max_assign_range_sectors = 0;
 	lim->max_discard_sectors = 0;
 	lim->max_hw_discard_sectors = 0;
 	lim->discard_granularity = 0;
@@ -83,6 +84,7 @@ void blk_set_stacking_limits(struct queue_limits *lim)
 	lim->max_dev_sectors = UINT_MAX;
 	lim->max_write_same_sectors = UINT_MAX;
 	lim->max_write_zeroes_sectors = UINT_MAX;
+	lim->max_assign_range_sectors = UINT_MAX;
 }
 EXPORT_SYMBOL(blk_set_stacking_limits);
 
@@ -257,6 +259,21 @@ void blk_queue_max_write_zeroes_sectors(struct request_queue *q,
 }
 EXPORT_SYMBOL(blk_queue_max_write_zeroes_sectors);
 
+/**
+ * blk_queue_max_assign_range_sectors - set max sectors for a single
+ *                                      assign_range
+ *
+ * @q:  the request queue for the device
+ * @max_assign_range_sectors: maximum number of sectors to assign range per
+ *                            command
+ **/
+void blk_queue_max_assign_range_sectors(struct request_queue *q,
+		unsigned int max_assign_range_sectors)
+{
+	q->limits.max_assign_range_sectors = max_assign_range_sectors;
+}
+EXPORT_SYMBOL(blk_queue_max_assign_range_sectors);
+
 /**
  * blk_queue_max_segments - set max hw segments for a request for this queue
  * @q:  the request queue for the device
@@ -506,6 +523,8 @@ int blk_stack_limits(struct queue_limits *t, struct queue_limits *b,
 					b->max_write_same_sectors);
 	t->max_write_zeroes_sectors = min(t->max_write_zeroes_sectors,
 					b->max_write_zeroes_sectors);
+	t->max_assign_range_sectors = min(t->max_assign_range_sectors,
+				    b->max_assign_range_sectors);
 	t->bounce_pfn = min_not_zero(t->bounce_pfn, b->bounce_pfn);
 
 	t->seg_boundary_mask = min_not_zero(t->seg_boundary_mask,
diff --git a/block/blk-zoned.c b/block/blk-zoned.c
index 05741c6f618b..14b1fbed40f6 100644
--- a/block/blk-zoned.c
+++ b/block/blk-zoned.c
@@ -41,6 +41,7 @@ bool blk_req_needs_zone_write_lock(struct request *rq)
 
 	switch (req_op(rq)) {
 	case REQ_OP_WRITE_ZEROES:
+	case REQ_OP_ASSIGN_RANGE:
 	case REQ_OP_WRITE_SAME:
 	case REQ_OP_WRITE:
 		return blk_rq_zone_is_seq(rq);
diff --git a/block/bounce.c b/block/bounce.c
index f8ed677a1bf7..0eeb20b290ec 100644
--- a/block/bounce.c
+++ b/block/bounce.c
@@ -257,6 +257,7 @@ static struct bio *bounce_clone_bio(struct bio *bio_src, gfp_t gfp_mask,
 	case REQ_OP_DISCARD:
 	case REQ_OP_SECURE_ERASE:
 	case REQ_OP_WRITE_ZEROES:
+	case REQ_OP_ASSIGN_RANGE:
 		break;
 	case REQ_OP_WRITE_SAME:
 		bio->bi_io_vec[bio->bi_vcnt++] = bio_src->bi_io_vec[0];
diff --git a/include/linux/bio.h b/include/linux/bio.h
index 853d92ceee64..8617abfc6f78 100644
--- a/include/linux/bio.h
+++ b/include/linux/bio.h
@@ -64,7 +64,8 @@ static inline bool bio_has_data(struct bio *bio)
 	    bio->bi_iter.bi_size &&
 	    bio_op(bio) != REQ_OP_DISCARD &&
 	    bio_op(bio) != REQ_OP_SECURE_ERASE &&
-	    bio_op(bio) != REQ_OP_WRITE_ZEROES)
+	    bio_op(bio) != REQ_OP_WRITE_ZEROES &&
+	    bio_op(bio) != REQ_OP_ASSIGN_RANGE)
 		return true;
 
 	return false;
@@ -75,7 +76,8 @@ static inline bool bio_no_advance_iter(struct bio *bio)
 	return bio_op(bio) == REQ_OP_DISCARD ||
 	       bio_op(bio) == REQ_OP_SECURE_ERASE ||
 	       bio_op(bio) == REQ_OP_WRITE_SAME ||
-	       bio_op(bio) == REQ_OP_WRITE_ZEROES;
+	       bio_op(bio) == REQ_OP_WRITE_ZEROES ||
+	       bio_op(bio) == REQ_OP_ASSIGN_RANGE;
 }
 
 static inline bool bio_mergeable(struct bio *bio)
@@ -178,7 +180,7 @@ static inline unsigned bio_segments(struct bio *bio)
 	struct bvec_iter iter;
 
 	/*
-	 * We special case discard/write same/write zeroes, because they
+	 * We special case discard/write same/write zeroes/assign range, because
 	 * interpret bi_size differently:
 	 */
 
@@ -186,6 +188,7 @@ static inline unsigned bio_segments(struct bio *bio)
 	case REQ_OP_DISCARD:
 	case REQ_OP_SECURE_ERASE:
 	case REQ_OP_WRITE_ZEROES:
+	case REQ_OP_ASSIGN_RANGE:
 		return 0;
 	case REQ_OP_WRITE_SAME:
 		return 1;
diff --git a/include/linux/blk_types.h b/include/linux/blk_types.h
index 70254ae11769..bef450026044 100644
--- a/include/linux/blk_types.h
+++ b/include/linux/blk_types.h
@@ -296,6 +296,8 @@ enum req_opf {
 	REQ_OP_ZONE_CLOSE	= 11,
 	/* Transition a zone to full */
 	REQ_OP_ZONE_FINISH	= 12,
+	/* Assign a sector range */
+	REQ_OP_ASSIGN_RANGE	= 15,
 
 	/* SCSI passthrough using struct scsi_request */
 	REQ_OP_SCSI_IN		= 32,
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index f629d40c645c..3a63c14e2cbc 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -336,6 +336,7 @@ struct queue_limits {
 	unsigned int		max_hw_discard_sectors;
 	unsigned int		max_write_same_sectors;
 	unsigned int		max_write_zeroes_sectors;
+	unsigned int		max_assign_range_sectors;
 	unsigned int		discard_granularity;
 	unsigned int		discard_alignment;
 
@@ -747,6 +748,9 @@ static inline bool rq_mergeable(struct request *rq)
 	if (req_op(rq) == REQ_OP_WRITE_ZEROES)
 		return false;
 
+	if (req_op(rq) == REQ_OP_ASSIGN_RANGE)
+		return false;
+
 	if (rq->cmd_flags & REQ_NOMERGE_FLAGS)
 		return false;
 	if (rq->rq_flags & RQF_NOMERGE_FLAGS)
@@ -1000,6 +1004,10 @@ static inline unsigned int blk_queue_get_max_sectors(struct request_queue *q,
 	if (unlikely(op == REQ_OP_WRITE_ZEROES))
 		return q->limits.max_write_zeroes_sectors;
 
+	if (unlikely(op == REQ_OP_ASSIGN_RANGE))
+		return min(q->limits.max_assign_range_sectors,
+			   UINT_MAX >> SECTOR_SHIFT);
+
 	return q->limits.max_sectors;
 }
 
@@ -1077,6 +1085,8 @@ extern void blk_queue_max_write_same_sectors(struct request_queue *q,
 		unsigned int max_write_same_sectors);
 extern void blk_queue_max_write_zeroes_sectors(struct request_queue *q,
 		unsigned int max_write_same_sectors);
+extern void blk_queue_max_assign_range_sectors(struct request_queue *q,
+		unsigned int max_assign_range_sectors);
 extern void blk_queue_logical_block_size(struct request_queue *, unsigned int);
 extern void blk_queue_physical_block_size(struct request_queue *, unsigned int);
 extern void blk_queue_alignment_offset(struct request_queue *q,
@@ -1246,6 +1256,20 @@ static inline int sb_issue_zeroout(struct super_block *sb, sector_t block,
 				    gfp_mask, 0);
 }
 
+extern int blkdev_issue_assign_range(struct block_device *bdev, sector_t sector,
+		sector_t nr_sects, gfp_t gfp_mask);
+
+static inline int sb_issue_assign_range(struct super_block *sb, sector_t block,
+		sector_t nr_blocks, gfp_t gfp_mask)
+{
+	return blkdev_issue_assign_range(sb->s_bdev,
+					 block << (sb->s_blocksize_bits -
+						   SECTOR_SHIFT),
+					 nr_blocks << (sb->s_blocksize_bits -
+						       SECTOR_SHIFT),
+					 gfp_mask);
+}
+
 extern int blk_verify_command(unsigned char *cmd, fmode_t mode);
 
 enum blk_default_limits {
@@ -1427,6 +1451,16 @@ static inline unsigned int bdev_write_zeroes_sectors(struct block_device *bdev)
 	return 0;
 }
 
+static inline unsigned int bdev_assign_range_sectors(struct block_device *bdev)
+{
+	struct request_queue *q = bdev_get_queue(bdev);
+
+	if (q)
+		return q->limits.max_assign_range_sectors;
+
+	return 0;
+}
+
 static inline enum blk_zoned_model bdev_zoned_model(struct block_device *bdev)
 {
 	struct request_queue *q = bdev_get_queue(bdev);
-- 
2.22.0


^ permalink raw reply related	[flat|nested] 122+ messages in thread

* [PATCH 3/4] loop: Forward REQ_OP_ASSIGN_RANGE into fallocate(0)
  2020-03-29 17:47 [PATCH 0/4] block: Add support for REQ_OP_ASSIGN_RANGE Chaitanya Kulkarni
                   ` (2 preceding siblings ...)
  2020-03-29 17:47 ` [PATCH 2/4] block: Add support for REQ_OP_ASSIGN_RANGE Chaitanya Kulkarni
@ 2020-03-29 17:47 ` Chaitanya Kulkarni
  2020-03-29 17:47 ` [PATCH 4/4] ext4: Notify block device about alloc-assigned blk Chaitanya Kulkarni
                   ` (2 subsequent siblings)
  6 siblings, 0 replies; 122+ messages in thread
From: Chaitanya Kulkarni @ 2020-03-29 17:47 UTC (permalink / raw)
  To: hch, martin.petersen
  Cc: darrick.wong, axboe, tytso, adilger.kernel, ming.lei, jthumshirn,
	minwoo.im.dev, chaitanya.kulkarni, damien.lemoal, andrea.parri,
	hare, tj, hannes, khlebnikov, ajay.joshi, bvanassche, arnd,
	houtao1, asml.silence, linux-block, linux-ext4, Kirill Tkhai

From: Kirill Tkhai <ktkhai@virtuozzo.com>

From: Kirill Tkhai <ktkhai@virtuozzo.com>

Send fallocate(0) request into underlining filesystem after upper
filesystem sent REQ_OP_ASSIGN_RANGE request to block device.

Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com>
[Use blk_queue_max_assign_range_sectors() from newly updated previous
 patch.]
Signed-off-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
---
 drivers/block/loop.c | 5 +++++
 1 file changed, 5 insertions(+)

diff --git a/drivers/block/loop.c b/drivers/block/loop.c
index 739b372a5112..0a28db66c485 100644
--- a/drivers/block/loop.c
+++ b/drivers/block/loop.c
@@ -609,6 +609,8 @@ static int do_req_filebacked(struct loop_device *lo, struct request *rq)
 				FALLOC_FL_PUNCH_HOLE);
 	case REQ_OP_DISCARD:
 		return lo_fallocate(lo, rq, pos, FALLOC_FL_PUNCH_HOLE);
+	case REQ_OP_ASSIGN_RANGE:
+		return lo_fallocate(lo, rq, pos, 0);
 	case REQ_OP_WRITE:
 		if (lo->transfer)
 			return lo_write_transfer(lo, rq, pos);
@@ -876,6 +878,7 @@ static void loop_config_discard(struct loop_device *lo)
 		q->limits.discard_granularity = 0;
 		q->limits.discard_alignment = 0;
 		blk_queue_max_discard_sectors(q, 0);
+		blk_queue_max_assign_range_sectors(q, 0);
 		blk_queue_max_write_zeroes_sectors(q, 0);
 		blk_queue_flag_clear(QUEUE_FLAG_DISCARD, q);
 		return;
@@ -886,6 +889,7 @@ static void loop_config_discard(struct loop_device *lo)
 
 	blk_queue_max_discard_sectors(q, UINT_MAX >> 9);
 	blk_queue_max_write_zeroes_sectors(q, UINT_MAX >> 9);
+	blk_queue_max_assign_range_sectors(q, UINT_MAX >> 9);
 	blk_queue_flag_set(QUEUE_FLAG_DISCARD, q);
 }
 
@@ -1917,6 +1921,7 @@ static blk_status_t loop_queue_rq(struct blk_mq_hw_ctx *hctx,
 	case REQ_OP_FLUSH:
 	case REQ_OP_DISCARD:
 	case REQ_OP_WRITE_ZEROES:
+	case REQ_OP_ASSIGN_RANGE:
 		cmd->use_aio = false;
 		break;
 	default:
-- 
2.22.0


^ permalink raw reply related	[flat|nested] 122+ messages in thread

* [PATCH 4/4] ext4: Notify block device about alloc-assigned blk
  2020-03-29 17:47 [PATCH 0/4] block: Add support for REQ_OP_ASSIGN_RANGE Chaitanya Kulkarni
                   ` (3 preceding siblings ...)
  2020-03-29 17:47 ` [PATCH 3/4] loop: Forward REQ_OP_ASSIGN_RANGE into fallocate(0) Chaitanya Kulkarni
@ 2020-03-29 17:47 ` Chaitanya Kulkarni
  2020-04-01  6:22 ` [PATCH 0/4] block: Add support for REQ_OP_ASSIGN_RANGE Konstantin Khlebnikov
  2020-04-02 22:41 ` Dave Chinner
  6 siblings, 0 replies; 122+ messages in thread
From: Chaitanya Kulkarni @ 2020-03-29 17:47 UTC (permalink / raw)
  To: hch, martin.petersen
  Cc: darrick.wong, axboe, tytso, adilger.kernel, ming.lei, jthumshirn,
	minwoo.im.dev, chaitanya.kulkarni, damien.lemoal, andrea.parri,
	hare, tj, hannes, khlebnikov, ajay.joshi, bvanassche, arnd,
	houtao1, asml.silence, linux-block, linux-ext4, Kirill Tkhai

From: Kirill Tkhai <ktkhai@virtuozzo.com>

From: Kirill Tkhai <ktkhai@virtuozzo.com>

Call sb_issue_assign_range() after extent range was allocated on user
request. Hopeful, this helps block device to maintain its internals in
the best way, if this is appliable.

Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com>
Signed-off-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
---
 fs/ext4/ext4.h    |  2 ++
 fs/ext4/extents.c | 12 +++++++++++-
 2 files changed, 13 insertions(+), 1 deletion(-)

diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
index 61b37a052052..0d0fa9904147 100644
--- a/fs/ext4/ext4.h
+++ b/fs/ext4/ext4.h
@@ -622,6 +622,8 @@ enum {
 	 * allows jbd2 to avoid submitting data before commit. */
 #define EXT4_GET_BLOCKS_IO_SUBMIT		0x0400
 
+#define EXT4_GET_BLOCKS_SUBMIT_ALLOC		0x0800
+
 /*
  * The bit position of these flags must not overlap with any of the
  * EXT4_GET_BLOCKS_*.  They are used by ext4_find_extent(),
diff --git a/fs/ext4/extents.c b/fs/ext4/extents.c
index 954013d6076b..598b700c4d4c 100644
--- a/fs/ext4/extents.c
+++ b/fs/ext4/extents.c
@@ -4449,6 +4449,14 @@ int ext4_ext_map_blocks(handle_t *handle, struct inode *inode,
 		ar.len = allocated;
 
 got_allocated_blocks:
+	if ((flags & EXT4_GET_BLOCKS_SUBMIT_ALLOC) &&
+			inode->i_fop->fallocate) {
+		err = sb_issue_assign_range(inode->i_sb, newblock,
+				EXT4_C2B(sbi, allocated_clusters), GFP_NOFS);
+		if (err)
+			goto free_on_err;
+	}
+
 	/* try to insert new extent into found leaf and return */
 	ext4_ext_store_pblock(&newex, newblock + offset);
 	newex.ee_len = cpu_to_le16(ar.len);
@@ -4466,6 +4474,7 @@ int ext4_ext_map_blocks(handle_t *handle, struct inode *inode,
 		err = ext4_ext_insert_extent(handle, inode, &path,
 					     &newex, flags);
 
+free_on_err:
 	if (err && free_on_err) {
 		int fb_flags = flags & EXT4_GET_BLOCKS_DELALLOC_RESERVE ?
 			EXT4_FREE_BLOCKS_NO_QUOT_UPDATE : 0;
@@ -4733,7 +4742,8 @@ static long ext4_zero_range(struct file *file, loff_t offset,
 			goto out_mutex;
 	}
 
-	flags = EXT4_GET_BLOCKS_CREATE_UNWRIT_EXT;
+	flags = EXT4_GET_BLOCKS_CREATE_UNWRIT_EXT |
+		EXT4_GET_BLOCKS_SUBMIT_ALLOC;
 	if (mode & FALLOC_FL_KEEP_SIZE)
 		flags |= EXT4_GET_BLOCKS_KEEP_SIZE;
 
-- 
2.22.0


^ permalink raw reply related	[flat|nested] 122+ messages in thread

* Re: [PATCH 0/4] block: Add support for REQ_OP_ASSIGN_RANGE
       [not found]                                                   ` <-0700")>
  2013-06-05 19:29                                                     ` Martin K. Petersen
  2013-06-12 14:43                                                     ` Martin K. Petersen
@ 2020-04-01  2:29                                                     ` Martin K. Petersen
  2020-04-01  4:53                                                       ` Chaitanya Kulkarni
  2020-05-12 16:01                                                     ` [PATCH v11 00/10] Introduce Zone Append for writing to zoned block devices Martin K. Petersen
  3 siblings, 1 reply; 122+ messages in thread
From: Martin K. Petersen @ 2020-04-01  2:29 UTC (permalink / raw)
  To: Chaitanya Kulkarni
  Cc: hch, martin.petersen, darrick.wong, axboe, tytso, adilger.kernel,
	ming.lei, jthumshirn, minwoo.im.dev, damien.lemoal, andrea.parri,
	hare, tj, hannes, khlebnikov, ajay.joshi, bvanassche, arnd,
	houtao1, asml.silence, linux-block, linux-ext4


Chaitanya,

> This patchset introduces REQ_OP_ASSIGN_RANGE, which is going
> to be used for forwarding user's fallocate(0) requests into
> block device internals.

s/assign_range/allocate/g

-- 
Martin K. Petersen	Oracle Linux Engineering

^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: [PATCH 0/4] block: Add support for REQ_OP_ASSIGN_RANGE
  2020-04-01  2:29                                                     ` [PATCH 0/4] block: Add support for REQ_OP_ASSIGN_RANGE Martin K. Petersen
@ 2020-04-01  4:53                                                       ` Chaitanya Kulkarni
  0 siblings, 0 replies; 122+ messages in thread
From: Chaitanya Kulkarni @ 2020-04-01  4:53 UTC (permalink / raw)
  To: Martin K. Petersen
  Cc: hch, darrick.wong, axboe, tytso, adilger.kernel, ming.lei,
	jthumshirn, minwoo.im.dev, Damien Le Moal, andrea.parri, hare,
	tj, hannes, khlebnikov, Ajay Joshi, bvanassche, arnd, houtao1,
	asml.silence, linux-block, linux-ext4

On 3/31/20 7:31 PM, Martin K. Petersen wrote:
> Chaitanya,
>
>> This patchset introduces REQ_OP_ASSIGN_RANGE, which is going
>> to be used for forwarding user's fallocate(0) requests into
>> block device internals.
> s/assign_range/allocate/g
>
Okay, will send out the V2.


^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: [PATCH 0/4] block: Add support for REQ_OP_ASSIGN_RANGE
  2020-03-29 17:47 [PATCH 0/4] block: Add support for REQ_OP_ASSIGN_RANGE Chaitanya Kulkarni
                   ` (4 preceding siblings ...)
  2020-03-29 17:47 ` [PATCH 4/4] ext4: Notify block device about alloc-assigned blk Chaitanya Kulkarni
@ 2020-04-01  6:22 ` Konstantin Khlebnikov
  2020-04-02  2:29   ` Martin K. Petersen
  2020-04-02 22:41 ` Dave Chinner
  6 siblings, 1 reply; 122+ messages in thread
From: Konstantin Khlebnikov @ 2020-04-01  6:22 UTC (permalink / raw)
  To: Chaitanya Kulkarni, hch, martin.petersen
  Cc: darrick.wong, axboe, tytso, adilger.kernel, ming.lei, jthumshirn,
	minwoo.im.dev, damien.lemoal, andrea.parri, hare, tj, hannes,
	ajay.joshi, bvanassche, arnd, houtao1, asml.silence, linux-block,
	linux-ext4

On 29/03/2020 20.47, Chaitanya Kulkarni wrote:
> Hi,
> 
> This patch-series is based on the original RFC patch series:-
> https://www.spinics.net/lists/linux-block/msg47933.html.
> 
> I've designed a rough testcase based on the information present
> in the mailing list archive for original RFC, it may need
> some corrections from the author.
> 
> If anyone is interested, test results are at the end of this patch.
> 
> Following is the original cover-letter :-
> 
> Information about continuous extent placement may be useful
> for some block devices. Say, distributed network filesystems,
> which provide block device interface, may use this information
> for better blocks placement over the nodes in their cluster,
> and for better performance. Block devices, which map a file
> on another filesystem (loop), may request the same length extent
> on underlining filesystem for less fragmentation and for batching
> allocation requests. Also, hypervisors like QEMU may use this
> information for optimization of cluster allocations.
> 
> This patchset introduces REQ_OP_ASSIGN_RANGE, which is going
> to be used for forwarding user's fallocate(0) requests into
> block device internals. It rather similar to existing
> REQ_OP_DISCARD, REQ_OP_WRITE_ZEROES, etc. The corresponding
> exported primitive is called blkdev_issue_assign_range().

What exact semantics of that?

It may/must preserve present data or may/must discard them, or may fill range with random garbage?

Obviously I prefer weakest one - may discard data, may return garbage, may do nothing.

I.e. lower layer could reuse blocks without zeroing, for encrypted storage this is even safe.

So, this works as third type of dicasrd in addtion to REQ_OP_DISCARD and REQ_OP_SECURE_ERASE.

> See [1/3] for the details.
> 
> Patch [2/3] teaches loop driver to handle REQ_OP_ASSIGN_RANGE
> requests by calling fallocate(0).
> 
> Patch [3/3] makes ext4 to notify a block device about fallocate(0).
> 
> Here is a simple test I did:
> https://gist.github.com/tkhai/5b788651cdb74c1dbff3500745878856
> 
> I attached a file on ext4 to loop. Then, created ext4 partition
> on loop device and started the test in the partition. Direct-io
> is enabled on loop.
> 
> The test fallocates 4G file and writes from some offset with
> given step, then it chooses another offset and repeats. After
> the test all the blocks in the file become written.
> 
> The results shows that batching extents-assigning requests improves
> the performance:
> 
> Before patchset: real ~ 1min 27sec
> After patchset:  real ~ 1min 16sec (18% better)
> 
> Ordinary fallocate() before writes improves the performance
> by batching the requests. These results just show, the same
> is in case of forwarding extents information to underlining
> filesystem.
> 
> Regards,
> Chaitanya
> 
> Changes from RFC:-
> 
> 1. Add missing plumbing for REQ_OP_ASSIGN_RANGE similar to write-zeores.
> 2. Add a prep patch to create a helper to submit payloadless bios.
> 3. Design a testcases around the description present in the
>     cover-letter.
> 
> Chaitanya Kulkarni (1):
>    block: create payloadless issue bio helper
> 
> Kirill Tkhai (3):
>    block: Add support for REQ_OP_ASSIGN_RANGE
>    loop: Forward REQ_OP_ASSIGN_RANGE into fallocate(0)
>    ext4: Notify block device about alloc-assigned blk
> 
>   block/blk-core.c          |   5 ++
>   block/blk-lib.c           | 115 +++++++++++++++++++++++++++++++-------
>   block/blk-merge.c         |  21 +++++++
>   block/blk-settings.c      |  19 +++++++
>   block/blk-zoned.c         |   1 +
>   block/bounce.c            |   1 +
>   drivers/block/loop.c      |   5 ++
>   fs/ext4/ext4.h            |   2 +
>   fs/ext4/extents.c         |  12 +++-
>   include/linux/bio.h       |   9 ++-
>   include/linux/blk_types.h |   2 +
>   include/linux/blkdev.h    |  34 +++++++++++
>   12 files changed, 201 insertions(+), 25 deletions(-)
> 
> 1. Setup :-
> -----------
> # git log --oneline -5
> c64a4c781915 (HEAD -> req-op-assign-range) ext4: Notify block device about alloc-assigned blk
> 000cbc6720a4 loop: Forward REQ_OP_ASSIGN_RANGE into fallocate(0)
> 89ceed8cac80 block: Add support for REQ_OP_ASSIGN_RANGE
> a798743e87e7 block: create payloadless issue bio helper
> b53df2e7442c (tag: block-5.6-2020-03-13) block: Fix partition support for host aware zoned block devices
> 
> # cat /proc/kallsyms | grep -i blkdev_issue_assign_range
> ffffffffa3264a80 T blkdev_issue_assign_range
> ffffffffa4027184 r __ksymtab_blkdev_issue_assign_range
> ffffffffa40524be r __kstrtabns_blkdev_issue_assign_range
> ffffffffa405a8eb r __kstrtab_blkdev_issue_assign_range
> 
> 2. Test program, will be moved to blktest once code is upstream :-
> -----------------
> #define _GNU_SOURCE
> #include <sys/types.h>
> #include <unistd.h>
> #include <stdlib.h>
> #include <stdio.h>
> #include <fcntl.h>
> #include <errno.h>
> 
> #define BLOCK_SIZE 4096
> #define STEP (BLOCK_SIZE * 16)
> #define SIZE (1024 * 1024 * 1024ULL)
> 
> int main(int argc, char *argv[])
> {
> 	int fd, step, ret = 0;
> 	unsigned long i;
> 	void *buf;
> 
> 	if (posix_memalign(&buf, BLOCK_SIZE, SIZE)) {
> 		perror("alloc");
> 		exit(1);
> 	}
> 
> 	fd = open("/mnt/loop0/file.img", O_RDWR | O_CREAT | O_DIRECT);
> 	if (fd < 0) {
> 		perror("open");
> 		exit(1);
> 	}
> 
> 	if (ftruncate(fd, SIZE)) {
> 		perror("ftruncate");
> 		exit(1);
> 	}
> 
> 	ret = fallocate(fd, 0, 0, SIZE);
> 	if (ret) {
> 		perror("fallocate");
> 		exit(1);
> 	}
> 	
> 	for (step = STEP - BLOCK_SIZE; step >= 0; step -= BLOCK_SIZE) {
> 		printf("step=%u\n", step);
> 		for (i = step; i < SIZE; i += STEP) {
> 			errno = 0;
> 			if (pwrite(fd, buf, BLOCK_SIZE, i) != BLOCK_SIZE) {
> 				perror("pwrite");
> 				exit(1);
> 			}
> 		}
> 
> 		if (fsync(fd)) {
> 			perror("fsync");
> 			exit(1);
> 		}
> 	}
> 	return 0;
> }
> 
> 3. Test script, will be moved to blktests once code is upstream :-
> ------------------------------------------------------------------
> # cat req_op_assign_test.sh
> #!/bin/bash -x
> 
> NULLB_FILE="/mnt/backend/data"
> NULLB_MNT="/mnt/backend"
> LOOP_MNT="/mnt/loop0"
> 
> delete_loop()
> {
> 	umount ${LOOP_MNT}
> 	losetup -D
> 	sleep 3
> }
> 
> delete_nullb()
> {
> 	umount ${NULLB_MNT}
> 	echo 1 > config/nullb/nullb0/power
> 	rmdir config/nullb/nullb0
> 	sleep 3
> }
> 
> unload_modules()
> {
> 	rmmod drivers/block/loop.ko
> 	rmmod fs/ext4/ext4.ko
> 	rmmod drivers/block/null_blk.ko
> 	lsmod | grep -e ext4 -e loop -e null_blk
> }
> 
> unload()
> {
> 	delete_loop
> 	delete_nullb
> 	unload_modules
> }
> 
> load_ext4()
> {
> 	make -j $(nproc) M=fs/ext4 modules
> 	local src=fs/ext4/
> 	local dest=/lib/modules/`uname -r`/kernel/fs/ext4
> 	\cp ${src}/ext4.ko ${dest}/
> 
> 	modprobe mbcache
> 	modprobe jbd2
> 	sleep 1
> 	insmod fs/ext4/ext4.ko
> 	sleep 1
> }
> 
> load_nullb()
> {
> 	local src=drivers/block/
> 	local dest=/lib/modules/`uname -r`/kernel/drivers/block
> 	\cp ${src}/null_blk.ko ${dest}/
> 
> 	modprobe null_blk nr_devices=0
> 	sleep 1
> 
> 	mkdir config/nullb/nullb0
> 	tree config/nullb/nullb0
> 
> 	echo 1 > config/nullb/nullb0/memory_backed
> 	echo 512 > config/nullb/nullb0/blocksize
> 
> 	# 20 GB
> 	echo 20480 > config/nullb/nullb0/size
> 	echo 1 > config/nullb/nullb0/power
> 	sleep 2
> 	IDX=`cat config/nullb/nullb0/index`
> 	lsblk | grep null${IDX}
> 	sleep 1
> 
> 	mkfs.ext4 /dev/nullb0
> 	mount /dev/nullb0 ${NULLB_MNT}
> 	sleep 1
> 	mount | grep nullb
> 
> 	# 10 GB
> 	dd if=/dev/zero of=${NULLB_FILE} count=2621440 bs=4096
> }
> 
> load_loop()
> {
> 	local src=drivers/block/
> 	local dest=/lib/modules/`uname -r`/kernel/drivers/block
> 	\cp ${src}/loop.ko ${dest}/
> 
> 	insmod drivers/block/loop.ko max_loop=1
> 	sleep 3
> 	/root/util-linux/losetup --direct-io=off /dev/loop0 ${NULLB_FILE}
> 	sleep 3
> 	/root/util-linux/losetup
> 	ls -l /dev/loop*
> 	dmesg -c
> 	mkfs.ext4 /dev/loop0
> 	mount /dev/loop0 ${LOOP_MNT}
> 	mount | grep loop0
> }
> 
> load()
> {
> 	make -j $(nproc) M=drivers/block modules
> 
> 	load_ext4
> 	load_nullb
> 	load_loop
> 	sleep 1
> 	sync
> 	sync
> 	sync
> }
> 
> unload
> load
> time ./test
> 
> 4. Test Results :-
> ------------------
> 
> # ./req_op_assign_test.sh
> + NULLB_FILE=/mnt/backend/data
> + NULLB_MNT=/mnt/backend
> + LOOP_MNT=/mnt/loop0
> + unload
> + delete_loop
> + umount /mnt/loop0
> + losetup -D
> + sleep 3
> + delete_nullb
> + umount /mnt/backend
> + echo 1
> + rmdir config/nullb/nullb0
> + sleep 3
> + unload_modules
> + rmmod drivers/block/loop.ko
> + rmmod fs/ext4/ext4.ko
> + rmmod drivers/block/null_blk.ko
> + lsmod
> + grep -e ext4 -e loop -e null_blk
> + load
> ++ nproc
> + make -j 32 M=drivers/block modules
>    CC [M]  drivers/block/loop.o
>    MODPOST 11 modules
>    CC [M]  drivers/block/loop.mod.o
>    LD [M]  drivers/block/loop.ko
> + load_ext4
> ++ nproc
> + make -j 32 M=fs/ext4 modules
>    CC [M]  fs/ext4/balloc.o
>    CC [M]  fs/ext4/bitmap.o
>    CC [M]  fs/ext4/block_validity.o
>    CC [M]  fs/ext4/dir.o
>    CC [M]  fs/ext4/ext4_jbd2.o
>    CC [M]  fs/ext4/extents.o
>    CC [M]  fs/ext4/extents_status.o
>    CC [M]  fs/ext4/file.o
>    CC [M]  fs/ext4/fsmap.o
>    CC [M]  fs/ext4/fsync.o
>    CC [M]  fs/ext4/hash.o
>    CC [M]  fs/ext4/ialloc.o
>    CC [M]  fs/ext4/indirect.o
>    CC [M]  fs/ext4/inline.o
>    CC [M]  fs/ext4/inode.o
>    CC [M]  fs/ext4/ioctl.o
>    CC [M]  fs/ext4/mballoc.o
>    CC [M]  fs/ext4/migrate.o
>    CC [M]  fs/ext4/mmp.o
>    CC [M]  fs/ext4/move_extent.o
>    CC [M]  fs/ext4/namei.o
>    CC [M]  fs/ext4/page-io.o
>    CC [M]  fs/ext4/readpage.o
>    CC [M]  fs/ext4/resize.o
>    CC [M]  fs/ext4/super.o
>    CC [M]  fs/ext4/symlink.o
>    CC [M]  fs/ext4/sysfs.o
>    CC [M]  fs/ext4/xattr.o
>    CC [M]  fs/ext4/xattr_trusted.o
>    CC [M]  fs/ext4/xattr_user.o
>    CC [M]  fs/ext4/acl.o
>    CC [M]  fs/ext4/xattr_security.o
>    LD [M]  fs/ext4/ext4.o
>    MODPOST 1 modules
>    LD [M]  fs/ext4/ext4.ko
> + local src=fs/ext4/
> ++ uname -r
> + local dest=/lib/modules/5.6.0-rc3lbk+/kernel/fs/ext4
> + cp fs/ext4//ext4.ko /lib/modules/5.6.0-rc3lbk+/kernel/fs/ext4/
> + modprobe mbcache
> + modprobe jbd2
> + sleep 1
> + insmod fs/ext4/ext4.ko
> + sleep 1
> + load_nullb
> + local src=drivers/block/
> ++ uname -r
> + local dest=/lib/modules/5.6.0-rc3lbk+/kernel/drivers/block
> + cp drivers/block//null_blk.ko /lib/modules/5.6.0-rc3lbk+/kernel/drivers/block/
> + modprobe null_blk nr_devices=0
> + sleep 1
> + mkdir config/nullb/nullb0
> + tree config/nullb/nullb0
> config/nullb/nullb0
> ├── badblocks
> ├── blocking
> ├── blocksize
> ├── cache_size
> ├── completion_nsec
> ├── discard
> ├── home_node
> ├── hw_queue_depth
> ├── index
> ├── irqmode
> ├── mbps
> ├── memory_backed
> ├── power
> ├── queue_mode
> ├── size
> ├── submit_queues
> ├── use_per_node_hctx
> ├── zoned
> ├── zone_nr_conv
> └── zone_size
> 
> 0 directories, 20 files
> + echo 1
> + echo 512
> + echo 20480
> + echo 1
> + sleep 2
> ++ cat config/nullb/nullb0/index
> + IDX=0
> + lsblk
> + grep null0
> + sleep 1
> + mkfs.ext4 /dev/nullb0
> mke2fs 1.42.9 (28-Dec-2013)
> Filesystem label=
> OS type: Linux
> Block size=4096 (log=2)
> Fragment size=4096 (log=2)
> Stride=0 blocks, Stripe width=0 blocks
> 1310720 inodes, 5242880 blocks
> 262144 blocks (5.00%) reserved for the super user
> First data block=0
> Maximum filesystem blocks=2153775104
> 160 block groups
> 32768 blocks per group, 32768 fragments per group
> 8192 inodes per group
> Superblock backups stored on blocks:
> 	32768, 98304, 163840, 229376, 294912, 819200, 884736, 1605632, 2654208,
> 	4096000
> 
> Allocating group tables: done
> Writing inode tables: done
> Creating journal (32768 blocks): done
> Writing superblocks and filesystem accounting information: done
> 
> + mount /dev/nullb0 /mnt/backend
> + sleep 1
> + mount
> + grep nullb
> /dev/nullb0 on /mnt/backend type ext4 (rw,relatime,seclabel)
> + dd if=/dev/zero of=/mnt/backend/data count=2621440 bs=4096
> 2621440+0 records in
> 2621440+0 records out
> 10737418240 bytes (11 GB) copied, 27.4579 s, 391 MB/s
> + load_loop
> + local src=drivers/block/
> ++ uname -r
> + local dest=/lib/modules/5.6.0-rc3lbk+/kernel/drivers/block
> + cp drivers/block//loop.ko /lib/modules/5.6.0-rc3lbk+/kernel/drivers/block/
> + insmod drivers/block/loop.ko max_loop=1
> + sleep 3
> + /root/util-linux/losetup --direct-io=off /dev/loop0 /mnt/backend/data
> + sleep 3
> + /root/util-linux/losetup
> NAME       SIZELIMIT OFFSET AUTOCLEAR RO BACK-FILE         DIO LOG-SEC
> /dev/loop0         0      0         0  0 /mnt/backend/data   0     512
> + ls -l /dev/loop0 /dev/loop-control
> brw-rw----. 1 root disk  7,   0 Mar 29 10:28 /dev/loop0
> crw-rw----. 1 root disk 10, 237 Mar 29 10:28 /dev/loop-control
> + dmesg -c
> [42963.967060] null_blk: module loaded
> [42968.419481] EXT4-fs (nullb0): mounted filesystem with ordered data mode. Opts: (null)
> [42996.928141] loop: module loaded
> + mkfs.ext4 /dev/loop0
> mke2fs 1.42.9 (28-Dec-2013)
> Discarding device blocks: done
> Filesystem label=
> OS type: Linux
> Block size=4096 (log=2)
> Fragment size=4096 (log=2)
> Stride=0 blocks, Stripe width=0 blocks
> 655360 inodes, 2621440 blocks
> 131072 blocks (5.00%) reserved for the super user
> First data block=0
> Maximum filesystem blocks=2151677952
> 80 block groups
> 32768 blocks per group, 32768 fragments per group
> 8192 inodes per group
> Superblock backups stored on blocks:
> 	32768, 98304, 163840, 229376, 294912, 819200, 884736, 1605632
> 
> Allocating group tables: done
> Writing inode tables: done
> Creating journal (32768 blocks): done
> Writing superblocks and filesystem accounting information: done
> 
> + mount /dev/loop0 /mnt/loop0
> + mount
> + grep loop0
> /dev/loop0 on /mnt/loop0 type ext4 (rw,relatime,seclabel)
> + sleep 1
> + sync
> + sync
> + sync
> + ./test
> step=61440
> step=57344
> step=53248
> step=49152
> step=45056
> step=40960
> step=36864
> step=32768
> step=28672
> step=24576
> step=20480
> step=16384
> step=12288
> step=8192
> step=4096
> step=0
> 
> real	9m34.472s
> user	0m0.062s
> sys	0m5.783s
> 

^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: [PATCH 0/4] block: Add support for REQ_OP_ASSIGN_RANGE
  2020-04-01  6:22 ` [PATCH 0/4] block: Add support for REQ_OP_ASSIGN_RANGE Konstantin Khlebnikov
@ 2020-04-02  2:29   ` Martin K. Petersen
  2020-04-02  9:49     ` Konstantin Khlebnikov
  0 siblings, 1 reply; 122+ messages in thread
From: Martin K. Petersen @ 2020-04-02  2:29 UTC (permalink / raw)
  To: Konstantin Khlebnikov
  Cc: Chaitanya Kulkarni, hch, martin.petersen, darrick.wong, axboe,
	tytso, adilger.kernel, ming.lei, jthumshirn, minwoo.im.dev,
	damien.lemoal, andrea.parri, hare, tj, hannes, ajay.joshi,
	bvanassche, arnd, houtao1, asml.silence, linux-block, linux-ext4


Konstantin,

>> The corresponding exported primitive is called
>> blkdev_issue_assign_range().
>
> What exact semantics of that?

REQ_OP_ALLOCATE will be used to compel a device to allocate a block
range. What a given block contains after successful allocation is
undefined (depends on the device implementation).

For block allocation with deterministic zeroing, one must keep using
REQ_OP_WRITE_ZEROES with the NOUNMAP flag set.

-- 
Martin K. Petersen	Oracle Linux Engineering

^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: [PATCH 0/4] block: Add support for REQ_OP_ASSIGN_RANGE
  2020-04-02  2:29   ` Martin K. Petersen
@ 2020-04-02  9:49     ` Konstantin Khlebnikov
  0 siblings, 0 replies; 122+ messages in thread
From: Konstantin Khlebnikov @ 2020-04-02  9:49 UTC (permalink / raw)
  To: Martin K. Petersen
  Cc: Chaitanya Kulkarni, hch, darrick.wong, axboe, tytso,
	adilger.kernel, ming.lei, jthumshirn, minwoo.im.dev,
	damien.lemoal, andrea.parri, hare, tj, hannes, ajay.joshi,
	bvanassche, arnd, houtao1, asml.silence, linux-block, linux-ext4

On 02/04/2020 05.29, Martin K. Petersen wrote:
> 
> Konstantin,
> 
>>> The corresponding exported primitive is called
>>> blkdev_issue_assign_range().
>>
>> What exact semantics of that?
> 
> REQ_OP_ALLOCATE will be used to compel a device to allocate a block
> range. What a given block contains after successful allocation is
> undefined (depends on the device implementation).

Ok. Then REQ_OP_ALLOCATE should be accounted as discard rather than write.
That's decided by helper op_is_discard() which is used only by statistics.
It seems REQ_OP_SECURE_ERASE also should be accounted in this way.

> 
> For block allocation with deterministic zeroing, one must keep using
> REQ_OP_WRITE_ZEROES with the NOUNMAP flag set.
> 

^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: [PATCH 0/4] block: Add support for REQ_OP_ASSIGN_RANGE
  2020-03-29 17:47 [PATCH 0/4] block: Add support for REQ_OP_ASSIGN_RANGE Chaitanya Kulkarni
                   ` (5 preceding siblings ...)
  2020-04-01  6:22 ` [PATCH 0/4] block: Add support for REQ_OP_ASSIGN_RANGE Konstantin Khlebnikov
@ 2020-04-02 22:41 ` Dave Chinner
  2020-04-03  1:34   ` Martin K. Petersen
  6 siblings, 1 reply; 122+ messages in thread
From: Dave Chinner @ 2020-04-02 22:41 UTC (permalink / raw)
  To: Chaitanya Kulkarni
  Cc: hch, martin.petersen, darrick.wong, axboe, tytso, adilger.kernel,
	ming.lei, jthumshirn, minwoo.im.dev, damien.lemoal, andrea.parri,
	hare, tj, hannes, khlebnikov, ajay.joshi, bvanassche, arnd,
	houtao1, asml.silence, linux-block, linux-ext4

On Sun, Mar 29, 2020 at 10:47:10AM -0700, Chaitanya Kulkarni wrote:
> Hi,
> 
> This patch-series is based on the original RFC patch series:-
> https://www.spinics.net/lists/linux-block/msg47933.html.
> 
> I've designed a rough testcase based on the information present
> in the mailing list archive for original RFC, it may need
> some corrections from the author.
> 
> If anyone is interested, test results are at the end of this patch.
> 
> Following is the original cover-letter :-
> 
> Information about continuous extent placement may be useful
> for some block devices. Say, distributed network filesystems,
> which provide block device interface, may use this information
> for better blocks placement over the nodes in their cluster,
> and for better performance. Block devices, which map a file
> on another filesystem (loop), may request the same length extent
> on underlining filesystem for less fragmentation and for batching
> allocation requests. Also, hypervisors like QEMU may use this
> information for optimization of cluster allocations.
> 
> This patchset introduces REQ_OP_ASSIGN_RANGE, which is going
> to be used for forwarding user's fallocate(0) requests into
> block device internals. It rather similar to existing
> REQ_OP_DISCARD, REQ_OP_WRITE_ZEROES, etc. The corresponding
> exported primitive is called blkdev_issue_assign_range().
> See [1/3] for the details.
> 
> Patch [2/3] teaches loop driver to handle REQ_OP_ASSIGN_RANGE
> requests by calling fallocate(0).
> 
> Patch [3/3] makes ext4 to notify a block device about fallocate(0).

Ok, so ext4 has a very limited max allocation size for an extent, so
I expect this won't cause huge latency problems. However, what
happens when we use XFS, have a 64kB block size, and fallocate() is
allocating disk space in continguous 100GB extents and passing those
down to the block device?

How does this get split by dm devices? Are raid stripes going to
dice this into separate stripe unit sized bios, so instead of single
large requests we end up with hundreds or thousands or tiny
allocation requests being issued?

I know that for the loop device, it is going to serialise all IO to
the backing file while fallocate is run on it. Hence if you have
concurrent IO running, any REQ_OP_ASSIGN_RANGE is going to cause an
significant, measurable latency hit to all those IOs in flight.

How are we expecting hardware to behave here? Is this a queued
command in the scsi/nvme/sata protocols? Or is this, for the moment,
just a special snowflake that we can't actually use in production
because the hardware just can't handle what we throw at it?

IOWs, what sort of latency issues is this operation going to cause
on real hardware? Is this going to be like discard? i.e. where we
end up not using it at all because so few devices actually handle
the massive stream of operations the filesystem will end up sending
the device(s) in the course of normal operations?

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: [PATCH 0/4] block: Add support for REQ_OP_ASSIGN_RANGE
  2020-04-02 22:41 ` Dave Chinner
@ 2020-04-03  1:34   ` Martin K. Petersen
  2020-04-03  2:57     ` Dave Chinner
  0 siblings, 1 reply; 122+ messages in thread
From: Martin K. Petersen @ 2020-04-03  1:34 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Chaitanya Kulkarni, hch, martin.petersen, darrick.wong, axboe,
	tytso, adilger.kernel, ming.lei, jthumshirn, minwoo.im.dev,
	damien.lemoal, andrea.parri, hare, tj, hannes, khlebnikov,
	ajay.joshi, bvanassche, arnd, houtao1, asml.silence, linux-block,
	linux-ext4


Hi Dave!

> Ok, so ext4 has a very limited max allocation size for an extent, so
> I expect this won't cause huge latency problems. However, what
> happens when we use XFS, have a 64kB block size, and fallocate() is
> allocating disk space in continguous 100GB extents and passing those
> down to the block device?

Depends on the device.

> How does this get split by dm devices? Are raid stripes going to dice
> this into separate stripe unit sized bios, so instead of single large
> requests we end up with hundreds or thousands or tiny allocation
> requests being issued?

There is nothing special about this operation. It needs to be handled
the same way as all other splits. I.e. ideally coalesced at the bottom
of the stack so we can issue larger, contiguous commands to the
hardware.

> How are we expecting hardware to behave here? Is this a queued
> command in the scsi/nvme/sata protocols? Or is this, for the moment,
> just a special snowflake that we can't actually use in production
> because the hardware just can't handle what we throw at it?

For now it's SCSI and queued. Only found in high-end thinly provisioned
storage arrays and not in your average SSD.

The performance expectation for REQ_OP_ALLOCATE is that it is faster
than a write to the same block range since the device potentially needs
to do less work. I.e. the device simply needs to decrement the free
space and mark the LBAs reserved in a map. It doesn't need to write all
the blocks to zero them. If you want zeroed blocks, use
REQ_OP_WRITE_ZEROES.

> IOWs, what sort of latency issues is this operation going to cause
> on real hardware? Is this going to be like discard? i.e. where we
> end up not using it at all because so few devices actually handle
> the massive stream of operations the filesystem will end up sending
> the device(s) in the course of normal operations?

The intended use case, from a SCSI perspective, is that on a thinly
provisioned device you can use this operation to preallocate blocks so
that future writes to the LBAs in question will not fail due to the
device being out of space. I.e. you would use this to pin down block
ranges where you can not tolerate write failures. The advantage over
writing the blocks individually is that dedup won't apply and that the
device doesn't actually have to go write all the individual blocks.

-- 
Martin K. Petersen	Oracle Linux Engineering

^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: [PATCH 0/4] block: Add support for REQ_OP_ASSIGN_RANGE
  2020-04-03  1:34   ` Martin K. Petersen
@ 2020-04-03  2:57     ` Dave Chinner
       [not found]       ` <(Dave>
  0 siblings, 1 reply; 122+ messages in thread
From: Dave Chinner @ 2020-04-03  2:57 UTC (permalink / raw)
  To: Martin K. Petersen
  Cc: Chaitanya Kulkarni, hch, darrick.wong, axboe, tytso,
	adilger.kernel, ming.lei, jthumshirn, minwoo.im.dev,
	damien.lemoal, andrea.parri, hare, tj, hannes, khlebnikov,
	ajay.joshi, bvanassche, arnd, houtao1, asml.silence, linux-block,
	linux-ext4

On Thu, Apr 02, 2020 at 09:34:43PM -0400, Martin K. Petersen wrote:
> 
> Hi Dave!
> 
> > Ok, so ext4 has a very limited max allocation size for an extent, so
> > I expect this won't cause huge latency problems. However, what
> > happens when we use XFS, have a 64kB block size, and fallocate() is
> > allocating disk space in continguous 100GB extents and passing those
> > down to the block device?
> 
> Depends on the device.

Great. :(

> > How does this get split by dm devices? Are raid stripes going to dice
> > this into separate stripe unit sized bios, so instead of single large
> > requests we end up with hundreds or thousands or tiny allocation
> > requests being issued?
> 
> There is nothing special about this operation. It needs to be handled
> the same way as all other splits. I.e. ideally coalesced at the bottom
> of the stack so we can issue larger, contiguous commands to the
> hardware.
> 
> > How are we expecting hardware to behave here? Is this a queued
> > command in the scsi/nvme/sata protocols? Or is this, for the moment,
> > just a special snowflake that we can't actually use in production
> > because the hardware just can't handle what we throw at it?
> 
> For now it's SCSI and queued. Only found in high-end thinly provisioned
> storage arrays and not in your average SSD.

So it's a special snowflake :)

> The performance expectation for REQ_OP_ALLOCATE is that it is faster
> than a write to the same block range since the device potentially needs
> to do less work. I.e. the device simply needs to decrement the free
> space and mark the LBAs reserved in a map. It doesn't need to write all
> the blocks to zero them. If you want zeroed blocks, use
> REQ_OP_WRITE_ZEROES.

I suspect that the implications of wiring filesystems directly up to
this hasn't been thought through entirely....

> > IOWs, what sort of latency issues is this operation going to cause
> > on real hardware? Is this going to be like discard? i.e. where we
> > end up not using it at all because so few devices actually handle
> > the massive stream of operations the filesystem will end up sending
> > the device(s) in the course of normal operations?
> 
> The intended use case, from a SCSI perspective, is that on a thinly
> provisioned device you can use this operation to preallocate blocks so
> that future writes to the LBAs in question will not fail due to the
> device being out of space. I.e. you would use this to pin down block
> ranges where you can not tolerate write failures. The advantage over
> writing the blocks individually is that dedup won't apply and that the
> device doesn't actually have to go write all the individual blocks.

.... because when backed by thinp storage, plumbing user level
fallocate() straight through from the filesystem introduces a
trivial, user level storage DOS vector....

i.e. a user can just fallocate a bunch of files and, because the
filesystem can do that instantly, can also run the back end array
out of space almost instantly. Storage admins are going to love
this!

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: [PATCH 0/4] block: Add support for REQ_OP_ASSIGN_RANGE
       [not found]                                                   ` <+1100")>
  2018-11-07  1:52                                                     ` linux-next: Signed-off-by missing for commit in the scsi-fixes tree Martin K. Petersen
@ 2020-04-03  3:45                                                     ` Martin K. Petersen
  2020-04-07  2:27                                                       ` Dave Chinner
  1 sibling, 1 reply; 122+ messages in thread
From: Martin K. Petersen @ 2020-04-03  3:45 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Martin K. Petersen, Chaitanya Kulkarni, hch, darrick.wong, axboe,
	tytso, adilger.kernel, ming.lei, jthumshirn, minwoo.im.dev,
	damien.lemoal, andrea.parri, hare, tj, hannes, khlebnikov,
	ajay.joshi, bvanassche, arnd, houtao1, asml.silence, linux-block,
	linux-ext4


Dave,

> .... because when backed by thinp storage, plumbing user level
> fallocate() straight through from the filesystem introduces a
> trivial, user level storage DOS vector....
>
> i.e. a user can just fallocate a bunch of files and, because the
> filesystem can do that instantly, can also run the back end array
> out of space almost instantly. Storage admins are going to love
> this!

In the standards space, the allocation concept was mainly aimed at
protecting filesystem internals against out-of-space conditions on
devices that dedup identical blocks and where simply zeroing the blocks
therefore is ineffective.

So far we have mainly been talking about fallocate on block devices. How
XFS decides to enforce space allocation policy and potentially leverage
this plumbing is entirely up to you.

-- 
Martin K. Petersen	Oracle Linux Engineering

^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: [PATCH 0/4] block: Add support for REQ_OP_ASSIGN_RANGE
  2020-04-03  3:45                                                     ` [PATCH 0/4] block: Add support for REQ_OP_ASSIGN_RANGE Martin K. Petersen
@ 2020-04-07  2:27                                                       ` Dave Chinner
  2020-04-08  4:10                                                         ` Martin K. Petersen
  0 siblings, 1 reply; 122+ messages in thread
From: Dave Chinner @ 2020-04-07  2:27 UTC (permalink / raw)
  To: Martin K. Petersen
  Cc: Chaitanya Kulkarni, hch, darrick.wong, axboe, tytso,
	adilger.kernel, ming.lei, jthumshirn, minwoo.im.dev,
	damien.lemoal, andrea.parri, hare, tj, hannes, khlebnikov,
	ajay.joshi, bvanassche, arnd, houtao1, asml.silence, linux-block,
	linux-ext4

On Thu, Apr 02, 2020 at 08:45:30PM -0700, Martin K. Petersen wrote:
> 
> Dave,
> 
> > .... because when backed by thinp storage, plumbing user level
> > fallocate() straight through from the filesystem introduces a
> > trivial, user level storage DOS vector....
> >
> > i.e. a user can just fallocate a bunch of files and, because the
> > filesystem can do that instantly, can also run the back end array
> > out of space almost instantly. Storage admins are going to love
> > this!
> 
> In the standards space, the allocation concept was mainly aimed at
> protecting filesystem internals against out-of-space conditions on
> devices that dedup identical blocks and where simply zeroing the blocks
> therefore is ineffective.

Um, so we're supposed to use space allocation before overwriting
existing metadata in the filesystem? So that the underlying storage
can reserve space for it before we write it? Which would mean we
have to issue a space allocation before we dirty the metadata, which
means before we dirty any metadata in a transaction. Which means
we'll basically have to redesign the filesystems from the ground up,
yes?

> So far we have mainly been talking about fallocate on block devices.

You might be talking about filesystem metadata and block devices,
but this patchset ends up connecting ext4's user data fallocate() to
the block device, thereby allowing users to reserve space directly
in the underlying block device and directly exposing this issue to
userspace.

I can only go on what is presented to me in patches - this patchset
nothing to do with filesystem metadata nor preventing ENOSPC issues
with internal filesystem updates.

XFS is no different to ext4 or btrfs here - the filesystem doesn't
matter because all of them can fallocate() terabytes of space in
a second or two these days....

> How XFS decides to enforce space allocation policy and potentially
> leverage this plumbing is entirely up to you.

Do I understand this correctly? i.e. that it is the filesystem's
responsibility to prevent users from preallocating more space than
exists in an underlying storage pool that has been intentionally
hidden from the filesystem so it can be underprovisioned?

IOWs, I'm struggling to understand exactly how the "standards space"
think filesystems are supposed to be using this feature whilst also
preventing unprivileged exhaustion of a underprovisioned storage
pool they know nothing about.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: [PATCH 0/4] block: Add support for REQ_OP_ASSIGN_RANGE
  2020-04-07  2:27                                                       ` Dave Chinner
@ 2020-04-08  4:10                                                         ` Martin K. Petersen
  2020-04-19 22:36                                                           ` Dave Chinner
  0 siblings, 1 reply; 122+ messages in thread
From: Martin K. Petersen @ 2020-04-08  4:10 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Martin K. Petersen, Chaitanya Kulkarni, hch, darrick.wong, axboe,
	tytso, adilger.kernel, ming.lei, jthumshirn, minwoo.im.dev,
	damien.lemoal, andrea.parri, hare, tj, hannes, khlebnikov,
	ajay.joshi, bvanassche, arnd, houtao1, asml.silence, linux-block,
	linux-ext4


Hi Dave!

>> In the standards space, the allocation concept was mainly aimed at
>> protecting filesystem internals against out-of-space conditions on
>> devices that dedup identical blocks and where simply zeroing the blocks
>> therefore is ineffective.

> Um, so we're supposed to use space allocation before overwriting
> existing metadata in the filesystem?

Not before overwriting, no. Once you have allocated an LBA it remains
allocated until you discard it.

> So that the underlying storage can reserve space for it before we
> write it? Which would mean we have to issue a space allocation before
> we dirty the metadata, which means before we dirty any metadata in a
> transaction. Which means we'll basically have to redesign the
> filesystems from the ground up, yes?

My understanding is that this facility was aimed at filesystems that do
not dynamically allocate metadata. The intent was that mkfs would
preallocate the metadata LBA ranges, not the filesystem. For filesystems
that allocate metadata dynamically, then yes, an additional step is
required if you want to pin the LBAs.

> You might be talking about filesystem metadata and block devices,
> but this patchset ends up connecting ext4's user data fallocate() to
> the block device, thereby allowing users to reserve space directly
> in the underlying block device and directly exposing this issue to
> userspace.

I missed that Chaitanya's repost of this series included the ext4 patch.
Sorry!

>> How XFS decides to enforce space allocation policy and potentially
>> leverage this plumbing is entirely up to you.
>
> Do I understand this correctly? i.e. that it is the filesystem's
> responsibility to prevent users from preallocating more space than
> exists in an underlying storage pool that has been intentionally
> hidden from the filesystem so it can be underprovisioned?

No. But as an administrative policy it is useful to prevent runaway
applications from writing a petabyte of random garbage to media. My
point was that it is up to you and the other filesystem developers to
decide how you want to leverage the low-level allocation capability and
how you want to provide it to processes. And whether CAP_SYS_ADMIN,
ulimit, or something else is the appropriate policy interface for this.

In terms of thin provisioning and space management there are various
thresholds that may be reported by the device. In past discussions there
haven't been much interest in getting these exposed. It is also unclear
to me whether it is actually beneficial to send low space warnings to
hundreds or thousands of hosts attached to an array. In many cases the
individual server admins are not even the right audience. The most
common notification mechanism is a message to the storage array admin
saying "click here to buy more disk".

If you feel there is merit in having the kernel emit the threshold
warnings you could use as a feedback mechanism, I can absolutely look
into that.

-- 
Martin K. Petersen	Oracle Linux Engineering

^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: [PATCH 0/4] block: Add support for REQ_OP_ASSIGN_RANGE
  2020-04-08  4:10                                                         ` Martin K. Petersen
@ 2020-04-19 22:36                                                           ` Dave Chinner
  2020-04-23  0:40                                                             ` Martin K. Petersen
  0 siblings, 1 reply; 122+ messages in thread
From: Dave Chinner @ 2020-04-19 22:36 UTC (permalink / raw)
  To: Martin K. Petersen
  Cc: Chaitanya Kulkarni, hch, darrick.wong, axboe, tytso,
	adilger.kernel, ming.lei, jthumshirn, minwoo.im.dev,
	damien.lemoal, andrea.parri, hare, tj, hannes, khlebnikov,
	ajay.joshi, bvanassche, arnd, houtao1, asml.silence, linux-block,
	linux-ext4

On Wed, Apr 08, 2020 at 12:10:12AM -0400, Martin K. Petersen wrote:
> 
> Hi Dave!
> 
> >> In the standards space, the allocation concept was mainly aimed at
> >> protecting filesystem internals against out-of-space conditions on
> >> devices that dedup identical blocks and where simply zeroing the blocks
> >> therefore is ineffective.
> 
> > Um, so we're supposed to use space allocation before overwriting
> > existing metadata in the filesystem?
> 
> Not before overwriting, no. Once you have allocated an LBA it remains
> allocated until you discard it.

That is not a consistent argument. If the data has been deduped and
we overwrite, the storage array has to allocate new physical space
for an overwrite to an existing LBA. i.e. deduped data has multiple
LBAs pointing to the same physical storage. Any overwrite of an LBA
that maps to mulitply referenced physical storage requires the
storage array to allocate new physical space for that overwrite.

i.e. allocation is not determined by whether the LBA has been
written to, "pinned" or not - it's whether the act of writing to
that LBA requires the storage to allocate new space to allow the
write to proceed.

That's my point here - one particular shared data overwrite case is
being special cased by preallocation (avoiding dedupe of zero filled
data) to prevent ENOSPC, ignoring all the other cases where we
overwrite shared non-zero data and will also require new physical
space for the new data. In all those cases, the storage has to take
the same action - allocation on overwrite - and so all of them are
susceptible to ENOSPC.

> > So that the underlying storage can reserve space for it before we
> > write it? Which would mean we have to issue a space allocation before
> > we dirty the metadata, which means before we dirty any metadata in a
> > transaction. Which means we'll basically have to redesign the
> > filesystems from the ground up, yes?
> 
> My understanding is that this facility was aimed at filesystems that do
> not dynamically allocate metadata. The intent was that mkfs would
> preallocate the metadata LBA ranges, not the filesystem. For filesystems
> that allocate metadata dynamically, then yes, an additional step is
> required if you want to pin the LBAs.

Ok, so you are confirming what I thought: it's almost completely
useless to us.

i.e. this requires issuing IO to "reserve" space whilst preserving
data before every metadata object goes from clean to dirty in
memory.  But the problem with that is we don't know how much
metadata we are going to dirty in any specific operation. Worse is
that we don't know exactly *what* metadata we will modify until we
walk structures and do lookups, which often happen after we've
dirtied other structures. An ENOSPC from a space reservation at that
point is fatal to the filesystem anyway, so there's no point in even
trying to do this.  Like I said, functionality like this cannot be
retrofitted to existing filesysetms.

IOWs, this is pretty much useless functionality for the filesystem
layer, and if the only use is for some mythical filesystem with
completely static metadata then the standards space really jumped
the shark on this one....

> > You might be talking about filesystem metadata and block devices,
> > but this patchset ends up connecting ext4's user data fallocate() to
> > the block device, thereby allowing users to reserve space directly
> > in the underlying block device and directly exposing this issue to
> > userspace.
> 
> I missed that Chaitanya's repost of this series included the ext4 patch.
> Sorry!
> 
> >> How XFS decides to enforce space allocation policy and potentially
> >> leverage this plumbing is entirely up to you.
> >
> > Do I understand this correctly? i.e. that it is the filesystem's
> > responsibility to prevent users from preallocating more space than
> > exists in an underlying storage pool that has been intentionally
> > hidden from the filesystem so it can be underprovisioned?
> 
> No. But as an administrative policy it is useful to prevent runaway
> applications from writing a petabyte of random garbage to media. My
> point was that it is up to you and the other filesystem developers to
> decide how you want to leverage the low-level allocation capability and
> how you want to provide it to processes. And whether CAP_SYS_ADMIN,
> ulimit, or something else is the appropriate policy interface for this.

My cynical translation: the storage standards space haven't given
any thought to how it can be used and/or administered in the real
world. Pass the buck - let the filesystem people work that out.

What I'm hearing is that this wasn't designed for typical filesystem
use, it wasn't designed for typical user application use, and how to
prevent abuse wasn't thought about at all.

That sounds like a big fat NACK to me....

> In terms of thin provisioning and space management there are various
> thresholds that may be reported by the device. In past discussions there
> haven't been much interest in getting these exposed. It is also unclear
> to me whether it is actually beneficial to send low space warnings to
> hundreds or thousands of hosts attached to an array. In many cases the
> individual server admins are not even the right audience. The most
> common notification mechanism is a message to the storage array admin
> saying "click here to buy more disk".

Notifications are not relevant to preallocation functionality at all.

-Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: [PATCH 0/4] block: Add support for REQ_OP_ASSIGN_RANGE
  2020-04-19 22:36                                                           ` Dave Chinner
@ 2020-04-23  0:40                                                             ` Martin K. Petersen
  0 siblings, 0 replies; 122+ messages in thread
From: Martin K. Petersen @ 2020-04-23  0:40 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Martin K. Petersen, Chaitanya Kulkarni, hch, darrick.wong, axboe,
	tytso, adilger.kernel, ming.lei, jthumshirn, minwoo.im.dev,
	damien.lemoal, andrea.parri, hare, tj, hannes, khlebnikov,
	ajay.joshi, bvanassche, arnd, houtao1, asml.silence, linux-block,
	linux-ext4


Dave,

>> Not before overwriting, no. Once you have allocated an LBA it remains
>> allocated until you discard it.

> Ok, so you are confirming what I thought: it's almost completely
> useless to us.
>
> i.e. this requires issuing IO to "reserve" space whilst preserving
> data before every metadata object goes from clean to dirty in memory.

You can only reserve the space prior to writing a block for the first
time. Once an LBA has been written ("Mapped" in the SCSI state machine),
it remains allocated until it is explicitly deallocated (via a
discard/Unmap operation).

This part of the SCSI spec was written eons ago under the assumption
that when a physical resource backing a given LBA had been established,
you could write the block over and over without having to allocate new
space.

This used to be true, but obviously the introduction of de-duplication
blew a major hole in that. I have been perusing the spec over and over
trying to understand how block provisioning state transitions are
defined when dedup is in the picture. However, much is left unexplained.

As a result, I reached out to various folks. Including the people who
worked on this feature in the standards way back. And the response that
I get from them is that allocation operation got irreparably broken when
support for de-duplication was added to the spec. Nobody attempted to
fix the state transitions since most vendors only cared about
deallocation. Consequently specifying the exact behavior of the
allocation operation in the context of dedup fell by the wayside.

The recommendation I got was that we should not rely on this feature
despite it being advertised as supported by the storage. I looked at
whether it was feasible to support it on non-dedup devices only, but it
does not look like it's worthwhile to pursue. And as a result there is
no need for block layer allocation operation to have parity with
SCSI. Although we may want to keep NVMe in mind when defining the
semantics.

-- 
Martin K. Petersen	Oracle Linux Engineering

^ permalink raw reply	[flat|nested] 122+ messages in thread

* [PATCH v11 00/10] Introduce Zone Append for writing to zoned block devices
@ 2020-05-12  8:55 Johannes Thumshirn
  2020-05-12  8:55 ` [PATCH v11 01/10] block: provide fallbacks for blk_queue_zone_is_seq and blk_queue_zone_no Johannes Thumshirn
                   ` (11 more replies)
  0 siblings, 12 replies; 122+ messages in thread
From: Johannes Thumshirn @ 2020-05-12  8:55 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Christoph Hellwig, linux-block, Damien Le Moal, Keith Busch,
	linux-scsi @ vger . kernel . org, Martin K . Petersen,
	linux-fsdevel @ vger . kernel . org, Johannes Thumshirn

The upcoming NVMe ZNS Specification will define a new type of write
command for zoned block devices, zone append.

When when writing to a zoned block device using zone append, the start
sector of the write is pointing at the start LBA of the zone to write to.
Upon completion the block device will respond with the position the data
has been placed in the zone. This from a high level perspective can be
seen like a file system's block allocator, where the user writes to a
file and the file-system takes care of the data placement on the device.

In order to fully exploit the new zone append command in file-systems and
other interfaces above the block layer, we choose to emulate zone append
in SCSI and null_blk. This way we can have a single write path for both
file-systems and other interfaces above the block-layer, like io_uring on
zoned block devices, without having to care too much about the underlying
characteristics of the device itself.

The emulation works by providing a cache of each zone's write pointer, so
zone append issued to the disk can be translated to a write with a
starting LBA of the write pointer. This LBA is used as input zone number
for the write pointer lookup in the zone write pointer offset cache and
the cached offset is then added to the LBA to get the actual position to
write the data. In SCSI we then turn the REQ_OP_ZONE_APPEND request into a
WRITE(16) command. Upon successful completion of the WRITE(16), the cache
will be updated to the new write pointer location and the written sector
will be noted in the request. On error the cache entry will be marked as
invalid and on the next write an update of the write pointer will be
scheduled, before issuing the actual write.

In order to reduce memory consumption, the only cached item is the offset
of the write pointer from the start of the zone, everything else can be
calculated. On an example drive with 52156 zones, the additional memory
consumption of the cache is thus 52156 * 4 = 208624 Bytes or 51 4k Byte
pages. The performance impact is neglectable for a spinning drive.

For null_blk the emulation is way simpler, as null_blk's zoned block
device emulation support already caches the write pointer position, so we
only need to report the position back to the upper layers. Additional
caching is not needed here.

Furthermore we have converted zonefs to run use ZONE_APPEND for synchronous
direct I/Os. Asynchronous I/O still uses the normal path via iomap.

Performance testing with zonefs sync writes on a 14 TB SMR drive and nullblk
shows good results. On the SMR drive we're not regressing (the performance
improvement is within noise), on nullblk we could drastically improve specific
workloads:

* nullblk:

Single Thread Multiple Zones
				kIOPS	MiB/s	MB/s	% delta
mq-deadline REQ_OP_WRITE	10.1	631	662
mq-deadline REQ_OP_ZONE_APPEND	13.2	828	868	+31.12
none REQ_OP_ZONE_APPEND		15.6	978	1026	+54.98


Multiple Threads Multiple Zones
				kIOPS	MiB/s	MB/s	% delta
mq-deadline REQ_OP_WRITE	10.2	640	671
mq-deadline REQ_OP_ZONE_APPEND	10.4	650	681	+1.49
none REQ_OP_ZONE_APPEND		16.9	1058	1109	+65.28

* 14 TB SMR drive

Single Thread Multiple Zones
				IOPS	MiB/s	MB/s	% delta
mq-deadline REQ_OP_WRITE	797	49.9	52.3
mq-deadline REQ_OP_ZONE_APPEND	806	50.4	52.9	+1.15

Multiple Threads Multiple Zones
				kIOPS	MiB/s	MB/s	% delta
mq-deadline REQ_OP_WRITE	745	46.6	48.9
mq-deadline REQ_OP_ZONE_APPEND	768	48	50.3	+2.86

The %-delta is against the baseline of REQ_OP_WRITE using mq-deadline as I/O
scheduler.

The series is based on Jens' for-5.8/block branch with HEAD:
ae979182ebb3 ("bdi: fix up for "remove the name field in struct backing_dev_info"")

As Christoph asked for a branch I pushed it to a git repo at:
git://git.kernel.org/pub/scm/linux/kernel/git/jth/linux.git zone-append.v11
https://git.kernel.org/pub/scm/linux/kernel/git/jth/linux.git/log/?h=zone-append.v11

Changes to v10:
- Added Reviews from Hannes
- Added Performance Numbers to cover letter

Changes to v9:
- Renamed zone_wp_ofst to zone_wp_offset (Hannes/Martin)
- Colledted Reviews
- Dropped already merged patches

Changes to v8:
- Added kerneldoc for bio_add_hw_page (Hannes)
- Simplified calculation of zone-boundary cross checking (Bart)
- Added safety nets for max_appen_sectors setting
- Added Reviews from Hannes
- Added Damien's Ack on the zonefs change

Changes to v7:
- Rebased on Jens' for-5.8/block
- Fixed up stray whitespace change (Bart)
- Added Reviews from Bart and Christoph

Changes to v6:
- Added Daniel's Reviewed-by's
- Addressed Christoph's comment on whitespace changes in 4/11
- Renamed driver_cb in 6/11
- Fixed lines over 80 characters in 8/11
- Damien simplified sd_zbc_revalidate_zones() in 8/11

Changes to v5:
- Added patch to fix the memleak on failed scsi command setup
- Added prep patch from Christoph for bio_add_hw_page
- Added Christoph's suggestions for adding append pages to bios
- Fixed compile warning with !CONFIG_BLK_DEV_ZONED
- Damien re-worked revalidate zone
- Added Christoph's suggestions for rescanning write pointers to update cache

Changes to v4:
- Added page merging for zone-append bios (Christoph)
- Removed different locking schmes for zone management operations (Christoph)
- Changed wp_ofst assignment from blk_revalidate_zones (Christoph)
- Smaller nitpicks (Christoph)
- Documented my changes to Keith's patch so it's clear where I messed up so he
  doesn't get blamed
- Added Damien as a Co-developer to the sd emulation patch as he wrote as much
  code for it as I did (if not more)

Changes since v3:
- Remove impact of zone-append from bio_full() and bio_add_page()
  fast-path (Christoph)
- All of the zone write pointer offset caching is handled in SCSI now
  (Christoph) 
- Drop null_blk pathces that damien sent separately (Christoph)
- Use EXPORT_SYMBOL_GPL for new exports (Christoph)	

Changes since v2:
- Remove iomap implementation and directly issue zone-appends from within
  zonefs (Christoph)
- Drop already merged patch
- Rebase onto new for-next branch

Changes since v1:
- Too much to mention, treat as a completely new series.


Christoph Hellwig (1):
  block: rename __bio_add_pc_page to bio_add_hw_page

Damien Le Moal (2):
  block: Modify revalidate zones
  null_blk: Support REQ_OP_ZONE_APPEND

Johannes Thumshirn (6):
  block: provide fallbacks for blk_queue_zone_is_seq and
    blk_queue_zone_no
  block: introduce blk_req_zone_write_trylock
  scsi: sd_zbc: factor out sanity checks for zoned commands
  scsi: sd_zbc: emulate ZONE_APPEND commands
  block: export bio_release_pages and bio_iov_iter_get_pages
  zonefs: use REQ_OP_ZONE_APPEND for sync DIO

Keith Busch (1):
  block: Introduce REQ_OP_ZONE_APPEND

 block/bio.c                    | 129 ++++++++---
 block/blk-core.c               |  52 +++++
 block/blk-map.c                |   5 +-
 block/blk-mq.c                 |  27 +++
 block/blk-settings.c           |  31 +++
 block/blk-sysfs.c              |  13 ++
 block/blk-zoned.c              |  23 +-
 block/blk.h                    |   4 +-
 drivers/block/null_blk_zoned.c |  37 ++-
 drivers/scsi/scsi_lib.c        |   1 +
 drivers/scsi/sd.c              |  16 +-
 drivers/scsi/sd.h              |  43 +++-
 drivers/scsi/sd_zbc.c          | 399 ++++++++++++++++++++++++++++++---
 fs/zonefs/super.c              |  80 ++++++-
 include/linux/blk_types.h      |  14 ++
 include/linux/blkdev.h         |  25 ++-
 16 files changed, 807 insertions(+), 92 deletions(-)

-- 
2.24.1


^ permalink raw reply	[flat|nested] 122+ messages in thread

* [PATCH v11 01/10] block: provide fallbacks for blk_queue_zone_is_seq and blk_queue_zone_no
  2020-05-12  8:55 [PATCH v11 00/10] Introduce Zone Append for writing to zoned block devices Johannes Thumshirn
@ 2020-05-12  8:55 ` Johannes Thumshirn
  2020-05-12  8:55 ` [PATCH v11 02/10] block: rename __bio_add_pc_page to bio_add_hw_page Johannes Thumshirn
                   ` (10 subsequent siblings)
  11 siblings, 0 replies; 122+ messages in thread
From: Johannes Thumshirn @ 2020-05-12  8:55 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Christoph Hellwig, linux-block, Damien Le Moal, Keith Busch,
	linux-scsi @ vger . kernel . org, Martin K . Petersen,
	linux-fsdevel @ vger . kernel . org, Johannes Thumshirn,
	Christoph Hellwig, Bart Van Assche, Hannes Reinecke

blk_queue_zone_is_seq() and blk_queue_zone_no() have not been called with
CONFIG_BLK_DEV_ZONED disabled until now.

The introduction of REQ_OP_ZONE_APPEND will change this, so we need to
provide noop fallbacks for the !CONFIG_BLK_DEV_ZONED case.

Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
---
 include/linux/blkdev.h | 10 ++++++++++
 1 file changed, 10 insertions(+)

diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index f00bd4042295..91c6e413bf6b 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -721,6 +721,16 @@ static inline unsigned int blk_queue_nr_zones(struct request_queue *q)
 {
 	return 0;
 }
+static inline bool blk_queue_zone_is_seq(struct request_queue *q,
+					 sector_t sector)
+{
+	return false;
+}
+static inline unsigned int blk_queue_zone_no(struct request_queue *q,
+					     sector_t sector)
+{
+	return 0;
+}
 #endif /* CONFIG_BLK_DEV_ZONED */
 
 static inline bool rq_is_sync(struct request *rq)
-- 
2.24.1


^ permalink raw reply related	[flat|nested] 122+ messages in thread

* [PATCH v11 02/10] block: rename __bio_add_pc_page to bio_add_hw_page
  2020-05-12  8:55 [PATCH v11 00/10] Introduce Zone Append for writing to zoned block devices Johannes Thumshirn
  2020-05-12  8:55 ` [PATCH v11 01/10] block: provide fallbacks for blk_queue_zone_is_seq and blk_queue_zone_no Johannes Thumshirn
@ 2020-05-12  8:55 ` Johannes Thumshirn
  2020-05-12  8:55 ` [PATCH v11 03/10] block: Introduce REQ_OP_ZONE_APPEND Johannes Thumshirn
                   ` (9 subsequent siblings)
  11 siblings, 0 replies; 122+ messages in thread
From: Johannes Thumshirn @ 2020-05-12  8:55 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Christoph Hellwig, linux-block, Damien Le Moal, Keith Busch,
	linux-scsi @ vger . kernel . org, Martin K . Petersen,
	linux-fsdevel @ vger . kernel . org, Christoph Hellwig,
	Johannes Thumshirn, Daniel Wagner, Hannes Reinecke

From: Christoph Hellwig <hch@lst.de>

Rename __bio_add_pc_page() to bio_add_hw_page() and explicitly pass in a
max_sectors argument.

This max_sectors argument can be used to specify constraints from the
hardware.

Signed-off-by: Christoph Hellwig <hch@lst.de>
[ jth: rebased and made public for blk-map.c ]
Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Daniel Wagner <dwagner@suse.de>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Reviewed-by: Hannes Reinecke <hare@suse.de>
---
 block/bio.c     | 65 ++++++++++++++++++++++++++++++-------------------
 block/blk-map.c |  5 ++--
 block/blk.h     |  4 +--
 3 files changed, 45 insertions(+), 29 deletions(-)

diff --git a/block/bio.c b/block/bio.c
index 21cbaa6a1c20..aad0a6dad4f9 100644
--- a/block/bio.c
+++ b/block/bio.c
@@ -748,9 +748,14 @@ static inline bool page_is_mergeable(const struct bio_vec *bv,
 	return true;
 }
 
-static bool bio_try_merge_pc_page(struct request_queue *q, struct bio *bio,
-		struct page *page, unsigned len, unsigned offset,
-		bool *same_page)
+/*
+ * Try to merge a page into a segment, while obeying the hardware segment
+ * size limit.  This is not for normal read/write bios, but for passthrough
+ * or Zone Append operations that we can't split.
+ */
+static bool bio_try_merge_hw_seg(struct request_queue *q, struct bio *bio,
+				 struct page *page, unsigned len,
+				 unsigned offset, bool *same_page)
 {
 	struct bio_vec *bv = &bio->bi_io_vec[bio->bi_vcnt - 1];
 	unsigned long mask = queue_segment_boundary(q);
@@ -765,38 +770,32 @@ static bool bio_try_merge_pc_page(struct request_queue *q, struct bio *bio,
 }
 
 /**
- *	__bio_add_pc_page	- attempt to add page to passthrough bio
- *	@q: the target queue
- *	@bio: destination bio
- *	@page: page to add
- *	@len: vec entry length
- *	@offset: vec entry offset
- *	@same_page: return if the merge happen inside the same page
- *
- *	Attempt to add a page to the bio_vec maplist. This can fail for a
- *	number of reasons, such as the bio being full or target block device
- *	limitations. The target block device must allow bio's up to PAGE_SIZE,
- *	so it is always possible to add a single page to an empty bio.
+ * bio_add_hw_page - attempt to add a page to a bio with hw constraints
+ * @q: the target queue
+ * @bio: destination bio
+ * @page: page to add
+ * @len: vec entry length
+ * @offset: vec entry offset
+ * @max_sectors: maximum number of sectors that can be added
+ * @same_page: return if the segment has been merged inside the same page
  *
- *	This should only be used by passthrough bios.
+ * Add a page to a bio while respecting the hardware max_sectors, max_segment
+ * and gap limitations.
  */
-int __bio_add_pc_page(struct request_queue *q, struct bio *bio,
+int bio_add_hw_page(struct request_queue *q, struct bio *bio,
 		struct page *page, unsigned int len, unsigned int offset,
-		bool *same_page)
+		unsigned int max_sectors, bool *same_page)
 {
 	struct bio_vec *bvec;
 
-	/*
-	 * cloned bio must not modify vec list
-	 */
-	if (unlikely(bio_flagged(bio, BIO_CLONED)))
+	if (WARN_ON_ONCE(bio_flagged(bio, BIO_CLONED)))
 		return 0;
 
-	if (((bio->bi_iter.bi_size + len) >> 9) > queue_max_hw_sectors(q))
+	if (((bio->bi_iter.bi_size + len) >> 9) > max_sectors)
 		return 0;
 
 	if (bio->bi_vcnt > 0) {
-		if (bio_try_merge_pc_page(q, bio, page, len, offset, same_page))
+		if (bio_try_merge_hw_seg(q, bio, page, len, offset, same_page))
 			return len;
 
 		/*
@@ -823,11 +822,27 @@ int __bio_add_pc_page(struct request_queue *q, struct bio *bio,
 	return len;
 }
 
+/**
+ * bio_add_pc_page	- attempt to add page to passthrough bio
+ * @q: the target queue
+ * @bio: destination bio
+ * @page: page to add
+ * @len: vec entry length
+ * @offset: vec entry offset
+ *
+ * Attempt to add a page to the bio_vec maplist. This can fail for a
+ * number of reasons, such as the bio being full or target block device
+ * limitations. The target block device must allow bio's up to PAGE_SIZE,
+ * so it is always possible to add a single page to an empty bio.
+ *
+ * This should only be used by passthrough bios.
+ */
 int bio_add_pc_page(struct request_queue *q, struct bio *bio,
 		struct page *page, unsigned int len, unsigned int offset)
 {
 	bool same_page = false;
-	return __bio_add_pc_page(q, bio, page, len, offset, &same_page);
+	return bio_add_hw_page(q, bio, page, len, offset,
+			queue_max_hw_sectors(q), &same_page);
 }
 EXPORT_SYMBOL(bio_add_pc_page);
 
diff --git a/block/blk-map.c b/block/blk-map.c
index b6fa343fea9f..e3e4ac48db45 100644
--- a/block/blk-map.c
+++ b/block/blk-map.c
@@ -257,6 +257,7 @@ static struct bio *bio_copy_user_iov(struct request_queue *q,
 static struct bio *bio_map_user_iov(struct request_queue *q,
 		struct iov_iter *iter, gfp_t gfp_mask)
 {
+	unsigned int max_sectors = queue_max_hw_sectors(q);
 	int j;
 	struct bio *bio;
 	int ret;
@@ -294,8 +295,8 @@ static struct bio *bio_map_user_iov(struct request_queue *q,
 				if (n > bytes)
 					n = bytes;
 
-				if (!__bio_add_pc_page(q, bio, page, n, offs,
-						&same_page)) {
+				if (!bio_add_hw_page(q, bio, page, n, offs,
+						     max_sectors, &same_page)) {
 					if (same_page)
 						put_page(page);
 					break;
diff --git a/block/blk.h b/block/blk.h
index 73bd3b1c6938..1ae3279df712 100644
--- a/block/blk.h
+++ b/block/blk.h
@@ -453,8 +453,8 @@ static inline void part_nr_sects_write(struct hd_struct *part, sector_t size)
 
 struct request_queue *__blk_alloc_queue(int node_id);
 
-int __bio_add_pc_page(struct request_queue *q, struct bio *bio,
+int bio_add_hw_page(struct request_queue *q, struct bio *bio,
 		struct page *page, unsigned int len, unsigned int offset,
-		bool *same_page);
+		unsigned int max_sectors, bool *same_page);
 
 #endif /* BLK_INTERNAL_H */
-- 
2.24.1


^ permalink raw reply related	[flat|nested] 122+ messages in thread

* [PATCH v11 03/10] block: Introduce REQ_OP_ZONE_APPEND
  2020-05-12  8:55 [PATCH v11 00/10] Introduce Zone Append for writing to zoned block devices Johannes Thumshirn
  2020-05-12  8:55 ` [PATCH v11 01/10] block: provide fallbacks for blk_queue_zone_is_seq and blk_queue_zone_no Johannes Thumshirn
  2020-05-12  8:55 ` [PATCH v11 02/10] block: rename __bio_add_pc_page to bio_add_hw_page Johannes Thumshirn
@ 2020-05-12  8:55 ` Johannes Thumshirn
  2020-05-12  8:55 ` [PATCH v11 04/10] block: introduce blk_req_zone_write_trylock Johannes Thumshirn
                   ` (8 subsequent siblings)
  11 siblings, 0 replies; 122+ messages in thread
From: Johannes Thumshirn @ 2020-05-12  8:55 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Christoph Hellwig, linux-block, Damien Le Moal, Keith Busch,
	linux-scsi @ vger . kernel . org, Martin K . Petersen,
	linux-fsdevel @ vger . kernel . org, Johannes Thumshirn,
	Christoph Hellwig, Hannes Reinecke

From: Keith Busch <kbusch@kernel.org>

Define REQ_OP_ZONE_APPEND to append-write sectors to a zone of a zoned
block device. This is a no-merge write operation.

A zone append write BIO must:
* Target a zoned block device
* Have a sector position indicating the start sector of the target zone
* The target zone must be a sequential write zone
* The BIO must not cross a zone boundary
* The BIO size must not be split to ensure that a single range of LBAs
  is written with a single command.

Implement these checks in generic_make_request_checks() using the
helper function blk_check_zone_append(). To avoid write append BIO
splitting, introduce the new max_zone_append_sectors queue limit
attribute and ensure that a BIO size is always lower than this limit.
Export this new limit through sysfs and check these limits in bio_full().

Also when a LLDD can't dispatch a request to a specific zone, it
will return BLK_STS_ZONE_RESOURCE indicating this request needs to
be delayed, e.g.  because the zone it will be dispatched to is still
write-locked. If this happens set the request aside in a local list
to continue trying dispatching requests such as READ requests or a
WRITE/ZONE_APPEND requests targetting other zones. This way we can
still keep a high queue depth without starving other requests even if
one request can't be served due to zone write-locking.

Finally, make sure that the bio sector position indicates the actual
write position as indicated by the device on completion.

Signed-off-by: Keith Busch <kbusch@kernel.org>
[ jth: added zone-append specific add_page and merge_page helpers ]
Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
---
 block/bio.c               | 62 ++++++++++++++++++++++++++++++++++++---
 block/blk-core.c          | 52 ++++++++++++++++++++++++++++++++
 block/blk-mq.c            | 27 +++++++++++++++++
 block/blk-settings.c      | 31 ++++++++++++++++++++
 block/blk-sysfs.c         | 13 ++++++++
 drivers/scsi/scsi_lib.c   |  1 +
 include/linux/blk_types.h | 14 +++++++++
 include/linux/blkdev.h    | 11 +++++++
 8 files changed, 207 insertions(+), 4 deletions(-)

diff --git a/block/bio.c b/block/bio.c
index aad0a6dad4f9..3aa3c4ce2e5e 100644
--- a/block/bio.c
+++ b/block/bio.c
@@ -1025,6 +1025,50 @@ static int __bio_iov_iter_get_pages(struct bio *bio, struct iov_iter *iter)
 	return 0;
 }
 
+static int __bio_iov_append_get_pages(struct bio *bio, struct iov_iter *iter)
+{
+	unsigned short nr_pages = bio->bi_max_vecs - bio->bi_vcnt;
+	unsigned short entries_left = bio->bi_max_vecs - bio->bi_vcnt;
+	struct request_queue *q = bio->bi_disk->queue;
+	unsigned int max_append_sectors = queue_max_zone_append_sectors(q);
+	struct bio_vec *bv = bio->bi_io_vec + bio->bi_vcnt;
+	struct page **pages = (struct page **)bv;
+	ssize_t size, left;
+	unsigned len, i;
+	size_t offset;
+
+	if (WARN_ON_ONCE(!max_append_sectors))
+		return 0;
+
+	/*
+	 * Move page array up in the allocated memory for the bio vecs as far as
+	 * possible so that we can start filling biovecs from the beginning
+	 * without overwriting the temporary page array.
+	 */
+	BUILD_BUG_ON(PAGE_PTRS_PER_BVEC < 2);
+	pages += entries_left * (PAGE_PTRS_PER_BVEC - 1);
+
+	size = iov_iter_get_pages(iter, pages, LONG_MAX, nr_pages, &offset);
+	if (unlikely(size <= 0))
+		return size ? size : -EFAULT;
+
+	for (left = size, i = 0; left > 0; left -= len, i++) {
+		struct page *page = pages[i];
+		bool same_page = false;
+
+		len = min_t(size_t, PAGE_SIZE - offset, left);
+		if (bio_add_hw_page(q, bio, page, len, offset,
+				max_append_sectors, &same_page) != len)
+			return -EINVAL;
+		if (same_page)
+			put_page(page);
+		offset = 0;
+	}
+
+	iov_iter_advance(iter, size);
+	return 0;
+}
+
 /**
  * bio_iov_iter_get_pages - add user or kernel pages to a bio
  * @bio: bio to add pages to
@@ -1054,10 +1098,16 @@ int bio_iov_iter_get_pages(struct bio *bio, struct iov_iter *iter)
 		return -EINVAL;
 
 	do {
-		if (is_bvec)
-			ret = __bio_iov_bvec_add_pages(bio, iter);
-		else
-			ret = __bio_iov_iter_get_pages(bio, iter);
+		if (bio_op(bio) == REQ_OP_ZONE_APPEND) {
+			if (WARN_ON_ONCE(is_bvec))
+				return -EINVAL;
+			ret = __bio_iov_append_get_pages(bio, iter);
+		} else {
+			if (is_bvec)
+				ret = __bio_iov_bvec_add_pages(bio, iter);
+			else
+				ret = __bio_iov_iter_get_pages(bio, iter);
+		}
 	} while (!ret && iov_iter_count(iter) && !bio_full(bio, 0));
 
 	if (is_bvec)
@@ -1460,6 +1510,10 @@ struct bio *bio_split(struct bio *bio, int sectors,
 	BUG_ON(sectors <= 0);
 	BUG_ON(sectors >= bio_sectors(bio));
 
+	/* Zone append commands cannot be split */
+	if (WARN_ON_ONCE(bio_op(bio) == REQ_OP_ZONE_APPEND))
+		return NULL;
+
 	split = bio_clone_fast(bio, gfp, bs);
 	if (!split)
 		return NULL;
diff --git a/block/blk-core.c b/block/blk-core.c
index 538cbc725620..0f95fdfd3827 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -135,6 +135,7 @@ static const char *const blk_op_name[] = {
 	REQ_OP_NAME(ZONE_OPEN),
 	REQ_OP_NAME(ZONE_CLOSE),
 	REQ_OP_NAME(ZONE_FINISH),
+	REQ_OP_NAME(ZONE_APPEND),
 	REQ_OP_NAME(WRITE_SAME),
 	REQ_OP_NAME(WRITE_ZEROES),
 	REQ_OP_NAME(SCSI_IN),
@@ -240,6 +241,17 @@ static void req_bio_endio(struct request *rq, struct bio *bio,
 
 	bio_advance(bio, nbytes);
 
+	if (req_op(rq) == REQ_OP_ZONE_APPEND && error == BLK_STS_OK) {
+		/*
+		 * Partial zone append completions cannot be supported as the
+		 * BIO fragments may end up not being written sequentially.
+		 */
+		if (bio->bi_iter.bi_size)
+			bio->bi_status = BLK_STS_IOERR;
+		else
+			bio->bi_iter.bi_sector = rq->__sector;
+	}
+
 	/* don't actually finish bio if it's part of flush sequence */
 	if (bio->bi_iter.bi_size == 0 && !(rq->rq_flags & RQF_FLUSH_SEQ))
 		bio_endio(bio);
@@ -887,6 +899,41 @@ static inline int blk_partition_remap(struct bio *bio)
 	return ret;
 }
 
+/*
+ * Check write append to a zoned block device.
+ */
+static inline blk_status_t blk_check_zone_append(struct request_queue *q,
+						 struct bio *bio)
+{
+	sector_t pos = bio->bi_iter.bi_sector;
+	int nr_sectors = bio_sectors(bio);
+
+	/* Only applicable to zoned block devices */
+	if (!blk_queue_is_zoned(q))
+		return BLK_STS_NOTSUPP;
+
+	/* The bio sector must point to the start of a sequential zone */
+	if (pos & (blk_queue_zone_sectors(q) - 1) ||
+	    !blk_queue_zone_is_seq(q, pos))
+		return BLK_STS_IOERR;
+
+	/*
+	 * Not allowed to cross zone boundaries. Otherwise, the BIO will be
+	 * split and could result in non-contiguous sectors being written in
+	 * different zones.
+	 */
+	if (nr_sectors > q->limits.chunk_sectors)
+		return BLK_STS_IOERR;
+
+	/* Make sure the BIO is small enough and will not get split */
+	if (nr_sectors > q->limits.max_zone_append_sectors)
+		return BLK_STS_IOERR;
+
+	bio->bi_opf |= REQ_NOMERGE;
+
+	return BLK_STS_OK;
+}
+
 static noinline_for_stack bool
 generic_make_request_checks(struct bio *bio)
 {
@@ -959,6 +1006,11 @@ generic_make_request_checks(struct bio *bio)
 		if (!q->limits.max_write_same_sectors)
 			goto not_supported;
 		break;
+	case REQ_OP_ZONE_APPEND:
+		status = blk_check_zone_append(q, bio);
+		if (status != BLK_STS_OK)
+			goto end_io;
+		break;
 	case REQ_OP_ZONE_RESET:
 	case REQ_OP_ZONE_OPEN:
 	case REQ_OP_ZONE_CLOSE:
diff --git a/block/blk-mq.c b/block/blk-mq.c
index bc34d6b572b6..d743f0af5fb8 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -1183,6 +1183,19 @@ static void blk_mq_handle_dev_resource(struct request *rq,
 	__blk_mq_requeue_request(rq);
 }
 
+static void blk_mq_handle_zone_resource(struct request *rq,
+					struct list_head *zone_list)
+{
+	/*
+	 * If we end up here it is because we cannot dispatch a request to a
+	 * specific zone due to LLD level zone-write locking or other zone
+	 * related resource not being available. In this case, set the request
+	 * aside in zone_list for retrying it later.
+	 */
+	list_add(&rq->queuelist, zone_list);
+	__blk_mq_requeue_request(rq);
+}
+
 /*
  * Returns true if we did some work AND can potentially do more.
  */
@@ -1195,6 +1208,7 @@ bool blk_mq_dispatch_rq_list(struct request_queue *q, struct list_head *list,
 	int errors, queued;
 	blk_status_t ret = BLK_STS_OK;
 	bool no_budget_avail = false;
+	LIST_HEAD(zone_list);
 
 	if (list_empty(list))
 		return false;
@@ -1256,6 +1270,16 @@ bool blk_mq_dispatch_rq_list(struct request_queue *q, struct list_head *list,
 		if (ret == BLK_STS_RESOURCE || ret == BLK_STS_DEV_RESOURCE) {
 			blk_mq_handle_dev_resource(rq, list);
 			break;
+		} else if (ret == BLK_STS_ZONE_RESOURCE) {
+			/*
+			 * Move the request to zone_list and keep going through
+			 * the dispatch list to find more requests the drive can
+			 * accept.
+			 */
+			blk_mq_handle_zone_resource(rq, &zone_list);
+			if (list_empty(list))
+				break;
+			continue;
 		}
 
 		if (unlikely(ret != BLK_STS_OK)) {
@@ -1267,6 +1291,9 @@ bool blk_mq_dispatch_rq_list(struct request_queue *q, struct list_head *list,
 		queued++;
 	} while (!list_empty(list));
 
+	if (!list_empty(&zone_list))
+		list_splice_tail_init(&zone_list, list);
+
 	hctx->dispatched[queued_to_index(queued)]++;
 
 	/*
diff --git a/block/blk-settings.c b/block/blk-settings.c
index 2ab1967b9716..9a2c23cd9700 100644
--- a/block/blk-settings.c
+++ b/block/blk-settings.c
@@ -48,6 +48,7 @@ void blk_set_default_limits(struct queue_limits *lim)
 	lim->chunk_sectors = 0;
 	lim->max_write_same_sectors = 0;
 	lim->max_write_zeroes_sectors = 0;
+	lim->max_zone_append_sectors = 0;
 	lim->max_discard_sectors = 0;
 	lim->max_hw_discard_sectors = 0;
 	lim->discard_granularity = 0;
@@ -83,6 +84,7 @@ void blk_set_stacking_limits(struct queue_limits *lim)
 	lim->max_dev_sectors = UINT_MAX;
 	lim->max_write_same_sectors = UINT_MAX;
 	lim->max_write_zeroes_sectors = UINT_MAX;
+	lim->max_zone_append_sectors = UINT_MAX;
 }
 EXPORT_SYMBOL(blk_set_stacking_limits);
 
@@ -221,6 +223,33 @@ void blk_queue_max_write_zeroes_sectors(struct request_queue *q,
 }
 EXPORT_SYMBOL(blk_queue_max_write_zeroes_sectors);
 
+/**
+ * blk_queue_max_zone_append_sectors - set max sectors for a single zone append
+ * @q:  the request queue for the device
+ * @max_zone_append_sectors: maximum number of sectors to write per command
+ **/
+void blk_queue_max_zone_append_sectors(struct request_queue *q,
+		unsigned int max_zone_append_sectors)
+{
+	unsigned int max_sectors;
+
+	if (WARN_ON(!blk_queue_is_zoned(q)))
+		return;
+
+	max_sectors = min(q->limits.max_hw_sectors, max_zone_append_sectors);
+	max_sectors = min(q->limits.chunk_sectors, max_sectors);
+
+	/*
+	 * Signal eventual driver bugs resulting in the max_zone_append sectors limit
+	 * being 0 due to a 0 argument, the chunk_sectors limit (zone size) not set,
+	 * or the max_hw_sectors limit not set.
+	 */
+	WARN_ON(!max_sectors);
+
+	q->limits.max_zone_append_sectors = max_sectors;
+}
+EXPORT_SYMBOL_GPL(blk_queue_max_zone_append_sectors);
+
 /**
  * blk_queue_max_segments - set max hw segments for a request for this queue
  * @q:  the request queue for the device
@@ -470,6 +499,8 @@ int blk_stack_limits(struct queue_limits *t, struct queue_limits *b,
 					b->max_write_same_sectors);
 	t->max_write_zeroes_sectors = min(t->max_write_zeroes_sectors,
 					b->max_write_zeroes_sectors);
+	t->max_zone_append_sectors = min(t->max_zone_append_sectors,
+					b->max_zone_append_sectors);
 	t->bounce_pfn = min_not_zero(t->bounce_pfn, b->bounce_pfn);
 
 	t->seg_boundary_mask = min_not_zero(t->seg_boundary_mask,
diff --git a/block/blk-sysfs.c b/block/blk-sysfs.c
index fca9b158f4a0..02643e149d5e 100644
--- a/block/blk-sysfs.c
+++ b/block/blk-sysfs.c
@@ -218,6 +218,13 @@ static ssize_t queue_write_zeroes_max_show(struct request_queue *q, char *page)
 		(unsigned long long)q->limits.max_write_zeroes_sectors << 9);
 }
 
+static ssize_t queue_zone_append_max_show(struct request_queue *q, char *page)
+{
+	unsigned long long max_sectors = q->limits.max_zone_append_sectors;
+
+	return sprintf(page, "%llu\n", max_sectors << SECTOR_SHIFT);
+}
+
 static ssize_t
 queue_max_sectors_store(struct request_queue *q, const char *page, size_t count)
 {
@@ -639,6 +646,11 @@ static struct queue_sysfs_entry queue_write_zeroes_max_entry = {
 	.show = queue_write_zeroes_max_show,
 };
 
+static struct queue_sysfs_entry queue_zone_append_max_entry = {
+	.attr = {.name = "zone_append_max_bytes", .mode = 0444 },
+	.show = queue_zone_append_max_show,
+};
+
 static struct queue_sysfs_entry queue_nonrot_entry = {
 	.attr = {.name = "rotational", .mode = 0644 },
 	.show = queue_show_nonrot,
@@ -749,6 +761,7 @@ static struct attribute *queue_attrs[] = {
 	&queue_discard_zeroes_data_entry.attr,
 	&queue_write_same_max_entry.attr,
 	&queue_write_zeroes_max_entry.attr,
+	&queue_zone_append_max_entry.attr,
 	&queue_nonrot_entry.attr,
 	&queue_zoned_entry.attr,
 	&queue_nr_zones_entry.attr,
diff --git a/drivers/scsi/scsi_lib.c b/drivers/scsi/scsi_lib.c
index f0cb26b3da6a..af00e4a3f006 100644
--- a/drivers/scsi/scsi_lib.c
+++ b/drivers/scsi/scsi_lib.c
@@ -1712,6 +1712,7 @@ static blk_status_t scsi_queue_rq(struct blk_mq_hw_ctx *hctx,
 	case BLK_STS_OK:
 		break;
 	case BLK_STS_RESOURCE:
+	case BLK_STS_ZONE_RESOURCE:
 		if (atomic_read(&sdev->device_busy) ||
 		    scsi_device_blocked(sdev))
 			ret = BLK_STS_DEV_RESOURCE;
diff --git a/include/linux/blk_types.h b/include/linux/blk_types.h
index 90895d594e64..b90dca1fa430 100644
--- a/include/linux/blk_types.h
+++ b/include/linux/blk_types.h
@@ -63,6 +63,18 @@ typedef u8 __bitwise blk_status_t;
  */
 #define BLK_STS_DEV_RESOURCE	((__force blk_status_t)13)
 
+/*
+ * BLK_STS_ZONE_RESOURCE is returned from the driver to the block layer if zone
+ * related resources are unavailable, but the driver can guarantee the queue
+ * will be rerun in the future once the resources become available again.
+ *
+ * This is different from BLK_STS_DEV_RESOURCE in that it explicitly references
+ * a zone specific resource and IO to a different zone on the same device could
+ * still be served. Examples of that are zones that are write-locked, but a read
+ * to the same zone could be served.
+ */
+#define BLK_STS_ZONE_RESOURCE	((__force blk_status_t)14)
+
 /**
  * blk_path_error - returns true if error may be path related
  * @error: status the request was completed with
@@ -296,6 +308,8 @@ enum req_opf {
 	REQ_OP_ZONE_CLOSE	= 11,
 	/* Transition a zone to full */
 	REQ_OP_ZONE_FINISH	= 12,
+	/* write data at the current zone write pointer */
+	REQ_OP_ZONE_APPEND	= 13,
 
 	/* SCSI passthrough using struct scsi_request */
 	REQ_OP_SCSI_IN		= 32,
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index 91c6e413bf6b..158641fbc7cd 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -331,6 +331,7 @@ struct queue_limits {
 	unsigned int		max_hw_discard_sectors;
 	unsigned int		max_write_same_sectors;
 	unsigned int		max_write_zeroes_sectors;
+	unsigned int		max_zone_append_sectors;
 	unsigned int		discard_granularity;
 	unsigned int		discard_alignment;
 
@@ -749,6 +750,9 @@ static inline bool rq_mergeable(struct request *rq)
 	if (req_op(rq) == REQ_OP_WRITE_ZEROES)
 		return false;
 
+	if (req_op(rq) == REQ_OP_ZONE_APPEND)
+		return false;
+
 	if (rq->cmd_flags & REQ_NOMERGE_FLAGS)
 		return false;
 	if (rq->rq_flags & RQF_NOMERGE_FLAGS)
@@ -1083,6 +1087,8 @@ extern void blk_queue_max_write_same_sectors(struct request_queue *q,
 extern void blk_queue_max_write_zeroes_sectors(struct request_queue *q,
 		unsigned int max_write_same_sectors);
 extern void blk_queue_logical_block_size(struct request_queue *, unsigned int);
+extern void blk_queue_max_zone_append_sectors(struct request_queue *q,
+		unsigned int max_zone_append_sectors);
 extern void blk_queue_physical_block_size(struct request_queue *, unsigned int);
 extern void blk_queue_alignment_offset(struct request_queue *q,
 				       unsigned int alignment);
@@ -1300,6 +1306,11 @@ static inline unsigned int queue_max_segment_size(const struct request_queue *q)
 	return q->limits.max_segment_size;
 }
 
+static inline unsigned int queue_max_zone_append_sectors(const struct request_queue *q)
+{
+	return q->limits.max_zone_append_sectors;
+}
+
 static inline unsigned queue_logical_block_size(const struct request_queue *q)
 {
 	int retval = 512;
-- 
2.24.1


^ permalink raw reply related	[flat|nested] 122+ messages in thread

* [PATCH v11 04/10] block: introduce blk_req_zone_write_trylock
  2020-05-12  8:55 [PATCH v11 00/10] Introduce Zone Append for writing to zoned block devices Johannes Thumshirn
                   ` (2 preceding siblings ...)
  2020-05-12  8:55 ` [PATCH v11 03/10] block: Introduce REQ_OP_ZONE_APPEND Johannes Thumshirn
@ 2020-05-12  8:55 ` Johannes Thumshirn
  2020-05-12  8:55 ` [PATCH v11 05/10] block: Modify revalidate zones Johannes Thumshirn
                   ` (7 subsequent siblings)
  11 siblings, 0 replies; 122+ messages in thread
From: Johannes Thumshirn @ 2020-05-12  8:55 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Christoph Hellwig, linux-block, Damien Le Moal, Keith Busch,
	linux-scsi @ vger . kernel . org, Martin K . Petersen,
	linux-fsdevel @ vger . kernel . org, Johannes Thumshirn,
	Christoph Hellwig, Hannes Reinecke

Introduce blk_req_zone_write_trylock(), which either grabs the write-lock
for a sequential zone or returns false, if the zone is already locked.

Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
---
 block/blk-zoned.c      | 14 ++++++++++++++
 include/linux/blkdev.h |  1 +
 2 files changed, 15 insertions(+)

diff --git a/block/blk-zoned.c b/block/blk-zoned.c
index f87956e0dcaf..c822cfa7a102 100644
--- a/block/blk-zoned.c
+++ b/block/blk-zoned.c
@@ -82,6 +82,20 @@ bool blk_req_needs_zone_write_lock(struct request *rq)
 }
 EXPORT_SYMBOL_GPL(blk_req_needs_zone_write_lock);
 
+bool blk_req_zone_write_trylock(struct request *rq)
+{
+	unsigned int zno = blk_rq_zone_no(rq);
+
+	if (test_and_set_bit(zno, rq->q->seq_zones_wlock))
+		return false;
+
+	WARN_ON_ONCE(rq->rq_flags & RQF_ZONE_WRITE_LOCKED);
+	rq->rq_flags |= RQF_ZONE_WRITE_LOCKED;
+
+	return true;
+}
+EXPORT_SYMBOL_GPL(blk_req_zone_write_trylock);
+
 void __blk_req_zone_write_lock(struct request *rq)
 {
 	if (WARN_ON_ONCE(test_and_set_bit(blk_rq_zone_no(rq),
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index 158641fbc7cd..d6e6ce3dc656 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -1737,6 +1737,7 @@ extern int bdev_write_page(struct block_device *, sector_t, struct page *,
 
 #ifdef CONFIG_BLK_DEV_ZONED
 bool blk_req_needs_zone_write_lock(struct request *rq);
+bool blk_req_zone_write_trylock(struct request *rq);
 void __blk_req_zone_write_lock(struct request *rq);
 void __blk_req_zone_write_unlock(struct request *rq);
 
-- 
2.24.1


^ permalink raw reply related	[flat|nested] 122+ messages in thread

* [PATCH v11 05/10] block: Modify revalidate zones
  2020-05-12  8:55 [PATCH v11 00/10] Introduce Zone Append for writing to zoned block devices Johannes Thumshirn
                   ` (3 preceding siblings ...)
  2020-05-12  8:55 ` [PATCH v11 04/10] block: introduce blk_req_zone_write_trylock Johannes Thumshirn
@ 2020-05-12  8:55 ` Johannes Thumshirn
  2020-05-12  8:55 ` [PATCH v11 06/10] scsi: sd_zbc: factor out sanity checks for zoned commands Johannes Thumshirn
                   ` (6 subsequent siblings)
  11 siblings, 0 replies; 122+ messages in thread
From: Johannes Thumshirn @ 2020-05-12  8:55 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Christoph Hellwig, linux-block, Damien Le Moal, Keith Busch,
	linux-scsi @ vger . kernel . org, Martin K . Petersen,
	linux-fsdevel @ vger . kernel . org, Damien Le Moal,
	Johannes Thumshirn, Christoph Hellwig, Hannes Reinecke

From: Damien Le Moal <damien.lemoal@wdc.com>

Modify the interface of blk_revalidate_disk_zones() to add an optional
driver callback function that a driver can use to extend processing
done during zone revalidation. The callback, if defined, is executed
with the device request queue frozen, after all zones have been
inspected.

Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com>
Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
---
 block/blk-zoned.c              | 9 ++++++++-
 drivers/block/null_blk_zoned.c | 2 +-
 include/linux/blkdev.h         | 3 ++-
 3 files changed, 11 insertions(+), 3 deletions(-)

diff --git a/block/blk-zoned.c b/block/blk-zoned.c
index c822cfa7a102..23831fa8701d 100644
--- a/block/blk-zoned.c
+++ b/block/blk-zoned.c
@@ -471,14 +471,19 @@ static int blk_revalidate_zone_cb(struct blk_zone *zone, unsigned int idx,
 /**
  * blk_revalidate_disk_zones - (re)allocate and initialize zone bitmaps
  * @disk:	Target disk
+ * @update_driver_data:	Callback to update driver data on the frozen disk
  *
  * Helper function for low-level device drivers to (re) allocate and initialize
  * a disk request queue zone bitmaps. This functions should normally be called
  * within the disk ->revalidate method for blk-mq based drivers.  For BIO based
  * drivers only q->nr_zones needs to be updated so that the sysfs exposed value
  * is correct.
+ * If the @update_driver_data callback function is not NULL, the callback is
+ * executed with the device request queue frozen after all zones have been
+ * checked.
  */
-int blk_revalidate_disk_zones(struct gendisk *disk)
+int blk_revalidate_disk_zones(struct gendisk *disk,
+			      void (*update_driver_data)(struct gendisk *disk))
 {
 	struct request_queue *q = disk->queue;
 	struct blk_revalidate_zone_args args = {
@@ -512,6 +517,8 @@ int blk_revalidate_disk_zones(struct gendisk *disk)
 		q->nr_zones = args.nr_zones;
 		swap(q->seq_zones_wlock, args.seq_zones_wlock);
 		swap(q->conv_zones_bitmap, args.conv_zones_bitmap);
+		if (update_driver_data)
+			update_driver_data(disk);
 		ret = 0;
 	} else {
 		pr_warn("%s: failed to revalidate zones\n", disk->disk_name);
diff --git a/drivers/block/null_blk_zoned.c b/drivers/block/null_blk_zoned.c
index 9e4bcdad1a80..46641df2e58e 100644
--- a/drivers/block/null_blk_zoned.c
+++ b/drivers/block/null_blk_zoned.c
@@ -73,7 +73,7 @@ int null_register_zoned_dev(struct nullb *nullb)
 	struct request_queue *q = nullb->q;
 
 	if (queue_is_mq(q))
-		return blk_revalidate_disk_zones(nullb->disk);
+		return blk_revalidate_disk_zones(nullb->disk, NULL);
 
 	blk_queue_chunk_sectors(q, nullb->dev->zone_size_sects);
 	q->nr_zones = blkdev_nr_zones(nullb->disk);
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index d6e6ce3dc656..fd405dac8eb0 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -357,7 +357,8 @@ unsigned int blkdev_nr_zones(struct gendisk *disk);
 extern int blkdev_zone_mgmt(struct block_device *bdev, enum req_opf op,
 			    sector_t sectors, sector_t nr_sectors,
 			    gfp_t gfp_mask);
-extern int blk_revalidate_disk_zones(struct gendisk *disk);
+int blk_revalidate_disk_zones(struct gendisk *disk,
+			      void (*update_driver_data)(struct gendisk *disk));
 
 extern int blkdev_report_zones_ioctl(struct block_device *bdev, fmode_t mode,
 				     unsigned int cmd, unsigned long arg);
-- 
2.24.1


^ permalink raw reply related	[flat|nested] 122+ messages in thread

* [PATCH v11 06/10] scsi: sd_zbc: factor out sanity checks for zoned commands
  2020-05-12  8:55 [PATCH v11 00/10] Introduce Zone Append for writing to zoned block devices Johannes Thumshirn
                   ` (4 preceding siblings ...)
  2020-05-12  8:55 ` [PATCH v11 05/10] block: Modify revalidate zones Johannes Thumshirn
@ 2020-05-12  8:55 ` Johannes Thumshirn
  2020-05-12  8:55 ` [PATCH v11 07/10] scsi: sd_zbc: emulate ZONE_APPEND commands Johannes Thumshirn
                   ` (5 subsequent siblings)
  11 siblings, 0 replies; 122+ messages in thread
From: Johannes Thumshirn @ 2020-05-12  8:55 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Christoph Hellwig, linux-block, Damien Le Moal, Keith Busch,
	linux-scsi @ vger . kernel . org, Martin K . Petersen,
	linux-fsdevel @ vger . kernel . org, Johannes Thumshirn,
	Christoph Hellwig, Bart Van Assche, Hannes Reinecke

Factor sanity checks for zoned commands from sd_zbc_setup_zone_mgmt_cmnd().

This will help with the introduction of an emulated ZONE_APPEND command.

Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
---
 drivers/scsi/sd_zbc.c | 36 +++++++++++++++++++++++++-----------
 1 file changed, 25 insertions(+), 11 deletions(-)

diff --git a/drivers/scsi/sd_zbc.c b/drivers/scsi/sd_zbc.c
index f45c22b09726..ee156fbf3780 100644
--- a/drivers/scsi/sd_zbc.c
+++ b/drivers/scsi/sd_zbc.c
@@ -209,6 +209,26 @@ int sd_zbc_report_zones(struct gendisk *disk, sector_t sector,
 	return ret;
 }
 
+static blk_status_t sd_zbc_cmnd_checks(struct scsi_cmnd *cmd)
+{
+	struct request *rq = cmd->request;
+	struct scsi_disk *sdkp = scsi_disk(rq->rq_disk);
+	sector_t sector = blk_rq_pos(rq);
+
+	if (!sd_is_zoned(sdkp))
+		/* Not a zoned device */
+		return BLK_STS_IOERR;
+
+	if (sdkp->device->changed)
+		return BLK_STS_IOERR;
+
+	if (sector & (sd_zbc_zone_sectors(sdkp) - 1))
+		/* Unaligned request */
+		return BLK_STS_IOERR;
+
+	return BLK_STS_OK;
+}
+
 /**
  * sd_zbc_setup_zone_mgmt_cmnd - Prepare a zone ZBC_OUT command. The operations
  *			can be RESET WRITE POINTER, OPEN, CLOSE or FINISH.
@@ -223,20 +243,14 @@ blk_status_t sd_zbc_setup_zone_mgmt_cmnd(struct scsi_cmnd *cmd,
 					 unsigned char op, bool all)
 {
 	struct request *rq = cmd->request;
-	struct scsi_disk *sdkp = scsi_disk(rq->rq_disk);
 	sector_t sector = blk_rq_pos(rq);
+	struct scsi_disk *sdkp = scsi_disk(rq->rq_disk);
 	sector_t block = sectors_to_logical(sdkp->device, sector);
+	blk_status_t ret;
 
-	if (!sd_is_zoned(sdkp))
-		/* Not a zoned device */
-		return BLK_STS_IOERR;
-
-	if (sdkp->device->changed)
-		return BLK_STS_IOERR;
-
-	if (sector & (sd_zbc_zone_sectors(sdkp) - 1))
-		/* Unaligned request */
-		return BLK_STS_IOERR;
+	ret = sd_zbc_cmnd_checks(cmd);
+	if (ret != BLK_STS_OK)
+		return ret;
 
 	cmd->cmd_len = 16;
 	memset(cmd->cmnd, 0, cmd->cmd_len);
-- 
2.24.1


^ permalink raw reply related	[flat|nested] 122+ messages in thread

* [PATCH v11 07/10] scsi: sd_zbc: emulate ZONE_APPEND commands
  2020-05-12  8:55 [PATCH v11 00/10] Introduce Zone Append for writing to zoned block devices Johannes Thumshirn
                   ` (5 preceding siblings ...)
  2020-05-12  8:55 ` [PATCH v11 06/10] scsi: sd_zbc: factor out sanity checks for zoned commands Johannes Thumshirn
@ 2020-05-12  8:55 ` Johannes Thumshirn
  2020-05-12  8:55 ` [PATCH v11 08/10] null_blk: Support REQ_OP_ZONE_APPEND Johannes Thumshirn
                   ` (4 subsequent siblings)
  11 siblings, 0 replies; 122+ messages in thread
From: Johannes Thumshirn @ 2020-05-12  8:55 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Christoph Hellwig, linux-block, Damien Le Moal, Keith Busch,
	linux-scsi @ vger . kernel . org, Martin K . Petersen,
	linux-fsdevel @ vger . kernel . org, Johannes Thumshirn,
	Christoph Hellwig, Hannes Reinecke

Emulate ZONE_APPEND for SCSI disks using a regular WRITE(16) command
with a start LBA set to the target zone write pointer position.

In order to always know the write pointer position of a sequential write
zone, the write pointer of all zones is tracked using an array of 32bits
zone write pointer offset attached to the scsi disk structure. Each
entry of the array indicate a zone write pointer position relative to
the zone start sector. The write pointer offsets are maintained in sync
with the device as follows:
1) the write pointer offset of a zone is reset to 0 when a
   REQ_OP_ZONE_RESET command completes.
2) the write pointer offset of a zone is set to the zone size when a
   REQ_OP_ZONE_FINISH command completes.
3) the write pointer offset of a zone is incremented by the number of
   512B sectors written when a write, write same or a zone append
   command completes.
4) the write pointer offset of all zones is reset to 0 when a
   REQ_OP_ZONE_RESET_ALL command completes.

Since the block layer does not write lock zones for zone append
commands, to ensure a sequential ordering of the regular write commands
used for the emulation, the target zone of a zone append command is
locked when the function sd_zbc_prepare_zone_append() is called from
sd_setup_read_write_cmnd(). If the zone write lock cannot be obtained
(e.g. a zone append is in-flight or a regular write has already locked
the zone), the zone append command dispatching is delayed by returning
BLK_STS_ZONE_RESOURCE.

To avoid the need for write locking all zones for REQ_OP_ZONE_RESET_ALL
requests, use a spinlock to protect accesses and modifications of the
zone write pointer offsets. This spinlock is initialized from sd_probe()
using the new function sd_zbc_init().

Co-developed-by: Damien Le Moal <Damien.LeMoal@wdc.com>
Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Reviewed-by: Hannes Reinecke <hare@suse.de>
---
 drivers/scsi/sd.c     |  16 +-
 drivers/scsi/sd.h     |  43 ++++-
 drivers/scsi/sd_zbc.c | 363 +++++++++++++++++++++++++++++++++++++++---
 3 files changed, 392 insertions(+), 30 deletions(-)

diff --git a/drivers/scsi/sd.c b/drivers/scsi/sd.c
index a793cb08d025..7b0383e42b4c 100644
--- a/drivers/scsi/sd.c
+++ b/drivers/scsi/sd.c
@@ -1206,6 +1206,12 @@ static blk_status_t sd_setup_read_write_cmnd(struct scsi_cmnd *cmd)
 		}
 	}
 
+	if (req_op(rq) == REQ_OP_ZONE_APPEND) {
+		ret = sd_zbc_prepare_zone_append(cmd, &lba, nr_blocks);
+		if (ret)
+			return ret;
+	}
+
 	fua = rq->cmd_flags & REQ_FUA ? 0x8 : 0;
 	dix = scsi_prot_sg_count(cmd);
 	dif = scsi_host_dif_capable(cmd->device->host, sdkp->protection_type);
@@ -1287,6 +1293,7 @@ static blk_status_t sd_init_command(struct scsi_cmnd *cmd)
 		return sd_setup_flush_cmnd(cmd);
 	case REQ_OP_READ:
 	case REQ_OP_WRITE:
+	case REQ_OP_ZONE_APPEND:
 		return sd_setup_read_write_cmnd(cmd);
 	case REQ_OP_ZONE_RESET:
 		return sd_zbc_setup_zone_mgmt_cmnd(cmd, ZO_RESET_WRITE_POINTER,
@@ -2055,7 +2062,7 @@ static int sd_done(struct scsi_cmnd *SCpnt)
 
  out:
 	if (sd_is_zoned(sdkp))
-		sd_zbc_complete(SCpnt, good_bytes, &sshdr);
+		good_bytes = sd_zbc_complete(SCpnt, good_bytes, &sshdr);
 
 	SCSI_LOG_HLCOMPLETE(1, scmd_printk(KERN_INFO, SCpnt,
 					   "sd_done: completed %d of %d bytes\n",
@@ -3372,6 +3379,10 @@ static int sd_probe(struct device *dev)
 	sdkp->first_scan = 1;
 	sdkp->max_medium_access_timeouts = SD_MAX_MEDIUM_TIMEOUTS;
 
+	error = sd_zbc_init_disk(sdkp);
+	if (error)
+		goto out_free_index;
+
 	sd_revalidate_disk(gd);
 
 	gd->flags = GENHD_FL_EXT_DEVT;
@@ -3409,6 +3420,7 @@ static int sd_probe(struct device *dev)
  out_put:
 	put_disk(gd);
  out_free:
+	sd_zbc_release_disk(sdkp);
 	kfree(sdkp);
  out:
 	scsi_autopm_put_device(sdp);
@@ -3485,6 +3497,8 @@ static void scsi_disk_release(struct device *dev)
 	put_disk(disk);
 	put_device(&sdkp->device->sdev_gendev);
 
+	sd_zbc_release_disk(sdkp);
+
 	kfree(sdkp);
 }
 
diff --git a/drivers/scsi/sd.h b/drivers/scsi/sd.h
index 50fff0bf8c8e..3a74f4b45134 100644
--- a/drivers/scsi/sd.h
+++ b/drivers/scsi/sd.h
@@ -79,6 +79,12 @@ struct scsi_disk {
 	u32		zones_optimal_open;
 	u32		zones_optimal_nonseq;
 	u32		zones_max_open;
+	u32		*zones_wp_offset;
+	spinlock_t	zones_wp_offset_lock;
+	u32		*rev_wp_offset;
+	struct mutex	rev_mutex;
+	struct work_struct zone_wp_offset_work;
+	char		*zone_wp_update_buf;
 #endif
 	atomic_t	openers;
 	sector_t	capacity;	/* size in logical blocks */
@@ -207,17 +213,35 @@ static inline int sd_is_zoned(struct scsi_disk *sdkp)
 
 #ifdef CONFIG_BLK_DEV_ZONED
 
+int sd_zbc_init_disk(struct scsi_disk *sdkp);
+void sd_zbc_release_disk(struct scsi_disk *sdkp);
 extern int sd_zbc_read_zones(struct scsi_disk *sdkp, unsigned char *buffer);
 extern void sd_zbc_print_zones(struct scsi_disk *sdkp);
 blk_status_t sd_zbc_setup_zone_mgmt_cmnd(struct scsi_cmnd *cmd,
 					 unsigned char op, bool all);
-extern void sd_zbc_complete(struct scsi_cmnd *cmd, unsigned int good_bytes,
-			    struct scsi_sense_hdr *sshdr);
+unsigned int sd_zbc_complete(struct scsi_cmnd *cmd, unsigned int good_bytes,
+			     struct scsi_sense_hdr *sshdr);
 int sd_zbc_report_zones(struct gendisk *disk, sector_t sector,
 		unsigned int nr_zones, report_zones_cb cb, void *data);
 
+blk_status_t sd_zbc_prepare_zone_append(struct scsi_cmnd *cmd, sector_t *lba,
+				        unsigned int nr_blocks);
+
 #else /* CONFIG_BLK_DEV_ZONED */
 
+static inline int sd_zbc_init(void)
+{
+	return 0;
+}
+
+static inline int sd_zbc_init_disk(struct scsi_disk *sdkp)
+{
+	return 0;
+}
+
+static inline void sd_zbc_exit(void) {}
+static inline void sd_zbc_release_disk(struct scsi_disk *sdkp) {}
+
 static inline int sd_zbc_read_zones(struct scsi_disk *sdkp,
 				    unsigned char *buf)
 {
@@ -233,9 +257,18 @@ static inline blk_status_t sd_zbc_setup_zone_mgmt_cmnd(struct scsi_cmnd *cmd,
 	return BLK_STS_TARGET;
 }
 
-static inline void sd_zbc_complete(struct scsi_cmnd *cmd,
-				   unsigned int good_bytes,
-				   struct scsi_sense_hdr *sshdr) {}
+static inline unsigned int sd_zbc_complete(struct scsi_cmnd *cmd,
+			unsigned int good_bytes, struct scsi_sense_hdr *sshdr)
+{
+	return 0;
+}
+
+static inline blk_status_t sd_zbc_prepare_zone_append(struct scsi_cmnd *cmd,
+						      sector_t *lba,
+						      unsigned int nr_blocks)
+{
+	return BLK_STS_TARGET;
+}
 
 #define sd_zbc_report_zones NULL
 
diff --git a/drivers/scsi/sd_zbc.c b/drivers/scsi/sd_zbc.c
index ee156fbf3780..bb87fbba2a09 100644
--- a/drivers/scsi/sd_zbc.c
+++ b/drivers/scsi/sd_zbc.c
@@ -11,6 +11,7 @@
 #include <linux/blkdev.h>
 #include <linux/vmalloc.h>
 #include <linux/sched/mm.h>
+#include <linux/mutex.h>
 
 #include <asm/unaligned.h>
 
@@ -19,11 +20,36 @@
 
 #include "sd.h"
 
+static unsigned int sd_zbc_get_zone_wp_offset(struct blk_zone *zone)
+{
+	if (zone->type == ZBC_ZONE_TYPE_CONV)
+		return 0;
+
+	switch (zone->cond) {
+	case BLK_ZONE_COND_IMP_OPEN:
+	case BLK_ZONE_COND_EXP_OPEN:
+	case BLK_ZONE_COND_CLOSED:
+		return zone->wp - zone->start;
+	case BLK_ZONE_COND_FULL:
+		return zone->len;
+	case BLK_ZONE_COND_EMPTY:
+	case BLK_ZONE_COND_OFFLINE:
+	case BLK_ZONE_COND_READONLY:
+	default:
+		/*
+		 * Offline and read-only zones do not have a valid
+		 * write pointer. Use 0 as for an empty zone.
+		 */
+		return 0;
+	}
+}
+
 static int sd_zbc_parse_report(struct scsi_disk *sdkp, u8 *buf,
 			       unsigned int idx, report_zones_cb cb, void *data)
 {
 	struct scsi_device *sdp = sdkp->device;
 	struct blk_zone zone = { 0 };
+	int ret;
 
 	zone.type = buf[0] & 0x0f;
 	zone.cond = (buf[1] >> 4) & 0xf;
@@ -39,7 +65,14 @@ static int sd_zbc_parse_report(struct scsi_disk *sdkp, u8 *buf,
 	    zone.cond == ZBC_ZONE_COND_FULL)
 		zone.wp = zone.start + zone.len;
 
-	return cb(&zone, idx, data);
+	ret = cb(&zone, idx, data);
+	if (ret)
+		return ret;
+
+	if (sdkp->rev_wp_offset)
+		sdkp->rev_wp_offset[idx] = sd_zbc_get_zone_wp_offset(&zone);
+
+	return 0;
 }
 
 /**
@@ -229,6 +262,116 @@ static blk_status_t sd_zbc_cmnd_checks(struct scsi_cmnd *cmd)
 	return BLK_STS_OK;
 }
 
+#define SD_ZBC_INVALID_WP_OFST	(~0u)
+#define SD_ZBC_UPDATING_WP_OFST	(SD_ZBC_INVALID_WP_OFST - 1)
+
+static int sd_zbc_update_wp_offset_cb(struct blk_zone *zone, unsigned int idx,
+				    void *data)
+{
+	struct scsi_disk *sdkp = data;
+
+	lockdep_assert_held(&sdkp->zones_wp_offset_lock);
+
+	sdkp->zones_wp_offset[idx] = sd_zbc_get_zone_wp_offset(zone);
+
+	return 0;
+}
+
+static void sd_zbc_update_wp_offset_workfn(struct work_struct *work)
+{
+	struct scsi_disk *sdkp;
+	unsigned int zno;
+	int ret;
+
+	sdkp = container_of(work, struct scsi_disk, zone_wp_offset_work);
+
+	spin_lock_bh(&sdkp->zones_wp_offset_lock);
+	for (zno = 0; zno < sdkp->nr_zones; zno++) {
+		if (sdkp->zones_wp_offset[zno] != SD_ZBC_UPDATING_WP_OFST)
+			continue;
+
+		spin_unlock_bh(&sdkp->zones_wp_offset_lock);
+		ret = sd_zbc_do_report_zones(sdkp, sdkp->zone_wp_update_buf,
+					     SD_BUF_SIZE,
+					     zno * sdkp->zone_blocks, true);
+		spin_lock_bh(&sdkp->zones_wp_offset_lock);
+		if (!ret)
+			sd_zbc_parse_report(sdkp, sdkp->zone_wp_update_buf + 64,
+					    zno, sd_zbc_update_wp_offset_cb,
+					    sdkp);
+	}
+	spin_unlock_bh(&sdkp->zones_wp_offset_lock);
+
+	scsi_device_put(sdkp->device);
+}
+
+/**
+ * sd_zbc_prepare_zone_append() - Prepare an emulated ZONE_APPEND command.
+ * @cmd: the command to setup
+ * @lba: the LBA to patch
+ * @nr_blocks: the number of LBAs to be written
+ *
+ * Called from sd_setup_read_write_cmnd() for REQ_OP_ZONE_APPEND.
+ * @sd_zbc_prepare_zone_append() handles the necessary zone wrote locking and
+ * patching of the lba for an emulated ZONE_APPEND command.
+ *
+ * In case the cached write pointer offset is %SD_ZBC_INVALID_WP_OFST it will
+ * schedule a REPORT ZONES command and return BLK_STS_IOERR.
+ */
+blk_status_t sd_zbc_prepare_zone_append(struct scsi_cmnd *cmd, sector_t *lba,
+					unsigned int nr_blocks)
+{
+	struct request *rq = cmd->request;
+	struct scsi_disk *sdkp = scsi_disk(rq->rq_disk);
+	unsigned int wp_offset, zno = blk_rq_zone_no(rq);
+	blk_status_t ret;
+
+	ret = sd_zbc_cmnd_checks(cmd);
+	if (ret != BLK_STS_OK)
+		return ret;
+
+	if (!blk_rq_zone_is_seq(rq))
+		return BLK_STS_IOERR;
+
+	/* Unlock of the write lock will happen in sd_zbc_complete() */
+	if (!blk_req_zone_write_trylock(rq))
+		return BLK_STS_ZONE_RESOURCE;
+
+	spin_lock_bh(&sdkp->zones_wp_offset_lock);
+	wp_offset = sdkp->zones_wp_offset[zno];
+	switch (wp_offset) {
+	case SD_ZBC_INVALID_WP_OFST:
+		/*
+		 * We are about to schedule work to update a zone write pointer
+		 * offset, which will cause the zone append command to be
+		 * requeued. So make sure that the scsi device does not go away
+		 * while the work is being processed.
+		 */
+		if (scsi_device_get(sdkp->device)) {
+			ret = BLK_STS_IOERR;
+			break;
+		}
+		sdkp->zones_wp_offset[zno] = SD_ZBC_UPDATING_WP_OFST;
+		schedule_work(&sdkp->zone_wp_offset_work);
+		fallthrough;
+	case SD_ZBC_UPDATING_WP_OFST:
+		ret = BLK_STS_DEV_RESOURCE;
+		break;
+	default:
+		wp_offset = sectors_to_logical(sdkp->device, wp_offset);
+		if (wp_offset + nr_blocks > sdkp->zone_blocks) {
+			ret = BLK_STS_IOERR;
+			break;
+		}
+
+		*lba += wp_offset;
+	}
+	spin_unlock_bh(&sdkp->zones_wp_offset_lock);
+	if (ret)
+		blk_req_zone_write_unlock(rq);
+	return ret;
+}
+
 /**
  * sd_zbc_setup_zone_mgmt_cmnd - Prepare a zone ZBC_OUT command. The operations
  *			can be RESET WRITE POINTER, OPEN, CLOSE or FINISH.
@@ -269,16 +412,105 @@ blk_status_t sd_zbc_setup_zone_mgmt_cmnd(struct scsi_cmnd *cmd,
 	return BLK_STS_OK;
 }
 
+static bool sd_zbc_need_zone_wp_update(struct request *rq)
+{
+	switch (req_op(rq)) {
+	case REQ_OP_ZONE_APPEND:
+	case REQ_OP_ZONE_FINISH:
+	case REQ_OP_ZONE_RESET:
+	case REQ_OP_ZONE_RESET_ALL:
+		return true;
+	case REQ_OP_WRITE:
+	case REQ_OP_WRITE_ZEROES:
+	case REQ_OP_WRITE_SAME:
+		return blk_rq_zone_is_seq(rq);
+	default:
+		return false;
+	}
+}
+
+/**
+ * sd_zbc_zone_wp_update - Update cached zone write pointer upon cmd completion
+ * @cmd: Completed command
+ * @good_bytes: Command reply bytes
+ *
+ * Called from sd_zbc_complete() to handle the update of the cached zone write
+ * pointer value in case an update is needed.
+ */
+static unsigned int sd_zbc_zone_wp_update(struct scsi_cmnd *cmd,
+					  unsigned int good_bytes)
+{
+	int result = cmd->result;
+	struct request *rq = cmd->request;
+	struct scsi_disk *sdkp = scsi_disk(rq->rq_disk);
+	unsigned int zno = blk_rq_zone_no(rq);
+	enum req_opf op = req_op(rq);
+
+	/*
+	 * If we got an error for a command that needs updating the write
+	 * pointer offset cache, we must mark the zone wp offset entry as
+	 * invalid to force an update from disk the next time a zone append
+	 * command is issued.
+	 */
+	spin_lock_bh(&sdkp->zones_wp_offset_lock);
+
+	if (result && op != REQ_OP_ZONE_RESET_ALL) {
+		if (op == REQ_OP_ZONE_APPEND) {
+			/* Force complete completion (no retry) */
+			good_bytes = 0;
+			scsi_set_resid(cmd, blk_rq_bytes(rq));
+		}
+
+		/*
+		 * Force an update of the zone write pointer offset on
+		 * the next zone append access.
+		 */
+		if (sdkp->zones_wp_offset[zno] != SD_ZBC_UPDATING_WP_OFST)
+			sdkp->zones_wp_offset[zno] = SD_ZBC_INVALID_WP_OFST;
+		goto unlock_wp_offset;
+	}
+
+	switch (op) {
+	case REQ_OP_ZONE_APPEND:
+		rq->__sector += sdkp->zones_wp_offset[zno];
+		fallthrough;
+	case REQ_OP_WRITE_ZEROES:
+	case REQ_OP_WRITE_SAME:
+	case REQ_OP_WRITE:
+		if (sdkp->zones_wp_offset[zno] < sd_zbc_zone_sectors(sdkp))
+			sdkp->zones_wp_offset[zno] +=
+						good_bytes >> SECTOR_SHIFT;
+		break;
+	case REQ_OP_ZONE_RESET:
+		sdkp->zones_wp_offset[zno] = 0;
+		break;
+	case REQ_OP_ZONE_FINISH:
+		sdkp->zones_wp_offset[zno] = sd_zbc_zone_sectors(sdkp);
+		break;
+	case REQ_OP_ZONE_RESET_ALL:
+		memset(sdkp->zones_wp_offset, 0,
+		       sdkp->nr_zones * sizeof(unsigned int));
+		break;
+	default:
+		break;
+	}
+
+unlock_wp_offset:
+	spin_unlock_bh(&sdkp->zones_wp_offset_lock);
+
+	return good_bytes;
+}
+
 /**
  * sd_zbc_complete - ZBC command post processing.
  * @cmd: Completed command
  * @good_bytes: Command reply bytes
  * @sshdr: command sense header
  *
- * Called from sd_done(). Process report zones reply and handle reset zone
- * and write commands errors.
+ * Called from sd_done() to handle zone commands errors and updates to the
+ * device queue zone write pointer offset cahce.
  */
-void sd_zbc_complete(struct scsi_cmnd *cmd, unsigned int good_bytes,
+unsigned int sd_zbc_complete(struct scsi_cmnd *cmd, unsigned int good_bytes,
 		     struct scsi_sense_hdr *sshdr)
 {
 	int result = cmd->result;
@@ -294,7 +526,13 @@ void sd_zbc_complete(struct scsi_cmnd *cmd, unsigned int good_bytes,
 		 * so be quiet about the error.
 		 */
 		rq->rq_flags |= RQF_QUIET;
-	}
+	} else if (sd_zbc_need_zone_wp_update(rq))
+		good_bytes = sd_zbc_zone_wp_update(cmd, good_bytes);
+
+	if (req_op(rq) == REQ_OP_ZONE_APPEND)
+		blk_req_zone_write_unlock(rq);
+
+	return good_bytes;
 }
 
 /**
@@ -396,11 +634,67 @@ static int sd_zbc_check_capacity(struct scsi_disk *sdkp, unsigned char *buf,
 	return 0;
 }
 
+static void sd_zbc_revalidate_zones_cb(struct gendisk *disk)
+{
+	struct scsi_disk *sdkp = scsi_disk(disk);
+
+	swap(sdkp->zones_wp_offset, sdkp->rev_wp_offset);
+}
+
+static int sd_zbc_revalidate_zones(struct scsi_disk *sdkp,
+				   u32 zone_blocks,
+				   unsigned int nr_zones)
+{
+	struct gendisk *disk = sdkp->disk;
+	int ret = 0;
+
+	/*
+	 * Make sure revalidate zones are serialized to ensure exclusive
+	 * updates of the scsi disk data.
+	 */
+	mutex_lock(&sdkp->rev_mutex);
+
+	/*
+	 * Revalidate the disk zones to update the device request queue zone
+	 * bitmaps and the zone write pointer offset array. Do this only once
+	 * the device capacity is set on the second revalidate execution for
+	 * disk scan or if something changed when executing a normal revalidate.
+	 */
+	if (sdkp->first_scan) {
+		sdkp->zone_blocks = zone_blocks;
+		sdkp->nr_zones = nr_zones;
+		goto unlock;
+	}
+
+	if (sdkp->zone_blocks == zone_blocks &&
+	    sdkp->nr_zones == nr_zones &&
+	    disk->queue->nr_zones == nr_zones)
+		goto unlock;
+
+	sdkp->rev_wp_offset = kvcalloc(nr_zones, sizeof(u32), GFP_NOIO);
+	if (!sdkp->rev_wp_offset) {
+		ret = -ENOMEM;
+		goto unlock;
+	}
+
+	ret = blk_revalidate_disk_zones(disk, sd_zbc_revalidate_zones_cb);
+
+	kvfree(sdkp->rev_wp_offset);
+	sdkp->rev_wp_offset = NULL;
+
+unlock:
+	mutex_unlock(&sdkp->rev_mutex);
+
+	return ret;
+}
+
 int sd_zbc_read_zones(struct scsi_disk *sdkp, unsigned char *buf)
 {
 	struct gendisk *disk = sdkp->disk;
+	struct request_queue *q = disk->queue;
 	unsigned int nr_zones;
 	u32 zone_blocks = 0;
+	u32 max_append;
 	int ret;
 
 	if (!sd_is_zoned(sdkp))
@@ -421,35 +715,31 @@ int sd_zbc_read_zones(struct scsi_disk *sdkp, unsigned char *buf)
 		goto err;
 
 	/* The drive satisfies the kernel restrictions: set it up */
-	blk_queue_flag_set(QUEUE_FLAG_ZONE_RESETALL, sdkp->disk->queue);
-	blk_queue_required_elevator_features(sdkp->disk->queue,
-					     ELEVATOR_F_ZBD_SEQ_WRITE);
+	blk_queue_flag_set(QUEUE_FLAG_ZONE_RESETALL, q);
+	blk_queue_required_elevator_features(q, ELEVATOR_F_ZBD_SEQ_WRITE);
 	nr_zones = round_up(sdkp->capacity, zone_blocks) >> ilog2(zone_blocks);
 
 	/* READ16/WRITE16 is mandatory for ZBC disks */
 	sdkp->device->use_16_for_rw = 1;
 	sdkp->device->use_10_for_rw = 0;
 
+	ret = sd_zbc_revalidate_zones(sdkp, zone_blocks, nr_zones);
+	if (ret)
+		goto err;
+
 	/*
-	 * Revalidate the disk zone bitmaps once the block device capacity is
-	 * set on the second revalidate execution during disk scan and if
-	 * something changed when executing a normal revalidate.
+	 * On the first scan 'chunk_sectors' isn't setup yet, so calling
+	 * blk_queue_max_zone_append_sectors() will result in a WARN(). Defer
+	 * this setting to the second scan.
 	 */
-	if (sdkp->first_scan) {
-		sdkp->zone_blocks = zone_blocks;
-		sdkp->nr_zones = nr_zones;
+	if (sdkp->first_scan)
 		return 0;
-	}
 
-	if (sdkp->zone_blocks != zone_blocks ||
-	    sdkp->nr_zones != nr_zones ||
-	    disk->queue->nr_zones != nr_zones) {
-		ret = blk_revalidate_disk_zones(disk);
-		if (ret != 0)
-			goto err;
-		sdkp->zone_blocks = zone_blocks;
-		sdkp->nr_zones = nr_zones;
-	}
+	max_append = min_t(u32, logical_to_sectors(sdkp->device, zone_blocks),
+			   q->limits.max_segments << (PAGE_SHIFT - 9));
+	max_append = min_t(u32, max_append, queue_max_hw_sectors(q));
+
+	blk_queue_max_zone_append_sectors(q, max_append);
 
 	return 0;
 
@@ -475,3 +765,28 @@ void sd_zbc_print_zones(struct scsi_disk *sdkp)
 			  sdkp->nr_zones,
 			  sdkp->zone_blocks);
 }
+
+int sd_zbc_init_disk(struct scsi_disk *sdkp)
+{
+	if (!sd_is_zoned(sdkp))
+		return 0;
+
+	sdkp->zones_wp_offset = NULL;
+	spin_lock_init(&sdkp->zones_wp_offset_lock);
+	sdkp->rev_wp_offset = NULL;
+	mutex_init(&sdkp->rev_mutex);
+	INIT_WORK(&sdkp->zone_wp_offset_work, sd_zbc_update_wp_offset_workfn);
+	sdkp->zone_wp_update_buf = kzalloc(SD_BUF_SIZE, GFP_KERNEL);
+	if (!sdkp->zone_wp_update_buf)
+		return -ENOMEM;
+
+	return 0;
+}
+
+void sd_zbc_release_disk(struct scsi_disk *sdkp)
+{
+	kvfree(sdkp->zones_wp_offset);
+	sdkp->zones_wp_offset = NULL;
+	kfree(sdkp->zone_wp_update_buf);
+	sdkp->zone_wp_update_buf = NULL;
+}
-- 
2.24.1


^ permalink raw reply related	[flat|nested] 122+ messages in thread

* [PATCH v11 08/10] null_blk: Support REQ_OP_ZONE_APPEND
  2020-05-12  8:55 [PATCH v11 00/10] Introduce Zone Append for writing to zoned block devices Johannes Thumshirn
                   ` (6 preceding siblings ...)
  2020-05-12  8:55 ` [PATCH v11 07/10] scsi: sd_zbc: emulate ZONE_APPEND commands Johannes Thumshirn
@ 2020-05-12  8:55 ` Johannes Thumshirn
  2020-05-12  8:55 ` [PATCH v11 09/10] block: export bio_release_pages and bio_iov_iter_get_pages Johannes Thumshirn
                   ` (3 subsequent siblings)
  11 siblings, 0 replies; 122+ messages in thread
From: Johannes Thumshirn @ 2020-05-12  8:55 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Christoph Hellwig, linux-block, Damien Le Moal, Keith Busch,
	linux-scsi @ vger . kernel . org, Martin K . Petersen,
	linux-fsdevel @ vger . kernel . org, Damien Le Moal,
	Hannes Reinecke, Christoph Hellwig, Johannes Thumshirn

From: Damien Le Moal <damien.lemoal@wdc.com>

Support REQ_OP_ZONE_APPEND requests for null_blk devices with zoned
mode enabled. Use the internally tracked zone write pointer position
as the actual write position and return it using the command request
__sector field in the case of an mq device and using the command BIO
sector in the case of a BIO device.

Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com>
Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
---
 drivers/block/null_blk_zoned.c | 37 ++++++++++++++++++++++++++--------
 1 file changed, 29 insertions(+), 8 deletions(-)

diff --git a/drivers/block/null_blk_zoned.c b/drivers/block/null_blk_zoned.c
index 46641df2e58e..9c19f747f394 100644
--- a/drivers/block/null_blk_zoned.c
+++ b/drivers/block/null_blk_zoned.c
@@ -70,13 +70,20 @@ int null_init_zoned_dev(struct nullb_device *dev, struct request_queue *q)
 
 int null_register_zoned_dev(struct nullb *nullb)
 {
+	struct nullb_device *dev = nullb->dev;
 	struct request_queue *q = nullb->q;
 
-	if (queue_is_mq(q))
-		return blk_revalidate_disk_zones(nullb->disk, NULL);
+	if (queue_is_mq(q)) {
+		int ret = blk_revalidate_disk_zones(nullb->disk, NULL);
+
+		if (ret)
+			return ret;
+	} else {
+		blk_queue_chunk_sectors(q, dev->zone_size_sects);
+		q->nr_zones = blkdev_nr_zones(nullb->disk);
+	}
 
-	blk_queue_chunk_sectors(q, nullb->dev->zone_size_sects);
-	q->nr_zones = blkdev_nr_zones(nullb->disk);
+	blk_queue_max_zone_append_sectors(q, dev->zone_size_sects);
 
 	return 0;
 }
@@ -138,7 +145,7 @@ size_t null_zone_valid_read_len(struct nullb *nullb,
 }
 
 static blk_status_t null_zone_write(struct nullb_cmd *cmd, sector_t sector,
-		     unsigned int nr_sectors)
+				    unsigned int nr_sectors, bool append)
 {
 	struct nullb_device *dev = cmd->nq->dev;
 	unsigned int zno = null_zone_no(dev, sector);
@@ -158,9 +165,21 @@ static blk_status_t null_zone_write(struct nullb_cmd *cmd, sector_t sector,
 	case BLK_ZONE_COND_IMP_OPEN:
 	case BLK_ZONE_COND_EXP_OPEN:
 	case BLK_ZONE_COND_CLOSED:
-		/* Writes must be at the write pointer position */
-		if (sector != zone->wp)
+		/*
+		 * Regular writes must be at the write pointer position.
+		 * Zone append writes are automatically issued at the write
+		 * pointer and the position returned using the request or BIO
+		 * sector.
+		 */
+		if (append) {
+			sector = zone->wp;
+			if (cmd->bio)
+				cmd->bio->bi_iter.bi_sector = sector;
+			else
+				cmd->rq->__sector = sector;
+		} else if (sector != zone->wp) {
 			return BLK_STS_IOERR;
+		}
 
 		if (zone->cond != BLK_ZONE_COND_EXP_OPEN)
 			zone->cond = BLK_ZONE_COND_IMP_OPEN;
@@ -242,7 +261,9 @@ blk_status_t null_process_zoned_cmd(struct nullb_cmd *cmd, enum req_opf op,
 {
 	switch (op) {
 	case REQ_OP_WRITE:
-		return null_zone_write(cmd, sector, nr_sectors);
+		return null_zone_write(cmd, sector, nr_sectors, false);
+	case REQ_OP_ZONE_APPEND:
+		return null_zone_write(cmd, sector, nr_sectors, true);
 	case REQ_OP_ZONE_RESET:
 	case REQ_OP_ZONE_RESET_ALL:
 	case REQ_OP_ZONE_OPEN:
-- 
2.24.1


^ permalink raw reply related	[flat|nested] 122+ messages in thread

* [PATCH v11 09/10] block: export bio_release_pages and bio_iov_iter_get_pages
  2020-05-12  8:55 [PATCH v11 00/10] Introduce Zone Append for writing to zoned block devices Johannes Thumshirn
                   ` (7 preceding siblings ...)
  2020-05-12  8:55 ` [PATCH v11 08/10] null_blk: Support REQ_OP_ZONE_APPEND Johannes Thumshirn
@ 2020-05-12  8:55 ` Johannes Thumshirn
  2020-05-12  8:55 ` [PATCH v11 10/10] zonefs: use REQ_OP_ZONE_APPEND for sync DIO Johannes Thumshirn
                   ` (2 subsequent siblings)
  11 siblings, 0 replies; 122+ messages in thread
From: Johannes Thumshirn @ 2020-05-12  8:55 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Christoph Hellwig, linux-block, Damien Le Moal, Keith Busch,
	linux-scsi @ vger . kernel . org, Martin K . Petersen,
	linux-fsdevel @ vger . kernel . org, Johannes Thumshirn,
	Hannes Reinecke

Export bio_release_pages and bio_iov_iter_get_pages, so they can be used
from modular code.

Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Reviewed-by: Hannes Reinecke <hare@suse.de>
---
 block/bio.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/block/bio.c b/block/bio.c
index 3aa3c4ce2e5e..e4c46e2bd5ba 100644
--- a/block/bio.c
+++ b/block/bio.c
@@ -951,6 +951,7 @@ void bio_release_pages(struct bio *bio, bool mark_dirty)
 		put_page(bvec->bv_page);
 	}
 }
+EXPORT_SYMBOL_GPL(bio_release_pages);
 
 static int __bio_iov_bvec_add_pages(struct bio *bio, struct iov_iter *iter)
 {
@@ -1114,6 +1115,7 @@ int bio_iov_iter_get_pages(struct bio *bio, struct iov_iter *iter)
 		bio_set_flag(bio, BIO_NO_PAGE_REF);
 	return bio->bi_vcnt ? 0 : ret;
 }
+EXPORT_SYMBOL_GPL(bio_iov_iter_get_pages);
 
 static void submit_bio_wait_endio(struct bio *bio)
 {
-- 
2.24.1


^ permalink raw reply related	[flat|nested] 122+ messages in thread

* [PATCH v11 10/10] zonefs: use REQ_OP_ZONE_APPEND for sync DIO
  2020-05-12  8:55 [PATCH v11 00/10] Introduce Zone Append for writing to zoned block devices Johannes Thumshirn
                   ` (8 preceding siblings ...)
  2020-05-12  8:55 ` [PATCH v11 09/10] block: export bio_release_pages and bio_iov_iter_get_pages Johannes Thumshirn
@ 2020-05-12  8:55 ` Johannes Thumshirn
  2020-05-12 13:17 ` [PATCH v11 00/10] Introduce Zone Append for writing to zoned block devices Christoph Hellwig
  2020-05-13  2:37 ` Jens Axboe
  11 siblings, 0 replies; 122+ messages in thread
From: Johannes Thumshirn @ 2020-05-12  8:55 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Christoph Hellwig, linux-block, Damien Le Moal, Keith Busch,
	linux-scsi @ vger . kernel . org, Martin K . Petersen,
	linux-fsdevel @ vger . kernel . org, Johannes Thumshirn,
	Damien Le Moal

Synchronous direct I/O to a sequential write only zone can be issued using
the new REQ_OP_ZONE_APPEND request operation. As dispatching multiple
BIOs can potentially result in reordering, we cannot support asynchronous
IO via this interface.

We also can only dispatch up to queue_max_zone_append_sectors() via the
new zone-append method and have to return a short write back to user-space
in case an IO larger than queue_max_zone_append_sectors() has been issued.

Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Acked-by: Damien Le Moal <damien.lemoal@wdc.com>
---
 fs/zonefs/super.c | 80 ++++++++++++++++++++++++++++++++++++++++++-----
 1 file changed, 72 insertions(+), 8 deletions(-)

diff --git a/fs/zonefs/super.c b/fs/zonefs/super.c
index 3ce9829a6936..0bf7009f50a2 100644
--- a/fs/zonefs/super.c
+++ b/fs/zonefs/super.c
@@ -20,6 +20,7 @@
 #include <linux/mman.h>
 #include <linux/sched/mm.h>
 #include <linux/crc32.h>
+#include <linux/task_io_accounting_ops.h>
 
 #include "zonefs.h"
 
@@ -596,6 +597,61 @@ static const struct iomap_dio_ops zonefs_write_dio_ops = {
 	.end_io			= zonefs_file_write_dio_end_io,
 };
 
+static ssize_t zonefs_file_dio_append(struct kiocb *iocb, struct iov_iter *from)
+{
+	struct inode *inode = file_inode(iocb->ki_filp);
+	struct zonefs_inode_info *zi = ZONEFS_I(inode);
+	struct block_device *bdev = inode->i_sb->s_bdev;
+	unsigned int max;
+	struct bio *bio;
+	ssize_t size;
+	int nr_pages;
+	ssize_t ret;
+
+	nr_pages = iov_iter_npages(from, BIO_MAX_PAGES);
+	if (!nr_pages)
+		return 0;
+
+	max = queue_max_zone_append_sectors(bdev_get_queue(bdev));
+	max = ALIGN_DOWN(max << SECTOR_SHIFT, inode->i_sb->s_blocksize);
+	iov_iter_truncate(from, max);
+
+	bio = bio_alloc_bioset(GFP_NOFS, nr_pages, &fs_bio_set);
+	if (!bio)
+		return -ENOMEM;
+
+	bio_set_dev(bio, bdev);
+	bio->bi_iter.bi_sector = zi->i_zsector;
+	bio->bi_write_hint = iocb->ki_hint;
+	bio->bi_ioprio = iocb->ki_ioprio;
+	bio->bi_opf = REQ_OP_ZONE_APPEND | REQ_SYNC | REQ_IDLE;
+	if (iocb->ki_flags & IOCB_DSYNC)
+		bio->bi_opf |= REQ_FUA;
+
+	ret = bio_iov_iter_get_pages(bio, from);
+	if (unlikely(ret)) {
+		bio_io_error(bio);
+		return ret;
+	}
+	size = bio->bi_iter.bi_size;
+	task_io_account_write(ret);
+
+	if (iocb->ki_flags & IOCB_HIPRI)
+		bio_set_polled(bio, iocb);
+
+	ret = submit_bio_wait(bio);
+
+	bio_put(bio);
+
+	zonefs_file_write_dio_end_io(iocb, size, ret, 0);
+	if (ret >= 0) {
+		iocb->ki_pos += size;
+		return size;
+	}
+
+	return ret;
+}
+
 /*
  * Handle direct writes. For sequential zone files, this is the only possible
  * write path. For these files, check that the user is issuing writes
@@ -611,6 +667,8 @@ static ssize_t zonefs_file_dio_write(struct kiocb *iocb, struct iov_iter *from)
 	struct inode *inode = file_inode(iocb->ki_filp);
 	struct zonefs_inode_info *zi = ZONEFS_I(inode);
 	struct super_block *sb = inode->i_sb;
+	bool sync = is_sync_kiocb(iocb);
+	bool append = false;
 	size_t count;
 	ssize_t ret;
 
@@ -619,7 +677,7 @@ static ssize_t zonefs_file_dio_write(struct kiocb *iocb, struct iov_iter *from)
 	 * as this can cause write reordering (e.g. the first aio gets EAGAIN
 	 * on the inode lock but the second goes through but is now unaligned).
 	 */
-	if (zi->i_ztype == ZONEFS_ZTYPE_SEQ && !is_sync_kiocb(iocb) &&
+	if (zi->i_ztype == ZONEFS_ZTYPE_SEQ && !sync &&
 	    (iocb->ki_flags & IOCB_NOWAIT))
 		return -EOPNOTSUPP;
 
@@ -643,16 +701,22 @@ static ssize_t zonefs_file_dio_write(struct kiocb *iocb, struct iov_iter *from)
 	}
 
 	/* Enforce sequential writes (append only) in sequential zones */
-	mutex_lock(&zi->i_truncate_mutex);
-	if (zi->i_ztype == ZONEFS_ZTYPE_SEQ && iocb->ki_pos != zi->i_wpoffset) {
+	if (zi->i_ztype == ZONEFS_ZTYPE_SEQ) {
+		mutex_lock(&zi->i_truncate_mutex);
+		if (iocb->ki_pos != zi->i_wpoffset) {
+			mutex_unlock(&zi->i_truncate_mutex);
+			ret = -EINVAL;
+			goto inode_unlock;
+		}
 		mutex_unlock(&zi->i_truncate_mutex);
-		ret = -EINVAL;
-		goto inode_unlock;
+		append = sync;
 	}
-	mutex_unlock(&zi->i_truncate_mutex);
 
-	ret = iomap_dio_rw(iocb, from, &zonefs_iomap_ops,
-			   &zonefs_write_dio_ops, is_sync_kiocb(iocb));
+	if (append)
+		ret = zonefs_file_dio_append(iocb, from);
+	else
+		ret = iomap_dio_rw(iocb, from, &zonefs_iomap_ops,
+				   &zonefs_write_dio_ops, sync);
 	if (zi->i_ztype == ZONEFS_ZTYPE_SEQ &&
 	    (ret > 0 || ret == -EIOCBQUEUED)) {
 		if (ret > 0)
-- 
2.24.1


^ permalink raw reply related	[flat|nested] 122+ messages in thread

* Re: [PATCH v11 00/10] Introduce Zone Append for writing to zoned block devices
  2020-05-12  8:55 [PATCH v11 00/10] Introduce Zone Append for writing to zoned block devices Johannes Thumshirn
                   ` (9 preceding siblings ...)
  2020-05-12  8:55 ` [PATCH v11 10/10] zonefs: use REQ_OP_ZONE_APPEND for sync DIO Johannes Thumshirn
@ 2020-05-12 13:17 ` Christoph Hellwig
       [not found]   ` <(Christoph>
  2020-05-13  2:37 ` Jens Axboe
  11 siblings, 1 reply; 122+ messages in thread
From: Christoph Hellwig @ 2020-05-12 13:17 UTC (permalink / raw)
  To: Johannes Thumshirn
  Cc: Jens Axboe, Christoph Hellwig, linux-block, Damien Le Moal,
	Keith Busch, linux-scsi @ vger . kernel . org,
	Martin K . Petersen, linux-fsdevel @ vger . kernel . org

The whole series looks good to me:

Reviewed-by: Christoph Hellwig <hch@lst.de>

I hope we can get this in 5.8 to help with the btrfs in 5.9.

^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: [PATCH v11 00/10] Introduce Zone Append for writing to zoned block devices
       [not found]                                                   ` <-0700")>
                                                                       ` (2 preceding siblings ...)
  2020-04-01  2:29                                                     ` [PATCH 0/4] block: Add support for REQ_OP_ASSIGN_RANGE Martin K. Petersen
@ 2020-05-12 16:01                                                     ` Martin K. Petersen
  2020-05-12 16:04                                                       ` Christoph Hellwig
  3 siblings, 1 reply; 122+ messages in thread
From: Martin K. Petersen @ 2020-05-12 16:01 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Johannes Thumshirn, Jens Axboe, linux-block, Damien Le Moal,
	Keith Busch, linux-scsi @ vger . kernel . org,
	Martin K . Petersen, linux-fsdevel @ vger . kernel . org


Christoph,

> The whole series looks good to me:
>
> Reviewed-by: Christoph Hellwig <hch@lst.de>
>
> I hope we can get this in 5.8 to help with the btrfs in 5.9.

Yep, I think this looks good.

I suspect this series going to clash with my sd revalidate surgery. I
may have to stick that in a postmerge branch based on Jens' tree.

-- 
Martin K. Petersen	Oracle Linux Engineering

^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: [PATCH v11 00/10] Introduce Zone Append for writing to zoned block devices
  2020-05-12 16:01                                                     ` [PATCH v11 00/10] Introduce Zone Append for writing to zoned block devices Martin K. Petersen
@ 2020-05-12 16:04                                                       ` Christoph Hellwig
  2020-05-12 16:12                                                         ` Martin K. Petersen
  0 siblings, 1 reply; 122+ messages in thread
From: Christoph Hellwig @ 2020-05-12 16:04 UTC (permalink / raw)
  To: Martin K. Petersen
  Cc: Christoph Hellwig, Johannes Thumshirn, Jens Axboe, linux-block,
	Damien Le Moal, Keith Busch, linux-scsi @ vger . kernel . org,
	linux-fsdevel @ vger . kernel . org

On Tue, May 12, 2020 at 09:01:18AM -0700, Martin K. Petersen wrote:
> I suspect this series going to clash with my sd revalidate surgery. I
> may have to stick that in a postmerge branch based on Jens' tree.

Where is that series?  I don't remember any changes in that area.

> 
> -- 
> Martin K. Petersen	Oracle Linux Engineering
---end quoted text---

^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: [PATCH v11 00/10] Introduce Zone Append for writing to zoned block devices
  2020-05-12 16:04                                                       ` Christoph Hellwig
@ 2020-05-12 16:12                                                         ` Martin K. Petersen
  2020-05-12 16:18                                                           ` Johannes Thumshirn
  0 siblings, 1 reply; 122+ messages in thread
From: Martin K. Petersen @ 2020-05-12 16:12 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Martin K. Petersen, Johannes Thumshirn, Jens Axboe, linux-block,
	Damien Le Moal, Keith Busch, linux-scsi @ vger . kernel . org,
	linux-fsdevel @ vger . kernel . org


Christoph,

> Where is that series?  I don't remember any changes in that area.

Haven't posted it yet. Still working on a few patches that address
validating reported values for devices that change parameters in flight.

The first part of the series is here:

https://git.kernel.org/pub/scm/linux/kernel/git/mkp/linux.git/log/?h=5.8/sd-revalidate

-- 
Martin K. Petersen	Oracle Linux Engineering

^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: [PATCH v11 00/10] Introduce Zone Append for writing to zoned block devices
  2020-05-12 16:12                                                         ` Martin K. Petersen
@ 2020-05-12 16:18                                                           ` Johannes Thumshirn
  2020-05-12 16:24                                                             ` Martin K. Petersen
  0 siblings, 1 reply; 122+ messages in thread
From: Johannes Thumshirn @ 2020-05-12 16:18 UTC (permalink / raw)
  To: Martin K. Petersen, hch
  Cc: Jens Axboe, linux-block, Damien Le Moal, Keith Busch,
	linux-scsi @ vger . kernel . org,
	linux-fsdevel @ vger . kernel . org

On 12/05/2020 18:15, Martin K. Petersen wrote:
> 
> Christoph,
> 
>> Where is that series?  I don't remember any changes in that area.
> 
> Haven't posted it yet. Still working on a few patches that address
> validating reported values for devices that change parameters in flight.
> 
> The first part of the series is here:
> 
> https://git.kernel.org/pub/scm/linux/kernel/git/mkp/linux.git/log/?h=5.8/sd-revalidate
> 

Just did a quick skim in the gitweb and I can't see anything that will clash. So I think we're good.

^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: [PATCH v11 00/10] Introduce Zone Append for writing to zoned block devices
  2020-05-12 16:18                                                           ` Johannes Thumshirn
@ 2020-05-12 16:24                                                             ` Martin K. Petersen
  0 siblings, 0 replies; 122+ messages in thread
From: Martin K. Petersen @ 2020-05-12 16:24 UTC (permalink / raw)
  To: Johannes Thumshirn
  Cc: Martin K. Petersen, hch, Jens Axboe, linux-block, Damien Le Moal,
	Keith Busch, linux-scsi @ vger . kernel . org,
	linux-fsdevel @ vger . kernel . org


Johannes,

> Just did a quick skim in the gitweb and I can't see anything that will
> clash. So I think we're good.

The most intrusive pieces are impending. But no worries! Staging the
merge is my problem. Your series is good to go as far as I'm concerned.

-- 
Martin K. Petersen	Oracle Linux Engineering

^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: [PATCH v11 00/10] Introduce Zone Append for writing to zoned block devices
  2020-05-12  8:55 [PATCH v11 00/10] Introduce Zone Append for writing to zoned block devices Johannes Thumshirn
                   ` (10 preceding siblings ...)
  2020-05-12 13:17 ` [PATCH v11 00/10] Introduce Zone Append for writing to zoned block devices Christoph Hellwig
@ 2020-05-13  2:37 ` Jens Axboe
  11 siblings, 0 replies; 122+ messages in thread
From: Jens Axboe @ 2020-05-13  2:37 UTC (permalink / raw)
  To: Johannes Thumshirn
  Cc: Christoph Hellwig, linux-block, Damien Le Moal, Keith Busch,
	linux-scsi @ vger . kernel . org, Martin K . Petersen,
	linux-fsdevel @ vger . kernel . org

On 5/12/20 2:55 AM, Johannes Thumshirn wrote:
> The upcoming NVMe ZNS Specification will define a new type of write
> command for zoned block devices, zone append.
> 
> When when writing to a zoned block device using zone append, the start
> sector of the write is pointing at the start LBA of the zone to write to.
> Upon completion the block device will respond with the position the data
> has been placed in the zone. This from a high level perspective can be
> seen like a file system's block allocator, where the user writes to a
> file and the file-system takes care of the data placement on the device.
> 
> In order to fully exploit the new zone append command in file-systems and
> other interfaces above the block layer, we choose to emulate zone append
> in SCSI and null_blk. This way we can have a single write path for both
> file-systems and other interfaces above the block-layer, like io_uring on
> zoned block devices, without having to care too much about the underlying
> characteristics of the device itself.
> 
> The emulation works by providing a cache of each zone's write pointer, so
> zone append issued to the disk can be translated to a write with a
> starting LBA of the write pointer. This LBA is used as input zone number
> for the write pointer lookup in the zone write pointer offset cache and
> the cached offset is then added to the LBA to get the actual position to
> write the data. In SCSI we then turn the REQ_OP_ZONE_APPEND request into a
> WRITE(16) command. Upon successful completion of the WRITE(16), the cache
> will be updated to the new write pointer location and the written sector
> will be noted in the request. On error the cache entry will be marked as
> invalid and on the next write an update of the write pointer will be
> scheduled, before issuing the actual write.
> 
> In order to reduce memory consumption, the only cached item is the offset
> of the write pointer from the start of the zone, everything else can be
> calculated. On an example drive with 52156 zones, the additional memory
> consumption of the cache is thus 52156 * 4 = 208624 Bytes or 51 4k Byte
> pages. The performance impact is neglectable for a spinning drive.
> 
> For null_blk the emulation is way simpler, as null_blk's zoned block
> device emulation support already caches the write pointer position, so we
> only need to report the position back to the upper layers. Additional
> caching is not needed here.
> 
> Furthermore we have converted zonefs to run use ZONE_APPEND for synchronous
> direct I/Os. Asynchronous I/O still uses the normal path via iomap.
> 
> Performance testing with zonefs sync writes on a 14 TB SMR drive and nullblk
> shows good results. On the SMR drive we're not regressing (the performance
> improvement is within noise), on nullblk we could drastically improve specific
> workloads:
> 
> * nullblk:
> 
> Single Thread Multiple Zones
> 				kIOPS	MiB/s	MB/s	% delta
> mq-deadline REQ_OP_WRITE	10.1	631	662
> mq-deadline REQ_OP_ZONE_APPEND	13.2	828	868	+31.12
> none REQ_OP_ZONE_APPEND		15.6	978	1026	+54.98
> 
> 
> Multiple Threads Multiple Zones
> 				kIOPS	MiB/s	MB/s	% delta
> mq-deadline REQ_OP_WRITE	10.2	640	671
> mq-deadline REQ_OP_ZONE_APPEND	10.4	650	681	+1.49
> none REQ_OP_ZONE_APPEND		16.9	1058	1109	+65.28
> 
> * 14 TB SMR drive
> 
> Single Thread Multiple Zones
> 				IOPS	MiB/s	MB/s	% delta
> mq-deadline REQ_OP_WRITE	797	49.9	52.3
> mq-deadline REQ_OP_ZONE_APPEND	806	50.4	52.9	+1.15
> 
> Multiple Threads Multiple Zones
> 				kIOPS	MiB/s	MB/s	% delta
> mq-deadline REQ_OP_WRITE	745	46.6	48.9
> mq-deadline REQ_OP_ZONE_APPEND	768	48	50.3	+2.86
> 
> The %-delta is against the baseline of REQ_OP_WRITE using mq-deadline as I/O
> scheduler.
> 
> The series is based on Jens' for-5.8/block branch with HEAD:
> ae979182ebb3 ("bdi: fix up for "remove the name field in struct backing_dev_info"")

Applied for 5.8, thanks.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 122+ messages in thread

* [PATCH 0/2] two generic block layer fixes for 5.9
@ 2020-07-13 12:35 Coly Li
  2020-07-13 12:35 ` [PATCH 1/2] block: change REQ_OP_ZONE_RESET and REQ_OP_ZONE_RESET_ALL to be odd numbers Coly Li
  2020-07-13 12:35 ` [PATCH 2/2] block: improve discard bio alignment in __blkdev_issue_discard() Coly Li
  0 siblings, 2 replies; 122+ messages in thread
From: Coly Li @ 2020-07-13 12:35 UTC (permalink / raw)
  To: axboe, linux-block; +Cc: Coly Li

Hi Jens,

These two patches are posted for a while, and have reviewed by several
other developers. Could you please to take them for Linux v5.9 ?

Thanks in advance.

Coly Li
---

Coly Li (2):
  block: change REQ_OP_ZONE_RESET and REQ_OP_ZONE_RESET_ALL to be odd
    numbers
  block: improve discard bio alignment in __blkdev_issue_discard()

 block/blk-lib.c           | 25 +++++++++++++++++++++++--
 block/blk.h               | 14 ++++++++++++++
 include/linux/blk_types.h |  8 ++++----
 3 files changed, 41 insertions(+), 6 deletions(-)

-- 
2.26.2


^ permalink raw reply	[flat|nested] 122+ messages in thread

* [PATCH 1/2] block: change REQ_OP_ZONE_RESET and REQ_OP_ZONE_RESET_ALL to be odd numbers
  2020-07-13 12:35 [PATCH 0/2] two generic block layer fixes for 5.9 Coly Li
@ 2020-07-13 12:35 ` Coly Li
  2020-07-13 23:12   ` Damien Le Moal
  2020-07-13 12:35 ` [PATCH 2/2] block: improve discard bio alignment in __blkdev_issue_discard() Coly Li
  1 sibling, 1 reply; 122+ messages in thread
From: Coly Li @ 2020-07-13 12:35 UTC (permalink / raw)
  To: axboe, linux-block
  Cc: Coly Li, Damien Le Moal, Chaitanya Kulkarni, Christoph Hellwig,
	Hannes Reinecke, Jens Axboe, Johannes Thumshirn, Keith Busch,
	Shaun Tancheff

Currently REQ_OP_ZONE_RESET and REQ_OP_ZONE_RESET_ALL are defined as
even numbers 6 and 8, such zone reset bios are treated as READ bios by
bio_data_dir(), which is obviously misleading.

The macro bio_data_dir() is defined in include/linux/bio.h as,
 55 #define bio_data_dir(bio) \
 56         (op_is_write(bio_op(bio)) ? WRITE : READ)

And op_is_write() is defined in include/linux/blk_types.h as,
397 static inline bool op_is_write(unsigned int op)
398 {
399         return (op & 1);
400 }

The convention of op_is_write() is when there is data transfer then the
op code should be odd number, and treat as a write op. bio_data_dir()
treats all bio direction as READ if op_is_write() reports false, and
WRITE if op_is_write() reports true.

Because REQ_OP_ZONE_RESET and REQ_OP_ZONE_RESET_ALL are even numbers,
although they don't transfer data but reporting them as READ bio by
bio_data_dir() is misleading and might be wrong. Because these two
commands will reset the writer pointers of the resetting zones, and all
content after the reset write pointer will be invalid and unaccessible,
obviously they are not READ bios in any means.

This patch changes REQ_OP_ZONE_RESET from 6 to 15, and changes
REQ_OP_ZONE_RESET_ALL from 8 to 17. Now bios with these two op code
can be treated as WRITE by bio_data_dir(). Although they don't transfer
data, now we keep them consistent with REQ_OP_DISCARD and
REQ_OP_WRITE_ZEROES with the ituition that they change on-media content
and should be WRITE request.

Signed-off-by: Coly Li <colyli@suse.de>
Reviewed-by: Damien Le Moal <damien.lemoal@wdc.com>
Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Hannes Reinecke <hare@suse.de>
Cc: Jens Axboe <axboe@fb.com>
Cc: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Cc: Keith Busch <kbusch@kernel.org>
Cc: Shaun Tancheff <shaun.tancheff@seagate.com>
---
 include/linux/blk_types.h | 8 ++++----
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/include/linux/blk_types.h b/include/linux/blk_types.h
index ccb895f911b1..447b46a0accf 100644
--- a/include/linux/blk_types.h
+++ b/include/linux/blk_types.h
@@ -300,12 +300,8 @@ enum req_opf {
 	REQ_OP_DISCARD		= 3,
 	/* securely erase sectors */
 	REQ_OP_SECURE_ERASE	= 5,
-	/* reset a zone write pointer */
-	REQ_OP_ZONE_RESET	= 6,
 	/* write the same sector many times */
 	REQ_OP_WRITE_SAME	= 7,
-	/* reset all the zone present on the device */
-	REQ_OP_ZONE_RESET_ALL	= 8,
 	/* write the zero filled sector many times */
 	REQ_OP_WRITE_ZEROES	= 9,
 	/* Open a zone */
@@ -316,6 +312,10 @@ enum req_opf {
 	REQ_OP_ZONE_FINISH	= 12,
 	/* write data at the current zone write pointer */
 	REQ_OP_ZONE_APPEND	= 13,
+	/* reset a zone write pointer */
+	REQ_OP_ZONE_RESET	= 15,
+	/* reset all the zone present on the device */
+	REQ_OP_ZONE_RESET_ALL	= 17,
 
 	/* SCSI passthrough using struct scsi_request */
 	REQ_OP_SCSI_IN		= 32,
-- 
2.26.2


^ permalink raw reply related	[flat|nested] 122+ messages in thread

* [PATCH 2/2] block: improve discard bio alignment in __blkdev_issue_discard()
  2020-07-13 12:35 [PATCH 0/2] two generic block layer fixes for 5.9 Coly Li
  2020-07-13 12:35 ` [PATCH 1/2] block: change REQ_OP_ZONE_RESET and REQ_OP_ZONE_RESET_ALL to be odd numbers Coly Li
@ 2020-07-13 12:35 ` Coly Li
       [not found]   ` <(Coly>
  1 sibling, 1 reply; 122+ messages in thread
From: Coly Li @ 2020-07-13 12:35 UTC (permalink / raw)
  To: axboe, linux-block
  Cc: Coly Li, Acshai Manoj, Hannes Reinecke, Ming Lei, Xiao Ni,
	Bart Van Assche, Christoph Hellwig, Enzo Matsumiya

This patch improves discard bio split for address and size alignment in
__blkdev_issue_discard(). The aligned discard bio may help underlying
device controller to perform better discard and internal garbage
collection, and avoid unnecessary internal fragment.

Current discard bio split algorithm in __blkdev_issue_discard() may have
non-discarded fregment on device even the discard bio LBA and size are
both aligned to device's discard granularity size.

Here is the example steps on how to reproduce the above problem.
- On a VMWare ESXi 6.5 update3 installation, create a 51GB virtual disk
  with thin mode and give it to a Linux virtual machine.
- Inside the Linux virtual machine, if the 50GB virtual disk shows up as
  /dev/sdb, fill data into the first 50GB by,
        # dd if=/dev/zero of=/dev/sdb bs=4096 count=13107200
- Discard the 50GB range from offset 0 on /dev/sdb,
        # blkdiscard /dev/sdb -o 0 -l 53687091200
- Observe the underlying mapping status of the device
        # sg_get_lba_status /dev/sdb -m 1048 --lba=0
  descriptor LBA: 0x0000000000000000  blocks: 2048  mapped (or unknown)
  descriptor LBA: 0x0000000000000800  blocks: 16773120  deallocated
  descriptor LBA: 0x0000000000fff800  blocks: 2048  mapped (or unknown)
  descriptor LBA: 0x0000000001000000  blocks: 8386560  deallocated
  descriptor LBA: 0x00000000017ff800  blocks: 2048  mapped (or unknown)
  descriptor LBA: 0x0000000001800000  blocks: 8386560  deallocated
  descriptor LBA: 0x0000000001fff800  blocks: 2048  mapped (or unknown)
  descriptor LBA: 0x0000000002000000  blocks: 8386560  deallocated
  descriptor LBA: 0x00000000027ff800  blocks: 2048  mapped (or unknown)
  descriptor LBA: 0x0000000002800000  blocks: 8386560  deallocated
  descriptor LBA: 0x0000000002fff800  blocks: 2048  mapped (or unknown)
  descriptor LBA: 0x0000000003000000  blocks: 8386560  deallocated
  descriptor LBA: 0x00000000037ff800  blocks: 2048  mapped (or unknown)
  descriptor LBA: 0x0000000003800000  blocks: 8386560  deallocated
  descriptor LBA: 0x0000000003fff800  blocks: 2048  mapped (or unknown)
  descriptor LBA: 0x0000000004000000  blocks: 8386560  deallocated
  descriptor LBA: 0x00000000047ff800  blocks: 2048  mapped (or unknown)
  descriptor LBA: 0x0000000004800000  blocks: 8386560  deallocated
  descriptor LBA: 0x0000000004fff800  blocks: 2048  mapped (or unknown)
  descriptor LBA: 0x0000000005000000  blocks: 8386560  deallocated
  descriptor LBA: 0x00000000057ff800  blocks: 2048  mapped (or unknown)
  descriptor LBA: 0x0000000005800000  blocks: 8386560  deallocated
  descriptor LBA: 0x0000000005fff800  blocks: 2048  mapped (or unknown)
  descriptor LBA: 0x0000000006000000  blocks: 6291456  deallocated
  descriptor LBA: 0x0000000006600000  blocks: 0  deallocated

Although the discard bio starts at LBA 0 and has 50<<30 bytes size which
are perfect aligned to the discard granularity, from the above list
these are many 1MB (2048 sectors) internal fragments exist unexpectedly.

The problem is in __blkdev_issue_discard(), an improper algorithm causes
an improper bio size which is not aligned.

 25 int __blkdev_issue_discard(struct block_device *bdev, sector_t sector,
 26                 sector_t nr_sects, gfp_t gfp_mask, int flags,
 27                 struct bio **biop)
 28 {
 29         struct request_queue *q = bdev_get_queue(bdev);
   [snipped]
 56
 57         while (nr_sects) {
 58                 sector_t req_sects = min_t(sector_t, nr_sects,
 59                                 bio_allowed_max_sectors(q));
 60
 61                 WARN_ON_ONCE((req_sects << 9) > UINT_MAX);
 62
 63                 bio = blk_next_bio(bio, 0, gfp_mask);
 64                 bio->bi_iter.bi_sector = sector;
 65                 bio_set_dev(bio, bdev);
 66                 bio_set_op_attrs(bio, op, 0);
 67
 68                 bio->bi_iter.bi_size = req_sects << 9;
 69                 sector += req_sects;
 70                 nr_sects -= req_sects;
   [snipped]
 79         }
 80
 81         *biop = bio;
 82         return 0;
 83 }
 84 EXPORT_SYMBOL(__blkdev_issue_discard);

At line 58-59, to discard a 50GB range, req_sects is set as return value
of bio_allowed_max_sectors(q), which is 8388607 sectors. In the above
case, the discard granularity is 2048 sectors, although the start LBA
and discard length are aligned to discard granularity, req_sects never
has chance to be aligned to discard granularity. This is why there are
some still-mapped 2048 sectors fragment in every 4 or 8 GB range.

If req_sects at line 58 is set to a value aligned to discard_granularity
and close to UNIT_MAX, then all consequent split bios inside device
driver are (almostly) aligned to discard_granularity of the device
queue. The 2048 sectors still-mapped fragment will disappear.

This patch introduces bio_aligned_discard_max_sectors() to return the
the value which is aligned to q->limits.discard_granularity and closest
to UINT_MAX. Then this patch replaces bio_allowed_max_sectors() with
this new routine to decide a more proper split bio length.

But we still need to handle the situation when discard start LBA is not
aligned to q->limits.discard_granularity, otherwise even the length is
aligned, current code may still leave 2048 fragment around every 4GB
range. Therefore, to calculate req_sects, firstly the start LBA of
discard range is checked, if it is not aligned to discard granularity,
the first split location should make sure following bio has bi_sector
aligned to discard granularity. Then there won't be still-mapped
fragment in the middle of the discard range.

The above is how this patch improves discard bio alignment in
__blkdev_issue_discard(). Now with this patch, after discard with same
command line mentiond previously, sg_get_lba_status returns,
descriptor LBA: 0x0000000000000000  blocks: 106954752  deallocated
descriptor LBA: 0x0000000006600000  blocks: 0  deallocated

We an see there is no 2048 sectors segment anymore, everything is clean.

Reported-and-tested-by: Acshai Manoj <acshai.manoj@microfocus.com>
Signed-off-by: Coly Li <colyli@suse.de>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Reviewed-by: Xiao Ni <xni@redhat.com>
Cc: Bart Van Assche <bvanassche@acm.org>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Enzo Matsumiya <ematsumiya@suse.com>
Cc: Jens Axboe <axboe@kernel.dk>
---
 block/blk-lib.c | 25 +++++++++++++++++++++++--
 block/blk.h     | 14 ++++++++++++++
 2 files changed, 37 insertions(+), 2 deletions(-)

diff --git a/block/blk-lib.c b/block/blk-lib.c
index 5f2c429d4378..7bffdee63a20 100644
--- a/block/blk-lib.c
+++ b/block/blk-lib.c
@@ -55,8 +55,29 @@ int __blkdev_issue_discard(struct block_device *bdev, sector_t sector,
 		return -EINVAL;
 
 	while (nr_sects) {
-		sector_t req_sects = min_t(sector_t, nr_sects,
-				bio_allowed_max_sectors(q));
+		sector_t granularity_aligned_lba;
+		sector_t req_sects;
+
+		granularity_aligned_lba = round_up(sector,
+				q->limits.discard_granularity >> SECTOR_SHIFT);
+
+		/*
+		 * Check whether the discard bio starts at a discard_granularity
+		 * aligned LBA,
+		 * - If no: set (granularity_aligned_lba - sector) to bi_size of
+		 *   the first split bio, then the second bio will start at a
+		 *   discard_granularity aligned LBA.
+		 * - If yes: use bio_aligned_discard_max_sectors() as the max
+		 *   possible bi_size of the first split bio. Then when this bio
+		 *   is split in device drive, the split ones are very probably
+		 *   to be aligned to discard_granularity of the device's queue.
+		 */
+		if (granularity_aligned_lba == sector)
+			req_sects = min_t(sector_t, nr_sects,
+					  bio_aligned_discard_max_sectors(q));
+		else
+			req_sects = min_t(sector_t, nr_sects,
+					  granularity_aligned_lba - sector);
 
 		WARN_ON_ONCE((req_sects << 9) > UINT_MAX);
 
diff --git a/block/blk.h b/block/blk.h
index b5d1f0fc6547..a80738581f84 100644
--- a/block/blk.h
+++ b/block/blk.h
@@ -281,6 +281,20 @@ static inline unsigned int bio_allowed_max_sectors(struct request_queue *q)
 	return round_down(UINT_MAX, queue_logical_block_size(q)) >> 9;
 }
 
+/*
+ * The max bio size which is aligned to q->limits.discard_granularity. This
+ * is a hint to split large discard bio in generic block layer, then if device
+ * driver needs to split the discard bio into smaller ones, their bi_size can
+ * be very probably and easily aligned to discard_granularity of the device's
+ * queue.
+ */
+static inline unsigned int bio_aligned_discard_max_sectors(
+					struct request_queue *q)
+{
+	return round_down(UINT_MAX, q->limits.discard_granularity) >>
+			SECTOR_SHIFT;
+}
+
 /*
  * Internal io_context interface
  */
-- 
2.26.2


^ permalink raw reply related	[flat|nested] 122+ messages in thread

* Re: [PATCH 2/2] block: improve discard bio alignment in __blkdev_issue_discard()
       [not found]                                                   ` <+0800")>
@ 2020-07-13 16:47                                                     ` Martin K. Petersen
  2020-07-13 17:50                                                       ` Coly Li
  0 siblings, 1 reply; 122+ messages in thread
From: Martin K. Petersen @ 2020-07-13 16:47 UTC (permalink / raw)
  To: Coly Li
  Cc: axboe, linux-block, Acshai Manoj, Hannes Reinecke, Ming Lei,
	Xiao Ni, Bart Van Assche, Christoph Hellwig, Enzo Matsumiya


Hi Coly!

> This patch improves discard bio split for address and size alignment
> in __blkdev_issue_discard(). The aligned discard bio may help
> underlying device controller to perform better discard and internal
> garbage collection, and avoid unnecessary internal fragment.

If the aim is to guarantee that all discard requests, except for head
and tail, are aligned multiples of the discard_granularity, you also
need to take the discard_alignment queue limit and the partition offset
into consideration.

-- 
Martin K. Petersen	Oracle Linux Engineering

^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: [PATCH 2/2] block: improve discard bio alignment in __blkdev_issue_discard()
  2020-07-13 16:47                                                     ` [PATCH 2/2] block: improve discard bio alignment in __blkdev_issue_discard() Martin K. Petersen
@ 2020-07-13 17:50                                                       ` Coly Li
  0 siblings, 0 replies; 122+ messages in thread
From: Coly Li @ 2020-07-13 17:50 UTC (permalink / raw)
  To: Martin K. Petersen
  Cc: axboe, linux-block, Acshai Manoj, Hannes Reinecke, Ming Lei,
	Xiao Ni, Bart Van Assche, Christoph Hellwig, Enzo Matsumiya

On 2020/7/14 00:47, Martin K. Petersen wrote:
> 
> Hi Coly!
> 
>> This patch improves discard bio split for address and size alignment
>> in __blkdev_issue_discard(). The aligned discard bio may help
>> underlying device controller to perform better discard and internal
>> garbage collection, and avoid unnecessary internal fragment.
> 

Hi Martin,

> If the aim is to guarantee that all discard requests, except for head
> and tail, are aligned multiples of the discard_granularity, you also
> need to take the discard_alignment queue limit and the partition offset
> into consideration.
> 

The discard_granularity was considered and my though is,
discard_alignment normally is multiples of discard granularity, if the
discard bio bi_sector and bi_size are aligned to discard granularity,
when underlying driver splits its discard bio by discard_alignment, the
split bio bi_sector and bi_size can still be aligned to discard
granularity. Another reason I don't handle discard_alignment alignment
in __blkdev_issue_discard() is performance. Handling discard_alignment
bio split in driver's split loop may call blk_next_bio() much less in
__blkdev_issue_discard(), and be more friendly for memory and cache
footprint.

For the partition offset, my original idea was to suggest the partition
or dm target starts on offset 2048 sectors. But your opinion sound
better, if considering partition or target offset maybe users
misconfigured the partition offset may also gain the benefit of discard
alignment.

Let me try to improve a v3 patch to handle the partition offset too.

Thanks for the cool idea :-)

Coly Li


^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: [PATCH 1/2] block: change REQ_OP_ZONE_RESET and REQ_OP_ZONE_RESET_ALL to be odd numbers
  2020-07-13 12:35 ` [PATCH 1/2] block: change REQ_OP_ZONE_RESET and REQ_OP_ZONE_RESET_ALL to be odd numbers Coly Li
@ 2020-07-13 23:12   ` Damien Le Moal
  0 siblings, 0 replies; 122+ messages in thread
From: Damien Le Moal @ 2020-07-13 23:12 UTC (permalink / raw)
  To: Coly Li, axboe, linux-block
  Cc: Chaitanya Kulkarni, Christoph Hellwig, Hannes Reinecke,
	Jens Axboe, Johannes Thumshirn, Keith Busch, Shaun Tancheff

On 2020/07/13 21:35, Coly Li wrote:
> Currently REQ_OP_ZONE_RESET and REQ_OP_ZONE_RESET_ALL are defined as
> even numbers 6 and 8, such zone reset bios are treated as READ bios by
> bio_data_dir(), which is obviously misleading.
> 
> The macro bio_data_dir() is defined in include/linux/bio.h as,
>  55 #define bio_data_dir(bio) \
>  56         (op_is_write(bio_op(bio)) ? WRITE : READ)
> 
> And op_is_write() is defined in include/linux/blk_types.h as,
> 397 static inline bool op_is_write(unsigned int op)
> 398 {
> 399         return (op & 1);
> 400 }
> 
> The convention of op_is_write() is when there is data transfer then the
> op code should be odd number, and treat as a write op. bio_data_dir()
> treats all bio direction as READ if op_is_write() reports false, and
> WRITE if op_is_write() reports true.
> 
> Because REQ_OP_ZONE_RESET and REQ_OP_ZONE_RESET_ALL are even numbers,
> although they don't transfer data but reporting them as READ bio by
> bio_data_dir() is misleading and might be wrong. Because these two
> commands will reset the writer pointers of the resetting zones, and all
> content after the reset write pointer will be invalid and unaccessible,
> obviously they are not READ bios in any means.
> 
> This patch changes REQ_OP_ZONE_RESET from 6 to 15, and changes
> REQ_OP_ZONE_RESET_ALL from 8 to 17. Now bios with these two op code
> can be treated as WRITE by bio_data_dir(). Although they don't transfer
> data, now we keep them consistent with REQ_OP_DISCARD and
> REQ_OP_WRITE_ZEROES with the ituition that they change on-media content

s/ituition/assumption

> and should be WRITE request.
> 
> Signed-off-by: Coly Li <colyli@suse.de>
> Reviewed-by: Damien Le Moal <damien.lemoal@wdc.com>
> Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
> Cc: Christoph Hellwig <hch@lst.de>
> Cc: Hannes Reinecke <hare@suse.de>
> Cc: Jens Axboe <axboe@fb.com>
> Cc: Johannes Thumshirn <johannes.thumshirn@wdc.com>
> Cc: Keith Busch <kbusch@kernel.org>
> Cc: Shaun Tancheff <shaun.tancheff@seagate.com>
> ---
>  include/linux/blk_types.h | 8 ++++----
>  1 file changed, 4 insertions(+), 4 deletions(-)
> 
> diff --git a/include/linux/blk_types.h b/include/linux/blk_types.h
> index ccb895f911b1..447b46a0accf 100644
> --- a/include/linux/blk_types.h
> +++ b/include/linux/blk_types.h
> @@ -300,12 +300,8 @@ enum req_opf {
>  	REQ_OP_DISCARD		= 3,
>  	/* securely erase sectors */
>  	REQ_OP_SECURE_ERASE	= 5,
> -	/* reset a zone write pointer */
> -	REQ_OP_ZONE_RESET	= 6,
>  	/* write the same sector many times */
>  	REQ_OP_WRITE_SAME	= 7,
> -	/* reset all the zone present on the device */
> -	REQ_OP_ZONE_RESET_ALL	= 8,
>  	/* write the zero filled sector many times */
>  	REQ_OP_WRITE_ZEROES	= 9,
>  	/* Open a zone */
> @@ -316,6 +312,10 @@ enum req_opf {
>  	REQ_OP_ZONE_FINISH	= 12,
>  	/* write data at the current zone write pointer */
>  	REQ_OP_ZONE_APPEND	= 13,
> +	/* reset a zone write pointer */
> +	REQ_OP_ZONE_RESET	= 15,
> +	/* reset all the zone present on the device */
> +	REQ_OP_ZONE_RESET_ALL	= 17,
>  
>  	/* SCSI passthrough using struct scsi_request */
>  	REQ_OP_SCSI_IN		= 32,
> 

Looks good to me. Zone reset is very similar to a discard, albeit stronger (zone
reset is not a hint). So defining these ops as having the same data dir makes sense.

Reviewed-by: Damien Le Moal <damien.lemoal@wdc.com>

-- 
Damien Le Moal
Western Digital Research

^ permalink raw reply	[flat|nested] 122+ messages in thread

* [PATCH] lpfc: Correct null ndlp reference on routine exit
@ 2020-11-30 18:12 James Smart
       [not found] ` <(James>
  2020-12-08  4:52 ` Martin K. Petersen
  0 siblings, 2 replies; 122+ messages in thread
From: James Smart @ 2020-11-30 18:12 UTC (permalink / raw)
  To: linux-scsi; +Cc: James Smart, Dan Carpenter

[-- Attachment #1: Type: text/plain, Size: 1470 bytes --]

smatch correctly called out a logic error with accessing a pointer after
checking it for null.
 drivers/scsi/lpfc/lpfc_els.c:2043 lpfc_cmpl_els_plogi()
 error: we previously assumed 'ndlp' could be null (see line 1942)

Adjust the exit point to avoid the trace printf ndlp reference. A trace
entry was already generated when the ndlp was checked for null.

Fixes: 4430f7fd09ec ("scsi: lpfc: Rework locations of ndlp reference taking")
Signed-off-by: James Smart <james.smart@broadcom.com>
Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
---
 drivers/scsi/lpfc/lpfc_els.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/drivers/scsi/lpfc/lpfc_els.c b/drivers/scsi/lpfc/lpfc_els.c
index fd5c581cd67a..96c087b8b474 100644
--- a/drivers/scsi/lpfc/lpfc_els.c
+++ b/drivers/scsi/lpfc/lpfc_els.c
@@ -1946,7 +1946,7 @@ lpfc_cmpl_els_plogi(struct lpfc_hba *phba, struct lpfc_iocbq *cmdiocb,
 				 irsp->un.elsreq64.remoteID,
 				 irsp->ulpStatus, irsp->un.ulpWord[4],
 				 irsp->ulpIoTag);
-		goto out;
+		goto out_freeiocb;
 	}
 
 	/* Since ndlp can be freed in the disc state machine, note if this node
@@ -2042,6 +2042,7 @@ lpfc_cmpl_els_plogi(struct lpfc_hba *phba, struct lpfc_iocbq *cmdiocb,
 			      "PLOGI Cmpl PUT:     did:x%x refcnt %d",
 			      ndlp->nlp_DID, kref_read(&ndlp->kref), 0);
 
+out_freeiocb:
 	/* Release the reference on the original I/O request. */
 	free_ndlp = (struct lpfc_nodelist *)cmdiocb->context1;
 
-- 
2.26.2


[-- Attachment #2: S/MIME Cryptographic Signature --]
[-- Type: application/pkcs7-signature, Size: 4163 bytes --]

^ permalink raw reply related	[flat|nested] 122+ messages in thread

* Re: [PATCH] lpfc: Correct null ndlp reference on routine exit
       [not found]                                                   ` <-0800")>
@ 2020-12-01  5:19                                                     ` Martin K. Petersen
  0 siblings, 0 replies; 122+ messages in thread
From: Martin K. Petersen @ 2020-12-01  5:19 UTC (permalink / raw)
  To: James Smart; +Cc: linux-scsi, Dan Carpenter


James,

> smatch correctly called out a logic error with accessing a pointer after
> checking it for null.
>  drivers/scsi/lpfc/lpfc_els.c:2043 lpfc_cmpl_els_plogi()
>  error: we previously assumed 'ndlp' could be null (see line 1942)
>
> Adjust the exit point to avoid the trace printf ndlp reference. A trace
> entry was already generated when the ndlp was checked for null.

Applied to 5.11/scsi-staging, thanks!

-- 
Martin K. Petersen	Oracle Linux Engineering

^ permalink raw reply	[flat|nested] 122+ messages in thread

* [PATCH 0/2] two UFS changes
@ 2020-12-07 19:01 ` Bean Huo
       [not found]   ` <(Bean>
                     ` (3 more replies)
  0 siblings, 4 replies; 122+ messages in thread
From: Bean Huo @ 2020-12-07 19:01 UTC (permalink / raw)
  To: alim.akhtar, avri.altman, asutoshd, jejb, martin.petersen,
	stanley.chu, beanhuo, bvanassche, tomas.winkler, cang
  Cc: linux-scsi, linux-kernel

From: Bean Huo <beanhuo@micron.com>



Bean Huo (2):
  scsi: ufs: Remove an unused macro definition POWER_DESC_MAX_SIZE
  scsi: ufs: Fix wrong print message in dev_err()

 drivers/scsi/ufs/ufs.h    | 1 -
 drivers/scsi/ufs/ufshcd.c | 2 +-
 2 files changed, 1 insertion(+), 2 deletions(-)

-- 
2.17.1


^ permalink raw reply	[flat|nested] 122+ messages in thread

* [PATCH 1/2] scsi: ufs: Remove an unused macro definition POWER_DESC_MAX_SIZE
  2020-12-07 19:01 ` [PATCH 0/2] two UFS changes Bean Huo
       [not found]   ` <(Bean>
@ 2020-12-07 19:01   ` Bean Huo
  2020-12-08  7:52     ` Avri Altman
  2020-12-07 19:01   ` [PATCH 2/2] scsi: ufs: Fix wrong print message in dev_err() Bean Huo
  2020-12-08  2:57   ` [PATCH 0/2] two UFS changes Alim Akhtar
  3 siblings, 1 reply; 122+ messages in thread
From: Bean Huo @ 2020-12-07 19:01 UTC (permalink / raw)
  To: alim.akhtar, avri.altman, asutoshd, jejb, martin.petersen,
	stanley.chu, beanhuo, bvanassche, tomas.winkler, cang
  Cc: linux-scsi, linux-kernel

From: Bean Huo <beanhuo@micron.com>

No user uses POWER_DESC_MAX_SIZE, remove it.

Signed-off-by: Bean Huo <beanhuo@micron.com>
---
 drivers/scsi/ufs/ufs.h | 1 -
 1 file changed, 1 deletion(-)

diff --git a/drivers/scsi/ufs/ufs.h b/drivers/scsi/ufs/ufs.h
index 311d5f7a024d..527ba5c00097 100644
--- a/drivers/scsi/ufs/ufs.h
+++ b/drivers/scsi/ufs/ufs.h
@@ -330,7 +330,6 @@ enum {
 	UFS_DEV_WRITE_BOOSTER_SUP	= BIT(8),
 };
 
-#define POWER_DESC_MAX_SIZE			0x62
 #define POWER_DESC_MAX_ACTV_ICC_LVLS		16
 
 /* Attribute  bActiveICCLevel parameter bit masks definitions */
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 122+ messages in thread

* [PATCH 2/2] scsi: ufs: Fix wrong print message in dev_err()
  2020-12-07 19:01 ` [PATCH 0/2] two UFS changes Bean Huo
       [not found]   ` <(Bean>
  2020-12-07 19:01   ` [PATCH 1/2] scsi: ufs: Remove an unused macro definition POWER_DESC_MAX_SIZE Bean Huo
@ 2020-12-07 19:01   ` Bean Huo
  2020-12-08  7:53     ` Avri Altman
  2020-12-08  2:57   ` [PATCH 0/2] two UFS changes Alim Akhtar
  3 siblings, 1 reply; 122+ messages in thread
From: Bean Huo @ 2020-12-07 19:01 UTC (permalink / raw)
  To: alim.akhtar, avri.altman, asutoshd, jejb, martin.petersen,
	stanley.chu, beanhuo, bvanassche, tomas.winkler, cang
  Cc: linux-scsi, linux-kernel

From: Bean Huo <beanhuo@micron.com>

Change dev_err() print message from "dme-reset" to "dme_enable" in function
ufshcd_dme_enable().

Signed-off-by: Bean Huo <beanhuo@micron.com>
---
 drivers/scsi/ufs/ufshcd.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/scsi/ufs/ufshcd.c b/drivers/scsi/ufs/ufshcd.c
index 5b2219e44743..f8f5eddad506 100644
--- a/drivers/scsi/ufs/ufshcd.c
+++ b/drivers/scsi/ufs/ufshcd.c
@@ -3645,7 +3645,7 @@ static int ufshcd_dme_enable(struct ufs_hba *hba)
 	ret = ufshcd_send_uic_cmd(hba, &uic_cmd);
 	if (ret)
 		dev_err(hba->dev,
-			"dme-reset: error code %d\n", ret);
+			"dme-enable: error code %d\n", ret);
 
 	return ret;
 }
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 122+ messages in thread

* RE: [PATCH 0/2] two UFS changes
  2020-12-07 19:01 ` [PATCH 0/2] two UFS changes Bean Huo
                     ` (2 preceding siblings ...)
  2020-12-07 19:01   ` [PATCH 2/2] scsi: ufs: Fix wrong print message in dev_err() Bean Huo
@ 2020-12-08  2:57   ` Alim Akhtar
  3 siblings, 0 replies; 122+ messages in thread
From: Alim Akhtar @ 2020-12-08  2:57 UTC (permalink / raw)
  To: 'Bean Huo',
	avri.altman, asutoshd, jejb, martin.petersen, stanley.chu,
	beanhuo, bvanassche, tomas.winkler, cang
  Cc: linux-scsi, linux-kernel

Hi Bean,

> -----Original Message-----
> From: Bean Huo <huobean@gmail.com>
> Sent: 08 December 2020 00:32
> To: alim.akhtar@samsung.com; avri.altman@wdc.com;
> asutoshd@codeaurora.org; jejb@linux.ibm.com;
> martin.petersen@oracle.com; stanley.chu@mediatek.com;
> beanhuo@micron.com; bvanassche@acm.org; tomas.winkler@intel.com;
> cang@codeaurora.org
> Cc: linux-scsi@vger.kernel.org; linux-kernel@vger.kernel.org
> Subject: [PATCH 0/2] two UFS changes
> 
> From: Bean Huo <beanhuo@micron.com>
> 
> 
> 
> Bean Huo (2):
>   scsi: ufs: Remove an unused macro definition POWER_DESC_MAX_SIZE
>   scsi: ufs: Fix wrong print message in dev_err()
> 
>  drivers/scsi/ufs/ufs.h    | 1 -
>  drivers/scsi/ufs/ufshcd.c | 2 +-
>  2 files changed, 1 insertion(+), 2 deletions(-)
> 
Thanks!
Acked-by: Alim Akhtar <alim.akhtar@samsung.com>

> --
> 2.17.1



^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: [PATCH] lpfc: Correct null ndlp reference on routine exit
  2020-11-30 18:12 [PATCH] lpfc: Correct null ndlp reference on routine exit James Smart
       [not found] ` <(James>
@ 2020-12-08  4:52 ` Martin K. Petersen
  1 sibling, 0 replies; 122+ messages in thread
From: Martin K. Petersen @ 2020-12-08  4:52 UTC (permalink / raw)
  To: James Smart, linux-scsi; +Cc: Martin K . Petersen, Dan Carpenter

On Mon, 30 Nov 2020 10:12:26 -0800, James Smart wrote:

> smatch correctly called out a logic error with accessing a pointer after
> checking it for null.
>  drivers/scsi/lpfc/lpfc_els.c:2043 lpfc_cmpl_els_plogi()
>  error: we previously assumed 'ndlp' could be null (see line 1942)
> 
> Adjust the exit point to avoid the trace printf ndlp reference. A trace
> entry was already generated when the ndlp was checked for null.

Applied to 5.11/scsi-queue, thanks!

[1/1] lpfc: Correct null ndlp reference on routine exit
      https://git.kernel.org/mkp/scsi/c/9d8de441db26

-- 
Martin K. Petersen	Oracle Linux Engineering

^ permalink raw reply	[flat|nested] 122+ messages in thread

* RE: [PATCH 1/2] scsi: ufs: Remove an unused macro definition POWER_DESC_MAX_SIZE
  2020-12-07 19:01   ` [PATCH 1/2] scsi: ufs: Remove an unused macro definition POWER_DESC_MAX_SIZE Bean Huo
@ 2020-12-08  7:52     ` Avri Altman
  0 siblings, 0 replies; 122+ messages in thread
From: Avri Altman @ 2020-12-08  7:52 UTC (permalink / raw)
  To: Bean Huo, alim.akhtar, asutoshd, jejb, martin.petersen,
	stanley.chu, beanhuo, bvanassche, tomas.winkler, cang
  Cc: linux-scsi, linux-kernel

 
> From: Bean Huo <beanhuo@micron.com>
> 
> No user uses POWER_DESC_MAX_SIZE, remove it.
> 
> Signed-off-by: Bean Huo <beanhuo@micron.com>
Acked-by: Avri Altman <avri.altman@wdc.com>

^ permalink raw reply	[flat|nested] 122+ messages in thread

* RE: [PATCH 2/2] scsi: ufs: Fix wrong print message in dev_err()
  2020-12-07 19:01   ` [PATCH 2/2] scsi: ufs: Fix wrong print message in dev_err() Bean Huo
@ 2020-12-08  7:53     ` Avri Altman
  0 siblings, 0 replies; 122+ messages in thread
From: Avri Altman @ 2020-12-08  7:53 UTC (permalink / raw)
  To: Bean Huo, alim.akhtar, asutoshd, jejb, martin.petersen,
	stanley.chu, beanhuo, bvanassche, tomas.winkler, cang
  Cc: linux-scsi, linux-kernel

> From: Bean Huo <beanhuo@micron.com>
> 
> Change dev_err() print message from "dme-reset" to "dme_enable" in
> function
> ufshcd_dme_enable().
> 
> Signed-off-by: Bean Huo <beanhuo@micron.com>
Acked-by: Avri Altman <avri.altman@wdc.com?

^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: [PATCH 0/2] two UFS changes
       [not found]                                                   ` <+0100")>
  2015-12-08  1:12                                                     ` [PATCH] scsi_dh_alua: Remove stale variables Martin K. Petersen
@ 2020-12-09  2:17                                                     ` Martin K. Petersen
  1 sibling, 0 replies; 122+ messages in thread
From: Martin K. Petersen @ 2020-12-09  2:17 UTC (permalink / raw)
  To: Bean Huo
  Cc: alim.akhtar, avri.altman, asutoshd, jejb, martin.petersen,
	stanley.chu, beanhuo, bvanassche, tomas.winkler, cang,
	linux-scsi, linux-kernel


Bean,

> Bean Huo (2):
>   scsi: ufs: Remove an unused macro definition POWER_DESC_MAX_SIZE
>   scsi: ufs: Fix wrong print message in dev_err()

Applied to 5.11/scsi-staging, thanks!

-- 
Martin K. Petersen	Oracle Linux Engineering

^ permalink raw reply	[flat|nested] 122+ messages in thread

end of thread, other threads:[~2020-12-09  2:20 UTC | newest]

Thread overview: 122+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <CGME20201207190149epcas5p2d877f4e3f6d31548d97f9b486d243a05@epcas5p2.samsung.com>
2020-12-07 19:01 ` [PATCH 0/2] two UFS changes Bean Huo
     [not found]   ` <(Bean>
2020-12-07 19:01   ` [PATCH 1/2] scsi: ufs: Remove an unused macro definition POWER_DESC_MAX_SIZE Bean Huo
2020-12-08  7:52     ` Avri Altman
2020-12-07 19:01   ` [PATCH 2/2] scsi: ufs: Fix wrong print message in dev_err() Bean Huo
2020-12-08  7:53     ` Avri Altman
2020-12-08  2:57   ` [PATCH 0/2] two UFS changes Alim Akhtar
2020-11-30 18:12 [PATCH] lpfc: Correct null ndlp reference on routine exit James Smart
     [not found] ` <(James>
2020-12-08  4:52 ` Martin K. Petersen
  -- strict thread matches above, loose matches on Subject: below --
2020-07-13 12:35 [PATCH 0/2] two generic block layer fixes for 5.9 Coly Li
2020-07-13 12:35 ` [PATCH 1/2] block: change REQ_OP_ZONE_RESET and REQ_OP_ZONE_RESET_ALL to be odd numbers Coly Li
2020-07-13 23:12   ` Damien Le Moal
2020-07-13 12:35 ` [PATCH 2/2] block: improve discard bio alignment in __blkdev_issue_discard() Coly Li
     [not found]   ` <(Coly>
2020-05-12  8:55 [PATCH v11 00/10] Introduce Zone Append for writing to zoned block devices Johannes Thumshirn
2020-05-12  8:55 ` [PATCH v11 01/10] block: provide fallbacks for blk_queue_zone_is_seq and blk_queue_zone_no Johannes Thumshirn
2020-05-12  8:55 ` [PATCH v11 02/10] block: rename __bio_add_pc_page to bio_add_hw_page Johannes Thumshirn
2020-05-12  8:55 ` [PATCH v11 03/10] block: Introduce REQ_OP_ZONE_APPEND Johannes Thumshirn
2020-05-12  8:55 ` [PATCH v11 04/10] block: introduce blk_req_zone_write_trylock Johannes Thumshirn
2020-05-12  8:55 ` [PATCH v11 05/10] block: Modify revalidate zones Johannes Thumshirn
2020-05-12  8:55 ` [PATCH v11 06/10] scsi: sd_zbc: factor out sanity checks for zoned commands Johannes Thumshirn
2020-05-12  8:55 ` [PATCH v11 07/10] scsi: sd_zbc: emulate ZONE_APPEND commands Johannes Thumshirn
2020-05-12  8:55 ` [PATCH v11 08/10] null_blk: Support REQ_OP_ZONE_APPEND Johannes Thumshirn
2020-05-12  8:55 ` [PATCH v11 09/10] block: export bio_release_pages and bio_iov_iter_get_pages Johannes Thumshirn
2020-05-12  8:55 ` [PATCH v11 10/10] zonefs: use REQ_OP_ZONE_APPEND for sync DIO Johannes Thumshirn
2020-05-12 13:17 ` [PATCH v11 00/10] Introduce Zone Append for writing to zoned block devices Christoph Hellwig
     [not found]   ` <(Christoph>
2020-05-13  2:37 ` Jens Axboe
2020-03-29 17:47 [PATCH 0/4] block: Add support for REQ_OP_ASSIGN_RANGE Chaitanya Kulkarni
     [not found] ` <(Chaitanya>
2020-03-29 17:47 ` [PATCH 1/4] block: create payloadless issue bio helper Chaitanya Kulkarni
2020-03-29 17:47 ` [PATCH 2/4] block: Add support for REQ_OP_ASSIGN_RANGE Chaitanya Kulkarni
2020-03-29 17:47 ` [PATCH 3/4] loop: Forward REQ_OP_ASSIGN_RANGE into fallocate(0) Chaitanya Kulkarni
2020-03-29 17:47 ` [PATCH 4/4] ext4: Notify block device about alloc-assigned blk Chaitanya Kulkarni
2020-04-01  6:22 ` [PATCH 0/4] block: Add support for REQ_OP_ASSIGN_RANGE Konstantin Khlebnikov
2020-04-02  2:29   ` Martin K. Petersen
2020-04-02  9:49     ` Konstantin Khlebnikov
2020-04-02 22:41 ` Dave Chinner
2020-04-03  1:34   ` Martin K. Petersen
2020-04-03  2:57     ` Dave Chinner
     [not found]       ` <(Dave>
2018-11-06 19:48 linux-next: Signed-off-by missing for commit in the scsi-fixes tree Stephen Rothwell
     [not found] ` <(Stephen>
2015-12-03  6:57 [PATCH] scsi_dh_alua: Remove stale variables Hannes Reinecke
     [not found] ` <(Hannes>
2015-12-03  9:23 ` Johannes Thumshirn
2015-12-03 16:43 ` Christoph Hellwig
2013-06-03  3:57 RAID-10 keeps aborting H. Peter Anvin
2013-06-03  4:05 ` H. Peter Anvin
2013-06-03  5:47 ` Dan Williams
2013-06-03  6:06   ` H. Peter Anvin
2013-06-03  6:14     ` Dan Williams
2013-06-03  6:30       ` H. Peter Anvin
2013-06-03 14:39       ` H. Peter Anvin
2013-06-11 16:47         ` Joe Lawrence
2013-06-11 17:12           ` H. Peter Anvin
2013-06-03 15:47       ` H. Peter Anvin
2013-06-03 16:09         ` Joe Lawrence
2013-06-03 17:22         ` Dan Williams
2013-06-03 17:40           ` H. Peter Anvin
2013-06-03 18:35             ` Martin K. Petersen
2013-06-03 18:38               ` H. Peter Anvin
2013-06-03 18:40               ` H. Peter Anvin
2013-06-03 22:20                 ` H. Peter Anvin
2013-06-03 22:34                   ` H. Peter Anvin
2013-06-04 15:56                     ` Martin K. Petersen
2013-06-03 23:19               ` H. Peter Anvin
2013-06-04 15:39                 ` Joe Lawrence
2013-06-04 15:46                   ` H. Peter Anvin
2013-06-04 15:54                     ` Martin K. Petersen
2013-06-05 10:02                   ` Bernd Schubert
2013-06-05 11:38                     ` Bernd Schubert
2013-06-05 12:53                       ` [PATCH] scsi: Check if the device support WRITE_SAME_10 Bernd Schubert
2013-06-05 19:14                         ` Martin K. Petersen
2013-06-05 20:09                           ` Bernd Schubert
2013-06-07  2:15                             ` Martin K. Petersen
2013-06-12 19:34                               ` Bernd Schubert
2013-06-05 19:11                       ` RAID-10 keeps aborting Martin K. Petersen
2013-06-04 17:36               ` Dan Williams
2013-06-04 17:54                 ` Martin K. Petersen
2013-06-04 17:57                   ` H. Peter Anvin
2013-06-04 18:04                     ` Martin K. Petersen
2013-06-04 18:32                       ` Dan Williams
2013-06-04 18:38                         ` H. Peter Anvin
2013-06-04 18:56                           ` Dan Williams
2013-06-05  2:39                             ` H. Peter Anvin
     [not found]                               ` <(H.>
     [not found]                                 ` <Peter>
     [not found]                                   ` <Anvin's>
     [not found]                                     ` <message>
     [not found]                                       ` <of>
     [not found]                                         ` <"Mon>
     [not found]                                           ` <13>
     [not found]                                         ` <"Wed>
     [not found]                                           ` <12>
     [not found]                                           ` <7>
     [not found]                                             ` <Nov>
     [not found]                                               ` <2018>
     [not found]                                                 ` <06:48:34>
     [not found]                                                   ` <+1100")>
2018-11-07  1:52                                                     ` linux-next: Signed-off-by missing for commit in the scsi-fixes tree Martin K. Petersen
2020-04-03  3:45                                                     ` [PATCH 0/4] block: Add support for REQ_OP_ASSIGN_RANGE Martin K. Petersen
2020-04-07  2:27                                                       ` Dave Chinner
2020-04-08  4:10                                                         ` Martin K. Petersen
2020-04-19 22:36                                                           ` Dave Chinner
2020-04-23  0:40                                                             ` Martin K. Petersen
     [not found]                                         ` <"Tue>
     [not found]                                           ` <04>
     [not found]                                             ` <Jun>
     [not found]                                               ` <2013>
     [not found]                                                 ` <14:27:47>
     [not found]                                                   ` <-0400")>
2013-06-07  2:19                                                     ` RAID-10 keeps aborting Martin K. Petersen
2013-06-10 14:15                                                       ` Joe Lawrence
2013-06-12  3:15                                                         ` NeilBrown
2013-06-12  4:07                                                           ` H. Peter Anvin
2013-06-12  6:29                                                             ` Bernd Schubert
2013-06-12 10:22                                                               ` Joe Lawrence
2013-06-12 14:28                                                               ` Martin K. Petersen
2013-06-12 14:25                                                             ` Martin K. Petersen
2013-06-12 14:29                                                               ` H. Peter Anvin
2013-06-12 14:34                                                                 ` Martin K. Petersen
2013-06-12 14:37                                                                   ` H. Peter Anvin
2013-06-12 14:45                                                                   ` H. Peter Anvin
     [not found]                                                                     ` <5AA430FFE4486C448003201AC83BC85E0360CE3F@EXHQ.corp.stratus! .com>
     [not found]                                                                       ` <5AA430FFE4486C448003201AC83BC85E0360CE3F@EXHQ.corp.stratus.com>
2013-06-12 15:58                                                                         ` H. Peter Anvin
2013-06-13  3:10                                                                     ` NeilBrown
2013-06-13  3:13                                                                       ` H. Peter Anvin
2013-06-13  3:31                                                                         ` NeilBrown
2013-06-13 21:40                                                                       ` Martin K. Petersen
2013-06-13  2:45                                                           ` Joe Lawrence
2013-06-13  3:11                                                             ` NeilBrown
     [not found]                                                 ` <19:39:58>
     [not found]                                                   ` <-0700")>
2013-06-05 19:29                                                     ` Martin K. Petersen
2013-06-06 18:27                                                       ` Joe Lawrence
     [not found]                                                         ` <(Joe>
2013-06-06 18:36                                                         ` H. Peter Anvin
2013-06-12 14:43                                                     ` Martin K. Petersen
2020-04-01  2:29                                                     ` [PATCH 0/4] block: Add support for REQ_OP_ASSIGN_RANGE Martin K. Petersen
2020-04-01  4:53                                                       ` Chaitanya Kulkarni
2020-05-12 16:01                                                     ` [PATCH v11 00/10] Introduce Zone Append for writing to zoned block devices Martin K. Petersen
2020-05-12 16:04                                                       ` Christoph Hellwig
2020-05-12 16:12                                                         ` Martin K. Petersen
2020-05-12 16:18                                                           ` Johannes Thumshirn
2020-05-12 16:24                                                             ` Martin K. Petersen
     [not found]                                         ` <"Sun>
     [not found]                                           ` <29>
     [not found]                                             ` <Mar>
     [not found]                                               ` <2020>
     [not found]                                                 ` <20:35:11>
     [not found]                                                   ` <+0800")>
2020-07-13 16:47                                                     ` [PATCH 2/2] block: improve discard bio alignment in __blkdev_issue_discard() Martin K. Petersen
2020-07-13 17:50                                                       ` Coly Li
     [not found]                                                 ` <10:12:26>
     [not found]                                                   ` <-0800")>
2020-12-01  5:19                                                     ` [PATCH] lpfc: Correct null ndlp reference on routine exit Martin K. Petersen
     [not found]                                         ` <"Thu>
     [not found]                                           ` <3>
     [not found]                                             ` <Dec>
     [not found]                                               ` <2015>
     [not found]                                                 ` <07:57:35>
     [not found]                                                   ` <+0100")>
2015-12-08  1:12                                                     ` [PATCH] scsi_dh_alua: Remove stale variables Martin K. Petersen
2020-12-09  2:17                                                     ` [PATCH 0/2] two UFS changes Martin K. Petersen
2013-06-11 21:50 ` RAID-10 keeps aborting Joe Lawrence
2013-06-11 21:53   ` H. Peter Anvin

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.