All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH] scsi: convert discard to REQ_TYPE_FS instead of REQ_TYPE_BLOCK_PC
@ 2010-07-06  7:01 FUJITA Tomonori
  2010-07-06 21:31 ` Mike Snitzer
  2010-07-07 16:39 ` [PATCH] " Christoph Hellwig
  0 siblings, 2 replies; 18+ messages in thread
From: FUJITA Tomonori @ 2010-07-06  7:01 UTC (permalink / raw)
  To: linux-scsi; +Cc: James.Bottomley, hch, snitzer, axboe

I confirmed that mkfs.xfs worked with Intel X25-M (trim) and
scsi_debug (write same and unmap).

REQ_TYPE_FS should give the same scsi_cmnd struct as REQ_TYPE_BLOCK_PC.

This can be applied to block's for-2.6.36.

The git tree is also available:

git://git.kernel.org/pub/scm/linux/kernel/git/tomo/linux-2.6-misc.git fs-discard

=
From: FUJITA Tomonori <fujita.tomonori@lab.ntt.co.jp>
Subject: [PATCH] scsi: convert discard to REQ_TYPE_FS instead of REQ_TYPE_BLOCK_PC

The block layer (file systems) sends discard requests as REQ_TYPE_FS
(the role of REQ_TYPE_FS is that setting up commands and interpreting
the results). But SCSI-ml treats discard requests as
REQ_TYPE_BLOCK_PC.

scsi-ml can handle discard requests as REQ_TYPE_FS
easily. scsi_setup_discard_cmnd() sets up struct request and the bio
nicely. Only remaining issue is that discard requests can't be
completed partially so we need to modify sd_done.

This conversion also fixes the problem that discard requests aren't
retried when possible (e.g. UNIT ATTENTION).

Signed-off-by: FUJITA Tomonori <fujita.tomonori@lab.ntt.co.jp>
---
 drivers/scsi/sd.c |    7 ++++++-
 1 files changed, 6 insertions(+), 1 deletions(-)

diff --git a/drivers/scsi/sd.c b/drivers/scsi/sd.c
index 0994ab6..5f59601 100644
--- a/drivers/scsi/sd.c
+++ b/drivers/scsi/sd.c
@@ -433,7 +433,6 @@ static int scsi_setup_discard_cmnd(struct scsi_device *sdp, struct request *rq)
 		nr_sectors >>= 3;
 	}
 
-	rq->cmd_type = REQ_TYPE_BLOCK_PC;
 	rq->timeout = SD_TIMEOUT;
 
 	memset(rq->cmd, 0, rq->cmd_len);
@@ -1185,6 +1184,12 @@ static int sd_done(struct scsi_cmnd *SCpnt)
 	int sense_valid = 0;
 	int sense_deferred = 0;
 
+	if (SCpnt->request->cmd_flags & REQ_DISCARD) {
+		if (!result)
+			scsi_set_resid(SCpnt, 0);
+		return good_bytes;
+	}
+
 	if (result) {
 		sense_valid = scsi_command_normalize_sense(SCpnt, &sshdr);
 		if (sense_valid)
-- 
1.6.5


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* Re: scsi: convert discard to REQ_TYPE_FS instead of REQ_TYPE_BLOCK_PC
  2010-07-06  7:01 [PATCH] scsi: convert discard to REQ_TYPE_FS instead of REQ_TYPE_BLOCK_PC FUJITA Tomonori
@ 2010-07-06 21:31 ` Mike Snitzer
  2010-07-06 23:40   ` Douglas Gilbert
  2010-07-07 16:39 ` [PATCH] " Christoph Hellwig
  1 sibling, 1 reply; 18+ messages in thread
From: Mike Snitzer @ 2010-07-06 21:31 UTC (permalink / raw)
  To: FUJITA Tomonori; +Cc: linux-scsi, James.Bottomley, hch, axboe

On Tue, Jul 06 2010 at  3:01am -0400,
FUJITA Tomonori <fujita.tomonori@lab.ntt.co.jp> wrote:

> I confirmed that mkfs.xfs worked with Intel X25-M (trim) and
> scsi_debug (write same and unmap).
> 
> REQ_TYPE_FS should give the same scsi_cmnd struct as REQ_TYPE_BLOCK_PC.
> 
> This can be applied to block's for-2.6.36.
> 
> The git tree is also available:
> 
> git://git.kernel.org/pub/scm/linux/kernel/git/tomo/linux-2.6-misc.git fs-discard
> 
> =
> From: FUJITA Tomonori <fujita.tomonori@lab.ntt.co.jp>
> Subject: [PATCH] scsi: convert discard to REQ_TYPE_FS instead of REQ_TYPE_BLOCK_PC
> 
> The block layer (file systems) sends discard requests as REQ_TYPE_FS
> (the role of REQ_TYPE_FS is that setting up commands and interpreting
> the results). But SCSI-ml treats discard requests as
> REQ_TYPE_BLOCK_PC.
> 
> scsi-ml can handle discard requests as REQ_TYPE_FS
> easily. scsi_setup_discard_cmnd() sets up struct request and the bio
> nicely. Only remaining issue is that discard requests can't be
> completed partially so we need to modify sd_done.
> 
> This conversion also fixes the problem that discard requests aren't
> retried when possible (e.g. UNIT ATTENTION).
> 
> Signed-off-by: FUJITA Tomonori <fujita.tomonori@lab.ntt.co.jp>

Unfortunately this patch causes 'mkfs.ext4 -F /dev/sda' to fail against
a device whose discard support is implemented using WRITE SAME 16 w/
discard bit set.  This is with recent e2fsprogs that issues BLKDISCARD
ioctl at start of mkfs:

sd 2:0:0:0: [sda] Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
sd 2:0:0:0: [sda] Sense Key : Illegal Request [current] 
Info fld=0x0
sd 2:0:0:0: [sda] Add. Sense: Parameter value invalid
sd 2:0:0:0: [sda] CDB: Write same(16): 93 08 00 00 00 00 00 00 00 00 00 7f ff ff 00 00
end_request: I/O error, dev sda, sector 0

If I revert your patch I don't see this.

Mike

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: scsi: convert discard to REQ_TYPE_FS instead of REQ_TYPE_BLOCK_PC
  2010-07-06 21:31 ` Mike Snitzer
@ 2010-07-06 23:40   ` Douglas Gilbert
  2010-07-07  0:47     ` Mike Snitzer
  0 siblings, 1 reply; 18+ messages in thread
From: Douglas Gilbert @ 2010-07-06 23:40 UTC (permalink / raw)
  To: Mike Snitzer; +Cc: FUJITA Tomonori, linux-scsi, James.Bottomley, hch, axboe

On 10-07-06 05:31 PM, Mike Snitzer wrote:
> On Tue, Jul 06 2010 at  3:01am -0400,
> FUJITA Tomonori<fujita.tomonori@lab.ntt.co.jp>  wrote:
>
>> I confirmed that mkfs.xfs worked with Intel X25-M (trim) and
>> scsi_debug (write same and unmap).
>>
>> REQ_TYPE_FS should give the same scsi_cmnd struct as REQ_TYPE_BLOCK_PC.
>>
>> This can be applied to block's for-2.6.36.
>>
>> The git tree is also available:
>>
>> git://git.kernel.org/pub/scm/linux/kernel/git/tomo/linux-2.6-misc.git fs-discard
>>
>> =
>> From: FUJITA Tomonori<fujita.tomonori@lab.ntt.co.jp>
>> Subject: [PATCH] scsi: convert discard to REQ_TYPE_FS instead of REQ_TYPE_BLOCK_PC
>>
>> The block layer (file systems) sends discard requests as REQ_TYPE_FS
>> (the role of REQ_TYPE_FS is that setting up commands and interpreting
>> the results). But SCSI-ml treats discard requests as
>> REQ_TYPE_BLOCK_PC.
>>
>> scsi-ml can handle discard requests as REQ_TYPE_FS
>> easily. scsi_setup_discard_cmnd() sets up struct request and the bio
>> nicely. Only remaining issue is that discard requests can't be
>> completed partially so we need to modify sd_done.
>>
>> This conversion also fixes the problem that discard requests aren't
>> retried when possible (e.g. UNIT ATTENTION).
>>
>> Signed-off-by: FUJITA Tomonori<fujita.tomonori@lab.ntt.co.jp>
>
> Unfortunately this patch causes 'mkfs.ext4 -F /dev/sda' to fail against
> a device whose discard support is implemented using WRITE SAME 16 w/
> discard bit set.  This is with recent e2fsprogs that issues BLKDISCARD
> ioctl at start of mkfs:
>
> sd 2:0:0:0: [sda] Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
> sd 2:0:0:0: [sda] Sense Key : Illegal Request [current]
> Info fld=0x0
> sd 2:0:0:0: [sda] Add. Sense: Parameter value invalid
> sd 2:0:0:0: [sda] CDB: Write same(16): 93 08 00 00 00 00 00 00 00 00 00 7f ff ff 00 00
> end_request: I/O error, dev sda, sector 0

That is 0x7fffff (over 8 million) blocks (4 GB) being unmapped
in one operation! That may exceed the "maximum unmap lba
count" field in the Block Limits VPD page.
The latest SBC draft (sbc3r22.pdf) says that field applies to
the SCSI UNMAP command and does not mention the WRITE SAME (16)
command but that is probably an oversight.

One also wonders how long that would take if permitted and what
is an appropriate command timeout.

Also the additional sense of "Parameter value invalid" (26h,2h)
is not mentioned in the latest drafts of SPC or SBC, apart from
being defined. So it might be a good one for an implementer to
pull out of the bag for special occasions. And the SBC draft
does say what is the correct additional sense in this case.


On a slightly related matter, when the target device is ATA
(e.g. a SSD with an SATA interface) then the SAT used can
influence whether WRITE SAME (16) gets translated into DATA
SET MANAGEMENT with the TRIM bit set. libata does it in recent
kernels but the LSI SAS controller that I use has its own SAT
which does not translate WRITE SAME. [It might with more
recent firmware.]

Doug Gilbert



^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: scsi: convert discard to REQ_TYPE_FS instead of REQ_TYPE_BLOCK_PC
  2010-07-06 23:40   ` Douglas Gilbert
@ 2010-07-07  0:47     ` Mike Snitzer
  2010-07-07  1:39       ` Martin K. Petersen
                         ` (2 more replies)
  0 siblings, 3 replies; 18+ messages in thread
From: Mike Snitzer @ 2010-07-07  0:47 UTC (permalink / raw)
  To: Douglas Gilbert; +Cc: FUJITA Tomonori, linux-scsi, James.Bottomley, hch, axboe

On Tue, Jul 06 2010 at  7:40pm -0400,
Douglas Gilbert <dgilbert@interlog.com> wrote:

> On 10-07-06 05:31 PM, Mike Snitzer wrote:
> >On Tue, Jul 06 2010 at  3:01am -0400,
> >FUJITA Tomonori<fujita.tomonori@lab.ntt.co.jp>  wrote:
> >
> >>I confirmed that mkfs.xfs worked with Intel X25-M (trim) and
> >>scsi_debug (write same and unmap).
> >>
> >>REQ_TYPE_FS should give the same scsi_cmnd struct as REQ_TYPE_BLOCK_PC.
> >>
> >>This can be applied to block's for-2.6.36.
> >>
> >>The git tree is also available:
> >>
> >>git://git.kernel.org/pub/scm/linux/kernel/git/tomo/linux-2.6-misc.git fs-discard
> >>
> >>=
> >>From: FUJITA Tomonori<fujita.tomonori@lab.ntt.co.jp>
> >>Subject: [PATCH] scsi: convert discard to REQ_TYPE_FS instead of REQ_TYPE_BLOCK_PC
> >>
> >>The block layer (file systems) sends discard requests as REQ_TYPE_FS
> >>(the role of REQ_TYPE_FS is that setting up commands and interpreting
> >>the results). But SCSI-ml treats discard requests as
> >>REQ_TYPE_BLOCK_PC.
> >>
> >>scsi-ml can handle discard requests as REQ_TYPE_FS
> >>easily. scsi_setup_discard_cmnd() sets up struct request and the bio
> >>nicely. Only remaining issue is that discard requests can't be
> >>completed partially so we need to modify sd_done.
> >>
> >>This conversion also fixes the problem that discard requests aren't
> >>retried when possible (e.g. UNIT ATTENTION).
> >>
> >>Signed-off-by: FUJITA Tomonori<fujita.tomonori@lab.ntt.co.jp>
> >
> >Unfortunately this patch causes 'mkfs.ext4 -F /dev/sda' to fail against
> >a device whose discard support is implemented using WRITE SAME 16 w/
> >discard bit set.  This is with recent e2fsprogs that issues BLKDISCARD
> >ioctl at start of mkfs:
> >
> >sd 2:0:0:0: [sda] Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
> >sd 2:0:0:0: [sda] Sense Key : Illegal Request [current]
> >Info fld=0x0
> >sd 2:0:0:0: [sda] Add. Sense: Parameter value invalid
> >sd 2:0:0:0: [sda] CDB: Write same(16): 93 08 00 00 00 00 00 00 00 00 00 7f ff ff 00 00
> >end_request: I/O error, dev sda, sector 0
> 
> That is 0x7fffff (over 8 million) blocks (4 GB) being unmapped
> in one operation! That may exceed the "maximum unmap lba
> count" field in the Block Limits VPD page.
> The latest SBC draft (sbc3r22.pdf) says that field applies to
> the SCSI UNMAP command and does not mention the WRITE SAME (16)
> command but that is probably an oversight.

# sg_inq -p 0xb0 /dev/sda
VPD INQUIRY: Block limits page (SBC)
  Optimal transfer length granularity: 8 blocks
  Maximum transfer length: 8388607 blocks
  Optimal transfer length: 128 blocks
  Maximum prefetch, xdread, xdwrite transfer length: 0 blocks
  Maximum unmap LBA count: 0
  Maximum unmap block descriptor count: 0
# cat /sys/block/sda/queue/discard_granularity 
512
# cat /sys/block/sda/queue/discard_max_bytes 
4294966784

I'll look to understand why 'discard_max_bytes' is so large for this LUN
despite the standard Block limits VPD page not reflecting this.

Here is a SCSI trace with Tomo's patch REQ_TYPE_FS applied:

           blkid-1425  [001] 1272477.814205: scsi_dispatch_cmd_start: host_no=2 channel=0 id=0 lun=0 data_sgl=1 prot_sgl=0 cmnd=(WRITE_SAME_16 lba=0 txlen=8388607 protect=0 unmap=1 raw=93 08 00 00 00 00 00 00 00 00 00 7f ff ff 00 00)
          <idle>-0     [000] 1272477.815199: scsi_dispatch_cmd_done: host_no=2 channel=0 id=0 lun=0 data_sgl=1 prot_sgl=0 cmnd=(WRITE_SAME_16 lba=0 txlen=8388607 protect=0 unmap=1 raw=93 08 00 00 00 00 00 00 00 00 00 7f ff ff 00 00) result=(driver=DRIVER_OK host=DID_OK message=COMMAND_COMPLETE status=SAM_STAT_CHECK_CONDITION)

and without:

          <idle>-0     [001] 1272933.144045: scsi_dispatch_cmd_start: host_no=2 channel=0 id=0 lun=0 data_sgl=1 prot_sgl=0 cmnd=(WRITE_SAME_16 lba=0 txlen=8388607 protect=0 unmap=1 raw=93 08 00 00 00 00 00 00 00 00 00 7f ff ff 00 00)
          <idle>-0     [000] 1272933.144726: scsi_dispatch_cmd_done: host_no=2 channel=0 id=0 lun=0 data_sgl=1 prot_sgl=0 cmnd=(WRITE_SAME_16 lba=0 txlen=8388607 protect=0 unmap=1 raw=93 08 00 00 00 00 00 00 00 00 00 7f ff ff 00 00) result=(driver=DRIVER_OK host=DID_OK message=COMMAND_COMPLETE status=SAM_STAT_CHECK_CONDITION)

So it seems the transition away from BLOCK_PC to REQ_TYPE_FS has enabled
us to actually know about malformed SCSI requests without special SCSI
tracing.

This appears to be a welcomed side-effect of using REQ_TYPE_FS.

Mike

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: scsi: convert discard to REQ_TYPE_FS instead of REQ_TYPE_BLOCK_PC
  2010-07-07  0:47     ` Mike Snitzer
@ 2010-07-07  1:39       ` Martin K. Petersen
  2010-07-07  2:19         ` Mike Snitzer
  2010-07-07  3:35         ` Douglas Gilbert
  2010-07-07  4:06       ` FUJITA Tomonori
  2010-07-07  4:07       ` James Bottomley
  2 siblings, 2 replies; 18+ messages in thread
From: Martin K. Petersen @ 2010-07-07  1:39 UTC (permalink / raw)
  To: Mike Snitzer
  Cc: Douglas Gilbert, FUJITA Tomonori, linux-scsi, James.Bottomley,
	hch, axboe

>>>>> "Mike" == Mike Snitzer <snitzer@redhat.com> writes:

>> That is 0x7fffff (over 8 million) blocks (4 GB) being unmapped in one
>> operation! That may exceed the "maximum unmap lba count" field in the
>> Block Limits VPD page.  The latest SBC draft (sbc3r22.pdf) says that
>> field applies to the SCSI UNMAP command and does not mention the
>> WRITE SAME (16) command but that is probably an oversight.

Maximum Unmap LBA Count > 0 (in combination with the descriptor count)
are what indicate that the device server supports UNMAP.

You could argue, then, that a Maximum Unmap LBA Count > 0 but a Maximum
Unmap Descriptor Count of 0 would provide means to indicate the maximum
range for WRITE SAME.  But the T10 people I have talked to all agree
that the LBA count for WRITE SAME is gated by the command's LBA count
and nothing else.  So no special casing for when the UNMAP bit is set.
I.e. the max for WRITE SAME(16) is 32-bits times logical_block_size.


Mike> # cat /sys/block/sda/queue/discard_granularity
Mike> 512
Mike> # cat /sys/block/sda/queue/discard_max_bytes
Mike> 4294966784

Mike> I'll look to understand why 'discard_max_bytes' is so large for
Mike> this LUN despite the standard Block limits VPD page not reflecting
Mike> this.

discard_max_bytes is 0xFFFFFFFF for WRITE SAME(16).

-- 
Martin K. Petersen	Oracle Linux Engineering

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: scsi: convert discard to REQ_TYPE_FS instead of REQ_TYPE_BLOCK_PC
  2010-07-07  1:39       ` Martin K. Petersen
@ 2010-07-07  2:19         ` Mike Snitzer
  2010-07-07  3:35         ` Douglas Gilbert
  1 sibling, 0 replies; 18+ messages in thread
From: Mike Snitzer @ 2010-07-07  2:19 UTC (permalink / raw)
  To: Martin K. Petersen
  Cc: Douglas Gilbert, FUJITA Tomonori, linux-scsi, James.Bottomley,
	hch, axboe

On Tue, Jul 06 2010 at  9:39pm -0400,
Martin K. Petersen <martin.petersen@oracle.com> wrote:

> >>>>> "Mike" == Mike Snitzer <snitzer@redhat.com> writes:
> 
> >> That is 0x7fffff (over 8 million) blocks (4 GB) being unmapped in one
> >> operation! That may exceed the "maximum unmap lba count" field in the
> >> Block Limits VPD page.  The latest SBC draft (sbc3r22.pdf) says that
> >> field applies to the SCSI UNMAP command and does not mention the
> >> WRITE SAME (16) command but that is probably an oversight.
> 
> Maximum Unmap LBA Count > 0 (in combination with the descriptor count)
> are what indicate that the device server supports UNMAP.
> 
> You could argue, then, that a Maximum Unmap LBA Count > 0 but a Maximum
> Unmap Descriptor Count of 0 would provide means to indicate the maximum
> range for WRITE SAME.  But the T10 people I have talked to all agree
> that the LBA count for WRITE SAME is gated by the command's LBA count
> and nothing else.  So no special casing for when the UNMAP bit is set.
> I.e. the max for WRITE SAME(16) is 32-bits times logical_block_size.

Ah, yes completely missed that fact that we're talking WRITE SAME(16)
yet I was looking at UNMAP limits.

> 
> Mike> # cat /sys/block/sda/queue/discard_granularity
> Mike> 512
> Mike> # cat /sys/block/sda/queue/discard_max_bytes
> Mike> 4294966784
> 
> Mike> I'll look to understand why 'discard_max_bytes' is so large for
> Mike> this LUN despite the standard Block limits VPD page not reflecting
> Mike> this.
> 
> discard_max_bytes is 0xFFFFFFFF for WRITE SAME(16).

Seems discard_max_bytes = 0x7fffff * 512 = 4294966784

So mkfs.ext4's BLKDISCARD ioctl first issues a discard of
discard_max_bytes (via blkdev_issue_discard's loop) and it fails. 

Question becomes: why did the request fail if the LUN supports WRITE
SAME(16) and its max is 0xFFFFFFFF?

Likely a question for the vendor but I'd first like to make sure it is
reasonable to expect this command to succeed.

Thanks,
Mike

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: scsi: convert discard to REQ_TYPE_FS instead of REQ_TYPE_BLOCK_PC
  2010-07-07  1:39       ` Martin K. Petersen
  2010-07-07  2:19         ` Mike Snitzer
@ 2010-07-07  3:35         ` Douglas Gilbert
  2010-07-08 19:11           ` Mike Snitzer
  2010-07-09 16:22           ` Martin K. Petersen
  1 sibling, 2 replies; 18+ messages in thread
From: Douglas Gilbert @ 2010-07-07  3:35 UTC (permalink / raw)
  To: Martin K. Petersen
  Cc: Mike Snitzer, FUJITA Tomonori, linux-scsi, James.Bottomley, hch, axboe

On 10-07-06 09:39 PM, Martin K. Petersen wrote:
>>>>>> "Mike" == Mike Snitzer<snitzer@redhat.com>  writes:
>
>>> That is 0x7fffff (over 8 million) blocks (4 GB) being unmapped in one
>>> operation! That may exceed the "maximum unmap lba count" field in the
>>> Block Limits VPD page.  The latest SBC draft (sbc3r22.pdf) says that
>>> field applies to the SCSI UNMAP command and does not mention the
>>> WRITE SAME (16) command but that is probably an oversight.
>
> Maximum Unmap LBA Count>  0 (in combination with the descriptor count)
> are what indicate that the device server supports UNMAP.

That has been superseded by the TPU and TPWS bits
in the Thin provisioning VPD page (B2h) in sbc3r22.
TPU and TPWS indicate support for the UNMAP and WRITE
SAME (16) with UNMAP bit ** commands respectively.

> You could argue, then, that a Maximum Unmap LBA Count>  0 but a Maximum
> Unmap Descriptor Count of 0 would provide means to indicate the maximum
> range for WRITE SAME.  But the T10 people I have talked to all agree
> that the LBA count for WRITE SAME is gated by the command's LBA count
> and nothing else.  So no special casing for when the UNMAP bit is set.
> I.e. the max for WRITE SAME(16) is 32-bits times logical_block_size.

I think sbc3r22 is just flaky in that area and will
be cleaned up soon. As the words stand now, in the
Block limits VPD page "maximum unmap lba count" only
applies to the UNMAP command while "optimal unmap
granularity" applies to both the UNMAP command and
the WRITE SAME(16) command. Inconsistent.
And "maximum unmap lba count"==0 implying no UNMAP
command is pointless given the TPU bit.

> Mike>  # cat /sys/block/sda/queue/discard_granularity
> Mike>  512
> Mike>  # cat /sys/block/sda/queue/discard_max_bytes
> Mike>  4294966784
>
> Mike>  I'll look to understand why 'discard_max_bytes' is so large for
> Mike>  this LUN despite the standard Block limits VPD page not reflecting
> Mike>  this.
>
> discard_max_bytes is 0xFFFFFFFF for WRITE SAME(16).

FORMAT UNIT has several associated mechanisms (e.g
IMMED bit and REQUEST SENSE polling) that let it
run for a long time. WRITE SAME has no such mechanisms.
There was a proposal put to t10 to place an upper limit
on WRITE SAME's lba count but I think that has been
dropped. IMO if we want to give large block counts to
UNMAP or WRITE SAME in the absence of guidance from the
block limits VPD page, then we need to cope with
device saying "nope".

Whatever device Mike has it seems to be failing the
WRITE SAME(16) command due to the huge lba block count.
Does the device work with a smaller lba block count?
For example:
    sg_write_same --unmap --lba 0 --num 1024 /dev/sda



** WRITE SAME (32) also has an UNMAP bit.

Doug Gilbert


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: scsi: convert discard to REQ_TYPE_FS instead of REQ_TYPE_BLOCK_PC
  2010-07-07  0:47     ` Mike Snitzer
  2010-07-07  1:39       ` Martin K. Petersen
@ 2010-07-07  4:06       ` FUJITA Tomonori
  2010-07-07  4:07       ` James Bottomley
  2 siblings, 0 replies; 18+ messages in thread
From: FUJITA Tomonori @ 2010-07-07  4:06 UTC (permalink / raw)
  To: snitzer
  Cc: dgilbert, fujita.tomonori, linux-scsi, James.Bottomley, hch, axboe

On Tue, 6 Jul 2010 20:47:48 -0400
Mike Snitzer <snitzer@redhat.com> wrote:

> So it seems the transition away from BLOCK_PC to REQ_TYPE_FS has enabled
> us to actually know about malformed SCSI requests without special SCSI
> tracing.
>
> This appears to be a welcomed side-effect of using REQ_TYPE_FS.

scsi-ml just ignores the results for BLOCK_PC since it assumes that
the upper layers want to handle. The block layer and file systems
can't handle the results of discard so discard should be REQ_TYPE_FS.

With REQ_TYPE_FS conversion, we might see more errors of discard that
we have ignored so far. However, the errors aren't due to REQ_TYPE_FS
conversion. REQ_TYPE_FS conversion doesn't lead to any changes to
scsi_cmnd struct.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: scsi: convert discard to REQ_TYPE_FS instead of REQ_TYPE_BLOCK_PC
  2010-07-07  0:47     ` Mike Snitzer
  2010-07-07  1:39       ` Martin K. Petersen
  2010-07-07  4:06       ` FUJITA Tomonori
@ 2010-07-07  4:07       ` James Bottomley
  2 siblings, 0 replies; 18+ messages in thread
From: James Bottomley @ 2010-07-07  4:07 UTC (permalink / raw)
  To: Mike Snitzer; +Cc: Douglas Gilbert, FUJITA Tomonori, linux-scsi, hch, axboe

On Tue, 2010-07-06 at 20:47 -0400, Mike Snitzer wrote:
> On Tue, Jul 06 2010 at  7:40pm -0400,
> Douglas Gilbert <dgilbert@interlog.com> wrote:
> 
> > On 10-07-06 05:31 PM, Mike Snitzer wrote:
> > >On Tue, Jul 06 2010 at  3:01am -0400,
> > >FUJITA Tomonori<fujita.tomonori@lab.ntt.co.jp>  wrote:
> > >
> > >>I confirmed that mkfs.xfs worked with Intel X25-M (trim) and
> > >>scsi_debug (write same and unmap).
> > >>
> > >>REQ_TYPE_FS should give the same scsi_cmnd struct as REQ_TYPE_BLOCK_PC.
> > >>
> > >>This can be applied to block's for-2.6.36.
> > >>
> > >>The git tree is also available:
> > >>
> > >>git://git.kernel.org/pub/scm/linux/kernel/git/tomo/linux-2.6-misc.git fs-discard
> > >>
> > >>=
> > >>From: FUJITA Tomonori<fujita.tomonori@lab.ntt.co.jp>
> > >>Subject: [PATCH] scsi: convert discard to REQ_TYPE_FS instead of REQ_TYPE_BLOCK_PC
> > >>
> > >>The block layer (file systems) sends discard requests as REQ_TYPE_FS
> > >>(the role of REQ_TYPE_FS is that setting up commands and interpreting
> > >>the results). But SCSI-ml treats discard requests as
> > >>REQ_TYPE_BLOCK_PC.
> > >>
> > >>scsi-ml can handle discard requests as REQ_TYPE_FS
> > >>easily. scsi_setup_discard_cmnd() sets up struct request and the bio
> > >>nicely. Only remaining issue is that discard requests can't be
> > >>completed partially so we need to modify sd_done.
> > >>
> > >>This conversion also fixes the problem that discard requests aren't
> > >>retried when possible (e.g. UNIT ATTENTION).
> > >>
> > >>Signed-off-by: FUJITA Tomonori<fujita.tomonori@lab.ntt.co.jp>
> > >
> > >Unfortunately this patch causes 'mkfs.ext4 -F /dev/sda' to fail against
> > >a device whose discard support is implemented using WRITE SAME 16 w/
> > >discard bit set.  This is with recent e2fsprogs that issues BLKDISCARD
> > >ioctl at start of mkfs:
> > >
> > >sd 2:0:0:0: [sda] Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
> > >sd 2:0:0:0: [sda] Sense Key : Illegal Request [current]
> > >Info fld=0x0
> > >sd 2:0:0:0: [sda] Add. Sense: Parameter value invalid
> > >sd 2:0:0:0: [sda] CDB: Write same(16): 93 08 00 00 00 00 00 00 00 00 00 7f ff ff 00 00
> > >end_request: I/O error, dev sda, sector 0
> > 
> > That is 0x7fffff (over 8 million) blocks (4 GB) being unmapped
> > in one operation! That may exceed the "maximum unmap lba
> > count" field in the Block Limits VPD page.
> > The latest SBC draft (sbc3r22.pdf) says that field applies to
> > the SCSI UNMAP command and does not mention the WRITE SAME (16)
> > command but that is probably an oversight.
> 
> # sg_inq -p 0xb0 /dev/sda
> VPD INQUIRY: Block limits page (SBC)
>   Optimal transfer length granularity: 8 blocks
>   Maximum transfer length: 8388607 blocks

I'll lay reasonable odds this is the problem.  The standard doesn't
mention WRITE SAME for this transfer length limit, but nor does it
mention WRITE LONG ... I bet they both fail when over this.

James



^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH] scsi: convert discard to REQ_TYPE_FS instead of REQ_TYPE_BLOCK_PC
  2010-07-06  7:01 [PATCH] scsi: convert discard to REQ_TYPE_FS instead of REQ_TYPE_BLOCK_PC FUJITA Tomonori
  2010-07-06 21:31 ` Mike Snitzer
@ 2010-07-07 16:39 ` Christoph Hellwig
  2010-07-08  0:40   ` FUJITA Tomonori
  1 sibling, 1 reply; 18+ messages in thread
From: Christoph Hellwig @ 2010-07-07 16:39 UTC (permalink / raw)
  To: FUJITA Tomonori; +Cc: linux-scsi, James.Bottomley, hch, snitzer, axboe

Works for me with the OCZ SSD,


Reviewed-by: Christoph Hellwig <hch@lst.de>


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH] scsi: convert discard to REQ_TYPE_FS instead of REQ_TYPE_BLOCK_PC
  2010-07-07 16:39 ` [PATCH] " Christoph Hellwig
@ 2010-07-08  0:40   ` FUJITA Tomonori
  2010-07-08 14:35     ` James Bottomley
  2010-07-09  3:55     ` Christoph Hellwig
  0 siblings, 2 replies; 18+ messages in thread
From: FUJITA Tomonori @ 2010-07-08  0:40 UTC (permalink / raw)
  To: hch; +Cc: fujita.tomonori, linux-scsi, James.Bottomley, snitzer, axboe

On Wed, 7 Jul 2010 18:39:02 +0200
Christoph Hellwig <hch@lst.de> wrote:

> Works for me with the OCZ SSD,
> 
> 
> Reviewed-by: Christoph Hellwig <hch@lst.de>

Thanks for testing!

This is purely scsi stuff but depends on block's for-2.6.36. Jens, can
you add this to your for-2.6.36?

James, you are going to create a post-merge tree?

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH] scsi: convert discard to REQ_TYPE_FS instead of REQ_TYPE_BLOCK_PC
  2010-07-08  0:40   ` FUJITA Tomonori
@ 2010-07-08 14:35     ` James Bottomley
  2010-07-09  3:55     ` Christoph Hellwig
  1 sibling, 0 replies; 18+ messages in thread
From: James Bottomley @ 2010-07-08 14:35 UTC (permalink / raw)
  To: FUJITA Tomonori; +Cc: hch, linux-scsi, snitzer, axboe

On Thu, 2010-07-08 at 09:40 +0900, FUJITA Tomonori wrote:
> On Wed, 7 Jul 2010 18:39:02 +0200
> Christoph Hellwig <hch@lst.de> wrote:
> 
> > Works for me with the OCZ SSD,
> > 
> > 
> > Reviewed-by: Christoph Hellwig <hch@lst.de>
> 
> Thanks for testing!
> 
> This is purely scsi stuff but depends on block's for-2.6.36. Jens, can
> you add this to your for-2.6.36?
> 
> James, you are going to create a post-merge tree?

Well, I'm open to the idea, since it seems we need to clean up some of
the error handling a bit more ... and we could certainly do with
cleaning up the patch history, since it's rather confusing.

However, I don't quite see how to do it cleanly given the intertwining
of block and SCSI on this ... if you can suggest a good separation
point, I'd be game.

James



^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: scsi: convert discard to REQ_TYPE_FS instead of REQ_TYPE_BLOCK_PC
  2010-07-07  3:35         ` Douglas Gilbert
@ 2010-07-08 19:11           ` Mike Snitzer
  2010-07-09 16:27             ` Martin K. Petersen
  2010-07-09 16:22           ` Martin K. Petersen
  1 sibling, 1 reply; 18+ messages in thread
From: Mike Snitzer @ 2010-07-08 19:11 UTC (permalink / raw)
  To: Douglas Gilbert
  Cc: Martin K. Petersen, FUJITA Tomonori, linux-scsi, James.Bottomley,
	hch, axboe

On Tue, Jul 06 2010 at 11:35pm -0400,
Douglas Gilbert <dgilbert@interlog.com> wrote:

> On 10-07-06 09:39 PM, Martin K. Petersen wrote:
> >>>>>>"Mike" == Mike Snitzer<snitzer@redhat.com>  writes:

> >Mike>  # cat /sys/block/sda/queue/discard_granularity
> >Mike>  512
> >Mike>  # cat /sys/block/sda/queue/discard_max_bytes
> >Mike>  4294966784
> >
> >Mike>  I'll look to understand why 'discard_max_bytes' is so large for
> >Mike>  this LUN despite the standard Block limits VPD page not reflecting
> >Mike>  this.
> >
> >discard_max_bytes is 0xFFFFFFFF for WRITE SAME(16).
> 
> FORMAT UNIT has several associated mechanisms (e.g
> IMMED bit and REQUEST SENSE polling) that let it
> run for a long time. WRITE SAME has no such mechanisms.
> There was a proposal put to t10 to place an upper limit
> on WRITE SAME's lba count but I think that has been
> dropped. IMO if we want to give large block counts to
> UNMAP or WRITE SAME in the absence of guidance from the
> block limits VPD page, then we need to cope with
> device saying "nope".
> 
> Whatever device Mike has it seems to be failing the
> WRITE SAME(16) command due to the huge lba block count.
> Does the device work with a smaller lba block count?
> For example:
>    sg_write_same --unmap --lba 0 --num 1024 /dev/sda

Yes, and even large requests that have 4K granularity work.

Turns out that this LUN has a 4K granularity requirement (will fail the
WRITE SAME if the granularity requirements are not met).

4294966784 % 4096 = 3584

So we need to see why Linux actually has discard_max_bytes = 4294966784
rather than the full 0xFFFFFFFF we initialize in read_capacity_16:

q->limits.max_discard_sectors = 0xffffffff;

By bet is on blkdev_issue_discard:
  unsigned int max_discard_sectors =
         min(q->limits.max_discard_sectors, UINT_MAX >> 9);

Mike

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH] scsi: convert discard to REQ_TYPE_FS instead of REQ_TYPE_BLOCK_PC
  2010-07-08  0:40   ` FUJITA Tomonori
  2010-07-08 14:35     ` James Bottomley
@ 2010-07-09  3:55     ` Christoph Hellwig
  2010-07-09  4:42       ` FUJITA Tomonori
  1 sibling, 1 reply; 18+ messages in thread
From: Christoph Hellwig @ 2010-07-09  3:55 UTC (permalink / raw)
  To: FUJITA Tomonori; +Cc: hch, linux-scsi, James.Bottomley, snitzer, axboe

On Thu, Jul 08, 2010 at 09:40:12AM +0900, FUJITA Tomonori wrote:
> Thanks for testing!
> 
> This is purely scsi stuff but depends on block's for-2.6.36. Jens, can
> you add this to your for-2.6.36?
> 
> James, you are going to create a post-merge tree?

Can we get this in through the block tree for now?  It's pretty simple
and keeps all the things to be tested in one place.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH] scsi: convert discard to REQ_TYPE_FS instead of REQ_TYPE_BLOCK_PC
  2010-07-09  3:55     ` Christoph Hellwig
@ 2010-07-09  4:42       ` FUJITA Tomonori
  0 siblings, 0 replies; 18+ messages in thread
From: FUJITA Tomonori @ 2010-07-09  4:42 UTC (permalink / raw)
  To: hch; +Cc: fujita.tomonori, linux-scsi, James.Bottomley, snitzer, axboe

On Fri, 9 Jul 2010 05:55:31 +0200
Christoph Hellwig <hch@lst.de> wrote:

> On Thu, Jul 08, 2010 at 09:40:12AM +0900, FUJITA Tomonori wrote:
> > Thanks for testing!
> > 
> > This is purely scsi stuff but depends on block's for-2.6.36. Jens, can
> > you add this to your for-2.6.36?
> > 
> > James, you are going to create a post-merge tree?
> 
> Can we get this in through the block tree for now?  It's pretty simple
> and keeps all the things to be tested in one place.

I prefer that too, merging the discard changes and REQ_TYPE_FS
conversion via the block. That could be a reasonable separation
point. However, I suspect that merging all the core scsi changes (that
is, except for driver updates) via the block tree in the next merge
window might be easier.



^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: scsi: convert discard to REQ_TYPE_FS instead of REQ_TYPE_BLOCK_PC
  2010-07-07  3:35         ` Douglas Gilbert
  2010-07-08 19:11           ` Mike Snitzer
@ 2010-07-09 16:22           ` Martin K. Petersen
  1 sibling, 0 replies; 18+ messages in thread
From: Martin K. Petersen @ 2010-07-09 16:22 UTC (permalink / raw)
  To: dgilbert
  Cc: Martin K. Petersen, Mike Snitzer, FUJITA Tomonori, linux-scsi,
	James.Bottomley, hch, axboe

>>>>> "Doug" == Douglas Gilbert <dgilbert@interlog.com> writes:

>> Maximum Unmap LBA Count> 0 (in combination with the descriptor count)
>> are what indicate that the device server supports UNMAP.

Doug> That has been superseded by the TPU and TPWS bits in the Thin
Doug> provisioning VPD page (B2h) in sbc3r22.  TPU and TPWS indicate
Doug> support for the UNMAP and WRITE SAME (16) with UNMAP bit **
Doug> commands respectively.

*sigh* this is getting more and more convoluted.  I really wish T10 had
been able to agree on a single approach instead of 2.


Doug> I think sbc3r22 is just flaky in that area and will be cleaned up
Doug> soon. As the words stand now, in the Block limits VPD page
Doug> "maximum unmap lba count" only applies to the UNMAP command while
Doug> "optimal unmap granularity" applies to both the UNMAP command and
Doug> the WRITE SAME(16) command. Inconsistent.  And "maximum unmap lba
Doug> count"==0 implying no UNMAP command is pointless given the TPU
Doug> bit.

I'll ask around and see what the plans are in T10 when I get back from
vacation next week.  And I'll try to adjust our heuristics accordingly.

-- 
Martin K. Petersen	Oracle Linux Engineering

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: scsi: convert discard to REQ_TYPE_FS instead of REQ_TYPE_BLOCK_PC
  2010-07-08 19:11           ` Mike Snitzer
@ 2010-07-09 16:27             ` Martin K. Petersen
  2010-07-09 18:06               ` Mike Snitzer
  0 siblings, 1 reply; 18+ messages in thread
From: Martin K. Petersen @ 2010-07-09 16:27 UTC (permalink / raw)
  To: Mike Snitzer
  Cc: Douglas Gilbert, Martin K. Petersen, FUJITA Tomonori, linux-scsi,
	James.Bottomley, hch, axboe

>>>>> "Mike" == Mike Snitzer <snitzer@redhat.com> writes:

Mike> Turns out that this LUN has a 4K granularity requirement (will
Mike> fail the WRITE SAME if the granularity requirements are not met).

That's lame.  The devices I've been working with just ignore the
portions of the block range that are not aligned / do not constitute
entire blocks.

I've had the following patch kicking around in my tree for a while.  It
does not yet handle offset discard alignment, though.

-- 
Martin K. Petersen	Oracle Linux Engineering

diff --git a/drivers/scsi/sd.c b/drivers/scsi/sd.c
index 2e9c6bd..60632ed 100644
--- a/drivers/scsi/sd.c
+++ b/drivers/scsi/sd.c
@@ -420,12 +420,22 @@ static int sd_prepare_discard(struct request *rq)
 {
 	struct scsi_disk *sdkp = scsi_disk(rq->rq_disk);
 	struct bio *bio = rq->bio;
-	sector_t sector = bio->bi_sector;
-	unsigned int num = bio_sectors(bio);
+	sector_t lba;
+	unsigned int xfer_length, logical_block_shift;
+	unsigned int discard_shift, discard_mask;
+
+	logical_block_shift = blksize_bits(sdkp->device->sector_size) - 9;
+	lba = bio->bi_sector >> logical_block_shift;
+	xfer_length = bio_sectors(bio) >> logical_block_shift;
+
+	discard_shift = blksize_bits(rq->q->limits.discard_granularity) - 9;
+
+	if (discard_shift) {
+		discard_mask = (1 << discard_shift) - 1;
+		sector_t end_lba = (lba + xfer_length) & ~discard_mask;
 
-	if (sdkp->device->sector_size == 4096) {
-		sector >>= 3;
-		num >>= 3;
+		lba = (lba + discard_mask) & ~discard_mask;
+		xfer_length = end_lba - lba;
 	}
 
 	rq->cmd_type = REQ_TYPE_BLOCK_PC;
@@ -445,15 +455,15 @@ static int sd_prepare_discard(struct request *rq)
 
 		put_unaligned_be16(6 + 16, &buf[0]);
 		put_unaligned_be16(16, &buf[2]);
-		put_unaligned_be64(sector, &buf[8]);
-		put_unaligned_be32(num, &buf[16]);
+		put_unaligned_be64(lba, &buf[8]);
+		put_unaligned_be32(xfer_length, &buf[16]);
 
 		kunmap_atomic(buf, KM_USER0);
 	} else {
 		rq->cmd[0] = WRITE_SAME_16;
 		rq->cmd[1] = 0x8; /* UNMAP */
-		put_unaligned_be64(sector, &rq->cmd[2]);
-		put_unaligned_be32(num, &rq->cmd[10]);
+		put_unaligned_be64(lba, &rq->cmd[2]);
+		put_unaligned_be32(xfer_length, &rq->cmd[10]);
 		rq->cmd_len = 16;
 	}
 

^ permalink raw reply related	[flat|nested] 18+ messages in thread

* Re: scsi: convert discard to REQ_TYPE_FS instead of REQ_TYPE_BLOCK_PC
  2010-07-09 16:27             ` Martin K. Petersen
@ 2010-07-09 18:06               ` Mike Snitzer
  0 siblings, 0 replies; 18+ messages in thread
From: Mike Snitzer @ 2010-07-09 18:06 UTC (permalink / raw)
  To: Martin K. Petersen
  Cc: Douglas Gilbert, FUJITA Tomonori, linux-scsi, James.Bottomley,
	hch, axboe

On Fri, Jul 09 2010 at 12:27pm -0400,
Martin K. Petersen <martin.petersen@oracle.com> wrote:

> >>>>> "Mike" == Mike Snitzer <snitzer@redhat.com> writes:
> 
> Mike> Turns out that this LUN has a 4K granularity requirement (will
> Mike> fail the WRITE SAME if the granularity requirements are not met).
> 
> That's lame.  The devices I've been working with just ignore the
> portions of the block range that are not aligned / do not constitute
> entire blocks.

What is really lame is this LUN also doesn't have the correct
discard_granularity!

# cat /sys/block/sda/queue/discard_granularity 
512
 
> I've had the following patch kicking around in my tree for a while.  It
> does not yet handle offset discard alignment, though.

OK, won't help us for this LUN but thanks.

I'm left wondering whether the higher level that is discarding a
particular large region (say blkdev_issue_discard's loop) should also be
trained about discard alignment too.  That way the discard requests are
submitted properly aligned and they aren't reduced (leaving holes, like
your patch could do) -- except for the last discard in a sequence, which
could be reduced if its granularity is wrong.

No idea if the above will make sense to others ;)

Mike

^ permalink raw reply	[flat|nested] 18+ messages in thread

end of thread, other threads:[~2010-07-09 18:06 UTC | newest]

Thread overview: 18+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2010-07-06  7:01 [PATCH] scsi: convert discard to REQ_TYPE_FS instead of REQ_TYPE_BLOCK_PC FUJITA Tomonori
2010-07-06 21:31 ` Mike Snitzer
2010-07-06 23:40   ` Douglas Gilbert
2010-07-07  0:47     ` Mike Snitzer
2010-07-07  1:39       ` Martin K. Petersen
2010-07-07  2:19         ` Mike Snitzer
2010-07-07  3:35         ` Douglas Gilbert
2010-07-08 19:11           ` Mike Snitzer
2010-07-09 16:27             ` Martin K. Petersen
2010-07-09 18:06               ` Mike Snitzer
2010-07-09 16:22           ` Martin K. Petersen
2010-07-07  4:06       ` FUJITA Tomonori
2010-07-07  4:07       ` James Bottomley
2010-07-07 16:39 ` [PATCH] " Christoph Hellwig
2010-07-08  0:40   ` FUJITA Tomonori
2010-07-08 14:35     ` James Bottomley
2010-07-09  3:55     ` Christoph Hellwig
2010-07-09  4:42       ` FUJITA Tomonori

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.