All of lore.kernel.org
 help / color / mirror / Atom feed
* Getting TRIM working
@ 2009-03-03 19:07 Matthew Wilcox
  2009-03-04  9:20 ` Boaz Harrosh
  0 siblings, 1 reply; 15+ messages in thread
From: Matthew Wilcox @ 2009-03-03 19:07 UTC (permalink / raw)
  To: linux-ide; +Cc: linux-scsi


I'm having trouble getting Linux to spit out an appropriate TRIM
command, and I've hit a bit of a wall, so I thought I'd appeal for help.

The symptom is that we send out only 24 bytes and then stop rather than
sending out a full 512 byte "sector".

Here's what goes on, as far as I understand it.

We call 'blkdiscard /dev/sda 0 1' which is supposed to TRIM just LBA 0.

We create a bio in blk_ioctl_discard() which sets bi_size to 512 (1 << 9).
This is necessary to let the elevators know how many bytes we're going
to discard, so it can do merging correctly.

Then, in sd_discard_fn(), we do this:

        bio->bi_size = 0;
        if (bio_add_pc_page(q, bio, page, 24, 0) < 24) {
                __free_page(page);
                return -ENOMEM;
        }

This is where the 24 comes from -- it's the length of the data transmitted
to the drive for an unmap of a single range.

Now in ata_scsi_unmap_xlat() we need to turn this SCSI UNMAP command
into an ATA TRIM command.  So I do this:

        size = ALIGN(i * 8, 512);
        memset(buffer + i * 8, 0, size - i * 8);
        old_size = bio_iovec(bio)->bv_len;
printk("before: bi_size %d, data_len %d, bv_len %d\n", bio->bi_size,
                req->data_len, old_size);
        if (size > old_size) {
                bio_add_pc_page(req->q, bio, bio_page(bio),
                                        size - old_size, old_size);
                req->data_len = size;
        }
printk("after: bi_size %d, data_len %d, bv_len %d\n", bio->bi_size,
                req->data_len, bio_iovec(bio)->bv_len);

Now req->data_len, bio->bi_size and bio_iovec(bio)->bv_len are all 512.
Yet the AHCI driver still spits out 24 bytes and then stops (which hangs
the drive).  What am I missing?

The current git tree can be found at
http://git.kernel.org/?p=linux/kernel/git/willy/ssd.git;a=shortlog;h=trim-20090212

I tried a few things yesterday with access to the ATA bus analyser,
and the results of that are not found in the git tree.  This patch on
top of that git tree makes things better (before yesterday, I didn't
know about bv_len, for example):

diff --git a/drivers/ata/libata-scsi.c b/drivers/ata/libata-scsi.c
index 591d7a1..dea97e6 100644
--- a/drivers/ata/libata-scsi.c
+++ b/drivers/ata/libata-scsi.c
@@ -1714,9 +1714,10 @@ nothing_to_do:
 static unsigned int ata_scsi_unmap_xlat(struct ata_queued_cmd *qc)
 {
 	struct request *req = qc->scsicmd->request;
-	char *buffer = bio_data(req->bio);
+	struct bio *bio = req->bio;
+	char *buffer = bio_data(bio);
 	struct ata_taskfile *tf = &qc->tf;
-	unsigned size, i = 0;
+	unsigned old_size, size, i = 0;
 
 	/*
 	 * Ignore what SCSI has written to the buffer.  Will make it easier
@@ -1727,6 +1728,16 @@ static unsigned int ata_scsi_unmap_xlat(struct ata_queued_cmd *qc)
 					req->sector, req->nr_sectors);
 	size = ALIGN(i * 8, 512);
 	memset(buffer + i * 8, 0, size - i * 8);
+	old_size = bio_iovec(bio)->bv_len;
+printk("before: bi_size %d, data_len %d, bv_len %d\n", bio->bi_size,
+		req->data_len, old_size);
+	if (size > old_size) {
+		bio_add_pc_page(req->q, bio, bio_page(bio),
+					size - old_size, old_size);
+		req->data_len = size;
+	}
+printk("after: bi_size %d, data_len %d, bv_len %d\n", bio->bi_size,
+		req->data_len, bio_iovec(bio)->bv_len);
 
 	qc->flags |= ATA_QCFLAG_IO;
 	qc->nbytes = size;
@@ -1741,8 +1752,6 @@ static unsigned int ata_scsi_unmap_xlat(struct ata_queued_cmd *qc)
 			ATA_TFLAG_LBA48 | ATA_TFLAG_LBA;
 	tf->device = ATA_LBA;
 
-	req->bio->bi_size = req->data_len = size;
-
 	return 0;
 }
 
diff --git a/drivers/scsi/sd.c b/drivers/scsi/sd.c
index d7539d5..2d5fbd1 100644
--- a/drivers/scsi/sd.c
+++ b/drivers/scsi/sd.c
@@ -375,11 +375,22 @@ static int sd_discard_fn(struct request_queue *q, struct request *rq,
 	struct page *page = alloc_page(GFP_KERNEL);
 	if (!page)
 		return -ENOMEM;
+
+	/*
+	 * Earlier, we set bi_size to the number of bytes to be discarded so
+	 * that the elevators understood what was going on.  The bi_size
+	 * has been copied to req->data_len so we don't need to be
+	 * concerned with that any more.  Zero it out so it accurately
+	 * reflects the length of the data we're now sending to the drive.
+	 */
+	bio->bi_size = 0;
 	if (bio_add_pc_page(q, bio, page, 24, 0) < 24) {
 		__free_page(page);
 		return -ENOMEM;
 	}
 
+printk("bio->bi_size = %d\n", bio->bi_size);
+
 	unmap = bio_data(bio);
 	memset(unmap, 0, 12);
 	put_unaligned_be64(bio->bi_sector, unmap + 12);

-- 
Matthew Wilcox				Intel Open Source Technology Centre
"Bill, look, we understand that you're interested in selling us this
operating system, but compare it to ours.  We can't possibly take such
a retrograde step."

^ permalink raw reply related	[flat|nested] 15+ messages in thread

* Re: Getting TRIM working
  2009-03-03 19:07 Getting TRIM working Matthew Wilcox
@ 2009-03-04  9:20 ` Boaz Harrosh
  2009-03-06 19:16   ` Matthew Wilcox
  0 siblings, 1 reply; 15+ messages in thread
From: Boaz Harrosh @ 2009-03-04  9:20 UTC (permalink / raw)
  To: Matthew Wilcox; +Cc: linux-ide, linux-scsi

Matthew Wilcox wrote:
> I'm having trouble getting Linux to spit out an appropriate TRIM
> command, and I've hit a bit of a wall, so I thought I'd appeal for help.
> 
> The symptom is that we send out only 24 bytes and then stop rather than
> sending out a full 512 byte "sector".
> 
> Here's what goes on, as far as I understand it.
> 
> We call 'blkdiscard /dev/sda 0 1' which is supposed to TRIM just LBA 0.
> 
> We create a bio in blk_ioctl_discard() which sets bi_size to 512 (1 << 9).
> This is necessary to let the elevators know how many bytes we're going
> to discard, so it can do merging correctly.
> 
> Then, in sd_discard_fn(), we do this:
> 
>         bio->bi_size = 0;
>         if (bio_add_pc_page(q, bio, page, 24, 0) < 24) {
>                 __free_page(page);
>                 return -ENOMEM;
>         }
> 
> This is where the 24 comes from -- it's the length of the data transmitted
> to the drive for an unmap of a single range.
> 
> Now in ata_scsi_unmap_xlat() we need to turn this SCSI UNMAP command
> into an ATA TRIM command.  So I do this:
> 
>         size = ALIGN(i * 8, 512);
>         memset(buffer + i * 8, 0, size - i * 8);
>         old_size = bio_iovec(bio)->bv_len;
> printk("before: bi_size %d, data_len %d, bv_len %d\n", bio->bi_size,
>                 req->data_len, old_size);
>         if (size > old_size) {
>                 bio_add_pc_page(req->q, bio, bio_page(bio),
>                                         size - old_size, old_size);
>                 req->data_len = size;
>         }
> printk("after: bi_size %d, data_len %d, bv_len %d\n", bio->bi_size,
>                 req->data_len, bio_iovec(bio)->bv_len);
> 
> Now req->data_len, bio->bi_size and bio_iovec(bio)->bv_len are all 512.
> Yet the AHCI driver still spits out 24 bytes and then stops (which hangs
> the drive).  What am I missing?
> 
> The current git tree can be found at
> http://git.kernel.org/?p=linux/kernel/git/willy/ssd.git;a=shortlog;h=trim-20090212
> 
> I tried a few things yesterday with access to the ATA bus analyser,
> and the results of that are not found in the git tree.  This patch on
> top of that git tree makes things better (before yesterday, I didn't
> know about bv_len, for example):
> 

I have not inspected the code so please forgive me if I'm talking none-sense.

What about the length embedded in the CDB, which is usually derived from
scsi_bufflen(), or other places that look at scsi_bufflen() and not at
request && it's bios. The later might be bigger then scsi's in split commands
but the drivers should only consume scsi_bufflen() bytes.

Just a shot in the dark
Boaz
 


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Getting TRIM working
  2009-03-04  9:20 ` Boaz Harrosh
@ 2009-03-06 19:16   ` Matthew Wilcox
  2009-03-08 10:28     ` Boaz Harrosh
  0 siblings, 1 reply; 15+ messages in thread
From: Matthew Wilcox @ 2009-03-06 19:16 UTC (permalink / raw)
  To: Boaz Harrosh; +Cc: linux-ide, linux-scsi

On Wed, Mar 04, 2009 at 11:20:27AM +0200, Boaz Harrosh wrote:
> Matthew Wilcox wrote:
> >         size = ALIGN(i * 8, 512);
> >         memset(buffer + i * 8, 0, size - i * 8);
> >         old_size = bio_iovec(bio)->bv_len;
> > printk("before: bi_size %d, data_len %d, bv_len %d\n", bio->bi_size,
> >                 req->data_len, old_size);
> >         if (size > old_size) {
> >                 bio_add_pc_page(req->q, bio, bio_page(bio),
> >                                         size - old_size, old_size);
> >                 req->data_len = size;
> >         }
> > printk("after: bi_size %d, data_len %d, bv_len %d\n", bio->bi_size,
> >                 req->data_len, bio_iovec(bio)->bv_len);
> > 
> > Now req->data_len, bio->bi_size and bio_iovec(bio)->bv_len are all 512.
> > Yet the AHCI driver still spits out 24 bytes and then stops (which hangs
> > the drive).  What am I missing?
> 
> What about the length embedded in the CDB, which is usually derived from
> scsi_bufflen(), or other places that look at scsi_bufflen() and not at
> request && it's bios. The later might be bigger then scsi's in split commands
> but the drivers should only consume scsi_bufflen() bytes.

A fine idea, completely true ... I fixed it like this:

+       old_size = bio_iovec(bio)->bv_len;
+printk("before: bi_size %d, data_len %d, bv_len %d sdb length %d\n",
+               bio->bi_size, req->data_len, old_size, scmd->sdb.length);
+       if (size > old_size) {
+               bio_add_pc_page(req->q, bio, bio_page(bio),
+                                       size - old_size, old_size);
+       }
+       scmd->sdb.length = req->data_len = size;
+printk("after: bi_size %d, data_len %d, bv_len %d sdb length %d\n",
+               bio->bi_size, req->data_len, bio_iovec(bio)->bv_len,
+               scmd->sdb.length);

and it howed sdb.length being 24 before, and 512 after.

And the damn thing still spit out 24 bytes onto the bus and stopped.

To prove where the bug is, I lied to SCSI.  I changed this:

-       if (bio_add_pc_page(q, bio, page, 24, 0) < 24) {
+       if (bio_add_pc_page(q, bio, page, 512, 0) < 512) {

and we spat out a 512 byte sector to the disc, which accepted it and
erased the trimmed sector.  Yay.

So we can go back to looking for a *fifth* place where we store the
length of the data we're transferring.  I'm not convinced this says good
things about our storage stack that we have so many places where we
store the length.  There's more than this of course, because there's
ATA's qc->nbytes, and tf->nsect+hob_nsect, but I already set those
correctly.

-- 
Matthew Wilcox				Intel Open Source Technology Centre
"Bill, look, we understand that you're interested in selling us this
operating system, but compare it to ours.  We can't possibly take such
a retrograde step."

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Getting TRIM working
  2009-03-06 19:16   ` Matthew Wilcox
@ 2009-03-08 10:28     ` Boaz Harrosh
  2009-03-08 16:54       ` Matthew Wilcox
  2009-03-08 21:24       ` James Bottomley
  0 siblings, 2 replies; 15+ messages in thread
From: Boaz Harrosh @ 2009-03-08 10:28 UTC (permalink / raw)
  To: Matthew Wilcox; +Cc: linux-ide, linux-scsi, Tejun Heo

Matthew Wilcox wrote:
> On Wed, Mar 04, 2009 at 11:20:27AM +0200, Boaz Harrosh wrote:
>> Matthew Wilcox wrote:
>>>         size = ALIGN(i * 8, 512);
>>>         memset(buffer + i * 8, 0, size - i * 8);
>>>         old_size = bio_iovec(bio)->bv_len;
>>> printk("before: bi_size %d, data_len %d, bv_len %d\n", bio->bi_size,
>>>                 req->data_len, old_size);
>>>         if (size > old_size) {
>>>                 bio_add_pc_page(req->q, bio, bio_page(bio),
>>>                                         size - old_size, old_size);
>>>                 req->data_len = size;
>>>         }
>>> printk("after: bi_size %d, data_len %d, bv_len %d\n", bio->bi_size,
>>>                 req->data_len, bio_iovec(bio)->bv_len);
>>>
>>> Now req->data_len, bio->bi_size and bio_iovec(bio)->bv_len are all 512.
>>> Yet the AHCI driver still spits out 24 bytes and then stops (which hangs
>>> the drive).  What am I missing?
>> What about the length embedded in the CDB, which is usually derived from
>> scsi_bufflen(), or other places that look at scsi_bufflen() and not at
>> request && it's bios. The later might be bigger then scsi's in split commands
>> but the drivers should only consume scsi_bufflen() bytes.
> 
> A fine idea, completely true ... I fixed it like this:
> 
> +       old_size = bio_iovec(bio)->bv_len;
> +printk("before: bi_size %d, data_len %d, bv_len %d sdb length %d\n",
> +               bio->bi_size, req->data_len, old_size, scmd->sdb.length);
> +       if (size > old_size) {
> +               bio_add_pc_page(req->q, bio, bio_page(bio),
> +                                       size - old_size, old_size);
> +       }
> +       scmd->sdb.length = req->data_len = size;
> +printk("after: bi_size %d, data_len %d, bv_len %d sdb length %d\n",
> +               bio->bi_size, req->data_len, bio_iovec(bio)->bv_len,
> +               scmd->sdb.length);
> 
> and it howed sdb.length being 24 before, and 512 after.
> 
> And the damn thing still spit out 24 bytes onto the bus and stopped.
> 
> To prove where the bug is, I lied to SCSI.  I changed this:
> 
> -       if (bio_add_pc_page(q, bio, page, 24, 0) < 24) {
> +       if (bio_add_pc_page(q, bio, page, 512, 0) < 512) {
> 
> and we spat out a 512 byte sector to the disc, which accepted it and
> erased the trimmed sector.  Yay.
> 
> So we can go back to looking for a *fifth* place where we store the
> length of the data we're transferring.  I'm not convinced this says good
> things about our storage stack that we have so many places where we
> store the length.  There's more than this of course, because there's
> ATA's qc->nbytes, and tf->nsect+hob_nsect, but I already set those
> correctly.
> 

That's because you are doing it at the wrong level at the wrong stage.
1. block-level submits a request
2. sd/sr or what ever ULD prepares a scsi_cmnd out of request.
   Request's sizes are only a recommendation. ULD or scsi-ml may
   prepare a smaller command then request. Once command is prepared
   request is disregarded, you can bang on it all you want code will
   not care about it one bit.
3. LLD executes the scsi-command (Not the block-request)
4. scsi-ml completes command's bytes, at this stage the request might
   not be over and, and a reminder is re-prepared so the request can
   be complete.

The code above scmd->sdb.length = req->data_len = size; is not allowed
and can cause data leaks.

You should ping Tejun, block-layer(1) and ATA-LLD(3) has a way to communicate
alignments and drain buffers that expose some other possible lenght's to ata.

And to your question the missing length above is probably encoded inside the
submitted CDB. (scsi_cmnd->cmnd). When you change the length before
stage (2) it works.

I think you should be using the drain mechanisms built into ata 

Boaz

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Getting TRIM working
  2009-03-08 10:28     ` Boaz Harrosh
@ 2009-03-08 16:54       ` Matthew Wilcox
  2009-03-08 17:38         ` Boaz Harrosh
  2009-03-08 21:24       ` James Bottomley
  1 sibling, 1 reply; 15+ messages in thread
From: Matthew Wilcox @ 2009-03-08 16:54 UTC (permalink / raw)
  To: Boaz Harrosh; +Cc: linux-ide, linux-scsi, Tejun Heo, Jeff Garzik

On Sun, Mar 08, 2009 at 12:28:27PM +0200, Boaz Harrosh wrote:
> That's because you are doing it at the wrong level at the wrong stage.
> 1. block-level submits a request
> 2. sd/sr or what ever ULD prepares a scsi_cmnd out of request.
>    Request's sizes are only a recommendation. ULD or scsi-ml may
>    prepare a smaller command then request. Once command is prepared
>    request is disregarded, you can bang on it all you want code will
>    not care about it one bit.

I may well be changing more than I need to, but it would be foolish of
me to make the assumption at this point that nothing is looking at the
request.

> 3. LLD executes the scsi-command (Not the block-request)

This is true, *but* some of the lengths in the block request still end
up getting used, for example bv_len is used by blk_rq_map_sg() which is
called by the LLD.

> 4. scsi-ml completes command's bytes, at this stage the request might
>    not be over and, and a reminder is re-prepared so the request can
>    be complete.
> 
> The code above scmd->sdb.length = req->data_len = size; is not allowed
> and can cause data leaks.

Simply not true.  We are *changing the amount of data we wish to
transfer*.  SCSI would have sent down 24 bytes of data.  ATA needs to
send down 512 bytes of data.

> You should ping Tejun, block-layer(1) and ATA-LLD(3) has a way to communicate
> alignments and drain buffers that expose some other possible lenght's to ata.
> 
> And to your question the missing length above is probably encoded inside the
> submitted CDB. (scsi_cmnd->cmnd). When you change the length before
> stage (2) it works.

No, that's not it.  This works (in sd.c):

        if (bio_add_pc_page(q, bio, page, 512, 0) < 512) {
[...]
        rq->cmd_type = REQ_TYPE_BLOCK_PC;
	rq->cmd_len = 10;
	rq->cmd[0] = UNMAP;
	put_unaligned_be16(24, rq->cmd + 7);

So the 24 in the cmd gets ignored.

> I think you should be using the drain mechanisms built into ata 

drain seems only applicable to ATAPI, not to ATA.  The comment says:

 *      ATAPI commands which transfer variable length data to host
 *      might overflow due to application error or hardare bug.  This
 *      function checks whether overflow should be drained and ignored
 *      for @request.

This isn't the case with discard/UNMAP/TRIM.  We know exactly how much
data we want to send.  The problem is that I don't know how to update all
the required places to change the amount of data being sent.  I don't
see any other ATA command which needs to do this, so this is breaking
new ground for libata-scsi.

-- 
Matthew Wilcox				Intel Open Source Technology Centre
"Bill, look, we understand that you're interested in selling us this
operating system, but compare it to ours.  We can't possibly take such
a retrograde step."

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Getting TRIM working
  2009-03-08 16:54       ` Matthew Wilcox
@ 2009-03-08 17:38         ` Boaz Harrosh
  0 siblings, 0 replies; 15+ messages in thread
From: Boaz Harrosh @ 2009-03-08 17:38 UTC (permalink / raw)
  To: Matthew Wilcox; +Cc: linux-ide, linux-scsi, Tejun Heo, Jeff Garzik

Matthew Wilcox wrote:
> On Sun, Mar 08, 2009 at 12:28:27PM +0200, Boaz Harrosh wrote:
>> That's because you are doing it at the wrong level at the wrong stage.
>> 1. block-level submits a request
>> 2. sd/sr or what ever ULD prepares a scsi_cmnd out of request.
>>    Request's sizes are only a recommendation. ULD or scsi-ml may
>>    prepare a smaller command then request. Once command is prepared
>>    request is disregarded, you can bang on it all you want code will
>>    not care about it one bit.
> 
> I may well be changing more than I need to, but it would be foolish of
> me to make the assumption at this point that nothing is looking at the
> request.
> 
>> 3. LLD executes the scsi-command (Not the block-request)
> 
> This is true, *but* some of the lengths in the block request still end
> up getting used, for example bv_len is used by blk_rq_map_sg() which is
> called by the LLD.
> 

I must be running but... What? blk_rq_map_sg() is called by scsi-ml in prep_command
when allocating and setting up SGs (scsi_sglist())

If called by LLD then it's the first I've herd of, and some assumptions at scsi-ml
surly break.

This is exactly stage (2) the call to blk_rq_map_sg(), from there on request is
ignored.

>> 4. scsi-ml completes command's bytes, at this stage the request might
>>    not be over and, and a reminder is re-prepared so the request can
>>    be complete.
>>
>> The code above scmd->sdb.length = req->data_len = size; is not allowed
>> and can cause data leaks.
> 
> Simply not true.  We are *changing the amount of data we wish to
> transfer*.  SCSI would have sent down 24 bytes of data.  ATA needs to
> send down 512 bytes of data.
> 

That's what you want but you can't do that after stage 2.

>> You should ping Tejun, block-layer(1) and ATA-LLD(3) has a way to communicate
>> alignments and drain buffers that expose some other possible lenght's to ata.
>>
>> And to your question the missing length above is probably encoded inside the
>> submitted CDB. (scsi_cmnd->cmnd). When you change the length before
>> stage (2) it works.
> 
> No, that's not it.  This works (in sd.c):
> 
>         if (bio_add_pc_page(q, bio, page, 512, 0) < 512) {
> [...]
>         rq->cmd_type = REQ_TYPE_BLOCK_PC;
> 	rq->cmd_len = 10;
> 	rq->cmd[0] = UNMAP;
> 	put_unaligned_be16(24, rq->cmd + 7);
> 
> So the 24 in the cmd gets ignored.
> 
>> I think you should be using the drain mechanisms built into ata 
> 
> drain seems only applicable to ATAPI, not to ATA.  The comment says:
> 
>  *      ATAPI commands which transfer variable length data to host
>  *      might overflow due to application error or hardare bug.  This
>  *      function checks whether overflow should be drained and ignored
>  *      for @request.
> 
> This isn't the case with discard/UNMAP/TRIM.  We know exactly how much
> data we want to send.  The problem is that I don't know how to update all
> the required places to change the amount of data being sent.  I don't
> see any other ATA command which needs to do this, so this is breaking
> new ground for libata-scsi.
> 

You must do the change before/during stage 2 above. (length get encoded inside
scsi_sglist()).

Sorry I must be running. If you have a public git with this stuff
I can have a look tomorrow, or send a .diff I'll apply locally and
inspect it.

bye
Boaz


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Getting TRIM working
  2009-03-08 10:28     ` Boaz Harrosh
  2009-03-08 16:54       ` Matthew Wilcox
@ 2009-03-08 21:24       ` James Bottomley
  2009-03-08 21:32         ` James Bottomley
  1 sibling, 1 reply; 15+ messages in thread
From: James Bottomley @ 2009-03-08 21:24 UTC (permalink / raw)
  To: Boaz Harrosh; +Cc: Matthew Wilcox, linux-ide, linux-scsi, Tejun Heo

On Sun, 2009-03-08 at 12:28 +0200, Boaz Harrosh wrote:
> Matthew Wilcox wrote:
> > On Wed, Mar 04, 2009 at 11:20:27AM +0200, Boaz Harrosh wrote:
> >> Matthew Wilcox wrote:
> >>>         size = ALIGN(i * 8, 512);
> >>>         memset(buffer + i * 8, 0, size - i * 8);
> >>>         old_size = bio_iovec(bio)->bv_len;
> >>> printk("before: bi_size %d, data_len %d, bv_len %d\n", bio->bi_size,
> >>>                 req->data_len, old_size);
> >>>         if (size > old_size) {
> >>>                 bio_add_pc_page(req->q, bio, bio_page(bio),
> >>>                                         size - old_size, old_size);
> >>>                 req->data_len = size;
> >>>         }
> >>> printk("after: bi_size %d, data_len %d, bv_len %d\n", bio->bi_size,
> >>>                 req->data_len, bio_iovec(bio)->bv_len);
> >>>
> >>> Now req->data_len, bio->bi_size and bio_iovec(bio)->bv_len are all 512.
> >>> Yet the AHCI driver still spits out 24 bytes and then stops (which hangs
> >>> the drive).  What am I missing?
> >> What about the length embedded in the CDB, which is usually derived from
> >> scsi_bufflen(), or other places that look at scsi_bufflen() and not at
> >> request && it's bios. The later might be bigger then scsi's in split commands
> >> but the drivers should only consume scsi_bufflen() bytes.
> > 
> > A fine idea, completely true ... I fixed it like this:
> > 
> > +       old_size = bio_iovec(bio)->bv_len;
> > +printk("before: bi_size %d, data_len %d, bv_len %d sdb length %d\n",
> > +               bio->bi_size, req->data_len, old_size, scmd->sdb.length);
> > +       if (size > old_size) {
> > +               bio_add_pc_page(req->q, bio, bio_page(bio),
> > +                                       size - old_size, old_size);
> > +       }
> > +       scmd->sdb.length = req->data_len = size;
> > +printk("after: bi_size %d, data_len %d, bv_len %d sdb length %d\n",
> > +               bio->bi_size, req->data_len, bio_iovec(bio)->bv_len,
> > +               scmd->sdb.length);
> > 
> > and it howed sdb.length being 24 before, and 512 after.
> > 
> > And the damn thing still spit out 24 bytes onto the bus and stopped.
> > 
> > To prove where the bug is, I lied to SCSI.  I changed this:
> > 
> > -       if (bio_add_pc_page(q, bio, page, 24, 0) < 24) {
> > +       if (bio_add_pc_page(q, bio, page, 512, 0) < 512) {
> > 
> > and we spat out a 512 byte sector to the disc, which accepted it and
> > erased the trimmed sector.  Yay.
> > 
> > So we can go back to looking for a *fifth* place where we store the
> > length of the data we're transferring.  I'm not convinced this says good
> > things about our storage stack that we have so many places where we
> > store the length.  There's more than this of course, because there's
> > ATA's qc->nbytes, and tf->nsect+hob_nsect, but I already set those
> > correctly.
> > 
> 
> That's because you are doing it at the wrong level at the wrong stage.
> 1. block-level submits a request
> 2. sd/sr or what ever ULD prepares a scsi_cmnd out of request.
>    Request's sizes are only a recommendation. ULD or scsi-ml may
>    prepare a smaller command then request. Once command is prepared
>    request is disregarded, you can bang on it all you want code will
>    not care about it one bit.
> 3. LLD executes the scsi-command (Not the block-request)
> 4. scsi-ml completes command's bytes, at this stage the request might
>    not be over and, and a reminder is re-prepared so the request can
>    be complete.
> 
> The code above scmd->sdb.length = req->data_len = size; is not allowed
> and can cause data leaks.
> 
> You should ping Tejun, block-layer(1) and ATA-LLD(3) has a way to communicate
> alignments and drain buffers that expose some other possible lenght's to ata.
> 
> And to your question the missing length above is probably encoded inside the
> submitted CDB. (scsi_cmnd->cmnd). When you change the length before
> stage (2) it works.
> 
> I think you should be using the drain mechanisms built into ata 

OK, so I think you correctly identified the problem;  I don't quite
think you've identified the solution because draining is all about
disposing of excess data, in particular we only tend to have one drain
area per queue (and there could be multiple outstanding discards).

The problem in the prepare is you need to set up a command with data.
What I'm not quite clear on is why blk_rq_map_kern() on a kmalloc'd
buffer can't be used.  You'd end up with a dual bio request (one for the
discard, one for the data), but they should tear down correctly using
the separate bio teardowns and pass correctly into blk_rq_map_sg() which
was the original point.

James



^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Getting TRIM working
  2009-03-08 21:24       ` James Bottomley
@ 2009-03-08 21:32         ` James Bottomley
  2009-03-09  8:36           ` Matthew Wilcox
  0 siblings, 1 reply; 15+ messages in thread
From: James Bottomley @ 2009-03-08 21:32 UTC (permalink / raw)
  To: Boaz Harrosh; +Cc: Matthew Wilcox, linux-ide, linux-scsi, Tejun Heo

On Sun, 2009-03-08 at 16:24 -0500, James Bottomley wrote:
> The problem in the prepare is you need to set up a command with data.
> What I'm not quite clear on is why blk_rq_map_kern() on a kmalloc'd
> buffer can't be used.  You'd end up with a dual bio request (one for the
> discard, one for the data), but they should tear down correctly using
> the separate bio teardowns and pass correctly into blk_rq_map_sg() which
> was the original point.

Actually, found the reason, blk_rq_map_kern will blast the original bio
from the request.  You could fix this by chaining it back again at the
beginning.  If that works, we could just wrap it into a block API to
prevent users from having to muck with bios.

James



^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Getting TRIM working
  2009-03-08 21:32         ` James Bottomley
@ 2009-03-09  8:36           ` Matthew Wilcox
  2009-03-09 13:52             ` Douglas Gilbert
  2009-03-09 14:04             ` James Bottomley
  0 siblings, 2 replies; 15+ messages in thread
From: Matthew Wilcox @ 2009-03-09  8:36 UTC (permalink / raw)
  To: James Bottomley
  Cc: Boaz Harrosh, linux-ide, linux-scsi, Tejun Heo, Jeff Garzik

On Sun, Mar 08, 2009 at 04:32:36PM -0500, James Bottomley wrote:
> Actually, found the reason, blk_rq_map_kern will blast the original bio
> from the request.  You could fix this by chaining it back again at the
> beginning.  If that works, we could just wrap it into a block API to
> prevent users from having to muck with bios.

How about constructing the TRIM entirely within libata?  I won't be able
to test this patch until Oregon wakes up, but is this acceptable?

Advantages:
 - Don't need to wait for T10 to finish designing UNMAP
 - Uses well-tested ATA_16 passthrough layer
 - Changing the UNMAP implementation to do multiple ranges won't break TRIM
 - Will be easier to adapt to a future separation of scsi and libata

Disadvantages:
 - UNMAP gets less testing

diff --git a/drivers/ata/libata-scsi.c b/drivers/ata/libata-scsi.c
index b9747fa..edcf9db 100644
--- a/drivers/ata/libata-scsi.c
+++ b/drivers/ata/libata-scsi.c
@@ -1029,6 +1029,55 @@ static void ata_gen_ata_sense(struct ata_queued_cmd *qc)
 	desc[11] = block;
 }
 
+static int ata_discard_fn(struct request_queue *q, struct request *req,
+							struct bio *bio)
+{
+	char *trim;
+	unsigned i, size;
+	struct page *page = alloc_page(GFP_KERNEL);
+	if (!page)
+		return -ENOMEM;
+	/*
+	 * Earlier, we set bi_size to the number of bytes to be discarded so
+	 * that the elevators understood what was going on.  The bi_size
+	 * has been copied to req->data_len so we don't need to be
+	 * concerned with that any more.  Zero it out so it accurately
+	 * reflects the length of the data we're now sending to the drive.
+	 */
+	bio->bi_size = 0;
+
+	trim = bio_data(bio);
+	i = ata_set_lba_range_entries(trim, PAGE_SIZE / 8,
+					bio->bi_sector, bio_sectors(bio));
+	size = ALIGN(i * 8, 512);
+	memset(trim + i * 8, 0, size - i * 8);
+	if (bio_add_pc_page(q, bio, page, size, 0) < size) {
+		__free_page(page);
+		return -ENOMEM;
+	}
+
+	req->cmd_type = REQ_TYPE_BLOCK_PC;
+	req->cmd_len = 16;
+	req->cmd[0] = ATA_16;
+	req->cmd[1] = (ATA_PROT_DMA << 1) | 1;	/* 48-bit, protocol */
+	req->cmd[2] = 0x6;			/* length, direction */
+	req->cmd[3] = 0;			/* feature high */
+	req->cmd[4] = ATA_DSM_TRIM;		/* feature low */
+	req->cmd[5] = (size / 512) >> 8;	/* nsect high */
+	req->cmd[6] = size / 512;		/* nsect low */
+	req->cmd[7] = 0;			/* lba */
+	req->cmd[8] = 0;			/* lba */
+	req->cmd[9] = 0;			/* lba */
+	req->cmd[10] = 0;			/* lba */
+	req->cmd[11] = 0;			/* lba */
+	req->cmd[12] = 0;			/* lba */
+	req->cmd[13] = ATA_LBA;			/* device */
+	req->cmd[14] = ATA_CMD_DSM;		/* command */
+	req->cmd[15] = 0;			/* control */
+
+	return 0;
+}
+
 static void ata_scsi_sdev_config(struct scsi_device *sdev)
 {
 	sdev->use_10_for_rw = 1;
@@ -1077,6 +1126,9 @@ static int ata_scsi_dev_config(struct scsi_device *sdev,
 	/* configure max sectors */
 	blk_queue_max_sectors(sdev->request_queue, dev->max_sectors);
 
+	if (ata_id_has_trim(dev->id))
+		blk_queue_set_discard(sdev->request_queue, ata_discard_fn);
+
 	if (dev->class == ATA_DEV_ATAPI) {
 		struct request_queue *q = sdev->request_queue;
 		void *buf;
@@ -2385,6 +2437,9 @@ static unsigned int ata_scsiop_read_cap(struct ata_scsi_args *args, u8 *rbuf)
 		/* sector size */
 		rbuf[10] = ATA_SECT_SIZE >> 8;
 		rbuf[11] = ATA_SECT_SIZE & 0xff;
+
+		if (ata_id_has_trim(args->id))
+			rbuf[14] = 0x80;
 	}
 
 	return 0;

-- 
Matthew Wilcox				Intel Open Source Technology Centre
"Bill, look, we understand that you're interested in selling us this
operating system, but compare it to ours.  We can't possibly take such
a retrograde step."

^ permalink raw reply related	[flat|nested] 15+ messages in thread

* Re: Getting TRIM working
  2009-03-09  8:36           ` Matthew Wilcox
@ 2009-03-09 13:52             ` Douglas Gilbert
  2009-03-09 14:03               ` INCITS Matthew Wilcox
  2009-03-09 14:08               ` Getting TRIM working James Bottomley
  2009-03-09 14:04             ` James Bottomley
  1 sibling, 2 replies; 15+ messages in thread
From: Douglas Gilbert @ 2009-03-09 13:52 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: James Bottomley, Boaz Harrosh, linux-ide, linux-scsi, Tejun Heo,
	Jeff Garzik

Matthew Wilcox wrote:
> On Sun, Mar 08, 2009 at 04:32:36PM -0500, James Bottomley wrote:
>> Actually, found the reason, blk_rq_map_kern will blast the original bio
>> from the request.  You could fix this by chaining it back again at the
>> beginning.  If that works, we could just wrap it into a block API to
>> prevent users from having to muck with bios.
> 
> How about constructing the TRIM entirely within libata?  I won't be able
> to test this patch until Oregon wakes up, but is this acceptable?
> 
> Advantages:
>  - Don't need to wait for T10 to finish designing UNMAP

The spc4r18.pdf and sbc3r18.pdf drafts were released recently
at t10.org. Both include thin provisioning items (e.g. see
UNMAP and WRITE SAME(16) in sbc3r18.pdf). So the design and
implementation of thin provisioning is pretty well done.

If you attempt to download these drafts you will be
challenged for a t10 login, select guest and supply a name
(e.g. "incits" is the name of the organization responsible
for this). I give "Linux" as my company.

Doug Gilbert

>  - Uses well-tested ATA_16 passthrough layer
>  - Changing the UNMAP implementation to do multiple ranges won't break TRIM
>  - Will be easier to adapt to a future separation of scsi and libata

[snip]

^ permalink raw reply	[flat|nested] 15+ messages in thread

* INCITS
  2009-03-09 13:52             ` Douglas Gilbert
@ 2009-03-09 14:03               ` Matthew Wilcox
  2009-03-09 14:08               ` Getting TRIM working James Bottomley
  1 sibling, 0 replies; 15+ messages in thread
From: Matthew Wilcox @ 2009-03-09 14:03 UTC (permalink / raw)
  To: Douglas Gilbert
  Cc: James Bottomley, Boaz Harrosh, linux-ide, linux-scsi, Tejun Heo,
	Jeff Garzik

On Mon, Mar 09, 2009 at 09:52:35AM -0400, Douglas Gilbert wrote:
> If you attempt to download these drafts you will be
> challenged for a t10 login, select guest and supply a name
> (e.g. "incits" is the name of the organization responsible
> for this). I give "Linux" as my company.

I'm not really interested in playing INCITS' stupid games.  T10 should
have the balls to tell them to go to hell, and find themselves a new
organisational home.  I'm sure there's any number of nonprofits who'd be
glad to act as an umbrella for them.

-- 
Matthew Wilcox				Intel Open Source Technology Centre
"Bill, look, we understand that you're interested in selling us this
operating system, but compare it to ours.  We can't possibly take such
a retrograde step."

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Getting TRIM working
  2009-03-09  8:36           ` Matthew Wilcox
  2009-03-09 13:52             ` Douglas Gilbert
@ 2009-03-09 14:04             ` James Bottomley
  2009-03-09 14:14               ` Matthew Wilcox
  1 sibling, 1 reply; 15+ messages in thread
From: James Bottomley @ 2009-03-09 14:04 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Boaz Harrosh, linux-ide, linux-scsi, Tejun Heo, Jeff Garzik

On Mon, 2009-03-09 at 02:36 -0600, Matthew Wilcox wrote:
> On Sun, Mar 08, 2009 at 04:32:36PM -0500, James Bottomley wrote:
> > Actually, found the reason, blk_rq_map_kern will blast the original bio
> > from the request.  You could fix this by chaining it back again at the
> > beginning.  If that works, we could just wrap it into a block API to
> > prevent users from having to muck with bios.
> 
> How about constructing the TRIM entirely within libata?  I won't be able
> to test this patch until Oregon wakes up, but is this acceptable?
> 
> Advantages:
>  - Don't need to wait for T10 to finish designing UNMAP
>  - Uses well-tested ATA_16 passthrough layer
>  - Changing the UNMAP implementation to do multiple ranges won't break TRIM
>  - Will be easier to adapt to a future separation of scsi and libata
> 
> Disadvantages:
>  - UNMAP gets less testing

Also disadvantages: UNMAP will not work unless trim is pulled back out
of libata again ... which makes this look a bit like a temporary hack.
If we're going to get this working, we might as well do the job
properly.  Even if we don't quite know what UNMAP looks like, all we
need is for the SAT to work on our approximation.

As I've said before, I don't like the bio based approach.  I think you
leak a page on every trim command doing it (why?  because the kernel
mapped bios have special bio->end_io routines to collect and free the
pages, you have no such routine).

I think this is also a reflection of the fact that our discard structure
isn't quite complete in practice.  We need some sort of tear down to
allow us to free the allocated parameter list for the command.

James



^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Getting TRIM working
  2009-03-09 13:52             ` Douglas Gilbert
  2009-03-09 14:03               ` INCITS Matthew Wilcox
@ 2009-03-09 14:08               ` James Bottomley
  1 sibling, 0 replies; 15+ messages in thread
From: James Bottomley @ 2009-03-09 14:08 UTC (permalink / raw)
  To: dgilbert
  Cc: Matthew Wilcox, Boaz Harrosh, linux-ide, linux-scsi, Tejun Heo,
	Jeff Garzik

On Mon, 2009-03-09 at 09:52 -0400, Douglas Gilbert wrote:
> If you attempt to download these drafts you will be
> challenged for a t10 login, select guest and supply a name
> (e.g. "incits" is the name of the organization responsible
> for this). I give "Linux" as my company.

Just be thankful you can still download the minutes and draft proposals.
You can no longer download the actual draft standards ....

James



^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Getting TRIM working
  2009-03-09 14:04             ` James Bottomley
@ 2009-03-09 14:14               ` Matthew Wilcox
  2009-03-09 15:17                 ` Matthew Wilcox
  0 siblings, 1 reply; 15+ messages in thread
From: Matthew Wilcox @ 2009-03-09 14:14 UTC (permalink / raw)
  To: James Bottomley
  Cc: Boaz Harrosh, linux-ide, linux-scsi, Tejun Heo, Jeff Garzik

On Mon, Mar 09, 2009 at 02:04:55PM +0000, James Bottomley wrote:
> > Disadvantages:
> >  - UNMAP gets less testing
> 
> Also disadvantages: UNMAP will not work unless trim is pulled back out
> of libata again ... which makes this look a bit like a temporary hack.

Huh?  sd.c will continue to have sd_discard_fn() which will work fine
for UNMAP.  This will give us a different path for ATA discs and SCSI
discs, as far as discard functions go.

> As I've said before, I don't like the bio based approach.  I think you
> leak a page on every trim command doing it (why?  because the kernel
> mapped bios have special bio->end_io routines to collect and free the
> pages, you have no such routine).

I added that in an earlier patch:

http://git.kernel.org/?p=linux/kernel/git/willy/ssd.git;a=commitdiff;h=f6ebf54cb8045a0cab19d73e26a07945c29d1394;hp=1223b373ae0283221b46e70955eb825679c5a6cb

I'm all in favour of a less messy approach than using bios directly, if
there is one.  I don't see what it might look like yet.

-- 
Matthew Wilcox				Intel Open Source Technology Centre
"Bill, look, we understand that you're interested in selling us this
operating system, but compare it to ours.  We can't possibly take such
a retrograde step."

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Getting TRIM working
  2009-03-09 14:14               ` Matthew Wilcox
@ 2009-03-09 15:17                 ` Matthew Wilcox
  0 siblings, 0 replies; 15+ messages in thread
From: Matthew Wilcox @ 2009-03-09 15:17 UTC (permalink / raw)
  To: James Bottomley
  Cc: Boaz Harrosh, linux-ide, linux-scsi, Tejun Heo, Jeff Garzik,
	David Woodhouse

On Mon, Mar 09, 2009 at 08:14:51AM -0600, Matthew Wilcox wrote:
> I added that in an earlier patch:
> 
> http://git.kernel.org/?p=linux/kernel/git/willy/ssd.git;a=commitdiff;h=f6ebf54cb8045a0cab19d73e26a07945c29d1394;hp=1223b373ae0283221b46e70955eb825679c5a6cb

I've pushed out a new version of the trim patchset:

http://git.kernel.org/?p=linux/kernel/git/willy/ssd.git;a=shortlog;h=trim-20090309

Now that the ATA bits don't rely on the SCSI bits, I've rearranged the
SCSI bits to be last (since we seem to be closer to having consensus
and shipping devices using TRIM than we do UNMAP).

After I've had a chance to do some testing with these patches, I'd like to
see the first five patches go upstream during the upcoming merge window,
probably through Jeff's IDE tree.

I'd also like the sixth and seventh patches to go upstream during that
merge window, since we also need Read Capacity 16 for 4k sector support,
but that is a different topic for a different thread.

-- 
Matthew Wilcox				Intel Open Source Technology Centre
"Bill, look, we understand that you're interested in selling us this
operating system, but compare it to ours.  We can't possibly take such
a retrograde step."

^ permalink raw reply	[flat|nested] 15+ messages in thread

end of thread, other threads:[~2009-03-09 15:17 UTC | newest]

Thread overview: 15+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2009-03-03 19:07 Getting TRIM working Matthew Wilcox
2009-03-04  9:20 ` Boaz Harrosh
2009-03-06 19:16   ` Matthew Wilcox
2009-03-08 10:28     ` Boaz Harrosh
2009-03-08 16:54       ` Matthew Wilcox
2009-03-08 17:38         ` Boaz Harrosh
2009-03-08 21:24       ` James Bottomley
2009-03-08 21:32         ` James Bottomley
2009-03-09  8:36           ` Matthew Wilcox
2009-03-09 13:52             ` Douglas Gilbert
2009-03-09 14:03               ` INCITS Matthew Wilcox
2009-03-09 14:08               ` Getting TRIM working James Bottomley
2009-03-09 14:04             ` James Bottomley
2009-03-09 14:14               ` Matthew Wilcox
2009-03-09 15:17                 ` Matthew Wilcox

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.