All of lore.kernel.org
 help / color / mirror / Atom feed
* configurable discard parameters
@ 2015-06-20 17:12 Tom Yan
  2015-06-21  0:20 ` Martin K. Petersen
  0 siblings, 1 reply; 18+ messages in thread
From: Tom Yan @ 2015-06-20 17:12 UTC (permalink / raw)
  To: linux-scsi, linux-ide

[-- Attachment #1: Type: text/plain, Size: 2716 bytes --]

Today I check if blkdiscard really does a full device trim/wipe for my
Intel 530 SSD (240gb) with hexdump. I end up found that it fail to do
so because it report garbage info on its block limits VPD.

[tom@localhost ~]$ sudo sg_vpd -p 0xb0 /dev/sda
Block limits VPD page (SBC):
  Write same non-zero (WSNZ): 0
  Maximum compare and write length: 0 blocks
  Optimal transfer length granularity: 1 blocks
  Maximum transfer length: 0 blocks
  Optimal transfer length: 0 blocks
  Maximum prefetch length: 0 blocks
  Maximum unmap LBA count: 0
  Maximum unmap block descriptor count: 0
  Optimal unmap granularity: 1
  Unmap granularity alignment valid: 0
  Unmap granularity alignment: 0
  Maximum write same length: 0x3fffc0 blocks
  Maximum atomic transfer length: 0
  Atomic alignment: 0
  Atomic transfer length granularity: 0

The fact is, in each iteration, for this drive, blkdiscard can only
trim a maximum of 65528 sectors, which is the largest multiple of 8
sectors, which is the minimum possible. (In fact 65535 sectors seems
to be some sort of limit of ATA TRIM commands. It's also the maximum
of WRITE SAME (10).)

With `blkdiscard`, I can still workaround this with the "--step"
option, but for discard mount options, I don't think I have any way to
deal with it. Although it's basically the reponsibility of Intel (Well
I tried to find a way to complain to them but even their website is
borked as always), but still I would like to know why the kernel
doesn't allow "discard_granularity" and "discard_max_bytes" to be
configurable for users. Also wanna share about this in case nobody
ever noticed.


Another case is, I have a SanDisk Extreme USB (32gb, w/o UASP), which
is sort of a SATA SSD with USB bridge. I can basically TRIM the drive
with hdparm (although there's a minor issue:
https://sourceforge.net/p/hdparm/bugs/63/), but I can't do it with
blkdiscard. At first I thought it was only because its VPD doesn't
provide limits of discard:

[tom@localhost ~]$ sudo sg_vpd -p 0xb0 /dev/sdc
Block limits VPD page (SBC):
  Write same non-zero (WSNZ): 0
  Maximum compare and write length: 0 blocks
  Optimal transfer length granularity: 0 blocks
  Maximum transfer length: 8388607 blocks
  Optimal transfer length: 8388607 blocks
  Maximum prefetch length: 0 blocks

but then I try to use the attached patch to force some value for it,
blkdiscard still gives me "I/O error" instead of "NOT SUPPROTED". So
what's the requirement of the current discard way in the kernel? Does
it work differently on SATA drives than on USB drives because of the
controller (like involvement of libata)? Does it require capability of
one of the three scsi unmap ways? Or did I just miss some other dirty
hacking bits?

[-- Attachment #2: discard.patch --]
[-- Type: text/x-patch, Size: 1570 bytes --]

diff --git a/drivers/scsi/sd.c~ b/drivers/scsi/sd.c
index a661d33..a9cdfb6 100644
--- a/drivers/scsi/sd.c~
+++ b/drivers/scsi/sd.c
@@ -99,6 +99,7 @@ MODULE_ALIAS_SCSI_DEVICE(TYPE_RBC);
 #endif
 
 static void sd_config_discard(struct scsi_disk *, unsigned int);
+static void sd_config_discard_novpd(struct scsi_disk *);
 static void sd_config_write_same(struct scsi_disk *);
 static int  sd_revalidate_disk(struct gendisk *);
 static void sd_unlock_native_capacity(struct gendisk *disk);
@@ -700,6 +701,22 @@ static void sd_config_discard(struct scsi_disk *sdkp, unsigned int mode)
 	queue_flag_set_unlocked(QUEUE_FLAG_DISCARD, q);
 }
 
+static void sd_config_discard_novpd(struct scsi_disk *sdkp)
+{
+	struct request_queue *q = sdkp->disk->queue;
+	unsigned int logical_block_size = sdkp->device->sector_size;
+	unsigned int max_blocks = (u32)SD_MAX_WS16_BLOCKS;
+
+	q->limits.discard_zeroes_data = 0;
+	q->limits.discard_alignment = 0;
+	q->limits.discard_granularity = sdkp->physical_block_size;
+
+	sdkp->provisioning_mode = SD_LBP_WS16;
+
+	q->limits.max_discard_sectors = max_blocks * (logical_block_size >> 9);
+	queue_flag_set_unlocked(QUEUE_FLAG_DISCARD, q);
+}
+
 /**
  * sd_setup_discard_cmnd - unmap blocks on thinly provisioned device
  * @sdp: scsi device to operate one
@@ -2776,6 +2793,9 @@ static int sd_revalidate_disk(struct gendisk *disk)
 			sd_read_block_limits(sdkp);
 			sd_read_block_characteristics(sdkp);
 		}
+		else {
+			sd_config_discard_novpd(sdkp);
+		}
 
 		sd_read_write_protect_flag(sdkp, buffer);
 		sd_read_cache_type(sdkp, buffer);

^ permalink raw reply related	[flat|nested] 18+ messages in thread

* Re: configurable discard parameters
  2015-06-20 17:12 configurable discard parameters Tom Yan
@ 2015-06-21  0:20 ` Martin K. Petersen
  2015-06-21  7:03   ` Tom Yan
  0 siblings, 1 reply; 18+ messages in thread
From: Martin K. Petersen @ 2015-06-21  0:20 UTC (permalink / raw)
  To: Tom Yan; +Cc: linux-scsi, linux-ide

>>>>> "Tom" == Tom Yan <tom.ty89@gmail.com> writes:

Tom> Today I check if blkdiscard really does a full device trim/wipe for
Tom> my Intel 530 SSD (240gb) with hexdump. I end up found that it fail
Tom> to do so because it report garbage info on its block limits VPD.

It is a SATA-attached drive, it has no block limits VPD. What you are
seeing is information prepared by libata's SATL.

Tom> The fact is, in each iteration, for this drive, blkdiscard can only
Tom> trim a maximum of 65528 sectors, which is the largest multiple of 8
Tom> sectors, which is the minimum possible.

  Maximum write same length: 0x3fffc0 blocks

That's 2GB-e per request for a drive with 512-byte logical blocks.

Tom> but still I would like to know why the kernel doesn't allow
Tom> "discard_granularity" and "discard_max_bytes" to be configurable
Tom> for users.

Because if the vendor got these trivial values wrong there is little to
no chance that they implemented discard correctly in their firmware.

-- 
Martin K. Petersen	Oracle Linux Engineering
--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: configurable discard parameters
  2015-06-21  0:20 ` Martin K. Petersen
@ 2015-06-21  7:03   ` Tom Yan
  2015-06-21  8:05     ` Tom Yan
  0 siblings, 1 reply; 18+ messages in thread
From: Tom Yan @ 2015-06-21  7:03 UTC (permalink / raw)
  To: Martin K. Petersen; +Cc: linux-scsi, linux-ide

On 21 June 2015 at 08:20, Martin K. Petersen <martin.petersen@oracle.com> wrote:
> It is a SATA-attached drive, it has no block limits VPD. What you are
> seeing is information prepared by libata's SATL.
>
> Because if the vendor got these trivial values wrong there is little to
> no chance that they implemented discard correctly in their firmware.

I don't get it. So there's a chance that the VPDs is not purely from
the hardware? Then how can I differentiate them? But then you said
"the vendor got these trivial values wrong", so were you talking about
this drive or just for real SCSI drives?
--
To unsubscribe from this list: send the line "unsubscribe linux-ide" in

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: configurable discard parameters
  2015-06-21  7:03   ` Tom Yan
@ 2015-06-21  8:05     ` Tom Yan
  2015-06-21 12:36       ` Tom Yan
  0 siblings, 1 reply; 18+ messages in thread
From: Tom Yan @ 2015-06-21  8:05 UTC (permalink / raw)
  To: Martin K. Petersen; +Cc: linux-scsi, linux-ide

By the way do you think it could be a bug of libata's SATL anyway?
Like perhaps it should break a single unmap request to multiple ATA
commands? I am not totally sure about it but it looks like there's a
limit of addressed blocks in a single ATA DSM/TRIM command (4 bytes,
which is 65535).

On 21 June 2015 at 15:03, Tom Yan <tom.ty89@gmail.com> wrote:
> On 21 June 2015 at 08:20, Martin K. Petersen <martin.petersen@oracle.com> wrote:
>> It is a SATA-attached drive, it has no block limits VPD. What you are
>> seeing is information prepared by libata's SATL.
>>
>> Because if the vendor got these trivial values wrong there is little to
>> no chance that they implemented discard correctly in their firmware.
>
> I don't get it. So there's a chance that the VPDs is not purely from
> the hardware? Then how can I differentiate them? But then you said
> "the vendor got these trivial values wrong", so were you talking about
> this drive or just for real SCSI drives?
--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: configurable discard parameters
  2015-06-21  8:05     ` Tom Yan
@ 2015-06-21 12:36       ` Tom Yan
  2015-06-21 20:30         ` Tom Yan
  0 siblings, 1 reply; 18+ messages in thread
From: Tom Yan @ 2015-06-21 12:36 UTC (permalink / raw)
  To: Martin K. Petersen; +Cc: linux-scsi, linux-ide

https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/drivers/ata/libata-scsi.c?id=18f0f97850059303ed73b1f02084f55ca330a80c

https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/drivers/ata/libata-scsi.c?id=e78db4dfb1355a895f7ea50133b702b55b8ed184

So I think I found the answer myself. But I don't see the rationale
here at all. So values (arbitrary?) of granularity and maximum has
been hardcoded in the kernel for ALL devices which make use of libata,
and the parameter cannot be changed by user in runtime? So does it
expect all devices to be of the same spec? Or is it just like a
draft/example which people can only consider it as "may work"?

On 21 June 2015 at 16:05, Tom Yan <tom.ty89@gmail.com> wrote:
> By the way do you think it could be a bug of libata's SATL anyway?
> Like perhaps it should break a single unmap request to multiple ATA
> commands? I am not totally sure about it but it looks like there's a
> limit of addressed blocks in a single ATA DSM/TRIM command (4 bytes,
> which is 65535).
>
> On 21 June 2015 at 15:03, Tom Yan <tom.ty89@gmail.com> wrote:
>> On 21 June 2015 at 08:20, Martin K. Petersen <martin.petersen@oracle.com> wrote:
>>> It is a SATA-attached drive, it has no block limits VPD. What you are
>>> seeing is information prepared by libata's SATL.
>>>
>>> Because if the vendor got these trivial values wrong there is little to
>>> no chance that they implemented discard correctly in their firmware.
>>
>> I don't get it. So there's a chance that the VPDs is not purely from
>> the hardware? Then how can I differentiate them? But then you said
>> "the vendor got these trivial values wrong", so were you talking about
>> this drive or just for real SCSI drives?
--
To unsubscribe from this list: send the line "unsubscribe linux-ide" in

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: configurable discard parameters
  2015-06-21 12:36       ` Tom Yan
@ 2015-06-21 20:30         ` Tom Yan
  2015-06-22 20:57           ` Martin K. Petersen
  0 siblings, 1 reply; 18+ messages in thread
From: Tom Yan @ 2015-06-21 20:30 UTC (permalink / raw)
  To: Martin K. Petersen; +Cc: linux-scsi, linux-ide

After rechecking the result of blkdiscard (without specifying --step)
with hexdump, I see that the single ioctl is actually broken into
multiple TRIM commands/ranges by 65535 sector. I wonder if it will be
further tuned by ANY KIND of granularity though. Though it will
probably only be a problem for device TRIM, and for that you can
probably always rely on blkdiscard with --step (and avoid device TRIM
with things like mkfs.btrfs) so I guess it won't be a big problem. And
it will be fine for filesystem TRIM anyway I guess.

So it seems that "Maximum write same length: 0x3fffc0 blocks" doesn't
really matter (In fact when I ran `sg_write_same -U --lba=0
--num=0x7fffffff`, the command also trim the first 65528 sectors of
the drive). However, I am curious what is the meaning of "Our current
TRIM payload is a single sector that can accommodate 64 * 65535 blocks
being unmapped. Report this value in the Block Limits Maximum Unmap
LBA count field." in the commit message. What does the value actually
mean/affect? Could it cause trouble to certain drives?

By the way I think I got the answer for my USB TRIM question.
Basically for USB->SATA drives the SATL is implemented in the bridges,
so it must be able to "SAT" one of the three SCSI unmap methods to ATA
TRIM, just like libata does, so that it can handle whatever requested
through the scsi disk driver, otherwise one can only TRIM it through
SCSI's ATA pass-through commands if the bridge supports them, e.g.
hdparm. So my dirty hacking was silly.

P.S.:

[tom@localhost ~]$ sudo sg_vpd -p ai /dev/sda
ATA information VPD page:
  SAT Vendor identification: linux
  SAT Product identification: libata
  SAT Product revision level: 3.00

So it actually tells. And I hope that the kernel wouldn't "falsify"
anything for devices which do provide some VPD(s). : \

Another P.S.: I made a mistake in one of my last mails. 65535 is two
bytes, not four bytes. : /

On 21 June 2015 at 20:36, Tom Yan <tom.ty89@gmail.com> wrote:
> https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/drivers/ata/libata-scsi.c?id=18f0f97850059303ed73b1f02084f55ca330a80c
>
> https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/drivers/ata/libata-scsi.c?id=e78db4dfb1355a895f7ea50133b702b55b8ed184
>
> So I think I found the answer myself. But I don't see the rationale
> here at all. So values (arbitrary?) of granularity and maximum has
> been hardcoded in the kernel for ALL devices which make use of libata,
> and the parameter cannot be changed by user in runtime? So does it
> expect all devices to be of the same spec? Or is it just like a
> draft/example which people can only consider it as "may work"?
>
> On 21 June 2015 at 16:05, Tom Yan <tom.ty89@gmail.com> wrote:
>> By the way do you think it could be a bug of libata's SATL anyway?
>> Like perhaps it should break a single unmap request to multiple ATA
>> commands? I am not totally sure about it but it looks like there's a
>> limit of addressed blocks in a single ATA DSM/TRIM command (4 bytes,
>> which is 65535).
>>
>> On 21 June 2015 at 15:03, Tom Yan <tom.ty89@gmail.com> wrote:
>>> On 21 June 2015 at 08:20, Martin K. Petersen <martin.petersen@oracle.com> wrote:
>>>> It is a SATA-attached drive, it has no block limits VPD. What you are
>>>> seeing is information prepared by libata's SATL.
>>>>
>>>> Because if the vendor got these trivial values wrong there is little to
>>>> no chance that they implemented discard correctly in their firmware.
>>>
>>> I don't get it. So there's a chance that the VPDs is not purely from
>>> the hardware? Then how can I differentiate them? But then you said
>>> "the vendor got these trivial values wrong", so were you talking about
>>> this drive or just for real SCSI drives?
--
To unsubscribe from this list: send the line "unsubscribe linux-ide" in

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: configurable discard parameters
  2015-06-21 20:30         ` Tom Yan
@ 2015-06-22 20:57           ` Martin K. Petersen
  2015-06-23 14:16             ` Tom Yan
  0 siblings, 1 reply; 18+ messages in thread
From: Martin K. Petersen @ 2015-06-22 20:57 UTC (permalink / raw)
  To: Tom Yan; +Cc: Martin K. Petersen, linux-scsi, linux-ide

>>>>> "Tom" == Tom Yan <tom.ty89@gmail.com> writes:

Tom> However, I am curious what is the meaning of "Our current TRIM
Tom> payload is a single sector that can accommodate 64 * 65535 blocks
Tom> being unmapped. Report this value in the Block Limits Maximum Unmap
Tom> LBA count field." in the commit message. What does the value
Tom> actually mean/affect? Could it cause trouble to certain drives?

DSM Trim takes a payload of <block nr, block count> ranges. The block
count is constrained to 65535 by virtue of being 16 bits. We can fit 64
8-byte range descriptors in a single 512-byte sector payload. That's why
we set the limit for a single, contiguous discard request to
0x3fffc0. That results in 2GB minus change for a single command.

Tom> By the way I think I got the answer for my USB TRIM question.
Tom> Basically for USB->SATA drives the SATL is implemented in the
Tom> bridges, so it must be able to "SAT" one of the three SCSI unmap
Tom> methods to ATA TRIM, just like libata does,

Yep. I have set to see one that gets that right, though. There are even
many enterprise SAS/RAID controllers that don't support UNMAP-TRIM
translation.

Linux' libata was one of the first implementations of SCSI-ATA
translation for TRIM. Initially we bent the rules for UNMAP a bit. When
the standards caught up we switched to WRITE SAME.

Tom> So it actually tells. And I hope that the kernel wouldn't "falsify"
Tom> anything for devices which do provide some VPD(s). : \

SATA doesn't have VPDs.

-- 
Martin K. Petersen	Oracle Linux Engineering
--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: configurable discard parameters
  2015-06-22 20:57           ` Martin K. Petersen
@ 2015-06-23 14:16             ` Tom Yan
  2015-06-23 15:36               ` Martin K. Petersen
  0 siblings, 1 reply; 18+ messages in thread
From: Tom Yan @ 2015-06-23 14:16 UTC (permalink / raw)
  To: Martin K. Petersen; +Cc: linux-scsi, linux-ide

Still, I wonder if the kernel should allow the user to configure the
real granularity and send ATA commands that are rounded off by it
(which works just like --step in blkdiscard). For example,

(64 * 65535) % 8 = 0

but

65535 % 8 = 7

So the block count limit for the ATA commands would be 65535 - (65535
% real_gran) per range.


Also, by experimenting on my SanDisk Extreme USB with hdparm TRIM and
hexdump, I can see that the maximum block/sector allowed to be
discarded per ATA command is 65536, inspite of the payload size. I
don't know whether the USB bridging or the way hdparm does TRIM
matters, but it seems that some devices can't really handle limit like
0x3fffc0 blocks.

So I still think that the kernel should allow users to configure
limits that can make libata send reliable ATA TRIM commands.

On 23 June 2015 at 04:57, Martin K. Petersen <martin.petersen@oracle.com> wrote:
>
> Tom> So it actually tells. And I hope that the kernel wouldn't "falsify"
> Tom> anything for devices which do provide some VPD(s). : \
>
> SATA doesn't have VPDs.
>

I know now, that's why I think it's "alright" for libata to do so. I
just don't know whether the kernel would tamper with them for devices
like USB or real SCSI drives.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: configurable discard parameters
  2015-06-23 14:16             ` Tom Yan
@ 2015-06-23 15:36               ` Martin K. Petersen
  2015-06-23 16:41                 ` Tom Yan
  0 siblings, 1 reply; 18+ messages in thread
From: Martin K. Petersen @ 2015-06-23 15:36 UTC (permalink / raw)
  To: Tom Yan; +Cc: Martin K. Petersen, linux-scsi, linux-ide

>>>>> "Tom" == Tom Yan <tom.ty89@gmail.com> writes:

Tom> I don't know whether the USB bridging or the way hdparm does TRIM
Tom> matters, but it seems that some devices can't really handle limit
Tom> like 0x3fffc0 blocks.

The 0x3fffc0 limit is for SATA devices connected through Linux' libata
SCSI ATA translation. This number corresponds to the biggest range we
can discard while maintaining a 1:1 mapping between the SCSI command and
the ATA ditto.

If you connect the SATA drive to a USB/UAS bridge you are entirely at
the bridge's whim when it comes to translation of WRITE SAME(10/16) w/
UNMAP or the UNMAP command to DSM TRIM. libata is not involved and
neither is the 0x3fffc0 limit (although chances are that the drive has
the same limit internally). Even when using hdparm the bridge device
still needs to correctly translate the ATA PASSTHROUGH commands.

Tom> So I still think that the kernel should allow users to configure
Tom> limits that can make libata send reliable ATA TRIM commands.

You haven't given us any reason to. We are not aware of any ATA drives
that put constraints on the range block count.

Again, if you want to help please focus on your bridge devices and what,
if anything, can be done to uniquely identify them and override any
incorrect values they might report to the SCSI stack. Up to and
including us disabling discard entirely on these devices if their
translation isn't verifiable and bullet proof.

-- 
Martin K. Petersen	Oracle Linux Engineering

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: configurable discard parameters
  2015-06-23 15:36               ` Martin K. Petersen
@ 2015-06-23 16:41                 ` Tom Yan
  2015-06-23 17:03                   ` Martin K. Petersen
  0 siblings, 1 reply; 18+ messages in thread
From: Tom Yan @ 2015-06-23 16:41 UTC (permalink / raw)
  To: Martin K. Petersen; +Cc: linux-scsi, linux-ide

On 23 June 2015 at 23:36, Martin K. Petersen <martin.petersen@oracle.com> wrote:
>
> You haven't given us any reason to. We are not aware of any ATA drives
> that put constraints on the range block count.
>

What I have been doing is trying to show you example of those
constraints. When I talked about the block count limit of the SanDisk
Extreme USB, I am not asking you to fix anything for the drive I HAVE
(because I can only use hdparm to TRIM it anyway), but to show you
that some devices MIGHT have such limit. I just can't be 100% sure of
it, even though from what I see in the different results on payload
size and range sector count, I don't think the bridge is intervening
at all (but just the limit of the SATA controller behind).

But for my Intel 530 SSD, the granularity problem is obvious and solid
enough because it's connected directly with SATA. Currently what the
kernel does is assume all devices support 1 sector granularity. And
the problem doesn't even lie on the current "hardcoded" granularity in
libata, because blkdev_issue_discard() only does "mod check" against
the granularity on the maximum sector counts, so even if it's allowed
to be configured, it wouldn't really help. And I have yet to see
anything which actually does "mod check" against ANY granularity on
the sector count per range. This is what the example in my last mail
was about.

And the only info I have ever saw about TRIM block limit of a SATA
drive is in `hdparm -I`. For my Intel SSD, which actually has a lower
limit of 8 sectors, it shows:
Data Set Management TRIM supported (limit 1 block)
while for the SanDisk Extreme USB, which actually has a lower limit of
256 sectors, it shows:
Data Set Management TRIM supported (limit 8 blocks)
Inspite of the accuracy, I don't see the kernel actually read this limit.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: configurable discard parameters
  2015-06-23 16:41                 ` Tom Yan
@ 2015-06-23 17:03                   ` Martin K. Petersen
  2015-06-23 17:24                     ` Tom Yan
  0 siblings, 1 reply; 18+ messages in thread
From: Martin K. Petersen @ 2015-06-23 17:03 UTC (permalink / raw)
  To: Tom Yan; +Cc: Martin K. Petersen, linux-scsi, linux-ide

>>>>> "Tom" == Tom Yan <tom.ty89@gmail.com> writes:

Tom> Currently what the kernel does is assume all devices support 1
Tom> sector granularity.

The ATA Command Set does not allow for any other granularity than 1
sector.

Bigger granularity reporting is supported in SCSI SBC to allow for
thinly provisioned disk arrays with a bigger allocation unit. It is
common for disk arrays to provision LUN space in units of 1MB or more.

Tom> For my Intel SSD, which actually has a lower limit of 8 sectors, it
Tom> shows: Data Set Management TRIM supported (limit 1 block) while for
Tom> the SanDisk Extreme USB, which actually has a lower limit of 256
Tom> sectors, it shows: Data Set Management TRIM supported (limit 8
Tom> blocks)

It don't know what these "lower limits" you are talking about are.

What hdparm tells you is that DSM TRIM on the Intel drive supports 1
block (512 bytes) of range payload data. Whereas the SanDisk drive
supports a full 4KB page of range payload data (8 * 512 bytes).

We did try to enable payloads bigger than 512 bytes back in the day but
most drives that reported supporting the bigger payload actually didn't
and all hell broke loose. So we reverted to a single sector payload for
libata.

I still have the payload patch in my SSD test branch and regularly test
with it. But there are still drives that fail in mysterious ways with it
in place and so far I haven't felt compelled to maintain yet another
blacklist.

-- 
Martin K. Petersen	Oracle Linux Engineering

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: configurable discard parameters
  2015-06-23 17:03                   ` Martin K. Petersen
@ 2015-06-23 17:24                     ` Tom Yan
  2015-06-23 18:26                       ` Martin K. Petersen
  0 siblings, 1 reply; 18+ messages in thread
From: Tom Yan @ 2015-06-23 17:24 UTC (permalink / raw)
  To: Martin K. Petersen; +Cc: linux-scsi, linux-ide

On 24 June 2015 at 01:03, Martin K. Petersen <martin.petersen@oracle.com> wrote:
>
> It don't know what these "lower limits" you are talking about are.
>

[tom@localhost ~]$ sudo shred -v -n 1 /dev/sda3

[tom@localhost ~]$ sudo blkdiscard -l 512 /dev/sda3
[tom@localhost ~]$ sudo hexdump -n 4096 /dev/sda3 | head
0000000 58c7 f3b0 0f97 2a9a 2da8 7c44 11f5 7852
0000010 4f1d d19c 682b a3d8 6be9 61df 4eb3 a4ee
0000020 f28b f2f6 ef36 c244 f1b0 eadd 5ad0 c5ec
0000030 54c4 abb4 f9c3 7ca0 d807 63ad 81cd 45e6
0000040 2b30 709c 550a 5e52 6928 4468 78c9 f671
0000050 eef1 dfa7 2a6d 3150 64f8 f9a2 240d bbf4
0000060 2cff 02e0 c893 16a7 67c2 8cd9 af76 71b6
0000070 4740 83cc 5b15 3db0 64b2 1a6d 3f40 5a98
0000080 09d0 035c b5ae 2cd2 1850 ea70 7a09 0491
0000090 6b27 9d81 9d74 a087 afd9 e75a 85ea adf6

[tom@localhost ~]$ sudo blkdiscard -l 3584 /dev/sda3
[tom@localhost ~]$ sudo hexdump -n 4096 /dev/sda3 | head
0000000 58c7 f3b0 0f97 2a9a 2da8 7c44 11f5 7852
0000010 4f1d d19c 682b a3d8 6be9 61df 4eb3 a4ee
0000020 f28b f2f6 ef36 c244 f1b0 eadd 5ad0 c5ec
0000030 54c4 abb4 f9c3 7ca0 d807 63ad 81cd 45e6
0000040 2b30 709c 550a 5e52 6928 4468 78c9 f671
0000050 eef1 dfa7 2a6d 3150 64f8 f9a2 240d bbf4
0000060 2cff 02e0 c893 16a7 67c2 8cd9 af76 71b6
0000070 4740 83cc 5b15 3db0 64b2 1a6d 3f40 5a98
0000080 09d0 035c b5ae 2cd2 1850 ea70 7a09 0491
0000090 6b27 9d81 9d74 a087 afd9 e75a 85ea adf6

[tom@localhost ~]$ sudo blkdiscard -l 4096 /dev/sda3
[tom@localhost ~]$ sudo hexdump -n 4096 /dev/sda3 | head
0000000 0000 0000 0000 0000 0000 0000 0000 0000
*
0001000

So when libata issue ATA commands with ranges of 65535 sectors, only
65535-(65535%8) = 65528 sectors are discarded, yet it wouldn't know
about that so the next range will start from LBA 65536. I can
workaround this by specifying --step in blkdiscard, but I think the
kernel should have a param configurable for general.

>
> What hdparm tells you is that DSM TRIM on the Intel drive supports 1
> block (512 bytes) of range payload data. Whereas the SanDisk drive
> supports a full 4KB page of range payload data (8 * 512 bytes).
>

Oh so they are not reporting garbage. That's good to know. I guess
that's why hdparm actually splits the range list I provided by
512-range per ATA command (which works) for the SanDisk drive.

>
> We did try to enable payloads bigger than 512 bytes back in the day but
> most drives that reported supporting the bigger payload actually didn't
> and all hell broke loose. So we reverted to a single sector payload for
> libata.
>
> I still have the payload patch in my SSD test branch and regularly test
> with it. But there are still drives that fail in mysterious ways with it
> in place and so far I haven't felt compelled to maintain yet another
> blacklist.
>

For now I have no problem with the single sector payload thing.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: configurable discard parameters
  2015-06-23 17:24                     ` Tom Yan
@ 2015-06-23 18:26                       ` Martin K. Petersen
  2015-06-23 21:25                         ` Tom Yan
  0 siblings, 1 reply; 18+ messages in thread
From: Martin K. Petersen @ 2015-06-23 18:26 UTC (permalink / raw)
  To: Tom Yan; +Cc: Martin K. Petersen, linux-scsi, linux-ide

>>>>> "Tom" == Tom Yan <tom.ty89@gmail.com> writes:

Tom> So when libata issue ATA commands with ranges of 65535 sectors,
Tom> only 65535-(65535%8) = 65528 sectors are discarded,

That's unfortunate but TRIM is advisory so the drive is free to ignore
all or parts of the request.

What happens if you discard sectors 0-6 and then sector 7?

Tom> I can workaround this by specifying --step in blkdiscard, but I
Tom> think the kernel should have a param configurable for general.

This is on the Intel 530? What does the drive report in
/sys/block/sdN/queue/discard_zeroes_data?

-- 
Martin K. Petersen	Oracle Linux Engineering

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: configurable discard parameters
  2015-06-23 18:26                       ` Martin K. Petersen
@ 2015-06-23 21:25                         ` Tom Yan
  2015-06-24  2:55                           ` Martin K. Petersen
  0 siblings, 1 reply; 18+ messages in thread
From: Tom Yan @ 2015-06-23 21:25 UTC (permalink / raw)
  To: Martin K. Petersen; +Cc: linux-scsi, linux-ide

[-- Attachment #1: Type: text/plain, Size: 1608 bytes --]

First of all let me add another "statistic" about the issue:

[tom@localhost ~]$ sudo shred -n 1 /dev/sda3
[tom@localhost ~]$ sudo blkdiscard /dev/sda3
[tom@localhost ~]$ sudo hexdump /dev/sda3 | wc -l
310635
[tom@localhost ~]$ sudo hexdump /dev/sda3 | pcregrep -M '0000 0000
0000 0000 0000 0000 0000 0000\n\*' | wc -l
2410

total untrimmed sectors: (310635 - 1 - 2410) / 32 = 9632
total ranges: (43 GiB * (1024 ^ 3) / 512) / 65535 ~= 1376
average untrimmed sectors per range: (9632 / 1376) = 7 = (65535 % 8)

Also, FWIW:

[tom@localhost ~]$ sudo shred -n 1 /dev/sda3
[tom@localhost ~]$ let step=(65535-65535%8)*512
[tom@localhost ~]$ echo $step
33550336
[tom@localhost ~]$ sudo blkdiscard -p "$step" /dev/sda3
[tom@localhost ~]$ sudo hexdump /dev/sda3
0000000 0000 0000 0000 0000 0000 0000 0000 0000
*
ac0000000

On 24 June 2015 at 02:26, Martin K. Petersen <martin.petersen@oracle.com> wrote:
>
> What happens if you discard sectors 0-6 and then sector 7?
>

[tom@localhost ~]$ sudo shred -n 1 /dev/sda3
[tom@localhost ~]$ sudo blkdiscard -l 3584 /dev/sda3
[tom@localhost ~]$ sudo blkdiscard -o 3584 -l 512 /dev/sda3
[tom@localhost ~]$ sudo hexdump -n 4096 /dev/sda3
0000000 f06d 8365 5e1b 616c 7362 4d61 2182 02fb
...
0000ff0 54ef 9579 51bc 9042 115a 375e c28f 4dcc
0001000

>
> This is on the Intel 530? What does the drive report in
> /sys/block/sdN/queue/discard_zeroes_data?
>

Yes. It reports 0. In `hdparm -I`:
       *    Data Set Management TRIM supported (limit 1 block)
       *    Deterministic read data after TRIM

You may also want to check out and compare the two attached test case files.

[-- Attachment #2: sandisk_test --]
[-- Type: application/octet-stream, Size: 3176 bytes --]

[tom@localhost ~]$ sudo shred -n 1 /dev/sdc1
[tom@localhost ~]$ sudo hdparm --please-destroy-my-drive --trim-sector-ranges 256:1 /dev/sdc

/dev/sdc:
trimming 1 sectors from 1 ranges
succeeded
[tom@localhost ~]$ sudo hexdump /dev/sdc1 | head -n 3
0000000 2c1c 1ef7 b9ed bcc3 9bce cf70 3849 c601
0000010 63bb 6423 204f 95af e414 64c6 48c9 6a35
0000020 6914 f4c2 be6e 9c93 d561 9bf4 c76b 0bce
[tom@localhost ~]$ sudo shred -n 1 /dev/sdc1
[tom@localhost ~]$ sudo hdparm --please-destroy-my-drive --trim-sector-ranges 256:128 /dev/sdc

/dev/sdc:
trimming 128 sectors from 1 ranges
succeeded
[tom@localhost ~]$ sudo hexdump /dev/sdc1 | head -n 3
0000000 de94 1b97 e60f 416c 6dce e138 a5a6 7cb9
0000010 5412 9f00 d573 d9c1 6fdf e4df 65ad 1e26
0000020 e6fd 24dd 0907 9471 b5f4 3b67 e7cf c165
[tom@localhost ~]$ sudo shred -n 1 /dev/sdc1
[tom@localhost ~]$ sudo hdparm --please-destroy-my-drive --trim-sector-ranges 256:256 /dev/sdc

/dev/sdc:
trimming 256 sectors from 1 ranges
succeeded
[tom@localhost ~]$ sudo hexdump /dev/sdc1 | head -n 3
0000000 0000 0000 0000 0000 0000 0000 0000 0000
*
0020000 8fdb 7a20 c3ee 9e29 c325 29b2 c323 29dd
[tom@localhost ~]$ sudo shred -n 1 /dev/sdc1
[tom@localhost ~]$ for (( i=256; i<=256; i+=1 )); do echo "$i:1"; done | sudo hdparm --please-destroy-my-drive --trim-sector-ranges-stdin /dev/sdc

/dev/sdc:
trimming 1 sectors from 1 ranges
succeeded
[tom@localhost ~]$ sudo hexdump /dev/sdc1 | head -n 3
0000000 ae7d d079 e9c1 4f5c 3c83 8989 c1b7 3f5a
0000010 8c41 5197 66a6 cfee 1d2f 2468 0c95 32b9
0000020 4679 fdea 4b33 b86d 34c7 6721 a921 f35c
[tom@localhost ~]$ sudo shred -n 1 /dev/sdc1
[tom@localhost ~]$ for (( i=256; i<=383; i+=1 )); do echo "$i:1"; done | sudo hdparm --please-destroy-my-drive --trim-sector-ranges-stdin /dev/sdc

/dev/sdc:
trimming 128 sectors from 128 ranges
succeeded
[tom@localhost ~]$ sudo hexdump /dev/sdc1 | head -n 3
0000000 6067 e2c3 8fe2 475c 2cf7 b7d1 5c9f e7df
0000010 9f12 1a09 8c15 d54d 0b9a 6973 c3a1 2bc1
0000020 7e30 4b1c e427 7608 9f1f 8d5a f390 ff1a
[tom@localhost ~]$ sudo shred -n 1 /dev/sdc1
[tom@localhost ~]$ for (( i=256; i<=511; i+=1 )); do echo "$i:1"; done | sudo hdparm --please-destroy-my-drive --trim-sector-ranges-stdin /dev/sdc

/dev/sdc:
trimming 256 sectors from 256 ranges
succeeded
[tom@localhost ~]$ sudo hexdump /dev/sdc1 | head -n 3
0000000 0000 0000 0000 0000 0000 0000 0000 0000
*
0020000 e90c 9dd1 9f7c 0504 afc3 24ec e07d d93a
[tom@localhost ~]$ sudo shred -n 1 /dev/sdc1
[tom@localhost ~]$ sudo hdparm --please-destroy-my-drive --trim-sector-ranges 256:65535 65791:65025 /dev/sdc

/dev/sdc:
trimming 130560 sectors from 2 ranges
succeeded
[tom@localhost ~]$ sudo hexdump /dev/sdc1 | head -n 3
0000000 0000 0000 0000 0000 0000 0000 0000 0000
*
1fe0000 f634 ce74 02e9 43a0 164d 7388 ea26 6767
[tom@localhost ~]$ sudo shred -n 1 /dev/sdc1
[tom@localhost ~]$ sudo hdparm --please-destroy-my-drive --trim-sector-ranges 256:65280 65536:65280 /dev/sdc

/dev/sdc:
trimming 130560 sectors from 2 ranges
succeeded
[tom@localhost ~]$ sudo hexdump /dev/sdc1 | head -n 3
0000000 0000 0000 0000 0000 0000 0000 0000 0000
*
3fc0000 62f4 c0a3 2789 ec72 f555 994b 449f 9ff2
[tom@localhost ~]$ 

[-- Attachment #3: intel_test --]
[-- Type: application/octet-stream, Size: 1339 bytes --]

[tom@localhost ~]$ sudo sh -c 'for (( i=512; i<=4096; i+=512 )); do blkdiscard -p "$i" -l 4096 /dev/sda3; sleep 5; echo "${i}:"; hexdump /dev/sda3 | head -n 3; done'
512:
0000000 8537 38ae 0ed2 1442 36c0 fab6 0bbf 9b96
0000010 3d5e 138f 8890 3af4 644e 04a6 062b 0625
0000020 9f62 e65c cb8d 03c2 36df 37c7 999c bd16
1024:
0000000 8537 38ae 0ed2 1442 36c0 fab6 0bbf 9b96
0000010 3d5e 138f 8890 3af4 644e 04a6 062b 0625
0000020 9f62 e65c cb8d 03c2 36df 37c7 999c bd16
1536:
0000000 8537 38ae 0ed2 1442 36c0 fab6 0bbf 9b96
0000010 3d5e 138f 8890 3af4 644e 04a6 062b 0625
0000020 9f62 e65c cb8d 03c2 36df 37c7 999c bd16
2048:
0000000 8537 38ae 0ed2 1442 36c0 fab6 0bbf 9b96
0000010 3d5e 138f 8890 3af4 644e 04a6 062b 0625
0000020 9f62 e65c cb8d 03c2 36df 37c7 999c bd16
2560:
0000000 8537 38ae 0ed2 1442 36c0 fab6 0bbf 9b96
0000010 3d5e 138f 8890 3af4 644e 04a6 062b 0625
0000020 9f62 e65c cb8d 03c2 36df 37c7 999c bd16
3072:
0000000 8537 38ae 0ed2 1442 36c0 fab6 0bbf 9b96
0000010 3d5e 138f 8890 3af4 644e 04a6 062b 0625
0000020 9f62 e65c cb8d 03c2 36df 37c7 999c bd16
3584:
0000000 8537 38ae 0ed2 1442 36c0 fab6 0bbf 9b96
0000010 3d5e 138f 8890 3af4 644e 04a6 062b 0625
0000020 9f62 e65c cb8d 03c2 36df 37c7 999c bd16
4096:
0000000 0000 0000 0000 0000 0000 0000 0000 0000
*
0001000 2196 b091 70e4 24eb 3dab 2834 a1cf 249a
[tom@localhost ~]$ 

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: configurable discard parameters
  2015-06-23 21:25                         ` Tom Yan
@ 2015-06-24  2:55                           ` Martin K. Petersen
  2015-06-24 12:46                             ` Tom Yan
  0 siblings, 1 reply; 18+ messages in thread
From: Martin K. Petersen @ 2015-06-24  2:55 UTC (permalink / raw)
  To: Tom Yan; +Cc: Martin K. Petersen, linux-scsi, linux-ide

>>>>> "Tom" == Tom Yan <tom.ty89@gmail.com> writes:

Tom> total untrimmed sectors: (310635 - 1 - 2410) / 32 = 9632 total
Tom> ranges: (43 GiB * (1024 ^ 3) / 512) / 65535 ~= 1376 average
Tom> untrimmed sectors per range: (9632 / 1376) = 7 = (65535 % 8)

Every type of drive has its own internal restrictions. Unless the drive
guarantees zero after TRIM it is perfectly normal for heads, tails or
random pieces in-between to be left untouched.

I am surprised that the common case of contiguous range entries was not
handled properly by your drive. Most of them deal with that just fine
(regardless of their internal granularity and alignment constraints). It
is normally only the partial block tracking between command invocations
that causes grief.

In any case. Unless discard_zeroes_data is 1 (and that requires
guarantees above and beyond what can be discovered via the ATA Command
Set) all bets are off wrt. dependable behavior.

The code below is an untested proof of concept to see what it would take
to alleviate your particular issue. I am, however, not at all convinced
that introducing such a change is worth the risk. I know of at least a
couple of devices that would blow up as a result of this patch...

-- 
Martin K. Petersen	Oracle Linux Engineering

diff --git a/drivers/ata/libata-scsi.c b/drivers/ata/libata-scsi.c
index 3131adcc1f87..9c270a303f0b 100644
--- a/drivers/ata/libata-scsi.c
+++ b/drivers/ata/libata-scsi.c
@@ -2238,8 +2238,24 @@ static unsigned int ata_scsiop_inq_b0(struct ata_scsi_args *args, u8 *rbuf)
 	 * with the unmap bit set.
 	 */
 	if (ata_id_has_trim(args->id)) {
-		put_unaligned_be64(65535 * 512 / 8, &rbuf[36]);
-		put_unaligned_be32(1, &rbuf[28]);
+		/*
+		 * If the drive supports reliable zero after trim we set
+		 * the granularity to 1 and the blocks per range to
+		 * 0xffff. Otherwise we set set the granularity to 8 and
+		 * restrict the blocks per range to 0xfff8. This is done
+		 * to accommodate older drives which prefer 4K-alignment.
+		 */
+
+		if (ata_id_has_zero_after_trim(args->id) &&
+		    args->dev->horkage & ATA_HORKAGE_ZERO_AFTER_TRIM) {
+			put_unaligned_be64(ATA_MAX_TRIM_RANGE * 512 / 8,
+					   &rbuf[36]);
+			put_unaligned_be32(1, &rbuf[28]);
+		} else {
+			put_unaligned_be64(ATA_ALIGNED_TRIM_RANGE * 512 / 8,
+					   &rbuf[36]);
+			put_unaligned_be32(8, &rbuf[28]);
+		}
 	}
 
 	return 0;
@@ -3168,7 +3184,14 @@ static unsigned int ata_scsi_write_same_xlat(struct ata_queued_cmd *qc)
 		goto invalid_fld;
 
 	buf = page_address(sg_page(scsi_sglist(scmd)));
-	size = ata_set_lba_range_entries(buf, 512, block, n_block);
+
+	if (ata_id_has_zero_after_trim(dev->id) &&
+	    dev->horkage & ATA_HORKAGE_ZERO_AFTER_TRIM)
+		size = ata_set_lba_range_entries(buf, 512, block, n_block,
+						 ATA_MAX_TRIM_RANGE);
+	else
+		size = ata_set_lba_range_entries(buf, 512, block, n_block,
+						 ATA_ALIGNED_TRIM_RANGE);
 
 	if (ata_ncq_enabled(dev) && ata_fpdma_dsm_supported(dev)) {
 		/* Newer devices support queued TRIM commands */
diff --git a/include/linux/ata.h b/include/linux/ata.h
index fed36418dd1c..8a17fa22cdbe 100644
--- a/include/linux/ata.h
+++ b/include/linux/ata.h
@@ -47,6 +47,8 @@ enum {
 	ATA_MAX_SECTORS		= 256,
 	ATA_MAX_SECTORS_LBA48	= 65535,/* TODO: 65536? */
 	ATA_MAX_SECTORS_TAPE	= 65535,
+	ATA_MAX_TRIM_RANGE	= 0xffff,
+	ATA_ALIGNED_TRIM_RANGE	= 0xfff8,
 
 	ATA_ID_WORDS		= 256,
 	ATA_ID_CONFIG		= 0,
@@ -1012,19 +1014,20 @@ static inline void ata_id_to_hd_driveid(u16 *id)
  * TO NV CACHE PINNED SET.
  */
 static inline unsigned ata_set_lba_range_entries(void *_buffer,
-		unsigned buf_size, u64 sector, unsigned long count)
+		unsigned buf_size, u64 sector, unsigned long count,
+		u16 bpe)
 {
 	__le64 *buffer = _buffer;
 	unsigned i = 0, used_bytes;
 
 	while (i < buf_size / 8 ) { /* 6-byte LBA + 2-byte range per entry */
 		u64 entry = sector |
-			((u64)(count > 0xffff ? 0xffff : count) << 48);
+			((u64)(count > bpe ? bpe : count) << 48);
 		buffer[i++] = __cpu_to_le64(entry);
-		if (count <= 0xffff)
+		if (count <= bpe)
 			break;
-		count -= 0xffff;
-		sector += 0xffff;
+		count -= bpe;
+		sector += bpe;
 	}
 
 	used_bytes = ALIGN(i * 8, 512);

^ permalink raw reply related	[flat|nested] 18+ messages in thread

* Re: configurable discard parameters
  2015-06-24  2:55                           ` Martin K. Petersen
@ 2015-06-24 12:46                             ` Tom Yan
  2015-06-25  1:15                               ` Martin K. Petersen
  0 siblings, 1 reply; 18+ messages in thread
From: Tom Yan @ 2015-06-24 12:46 UTC (permalink / raw)
  To: Martin K. Petersen; +Cc: linux-scsi, linux-ide

To be honest I don't think this patch should get in as well. First of
all it wouldn't really solve problem for all devices which have
different limits. Secondly I doubt it's related to Deterministic Zero
AT ALL. (For one the SanDisk drive I have shows Deterministic Zero,
still it behaves similarly as the Intel drive.) IHMO adding "switch"
base on characteristics which is uncertain to be / not at all relevant
is the worst thing to do ever. (Yeah I recall how Intel "follows"
HDMI/DP color range spec senselessly again.)

The only patching which would really mean something is to allow user
to configure blocks per range and ranges per command, so that for
users can tune kernel TRIM per device if they really want to, while
leaving the "safe default" intact.

However I am really curious how the drives "blow up" on less blocks
per range. Isn't that even more of a firmware bug than the problem I
have? If they blow up even with single range with less then 65535
blocks addressed, wouldn't they blow up anytime if they do filesystem
TRIM?

On 24 June 2015 at 10:55, Martin K. Petersen <martin.petersen@oracle.com> wrote:
>>>>>> "Tom" == Tom Yan <tom.ty89@gmail.com> writes:
>
> Tom> total untrimmed sectors: (310635 - 1 - 2410) / 32 = 9632 total
> Tom> ranges: (43 GiB * (1024 ^ 3) / 512) / 65535 ~= 1376 average
> Tom> untrimmed sectors per range: (9632 / 1376) = 7 = (65535 % 8)
>
> Every type of drive has its own internal restrictions. Unless the drive
> guarantees zero after TRIM it is perfectly normal for heads, tails or
> random pieces in-between to be left untouched.
>
> I am surprised that the common case of contiguous range entries was not
> handled properly by your drive. Most of them deal with that just fine
> (regardless of their internal granularity and alignment constraints). It
> is normally only the partial block tracking between command invocations
> that causes grief.
>
> In any case. Unless discard_zeroes_data is 1 (and that requires
> guarantees above and beyond what can be discovered via the ATA Command
> Set) all bets are off wrt. dependable behavior.
>
> The code below is an untested proof of concept to see what it would take
> to alleviate your particular issue. I am, however, not at all convinced
> that introducing such a change is worth the risk. I know of at least a
> couple of devices that would blow up as a result of this patch...
>
> --
> Martin K. Petersen      Oracle Linux Engineering
>
> diff --git a/drivers/ata/libata-scsi.c b/drivers/ata/libata-scsi.c
> index 3131adcc1f87..9c270a303f0b 100644
> --- a/drivers/ata/libata-scsi.c
> +++ b/drivers/ata/libata-scsi.c
> @@ -2238,8 +2238,24 @@ static unsigned int ata_scsiop_inq_b0(struct ata_scsi_args *args, u8 *rbuf)
>          * with the unmap bit set.
>          */
>         if (ata_id_has_trim(args->id)) {
> -               put_unaligned_be64(65535 * 512 / 8, &rbuf[36]);
> -               put_unaligned_be32(1, &rbuf[28]);
> +               /*
> +                * If the drive supports reliable zero after trim we set
> +                * the granularity to 1 and the blocks per range to
> +                * 0xffff. Otherwise we set set the granularity to 8 and
> +                * restrict the blocks per range to 0xfff8. This is done
> +                * to accommodate older drives which prefer 4K-alignment.
> +                */
> +
> +               if (ata_id_has_zero_after_trim(args->id) &&
> +                   args->dev->horkage & ATA_HORKAGE_ZERO_AFTER_TRIM) {
> +                       put_unaligned_be64(ATA_MAX_TRIM_RANGE * 512 / 8,
> +                                          &rbuf[36]);
> +                       put_unaligned_be32(1, &rbuf[28]);
> +               } else {
> +                       put_unaligned_be64(ATA_ALIGNED_TRIM_RANGE * 512 / 8,
> +                                          &rbuf[36]);
> +                       put_unaligned_be32(8, &rbuf[28]);
> +               }
>         }
>
>         return 0;
> @@ -3168,7 +3184,14 @@ static unsigned int ata_scsi_write_same_xlat(struct ata_queued_cmd *qc)
>                 goto invalid_fld;
>
>         buf = page_address(sg_page(scsi_sglist(scmd)));
> -       size = ata_set_lba_range_entries(buf, 512, block, n_block);
> +
> +       if (ata_id_has_zero_after_trim(dev->id) &&
> +           dev->horkage & ATA_HORKAGE_ZERO_AFTER_TRIM)
> +               size = ata_set_lba_range_entries(buf, 512, block, n_block,
> +                                                ATA_MAX_TRIM_RANGE);
> +       else
> +               size = ata_set_lba_range_entries(buf, 512, block, n_block,
> +                                                ATA_ALIGNED_TRIM_RANGE);
>
>         if (ata_ncq_enabled(dev) && ata_fpdma_dsm_supported(dev)) {
>                 /* Newer devices support queued TRIM commands */
> diff --git a/include/linux/ata.h b/include/linux/ata.h
> index fed36418dd1c..8a17fa22cdbe 100644
> --- a/include/linux/ata.h
> +++ b/include/linux/ata.h
> @@ -47,6 +47,8 @@ enum {
>         ATA_MAX_SECTORS         = 256,
>         ATA_MAX_SECTORS_LBA48   = 65535,/* TODO: 65536? */
>         ATA_MAX_SECTORS_TAPE    = 65535,
> +       ATA_MAX_TRIM_RANGE      = 0xffff,
> +       ATA_ALIGNED_TRIM_RANGE  = 0xfff8,
>
>         ATA_ID_WORDS            = 256,
>         ATA_ID_CONFIG           = 0,
> @@ -1012,19 +1014,20 @@ static inline void ata_id_to_hd_driveid(u16 *id)
>   * TO NV CACHE PINNED SET.
>   */
>  static inline unsigned ata_set_lba_range_entries(void *_buffer,
> -               unsigned buf_size, u64 sector, unsigned long count)
> +               unsigned buf_size, u64 sector, unsigned long count,
> +               u16 bpe)
>  {
>         __le64 *buffer = _buffer;
>         unsigned i = 0, used_bytes;
>
>         while (i < buf_size / 8 ) { /* 6-byte LBA + 2-byte range per entry */
>                 u64 entry = sector |
> -                       ((u64)(count > 0xffff ? 0xffff : count) << 48);
> +                       ((u64)(count > bpe ? bpe : count) << 48);
>                 buffer[i++] = __cpu_to_le64(entry);
> -               if (count <= 0xffff)
> +               if (count <= bpe)
>                         break;
> -               count -= 0xffff;
> -               sector += 0xffff;
> +               count -= bpe;
> +               sector += bpe;
>         }
>
>         used_bytes = ALIGN(i * 8, 512);

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: configurable discard parameters
  2015-06-24 12:46                             ` Tom Yan
@ 2015-06-25  1:15                               ` Martin K. Petersen
  2015-06-26  7:05                                 ` Tom Yan
  0 siblings, 1 reply; 18+ messages in thread
From: Martin K. Petersen @ 2015-06-25  1:15 UTC (permalink / raw)
  To: Tom Yan; +Cc: Martin K. Petersen, linux-scsi, linux-ide

>>>>> "Tom" == Tom Yan <tom.ty89@gmail.com> writes:

Tom> First of all it wouldn't really solve problem for all devices which
Tom> have different limits.

Only by virtue of us generally aligning on large power of two
boundaries.

The only reason I am entertaining this in the first place is that I have
one drive that behaves in a way similar to yours. And if we can make
things slightly better (but not perfect) for several drives without
causing any regressions then that's worth exploring.

Tom> Secondly I doubt it's related to Deterministic Zero AT ALL. (For
Tom> one the SanDisk drive I have shows Deterministic Zero, still it
Tom> behaves similarly as the Intel drive.)

The drive reporting deterministic zero is not enough. It needs to be
explicitly whitelisted before we report discard_zeroes_data=1.

Tom> The only patching which would really mean something is to allow
Tom> user to configure blocks per range and ranges per command, so that
Tom> for users can tune kernel TRIM per device if they really want to,
Tom> while leaving the "safe default" intact.

I'll think about it.

Tom> However I am really curious how the drives "blow up" on less blocks
Tom> per range. Isn't that even more of a firmware bug than the problem
Tom> I have?

I have several older drives that expect a single contiguous LBA
range. They don't handle multiple discontiguous ranges at all.

-- 
Martin K. Petersen	Oracle Linux Engineering

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: configurable discard parameters
  2015-06-25  1:15                               ` Martin K. Petersen
@ 2015-06-26  7:05                                 ` Tom Yan
  0 siblings, 0 replies; 18+ messages in thread
From: Tom Yan @ 2015-06-26  7:05 UTC (permalink / raw)
  To: Martin K. Petersen; +Cc: linux-scsi, linux-ide

On 25 June 2015 at 09:15, Martin K. Petersen <martin.petersen@oracle.com> wrote:
>
> The drive reporting deterministic zero is not enough. It needs to be
> explicitly whitelisted before we report discard_zeroes_data=1.
>

Not sure what you mean here. I just meant I don't think the "lower
limit" thing I am experiencing is related to the "type" of TRIM, at
least not for any Deterministic TRIM. (Although I don't really know
what exactly does Deterministic Read Data means. Perhaps it means
vendor would make sure the LBA after TRIM reads either zero or a
specific pattern of data?)

>
> I have several older drives that expect a single contiguous LBA
> range. They don't handle multiple discontiguous ranges at all.
>

Well discontiguous is somewhat inaccurate here. I guess what you mean
is they simply don't handle the next range until the previous one is
used up. For example, "0:65528, 65528:65528" are definitely
contiguous, but not fulfiling the requirement of your drives; while
"0:65535 131072:65535" are definitely not contiguous, but your drives
take them fine.

By the way what are the consequences? Data loss? System hang?

^ permalink raw reply	[flat|nested] 18+ messages in thread

end of thread, other threads:[~2015-06-26  7:05 UTC | newest]

Thread overview: 18+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-06-20 17:12 configurable discard parameters Tom Yan
2015-06-21  0:20 ` Martin K. Petersen
2015-06-21  7:03   ` Tom Yan
2015-06-21  8:05     ` Tom Yan
2015-06-21 12:36       ` Tom Yan
2015-06-21 20:30         ` Tom Yan
2015-06-22 20:57           ` Martin K. Petersen
2015-06-23 14:16             ` Tom Yan
2015-06-23 15:36               ` Martin K. Petersen
2015-06-23 16:41                 ` Tom Yan
2015-06-23 17:03                   ` Martin K. Petersen
2015-06-23 17:24                     ` Tom Yan
2015-06-23 18:26                       ` Martin K. Petersen
2015-06-23 21:25                         ` Tom Yan
2015-06-24  2:55                           ` Martin K. Petersen
2015-06-24 12:46                             ` Tom Yan
2015-06-25  1:15                               ` Martin K. Petersen
2015-06-26  7:05                                 ` Tom Yan

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.