All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH] libata: Whitelist SSDs that are known to properly return zeroes after TRIM
@ 2014-12-04  2:44 Martin K. Petersen
  2014-12-04  3:02 ` Phillip Susi
  2014-12-04 17:06 ` Tejun Heo
  0 siblings, 2 replies; 35+ messages in thread
From: Martin K. Petersen @ 2014-12-04  2:44 UTC (permalink / raw)
  To: Tejun Heo; +Cc: linux-ide


As defined, the DRAT (Deterministic Read After Trim) and RZAT (Return
Zero After Trim) flags in the ATA Command Set are unreliable in the
sense that they only define what happens if the device successfully
executed the DSM TRIM command. TRIM is only advisory, however, and the
device is free to silently ignore all or parts of the request.

In practice this renders the DRAT and RZAT flags completely useless and
because the results are unpredictable we decided to disable discard in
MD for 3.18 to avoid the risk of data corruption.

Hardware vendors in the real world obviously need better guarantees than
what the standards bodies provide. Unfortuntely those guarantees are
encoded in product requirements documents rather than somewhere we can
key off of them programatically. So we are compelled to disabling
discard_zeroes_data for all devices unless we explicitly have data to
support whitelisting them.

This patch whitelists SSDs from a few of the main vendors. None of the
whitelists are based on written guarantees. They are purely based on
empirical evidence collected from internal and external users that have
tested or qualified these drives in RAID deployments.

The whitelist is only meant as a starting point and is by no means
comprehensive:

   - All intel SSD models except for 510
   - Micron M5*
   - Samsung SSDs
   - Seagate SSDs

Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
---
 drivers/ata/libata-core.c |   18 ++++++++++++++----
 drivers/ata/libata-scsi.c |   10 ++++++----
 include/linux/libata.h    |    1 +
 3 files changed, 21 insertions(+), 8 deletions(-)

diff --git a/drivers/ata/libata-core.c b/drivers/ata/libata-core.c
index c5ba15af87d3..f41f24a8bc21 100644
--- a/drivers/ata/libata-core.c
+++ b/drivers/ata/libata-core.c
@@ -4225,10 +4225,20 @@ static const struct ata_blacklist_entry ata_device_blacklist [] = {
 	{ "PIONEER DVD-RW  DVR-216D",	NULL,	ATA_HORKAGE_NOSETXFER },
 
 	/* devices that don't properly handle queued TRIM commands */
-	{ "Micron_M500*",		NULL,	ATA_HORKAGE_NO_NCQ_TRIM, },
-	{ "Crucial_CT???M500SSD*",	NULL,	ATA_HORKAGE_NO_NCQ_TRIM, },
-	{ "Micron_M550*",		NULL,	ATA_HORKAGE_NO_NCQ_TRIM, },
-	{ "Crucial_CT*M550SSD*",	NULL,	ATA_HORKAGE_NO_NCQ_TRIM, },
+	{ "Micron_M5?0*",		NULL,	ATA_HORKAGE_NO_NCQ_TRIM |
+						ATA_HORKAGE_ZERO_AFTER_TRIM, },
+	{ "Crucial_CT???M5?0SSD*",	NULL,	ATA_HORKAGE_NO_NCQ_TRIM, },
+
+	/*
+	 * DRAT/RZAT are weak guarantees. Explicitly black/whitelist
+	 * SSDs that provide reliable zero after TRIM.
+	 */
+	{ "INTEL*SSDSC2MH*",		NULL,	0, }, /* Blacklist intel 510 */
+	{ "INTEL*SSD*", 		NULL,	ATA_HORKAGE_ZERO_AFTER_TRIM, },
+	{ "SSD*INTEL*",			NULL,	ATA_HORKAGE_ZERO_AFTER_TRIM, },
+	{ "Samsung*SSD*",		NULL,	ATA_HORKAGE_ZERO_AFTER_TRIM, },
+	{ "SAMSUNG*SSD*",		NULL,	ATA_HORKAGE_ZERO_AFTER_TRIM, },
+	{ "ST[1248][0248]0[FH]*",	NULL,	ATA_HORKAGE_ZERO_AFTER_TRIM, },
 
 	/*
 	 * Some WD SATA-I drives spin up and down erratically when the link
diff --git a/drivers/ata/libata-scsi.c b/drivers/ata/libata-scsi.c
index 0586f66d70fa..a19479282b65 100644
--- a/drivers/ata/libata-scsi.c
+++ b/drivers/ata/libata-scsi.c
@@ -2515,13 +2515,15 @@ static unsigned int ata_scsiop_read_cap(struct ata_scsi_args *args, u8 *rbuf)
 		rbuf[15] = lowest_aligned;
 
 		if (ata_id_has_trim(args->id)) {
-			rbuf[14] |= 0x80; /* TPE */
+			rbuf[14] |= 0x80; /* LBPME */
 
-			if (ata_id_has_zero_after_trim(args->id))
-				rbuf[14] |= 0x40; /* TPRZ */
+			if (ata_id_has_zero_after_trim(args->id) &&
+			    dev->horkage & ATA_HORKAGE_ZERO_AFTER_TRIM) {
+				ata_dev_info(dev, "Enabling discard_zeroes_data\n");
+				rbuf[14] |= 0x40; /* LBPRZ */
+			}
 		}
 	}
-
 	return 0;
 }
 
diff --git a/include/linux/libata.h b/include/linux/libata.h
index bd5fefeaf548..45ac825b8366 100644
--- a/include/linux/libata.h
+++ b/include/linux/libata.h
@@ -421,6 +421,7 @@ enum {
 	ATA_HORKAGE_NO_NCQ_TRIM	= (1 << 19),	/* don't use queued TRIM */
 	ATA_HORKAGE_NOLPM	= (1 << 20),	/* don't use LPM */
 	ATA_HORKAGE_WD_BROKEN_LPM = (1 << 21),	/* some WDs have broken LPM */
+	ATA_HORKAGE_ZERO_AFTER_TRIM = (1 << 22),/* guarantees zero after trim */
 
 	 /* DMA mask for user DMA control: User visible values; DO NOT
 	    renumber */


-- 
Martin K. Petersen	Oracle Linux Engineering

^ permalink raw reply related	[flat|nested] 35+ messages in thread

* Re: [PATCH] libata: Whitelist SSDs that are known to properly return zeroes after TRIM
  2014-12-04  2:44 [PATCH] libata: Whitelist SSDs that are known to properly return zeroes after TRIM Martin K. Petersen
@ 2014-12-04  3:02 ` Phillip Susi
  2014-12-04  3:24   ` Martin K. Petersen
  2014-12-04 17:06 ` Tejun Heo
  1 sibling, 1 reply; 35+ messages in thread
From: Phillip Susi @ 2014-12-04  3:02 UTC (permalink / raw)
  To: Martin K. Petersen, Tejun Heo; +Cc: linux-ide

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA512

On 12/03/2014 09:44 PM, Martin K. Petersen wrote:
> This patch whitelists SSDs from a few of the main vendors. None of
> the whitelists are based on written guarantees. They are purely
> based on empirical evidence collected from internal and external
> users that have tested or qualified these drives in RAID
> deployments.

Without anything in writing from the vendor, such a white list amounts
to nothing more than wishful thinking.  You can't really test for it
and even if it *appears* to be so, the drive is free to change its
behavior any time.



-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1

iQEcBAEBCgAGBQJUf87GAAoJENRVrw2cjl5RPMwIAI3RHHZBhv0xDi5uMCzHdaZM
pSp+OAYtuXuYKNhxEnH/v3RqqeL2joARR7ELa3wZ57DilpANF4SPKeV4TyJqdjxG
BLYeW1jySSNn7LjBDRxSgg61FRka4Bz3m4MCA6/6pfZawufE4bkTshRhpMzIWMsb
AAt1d1Kyw2a2wat2+h75LqEVuPdTcENBV7B1Ow//zObkljGf3p3Jkjiz3ZKE3bIT
kXIu62kypdUbvpYCtSE7Xa51Io74jPtZPyjc6vrsFaB8ulhFzDg5WK4o8Ndswz3D
Rjq4WAl7dOJ0KUKef3n57YZFjgtfPo3VUIXzrDOo0hCXYUO77Kxei9ZwOtJUkrc=
=5Wuq
-----END PGP SIGNATURE-----

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH] libata: Whitelist SSDs that are known to properly return zeroes after TRIM
  2014-12-04  3:02 ` Phillip Susi
@ 2014-12-04  3:24   ` Martin K. Petersen
  2014-12-04  3:28     ` Phillip Susi
  0 siblings, 1 reply; 35+ messages in thread
From: Martin K. Petersen @ 2014-12-04  3:24 UTC (permalink / raw)
  To: Phillip Susi; +Cc: Martin K. Petersen, Tejun Heo, linux-ide

>>>>> "Phillip" == Phillip Susi <psusi@ubuntu.com> writes:

Phillip> On 12/03/2014 09:44 PM, Martin K. Petersen wrote:
>> This patch whitelists SSDs from a few of the main vendors. None of
>> the whitelists are based on written guarantees. They are purely based
>> on empirical evidence collected from internal and external users that
>> have tested or qualified these drives in RAID deployments.

Phillip> Without anything in writing from the vendor, such a white list
Phillip> amounts to nothing more than wishful thinking.  You can't
Phillip> really test for it and even if it *appears* to be so, the drive
Phillip> is free to change its behavior any time.

Well. It's a bit stronger than wishful thinking given that OEM product
requirements documents and supply contracts make it so. But it is not
something we can get in writing.

It is, however, a step up from the blanket whitelist we've had so far
where we assumed that it worked for every drive that had DRAT/RZAT set.

-- 
Martin K. Petersen	Oracle Linux Engineering

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH] libata: Whitelist SSDs that are known to properly return zeroes after TRIM
  2014-12-04  3:24   ` Martin K. Petersen
@ 2014-12-04  3:28     ` Phillip Susi
  2014-12-04  3:35       ` Martin K. Petersen
  0 siblings, 1 reply; 35+ messages in thread
From: Phillip Susi @ 2014-12-04  3:28 UTC (permalink / raw)
  To: Martin K. Petersen; +Cc: Tejun Heo, linux-ide

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA512

On 12/03/2014 10:24 PM, Martin K. Petersen wrote:
> Well. It's a bit stronger than wishful thinking given that OEM
> product requirements documents and supply contracts make it so. But
> it is not something we can get in writing.

OEM product requirements documents and supply contracts sound like
forms of writing to me.


-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1

iQEcBAEBCgAGBQJUf9T2AAoJENRVrw2cjl5Rc2IH/0f1txjITXL7Os4W2KqTUbFj
RryTH8pxexysy1flXcUT4nIyoulBnRSDsy6o9DTgPI5lZDgsIVzFaNuYJHBv/Z/n
mvnjPXS4gp1cFCogHZ6K1IbMQC2WBs7Hz+Qq9pxvLKBVuifcsF+4VOiCQtz7A6QR
nfkPRh9JP7cN0kjNLYVvoNfFDqSVJ8Ac2IbKkB3OeqmeuLoxpL1qG0fSE+A90/nB
B9/wVeFiw5nH6Fdyy6kodbt0tF+gwOGBFIynVKO+4E3qeZi28YikLN/m+L09mW/m
kW4UXXWDCLnqmrfxy/VTHCKUZruj9uplZ/3UpWl8vC4jJmZM5HYS7GNE+AJurjA=
=kslG
-----END PGP SIGNATURE-----

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH] libata: Whitelist SSDs that are known to properly return zeroes after TRIM
  2014-12-04  3:28     ` Phillip Susi
@ 2014-12-04  3:35       ` Martin K. Petersen
  2014-12-04  4:40         ` Phillip Susi
  2014-12-04 21:49         ` One Thousand Gnomes
  0 siblings, 2 replies; 35+ messages in thread
From: Martin K. Petersen @ 2014-12-04  3:35 UTC (permalink / raw)
  To: Phillip Susi; +Cc: Martin K. Petersen, Tejun Heo, linux-ide

>>>>> "Phillip" == Phillip Susi <psusi@ubuntu.com> writes:

Phillip> On 12/03/2014 10:24 PM, Martin K. Petersen wrote:
>> Well. It's a bit stronger than wishful thinking given that OEM
>> product requirements documents and supply contracts make it so. But
>> it is not something we can get in writing.

Phillip> OEM product requirements documents and supply contracts sound
Phillip> like forms of writing to me.

Except they are not forms of writing that we as a community have access
to.

-- 
Martin K. Petersen	Oracle Linux Engineering

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH] libata: Whitelist SSDs that are known to properly return zeroes after TRIM
  2014-12-04  3:35       ` Martin K. Petersen
@ 2014-12-04  4:40         ` Phillip Susi
  2014-12-05  1:53           ` Martin K. Petersen
  2014-12-04 21:49         ` One Thousand Gnomes
  1 sibling, 1 reply; 35+ messages in thread
From: Phillip Susi @ 2014-12-04  4:40 UTC (permalink / raw)
  To: Martin K. Petersen; +Cc: Tejun Heo, linux-ide

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA512

On 12/03/2014 10:35 PM, Martin K. Petersen wrote:
> Phillip> OEM product requirements documents and supply contracts
> sound Phillip> like forms of writing to me.
> 
> Except they are not forms of writing that we as a community have
> access to.

So your assertion is that you have seen it in writing, so we should
all assume the drives will adhere to that, but the writing is private
and can not be verified and the manufacturers can not be held to it if
they choose not to abide by it in the general case, but we should
assume that if they are bound by private contracts, that they would
perform the same way with publically sold models that claim to have
the same model and revision?

I'm not saying a hard hell no, but this certainly makes me
uncomfortable.  I'd much rather see the manufacturers put it in
writing that yes, this make and model will perform this way even
though it is not strictly required by ATA8.  What would be better
still is if the bloody ATA standard got a clue and said that if the
drive claims that it does in fact zero after TRIM, that the TRIM
command becomes mandatory instead of advisory.


-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1

iQEcBAEBCgAGBQJUf+W5AAoJENRVrw2cjl5RObkIAIRy29u8cx1Ejp2gOr01oWn/
tV+Qj0gpgGzcazKDJWpDK5sxboDteoFl+UiI1/1yEPE+dfvwT26ryyqWKsNjTUDb
YcwkT3zn7wgUbloL3yx4WqNZnM9/vDDv1mh94bjdIjZM2iUOpoZj81iGVaKWHIFC
I/qXf5eeHGrPtHBUzdEyAgVtd4pc1dN2zo4KZmwA3Xfj6zxq3knsASE4fgiFuegv
iyC5PvqXN1z14P2f+6/EhT2Ls1Vzo0Y/pnxZstEexftjWG6a4rbaEZVFT/fxawgn
nAqaB0GxvLpD5tSmUKUWtAYFDSWOUP+MIFyPvm0A8V1pLrLQV81PrKZ2qHMKvMU=
=dFFk
-----END PGP SIGNATURE-----

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH] libata: Whitelist SSDs that are known to properly return zeroes after TRIM
  2014-12-04  2:44 [PATCH] libata: Whitelist SSDs that are known to properly return zeroes after TRIM Martin K. Petersen
  2014-12-04  3:02 ` Phillip Susi
@ 2014-12-04 17:06 ` Tejun Heo
  2014-12-05  2:13   ` Martin K. Petersen
  1 sibling, 1 reply; 35+ messages in thread
From: Tejun Heo @ 2014-12-04 17:06 UTC (permalink / raw)
  To: Martin K. Petersen; +Cc: linux-ide

Hello, Martin.

On Wed, Dec 03, 2014 at 09:44:46PM -0500, Martin K. Petersen wrote:
> 
> As defined, the DRAT (Deterministic Read After Trim) and RZAT (Return
> Zero After Trim) flags in the ATA Command Set are unreliable in the
> sense that they only define what happens if the device successfully
> executed the DSM TRIM command. TRIM is only advisory, however, and the
> device is free to silently ignore all or parts of the request.
> 
> In practice this renders the DRAT and RZAT flags completely useless and
> because the results are unpredictable we decided to disable discard in
> MD for 3.18 to avoid the risk of data corruption.
> 
> Hardware vendors in the real world obviously need better guarantees than
> what the standards bodies provide. Unfortuntely those guarantees are
> encoded in product requirements documents rather than somewhere we can
> key off of them programatically. So we are compelled to disabling
> discard_zeroes_data for all devices unless we explicitly have data to
> support whitelisting them.
> 
> This patch whitelists SSDs from a few of the main vendors. None of the
> whitelists are based on written guarantees. They are purely based on
> empirical evidence collected from internal and external users that have
> tested or qualified these drives in RAID deployments.
> 
> The whitelist is only meant as a starting point and is by no means
> comprehensive:
> 
>    - All intel SSD models except for 510
>    - Micron M5*
>    - Samsung SSDs
>    - Seagate SSDs

Generally, I'm extremely skeptical about whitelists.  It's very
difficult to keep them meaningfully up-to-date and often just ends up
bit rotting after the initial flurry of updates, especially given how
little this particular feature would be visible to most users.

If there's something on the horizon which would solve the
identification problem and we only have to worry about the current
batch of devices, whitelisting can be useful but otherwise I'm not
sure this is a good idea.

If there currently is no way of properly indicating this feature,
let's please disable it unconditionally.  If this is absolutely
necesary in certain cases (is it?), we can add libata.force flags or
sysfs knobs to force enable it for users who care enough about it and
knows what they're doing.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH] libata: Whitelist SSDs that are known to properly return zeroes after TRIM
  2014-12-04  3:35       ` Martin K. Petersen
  2014-12-04  4:40         ` Phillip Susi
@ 2014-12-04 21:49         ` One Thousand Gnomes
  2014-12-05  2:46           ` Martin K. Petersen
  1 sibling, 1 reply; 35+ messages in thread
From: One Thousand Gnomes @ 2014-12-04 21:49 UTC (permalink / raw)
  To: Martin K. Petersen; +Cc: Phillip Susi, Tejun Heo, linux-ide

On Wed, 03 Dec 2014 22:35:59 -0500
"Martin K. Petersen" <martin.petersen@oracle.com> wrote:

> >>>>> "Phillip" == Phillip Susi <psusi@ubuntu.com> writes:
> 
> Phillip> On 12/03/2014 10:24 PM, Martin K. Petersen wrote:
> >> Well. It's a bit stronger than wishful thinking given that OEM
> >> product requirements documents and supply contracts make it so. But
> >> it is not something we can get in writing.
> 
> Phillip> OEM product requirements documents and supply contracts sound
> Phillip> like forms of writing to me.
> 
> Except they are not forms of writing that we as a community have access
> to.

Thats true of many things we reply upon.

I assume it's something vaguely of the form

Vendor A has agreed with Oracle that drive identity X will do this and
meet some certification so Oracle can approve it for databases

Oracle has agreed to break vendor A's legs if it doesn't

Oracle would rather like that the kernel just knows about this drive as
being good, based on the above.

In which case IMHO it's good enough. "Do not p*ss off a major customer"
is probably more watertight than a lot of signed paper things 8)

Alan

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH] libata: Whitelist SSDs that are known to properly return zeroes after TRIM
  2014-12-04  4:40         ` Phillip Susi
@ 2014-12-05  1:53           ` Martin K. Petersen
  0 siblings, 0 replies; 35+ messages in thread
From: Martin K. Petersen @ 2014-12-05  1:53 UTC (permalink / raw)
  To: Phillip Susi; +Cc: Martin K. Petersen, Tejun Heo, linux-ide

>>>>> "Phillip" == Phillip Susi <psusi@ubuntu.com> writes:

Phillip> So your assertion is that you have seen it in writing, 

Actually I haven't. I have talked to a bunch of people that have done
RAID qualification on SSDs. I also gleaned on a few vendor lists of
supported drive configurations.

The good news is that practice of each OEM having their own custom drive
firmware build is not very common for SSDs. And most of the results I
looked at was collected using off-the-shelf drives.

Phillip> I'm not saying a hard hell no, but this certainly makes me
Phillip> uncomfortable.  

You do realize that my patch *restricts* the drives we enable
discard_zeroes_data on, right? Instead of relying solely on the drive's
own reporting we now also require empirical evidence that they do the
right thing.

Phillip> I'd much rather see the manufacturers put it in writing that
Phillip> yes, this make and model will perform this way even though it
Phillip> is not strictly required by ATA8.

Wishful thinking.

The sad reality is that standards are deliberately written to be
vague. And that any hard guarantees are part of product requirements
documents. That's not specific to SSDs in any way. That's true for
pretty much any piece of hardware.

Phillip> What would be better still is if the bloody ATA standard got a
Phillip> clue and said that if the drive claims that it does in fact
Phillip> zero after TRIM, that the TRIM command becomes mandatory
Phillip> instead of advisory.

I agree that the standards efforts in this department are a total train
wreck.

-- 
Martin K. Petersen	Oracle Linux Engineering

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH] libata: Whitelist SSDs that are known to properly return zeroes after TRIM
  2014-12-04 17:06 ` Tejun Heo
@ 2014-12-05  2:13   ` Martin K. Petersen
  2014-12-05 14:51     ` Tejun Heo
  0 siblings, 1 reply; 35+ messages in thread
From: Martin K. Petersen @ 2014-12-05  2:13 UTC (permalink / raw)
  To: Tejun Heo; +Cc: Martin K. Petersen, linux-ide

>>>>> "Tejun" == Tejun Heo <tj@kernel.org> writes:

Tejun> Generally, I'm extremely skeptical about whitelists.

Me too. Unfortunately, in this case the blacklist is essentially
unbounded whereas the whitelist is small.

Tejun> If there's something on the horizon which would solve the
Tejun> identification problem and we only have to worry about the
Tejun> current batch of devices, whitelisting can be useful but
Tejun> otherwise I'm not sure this is a good idea.

There isn't :( The only saving grace is that SSDs are gravitating
towards NVMe and other non-ATA interfaces.

Tejun> It's very difficult to keep them meaningfully up-to-date and
Tejun> often just ends up bit rotting after the initial flurry of
Tejun> updates,

I'm well aware that this adds another truckload of ongoing pain to my
plate. Several of us discussed this issue at conferences this fall and
the consensus was that whitelisting is the only way to go about it.

I've already tightened up things in SCSI so we now prefer WRITE SAME
which does give hard guarantees unlike UNMAP. But because we use WRITE
SAME in the libata SATL it is imperative that we change our internal
flagging to be equally restrictive.

Tejun> If there currently is no way of properly indicating this feature,
Tejun> let's please disable it unconditionally.

Well, we recently disabled discard support in MD RAID5/6 to avoid
corruption. There currently is a manual override and that may be good
enough.

I just feel bad about disabling the feature for the many existing users
(and there are quite a few) that are using well-behaved drives in their
RAID deployments. And the filesystem folks have been begging for the
zeroout discard variant that I posted a few weeks ago. So the users are
there. I'm just trying to accommodate them the best I can given the lame
spec.

-- 
Martin K. Petersen	Oracle Linux Engineering

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH] libata: Whitelist SSDs that are known to properly return zeroes after TRIM
  2014-12-04 21:49         ` One Thousand Gnomes
@ 2014-12-05  2:46           ` Martin K. Petersen
  0 siblings, 0 replies; 35+ messages in thread
From: Martin K. Petersen @ 2014-12-05  2:46 UTC (permalink / raw)
  To: One Thousand Gnomes
  Cc: Martin K. Petersen, Phillip Susi, Tejun Heo, linux-ide

>>>>> "Alan" == One Thousand Gnomes <gnomes@lxorguk.ukuu.org.uk> writes:

Alan> I assume it's something vaguely of the form

Alan> Vendor A has agreed with Oracle that drive identity X will do this
Alan> and meet some certification so Oracle can approve it for databases

Actually, we don't trust devices to zero anything for us :)

Alan> Oracle has agreed to break vendor A's legs if it doesn't

Alan> Oracle would rather like that the kernel just knows about this
Alan> drive as being good, based on the above.

We're generally not dealing with this kind of consumer/SMB hardware at
Oracle. So s/Oracle/$OTHER_OEM/, but yeah.

My interest in this is purely from the upstream discard lackey
perspective...

-- 
Martin K. Petersen	Oracle Linux Engineering

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH] libata: Whitelist SSDs that are known to properly return zeroes after TRIM
  2014-12-05  2:13   ` Martin K. Petersen
@ 2014-12-05 14:51     ` Tejun Heo
  2014-12-10  4:09       ` James Bottomley
  0 siblings, 1 reply; 35+ messages in thread
From: Tejun Heo @ 2014-12-05 14:51 UTC (permalink / raw)
  To: Martin K. Petersen; +Cc: linux-ide

Hello,

On Thu, Dec 04, 2014 at 09:13:22PM -0500, Martin K. Petersen wrote:
> Tejun> If there's something on the horizon which would solve the
> Tejun> identification problem and we only have to worry about the
> Tejun> current batch of devices, whitelisting can be useful but
> Tejun> otherwise I'm not sure this is a good idea.
> 
> There isn't :( The only saving grace is that SSDs are gravitating
> towards NVMe and other non-ATA interfaces.

Yeap, that's true.  This whole thing is on the way out.

> I've already tightened up things in SCSI so we now prefer WRITE SAME
> which does give hard guarantees unlike UNMAP. But because we use WRITE
> SAME in the libata SATL it is imperative that we change our internal
> flagging to be equally restrictive.

Can you please elaborate this in the changelog and add comments
explaning why ATA_HORKAGE_ZERO_AFTER_TRIM is necessary and how it's
used?

> I just feel bad about disabling the feature for the many existing users
> (and there are quite a few) that are using well-behaved drives in their
> RAID deployments. And the filesystem folks have been begging for the
> zeroout discard variant that I posted a few weeks ago. So the users are
> there. I'm just trying to accommodate them the best I can given the lame
> spec.

Can you please explain further the practical gains of using trims
which guarantee zeroing?

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH] libata: Whitelist SSDs that are known to properly return zeroes after TRIM
  2014-12-05 14:51     ` Tejun Heo
@ 2014-12-10  4:09       ` James Bottomley
  2014-12-10 14:29         ` Tejun Heo
  2014-12-10 15:43         ` Tim Small
  0 siblings, 2 replies; 35+ messages in thread
From: James Bottomley @ 2014-12-10  4:09 UTC (permalink / raw)
  To: Tejun Heo; +Cc: Martin K. Petersen, linux-ide

On Fri, 2014-12-05 at 09:51 -0500, Tejun Heo wrote:
> > I just feel bad about disabling the feature for the many existing users
> > (and there are quite a few) that are using well-behaved drives in their
> > RAID deployments. And the filesystem folks have been begging for the
> > zeroout discard variant that I posted a few weeks ago. So the users are
> > there. I'm just trying to accommodate them the best I can given the lame
> > spec.
> 
> Can you please explain further the practical gains of using trims

It's for RAID devices.  We'd like to trim RAID devices simply by sending
the trim down to every component.  If we actually have to translate trim
to a write it would damage performance (and defeat the purpose of
helping the SSD FTL reuse blocks).  However, RAID requires that the
redundancy verification of the array components matches otherwise a
verify of the RAID fails.  This means we have to have a guarantee what
the verify read of a trimmed block of a RAID component will return.  So
for RAID-1, we just need both trimmed components to return the same data
(we don't actually care what it is, just that it be mirrored); for
RAID-5 we need zeros on every trimmed component, because zero^zero =
zero.

Conversely, drives that return random junk after a trim cause
verification failures, so we just elect not to transmit trim down to
them from the RAID layer.

James



^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH] libata: Whitelist SSDs that are known to properly return zeroes after TRIM
  2014-12-10  4:09       ` James Bottomley
@ 2014-12-10 14:29         ` Tejun Heo
  2014-12-10 20:34           ` James Bottomley
  2014-12-10 21:02           ` Martin K. Petersen
  2014-12-10 15:43         ` Tim Small
  1 sibling, 2 replies; 35+ messages in thread
From: Tejun Heo @ 2014-12-10 14:29 UTC (permalink / raw)
  To: James Bottomley; +Cc: Martin K. Petersen, linux-ide

Hello,

On Wed, Dec 10, 2014 at 07:09:38AM +0300, James Bottomley wrote:
> Conversely, drives that return random junk after a trim cause
> verification failures, so we just elect not to transmit trim down to
> them from the RAID layer.

I see.  Thanks for the explanation.  I suppose the current raid
implementations' metadata handling on partially built devices isn't
developed enough to simply mark those trimmed blocks as unsynced?  If
raid consistency truly is the only reason for this, that approach
seems way more fruitful to me than playing this optional feature game
with hardware vendors which so often leads to eventual abysmal
outcomes.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH] libata: Whitelist SSDs that are known to properly return zeroes after TRIM
  2014-12-10  4:09       ` James Bottomley
  2014-12-10 14:29         ` Tejun Heo
@ 2014-12-10 15:43         ` Tim Small
  1 sibling, 0 replies; 35+ messages in thread
From: Tim Small @ 2014-12-10 15:43 UTC (permalink / raw)
  To: James Bottomley, Tejun Heo; +Cc: Martin K. Petersen, linux-ide


On 10/12/14 04:09, James Bottomley wrote:

> RAID requires that the redundancy verification of the array 
> components matches otherwise a verify of the RAID fails.  This means
> we have to have a guarantee what the verify read of a trimmed block
> of a RAID component will return. So for RAID-1, we just need both
> trimmed components to return the same data (we don't actually care
> what it is, just that it be mirrored);


FWIW, Unless I'm out-of-date, md RAID-1 and RAID-10 currently DON'T
guarantee that the data is always mirrored.  The md(4) man page from
http://git.neil.brown.name/?p=mdadm.git;a=blob;f=md.4 says:

> However on RAID1 and RAID10 it is possible for software issues to 
> cause a mismatch to be reported.  This does not necessarily mean
> that the data on the array is corrupted.  It could simply be that
> the system does not care what is stored on that part of the array -
> it is unused space.
> 
> The most likely cause for an unexpected mismatch on RAID1 or RAID10 
> occurs if a swap partition or swap file is stored on the array.
> 
> When the swap subsystem wants to write a page of memory out, it
> flags the page as 'clean' in the memory manager and requests the
> swap device to write it out.  It is quite possible that the memory
> will be changed while the write-out is happening.  In that case the
> 'clean' flag will be found to be clear when the write completes and
> so the swap subsystem will simply forget that the swapout had been 
> attempted, and will possibly choose a different page to write out.
> 
> If the swap device was on RAID1 (or RAID10), then the data is sent 
> from memory to a device twice (or more depending on the number of 
> devices in the array).  Thus it is possible that the memory gets 
> changed  between the times it is sent, so different data can be 
> written to the different devices in the array.  This will be
> detected by check as a mismatch.  However it does not reflect any
> corruption as the block where this mismatch occurs is being treated
> by the swap system as being empty, and the data will never be read
> from that block.
> 
> It is conceivable for a similar situation to occur on non-swap
> files, though it is less likely.



In my experience these inconsistencies are very common in data written
by certain applications (e.g. DBMS) on md RAID1 and RAID10 devices, and
effectively makes md verify useless for them.  This is a shame -
particularly with things like FTLs where (at least anecdotally) data
scrambling seems to be a reasonably frequent occurrence.

So unreliable read-zero-on-TRIM isn't going to break the md RAID-1 /
RAID-10 use-case significantly, because verify hasn't worked on those
forever anyway.

Of course, a non-deterministic return-zero-on-trim will still cause
problems with md RAID5/6, and with other users such as dm RAID too I'd
guess.

Tim.

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH] libata: Whitelist SSDs that are known to properly return zeroes after TRIM
  2014-12-10 14:29         ` Tejun Heo
@ 2014-12-10 20:34           ` James Bottomley
  2014-12-10 21:02           ` Martin K. Petersen
  1 sibling, 0 replies; 35+ messages in thread
From: James Bottomley @ 2014-12-10 20:34 UTC (permalink / raw)
  To: Tejun Heo; +Cc: Martin K. Petersen, linux-ide

On Wed, 2014-12-10 at 09:29 -0500, Tejun Heo wrote:
> Hello,
> 
> On Wed, Dec 10, 2014 at 07:09:38AM +0300, James Bottomley wrote:
> > Conversely, drives that return random junk after a trim cause
> > verification failures, so we just elect not to transmit trim down to
> > them from the RAID layer.
> 
> I see.  Thanks for the explanation.  I suppose the current raid
> implementations' metadata handling on partially built devices isn't
> developed enough to simply mark those trimmed blocks as unsynced?

We do have a log, which could be used for RAID-1 for this purpose, but
it doesn't seem to be much used in practise.  It takes extra space which
most admins don't account for.

For RAID-2+ or erasure codes this won't work because a bad block read
corrupts the stripe: the really subtle failure here is that you trim a
stripe and then partially write it, the RMW you do for parity will be an
incorrect partial syndrome because it's based on junk rather than the
syndrome of the other blocks and the corruption wouldn't be detected
until you get an actual disk failure (meaning everything will look fine
until that crucial day you need your data protection mechanism to work).
We could cope with this with an even more subtle logging mechanism,
where we only trim stripes and then log trimmed stripes and insist on
full instead of partial writes so we get back to a known syndrome, but
that's introducing a lot of subtlety into the logging code ...

James



^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH] libata: Whitelist SSDs that are known to properly return zeroes after TRIM
  2014-12-10 14:29         ` Tejun Heo
  2014-12-10 20:34           ` James Bottomley
@ 2014-12-10 21:02           ` Martin K. Petersen
  2014-12-12  8:35             ` Ming Lei
  2015-01-05 16:28             ` Tejun Heo
  1 sibling, 2 replies; 35+ messages in thread
From: Martin K. Petersen @ 2014-12-10 21:02 UTC (permalink / raw)
  To: Tejun Heo; +Cc: James Bottomley, Martin K. Petersen, linux-ide

>>>>> "Tejun" == Tejun Heo <tj@kernel.org> writes:

Tejun> If raid consistency truly is the only reason for this, that
Tejun> approach seems way more fruitful to me than playing this optional
Tejun> feature game with hardware vendors which so often leads to
Tejun> eventual abysmal outcomes.

The other use case is the filesystem one where it is common to zero
block ranges for bitmaps, etc. In many workloads there's is a
significant win to trimming over writing out many blocks of zeroes.

-- 
Martin K. Petersen	Oracle Linux Engineering

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH] libata: Whitelist SSDs that are known to properly return zeroes after TRIM
  2014-12-10 21:02           ` Martin K. Petersen
@ 2014-12-12  8:35             ` Ming Lei
  2015-01-05 16:28             ` Tejun Heo
  1 sibling, 0 replies; 35+ messages in thread
From: Ming Lei @ 2014-12-12  8:35 UTC (permalink / raw)
  To: Martin K. Petersen; +Cc: Tejun Heo, James Bottomley, linux-ide

On Thu, Dec 11, 2014 at 5:02 AM, Martin K. Petersen
<martin.petersen@oracle.com> wrote:
>>>>>> "Tejun" == Tejun Heo <tj@kernel.org> writes:
>
> Tejun> If raid consistency truly is the only reason for this, that
> Tejun> approach seems way more fruitful to me than playing this optional
> Tejun> feature game with hardware vendors which so often leads to
> Tejun> eventual abysmal outcomes.
>
> The other use case is the filesystem one where it is common to zero
> block ranges for bitmaps, etc. In many workloads there's is a
> significant win to trimming over writing out many blocks of zeroes.

Yes, looks mkfs is using the feature.

Thanks,
Ming Lei

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH] libata: Whitelist SSDs that are known to properly return zeroes after TRIM
  2014-12-10 21:02           ` Martin K. Petersen
  2014-12-12  8:35             ` Ming Lei
@ 2015-01-05 16:28             ` Tejun Heo
  2015-01-07  0:05               ` Martin K. Petersen
  1 sibling, 1 reply; 35+ messages in thread
From: Tejun Heo @ 2015-01-05 16:28 UTC (permalink / raw)
  To: Martin K. Petersen; +Cc: James Bottomley, linux-ide

Hello, Martin.

On Wed, Dec 10, 2014 at 04:02:32PM -0500, Martin K. Petersen wrote:
> >>>>> "Tejun" == Tejun Heo <tj@kernel.org> writes:
> 
> Tejun> If raid consistency truly is the only reason for this, that
> Tejun> approach seems way more fruitful to me than playing this optional
> Tejun> feature game with hardware vendors which so often leads to
> Tejun> eventual abysmal outcomes.
> 
> The other use case is the filesystem one where it is common to zero
> block ranges for bitmaps, etc. In many workloads there's is a
> significant win to trimming over writing out many blocks of zeroes.

Isn't that kinda niche and specialized tho?  Besides, in this use
case, the kernel is essentially just serving as the whitelist and
nothing else.  Userland would have to issue TRIMs directly.  I'm not
against this patch being merged but this really looks like an
ultimately futile effort to me.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH] libata: Whitelist SSDs that are known to properly return zeroes after TRIM
  2015-01-05 16:28             ` Tejun Heo
@ 2015-01-07  0:05               ` Martin K. Petersen
  2015-01-07  2:54                 ` Tejun Heo
  0 siblings, 1 reply; 35+ messages in thread
From: Martin K. Petersen @ 2015-01-07  0:05 UTC (permalink / raw)
  To: Tejun Heo; +Cc: Martin K. Petersen, James Bottomley, linux-ide, linux-fsdevel

>>>>> "Tejun" == Tejun Heo <tj@kernel.org> writes:

>> The other use case is the filesystem one where it is common to zero
>> block ranges for bitmaps, etc. In many workloads there's is a
>> significant win to trimming over writing out many blocks of zeroes.

Tejun> Isn't that kinda niche and specialized tho?  

I don't think so. There are two reasons for zeroing block ranges:

 1) To ensure they contain zeroes on subsequent reads

 2) To preallocate them or anchor them down on thin provisioned devices

The filesystem folks have specifically asked to be able to make that
distinction. Hence the patch that changes blkdev_issue_zeroout().

You really don't want to write out gobs and gobs of zeroes and cause
unnecessary flash wear if all you care about is the blocks being in a
deterministic state.

-- 
Martin K. Petersen	Oracle Linux Engineering

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH] libata: Whitelist SSDs that are known to properly return zeroes after TRIM
  2015-01-07  0:05               ` Martin K. Petersen
@ 2015-01-07  2:54                 ` Tejun Heo
  2015-01-07  4:15                   ` Dave Chinner
  0 siblings, 1 reply; 35+ messages in thread
From: Tejun Heo @ 2015-01-07  2:54 UTC (permalink / raw)
  To: Martin K. Petersen; +Cc: James Bottomley, linux-ide, linux-fsdevel

Hello, Martin.

On Tue, Jan 06, 2015 at 07:05:40PM -0500, Martin K. Petersen wrote:
> Tejun> Isn't that kinda niche and specialized tho?  
> 
> I don't think so. There are two reasons for zeroing block ranges:
> 
>  1) To ensure they contain zeroes on subsequent reads
> 
>  2) To preallocate them or anchor them down on thin provisioned devices
> 
> The filesystem folks have specifically asked to be able to make that
> distinction. Hence the patch that changes blkdev_issue_zeroout().
> 
> You really don't want to write out gobs and gobs of zeroes and cause
> unnecessary flash wear if all you care about is the blocks being in a
> deterministic state.

I think I'm still missing something.  Are there enough cases where
filesystems want to write out zeroes during operation?  Earlier in the
thread, it was mentioned that this is currently mostly useful for
raids which need the blocks actually cleared for checksum consistency,
which basically means that raid metadata handling isn't (yet) capable
of just marking those (parts of) stripes as unused.  If a filesystem
wants to read back zeros from data blocks, wouldn't it be just marking
the matching index as such?  And if you take out the zeroing case,
trims are just trims and whether they return 0 afterwards or not is
irrelevant.

There sure can be use cases where zeroing fast and securely make
noticeable difference but the cases put forth till this point seem
relatively weak.  I mean, after all, requiring trim to zero the blocks
is essentially pushing down that amount of metadata management to the
device - the device would do the exactly same thing.  Pushing it down
the layers can definitely be beneficial especially when there's no
agreed-upon metadata on the medium (so, mkfs time), but it seems kinda
superflous during normal operation.  What am I missing?

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH] libata: Whitelist SSDs that are known to properly return zeroes after TRIM
  2015-01-07  2:54                 ` Tejun Heo
@ 2015-01-07  4:15                   ` Dave Chinner
  2015-01-07 15:26                     ` Tejun Heo
  2015-01-08  4:05                     ` Phillip Susi
  0 siblings, 2 replies; 35+ messages in thread
From: Dave Chinner @ 2015-01-07  4:15 UTC (permalink / raw)
  To: Tejun Heo; +Cc: Martin K. Petersen, James Bottomley, linux-ide, linux-fsdevel

On Tue, Jan 06, 2015 at 09:54:55PM -0500, Tejun Heo wrote:
> Hello, Martin.
> 
> On Tue, Jan 06, 2015 at 07:05:40PM -0500, Martin K. Petersen wrote:
> > Tejun> Isn't that kinda niche and specialized tho?  
> > 
> > I don't think so. There are two reasons for zeroing block ranges:
> > 
> >  1) To ensure they contain zeroes on subsequent reads
> > 
> >  2) To preallocate them or anchor them down on thin provisioned devices
> > 
> > The filesystem folks have specifically asked to be able to make that
> > distinction. Hence the patch that changes blkdev_issue_zeroout().
> > 
> > You really don't want to write out gobs and gobs of zeroes and cause
> > unnecessary flash wear if all you care about is the blocks being in a
> > deterministic state.
> 
> I think I'm still missing something.  Are there enough cases where
> filesystems want to write out zeroes during operation?

IMO, yes.

w.r.t. thinp devices, we need to be able to guarantee that
prellocated regions in the filesystem are actually backed by real
blocks in the thinp device so we don't get ENOSPC from the thinp
device. No filesystems do this yet because we don't have a mechanism
for telling the lower layers "preallocate these blocks to zero".

The biggest issue is that we currently have no easy way to say
"these blocks need to contain zeros, but we aren't actually using
them yet". i.e. the filesystem code assumes that they contain zeros
(e.g. in ext4 inode tables because mkfs used to zero them) if they
haven't been used, so when it reads them it detects that
initialisation is needed because the blocks are empty....

FWIW, some filesystems need these regions to actually contain
zeros because they can't track unwritten extents (e.g.
gfs2). having sb_issue_zeroout() just do the right thing enables us
to efficiently zero the regions they are preallocating...

> Earlier in the
> thread, it was mentioned that this is currently mostly useful for
> raids which need the blocks actually cleared for checksum consistency,
> which basically means that raid metadata handling isn't (yet) capable
> of just marking those (parts of) stripes as unused.  If a filesystem
> wants to read back zeros from data blocks, wouldn't it be just marking
> the matching index as such?

Not all filesystems can do this for user data (see gfs2 case above)
and no linux filesystem tracks whether free space contains zeros or
stale data. Hence if we want blocks to be zeroed on disk, we
currently have to write zeros to them and hence they get pinned in
devices as "used space" even though they may never get used again.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH] libata: Whitelist SSDs that are known to properly return zeroes after TRIM
  2015-01-07  4:15                   ` Dave Chinner
@ 2015-01-07 15:26                     ` Tejun Heo
  2015-01-08 14:28                       ` Martin K. Petersen
  2015-01-08 14:29                       ` Martin K. Petersen
  2015-01-08  4:05                     ` Phillip Susi
  1 sibling, 2 replies; 35+ messages in thread
From: Tejun Heo @ 2015-01-07 15:26 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Martin K. Petersen, James Bottomley, linux-ide, linux-fsdevel

Hello, Dave.

On Wed, Jan 07, 2015 at 03:15:37PM +1100, Dave Chinner wrote:
> w.r.t. thinp devices, we need to be able to guarantee that
> prellocated regions in the filesystem are actually backed by real
> blocks in the thinp device so we don't get ENOSPC from the thinp
> device. No filesystems do this yet because we don't have a mechanism
> for telling the lower layers "preallocate these blocks to zero".
> 
> The biggest issue is that we currently have no easy way to say
> "these blocks need to contain zeros, but we aren't actually using
> them yet". i.e. the filesystem code assumes that they contain zeros
> (e.g. in ext4 inode tables because mkfs used to zero them) if they
> haven't been used, so when it reads them it detects that
> initialisation is needed because the blocks are empty....
> 
> FWIW, some filesystems need these regions to actually contain
> zeros because they can't track unwritten extents (e.g.
> gfs2). having sb_issue_zeroout() just do the right thing enables us
> to efficiently zero the regions they are preallocating...
> 
> > Earlier in the
> > thread, it was mentioned that this is currently mostly useful for
> > raids which need the blocks actually cleared for checksum consistency,
> > which basically means that raid metadata handling isn't (yet) capable
> > of just marking those (parts of) stripes as unused.  If a filesystem
> > wants to read back zeros from data blocks, wouldn't it be just marking
> > the matching index as such?
> 
> Not all filesystems can do this for user data (see gfs2 case above)
> and no linux filesystem tracks whether free space contains zeros or
> stale data. Hence if we want blocks to be zeroed on disk, we
> currently have to write zeros to them and hence they get pinned in
> devices as "used space" even though they may never get used again.

Okay, I'll take it as that this benefits generic enough use cases from
filesystem POV.  As long as that's the case, it prolly has a
reasonable chance of actually being widely and properly implemented hw
vendors.

It's easy to complain about mainstream hardware but IMHO the market is
actually pretty efficient at shooting down extra cruft which doesn't
really matter and only exists to increase the number of feature
checkboxes.  Hopefully, this has actual benefits and won't end up that
way.

Martin, do you have a newer version or shall I apply the original one?

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH] libata: Whitelist SSDs that are known to properly return zeroes after TRIM
  2015-01-07  4:15                   ` Dave Chinner
  2015-01-07 15:26                     ` Tejun Heo
@ 2015-01-08  4:05                     ` Phillip Susi
  2015-01-08  4:58                       ` Andreas Dilger
  1 sibling, 1 reply; 35+ messages in thread
From: Phillip Susi @ 2015-01-08  4:05 UTC (permalink / raw)
  To: Dave Chinner, Tejun Heo
  Cc: Martin K. Petersen, James Bottomley, linux-ide, linux-fsdevel

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA512

On 01/06/2015 11:15 PM, Dave Chinner wrote:
> (e.g. in ext4 inode tables because mkfs used to zero them) if they 
> haven't been used, so when it reads them it detects that 
> initialisation is needed because the blocks are empty....

No, it knows that the inode table needs initialized because there is a
flag in the group descriptor that says this inode table is still
uninitalized.  It never reads the blocks to see if they are full of
zeros.  mke2fs sets the flag when it does not initialize the table
with zeros, either by direct writes ( which it doesn't do if
lazy_itable_init is true, which it defaults to these days ), or by
discarding the blocks when the device claims to support deterministic
discard that zeros.


-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1

iQEcBAEBCgAGBQJUrgHvAAoJENRVrw2cjl5RWUUH/076fhOAYW8vRB285cz+UwrE
L7o3uteYa9OctoWZTctWAVueuISJaNsxzmbu04btznETkR/YbkkexpmXRKvYt6RZ
rlCzRLWLTOC4h57jZteczn9pBf0RBeeizU0pEdTdZP/IVxOqA6NDG1qywDCE9A+I
3frFg54alaZ4MCGH0jmPVbVxM6jY3/UhA8DnOoMSlIOXatV1X7UMPK3lMJ4Ih/Yx
e3yDWJjNHjAn88TRAGD6WJ2066t7DBGnLOOF9PscaZNW6f4hbuvEG9teAPk51OfS
cwypcLsk6Mj7WU8cSQ0T/6a7Qx84g9JC6wLR0QZCUCSau6ZfWf3ZuSI26i/Xlfc=
=BRqd
-----END PGP SIGNATURE-----

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH] libata: Whitelist SSDs that are known to properly return zeroes after TRIM
  2015-01-08  4:05                     ` Phillip Susi
@ 2015-01-08  4:58                       ` Andreas Dilger
  2015-01-08 14:09                         ` Phillip Susi
  0 siblings, 1 reply; 35+ messages in thread
From: Andreas Dilger @ 2015-01-08  4:58 UTC (permalink / raw)
  To: Phillip Susi
  Cc: Dave Chinner, Tejun Heo, Martin Petersen, James Bottomley,
	linux-ide, linux-fsdevel

On Jan 7, 2015, at 9:05 PM, Phillip Susi <psusi@ubuntu.com> wrote:
> On 01/06/2015 11:15 PM, Dave Chinner wrote:
>> (e.g. in ext4 inode tables because mkfs used to zero them) if they 
>> haven't been used, so when it reads them it detects that 
>> initialisation is needed because the blocks are empty....
> 
> No, it knows that the inode table needs initialized because there is a
> flag in the group descriptor that says this inode table is still
> uninitalized.  It never reads the blocks to see if they are full of
> zeros.  mke2fs sets the flag when it does not initialize the table
> with zeros, either by direct writes ( which it doesn't do if
> lazy_itable_init is true, which it defaults to these days ), or by
> discarding the blocks when the device claims to support deterministic
> discard that zeros.

That is only partially correct.  While it is true that mke2fs sets the
UNINIT flag at format time, the "lazy" part of that means there is a
kernel thread still does the zeroing of the inode table blocks, but after
the filesystem is mounted, for any group that does not have the ZEROED
flag set.  After that point, the "UNINIT" flag is an optimization to
avoid reading the bitmap and unused blocks from disk during allocation.

This is needed in case the group descriptor or inode bitmap is corrupted,
and e2fsck needs to scan the inode table for in-use inodes.  We don't
want it to find old inodes from before the filesystem was formatted.

The ext4_init_inode_table() calls sb_issue_zeroout->blkdev_issue_zeroout(),
so if the underlying storage supported deterministic zeroing of the
underlying storage, this could be handled very efficiently.

Cheers, Andreas






^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH] libata: Whitelist SSDs that are known to properly return zeroes after TRIM
  2015-01-08  4:58                       ` Andreas Dilger
@ 2015-01-08 14:09                         ` Phillip Susi
  2015-01-08 22:31                           ` Andreas Dilger
  0 siblings, 1 reply; 35+ messages in thread
From: Phillip Susi @ 2015-01-08 14:09 UTC (permalink / raw)
  To: Andreas Dilger
  Cc: Dave Chinner, Tejun Heo, Martin Petersen, James Bottomley,
	linux-ide, linux-fsdevel

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

On 1/7/2015 11:58 PM, Andreas Dilger wrote:
>> No, it knows that the inode table needs initialized because there
>> is a flag in the group descriptor that says this inode table is
>> still uninitalized.  It never reads the blocks to see if they are
>> full of zeros.  mke2fs sets the flag when it does not initialize
>> the table with zeros, either by direct writes ( which it doesn't
>> do if lazy_itable_init is true, which it defaults to these days
>> ), or by discarding the blocks when the device claims to support
>> deterministic discard that zeros.
> 
> That is only partially correct.  While it is true that mke2fs sets
> the UNINIT flag at format time, the "lazy" part of that means there
> is a kernel thread still does the zeroing of the inode table
> blocks, but after the filesystem is mounted, for any group that
> does not have the ZEROED flag set.  After that point, the "UNINIT"
> flag is an optimization to avoid reading the bitmap and unused
> blocks from disk during allocation.

That is pretty much what I said, except that I was pointing out that
it does not *read* first to see if the disk is already zeroed, as that
would be a waste of time.  It just writes out the zeros for block
groups that still have the uninit flag set.

> This is needed in case the group descriptor or inode bitmap is
> corrupted, and e2fsck needs to scan the inode table for in-use
> inodes.  We don't want it to find old inodes from before the
> filesystem was formatted.
> 
> The ext4_init_inode_table() calls
> sb_issue_zeroout->blkdev_issue_zeroout(), so if the underlying
> storage supported deterministic zeroing of the underlying storage,
> this could be handled very efficiently.

Again, that's pretty much what I said.


-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.17 (MingW32)

iQEcBAEBAgAGBQJUro+rAAoJENRVrw2cjl5RDcMIAIF/qOjB4FfQ1v1NPLnbYrwa
MeGxmAGIq8ftOlfOpSo8W/Adal4eegPwiBzE/8JciTHli2hZFkboG96lCGLL2GHd
51gqADIcSUmPHleLP3L/MSleFLvOAC4xYWYi00idZF5UVK3tt6m07+T/DCkoku1L
8Yg0BH5rxxRyPMRe3a/Y3WoI3PFFQDiDhz/7PSFH1W6JrEmiZutIDfm/U4h6zFWU
tlNUMyiya8DrJt4katjBTa74EPWdZQjT/dOuUGGslAgOro69ZZQ4jmEXeJGN96Ly
mMhbzkw9mjvSHYWuH2Hf1QlxoMKgq1hxaNbwEtImBzYeQTb9VSqM6BSO1ru0m9M=
=X07B
-----END PGP SIGNATURE-----

^ permalink raw reply	[flat|nested] 35+ messages in thread

* [PATCH] libata: Whitelist SSDs that are known to properly return zeroes after TRIM
  2015-01-07 15:26                     ` Tejun Heo
@ 2015-01-08 14:28                       ` Martin K. Petersen
  2015-01-08 15:11                         ` Tejun Heo
  2015-01-08 15:58                         ` Tim Small
  2015-01-08 14:29                       ` Martin K. Petersen
  1 sibling, 2 replies; 35+ messages in thread
From: Martin K. Petersen @ 2015-01-08 14:28 UTC (permalink / raw)
  To: tj; +Cc: linux-ide, linux-fsdevel, Martin K. Petersen

As defined, the DRAT (Deterministic Read After Trim) and RZAT (Return
Zero After Trim) flags in the ATA Command Set are unreliable in the
sense that they only define what happens if the device successfully
executed the DSM TRIM command. TRIM is only advisory, however, and the
device is free to silently ignore all or parts of the request.

In practice this renders the DRAT and RZAT flags completely useless and
because the results are unpredictable we decided to disable discard in
MD for 3.18 to avoid the risk of data corruption.

Hardware vendors in the real world obviously need better guarantees than
what the standards bodies provide. Unfortuntely those guarantees are
encoded in product requirements documents rather than somewhere we can
key off of them programatically. So we are compelled to disabling
discard_zeroes_data for all devices unless we explicitly have data to
support whitelisting them.

This patch whitelists SSDs from a few of the main vendors. None of the
whitelists are based on written guarantees. They are purely based on
empirical evidence collected from internal and external users that have
tested or qualified these drives in RAID deployments.

The whitelist is only meant as a starting point and is by no means
comprehensive:

   - All intel SSD models except for 510
   - Micron M5?0/M600
   - Samsung SSDs
   - Seagate SSDs

Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
---
 drivers/ata/libata-core.c | 18 ++++++++++++++----
 drivers/ata/libata-scsi.c | 10 ++++++----
 include/linux/libata.h    |  1 +
 3 files changed, 21 insertions(+), 8 deletions(-)

diff --git a/drivers/ata/libata-core.c b/drivers/ata/libata-core.c
index 5c84fb5c3372..7c03401dc5d8 100644
--- a/drivers/ata/libata-core.c
+++ b/drivers/ata/libata-core.c
@@ -4233,10 +4233,20 @@ static const struct ata_blacklist_entry ata_device_blacklist [] = {
 	{ "PIONEER DVD-RW  DVR-216D",	NULL,	ATA_HORKAGE_NOSETXFER },
 
 	/* devices that don't properly handle queued TRIM commands */
-	{ "Micron_M500*",		NULL,	ATA_HORKAGE_NO_NCQ_TRIM, },
-	{ "Crucial_CT???M500SSD*",	NULL,	ATA_HORKAGE_NO_NCQ_TRIM, },
-	{ "Micron_M550*",		NULL,	ATA_HORKAGE_NO_NCQ_TRIM, },
-	{ "Crucial_CT*M550SSD*",	NULL,	ATA_HORKAGE_NO_NCQ_TRIM, },
+	{ "Micron_M[56]*",		NULL,	ATA_HORKAGE_NO_NCQ_TRIM |
+						ATA_HORKAGE_ZERO_AFTER_TRIM, },
+	{ "Crucial_CT*SSD*",		NULL,	ATA_HORKAGE_NO_NCQ_TRIM, },
+
+	/*
+	 * DRAT/RZAT are weak guarantees. Explicitly black/whitelist
+	 * SSDs that provide reliable zero after TRIM.
+	 */
+	{ "INTEL*SSDSC2MH*",		NULL,	0, }, /* Blacklist intel 510 */
+	{ "INTEL*SSD*", 		NULL,	ATA_HORKAGE_ZERO_AFTER_TRIM, },
+	{ "SSD*INTEL*",			NULL,	ATA_HORKAGE_ZERO_AFTER_TRIM, },
+	{ "Samsung*SSD*",		NULL,	ATA_HORKAGE_ZERO_AFTER_TRIM, },
+	{ "SAMSUNG*SSD*",		NULL,	ATA_HORKAGE_ZERO_AFTER_TRIM, },
+	{ "ST[1248][0248]0[FH]*",	NULL,	ATA_HORKAGE_ZERO_AFTER_TRIM, },
 
 	/*
 	 * Some WD SATA-I drives spin up and down erratically when the link
diff --git a/drivers/ata/libata-scsi.c b/drivers/ata/libata-scsi.c
index 7659d6468303..280729325ebd 100644
--- a/drivers/ata/libata-scsi.c
+++ b/drivers/ata/libata-scsi.c
@@ -2532,13 +2532,15 @@ static unsigned int ata_scsiop_read_cap(struct ata_scsi_args *args, u8 *rbuf)
 		rbuf[15] = lowest_aligned;
 
 		if (ata_id_has_trim(args->id)) {
-			rbuf[14] |= 0x80; /* TPE */
+			rbuf[14] |= 0x80; /* LBPME */
 
-			if (ata_id_has_zero_after_trim(args->id))
-				rbuf[14] |= 0x40; /* TPRZ */
+			if (ata_id_has_zero_after_trim(args->id) &&
+			    dev->horkage & ATA_HORKAGE_ZERO_AFTER_TRIM) {
+				ata_dev_info(dev, "Enabling discard_zeroes_data\n");
+				rbuf[14] |= 0x40; /* LBPRZ */
+			}
 		}
 	}
-
 	return 0;
 }
 
diff --git a/include/linux/libata.h b/include/linux/libata.h
index 2d182413b1db..f2b440e44fd7 100644
--- a/include/linux/libata.h
+++ b/include/linux/libata.h
@@ -422,6 +422,7 @@ enum {
 	ATA_HORKAGE_NO_NCQ_TRIM	= (1 << 19),	/* don't use queued TRIM */
 	ATA_HORKAGE_NOLPM	= (1 << 20),	/* don't use LPM */
 	ATA_HORKAGE_WD_BROKEN_LPM = (1 << 21),	/* some WDs have broken LPM */
+	ATA_HORKAGE_ZERO_AFTER_TRIM = (1 << 22),/* guarantees zero after trim */
 
 	 /* DMA mask for user DMA control: User visible values; DO NOT
 	    renumber */
-- 
1.9.3


^ permalink raw reply related	[flat|nested] 35+ messages in thread

* Re: [PATCH] libata: Whitelist SSDs that are known to properly return zeroes after TRIM
  2015-01-07 15:26                     ` Tejun Heo
  2015-01-08 14:28                       ` Martin K. Petersen
@ 2015-01-08 14:29                       ` Martin K. Petersen
  1 sibling, 0 replies; 35+ messages in thread
From: Martin K. Petersen @ 2015-01-08 14:29 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Dave Chinner, Martin K. Petersen, James Bottomley, linux-ide,
	linux-fsdevel

>>>>> "Tejun" == Tejun Heo <tj@kernel.org> writes:

Tejun> do you have a newer version or shall I apply the original one?

Just sent an updated patch that catches the latest Micron/Crucial drives
as well.

-- 
Martin K. Petersen	Oracle Linux Engineering

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH] libata: Whitelist SSDs that are known to properly return zeroes after TRIM
  2015-01-08 14:28                       ` Martin K. Petersen
@ 2015-01-08 15:11                         ` Tejun Heo
  2015-01-08 15:34                           ` Martin K. Petersen
  2015-01-08 15:58                         ` Tim Small
  1 sibling, 1 reply; 35+ messages in thread
From: Tejun Heo @ 2015-01-08 15:11 UTC (permalink / raw)
  To: Martin K. Petersen; +Cc: linux-ide, linux-fsdevel

Hello, Martin.

The patch looks okay to me.  Just one nit.

On Thu, Jan 08, 2015 at 09:28:31AM -0500, Martin K. Petersen wrote:
> @@ -4233,10 +4233,20 @@ static const struct ata_blacklist_entry ata_device_blacklist [] = {
>  	{ "PIONEER DVD-RW  DVR-216D",	NULL,	ATA_HORKAGE_NOSETXFER },
>  
>  	/* devices that don't properly handle queued TRIM commands */
> -	{ "Micron_M500*",		NULL,	ATA_HORKAGE_NO_NCQ_TRIM, },
> -	{ "Crucial_CT???M500SSD*",	NULL,	ATA_HORKAGE_NO_NCQ_TRIM, },
> -	{ "Micron_M550*",		NULL,	ATA_HORKAGE_NO_NCQ_TRIM, },
> -	{ "Crucial_CT*M550SSD*",	NULL,	ATA_HORKAGE_NO_NCQ_TRIM, },
> +	{ "Micron_M[56]*",		NULL,	ATA_HORKAGE_NO_NCQ_TRIM |
> +						ATA_HORKAGE_ZERO_AFTER_TRIM, },
> +	{ "Crucial_CT*SSD*",		NULL,	ATA_HORKAGE_NO_NCQ_TRIM, },
> +
> +	/*
> +	 * DRAT/RZAT are weak guarantees. Explicitly black/whitelist
> +	 * SSDs that provide reliable zero after TRIM.
> +	 */

Can you please be more detailed on the above explanation?  It's not
like most people would be familiar with acronyms like DRAT/RZAT and
what the surrounding situation is like to require this sort of
whitelisting.

> +	{ "INTEL*SSDSC2MH*",		NULL,	0, }, /* Blacklist intel 510 */
> +	{ "INTEL*SSD*", 		NULL,	ATA_HORKAGE_ZERO_AFTER_TRIM, },

I think the above two can use a bit more verbose explanation too.

> +	{ "SSD*INTEL*",			NULL,	ATA_HORKAGE_ZERO_AFTER_TRIM, },
> +	{ "Samsung*SSD*",		NULL,	ATA_HORKAGE_ZERO_AFTER_TRIM, },
> +	{ "SAMSUNG*SSD*",		NULL,	ATA_HORKAGE_ZERO_AFTER_TRIM, },
> +	{ "ST[1248][0248]0[FH]*",	NULL,	ATA_HORKAGE_ZERO_AFTER_TRIM, },

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 35+ messages in thread

* [PATCH] libata: Whitelist SSDs that are known to properly return zeroes after TRIM
  2015-01-08 15:11                         ` Tejun Heo
@ 2015-01-08 15:34                           ` Martin K. Petersen
  2015-01-08 15:36                             ` Tejun Heo
  0 siblings, 1 reply; 35+ messages in thread
From: Martin K. Petersen @ 2015-01-08 15:34 UTC (permalink / raw)
  To: tj; +Cc: linux-ide, linux-fsdevel, Martin K. Petersen

As defined, the DRAT (Deterministic Read After Trim) and RZAT (Return
Zero After Trim) flags in the ATA Command Set are unreliable in the
sense that they only define what happens if the device successfully
executed the DSM TRIM command. TRIM is only advisory, however, and the
device is free to silently ignore all or parts of the request.

In practice this renders the DRAT and RZAT flags completely useless and
because the results are unpredictable we decided to disable discard in
MD for 3.18 to avoid the risk of data corruption.

Hardware vendors in the real world obviously need better guarantees than
what the standards bodies provide. Unfortuntely those guarantees are
encoded in product requirements documents rather than somewhere we can
key off of them programatically. So we are compelled to disabling
discard_zeroes_data for all devices unless we explicitly have data to
support whitelisting them.

This patch whitelists SSDs from a few of the main vendors. None of the
whitelists are based on written guarantees. They are purely based on
empirical evidence collected from internal and external users that have
tested or qualified these drives in RAID deployments.

The whitelist is only meant as a starting point and is by no means
comprehensive:

   - All intel SSD models except for 510
   - Micron M5?0/M600
   - Samsung SSDs
   - Seagate SSDs

Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
---
 drivers/ata/libata-core.c | 31 +++++++++++++++++++++++++++----
 drivers/ata/libata-scsi.c | 10 ++++++----
 include/linux/libata.h    |  1 +
 3 files changed, 34 insertions(+), 8 deletions(-)

diff --git a/drivers/ata/libata-core.c b/drivers/ata/libata-core.c
index 5c84fb5c3372..23c2ae03a7ab 100644
--- a/drivers/ata/libata-core.c
+++ b/drivers/ata/libata-core.c
@@ -4233,10 +4233,33 @@ static const struct ata_blacklist_entry ata_device_blacklist [] = {
 	{ "PIONEER DVD-RW  DVR-216D",	NULL,	ATA_HORKAGE_NOSETXFER },
 
 	/* devices that don't properly handle queued TRIM commands */
-	{ "Micron_M500*",		NULL,	ATA_HORKAGE_NO_NCQ_TRIM, },
-	{ "Crucial_CT???M500SSD*",	NULL,	ATA_HORKAGE_NO_NCQ_TRIM, },
-	{ "Micron_M550*",		NULL,	ATA_HORKAGE_NO_NCQ_TRIM, },
-	{ "Crucial_CT*M550SSD*",	NULL,	ATA_HORKAGE_NO_NCQ_TRIM, },
+	{ "Micron_M[56]*",		NULL,	ATA_HORKAGE_NO_NCQ_TRIM |
+						ATA_HORKAGE_ZERO_AFTER_TRIM, },
+	{ "Crucial_CT*SSD*",		NULL,	ATA_HORKAGE_NO_NCQ_TRIM, },
+
+	/*
+	 * As defined, the DRAT (Deterministic Read After Trim) and RZAT
+	 * (Return Zero After Trim) flags in the ATA Command Set are
+	 * unreliable in the sense that they only define what happens if
+	 * the device successfully executed the DSM TRIM command. TRIM
+	 * is only advisory, however, and the device is free to silently
+	 * ignore all or parts of the request.
+	 *
+	 * Whitelist drives that are known to reliably return zeroes
+	 * after TRIM.
+	 */
+
+	/*
+	 * The intel 510 drive has buggy DRAT/RZAT. Explicitly exclude
+	 * that model before whitelisting all other intel SSDs.
+	 */
+	{ "INTEL*SSDSC2MH*",		NULL,	0, },
+
+	{ "INTEL*SSD*", 		NULL,	ATA_HORKAGE_ZERO_AFTER_TRIM, },
+	{ "SSD*INTEL*",			NULL,	ATA_HORKAGE_ZERO_AFTER_TRIM, },
+	{ "Samsung*SSD*",		NULL,	ATA_HORKAGE_ZERO_AFTER_TRIM, },
+	{ "SAMSUNG*SSD*",		NULL,	ATA_HORKAGE_ZERO_AFTER_TRIM, },
+	{ "ST[1248][0248]0[FH]*",	NULL,	ATA_HORKAGE_ZERO_AFTER_TRIM, },
 
 	/*
 	 * Some WD SATA-I drives spin up and down erratically when the link
diff --git a/drivers/ata/libata-scsi.c b/drivers/ata/libata-scsi.c
index 7659d6468303..280729325ebd 100644
--- a/drivers/ata/libata-scsi.c
+++ b/drivers/ata/libata-scsi.c
@@ -2532,13 +2532,15 @@ static unsigned int ata_scsiop_read_cap(struct ata_scsi_args *args, u8 *rbuf)
 		rbuf[15] = lowest_aligned;
 
 		if (ata_id_has_trim(args->id)) {
-			rbuf[14] |= 0x80; /* TPE */
+			rbuf[14] |= 0x80; /* LBPME */
 
-			if (ata_id_has_zero_after_trim(args->id))
-				rbuf[14] |= 0x40; /* TPRZ */
+			if (ata_id_has_zero_after_trim(args->id) &&
+			    dev->horkage & ATA_HORKAGE_ZERO_AFTER_TRIM) {
+				ata_dev_info(dev, "Enabling discard_zeroes_data\n");
+				rbuf[14] |= 0x40; /* LBPRZ */
+			}
 		}
 	}
-
 	return 0;
 }
 
diff --git a/include/linux/libata.h b/include/linux/libata.h
index 2d182413b1db..f2b440e44fd7 100644
--- a/include/linux/libata.h
+++ b/include/linux/libata.h
@@ -422,6 +422,7 @@ enum {
 	ATA_HORKAGE_NO_NCQ_TRIM	= (1 << 19),	/* don't use queued TRIM */
 	ATA_HORKAGE_NOLPM	= (1 << 20),	/* don't use LPM */
 	ATA_HORKAGE_WD_BROKEN_LPM = (1 << 21),	/* some WDs have broken LPM */
+	ATA_HORKAGE_ZERO_AFTER_TRIM = (1 << 22),/* guarantees zero after trim */
 
 	 /* DMA mask for user DMA control: User visible values; DO NOT
 	    renumber */
-- 
1.9.3


^ permalink raw reply related	[flat|nested] 35+ messages in thread

* Re: [PATCH] libata: Whitelist SSDs that are known to properly return zeroes after TRIM
  2015-01-08 15:34                           ` Martin K. Petersen
@ 2015-01-08 15:36                             ` Tejun Heo
  0 siblings, 0 replies; 35+ messages in thread
From: Tejun Heo @ 2015-01-08 15:36 UTC (permalink / raw)
  To: Martin K. Petersen; +Cc: linux-ide, linux-fsdevel

On Thu, Jan 08, 2015 at 10:34:27AM -0500, Martin K. Petersen wrote:
> As defined, the DRAT (Deterministic Read After Trim) and RZAT (Return
> Zero After Trim) flags in the ATA Command Set are unreliable in the
> sense that they only define what happens if the device successfully
> executed the DSM TRIM command. TRIM is only advisory, however, and the
> device is free to silently ignore all or parts of the request.
> 
> In practice this renders the DRAT and RZAT flags completely useless and
> because the results are unpredictable we decided to disable discard in
> MD for 3.18 to avoid the risk of data corruption.
> 
> Hardware vendors in the real world obviously need better guarantees than
> what the standards bodies provide. Unfortuntely those guarantees are
> encoded in product requirements documents rather than somewhere we can
> key off of them programatically. So we are compelled to disabling
> discard_zeroes_data for all devices unless we explicitly have data to
> support whitelisting them.
> 
> This patch whitelists SSDs from a few of the main vendors. None of the
> whitelists are based on written guarantees. They are purely based on
> empirical evidence collected from internal and external users that have
> tested or qualified these drives in RAID deployments.
> 
> The whitelist is only meant as a starting point and is by no means
> comprehensive:
> 
>    - All intel SSD models except for 510
>    - Micron M5?0/M600
>    - Samsung SSDs
>    - Seagate SSDs
> 
> Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
> Reviewed-by: Christoph Hellwig <hch@lst.de>

Applied to libata/for-3.19-fixes.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH] libata: Whitelist SSDs that are known to properly return zeroes after TRIM
  2015-01-08 14:28                       ` Martin K. Petersen
  2015-01-08 15:11                         ` Tejun Heo
@ 2015-01-08 15:58                         ` Tim Small
  2015-01-09 20:52                           ` Martin K. Petersen
  1 sibling, 1 reply; 35+ messages in thread
From: Tim Small @ 2015-01-08 15:58 UTC (permalink / raw)
  To: Martin K. Petersen, tj; +Cc: linux-ide, linux-fsdevel

On 08/01/15 14:28, Martin K. Petersen wrote:


> +	{ "Micron_M[56]*",		NULL,	ATA_HORKAGE_NO_NCQ_TRIM |
> +						ATA_HORKAGE_ZERO_AFTER_TRIM, },


A minor quibble - but all the other HORKAGE flags are used to identify
things which are broken or non-functional about a device (and this is
implicit with the use of the word 'horkage' IMO).

However, this proposed flag is trying to imply behaviour which is an
"extra feature" (beyond the requirements of the spec).  So the intention
is to whitelist using a mechanism which is otherwise used only been used
to blacklist.

I know it's difficult coming up with something which isn't too
wordy/verbose, but IMO any of:

ATA_HORKAGE_TRIM_ALWAYS_ZEROS

ATA_TRIM_ALWAYS_ZEROS

ATA_RELIABLE_ZERO_AFTER_TRIM


would seem clearer to me, because as the patch currently stands,
"ATA_HORKAGE_ZERO_AFTER_TRIM" implies to me "zero after trim is broken
on this device".

Particularly with the Micron excerpt quoted above I initially parsed
this as "don't issue tagged TRIM commands to this device, and don't
assume it'll read zeros after TRIM either".


Tim.

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH] libata: Whitelist SSDs that are known to properly return zeroes after TRIM
  2015-01-08 14:09                         ` Phillip Susi
@ 2015-01-08 22:31                           ` Andreas Dilger
  0 siblings, 0 replies; 35+ messages in thread
From: Andreas Dilger @ 2015-01-08 22:31 UTC (permalink / raw)
  To: Phillip Susi
  Cc: Dave Chinner, Tejun Heo, Martin Petersen, James Bottomley,
	linux-ide, linux-fsdevel

On Jan 8, 2015, at 7:09 AM, Phillip Susi <psusi@ubuntu.com> wrote:
> On 1/7/2015 11:58 PM, Andreas Dilger wrote:
>>> No, it knows that the inode table needs initialized because there
>>> is a flag in the group descriptor that says this inode table is
>>> still uninitalized.  It never reads the blocks to see if they are
>>> full of zeros.  mke2fs sets the flag when it does not initialize
>>> the table with zeros, either by direct writes ( which it doesn't
>>> do if lazy_itable_init is true, which it defaults to these days
>>> ), or by discarding the blocks when the device claims to support
>>> deterministic discard that zeros.
>> 
>> That is only partially correct.  While it is true that mke2fs sets
>> the UNINIT flag at format time, the "lazy" part of that means there
>> is a kernel thread still does the zeroing of the inode table
>> blocks, but after the filesystem is mounted, for any group that
>> does not have the ZEROED flag set.  After that point, the "UNINIT"
>> flag is an optimization to avoid reading the bitmap and unused
>> blocks from disk during allocation.
> 
> That is pretty much what I said, except that I was pointing out that
> it does not *read* first to see if the disk is already zeroed, as that
> would be a waste of time.  It just writes out the zeros for block
> groups that still have the uninit flag set.

Sorry, I didn't get that from my reading, so I thought I'd clarify.
I'd actually proposed that the ext4_init_inode_table() thread start
by reading the itable blocks first, check them for zeroes, and only
switch over to writing if it finds any non-zero data in the blocks.

I think that would be a net win in some cases, and only a tiny bit
of overhead (a single read) if it turns out to be wrong.

>> This is needed in case the group descriptor or inode bitmap is
>> corrupted, and e2fsck needs to scan the inode table for in-use
>> inodes.  We don't want it to find old inodes from before the
>> filesystem was formatted.
>> 
>> The ext4_init_inode_table() calls
>> sb_issue_zeroout->blkdev_issue_zeroout(), so if the underlying
>> storage supported deterministic zeroing of the underlying storage,
>> this could be handled very efficiently.
> 
> Again, that's pretty much what I said.


Cheers, Andreas






^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH] libata: Whitelist SSDs that are known to properly return zeroes after TRIM
  2015-01-08 15:58                         ` Tim Small
@ 2015-01-09 20:52                           ` Martin K. Petersen
  2015-01-09 21:39                             ` Tejun Heo
  0 siblings, 1 reply; 35+ messages in thread
From: Martin K. Petersen @ 2015-01-09 20:52 UTC (permalink / raw)
  To: Tim Small; +Cc: Martin K. Petersen, tj, linux-ide, linux-fsdevel

>>>>> "Tim" == Tim Small <tim@seoss.co.uk> writes:

Tim> would seem clearer to me, because as the patch currently stands,
Tim> "ATA_HORKAGE_ZERO_AFTER_TRIM" implies to me "zero after trim is
Tim> broken on this device".

In SCSI we use often the term "quirk" regardless of whether we enable or
disable a feature. So "horkage" didn't really bother me much in this
context. But I'm happy to submit a name change patch if Tejun thinks
it's a valid concern.

-- 
Martin K. Petersen	Oracle Linux Engineering

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH] libata: Whitelist SSDs that are known to properly return zeroes after TRIM
  2015-01-09 20:52                           ` Martin K. Petersen
@ 2015-01-09 21:39                             ` Tejun Heo
  0 siblings, 0 replies; 35+ messages in thread
From: Tejun Heo @ 2015-01-09 21:39 UTC (permalink / raw)
  To: Martin K. Petersen; +Cc: Tim Small, linux-ide, linux-fsdevel

On Fri, Jan 09, 2015 at 03:52:21PM -0500, Martin K. Petersen wrote:
> >>>>> "Tim" == Tim Small <tim@seoss.co.uk> writes:
> 
> Tim> would seem clearer to me, because as the patch currently stands,
> Tim> "ATA_HORKAGE_ZERO_AFTER_TRIM" implies to me "zero after trim is
> Tim> broken on this device".
> 
> In SCSI we use often the term "quirk" regardless of whether we enable or
> disable a feature. So "horkage" didn't really bother me much in this
> context. But I'm happy to submit a name change patch if Tejun thinks
> it's a valid concern.

I think it's fine.  It's not like its polarity is confusing - nobody
would read it as meaning the opposite.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 35+ messages in thread

end of thread, other threads:[~2015-01-09 21:39 UTC | newest]

Thread overview: 35+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2014-12-04  2:44 [PATCH] libata: Whitelist SSDs that are known to properly return zeroes after TRIM Martin K. Petersen
2014-12-04  3:02 ` Phillip Susi
2014-12-04  3:24   ` Martin K. Petersen
2014-12-04  3:28     ` Phillip Susi
2014-12-04  3:35       ` Martin K. Petersen
2014-12-04  4:40         ` Phillip Susi
2014-12-05  1:53           ` Martin K. Petersen
2014-12-04 21:49         ` One Thousand Gnomes
2014-12-05  2:46           ` Martin K. Petersen
2014-12-04 17:06 ` Tejun Heo
2014-12-05  2:13   ` Martin K. Petersen
2014-12-05 14:51     ` Tejun Heo
2014-12-10  4:09       ` James Bottomley
2014-12-10 14:29         ` Tejun Heo
2014-12-10 20:34           ` James Bottomley
2014-12-10 21:02           ` Martin K. Petersen
2014-12-12  8:35             ` Ming Lei
2015-01-05 16:28             ` Tejun Heo
2015-01-07  0:05               ` Martin K. Petersen
2015-01-07  2:54                 ` Tejun Heo
2015-01-07  4:15                   ` Dave Chinner
2015-01-07 15:26                     ` Tejun Heo
2015-01-08 14:28                       ` Martin K. Petersen
2015-01-08 15:11                         ` Tejun Heo
2015-01-08 15:34                           ` Martin K. Petersen
2015-01-08 15:36                             ` Tejun Heo
2015-01-08 15:58                         ` Tim Small
2015-01-09 20:52                           ` Martin K. Petersen
2015-01-09 21:39                             ` Tejun Heo
2015-01-08 14:29                       ` Martin K. Petersen
2015-01-08  4:05                     ` Phillip Susi
2015-01-08  4:58                       ` Andreas Dilger
2015-01-08 14:09                         ` Phillip Susi
2015-01-08 22:31                           ` Andreas Dilger
2014-12-10 15:43         ` Tim Small

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.