All of lore.kernel.org
 help / color / mirror / Atom feed
* [RFC] Discard update for 3.19
@ 2014-11-07  5:08 Martin K. Petersen
  2014-11-07  5:08 ` [PATCH 1/3] libata: Whitelist SSDs that are known to properly return zeroes after TRIM Martin K. Petersen
                   ` (5 more replies)
  0 siblings, 6 replies; 29+ messages in thread
From: Martin K. Petersen @ 2014-11-07  5:08 UTC (permalink / raw)
  To: linux-scsi, linux-ide, linux-fsdevel, neilb

This update is mainly motivated by an attempt to make the
discard_zeroes_data reporting more accurate. As we have discussed
several times in the past we are stuck with pretty weak guarantees from
the T10/T13 standards. And as a result we feel compelled to tighten up
the scenarios under which we advertise discard_zeroes_data since several
applications and subsystems depend on it being accurate.

The first patch is the most controversial. It disables
discard_zeroes_data for libata devices unless they explicitly have been
whitelisted. I had hoped to have a more comprehensive list of drives but
I didn't have much luck in procuring the identify strings that would
allow me to generate the whitelist matching patterns. I could use some
help here.

The second patch tweaks the SCSI disk driver to prefer WRITE SAME w/
UNMAP instead of the UNMAP command since the former has deterministic
behavior.

The lack of a hard discard_zeroes_data guarantees has also prevented us
from having a variant of blkdev_issue_zeroout() that discards if
possible. The last patch in this series will add such a call that the
filesystems and virt block drivers can use to clear and deprovision
block ranges.

-- 
Martin K. Petersen	Oracle Linux Engineering


^ permalink raw reply	[flat|nested] 29+ messages in thread

* [PATCH 1/3] libata: Whitelist SSDs that are known to properly return zeroes after TRIM
  2014-11-07  5:08 [RFC] Discard update for 3.19 Martin K. Petersen
@ 2014-11-07  5:08 ` Martin K. Petersen
  2014-11-07  8:24   ` Christoph Hellwig
                     ` (2 more replies)
  2014-11-07  5:08 ` [PATCH 2/3] sd: Disable discard_zeroes_data for UNMAP Martin K. Petersen
                   ` (4 subsequent siblings)
  5 siblings, 3 replies; 29+ messages in thread
From: Martin K. Petersen @ 2014-11-07  5:08 UTC (permalink / raw)
  To: linux-scsi, linux-ide, linux-fsdevel, neilb; +Cc: Martin K. Petersen

As defined, the DRAT (Deterministic Read After Trim) and RZAT (Return
Zero After Trim) flags in the ATA Command Set are unreliable in the
sense that they only define what happens if the device successfully
executed the DSM TRIM command. TRIM is only advisory, however, and the
device is free to silently ignore all or parts of the request.

In practice this renders the DRAT and RZAT flags completely useless and
because the results are unpredictable we decided to disable discard in
MD for 3.18 to avoid the risk of data corruption.

Hardware vendors in the real world obviously need better guarantees than
what the standards bodies provide. Unfortuntely those guarantees are
encoded in product requirements documents rather than somewhere we can
key off of them programatically. So we are compelled to disabling
discard_zeroes_data for all devices unless we explicitly have data to
support whitelisting them.

This patch whitelists SSDs from a few of the main vendors. None of the
whitelists are based on written guarantees. They are purely based on
empirical evidence collected from internal and external users that have
tested or qualified these drives in RAID deployments.

The whitelist is only meant as a starting point and is by no means
comprehensive:

   - All intel SSD models except for 510
   - Micron M5*
   - Samsung SSDs
   - Seagate SSDs

Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
---
 drivers/ata/libata-core.c | 18 ++++++++++++++----
 drivers/ata/libata-scsi.c | 10 ++++++----
 include/linux/libata.h    |  1 +
 3 files changed, 21 insertions(+), 8 deletions(-)

diff --git a/drivers/ata/libata-core.c b/drivers/ata/libata-core.c
index c5ba15af87d3..f41f24a8bc21 100644
--- a/drivers/ata/libata-core.c
+++ b/drivers/ata/libata-core.c
@@ -4225,10 +4225,20 @@ static const struct ata_blacklist_entry ata_device_blacklist [] = {
 	{ "PIONEER DVD-RW  DVR-216D",	NULL,	ATA_HORKAGE_NOSETXFER },
 
 	/* devices that don't properly handle queued TRIM commands */
-	{ "Micron_M500*",		NULL,	ATA_HORKAGE_NO_NCQ_TRIM, },
-	{ "Crucial_CT???M500SSD*",	NULL,	ATA_HORKAGE_NO_NCQ_TRIM, },
-	{ "Micron_M550*",		NULL,	ATA_HORKAGE_NO_NCQ_TRIM, },
-	{ "Crucial_CT*M550SSD*",	NULL,	ATA_HORKAGE_NO_NCQ_TRIM, },
+	{ "Micron_M5?0*",		NULL,	ATA_HORKAGE_NO_NCQ_TRIM |
+						ATA_HORKAGE_ZERO_AFTER_TRIM, },
+	{ "Crucial_CT???M5?0SSD*",	NULL,	ATA_HORKAGE_NO_NCQ_TRIM, },
+
+	/*
+	 * DRAT/RZAT are weak guarantees. Explicitly black/whitelist
+	 * SSDs that provide reliable zero after TRIM.
+	 */
+	{ "INTEL*SSDSC2MH*",		NULL,	0, }, /* Blacklist intel 510 */
+	{ "INTEL*SSD*", 		NULL,	ATA_HORKAGE_ZERO_AFTER_TRIM, },
+	{ "SSD*INTEL*",			NULL,	ATA_HORKAGE_ZERO_AFTER_TRIM, },
+	{ "Samsung*SSD*",		NULL,	ATA_HORKAGE_ZERO_AFTER_TRIM, },
+	{ "SAMSUNG*SSD*",		NULL,	ATA_HORKAGE_ZERO_AFTER_TRIM, },
+	{ "ST[1248][0248]0[FH]*",	NULL,	ATA_HORKAGE_ZERO_AFTER_TRIM, },
 
 	/*
 	 * Some WD SATA-I drives spin up and down erratically when the link
diff --git a/drivers/ata/libata-scsi.c b/drivers/ata/libata-scsi.c
index 0586f66d70fa..deaa6e34ed4d 100644
--- a/drivers/ata/libata-scsi.c
+++ b/drivers/ata/libata-scsi.c
@@ -2515,13 +2515,15 @@ static unsigned int ata_scsiop_read_cap(struct ata_scsi_args *args, u8 *rbuf)
 		rbuf[15] = lowest_aligned;
 
 		if (ata_id_has_trim(args->id)) {
-			rbuf[14] |= 0x80; /* TPE */
+			rbuf[14] |= 0x80; /* LBPME */
 
-			if (ata_id_has_zero_after_trim(args->id))
-				rbuf[14] |= 0x40; /* TPRZ */
+			if (ata_id_has_zero_after_trim(args->id) &&
+			    dev->horkage & ATA_HORKAGE_ZERO_AFTER_TRIM) {
+				ata_dev_warn(dev, "Enabling discard_zeroes_data\n");
+				rbuf[14] |= 0x40; /* LBPRZ */
+			}
 		}
 	}
-
 	return 0;
 }
 
diff --git a/include/linux/libata.h b/include/linux/libata.h
index bd5fefeaf548..45ac825b8366 100644
--- a/include/linux/libata.h
+++ b/include/linux/libata.h
@@ -421,6 +421,7 @@ enum {
 	ATA_HORKAGE_NO_NCQ_TRIM	= (1 << 19),	/* don't use queued TRIM */
 	ATA_HORKAGE_NOLPM	= (1 << 20),	/* don't use LPM */
 	ATA_HORKAGE_WD_BROKEN_LPM = (1 << 21),	/* some WDs have broken LPM */
+	ATA_HORKAGE_ZERO_AFTER_TRIM = (1 << 22),/* guarantees zero after trim */
 
 	 /* DMA mask for user DMA control: User visible values; DO NOT
 	    renumber */
-- 
1.9.3


^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [PATCH 2/3] sd: Disable discard_zeroes_data for UNMAP
  2014-11-07  5:08 [RFC] Discard update for 3.19 Martin K. Petersen
  2014-11-07  5:08 ` [PATCH 1/3] libata: Whitelist SSDs that are known to properly return zeroes after TRIM Martin K. Petersen
@ 2014-11-07  5:08 ` Martin K. Petersen
  2014-11-07  8:25   ` Christoph Hellwig
  2014-11-10 23:43   ` Paolo Bonzini
  2014-11-07  5:08 ` [PATCH 3/3] block: Introduce blkdev_issue_zeroout_discard() function Martin K. Petersen
                   ` (3 subsequent siblings)
  5 siblings, 2 replies; 29+ messages in thread
From: Martin K. Petersen @ 2014-11-07  5:08 UTC (permalink / raw)
  To: linux-scsi, linux-ide, linux-fsdevel, neilb; +Cc: Martin K. Petersen

The T10 SBC UNMAP command does not provide any hard guarantees that
blocks will return zeroes on a subsequent READ. This is due to the fact
that the device server is free to silently ignore all or parts of the
request.

The only way to ensure that a block consistently returns zeroes after
being unmapped is to use WRITE SAME with the UNMAP bit set. Should the
device be unable to unmap one or more blocks described by the command it
is required to manually write zeroes to them.

Until now we have preferred UNMAP over the WRITE SAME variants to
accommodate thinly provisioned devices that predated the final SBC-3
spec. This patch changes the heuristic so that we favor WRITE SAME(16)
or (10) over UNMAP if these commands are marked as supported in the
Logical Block Provisioning VPD page.

The patch also disables discard_zeroes_data for devices operating in
UNMAP mode.

Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
---
 drivers/scsi/sd.c | 10 ++++++----
 1 file changed, 6 insertions(+), 4 deletions(-)

diff --git a/drivers/scsi/sd.c b/drivers/scsi/sd.c
index b041eca8955d..95bfb7bfbb9d 100644
--- a/drivers/scsi/sd.c
+++ b/drivers/scsi/sd.c
@@ -656,7 +656,7 @@ static void sd_config_discard(struct scsi_disk *sdkp, unsigned int mode)
 	unsigned int logical_block_size = sdkp->device->sector_size;
 	unsigned int max_blocks = 0;
 
-	q->limits.discard_zeroes_data = sdkp->lbprz;
+	q->limits.discard_zeroes_data = 0;
 	q->limits.discard_alignment = sdkp->unmap_alignment *
 		logical_block_size;
 	q->limits.discard_granularity =
@@ -680,11 +680,13 @@ static void sd_config_discard(struct scsi_disk *sdkp, unsigned int mode)
 	case SD_LBP_WS16:
 		max_blocks = min_not_zero(sdkp->max_ws_blocks,
 					  (u32)SD_MAX_WS16_BLOCKS);
+		q->limits.discard_zeroes_data = sdkp->lbprz;
 		break;
 
 	case SD_LBP_WS10:
 		max_blocks = min_not_zero(sdkp->max_ws_blocks,
 					  (u32)SD_MAX_WS10_BLOCKS);
+		q->limits.discard_zeroes_data = sdkp->lbprz;
 		break;
 
 	case SD_LBP_ZERO:
@@ -2622,12 +2624,12 @@ static void sd_read_block_limits(struct scsi_disk *sdkp)
 
 		} else {	/* LBP VPD page tells us what to use */
 
-			if (sdkp->lbpu && sdkp->max_unmap_blocks)
-				sd_config_discard(sdkp, SD_LBP_UNMAP);
-			else if (sdkp->lbpws)
+			if (sdkp->lbpws)
 				sd_config_discard(sdkp, SD_LBP_WS16);
 			else if (sdkp->lbpws10)
 				sd_config_discard(sdkp, SD_LBP_WS10);
+			else if (sdkp->lbpu && sdkp->max_unmap_blocks)
+				sd_config_discard(sdkp, SD_LBP_UNMAP);
 			else
 				sd_config_discard(sdkp, SD_LBP_DISABLE);
 		}
-- 
1.9.3


^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [PATCH 3/3] block: Introduce blkdev_issue_zeroout_discard() function
  2014-11-07  5:08 [RFC] Discard update for 3.19 Martin K. Petersen
  2014-11-07  5:08 ` [PATCH 1/3] libata: Whitelist SSDs that are known to properly return zeroes after TRIM Martin K. Petersen
  2014-11-07  5:08 ` [PATCH 2/3] sd: Disable discard_zeroes_data for UNMAP Martin K. Petersen
@ 2014-11-07  5:08 ` Martin K. Petersen
  2014-11-07  8:26   ` Christoph Hellwig
  2014-11-11  0:04   ` Darrick J. Wong
  2014-11-07 12:09 ` [RFC] Discard update for 3.19 Bernd Schubert
                   ` (2 subsequent siblings)
  5 siblings, 2 replies; 29+ messages in thread
From: Martin K. Petersen @ 2014-11-07  5:08 UTC (permalink / raw)
  To: linux-scsi, linux-ide, linux-fsdevel, neilb; +Cc: Martin K. Petersen

blkdev_issue_discard() will zero a given block range on disk. This is
done by way of either WRITE SAME or regular WRITE. I.e. the blocks on
disk will be written and thus provisioned.

There are use cases where the desired behavior is to zero the blocks but
unprovision them if possible. The blocks must deterministically contain
zeroes when they are subsequently read back.

This patch introduces a blkdev_issue_zeroout_discard() call that
provides this functionality. If a block device guarantees
discard_zeroes_data the new function will use discard to clear the block
range. If the device does not support discard_zeroes_data or if the
discard request fails we will fall back to blkdev_issue_zeroout() to
ensure predictable results.

Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
---
 block/blk-lib.c        | 44 ++++++++++++++++++++++++++++++++++++++++++--
 include/linux/blkdev.h |  2 ++
 2 files changed, 44 insertions(+), 2 deletions(-)

diff --git a/block/blk-lib.c b/block/blk-lib.c
index 8411be3c19d3..2ffec6a01c71 100644
--- a/block/blk-lib.c
+++ b/block/blk-lib.c
@@ -278,14 +278,18 @@ static int __blkdev_issue_zeroout(struct block_device *bdev, sector_t sector,
 }
 
 /**
- * blkdev_issue_zeroout - zero-fill a block range
+ * blkdev_issue_zeroout - zero-fill and provision a block range
  * @bdev:	blockdev to write
  * @sector:	start sector
  * @nr_sects:	number of sectors to write
  * @gfp_mask:	memory allocation flags (for bio_alloc)
  *
  * Description:
- *  Generate and issue number of bios with zerofiled pages.
+ *  Zero-fill a block range. The blocks will be provisioned
+ *  (allocated/anchored) and are guaranteed to return zeroes when read
+ *  back. This function will attempt to use WRITE SAME to optimize the
+ *  process if the block device supports it. Otherwise it will fall back
+ *  to zeroing the blocks using regular WRITE calls.
  */
 
 int blkdev_issue_zeroout(struct block_device *bdev, sector_t sector,
@@ -305,3 +309,39 @@ int blkdev_issue_zeroout(struct block_device *bdev, sector_t sector,
 	return __blkdev_issue_zeroout(bdev, sector, nr_sects, gfp_mask);
 }
 EXPORT_SYMBOL(blkdev_issue_zeroout);
+
+/**
+ * blkdev_issue_zeroout_discard - zero-fill and attempt to discard block range
+ * @bdev:	blockdev to write
+ * @sector:	start sector
+ * @nr_sects:	number of sectors to write
+ * @gfp_mask:	memory allocation flags (for bio_alloc)
+ *
+ * Description:
+ *  Zero-fill a block range. In contrast to blkdev_issue_zeroout() this
+ *  function will attempt to deprovision (deallocate/discard) the blocks
+ *  in question. It will only do so if the underlying device guarantees
+ *  that subsequent READ operations to the block range in question will
+ *  return zeroes. If the device does not provide hard guarantees or if
+ *  the DISCARD attempt should fail the block range will be explicitly
+ *  zeroed using blkdev_issue_zeroout().
+ */
+
+int blkdev_issue_zeroout_discard(struct block_device *bdev, sector_t sector,
+				 sector_t nr_sects, gfp_t gfp_mask)
+{
+	struct request_queue *q = bdev_get_queue(bdev);
+
+	if (blk_queue_discard(q) && q->limits.discard_zeroes_data) {
+		unsigned char bdn[BDEVNAME_SIZE];
+
+		if (!blkdev_issue_discard(bdev, sector, nr_sects, gfp_mask, 0))
+			return 0;
+
+		bdevname(bdev, bdn);
+		pr_err("%s: DISCARD failed. Manually zeroing.\n", bdn);
+	}
+
+	return blkdev_issue_zeroout(bdev, sector, nr_sects, gfp_mask);
+}
+EXPORT_SYMBOL(blkdev_issue_zeroout_discard);
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index aac0f9ea952a..078b6e5f488a 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -1164,6 +1164,8 @@ extern int blkdev_issue_write_same(struct block_device *bdev, sector_t sector,
 		sector_t nr_sects, gfp_t gfp_mask, struct page *page);
 extern int blkdev_issue_zeroout(struct block_device *bdev, sector_t sector,
 			sector_t nr_sects, gfp_t gfp_mask);
+extern int blkdev_issue_zeroout_discard(struct block_device *bdev,
+			sector_t sector, sector_t nr_sects, gfp_t gfp_mask);
 static inline int sb_issue_discard(struct super_block *sb, sector_t block,
 		sector_t nr_blocks, gfp_t gfp_mask, unsigned long flags)
 {
-- 
1.9.3


^ permalink raw reply related	[flat|nested] 29+ messages in thread

* Re: [PATCH 1/3] libata: Whitelist SSDs that are known to properly return zeroes after TRIM
  2014-11-07  5:08 ` [PATCH 1/3] libata: Whitelist SSDs that are known to properly return zeroes after TRIM Martin K. Petersen
@ 2014-11-07  8:24   ` Christoph Hellwig
  2014-11-07 15:37     ` Martin K. Petersen
  2014-12-05 16:45   ` Paolo Bonzini
  2014-12-05 22:58   ` Elliott, Robert (Server Storage)
  2 siblings, 1 reply; 29+ messages in thread
From: Christoph Hellwig @ 2014-11-07  8:24 UTC (permalink / raw)
  To: Martin K. Petersen; +Cc: linux-scsi, linux-ide, linux-fsdevel, neilb

On Fri, Nov 07, 2014 at 12:08:12AM -0500, Martin K. Petersen wrote:
>  		if (ata_id_has_trim(args->id)) {
> -			rbuf[14] |= 0x80; /* TPE */
> +			rbuf[14] |= 0x80; /* LBPME */
>  
> -			if (ata_id_has_zero_after_trim(args->id))
> -				rbuf[14] |= 0x40; /* TPRZ */
> +			if (ata_id_has_zero_after_trim(args->id) &&
> +			    dev->horkage & ATA_HORKAGE_ZERO_AFTER_TRIM) {
> +				ata_dev_warn(dev, "Enabling discard_zeroes_data\n");

I think this should _info, not _warn.

Otherwise looks good to me,

Reviewed-by: Christoph Hellwig <hch@lst.de>

It would be nice if there was a way to trigger the flag from userspace,
so that we don't need to rebuild the kernel to add a whitelist entry.

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 2/3] sd: Disable discard_zeroes_data for UNMAP
  2014-11-07  5:08 ` [PATCH 2/3] sd: Disable discard_zeroes_data for UNMAP Martin K. Petersen
@ 2014-11-07  8:25   ` Christoph Hellwig
  2014-11-10 23:43   ` Paolo Bonzini
  1 sibling, 0 replies; 29+ messages in thread
From: Christoph Hellwig @ 2014-11-07  8:25 UTC (permalink / raw)
  To: Martin K. Petersen; +Cc: linux-scsi, linux-ide, linux-fsdevel, neilb

Looks good,

Reviewed-by: Christoph Hellwig <hch@lst.de>

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 3/3] block: Introduce blkdev_issue_zeroout_discard() function
  2014-11-07  5:08 ` [PATCH 3/3] block: Introduce blkdev_issue_zeroout_discard() function Martin K. Petersen
@ 2014-11-07  8:26   ` Christoph Hellwig
  2014-11-07 15:42     ` Martin K. Petersen
  2014-11-11  0:04   ` Darrick J. Wong
  1 sibling, 1 reply; 29+ messages in thread
From: Christoph Hellwig @ 2014-11-07  8:26 UTC (permalink / raw)
  To: Martin K. Petersen; +Cc: linux-scsi, linux-ide, linux-fsdevel, neilb

On Fri, Nov 07, 2014 at 12:08:14AM -0500, Martin K. Petersen wrote:
> blkdev_issue_discard() will zero a given block range on disk. This is
> done by way of either WRITE SAME or regular WRITE. I.e. the blocks on
> disk will be written and thus provisioned.
> 
> There are use cases where the desired behavior is to zero the blocks but
> unprovision them if possible. The blocks must deterministically contain
> zeroes when they are subsequently read back.
> 
> This patch introduces a blkdev_issue_zeroout_discard() call that
> provides this functionality. If a block device guarantees
> discard_zeroes_data the new function will use discard to clear the block
> range. If the device does not support discard_zeroes_data or if the
> discard request fails we will fall back to blkdev_issue_zeroout() to
> ensure predictable results.

I'm not a fan of adding another function here and would prefer a flag,
but it looks correct, so:

Reviewed-by: Christoph Hellwig <hch@lst.de>

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [RFC] Discard update for 3.19
  2014-11-07  5:08 [RFC] Discard update for 3.19 Martin K. Petersen
                   ` (2 preceding siblings ...)
  2014-11-07  5:08 ` [PATCH 3/3] block: Introduce blkdev_issue_zeroout_discard() function Martin K. Petersen
@ 2014-11-07 12:09 ` Bernd Schubert
  2014-11-07 12:11 ` Bernd Schubert
  2014-11-10 14:19 ` Christoph Hellwig
  5 siblings, 0 replies; 29+ messages in thread
From: Bernd Schubert @ 2014-11-07 12:09 UTC (permalink / raw)
  To: linux-fsdevel

> The second patch tweaks the SCSI disk driver to prefer WRITE SAME w/
> UNMAP instead of the UNMAP command since the former has deterministic
> behavior.

I have seen data corruption on an Intel SSD 510 after sending WRITE SAME
UNMAP. As far as I remember, these disks ignore writes after sending
that command. Unmap worked fine, though.

Cheers,
Bernd



^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [RFC] Discard update for 3.19
  2014-11-07  5:08 [RFC] Discard update for 3.19 Martin K. Petersen
                   ` (3 preceding siblings ...)
  2014-11-07 12:09 ` [RFC] Discard update for 3.19 Bernd Schubert
@ 2014-11-07 12:11 ` Bernd Schubert
  2014-11-07 15:46     ` Martin K. Petersen
  2014-11-10 14:19 ` Christoph Hellwig
  5 siblings, 1 reply; 29+ messages in thread
From: Bernd Schubert @ 2014-11-07 12:11 UTC (permalink / raw)
  To: Martin K. Petersen, linux-scsi, linux-ide, linux-fsdevel, neilb

> The second patch tweaks the SCSI disk driver to prefer WRITE SAME w/
> UNMAP instead of the UNMAP command since the former has deterministic
> behavior.

I have seen data corruption on an Intel SSD 510 after sending WRITE SAME
UNMAP. As far as I remember, these disks ignore writes after sending
that command. Unmap worked fine, though. So possibly there is another
blacklisting required.

Cheers,
Bernd


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 1/3] libata: Whitelist SSDs that are known to properly return zeroes after TRIM
  2014-11-07  8:24   ` Christoph Hellwig
@ 2014-11-07 15:37     ` Martin K. Petersen
  0 siblings, 0 replies; 29+ messages in thread
From: Martin K. Petersen @ 2014-11-07 15:37 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Martin K. Petersen, linux-scsi, linux-ide, linux-fsdevel, neilb

>>>>> "Christoph" == Christoph Hellwig <hch@infradead.org> writes:

>> + ata_dev_warn(dev, "Enabling discard_zeroes_data\n");

Christoph> I think this should _info, not _warn.

Fixed.

-- 
Martin K. Petersen	Oracle Linux Engineering

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 3/3] block: Introduce blkdev_issue_zeroout_discard() function
  2014-11-07  8:26   ` Christoph Hellwig
@ 2014-11-07 15:42     ` Martin K. Petersen
  2014-11-07 16:20       ` Theodore Ts'o
  0 siblings, 1 reply; 29+ messages in thread
From: Martin K. Petersen @ 2014-11-07 15:42 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Martin K. Petersen, linux-scsi, linux-ide, linux-fsdevel, neilb,
	Theodore Ts'o

>>>>> "Christoph" == Christoph Hellwig <hch@infradead.org> writes:

Christoph> I'm not a fan of adding another function here and would
Christoph> prefer a flag, but it looks correct, 

That was my original approach too but I didn't want to stomp over all
the existing callers. Although there only are few.

Ted: Which would you prefer?

-- 
Martin K. Petersen	Oracle Linux Engineering

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [RFC] Discard update for 3.19
  2014-11-07 12:11 ` Bernd Schubert
@ 2014-11-07 15:46     ` Martin K. Petersen
  0 siblings, 0 replies; 29+ messages in thread
From: Martin K. Petersen @ 2014-11-07 15:46 UTC (permalink / raw)
  To: Bernd Schubert
  Cc: Martin K. Petersen, linux-scsi, linux-ide, linux-fsdevel, neilb

>>>>> "Bernd" == Bernd Schubert <bschubert@ddn.com> writes:

Bernd,

Bernd> I have seen data corruption on an Intel SSD 510 after sending
Bernd> WRITE SAME UNMAP. As far as I remember, these disks ignore writes
Bernd> after sending that command. Unmap worked fine, though. So
Bernd> possibly there is another blacklisting required.

Well, the 510 is a SATA drive and doesn't implement either command so
that would be an LSI SATL problem. I recall that we had some problems
with the LSI firmware blacklisting DRAT/RZAT on that particular drive so
I did the same in my libata patch.

I don't have any bugs open with WRITE SAME or UNMAP on recent LSI
firmware, though. We use it extensively here.

-- 
Martin K. Petersen	Oracle Linux Engineering

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [RFC] Discard update for 3.19
@ 2014-11-07 15:46     ` Martin K. Petersen
  0 siblings, 0 replies; 29+ messages in thread
From: Martin K. Petersen @ 2014-11-07 15:46 UTC (permalink / raw)
  To: Bernd Schubert
  Cc: Martin K. Petersen, linux-scsi, linux-ide, linux-fsdevel, neilb

>>>>> "Bernd" == Bernd Schubert <bschubert@ddn.com> writes:

Bernd,

Bernd> I have seen data corruption on an Intel SSD 510 after sending
Bernd> WRITE SAME UNMAP. As far as I remember, these disks ignore writes
Bernd> after sending that command. Unmap worked fine, though. So
Bernd> possibly there is another blacklisting required.

Well, the 510 is a SATA drive and doesn't implement either command so
that would be an LSI SATL problem. I recall that we had some problems
with the LSI firmware blacklisting DRAT/RZAT on that particular drive so
I did the same in my libata patch.

I don't have any bugs open with WRITE SAME or UNMAP on recent LSI
firmware, though. We use it extensively here.

-- 
Martin K. Petersen	Oracle Linux Engineering

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 3/3] block: Introduce blkdev_issue_zeroout_discard() function
  2014-11-07 15:42     ` Martin K. Petersen
@ 2014-11-07 16:20       ` Theodore Ts'o
  2014-11-07 16:27         ` Martin K. Petersen
  0 siblings, 1 reply; 29+ messages in thread
From: Theodore Ts'o @ 2014-11-07 16:20 UTC (permalink / raw)
  To: Martin K. Petersen
  Cc: Christoph Hellwig, linux-scsi, linux-ide, linux-fsdevel, neilb

On Fri, Nov 07, 2014 at 10:42:24AM -0500, Martin K. Petersen wrote:
> >>>>> "Christoph" == Christoph Hellwig <hch@infradead.org> writes:
> 
> Christoph> I'm not a fan of adding another function here and would
> Christoph> prefer a flag, but it looks correct, 
> 
> That was my original approach too but I didn't want to stomp over all
> the existing callers. Although there only are few.
> 
> Ted: Which would you prefer?

There are *very* few users of blkdev_issue_zeroout(), and aside for a
single drbd, they are all in the block layer.  It would only start
affecting ext4 when we plumb that flag through to sb_issue_zeroout
(which your patch doesn't currently do), at which point it will affect
4 call sites in ext4, and a call site in gfs2 and hpfs2.

So I'd be in favor of adding a flag to to blkdev_issue_zeroout(), and
I would have a slight preference for also modifying sb_issue_zeroout
so the flag gets plumbed all the way through to the fs-level callers.

Cheers,

							- Ted

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 3/3] block: Introduce blkdev_issue_zeroout_discard() function
  2014-11-07 16:20       ` Theodore Ts'o
@ 2014-11-07 16:27         ` Martin K. Petersen
  2014-11-14 20:22           ` Martin K. Petersen
  0 siblings, 1 reply; 29+ messages in thread
From: Martin K. Petersen @ 2014-11-07 16:27 UTC (permalink / raw)
  To: Theodore Ts'o
  Cc: Martin K. Petersen, Christoph Hellwig, linux-scsi, linux-ide,
	linux-fsdevel, neilb

>>>>> "Ted" == Theodore Ts'o <tytso@mit.edu> writes:

Ted> So I'd be in favor of adding a flag to to blkdev_issue_zeroout(),
Ted> and I would have a slight preference for also modifying
Ted> sb_issue_zeroout so the flag gets plumbed all the way through to
Ted> the fs-level callers.

OK. I'll do that, then. I always liked the flag better. I was just
trying to minimize the impact.

What would you prefer as the default for the ext4 use case? To allocate
or to discard?

-- 
Martin K. Petersen	Oracle Linux Engineering

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [RFC] Discard update for 3.19
  2014-11-07  5:08 [RFC] Discard update for 3.19 Martin K. Petersen
                   ` (4 preceding siblings ...)
  2014-11-07 12:11 ` Bernd Schubert
@ 2014-11-10 14:19 ` Christoph Hellwig
  2014-11-12 18:40   ` Ewan Milne
  5 siblings, 1 reply; 29+ messages in thread
From: Christoph Hellwig @ 2014-11-10 14:19 UTC (permalink / raw)
  To: Martin K. Petersen; +Cc: linux-scsi, linux-ide, linux-fsdevel, neilb

Looks like there is no real dependency between these patches, so we
might take on each through the libata, scsi and block trees.

Can I get another review for the sd patch, please?


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 2/3] sd: Disable discard_zeroes_data for UNMAP
  2014-11-07  5:08 ` [PATCH 2/3] sd: Disable discard_zeroes_data for UNMAP Martin K. Petersen
  2014-11-07  8:25   ` Christoph Hellwig
@ 2014-11-10 23:43   ` Paolo Bonzini
  1 sibling, 0 replies; 29+ messages in thread
From: Paolo Bonzini @ 2014-11-10 23:43 UTC (permalink / raw)
  To: Martin K. Petersen, linux-scsi, linux-ide, linux-fsdevel, neilb

On 07/11/2014 06:08, Martin K. Petersen wrote:
> The T10 SBC UNMAP command does not provide any hard guarantees that
> blocks will return zeroes on a subsequent READ. This is due to the fact
> that the device server is free to silently ignore all or parts of the
> request.
> 
> The only way to ensure that a block consistently returns zeroes after
> being unmapped is to use WRITE SAME with the UNMAP bit set. Should the
> device be unable to unmap one or more blocks described by the command it
> is required to manually write zeroes to them.
> 
> Until now we have preferred UNMAP over the WRITE SAME variants to
> accommodate thinly provisioned devices that predated the final SBC-3
> spec. This patch changes the heuristic so that we favor WRITE SAME(16)
> or (10) over UNMAP if these commands are marked as supported in the
> Logical Block Provisioning VPD page.
> 
> The patch also disables discard_zeroes_data for devices operating in
> UNMAP mode.
> 
> Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
> ---
>  drivers/scsi/sd.c | 10 ++++++----
>  1 file changed, 6 insertions(+), 4 deletions(-)
> 
> diff --git a/drivers/scsi/sd.c b/drivers/scsi/sd.c
> index b041eca8955d..95bfb7bfbb9d 100644
> --- a/drivers/scsi/sd.c
> +++ b/drivers/scsi/sd.c
> @@ -656,7 +656,7 @@ static void sd_config_discard(struct scsi_disk *sdkp, unsigned int mode)
>  	unsigned int logical_block_size = sdkp->device->sector_size;
>  	unsigned int max_blocks = 0;
>  
> -	q->limits.discard_zeroes_data = sdkp->lbprz;
> +	q->limits.discard_zeroes_data = 0;
>  	q->limits.discard_alignment = sdkp->unmap_alignment *
>  		logical_block_size;
>  	q->limits.discard_granularity =
> @@ -680,11 +680,13 @@ static void sd_config_discard(struct scsi_disk *sdkp, unsigned int mode)
>  	case SD_LBP_WS16:
>  		max_blocks = min_not_zero(sdkp->max_ws_blocks,
>  					  (u32)SD_MAX_WS16_BLOCKS);
> +		q->limits.discard_zeroes_data = sdkp->lbprz;
>  		break;
>  
>  	case SD_LBP_WS10:
>  		max_blocks = min_not_zero(sdkp->max_ws_blocks,
>  					  (u32)SD_MAX_WS10_BLOCKS);
> +		q->limits.discard_zeroes_data = sdkp->lbprz;
>  		break;
>  
>  	case SD_LBP_ZERO:
> @@ -2622,12 +2624,12 @@ static void sd_read_block_limits(struct scsi_disk *sdkp)
>  
>  		} else {	/* LBP VPD page tells us what to use */
>  
> -			if (sdkp->lbpu && sdkp->max_unmap_blocks)
> -				sd_config_discard(sdkp, SD_LBP_UNMAP);
> -			else if (sdkp->lbpws)
> +			if (sdkp->lbpws)
>  				sd_config_discard(sdkp, SD_LBP_WS16);
>  			else if (sdkp->lbpws10)
>  				sd_config_discard(sdkp, SD_LBP_WS10);
> +			else if (sdkp->lbpu && sdkp->max_unmap_blocks)
> +				sd_config_discard(sdkp, SD_LBP_UNMAP);
>  			else
>  				sd_config_discard(sdkp, SD_LBP_DISABLE);
>  		}
> 

Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 3/3] block: Introduce blkdev_issue_zeroout_discard() function
  2014-11-07  5:08 ` [PATCH 3/3] block: Introduce blkdev_issue_zeroout_discard() function Martin K. Petersen
  2014-11-07  8:26   ` Christoph Hellwig
@ 2014-11-11  0:04   ` Darrick J. Wong
  2014-11-11  2:33     ` Martin K. Petersen
  2014-11-17 19:28     ` [PATCH] block: create ioctl to discard-or-zeroout a range of blocks Darrick J. Wong
  1 sibling, 2 replies; 29+ messages in thread
From: Darrick J. Wong @ 2014-11-11  0:04 UTC (permalink / raw)
  To: Martin K. Petersen; +Cc: linux-scsi, linux-ide, linux-fsdevel, neilb, linux-api

On Fri, Nov 07, 2014 at 12:08:14AM -0500, Martin K. Petersen wrote:
> blkdev_issue_discard() will zero a given block range on disk. This is
> done by way of either WRITE SAME or regular WRITE. I.e. the blocks on
> disk will be written and thus provisioned.
> 
> There are use cases where the desired behavior is to zero the blocks but
> unprovision them if possible. The blocks must deterministically contain
> zeroes when they are subsequently read back.
> 
> This patch introduces a blkdev_issue_zeroout_discard() call that
> provides this functionality. If a block device guarantees
> discard_zeroes_data the new function will use discard to clear the block
> range. If the device does not support discard_zeroes_data or if the
> discard request fails we will fall back to blkdev_issue_zeroout() to
> ensure predictable results.

Can this be plumbed into a BLK* ioctl too?  I'll write a patch, if this is ok
with everyone:

struct blkzeroout_t {
	__u64 start;
	__u64 end;
	__u32 flags;
};
#define BLKZEROOUT_DISCARD_OK	1

#define BLKZEROOUT_V2		_IOR(0x12, 127, sizeof(struct blkzeroout_t))

...and make it zap the page cache per earlier discussion.  This seems to be a
good fit with what we've been discussing for mke2fs.

--D

> 
> Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
> ---
>  block/blk-lib.c        | 44 ++++++++++++++++++++++++++++++++++++++++++--
>  include/linux/blkdev.h |  2 ++
>  2 files changed, 44 insertions(+), 2 deletions(-)
> 
> diff --git a/block/blk-lib.c b/block/blk-lib.c
> index 8411be3c19d3..2ffec6a01c71 100644
> --- a/block/blk-lib.c
> +++ b/block/blk-lib.c
> @@ -278,14 +278,18 @@ static int __blkdev_issue_zeroout(struct block_device *bdev, sector_t sector,
>  }
>  
>  /**
> - * blkdev_issue_zeroout - zero-fill a block range
> + * blkdev_issue_zeroout - zero-fill and provision a block range
>   * @bdev:	blockdev to write
>   * @sector:	start sector
>   * @nr_sects:	number of sectors to write
>   * @gfp_mask:	memory allocation flags (for bio_alloc)
>   *
>   * Description:
> - *  Generate and issue number of bios with zerofiled pages.
> + *  Zero-fill a block range. The blocks will be provisioned
> + *  (allocated/anchored) and are guaranteed to return zeroes when read
> + *  back. This function will attempt to use WRITE SAME to optimize the
> + *  process if the block device supports it. Otherwise it will fall back
> + *  to zeroing the blocks using regular WRITE calls.
>   */
>  
>  int blkdev_issue_zeroout(struct block_device *bdev, sector_t sector,
> @@ -305,3 +309,39 @@ int blkdev_issue_zeroout(struct block_device *bdev, sector_t sector,
>  	return __blkdev_issue_zeroout(bdev, sector, nr_sects, gfp_mask);
>  }
>  EXPORT_SYMBOL(blkdev_issue_zeroout);
> +
> +/**
> + * blkdev_issue_zeroout_discard - zero-fill and attempt to discard block range
> + * @bdev:	blockdev to write
> + * @sector:	start sector
> + * @nr_sects:	number of sectors to write
> + * @gfp_mask:	memory allocation flags (for bio_alloc)
> + *
> + * Description:
> + *  Zero-fill a block range. In contrast to blkdev_issue_zeroout() this
> + *  function will attempt to deprovision (deallocate/discard) the blocks
> + *  in question. It will only do so if the underlying device guarantees
> + *  that subsequent READ operations to the block range in question will
> + *  return zeroes. If the device does not provide hard guarantees or if
> + *  the DISCARD attempt should fail the block range will be explicitly
> + *  zeroed using blkdev_issue_zeroout().
> + */
> +
> +int blkdev_issue_zeroout_discard(struct block_device *bdev, sector_t sector,
> +				 sector_t nr_sects, gfp_t gfp_mask)
> +{
> +	struct request_queue *q = bdev_get_queue(bdev);
> +
> +	if (blk_queue_discard(q) && q->limits.discard_zeroes_data) {
> +		unsigned char bdn[BDEVNAME_SIZE];
> +
> +		if (!blkdev_issue_discard(bdev, sector, nr_sects, gfp_mask, 0))
> +			return 0;
> +
> +		bdevname(bdev, bdn);
> +		pr_err("%s: DISCARD failed. Manually zeroing.\n", bdn);
> +	}
> +
> +	return blkdev_issue_zeroout(bdev, sector, nr_sects, gfp_mask);
> +}
> +EXPORT_SYMBOL(blkdev_issue_zeroout_discard);
> diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
> index aac0f9ea952a..078b6e5f488a 100644
> --- a/include/linux/blkdev.h
> +++ b/include/linux/blkdev.h
> @@ -1164,6 +1164,8 @@ extern int blkdev_issue_write_same(struct block_device *bdev, sector_t sector,
>  		sector_t nr_sects, gfp_t gfp_mask, struct page *page);
>  extern int blkdev_issue_zeroout(struct block_device *bdev, sector_t sector,
>  			sector_t nr_sects, gfp_t gfp_mask);
> +extern int blkdev_issue_zeroout_discard(struct block_device *bdev,
> +			sector_t sector, sector_t nr_sects, gfp_t gfp_mask);
>  static inline int sb_issue_discard(struct super_block *sb, sector_t block,
>  		sector_t nr_blocks, gfp_t gfp_mask, unsigned long flags)
>  {
> -- 
> 1.9.3
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 3/3] block: Introduce blkdev_issue_zeroout_discard() function
  2014-11-11  0:04   ` Darrick J. Wong
@ 2014-11-11  2:33     ` Martin K. Petersen
  2014-11-17 19:28     ` [PATCH] block: create ioctl to discard-or-zeroout a range of blocks Darrick J. Wong
  1 sibling, 0 replies; 29+ messages in thread
From: Martin K. Petersen @ 2014-11-11  2:33 UTC (permalink / raw)
  To: Darrick J. Wong
  Cc: Martin K. Petersen, linux-scsi, linux-ide, linux-fsdevel, neilb,
	linux-api

>>>>> "Darrick" == Darrick J Wong <darrick.wong@oracle.com> writes:

Darrick> Can this be plumbed into a BLK* ioctl too?  I'll write a patch,
Darrick> if this is ok with everyone:

Darrick> ...and make it zap the page cache per earlier discussion.  This
Darrick> seems to be a good fit with what we've been discussing for
Darrick> mke2fs.

That sounds good to me. I'll get the updated patch out tomorrow.

-- 
Martin K. Petersen	Oracle Linux Engineering

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [RFC] Discard update for 3.19
  2014-11-10 14:19 ` Christoph Hellwig
@ 2014-11-12 18:40   ` Ewan Milne
  2014-11-12 19:41     ` Martin K. Petersen
  0 siblings, 1 reply; 29+ messages in thread
From: Ewan Milne @ 2014-11-12 18:40 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Martin K. Petersen, linux-scsi, linux-ide, linux-fsdevel, neilb

On Mon, 2014-11-10 at 06:19 -0800, Christoph Hellwig wrote:
> Looks like there is no real dependency between these patches, so we
> might take on each through the libata, scsi and block trees.
> 
> Can I get another review for the sd patch, please?
> 

The changes in [PATCH 2/3] sd: Disable discard_zeroes_data for UNMAP
look fine to me.  I was wondering, though, if the changes to add the
"devices_handle_discard_safely" module parameter to the MD raid drivers
are really needed if this is fixed.  (It's great to be able to disable
the use of discard if desired, but how is an administrator actually
supposed to know if the devices they have *really* work properly?)
Maybe the default value of this parameter should be changed, or the
parameter should be changed to have an inverse sense, i.e. "disable
use of discard"...

-Ewan



^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [RFC] Discard update for 3.19
  2014-11-12 18:40   ` Ewan Milne
@ 2014-11-12 19:41     ` Martin K. Petersen
  0 siblings, 0 replies; 29+ messages in thread
From: Martin K. Petersen @ 2014-11-12 19:41 UTC (permalink / raw)
  To: emilne
  Cc: Christoph Hellwig, Martin K. Petersen, linux-scsi, linux-ide,
	linux-fsdevel, neilb

>>>>> "Ewan" == Ewan Milne <emilne@redhat.com> writes:

Ewan> The changes in [PATCH 2/3] sd: Disable discard_zeroes_data for
Ewan> UNMAP look fine to me.  I was wondering, though, if the changes to
Ewan> add the "devices_handle_discard_safely" module parameter to the MD
Ewan> raid drivers are really needed if this is fixed.

I was expecting Neil to either revert or toggle the default once this
change was in place.

-- 
Martin K. Petersen	Oracle Linux Engineering

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 3/3] block: Introduce blkdev_issue_zeroout_discard() function
  2014-11-07 16:27         ` Martin K. Petersen
@ 2014-11-14 20:22           ` Martin K. Petersen
  2014-11-17 18:53             ` Darrick J. Wong
  0 siblings, 1 reply; 29+ messages in thread
From: Martin K. Petersen @ 2014-11-14 20:22 UTC (permalink / raw)
  To: Martin K. Petersen
  Cc: Theodore Ts'o, Christoph Hellwig, linux-scsi, linux-ide,
	linux-fsdevel, neilb

>>>>> "Martin" == Martin K Petersen <martin.petersen@oracle.com> writes:

Martin> What would you prefer as the default for the ext4 use case? To
Martin> allocate or to discard?

I didn't get a preference for whether sb_issue_zeroout() should discard
or allocate.

But here's an updated patch 3...

commit eb23c9e71e08b7f467cbc36990a1a01a94a7b959
Author: Martin K. Petersen <martin.petersen@oracle.com>
Date:   Thu Nov 6 14:36:05 2014 -0500

    block: Add discard flag to blkdev_issue_zeroout() function
    
    blkdev_issue_discard() will zero a given block range. This is done by
    way of explicit writing, thus provisioning or allocating the blocks on
    disk.
    
    There are use cases where the desired behavior is to zero the blocks but
    unprovision them if possible. The blocks must deterministically contain
    zeroes when they are subsequently read back.
    
    This patch adds a flag to blkdev_issue_zeroout() that provides this
    variant. If the discard flag is set and a block device guarantees
    discard_zeroes_data we will use REQ_DISCARD to clear the block range. If
    the device does not support discard_zeroes_data or if the discard
    request fails we will fall back to first REQ_WRITE_SAME and then a
    regular REQ_WRITE.
    
    Also update the callers of blkdev_issue_zero() to reflect the new flag
    and make sb_issue_zeroout() prefer the discard approach.
    
    Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>

diff --git a/block/blk-lib.c b/block/blk-lib.c
index 8411be3c19d3..715e948f58a4 100644
--- a/block/blk-lib.c
+++ b/block/blk-lib.c
@@ -283,23 +283,45 @@ static int __blkdev_issue_zeroout(struct block_device *bdev, sector_t sector,
  * @sector:	start sector
  * @nr_sects:	number of sectors to write
  * @gfp_mask:	memory allocation flags (for bio_alloc)
+ * @discard:	whether to discard the block range
  *
  * Description:
- *  Generate and issue number of bios with zerofiled pages.
+
+ *  Zero-fill a block range.  If the discard flag is set and the block
+ *  device guarantees that subsequent READ operations to the block range
+ *  in question will return zeroes, the blocks will be discarded. Should
+ *  the discard request fail, if the discard flag is not set, or if
+ *  discard_zeroes_data is not supported, this function will resort to
+ *  zeroing the blocks manually, thus provisioning (allocating,
+ *  anchoring) them. If the block device supports the WRITE SAME command
+ *  blkdev_issue_zeroout() will use it to optimize the process of
+ *  clearing the block range. Otherwise the zeroing will be performed
+ *  using regular WRITE calls.
  */
 
 int blkdev_issue_zeroout(struct block_device *bdev, sector_t sector,
-			 sector_t nr_sects, gfp_t gfp_mask)
+			 sector_t nr_sects, gfp_t gfp_mask, bool discard)
 {
+	struct request_queue *q = bdev_get_queue(bdev);
+	unsigned char bdn[BDEVNAME_SIZE];
+
+	if (discard && blk_queue_discard(q) && q->limits.discard_zeroes_data) {
+
+		if (!blkdev_issue_discard(bdev, sector, nr_sects, gfp_mask, 0))
+			return 0;
+
+		bdevname(bdev, bdn);
+		pr_warn("%s: DISCARD failed. Manually zeroing.\n", bdn);
+	}
+
 	if (bdev_write_same(bdev)) {
-		unsigned char bdn[BDEVNAME_SIZE];
 
 		if (!blkdev_issue_write_same(bdev, sector, nr_sects, gfp_mask,
 					     ZERO_PAGE(0)))
 			return 0;
 
 		bdevname(bdev, bdn);
-		pr_err("%s: WRITE SAME failed. Manually zeroing.\n", bdn);
+		pr_warn("%s: WRITE SAME failed. Manually zeroing.\n", bdn);
 	}
 
 	return __blkdev_issue_zeroout(bdev, sector, nr_sects, gfp_mask);
diff --git a/block/ioctl.c b/block/ioctl.c
index 6c7bf903742f..7d8befde2aca 100644
--- a/block/ioctl.c
+++ b/block/ioctl.c
@@ -198,7 +198,7 @@ static int blk_ioctl_zeroout(struct block_device *bdev, uint64_t start,
 	if (start + len > (i_size_read(bdev->bd_inode) >> 9))
 		return -EINVAL;
 
-	return blkdev_issue_zeroout(bdev, start, len, GFP_KERNEL);
+	return blkdev_issue_zeroout(bdev, start, len, GFP_KERNEL, false);
 }
 
 static int put_ushort(unsigned long arg, unsigned short val)
diff --git a/drivers/block/drbd/drbd_receiver.c b/drivers/block/drbd/drbd_receiver.c
index 6960fb064731..ee5b9611c51c 100644
--- a/drivers/block/drbd/drbd_receiver.c
+++ b/drivers/block/drbd/drbd_receiver.c
@@ -1388,7 +1388,7 @@ int drbd_submit_peer_request(struct drbd_device *device,
 		list_add_tail(&peer_req->w.list, &device->active_ee);
 		spin_unlock_irq(&device->resource->req_lock);
 		if (blkdev_issue_zeroout(device->ldev->backing_bdev,
-			sector, data_size >> 9, GFP_NOIO))
+			sector, data_size >> 9, GFP_NOIO, false))
 			peer_req->flags |= EE_WAS_ERROR;
 		drbd_endio_write_sec_final(peer_req);
 		return 0;
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index aac0f9ea952a..f3ad69d8de45 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -1163,7 +1163,7 @@ extern int blkdev_issue_discard(struct block_device *bdev, sector_t sector,
 extern int blkdev_issue_write_same(struct block_device *bdev, sector_t sector,
 		sector_t nr_sects, gfp_t gfp_mask, struct page *page);
 extern int blkdev_issue_zeroout(struct block_device *bdev, sector_t sector,
-			sector_t nr_sects, gfp_t gfp_mask);
+		sector_t nr_sects, gfp_t gfp_mask, bool discard);
 static inline int sb_issue_discard(struct super_block *sb, sector_t block,
 		sector_t nr_blocks, gfp_t gfp_mask, unsigned long flags)
 {
@@ -1177,7 +1177,7 @@ static inline int sb_issue_zeroout(struct super_block *sb, sector_t block,
 	return blkdev_issue_zeroout(sb->s_bdev,
 				    block << (sb->s_blocksize_bits - 9),
 				    nr_blocks << (sb->s_blocksize_bits - 9),
-				    gfp_mask);
+				    gfp_mask, true);
 }
 
 extern int blk_verify_command(unsigned char *cmd, fmode_t has_write_perm);

^ permalink raw reply related	[flat|nested] 29+ messages in thread

* Re: [PATCH 3/3] block: Introduce blkdev_issue_zeroout_discard() function
  2014-11-14 20:22           ` Martin K. Petersen
@ 2014-11-17 18:53             ` Darrick J. Wong
  0 siblings, 0 replies; 29+ messages in thread
From: Darrick J. Wong @ 2014-11-17 18:53 UTC (permalink / raw)
  To: Martin K. Petersen
  Cc: Theodore Ts'o, Christoph Hellwig, linux-scsi, linux-ide,
	linux-fsdevel, neilb

On Fri, Nov 14, 2014 at 03:22:05PM -0500, Martin K. Petersen wrote:
> >>>>> "Martin" == Martin K Petersen <martin.petersen@oracle.com> writes:
> 
> Martin> What would you prefer as the default for the ext4 use case? To
> Martin> allocate or to discard?
> 
> I didn't get a preference for whether sb_issue_zeroout() should discard
> or allocate.

In the discussions I've had on the ext4 list, we seem to be leaning towards
discard and falling back to allocate if necessary.

--D

> 
> But here's an updated patch 3...
> 
> commit eb23c9e71e08b7f467cbc36990a1a01a94a7b959
> Author: Martin K. Petersen <martin.petersen@oracle.com>
> Date:   Thu Nov 6 14:36:05 2014 -0500
> 
>     block: Add discard flag to blkdev_issue_zeroout() function
>     
>     blkdev_issue_discard() will zero a given block range. This is done by
>     way of explicit writing, thus provisioning or allocating the blocks on
>     disk.
>     
>     There are use cases where the desired behavior is to zero the blocks but
>     unprovision them if possible. The blocks must deterministically contain
>     zeroes when they are subsequently read back.
>     
>     This patch adds a flag to blkdev_issue_zeroout() that provides this
>     variant. If the discard flag is set and a block device guarantees
>     discard_zeroes_data we will use REQ_DISCARD to clear the block range. If
>     the device does not support discard_zeroes_data or if the discard
>     request fails we will fall back to first REQ_WRITE_SAME and then a
>     regular REQ_WRITE.
>     
>     Also update the callers of blkdev_issue_zero() to reflect the new flag
>     and make sb_issue_zeroout() prefer the discard approach.
>     
>     Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
> 
> diff --git a/block/blk-lib.c b/block/blk-lib.c
> index 8411be3c19d3..715e948f58a4 100644
> --- a/block/blk-lib.c
> +++ b/block/blk-lib.c
> @@ -283,23 +283,45 @@ static int __blkdev_issue_zeroout(struct block_device *bdev, sector_t sector,
>   * @sector:	start sector
>   * @nr_sects:	number of sectors to write
>   * @gfp_mask:	memory allocation flags (for bio_alloc)
> + * @discard:	whether to discard the block range
>   *
>   * Description:
> - *  Generate and issue number of bios with zerofiled pages.
> +
> + *  Zero-fill a block range.  If the discard flag is set and the block
> + *  device guarantees that subsequent READ operations to the block range
> + *  in question will return zeroes, the blocks will be discarded. Should
> + *  the discard request fail, if the discard flag is not set, or if
> + *  discard_zeroes_data is not supported, this function will resort to
> + *  zeroing the blocks manually, thus provisioning (allocating,
> + *  anchoring) them. If the block device supports the WRITE SAME command
> + *  blkdev_issue_zeroout() will use it to optimize the process of
> + *  clearing the block range. Otherwise the zeroing will be performed
> + *  using regular WRITE calls.
>   */
>  
>  int blkdev_issue_zeroout(struct block_device *bdev, sector_t sector,
> -			 sector_t nr_sects, gfp_t gfp_mask)
> +			 sector_t nr_sects, gfp_t gfp_mask, bool discard)
>  {
> +	struct request_queue *q = bdev_get_queue(bdev);
> +	unsigned char bdn[BDEVNAME_SIZE];
> +
> +	if (discard && blk_queue_discard(q) && q->limits.discard_zeroes_data) {
> +
> +		if (!blkdev_issue_discard(bdev, sector, nr_sects, gfp_mask, 0))
> +			return 0;
> +
> +		bdevname(bdev, bdn);
> +		pr_warn("%s: DISCARD failed. Manually zeroing.\n", bdn);
> +	}
> +
>  	if (bdev_write_same(bdev)) {
> -		unsigned char bdn[BDEVNAME_SIZE];
>  
>  		if (!blkdev_issue_write_same(bdev, sector, nr_sects, gfp_mask,
>  					     ZERO_PAGE(0)))
>  			return 0;
>  
>  		bdevname(bdev, bdn);
> -		pr_err("%s: WRITE SAME failed. Manually zeroing.\n", bdn);
> +		pr_warn("%s: WRITE SAME failed. Manually zeroing.\n", bdn);
>  	}
>  
>  	return __blkdev_issue_zeroout(bdev, sector, nr_sects, gfp_mask);
> diff --git a/block/ioctl.c b/block/ioctl.c
> index 6c7bf903742f..7d8befde2aca 100644
> --- a/block/ioctl.c
> +++ b/block/ioctl.c
> @@ -198,7 +198,7 @@ static int blk_ioctl_zeroout(struct block_device *bdev, uint64_t start,
>  	if (start + len > (i_size_read(bdev->bd_inode) >> 9))
>  		return -EINVAL;
>  
> -	return blkdev_issue_zeroout(bdev, start, len, GFP_KERNEL);
> +	return blkdev_issue_zeroout(bdev, start, len, GFP_KERNEL, false);
>  }
>  
>  static int put_ushort(unsigned long arg, unsigned short val)
> diff --git a/drivers/block/drbd/drbd_receiver.c b/drivers/block/drbd/drbd_receiver.c
> index 6960fb064731..ee5b9611c51c 100644
> --- a/drivers/block/drbd/drbd_receiver.c
> +++ b/drivers/block/drbd/drbd_receiver.c
> @@ -1388,7 +1388,7 @@ int drbd_submit_peer_request(struct drbd_device *device,
>  		list_add_tail(&peer_req->w.list, &device->active_ee);
>  		spin_unlock_irq(&device->resource->req_lock);
>  		if (blkdev_issue_zeroout(device->ldev->backing_bdev,
> -			sector, data_size >> 9, GFP_NOIO))
> +			sector, data_size >> 9, GFP_NOIO, false))
>  			peer_req->flags |= EE_WAS_ERROR;
>  		drbd_endio_write_sec_final(peer_req);
>  		return 0;
> diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
> index aac0f9ea952a..f3ad69d8de45 100644
> --- a/include/linux/blkdev.h
> +++ b/include/linux/blkdev.h
> @@ -1163,7 +1163,7 @@ extern int blkdev_issue_discard(struct block_device *bdev, sector_t sector,
>  extern int blkdev_issue_write_same(struct block_device *bdev, sector_t sector,
>  		sector_t nr_sects, gfp_t gfp_mask, struct page *page);
>  extern int blkdev_issue_zeroout(struct block_device *bdev, sector_t sector,
> -			sector_t nr_sects, gfp_t gfp_mask);
> +		sector_t nr_sects, gfp_t gfp_mask, bool discard);
>  static inline int sb_issue_discard(struct super_block *sb, sector_t block,
>  		sector_t nr_blocks, gfp_t gfp_mask, unsigned long flags)
>  {
> @@ -1177,7 +1177,7 @@ static inline int sb_issue_zeroout(struct super_block *sb, sector_t block,
>  	return blkdev_issue_zeroout(sb->s_bdev,
>  				    block << (sb->s_blocksize_bits - 9),
>  				    nr_blocks << (sb->s_blocksize_bits - 9),
> -				    gfp_mask);
> +				    gfp_mask, true);
>  }
>  
>  extern int blk_verify_command(unsigned char *cmd, fmode_t has_write_perm);
> --
> To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 29+ messages in thread

* [PATCH] block: create ioctl to discard-or-zeroout a range of blocks
  2014-11-11  0:04   ` Darrick J. Wong
  2014-11-11  2:33     ` Martin K. Petersen
@ 2014-11-17 19:28     ` Darrick J. Wong
  1 sibling, 0 replies; 29+ messages in thread
From: Darrick J. Wong @ 2014-11-17 19:28 UTC (permalink / raw)
  To: Martin K. Petersen; +Cc: linux-scsi, linux-ide, linux-fsdevel, neilb, linux-api

Create a new ioctl to expose the block layer's newfound ability to
issue either a zeroing discard, a WRITE SAME with a zero page, or a
regular write with the zero page.  This BLKZEROOUT2 ioctl takes
{start, length, flags} as parameters.  So far, the only flag available
is to enable the zeroing discard part -- without it, the call invokes
the old BLKZEROOUT behavior.  start and length have the same meaning
as in BLKZEROOUT.

Furthermore, because BLKZEROOUT2 issues commands directly to the
storage device, we must invalidate the page cache (as a regular
O_DIRECT write would do) to avoid returning stale cache contents at a
later time.

This patch depends on mkp's earlier patch "block: Introduce
blkdev_issue_zeroout_discard() function".

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
---
 block/ioctl.c           |   45 ++++++++++++++++++++++++++++++++++++++-------
 include/uapi/linux/fs.h |    7 +++++++
 2 files changed, 45 insertions(+), 7 deletions(-)

diff --git a/block/ioctl.c b/block/ioctl.c
index 7d8befd..ff623d5 100644
--- a/block/ioctl.c
+++ b/block/ioctl.c
@@ -186,19 +186,39 @@ static int blk_ioctl_discard(struct block_device *bdev, uint64_t start,
 }
 
 static int blk_ioctl_zeroout(struct block_device *bdev, uint64_t start,
-			     uint64_t len)
+			     uint64_t len, uint32_t flags)
 {
+	int ret;
+	struct address_space *mapping;
+	uint64_t end = start + len - 1;
+
+	if (flags & ~BLKZEROOUT2_DISCARD_OK)
+		return -EINVAL;
 	if (start & 511)
 		return -EINVAL;
 	if (len & 511)
 		return -EINVAL;
-	start >>= 9;
-	len >>= 9;
-
-	if (start + len > (i_size_read(bdev->bd_inode) >> 9))
+	if (end >= i_size_read(bdev->bd_inode))
 		return -EINVAL;
 
-	return blkdev_issue_zeroout(bdev, start, len, GFP_KERNEL, false);
+	/* Invalidate the page cache, including dirty pages */
+	mapping = bdev->bd_inode->i_mapping;
+	truncate_inode_pages_range(mapping, start, end);
+
+	ret = blkdev_issue_zeroout(bdev, start >> 9, len >> 9, GFP_KERNEL,
+				   flags & BLKZEROOUT2_DISCARD_OK);
+	if (ret)
+		goto out;
+
+	/*
+	 * Invalidate again; if someone wandered in and dirtied a page,
+	 * the caller will be given -EBUSY.
+	 */
+	ret = invalidate_inode_pages2_range(mapping,
+					    start >> PAGE_CACHE_SHIFT,
+					    end >> PAGE_CACHE_SHIFT);
+out:
+	return ret;
 }
 
 static int put_ushort(unsigned long arg, unsigned short val)
@@ -326,7 +346,18 @@ int blkdev_ioctl(struct block_device *bdev, fmode_t mode, unsigned cmd,
 		if (copy_from_user(range, (void __user *)arg, sizeof(range)))
 			return -EFAULT;
 
-		return blk_ioctl_zeroout(bdev, range[0], range[1]);
+		return blk_ioctl_zeroout(bdev, range[0], range[1], 0);
+	}
+	case BLKZEROOUT2: {
+		struct blkzeroout2 p;
+
+		if (!(mode & FMODE_WRITE))
+			return -EBADF;
+
+		if (copy_from_user(&p, (void __user *)arg, sizeof(p)))
+			return -EFAULT;
+
+		return blk_ioctl_zeroout(bdev, p.start, p.length, p.flags);
 	}
 
 	case HDIO_GETGEO: {
diff --git a/include/uapi/linux/fs.h b/include/uapi/linux/fs.h
index 3735fa0..54d24ea 100644
--- a/include/uapi/linux/fs.h
+++ b/include/uapi/linux/fs.h
@@ -150,6 +150,13 @@ struct inodes_stat_t {
 #define BLKSECDISCARD _IO(0x12,125)
 #define BLKROTATIONAL _IO(0x12,126)
 #define BLKZEROOUT _IO(0x12,127)
+struct blkzeroout2 {
+	__u64 start;
+	__u64 length;
+	__u32 flags;
+};
+#define BLKZEROOUT2_DISCARD_OK	1
+#define BLKZEROOUT2 _IOR(0x12, 127, struct blkzeroout2)
 
 #define BMAP_IOCTL 1		/* obsolete - kept for compatibility */
 #define FIBMAP	   _IO(0x00,1)	/* bmap access */

^ permalink raw reply related	[flat|nested] 29+ messages in thread

* Re: [PATCH 1/3] libata: Whitelist SSDs that are known to properly return zeroes after TRIM
  2014-11-07  5:08 ` [PATCH 1/3] libata: Whitelist SSDs that are known to properly return zeroes after TRIM Martin K. Petersen
  2014-11-07  8:24   ` Christoph Hellwig
@ 2014-12-05 16:45   ` Paolo Bonzini
  2014-12-05 22:58   ` Elliott, Robert (Server Storage)
  2 siblings, 0 replies; 29+ messages in thread
From: Paolo Bonzini @ 2014-12-05 16:45 UTC (permalink / raw)
  To: Martin K. Petersen, linux-scsi, linux-ide, linux-fsdevel, neilb



On 07/11/2014 06:08, Martin K. Petersen wrote:
> The whitelist is only meant as a starting point and is by no means
> comprehensive:
> 
>    - All intel SSD models except for 510
>    - Micron M5*
>    - Samsung SSDs
>    - Seagate SSDs
> 
> Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
> ---
>  drivers/ata/libata-core.c | 18 ++++++++++++++----
>  drivers/ata/libata-scsi.c | 10 ++++++----
>  include/linux/libata.h    |  1 +
>  3 files changed, 21 insertions(+), 8 deletions(-)
> 
> diff --git a/drivers/ata/libata-core.c b/drivers/ata/libata-core.c
> index c5ba15af87d3..f41f24a8bc21 100644
> --- a/drivers/ata/libata-core.c
> +++ b/drivers/ata/libata-core.c
> @@ -4225,10 +4225,20 @@ static const struct ata_blacklist_entry ata_device_blacklist [] = {
>  	{ "PIONEER DVD-RW  DVR-216D",	NULL,	ATA_HORKAGE_NOSETXFER },
>  
>  	/* devices that don't properly handle queued TRIM commands */
> -	{ "Micron_M500*",		NULL,	ATA_HORKAGE_NO_NCQ_TRIM, },
> -	{ "Crucial_CT???M500SSD*",	NULL,	ATA_HORKAGE_NO_NCQ_TRIM, },
> -	{ "Micron_M550*",		NULL,	ATA_HORKAGE_NO_NCQ_TRIM, },
> -	{ "Crucial_CT*M550SSD*",	NULL,	ATA_HORKAGE_NO_NCQ_TRIM, },
> +	{ "Micron_M5?0*",		NULL,	ATA_HORKAGE_NO_NCQ_TRIM |
> +						ATA_HORKAGE_ZERO_AFTER_TRIM, },
> +	{ "Crucial_CT???M5?0SSD*",	NULL,	ATA_HORKAGE_NO_NCQ_TRIM, },

I have a Crucial_CT256MX1 (i.e. MX100) and it does reliably zero.

BTW. it's the same hardware as the M550, so probably the same set of
quirks should apply to both.

Paolo

> +
> +	/*
> +	 * DRAT/RZAT are weak guarantees. Explicitly black/whitelist
> +	 * SSDs that provide reliable zero after TRIM.
> +	 */
> +	{ "INTEL*SSDSC2MH*",		NULL,	0, }, /* Blacklist intel 510 */
> +	{ "INTEL*SSD*", 		NULL,	ATA_HORKAGE_ZERO_AFTER_TRIM, },
> +	{ "SSD*INTEL*",			NULL,	ATA_HORKAGE_ZERO_AFTER_TRIM, },
> +	{ "Samsung*SSD*",		NULL,	ATA_HORKAGE_ZERO_AFTER_TRIM, },
> +	{ "SAMSUNG*SSD*",		NULL,	ATA_HORKAGE_ZERO_AFTER_TRIM, },
> +	{ "ST[1248][0248]0[FH]*",	NULL,	ATA_HORKAGE_ZERO_AFTER_TRIM, },
>  
>  	/*
>  	 * Some WD SATA-I drives spin up and down erratically when the link
> diff --git a/drivers/ata/libata-scsi.c b/drivers/ata/libata-scsi.c
> index 0586f66d70fa..deaa6e34ed4d 100644
> --- a/drivers/ata/libata-scsi.c
> +++ b/drivers/ata/libata-scsi.c
> @@ -2515,13 +2515,15 @@ static unsigned int ata_scsiop_read_cap(struct ata_scsi_args *args, u8 *rbuf)
>  		rbuf[15] = lowest_aligned;
>  
>  		if (ata_id_has_trim(args->id)) {
> -			rbuf[14] |= 0x80; /* TPE */
> +			rbuf[14] |= 0x80; /* LBPME */
>  
> -			if (ata_id_has_zero_after_trim(args->id))
> -				rbuf[14] |= 0x40; /* TPRZ */
> +			if (ata_id_has_zero_after_trim(args->id) &&
> +			    dev->horkage & ATA_HORKAGE_ZERO_AFTER_TRIM) {
> +				ata_dev_warn(dev, "Enabling discard_zeroes_data\n");
> +				rbuf[14] |= 0x40; /* LBPRZ */
> +			}
>  		}
>  	}
> -
>  	return 0;
>  }
>  
> diff --git a/include/linux/libata.h b/include/linux/libata.h
> index bd5fefeaf548..45ac825b8366 100644
> --- a/include/linux/libata.h
> +++ b/include/linux/libata.h
> @@ -421,6 +421,7 @@ enum {
>  	ATA_HORKAGE_NO_NCQ_TRIM	= (1 << 19),	/* don't use queued TRIM */
>  	ATA_HORKAGE_NOLPM	= (1 << 20),	/* don't use LPM */
>  	ATA_HORKAGE_WD_BROKEN_LPM = (1 << 21),	/* some WDs have broken LPM */
> +	ATA_HORKAGE_ZERO_AFTER_TRIM = (1 << 22),/* guarantees zero after trim */
>  
>  	 /* DMA mask for user DMA control: User visible values; DO NOT
>  	    renumber */
> 

^ permalink raw reply	[flat|nested] 29+ messages in thread

* RE: [PATCH 1/3] libata: Whitelist SSDs that are known to properly return zeroes after TRIM
  2014-11-07  5:08 ` [PATCH 1/3] libata: Whitelist SSDs that are known to properly return zeroes after TRIM Martin K. Petersen
  2014-11-07  8:24   ` Christoph Hellwig
  2014-12-05 16:45   ` Paolo Bonzini
@ 2014-12-05 22:58   ` Elliott, Robert (Server Storage)
  2014-12-08 15:15     ` Theodore Ts'o
  2 siblings, 1 reply; 29+ messages in thread
From: Elliott, Robert (Server Storage) @ 2014-12-05 22:58 UTC (permalink / raw)
  To: Martin K. Petersen, Paolo Bonzini, linux-scsi, linux-ide,
	linux-fsdevel, neilb



> -----Original Message-----
> From: linux-scsi-owner@vger.kernel.org [mailto:linux-scsi-
> owner@vger.kernel.org] On Behalf Of Martin K. Petersen
> Sent: Thursday, 06 November, 2014 11:08 PM
> To: linux-scsi@vger.kernel.org; linux-ide@vger.kernel.org; linux-
> fsdevel@vger.kernel.org; neilb@suse.de
> Cc: Martin K. Petersen
> Subject: [PATCH 1/3] libata: Whitelist SSDs that are known to properly return
> zeroes after TRIM
> 
> As defined, the DRAT (Deterministic Read After Trim) and RZAT (Return
> Zero After Trim) flags in the ATA Command Set are unreliable in the
> sense that they only define what happens if the device successfully
> executed the DSM TRIM command. TRIM is only advisory, however, and the
> device is free to silently ignore all or parts of the request.
> 
> In practice this renders the DRAT and RZAT flags completely useless and
> because the results are unpredictable we decided to disable discard in
> MD for 3.18 to avoid the risk of data corruption.
> 
> Hardware vendors in the real world obviously need better guarantees than
> what the standards bodies provide. Unfortuntely those guarantees are
> encoded in product requirements documents rather than somewhere we can
> key off of them programatically. So we are compelled to disabling
> discard_zeroes_data for all devices unless we explicitly have data to
> support whitelisting them.
> 
> This patch whitelists SSDs from a few of the main vendors. None of the
> whitelists are based on written guarantees. They are purely based on
> empirical evidence collected from internal and external users that have
> tested or qualified these drives in RAID deployments.
> 
> The whitelist is only meant as a starting point and is by no means
> comprehensive:
> 
>    - All intel SSD models except for 510
>    - Micron M5*
>    - Samsung SSDs
>    - Seagate SSDs
> 

That description and Paolo's reply:
> From: linux-scsi-owner@vger.kernel.org [mailto:linux-scsi-
> owner@vger.kernel.org] On Behalf Of Paolo Bonzini
> Sent: Friday, 05 December, 2014 10:45 AM
...
> I have a Crucial_CT256MX1 (i.e. MX100) and it does reliably zero.

make me concerned about this whitelist approach.

I think you need a manufacturer assertion that this is indeed
the design intent; you cannot be certain based on observation
from outside.

Since the SCSI and ATA standards allow ignoring the hint, it 
might be honored most of the time, but ignored in some rare
cases (e.g., drive firmware has a malloc() failure that only
happens when the drive is handling an overtemperature
condition and six other problems at the same time).

Maybe there should be two levels:
* vendor asserts the drive is designed to always honor the hint
* community observes the drive always seems to honor the hint

and a sysfs flag for users to select the level at which 
they feel safe.

A user running 3 replicas of the data in different sites 
might be more trusting than a user for which this is the 
only copy of the data.

---
Rob Elliott    HP Server Storage





^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 1/3] libata: Whitelist SSDs that are known to properly return zeroes after TRIM
  2014-12-05 22:58   ` Elliott, Robert (Server Storage)
@ 2014-12-08 15:15     ` Theodore Ts'o
  2014-12-08 15:28       ` James Bottomley
  2014-12-08 22:59       ` One Thousand Gnomes
  0 siblings, 2 replies; 29+ messages in thread
From: Theodore Ts'o @ 2014-12-08 15:15 UTC (permalink / raw)
  To: Elliott, Robert (Server Storage)
  Cc: Martin K. Petersen, Paolo Bonzini, linux-scsi, linux-ide,
	linux-fsdevel, neilb

On Fri, Dec 05, 2014 at 10:58:09PM +0000, Elliott, Robert (Server Storage) wrote:
> > I have a Crucial_CT256MX1 (i.e. MX100) and it does reliably zero.
> 
> make me concerned about this whitelist approach.
> 
> I think you need a manufacturer assertion that this is indeed
> the design intent; you cannot be certain based on observation
> from outside.

How is this different from a manufacturer assertion that they follow a
SCSI or ATA standard?  There have been cases in the distant past
(fortunately) of disk manufacturers that ignored a CACHE FLUSH command
just to get higher Winbench scores.  Does that mean we can't trust
them to do anything right?

What I'd suggest instead is that if a vendor states this on a spec
sheet --- more than just an e-mail assertion --- so they can be sued
if they knowingly misrepresent their product, that we take their word
at it.  Of course, there will be bugs, which is why we have
blacklists, or why we can remove them from the list if it turns out
there are edge conditions where it appears the disk doesn't quite do
the right thing.

After all, we generally take the manufacturer's word that air bags
will work as claimed, even if potentially 11 million of them are
currently subject to recall.  And do we think that "the community"
would necessarily be more suited than the vendors and the manufacturer
to figure out whether or not air bags or drives are working as
desired?

That being said, if someone wants to create a open source program
which stress tests SSD's to look for cases where it is dropping a
requested discard, that would certainly be a good thing to do...

Cheers,

						 - Ted
						 

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 1/3] libata: Whitelist SSDs that are known to properly return zeroes after TRIM
  2014-12-08 15:15     ` Theodore Ts'o
@ 2014-12-08 15:28       ` James Bottomley
  2014-12-08 22:59       ` One Thousand Gnomes
  1 sibling, 0 replies; 29+ messages in thread
From: James Bottomley @ 2014-12-08 15:28 UTC (permalink / raw)
  To: Theodore Ts'o
  Cc: Elliott, Robert (Server Storage),
	Martin K. Petersen, Paolo Bonzini, linux-scsi, linux-ide,
	linux-fsdevel, neilb

On Mon, 2014-12-08 at 10:15 -0500, Theodore Ts'o wrote:
> On Fri, Dec 05, 2014 at 10:58:09PM +0000, Elliott, Robert (Server Storage) wrote:
> > > I have a Crucial_CT256MX1 (i.e. MX100) and it does reliably zero.
> > 
> > make me concerned about this whitelist approach.
> > 
> > I think you need a manufacturer assertion that this is indeed
> > the design intent; you cannot be certain based on observation
> > from outside.
> 
> How is this different from a manufacturer assertion that they follow a
> SCSI or ATA standard?  There have been cases in the distant past
> (fortunately) of disk manufacturers that ignored a CACHE FLUSH command
> just to get higher Winbench scores.  Does that mean we can't trust
> them to do anything right?

That answer depends on device type manufacturer.  USB devices, hell no.
ATA devices, maybe and SCSI devices usually.

The main problem is usually testing.  Consumer devices like USB and
(s)ATA rarely get tested on anything but windows.  USB devices tend to
supply their own driver, so they're on the "we fix it in the driver"
model which is why they bite us so badly. (S)ATA usually comply, but
they only test what windows exercises, so if windows doesn't do it,
chances are it never got tested.  SCSI devices still tend to be tested
in legacy UNIX environments, which are as diverse as we are.

> What I'd suggest instead is that if a vendor states this on a spec
> sheet --- more than just an e-mail assertion --- so they can be sued
> if they knowingly misrepresent their product, that we take their word
> at it.  Of course, there will be bugs, which is why we have
> blacklists, or why we can remove them from the list if it turns out
> there are edge conditions where it appears the disk doesn't quite do
> the right thing.
> 
> After all, we generally take the manufacturer's word that air bags
> will work as claimed, even if potentially 11 million of them are
> currently subject to recall.  And do we think that "the community"
> would necessarily be more suited than the vendors and the manufacturer
> to figure out whether or not air bags or drives are working as
> desired?
> 
> That being said, if someone wants to create a open source program
> which stress tests SSD's to look for cases where it is dropping a
> requested discard, that would certainly be a good thing to do...

The purpose of DRAT and RZAT is to enable disk arrays deterministically
to use TRIM/Unmap so arrays know what happens to stripes on discard.
Arrays are being built of mostly SATA technology these days, so some
manufacturers have retargetted to arrays and consumer technology (and
are testing the array cases).  However, windows doesn't use either
feature, so manufacturers not targetting arrays will never test this
feature.  Hence, in this case, I think a whitelist does make sense.

James



^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 1/3] libata: Whitelist SSDs that are known to properly return zeroes after TRIM
  2014-12-08 15:15     ` Theodore Ts'o
  2014-12-08 15:28       ` James Bottomley
@ 2014-12-08 22:59       ` One Thousand Gnomes
  1 sibling, 0 replies; 29+ messages in thread
From: One Thousand Gnomes @ 2014-12-08 22:59 UTC (permalink / raw)
  To: Theodore Ts'o
  Cc: Elliott, Robert (Server Storage),
	Martin K. Petersen, Paolo Bonzini, linux-scsi, linux-ide,
	linux-fsdevel, neilb

On Mon, 8 Dec 2014 10:15:59 -0500
"Theodore Ts'o" <tytso@mit.edu> wrote:

> On Fri, Dec 05, 2014 at 10:58:09PM +0000, Elliott, Robert (Server Storage) wrote:
> > > I have a Crucial_CT256MX1 (i.e. MX100) and it does reliably zero.
> > 
> > make me concerned about this whitelist approach.
> > 
> > I think you need a manufacturer assertion that this is indeed
> > the design intent; you cannot be certain based on observation
> > from outside.
> 
> How is this different from a manufacturer assertion that they follow a
> SCSI or ATA standard?  There have been cases in the distant past
> (fortunately) of disk manufacturers that ignored a CACHE FLUSH command
> just to get higher Winbench scores.  Does that mean we can't trust
> them to do anything right?

At the time they never promised to honour cache flush. The reason it was
became mandatory in the specification was in part so that the vendors
could all force each other to play fair. If its "optional" then it's
tough..., if they say they meet the standard it's class action 8)

If this is a promise then it ought to be good 

James:  "USB devices tend to supply their own driver"

has not been true for some years now. Microsoft provide an in-box driver
and vendors have the choice of using that or certifying their own via
WHQL, which is a bit like choosing between free ice cream and banging
your head against a plank cover in nails.

Alan

^ permalink raw reply	[flat|nested] 29+ messages in thread

end of thread, other threads:[~2014-12-08 22:59 UTC | newest]

Thread overview: 29+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2014-11-07  5:08 [RFC] Discard update for 3.19 Martin K. Petersen
2014-11-07  5:08 ` [PATCH 1/3] libata: Whitelist SSDs that are known to properly return zeroes after TRIM Martin K. Petersen
2014-11-07  8:24   ` Christoph Hellwig
2014-11-07 15:37     ` Martin K. Petersen
2014-12-05 16:45   ` Paolo Bonzini
2014-12-05 22:58   ` Elliott, Robert (Server Storage)
2014-12-08 15:15     ` Theodore Ts'o
2014-12-08 15:28       ` James Bottomley
2014-12-08 22:59       ` One Thousand Gnomes
2014-11-07  5:08 ` [PATCH 2/3] sd: Disable discard_zeroes_data for UNMAP Martin K. Petersen
2014-11-07  8:25   ` Christoph Hellwig
2014-11-10 23:43   ` Paolo Bonzini
2014-11-07  5:08 ` [PATCH 3/3] block: Introduce blkdev_issue_zeroout_discard() function Martin K. Petersen
2014-11-07  8:26   ` Christoph Hellwig
2014-11-07 15:42     ` Martin K. Petersen
2014-11-07 16:20       ` Theodore Ts'o
2014-11-07 16:27         ` Martin K. Petersen
2014-11-14 20:22           ` Martin K. Petersen
2014-11-17 18:53             ` Darrick J. Wong
2014-11-11  0:04   ` Darrick J. Wong
2014-11-11  2:33     ` Martin K. Petersen
2014-11-17 19:28     ` [PATCH] block: create ioctl to discard-or-zeroout a range of blocks Darrick J. Wong
2014-11-07 12:09 ` [RFC] Discard update for 3.19 Bernd Schubert
2014-11-07 12:11 ` Bernd Schubert
2014-11-07 15:46   ` Martin K. Petersen
2014-11-07 15:46     ` Martin K. Petersen
2014-11-10 14:19 ` Christoph Hellwig
2014-11-12 18:40   ` Ewan Milne
2014-11-12 19:41     ` Martin K. Petersen

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.