All of lore.kernel.org
 help / color / mirror / Atom feed
* Thin provisioning fixes
@ 2009-11-21  2:45 Martin K. Petersen
  2009-11-21  2:45 ` [PATCH 1/4] block: Allow devices to indicate whether discarded blocks are zeroed Martin K. Petersen
                   ` (9 more replies)
  0 siblings, 10 replies; 35+ messages in thread
From: Martin K. Petersen @ 2009-11-21  2:45 UTC (permalink / raw)
  To: jens.axboe, james.bottomley, hch, willy, jgarzik, sandeen,
	rwheeler, linux-ide

A few updates to the thin provisioning bits aimed at 2.6.33:

 - Block layer support for reporting whether reading previously
   discarded blocks returns zeroes or not.  Jens, this applies on
   top of the patch that you have already queued.

 - Updated SCSI disk thin provisioning support:

   - Reports TPRZ via the new block layer flag

   - Moved some of the discard detection to read_capacity_16 to
     accommodate TP devices that don't support Unmap and the
     block limits VPD

   James, this replaces my existing post-merge sd patch.

 - libata:

   - Translate DRAT and ZRAT to TPRZ (zeroed blocks after trim)

   - Report the maximum discard sectors in the block limits VPD

   - Fix a couple of problems in ata_set_lba_range_entries() that
     caused us to issue garbled trims


Interested parties can grab the tp branch from:

git://git.kernel.org/pub/scm/linux/kernel/git/mkp/linux-2.6-mkp.git

--
Martin K. Petersen      Oracle Linux Engineering






^ permalink raw reply	[flat|nested] 35+ messages in thread

* [PATCH 1/4] block: Allow devices to indicate whether discarded blocks are zeroed
  2009-11-21  2:45 Thin provisioning fixes Martin K. Petersen
@ 2009-11-21  2:45 ` Martin K. Petersen
  2009-11-21 10:13   ` Christoph Hellwig
  2009-11-21 12:50   ` Ric Wheeler
  2009-11-21  2:45 ` Martin K. Petersen
                   ` (8 subsequent siblings)
  9 siblings, 2 replies; 35+ messages in thread
From: Martin K. Petersen @ 2009-11-21  2:45 UTC (permalink / raw)
  To: jens.axboe, james.bottomley, hch, willy, jgarzik, sandeen,
	rwheeler, linux-ide
  Cc: Martin K. Petersen

The discard ioctl is used by mkfs utilities to clear a block device
prior to putting metadata down.  However, not all devices return zeroed
blocks after a discard.  Some drives return stale data, potentially
containing old superblocks.  It is therefore important to know whether
discarded blocks are properly zeroed.

Both ATA and SCSI drives have configuration bits that indicate whether
zeroes are returned after a discard operation.  Implement a block level
interface that allows this information to be bubbled up the stack.

Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
---
 block/blk-settings.c   |    2 ++
 block/blk-sysfs.c      |   11 +++++++++++
 include/linux/blkdev.h |    1 +
 3 files changed, 14 insertions(+), 0 deletions(-)

diff --git a/block/blk-settings.c b/block/blk-settings.c
index 7f986ca..1027e30 100644
--- a/block/blk-settings.c
+++ b/block/blk-settings.c
@@ -100,6 +100,7 @@ void blk_set_default_limits(struct queue_limits *lim)
 	lim->discard_granularity = 0;
 	lim->discard_alignment = 0;
 	lim->discard_misaligned = 0;
+	lim->discard_zeroes_data = -1;
 	lim->logical_block_size = lim->physical_block_size = lim->io_min = 512;
 	lim->bounce_pfn = (unsigned long)(BLK_BOUNCE_ANY >> PAGE_SHIFT);
 	lim->alignment_offset = 0;
@@ -543,6 +544,7 @@ int blk_stack_limits(struct queue_limits *t, struct queue_limits *b,
 
 	t->io_min = max(t->io_min, b->io_min);
 	t->no_cluster |= b->no_cluster;
+	t->discard_zeroes_data &= b->discard_zeroes_data;
 
 	/* Bottom device offset aligned? */
 	if (offset &&
diff --git a/block/blk-sysfs.c b/block/blk-sysfs.c
index 3147145..1f1d2b6 100644
--- a/block/blk-sysfs.c
+++ b/block/blk-sysfs.c
@@ -136,6 +136,11 @@ static ssize_t queue_discard_max_show(struct request_queue *q, char *page)
 	return queue_var_show(q->limits.max_discard_sectors << 9, page);
 }
 
+static ssize_t queue_discard_zeroes_data_show(struct request_queue *q, char *page)
+{
+	return queue_var_show(q->limits.discard_zeroes_data == 1 ? 1 : 0, page);
+}
+
 static ssize_t
 queue_max_sectors_store(struct request_queue *q, const char *page, size_t count)
 {
@@ -313,6 +318,11 @@ static struct queue_sysfs_entry queue_discard_max_entry = {
 	.show = queue_discard_max_show,
 };
 
+static struct queue_sysfs_entry queue_discard_zeroes_data_entry = {
+	.attr = {.name = "discard_zeroes_data", .mode = S_IRUGO },
+	.show = queue_discard_zeroes_data_show,
+};
+
 static struct queue_sysfs_entry queue_nonrot_entry = {
 	.attr = {.name = "rotational", .mode = S_IRUGO | S_IWUSR },
 	.show = queue_nonrot_show,
@@ -350,6 +360,7 @@ static struct attribute *default_attrs[] = {
 	&queue_io_opt_entry.attr,
 	&queue_discard_granularity_entry.attr,
 	&queue_discard_max_entry.attr,
+	&queue_discard_zeroes_data_entry.attr,
 	&queue_nonrot_entry.attr,
 	&queue_nomerges_entry.attr,
 	&queue_rq_affinity_entry.attr,
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index 3b67221..e605945 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -322,6 +322,7 @@ struct queue_limits {
 	unsigned char		misaligned;
 	unsigned char		discard_misaligned;
 	unsigned char		no_cluster;
+	signed char		discard_zeroes_data;
 };
 
 struct request_queue
-- 
1.6.0.6


^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [PATCH 1/4] block: Allow devices to indicate whether discarded blocks are zeroed
  2009-11-21  2:45 Thin provisioning fixes Martin K. Petersen
  2009-11-21  2:45 ` [PATCH 1/4] block: Allow devices to indicate whether discarded blocks are zeroed Martin K. Petersen
@ 2009-11-21  2:45 ` Martin K. Petersen
  2009-11-21  2:45 ` [PATCH 2/4] sd: WRITE SAME(16) / UNMAP support Martin K. Petersen
                   ` (7 subsequent siblings)
  9 siblings, 0 replies; 35+ messages in thread
From: Martin K. Petersen @ 2009-11-21  2:45 UTC (permalink / raw)
  To: jens.axboe, james.bottomley, hch, willy, jgarzik, sandeen,
	rwheeler, linux-ide
  Cc: Martin K. Petersen

The discard ioctl is used by mkfs utilities to clear a block device
prior to putting metadata down.  However, not all devices return zeroed
blocks after a discard.  Some drives return stale data, potentially
containing old superblocks.  It is therefore important to know whether
discarded blocks are properly zeroed.

Both ATA and SCSI drives have configuration bits that indicate whether
zeroes are returned after a discard operation.  Implement a block level
interface that allows this information to be bubbled up the stack.

Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
---
 block/blk-settings.c   |    2 ++
 block/blk-sysfs.c      |   11 +++++++++++
 include/linux/blkdev.h |    1 +
 3 files changed, 14 insertions(+), 0 deletions(-)

diff --git a/block/blk-settings.c b/block/blk-settings.c
index 7f986ca..1027e30 100644
--- a/block/blk-settings.c
+++ b/block/blk-settings.c
@@ -100,6 +100,7 @@ void blk_set_default_limits(struct queue_limits *lim)
 	lim->discard_granularity = 0;
 	lim->discard_alignment = 0;
 	lim->discard_misaligned = 0;
+	lim->discard_zeroes_data = -1;
 	lim->logical_block_size = lim->physical_block_size = lim->io_min = 512;
 	lim->bounce_pfn = (unsigned long)(BLK_BOUNCE_ANY >> PAGE_SHIFT);
 	lim->alignment_offset = 0;
@@ -543,6 +544,7 @@ int blk_stack_limits(struct queue_limits *t, struct queue_limits *b,
 
 	t->io_min = max(t->io_min, b->io_min);
 	t->no_cluster |= b->no_cluster;
+	t->discard_zeroes_data &= b->discard_zeroes_data;
 
 	/* Bottom device offset aligned? */
 	if (offset &&
diff --git a/block/blk-sysfs.c b/block/blk-sysfs.c
index 3147145..1f1d2b6 100644
--- a/block/blk-sysfs.c
+++ b/block/blk-sysfs.c
@@ -136,6 +136,11 @@ static ssize_t queue_discard_max_show(struct request_queue *q, char *page)
 	return queue_var_show(q->limits.max_discard_sectors << 9, page);
 }
 
+static ssize_t queue_discard_zeroes_data_show(struct request_queue *q, char *page)
+{
+	return queue_var_show(q->limits.discard_zeroes_data == 1 ? 1 : 0, page);
+}
+
 static ssize_t
 queue_max_sectors_store(struct request_queue *q, const char *page, size_t count)
 {
@@ -313,6 +318,11 @@ static struct queue_sysfs_entry queue_discard_max_entry = {
 	.show = queue_discard_max_show,
 };
 
+static struct queue_sysfs_entry queue_discard_zeroes_data_entry = {
+	.attr = {.name = "discard_zeroes_data", .mode = S_IRUGO },
+	.show = queue_discard_zeroes_data_show,
+};
+
 static struct queue_sysfs_entry queue_nonrot_entry = {
 	.attr = {.name = "rotational", .mode = S_IRUGO | S_IWUSR },
 	.show = queue_nonrot_show,
@@ -350,6 +360,7 @@ static struct attribute *default_attrs[] = {
 	&queue_io_opt_entry.attr,
 	&queue_discard_granularity_entry.attr,
 	&queue_discard_max_entry.attr,
+	&queue_discard_zeroes_data_entry.attr,
 	&queue_nonrot_entry.attr,
 	&queue_nomerges_entry.attr,
 	&queue_rq_affinity_entry.attr,
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index 3b67221..e605945 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -322,6 +322,7 @@ struct queue_limits {
 	unsigned char		misaligned;
 	unsigned char		discard_misaligned;
 	unsigned char		no_cluster;
+	signed char		discard_zeroes_data;
 };
 
 struct request_queue
-- 
1.6.0.6


^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [PATCH 2/4] sd: WRITE SAME(16) / UNMAP support
  2009-11-21  2:45 Thin provisioning fixes Martin K. Petersen
                   ` (2 preceding siblings ...)
  2009-11-21  2:45 ` [PATCH 2/4] sd: WRITE SAME(16) / UNMAP support Martin K. Petersen
@ 2009-11-21  2:45 ` Martin K. Petersen
  2009-11-21  2:45 ` [PATCH 3/4] libata: Report zeroed read after Trim and max discard size Martin K. Petersen
                   ` (5 subsequent siblings)
  9 siblings, 0 replies; 35+ messages in thread
From: Martin K. Petersen @ 2009-11-21  2:45 UTC (permalink / raw)
  To: jens.axboe, james.bottomley, hch, willy, jgarzik, sandeen,
	rwheeler, linux-ide
  Cc: Martin K. Petersen

Implement a prepare discard function that sends either WRITE SAME(16) or
UNMAP(10) depending on parameters indicated by the device in the block
limits VPD.

Extract unmap constraints and report them to the block layer.

Based in part by a patch by Christoph Hellwig <hch@lst.de>.

Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
---
 drivers/scsi/sd.c |  105 +++++++++++++++++++++++++++++++++++++++++++++++++++++
 drivers/scsi/sd.h |    2 +
 2 files changed, 107 insertions(+), 0 deletions(-)

diff --git a/drivers/scsi/sd.c b/drivers/scsi/sd.c
index 9093c72..65538e1 100644
--- a/drivers/scsi/sd.c
+++ b/drivers/scsi/sd.c
@@ -264,6 +264,15 @@ sd_show_app_tag_own(struct device *dev, struct device_attribute *attr,
 	return snprintf(buf, 20, "%u\n", sdkp->ATO);
 }
 
+static ssize_t
+sd_show_thin_provisioning(struct device *dev, struct device_attribute *attr,
+			  char *buf)
+{
+	struct scsi_disk *sdkp = to_scsi_disk(dev);
+
+	return snprintf(buf, 20, "%u\n", sdkp->thin_provisioning);
+}
+
 static struct device_attribute sd_disk_attrs[] = {
 	__ATTR(cache_type, S_IRUGO|S_IWUSR, sd_show_cache_type,
 	       sd_store_cache_type),
@@ -274,6 +283,7 @@ static struct device_attribute sd_disk_attrs[] = {
 	       sd_store_manage_start_stop),
 	__ATTR(protection_type, S_IRUGO, sd_show_protection_type, NULL),
 	__ATTR(app_tag_own, S_IRUGO, sd_show_app_tag_own, NULL),
+	__ATTR(thin_provisioning, S_IRUGO, sd_show_thin_provisioning, NULL),
 	__ATTR_NULL,
 };
 
@@ -399,6 +409,57 @@ static void sd_prot_op(struct scsi_cmnd *scmd, unsigned int dif)
 }
 
 /**
+ * sd_prepare_discard - unmap blocks on thinly provisioned device
+ * @rq: Request to prepare
+ *
+ * Will issue either UNMAP or WRITE SAME(16) depending on preference
+ * indicated by target device.
+ **/
+static int sd_prepare_discard(struct request *rq)
+{
+	struct scsi_disk *sdkp = scsi_disk(rq->rq_disk);
+	struct bio *bio = rq->bio;
+	sector_t sector = bio->bi_sector;
+	unsigned int num = bio_sectors(bio);
+
+	if (sdkp->device->sector_size == 4096) {
+		sector >>= 3;
+		num >>= 3;
+	}
+
+	rq->cmd_type = REQ_TYPE_BLOCK_PC;
+	rq->timeout = SD_TIMEOUT;
+
+	memset(rq->cmd, 0, rq->cmd_len);
+
+	if (sdkp->unmap) {
+		char *buf = kmap_atomic(bio_page(bio), KM_USER0);
+
+		rq->cmd[0] = UNMAP;
+		rq->cmd[8] = 24;
+		rq->cmd_len = 10;
+
+		/* Ensure that data length matches payload */
+		rq->__data_len = bio->bi_size = bio->bi_io_vec->bv_len = 24;
+
+		put_unaligned_be16(6 + 16, &buf[0]);
+		put_unaligned_be16(16, &buf[2]);
+		put_unaligned_be64(sector, &buf[8]);
+		put_unaligned_be32(num, &buf[16]);
+
+		kunmap_atomic(buf, KM_USER0);
+	} else {
+		rq->cmd[0] = WRITE_SAME_16;
+		rq->cmd[1] = 0x8; /* UNMAP */
+		put_unaligned_be64(sector, &rq->cmd[2]);
+		put_unaligned_be32(num, &rq->cmd[10]);
+		rq->cmd_len = 16;
+	}
+
+	return BLKPREP_OK;
+}
+
+/**
  *	sd_init_command - build a scsi (read or write) command from
  *	information in the request structure.
  *	@SCpnt: pointer to mid-level's per scsi command structure that
@@ -418,6 +479,13 @@ static int sd_prep_fn(struct request_queue *q, struct request *rq)
 	int ret, host_dif;
 	unsigned char protect;
 
+	/*
+	 * Discard request come in as REQ_TYPE_FS but we turn them into
+	 * block PC requests to make life easier.
+	 */
+	if (blk_discard_rq(rq))
+		ret = sd_prepare_discard(rq);
+
 	if (rq->cmd_type == REQ_TYPE_BLOCK_PC) {
 		ret = scsi_setup_blk_pc_cmnd(sdp, rq);
 		goto out;
@@ -1432,6 +1500,19 @@ static int read_capacity_16(struct scsi_disk *sdkp, struct scsi_device *sdp,
 		sd_printk(KERN_NOTICE, sdkp,
 			  "physical block alignment offset: %u\n", alignment);
 
+	if (buffer[14] & 0x80) { /* TPE */
+		struct request_queue *q = sdp->request_queue;
+
+		sdkp->thin_provisioning = 1;
+		q->limits.discard_granularity = sdkp->hw_sector_size;
+		q->limits.max_discard_sectors = 0xffffffff;
+
+		if (buffer[14] & 0x40) /* TPRZ */
+			q->limits.discard_zeroes_data = 1;
+
+		queue_flag_set_unlocked(QUEUE_FLAG_DISCARD, q);
+	}
+
 	sdkp->capacity = lba + 1;
 	return sector_size;
 }
@@ -1863,6 +1944,7 @@ void sd_read_app_tag_own(struct scsi_disk *sdkp, unsigned char *buffer)
  */
 static void sd_read_block_limits(struct scsi_disk *sdkp)
 {
+	struct request_queue *q = sdkp->disk->queue;
 	unsigned int sector_sz = sdkp->device->sector_size;
 	char *buffer;
 
@@ -1877,6 +1959,29 @@ static void sd_read_block_limits(struct scsi_disk *sdkp)
 	blk_queue_io_opt(sdkp->disk->queue,
 			 get_unaligned_be32(&buffer[12]) * sector_sz);
 
+	/* Thin provisioning enabled and page length indicates TP support */
+	if (sdkp->thin_provisioning && buffer[3] == 0x3c) {
+		unsigned int lba_count, desc_count, granularity;
+
+		lba_count = get_unaligned_be32(&buffer[20]);
+		desc_count = get_unaligned_be32(&buffer[24]);
+
+		if (lba_count && desc_count) {
+			sdkp->unmap = 1;
+			q->limits.max_discard_sectors =
+				lba_count * sector_sz >> 9;
+		}
+
+		granularity = get_unaligned_be32(&buffer[28]);
+
+		if (granularity)
+			q->limits.discard_granularity = granularity * sector_sz;
+
+		if (buffer[32] & 0x80)
+			q->limits.discard_alignment =
+				get_unaligned_be32(&buffer[32]) & ~(1 << 31);
+	}
+
 	kfree(buffer);
 }
 
diff --git a/drivers/scsi/sd.h b/drivers/scsi/sd.h
index e374804..43d3caf 100644
--- a/drivers/scsi/sd.h
+++ b/drivers/scsi/sd.h
@@ -60,6 +60,8 @@ struct scsi_disk {
 	unsigned	RCD : 1;	/* state of disk RCD bit, unused */
 	unsigned	DPOFUA : 1;	/* state of disk DPOFUA bit */
 	unsigned	first_scan : 1;
+	unsigned	thin_provisioning : 1;
+	unsigned	unmap : 1;
 };
 #define to_scsi_disk(obj) container_of(obj,struct scsi_disk,dev)
 
-- 
1.6.0.6


^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [PATCH 2/4] sd: WRITE SAME(16) / UNMAP support
  2009-11-21  2:45 Thin provisioning fixes Martin K. Petersen
  2009-11-21  2:45 ` [PATCH 1/4] block: Allow devices to indicate whether discarded blocks are zeroed Martin K. Petersen
  2009-11-21  2:45 ` Martin K. Petersen
@ 2009-11-21  2:45 ` Martin K. Petersen
  2009-11-21  2:45 ` Martin K. Petersen
                   ` (6 subsequent siblings)
  9 siblings, 0 replies; 35+ messages in thread
From: Martin K. Petersen @ 2009-11-21  2:45 UTC (permalink / raw)
  To: jens.axboe, james.bottomley, hch, willy, jgarzik, sandeen,
	rwheeler, linux-ide
  Cc: Martin K. Petersen

Implement a prepare discard function that sends either WRITE SAME(16) or
UNMAP(10) depending on parameters indicated by the device in the block
limits VPD.

Extract unmap constraints and report them to the block layer.

Based in part by a patch by Christoph Hellwig <hch@lst.de>.

Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
---
 drivers/scsi/sd.c |  105 +++++++++++++++++++++++++++++++++++++++++++++++++++++
 drivers/scsi/sd.h |    2 +
 2 files changed, 107 insertions(+), 0 deletions(-)

diff --git a/drivers/scsi/sd.c b/drivers/scsi/sd.c
index 9093c72..65538e1 100644
--- a/drivers/scsi/sd.c
+++ b/drivers/scsi/sd.c
@@ -264,6 +264,15 @@ sd_show_app_tag_own(struct device *dev, struct device_attribute *attr,
 	return snprintf(buf, 20, "%u\n", sdkp->ATO);
 }
 
+static ssize_t
+sd_show_thin_provisioning(struct device *dev, struct device_attribute *attr,
+			  char *buf)
+{
+	struct scsi_disk *sdkp = to_scsi_disk(dev);
+
+	return snprintf(buf, 20, "%u\n", sdkp->thin_provisioning);
+}
+
 static struct device_attribute sd_disk_attrs[] = {
 	__ATTR(cache_type, S_IRUGO|S_IWUSR, sd_show_cache_type,
 	       sd_store_cache_type),
@@ -274,6 +283,7 @@ static struct device_attribute sd_disk_attrs[] = {
 	       sd_store_manage_start_stop),
 	__ATTR(protection_type, S_IRUGO, sd_show_protection_type, NULL),
 	__ATTR(app_tag_own, S_IRUGO, sd_show_app_tag_own, NULL),
+	__ATTR(thin_provisioning, S_IRUGO, sd_show_thin_provisioning, NULL),
 	__ATTR_NULL,
 };
 
@@ -399,6 +409,57 @@ static void sd_prot_op(struct scsi_cmnd *scmd, unsigned int dif)
 }
 
 /**
+ * sd_prepare_discard - unmap blocks on thinly provisioned device
+ * @rq: Request to prepare
+ *
+ * Will issue either UNMAP or WRITE SAME(16) depending on preference
+ * indicated by target device.
+ **/
+static int sd_prepare_discard(struct request *rq)
+{
+	struct scsi_disk *sdkp = scsi_disk(rq->rq_disk);
+	struct bio *bio = rq->bio;
+	sector_t sector = bio->bi_sector;
+	unsigned int num = bio_sectors(bio);
+
+	if (sdkp->device->sector_size == 4096) {
+		sector >>= 3;
+		num >>= 3;
+	}
+
+	rq->cmd_type = REQ_TYPE_BLOCK_PC;
+	rq->timeout = SD_TIMEOUT;
+
+	memset(rq->cmd, 0, rq->cmd_len);
+
+	if (sdkp->unmap) {
+		char *buf = kmap_atomic(bio_page(bio), KM_USER0);
+
+		rq->cmd[0] = UNMAP;
+		rq->cmd[8] = 24;
+		rq->cmd_len = 10;
+
+		/* Ensure that data length matches payload */
+		rq->__data_len = bio->bi_size = bio->bi_io_vec->bv_len = 24;
+
+		put_unaligned_be16(6 + 16, &buf[0]);
+		put_unaligned_be16(16, &buf[2]);
+		put_unaligned_be64(sector, &buf[8]);
+		put_unaligned_be32(num, &buf[16]);
+
+		kunmap_atomic(buf, KM_USER0);
+	} else {
+		rq->cmd[0] = WRITE_SAME_16;
+		rq->cmd[1] = 0x8; /* UNMAP */
+		put_unaligned_be64(sector, &rq->cmd[2]);
+		put_unaligned_be32(num, &rq->cmd[10]);
+		rq->cmd_len = 16;
+	}
+
+	return BLKPREP_OK;
+}
+
+/**
  *	sd_init_command - build a scsi (read or write) command from
  *	information in the request structure.
  *	@SCpnt: pointer to mid-level's per scsi command structure that
@@ -418,6 +479,13 @@ static int sd_prep_fn(struct request_queue *q, struct request *rq)
 	int ret, host_dif;
 	unsigned char protect;
 
+	/*
+	 * Discard request come in as REQ_TYPE_FS but we turn them into
+	 * block PC requests to make life easier.
+	 */
+	if (blk_discard_rq(rq))
+		ret = sd_prepare_discard(rq);
+
 	if (rq->cmd_type == REQ_TYPE_BLOCK_PC) {
 		ret = scsi_setup_blk_pc_cmnd(sdp, rq);
 		goto out;
@@ -1432,6 +1500,19 @@ static int read_capacity_16(struct scsi_disk *sdkp, struct scsi_device *sdp,
 		sd_printk(KERN_NOTICE, sdkp,
 			  "physical block alignment offset: %u\n", alignment);
 
+	if (buffer[14] & 0x80) { /* TPE */
+		struct request_queue *q = sdp->request_queue;
+
+		sdkp->thin_provisioning = 1;
+		q->limits.discard_granularity = sdkp->hw_sector_size;
+		q->limits.max_discard_sectors = 0xffffffff;
+
+		if (buffer[14] & 0x40) /* TPRZ */
+			q->limits.discard_zeroes_data = 1;
+
+		queue_flag_set_unlocked(QUEUE_FLAG_DISCARD, q);
+	}
+
 	sdkp->capacity = lba + 1;
 	return sector_size;
 }
@@ -1863,6 +1944,7 @@ void sd_read_app_tag_own(struct scsi_disk *sdkp, unsigned char *buffer)
  */
 static void sd_read_block_limits(struct scsi_disk *sdkp)
 {
+	struct request_queue *q = sdkp->disk->queue;
 	unsigned int sector_sz = sdkp->device->sector_size;
 	char *buffer;
 
@@ -1877,6 +1959,29 @@ static void sd_read_block_limits(struct scsi_disk *sdkp)
 	blk_queue_io_opt(sdkp->disk->queue,
 			 get_unaligned_be32(&buffer[12]) * sector_sz);
 
+	/* Thin provisioning enabled and page length indicates TP support */
+	if (sdkp->thin_provisioning && buffer[3] == 0x3c) {
+		unsigned int lba_count, desc_count, granularity;
+
+		lba_count = get_unaligned_be32(&buffer[20]);
+		desc_count = get_unaligned_be32(&buffer[24]);
+
+		if (lba_count && desc_count) {
+			sdkp->unmap = 1;
+			q->limits.max_discard_sectors =
+				lba_count * sector_sz >> 9;
+		}
+
+		granularity = get_unaligned_be32(&buffer[28]);
+
+		if (granularity)
+			q->limits.discard_granularity = granularity * sector_sz;
+
+		if (buffer[32] & 0x80)
+			q->limits.discard_alignment =
+				get_unaligned_be32(&buffer[32]) & ~(1 << 31);
+	}
+
 	kfree(buffer);
 }
 
diff --git a/drivers/scsi/sd.h b/drivers/scsi/sd.h
index e374804..43d3caf 100644
--- a/drivers/scsi/sd.h
+++ b/drivers/scsi/sd.h
@@ -60,6 +60,8 @@ struct scsi_disk {
 	unsigned	RCD : 1;	/* state of disk RCD bit, unused */
 	unsigned	DPOFUA : 1;	/* state of disk DPOFUA bit */
 	unsigned	first_scan : 1;
+	unsigned	thin_provisioning : 1;
+	unsigned	unmap : 1;
 };
 #define to_scsi_disk(obj) container_of(obj,struct scsi_disk,dev)
 
-- 
1.6.0.6


^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [PATCH 3/4] libata: Report zeroed read after Trim and max discard size
  2009-11-21  2:45 Thin provisioning fixes Martin K. Petersen
                   ` (4 preceding siblings ...)
  2009-11-21  2:45 ` [PATCH 3/4] libata: Report zeroed read after Trim and max discard size Martin K. Petersen
@ 2009-11-21  2:45 ` Martin K. Petersen
  2009-11-21 10:49   ` Christoph Hellwig
  2009-11-21  2:45 ` [PATCH 4/4] libata: Fix garbled Trim payload Martin K. Petersen
                   ` (3 subsequent siblings)
  9 siblings, 1 reply; 35+ messages in thread
From: Martin K. Petersen @ 2009-11-21  2:45 UTC (permalink / raw)
  To: jens.axboe, james.bottomley, hch, willy, jgarzik, sandeen,
	rwheeler, linux-ide
  Cc: Martin K. Petersen

Our current Trim payload is a single sector that can accommodate 64 *
65535 blocks being unmapped.  Report this value in the Block Limits
Maximum Unmap LBA count field.

If a storage device supports Trim and the DRAT and RZAT bits are set,
report TPRZ=1 in Read Capacity(16).

Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
---
 drivers/ata/libata-scsi.c |   12 +++++++++---
 include/linux/ata.h       |   10 ++++++++++
 2 files changed, 19 insertions(+), 3 deletions(-)

diff --git a/drivers/ata/libata-scsi.c b/drivers/ata/libata-scsi.c
index e0995c4..08d4ab7 100644
--- a/drivers/ata/libata-scsi.c
+++ b/drivers/ata/libata-scsi.c
@@ -2116,8 +2116,10 @@ static unsigned int ata_scsiop_inq_b0(struct ata_scsi_args *args, u8 *rbuf)
 	 * that we support some form of unmap - in thise case via WRITE SAME
 	 * with the unmap bit set.
 	 */
-	if (ata_id_has_trim(args->id))
+	if (ata_id_has_trim(args->id)) {
+		put_unaligned_be32(65535 * 512 / 8, &rbuf[20]);
 		put_unaligned_be32(1, &rbuf[28]);
+	}
 
 	return 0;
 }
@@ -2412,8 +2414,12 @@ static unsigned int ata_scsiop_read_cap(struct ata_scsi_args *args, u8 *rbuf)
 		rbuf[14] = (lowest_aligned >> 8) & 0x3f;
 		rbuf[15] = lowest_aligned;
 
-		if (ata_id_has_trim(args->id))
-			rbuf[14] |= 0x80;
+		if (ata_id_has_trim(args->id)) {
+			rbuf[14] |= 0x80; /* TPE */
+
+			if (ata_id_has_zero_after_trim(args->id))
+				rbuf[14] |= 0x40; /* TPRZ */
+		}
 	}
 
 	return 0;
diff --git a/include/linux/ata.h b/include/linux/ata.h
index e2595e8..b18b2bb 100644
--- a/include/linux/ata.h
+++ b/include/linux/ata.h
@@ -75,6 +75,7 @@ enum {
 	ATA_ID_EIDE_DMA_TIME	= 66,
 	ATA_ID_EIDE_PIO		= 67,
 	ATA_ID_EIDE_PIO_IORDY	= 68,
+	ATA_ID_ADDITIONAL_SUPP	= 69,
 	ATA_ID_QUEUE_DEPTH	= 75,
 	ATA_ID_MAJOR_VER	= 80,
 	ATA_ID_COMMAND_SET_1	= 82,
@@ -816,6 +817,15 @@ static inline int ata_id_has_trim(const u16 *id)
 	return 0;
 }
 
+static inline int ata_id_has_zero_after_trim(const u16 *id)
+{
+	/* DSM supported, deterministic read, and read zero after trim set */
+	if (ata_id_has_trim(id) && id[ATA_ID_ADDITIONAL_SUPP] & 0x4020)
+		return 1;
+
+	return 0;
+}
+
 static inline int ata_id_current_chs_valid(const u16 *id)
 {
 	/* For ATA-1 devices, if the INITIALIZE DEVICE PARAMETERS command
-- 
1.6.0.6


^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [PATCH 3/4] libata: Report zeroed read after Trim and max discard size
  2009-11-21  2:45 Thin provisioning fixes Martin K. Petersen
                   ` (3 preceding siblings ...)
  2009-11-21  2:45 ` Martin K. Petersen
@ 2009-11-21  2:45 ` Martin K. Petersen
  2009-11-21  2:45 ` Martin K. Petersen
                   ` (4 subsequent siblings)
  9 siblings, 0 replies; 35+ messages in thread
From: Martin K. Petersen @ 2009-11-21  2:45 UTC (permalink / raw)
  To: jens.axboe, james.bottomley, hch, willy, jgarzik, sandeen,
	rwheeler, linux-ide
  Cc: Martin K. Petersen

Our current Trim payload is a single sector that can accommodate 64 *
65535 blocks being unmapped.  Report this value in the Block Limits
Maximum Unmap LBA count field.

If a storage device supports Trim and the DRAT and RZAT bits are set,
report TPRZ=1 in Read Capacity(16).

Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
---
 drivers/ata/libata-scsi.c |   12 +++++++++---
 include/linux/ata.h       |   10 ++++++++++
 2 files changed, 19 insertions(+), 3 deletions(-)

diff --git a/drivers/ata/libata-scsi.c b/drivers/ata/libata-scsi.c
index e0995c4..08d4ab7 100644
--- a/drivers/ata/libata-scsi.c
+++ b/drivers/ata/libata-scsi.c
@@ -2116,8 +2116,10 @@ static unsigned int ata_scsiop_inq_b0(struct ata_scsi_args *args, u8 *rbuf)
 	 * that we support some form of unmap - in thise case via WRITE SAME
 	 * with the unmap bit set.
 	 */
-	if (ata_id_has_trim(args->id))
+	if (ata_id_has_trim(args->id)) {
+		put_unaligned_be32(65535 * 512 / 8, &rbuf[20]);
 		put_unaligned_be32(1, &rbuf[28]);
+	}
 
 	return 0;
 }
@@ -2412,8 +2414,12 @@ static unsigned int ata_scsiop_read_cap(struct ata_scsi_args *args, u8 *rbuf)
 		rbuf[14] = (lowest_aligned >> 8) & 0x3f;
 		rbuf[15] = lowest_aligned;
 
-		if (ata_id_has_trim(args->id))
-			rbuf[14] |= 0x80;
+		if (ata_id_has_trim(args->id)) {
+			rbuf[14] |= 0x80; /* TPE */
+
+			if (ata_id_has_zero_after_trim(args->id))
+				rbuf[14] |= 0x40; /* TPRZ */
+		}
 	}
 
 	return 0;
diff --git a/include/linux/ata.h b/include/linux/ata.h
index e2595e8..b18b2bb 100644
--- a/include/linux/ata.h
+++ b/include/linux/ata.h
@@ -75,6 +75,7 @@ enum {
 	ATA_ID_EIDE_DMA_TIME	= 66,
 	ATA_ID_EIDE_PIO		= 67,
 	ATA_ID_EIDE_PIO_IORDY	= 68,
+	ATA_ID_ADDITIONAL_SUPP	= 69,
 	ATA_ID_QUEUE_DEPTH	= 75,
 	ATA_ID_MAJOR_VER	= 80,
 	ATA_ID_COMMAND_SET_1	= 82,
@@ -816,6 +817,15 @@ static inline int ata_id_has_trim(const u16 *id)
 	return 0;
 }
 
+static inline int ata_id_has_zero_after_trim(const u16 *id)
+{
+	/* DSM supported, deterministic read, and read zero after trim set */
+	if (ata_id_has_trim(id) && id[ATA_ID_ADDITIONAL_SUPP] & 0x4020)
+		return 1;
+
+	return 0;
+}
+
 static inline int ata_id_current_chs_valid(const u16 *id)
 {
 	/* For ATA-1 devices, if the INITIALIZE DEVICE PARAMETERS command
-- 
1.6.0.6


^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [PATCH 4/4] libata: Fix garbled Trim payload
  2009-11-21  2:45 Thin provisioning fixes Martin K. Petersen
                   ` (5 preceding siblings ...)
  2009-11-21  2:45 ` Martin K. Petersen
@ 2009-11-21  2:45 ` Martin K. Petersen
  2009-11-21 10:47   ` Christoph Hellwig
  2009-11-21  2:45 ` Martin K. Petersen
                   ` (2 subsequent siblings)
  9 siblings, 1 reply; 35+ messages in thread
From: Martin K. Petersen @ 2009-11-21  2:45 UTC (permalink / raw)
  To: jens.axboe, james.bottomley, hch, willy, jgarzik, sandeen,
	rwheeler, linux-ide
  Cc: Martin K. Petersen

ata_set_lba_range_entries confused indexes into a u64 buffer with byte
offsets.  Fix this up and document the return value.

Update the libata write same translation to reflect this.

Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
---
 drivers/ata/libata-scsi.c |    8 ++++----
 include/linux/ata.h       |    9 +++++----
 2 files changed, 9 insertions(+), 8 deletions(-)

diff --git a/drivers/ata/libata-scsi.c b/drivers/ata/libata-scsi.c
index 08d4ab7..0c496f8 100644
--- a/drivers/ata/libata-scsi.c
+++ b/drivers/ata/libata-scsi.c
@@ -2950,7 +2950,7 @@ static unsigned int ata_scsi_write_same_xlat(struct ata_queued_cmd *qc)
 	const u8 *cdb = scmd->cmnd;
 	u64 block;
 	u32 n_block;
-	u32 size;
+	u32 trim_sectors;
 	void *buf;
 
 	/* we may not issue DMA commands if no DMA mode is set */
@@ -2973,13 +2973,13 @@ static unsigned int ata_scsi_write_same_xlat(struct ata_queued_cmd *qc)
 		goto invalid_fld;
 
 	buf = page_address(sg_page(scsi_sglist(scmd)));
-	size = ata_set_lba_range_entries(buf, 512 / 8, block, n_block);
+	trim_sectors = ata_set_lba_range_entries(buf, 512 / 8, block, n_block);
 
 	tf->protocol = ATA_PROT_DMA;
 	tf->hob_feature = 0;
 	tf->feature = ATA_DSM_TRIM;
-	tf->hob_nsect = (size / 512) >> 8;
-	tf->nsect = size / 512;
+	tf->hob_nsect = trim_sectors >> 8;
+	tf->nsect = trim_sectors;
 	tf->command = ATA_CMD_DSM;
 	tf->flags |= ATA_TFLAG_ISADDR | ATA_TFLAG_DEVICE | ATA_TFLAG_LBA48 |
 		     ATA_TFLAG_WRITE;
diff --git a/include/linux/ata.h b/include/linux/ata.h
index b18b2bb..150c9dc 100644
--- a/include/linux/ata.h
+++ b/include/linux/ata.h
@@ -983,7 +983,8 @@ static inline void ata_id_to_hd_driveid(u16 *id)
 /*
  * Write up to 'max' LBA Range Entries to the buffer that will cover the
  * extent from sector to sector + count.  This is used for TRIM and for
- * ADD LBA(S) TO NV CACHE PINNED SET.
+ * ADD LBA(S) TO NV CACHE PINNED SET.  Returns number of 512-byte
+ * sectors required to encode the range.
  */
 static inline unsigned ata_set_lba_range_entries(void *_buffer, unsigned max,
 						u64 sector, unsigned long count)
@@ -1001,9 +1002,9 @@ static inline unsigned ata_set_lba_range_entries(void *_buffer, unsigned max,
 		sector += 0xffff;
 	}
 
-	max = ALIGN(i * 8, 512);
-	memset(buffer + i, 0, max - i * 8);
-	return max;
+	memset(&buffer[i], 0, (max - i) * 8);
+
+	return ALIGN(i * 8, 512);
 }
 
 static inline int is_multi_taskfile(struct ata_taskfile *tf)
-- 
1.6.0.6


^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [PATCH 4/4] libata: Fix garbled Trim payload
  2009-11-21  2:45 Thin provisioning fixes Martin K. Petersen
                   ` (6 preceding siblings ...)
  2009-11-21  2:45 ` [PATCH 4/4] libata: Fix garbled Trim payload Martin K. Petersen
@ 2009-11-21  2:45 ` Martin K. Petersen
  2009-11-21  4:56 ` Thin provisioning fixes Eric Sandeen
  2009-11-26 10:59 ` Christoph Hellwig
  9 siblings, 0 replies; 35+ messages in thread
From: Martin K. Petersen @ 2009-11-21  2:45 UTC (permalink / raw)
  To: jens.axboe, james.bottomley, hch, willy, jgarzik, sandeen,
	rwheeler, linux-ide
  Cc: Martin K. Petersen

ata_set_lba_range_entries confused indexes into a u64 buffer with byte
offsets.  Fix this up and document the return value.

Update the libata write same translation to reflect this.

Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
---
 drivers/ata/libata-scsi.c |    8 ++++----
 include/linux/ata.h       |    9 +++++----
 2 files changed, 9 insertions(+), 8 deletions(-)

diff --git a/drivers/ata/libata-scsi.c b/drivers/ata/libata-scsi.c
index 08d4ab7..0c496f8 100644
--- a/drivers/ata/libata-scsi.c
+++ b/drivers/ata/libata-scsi.c
@@ -2950,7 +2950,7 @@ static unsigned int ata_scsi_write_same_xlat(struct ata_queued_cmd *qc)
 	const u8 *cdb = scmd->cmnd;
 	u64 block;
 	u32 n_block;
-	u32 size;
+	u32 trim_sectors;
 	void *buf;
 
 	/* we may not issue DMA commands if no DMA mode is set */
@@ -2973,13 +2973,13 @@ static unsigned int ata_scsi_write_same_xlat(struct ata_queued_cmd *qc)
 		goto invalid_fld;
 
 	buf = page_address(sg_page(scsi_sglist(scmd)));
-	size = ata_set_lba_range_entries(buf, 512 / 8, block, n_block);
+	trim_sectors = ata_set_lba_range_entries(buf, 512 / 8, block, n_block);
 
 	tf->protocol = ATA_PROT_DMA;
 	tf->hob_feature = 0;
 	tf->feature = ATA_DSM_TRIM;
-	tf->hob_nsect = (size / 512) >> 8;
-	tf->nsect = size / 512;
+	tf->hob_nsect = trim_sectors >> 8;
+	tf->nsect = trim_sectors;
 	tf->command = ATA_CMD_DSM;
 	tf->flags |= ATA_TFLAG_ISADDR | ATA_TFLAG_DEVICE | ATA_TFLAG_LBA48 |
 		     ATA_TFLAG_WRITE;
diff --git a/include/linux/ata.h b/include/linux/ata.h
index b18b2bb..150c9dc 100644
--- a/include/linux/ata.h
+++ b/include/linux/ata.h
@@ -983,7 +983,8 @@ static inline void ata_id_to_hd_driveid(u16 *id)
 /*
  * Write up to 'max' LBA Range Entries to the buffer that will cover the
  * extent from sector to sector + count.  This is used for TRIM and for
- * ADD LBA(S) TO NV CACHE PINNED SET.
+ * ADD LBA(S) TO NV CACHE PINNED SET.  Returns number of 512-byte
+ * sectors required to encode the range.
  */
 static inline unsigned ata_set_lba_range_entries(void *_buffer, unsigned max,
 						u64 sector, unsigned long count)
@@ -1001,9 +1002,9 @@ static inline unsigned ata_set_lba_range_entries(void *_buffer, unsigned max,
 		sector += 0xffff;
 	}
 
-	max = ALIGN(i * 8, 512);
-	memset(buffer + i, 0, max - i * 8);
-	return max;
+	memset(&buffer[i], 0, (max - i) * 8);
+
+	return ALIGN(i * 8, 512);
 }
 
 static inline int is_multi_taskfile(struct ata_taskfile *tf)
-- 
1.6.0.6


^ permalink raw reply related	[flat|nested] 35+ messages in thread

* Re: Thin provisioning fixes
  2009-11-21  2:45 Thin provisioning fixes Martin K. Petersen
                   ` (7 preceding siblings ...)
  2009-11-21  2:45 ` Martin K. Petersen
@ 2009-11-21  4:56 ` Eric Sandeen
  2009-11-21  6:08   ` Martin K. Petersen
  2009-11-21  6:55   ` Martin K. Petersen
  2009-11-26 10:59 ` Christoph Hellwig
  9 siblings, 2 replies; 35+ messages in thread
From: Eric Sandeen @ 2009-11-21  4:56 UTC (permalink / raw)
  To: Martin K. Petersen
  Cc: jens.axboe, james.bottomley, hch, willy, jgarzik, rwheeler,
	linux-ide, linux-scsi

Martin K. Petersen wrote:
> A few updates to the thin provisioning bits aimed at 2.6.33:
> 
>  - Block layer support for reporting whether reading previously
>    discarded blocks returns zeroes or not.  Jens, this applies on
>    top of the patch that you have already queued.
> 
>  - Updated SCSI disk thin provisioning support:
> 
>    - Reports TPRZ via the new block layer flag
> 
>    - Moved some of the discard detection to read_capacity_16 to
>      accommodate TP devices that don't support Unmap and the
>      block limits VPD
> 
>    James, this replaces my existing post-merge sd patch.
> 
>  - libata:
> 
>    - Translate DRAT and ZRAT to TPRZ (zeroed blocks after trim)
> 
>    - Report the maximum discard sectors in the block limits VPD
> 
>    - Fix a couple of problems in ata_set_lba_range_entries() that
>      caused us to issue garbled trims
> 
> 
> Interested parties can grab the tp branch from:
> 
> git://git.kernel.org/pub/scm/linux/kernel/git/mkp/linux-2.6-mkp.git

I pulled that tonight ...

Doing a test on an OCZ ssd, by patterning 16GB and then discarding
it via a single ioctl call, ends in an unhappy drive:

ata3.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 frozen
ata3.00: cmd 06/01:00:00:00:00/00:02:00:00:00/a0 tag 0 dma 4294966784 out
         res 40/00:18:18:00:00/00:00:00:00:00/40 Emask 0x4 (timeout)
ata3.00: status: { DRDY }
ata3: hard resetting link
ata3: SATA link down (SStatus 0 SControl 300)
ata3: hard resetting link
ata3: SATA link down (SStatus 0 SControl 300)
ata3: limiting SATA link speed to 1.5 Gbps
ata3: hard resetting link
ata3: SATA link down (SStatus 0 SControl 310)
ata3.00: disabled
ata3: EH complete
ata3.00: detaching (SCSI 2:0:0:0)
sd 2:0:0:0: [sdb] Stopping disk
sd 2:0:0:0: [sdb] START_STOP FAILED
sd 2:0:0:0: [sdb] Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK

... and then it goes away ...

-Eric

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: Thin provisioning fixes
  2009-11-21  4:56 ` Thin provisioning fixes Eric Sandeen
@ 2009-11-21  6:08   ` Martin K. Petersen
  2009-11-21  6:55   ` Martin K. Petersen
  1 sibling, 0 replies; 35+ messages in thread
From: Martin K. Petersen @ 2009-11-21  6:08 UTC (permalink / raw)
  To: Eric Sandeen
  Cc: Martin K. Petersen, jens.axboe, james.bottomley, hch, willy,
	jgarzik, rwheeler, linux-ide, linux-scsi

>>>>> "Eric" == Eric Sandeen <sandeen@redhat.com> writes:

>> - Fix a couple of problems in ata_set_lba_range_entries() that
>> caused us to issue garbled trims

>> git://git.kernel.org/pub/scm/linux/kernel/git/mkp/linux-2.6-mkp.git

Eric> I pulled that tonight ...

Eric> Doing a test on an OCZ ssd, by patterning 16GB and then discarding
Eric> it via a single ioctl call, ends in an unhappy drive:

Interesting.

Did it trim the missing pieces or are you still seeing holes?

-- 
Martin K. Petersen	Oracle Linux Engineering

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: Thin provisioning fixes
  2009-11-21  4:56 ` Thin provisioning fixes Eric Sandeen
  2009-11-21  6:08   ` Martin K. Petersen
@ 2009-11-21  6:55   ` Martin K. Petersen
  1 sibling, 0 replies; 35+ messages in thread
From: Martin K. Petersen @ 2009-11-21  6:55 UTC (permalink / raw)
  To: Eric Sandeen
  Cc: Martin K. Petersen, jens.axboe, james.bottomley, hch, willy,
	jgarzik, rwheeler, linux-ide, linux-scsi

>>>>> "Eric" == Eric Sandeen <sandeen@redhat.com> writes:

Eric> Doing a test on an OCZ ssd, by patterning 16GB and then discarding
Eric> it via a single ioctl call, ends in an unhappy drive:

Nothing to see here, move along.  It was a bug in my userland
instrumentation and not in ata_set_lba_range_entries().  Sorry, Matthew.

So just drop this last patch.

-- 
Martin K. Petersen	Oracle Linux Engineering

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH 1/4] block: Allow devices to indicate whether discarded blocks are zeroed
  2009-11-21  2:45 ` [PATCH 1/4] block: Allow devices to indicate whether discarded blocks are zeroed Martin K. Petersen
@ 2009-11-21 10:13   ` Christoph Hellwig
  2009-11-21 19:58     ` Matthew Wilcox
  2009-11-21 12:50   ` Ric Wheeler
  1 sibling, 1 reply; 35+ messages in thread
From: Christoph Hellwig @ 2009-11-21 10:13 UTC (permalink / raw)
  To: Martin K. Petersen
  Cc: jens.axboe, james.bottomley, hch, willy, jgarzik, sandeen,
	rwheeler, linux-ide, linux-scsi

On Fri, Nov 20, 2009 at 09:45:21PM -0500, Martin K. Petersen wrote:
> The discard ioctl is used by mkfs utilities to clear a block device
> prior to putting metadata down.  However, not all devices return zeroed
> blocks after a discard.  Some drives return stale data, potentially
> containing old superblocks.  It is therefore important to know whether
> discarded blocks are properly zeroed.

At least for mkfs.xfs we make sure to still zero the important areas
after the TRIM ioctl anyway.


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH 4/4] libata: Fix garbled Trim payload
  2009-11-21  2:45 ` [PATCH 4/4] libata: Fix garbled Trim payload Martin K. Petersen
@ 2009-11-21 10:47   ` Christoph Hellwig
  2009-11-21 19:50     ` Martin K. Petersen
  0 siblings, 1 reply; 35+ messages in thread
From: Christoph Hellwig @ 2009-11-21 10:47 UTC (permalink / raw)
  To: Martin K. Petersen
  Cc: jens.axboe, james.bottomley, hch, willy, jgarzik, sandeen,
	rwheeler, linux-ide, linux-scsi

On Fri, Nov 20, 2009 at 09:45:24PM -0500, Martin K. Petersen wrote:
> ata_set_lba_range_entries confused indexes into a u64 buffer with byte
> offsets.  Fix this up and document the return value.
> 
> Update the libata write same translation to reflect this.

Once we fix up this beast we should also change the current max argument
to encode the buffer size, not the maximum number of entries which is a
rather confusing interface.


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH 3/4] libata: Report zeroed read after Trim and max discard size
  2009-11-21  2:45 ` Martin K. Petersen
@ 2009-11-21 10:49   ` Christoph Hellwig
  2009-11-21 20:16     ` Martin K. Petersen
  0 siblings, 1 reply; 35+ messages in thread
From: Christoph Hellwig @ 2009-11-21 10:49 UTC (permalink / raw)
  To: Martin K. Petersen
  Cc: jens.axboe, james.bottomley, hch, willy, jgarzik, sandeen,
	rwheeler, linux-ide, linux-scsi

On Fri, Nov 20, 2009 at 09:45:23PM -0500, Martin K. Petersen wrote:
>  	 * with the unmap bit set.
>  	 */
> -	if (ata_id_has_trim(args->id))
> +	if (ata_id_has_trim(args->id)) {
> +		put_unaligned_be32(65535 * 512 / 8, &rbuf[20]);
>  		put_unaligned_be32(1, &rbuf[28]);

My reading of SPC is that the max unmap size only makes sense for
devices supporting UNMAP, while the SATL for now only supports WRITE
SAME with the unmap bit.

> -			rbuf[14] |= 0x80;
> +		if (ata_id_has_trim(args->id)) {
> +			rbuf[14] |= 0x80; /* TPE */
> +
> +			if (ata_id_has_zero_after_trim(args->id))
> +				rbuf[14] |= 0x40; /* TPRZ */

Looks good.


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH 1/4] block: Allow devices to indicate whether discarded blocks are zeroed
  2009-11-21  2:45 ` [PATCH 1/4] block: Allow devices to indicate whether discarded blocks are zeroed Martin K. Petersen
  2009-11-21 10:13   ` Christoph Hellwig
@ 2009-11-21 12:50   ` Ric Wheeler
  2009-11-21 20:17     ` Martin K. Petersen
  1 sibling, 1 reply; 35+ messages in thread
From: Ric Wheeler @ 2009-11-21 12:50 UTC (permalink / raw)
  To: Martin K. Petersen
  Cc: jens.axboe, james.bottomley, hch, willy, jgarzik, sandeen,
	linux-ide, linux-scsi

On 11/20/2009 09:45 PM, Martin K. Petersen wrote:
> The discard ioctl is used by mkfs utilities to clear a block device
> prior to putting metadata down.  However, not all devices return zeroed
> blocks after a discard.  Some drives return stale data, potentially
> containing old superblocks.  It is therefore important to know whether
> discarded blocks are properly zeroed.
>
> Both ATA and SCSI drives have configuration bits that indicate whether
> zeroes are returned after a discard operation.  Implement a block level
> interface that allows this information to be bubbled up the stack.
>
> Signed-off-by: Martin K. Petersen<martin.petersen@oracle.com>
> ---
>   block/blk-settings.c   |    2 ++
>   block/blk-sysfs.c      |   11 +++++++++++
>   include/linux/blkdev.h |    1 +
>   3 files changed, 14 insertions(+), 0 deletions(-)
>
> diff --git a/block/blk-settings.c b/block/blk-settings.c
> index 7f986ca..1027e30 100644
> --- a/block/blk-settings.c
> +++ b/block/blk-settings.c
> @@ -100,6 +100,7 @@ void blk_set_default_limits(struct queue_limits *lim)
>   	lim->discard_granularity = 0;
>   	lim->discard_alignment = 0;
>   	lim->discard_misaligned = 0;
> +	lim->discard_zeroes_data = -1;
>    


Did you mean to set this to -1 ?

ric


>   	lim->logical_block_size = lim->physical_block_size = lim->io_min = 512;
>   	lim->bounce_pfn = (unsigned long)(BLK_BOUNCE_ANY>>  PAGE_SHIFT);
>   	lim->alignment_offset = 0;
> @@ -543,6 +544,7 @@ int blk_stack_limits(struct queue_limits *t, struct queue_limits *b,
>
>   	t->io_min = max(t->io_min, b->io_min);
>   	t->no_cluster |= b->no_cluster;
> +	t->discard_zeroes_data&= b->discard_zeroes_data;
>
>   	/* Bottom device offset aligned? */
>   	if (offset&&
> diff --git a/block/blk-sysfs.c b/block/blk-sysfs.c
> index 3147145..1f1d2b6 100644
> --- a/block/blk-sysfs.c
> +++ b/block/blk-sysfs.c
> @@ -136,6 +136,11 @@ static ssize_t queue_discard_max_show(struct request_queue *q, char *page)
>   	return queue_var_show(q->limits.max_discard_sectors<<  9, page);
>   }
>
> +static ssize_t queue_discard_zeroes_data_show(struct request_queue *q, char *page)
> +{
> +	return queue_var_show(q->limits.discard_zeroes_data == 1 ? 1 : 0, page);
> +}
> +
>   static ssize_t
>   queue_max_sectors_store(struct request_queue *q, const char *page, size_t count)
>   {
> @@ -313,6 +318,11 @@ static struct queue_sysfs_entry queue_discard_max_entry = {
>   	.show = queue_discard_max_show,
>   };
>
> +static struct queue_sysfs_entry queue_discard_zeroes_data_entry = {
> +	.attr = {.name = "discard_zeroes_data", .mode = S_IRUGO },
> +	.show = queue_discard_zeroes_data_show,
> +};
> +
>   static struct queue_sysfs_entry queue_nonrot_entry = {
>   	.attr = {.name = "rotational", .mode = S_IRUGO | S_IWUSR },
>   	.show = queue_nonrot_show,
> @@ -350,6 +360,7 @@ static struct attribute *default_attrs[] = {
>   	&queue_io_opt_entry.attr,
>   	&queue_discard_granularity_entry.attr,
>   	&queue_discard_max_entry.attr,
> +	&queue_discard_zeroes_data_entry.attr,
>   	&queue_nonrot_entry.attr,
>   	&queue_nomerges_entry.attr,
>   	&queue_rq_affinity_entry.attr,
> diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
> index 3b67221..e605945 100644
> --- a/include/linux/blkdev.h
> +++ b/include/linux/blkdev.h
> @@ -322,6 +322,7 @@ struct queue_limits {
>   	unsigned char		misaligned;
>   	unsigned char		discard_misaligned;
>   	unsigned char		no_cluster;
> +	signed char		discard_zeroes_data;
>   };
>
>   struct request_queue
>    


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH 4/4] libata: Fix garbled Trim payload
  2009-11-21 10:47   ` Christoph Hellwig
@ 2009-11-21 19:50     ` Martin K. Petersen
  0 siblings, 0 replies; 35+ messages in thread
From: Martin K. Petersen @ 2009-11-21 19:50 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Martin K. Petersen, jens.axboe, james.bottomley, willy, jgarzik,
	sandeen, rwheeler, linux-ide, linux-scsi

>>>>> "Christoph" == Christoph Hellwig <hch@infradead.org> writes:

Christoph> On Fri, Nov 20, 2009 at 09:45:24PM -0500, Martin K. Petersen wrote:
>> ata_set_lba_range_entries confused indexes into a u64 buffer with
>> byte offsets.  Fix this up and document the return value.
>> 
>> Update the libata write same translation to reflect this.

Christoph> Once we fix up this beast we should also change the current
Christoph> max argument to encode the buffer size, not the maximum
Christoph> number of entries which is a rather confusing interface.

What threw me off yesterday was that `max' starts out being a cap on the
number of 8-byte entries in the buffer. And then towards the end it
suddenly turns into a number-of-bytes entity as a result of the ALIGN
macro.

I agree that we should fix this up to be consistent. I'll do that.

-- 
Martin K. Petersen	Oracle Linux Engineering

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH 1/4] block: Allow devices to indicate whether discarded blocks are zeroed
  2009-11-21 10:13   ` Christoph Hellwig
@ 2009-11-21 19:58     ` Matthew Wilcox
  2009-11-22  2:43       ` Mark Lord
  2009-11-23 17:05       ` Christoph Hellwig
  0 siblings, 2 replies; 35+ messages in thread
From: Matthew Wilcox @ 2009-11-21 19:58 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Martin K. Petersen, jens.axboe, james.bottomley, willy, jgarzik,
	sandeen, rwheeler, linux-ide, linux-scsi

On Sat, Nov 21, 2009 at 05:13:56AM -0500, Christoph Hellwig wrote:
> On Fri, Nov 20, 2009 at 09:45:21PM -0500, Martin K. Petersen wrote:
> > The discard ioctl is used by mkfs utilities to clear a block device
> > prior to putting metadata down.  However, not all devices return zeroed
> > blocks after a discard.  Some drives return stale data, potentially
> > containing old superblocks.  It is therefore important to know whether
> > discarded blocks are properly zeroed.
> 
> At least for mkfs.xfs we make sure to still zero the important areas
> after the TRIM ioctl anyway.

Could you change that to zero _before_ the TRIM?

-- 
Matthew Wilcox				Intel Open Source Technology Centre
"Bill, look, we understand that you're interested in selling us this
operating system, but compare it to ours.  We can't possibly take such
a retrograde step."

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH 3/4] libata: Report zeroed read after Trim and max discard size
  2009-11-21 10:49   ` Christoph Hellwig
@ 2009-11-21 20:16     ` Martin K. Petersen
  2009-11-24 14:35       ` Christoph Hellwig
  0 siblings, 1 reply; 35+ messages in thread
From: Martin K. Petersen @ 2009-11-21 20:16 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Martin K. Petersen, jens.axboe, james.bottomley, willy, jgarzik,
	sandeen, rwheeler, linux-ide, linux-scsi

>>>>> "Christoph" == Christoph Hellwig <hch@infradead.org> writes:

Christoph> On Fri, Nov 20, 2009 at 09:45:23PM -0500, Martin K. Petersen wrote:
>> * with the unmap bit set.
>> */
>> - if (ata_id_has_trim(args->id))
>> + if (ata_id_has_trim(args->id)) {
>> + put_unaligned_be32(65535 * 512 / 8, &rbuf[20]);
>> put_unaligned_be32(1, &rbuf[28]);

Christoph> My reading of SPC is that the max unmap size only makes sense
Christoph> for devices supporting UNMAP, while the SATL for now only
Christoph> supports WRITE SAME with the unmap bit.

I was trying to help Eric figure out why his drive pooped on big Trim
requests.  For WRITE SAME the limit is inherent in the arguments,
whereas our SATL implementation is limited by the 512-byte WRITE SAME
payload.  So I needed a way to convey this up the stack.

Since you already return a B0 VPD page I thought it would be a
convenient place to communicate the max without having to tweak the
queue limits directly from within libata.

You are right that I'm relying on fuzziness in SBC which requires both
the max LBA count and the descriptor count to be specified for UNMAP.

So what I can do is this:

                if (lba_count) {
                        q->limits.max_discard_sectors =
                                lba_count * sector_sz >> 9;

                        if (desc_count)
                                sdkp->unmap = 1;
                }

That way we don't impose limits on "normal" WRITE SAME devices.

-- 
Martin K. Petersen	Oracle Linux Engineering

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH 1/4] block: Allow devices to indicate whether discarded blocks are zeroed
  2009-11-21 12:50   ` Ric Wheeler
@ 2009-11-21 20:17     ` Martin K. Petersen
  0 siblings, 0 replies; 35+ messages in thread
From: Martin K. Petersen @ 2009-11-21 20:17 UTC (permalink / raw)
  To: Ric Wheeler
  Cc: Martin K. Petersen, jens.axboe, james.bottomley, hch, willy,
	jgarzik, sandeen, linux-ide, linux-scsi

>>>>> "Ric" == Ric Wheeler <rwheeler@redhat.com> writes:

>> + lim->discard_zeroes_data = -1;

Ric> Did you mean to set this to -1 ?

Yep, it's a "not set yet" value for the stacking to work properly in
this case.

-- 
Martin K. Petersen	Oracle Linux Engineering

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH 1/4] block: Allow devices to indicate whether discarded blocks are zeroed
  2009-11-21 19:58     ` Matthew Wilcox
@ 2009-11-22  2:43       ` Mark Lord
  2009-11-23 16:37         ` Ric Wheeler
  2009-11-23 17:05       ` Christoph Hellwig
  1 sibling, 1 reply; 35+ messages in thread
From: Mark Lord @ 2009-11-22  2:43 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Christoph Hellwig, Martin K. Petersen, jens.axboe,
	james.bottomley, willy, jgarzik, sandeen, rwheeler, linux-ide,
	linux-scsi

Matthew Wilcox wrote:
> On Sat, Nov 21, 2009 at 05:13:56AM -0500, Christoph Hellwig wrote:
>> On Fri, Nov 20, 2009 at 09:45:21PM -0500, Martin K. Petersen wrote:
>>> The discard ioctl is used by mkfs utilities to clear a block device
>>> prior to putting metadata down.  However, not all devices return zeroed
>>> blocks after a discard.  Some drives return stale data, potentially
>>> containing old superblocks.  It is therefore important to know whether
>>> discarded blocks are properly zeroed.
>> At least for mkfs.xfs we make sure to still zero the important areas
>> after the TRIM ioctl anyway.
> 
> Could you change that to zero _before_ the TRIM?
..


Hopefully not for the drives that *don't* guarantee zeros after TRIM.
Eg. most existing ones, including all of the Indilinx chipset drives.

Cheers


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH 1/4] block: Allow devices to indicate whether discarded blocks are zeroed
  2009-11-22  2:43       ` Mark Lord
@ 2009-11-23 16:37         ` Ric Wheeler
  2009-11-23 16:54           ` Greg Freemyer
  0 siblings, 1 reply; 35+ messages in thread
From: Ric Wheeler @ 2009-11-23 16:37 UTC (permalink / raw)
  To: Mark Lord
  Cc: Matthew Wilcox, Christoph Hellwig, Martin K. Petersen,
	jens.axboe, james.bottomley, willy, jgarzik, sandeen, rwheeler,
	linux-ide, linux-scsi

On 11/21/2009 09:43 PM, Mark Lord wrote:
> Matthew Wilcox wrote:
>> On Sat, Nov 21, 2009 at 05:13:56AM -0500, Christoph Hellwig wrote:
>>> On Fri, Nov 20, 2009 at 09:45:21PM -0500, Martin K. Petersen wrote:
>>>> The discard ioctl is used by mkfs utilities to clear a block device
>>>> prior to putting metadata down.  However, not all devices return 
>>>> zeroed
>>>> blocks after a discard.  Some drives return stale data, potentially
>>>> containing old superblocks.  It is therefore important to know whether
>>>> discarded blocks are properly zeroed.
>>> At least for mkfs.xfs we make sure to still zero the important areas
>>> after the TRIM ioctl anyway.
>>
>> Could you change that to zero _before_ the TRIM?
> ..
>
>
> Hopefully not for the drives that *don't* guarantee zeros after TRIM.
> Eg. most existing ones, including all of the Indilinx chipset drives.
>
> Cheers
>

Note that writing to a block after a discard (write or trim) changes the 
state of that block back to allocated (mapped) in the target. Basically, 
that would seem to make the trim redundant and irrelevant.

You can zero before the discard but that can still fail is the device 
returns garbage for a trim'ed block I think,

Ric


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH 1/4] block: Allow devices to indicate whether discarded blocks are zeroed
  2009-11-23 16:37         ` Ric Wheeler
@ 2009-11-23 16:54           ` Greg Freemyer
  2009-11-23 17:02             ` Ric Wheeler
  0 siblings, 1 reply; 35+ messages in thread
From: Greg Freemyer @ 2009-11-23 16:54 UTC (permalink / raw)
  To: Ric Wheeler
  Cc: Mark Lord, Matthew Wilcox, Christoph Hellwig, Martin K. Petersen,
	jens.axboe, james.bottomley, willy, jgarzik, sandeen, linux-ide,
	linux-scsi

On Mon, Nov 23, 2009 at 11:37 AM, Ric Wheeler <rwheeler@redhat.com> wrote:
> On 11/21/2009 09:43 PM, Mark Lord wrote:
>>
>> Matthew Wilcox wrote:
>>>
>>> On Sat, Nov 21, 2009 at 05:13:56AM -0500, Christoph Hellwig wrote:
>>>>
>>>> On Fri, Nov 20, 2009 at 09:45:21PM -0500, Martin K. Petersen wrote:
>>>>>
>>>>> The discard ioctl is used by mkfs utilities to clear a block device
>>>>> prior to putting metadata down.  However, not all devices return zeroed
>>>>> blocks after a discard.  Some drives return stale data, potentially
>>>>> containing old superblocks.  It is therefore important to know whether
>>>>> discarded blocks are properly zeroed.
>>>>
>>>> At least for mkfs.xfs we make sure to still zero the important areas
>>>> after the TRIM ioctl anyway.
>>>
>>> Could you change that to zero _before_ the TRIM?
>>
>> ..
>>
>>
>> Hopefully not for the drives that *don't* guarantee zeros after TRIM.
>> Eg. most existing ones, including all of the Indilinx chipset drives.
>>
>> Cheers
>>
>
> Note that writing to a block after a discard (write or trim) changes the
> state of that block back to allocated (mapped) in the target. Basically,
> that would seem to make the trim redundant and irrelevant.
>
> You can zero before the discard but that can still fail is the device
> returns garbage for a trim'ed block I think,
>
> Ric

I understood the mkfs process to be:

- discard / trim 100% of the block device
- write zeros to the relatively small group of blocks that the
filesystem assumes will be zero'd.

Seems like the only reasonable way to handle it.

Greg
-- 
Greg Freemyer
Head of EDD Tape Extraction and Processing team
Litigation Triage Solutions Specialist
http://www.linkedin.com/in/gregfreemyer
Preservation and Forensic processing of Exchange Repositories White Paper -
<http://www.norcrossgroup.com/forms/whitepapers/tng_whitepaper_fpe.html>

The Norcross Group
The Intersection of Evidence & Technology
http://www.norcrossgroup.com
--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH 1/4] block: Allow devices to indicate whether discarded blocks are zeroed
  2009-11-23 16:54           ` Greg Freemyer
@ 2009-11-23 17:02             ` Ric Wheeler
  2009-11-23 17:03               ` Christoph Hellwig
  0 siblings, 1 reply; 35+ messages in thread
From: Ric Wheeler @ 2009-11-23 17:02 UTC (permalink / raw)
  To: Greg Freemyer
  Cc: Mark Lord, Matthew Wilcox, Christoph Hellwig, Martin K. Petersen,
	jens.axboe, james.bottomley, willy, jgarzik, sandeen, linux-ide,
	linux-scsi

On 11/23/2009 11:54 AM, Greg Freemyer wrote:
> On Mon, Nov 23, 2009 at 11:37 AM, Ric Wheeler<rwheeler@redhat.com>  wrote:
>> On 11/21/2009 09:43 PM, Mark Lord wrote:
>>>
>>> Matthew Wilcox wrote:
>>>>
>>>> On Sat, Nov 21, 2009 at 05:13:56AM -0500, Christoph Hellwig wrote:
>>>>>
>>>>> On Fri, Nov 20, 2009 at 09:45:21PM -0500, Martin K. Petersen wrote:
>>>>>>
>>>>>> The discard ioctl is used by mkfs utilities to clear a block device
>>>>>> prior to putting metadata down.  However, not all devices return zeroed
>>>>>> blocks after a discard.  Some drives return stale data, potentially
>>>>>> containing old superblocks.  It is therefore important to know whether
>>>>>> discarded blocks are properly zeroed.
>>>>>
>>>>> At least for mkfs.xfs we make sure to still zero the important areas
>>>>> after the TRIM ioctl anyway.
>>>>
>>>> Could you change that to zero _before_ the TRIM?
>>>
>>> ..
>>>
>>>
>>> Hopefully not for the drives that *don't* guarantee zeros after TRIM.
>>> Eg. most existing ones, including all of the Indilinx chipset drives.
>>>
>>> Cheers
>>>
>>
>> Note that writing to a block after a discard (write or trim) changes the
>> state of that block back to allocated (mapped) in the target. Basically,
>> that would seem to make the trim redundant and irrelevant.
>>
>> You can zero before the discard but that can still fail is the device
>> returns garbage for a trim'ed block I think,
>>
>> Ric
>
> I understood the mkfs process to be:
>
> - discard / trim 100% of the block device
> - write zeros to the relatively small group of blocks that the
> filesystem assumes will be zero'd.
>
> Seems like the only reasonable way to handle it.
>
> Greg

I agree - the discard of the whole device is a good idea.

I just want to make clear that "discard block X; write block X with zeroed data" 
undoes the discard in general :-)

ric


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH 1/4] block: Allow devices to indicate whether discarded blocks are zeroed
  2009-11-23 17:02             ` Ric Wheeler
@ 2009-11-23 17:03               ` Christoph Hellwig
  2009-11-23 17:50                 ` Eric Sandeen
  0 siblings, 1 reply; 35+ messages in thread
From: Christoph Hellwig @ 2009-11-23 17:03 UTC (permalink / raw)
  To: Ric Wheeler
  Cc: Greg Freemyer, Mark Lord, Matthew Wilcox, Christoph Hellwig,
	Martin K. Petersen, jens.axboe, james.bottomley, willy, jgarzik,
	sandeen, linux-ide, linux-scsi

On Mon, Nov 23, 2009 at 12:02:48PM -0500, Ric Wheeler wrote:
> I agree - the discard of the whole device is a good idea.
>
> I just want to make clear that "discard block X; write block X with 
> zeroed data" undoes the discard in general :-)

We can skip the writing of zeroes if we know the device returns zeroed
blocks after a trim.  Martin's patch exports that information to
userspace, and once we have a nice enough interface (e.g. blkid or an
ioctl) we can actually use it in mkfs to optimize the writing of zeroes
away.  Raw growling in sysfs is a bit too nasty to add it to mkfs for
those few blocks IMHO.


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH 1/4] block: Allow devices to indicate whether discarded blocks are zeroed
  2009-11-21 19:58     ` Matthew Wilcox
  2009-11-22  2:43       ` Mark Lord
@ 2009-11-23 17:05       ` Christoph Hellwig
  1 sibling, 0 replies; 35+ messages in thread
From: Christoph Hellwig @ 2009-11-23 17:05 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Christoph Hellwig, Martin K. Petersen, jens.axboe,
	james.bottomley, willy, jgarzik, sandeen, rwheeler, linux-ide,
	linux-scsi

On Sat, Nov 21, 2009 at 12:58:03PM -0700, Matthew Wilcox wrote:
> On Sat, Nov 21, 2009 at 05:13:56AM -0500, Christoph Hellwig wrote:
> > On Fri, Nov 20, 2009 at 09:45:21PM -0500, Martin K. Petersen wrote:
> > > The discard ioctl is used by mkfs utilities to clear a block device
> > > prior to putting metadata down.  However, not all devices return zeroed
> > > blocks after a discard.  Some drives return stale data, potentially
> > > containing old superblocks.  It is therefore important to know whether
> > > discarded blocks are properly zeroed.
> > 
> > At least for mkfs.xfs we make sure to still zero the important areas
> > after the TRIM ioctl anyway.
> 
> Could you change that to zero _before_ the TRIM?

I could easily, but it would be a very bad idea.


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH 1/4] block: Allow devices to indicate whether discarded blocks are zeroed
  2009-11-23 17:03               ` Christoph Hellwig
@ 2009-11-23 17:50                 ` Eric Sandeen
  0 siblings, 0 replies; 35+ messages in thread
From: Eric Sandeen @ 2009-11-23 17:50 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Ric Wheeler, Greg Freemyer, Mark Lord, Matthew Wilcox,
	Martin K. Petersen, jens.axboe, james.bottomley, willy, jgarzik,
	linux-ide, linux-scsi

Christoph Hellwig wrote:
> On Mon, Nov 23, 2009 at 12:02:48PM -0500, Ric Wheeler wrote:
>> I agree - the discard of the whole device is a good idea.
>>
>> I just want to make clear that "discard block X; write block X with 
>> zeroed data" undoes the discard in general :-)
> 
> We can skip the writing of zeroes if we know the device returns zeroed
> blocks after a trim.  Martin's patch exports that information to
> userspace, and once we have a nice enough interface (e.g. blkid or an
> ioctl) we can actually use it in mkfs to optimize the writing of zeroes
> away.  Raw growling in sysfs is a bit too nasty to add it to mkfs for
> those few blocks IMHO.
> 

well, in testing, I've found that not all devices which claim to return
0s post-trim, do.

For example one ssd I have, if I pattern 1M of 0xaf, then trim from
512k to 1M, and read it back (all IO done direct) I get:

00000000  af af af af af af af af  af af af af af af af af
*
00080000  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00
*
000a9000  af af af af af af af af  af af af af af af af af
*
00100000

with scraps of data left over at 0xA9000.

So for the very few blocks that we zero at mkfs (thinking xfs here,
anyway) I'd really rather just zero them post-trim to be safe, at least
for now.

-Eric


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH 3/4] libata: Report zeroed read after Trim and max discard size
  2009-11-21 20:16     ` Martin K. Petersen
@ 2009-11-24 14:35       ` Christoph Hellwig
  2009-11-24 15:20         ` Mark Lord
  0 siblings, 1 reply; 35+ messages in thread
From: Christoph Hellwig @ 2009-11-24 14:35 UTC (permalink / raw)
  To: Martin K. Petersen
  Cc: Christoph Hellwig, jens.axboe, james.bottomley, willy, jgarzik,
	sandeen, rwheeler, linux-ide, linux-scsi

On Sat, Nov 21, 2009 at 03:16:05PM -0500, Martin K. Petersen wrote:
> I was trying to help Eric figure out why his drive pooped on big Trim
> requests.  For WRITE SAME the limit is inherent in the arguments,
> whereas our SATL implementation is limited by the 512-byte WRITE SAME
> payload.  So I needed a way to convey this up the stack.
> 
> Since you already return a B0 VPD page I thought it would be a
> convenient place to communicate the max without having to tweak the
> queue limits directly from within libata.
> 
> You are right that I'm relying on fuzziness in SBC which requires both
> the max LBA count and the descriptor count to be specified for UNMAP.

I think the better way is to make sure we can support any TRIM that
can be sent down.  Given that TRIM is not NCQ-capable we can just
allocate one buffer for the TRIM ranges per TRIM capable device.


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH 3/4] libata: Report zeroed read after Trim and max discard size
  2009-11-24 14:35       ` Christoph Hellwig
@ 2009-11-24 15:20         ` Mark Lord
  2009-11-24 15:21           ` Christoph Hellwig
  0 siblings, 1 reply; 35+ messages in thread
From: Mark Lord @ 2009-11-24 15:20 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Martin K. Petersen, jens.axboe, james.bottomley, willy, jgarzik,
	sandeen, rwheeler, linux-ide, linux-scsi

Christoph Hellwig wrote:
> On Sat, Nov 21, 2009 at 03:16:05PM -0500, Martin K. Petersen wrote:
>> I was trying to help Eric figure out why his drive pooped on big Trim
>> requests.  For WRITE SAME the limit is inherent in the arguments,
>> whereas our SATL implementation is limited by the 512-byte WRITE SAME
>> payload.  So I needed a way to convey this up the stack.
>>
>> Since you already return a B0 VPD page I thought it would be a
>> convenient place to communicate the max without having to tweak the
>> queue limits directly from within libata.
>>
>> You are right that I'm relying on fuzziness in SBC which requires both
>> the max LBA count and the descriptor count to be specified for UNMAP.
> 
> I think the better way is to make sure we can support any TRIM that
> can be sent down.  Given that TRIM is not NCQ-capable we can just
> allocate one buffer for the TRIM ranges per TRIM capable device.
..

Good approach.

I suppose that buffer would be only 512 bytes long, per device?
That might be a bit restrictive, as TRIM can handle much larger
requests, and some drives (Indinlinx-based at least) prefer large
TRIM lists at present.

On the other hand, the Marvell chipsets cannot handle more than a
single sector of data without first fixing the driver to work around
chipset bugs.  That's probably unique to sata_mv, though.

Cheers

Mark Lord


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH 3/4] libata: Report zeroed read after Trim and max discard size
  2009-11-24 15:20         ` Mark Lord
@ 2009-11-24 15:21           ` Christoph Hellwig
  0 siblings, 0 replies; 35+ messages in thread
From: Christoph Hellwig @ 2009-11-24 15:21 UTC (permalink / raw)
  To: Mark Lord
  Cc: Christoph Hellwig, Martin K. Petersen, jens.axboe,
	james.bottomley, willy, jgarzik, sandeen, rwheeler, linux-ide,
	linux-scsi

On Tue, Nov 24, 2009 at 10:20:28AM -0500, Mark Lord wrote:
> I suppose that buffer would be only 512 bytes long, per device?
> That might be a bit restrictive, as TRIM can handle much larger
> requests, and some drives (Indinlinx-based at least) prefer large
> TRIM lists at present.

Allocating a buffer per device would allow it to be much larger.



^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: Thin provisioning fixes
  2009-11-21  2:45 Thin provisioning fixes Martin K. Petersen
                   ` (8 preceding siblings ...)
  2009-11-21  4:56 ` Thin provisioning fixes Eric Sandeen
@ 2009-11-26 10:59 ` Christoph Hellwig
  2009-11-26 11:01   ` Jens Axboe
  9 siblings, 1 reply; 35+ messages in thread
From: Christoph Hellwig @ 2009-11-26 10:59 UTC (permalink / raw)
  To: Martin K. Petersen
  Cc: jens.axboe, james.bottomley, hch, willy, jgarzik, sandeen,
	rwheeler, linux-ide, linux-scsi

What's the current state of this?

patch 4 is if I see this correctly withdrawn.  patch 1 could go into the
block tree as is, and patch 3 into the libata tree, and patch 2 would
replace the older one in the scsi post merge tree.

Jens, Jeff, James?


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: Thin provisioning fixes
  2009-11-26 10:59 ` Christoph Hellwig
@ 2009-11-26 11:01   ` Jens Axboe
  2009-11-26 11:05     ` Christoph Hellwig
  0 siblings, 1 reply; 35+ messages in thread
From: Jens Axboe @ 2009-11-26 11:01 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Martin K. Petersen, james.bottomley, willy, jgarzik, sandeen,
	rwheeler, linux-ide, linux-scsi

On Thu, Nov 26 2009, Christoph Hellwig wrote:
> What's the current state of this?
> 
> patch 4 is if I see this correctly withdrawn.  patch 1 could go into the
> block tree as is, and patch 3 into the libata tree, and patch 2 would
> replace the older one in the scsi post merge tree.
> 
> Jens, Jeff, James?

I was waiting for the discussion to settle and a repost. Let me know if
I should just pick up the current #1.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: Thin provisioning fixes
  2009-11-26 11:01   ` Jens Axboe
@ 2009-11-26 11:05     ` Christoph Hellwig
  2009-11-26 15:13       ` Mark Lord
  0 siblings, 1 reply; 35+ messages in thread
From: Christoph Hellwig @ 2009-11-26 11:05 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Christoph Hellwig, Martin K. Petersen, james.bottomley, willy,
	jgarzik, sandeen, rwheeler, linux-ide, linux-scsi

On Thu, Nov 26, 2009 at 12:01:57PM +0100, Jens Axboe wrote:
> I was waiting for the discussion to settle and a repost. Let me know if
> I should just pick up the current #1.

As far as I'm concerned the discussion was just about the example use
case Martin mentioned.  Exposing wether a device zeroes discarded blocks
is a good thing in general.

Note that we might have to add blacklists for this in libata pretty soon
as the OCZ Vertex seems to get it wrong, but that's material for another
patch not touching the block layer..


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: Thin provisioning fixes
  2009-11-26 11:05     ` Christoph Hellwig
@ 2009-11-26 15:13       ` Mark Lord
  2009-11-26 15:14         ` Mark Lord
  0 siblings, 1 reply; 35+ messages in thread
From: Mark Lord @ 2009-11-26 15:13 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Jens Axboe, Martin K. Petersen, james.bottomley, willy, jgarzik,
	sandeen, rwheeler, linux-ide, linux-scsi

Christoph Hellwig wrote:
>
> Note that we might have to add blacklists for this in libata pretty soon
> as the OCZ Vertex seems to get it wrong, but that's material for another
> patch not touching the block layer..
..

With the current (latest) firmware on the Vertex,
it does not claim "read zeros after trim".
Which is correct.

How does this get something wrong?

Thanks

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: Thin provisioning fixes
  2009-11-26 15:13       ` Mark Lord
@ 2009-11-26 15:14         ` Mark Lord
  0 siblings, 0 replies; 35+ messages in thread
From: Mark Lord @ 2009-11-26 15:14 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Jens Axboe, Martin K. Petersen, james.bottomley, willy, jgarzik,
	sandeen, rwheeler, linux-ide, linux-scsi

Mark Lord wrote:
> Christoph Hellwig wrote:
>>
>> Note that we might have to add blacklists for this in libata pretty soon
>> as the OCZ Vertex seems to get it wrong, but that's material for another
>> patch not touching the block layer..
> ..
> 
> With the current (latest) firmware on the Vertex,
> it does not claim "read zeros after trim".
> Which is correct.
> 
> How does this get something wrong?
..

Oh.. answering my own question here..

All of the earlier firmwares *did* get it wrong,
so I suppose those could/should be X-listed.

Cheers!

^ permalink raw reply	[flat|nested] 35+ messages in thread

end of thread, other threads:[~2009-11-26 15:14 UTC | newest]

Thread overview: 35+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2009-11-21  2:45 Thin provisioning fixes Martin K. Petersen
2009-11-21  2:45 ` [PATCH 1/4] block: Allow devices to indicate whether discarded blocks are zeroed Martin K. Petersen
2009-11-21 10:13   ` Christoph Hellwig
2009-11-21 19:58     ` Matthew Wilcox
2009-11-22  2:43       ` Mark Lord
2009-11-23 16:37         ` Ric Wheeler
2009-11-23 16:54           ` Greg Freemyer
2009-11-23 17:02             ` Ric Wheeler
2009-11-23 17:03               ` Christoph Hellwig
2009-11-23 17:50                 ` Eric Sandeen
2009-11-23 17:05       ` Christoph Hellwig
2009-11-21 12:50   ` Ric Wheeler
2009-11-21 20:17     ` Martin K. Petersen
2009-11-21  2:45 ` Martin K. Petersen
2009-11-21  2:45 ` [PATCH 2/4] sd: WRITE SAME(16) / UNMAP support Martin K. Petersen
2009-11-21  2:45 ` Martin K. Petersen
2009-11-21  2:45 ` [PATCH 3/4] libata: Report zeroed read after Trim and max discard size Martin K. Petersen
2009-11-21  2:45 ` Martin K. Petersen
2009-11-21 10:49   ` Christoph Hellwig
2009-11-21 20:16     ` Martin K. Petersen
2009-11-24 14:35       ` Christoph Hellwig
2009-11-24 15:20         ` Mark Lord
2009-11-24 15:21           ` Christoph Hellwig
2009-11-21  2:45 ` [PATCH 4/4] libata: Fix garbled Trim payload Martin K. Petersen
2009-11-21 10:47   ` Christoph Hellwig
2009-11-21 19:50     ` Martin K. Petersen
2009-11-21  2:45 ` Martin K. Petersen
2009-11-21  4:56 ` Thin provisioning fixes Eric Sandeen
2009-11-21  6:08   ` Martin K. Petersen
2009-11-21  6:55   ` Martin K. Petersen
2009-11-26 10:59 ` Christoph Hellwig
2009-11-26 11:01   ` Jens Axboe
2009-11-26 11:05     ` Christoph Hellwig
2009-11-26 15:13       ` Mark Lord
2009-11-26 15:14         ` Mark Lord

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.