All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 0/7] discard support revisited
@ 2009-08-29 23:03 Christoph Hellwig
  2009-08-29 23:03 ` [PATCH 1/7] Make DISCARD_BARRIER and DISCARD_NOBARRIER writes instead of reads Christoph Hellwig
                   ` (7 more replies)
  0 siblings, 8 replies; 30+ messages in thread
From: Christoph Hellwig @ 2009-08-29 23:03 UTC (permalink / raw)
  To: linux-scsi, linux-ide, linux-kernel; +Cc: liml, jens.axboe, matthew, dwmw2

I've started working on discard support a little bit more, and I think
it's time to get an overview over the state we're in right now.  This
patchswt is what I needed to get the batched discard support in XFS
properly working.  Many patches are just forward ported from Matthew's
SSD tree and I hope quilt mail didn't lose the attributions.

There are however a few significant changes:

 - blkdev_issue_discard is now a lot more generic - it grows a flags
   argument to control if we issue a barrier discard request or not
   or if we wait or not.  That is needed for the batched discard support
   in XFS which wants to wait just like the ioctl interface.
 - I have implemented support for sending WRITE SAME requests with the
   unmap bit set in sd.  This has been tested with a qemu-based backed
   only so far, but we'll get some real array coverage soon.
 - there is a new patch to allow larger discard requests.
 - the last patch is just my old batched discard patch for xfs ported
   to work ontop of this series.

I would really love to see some progress on this in the 2.6.32 circle.
We should at least get the block layer bits in that allow implementing
a somewhat useful discard function.  I would also love to see the
actual scsi and libata implementations in so that we can start playing
around with it.  But given the speed up the current TRIM implementations
and the expectations for WRITE SAME we should make sure the exact
TRIM tracking is not actually enabled anywhere by default for now.

^ permalink raw reply	[flat|nested] 30+ messages in thread

* [PATCH 1/7] Make DISCARD_BARRIER and DISCARD_NOBARRIER writes instead of reads
  2009-08-29 23:03 [PATCH 0/7] discard support revisited Christoph Hellwig
@ 2009-08-29 23:03 ` Christoph Hellwig
  2009-09-03 17:52   ` David Woodhouse
  2009-08-29 23:03 ` [PATCH 2/7] block: use blkdev_issue_discard in blk_ioctl_discard Christoph Hellwig
                   ` (6 subsequent siblings)
  7 siblings, 1 reply; 30+ messages in thread
From: Christoph Hellwig @ 2009-08-29 23:03 UTC (permalink / raw)
  To: linux-scsi, linux-ide, linux-kernel
  Cc: liml, jens.axboe, matthew, dwmw2, David Woodhouse, Matthew Wilcox

[-- Attachment #1: discard-mark-requests-as-write --]
[-- Type: text/plain, Size: 1039 bytes --]

From: David Woodhouse <dwmw2@infradead.org>

The commands are conceptually writes, and in the case of IDE and SCSI
commands actually are writes.  They were only reads because we thought
that would interact better with the elevators.  Now the elevators know
about discard requests, that advantage no longer exists.

Signed-off-by: David Woodhouse <David.Woodhouse@intel.com>
Signed-off-by: Matthew Wilcox <willy@linux.intel.com>
---

diff --git a/include/linux/fs.h b/include/linux/fs.h
index 5bed436..28f8e5d 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -161,8 +161,8 @@ struct inodes_stat_t {
  * These aren't really reads or writes, they pass down information about
  * parts of device that are now unused by the file system.
  */
-#define DISCARD_NOBARRIER (1 << BIO_RW_DISCARD)
-#define DISCARD_BARRIER ((1 << BIO_RW_DISCARD) | (1 << BIO_RW_BARRIER))
+#define DISCARD_NOBARRIER (WRITE | (1 << BIO_RW_DISCARD))
+#define DISCARD_BARRIER (DISCARD_NOBARRIER | (1 << BIO_RW_BARRIER))
 
 #define SEL_IN		1
 #define SEL_OUT		2

^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [PATCH 2/7] block: use blkdev_issue_discard in blk_ioctl_discard
  2009-08-29 23:03 [PATCH 0/7] discard support revisited Christoph Hellwig
  2009-08-29 23:03 ` [PATCH 1/7] Make DISCARD_BARRIER and DISCARD_NOBARRIER writes instead of reads Christoph Hellwig
@ 2009-08-29 23:03 ` Christoph Hellwig
  2009-09-01  9:06   ` Steven Whitehouse
  2009-09-11 22:26   ` Christoph Hellwig
  2009-08-29 23:03 ` [PATCH 3/7] block: discard may need to allocate pages Christoph Hellwig
                   ` (5 subsequent siblings)
  7 siblings, 2 replies; 30+ messages in thread
From: Christoph Hellwig @ 2009-08-29 23:03 UTC (permalink / raw)
  To: linux-scsi, linux-ide, linux-kernel; +Cc: liml, jens.axboe, matthew, dwmw2

[-- Attachment #1: discard-unify-code --]
[-- Type: text/plain, Size: 7946 bytes --]

blk_ioctl_discard duplicates large amounts of code from blkdev_issue_discard,
the only difference between the two is that blkdev_issue_discard needs to
send a barrier discard request and blk_ioctl_discard a non-barrier one,
and blk_ioctl_discard needs to wait on the request.  To facilitates this
add a flags argument to blkdev_issue_discard to control both aspects of the
behaviour.  This will be very useful later on for using the waiting
funcitonality for other callers.

Based on an earlier patch from Matthew Wilcox <matthew@wil.cx>.


Signed-off-by: Christoph Hellwig <hch@lst.de>

Index: linux-2.6/block/blk-barrier.c
===================================================================
--- linux-2.6.orig/block/blk-barrier.c	2009-08-29 16:49:43.067370900 -0300
+++ linux-2.6/block/blk-barrier.c	2009-08-29 17:43:30.407344330 -0300
@@ -348,6 +348,9 @@ static void blkdev_discard_end_io(struct
 		clear_bit(BIO_UPTODATE, &bio->bi_flags);
 	}
 
+	if (bio->bi_private)
+		complete(bio->bi_private);
+
 	bio_put(bio);
 }
 
@@ -357,21 +360,20 @@ static void blkdev_discard_end_io(struct
  * @sector:	start sector
  * @nr_sects:	number of sectors to discard
  * @gfp_mask:	memory allocation flags (for bio_alloc)
+ * @flags:	DISCARD_FL_* flags to control behaviour
  *
  * Description:
- *    Issue a discard request for the sectors in question. Does not wait.
+ *    Issue a discard request for the sectors in question.
  */
-int blkdev_issue_discard(struct block_device *bdev,
-			 sector_t sector, sector_t nr_sects, gfp_t gfp_mask)
+int blkdev_issue_discard(struct block_device *bdev, sector_t sector,
+		sector_t nr_sects, gfp_t gfp_mask, int flags)
 {
-	struct request_queue *q;
-	struct bio *bio;
+	DECLARE_COMPLETION_ONSTACK(wait);
+	struct request_queue *q = bdev_get_queue(bdev);
+	int type = flags & DISCARD_FL_BARRIER ?
+		DISCARD_BARRIER : DISCARD_NOBARRIER;
 	int ret = 0;
 
-	if (bdev->bd_disk == NULL)
-		return -ENXIO;
-
-	q = bdev_get_queue(bdev);
 	if (!q)
 		return -ENXIO;
 
@@ -379,12 +381,14 @@ int blkdev_issue_discard(struct block_de
 		return -EOPNOTSUPP;
 
 	while (nr_sects && !ret) {
-		bio = bio_alloc(gfp_mask, 0);
+		struct bio *bio = bio_alloc(gfp_mask, 0);
 		if (!bio)
 			return -ENOMEM;
 
 		bio->bi_end_io = blkdev_discard_end_io;
 		bio->bi_bdev = bdev;
+		if (flags & DISCARD_FL_WAIT)
+			bio->bi_private = &wait;
 
 		bio->bi_sector = sector;
 
@@ -396,10 +400,13 @@ int blkdev_issue_discard(struct block_de
 			bio->bi_size = nr_sects << 9;
 			nr_sects = 0;
 		}
+
 		bio_get(bio);
-		submit_bio(DISCARD_BARRIER, bio);
+		submit_bio(type, bio);
+
+		if (flags & DISCARD_FL_WAIT)
+			wait_for_completion(&wait);
 
-		/* Check if it failed immediately */
 		if (bio_flagged(bio, BIO_EOPNOTSUPP))
 			ret = -EOPNOTSUPP;
 		else if (!bio_flagged(bio, BIO_UPTODATE))
Index: linux-2.6/block/ioctl.c
===================================================================
--- linux-2.6.orig/block/ioctl.c	2009-08-29 16:49:43.075370172 -0300
+++ linux-2.6/block/ioctl.c	2009-08-29 16:49:52.743341374 -0300
@@ -112,22 +112,9 @@ static int blkdev_reread_part(struct blo
 	return res;
 }
 
-static void blk_ioc_discard_endio(struct bio *bio, int err)
-{
-	if (err) {
-		if (err == -EOPNOTSUPP)
-			set_bit(BIO_EOPNOTSUPP, &bio->bi_flags);
-		clear_bit(BIO_UPTODATE, &bio->bi_flags);
-	}
-	complete(bio->bi_private);
-}
-
 static int blk_ioctl_discard(struct block_device *bdev, uint64_t start,
 			     uint64_t len)
 {
-	struct request_queue *q = bdev_get_queue(bdev);
-	int ret = 0;
-
 	if (start & 511)
 		return -EINVAL;
 	if (len & 511)
@@ -137,40 +124,8 @@ static int blk_ioctl_discard(struct bloc
 
 	if (start + len > (bdev->bd_inode->i_size >> 9))
 		return -EINVAL;
-
-	if (!q->prepare_discard_fn)
-		return -EOPNOTSUPP;
-
-	while (len && !ret) {
-		DECLARE_COMPLETION_ONSTACK(wait);
-		struct bio *bio;
-
-		bio = bio_alloc(GFP_KERNEL, 0);
-
-		bio->bi_end_io = blk_ioc_discard_endio;
-		bio->bi_bdev = bdev;
-		bio->bi_private = &wait;
-		bio->bi_sector = start;
-
-		if (len > queue_max_hw_sectors(q)) {
-			bio->bi_size = queue_max_hw_sectors(q) << 9;
-			len -= queue_max_hw_sectors(q);
-			start += queue_max_hw_sectors(q);
-		} else {
-			bio->bi_size = len << 9;
-			len = 0;
-		}
-		submit_bio(DISCARD_NOBARRIER, bio);
-
-		wait_for_completion(&wait);
-
-		if (bio_flagged(bio, BIO_EOPNOTSUPP))
-			ret = -EOPNOTSUPP;
-		else if (!bio_flagged(bio, BIO_UPTODATE))
-			ret = -EIO;
-		bio_put(bio);
-	}
-	return ret;
+	return blkdev_issue_discard(bdev, start, len, GFP_KERNEL,
+				    DISCARD_FL_WAIT);
 }
 
 static int put_ushort(unsigned long arg, unsigned short val)
Index: linux-2.6/fs/btrfs/extent-tree.c
===================================================================
--- linux-2.6.orig/fs/btrfs/extent-tree.c	2009-08-29 16:49:43.079341766 -0300
+++ linux-2.6/fs/btrfs/extent-tree.c	2009-08-29 16:49:52.743341374 -0300
@@ -1511,7 +1511,8 @@ static int remove_extent_backref(struct 
 static void btrfs_issue_discard(struct block_device *bdev,
 				u64 start, u64 len)
 {
-	blkdev_issue_discard(bdev, start >> 9, len >> 9, GFP_KERNEL);
+	blkdev_issue_discard(bdev, start >> 9, len >> 9, GFP_KERNEL,
+			     DISCARD_FL_BARRIER);
 }
 #endif
 
Index: linux-2.6/fs/gfs2/rgrp.c
===================================================================
--- linux-2.6.orig/fs/gfs2/rgrp.c	2009-08-29 16:49:43.139342340 -0300
+++ linux-2.6/fs/gfs2/rgrp.c	2009-08-29 16:49:52.747373241 -0300
@@ -857,7 +857,8 @@ static void gfs2_rgrp_send_discards(stru
 					goto start_new_extent;
 				if ((start + nr_sects) != blk) {
 					rv = blkdev_issue_discard(bdev, start,
-							    nr_sects, GFP_NOFS);
+							    nr_sects, GFP_NOFS,
+							    DISCARD_FL_BARRIER);
 					if (rv)
 						goto fail;
 					nr_sects = 0;
@@ -871,7 +872,8 @@ start_new_extent:
 		}
 	}
 	if (nr_sects) {
-		rv = blkdev_issue_discard(bdev, start, nr_sects, GFP_NOFS);
+		rv = blkdev_issue_discard(bdev, start, nr_sects, GFP_NOFS,
+					 DISCARD_FL_BARRIER);
 		if (rv)
 			goto fail;
 	}
Index: linux-2.6/include/linux/blkdev.h
===================================================================
--- linux-2.6.orig/include/linux/blkdev.h	2009-08-29 16:49:43.147343148 -0300
+++ linux-2.6/include/linux/blkdev.h	2009-08-29 17:43:19.767342397 -0300
@@ -977,15 +977,18 @@ static inline struct request *blk_map_qu
 }
 
 extern int blkdev_issue_flush(struct block_device *, sector_t *);
-extern int blkdev_issue_discard(struct block_device *,
-				sector_t sector, sector_t nr_sects, gfp_t);
+#define DISCARD_FL_WAIT		0x01	/* wait for completion */
+#define DISCARD_FL_BARRIER	0x02	/* issue DISCARD_BARRIER request */
+extern int blkdev_issue_discard(struct block_device *, sector_t sector,
+		sector_t nr_sects, gfp_t, int flags);
 
 static inline int sb_issue_discard(struct super_block *sb,
 				   sector_t block, sector_t nr_blocks)
 {
 	block <<= (sb->s_blocksize_bits - 9);
 	nr_blocks <<= (sb->s_blocksize_bits - 9);
-	return blkdev_issue_discard(sb->s_bdev, block, nr_blocks, GFP_KERNEL);
+	return blkdev_issue_discard(sb->s_bdev, block, nr_blocks, GFP_KERNEL,
+				    DISCARD_FL_BARRIER);
 }
 
 extern int blk_verify_command(unsigned char *cmd, fmode_t has_write_perm);
Index: linux-2.6/mm/swapfile.c
===================================================================
--- linux-2.6.orig/mm/swapfile.c	2009-08-29 16:49:43.151342120 -0300
+++ linux-2.6/mm/swapfile.c	2009-08-29 16:49:52.755377749 -0300
@@ -161,7 +161,8 @@ static int discard_swap(struct swap_info
 		}
 
 		err = blkdev_issue_discard(si->bdev, start_block,
-						nr_blocks, GFP_KERNEL);
+						nr_blocks, GFP_KERNEL,
+						DISCARD_FL_BARRIER);
 		if (err)
 			break;
 
@@ -200,7 +201,8 @@ static void discard_swap_cluster(struct 
 			start_block <<= PAGE_SHIFT - 9;
 			nr_blocks <<= PAGE_SHIFT - 9;
 			if (blkdev_issue_discard(si->bdev, start_block,
-							nr_blocks, GFP_NOIO))
+							nr_blocks, GFP_NOIO,
+							DISCARD_FL_BARRIER))
 				break;
 		}
 

^ permalink raw reply	[flat|nested] 30+ messages in thread

* [PATCH 3/7] block: discard may need to allocate pages
  2009-08-29 23:03 [PATCH 0/7] discard support revisited Christoph Hellwig
  2009-08-29 23:03 ` [PATCH 1/7] Make DISCARD_BARRIER and DISCARD_NOBARRIER writes instead of reads Christoph Hellwig
  2009-08-29 23:03 ` [PATCH 2/7] block: use blkdev_issue_discard in blk_ioctl_discard Christoph Hellwig
@ 2009-08-29 23:03 ` Christoph Hellwig
  2009-08-29 23:03 ` [PATCH 4/7] sd: add support for WRITE SAME (16) with unmap bit Christoph Hellwig
                   ` (4 subsequent siblings)
  7 siblings, 0 replies; 30+ messages in thread
From: Christoph Hellwig @ 2009-08-29 23:03 UTC (permalink / raw)
  To: linux-scsi, linux-ide, linux-kernel
  Cc: liml, jens.axboe, matthew, dwmw2, Matthew Wilcox

[-- Attachment #1: discard-add-payload --]
[-- Type: text/plain, Size: 3050 bytes --]

SCSI and ATA both need to send data to the device.  In order to do this,
the BIO must be allocated with room for a page to be added, and the bio
needs to be passed to the discard prep function.  We also need to free
the page attached to the BIO before we free it.

init_request_from_bio() is not currently called from a context which
forbids sleeping, and to make sure it stays that was (so we don't have
to use GFP_ATOMIC), add a might_sleep() to it.

[hch: moved the might_sleep into the discard section, it can trigger
 easily for normal requests]

Signed-off-by: Matthew Wilcox <willy@linux.intel.com>

Index: linux-2.6/block/blk-barrier.c
===================================================================
--- linux-2.6.orig/block/blk-barrier.c	2009-08-29 19:19:47.351880578 -0300
+++ linux-2.6/block/blk-barrier.c	2009-08-29 19:19:58.087575976 -0300
@@ -383,7 +383,7 @@ int blkdev_issue_discard(struct block_de
 		return -EOPNOTSUPP;
 
 	while (nr_sects && !ret) {
-		struct bio *bio = bio_alloc(gfp_mask, 0);
+		struct bio *bio = bio_alloc(gfp_mask, 1);
 		if (!bio)
 			return -ENOMEM;
 
Index: linux-2.6/block/blk-core.c
===================================================================
--- linux-2.6.orig/block/blk-core.c	2009-08-29 18:16:52.427841223 -0300
+++ linux-2.6/block/blk-core.c	2009-08-29 19:19:28.639880676 -0300
@@ -1124,10 +1124,11 @@ void init_request_from_bio(struct reques
 		req->cmd_flags |= REQ_FAILFAST_DRIVER;
 
 	if (unlikely(bio_discard(bio))) {
+		might_sleep();
 		req->cmd_flags |= REQ_DISCARD;
 		if (bio_barrier(bio))
 			req->cmd_flags |= REQ_SOFTBARRIER;
-		req->q->prepare_discard_fn(req->q, req);
+		req->q->prepare_discard_fn(req->q, req, bio);
 	} else if (unlikely(bio_barrier(bio)))
 		req->cmd_flags |= REQ_HARDBARRIER;
 
Index: linux-2.6/drivers/mtd/mtd_blkdevs.c
===================================================================
--- linux-2.6.orig/drivers/mtd/mtd_blkdevs.c	2009-08-29 18:16:52.639841570 -0300
+++ linux-2.6/drivers/mtd/mtd_blkdevs.c	2009-08-29 18:17:22.799373769 -0300
@@ -33,7 +33,7 @@ struct mtd_blkcore_priv {
 };
 
 static int blktrans_discard_request(struct request_queue *q,
-				    struct request *req)
+				    struct request *req, struct bio *bio)
 {
 	req->cmd_type = REQ_TYPE_LINUX_BLOCK;
 	req->cmd[0] = REQ_LB_OP_DISCARD;
Index: linux-2.6/include/linux/blkdev.h
===================================================================
--- linux-2.6.orig/include/linux/blkdev.h	2009-08-29 18:16:52.899841155 -0300
+++ linux-2.6/include/linux/blkdev.h	2009-08-29 18:17:22.803373020 -0300
@@ -255,7 +255,8 @@ typedef void (request_fn_proc) (struct r
 typedef int (make_request_fn) (struct request_queue *q, struct bio *bio);
 typedef int (prep_rq_fn) (struct request_queue *, struct request *);
 typedef void (unplug_fn) (struct request_queue *);
-typedef int (prepare_discard_fn) (struct request_queue *, struct request *);
+typedef int (prepare_discard_fn) (struct request_queue *, struct request *,
+				  struct bio *bio);
 
 struct bio_vec;
 struct bvec_merge_data {

^ permalink raw reply	[flat|nested] 30+ messages in thread

* [PATCH 4/7] sd: add support for WRITE SAME (16) with unmap bit
  2009-08-29 23:03 [PATCH 0/7] discard support revisited Christoph Hellwig
                   ` (2 preceding siblings ...)
  2009-08-29 23:03 ` [PATCH 3/7] block: discard may need to allocate pages Christoph Hellwig
@ 2009-08-29 23:03 ` Christoph Hellwig
  2009-08-30  0:43   ` Douglas Gilbert
  2009-08-30 11:12   ` Sergei Shtylyov
  2009-08-29 23:03 ` [PATCH 5/7] libata: Add support for TRIM Christoph Hellwig
                   ` (3 subsequent siblings)
  7 siblings, 2 replies; 30+ messages in thread
From: Christoph Hellwig @ 2009-08-29 23:03 UTC (permalink / raw)
  To: linux-scsi, linux-ide, linux-kernel; +Cc: liml, jens.axboe, matthew, dwmw2

[-- Attachment #1: discard-add-scsi-write-same-support --]
[-- Type: text/plain, Size: 3855 bytes --]

Add a prepare_discard function to sd that sends a WRITE SAME request with
the unmap bit set to the device if it advertises thin provisioning support.


Signed-off-by: Christoph Hellwig <hch@lst.de>

Index: linux-2.6/drivers/scsi/sd.c
===================================================================
--- linux-2.6.orig/drivers/scsi/sd.c	2009-08-29 19:19:36.067371669 -0300
+++ linux-2.6/drivers/scsi/sd.c	2009-08-29 19:26:20.723754241 -0300
@@ -911,6 +911,50 @@ static void sd_prepare_flush(struct requ
 	rq->cmd_len = 10;
 }
 
+static int sd_prepare_discard(struct request_queue *q, struct request *rq,
+		struct bio *bio)
+{
+	struct scsi_device *sdp = q->queuedata;
+	struct page *page = alloc_page(GFP_KERNEL);
+
+	if (!page)
+		return -ENOMEM;
+
+	rq->cmd_type = REQ_TYPE_BLOCK_PC;
+	rq->timeout = SD_TIMEOUT;
+	rq->cmd[0] = WRITE_SAME_16;
+	rq->cmd[1] = 0x8; /* UNMAP bit */
+	rq->cmd[2] = sizeof(bio->bi_sector) > 4 ?
+			(unsigned char) (bio->bi_sector >> 56) & 0xff : 0;
+	rq->cmd[3] = sizeof(bio->bi_sector) > 4 ?
+			(unsigned char) (bio->bi_sector >> 48) & 0xff : 0;
+	rq->cmd[4] = sizeof(bio->bi_sector) > 4 ?
+			(unsigned char) (bio->bi_sector >> 40) & 0xff : 0;
+	rq->cmd[5] = sizeof(bio->bi_sector) > 4 ?
+			(unsigned char) (bio->bi_sector >> 32) & 0xff : 0;
+	rq->cmd[6] = (unsigned char) (bio->bi_sector >> 24) & 0xff;
+	rq->cmd[7] = (unsigned char) (bio->bi_sector >> 16) & 0xff;
+	rq->cmd[8] = (unsigned char) (bio->bi_sector >> 8) & 0xff;
+	rq->cmd[9] = (unsigned char) bio->bi_sector & 0xff;
+	rq->cmd[10] = (unsigned char) (bio_sectors(bio) >> 24) & 0xff;
+	rq->cmd[11] = (unsigned char) (bio_sectors(bio) >> 16) & 0xff;
+	rq->cmd[12] = (unsigned char) (bio_sectors(bio) >> 8) & 0xff;
+	rq->cmd[13] = (unsigned char) bio_sectors(bio) & 0xff;
+	rq->cmd[14] = 0;
+	rq->cmd[15] = 0;
+	rq->cmd_len = 16;
+
+	printk(KERN_INFO "umap, lba = 0x%lld, len = %d\n",
+	       bio->bi_sector, bio_sectors(bio));
+
+	bio->bi_size = 0;
+	if (bio_add_pc_page(q, bio, page, sdp->sector_size, 0) <
+			sdp->sector_size)
+		return -EIO;
+
+	return 0;
+}
+
 static void sd_rescan(struct device *dev)
 {
 	struct scsi_disk *sdkp = scsi_disk_get_from_dev(dev);
@@ -1369,6 +1413,9 @@ static int read_capacity_16(struct scsi_
 		sd_printk(KERN_NOTICE, sdkp,
 			  "physical block alignment offset: %u\n", alignment);
 
+	if (buffer[14] & 0x80)
+		sdkp->thin_provisioning = 1;
+
 	sdkp->capacity = lba + 1;
 	return sector_size;
 }
@@ -1915,6 +1962,8 @@ static int sd_revalidate_disk(struct gen
 		ordered = QUEUE_ORDERED_DRAIN;
 
 	blk_queue_ordered(sdkp->disk->queue, ordered, sd_prepare_flush);
+	if (sdkp->thin_provisioning && !sdp->request_queue->prepare_discard_fn)
+		blk_queue_set_discard(sdkp->disk->queue, sd_prepare_discard);
 
 	set_capacity(disk, sdkp->capacity);
 	kfree(buffer);
Index: linux-2.6/include/scsi/scsi.h
===================================================================
--- linux-2.6.orig/include/scsi/scsi.h	2009-08-29 19:19:36.079340649 -0300
+++ linux-2.6/include/scsi/scsi.h	2009-08-29 19:20:01.995378150 -0300
@@ -122,6 +122,8 @@ struct scsi_cmnd;
 #define READ_16               0x88
 #define WRITE_16              0x8a
 #define VERIFY_16	      0x8f
+#define WRITE_SAME_16	      0x93
+
 #define SERVICE_ACTION_IN     0x9e
 /* values for service action in */
 #define	SAI_READ_CAPACITY_16  0x10
Index: linux-2.6/drivers/scsi/sd.h
===================================================================
--- linux-2.6.orig/drivers/scsi/sd.h	2009-08-29 19:19:36.071341377 -0300
+++ linux-2.6/drivers/scsi/sd.h	2009-08-29 19:20:01.995378150 -0300
@@ -55,6 +55,7 @@ struct scsi_disk {
 	unsigned	RCD : 1;	/* state of disk RCD bit, unused */
 	unsigned	DPOFUA : 1;	/* state of disk DPOFUA bit */
 	unsigned	first_scan : 1;
+	unsigned	thin_provisioning : 1;
 };
 #define to_scsi_disk(obj) container_of(obj,struct scsi_disk,dev)
 


^ permalink raw reply	[flat|nested] 30+ messages in thread

* [PATCH 5/7] libata: Add support for TRIM
  2009-08-29 23:03 [PATCH 0/7] discard support revisited Christoph Hellwig
                   ` (3 preceding siblings ...)
  2009-08-29 23:03 ` [PATCH 4/7] sd: add support for WRITE SAME (16) with unmap bit Christoph Hellwig
@ 2009-08-29 23:03 ` Christoph Hellwig
  2009-08-29 23:03 ` [PATCH 6/7] block: allow large discard requests Christoph Hellwig
                   ` (2 subsequent siblings)
  7 siblings, 0 replies; 30+ messages in thread
From: Christoph Hellwig @ 2009-08-29 23:03 UTC (permalink / raw)
  To: linux-scsi, linux-ide, linux-kernel
  Cc: liml, jens.axboe, matthew, dwmw2, Matthew Wilcox

[-- Attachment #1: discard-add-libata-trim-support --]
[-- Type: text/plain, Size: 2262 bytes --]

From: Matthew Wilcox <willy@linux.intel.com>

libata: Add support for TRIM

Signed-off-by: Matthew Wilcox <willy@linux.intel.com>
---

Index: linux-2.6/drivers/ata/libata-scsi.c
===================================================================
--- linux-2.6.orig/drivers/ata/libata-scsi.c	2009-08-29 15:53:44.851841524 -0300
+++ linux-2.6/drivers/ata/libata-scsi.c	2009-08-29 19:25:16.319344322 -0300
@@ -1051,6 +1051,46 @@ static void ata_gen_ata_sense(struct ata
 	desc[11] = block;
 }
 
+static int ata_discard_fn(struct request_queue *q, struct request *req,
+							struct bio *bio)
+{
+	unsigned size;
+	struct page *page = alloc_page(GFP_KERNEL);
+	if (!page)
+		goto error;
+
+	size = ata_set_lba_range_entries(page_address(page), PAGE_SIZE / 8,
+					bio->bi_sector, bio_sectors(bio));
+	bio->bi_size = 0;
+	if (bio_add_pc_page(q, bio, page, size, 0) < size)
+		goto free_page;
+
+	req->cmd_type = REQ_TYPE_BLOCK_PC;
+	req->cmd_len = 16;
+	req->cmd[0] = ATA_16;
+	req->cmd[1] = (6 << 1) | 1;		/* dma, 48-bit */
+	req->cmd[2] = 0x6;			/* length, direction */
+	req->cmd[3] = 0;			/* feature high */
+	req->cmd[4] = ATA_DSM_TRIM;		/* feature low */
+	req->cmd[5] = (size / 512) >> 8;	/* nsect high */
+	req->cmd[6] = size / 512;		/* nsect low */
+	req->cmd[7] = 0;			/* lba */
+	req->cmd[8] = 0;			/* lba */
+	req->cmd[9] = 0;			/* lba */
+	req->cmd[10] = 0;			/* lba */
+	req->cmd[11] = 0;			/* lba */
+	req->cmd[12] = 0;			/* lba */
+	req->cmd[13] = ATA_LBA;			/* device */
+	req->cmd[14] = ATA_CMD_DSM;		/* command */
+	req->cmd[15] = 0;			/* control */
+
+	return 0;
+ free_page:
+	__free_page(page);
+ error:
+	return -ENOMEM;
+}
+
 static void ata_scsi_sdev_config(struct scsi_device *sdev)
 {
 	sdev->use_10_for_rw = 1;
@@ -1099,6 +1139,9 @@ static int ata_scsi_dev_config(struct sc
 	/* configure max sectors */
 	blk_queue_max_sectors(sdev->request_queue, dev->max_sectors);
 
+	if (ata_id_has_trim(dev->id))
+		blk_queue_set_discard(sdev->request_queue, ata_discard_fn);
+
 	if (dev->class == ATA_DEV_ATAPI) {
 		struct request_queue *q = sdev->request_queue;
 		void *buf;
@@ -2431,6 +2474,8 @@ static unsigned int ata_scsiop_read_cap(
 		rbuf[15] = lowest_aligned;
 	}
 
+	if (ata_id_has_trim(args->id))
+		rbuf[14] = 0x80;
 	return 0;
 }
 

^ permalink raw reply	[flat|nested] 30+ messages in thread

* [PATCH 6/7] block: allow large discard requests
  2009-08-29 23:03 [PATCH 0/7] discard support revisited Christoph Hellwig
                   ` (4 preceding siblings ...)
  2009-08-29 23:03 ` [PATCH 5/7] libata: Add support for TRIM Christoph Hellwig
@ 2009-08-29 23:03 ` Christoph Hellwig
  2009-08-30  2:49   ` Mark Lord
  2009-08-29 23:03 ` [PATCH 7/7] xfs: add batches discard support Christoph Hellwig
  2009-08-29 23:37 ` [PATCH 0/7] discard support revisited Matthew Wilcox
  7 siblings, 1 reply; 30+ messages in thread
From: Christoph Hellwig @ 2009-08-29 23:03 UTC (permalink / raw)
  To: linux-scsi, linux-ide, linux-kernel; +Cc: liml, jens.axboe, matthew, dwmw2

[-- Attachment #1: discard-allow-large-requests --]
[-- Type: text/plain, Size: 6435 bytes --]

Currently we set the bio size to the byte equivalent of the blocks to
be trimmed when submitting the initial DISCARD ioctl.  That means it
is subject to the max_hw_sectors limitation of the HBA which is
much lower than the size of a DISCARD request we can support.
Add a separate max_discard_bytes tunable to limit the size for discard
requests.  Currently for all implementations it's all we can fit into
the 32 bit variable that holds the bio size, although it could be much
larger if we had a way to pass it through the block layer as a number
of sectors.

Signed-off-by: Christoph Hellwig <hch@lst.de>

Index: linux-2.6/block/blk-core.c
===================================================================
--- linux-2.6.orig/block/blk-core.c	2009-08-29 19:31:29.891344258 -0300
+++ linux-2.6/block/blk-core.c	2009-08-29 19:31:56.891344537 -0300
@@ -1431,7 +1431,8 @@ static inline void __generic_make_reques
 			goto end_io;
 		}
 
-		if (unlikely(nr_sectors > queue_max_hw_sectors(q))) {
+		if (unlikely(!bio_discard(bio) &&
+			     nr_sectors > queue_max_hw_sectors(q))) {
 			printk(KERN_ERR "bio too big device %s (%u > %u)\n",
 			       bdevname(bio->bi_bdev, b),
 			       bio_sectors(bio),
Index: linux-2.6/block/blk-settings.c
===================================================================
--- linux-2.6.orig/block/blk-settings.c	2009-08-29 19:31:29.911344985 -0300
+++ linux-2.6/block/blk-settings.c	2009-08-29 19:31:56.895345046 -0300
@@ -35,8 +35,9 @@ EXPORT_SYMBOL(blk_queue_prep_rq);
 
 /**
  * blk_queue_set_discard - set a discard_sectors function for queue
- * @q:		queue
- * @dfn:	prepare_discard function
+ * @q:			queue
+ * @dfn:		prepare_discard function
+ * @max_discard:	maximum number of bytes for a single discard request
  *
  * It's possible for a queue to register a discard callback which is used
  * to transform a discard request into the appropriate type for the
@@ -44,8 +45,10 @@ EXPORT_SYMBOL(blk_queue_prep_rq);
  * with %EOPNOTSUPP.
  *
  */
-void blk_queue_set_discard(struct request_queue *q, prepare_discard_fn *dfn)
+void blk_queue_set_discard(struct request_queue *q, prepare_discard_fn *dfn,
+		unsigned int max_discard)
 {
+	q->limits.max_discard_bytes = max_discard;
 	q->prepare_discard_fn = dfn;
 }
 EXPORT_SYMBOL(blk_queue_set_discard);
Index: linux-2.6/include/linux/blkdev.h
===================================================================
--- linux-2.6.orig/include/linux/blkdev.h	2009-08-29 19:31:29.951344623 -0300
+++ linux-2.6/include/linux/blkdev.h	2009-08-29 19:31:56.895345046 -0300
@@ -308,6 +308,7 @@ struct queue_limits {
 	unsigned int		alignment_offset;
 	unsigned int		io_min;
 	unsigned int		io_opt;
+	unsigned int		max_discard_bytes;
 
 	unsigned short		logical_block_size;
 	unsigned short		max_hw_segments;
@@ -935,7 +936,8 @@ extern void blk_queue_merge_bvec(struct 
 extern void blk_queue_dma_alignment(struct request_queue *, int);
 extern void blk_queue_update_dma_alignment(struct request_queue *, int);
 extern void blk_queue_softirq_done(struct request_queue *, softirq_done_fn *);
-extern void blk_queue_set_discard(struct request_queue *, prepare_discard_fn *);
+extern void blk_queue_set_discard(struct request_queue *, prepare_discard_fn *,
+		unsigned int max_discard);
 extern void blk_queue_rq_timed_out(struct request_queue *, rq_timed_out_fn *);
 extern void blk_queue_rq_timeout(struct request_queue *, unsigned int);
 extern struct backing_dev_info *blk_get_backing_dev_info(struct block_device *bdev);
Index: linux-2.6/block/blk-barrier.c
===================================================================
--- linux-2.6.orig/block/blk-barrier.c	2009-08-29 19:31:29.931344315 -0300
+++ linux-2.6/block/blk-barrier.c	2009-08-29 19:31:56.899380824 -0300
@@ -384,6 +384,8 @@ int blkdev_issue_discard(struct block_de
 
 	while (nr_sects && !ret) {
 		struct bio *bio = bio_alloc(gfp_mask, 1);
+		unsigned int max_discard_sectors;
+
 		if (!bio)
 			return -ENOMEM;
 
@@ -394,10 +396,11 @@ int blkdev_issue_discard(struct block_de
 
 		bio->bi_sector = sector;
 
-		if (nr_sects > queue_max_hw_sectors(q)) {
-			bio->bi_size = queue_max_hw_sectors(q) << 9;
-			nr_sects -= queue_max_hw_sectors(q);
-			sector += queue_max_hw_sectors(q);
+		max_discard_sectors = q->limits.max_discard_bytes >> 9;
+		if (nr_sects > max_discard_sectors) {
+			bio->bi_size = q->limits.max_discard_bytes;
+			nr_sects -= max_discard_sectors;
+			sector += max_discard_sectors;
 		} else {
 			bio->bi_size = nr_sects << 9;
 			nr_sects = 0;
Index: linux-2.6/drivers/mtd/mtd_blkdevs.c
===================================================================
--- linux-2.6.orig/drivers/mtd/mtd_blkdevs.c	2009-08-29 19:31:29.971356036 -0300
+++ linux-2.6/drivers/mtd/mtd_blkdevs.c	2009-08-29 19:31:56.903369320 -0300
@@ -381,7 +381,7 @@ int register_mtd_blktrans(struct mtd_blk
 	blk_queue_logical_block_size(tr->blkcore_priv->rq, tr->blksize);
 	if (tr->discard)
 		blk_queue_set_discard(tr->blkcore_priv->rq,
-				      blktrans_discard_request);
+				      blktrans_discard_request, 0xffffffff);
 
 	tr->blkshift = ffs(tr->blksize) - 1;
 
Index: linux-2.6/drivers/scsi/sd.c
===================================================================
--- linux-2.6.orig/drivers/scsi/sd.c	2009-08-29 19:31:29.983344641 -0300
+++ linux-2.6/drivers/scsi/sd.c	2009-08-29 19:31:56.907380585 -0300
@@ -1963,7 +1963,8 @@ static int sd_revalidate_disk(struct gen
 
 	blk_queue_ordered(sdkp->disk->queue, ordered, sd_prepare_flush);
 	if (sdkp->thin_provisioning && !sdp->request_queue->prepare_discard_fn)
-		blk_queue_set_discard(sdkp->disk->queue, sd_prepare_discard);
+		blk_queue_set_discard(sdkp->disk->queue, sd_prepare_discard,
+				      0xffffffff);
 
 	set_capacity(disk, sdkp->capacity);
 	kfree(buffer);
Index: linux-2.6/drivers/ata/libata-scsi.c
===================================================================
--- linux-2.6.orig/drivers/ata/libata-scsi.c	2009-08-29 19:32:05.391879870 -0300
+++ linux-2.6/drivers/ata/libata-scsi.c	2009-08-29 19:32:16.895376612 -0300
@@ -1140,7 +1140,8 @@ static int ata_scsi_dev_config(struct sc
 	blk_queue_max_sectors(sdev->request_queue, dev->max_sectors);
 
 	if (ata_id_has_trim(dev->id))
-		blk_queue_set_discard(sdev->request_queue, ata_discard_fn);
+		blk_queue_set_discard(sdev->request_queue, ata_discard_fn,
+				     0xffffffff);
 
 	if (dev->class == ATA_DEV_ATAPI) {
 		struct request_queue *q = sdev->request_queue;

^ permalink raw reply	[flat|nested] 30+ messages in thread

* [PATCH 7/7] xfs: add batches discard support
  2009-08-29 23:03 [PATCH 0/7] discard support revisited Christoph Hellwig
                   ` (5 preceding siblings ...)
  2009-08-29 23:03 ` [PATCH 6/7] block: allow large discard requests Christoph Hellwig
@ 2009-08-29 23:03 ` Christoph Hellwig
  2009-08-29 23:37 ` [PATCH 0/7] discard support revisited Matthew Wilcox
  7 siblings, 0 replies; 30+ messages in thread
From: Christoph Hellwig @ 2009-08-29 23:03 UTC (permalink / raw)
  To: linux-scsi, linux-ide, linux-kernel; +Cc: liml, jens.axboe, matthew, dwmw2

[-- Attachment #1: xfs-add-trim-support-2 --]
[-- Type: text/plain, Size: 6248 bytes --]

Add support for discarding all currently unused space by an ioctl.  Only
intended as demonstration and not for merging.

Use the following small tool to exercise it:


#include <errno.h>
#include <fcntl.h>
#include <stdio.h>
#include <stdint.h>
#include <sys/ioctl.h>

#define XFS_IOC_TRIM                 _IOR ('X', 126, uint32_t)


int main(int argc, char **argv)
{
	int minsize = 4096;
	int fd;

	if (argc != 2) {
		fprintf(stderr, "usage: %s mountpoint\n", argv[0]);
		return 1;
	}

	fd = open(argv[1], O_RDONLY);
	if (fd < 0) {
		perror("open");
		return 1;
	}

	if (ioctl(fd, XFS_IOC_TRIM, &minsize)) {
		if (errno == EOPNOTSUPP)
			fprintf(stderr, "TRIM not supported\n");
		else
			perror("XFS_IOC_TRIM");
		return 1;
	}

	return 0;
}


Signed-off-by: Christoph Hellwig <hch@lst.de>

Index: linux-2.6/fs/xfs/linux-2.6/xfs_ioctl.c
===================================================================
--- linux-2.6.orig/fs/xfs/linux-2.6/xfs_ioctl.c	2009-08-29 15:53:27.319844716 -0300
+++ linux-2.6/fs/xfs/linux-2.6/xfs_ioctl.c	2009-08-29 16:51:56.271867967 -0300
@@ -1274,6 +1274,31 @@ xfs_ioc_getbmapx(
 	return 0;
 }
 
+STATIC int
+xfs_ioc_trim(
+	struct xfs_mount	*mp,
+	__uint32_t		*argp)
+{
+	xfs_agnumber_t		agno;
+	int			error = 0;
+	__uint32_t		minlen;
+
+	if (!capable(CAP_SYS_ADMIN))
+		return -EPERM;
+	if (get_user(minlen, argp))
+		return -EFAULT;
+
+	down_read(&mp->m_peraglock);
+	for (agno = 0; agno < mp->m_sb.sb_agcount; agno++) {
+		error = -xfs_trim_extents(mp, agno, minlen);
+		if (error)
+			break;
+	}
+	up_read(&mp->m_peraglock);
+
+	return error;
+}
+
 /*
  * Note: some of the ioctl's return positive numbers as a
  * byte count indicating success, such as readlink_by_handle.
@@ -1523,6 +1548,9 @@ xfs_file_ioctl(
 		error = xfs_errortag_clearall(mp, 1);
 		return -error;
 
+	case XFS_IOC_TRIM:
+		return xfs_ioc_trim(mp, arg);
+
 	default:
 		return -ENOTTY;
 	}
Index: linux-2.6/fs/xfs/xfs_alloc.c
===================================================================
--- linux-2.6.orig/fs/xfs/xfs_alloc.c	2009-08-29 15:53:27.355845733 -0300
+++ linux-2.6/fs/xfs/xfs_alloc.c	2009-08-29 16:59:20.451343922 -0300
@@ -2609,6 +2609,96 @@ error0:
 	return error;
 }
 
+STATIC int
+xfs_trim_extent(
+	struct xfs_mount	*mp,
+	xfs_agnumber_t		agno,
+	xfs_agblock_t		fbno,
+	xfs_extlen_t		flen)
+{
+	xfs_daddr_t		blkno = XFS_AGB_TO_DADDR(mp, agno, fbno);
+	sector_t		nblks = XFS_FSB_TO_BB(mp, flen);
+	int			error;
+
+	xfs_fs_cmn_err(CE_NOTE, mp, "discarding sectors [0x%llx-0x%llx]",
+			blkno, nblks);
+
+	error = -blkdev_issue_discard(mp->m_ddev_targp->bt_bdev, blkno, nblks,
+				      GFP_NOFS, DISCARD_FL_WAIT);
+	if (error && error != EOPNOTSUPP)
+		xfs_fs_cmn_err(CE_NOTE, mp, "discard failed, error %d", error);
+	return error;
+}
+
+/*
+ * Notify the underlying block device about our free extent map.
+ *
+ * This walks all free extents above a minimum threshold and notifies the
+ * underlying device that these blocks are unused.  That information is
+ * useful for SSDs or thinly provisioned storage in high end arrays or
+ * virtualization scenarios.
+ */
+int
+xfs_trim_extents(
+	struct xfs_mount	*mp,
+	xfs_agnumber_t		agno,
+	xfs_extlen_t		minlen)	/* minimum extent size to bother */
+{
+	struct xfs_btree_cur	*cur;	/* cursor for the by-block btree */
+	struct xfs_buf		*agbp;	/* AGF buffer pointer */
+	xfs_agblock_t		bno;	/* block the for next search */
+	xfs_agblock_t		fbno;	/* start block of found extent */
+	xfs_extlen_t		flen;	/* length of found extent */
+	int			error;
+	int			i;
+
+	error = xfs_alloc_read_agf(mp, NULL, agno, 0, &agbp);
+	if (error)
+		return error;
+
+	bno = 0;
+	for (;;) {
+		cur = xfs_allocbt_init_cursor(mp, NULL, agbp, agno,
+					      XFS_BTNUM_BNO);
+
+		error = xfs_alloc_lookup_ge(cur, bno, minlen, &i);
+		if (error)
+			goto error0;
+		if (!i) {
+			/*
+			 * No more free extents found: done.
+			 */
+			xfs_btree_del_cursor(cur, XFS_BTREE_NOERROR);
+			break;
+		}
+
+		error = xfs_alloc_get_rec(cur, &fbno, &flen, &i);
+		if (error)
+			goto error0;
+		XFS_WANT_CORRUPTED_GOTO(i == 1, error0);
+
+		/*
+		 * Pass if the freespace extent isn't long enough to bother.
+		 */
+		if (flen >= minlen) {
+			error = xfs_trim_extent(mp, agno, fbno, flen);
+			if (error) {
+				xfs_btree_del_cursor(cur, XFS_BTREE_NOERROR);
+				break;
+			}
+		}
+
+		xfs_btree_del_cursor(cur, XFS_BTREE_NOERROR);
+		bno = fbno + flen;
+	}
+
+out:
+	xfs_buf_relse(agbp);
+	return error;
+error0:
+	xfs_btree_del_cursor(cur, XFS_BTREE_ERROR);
+	goto out;
+}
 
 /*
  * AG Busy list management
Index: linux-2.6/fs/xfs/xfs_alloc.h
===================================================================
--- linux-2.6.orig/fs/xfs/xfs_alloc.h	2009-08-29 15:53:27.371844485 -0300
+++ linux-2.6/fs/xfs/xfs_alloc.h	2009-08-29 16:51:56.271867967 -0300
@@ -215,4 +215,7 @@ xfs_free_extent(
 	xfs_fsblock_t	bno,	/* starting block number of extent */
 	xfs_extlen_t	len);	/* length of extent */
 
+int xfs_trim_extents(struct xfs_mount *mp, xfs_agnumber_t agno,
+	xfs_extlen_t minlen);
+
 #endif	/* __XFS_ALLOC_H__ */
Index: linux-2.6/fs/xfs/xfs_fs.h
===================================================================
--- linux-2.6.orig/fs/xfs/xfs_fs.h	2009-08-29 15:53:27.391844445 -0300
+++ linux-2.6/fs/xfs/xfs_fs.h	2009-08-29 16:51:56.279865211 -0300
@@ -475,6 +475,7 @@ typedef struct xfs_handle {
 #define XFS_IOC_ATTRMULTI_BY_HANDLE  _IOW ('X', 123, struct xfs_fsop_attrmulti_handlereq)
 #define XFS_IOC_FSGEOMETRY	     _IOR ('X', 124, struct xfs_fsop_geom)
 #define XFS_IOC_GOINGDOWN	     _IOR ('X', 125, __uint32_t)
+#define XFS_IOC_TRIM		     _IOR ('X', 126, __uint32_t)
 /*	XFS_IOC_GETFSUUID ---------- deprecated 140	 */
 
 
Index: linux-2.6/fs/xfs/linux-2.6/xfs_ioctl32.c
===================================================================
--- linux-2.6.orig/fs/xfs/linux-2.6/xfs_ioctl32.c	2009-08-29 15:53:27.339845024 -0300
+++ linux-2.6/fs/xfs/linux-2.6/xfs_ioctl32.c	2009-08-29 16:51:56.283864672 -0300
@@ -563,6 +563,7 @@ xfs_file_compat_ioctl(
 	case XFS_IOC_GOINGDOWN:
 	case XFS_IOC_ERROR_INJECTION:
 	case XFS_IOC_ERROR_CLEARALL:
+	case XFS_IOC_TRIM:
 		return xfs_file_ioctl(filp, cmd, p);
 #ifndef BROKEN_X86_ALIGNMENT
 	/* These are handled fine if no alignment issues */

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 0/7] discard support revisited
  2009-08-29 23:03 [PATCH 0/7] discard support revisited Christoph Hellwig
                   ` (6 preceding siblings ...)
  2009-08-29 23:03 ` [PATCH 7/7] xfs: add batches discard support Christoph Hellwig
@ 2009-08-29 23:37 ` Matthew Wilcox
  2009-08-30  2:15   ` Christoph Hellwig
  7 siblings, 1 reply; 30+ messages in thread
From: Matthew Wilcox @ 2009-08-29 23:37 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: linux-scsi, linux-ide, linux-kernel, liml, jens.axboe, dwmw2

On Sat, Aug 29, 2009 at 07:03:32PM -0400, Christoph Hellwig wrote:
> I've started working on discard support a little bit more, and I think
> it's time to get an overview over the state we're in right now.  This
> patchswt is what I needed to get the batched discard support in XFS
> properly working.  Many patches are just forward ported from Matthew's
> SSD tree and I hope quilt mail didn't lose the attributions.

Funny, I was working on forward-porting my patches this morning.

>  - blkdev_issue_discard is now a lot more generic - it grows a flags
>    argument to control if we issue a barrier discard request or not
>    or if we wait or not.  That is needed for the batched discard support
>    in XFS which wants to wait just like the ioctl interface.

Seems fine.

>  - I have implemented support for sending WRITE SAME requests with the
>    unmap bit set in sd.  This has been tested with a qemu-based backed
>    only so far, but we'll get some real array coverage soon.

I think we're going to need to figure out whether we should be sending
UNMAP or WRITE SAME ... probably need to dive back into the T10 poostorm
to see what's going on.

>  - there is a new patch to allow larger discard requests.
>  - the last patch is just my old batched discard patch for xfs ported
>    to work ontop of this series.
> 
> I would really love to see some progress on this in the 2.6.32 circle.
> We should at least get the block layer bits in that allow implementing
> a somewhat useful discard function.  I would also love to see the
> actual scsi and libata implementations in so that we can start playing
> around with it.  But given the speed up the current TRIM implementations
> and the expectations for WRITE SAME we should make sure the exact
> TRIM tracking is not actually enabled anywhere by default for now.

Jens had some objections to the block layer bits last time I posted
these.  I forget what they were now (this would have been around May
2nd, I think).  What I've done instead in my current patchset (which
undoubtedly has bugs because it isn't tested, because I'm not supposed
to be working on the weekends) is to make sd_prep_fn() call a new method
in the scsi_host_template.  That should translate the discard request
into a BLOCK_PC ATA_16 command, and we'll all be happy.

It goes a little something like this:
http://git.kernel.org/?p=linux/kernel/git/willy/ssd.git;a=shortlog;h=trim-20090829

Right now, the test tool is telling me 'Operation not supported', and
I haven't tried to figure out why yet.

-- 
Matthew Wilcox				Intel Open Source Technology Centre
"Bill, look, we understand that you're interested in selling us this
operating system, but compare it to ours.  We can't possibly take such
a retrograde step."

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 4/7] sd: add support for WRITE SAME (16) with unmap bit
  2009-08-29 23:03 ` [PATCH 4/7] sd: add support for WRITE SAME (16) with unmap bit Christoph Hellwig
@ 2009-08-30  0:43   ` Douglas Gilbert
  2009-08-30  1:05     ` Christoph Hellwig
  2009-08-30 11:12   ` Sergei Shtylyov
  1 sibling, 1 reply; 30+ messages in thread
From: Douglas Gilbert @ 2009-08-30  0:43 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: linux-scsi, linux-ide, linux-kernel, liml, jens.axboe, matthew, dwmw2

Christoph Hellwig wrote:
> Add a prepare_discard function to sd that sends a WRITE SAME request with
> the unmap bit set to the device if it advertises thin provisioning support.
> 
> 
> Signed-off-by: Christoph Hellwig <hch@lst.de>
> 
> Index: linux-2.6/drivers/scsi/sd.c
> ===================================================================
> --- linux-2.6.orig/drivers/scsi/sd.c	2009-08-29 19:19:36.067371669 -0300
> +++ linux-2.6/drivers/scsi/sd.c	2009-08-29 19:26:20.723754241 -0300
> @@ -911,6 +911,50 @@ static void sd_prepare_flush(struct requ
>  	rq->cmd_len = 10;
>  }
>  
> +static int sd_prepare_discard(struct request_queue *q, struct request *rq,
> +		struct bio *bio)
> +{
> +	struct scsi_device *sdp = q->queuedata;
> +	struct page *page = alloc_page(GFP_KERNEL);
> +
> +	if (!page)
> +		return -ENOMEM;
> +
> +	rq->cmd_type = REQ_TYPE_BLOCK_PC;
> +	rq->timeout = SD_TIMEOUT;
> +	rq->cmd[0] = WRITE_SAME_16;
> +	rq->cmd[1] = 0x8; /* UNMAP bit */
> +	rq->cmd[2] = sizeof(bio->bi_sector) > 4 ?
> +			(unsigned char) (bio->bi_sector >> 56) & 0xff : 0;
> +	rq->cmd[3] = sizeof(bio->bi_sector) > 4 ?
> +			(unsigned char) (bio->bi_sector >> 48) & 0xff : 0;
> +	rq->cmd[4] = sizeof(bio->bi_sector) > 4 ?
> +			(unsigned char) (bio->bi_sector >> 40) & 0xff : 0;
> +	rq->cmd[5] = sizeof(bio->bi_sector) > 4 ?
> +			(unsigned char) (bio->bi_sector >> 32) & 0xff : 0;
> +	rq->cmd[6] = (unsigned char) (bio->bi_sector >> 24) & 0xff;
> +	rq->cmd[7] = (unsigned char) (bio->bi_sector >> 16) & 0xff;
> +	rq->cmd[8] = (unsigned char) (bio->bi_sector >> 8) & 0xff;
> +	rq->cmd[9] = (unsigned char) bio->bi_sector & 0xff;
> +	rq->cmd[10] = (unsigned char) (bio_sectors(bio) >> 24) & 0xff;
> +	rq->cmd[11] = (unsigned char) (bio_sectors(bio) >> 16) & 0xff;
> +	rq->cmd[12] = (unsigned char) (bio_sectors(bio) >> 8) & 0xff;
> +	rq->cmd[13] = (unsigned char) bio_sectors(bio) & 0xff;
> +	rq->cmd[14] = 0;
> +	rq->cmd[15] = 0;
> +	rq->cmd_len = 16;
> +
> +	printk(KERN_INFO "umap, lba = 0x%lld, len = %d\n",
> +	       bio->bi_sector, bio_sectors(bio));
> +
> +	bio->bi_size = 0;
> +	if (bio_add_pc_page(q, bio, page, sdp->sector_size, 0) <
> +			sdp->sector_size)
> +		return -EIO;
> +
> +	return 0;
> +}
> +
>  static void sd_rescan(struct device *dev)
>  {
>  	struct scsi_disk *sdkp = scsi_disk_get_from_dev(dev);
> @@ -1369,6 +1413,9 @@ static int read_capacity_16(struct scsi_
>  		sd_printk(KERN_NOTICE, sdkp,
>  			  "physical block alignment offset: %u\n", alignment);
>  
> +	if (buffer[14] & 0x80)
> +		sdkp->thin_provisioning = 1;
> +

So you are checking the TPE bit (Thin Provisioning Enabled) but
not the TPRZ bit (Thin Provisioning Read Zeros). Shouldn't
there also be a sdkp->thin_provisioning_read_zeros bit defined
in struct scsi_disk?

Is your application well defined when TPRZ==0 ?

Doug Gilbert




^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 4/7] sd: add support for WRITE SAME (16) with unmap bit
  2009-08-30  0:43   ` Douglas Gilbert
@ 2009-08-30  1:05     ` Christoph Hellwig
  2009-08-30  2:43       ` Douglas Gilbert
  0 siblings, 1 reply; 30+ messages in thread
From: Christoph Hellwig @ 2009-08-30  1:05 UTC (permalink / raw)
  To: Douglas Gilbert
  Cc: Christoph Hellwig, linux-scsi, linux-ide, linux-kernel, liml,
	jens.axboe, matthew, dwmw2

On Sat, Aug 29, 2009 at 08:43:27PM -0400, Douglas Gilbert wrote:
>>  +	if (buffer[14] & 0x80)
>> +		sdkp->thin_provisioning = 1;
>> +
>
> So you are checking the TPE bit (Thin Provisioning Enabled) but
> not the TPRZ bit (Thin Provisioning Read Zeros). Shouldn't
> there also be a sdkp->thin_provisioning_read_zeros bit defined
> in struct scsi_disk?
>
> Is your application well defined when TPRZ==0 ?

Filesystems do not care if these blocks are in a defined state, as
they must never return the content of uninitilized blocks to userspace.
Now if we do want to support discard through raid arrays we might start
to care, and will have check the TPRZ bit.


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 0/7] discard support revisited
  2009-08-29 23:37 ` [PATCH 0/7] discard support revisited Matthew Wilcox
@ 2009-08-30  2:15   ` Christoph Hellwig
  2009-08-30  3:03     ` Matthew Wilcox
  2009-08-30 20:17     ` James Bottomley
  0 siblings, 2 replies; 30+ messages in thread
From: Christoph Hellwig @ 2009-08-30  2:15 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Christoph Hellwig, linux-scsi, linux-ide, linux-kernel, liml,
	jens.axboe, dwmw2

On Sat, Aug 29, 2009 at 05:37:19PM -0600, Matthew Wilcox wrote:
> >  - I have implemented support for sending WRITE SAME requests with the
> >    unmap bit set in sd.  This has been tested with a qemu-based backed
> >    only so far, but we'll get some real array coverage soon.
> 
> I think we're going to need to figure out whether we should be sending
> UNMAP or WRITE SAME ... probably need to dive back into the T10 poostorm
> to see what's going on.

Good question.  Latest I had heard was that at least one array vendor
prefers the WRITE SAME.  To me it looks like the much saner interface
for the OS, so unless there are arrays that strongly prefer UNMAP or
we need to make use of the multiple extends feature in it I'd go with
WRITE SAME as first choice.

> > I would really love to see some progress on this in the 2.6.32 circle.
> > We should at least get the block layer bits in that allow implementing
> > a somewhat useful discard function.  I would also love to see the
> > actual scsi and libata implementations in so that we can start playing
> > around with it.  But given the speed up the current TRIM implementations
> > and the expectations for WRITE SAME we should make sure the exact
> > TRIM tracking is not actually enabled anywhere by default for now.
> 
> Jens had some objections to the block layer bits last time I posted
> these.  I forget what they were now (this would have been around May
> 2nd, I think).  What I've done instead in my current patchset (which
> undoubtedly has bugs because it isn't tested, because I'm not supposed
> to be working on the weekends) is to make sd_prep_fn() call a new method
> in the scsi_host_template.  That should translate the discard request
> into a BLOCK_PC ATA_16 command, and we'll all be happy.
> 
> It goes a little something like this:
> http://git.kernel.org/?p=linux/kernel/git/willy/ssd.git;a=shortlog;h=trim-20090829
> 
> Right now, the test tool is telling me 'Operation not supported', and
> I haven't tried to figure out why yet.

Queue flag and handling the discard in the prep function is much better
than the prepare function, yes.  I don't like the prep_fn callout to the
host a lot.  If we go with WRITE SAME as prefered discard option for
scsi translating it to TRIM should be relatively easy, it uses the same
LBA/length encoding as the regular WRITE_16, except that the payload is
just a single sector.  That should be not too hard to implement in the
SAT layer.

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 4/7] sd: add support for WRITE SAME (16) with unmap bit
  2009-08-30  1:05     ` Christoph Hellwig
@ 2009-08-30  2:43       ` Douglas Gilbert
  2009-08-30  2:48         ` Christoph Hellwig
  0 siblings, 1 reply; 30+ messages in thread
From: Douglas Gilbert @ 2009-08-30  2:43 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: linux-scsi, linux-ide, linux-kernel, liml, jens.axboe, matthew, dwmw2

Christoph Hellwig wrote:
> On Sat, Aug 29, 2009 at 08:43:27PM -0400, Douglas Gilbert wrote:
>>>  +	if (buffer[14] & 0x80)
>>> +		sdkp->thin_provisioning = 1;
>>> +
>> So you are checking the TPE bit (Thin Provisioning Enabled) but
>> not the TPRZ bit (Thin Provisioning Read Zeros). Shouldn't
>> there also be a sdkp->thin_provisioning_read_zeros bit defined
>> in struct scsi_disk?
>>
>> Is your application well defined when TPRZ==0 ?
> 
> Filesystems do not care if these blocks are in a defined state, as
> they must never return the content of uninitilized blocks to userspace.
> Now if we do want to support discard through raid arrays we might start
> to care, and will have check the TPRZ bit.

Another reason to note the TPRZ bit is that if it is 1 then
the data given to WRITE SAME (16 and 32) must be a logical
block of zeros for the UNMAP bit to be honoured (sbc3r19.pdf
section 4.6.3.2 last paragraph).

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 4/7] sd: add support for WRITE SAME (16) with unmap bit
  2009-08-30  2:43       ` Douglas Gilbert
@ 2009-08-30  2:48         ` Christoph Hellwig
  0 siblings, 0 replies; 30+ messages in thread
From: Christoph Hellwig @ 2009-08-30  2:48 UTC (permalink / raw)
  To: Douglas Gilbert
  Cc: linux-scsi, linux-ide, linux-kernel, liml, jens.axboe, matthew, dwmw2

On Sat, Aug 29, 2009 at 10:43:44PM -0400, Douglas Gilbert wrote:
>> Filesystems do not care if these blocks are in a defined state, as
>> they must never return the content of uninitilized blocks to userspace.
>> Now if we do want to support discard through raid arrays we might start
>> to care, and will have check the TPRZ bit.
>
> Another reason to note the TPRZ bit is that if it is 1 then
> the data given to WRITE SAME (16 and 32) must be a logical
> block of zeros for the UNMAP bit to be honoured (sbc3r19.pdf
> section 4.6.3.2 last paragraph).

Ah, good hint.  I did in fact send down ZERO_PAGE(0) in an earlier
version, but the bio completion handler wasn't too happy with that
anymore after adding Willy's patch to free the page there.  I'll
either need to tweak it to make that conditional or zero the page
we allocated.

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 6/7] block: allow large discard requests
  2009-08-29 23:03 ` [PATCH 6/7] block: allow large discard requests Christoph Hellwig
@ 2009-08-30  2:49   ` Mark Lord
  2009-08-30  2:50     ` Matthew Wilcox
  0 siblings, 1 reply; 30+ messages in thread
From: Mark Lord @ 2009-08-30  2:49 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: linux-scsi, linux-ide, linux-kernel, jens.axboe, matthew, dwmw2

Pardon my cursory review, but I don't see where this
patch set allows for more than a single extent/range
of sectors per TRIM command.

If it still does not group many extents into a single TRIM,
then it will be far to slow to be useful at all for SATA.

But if it does.. my apologies.  :)

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 6/7] block: allow large discard requests
  2009-08-30  2:49   ` Mark Lord
@ 2009-08-30  2:50     ` Matthew Wilcox
  2009-08-30  2:52       ` Mark Lord
  0 siblings, 1 reply; 30+ messages in thread
From: Matthew Wilcox @ 2009-08-30  2:50 UTC (permalink / raw)
  To: Mark Lord
  Cc: Christoph Hellwig, linux-scsi, linux-ide, linux-kernel,
	jens.axboe, dwmw2

On Sat, Aug 29, 2009 at 10:49:02PM -0400, Mark Lord wrote:
> Pardon my cursory review, but I don't see where this
> patch set allows for more than a single extent/range
> of sectors per TRIM command.

It doesn't.

> If it still does not group many extents into a single TRIM,
> then it will be far to slow to be useful at all for SATA.

That depends on the firmware.  I do have plans to group the extents,
but let's get something working before making it quick.

-- 
Matthew Wilcox				Intel Open Source Technology Centre
"Bill, look, we understand that you're interested in selling us this
operating system, but compare it to ours.  We can't possibly take such
a retrograde step."

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 6/7] block: allow large discard requests
  2009-08-30  2:50     ` Matthew Wilcox
@ 2009-08-30  2:52       ` Mark Lord
  2009-08-30  2:56         ` Christoph Hellwig
  0 siblings, 1 reply; 30+ messages in thread
From: Mark Lord @ 2009-08-30  2:52 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Christoph Hellwig, linux-scsi, linux-ide, linux-kernel,
	jens.axboe, dwmw2

Matthew Wilcox wrote:
> On Sat, Aug 29, 2009 at 10:49:02PM -0400, Mark Lord wrote:
>> Pardon my cursory review, but I don't see where this
>> patch set allows for more than a single extent/range
>> of sectors per TRIM command.
> 
> It doesn't.
> 
>> If it still does not group many extents into a single TRIM,
>> then it will be far to slow to be useful at all for SATA.
> 
> That depends on the firmware.  I do have plans to group the extents,
> but let's get something working before making it quick.
..

In this situation, it's more a case of making it *not sucky slow*
than getting to the point where we think about making it "quick".

One extent at a time is glacial.

Cheers


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 6/7] block: allow large discard requests
  2009-08-30  2:52       ` Mark Lord
@ 2009-08-30  2:56         ` Christoph Hellwig
  0 siblings, 0 replies; 30+ messages in thread
From: Christoph Hellwig @ 2009-08-30  2:56 UTC (permalink / raw)
  To: Mark Lord
  Cc: Matthew Wilcox, Christoph Hellwig, linux-scsi, linux-ide,
	linux-kernel, jens.axboe, dwmw2

On Sat, Aug 29, 2009 at 10:52:59PM -0400, Mark Lord wrote:
>> That depends on the firmware.  I do have plans to group the extents,
>> but let's get something working before making it quick.
> ..
>
> In this situation, it's more a case of making it *not sucky slow*
> than getting to the point where we think about making it "quick".
>
> One extent at a time is glacial.

Really depends on the use case.  With XFS you will only have very few
free space extents per AG, so it should work reasonably well for the
batched "offline" discard in my last patch.  And for SCSI arrays that
can actually process other I/O during that time the WRITE SAME is in
progress it should not be that bad.

But so far this is all a lot of talk of course until I can actually
work against real hardware.

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 0/7] discard support revisited
  2009-08-30  2:15   ` Christoph Hellwig
@ 2009-08-30  3:03     ` Matthew Wilcox
  2009-08-30 20:17     ` James Bottomley
  1 sibling, 0 replies; 30+ messages in thread
From: Matthew Wilcox @ 2009-08-30  3:03 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: linux-scsi, linux-ide, linux-kernel, liml, jens.axboe, dwmw2

On Sat, Aug 29, 2009 at 10:15:34PM -0400, Christoph Hellwig wrote:
> On Sat, Aug 29, 2009 at 05:37:19PM -0600, Matthew Wilcox wrote:
> > I think we're going to need to figure out whether we should be sending
> > UNMAP or WRITE SAME ... probably need to dive back into the T10 poostorm
> > to see what's going on.
> 
> Good question.  Latest I had heard was that at least one array vendor
> prefers the WRITE SAME.  To me it looks like the much saner interface
> for the OS, so unless there are arrays that strongly prefer UNMAP or
> we need to make use of the multiple extends feature in it I'd go with
> WRITE SAME as first choice.

I think we're going to see a split in array vendors, tbh.  Many were
very upset at the thought of taking out multiple extents from the UNMAP
command.  Which I suggested, because frankly it's insane.

> > Jens had some objections to the block layer bits last time I posted
> > these.  I forget what they were now (this would have been around May
> > 2nd, I think).  What I've done instead in my current patchset (which
> > undoubtedly has bugs because it isn't tested, because I'm not supposed
> > to be working on the weekends) is to make sd_prep_fn() call a new method
> > in the scsi_host_template.  That should translate the discard request
> > into a BLOCK_PC ATA_16 command, and we'll all be happy.
> > 
> > It goes a little something like this:
> > http://git.kernel.org/?p=linux/kernel/git/willy/ssd.git;a=shortlog;h=trim-20090829
> > 
> > Right now, the test tool is telling me 'Operation not supported', and
> > I haven't tried to figure out why yet.
> 
> Queue flag and handling the discard in the prep function is much better
> than the prepare function, yes.  I don't like the prep_fn callout to the
> host a lot.

No, but I think we can make it more palatable.  Look at the ugly USB
hack for accessing near the end of the disc that we have in sd_prep_fn
right now.  If we can push that into the USB driver, I think that'll make
everybody happier.

This also gives us an interesting opportunity to experiment with
translating read/write commands directly into ATA_16 commands rather than
going through the SCSI translation first.  That should save a few cycles.

> If we go with WRITE SAME as prefered discard option for
> scsi translating it to TRIM should be relatively easy, it uses the same
> LBA/length encoding as the regular WRITE_16, except that the payload is
> just a single sector.  That should be not too hard to implement in the
> SAT layer.

It should avoid the difficulty in translating the command size, true.

-- 
Matthew Wilcox				Intel Open Source Technology Centre
"Bill, look, we understand that you're interested in selling us this
operating system, but compare it to ours.  We can't possibly take such
a retrograde step."

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 4/7] sd: add support for WRITE SAME (16) with unmap bit
  2009-08-29 23:03 ` [PATCH 4/7] sd: add support for WRITE SAME (16) with unmap bit Christoph Hellwig
  2009-08-30  0:43   ` Douglas Gilbert
@ 2009-08-30 11:12   ` Sergei Shtylyov
  2009-08-30 17:14     ` Christoph Hellwig
  1 sibling, 1 reply; 30+ messages in thread
From: Sergei Shtylyov @ 2009-08-30 11:12 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: linux-scsi, linux-ide, linux-kernel, liml, jens.axboe, matthew, dwmw2

Hello.

Christoph Hellwig wrote:

> Add a prepare_discard function to sd that sends a WRITE SAME request with
> the unmap bit set to the device if it advertises thin provisioning support.
>
>
> Signed-off-by: Christoph Hellwig <hch@lst.de>
>
> Index: linux-2.6/drivers/scsi/sd.c
> ===================================================================
> --- linux-2.6.orig/drivers/scsi/sd.c	2009-08-29 19:19:36.067371669 -0300
> +++ linux-2.6/drivers/scsi/sd.c	2009-08-29 19:26:20.723754241 -0300
> @@ -911,6 +911,50 @@ static void sd_prepare_flush(struct requ
>  	rq->cmd_len = 10;
>  }
>  
> +static int sd_prepare_discard(struct request_queue *q, struct request *rq,
> +		struct bio *bio)
> +{
> +	struct scsi_device *sdp = q->queuedata;
> +	struct page *page = alloc_page(GFP_KERNEL);
> +
> +	if (!page)
> +		return -ENOMEM;
> +
> +	rq->cmd_type = REQ_TYPE_BLOCK_PC;
> +	rq->timeout = SD_TIMEOUT;
> +	rq->cmd[0] = WRITE_SAME_16;
> +	rq->cmd[1] = 0x8; /* UNMAP bit */
> +	rq->cmd[2] = sizeof(bio->bi_sector) > 4 ?
> +			(unsigned char) (bio->bi_sector >> 56) & 0xff : 0;
> +	rq->cmd[3] = sizeof(bio->bi_sector) > 4 ?
> +			(unsigned char) (bio->bi_sector >> 48) & 0xff : 0;
> +	rq->cmd[4] = sizeof(bio->bi_sector) > 4 ?
> +			(unsigned char) (bio->bi_sector >> 40) & 0xff : 0;
> +	rq->cmd[5] = sizeof(bio->bi_sector) > 4 ?
> +			(unsigned char) (bio->bi_sector >> 32) & 0xff : 0;
> +	rq->cmd[6] = (unsigned char) (bio->bi_sector >> 24) & 0xff;
> +	rq->cmd[7] = (unsigned char) (bio->bi_sector >> 16) & 0xff;
> +	rq->cmd[8] = (unsigned char) (bio->bi_sector >> 8) & 0xff;
> +	rq->cmd[9] = (unsigned char) bio->bi_sector & 0xff;
> +	rq->cmd[10] = (unsigned char) (bio_sectors(bio) >> 24) & 0xff;
> +	rq->cmd[11] = (unsigned char) (bio_sectors(bio) >> 16) & 0xff;
> +	rq->cmd[12] = (unsigned char) (bio_sectors(bio) >> 8) & 0xff;
> +	rq->cmd[13] = (unsigned char) bio_sectors(bio) & 0xff;
> +	rq->cmd[14] = 0;
> +	rq->cmd[15] = 0;
> +	rq->cmd_len = 16;
> +
> +	printk(KERN_INFO "umap, lba = 0x%lld, len = %d\n",
>   

   So, is it hex or decimal? :-)

> +	       bio->bi_sector, bio_sectors(bio));
>   

   Since bio->bi_sector can either be 4 or 8 bytes, you need a cast to 
unsigned long long here.

MBR, Sergei

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 4/7] sd: add support for WRITE SAME (16) with unmap bit
  2009-08-30 11:12   ` Sergei Shtylyov
@ 2009-08-30 17:14     ` Christoph Hellwig
  0 siblings, 0 replies; 30+ messages in thread
From: Christoph Hellwig @ 2009-08-30 17:14 UTC (permalink / raw)
  To: Sergei Shtylyov
  Cc: Christoph Hellwig, linux-scsi, linux-ide, linux-kernel, liml,
	jens.axboe, matthew, dwmw2

On Sun, Aug 30, 2009 at 03:12:09PM +0400, Sergei Shtylyov wrote:
>> +	printk(KERN_INFO "umap, lba = 0x%lld, len = %d\n",
>>   
>
>   So, is it hex or decimal? :-)
>
>> +	       bio->bi_sector, bio_sectors(bio));
>>   
>
>   Since bio->bi_sector can either be 4 or 8 bytes, you need a cast to  
> unsigned long long here.

Indeed.  But as this is justa debug printk that slipped through I will
just remove it completely in the next iteration.


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 0/7] discard support revisited
  2009-08-30  2:15   ` Christoph Hellwig
  2009-08-30  3:03     ` Matthew Wilcox
@ 2009-08-30 20:17     ` James Bottomley
  2009-08-30 21:42       ` Matthew Wilcox
  2009-08-30 22:48       ` Christoph Hellwig
  1 sibling, 2 replies; 30+ messages in thread
From: James Bottomley @ 2009-08-30 20:17 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Matthew Wilcox, linux-scsi, linux-ide, linux-kernel, liml,
	jens.axboe, dwmw2

On Sat, 2009-08-29 at 22:15 -0400, Christoph Hellwig wrote:
> On Sat, Aug 29, 2009 at 05:37:19PM -0600, Matthew Wilcox wrote:
> > >  - I have implemented support for sending WRITE SAME requests with the
> > >    unmap bit set in sd.  This has been tested with a qemu-based backed
> > >    only so far, but we'll get some real array coverage soon.
> > 
> > I think we're going to need to figure out whether we should be sending
> > UNMAP or WRITE SAME ... probably need to dive back into the T10 poostorm
> > to see what's going on.

Actually, I've been keeping an eye on that

> Good question.  Latest I had heard was that at least one array vendor
> prefers the WRITE SAME.  To me it looks like the much saner interface
> for the OS, so unless there are arrays that strongly prefer UNMAP or
> we need to make use of the multiple extends feature in it I'd go with
> WRITE SAME as first choice.

So, since their respective names are on the proposals, it's no real
secret that EMC are pushing WRITE_SAME and Netapp UNMAP, but they are
both working together on this.  I've already communicated to T10 via
intermediaries that we'd like only a single implementation for this,
please.  However, failing that, the current situation where we know from
an inquiry that the array supports thin provisioning, but don't know
whether it supports WRITE_SAME or UNMAP until we get a command failure
is unacceptable.

If we could get some good solid implementation evidence that WRITE_SAME
is much easier for an OS than UNMAP, that might help with the T10
deliberations.

> > > I would really love to see some progress on this in the 2.6.32 circle.
> > > We should at least get the block layer bits in that allow implementing
> > > a somewhat useful discard function.  I would also love to see the
> > > actual scsi and libata implementations in so that we can start playing
> > > around with it.  But given the speed up the current TRIM implementations
> > > and the expectations for WRITE SAME we should make sure the exact
> > > TRIM tracking is not actually enabled anywhere by default for now.
> > 
> > Jens had some objections to the block layer bits last time I posted
> > these.  I forget what they were now (this would have been around May
> > 2nd, I think).  What I've done instead in my current patchset (which
> > undoubtedly has bugs because it isn't tested, because I'm not supposed
> > to be working on the weekends) is to make sd_prep_fn() call a new method
> > in the scsi_host_template.  That should translate the discard request
> > into a BLOCK_PC ATA_16 command, and we'll all be happy.
> > 
> > It goes a little something like this:
> > http://git.kernel.org/?p=linux/kernel/git/willy/ssd.git;a=shortlog;h=trim-20090829
> > 
> > Right now, the test tool is telling me 'Operation not supported', and
> > I haven't tried to figure out why yet.
> 
> Queue flag and handling the discard in the prep function is much better
> than the prepare function, yes.  I don't like the prep_fn callout to the
> host a lot.

Me neither.  I'm sort of OK with a transformed operation, callout for
the ULD, but I really don't see why a disk only function should go
through the host template.

>  If we go with WRITE SAME as prefered discard option for
> scsi translating it to TRIM should be relatively easy, it uses the same
> LBA/length encoding as the regular WRITE_16, except that the payload is
> just a single sector.  That should be not too hard to implement in the
> SAT layer.

So the last ATA data set management with TRIM proposal I saw had a set
of discontiguous ranges, very like UNMAP.  It's certainly possible to do
the transformation, you just have to drop the sector buffer and add one
for the ranges (then reverse it in the back translation for the
completion) but it's not pretty.

James



^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 0/7] discard support revisited
  2009-08-30 20:17     ` James Bottomley
@ 2009-08-30 21:42       ` Matthew Wilcox
  2009-08-30 22:48       ` Christoph Hellwig
  1 sibling, 0 replies; 30+ messages in thread
From: Matthew Wilcox @ 2009-08-30 21:42 UTC (permalink / raw)
  To: James Bottomley
  Cc: Christoph Hellwig, linux-scsi, linux-ide, linux-kernel, liml,
	jens.axboe, dwmw2

On Sun, Aug 30, 2009 at 03:17:19PM -0500, James Bottomley wrote:
> > > Jens had some objections to the block layer bits last time I posted
> > > these.  I forget what they were now (this would have been around May
> > > 2nd, I think).  What I've done instead in my current patchset (which
> > > undoubtedly has bugs because it isn't tested, because I'm not supposed
> > > to be working on the weekends) is to make sd_prep_fn() call a new method
> > > in the scsi_host_template.  That should translate the discard request
> > > into a BLOCK_PC ATA_16 command, and we'll all be happy.
> > > 
> > > It goes a little something like this:
> > > http://git.kernel.org/?p=linux/kernel/git/willy/ssd.git;a=shortlog;h=trim-20090829
> > 
> > Queue flag and handling the discard in the prep function is much better
> > than the prepare function, yes.  I don't like the prep_fn callout to the
> > host a lot.
> 
> Me neither.  I'm sort of OK with a transformed operation, callout for
> the ULD, but I really don't see why a disk only function should go
> through the host template.

I'm fine with putting the function pointer elsewhere ... the host template
seemed like a good place because the two things I see it being useful
for (USB and ATA) are a per-host thing rather than a per-device thing.
Would you like to see it in the scsi_disk instead?  Maybe the scsi_device?

> So the last ATA data set management with TRIM proposal I saw had a set
> of discontiguous ranges, very like UNMAP.  It's certainly possible to do
> the transformation, you just have to drop the sector buffer and add one
> for the ranges (then reverse it in the back translation for the
> completion) but it's not pretty.

Not pretty, and also has some practical problems, in that I cannot figure
out where Linux stores the length.  Even after I put on a new page,
it would only send the first 24 bytes (length of the UNMAP payload)
rather than the 512 bytes of the TRIM payload.

-- 
Matthew Wilcox				Intel Open Source Technology Centre
"Bill, look, we understand that you're interested in selling us this
operating system, but compare it to ours.  We can't possibly take such
a retrograde step."

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 0/7] discard support revisited
  2009-08-30 20:17     ` James Bottomley
  2009-08-30 21:42       ` Matthew Wilcox
@ 2009-08-30 22:48       ` Christoph Hellwig
  2009-09-02 19:46         ` Matthew Wilcox
  1 sibling, 1 reply; 30+ messages in thread
From: Christoph Hellwig @ 2009-08-30 22:48 UTC (permalink / raw)
  To: James Bottomley
  Cc: Christoph Hellwig, Matthew Wilcox, linux-scsi, linux-ide,
	linux-kernel, liml, jens.axboe, dwmw2

On Sun, Aug 30, 2009 at 03:17:19PM -0500, James Bottomley wrote:
> > Good question.  Latest I had heard was that at least one array vendor
> > prefers the WRITE SAME.  To me it looks like the much saner interface
> > for the OS, so unless there are arrays that strongly prefer UNMAP or
> > we need to make use of the multiple extends feature in it I'd go with
> > WRITE SAME as first choice.
> 
> So, since their respective names are on the proposals, it's no real
> secret that EMC are pushing WRITE_SAME and Netapp UNMAP, but they are
> both working together on this.  I've already communicated to T10 via
> intermediaries that we'd like only a single implementation for this,
> please.  However, failing that, the current situation where we know from
> an inquiry that the array supports thin provisioning, but don't know
> whether it supports WRITE_SAME or UNMAP until we get a command failure
> is unacceptable.
> 
> If we could get some good solid implementation evidence that WRITE_SAME
> is much easier for an OS than UNMAP, that might help with the T10
> deliberations.

As I've recently worked on all sides of the discard battle (filesystem
support, initiator support, and target support) here are my notes:


 - WRITE_SAME is extremly nice to implement for both the initiator and
   target.  It has the LBA and len exactly in the same place as normal
   16 byte commands, the payload length is fixed to one block, which
   we can allocate once and zero so that we don't even need any memory
   allocations for this command in the initiator.
 - UNMAP is a pain to implement in both initiator and target.  Not
   actuall having the LBA/len information in the cdb but in the payload
   is at least a minor incovenience in the initator, and quite annoying
   in the target as we now need to process payload data in the fastpath,
   which we otherwise only do for slow path CDBs.  This will be
   especially bad for split kernel/user target implementations.

Now the weird design of UNMAP of course has a rather (besides some
apparent pissing contest at NetApp about who can't come with the worst
possible protocol specifications, whose results can be seen in NFSv4
and iSer), and that is that it allows dicarding of multiple
discontinguous ranges.  Doing so is really bad for the filesystem as
it requires it to track multiple outstanding discard requests, which
requires locking, and book keeping to make sure we do not re-use these
blocks before they are discarded.

And at least for my target design it does not provide any measureable
benefits at all, the discard operations are mapped to a hole punch
ioctl on a filesystem, which has a constant basic overhead for each
region punched (synchronous transaction commit) and a small linear
cost per extent removed.  The only benefit of the multiple rangs unmap
would be a saving of protocol roundtrips.

Now that is interestingly actually a downside at least for my still
rather dumb target implementation with a typical Linux filesystem
workload on the initiator side.  If we actually do a lot different unmap
operations in a single unmap command it can start to take significant
amounts of time, and do to Linux waiting for queue drains frequently
due to the barrier implementations we will end up waiting for the unmap
command.


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 2/7] block: use blkdev_issue_discard in blk_ioctl_discard
  2009-08-29 23:03 ` [PATCH 2/7] block: use blkdev_issue_discard in blk_ioctl_discard Christoph Hellwig
@ 2009-09-01  9:06   ` Steven Whitehouse
  2009-09-11 22:26   ` Christoph Hellwig
  1 sibling, 0 replies; 30+ messages in thread
From: Steven Whitehouse @ 2009-09-01  9:06 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: linux-scsi, linux-ide, linux-kernel, liml, jens.axboe, matthew, dwmw2

Hi,

For the GFS2 bits:

Acked-by: Steven Whitehouse <swhiteho@redhat.com>

Steve.

On Sat, 2009-08-29 at 19:03 -0400, Christoph Hellwig wrote:
> plain text document attachment (discard-unify-code)
> blk_ioctl_discard duplicates large amounts of code from blkdev_issue_discard,
> the only difference between the two is that blkdev_issue_discard needs to
> send a barrier discard request and blk_ioctl_discard a non-barrier one,
> and blk_ioctl_discard needs to wait on the request.  To facilitates this
> add a flags argument to blkdev_issue_discard to control both aspects of the
> behaviour.  This will be very useful later on for using the waiting
> funcitonality for other callers.
> 
> Based on an earlier patch from Matthew Wilcox <matthew@wil.cx>.
> 
> 
> Signed-off-by: Christoph Hellwig <hch@lst.de>
> 
> Index: linux-2.6/block/blk-barrier.c
> ===================================================================
> --- linux-2.6.orig/block/blk-barrier.c	2009-08-29 16:49:43.067370900 -0300
> +++ linux-2.6/block/blk-barrier.c	2009-08-29 17:43:30.407344330 -0300
> @@ -348,6 +348,9 @@ static void blkdev_discard_end_io(struct
>  		clear_bit(BIO_UPTODATE, &bio->bi_flags);
>  	}
>  
> +	if (bio->bi_private)
> +		complete(bio->bi_private);
> +
>  	bio_put(bio);
>  }
>  
> @@ -357,21 +360,20 @@ static void blkdev_discard_end_io(struct
>   * @sector:	start sector
>   * @nr_sects:	number of sectors to discard
>   * @gfp_mask:	memory allocation flags (for bio_alloc)
> + * @flags:	DISCARD_FL_* flags to control behaviour
>   *
>   * Description:
> - *    Issue a discard request for the sectors in question. Does not wait.
> + *    Issue a discard request for the sectors in question.
>   */
> -int blkdev_issue_discard(struct block_device *bdev,
> -			 sector_t sector, sector_t nr_sects, gfp_t gfp_mask)
> +int blkdev_issue_discard(struct block_device *bdev, sector_t sector,
> +		sector_t nr_sects, gfp_t gfp_mask, int flags)
>  {
> -	struct request_queue *q;
> -	struct bio *bio;
> +	DECLARE_COMPLETION_ONSTACK(wait);
> +	struct request_queue *q = bdev_get_queue(bdev);
> +	int type = flags & DISCARD_FL_BARRIER ?
> +		DISCARD_BARRIER : DISCARD_NOBARRIER;
>  	int ret = 0;
>  
> -	if (bdev->bd_disk == NULL)
> -		return -ENXIO;
> -
> -	q = bdev_get_queue(bdev);
>  	if (!q)
>  		return -ENXIO;
>  
> @@ -379,12 +381,14 @@ int blkdev_issue_discard(struct block_de
>  		return -EOPNOTSUPP;
>  
>  	while (nr_sects && !ret) {
> -		bio = bio_alloc(gfp_mask, 0);
> +		struct bio *bio = bio_alloc(gfp_mask, 0);
>  		if (!bio)
>  			return -ENOMEM;
>  
>  		bio->bi_end_io = blkdev_discard_end_io;
>  		bio->bi_bdev = bdev;
> +		if (flags & DISCARD_FL_WAIT)
> +			bio->bi_private = &wait;
>  
>  		bio->bi_sector = sector;
>  
> @@ -396,10 +400,13 @@ int blkdev_issue_discard(struct block_de
>  			bio->bi_size = nr_sects << 9;
>  			nr_sects = 0;
>  		}
> +
>  		bio_get(bio);
> -		submit_bio(DISCARD_BARRIER, bio);
> +		submit_bio(type, bio);
> +
> +		if (flags & DISCARD_FL_WAIT)
> +			wait_for_completion(&wait);
>  
> -		/* Check if it failed immediately */
>  		if (bio_flagged(bio, BIO_EOPNOTSUPP))
>  			ret = -EOPNOTSUPP;
>  		else if (!bio_flagged(bio, BIO_UPTODATE))
> Index: linux-2.6/block/ioctl.c
> ===================================================================
> --- linux-2.6.orig/block/ioctl.c	2009-08-29 16:49:43.075370172 -0300
> +++ linux-2.6/block/ioctl.c	2009-08-29 16:49:52.743341374 -0300
> @@ -112,22 +112,9 @@ static int blkdev_reread_part(struct blo
>  	return res;
>  }
>  
> -static void blk_ioc_discard_endio(struct bio *bio, int err)
> -{
> -	if (err) {
> -		if (err == -EOPNOTSUPP)
> -			set_bit(BIO_EOPNOTSUPP, &bio->bi_flags);
> -		clear_bit(BIO_UPTODATE, &bio->bi_flags);
> -	}
> -	complete(bio->bi_private);
> -}
> -
>  static int blk_ioctl_discard(struct block_device *bdev, uint64_t start,
>  			     uint64_t len)
>  {
> -	struct request_queue *q = bdev_get_queue(bdev);
> -	int ret = 0;
> -
>  	if (start & 511)
>  		return -EINVAL;
>  	if (len & 511)
> @@ -137,40 +124,8 @@ static int blk_ioctl_discard(struct bloc
>  
>  	if (start + len > (bdev->bd_inode->i_size >> 9))
>  		return -EINVAL;
> -
> -	if (!q->prepare_discard_fn)
> -		return -EOPNOTSUPP;
> -
> -	while (len && !ret) {
> -		DECLARE_COMPLETION_ONSTACK(wait);
> -		struct bio *bio;
> -
> -		bio = bio_alloc(GFP_KERNEL, 0);
> -
> -		bio->bi_end_io = blk_ioc_discard_endio;
> -		bio->bi_bdev = bdev;
> -		bio->bi_private = &wait;
> -		bio->bi_sector = start;
> -
> -		if (len > queue_max_hw_sectors(q)) {
> -			bio->bi_size = queue_max_hw_sectors(q) << 9;
> -			len -= queue_max_hw_sectors(q);
> -			start += queue_max_hw_sectors(q);
> -		} else {
> -			bio->bi_size = len << 9;
> -			len = 0;
> -		}
> -		submit_bio(DISCARD_NOBARRIER, bio);
> -
> -		wait_for_completion(&wait);
> -
> -		if (bio_flagged(bio, BIO_EOPNOTSUPP))
> -			ret = -EOPNOTSUPP;
> -		else if (!bio_flagged(bio, BIO_UPTODATE))
> -			ret = -EIO;
> -		bio_put(bio);
> -	}
> -	return ret;
> +	return blkdev_issue_discard(bdev, start, len, GFP_KERNEL,
> +				    DISCARD_FL_WAIT);
>  }
>  
>  static int put_ushort(unsigned long arg, unsigned short val)
> Index: linux-2.6/fs/btrfs/extent-tree.c
> ===================================================================
> --- linux-2.6.orig/fs/btrfs/extent-tree.c	2009-08-29 16:49:43.079341766 -0300
> +++ linux-2.6/fs/btrfs/extent-tree.c	2009-08-29 16:49:52.743341374 -0300
> @@ -1511,7 +1511,8 @@ static int remove_extent_backref(struct 
>  static void btrfs_issue_discard(struct block_device *bdev,
>  				u64 start, u64 len)
>  {
> -	blkdev_issue_discard(bdev, start >> 9, len >> 9, GFP_KERNEL);
> +	blkdev_issue_discard(bdev, start >> 9, len >> 9, GFP_KERNEL,
> +			     DISCARD_FL_BARRIER);
>  }
>  #endif
>  
> Index: linux-2.6/fs/gfs2/rgrp.c
> ===================================================================
> --- linux-2.6.orig/fs/gfs2/rgrp.c	2009-08-29 16:49:43.139342340 -0300
> +++ linux-2.6/fs/gfs2/rgrp.c	2009-08-29 16:49:52.747373241 -0300
> @@ -857,7 +857,8 @@ static void gfs2_rgrp_send_discards(stru
>  					goto start_new_extent;
>  				if ((start + nr_sects) != blk) {
>  					rv = blkdev_issue_discard(bdev, start,
> -							    nr_sects, GFP_NOFS);
> +							    nr_sects, GFP_NOFS,
> +							    DISCARD_FL_BARRIER);
>  					if (rv)
>  						goto fail;
>  					nr_sects = 0;
> @@ -871,7 +872,8 @@ start_new_extent:
>  		}
>  	}
>  	if (nr_sects) {
> -		rv = blkdev_issue_discard(bdev, start, nr_sects, GFP_NOFS);
> +		rv = blkdev_issue_discard(bdev, start, nr_sects, GFP_NOFS,
> +					 DISCARD_FL_BARRIER);
>  		if (rv)
>  			goto fail;
>  	}
> Index: linux-2.6/include/linux/blkdev.h
> ===================================================================
> --- linux-2.6.orig/include/linux/blkdev.h	2009-08-29 16:49:43.147343148 -0300
> +++ linux-2.6/include/linux/blkdev.h	2009-08-29 17:43:19.767342397 -0300
> @@ -977,15 +977,18 @@ static inline struct request *blk_map_qu
>  }
>  
>  extern int blkdev_issue_flush(struct block_device *, sector_t *);
> -extern int blkdev_issue_discard(struct block_device *,
> -				sector_t sector, sector_t nr_sects, gfp_t);
> +#define DISCARD_FL_WAIT		0x01	/* wait for completion */
> +#define DISCARD_FL_BARRIER	0x02	/* issue DISCARD_BARRIER request */
> +extern int blkdev_issue_discard(struct block_device *, sector_t sector,
> +		sector_t nr_sects, gfp_t, int flags);
>  
>  static inline int sb_issue_discard(struct super_block *sb,
>  				   sector_t block, sector_t nr_blocks)
>  {
>  	block <<= (sb->s_blocksize_bits - 9);
>  	nr_blocks <<= (sb->s_blocksize_bits - 9);
> -	return blkdev_issue_discard(sb->s_bdev, block, nr_blocks, GFP_KERNEL);
> +	return blkdev_issue_discard(sb->s_bdev, block, nr_blocks, GFP_KERNEL,
> +				    DISCARD_FL_BARRIER);
>  }
>  
>  extern int blk_verify_command(unsigned char *cmd, fmode_t has_write_perm);
> Index: linux-2.6/mm/swapfile.c
> ===================================================================
> --- linux-2.6.orig/mm/swapfile.c	2009-08-29 16:49:43.151342120 -0300
> +++ linux-2.6/mm/swapfile.c	2009-08-29 16:49:52.755377749 -0300
> @@ -161,7 +161,8 @@ static int discard_swap(struct swap_info
>  		}
>  
>  		err = blkdev_issue_discard(si->bdev, start_block,
> -						nr_blocks, GFP_KERNEL);
> +						nr_blocks, GFP_KERNEL,
> +						DISCARD_FL_BARRIER);
>  		if (err)
>  			break;
>  
> @@ -200,7 +201,8 @@ static void discard_swap_cluster(struct 
>  			start_block <<= PAGE_SHIFT - 9;
>  			nr_blocks <<= PAGE_SHIFT - 9;
>  			if (blkdev_issue_discard(si->bdev, start_block,
> -							nr_blocks, GFP_NOIO))
> +							nr_blocks, GFP_NOIO,
> +							DISCARD_FL_BARRIER))
>  				break;
>  		}
>  
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 0/7] discard support revisited
  2009-08-30 22:48       ` Christoph Hellwig
@ 2009-09-02 19:46         ` Matthew Wilcox
  0 siblings, 0 replies; 30+ messages in thread
From: Matthew Wilcox @ 2009-09-02 19:46 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: James Bottomley, linux-scsi, linux-ide, linux-kernel, liml,
	jens.axboe, dwmw2

On Sun, Aug 30, 2009 at 06:48:29PM -0400, Christoph Hellwig wrote:
> As I've recently worked on all sides of the discard battle (filesystem
> support, initiator support, and target support) here are my notes:
> 
>  - WRITE_SAME is extremly nice to implement for both the initiator and
>    target.  It has the LBA and len exactly in the same place as normal
>    16 byte commands, the payload length is fixed to one block, which
>    we can allocate once and zero so that we don't even need any memory
>    allocations for this command in the initiator.
>  - UNMAP is a pain to implement in both initiator and target.  Not
>    actuall having the LBA/len information in the cdb but in the payload
>    is at least a minor incovenience in the initator, and quite annoying
>    in the target as we now need to process payload data in the fastpath,
>    which we otherwise only do for slow path CDBs.  This will be
>    especially bad for split kernel/user target implementations.
> 
> Now the weird design of UNMAP of course has a rather (besides some
> apparent pissing contest at NetApp about who can't come with the worst
> possible protocol specifications, whose results can be seen in NFSv4
> and iSer), and that is that it allows dicarding of multiple
> discontinguous ranges.

This sentence no object ;-)

> Doing so is really bad for the filesystem as
> it requires it to track multiple outstanding discard requests, which
> requires locking, and book keeping to make sure we do not re-use these
> blocks before they are discarded.

Yeah, but we need to do that for TRIM anyway.  While we're doing it for
TRIM, we might as well do it for UNMAP.

> And at least for my target design it does not provide any measureable
> benefits at all, the discard operations are mapped to a hole punch
> ioctl on a filesystem, which has a constant basic overhead for each
> region punched (synchronous transaction commit) and a small linear
> cost per extent removed.  The only benefit of the multiple rangs unmap
> would be a saving of protocol roundtrips.

Sure, but you've got a relatively sane underpinning.  NAND is pretty
insane, and the aggregation can actually go a long way to helping with
some of the problems.

> Now that is interestingly actually a downside at least for my still
> rather dumb target implementation with a typical Linux filesystem
> workload on the initiator side.  If we actually do a lot different unmap
> operations in a single unmap command it can start to take significant
> amounts of time, and do to Linux waiting for queue drains frequently
> due to the barrier implementations we will end up waiting for the unmap
> command.

OK, but that's because you've implemented a single-range ioctl.  If we
had an ioctl which let you discard multiple ranges, it would actually
be faster (due to the barriers) than implementing a WRITE SAME.

-- 
Matthew Wilcox				Intel Open Source Technology Centre
"Bill, look, we understand that you're interested in selling us this
operating system, but compare it to ours.  We can't possibly take such
a retrograde step."

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 1/7] Make DISCARD_BARRIER and DISCARD_NOBARRIER writes instead of reads
  2009-08-29 23:03 ` [PATCH 1/7] Make DISCARD_BARRIER and DISCARD_NOBARRIER writes instead of reads Christoph Hellwig
@ 2009-09-03 17:52   ` David Woodhouse
  2009-09-03 17:56     ` Matthew Wilcox
  0 siblings, 1 reply; 30+ messages in thread
From: David Woodhouse @ 2009-09-03 17:52 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: linux-scsi, linux-ide, linux-kernel, liml, jens.axboe, matthew,
	Matthew Wilcox

On Sat, 2009-08-29 at 19:03 -0400, Christoph Hellwig wrote:
> 
> The commands are conceptually writes, and in the case of IDE and SCSI
> commands actually are writes.  They were only reads because we thought
> that would interact better with the elevators.  Now the elevators know
> about discard requests, that advantage no longer exists.

Can you drop the final sentence of that? It isn't true, and I never said
it.

s/.  Now.*/, but that isn't necessary, and making them writes makes it
easier for the low-level IDE and SCSI code to cope with the fact that
the command has to be sent with a payload./


The elevators _still_ don't know about discards, and will still let
reads and writes (and discards, which are just a special case of writes)
to the same sector all cross each other on the queue unless there's some
external factor to prevent it.

-- 
David Woodhouse                            Open Source Technology Centre
David.Woodhouse@intel.com                              Intel Corporation


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 1/7] Make DISCARD_BARRIER and DISCARD_NOBARRIER writes instead of reads
  2009-09-03 17:52   ` David Woodhouse
@ 2009-09-03 17:56     ` Matthew Wilcox
  0 siblings, 0 replies; 30+ messages in thread
From: Matthew Wilcox @ 2009-09-03 17:56 UTC (permalink / raw)
  To: David Woodhouse
  Cc: Christoph Hellwig, linux-scsi, linux-ide, linux-kernel, liml,
	jens.axboe, Matthew Wilcox

On Thu, Sep 03, 2009 at 06:52:20PM +0100, David Woodhouse wrote:
> On Sat, 2009-08-29 at 19:03 -0400, Christoph Hellwig wrote:
> > 
> > The commands are conceptually writes, and in the case of IDE and SCSI
> > commands actually are writes.  They were only reads because we thought
> > that would interact better with the elevators.  Now the elevators know
> > about discard requests, that advantage no longer exists.
> 
> Can you drop the final sentence of that? It isn't true, and I never said
> it.

No, but you wouldn't give me a changelog entry, so I had to make
something up.

> s/.  Now.*/, but that isn't necessary, and making them writes makes it
> easier for the low-level IDE and SCSI code to cope with the fact that
> the command has to be sent with a payload./

Thanks.

-- 
Matthew Wilcox				Intel Open Source Technology Centre
"Bill, look, we understand that you're interested in selling us this
operating system, but compare it to ours.  We can't possibly take such
a retrograde step."

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 2/7] block: use blkdev_issue_discard in blk_ioctl_discard
  2009-08-29 23:03 ` [PATCH 2/7] block: use blkdev_issue_discard in blk_ioctl_discard Christoph Hellwig
  2009-09-01  9:06   ` Steven Whitehouse
@ 2009-09-11 22:26   ` Christoph Hellwig
  2009-09-14  9:40     ` Jens Axboe
  1 sibling, 1 reply; 30+ messages in thread
From: Christoph Hellwig @ 2009-09-11 22:26 UTC (permalink / raw)
  To: linux-scsi, linux-ide, linux-kernel; +Cc: liml, jens.axboe, matthew, dwmw2

Jens, can we put this one and possible Dave's treat discard requests
as writes patch in early for 2.6.32?  That'll make the discard patch
queue quite a bit smaller and will allow easier experimenting.

On Sat, Aug 29, 2009 at 07:03:34PM -0400, Christoph Hellwig wrote:
> blk_ioctl_discard duplicates large amounts of code from blkdev_issue_discard,
> the only difference between the two is that blkdev_issue_discard needs to
> send a barrier discard request and blk_ioctl_discard a non-barrier one,
> and blk_ioctl_discard needs to wait on the request.  To facilitates this
> add a flags argument to blkdev_issue_discard to control both aspects of the
> behaviour.  This will be very useful later on for using the waiting
> funcitonality for other callers.
> 
> Based on an earlier patch from Matthew Wilcox <matthew@wil.cx>.
> 
> 
> Signed-off-by: Christoph Hellwig <hch@lst.de>
> 
> Index: linux-2.6/block/blk-barrier.c
> ===================================================================
> --- linux-2.6.orig/block/blk-barrier.c	2009-08-29 16:49:43.067370900 -0300
> +++ linux-2.6/block/blk-barrier.c	2009-08-29 17:43:30.407344330 -0300
> @@ -348,6 +348,9 @@ static void blkdev_discard_end_io(struct
>  		clear_bit(BIO_UPTODATE, &bio->bi_flags);
>  	}
>  
> +	if (bio->bi_private)
> +		complete(bio->bi_private);
> +
>  	bio_put(bio);
>  }
>  
> @@ -357,21 +360,20 @@ static void blkdev_discard_end_io(struct
>   * @sector:	start sector
>   * @nr_sects:	number of sectors to discard
>   * @gfp_mask:	memory allocation flags (for bio_alloc)
> + * @flags:	DISCARD_FL_* flags to control behaviour
>   *
>   * Description:
> - *    Issue a discard request for the sectors in question. Does not wait.
> + *    Issue a discard request for the sectors in question.
>   */
> -int blkdev_issue_discard(struct block_device *bdev,
> -			 sector_t sector, sector_t nr_sects, gfp_t gfp_mask)
> +int blkdev_issue_discard(struct block_device *bdev, sector_t sector,
> +		sector_t nr_sects, gfp_t gfp_mask, int flags)
>  {
> -	struct request_queue *q;
> -	struct bio *bio;
> +	DECLARE_COMPLETION_ONSTACK(wait);
> +	struct request_queue *q = bdev_get_queue(bdev);
> +	int type = flags & DISCARD_FL_BARRIER ?
> +		DISCARD_BARRIER : DISCARD_NOBARRIER;
>  	int ret = 0;
>  
> -	if (bdev->bd_disk == NULL)
> -		return -ENXIO;
> -
> -	q = bdev_get_queue(bdev);
>  	if (!q)
>  		return -ENXIO;
>  
> @@ -379,12 +381,14 @@ int blkdev_issue_discard(struct block_de
>  		return -EOPNOTSUPP;
>  
>  	while (nr_sects && !ret) {
> -		bio = bio_alloc(gfp_mask, 0);
> +		struct bio *bio = bio_alloc(gfp_mask, 0);
>  		if (!bio)
>  			return -ENOMEM;
>  
>  		bio->bi_end_io = blkdev_discard_end_io;
>  		bio->bi_bdev = bdev;
> +		if (flags & DISCARD_FL_WAIT)
> +			bio->bi_private = &wait;
>  
>  		bio->bi_sector = sector;
>  
> @@ -396,10 +400,13 @@ int blkdev_issue_discard(struct block_de
>  			bio->bi_size = nr_sects << 9;
>  			nr_sects = 0;
>  		}
> +
>  		bio_get(bio);
> -		submit_bio(DISCARD_BARRIER, bio);
> +		submit_bio(type, bio);
> +
> +		if (flags & DISCARD_FL_WAIT)
> +			wait_for_completion(&wait);
>  
> -		/* Check if it failed immediately */
>  		if (bio_flagged(bio, BIO_EOPNOTSUPP))
>  			ret = -EOPNOTSUPP;
>  		else if (!bio_flagged(bio, BIO_UPTODATE))
> Index: linux-2.6/block/ioctl.c
> ===================================================================
> --- linux-2.6.orig/block/ioctl.c	2009-08-29 16:49:43.075370172 -0300
> +++ linux-2.6/block/ioctl.c	2009-08-29 16:49:52.743341374 -0300
> @@ -112,22 +112,9 @@ static int blkdev_reread_part(struct blo
>  	return res;
>  }
>  
> -static void blk_ioc_discard_endio(struct bio *bio, int err)
> -{
> -	if (err) {
> -		if (err == -EOPNOTSUPP)
> -			set_bit(BIO_EOPNOTSUPP, &bio->bi_flags);
> -		clear_bit(BIO_UPTODATE, &bio->bi_flags);
> -	}
> -	complete(bio->bi_private);
> -}
> -
>  static int blk_ioctl_discard(struct block_device *bdev, uint64_t start,
>  			     uint64_t len)
>  {
> -	struct request_queue *q = bdev_get_queue(bdev);
> -	int ret = 0;
> -
>  	if (start & 511)
>  		return -EINVAL;
>  	if (len & 511)
> @@ -137,40 +124,8 @@ static int blk_ioctl_discard(struct bloc
>  
>  	if (start + len > (bdev->bd_inode->i_size >> 9))
>  		return -EINVAL;
> -
> -	if (!q->prepare_discard_fn)
> -		return -EOPNOTSUPP;
> -
> -	while (len && !ret) {
> -		DECLARE_COMPLETION_ONSTACK(wait);
> -		struct bio *bio;
> -
> -		bio = bio_alloc(GFP_KERNEL, 0);
> -
> -		bio->bi_end_io = blk_ioc_discard_endio;
> -		bio->bi_bdev = bdev;
> -		bio->bi_private = &wait;
> -		bio->bi_sector = start;
> -
> -		if (len > queue_max_hw_sectors(q)) {
> -			bio->bi_size = queue_max_hw_sectors(q) << 9;
> -			len -= queue_max_hw_sectors(q);
> -			start += queue_max_hw_sectors(q);
> -		} else {
> -			bio->bi_size = len << 9;
> -			len = 0;
> -		}
> -		submit_bio(DISCARD_NOBARRIER, bio);
> -
> -		wait_for_completion(&wait);
> -
> -		if (bio_flagged(bio, BIO_EOPNOTSUPP))
> -			ret = -EOPNOTSUPP;
> -		else if (!bio_flagged(bio, BIO_UPTODATE))
> -			ret = -EIO;
> -		bio_put(bio);
> -	}
> -	return ret;
> +	return blkdev_issue_discard(bdev, start, len, GFP_KERNEL,
> +				    DISCARD_FL_WAIT);
>  }
>  
>  static int put_ushort(unsigned long arg, unsigned short val)
> Index: linux-2.6/fs/btrfs/extent-tree.c
> ===================================================================
> --- linux-2.6.orig/fs/btrfs/extent-tree.c	2009-08-29 16:49:43.079341766 -0300
> +++ linux-2.6/fs/btrfs/extent-tree.c	2009-08-29 16:49:52.743341374 -0300
> @@ -1511,7 +1511,8 @@ static int remove_extent_backref(struct 
>  static void btrfs_issue_discard(struct block_device *bdev,
>  				u64 start, u64 len)
>  {
> -	blkdev_issue_discard(bdev, start >> 9, len >> 9, GFP_KERNEL);
> +	blkdev_issue_discard(bdev, start >> 9, len >> 9, GFP_KERNEL,
> +			     DISCARD_FL_BARRIER);
>  }
>  #endif
>  
> Index: linux-2.6/fs/gfs2/rgrp.c
> ===================================================================
> --- linux-2.6.orig/fs/gfs2/rgrp.c	2009-08-29 16:49:43.139342340 -0300
> +++ linux-2.6/fs/gfs2/rgrp.c	2009-08-29 16:49:52.747373241 -0300
> @@ -857,7 +857,8 @@ static void gfs2_rgrp_send_discards(stru
>  					goto start_new_extent;
>  				if ((start + nr_sects) != blk) {
>  					rv = blkdev_issue_discard(bdev, start,
> -							    nr_sects, GFP_NOFS);
> +							    nr_sects, GFP_NOFS,
> +							    DISCARD_FL_BARRIER);
>  					if (rv)
>  						goto fail;
>  					nr_sects = 0;
> @@ -871,7 +872,8 @@ start_new_extent:
>  		}
>  	}
>  	if (nr_sects) {
> -		rv = blkdev_issue_discard(bdev, start, nr_sects, GFP_NOFS);
> +		rv = blkdev_issue_discard(bdev, start, nr_sects, GFP_NOFS,
> +					 DISCARD_FL_BARRIER);
>  		if (rv)
>  			goto fail;
>  	}
> Index: linux-2.6/include/linux/blkdev.h
> ===================================================================
> --- linux-2.6.orig/include/linux/blkdev.h	2009-08-29 16:49:43.147343148 -0300
> +++ linux-2.6/include/linux/blkdev.h	2009-08-29 17:43:19.767342397 -0300
> @@ -977,15 +977,18 @@ static inline struct request *blk_map_qu
>  }
>  
>  extern int blkdev_issue_flush(struct block_device *, sector_t *);
> -extern int blkdev_issue_discard(struct block_device *,
> -				sector_t sector, sector_t nr_sects, gfp_t);
> +#define DISCARD_FL_WAIT		0x01	/* wait for completion */
> +#define DISCARD_FL_BARRIER	0x02	/* issue DISCARD_BARRIER request */
> +extern int blkdev_issue_discard(struct block_device *, sector_t sector,
> +		sector_t nr_sects, gfp_t, int flags);
>  
>  static inline int sb_issue_discard(struct super_block *sb,
>  				   sector_t block, sector_t nr_blocks)
>  {
>  	block <<= (sb->s_blocksize_bits - 9);
>  	nr_blocks <<= (sb->s_blocksize_bits - 9);
> -	return blkdev_issue_discard(sb->s_bdev, block, nr_blocks, GFP_KERNEL);
> +	return blkdev_issue_discard(sb->s_bdev, block, nr_blocks, GFP_KERNEL,
> +				    DISCARD_FL_BARRIER);
>  }
>  
>  extern int blk_verify_command(unsigned char *cmd, fmode_t has_write_perm);
> Index: linux-2.6/mm/swapfile.c
> ===================================================================
> --- linux-2.6.orig/mm/swapfile.c	2009-08-29 16:49:43.151342120 -0300
> +++ linux-2.6/mm/swapfile.c	2009-08-29 16:49:52.755377749 -0300
> @@ -161,7 +161,8 @@ static int discard_swap(struct swap_info
>  		}
>  
>  		err = blkdev_issue_discard(si->bdev, start_block,
> -						nr_blocks, GFP_KERNEL);
> +						nr_blocks, GFP_KERNEL,
> +						DISCARD_FL_BARRIER);
>  		if (err)
>  			break;
>  
> @@ -200,7 +201,8 @@ static void discard_swap_cluster(struct 
>  			start_block <<= PAGE_SHIFT - 9;
>  			nr_blocks <<= PAGE_SHIFT - 9;
>  			if (blkdev_issue_discard(si->bdev, start_block,
> -							nr_blocks, GFP_NOIO))
> +							nr_blocks, GFP_NOIO,
> +							DISCARD_FL_BARRIER))
>  				break;
>  		}
>  
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
---end quoted text---

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 2/7] block: use blkdev_issue_discard in blk_ioctl_discard
  2009-09-11 22:26   ` Christoph Hellwig
@ 2009-09-14  9:40     ` Jens Axboe
  0 siblings, 0 replies; 30+ messages in thread
From: Jens Axboe @ 2009-09-14  9:40 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: linux-scsi, linux-ide, linux-kernel, liml, matthew, dwmw2

On Fri, Sep 11 2009, Christoph Hellwig wrote:
> Jens, can we put this one and possible Dave's treat discard requests
> as writes patch in early for 2.6.32?  That'll make the discard patch
> queue quite a bit smaller and will allow easier experimenting.

Looks fine from a review, I'll include it in the for-2.6.32 branch. I
plan on asking for a pull of that tomorrow.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 30+ messages in thread

end of thread, other threads:[~2009-09-14  9:40 UTC | newest]

Thread overview: 30+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2009-08-29 23:03 [PATCH 0/7] discard support revisited Christoph Hellwig
2009-08-29 23:03 ` [PATCH 1/7] Make DISCARD_BARRIER and DISCARD_NOBARRIER writes instead of reads Christoph Hellwig
2009-09-03 17:52   ` David Woodhouse
2009-09-03 17:56     ` Matthew Wilcox
2009-08-29 23:03 ` [PATCH 2/7] block: use blkdev_issue_discard in blk_ioctl_discard Christoph Hellwig
2009-09-01  9:06   ` Steven Whitehouse
2009-09-11 22:26   ` Christoph Hellwig
2009-09-14  9:40     ` Jens Axboe
2009-08-29 23:03 ` [PATCH 3/7] block: discard may need to allocate pages Christoph Hellwig
2009-08-29 23:03 ` [PATCH 4/7] sd: add support for WRITE SAME (16) with unmap bit Christoph Hellwig
2009-08-30  0:43   ` Douglas Gilbert
2009-08-30  1:05     ` Christoph Hellwig
2009-08-30  2:43       ` Douglas Gilbert
2009-08-30  2:48         ` Christoph Hellwig
2009-08-30 11:12   ` Sergei Shtylyov
2009-08-30 17:14     ` Christoph Hellwig
2009-08-29 23:03 ` [PATCH 5/7] libata: Add support for TRIM Christoph Hellwig
2009-08-29 23:03 ` [PATCH 6/7] block: allow large discard requests Christoph Hellwig
2009-08-30  2:49   ` Mark Lord
2009-08-30  2:50     ` Matthew Wilcox
2009-08-30  2:52       ` Mark Lord
2009-08-30  2:56         ` Christoph Hellwig
2009-08-29 23:03 ` [PATCH 7/7] xfs: add batches discard support Christoph Hellwig
2009-08-29 23:37 ` [PATCH 0/7] discard support revisited Matthew Wilcox
2009-08-30  2:15   ` Christoph Hellwig
2009-08-30  3:03     ` Matthew Wilcox
2009-08-30 20:17     ` James Bottomley
2009-08-30 21:42       ` Matthew Wilcox
2009-08-30 22:48       ` Christoph Hellwig
2009-09-02 19:46         ` Matthew Wilcox

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.