linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 1/5] Block: Discard may need to allocate pages
@ 2009-04-02 14:37 Matthew Wilcox
  2009-04-02 14:37 ` [PATCH 2/5] Make DISCARD_BARRIER and DISCARD_NOBARRIER writes instead of reads Matthew Wilcox
                   ` (2 more replies)
  0 siblings, 3 replies; 59+ messages in thread
From: Matthew Wilcox @ 2009-04-02 14:37 UTC (permalink / raw)
  To: linux-ide, linux-kernel, jgarzik; +Cc: Matthew Wilcox, Matthew Wilcox

From: Matthew Wilcox <matthew@wil.cx>

SCSI and ATA both need to send data to the device.  In order to do this,
the BIO must be allocated with room for a page to be added, and the bio
needs to be passed to the discard prep function.  We also need to free
the page attached to the BIO before we free it.

init_request_from_bio() is not currently called from a context which
forbids sleeping, and to make sure it stays that way (so we don't have
to use GFP_ATOMIC), add a might_sleep() to it.

Signed-off-by: Matthew Wilcox <willy@linux.intel.com>
---
 block/blk-barrier.c       |    4 +++-
 block/blk-core.c          |    4 +++-
 block/ioctl.c             |    4 +++-
 drivers/mtd/mtd_blkdevs.c |    2 +-
 include/linux/blkdev.h    |    3 ++-
 5 files changed, 12 insertions(+), 5 deletions(-)

diff --git a/block/blk-barrier.c b/block/blk-barrier.c
index f7dae57..82a3035 100644
--- a/block/blk-barrier.c
+++ b/block/blk-barrier.c
@@ -356,6 +356,8 @@ static void blkdev_discard_end_io(struct bio *bio, int err)
 		clear_bit(BIO_UPTODATE, &bio->bi_flags);
 	}
 
+	if (bio_has_data(bio))
+		__free_page(bio_page(bio));
 	bio_put(bio);
 }
 
@@ -387,7 +389,7 @@ int blkdev_issue_discard(struct block_device *bdev,
 		return -EOPNOTSUPP;
 
 	while (nr_sects && !ret) {
-		bio = bio_alloc(gfp_mask, 0);
+		bio = bio_alloc(gfp_mask, 1);
 		if (!bio)
 			return -ENOMEM;
 
diff --git a/block/blk-core.c b/block/blk-core.c
index 996ed90..7899761 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -1095,6 +1095,8 @@ EXPORT_SYMBOL(blk_put_request);
 
 void init_request_from_bio(struct request *req, struct bio *bio)
 {
+	might_sleep();
+
 	req->cpu = bio->bi_comp_cpu;
 	req->cmd_type = REQ_TYPE_FS;
 
@@ -1118,7 +1120,7 @@ void init_request_from_bio(struct request *req, struct bio *bio)
 		req->cmd_flags |= REQ_DISCARD;
 		if (bio_barrier(bio))
 			req->cmd_flags |= REQ_SOFTBARRIER;
-		req->q->prepare_discard_fn(req->q, req);
+		req->q->prepare_discard_fn(req->q, req, bio);
 	} else if (unlikely(bio_barrier(bio)))
 		req->cmd_flags |= (REQ_HARDBARRIER | REQ_NOMERGE);
 
diff --git a/block/ioctl.c b/block/ioctl.c
index 0f22e62..088a9ba 100644
--- a/block/ioctl.c
+++ b/block/ioctl.c
@@ -145,7 +145,7 @@ static int blk_ioctl_discard(struct block_device *bdev, uint64_t start,
 		DECLARE_COMPLETION_ONSTACK(wait);
 		struct bio *bio;
 
-		bio = bio_alloc(GFP_KERNEL, 0);
+		bio = bio_alloc(GFP_KERNEL, 1);
 		if (!bio)
 			return -ENOMEM;
 
@@ -170,6 +170,8 @@ static int blk_ioctl_discard(struct block_device *bdev, uint64_t start,
 			ret = -EOPNOTSUPP;
 		else if (!bio_flagged(bio, BIO_UPTODATE))
 			ret = -EIO;
+		if (bio_has_data(bio))
+			__free_page(bio_page(bio));
 		bio_put(bio);
 	}
 	return ret;
diff --git a/drivers/mtd/mtd_blkdevs.c b/drivers/mtd/mtd_blkdevs.c
index 1409f01..2b6ed4b 100644
--- a/drivers/mtd/mtd_blkdevs.c
+++ b/drivers/mtd/mtd_blkdevs.c
@@ -33,7 +33,7 @@ struct mtd_blkcore_priv {
 };
 
 static int blktrans_discard_request(struct request_queue *q,
-				    struct request *req)
+				    struct request *req, struct bio *bio)
 {
 	req->cmd_type = REQ_TYPE_LINUX_BLOCK;
 	req->cmd[0] = REQ_LB_OP_DISCARD;
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index 465d6ba..9d9bd7b 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -260,7 +260,8 @@ typedef void (request_fn_proc) (struct request_queue *q);
 typedef int (make_request_fn) (struct request_queue *q, struct bio *bio);
 typedef int (prep_rq_fn) (struct request_queue *, struct request *);
 typedef void (unplug_fn) (struct request_queue *);
-typedef int (prepare_discard_fn) (struct request_queue *, struct request *);
+typedef int (prepare_discard_fn) (struct request_queue *, struct request *,
+							struct bio *bio);
 
 struct bio_vec;
 struct bvec_merge_data {
-- 
1.6.2.1


^ permalink raw reply related	[flat|nested] 59+ messages in thread

* [PATCH 2/5] Make DISCARD_BARRIER and DISCARD_NOBARRIER writes instead of reads
  2009-04-02 14:37 [PATCH 1/5] Block: Discard may need to allocate pages Matthew Wilcox
@ 2009-04-02 14:37 ` Matthew Wilcox
  2009-04-02 14:37   ` [PATCH 3/5] ata: Add TRIM infrastructure Matthew Wilcox
  2009-04-05 12:28 ` [PATCH 1/5] Block: Discard may need to allocate pages Boaz Harrosh
  2009-04-17 21:23 ` [PATCH 1/5] Block: Discard may need to allocate pages Mark Lord
  2 siblings, 1 reply; 59+ messages in thread
From: Matthew Wilcox @ 2009-04-02 14:37 UTC (permalink / raw)
  To: linux-ide, linux-kernel, jgarzik
  Cc: David Woodhouse, David Woodhouse, Matthew Wilcox

From: David Woodhouse <dwmw2@infradead.org>

The commands are conceptually writes, and in the case of IDE and SCSI
commands actually are writes.  They were only reads because we thought
that would interact better with the elevators.  Now the elevators know
about discard requests, that advantage no longer exists.

Signed-off-by: David Woodhouse <David.Woodhouse@intel.com>
Signed-off-by: Matthew Wilcox <willy@linux.intel.com>
---
 include/linux/fs.h |    4 ++--
 1 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/include/linux/fs.h b/include/linux/fs.h
index 61211ad..e5dc992 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -98,8 +98,8 @@ struct inodes_stat_t {
 #define WRITE_SYNC	(WRITE | (1 << BIO_RW_SYNCIO) | (1 << BIO_RW_UNPLUG))
 #define SWRITE_SYNC	(SWRITE | (1 << BIO_RW_SYNCIO) | (1 << BIO_RW_UNPLUG))
 #define WRITE_BARRIER	(WRITE | (1 << BIO_RW_BARRIER))
-#define DISCARD_NOBARRIER (1 << BIO_RW_DISCARD)
-#define DISCARD_BARRIER ((1 << BIO_RW_DISCARD) | (1 << BIO_RW_BARRIER))
+#define DISCARD_NOBARRIER (WRITE | (1 << BIO_RW_DISCARD))
+#define DISCARD_BARRIER (DISCARD_NOBARRIER | (1 << BIO_RW_BARRIER))
 
 #define SEL_IN		1
 #define SEL_OUT		2
-- 
1.6.2.1


^ permalink raw reply related	[flat|nested] 59+ messages in thread

* [PATCH 3/5] ata: Add TRIM infrastructure
  2009-04-02 14:37 ` [PATCH 2/5] Make DISCARD_BARRIER and DISCARD_NOBARRIER writes instead of reads Matthew Wilcox
@ 2009-04-02 14:37   ` Matthew Wilcox
  2009-04-02 14:37     ` [PATCH 4/5] ide: Add support for TRIM Matthew Wilcox
                       ` (2 more replies)
  0 siblings, 3 replies; 59+ messages in thread
From: Matthew Wilcox @ 2009-04-02 14:37 UTC (permalink / raw)
  To: linux-ide, linux-kernel, jgarzik; +Cc: Matthew Wilcox, David Woodhouse

This is common code shared between the IDE and libata implementations

Signed-off-by: David Woodhouse <David.Woodhouse@intel.com>
Signed-off-by: Matthew Wilcox <willy@linux.intel.com>
---
 include/linux/ata.h |   41 +++++++++++++++++++++++++++++++++++++++++
 1 files changed, 41 insertions(+), 0 deletions(-)

diff --git a/include/linux/ata.h b/include/linux/ata.h
index 6617c9f..cb79b7a 100644
--- a/include/linux/ata.h
+++ b/include/linux/ata.h
@@ -29,6 +29,8 @@
 #ifndef __LINUX_ATA_H__
 #define __LINUX_ATA_H__
 
+#include <linux/kernel.h>
+#include <linux/string.h>
 #include <linux/types.h>
 #include <asm/byteorder.h>
 
@@ -91,6 +93,7 @@ enum {
 	ATA_ID_CFA_POWER	= 160,
 	ATA_ID_CFA_KEY_MGMT	= 162,
 	ATA_ID_CFA_MODES	= 163,
+	ATA_ID_DATA_SET_MGMT	= 169,
 	ATA_ID_ROT_SPEED	= 217,
 	ATA_ID_PIO4		= (1 << 1),
 
@@ -248,6 +251,7 @@ enum {
 	ATA_CMD_SMART		= 0xB0,
 	ATA_CMD_MEDIA_LOCK	= 0xDE,
 	ATA_CMD_MEDIA_UNLOCK	= 0xDF,
+	ATA_CMD_DSM		= 0x06,
 	/* marked obsolete in the ATA/ATAPI-7 spec */
 	ATA_CMD_RESTORE		= 0x10,
 
@@ -321,6 +325,9 @@ enum {
 	ATA_SMART_READ_VALUES	= 0xD0,
 	ATA_SMART_READ_THRESHOLDS = 0xD1,
 
+	/* feature values for Data Set Management */
+	ATA_DSM_TRIM		= 0x01,
+
 	/* password used in LBA Mid / LBA High for executing SMART commands */
 	ATA_SMART_LBAM_PASS	= 0x4F,
 	ATA_SMART_LBAH_PASS	= 0xC2,
@@ -723,6 +730,14 @@ static inline int ata_id_has_unload(const u16 *id)
 	return 0;
 }
 
+static inline int ata_id_has_trim(const u16 *id)
+{
+	if (ata_id_major_version(id) >= 7 &&
+	    (id[ATA_ID_DATA_SET_MGMT] & 1))
+		return 1;
+	return 0;
+}
+
 static inline int ata_id_current_chs_valid(const u16 *id)
 {
 	/* For ATA-1 devices, if the INITIALIZE DEVICE PARAMETERS command
@@ -863,6 +878,32 @@ static inline void ata_id_to_hd_driveid(u16 *id)
 #endif
 }
 
+/*
+ * Write up to 'max' LBA Range Entries to the buffer that will cover the
+ * extent from sector to sector + count.  This is used for TRIM and for
+ * ADD LBA(S) TO NV CACHE PINNED SET.
+ */
+static inline unsigned ata_set_lba_range_entries(void *_buffer, unsigned max,
+						u64 sector, unsigned long count)
+{
+	__le64 *buffer = _buffer;
+	unsigned i = 0;
+
+	while (i < max) {
+		u64 entry = sector |
+			((u64)(count > 0xffff ? 0xffff : count) << 48);
+		buffer[i++] = __cpu_to_le64(entry);
+		if (count <= 0xffff)
+			break;
+		count -= 0xffff;
+		sector += 0xffff;
+	}
+
+	max = ALIGN(i * 8, 512);
+	memset(buffer + i, 0, max - i * 8);
+	return max;
+}
+
 static inline int is_multi_taskfile(struct ata_taskfile *tf)
 {
 	return (tf->command == ATA_CMD_READ_MULTI) ||
-- 
1.6.2.1


^ permalink raw reply related	[flat|nested] 59+ messages in thread

* [PATCH 4/5] ide: Add support for TRIM
  2009-04-02 14:37   ` [PATCH 3/5] ata: Add TRIM infrastructure Matthew Wilcox
@ 2009-04-02 14:37     ` Matthew Wilcox
  2009-04-02 14:37       ` [PATCH 5/5] libata: " Matthew Wilcox
                         ` (3 more replies)
  2009-04-02 15:55     ` [PATCH 3/5] ata: Add TRIM infrastructure Sergei Shtylyov
  2009-04-07  0:02     ` Jeff Garzik
  2 siblings, 4 replies; 59+ messages in thread
From: Matthew Wilcox @ 2009-04-02 14:37 UTC (permalink / raw)
  To: linux-ide, linux-kernel, jgarzik; +Cc: David Woodhouse, Matthew Wilcox

From: David Woodhouse <David.Woodhouse@intel.com>

Signed-off-by: David Woodhouse <David.Woodhouse@intel.com>
[modified to reduce amount of special casing needed for discard]
Signed-off-by: Matthew Wilcox <willy@linux.intel.com>
---
 drivers/ide/ide-disk.c |   49 ++++++++++++++++++++++++++++++++++++++++++++++++
 1 files changed, 49 insertions(+), 0 deletions(-)

diff --git a/drivers/ide/ide-disk.c b/drivers/ide/ide-disk.c
index c998cf8..ed34bd3 100644
--- a/drivers/ide/ide-disk.c
+++ b/drivers/ide/ide-disk.c
@@ -404,6 +404,52 @@ static void idedisk_prepare_flush(struct request_queue *q, struct request *rq)
 	rq->special = cmd;
 }
 
+static int idedisk_prepare_discard(struct request_queue *q, struct request *rq,
+							struct bio *bio)
+{
+	ide_task_t *task;
+	unsigned size;
+	struct page *page = alloc_page(GFP_KERNEL);
+	if (!page)
+		goto error;
+
+	/* FIXME: map struct ide_taskfile on rq->cmd[] */
+	task = kzalloc(sizeof(*task), GFP_KERNEL);
+	if (!task)
+		goto free_page;
+
+	size = ata_set_lba_range_entries(page_address(page), PAGE_SIZE,
+					bio->bi_sector, bio_sectors(bio));
+	bio->bi_size = 0;
+	if (bio_add_pc_page(q, bio, page, size, 0) < size)
+		goto free_task;
+
+	task->tf.command = ATA_CMD_DSM;
+	task->tf.feature = ATA_DSM_TRIM;
+	task->tf.hob_feature = 0x00;
+	task->tf.nsect = size / 512;
+	task->tf.hob_nsect = (size / 512) >> 8;
+
+	task->tf_flags   = IDE_TFLAG_LBA48 | IDE_TFLAG_OUT_HOB |
+			   IDE_TFLAG_OUT_TF | IDE_TFLAG_OUT_DEVICE |
+			   IDE_TFLAG_DYN;
+	task->data_phase = TASKFILE_OUT_DMA;
+
+	rq->cmd_type = REQ_TYPE_ATA_TASKFILE;
+	rq->special = task;
+	rq->nr_sectors = size / 512;
+
+	return 0;
+
+ free_task:
+	kfree(task);
+ free_page:
+	__free_page(page);
+ error:
+	return -ENOMEM;
+}
+
+
 ide_devset_get(multcount, mult_count);
 
 /*
@@ -521,6 +567,9 @@ static int set_wcache(ide_drive_t *drive, int arg)
 
 	update_ordered(drive);
 
+	if (ata_id_has_trim(drive->id))
+		blk_queue_set_discard(drive->queue, idedisk_prepare_discard);
+
 	return err;
 }
 
-- 
1.6.2.1


^ permalink raw reply related	[flat|nested] 59+ messages in thread

* [PATCH 5/5] libata: Add support for TRIM
  2009-04-02 14:37     ` [PATCH 4/5] ide: Add support for TRIM Matthew Wilcox
@ 2009-04-02 14:37       ` Matthew Wilcox
  2009-04-02 17:20         ` Mark Lord
  2009-04-16 20:25         ` Mark Lord
  2009-04-02 15:58       ` [PATCH 4/5] ide: " Sergei Shtylyov
                         ` (2 subsequent siblings)
  3 siblings, 2 replies; 59+ messages in thread
From: Matthew Wilcox @ 2009-04-02 14:37 UTC (permalink / raw)
  To: linux-ide, linux-kernel, jgarzik; +Cc: Matthew Wilcox

Signed-off-by: Matthew Wilcox <willy@linux.intel.com>
---
 drivers/ata/libata-scsi.c |   46 +++++++++++++++++++++++++++++++++++++++++++++
 1 files changed, 46 insertions(+), 0 deletions(-)

diff --git a/drivers/ata/libata-scsi.c b/drivers/ata/libata-scsi.c
index b9747fa..d4c8b8b 100644
--- a/drivers/ata/libata-scsi.c
+++ b/drivers/ata/libata-scsi.c
@@ -1029,6 +1029,46 @@ static void ata_gen_ata_sense(struct ata_queued_cmd *qc)
 	desc[11] = block;
 }
 
+static int ata_discard_fn(struct request_queue *q, struct request *req,
+							struct bio *bio)
+{
+	unsigned size;
+	struct page *page = alloc_page(GFP_KERNEL);
+	if (!page)
+		goto error;
+
+	size = ata_set_lba_range_entries(page_address(page), PAGE_SIZE / 8,
+					bio->bi_sector, bio_sectors(bio));
+	bio->bi_size = 0;
+	if (bio_add_pc_page(q, bio, page, size, 0) < size)
+		goto free_page;
+
+	req->cmd_type = REQ_TYPE_BLOCK_PC;
+	req->cmd_len = 16;
+	req->cmd[0] = ATA_16;
+	req->cmd[1] = (6 << 1) | 1;		/* dma, 48-bit */
+	req->cmd[2] = 0x6;			/* length, direction */
+	req->cmd[3] = 0;			/* feature high */
+	req->cmd[4] = ATA_DSM_TRIM;		/* feature low */
+	req->cmd[5] = (size / 512) >> 8;	/* nsect high */
+	req->cmd[6] = size / 512;		/* nsect low */
+	req->cmd[7] = 0;			/* lba */
+	req->cmd[8] = 0;			/* lba */
+	req->cmd[9] = 0;			/* lba */
+	req->cmd[10] = 0;			/* lba */
+	req->cmd[11] = 0;			/* lba */
+	req->cmd[12] = 0;			/* lba */
+	req->cmd[13] = ATA_LBA;			/* device */
+	req->cmd[14] = ATA_CMD_DSM;		/* command */
+	req->cmd[15] = 0;			/* control */
+
+	return 0;
+ free_page:
+	__free_page(page);
+ error:
+	return -ENOMEM;
+}
+
 static void ata_scsi_sdev_config(struct scsi_device *sdev)
 {
 	sdev->use_10_for_rw = 1;
@@ -1077,6 +1117,9 @@ static int ata_scsi_dev_config(struct scsi_device *sdev,
 	/* configure max sectors */
 	blk_queue_max_sectors(sdev->request_queue, dev->max_sectors);
 
+	if (ata_id_has_trim(dev->id))
+		blk_queue_set_discard(sdev->request_queue, ata_discard_fn);
+
 	if (dev->class == ATA_DEV_ATAPI) {
 		struct request_queue *q = sdev->request_queue;
 		void *buf;
@@ -2385,6 +2428,9 @@ static unsigned int ata_scsiop_read_cap(struct ata_scsi_args *args, u8 *rbuf)
 		/* sector size */
 		rbuf[10] = ATA_SECT_SIZE >> 8;
 		rbuf[11] = ATA_SECT_SIZE & 0xff;
+
+		if (ata_id_has_trim(args->id))
+			rbuf[14] = 0x80;
 	}
 
 	return 0;
-- 
1.6.2.1


^ permalink raw reply related	[flat|nested] 59+ messages in thread

* Re: [PATCH 3/5] ata: Add TRIM infrastructure
  2009-04-02 14:37   ` [PATCH 3/5] ata: Add TRIM infrastructure Matthew Wilcox
  2009-04-02 14:37     ` [PATCH 4/5] ide: Add support for TRIM Matthew Wilcox
@ 2009-04-02 15:55     ` Sergei Shtylyov
  2009-04-02 16:18       ` Matthew Wilcox
  2009-04-07  0:02     ` Jeff Garzik
  2 siblings, 1 reply; 59+ messages in thread
From: Sergei Shtylyov @ 2009-04-02 15:55 UTC (permalink / raw)
  To: Matthew Wilcox; +Cc: linux-ide, linux-kernel, jgarzik, David Woodhouse

Matthew Wilcox wrote:

> This is common code shared between the IDE and libata implementations

> Signed-off-by: David Woodhouse <David.Woodhouse@intel.com>
> Signed-off-by: Matthew Wilcox <willy@linux.intel.com>

[...]

> diff --git a/include/linux/ata.h b/include/linux/ata.h
> index 6617c9f..cb79b7a 100644
> --- a/include/linux/ata.h
> +++ b/include/linux/ata.h
> @@ -29,6 +29,8 @@
>  #ifndef __LINUX_ATA_H__
>  #define __LINUX_ATA_H__
>  
> +#include <linux/kernel.h>
> +#include <linux/string.h>
>  #include <linux/types.h>
>  #include <asm/byteorder.h>
>  
> @@ -91,6 +93,7 @@ enum {
>  	ATA_ID_CFA_POWER	= 160,
>  	ATA_ID_CFA_KEY_MGMT	= 162,
>  	ATA_ID_CFA_MODES	= 163,
> +	ATA_ID_DATA_SET_MGMT	= 169,
>  	ATA_ID_ROT_SPEED	= 217,
>  	ATA_ID_PIO4		= (1 << 1),
>  
> @@ -248,6 +251,7 @@ enum {
>  	ATA_CMD_SMART		= 0xB0,
>  	ATA_CMD_MEDIA_LOCK	= 0xDE,
>  	ATA_CMD_MEDIA_UNLOCK	= 0xDF,
> +	ATA_CMD_DSM		= 0x06,
>  	/* marked obsolete in the ATA/ATAPI-7 spec */
>  	ATA_CMD_RESTORE		= 0x10,
>  
> @@ -321,6 +325,9 @@ enum {
>  	ATA_SMART_READ_VALUES	= 0xD0,
>  	ATA_SMART_READ_THRESHOLDS = 0xD1,
>  
> +	/* feature values for Data Set Management */
> +	ATA_DSM_TRIM		= 0x01,
> +
>  	/* password used in LBA Mid / LBA High for executing SMART commands */
>  	ATA_SMART_LBAM_PASS	= 0x4F,
>  	ATA_SMART_LBAH_PASS	= 0xC2,

    I wonder where to read about the word 169 and commnand 0x06... what's it 
all about?

WBR, Sergei

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH 4/5] ide: Add support for TRIM
  2009-04-02 14:37     ` [PATCH 4/5] ide: Add support for TRIM Matthew Wilcox
  2009-04-02 14:37       ` [PATCH 5/5] libata: " Matthew Wilcox
@ 2009-04-02 15:58       ` Sergei Shtylyov
  2009-04-02 16:28         ` Matthew Wilcox
  2009-04-07 17:20       ` Jeff Garzik
  2009-04-08 14:33       ` Mark Lord
  3 siblings, 1 reply; 59+ messages in thread
From: Sergei Shtylyov @ 2009-04-02 15:58 UTC (permalink / raw)
  To: Matthew Wilcox; +Cc: linux-ide, linux-kernel, jgarzik, David Woodhouse

Hello.

Matthew Wilcox wrote:

> From: David Woodhouse <David.Woodhouse@intel.com>

> Signed-off-by: David Woodhouse <David.Woodhouse@intel.com>
> [modified to reduce amount of special casing needed for discard]
> Signed-off-by: Matthew Wilcox <willy@linux.intel.com>



> diff --git a/drivers/ide/ide-disk.c b/drivers/ide/ide-disk.c
> index c998cf8..ed34bd3 100644
> --- a/drivers/ide/ide-disk.c
> +++ b/drivers/ide/ide-disk.c
> @@ -404,6 +404,52 @@ static void idedisk_prepare_flush(struct request_queue *q, struct request *rq)
>  	rq->special = cmd;
>  }
>  
> +static int idedisk_prepare_discard(struct request_queue *q, struct request *rq,
> +							struct bio *bio)

    Weird indentation.

> +{
> +	ide_task_t *task;

    This patch is already obsolete as 'ide_task_t' is gone. Use 'struct 
ide_cmd' instead.

> +	unsigned size;
> +	struct page *page = alloc_page(GFP_KERNEL);

    Missing empty line after the declaration block...

> +	if (!page)
> +		goto error;

    Unneeded goto -- why not just return?

> +
> +	/* FIXME: map struct ide_taskfile on rq->cmd[] */
> +	task = kzalloc(sizeof(*task), GFP_KERNEL);
> +	if (!task)
> +		goto free_page;
> +
> +	size = ata_set_lba_range_entries(page_address(page), PAGE_SIZE,
> +					bio->bi_sector, bio_sectors(bio));
> +	bio->bi_size = 0;
> +	if (bio_add_pc_page(q, bio, page, size, 0) < size)
> +		goto free_task;
> +
> +	task->tf.command = ATA_CMD_DSM;
> +	task->tf.feature = ATA_DSM_TRIM;
> +	task->tf.hob_feature = 0x00;
> +	task->tf.nsect = size / 512;
> +	task->tf.hob_nsect = (size / 512) >> 8;
> +
> +	task->tf_flags   = IDE_TFLAG_LBA48 | IDE_TFLAG_OUT_HOB |
> +			   IDE_TFLAG_OUT_TF | IDE_TFLAG_OUT_DEVICE |

    The last 3 flags are going to be obsoleted too...

MBR, Sergei

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH 3/5] ata: Add TRIM infrastructure
  2009-04-02 15:55     ` [PATCH 3/5] ata: Add TRIM infrastructure Sergei Shtylyov
@ 2009-04-02 16:18       ` Matthew Wilcox
  2009-04-02 16:32         ` Sergei Shtylyov
  0 siblings, 1 reply; 59+ messages in thread
From: Matthew Wilcox @ 2009-04-02 16:18 UTC (permalink / raw)
  To: Sergei Shtylyov; +Cc: linux-ide, linux-kernel, jgarzik, David Woodhouse

On Thu, Apr 02, 2009 at 07:55:39PM +0400, Sergei Shtylyov wrote:
>    I wonder where to read about the word 169 and commnand 0x06... what's 
> it all about?

It's in d2015r1-ATAATAPI_Command_Set_-_2_ACS-2.pdf which can be found on
the t13 website.

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH 4/5] ide: Add support for TRIM
  2009-04-02 15:58       ` [PATCH 4/5] ide: " Sergei Shtylyov
@ 2009-04-02 16:28         ` Matthew Wilcox
  2009-04-02 16:38           ` Sergei Shtylyov
  0 siblings, 1 reply; 59+ messages in thread
From: Matthew Wilcox @ 2009-04-02 16:28 UTC (permalink / raw)
  To: Sergei Shtylyov; +Cc: linux-ide, linux-kernel, jgarzik, David Woodhouse

On Thu, Apr 02, 2009 at 07:58:59PM +0400, Sergei Shtylyov wrote:
>>  +static int idedisk_prepare_discard(struct request_queue *q, struct 
>> request *rq,
>> +							struct bio *bio)
>
>    Weird indentation.

Not at all.  New line, so tab the next line over as far as it will go
without falling off the 80-column limit.  It's fairly common.

To quote Documentation/CodingStyle:

  Statements longer than 80 columns will be broken into sensible chunks.
  Descendants are always substantially shorter than the parent and are placed
  substantially to the right. The same applies to function headers with a long
  argument list. Long strings are as well broken into shorter strings.  The
  only exception to this is where exceeding 80 columns significantly increases
  readability and does not hide information.

>> +{
>> +	ide_task_t *task;
>
>    This patch is already obsolete as 'ide_task_t' is gone. Use 'struct  
> ide_cmd' instead.

Thanks, fixed.

>> +	unsigned size;
>> +	struct page *page = alloc_page(GFP_KERNEL);
>
>    Missing empty line after the declaration block...

Empty line not necessary.

>> +	if (!page)
>> +		goto error;
>
>    Unneeded goto -- why not just return?

Because it then looks the same as the other two cases where we can error
out.

>> +	task->tf_flags   = IDE_TFLAG_LBA48 | IDE_TFLAG_OUT_HOB |
>> +			   IDE_TFLAG_OUT_TF | IDE_TFLAG_OUT_DEVICE |
>
>    The last 3 flags are going to be obsoleted too...

So if I remove them today, the command will still work?

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH 3/5] ata: Add TRIM infrastructure
  2009-04-02 16:18       ` Matthew Wilcox
@ 2009-04-02 16:32         ` Sergei Shtylyov
  2009-04-02 16:47           ` Matthew Wilcox
  0 siblings, 1 reply; 59+ messages in thread
From: Sergei Shtylyov @ 2009-04-02 16:32 UTC (permalink / raw)
  To: Matthew Wilcox; +Cc: linux-ide, linux-kernel, jgarzik, David Woodhouse

Matthew Wilcox wrote:

>>   I wonder where to read about the word 169 and commnand 0x06... what's 
>>it all about?

> It's in d2015r1-ATAATAPI_Command_Set_-_2_ACS-2.pdf which can be found on
> the t13 website.

    But since neither ATA/PI-7 nor ATA/PI-8 mention neither the word nor 
command, why your validity check requires only ATA/PI-7+ compliance?

MBR, Sergei

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH 4/5] ide: Add support for TRIM
  2009-04-02 16:28         ` Matthew Wilcox
@ 2009-04-02 16:38           ` Sergei Shtylyov
  2009-04-02 16:51             ` Matthew Wilcox
  0 siblings, 1 reply; 59+ messages in thread
From: Sergei Shtylyov @ 2009-04-02 16:38 UTC (permalink / raw)
  To: Matthew Wilcox; +Cc: linux-ide, linux-kernel, jgarzik, David Woodhouse

Hello.

Matthew Wilcox wrote:

>>> +static int idedisk_prepare_discard(struct request_queue *q, struct 
>>>request *rq,
>>>+							struct bio *bio)

>>   Weird indentation.

> Not at all.  New line, so tab the next line over as far as it will go
> without falling off the 80-column limit.  It's fairly common.

> To quote Documentation/CodingStyle:

>   Statements longer than 80 columns will be broken into sensible chunks.
>   Descendants are always substantially shorter than the parent and are placed
>   substantially to the right. The same applies to function headers with a long
>   argument list. Long strings are as well broken into shorter strings.  The
>   only exception to this is where exceeding 80 columns significantly increases
>   readability and does not hide information.

    That's all clear, it's just that the second line was overly indented, at 
least to my taste. :-)

>>>+{
>>>+	ide_task_t *task;

>>   This patch is already obsolete as 'ide_task_t' is gone. Use 'struct  
>>ide_cmd' instead.

> Thanks, fixed.

>>>+	unsigned size;
>>>+	struct page *page = alloc_page(GFP_KERNEL);

>>   Missing empty line after the declaration block...

> Empty line not necessary.

    Usually it's there.

>>>+	task->tf_flags   = IDE_TFLAG_LBA48 | IDE_TFLAG_OUT_HOB |
>>>+			   IDE_TFLAG_OUT_TF | IDE_TFLAG_OUT_DEVICE |

>>   The last 3 flags are going to be obsoleted too...

> So if I remove them today, the command will still work?

     s/obsoleted/renamed and moved to another other field/ -- I'm going to 
submit a patchset refactoring 'struct ide_cmd and 'struct ide-taskfile' at last...

MBR, Sergei

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH 3/5] ata: Add TRIM infrastructure
  2009-04-02 16:32         ` Sergei Shtylyov
@ 2009-04-02 16:47           ` Matthew Wilcox
  0 siblings, 0 replies; 59+ messages in thread
From: Matthew Wilcox @ 2009-04-02 16:47 UTC (permalink / raw)
  To: Sergei Shtylyov; +Cc: linux-ide, linux-kernel, jgarzik, David Woodhouse

On Thu, Apr 02, 2009 at 08:32:56PM +0400, Sergei Shtylyov wrote:
> Matthew Wilcox wrote:
>
>>>   I wonder where to read about the word 169 and commnand 0x06... 
>>> what's it all about?
>
>> It's in d2015r1-ATAATAPI_Command_Set_-_2_ACS-2.pdf which can be found on
>> the t13 website.
>
>    But since neither ATA/PI-7 nor ATA/PI-8 mention neither the word nor  
> command, why your validity check requires only ATA/PI-7+ compliance?

Because there are ATA-7 devices which implement this feature.

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH 4/5] ide: Add support for TRIM
  2009-04-02 16:38           ` Sergei Shtylyov
@ 2009-04-02 16:51             ` Matthew Wilcox
  2009-04-02 19:37               ` Bartlomiej Zolnierkiewicz
  0 siblings, 1 reply; 59+ messages in thread
From: Matthew Wilcox @ 2009-04-02 16:51 UTC (permalink / raw)
  To: Sergei Shtylyov
  Cc: linux-ide, linux-kernel, jgarzik, David Woodhouse,
	Bartlomiej Zolnierkiewicz

On Thu, Apr 02, 2009 at 08:38:03PM +0400, Sergei Shtylyov wrote:
> Hello.
>>>> +	task->tf_flags   = IDE_TFLAG_LBA48 | IDE_TFLAG_OUT_HOB |
>>>> +			   IDE_TFLAG_OUT_TF | IDE_TFLAG_OUT_DEVICE |
>
>>>   The last 3 flags are going to be obsoleted too...
>
>> So if I remove them today, the command will still work?
>
>     s/obsoleted/renamed and moved to another other field/ -- I'm going to 
> submit a patchset refactoring 'struct ide_cmd and 'struct ide-taskfile' 
> at last...

Since I'm coding to today's kernel, not to a patch which doesn't exist
yet, taking them out won't work very well.

I've not been paying much attention to drivers/ide, so I've no idea
whether the following patch works.  It does at least compile:

+++ b/drivers/ide/ide-disk.c
@@ -407,7 +407,7 @@ static void idedisk_prepare_flush(struct request_queue *q, s
 static int idedisk_prepare_discard(struct request_queue *q, struct request *rq,
                                                        struct bio *bio)
 {
-       ide_task_t *task;
+       struct ide_cmd *task;
        unsigned size;
        struct page *page = alloc_page(GFP_KERNEL);
        if (!page)
@@ -432,8 +432,8 @@ static int idedisk_prepare_discard(struct request_queue *q, 
 
        task->tf_flags   = IDE_TFLAG_LBA48 | IDE_TFLAG_OUT_HOB |
                           IDE_TFLAG_OUT_TF | IDE_TFLAG_OUT_DEVICE |
-                          IDE_TFLAG_DYN;
-       task->data_phase = TASKFILE_OUT_DMA;
+                          IDE_TFLAG_DYN | IDE_TFLAG_WRITE;
+       task->protocol = ATA_PROT_DMA;
 
        rq->cmd_type = REQ_TYPE_ATA_TASKFILE;
        rq->special = task;

If I've understood 0dfb991c6943c810175376b58d1c29cfe532541b correctly,
this should be equivalent.  Bart?

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH 5/5] libata: Add support for TRIM
  2009-04-02 14:37       ` [PATCH 5/5] libata: " Matthew Wilcox
@ 2009-04-02 17:20         ` Mark Lord
  2009-04-02 17:55           ` Matthew Wilcox
  2009-04-16 20:25         ` Mark Lord
  1 sibling, 1 reply; 59+ messages in thread
From: Mark Lord @ 2009-04-02 17:20 UTC (permalink / raw)
  To: Matthew Wilcox; +Cc: linux-ide, linux-kernel, jgarzik

Matthew Wilcox wrote:
> Signed-off-by: Matthew Wilcox <willy@linux.intel.com>
> ---
>  drivers/ata/libata-scsi.c |   46 +++++++++++++++++++++++++++++++++++++++++++++
>  1 files changed, 46 insertions(+), 0 deletions(-)
> 
> diff --git a/drivers/ata/libata-scsi.c b/drivers/ata/libata-scsi.c
> index b9747fa..d4c8b8b 100644
> --- a/drivers/ata/libata-scsi.c
> +++ b/drivers/ata/libata-scsi.c
> @@ -1029,6 +1029,46 @@ static void ata_gen_ata_sense(struct ata_queued_cmd *qc)
>  	desc[11] = block;
>  }
>  
> +static int ata_discard_fn(struct request_queue *q, struct request *req,
> +							struct bio *bio)
> +{
> +	unsigned size;
> +	struct page *page = alloc_page(GFP_KERNEL);
> +	if (!page)
> +		goto error;
> +
> +	size = ata_set_lba_range_entries(page_address(page), PAGE_SIZE / 8,
> +					bio->bi_sector, bio_sectors(bio));
> +	bio->bi_size = 0;
> +	if (bio_add_pc_page(q, bio, page, size, 0) < size)
> +		goto free_page;
> +
> +	req->cmd_type = REQ_TYPE_BLOCK_PC;
> +	req->cmd_len = 16;
> +	req->cmd[0] = ATA_16;
> +	req->cmd[1] = (6 << 1) | 1;		/* dma, 48-bit */
> +	req->cmd[2] = 0x6;			/* length, direction */
> +	req->cmd[3] = 0;			/* feature high */
> +	req->cmd[4] = ATA_DSM_TRIM;		/* feature low */
> +	req->cmd[5] = (size / 512) >> 8;	/* nsect high */
> +	req->cmd[6] = size / 512;		/* nsect low */
> +	req->cmd[7] = 0;			/* lba */
> +	req->cmd[8] = 0;			/* lba */
> +	req->cmd[9] = 0;			/* lba */
> +	req->cmd[10] = 0;			/* lba */
> +	req->cmd[11] = 0;			/* lba */
> +	req->cmd[12] = 0;			/* lba */
> +	req->cmd[13] = ATA_LBA;			/* device */
> +	req->cmd[14] = ATA_CMD_DSM;		/* command */
> +	req->cmd[15] = 0;			/* control */
> +
> +	return 0;
> + free_page:
> +	__free_page(page);
> + error:
> +	return -ENOMEM;
> +}
> +
>  static void ata_scsi_sdev_config(struct scsi_device *sdev)
>  {
>  	sdev->use_10_for_rw = 1;
> @@ -1077,6 +1117,9 @@ static int ata_scsi_dev_config(struct scsi_device *sdev,
>  	/* configure max sectors */
>  	blk_queue_max_sectors(sdev->request_queue, dev->max_sectors);
>  
> +	if (ata_id_has_trim(dev->id))
> +		blk_queue_set_discard(sdev->request_queue, ata_discard_fn);
> +
>  	if (dev->class == ATA_DEV_ATAPI) {
>  		struct request_queue *q = sdev->request_queue;
>  		void *buf;
> @@ -2385,6 +2428,9 @@ static unsigned int ata_scsiop_read_cap(struct ata_scsi_args *args, u8 *rbuf)
>  		/* sector size */
>  		rbuf[10] = ATA_SECT_SIZE >> 8;
>  		rbuf[11] = ATA_SECT_SIZE & 0xff;
> +
> +		if (ata_id_has_trim(args->id))
> +			rbuf[14] = 0x80;
>  	}
>  
>  	return 0;
..

Now we just need an implementation of ata_discard_fn()
that issues the CFA ERASE SECTORS command instead of TRIM
when speaking to a CFA device.

Ah, life is good!

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH 5/5] libata: Add support for TRIM
  2009-04-02 17:20         ` Mark Lord
@ 2009-04-02 17:55           ` Matthew Wilcox
  0 siblings, 0 replies; 59+ messages in thread
From: Matthew Wilcox @ 2009-04-02 17:55 UTC (permalink / raw)
  To: Mark Lord; +Cc: linux-ide, linux-kernel, jgarzik, David Woodhouse

On Thu, Apr 02, 2009 at 01:20:01PM -0400, Mark Lord wrote:
> Now we just need an implementation of ata_discard_fn()
> that issues the CFA ERASE SECTORS command instead of TRIM
> when speaking to a CFA device.

Yes, should be quite straightforward.  The only minor wrinkle I can see
is that CFA ERASE SECTORS only allows you to erase up to 256 sectors at a
time.  I think Dave's infrastructure for Discard lets us handle that ...

> Ah, life is good!

Isn't it though?

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH 4/5] ide: Add support for TRIM
  2009-04-02 16:51             ` Matthew Wilcox
@ 2009-04-02 19:37               ` Bartlomiej Zolnierkiewicz
  2009-04-07 21:38                 ` Bartlomiej Zolnierkiewicz
  0 siblings, 1 reply; 59+ messages in thread
From: Bartlomiej Zolnierkiewicz @ 2009-04-02 19:37 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Sergei Shtylyov, linux-ide, linux-kernel, jgarzik, David Woodhouse

On Thursday 02 April 2009, Matthew Wilcox wrote:
> On Thu, Apr 02, 2009 at 08:38:03PM +0400, Sergei Shtylyov wrote:
> > Hello.
> >>>> +	task->tf_flags   = IDE_TFLAG_LBA48 | IDE_TFLAG_OUT_HOB |
> >>>> +			   IDE_TFLAG_OUT_TF | IDE_TFLAG_OUT_DEVICE |
> >
> >>>   The last 3 flags are going to be obsoleted too...
> >
> >> So if I remove them today, the command will still work?
> >
> >     s/obsoleted/renamed and moved to another other field/ -- I'm going to 
> > submit a patchset refactoring 'struct ide_cmd and 'struct ide-taskfile' 
> > at last...
> 
> Since I'm coding to today's kernel, not to a patch which doesn't exist
> yet, taking them out won't work very well.
> 
> I've not been paying much attention to drivers/ide, so I've no idea
> whether the following patch works.  It does at least compile:
> 
> +++ b/drivers/ide/ide-disk.c
> @@ -407,7 +407,7 @@ static void idedisk_prepare_flush(struct request_queue *q, s
>  static int idedisk_prepare_discard(struct request_queue *q, struct request *rq,
>                                                         struct bio *bio)
>  {
> -       ide_task_t *task;
> +       struct ide_cmd *task;
>         unsigned size;
>         struct page *page = alloc_page(GFP_KERNEL);
>         if (!page)
> @@ -432,8 +432,8 @@ static int idedisk_prepare_discard(struct request_queue *q, 
>  
>         task->tf_flags   = IDE_TFLAG_LBA48 | IDE_TFLAG_OUT_HOB |
>                            IDE_TFLAG_OUT_TF | IDE_TFLAG_OUT_DEVICE |
> -                          IDE_TFLAG_DYN;
> -       task->data_phase = TASKFILE_OUT_DMA;
> +                          IDE_TFLAG_DYN | IDE_TFLAG_WRITE;
> +       task->protocol = ATA_PROT_DMA;
>  
>         rq->cmd_type = REQ_TYPE_ATA_TASKFILE;
>         rq->special = task;
> 
> If I've understood 0dfb991c6943c810175376b58d1c29cfe532541b correctly,
> this should be equivalent.  Bart?

Yep.  The final patch looks fine overall (thanks for adding TRIM support
also to drivers/ide), with the small exception for this chunk:

@@ -521,6 +567,9 @@ static int set_wcache(ide_drive_t *drive, int arg)
 
        update_ordered(drive);
 
+       if (ata_id_has_trim(drive->id))
+               blk_queue_set_discard(drive->queue, idedisk_prepare_discard);
+
        return err;
 }

as it would fit better somewhere in ide_disk_setup() ISO set_wcache()...

[ When it comes to CodingStyle issues I don't care that much though I think
  that applying Sergei's suggestions will make the patch more consistent with
  the rest of drivers/ide code... ]

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH 1/5] Block: Discard may need to allocate pages
  2009-04-02 14:37 [PATCH 1/5] Block: Discard may need to allocate pages Matthew Wilcox
  2009-04-02 14:37 ` [PATCH 2/5] Make DISCARD_BARRIER and DISCARD_NOBARRIER writes instead of reads Matthew Wilcox
@ 2009-04-05 12:28 ` Boaz Harrosh
  2009-04-06 20:34   ` Matthew Wilcox
  2009-05-03  6:11   ` Matthew Wilcox
  2009-04-17 21:23 ` [PATCH 1/5] Block: Discard may need to allocate pages Mark Lord
  2 siblings, 2 replies; 59+ messages in thread
From: Boaz Harrosh @ 2009-04-05 12:28 UTC (permalink / raw)
  To: Matthew Wilcox; +Cc: linux-ide, linux-kernel, jgarzik, Matthew Wilcox

On 04/02/2009 05:37 PM, Matthew Wilcox wrote:
> From: Matthew Wilcox <matthew@wil.cx>
> 
> SCSI and ATA both need to send data to the device.  In order to do this,
> the BIO must be allocated with room for a page to be added, and the bio
> needs to be passed to the discard prep function.  We also need to free
> the page attached to the BIO before we free it.
> 
> init_request_from_bio() is not currently called from a context which
> forbids sleeping, and to make sure it stays that way (so we don't have
> to use GFP_ATOMIC), add a might_sleep() to it.
> 

I understand you have inherited this code, but I think it is a bit of a mess
and you are only adding to the it.

> Signed-off-by: Matthew Wilcox <willy@linux.intel.com>
> ---
>  block/blk-barrier.c       |    4 +++-
>  block/blk-core.c          |    4 +++-
>  block/ioctl.c             |    4 +++-
>  drivers/mtd/mtd_blkdevs.c |    2 +-
>  include/linux/blkdev.h    |    3 ++-
>  5 files changed, 12 insertions(+), 5 deletions(-)
> 
> diff --git a/block/blk-barrier.c b/block/blk-barrier.c
> index f7dae57..82a3035 100644
> --- a/block/blk-barrier.c
> +++ b/block/blk-barrier.c
> @@ -356,6 +356,8 @@ static void blkdev_discard_end_io(struct bio *bio, int err)
>  		clear_bit(BIO_UPTODATE, &bio->bi_flags);
>  	}
>  
> +	if (bio_has_data(bio))
> +		__free_page(bio_page(bio));

Page freed which was allocated by the LLD

>  	bio_put(bio);

OK bio was allocated by user code but shouldn't

>  }
>  
> @@ -387,7 +389,7 @@ int blkdev_issue_discard(struct block_device *bdev,
>  		return -EOPNOTSUPP;
>  
>  	while (nr_sects && !ret) {
> -		bio = bio_alloc(gfp_mask, 0);
> +		bio = bio_alloc(gfp_mask, 1);

blkdev_issue_discard() and blk_ioctl_discard() has half a page
of common (and changing) code, could be done to use a common
helper that sets policy about bio allocation sizes and such.

Just my $0.017

>  		if (!bio)
>  			return -ENOMEM;
>  
> diff --git a/block/blk-core.c b/block/blk-core.c
> index 996ed90..7899761 100644
> --- a/block/blk-core.c
> +++ b/block/blk-core.c
> @@ -1095,6 +1095,8 @@ EXPORT_SYMBOL(blk_put_request);
>  
>  void init_request_from_bio(struct request *req, struct bio *bio)
>  {
> +	might_sleep();
> +
>  	req->cpu = bio->bi_comp_cpu;
>  	req->cmd_type = REQ_TYPE_FS;
>  
> @@ -1118,7 +1120,7 @@ void init_request_from_bio(struct request *req, struct bio *bio)
>  		req->cmd_flags |= REQ_DISCARD;
>  		if (bio_barrier(bio))
>  			req->cmd_flags |= REQ_SOFTBARRIER;
> -		req->q->prepare_discard_fn(req->q, req);
> +		req->q->prepare_discard_fn(req->q, req, bio);

Allocation of bio page could be done commonly here.
The prepare_discard_fn() is made to return the needed size. It is not as if we actually
give the driver a choice about the allocation.

So now we allocate the page and free it at the same level.
And we do it only in one place.

Same common code in [PATCH 4/5] and [PATCH 4/5] is done once, here.

>  	} else if (unlikely(bio_barrier(bio)))
>  		req->cmd_flags |= (REQ_HARDBARRIER | REQ_NOMERGE);
>  
> diff --git a/block/ioctl.c b/block/ioctl.c
> index 0f22e62..088a9ba 100644
> --- a/block/ioctl.c
> +++ b/block/ioctl.c
> @@ -145,7 +145,7 @@ static int blk_ioctl_discard(struct block_device *bdev, uint64_t start,
>  		DECLARE_COMPLETION_ONSTACK(wait);
>  		struct bio *bio;
>  
> -		bio = bio_alloc(GFP_KERNEL, 0);
> +		bio = bio_alloc(GFP_KERNEL, 1);

This is deja vu, don't you think ;)

>  		if (!bio)
>  			return -ENOMEM;
>  
> @@ -170,6 +170,8 @@ static int blk_ioctl_discard(struct block_device *bdev, uint64_t start,
>  			ret = -EOPNOTSUPP;
>  		else if (!bio_flagged(bio, BIO_UPTODATE))
>  			ret = -EIO;
> +		if (bio_has_data(bio))
> +			__free_page(bio_page(bio));
>  		bio_put(bio);
>  	}
>  	return ret;
> diff --git a/drivers/mtd/mtd_blkdevs.c b/drivers/mtd/mtd_blkdevs.c
> index 1409f01..2b6ed4b 100644
> --- a/drivers/mtd/mtd_blkdevs.c
> +++ b/drivers/mtd/mtd_blkdevs.c
> @@ -33,7 +33,7 @@ struct mtd_blkcore_priv {
>  };
>  
>  static int blktrans_discard_request(struct request_queue *q,
> -				    struct request *req)
> +				    struct request *req, struct bio *bio)
>  {
>  	req->cmd_type = REQ_TYPE_LINUX_BLOCK;
>  	req->cmd[0] = REQ_LB_OP_DISCARD;
> diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
> index 465d6ba..9d9bd7b 100644
> --- a/include/linux/blkdev.h
> +++ b/include/linux/blkdev.h
> @@ -260,7 +260,8 @@ typedef void (request_fn_proc) (struct request_queue *q);
>  typedef int (make_request_fn) (struct request_queue *q, struct bio *bio);
>  typedef int (prep_rq_fn) (struct request_queue *, struct request *);
>  typedef void (unplug_fn) (struct request_queue *);
> -typedef int (prepare_discard_fn) (struct request_queue *, struct request *);
> +typedef int (prepare_discard_fn) (struct request_queue *, struct request *,
> +							struct bio *bio);
>  
>  struct bio_vec;
>  struct bvec_merge_data {

I have one question:

At [PATCH 4/5] and [PATCH 4/5] you do:
+	struct page *page = alloc_page(GFP_KERNEL);

does that zero the alloced page? since if I understand correctly this page
will go on the wire, a SW target on the other size could snoop random Kernel
memory, is that allowed? OK I might be totally clueless here.

Have a good day
Boaz

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH 1/5] Block: Discard may need to allocate pages
  2009-04-05 12:28 ` [PATCH 1/5] Block: Discard may need to allocate pages Boaz Harrosh
@ 2009-04-06 20:34   ` Matthew Wilcox
  2009-05-03  6:11   ` Matthew Wilcox
  1 sibling, 0 replies; 59+ messages in thread
From: Matthew Wilcox @ 2009-04-06 20:34 UTC (permalink / raw)
  To: Boaz Harrosh; +Cc: linux-ide, linux-kernel, jgarzik, Matthew Wilcox

On Sun, Apr 05, 2009 at 03:28:07PM +0300, Boaz Harrosh wrote:
> > +	if (bio_has_data(bio))
> > +		__free_page(bio_page(bio));
> 
> Page freed which was allocated by the LLD

Not by the LLD.  By the ULD.

> blkdev_issue_discard() and blk_ioctl_discard() has half a page
> of common (and changing) code, could be done to use a common
> helper that sets policy about bio allocation sizes and such.

Sure, could be done.

> > -		req->q->prepare_discard_fn(req->q, req);
> > +		req->q->prepare_discard_fn(req->q, req, bio);
> 
> Allocation of bio page could be done commonly here.

Not all prepare_discard_fn() implementations need or want a bio page.
The CFA ERASE SECTORS command is a non-data command, for example.

> The prepare_discard_fn() is made to return the needed size. It is not as if we actually
> give the driver a choice about the allocation.

Umm ... prepare_discard_fn() needs to fill in the page.  I don't
understand what code you propose here.

> > -		bio = bio_alloc(GFP_KERNEL, 0);
> > +		bio = bio_alloc(GFP_KERNEL, 1);
> 
> This is deja vu, don't you think ;)

Yep.

> I have one question:
> 
> At [PATCH 4/5] and [PATCH 4/5] you do:
> +	struct page *page = alloc_page(GFP_KERNEL);
> 
> does that zero the alloced page? since if I understand correctly this page
> will go on the wire, a SW target on the other size could snoop random Kernel
> memory, is that allowed? OK I might be totally clueless here.

The prepare_discard_fn() is responsible for not leaking data.  The ATA
one does it via memset() to a 512 byte boundary.  The SCSI one
initialises the 24 bytes that it sends on the wire.  I don't think
either implementation leaks uninitialised data.

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH 3/5] ata: Add TRIM infrastructure
  2009-04-02 14:37   ` [PATCH 3/5] ata: Add TRIM infrastructure Matthew Wilcox
  2009-04-02 14:37     ` [PATCH 4/5] ide: Add support for TRIM Matthew Wilcox
  2009-04-02 15:55     ` [PATCH 3/5] ata: Add TRIM infrastructure Sergei Shtylyov
@ 2009-04-07  0:02     ` Jeff Garzik
  2 siblings, 0 replies; 59+ messages in thread
From: Jeff Garzik @ 2009-04-07  0:02 UTC (permalink / raw)
  To: Matthew Wilcox; +Cc: linux-ide, linux-kernel, jgarzik, David Woodhouse

Matthew Wilcox wrote:
> This is common code shared between the IDE and libata implementations
> 
> Signed-off-by: David Woodhouse <David.Woodhouse@intel.com>
> Signed-off-by: Matthew Wilcox <willy@linux.intel.com>
> ---
>  include/linux/ata.h |   41 +++++++++++++++++++++++++++++++++++++++++
>  1 files changed, 41 insertions(+), 0 deletions(-)

applied



^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH 4/5] ide: Add support for TRIM
  2009-04-02 14:37     ` [PATCH 4/5] ide: Add support for TRIM Matthew Wilcox
  2009-04-02 14:37       ` [PATCH 5/5] libata: " Matthew Wilcox
  2009-04-02 15:58       ` [PATCH 4/5] ide: " Sergei Shtylyov
@ 2009-04-07 17:20       ` Jeff Garzik
  2009-04-07 17:57         ` Mark Lord
  2009-04-08 14:33       ` Mark Lord
  3 siblings, 1 reply; 59+ messages in thread
From: Jeff Garzik @ 2009-04-07 17:20 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: linux-ide, linux-kernel, Bartlomiej Zolnierkiewicz, David Woodhouse

Matthew Wilcox wrote:
> From: David Woodhouse <David.Woodhouse@intel.com>
> 
> Signed-off-by: David Woodhouse <David.Woodhouse@intel.com>
> [modified to reduce amount of special casing needed for discard]
> Signed-off-by: Matthew Wilcox <willy@linux.intel.com>
> ---
>  drivers/ide/ide-disk.c |   49 ++++++++++++++++++++++++++++++++++++++++++++++++
>  1 files changed, 49 insertions(+), 0 deletions(-)

OK, the linux/ata.h portion of TRIM support is now upstream.  This can 
go via Bart's tree.



^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH 4/5] ide: Add support for TRIM
  2009-04-07 17:20       ` Jeff Garzik
@ 2009-04-07 17:57         ` Mark Lord
  2009-04-07 18:10           ` Markus Trippelsdorf
  0 siblings, 1 reply; 59+ messages in thread
From: Mark Lord @ 2009-04-07 17:57 UTC (permalink / raw)
  To: Jeff Garzik
  Cc: Matthew Wilcox, linux-ide, linux-kernel,
	Bartlomiej Zolnierkiewicz, David Woodhouse

Are there any commercially available SSDs that have this feature yet?
Anyone know which ones?

Thanks

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH 4/5] ide: Add support for TRIM
  2009-04-07 17:57         ` Mark Lord
@ 2009-04-07 18:10           ` Markus Trippelsdorf
  2009-04-07 19:58             ` Mark Lord
  0 siblings, 1 reply; 59+ messages in thread
From: Markus Trippelsdorf @ 2009-04-07 18:10 UTC (permalink / raw)
  To: Mark Lord
  Cc: Jeff Garzik, Matthew Wilcox, linux-ide, linux-kernel,
	Bartlomiej Zolnierkiewicz, David Woodhouse

On Tue, Apr 07, 2009 at 01:57:54PM -0400, Mark Lord wrote:
> Are there any commercially available SSDs that have this feature yet?
> Anyone know which ones?

OCZ will release a new firmware for their Vertex line in the coming
days, which will feature ATA TRIM support.
And since they get their firmware directly from Indilinx, all SSDs
with this controller will be TRIM ready in the future.

http://www.ocztechnologyforum.com/forum/showthread.php?t=54373

-- 
Markus

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH 4/5] ide: Add support for TRIM
  2009-04-07 18:10           ` Markus Trippelsdorf
@ 2009-04-07 19:58             ` Mark Lord
  2009-04-08  7:14               ` Markus Trippelsdorf
  0 siblings, 1 reply; 59+ messages in thread
From: Mark Lord @ 2009-04-07 19:58 UTC (permalink / raw)
  To: Markus Trippelsdorf
  Cc: Mark Lord, Jeff Garzik, Matthew Wilcox, linux-ide, linux-kernel,
	Bartlomiej Zolnierkiewicz, David Woodhouse

Markus Trippelsdorf wrote:
> On Tue, Apr 07, 2009 at 01:57:54PM -0400, Mark Lord wrote:
>> Are there any commercially available SSDs that have this feature yet?
>> Anyone know which ones?
> 
> OCZ will release a new firmware for their Vertex line in the coming
> days, which will feature ATA TRIM support.
> And since they get their firmware directly from Indilinx, all SSDs
> with this controller will be TRIM ready in the future.
> 
> http://www.ocztechnologyforum.com/forum/showthread.php?t=54373
..

Mmm.. sounds like they really don't understand "standards" there.
All of that nonsense about waiting until their proprietary "trim utility"
is working on Windows-7 before looking at a Linux implementation.

WTF?  So long as they implement the required ATA opcode (correctly),
then it should "just work" fine on Linux.  But that Windows-centred
discussion there really doesn't sound encouraging.

But hey, if they do get their act together, then the Vertex series
are supposedly the second-best consumer SSDs available these days,
after the highly touted/priced Intel ones.  :)

Cheers


^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH 4/5] ide: Add support for TRIM
  2009-04-02 19:37               ` Bartlomiej Zolnierkiewicz
@ 2009-04-07 21:38                 ` Bartlomiej Zolnierkiewicz
  2009-04-07 22:15                   ` Matthew Wilcox
  0 siblings, 1 reply; 59+ messages in thread
From: Bartlomiej Zolnierkiewicz @ 2009-04-07 21:38 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Sergei Shtylyov, linux-ide, linux-kernel, jgarzik, David Woodhouse

On Thursday 02 April 2009 21:37:35 Bartlomiej Zolnierkiewicz wrote:
> On Thursday 02 April 2009, Matthew Wilcox wrote:
> > On Thu, Apr 02, 2009 at 08:38:03PM +0400, Sergei Shtylyov wrote:
> > > Hello.
> > >>>> +	task->tf_flags   = IDE_TFLAG_LBA48 | IDE_TFLAG_OUT_HOB |
> > >>>> +			   IDE_TFLAG_OUT_TF | IDE_TFLAG_OUT_DEVICE |
> > >
> > >>>   The last 3 flags are going to be obsoleted too...
> > >
> > >> So if I remove them today, the command will still work?
> > >
> > >     s/obsoleted/renamed and moved to another other field/ -- I'm going to 
> > > submit a patchset refactoring 'struct ide_cmd and 'struct ide-taskfile' 
> > > at last...
> > 
> > Since I'm coding to today's kernel, not to a patch which doesn't exist
> > yet, taking them out won't work very well.
> > 
> > I've not been paying much attention to drivers/ide, so I've no idea
> > whether the following patch works.  It does at least compile:
> > 
> > +++ b/drivers/ide/ide-disk.c
> > @@ -407,7 +407,7 @@ static void idedisk_prepare_flush(struct request_queue *q, s
> >  static int idedisk_prepare_discard(struct request_queue *q, struct request *rq,
> >                                                         struct bio *bio)
> >  {
> > -       ide_task_t *task;
> > +       struct ide_cmd *task;
> >         unsigned size;
> >         struct page *page = alloc_page(GFP_KERNEL);
> >         if (!page)
> > @@ -432,8 +432,8 @@ static int idedisk_prepare_discard(struct request_queue *q, 
> >  
> >         task->tf_flags   = IDE_TFLAG_LBA48 | IDE_TFLAG_OUT_HOB |
> >                            IDE_TFLAG_OUT_TF | IDE_TFLAG_OUT_DEVICE |
> > -                          IDE_TFLAG_DYN;
> > -       task->data_phase = TASKFILE_OUT_DMA;
> > +                          IDE_TFLAG_DYN | IDE_TFLAG_WRITE;
> > +       task->protocol = ATA_PROT_DMA;
> >  
> >         rq->cmd_type = REQ_TYPE_ATA_TASKFILE;
> >         rq->special = task;
> > 
> > If I've understood 0dfb991c6943c810175376b58d1c29cfe532541b correctly,
> > this should be equivalent.  Bart?
> 
> Yep.  The final patch looks fine overall (thanks for adding TRIM support
> also to drivers/ide), with the small exception for this chunk:
> 
> @@ -521,6 +567,9 @@ static int set_wcache(ide_drive_t *drive, int arg)
>  
>         update_ordered(drive);
>  
> +       if (ata_id_has_trim(drive->id))
> +               blk_queue_set_discard(drive->queue, idedisk_prepare_discard);
> +
>         return err;
>  }
> 
> as it would fit better somewhere in ide_disk_setup() ISO set_wcache()...
> 
> [ When it comes to CodingStyle issues I don't care that much though I think
>   that applying Sergei's suggestions will make the patch more consistent with
>   the rest of drivers/ide code... ]

I would like to see this merged for -rc1 so I tried to put it altogether
and then integrate it over recent IDE and block changes...

I ended up with the patch below which I'll push to Linus in a few minutes.

Once it hits Linus' tree please test that it still works. :)

Thanks.

From: David Woodhouse <David.Woodhouse@intel.com>
Subject: [PATCH 4/5] ide: Add support for TRIM

Bart: Since this is a new feature and hasn't seen much testing with
production devices it is not enabled by default yet (requires use of
"ide_core.trim=1" kernel parameter).

Signed-off-by: David Woodhouse <David.Woodhouse@intel.com>
[modified to reduce amount of special casing needed for discard]
Signed-off-by: Matthew Wilcox <willy@linux.intel.com>
Cc: Jeff Garzik <jgarzik@redhat.com>
Cc: Sergei Shtylyov <sshtylyov@ru.mvista.com>
[bart: apply follow-up changes from Matthew]
[bart: apply CodingStyle fixups per Sergei's suggestion]
[bart: move blk_queue_set_discard() call to ide_disk_setup()]
[bart: port over recent IDE changes]
[bart: remove "bio" argument from idedisk_prepare_discard()]
[bart: s/PAGE_SIZE/PAGE_SIZE + 8/ + set ATA_LBA bit in tf.device]
[bart: add "ide_core.trim=" kernel parameter]
Signed-off-by: Bartlomiej Zolnierkiewicz <bzolnier@gmail.com>
---
 drivers/ide/ide-disk.c |   53 +++++++++++++++++++++++++++++++++++++++++++++++++
 drivers/ide/ide.c      |    6 +++++
 2 files changed, 59 insertions(+)

Index: b/drivers/ide/ide-disk.c
===================================================================
--- a/drivers/ide/ide-disk.c
+++ b/drivers/ide/ide-disk.c
@@ -415,6 +415,54 @@ static void idedisk_prepare_flush(struct
 	rq->special = cmd;
 }
 
+static int idedisk_prepare_discard(struct request_queue *q, struct request *rq)
+{
+	struct ide_cmd *cmd;
+	struct page *page;
+	struct bio *bio = rq->bio;
+	unsigned int size;
+
+	page = alloc_page(GFP_KERNEL);
+	if (!page)
+		goto error;
+
+	cmd = kzalloc(sizeof(*cmd), GFP_KERNEL);
+	if (!cmd)
+		goto free_page;
+
+	size = ata_set_lba_range_entries(page_address(page), PAGE_SIZE / 8,
+					 bio->bi_sector, bio_sectors(bio));
+	bio->bi_size = 0;
+
+	if (bio_add_pc_page(q, bio, page, size, 0) < size)
+		goto free_task;
+
+	cmd->tf.command = ATA_CMD_DSM;
+	cmd->tf.feature = ATA_DSM_TRIM;
+	cmd->tf.nsect   = size / 512;
+	cmd->hob.nsect  = (size / 512) >> 8;
+	cmd->tf.device  = ATA_LBA;
+
+	cmd->valid.out.tf  = IDE_VALID_OUT_TF | IDE_VALID_DEVICE;
+	cmd->valid.out.hob = IDE_VALID_OUT_HOB;
+
+	cmd->tf_flags = IDE_TFLAG_LBA48 | IDE_TFLAG_WRITE | IDE_TFLAG_DYN;
+	cmd->protocol = ATA_PROT_DMA;
+
+	rq->cmd_type = REQ_TYPE_ATA_TASKFILE;
+	rq->special = cmd;
+	rq->nr_sectors = size / 512;
+
+	return 0;
+
+ free_task:
+	kfree(cmd);
+ free_page:
+	__free_page(page);
+ error:
+	return -ENOMEM;
+}
+
 ide_devset_get(multcount, mult_count);
 
 /*
@@ -606,6 +654,8 @@ static int ide_disk_check(ide_drive_t *d
 	return 1;
 }
 
+extern int ide_trim;
+
 static void ide_disk_setup(ide_drive_t *drive)
 {
 	struct ide_disk_obj *idkp = drive->driver_data;
@@ -693,6 +743,9 @@ static void ide_disk_setup(ide_drive_t *
 
 	set_wcache(drive, 1);
 
+	if (ata_id_has_trim(id) && ide_trim)
+		blk_queue_set_discard(q, idedisk_prepare_discard);
+
 	if ((drive->dev_flags & IDE_DFLAG_LBA) == 0 &&
 	    (drive->head == 0 || drive->head > 16)) {
 		printk(KERN_ERR "%s: invalid geometry: %d physical heads?\n",
Index: b/drivers/ide/ide.c
===================================================================
--- a/drivers/ide/ide.c
+++ b/drivers/ide/ide.c
@@ -178,6 +178,12 @@ EXPORT_SYMBOL_GPL(ide_pci_clk);
 module_param_named(pci_clock, ide_pci_clk, int, 0);
 MODULE_PARM_DESC(pci_clock, "PCI bus clock frequency (in MHz)");
 
+int ide_trim = 0;
+EXPORT_SYMBOL_GPL(ide_trim);
+
+module_param_named(trim, ide_trim, int, 0);
+MODULE_PARM_DESC(trim, "TRIM support (0=off, 1=on)");
+
 static int ide_set_dev_param_mask(const char *s, struct kernel_param *kp)
 {
 	int a, b, i, j = 1;



^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH 4/5] ide: Add support for TRIM
  2009-04-07 21:38                 ` Bartlomiej Zolnierkiewicz
@ 2009-04-07 22:15                   ` Matthew Wilcox
  2009-04-07 22:26                     ` Jeff Garzik
  2009-04-07 22:35                     ` Bartlomiej Zolnierkiewicz
  0 siblings, 2 replies; 59+ messages in thread
From: Matthew Wilcox @ 2009-04-07 22:15 UTC (permalink / raw)
  To: Bartlomiej Zolnierkiewicz
  Cc: Sergei Shtylyov, linux-ide, linux-kernel, jgarzik, David Woodhouse

On Tue, Apr 07, 2009 at 11:38:59PM +0200, Bartlomiej Zolnierkiewicz wrote:
> I would like to see this merged for -rc1 so I tried to put it altogether
> and then integrate it over recent IDE and block changes...

You're missing 1/5 and 2/5.  This patch will crash horribly if you use
it with a drive which supports TRIM.

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH 4/5] ide: Add support for TRIM
  2009-04-07 22:15                   ` Matthew Wilcox
@ 2009-04-07 22:26                     ` Jeff Garzik
  2009-04-07 22:35                     ` Bartlomiej Zolnierkiewicz
  1 sibling, 0 replies; 59+ messages in thread
From: Jeff Garzik @ 2009-04-07 22:26 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Bartlomiej Zolnierkiewicz, Sergei Shtylyov, linux-ide,
	linux-kernel, jgarzik, David Woodhouse

Matthew Wilcox wrote:
> On Tue, Apr 07, 2009 at 11:38:59PM +0200, Bartlomiej Zolnierkiewicz wrote:
>> I would like to see this merged for -rc1 so I tried to put it altogether
>> and then integrate it over recent IDE and block changes...
> 
> You're missing 1/5 and 2/5.  This patch will crash horribly if you use
> it with a drive which supports TRIM.

For things that need to hit along with the block layer, Bart and I could 
Ack the IDE+libata patches, and Jens could merge all of them.

Or just poke Jens to review and merge the block stuff.  Then the rest 
can go at any time.

	Jeff




^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH 4/5] ide: Add support for TRIM
  2009-04-07 22:15                   ` Matthew Wilcox
  2009-04-07 22:26                     ` Jeff Garzik
@ 2009-04-07 22:35                     ` Bartlomiej Zolnierkiewicz
  1 sibling, 0 replies; 59+ messages in thread
From: Bartlomiej Zolnierkiewicz @ 2009-04-07 22:35 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Sergei Shtylyov, linux-ide, linux-kernel, jgarzik, David Woodhouse

On Wednesday 08 April 2009 00:15:54 Matthew Wilcox wrote:
> On Tue, Apr 07, 2009 at 11:38:59PM +0200, Bartlomiej Zolnierkiewicz wrote:
> > I would like to see this merged for -rc1 so I tried to put it altogether
> > and then integrate it over recent IDE and block changes...
> 
> You're missing 1/5 and 2/5.  This patch will crash horribly if you use
> it with a drive which supports TRIM.

Heh, I knew something was out of place...

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH 4/5] ide: Add support for TRIM
  2009-04-07 19:58             ` Mark Lord
@ 2009-04-08  7:14               ` Markus Trippelsdorf
  2009-04-08 14:25                 ` Mark Lord
  0 siblings, 1 reply; 59+ messages in thread
From: Markus Trippelsdorf @ 2009-04-08  7:14 UTC (permalink / raw)
  To: Mark Lord
  Cc: Jeff Garzik, Matthew Wilcox, linux-ide, linux-kernel,
	Bartlomiej Zolnierkiewicz, David Woodhouse

On Tue, Apr 07, 2009 at 03:58:41PM -0400, Mark Lord wrote:
> Markus Trippelsdorf wrote:
> > On Tue, Apr 07, 2009 at 01:57:54PM -0400, Mark Lord wrote:
> >> Are there any commercially available SSDs that have this feature yet?
> >> Anyone know which ones?
> > 
> > OCZ will release a new firmware for their Vertex line in the coming
> > days, which will feature ATA TRIM support.
> > And since they get their firmware directly from Indilinx, all SSDs
> > with this controller will be TRIM ready in the future.
> > 
> > http://www.ocztechnologyforum.com/forum/showthread.php?t=54373
> ..
> 
> Mmm.. sounds like they really don't understand "standards" there.
> All of that nonsense about waiting until their proprietary "trim utility"
> is working on Windows-7 before looking at a Linux implementation.
> 
> WTF?  So long as they implement the required ATA opcode (correctly),
> then it should "just work" fine on Linux.  But that Windows-centred
> discussion there really doesn't sound encouraging.
> 
> But hey, if they do get their act together, then the Vertex series
> are supposedly the second-best consumer SSDs available these days,
> after the highly touted/priced Intel ones.  :)

The new firmware is out and looks promising:

########## Begin SSD TRIM output ##########
word21 bit10 = 1: support for the DATA SET MANAGEMENT is changeable.
word169 bit0 = 1: device supports the Trim bit of DATA SET MANAGEMENT command.
  word69 bit14 = 1: Trim function of the DATA SET MANAGEMENT command supports determinate behavior.
########## End SSD TRIM output ##########

This is the output from a modified hdparm, that checks the relevant
bits. 

(I didn't run hdparm on my machine, because I'm waiting for a flasher that
doesn't require XP. I just copied the output from:
http://www.ocztechnologyforum.com/forum/showpost.php?p=371277&postcount=28 )

-- 
Markus

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH 4/5] ide: Add support for TRIM
  2009-04-08  7:14               ` Markus Trippelsdorf
@ 2009-04-08 14:25                 ` Mark Lord
  0 siblings, 0 replies; 59+ messages in thread
From: Mark Lord @ 2009-04-08 14:25 UTC (permalink / raw)
  To: Markus Trippelsdorf
  Cc: Mark Lord, Jeff Garzik, Matthew Wilcox, linux-ide, linux-kernel,
	Bartlomiej Zolnierkiewicz, David Woodhouse

Markus Trippelsdorf wrote:
> ..
> The new firmware is out and looks promising:
> 
> ########## Begin SSD TRIM output ##########
> word21 bit10 = 1: support for the DATA SET MANAGEMENT is changeable.
> word169 bit0 = 1: device supports the Trim bit of DATA SET MANAGEMENT command.
>   word69 bit14 = 1: Trim function of the DATA SET MANAGEMENT command supports determinate behavior.
> ########## End SSD TRIM output ##########
> 
> This is the output from a modified hdparm, that checks the relevant bits. 
..

I'll release hdparm-9.14 shortly, with reporting for the TRIM feature in the -I output.

Cheers

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH 4/5] ide: Add support for TRIM
  2009-04-02 14:37     ` [PATCH 4/5] ide: Add support for TRIM Matthew Wilcox
                         ` (2 preceding siblings ...)
  2009-04-07 17:20       ` Jeff Garzik
@ 2009-04-08 14:33       ` Mark Lord
  2009-04-08 14:44         ` Dongjun Shin
  2009-04-08 14:59         ` Jeff Garzik
  3 siblings, 2 replies; 59+ messages in thread
From: Mark Lord @ 2009-04-08 14:33 UTC (permalink / raw)
  To: Matthew Wilcox; +Cc: linux-ide, linux-kernel, jgarzik, David Woodhouse

Matthew Wilcox wrote:
> From: David Woodhouse <David.Woodhouse@intel.com>
> 
> Signed-off-by: David Woodhouse <David.Woodhouse@intel.com>
> [modified to reduce amount of special casing needed for discard]
> Signed-off-by: Matthew Wilcox <willy@linux.intel.com>
> ---
..
> +	task->tf.command = ATA_CMD_DSM;
> +	task->tf.feature = ATA_DSM_TRIM;
> +	task->tf.hob_feature = 0x00;
> +	task->tf.nsect = size / 512;
> +	task->tf.hob_nsect = (size / 512) >> 8;
..

Matthew (or others),

Could you perhaps explain what the purpose of the data portion
of this command is for?  The draft ATA spec I have here has mangled
text in the description -- like it's missing a crucial sentence there.

So it's not obvious exactly why this command even needs data to
be transfered.

Thanks

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH 4/5] ide: Add support for TRIM
  2009-04-08 14:33       ` Mark Lord
@ 2009-04-08 14:44         ` Dongjun Shin
  2009-04-08 14:59         ` Jeff Garzik
  1 sibling, 0 replies; 59+ messages in thread
From: Dongjun Shin @ 2009-04-08 14:44 UTC (permalink / raw)
  To: Mark Lord
  Cc: Matthew Wilcox, linux-ide, linux-kernel, jgarzik, David Woodhouse

On Wed, Apr 8, 2009 at 11:33 PM, Mark Lord <liml@rtr.ca> wrote:
>
> Matthew Wilcox wrote:
>>
>> From: David Woodhouse <David.Woodhouse@intel.com>
>>
>> Signed-off-by: David Woodhouse <David.Woodhouse@intel.com>
>> [modified to reduce amount of special casing needed for discard]
>> Signed-off-by: Matthew Wilcox <willy@linux.intel.com>
>> ---
>
> ..
>>
>> +       task->tf.command = ATA_CMD_DSM;
>> +       task->tf.feature = ATA_DSM_TRIM;
>> +       task->tf.hob_feature = 0x00;
>> +       task->tf.nsect = size / 512;
>> +       task->tf.hob_nsect = (size / 512) >> 8;
>
> ..
>
> Matthew (or others),
>
> Could you perhaps explain what the purpose of the data portion
> of this command is for?  The draft ATA spec I have here has mangled
> text in the description -- like it's missing a crucial sentence there.
>
> So it's not obvious exactly why this command even needs data to
> be transfered.

It's for transfering list of LBA ranges which composed of
LBA offset (48bit) and length of data to be trimmed (16bit).

--
Dongjun

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH 4/5] ide: Add support for TRIM
  2009-04-08 14:33       ` Mark Lord
  2009-04-08 14:44         ` Dongjun Shin
@ 2009-04-08 14:59         ` Jeff Garzik
  2009-04-08 15:50           ` Mark Lord
  1 sibling, 1 reply; 59+ messages in thread
From: Jeff Garzik @ 2009-04-08 14:59 UTC (permalink / raw)
  To: Mark Lord
  Cc: Matthew Wilcox, linux-ide, linux-kernel, jgarzik, David Woodhouse

Mark Lord wrote:
> Matthew Wilcox wrote:
>> From: David Woodhouse <David.Woodhouse@intel.com>
>>
>> Signed-off-by: David Woodhouse <David.Woodhouse@intel.com>
>> [modified to reduce amount of special casing needed for discard]
>> Signed-off-by: Matthew Wilcox <willy@linux.intel.com>
>> ---
> ..
>> +    task->tf.command = ATA_CMD_DSM;
>> +    task->tf.feature = ATA_DSM_TRIM;
>> +    task->tf.hob_feature = 0x00;
>> +    task->tf.nsect = size / 512;
>> +    task->tf.hob_nsect = (size / 512) >> 8;
> ..
> 
> Matthew (or others),
> 
> Could you perhaps explain what the purpose of the data portion
> of this command is for?  The draft ATA spec I have here has mangled
> text in the description -- like it's missing a crucial sentence there.
> 
> So it's not obvious exactly why this command even needs data to
> be transfered.

It's a list of LBA ranges, like the NV-cache stuff recently added.

	Jeff




^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH 4/5] ide: Add support for TRIM
  2009-04-08 14:59         ` Jeff Garzik
@ 2009-04-08 15:50           ` Mark Lord
  0 siblings, 0 replies; 59+ messages in thread
From: Mark Lord @ 2009-04-08 15:50 UTC (permalink / raw)
  To: Jeff Garzik
  Cc: Matthew Wilcox, linux-ide, linux-kernel, jgarzik, David Woodhouse

Jeff Garzik wrote:
> Mark Lord wrote:
..
>> So it's not obvious exactly why this command even needs data to
>> be transfered.
> 
> It's a list of LBA ranges, like the NV-cache stuff recently added.
..

Ack.  Thanks.  I found the LBA range description (intact) in the drafts now.

-ml

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH 5/5] libata: Add support for TRIM
  2009-04-02 14:37       ` [PATCH 5/5] libata: " Matthew Wilcox
  2009-04-02 17:20         ` Mark Lord
@ 2009-04-16 20:25         ` Mark Lord
  2009-04-17 19:44           ` Mark Lord
  1 sibling, 1 reply; 59+ messages in thread
From: Mark Lord @ 2009-04-16 20:25 UTC (permalink / raw)
  To: Matthew Wilcox; +Cc: linux-ide, linux-kernel, jgarzik

Matthew Wilcox wrote:
> Signed-off-by: Matthew Wilcox <willy@linux.intel.com>
> ---
>  drivers/ata/libata-scsi.c |   46 +++++++++++++++++++++++++++++++++++++++++++++
>  1 files changed, 46 insertions(+), 0 deletions(-)
> 
> diff --git a/drivers/ata/libata-scsi.c b/drivers/ata/libata-scsi.c
> index b9747fa..d4c8b8b 100644
> --- a/drivers/ata/libata-scsi.c
> +++ b/drivers/ata/libata-scsi.c
...

It works.  :)

I'm using TRIM on a 120GB OCZ Vertex SSD here now,
and instrumented ata_discard_fn() to log what it does.
Seems to work perfectly, and the sector ranges for TRIM
match those given by "hdparm --fibmap" for the files I've tried.

Does this also TRIM metadata (eg. block bitmaps and the like) ?

Thanks Matthew (and others).

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH 5/5] libata: Add support for TRIM
  2009-04-16 20:25         ` Mark Lord
@ 2009-04-17 19:44           ` Mark Lord
  0 siblings, 0 replies; 59+ messages in thread
From: Mark Lord @ 2009-04-17 19:44 UTC (permalink / raw)
  To: Matthew Wilcox; +Cc: linux-ide, linux-kernel, jgarzik

Mark Lord wrote:
> Matthew Wilcox wrote:
>> Signed-off-by: Matthew Wilcox <willy@linux.intel.com>
>> ---
>>  drivers/ata/libata-scsi.c |   46 
>> +++++++++++++++++++++++++++++++++++++++++++++
>>  1 files changed, 46 insertions(+), 0 deletions(-)
>>
>> diff --git a/drivers/ata/libata-scsi.c b/drivers/ata/libata-scsi.c
>> index b9747fa..d4c8b8b 100644
>> --- a/drivers/ata/libata-scsi.c
>> +++ b/drivers/ata/libata-scsi.c
> ...
> 
> It works.  :)
> 
> I'm using TRIM on a 120GB OCZ Vertex SSD here now,
..

I take that back now.  The command is apparently failing,
but we don't log the failure, and nor do we stop attempting
to continue to use it on future TRIMs.

I dug deeper into the kernel side, when my own attempts in hdparm
all failed, both using SG_IO directly and using the BLKDISCARD ioctl().

The commands are simply rejected by the drive, with status=0x51, err=0x04.

Everything *looks* okay on the Linux side, so I'm guessing that the OCZ
drive uses a vendor-unique opcode or protocol for it.  They do have it
in there for a windows tool, but no further info than that.

-ml

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH 1/5] Block: Discard may need to allocate pages
  2009-04-02 14:37 [PATCH 1/5] Block: Discard may need to allocate pages Matthew Wilcox
  2009-04-02 14:37 ` [PATCH 2/5] Make DISCARD_BARRIER and DISCARD_NOBARRIER writes instead of reads Matthew Wilcox
  2009-04-05 12:28 ` [PATCH 1/5] Block: Discard may need to allocate pages Boaz Harrosh
@ 2009-04-17 21:23 ` Mark Lord
  2 siblings, 0 replies; 59+ messages in thread
From: Mark Lord @ 2009-04-17 21:23 UTC (permalink / raw)
  To: Matthew Wilcox; +Cc: linux-ide, linux-kernel, jgarzik, Matthew Wilcox

Matthew Wilcox wrote:
> From: Matthew Wilcox <matthew@wil.cx>
> 
> SCSI and ATA both need to send data to the device.  In order to do this,
> the BIO must be allocated with room for a page to be added, and the bio
> needs to be passed to the discard prep function.  We also need to free
> the page attached to the BIO before we free it.
> 
> init_request_from_bio() is not currently called from a context which
> forbids sleeping, and to make sure it stays that way (so we don't have
> to use GFP_ATOMIC), add a might_sleep() to it.
> 
> Signed-off-by: Matthew Wilcox <willy@linux.intel.com>
...
> @@ -1118,7 +1120,7 @@ void init_request_from_bio(struct request *req, struct bio *bio)
>  		req->cmd_flags |= REQ_DISCARD;
>  		if (bio_barrier(bio))
>  			req->cmd_flags |= REQ_SOFTBARRIER;
> -		req->q->prepare_discard_fn(req->q, req);
> +		req->q->prepare_discard_fn(req->q, req, bio);
>  	} else if (unlikely(bio_barrier(bio)))
>  		req->cmd_flags |= (REQ_HARDBARRIER | REQ_NOMERGE);
>  
..

Matthew, note that prepare_discardfn() may fail,
and return an error result.  The above code does not
check for this or handle it very well.

Cheers

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH 1/5] Block: Discard may need to allocate pages
  2009-04-05 12:28 ` [PATCH 1/5] Block: Discard may need to allocate pages Boaz Harrosh
  2009-04-06 20:34   ` Matthew Wilcox
@ 2009-05-03  6:11   ` Matthew Wilcox
  2009-05-03  7:16     ` New TRIM/UNMAP tree published (2009-05-02) Matthew Wilcox
  1 sibling, 1 reply; 59+ messages in thread
From: Matthew Wilcox @ 2009-05-03  6:11 UTC (permalink / raw)
  To: Boaz Harrosh; +Cc: linux-ide, linux-kernel, jgarzik, Matthew Wilcox


[I thought I replied to this, but I don't see an indication that I did]

On Sun, Apr 05, 2009 at 03:28:07PM +0300, Boaz Harrosh wrote:
> > +++ b/block/blk-barrier.c
> > @@ -356,6 +356,8 @@ static void blkdev_discard_end_io(struct bio *bio, int err)
> >  		clear_bit(BIO_UPTODATE, &bio->bi_flags);
> >  	}
> >  
> > +	if (bio_has_data(bio))
> > +		__free_page(bio_page(bio));
> 
> Page freed which was allocated by the LLD

It wasn't allocated by the LLD.  It was allocated by the ULD.

> >  	bio_put(bio);
> 
> OK bio was allocated by user code but shouldn't

?  Are you saying the bio should be allocated by each driver
implementing a discard operation?

> >  	while (nr_sects && !ret) {
> > -		bio = bio_alloc(gfp_mask, 0);
> > +		bio = bio_alloc(gfp_mask, 1);
> 
> blkdev_issue_discard() and blk_ioctl_discard() has half a page
> of common (and changing) code, could be done to use a common
> helper that sets policy about bio allocation sizes and such.
> 
> Just my $0.017

Yes, that works nicely.  Thanks for the suggestion.

> > @@ -1118,7 +1120,7 @@ void init_request_from_bio(struct request *req, struct bio *bio)
> >  		req->cmd_flags |= REQ_DISCARD;
> >  		if (bio_barrier(bio))
> >  			req->cmd_flags |= REQ_SOFTBARRIER;
> > -		req->q->prepare_discard_fn(req->q, req);
> > +		req->q->prepare_discard_fn(req->q, req, bio);
> 
> Allocation of bio page could be done commonly here.
> The prepare_discard_fn() is made to return the needed size. It is not as if we actually
> give the driver a choice about the allocation.

Not all drivers need to allocate a page.  Some drivers may need
to allocate more than one page, depending on how large the range is.
And the driver can't just return the page size it needs here -- it needs
to fill in the contents of the page too.

I suppose we could do something fairly disgusting like:

		for (;;) {
			struct page *page;
			needed = req->q->prepare_discard_fn(req->q, req, bio);
			if (!needed)
				break;
			page = alloc_page(GFP_KERNEL);
			if (bio_add_pc_page(q, bio, page, PAGE_SIZE, 0) < PAGE_SIZE)
				goto fail;
		}

Then the driver can return 0 => success, anything else => allocate more
ram, try again.

> I have one question:
> 
> At [PATCH 4/5] and [PATCH 4/5] you do:
> +	struct page *page = alloc_page(GFP_KERNEL);
> 
> does that zero the alloced page? since if I understand correctly this page
> will go on the wire, a SW target on the other size could snoop random Kernel
> memory, is that allowed? OK I might be totally clueless here.

alloc_page doesn't zero the page.

scsi only sends out 24 bytes of that page on the wire, and it initialises
all 24 bytes.  ide/ata send multiples of 512 bytes on the wire, and
they're careful to zero any of the space they're not using.

^ permalink raw reply	[flat|nested] 59+ messages in thread

* New TRIM/UNMAP tree published (2009-05-02)
  2009-05-03  6:11   ` Matthew Wilcox
@ 2009-05-03  7:16     ` Matthew Wilcox
  2009-05-03 13:07       ` Hugh Dickins
  0 siblings, 1 reply; 59+ messages in thread
From: Matthew Wilcox @ 2009-05-03  7:16 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Boaz Harrosh, linux-ide, linux-kernel, jgarzik, linux-scsi,
	Jens Axboe, Bartlomiej Zolnierkiewicz

On Sat, May 02, 2009 at 11:11:50PM -0700, Matthew Wilcox wrote:
> > blkdev_issue_discard() and blk_ioctl_discard() has half a page
> > of common (and changing) code, could be done to use a common
> > helper that sets policy about bio allocation sizes and such.
> > 
> > Just my $0.017
> 
> Yes, that works nicely.  Thanks for the suggestion.

I've pushed out a new git tree:

http://git.kernel.org/?p=linux/kernel/git/willy/ssd.git;a=shortlog;h=trim-20090502

Changes since the last version:

 - Based on 2.6.30-rc4.
 - Dropped the three patches which were already merged.
 - Shuffled 'Make DISCARD_BARRIER and DISCARD_NOBARRIER writes instead of
   reads' to the front of the queue since it's not controversial and could
   be merged earlier.
 - Inserted 'Unify blk_ioctl_discard and blkdev_issue_discard' as patch two.
 - Updated 'ide: Add support for TRIM' to compile with new IDE code.

Still to do:
 - Understand all the changes Bart made to the IDE code; I didn't make all
   the changes that he did.  I just made it compile for now.
 - Dave had some objections to the description of 'Make DISCARD_BARRIER and
   DISCARD_NOBARRIER writes instead of reads', but didn't propose replacement
   text.
 - Figure out what to do about memory allocation.
 - Handle error returns from discard_prep.
 - Test the new code still works.
 - Delve into the latest SCSI spec and see if anything changed in UNMAP

Given the extensive length of the todo list, I shan't send out the mailbomb.

-- 
Matthew Wilcox				Intel Open Source Technology Centre
"Bill, look, we understand that you're interested in selling us this
operating system, but compare it to ours.  We can't possibly take such
a retrograde step."

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: New TRIM/UNMAP tree published (2009-05-02)
  2009-05-03  7:16     ` New TRIM/UNMAP tree published (2009-05-02) Matthew Wilcox
@ 2009-05-03 13:07       ` Hugh Dickins
  2009-05-03 14:48         ` Matthew Wilcox
  0 siblings, 1 reply; 59+ messages in thread
From: Hugh Dickins @ 2009-05-03 13:07 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Matthew Wilcox, Boaz Harrosh, linux-ide, linux-kernel,
	Jeff Garzik, linux-scsi, Jens Axboe, Bartlomiej Zolnierkiewicz,
	Mark Lord

On Sun, 3 May 2009, Matthew Wilcox wrote:
> On Sat, May 02, 2009 at 11:11:50PM -0700, Matthew Wilcox wrote:
> > > blkdev_issue_discard() and blk_ioctl_discard() has half a page
> > > of common (and changing) code, could be done to use a common
> > > helper that sets policy about bio allocation sizes and such.
> > > 
> > > Just my $0.017
> > 
> > Yes, that works nicely.  Thanks for the suggestion.
> 
> I've pushed out a new git tree:
> 
> http://git.kernel.org/?p=linux/kernel/git/willy/ssd.git;a=shortlog;h=trim-20090502

I'm glad to see this coming alive again, thank you.

I notice OCZ hope to have updated Vertex firmware mid-May,
such that their TRIM support matches up with (y)ours.

One comment below...

> 
> Changes since the last version:
> 
>  - Based on 2.6.30-rc4.
>  - Dropped the three patches which were already merged.
>  - Shuffled 'Make DISCARD_BARRIER and DISCARD_NOBARRIER writes instead of
>    reads' to the front of the queue since it's not controversial and could
>    be merged earlier.
>  - Inserted 'Unify blk_ioctl_discard and blkdev_issue_discard' as patch two.
>  - Updated 'ide: Add support for TRIM' to compile with new IDE code.
> 
> Still to do:
>  - Understand all the changes Bart made to the IDE code; I didn't make all
>    the changes that he did.  I just made it compile for now.
>  - Dave had some objections to the description of 'Make DISCARD_BARRIER and
>    DISCARD_NOBARRIER writes instead of reads', but didn't propose replacement
>    text.
>  - Figure out what to do about memory allocation.

You have GFP_KERNEL allocations in your discard_fn()s.  I believe that
allocations down at that level will need to be GFP_NOIO - the bio mempool
might be empty by this point, you don't want memory allocation to get into
submitting even more I/O to satisfy your discard_fn().  Does it need its
own mempool of reserve pages?  My guess is that you can get away without
that, since a failed discard, though regrettable, loses no data.

Hugh

>  - Handle error returns from discard_prep.
>  - Test the new code still works.
>  - Delve into the latest SCSI spec and see if anything changed in UNMAP
> 
> Given the extensive length of the todo list, I shan't send out the mailbomb.

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: New TRIM/UNMAP tree published (2009-05-02)
  2009-05-03 13:07       ` Hugh Dickins
@ 2009-05-03 14:48         ` Matthew Wilcox
  2009-05-03 15:02           ` Boaz Harrosh
  2009-05-03 15:05           ` Hugh Dickins
  0 siblings, 2 replies; 59+ messages in thread
From: Matthew Wilcox @ 2009-05-03 14:48 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Matthew Wilcox, Boaz Harrosh, linux-ide, linux-kernel,
	Jeff Garzik, linux-scsi, Jens Axboe, Bartlomiej Zolnierkiewicz,
	Mark Lord

On Sun, May 03, 2009 at 02:07:32PM +0100, Hugh Dickins wrote:
> I notice OCZ hope to have updated Vertex firmware mid-May,
> such that their TRIM support matches up with (y)ours.

Excellent.

> You have GFP_KERNEL allocations in your discard_fn()s.  I believe that
> allocations down at that level will need to be GFP_NOIO - the bio mempool
> might be empty by this point, you don't want memory allocation to get into
> submitting even more I/O to satisfy your discard_fn().  Does it need its
> own mempool of reserve pages?  My guess is that you can get away without
> that, since a failed discard, though regrettable, loses no data.

You make a good point.  There's definitely still work to be done around
error handling.  OTOH, we could always do what IDE does ...

static void idedisk_prepare_flush(struct request_queue *q, struct request *rq)
{
        ide_drive_t *drive = q->queuedata;
        struct ide_cmd *cmd = kmalloc(sizeof(*cmd), GFP_ATOMIC);

        /* FIXME: map struct ide_taskfile on rq->cmd[] */
        BUG_ON(cmd == NULL);



-- 
Matthew Wilcox				Intel Open Source Technology Centre
"Bill, look, we understand that you're interested in selling us this
operating system, but compare it to ours.  We can't possibly take such
a retrograde step."

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: New TRIM/UNMAP tree published (2009-05-02)
  2009-05-03 14:48         ` Matthew Wilcox
@ 2009-05-03 15:02           ` Boaz Harrosh
  2009-05-03 15:42             ` Matthew Wilcox
  2009-05-03 15:05           ` Hugh Dickins
  1 sibling, 1 reply; 59+ messages in thread
From: Boaz Harrosh @ 2009-05-03 15:02 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Hugh Dickins, Matthew Wilcox, linux-ide, linux-kernel,
	Jeff Garzik, linux-scsi, Jens Axboe, Bartlomiej Zolnierkiewicz,
	Mark Lord

On 05/03/2009 05:48 PM, Matthew Wilcox wrote:
> On Sun, May 03, 2009 at 02:07:32PM +0100, Hugh Dickins wrote:
>> I notice OCZ hope to have updated Vertex firmware mid-May,
>> such that their TRIM support matches up with (y)ours.
> 
> Excellent.
> 
>> You have GFP_KERNEL allocations in your discard_fn()s.  I believe that
>> allocations down at that level will need to be GFP_NOIO - the bio mempool
>> might be empty by this point, you don't want memory allocation to get into
>> submitting even more I/O to satisfy your discard_fn().  Does it need its
>> own mempool of reserve pages?  My guess is that you can get away without
>> that, since a failed discard, though regrettable, loses no data.
> 
> You make a good point.  There's definitely still work to be done around
> error handling.  OTOH, we could always do what IDE does ...
> 
> static void idedisk_prepare_flush(struct request_queue *q, struct request *rq)
> {
>         ide_drive_t *drive = q->queuedata;
>         struct ide_cmd *cmd = kmalloc(sizeof(*cmd), GFP_ATOMIC);
> 
>         /* FIXME: map struct ide_taskfile on rq->cmd[] */
>         BUG_ON(cmd == NULL);
> 
> 
> 

Out from the frying pan and into the fire. ;-)

I agree with Hugh. The allocation is done at, too-low in the food chain.
(And that free of buffer at upper layer allocated by lower layer).

I think you need to separate the: "does lld need buffer, what size"
from the "here is buffer prepare", so upper layer that can sleep does
sleep.

In all other buffer needing operations the allocation is done before
submission of request, No?

Just my $0.017
Thanks
Boaz

BTW:
 I've not have time to review second round will "TODO"

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: New TRIM/UNMAP tree published (2009-05-02)
  2009-05-03 14:48         ` Matthew Wilcox
  2009-05-03 15:02           ` Boaz Harrosh
@ 2009-05-03 15:05           ` Hugh Dickins
  1 sibling, 0 replies; 59+ messages in thread
From: Hugh Dickins @ 2009-05-03 15:05 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Matthew Wilcox, Boaz Harrosh, linux-ide, linux-kernel,
	Jeff Garzik, linux-scsi, Jens Axboe, Bartlomiej Zolnierkiewicz,
	Mark Lord

On Sun, 3 May 2009, Matthew Wilcox wrote:
> 
> You make a good point.  There's definitely still work to be done around
> error handling.  OTOH, we could always do what IDE does ...
> 
> static void idedisk_prepare_flush(struct request_queue *q, struct request *rq)
> {
>         ide_drive_t *drive = q->queuedata;
>         struct ide_cmd *cmd = kmalloc(sizeof(*cmd), GFP_ATOMIC);
> 
>         /* FIXME: map struct ide_taskfile on rq->cmd[] */
>         BUG_ON(cmd == NULL);

Oooh, you're winding me up.  Certainly not.  Off with the Top Of your Head!

Hugh

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: New TRIM/UNMAP tree published (2009-05-02)
  2009-05-03 15:02           ` Boaz Harrosh
@ 2009-05-03 15:42             ` Matthew Wilcox
  2009-05-03 16:34               ` Boaz Harrosh
  2009-05-03 18:48               ` Bartlomiej Zolnierkiewicz
  0 siblings, 2 replies; 59+ messages in thread
From: Matthew Wilcox @ 2009-05-03 15:42 UTC (permalink / raw)
  To: Boaz Harrosh
  Cc: Hugh Dickins, Matthew Wilcox, linux-ide, linux-kernel,
	Jeff Garzik, linux-scsi, Jens Axboe, Bartlomiej Zolnierkiewicz,
	Mark Lord

On Sun, May 03, 2009 at 06:02:51PM +0300, Boaz Harrosh wrote:
> I agree with Hugh. The allocation is done at, too-low in the food chain.
> (And that free of buffer at upper layer allocated by lower layer).
> 
> I think you need to separate the: "does lld need buffer, what size"
> from the "here is buffer prepare", so upper layer that can sleep does
> sleep.

So you want two function pointers in the request queue relating to discard?

> In all other buffer needing operations the allocation is done before
> submission of request, No?

It's not true for the flush request (the example I quoted).  Obviously,
the solution adopted here by IDE is Bad and Wrong ...

-- 
Matthew Wilcox				Intel Open Source Technology Centre
"Bill, look, we understand that you're interested in selling us this
operating system, but compare it to ours.  We can't possibly take such
a retrograde step."

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: New TRIM/UNMAP tree published (2009-05-02)
  2009-05-03 15:42             ` Matthew Wilcox
@ 2009-05-03 16:34               ` Boaz Harrosh
  2009-05-03 18:34                 ` Jeff Garzik
  2009-05-03 18:48               ` Bartlomiej Zolnierkiewicz
  1 sibling, 1 reply; 59+ messages in thread
From: Boaz Harrosh @ 2009-05-03 16:34 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Hugh Dickins, Matthew Wilcox, linux-ide, linux-kernel,
	Jeff Garzik, linux-scsi, Jens Axboe, Bartlomiej Zolnierkiewicz,
	Mark Lord

On 05/03/2009 06:42 PM, Matthew Wilcox wrote:
> On Sun, May 03, 2009 at 06:02:51PM +0300, Boaz Harrosh wrote:
>> I agree with Hugh. The allocation is done at, too-low in the food chain.
>> (And that free of buffer at upper layer allocated by lower layer).
>>
>> I think you need to separate the: "does lld need buffer, what size"
>> from the "here is buffer prepare", so upper layer that can sleep does
>> sleep.
> 
> So you want two function pointers in the request queue relating to discard?
> 

OK I don't know what I want, I guess. ;-)

I'm not a block-device export but from the small osdblk device I maintain
it looks like osdblk_prepare_flush which is set into:
    blk_queue_ordered(q, QUEUE_ORDERED_DRAIN_FLUSH, osdblk_prepare_flush);

does some internal structure setup, but the actual flush command is only executed
later in the global osdblk_rq_fn which is set into:
    blk_init_queue(osdblk_rq_fn, &osdev->lock);

But I'm not even sure that prepare_flush is called in a better context then
queue_fn, and what does it means to let block devices take care of another
new command type at queue_fn.

I guess it comes back to Jeff Garzik's comment about not having a central
place to ask the request what we need to do.

But I do hate that allocation is done by driver and free by mid-layer,
so yes two vectors, request_queue is allocated once per device it's not
that bad. And later when Jeff's comment is addressed it can be removed.

>> In all other buffer needing operations the allocation is done before
>> submission of request, No?
> 
> It's not true for the flush request (the example I quoted).  Obviously,
> the solution adopted here by IDE is Bad and Wrong ...
> 

So now I don't understand, is flush executed by the queue_fn or by the
prepare_flush, or am I seeing two type of flushes?

Thanks
Boaz

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: New TRIM/UNMAP tree published (2009-05-02)
  2009-05-03 16:34               ` Boaz Harrosh
@ 2009-05-03 18:34                 ` Jeff Garzik
  2009-05-03 18:40                   ` Jeff Garzik
  2009-05-03 21:48                   ` Matthew Wilcox
  0 siblings, 2 replies; 59+ messages in thread
From: Jeff Garzik @ 2009-05-03 18:34 UTC (permalink / raw)
  To: Boaz Harrosh
  Cc: Matthew Wilcox, Hugh Dickins, Matthew Wilcox, linux-ide,
	linux-kernel, Jeff Garzik, linux-scsi, Jens Axboe,
	Bartlomiej Zolnierkiewicz, Mark Lord

Boaz Harrosh wrote:
> On 05/03/2009 06:42 PM, Matthew Wilcox wrote:
>> On Sun, May 03, 2009 at 06:02:51PM +0300, Boaz Harrosh wrote:
>>> I agree with Hugh. The allocation is done at, too-low in the food chain.
>>> (And that free of buffer at upper layer allocated by lower layer).
>>>
>>> I think you need to separate the: "does lld need buffer, what size"
>>> from the "here is buffer prepare", so upper layer that can sleep does
>>> sleep.
>> So you want two function pointers in the request queue relating to discard?
>>
> 
> OK I don't know what I want, I guess. ;-)
> 
> I'm not a block-device export but from the small osdblk device I maintain
> it looks like osdblk_prepare_flush which is set into:
>     blk_queue_ordered(q, QUEUE_ORDERED_DRAIN_FLUSH, osdblk_prepare_flush);
> 
> does some internal structure setup, but the actual flush command is only executed
> later in the global osdblk_rq_fn which is set into:
>     blk_init_queue(osdblk_rq_fn, &osdev->lock);
> 
> But I'm not even sure that prepare_flush is called in a better context then
> queue_fn, and what does it means to let block devices take care of another
> new command type at queue_fn.
> 
> I guess it comes back to Jeff Garzik's comment about not having a central
> place to ask the request what we need to do.
> 
> But I do hate that allocation is done by driver and free by mid-layer,
> so yes two vectors, request_queue is allocated once per device it's not
> that bad. And later when Jeff's comment is addressed it can be removed.

May I presume you are referring to the following osdblk.c comment?

                 /* deduce our operation (read, write, flush) */
                 /* I wish the block layer simplified
		 * cmd_type/cmd_flags/cmd[]
                  * into a clearly defined set of RPC commands:
                  * read, write, flush, scsi command, power mgmt req,
                  * driver-specific, etc.
                  */

Yes, the task of figuring out -what to do- in the queue's request 
function is quite complex, and discard makes it even more so.

The API makes life difficult -- you have to pass temporary info to 
yourself in ->prepare_flush_fn() and ->prepare_discard_fn(), and the 
overall sum is a bewildering collection of opcodes, flags, and internal 
driver notes to itself.

Add to this yet another prep function, ->prep_rq_fn()

It definitely sucks, especially with regards to failed atomic 
allocations...   but I think fixing this quite a big more than Matthew 
probably willing to tackle ;-)

My ideal block layer interface would be a lot more opcode-based, e.g.

(1) create REQ_TYPE_DISCARD

(2) determine at init if queue (a) supports explicit DISCARD and/or (b) 
supports DISCARD flag passed with READ or WRITE

(3) when creating a discard request, use block helpers w/ queue-specific 
knowledge to create either
	(a) one request, REQ_TYPE_FS, with discard flag or
	(b) two requests, REQ_TYPE_FS followed by REQ_TYPE_DISCARD

(4) blkdev_issue_discard() would function like an empty barrier, and 
unconditionally create REQ_TYPE_DISCARD.


This type of setup would require NO prepare_discard command, as all 
knowledge would be passed directly to ->prep_rq_fn() and ->request_fn()


And to tangent a bit...  I feel barriers should be handled in exactly 
the same way.  Create REQ_TYPE_FLUSH, which would be issued for above 
examples #2a and #4, if the queue is setup that way.

All this MINIMIZES the amount of information a driver must "pass to 
itself", by utilizing existing ->prep_fn_rq() and ->request_fn() pathways.

	Jeff




^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: New TRIM/UNMAP tree published (2009-05-02)
  2009-05-03 18:34                 ` Jeff Garzik
@ 2009-05-03 18:40                   ` Jeff Garzik
  2009-05-03 19:04                     ` James Bottomley
  2009-05-03 21:48                   ` Matthew Wilcox
  1 sibling, 1 reply; 59+ messages in thread
From: Jeff Garzik @ 2009-05-03 18:40 UTC (permalink / raw)
  To: Matthew Wilcox, Jens Axboe
  Cc: Boaz Harrosh, Hugh Dickins, Matthew Wilcox, linux-ide,
	linux-kernel, linux-scsi, Bartlomiej Zolnierkiewicz, Mark Lord

Jeff Garzik wrote:
> (2) determine at init if queue (a) supports explicit DISCARD and/or (b) 
> supports DISCARD flag passed with READ or WRITE


As an aside -- does any existing command set support case #b, above?

AFAICT, ATA, SCSI and NVMHCI all have a single, explicit hardware 
command to discard/deallocate unused sectors.

Therefore, creating REQ_TYPE_DISCARD seems to eliminate any need for new 
hook ->prepare_discard().

This provides a 1:1 correspondence between hardware and struct request, 
most closely matching the setup of known hardware.

	Jeff




^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: New TRIM/UNMAP tree published (2009-05-02)
  2009-05-03 15:42             ` Matthew Wilcox
  2009-05-03 16:34               ` Boaz Harrosh
@ 2009-05-03 18:48               ` Bartlomiej Zolnierkiewicz
  1 sibling, 0 replies; 59+ messages in thread
From: Bartlomiej Zolnierkiewicz @ 2009-05-03 18:48 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Boaz Harrosh, Hugh Dickins, Matthew Wilcox, linux-ide,
	linux-kernel, Jeff Garzik, linux-scsi, Jens Axboe, Mark Lord

On Sunday 03 May 2009 17:42:16 Matthew Wilcox wrote:
> On Sun, May 03, 2009 at 06:02:51PM +0300, Boaz Harrosh wrote:
> > I agree with Hugh. The allocation is done at, too-low in the food chain.
> > (And that free of buffer at upper layer allocated by lower layer).
> > 
> > I think you need to separate the: "does lld need buffer, what size"
> > from the "here is buffer prepare", so upper layer that can sleep does
> > sleep.
> 
> So you want two function pointers in the request queue relating to discard?
> 
> > In all other buffer needing operations the allocation is done before
> > submission of request, No?
> 
> It's not true for the flush request (the example I quoted).  Obviously,
> the solution adopted here by IDE is Bad and Wrong ...

When speaking about "the solution" do you mean that quick & ugly hack
that I needed to fix the unexpected regression during late -rc cycle? ;)

The proper solution would involve adding 'struct ide_cmd flush_cmd' to
ide_drive_t but I never got to verify/inquiry that is OK with the block
layer (if there can be only one flush rq outstanding or if we can have
both q->pre_flush_rq and q->post_flush_rq queued at the same time)...

Thanks,
Bart

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: New TRIM/UNMAP tree published (2009-05-02)
  2009-05-03 18:40                   ` Jeff Garzik
@ 2009-05-03 19:04                     ` James Bottomley
  2009-05-03 19:20                       ` Jeff Garzik
  0 siblings, 1 reply; 59+ messages in thread
From: James Bottomley @ 2009-05-03 19:04 UTC (permalink / raw)
  To: Jeff Garzik
  Cc: Matthew Wilcox, Jens Axboe, Boaz Harrosh, Hugh Dickins,
	Matthew Wilcox, linux-ide, linux-kernel, linux-scsi,
	Bartlomiej Zolnierkiewicz, Mark Lord

On Sun, 2009-05-03 at 14:40 -0400, Jeff Garzik wrote:
> Jeff Garzik wrote:
> > (2) determine at init if queue (a) supports explicit DISCARD and/or (b) 
> > supports DISCARD flag passed with READ or WRITE
> 
> 
> As an aside -- does any existing command set support case #b, above?

Not to my knowledge.

I think discard was modelled on barrier (which can be associated with
data or stand on its own).

> AFAICT, ATA, SCSI and NVMHCI all have a single, explicit hardware 
> command to discard/deallocate unused sectors.
> 
> Therefore, creating REQ_TYPE_DISCARD seems to eliminate any need for new 
> hook ->prepare_discard().

Well, yes and no ... in the SCSI implementation of either UNMAP or WRITE
SAME, we still need a data buffer to store either the extents or the
actual same data.

> This provides a 1:1 correspondence between hardware and struct request, 
> most closely matching the setup of known hardware.

Agreed ... but we still have to allocate the adjunct buffer
somewhere ....

James



^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: New TRIM/UNMAP tree published (2009-05-02)
  2009-05-03 19:04                     ` James Bottomley
@ 2009-05-03 19:20                       ` Jeff Garzik
  2009-05-03 19:37                         ` James Bottomley
  2009-05-03 19:47                         ` James Bottomley
  0 siblings, 2 replies; 59+ messages in thread
From: Jeff Garzik @ 2009-05-03 19:20 UTC (permalink / raw)
  To: James Bottomley
  Cc: Matthew Wilcox, Jens Axboe, Boaz Harrosh, Hugh Dickins,
	Matthew Wilcox, linux-ide, linux-kernel, linux-scsi,
	Bartlomiej Zolnierkiewicz, Mark Lord

James Bottomley wrote:
> On Sun, 2009-05-03 at 14:40 -0400, Jeff Garzik wrote:
>> Jeff Garzik wrote:
>>> (2) determine at init if queue (a) supports explicit DISCARD and/or (b) 
>>> supports DISCARD flag passed with READ or WRITE
>>
>> As an aside -- does any existing command set support case #b, above?
> 
> Not to my knowledge.
> 
> I think discard was modelled on barrier (which can be associated with
> data or stand on its own).

Yeah -- I am thinking we can reduce complexity if no real hardware 
supports "associated with data."


>> AFAICT, ATA, SCSI and NVMHCI all have a single, explicit hardware 
>> command to discard/deallocate unused sectors.
>>
>> Therefore, creating REQ_TYPE_DISCARD seems to eliminate any need for new 
>> hook ->prepare_discard().
> 
> Well, yes and no ... in the SCSI implementation of either UNMAP or WRITE
> SAME, we still need a data buffer to store either the extents or the
> actual same data.

Sure, UNMAP's extents would quite naturally travel with the 
REQ_TYPE_DISCARD struct request, and can be set up at struct request 
allocation time.  In all of ATA, SCSI and NVMHCI, they appear to want 
the extents.

Is WRITE SAME associated with this current DISCARD work, or is that just 
a similar example?  I'm unfamiliar with its issues...


>> This provides a 1:1 correspondence between hardware and struct request, 
>> most closely matching the setup of known hardware.
> 
> Agreed ... but we still have to allocate the adjunct buffer
> somewhere ....

Agreed.  I merely argue ->prep_rq_fn() would be a better place -- it can 
at least return a useful return value -- than ->prepare_discard(), which 
would be the natural disposition for a REQ_TYPE_DISCARD -- or 
->request_fn() itself, if a block driver so chooses.

The extents would be an argument to REQ_TYPE_DISCARD, and should be 
associated with struct request somehow, by struct request's creator. 
The extent info needs to be in some generic form, because all of 
ATA|SCSI|NVMHCI seem to want it.


[tangent...]

Does make you wonder if a ->init_rq_fn() would be helpful, one that 
could perform gfp_t allocations rather than GFP_ATOMIC?  The idea being 
to call ->init_rq_fn() almost immediately after creation of struct 
request, by the struct request creator.

I obviously have not thought in depth about this, but it does seem that 
init_rq_fn(), called earlier in struct request lifetime, could eliminate 
the need for ->prepare_flush, ->prepare_discard, and perhaps could be a 
better place for some of the ->prep_rq_fn logic.

The creator of struct request generally has more freedom to sleep, and 
it seems logical to give low-level drivers a "fill in LLD-specific info" 
hook BEFORE the request is ever added to a request_queue.

Who knows...

	Jeff





^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: New TRIM/UNMAP tree published (2009-05-02)
  2009-05-03 19:20                       ` Jeff Garzik
@ 2009-05-03 19:37                         ` James Bottomley
  2009-05-04 14:03                           ` Douglas Gilbert
  2009-05-03 19:47                         ` James Bottomley
  1 sibling, 1 reply; 59+ messages in thread
From: James Bottomley @ 2009-05-03 19:37 UTC (permalink / raw)
  To: Jeff Garzik
  Cc: Matthew Wilcox, Jens Axboe, Boaz Harrosh, Hugh Dickins,
	Matthew Wilcox, linux-ide, linux-kernel, linux-scsi,
	Bartlomiej Zolnierkiewicz, Mark Lord

On Sun, 2009-05-03 at 15:20 -0400, Jeff Garzik wrote:
> Is WRITE SAME associated with this current DISCARD work, or is that just 
> a similar example?  I'm unfamiliar with its issues...

It's an adjunct body of work.  T10 apparently ratified both UNMAP and
the WRITE SAME extensions.  What WRITE SAME does is write the same data
block to multiple contiguous locations as specified in the CDB.  What
the thin provisioning update did for it is allow you to specify a flag
saying I want these sectors unmapped.   The perceived benefit of WRITE
SAME is that you specify (with the write same data ... presumably all
zeros) what an unmapped sector will return if it's ever read from again,
which was a big argument in the UNMAP case.

James



^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: New TRIM/UNMAP tree published (2009-05-02)
  2009-05-03 19:20                       ` Jeff Garzik
  2009-05-03 19:37                         ` James Bottomley
@ 2009-05-03 19:47                         ` James Bottomley
  2009-05-03 22:47                           ` Jeff Garzik
  1 sibling, 1 reply; 59+ messages in thread
From: James Bottomley @ 2009-05-03 19:47 UTC (permalink / raw)
  To: Jeff Garzik
  Cc: Matthew Wilcox, Jens Axboe, Boaz Harrosh, Hugh Dickins,
	Matthew Wilcox, linux-ide, linux-kernel, linux-scsi,
	Bartlomiej Zolnierkiewicz, Mark Lord

On Sun, 2009-05-03 at 15:20 -0400, Jeff Garzik wrote:
> [tangent...]
> 
> Does make you wonder if a ->init_rq_fn() would be helpful, one that 
> could perform gfp_t allocations rather than GFP_ATOMIC?  The idea being 
> to call ->init_rq_fn() almost immediately after creation of struct 
> request, by the struct request creator.

Isn't that what the current prep_fn actually is?

> I obviously have not thought in depth about this, but it does seem that 
> init_rq_fn(), called earlier in struct request lifetime, could eliminate 
> the need for ->prepare_flush, ->prepare_discard, and perhaps could be a 
> better place for some of the ->prep_rq_fn logic.

It's hard to see how ... prep_rq_fn is already called pretty early ...
almost as soon as the elevator has decided to spit out the request

> The creator of struct request generally has more freedom to sleep, and 
> it seems logical to give low-level drivers a "fill in LLD-specific info" 
> hook BEFORE the request is ever added to a request_queue.

Unfortunately it's not really possible to find a sleeping context in
there:  The elevators have to operate from the current
elv_next_request() context, which, in most drivers can either be user or
interrupt.

The way the block layer is designed is to pull allocations up the stack
much closer to the process (usually at the bio creation point) because
that allows the elevators to operate even in memory starved conditions.
If we pushed the allocation down into the request level, we'd need some
type of threading (bad for performance) and the request processing would
stall when some GFP_KERNEL allocation went out to lunch finding memory.

The ideal for REQ_TYPE_DISCARD seems to be to force a page allocation
tied to a bio when it's issued at the top.  That way everyone has enough
memory when it comes down the stack (both extents and WRITE SAME sector
will fit into a page ... although only just for WRITE SAME on 4k
sectors).

James



^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: New TRIM/UNMAP tree published (2009-05-02)
  2009-05-03 18:34                 ` Jeff Garzik
  2009-05-03 18:40                   ` Jeff Garzik
@ 2009-05-03 21:48                   ` Matthew Wilcox
  2009-05-03 22:54                     ` Jeff Garzik
  1 sibling, 1 reply; 59+ messages in thread
From: Matthew Wilcox @ 2009-05-03 21:48 UTC (permalink / raw)
  To: Jeff Garzik
  Cc: Boaz Harrosh, Hugh Dickins, Matthew Wilcox, linux-ide,
	linux-kernel, Jeff Garzik, linux-scsi, Jens Axboe,
	Bartlomiej Zolnierkiewicz, Mark Lord, David Woodhouse

On Sun, May 03, 2009 at 02:34:35PM -0400, Jeff Garzik wrote:
> Yes, the task of figuring out -what to do- in the queue's request  
> function is quite complex, and discard makes it even more so.
>
> The API makes life difficult -- you have to pass temporary info to  
> yourself in ->prepare_flush_fn() and ->prepare_discard_fn(), and the  
> overall sum is a bewildering collection of opcodes, flags, and internal  
> driver notes to itself.
>
> Add to this yet another prep function, ->prep_rq_fn()
>
> It definitely sucks, especially with regards to failed atomic  
> allocations...   but I think fixing this quite a big more than Matthew  
> probably willing to tackle ;-)

I'm completely confused by the block layer API, to be honest.  Trying to
deduce how to add a new feature at this stage is hard (compare it to
adding the reflink operation to the VFS ...).  I'm definitely willing to
tackle changing the block device API, but it may take a while.

> My ideal block layer interface would be a lot more opcode-based, e.g.
>
> (1) create REQ_TYPE_DISCARD
>
> (2) determine at init if queue (a) supports explicit DISCARD and/or (b)  
> supports DISCARD flag passed with READ or WRITE
>
> (3) when creating a discard request, use block helpers w/ queue-specific  
> knowledge to create either
> 	(a) one request, REQ_TYPE_FS, with discard flag or
> 	(b) two requests, REQ_TYPE_FS followed by REQ_TYPE_DISCARD

I'm not sure we need option 3b.

> (4) blkdev_issue_discard() would function like an empty barrier, and  
> unconditionally create REQ_TYPE_DISCARD.

I can certainly prototype a replacement for discard_prep_fn along these lines.

-- 
Matthew Wilcox				Intel Open Source Technology Centre
"Bill, look, we understand that you're interested in selling us this
operating system, but compare it to ours.  We can't possibly take such
a retrograde step."

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: New TRIM/UNMAP tree published (2009-05-02)
  2009-05-03 19:47                         ` James Bottomley
@ 2009-05-03 22:47                           ` Jeff Garzik
  2009-05-04 15:28                             ` Boaz Harrosh
  0 siblings, 1 reply; 59+ messages in thread
From: Jeff Garzik @ 2009-05-03 22:47 UTC (permalink / raw)
  To: James Bottomley
  Cc: Matthew Wilcox, Jens Axboe, Boaz Harrosh, Hugh Dickins,
	Matthew Wilcox, linux-ide, linux-kernel, linux-scsi,
	Bartlomiej Zolnierkiewicz, Mark Lord

James Bottomley wrote:
> On Sun, 2009-05-03 at 15:20 -0400, Jeff Garzik wrote:
>> [tangent...]
>>
>> Does make you wonder if a ->init_rq_fn() would be helpful, one that 
>> could perform gfp_t allocations rather than GFP_ATOMIC?  The idea being 
>> to call ->init_rq_fn() almost immediately after creation of struct 
>> request, by the struct request creator.
> 
> Isn't that what the current prep_fn actually is?

> It's hard to see how ... prep_rq_fn is already called pretty early ...
> almost as soon as the elevator has decided to spit out the request

prep_rq_fn is called very, very late -- from elv_next_request(), which 
is called from ->request_fn.

As quoted above, I'm talking about something closer to -creation- time, 
not -execution- time.


>> The creator of struct request generally has more freedom to sleep, and 
>> it seems logical to give low-level drivers a "fill in LLD-specific info" 
>> hook BEFORE the request is ever added to a request_queue.
> 
> Unfortunately it's not really possible to find a sleeping context in
> there:  The elevators have to operate from the current
> elv_next_request() context, which, in most drivers can either be user or
> interrupt.

Sure, because that's further down the pipeline at the request execution 
stage.  Think more like make_request time...

> The ideal for REQ_TYPE_DISCARD seems to be to force a page allocation
> tied to a bio when it's issued at the top.  That way everyone has enough
> memory when it comes down the stack (both extents and WRITE SAME sector
> will fit into a page ... although only just for WRITE SAME on 4k
> sectors).

Makes sense...

	Jeff



^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: New TRIM/UNMAP tree published (2009-05-02)
  2009-05-03 21:48                   ` Matthew Wilcox
@ 2009-05-03 22:54                     ` Jeff Garzik
  0 siblings, 0 replies; 59+ messages in thread
From: Jeff Garzik @ 2009-05-03 22:54 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Boaz Harrosh, Hugh Dickins, Matthew Wilcox, linux-ide,
	linux-kernel, Jeff Garzik, linux-scsi, Jens Axboe,
	Bartlomiej Zolnierkiewicz, Mark Lord, David Woodhouse

Matthew Wilcox wrote:
>> (3) when creating a discard request, use block helpers w/ queue-specific  
>> knowledge to create either
>> 	(a) one request, REQ_TYPE_FS, with discard flag or
>> 	(b) two requests, REQ_TYPE_FS followed by REQ_TYPE_DISCARD
> 
> I'm not sure we need option 3b.


Well -- it is a hard requirement to map 1:1 struct request with the 
underlying hardware device's command set.

If a device command set lacks a READ-and-DISCARD operation, then you 
_must_ create two struct request.  Otherwise you break block layer 
tagging and other 1:1-based assumptions that exist today.	

Thus, given that all of ATA|SCSI|NVMHCI have a DISCARD operation 
distinct from other commands...

option 3b is the overwhelming common case.

	Jeff




^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: New TRIM/UNMAP tree published (2009-05-02)
  2009-05-03 19:37                         ` James Bottomley
@ 2009-05-04 14:03                           ` Douglas Gilbert
  2009-05-04 14:40                             ` James Bottomley
  0 siblings, 1 reply; 59+ messages in thread
From: Douglas Gilbert @ 2009-05-04 14:03 UTC (permalink / raw)
  To: James Bottomley
  Cc: Jeff Garzik, Matthew Wilcox, Jens Axboe, Boaz Harrosh,
	Hugh Dickins, Matthew Wilcox, linux-ide, linux-kernel,
	linux-scsi, Bartlomiej Zolnierkiewicz, Mark Lord

James Bottomley wrote:
> On Sun, 2009-05-03 at 15:20 -0400, Jeff Garzik wrote:
>> Is WRITE SAME associated with this current DISCARD work, or is that just 
>> a similar example?  I'm unfamiliar with its issues...
> 
> It's an adjunct body of work.  T10 apparently ratified both UNMAP and
> the WRITE SAME extensions.  What WRITE SAME does is write the same data
> block to multiple contiguous locations as specified in the CDB.  What
> the thin provisioning update did for it is allow you to specify a flag
> saying I want these sectors unmapped.   The perceived benefit of WRITE
> SAME is that you specify (with the write same data ... presumably all
> zeros) what an unmapped sector will return if it's ever read from again,
> which was a big argument in the UNMAP case.

James,
Your presumption is correct. For the UNMAP bit to be honoured
in the SCSI WRITE SAME command, the user data part of the
data-out buffer needs to be all zeros, and, if present,
the protection data part of the data-out buffer needs
to be all 0xff_s (i.e. 8 bytes of 0xff). Otherwise the
UNMAP bit in WRITE SAME command is ignored and it does a
"normal" WRITE SAME.

My $0.02's worth was a suggestion to report an error if the
UNMAP bit was given to WRITE SAME and the data-out
buffer did not comply with the above pattern. Alternatively
the data-out buffer could just be ignored. The author
of the WRITE SAME "unmap" facility duly noted my observations
and rejected them :-) The wording in sbc3r18.pdf for WRITE SAME
is contorted so there will be changes. And t10 is still
having teleconferences about thin provisioning so there may be
non-trivial changes in the near future.

Doug Gilbert

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: New TRIM/UNMAP tree published (2009-05-02)
  2009-05-04 14:03                           ` Douglas Gilbert
@ 2009-05-04 14:40                             ` James Bottomley
  2009-05-04 15:11                               ` Douglas Gilbert
  0 siblings, 1 reply; 59+ messages in thread
From: James Bottomley @ 2009-05-04 14:40 UTC (permalink / raw)
  To: dgilbert
  Cc: Jeff Garzik, Matthew Wilcox, Jens Axboe, Boaz Harrosh,
	Hugh Dickins, Matthew Wilcox, linux-ide, linux-kernel,
	linux-scsi, Bartlomiej Zolnierkiewicz, Mark Lord

On Mon, 2009-05-04 at 16:03 +0200, Douglas Gilbert wrote:
> James Bottomley wrote:
> > On Sun, 2009-05-03 at 15:20 -0400, Jeff Garzik wrote:
> >> Is WRITE SAME associated with this current DISCARD work, or is that just 
> >> a similar example?  I'm unfamiliar with its issues...
> > 
> > It's an adjunct body of work.  T10 apparently ratified both UNMAP and
> > the WRITE SAME extensions.  What WRITE SAME does is write the same data
> > block to multiple contiguous locations as specified in the CDB.  What
> > the thin provisioning update did for it is allow you to specify a flag
> > saying I want these sectors unmapped.   The perceived benefit of WRITE
> > SAME is that you specify (with the write same data ... presumably all
> > zeros) what an unmapped sector will return if it's ever read from again,
> > which was a big argument in the UNMAP case.
> 
> James,
> Your presumption is correct. For the UNMAP bit to be honoured
> in the SCSI WRITE SAME command, the user data part of the
> data-out buffer needs to be all zeros, and, if present,
> the protection data part of the data-out buffer needs
> to be all 0xff_s (i.e. 8 bytes of 0xff). Otherwise the
> UNMAP bit in WRITE SAME command is ignored and it does a
> "normal" WRITE SAME.
> 
> My $0.02's worth was a suggestion to report an error if the
> UNMAP bit was given to WRITE SAME and the data-out
> buffer did not comply with the above pattern. Alternatively
> the data-out buffer could just be ignored. The author
> of the WRITE SAME "unmap" facility duly noted my observations
> and rejected them :-) The wording in sbc3r18.pdf for WRITE SAME
> is contorted so there will be changes. And t10 is still
> having teleconferences about thin provisioning so there may be
> non-trivial changes in the near future.

Actually, I'd just like something far more basic: forcing a thin
provisioned array to support all of the three possible mechanisms.  It's
going to be a real mess trying to work out for any given array do you
support UNMAP or WRITE SAME(16) or WRITE SAME(32)?  We can only do this
currently by trying the commands ... then we have to have support for
all three built into sd just in case ... and we get precisely the same
functionality in each case ...

James



^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: New TRIM/UNMAP tree published (2009-05-02)
  2009-05-04 14:40                             ` James Bottomley
@ 2009-05-04 15:11                               ` Douglas Gilbert
  2009-05-04 15:23                                 ` James Bottomley
  0 siblings, 1 reply; 59+ messages in thread
From: Douglas Gilbert @ 2009-05-04 15:11 UTC (permalink / raw)
  To: James Bottomley
  Cc: Jeff Garzik, Matthew Wilcox, Jens Axboe, Boaz Harrosh,
	Hugh Dickins, Matthew Wilcox, linux-ide, linux-kernel,
	linux-scsi, Bartlomiej Zolnierkiewicz, Mark Lord

James Bottomley wrote:
> On Mon, 2009-05-04 at 16:03 +0200, Douglas Gilbert wrote:
>> James Bottomley wrote:
>>> On Sun, 2009-05-03 at 15:20 -0400, Jeff Garzik wrote:
>>>> Is WRITE SAME associated with this current DISCARD work, or is that just 
>>>> a similar example?  I'm unfamiliar with its issues...
>>> It's an adjunct body of work.  T10 apparently ratified both UNMAP and
>>> the WRITE SAME extensions.  What WRITE SAME does is write the same data
>>> block to multiple contiguous locations as specified in the CDB.  What
>>> the thin provisioning update did for it is allow you to specify a flag
>>> saying I want these sectors unmapped.   The perceived benefit of WRITE
>>> SAME is that you specify (with the write same data ... presumably all
>>> zeros) what an unmapped sector will return if it's ever read from again,
>>> which was a big argument in the UNMAP case.
>> James,
>> Your presumption is correct. For the UNMAP bit to be honoured
>> in the SCSI WRITE SAME command, the user data part of the
>> data-out buffer needs to be all zeros, and, if present,
>> the protection data part of the data-out buffer needs
>> to be all 0xff_s (i.e. 8 bytes of 0xff). Otherwise the
>> UNMAP bit in WRITE SAME command is ignored and it does a
>> "normal" WRITE SAME.
>>
>> My $0.02's worth was a suggestion to report an error if the
>> UNMAP bit was given to WRITE SAME and the data-out
>> buffer did not comply with the above pattern. Alternatively
>> the data-out buffer could just be ignored. The author
>> of the WRITE SAME "unmap" facility duly noted my observations
>> and rejected them :-) The wording in sbc3r18.pdf for WRITE SAME
>> is contorted so there will be changes. And t10 is still
>> having teleconferences about thin provisioning so there may be
>> non-trivial changes in the near future.
> 
> Actually, I'd just like something far more basic: forcing a thin
> provisioned array to support all of the three possible mechanisms.  It's
> going to be a real mess trying to work out for any given array do you
> support UNMAP or WRITE SAME(16) or WRITE SAME(32)?  We can only do this
> currently by trying the commands ... then we have to have support for
> all three built into sd just in case ... and we get precisely the same
> functionality in each case ...

James,
Another aspect, especially if a large amount of storage
is to be trimmed, is how long will it take? This relates
to the timeout value we should associate with such an
invocation. The FORMAT UNIT and START STOP UNIT commands
have an IMMED bit, but not WRITE SAME.

Speaking of FORMAT UNIT, some words were added into sbc3r18
that suggest a FORMAT UNIT command could be interpreted as
unmap/trim the whole disk.

Doug Gilbert


^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: New TRIM/UNMAP tree published (2009-05-02)
  2009-05-04 15:11                               ` Douglas Gilbert
@ 2009-05-04 15:23                                 ` James Bottomley
  0 siblings, 0 replies; 59+ messages in thread
From: James Bottomley @ 2009-05-04 15:23 UTC (permalink / raw)
  To: dgilbert
  Cc: Jeff Garzik, Matthew Wilcox, Jens Axboe, Boaz Harrosh,
	Hugh Dickins, Matthew Wilcox, linux-ide, linux-kernel,
	linux-scsi, Bartlomiej Zolnierkiewicz, Mark Lord

On Mon, 2009-05-04 at 17:11 +0200, Douglas Gilbert wrote:
> Another aspect, especially if a large amount of storage
> is to be trimmed, is how long will it take? This relates
> to the timeout value we should associate with such an
> invocation. The FORMAT UNIT and START STOP UNIT commands
> have an IMMED bit, but not WRITE SAME.

I was assuming it would be more or less instantaneous ... if it's not
then the plan of trying to keep thin provisioning happy just by sending
down a maximal trim from the filesystem is going to run into trouble.

> Speaking of FORMAT UNIT, some words were added into sbc3r18
> that suggest a FORMAT UNIT command could be interpreted as
> unmap/trim the whole disk.

Seems reasonable ... if you format the device it could be argued you're
not relying on the data contents any more ...

James



^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: New TRIM/UNMAP tree published (2009-05-02)
  2009-05-03 22:47                           ` Jeff Garzik
@ 2009-05-04 15:28                             ` Boaz Harrosh
  0 siblings, 0 replies; 59+ messages in thread
From: Boaz Harrosh @ 2009-05-04 15:28 UTC (permalink / raw)
  To: Jeff Garzik
  Cc: James Bottomley, Matthew Wilcox, Jens Axboe, Hugh Dickins,
	Matthew Wilcox, linux-ide, linux-kernel, linux-scsi,
	Bartlomiej Zolnierkiewicz, Mark Lord

On 05/04/2009 01:47 AM, Jeff Garzik wrote:
> James Bottomley wrote:
>> The ideal for REQ_TYPE_DISCARD seems to be to force a page allocation
>> tied to a bio when it's issued at the top.  That way everyone has enough
>> memory when it comes down the stack (both extents and WRITE SAME sector
>> will fit into a page ... although only just for WRITE SAME on 4k
>> sectors).
> 
> Makes sense...
> 
> 	Jeff
> 

I second that, let the allocator of the request supply a buffer and free it
on return/done. Perhaps with a general helper that does all that.

Boaz

^ permalink raw reply	[flat|nested] 59+ messages in thread

end of thread, other threads:[~2009-05-04 15:28 UTC | newest]

Thread overview: 59+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2009-04-02 14:37 [PATCH 1/5] Block: Discard may need to allocate pages Matthew Wilcox
2009-04-02 14:37 ` [PATCH 2/5] Make DISCARD_BARRIER and DISCARD_NOBARRIER writes instead of reads Matthew Wilcox
2009-04-02 14:37   ` [PATCH 3/5] ata: Add TRIM infrastructure Matthew Wilcox
2009-04-02 14:37     ` [PATCH 4/5] ide: Add support for TRIM Matthew Wilcox
2009-04-02 14:37       ` [PATCH 5/5] libata: " Matthew Wilcox
2009-04-02 17:20         ` Mark Lord
2009-04-02 17:55           ` Matthew Wilcox
2009-04-16 20:25         ` Mark Lord
2009-04-17 19:44           ` Mark Lord
2009-04-02 15:58       ` [PATCH 4/5] ide: " Sergei Shtylyov
2009-04-02 16:28         ` Matthew Wilcox
2009-04-02 16:38           ` Sergei Shtylyov
2009-04-02 16:51             ` Matthew Wilcox
2009-04-02 19:37               ` Bartlomiej Zolnierkiewicz
2009-04-07 21:38                 ` Bartlomiej Zolnierkiewicz
2009-04-07 22:15                   ` Matthew Wilcox
2009-04-07 22:26                     ` Jeff Garzik
2009-04-07 22:35                     ` Bartlomiej Zolnierkiewicz
2009-04-07 17:20       ` Jeff Garzik
2009-04-07 17:57         ` Mark Lord
2009-04-07 18:10           ` Markus Trippelsdorf
2009-04-07 19:58             ` Mark Lord
2009-04-08  7:14               ` Markus Trippelsdorf
2009-04-08 14:25                 ` Mark Lord
2009-04-08 14:33       ` Mark Lord
2009-04-08 14:44         ` Dongjun Shin
2009-04-08 14:59         ` Jeff Garzik
2009-04-08 15:50           ` Mark Lord
2009-04-02 15:55     ` [PATCH 3/5] ata: Add TRIM infrastructure Sergei Shtylyov
2009-04-02 16:18       ` Matthew Wilcox
2009-04-02 16:32         ` Sergei Shtylyov
2009-04-02 16:47           ` Matthew Wilcox
2009-04-07  0:02     ` Jeff Garzik
2009-04-05 12:28 ` [PATCH 1/5] Block: Discard may need to allocate pages Boaz Harrosh
2009-04-06 20:34   ` Matthew Wilcox
2009-05-03  6:11   ` Matthew Wilcox
2009-05-03  7:16     ` New TRIM/UNMAP tree published (2009-05-02) Matthew Wilcox
2009-05-03 13:07       ` Hugh Dickins
2009-05-03 14:48         ` Matthew Wilcox
2009-05-03 15:02           ` Boaz Harrosh
2009-05-03 15:42             ` Matthew Wilcox
2009-05-03 16:34               ` Boaz Harrosh
2009-05-03 18:34                 ` Jeff Garzik
2009-05-03 18:40                   ` Jeff Garzik
2009-05-03 19:04                     ` James Bottomley
2009-05-03 19:20                       ` Jeff Garzik
2009-05-03 19:37                         ` James Bottomley
2009-05-04 14:03                           ` Douglas Gilbert
2009-05-04 14:40                             ` James Bottomley
2009-05-04 15:11                               ` Douglas Gilbert
2009-05-04 15:23                                 ` James Bottomley
2009-05-03 19:47                         ` James Bottomley
2009-05-03 22:47                           ` Jeff Garzik
2009-05-04 15:28                             ` Boaz Harrosh
2009-05-03 21:48                   ` Matthew Wilcox
2009-05-03 22:54                     ` Jeff Garzik
2009-05-03 18:48               ` Bartlomiej Zolnierkiewicz
2009-05-03 15:05           ` Hugh Dickins
2009-04-17 21:23 ` [PATCH 1/5] Block: Discard may need to allocate pages Mark Lord

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).