All of lore.kernel.org
 help / color / mirror / Atom feed
* always use REQ_OP_WRITE_ZEROES for zeroing offload V2
@ 2017-04-05 17:20 ` Christoph Hellwig
  0 siblings, 0 replies; 91+ messages in thread
From: Christoph Hellwig @ 2017-04-05 17:20 UTC (permalink / raw)
  To: axboe-tSWWG44O7X1aa/9Udqfwiw,
	martin.petersen-QHcLZuEGTsvQT0dZR+AlfA,
	agk-H+wXaHxf7aLQT0dZR+AlfA, snitzer-H+wXaHxf7aLQT0dZR+AlfA,
	shli-DgEjT+Ai2ygdnm+yROfE0A,
	philipp.reisner-63ez5xqkn6DQT0dZR+AlfA,
	lars.ellenberg-63ez5xqkn6DQT0dZR+AlfA
  Cc: linux-block-u79uwXL29TY76Z2rM5mHXA,
	linux-raid-u79uwXL29TY76Z2rM5mHXA,
	dm-devel-H+wXaHxf7aLQT0dZR+AlfA,
	linux-scsi-u79uwXL29TY76Z2rM5mHXA,
	drbd-dev-cunTk1MwBs8qoQakbn7OcQ

This series makes REQ_OP_WRITE_ZEROES the only zeroing offload
supported by the block layer, and switches existing implementations
of REQ_OP_DISCARD that correctly set discard_zeroes_data to it,
removes incorrect discard_zeroes_data, and also switches WRITE SAME
based zeroing in SCSI to this new method.

The series is against the block for-next tree.

A git tree is also avaiable at:

    git://git.infradead.org/users/hch/block.git discard-rework.2

Gitweb:

    http://git.infradead.org/users/hch/block.git/shortlog/refs/heads/discard-rework.2

Changes since V2:
 - various spelling fixes
 - various reviews captured
 - two new patches from Martin at the end

^ permalink raw reply	[flat|nested] 91+ messages in thread

* always use REQ_OP_WRITE_ZEROES for zeroing offload V2
@ 2017-04-05 17:20 ` Christoph Hellwig
  0 siblings, 0 replies; 91+ messages in thread
From: Christoph Hellwig @ 2017-04-05 17:20 UTC (permalink / raw)
  To: axboe, martin.petersen, agk, snitzer, shli, philipp.reisner,
	lars.ellenberg
  Cc: linux-block, linux-scsi, drbd-dev, dm-devel, linux-raid

This series makes REQ_OP_WRITE_ZEROES the only zeroing offload
supported by the block layer, and switches existing implementations
of REQ_OP_DISCARD that correctly set discard_zeroes_data to it,
removes incorrect discard_zeroes_data, and also switches WRITE SAME
based zeroing in SCSI to this new method.

The series is against the block for-next tree.

A git tree is also avaiable at:

    git://git.infradead.org/users/hch/block.git discard-rework.2

Gitweb:

    http://git.infradead.org/users/hch/block.git/shortlog/refs/heads/discard-rework.2

Changes since V2:
 - various spelling fixes
 - various reviews captured
 - two new patches from Martin at the end

^ permalink raw reply	[flat|nested] 91+ messages in thread

* [PATCH 01/27] sd: split sd_setup_discard_cmnd
  2017-04-05 17:20 ` Christoph Hellwig
@ 2017-04-05 17:20   ` Christoph Hellwig
  -1 siblings, 0 replies; 91+ messages in thread
From: Christoph Hellwig @ 2017-04-05 17:20 UTC (permalink / raw)
  To: axboe, martin.petersen, agk, snitzer, shli, philipp.reisner,
	lars.ellenberg
  Cc: linux-block, linux-raid, dm-devel, linux-scsi, drbd-dev

Split sd_setup_discard_cmnd into one function per provisioning type.  While
this creates some very slight duplication of boilerplate code it keeps the
code modular for additions of new provisioning types, and for reusing the
write same functions for the upcoming scsi implementation of the Write Zeroes
operation.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Reviewed-by: Hannes Reinecke <hare@suse.com>
---
 drivers/scsi/sd.c | 153 ++++++++++++++++++++++++++++++------------------------
 1 file changed, 84 insertions(+), 69 deletions(-)

diff --git a/drivers/scsi/sd.c b/drivers/scsi/sd.c
index fcfeddc79331..b853f91fb3da 100644
--- a/drivers/scsi/sd.c
+++ b/drivers/scsi/sd.c
@@ -701,93 +701,97 @@ static void sd_config_discard(struct scsi_disk *sdkp, unsigned int mode)
 	queue_flag_set_unlocked(QUEUE_FLAG_DISCARD, q);
 }
 
-/**
- * sd_setup_discard_cmnd - unmap blocks on thinly provisioned device
- * @sdp: scsi device to operate on
- * @rq: Request to prepare
- *
- * Will issue either UNMAP or WRITE SAME(16) depending on preference
- * indicated by target device.
- **/
-static int sd_setup_discard_cmnd(struct scsi_cmnd *cmd)
+static int sd_setup_unmap_cmnd(struct scsi_cmnd *cmd)
 {
-	struct request *rq = cmd->request;
 	struct scsi_device *sdp = cmd->device;
-	struct scsi_disk *sdkp = scsi_disk(rq->rq_disk);
-	sector_t sector = blk_rq_pos(rq);
-	unsigned int nr_sectors = blk_rq_sectors(rq);
-	unsigned int len;
-	int ret;
+	struct request *rq = cmd->request;
+	u64 sector = blk_rq_pos(rq) >> (ilog2(sdp->sector_size) - 9);
+	u32 nr_sectors = blk_rq_sectors(rq) >> (ilog2(sdp->sector_size) - 9);
+	unsigned int data_len = 24;
 	char *buf;
-	struct page *page;
 
-	sector >>= ilog2(sdp->sector_size) - 9;
-	nr_sectors >>= ilog2(sdp->sector_size) - 9;
-
-	page = alloc_page(GFP_ATOMIC | __GFP_ZERO);
-	if (!page)
+	rq->special_vec.bv_page = alloc_page(GFP_ATOMIC | __GFP_ZERO);
+	if (!rq->special_vec.bv_page)
 		return BLKPREP_DEFER;
+	rq->special_vec.bv_offset = 0;
+	rq->special_vec.bv_len = data_len;
+	rq->rq_flags |= RQF_SPECIAL_PAYLOAD;
 
-	switch (sdkp->provisioning_mode) {
-	case SD_LBP_UNMAP:
-		buf = page_address(page);
-
-		cmd->cmd_len = 10;
-		cmd->cmnd[0] = UNMAP;
-		cmd->cmnd[8] = 24;
-
-		put_unaligned_be16(6 + 16, &buf[0]);
-		put_unaligned_be16(16, &buf[2]);
-		put_unaligned_be64(sector, &buf[8]);
-		put_unaligned_be32(nr_sectors, &buf[16]);
+	cmd->cmd_len = 10;
+	cmd->cmnd[0] = UNMAP;
+	cmd->cmnd[8] = 24;
 
-		len = 24;
-		break;
+	buf = page_address(rq->special_vec.bv_page);
+	put_unaligned_be16(6 + 16, &buf[0]);
+	put_unaligned_be16(16, &buf[2]);
+	put_unaligned_be64(sector, &buf[8]);
+	put_unaligned_be32(nr_sectors, &buf[16]);
 
-	case SD_LBP_WS16:
-		cmd->cmd_len = 16;
-		cmd->cmnd[0] = WRITE_SAME_16;
-		cmd->cmnd[1] = 0x8; /* UNMAP */
-		put_unaligned_be64(sector, &cmd->cmnd[2]);
-		put_unaligned_be32(nr_sectors, &cmd->cmnd[10]);
+	cmd->allowed = SD_MAX_RETRIES;
+	cmd->transfersize = data_len;
+	rq->timeout = SD_TIMEOUT;
+	scsi_req(rq)->resid_len = data_len;
 
-		len = sdkp->device->sector_size;
-		break;
+	return scsi_init_io(cmd);
+}
 
-	case SD_LBP_WS10:
-	case SD_LBP_ZERO:
-		cmd->cmd_len = 10;
-		cmd->cmnd[0] = WRITE_SAME;
-		if (sdkp->provisioning_mode == SD_LBP_WS10)
-			cmd->cmnd[1] = 0x8; /* UNMAP */
-		put_unaligned_be32(sector, &cmd->cmnd[2]);
-		put_unaligned_be16(nr_sectors, &cmd->cmnd[7]);
+static int sd_setup_write_same16_cmnd(struct scsi_cmnd *cmd)
+{
+	struct scsi_device *sdp = cmd->device;
+	struct request *rq = cmd->request;
+	u64 sector = blk_rq_pos(rq) >> (ilog2(sdp->sector_size) - 9);
+	u32 nr_sectors = blk_rq_sectors(rq) >> (ilog2(sdp->sector_size) - 9);
+	u32 data_len = sdp->sector_size;
 
-		len = sdkp->device->sector_size;
-		break;
+	rq->special_vec.bv_page = alloc_page(GFP_ATOMIC | __GFP_ZERO);
+	if (!rq->special_vec.bv_page)
+		return BLKPREP_DEFER;
+	rq->special_vec.bv_offset = 0;
+	rq->special_vec.bv_len = data_len;
+	rq->rq_flags |= RQF_SPECIAL_PAYLOAD;
 
-	default:
-		ret = BLKPREP_INVALID;
-		goto out;
-	}
+	cmd->cmd_len = 16;
+	cmd->cmnd[0] = WRITE_SAME_16;
+	cmd->cmnd[1] = 0x8; /* UNMAP */
+	put_unaligned_be64(sector, &cmd->cmnd[2]);
+	put_unaligned_be32(nr_sectors, &cmd->cmnd[10]);
 
+	cmd->allowed = SD_MAX_RETRIES;
+	cmd->transfersize = data_len;
 	rq->timeout = SD_TIMEOUT;
+	scsi_req(rq)->resid_len = data_len;
 
-	cmd->transfersize = len;
-	cmd->allowed = SD_MAX_RETRIES;
+	return scsi_init_io(cmd);
+}
 
-	rq->special_vec.bv_page = page;
-	rq->special_vec.bv_offset = 0;
-	rq->special_vec.bv_len = len;
+static int sd_setup_write_same10_cmnd(struct scsi_cmnd *cmd, bool unmap)
+{
+	struct scsi_device *sdp = cmd->device;
+	struct request *rq = cmd->request;
+	u64 sector = blk_rq_pos(rq) >> (ilog2(sdp->sector_size) - 9);
+	u32 nr_sectors = blk_rq_sectors(rq) >> (ilog2(sdp->sector_size) - 9);
+	u32 data_len = sdp->sector_size;
 
+	rq->special_vec.bv_page = alloc_page(GFP_ATOMIC | __GFP_ZERO);
+	if (!rq->special_vec.bv_page)
+		return BLKPREP_DEFER;
+	rq->special_vec.bv_offset = 0;
+	rq->special_vec.bv_len = data_len;
 	rq->rq_flags |= RQF_SPECIAL_PAYLOAD;
-	scsi_req(rq)->resid_len = len;
 
-	ret = scsi_init_io(cmd);
-out:
-	if (ret != BLKPREP_OK)
-		__free_page(page);
-	return ret;
+	cmd->cmd_len = 10;
+	cmd->cmnd[0] = WRITE_SAME;
+	if (unmap)
+		cmd->cmnd[1] = 0x8; /* UNMAP */
+	put_unaligned_be32(sector, &cmd->cmnd[2]);
+	put_unaligned_be16(nr_sectors, &cmd->cmnd[7]);
+
+	cmd->allowed = SD_MAX_RETRIES;
+	cmd->transfersize = data_len;
+	rq->timeout = SD_TIMEOUT;
+	scsi_req(rq)->resid_len = data_len;
+
+	return scsi_init_io(cmd);
 }
 
 static void sd_config_write_same(struct scsi_disk *sdkp)
@@ -1155,7 +1159,18 @@ static int sd_init_command(struct scsi_cmnd *cmd)
 
 	switch (req_op(rq)) {
 	case REQ_OP_DISCARD:
-		return sd_setup_discard_cmnd(cmd);
+		switch (scsi_disk(rq->rq_disk)->provisioning_mode) {
+		case SD_LBP_UNMAP:
+			return sd_setup_unmap_cmnd(cmd);
+		case SD_LBP_WS16:
+			return sd_setup_write_same16_cmnd(cmd);
+		case SD_LBP_WS10:
+			return sd_setup_write_same10_cmnd(cmd, true);
+		case SD_LBP_ZERO:
+			return sd_setup_write_same10_cmnd(cmd, false);
+		default:
+			return BLKPREP_INVALID;
+		}
 	case REQ_OP_WRITE_SAME:
 		return sd_setup_write_same_cmnd(cmd);
 	case REQ_OP_FLUSH:
-- 
2.11.0

^ permalink raw reply related	[flat|nested] 91+ messages in thread

* [PATCH 01/27] sd: split sd_setup_discard_cmnd
@ 2017-04-05 17:20   ` Christoph Hellwig
  0 siblings, 0 replies; 91+ messages in thread
From: Christoph Hellwig @ 2017-04-05 17:20 UTC (permalink / raw)
  To: axboe, martin.petersen, agk, snitzer, shli, philipp.reisner,
	lars.ellenberg
  Cc: linux-block, linux-scsi, drbd-dev, dm-devel, linux-raid

Split sd_setup_discard_cmnd into one function per provisioning type.  While
this creates some very slight duplication of boilerplate code it keeps the
code modular for additions of new provisioning types, and for reusing the
write same functions for the upcoming scsi implementation of the Write Zeroes
operation.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Reviewed-by: Hannes Reinecke <hare@suse.com>
---
 drivers/scsi/sd.c | 153 ++++++++++++++++++++++++++++++------------------------
 1 file changed, 84 insertions(+), 69 deletions(-)

diff --git a/drivers/scsi/sd.c b/drivers/scsi/sd.c
index fcfeddc79331..b853f91fb3da 100644
--- a/drivers/scsi/sd.c
+++ b/drivers/scsi/sd.c
@@ -701,93 +701,97 @@ static void sd_config_discard(struct scsi_disk *sdkp, unsigned int mode)
 	queue_flag_set_unlocked(QUEUE_FLAG_DISCARD, q);
 }
 
-/**
- * sd_setup_discard_cmnd - unmap blocks on thinly provisioned device
- * @sdp: scsi device to operate on
- * @rq: Request to prepare
- *
- * Will issue either UNMAP or WRITE SAME(16) depending on preference
- * indicated by target device.
- **/
-static int sd_setup_discard_cmnd(struct scsi_cmnd *cmd)
+static int sd_setup_unmap_cmnd(struct scsi_cmnd *cmd)
 {
-	struct request *rq = cmd->request;
 	struct scsi_device *sdp = cmd->device;
-	struct scsi_disk *sdkp = scsi_disk(rq->rq_disk);
-	sector_t sector = blk_rq_pos(rq);
-	unsigned int nr_sectors = blk_rq_sectors(rq);
-	unsigned int len;
-	int ret;
+	struct request *rq = cmd->request;
+	u64 sector = blk_rq_pos(rq) >> (ilog2(sdp->sector_size) - 9);
+	u32 nr_sectors = blk_rq_sectors(rq) >> (ilog2(sdp->sector_size) - 9);
+	unsigned int data_len = 24;
 	char *buf;
-	struct page *page;
 
-	sector >>= ilog2(sdp->sector_size) - 9;
-	nr_sectors >>= ilog2(sdp->sector_size) - 9;
-
-	page = alloc_page(GFP_ATOMIC | __GFP_ZERO);
-	if (!page)
+	rq->special_vec.bv_page = alloc_page(GFP_ATOMIC | __GFP_ZERO);
+	if (!rq->special_vec.bv_page)
 		return BLKPREP_DEFER;
+	rq->special_vec.bv_offset = 0;
+	rq->special_vec.bv_len = data_len;
+	rq->rq_flags |= RQF_SPECIAL_PAYLOAD;
 
-	switch (sdkp->provisioning_mode) {
-	case SD_LBP_UNMAP:
-		buf = page_address(page);
-
-		cmd->cmd_len = 10;
-		cmd->cmnd[0] = UNMAP;
-		cmd->cmnd[8] = 24;
-
-		put_unaligned_be16(6 + 16, &buf[0]);
-		put_unaligned_be16(16, &buf[2]);
-		put_unaligned_be64(sector, &buf[8]);
-		put_unaligned_be32(nr_sectors, &buf[16]);
+	cmd->cmd_len = 10;
+	cmd->cmnd[0] = UNMAP;
+	cmd->cmnd[8] = 24;
 
-		len = 24;
-		break;
+	buf = page_address(rq->special_vec.bv_page);
+	put_unaligned_be16(6 + 16, &buf[0]);
+	put_unaligned_be16(16, &buf[2]);
+	put_unaligned_be64(sector, &buf[8]);
+	put_unaligned_be32(nr_sectors, &buf[16]);
 
-	case SD_LBP_WS16:
-		cmd->cmd_len = 16;
-		cmd->cmnd[0] = WRITE_SAME_16;
-		cmd->cmnd[1] = 0x8; /* UNMAP */
-		put_unaligned_be64(sector, &cmd->cmnd[2]);
-		put_unaligned_be32(nr_sectors, &cmd->cmnd[10]);
+	cmd->allowed = SD_MAX_RETRIES;
+	cmd->transfersize = data_len;
+	rq->timeout = SD_TIMEOUT;
+	scsi_req(rq)->resid_len = data_len;
 
-		len = sdkp->device->sector_size;
-		break;
+	return scsi_init_io(cmd);
+}
 
-	case SD_LBP_WS10:
-	case SD_LBP_ZERO:
-		cmd->cmd_len = 10;
-		cmd->cmnd[0] = WRITE_SAME;
-		if (sdkp->provisioning_mode == SD_LBP_WS10)
-			cmd->cmnd[1] = 0x8; /* UNMAP */
-		put_unaligned_be32(sector, &cmd->cmnd[2]);
-		put_unaligned_be16(nr_sectors, &cmd->cmnd[7]);
+static int sd_setup_write_same16_cmnd(struct scsi_cmnd *cmd)
+{
+	struct scsi_device *sdp = cmd->device;
+	struct request *rq = cmd->request;
+	u64 sector = blk_rq_pos(rq) >> (ilog2(sdp->sector_size) - 9);
+	u32 nr_sectors = blk_rq_sectors(rq) >> (ilog2(sdp->sector_size) - 9);
+	u32 data_len = sdp->sector_size;
 
-		len = sdkp->device->sector_size;
-		break;
+	rq->special_vec.bv_page = alloc_page(GFP_ATOMIC | __GFP_ZERO);
+	if (!rq->special_vec.bv_page)
+		return BLKPREP_DEFER;
+	rq->special_vec.bv_offset = 0;
+	rq->special_vec.bv_len = data_len;
+	rq->rq_flags |= RQF_SPECIAL_PAYLOAD;
 
-	default:
-		ret = BLKPREP_INVALID;
-		goto out;
-	}
+	cmd->cmd_len = 16;
+	cmd->cmnd[0] = WRITE_SAME_16;
+	cmd->cmnd[1] = 0x8; /* UNMAP */
+	put_unaligned_be64(sector, &cmd->cmnd[2]);
+	put_unaligned_be32(nr_sectors, &cmd->cmnd[10]);
 
+	cmd->allowed = SD_MAX_RETRIES;
+	cmd->transfersize = data_len;
 	rq->timeout = SD_TIMEOUT;
+	scsi_req(rq)->resid_len = data_len;
 
-	cmd->transfersize = len;
-	cmd->allowed = SD_MAX_RETRIES;
+	return scsi_init_io(cmd);
+}
 
-	rq->special_vec.bv_page = page;
-	rq->special_vec.bv_offset = 0;
-	rq->special_vec.bv_len = len;
+static int sd_setup_write_same10_cmnd(struct scsi_cmnd *cmd, bool unmap)
+{
+	struct scsi_device *sdp = cmd->device;
+	struct request *rq = cmd->request;
+	u64 sector = blk_rq_pos(rq) >> (ilog2(sdp->sector_size) - 9);
+	u32 nr_sectors = blk_rq_sectors(rq) >> (ilog2(sdp->sector_size) - 9);
+	u32 data_len = sdp->sector_size;
 
+	rq->special_vec.bv_page = alloc_page(GFP_ATOMIC | __GFP_ZERO);
+	if (!rq->special_vec.bv_page)
+		return BLKPREP_DEFER;
+	rq->special_vec.bv_offset = 0;
+	rq->special_vec.bv_len = data_len;
 	rq->rq_flags |= RQF_SPECIAL_PAYLOAD;
-	scsi_req(rq)->resid_len = len;
 
-	ret = scsi_init_io(cmd);
-out:
-	if (ret != BLKPREP_OK)
-		__free_page(page);
-	return ret;
+	cmd->cmd_len = 10;
+	cmd->cmnd[0] = WRITE_SAME;
+	if (unmap)
+		cmd->cmnd[1] = 0x8; /* UNMAP */
+	put_unaligned_be32(sector, &cmd->cmnd[2]);
+	put_unaligned_be16(nr_sectors, &cmd->cmnd[7]);
+
+	cmd->allowed = SD_MAX_RETRIES;
+	cmd->transfersize = data_len;
+	rq->timeout = SD_TIMEOUT;
+	scsi_req(rq)->resid_len = data_len;
+
+	return scsi_init_io(cmd);
 }
 
 static void sd_config_write_same(struct scsi_disk *sdkp)
@@ -1155,7 +1159,18 @@ static int sd_init_command(struct scsi_cmnd *cmd)
 
 	switch (req_op(rq)) {
 	case REQ_OP_DISCARD:
-		return sd_setup_discard_cmnd(cmd);
+		switch (scsi_disk(rq->rq_disk)->provisioning_mode) {
+		case SD_LBP_UNMAP:
+			return sd_setup_unmap_cmnd(cmd);
+		case SD_LBP_WS16:
+			return sd_setup_write_same16_cmnd(cmd);
+		case SD_LBP_WS10:
+			return sd_setup_write_same10_cmnd(cmd, true);
+		case SD_LBP_ZERO:
+			return sd_setup_write_same10_cmnd(cmd, false);
+		default:
+			return BLKPREP_INVALID;
+		}
 	case REQ_OP_WRITE_SAME:
 		return sd_setup_write_same_cmnd(cmd);
 	case REQ_OP_FLUSH:
-- 
2.11.0

^ permalink raw reply related	[flat|nested] 91+ messages in thread

* [PATCH 02/27] block: renumber REQ_OP_WRITE_ZEROES
  2017-04-05 17:20 ` Christoph Hellwig
@ 2017-04-05 17:21   ` Christoph Hellwig
  -1 siblings, 0 replies; 91+ messages in thread
From: Christoph Hellwig @ 2017-04-05 17:21 UTC (permalink / raw)
  To: axboe, martin.petersen, agk, snitzer, shli, philipp.reisner,
	lars.ellenberg
  Cc: linux-block, linux-raid, dm-devel, linux-scsi, drbd-dev

Make life easy for implementations that needs to send a data buffer
to the device (e.g. SCSI) by numbering it as a data out command.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Reviewed-by: Hannes Reinecke <hare@suse.com>
---
 include/linux/blk_types.h | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/include/linux/blk_types.h b/include/linux/blk_types.h
index 67bcf8a5326e..4eae30bfbfca 100644
--- a/include/linux/blk_types.h
+++ b/include/linux/blk_types.h
@@ -168,7 +168,7 @@ enum req_opf {
 	/* write the same sector many times */
 	REQ_OP_WRITE_SAME	= 7,
 	/* write the zero filled sector many times */
-	REQ_OP_WRITE_ZEROES	= 8,
+	REQ_OP_WRITE_ZEROES	= 9,
 
 	/* SCSI passthrough using struct scsi_request */
 	REQ_OP_SCSI_IN		= 32,
-- 
2.11.0

^ permalink raw reply related	[flat|nested] 91+ messages in thread

* [PATCH 02/27] block: renumber REQ_OP_WRITE_ZEROES
@ 2017-04-05 17:21   ` Christoph Hellwig
  0 siblings, 0 replies; 91+ messages in thread
From: Christoph Hellwig @ 2017-04-05 17:21 UTC (permalink / raw)
  To: axboe, martin.petersen, agk, snitzer, shli, philipp.reisner,
	lars.ellenberg
  Cc: linux-block, linux-scsi, drbd-dev, dm-devel, linux-raid

Make life easy for implementations that needs to send a data buffer
to the device (e.g. SCSI) by numbering it as a data out command.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Reviewed-by: Hannes Reinecke <hare@suse.com>
---
 include/linux/blk_types.h | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/include/linux/blk_types.h b/include/linux/blk_types.h
index 67bcf8a5326e..4eae30bfbfca 100644
--- a/include/linux/blk_types.h
+++ b/include/linux/blk_types.h
@@ -168,7 +168,7 @@ enum req_opf {
 	/* write the same sector many times */
 	REQ_OP_WRITE_SAME	= 7,
 	/* write the zero filled sector many times */
-	REQ_OP_WRITE_ZEROES	= 8,
+	REQ_OP_WRITE_ZEROES	= 9,
 
 	/* SCSI passthrough using struct scsi_request */
 	REQ_OP_SCSI_IN		= 32,
-- 
2.11.0

^ permalink raw reply related	[flat|nested] 91+ messages in thread

* [PATCH 03/27] block: implement splitting of REQ_OP_WRITE_ZEROES bios
  2017-04-05 17:20 ` Christoph Hellwig
@ 2017-04-05 17:21   ` Christoph Hellwig
  -1 siblings, 0 replies; 91+ messages in thread
From: Christoph Hellwig @ 2017-04-05 17:21 UTC (permalink / raw)
  To: axboe, martin.petersen, agk, snitzer, shli, philipp.reisner,
	lars.ellenberg
  Cc: linux-block, linux-raid, dm-devel, linux-scsi, drbd-dev

Copy and past the REQ_OP_WRITE_SAME code to prepare to implementations
that limit the write zeroes size.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Reviewed-by: Hannes Reinecke <hare@suse.com>
---
 block/blk-merge.c | 17 +++++++++++++++--
 1 file changed, 15 insertions(+), 2 deletions(-)

diff --git a/block/blk-merge.c b/block/blk-merge.c
index 2afa262425d1..3990ae406341 100644
--- a/block/blk-merge.c
+++ b/block/blk-merge.c
@@ -54,6 +54,20 @@ static struct bio *blk_bio_discard_split(struct request_queue *q,
 	return bio_split(bio, split_sectors, GFP_NOIO, bs);
 }
 
+static struct bio *blk_bio_write_zeroes_split(struct request_queue *q,
+		struct bio *bio, struct bio_set *bs, unsigned *nsegs)
+{
+	*nsegs = 1;
+
+	if (!q->limits.max_write_zeroes_sectors)
+		return NULL;
+
+	if (bio_sectors(bio) <= q->limits.max_write_zeroes_sectors)
+		return NULL;
+
+	return bio_split(bio, q->limits.max_write_zeroes_sectors, GFP_NOIO, bs);
+}
+
 static struct bio *blk_bio_write_same_split(struct request_queue *q,
 					    struct bio *bio,
 					    struct bio_set *bs,
@@ -200,8 +214,7 @@ void blk_queue_split(struct request_queue *q, struct bio **bio,
 		split = blk_bio_discard_split(q, *bio, bs, &nsegs);
 		break;
 	case REQ_OP_WRITE_ZEROES:
-		split = NULL;
-		nsegs = (*bio)->bi_phys_segments;
+		split = blk_bio_write_zeroes_split(q, *bio, bs, &nsegs);
 		break;
 	case REQ_OP_WRITE_SAME:
 		split = blk_bio_write_same_split(q, *bio, bs, &nsegs);
-- 
2.11.0

^ permalink raw reply related	[flat|nested] 91+ messages in thread

* [PATCH 03/27] block: implement splitting of REQ_OP_WRITE_ZEROES bios
@ 2017-04-05 17:21   ` Christoph Hellwig
  0 siblings, 0 replies; 91+ messages in thread
From: Christoph Hellwig @ 2017-04-05 17:21 UTC (permalink / raw)
  To: axboe, martin.petersen, agk, snitzer, shli, philipp.reisner,
	lars.ellenberg
  Cc: linux-block, linux-scsi, drbd-dev, dm-devel, linux-raid

Copy and past the REQ_OP_WRITE_SAME code to prepare to implementations
that limit the write zeroes size.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Reviewed-by: Hannes Reinecke <hare@suse.com>
---
 block/blk-merge.c | 17 +++++++++++++++--
 1 file changed, 15 insertions(+), 2 deletions(-)

diff --git a/block/blk-merge.c b/block/blk-merge.c
index 2afa262425d1..3990ae406341 100644
--- a/block/blk-merge.c
+++ b/block/blk-merge.c
@@ -54,6 +54,20 @@ static struct bio *blk_bio_discard_split(struct request_queue *q,
 	return bio_split(bio, split_sectors, GFP_NOIO, bs);
 }
 
+static struct bio *blk_bio_write_zeroes_split(struct request_queue *q,
+		struct bio *bio, struct bio_set *bs, unsigned *nsegs)
+{
+	*nsegs = 1;
+
+	if (!q->limits.max_write_zeroes_sectors)
+		return NULL;
+
+	if (bio_sectors(bio) <= q->limits.max_write_zeroes_sectors)
+		return NULL;
+
+	return bio_split(bio, q->limits.max_write_zeroes_sectors, GFP_NOIO, bs);
+}
+
 static struct bio *blk_bio_write_same_split(struct request_queue *q,
 					    struct bio *bio,
 					    struct bio_set *bs,
@@ -200,8 +214,7 @@ void blk_queue_split(struct request_queue *q, struct bio **bio,
 		split = blk_bio_discard_split(q, *bio, bs, &nsegs);
 		break;
 	case REQ_OP_WRITE_ZEROES:
-		split = NULL;
-		nsegs = (*bio)->bi_phys_segments;
+		split = blk_bio_write_zeroes_split(q, *bio, bs, &nsegs);
 		break;
 	case REQ_OP_WRITE_SAME:
 		split = blk_bio_write_same_split(q, *bio, bs, &nsegs);
-- 
2.11.0

^ permalink raw reply related	[flat|nested] 91+ messages in thread

* [PATCH 04/27] sd: implement REQ_OP_WRITE_ZEROES
  2017-04-05 17:20 ` Christoph Hellwig
@ 2017-04-05 17:21   ` Christoph Hellwig
  -1 siblings, 0 replies; 91+ messages in thread
From: Christoph Hellwig @ 2017-04-05 17:21 UTC (permalink / raw)
  To: axboe, martin.petersen, agk, snitzer, shli, philipp.reisner,
	lars.ellenberg
  Cc: linux-block, linux-raid, dm-devel, linux-scsi, drbd-dev

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Reviewed-by: Hannes Reinecke <hare@suse.com>
---
 drivers/scsi/sd.c     | 31 ++++++++++++++++++++++++++-----
 drivers/scsi/sd_zbc.c |  1 +
 2 files changed, 27 insertions(+), 5 deletions(-)

diff --git a/drivers/scsi/sd.c b/drivers/scsi/sd.c
index b853f91fb3da..d8d9c0bdd93c 100644
--- a/drivers/scsi/sd.c
+++ b/drivers/scsi/sd.c
@@ -735,7 +735,7 @@ static int sd_setup_unmap_cmnd(struct scsi_cmnd *cmd)
 	return scsi_init_io(cmd);
 }
 
-static int sd_setup_write_same16_cmnd(struct scsi_cmnd *cmd)
+static int sd_setup_write_same16_cmnd(struct scsi_cmnd *cmd, bool unmap)
 {
 	struct scsi_device *sdp = cmd->device;
 	struct request *rq = cmd->request;
@@ -752,13 +752,14 @@ static int sd_setup_write_same16_cmnd(struct scsi_cmnd *cmd)
 
 	cmd->cmd_len = 16;
 	cmd->cmnd[0] = WRITE_SAME_16;
-	cmd->cmnd[1] = 0x8; /* UNMAP */
+	if (unmap)
+		cmd->cmnd[1] = 0x8; /* UNMAP */
 	put_unaligned_be64(sector, &cmd->cmnd[2]);
 	put_unaligned_be32(nr_sectors, &cmd->cmnd[10]);
 
 	cmd->allowed = SD_MAX_RETRIES;
 	cmd->transfersize = data_len;
-	rq->timeout = SD_TIMEOUT;
+	rq->timeout = unmap ? SD_TIMEOUT : SD_WRITE_SAME_TIMEOUT;
 	scsi_req(rq)->resid_len = data_len;
 
 	return scsi_init_io(cmd);
@@ -788,12 +789,27 @@ static int sd_setup_write_same10_cmnd(struct scsi_cmnd *cmd, bool unmap)
 
 	cmd->allowed = SD_MAX_RETRIES;
 	cmd->transfersize = data_len;
-	rq->timeout = SD_TIMEOUT;
+	rq->timeout = unmap ? SD_TIMEOUT : SD_WRITE_SAME_TIMEOUT;
 	scsi_req(rq)->resid_len = data_len;
 
 	return scsi_init_io(cmd);
 }
 
+static int sd_setup_write_zeroes_cmnd(struct scsi_cmnd *cmd)
+{
+	struct request *rq = cmd->request;
+	struct scsi_device *sdp = cmd->device;
+	struct scsi_disk *sdkp = scsi_disk(rq->rq_disk);
+	u64 sector = blk_rq_pos(rq) >> (ilog2(sdp->sector_size) - 9);
+	u32 nr_sectors = blk_rq_sectors(rq) >> (ilog2(sdp->sector_size) - 9);
+
+	if (sdp->no_write_same)
+		return BLKPREP_INVALID;
+	if (sdkp->ws16 || sector > 0xffffffff || nr_sectors > 0xffff)
+		return sd_setup_write_same16_cmnd(cmd, false);
+	return sd_setup_write_same10_cmnd(cmd, false);
+}
+
 static void sd_config_write_same(struct scsi_disk *sdkp)
 {
 	struct request_queue *q = sdkp->disk->queue;
@@ -823,6 +839,8 @@ static void sd_config_write_same(struct scsi_disk *sdkp)
 out:
 	blk_queue_max_write_same_sectors(q, sdkp->max_ws_blocks *
 					 (logical_block_size >> 9));
+	blk_queue_max_write_zeroes_sectors(q, sdkp->max_ws_blocks *
+					 (logical_block_size >> 9));
 }
 
 /**
@@ -1163,7 +1181,7 @@ static int sd_init_command(struct scsi_cmnd *cmd)
 		case SD_LBP_UNMAP:
 			return sd_setup_unmap_cmnd(cmd);
 		case SD_LBP_WS16:
-			return sd_setup_write_same16_cmnd(cmd);
+			return sd_setup_write_same16_cmnd(cmd, true);
 		case SD_LBP_WS10:
 			return sd_setup_write_same10_cmnd(cmd, true);
 		case SD_LBP_ZERO:
@@ -1171,6 +1189,8 @@ static int sd_init_command(struct scsi_cmnd *cmd)
 		default:
 			return BLKPREP_INVALID;
 		}
+	case REQ_OP_WRITE_ZEROES:
+		return sd_setup_write_zeroes_cmnd(cmd);
 	case REQ_OP_WRITE_SAME:
 		return sd_setup_write_same_cmnd(cmd);
 	case REQ_OP_FLUSH:
@@ -1810,6 +1830,7 @@ static int sd_done(struct scsi_cmnd *SCpnt)
 
 	switch (req_op(req)) {
 	case REQ_OP_DISCARD:
+	case REQ_OP_WRITE_ZEROES:
 	case REQ_OP_WRITE_SAME:
 	case REQ_OP_ZONE_RESET:
 		if (!result) {
diff --git a/drivers/scsi/sd_zbc.c b/drivers/scsi/sd_zbc.c
index 92620c8ea8ad..1994f7799fce 100644
--- a/drivers/scsi/sd_zbc.c
+++ b/drivers/scsi/sd_zbc.c
@@ -329,6 +329,7 @@ void sd_zbc_complete(struct scsi_cmnd *cmd,
 
 	switch (req_op(rq)) {
 	case REQ_OP_WRITE:
+	case REQ_OP_WRITE_ZEROES:
 	case REQ_OP_WRITE_SAME:
 	case REQ_OP_ZONE_RESET:
 
-- 
2.11.0

^ permalink raw reply related	[flat|nested] 91+ messages in thread

* [PATCH 04/27] sd: implement REQ_OP_WRITE_ZEROES
@ 2017-04-05 17:21   ` Christoph Hellwig
  0 siblings, 0 replies; 91+ messages in thread
From: Christoph Hellwig @ 2017-04-05 17:21 UTC (permalink / raw)
  To: axboe, martin.petersen, agk, snitzer, shli, philipp.reisner,
	lars.ellenberg
  Cc: linux-block, linux-scsi, drbd-dev, dm-devel, linux-raid

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Reviewed-by: Hannes Reinecke <hare@suse.com>
---
 drivers/scsi/sd.c     | 31 ++++++++++++++++++++++++++-----
 drivers/scsi/sd_zbc.c |  1 +
 2 files changed, 27 insertions(+), 5 deletions(-)

diff --git a/drivers/scsi/sd.c b/drivers/scsi/sd.c
index b853f91fb3da..d8d9c0bdd93c 100644
--- a/drivers/scsi/sd.c
+++ b/drivers/scsi/sd.c
@@ -735,7 +735,7 @@ static int sd_setup_unmap_cmnd(struct scsi_cmnd *cmd)
 	return scsi_init_io(cmd);
 }
 
-static int sd_setup_write_same16_cmnd(struct scsi_cmnd *cmd)
+static int sd_setup_write_same16_cmnd(struct scsi_cmnd *cmd, bool unmap)
 {
 	struct scsi_device *sdp = cmd->device;
 	struct request *rq = cmd->request;
@@ -752,13 +752,14 @@ static int sd_setup_write_same16_cmnd(struct scsi_cmnd *cmd)
 
 	cmd->cmd_len = 16;
 	cmd->cmnd[0] = WRITE_SAME_16;
-	cmd->cmnd[1] = 0x8; /* UNMAP */
+	if (unmap)
+		cmd->cmnd[1] = 0x8; /* UNMAP */
 	put_unaligned_be64(sector, &cmd->cmnd[2]);
 	put_unaligned_be32(nr_sectors, &cmd->cmnd[10]);
 
 	cmd->allowed = SD_MAX_RETRIES;
 	cmd->transfersize = data_len;
-	rq->timeout = SD_TIMEOUT;
+	rq->timeout = unmap ? SD_TIMEOUT : SD_WRITE_SAME_TIMEOUT;
 	scsi_req(rq)->resid_len = data_len;
 
 	return scsi_init_io(cmd);
@@ -788,12 +789,27 @@ static int sd_setup_write_same10_cmnd(struct scsi_cmnd *cmd, bool unmap)
 
 	cmd->allowed = SD_MAX_RETRIES;
 	cmd->transfersize = data_len;
-	rq->timeout = SD_TIMEOUT;
+	rq->timeout = unmap ? SD_TIMEOUT : SD_WRITE_SAME_TIMEOUT;
 	scsi_req(rq)->resid_len = data_len;
 
 	return scsi_init_io(cmd);
 }
 
+static int sd_setup_write_zeroes_cmnd(struct scsi_cmnd *cmd)
+{
+	struct request *rq = cmd->request;
+	struct scsi_device *sdp = cmd->device;
+	struct scsi_disk *sdkp = scsi_disk(rq->rq_disk);
+	u64 sector = blk_rq_pos(rq) >> (ilog2(sdp->sector_size) - 9);
+	u32 nr_sectors = blk_rq_sectors(rq) >> (ilog2(sdp->sector_size) - 9);
+
+	if (sdp->no_write_same)
+		return BLKPREP_INVALID;
+	if (sdkp->ws16 || sector > 0xffffffff || nr_sectors > 0xffff)
+		return sd_setup_write_same16_cmnd(cmd, false);
+	return sd_setup_write_same10_cmnd(cmd, false);
+}
+
 static void sd_config_write_same(struct scsi_disk *sdkp)
 {
 	struct request_queue *q = sdkp->disk->queue;
@@ -823,6 +839,8 @@ static void sd_config_write_same(struct scsi_disk *sdkp)
 out:
 	blk_queue_max_write_same_sectors(q, sdkp->max_ws_blocks *
 					 (logical_block_size >> 9));
+	blk_queue_max_write_zeroes_sectors(q, sdkp->max_ws_blocks *
+					 (logical_block_size >> 9));
 }
 
 /**
@@ -1163,7 +1181,7 @@ static int sd_init_command(struct scsi_cmnd *cmd)
 		case SD_LBP_UNMAP:
 			return sd_setup_unmap_cmnd(cmd);
 		case SD_LBP_WS16:
-			return sd_setup_write_same16_cmnd(cmd);
+			return sd_setup_write_same16_cmnd(cmd, true);
 		case SD_LBP_WS10:
 			return sd_setup_write_same10_cmnd(cmd, true);
 		case SD_LBP_ZERO:
@@ -1171,6 +1189,8 @@ static int sd_init_command(struct scsi_cmnd *cmd)
 		default:
 			return BLKPREP_INVALID;
 		}
+	case REQ_OP_WRITE_ZEROES:
+		return sd_setup_write_zeroes_cmnd(cmd);
 	case REQ_OP_WRITE_SAME:
 		return sd_setup_write_same_cmnd(cmd);
 	case REQ_OP_FLUSH:
@@ -1810,6 +1830,7 @@ static int sd_done(struct scsi_cmnd *SCpnt)
 
 	switch (req_op(req)) {
 	case REQ_OP_DISCARD:
+	case REQ_OP_WRITE_ZEROES:
 	case REQ_OP_WRITE_SAME:
 	case REQ_OP_ZONE_RESET:
 		if (!result) {
diff --git a/drivers/scsi/sd_zbc.c b/drivers/scsi/sd_zbc.c
index 92620c8ea8ad..1994f7799fce 100644
--- a/drivers/scsi/sd_zbc.c
+++ b/drivers/scsi/sd_zbc.c
@@ -329,6 +329,7 @@ void sd_zbc_complete(struct scsi_cmnd *cmd,
 
 	switch (req_op(rq)) {
 	case REQ_OP_WRITE:
+	case REQ_OP_WRITE_ZEROES:
 	case REQ_OP_WRITE_SAME:
 	case REQ_OP_ZONE_RESET:
 
-- 
2.11.0

^ permalink raw reply related	[flat|nested] 91+ messages in thread

* [PATCH 05/27] md: support REQ_OP_WRITE_ZEROES
  2017-04-05 17:20 ` Christoph Hellwig
@ 2017-04-05 17:21   ` Christoph Hellwig
  -1 siblings, 0 replies; 91+ messages in thread
From: Christoph Hellwig @ 2017-04-05 17:21 UTC (permalink / raw)
  To: axboe, martin.petersen, agk, snitzer, shli, philipp.reisner,
	lars.ellenberg
  Cc: linux-block, linux-raid, dm-devel, linux-scsi, drbd-dev

Copy & paste from the REQ_OP_WRITE_SAME code.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.com>
---
 drivers/md/linear.c    | 1 +
 drivers/md/md.h        | 7 +++++++
 drivers/md/multipath.c | 1 +
 drivers/md/raid0.c     | 2 ++
 drivers/md/raid1.c     | 4 +++-
 drivers/md/raid10.c    | 1 +
 drivers/md/raid5.c     | 1 +
 7 files changed, 16 insertions(+), 1 deletion(-)

diff --git a/drivers/md/linear.c b/drivers/md/linear.c
index 3e38e0207a3e..377a8a3672e3 100644
--- a/drivers/md/linear.c
+++ b/drivers/md/linear.c
@@ -293,6 +293,7 @@ static void linear_make_request(struct mddev *mddev, struct bio *bio)
 						      split, disk_devt(mddev->gendisk),
 						      bio_sector);
 			mddev_check_writesame(mddev, split);
+			mddev_check_write_zeroes(mddev, split);
 			generic_make_request(split);
 		}
 	} while (split != bio);
diff --git a/drivers/md/md.h b/drivers/md/md.h
index dde8ecb760c8..1e76d64ce180 100644
--- a/drivers/md/md.h
+++ b/drivers/md/md.h
@@ -709,4 +709,11 @@ static inline void mddev_check_writesame(struct mddev *mddev, struct bio *bio)
 	    !bdev_get_queue(bio->bi_bdev)->limits.max_write_same_sectors)
 		mddev->queue->limits.max_write_same_sectors = 0;
 }
+
+static inline void mddev_check_write_zeroes(struct mddev *mddev, struct bio *bio)
+{
+	if (bio_op(bio) == REQ_OP_WRITE_ZEROES &&
+	    !bdev_get_queue(bio->bi_bdev)->limits.max_write_zeroes_sectors)
+		mddev->queue->limits.max_write_zeroes_sectors = 0;
+}
 #endif /* _MD_MD_H */
diff --git a/drivers/md/multipath.c b/drivers/md/multipath.c
index 79a12b59250b..e95d521d93e9 100644
--- a/drivers/md/multipath.c
+++ b/drivers/md/multipath.c
@@ -139,6 +139,7 @@ static void multipath_make_request(struct mddev *mddev, struct bio * bio)
 	mp_bh->bio.bi_end_io = multipath_end_request;
 	mp_bh->bio.bi_private = mp_bh;
 	mddev_check_writesame(mddev, &mp_bh->bio);
+	mddev_check_write_zeroes(mddev, &mp_bh->bio);
 	generic_make_request(&mp_bh->bio);
 	return;
 }
diff --git a/drivers/md/raid0.c b/drivers/md/raid0.c
index 93347ca7c7a6..ce7a6a56cf73 100644
--- a/drivers/md/raid0.c
+++ b/drivers/md/raid0.c
@@ -383,6 +383,7 @@ static int raid0_run(struct mddev *mddev)
 
 		blk_queue_max_hw_sectors(mddev->queue, mddev->chunk_sectors);
 		blk_queue_max_write_same_sectors(mddev->queue, mddev->chunk_sectors);
+		blk_queue_max_write_zeroes_sectors(mddev->queue, mddev->chunk_sectors);
 		blk_queue_max_discard_sectors(mddev->queue, mddev->chunk_sectors);
 
 		blk_queue_io_min(mddev->queue, mddev->chunk_sectors << 9);
@@ -504,6 +505,7 @@ static void raid0_make_request(struct mddev *mddev, struct bio *bio)
 						      split, disk_devt(mddev->gendisk),
 						      bio_sector);
 			mddev_check_writesame(mddev, split);
+			mddev_check_write_zeroes(mddev, split);
 			generic_make_request(split);
 		}
 	} while (split != bio);
diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c
index a34f58772022..b59cc100320a 100644
--- a/drivers/md/raid1.c
+++ b/drivers/md/raid1.c
@@ -3177,8 +3177,10 @@ static int raid1_run(struct mddev *mddev)
 	if (IS_ERR(conf))
 		return PTR_ERR(conf);
 
-	if (mddev->queue)
+	if (mddev->queue) {
 		blk_queue_max_write_same_sectors(mddev->queue, 0);
+		blk_queue_max_write_zeroes_sectors(mddev->queue, 0);
+	}
 
 	rdev_for_each(rdev, mddev) {
 		if (!mddev->gendisk)
diff --git a/drivers/md/raid10.c b/drivers/md/raid10.c
index e89a8d78a9ed..28ec3a93acee 100644
--- a/drivers/md/raid10.c
+++ b/drivers/md/raid10.c
@@ -3749,6 +3749,7 @@ static int raid10_run(struct mddev *mddev)
 		blk_queue_max_discard_sectors(mddev->queue,
 					      mddev->chunk_sectors);
 		blk_queue_max_write_same_sectors(mddev->queue, 0);
+		blk_queue_max_write_zeroes_sectors(mddev->queue, 0);
 		blk_queue_io_min(mddev->queue, chunk_size);
 		if (conf->geo.raid_disks % conf->geo.near_copies)
 			blk_queue_io_opt(mddev->queue, chunk_size * conf->geo.raid_disks);
diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
index ed5cd705b985..8cf1f86dcd05 100644
--- a/drivers/md/raid5.c
+++ b/drivers/md/raid5.c
@@ -7272,6 +7272,7 @@ static int raid5_run(struct mddev *mddev)
 		mddev->queue->limits.discard_zeroes_data = 0;
 
 		blk_queue_max_write_same_sectors(mddev->queue, 0);
+		blk_queue_max_write_zeroes_sectors(mddev->queue, 0);
 
 		rdev_for_each(rdev, mddev) {
 			disk_stack_limits(mddev->gendisk, rdev->bdev,
-- 
2.11.0

^ permalink raw reply related	[flat|nested] 91+ messages in thread

* [PATCH 05/27] md: support REQ_OP_WRITE_ZEROES
@ 2017-04-05 17:21   ` Christoph Hellwig
  0 siblings, 0 replies; 91+ messages in thread
From: Christoph Hellwig @ 2017-04-05 17:21 UTC (permalink / raw)
  To: axboe, martin.petersen, agk, snitzer, shli, philipp.reisner,
	lars.ellenberg
  Cc: linux-block, linux-scsi, drbd-dev, dm-devel, linux-raid

Copy & paste from the REQ_OP_WRITE_SAME code.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.com>
---
 drivers/md/linear.c    | 1 +
 drivers/md/md.h        | 7 +++++++
 drivers/md/multipath.c | 1 +
 drivers/md/raid0.c     | 2 ++
 drivers/md/raid1.c     | 4 +++-
 drivers/md/raid10.c    | 1 +
 drivers/md/raid5.c     | 1 +
 7 files changed, 16 insertions(+), 1 deletion(-)

diff --git a/drivers/md/linear.c b/drivers/md/linear.c
index 3e38e0207a3e..377a8a3672e3 100644
--- a/drivers/md/linear.c
+++ b/drivers/md/linear.c
@@ -293,6 +293,7 @@ static void linear_make_request(struct mddev *mddev, struct bio *bio)
 						      split, disk_devt(mddev->gendisk),
 						      bio_sector);
 			mddev_check_writesame(mddev, split);
+			mddev_check_write_zeroes(mddev, split);
 			generic_make_request(split);
 		}
 	} while (split != bio);
diff --git a/drivers/md/md.h b/drivers/md/md.h
index dde8ecb760c8..1e76d64ce180 100644
--- a/drivers/md/md.h
+++ b/drivers/md/md.h
@@ -709,4 +709,11 @@ static inline void mddev_check_writesame(struct mddev *mddev, struct bio *bio)
 	    !bdev_get_queue(bio->bi_bdev)->limits.max_write_same_sectors)
 		mddev->queue->limits.max_write_same_sectors = 0;
 }
+
+static inline void mddev_check_write_zeroes(struct mddev *mddev, struct bio *bio)
+{
+	if (bio_op(bio) == REQ_OP_WRITE_ZEROES &&
+	    !bdev_get_queue(bio->bi_bdev)->limits.max_write_zeroes_sectors)
+		mddev->queue->limits.max_write_zeroes_sectors = 0;
+}
 #endif /* _MD_MD_H */
diff --git a/drivers/md/multipath.c b/drivers/md/multipath.c
index 79a12b59250b..e95d521d93e9 100644
--- a/drivers/md/multipath.c
+++ b/drivers/md/multipath.c
@@ -139,6 +139,7 @@ static void multipath_make_request(struct mddev *mddev, struct bio * bio)
 	mp_bh->bio.bi_end_io = multipath_end_request;
 	mp_bh->bio.bi_private = mp_bh;
 	mddev_check_writesame(mddev, &mp_bh->bio);
+	mddev_check_write_zeroes(mddev, &mp_bh->bio);
 	generic_make_request(&mp_bh->bio);
 	return;
 }
diff --git a/drivers/md/raid0.c b/drivers/md/raid0.c
index 93347ca7c7a6..ce7a6a56cf73 100644
--- a/drivers/md/raid0.c
+++ b/drivers/md/raid0.c
@@ -383,6 +383,7 @@ static int raid0_run(struct mddev *mddev)
 
 		blk_queue_max_hw_sectors(mddev->queue, mddev->chunk_sectors);
 		blk_queue_max_write_same_sectors(mddev->queue, mddev->chunk_sectors);
+		blk_queue_max_write_zeroes_sectors(mddev->queue, mddev->chunk_sectors);
 		blk_queue_max_discard_sectors(mddev->queue, mddev->chunk_sectors);
 
 		blk_queue_io_min(mddev->queue, mddev->chunk_sectors << 9);
@@ -504,6 +505,7 @@ static void raid0_make_request(struct mddev *mddev, struct bio *bio)
 						      split, disk_devt(mddev->gendisk),
 						      bio_sector);
 			mddev_check_writesame(mddev, split);
+			mddev_check_write_zeroes(mddev, split);
 			generic_make_request(split);
 		}
 	} while (split != bio);
diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c
index a34f58772022..b59cc100320a 100644
--- a/drivers/md/raid1.c
+++ b/drivers/md/raid1.c
@@ -3177,8 +3177,10 @@ static int raid1_run(struct mddev *mddev)
 	if (IS_ERR(conf))
 		return PTR_ERR(conf);
 
-	if (mddev->queue)
+	if (mddev->queue) {
 		blk_queue_max_write_same_sectors(mddev->queue, 0);
+		blk_queue_max_write_zeroes_sectors(mddev->queue, 0);
+	}
 
 	rdev_for_each(rdev, mddev) {
 		if (!mddev->gendisk)
diff --git a/drivers/md/raid10.c b/drivers/md/raid10.c
index e89a8d78a9ed..28ec3a93acee 100644
--- a/drivers/md/raid10.c
+++ b/drivers/md/raid10.c
@@ -3749,6 +3749,7 @@ static int raid10_run(struct mddev *mddev)
 		blk_queue_max_discard_sectors(mddev->queue,
 					      mddev->chunk_sectors);
 		blk_queue_max_write_same_sectors(mddev->queue, 0);
+		blk_queue_max_write_zeroes_sectors(mddev->queue, 0);
 		blk_queue_io_min(mddev->queue, chunk_size);
 		if (conf->geo.raid_disks % conf->geo.near_copies)
 			blk_queue_io_opt(mddev->queue, chunk_size * conf->geo.raid_disks);
diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
index ed5cd705b985..8cf1f86dcd05 100644
--- a/drivers/md/raid5.c
+++ b/drivers/md/raid5.c
@@ -7272,6 +7272,7 @@ static int raid5_run(struct mddev *mddev)
 		mddev->queue->limits.discard_zeroes_data = 0;
 
 		blk_queue_max_write_same_sectors(mddev->queue, 0);
+		blk_queue_max_write_zeroes_sectors(mddev->queue, 0);
 
 		rdev_for_each(rdev, mddev) {
 			disk_stack_limits(mddev->gendisk, rdev->bdev,
-- 
2.11.0

^ permalink raw reply related	[flat|nested] 91+ messages in thread

* [PATCH 06/27] dm io: discards don't take a payload
  2017-04-05 17:20 ` Christoph Hellwig
@ 2017-04-05 17:21   ` Christoph Hellwig
  -1 siblings, 0 replies; 91+ messages in thread
From: Christoph Hellwig @ 2017-04-05 17:21 UTC (permalink / raw)
  To: axboe, martin.petersen, agk, snitzer, shli, philipp.reisner,
	lars.ellenberg
  Cc: linux-block, linux-raid, dm-devel, linux-scsi, drbd-dev

Fix up do_region to not allocate a bio_vec for discards.  We've
got rid of the discard payload allocated by the caller years ago.

Obviously this wasn't actually harmful given how long it's been
there, but it's still good to avoid the pointless allocation.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.com>
---
 drivers/md/dm-io.c | 10 ++++++++--
 1 file changed, 8 insertions(+), 2 deletions(-)

diff --git a/drivers/md/dm-io.c b/drivers/md/dm-io.c
index 03940bf36f6c..b808cbe22678 100644
--- a/drivers/md/dm-io.c
+++ b/drivers/md/dm-io.c
@@ -328,11 +328,17 @@ static void do_region(int op, int op_flags, unsigned region,
 		/*
 		 * Allocate a suitably sized-bio.
 		 */
-		if ((op == REQ_OP_DISCARD) || (op == REQ_OP_WRITE_SAME))
+		switch (op) {
+		case REQ_OP_DISCARD:
+			num_bvecs = 0;
+			break;
+		case REQ_OP_WRITE_SAME:
 			num_bvecs = 1;
-		else
+			break;
+		default:
 			num_bvecs = min_t(int, BIO_MAX_PAGES,
 					  dm_sector_div_up(remaining, (PAGE_SIZE >> SECTOR_SHIFT)));
+		}
 
 		bio = bio_alloc_bioset(GFP_NOIO, num_bvecs, io->client->bios);
 		bio->bi_iter.bi_sector = where->sector + (where->count - remaining);
-- 
2.11.0

^ permalink raw reply related	[flat|nested] 91+ messages in thread

* [PATCH 06/27] dm io: discards don't take a payload
@ 2017-04-05 17:21   ` Christoph Hellwig
  0 siblings, 0 replies; 91+ messages in thread
From: Christoph Hellwig @ 2017-04-05 17:21 UTC (permalink / raw)
  To: axboe, martin.petersen, agk, snitzer, shli, philipp.reisner,
	lars.ellenberg
  Cc: linux-block, linux-scsi, drbd-dev, dm-devel, linux-raid

Fix up do_region to not allocate a bio_vec for discards.  We've
got rid of the discard payload allocated by the caller years ago.

Obviously this wasn't actually harmful given how long it's been
there, but it's still good to avoid the pointless allocation.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.com>
---
 drivers/md/dm-io.c | 10 ++++++++--
 1 file changed, 8 insertions(+), 2 deletions(-)

diff --git a/drivers/md/dm-io.c b/drivers/md/dm-io.c
index 03940bf36f6c..b808cbe22678 100644
--- a/drivers/md/dm-io.c
+++ b/drivers/md/dm-io.c
@@ -328,11 +328,17 @@ static void do_region(int op, int op_flags, unsigned region,
 		/*
 		 * Allocate a suitably sized-bio.
 		 */
-		if ((op == REQ_OP_DISCARD) || (op == REQ_OP_WRITE_SAME))
+		switch (op) {
+		case REQ_OP_DISCARD:
+			num_bvecs = 0;
+			break;
+		case REQ_OP_WRITE_SAME:
 			num_bvecs = 1;
-		else
+			break;
+		default:
 			num_bvecs = min_t(int, BIO_MAX_PAGES,
 					  dm_sector_div_up(remaining, (PAGE_SIZE >> SECTOR_SHIFT)));
+		}
 
 		bio = bio_alloc_bioset(GFP_NOIO, num_bvecs, io->client->bios);
 		bio->bi_iter.bi_sector = where->sector + (where->count - remaining);
-- 
2.11.0

^ permalink raw reply related	[flat|nested] 91+ messages in thread

* [PATCH 07/27] dm: support REQ_OP_WRITE_ZEROES
  2017-04-05 17:20 ` Christoph Hellwig
@ 2017-04-05 17:21   ` Christoph Hellwig
  -1 siblings, 0 replies; 91+ messages in thread
From: Christoph Hellwig @ 2017-04-05 17:21 UTC (permalink / raw)
  To: axboe, martin.petersen, agk, snitzer, shli, philipp.reisner,
	lars.ellenberg
  Cc: linux-block, linux-raid, dm-devel, linux-scsi, drbd-dev

Copy & paste from the REQ_OP_WRITE_SAME code.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.com>
---
 drivers/md/dm-core.h          |  1 +
 drivers/md/dm-io.c            |  8 ++++++--
 drivers/md/dm-linear.c        |  1 +
 drivers/md/dm-mpath.c         |  1 +
 drivers/md/dm-rq.c            | 11 ++++++++---
 drivers/md/dm-stripe.c        |  2 ++
 drivers/md/dm-table.c         | 30 ++++++++++++++++++++++++++++++
 drivers/md/dm.c               | 31 ++++++++++++++++++++++++++++---
 include/linux/device-mapper.h |  6 ++++++
 9 files changed, 83 insertions(+), 8 deletions(-)

diff --git a/drivers/md/dm-core.h b/drivers/md/dm-core.h
index 136fda3ff9e5..fea5bd52ada8 100644
--- a/drivers/md/dm-core.h
+++ b/drivers/md/dm-core.h
@@ -132,6 +132,7 @@ void dm_init_md_queue(struct mapped_device *md);
 void dm_init_normal_md_queue(struct mapped_device *md);
 int md_in_flight(struct mapped_device *md);
 void disable_write_same(struct mapped_device *md);
+void disable_write_zeroes(struct mapped_device *md);
 
 static inline struct completion *dm_get_completion_from_kobject(struct kobject *kobj)
 {
diff --git a/drivers/md/dm-io.c b/drivers/md/dm-io.c
index b808cbe22678..3702e502466d 100644
--- a/drivers/md/dm-io.c
+++ b/drivers/md/dm-io.c
@@ -312,9 +312,12 @@ static void do_region(int op, int op_flags, unsigned region,
 	 */
 	if (op == REQ_OP_DISCARD)
 		special_cmd_max_sectors = q->limits.max_discard_sectors;
+	else if (op == REQ_OP_WRITE_ZEROES)
+		special_cmd_max_sectors = q->limits.max_write_zeroes_sectors;
 	else if (op == REQ_OP_WRITE_SAME)
 		special_cmd_max_sectors = q->limits.max_write_same_sectors;
-	if ((op == REQ_OP_DISCARD || op == REQ_OP_WRITE_SAME) &&
+	if ((op == REQ_OP_DISCARD || op == REQ_OP_WRITE_ZEROES ||
+	     op == REQ_OP_WRITE_SAME)  &&
 	    special_cmd_max_sectors == 0) {
 		dec_count(io, region, -EOPNOTSUPP);
 		return;
@@ -330,6 +333,7 @@ static void do_region(int op, int op_flags, unsigned region,
 		 */
 		switch (op) {
 		case REQ_OP_DISCARD:
+		case REQ_OP_WRITE_ZEROES:
 			num_bvecs = 0;
 			break;
 		case REQ_OP_WRITE_SAME:
@@ -347,7 +351,7 @@ static void do_region(int op, int op_flags, unsigned region,
 		bio_set_op_attrs(bio, op, op_flags);
 		store_io_and_region_in_bio(bio, io, region);
 
-		if (op == REQ_OP_DISCARD) {
+		if (op == REQ_OP_DISCARD || op == REQ_OP_WRITE_ZEROES) {
 			num_sectors = min_t(sector_t, special_cmd_max_sectors, remaining);
 			bio->bi_iter.bi_size = num_sectors << SECTOR_SHIFT;
 			remaining -= num_sectors;
diff --git a/drivers/md/dm-linear.c b/drivers/md/dm-linear.c
index 4788b0b989a9..e17fd44ceef5 100644
--- a/drivers/md/dm-linear.c
+++ b/drivers/md/dm-linear.c
@@ -59,6 +59,7 @@ static int linear_ctr(struct dm_target *ti, unsigned int argc, char **argv)
 	ti->num_flush_bios = 1;
 	ti->num_discard_bios = 1;
 	ti->num_write_same_bios = 1;
+	ti->num_write_zeroes_bios = 1;
 	ti->private = lc;
 	return 0;
 
diff --git a/drivers/md/dm-mpath.c b/drivers/md/dm-mpath.c
index 7f223dbed49f..ab55955ed704 100644
--- a/drivers/md/dm-mpath.c
+++ b/drivers/md/dm-mpath.c
@@ -1103,6 +1103,7 @@ static int multipath_ctr(struct dm_target *ti, unsigned argc, char **argv)
 	ti->num_flush_bios = 1;
 	ti->num_discard_bios = 1;
 	ti->num_write_same_bios = 1;
+	ti->num_write_zeroes_bios = 1;
 	if (m->queue_mode == DM_TYPE_BIO_BASED)
 		ti->per_io_data_size = multipath_per_bio_data_size();
 	else
diff --git a/drivers/md/dm-rq.c b/drivers/md/dm-rq.c
index 6886bf160fb2..a789bf035621 100644
--- a/drivers/md/dm-rq.c
+++ b/drivers/md/dm-rq.c
@@ -298,9 +298,14 @@ static void dm_done(struct request *clone, int error, bool mapped)
 			r = rq_end_io(tio->ti, clone, error, &tio->info);
 	}
 
-	if (unlikely(r == -EREMOTEIO && (req_op(clone) == REQ_OP_WRITE_SAME) &&
-		     !clone->q->limits.max_write_same_sectors))
-		disable_write_same(tio->md);
+	if (unlikely(r == -EREMOTEIO)) {
+		if (req_op(clone) == REQ_OP_WRITE_SAME &&
+		    !clone->q->limits.max_write_same_sectors)
+			disable_write_same(tio->md);
+		if (req_op(clone) == REQ_OP_WRITE_ZEROES &&
+		    !clone->q->limits.max_write_zeroes_sectors)
+			disable_write_zeroes(tio->md);
+	}
 
 	if (r <= 0)
 		/* The target wants to complete the I/O */
diff --git a/drivers/md/dm-stripe.c b/drivers/md/dm-stripe.c
index 28193a57bf47..5ef49c121d99 100644
--- a/drivers/md/dm-stripe.c
+++ b/drivers/md/dm-stripe.c
@@ -169,6 +169,7 @@ static int stripe_ctr(struct dm_target *ti, unsigned int argc, char **argv)
 	ti->num_flush_bios = stripes;
 	ti->num_discard_bios = stripes;
 	ti->num_write_same_bios = stripes;
+	ti->num_write_zeroes_bios = stripes;
 
 	sc->chunk_size = chunk_size;
 	if (chunk_size & (chunk_size - 1))
@@ -293,6 +294,7 @@ static int stripe_map(struct dm_target *ti, struct bio *bio)
 		return DM_MAPIO_REMAPPED;
 	}
 	if (unlikely(bio_op(bio) == REQ_OP_DISCARD) ||
+	    unlikely(bio_op(bio) == REQ_OP_WRITE_ZEROES) ||
 	    unlikely(bio_op(bio) == REQ_OP_WRITE_SAME)) {
 		target_bio_nr = dm_bio_get_target_bio_nr(bio);
 		BUG_ON(target_bio_nr >= sc->stripes);
diff --git a/drivers/md/dm-table.c b/drivers/md/dm-table.c
index 3ad16d9c9d5a..5cd665c91ead 100644
--- a/drivers/md/dm-table.c
+++ b/drivers/md/dm-table.c
@@ -1533,6 +1533,34 @@ static bool dm_table_supports_write_same(struct dm_table *t)
 	return true;
 }
 
+static int device_not_write_zeroes_capable(struct dm_target *ti, struct dm_dev *dev,
+					   sector_t start, sector_t len, void *data)
+{
+	struct request_queue *q = bdev_get_queue(dev->bdev);
+
+	return q && !q->limits.max_write_zeroes_sectors;
+}
+
+static bool dm_table_supports_write_zeroes(struct dm_table *t)
+{
+	struct dm_target *ti;
+	unsigned i = 0;
+
+	while (i < dm_table_get_num_targets(t)) {
+		ti = dm_table_get_target(t, i++);
+
+		if (!ti->num_write_zeroes_bios)
+			return false;
+
+		if (!ti->type->iterate_devices ||
+		    ti->type->iterate_devices(ti, device_not_write_zeroes_capable, NULL))
+			return false;
+	}
+
+	return true;
+}
+
+
 static int device_discard_capable(struct dm_target *ti, struct dm_dev *dev,
 				  sector_t start, sector_t len, void *data)
 {
@@ -1603,6 +1631,8 @@ void dm_table_set_restrictions(struct dm_table *t, struct request_queue *q,
 
 	if (!dm_table_supports_write_same(t))
 		q->limits.max_write_same_sectors = 0;
+	if (!dm_table_supports_write_zeroes(t))
+		q->limits.max_write_zeroes_sectors = 0;
 
 	if (dm_table_all_devices_attribute(t, queue_supports_sg_merge))
 		queue_flag_clear_unlocked(QUEUE_FLAG_NO_SG_MERGE, q);
diff --git a/drivers/md/dm.c b/drivers/md/dm.c
index dfb75979e455..e8226359c8f7 100644
--- a/drivers/md/dm.c
+++ b/drivers/md/dm.c
@@ -825,6 +825,14 @@ void disable_write_same(struct mapped_device *md)
 	limits->max_write_same_sectors = 0;
 }
 
+void disable_write_zeroes(struct mapped_device *md)
+{
+	struct queue_limits *limits = dm_get_queue_limits(md);
+
+	/* device doesn't really support WRITE ZEROES, disable it */
+	limits->max_write_zeroes_sectors = 0;
+}
+
 static void clone_endio(struct bio *bio)
 {
 	int error = bio->bi_error;
@@ -851,9 +859,14 @@ static void clone_endio(struct bio *bio)
 		}
 	}
 
-	if (unlikely(r == -EREMOTEIO && (bio_op(bio) == REQ_OP_WRITE_SAME) &&
-		     !bdev_get_queue(bio->bi_bdev)->limits.max_write_same_sectors))
-		disable_write_same(md);
+	if (unlikely(r == -EREMOTEIO)) {
+		if (bio_op(bio) == REQ_OP_WRITE_SAME &&
+		    !bdev_get_queue(bio->bi_bdev)->limits.max_write_same_sectors)
+			disable_write_same(md);
+		if (bio_op(bio) == REQ_OP_WRITE_ZEROES &&
+		    !bdev_get_queue(bio->bi_bdev)->limits.max_write_zeroes_sectors)
+			disable_write_zeroes(md);
+	}
 
 	free_tio(tio);
 	dec_pending(io, error);
@@ -1202,6 +1215,11 @@ static unsigned get_num_write_same_bios(struct dm_target *ti)
 	return ti->num_write_same_bios;
 }
 
+static unsigned get_num_write_zeroes_bios(struct dm_target *ti)
+{
+	return ti->num_write_zeroes_bios;
+}
+
 typedef bool (*is_split_required_fn)(struct dm_target *ti);
 
 static bool is_split_required_for_discard(struct dm_target *ti)
@@ -1256,6 +1274,11 @@ static int __send_write_same(struct clone_info *ci)
 	return __send_changing_extent_only(ci, get_num_write_same_bios, NULL);
 }
 
+static int __send_write_zeroes(struct clone_info *ci)
+{
+	return __send_changing_extent_only(ci, get_num_write_zeroes_bios, NULL);
+}
+
 /*
  * Select the correct strategy for processing a non-flush bio.
  */
@@ -1270,6 +1293,8 @@ static int __split_and_process_non_flush(struct clone_info *ci)
 		return __send_discard(ci);
 	else if (unlikely(bio_op(bio) == REQ_OP_WRITE_SAME))
 		return __send_write_same(ci);
+	else if (unlikely(bio_op(bio) == REQ_OP_WRITE_ZEROES))
+		return __send_write_zeroes(ci);
 
 	ti = dm_table_find_target(ci->map, ci->sector);
 	if (!dm_target_is_valid(ti))
diff --git a/include/linux/device-mapper.h b/include/linux/device-mapper.h
index a7e6903866fd..3829bee2302a 100644
--- a/include/linux/device-mapper.h
+++ b/include/linux/device-mapper.h
@@ -255,6 +255,12 @@ struct dm_target {
 	unsigned num_write_same_bios;
 
 	/*
+	 * The number of WRITE ZEROES bios that will be submitted to the target.
+	 * The bio number can be accessed with dm_bio_get_target_bio_nr.
+	 */
+	unsigned num_write_zeroes_bios;
+
+	/*
 	 * The minimum number of extra bytes allocated in each io for the
 	 * target to use.
 	 */
-- 
2.11.0

^ permalink raw reply related	[flat|nested] 91+ messages in thread

* [PATCH 07/27] dm: support REQ_OP_WRITE_ZEROES
@ 2017-04-05 17:21   ` Christoph Hellwig
  0 siblings, 0 replies; 91+ messages in thread
From: Christoph Hellwig @ 2017-04-05 17:21 UTC (permalink / raw)
  To: axboe, martin.petersen, agk, snitzer, shli, philipp.reisner,
	lars.ellenberg
  Cc: linux-block, linux-scsi, drbd-dev, dm-devel, linux-raid

Copy & paste from the REQ_OP_WRITE_SAME code.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.com>
---
 drivers/md/dm-core.h          |  1 +
 drivers/md/dm-io.c            |  8 ++++++--
 drivers/md/dm-linear.c        |  1 +
 drivers/md/dm-mpath.c         |  1 +
 drivers/md/dm-rq.c            | 11 ++++++++---
 drivers/md/dm-stripe.c        |  2 ++
 drivers/md/dm-table.c         | 30 ++++++++++++++++++++++++++++++
 drivers/md/dm.c               | 31 ++++++++++++++++++++++++++++---
 include/linux/device-mapper.h |  6 ++++++
 9 files changed, 83 insertions(+), 8 deletions(-)

diff --git a/drivers/md/dm-core.h b/drivers/md/dm-core.h
index 136fda3ff9e5..fea5bd52ada8 100644
--- a/drivers/md/dm-core.h
+++ b/drivers/md/dm-core.h
@@ -132,6 +132,7 @@ void dm_init_md_queue(struct mapped_device *md);
 void dm_init_normal_md_queue(struct mapped_device *md);
 int md_in_flight(struct mapped_device *md);
 void disable_write_same(struct mapped_device *md);
+void disable_write_zeroes(struct mapped_device *md);
 
 static inline struct completion *dm_get_completion_from_kobject(struct kobject *kobj)
 {
diff --git a/drivers/md/dm-io.c b/drivers/md/dm-io.c
index b808cbe22678..3702e502466d 100644
--- a/drivers/md/dm-io.c
+++ b/drivers/md/dm-io.c
@@ -312,9 +312,12 @@ static void do_region(int op, int op_flags, unsigned region,
 	 */
 	if (op == REQ_OP_DISCARD)
 		special_cmd_max_sectors = q->limits.max_discard_sectors;
+	else if (op == REQ_OP_WRITE_ZEROES)
+		special_cmd_max_sectors = q->limits.max_write_zeroes_sectors;
 	else if (op == REQ_OP_WRITE_SAME)
 		special_cmd_max_sectors = q->limits.max_write_same_sectors;
-	if ((op == REQ_OP_DISCARD || op == REQ_OP_WRITE_SAME) &&
+	if ((op == REQ_OP_DISCARD || op == REQ_OP_WRITE_ZEROES ||
+	     op == REQ_OP_WRITE_SAME)  &&
 	    special_cmd_max_sectors == 0) {
 		dec_count(io, region, -EOPNOTSUPP);
 		return;
@@ -330,6 +333,7 @@ static void do_region(int op, int op_flags, unsigned region,
 		 */
 		switch (op) {
 		case REQ_OP_DISCARD:
+		case REQ_OP_WRITE_ZEROES:
 			num_bvecs = 0;
 			break;
 		case REQ_OP_WRITE_SAME:
@@ -347,7 +351,7 @@ static void do_region(int op, int op_flags, unsigned region,
 		bio_set_op_attrs(bio, op, op_flags);
 		store_io_and_region_in_bio(bio, io, region);
 
-		if (op == REQ_OP_DISCARD) {
+		if (op == REQ_OP_DISCARD || op == REQ_OP_WRITE_ZEROES) {
 			num_sectors = min_t(sector_t, special_cmd_max_sectors, remaining);
 			bio->bi_iter.bi_size = num_sectors << SECTOR_SHIFT;
 			remaining -= num_sectors;
diff --git a/drivers/md/dm-linear.c b/drivers/md/dm-linear.c
index 4788b0b989a9..e17fd44ceef5 100644
--- a/drivers/md/dm-linear.c
+++ b/drivers/md/dm-linear.c
@@ -59,6 +59,7 @@ static int linear_ctr(struct dm_target *ti, unsigned int argc, char **argv)
 	ti->num_flush_bios = 1;
 	ti->num_discard_bios = 1;
 	ti->num_write_same_bios = 1;
+	ti->num_write_zeroes_bios = 1;
 	ti->private = lc;
 	return 0;
 
diff --git a/drivers/md/dm-mpath.c b/drivers/md/dm-mpath.c
index 7f223dbed49f..ab55955ed704 100644
--- a/drivers/md/dm-mpath.c
+++ b/drivers/md/dm-mpath.c
@@ -1103,6 +1103,7 @@ static int multipath_ctr(struct dm_target *ti, unsigned argc, char **argv)
 	ti->num_flush_bios = 1;
 	ti->num_discard_bios = 1;
 	ti->num_write_same_bios = 1;
+	ti->num_write_zeroes_bios = 1;
 	if (m->queue_mode == DM_TYPE_BIO_BASED)
 		ti->per_io_data_size = multipath_per_bio_data_size();
 	else
diff --git a/drivers/md/dm-rq.c b/drivers/md/dm-rq.c
index 6886bf160fb2..a789bf035621 100644
--- a/drivers/md/dm-rq.c
+++ b/drivers/md/dm-rq.c
@@ -298,9 +298,14 @@ static void dm_done(struct request *clone, int error, bool mapped)
 			r = rq_end_io(tio->ti, clone, error, &tio->info);
 	}
 
-	if (unlikely(r == -EREMOTEIO && (req_op(clone) == REQ_OP_WRITE_SAME) &&
-		     !clone->q->limits.max_write_same_sectors))
-		disable_write_same(tio->md);
+	if (unlikely(r == -EREMOTEIO)) {
+		if (req_op(clone) == REQ_OP_WRITE_SAME &&
+		    !clone->q->limits.max_write_same_sectors)
+			disable_write_same(tio->md);
+		if (req_op(clone) == REQ_OP_WRITE_ZEROES &&
+		    !clone->q->limits.max_write_zeroes_sectors)
+			disable_write_zeroes(tio->md);
+	}
 
 	if (r <= 0)
 		/* The target wants to complete the I/O */
diff --git a/drivers/md/dm-stripe.c b/drivers/md/dm-stripe.c
index 28193a57bf47..5ef49c121d99 100644
--- a/drivers/md/dm-stripe.c
+++ b/drivers/md/dm-stripe.c
@@ -169,6 +169,7 @@ static int stripe_ctr(struct dm_target *ti, unsigned int argc, char **argv)
 	ti->num_flush_bios = stripes;
 	ti->num_discard_bios = stripes;
 	ti->num_write_same_bios = stripes;
+	ti->num_write_zeroes_bios = stripes;
 
 	sc->chunk_size = chunk_size;
 	if (chunk_size & (chunk_size - 1))
@@ -293,6 +294,7 @@ static int stripe_map(struct dm_target *ti, struct bio *bio)
 		return DM_MAPIO_REMAPPED;
 	}
 	if (unlikely(bio_op(bio) == REQ_OP_DISCARD) ||
+	    unlikely(bio_op(bio) == REQ_OP_WRITE_ZEROES) ||
 	    unlikely(bio_op(bio) == REQ_OP_WRITE_SAME)) {
 		target_bio_nr = dm_bio_get_target_bio_nr(bio);
 		BUG_ON(target_bio_nr >= sc->stripes);
diff --git a/drivers/md/dm-table.c b/drivers/md/dm-table.c
index 3ad16d9c9d5a..5cd665c91ead 100644
--- a/drivers/md/dm-table.c
+++ b/drivers/md/dm-table.c
@@ -1533,6 +1533,34 @@ static bool dm_table_supports_write_same(struct dm_table *t)
 	return true;
 }
 
+static int device_not_write_zeroes_capable(struct dm_target *ti, struct dm_dev *dev,
+					   sector_t start, sector_t len, void *data)
+{
+	struct request_queue *q = bdev_get_queue(dev->bdev);
+
+	return q && !q->limits.max_write_zeroes_sectors;
+}
+
+static bool dm_table_supports_write_zeroes(struct dm_table *t)
+{
+	struct dm_target *ti;
+	unsigned i = 0;
+
+	while (i < dm_table_get_num_targets(t)) {
+		ti = dm_table_get_target(t, i++);
+
+		if (!ti->num_write_zeroes_bios)
+			return false;
+
+		if (!ti->type->iterate_devices ||
+		    ti->type->iterate_devices(ti, device_not_write_zeroes_capable, NULL))
+			return false;
+	}
+
+	return true;
+}
+
+
 static int device_discard_capable(struct dm_target *ti, struct dm_dev *dev,
 				  sector_t start, sector_t len, void *data)
 {
@@ -1603,6 +1631,8 @@ void dm_table_set_restrictions(struct dm_table *t, struct request_queue *q,
 
 	if (!dm_table_supports_write_same(t))
 		q->limits.max_write_same_sectors = 0;
+	if (!dm_table_supports_write_zeroes(t))
+		q->limits.max_write_zeroes_sectors = 0;
 
 	if (dm_table_all_devices_attribute(t, queue_supports_sg_merge))
 		queue_flag_clear_unlocked(QUEUE_FLAG_NO_SG_MERGE, q);
diff --git a/drivers/md/dm.c b/drivers/md/dm.c
index dfb75979e455..e8226359c8f7 100644
--- a/drivers/md/dm.c
+++ b/drivers/md/dm.c
@@ -825,6 +825,14 @@ void disable_write_same(struct mapped_device *md)
 	limits->max_write_same_sectors = 0;
 }
 
+void disable_write_zeroes(struct mapped_device *md)
+{
+	struct queue_limits *limits = dm_get_queue_limits(md);
+
+	/* device doesn't really support WRITE ZEROES, disable it */
+	limits->max_write_zeroes_sectors = 0;
+}
+
 static void clone_endio(struct bio *bio)
 {
 	int error = bio->bi_error;
@@ -851,9 +859,14 @@ static void clone_endio(struct bio *bio)
 		}
 	}
 
-	if (unlikely(r == -EREMOTEIO && (bio_op(bio) == REQ_OP_WRITE_SAME) &&
-		     !bdev_get_queue(bio->bi_bdev)->limits.max_write_same_sectors))
-		disable_write_same(md);
+	if (unlikely(r == -EREMOTEIO)) {
+		if (bio_op(bio) == REQ_OP_WRITE_SAME &&
+		    !bdev_get_queue(bio->bi_bdev)->limits.max_write_same_sectors)
+			disable_write_same(md);
+		if (bio_op(bio) == REQ_OP_WRITE_ZEROES &&
+		    !bdev_get_queue(bio->bi_bdev)->limits.max_write_zeroes_sectors)
+			disable_write_zeroes(md);
+	}
 
 	free_tio(tio);
 	dec_pending(io, error);
@@ -1202,6 +1215,11 @@ static unsigned get_num_write_same_bios(struct dm_target *ti)
 	return ti->num_write_same_bios;
 }
 
+static unsigned get_num_write_zeroes_bios(struct dm_target *ti)
+{
+	return ti->num_write_zeroes_bios;
+}
+
 typedef bool (*is_split_required_fn)(struct dm_target *ti);
 
 static bool is_split_required_for_discard(struct dm_target *ti)
@@ -1256,6 +1274,11 @@ static int __send_write_same(struct clone_info *ci)
 	return __send_changing_extent_only(ci, get_num_write_same_bios, NULL);
 }
 
+static int __send_write_zeroes(struct clone_info *ci)
+{
+	return __send_changing_extent_only(ci, get_num_write_zeroes_bios, NULL);
+}
+
 /*
  * Select the correct strategy for processing a non-flush bio.
  */
@@ -1270,6 +1293,8 @@ static int __split_and_process_non_flush(struct clone_info *ci)
 		return __send_discard(ci);
 	else if (unlikely(bio_op(bio) == REQ_OP_WRITE_SAME))
 		return __send_write_same(ci);
+	else if (unlikely(bio_op(bio) == REQ_OP_WRITE_ZEROES))
+		return __send_write_zeroes(ci);
 
 	ti = dm_table_find_target(ci->map, ci->sector);
 	if (!dm_target_is_valid(ti))
diff --git a/include/linux/device-mapper.h b/include/linux/device-mapper.h
index a7e6903866fd..3829bee2302a 100644
--- a/include/linux/device-mapper.h
+++ b/include/linux/device-mapper.h
@@ -255,6 +255,12 @@ struct dm_target {
 	unsigned num_write_same_bios;
 
 	/*
+	 * The number of WRITE ZEROES bios that will be submitted to the target.
+	 * The bio number can be accessed with dm_bio_get_target_bio_nr.
+	 */
+	unsigned num_write_zeroes_bios;
+
+	/*
 	 * The minimum number of extra bytes allocated in each io for the
 	 * target to use.
 	 */
-- 
2.11.0

^ permalink raw reply related	[flat|nested] 91+ messages in thread

* [PATCH 08/27] dm kcopyd: switch to use REQ_OP_WRITE_ZEROES
  2017-04-05 17:20 ` Christoph Hellwig
@ 2017-04-05 17:21   ` Christoph Hellwig
  -1 siblings, 0 replies; 91+ messages in thread
From: Christoph Hellwig @ 2017-04-05 17:21 UTC (permalink / raw)
  To: axboe, martin.petersen, agk, snitzer, shli, philipp.reisner,
	lars.ellenberg
  Cc: linux-block, linux-raid, dm-devel, linux-scsi, drbd-dev

It seems like the code currently passes whatever it was using for writes
to WRITE SAME.  Just switch it to WRITE ZEROES, although that doesn't
need any payload.

Untested, and confused by the code, maybe someone who understands it
better than me can help..

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.com>
---
 drivers/md/dm-kcopyd.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/drivers/md/dm-kcopyd.c b/drivers/md/dm-kcopyd.c
index 9e9d04cb7d51..f85846741d50 100644
--- a/drivers/md/dm-kcopyd.c
+++ b/drivers/md/dm-kcopyd.c
@@ -733,11 +733,11 @@ int dm_kcopyd_copy(struct dm_kcopyd_client *kc, struct dm_io_region *from,
 		job->pages = &zero_page_list;
 
 		/*
-		 * Use WRITE SAME to optimize zeroing if all dests support it.
+		 * Use WRITE ZEROES to optimize zeroing if all dests support it.
 		 */
-		job->rw = REQ_OP_WRITE_SAME;
+		job->rw = REQ_OP_WRITE_ZEROES;
 		for (i = 0; i < job->num_dests; i++)
-			if (!bdev_write_same(job->dests[i].bdev)) {
+			if (!bdev_write_zeroes_sectors(job->dests[i].bdev)) {
 				job->rw = WRITE;
 				break;
 			}
-- 
2.11.0

^ permalink raw reply related	[flat|nested] 91+ messages in thread

* [PATCH 08/27] dm kcopyd: switch to use REQ_OP_WRITE_ZEROES
@ 2017-04-05 17:21   ` Christoph Hellwig
  0 siblings, 0 replies; 91+ messages in thread
From: Christoph Hellwig @ 2017-04-05 17:21 UTC (permalink / raw)
  To: axboe, martin.petersen, agk, snitzer, shli, philipp.reisner,
	lars.ellenberg
  Cc: linux-block, linux-scsi, drbd-dev, dm-devel, linux-raid

It seems like the code currently passes whatever it was using for writes
to WRITE SAME.  Just switch it to WRITE ZEROES, although that doesn't
need any payload.

Untested, and confused by the code, maybe someone who understands it
better than me can help..

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.com>
---
 drivers/md/dm-kcopyd.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/drivers/md/dm-kcopyd.c b/drivers/md/dm-kcopyd.c
index 9e9d04cb7d51..f85846741d50 100644
--- a/drivers/md/dm-kcopyd.c
+++ b/drivers/md/dm-kcopyd.c
@@ -733,11 +733,11 @@ int dm_kcopyd_copy(struct dm_kcopyd_client *kc, struct dm_io_region *from,
 		job->pages = &zero_page_list;
 
 		/*
-		 * Use WRITE SAME to optimize zeroing if all dests support it.
+		 * Use WRITE ZEROES to optimize zeroing if all dests support it.
 		 */
-		job->rw = REQ_OP_WRITE_SAME;
+		job->rw = REQ_OP_WRITE_ZEROES;
 		for (i = 0; i < job->num_dests; i++)
-			if (!bdev_write_same(job->dests[i].bdev)) {
+			if (!bdev_write_zeroes_sectors(job->dests[i].bdev)) {
 				job->rw = WRITE;
 				break;
 			}
-- 
2.11.0

^ permalink raw reply related	[flat|nested] 91+ messages in thread

* [PATCH 09/27] block: stop using blkdev_issue_write_same for zeroing
  2017-04-05 17:20 ` Christoph Hellwig
@ 2017-04-05 17:21   ` Christoph Hellwig
  -1 siblings, 0 replies; 91+ messages in thread
From: Christoph Hellwig @ 2017-04-05 17:21 UTC (permalink / raw)
  To: axboe, martin.petersen, agk, snitzer, shli, philipp.reisner,
	lars.ellenberg
  Cc: linux-block, linux-raid, dm-devel, linux-scsi, drbd-dev

We'll always use the WRITE ZEROES code for zeroing now.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Reviewed-by: Hannes Reinecke <hare@suse.com>
---
 block/blk-lib.c | 4 ----
 1 file changed, 4 deletions(-)

diff --git a/block/blk-lib.c b/block/blk-lib.c
index e5b853f2b8a2..2a8d638544a7 100644
--- a/block/blk-lib.c
+++ b/block/blk-lib.c
@@ -364,10 +364,6 @@ int blkdev_issue_zeroout(struct block_device *bdev, sector_t sector,
 			return 0;
 	}
 
-	if (!blkdev_issue_write_same(bdev, sector, nr_sects, gfp_mask,
-			ZERO_PAGE(0)))
-		return 0;
-
 	blk_start_plug(&plug);
 	ret = __blkdev_issue_zeroout(bdev, sector, nr_sects, gfp_mask,
 			&bio, discard);
-- 
2.11.0

^ permalink raw reply related	[flat|nested] 91+ messages in thread

* [PATCH 09/27] block: stop using blkdev_issue_write_same for zeroing
@ 2017-04-05 17:21   ` Christoph Hellwig
  0 siblings, 0 replies; 91+ messages in thread
From: Christoph Hellwig @ 2017-04-05 17:21 UTC (permalink / raw)
  To: axboe, martin.petersen, agk, snitzer, shli, philipp.reisner,
	lars.ellenberg
  Cc: linux-block, linux-scsi, drbd-dev, dm-devel, linux-raid

We'll always use the WRITE ZEROES code for zeroing now.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Reviewed-by: Hannes Reinecke <hare@suse.com>
---
 block/blk-lib.c | 4 ----
 1 file changed, 4 deletions(-)

diff --git a/block/blk-lib.c b/block/blk-lib.c
index e5b853f2b8a2..2a8d638544a7 100644
--- a/block/blk-lib.c
+++ b/block/blk-lib.c
@@ -364,10 +364,6 @@ int blkdev_issue_zeroout(struct block_device *bdev, sector_t sector,
 			return 0;
 	}
 
-	if (!blkdev_issue_write_same(bdev, sector, nr_sects, gfp_mask,
-			ZERO_PAGE(0)))
-		return 0;
-
 	blk_start_plug(&plug);
 	ret = __blkdev_issue_zeroout(bdev, sector, nr_sects, gfp_mask,
 			&bio, discard);
-- 
2.11.0

^ permalink raw reply related	[flat|nested] 91+ messages in thread

* [PATCH 10/27] block: add a flags argument to (__)blkdev_issue_zeroout
  2017-04-05 17:20 ` Christoph Hellwig
@ 2017-04-05 17:21   ` Christoph Hellwig
  -1 siblings, 0 replies; 91+ messages in thread
From: Christoph Hellwig @ 2017-04-05 17:21 UTC (permalink / raw)
  To: axboe, martin.petersen, agk, snitzer, shli, philipp.reisner,
	lars.ellenberg
  Cc: linux-block, linux-raid, dm-devel, linux-scsi, drbd-dev

Turn the existing discard flag into a new BLKDEV_ZERO_UNMAP flag with
similar semantics, but without referring to diѕcard.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Reviewed-by: Hannes Reinecke <hare@suse.com>
---
 block/blk-lib.c                    | 31 ++++++++++++++-----------------
 block/ioctl.c                      |  2 +-
 drivers/block/drbd/drbd_receiver.c |  9 ++++++---
 drivers/nvme/target/io-cmd.c       |  2 +-
 fs/block_dev.c                     |  2 +-
 fs/dax.c                           |  2 +-
 fs/xfs/xfs_bmap_util.c             |  2 +-
 include/linux/blkdev.h             | 16 ++++++++++------
 8 files changed, 35 insertions(+), 31 deletions(-)

diff --git a/block/blk-lib.c b/block/blk-lib.c
index 2a8d638544a7..f9f24ec69c27 100644
--- a/block/blk-lib.c
+++ b/block/blk-lib.c
@@ -282,14 +282,18 @@ static int __blkdev_issue_write_zeroes(struct block_device *bdev,
  * @nr_sects:	number of sectors to write
  * @gfp_mask:	memory allocation flags (for bio_alloc)
  * @biop:	pointer to anchor bio
- * @discard:	discard flag
+ * @flags:	controls detailed behavior
  *
  * Description:
- *  Generate and issue number of bios with zerofiled pages.
+ *  Zero-fill a block range, either using hardware offload or by explicitly
+ *  writing zeroes to the device.
+ *
+ *  If a device is using logical block provisioning, the underlying space will
+ *  not be released if %flags contains BLKDEV_ZERO_NOUNMAP.
  */
 int __blkdev_issue_zeroout(struct block_device *bdev, sector_t sector,
 		sector_t nr_sects, gfp_t gfp_mask, struct bio **biop,
-		bool discard)
+		unsigned flags)
 {
 	int ret;
 	int bi_size = 0;
@@ -337,28 +341,21 @@ EXPORT_SYMBOL(__blkdev_issue_zeroout);
  * @sector:	start sector
  * @nr_sects:	number of sectors to write
  * @gfp_mask:	memory allocation flags (for bio_alloc)
- * @discard:	whether to discard the block range
+ * @flags:	controls detailed behavior
  *
  * Description:
- *  Zero-fill a block range.  If the discard flag is set and the block
- *  device guarantees that subsequent READ operations to the block range
- *  in question will return zeroes, the blocks will be discarded. Should
- *  the discard request fail, if the discard flag is not set, or if
- *  discard_zeroes_data is not supported, this function will resort to
- *  zeroing the blocks manually, thus provisioning (allocating,
- *  anchoring) them. If the block device supports WRITE ZEROES or WRITE SAME
- *  command(s), blkdev_issue_zeroout() will use it to optimize the process of
- *  clearing the block range. Otherwise the zeroing will be performed
- *  using regular WRITE calls.
+ *  Zero-fill a block range, either using hardware offload or by explicitly
+ *  writing zeroes to the device.  See __blkdev_issue_zeroout() for the
+ *  valid values for %flags.
  */
 int blkdev_issue_zeroout(struct block_device *bdev, sector_t sector,
-			 sector_t nr_sects, gfp_t gfp_mask, bool discard)
+		sector_t nr_sects, gfp_t gfp_mask, unsigned flags)
 {
 	int ret;
 	struct bio *bio = NULL;
 	struct blk_plug plug;
 
-	if (discard) {
+	if (!(flags & BLKDEV_ZERO_NOUNMAP)) {
 		if (!blkdev_issue_discard(bdev, sector, nr_sects, gfp_mask,
 				BLKDEV_DISCARD_ZERO))
 			return 0;
@@ -366,7 +363,7 @@ int blkdev_issue_zeroout(struct block_device *bdev, sector_t sector,
 
 	blk_start_plug(&plug);
 	ret = __blkdev_issue_zeroout(bdev, sector, nr_sects, gfp_mask,
-			&bio, discard);
+			&bio, flags);
 	if (ret == 0 && bio) {
 		ret = submit_bio_wait(bio);
 		bio_put(bio);
diff --git a/block/ioctl.c b/block/ioctl.c
index 7b88820b93d9..8ea00a41be01 100644
--- a/block/ioctl.c
+++ b/block/ioctl.c
@@ -255,7 +255,7 @@ static int blk_ioctl_zeroout(struct block_device *bdev, fmode_t mode,
 	truncate_inode_pages_range(mapping, start, end);
 
 	return blkdev_issue_zeroout(bdev, start >> 9, len >> 9, GFP_KERNEL,
-				    false);
+			BLKDEV_ZERO_NOUNMAP);
 }
 
 static int put_ushort(unsigned long arg, unsigned short val)
diff --git a/drivers/block/drbd/drbd_receiver.c b/drivers/block/drbd/drbd_receiver.c
index aa6bf9692eff..dc9a6dcd431c 100644
--- a/drivers/block/drbd/drbd_receiver.c
+++ b/drivers/block/drbd/drbd_receiver.c
@@ -1499,19 +1499,22 @@ int drbd_issue_discard_or_zero_out(struct drbd_device *device, sector_t start, u
 		tmp = start + granularity - sector_div(tmp, granularity);
 
 		nr = tmp - start;
-		err |= blkdev_issue_zeroout(bdev, start, nr, GFP_NOIO, 0);
+		err |= blkdev_issue_zeroout(bdev, start, nr, GFP_NOIO,
+				BLKDEV_ZERO_NOUNMAP);
 		nr_sectors -= nr;
 		start = tmp;
 	}
 	while (nr_sectors >= granularity) {
 		nr = min_t(sector_t, nr_sectors, max_discard_sectors);
-		err |= blkdev_issue_discard(bdev, start, nr, GFP_NOIO, 0);
+		err |= blkdev_issue_discard(bdev, start, nr, GFP_NOIO,
+				BLKDEV_ZERO_NOUNMAP);
 		nr_sectors -= nr;
 		start += nr;
 	}
  zero_out:
 	if (nr_sectors) {
-		err |= blkdev_issue_zeroout(bdev, start, nr_sectors, GFP_NOIO, 0);
+		err |= blkdev_issue_zeroout(bdev, start, nr_sectors, GFP_NOIO,
+				BLKDEV_ZERO_NOUNMAP);
 	}
 	return err != 0;
 }
diff --git a/drivers/nvme/target/io-cmd.c b/drivers/nvme/target/io-cmd.c
index 27623f2bfe6b..de266cc99397 100644
--- a/drivers/nvme/target/io-cmd.c
+++ b/drivers/nvme/target/io-cmd.c
@@ -184,7 +184,7 @@ static void nvmet_execute_write_zeroes(struct nvmet_req *req)
 		(req->ns->blksize_shift - 9)) + 1;
 
 	if (__blkdev_issue_zeroout(req->ns->bdev, sector, nr_sector,
-				GFP_KERNEL, &bio, true))
+				GFP_KERNEL, &bio, 0))
 		status = NVME_SC_INTERNAL | NVME_SC_DNR;
 
 	if (bio) {
diff --git a/fs/block_dev.c b/fs/block_dev.c
index f2d59f143ef4..2f704c3a816f 100644
--- a/fs/block_dev.c
+++ b/fs/block_dev.c
@@ -2105,7 +2105,7 @@ static long blkdev_fallocate(struct file *file, int mode, loff_t start,
 	case FALLOC_FL_ZERO_RANGE:
 	case FALLOC_FL_ZERO_RANGE | FALLOC_FL_KEEP_SIZE:
 		error = blkdev_issue_zeroout(bdev, start >> 9, len >> 9,
-					    GFP_KERNEL, false);
+					    GFP_KERNEL, BLKDEV_ZERO_NOUNMAP);
 		break;
 	case FALLOC_FL_PUNCH_HOLE | FALLOC_FL_KEEP_SIZE:
 		/* Only punch if the device can do zeroing discard. */
diff --git a/fs/dax.c b/fs/dax.c
index de622d4282a6..2bfbcd726047 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -982,7 +982,7 @@ int __dax_zero_page_range(struct block_device *bdev, sector_t sector,
 		sector_t start_sector = dax.sector + (offset >> 9);
 
 		return blkdev_issue_zeroout(bdev, start_sector,
-				length >> 9, GFP_NOFS, true);
+				length >> 9, GFP_NOFS, 0);
 	} else {
 		if (dax_map_atomic(bdev, &dax) < 0)
 			return PTR_ERR(dax.addr);
diff --git a/fs/xfs/xfs_bmap_util.c b/fs/xfs/xfs_bmap_util.c
index 8b75dcea5966..142bbbe06114 100644
--- a/fs/xfs/xfs_bmap_util.c
+++ b/fs/xfs/xfs_bmap_util.c
@@ -81,7 +81,7 @@ xfs_zero_extent(
 	return blkdev_issue_zeroout(xfs_find_bdev_for_inode(VFS_I(ip)),
 		block << (mp->m_super->s_blocksize_bits - 9),
 		count_fsb << (mp->m_super->s_blocksize_bits - 9),
-		GFP_NOFS, true);
+		GFP_NOFS, 0);
 }
 
 int
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index a2dc6b390d48..e7513ce3dbde 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -1337,23 +1337,27 @@ static inline struct request *blk_map_queue_find_tag(struct blk_queue_tag *bqt,
 	return bqt->tag_index[tag];
 }
 
+extern int blkdev_issue_flush(struct block_device *, gfp_t, sector_t *);
+extern int blkdev_issue_write_same(struct block_device *bdev, sector_t sector,
+		sector_t nr_sects, gfp_t gfp_mask, struct page *page);
 
 #define BLKDEV_DISCARD_SECURE	(1 << 0)	/* issue a secure erase */
 #define BLKDEV_DISCARD_ZERO	(1 << 1)	/* must reliably zero data */
 
-extern int blkdev_issue_flush(struct block_device *, gfp_t, sector_t *);
 extern int blkdev_issue_discard(struct block_device *bdev, sector_t sector,
 		sector_t nr_sects, gfp_t gfp_mask, unsigned long flags);
 extern int __blkdev_issue_discard(struct block_device *bdev, sector_t sector,
 		sector_t nr_sects, gfp_t gfp_mask, int flags,
 		struct bio **biop);
-extern int blkdev_issue_write_same(struct block_device *bdev, sector_t sector,
-		sector_t nr_sects, gfp_t gfp_mask, struct page *page);
+
+#define BLKDEV_ZERO_NOUNMAP	(1 << 0)  /* do not free blocks */
+
 extern int __blkdev_issue_zeroout(struct block_device *bdev, sector_t sector,
 		sector_t nr_sects, gfp_t gfp_mask, struct bio **biop,
-		bool discard);
+		unsigned flags);
 extern int blkdev_issue_zeroout(struct block_device *bdev, sector_t sector,
-		sector_t nr_sects, gfp_t gfp_mask, bool discard);
+		sector_t nr_sects, gfp_t gfp_mask, unsigned flags);
+
 static inline int sb_issue_discard(struct super_block *sb, sector_t block,
 		sector_t nr_blocks, gfp_t gfp_mask, unsigned long flags)
 {
@@ -1367,7 +1371,7 @@ static inline int sb_issue_zeroout(struct super_block *sb, sector_t block,
 	return blkdev_issue_zeroout(sb->s_bdev,
 				    block << (sb->s_blocksize_bits - 9),
 				    nr_blocks << (sb->s_blocksize_bits - 9),
-				    gfp_mask, true);
+				    gfp_mask, 0);
 }
 
 extern int blk_verify_command(unsigned char *cmd, fmode_t has_write_perm);
-- 
2.11.0

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel

^ permalink raw reply related	[flat|nested] 91+ messages in thread

* [PATCH 10/27] block: add a flags argument to (__)blkdev_issue_zeroout
@ 2017-04-05 17:21   ` Christoph Hellwig
  0 siblings, 0 replies; 91+ messages in thread
From: Christoph Hellwig @ 2017-04-05 17:21 UTC (permalink / raw)
  To: axboe, martin.petersen, agk, snitzer, shli, philipp.reisner,
	lars.ellenberg
  Cc: linux-block, linux-scsi, drbd-dev, dm-devel, linux-raid

Turn the existing discard flag into a new BLKDEV_ZERO_UNMAP flag with
similar semantics, but without referring to diѕcard.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Reviewed-by: Hannes Reinecke <hare@suse.com>
---
 block/blk-lib.c                    | 31 ++++++++++++++-----------------
 block/ioctl.c                      |  2 +-
 drivers/block/drbd/drbd_receiver.c |  9 ++++++---
 drivers/nvme/target/io-cmd.c       |  2 +-
 fs/block_dev.c                     |  2 +-
 fs/dax.c                           |  2 +-
 fs/xfs/xfs_bmap_util.c             |  2 +-
 include/linux/blkdev.h             | 16 ++++++++++------
 8 files changed, 35 insertions(+), 31 deletions(-)

diff --git a/block/blk-lib.c b/block/blk-lib.c
index 2a8d638544a7..f9f24ec69c27 100644
--- a/block/blk-lib.c
+++ b/block/blk-lib.c
@@ -282,14 +282,18 @@ static int __blkdev_issue_write_zeroes(struct block_device *bdev,
  * @nr_sects:	number of sectors to write
  * @gfp_mask:	memory allocation flags (for bio_alloc)
  * @biop:	pointer to anchor bio
- * @discard:	discard flag
+ * @flags:	controls detailed behavior
  *
  * Description:
- *  Generate and issue number of bios with zerofiled pages.
+ *  Zero-fill a block range, either using hardware offload or by explicitly
+ *  writing zeroes to the device.
+ *
+ *  If a device is using logical block provisioning, the underlying space will
+ *  not be released if %flags contains BLKDEV_ZERO_NOUNMAP.
  */
 int __blkdev_issue_zeroout(struct block_device *bdev, sector_t sector,
 		sector_t nr_sects, gfp_t gfp_mask, struct bio **biop,
-		bool discard)
+		unsigned flags)
 {
 	int ret;
 	int bi_size = 0;
@@ -337,28 +341,21 @@ EXPORT_SYMBOL(__blkdev_issue_zeroout);
  * @sector:	start sector
  * @nr_sects:	number of sectors to write
  * @gfp_mask:	memory allocation flags (for bio_alloc)
- * @discard:	whether to discard the block range
+ * @flags:	controls detailed behavior
  *
  * Description:
- *  Zero-fill a block range.  If the discard flag is set and the block
- *  device guarantees that subsequent READ operations to the block range
- *  in question will return zeroes, the blocks will be discarded. Should
- *  the discard request fail, if the discard flag is not set, or if
- *  discard_zeroes_data is not supported, this function will resort to
- *  zeroing the blocks manually, thus provisioning (allocating,
- *  anchoring) them. If the block device supports WRITE ZEROES or WRITE SAME
- *  command(s), blkdev_issue_zeroout() will use it to optimize the process of
- *  clearing the block range. Otherwise the zeroing will be performed
- *  using regular WRITE calls.
+ *  Zero-fill a block range, either using hardware offload or by explicitly
+ *  writing zeroes to the device.  See __blkdev_issue_zeroout() for the
+ *  valid values for %flags.
  */
 int blkdev_issue_zeroout(struct block_device *bdev, sector_t sector,
-			 sector_t nr_sects, gfp_t gfp_mask, bool discard)
+		sector_t nr_sects, gfp_t gfp_mask, unsigned flags)
 {
 	int ret;
 	struct bio *bio = NULL;
 	struct blk_plug plug;
 
-	if (discard) {
+	if (!(flags & BLKDEV_ZERO_NOUNMAP)) {
 		if (!blkdev_issue_discard(bdev, sector, nr_sects, gfp_mask,
 				BLKDEV_DISCARD_ZERO))
 			return 0;
@@ -366,7 +363,7 @@ int blkdev_issue_zeroout(struct block_device *bdev, sector_t sector,
 
 	blk_start_plug(&plug);
 	ret = __blkdev_issue_zeroout(bdev, sector, nr_sects, gfp_mask,
-			&bio, discard);
+			&bio, flags);
 	if (ret == 0 && bio) {
 		ret = submit_bio_wait(bio);
 		bio_put(bio);
diff --git a/block/ioctl.c b/block/ioctl.c
index 7b88820b93d9..8ea00a41be01 100644
--- a/block/ioctl.c
+++ b/block/ioctl.c
@@ -255,7 +255,7 @@ static int blk_ioctl_zeroout(struct block_device *bdev, fmode_t mode,
 	truncate_inode_pages_range(mapping, start, end);
 
 	return blkdev_issue_zeroout(bdev, start >> 9, len >> 9, GFP_KERNEL,
-				    false);
+			BLKDEV_ZERO_NOUNMAP);
 }
 
 static int put_ushort(unsigned long arg, unsigned short val)
diff --git a/drivers/block/drbd/drbd_receiver.c b/drivers/block/drbd/drbd_receiver.c
index aa6bf9692eff..dc9a6dcd431c 100644
--- a/drivers/block/drbd/drbd_receiver.c
+++ b/drivers/block/drbd/drbd_receiver.c
@@ -1499,19 +1499,22 @@ int drbd_issue_discard_or_zero_out(struct drbd_device *device, sector_t start, u
 		tmp = start + granularity - sector_div(tmp, granularity);
 
 		nr = tmp - start;
-		err |= blkdev_issue_zeroout(bdev, start, nr, GFP_NOIO, 0);
+		err |= blkdev_issue_zeroout(bdev, start, nr, GFP_NOIO,
+				BLKDEV_ZERO_NOUNMAP);
 		nr_sectors -= nr;
 		start = tmp;
 	}
 	while (nr_sectors >= granularity) {
 		nr = min_t(sector_t, nr_sectors, max_discard_sectors);
-		err |= blkdev_issue_discard(bdev, start, nr, GFP_NOIO, 0);
+		err |= blkdev_issue_discard(bdev, start, nr, GFP_NOIO,
+				BLKDEV_ZERO_NOUNMAP);
 		nr_sectors -= nr;
 		start += nr;
 	}
  zero_out:
 	if (nr_sectors) {
-		err |= blkdev_issue_zeroout(bdev, start, nr_sectors, GFP_NOIO, 0);
+		err |= blkdev_issue_zeroout(bdev, start, nr_sectors, GFP_NOIO,
+				BLKDEV_ZERO_NOUNMAP);
 	}
 	return err != 0;
 }
diff --git a/drivers/nvme/target/io-cmd.c b/drivers/nvme/target/io-cmd.c
index 27623f2bfe6b..de266cc99397 100644
--- a/drivers/nvme/target/io-cmd.c
+++ b/drivers/nvme/target/io-cmd.c
@@ -184,7 +184,7 @@ static void nvmet_execute_write_zeroes(struct nvmet_req *req)
 		(req->ns->blksize_shift - 9)) + 1;
 
 	if (__blkdev_issue_zeroout(req->ns->bdev, sector, nr_sector,
-				GFP_KERNEL, &bio, true))
+				GFP_KERNEL, &bio, 0))
 		status = NVME_SC_INTERNAL | NVME_SC_DNR;
 
 	if (bio) {
diff --git a/fs/block_dev.c b/fs/block_dev.c
index f2d59f143ef4..2f704c3a816f 100644
--- a/fs/block_dev.c
+++ b/fs/block_dev.c
@@ -2105,7 +2105,7 @@ static long blkdev_fallocate(struct file *file, int mode, loff_t start,
 	case FALLOC_FL_ZERO_RANGE:
 	case FALLOC_FL_ZERO_RANGE | FALLOC_FL_KEEP_SIZE:
 		error = blkdev_issue_zeroout(bdev, start >> 9, len >> 9,
-					    GFP_KERNEL, false);
+					    GFP_KERNEL, BLKDEV_ZERO_NOUNMAP);
 		break;
 	case FALLOC_FL_PUNCH_HOLE | FALLOC_FL_KEEP_SIZE:
 		/* Only punch if the device can do zeroing discard. */
diff --git a/fs/dax.c b/fs/dax.c
index de622d4282a6..2bfbcd726047 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -982,7 +982,7 @@ int __dax_zero_page_range(struct block_device *bdev, sector_t sector,
 		sector_t start_sector = dax.sector + (offset >> 9);
 
 		return blkdev_issue_zeroout(bdev, start_sector,
-				length >> 9, GFP_NOFS, true);
+				length >> 9, GFP_NOFS, 0);
 	} else {
 		if (dax_map_atomic(bdev, &dax) < 0)
 			return PTR_ERR(dax.addr);
diff --git a/fs/xfs/xfs_bmap_util.c b/fs/xfs/xfs_bmap_util.c
index 8b75dcea5966..142bbbe06114 100644
--- a/fs/xfs/xfs_bmap_util.c
+++ b/fs/xfs/xfs_bmap_util.c
@@ -81,7 +81,7 @@ xfs_zero_extent(
 	return blkdev_issue_zeroout(xfs_find_bdev_for_inode(VFS_I(ip)),
 		block << (mp->m_super->s_blocksize_bits - 9),
 		count_fsb << (mp->m_super->s_blocksize_bits - 9),
-		GFP_NOFS, true);
+		GFP_NOFS, 0);
 }
 
 int
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index a2dc6b390d48..e7513ce3dbde 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -1337,23 +1337,27 @@ static inline struct request *blk_map_queue_find_tag(struct blk_queue_tag *bqt,
 	return bqt->tag_index[tag];
 }
 
+extern int blkdev_issue_flush(struct block_device *, gfp_t, sector_t *);
+extern int blkdev_issue_write_same(struct block_device *bdev, sector_t sector,
+		sector_t nr_sects, gfp_t gfp_mask, struct page *page);
 
 #define BLKDEV_DISCARD_SECURE	(1 << 0)	/* issue a secure erase */
 #define BLKDEV_DISCARD_ZERO	(1 << 1)	/* must reliably zero data */
 
-extern int blkdev_issue_flush(struct block_device *, gfp_t, sector_t *);
 extern int blkdev_issue_discard(struct block_device *bdev, sector_t sector,
 		sector_t nr_sects, gfp_t gfp_mask, unsigned long flags);
 extern int __blkdev_issue_discard(struct block_device *bdev, sector_t sector,
 		sector_t nr_sects, gfp_t gfp_mask, int flags,
 		struct bio **biop);
-extern int blkdev_issue_write_same(struct block_device *bdev, sector_t sector,
-		sector_t nr_sects, gfp_t gfp_mask, struct page *page);
+
+#define BLKDEV_ZERO_NOUNMAP	(1 << 0)  /* do not free blocks */
+
 extern int __blkdev_issue_zeroout(struct block_device *bdev, sector_t sector,
 		sector_t nr_sects, gfp_t gfp_mask, struct bio **biop,
-		bool discard);
+		unsigned flags);
 extern int blkdev_issue_zeroout(struct block_device *bdev, sector_t sector,
-		sector_t nr_sects, gfp_t gfp_mask, bool discard);
+		sector_t nr_sects, gfp_t gfp_mask, unsigned flags);
+
 static inline int sb_issue_discard(struct super_block *sb, sector_t block,
 		sector_t nr_blocks, gfp_t gfp_mask, unsigned long flags)
 {
@@ -1367,7 +1371,7 @@ static inline int sb_issue_zeroout(struct super_block *sb, sector_t block,
 	return blkdev_issue_zeroout(sb->s_bdev,
 				    block << (sb->s_blocksize_bits - 9),
 				    nr_blocks << (sb->s_blocksize_bits - 9),
-				    gfp_mask, true);
+				    gfp_mask, 0);
 }
 
 extern int blk_verify_command(unsigned char *cmd, fmode_t has_write_perm);
-- 
2.11.0

^ permalink raw reply related	[flat|nested] 91+ messages in thread

* [PATCH 11/27] block: add a REQ_NOUNMAP flag for REQ_OP_WRITE_ZEROES
  2017-04-05 17:20 ` Christoph Hellwig
@ 2017-04-05 17:21   ` Christoph Hellwig
  -1 siblings, 0 replies; 91+ messages in thread
From: Christoph Hellwig @ 2017-04-05 17:21 UTC (permalink / raw)
  To: axboe, martin.petersen, agk, snitzer, shli, philipp.reisner,
	lars.ellenberg
  Cc: linux-block, linux-raid, dm-devel, linux-scsi, drbd-dev

If this flag is set logical provisioning capable device should
release space for the zeroed blocks if possible, if it is not set
devices should keep the blocks anchored.

Also remove an out of sync kerneldoc comment for a static function
that would have become even more out of data with this change.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Reviewed-by: Hannes Reinecke <hare@suse.com>
---
 block/blk-lib.c           | 19 +++++--------------
 include/linux/blk_types.h |  6 ++++++
 2 files changed, 11 insertions(+), 14 deletions(-)

diff --git a/block/blk-lib.c b/block/blk-lib.c
index f9f24ec69c27..2f6d2cb2e1a2 100644
--- a/block/blk-lib.c
+++ b/block/blk-lib.c
@@ -226,20 +226,9 @@ int blkdev_issue_write_same(struct block_device *bdev, sector_t sector,
 }
 EXPORT_SYMBOL(blkdev_issue_write_same);
 
-/**
- * __blkdev_issue_write_zeroes - generate number of bios with WRITE ZEROES
- * @bdev:	blockdev to issue
- * @sector:	start sector
- * @nr_sects:	number of sectors to write
- * @gfp_mask:	memory allocation flags (for bio_alloc)
- * @biop:	pointer to anchor bio
- *
- * Description:
- *  Generate and issue number of bios(REQ_OP_WRITE_ZEROES) with zerofiled pages.
- */
 static int __blkdev_issue_write_zeroes(struct block_device *bdev,
 		sector_t sector, sector_t nr_sects, gfp_t gfp_mask,
-		struct bio **biop)
+		struct bio **biop, unsigned flags)
 {
 	struct bio *bio = *biop;
 	unsigned int max_write_zeroes_sectors;
@@ -258,7 +247,9 @@ static int __blkdev_issue_write_zeroes(struct block_device *bdev,
 		bio = next_bio(bio, 0, gfp_mask);
 		bio->bi_iter.bi_sector = sector;
 		bio->bi_bdev = bdev;
-		bio_set_op_attrs(bio, REQ_OP_WRITE_ZEROES, 0);
+		bio->bi_opf = REQ_OP_WRITE_ZEROES;
+		if (flags & BLKDEV_ZERO_NOUNMAP)
+			bio->bi_opf |= REQ_NOUNMAP;
 
 		if (nr_sects > max_write_zeroes_sectors) {
 			bio->bi_iter.bi_size = max_write_zeroes_sectors << 9;
@@ -306,7 +297,7 @@ int __blkdev_issue_zeroout(struct block_device *bdev, sector_t sector,
 		return -EINVAL;
 
 	ret = __blkdev_issue_write_zeroes(bdev, sector, nr_sects, gfp_mask,
-			biop);
+			biop, flags);
 	if (ret == 0 || (ret && ret != -EOPNOTSUPP))
 		goto out;
 
diff --git a/include/linux/blk_types.h b/include/linux/blk_types.h
index 4eae30bfbfca..8eaa7dca7057 100644
--- a/include/linux/blk_types.h
+++ b/include/linux/blk_types.h
@@ -195,6 +195,10 @@ enum req_flag_bits {
 	__REQ_PREFLUSH,		/* request for cache flush */
 	__REQ_RAHEAD,		/* read ahead, can fail anytime */
 	__REQ_BACKGROUND,	/* background IO */
+
+	/* command specific flags for REQ_OP_WRITE_ZEROES: */
+	__REQ_NOUNMAP,		/* do not free blocks when zeroing */
+
 	__REQ_NR_BITS,		/* stops here */
 };
 
@@ -212,6 +216,8 @@ enum req_flag_bits {
 #define REQ_RAHEAD		(1ULL << __REQ_RAHEAD)
 #define REQ_BACKGROUND		(1ULL << __REQ_BACKGROUND)
 
+#define REQ_NOUNMAP		(1ULL << __REQ_NOUNMAP)
+
 #define REQ_FAILFAST_MASK \
 	(REQ_FAILFAST_DEV | REQ_FAILFAST_TRANSPORT | REQ_FAILFAST_DRIVER)
 
-- 
2.11.0

^ permalink raw reply related	[flat|nested] 91+ messages in thread

* [PATCH 11/27] block: add a REQ_NOUNMAP flag for REQ_OP_WRITE_ZEROES
@ 2017-04-05 17:21   ` Christoph Hellwig
  0 siblings, 0 replies; 91+ messages in thread
From: Christoph Hellwig @ 2017-04-05 17:21 UTC (permalink / raw)
  To: axboe, martin.petersen, agk, snitzer, shli, philipp.reisner,
	lars.ellenberg
  Cc: linux-block, linux-scsi, drbd-dev, dm-devel, linux-raid

If this flag is set logical provisioning capable device should
release space for the zeroed blocks if possible, if it is not set
devices should keep the blocks anchored.

Also remove an out of sync kerneldoc comment for a static function
that would have become even more out of data with this change.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Reviewed-by: Hannes Reinecke <hare@suse.com>
---
 block/blk-lib.c           | 19 +++++--------------
 include/linux/blk_types.h |  6 ++++++
 2 files changed, 11 insertions(+), 14 deletions(-)

diff --git a/block/blk-lib.c b/block/blk-lib.c
index f9f24ec69c27..2f6d2cb2e1a2 100644
--- a/block/blk-lib.c
+++ b/block/blk-lib.c
@@ -226,20 +226,9 @@ int blkdev_issue_write_same(struct block_device *bdev, sector_t sector,
 }
 EXPORT_SYMBOL(blkdev_issue_write_same);
 
-/**
- * __blkdev_issue_write_zeroes - generate number of bios with WRITE ZEROES
- * @bdev:	blockdev to issue
- * @sector:	start sector
- * @nr_sects:	number of sectors to write
- * @gfp_mask:	memory allocation flags (for bio_alloc)
- * @biop:	pointer to anchor bio
- *
- * Description:
- *  Generate and issue number of bios(REQ_OP_WRITE_ZEROES) with zerofiled pages.
- */
 static int __blkdev_issue_write_zeroes(struct block_device *bdev,
 		sector_t sector, sector_t nr_sects, gfp_t gfp_mask,
-		struct bio **biop)
+		struct bio **biop, unsigned flags)
 {
 	struct bio *bio = *biop;
 	unsigned int max_write_zeroes_sectors;
@@ -258,7 +247,9 @@ static int __blkdev_issue_write_zeroes(struct block_device *bdev,
 		bio = next_bio(bio, 0, gfp_mask);
 		bio->bi_iter.bi_sector = sector;
 		bio->bi_bdev = bdev;
-		bio_set_op_attrs(bio, REQ_OP_WRITE_ZEROES, 0);
+		bio->bi_opf = REQ_OP_WRITE_ZEROES;
+		if (flags & BLKDEV_ZERO_NOUNMAP)
+			bio->bi_opf |= REQ_NOUNMAP;
 
 		if (nr_sects > max_write_zeroes_sectors) {
 			bio->bi_iter.bi_size = max_write_zeroes_sectors << 9;
@@ -306,7 +297,7 @@ int __blkdev_issue_zeroout(struct block_device *bdev, sector_t sector,
 		return -EINVAL;
 
 	ret = __blkdev_issue_write_zeroes(bdev, sector, nr_sects, gfp_mask,
-			biop);
+			biop, flags);
 	if (ret == 0 || (ret && ret != -EOPNOTSUPP))
 		goto out;
 
diff --git a/include/linux/blk_types.h b/include/linux/blk_types.h
index 4eae30bfbfca..8eaa7dca7057 100644
--- a/include/linux/blk_types.h
+++ b/include/linux/blk_types.h
@@ -195,6 +195,10 @@ enum req_flag_bits {
 	__REQ_PREFLUSH,		/* request for cache flush */
 	__REQ_RAHEAD,		/* read ahead, can fail anytime */
 	__REQ_BACKGROUND,	/* background IO */
+
+	/* command specific flags for REQ_OP_WRITE_ZEROES: */
+	__REQ_NOUNMAP,		/* do not free blocks when zeroing */
+
 	__REQ_NR_BITS,		/* stops here */
 };
 
@@ -212,6 +216,8 @@ enum req_flag_bits {
 #define REQ_RAHEAD		(1ULL << __REQ_RAHEAD)
 #define REQ_BACKGROUND		(1ULL << __REQ_BACKGROUND)
 
+#define REQ_NOUNMAP		(1ULL << __REQ_NOUNMAP)
+
 #define REQ_FAILFAST_MASK \
 	(REQ_FAILFAST_DEV | REQ_FAILFAST_TRANSPORT | REQ_FAILFAST_DRIVER)
 
-- 
2.11.0

^ permalink raw reply related	[flat|nested] 91+ messages in thread

* [PATCH 12/27] block: add a new BLKDEV_ZERO_NOFALLBACK flag
  2017-04-05 17:20 ` Christoph Hellwig
@ 2017-04-05 17:21   ` Christoph Hellwig
  -1 siblings, 0 replies; 91+ messages in thread
From: Christoph Hellwig @ 2017-04-05 17:21 UTC (permalink / raw)
  To: axboe, martin.petersen, agk, snitzer, shli, philipp.reisner,
	lars.ellenberg
  Cc: linux-block, linux-raid, dm-devel, linux-scsi, drbd-dev

This avoids fallbacks to explicit zeroing in (__)blkdev_issue_zeroout if
the caller doesn't want them.

Also clean up the convoluted check for the return condition that this
new flag is added to.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Reviewed-by: Hannes Reinecke <hare@suse.com>
---
 block/blk-lib.c        | 5 ++++-
 include/linux/blkdev.h | 1 +
 2 files changed, 5 insertions(+), 1 deletion(-)

diff --git a/block/blk-lib.c b/block/blk-lib.c
index 2f6d2cb2e1a2..2f882e22890b 100644
--- a/block/blk-lib.c
+++ b/block/blk-lib.c
@@ -281,6 +281,9 @@ static int __blkdev_issue_write_zeroes(struct block_device *bdev,
  *
  *  If a device is using logical block provisioning, the underlying space will
  *  not be released if %flags contains BLKDEV_ZERO_NOUNMAP.
+ *
+ *  If %flags contains BLKDEV_ZERO_NOFALLBACK, the function will return
+ *  -EOPNOTSUPP if no explicit hardware offload for zeroing is provided.
  */
 int __blkdev_issue_zeroout(struct block_device *bdev, sector_t sector,
 		sector_t nr_sects, gfp_t gfp_mask, struct bio **biop,
@@ -298,7 +301,7 @@ int __blkdev_issue_zeroout(struct block_device *bdev, sector_t sector,
 
 	ret = __blkdev_issue_write_zeroes(bdev, sector, nr_sects, gfp_mask,
 			biop, flags);
-	if (ret == 0 || (ret && ret != -EOPNOTSUPP))
+	if (ret != -EOPNOTSUPP || (flags & BLKDEV_ZERO_NOFALLBACK))
 		goto out;
 
 	ret = 0;
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index e7513ce3dbde..a5055d760661 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -1351,6 +1351,7 @@ extern int __blkdev_issue_discard(struct block_device *bdev, sector_t sector,
 		struct bio **biop);
 
 #define BLKDEV_ZERO_NOUNMAP	(1 << 0)  /* do not free blocks */
+#define BLKDEV_ZERO_NOFALLBACK	(1 << 1)  /* don't write explicit zeroes */
 
 extern int __blkdev_issue_zeroout(struct block_device *bdev, sector_t sector,
 		sector_t nr_sects, gfp_t gfp_mask, struct bio **biop,
-- 
2.11.0

^ permalink raw reply related	[flat|nested] 91+ messages in thread

* [PATCH 12/27] block: add a new BLKDEV_ZERO_NOFALLBACK flag
@ 2017-04-05 17:21   ` Christoph Hellwig
  0 siblings, 0 replies; 91+ messages in thread
From: Christoph Hellwig @ 2017-04-05 17:21 UTC (permalink / raw)
  To: axboe, martin.petersen, agk, snitzer, shli, philipp.reisner,
	lars.ellenberg
  Cc: linux-block, linux-scsi, drbd-dev, dm-devel, linux-raid

This avoids fallbacks to explicit zeroing in (__)blkdev_issue_zeroout if
the caller doesn't want them.

Also clean up the convoluted check for the return condition that this
new flag is added to.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Reviewed-by: Hannes Reinecke <hare@suse.com>
---
 block/blk-lib.c        | 5 ++++-
 include/linux/blkdev.h | 1 +
 2 files changed, 5 insertions(+), 1 deletion(-)

diff --git a/block/blk-lib.c b/block/blk-lib.c
index 2f6d2cb2e1a2..2f882e22890b 100644
--- a/block/blk-lib.c
+++ b/block/blk-lib.c
@@ -281,6 +281,9 @@ static int __blkdev_issue_write_zeroes(struct block_device *bdev,
  *
  *  If a device is using logical block provisioning, the underlying space will
  *  not be released if %flags contains BLKDEV_ZERO_NOUNMAP.
+ *
+ *  If %flags contains BLKDEV_ZERO_NOFALLBACK, the function will return
+ *  -EOPNOTSUPP if no explicit hardware offload for zeroing is provided.
  */
 int __blkdev_issue_zeroout(struct block_device *bdev, sector_t sector,
 		sector_t nr_sects, gfp_t gfp_mask, struct bio **biop,
@@ -298,7 +301,7 @@ int __blkdev_issue_zeroout(struct block_device *bdev, sector_t sector,
 
 	ret = __blkdev_issue_write_zeroes(bdev, sector, nr_sects, gfp_mask,
 			biop, flags);
-	if (ret == 0 || (ret && ret != -EOPNOTSUPP))
+	if (ret != -EOPNOTSUPP || (flags & BLKDEV_ZERO_NOFALLBACK))
 		goto out;
 
 	ret = 0;
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index e7513ce3dbde..a5055d760661 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -1351,6 +1351,7 @@ extern int __blkdev_issue_discard(struct block_device *bdev, sector_t sector,
 		struct bio **biop);
 
 #define BLKDEV_ZERO_NOUNMAP	(1 << 0)  /* do not free blocks */
+#define BLKDEV_ZERO_NOFALLBACK	(1 << 1)  /* don't write explicit zeroes */
 
 extern int __blkdev_issue_zeroout(struct block_device *bdev, sector_t sector,
 		sector_t nr_sects, gfp_t gfp_mask, struct bio **biop,
-- 
2.11.0

^ permalink raw reply related	[flat|nested] 91+ messages in thread

* [PATCH 13/27] block_dev: use blkdev_issue_zerout for hole punches
  2017-04-05 17:20 ` Christoph Hellwig
@ 2017-04-05 17:21   ` Christoph Hellwig
  -1 siblings, 0 replies; 91+ messages in thread
From: Christoph Hellwig @ 2017-04-05 17:21 UTC (permalink / raw)
  To: axboe, martin.petersen, agk, snitzer, shli, philipp.reisner,
	lars.ellenberg
  Cc: linux-block, linux-raid, dm-devel, linux-scsi, drbd-dev

This gets us support for non-discard efficient write of zeroes (e.g. NVMe)
and prepares for removing the discard_zeroes_data flag.

Also remove a pointless discard support check, which is done in
blkdev_issue_discard already.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>

Reviewed-by: Hannes Reinecke <hare@suse.com>
---
 fs/block_dev.c | 10 ++--------
 1 file changed, 2 insertions(+), 8 deletions(-)

diff --git a/fs/block_dev.c b/fs/block_dev.c
index 2f704c3a816f..e405d8e58e31 100644
--- a/fs/block_dev.c
+++ b/fs/block_dev.c
@@ -2069,7 +2069,6 @@ static long blkdev_fallocate(struct file *file, int mode, loff_t start,
 			     loff_t len)
 {
 	struct block_device *bdev = I_BDEV(bdev_file_inode(file));
-	struct request_queue *q = bdev_get_queue(bdev);
 	struct address_space *mapping;
 	loff_t end = start + len - 1;
 	loff_t isize;
@@ -2108,15 +2107,10 @@ static long blkdev_fallocate(struct file *file, int mode, loff_t start,
 					    GFP_KERNEL, BLKDEV_ZERO_NOUNMAP);
 		break;
 	case FALLOC_FL_PUNCH_HOLE | FALLOC_FL_KEEP_SIZE:
-		/* Only punch if the device can do zeroing discard. */
-		if (!blk_queue_discard(q) || !q->limits.discard_zeroes_data)
-			return -EOPNOTSUPP;
-		error = blkdev_issue_discard(bdev, start >> 9, len >> 9,
-					     GFP_KERNEL, 0);
+		error = blkdev_issue_zeroout(bdev, start >> 9, len >> 9,
+					     GFP_KERNEL, BLKDEV_ZERO_NOFALLBACK);
 		break;
 	case FALLOC_FL_PUNCH_HOLE | FALLOC_FL_KEEP_SIZE | FALLOC_FL_NO_HIDE_STALE:
-		if (!blk_queue_discard(q))
-			return -EOPNOTSUPP;
 		error = blkdev_issue_discard(bdev, start >> 9, len >> 9,
 					     GFP_KERNEL, 0);
 		break;
-- 
2.11.0

^ permalink raw reply related	[flat|nested] 91+ messages in thread

* [PATCH 13/27] block_dev: use blkdev_issue_zerout for hole punches
@ 2017-04-05 17:21   ` Christoph Hellwig
  0 siblings, 0 replies; 91+ messages in thread
From: Christoph Hellwig @ 2017-04-05 17:21 UTC (permalink / raw)
  To: axboe, martin.petersen, agk, snitzer, shli, philipp.reisner,
	lars.ellenberg
  Cc: linux-block, linux-scsi, drbd-dev, dm-devel, linux-raid

This gets us support for non-discard efficient write of zeroes (e.g. NVMe)
and prepares for removing the discard_zeroes_data flag.

Also remove a pointless discard support check, which is done in
blkdev_issue_discard already.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>

Reviewed-by: Hannes Reinecke <hare@suse.com>
---
 fs/block_dev.c | 10 ++--------
 1 file changed, 2 insertions(+), 8 deletions(-)

diff --git a/fs/block_dev.c b/fs/block_dev.c
index 2f704c3a816f..e405d8e58e31 100644
--- a/fs/block_dev.c
+++ b/fs/block_dev.c
@@ -2069,7 +2069,6 @@ static long blkdev_fallocate(struct file *file, int mode, loff_t start,
 			     loff_t len)
 {
 	struct block_device *bdev = I_BDEV(bdev_file_inode(file));
-	struct request_queue *q = bdev_get_queue(bdev);
 	struct address_space *mapping;
 	loff_t end = start + len - 1;
 	loff_t isize;
@@ -2108,15 +2107,10 @@ static long blkdev_fallocate(struct file *file, int mode, loff_t start,
 					    GFP_KERNEL, BLKDEV_ZERO_NOUNMAP);
 		break;
 	case FALLOC_FL_PUNCH_HOLE | FALLOC_FL_KEEP_SIZE:
-		/* Only punch if the device can do zeroing discard. */
-		if (!blk_queue_discard(q) || !q->limits.discard_zeroes_data)
-			return -EOPNOTSUPP;
-		error = blkdev_issue_discard(bdev, start >> 9, len >> 9,
-					     GFP_KERNEL, 0);
+		error = blkdev_issue_zeroout(bdev, start >> 9, len >> 9,
+					     GFP_KERNEL, BLKDEV_ZERO_NOFALLBACK);
 		break;
 	case FALLOC_FL_PUNCH_HOLE | FALLOC_FL_KEEP_SIZE | FALLOC_FL_NO_HIDE_STALE:
-		if (!blk_queue_discard(q))
-			return -EOPNOTSUPP;
 		error = blkdev_issue_discard(bdev, start >> 9, len >> 9,
 					     GFP_KERNEL, 0);
 		break;
-- 
2.11.0

^ permalink raw reply related	[flat|nested] 91+ messages in thread

* [PATCH 14/27] sd: implement unmapping Write Zeroes
  2017-04-05 17:20 ` Christoph Hellwig
@ 2017-04-05 17:21   ` Christoph Hellwig
  -1 siblings, 0 replies; 91+ messages in thread
From: Christoph Hellwig @ 2017-04-05 17:21 UTC (permalink / raw)
  To: axboe, martin.petersen, agk, snitzer, shli, philipp.reisner,
	lars.ellenberg
  Cc: linux-block, linux-raid, dm-devel, linux-scsi, drbd-dev

Try to use a write same with unmap bit variant if the device supports it
and the caller allows for it.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Reviewed-by: Hannes Reinecke <hare@suse.com>
---
 drivers/scsi/sd.c | 9 +++++++++
 1 file changed, 9 insertions(+)

diff --git a/drivers/scsi/sd.c b/drivers/scsi/sd.c
index d8d9c0bdd93c..001593ed0444 100644
--- a/drivers/scsi/sd.c
+++ b/drivers/scsi/sd.c
@@ -803,6 +803,15 @@ static int sd_setup_write_zeroes_cmnd(struct scsi_cmnd *cmd)
 	u64 sector = blk_rq_pos(rq) >> (ilog2(sdp->sector_size) - 9);
 	u32 nr_sectors = blk_rq_sectors(rq) >> (ilog2(sdp->sector_size) - 9);
 
+	if (!(rq->cmd_flags & REQ_NOUNMAP)) {
+		switch (sdkp->provisioning_mode) {
+		case SD_LBP_WS16:
+			return sd_setup_write_same16_cmnd(cmd, true);
+		case SD_LBP_WS10:
+			return sd_setup_write_same10_cmnd(cmd, true);
+		}
+	}
+
 	if (sdp->no_write_same)
 		return BLKPREP_INVALID;
 	if (sdkp->ws16 || sector > 0xffffffff || nr_sectors > 0xffff)
-- 
2.11.0

^ permalink raw reply related	[flat|nested] 91+ messages in thread

* [PATCH 14/27] sd: implement unmapping Write Zeroes
@ 2017-04-05 17:21   ` Christoph Hellwig
  0 siblings, 0 replies; 91+ messages in thread
From: Christoph Hellwig @ 2017-04-05 17:21 UTC (permalink / raw)
  To: axboe, martin.petersen, agk, snitzer, shli, philipp.reisner,
	lars.ellenberg
  Cc: linux-block, linux-scsi, drbd-dev, dm-devel, linux-raid

Try to use a write same with unmap bit variant if the device supports it
and the caller allows for it.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Reviewed-by: Hannes Reinecke <hare@suse.com>
---
 drivers/scsi/sd.c | 9 +++++++++
 1 file changed, 9 insertions(+)

diff --git a/drivers/scsi/sd.c b/drivers/scsi/sd.c
index d8d9c0bdd93c..001593ed0444 100644
--- a/drivers/scsi/sd.c
+++ b/drivers/scsi/sd.c
@@ -803,6 +803,15 @@ static int sd_setup_write_zeroes_cmnd(struct scsi_cmnd *cmd)
 	u64 sector = blk_rq_pos(rq) >> (ilog2(sdp->sector_size) - 9);
 	u32 nr_sectors = blk_rq_sectors(rq) >> (ilog2(sdp->sector_size) - 9);
 
+	if (!(rq->cmd_flags & REQ_NOUNMAP)) {
+		switch (sdkp->provisioning_mode) {
+		case SD_LBP_WS16:
+			return sd_setup_write_same16_cmnd(cmd, true);
+		case SD_LBP_WS10:
+			return sd_setup_write_same10_cmnd(cmd, true);
+		}
+	}
+
 	if (sdp->no_write_same)
 		return BLKPREP_INVALID;
 	if (sdkp->ws16 || sector > 0xffffffff || nr_sectors > 0xffff)
-- 
2.11.0

^ permalink raw reply related	[flat|nested] 91+ messages in thread

* [PATCH 15/27] nvme: implement REQ_OP_WRITE_ZEROES
  2017-04-05 17:20 ` Christoph Hellwig
@ 2017-04-05 17:21   ` Christoph Hellwig
  -1 siblings, 0 replies; 91+ messages in thread
From: Christoph Hellwig @ 2017-04-05 17:21 UTC (permalink / raw)
  To: axboe, martin.petersen, agk, snitzer, shli, philipp.reisner,
	lars.ellenberg
  Cc: linux-block, linux-raid, dm-devel, linux-scsi, drbd-dev

But now for the real NVMe Write Zeroes yet, just to get rid of the
discard abuse for zeroing.  Also rename the quirk flag to be a bit
more self-explanatory.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Reviewed-by: Hannes Reinecke <hare@suse.com>
---
 drivers/nvme/host/core.c | 10 +++++-----
 drivers/nvme/host/nvme.h |  6 +++---
 drivers/nvme/host/pci.c  |  6 +++---
 3 files changed, 11 insertions(+), 11 deletions(-)

diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
index 3c908e1bc903..26d5129a640a 100644
--- a/drivers/nvme/host/core.c
+++ b/drivers/nvme/host/core.c
@@ -358,6 +358,8 @@ int nvme_setup_cmd(struct nvme_ns *ns, struct request *req,
 	case REQ_OP_FLUSH:
 		nvme_setup_flush(ns, cmd);
 		break;
+	case REQ_OP_WRITE_ZEROES:
+		/* currently only aliased to deallocate for a few ctrls: */
 	case REQ_OP_DISCARD:
 		ret = nvme_setup_discard(ns, req, cmd);
 		break;
@@ -923,16 +925,14 @@ static void nvme_config_discard(struct nvme_ns *ns)
 	BUILD_BUG_ON(PAGE_SIZE / sizeof(struct nvme_dsm_range) <
 			NVME_DSM_MAX_RANGES);
 
-	if (ctrl->quirks & NVME_QUIRK_DISCARD_ZEROES)
-		ns->queue->limits.discard_zeroes_data = 1;
-	else
-		ns->queue->limits.discard_zeroes_data = 0;
-
 	ns->queue->limits.discard_alignment = logical_block_size;
 	ns->queue->limits.discard_granularity = logical_block_size;
 	blk_queue_max_discard_sectors(ns->queue, UINT_MAX);
 	blk_queue_max_discard_segments(ns->queue, NVME_DSM_MAX_RANGES);
 	queue_flag_set_unlocked(QUEUE_FLAG_DISCARD, ns->queue);
+
+	if (ctrl->quirks & NVME_QUIRK_DEALLOCATE_ZEROES)
+		blk_queue_max_write_zeroes_sectors(ns->queue, UINT_MAX);
 }
 
 static int nvme_revalidate_ns(struct nvme_ns *ns, struct nvme_id_ns **id)
diff --git a/drivers/nvme/host/nvme.h b/drivers/nvme/host/nvme.h
index 227f281482db..f903726eeb68 100644
--- a/drivers/nvme/host/nvme.h
+++ b/drivers/nvme/host/nvme.h
@@ -68,10 +68,10 @@ enum nvme_quirks {
 	NVME_QUIRK_IDENTIFY_CNS			= (1 << 1),
 
 	/*
-	 * The controller deterministically returns O's on reads to discarded
-	 * logical blocks.
+	 * The controller deterministically returns O's on reads to
+	 * logical blocks that deallocate was called on.
 	 */
-	NVME_QUIRK_DISCARD_ZEROES		= (1 << 2),
+	NVME_QUIRK_DEALLOCATE_ZEROES		= (1 << 2),
 
 	/*
 	 * The controller needs a delay before starts checking the device
diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
index 9e686a67d93b..cb530a6bef3f 100644
--- a/drivers/nvme/host/pci.c
+++ b/drivers/nvme/host/pci.c
@@ -2113,13 +2113,13 @@ static const struct pci_error_handlers nvme_err_handler = {
 static const struct pci_device_id nvme_id_table[] = {
 	{ PCI_VDEVICE(INTEL, 0x0953),
 		.driver_data = NVME_QUIRK_STRIPE_SIZE |
-				NVME_QUIRK_DISCARD_ZEROES, },
+				NVME_QUIRK_DEALLOCATE_ZEROES, },
 	{ PCI_VDEVICE(INTEL, 0x0a53),
 		.driver_data = NVME_QUIRK_STRIPE_SIZE |
-				NVME_QUIRK_DISCARD_ZEROES, },
+				NVME_QUIRK_DEALLOCATE_ZEROES, },
 	{ PCI_VDEVICE(INTEL, 0x0a54),
 		.driver_data = NVME_QUIRK_STRIPE_SIZE |
-				NVME_QUIRK_DISCARD_ZEROES, },
+				NVME_QUIRK_DEALLOCATE_ZEROES, },
 	{ PCI_VDEVICE(INTEL, 0x5845),	/* Qemu emulated controller */
 		.driver_data = NVME_QUIRK_IDENTIFY_CNS, },
 	{ PCI_DEVICE(0x1c58, 0x0003),	/* HGST adapter */
-- 
2.11.0

^ permalink raw reply related	[flat|nested] 91+ messages in thread

* [PATCH 15/27] nvme: implement REQ_OP_WRITE_ZEROES
@ 2017-04-05 17:21   ` Christoph Hellwig
  0 siblings, 0 replies; 91+ messages in thread
From: Christoph Hellwig @ 2017-04-05 17:21 UTC (permalink / raw)
  To: axboe, martin.petersen, agk, snitzer, shli, philipp.reisner,
	lars.ellenberg
  Cc: linux-block, linux-scsi, drbd-dev, dm-devel, linux-raid

But now for the real NVMe Write Zeroes yet, just to get rid of the
discard abuse for zeroing.  Also rename the quirk flag to be a bit
more self-explanatory.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Reviewed-by: Hannes Reinecke <hare@suse.com>
---
 drivers/nvme/host/core.c | 10 +++++-----
 drivers/nvme/host/nvme.h |  6 +++---
 drivers/nvme/host/pci.c  |  6 +++---
 3 files changed, 11 insertions(+), 11 deletions(-)

diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
index 3c908e1bc903..26d5129a640a 100644
--- a/drivers/nvme/host/core.c
+++ b/drivers/nvme/host/core.c
@@ -358,6 +358,8 @@ int nvme_setup_cmd(struct nvme_ns *ns, struct request *req,
 	case REQ_OP_FLUSH:
 		nvme_setup_flush(ns, cmd);
 		break;
+	case REQ_OP_WRITE_ZEROES:
+		/* currently only aliased to deallocate for a few ctrls: */
 	case REQ_OP_DISCARD:
 		ret = nvme_setup_discard(ns, req, cmd);
 		break;
@@ -923,16 +925,14 @@ static void nvme_config_discard(struct nvme_ns *ns)
 	BUILD_BUG_ON(PAGE_SIZE / sizeof(struct nvme_dsm_range) <
 			NVME_DSM_MAX_RANGES);
 
-	if (ctrl->quirks & NVME_QUIRK_DISCARD_ZEROES)
-		ns->queue->limits.discard_zeroes_data = 1;
-	else
-		ns->queue->limits.discard_zeroes_data = 0;
-
 	ns->queue->limits.discard_alignment = logical_block_size;
 	ns->queue->limits.discard_granularity = logical_block_size;
 	blk_queue_max_discard_sectors(ns->queue, UINT_MAX);
 	blk_queue_max_discard_segments(ns->queue, NVME_DSM_MAX_RANGES);
 	queue_flag_set_unlocked(QUEUE_FLAG_DISCARD, ns->queue);
+
+	if (ctrl->quirks & NVME_QUIRK_DEALLOCATE_ZEROES)
+		blk_queue_max_write_zeroes_sectors(ns->queue, UINT_MAX);
 }
 
 static int nvme_revalidate_ns(struct nvme_ns *ns, struct nvme_id_ns **id)
diff --git a/drivers/nvme/host/nvme.h b/drivers/nvme/host/nvme.h
index 227f281482db..f903726eeb68 100644
--- a/drivers/nvme/host/nvme.h
+++ b/drivers/nvme/host/nvme.h
@@ -68,10 +68,10 @@ enum nvme_quirks {
 	NVME_QUIRK_IDENTIFY_CNS			= (1 << 1),
 
 	/*
-	 * The controller deterministically returns O's on reads to discarded
-	 * logical blocks.
+	 * The controller deterministically returns O's on reads to
+	 * logical blocks that deallocate was called on.
 	 */
-	NVME_QUIRK_DISCARD_ZEROES		= (1 << 2),
+	NVME_QUIRK_DEALLOCATE_ZEROES		= (1 << 2),
 
 	/*
 	 * The controller needs a delay before starts checking the device
diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
index 9e686a67d93b..cb530a6bef3f 100644
--- a/drivers/nvme/host/pci.c
+++ b/drivers/nvme/host/pci.c
@@ -2113,13 +2113,13 @@ static const struct pci_error_handlers nvme_err_handler = {
 static const struct pci_device_id nvme_id_table[] = {
 	{ PCI_VDEVICE(INTEL, 0x0953),
 		.driver_data = NVME_QUIRK_STRIPE_SIZE |
-				NVME_QUIRK_DISCARD_ZEROES, },
+				NVME_QUIRK_DEALLOCATE_ZEROES, },
 	{ PCI_VDEVICE(INTEL, 0x0a53),
 		.driver_data = NVME_QUIRK_STRIPE_SIZE |
-				NVME_QUIRK_DISCARD_ZEROES, },
+				NVME_QUIRK_DEALLOCATE_ZEROES, },
 	{ PCI_VDEVICE(INTEL, 0x0a54),
 		.driver_data = NVME_QUIRK_STRIPE_SIZE |
-				NVME_QUIRK_DISCARD_ZEROES, },
+				NVME_QUIRK_DEALLOCATE_ZEROES, },
 	{ PCI_VDEVICE(INTEL, 0x5845),	/* Qemu emulated controller */
 		.driver_data = NVME_QUIRK_IDENTIFY_CNS, },
 	{ PCI_DEVICE(0x1c58, 0x0003),	/* HGST adapter */
-- 
2.11.0

^ permalink raw reply related	[flat|nested] 91+ messages in thread

* [PATCH 16/27] zram: implement REQ_OP_WRITE_ZEROES
  2017-04-05 17:20 ` Christoph Hellwig
@ 2017-04-05 17:21   ` Christoph Hellwig
  -1 siblings, 0 replies; 91+ messages in thread
From: Christoph Hellwig @ 2017-04-05 17:21 UTC (permalink / raw)
  To: axboe, martin.petersen, agk, snitzer, shli, philipp.reisner,
	lars.ellenberg
  Cc: linux-block, linux-raid, dm-devel, linux-scsi, drbd-dev

Just the same as discard if the block size equals the system page size.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.com>
---
 drivers/block/zram/zram_drv.c | 13 ++++++++-----
 1 file changed, 8 insertions(+), 5 deletions(-)

diff --git a/drivers/block/zram/zram_drv.c b/drivers/block/zram/zram_drv.c
index dceb5edd1e54..1710b06f04a7 100644
--- a/drivers/block/zram/zram_drv.c
+++ b/drivers/block/zram/zram_drv.c
@@ -829,10 +829,14 @@ static void __zram_make_request(struct zram *zram, struct bio *bio)
 	offset = (bio->bi_iter.bi_sector &
 		  (SECTORS_PER_PAGE - 1)) << SECTOR_SHIFT;
 
-	if (unlikely(bio_op(bio) == REQ_OP_DISCARD)) {
+	switch (bio_op(bio)) {
+	case REQ_OP_DISCARD:
+	case REQ_OP_WRITE_ZEROES:
 		zram_bio_discard(zram, index, offset, bio);
 		bio_endio(bio);
 		return;
+	default:
+		break;
 	}
 
 	bio_for_each_segment(bvec, bio, iter) {
@@ -1192,6 +1196,8 @@ static int zram_add(void)
 	zram->disk->queue->limits.max_sectors = SECTORS_PER_PAGE;
 	zram->disk->queue->limits.chunk_sectors = 0;
 	blk_queue_max_discard_sectors(zram->disk->queue, UINT_MAX);
+	queue_flag_set_unlocked(QUEUE_FLAG_DISCARD, zram->disk->queue);
+
 	/*
 	 * zram_bio_discard() will clear all logical blocks if logical block
 	 * size is identical with physical block size(PAGE_SIZE). But if it is
@@ -1201,10 +1207,7 @@ static int zram_add(void)
 	 * zeroed.
 	 */
 	if (ZRAM_LOGICAL_BLOCK_SIZE == PAGE_SIZE)
-		zram->disk->queue->limits.discard_zeroes_data = 1;
-	else
-		zram->disk->queue->limits.discard_zeroes_data = 0;
-	queue_flag_set_unlocked(QUEUE_FLAG_DISCARD, zram->disk->queue);
+		blk_queue_max_write_zeroes_sectors(zram->disk->queue, UINT_MAX);
 
 	add_disk(zram->disk);
 
-- 
2.11.0

^ permalink raw reply related	[flat|nested] 91+ messages in thread

* [PATCH 16/27] zram: implement REQ_OP_WRITE_ZEROES
@ 2017-04-05 17:21   ` Christoph Hellwig
  0 siblings, 0 replies; 91+ messages in thread
From: Christoph Hellwig @ 2017-04-05 17:21 UTC (permalink / raw)
  To: axboe, martin.petersen, agk, snitzer, shli, philipp.reisner,
	lars.ellenberg
  Cc: linux-block, linux-scsi, drbd-dev, dm-devel, linux-raid

Just the same as discard if the block size equals the system page size.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.com>
---
 drivers/block/zram/zram_drv.c | 13 ++++++++-----
 1 file changed, 8 insertions(+), 5 deletions(-)

diff --git a/drivers/block/zram/zram_drv.c b/drivers/block/zram/zram_drv.c
index dceb5edd1e54..1710b06f04a7 100644
--- a/drivers/block/zram/zram_drv.c
+++ b/drivers/block/zram/zram_drv.c
@@ -829,10 +829,14 @@ static void __zram_make_request(struct zram *zram, struct bio *bio)
 	offset = (bio->bi_iter.bi_sector &
 		  (SECTORS_PER_PAGE - 1)) << SECTOR_SHIFT;
 
-	if (unlikely(bio_op(bio) == REQ_OP_DISCARD)) {
+	switch (bio_op(bio)) {
+	case REQ_OP_DISCARD:
+	case REQ_OP_WRITE_ZEROES:
 		zram_bio_discard(zram, index, offset, bio);
 		bio_endio(bio);
 		return;
+	default:
+		break;
 	}
 
 	bio_for_each_segment(bvec, bio, iter) {
@@ -1192,6 +1196,8 @@ static int zram_add(void)
 	zram->disk->queue->limits.max_sectors = SECTORS_PER_PAGE;
 	zram->disk->queue->limits.chunk_sectors = 0;
 	blk_queue_max_discard_sectors(zram->disk->queue, UINT_MAX);
+	queue_flag_set_unlocked(QUEUE_FLAG_DISCARD, zram->disk->queue);
+
 	/*
 	 * zram_bio_discard() will clear all logical blocks if logical block
 	 * size is identical with physical block size(PAGE_SIZE). But if it is
@@ -1201,10 +1207,7 @@ static int zram_add(void)
 	 * zeroed.
 	 */
 	if (ZRAM_LOGICAL_BLOCK_SIZE == PAGE_SIZE)
-		zram->disk->queue->limits.discard_zeroes_data = 1;
-	else
-		zram->disk->queue->limits.discard_zeroes_data = 0;
-	queue_flag_set_unlocked(QUEUE_FLAG_DISCARD, zram->disk->queue);
+		blk_queue_max_write_zeroes_sectors(zram->disk->queue, UINT_MAX);
 
 	add_disk(zram->disk);
 
-- 
2.11.0

^ permalink raw reply related	[flat|nested] 91+ messages in thread

* [PATCH 17/27] loop: implement REQ_OP_WRITE_ZEROES
  2017-04-05 17:20 ` Christoph Hellwig
@ 2017-04-05 17:21   ` Christoph Hellwig
  -1 siblings, 0 replies; 91+ messages in thread
From: Christoph Hellwig @ 2017-04-05 17:21 UTC (permalink / raw)
  To: axboe, martin.petersen, agk, snitzer, shli, philipp.reisner,
	lars.ellenberg
  Cc: linux-block, linux-raid, dm-devel, linux-scsi, drbd-dev

It's identical to discard as hole punches will always leave us with
zeroes on reads.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.com>
---
 drivers/block/loop.c | 4 ++++
 1 file changed, 4 insertions(+)

diff --git a/drivers/block/loop.c b/drivers/block/loop.c
index cc981f34e017..3bb04c1a4ba1 100644
--- a/drivers/block/loop.c
+++ b/drivers/block/loop.c
@@ -528,6 +528,7 @@ static int do_req_filebacked(struct loop_device *lo, struct request *rq)
 	case REQ_OP_FLUSH:
 		return lo_req_flush(lo, rq);
 	case REQ_OP_DISCARD:
+	case REQ_OP_WRITE_ZEROES:
 		return lo_discard(lo, rq, pos);
 	case REQ_OP_WRITE:
 		if (lo->transfer)
@@ -826,6 +827,7 @@ static void loop_config_discard(struct loop_device *lo)
 		q->limits.discard_granularity = 0;
 		q->limits.discard_alignment = 0;
 		blk_queue_max_discard_sectors(q, 0);
+		blk_queue_max_write_zeroes_sectors(q, 0);
 		q->limits.discard_zeroes_data = 0;
 		queue_flag_clear_unlocked(QUEUE_FLAG_DISCARD, q);
 		return;
@@ -834,6 +836,7 @@ static void loop_config_discard(struct loop_device *lo)
 	q->limits.discard_granularity = inode->i_sb->s_blocksize;
 	q->limits.discard_alignment = 0;
 	blk_queue_max_discard_sectors(q, UINT_MAX >> 9);
+	blk_queue_max_write_zeroes_sectors(q, UINT_MAX >> 9);
 	q->limits.discard_zeroes_data = 1;
 	queue_flag_set_unlocked(QUEUE_FLAG_DISCARD, q);
 }
@@ -1660,6 +1663,7 @@ static int loop_queue_rq(struct blk_mq_hw_ctx *hctx,
 	switch (req_op(cmd->rq)) {
 	case REQ_OP_FLUSH:
 	case REQ_OP_DISCARD:
+	case REQ_OP_WRITE_ZEROES:
 		cmd->use_aio = false;
 		break;
 	default:
-- 
2.11.0

^ permalink raw reply related	[flat|nested] 91+ messages in thread

* [PATCH 17/27] loop: implement REQ_OP_WRITE_ZEROES
@ 2017-04-05 17:21   ` Christoph Hellwig
  0 siblings, 0 replies; 91+ messages in thread
From: Christoph Hellwig @ 2017-04-05 17:21 UTC (permalink / raw)
  To: axboe, martin.petersen, agk, snitzer, shli, philipp.reisner,
	lars.ellenberg
  Cc: linux-block, linux-scsi, drbd-dev, dm-devel, linux-raid

It's identical to discard as hole punches will always leave us with
zeroes on reads.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.com>
---
 drivers/block/loop.c | 4 ++++
 1 file changed, 4 insertions(+)

diff --git a/drivers/block/loop.c b/drivers/block/loop.c
index cc981f34e017..3bb04c1a4ba1 100644
--- a/drivers/block/loop.c
+++ b/drivers/block/loop.c
@@ -528,6 +528,7 @@ static int do_req_filebacked(struct loop_device *lo, struct request *rq)
 	case REQ_OP_FLUSH:
 		return lo_req_flush(lo, rq);
 	case REQ_OP_DISCARD:
+	case REQ_OP_WRITE_ZEROES:
 		return lo_discard(lo, rq, pos);
 	case REQ_OP_WRITE:
 		if (lo->transfer)
@@ -826,6 +827,7 @@ static void loop_config_discard(struct loop_device *lo)
 		q->limits.discard_granularity = 0;
 		q->limits.discard_alignment = 0;
 		blk_queue_max_discard_sectors(q, 0);
+		blk_queue_max_write_zeroes_sectors(q, 0);
 		q->limits.discard_zeroes_data = 0;
 		queue_flag_clear_unlocked(QUEUE_FLAG_DISCARD, q);
 		return;
@@ -834,6 +836,7 @@ static void loop_config_discard(struct loop_device *lo)
 	q->limits.discard_granularity = inode->i_sb->s_blocksize;
 	q->limits.discard_alignment = 0;
 	blk_queue_max_discard_sectors(q, UINT_MAX >> 9);
+	blk_queue_max_write_zeroes_sectors(q, UINT_MAX >> 9);
 	q->limits.discard_zeroes_data = 1;
 	queue_flag_set_unlocked(QUEUE_FLAG_DISCARD, q);
 }
@@ -1660,6 +1663,7 @@ static int loop_queue_rq(struct blk_mq_hw_ctx *hctx,
 	switch (req_op(cmd->rq)) {
 	case REQ_OP_FLUSH:
 	case REQ_OP_DISCARD:
+	case REQ_OP_WRITE_ZEROES:
 		cmd->use_aio = false;
 		break;
 	default:
-- 
2.11.0

^ permalink raw reply related	[flat|nested] 91+ messages in thread

* [PATCH 18/27] brd: remove discard support
  2017-04-05 17:20 ` Christoph Hellwig
@ 2017-04-05 17:21   ` Christoph Hellwig
  -1 siblings, 0 replies; 91+ messages in thread
From: Christoph Hellwig @ 2017-04-05 17:21 UTC (permalink / raw)
  To: axboe, martin.petersen, agk, snitzer, shli, philipp.reisner,
	lars.ellenberg
  Cc: linux-block, linux-raid, dm-devel, linux-scsi, drbd-dev

It's just a in-driver reimplementation of writing zeroes to the pages,
which fails if the discards aren't page aligned.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.com>
---
 drivers/block/brd.c | 54 -----------------------------------------------------
 1 file changed, 54 deletions(-)

diff --git a/drivers/block/brd.c b/drivers/block/brd.c
index 3adc32a3153b..4ec84d504780 100644
--- a/drivers/block/brd.c
+++ b/drivers/block/brd.c
@@ -134,28 +134,6 @@ static struct page *brd_insert_page(struct brd_device *brd, sector_t sector)
 	return page;
 }
 
-static void brd_free_page(struct brd_device *brd, sector_t sector)
-{
-	struct page *page;
-	pgoff_t idx;
-
-	spin_lock(&brd->brd_lock);
-	idx = sector >> PAGE_SECTORS_SHIFT;
-	page = radix_tree_delete(&brd->brd_pages, idx);
-	spin_unlock(&brd->brd_lock);
-	if (page)
-		__free_page(page);
-}
-
-static void brd_zero_page(struct brd_device *brd, sector_t sector)
-{
-	struct page *page;
-
-	page = brd_lookup_page(brd, sector);
-	if (page)
-		clear_highpage(page);
-}
-
 /*
  * Free all backing store pages and radix tree. This must only be called when
  * there are no other users of the device.
@@ -212,24 +190,6 @@ static int copy_to_brd_setup(struct brd_device *brd, sector_t sector, size_t n)
 	return 0;
 }
 
-static void discard_from_brd(struct brd_device *brd,
-			sector_t sector, size_t n)
-{
-	while (n >= PAGE_SIZE) {
-		/*
-		 * Don't want to actually discard pages here because
-		 * re-allocating the pages can result in writeback
-		 * deadlocks under heavy load.
-		 */
-		if (0)
-			brd_free_page(brd, sector);
-		else
-			brd_zero_page(brd, sector);
-		sector += PAGE_SIZE >> SECTOR_SHIFT;
-		n -= PAGE_SIZE;
-	}
-}
-
 /*
  * Copy n bytes from src to the brd starting at sector. Does not sleep.
  */
@@ -338,14 +298,6 @@ static blk_qc_t brd_make_request(struct request_queue *q, struct bio *bio)
 	if (bio_end_sector(bio) > get_capacity(bdev->bd_disk))
 		goto io_error;
 
-	if (unlikely(bio_op(bio) == REQ_OP_DISCARD)) {
-		if (sector & ((PAGE_SIZE >> SECTOR_SHIFT) - 1) ||
-		    bio->bi_iter.bi_size & ~PAGE_MASK)
-			goto io_error;
-		discard_from_brd(brd, sector, bio->bi_iter.bi_size);
-		goto out;
-	}
-
 	bio_for_each_segment(bvec, bio, iter) {
 		unsigned int len = bvec.bv_len;
 		int err;
@@ -357,7 +309,6 @@ static blk_qc_t brd_make_request(struct request_queue *q, struct bio *bio)
 		sector += len >> SECTOR_SHIFT;
 	}
 
-out:
 	bio_endio(bio);
 	return BLK_QC_T_NONE;
 io_error:
@@ -464,11 +415,6 @@ static struct brd_device *brd_alloc(int i)
 	 *  is harmless)
 	 */
 	blk_queue_physical_block_size(brd->brd_queue, PAGE_SIZE);
-
-	brd->brd_queue->limits.discard_granularity = PAGE_SIZE;
-	blk_queue_max_discard_sectors(brd->brd_queue, UINT_MAX);
-	brd->brd_queue->limits.discard_zeroes_data = 1;
-	queue_flag_set_unlocked(QUEUE_FLAG_DISCARD, brd->brd_queue);
 #ifdef CONFIG_BLK_DEV_RAM_DAX
 	queue_flag_set_unlocked(QUEUE_FLAG_DAX, brd->brd_queue);
 #endif
-- 
2.11.0

^ permalink raw reply related	[flat|nested] 91+ messages in thread

* [PATCH 18/27] brd: remove discard support
@ 2017-04-05 17:21   ` Christoph Hellwig
  0 siblings, 0 replies; 91+ messages in thread
From: Christoph Hellwig @ 2017-04-05 17:21 UTC (permalink / raw)
  To: axboe, martin.petersen, agk, snitzer, shli, philipp.reisner,
	lars.ellenberg
  Cc: linux-block, linux-scsi, drbd-dev, dm-devel, linux-raid

It's just a in-driver reimplementation of writing zeroes to the pages,
which fails if the discards aren't page aligned.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.com>
---
 drivers/block/brd.c | 54 -----------------------------------------------------
 1 file changed, 54 deletions(-)

diff --git a/drivers/block/brd.c b/drivers/block/brd.c
index 3adc32a3153b..4ec84d504780 100644
--- a/drivers/block/brd.c
+++ b/drivers/block/brd.c
@@ -134,28 +134,6 @@ static struct page *brd_insert_page(struct brd_device *brd, sector_t sector)
 	return page;
 }
 
-static void brd_free_page(struct brd_device *brd, sector_t sector)
-{
-	struct page *page;
-	pgoff_t idx;
-
-	spin_lock(&brd->brd_lock);
-	idx = sector >> PAGE_SECTORS_SHIFT;
-	page = radix_tree_delete(&brd->brd_pages, idx);
-	spin_unlock(&brd->brd_lock);
-	if (page)
-		__free_page(page);
-}
-
-static void brd_zero_page(struct brd_device *brd, sector_t sector)
-{
-	struct page *page;
-
-	page = brd_lookup_page(brd, sector);
-	if (page)
-		clear_highpage(page);
-}
-
 /*
  * Free all backing store pages and radix tree. This must only be called when
  * there are no other users of the device.
@@ -212,24 +190,6 @@ static int copy_to_brd_setup(struct brd_device *brd, sector_t sector, size_t n)
 	return 0;
 }
 
-static void discard_from_brd(struct brd_device *brd,
-			sector_t sector, size_t n)
-{
-	while (n >= PAGE_SIZE) {
-		/*
-		 * Don't want to actually discard pages here because
-		 * re-allocating the pages can result in writeback
-		 * deadlocks under heavy load.
-		 */
-		if (0)
-			brd_free_page(brd, sector);
-		else
-			brd_zero_page(brd, sector);
-		sector += PAGE_SIZE >> SECTOR_SHIFT;
-		n -= PAGE_SIZE;
-	}
-}
-
 /*
  * Copy n bytes from src to the brd starting at sector. Does not sleep.
  */
@@ -338,14 +298,6 @@ static blk_qc_t brd_make_request(struct request_queue *q, struct bio *bio)
 	if (bio_end_sector(bio) > get_capacity(bdev->bd_disk))
 		goto io_error;
 
-	if (unlikely(bio_op(bio) == REQ_OP_DISCARD)) {
-		if (sector & ((PAGE_SIZE >> SECTOR_SHIFT) - 1) ||
-		    bio->bi_iter.bi_size & ~PAGE_MASK)
-			goto io_error;
-		discard_from_brd(brd, sector, bio->bi_iter.bi_size);
-		goto out;
-	}
-
 	bio_for_each_segment(bvec, bio, iter) {
 		unsigned int len = bvec.bv_len;
 		int err;
@@ -357,7 +309,6 @@ static blk_qc_t brd_make_request(struct request_queue *q, struct bio *bio)
 		sector += len >> SECTOR_SHIFT;
 	}
 
-out:
 	bio_endio(bio);
 	return BLK_QC_T_NONE;
 io_error:
@@ -464,11 +415,6 @@ static struct brd_device *brd_alloc(int i)
 	 *  is harmless)
 	 */
 	blk_queue_physical_block_size(brd->brd_queue, PAGE_SIZE);
-
-	brd->brd_queue->limits.discard_granularity = PAGE_SIZE;
-	blk_queue_max_discard_sectors(brd->brd_queue, UINT_MAX);
-	brd->brd_queue->limits.discard_zeroes_data = 1;
-	queue_flag_set_unlocked(QUEUE_FLAG_DISCARD, brd->brd_queue);
 #ifdef CONFIG_BLK_DEV_RAM_DAX
 	queue_flag_set_unlocked(QUEUE_FLAG_DAX, brd->brd_queue);
 #endif
-- 
2.11.0

^ permalink raw reply related	[flat|nested] 91+ messages in thread

* [PATCH 19/27] rbd: remove the discard_zeroes_data flag
  2017-04-05 17:20 ` Christoph Hellwig
@ 2017-04-05 17:21   ` Christoph Hellwig
  -1 siblings, 0 replies; 91+ messages in thread
From: Christoph Hellwig @ 2017-04-05 17:21 UTC (permalink / raw)
  To: axboe, martin.petersen, agk, snitzer, shli, philipp.reisner,
	lars.ellenberg
  Cc: linux-block, linux-raid, dm-devel, linux-scsi, drbd-dev

rbd only supports discarding on large alignments, so the zeroing code
would always fall back to explicit writings of zeroes.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.com>
---
 drivers/block/rbd.c | 1 -
 1 file changed, 1 deletion(-)

diff --git a/drivers/block/rbd.c b/drivers/block/rbd.c
index f24ade333e0c..089ac4179919 100644
--- a/drivers/block/rbd.c
+++ b/drivers/block/rbd.c
@@ -4380,7 +4380,6 @@ static int rbd_init_disk(struct rbd_device *rbd_dev)
 	q->limits.discard_granularity = segment_size;
 	q->limits.discard_alignment = segment_size;
 	blk_queue_max_discard_sectors(q, segment_size / SECTOR_SIZE);
-	q->limits.discard_zeroes_data = 1;
 
 	if (!ceph_test_opt(rbd_dev->rbd_client->client, NOCRC))
 		q->backing_dev_info->capabilities |= BDI_CAP_STABLE_WRITES;
-- 
2.11.0

^ permalink raw reply related	[flat|nested] 91+ messages in thread

* [PATCH 19/27] rbd: remove the discard_zeroes_data flag
@ 2017-04-05 17:21   ` Christoph Hellwig
  0 siblings, 0 replies; 91+ messages in thread
From: Christoph Hellwig @ 2017-04-05 17:21 UTC (permalink / raw)
  To: axboe, martin.petersen, agk, snitzer, shli, philipp.reisner,
	lars.ellenberg
  Cc: linux-block, linux-scsi, drbd-dev, dm-devel, linux-raid

rbd only supports discarding on large alignments, so the zeroing code
would always fall back to explicit writings of zeroes.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.com>
---
 drivers/block/rbd.c | 1 -
 1 file changed, 1 deletion(-)

diff --git a/drivers/block/rbd.c b/drivers/block/rbd.c
index f24ade333e0c..089ac4179919 100644
--- a/drivers/block/rbd.c
+++ b/drivers/block/rbd.c
@@ -4380,7 +4380,6 @@ static int rbd_init_disk(struct rbd_device *rbd_dev)
 	q->limits.discard_granularity = segment_size;
 	q->limits.discard_alignment = segment_size;
 	blk_queue_max_discard_sectors(q, segment_size / SECTOR_SIZE);
-	q->limits.discard_zeroes_data = 1;
 
 	if (!ceph_test_opt(rbd_dev->rbd_client->client, NOCRC))
 		q->backing_dev_info->capabilities |= BDI_CAP_STABLE_WRITES;
-- 
2.11.0

^ permalink raw reply related	[flat|nested] 91+ messages in thread

* [PATCH 20/27] rsxx: remove the discard_zeroes_data flag
  2017-04-05 17:20 ` Christoph Hellwig
@ 2017-04-05 17:21   ` Christoph Hellwig
  -1 siblings, 0 replies; 91+ messages in thread
From: Christoph Hellwig @ 2017-04-05 17:21 UTC (permalink / raw)
  To: axboe, martin.petersen, agk, snitzer, shli, philipp.reisner,
	lars.ellenberg
  Cc: linux-block, linux-raid, dm-devel, linux-scsi, drbd-dev

rsxx only supports discarding on large alignments, so the zeroing code
would always fall back to explicit writings of zeroes.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.com>
---
 drivers/block/rsxx/dev.c | 1 -
 1 file changed, 1 deletion(-)

diff --git a/drivers/block/rsxx/dev.c b/drivers/block/rsxx/dev.c
index f81d70b39d10..9c566364ac9c 100644
--- a/drivers/block/rsxx/dev.c
+++ b/drivers/block/rsxx/dev.c
@@ -300,7 +300,6 @@ int rsxx_setup_dev(struct rsxx_cardinfo *card)
 						RSXX_HW_BLK_SIZE >> 9);
 		card->queue->limits.discard_granularity = RSXX_HW_BLK_SIZE;
 		card->queue->limits.discard_alignment   = RSXX_HW_BLK_SIZE;
-		card->queue->limits.discard_zeroes_data = 1;
 	}
 
 	card->queue->queuedata = card;
-- 
2.11.0

^ permalink raw reply related	[flat|nested] 91+ messages in thread

* [PATCH 20/27] rsxx: remove the discard_zeroes_data flag
@ 2017-04-05 17:21   ` Christoph Hellwig
  0 siblings, 0 replies; 91+ messages in thread
From: Christoph Hellwig @ 2017-04-05 17:21 UTC (permalink / raw)
  To: axboe, martin.petersen, agk, snitzer, shli, philipp.reisner,
	lars.ellenberg
  Cc: linux-block, linux-scsi, drbd-dev, dm-devel, linux-raid

rsxx only supports discarding on large alignments, so the zeroing code
would always fall back to explicit writings of zeroes.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.com>
---
 drivers/block/rsxx/dev.c | 1 -
 1 file changed, 1 deletion(-)

diff --git a/drivers/block/rsxx/dev.c b/drivers/block/rsxx/dev.c
index f81d70b39d10..9c566364ac9c 100644
--- a/drivers/block/rsxx/dev.c
+++ b/drivers/block/rsxx/dev.c
@@ -300,7 +300,6 @@ int rsxx_setup_dev(struct rsxx_cardinfo *card)
 						RSXX_HW_BLK_SIZE >> 9);
 		card->queue->limits.discard_granularity = RSXX_HW_BLK_SIZE;
 		card->queue->limits.discard_alignment   = RSXX_HW_BLK_SIZE;
-		card->queue->limits.discard_zeroes_data = 1;
 	}
 
 	card->queue->queuedata = card;
-- 
2.11.0

^ permalink raw reply related	[flat|nested] 91+ messages in thread

* [PATCH 21/27] mmc: remove the discard_zeroes_data flag
  2017-04-05 17:20 ` Christoph Hellwig
@ 2017-04-05 17:21   ` Christoph Hellwig
  -1 siblings, 0 replies; 91+ messages in thread
From: Christoph Hellwig @ 2017-04-05 17:21 UTC (permalink / raw)
  To: axboe, martin.petersen, agk, snitzer, shli, philipp.reisner,
	lars.ellenberg
  Cc: linux-block, linux-raid, dm-devel, linux-scsi, drbd-dev

mmc only supports discarding on large alignments, so the zeroing code
would always fall back to explicit writings of zeroes.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.com>
---
 drivers/mmc/core/queue.c | 2 --
 1 file changed, 2 deletions(-)

diff --git a/drivers/mmc/core/queue.c b/drivers/mmc/core/queue.c
index 493eb10ce580..4c54ad34e17a 100644
--- a/drivers/mmc/core/queue.c
+++ b/drivers/mmc/core/queue.c
@@ -167,8 +167,6 @@ static void mmc_queue_setup_discard(struct request_queue *q,
 
 	queue_flag_set_unlocked(QUEUE_FLAG_DISCARD, q);
 	blk_queue_max_discard_sectors(q, max_discard);
-	if (card->erased_byte == 0 && !mmc_can_discard(card))
-		q->limits.discard_zeroes_data = 1;
 	q->limits.discard_granularity = card->pref_erase << 9;
 	/* granularity must not be greater than max. discard */
 	if (card->pref_erase > max_discard)
-- 
2.11.0

^ permalink raw reply related	[flat|nested] 91+ messages in thread

* [PATCH 21/27] mmc: remove the discard_zeroes_data flag
@ 2017-04-05 17:21   ` Christoph Hellwig
  0 siblings, 0 replies; 91+ messages in thread
From: Christoph Hellwig @ 2017-04-05 17:21 UTC (permalink / raw)
  To: axboe, martin.petersen, agk, snitzer, shli, philipp.reisner,
	lars.ellenberg
  Cc: linux-block, linux-scsi, drbd-dev, dm-devel, linux-raid

mmc only supports discarding on large alignments, so the zeroing code
would always fall back to explicit writings of zeroes.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.com>
---
 drivers/mmc/core/queue.c | 2 --
 1 file changed, 2 deletions(-)

diff --git a/drivers/mmc/core/queue.c b/drivers/mmc/core/queue.c
index 493eb10ce580..4c54ad34e17a 100644
--- a/drivers/mmc/core/queue.c
+++ b/drivers/mmc/core/queue.c
@@ -167,8 +167,6 @@ static void mmc_queue_setup_discard(struct request_queue *q,
 
 	queue_flag_set_unlocked(QUEUE_FLAG_DISCARD, q);
 	blk_queue_max_discard_sectors(q, max_discard);
-	if (card->erased_byte == 0 && !mmc_can_discard(card))
-		q->limits.discard_zeroes_data = 1;
 	q->limits.discard_granularity = card->pref_erase << 9;
 	/* granularity must not be greater than max. discard */
 	if (card->pref_erase > max_discard)
-- 
2.11.0

^ permalink raw reply related	[flat|nested] 91+ messages in thread

* [PATCH 22/27] block: stop using discards for zeroing
  2017-04-05 17:20 ` Christoph Hellwig
@ 2017-04-05 17:21   ` Christoph Hellwig
  -1 siblings, 0 replies; 91+ messages in thread
From: Christoph Hellwig @ 2017-04-05 17:21 UTC (permalink / raw)
  To: axboe, martin.petersen, agk, snitzer, shli, philipp.reisner,
	lars.ellenberg
  Cc: linux-block, linux-raid, dm-devel, linux-scsi, drbd-dev

Now that we have REQ_OP_WRITE_ZEROES implemented for all devices that
support efficient zeroing, we can remove the call to blkdev_issue_discard.
This means we only have two ways of zeroing left and can simplify the
code.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Reviewed-by: Hannes Reinecke <hare@suse.com>
---
 block/blk-lib.c | 12 ++++++------
 1 file changed, 6 insertions(+), 6 deletions(-)

diff --git a/block/blk-lib.c b/block/blk-lib.c
index 2f882e22890b..b0c6c4bcf441 100644
--- a/block/blk-lib.c
+++ b/block/blk-lib.c
@@ -279,6 +279,12 @@ static int __blkdev_issue_write_zeroes(struct block_device *bdev,
  *  Zero-fill a block range, either using hardware offload or by explicitly
  *  writing zeroes to the device.
  *
+ *  Note that this function may fail with -EOPNOTSUPP if the driver signals
+ *  zeroing offload support, but the device fails to process the command (for
+ *  some devices there is no non-destructive way to verify whether this
+ *  operation is actually supported).  In this case the caller should call
+ *  retry the call to blkdev_issue_zeroout() and the fallback path will be used.
+ *
  *  If a device is using logical block provisioning, the underlying space will
  *  not be released if %flags contains BLKDEV_ZERO_NOUNMAP.
  *
@@ -349,12 +355,6 @@ int blkdev_issue_zeroout(struct block_device *bdev, sector_t sector,
 	struct bio *bio = NULL;
 	struct blk_plug plug;
 
-	if (!(flags & BLKDEV_ZERO_NOUNMAP)) {
-		if (!blkdev_issue_discard(bdev, sector, nr_sects, gfp_mask,
-				BLKDEV_DISCARD_ZERO))
-			return 0;
-	}
-
 	blk_start_plug(&plug);
 	ret = __blkdev_issue_zeroout(bdev, sector, nr_sects, gfp_mask,
 			&bio, flags);
-- 
2.11.0

^ permalink raw reply related	[flat|nested] 91+ messages in thread

* [PATCH 22/27] block: stop using discards for zeroing
@ 2017-04-05 17:21   ` Christoph Hellwig
  0 siblings, 0 replies; 91+ messages in thread
From: Christoph Hellwig @ 2017-04-05 17:21 UTC (permalink / raw)
  To: axboe, martin.petersen, agk, snitzer, shli, philipp.reisner,
	lars.ellenberg
  Cc: linux-block, linux-scsi, drbd-dev, dm-devel, linux-raid

Now that we have REQ_OP_WRITE_ZEROES implemented for all devices that
support efficient zeroing, we can remove the call to blkdev_issue_discard.
This means we only have two ways of zeroing left and can simplify the
code.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Reviewed-by: Hannes Reinecke <hare@suse.com>
---
 block/blk-lib.c | 12 ++++++------
 1 file changed, 6 insertions(+), 6 deletions(-)

diff --git a/block/blk-lib.c b/block/blk-lib.c
index 2f882e22890b..b0c6c4bcf441 100644
--- a/block/blk-lib.c
+++ b/block/blk-lib.c
@@ -279,6 +279,12 @@ static int __blkdev_issue_write_zeroes(struct block_device *bdev,
  *  Zero-fill a block range, either using hardware offload or by explicitly
  *  writing zeroes to the device.
  *
+ *  Note that this function may fail with -EOPNOTSUPP if the driver signals
+ *  zeroing offload support, but the device fails to process the command (for
+ *  some devices there is no non-destructive way to verify whether this
+ *  operation is actually supported).  In this case the caller should call
+ *  retry the call to blkdev_issue_zeroout() and the fallback path will be used.
+ *
  *  If a device is using logical block provisioning, the underlying space will
  *  not be released if %flags contains BLKDEV_ZERO_NOUNMAP.
  *
@@ -349,12 +355,6 @@ int blkdev_issue_zeroout(struct block_device *bdev, sector_t sector,
 	struct bio *bio = NULL;
 	struct blk_plug plug;
 
-	if (!(flags & BLKDEV_ZERO_NOUNMAP)) {
-		if (!blkdev_issue_discard(bdev, sector, nr_sects, gfp_mask,
-				BLKDEV_DISCARD_ZERO))
-			return 0;
-	}
-
 	blk_start_plug(&plug);
 	ret = __blkdev_issue_zeroout(bdev, sector, nr_sects, gfp_mask,
 			&bio, flags);
-- 
2.11.0

^ permalink raw reply related	[flat|nested] 91+ messages in thread

* [PATCH 23/27] drbd: make intelligent use of blkdev_issue_zeroout
  2017-04-05 17:20 ` Christoph Hellwig
@ 2017-04-05 17:21   ` Christoph Hellwig
  -1 siblings, 0 replies; 91+ messages in thread
From: Christoph Hellwig @ 2017-04-05 17:21 UTC (permalink / raw)
  To: axboe, martin.petersen, agk, snitzer, shli, philipp.reisner,
	lars.ellenberg
  Cc: linux-block, linux-raid, dm-devel, linux-scsi, drbd-dev

drbd always wants its discard wire operations to zero the blocks, so
use blkdev_issue_zeroout with the BLKDEV_ZERO_UNMAP flag instead of
reinventing it poorly.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.com>
---
 drivers/block/drbd/drbd_debugfs.c  |   3 --
 drivers/block/drbd/drbd_int.h      |   6 ---
 drivers/block/drbd/drbd_receiver.c | 102 ++-----------------------------------
 drivers/block/drbd/drbd_req.c      |   6 +--
 4 files changed, 7 insertions(+), 110 deletions(-)

diff --git a/drivers/block/drbd/drbd_debugfs.c b/drivers/block/drbd/drbd_debugfs.c
index de5c3ee8a790..494837e59f23 100644
--- a/drivers/block/drbd/drbd_debugfs.c
+++ b/drivers/block/drbd/drbd_debugfs.c
@@ -236,9 +236,6 @@ static void seq_print_peer_request_flags(struct seq_file *m, struct drbd_peer_re
 	seq_print_rq_state_bit(m, f & EE_CALL_AL_COMPLETE_IO, &sep, "in-AL");
 	seq_print_rq_state_bit(m, f & EE_SEND_WRITE_ACK, &sep, "C");
 	seq_print_rq_state_bit(m, f & EE_MAY_SET_IN_SYNC, &sep, "set-in-sync");
-
-	if (f & EE_IS_TRIM)
-		__seq_print_rq_state_bit(m, f & EE_IS_TRIM_USE_ZEROOUT, &sep, "zero-out", "trim");
 	seq_print_rq_state_bit(m, f & EE_WRITE_SAME, &sep, "write-same");
 	seq_putc(m, '\n');
 }
diff --git a/drivers/block/drbd/drbd_int.h b/drivers/block/drbd/drbd_int.h
index 724d1c50fc52..d5da45bb03a6 100644
--- a/drivers/block/drbd/drbd_int.h
+++ b/drivers/block/drbd/drbd_int.h
@@ -437,9 +437,6 @@ enum {
 
 	/* is this a TRIM aka REQ_DISCARD? */
 	__EE_IS_TRIM,
-	/* our lower level cannot handle trim,
-	 * and we want to fall back to zeroout instead */
-	__EE_IS_TRIM_USE_ZEROOUT,
 
 	/* In case a barrier failed,
 	 * we need to resubmit without the barrier flag. */
@@ -482,7 +479,6 @@ enum {
 #define EE_CALL_AL_COMPLETE_IO (1<<__EE_CALL_AL_COMPLETE_IO)
 #define EE_MAY_SET_IN_SYNC     (1<<__EE_MAY_SET_IN_SYNC)
 #define EE_IS_TRIM             (1<<__EE_IS_TRIM)
-#define EE_IS_TRIM_USE_ZEROOUT (1<<__EE_IS_TRIM_USE_ZEROOUT)
 #define EE_RESUBMITTED         (1<<__EE_RESUBMITTED)
 #define EE_WAS_ERROR           (1<<__EE_WAS_ERROR)
 #define EE_HAS_DIGEST          (1<<__EE_HAS_DIGEST)
@@ -1561,8 +1557,6 @@ extern void start_resync_timer_fn(unsigned long data);
 extern void drbd_endio_write_sec_final(struct drbd_peer_request *peer_req);
 
 /* drbd_receiver.c */
-extern int drbd_issue_discard_or_zero_out(struct drbd_device *device,
-		sector_t start, unsigned int nr_sectors, bool discard);
 extern int drbd_receiver(struct drbd_thread *thi);
 extern int drbd_ack_receiver(struct drbd_thread *thi);
 extern void drbd_send_ping_wf(struct work_struct *ws);
diff --git a/drivers/block/drbd/drbd_receiver.c b/drivers/block/drbd/drbd_receiver.c
index dc9a6dcd431c..bc1d296581f9 100644
--- a/drivers/block/drbd/drbd_receiver.c
+++ b/drivers/block/drbd/drbd_receiver.c
@@ -1448,108 +1448,14 @@ void drbd_bump_write_ordering(struct drbd_resource *resource, struct drbd_backin
 		drbd_info(resource, "Method to ensure write ordering: %s\n", write_ordering_str[resource->write_ordering]);
 }
 
-/*
- * We *may* ignore the discard-zeroes-data setting, if so configured.
- *
- * Assumption is that it "discard_zeroes_data=0" is only because the backend
- * may ignore partial unaligned discards.
- *
- * LVM/DM thin as of at least
- *   LVM version:     2.02.115(2)-RHEL7 (2015-01-28)
- *   Library version: 1.02.93-RHEL7 (2015-01-28)
- *   Driver version:  4.29.0
- * still behaves this way.
- *
- * For unaligned (wrt. alignment and granularity) or too small discards,
- * we zero-out the initial (and/or) trailing unaligned partial chunks,
- * but discard all the aligned full chunks.
- *
- * At least for LVM/DM thin, the result is effectively "discard_zeroes_data=1".
- */
-int drbd_issue_discard_or_zero_out(struct drbd_device *device, sector_t start, unsigned int nr_sectors, bool discard)
-{
-	struct block_device *bdev = device->ldev->backing_bdev;
-	struct request_queue *q = bdev_get_queue(bdev);
-	sector_t tmp, nr;
-	unsigned int max_discard_sectors, granularity;
-	int alignment;
-	int err = 0;
-
-	if (!discard)
-		goto zero_out;
-
-	/* Zero-sector (unknown) and one-sector granularities are the same.  */
-	granularity = max(q->limits.discard_granularity >> 9, 1U);
-	alignment = (bdev_discard_alignment(bdev) >> 9) % granularity;
-
-	max_discard_sectors = min(q->limits.max_discard_sectors, (1U << 22));
-	max_discard_sectors -= max_discard_sectors % granularity;
-	if (unlikely(!max_discard_sectors))
-		goto zero_out;
-
-	if (nr_sectors < granularity)
-		goto zero_out;
-
-	tmp = start;
-	if (sector_div(tmp, granularity) != alignment) {
-		if (nr_sectors < 2*granularity)
-			goto zero_out;
-		/* start + gran - (start + gran - align) % gran */
-		tmp = start + granularity - alignment;
-		tmp = start + granularity - sector_div(tmp, granularity);
-
-		nr = tmp - start;
-		err |= blkdev_issue_zeroout(bdev, start, nr, GFP_NOIO,
-				BLKDEV_ZERO_NOUNMAP);
-		nr_sectors -= nr;
-		start = tmp;
-	}
-	while (nr_sectors >= granularity) {
-		nr = min_t(sector_t, nr_sectors, max_discard_sectors);
-		err |= blkdev_issue_discard(bdev, start, nr, GFP_NOIO,
-				BLKDEV_ZERO_NOUNMAP);
-		nr_sectors -= nr;
-		start += nr;
-	}
- zero_out:
-	if (nr_sectors) {
-		err |= blkdev_issue_zeroout(bdev, start, nr_sectors, GFP_NOIO,
-				BLKDEV_ZERO_NOUNMAP);
-	}
-	return err != 0;
-}
-
-static bool can_do_reliable_discards(struct drbd_device *device)
-{
-	struct request_queue *q = bdev_get_queue(device->ldev->backing_bdev);
-	struct disk_conf *dc;
-	bool can_do;
-
-	if (!blk_queue_discard(q))
-		return false;
-
-	if (q->limits.discard_zeroes_data)
-		return true;
-
-	rcu_read_lock();
-	dc = rcu_dereference(device->ldev->disk_conf);
-	can_do = dc->discard_zeroes_if_aligned;
-	rcu_read_unlock();
-	return can_do;
-}
-
 static void drbd_issue_peer_discard(struct drbd_device *device, struct drbd_peer_request *peer_req)
 {
-	/* If the backend cannot discard, or does not guarantee
-	 * read-back zeroes in discarded ranges, we fall back to
-	 * zero-out.  Unless configuration specifically requested
-	 * otherwise. */
-	if (!can_do_reliable_discards(device))
-		peer_req->flags |= EE_IS_TRIM_USE_ZEROOUT;
+	struct block_device *bdev = device->ldev->backing_bdev;
 
-	if (drbd_issue_discard_or_zero_out(device, peer_req->i.sector,
-	    peer_req->i.size >> 9, !(peer_req->flags & EE_IS_TRIM_USE_ZEROOUT)))
+	if (blkdev_issue_zeroout(bdev, peer_req->i.sector, peer_req->i.size >> 9,
+			GFP_NOIO, 0))
 		peer_req->flags |= EE_WAS_ERROR;
+
 	drbd_endio_write_sec_final(peer_req);
 }
 
diff --git a/drivers/block/drbd/drbd_req.c b/drivers/block/drbd/drbd_req.c
index 652114ae1a8a..6da9ea8c48b6 100644
--- a/drivers/block/drbd/drbd_req.c
+++ b/drivers/block/drbd/drbd_req.c
@@ -1148,10 +1148,10 @@ static int drbd_process_write_request(struct drbd_request *req)
 
 static void drbd_process_discard_req(struct drbd_request *req)
 {
-	int err = drbd_issue_discard_or_zero_out(req->device,
-				req->i.sector, req->i.size >> 9, true);
+	struct block_device *bdev = req->device->ldev->backing_bdev;
 
-	if (err)
+	if (blkdev_issue_zeroout(bdev, req->i.sector, req->i.size >> 9,
+			GFP_NOIO, 0))
 		req->private_bio->bi_error = -EIO;
 	bio_endio(req->private_bio);
 }
-- 
2.11.0

^ permalink raw reply related	[flat|nested] 91+ messages in thread

* [PATCH 23/27] drbd: make intelligent use of blkdev_issue_zeroout
@ 2017-04-05 17:21   ` Christoph Hellwig
  0 siblings, 0 replies; 91+ messages in thread
From: Christoph Hellwig @ 2017-04-05 17:21 UTC (permalink / raw)
  To: axboe, martin.petersen, agk, snitzer, shli, philipp.reisner,
	lars.ellenberg
  Cc: linux-block, linux-scsi, drbd-dev, dm-devel, linux-raid

drbd always wants its discard wire operations to zero the blocks, so
use blkdev_issue_zeroout with the BLKDEV_ZERO_UNMAP flag instead of
reinventing it poorly.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.com>
---
 drivers/block/drbd/drbd_debugfs.c  |   3 --
 drivers/block/drbd/drbd_int.h      |   6 ---
 drivers/block/drbd/drbd_receiver.c | 102 ++-----------------------------------
 drivers/block/drbd/drbd_req.c      |   6 +--
 4 files changed, 7 insertions(+), 110 deletions(-)

diff --git a/drivers/block/drbd/drbd_debugfs.c b/drivers/block/drbd/drbd_debugfs.c
index de5c3ee8a790..494837e59f23 100644
--- a/drivers/block/drbd/drbd_debugfs.c
+++ b/drivers/block/drbd/drbd_debugfs.c
@@ -236,9 +236,6 @@ static void seq_print_peer_request_flags(struct seq_file *m, struct drbd_peer_re
 	seq_print_rq_state_bit(m, f & EE_CALL_AL_COMPLETE_IO, &sep, "in-AL");
 	seq_print_rq_state_bit(m, f & EE_SEND_WRITE_ACK, &sep, "C");
 	seq_print_rq_state_bit(m, f & EE_MAY_SET_IN_SYNC, &sep, "set-in-sync");
-
-	if (f & EE_IS_TRIM)
-		__seq_print_rq_state_bit(m, f & EE_IS_TRIM_USE_ZEROOUT, &sep, "zero-out", "trim");
 	seq_print_rq_state_bit(m, f & EE_WRITE_SAME, &sep, "write-same");
 	seq_putc(m, '\n');
 }
diff --git a/drivers/block/drbd/drbd_int.h b/drivers/block/drbd/drbd_int.h
index 724d1c50fc52..d5da45bb03a6 100644
--- a/drivers/block/drbd/drbd_int.h
+++ b/drivers/block/drbd/drbd_int.h
@@ -437,9 +437,6 @@ enum {
 
 	/* is this a TRIM aka REQ_DISCARD? */
 	__EE_IS_TRIM,
-	/* our lower level cannot handle trim,
-	 * and we want to fall back to zeroout instead */
-	__EE_IS_TRIM_USE_ZEROOUT,
 
 	/* In case a barrier failed,
 	 * we need to resubmit without the barrier flag. */
@@ -482,7 +479,6 @@ enum {
 #define EE_CALL_AL_COMPLETE_IO (1<<__EE_CALL_AL_COMPLETE_IO)
 #define EE_MAY_SET_IN_SYNC     (1<<__EE_MAY_SET_IN_SYNC)
 #define EE_IS_TRIM             (1<<__EE_IS_TRIM)
-#define EE_IS_TRIM_USE_ZEROOUT (1<<__EE_IS_TRIM_USE_ZEROOUT)
 #define EE_RESUBMITTED         (1<<__EE_RESUBMITTED)
 #define EE_WAS_ERROR           (1<<__EE_WAS_ERROR)
 #define EE_HAS_DIGEST          (1<<__EE_HAS_DIGEST)
@@ -1561,8 +1557,6 @@ extern void start_resync_timer_fn(unsigned long data);
 extern void drbd_endio_write_sec_final(struct drbd_peer_request *peer_req);
 
 /* drbd_receiver.c */
-extern int drbd_issue_discard_or_zero_out(struct drbd_device *device,
-		sector_t start, unsigned int nr_sectors, bool discard);
 extern int drbd_receiver(struct drbd_thread *thi);
 extern int drbd_ack_receiver(struct drbd_thread *thi);
 extern void drbd_send_ping_wf(struct work_struct *ws);
diff --git a/drivers/block/drbd/drbd_receiver.c b/drivers/block/drbd/drbd_receiver.c
index dc9a6dcd431c..bc1d296581f9 100644
--- a/drivers/block/drbd/drbd_receiver.c
+++ b/drivers/block/drbd/drbd_receiver.c
@@ -1448,108 +1448,14 @@ void drbd_bump_write_ordering(struct drbd_resource *resource, struct drbd_backin
 		drbd_info(resource, "Method to ensure write ordering: %s\n", write_ordering_str[resource->write_ordering]);
 }
 
-/*
- * We *may* ignore the discard-zeroes-data setting, if so configured.
- *
- * Assumption is that it "discard_zeroes_data=0" is only because the backend
- * may ignore partial unaligned discards.
- *
- * LVM/DM thin as of at least
- *   LVM version:     2.02.115(2)-RHEL7 (2015-01-28)
- *   Library version: 1.02.93-RHEL7 (2015-01-28)
- *   Driver version:  4.29.0
- * still behaves this way.
- *
- * For unaligned (wrt. alignment and granularity) or too small discards,
- * we zero-out the initial (and/or) trailing unaligned partial chunks,
- * but discard all the aligned full chunks.
- *
- * At least for LVM/DM thin, the result is effectively "discard_zeroes_data=1".
- */
-int drbd_issue_discard_or_zero_out(struct drbd_device *device, sector_t start, unsigned int nr_sectors, bool discard)
-{
-	struct block_device *bdev = device->ldev->backing_bdev;
-	struct request_queue *q = bdev_get_queue(bdev);
-	sector_t tmp, nr;
-	unsigned int max_discard_sectors, granularity;
-	int alignment;
-	int err = 0;
-
-	if (!discard)
-		goto zero_out;
-
-	/* Zero-sector (unknown) and one-sector granularities are the same.  */
-	granularity = max(q->limits.discard_granularity >> 9, 1U);
-	alignment = (bdev_discard_alignment(bdev) >> 9) % granularity;
-
-	max_discard_sectors = min(q->limits.max_discard_sectors, (1U << 22));
-	max_discard_sectors -= max_discard_sectors % granularity;
-	if (unlikely(!max_discard_sectors))
-		goto zero_out;
-
-	if (nr_sectors < granularity)
-		goto zero_out;
-
-	tmp = start;
-	if (sector_div(tmp, granularity) != alignment) {
-		if (nr_sectors < 2*granularity)
-			goto zero_out;
-		/* start + gran - (start + gran - align) % gran */
-		tmp = start + granularity - alignment;
-		tmp = start + granularity - sector_div(tmp, granularity);
-
-		nr = tmp - start;
-		err |= blkdev_issue_zeroout(bdev, start, nr, GFP_NOIO,
-				BLKDEV_ZERO_NOUNMAP);
-		nr_sectors -= nr;
-		start = tmp;
-	}
-	while (nr_sectors >= granularity) {
-		nr = min_t(sector_t, nr_sectors, max_discard_sectors);
-		err |= blkdev_issue_discard(bdev, start, nr, GFP_NOIO,
-				BLKDEV_ZERO_NOUNMAP);
-		nr_sectors -= nr;
-		start += nr;
-	}
- zero_out:
-	if (nr_sectors) {
-		err |= blkdev_issue_zeroout(bdev, start, nr_sectors, GFP_NOIO,
-				BLKDEV_ZERO_NOUNMAP);
-	}
-	return err != 0;
-}
-
-static bool can_do_reliable_discards(struct drbd_device *device)
-{
-	struct request_queue *q = bdev_get_queue(device->ldev->backing_bdev);
-	struct disk_conf *dc;
-	bool can_do;
-
-	if (!blk_queue_discard(q))
-		return false;
-
-	if (q->limits.discard_zeroes_data)
-		return true;
-
-	rcu_read_lock();
-	dc = rcu_dereference(device->ldev->disk_conf);
-	can_do = dc->discard_zeroes_if_aligned;
-	rcu_read_unlock();
-	return can_do;
-}
-
 static void drbd_issue_peer_discard(struct drbd_device *device, struct drbd_peer_request *peer_req)
 {
-	/* If the backend cannot discard, or does not guarantee
-	 * read-back zeroes in discarded ranges, we fall back to
-	 * zero-out.  Unless configuration specifically requested
-	 * otherwise. */
-	if (!can_do_reliable_discards(device))
-		peer_req->flags |= EE_IS_TRIM_USE_ZEROOUT;
+	struct block_device *bdev = device->ldev->backing_bdev;
 
-	if (drbd_issue_discard_or_zero_out(device, peer_req->i.sector,
-	    peer_req->i.size >> 9, !(peer_req->flags & EE_IS_TRIM_USE_ZEROOUT)))
+	if (blkdev_issue_zeroout(bdev, peer_req->i.sector, peer_req->i.size >> 9,
+			GFP_NOIO, 0))
 		peer_req->flags |= EE_WAS_ERROR;
+
 	drbd_endio_write_sec_final(peer_req);
 }
 
diff --git a/drivers/block/drbd/drbd_req.c b/drivers/block/drbd/drbd_req.c
index 652114ae1a8a..6da9ea8c48b6 100644
--- a/drivers/block/drbd/drbd_req.c
+++ b/drivers/block/drbd/drbd_req.c
@@ -1148,10 +1148,10 @@ static int drbd_process_write_request(struct drbd_request *req)
 
 static void drbd_process_discard_req(struct drbd_request *req)
 {
-	int err = drbd_issue_discard_or_zero_out(req->device,
-				req->i.sector, req->i.size >> 9, true);
+	struct block_device *bdev = req->device->ldev->backing_bdev;
 
-	if (err)
+	if (blkdev_issue_zeroout(bdev, req->i.sector, req->i.size >> 9,
+			GFP_NOIO, 0))
 		req->private_bio->bi_error = -EIO;
 	bio_endio(req->private_bio);
 }
-- 
2.11.0

^ permalink raw reply related	[flat|nested] 91+ messages in thread

* [PATCH 24/27] drbd: implement REQ_OP_WRITE_ZEROES
  2017-04-05 17:20 ` Christoph Hellwig
@ 2017-04-05 17:21   ` Christoph Hellwig
  -1 siblings, 0 replies; 91+ messages in thread
From: Christoph Hellwig @ 2017-04-05 17:21 UTC (permalink / raw)
  To: axboe, martin.petersen, agk, snitzer, shli, philipp.reisner,
	lars.ellenberg
  Cc: linux-block, linux-raid, dm-devel, linux-scsi, drbd-dev

It seems like DRBD assumes its on the wire TRIM request always zeroes data.
Use that fact to implement REQ_OP_WRITE_ZEROES.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.com>
---
 drivers/block/drbd/drbd_main.c     | 3 ++-
 drivers/block/drbd/drbd_nl.c       | 2 ++
 drivers/block/drbd/drbd_receiver.c | 6 +++---
 drivers/block/drbd/drbd_req.c      | 7 +++++--
 drivers/block/drbd/drbd_worker.c   | 4 +++-
 5 files changed, 15 insertions(+), 7 deletions(-)

diff --git a/drivers/block/drbd/drbd_main.c b/drivers/block/drbd/drbd_main.c
index 92c60cbd04ee..8e62d9f65510 100644
--- a/drivers/block/drbd/drbd_main.c
+++ b/drivers/block/drbd/drbd_main.c
@@ -1668,7 +1668,8 @@ static u32 bio_flags_to_wire(struct drbd_connection *connection,
 			(bio->bi_opf & REQ_FUA ? DP_FUA : 0) |
 			(bio->bi_opf & REQ_PREFLUSH ? DP_FLUSH : 0) |
 			(bio_op(bio) == REQ_OP_WRITE_SAME ? DP_WSAME : 0) |
-			(bio_op(bio) == REQ_OP_DISCARD ? DP_DISCARD : 0);
+			(bio_op(bio) == REQ_OP_DISCARD ? DP_DISCARD : 0) |
+			(bio_op(bio) == REQ_OP_WRITE_ZEROES ? DP_DISCARD : 0);
 	else
 		return bio->bi_opf & REQ_SYNC ? DP_RW_SYNC : 0;
 }
diff --git a/drivers/block/drbd/drbd_nl.c b/drivers/block/drbd/drbd_nl.c
index 908c704e20aa..e4516d3b971d 100644
--- a/drivers/block/drbd/drbd_nl.c
+++ b/drivers/block/drbd/drbd_nl.c
@@ -1217,10 +1217,12 @@ static void decide_on_discard_support(struct drbd_device *device,
 		blk_queue_discard_granularity(q, 512);
 		q->limits.max_discard_sectors = drbd_max_discard_sectors(connection);
 		queue_flag_set_unlocked(QUEUE_FLAG_DISCARD, q);
+		q->limits.max_write_zeroes_sectors = drbd_max_discard_sectors(connection);
 	} else {
 		queue_flag_clear_unlocked(QUEUE_FLAG_DISCARD, q);
 		blk_queue_discard_granularity(q, 0);
 		q->limits.max_discard_sectors = 0;
+		q->limits.max_write_zeroes_sectors = 0;
 	}
 }
 
diff --git a/drivers/block/drbd/drbd_receiver.c b/drivers/block/drbd/drbd_receiver.c
index bc1d296581f9..1b0a2be24f39 100644
--- a/drivers/block/drbd/drbd_receiver.c
+++ b/drivers/block/drbd/drbd_receiver.c
@@ -2285,7 +2285,7 @@ static unsigned long wire_flags_to_bio_flags(u32 dpf)
 static unsigned long wire_flags_to_bio_op(u32 dpf)
 {
 	if (dpf & DP_DISCARD)
-		return REQ_OP_DISCARD;
+		return REQ_OP_WRITE_ZEROES;
 	else
 		return REQ_OP_WRITE;
 }
@@ -2476,7 +2476,7 @@ static int receive_Data(struct drbd_connection *connection, struct packet_info *
 	op_flags = wire_flags_to_bio_flags(dp_flags);
 	if (pi->cmd == P_TRIM) {
 		D_ASSERT(peer_device, peer_req->i.size > 0);
-		D_ASSERT(peer_device, op == REQ_OP_DISCARD);
+		D_ASSERT(peer_device, op == REQ_OP_WRITE_ZEROES);
 		D_ASSERT(peer_device, peer_req->pages == NULL);
 	} else if (peer_req->pages == NULL) {
 		D_ASSERT(device, peer_req->i.size == 0);
@@ -4789,7 +4789,7 @@ static int receive_rs_deallocated(struct drbd_connection *connection, struct pac
 
 	if (get_ldev(device)) {
 		struct drbd_peer_request *peer_req;
-		const int op = REQ_OP_DISCARD;
+		const int op = REQ_OP_WRITE_ZEROES;
 
 		peer_req = drbd_alloc_peer_req(peer_device, ID_SYNCER, sector,
 					       size, 0, GFP_NOIO);
diff --git a/drivers/block/drbd/drbd_req.c b/drivers/block/drbd/drbd_req.c
index 6da9ea8c48b6..b5730e17b455 100644
--- a/drivers/block/drbd/drbd_req.c
+++ b/drivers/block/drbd/drbd_req.c
@@ -59,6 +59,7 @@ static struct drbd_request *drbd_req_new(struct drbd_device *device, struct bio
 	drbd_req_make_private_bio(req, bio_src);
 	req->rq_state = (bio_data_dir(bio_src) == WRITE ? RQ_WRITE : 0)
 		      | (bio_op(bio_src) == REQ_OP_WRITE_SAME ? RQ_WSAME : 0)
+		      | (bio_op(bio_src) == REQ_OP_WRITE_ZEROES ? RQ_UNMAP : 0)
 		      | (bio_op(bio_src) == REQ_OP_DISCARD ? RQ_UNMAP : 0);
 	req->device = device;
 	req->master_bio = bio_src;
@@ -1180,7 +1181,8 @@ drbd_submit_req_private_bio(struct drbd_request *req)
 	if (get_ldev(device)) {
 		if (drbd_insert_fault(device, type))
 			bio_io_error(bio);
-		else if (bio_op(bio) == REQ_OP_DISCARD)
+		else if (bio_op(bio) == REQ_OP_WRITE_ZEROES ||
+			 bio_op(bio) == REQ_OP_DISCARD)
 			drbd_process_discard_req(req);
 		else
 			generic_make_request(bio);
@@ -1234,7 +1236,8 @@ drbd_request_prepare(struct drbd_device *device, struct bio *bio, unsigned long
 	_drbd_start_io_acct(device, req);
 
 	/* process discards always from our submitter thread */
-	if (bio_op(bio) & REQ_OP_DISCARD)
+	if ((bio_op(bio) & REQ_OP_WRITE_ZEROES) ||
+	    (bio_op(bio) & REQ_OP_DISCARD))
 		goto queue_for_submitter_thread;
 
 	if (rw == WRITE && req->private_bio && req->i.size
diff --git a/drivers/block/drbd/drbd_worker.c b/drivers/block/drbd/drbd_worker.c
index 3bff33f21435..1afcb4e02d8d 100644
--- a/drivers/block/drbd/drbd_worker.c
+++ b/drivers/block/drbd/drbd_worker.c
@@ -174,7 +174,8 @@ void drbd_peer_request_endio(struct bio *bio)
 	struct drbd_peer_request *peer_req = bio->bi_private;
 	struct drbd_device *device = peer_req->peer_device->device;
 	bool is_write = bio_data_dir(bio) == WRITE;
-	bool is_discard = !!(bio_op(bio) == REQ_OP_DISCARD);
+	bool is_discard = bio_op(bio) == REQ_OP_WRITE_ZEROES ||
+			  bio_op(bio) == REQ_OP_DISCARD;
 
 	if (bio->bi_error && __ratelimit(&drbd_ratelimit_state))
 		drbd_warn(device, "%s: error=%d s=%llus\n",
@@ -249,6 +250,7 @@ void drbd_request_endio(struct bio *bio)
 	/* to avoid recursion in __req_mod */
 	if (unlikely(bio->bi_error)) {
 		switch (bio_op(bio)) {
+		case REQ_OP_WRITE_ZEROES:
 		case REQ_OP_DISCARD:
 			if (bio->bi_error == -EOPNOTSUPP)
 				what = DISCARD_COMPLETED_NOTSUPP;
-- 
2.11.0

^ permalink raw reply related	[flat|nested] 91+ messages in thread

* [PATCH 24/27] drbd: implement REQ_OP_WRITE_ZEROES
@ 2017-04-05 17:21   ` Christoph Hellwig
  0 siblings, 0 replies; 91+ messages in thread
From: Christoph Hellwig @ 2017-04-05 17:21 UTC (permalink / raw)
  To: axboe, martin.petersen, agk, snitzer, shli, philipp.reisner,
	lars.ellenberg
  Cc: linux-block, linux-scsi, drbd-dev, dm-devel, linux-raid

It seems like DRBD assumes its on the wire TRIM request always zeroes data.
Use that fact to implement REQ_OP_WRITE_ZEROES.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.com>
---
 drivers/block/drbd/drbd_main.c     | 3 ++-
 drivers/block/drbd/drbd_nl.c       | 2 ++
 drivers/block/drbd/drbd_receiver.c | 6 +++---
 drivers/block/drbd/drbd_req.c      | 7 +++++--
 drivers/block/drbd/drbd_worker.c   | 4 +++-
 5 files changed, 15 insertions(+), 7 deletions(-)

diff --git a/drivers/block/drbd/drbd_main.c b/drivers/block/drbd/drbd_main.c
index 92c60cbd04ee..8e62d9f65510 100644
--- a/drivers/block/drbd/drbd_main.c
+++ b/drivers/block/drbd/drbd_main.c
@@ -1668,7 +1668,8 @@ static u32 bio_flags_to_wire(struct drbd_connection *connection,
 			(bio->bi_opf & REQ_FUA ? DP_FUA : 0) |
 			(bio->bi_opf & REQ_PREFLUSH ? DP_FLUSH : 0) |
 			(bio_op(bio) == REQ_OP_WRITE_SAME ? DP_WSAME : 0) |
-			(bio_op(bio) == REQ_OP_DISCARD ? DP_DISCARD : 0);
+			(bio_op(bio) == REQ_OP_DISCARD ? DP_DISCARD : 0) |
+			(bio_op(bio) == REQ_OP_WRITE_ZEROES ? DP_DISCARD : 0);
 	else
 		return bio->bi_opf & REQ_SYNC ? DP_RW_SYNC : 0;
 }
diff --git a/drivers/block/drbd/drbd_nl.c b/drivers/block/drbd/drbd_nl.c
index 908c704e20aa..e4516d3b971d 100644
--- a/drivers/block/drbd/drbd_nl.c
+++ b/drivers/block/drbd/drbd_nl.c
@@ -1217,10 +1217,12 @@ static void decide_on_discard_support(struct drbd_device *device,
 		blk_queue_discard_granularity(q, 512);
 		q->limits.max_discard_sectors = drbd_max_discard_sectors(connection);
 		queue_flag_set_unlocked(QUEUE_FLAG_DISCARD, q);
+		q->limits.max_write_zeroes_sectors = drbd_max_discard_sectors(connection);
 	} else {
 		queue_flag_clear_unlocked(QUEUE_FLAG_DISCARD, q);
 		blk_queue_discard_granularity(q, 0);
 		q->limits.max_discard_sectors = 0;
+		q->limits.max_write_zeroes_sectors = 0;
 	}
 }
 
diff --git a/drivers/block/drbd/drbd_receiver.c b/drivers/block/drbd/drbd_receiver.c
index bc1d296581f9..1b0a2be24f39 100644
--- a/drivers/block/drbd/drbd_receiver.c
+++ b/drivers/block/drbd/drbd_receiver.c
@@ -2285,7 +2285,7 @@ static unsigned long wire_flags_to_bio_flags(u32 dpf)
 static unsigned long wire_flags_to_bio_op(u32 dpf)
 {
 	if (dpf & DP_DISCARD)
-		return REQ_OP_DISCARD;
+		return REQ_OP_WRITE_ZEROES;
 	else
 		return REQ_OP_WRITE;
 }
@@ -2476,7 +2476,7 @@ static int receive_Data(struct drbd_connection *connection, struct packet_info *
 	op_flags = wire_flags_to_bio_flags(dp_flags);
 	if (pi->cmd == P_TRIM) {
 		D_ASSERT(peer_device, peer_req->i.size > 0);
-		D_ASSERT(peer_device, op == REQ_OP_DISCARD);
+		D_ASSERT(peer_device, op == REQ_OP_WRITE_ZEROES);
 		D_ASSERT(peer_device, peer_req->pages == NULL);
 	} else if (peer_req->pages == NULL) {
 		D_ASSERT(device, peer_req->i.size == 0);
@@ -4789,7 +4789,7 @@ static int receive_rs_deallocated(struct drbd_connection *connection, struct pac
 
 	if (get_ldev(device)) {
 		struct drbd_peer_request *peer_req;
-		const int op = REQ_OP_DISCARD;
+		const int op = REQ_OP_WRITE_ZEROES;
 
 		peer_req = drbd_alloc_peer_req(peer_device, ID_SYNCER, sector,
 					       size, 0, GFP_NOIO);
diff --git a/drivers/block/drbd/drbd_req.c b/drivers/block/drbd/drbd_req.c
index 6da9ea8c48b6..b5730e17b455 100644
--- a/drivers/block/drbd/drbd_req.c
+++ b/drivers/block/drbd/drbd_req.c
@@ -59,6 +59,7 @@ static struct drbd_request *drbd_req_new(struct drbd_device *device, struct bio
 	drbd_req_make_private_bio(req, bio_src);
 	req->rq_state = (bio_data_dir(bio_src) == WRITE ? RQ_WRITE : 0)
 		      | (bio_op(bio_src) == REQ_OP_WRITE_SAME ? RQ_WSAME : 0)
+		      | (bio_op(bio_src) == REQ_OP_WRITE_ZEROES ? RQ_UNMAP : 0)
 		      | (bio_op(bio_src) == REQ_OP_DISCARD ? RQ_UNMAP : 0);
 	req->device = device;
 	req->master_bio = bio_src;
@@ -1180,7 +1181,8 @@ drbd_submit_req_private_bio(struct drbd_request *req)
 	if (get_ldev(device)) {
 		if (drbd_insert_fault(device, type))
 			bio_io_error(bio);
-		else if (bio_op(bio) == REQ_OP_DISCARD)
+		else if (bio_op(bio) == REQ_OP_WRITE_ZEROES ||
+			 bio_op(bio) == REQ_OP_DISCARD)
 			drbd_process_discard_req(req);
 		else
 			generic_make_request(bio);
@@ -1234,7 +1236,8 @@ drbd_request_prepare(struct drbd_device *device, struct bio *bio, unsigned long
 	_drbd_start_io_acct(device, req);
 
 	/* process discards always from our submitter thread */
-	if (bio_op(bio) & REQ_OP_DISCARD)
+	if ((bio_op(bio) & REQ_OP_WRITE_ZEROES) ||
+	    (bio_op(bio) & REQ_OP_DISCARD))
 		goto queue_for_submitter_thread;
 
 	if (rw == WRITE && req->private_bio && req->i.size
diff --git a/drivers/block/drbd/drbd_worker.c b/drivers/block/drbd/drbd_worker.c
index 3bff33f21435..1afcb4e02d8d 100644
--- a/drivers/block/drbd/drbd_worker.c
+++ b/drivers/block/drbd/drbd_worker.c
@@ -174,7 +174,8 @@ void drbd_peer_request_endio(struct bio *bio)
 	struct drbd_peer_request *peer_req = bio->bi_private;
 	struct drbd_device *device = peer_req->peer_device->device;
 	bool is_write = bio_data_dir(bio) == WRITE;
-	bool is_discard = !!(bio_op(bio) == REQ_OP_DISCARD);
+	bool is_discard = bio_op(bio) == REQ_OP_WRITE_ZEROES ||
+			  bio_op(bio) == REQ_OP_DISCARD;
 
 	if (bio->bi_error && __ratelimit(&drbd_ratelimit_state))
 		drbd_warn(device, "%s: error=%d s=%llus\n",
@@ -249,6 +250,7 @@ void drbd_request_endio(struct bio *bio)
 	/* to avoid recursion in __req_mod */
 	if (unlikely(bio->bi_error)) {
 		switch (bio_op(bio)) {
+		case REQ_OP_WRITE_ZEROES:
 		case REQ_OP_DISCARD:
 			if (bio->bi_error == -EOPNOTSUPP)
 				what = DISCARD_COMPLETED_NOTSUPP;
-- 
2.11.0

^ permalink raw reply related	[flat|nested] 91+ messages in thread

* [PATCH 25/27] block: remove the discard_zeroes_data flag
  2017-04-05 17:20 ` Christoph Hellwig
                   ` (24 preceding siblings ...)
  (?)
@ 2017-04-05 17:21 ` Christoph Hellwig
  2017-05-01 20:45     ` Bart Van Assche
  -1 siblings, 1 reply; 91+ messages in thread
From: Christoph Hellwig @ 2017-04-05 17:21 UTC (permalink / raw)
  To: axboe, martin.petersen, agk, snitzer, shli, philipp.reisner,
	lars.ellenberg
  Cc: linux-block, linux-scsi, drbd-dev, dm-devel, linux-raid

Now that we use the proper REQ_OP_WRITE_ZEROES operation everywhere we can
kill this hack.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Reviewed-by: Hannes Reinecke <hare@suse.com>
---
 Documentation/ABI/testing/sysfs-block | 10 ++-----
 Documentation/block/queue-sysfs.txt   |  5 ----
 block/blk-lib.c                       |  7 +----
 block/blk-settings.c                  |  3 ---
 block/blk-sysfs.c                     |  2 +-
 block/compat_ioctl.c                  |  2 +-
 block/ioctl.c                         |  2 +-
 drivers/block/drbd/drbd_main.c        |  2 --
 drivers/block/drbd/drbd_nl.c          |  7 +----
 drivers/block/loop.c                  |  2 --
 drivers/block/mtip32xx/mtip32xx.c     |  1 -
 drivers/block/nbd.c                   |  1 -
 drivers/md/dm-cache-target.c          |  1 -
 drivers/md/dm-crypt.c                 |  1 -
 drivers/md/dm-raid.c                  |  6 ++---
 drivers/md/dm-raid1.c                 |  1 -
 drivers/md/dm-table.c                 | 19 -------------
 drivers/md/dm-thin.c                  |  2 --
 drivers/md/raid5.c                    | 50 +++++++++++------------------------
 drivers/scsi/sd.c                     |  5 ----
 drivers/target/target_core_device.c   |  2 +-
 include/linux/blkdev.h                | 15 -----------
 include/linux/device-mapper.h         |  5 ----
 23 files changed, 27 insertions(+), 124 deletions(-)

diff --git a/Documentation/ABI/testing/sysfs-block b/Documentation/ABI/testing/sysfs-block
index 2da04ce6aeef..dea212db9df3 100644
--- a/Documentation/ABI/testing/sysfs-block
+++ b/Documentation/ABI/testing/sysfs-block
@@ -213,14 +213,8 @@ What:		/sys/block/<disk>/queue/discard_zeroes_data
 Date:		May 2011
 Contact:	Martin K. Petersen <martin.petersen@oracle.com>
 Description:
-		Devices that support discard functionality may return
-		stale or random data when a previously discarded block
-		is read back. This can cause problems if the filesystem
-		expects discarded blocks to be explicitly cleared. If a
-		device reports that it deterministically returns zeroes
-		when a discarded area is read the discard_zeroes_data
-		parameter will be set to one. Otherwise it will be 0 and
-		the result of reading a discarded area is undefined.
+		Will always return 0.  Don't rely on any specific behavior
+		for discards, and don't read this file.
 
 What:		/sys/block/<disk>/queue/write_same_max_bytes
 Date:		January 2012
diff --git a/Documentation/block/queue-sysfs.txt b/Documentation/block/queue-sysfs.txt
index b7f6bdc96d73..2c1e67058fd3 100644
--- a/Documentation/block/queue-sysfs.txt
+++ b/Documentation/block/queue-sysfs.txt
@@ -43,11 +43,6 @@ large discards are issued, setting this value lower will make Linux issue
 smaller discards and potentially help reduce latencies induced by large
 discard operations.
 
-discard_zeroes_data (RO)
-------------------------
-When read, this file will show if the discarded block are zeroed by the
-device or not. If its value is '1' the blocks are zeroed otherwise not.
-
 hw_sector_size (RO)
 -------------------
 This is the hardware sector size of the device, in bytes.
diff --git a/block/blk-lib.c b/block/blk-lib.c
index b0c6c4bcf441..e8caecd71688 100644
--- a/block/blk-lib.c
+++ b/block/blk-lib.c
@@ -37,17 +37,12 @@ int __blkdev_issue_discard(struct block_device *bdev, sector_t sector,
 		return -ENXIO;
 
 	if (flags & BLKDEV_DISCARD_SECURE) {
-		if (flags & BLKDEV_DISCARD_ZERO)
-			return -EOPNOTSUPP;
 		if (!blk_queue_secure_erase(q))
 			return -EOPNOTSUPP;
 		op = REQ_OP_SECURE_ERASE;
 	} else {
 		if (!blk_queue_discard(q))
 			return -EOPNOTSUPP;
-		if ((flags & BLKDEV_DISCARD_ZERO) &&
-		    !q->limits.discard_zeroes_data)
-			return -EOPNOTSUPP;
 		op = REQ_OP_DISCARD;
 	}
 
@@ -126,7 +121,7 @@ int blkdev_issue_discard(struct block_device *bdev, sector_t sector,
 			&bio);
 	if (!ret && bio) {
 		ret = submit_bio_wait(bio);
-		if (ret == -EOPNOTSUPP && !(flags & BLKDEV_DISCARD_ZERO))
+		if (ret == -EOPNOTSUPP)
 			ret = 0;
 		bio_put(bio);
 	}
diff --git a/block/blk-settings.c b/block/blk-settings.c
index 1e7174ffc9d4..4fa81ed383ca 100644
--- a/block/blk-settings.c
+++ b/block/blk-settings.c
@@ -103,7 +103,6 @@ void blk_set_default_limits(struct queue_limits *lim)
 	lim->discard_granularity = 0;
 	lim->discard_alignment = 0;
 	lim->discard_misaligned = 0;
-	lim->discard_zeroes_data = 0;
 	lim->logical_block_size = lim->physical_block_size = lim->io_min = 512;
 	lim->bounce_pfn = (unsigned long)(BLK_BOUNCE_ANY >> PAGE_SHIFT);
 	lim->alignment_offset = 0;
@@ -127,7 +126,6 @@ void blk_set_stacking_limits(struct queue_limits *lim)
 	blk_set_default_limits(lim);
 
 	/* Inherit limits from component devices */
-	lim->discard_zeroes_data = 1;
 	lim->max_segments = USHRT_MAX;
 	lim->max_discard_segments = 1;
 	lim->max_hw_sectors = UINT_MAX;
@@ -609,7 +607,6 @@ int blk_stack_limits(struct queue_limits *t, struct queue_limits *b,
 	t->io_opt = lcm_not_zero(t->io_opt, b->io_opt);
 
 	t->cluster &= b->cluster;
-	t->discard_zeroes_data &= b->discard_zeroes_data;
 
 	/* Physical block size a multiple of the logical block size? */
 	if (t->physical_block_size & (t->logical_block_size - 1)) {
diff --git a/block/blk-sysfs.c b/block/blk-sysfs.c
index 45854266e398..b65ce3c65ae8 100644
--- a/block/blk-sysfs.c
+++ b/block/blk-sysfs.c
@@ -208,7 +208,7 @@ static ssize_t queue_discard_max_store(struct request_queue *q,
 
 static ssize_t queue_discard_zeroes_data_show(struct request_queue *q, char *page)
 {
-	return queue_var_show(queue_discard_zeroes_data(q), page);
+	return queue_var_show(0, page);
 }
 
 static ssize_t queue_write_same_max_show(struct request_queue *q, char *page)
diff --git a/block/compat_ioctl.c b/block/compat_ioctl.c
index 570021a0dc1c..04325b81c2b4 100644
--- a/block/compat_ioctl.c
+++ b/block/compat_ioctl.c
@@ -685,7 +685,7 @@ long compat_blkdev_ioctl(struct file *file, unsigned cmd, unsigned long arg)
 	case BLKALIGNOFF:
 		return compat_put_int(arg, bdev_alignment_offset(bdev));
 	case BLKDISCARDZEROES:
-		return compat_put_uint(arg, bdev_discard_zeroes_data(bdev));
+		return compat_put_uint(arg, 0);
 	case BLKFLSBUF:
 	case BLKROSET:
 	case BLKDISCARD:
diff --git a/block/ioctl.c b/block/ioctl.c
index 8ea00a41be01..0de02ee67eed 100644
--- a/block/ioctl.c
+++ b/block/ioctl.c
@@ -547,7 +547,7 @@ int blkdev_ioctl(struct block_device *bdev, fmode_t mode, unsigned cmd,
 	case BLKALIGNOFF:
 		return put_int(arg, bdev_alignment_offset(bdev));
 	case BLKDISCARDZEROES:
-		return put_uint(arg, bdev_discard_zeroes_data(bdev));
+		return put_uint(arg, 0);
 	case BLKSECTGET:
 		max_sectors = min_t(unsigned int, USHRT_MAX,
 				    queue_max_sectors(bdev_get_queue(bdev)));
diff --git a/drivers/block/drbd/drbd_main.c b/drivers/block/drbd/drbd_main.c
index 8e62d9f65510..84455c365f57 100644
--- a/drivers/block/drbd/drbd_main.c
+++ b/drivers/block/drbd/drbd_main.c
@@ -931,7 +931,6 @@ void assign_p_sizes_qlim(struct drbd_device *device, struct p_sizes *p, struct r
 		p->qlim->io_min = cpu_to_be32(queue_io_min(q));
 		p->qlim->io_opt = cpu_to_be32(queue_io_opt(q));
 		p->qlim->discard_enabled = blk_queue_discard(q);
-		p->qlim->discard_zeroes_data = queue_discard_zeroes_data(q);
 		p->qlim->write_same_capable = !!q->limits.max_write_same_sectors;
 	} else {
 		q = device->rq_queue;
@@ -941,7 +940,6 @@ void assign_p_sizes_qlim(struct drbd_device *device, struct p_sizes *p, struct r
 		p->qlim->io_min = cpu_to_be32(queue_io_min(q));
 		p->qlim->io_opt = cpu_to_be32(queue_io_opt(q));
 		p->qlim->discard_enabled = 0;
-		p->qlim->discard_zeroes_data = 0;
 		p->qlim->write_same_capable = 0;
 	}
 }
diff --git a/drivers/block/drbd/drbd_nl.c b/drivers/block/drbd/drbd_nl.c
index e4516d3b971d..02255a0d68b9 100644
--- a/drivers/block/drbd/drbd_nl.c
+++ b/drivers/block/drbd/drbd_nl.c
@@ -1199,10 +1199,6 @@ static void decide_on_discard_support(struct drbd_device *device,
 	struct drbd_connection *connection = first_peer_device(device)->connection;
 	bool can_do = b ? blk_queue_discard(b) : true;
 
-	if (can_do && b && !b->limits.discard_zeroes_data && !discard_zeroes_if_aligned) {
-		can_do = false;
-		drbd_info(device, "discard_zeroes_data=0 and discard_zeroes_if_aligned=no: disabling discards\n");
-	}
 	if (can_do && connection->cstate >= C_CONNECTED && !(connection->agreed_features & DRBD_FF_TRIM)) {
 		can_do = false;
 		drbd_info(connection, "peer DRBD too old, does not support TRIM: disabling discards\n");
@@ -1484,8 +1480,7 @@ static void sanitize_disk_conf(struct drbd_device *device, struct disk_conf *dis
 	if (disk_conf->al_extents > drbd_al_extents_max(nbc))
 		disk_conf->al_extents = drbd_al_extents_max(nbc);
 
-	if (!blk_queue_discard(q)
-	    || (!q->limits.discard_zeroes_data && !disk_conf->discard_zeroes_if_aligned)) {
+	if (!blk_queue_discard(q)) {
 		if (disk_conf->rs_discard_granularity) {
 			disk_conf->rs_discard_granularity = 0; /* disable feature */
 			drbd_info(device, "rs_discard_granularity feature disabled\n");
diff --git a/drivers/block/loop.c b/drivers/block/loop.c
index 3bb04c1a4ba1..3081d83d2ea3 100644
--- a/drivers/block/loop.c
+++ b/drivers/block/loop.c
@@ -828,7 +828,6 @@ static void loop_config_discard(struct loop_device *lo)
 		q->limits.discard_alignment = 0;
 		blk_queue_max_discard_sectors(q, 0);
 		blk_queue_max_write_zeroes_sectors(q, 0);
-		q->limits.discard_zeroes_data = 0;
 		queue_flag_clear_unlocked(QUEUE_FLAG_DISCARD, q);
 		return;
 	}
@@ -837,7 +836,6 @@ static void loop_config_discard(struct loop_device *lo)
 	q->limits.discard_alignment = 0;
 	blk_queue_max_discard_sectors(q, UINT_MAX >> 9);
 	blk_queue_max_write_zeroes_sectors(q, UINT_MAX >> 9);
-	q->limits.discard_zeroes_data = 1;
 	queue_flag_set_unlocked(QUEUE_FLAG_DISCARD, q);
 }
 
diff --git a/drivers/block/mtip32xx/mtip32xx.c b/drivers/block/mtip32xx/mtip32xx.c
index 30076e7753bc..05e3e664ea1b 100644
--- a/drivers/block/mtip32xx/mtip32xx.c
+++ b/drivers/block/mtip32xx/mtip32xx.c
@@ -4025,7 +4025,6 @@ static int mtip_block_initialize(struct driver_data *dd)
 		dd->queue->limits.discard_granularity = 4096;
 		blk_queue_max_discard_sectors(dd->queue,
 			MTIP_MAX_TRIM_ENTRY_LEN * MTIP_MAX_TRIM_ENTRIES);
-		dd->queue->limits.discard_zeroes_data = 0;
 	}
 
 	/* Set the capacity of the device in 512 byte sectors. */
diff --git a/drivers/block/nbd.c b/drivers/block/nbd.c
index 03ae72985c79..b02f2362fdf7 100644
--- a/drivers/block/nbd.c
+++ b/drivers/block/nbd.c
@@ -1110,7 +1110,6 @@ static int nbd_dev_add(int index)
 	queue_flag_clear_unlocked(QUEUE_FLAG_ADD_RANDOM, disk->queue);
 	disk->queue->limits.discard_granularity = 512;
 	blk_queue_max_discard_sectors(disk->queue, UINT_MAX);
-	disk->queue->limits.discard_zeroes_data = 0;
 	blk_queue_max_hw_sectors(disk->queue, 65536);
 	disk->queue->limits.max_sectors = 256;
 
diff --git a/drivers/md/dm-cache-target.c b/drivers/md/dm-cache-target.c
index 9c689b34e6e7..975922c8f231 100644
--- a/drivers/md/dm-cache-target.c
+++ b/drivers/md/dm-cache-target.c
@@ -2773,7 +2773,6 @@ static int cache_create(struct cache_args *ca, struct cache **result)
 
 	ti->num_discard_bios = 1;
 	ti->discards_supported = true;
-	ti->discard_zeroes_data_unsupported = true;
 	ti->split_discard_bios = false;
 
 	cache->features = ca->features;
diff --git a/drivers/md/dm-crypt.c b/drivers/md/dm-crypt.c
index 389a3637ffcc..ef1d836bd81b 100644
--- a/drivers/md/dm-crypt.c
+++ b/drivers/md/dm-crypt.c
@@ -2030,7 +2030,6 @@ static int crypt_ctr(struct dm_target *ti, unsigned int argc, char **argv)
 	wake_up_process(cc->write_thread);
 
 	ti->num_flush_bios = 1;
-	ti->discard_zeroes_data_unsupported = true;
 
 	return 0;
 
diff --git a/drivers/md/dm-raid.c b/drivers/md/dm-raid.c
index f8564d63982f..468f1380de1d 100644
--- a/drivers/md/dm-raid.c
+++ b/drivers/md/dm-raid.c
@@ -2813,7 +2813,9 @@ static void configure_discard_support(struct raid_set *rs)
 	/* Assume discards not supported until after checks below. */
 	ti->discards_supported = false;
 
-	/* RAID level 4,5,6 require discard_zeroes_data for data integrity! */
+	/*
+	 * XXX: RAID level 4,5,6 require zeroing for safety.
+	 */
 	raid456 = (rs->md.level == 4 || rs->md.level == 5 || rs->md.level == 6);
 
 	for (i = 0; i < rs->raid_disks; i++) {
@@ -2827,8 +2829,6 @@ static void configure_discard_support(struct raid_set *rs)
 			return;
 
 		if (raid456) {
-			if (!q->limits.discard_zeroes_data)
-				return;
 			if (!devices_handle_discard_safely) {
 				DMERR("raid456 discard support disabled due to discard_zeroes_data uncertainty.");
 				DMERR("Set dm-raid.devices_handle_discard_safely=Y to override.");
diff --git a/drivers/md/dm-raid1.c b/drivers/md/dm-raid1.c
index 2ddc2d20e62d..a95cbb80fb34 100644
--- a/drivers/md/dm-raid1.c
+++ b/drivers/md/dm-raid1.c
@@ -1124,7 +1124,6 @@ static int mirror_ctr(struct dm_target *ti, unsigned int argc, char **argv)
 	ti->num_flush_bios = 1;
 	ti->num_discard_bios = 1;
 	ti->per_io_data_size = sizeof(struct dm_raid1_bio_record);
-	ti->discard_zeroes_data_unsupported = true;
 
 	ms->kmirrord_wq = alloc_workqueue("kmirrord", WQ_MEM_RECLAIM, 0);
 	if (!ms->kmirrord_wq) {
diff --git a/drivers/md/dm-table.c b/drivers/md/dm-table.c
index 5cd665c91ead..958275aca008 100644
--- a/drivers/md/dm-table.c
+++ b/drivers/md/dm-table.c
@@ -1449,22 +1449,6 @@ static bool dm_table_supports_flush(struct dm_table *t, unsigned long flush)
 	return false;
 }
 
-static bool dm_table_discard_zeroes_data(struct dm_table *t)
-{
-	struct dm_target *ti;
-	unsigned i = 0;
-
-	/* Ensure that all targets supports discard_zeroes_data. */
-	while (i < dm_table_get_num_targets(t)) {
-		ti = dm_table_get_target(t, i++);
-
-		if (ti->discard_zeroes_data_unsupported)
-			return false;
-	}
-
-	return true;
-}
-
 static int device_is_nonrot(struct dm_target *ti, struct dm_dev *dev,
 			    sector_t start, sector_t len, void *data)
 {
@@ -1620,9 +1604,6 @@ void dm_table_set_restrictions(struct dm_table *t, struct request_queue *q,
 	}
 	blk_queue_write_cache(q, wc, fua);
 
-	if (!dm_table_discard_zeroes_data(t))
-		q->limits.discard_zeroes_data = 0;
-
 	/* Ensure that all underlying devices are non-rotational. */
 	if (dm_table_all_devices_attribute(t, device_is_nonrot))
 		queue_flag_set_unlocked(QUEUE_FLAG_NONROT, q);
diff --git a/drivers/md/dm-thin.c b/drivers/md/dm-thin.c
index 2b266a2b5035..a5f1916f621a 100644
--- a/drivers/md/dm-thin.c
+++ b/drivers/md/dm-thin.c
@@ -3263,7 +3263,6 @@ static int pool_ctr(struct dm_target *ti, unsigned argc, char **argv)
 	 * them down to the data device.  The thin device's discard
 	 * processing will cause mappings to be removed from the btree.
 	 */
-	ti->discard_zeroes_data_unsupported = true;
 	if (pf.discard_enabled && pf.discard_passdown) {
 		ti->num_discard_bios = 1;
 
@@ -4119,7 +4118,6 @@ static int thin_ctr(struct dm_target *ti, unsigned argc, char **argv)
 	ti->per_io_data_size = sizeof(struct dm_thin_endio_hook);
 
 	/* In case the pool supports discards, pass them on. */
-	ti->discard_zeroes_data_unsupported = true;
 	if (tc->pool->pf.discard_enabled) {
 		ti->discards_supported = true;
 		ti->num_discard_bios = 1;
diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
index 8cf1f86dcd05..d6ae8d22d461 100644
--- a/drivers/md/raid5.c
+++ b/drivers/md/raid5.c
@@ -7229,7 +7229,6 @@ static int raid5_run(struct mddev *mddev)
 
 	if (mddev->queue) {
 		int chunk_size;
-		bool discard_supported = true;
 		/* read-ahead size must cover two whole stripes, which
 		 * is 2 * (datadisks) * chunksize where 'n' is the
 		 * number of raid devices
@@ -7265,12 +7264,6 @@ static int raid5_run(struct mddev *mddev)
 		blk_queue_max_discard_sectors(mddev->queue,
 					      0xfffe * STRIPE_SECTORS);
 
-		/*
-		 * unaligned part of discard request will be ignored, so can't
-		 * guarantee discard_zeroes_data
-		 */
-		mddev->queue->limits.discard_zeroes_data = 0;
-
 		blk_queue_max_write_same_sectors(mddev->queue, 0);
 		blk_queue_max_write_zeroes_sectors(mddev->queue, 0);
 
@@ -7279,35 +7272,24 @@ static int raid5_run(struct mddev *mddev)
 					  rdev->data_offset << 9);
 			disk_stack_limits(mddev->gendisk, rdev->bdev,
 					  rdev->new_data_offset << 9);
-			/*
-			 * discard_zeroes_data is required, otherwise data
-			 * could be lost. Consider a scenario: discard a stripe
-			 * (the stripe could be inconsistent if
-			 * discard_zeroes_data is 0); write one disk of the
-			 * stripe (the stripe could be inconsistent again
-			 * depending on which disks are used to calculate
-			 * parity); the disk is broken; The stripe data of this
-			 * disk is lost.
-			 */
-			if (!blk_queue_discard(bdev_get_queue(rdev->bdev)) ||
-			    !bdev_get_queue(rdev->bdev)->
-						limits.discard_zeroes_data)
-				discard_supported = false;
-			/* Unfortunately, discard_zeroes_data is not currently
-			 * a guarantee - just a hint.  So we only allow DISCARD
-			 * if the sysadmin has confirmed that only safe devices
-			 * are in use by setting a module parameter.
-			 */
-			if (!devices_handle_discard_safely) {
-				if (discard_supported) {
-					pr_info("md/raid456: discard support disabled due to uncertainty.\n");
-					pr_info("Set raid456.devices_handle_discard_safely=Y to override.\n");
-				}
-				discard_supported = false;
-			}
 		}
 
-		if (discard_supported &&
+		/*
+		 * zeroing is required, otherwise data
+		 * could be lost. Consider a scenario: discard a stripe
+		 * (the stripe could be inconsistent if
+		 * discard_zeroes_data is 0); write one disk of the
+		 * stripe (the stripe could be inconsistent again
+		 * depending on which disks are used to calculate
+		 * parity); the disk is broken; The stripe data of this
+		 * disk is lost.
+		 *
+		 * We only allow DISCARD if the sysadmin has confirmed that
+		 * only safe devices are in use by setting a module parameter.
+		 * A better idea might be to turn DISCARD into WRITE_ZEROES
+		 * requests, as that is required to be safe.
+		 */
+		if (devices_handle_discard_safely &&
 		    mddev->queue->limits.max_discard_sectors >= (stripe >> 9) &&
 		    mddev->queue->limits.discard_granularity >= stripe)
 			queue_flag_set_unlocked(QUEUE_FLAG_DISCARD,
diff --git a/drivers/scsi/sd.c b/drivers/scsi/sd.c
index 001593ed0444..bcb0cb020fd2 100644
--- a/drivers/scsi/sd.c
+++ b/drivers/scsi/sd.c
@@ -644,8 +644,6 @@ static void sd_config_discard(struct scsi_disk *sdkp, unsigned int mode)
 	unsigned int logical_block_size = sdkp->device->sector_size;
 	unsigned int max_blocks = 0;
 
-	q->limits.discard_zeroes_data = 0;
-
 	/*
 	 * When LBPRZ is reported, discard alignment and granularity
 	 * must be fixed to the logical block size. Otherwise the block
@@ -681,19 +679,16 @@ static void sd_config_discard(struct scsi_disk *sdkp, unsigned int mode)
 	case SD_LBP_WS16:
 		max_blocks = min_not_zero(sdkp->max_ws_blocks,
 					  (u32)SD_MAX_WS16_BLOCKS);
-		q->limits.discard_zeroes_data = sdkp->lbprz;
 		break;
 
 	case SD_LBP_WS10:
 		max_blocks = min_not_zero(sdkp->max_ws_blocks,
 					  (u32)SD_MAX_WS10_BLOCKS);
-		q->limits.discard_zeroes_data = sdkp->lbprz;
 		break;
 
 	case SD_LBP_ZERO:
 		max_blocks = min_not_zero(sdkp->max_ws_blocks,
 					  (u32)SD_MAX_WS10_BLOCKS);
-		q->limits.discard_zeroes_data = 1;
 		break;
 	}
 
diff --git a/drivers/target/target_core_device.c b/drivers/target/target_core_device.c
index c754ae33bf7b..d2f089cfa9ae 100644
--- a/drivers/target/target_core_device.c
+++ b/drivers/target/target_core_device.c
@@ -851,7 +851,7 @@ bool target_configure_unmap_from_queue(struct se_dev_attrib *attrib,
 	attrib->unmap_granularity = q->limits.discard_granularity / block_size;
 	attrib->unmap_granularity_alignment = q->limits.discard_alignment /
 								block_size;
-	attrib->unmap_zeroes_data = q->limits.discard_zeroes_data;
+	attrib->unmap_zeroes_data = 0;
 	return true;
 }
 EXPORT_SYMBOL(target_configure_unmap_from_queue);
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index a5055d760661..d5d9dd72418a 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -339,7 +339,6 @@ struct queue_limits {
 	unsigned char		misaligned;
 	unsigned char		discard_misaligned;
 	unsigned char		cluster;
-	unsigned char		discard_zeroes_data;
 	unsigned char		raid_partial_stripes_expensive;
 	enum blk_zoned_model	zoned;
 };
@@ -1342,7 +1341,6 @@ extern int blkdev_issue_write_same(struct block_device *bdev, sector_t sector,
 		sector_t nr_sects, gfp_t gfp_mask, struct page *page);
 
 #define BLKDEV_DISCARD_SECURE	(1 << 0)	/* issue a secure erase */
-#define BLKDEV_DISCARD_ZERO	(1 << 1)	/* must reliably zero data */
 
 extern int blkdev_issue_discard(struct block_device *bdev, sector_t sector,
 		sector_t nr_sects, gfp_t gfp_mask, unsigned long flags);
@@ -1542,19 +1540,6 @@ static inline int bdev_discard_alignment(struct block_device *bdev)
 	return q->limits.discard_alignment;
 }
 
-static inline unsigned int queue_discard_zeroes_data(struct request_queue *q)
-{
-	if (q->limits.max_discard_sectors && q->limits.discard_zeroes_data == 1)
-		return 1;
-
-	return 0;
-}
-
-static inline unsigned int bdev_discard_zeroes_data(struct block_device *bdev)
-{
-	return queue_discard_zeroes_data(bdev_get_queue(bdev));
-}
-
 static inline unsigned int bdev_write_same(struct block_device *bdev)
 {
 	struct request_queue *q = bdev_get_queue(bdev);
diff --git a/include/linux/device-mapper.h b/include/linux/device-mapper.h
index 3829bee2302a..c7ea33e38fb9 100644
--- a/include/linux/device-mapper.h
+++ b/include/linux/device-mapper.h
@@ -296,11 +296,6 @@ struct dm_target {
 	 * on max_io_len boundary.
 	 */
 	bool split_discard_bios:1;
-
-	/*
-	 * Set if this target does not return zeroes on discarded blocks.
-	 */
-	bool discard_zeroes_data_unsupported:1;
 };
 
 /* Each target can link one of these into the table */
-- 
2.11.0

^ permalink raw reply related	[flat|nested] 91+ messages in thread

* [PATCH 26/27] scsi: sd: Separate zeroout and discard command choices
  2017-04-05 17:20 ` Christoph Hellwig
                   ` (25 preceding siblings ...)
  (?)
@ 2017-04-05 17:21 ` Christoph Hellwig
  2017-04-06  6:17     ` Hannes Reinecke
  2017-04-19 14:56   ` Paolo Bonzini
  -1 siblings, 2 replies; 91+ messages in thread
From: Christoph Hellwig @ 2017-04-05 17:21 UTC (permalink / raw)
  To: axboe, martin.petersen, agk, snitzer, shli, philipp.reisner,
	lars.ellenberg
  Cc: linux-block, linux-scsi, drbd-dev, dm-devel, linux-raid

From: "Martin K. Petersen" <martin.petersen@oracle.com>

Now that zeroout and discards are distinct operations we need to
separate the policy of choosing the appropriate command. Create a
zeroing_mode which can be one of:

write:			Zeroout assist not present, use regular WRITE
writesame:		Allow WRITE SAME(10/16) with a zeroed payload
writesame_16_unmap:	Allow WRITE SAME(16) with UNMAP
writesame_10_unmap:	Allow WRITE SAME(10) with UNMAP

The last two are conditional on the device being thin provisioned with
LBPRZ=1 and LBPWS=1 or LBPWS10=1 respectively.

Whether to set the UNMAP bit or not depends on the REQ_NOUNMAP flag. And
if none of the _unmap variants are supported, regular WRITE SAME will be
used if the device supports it.

The zeroout_mode is exported in sysfs and the detected mode for a given
device can be overridden using the string constants above.

With this change in place we can now issue WRITE SAME(16) with UNMAP set
for block zeroing applications that require hard guarantees and
logical_block_size granularity. And at the same time use the UNMAP
command with the device's preferred granulary and alignment for discard
operations.

Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
---
 drivers/scsi/sd.c | 56 ++++++++++++++++++++++++++++++++++++++++++++++++++++---
 drivers/scsi/sd.h |  8 ++++++++
 2 files changed, 61 insertions(+), 3 deletions(-)

diff --git a/drivers/scsi/sd.c b/drivers/scsi/sd.c
index bcb0cb020fd2..acf9d17b05d8 100644
--- a/drivers/scsi/sd.c
+++ b/drivers/scsi/sd.c
@@ -418,6 +418,46 @@ provisioning_mode_store(struct device *dev, struct device_attribute *attr,
 }
 static DEVICE_ATTR_RW(provisioning_mode);
 
+static const char *zeroing_mode[] = {
+	[SD_ZERO_WRITE]		= "write",
+	[SD_ZERO_WS]		= "writesame",
+	[SD_ZERO_WS16_UNMAP]	= "writesame_16_unmap",
+	[SD_ZERO_WS10_UNMAP]	= "writesame_10_unmap",
+};
+
+static ssize_t
+zeroing_mode_show(struct device *dev, struct device_attribute *attr,
+		  char *buf)
+{
+	struct scsi_disk *sdkp = to_scsi_disk(dev);
+
+	return snprintf(buf, 20, "%s\n", zeroing_mode[sdkp->zeroing_mode]);
+}
+
+static ssize_t
+zeroing_mode_store(struct device *dev, struct device_attribute *attr,
+		   const char *buf, size_t count)
+{
+	struct scsi_disk *sdkp = to_scsi_disk(dev);
+
+	if (!capable(CAP_SYS_ADMIN))
+		return -EACCES;
+
+	if (!strncmp(buf, zeroing_mode[SD_ZERO_WRITE], 20))
+		sdkp->zeroing_mode = SD_ZERO_WRITE;
+	else if (!strncmp(buf, zeroing_mode[SD_ZERO_WS], 20))
+		sdkp->zeroing_mode = SD_ZERO_WS;
+	else if (!strncmp(buf, zeroing_mode[SD_ZERO_WS16_UNMAP], 20))
+		sdkp->zeroing_mode = SD_ZERO_WS16_UNMAP;
+	else if (!strncmp(buf, zeroing_mode[SD_ZERO_WS10_UNMAP], 20))
+		sdkp->zeroing_mode = SD_ZERO_WS10_UNMAP;
+	else
+		return -EINVAL;
+
+	return count;
+}
+static DEVICE_ATTR_RW(zeroing_mode);
+
 static ssize_t
 max_medium_access_timeouts_show(struct device *dev,
 				struct device_attribute *attr, char *buf)
@@ -496,6 +536,7 @@ static struct attribute *sd_disk_attrs[] = {
 	&dev_attr_app_tag_own.attr,
 	&dev_attr_thin_provisioning.attr,
 	&dev_attr_provisioning_mode.attr,
+	&dev_attr_zeroing_mode.attr,
 	&dev_attr_max_write_same_blocks.attr,
 	&dev_attr_max_medium_access_timeouts.attr,
 	NULL,
@@ -799,10 +840,10 @@ static int sd_setup_write_zeroes_cmnd(struct scsi_cmnd *cmd)
 	u32 nr_sectors = blk_rq_sectors(rq) >> (ilog2(sdp->sector_size) - 9);
 
 	if (!(rq->cmd_flags & REQ_NOUNMAP)) {
-		switch (sdkp->provisioning_mode) {
-		case SD_LBP_WS16:
+		switch (sdkp->zeroing_mode) {
+		case SD_ZERO_WS16_UNMAP:
 			return sd_setup_write_same16_cmnd(cmd, true);
-		case SD_LBP_WS10:
+		case SD_ZERO_WS10_UNMAP:
 			return sd_setup_write_same10_cmnd(cmd, true);
 		}
 	}
@@ -840,6 +881,15 @@ static void sd_config_write_same(struct scsi_disk *sdkp)
 		sdkp->max_ws_blocks = 0;
 	}
 
+	if (sdkp->lbprz && sdkp->lbpws)
+		sdkp->zeroing_mode = SD_ZERO_WS16_UNMAP;
+	else if (sdkp->lbprz && sdkp->lbpws10)
+		sdkp->zeroing_mode = SD_ZERO_WS10_UNMAP;
+	else if (sdkp->max_ws_blocks)
+		sdkp->zeroing_mode = SD_ZERO_WS;
+	else
+		sdkp->zeroing_mode = SD_ZERO_WRITE;
+
 out:
 	blk_queue_max_write_same_sectors(q, sdkp->max_ws_blocks *
 					 (logical_block_size >> 9));
diff --git a/drivers/scsi/sd.h b/drivers/scsi/sd.h
index 4dac35e96a75..a2c4b5c35379 100644
--- a/drivers/scsi/sd.h
+++ b/drivers/scsi/sd.h
@@ -59,6 +59,13 @@ enum {
 	SD_LBP_DISABLE,		/* Discard disabled due to failed cmd */
 };
 
+enum {
+	SD_ZERO_WRITE = 0,	/* Use WRITE(10/16) command */
+	SD_ZERO_WS,		/* Use WRITE SAME(10/16) command */
+	SD_ZERO_WS16_UNMAP,	/* Use WRITE SAME(16) with UNMAP */
+	SD_ZERO_WS10_UNMAP,	/* Use WRITE SAME(10) with UNMAP */
+};
+
 struct scsi_disk {
 	struct scsi_driver *driver;	/* always &sd_template */
 	struct scsi_device *device;
@@ -89,6 +96,7 @@ struct scsi_disk {
 	u8		write_prot;
 	u8		protection_type;/* Data Integrity Field */
 	u8		provisioning_mode;
+	u8		zeroing_mode;
 	unsigned	ATO : 1;	/* state of disk ATO bit */
 	unsigned	cache_override : 1; /* temp override of WCE,RCD */
 	unsigned	WCE : 1;	/* state of disk WCE bit */
-- 
2.11.0

^ permalink raw reply related	[flat|nested] 91+ messages in thread

* [PATCH 27/27] scsi: sd: Remove LBPRZ dependency for discards
  2017-04-05 17:20 ` Christoph Hellwig
                   ` (26 preceding siblings ...)
  (?)
@ 2017-04-05 17:21 ` Christoph Hellwig
  2017-04-06  6:18     ` Hannes Reinecke
  -1 siblings, 1 reply; 91+ messages in thread
From: Christoph Hellwig @ 2017-04-05 17:21 UTC (permalink / raw)
  To: axboe, martin.petersen, agk, snitzer, shli, philipp.reisner,
	lars.ellenberg
  Cc: linux-block, linux-scsi, drbd-dev, dm-devel, linux-raid

From: "Martin K. Petersen" <martin.petersen@oracle.com>

Separating discards and zeroout operations allows us to remove the LBPRZ
block zeroing constraints from discards and honor the device preferences
for UNMAP commands.

If supported by the device, we'll also choose UNMAP over one of the
WRITE SAME variants for discards.

Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
---
 drivers/scsi/sd.c | 25 ++++++-------------------
 1 file changed, 6 insertions(+), 19 deletions(-)

diff --git a/drivers/scsi/sd.c b/drivers/scsi/sd.c
index acf9d17b05d8..8cf34a8e3eea 100644
--- a/drivers/scsi/sd.c
+++ b/drivers/scsi/sd.c
@@ -685,24 +685,11 @@ static void sd_config_discard(struct scsi_disk *sdkp, unsigned int mode)
 	unsigned int logical_block_size = sdkp->device->sector_size;
 	unsigned int max_blocks = 0;
 
-	/*
-	 * When LBPRZ is reported, discard alignment and granularity
-	 * must be fixed to the logical block size. Otherwise the block
-	 * layer will drop misaligned portions of the request which can
-	 * lead to data corruption. If LBPRZ is not set, we honor the
-	 * device preference.
-	 */
-	if (sdkp->lbprz) {
-		q->limits.discard_alignment = 0;
-		q->limits.discard_granularity = logical_block_size;
-	} else {
-		q->limits.discard_alignment = sdkp->unmap_alignment *
-			logical_block_size;
-		q->limits.discard_granularity =
-			max(sdkp->physical_block_size,
-			    sdkp->unmap_granularity * logical_block_size);
-	}
-
+	q->limits.discard_alignment =
+		sdkp->unmap_alignment * logical_block_size;
+	q->limits.discard_granularity =
+		max(sdkp->physical_block_size,
+		    sdkp->unmap_granularity * logical_block_size);
 	sdkp->provisioning_mode = mode;
 
 	switch (mode) {
@@ -2842,7 +2829,7 @@ static void sd_read_block_limits(struct scsi_disk *sdkp)
 				sd_config_discard(sdkp, SD_LBP_WS16);
 
 		} else {	/* LBP VPD page tells us what to use */
-			if (sdkp->lbpu && sdkp->max_unmap_blocks && !sdkp->lbprz)
+			if (sdkp->lbpu && sdkp->max_unmap_blocks)
 				sd_config_discard(sdkp, SD_LBP_UNMAP);
 			else if (sdkp->lbpws)
 				sd_config_discard(sdkp, SD_LBP_WS16);
-- 
2.11.0

^ permalink raw reply related	[flat|nested] 91+ messages in thread

* Re: [PATCH 26/27] scsi: sd: Separate zeroout and discard command choices
  2017-04-05 17:21 ` [PATCH 26/27] scsi: sd: Separate zeroout and discard command choices Christoph Hellwig
@ 2017-04-06  6:17     ` Hannes Reinecke
  2017-04-19 14:56   ` Paolo Bonzini
  1 sibling, 0 replies; 91+ messages in thread
From: Hannes Reinecke @ 2017-04-06  6:17 UTC (permalink / raw)
  To: Christoph Hellwig, axboe, martin.petersen, agk, snitzer, shli,
	philipp.reisner, lars.ellenberg
  Cc: linux-block, linux-scsi, drbd-dev, dm-devel, linux-raid

On 04/05/2017 07:21 PM, Christoph Hellwig wrote:
> From: "Martin K. Petersen" <martin.petersen@oracle.com>
> 
> Now that zeroout and discards are distinct operations we need to
> separate the policy of choosing the appropriate command. Create a
> zeroing_mode which can be one of:
> 
> write:			Zeroout assist not present, use regular WRITE
> writesame:		Allow WRITE SAME(10/16) with a zeroed payload
> writesame_16_unmap:	Allow WRITE SAME(16) with UNMAP
> writesame_10_unmap:	Allow WRITE SAME(10) with UNMAP
> 
> The last two are conditional on the device being thin provisioned with
> LBPRZ=1 and LBPWS=1 or LBPWS10=1 respectively.
> 
> Whether to set the UNMAP bit or not depends on the REQ_NOUNMAP flag. And
> if none of the _unmap variants are supported, regular WRITE SAME will be
> used if the device supports it.
> 
> The zeroout_mode is exported in sysfs and the detected mode for a given
> device can be overridden using the string constants above.
> 
> With this change in place we can now issue WRITE SAME(16) with UNMAP set
> for block zeroing applications that require hard guarantees and
> logical_block_size granularity. And at the same time use the UNMAP
> command with the device's preferred granulary and alignment for discard
> operations.
> 
> Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
> Signed-off-by: Christoph Hellwig <hch@lst.de>
> ---
>  drivers/scsi/sd.c | 56 ++++++++++++++++++++++++++++++++++++++++++++++++++++---
>  drivers/scsi/sd.h |  8 ++++++++
>  2 files changed, 61 insertions(+), 3 deletions(-)
> 
Reviewed-by: Hannes Reinecke <hare@suse.com>

Cheers,

Hannes
-- 
Dr. Hannes Reinecke		   Teamlead Storage & Networking
hare@suse.de			               +49 911 74053 688
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: F. Imendörffer, J. Smithard, J. Guild, D. Upmanyu, G. Norton
HRB 21284 (AG Nürnberg)

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [PATCH 26/27] scsi: sd: Separate zeroout and discard command choices
@ 2017-04-06  6:17     ` Hannes Reinecke
  0 siblings, 0 replies; 91+ messages in thread
From: Hannes Reinecke @ 2017-04-06  6:17 UTC (permalink / raw)
  To: Christoph Hellwig, axboe, martin.petersen, agk, snitzer, shli,
	philipp.reisner, lars.ellenberg
  Cc: linux-block, linux-scsi, drbd-dev, dm-devel, linux-raid

On 04/05/2017 07:21 PM, Christoph Hellwig wrote:
> From: "Martin K. Petersen" <martin.petersen@oracle.com>
> 
> Now that zeroout and discards are distinct operations we need to
> separate the policy of choosing the appropriate command. Create a
> zeroing_mode which can be one of:
> 
> write:			Zeroout assist not present, use regular WRITE
> writesame:		Allow WRITE SAME(10/16) with a zeroed payload
> writesame_16_unmap:	Allow WRITE SAME(16) with UNMAP
> writesame_10_unmap:	Allow WRITE SAME(10) with UNMAP
> 
> The last two are conditional on the device being thin provisioned with
> LBPRZ=1 and LBPWS=1 or LBPWS10=1 respectively.
> 
> Whether to set the UNMAP bit or not depends on the REQ_NOUNMAP flag. And
> if none of the _unmap variants are supported, regular WRITE SAME will be
> used if the device supports it.
> 
> The zeroout_mode is exported in sysfs and the detected mode for a given
> device can be overridden using the string constants above.
> 
> With this change in place we can now issue WRITE SAME(16) with UNMAP set
> for block zeroing applications that require hard guarantees and
> logical_block_size granularity. And at the same time use the UNMAP
> command with the device's preferred granulary and alignment for discard
> operations.
> 
> Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
> Signed-off-by: Christoph Hellwig <hch@lst.de>
> ---
>  drivers/scsi/sd.c | 56 ++++++++++++++++++++++++++++++++++++++++++++++++++++---
>  drivers/scsi/sd.h |  8 ++++++++
>  2 files changed, 61 insertions(+), 3 deletions(-)
> 
Reviewed-by: Hannes Reinecke <hare@suse.com>

Cheers,

Hannes
-- 
Dr. Hannes Reinecke		   Teamlead Storage & Networking
hare@suse.de			               +49 911 74053 688
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 N�rnberg
GF: F. Imend�rffer, J. Smithard, J. Guild, D. Upmanyu, G. Norton
HRB 21284 (AG N�rnberg)

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [PATCH 27/27] scsi: sd: Remove LBPRZ dependency for discards
  2017-04-05 17:21 ` [PATCH 27/27] scsi: sd: Remove LBPRZ dependency for discards Christoph Hellwig
@ 2017-04-06  6:18     ` Hannes Reinecke
  0 siblings, 0 replies; 91+ messages in thread
From: Hannes Reinecke @ 2017-04-06  6:18 UTC (permalink / raw)
  To: Christoph Hellwig, axboe, martin.petersen, agk, snitzer, shli,
	philipp.reisner, lars.ellenberg
  Cc: linux-block, linux-scsi, drbd-dev, dm-devel, linux-raid

On 04/05/2017 07:21 PM, Christoph Hellwig wrote:
> From: "Martin K. Petersen" <martin.petersen@oracle.com>
> 
> Separating discards and zeroout operations allows us to remove the LBPRZ
> block zeroing constraints from discards and honor the device preferences
> for UNMAP commands.
> 
> If supported by the device, we'll also choose UNMAP over one of the
> WRITE SAME variants for discards.
> 
> Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
> Signed-off-by: Christoph Hellwig <hch@lst.de>
> ---
>  drivers/scsi/sd.c | 25 ++++++-------------------
>  1 file changed, 6 insertions(+), 19 deletions(-)
> 
Reviewed-by: Hannes Reinecke <hare@suse.com>

Cheers,

Hannes
-- 
Dr. Hannes Reinecke		   Teamlead Storage & Networking
hare@suse.de			               +49 911 74053 688
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: F. Imendörffer, J. Smithard, J. Guild, D. Upmanyu, G. Norton
HRB 21284 (AG Nürnberg)

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [PATCH 27/27] scsi: sd: Remove LBPRZ dependency for discards
@ 2017-04-06  6:18     ` Hannes Reinecke
  0 siblings, 0 replies; 91+ messages in thread
From: Hannes Reinecke @ 2017-04-06  6:18 UTC (permalink / raw)
  To: Christoph Hellwig, axboe, martin.petersen, agk, snitzer, shli,
	philipp.reisner, lars.ellenberg
  Cc: linux-block, linux-scsi, drbd-dev, dm-devel, linux-raid

On 04/05/2017 07:21 PM, Christoph Hellwig wrote:
> From: "Martin K. Petersen" <martin.petersen@oracle.com>
> 
> Separating discards and zeroout operations allows us to remove the LBPRZ
> block zeroing constraints from discards and honor the device preferences
> for UNMAP commands.
> 
> If supported by the device, we'll also choose UNMAP over one of the
> WRITE SAME variants for discards.
> 
> Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
> Signed-off-by: Christoph Hellwig <hch@lst.de>
> ---
>  drivers/scsi/sd.c | 25 ++++++-------------------
>  1 file changed, 6 insertions(+), 19 deletions(-)
> 
Reviewed-by: Hannes Reinecke <hare@suse.com>

Cheers,

Hannes
-- 
Dr. Hannes Reinecke		   Teamlead Storage & Networking
hare@suse.de			               +49 911 74053 688
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 N�rnberg
GF: F. Imend�rffer, J. Smithard, J. Guild, D. Upmanyu, G. Norton
HRB 21284 (AG N�rnberg)

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: always use REQ_OP_WRITE_ZEROES for zeroing offload V2
  2017-04-05 17:20 ` Christoph Hellwig
                   ` (27 preceding siblings ...)
  (?)
@ 2017-04-08 17:26 ` Jens Axboe
  -1 siblings, 0 replies; 91+ messages in thread
From: Jens Axboe @ 2017-04-08 17:26 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: martin.petersen, agk, snitzer, shli, philipp.reisner,
	lars.ellenberg, linux-block, linux-scsi, drbd-dev, dm-devel,
	linux-raid

On Wed, Apr 05 2017, Christoph Hellwig wrote:
> This series makes REQ_OP_WRITE_ZEROES the only zeroing offload
> supported by the block layer, and switches existing implementations
> of REQ_OP_DISCARD that correctly set discard_zeroes_data to it,
> removes incorrect discard_zeroes_data, and also switches WRITE SAME
> based zeroing in SCSI to this new method.
> 
> The series is against the block for-next tree.

Added for 4.12, thanks Christoph.

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [PATCH 26/27] scsi: sd: Separate zeroout and discard command choices
  2017-04-05 17:21 ` [PATCH 26/27] scsi: sd: Separate zeroout and discard command choices Christoph Hellwig
  2017-04-06  6:17     ` Hannes Reinecke
@ 2017-04-19 14:56   ` Paolo Bonzini
       [not found]     ` <58c3d6a6-924e-cc86-1907-a9fd02a39c0e-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
  1 sibling, 1 reply; 91+ messages in thread
From: Paolo Bonzini @ 2017-04-19 14:56 UTC (permalink / raw)
  To: Christoph Hellwig, axboe, martin.petersen, agk, snitzer, shli,
	philipp.reisner, lars.ellenberg
  Cc: linux-block, linux-scsi, drbd-dev, dm-devel, linux-raid



On 05/04/2017 19:21, Christoph Hellwig wrote:
> +static ssize_t
> +zeroing_mode_store(struct device *dev, struct device_attribute *attr,
> +		   const char *buf, size_t count)
> +{
> +	struct scsi_disk *sdkp = to_scsi_disk(dev);
> +
> +	if (!capable(CAP_SYS_ADMIN))
> +		return -EACCES;
> +
> +
> +	if (!strncmp(buf, zeroing_mode[SD_ZERO_WRITE], 20))
> +		sdkp->zeroing_mode = SD_ZERO_WRITE;
> +	else if (!strncmp(buf, zeroing_mode[SD_ZERO_WS], 20))
> +		sdkp->zeroing_mode = SD_ZERO_WS;
> +	else if (!strncmp(buf, zeroing_mode[SD_ZERO_WS16_UNMAP], 20))
> +		sdkp->zeroing_mode = SD_ZERO_WS16_UNMAP;
> +	else if (!strncmp(buf, zeroing_mode[SD_ZERO_WS10_UNMAP], 20))
> +		sdkp->zeroing_mode = SD_ZERO_WS10_UNMAP;

Should this be conditional on lbprz, lbpws, lbpws10 and max_ws_blocks?

Thanks,

Paolo

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [PATCH 26/27] scsi: sd: Separate zeroout and discard command choices
  2017-04-19 14:56   ` Paolo Bonzini
@ 2017-04-20  1:34         ` Martin K. Petersen
  0 siblings, 0 replies; 91+ messages in thread
From: Martin K. Petersen @ 2017-04-20  1:34 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: axboe-tSWWG44O7X1aa/9Udqfwiw, linux-raid-u79uwXL29TY76Z2rM5mHXA,
	martin.petersen-QHcLZuEGTsvQT0dZR+AlfA,
	snitzer-H+wXaHxf7aLQT0dZR+AlfA,
	philipp.reisner-63ez5xqkn6DQT0dZR+AlfA,
	linux-block-u79uwXL29TY76Z2rM5mHXA,
	dm-devel-H+wXaHxf7aLQT0dZR+AlfA,
	linux-scsi-u79uwXL29TY76Z2rM5mHXA,
	lars.ellenberg-63ez5xqkn6DQT0dZR+AlfA,
	shli-DgEjT+Ai2ygdnm+yROfE0A, Christoph Hellwig,
	agk-H+wXaHxf7aLQT0dZR+AlfA, drbd-dev-cunTk1MwBs8qoQakbn7OcQ


Paolo,

> Should this be conditional on lbprz, lbpws, lbpws10 and max_ws_blocks?

It is intentional that things can be overridden from userland for
devices that report the "wrong" thing. We do the same for discard so
people can set up udev rules.

-- 
Martin K. Petersen	Oracle Linux Engineering

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [PATCH 26/27] scsi: sd: Separate zeroout and discard command choices
@ 2017-04-20  1:34         ` Martin K. Petersen
  0 siblings, 0 replies; 91+ messages in thread
From: Martin K. Petersen @ 2017-04-20  1:34 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Christoph Hellwig, axboe, martin.petersen, agk, snitzer, shli,
	philipp.reisner, lars.ellenberg, linux-block, linux-scsi,
	drbd-dev, dm-devel, linux-raid


Paolo,

> Should this be conditional on lbprz, lbpws, lbpws10 and max_ws_blocks?

It is intentional that things can be overridden from userland for
devices that report the "wrong" thing. We do the same for discard so
people can set up udev rules.

-- 
Martin K. Petersen	Oracle Linux Engineering

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [PATCH 25/27] block: remove the discard_zeroes_data flag
  2017-04-05 17:21 ` [PATCH 25/27] block: remove the discard_zeroes_data flag Christoph Hellwig
@ 2017-05-01 20:45     ` Bart Van Assche
  0 siblings, 0 replies; 91+ messages in thread
From: Bart Van Assche @ 2017-05-01 20:45 UTC (permalink / raw)
  To: agk, lars.ellenberg, snitzer, hch, martin.petersen,
	philipp.reisner, axboe, shli
  Cc: linux-scsi, dm-devel, drbd-dev, linux-block, target-devel, linux-raid

On Wed, 2017-04-05 at 19:21 +0200, Christoph Hellwig wrote:
> Now that we use the proper REQ_OP_WRITE_ZEROES operation everywhere we can
> kill this hack.
> 
> Signed-off-by: Christoph Hellwig <hch@lst.de>
> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
> Reviewed-by: Hannes Reinecke <hare@suse.com>
> [ ... ]
> diff --git a/drivers/target/target_core_device.c b/drivers/target/target_core_device.c
> index c754ae33bf7b..d2f089cfa9ae 100644
> --- a/drivers/target/target_core_device.c
> +++ b/drivers/target/target_core_device.c
> @@ -851,7 +851,7 @@ bool target_configure_unmap_from_queue(struct se_dev_attrib *attrib,
>  	attrib->unmap_granularity = q->limits.discard_granularity / block_size;
>  	attrib->unmap_granularity_alignment = q->limits.discard_alignment /
>  								block_size;
> -	attrib->unmap_zeroes_data = q->limits.discard_zeroes_data;
> +	attrib->unmap_zeroes_data = 0;
>  	return true;
>  }
>  EXPORT_SYMBOL(target_configure_unmap_from_queue);

Hello Christoph,

Sorry that I hadn't noticed this before but I think that this patch
introduces a significant performance regressions for LIO users. Before
this patch the LBPRZ flag was reported correctly to initiator systems
through the thin provisioning VPD. With this patch applied that flag
will always be reported as zero, forcing initiators to submit WRITE
commands with zeroed data buffers instead of submitting the SCSI UNMAP
command to block devices for which discard_zeroes_data was set. From
target_core_spc.c:

/* Thin Provisioning VPD */
static sense_reason_t spc_emulate_evpd_b2(struct se_cmd *cmd, unsigned char *buf)
{
	[ ... ]
	/*
	 * The unmap_zeroes_data set means that the underlying device supports
	 * REQ_DISCARD and has the discard_zeroes_data bit set. This satisfies
	 * the SBC requirements for LBPRZ, meaning that a subsequent read
	 * will return zeroes after an UNMAP or WRITE SAME (16) to an LBA
	 * See sbc4r36 6.6.4.
	 */
	if (((dev->dev_attrib.emulate_tpu != 0) ||
	     (dev->dev_attrib.emulate_tpws != 0)) &&
	     (dev->dev_attrib.unmap_zeroes_data != 0))
		buf[5] |= 0x04;
	[ ... ]
}

Please advise how to proceed.

Thanks,

Bart.

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [PATCH 25/27] block: remove the discard_zeroes_data flag
@ 2017-05-01 20:45     ` Bart Van Assche
  0 siblings, 0 replies; 91+ messages in thread
From: Bart Van Assche @ 2017-05-01 20:45 UTC (permalink / raw)
  To: agk, lars.ellenberg, snitzer, hch, martin.petersen,
	philipp.reisner, axboe, shli
  Cc: linux-scsi, dm-devel, drbd-dev, linux-block, target-devel, linux-raid

On Wed, 2017-04-05 at 19:21 +0200, Christoph Hellwig wrote:
> Now that we use the proper REQ_OP_WRITE_ZEROES operation everywhere we ca=
n
> kill this hack.
>=20
> Signed-off-by: Christoph Hellwig <hch@lst.de>
> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
> Reviewed-by: Hannes Reinecke <hare@suse.com>
> [ ... ]
> diff --git a/drivers/target/target_core_device.c b/drivers/target/target_=
core_device.c
> index c754ae33bf7b..d2f089cfa9ae 100644
> --- a/drivers/target/target_core_device.c
> +++ b/drivers/target/target_core_device.c
> @@ -851,7 +851,7 @@ bool target_configure_unmap_from_queue(struct se_dev_=
attrib *attrib,
>  	attrib->unmap_granularity =3D q->limits.discard_granularity / block_siz=
e;
>  	attrib->unmap_granularity_alignment =3D q->limits.discard_alignment /
>  								block_size;
> -	attrib->unmap_zeroes_data =3D q->limits.discard_zeroes_data;
> +	attrib->unmap_zeroes_data =3D 0;
>  	return true;
>  }
>  EXPORT_SYMBOL(target_configure_unmap_from_queue);

Hello Christoph,

Sorry that I hadn't noticed this before but I think that this patch
introduces a significant performance regressions for LIO users. Before
this patch the LBPRZ flag was reported correctly to initiator systems
through the thin provisioning VPD. With this patch applied that flag
will always be reported as zero, forcing initiators to submit WRITE
commands with zeroed data buffers instead of submitting the SCSI UNMAP
command to block devices for which=A0discard_zeroes_data was set. From
target_core_spc.c:

/* Thin Provisioning VPD */
static sense_reason_t spc_emulate_evpd_b2(struct se_cmd *cmd, unsigned char=
 *buf)
{
	[ ... ]
	/*
	 * The unmap_zeroes_data set means that the underlying device supports
	 * REQ_DISCARD and has the discard_zeroes_data bit set. This satisfies
	 * the SBC requirements for LBPRZ, meaning that a subsequent read
	 * will return zeroes after an UNMAP or WRITE SAME (16) to an LBA
	 * See sbc4r36 6.6.4.
	 */
	if (((dev->dev_attrib.emulate_tpu !=3D 0) ||
	=A0=A0=A0=A0=A0(dev->dev_attrib.emulate_tpws !=3D 0)) &&
	=A0=A0=A0=A0=A0(dev->dev_attrib.unmap_zeroes_data !=3D 0))
		buf[5] |=3D 0x04;
	[ ... ]
}

Please advise how to proceed.

Thanks,

Bart.=

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [PATCH 25/27] block: remove the discard_zeroes_data flag
  2017-05-01 20:45     ` Bart Van Assche
@ 2017-05-02  6:43         ` Nicholas A. Bellinger
  -1 siblings, 0 replies; 91+ messages in thread
From: Nicholas A. Bellinger @ 2017-05-02  6:43 UTC (permalink / raw)
  To: Bart Van Assche
  Cc: axboe-tSWWG44O7X1aa/9Udqfwiw, linux-raid-u79uwXL29TY76Z2rM5mHXA,
	snitzer-H+wXaHxf7aLQT0dZR+AlfA,
	martin.petersen-QHcLZuEGTsvQT0dZR+AlfA,
	philipp.reisner-63ez5xqkn6DQT0dZR+AlfA,
	linux-block-u79uwXL29TY76Z2rM5mHXA,
	dm-devel-H+wXaHxf7aLQT0dZR+AlfA,
	target-devel-u79uwXL29TY76Z2rM5mHXA,
	linux-scsi-u79uwXL29TY76Z2rM5mHXA,
	lars.ellenberg-63ez5xqkn6DQT0dZR+AlfA,
	shli-DgEjT+Ai2ygdnm+yROfE0A, hch-jcswGhMUV9g,
	agk-H+wXaHxf7aLQT0dZR+AlfA, drbd-dev-cunTk1MwBs8qoQakbn7OcQ

On Mon, 2017-05-01 at 20:45 +0000, Bart Van Assche wrote:
> On Wed, 2017-04-05 at 19:21 +0200, Christoph Hellwig wrote:
> > Now that we use the proper REQ_OP_WRITE_ZEROES operation everywhere we can
> > kill this hack.
> > 
> > Signed-off-by: Christoph Hellwig <hch-jcswGhMUV9g@public.gmane.org>
> > Reviewed-by: Martin K. Petersen <martin.petersen-QHcLZuEGTsvQT0dZR+AlfA@public.gmane.org>
> > Reviewed-by: Hannes Reinecke <hare-IBi9RG/b67k@public.gmane.org>
> > [ ... ]
> > diff --git a/drivers/target/target_core_device.c b/drivers/target/target_core_device.c
> > index c754ae33bf7b..d2f089cfa9ae 100644
> > --- a/drivers/target/target_core_device.c
> > +++ b/drivers/target/target_core_device.c
> > @@ -851,7 +851,7 @@ bool target_configure_unmap_from_queue(struct se_dev_attrib *attrib,
> >  	attrib->unmap_granularity = q->limits.discard_granularity / block_size;
> >  	attrib->unmap_granularity_alignment = q->limits.discard_alignment /
> >  								block_size;
> > -	attrib->unmap_zeroes_data = q->limits.discard_zeroes_data;
> > +	attrib->unmap_zeroes_data = 0;
> >  	return true;
> >  }
> >  EXPORT_SYMBOL(target_configure_unmap_from_queue);
> 
> Hello Christoph,
> 
> Sorry that I hadn't noticed this before but I think that this patch
> introduces a significant performance regressions for LIO users. Before
> this patch the LBPRZ flag was reported correctly to initiator systems
> through the thin provisioning VPD. With this patch applied that flag
> will always be reported as zero, forcing initiators to submit WRITE
> commands with zeroed data buffers instead of submitting the SCSI UNMAP
> command to block devices for which discard_zeroes_data was set. From
> target_core_spc.c:
> 
> /* Thin Provisioning VPD */
> static sense_reason_t spc_emulate_evpd_b2(struct se_cmd *cmd, unsigned char *buf)
> {
> 	[ ... ]
> 	/*
> 	 * The unmap_zeroes_data set means that the underlying device supports
> 	 * REQ_DISCARD and has the discard_zeroes_data bit set. This satisfies
> 	 * the SBC requirements for LBPRZ, meaning that a subsequent read
> 	 * will return zeroes after an UNMAP or WRITE SAME (16) to an LBA
> 	 * See sbc4r36 6.6.4.
> 	 */
> 	if (((dev->dev_attrib.emulate_tpu != 0) ||
> 	     (dev->dev_attrib.emulate_tpws != 0)) &&
> 	     (dev->dev_attrib.unmap_zeroes_data != 0))
> 		buf[5] |= 0x04;
> 	[ ... ]
> }
> 

According to sd_config_discard(), it's SD_LBP_WS16, SD_LBP_WS10 and
SD_LBP_ZERO that where ever setting unmap_zeros_data = 1 to begin with.
For UNMAP, q->limits.discard_zeroes_data was never set.

That said, it's pretty much implied that supporting DISCARD means
subsequent READs return zeros, so target_configure_unmap_from_queue()
should be setting attrib->unmap_zeroes_data = 1, or just dropping it
all-together.

Post -rc1, I'll push a patch to do the latter.

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [PATCH 25/27] block: remove the discard_zeroes_data flag
@ 2017-05-02  6:43         ` Nicholas A. Bellinger
  0 siblings, 0 replies; 91+ messages in thread
From: Nicholas A. Bellinger @ 2017-05-02  6:43 UTC (permalink / raw)
  To: Bart Van Assche
  Cc: agk, lars.ellenberg, snitzer, hch, martin.petersen,
	philipp.reisner, axboe, shli, linux-scsi, dm-devel, drbd-dev,
	linux-block, target-devel, linux-raid

On Mon, 2017-05-01 at 20:45 +0000, Bart Van Assche wrote:
> On Wed, 2017-04-05 at 19:21 +0200, Christoph Hellwig wrote:
> > Now that we use the proper REQ_OP_WRITE_ZEROES operation everywhere we can
> > kill this hack.
> > 
> > Signed-off-by: Christoph Hellwig <hch@lst.de>
> > Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
> > Reviewed-by: Hannes Reinecke <hare@suse.com>
> > [ ... ]
> > diff --git a/drivers/target/target_core_device.c b/drivers/target/target_core_device.c
> > index c754ae33bf7b..d2f089cfa9ae 100644
> > --- a/drivers/target/target_core_device.c
> > +++ b/drivers/target/target_core_device.c
> > @@ -851,7 +851,7 @@ bool target_configure_unmap_from_queue(struct se_dev_attrib *attrib,
> >  	attrib->unmap_granularity = q->limits.discard_granularity / block_size;
> >  	attrib->unmap_granularity_alignment = q->limits.discard_alignment /
> >  								block_size;
> > -	attrib->unmap_zeroes_data = q->limits.discard_zeroes_data;
> > +	attrib->unmap_zeroes_data = 0;
> >  	return true;
> >  }
> >  EXPORT_SYMBOL(target_configure_unmap_from_queue);
> 
> Hello Christoph,
> 
> Sorry that I hadn't noticed this before but I think that this patch
> introduces a significant performance regressions for LIO users. Before
> this patch the LBPRZ flag was reported correctly to initiator systems
> through the thin provisioning VPD. With this patch applied that flag
> will always be reported as zero, forcing initiators to submit WRITE
> commands with zeroed data buffers instead of submitting the SCSI UNMAP
> command to block devices for which discard_zeroes_data was set. From
> target_core_spc.c:
> 
> /* Thin Provisioning VPD */
> static sense_reason_t spc_emulate_evpd_b2(struct se_cmd *cmd, unsigned char *buf)
> {
> 	[ ... ]
> 	/*
> 	 * The unmap_zeroes_data set means that the underlying device supports
> 	 * REQ_DISCARD and has the discard_zeroes_data bit set. This satisfies
> 	 * the SBC requirements for LBPRZ, meaning that a subsequent read
> 	 * will return zeroes after an UNMAP or WRITE SAME (16) to an LBA
> 	 * See sbc4r36 6.6.4.
> 	 */
> 	if (((dev->dev_attrib.emulate_tpu != 0) ||
> 	     (dev->dev_attrib.emulate_tpws != 0)) &&
> 	     (dev->dev_attrib.unmap_zeroes_data != 0))
> 		buf[5] |= 0x04;
> 	[ ... ]
> }
> 

According to sd_config_discard(), it's SD_LBP_WS16, SD_LBP_WS10 and
SD_LBP_ZERO that where ever setting unmap_zeros_data = 1 to begin with.
For UNMAP, q->limits.discard_zeroes_data was never set.

That said, it's pretty much implied that supporting DISCARD means
subsequent READs return zeros, so target_configure_unmap_from_queue()
should be setting attrib->unmap_zeroes_data = 1, or just dropping it
all-together.

Post -rc1, I'll push a patch to do the latter.

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [PATCH 25/27] block: remove the discard_zeroes_data flag
  2017-05-02  6:43         ` Nicholas A. Bellinger
@ 2017-05-02  7:16             ` Nicholas A. Bellinger
  -1 siblings, 0 replies; 91+ messages in thread
From: Nicholas A. Bellinger @ 2017-05-02  7:16 UTC (permalink / raw)
  To: Bart Van Assche
  Cc: axboe-tSWWG44O7X1aa/9Udqfwiw, linux-raid-u79uwXL29TY76Z2rM5mHXA,
	snitzer-H+wXaHxf7aLQT0dZR+AlfA,
	martin.petersen-QHcLZuEGTsvQT0dZR+AlfA,
	philipp.reisner-63ez5xqkn6DQT0dZR+AlfA,
	linux-block-u79uwXL29TY76Z2rM5mHXA,
	dm-devel-H+wXaHxf7aLQT0dZR+AlfA,
	target-devel-u79uwXL29TY76Z2rM5mHXA,
	linux-scsi-u79uwXL29TY76Z2rM5mHXA,
	lars.ellenberg-63ez5xqkn6DQT0dZR+AlfA,
	shli-DgEjT+Ai2ygdnm+yROfE0A, hch-jcswGhMUV9g,
	agk-H+wXaHxf7aLQT0dZR+AlfA, drbd-dev-cunTk1MwBs8qoQakbn7OcQ

On Mon, 2017-05-01 at 23:43 -0700, Nicholas A. Bellinger wrote:
> On Mon, 2017-05-01 at 20:45 +0000, Bart Van Assche wrote:
> > On Wed, 2017-04-05 at 19:21 +0200, Christoph Hellwig wrote:
> > > Now that we use the proper REQ_OP_WRITE_ZEROES operation everywhere we can
> > > kill this hack.
> > > 
> > > Signed-off-by: Christoph Hellwig <hch-jcswGhMUV9g@public.gmane.org>
> > > Reviewed-by: Martin K. Petersen <martin.petersen-QHcLZuEGTsvQT0dZR+AlfA@public.gmane.org>
> > > Reviewed-by: Hannes Reinecke <hare-IBi9RG/b67k@public.gmane.org>
> > > [ ... ]
> > > diff --git a/drivers/target/target_core_device.c b/drivers/target/target_core_device.c
> > > index c754ae33bf7b..d2f089cfa9ae 100644
> > > --- a/drivers/target/target_core_device.c
> > > +++ b/drivers/target/target_core_device.c
> > > @@ -851,7 +851,7 @@ bool target_configure_unmap_from_queue(struct se_dev_attrib *attrib,
> > >  	attrib->unmap_granularity = q->limits.discard_granularity / block_size;
> > >  	attrib->unmap_granularity_alignment = q->limits.discard_alignment /
> > >  								block_size;
> > > -	attrib->unmap_zeroes_data = q->limits.discard_zeroes_data;
> > > +	attrib->unmap_zeroes_data = 0;
> > >  	return true;
> > >  }
> > >  EXPORT_SYMBOL(target_configure_unmap_from_queue);
> > 
> > Hello Christoph,
> > 
> > Sorry that I hadn't noticed this before but I think that this patch
> > introduces a significant performance regressions for LIO users. Before
> > this patch the LBPRZ flag was reported correctly to initiator systems
> > through the thin provisioning VPD. With this patch applied that flag
> > will always be reported as zero, forcing initiators to submit WRITE
> > commands with zeroed data buffers instead of submitting the SCSI UNMAP
> > command to block devices for which discard_zeroes_data was set. From
> > target_core_spc.c:
> > 
> > /* Thin Provisioning VPD */
> > static sense_reason_t spc_emulate_evpd_b2(struct se_cmd *cmd, unsigned char *buf)
> > {
> > 	[ ... ]
> > 	/*
> > 	 * The unmap_zeroes_data set means that the underlying device supports
> > 	 * REQ_DISCARD and has the discard_zeroes_data bit set. This satisfies
> > 	 * the SBC requirements for LBPRZ, meaning that a subsequent read
> > 	 * will return zeroes after an UNMAP or WRITE SAME (16) to an LBA
> > 	 * See sbc4r36 6.6.4.
> > 	 */
> > 	if (((dev->dev_attrib.emulate_tpu != 0) ||
> > 	     (dev->dev_attrib.emulate_tpws != 0)) &&
> > 	     (dev->dev_attrib.unmap_zeroes_data != 0))
> > 		buf[5] |= 0x04;
> > 	[ ... ]
> > }
> > 
> 
> According to sd_config_discard(), it's SD_LBP_WS16, SD_LBP_WS10 and
> SD_LBP_ZERO that where ever setting unmap_zeros_data = 1 to begin with.
> For UNMAP, q->limits.discard_zeroes_data was never set.
> 
> That said, it's pretty much implied that supporting DISCARD means
> subsequent READs return zeros, so target_configure_unmap_from_queue()
> should be setting attrib->unmap_zeroes_data = 1, or just dropping it
> all-together.
> 
> Post -rc1, I'll push a patch to do the latter.
> 

Or, another options is use bdev_write_zeroes_sectors() to determine when
dev_attrib->unmap_zeroes_data should be set.

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [PATCH 25/27] block: remove the discard_zeroes_data flag
@ 2017-05-02  7:16             ` Nicholas A. Bellinger
  0 siblings, 0 replies; 91+ messages in thread
From: Nicholas A. Bellinger @ 2017-05-02  7:16 UTC (permalink / raw)
  To: Bart Van Assche
  Cc: agk, lars.ellenberg, snitzer, hch, martin.petersen,
	philipp.reisner, axboe, shli, linux-scsi, dm-devel, drbd-dev,
	linux-block, target-devel, linux-raid

On Mon, 2017-05-01 at 23:43 -0700, Nicholas A. Bellinger wrote:
> On Mon, 2017-05-01 at 20:45 +0000, Bart Van Assche wrote:
> > On Wed, 2017-04-05 at 19:21 +0200, Christoph Hellwig wrote:
> > > Now that we use the proper REQ_OP_WRITE_ZEROES operation everywhere we can
> > > kill this hack.
> > > 
> > > Signed-off-by: Christoph Hellwig <hch@lst.de>
> > > Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
> > > Reviewed-by: Hannes Reinecke <hare@suse.com>
> > > [ ... ]
> > > diff --git a/drivers/target/target_core_device.c b/drivers/target/target_core_device.c
> > > index c754ae33bf7b..d2f089cfa9ae 100644
> > > --- a/drivers/target/target_core_device.c
> > > +++ b/drivers/target/target_core_device.c
> > > @@ -851,7 +851,7 @@ bool target_configure_unmap_from_queue(struct se_dev_attrib *attrib,
> > >  	attrib->unmap_granularity = q->limits.discard_granularity / block_size;
> > >  	attrib->unmap_granularity_alignment = q->limits.discard_alignment /
> > >  								block_size;
> > > -	attrib->unmap_zeroes_data = q->limits.discard_zeroes_data;
> > > +	attrib->unmap_zeroes_data = 0;
> > >  	return true;
> > >  }
> > >  EXPORT_SYMBOL(target_configure_unmap_from_queue);
> > 
> > Hello Christoph,
> > 
> > Sorry that I hadn't noticed this before but I think that this patch
> > introduces a significant performance regressions for LIO users. Before
> > this patch the LBPRZ flag was reported correctly to initiator systems
> > through the thin provisioning VPD. With this patch applied that flag
> > will always be reported as zero, forcing initiators to submit WRITE
> > commands with zeroed data buffers instead of submitting the SCSI UNMAP
> > command to block devices for which discard_zeroes_data was set. From
> > target_core_spc.c:
> > 
> > /* Thin Provisioning VPD */
> > static sense_reason_t spc_emulate_evpd_b2(struct se_cmd *cmd, unsigned char *buf)
> > {
> > 	[ ... ]
> > 	/*
> > 	 * The unmap_zeroes_data set means that the underlying device supports
> > 	 * REQ_DISCARD and has the discard_zeroes_data bit set. This satisfies
> > 	 * the SBC requirements for LBPRZ, meaning that a subsequent read
> > 	 * will return zeroes after an UNMAP or WRITE SAME (16) to an LBA
> > 	 * See sbc4r36 6.6.4.
> > 	 */
> > 	if (((dev->dev_attrib.emulate_tpu != 0) ||
> > 	     (dev->dev_attrib.emulate_tpws != 0)) &&
> > 	     (dev->dev_attrib.unmap_zeroes_data != 0))
> > 		buf[5] |= 0x04;
> > 	[ ... ]
> > }
> > 
> 
> According to sd_config_discard(), it's SD_LBP_WS16, SD_LBP_WS10 and
> SD_LBP_ZERO that where ever setting unmap_zeros_data = 1 to begin with.
> For UNMAP, q->limits.discard_zeroes_data was never set.
> 
> That said, it's pretty much implied that supporting DISCARD means
> subsequent READs return zeros, so target_configure_unmap_from_queue()
> should be setting attrib->unmap_zeroes_data = 1, or just dropping it
> all-together.
> 
> Post -rc1, I'll push a patch to do the latter.
> 

Or, another options is use bdev_write_zeroes_sectors() to determine when
dev_attrib->unmap_zeroes_data should be set.

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [PATCH 25/27] block: remove the discard_zeroes_data flag
  2017-05-02  7:16             ` Nicholas A. Bellinger
@ 2017-05-02  7:23                 ` hch
  -1 siblings, 0 replies; 91+ messages in thread
From: hch-jcswGhMUV9g @ 2017-05-02  7:23 UTC (permalink / raw)
  To: Nicholas A. Bellinger
  Cc: axboe-tSWWG44O7X1aa/9Udqfwiw, linux-raid-u79uwXL29TY76Z2rM5mHXA,
	snitzer-H+wXaHxf7aLQT0dZR+AlfA,
	martin.petersen-QHcLZuEGTsvQT0dZR+AlfA,
	philipp.reisner-63ez5xqkn6DQT0dZR+AlfA,
	linux-block-u79uwXL29TY76Z2rM5mHXA,
	dm-devel-H+wXaHxf7aLQT0dZR+AlfA,
	target-devel-u79uwXL29TY76Z2rM5mHXA,
	linux-scsi-u79uwXL29TY76Z2rM5mHXA, Bart Van Assche,
	lars.ellenberg-63ez5xqkn6DQT0dZR+AlfA,
	shli-DgEjT+Ai2ygdnm+yROfE0A, hch-jcswGhMUV9g,
	agk-H+wXaHxf7aLQT0dZR+AlfA, drbd-dev-cunTk1MwBs8qoQakbn7OcQ

On Tue, May 02, 2017 at 12:16:13AM -0700, Nicholas A. Bellinger wrote:
> Or, another options is use bdev_write_zeroes_sectors() to determine when
> dev_attrib->unmap_zeroes_data should be set.

Yes, that in combination with your patch to use bdev_write_zeroes_sectors
for zeroing from write same seems like the right fix.

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [PATCH 25/27] block: remove the discard_zeroes_data flag
@ 2017-05-02  7:23                 ` hch
  0 siblings, 0 replies; 91+ messages in thread
From: hch @ 2017-05-02  7:23 UTC (permalink / raw)
  To: Nicholas A. Bellinger
  Cc: Bart Van Assche, agk, lars.ellenberg, snitzer, hch,
	martin.petersen, philipp.reisner, axboe, shli, linux-scsi,
	dm-devel, drbd-dev, linux-block, target-devel, linux-raid

On Tue, May 02, 2017 at 12:16:13AM -0700, Nicholas A. Bellinger wrote:
> Or, another options is use bdev_write_zeroes_sectors() to determine when
> dev_attrib->unmap_zeroes_data should be set.

Yes, that in combination with your patch to use bdev_write_zeroes_sectors
for zeroing from write same seems like the right fix.

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [PATCH 25/27] block: remove the discard_zeroes_data flag
  2017-05-02  7:23                 ` hch
@ 2017-05-03  3:33                   ` Nicholas A. Bellinger
  -1 siblings, 0 replies; 91+ messages in thread
From: Nicholas A. Bellinger @ 2017-05-03  3:33 UTC (permalink / raw)
  To: hch
  Cc: Bart Van Assche, agk, lars.ellenberg, snitzer, martin.petersen,
	philipp.reisner, axboe, shli, linux-scsi, dm-devel, drbd-dev,
	linux-block, target-devel, linux-raid

On Tue, 2017-05-02 at 09:23 +0200, hch@lst.de wrote:
> On Tue, May 02, 2017 at 12:16:13AM -0700, Nicholas A. Bellinger wrote:
> > Or, another options is use bdev_write_zeroes_sectors() to determine when
> > dev_attrib->unmap_zeroes_data should be set.
> 
> Yes, that in combination with your patch to use bdev_write_zeroes_sectors
> for zeroing from write same seems like the right fix.

The larger target/iblock conversion patch looks like post v4.12 material
at this point, so to avoid breakage wrt to existing LBPRZ behavior, I'll
plan to push the following patch post -rc1.

diff --git a/drivers/target/target_core_device.c b/drivers/target/target_core_device.c
index d2f089c..e7caf78 100644
--- a/drivers/target/target_core_device.c
+++ b/drivers/target/target_core_device.c
@@ -851,7 +851,7 @@ bool target_configure_unmap_from_queue(struct se_dev_attrib *attrib,
        attrib->unmap_granularity = q->limits.discard_granularity / block_size;
        attrib->unmap_granularity_alignment = q->limits.discard_alignment /
                                                                block_size;
-       attrib->unmap_zeroes_data = 0;
+       attrib->unmap_zeroes_data = (q->limits.max_write_zeroes_sectors);
        return true;
 }
 EXPORT_SYMBOL(target_configure_unmap_from_queue);

^ permalink raw reply related	[flat|nested] 91+ messages in thread

* Re: [PATCH 25/27] block: remove the discard_zeroes_data flag
@ 2017-05-03  3:33                   ` Nicholas A. Bellinger
  0 siblings, 0 replies; 91+ messages in thread
From: Nicholas A. Bellinger @ 2017-05-03  3:33 UTC (permalink / raw)
  To: hch
  Cc: Bart Van Assche, agk, lars.ellenberg, snitzer, martin.petersen,
	philipp.reisner, axboe, shli, linux-scsi, dm-devel, drbd-dev,
	linux-block, target-devel, linux-raid

On Tue, 2017-05-02 at 09:23 +0200, hch@lst.de wrote:
> On Tue, May 02, 2017 at 12:16:13AM -0700, Nicholas A. Bellinger wrote:
> > Or, another options is use bdev_write_zeroes_sectors() to determine when
> > dev_attrib->unmap_zeroes_data should be set.
> 
> Yes, that in combination with your patch to use bdev_write_zeroes_sectors
> for zeroing from write same seems like the right fix.

The larger target/iblock conversion patch looks like post v4.12 material
at this point, so to avoid breakage wrt to existing LBPRZ behavior, I'll
plan to push the following patch post -rc1.

diff --git a/drivers/target/target_core_device.c b/drivers/target/target_core_device.c
index d2f089c..e7caf78 100644
--- a/drivers/target/target_core_device.c
+++ b/drivers/target/target_core_device.c
@@ -851,7 +851,7 @@ bool target_configure_unmap_from_queue(struct se_dev_attrib *attrib,
        attrib->unmap_granularity = q->limits.discard_granularity / block_size;
        attrib->unmap_granularity_alignment = q->limits.discard_alignment /
                                                                block_size;
-       attrib->unmap_zeroes_data = 0;
+       attrib->unmap_zeroes_data = (q->limits.max_write_zeroes_sectors);
        return true;
 }
 EXPORT_SYMBOL(target_configure_unmap_from_queue);

^ permalink raw reply related	[flat|nested] 91+ messages in thread

* Re: [PATCH 25/27] block: remove the discard_zeroes_data flag
  2017-05-03  3:33                   ` Nicholas A. Bellinger
  (?)
@ 2017-05-03 14:33                   ` Mike Snitzer
  2017-05-05  3:10                       ` Nicholas A. Bellinger
  -1 siblings, 1 reply; 91+ messages in thread
From: Mike Snitzer @ 2017-05-03 14:33 UTC (permalink / raw)
  To: Nicholas A. Bellinger
  Cc: hch, Bart Van Assche, agk, lars.ellenberg, martin.petersen,
	philipp.reisner, axboe, shli, linux-scsi, dm-devel, drbd-dev,
	linux-block, target-devel, linux-raid

On Tue, May 02 2017 at 11:33pm -0400,
Nicholas A. Bellinger <nab@linux-iscsi.org> wrote:

> On Tue, 2017-05-02 at 09:23 +0200, hch@lst.de wrote:
> > On Tue, May 02, 2017 at 12:16:13AM -0700, Nicholas A. Bellinger wrote:
> > > Or, another options is use bdev_write_zeroes_sectors() to determine when
> > > dev_attrib->unmap_zeroes_data should be set.
> > 
> > Yes, that in combination with your patch to use bdev_write_zeroes_sectors
> > for zeroing from write same seems like the right fix.
> 
> The larger target/iblock conversion patch looks like post v4.12 material
> at this point, so to avoid breakage wrt to existing LBPRZ behavior, I'll
> plan to push the following patch post -rc1.
> 
> diff --git a/drivers/target/target_core_device.c b/drivers/target/target_core_device.c
> index d2f089c..e7caf78 100644
> --- a/drivers/target/target_core_device.c
> +++ b/drivers/target/target_core_device.c
> @@ -851,7 +851,7 @@ bool target_configure_unmap_from_queue(struct se_dev_attrib *attrib,
>         attrib->unmap_granularity = q->limits.discard_granularity / block_size;
>         attrib->unmap_granularity_alignment = q->limits.discard_alignment /
>                                                                 block_size;
> -       attrib->unmap_zeroes_data = 0;
> +       attrib->unmap_zeroes_data = (q->limits.max_write_zeroes_sectors);
>         return true;
>  }
>  EXPORT_SYMBOL(target_configure_unmap_from_queue);
> 

Completely a nit but: why the extra parenthesis?

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [PATCH 25/27] block: remove the discard_zeroes_data flag
  2017-05-03 14:33                   ` Mike Snitzer
@ 2017-05-05  3:10                       ` Nicholas A. Bellinger
  0 siblings, 0 replies; 91+ messages in thread
From: Nicholas A. Bellinger @ 2017-05-05  3:10 UTC (permalink / raw)
  To: Mike Snitzer
  Cc: hch, Bart Van Assche, agk, lars.ellenberg, martin.petersen,
	philipp.reisner, axboe, shli, linux-scsi, dm-devel, drbd-dev,
	linux-block, target-devel, linux-raid

On Wed, 2017-05-03 at 10:33 -0400, Mike Snitzer wrote:
> On Tue, May 02 2017 at 11:33pm -0400,
> Nicholas A. Bellinger <nab@linux-iscsi.org> wrote:
> 
> > On Tue, 2017-05-02 at 09:23 +0200, hch@lst.de wrote:
> > > On Tue, May 02, 2017 at 12:16:13AM -0700, Nicholas A. Bellinger wrote:
> > > > Or, another options is use bdev_write_zeroes_sectors() to determine when
> > > > dev_attrib->unmap_zeroes_data should be set.
> > > 
> > > Yes, that in combination with your patch to use bdev_write_zeroes_sectors
> > > for zeroing from write same seems like the right fix.
> > 
> > The larger target/iblock conversion patch looks like post v4.12 material
> > at this point, so to avoid breakage wrt to existing LBPRZ behavior, I'll
> > plan to push the following patch post -rc1.
> > 
> > diff --git a/drivers/target/target_core_device.c b/drivers/target/target_core_device.c
> > index d2f089c..e7caf78 100644
> > --- a/drivers/target/target_core_device.c
> > +++ b/drivers/target/target_core_device.c
> > @@ -851,7 +851,7 @@ bool target_configure_unmap_from_queue(struct se_dev_attrib *attrib,
> >         attrib->unmap_granularity = q->limits.discard_granularity / block_size;
> >         attrib->unmap_granularity_alignment = q->limits.discard_alignment /
> >                                                                 block_size;
> > -       attrib->unmap_zeroes_data = 0;
> > +       attrib->unmap_zeroes_data = (q->limits.max_write_zeroes_sectors);
> >         return true;
> >  }
> >  EXPORT_SYMBOL(target_configure_unmap_from_queue);
> > 
> 
> Completely a nit but: why the extra parenthesis?

dev_attrib->unmap_zeros_data is only compared as a bool.

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [PATCH 25/27] block: remove the discard_zeroes_data flag
@ 2017-05-05  3:10                       ` Nicholas A. Bellinger
  0 siblings, 0 replies; 91+ messages in thread
From: Nicholas A. Bellinger @ 2017-05-05  3:10 UTC (permalink / raw)
  To: Mike Snitzer
  Cc: hch, Bart Van Assche, agk, lars.ellenberg, martin.petersen,
	philipp.reisner, axboe, shli, linux-scsi, dm-devel, drbd-dev,
	linux-block, target-devel, linux-raid

On Wed, 2017-05-03 at 10:33 -0400, Mike Snitzer wrote:
> On Tue, May 02 2017 at 11:33pm -0400,
> Nicholas A. Bellinger <nab@linux-iscsi.org> wrote:
> 
> > On Tue, 2017-05-02 at 09:23 +0200, hch@lst.de wrote:
> > > On Tue, May 02, 2017 at 12:16:13AM -0700, Nicholas A. Bellinger wrote:
> > > > Or, another options is use bdev_write_zeroes_sectors() to determine when
> > > > dev_attrib->unmap_zeroes_data should be set.
> > > 
> > > Yes, that in combination with your patch to use bdev_write_zeroes_sectors
> > > for zeroing from write same seems like the right fix.
> > 
> > The larger target/iblock conversion patch looks like post v4.12 material
> > at this point, so to avoid breakage wrt to existing LBPRZ behavior, I'll
> > plan to push the following patch post -rc1.
> > 
> > diff --git a/drivers/target/target_core_device.c b/drivers/target/target_core_device.c
> > index d2f089c..e7caf78 100644
> > --- a/drivers/target/target_core_device.c
> > +++ b/drivers/target/target_core_device.c
> > @@ -851,7 +851,7 @@ bool target_configure_unmap_from_queue(struct se_dev_attrib *attrib,
> >         attrib->unmap_granularity = q->limits.discard_granularity / block_size;
> >         attrib->unmap_granularity_alignment = q->limits.discard_alignment /
> >                                                                 block_size;
> > -       attrib->unmap_zeroes_data = 0;
> > +       attrib->unmap_zeroes_data = (q->limits.max_write_zeroes_sectors);
> >         return true;
> >  }
> >  EXPORT_SYMBOL(target_configure_unmap_from_queue);
> > 
> 
> Completely a nit but: why the extra parenthesis?

dev_attrib->unmap_zeros_data is only compared as a bool.

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [PATCH 25/27] block: remove the discard_zeroes_data flag
  2017-05-03  3:33                   ` Nicholas A. Bellinger
@ 2017-05-07  9:22                       ` hch
  -1 siblings, 0 replies; 91+ messages in thread
From: hch-jcswGhMUV9g @ 2017-05-07  9:22 UTC (permalink / raw)
  To: Nicholas A. Bellinger
  Cc: axboe-tSWWG44O7X1aa/9Udqfwiw, linux-raid-u79uwXL29TY76Z2rM5mHXA,
	martin.petersen-QHcLZuEGTsvQT0dZR+AlfA,
	snitzer-H+wXaHxf7aLQT0dZR+AlfA,
	philipp.reisner-63ez5xqkn6DQT0dZR+AlfA,
	linux-block-u79uwXL29TY76Z2rM5mHXA,
	dm-devel-H+wXaHxf7aLQT0dZR+AlfA,
	target-devel-u79uwXL29TY76Z2rM5mHXA,
	linux-scsi-u79uwXL29TY76Z2rM5mHXA, Bart Van Assche,
	lars.ellenberg-63ez5xqkn6DQT0dZR+AlfA,
	shli-DgEjT+Ai2ygdnm+yROfE0A, hch-jcswGhMUV9g,
	agk-H+wXaHxf7aLQT0dZR+AlfA, drbd-dev-cunTk1MwBs8qoQakbn7OcQ

On Tue, May 02, 2017 at 08:33:15PM -0700, Nicholas A. Bellinger wrote:
> The larger target/iblock conversion patch looks like post v4.12 material
> at this point, so to avoid breakage wrt to existing LBPRZ behavior, I'll
> plan to push the following patch post -rc1.

I don't think this is safe.  If you want to do the aboe you also
need to ensure ->execute_unmap always zeroes the data.  For actual
files in the file backend we should be all fine, but for the block
device case [1] and iblock we'd need to use blkdev_issue_zeroout
instead of blkdev_issue_discard when unmap_zeroes_data is set.

[1] which btw already seems broken as it doesn't invalidate cached
data when issuing a discard.

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [PATCH 25/27] block: remove the discard_zeroes_data flag
@ 2017-05-07  9:22                       ` hch
  0 siblings, 0 replies; 91+ messages in thread
From: hch @ 2017-05-07  9:22 UTC (permalink / raw)
  To: Nicholas A. Bellinger
  Cc: hch, Bart Van Assche, agk, lars.ellenberg, snitzer,
	martin.petersen, philipp.reisner, axboe, shli, linux-scsi,
	dm-devel, drbd-dev, linux-block, target-devel, linux-raid

On Tue, May 02, 2017 at 08:33:15PM -0700, Nicholas A. Bellinger wrote:
> The larger target/iblock conversion patch looks like post v4.12 material
> at this point, so to avoid breakage wrt to existing LBPRZ behavior, I'll
> plan to push the following patch post -rc1.

I don't think this is safe.  If you want to do the aboe you also
need to ensure ->execute_unmap always zeroes the data.  For actual
files in the file backend we should be all fine, but for the block
device case [1] and iblock we'd need to use blkdev_issue_zeroout
instead of blkdev_issue_discard when unmap_zeroes_data is set.

[1] which btw already seems broken as it doesn't invalidate cached
data when issuing a discard.

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [PATCH 25/27] block: remove the discard_zeroes_data flag
  2017-05-07  9:22                       ` hch
@ 2017-05-09  6:46                           ` Nicholas A. Bellinger
  -1 siblings, 0 replies; 91+ messages in thread
From: Nicholas A. Bellinger @ 2017-05-09  6:46 UTC (permalink / raw)
  To: hch-jcswGhMUV9g
  Cc: axboe-tSWWG44O7X1aa/9Udqfwiw, linux-raid-u79uwXL29TY76Z2rM5mHXA,
	martin.petersen-QHcLZuEGTsvQT0dZR+AlfA,
	snitzer-H+wXaHxf7aLQT0dZR+AlfA,
	philipp.reisner-63ez5xqkn6DQT0dZR+AlfA,
	linux-block-u79uwXL29TY76Z2rM5mHXA,
	dm-devel-H+wXaHxf7aLQT0dZR+AlfA,
	target-devel-u79uwXL29TY76Z2rM5mHXA,
	linux-scsi-u79uwXL29TY76Z2rM5mHXA, Bart Van Assche,
	lars.ellenberg-63ez5xqkn6DQT0dZR+AlfA,
	shli-DgEjT+Ai2ygdnm+yROfE0A, agk-H+wXaHxf7aLQT0dZR+AlfA,
	drbd-dev-cunTk1MwBs8qoQakbn7OcQ

On Sun, 2017-05-07 at 11:22 +0200, hch-jcswGhMUV9g@public.gmane.org wrote:
> On Tue, May 02, 2017 at 08:33:15PM -0700, Nicholas A. Bellinger wrote:
> > The larger target/iblock conversion patch looks like post v4.12 material
> > at this point, so to avoid breakage wrt to existing LBPRZ behavior, I'll
> > plan to push the following patch post -rc1.
> 
> I don't think this is safe.  If you want to do the aboe you also
> need to ensure ->execute_unmap always zeroes the data.  For actual
> files in the file backend we should be all fine, but for the block
> device case [1] and iblock we'd need to use blkdev_issue_zeroout
> instead of blkdev_issue_discard when unmap_zeroes_data is set.
> 
> [1] which btw already seems broken as it doesn't invalidate cached
> data when issuing a discard.

Mmm, for [1] that would appear to be true, but after a deeper look at
existing code I don't think this is the case.

The reason being is target backend attributes emulate_tpu and
emulate_tpws are strictly user configurable, and aren't automatically
set based upon the underlying IBLOCK block_device support for either
one.

According to pre v4.12-rc1 code, q->limits.discard_zeroes_data was only
enabled by drivers/scsi/sd.c:sd_config_discard() for
sdkp->provisioning_mode WRITE_SAME with LBPRZ = 1 or explicit ZERO, and
for NVME for devices that supported NVME_QUIRK_DISCARD_ZEROES.  Eg: Only
real DISCARD + ZERO support.

In your changes to v4.12-rc1, this logic to signal real DISCARD + zero
support for SCSI and NVMe via q->limits.max_write_zeroes_sectors has not
changed..

So AFAICT, regardless if the user sets emulate_tpu or emulate_tpws for a
IBLOCK backend, SCSI host code will have to choose sdkp->zeroing_mode
WRITE_SAME with LBPRZ or explicit ZERO, and NVMe host code will have to
chose a ctrl NVME_QUIRK_DEALLOCATE_ZEROES before
q->limits.max_write_zeroes_sectors != 0 is propagated up to target code,
and LBPRZ = 1 is signaled via READ_CAPACITY_16 and EVPD = 0xb2.

That said, simply propagating up q->limits.max_write_zeroes_sectors as
dev_attrib->unmap_zeroes_data following existing code still looks like
the right thing to do.

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [PATCH 25/27] block: remove the discard_zeroes_data flag
@ 2017-05-09  6:46                           ` Nicholas A. Bellinger
  0 siblings, 0 replies; 91+ messages in thread
From: Nicholas A. Bellinger @ 2017-05-09  6:46 UTC (permalink / raw)
  To: hch
  Cc: Bart Van Assche, agk, lars.ellenberg, snitzer, martin.petersen,
	philipp.reisner, axboe, shli, linux-scsi, dm-devel, drbd-dev,
	linux-block, target-devel, linux-raid

On Sun, 2017-05-07 at 11:22 +0200, hch@lst.de wrote:
> On Tue, May 02, 2017 at 08:33:15PM -0700, Nicholas A. Bellinger wrote:
> > The larger target/iblock conversion patch looks like post v4.12 material
> > at this point, so to avoid breakage wrt to existing LBPRZ behavior, I'll
> > plan to push the following patch post -rc1.
> 
> I don't think this is safe.  If you want to do the aboe you also
> need to ensure ->execute_unmap always zeroes the data.  For actual
> files in the file backend we should be all fine, but for the block
> device case [1] and iblock we'd need to use blkdev_issue_zeroout
> instead of blkdev_issue_discard when unmap_zeroes_data is set.
> 
> [1] which btw already seems broken as it doesn't invalidate cached
> data when issuing a discard.

Mmm, for [1] that would appear to be true, but after a deeper look at
existing code I don't think this is the case.

The reason being is target backend attributes emulate_tpu and
emulate_tpws are strictly user configurable, and aren't automatically
set based upon the underlying IBLOCK block_device support for either
one.

According to pre v4.12-rc1 code, q->limits.discard_zeroes_data was only
enabled by drivers/scsi/sd.c:sd_config_discard() for
sdkp->provisioning_mode WRITE_SAME with LBPRZ = 1 or explicit ZERO, and
for NVME for devices that supported NVME_QUIRK_DISCARD_ZEROES.  Eg: Only
real DISCARD + ZERO support.

In your changes to v4.12-rc1, this logic to signal real DISCARD + zero
support for SCSI and NVMe via q->limits.max_write_zeroes_sectors has not
changed..

So AFAICT, regardless if the user sets emulate_tpu or emulate_tpws for a
IBLOCK backend, SCSI host code will have to choose sdkp->zeroing_mode
WRITE_SAME with LBPRZ or explicit ZERO, and NVMe host code will have to
chose a ctrl NVME_QUIRK_DEALLOCATE_ZEROES before
q->limits.max_write_zeroes_sectors != 0 is propagated up to target code,
and LBPRZ = 1 is signaled via READ_CAPACITY_16 and EVPD = 0xb2.

That said, simply propagating up q->limits.max_write_zeroes_sectors as
dev_attrib->unmap_zeroes_data following existing code still looks like
the right thing to do.

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [PATCH 25/27] block: remove the discard_zeroes_data flag
  2017-05-09  6:46                           ` Nicholas A. Bellinger
  (?)
@ 2017-05-10 14:06                           ` hch
       [not found]                             ` <20170510140627.GA23759-jcswGhMUV9g@public.gmane.org>
  -1 siblings, 1 reply; 91+ messages in thread
From: hch @ 2017-05-10 14:06 UTC (permalink / raw)
  To: Nicholas A. Bellinger
  Cc: hch, Bart Van Assche, agk, lars.ellenberg, snitzer,
	martin.petersen, philipp.reisner, axboe, shli, linux-scsi,
	dm-devel, drbd-dev, linux-block, target-devel, linux-raid

On Mon, May 08, 2017 at 11:46:14PM -0700, Nicholas A. Bellinger wrote:
> That said, simply propagating up q->limits.max_write_zeroes_sectors as
> dev_attrib->unmap_zeroes_data following existing code still looks like
> the right thing to do.

It is not.  Martin has decoupled write same/zeroes support from discard
support.  Any device will claim to support it initially, and we'll
only clear the flag if a Write Same command fails.

So even if LBPRZ is not set you can trivially get into a situation
where discard is supported through UNMAP, and you'll incorrectly
set LBPRZ and will cause data corruption.

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [PATCH 25/27] block: remove the discard_zeroes_data flag
  2017-05-10 14:06                           ` hch
@ 2017-05-11  4:50                                 ` Nicholas A. Bellinger
  0 siblings, 0 replies; 91+ messages in thread
From: Nicholas A. Bellinger @ 2017-05-11  4:50 UTC (permalink / raw)
  To: hch-jcswGhMUV9g
  Cc: axboe-tSWWG44O7X1aa/9Udqfwiw, linux-raid-u79uwXL29TY76Z2rM5mHXA,
	martin.petersen-QHcLZuEGTsvQT0dZR+AlfA,
	snitzer-H+wXaHxf7aLQT0dZR+AlfA,
	philipp.reisner-63ez5xqkn6DQT0dZR+AlfA,
	linux-block-u79uwXL29TY76Z2rM5mHXA,
	dm-devel-H+wXaHxf7aLQT0dZR+AlfA,
	target-devel-u79uwXL29TY76Z2rM5mHXA,
	linux-scsi-u79uwXL29TY76Z2rM5mHXA, Bart Van Assche,
	lars.ellenberg-63ez5xqkn6DQT0dZR+AlfA,
	shli-DgEjT+Ai2ygdnm+yROfE0A, agk-H+wXaHxf7aLQT0dZR+AlfA,
	drbd-dev-cunTk1MwBs8qoQakbn7OcQ

On Wed, 2017-05-10 at 16:06 +0200, hch-jcswGhMUV9g@public.gmane.org wrote:
> On Mon, May 08, 2017 at 11:46:14PM -0700, Nicholas A. Bellinger wrote:
> > That said, simply propagating up q->limits.max_write_zeroes_sectors as
> > dev_attrib->unmap_zeroes_data following existing code still looks like
> > the right thing to do.
> 
> It is not.  Martin has decoupled write same/zeroes support from discard
> support.  Any device will claim to support it initially, and we'll
> only clear the flag if a Write Same command fails.
> 
> So even if LBPRZ is not set you can trivially get into a situation
> where discard is supported through UNMAP, and you'll incorrectly
> set LBPRZ and will cause data corruption.

In that case, there are two choices.

1) Expose a block_device or request_queue bit to signal 'real LBPRZ'
support up to IBLOCK, in order to maintain SCSI target feature
compatibility.

2) Or drop the LBPRZ bit usage for IBLOCK all-together.

Since I happen happen to support a block driver that has 'real LBPRZ'
support for all discards, I'd prefer the latter so this doesn't have to
be carried out-of-tree.

So what are the options for this in post v4.12..?

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [PATCH 25/27] block: remove the discard_zeroes_data flag
@ 2017-05-11  4:50                                 ` Nicholas A. Bellinger
  0 siblings, 0 replies; 91+ messages in thread
From: Nicholas A. Bellinger @ 2017-05-11  4:50 UTC (permalink / raw)
  To: hch
  Cc: Bart Van Assche, agk, lars.ellenberg, snitzer, martin.petersen,
	philipp.reisner, axboe, shli, linux-scsi, dm-devel, drbd-dev,
	linux-block, target-devel, linux-raid

On Wed, 2017-05-10 at 16:06 +0200, hch@lst.de wrote:
> On Mon, May 08, 2017 at 11:46:14PM -0700, Nicholas A. Bellinger wrote:
> > That said, simply propagating up q->limits.max_write_zeroes_sectors as
> > dev_attrib->unmap_zeroes_data following existing code still looks like
> > the right thing to do.
> 
> It is not.  Martin has decoupled write same/zeroes support from discard
> support.  Any device will claim to support it initially, and we'll
> only clear the flag if a Write Same command fails.
> 
> So even if LBPRZ is not set you can trivially get into a situation
> where discard is supported through UNMAP, and you'll incorrectly
> set LBPRZ and will cause data corruption.

In that case, there are two choices.

1) Expose a block_device or request_queue bit to signal 'real LBPRZ'
support up to IBLOCK, in order to maintain SCSI target feature
compatibility.

2) Or drop the LBPRZ bit usage for IBLOCK all-together.

Since I happen happen to support a block driver that has 'real LBPRZ'
support for all discards, I'd prefer the latter so this doesn't have to
be carried out-of-tree.

So what are the options for this in post v4.12..?

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [PATCH 25/27] block: remove the discard_zeroes_data flag
  2017-05-11  4:50                                 ` Nicholas A. Bellinger
@ 2017-05-11  6:26                                     ` hch
  -1 siblings, 0 replies; 91+ messages in thread
From: hch-jcswGhMUV9g @ 2017-05-11  6:26 UTC (permalink / raw)
  To: Nicholas A. Bellinger
  Cc: axboe-tSWWG44O7X1aa/9Udqfwiw, linux-raid-u79uwXL29TY76Z2rM5mHXA,
	martin.petersen-QHcLZuEGTsvQT0dZR+AlfA,
	snitzer-H+wXaHxf7aLQT0dZR+AlfA,
	philipp.reisner-63ez5xqkn6DQT0dZR+AlfA,
	linux-block-u79uwXL29TY76Z2rM5mHXA,
	dm-devel-H+wXaHxf7aLQT0dZR+AlfA,
	target-devel-u79uwXL29TY76Z2rM5mHXA,
	linux-scsi-u79uwXL29TY76Z2rM5mHXA, Bart Van Assche,
	lars.ellenberg-63ez5xqkn6DQT0dZR+AlfA,
	shli-DgEjT+Ai2ygdnm+yROfE0A, hch-jcswGhMUV9g,
	agk-H+wXaHxf7aLQT0dZR+AlfA, drbd-dev-cunTk1MwBs8qoQakbn7OcQ

On Wed, May 10, 2017 at 09:50:35PM -0700, Nicholas A. Bellinger wrote:
> 1) Expose a block_device or request_queue bit to signal 'real LBPRZ'
> support up to IBLOCK, in order to maintain SCSI target feature
> compatibility.

No way.  If you want to zero use REQ_OP_WRITE_ZEROES..

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [PATCH 25/27] block: remove the discard_zeroes_data flag
@ 2017-05-11  6:26                                     ` hch
  0 siblings, 0 replies; 91+ messages in thread
From: hch @ 2017-05-11  6:26 UTC (permalink / raw)
  To: Nicholas A. Bellinger
  Cc: hch, Bart Van Assche, agk, lars.ellenberg, snitzer,
	martin.petersen, philipp.reisner, axboe, shli, linux-scsi,
	dm-devel, drbd-dev, linux-block, target-devel, linux-raid

On Wed, May 10, 2017 at 09:50:35PM -0700, Nicholas A. Bellinger wrote:
> 1) Expose a block_device or request_queue bit to signal 'real LBPRZ'
> support up to IBLOCK, in order to maintain SCSI target feature
> compatibility.

No way.  If you want to zero use REQ_OP_WRITE_ZEROES..

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [PATCH 25/27] block: remove the discard_zeroes_data flag
  2017-05-11  6:26                                     ` hch
@ 2017-05-11  6:36                                         ` Nicholas A. Bellinger
  -1 siblings, 0 replies; 91+ messages in thread
From: Nicholas A. Bellinger @ 2017-05-11  6:36 UTC (permalink / raw)
  To: hch-jcswGhMUV9g
  Cc: axboe-tSWWG44O7X1aa/9Udqfwiw, linux-raid-u79uwXL29TY76Z2rM5mHXA,
	martin.petersen-QHcLZuEGTsvQT0dZR+AlfA,
	snitzer-H+wXaHxf7aLQT0dZR+AlfA,
	philipp.reisner-63ez5xqkn6DQT0dZR+AlfA,
	linux-block-u79uwXL29TY76Z2rM5mHXA,
	dm-devel-H+wXaHxf7aLQT0dZR+AlfA,
	target-devel-u79uwXL29TY76Z2rM5mHXA,
	linux-scsi-u79uwXL29TY76Z2rM5mHXA, Bart Van Assche,
	lars.ellenberg-63ez5xqkn6DQT0dZR+AlfA,
	shli-DgEjT+Ai2ygdnm+yROfE0A, agk-H+wXaHxf7aLQT0dZR+AlfA,
	drbd-dev-cunTk1MwBs8qoQakbn7OcQ

On Thu, 2017-05-11 at 08:26 +0200, hch-jcswGhMUV9g@public.gmane.org wrote:
> On Wed, May 10, 2017 at 09:50:35PM -0700, Nicholas A. Bellinger wrote:
> > 1) Expose a block_device or request_queue bit to signal 'real LBPRZ'
> > support up to IBLOCK, in order to maintain SCSI target feature
> > compatibility.
> 
> No way.  If you want to zero use REQ_OP_WRITE_ZEROES..

Yes, I understand that part and it's what the earlier conversion of
IBLOCK to use blkdev_issue_zeroout() already does.

Once the blkdev_issue_zeroout() conversion is in place then LBPRZ can
always be set to one for IBLOCK using blkdev_issue_zeroout().

The point is this is not in -rc1, which as-is breaks LBPRZ compat for a
release.

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [PATCH 25/27] block: remove the discard_zeroes_data flag
@ 2017-05-11  6:36                                         ` Nicholas A. Bellinger
  0 siblings, 0 replies; 91+ messages in thread
From: Nicholas A. Bellinger @ 2017-05-11  6:36 UTC (permalink / raw)
  To: hch
  Cc: Bart Van Assche, agk, lars.ellenberg, snitzer, martin.petersen,
	philipp.reisner, axboe, shli, linux-scsi, dm-devel, drbd-dev,
	linux-block, target-devel, linux-raid

On Thu, 2017-05-11 at 08:26 +0200, hch@lst.de wrote:
> On Wed, May 10, 2017 at 09:50:35PM -0700, Nicholas A. Bellinger wrote:
> > 1) Expose a block_device or request_queue bit to signal 'real LBPRZ'
> > support up to IBLOCK, in order to maintain SCSI target feature
> > compatibility.
> 
> No way.  If you want to zero use REQ_OP_WRITE_ZEROES..

Yes, I understand that part and it's what the earlier conversion of
IBLOCK to use blkdev_issue_zeroout() already does.

Once the blkdev_issue_zeroout() conversion is in place then LBPRZ can
always be set to one for IBLOCK using blkdev_issue_zeroout().

The point is this is not in -rc1, which as-is breaks LBPRZ compat for a
release.

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [Drbd-dev] [PATCH 23/27] drbd: make intelligent use of blkdev_issue_zeroout
  2017-04-05 17:21   ` Christoph Hellwig
  (?)
@ 2018-01-13  0:46   ` Eric Wheeler
  2018-01-15 12:46       ` Lars Ellenberg
  -1 siblings, 1 reply; 91+ messages in thread
From: Eric Wheeler @ 2018-01-13  0:46 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: axboe, martin.petersen, agk, snitzer, shli, philipp.reisner,
	lars.ellenberg, linux-block, linux-raid, dm-devel, linux-scsi,
	drbd-dev

Hello All,

We just noticed that discards to DRBD devices backed by dm-thin devices 
are fully allocating the thin blocks.

This behavior does not exist before 
ee472d83 block: add a flags argument to (__)blkdev_issue_zeroout

The problem exists somewhere between
[working] c20cfc27 block: stop using blkdev_issue_write_same for zeroing
  and
[broken]  45c21793 drbd: implement REQ_OP_WRITE_ZEROES

Note that c20cfc27 works as expected, but 45c21793 discards blocks 
being zeroed on the dm-thin backing device. All commits between those two 
produce the following error:

blkdiscard: /dev/drbd/by-res/test: BLKDISCARD ioctl failed: Input/output error

Also note that issuing a blkdiscard to the backing device directly 
discards as you would expect. This is just a problem when sending discards 
through DRBD.

Is there an easy way to solve this in the short term, even if the ultimate 
fix is more involved?

Thank you for your help!

-Eric

--
Eric Wheeler

On Wed, 5 Apr 2017, Christoph Hellwig wrote:

> drbd always wants its discard wire operations to zero the blocks, so
> use blkdev_issue_zeroout with the BLKDEV_ZERO_UNMAP flag instead of
> reinventing it poorly.
> 
> Signed-off-by: Christoph Hellwig <hch@lst.de>
> Reviewed-by: Hannes Reinecke <hare@suse.com>
> ---
>  drivers/block/drbd/drbd_debugfs.c  |   3 --
>  drivers/block/drbd/drbd_int.h      |   6 ---
>  drivers/block/drbd/drbd_receiver.c | 102 ++-----------------------------------
>  drivers/block/drbd/drbd_req.c      |   6 +--
>  4 files changed, 7 insertions(+), 110 deletions(-)
> 
> diff --git a/drivers/block/drbd/drbd_debugfs.c b/drivers/block/drbd/drbd_debugfs.c
> index de5c3ee8a790..494837e59f23 100644
> --- a/drivers/block/drbd/drbd_debugfs.c
> +++ b/drivers/block/drbd/drbd_debugfs.c
> @@ -236,9 +236,6 @@ static void seq_print_peer_request_flags(struct seq_file *m, struct drbd_peer_re
>  	seq_print_rq_state_bit(m, f & EE_CALL_AL_COMPLETE_IO, &sep, "in-AL");
>  	seq_print_rq_state_bit(m, f & EE_SEND_WRITE_ACK, &sep, "C");
>  	seq_print_rq_state_bit(m, f & EE_MAY_SET_IN_SYNC, &sep, "set-in-sync");
> -
> -	if (f & EE_IS_TRIM)
> -		__seq_print_rq_state_bit(m, f & EE_IS_TRIM_USE_ZEROOUT, &sep, "zero-out", "trim");
>  	seq_print_rq_state_bit(m, f & EE_WRITE_SAME, &sep, "write-same");
>  	seq_putc(m, '\n');
>  }
> diff --git a/drivers/block/drbd/drbd_int.h b/drivers/block/drbd/drbd_int.h
> index 724d1c50fc52..d5da45bb03a6 100644
> --- a/drivers/block/drbd/drbd_int.h
> +++ b/drivers/block/drbd/drbd_int.h
> @@ -437,9 +437,6 @@ enum {
>  
>  	/* is this a TRIM aka REQ_DISCARD? */
>  	__EE_IS_TRIM,
> -	/* our lower level cannot handle trim,
> -	 * and we want to fall back to zeroout instead */
> -	__EE_IS_TRIM_USE_ZEROOUT,
>  
>  	/* In case a barrier failed,
>  	 * we need to resubmit without the barrier flag. */
> @@ -482,7 +479,6 @@ enum {
>  #define EE_CALL_AL_COMPLETE_IO (1<<__EE_CALL_AL_COMPLETE_IO)
>  #define EE_MAY_SET_IN_SYNC     (1<<__EE_MAY_SET_IN_SYNC)
>  #define EE_IS_TRIM             (1<<__EE_IS_TRIM)
> -#define EE_IS_TRIM_USE_ZEROOUT (1<<__EE_IS_TRIM_USE_ZEROOUT)
>  #define EE_RESUBMITTED         (1<<__EE_RESUBMITTED)
>  #define EE_WAS_ERROR           (1<<__EE_WAS_ERROR)
>  #define EE_HAS_DIGEST          (1<<__EE_HAS_DIGEST)
> @@ -1561,8 +1557,6 @@ extern void start_resync_timer_fn(unsigned long data);
>  extern void drbd_endio_write_sec_final(struct drbd_peer_request *peer_req);
>  
>  /* drbd_receiver.c */
> -extern int drbd_issue_discard_or_zero_out(struct drbd_device *device,
> -		sector_t start, unsigned int nr_sectors, bool discard);
>  extern int drbd_receiver(struct drbd_thread *thi);
>  extern int drbd_ack_receiver(struct drbd_thread *thi);
>  extern void drbd_send_ping_wf(struct work_struct *ws);
> diff --git a/drivers/block/drbd/drbd_receiver.c b/drivers/block/drbd/drbd_receiver.c
> index dc9a6dcd431c..bc1d296581f9 100644
> --- a/drivers/block/drbd/drbd_receiver.c
> +++ b/drivers/block/drbd/drbd_receiver.c
> @@ -1448,108 +1448,14 @@ void drbd_bump_write_ordering(struct drbd_resource *resource, struct drbd_backin
>  		drbd_info(resource, "Method to ensure write ordering: %s\n", write_ordering_str[resource->write_ordering]);
>  }
>  
> -/*
> - * We *may* ignore the discard-zeroes-data setting, if so configured.
> - *
> - * Assumption is that it "discard_zeroes_data=0" is only because the backend
> - * may ignore partial unaligned discards.
> - *
> - * LVM/DM thin as of at least
> - *   LVM version:     2.02.115(2)-RHEL7 (2015-01-28)
> - *   Library version: 1.02.93-RHEL7 (2015-01-28)
> - *   Driver version:  4.29.0
> - * still behaves this way.
> - *
> - * For unaligned (wrt. alignment and granularity) or too small discards,
> - * we zero-out the initial (and/or) trailing unaligned partial chunks,
> - * but discard all the aligned full chunks.
> - *
> - * At least for LVM/DM thin, the result is effectively "discard_zeroes_data=1".
> - */
> -int drbd_issue_discard_or_zero_out(struct drbd_device *device, sector_t start, unsigned int nr_sectors, bool discard)
> -{
> -	struct block_device *bdev = device->ldev->backing_bdev;
> -	struct request_queue *q = bdev_get_queue(bdev);
> -	sector_t tmp, nr;
> -	unsigned int max_discard_sectors, granularity;
> -	int alignment;
> -	int err = 0;
> -
> -	if (!discard)
> -		goto zero_out;
> -
> -	/* Zero-sector (unknown) and one-sector granularities are the same.  */
> -	granularity = max(q->limits.discard_granularity >> 9, 1U);
> -	alignment = (bdev_discard_alignment(bdev) >> 9) % granularity;
> -
> -	max_discard_sectors = min(q->limits.max_discard_sectors, (1U << 22));
> -	max_discard_sectors -= max_discard_sectors % granularity;
> -	if (unlikely(!max_discard_sectors))
> -		goto zero_out;
> -
> -	if (nr_sectors < granularity)
> -		goto zero_out;
> -
> -	tmp = start;
> -	if (sector_div(tmp, granularity) != alignment) {
> -		if (nr_sectors < 2*granularity)
> -			goto zero_out;
> -		/* start + gran - (start + gran - align) % gran */
> -		tmp = start + granularity - alignment;
> -		tmp = start + granularity - sector_div(tmp, granularity);
> -
> -		nr = tmp - start;
> -		err |= blkdev_issue_zeroout(bdev, start, nr, GFP_NOIO,
> -				BLKDEV_ZERO_NOUNMAP);
> -		nr_sectors -= nr;
> -		start = tmp;
> -	}
> -	while (nr_sectors >= granularity) {
> -		nr = min_t(sector_t, nr_sectors, max_discard_sectors);
> -		err |= blkdev_issue_discard(bdev, start, nr, GFP_NOIO,
> -				BLKDEV_ZERO_NOUNMAP);
> -		nr_sectors -= nr;
> -		start += nr;
> -	}
> - zero_out:
> -	if (nr_sectors) {
> -		err |= blkdev_issue_zeroout(bdev, start, nr_sectors, GFP_NOIO,
> -				BLKDEV_ZERO_NOUNMAP);
> -	}
> -	return err != 0;
> -}
> -
> -static bool can_do_reliable_discards(struct drbd_device *device)
> -{
> -	struct request_queue *q = bdev_get_queue(device->ldev->backing_bdev);
> -	struct disk_conf *dc;
> -	bool can_do;
> -
> -	if (!blk_queue_discard(q))
> -		return false;
> -
> -	if (q->limits.discard_zeroes_data)
> -		return true;
> -
> -	rcu_read_lock();
> -	dc = rcu_dereference(device->ldev->disk_conf);
> -	can_do = dc->discard_zeroes_if_aligned;
> -	rcu_read_unlock();
> -	return can_do;
> -}
> -
>  static void drbd_issue_peer_discard(struct drbd_device *device, struct drbd_peer_request *peer_req)
>  {
> -	/* If the backend cannot discard, or does not guarantee
> -	 * read-back zeroes in discarded ranges, we fall back to
> -	 * zero-out.  Unless configuration specifically requested
> -	 * otherwise. */
> -	if (!can_do_reliable_discards(device))
> -		peer_req->flags |= EE_IS_TRIM_USE_ZEROOUT;
> +	struct block_device *bdev = device->ldev->backing_bdev;
>  
> -	if (drbd_issue_discard_or_zero_out(device, peer_req->i.sector,
> -	    peer_req->i.size >> 9, !(peer_req->flags & EE_IS_TRIM_USE_ZEROOUT)))
> +	if (blkdev_issue_zeroout(bdev, peer_req->i.sector, peer_req->i.size >> 9,
> +			GFP_NOIO, 0))
>  		peer_req->flags |= EE_WAS_ERROR;
> +
>  	drbd_endio_write_sec_final(peer_req);
>  }
>  
> diff --git a/drivers/block/drbd/drbd_req.c b/drivers/block/drbd/drbd_req.c
> index 652114ae1a8a..6da9ea8c48b6 100644
> --- a/drivers/block/drbd/drbd_req.c
> +++ b/drivers/block/drbd/drbd_req.c
> @@ -1148,10 +1148,10 @@ static int drbd_process_write_request(struct drbd_request *req)
>  
>  static void drbd_process_discard_req(struct drbd_request *req)
>  {
> -	int err = drbd_issue_discard_or_zero_out(req->device,
> -				req->i.sector, req->i.size >> 9, true);
> +	struct block_device *bdev = req->device->ldev->backing_bdev;
>  
> -	if (err)
> +	if (blkdev_issue_zeroout(bdev, req->i.sector, req->i.size >> 9,
> +			GFP_NOIO, 0))
>  		req->private_bio->bi_error = -EIO;
>  	bio_endio(req->private_bio);
>  }
> -- 
> 2.11.0
> 
> _______________________________________________
> drbd-dev mailing list
> drbd-dev@lists.linbit.com
> http://lists.linbit.com/mailman/listinfo/drbd-dev
> 

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [Drbd-dev] [PATCH 23/27] drbd: make intelligent use of blkdev_issue_zeroout
  2018-01-13  0:46   ` [Drbd-dev] " Eric Wheeler
@ 2018-01-15 12:46       ` Lars Ellenberg
  0 siblings, 0 replies; 91+ messages in thread
From: Lars Ellenberg @ 2018-01-15 12:46 UTC (permalink / raw)
  To: Eric Wheeler
  Cc: axboe, linux-raid, martin.petersen, snitzer, philipp.reisner,
	linux-block, dm-devel, linux-scsi, shli, Christoph Hellwig, agk,
	drbd-dev

On Sat, Jan 13, 2018 at 12:46:40AM +0000, Eric Wheeler wrote:
> Hello All,
> 
> We just noticed that discards to DRBD devices backed by dm-thin devices 
> are fully allocating the thin blocks.
> 
> This behavior does not exist before 
> ee472d83 block: add a flags argument to (__)blkdev_issue_zeroout
> 
> The problem exists somewhere between
> [working] c20cfc27 block: stop using blkdev_issue_write_same for zeroing
>   and
> [broken]  45c21793 drbd: implement REQ_OP_WRITE_ZEROES
> 
> Note that c20cfc27 works as expected, but 45c21793 discards blocks 
> being zeroed on the dm-thin backing device. All commits between those two 
> produce the following error:
> 
> blkdiscard: /dev/drbd/by-res/test: BLKDISCARD ioctl failed: Input/output error
> 
> Also note that issuing a blkdiscard to the backing device directly 
> discards as you would expect. This is just a problem when sending discards 
> through DRBD.
> 
> Is there an easy way to solve this in the short term, even if the ultimate 
> fix is more involved?

> On Wed, 5 Apr 2017, Christoph Hellwig wrote:
> 

commit 0dbed96a3cc9786bc4814dab98a7218753bde934
Author: Christoph Hellwig <hch@lst.de>
Date:   Wed Apr 5 19:21:21 2017 +0200

    drbd: make intelligent use of blkdev_issue_zeroout

> > drbd always wants its discard wire operations to zero the blocks, so
> > use blkdev_issue_zeroout with the BLKDEV_ZERO_UNMAP flag instead of
> > reinventing it poorly.

> > -/*
> > - * We *may* ignore the discard-zeroes-data setting, if so configured.
> > - *
> > - * Assumption is that it "discard_zeroes_data=0" is only because the backend
> > - * may ignore partial unaligned discards.
> > - *
> > - * LVM/DM thin as of at least
> > - *   LVM version:     2.02.115(2)-RHEL7 (2015-01-28)
> > - *   Library version: 1.02.93-RHEL7 (2015-01-28)
> > - *   Driver version:  4.29.0
> > - * still behaves this way.
> > - *
> > - * For unaligned (wrt. alignment and granularity) or too small discards,
> > - * we zero-out the initial (and/or) trailing unaligned partial chunks,
> > - * but discard all the aligned full chunks.
> > - *
> > - * At least for LVM/DM thin, the result is effectively "discard_zeroes_data=1".
> > - */
> > -int drbd_issue_discard_or_zero_out(struct drbd_device *device, sector_t start, unsigned int nr_sectors, bool discard)


As I understood it,
blkdev_issue_zeroout() was supposed to "always try to unmap",
deprovision, the relevant region, and zero-out any unaligned
head or tail, just like my work around above was doing.

And that device mapper thin was "about to" learn this, "soon",
or maybe block core would do the equivalent of my workaround
described above.

But it then did not.

See also:
https://www.redhat.com/archives/dm-devel/2017-March/msg00213.html
https://www.redhat.com/archives/dm-devel/2017-March/msg00226.html

I then did not follow this closely enough anymore,
and I missed that with recent enough kernel,
discard on DRBD on dm-thin would fully allocate.

In our out-of-tree module, we had to keep the older code for
compat reasons, anyways. I will just re-enable our zeroout
workaround there again.

In tree, either dm-thin learns to do REQ_OP_WRITE_ZEROES "properly",
so the result in this scenario is what we expect:

  _: unprovisioned, not allocated, returns zero on read anyways
  *: provisioned, some arbitrary data
  0: explicitly zeroed:

  |gran|ular|ity |    |    |    |
  |****|****|____|****|
     to|-be-|zero|ed
  |**00|____|____|00**|

(leave unallocated blocks alone,
 de-allocate full blocks just like with discard,
 explicitly zero unaligned head and tail)

Or DRBD will have to resurrect that reinvented zeroout again,
with exactly those semantics. I did reinvent it for a reason ;)

-- 
: Lars Ellenberg
: LINBIT | Keeping the Digital World Running
: DRBD -- Heartbeat -- Corosync -- Pacemaker
: R&D, Integration, Ops, Consulting, Support

DRBD® and LINBIT® are registered trademarks of LINBIT

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [Drbd-dev] [PATCH 23/27] drbd: make intelligent use of blkdev_issue_zeroout
@ 2018-01-15 12:46       ` Lars Ellenberg
  0 siblings, 0 replies; 91+ messages in thread
From: Lars Ellenberg @ 2018-01-15 12:46 UTC (permalink / raw)
  To: Eric Wheeler
  Cc: Christoph Hellwig, axboe, martin.petersen, agk, snitzer, shli,
	philipp.reisner, linux-block, linux-raid, dm-devel, linux-scsi,
	drbd-dev

On Sat, Jan 13, 2018 at 12:46:40AM +0000, Eric Wheeler wrote:
> Hello All,
> 
> We just noticed that discards to DRBD devices backed by dm-thin devices 
> are fully allocating the thin blocks.
> 
> This behavior does not exist before 
> ee472d83 block: add a flags argument to (__)blkdev_issue_zeroout
> 
> The problem exists somewhere between
> [working] c20cfc27 block: stop using blkdev_issue_write_same for zeroing
>   and
> [broken]  45c21793 drbd: implement REQ_OP_WRITE_ZEROES
> 
> Note that c20cfc27 works as expected, but 45c21793 discards blocks 
> being zeroed on the dm-thin backing device. All commits between those two 
> produce the following error:
> 
> blkdiscard: /dev/drbd/by-res/test: BLKDISCARD ioctl failed: Input/output error
> 
> Also note that issuing a blkdiscard to the backing device directly 
> discards as you would expect. This is just a problem when sending discards 
> through DRBD.
> 
> Is there an easy way to solve this in the short term, even if the ultimate 
> fix is more involved?

> On Wed, 5 Apr 2017, Christoph Hellwig wrote:
> 

commit 0dbed96a3cc9786bc4814dab98a7218753bde934
Author: Christoph Hellwig <hch@lst.de>
Date:   Wed Apr 5 19:21:21 2017 +0200

    drbd: make intelligent use of blkdev_issue_zeroout

> > drbd always wants its discard wire operations to zero the blocks, so
> > use blkdev_issue_zeroout with the BLKDEV_ZERO_UNMAP flag instead of
> > reinventing it poorly.

> > -/*
> > - * We *may* ignore the discard-zeroes-data setting, if so configured.
> > - *
> > - * Assumption is that it "discard_zeroes_data=0" is only because the backend
> > - * may ignore partial unaligned discards.
> > - *
> > - * LVM/DM thin as of at least
> > - *   LVM version:     2.02.115(2)-RHEL7 (2015-01-28)
> > - *   Library version: 1.02.93-RHEL7 (2015-01-28)
> > - *   Driver version:  4.29.0
> > - * still behaves this way.
> > - *
> > - * For unaligned (wrt. alignment and granularity) or too small discards,
> > - * we zero-out the initial (and/or) trailing unaligned partial chunks,
> > - * but discard all the aligned full chunks.
> > - *
> > - * At least for LVM/DM thin, the result is effectively "discard_zeroes_data=1".
> > - */
> > -int drbd_issue_discard_or_zero_out(struct drbd_device *device, sector_t start, unsigned int nr_sectors, bool discard)


As I understood it,
blkdev_issue_zeroout() was supposed to "always try to unmap",
deprovision, the relevant region, and zero-out any unaligned
head or tail, just like my work around above was doing.

And that device mapper thin was "about to" learn this, "soon",
or maybe block core would do the equivalent of my workaround
described above.

But it then did not.

See also:
https://www.redhat.com/archives/dm-devel/2017-March/msg00213.html
https://www.redhat.com/archives/dm-devel/2017-March/msg00226.html

I then did not follow this closely enough anymore,
and I missed that with recent enough kernel,
discard on DRBD on dm-thin would fully allocate.

In our out-of-tree module, we had to keep the older code for
compat reasons, anyways. I will just re-enable our zeroout
workaround there again.

In tree, either dm-thin learns to do REQ_OP_WRITE_ZEROES "properly",
so the result in this scenario is what we expect:

  _: unprovisioned, not allocated, returns zero on read anyways
  *: provisioned, some arbitrary data
  0: explicitly zeroed:

  |gran|ular|ity |    |    |    |
  |****|****|____|****|
     to|-be-|zero|ed
  |**00|____|____|00**|

(leave unallocated blocks alone,
 de-allocate full blocks just like with discard,
 explicitly zero unaligned head and tail)

Or DRBD will have to resurrect that reinvented zeroout again,
with exactly those semantics. I did reinvent it for a reason ;)

-- 
: Lars Ellenberg
: LINBIT | Keeping the Digital World Running
: DRBD -- Heartbeat -- Corosync -- Pacemaker
: R&D, Integration, Ops, Consulting, Support

DRBD� and LINBIT� are registered trademarks of LINBIT

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [PATCH 23/27] drbd: make intelligent use of blkdev_issue_zeroout
  2018-01-15 12:46       ` Lars Ellenberg
@ 2018-01-15 15:07           ` Mike Snitzer
  -1 siblings, 0 replies; 91+ messages in thread
From: Mike Snitzer @ 2018-01-15 15:07 UTC (permalink / raw)
  To: Eric Wheeler, Christoph Hellwig, axboe-tSWWG44O7X1aa/9Udqfwiw,
	martin.petersen-QHcLZuEGTsvQT0dZR+AlfA,
	agk-H+wXaHxf7aLQT0dZR+AlfA, shli-DgEjT+Ai2ygdnm+yROfE0A,
	philipp.reisner-63ez5xqkn6DQT0dZR+AlfA,
	linux-block-u79uwXL29TY76Z2rM5mHXA,
	linux-raid-u79uwXL29TY76Z2rM5mHXA,
	dm-devel-H+wXaHxf7aLQT0dZR+AlfA,
	linux-scsi-u79uwXL29TY76Z2rM5mHXA,
	drbd-dev-cunTk1MwBs8qoQakbn7OcQ
  Cc: ejt-H+wXaHxf7aLQT0dZR+AlfA

On Mon, Jan 15 2018 at  7:46am -0500,
Lars Ellenberg <lars.ellenberg-63ez5xqkn6DQT0dZR+AlfA@public.gmane.org> wrote:
 
> As I understood it,
> blkdev_issue_zeroout() was supposed to "always try to unmap",
> deprovision, the relevant region, and zero-out any unaligned
> head or tail, just like my work around above was doing.
> 
> And that device mapper thin was "about to" learn this, "soon",
> or maybe block core would do the equivalent of my workaround
> described above.
> 
> But it then did not.
> 
> See also:
> https://www.redhat.com/archives/dm-devel/2017-March/msg00213.html
> https://www.redhat.com/archives/dm-devel/2017-March/msg00226.html

Right, now that you mention it it is starting to ring a bell (especially
after I read your 2nd dm-devel archive url above).

> I then did not follow this closely enough anymore,
> and I missed that with recent enough kernel,
> discard on DRBD on dm-thin would fully allocate.
> 
> In our out-of-tree module, we had to keep the older code for
> compat reasons, anyways. I will just re-enable our zeroout
> workaround there again.
> 
> In tree, either dm-thin learns to do REQ_OP_WRITE_ZEROES "properly",
> so the result in this scenario is what we expect:
> 
>   _: unprovisioned, not allocated, returns zero on read anyways
>   *: provisioned, some arbitrary data
>   0: explicitly zeroed:
> 
>   |gran|ular|ity |    |    |    |
>   |****|****|____|****|
>      to|-be-|zero|ed
>   |**00|____|____|00**|
> 
> (leave unallocated blocks alone,
>  de-allocate full blocks just like with discard,
>  explicitly zero unaligned head and tail)

"de-allocate full blocks just like with discard" is an interesting take
what it means for dm-thin to handle REQ_OP_WRITE_ZEROES "properly".

> Or DRBD will have to resurrect that reinvented zeroout again,
> with exactly those semantics. I did reinvent it for a reason ;)

Yeah, I now recall dropping that line of development because it
became "hard" (or at least harder than originally thought).

Don't people use REQ_OP_WRITE_ZEROES to initialize a portion of the
disk?  E.g. zeroing superblocks, metadata areas, or whatever?

If we just discarded the logical extent and then a user did a partial
write to the block, areas that a user might expect to be zeroed wouldn't
be (at least in the case of dm-thinp if "skip_block_zeroing" is
enabled).  And yes if discard passdown is enabled and the device's
discard implementation does "discard_zeroes_data" then it'd be
fine.. but there are a lot of things that need to line up for drbd's
REQ_OP_WRITE_ZEROES to "just work" (as it expects).

(now I'm just echoing the kinds of concerns I had in that 2nd dm-devel
post above).

This post from mkp is interesting:
https://www.redhat.com/archives/dm-devel/2017-March/msg00228.html

Specifically:
"You don't have a way to mark those blocks as being full of zeroes
without actually writing them?

Note that the fallback to a zeroout command is to do a regular write. So
if DM doesn't zero the blocks, the block layer is going to it."

No, dm-thinp doesn't have an easy way to mark an allocated block as
containing zeroes (without actually zeroing).  I toyed with adding that
but then realized that even if we had it it'd still require block
zeroing be enabled.  But block zeroing is done at allocation time.  So
we'd need to interpret the "this block is zeroes" flag to mean "on first
write or read to this block it needs to first zero it".  Fugly to say
the least...

I've been quite busy with other things but I can revisit all this with
Joe Thornber and see what we come up with after a 2nd discussion.

But sadly, in general, this is a low priority for me, so you might do
well to reintroduce your drbd workaround.. sorry about that :(

Mike

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [PATCH 23/27] drbd: make intelligent use of blkdev_issue_zeroout
@ 2018-01-15 15:07           ` Mike Snitzer
  0 siblings, 0 replies; 91+ messages in thread
From: Mike Snitzer @ 2018-01-15 15:07 UTC (permalink / raw)
  To: Eric Wheeler, Christoph Hellwig, axboe, martin.petersen, agk,
	shli, philipp.reisner, linux-block, linux-raid, dm-devel,
	linux-scsi, drbd-dev
  Cc: ejt

On Mon, Jan 15 2018 at  7:46am -0500,
Lars Ellenberg <lars.ellenberg@linbit.com> wrote:
 
> As I understood it,
> blkdev_issue_zeroout() was supposed to "always try to unmap",
> deprovision, the relevant region, and zero-out any unaligned
> head or tail, just like my work around above was doing.
> 
> And that device mapper thin was "about to" learn this, "soon",
> or maybe block core would do the equivalent of my workaround
> described above.
> 
> But it then did not.
> 
> See also:
> https://www.redhat.com/archives/dm-devel/2017-March/msg00213.html
> https://www.redhat.com/archives/dm-devel/2017-March/msg00226.html

Right, now that you mention it it is starting to ring a bell (especially
after I read your 2nd dm-devel archive url above).

> I then did not follow this closely enough anymore,
> and I missed that with recent enough kernel,
> discard on DRBD on dm-thin would fully allocate.
> 
> In our out-of-tree module, we had to keep the older code for
> compat reasons, anyways. I will just re-enable our zeroout
> workaround there again.
> 
> In tree, either dm-thin learns to do REQ_OP_WRITE_ZEROES "properly",
> so the result in this scenario is what we expect:
> 
>   _: unprovisioned, not allocated, returns zero on read anyways
>   *: provisioned, some arbitrary data
>   0: explicitly zeroed:
> 
>   |gran|ular|ity |    |    |    |
>   |****|****|____|****|
>      to|-be-|zero|ed
>   |**00|____|____|00**|
> 
> (leave unallocated blocks alone,
>  de-allocate full blocks just like with discard,
>  explicitly zero unaligned head and tail)

"de-allocate full blocks just like with discard" is an interesting take
what it means for dm-thin to handle REQ_OP_WRITE_ZEROES "properly".

> Or DRBD will have to resurrect that reinvented zeroout again,
> with exactly those semantics. I did reinvent it for a reason ;)

Yeah, I now recall dropping that line of development because it
became "hard" (or at least harder than originally thought).

Don't people use REQ_OP_WRITE_ZEROES to initialize a portion of the
disk?  E.g. zeroing superblocks, metadata areas, or whatever?

If we just discarded the logical extent and then a user did a partial
write to the block, areas that a user might expect to be zeroed wouldn't
be (at least in the case of dm-thinp if "skip_block_zeroing" is
enabled).  And yes if discard passdown is enabled and the device's
discard implementation does "discard_zeroes_data" then it'd be
fine.. but there are a lot of things that need to line up for drbd's
REQ_OP_WRITE_ZEROES to "just work" (as it expects).

(now I'm just echoing the kinds of concerns I had in that 2nd dm-devel
post above).

This post from mkp is interesting:
https://www.redhat.com/archives/dm-devel/2017-March/msg00228.html

Specifically:
"You don't have a way to mark those blocks as being full of zeroes
without actually writing them?

Note that the fallback to a zeroout command is to do a regular write. So
if DM doesn't zero the blocks, the block layer is going to it."

No, dm-thinp doesn't have an easy way to mark an allocated block as
containing zeroes (without actually zeroing).  I toyed with adding that
but then realized that even if we had it it'd still require block
zeroing be enabled.  But block zeroing is done at allocation time.  So
we'd need to interpret the "this block is zeroes" flag to mean "on first
write or read to this block it needs to first zero it".  Fugly to say
the least...

I've been quite busy with other things but I can revisit all this with
Joe Thornber and see what we come up with after a 2nd discussion.

But sadly, in general, this is a low priority for me, so you might do
well to reintroduce your drbd workaround.. sorry about that :(

Mike

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [Drbd-dev] [PATCH 23/27] drbd: make intelligent use of blkdev_issue_zeroout
  2018-01-15 15:07           ` Mike Snitzer
  (?)
@ 2018-01-16  8:55           ` Lars Ellenberg
  -1 siblings, 0 replies; 91+ messages in thread
From: Lars Ellenberg @ 2018-01-16  8:55 UTC (permalink / raw)
  To: Mike Snitzer
  Cc: Eric Wheeler, Christoph Hellwig, axboe, martin.petersen, agk,
	shli, philipp.reisner, linux-block, linux-raid, dm-devel,
	linux-scsi, drbd-dev, ejt

On Mon, Jan 15, 2018 at 10:07:38AM -0500, Mike Snitzer wrote:
> > See also:
> > https://www.redhat.com/archives/dm-devel/2017-March/msg00213.html
> > https://www.redhat.com/archives/dm-devel/2017-March/msg00226.html
> 
> Right, now that you mention it it is starting to ring a bell (especially
> after I read your 2nd dm-devel archive url above).

> > In tree, either dm-thin learns to do REQ_OP_WRITE_ZEROES "properly",
> > so the result in this scenario is what we expect:
> > 
> >   _: unprovisioned, not allocated, returns zero on read anyways
> >   *: provisioned, some arbitrary data
> >   0: explicitly zeroed:
> > 
> >   |gran|ular|ity |    |    |    |
> >   |****|****|____|****|
> >      to|-be-|zero|ed
> >   |**00|____|____|00**|
> > 
> > (leave unallocated blocks alone,
> >  de-allocate full blocks just like with discard,
> >  explicitly zero unaligned head and tail)
> 
> "de-allocate full blocks just like with discard" is an interesting take
> what it means for dm-thin to handle REQ_OP_WRITE_ZEROES "properly".
> 
> > Or DRBD will have to resurrect that reinvented zeroout again,
> > with exactly those semantics. I did reinvent it for a reason ;)
> 
> Yeah, I now recall dropping that line of development because it
> became "hard" (or at least harder than originally thought).
> 
> Don't people use REQ_OP_WRITE_ZEROES to initialize a portion of the
> disk?  E.g. zeroing superblocks, metadata areas, or whatever?
> 
> If we just discarded the logical extent and then a user did a partial
> write to the block, areas that a user might expect to be zeroed wouldn't
> be (at least in the case of dm-thinp if "skip_block_zeroing" is
> enabled).


Oh-kay.
So "immediately after" such an operation
("zero-out head and tail and de-alloc full blocks")
a read to that area would return all zeros, as expected.

But once you do a partial write of something to one of those
de-allocated blocks (and skip_block_zeroing is enabled,
which it likely is due to "performance"),
"magically" arbitrary old garbage data springs into existence
on the LBAs that just before read as zeros.

lvmthin lvm.conf
Would that not break a lot of other things
(any read-modify-write of "upper layers")?
Would that not even be a serious "information leak"
(old garbage of other completely unrelated LVs leaking into this one)?

But thank you for that, I start to see the problem ;-)

> No, dm-thinp doesn't have an easy way to mark an allocated block as
> containing zeroes (without actually zeroing).  I toyed with adding that
> but then realized that even if we had it it'd still require block
> zeroing be enabled.  But block zeroing is done at allocation time.  So
> we'd need to interpret the "this block is zeroes" flag to mean "on first
> write or read to this block it needs to first zero it".  Fugly to say
> the least...


Maybe have a "known zeroed block" pool, allocate only from there,
and "lazy zero" unallocated blocks, add to the known-zero pool?
Fallback to zero-on-alloc if that known-zero-pool is depleted.

Easier said than done, I know.

> But sadly, in general, this is a low priority for me, so you might do
> well to reintroduce your drbd workaround.. sorry about that :(

No problem.
I'll put that back in, and document that we strongly recommend to
NOT skip_block_zeroing in those setups.

Thanks,

    Lars

^ permalink raw reply	[flat|nested] 91+ messages in thread

end of thread, other threads:[~2018-01-16  8:55 UTC | newest]

Thread overview: 91+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-04-05 17:20 always use REQ_OP_WRITE_ZEROES for zeroing offload V2 Christoph Hellwig
2017-04-05 17:20 ` Christoph Hellwig
2017-04-05 17:20 ` [PATCH 01/27] sd: split sd_setup_discard_cmnd Christoph Hellwig
2017-04-05 17:20   ` Christoph Hellwig
2017-04-05 17:21 ` [PATCH 02/27] block: renumber REQ_OP_WRITE_ZEROES Christoph Hellwig
2017-04-05 17:21   ` Christoph Hellwig
2017-04-05 17:21 ` [PATCH 03/27] block: implement splitting of REQ_OP_WRITE_ZEROES bios Christoph Hellwig
2017-04-05 17:21   ` Christoph Hellwig
2017-04-05 17:21 ` [PATCH 04/27] sd: implement REQ_OP_WRITE_ZEROES Christoph Hellwig
2017-04-05 17:21   ` Christoph Hellwig
2017-04-05 17:21 ` [PATCH 05/27] md: support REQ_OP_WRITE_ZEROES Christoph Hellwig
2017-04-05 17:21   ` Christoph Hellwig
2017-04-05 17:21 ` [PATCH 06/27] dm io: discards don't take a payload Christoph Hellwig
2017-04-05 17:21   ` Christoph Hellwig
2017-04-05 17:21 ` [PATCH 07/27] dm: support REQ_OP_WRITE_ZEROES Christoph Hellwig
2017-04-05 17:21   ` Christoph Hellwig
2017-04-05 17:21 ` [PATCH 08/27] dm kcopyd: switch to use REQ_OP_WRITE_ZEROES Christoph Hellwig
2017-04-05 17:21   ` Christoph Hellwig
2017-04-05 17:21 ` [PATCH 09/27] block: stop using blkdev_issue_write_same for zeroing Christoph Hellwig
2017-04-05 17:21   ` Christoph Hellwig
2017-04-05 17:21 ` [PATCH 10/27] block: add a flags argument to (__)blkdev_issue_zeroout Christoph Hellwig
2017-04-05 17:21   ` Christoph Hellwig
2017-04-05 17:21 ` [PATCH 11/27] block: add a REQ_NOUNMAP flag for REQ_OP_WRITE_ZEROES Christoph Hellwig
2017-04-05 17:21   ` Christoph Hellwig
2017-04-05 17:21 ` [PATCH 12/27] block: add a new BLKDEV_ZERO_NOFALLBACK flag Christoph Hellwig
2017-04-05 17:21   ` Christoph Hellwig
2017-04-05 17:21 ` [PATCH 13/27] block_dev: use blkdev_issue_zerout for hole punches Christoph Hellwig
2017-04-05 17:21   ` Christoph Hellwig
2017-04-05 17:21 ` [PATCH 14/27] sd: implement unmapping Write Zeroes Christoph Hellwig
2017-04-05 17:21   ` Christoph Hellwig
2017-04-05 17:21 ` [PATCH 15/27] nvme: implement REQ_OP_WRITE_ZEROES Christoph Hellwig
2017-04-05 17:21   ` Christoph Hellwig
2017-04-05 17:21 ` [PATCH 16/27] zram: " Christoph Hellwig
2017-04-05 17:21   ` Christoph Hellwig
2017-04-05 17:21 ` [PATCH 17/27] loop: " Christoph Hellwig
2017-04-05 17:21   ` Christoph Hellwig
2017-04-05 17:21 ` [PATCH 18/27] brd: remove discard support Christoph Hellwig
2017-04-05 17:21   ` Christoph Hellwig
2017-04-05 17:21 ` [PATCH 19/27] rbd: remove the discard_zeroes_data flag Christoph Hellwig
2017-04-05 17:21   ` Christoph Hellwig
2017-04-05 17:21 ` [PATCH 20/27] rsxx: " Christoph Hellwig
2017-04-05 17:21   ` Christoph Hellwig
2017-04-05 17:21 ` [PATCH 21/27] mmc: " Christoph Hellwig
2017-04-05 17:21   ` Christoph Hellwig
2017-04-05 17:21 ` [PATCH 22/27] block: stop using discards for zeroing Christoph Hellwig
2017-04-05 17:21   ` Christoph Hellwig
2017-04-05 17:21 ` [PATCH 23/27] drbd: make intelligent use of blkdev_issue_zeroout Christoph Hellwig
2017-04-05 17:21   ` Christoph Hellwig
2018-01-13  0:46   ` [Drbd-dev] " Eric Wheeler
2018-01-15 12:46     ` Lars Ellenberg
2018-01-15 12:46       ` Lars Ellenberg
     [not found]       ` <20180115124635.GA4107-w1SgEEioFePxa46PmUWvFg@public.gmane.org>
2018-01-15 15:07         ` Mike Snitzer
2018-01-15 15:07           ` Mike Snitzer
2018-01-16  8:55           ` [Drbd-dev] " Lars Ellenberg
2017-04-05 17:21 ` [PATCH 24/27] drbd: implement REQ_OP_WRITE_ZEROES Christoph Hellwig
2017-04-05 17:21   ` Christoph Hellwig
2017-04-05 17:21 ` [PATCH 25/27] block: remove the discard_zeroes_data flag Christoph Hellwig
2017-05-01 20:45   ` Bart Van Assche
2017-05-01 20:45     ` Bart Van Assche
     [not found]     ` <1493671519.2665.15.camel-XdAiOPVOjttBDgjK7y7TUQ@public.gmane.org>
2017-05-02  6:43       ` Nicholas A. Bellinger
2017-05-02  6:43         ` Nicholas A. Bellinger
     [not found]         ` <1493707425.23202.77.camel-XoQW25Eq2zviZyQQd+hFbcojREIfoBdhmpATvIKMPHk@public.gmane.org>
2017-05-02  7:16           ` Nicholas A. Bellinger
2017-05-02  7:16             ` Nicholas A. Bellinger
     [not found]             ` <1493709373.23202.79.camel-XoQW25Eq2zviZyQQd+hFbcojREIfoBdhmpATvIKMPHk@public.gmane.org>
2017-05-02  7:23               ` hch-jcswGhMUV9g
2017-05-02  7:23                 ` hch
2017-05-03  3:33                 ` Nicholas A. Bellinger
2017-05-03  3:33                   ` Nicholas A. Bellinger
2017-05-03 14:33                   ` Mike Snitzer
2017-05-05  3:10                     ` Nicholas A. Bellinger
2017-05-05  3:10                       ` Nicholas A. Bellinger
     [not found]                   ` <1493782395.23202.84.camel-XoQW25Eq2zviZyQQd+hFbcojREIfoBdhmpATvIKMPHk@public.gmane.org>
2017-05-07  9:22                     ` hch-jcswGhMUV9g
2017-05-07  9:22                       ` hch
     [not found]                       ` <20170507092209.GA27370-jcswGhMUV9g@public.gmane.org>
2017-05-09  6:46                         ` Nicholas A. Bellinger
2017-05-09  6:46                           ` Nicholas A. Bellinger
2017-05-10 14:06                           ` hch
     [not found]                             ` <20170510140627.GA23759-jcswGhMUV9g@public.gmane.org>
2017-05-11  4:50                               ` Nicholas A. Bellinger
2017-05-11  4:50                                 ` Nicholas A. Bellinger
     [not found]                                 ` <1494478235.16894.115.camel-XoQW25Eq2zviZyQQd+hFbcojREIfoBdhmpATvIKMPHk@public.gmane.org>
2017-05-11  6:26                                   ` hch-jcswGhMUV9g
2017-05-11  6:26                                     ` hch
     [not found]                                     ` <20170511062630.GA18517-jcswGhMUV9g@public.gmane.org>
2017-05-11  6:36                                       ` Nicholas A. Bellinger
2017-05-11  6:36                                         ` Nicholas A. Bellinger
2017-04-05 17:21 ` [PATCH 26/27] scsi: sd: Separate zeroout and discard command choices Christoph Hellwig
2017-04-06  6:17   ` Hannes Reinecke
2017-04-06  6:17     ` Hannes Reinecke
2017-04-19 14:56   ` Paolo Bonzini
     [not found]     ` <58c3d6a6-924e-cc86-1907-a9fd02a39c0e-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
2017-04-20  1:34       ` Martin K. Petersen
2017-04-20  1:34         ` Martin K. Petersen
2017-04-05 17:21 ` [PATCH 27/27] scsi: sd: Remove LBPRZ dependency for discards Christoph Hellwig
2017-04-06  6:18   ` Hannes Reinecke
2017-04-06  6:18     ` Hannes Reinecke
2017-04-08 17:26 ` always use REQ_OP_WRITE_ZEROES for zeroing offload V2 Jens Axboe

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.