Linux-ext4 Archive on lore.kernel.org
 help / color / Atom feed
* [PATCH RFC 0/3] block,ext4: Introduce REQ_OP_ASSIGN_RANGE to reflect extents allocation in block device internals
@ 2019-12-10 16:56 Kirill Tkhai
  2019-12-10 16:56 ` [PATCH RFC 1/3] block: Add support for REQ_OP_ASSIGN_RANGE operation Kirill Tkhai
                   ` (4 more replies)
  0 siblings, 5 replies; 21+ messages in thread
From: Kirill Tkhai @ 2019-12-10 16:56 UTC (permalink / raw)
  To: linux-block, linux-kernel, linux-ext4
  Cc: axboe, tytso, adilger.kernel, ming.lei, osandov, jthumshirn,
	minwoo.im.dev, damien.lemoal, ktkhai, andrea.parri, hare, tj,
	ajay.joshi, sagi, dsterba, chaitanya.kulkarni, bvanassche,
	dhowells, asml.silence

Information about continuous extent placement may be useful
for some block devices. Say, distributed network filesystems,
which provide block device interface, may use this information
for better blocks placement over the nodes in their cluster,
and for better performance. Block devices, which map a file
on another filesystem (loop), may request the same length extent
on underlining filesystem for less fragmentation and for batching
allocation requests. Also, hypervisors like QEMU may use this
information for optimization of cluster allocations.

This patchset introduces REQ_OP_ASSIGN_RANGE, which is going
to be used for forwarding user's fallocate(0) requests into
block device internals. It rather similar to existing
REQ_OP_DISCARD, REQ_OP_WRITE_ZEROES, etc. The corresponding
exported primitive is called blkdev_issue_assign_range().
See [1/3] for the details.

Patch [2/3] teaches loop driver to handle REQ_OP_ASSIGN_RANGE
requests by calling fallocate(0).

Patch [3/3] makes ext4 to notify a block device about fallocate(0).

Here is a simple test I did:
https://gist.github.com/tkhai/5b788651cdb74c1dbff3500745878856

I attached a file on ext4 to loop. Then, created ext4 partition
on loop device and started the test in the partition. Direct-io
is enabled on loop.

The test fallocates 4G file and writes from some offset with
given step, then it chooses another offset and repeats. After
the test all the blocks in the file become written.

The results shows that batching extents-assigning requests improves
the performance:

Before patchset: real ~ 1min 27sec
After patchset:  real ~ 1min 16sec (18% better)

Ordinary fallocate() before writes improves the performance
by batching the requests. These results just show, the same
is in case of forwarding extents information to underlining
filesystem.
---

Kirill Tkhai (3):
      block: Add support for REQ_OP_ASSIGN_RANGE operation
      loop: Forward REQ_OP_ASSIGN_RANGE into fallocate(0)
      ext4: Notify block device about fallocate(0)-assigned blocks


 block/blk-core.c          |    4 +++
 block/blk-lib.c           |   70 +++++++++++++++++++++++++++++++++++++++++++++
 block/blk-merge.c         |   21 ++++++++++++++
 block/bounce.c            |    1 +
 drivers/block/loop.c      |    5 +++
 fs/ext4/ext4.h            |    1 +
 fs/ext4/extents.c         |   11 ++++++-
 include/linux/bio.h       |    3 ++
 include/linux/blk_types.h |    2 +
 include/linux/blkdev.h    |   29 +++++++++++++++++++
 10 files changed, 145 insertions(+), 2 deletions(-)

--
Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com>


^ permalink raw reply	[flat|nested] 21+ messages in thread

* [PATCH RFC 1/3] block: Add support for REQ_OP_ASSIGN_RANGE operation
  2019-12-10 16:56 [PATCH RFC 0/3] block,ext4: Introduce REQ_OP_ASSIGN_RANGE to reflect extents allocation in block device internals Kirill Tkhai
@ 2019-12-10 16:56 ` Kirill Tkhai
  2019-12-19  3:03   ` Martin K. Petersen
  2019-12-10 16:56 ` [PATCH RFC 2/3] loop: Forward REQ_OP_ASSIGN_RANGE into fallocate(0) Kirill Tkhai
                   ` (3 subsequent siblings)
  4 siblings, 1 reply; 21+ messages in thread
From: Kirill Tkhai @ 2019-12-10 16:56 UTC (permalink / raw)
  To: linux-block, linux-kernel, linux-ext4
  Cc: axboe, tytso, adilger.kernel, ming.lei, osandov, jthumshirn,
	minwoo.im.dev, damien.lemoal, ktkhai, andrea.parri, hare, tj,
	ajay.joshi, sagi, dsterba, chaitanya.kulkarni, bvanassche,
	dhowells, asml.silence

This operation allows to notify a device about the fact,
that some sectors range was choosen by a filesystem
as a single extent, and the device should try its best
to reflect that (keep the range as a single hunk in its
internals, or represent the range as minimal set of hunks).
Speaking directly, the operation is for forwarding
fallocate(0) requests into an essence, on which the device
is based.

This may be useful for some distributed network filesystems,
providing block device interface, for optimization of their
blocks placement over the cluster nodes.

Also, block devices mapping a file (like loop) are users
of that, since this allows to allocate more continuous
extents and since this batches blocks allocation requests.
In addition, hypervisors like QEMU may use this for better
blocks placement.

The patch adds a new blkdev_issue_assign_range() primitive,
which is rather similar to existing blkdev_issue_{*} api.
Also, a new queue limit.max_assign_range_sectors is added.

Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com>
---
 block/blk-core.c          |    4 +++
 block/blk-lib.c           |   70 +++++++++++++++++++++++++++++++++++++++++++++
 block/blk-merge.c         |   21 ++++++++++++++
 block/bounce.c            |    1 +
 include/linux/bio.h       |    3 ++
 include/linux/blk_types.h |    2 +
 include/linux/blkdev.h    |   29 +++++++++++++++++++
 7 files changed, 130 insertions(+)

diff --git a/block/blk-core.c b/block/blk-core.c
index 8d4db6e74496..060cc0ea1246 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -978,6 +978,10 @@ generic_make_request_checks(struct bio *bio)
 		if (!q->limits.max_write_zeroes_sectors)
 			goto not_supported;
 		break;
+	case REQ_OP_ASSIGN_RANGE:
+		if (!q->limits.max_assign_range_sectors)
+			goto not_supported;
+		break;
 	default:
 		break;
 	}
diff --git a/block/blk-lib.c b/block/blk-lib.c
index 5f2c429d4378..fbf780d3ea32 100644
--- a/block/blk-lib.c
+++ b/block/blk-lib.c
@@ -252,6 +252,46 @@ static int __blkdev_issue_write_zeroes(struct block_device *bdev,
 	return 0;
 }
 
+static int __blkdev_issue_assign_range(struct block_device *bdev,
+		sector_t sector, sector_t nr_sects, gfp_t gfp_mask,
+		struct bio **biop, unsigned flags)
+{
+	struct bio *bio = *biop;
+	unsigned int max_sectors;
+	struct request_queue *q = bdev_get_queue(bdev);
+
+	if (!q)
+		return -ENXIO;
+
+	if (bdev_read_only(bdev))
+		return -EPERM;
+
+	max_sectors = bdev_assign_range_sectors(bdev);
+
+	if (max_sectors == 0)
+		return -EOPNOTSUPP;
+
+	while (nr_sects) {
+		bio = blk_next_bio(bio, 0, gfp_mask);
+		bio->bi_iter.bi_sector = sector;
+		bio_set_dev(bio, bdev);
+		bio->bi_opf = REQ_OP_ASSIGN_RANGE;
+
+		if (nr_sects > max_sectors) {
+			bio->bi_iter.bi_size = max_sectors << 9;
+			nr_sects -= max_sectors;
+			sector += max_sectors;
+		} else {
+			bio->bi_iter.bi_size = nr_sects << 9;
+			nr_sects = 0;
+		}
+		cond_resched();
+	}
+
+	*biop = bio;
+	return 0;
+}
+
 /*
  * Convert a number of 512B sectors to a number of pages.
  * The result is limited to a number of pages that can fit into a BIO.
@@ -405,3 +445,33 @@ int blkdev_issue_zeroout(struct block_device *bdev, sector_t sector,
 	return ret;
 }
 EXPORT_SYMBOL(blkdev_issue_zeroout);
+
+int blkdev_issue_assign_range(struct block_device *bdev, sector_t sector,
+			sector_t nr_sects, gfp_t gfp_mask, unsigned flags)
+{
+	int ret = 0;
+	sector_t bs_mask;
+	struct bio *bio;
+	struct blk_plug plug;
+
+	if (bdev_assign_range_sectors(bdev) == 0)
+		return 0;
+
+	bs_mask = (bdev_logical_block_size(bdev) >> 9) - 1;
+	if ((sector | nr_sects) & bs_mask)
+		return -EINVAL;
+
+	bio = NULL;
+	blk_start_plug(&plug);
+
+	ret = __blkdev_issue_assign_range(bdev, sector, nr_sects,
+					  gfp_mask, &bio, flags);
+	if (ret == 0 && bio) {
+		ret = submit_bio_wait(bio);
+		bio_put(bio);
+	}
+	blk_finish_plug(&plug);
+
+	return ret;
+}
+EXPORT_SYMBOL(blkdev_issue_assign_range);
diff --git a/block/blk-merge.c b/block/blk-merge.c
index d783bdc4559b..b2ae8b5acd72 100644
--- a/block/blk-merge.c
+++ b/block/blk-merge.c
@@ -102,6 +102,22 @@ static struct bio *blk_bio_discard_split(struct request_queue *q,
 	return bio_split(bio, split_sectors, GFP_NOIO, bs);
 }
 
+static struct bio *blk_bio_assign_range_split(struct request_queue *q,
+					      struct bio *bio,
+					      struct bio_set *bs,
+					      unsigned *nsegs)
+{
+	*nsegs = 1;
+
+	if (!q->limits.max_assign_range_sectors)
+		return NULL;
+
+	if (bio_sectors(bio) <= q->limits.max_assign_range_sectors)
+		return NULL;
+
+	return bio_split(bio, q->limits.max_assign_range_sectors, GFP_NOIO, bs);
+}
+
 static struct bio *blk_bio_write_zeroes_split(struct request_queue *q,
 		struct bio *bio, struct bio_set *bs, unsigned *nsegs)
 {
@@ -300,6 +316,10 @@ void __blk_queue_split(struct request_queue *q, struct bio **bio,
 	case REQ_OP_SECURE_ERASE:
 		split = blk_bio_discard_split(q, *bio, &q->bio_split, nr_segs);
 		break;
+	case REQ_OP_ASSIGN_RANGE:
+		split = blk_bio_assign_range_split(q, *bio, &q->bio_split,
+				nr_segs);
+		break;
 	case REQ_OP_WRITE_ZEROES:
 		split = blk_bio_write_zeroes_split(q, *bio, &q->bio_split,
 				nr_segs);
@@ -382,6 +402,7 @@ unsigned int blk_recalc_rq_segments(struct request *rq)
 	case REQ_OP_DISCARD:
 	case REQ_OP_SECURE_ERASE:
 	case REQ_OP_WRITE_ZEROES:
+	case REQ_OP_ASSIGN_RANGE:
 		return 0;
 	case REQ_OP_WRITE_SAME:
 		return 1;
diff --git a/block/bounce.c b/block/bounce.c
index f8ed677a1bf7..017bedba7b23 100644
--- a/block/bounce.c
+++ b/block/bounce.c
@@ -255,6 +255,7 @@ static struct bio *bounce_clone_bio(struct bio *bio_src, gfp_t gfp_mask,
 
 	switch (bio_op(bio)) {
 	case REQ_OP_DISCARD:
+	case REQ_OP_ASSIGN_RANGE:
 	case REQ_OP_SECURE_ERASE:
 	case REQ_OP_WRITE_ZEROES:
 		break;
diff --git a/include/linux/bio.h b/include/linux/bio.h
index 3cdb84cdc488..cf235c997e45 100644
--- a/include/linux/bio.h
+++ b/include/linux/bio.h
@@ -63,6 +63,7 @@ static inline bool bio_has_data(struct bio *bio)
 	if (bio &&
 	    bio->bi_iter.bi_size &&
 	    bio_op(bio) != REQ_OP_DISCARD &&
+	    bio_op(bio) != REQ_OP_ASSIGN_RANGE &&
 	    bio_op(bio) != REQ_OP_SECURE_ERASE &&
 	    bio_op(bio) != REQ_OP_WRITE_ZEROES)
 		return true;
@@ -73,6 +74,7 @@ static inline bool bio_has_data(struct bio *bio)
 static inline bool bio_no_advance_iter(struct bio *bio)
 {
 	return bio_op(bio) == REQ_OP_DISCARD ||
+	       bio_op(bio) == REQ_OP_ASSIGN_RANGE ||
 	       bio_op(bio) == REQ_OP_SECURE_ERASE ||
 	       bio_op(bio) == REQ_OP_WRITE_SAME ||
 	       bio_op(bio) == REQ_OP_WRITE_ZEROES;
@@ -184,6 +186,7 @@ static inline unsigned bio_segments(struct bio *bio)
 
 	switch (bio_op(bio)) {
 	case REQ_OP_DISCARD:
+	case REQ_OP_ASSIGN_RANGE:
 	case REQ_OP_SECURE_ERASE:
 	case REQ_OP_WRITE_ZEROES:
 		return 0;
diff --git a/include/linux/blk_types.h b/include/linux/blk_types.h
index 70254ae11769..f03dcf25c831 100644
--- a/include/linux/blk_types.h
+++ b/include/linux/blk_types.h
@@ -296,6 +296,8 @@ enum req_opf {
 	REQ_OP_ZONE_CLOSE	= 11,
 	/* Transition a zone to full */
 	REQ_OP_ZONE_FINISH	= 12,
+	/* assign sector range */
+	REQ_OP_ASSIGN_RANGE	= 15,
 
 	/* SCSI passthrough using struct scsi_request */
 	REQ_OP_SCSI_IN		= 32,
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index 3cd1853dbdac..9af70120fe57 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -336,6 +336,7 @@ struct queue_limits {
 	unsigned int		max_hw_discard_sectors;
 	unsigned int		max_write_same_sectors;
 	unsigned int		max_write_zeroes_sectors;
+	unsigned int		max_assign_range_sectors;
 	unsigned int		discard_granularity;
 	unsigned int		discard_alignment;
 
@@ -995,6 +996,10 @@ static inline unsigned int blk_queue_get_max_sectors(struct request_queue *q,
 		return min(q->limits.max_discard_sectors,
 			   UINT_MAX >> SECTOR_SHIFT);
 
+	if (unlikely(op == REQ_OP_ASSIGN_RANGE))
+		return min(q->limits.max_assign_range_sectors,
+			   UINT_MAX >> SECTOR_SHIFT);
+
 	if (unlikely(op == REQ_OP_WRITE_SAME))
 		return q->limits.max_write_same_sectors;
 
@@ -1028,6 +1033,7 @@ static inline unsigned int blk_rq_get_max_sectors(struct request *rq,
 
 	if (!q->limits.chunk_sectors ||
 	    req_op(rq) == REQ_OP_DISCARD ||
+	    req_op(rq) == REQ_OP_ASSIGN_RANGE ||
 	    req_op(rq) == REQ_OP_SECURE_ERASE)
 		return blk_queue_get_max_sectors(q, req_op(rq));
 
@@ -1225,6 +1231,8 @@ extern int __blkdev_issue_zeroout(struct block_device *bdev, sector_t sector,
 		unsigned flags);
 extern int blkdev_issue_zeroout(struct block_device *bdev, sector_t sector,
 		sector_t nr_sects, gfp_t gfp_mask, unsigned flags);
+extern int blkdev_issue_assign_range(struct block_device *bdev, sector_t sector,
+		sector_t nr_sects, gfp_t gfp_mask, unsigned flags);
 
 static inline int sb_issue_discard(struct super_block *sb, sector_t block,
 		sector_t nr_blocks, gfp_t gfp_mask, unsigned long flags)
@@ -1247,6 +1255,17 @@ static inline int sb_issue_zeroout(struct super_block *sb, sector_t block,
 				    gfp_mask, 0);
 }
 
+static inline int sb_issue_assign_range(struct super_block *sb, sector_t block,
+		sector_t nr_blocks, gfp_t gfp_mask)
+{
+	return blkdev_issue_assign_range(sb->s_bdev,
+					 block << (sb->s_blocksize_bits -
+						   SECTOR_SHIFT),
+					 nr_blocks << (sb->s_blocksize_bits -
+						       SECTOR_SHIFT),
+					 gfp_mask, 0);
+}
+
 extern int blk_verify_command(unsigned char *cmd, fmode_t mode);
 
 enum blk_default_limits {
@@ -1428,6 +1447,16 @@ static inline unsigned int bdev_write_zeroes_sectors(struct block_device *bdev)
 	return 0;
 }
 
+static inline unsigned int bdev_assign_range_sectors(struct block_device *bdev)
+{
+	struct request_queue *q = bdev_get_queue(bdev);
+
+	if (q)
+		return q->limits.max_assign_range_sectors;
+
+	return 0;
+}
+
 static inline enum blk_zoned_model bdev_zoned_model(struct block_device *bdev)
 {
 	struct request_queue *q = bdev_get_queue(bdev);



^ permalink raw reply	[flat|nested] 21+ messages in thread

* [PATCH RFC 2/3] loop: Forward REQ_OP_ASSIGN_RANGE into fallocate(0)
  2019-12-10 16:56 [PATCH RFC 0/3] block,ext4: Introduce REQ_OP_ASSIGN_RANGE to reflect extents allocation in block device internals Kirill Tkhai
  2019-12-10 16:56 ` [PATCH RFC 1/3] block: Add support for REQ_OP_ASSIGN_RANGE operation Kirill Tkhai
@ 2019-12-10 16:56 ` Kirill Tkhai
  2019-12-10 16:56 ` [PATCH RFC 3/3] ext4: Notify block device about fallocate(0)-assigned blocks Kirill Tkhai
                   ` (2 subsequent siblings)
  4 siblings, 0 replies; 21+ messages in thread
From: Kirill Tkhai @ 2019-12-10 16:56 UTC (permalink / raw)
  To: linux-block, linux-kernel, linux-ext4
  Cc: axboe, tytso, adilger.kernel, ming.lei, osandov, jthumshirn,
	minwoo.im.dev, damien.lemoal, ktkhai, andrea.parri, hare, tj,
	ajay.joshi, sagi, dsterba, chaitanya.kulkarni, bvanassche,
	dhowells, asml.silence

Send fallocate(0) request into underlining filesystem
after upper filesystem sent REQ_OP_ASSIGN_RANGE request
to block device.

Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com>
---
 drivers/block/loop.c |    5 +++++
 1 file changed, 5 insertions(+)

diff --git a/drivers/block/loop.c b/drivers/block/loop.c
index 739b372a5112..d99d9193de7a 100644
--- a/drivers/block/loop.c
+++ b/drivers/block/loop.c
@@ -609,6 +609,8 @@ static int do_req_filebacked(struct loop_device *lo, struct request *rq)
 				FALLOC_FL_PUNCH_HOLE);
 	case REQ_OP_DISCARD:
 		return lo_fallocate(lo, rq, pos, FALLOC_FL_PUNCH_HOLE);
+	case REQ_OP_ASSIGN_RANGE:
+		return lo_fallocate(lo, rq, pos, 0);
 	case REQ_OP_WRITE:
 		if (lo->transfer)
 			return lo_write_transfer(lo, rq, pos);
@@ -875,6 +877,7 @@ static void loop_config_discard(struct loop_device *lo)
 	    lo->lo_encrypt_key_size) {
 		q->limits.discard_granularity = 0;
 		q->limits.discard_alignment = 0;
+		q->limits.max_assign_range_sectors = 0;
 		blk_queue_max_discard_sectors(q, 0);
 		blk_queue_max_write_zeroes_sectors(q, 0);
 		blk_queue_flag_clear(QUEUE_FLAG_DISCARD, q);
@@ -883,6 +886,7 @@ static void loop_config_discard(struct loop_device *lo)
 
 	q->limits.discard_granularity = inode->i_sb->s_blocksize;
 	q->limits.discard_alignment = 0;
+	q->limits.max_assign_range_sectors = UINT_MAX >> 9;
 
 	blk_queue_max_discard_sectors(q, UINT_MAX >> 9);
 	blk_queue_max_write_zeroes_sectors(q, UINT_MAX >> 9);
@@ -1917,6 +1921,7 @@ static blk_status_t loop_queue_rq(struct blk_mq_hw_ctx *hctx,
 	case REQ_OP_FLUSH:
 	case REQ_OP_DISCARD:
 	case REQ_OP_WRITE_ZEROES:
+	case REQ_OP_ASSIGN_RANGE:
 		cmd->use_aio = false;
 		break;
 	default:



^ permalink raw reply	[flat|nested] 21+ messages in thread

* [PATCH RFC 3/3] ext4: Notify block device about fallocate(0)-assigned blocks
  2019-12-10 16:56 [PATCH RFC 0/3] block,ext4: Introduce REQ_OP_ASSIGN_RANGE to reflect extents allocation in block device internals Kirill Tkhai
  2019-12-10 16:56 ` [PATCH RFC 1/3] block: Add support for REQ_OP_ASSIGN_RANGE operation Kirill Tkhai
  2019-12-10 16:56 ` [PATCH RFC 2/3] loop: Forward REQ_OP_ASSIGN_RANGE into fallocate(0) Kirill Tkhai
@ 2019-12-10 16:56 ` Kirill Tkhai
  2019-12-11 12:55   ` [PATCH RFC v2 " Kirill Tkhai
  2019-12-11  7:42 ` [PATCH RFC 0/3] block,ext4: Introduce REQ_OP_ASSIGN_RANGE to reflect extents allocation in block device internals Chaitanya Kulkarni
  2019-12-17 14:16 ` Kirill Tkhai
  4 siblings, 1 reply; 21+ messages in thread
From: Kirill Tkhai @ 2019-12-10 16:56 UTC (permalink / raw)
  To: linux-block, linux-kernel, linux-ext4
  Cc: axboe, tytso, adilger.kernel, ming.lei, osandov, jthumshirn,
	minwoo.im.dev, damien.lemoal, ktkhai, andrea.parri, hare, tj,
	ajay.joshi, sagi, dsterba, chaitanya.kulkarni, bvanassche,
	dhowells, asml.silence

Call sb_issue_assign_range() after extent range was allocated
on user request. Hopeful, this helps block device to maintain
its internals in the best way, if this is appliable.

Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com>
---
 fs/ext4/ext4.h    |    1 +
 fs/ext4/extents.c |   11 +++++++++--
 2 files changed, 10 insertions(+), 2 deletions(-)

diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
index f8578caba40d..fe2263c00c0e 100644
--- a/fs/ext4/ext4.h
+++ b/fs/ext4/ext4.h
@@ -622,6 +622,7 @@ enum {
 	 * allows jbd2 to avoid submitting data before commit. */
 #define EXT4_GET_BLOCKS_IO_SUBMIT		0x0400
 
+#define EXT4_GET_BLOCKS_SUBMIT_ALLOC		0x0800
 /*
  * The bit position of these flags must not overlap with any of the
  * EXT4_GET_BLOCKS_*.  They are used by ext4_find_extent(),
diff --git a/fs/ext4/extents.c b/fs/ext4/extents.c
index 0e8708b77da6..5f4fc660cbb1 100644
--- a/fs/ext4/extents.c
+++ b/fs/ext4/extents.c
@@ -4490,6 +4490,13 @@ int ext4_ext_map_blocks(handle_t *handle, struct inode *inode,
 		ar.len = allocated;
 
 got_allocated_blocks:
+	if ((flags & EXT4_GET_BLOCKS_SUBMIT_ALLOC) && sbi->fallocate) {
+		err = sb_issue_assign_range(inode->i_sb, newblock,
+			EXT4_C2B(sbi, allocated_clusters), GFP_NOFS);
+		if (err)
+			goto free_on_err;
+	}
+
 	/* try to insert new extent into found leaf and return */
 	ext4_ext_store_pblock(&newex, newblock + offset);
 	newex.ee_len = cpu_to_le16(ar.len);
@@ -4506,7 +4513,7 @@ int ext4_ext_map_blocks(handle_t *handle, struct inode *inode,
 	if (!err)
 		err = ext4_ext_insert_extent(handle, inode, &path,
 					     &newex, flags);
-
+free_on_err:
 	if (err && free_on_err) {
 		int fb_flags = flags & EXT4_GET_BLOCKS_DELALLOC_RESERVE ?
 			EXT4_FREE_BLOCKS_NO_QUOT_UPDATE : 0;
@@ -4926,7 +4933,7 @@ long ext4_fallocate(struct file *file, int mode, loff_t offset, loff_t len)
 	lblk = offset >> blkbits;
 
 	max_blocks = EXT4_MAX_BLOCKS(len, offset, blkbits);
-	flags = EXT4_GET_BLOCKS_CREATE_UNWRIT_EXT;
+	flags = EXT4_GET_BLOCKS_CREATE_UNWRIT_EXT | EXT4_GET_BLOCKS_SUBMIT_ALLOC;
 	if (mode & FALLOC_FL_KEEP_SIZE)
 		flags |= EXT4_GET_BLOCKS_KEEP_SIZE;
 



^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH RFC 0/3] block,ext4: Introduce REQ_OP_ASSIGN_RANGE to reflect extents allocation in block device internals
  2019-12-10 16:56 [PATCH RFC 0/3] block,ext4: Introduce REQ_OP_ASSIGN_RANGE to reflect extents allocation in block device internals Kirill Tkhai
                   ` (2 preceding siblings ...)
  2019-12-10 16:56 ` [PATCH RFC 3/3] ext4: Notify block device about fallocate(0)-assigned blocks Kirill Tkhai
@ 2019-12-11  7:42 ` Chaitanya Kulkarni
  2019-12-11  8:50   ` Kirill Tkhai
  2019-12-17 14:16 ` Kirill Tkhai
  4 siblings, 1 reply; 21+ messages in thread
From: Chaitanya Kulkarni @ 2019-12-11  7:42 UTC (permalink / raw)
  To: Kirill Tkhai, linux-block, linux-kernel, linux-ext4
  Cc: axboe, tytso, adilger.kernel, ming.lei, osandov, jthumshirn,
	minwoo.im.dev, Damien Le Moal, andrea.parri, hare, tj,
	Ajay Joshi, sagi, dsterba, bvanassche, dhowells, asml.silence

> Here is a simple test I did:
> https://gist.github.com/tkhai/5b788651cdb74c1dbff3500745878856
>

Somehow I'm not able to open this link, can you please share results
in plain text on the email ?



^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH RFC 0/3] block,ext4: Introduce REQ_OP_ASSIGN_RANGE to reflect extents allocation in block device internals
  2019-12-11  7:42 ` [PATCH RFC 0/3] block,ext4: Introduce REQ_OP_ASSIGN_RANGE to reflect extents allocation in block device internals Chaitanya Kulkarni
@ 2019-12-11  8:50   ` Kirill Tkhai
  0 siblings, 0 replies; 21+ messages in thread
From: Kirill Tkhai @ 2019-12-11  8:50 UTC (permalink / raw)
  To: Chaitanya Kulkarni, linux-block, linux-kernel, linux-ext4
  Cc: axboe, tytso, adilger.kernel, ming.lei, osandov, jthumshirn,
	minwoo.im.dev, Damien Le Moal, andrea.parri, hare, tj,
	Ajay Joshi, sagi, dsterba, bvanassche, dhowells, asml.silence

On 11.12.2019 10:42, Chaitanya Kulkarni wrote:
>> Here is a simple test I did:
>> https://gist.github.com/tkhai/5b788651cdb74c1dbff3500745878856
>>
> 
> Somehow I'm not able to open this link, can you please share results
> in plain text on the email ?

#define _GNU_SOURCE
#include <sys/types.h>
#include <unistd.h>
#include <stdlib.h>
#include <stdio.h>
#include <fcntl.h>
#include <errno.h>

#define BLOCK_SIZE 4096
#define STEP (BLOCK_SIZE * 16)
#define SIZE (4ULL * 1024 * 1024 * 1024)

int main(int argc)
{
	int fd, step, ret = 0;
	unsigned long i;
	void *buf;

	if (posix_memalign(&buf, BLOCK_SIZE, SIZE)) {
		perror("alloc");
		exit(1);
	}

	fd = open("file2.img", O_RDWR|O_CREAT|O_DIRECT);
	if (fd < 0) {
		perror("open");
		exit(1);
	}

	if (ftruncate(fd, SIZE)) {
		perror("ftruncate");
		exit(1);
	}

	ret = fallocate(fd, 0, 0, SIZE);
	if (ret) {
		perror("fallocate");
		exit(1);
	}
	
	for (step = STEP - BLOCK_SIZE; step >= 0; step -= BLOCK_SIZE) {
		printf("step=%u\n", step);
		for (i = step; i < SIZE; i += STEP) {
			errno = 0;
			if (pwrite(fd, buf, BLOCK_SIZE, i) != BLOCK_SIZE) {
				perror("pwrite");
				exit(1);
			}
		}

		if (fsync(fd)) {
			perror("fsync");
			exit(1);
		}
	}

	return 0;
}



^ permalink raw reply	[flat|nested] 21+ messages in thread

* [PATCH RFC v2 3/3] ext4: Notify block device about fallocate(0)-assigned blocks
  2019-12-10 16:56 ` [PATCH RFC 3/3] ext4: Notify block device about fallocate(0)-assigned blocks Kirill Tkhai
@ 2019-12-11 12:55   ` " Kirill Tkhai
  0 siblings, 0 replies; 21+ messages in thread
From: Kirill Tkhai @ 2019-12-11 12:55 UTC (permalink / raw)
  To: linux-block, linux-kernel, linux-ext4
  Cc: axboe, tytso, adilger.kernel, ming.lei, osandov, jthumshirn,
	minwoo.im.dev, damien.lemoal, andrea.parri, hare, tj, ajay.joshi,
	sagi, dsterba, chaitanya.kulkarni, bvanassche, dhowells,
	asml.silence

[I missed debug hunk appeared in v1. Please, see correct patch below]

Call sb_issue_assign_range() after extent range was allocated
on user request. Hopeful, this helps block device to maintain
its internals in the best way, if this is appliable.

Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com>
---
 fs/ext4/ext4.h    |    1 +
 fs/ext4/extents.c |   11 +++++++++--
 2 files changed, 10 insertions(+), 2 deletions(-)

diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
index f8578caba40d..fe2263c00c0e 100644
--- a/fs/ext4/ext4.h
+++ b/fs/ext4/ext4.h
@@ -622,6 +622,7 @@ enum {
 	 * allows jbd2 to avoid submitting data before commit. */
 #define EXT4_GET_BLOCKS_IO_SUBMIT		0x0400
 
+#define EXT4_GET_BLOCKS_SUBMIT_ALLOC		0x0800
 /*
  * The bit position of these flags must not overlap with any of the
  * EXT4_GET_BLOCKS_*.  They are used by ext4_find_extent(),
diff --git a/fs/ext4/extents.c b/fs/ext4/extents.c
index 0e8708b77da6..68335e1d6893 100644
--- a/fs/ext4/extents.c
+++ b/fs/ext4/extents.c
@@ -4490,6 +4490,13 @@ int ext4_ext_map_blocks(handle_t *handle, struct inode *inode,
 		ar.len = allocated;
 
 got_allocated_blocks:
+	if ((flags & EXT4_GET_BLOCKS_SUBMIT_ALLOC)) {
+		err = sb_issue_assign_range(inode->i_sb, newblock,
+			EXT4_C2B(sbi, allocated_clusters), GFP_NOFS);
+		if (err)
+			goto free_on_err;
+	}
+
 	/* try to insert new extent into found leaf and return */
 	ext4_ext_store_pblock(&newex, newblock + offset);
 	newex.ee_len = cpu_to_le16(ar.len);
@@ -4506,7 +4513,7 @@ int ext4_ext_map_blocks(handle_t *handle, struct inode *inode,
 	if (!err)
 		err = ext4_ext_insert_extent(handle, inode, &path,
 					     &newex, flags);
-
+free_on_err:
 	if (err && free_on_err) {
 		int fb_flags = flags & EXT4_GET_BLOCKS_DELALLOC_RESERVE ?
 			EXT4_FREE_BLOCKS_NO_QUOT_UPDATE : 0;
@@ -4926,7 +4933,7 @@ long ext4_fallocate(struct file *file, int mode, loff_t offset, loff_t len)
 	lblk = offset >> blkbits;
 
 	max_blocks = EXT4_MAX_BLOCKS(len, offset, blkbits);
-	flags = EXT4_GET_BLOCKS_CREATE_UNWRIT_EXT;
+	flags = EXT4_GET_BLOCKS_CREATE_UNWRIT_EXT | EXT4_GET_BLOCKS_SUBMIT_ALLOC;
 	if (mode & FALLOC_FL_KEEP_SIZE)
 		flags |= EXT4_GET_BLOCKS_KEEP_SIZE;
 


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH RFC 0/3] block,ext4: Introduce REQ_OP_ASSIGN_RANGE to reflect extents allocation in block device internals
  2019-12-10 16:56 [PATCH RFC 0/3] block,ext4: Introduce REQ_OP_ASSIGN_RANGE to reflect extents allocation in block device internals Kirill Tkhai
                   ` (3 preceding siblings ...)
  2019-12-11  7:42 ` [PATCH RFC 0/3] block,ext4: Introduce REQ_OP_ASSIGN_RANGE to reflect extents allocation in block device internals Chaitanya Kulkarni
@ 2019-12-17 14:16 ` Kirill Tkhai
  4 siblings, 0 replies; 21+ messages in thread
From: Kirill Tkhai @ 2019-12-17 14:16 UTC (permalink / raw)
  To: linux-block, linux-kernel, linux-ext4, axboe
  Cc: tytso, adilger.kernel, ming.lei, osandov, jthumshirn,
	minwoo.im.dev, damien.lemoal, andrea.parri, hare, tj, ajay.joshi,
	sagi, dsterba, chaitanya.kulkarni, bvanassche, dhowells,
	asml.silence

Hi!

Any comments on this?

Thanks

On 10.12.2019 19:56, Kirill Tkhai wrote:
> Information about continuous extent placement may be useful
> for some block devices. Say, distributed network filesystems,
> which provide block device interface, may use this information
> for better blocks placement over the nodes in their cluster,
> and for better performance. Block devices, which map a file
> on another filesystem (loop), may request the same length extent
> on underlining filesystem for less fragmentation and for batching
> allocation requests. Also, hypervisors like QEMU may use this
> information for optimization of cluster allocations.
> 
> This patchset introduces REQ_OP_ASSIGN_RANGE, which is going
> to be used for forwarding user's fallocate(0) requests into
> block device internals. It rather similar to existing
> REQ_OP_DISCARD, REQ_OP_WRITE_ZEROES, etc. The corresponding
> exported primitive is called blkdev_issue_assign_range().
> See [1/3] for the details.
> 
> Patch [2/3] teaches loop driver to handle REQ_OP_ASSIGN_RANGE
> requests by calling fallocate(0).
> 
> Patch [3/3] makes ext4 to notify a block device about fallocate(0).
> 
> Here is a simple test I did:
> https://gist.github.com/tkhai/5b788651cdb74c1dbff3500745878856
> 
> I attached a file on ext4 to loop. Then, created ext4 partition
> on loop device and started the test in the partition. Direct-io
> is enabled on loop.
> 
> The test fallocates 4G file and writes from some offset with
> given step, then it chooses another offset and repeats. After
> the test all the blocks in the file become written.
> 
> The results shows that batching extents-assigning requests improves
> the performance:
> 
> Before patchset: real ~ 1min 27sec
> After patchset:  real ~ 1min 16sec (18% better)
> 
> Ordinary fallocate() before writes improves the performance
> by batching the requests. These results just show, the same
> is in case of forwarding extents information to underlining
> filesystem.
> ---
> 
> Kirill Tkhai (3):
>       block: Add support for REQ_OP_ASSIGN_RANGE operation
>       loop: Forward REQ_OP_ASSIGN_RANGE into fallocate(0)
>       ext4: Notify block device about fallocate(0)-assigned blocks
> 
> 
>  block/blk-core.c          |    4 +++
>  block/blk-lib.c           |   70 +++++++++++++++++++++++++++++++++++++++++++++
>  block/blk-merge.c         |   21 ++++++++++++++
>  block/bounce.c            |    1 +
>  drivers/block/loop.c      |    5 +++
>  fs/ext4/ext4.h            |    1 +
>  fs/ext4/extents.c         |   11 ++++++-
>  include/linux/bio.h       |    3 ++
>  include/linux/blk_types.h |    2 +
>  include/linux/blkdev.h    |   29 +++++++++++++++++++
>  10 files changed, 145 insertions(+), 2 deletions(-)
> 
> --
> Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com>
> 


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH RFC 1/3] block: Add support for REQ_OP_ASSIGN_RANGE operation
  2019-12-10 16:56 ` [PATCH RFC 1/3] block: Add support for REQ_OP_ASSIGN_RANGE operation Kirill Tkhai
@ 2019-12-19  3:03   ` Martin K. Petersen
  2019-12-19 11:07     ` Kirill Tkhai
  0 siblings, 1 reply; 21+ messages in thread
From: Martin K. Petersen @ 2019-12-19  3:03 UTC (permalink / raw)
  To: Kirill Tkhai
  Cc: linux-block, linux-kernel, linux-ext4, axboe, tytso,
	adilger.kernel, ming.lei, osandov, jthumshirn, minwoo.im.dev,
	damien.lemoal, andrea.parri, hare, tj, ajay.joshi, sagi, dsterba,
	chaitanya.kulkarni, bvanassche, dhowells, asml.silence


Hi Kirill!

> The patch adds a new blkdev_issue_assign_range() primitive, which is
> rather similar to existing blkdev_issue_{*} api.  Also, a new queue
> limit.max_assign_range_sectors is added.

I am not so keen on the assign_range name. What's wrong with "allocate"?

But why introduce a completely new operation? Isn't this essentially a
write zeroes with BLKDEV_ZERO_NOUNMAP flag set?

If the zeroing aspect is perceived to be a problem we could add a
BLKDEV_ZERO_ALLOCATE flag (or BLKDEV_ZERO_ANCHOR since that's the
terminology used in SCSI).

-- 
Martin K. Petersen	Oracle Linux Engineering

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH RFC 1/3] block: Add support for REQ_OP_ASSIGN_RANGE operation
  2019-12-19  3:03   ` Martin K. Petersen
@ 2019-12-19 11:07     ` Kirill Tkhai
  2019-12-19 22:03       ` Chaitanya Kulkarni
  2019-12-19 22:37       ` Martin K. Petersen
  0 siblings, 2 replies; 21+ messages in thread
From: Kirill Tkhai @ 2019-12-19 11:07 UTC (permalink / raw)
  To: Martin K. Petersen, axboe
  Cc: linux-block, linux-kernel, linux-ext4, tytso, adilger.kernel,
	ming.lei, osandov, jthumshirn, minwoo.im.dev, damien.lemoal,
	andrea.parri, hare, tj, ajay.joshi, sagi, dsterba,
	chaitanya.kulkarni, bvanassche, dhowells, asml.silence

Hi, Martin!

On 19.12.2019 06:03, Martin K. Petersen wrote:
> 
> Hi Kirill!
> 
>> The patch adds a new blkdev_issue_assign_range() primitive, which is
>> rather similar to existing blkdev_issue_{*} api.  Also, a new queue
>> limit.max_assign_range_sectors is added.
> 
> I am not so keen on the assign_range name. What's wrong with "allocate"?

REQ_OP_ALLOCATE_RANGE seemed for me as looking very long for the reviewers.
And I found that there is no an abbreviation of operations name in enum req_opf,
so REQ_OP_ALLOC_RANGE won't look good. Thus, I found a replacement.

But in case of REQ_OP_ALLOCATE_RANGE length is OK for people, there is no
a problem to choose it.

> But why introduce a completely new operation? Isn't this essentially a
> write zeroes with BLKDEV_ZERO_NOUNMAP flag set?
> 
> If the zeroing aspect is perceived to be a problem we could add a
> BLKDEV_ZERO_ALLOCATE flag (or BLKDEV_ZERO_ANCHOR since that's the
> terminology used in SCSI).

Hm. BLKDEV_ZERO_NOUNMAP is used in __blkdev_issue_write_zeroes() only.
So, do I understand right that we should the below two?:

1)Introduce a new flag BLKDEV_ZERO_ALLOCATE for blkdev_issue_write_zeroes().
2)Introduce a new flag REQ_NOZERO in enum req_opf.

Won't this confuse a reader that we have blkdev_issue_write_zeroes(),
which does not write zeroes sometimes? Maybe we should rename
blkdev_issue_write_zeroes() in some more generic name?

Thanks,
Kirill


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH RFC 1/3] block: Add support for REQ_OP_ASSIGN_RANGE operation
  2019-12-19 11:07     ` Kirill Tkhai
@ 2019-12-19 22:03       ` Chaitanya Kulkarni
  2019-12-19 22:37       ` Martin K. Petersen
  1 sibling, 0 replies; 21+ messages in thread
From: Chaitanya Kulkarni @ 2019-12-19 22:03 UTC (permalink / raw)
  To: Kirill Tkhai, Martin K. Petersen
  Cc: axboe, linux-block, linux-kernel, linux-ext4, tytso,
	adilger.kernel, ming.lei, osandov, jthumshirn, minwoo.im.dev,
	Damien Le Moal, andrea.parri, hare, tj, Ajay Joshi, sagi,
	dsterba, bvanassche, dhowells, asml.silence

> 1)Introduce a new flag BLKDEV_ZERO_ALLOCATE for blkdev_issue_write_zeroes().
> 2)Introduce a new flag REQ_NOZERO in enum req_opf.
>
> Won't this confuse a reader that we have blkdev_issue_write_zeroes(),
> which does not write zeroes sometimes? Maybe we should rename
> blkdev_issue_write_zeroes() in some more generic name?

Yes it will be confusing, I can see that the code for 
__blkdev_issue_write_zeroes() & __blkdev_issue_assign_range()
is very similar except op = REQ_OP_WRITE_ZEROES and
op = REQ_OP_ASSIGN_RANGE (and some minor details).

What we need is a prep patch which moves the payload less bio
code into the helper function which can accept op as an argument
based on that it will branch out and execute right code path.

If we decide to get this in then I'll be happy to create required
prep patch.
>
> Thanks,
> Kirill
>
>


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH RFC 1/3] block: Add support for REQ_OP_ASSIGN_RANGE operation
  2019-12-19 11:07     ` Kirill Tkhai
  2019-12-19 22:03       ` Chaitanya Kulkarni
@ 2019-12-19 22:37       ` Martin K. Petersen
  2019-12-20  1:53         ` Darrick J. Wong
  2019-12-20 11:55         ` Kirill Tkhai
  1 sibling, 2 replies; 21+ messages in thread
From: Martin K. Petersen @ 2019-12-19 22:37 UTC (permalink / raw)
  To: Kirill Tkhai
  Cc: Martin K. Petersen, axboe, linux-block, linux-kernel, linux-ext4,
	tytso, adilger.kernel, ming.lei, osandov, jthumshirn,
	minwoo.im.dev, damien.lemoal, andrea.parri, hare, tj, ajay.joshi,
	sagi, dsterba, chaitanya.kulkarni, bvanassche, dhowells,
	asml.silence


Kirill,

> Hm. BLKDEV_ZERO_NOUNMAP is used in __blkdev_issue_write_zeroes() only.
> So, do I understand right that we should the below two?:
>
> 1) Introduce a new flag BLKDEV_ZERO_ALLOCATE for
> blkdev_issue_write_zeroes().

> 2) Introduce a new flag REQ_NOZERO in enum req_opf.

Something like that. If zeroing is a problem for you.

Right now we offer the following semantics:

	Deallocate, no zeroing (discard)

	Optionally deallocate, zeroing (zeroout)

	Allocate, zeroing (zeroout + NOUNMAP)

Some devices also implement a fourth option which would be:

	Anchor: Allocate, no zeroing

> Won't this confuse a reader that we have blkdev_issue_write_zeroes(),
> which does not write zeroes sometimes? Maybe we should rename
> blkdev_issue_write_zeroes() in some more generic name?

Maybe. The naming is what it is for hysterical raisins and reflects how
things are implemented in the storage protocols. I wouldn't worry too
much about that. We can rename things if need be but we shouldn't plumb
an essentially identical operation through the block stack just to
expose a different name at the top.

-- 
Martin K. Petersen	Oracle Linux Engineering

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH RFC 1/3] block: Add support for REQ_OP_ASSIGN_RANGE operation
  2019-12-19 22:37       ` Martin K. Petersen
@ 2019-12-20  1:53         ` Darrick J. Wong
  2019-12-20  2:22           ` Martin K. Petersen
  2019-12-20 11:55         ` Kirill Tkhai
  1 sibling, 1 reply; 21+ messages in thread
From: Darrick J. Wong @ 2019-12-20  1:53 UTC (permalink / raw)
  To: Martin K. Petersen
  Cc: Kirill Tkhai, axboe, linux-block, linux-kernel, linux-ext4,
	tytso, adilger.kernel, ming.lei, osandov, jthumshirn,
	minwoo.im.dev, damien.lemoal, andrea.parri, hare, tj, ajay.joshi,
	sagi, dsterba, chaitanya.kulkarni, bvanassche, dhowells,
	asml.silence

On Thu, Dec 19, 2019 at 05:37:47PM -0500, Martin K. Petersen wrote:
> 
> Kirill,
> 
> > Hm. BLKDEV_ZERO_NOUNMAP is used in __blkdev_issue_write_zeroes() only.
> > So, do I understand right that we should the below two?:
> >
> > 1) Introduce a new flag BLKDEV_ZERO_ALLOCATE for
> > blkdev_issue_write_zeroes().
> 
> > 2) Introduce a new flag REQ_NOZERO in enum req_opf.
> 
> Something like that. If zeroing is a problem for you.
> 
> Right now we offer the following semantics:
> 
> 	Deallocate, no zeroing (discard)
> 
> 	Optionally deallocate, zeroing (zeroout)
> 
> 	Allocate, zeroing (zeroout + NOUNMAP)
> 
> Some devices also implement a fourth option which would be:
> 
> 	Anchor: Allocate, no zeroing

What happens if you anchor and then try to read the results?  IO error?

(Yeah, I'm being lazy and not digging through SBC-3, sorry...)

--D

> > Won't this confuse a reader that we have blkdev_issue_write_zeroes(),
> > which does not write zeroes sometimes? Maybe we should rename
> > blkdev_issue_write_zeroes() in some more generic name?
> 
> Maybe. The naming is what it is for hysterical raisins and reflects how
> things are implemented in the storage protocols. I wouldn't worry too
> much about that. We can rename things if need be but we shouldn't plumb
> an essentially identical operation through the block stack just to
> expose a different name at the top.
> 
> -- 
> Martin K. Petersen	Oracle Linux Engineering

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH RFC 1/3] block: Add support for REQ_OP_ASSIGN_RANGE operation
  2019-12-20  1:53         ` Darrick J. Wong
@ 2019-12-20  2:22           ` Martin K. Petersen
  0 siblings, 0 replies; 21+ messages in thread
From: Martin K. Petersen @ 2019-12-20  2:22 UTC (permalink / raw)
  To: Darrick J. Wong
  Cc: Martin K. Petersen, Kirill Tkhai, axboe, linux-block,
	linux-kernel, linux-ext4, tytso, adilger.kernel, ming.lei,
	osandov, jthumshirn, minwoo.im.dev, damien.lemoal, andrea.parri,
	hare, tj, ajay.joshi, sagi, dsterba, chaitanya.kulkarni,
	bvanassche, dhowells, asml.silence


Darrick,

[Anchor]

> What happens if you anchor and then try to read the results?  IO
> error?

Depends on the device. Typically you get the same results as UNMAP.

Anchoring is essentially a way to preallocate blocks on a thinly
provisioned device. If you then run out of actual storage capacity you
won't get a write failure writing to those LBAs. It is intended to
protect block ranges that do not tolerate write failures (journals,
metadata).

-- 
Martin K. Petersen	Oracle Linux Engineering

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH RFC 1/3] block: Add support for REQ_OP_ASSIGN_RANGE operation
  2019-12-19 22:37       ` Martin K. Petersen
  2019-12-20  1:53         ` Darrick J. Wong
@ 2019-12-20 11:55         ` Kirill Tkhai
  2019-12-21 18:54           ` Martin K. Petersen
  1 sibling, 1 reply; 21+ messages in thread
From: Kirill Tkhai @ 2019-12-20 11:55 UTC (permalink / raw)
  To: Martin K. Petersen
  Cc: axboe, linux-block, linux-kernel, linux-ext4, tytso,
	adilger.kernel, ming.lei, osandov, jthumshirn, minwoo.im.dev,
	damien.lemoal, andrea.parri, hare, tj, ajay.joshi, sagi, dsterba,
	chaitanya.kulkarni, bvanassche, dhowells, asml.silence

Hi, Martin,

On 20.12.2019 01:37, Martin K. Petersen wrote:
> 
> Kirill,
> 
>> Hm. BLKDEV_ZERO_NOUNMAP is used in __blkdev_issue_write_zeroes() only.
>> So, do I understand right that we should the below two?:
>>
>> 1) Introduce a new flag BLKDEV_ZERO_ALLOCATE for
>> blkdev_issue_write_zeroes().
> 
>> 2) Introduce a new flag REQ_NOZERO in enum req_opf.
> 
> Something like that. If zeroing is a problem for you.

My intention is to use this in fs allocators to notify virtual block devices
about allocated blocks (like in patch [3/3]). Filesystems allocators know about
written and unwritten extents, and they don't need a zeroing of allocated blocks.

Since a block range allocation action is less complicated (and faster), than
operation of allocation + zeroing of allocated blocks (at least for some devices),
we just choose it as the fastest. This is the reason we avoid zeroing.

> Right now we offer the following semantics:
> 
> 	Deallocate, no zeroing (discard)
> 
> 	Optionally deallocate, zeroing (zeroout)
> 
> 	Allocate, zeroing (zeroout + NOUNMAP)
> 
> Some devices also implement a fourth option which would be:
> 
> 	Anchor: Allocate, no zeroing
> 
>> Won't this confuse a reader that we have blkdev_issue_write_zeroes(),
>> which does not write zeroes sometimes? Maybe we should rename
>> blkdev_issue_write_zeroes() in some more generic name?
> 
> Maybe. The naming is what it is for hysterical raisins and reflects how
> things are implemented in the storage protocols. I wouldn't worry too
> much about that. We can rename things if need be but we shouldn't plumb
> an essentially identical operation through the block stack just to
> expose a different name at the top.

Not introduction a new operation is a good thing. Especially, since we don't
need a specific max_xxx_xxx_sectors != max_write_zeroes_sectors for it.

I'll rework the patch in this way (it seems it will become pretty small
after that).

One more thing to discuss. The new REQ_NOZERO flag won't be supported
by many block devices (their number will be even less, than number of
REQ_OP_WRITE_ZEROES supporters). Will this be a good thing, in case
of we will be completing BLKDEV_ZERO_ALLOCATE bios in __blkdev_issue_write_zeroes()
before splitting? I mean introduction of some flag in struct request_queue::limits.
Completion of them with -EOPNOTSUPP in block devices drivers looks
suboptimal for me.

Thanks,
Kirill

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH RFC 1/3] block: Add support for REQ_OP_ASSIGN_RANGE operation
  2019-12-20 11:55         ` Kirill Tkhai
@ 2019-12-21 18:54           ` Martin K. Petersen
  2019-12-23  8:51             ` Kirill Tkhai
  0 siblings, 1 reply; 21+ messages in thread
From: Martin K. Petersen @ 2019-12-21 18:54 UTC (permalink / raw)
  To: Kirill Tkhai
  Cc: Martin K. Petersen, axboe, linux-block, linux-kernel, linux-ext4,
	tytso, adilger.kernel, ming.lei, osandov, jthumshirn,
	minwoo.im.dev, damien.lemoal, andrea.parri, hare, tj, ajay.joshi,
	sagi, dsterba, chaitanya.kulkarni, bvanassche, dhowells,
	asml.silence


Kirill,

> One more thing to discuss. The new REQ_NOZERO flag won't be supported
> by many block devices (their number will be even less, than number of
> REQ_OP_WRITE_ZEROES supporters). Will this be a good thing, in case of
> we will be completing BLKDEV_ZERO_ALLOCATE bios in
> __blkdev_issue_write_zeroes() before splitting? I mean introduction of
> some flag in struct request_queue::limits.  Completion of them with
> -EOPNOTSUPP in block devices drivers looks suboptimal for me.

We already have the NOFALLBACK flag to let the user make that decision.

If that flag is not specified, and I receive an allocate request for a
SCSI device that does not support ANCHOR, my expectation would be that I
would do a regular write same.

If it's a filesystem that is the recipient of the operation and not a
SCSI device, how to react would depend on how the filesystem handles
unwritten extents, etc.

-- 
Martin K. Petersen	Oracle Linux Engineering

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH RFC 1/3] block: Add support for REQ_OP_ASSIGN_RANGE operation
  2019-12-21 18:54           ` Martin K. Petersen
@ 2019-12-23  8:51             ` Kirill Tkhai
  2020-01-07  3:24               ` Martin K. Petersen
  0 siblings, 1 reply; 21+ messages in thread
From: Kirill Tkhai @ 2019-12-23  8:51 UTC (permalink / raw)
  To: Martin K. Petersen
  Cc: axboe, linux-block, linux-kernel, linux-ext4, tytso,
	adilger.kernel, ming.lei, osandov, jthumshirn, minwoo.im.dev,
	damien.lemoal, andrea.parri, hare, tj, ajay.joshi, sagi, dsterba,
	chaitanya.kulkarni, bvanassche, dhowells, asml.silence

On 21.12.2019 21:54, Martin K. Petersen wrote:
> 
> Kirill,
> 
>> One more thing to discuss. The new REQ_NOZERO flag won't be supported
>> by many block devices (their number will be even less, than number of
>> REQ_OP_WRITE_ZEROES supporters). Will this be a good thing, in case of
>> we will be completing BLKDEV_ZERO_ALLOCATE bios in
>> __blkdev_issue_write_zeroes() before splitting? I mean introduction of
>> some flag in struct request_queue::limits.  Completion of them with
>> -EOPNOTSUPP in block devices drivers looks suboptimal for me.
> 
> We already have the NOFALLBACK flag to let the user make that decision.
> 
> If that flag is not specified, and I receive an allocate request for a
> SCSI device that does not support ANCHOR, my expectation would be that I
> would do a regular write same.
>
> If it's a filesystem that is the recipient of the operation and not a
> SCSI device, how to react would depend on how the filesystem handles
> unwritten extents, etc.

Ok, this case is clear for me, thanks.

But I also worry about NOFALLBACK case. There are possible block devices,
which support write zeroes, but they can't allocate blocks (block allocation
are just not appliable for them, say, these are all ordinary hdd).

Let's say, a user called fallocate(), and filesystem allocated range of blocks.
Then filesystem propagates the range to block device, and calls zeroout:

blkdev_issue_zeroout(bdev, start >> 9, len >> 9,
		     GFP_NOIO, BLKDEV_ZERO_ALLOCATE|BLKDEV_ZERO_NOFALLBACK);

This case filesystem does not want zeroing blocks, it just wants to send a hint
to block device. So, in case of block device supports allocation, everything is OK.

But won't it be a good thing to return EOPNOTSUPP right from __blkdev_issue_write_zeroes()
in case of block device can't allocate blocks (q->limits.write_zeroes_can_allocate
in the patch below)? Here is just a way to underline block devices, which support
write zeroes, but allocation of blocks is meant nothing for them (wasting of time).

What do you think about the below?

Thanks
---
diff --git a/block/blk-lib.c b/block/blk-lib.c
index 5f2c429d4378..524b47905fd5 100644
--- a/block/blk-lib.c
+++ b/block/blk-lib.c
@@ -214,7 +214,7 @@ static int __blkdev_issue_write_zeroes(struct block_device *bdev,
 		struct bio **biop, unsigned flags)
 {
 	struct bio *bio = *biop;
-	unsigned int max_write_zeroes_sectors;
+	unsigned int max_write_zeroes_sectors, req_flags = 0;
 	struct request_queue *q = bdev_get_queue(bdev);
 
 	if (!q)
@@ -229,13 +229,19 @@ static int __blkdev_issue_write_zeroes(struct block_device *bdev,
 	if (max_write_zeroes_sectors == 0)
 		return -EOPNOTSUPP;
 
+	if (flags & BLKDEV_ZERO_NOUNMAP)
+		req_flags |= REQ_NOUNMAP;
+	if (flags & BLKDEV_ZERO_ALLOCATE) {
+		if (!q->limits.write_zeroes_can_allocate)
+			return -EOPNOTSUPP;
+		req_flags |= REQ_NOZERO|REQ_NOUNMAP;
+	}
+
 	while (nr_sects) {
 		bio = blk_next_bio(bio, 0, gfp_mask);
 		bio->bi_iter.bi_sector = sector;
 		bio_set_dev(bio, bdev);
-		bio->bi_opf = REQ_OP_WRITE_ZEROES;
-		if (flags & BLKDEV_ZERO_NOUNMAP)
-			bio->bi_opf |= REQ_NOUNMAP;
+		bio->bi_opf = REQ_OP_WRITE_ZEROES | req_flags;
 
 		if (nr_sects > max_write_zeroes_sectors) {
 			bio->bi_iter.bi_size = max_write_zeroes_sectors << 9;
diff --git a/fs/block_dev.c b/fs/block_dev.c
index 69bf2fb6f7cd..1ffef894b3bd 100644
--- a/fs/block_dev.c
+++ b/fs/block_dev.c
@@ -2122,6 +2122,10 @@ static long blkdev_fallocate(struct file *file, int mode, loff_t start,
 		error = blkdev_issue_zeroout(bdev, start >> 9, len >> 9,
 					     GFP_KERNEL, BLKDEV_ZERO_NOFALLBACK);
 		break;
+	case FALLOC_FL_KEEP_SIZE:
+		error = blkdev_issue_zeroout(bdev, start >> 9, len >> 9,
+			GFP_KERNEL, BLKDEV_ZERO_ALLOCATE | BLKDEV_ZERO_NOFALLBACK);
+		break;
 	case FALLOC_FL_PUNCH_HOLE | FALLOC_FL_KEEP_SIZE | FALLOC_FL_NO_HIDE_STALE:
 		error = blkdev_issue_discard(bdev, start >> 9, len >> 9,
 					     GFP_KERNEL, 0);
diff --git a/include/linux/blk_types.h b/include/linux/blk_types.h
index 70254ae11769..9ed166860099 100644
--- a/include/linux/blk_types.h
+++ b/include/linux/blk_types.h
@@ -335,7 +335,9 @@ enum req_flag_bits {
 
 	/* command specific flags for REQ_OP_WRITE_ZEROES: */
 	__REQ_NOUNMAP,		/* do not free blocks when zeroing */
-
+	__REQ_NOZERO,		/* only notify about allocated blocks,
+				 * and do not actual zero them
+				 */
 	__REQ_HIPRI,
 
 	/* for driver use */
@@ -362,6 +364,7 @@ enum req_flag_bits {
 #define REQ_CGROUP_PUNT		(1ULL << __REQ_CGROUP_PUNT)
 
 #define REQ_NOUNMAP		(1ULL << __REQ_NOUNMAP)
+#define REQ_NOZERO		(1ULL << __REQ_NOZERO)
 #define REQ_HIPRI		(1ULL << __REQ_HIPRI)
 
 #define REQ_DRV			(1ULL << __REQ_DRV)
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index c45779f00cbd..9e3cd3394dd6 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -347,6 +347,7 @@ struct queue_limits {
 	unsigned char		misaligned;
 	unsigned char		discard_misaligned;
 	unsigned char		raid_partial_stripes_expensive;
+	bool			write_zeroes_can_allocate;
 	enum blk_zoned_model	zoned;
 };
 
@@ -1219,6 +1220,7 @@ extern int __blkdev_issue_discard(struct block_device *bdev, sector_t sector,
 
 #define BLKDEV_ZERO_NOUNMAP	(1 << 0)  /* do not free blocks */
 #define BLKDEV_ZERO_NOFALLBACK	(1 << 1)  /* don't write explicit zeroes */
+#define BLKDEV_ZERO_ALLOCATE	(1 << 2)  /* allocate range of blocks */
 
 extern int __blkdev_issue_zeroout(struct block_device *bdev, sector_t sector,
 		sector_t nr_sects, gfp_t gfp_mask, struct bio **biop,

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH RFC 1/3] block: Add support for REQ_OP_ASSIGN_RANGE operation
  2019-12-23  8:51             ` Kirill Tkhai
@ 2020-01-07  3:24               ` Martin K. Petersen
  2020-01-07 13:59                 ` Kirill Tkhai
  0 siblings, 1 reply; 21+ messages in thread
From: Martin K. Petersen @ 2020-01-07  3:24 UTC (permalink / raw)
  To: Kirill Tkhai
  Cc: Martin K. Petersen, axboe, linux-block, linux-kernel, linux-ext4,
	tytso, adilger.kernel, ming.lei, osandov, jthumshirn,
	minwoo.im.dev, damien.lemoal, andrea.parri, hare, tj, ajay.joshi,
	sagi, dsterba, chaitanya.kulkarni, bvanassche, dhowells,
	asml.silence


Kirill,

Sorry, the holiday break got in the way.

> But I also worry about NOFALLBACK case. There are possible block
> devices, which support write zeroes, but they can't allocate blocks
> (block allocation are just not appliable for them, say, these are all
> ordinary hdd).

Correct. We shouldn't go down this path unless a device is thinly
provisioned (i.e. max_discard_sectors > 0).

> But won't it be a good thing to return EOPNOTSUPP right from
> __blkdev_issue_write_zeroes() in case of block device can't allocate
> blocks (q->limits.write_zeroes_can_allocate in the patch below)? Here
> is just a way to underline block devices, which support write zeroes,
> but allocation of blocks is meant nothing for them (wasting of time).

I don't like "write_zeroes_can_allocate" because that makes assumptions
about WRITE ZEROES being the command of choice. I suggest we call it
"max_allocate_sectors" to mirror "max_discard_sectors". I.e. put
emphasis on the semantic operation and not the plumbing.

-- 
Martin K. Petersen	Oracle Linux Engineering

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH RFC 1/3] block: Add support for REQ_OP_ASSIGN_RANGE operation
  2020-01-07  3:24               ` Martin K. Petersen
@ 2020-01-07 13:59                 ` Kirill Tkhai
  2020-01-08  2:49                   ` Martin K. Petersen
  0 siblings, 1 reply; 21+ messages in thread
From: Kirill Tkhai @ 2020-01-07 13:59 UTC (permalink / raw)
  To: Martin K. Petersen
  Cc: axboe, linux-block, linux-kernel, linux-ext4, tytso,
	adilger.kernel, ming.lei, osandov, jthumshirn, minwoo.im.dev,
	damien.lemoal, andrea.parri, hare, tj, ajay.joshi, sagi, dsterba,
	chaitanya.kulkarni, bvanassche, dhowells, asml.silence

On 07.01.2020 06:24, Martin K. Petersen wrote:
> 
> Kirill,
> 
> Sorry, the holiday break got in the way.
> 
>> But I also worry about NOFALLBACK case. There are possible block
>> devices, which support write zeroes, but they can't allocate blocks
>> (block allocation are just not appliable for them, say, these are all
>> ordinary hdd).
> 
> Correct. We shouldn't go down this path unless a device is thinly
> provisioned (i.e. max_discard_sectors > 0).

(I assumed it is a typo, and you mean max_allocate_sectors like bellow).
 
>> But won't it be a good thing to return EOPNOTSUPP right from
>> __blkdev_issue_write_zeroes() in case of block device can't allocate
>> blocks (q->limits.write_zeroes_can_allocate in the patch below)? Here
>> is just a way to underline block devices, which support write zeroes,
>> but allocation of blocks is meant nothing for them (wasting of time).
> 
> I don't like "write_zeroes_can_allocate" because that makes assumptions
> about WRITE ZEROES being the command of choice. I suggest we call it
> "max_allocate_sectors" to mirror "max_discard_sectors". I.e. put
> emphasis on the semantic operation and not the plumbing.
 
Hm. Do you mean "bool max_allocate_sectors" or "unsigned int max_allocate_sectors"?
In the second case we should make all the q->limits.max_write_zeroes_sectors
dereferencing as switches like the below (this is a partial patch and only several
of places are converted to switches as examples):

diff --git a/block/blk-core.c b/block/blk-core.c
index 50a5de025d5e..4c45417838f7 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -978,7 +978,8 @@ generic_make_request_checks(struct bio *bio)
 			goto not_supported;
 		break;
 	case REQ_OP_WRITE_ZEROES:
-		if (!q->limits.max_write_zeroes_sectors)
+		if (!blk_queue_get_max_write_zeroes_sectors(q,
+				    bio->bi_opf & REQ_NOZERO))
 			goto not_supported;
 		break;
 	default:
@@ -1250,10 +1251,10 @@ EXPORT_SYMBOL(submit_bio);
 static int blk_cloned_rq_check_limits(struct request_queue *q,
 				      struct request *rq)
 {
-	if (blk_rq_sectors(rq) > blk_queue_get_max_sectors(q, req_op(rq))) {
+	if (blk_rq_sectors(rq) > blk_queue_get_max_sectors(q, rq->cmd_flags)) {
 		printk(KERN_ERR "%s: over max size limit. (%u > %u)\n",
 			__func__, blk_rq_sectors(rq),
-			blk_queue_get_max_sectors(q, req_op(rq)));
+			blk_queue_get_max_sectors(q, rq->cmd_flags));
 		return -EIO;
 	}
 
diff --git a/block/blk-merge.c b/block/blk-merge.c
index 347782a24a35..0cdf7d0386c8 100644
--- a/block/blk-merge.c
+++ b/block/blk-merge.c
@@ -105,15 +105,22 @@ static struct bio *blk_bio_discard_split(struct request_queue *q,
 static struct bio *blk_bio_write_zeroes_split(struct request_queue *q,
 		struct bio *bio, struct bio_set *bs, unsigned *nsegs)
 {
+	unsigned int max_sectors;
+
+	if (bio->bi_opf & REQ_NOZERO)
+		max_sectors = q->limits.max_allocate_sectors;
+	else
+		max_sectors = q->limits.max_write_zeroes_sectors;
+
 	*nsegs = 0;
 
-	if (!q->limits.max_write_zeroes_sectors)
+	if (!max_sectors)
 		return NULL;
 
-	if (bio_sectors(bio) <= q->limits.max_write_zeroes_sectors)
+	if (bio_sectors(bio) <= max_sectors)
 		return NULL;
 
-	return bio_split(bio, q->limits.max_write_zeroes_sectors, GFP_NOIO, bs);
+	return bio_split(bio, max_sectors, GFP_NOIO, bs);
 }
 
 static struct bio *blk_bio_write_same_split(struct request_queue *q,
diff --git a/block/blk-settings.c b/block/blk-settings.c
index 5f6dcc7a47bd..3f68a60cb196 100644
--- a/block/blk-settings.c
+++ b/block/blk-settings.c
@@ -506,6 +506,8 @@ int blk_stack_limits(struct queue_limits *t, struct queue_limits *b,
 					b->max_write_same_sectors);
 	t->max_write_zeroes_sectors = min(t->max_write_zeroes_sectors,
 					b->max_write_zeroes_sectors);
+	t->max_allocate_sectors = min(t->max_allocate_sectors,
+					b->max_allocate_sectors);
 	t->bounce_pfn = min_not_zero(t->bounce_pfn, b->bounce_pfn);
 
 	t->seg_boundary_mask = min_not_zero(t->seg_boundary_mask,
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index 9e3cd3394dd6..6219604a0c12 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -336,6 +336,7 @@ struct queue_limits {
 	unsigned int		max_hw_discard_sectors;
 	unsigned int		max_write_same_sectors;
 	unsigned int		max_write_zeroes_sectors;
+	unsigned int		max_allocate_sectors;
 	unsigned int		discard_granularity;
 	unsigned int		discard_alignment;
 
@@ -989,9 +989,19 @@ static inline struct bio_vec req_bvec(struct request *rq)
 	return mp_bvec_iter_bvec(rq->bio->bi_io_vec, rq->bio->bi_iter);
 }
 
+static inline unsigned int blk_queue_get_max_write_zeroes_sectors(
+			struct request_queue *q, bool allocate)
+{
+	if (allocate)
+		return q->limits.max_allocate_sectors;
+	return q->limits.max_write_zeroes_sectors;
+}
+
 static inline unsigned int blk_queue_get_max_sectors(struct request_queue *q,
-						     int op)
+						     unsigned int op_flags)
 {
+	int op = op_flags & REQ_OP_MASK;
+
 	if (unlikely(op == REQ_OP_DISCARD || op == REQ_OP_SECURE_ERASE))
 		return min(q->limits.max_discard_sectors,
 			   UINT_MAX >> SECTOR_SHIFT);
@@ -1000,7 +1010,8 @@ static inline unsigned int blk_queue_get_max_sectors(struct request_queue *q,
 		return q->limits.max_write_same_sectors;
 
 	if (unlikely(op == REQ_OP_WRITE_ZEROES))
-		return q->limits.max_write_zeroes_sectors;
+		return blk_queue_get_max_write_zeroes_sectors(q,
+					op_flags & REQ_NOZERO);
 
 	return q->limits.max_sectors;
 }

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH RFC 1/3] block: Add support for REQ_OP_ASSIGN_RANGE operation
  2020-01-07 13:59                 ` Kirill Tkhai
@ 2020-01-08  2:49                   ` Martin K. Petersen
  2020-01-09  9:43                     ` Kirill Tkhai
  0 siblings, 1 reply; 21+ messages in thread
From: Martin K. Petersen @ 2020-01-08  2:49 UTC (permalink / raw)
  To: Kirill Tkhai
  Cc: Martin K. Petersen, axboe\, linux-block\, linux-kernel\,
	linux-ext4\, tytso\, adilger.kernel\, ming.lei\, osandov\,
	jthumshirn\, minwoo.im.dev\, damien.lemoal\, andrea.parri\, hare\,
	tj\, ajay.joshi\, sagi\, dsterba\, chaitanya.kulkarni\,
	bvanassche\, dhowells\, asml.silence\


Kirill,

>> Correct. We shouldn't go down this path unless a device is thinly
>> provisioned (i.e. max_discard_sectors > 0).
>
> (I assumed it is a typo, and you mean max_allocate_sectors like bellow).

No, this was in the context of not having an explicit queue limit for
allocation. If a device does not have max_discard_sectors > 0 then it is
not thinly provisioned and therefore attempting allocation makes no
sense.

>> I don't like "write_zeroes_can_allocate" because that makes assumptions
>> about WRITE ZEROES being the command of choice. I suggest we call it
>> "max_allocate_sectors" to mirror "max_discard_sectors". I.e. put
>> emphasis on the semantic operation and not the plumbing.
>  
> Hm. Do you mean "bool max_allocate_sectors" or "unsigned int max_allocate_sectors"?

unsigned int. At least for SCSI we could have a device which would use
UNMAP for discards and WRITE SAME for allocates. And therefore the range
limit could be different for the two operations. Sadly.

I have a patch in the pipeline which deals with some problems in this
department because some devices have a split brain wrt. their discard
limits.

> In the second case we should make all the
> q->limits.max_write_zeroes_sectors dereferencing as switches like the
> below (this is a partial patch and only several of places are
> converted to switches as examples):

Something like that, yes.

This is getting a bit messy :( However, I am not sure that scattering
REQ_OP_ALLOCATE all over the I/O stack is particularly attractive
either.

Both REQ_OP_DISCARD and REQ_OP_WRITE_SAME come with some storage
protocol baggage that forces us to have special handling all over the
stack. But REQ_OP_WRITE_ZEROES is fairly clean and simple and, except
for the potentially different block count limit, an allocate operation
would be a carbon copy of the plumbing for write zeroes. A lot of
duplication.

So even through I'm increasingly torn on whether introducing separate
REQ_OP_ALLOCATE plumbing throughout the stack or having a REQ_ALLOCATE
flag for REQ_OP_WRITE_ZEROES is best, I still think I'm leaning towards
the latter. That will also make it easier for me in the SCSI disk
driver.

-- 
Martin K. Petersen	Oracle Linux Engineering

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH RFC 1/3] block: Add support for REQ_OP_ASSIGN_RANGE operation
  2020-01-08  2:49                   ` Martin K. Petersen
@ 2020-01-09  9:43                     ` Kirill Tkhai
  0 siblings, 0 replies; 21+ messages in thread
From: Kirill Tkhai @ 2020-01-09  9:43 UTC (permalink / raw)
  To: Martin K. Petersen
  Cc: axboe, linux-block, linux-kernel, linux-ext4, tytso,
	adilger.kernel, ming.lei, osandov, jthumshirn, minwoo.im.dev,
	damien.lemoal, andrea.parri, hare, tj, ajay.joshi, sagi, dsterba,
	chaitanya.kulkarni, bvanassche, dhowells, asml.silence

On 08.01.2020 05:49, Martin K. Petersen wrote:
> 
> Kirill,
> 
>>> Correct. We shouldn't go down this path unless a device is thinly
>>> provisioned (i.e. max_discard_sectors > 0).
>>
>> (I assumed it is a typo, and you mean max_allocate_sectors like bellow).
> 
> No, this was in the context of not having an explicit queue limit for
> allocation. If a device does not have max_discard_sectors > 0 then it is
> not thinly provisioned and therefore attempting allocation makes no
> sense.
> 
>>> I don't like "write_zeroes_can_allocate" because that makes assumptions
>>> about WRITE ZEROES being the command of choice. I suggest we call it
>>> "max_allocate_sectors" to mirror "max_discard_sectors". I.e. put
>>> emphasis on the semantic operation and not the plumbing.
>>  
>> Hm. Do you mean "bool max_allocate_sectors" or "unsigned int max_allocate_sectors"?
> 
> unsigned int. At least for SCSI we could have a device which would use
> UNMAP for discards and WRITE SAME for allocates. And therefore the range
> limit could be different for the two operations. Sadly.
> 
> I have a patch in the pipeline which deals with some problems in this
> department because some devices have a split brain wrt. their discard
> limits.
> 
>> In the second case we should make all the
>> q->limits.max_write_zeroes_sectors dereferencing as switches like the
>> below (this is a partial patch and only several of places are
>> converted to switches as examples):
> 
> Something like that, yes.
> 
> This is getting a bit messy :( However, I am not sure that scattering
> REQ_OP_ALLOCATE all over the I/O stack is particularly attractive
> either.

It looks like changing the second argument of blk_queue_get_max_write_zeroes_sectors()
gives almost clean code. All REQ_NOZERO checks become moved to helper.

> Both REQ_OP_DISCARD and REQ_OP_WRITE_SAME come with some storage
> protocol baggage that forces us to have special handling all over the
> stack. But REQ_OP_WRITE_ZEROES is fairly clean and simple and, except
> for the potentially different block count limit, an allocate operation
> would be a carbon copy of the plumbing for write zeroes. A lot of
> duplication.
> 
> So even through I'm increasingly torn on whether introducing separate
> REQ_OP_ALLOCATE plumbing throughout the stack or having a REQ_ALLOCATE
> flag for REQ_OP_WRITE_ZEROES is best, I still think I'm leaning towards
> the latter. That will also make it easier for me in the SCSI disk
> driver.

Yeah, this sounds good. Also, it looks the REQ_NOZERO patch is really
simpler, it is smaller even.

Below is complete patch for block core (I've excluded hunks for drivers/*,
since they require some cleanup patches for preparations). Please, let
me know whether I understood you correct and everything is OK. Thanks!

[PATCH] block: Add support for REQ_NOZERO

This adds support for REQ_NOZERO extension of REQ_OP_WRITE_ZEROES
operation, which encourages a block device driver to just allocate
blocks (or mark them allocated) instead of actual blocks zeroing.
REQ_NOZERO is aimed to be used for network filesystems providing
a block device interface. Also, block devices, which map a file
on other filesystem (like loop), may use this for less fragmentation
and batching fallocate() requests. Hypervisors like QEMU may
introduce optimizations of clusters allocations based on this.

BLKDEV_ZERO_ALLOCATE is a new corresponding flag for
blkdev_issue_zeroout().

Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com>
---
 block/blk-core.c          |    6 +++---
 block/blk-lib.c           |   17 ++++++++++-------
 block/blk-merge.c         |    9 ++++++---
 block/blk-settings.c      |    4 ++++
 fs/block_dev.c            |    4 ++++
 include/linux/blk_types.h |    5 ++++-
 include/linux/blkdev.h    |   32 ++++++++++++++++++++++++--------
 7 files changed, 55 insertions(+), 22 deletions(-)

diff --git a/block/blk-core.c b/block/blk-core.c
index 50a5de025d5e..2edcd55624f1 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -978,7 +978,7 @@ generic_make_request_checks(struct bio *bio)
 			goto not_supported;
 		break;
 	case REQ_OP_WRITE_ZEROES:
-		if (!q->limits.max_write_zeroes_sectors)
+		if (!blk_queue_get_max_write_zeroes_sectors(q, bio->bi_opf))
 			goto not_supported;
 		break;
 	default:
@@ -1250,10 +1250,10 @@ EXPORT_SYMBOL(submit_bio);
 static int blk_cloned_rq_check_limits(struct request_queue *q,
 				      struct request *rq)
 {
-	if (blk_rq_sectors(rq) > blk_queue_get_max_sectors(q, req_op(rq))) {
+	if (blk_rq_sectors(rq) > blk_queue_get_max_sectors(q, rq->cmd_flags)) {
 		printk(KERN_ERR "%s: over max size limit. (%u > %u)\n",
 			__func__, blk_rq_sectors(rq),
-			blk_queue_get_max_sectors(q, req_op(rq)));
+			blk_queue_get_max_sectors(q, rq->cmd_flags));
 		return -EIO;
 	}
 
diff --git a/block/blk-lib.c b/block/blk-lib.c
index 5f2c429d4378..3e80279eb029 100644
--- a/block/blk-lib.c
+++ b/block/blk-lib.c
@@ -214,7 +214,7 @@ static int __blkdev_issue_write_zeroes(struct block_device *bdev,
 		struct bio **biop, unsigned flags)
 {
 	struct bio *bio = *biop;
-	unsigned int max_write_zeroes_sectors;
+	unsigned int max_write_zeroes_sectors, req_flags = 0;
 	struct request_queue *q = bdev_get_queue(bdev);
 
 	if (!q)
@@ -224,18 +224,21 @@ static int __blkdev_issue_write_zeroes(struct block_device *bdev,
 		return -EPERM;
 
 	/* Ensure that max_write_zeroes_sectors doesn't overflow bi_size */
-	max_write_zeroes_sectors = bdev_write_zeroes_sectors(bdev);
+	max_write_zeroes_sectors = bdev_write_zeroes_sectors(bdev, flags);
 
 	if (max_write_zeroes_sectors == 0)
 		return -EOPNOTSUPP;
 
+	if (flags & BLKDEV_ZERO_NOUNMAP)
+		req_flags |= REQ_NOUNMAP;
+	if (flags & BLKDEV_ZERO_ALLOCATE)
+		req_flags |= REQ_NOZERO|REQ_NOUNMAP;
+
 	while (nr_sects) {
 		bio = blk_next_bio(bio, 0, gfp_mask);
 		bio->bi_iter.bi_sector = sector;
 		bio_set_dev(bio, bdev);
-		bio->bi_opf = REQ_OP_WRITE_ZEROES;
-		if (flags & BLKDEV_ZERO_NOUNMAP)
-			bio->bi_opf |= REQ_NOUNMAP;
+		bio->bi_opf = REQ_OP_WRITE_ZEROES | req_flags;
 
 		if (nr_sects > max_write_zeroes_sectors) {
 			bio->bi_iter.bi_size = max_write_zeroes_sectors << 9;
@@ -362,7 +365,7 @@ int blkdev_issue_zeroout(struct block_device *bdev, sector_t sector,
 	sector_t bs_mask;
 	struct bio *bio;
 	struct blk_plug plug;
-	bool try_write_zeroes = !!bdev_write_zeroes_sectors(bdev);
+	bool try_write_zeroes = !!bdev_write_zeroes_sectors(bdev, flags);
 
 	bs_mask = (bdev_logical_block_size(bdev) >> 9) - 1;
 	if ((sector | nr_sects) & bs_mask)
@@ -391,7 +394,7 @@ int blkdev_issue_zeroout(struct block_device *bdev, sector_t sector,
 			try_write_zeroes = false;
 			goto retry;
 		}
-		if (!bdev_write_zeroes_sectors(bdev)) {
+		if (!bdev_write_zeroes_sectors(bdev, flags)) {
 			/*
 			 * Zeroing offload support was indicated, but the
 			 * device reported ILLEGAL REQUEST (for some devices
diff --git a/block/blk-merge.c b/block/blk-merge.c
index 347782a24a35..e3ce4b87bbaa 100644
--- a/block/blk-merge.c
+++ b/block/blk-merge.c
@@ -105,15 +105,18 @@ static struct bio *blk_bio_discard_split(struct request_queue *q,
 static struct bio *blk_bio_write_zeroes_split(struct request_queue *q,
 		struct bio *bio, struct bio_set *bs, unsigned *nsegs)
 {
+	unsigned int max_sectors;
+
+	max_sectors = blk_queue_get_max_write_zeroes_sectors(q, bio->bi_opf);
 	*nsegs = 0;
 
-	if (!q->limits.max_write_zeroes_sectors)
+	if (!max_sectors)
 		return NULL;
 
-	if (bio_sectors(bio) <= q->limits.max_write_zeroes_sectors)
+	if (bio_sectors(bio) <= max_sectors)
 		return NULL;
 
-	return bio_split(bio, q->limits.max_write_zeroes_sectors, GFP_NOIO, bs);
+	return bio_split(bio, max_sectors, GFP_NOIO, bs);
 }
 
 static struct bio *blk_bio_write_same_split(struct request_queue *q,
diff --git a/block/blk-settings.c b/block/blk-settings.c
index 5f6dcc7a47bd..09a2eeacfadd 100644
--- a/block/blk-settings.c
+++ b/block/blk-settings.c
@@ -48,6 +48,7 @@ void blk_set_default_limits(struct queue_limits *lim)
 	lim->chunk_sectors = 0;
 	lim->max_write_same_sectors = 0;
 	lim->max_write_zeroes_sectors = 0;
+	lim->max_allocate_sectors = 0;
 	lim->max_discard_sectors = 0;
 	lim->max_hw_discard_sectors = 0;
 	lim->discard_granularity = 0;
@@ -83,6 +84,7 @@ void blk_set_stacking_limits(struct queue_limits *lim)
 	lim->max_dev_sectors = UINT_MAX;
 	lim->max_write_same_sectors = UINT_MAX;
 	lim->max_write_zeroes_sectors = UINT_MAX;
+	lim->max_allocate_sectors = UINT_MAX;
 }
 EXPORT_SYMBOL(blk_set_stacking_limits);
 
@@ -506,6 +508,8 @@ int blk_stack_limits(struct queue_limits *t, struct queue_limits *b,
 					b->max_write_same_sectors);
 	t->max_write_zeroes_sectors = min(t->max_write_zeroes_sectors,
 					b->max_write_zeroes_sectors);
+	t->max_allocate_sectors = min(t->max_allocate_sectors,
+					b->max_allocate_sectors);
 	t->bounce_pfn = min_not_zero(t->bounce_pfn, b->bounce_pfn);
 
 	t->seg_boundary_mask = min_not_zero(t->seg_boundary_mask,
diff --git a/fs/block_dev.c b/fs/block_dev.c
index 69bf2fb6f7cd..1ffef894b3bd 100644
--- a/fs/block_dev.c
+++ b/fs/block_dev.c
@@ -2122,6 +2122,10 @@ static long blkdev_fallocate(struct file *file, int mode, loff_t start,
 		error = blkdev_issue_zeroout(bdev, start >> 9, len >> 9,
 					     GFP_KERNEL, BLKDEV_ZERO_NOFALLBACK);
 		break;
+	case FALLOC_FL_KEEP_SIZE:
+		error = blkdev_issue_zeroout(bdev, start >> 9, len >> 9,
+			GFP_KERNEL, BLKDEV_ZERO_ALLOCATE | BLKDEV_ZERO_NOFALLBACK);
+		break;
 	case FALLOC_FL_PUNCH_HOLE | FALLOC_FL_KEEP_SIZE | FALLOC_FL_NO_HIDE_STALE:
 		error = blkdev_issue_discard(bdev, start >> 9, len >> 9,
 					     GFP_KERNEL, 0);
diff --git a/include/linux/blk_types.h b/include/linux/blk_types.h
index 70254ae11769..9ed166860099 100644
--- a/include/linux/blk_types.h
+++ b/include/linux/blk_types.h
@@ -335,7 +335,9 @@ enum req_flag_bits {
 
 	/* command specific flags for REQ_OP_WRITE_ZEROES: */
 	__REQ_NOUNMAP,		/* do not free blocks when zeroing */
-
+	__REQ_NOZERO,		/* only notify about allocated blocks,
+				 * and do not actual zero them
+				 */
 	__REQ_HIPRI,
 
 	/* for driver use */
@@ -362,6 +364,7 @@ enum req_flag_bits {
 #define REQ_CGROUP_PUNT		(1ULL << __REQ_CGROUP_PUNT)
 
 #define REQ_NOUNMAP		(1ULL << __REQ_NOUNMAP)
+#define REQ_NOZERO		(1ULL << __REQ_NOZERO)
 #define REQ_HIPRI		(1ULL << __REQ_HIPRI)
 
 #define REQ_DRV			(1ULL << __REQ_DRV)
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index c45779f00cbd..7e388be625af 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -336,6 +336,7 @@ struct queue_limits {
 	unsigned int		max_hw_discard_sectors;
 	unsigned int		max_write_same_sectors;
 	unsigned int		max_write_zeroes_sectors;
+	unsigned int		max_allocate_sectors;
 	unsigned int		discard_granularity;
 	unsigned int		discard_alignment;
 
@@ -988,9 +989,19 @@ static inline struct bio_vec req_bvec(struct request *rq)
 	return mp_bvec_iter_bvec(rq->bio->bi_io_vec, rq->bio->bi_iter);
 }
 
+static inline unsigned int blk_queue_get_max_write_zeroes_sectors(
+		struct request_queue *q, unsigned int op_flags)
+{
+	if (op_flags & REQ_NOZERO)
+		return q->limits.max_allocate_sectors;
+	return q->limits.max_write_zeroes_sectors;
+}
+
 static inline unsigned int blk_queue_get_max_sectors(struct request_queue *q,
-						     int op)
+						     unsigned int op_flags)
 {
+	int op = op_flags & REQ_OP_MASK;
+
 	if (unlikely(op == REQ_OP_DISCARD || op == REQ_OP_SECURE_ERASE))
 		return min(q->limits.max_discard_sectors,
 			   UINT_MAX >> SECTOR_SHIFT);
@@ -999,7 +1010,7 @@ static inline unsigned int blk_queue_get_max_sectors(struct request_queue *q,
 		return q->limits.max_write_same_sectors;
 
 	if (unlikely(op == REQ_OP_WRITE_ZEROES))
-		return q->limits.max_write_zeroes_sectors;
+		return blk_queue_get_max_write_zeroes_sectors(q, op_flags);
 
 	return q->limits.max_sectors;
 }
@@ -1029,10 +1040,10 @@ static inline unsigned int blk_rq_get_max_sectors(struct request *rq,
 	if (!q->limits.chunk_sectors ||
 	    req_op(rq) == REQ_OP_DISCARD ||
 	    req_op(rq) == REQ_OP_SECURE_ERASE)
-		return blk_queue_get_max_sectors(q, req_op(rq));
+		return blk_queue_get_max_sectors(q, rq->cmd_flags);
 
 	return min(blk_max_size_offset(q, offset),
-			blk_queue_get_max_sectors(q, req_op(rq)));
+			blk_queue_get_max_sectors(q, rq->cmd_flags));
 }
 
 static inline unsigned int blk_rq_count_bios(struct request *rq)
@@ -1219,6 +1230,7 @@ extern int __blkdev_issue_discard(struct block_device *bdev, sector_t sector,
 
 #define BLKDEV_ZERO_NOUNMAP	(1 << 0)  /* do not free blocks */
 #define BLKDEV_ZERO_NOFALLBACK	(1 << 1)  /* don't write explicit zeroes */
+#define BLKDEV_ZERO_ALLOCATE	(1 << 2)  /* allocate range of blocks */
 
 extern int __blkdev_issue_zeroout(struct block_device *bdev, sector_t sector,
 		sector_t nr_sects, gfp_t gfp_mask, struct bio **biop,
@@ -1418,14 +1430,18 @@ static inline unsigned int bdev_write_same(struct block_device *bdev)
 	return 0;
 }
 
-static inline unsigned int bdev_write_zeroes_sectors(struct block_device *bdev)
+static inline unsigned int bdev_write_zeroes_sectors(struct block_device *bdev,
+						     unsigned int flags)
 {
 	struct request_queue *q = bdev_get_queue(bdev);
 
-	if (q)
+	if (!q)
+		return 0;
+
+	if (flags & BLKDEV_ZERO_ALLOCATE)
+		return q->limits.max_allocate_sectors;
+	else
 		return q->limits.max_write_zeroes_sectors;
-
-	return 0;
 }
 
 static inline enum blk_zoned_model bdev_zoned_model(struct block_device *bdev)

 


^ permalink raw reply	[flat|nested] 21+ messages in thread

end of thread, back to index

Thread overview: 21+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-12-10 16:56 [PATCH RFC 0/3] block,ext4: Introduce REQ_OP_ASSIGN_RANGE to reflect extents allocation in block device internals Kirill Tkhai
2019-12-10 16:56 ` [PATCH RFC 1/3] block: Add support for REQ_OP_ASSIGN_RANGE operation Kirill Tkhai
2019-12-19  3:03   ` Martin K. Petersen
2019-12-19 11:07     ` Kirill Tkhai
2019-12-19 22:03       ` Chaitanya Kulkarni
2019-12-19 22:37       ` Martin K. Petersen
2019-12-20  1:53         ` Darrick J. Wong
2019-12-20  2:22           ` Martin K. Petersen
2019-12-20 11:55         ` Kirill Tkhai
2019-12-21 18:54           ` Martin K. Petersen
2019-12-23  8:51             ` Kirill Tkhai
2020-01-07  3:24               ` Martin K. Petersen
2020-01-07 13:59                 ` Kirill Tkhai
2020-01-08  2:49                   ` Martin K. Petersen
2020-01-09  9:43                     ` Kirill Tkhai
2019-12-10 16:56 ` [PATCH RFC 2/3] loop: Forward REQ_OP_ASSIGN_RANGE into fallocate(0) Kirill Tkhai
2019-12-10 16:56 ` [PATCH RFC 3/3] ext4: Notify block device about fallocate(0)-assigned blocks Kirill Tkhai
2019-12-11 12:55   ` [PATCH RFC v2 " Kirill Tkhai
2019-12-11  7:42 ` [PATCH RFC 0/3] block,ext4: Introduce REQ_OP_ASSIGN_RANGE to reflect extents allocation in block device internals Chaitanya Kulkarni
2019-12-11  8:50   ` Kirill Tkhai
2019-12-17 14:16 ` Kirill Tkhai

Linux-ext4 Archive on lore.kernel.org

Archives are clonable:
	git clone --mirror https://lore.kernel.org/linux-ext4/0 linux-ext4/git/0.git

	# If you have public-inbox 1.1+ installed, you may
	# initialize and index your mirror using the following commands:
	public-inbox-init -V2 linux-ext4 linux-ext4/ https://lore.kernel.org/linux-ext4 \
		linux-ext4@vger.kernel.org
	public-inbox-index linux-ext4

Example config snippet for mirrors

Newsgroup available over NNTP:
	nntp://nntp.lore.kernel.org/org.kernel.vger.linux-ext4


AGPL code for this site: git clone https://public-inbox.org/public-inbox.git