All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH RFC 0/3] block,ext4: Introduce REQ_OP_ASSIGN_RANGE to reflect extents allocation in block device internals
@ 2019-12-10 16:56 Kirill Tkhai
  2019-12-10 16:56 ` [PATCH RFC 1/3] block: Add support for REQ_OP_ASSIGN_RANGE operation Kirill Tkhai
                   ` (4 more replies)
  0 siblings, 5 replies; 23+ messages in thread
From: Kirill Tkhai @ 2019-12-10 16:56 UTC (permalink / raw)
  To: linux-block, linux-kernel, linux-ext4
  Cc: axboe, tytso, adilger.kernel, ming.lei, osandov, jthumshirn,
	minwoo.im.dev, damien.lemoal, ktkhai, andrea.parri, hare, tj,
	ajay.joshi, sagi, dsterba, chaitanya.kulkarni, bvanassche,
	dhowells, asml.silence

Information about continuous extent placement may be useful
for some block devices. Say, distributed network filesystems,
which provide block device interface, may use this information
for better blocks placement over the nodes in their cluster,
and for better performance. Block devices, which map a file
on another filesystem (loop), may request the same length extent
on underlining filesystem for less fragmentation and for batching
allocation requests. Also, hypervisors like QEMU may use this
information for optimization of cluster allocations.

This patchset introduces REQ_OP_ASSIGN_RANGE, which is going
to be used for forwarding user's fallocate(0) requests into
block device internals. It rather similar to existing
REQ_OP_DISCARD, REQ_OP_WRITE_ZEROES, etc. The corresponding
exported primitive is called blkdev_issue_assign_range().
See [1/3] for the details.

Patch [2/3] teaches loop driver to handle REQ_OP_ASSIGN_RANGE
requests by calling fallocate(0).

Patch [3/3] makes ext4 to notify a block device about fallocate(0).

Here is a simple test I did:
https://gist.github.com/tkhai/5b788651cdb74c1dbff3500745878856

I attached a file on ext4 to loop. Then, created ext4 partition
on loop device and started the test in the partition. Direct-io
is enabled on loop.

The test fallocates 4G file and writes from some offset with
given step, then it chooses another offset and repeats. After
the test all the blocks in the file become written.

The results shows that batching extents-assigning requests improves
the performance:

Before patchset: real ~ 1min 27sec
After patchset:  real ~ 1min 16sec (18% better)

Ordinary fallocate() before writes improves the performance
by batching the requests. These results just show, the same
is in case of forwarding extents information to underlining
filesystem.
---

Kirill Tkhai (3):
      block: Add support for REQ_OP_ASSIGN_RANGE operation
      loop: Forward REQ_OP_ASSIGN_RANGE into fallocate(0)
      ext4: Notify block device about fallocate(0)-assigned blocks


 block/blk-core.c          |    4 +++
 block/blk-lib.c           |   70 +++++++++++++++++++++++++++++++++++++++++++++
 block/blk-merge.c         |   21 ++++++++++++++
 block/bounce.c            |    1 +
 drivers/block/loop.c      |    5 +++
 fs/ext4/ext4.h            |    1 +
 fs/ext4/extents.c         |   11 ++++++-
 include/linux/bio.h       |    3 ++
 include/linux/blk_types.h |    2 +
 include/linux/blkdev.h    |   29 +++++++++++++++++++
 10 files changed, 145 insertions(+), 2 deletions(-)

--
Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com>


^ permalink raw reply	[flat|nested] 23+ messages in thread

* [PATCH RFC 1/3] block: Add support for REQ_OP_ASSIGN_RANGE operation
  2019-12-10 16:56 [PATCH RFC 0/3] block,ext4: Introduce REQ_OP_ASSIGN_RANGE to reflect extents allocation in block device internals Kirill Tkhai
@ 2019-12-10 16:56 ` Kirill Tkhai
  2019-12-19  3:03   ` Martin K. Petersen
  2019-12-10 16:56 ` [PATCH RFC 2/3] loop: Forward REQ_OP_ASSIGN_RANGE into fallocate(0) Kirill Tkhai
                   ` (3 subsequent siblings)
  4 siblings, 1 reply; 23+ messages in thread
From: Kirill Tkhai @ 2019-12-10 16:56 UTC (permalink / raw)
  To: linux-block, linux-kernel, linux-ext4
  Cc: axboe, tytso, adilger.kernel, ming.lei, osandov, jthumshirn,
	minwoo.im.dev, damien.lemoal, ktkhai, andrea.parri, hare, tj,
	ajay.joshi, sagi, dsterba, chaitanya.kulkarni, bvanassche,
	dhowells, asml.silence

This operation allows to notify a device about the fact,
that some sectors range was choosen by a filesystem
as a single extent, and the device should try its best
to reflect that (keep the range as a single hunk in its
internals, or represent the range as minimal set of hunks).
Speaking directly, the operation is for forwarding
fallocate(0) requests into an essence, on which the device
is based.

This may be useful for some distributed network filesystems,
providing block device interface, for optimization of their
blocks placement over the cluster nodes.

Also, block devices mapping a file (like loop) are users
of that, since this allows to allocate more continuous
extents and since this batches blocks allocation requests.
In addition, hypervisors like QEMU may use this for better
blocks placement.

The patch adds a new blkdev_issue_assign_range() primitive,
which is rather similar to existing blkdev_issue_{*} api.
Also, a new queue limit.max_assign_range_sectors is added.

Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com>
---
 block/blk-core.c          |    4 +++
 block/blk-lib.c           |   70 +++++++++++++++++++++++++++++++++++++++++++++
 block/blk-merge.c         |   21 ++++++++++++++
 block/bounce.c            |    1 +
 include/linux/bio.h       |    3 ++
 include/linux/blk_types.h |    2 +
 include/linux/blkdev.h    |   29 +++++++++++++++++++
 7 files changed, 130 insertions(+)

diff --git a/block/blk-core.c b/block/blk-core.c
index 8d4db6e74496..060cc0ea1246 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -978,6 +978,10 @@ generic_make_request_checks(struct bio *bio)
 		if (!q->limits.max_write_zeroes_sectors)
 			goto not_supported;
 		break;
+	case REQ_OP_ASSIGN_RANGE:
+		if (!q->limits.max_assign_range_sectors)
+			goto not_supported;
+		break;
 	default:
 		break;
 	}
diff --git a/block/blk-lib.c b/block/blk-lib.c
index 5f2c429d4378..fbf780d3ea32 100644
--- a/block/blk-lib.c
+++ b/block/blk-lib.c
@@ -252,6 +252,46 @@ static int __blkdev_issue_write_zeroes(struct block_device *bdev,
 	return 0;
 }
 
+static int __blkdev_issue_assign_range(struct block_device *bdev,
+		sector_t sector, sector_t nr_sects, gfp_t gfp_mask,
+		struct bio **biop, unsigned flags)
+{
+	struct bio *bio = *biop;
+	unsigned int max_sectors;
+	struct request_queue *q = bdev_get_queue(bdev);
+
+	if (!q)
+		return -ENXIO;
+
+	if (bdev_read_only(bdev))
+		return -EPERM;
+
+	max_sectors = bdev_assign_range_sectors(bdev);
+
+	if (max_sectors == 0)
+		return -EOPNOTSUPP;
+
+	while (nr_sects) {
+		bio = blk_next_bio(bio, 0, gfp_mask);
+		bio->bi_iter.bi_sector = sector;
+		bio_set_dev(bio, bdev);
+		bio->bi_opf = REQ_OP_ASSIGN_RANGE;
+
+		if (nr_sects > max_sectors) {
+			bio->bi_iter.bi_size = max_sectors << 9;
+			nr_sects -= max_sectors;
+			sector += max_sectors;
+		} else {
+			bio->bi_iter.bi_size = nr_sects << 9;
+			nr_sects = 0;
+		}
+		cond_resched();
+	}
+
+	*biop = bio;
+	return 0;
+}
+
 /*
  * Convert a number of 512B sectors to a number of pages.
  * The result is limited to a number of pages that can fit into a BIO.
@@ -405,3 +445,33 @@ int blkdev_issue_zeroout(struct block_device *bdev, sector_t sector,
 	return ret;
 }
 EXPORT_SYMBOL(blkdev_issue_zeroout);
+
+int blkdev_issue_assign_range(struct block_device *bdev, sector_t sector,
+			sector_t nr_sects, gfp_t gfp_mask, unsigned flags)
+{
+	int ret = 0;
+	sector_t bs_mask;
+	struct bio *bio;
+	struct blk_plug plug;
+
+	if (bdev_assign_range_sectors(bdev) == 0)
+		return 0;
+
+	bs_mask = (bdev_logical_block_size(bdev) >> 9) - 1;
+	if ((sector | nr_sects) & bs_mask)
+		return -EINVAL;
+
+	bio = NULL;
+	blk_start_plug(&plug);
+
+	ret = __blkdev_issue_assign_range(bdev, sector, nr_sects,
+					  gfp_mask, &bio, flags);
+	if (ret == 0 && bio) {
+		ret = submit_bio_wait(bio);
+		bio_put(bio);
+	}
+	blk_finish_plug(&plug);
+
+	return ret;
+}
+EXPORT_SYMBOL(blkdev_issue_assign_range);
diff --git a/block/blk-merge.c b/block/blk-merge.c
index d783bdc4559b..b2ae8b5acd72 100644
--- a/block/blk-merge.c
+++ b/block/blk-merge.c
@@ -102,6 +102,22 @@ static struct bio *blk_bio_discard_split(struct request_queue *q,
 	return bio_split(bio, split_sectors, GFP_NOIO, bs);
 }
 
+static struct bio *blk_bio_assign_range_split(struct request_queue *q,
+					      struct bio *bio,
+					      struct bio_set *bs,
+					      unsigned *nsegs)
+{
+	*nsegs = 1;
+
+	if (!q->limits.max_assign_range_sectors)
+		return NULL;
+
+	if (bio_sectors(bio) <= q->limits.max_assign_range_sectors)
+		return NULL;
+
+	return bio_split(bio, q->limits.max_assign_range_sectors, GFP_NOIO, bs);
+}
+
 static struct bio *blk_bio_write_zeroes_split(struct request_queue *q,
 		struct bio *bio, struct bio_set *bs, unsigned *nsegs)
 {
@@ -300,6 +316,10 @@ void __blk_queue_split(struct request_queue *q, struct bio **bio,
 	case REQ_OP_SECURE_ERASE:
 		split = blk_bio_discard_split(q, *bio, &q->bio_split, nr_segs);
 		break;
+	case REQ_OP_ASSIGN_RANGE:
+		split = blk_bio_assign_range_split(q, *bio, &q->bio_split,
+				nr_segs);
+		break;
 	case REQ_OP_WRITE_ZEROES:
 		split = blk_bio_write_zeroes_split(q, *bio, &q->bio_split,
 				nr_segs);
@@ -382,6 +402,7 @@ unsigned int blk_recalc_rq_segments(struct request *rq)
 	case REQ_OP_DISCARD:
 	case REQ_OP_SECURE_ERASE:
 	case REQ_OP_WRITE_ZEROES:
+	case REQ_OP_ASSIGN_RANGE:
 		return 0;
 	case REQ_OP_WRITE_SAME:
 		return 1;
diff --git a/block/bounce.c b/block/bounce.c
index f8ed677a1bf7..017bedba7b23 100644
--- a/block/bounce.c
+++ b/block/bounce.c
@@ -255,6 +255,7 @@ static struct bio *bounce_clone_bio(struct bio *bio_src, gfp_t gfp_mask,
 
 	switch (bio_op(bio)) {
 	case REQ_OP_DISCARD:
+	case REQ_OP_ASSIGN_RANGE:
 	case REQ_OP_SECURE_ERASE:
 	case REQ_OP_WRITE_ZEROES:
 		break;
diff --git a/include/linux/bio.h b/include/linux/bio.h
index 3cdb84cdc488..cf235c997e45 100644
--- a/include/linux/bio.h
+++ b/include/linux/bio.h
@@ -63,6 +63,7 @@ static inline bool bio_has_data(struct bio *bio)
 	if (bio &&
 	    bio->bi_iter.bi_size &&
 	    bio_op(bio) != REQ_OP_DISCARD &&
+	    bio_op(bio) != REQ_OP_ASSIGN_RANGE &&
 	    bio_op(bio) != REQ_OP_SECURE_ERASE &&
 	    bio_op(bio) != REQ_OP_WRITE_ZEROES)
 		return true;
@@ -73,6 +74,7 @@ static inline bool bio_has_data(struct bio *bio)
 static inline bool bio_no_advance_iter(struct bio *bio)
 {
 	return bio_op(bio) == REQ_OP_DISCARD ||
+	       bio_op(bio) == REQ_OP_ASSIGN_RANGE ||
 	       bio_op(bio) == REQ_OP_SECURE_ERASE ||
 	       bio_op(bio) == REQ_OP_WRITE_SAME ||
 	       bio_op(bio) == REQ_OP_WRITE_ZEROES;
@@ -184,6 +186,7 @@ static inline unsigned bio_segments(struct bio *bio)
 
 	switch (bio_op(bio)) {
 	case REQ_OP_DISCARD:
+	case REQ_OP_ASSIGN_RANGE:
 	case REQ_OP_SECURE_ERASE:
 	case REQ_OP_WRITE_ZEROES:
 		return 0;
diff --git a/include/linux/blk_types.h b/include/linux/blk_types.h
index 70254ae11769..f03dcf25c831 100644
--- a/include/linux/blk_types.h
+++ b/include/linux/blk_types.h
@@ -296,6 +296,8 @@ enum req_opf {
 	REQ_OP_ZONE_CLOSE	= 11,
 	/* Transition a zone to full */
 	REQ_OP_ZONE_FINISH	= 12,
+	/* assign sector range */
+	REQ_OP_ASSIGN_RANGE	= 15,
 
 	/* SCSI passthrough using struct scsi_request */
 	REQ_OP_SCSI_IN		= 32,
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index 3cd1853dbdac..9af70120fe57 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -336,6 +336,7 @@ struct queue_limits {
 	unsigned int		max_hw_discard_sectors;
 	unsigned int		max_write_same_sectors;
 	unsigned int		max_write_zeroes_sectors;
+	unsigned int		max_assign_range_sectors;
 	unsigned int		discard_granularity;
 	unsigned int		discard_alignment;
 
@@ -995,6 +996,10 @@ static inline unsigned int blk_queue_get_max_sectors(struct request_queue *q,
 		return min(q->limits.max_discard_sectors,
 			   UINT_MAX >> SECTOR_SHIFT);
 
+	if (unlikely(op == REQ_OP_ASSIGN_RANGE))
+		return min(q->limits.max_assign_range_sectors,
+			   UINT_MAX >> SECTOR_SHIFT);
+
 	if (unlikely(op == REQ_OP_WRITE_SAME))
 		return q->limits.max_write_same_sectors;
 
@@ -1028,6 +1033,7 @@ static inline unsigned int blk_rq_get_max_sectors(struct request *rq,
 
 	if (!q->limits.chunk_sectors ||
 	    req_op(rq) == REQ_OP_DISCARD ||
+	    req_op(rq) == REQ_OP_ASSIGN_RANGE ||
 	    req_op(rq) == REQ_OP_SECURE_ERASE)
 		return blk_queue_get_max_sectors(q, req_op(rq));
 
@@ -1225,6 +1231,8 @@ extern int __blkdev_issue_zeroout(struct block_device *bdev, sector_t sector,
 		unsigned flags);
 extern int blkdev_issue_zeroout(struct block_device *bdev, sector_t sector,
 		sector_t nr_sects, gfp_t gfp_mask, unsigned flags);
+extern int blkdev_issue_assign_range(struct block_device *bdev, sector_t sector,
+		sector_t nr_sects, gfp_t gfp_mask, unsigned flags);
 
 static inline int sb_issue_discard(struct super_block *sb, sector_t block,
 		sector_t nr_blocks, gfp_t gfp_mask, unsigned long flags)
@@ -1247,6 +1255,17 @@ static inline int sb_issue_zeroout(struct super_block *sb, sector_t block,
 				    gfp_mask, 0);
 }
 
+static inline int sb_issue_assign_range(struct super_block *sb, sector_t block,
+		sector_t nr_blocks, gfp_t gfp_mask)
+{
+	return blkdev_issue_assign_range(sb->s_bdev,
+					 block << (sb->s_blocksize_bits -
+						   SECTOR_SHIFT),
+					 nr_blocks << (sb->s_blocksize_bits -
+						       SECTOR_SHIFT),
+					 gfp_mask, 0);
+}
+
 extern int blk_verify_command(unsigned char *cmd, fmode_t mode);
 
 enum blk_default_limits {
@@ -1428,6 +1447,16 @@ static inline unsigned int bdev_write_zeroes_sectors(struct block_device *bdev)
 	return 0;
 }
 
+static inline unsigned int bdev_assign_range_sectors(struct block_device *bdev)
+{
+	struct request_queue *q = bdev_get_queue(bdev);
+
+	if (q)
+		return q->limits.max_assign_range_sectors;
+
+	return 0;
+}
+
 static inline enum blk_zoned_model bdev_zoned_model(struct block_device *bdev)
 {
 	struct request_queue *q = bdev_get_queue(bdev);



^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCH RFC 2/3] loop: Forward REQ_OP_ASSIGN_RANGE into fallocate(0)
  2019-12-10 16:56 [PATCH RFC 0/3] block,ext4: Introduce REQ_OP_ASSIGN_RANGE to reflect extents allocation in block device internals Kirill Tkhai
  2019-12-10 16:56 ` [PATCH RFC 1/3] block: Add support for REQ_OP_ASSIGN_RANGE operation Kirill Tkhai
@ 2019-12-10 16:56 ` Kirill Tkhai
  2019-12-10 16:56 ` [PATCH RFC 3/3] ext4: Notify block device about fallocate(0)-assigned blocks Kirill Tkhai
                   ` (2 subsequent siblings)
  4 siblings, 0 replies; 23+ messages in thread
From: Kirill Tkhai @ 2019-12-10 16:56 UTC (permalink / raw)
  To: linux-block, linux-kernel, linux-ext4
  Cc: axboe, tytso, adilger.kernel, ming.lei, osandov, jthumshirn,
	minwoo.im.dev, damien.lemoal, ktkhai, andrea.parri, hare, tj,
	ajay.joshi, sagi, dsterba, chaitanya.kulkarni, bvanassche,
	dhowells, asml.silence

Send fallocate(0) request into underlining filesystem
after upper filesystem sent REQ_OP_ASSIGN_RANGE request
to block device.

Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com>
---
 drivers/block/loop.c |    5 +++++
 1 file changed, 5 insertions(+)

diff --git a/drivers/block/loop.c b/drivers/block/loop.c
index 739b372a5112..d99d9193de7a 100644
--- a/drivers/block/loop.c
+++ b/drivers/block/loop.c
@@ -609,6 +609,8 @@ static int do_req_filebacked(struct loop_device *lo, struct request *rq)
 				FALLOC_FL_PUNCH_HOLE);
 	case REQ_OP_DISCARD:
 		return lo_fallocate(lo, rq, pos, FALLOC_FL_PUNCH_HOLE);
+	case REQ_OP_ASSIGN_RANGE:
+		return lo_fallocate(lo, rq, pos, 0);
 	case REQ_OP_WRITE:
 		if (lo->transfer)
 			return lo_write_transfer(lo, rq, pos);
@@ -875,6 +877,7 @@ static void loop_config_discard(struct loop_device *lo)
 	    lo->lo_encrypt_key_size) {
 		q->limits.discard_granularity = 0;
 		q->limits.discard_alignment = 0;
+		q->limits.max_assign_range_sectors = 0;
 		blk_queue_max_discard_sectors(q, 0);
 		blk_queue_max_write_zeroes_sectors(q, 0);
 		blk_queue_flag_clear(QUEUE_FLAG_DISCARD, q);
@@ -883,6 +886,7 @@ static void loop_config_discard(struct loop_device *lo)
 
 	q->limits.discard_granularity = inode->i_sb->s_blocksize;
 	q->limits.discard_alignment = 0;
+	q->limits.max_assign_range_sectors = UINT_MAX >> 9;
 
 	blk_queue_max_discard_sectors(q, UINT_MAX >> 9);
 	blk_queue_max_write_zeroes_sectors(q, UINT_MAX >> 9);
@@ -1917,6 +1921,7 @@ static blk_status_t loop_queue_rq(struct blk_mq_hw_ctx *hctx,
 	case REQ_OP_FLUSH:
 	case REQ_OP_DISCARD:
 	case REQ_OP_WRITE_ZEROES:
+	case REQ_OP_ASSIGN_RANGE:
 		cmd->use_aio = false;
 		break;
 	default:



^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCH RFC 3/3] ext4: Notify block device about fallocate(0)-assigned blocks
  2019-12-10 16:56 [PATCH RFC 0/3] block,ext4: Introduce REQ_OP_ASSIGN_RANGE to reflect extents allocation in block device internals Kirill Tkhai
  2019-12-10 16:56 ` [PATCH RFC 1/3] block: Add support for REQ_OP_ASSIGN_RANGE operation Kirill Tkhai
  2019-12-10 16:56 ` [PATCH RFC 2/3] loop: Forward REQ_OP_ASSIGN_RANGE into fallocate(0) Kirill Tkhai
@ 2019-12-10 16:56 ` Kirill Tkhai
  2019-12-11  1:02   ` kbuild test robot
                     ` (2 more replies)
  2019-12-11  7:42 ` [PATCH RFC 0/3] block,ext4: Introduce REQ_OP_ASSIGN_RANGE to reflect extents allocation in block device internals Chaitanya Kulkarni
  2019-12-17 14:16 ` Kirill Tkhai
  4 siblings, 3 replies; 23+ messages in thread
From: Kirill Tkhai @ 2019-12-10 16:56 UTC (permalink / raw)
  To: linux-block, linux-kernel, linux-ext4
  Cc: axboe, tytso, adilger.kernel, ming.lei, osandov, jthumshirn,
	minwoo.im.dev, damien.lemoal, ktkhai, andrea.parri, hare, tj,
	ajay.joshi, sagi, dsterba, chaitanya.kulkarni, bvanassche,
	dhowells, asml.silence

Call sb_issue_assign_range() after extent range was allocated
on user request. Hopeful, this helps block device to maintain
its internals in the best way, if this is appliable.

Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com>
---
 fs/ext4/ext4.h    |    1 +
 fs/ext4/extents.c |   11 +++++++++--
 2 files changed, 10 insertions(+), 2 deletions(-)

diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
index f8578caba40d..fe2263c00c0e 100644
--- a/fs/ext4/ext4.h
+++ b/fs/ext4/ext4.h
@@ -622,6 +622,7 @@ enum {
 	 * allows jbd2 to avoid submitting data before commit. */
 #define EXT4_GET_BLOCKS_IO_SUBMIT		0x0400
 
+#define EXT4_GET_BLOCKS_SUBMIT_ALLOC		0x0800
 /*
  * The bit position of these flags must not overlap with any of the
  * EXT4_GET_BLOCKS_*.  They are used by ext4_find_extent(),
diff --git a/fs/ext4/extents.c b/fs/ext4/extents.c
index 0e8708b77da6..5f4fc660cbb1 100644
--- a/fs/ext4/extents.c
+++ b/fs/ext4/extents.c
@@ -4490,6 +4490,13 @@ int ext4_ext_map_blocks(handle_t *handle, struct inode *inode,
 		ar.len = allocated;
 
 got_allocated_blocks:
+	if ((flags & EXT4_GET_BLOCKS_SUBMIT_ALLOC) && sbi->fallocate) {
+		err = sb_issue_assign_range(inode->i_sb, newblock,
+			EXT4_C2B(sbi, allocated_clusters), GFP_NOFS);
+		if (err)
+			goto free_on_err;
+	}
+
 	/* try to insert new extent into found leaf and return */
 	ext4_ext_store_pblock(&newex, newblock + offset);
 	newex.ee_len = cpu_to_le16(ar.len);
@@ -4506,7 +4513,7 @@ int ext4_ext_map_blocks(handle_t *handle, struct inode *inode,
 	if (!err)
 		err = ext4_ext_insert_extent(handle, inode, &path,
 					     &newex, flags);
-
+free_on_err:
 	if (err && free_on_err) {
 		int fb_flags = flags & EXT4_GET_BLOCKS_DELALLOC_RESERVE ?
 			EXT4_FREE_BLOCKS_NO_QUOT_UPDATE : 0;
@@ -4926,7 +4933,7 @@ long ext4_fallocate(struct file *file, int mode, loff_t offset, loff_t len)
 	lblk = offset >> blkbits;
 
 	max_blocks = EXT4_MAX_BLOCKS(len, offset, blkbits);
-	flags = EXT4_GET_BLOCKS_CREATE_UNWRIT_EXT;
+	flags = EXT4_GET_BLOCKS_CREATE_UNWRIT_EXT | EXT4_GET_BLOCKS_SUBMIT_ALLOC;
 	if (mode & FALLOC_FL_KEEP_SIZE)
 		flags |= EXT4_GET_BLOCKS_KEEP_SIZE;
 



^ permalink raw reply related	[flat|nested] 23+ messages in thread

* Re: [PATCH RFC 3/3] ext4: Notify block device about fallocate(0)-assigned blocks
  2019-12-10 16:56 ` [PATCH RFC 3/3] ext4: Notify block device about fallocate(0)-assigned blocks Kirill Tkhai
@ 2019-12-11  1:02   ` kbuild test robot
  2019-12-11 12:55   ` [PATCH RFC v2 " Kirill Tkhai
  2019-12-15 15:35   ` [PATCH RFC " kbuild test robot
  2 siblings, 0 replies; 23+ messages in thread
From: kbuild test robot @ 2019-12-11  1:02 UTC (permalink / raw)
  To: kbuild-all

[-- Attachment #1: Type: text/plain, Size: 14696 bytes --]

Hi Kirill,

[FYI, it's a private test report for your RFC patch.]
[auto build test ERROR on block/for-next]
[also build test ERROR on linus/master v5.5-rc1 next-20191210]
[cannot apply to ext4/dev]
[if your patch is applied to the wrong git tree, please drop us a note to help
improve the system. BTW, we also suggest to use '--base' option to specify the
base tree in git format-patch, please see https://stackoverflow.com/a/37406982]

url:    https://github.com/0day-ci/linux/commits/Kirill-Tkhai/block-ext4-Introduce-REQ_OP_ASSIGN_RANGE-to-reflect-extents-allocation-in-block-device-internals/20191211-073400
base:   https://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux-block.git for-next
config: riscv-defconfig (attached as .config)
compiler: riscv64-linux-gcc (GCC) 7.5.0
reproduce:
        wget https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross
        chmod +x ~/bin/make.cross
        # save the attached .config to linux build tree
        GCC_VERSION=7.5.0 make.cross ARCH=riscv 

If you fix the issue, kindly add following tag
Reported-by: kbuild test robot <lkp@intel.com>

All errors (new ones prefixed by >>):

   fs/ext4/extents.c: In function 'ext4_ext_map_blocks':
>> fs/ext4/extents.c:4493:51: error: 'struct ext4_sb_info' has no member named 'fallocate'
     if ((flags & EXT4_GET_BLOCKS_SUBMIT_ALLOC) && sbi->fallocate) {
                                                      ^~

vim +4493 fs/ext4/extents.c

  4261	
  4262	
  4263	/*
  4264	 * Block allocation/map/preallocation routine for extents based files
  4265	 *
  4266	 *
  4267	 * Need to be called with
  4268	 * down_read(&EXT4_I(inode)->i_data_sem) if not allocating file system block
  4269	 * (ie, create is zero). Otherwise down_write(&EXT4_I(inode)->i_data_sem)
  4270	 *
  4271	 * return > 0, number of of blocks already mapped/allocated
  4272	 *          if create == 0 and these are pre-allocated blocks
  4273	 *          	buffer head is unmapped
  4274	 *          otherwise blocks are mapped
  4275	 *
  4276	 * return = 0, if plain look up failed (blocks have not been allocated)
  4277	 *          buffer head is unmapped
  4278	 *
  4279	 * return < 0, error case.
  4280	 */
  4281	int ext4_ext_map_blocks(handle_t *handle, struct inode *inode,
  4282				struct ext4_map_blocks *map, int flags)
  4283	{
  4284		struct ext4_ext_path *path = NULL;
  4285		struct ext4_extent newex, *ex, *ex2;
  4286		struct ext4_sb_info *sbi = EXT4_SB(inode->i_sb);
  4287		ext4_fsblk_t newblock = 0;
  4288		int free_on_err = 0, err = 0, depth, ret;
  4289		unsigned int allocated = 0, offset = 0;
  4290		unsigned int allocated_clusters = 0;
  4291		struct ext4_allocation_request ar;
  4292		ext4_lblk_t cluster_offset;
  4293		bool map_from_cluster = false;
  4294	
  4295		ext_debug("blocks %u/%u requested for inode %lu\n",
  4296			  map->m_lblk, map->m_len, inode->i_ino);
  4297		trace_ext4_ext_map_blocks_enter(inode, map->m_lblk, map->m_len, flags);
  4298	
  4299		/* find extent for this block */
  4300		path = ext4_find_extent(inode, map->m_lblk, NULL, 0);
  4301		if (IS_ERR(path)) {
  4302			err = PTR_ERR(path);
  4303			path = NULL;
  4304			goto out2;
  4305		}
  4306	
  4307		depth = ext_depth(inode);
  4308	
  4309		/*
  4310		 * consistent leaf must not be empty;
  4311		 * this situation is possible, though, _during_ tree modification;
  4312		 * this is why assert can't be put in ext4_find_extent()
  4313		 */
  4314		if (unlikely(path[depth].p_ext == NULL && depth != 0)) {
  4315			EXT4_ERROR_INODE(inode, "bad extent address "
  4316					 "lblock: %lu, depth: %d pblock %lld",
  4317					 (unsigned long) map->m_lblk, depth,
  4318					 path[depth].p_block);
  4319			err = -EFSCORRUPTED;
  4320			goto out2;
  4321		}
  4322	
  4323		ex = path[depth].p_ext;
  4324		if (ex) {
  4325			ext4_lblk_t ee_block = le32_to_cpu(ex->ee_block);
  4326			ext4_fsblk_t ee_start = ext4_ext_pblock(ex);
  4327			unsigned short ee_len;
  4328	
  4329	
  4330			/*
  4331			 * unwritten extents are treated as holes, except that
  4332			 * we split out initialized portions during a write.
  4333			 */
  4334			ee_len = ext4_ext_get_actual_len(ex);
  4335	
  4336			trace_ext4_ext_show_extent(inode, ee_block, ee_start, ee_len);
  4337	
  4338			/* if found extent covers block, simply return it */
  4339			if (in_range(map->m_lblk, ee_block, ee_len)) {
  4340				newblock = map->m_lblk - ee_block + ee_start;
  4341				/* number of remaining blocks in the extent */
  4342				allocated = ee_len - (map->m_lblk - ee_block);
  4343				ext_debug("%u fit into %u:%d -> %llu\n", map->m_lblk,
  4344					  ee_block, ee_len, newblock);
  4345	
  4346				/*
  4347				 * If the extent is initialized check whether the
  4348				 * caller wants to convert it to unwritten.
  4349				 */
  4350				if ((!ext4_ext_is_unwritten(ex)) &&
  4351				    (flags & EXT4_GET_BLOCKS_CONVERT_UNWRITTEN)) {
  4352					allocated = convert_initialized_extent(
  4353							handle, inode, map, &path,
  4354							allocated);
  4355					goto out2;
  4356				} else if (!ext4_ext_is_unwritten(ex))
  4357					goto out;
  4358	
  4359				ret = ext4_ext_handle_unwritten_extents(
  4360					handle, inode, map, &path, flags,
  4361					allocated, newblock);
  4362				if (ret < 0)
  4363					err = ret;
  4364				else
  4365					allocated = ret;
  4366				goto out2;
  4367			}
  4368		}
  4369	
  4370		/*
  4371		 * requested block isn't allocated yet;
  4372		 * we couldn't try to create block if create flag is zero
  4373		 */
  4374		if ((flags & EXT4_GET_BLOCKS_CREATE) == 0) {
  4375			ext4_lblk_t hole_start, hole_len;
  4376	
  4377			hole_start = map->m_lblk;
  4378			hole_len = ext4_ext_determine_hole(inode, path, &hole_start);
  4379			/*
  4380			 * put just found gap into cache to speed up
  4381			 * subsequent requests
  4382			 */
  4383			ext4_ext_put_gap_in_cache(inode, hole_start, hole_len);
  4384	
  4385			/* Update hole_len to reflect hole size after map->m_lblk */
  4386			if (hole_start != map->m_lblk)
  4387				hole_len -= map->m_lblk - hole_start;
  4388			map->m_pblk = 0;
  4389			map->m_len = min_t(unsigned int, map->m_len, hole_len);
  4390	
  4391			goto out2;
  4392		}
  4393	
  4394		/*
  4395		 * Okay, we need to do block allocation.
  4396		 */
  4397		newex.ee_block = cpu_to_le32(map->m_lblk);
  4398		cluster_offset = EXT4_LBLK_COFF(sbi, map->m_lblk);
  4399	
  4400		/*
  4401		 * If we are doing bigalloc, check to see if the extent returned
  4402		 * by ext4_find_extent() implies a cluster we can use.
  4403		 */
  4404		if (cluster_offset && ex &&
  4405		    get_implied_cluster_alloc(inode->i_sb, map, ex, path)) {
  4406			ar.len = allocated = map->m_len;
  4407			newblock = map->m_pblk;
  4408			map_from_cluster = true;
  4409			goto got_allocated_blocks;
  4410		}
  4411	
  4412		/* find neighbour allocated blocks */
  4413		ar.lleft = map->m_lblk;
  4414		err = ext4_ext_search_left(inode, path, &ar.lleft, &ar.pleft);
  4415		if (err)
  4416			goto out2;
  4417		ar.lright = map->m_lblk;
  4418		ex2 = NULL;
  4419		err = ext4_ext_search_right(inode, path, &ar.lright, &ar.pright, &ex2);
  4420		if (err)
  4421			goto out2;
  4422	
  4423		/* Check if the extent after searching to the right implies a
  4424		 * cluster we can use. */
  4425		if ((sbi->s_cluster_ratio > 1) && ex2 &&
  4426		    get_implied_cluster_alloc(inode->i_sb, map, ex2, path)) {
  4427			ar.len = allocated = map->m_len;
  4428			newblock = map->m_pblk;
  4429			map_from_cluster = true;
  4430			goto got_allocated_blocks;
  4431		}
  4432	
  4433		/*
  4434		 * See if request is beyond maximum number of blocks we can have in
  4435		 * a single extent. For an initialized extent this limit is
  4436		 * EXT_INIT_MAX_LEN and for an unwritten extent this limit is
  4437		 * EXT_UNWRITTEN_MAX_LEN.
  4438		 */
  4439		if (map->m_len > EXT_INIT_MAX_LEN &&
  4440		    !(flags & EXT4_GET_BLOCKS_UNWRIT_EXT))
  4441			map->m_len = EXT_INIT_MAX_LEN;
  4442		else if (map->m_len > EXT_UNWRITTEN_MAX_LEN &&
  4443			 (flags & EXT4_GET_BLOCKS_UNWRIT_EXT))
  4444			map->m_len = EXT_UNWRITTEN_MAX_LEN;
  4445	
  4446		/* Check if we can really insert (m_lblk)::(m_lblk + m_len) extent */
  4447		newex.ee_len = cpu_to_le16(map->m_len);
  4448		err = ext4_ext_check_overlap(sbi, inode, &newex, path);
  4449		if (err)
  4450			allocated = ext4_ext_get_actual_len(&newex);
  4451		else
  4452			allocated = map->m_len;
  4453	
  4454		/* allocate new block */
  4455		ar.inode = inode;
  4456		ar.goal = ext4_ext_find_goal(inode, path, map->m_lblk);
  4457		ar.logical = map->m_lblk;
  4458		/*
  4459		 * We calculate the offset from the beginning of the cluster
  4460		 * for the logical block number, since when we allocate a
  4461		 * physical cluster, the physical block should start at the
  4462		 * same offset from the beginning of the cluster.  This is
  4463		 * needed so that future calls to get_implied_cluster_alloc()
  4464		 * work correctly.
  4465		 */
  4466		offset = EXT4_LBLK_COFF(sbi, map->m_lblk);
  4467		ar.len = EXT4_NUM_B2C(sbi, offset+allocated);
  4468		ar.goal -= offset;
  4469		ar.logical -= offset;
  4470		if (S_ISREG(inode->i_mode))
  4471			ar.flags = EXT4_MB_HINT_DATA;
  4472		else
  4473			/* disable in-core preallocation for non-regular files */
  4474			ar.flags = 0;
  4475		if (flags & EXT4_GET_BLOCKS_NO_NORMALIZE)
  4476			ar.flags |= EXT4_MB_HINT_NOPREALLOC;
  4477		if (flags & EXT4_GET_BLOCKS_DELALLOC_RESERVE)
  4478			ar.flags |= EXT4_MB_DELALLOC_RESERVED;
  4479		if (flags & EXT4_GET_BLOCKS_METADATA_NOFAIL)
  4480			ar.flags |= EXT4_MB_USE_RESERVED;
  4481		newblock = ext4_mb_new_blocks(handle, &ar, &err);
  4482		if (!newblock)
  4483			goto out2;
  4484		ext_debug("allocate new block: goal %llu, found %llu/%u\n",
  4485			  ar.goal, newblock, allocated);
  4486		free_on_err = 1;
  4487		allocated_clusters = ar.len;
  4488		ar.len = EXT4_C2B(sbi, ar.len) - offset;
  4489		if (ar.len > allocated)
  4490			ar.len = allocated;
  4491	
  4492	got_allocated_blocks:
> 4493		if ((flags & EXT4_GET_BLOCKS_SUBMIT_ALLOC) && sbi->fallocate) {
  4494			err = sb_issue_assign_range(inode->i_sb, newblock,
  4495				EXT4_C2B(sbi, allocated_clusters), GFP_NOFS);
  4496			if (err)
  4497				goto free_on_err;
  4498		}
  4499	
  4500		/* try to insert new extent into found leaf and return */
  4501		ext4_ext_store_pblock(&newex, newblock + offset);
  4502		newex.ee_len = cpu_to_le16(ar.len);
  4503		/* Mark unwritten */
  4504		if (flags & EXT4_GET_BLOCKS_UNWRIT_EXT){
  4505			ext4_ext_mark_unwritten(&newex);
  4506			map->m_flags |= EXT4_MAP_UNWRITTEN;
  4507		}
  4508	
  4509		err = 0;
  4510		if ((flags & EXT4_GET_BLOCKS_KEEP_SIZE) == 0)
  4511			err = check_eofblocks_fl(handle, inode, map->m_lblk,
  4512						 path, ar.len);
  4513		if (!err)
  4514			err = ext4_ext_insert_extent(handle, inode, &path,
  4515						     &newex, flags);
  4516	free_on_err:
  4517		if (err && free_on_err) {
  4518			int fb_flags = flags & EXT4_GET_BLOCKS_DELALLOC_RESERVE ?
  4519				EXT4_FREE_BLOCKS_NO_QUOT_UPDATE : 0;
  4520			/* free data blocks we just allocated */
  4521			/* not a good idea to call discard here directly,
  4522			 * but otherwise we'd need to call it every free() */
  4523			ext4_discard_preallocations(inode);
  4524			ext4_free_blocks(handle, inode, NULL, newblock,
  4525					 EXT4_C2B(sbi, allocated_clusters), fb_flags);
  4526			goto out2;
  4527		}
  4528	
  4529		/* previous routine could use block we allocated */
  4530		newblock = ext4_ext_pblock(&newex);
  4531		allocated = ext4_ext_get_actual_len(&newex);
  4532		if (allocated > map->m_len)
  4533			allocated = map->m_len;
  4534		map->m_flags |= EXT4_MAP_NEW;
  4535	
  4536		/*
  4537		 * Reduce the reserved cluster count to reflect successful deferred
  4538		 * allocation of delayed allocated clusters or direct allocation of
  4539		 * clusters discovered to be delayed allocated.  Once allocated, a
  4540		 * cluster is not included in the reserved count.
  4541		 */
  4542		if (test_opt(inode->i_sb, DELALLOC) && !map_from_cluster) {
  4543			if (flags & EXT4_GET_BLOCKS_DELALLOC_RESERVE) {
  4544				/*
  4545				 * When allocating delayed allocated clusters, simply
  4546				 * reduce the reserved cluster count and claim quota
  4547				 */
  4548				ext4_da_update_reserve_space(inode, allocated_clusters,
  4549								1);
  4550			} else {
  4551				ext4_lblk_t lblk, len;
  4552				unsigned int n;
  4553	
  4554				/*
  4555				 * When allocating non-delayed allocated clusters
  4556				 * (from fallocate, filemap, DIO, or clusters
  4557				 * allocated when delalloc has been disabled by
  4558				 * ext4_nonda_switch), reduce the reserved cluster
  4559				 * count by the number of allocated clusters that
  4560				 * have previously been delayed allocated.  Quota
  4561				 * has been claimed by ext4_mb_new_blocks() above,
  4562				 * so release the quota reservations made for any
  4563				 * previously delayed allocated clusters.
  4564				 */
  4565				lblk = EXT4_LBLK_CMASK(sbi, map->m_lblk);
  4566				len = allocated_clusters << sbi->s_cluster_bits;
  4567				n = ext4_es_delayed_clu(inode, lblk, len);
  4568				if (n > 0)
  4569					ext4_da_update_reserve_space(inode, (int) n, 0);
  4570			}
  4571		}
  4572	
  4573		/*
  4574		 * Cache the extent and update transaction to commit on fdatasync only
  4575		 * when it is _not_ an unwritten extent.
  4576		 */
  4577		if ((flags & EXT4_GET_BLOCKS_UNWRIT_EXT) == 0)
  4578			ext4_update_inode_fsync_trans(handle, inode, 1);
  4579		else
  4580			ext4_update_inode_fsync_trans(handle, inode, 0);
  4581	out:
  4582		if (allocated > map->m_len)
  4583			allocated = map->m_len;
  4584		ext4_ext_show_leaf(inode, path);
  4585		map->m_flags |= EXT4_MAP_MAPPED;
  4586		map->m_pblk = newblock;
  4587		map->m_len = allocated;
  4588	out2:
  4589		ext4_ext_drop_refs(path);
  4590		kfree(path);
  4591	
  4592		trace_ext4_ext_map_blocks_exit(inode, flags, map,
  4593					       err ? err : allocated);
  4594		return err ? err : allocated;
  4595	}
  4596	

---
0-DAY kernel test infrastructure                 Open Source Technology Center
https://lists.01.org/hyperkitty/list/kbuild-all(a)lists.01.org Intel Corporation

[-- Attachment #2: config.gz --]
[-- Type: application/gzip, Size: 18497 bytes --]

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH RFC 0/3] block,ext4: Introduce REQ_OP_ASSIGN_RANGE to reflect extents allocation in block device internals
  2019-12-10 16:56 [PATCH RFC 0/3] block,ext4: Introduce REQ_OP_ASSIGN_RANGE to reflect extents allocation in block device internals Kirill Tkhai
                   ` (2 preceding siblings ...)
  2019-12-10 16:56 ` [PATCH RFC 3/3] ext4: Notify block device about fallocate(0)-assigned blocks Kirill Tkhai
@ 2019-12-11  7:42 ` Chaitanya Kulkarni
  2019-12-11  8:50   ` Kirill Tkhai
  2019-12-17 14:16 ` Kirill Tkhai
  4 siblings, 1 reply; 23+ messages in thread
From: Chaitanya Kulkarni @ 2019-12-11  7:42 UTC (permalink / raw)
  To: Kirill Tkhai, linux-block, linux-kernel, linux-ext4
  Cc: axboe, tytso, adilger.kernel, ming.lei, osandov, jthumshirn,
	minwoo.im.dev, Damien Le Moal, andrea.parri, hare, tj,
	Ajay Joshi, sagi, dsterba, bvanassche, dhowells, asml.silence

> Here is a simple test I did:
> https://gist.github.com/tkhai/5b788651cdb74c1dbff3500745878856
>

Somehow I'm not able to open this link, can you please share results
in plain text on the email ?



^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH RFC 0/3] block,ext4: Introduce REQ_OP_ASSIGN_RANGE to reflect extents allocation in block device internals
  2019-12-11  7:42 ` [PATCH RFC 0/3] block,ext4: Introduce REQ_OP_ASSIGN_RANGE to reflect extents allocation in block device internals Chaitanya Kulkarni
@ 2019-12-11  8:50   ` Kirill Tkhai
  0 siblings, 0 replies; 23+ messages in thread
From: Kirill Tkhai @ 2019-12-11  8:50 UTC (permalink / raw)
  To: Chaitanya Kulkarni, linux-block, linux-kernel, linux-ext4
  Cc: axboe, tytso, adilger.kernel, ming.lei, osandov, jthumshirn,
	minwoo.im.dev, Damien Le Moal, andrea.parri, hare, tj,
	Ajay Joshi, sagi, dsterba, bvanassche, dhowells, asml.silence

On 11.12.2019 10:42, Chaitanya Kulkarni wrote:
>> Here is a simple test I did:
>> https://gist.github.com/tkhai/5b788651cdb74c1dbff3500745878856
>>
> 
> Somehow I'm not able to open this link, can you please share results
> in plain text on the email ?

#define _GNU_SOURCE
#include <sys/types.h>
#include <unistd.h>
#include <stdlib.h>
#include <stdio.h>
#include <fcntl.h>
#include <errno.h>

#define BLOCK_SIZE 4096
#define STEP (BLOCK_SIZE * 16)
#define SIZE (4ULL * 1024 * 1024 * 1024)

int main(int argc)
{
	int fd, step, ret = 0;
	unsigned long i;
	void *buf;

	if (posix_memalign(&buf, BLOCK_SIZE, SIZE)) {
		perror("alloc");
		exit(1);
	}

	fd = open("file2.img", O_RDWR|O_CREAT|O_DIRECT);
	if (fd < 0) {
		perror("open");
		exit(1);
	}

	if (ftruncate(fd, SIZE)) {
		perror("ftruncate");
		exit(1);
	}

	ret = fallocate(fd, 0, 0, SIZE);
	if (ret) {
		perror("fallocate");
		exit(1);
	}
	
	for (step = STEP - BLOCK_SIZE; step >= 0; step -= BLOCK_SIZE) {
		printf("step=%u\n", step);
		for (i = step; i < SIZE; i += STEP) {
			errno = 0;
			if (pwrite(fd, buf, BLOCK_SIZE, i) != BLOCK_SIZE) {
				perror("pwrite");
				exit(1);
			}
		}

		if (fsync(fd)) {
			perror("fsync");
			exit(1);
		}
	}

	return 0;
}



^ permalink raw reply	[flat|nested] 23+ messages in thread

* [PATCH RFC v2 3/3] ext4: Notify block device about fallocate(0)-assigned blocks
  2019-12-10 16:56 ` [PATCH RFC 3/3] ext4: Notify block device about fallocate(0)-assigned blocks Kirill Tkhai
  2019-12-11  1:02   ` kbuild test robot
@ 2019-12-11 12:55   ` Kirill Tkhai
  2019-12-15 15:35   ` [PATCH RFC " kbuild test robot
  2 siblings, 0 replies; 23+ messages in thread
From: Kirill Tkhai @ 2019-12-11 12:55 UTC (permalink / raw)
  To: linux-block, linux-kernel, linux-ext4
  Cc: axboe, tytso, adilger.kernel, ming.lei, osandov, jthumshirn,
	minwoo.im.dev, damien.lemoal, andrea.parri, hare, tj, ajay.joshi,
	sagi, dsterba, chaitanya.kulkarni, bvanassche, dhowells,
	asml.silence

[I missed debug hunk appeared in v1. Please, see correct patch below]

Call sb_issue_assign_range() after extent range was allocated
on user request. Hopeful, this helps block device to maintain
its internals in the best way, if this is appliable.

Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com>
---
 fs/ext4/ext4.h    |    1 +
 fs/ext4/extents.c |   11 +++++++++--
 2 files changed, 10 insertions(+), 2 deletions(-)

diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
index f8578caba40d..fe2263c00c0e 100644
--- a/fs/ext4/ext4.h
+++ b/fs/ext4/ext4.h
@@ -622,6 +622,7 @@ enum {
 	 * allows jbd2 to avoid submitting data before commit. */
 #define EXT4_GET_BLOCKS_IO_SUBMIT		0x0400
 
+#define EXT4_GET_BLOCKS_SUBMIT_ALLOC		0x0800
 /*
  * The bit position of these flags must not overlap with any of the
  * EXT4_GET_BLOCKS_*.  They are used by ext4_find_extent(),
diff --git a/fs/ext4/extents.c b/fs/ext4/extents.c
index 0e8708b77da6..68335e1d6893 100644
--- a/fs/ext4/extents.c
+++ b/fs/ext4/extents.c
@@ -4490,6 +4490,13 @@ int ext4_ext_map_blocks(handle_t *handle, struct inode *inode,
 		ar.len = allocated;
 
 got_allocated_blocks:
+	if ((flags & EXT4_GET_BLOCKS_SUBMIT_ALLOC)) {
+		err = sb_issue_assign_range(inode->i_sb, newblock,
+			EXT4_C2B(sbi, allocated_clusters), GFP_NOFS);
+		if (err)
+			goto free_on_err;
+	}
+
 	/* try to insert new extent into found leaf and return */
 	ext4_ext_store_pblock(&newex, newblock + offset);
 	newex.ee_len = cpu_to_le16(ar.len);
@@ -4506,7 +4513,7 @@ int ext4_ext_map_blocks(handle_t *handle, struct inode *inode,
 	if (!err)
 		err = ext4_ext_insert_extent(handle, inode, &path,
 					     &newex, flags);
-
+free_on_err:
 	if (err && free_on_err) {
 		int fb_flags = flags & EXT4_GET_BLOCKS_DELALLOC_RESERVE ?
 			EXT4_FREE_BLOCKS_NO_QUOT_UPDATE : 0;
@@ -4926,7 +4933,7 @@ long ext4_fallocate(struct file *file, int mode, loff_t offset, loff_t len)
 	lblk = offset >> blkbits;
 
 	max_blocks = EXT4_MAX_BLOCKS(len, offset, blkbits);
-	flags = EXT4_GET_BLOCKS_CREATE_UNWRIT_EXT;
+	flags = EXT4_GET_BLOCKS_CREATE_UNWRIT_EXT | EXT4_GET_BLOCKS_SUBMIT_ALLOC;
 	if (mode & FALLOC_FL_KEEP_SIZE)
 		flags |= EXT4_GET_BLOCKS_KEEP_SIZE;
 


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* Re: [PATCH RFC 3/3] ext4: Notify block device about fallocate(0)-assigned blocks
  2019-12-10 16:56 ` [PATCH RFC 3/3] ext4: Notify block device about fallocate(0)-assigned blocks Kirill Tkhai
  2019-12-11  1:02   ` kbuild test robot
  2019-12-11 12:55   ` [PATCH RFC v2 " Kirill Tkhai
@ 2019-12-15 15:35   ` kbuild test robot
  2 siblings, 0 replies; 23+ messages in thread
From: kbuild test robot @ 2019-12-15 15:35 UTC (permalink / raw)
  To: kbuild-all

[-- Attachment #1: Type: text/plain, Size: 16440 bytes --]

Hi Kirill,

[FYI, it's a private test report for your RFC patch.]
[auto build test WARNING on block/for-next]
[also build test WARNING on linus/master v5.5-rc1 next-20191213]
[cannot apply to ext4/dev]
[if your patch is applied to the wrong git tree, please drop us a note to help
improve the system. BTW, we also suggest to use '--base' option to specify the
base tree in git format-patch, please see https://stackoverflow.com/a/37406982]

url:    https://github.com/0day-ci/linux/commits/Kirill-Tkhai/block-ext4-Introduce-REQ_OP_ASSIGN_RANGE-to-reflect-extents-allocation-in-block-device-internals/20191211-073400
base:   https://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux-block.git for-next
config: i386-randconfig-b002-20191213 (attached as .config)
compiler: gcc-7 (Debian 7.5.0-1) 7.5.0
reproduce:
        # save the attached .config to linux build tree
        make ARCH=i386 

If you fix the issue, kindly add following tag
Reported-by: kbuild test robot <lkp@intel.com>

All warnings (new ones prefixed by >>):

   In file included from include/linux/export.h:42:0,
                    from include/linux/linkage.h:7,
                    from include/linux/fs.h:5,
                    from fs//ext4/extents.c:20:
   fs//ext4/extents.c: In function 'ext4_ext_map_blocks':
   fs//ext4/extents.c:4493:51: error: 'struct ext4_sb_info' has no member named 'fallocate'
     if ((flags & EXT4_GET_BLOCKS_SUBMIT_ALLOC) && sbi->fallocate) {
                                                      ^
   include/linux/compiler.h:58:52: note: in definition of macro '__trace_if_var'
    #define __trace_if_var(cond) (__builtin_constant_p(cond) ? (cond) : __trace_if_value(cond))
                                                       ^~~~
>> fs//ext4/extents.c:4493:2: note: in expansion of macro 'if'
     if ((flags & EXT4_GET_BLOCKS_SUBMIT_ALLOC) && sbi->fallocate) {
     ^~
   fs//ext4/extents.c:4493:51: error: 'struct ext4_sb_info' has no member named 'fallocate'
     if ((flags & EXT4_GET_BLOCKS_SUBMIT_ALLOC) && sbi->fallocate) {
                                                      ^
   include/linux/compiler.h:58:61: note: in definition of macro '__trace_if_var'
    #define __trace_if_var(cond) (__builtin_constant_p(cond) ? (cond) : __trace_if_value(cond))
                                                                ^~~~
>> fs//ext4/extents.c:4493:2: note: in expansion of macro 'if'
     if ((flags & EXT4_GET_BLOCKS_SUBMIT_ALLOC) && sbi->fallocate) {
     ^~
   fs//ext4/extents.c:4493:51: error: 'struct ext4_sb_info' has no member named 'fallocate'
     if ((flags & EXT4_GET_BLOCKS_SUBMIT_ALLOC) && sbi->fallocate) {
                                                      ^
   include/linux/compiler.h:69:3: note: in definition of macro '__trace_if_value'
     (cond) ?     \
      ^~~~
   include/linux/compiler.h:56:28: note: in expansion of macro '__trace_if_var'
    #define if(cond, ...) if ( __trace_if_var( !!(cond , ## __VA_ARGS__) ) )
                               ^~~~~~~~~~~~~~
>> fs//ext4/extents.c:4493:2: note: in expansion of macro 'if'
     if ((flags & EXT4_GET_BLOCKS_SUBMIT_ALLOC) && sbi->fallocate) {
     ^~

vim +/if +4493 fs//ext4/extents.c

  4261	
  4262	
  4263	/*
  4264	 * Block allocation/map/preallocation routine for extents based files
  4265	 *
  4266	 *
  4267	 * Need to be called with
  4268	 * down_read(&EXT4_I(inode)->i_data_sem) if not allocating file system block
  4269	 * (ie, create is zero). Otherwise down_write(&EXT4_I(inode)->i_data_sem)
  4270	 *
  4271	 * return > 0, number of of blocks already mapped/allocated
  4272	 *          if create == 0 and these are pre-allocated blocks
  4273	 *          	buffer head is unmapped
  4274	 *          otherwise blocks are mapped
  4275	 *
  4276	 * return = 0, if plain look up failed (blocks have not been allocated)
  4277	 *          buffer head is unmapped
  4278	 *
  4279	 * return < 0, error case.
  4280	 */
  4281	int ext4_ext_map_blocks(handle_t *handle, struct inode *inode,
  4282				struct ext4_map_blocks *map, int flags)
  4283	{
  4284		struct ext4_ext_path *path = NULL;
  4285		struct ext4_extent newex, *ex, *ex2;
  4286		struct ext4_sb_info *sbi = EXT4_SB(inode->i_sb);
  4287		ext4_fsblk_t newblock = 0;
  4288		int free_on_err = 0, err = 0, depth, ret;
  4289		unsigned int allocated = 0, offset = 0;
  4290		unsigned int allocated_clusters = 0;
  4291		struct ext4_allocation_request ar;
  4292		ext4_lblk_t cluster_offset;
  4293		bool map_from_cluster = false;
  4294	
  4295		ext_debug("blocks %u/%u requested for inode %lu\n",
  4296			  map->m_lblk, map->m_len, inode->i_ino);
  4297		trace_ext4_ext_map_blocks_enter(inode, map->m_lblk, map->m_len, flags);
  4298	
  4299		/* find extent for this block */
  4300		path = ext4_find_extent(inode, map->m_lblk, NULL, 0);
  4301		if (IS_ERR(path)) {
  4302			err = PTR_ERR(path);
  4303			path = NULL;
  4304			goto out2;
  4305		}
  4306	
  4307		depth = ext_depth(inode);
  4308	
  4309		/*
  4310		 * consistent leaf must not be empty;
  4311		 * this situation is possible, though, _during_ tree modification;
  4312		 * this is why assert can't be put in ext4_find_extent()
  4313		 */
  4314		if (unlikely(path[depth].p_ext == NULL && depth != 0)) {
  4315			EXT4_ERROR_INODE(inode, "bad extent address "
  4316					 "lblock: %lu, depth: %d pblock %lld",
  4317					 (unsigned long) map->m_lblk, depth,
  4318					 path[depth].p_block);
  4319			err = -EFSCORRUPTED;
  4320			goto out2;
  4321		}
  4322	
  4323		ex = path[depth].p_ext;
  4324		if (ex) {
  4325			ext4_lblk_t ee_block = le32_to_cpu(ex->ee_block);
  4326			ext4_fsblk_t ee_start = ext4_ext_pblock(ex);
  4327			unsigned short ee_len;
  4328	
  4329	
  4330			/*
  4331			 * unwritten extents are treated as holes, except that
  4332			 * we split out initialized portions during a write.
  4333			 */
  4334			ee_len = ext4_ext_get_actual_len(ex);
  4335	
  4336			trace_ext4_ext_show_extent(inode, ee_block, ee_start, ee_len);
  4337	
  4338			/* if found extent covers block, simply return it */
  4339			if (in_range(map->m_lblk, ee_block, ee_len)) {
  4340				newblock = map->m_lblk - ee_block + ee_start;
  4341				/* number of remaining blocks in the extent */
  4342				allocated = ee_len - (map->m_lblk - ee_block);
  4343				ext_debug("%u fit into %u:%d -> %llu\n", map->m_lblk,
  4344					  ee_block, ee_len, newblock);
  4345	
  4346				/*
  4347				 * If the extent is initialized check whether the
  4348				 * caller wants to convert it to unwritten.
  4349				 */
  4350				if ((!ext4_ext_is_unwritten(ex)) &&
  4351				    (flags & EXT4_GET_BLOCKS_CONVERT_UNWRITTEN)) {
  4352					allocated = convert_initialized_extent(
  4353							handle, inode, map, &path,
  4354							allocated);
  4355					goto out2;
  4356				} else if (!ext4_ext_is_unwritten(ex))
  4357					goto out;
  4358	
  4359				ret = ext4_ext_handle_unwritten_extents(
  4360					handle, inode, map, &path, flags,
  4361					allocated, newblock);
  4362				if (ret < 0)
  4363					err = ret;
  4364				else
  4365					allocated = ret;
  4366				goto out2;
  4367			}
  4368		}
  4369	
  4370		/*
  4371		 * requested block isn't allocated yet;
  4372		 * we couldn't try to create block if create flag is zero
  4373		 */
  4374		if ((flags & EXT4_GET_BLOCKS_CREATE) == 0) {
  4375			ext4_lblk_t hole_start, hole_len;
  4376	
  4377			hole_start = map->m_lblk;
  4378			hole_len = ext4_ext_determine_hole(inode, path, &hole_start);
  4379			/*
  4380			 * put just found gap into cache to speed up
  4381			 * subsequent requests
  4382			 */
  4383			ext4_ext_put_gap_in_cache(inode, hole_start, hole_len);
  4384	
  4385			/* Update hole_len to reflect hole size after map->m_lblk */
  4386			if (hole_start != map->m_lblk)
  4387				hole_len -= map->m_lblk - hole_start;
  4388			map->m_pblk = 0;
  4389			map->m_len = min_t(unsigned int, map->m_len, hole_len);
  4390	
  4391			goto out2;
  4392		}
  4393	
  4394		/*
  4395		 * Okay, we need to do block allocation.
  4396		 */
  4397		newex.ee_block = cpu_to_le32(map->m_lblk);
  4398		cluster_offset = EXT4_LBLK_COFF(sbi, map->m_lblk);
  4399	
  4400		/*
  4401		 * If we are doing bigalloc, check to see if the extent returned
  4402		 * by ext4_find_extent() implies a cluster we can use.
  4403		 */
  4404		if (cluster_offset && ex &&
  4405		    get_implied_cluster_alloc(inode->i_sb, map, ex, path)) {
  4406			ar.len = allocated = map->m_len;
  4407			newblock = map->m_pblk;
  4408			map_from_cluster = true;
  4409			goto got_allocated_blocks;
  4410		}
  4411	
  4412		/* find neighbour allocated blocks */
  4413		ar.lleft = map->m_lblk;
  4414		err = ext4_ext_search_left(inode, path, &ar.lleft, &ar.pleft);
  4415		if (err)
  4416			goto out2;
  4417		ar.lright = map->m_lblk;
  4418		ex2 = NULL;
  4419		err = ext4_ext_search_right(inode, path, &ar.lright, &ar.pright, &ex2);
  4420		if (err)
  4421			goto out2;
  4422	
  4423		/* Check if the extent after searching to the right implies a
  4424		 * cluster we can use. */
  4425		if ((sbi->s_cluster_ratio > 1) && ex2 &&
  4426		    get_implied_cluster_alloc(inode->i_sb, map, ex2, path)) {
  4427			ar.len = allocated = map->m_len;
  4428			newblock = map->m_pblk;
  4429			map_from_cluster = true;
  4430			goto got_allocated_blocks;
  4431		}
  4432	
  4433		/*
  4434		 * See if request is beyond maximum number of blocks we can have in
  4435		 * a single extent. For an initialized extent this limit is
  4436		 * EXT_INIT_MAX_LEN and for an unwritten extent this limit is
  4437		 * EXT_UNWRITTEN_MAX_LEN.
  4438		 */
  4439		if (map->m_len > EXT_INIT_MAX_LEN &&
  4440		    !(flags & EXT4_GET_BLOCKS_UNWRIT_EXT))
  4441			map->m_len = EXT_INIT_MAX_LEN;
  4442		else if (map->m_len > EXT_UNWRITTEN_MAX_LEN &&
  4443			 (flags & EXT4_GET_BLOCKS_UNWRIT_EXT))
  4444			map->m_len = EXT_UNWRITTEN_MAX_LEN;
  4445	
  4446		/* Check if we can really insert (m_lblk)::(m_lblk + m_len) extent */
  4447		newex.ee_len = cpu_to_le16(map->m_len);
  4448		err = ext4_ext_check_overlap(sbi, inode, &newex, path);
  4449		if (err)
  4450			allocated = ext4_ext_get_actual_len(&newex);
  4451		else
  4452			allocated = map->m_len;
  4453	
  4454		/* allocate new block */
  4455		ar.inode = inode;
  4456		ar.goal = ext4_ext_find_goal(inode, path, map->m_lblk);
  4457		ar.logical = map->m_lblk;
  4458		/*
  4459		 * We calculate the offset from the beginning of the cluster
  4460		 * for the logical block number, since when we allocate a
  4461		 * physical cluster, the physical block should start at the
  4462		 * same offset from the beginning of the cluster.  This is
  4463		 * needed so that future calls to get_implied_cluster_alloc()
  4464		 * work correctly.
  4465		 */
  4466		offset = EXT4_LBLK_COFF(sbi, map->m_lblk);
  4467		ar.len = EXT4_NUM_B2C(sbi, offset+allocated);
  4468		ar.goal -= offset;
  4469		ar.logical -= offset;
  4470		if (S_ISREG(inode->i_mode))
  4471			ar.flags = EXT4_MB_HINT_DATA;
  4472		else
  4473			/* disable in-core preallocation for non-regular files */
  4474			ar.flags = 0;
  4475		if (flags & EXT4_GET_BLOCKS_NO_NORMALIZE)
  4476			ar.flags |= EXT4_MB_HINT_NOPREALLOC;
  4477		if (flags & EXT4_GET_BLOCKS_DELALLOC_RESERVE)
  4478			ar.flags |= EXT4_MB_DELALLOC_RESERVED;
  4479		if (flags & EXT4_GET_BLOCKS_METADATA_NOFAIL)
  4480			ar.flags |= EXT4_MB_USE_RESERVED;
  4481		newblock = ext4_mb_new_blocks(handle, &ar, &err);
  4482		if (!newblock)
  4483			goto out2;
  4484		ext_debug("allocate new block: goal %llu, found %llu/%u\n",
  4485			  ar.goal, newblock, allocated);
  4486		free_on_err = 1;
  4487		allocated_clusters = ar.len;
  4488		ar.len = EXT4_C2B(sbi, ar.len) - offset;
  4489		if (ar.len > allocated)
  4490			ar.len = allocated;
  4491	
  4492	got_allocated_blocks:
> 4493		if ((flags & EXT4_GET_BLOCKS_SUBMIT_ALLOC) && sbi->fallocate) {
  4494			err = sb_issue_assign_range(inode->i_sb, newblock,
  4495				EXT4_C2B(sbi, allocated_clusters), GFP_NOFS);
  4496			if (err)
  4497				goto free_on_err;
  4498		}
  4499	
  4500		/* try to insert new extent into found leaf and return */
  4501		ext4_ext_store_pblock(&newex, newblock + offset);
  4502		newex.ee_len = cpu_to_le16(ar.len);
  4503		/* Mark unwritten */
  4504		if (flags & EXT4_GET_BLOCKS_UNWRIT_EXT){
  4505			ext4_ext_mark_unwritten(&newex);
  4506			map->m_flags |= EXT4_MAP_UNWRITTEN;
  4507		}
  4508	
  4509		err = 0;
  4510		if ((flags & EXT4_GET_BLOCKS_KEEP_SIZE) == 0)
  4511			err = check_eofblocks_fl(handle, inode, map->m_lblk,
  4512						 path, ar.len);
  4513		if (!err)
  4514			err = ext4_ext_insert_extent(handle, inode, &path,
  4515						     &newex, flags);
  4516	free_on_err:
  4517		if (err && free_on_err) {
  4518			int fb_flags = flags & EXT4_GET_BLOCKS_DELALLOC_RESERVE ?
  4519				EXT4_FREE_BLOCKS_NO_QUOT_UPDATE : 0;
  4520			/* free data blocks we just allocated */
  4521			/* not a good idea to call discard here directly,
  4522			 * but otherwise we'd need to call it every free() */
  4523			ext4_discard_preallocations(inode);
  4524			ext4_free_blocks(handle, inode, NULL, newblock,
  4525					 EXT4_C2B(sbi, allocated_clusters), fb_flags);
  4526			goto out2;
  4527		}
  4528	
  4529		/* previous routine could use block we allocated */
  4530		newblock = ext4_ext_pblock(&newex);
  4531		allocated = ext4_ext_get_actual_len(&newex);
  4532		if (allocated > map->m_len)
  4533			allocated = map->m_len;
  4534		map->m_flags |= EXT4_MAP_NEW;
  4535	
  4536		/*
  4537		 * Reduce the reserved cluster count to reflect successful deferred
  4538		 * allocation of delayed allocated clusters or direct allocation of
  4539		 * clusters discovered to be delayed allocated.  Once allocated, a
  4540		 * cluster is not included in the reserved count.
  4541		 */
  4542		if (test_opt(inode->i_sb, DELALLOC) && !map_from_cluster) {
  4543			if (flags & EXT4_GET_BLOCKS_DELALLOC_RESERVE) {
  4544				/*
  4545				 * When allocating delayed allocated clusters, simply
  4546				 * reduce the reserved cluster count and claim quota
  4547				 */
  4548				ext4_da_update_reserve_space(inode, allocated_clusters,
  4549								1);
  4550			} else {
  4551				ext4_lblk_t lblk, len;
  4552				unsigned int n;
  4553	
  4554				/*
  4555				 * When allocating non-delayed allocated clusters
  4556				 * (from fallocate, filemap, DIO, or clusters
  4557				 * allocated when delalloc has been disabled by
  4558				 * ext4_nonda_switch), reduce the reserved cluster
  4559				 * count by the number of allocated clusters that
  4560				 * have previously been delayed allocated.  Quota
  4561				 * has been claimed by ext4_mb_new_blocks() above,
  4562				 * so release the quota reservations made for any
  4563				 * previously delayed allocated clusters.
  4564				 */
  4565				lblk = EXT4_LBLK_CMASK(sbi, map->m_lblk);
  4566				len = allocated_clusters << sbi->s_cluster_bits;
  4567				n = ext4_es_delayed_clu(inode, lblk, len);
  4568				if (n > 0)
  4569					ext4_da_update_reserve_space(inode, (int) n, 0);
  4570			}
  4571		}
  4572	
  4573		/*
  4574		 * Cache the extent and update transaction to commit on fdatasync only
  4575		 * when it is _not_ an unwritten extent.
  4576		 */
  4577		if ((flags & EXT4_GET_BLOCKS_UNWRIT_EXT) == 0)
  4578			ext4_update_inode_fsync_trans(handle, inode, 1);
  4579		else
  4580			ext4_update_inode_fsync_trans(handle, inode, 0);
  4581	out:
  4582		if (allocated > map->m_len)
  4583			allocated = map->m_len;
  4584		ext4_ext_show_leaf(inode, path);
  4585		map->m_flags |= EXT4_MAP_MAPPED;
  4586		map->m_pblk = newblock;
  4587		map->m_len = allocated;
  4588	out2:
  4589		ext4_ext_drop_refs(path);
  4590		kfree(path);
  4591	
  4592		trace_ext4_ext_map_blocks_exit(inode, flags, map,
  4593					       err ? err : allocated);
  4594		return err ? err : allocated;
  4595	}
  4596	

---
0-DAY kernel test infrastructure                 Open Source Technology Center
https://lists.01.org/hyperkitty/list/kbuild-all(a)lists.01.org Intel Corporation

[-- Attachment #2: config.gz --]
[-- Type: application/gzip, Size: 34732 bytes --]

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH RFC 0/3] block,ext4: Introduce REQ_OP_ASSIGN_RANGE to reflect extents allocation in block device internals
  2019-12-10 16:56 [PATCH RFC 0/3] block,ext4: Introduce REQ_OP_ASSIGN_RANGE to reflect extents allocation in block device internals Kirill Tkhai
                   ` (3 preceding siblings ...)
  2019-12-11  7:42 ` [PATCH RFC 0/3] block,ext4: Introduce REQ_OP_ASSIGN_RANGE to reflect extents allocation in block device internals Chaitanya Kulkarni
@ 2019-12-17 14:16 ` Kirill Tkhai
  4 siblings, 0 replies; 23+ messages in thread
From: Kirill Tkhai @ 2019-12-17 14:16 UTC (permalink / raw)
  To: linux-block, linux-kernel, linux-ext4, axboe
  Cc: tytso, adilger.kernel, ming.lei, osandov, jthumshirn,
	minwoo.im.dev, damien.lemoal, andrea.parri, hare, tj, ajay.joshi,
	sagi, dsterba, chaitanya.kulkarni, bvanassche, dhowells,
	asml.silence

Hi!

Any comments on this?

Thanks

On 10.12.2019 19:56, Kirill Tkhai wrote:
> Information about continuous extent placement may be useful
> for some block devices. Say, distributed network filesystems,
> which provide block device interface, may use this information
> for better blocks placement over the nodes in their cluster,
> and for better performance. Block devices, which map a file
> on another filesystem (loop), may request the same length extent
> on underlining filesystem for less fragmentation and for batching
> allocation requests. Also, hypervisors like QEMU may use this
> information for optimization of cluster allocations.
> 
> This patchset introduces REQ_OP_ASSIGN_RANGE, which is going
> to be used for forwarding user's fallocate(0) requests into
> block device internals. It rather similar to existing
> REQ_OP_DISCARD, REQ_OP_WRITE_ZEROES, etc. The corresponding
> exported primitive is called blkdev_issue_assign_range().
> See [1/3] for the details.
> 
> Patch [2/3] teaches loop driver to handle REQ_OP_ASSIGN_RANGE
> requests by calling fallocate(0).
> 
> Patch [3/3] makes ext4 to notify a block device about fallocate(0).
> 
> Here is a simple test I did:
> https://gist.github.com/tkhai/5b788651cdb74c1dbff3500745878856
> 
> I attached a file on ext4 to loop. Then, created ext4 partition
> on loop device and started the test in the partition. Direct-io
> is enabled on loop.
> 
> The test fallocates 4G file and writes from some offset with
> given step, then it chooses another offset and repeats. After
> the test all the blocks in the file become written.
> 
> The results shows that batching extents-assigning requests improves
> the performance:
> 
> Before patchset: real ~ 1min 27sec
> After patchset:  real ~ 1min 16sec (18% better)
> 
> Ordinary fallocate() before writes improves the performance
> by batching the requests. These results just show, the same
> is in case of forwarding extents information to underlining
> filesystem.
> ---
> 
> Kirill Tkhai (3):
>       block: Add support for REQ_OP_ASSIGN_RANGE operation
>       loop: Forward REQ_OP_ASSIGN_RANGE into fallocate(0)
>       ext4: Notify block device about fallocate(0)-assigned blocks
> 
> 
>  block/blk-core.c          |    4 +++
>  block/blk-lib.c           |   70 +++++++++++++++++++++++++++++++++++++++++++++
>  block/blk-merge.c         |   21 ++++++++++++++
>  block/bounce.c            |    1 +
>  drivers/block/loop.c      |    5 +++
>  fs/ext4/ext4.h            |    1 +
>  fs/ext4/extents.c         |   11 ++++++-
>  include/linux/bio.h       |    3 ++
>  include/linux/blk_types.h |    2 +
>  include/linux/blkdev.h    |   29 +++++++++++++++++++
>  10 files changed, 145 insertions(+), 2 deletions(-)
> 
> --
> Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com>
> 


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH RFC 1/3] block: Add support for REQ_OP_ASSIGN_RANGE operation
  2019-12-10 16:56 ` [PATCH RFC 1/3] block: Add support for REQ_OP_ASSIGN_RANGE operation Kirill Tkhai
@ 2019-12-19  3:03   ` Martin K. Petersen
  2019-12-19 11:07     ` Kirill Tkhai
  0 siblings, 1 reply; 23+ messages in thread
From: Martin K. Petersen @ 2019-12-19  3:03 UTC (permalink / raw)
  To: Kirill Tkhai
  Cc: linux-block, linux-kernel, linux-ext4, axboe, tytso,
	adilger.kernel, ming.lei, osandov, jthumshirn, minwoo.im.dev,
	damien.lemoal, andrea.parri, hare, tj, ajay.joshi, sagi, dsterba,
	chaitanya.kulkarni, bvanassche, dhowells, asml.silence


Hi Kirill!

> The patch adds a new blkdev_issue_assign_range() primitive, which is
> rather similar to existing blkdev_issue_{*} api.  Also, a new queue
> limit.max_assign_range_sectors is added.

I am not so keen on the assign_range name. What's wrong with "allocate"?

But why introduce a completely new operation? Isn't this essentially a
write zeroes with BLKDEV_ZERO_NOUNMAP flag set?

If the zeroing aspect is perceived to be a problem we could add a
BLKDEV_ZERO_ALLOCATE flag (or BLKDEV_ZERO_ANCHOR since that's the
terminology used in SCSI).

-- 
Martin K. Petersen	Oracle Linux Engineering

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH RFC 1/3] block: Add support for REQ_OP_ASSIGN_RANGE operation
  2019-12-19  3:03   ` Martin K. Petersen
@ 2019-12-19 11:07     ` Kirill Tkhai
  2019-12-19 22:03       ` Chaitanya Kulkarni
  2019-12-19 22:37       ` Martin K. Petersen
  0 siblings, 2 replies; 23+ messages in thread
From: Kirill Tkhai @ 2019-12-19 11:07 UTC (permalink / raw)
  To: Martin K. Petersen, axboe
  Cc: linux-block, linux-kernel, linux-ext4, tytso, adilger.kernel,
	ming.lei, osandov, jthumshirn, minwoo.im.dev, damien.lemoal,
	andrea.parri, hare, tj, ajay.joshi, sagi, dsterba,
	chaitanya.kulkarni, bvanassche, dhowells, asml.silence

Hi, Martin!

On 19.12.2019 06:03, Martin K. Petersen wrote:
> 
> Hi Kirill!
> 
>> The patch adds a new blkdev_issue_assign_range() primitive, which is
>> rather similar to existing blkdev_issue_{*} api.  Also, a new queue
>> limit.max_assign_range_sectors is added.
> 
> I am not so keen on the assign_range name. What's wrong with "allocate"?

REQ_OP_ALLOCATE_RANGE seemed for me as looking very long for the reviewers.
And I found that there is no an abbreviation of operations name in enum req_opf,
so REQ_OP_ALLOC_RANGE won't look good. Thus, I found a replacement.

But in case of REQ_OP_ALLOCATE_RANGE length is OK for people, there is no
a problem to choose it.

> But why introduce a completely new operation? Isn't this essentially a
> write zeroes with BLKDEV_ZERO_NOUNMAP flag set?
> 
> If the zeroing aspect is perceived to be a problem we could add a
> BLKDEV_ZERO_ALLOCATE flag (or BLKDEV_ZERO_ANCHOR since that's the
> terminology used in SCSI).

Hm. BLKDEV_ZERO_NOUNMAP is used in __blkdev_issue_write_zeroes() only.
So, do I understand right that we should the below two?:

1)Introduce a new flag BLKDEV_ZERO_ALLOCATE for blkdev_issue_write_zeroes().
2)Introduce a new flag REQ_NOZERO in enum req_opf.

Won't this confuse a reader that we have blkdev_issue_write_zeroes(),
which does not write zeroes sometimes? Maybe we should rename
blkdev_issue_write_zeroes() in some more generic name?

Thanks,
Kirill


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH RFC 1/3] block: Add support for REQ_OP_ASSIGN_RANGE operation
  2019-12-19 11:07     ` Kirill Tkhai
@ 2019-12-19 22:03       ` Chaitanya Kulkarni
  2019-12-19 22:37       ` Martin K. Petersen
  1 sibling, 0 replies; 23+ messages in thread
From: Chaitanya Kulkarni @ 2019-12-19 22:03 UTC (permalink / raw)
  To: Kirill Tkhai, Martin K. Petersen
  Cc: axboe, linux-block, linux-kernel, linux-ext4, tytso,
	adilger.kernel, ming.lei, osandov, jthumshirn, minwoo.im.dev,
	Damien Le Moal, andrea.parri, hare, tj, Ajay Joshi, sagi,
	dsterba, bvanassche, dhowells, asml.silence

> 1)Introduce a new flag BLKDEV_ZERO_ALLOCATE for blkdev_issue_write_zeroes().
> 2)Introduce a new flag REQ_NOZERO in enum req_opf.
>
> Won't this confuse a reader that we have blkdev_issue_write_zeroes(),
> which does not write zeroes sometimes? Maybe we should rename
> blkdev_issue_write_zeroes() in some more generic name?

Yes it will be confusing, I can see that the code for 
__blkdev_issue_write_zeroes() & __blkdev_issue_assign_range()
is very similar except op = REQ_OP_WRITE_ZEROES and
op = REQ_OP_ASSIGN_RANGE (and some minor details).

What we need is a prep patch which moves the payload less bio
code into the helper function which can accept op as an argument
based on that it will branch out and execute right code path.

If we decide to get this in then I'll be happy to create required
prep patch.
>
> Thanks,
> Kirill
>
>


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH RFC 1/3] block: Add support for REQ_OP_ASSIGN_RANGE operation
  2019-12-19 11:07     ` Kirill Tkhai
  2019-12-19 22:03       ` Chaitanya Kulkarni
@ 2019-12-19 22:37       ` Martin K. Petersen
  2019-12-20  1:53         ` Darrick J. Wong
  2019-12-20 11:55         ` Kirill Tkhai
  1 sibling, 2 replies; 23+ messages in thread
From: Martin K. Petersen @ 2019-12-19 22:37 UTC (permalink / raw)
  To: Kirill Tkhai
  Cc: Martin K. Petersen, axboe, linux-block, linux-kernel, linux-ext4,
	tytso, adilger.kernel, ming.lei, osandov, jthumshirn,
	minwoo.im.dev, damien.lemoal, andrea.parri, hare, tj, ajay.joshi,
	sagi, dsterba, chaitanya.kulkarni, bvanassche, dhowells,
	asml.silence


Kirill,

> Hm. BLKDEV_ZERO_NOUNMAP is used in __blkdev_issue_write_zeroes() only.
> So, do I understand right that we should the below two?:
>
> 1) Introduce a new flag BLKDEV_ZERO_ALLOCATE for
> blkdev_issue_write_zeroes().

> 2) Introduce a new flag REQ_NOZERO in enum req_opf.

Something like that. If zeroing is a problem for you.

Right now we offer the following semantics:

	Deallocate, no zeroing (discard)

	Optionally deallocate, zeroing (zeroout)

	Allocate, zeroing (zeroout + NOUNMAP)

Some devices also implement a fourth option which would be:

	Anchor: Allocate, no zeroing

> Won't this confuse a reader that we have blkdev_issue_write_zeroes(),
> which does not write zeroes sometimes? Maybe we should rename
> blkdev_issue_write_zeroes() in some more generic name?

Maybe. The naming is what it is for hysterical raisins and reflects how
things are implemented in the storage protocols. I wouldn't worry too
much about that. We can rename things if need be but we shouldn't plumb
an essentially identical operation through the block stack just to
expose a different name at the top.

-- 
Martin K. Petersen	Oracle Linux Engineering

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH RFC 1/3] block: Add support for REQ_OP_ASSIGN_RANGE operation
  2019-12-19 22:37       ` Martin K. Petersen
@ 2019-12-20  1:53         ` Darrick J. Wong
  2019-12-20  2:22           ` Martin K. Petersen
  2019-12-20 11:55         ` Kirill Tkhai
  1 sibling, 1 reply; 23+ messages in thread
From: Darrick J. Wong @ 2019-12-20  1:53 UTC (permalink / raw)
  To: Martin K. Petersen
  Cc: Kirill Tkhai, axboe, linux-block, linux-kernel, linux-ext4,
	tytso, adilger.kernel, ming.lei, osandov, jthumshirn,
	minwoo.im.dev, damien.lemoal, andrea.parri, hare, tj, ajay.joshi,
	sagi, dsterba, chaitanya.kulkarni, bvanassche, dhowells,
	asml.silence

On Thu, Dec 19, 2019 at 05:37:47PM -0500, Martin K. Petersen wrote:
> 
> Kirill,
> 
> > Hm. BLKDEV_ZERO_NOUNMAP is used in __blkdev_issue_write_zeroes() only.
> > So, do I understand right that we should the below two?:
> >
> > 1) Introduce a new flag BLKDEV_ZERO_ALLOCATE for
> > blkdev_issue_write_zeroes().
> 
> > 2) Introduce a new flag REQ_NOZERO in enum req_opf.
> 
> Something like that. If zeroing is a problem for you.
> 
> Right now we offer the following semantics:
> 
> 	Deallocate, no zeroing (discard)
> 
> 	Optionally deallocate, zeroing (zeroout)
> 
> 	Allocate, zeroing (zeroout + NOUNMAP)
> 
> Some devices also implement a fourth option which would be:
> 
> 	Anchor: Allocate, no zeroing

What happens if you anchor and then try to read the results?  IO error?

(Yeah, I'm being lazy and not digging through SBC-3, sorry...)

--D

> > Won't this confuse a reader that we have blkdev_issue_write_zeroes(),
> > which does not write zeroes sometimes? Maybe we should rename
> > blkdev_issue_write_zeroes() in some more generic name?
> 
> Maybe. The naming is what it is for hysterical raisins and reflects how
> things are implemented in the storage protocols. I wouldn't worry too
> much about that. We can rename things if need be but we shouldn't plumb
> an essentially identical operation through the block stack just to
> expose a different name at the top.
> 
> -- 
> Martin K. Petersen	Oracle Linux Engineering

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH RFC 1/3] block: Add support for REQ_OP_ASSIGN_RANGE operation
  2019-12-20  1:53         ` Darrick J. Wong
@ 2019-12-20  2:22           ` Martin K. Petersen
  0 siblings, 0 replies; 23+ messages in thread
From: Martin K. Petersen @ 2019-12-20  2:22 UTC (permalink / raw)
  To: Darrick J. Wong
  Cc: Martin K. Petersen, Kirill Tkhai, axboe, linux-block,
	linux-kernel, linux-ext4, tytso, adilger.kernel, ming.lei,
	osandov, jthumshirn, minwoo.im.dev, damien.lemoal, andrea.parri,
	hare, tj, ajay.joshi, sagi, dsterba, chaitanya.kulkarni,
	bvanassche, dhowells, asml.silence


Darrick,

[Anchor]

> What happens if you anchor and then try to read the results?  IO
> error?

Depends on the device. Typically you get the same results as UNMAP.

Anchoring is essentially a way to preallocate blocks on a thinly
provisioned device. If you then run out of actual storage capacity you
won't get a write failure writing to those LBAs. It is intended to
protect block ranges that do not tolerate write failures (journals,
metadata).

-- 
Martin K. Petersen	Oracle Linux Engineering

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH RFC 1/3] block: Add support for REQ_OP_ASSIGN_RANGE operation
  2019-12-19 22:37       ` Martin K. Petersen
  2019-12-20  1:53         ` Darrick J. Wong
@ 2019-12-20 11:55         ` Kirill Tkhai
  2019-12-21 18:54           ` Martin K. Petersen
  1 sibling, 1 reply; 23+ messages in thread
From: Kirill Tkhai @ 2019-12-20 11:55 UTC (permalink / raw)
  To: Martin K. Petersen
  Cc: axboe, linux-block, linux-kernel, linux-ext4, tytso,
	adilger.kernel, ming.lei, osandov, jthumshirn, minwoo.im.dev,
	damien.lemoal, andrea.parri, hare, tj, ajay.joshi, sagi, dsterba,
	chaitanya.kulkarni, bvanassche, dhowells, asml.silence

Hi, Martin,

On 20.12.2019 01:37, Martin K. Petersen wrote:
> 
> Kirill,
> 
>> Hm. BLKDEV_ZERO_NOUNMAP is used in __blkdev_issue_write_zeroes() only.
>> So, do I understand right that we should the below two?:
>>
>> 1) Introduce a new flag BLKDEV_ZERO_ALLOCATE for
>> blkdev_issue_write_zeroes().
> 
>> 2) Introduce a new flag REQ_NOZERO in enum req_opf.
> 
> Something like that. If zeroing is a problem for you.

My intention is to use this in fs allocators to notify virtual block devices
about allocated blocks (like in patch [3/3]). Filesystems allocators know about
written and unwritten extents, and they don't need a zeroing of allocated blocks.

Since a block range allocation action is less complicated (and faster), than
operation of allocation + zeroing of allocated blocks (at least for some devices),
we just choose it as the fastest. This is the reason we avoid zeroing.

> Right now we offer the following semantics:
> 
> 	Deallocate, no zeroing (discard)
> 
> 	Optionally deallocate, zeroing (zeroout)
> 
> 	Allocate, zeroing (zeroout + NOUNMAP)
> 
> Some devices also implement a fourth option which would be:
> 
> 	Anchor: Allocate, no zeroing
> 
>> Won't this confuse a reader that we have blkdev_issue_write_zeroes(),
>> which does not write zeroes sometimes? Maybe we should rename
>> blkdev_issue_write_zeroes() in some more generic name?
> 
> Maybe. The naming is what it is for hysterical raisins and reflects how
> things are implemented in the storage protocols. I wouldn't worry too
> much about that. We can rename things if need be but we shouldn't plumb
> an essentially identical operation through the block stack just to
> expose a different name at the top.

Not introduction a new operation is a good thing. Especially, since we don't
need a specific max_xxx_xxx_sectors != max_write_zeroes_sectors for it.

I'll rework the patch in this way (it seems it will become pretty small
after that).

One more thing to discuss. The new REQ_NOZERO flag won't be supported
by many block devices (their number will be even less, than number of
REQ_OP_WRITE_ZEROES supporters). Will this be a good thing, in case
of we will be completing BLKDEV_ZERO_ALLOCATE bios in __blkdev_issue_write_zeroes()
before splitting? I mean introduction of some flag in struct request_queue::limits.
Completion of them with -EOPNOTSUPP in block devices drivers looks
suboptimal for me.

Thanks,
Kirill

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH RFC 1/3] block: Add support for REQ_OP_ASSIGN_RANGE operation
  2019-12-20 11:55         ` Kirill Tkhai
@ 2019-12-21 18:54           ` Martin K. Petersen
  2019-12-23  8:51             ` Kirill Tkhai
  0 siblings, 1 reply; 23+ messages in thread
From: Martin K. Petersen @ 2019-12-21 18:54 UTC (permalink / raw)
  To: Kirill Tkhai
  Cc: Martin K. Petersen, axboe, linux-block, linux-kernel, linux-ext4,
	tytso, adilger.kernel, ming.lei, osandov, jthumshirn,
	minwoo.im.dev, damien.lemoal, andrea.parri, hare, tj, ajay.joshi,
	sagi, dsterba, chaitanya.kulkarni, bvanassche, dhowells,
	asml.silence


Kirill,

> One more thing to discuss. The new REQ_NOZERO flag won't be supported
> by many block devices (their number will be even less, than number of
> REQ_OP_WRITE_ZEROES supporters). Will this be a good thing, in case of
> we will be completing BLKDEV_ZERO_ALLOCATE bios in
> __blkdev_issue_write_zeroes() before splitting? I mean introduction of
> some flag in struct request_queue::limits.  Completion of them with
> -EOPNOTSUPP in block devices drivers looks suboptimal for me.

We already have the NOFALLBACK flag to let the user make that decision.

If that flag is not specified, and I receive an allocate request for a
SCSI device that does not support ANCHOR, my expectation would be that I
would do a regular write same.

If it's a filesystem that is the recipient of the operation and not a
SCSI device, how to react would depend on how the filesystem handles
unwritten extents, etc.

-- 
Martin K. Petersen	Oracle Linux Engineering

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH RFC 1/3] block: Add support for REQ_OP_ASSIGN_RANGE operation
  2019-12-21 18:54           ` Martin K. Petersen
@ 2019-12-23  8:51             ` Kirill Tkhai
  2020-01-07  3:24               ` Martin K. Petersen
  0 siblings, 1 reply; 23+ messages in thread
From: Kirill Tkhai @ 2019-12-23  8:51 UTC (permalink / raw)
  To: Martin K. Petersen
  Cc: axboe, linux-block, linux-kernel, linux-ext4, tytso,
	adilger.kernel, ming.lei, osandov, jthumshirn, minwoo.im.dev,
	damien.lemoal, andrea.parri, hare, tj, ajay.joshi, sagi, dsterba,
	chaitanya.kulkarni, bvanassche, dhowells, asml.silence

On 21.12.2019 21:54, Martin K. Petersen wrote:
> 
> Kirill,
> 
>> One more thing to discuss. The new REQ_NOZERO flag won't be supported
>> by many block devices (their number will be even less, than number of
>> REQ_OP_WRITE_ZEROES supporters). Will this be a good thing, in case of
>> we will be completing BLKDEV_ZERO_ALLOCATE bios in
>> __blkdev_issue_write_zeroes() before splitting? I mean introduction of
>> some flag in struct request_queue::limits.  Completion of them with
>> -EOPNOTSUPP in block devices drivers looks suboptimal for me.
> 
> We already have the NOFALLBACK flag to let the user make that decision.
> 
> If that flag is not specified, and I receive an allocate request for a
> SCSI device that does not support ANCHOR, my expectation would be that I
> would do a regular write same.
>
> If it's a filesystem that is the recipient of the operation and not a
> SCSI device, how to react would depend on how the filesystem handles
> unwritten extents, etc.

Ok, this case is clear for me, thanks.

But I also worry about NOFALLBACK case. There are possible block devices,
which support write zeroes, but they can't allocate blocks (block allocation
are just not appliable for them, say, these are all ordinary hdd).

Let's say, a user called fallocate(), and filesystem allocated range of blocks.
Then filesystem propagates the range to block device, and calls zeroout:

blkdev_issue_zeroout(bdev, start >> 9, len >> 9,
		     GFP_NOIO, BLKDEV_ZERO_ALLOCATE|BLKDEV_ZERO_NOFALLBACK);

This case filesystem does not want zeroing blocks, it just wants to send a hint
to block device. So, in case of block device supports allocation, everything is OK.

But won't it be a good thing to return EOPNOTSUPP right from __blkdev_issue_write_zeroes()
in case of block device can't allocate blocks (q->limits.write_zeroes_can_allocate
in the patch below)? Here is just a way to underline block devices, which support
write zeroes, but allocation of blocks is meant nothing for them (wasting of time).

What do you think about the below?

Thanks
---
diff --git a/block/blk-lib.c b/block/blk-lib.c
index 5f2c429d4378..524b47905fd5 100644
--- a/block/blk-lib.c
+++ b/block/blk-lib.c
@@ -214,7 +214,7 @@ static int __blkdev_issue_write_zeroes(struct block_device *bdev,
 		struct bio **biop, unsigned flags)
 {
 	struct bio *bio = *biop;
-	unsigned int max_write_zeroes_sectors;
+	unsigned int max_write_zeroes_sectors, req_flags = 0;
 	struct request_queue *q = bdev_get_queue(bdev);
 
 	if (!q)
@@ -229,13 +229,19 @@ static int __blkdev_issue_write_zeroes(struct block_device *bdev,
 	if (max_write_zeroes_sectors == 0)
 		return -EOPNOTSUPP;
 
+	if (flags & BLKDEV_ZERO_NOUNMAP)
+		req_flags |= REQ_NOUNMAP;
+	if (flags & BLKDEV_ZERO_ALLOCATE) {
+		if (!q->limits.write_zeroes_can_allocate)
+			return -EOPNOTSUPP;
+		req_flags |= REQ_NOZERO|REQ_NOUNMAP;
+	}
+
 	while (nr_sects) {
 		bio = blk_next_bio(bio, 0, gfp_mask);
 		bio->bi_iter.bi_sector = sector;
 		bio_set_dev(bio, bdev);
-		bio->bi_opf = REQ_OP_WRITE_ZEROES;
-		if (flags & BLKDEV_ZERO_NOUNMAP)
-			bio->bi_opf |= REQ_NOUNMAP;
+		bio->bi_opf = REQ_OP_WRITE_ZEROES | req_flags;
 
 		if (nr_sects > max_write_zeroes_sectors) {
 			bio->bi_iter.bi_size = max_write_zeroes_sectors << 9;
diff --git a/fs/block_dev.c b/fs/block_dev.c
index 69bf2fb6f7cd..1ffef894b3bd 100644
--- a/fs/block_dev.c
+++ b/fs/block_dev.c
@@ -2122,6 +2122,10 @@ static long blkdev_fallocate(struct file *file, int mode, loff_t start,
 		error = blkdev_issue_zeroout(bdev, start >> 9, len >> 9,
 					     GFP_KERNEL, BLKDEV_ZERO_NOFALLBACK);
 		break;
+	case FALLOC_FL_KEEP_SIZE:
+		error = blkdev_issue_zeroout(bdev, start >> 9, len >> 9,
+			GFP_KERNEL, BLKDEV_ZERO_ALLOCATE | BLKDEV_ZERO_NOFALLBACK);
+		break;
 	case FALLOC_FL_PUNCH_HOLE | FALLOC_FL_KEEP_SIZE | FALLOC_FL_NO_HIDE_STALE:
 		error = blkdev_issue_discard(bdev, start >> 9, len >> 9,
 					     GFP_KERNEL, 0);
diff --git a/include/linux/blk_types.h b/include/linux/blk_types.h
index 70254ae11769..9ed166860099 100644
--- a/include/linux/blk_types.h
+++ b/include/linux/blk_types.h
@@ -335,7 +335,9 @@ enum req_flag_bits {
 
 	/* command specific flags for REQ_OP_WRITE_ZEROES: */
 	__REQ_NOUNMAP,		/* do not free blocks when zeroing */
-
+	__REQ_NOZERO,		/* only notify about allocated blocks,
+				 * and do not actual zero them
+				 */
 	__REQ_HIPRI,
 
 	/* for driver use */
@@ -362,6 +364,7 @@ enum req_flag_bits {
 #define REQ_CGROUP_PUNT		(1ULL << __REQ_CGROUP_PUNT)
 
 #define REQ_NOUNMAP		(1ULL << __REQ_NOUNMAP)
+#define REQ_NOZERO		(1ULL << __REQ_NOZERO)
 #define REQ_HIPRI		(1ULL << __REQ_HIPRI)
 
 #define REQ_DRV			(1ULL << __REQ_DRV)
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index c45779f00cbd..9e3cd3394dd6 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -347,6 +347,7 @@ struct queue_limits {
 	unsigned char		misaligned;
 	unsigned char		discard_misaligned;
 	unsigned char		raid_partial_stripes_expensive;
+	bool			write_zeroes_can_allocate;
 	enum blk_zoned_model	zoned;
 };
 
@@ -1219,6 +1220,7 @@ extern int __blkdev_issue_discard(struct block_device *bdev, sector_t sector,
 
 #define BLKDEV_ZERO_NOUNMAP	(1 << 0)  /* do not free blocks */
 #define BLKDEV_ZERO_NOFALLBACK	(1 << 1)  /* don't write explicit zeroes */
+#define BLKDEV_ZERO_ALLOCATE	(1 << 2)  /* allocate range of blocks */
 
 extern int __blkdev_issue_zeroout(struct block_device *bdev, sector_t sector,
 		sector_t nr_sects, gfp_t gfp_mask, struct bio **biop,

^ permalink raw reply related	[flat|nested] 23+ messages in thread

* Re: [PATCH RFC 1/3] block: Add support for REQ_OP_ASSIGN_RANGE operation
  2019-12-23  8:51             ` Kirill Tkhai
@ 2020-01-07  3:24               ` Martin K. Petersen
  2020-01-07 13:59                 ` Kirill Tkhai
  0 siblings, 1 reply; 23+ messages in thread
From: Martin K. Petersen @ 2020-01-07  3:24 UTC (permalink / raw)
  To: Kirill Tkhai
  Cc: Martin K. Petersen, axboe, linux-block, linux-kernel, linux-ext4,
	tytso, adilger.kernel, ming.lei, osandov, jthumshirn,
	minwoo.im.dev, damien.lemoal, andrea.parri, hare, tj, ajay.joshi,
	sagi, dsterba, chaitanya.kulkarni, bvanassche, dhowells,
	asml.silence


Kirill,

Sorry, the holiday break got in the way.

> But I also worry about NOFALLBACK case. There are possible block
> devices, which support write zeroes, but they can't allocate blocks
> (block allocation are just not appliable for them, say, these are all
> ordinary hdd).

Correct. We shouldn't go down this path unless a device is thinly
provisioned (i.e. max_discard_sectors > 0).

> But won't it be a good thing to return EOPNOTSUPP right from
> __blkdev_issue_write_zeroes() in case of block device can't allocate
> blocks (q->limits.write_zeroes_can_allocate in the patch below)? Here
> is just a way to underline block devices, which support write zeroes,
> but allocation of blocks is meant nothing for them (wasting of time).

I don't like "write_zeroes_can_allocate" because that makes assumptions
about WRITE ZEROES being the command of choice. I suggest we call it
"max_allocate_sectors" to mirror "max_discard_sectors". I.e. put
emphasis on the semantic operation and not the plumbing.

-- 
Martin K. Petersen	Oracle Linux Engineering

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH RFC 1/3] block: Add support for REQ_OP_ASSIGN_RANGE operation
  2020-01-07  3:24               ` Martin K. Petersen
@ 2020-01-07 13:59                 ` Kirill Tkhai
  2020-01-08  2:49                   ` Martin K. Petersen
  0 siblings, 1 reply; 23+ messages in thread
From: Kirill Tkhai @ 2020-01-07 13:59 UTC (permalink / raw)
  To: Martin K. Petersen
  Cc: axboe, linux-block, linux-kernel, linux-ext4, tytso,
	adilger.kernel, ming.lei, osandov, jthumshirn, minwoo.im.dev,
	damien.lemoal, andrea.parri, hare, tj, ajay.joshi, sagi, dsterba,
	chaitanya.kulkarni, bvanassche, dhowells, asml.silence

On 07.01.2020 06:24, Martin K. Petersen wrote:
> 
> Kirill,
> 
> Sorry, the holiday break got in the way.
> 
>> But I also worry about NOFALLBACK case. There are possible block
>> devices, which support write zeroes, but they can't allocate blocks
>> (block allocation are just not appliable for them, say, these are all
>> ordinary hdd).
> 
> Correct. We shouldn't go down this path unless a device is thinly
> provisioned (i.e. max_discard_sectors > 0).

(I assumed it is a typo, and you mean max_allocate_sectors like bellow).
 
>> But won't it be a good thing to return EOPNOTSUPP right from
>> __blkdev_issue_write_zeroes() in case of block device can't allocate
>> blocks (q->limits.write_zeroes_can_allocate in the patch below)? Here
>> is just a way to underline block devices, which support write zeroes,
>> but allocation of blocks is meant nothing for them (wasting of time).
> 
> I don't like "write_zeroes_can_allocate" because that makes assumptions
> about WRITE ZEROES being the command of choice. I suggest we call it
> "max_allocate_sectors" to mirror "max_discard_sectors". I.e. put
> emphasis on the semantic operation and not the plumbing.
 
Hm. Do you mean "bool max_allocate_sectors" or "unsigned int max_allocate_sectors"?
In the second case we should make all the q->limits.max_write_zeroes_sectors
dereferencing as switches like the below (this is a partial patch and only several
of places are converted to switches as examples):

diff --git a/block/blk-core.c b/block/blk-core.c
index 50a5de025d5e..4c45417838f7 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -978,7 +978,8 @@ generic_make_request_checks(struct bio *bio)
 			goto not_supported;
 		break;
 	case REQ_OP_WRITE_ZEROES:
-		if (!q->limits.max_write_zeroes_sectors)
+		if (!blk_queue_get_max_write_zeroes_sectors(q,
+				    bio->bi_opf & REQ_NOZERO))
 			goto not_supported;
 		break;
 	default:
@@ -1250,10 +1251,10 @@ EXPORT_SYMBOL(submit_bio);
 static int blk_cloned_rq_check_limits(struct request_queue *q,
 				      struct request *rq)
 {
-	if (blk_rq_sectors(rq) > blk_queue_get_max_sectors(q, req_op(rq))) {
+	if (blk_rq_sectors(rq) > blk_queue_get_max_sectors(q, rq->cmd_flags)) {
 		printk(KERN_ERR "%s: over max size limit. (%u > %u)\n",
 			__func__, blk_rq_sectors(rq),
-			blk_queue_get_max_sectors(q, req_op(rq)));
+			blk_queue_get_max_sectors(q, rq->cmd_flags));
 		return -EIO;
 	}
 
diff --git a/block/blk-merge.c b/block/blk-merge.c
index 347782a24a35..0cdf7d0386c8 100644
--- a/block/blk-merge.c
+++ b/block/blk-merge.c
@@ -105,15 +105,22 @@ static struct bio *blk_bio_discard_split(struct request_queue *q,
 static struct bio *blk_bio_write_zeroes_split(struct request_queue *q,
 		struct bio *bio, struct bio_set *bs, unsigned *nsegs)
 {
+	unsigned int max_sectors;
+
+	if (bio->bi_opf & REQ_NOZERO)
+		max_sectors = q->limits.max_allocate_sectors;
+	else
+		max_sectors = q->limits.max_write_zeroes_sectors;
+
 	*nsegs = 0;
 
-	if (!q->limits.max_write_zeroes_sectors)
+	if (!max_sectors)
 		return NULL;
 
-	if (bio_sectors(bio) <= q->limits.max_write_zeroes_sectors)
+	if (bio_sectors(bio) <= max_sectors)
 		return NULL;
 
-	return bio_split(bio, q->limits.max_write_zeroes_sectors, GFP_NOIO, bs);
+	return bio_split(bio, max_sectors, GFP_NOIO, bs);
 }
 
 static struct bio *blk_bio_write_same_split(struct request_queue *q,
diff --git a/block/blk-settings.c b/block/blk-settings.c
index 5f6dcc7a47bd..3f68a60cb196 100644
--- a/block/blk-settings.c
+++ b/block/blk-settings.c
@@ -506,6 +506,8 @@ int blk_stack_limits(struct queue_limits *t, struct queue_limits *b,
 					b->max_write_same_sectors);
 	t->max_write_zeroes_sectors = min(t->max_write_zeroes_sectors,
 					b->max_write_zeroes_sectors);
+	t->max_allocate_sectors = min(t->max_allocate_sectors,
+					b->max_allocate_sectors);
 	t->bounce_pfn = min_not_zero(t->bounce_pfn, b->bounce_pfn);
 
 	t->seg_boundary_mask = min_not_zero(t->seg_boundary_mask,
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index 9e3cd3394dd6..6219604a0c12 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -336,6 +336,7 @@ struct queue_limits {
 	unsigned int		max_hw_discard_sectors;
 	unsigned int		max_write_same_sectors;
 	unsigned int		max_write_zeroes_sectors;
+	unsigned int		max_allocate_sectors;
 	unsigned int		discard_granularity;
 	unsigned int		discard_alignment;
 
@@ -989,9 +989,19 @@ static inline struct bio_vec req_bvec(struct request *rq)
 	return mp_bvec_iter_bvec(rq->bio->bi_io_vec, rq->bio->bi_iter);
 }
 
+static inline unsigned int blk_queue_get_max_write_zeroes_sectors(
+			struct request_queue *q, bool allocate)
+{
+	if (allocate)
+		return q->limits.max_allocate_sectors;
+	return q->limits.max_write_zeroes_sectors;
+}
+
 static inline unsigned int blk_queue_get_max_sectors(struct request_queue *q,
-						     int op)
+						     unsigned int op_flags)
 {
+	int op = op_flags & REQ_OP_MASK;
+
 	if (unlikely(op == REQ_OP_DISCARD || op == REQ_OP_SECURE_ERASE))
 		return min(q->limits.max_discard_sectors,
 			   UINT_MAX >> SECTOR_SHIFT);
@@ -1000,7 +1010,8 @@ static inline unsigned int blk_queue_get_max_sectors(struct request_queue *q,
 		return q->limits.max_write_same_sectors;
 
 	if (unlikely(op == REQ_OP_WRITE_ZEROES))
-		return q->limits.max_write_zeroes_sectors;
+		return blk_queue_get_max_write_zeroes_sectors(q,
+					op_flags & REQ_NOZERO);
 
 	return q->limits.max_sectors;
 }

^ permalink raw reply related	[flat|nested] 23+ messages in thread

* Re: [PATCH RFC 1/3] block: Add support for REQ_OP_ASSIGN_RANGE operation
  2020-01-07 13:59                 ` Kirill Tkhai
@ 2020-01-08  2:49                   ` Martin K. Petersen
  2020-01-09  9:43                     ` Kirill Tkhai
  0 siblings, 1 reply; 23+ messages in thread
From: Martin K. Petersen @ 2020-01-08  2:49 UTC (permalink / raw)
  To: Kirill Tkhai
  Cc: Martin K. Petersen, axboe, linux-block, linux-kernel, linux-ext4,
	tytso, adilger.kernel, ming.lei, osandov, jthumshirn,
	minwoo.im.dev, damien.lemoal, andrea.parri, hare, tj, ajay.joshi,
	sagi, dsterba, chaitanya.kulkarni, bvanassche, dhowells,
	asml.silence


Kirill,

>> Correct. We shouldn't go down this path unless a device is thinly
>> provisioned (i.e. max_discard_sectors > 0).
>
> (I assumed it is a typo, and you mean max_allocate_sectors like bellow).

No, this was in the context of not having an explicit queue limit for
allocation. If a device does not have max_discard_sectors > 0 then it is
not thinly provisioned and therefore attempting allocation makes no
sense.

>> I don't like "write_zeroes_can_allocate" because that makes assumptions
>> about WRITE ZEROES being the command of choice. I suggest we call it
>> "max_allocate_sectors" to mirror "max_discard_sectors". I.e. put
>> emphasis on the semantic operation and not the plumbing.
>  
> Hm. Do you mean "bool max_allocate_sectors" or "unsigned int max_allocate_sectors"?

unsigned int. At least for SCSI we could have a device which would use
UNMAP for discards and WRITE SAME for allocates. And therefore the range
limit could be different for the two operations. Sadly.

I have a patch in the pipeline which deals with some problems in this
department because some devices have a split brain wrt. their discard
limits.

> In the second case we should make all the
> q->limits.max_write_zeroes_sectors dereferencing as switches like the
> below (this is a partial patch and only several of places are
> converted to switches as examples):

Something like that, yes.

This is getting a bit messy :( However, I am not sure that scattering
REQ_OP_ALLOCATE all over the I/O stack is particularly attractive
either.

Both REQ_OP_DISCARD and REQ_OP_WRITE_SAME come with some storage
protocol baggage that forces us to have special handling all over the
stack. But REQ_OP_WRITE_ZEROES is fairly clean and simple and, except
for the potentially different block count limit, an allocate operation
would be a carbon copy of the plumbing for write zeroes. A lot of
duplication.

So even through I'm increasingly torn on whether introducing separate
REQ_OP_ALLOCATE plumbing throughout the stack or having a REQ_ALLOCATE
flag for REQ_OP_WRITE_ZEROES is best, I still think I'm leaning towards
the latter. That will also make it easier for me in the SCSI disk
driver.

-- 
Martin K. Petersen	Oracle Linux Engineering

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH RFC 1/3] block: Add support for REQ_OP_ASSIGN_RANGE operation
  2020-01-08  2:49                   ` Martin K. Petersen
@ 2020-01-09  9:43                     ` Kirill Tkhai
  0 siblings, 0 replies; 23+ messages in thread
From: Kirill Tkhai @ 2020-01-09  9:43 UTC (permalink / raw)
  To: Martin K. Petersen
  Cc: axboe, linux-block, linux-kernel, linux-ext4, tytso,
	adilger.kernel, ming.lei, osandov, jthumshirn, minwoo.im.dev,
	damien.lemoal, andrea.parri, hare, tj, ajay.joshi, sagi, dsterba,
	chaitanya.kulkarni, bvanassche, dhowells, asml.silence

On 08.01.2020 05:49, Martin K. Petersen wrote:
> 
> Kirill,
> 
>>> Correct. We shouldn't go down this path unless a device is thinly
>>> provisioned (i.e. max_discard_sectors > 0).
>>
>> (I assumed it is a typo, and you mean max_allocate_sectors like bellow).
> 
> No, this was in the context of not having an explicit queue limit for
> allocation. If a device does not have max_discard_sectors > 0 then it is
> not thinly provisioned and therefore attempting allocation makes no
> sense.
> 
>>> I don't like "write_zeroes_can_allocate" because that makes assumptions
>>> about WRITE ZEROES being the command of choice. I suggest we call it
>>> "max_allocate_sectors" to mirror "max_discard_sectors". I.e. put
>>> emphasis on the semantic operation and not the plumbing.
>>  
>> Hm. Do you mean "bool max_allocate_sectors" or "unsigned int max_allocate_sectors"?
> 
> unsigned int. At least for SCSI we could have a device which would use
> UNMAP for discards and WRITE SAME for allocates. And therefore the range
> limit could be different for the two operations. Sadly.
> 
> I have a patch in the pipeline which deals with some problems in this
> department because some devices have a split brain wrt. their discard
> limits.
> 
>> In the second case we should make all the
>> q->limits.max_write_zeroes_sectors dereferencing as switches like the
>> below (this is a partial patch and only several of places are
>> converted to switches as examples):
> 
> Something like that, yes.
> 
> This is getting a bit messy :( However, I am not sure that scattering
> REQ_OP_ALLOCATE all over the I/O stack is particularly attractive
> either.

It looks like changing the second argument of blk_queue_get_max_write_zeroes_sectors()
gives almost clean code. All REQ_NOZERO checks become moved to helper.

> Both REQ_OP_DISCARD and REQ_OP_WRITE_SAME come with some storage
> protocol baggage that forces us to have special handling all over the
> stack. But REQ_OP_WRITE_ZEROES is fairly clean and simple and, except
> for the potentially different block count limit, an allocate operation
> would be a carbon copy of the plumbing for write zeroes. A lot of
> duplication.
> 
> So even through I'm increasingly torn on whether introducing separate
> REQ_OP_ALLOCATE plumbing throughout the stack or having a REQ_ALLOCATE
> flag for REQ_OP_WRITE_ZEROES is best, I still think I'm leaning towards
> the latter. That will also make it easier for me in the SCSI disk
> driver.

Yeah, this sounds good. Also, it looks the REQ_NOZERO patch is really
simpler, it is smaller even.

Below is complete patch for block core (I've excluded hunks for drivers/*,
since they require some cleanup patches for preparations). Please, let
me know whether I understood you correct and everything is OK. Thanks!

[PATCH] block: Add support for REQ_NOZERO

This adds support for REQ_NOZERO extension of REQ_OP_WRITE_ZEROES
operation, which encourages a block device driver to just allocate
blocks (or mark them allocated) instead of actual blocks zeroing.
REQ_NOZERO is aimed to be used for network filesystems providing
a block device interface. Also, block devices, which map a file
on other filesystem (like loop), may use this for less fragmentation
and batching fallocate() requests. Hypervisors like QEMU may
introduce optimizations of clusters allocations based on this.

BLKDEV_ZERO_ALLOCATE is a new corresponding flag for
blkdev_issue_zeroout().

Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com>
---
 block/blk-core.c          |    6 +++---
 block/blk-lib.c           |   17 ++++++++++-------
 block/blk-merge.c         |    9 ++++++---
 block/blk-settings.c      |    4 ++++
 fs/block_dev.c            |    4 ++++
 include/linux/blk_types.h |    5 ++++-
 include/linux/blkdev.h    |   32 ++++++++++++++++++++++++--------
 7 files changed, 55 insertions(+), 22 deletions(-)

diff --git a/block/blk-core.c b/block/blk-core.c
index 50a5de025d5e..2edcd55624f1 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -978,7 +978,7 @@ generic_make_request_checks(struct bio *bio)
 			goto not_supported;
 		break;
 	case REQ_OP_WRITE_ZEROES:
-		if (!q->limits.max_write_zeroes_sectors)
+		if (!blk_queue_get_max_write_zeroes_sectors(q, bio->bi_opf))
 			goto not_supported;
 		break;
 	default:
@@ -1250,10 +1250,10 @@ EXPORT_SYMBOL(submit_bio);
 static int blk_cloned_rq_check_limits(struct request_queue *q,
 				      struct request *rq)
 {
-	if (blk_rq_sectors(rq) > blk_queue_get_max_sectors(q, req_op(rq))) {
+	if (blk_rq_sectors(rq) > blk_queue_get_max_sectors(q, rq->cmd_flags)) {
 		printk(KERN_ERR "%s: over max size limit. (%u > %u)\n",
 			__func__, blk_rq_sectors(rq),
-			blk_queue_get_max_sectors(q, req_op(rq)));
+			blk_queue_get_max_sectors(q, rq->cmd_flags));
 		return -EIO;
 	}
 
diff --git a/block/blk-lib.c b/block/blk-lib.c
index 5f2c429d4378..3e80279eb029 100644
--- a/block/blk-lib.c
+++ b/block/blk-lib.c
@@ -214,7 +214,7 @@ static int __blkdev_issue_write_zeroes(struct block_device *bdev,
 		struct bio **biop, unsigned flags)
 {
 	struct bio *bio = *biop;
-	unsigned int max_write_zeroes_sectors;
+	unsigned int max_write_zeroes_sectors, req_flags = 0;
 	struct request_queue *q = bdev_get_queue(bdev);
 
 	if (!q)
@@ -224,18 +224,21 @@ static int __blkdev_issue_write_zeroes(struct block_device *bdev,
 		return -EPERM;
 
 	/* Ensure that max_write_zeroes_sectors doesn't overflow bi_size */
-	max_write_zeroes_sectors = bdev_write_zeroes_sectors(bdev);
+	max_write_zeroes_sectors = bdev_write_zeroes_sectors(bdev, flags);
 
 	if (max_write_zeroes_sectors == 0)
 		return -EOPNOTSUPP;
 
+	if (flags & BLKDEV_ZERO_NOUNMAP)
+		req_flags |= REQ_NOUNMAP;
+	if (flags & BLKDEV_ZERO_ALLOCATE)
+		req_flags |= REQ_NOZERO|REQ_NOUNMAP;
+
 	while (nr_sects) {
 		bio = blk_next_bio(bio, 0, gfp_mask);
 		bio->bi_iter.bi_sector = sector;
 		bio_set_dev(bio, bdev);
-		bio->bi_opf = REQ_OP_WRITE_ZEROES;
-		if (flags & BLKDEV_ZERO_NOUNMAP)
-			bio->bi_opf |= REQ_NOUNMAP;
+		bio->bi_opf = REQ_OP_WRITE_ZEROES | req_flags;
 
 		if (nr_sects > max_write_zeroes_sectors) {
 			bio->bi_iter.bi_size = max_write_zeroes_sectors << 9;
@@ -362,7 +365,7 @@ int blkdev_issue_zeroout(struct block_device *bdev, sector_t sector,
 	sector_t bs_mask;
 	struct bio *bio;
 	struct blk_plug plug;
-	bool try_write_zeroes = !!bdev_write_zeroes_sectors(bdev);
+	bool try_write_zeroes = !!bdev_write_zeroes_sectors(bdev, flags);
 
 	bs_mask = (bdev_logical_block_size(bdev) >> 9) - 1;
 	if ((sector | nr_sects) & bs_mask)
@@ -391,7 +394,7 @@ int blkdev_issue_zeroout(struct block_device *bdev, sector_t sector,
 			try_write_zeroes = false;
 			goto retry;
 		}
-		if (!bdev_write_zeroes_sectors(bdev)) {
+		if (!bdev_write_zeroes_sectors(bdev, flags)) {
 			/*
 			 * Zeroing offload support was indicated, but the
 			 * device reported ILLEGAL REQUEST (for some devices
diff --git a/block/blk-merge.c b/block/blk-merge.c
index 347782a24a35..e3ce4b87bbaa 100644
--- a/block/blk-merge.c
+++ b/block/blk-merge.c
@@ -105,15 +105,18 @@ static struct bio *blk_bio_discard_split(struct request_queue *q,
 static struct bio *blk_bio_write_zeroes_split(struct request_queue *q,
 		struct bio *bio, struct bio_set *bs, unsigned *nsegs)
 {
+	unsigned int max_sectors;
+
+	max_sectors = blk_queue_get_max_write_zeroes_sectors(q, bio->bi_opf);
 	*nsegs = 0;
 
-	if (!q->limits.max_write_zeroes_sectors)
+	if (!max_sectors)
 		return NULL;
 
-	if (bio_sectors(bio) <= q->limits.max_write_zeroes_sectors)
+	if (bio_sectors(bio) <= max_sectors)
 		return NULL;
 
-	return bio_split(bio, q->limits.max_write_zeroes_sectors, GFP_NOIO, bs);
+	return bio_split(bio, max_sectors, GFP_NOIO, bs);
 }
 
 static struct bio *blk_bio_write_same_split(struct request_queue *q,
diff --git a/block/blk-settings.c b/block/blk-settings.c
index 5f6dcc7a47bd..09a2eeacfadd 100644
--- a/block/blk-settings.c
+++ b/block/blk-settings.c
@@ -48,6 +48,7 @@ void blk_set_default_limits(struct queue_limits *lim)
 	lim->chunk_sectors = 0;
 	lim->max_write_same_sectors = 0;
 	lim->max_write_zeroes_sectors = 0;
+	lim->max_allocate_sectors = 0;
 	lim->max_discard_sectors = 0;
 	lim->max_hw_discard_sectors = 0;
 	lim->discard_granularity = 0;
@@ -83,6 +84,7 @@ void blk_set_stacking_limits(struct queue_limits *lim)
 	lim->max_dev_sectors = UINT_MAX;
 	lim->max_write_same_sectors = UINT_MAX;
 	lim->max_write_zeroes_sectors = UINT_MAX;
+	lim->max_allocate_sectors = UINT_MAX;
 }
 EXPORT_SYMBOL(blk_set_stacking_limits);
 
@@ -506,6 +508,8 @@ int blk_stack_limits(struct queue_limits *t, struct queue_limits *b,
 					b->max_write_same_sectors);
 	t->max_write_zeroes_sectors = min(t->max_write_zeroes_sectors,
 					b->max_write_zeroes_sectors);
+	t->max_allocate_sectors = min(t->max_allocate_sectors,
+					b->max_allocate_sectors);
 	t->bounce_pfn = min_not_zero(t->bounce_pfn, b->bounce_pfn);
 
 	t->seg_boundary_mask = min_not_zero(t->seg_boundary_mask,
diff --git a/fs/block_dev.c b/fs/block_dev.c
index 69bf2fb6f7cd..1ffef894b3bd 100644
--- a/fs/block_dev.c
+++ b/fs/block_dev.c
@@ -2122,6 +2122,10 @@ static long blkdev_fallocate(struct file *file, int mode, loff_t start,
 		error = blkdev_issue_zeroout(bdev, start >> 9, len >> 9,
 					     GFP_KERNEL, BLKDEV_ZERO_NOFALLBACK);
 		break;
+	case FALLOC_FL_KEEP_SIZE:
+		error = blkdev_issue_zeroout(bdev, start >> 9, len >> 9,
+			GFP_KERNEL, BLKDEV_ZERO_ALLOCATE | BLKDEV_ZERO_NOFALLBACK);
+		break;
 	case FALLOC_FL_PUNCH_HOLE | FALLOC_FL_KEEP_SIZE | FALLOC_FL_NO_HIDE_STALE:
 		error = blkdev_issue_discard(bdev, start >> 9, len >> 9,
 					     GFP_KERNEL, 0);
diff --git a/include/linux/blk_types.h b/include/linux/blk_types.h
index 70254ae11769..9ed166860099 100644
--- a/include/linux/blk_types.h
+++ b/include/linux/blk_types.h
@@ -335,7 +335,9 @@ enum req_flag_bits {
 
 	/* command specific flags for REQ_OP_WRITE_ZEROES: */
 	__REQ_NOUNMAP,		/* do not free blocks when zeroing */
-
+	__REQ_NOZERO,		/* only notify about allocated blocks,
+				 * and do not actual zero them
+				 */
 	__REQ_HIPRI,
 
 	/* for driver use */
@@ -362,6 +364,7 @@ enum req_flag_bits {
 #define REQ_CGROUP_PUNT		(1ULL << __REQ_CGROUP_PUNT)
 
 #define REQ_NOUNMAP		(1ULL << __REQ_NOUNMAP)
+#define REQ_NOZERO		(1ULL << __REQ_NOZERO)
 #define REQ_HIPRI		(1ULL << __REQ_HIPRI)
 
 #define REQ_DRV			(1ULL << __REQ_DRV)
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index c45779f00cbd..7e388be625af 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -336,6 +336,7 @@ struct queue_limits {
 	unsigned int		max_hw_discard_sectors;
 	unsigned int		max_write_same_sectors;
 	unsigned int		max_write_zeroes_sectors;
+	unsigned int		max_allocate_sectors;
 	unsigned int		discard_granularity;
 	unsigned int		discard_alignment;
 
@@ -988,9 +989,19 @@ static inline struct bio_vec req_bvec(struct request *rq)
 	return mp_bvec_iter_bvec(rq->bio->bi_io_vec, rq->bio->bi_iter);
 }
 
+static inline unsigned int blk_queue_get_max_write_zeroes_sectors(
+		struct request_queue *q, unsigned int op_flags)
+{
+	if (op_flags & REQ_NOZERO)
+		return q->limits.max_allocate_sectors;
+	return q->limits.max_write_zeroes_sectors;
+}
+
 static inline unsigned int blk_queue_get_max_sectors(struct request_queue *q,
-						     int op)
+						     unsigned int op_flags)
 {
+	int op = op_flags & REQ_OP_MASK;
+
 	if (unlikely(op == REQ_OP_DISCARD || op == REQ_OP_SECURE_ERASE))
 		return min(q->limits.max_discard_sectors,
 			   UINT_MAX >> SECTOR_SHIFT);
@@ -999,7 +1010,7 @@ static inline unsigned int blk_queue_get_max_sectors(struct request_queue *q,
 		return q->limits.max_write_same_sectors;
 
 	if (unlikely(op == REQ_OP_WRITE_ZEROES))
-		return q->limits.max_write_zeroes_sectors;
+		return blk_queue_get_max_write_zeroes_sectors(q, op_flags);
 
 	return q->limits.max_sectors;
 }
@@ -1029,10 +1040,10 @@ static inline unsigned int blk_rq_get_max_sectors(struct request *rq,
 	if (!q->limits.chunk_sectors ||
 	    req_op(rq) == REQ_OP_DISCARD ||
 	    req_op(rq) == REQ_OP_SECURE_ERASE)
-		return blk_queue_get_max_sectors(q, req_op(rq));
+		return blk_queue_get_max_sectors(q, rq->cmd_flags);
 
 	return min(blk_max_size_offset(q, offset),
-			blk_queue_get_max_sectors(q, req_op(rq)));
+			blk_queue_get_max_sectors(q, rq->cmd_flags));
 }
 
 static inline unsigned int blk_rq_count_bios(struct request *rq)
@@ -1219,6 +1230,7 @@ extern int __blkdev_issue_discard(struct block_device *bdev, sector_t sector,
 
 #define BLKDEV_ZERO_NOUNMAP	(1 << 0)  /* do not free blocks */
 #define BLKDEV_ZERO_NOFALLBACK	(1 << 1)  /* don't write explicit zeroes */
+#define BLKDEV_ZERO_ALLOCATE	(1 << 2)  /* allocate range of blocks */
 
 extern int __blkdev_issue_zeroout(struct block_device *bdev, sector_t sector,
 		sector_t nr_sects, gfp_t gfp_mask, struct bio **biop,
@@ -1418,14 +1430,18 @@ static inline unsigned int bdev_write_same(struct block_device *bdev)
 	return 0;
 }
 
-static inline unsigned int bdev_write_zeroes_sectors(struct block_device *bdev)
+static inline unsigned int bdev_write_zeroes_sectors(struct block_device *bdev,
+						     unsigned int flags)
 {
 	struct request_queue *q = bdev_get_queue(bdev);
 
-	if (q)
+	if (!q)
+		return 0;
+
+	if (flags & BLKDEV_ZERO_ALLOCATE)
+		return q->limits.max_allocate_sectors;
+	else
 		return q->limits.max_write_zeroes_sectors;
-
-	return 0;
 }
 
 static inline enum blk_zoned_model bdev_zoned_model(struct block_device *bdev)

 


^ permalink raw reply related	[flat|nested] 23+ messages in thread

end of thread, other threads:[~2020-01-09  9:45 UTC | newest]

Thread overview: 23+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-12-10 16:56 [PATCH RFC 0/3] block,ext4: Introduce REQ_OP_ASSIGN_RANGE to reflect extents allocation in block device internals Kirill Tkhai
2019-12-10 16:56 ` [PATCH RFC 1/3] block: Add support for REQ_OP_ASSIGN_RANGE operation Kirill Tkhai
2019-12-19  3:03   ` Martin K. Petersen
2019-12-19 11:07     ` Kirill Tkhai
2019-12-19 22:03       ` Chaitanya Kulkarni
2019-12-19 22:37       ` Martin K. Petersen
2019-12-20  1:53         ` Darrick J. Wong
2019-12-20  2:22           ` Martin K. Petersen
2019-12-20 11:55         ` Kirill Tkhai
2019-12-21 18:54           ` Martin K. Petersen
2019-12-23  8:51             ` Kirill Tkhai
2020-01-07  3:24               ` Martin K. Petersen
2020-01-07 13:59                 ` Kirill Tkhai
2020-01-08  2:49                   ` Martin K. Petersen
2020-01-09  9:43                     ` Kirill Tkhai
2019-12-10 16:56 ` [PATCH RFC 2/3] loop: Forward REQ_OP_ASSIGN_RANGE into fallocate(0) Kirill Tkhai
2019-12-10 16:56 ` [PATCH RFC 3/3] ext4: Notify block device about fallocate(0)-assigned blocks Kirill Tkhai
2019-12-11  1:02   ` kbuild test robot
2019-12-11 12:55   ` [PATCH RFC v2 " Kirill Tkhai
2019-12-15 15:35   ` [PATCH RFC " kbuild test robot
2019-12-11  7:42 ` [PATCH RFC 0/3] block,ext4: Introduce REQ_OP_ASSIGN_RANGE to reflect extents allocation in block device internals Chaitanya Kulkarni
2019-12-11  8:50   ` Kirill Tkhai
2019-12-17 14:16 ` Kirill Tkhai

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.