All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v2 0/8] Introduce provisioning primitives for thinly provisioned storage
@ 2022-12-29  8:12 ` Sarthak Kukreti
  0 siblings, 0 replies; 46+ messages in thread
From: Sarthak Kukreti @ 2022-12-29  8:12 UTC (permalink / raw)
  To: sarthakkukreti, dm-devel, linux-block, linux-ext4, linux-kernel,
	linux-fsdevel
  Cc: Jens Axboe, Michael S. Tsirkin, Jason Wang, Stefan Hajnoczi,
	Alasdair Kergon, Mike Snitzer, Christoph Hellwig, Brian Foster,
	Theodore Ts'o, Andreas Dilger, Bart Van Assche, Daniil Lunev,
	Darrick J. Wong

Hi,

This patch series adds a mechanism to pass through provision requests on
stacked thinly provisioned storage devices/filesystems.

The linux kernel provides several mechanisms to set up thinly provisioned
block storage abstractions (eg. dm-thin, loop devices over sparse files),
either directly as block devices or backing storage for filesystems. Currently,
short of writing data to either the device or filesystem, there is no way for
users to pre-allocate space for use in such storage setups. Consider the
following use-cases:

1) Suspend-to-disk and resume from a dm-thin device: In order to ensure that
   the underlying thinpool metadata is not modified during the suspend
   mechanism, the dm-thin device needs to be fully provisioned.
2) If a filesystem uses a loop device over a sparse file, fallocate() on the
   filesystem will allocate blocks for files but the underlying sparse file
   will remain intact.
3) Another example is virtual machine using a sparse file/dm-thin as a storage
   device; by default, allocations within the VM boundaries will not affect
   the host.
4) Several storage standards support mechanisms for thin provisioning on
   real hardware devices. For example:
   a. The NVMe spec 1.0b section 2.1.1 loosely talks about thin provisioning:
      "When the THINP bit in the NSFEAT field of the Identify Namespace data
       structure is set to ‘1’, the controller ... shall track the number of
       allocated blocks in the Namespace Utilization field"
   b. The SCSi Block Commands reference - 4 section references "Thin
      provisioned logical units",
   c. UFS 3.0 spec section 13.3.3 references "Thin provisioning".

In all the above situations, currently, the only way for pre-allocating space
is to issue writes (or use WRITE_ZEROES/WRITE_SAME). However, that does not
scale well with larger pre-allocation sizes.

This patchset introduces primitives to support block-level provisioning (note:
the term 'provisioning' is used to prevent overloading the term
'allocations/pre-allocations') requests across filesystems and block devices.
This allows fallocate() and file creation requests to reserve space across
stacked layers of block devices and filesystems. Currently, the patchset covers
a prototype on the device-mapper targets, loop device and ext4, but the same
mechanism can be extended to other filesystems/block devices as well as extended
for use with devices in 4 a-c.

Patch 1 introduces REQ_OP_PROVISION as a new request type.
The provision request acts like the inverse of a discard request; instead
of notifying lower layers that the block range will no longer be used, provision
acts as a request to lower layers to provision disk space for the given block
range. Real hardware storage devices will currently disable the provisioing
capability but for the standards listed in 4a.-c., REQ_OP_PROVISION can be
overloaded for use as the provisioing primitive for future devices.

Patch 2 implements REQ_OP_PROVISION handling for some of the device-mapper
targets. This additionally adds support for pre-allocating space for thinly
provisioned logical volumes via fallocate()

Patch 3 introduces an fallocate() mode (FALLOC_FL_PROVISION) that sends a
provision request to the underlying block device (and beyond). This acts as the
primary mechanism for file provisioning as well as disambiguates the notion of
virtual and true disk space allocations for thinly provisioned storage devices/
filesystems. With patch 3, the 'default' fallocate() mode is preserved to
perform preallocation at the current allocation layer and 'provision' mode
adds the capability to punch through the allocations to the underlying thinly
provisioned storage layers. For regular filesystems, both allocation modes
are equivalent.

Patch 4 wires up the loop device handling of REQ_OP_PROVISION.

Patches 5-7 cover a prototype implementation for ext4, which includes wiring up
the fallocate() implementation, introducing a filesystem level option (called
'provision') to control the default allocation behaviour and, finally, a
file-level override to retain current handling, even on filesystems mounted with
'provision'. These options allow users of stacked filesystems to flexibly take
advantage of provisioning.

Testing:
--------
- Tested on a VM running a 6.2 kernel.
- The following perfomrmance measurements were collected with fallocate(2)
patched to add support for FALLOC_FL_PROVISION via a command line option 
`-p/--provision`.

- Preallocation of dm-thin devices:
As expected, avoiding the need to zero out thinly-provisioned block devices to
preallocate space speeds up the provisioning operation significantly:

The following was tested on a dm-thin device set up on top of a dm-thinp with
skip_block_zeroing=true.
A) Zeroout was measured using `fallocate -z ...`
B) Provision was measured using `fallocate -p ...`.

Size    Time     A	B
512M    real     1.093  0.034
        user     0      0
        sys      0.022  0.01
1G      real     2.182  0.048
        user     0      0.01
        sys      0.022  0
2G      real     4.344  0.082
        user     0      0.01
        sys      0.036  0
4G      real     8.679  0.153
        user     0      0.01
        sys      0.073  0
8G      real    17.777  0.318
        user     0      0.01
        sys      0.144  0

- Preallocation of files on filesystems
Since fallocate() with FALLOC_FL_PROVISION can now pass down through
filesystems/block devices, this results in an expected slowdown in fallocate()
calls if the provision request is sent to the underlying layers.

The measurements were taken using fallocate() on ext4 filesystems set up with
the following opts/block devices:
A) ext4 filesystem mounted with 'noprovision'
B) ext4 filesystem mounted with 'provision' on a dm-thin device.
C) ext4 filesystem mounted with 'provision' on a loop device with a sparse
   backing file on the filesystem in (B).

Size	Time	A	B	C
512M	real	0.011	0.036	0.041
	user	0.02	0.03	0.002
	sys	0	0	0
1G	real	0.011	0.055	0.064
	user	0	0	0.03
	sys	0.003	0.004	0
2G	real	0.011	0.109	0.117
	user	0	0	0.004
	sys	0.003	0.006	0
4G	real	0.011	0.224	0.231
	user	0	0	0.006
	sys	0.004	0.012	0
8G	real	0.017	0.426	0.527
	user	0	0	0.013
	sys	0.009	0.024	0

As expected, the additional provision requests will slow down fallocate() calls
and the degree of slowdown depends on the number of layers that the provision
request is passed through to as well as the complexity of allocation on those
layers.

TODOs:
------
- Xfstests for validating provisioning results in allocation.

Changelog:

V2:
- Fix stacked limit handling.
- Enable provision request handling in dm-snapshot
- Don't call truncate_bdev_range if blkdev_fallocate() is called with
  FALLOC_FL_PROVISION.
- Clarify semantics of FALLOC_FL_PROVISION and why it needs to be a separate flag
  (as opposed to overloading mode == 0).

^ permalink raw reply	[flat|nested] 46+ messages in thread

* [dm-devel] [PATCH v2 0/8] Introduce provisioning primitives for thinly provisioned storage
@ 2022-12-29  8:12 ` Sarthak Kukreti
  0 siblings, 0 replies; 46+ messages in thread
From: Sarthak Kukreti @ 2022-12-29  8:12 UTC (permalink / raw)
  To: sarthakkukreti, dm-devel, linux-block, linux-ext4, linux-kernel,
	linux-fsdevel
  Cc: Jens Axboe, Theodore Ts'o, Michael S. Tsirkin,
	Darrick J. Wong, Jason Wang, Bart Van Assche, Mike Snitzer,
	Christoph Hellwig, Andreas Dilger, Daniil Lunev, Stefan Hajnoczi,
	Brian Foster, Alasdair Kergon

Hi,

This patch series adds a mechanism to pass through provision requests on
stacked thinly provisioned storage devices/filesystems.

The linux kernel provides several mechanisms to set up thinly provisioned
block storage abstractions (eg. dm-thin, loop devices over sparse files),
either directly as block devices or backing storage for filesystems. Currently,
short of writing data to either the device or filesystem, there is no way for
users to pre-allocate space for use in such storage setups. Consider the
following use-cases:

1) Suspend-to-disk and resume from a dm-thin device: In order to ensure that
   the underlying thinpool metadata is not modified during the suspend
   mechanism, the dm-thin device needs to be fully provisioned.
2) If a filesystem uses a loop device over a sparse file, fallocate() on the
   filesystem will allocate blocks for files but the underlying sparse file
   will remain intact.
3) Another example is virtual machine using a sparse file/dm-thin as a storage
   device; by default, allocations within the VM boundaries will not affect
   the host.
4) Several storage standards support mechanisms for thin provisioning on
   real hardware devices. For example:
   a. The NVMe spec 1.0b section 2.1.1 loosely talks about thin provisioning:
      "When the THINP bit in the NSFEAT field of the Identify Namespace data
       structure is set to ‘1’, the controller ... shall track the number of
       allocated blocks in the Namespace Utilization field"
   b. The SCSi Block Commands reference - 4 section references "Thin
      provisioned logical units",
   c. UFS 3.0 spec section 13.3.3 references "Thin provisioning".

In all the above situations, currently, the only way for pre-allocating space
is to issue writes (or use WRITE_ZEROES/WRITE_SAME). However, that does not
scale well with larger pre-allocation sizes.

This patchset introduces primitives to support block-level provisioning (note:
the term 'provisioning' is used to prevent overloading the term
'allocations/pre-allocations') requests across filesystems and block devices.
This allows fallocate() and file creation requests to reserve space across
stacked layers of block devices and filesystems. Currently, the patchset covers
a prototype on the device-mapper targets, loop device and ext4, but the same
mechanism can be extended to other filesystems/block devices as well as extended
for use with devices in 4 a-c.

Patch 1 introduces REQ_OP_PROVISION as a new request type.
The provision request acts like the inverse of a discard request; instead
of notifying lower layers that the block range will no longer be used, provision
acts as a request to lower layers to provision disk space for the given block
range. Real hardware storage devices will currently disable the provisioing
capability but for the standards listed in 4a.-c., REQ_OP_PROVISION can be
overloaded for use as the provisioing primitive for future devices.

Patch 2 implements REQ_OP_PROVISION handling for some of the device-mapper
targets. This additionally adds support for pre-allocating space for thinly
provisioned logical volumes via fallocate()

Patch 3 introduces an fallocate() mode (FALLOC_FL_PROVISION) that sends a
provision request to the underlying block device (and beyond). This acts as the
primary mechanism for file provisioning as well as disambiguates the notion of
virtual and true disk space allocations for thinly provisioned storage devices/
filesystems. With patch 3, the 'default' fallocate() mode is preserved to
perform preallocation at the current allocation layer and 'provision' mode
adds the capability to punch through the allocations to the underlying thinly
provisioned storage layers. For regular filesystems, both allocation modes
are equivalent.

Patch 4 wires up the loop device handling of REQ_OP_PROVISION.

Patches 5-7 cover a prototype implementation for ext4, which includes wiring up
the fallocate() implementation, introducing a filesystem level option (called
'provision') to control the default allocation behaviour and, finally, a
file-level override to retain current handling, even on filesystems mounted with
'provision'. These options allow users of stacked filesystems to flexibly take
advantage of provisioning.

Testing:
--------
- Tested on a VM running a 6.2 kernel.
- The following perfomrmance measurements were collected with fallocate(2)
patched to add support for FALLOC_FL_PROVISION via a command line option 
`-p/--provision`.

- Preallocation of dm-thin devices:
As expected, avoiding the need to zero out thinly-provisioned block devices to
preallocate space speeds up the provisioning operation significantly:

The following was tested on a dm-thin device set up on top of a dm-thinp with
skip_block_zeroing=true.
A) Zeroout was measured using `fallocate -z ...`
B) Provision was measured using `fallocate -p ...`.

Size    Time     A	B
512M    real     1.093  0.034
        user     0      0
        sys      0.022  0.01
1G      real     2.182  0.048
        user     0      0.01
        sys      0.022  0
2G      real     4.344  0.082
        user     0      0.01
        sys      0.036  0
4G      real     8.679  0.153
        user     0      0.01
        sys      0.073  0
8G      real    17.777  0.318
        user     0      0.01
        sys      0.144  0

- Preallocation of files on filesystems
Since fallocate() with FALLOC_FL_PROVISION can now pass down through
filesystems/block devices, this results in an expected slowdown in fallocate()
calls if the provision request is sent to the underlying layers.

The measurements were taken using fallocate() on ext4 filesystems set up with
the following opts/block devices:
A) ext4 filesystem mounted with 'noprovision'
B) ext4 filesystem mounted with 'provision' on a dm-thin device.
C) ext4 filesystem mounted with 'provision' on a loop device with a sparse
   backing file on the filesystem in (B).

Size	Time	A	B	C
512M	real	0.011	0.036	0.041
	user	0.02	0.03	0.002
	sys	0	0	0
1G	real	0.011	0.055	0.064
	user	0	0	0.03
	sys	0.003	0.004	0
2G	real	0.011	0.109	0.117
	user	0	0	0.004
	sys	0.003	0.006	0
4G	real	0.011	0.224	0.231
	user	0	0	0.006
	sys	0.004	0.012	0
8G	real	0.017	0.426	0.527
	user	0	0	0.013
	sys	0.009	0.024	0

As expected, the additional provision requests will slow down fallocate() calls
and the degree of slowdown depends on the number of layers that the provision
request is passed through to as well as the complexity of allocation on those
layers.

TODOs:
------
- Xfstests for validating provisioning results in allocation.

Changelog:

V2:
- Fix stacked limit handling.
- Enable provision request handling in dm-snapshot
- Don't call truncate_bdev_range if blkdev_fallocate() is called with
  FALLOC_FL_PROVISION.
- Clarify semantics of FALLOC_FL_PROVISION and why it needs to be a separate flag
  (as opposed to overloading mode == 0).

--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel

^ permalink raw reply	[flat|nested] 46+ messages in thread

* [PATCH v2 1/7] block: Introduce provisioning primitives
  2022-12-29  8:12 ` [dm-devel] " Sarthak Kukreti
@ 2022-12-29  8:12   ` Sarthak Kukreti
  -1 siblings, 0 replies; 46+ messages in thread
From: Sarthak Kukreti @ 2022-12-29  8:12 UTC (permalink / raw)
  To: sarthakkukreti, dm-devel, linux-block, linux-ext4, linux-kernel,
	linux-fsdevel
  Cc: Jens Axboe, Michael S. Tsirkin, Jason Wang, Stefan Hajnoczi,
	Alasdair Kergon, Mike Snitzer, Christoph Hellwig, Brian Foster,
	Theodore Ts'o, Andreas Dilger, Bart Van Assche, Daniil Lunev,
	Darrick J. Wong

Introduce block request REQ_OP_PROVISION. The intent of this request
is to request underlying storage to preallocate disk space for the given
block range. Block device that support this capability will export
a provision limit within their request queues.

Signed-off-by: Sarthak Kukreti <sarthakkukreti@chromium.org>
---
 block/blk-core.c          |  5 ++++
 block/blk-lib.c           | 53 +++++++++++++++++++++++++++++++++++++++
 block/blk-merge.c         | 18 +++++++++++++
 block/blk-settings.c      | 19 ++++++++++++++
 block/blk-sysfs.c         |  8 ++++++
 block/bounce.c            |  1 +
 include/linux/bio.h       |  6 +++--
 include/linux/blk_types.h |  5 +++-
 include/linux/blkdev.h    | 16 ++++++++++++
 9 files changed, 128 insertions(+), 3 deletions(-)

diff --git a/block/blk-core.c b/block/blk-core.c
index 9321767470dc..30bcabc7dc01 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -123,6 +123,7 @@ static const char *const blk_op_name[] = {
 	REQ_OP_NAME(WRITE_ZEROES),
 	REQ_OP_NAME(DRV_IN),
 	REQ_OP_NAME(DRV_OUT),
+	REQ_OP_NAME(PROVISION)
 };
 #undef REQ_OP_NAME
 
@@ -785,6 +786,10 @@ void submit_bio_noacct(struct bio *bio)
 		if (!q->limits.max_write_zeroes_sectors)
 			goto not_supported;
 		break;
+	case REQ_OP_PROVISION:
+		if (!q->limits.max_provision_sectors)
+			goto not_supported;
+		break;
 	default:
 		break;
 	}
diff --git a/block/blk-lib.c b/block/blk-lib.c
index e59c3069e835..647b6451660b 100644
--- a/block/blk-lib.c
+++ b/block/blk-lib.c
@@ -343,3 +343,56 @@ int blkdev_issue_secure_erase(struct block_device *bdev, sector_t sector,
 	return ret;
 }
 EXPORT_SYMBOL(blkdev_issue_secure_erase);
+
+/**
+ * blkdev_issue_provision - provision a block range
+ * @bdev:	blockdev to write
+ * @sector:	start sector
+ * @nr_sects:	number of sectors to provision
+ * @gfp_mask:	memory allocation flags (for bio_alloc)
+ *
+ * Description:
+ *  Issues a provision request to the block device for the range of sectors.
+ *  For thinly provisioned block devices, this acts as a signal for the
+ *  underlying storage pool to allocate space for this block range.
+ */
+int blkdev_issue_provision(struct block_device *bdev, sector_t sector,
+		sector_t nr_sects, gfp_t gfp)
+{
+	sector_t bs_mask = (bdev_logical_block_size(bdev) >> 9) - 1;
+	unsigned int max_sectors = bdev_max_provision_sectors(bdev);
+	struct bio *bio = NULL;
+	struct blk_plug plug;
+	int ret = 0;
+
+	if (max_sectors == 0)
+		return -EOPNOTSUPP;
+	if ((sector | nr_sects) & bs_mask)
+		return -EINVAL;
+	if (bdev_read_only(bdev))
+		return -EPERM;
+
+	blk_start_plug(&plug);
+	for (;;) {
+		unsigned int req_sects = min_t(sector_t, nr_sects, max_sectors);
+
+		bio = blk_next_bio(bio, bdev, 0, REQ_OP_PROVISION, gfp);
+		bio->bi_iter.bi_sector = sector;
+		bio->bi_iter.bi_size = req_sects << SECTOR_SHIFT;
+
+		sector += req_sects;
+		nr_sects -= req_sects;
+		if (!nr_sects) {
+			ret = submit_bio_wait(bio);
+			if (ret == -EOPNOTSUPP)
+				ret = 0;
+			bio_put(bio);
+			break;
+		}
+		cond_resched();
+	}
+	blk_finish_plug(&plug);
+
+	return ret;
+}
+EXPORT_SYMBOL(blkdev_issue_provision);
diff --git a/block/blk-merge.c b/block/blk-merge.c
index 35a8f75cc45d..3ab35bb2a333 100644
--- a/block/blk-merge.c
+++ b/block/blk-merge.c
@@ -158,6 +158,21 @@ static struct bio *bio_split_write_zeroes(struct bio *bio,
 	return bio_split(bio, lim->max_write_zeroes_sectors, GFP_NOIO, bs);
 }
 
+static struct bio *bio_split_provision(struct bio *bio,
+					const struct queue_limits *lim,
+					unsigned *nsegs, struct bio_set *bs)
+{
+	*nsegs = 0;
+
+	if (!lim->max_provision_sectors)
+		return NULL;
+
+	if (bio_sectors(bio) <= lim->max_provision_sectors)
+		return NULL;
+
+	return bio_split(bio, lim->max_provision_sectors, GFP_NOIO, bs);
+}
+
 /*
  * Return the maximum number of sectors from the start of a bio that may be
  * submitted as a single request to a block device. If enough sectors remain,
@@ -355,6 +370,9 @@ struct bio *__bio_split_to_limits(struct bio *bio,
 	case REQ_OP_WRITE_ZEROES:
 		split = bio_split_write_zeroes(bio, lim, nr_segs, bs);
 		break;
+	case REQ_OP_PROVISION:
+		split = bio_split_provision(bio, lim, nr_segs, bs);
+		break;
 	default:
 		split = bio_split_rw(bio, lim, nr_segs, bs,
 				get_max_io_size(bio, lim) << SECTOR_SHIFT);
diff --git a/block/blk-settings.c b/block/blk-settings.c
index 0477c4d527fe..57d88204fbbe 100644
--- a/block/blk-settings.c
+++ b/block/blk-settings.c
@@ -58,6 +58,7 @@ void blk_set_default_limits(struct queue_limits *lim)
 	lim->zoned = BLK_ZONED_NONE;
 	lim->zone_write_granularity = 0;
 	lim->dma_alignment = 511;
+	lim->max_provision_sectors = 0;
 }
 
 /**
@@ -81,6 +82,7 @@ void blk_set_stacking_limits(struct queue_limits *lim)
 	lim->max_dev_sectors = UINT_MAX;
 	lim->max_write_zeroes_sectors = UINT_MAX;
 	lim->max_zone_append_sectors = UINT_MAX;
+	lim->max_provision_sectors = UINT_MAX;
 }
 EXPORT_SYMBOL(blk_set_stacking_limits);
 
@@ -202,6 +204,20 @@ void blk_queue_max_write_zeroes_sectors(struct request_queue *q,
 }
 EXPORT_SYMBOL(blk_queue_max_write_zeroes_sectors);
 
+/**
+ * blk_queue_max_provision_sectors - set max sectors for a single provision
+ *
+ * @q:  the request queue for the device
+ * @max_provision_sectors: maximum number of sectors to provision per command
+ **/
+
+void blk_queue_max_provision_sectors(struct request_queue *q,
+		unsigned int max_provision_sectors)
+{
+	q->limits.max_provision_sectors = max_provision_sectors;
+}
+EXPORT_SYMBOL(blk_queue_max_provision_sectors);
+
 /**
  * blk_queue_max_zone_append_sectors - set max sectors for a single zone append
  * @q:  the request queue for the device
@@ -572,6 +588,9 @@ int blk_stack_limits(struct queue_limits *t, struct queue_limits *b,
 	t->max_segment_size = min_not_zero(t->max_segment_size,
 					   b->max_segment_size);
 
+	t->max_provision_sectors = min_not_zero(t->max_provision_sectors,
+						b->max_provision_sectors);
+
 	t->misaligned |= b->misaligned;
 
 	alignment = queue_limit_alignment_offset(b, start);
diff --git a/block/blk-sysfs.c b/block/blk-sysfs.c
index 93d9e9c9a6ea..2e678417b302 100644
--- a/block/blk-sysfs.c
+++ b/block/blk-sysfs.c
@@ -131,6 +131,12 @@ static ssize_t queue_max_discard_segments_show(struct request_queue *q,
 	return queue_var_show(queue_max_discard_segments(q), page);
 }
 
+static ssize_t queue_max_provision_sectors_show(struct request_queue *q,
+		char *page)
+{
+	return queue_var_show(queue_max_provision_sectors(q), (page));
+}
+
 static ssize_t queue_max_integrity_segments_show(struct request_queue *q, char *page)
 {
 	return queue_var_show(q->limits.max_integrity_segments, page);
@@ -589,6 +595,7 @@ QUEUE_RO_ENTRY(queue_io_min, "minimum_io_size");
 QUEUE_RO_ENTRY(queue_io_opt, "optimal_io_size");
 
 QUEUE_RO_ENTRY(queue_max_discard_segments, "max_discard_segments");
+QUEUE_RO_ENTRY(queue_max_provision_sectors, "max_provision_sectors");
 QUEUE_RO_ENTRY(queue_discard_granularity, "discard_granularity");
 QUEUE_RO_ENTRY(queue_discard_max_hw, "discard_max_hw_bytes");
 QUEUE_RW_ENTRY(queue_discard_max, "discard_max_bytes");
@@ -638,6 +645,7 @@ static struct attribute *queue_attrs[] = {
 	&queue_max_sectors_entry.attr,
 	&queue_max_segments_entry.attr,
 	&queue_max_discard_segments_entry.attr,
+	&queue_max_provision_sectors_entry.attr,
 	&queue_max_integrity_segments_entry.attr,
 	&queue_max_segment_size_entry.attr,
 	&elv_iosched_entry.attr,
diff --git a/block/bounce.c b/block/bounce.c
index 7cfcb242f9a1..ab9d8723ae64 100644
--- a/block/bounce.c
+++ b/block/bounce.c
@@ -176,6 +176,7 @@ static struct bio *bounce_clone_bio(struct bio *bio_src)
 	case REQ_OP_DISCARD:
 	case REQ_OP_SECURE_ERASE:
 	case REQ_OP_WRITE_ZEROES:
+	case REQ_OP_PROVISION:
 		break;
 	default:
 		bio_for_each_segment(bv, bio_src, iter)
diff --git a/include/linux/bio.h b/include/linux/bio.h
index 22078a28d7cb..5025af105b7c 100644
--- a/include/linux/bio.h
+++ b/include/linux/bio.h
@@ -55,7 +55,8 @@ static inline bool bio_has_data(struct bio *bio)
 	    bio->bi_iter.bi_size &&
 	    bio_op(bio) != REQ_OP_DISCARD &&
 	    bio_op(bio) != REQ_OP_SECURE_ERASE &&
-	    bio_op(bio) != REQ_OP_WRITE_ZEROES)
+	    bio_op(bio) != REQ_OP_WRITE_ZEROES &&
+	    bio_op(bio) != REQ_OP_PROVISION)
 		return true;
 
 	return false;
@@ -65,7 +66,8 @@ static inline bool bio_no_advance_iter(const struct bio *bio)
 {
 	return bio_op(bio) == REQ_OP_DISCARD ||
 	       bio_op(bio) == REQ_OP_SECURE_ERASE ||
-	       bio_op(bio) == REQ_OP_WRITE_ZEROES;
+	       bio_op(bio) == REQ_OP_WRITE_ZEROES ||
+	       bio_op(bio) == REQ_OP_PROVISION;
 }
 
 static inline void *bio_data(struct bio *bio)
diff --git a/include/linux/blk_types.h b/include/linux/blk_types.h
index 99be590f952f..27bdf88f541c 100644
--- a/include/linux/blk_types.h
+++ b/include/linux/blk_types.h
@@ -385,7 +385,10 @@ enum req_op {
 	REQ_OP_DRV_IN		= (__force blk_opf_t)34,
 	REQ_OP_DRV_OUT		= (__force blk_opf_t)35,
 
-	REQ_OP_LAST		= (__force blk_opf_t)36,
+	/* request device to provision block */
+	REQ_OP_PROVISION        = (__force blk_opf_t)37,
+
+	REQ_OP_LAST		= (__force blk_opf_t)38,
 };
 
 enum req_flag_bits {
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index 301cf1cf4f2f..f1abc7b43e25 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -302,6 +302,7 @@ struct queue_limits {
 	unsigned int		discard_granularity;
 	unsigned int		discard_alignment;
 	unsigned int		zone_write_granularity;
+	unsigned int		max_provision_sectors;
 
 	unsigned short		max_segments;
 	unsigned short		max_integrity_segments;
@@ -918,6 +919,8 @@ extern void blk_queue_max_discard_sectors(struct request_queue *q,
 		unsigned int max_discard_sectors);
 extern void blk_queue_max_write_zeroes_sectors(struct request_queue *q,
 		unsigned int max_write_same_sectors);
+extern void blk_queue_max_provision_sectors(struct request_queue *q,
+		unsigned int max_provision_sectors);
 extern void blk_queue_logical_block_size(struct request_queue *, unsigned int);
 extern void blk_queue_max_zone_append_sectors(struct request_queue *q,
 		unsigned int max_zone_append_sectors);
@@ -1057,6 +1060,9 @@ int __blkdev_issue_discard(struct block_device *bdev, sector_t sector,
 int blkdev_issue_secure_erase(struct block_device *bdev, sector_t sector,
 		sector_t nr_sects, gfp_t gfp);
 
+extern int blkdev_issue_provision(struct block_device *bdev, sector_t sector,
+		sector_t nr_sects, gfp_t gfp_mask);
+
 #define BLKDEV_ZERO_NOUNMAP	(1 << 0)  /* do not free blocks */
 #define BLKDEV_ZERO_NOFALLBACK	(1 << 1)  /* don't write explicit zeroes */
 
@@ -1135,6 +1141,11 @@ static inline unsigned short queue_max_discard_segments(const struct request_que
 	return q->limits.max_discard_segments;
 }
 
+static inline unsigned short queue_max_provision_sectors(const struct request_queue *q)
+{
+	return q->limits.max_provision_sectors;
+}
+
 static inline unsigned int queue_max_segment_size(const struct request_queue *q)
 {
 	return q->limits.max_segment_size;
@@ -1271,6 +1282,11 @@ static inline bool bdev_nowait(struct block_device *bdev)
 	return test_bit(QUEUE_FLAG_NOWAIT, &bdev_get_queue(bdev)->queue_flags);
 }
 
+static inline unsigned int bdev_max_provision_sectors(struct block_device *bdev)
+{
+	return bdev_get_queue(bdev)->limits.max_provision_sectors;
+}
+
 static inline enum blk_zoned_model bdev_zoned_model(struct block_device *bdev)
 {
 	struct request_queue *q = bdev_get_queue(bdev);
-- 
2.37.3


^ permalink raw reply related	[flat|nested] 46+ messages in thread

* [dm-devel] [PATCH v2 1/7] block: Introduce provisioning primitives
@ 2022-12-29  8:12   ` Sarthak Kukreti
  0 siblings, 0 replies; 46+ messages in thread
From: Sarthak Kukreti @ 2022-12-29  8:12 UTC (permalink / raw)
  To: sarthakkukreti, dm-devel, linux-block, linux-ext4, linux-kernel,
	linux-fsdevel
  Cc: Jens Axboe, Theodore Ts'o, Michael S. Tsirkin,
	Darrick J. Wong, Jason Wang, Bart Van Assche, Mike Snitzer,
	Christoph Hellwig, Andreas Dilger, Daniil Lunev, Stefan Hajnoczi,
	Brian Foster, Alasdair Kergon

Introduce block request REQ_OP_PROVISION. The intent of this request
is to request underlying storage to preallocate disk space for the given
block range. Block device that support this capability will export
a provision limit within their request queues.

Signed-off-by: Sarthak Kukreti <sarthakkukreti@chromium.org>
---
 block/blk-core.c          |  5 ++++
 block/blk-lib.c           | 53 +++++++++++++++++++++++++++++++++++++++
 block/blk-merge.c         | 18 +++++++++++++
 block/blk-settings.c      | 19 ++++++++++++++
 block/blk-sysfs.c         |  8 ++++++
 block/bounce.c            |  1 +
 include/linux/bio.h       |  6 +++--
 include/linux/blk_types.h |  5 +++-
 include/linux/blkdev.h    | 16 ++++++++++++
 9 files changed, 128 insertions(+), 3 deletions(-)

diff --git a/block/blk-core.c b/block/blk-core.c
index 9321767470dc..30bcabc7dc01 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -123,6 +123,7 @@ static const char *const blk_op_name[] = {
 	REQ_OP_NAME(WRITE_ZEROES),
 	REQ_OP_NAME(DRV_IN),
 	REQ_OP_NAME(DRV_OUT),
+	REQ_OP_NAME(PROVISION)
 };
 #undef REQ_OP_NAME
 
@@ -785,6 +786,10 @@ void submit_bio_noacct(struct bio *bio)
 		if (!q->limits.max_write_zeroes_sectors)
 			goto not_supported;
 		break;
+	case REQ_OP_PROVISION:
+		if (!q->limits.max_provision_sectors)
+			goto not_supported;
+		break;
 	default:
 		break;
 	}
diff --git a/block/blk-lib.c b/block/blk-lib.c
index e59c3069e835..647b6451660b 100644
--- a/block/blk-lib.c
+++ b/block/blk-lib.c
@@ -343,3 +343,56 @@ int blkdev_issue_secure_erase(struct block_device *bdev, sector_t sector,
 	return ret;
 }
 EXPORT_SYMBOL(blkdev_issue_secure_erase);
+
+/**
+ * blkdev_issue_provision - provision a block range
+ * @bdev:	blockdev to write
+ * @sector:	start sector
+ * @nr_sects:	number of sectors to provision
+ * @gfp_mask:	memory allocation flags (for bio_alloc)
+ *
+ * Description:
+ *  Issues a provision request to the block device for the range of sectors.
+ *  For thinly provisioned block devices, this acts as a signal for the
+ *  underlying storage pool to allocate space for this block range.
+ */
+int blkdev_issue_provision(struct block_device *bdev, sector_t sector,
+		sector_t nr_sects, gfp_t gfp)
+{
+	sector_t bs_mask = (bdev_logical_block_size(bdev) >> 9) - 1;
+	unsigned int max_sectors = bdev_max_provision_sectors(bdev);
+	struct bio *bio = NULL;
+	struct blk_plug plug;
+	int ret = 0;
+
+	if (max_sectors == 0)
+		return -EOPNOTSUPP;
+	if ((sector | nr_sects) & bs_mask)
+		return -EINVAL;
+	if (bdev_read_only(bdev))
+		return -EPERM;
+
+	blk_start_plug(&plug);
+	for (;;) {
+		unsigned int req_sects = min_t(sector_t, nr_sects, max_sectors);
+
+		bio = blk_next_bio(bio, bdev, 0, REQ_OP_PROVISION, gfp);
+		bio->bi_iter.bi_sector = sector;
+		bio->bi_iter.bi_size = req_sects << SECTOR_SHIFT;
+
+		sector += req_sects;
+		nr_sects -= req_sects;
+		if (!nr_sects) {
+			ret = submit_bio_wait(bio);
+			if (ret == -EOPNOTSUPP)
+				ret = 0;
+			bio_put(bio);
+			break;
+		}
+		cond_resched();
+	}
+	blk_finish_plug(&plug);
+
+	return ret;
+}
+EXPORT_SYMBOL(blkdev_issue_provision);
diff --git a/block/blk-merge.c b/block/blk-merge.c
index 35a8f75cc45d..3ab35bb2a333 100644
--- a/block/blk-merge.c
+++ b/block/blk-merge.c
@@ -158,6 +158,21 @@ static struct bio *bio_split_write_zeroes(struct bio *bio,
 	return bio_split(bio, lim->max_write_zeroes_sectors, GFP_NOIO, bs);
 }
 
+static struct bio *bio_split_provision(struct bio *bio,
+					const struct queue_limits *lim,
+					unsigned *nsegs, struct bio_set *bs)
+{
+	*nsegs = 0;
+
+	if (!lim->max_provision_sectors)
+		return NULL;
+
+	if (bio_sectors(bio) <= lim->max_provision_sectors)
+		return NULL;
+
+	return bio_split(bio, lim->max_provision_sectors, GFP_NOIO, bs);
+}
+
 /*
  * Return the maximum number of sectors from the start of a bio that may be
  * submitted as a single request to a block device. If enough sectors remain,
@@ -355,6 +370,9 @@ struct bio *__bio_split_to_limits(struct bio *bio,
 	case REQ_OP_WRITE_ZEROES:
 		split = bio_split_write_zeroes(bio, lim, nr_segs, bs);
 		break;
+	case REQ_OP_PROVISION:
+		split = bio_split_provision(bio, lim, nr_segs, bs);
+		break;
 	default:
 		split = bio_split_rw(bio, lim, nr_segs, bs,
 				get_max_io_size(bio, lim) << SECTOR_SHIFT);
diff --git a/block/blk-settings.c b/block/blk-settings.c
index 0477c4d527fe..57d88204fbbe 100644
--- a/block/blk-settings.c
+++ b/block/blk-settings.c
@@ -58,6 +58,7 @@ void blk_set_default_limits(struct queue_limits *lim)
 	lim->zoned = BLK_ZONED_NONE;
 	lim->zone_write_granularity = 0;
 	lim->dma_alignment = 511;
+	lim->max_provision_sectors = 0;
 }
 
 /**
@@ -81,6 +82,7 @@ void blk_set_stacking_limits(struct queue_limits *lim)
 	lim->max_dev_sectors = UINT_MAX;
 	lim->max_write_zeroes_sectors = UINT_MAX;
 	lim->max_zone_append_sectors = UINT_MAX;
+	lim->max_provision_sectors = UINT_MAX;
 }
 EXPORT_SYMBOL(blk_set_stacking_limits);
 
@@ -202,6 +204,20 @@ void blk_queue_max_write_zeroes_sectors(struct request_queue *q,
 }
 EXPORT_SYMBOL(blk_queue_max_write_zeroes_sectors);
 
+/**
+ * blk_queue_max_provision_sectors - set max sectors for a single provision
+ *
+ * @q:  the request queue for the device
+ * @max_provision_sectors: maximum number of sectors to provision per command
+ **/
+
+void blk_queue_max_provision_sectors(struct request_queue *q,
+		unsigned int max_provision_sectors)
+{
+	q->limits.max_provision_sectors = max_provision_sectors;
+}
+EXPORT_SYMBOL(blk_queue_max_provision_sectors);
+
 /**
  * blk_queue_max_zone_append_sectors - set max sectors for a single zone append
  * @q:  the request queue for the device
@@ -572,6 +588,9 @@ int blk_stack_limits(struct queue_limits *t, struct queue_limits *b,
 	t->max_segment_size = min_not_zero(t->max_segment_size,
 					   b->max_segment_size);
 
+	t->max_provision_sectors = min_not_zero(t->max_provision_sectors,
+						b->max_provision_sectors);
+
 	t->misaligned |= b->misaligned;
 
 	alignment = queue_limit_alignment_offset(b, start);
diff --git a/block/blk-sysfs.c b/block/blk-sysfs.c
index 93d9e9c9a6ea..2e678417b302 100644
--- a/block/blk-sysfs.c
+++ b/block/blk-sysfs.c
@@ -131,6 +131,12 @@ static ssize_t queue_max_discard_segments_show(struct request_queue *q,
 	return queue_var_show(queue_max_discard_segments(q), page);
 }
 
+static ssize_t queue_max_provision_sectors_show(struct request_queue *q,
+		char *page)
+{
+	return queue_var_show(queue_max_provision_sectors(q), (page));
+}
+
 static ssize_t queue_max_integrity_segments_show(struct request_queue *q, char *page)
 {
 	return queue_var_show(q->limits.max_integrity_segments, page);
@@ -589,6 +595,7 @@ QUEUE_RO_ENTRY(queue_io_min, "minimum_io_size");
 QUEUE_RO_ENTRY(queue_io_opt, "optimal_io_size");
 
 QUEUE_RO_ENTRY(queue_max_discard_segments, "max_discard_segments");
+QUEUE_RO_ENTRY(queue_max_provision_sectors, "max_provision_sectors");
 QUEUE_RO_ENTRY(queue_discard_granularity, "discard_granularity");
 QUEUE_RO_ENTRY(queue_discard_max_hw, "discard_max_hw_bytes");
 QUEUE_RW_ENTRY(queue_discard_max, "discard_max_bytes");
@@ -638,6 +645,7 @@ static struct attribute *queue_attrs[] = {
 	&queue_max_sectors_entry.attr,
 	&queue_max_segments_entry.attr,
 	&queue_max_discard_segments_entry.attr,
+	&queue_max_provision_sectors_entry.attr,
 	&queue_max_integrity_segments_entry.attr,
 	&queue_max_segment_size_entry.attr,
 	&elv_iosched_entry.attr,
diff --git a/block/bounce.c b/block/bounce.c
index 7cfcb242f9a1..ab9d8723ae64 100644
--- a/block/bounce.c
+++ b/block/bounce.c
@@ -176,6 +176,7 @@ static struct bio *bounce_clone_bio(struct bio *bio_src)
 	case REQ_OP_DISCARD:
 	case REQ_OP_SECURE_ERASE:
 	case REQ_OP_WRITE_ZEROES:
+	case REQ_OP_PROVISION:
 		break;
 	default:
 		bio_for_each_segment(bv, bio_src, iter)
diff --git a/include/linux/bio.h b/include/linux/bio.h
index 22078a28d7cb..5025af105b7c 100644
--- a/include/linux/bio.h
+++ b/include/linux/bio.h
@@ -55,7 +55,8 @@ static inline bool bio_has_data(struct bio *bio)
 	    bio->bi_iter.bi_size &&
 	    bio_op(bio) != REQ_OP_DISCARD &&
 	    bio_op(bio) != REQ_OP_SECURE_ERASE &&
-	    bio_op(bio) != REQ_OP_WRITE_ZEROES)
+	    bio_op(bio) != REQ_OP_WRITE_ZEROES &&
+	    bio_op(bio) != REQ_OP_PROVISION)
 		return true;
 
 	return false;
@@ -65,7 +66,8 @@ static inline bool bio_no_advance_iter(const struct bio *bio)
 {
 	return bio_op(bio) == REQ_OP_DISCARD ||
 	       bio_op(bio) == REQ_OP_SECURE_ERASE ||
-	       bio_op(bio) == REQ_OP_WRITE_ZEROES;
+	       bio_op(bio) == REQ_OP_WRITE_ZEROES ||
+	       bio_op(bio) == REQ_OP_PROVISION;
 }
 
 static inline void *bio_data(struct bio *bio)
diff --git a/include/linux/blk_types.h b/include/linux/blk_types.h
index 99be590f952f..27bdf88f541c 100644
--- a/include/linux/blk_types.h
+++ b/include/linux/blk_types.h
@@ -385,7 +385,10 @@ enum req_op {
 	REQ_OP_DRV_IN		= (__force blk_opf_t)34,
 	REQ_OP_DRV_OUT		= (__force blk_opf_t)35,
 
-	REQ_OP_LAST		= (__force blk_opf_t)36,
+	/* request device to provision block */
+	REQ_OP_PROVISION        = (__force blk_opf_t)37,
+
+	REQ_OP_LAST		= (__force blk_opf_t)38,
 };
 
 enum req_flag_bits {
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index 301cf1cf4f2f..f1abc7b43e25 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -302,6 +302,7 @@ struct queue_limits {
 	unsigned int		discard_granularity;
 	unsigned int		discard_alignment;
 	unsigned int		zone_write_granularity;
+	unsigned int		max_provision_sectors;
 
 	unsigned short		max_segments;
 	unsigned short		max_integrity_segments;
@@ -918,6 +919,8 @@ extern void blk_queue_max_discard_sectors(struct request_queue *q,
 		unsigned int max_discard_sectors);
 extern void blk_queue_max_write_zeroes_sectors(struct request_queue *q,
 		unsigned int max_write_same_sectors);
+extern void blk_queue_max_provision_sectors(struct request_queue *q,
+		unsigned int max_provision_sectors);
 extern void blk_queue_logical_block_size(struct request_queue *, unsigned int);
 extern void blk_queue_max_zone_append_sectors(struct request_queue *q,
 		unsigned int max_zone_append_sectors);
@@ -1057,6 +1060,9 @@ int __blkdev_issue_discard(struct block_device *bdev, sector_t sector,
 int blkdev_issue_secure_erase(struct block_device *bdev, sector_t sector,
 		sector_t nr_sects, gfp_t gfp);
 
+extern int blkdev_issue_provision(struct block_device *bdev, sector_t sector,
+		sector_t nr_sects, gfp_t gfp_mask);
+
 #define BLKDEV_ZERO_NOUNMAP	(1 << 0)  /* do not free blocks */
 #define BLKDEV_ZERO_NOFALLBACK	(1 << 1)  /* don't write explicit zeroes */
 
@@ -1135,6 +1141,11 @@ static inline unsigned short queue_max_discard_segments(const struct request_que
 	return q->limits.max_discard_segments;
 }
 
+static inline unsigned short queue_max_provision_sectors(const struct request_queue *q)
+{
+	return q->limits.max_provision_sectors;
+}
+
 static inline unsigned int queue_max_segment_size(const struct request_queue *q)
 {
 	return q->limits.max_segment_size;
@@ -1271,6 +1282,11 @@ static inline bool bdev_nowait(struct block_device *bdev)
 	return test_bit(QUEUE_FLAG_NOWAIT, &bdev_get_queue(bdev)->queue_flags);
 }
 
+static inline unsigned int bdev_max_provision_sectors(struct block_device *bdev)
+{
+	return bdev_get_queue(bdev)->limits.max_provision_sectors;
+}
+
 static inline enum blk_zoned_model bdev_zoned_model(struct block_device *bdev)
 {
 	struct request_queue *q = bdev_get_queue(bdev);
-- 
2.37.3

--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel


^ permalink raw reply related	[flat|nested] 46+ messages in thread

* [PATCH v2 2/7] dm: Add support for block provisioning
  2022-12-29  8:12 ` [dm-devel] " Sarthak Kukreti
@ 2022-12-29  8:12   ` Sarthak Kukreti
  -1 siblings, 0 replies; 46+ messages in thread
From: Sarthak Kukreti @ 2022-12-29  8:12 UTC (permalink / raw)
  To: sarthakkukreti, dm-devel, linux-block, linux-ext4, linux-kernel,
	linux-fsdevel
  Cc: Jens Axboe, Michael S. Tsirkin, Jason Wang, Stefan Hajnoczi,
	Alasdair Kergon, Mike Snitzer, Christoph Hellwig, Brian Foster,
	Theodore Ts'o, Andreas Dilger, Bart Van Assche, Daniil Lunev,
	Darrick J. Wong

Add support to dm devices for REQ_OP_PROVISION. The default mode
is to pass through the request and dm-thin will utilize it to provision
blocks.

Signed-off-by: Sarthak Kukreti <sarthakkukreti@chromium.org>
---
 drivers/md/dm-crypt.c         |  4 +-
 drivers/md/dm-linear.c        |  1 +
 drivers/md/dm-snap.c          |  7 +++
 drivers/md/dm-table.c         | 25 ++++++++++
 drivers/md/dm-thin.c          | 90 ++++++++++++++++++++++++++++++++++-
 drivers/md/dm.c               |  4 ++
 include/linux/device-mapper.h | 11 +++++
 7 files changed, 139 insertions(+), 3 deletions(-)

diff --git a/drivers/md/dm-crypt.c b/drivers/md/dm-crypt.c
index 2653516bcdef..7089a414c3d1 100644
--- a/drivers/md/dm-crypt.c
+++ b/drivers/md/dm-crypt.c
@@ -3081,6 +3081,8 @@ static int crypt_ctr_optional(struct dm_target *ti, unsigned int argc, char **ar
 	if (ret)
 		return ret;
 
+	ti->num_provision_bios = 1;
+
 	while (opt_params--) {
 		opt_string = dm_shift_arg(&as);
 		if (!opt_string) {
@@ -3384,7 +3386,7 @@ static int crypt_map(struct dm_target *ti, struct bio *bio)
 	 * - for REQ_OP_DISCARD caller must use flush if IO ordering matters
 	 */
 	if (unlikely(bio->bi_opf & REQ_PREFLUSH ||
-	    bio_op(bio) == REQ_OP_DISCARD)) {
+	    bio_op(bio) == REQ_OP_DISCARD || bio_op(bio) == REQ_OP_PROVISION)) {
 		bio_set_dev(bio, cc->dev->bdev);
 		if (bio_sectors(bio))
 			bio->bi_iter.bi_sector = cc->start +
diff --git a/drivers/md/dm-linear.c b/drivers/md/dm-linear.c
index 3212ef6aa81b..1aa782149428 100644
--- a/drivers/md/dm-linear.c
+++ b/drivers/md/dm-linear.c
@@ -61,6 +61,7 @@ static int linear_ctr(struct dm_target *ti, unsigned int argc, char **argv)
 	ti->num_discard_bios = 1;
 	ti->num_secure_erase_bios = 1;
 	ti->num_write_zeroes_bios = 1;
+	ti->num_provision_bios = 1;
 	ti->private = lc;
 	return 0;
 
diff --git a/drivers/md/dm-snap.c b/drivers/md/dm-snap.c
index d1c2f84d27e3..d4d2599e3620 100644
--- a/drivers/md/dm-snap.c
+++ b/drivers/md/dm-snap.c
@@ -1357,6 +1357,7 @@ static int snapshot_ctr(struct dm_target *ti, unsigned int argc, char **argv)
 	if (s->discard_zeroes_cow)
 		ti->num_discard_bios = (s->discard_passdown_origin ? 2 : 1);
 	ti->per_io_data_size = sizeof(struct dm_snap_tracked_chunk);
+	ti->num_provision_bios = 1;
 
 	/* Add snapshot to the list of snapshots for this origin */
 	/* Exceptions aren't triggered till snapshot_resume() is called */
@@ -2001,6 +2002,11 @@ static int snapshot_map(struct dm_target *ti, struct bio *bio)
 	/* If the block is already remapped - use that, else remap it */
 	e = dm_lookup_exception(&s->complete, chunk);
 	if (e) {
+		if (unlikely(bio_op(bio) == REQ_OP_PROVISION)) {
+			bio_endio(bio);
+			r = DM_MAPIO_SUBMITTED;
+			goto out_unlock;
+		}
 		remap_exception(s, e, bio, chunk);
 		if (unlikely(bio_op(bio) == REQ_OP_DISCARD) &&
 		    io_overlaps_chunk(s, bio)) {
@@ -2414,6 +2420,7 @@ static void snapshot_io_hints(struct dm_target *ti, struct queue_limits *limits)
 		/* All discards are split on chunk_size boundary */
 		limits->discard_granularity = snap->store->chunk_size;
 		limits->max_discard_sectors = snap->store->chunk_size;
+		limits->max_provision_sectors = snap->store->chunk_size;
 
 		up_read(&_origins_lock);
 	}
diff --git a/drivers/md/dm-table.c b/drivers/md/dm-table.c
index 8541d5688f3a..35f8d670935e 100644
--- a/drivers/md/dm-table.c
+++ b/drivers/md/dm-table.c
@@ -1853,6 +1853,26 @@ static bool dm_table_supports_write_zeroes(struct dm_table *t)
 	return true;
 }
 
+static int device_provision_capable(struct dm_target *ti, struct dm_dev *dev,
+				    sector_t start, sector_t len, void *data)
+{
+	return !bdev_max_provision_sectors(dev->bdev);
+}
+
+static bool dm_table_supports_provision(struct dm_table *t)
+{
+	for (unsigned int i = 0; i < t->num_targets; i++) {
+		struct dm_target *ti = dm_table_get_target(t, i);
+
+		if (ti->provision_supported ||
+		    (ti->type->iterate_devices &&
+		    ti->type->iterate_devices(ti, device_provision_capable, NULL)))
+			return true;
+	}
+
+	return false;
+}
+
 static int device_not_nowait_capable(struct dm_target *ti, struct dm_dev *dev,
 				     sector_t start, sector_t len, void *data)
 {
@@ -1987,6 +2007,11 @@ int dm_table_set_restrictions(struct dm_table *t, struct request_queue *q,
 	if (!dm_table_supports_write_zeroes(t))
 		q->limits.max_write_zeroes_sectors = 0;
 
+	if (dm_table_supports_provision(t))
+		blk_queue_max_provision_sectors(q, UINT_MAX >> 9);
+	else
+		q->limits.max_provision_sectors = 0;
+
 	dm_table_verify_integrity(t);
 
 	/*
diff --git a/drivers/md/dm-thin.c b/drivers/md/dm-thin.c
index 64cfcf46881d..ab3f1abfabaf 100644
--- a/drivers/md/dm-thin.c
+++ b/drivers/md/dm-thin.c
@@ -1012,6 +1012,14 @@ static void process_prepared_mapping(struct dm_thin_new_mapping *m)
 		goto out;
 	}
 
+	/* For provision requests, return once the prepared block has been inserted
+	 * into the mapping btree.
+	 */
+	if (bio && bio_op(bio) == REQ_OP_PROVISION) {
+		bio_endio(bio);
+		goto out;
+	}
+
 	/*
 	 * Release any bios held while the block was being provisioned.
 	 * If we are processing a write bio that completely covers the block,
@@ -1239,7 +1247,7 @@ static int io_overlaps_block(struct pool *pool, struct bio *bio)
 
 static int io_overwrites_block(struct pool *pool, struct bio *bio)
 {
-	return (bio_data_dir(bio) == WRITE) &&
+	return (bio_data_dir(bio) == WRITE) && bio_op(bio) != REQ_OP_PROVISION &&
 		io_overlaps_block(pool, bio);
 }
 
@@ -1388,6 +1396,10 @@ static void schedule_zero(struct thin_c *tc, dm_block_t virt_block,
 	m->data_block = data_block;
 	m->cell = cell;
 
+	/* Provision requests are chained on the original bio. */
+	if (bio && bio_op(bio) == REQ_OP_PROVISION)
+		m->bio = bio;
+
 	/*
 	 * If the whole block of data is being overwritten or we are not
 	 * zeroing pre-existing data, we can issue the bio immediately.
@@ -1980,6 +1992,70 @@ static void process_cell(struct thin_c *tc, struct dm_bio_prison_cell *cell)
 	}
 }
 
+static void process_provision_cell(struct thin_c *tc, struct dm_bio_prison_cell *cell)
+{
+	int r;
+	struct pool *pool = tc->pool;
+	struct bio *bio = cell->holder;
+	dm_block_t begin, end;
+	struct dm_thin_lookup_result lookup_result;
+
+	if (tc->requeue_mode) {
+		cell_requeue(pool, cell);
+		return;
+	}
+
+	get_bio_block_range(tc, bio, &begin, &end);
+
+	while (begin != end) {
+		r = ensure_next_mapping(pool);
+		if (r)
+			/* we did our best */
+			return;
+
+		r = dm_thin_find_block(tc->td, begin, 1, &lookup_result);
+		switch (r) {
+		case 0:
+			begin++;
+			break;
+		case -ENODATA:
+			bio_inc_remaining(bio);
+			provision_block(tc, bio, begin, cell);
+			begin++;
+			break;
+		default:
+			DMERR_LIMIT(
+				"%s: dm_thin_find_block() failed: error = %d",
+				__func__, r);
+			cell_defer_no_holder(tc, cell);
+			bio_io_error(bio);
+			begin++;
+			break;
+		}
+	}
+	bio_endio(bio);
+	cell_defer_no_holder(tc, cell);
+}
+
+static void process_provision_bio(struct thin_c *tc, struct bio *bio)
+{
+	dm_block_t begin, end;
+	struct dm_cell_key virt_key;
+	struct dm_bio_prison_cell *virt_cell;
+
+	get_bio_block_range(tc, bio, &begin, &end);
+	if (begin == end) {
+		bio_endio(bio);
+		return;
+	}
+
+	build_key(tc->td, VIRTUAL, begin, end, &virt_key);
+	if (bio_detain(tc->pool, &virt_key, bio, &virt_cell))
+		return;
+
+	process_provision_cell(tc, virt_cell);
+}
+
 static void process_bio(struct thin_c *tc, struct bio *bio)
 {
 	struct pool *pool = tc->pool;
@@ -2200,6 +2276,8 @@ static void process_thin_deferred_bios(struct thin_c *tc)
 
 		if (bio_op(bio) == REQ_OP_DISCARD)
 			pool->process_discard(tc, bio);
+		else if (bio_op(bio) == REQ_OP_PROVISION)
+			process_provision_bio(tc, bio);
 		else
 			pool->process_bio(tc, bio);
 
@@ -2716,7 +2794,8 @@ static int thin_bio_map(struct dm_target *ti, struct bio *bio)
 		return DM_MAPIO_SUBMITTED;
 	}
 
-	if (op_is_flush(bio->bi_opf) || bio_op(bio) == REQ_OP_DISCARD) {
+	if (op_is_flush(bio->bi_opf) || bio_op(bio) == REQ_OP_DISCARD ||
+	    bio_op(bio) == REQ_OP_PROVISION) {
 		thin_defer_bio_with_throttle(tc, bio);
 		return DM_MAPIO_SUBMITTED;
 	}
@@ -3355,6 +3434,8 @@ static int pool_ctr(struct dm_target *ti, unsigned argc, char **argv)
 	pt->low_water_blocks = low_water_blocks;
 	pt->adjusted_pf = pt->requested_pf = pf;
 	ti->num_flush_bios = 1;
+	ti->num_provision_bios = 1;
+	ti->provision_supported = true;
 
 	/*
 	 * Only need to enable discards if the pool should pass
@@ -4053,6 +4134,7 @@ static void pool_io_hints(struct dm_target *ti, struct queue_limits *limits)
 		blk_limits_io_opt(limits, pool->sectors_per_block << SECTOR_SHIFT);
 	}
 
+
 	/*
 	 * pt->adjusted_pf is a staging area for the actual features to use.
 	 * They get transferred to the live pool in bind_control_target()
@@ -4243,6 +4325,9 @@ static int thin_ctr(struct dm_target *ti, unsigned argc, char **argv)
 		ti->num_discard_bios = 1;
 	}
 
+	ti->num_provision_bios = 1;
+	ti->provision_supported = true;
+
 	mutex_unlock(&dm_thin_pool_table.mutex);
 
 	spin_lock_irq(&tc->pool->lock);
@@ -4457,6 +4542,7 @@ static void thin_io_hints(struct dm_target *ti, struct queue_limits *limits)
 
 	limits->discard_granularity = pool->sectors_per_block << SECTOR_SHIFT;
 	limits->max_discard_sectors = 2048 * 1024 * 16; /* 16G */
+	limits->max_provision_sectors = 2048 * 1024 * 16; /* 16G */
 }
 
 static struct target_type thin_target = {
diff --git a/drivers/md/dm.c b/drivers/md/dm.c
index e1ea3a7bd9d9..4d19bae9da4a 100644
--- a/drivers/md/dm.c
+++ b/drivers/md/dm.c
@@ -1587,6 +1587,7 @@ static bool is_abnormal_io(struct bio *bio)
 		case REQ_OP_DISCARD:
 		case REQ_OP_SECURE_ERASE:
 		case REQ_OP_WRITE_ZEROES:
+		case REQ_OP_PROVISION:
 			return true;
 		default:
 			break;
@@ -1611,6 +1612,9 @@ static blk_status_t __process_abnormal_io(struct clone_info *ci,
 	case REQ_OP_WRITE_ZEROES:
 		num_bios = ti->num_write_zeroes_bios;
 		break;
+	case REQ_OP_PROVISION:
+		num_bios = ti->num_provision_bios;
+		break;
 	default:
 		break;
 	}
diff --git a/include/linux/device-mapper.h b/include/linux/device-mapper.h
index 04c6acf7faaa..b4d97d5d75b8 100644
--- a/include/linux/device-mapper.h
+++ b/include/linux/device-mapper.h
@@ -333,6 +333,12 @@ struct dm_target {
 	 */
 	unsigned num_write_zeroes_bios;
 
+	/*
+	 * The number of PROVISION bios that will be submitted to the target.
+	 * The bio number can be accessed with dm_bio_get_target_bio_nr.
+	 */
+	unsigned num_provision_bios;
+
 	/*
 	 * The minimum number of extra bytes allocated in each io for the
 	 * target to use.
@@ -357,6 +363,11 @@ struct dm_target {
 	 */
 	bool discards_supported:1;
 
+	/* Set if this target needs to receive provision requests regardless of
+	 * whether or not its underlying devices have support.
+	 */
+	bool provision_supported:1;
+
 	/*
 	 * Set if we need to limit the number of in-flight bios when swapping.
 	 */
-- 
2.37.3


^ permalink raw reply related	[flat|nested] 46+ messages in thread

* [dm-devel] [PATCH v2 2/7] dm: Add support for block provisioning
@ 2022-12-29  8:12   ` Sarthak Kukreti
  0 siblings, 0 replies; 46+ messages in thread
From: Sarthak Kukreti @ 2022-12-29  8:12 UTC (permalink / raw)
  To: sarthakkukreti, dm-devel, linux-block, linux-ext4, linux-kernel,
	linux-fsdevel
  Cc: Jens Axboe, Theodore Ts'o, Michael S. Tsirkin,
	Darrick J. Wong, Jason Wang, Bart Van Assche, Mike Snitzer,
	Christoph Hellwig, Andreas Dilger, Daniil Lunev, Stefan Hajnoczi,
	Brian Foster, Alasdair Kergon

Add support to dm devices for REQ_OP_PROVISION. The default mode
is to pass through the request and dm-thin will utilize it to provision
blocks.

Signed-off-by: Sarthak Kukreti <sarthakkukreti@chromium.org>
---
 drivers/md/dm-crypt.c         |  4 +-
 drivers/md/dm-linear.c        |  1 +
 drivers/md/dm-snap.c          |  7 +++
 drivers/md/dm-table.c         | 25 ++++++++++
 drivers/md/dm-thin.c          | 90 ++++++++++++++++++++++++++++++++++-
 drivers/md/dm.c               |  4 ++
 include/linux/device-mapper.h | 11 +++++
 7 files changed, 139 insertions(+), 3 deletions(-)

diff --git a/drivers/md/dm-crypt.c b/drivers/md/dm-crypt.c
index 2653516bcdef..7089a414c3d1 100644
--- a/drivers/md/dm-crypt.c
+++ b/drivers/md/dm-crypt.c
@@ -3081,6 +3081,8 @@ static int crypt_ctr_optional(struct dm_target *ti, unsigned int argc, char **ar
 	if (ret)
 		return ret;
 
+	ti->num_provision_bios = 1;
+
 	while (opt_params--) {
 		opt_string = dm_shift_arg(&as);
 		if (!opt_string) {
@@ -3384,7 +3386,7 @@ static int crypt_map(struct dm_target *ti, struct bio *bio)
 	 * - for REQ_OP_DISCARD caller must use flush if IO ordering matters
 	 */
 	if (unlikely(bio->bi_opf & REQ_PREFLUSH ||
-	    bio_op(bio) == REQ_OP_DISCARD)) {
+	    bio_op(bio) == REQ_OP_DISCARD || bio_op(bio) == REQ_OP_PROVISION)) {
 		bio_set_dev(bio, cc->dev->bdev);
 		if (bio_sectors(bio))
 			bio->bi_iter.bi_sector = cc->start +
diff --git a/drivers/md/dm-linear.c b/drivers/md/dm-linear.c
index 3212ef6aa81b..1aa782149428 100644
--- a/drivers/md/dm-linear.c
+++ b/drivers/md/dm-linear.c
@@ -61,6 +61,7 @@ static int linear_ctr(struct dm_target *ti, unsigned int argc, char **argv)
 	ti->num_discard_bios = 1;
 	ti->num_secure_erase_bios = 1;
 	ti->num_write_zeroes_bios = 1;
+	ti->num_provision_bios = 1;
 	ti->private = lc;
 	return 0;
 
diff --git a/drivers/md/dm-snap.c b/drivers/md/dm-snap.c
index d1c2f84d27e3..d4d2599e3620 100644
--- a/drivers/md/dm-snap.c
+++ b/drivers/md/dm-snap.c
@@ -1357,6 +1357,7 @@ static int snapshot_ctr(struct dm_target *ti, unsigned int argc, char **argv)
 	if (s->discard_zeroes_cow)
 		ti->num_discard_bios = (s->discard_passdown_origin ? 2 : 1);
 	ti->per_io_data_size = sizeof(struct dm_snap_tracked_chunk);
+	ti->num_provision_bios = 1;
 
 	/* Add snapshot to the list of snapshots for this origin */
 	/* Exceptions aren't triggered till snapshot_resume() is called */
@@ -2001,6 +2002,11 @@ static int snapshot_map(struct dm_target *ti, struct bio *bio)
 	/* If the block is already remapped - use that, else remap it */
 	e = dm_lookup_exception(&s->complete, chunk);
 	if (e) {
+		if (unlikely(bio_op(bio) == REQ_OP_PROVISION)) {
+			bio_endio(bio);
+			r = DM_MAPIO_SUBMITTED;
+			goto out_unlock;
+		}
 		remap_exception(s, e, bio, chunk);
 		if (unlikely(bio_op(bio) == REQ_OP_DISCARD) &&
 		    io_overlaps_chunk(s, bio)) {
@@ -2414,6 +2420,7 @@ static void snapshot_io_hints(struct dm_target *ti, struct queue_limits *limits)
 		/* All discards are split on chunk_size boundary */
 		limits->discard_granularity = snap->store->chunk_size;
 		limits->max_discard_sectors = snap->store->chunk_size;
+		limits->max_provision_sectors = snap->store->chunk_size;
 
 		up_read(&_origins_lock);
 	}
diff --git a/drivers/md/dm-table.c b/drivers/md/dm-table.c
index 8541d5688f3a..35f8d670935e 100644
--- a/drivers/md/dm-table.c
+++ b/drivers/md/dm-table.c
@@ -1853,6 +1853,26 @@ static bool dm_table_supports_write_zeroes(struct dm_table *t)
 	return true;
 }
 
+static int device_provision_capable(struct dm_target *ti, struct dm_dev *dev,
+				    sector_t start, sector_t len, void *data)
+{
+	return !bdev_max_provision_sectors(dev->bdev);
+}
+
+static bool dm_table_supports_provision(struct dm_table *t)
+{
+	for (unsigned int i = 0; i < t->num_targets; i++) {
+		struct dm_target *ti = dm_table_get_target(t, i);
+
+		if (ti->provision_supported ||
+		    (ti->type->iterate_devices &&
+		    ti->type->iterate_devices(ti, device_provision_capable, NULL)))
+			return true;
+	}
+
+	return false;
+}
+
 static int device_not_nowait_capable(struct dm_target *ti, struct dm_dev *dev,
 				     sector_t start, sector_t len, void *data)
 {
@@ -1987,6 +2007,11 @@ int dm_table_set_restrictions(struct dm_table *t, struct request_queue *q,
 	if (!dm_table_supports_write_zeroes(t))
 		q->limits.max_write_zeroes_sectors = 0;
 
+	if (dm_table_supports_provision(t))
+		blk_queue_max_provision_sectors(q, UINT_MAX >> 9);
+	else
+		q->limits.max_provision_sectors = 0;
+
 	dm_table_verify_integrity(t);
 
 	/*
diff --git a/drivers/md/dm-thin.c b/drivers/md/dm-thin.c
index 64cfcf46881d..ab3f1abfabaf 100644
--- a/drivers/md/dm-thin.c
+++ b/drivers/md/dm-thin.c
@@ -1012,6 +1012,14 @@ static void process_prepared_mapping(struct dm_thin_new_mapping *m)
 		goto out;
 	}
 
+	/* For provision requests, return once the prepared block has been inserted
+	 * into the mapping btree.
+	 */
+	if (bio && bio_op(bio) == REQ_OP_PROVISION) {
+		bio_endio(bio);
+		goto out;
+	}
+
 	/*
 	 * Release any bios held while the block was being provisioned.
 	 * If we are processing a write bio that completely covers the block,
@@ -1239,7 +1247,7 @@ static int io_overlaps_block(struct pool *pool, struct bio *bio)
 
 static int io_overwrites_block(struct pool *pool, struct bio *bio)
 {
-	return (bio_data_dir(bio) == WRITE) &&
+	return (bio_data_dir(bio) == WRITE) && bio_op(bio) != REQ_OP_PROVISION &&
 		io_overlaps_block(pool, bio);
 }
 
@@ -1388,6 +1396,10 @@ static void schedule_zero(struct thin_c *tc, dm_block_t virt_block,
 	m->data_block = data_block;
 	m->cell = cell;
 
+	/* Provision requests are chained on the original bio. */
+	if (bio && bio_op(bio) == REQ_OP_PROVISION)
+		m->bio = bio;
+
 	/*
 	 * If the whole block of data is being overwritten or we are not
 	 * zeroing pre-existing data, we can issue the bio immediately.
@@ -1980,6 +1992,70 @@ static void process_cell(struct thin_c *tc, struct dm_bio_prison_cell *cell)
 	}
 }
 
+static void process_provision_cell(struct thin_c *tc, struct dm_bio_prison_cell *cell)
+{
+	int r;
+	struct pool *pool = tc->pool;
+	struct bio *bio = cell->holder;
+	dm_block_t begin, end;
+	struct dm_thin_lookup_result lookup_result;
+
+	if (tc->requeue_mode) {
+		cell_requeue(pool, cell);
+		return;
+	}
+
+	get_bio_block_range(tc, bio, &begin, &end);
+
+	while (begin != end) {
+		r = ensure_next_mapping(pool);
+		if (r)
+			/* we did our best */
+			return;
+
+		r = dm_thin_find_block(tc->td, begin, 1, &lookup_result);
+		switch (r) {
+		case 0:
+			begin++;
+			break;
+		case -ENODATA:
+			bio_inc_remaining(bio);
+			provision_block(tc, bio, begin, cell);
+			begin++;
+			break;
+		default:
+			DMERR_LIMIT(
+				"%s: dm_thin_find_block() failed: error = %d",
+				__func__, r);
+			cell_defer_no_holder(tc, cell);
+			bio_io_error(bio);
+			begin++;
+			break;
+		}
+	}
+	bio_endio(bio);
+	cell_defer_no_holder(tc, cell);
+}
+
+static void process_provision_bio(struct thin_c *tc, struct bio *bio)
+{
+	dm_block_t begin, end;
+	struct dm_cell_key virt_key;
+	struct dm_bio_prison_cell *virt_cell;
+
+	get_bio_block_range(tc, bio, &begin, &end);
+	if (begin == end) {
+		bio_endio(bio);
+		return;
+	}
+
+	build_key(tc->td, VIRTUAL, begin, end, &virt_key);
+	if (bio_detain(tc->pool, &virt_key, bio, &virt_cell))
+		return;
+
+	process_provision_cell(tc, virt_cell);
+}
+
 static void process_bio(struct thin_c *tc, struct bio *bio)
 {
 	struct pool *pool = tc->pool;
@@ -2200,6 +2276,8 @@ static void process_thin_deferred_bios(struct thin_c *tc)
 
 		if (bio_op(bio) == REQ_OP_DISCARD)
 			pool->process_discard(tc, bio);
+		else if (bio_op(bio) == REQ_OP_PROVISION)
+			process_provision_bio(tc, bio);
 		else
 			pool->process_bio(tc, bio);
 
@@ -2716,7 +2794,8 @@ static int thin_bio_map(struct dm_target *ti, struct bio *bio)
 		return DM_MAPIO_SUBMITTED;
 	}
 
-	if (op_is_flush(bio->bi_opf) || bio_op(bio) == REQ_OP_DISCARD) {
+	if (op_is_flush(bio->bi_opf) || bio_op(bio) == REQ_OP_DISCARD ||
+	    bio_op(bio) == REQ_OP_PROVISION) {
 		thin_defer_bio_with_throttle(tc, bio);
 		return DM_MAPIO_SUBMITTED;
 	}
@@ -3355,6 +3434,8 @@ static int pool_ctr(struct dm_target *ti, unsigned argc, char **argv)
 	pt->low_water_blocks = low_water_blocks;
 	pt->adjusted_pf = pt->requested_pf = pf;
 	ti->num_flush_bios = 1;
+	ti->num_provision_bios = 1;
+	ti->provision_supported = true;
 
 	/*
 	 * Only need to enable discards if the pool should pass
@@ -4053,6 +4134,7 @@ static void pool_io_hints(struct dm_target *ti, struct queue_limits *limits)
 		blk_limits_io_opt(limits, pool->sectors_per_block << SECTOR_SHIFT);
 	}
 
+
 	/*
 	 * pt->adjusted_pf is a staging area for the actual features to use.
 	 * They get transferred to the live pool in bind_control_target()
@@ -4243,6 +4325,9 @@ static int thin_ctr(struct dm_target *ti, unsigned argc, char **argv)
 		ti->num_discard_bios = 1;
 	}
 
+	ti->num_provision_bios = 1;
+	ti->provision_supported = true;
+
 	mutex_unlock(&dm_thin_pool_table.mutex);
 
 	spin_lock_irq(&tc->pool->lock);
@@ -4457,6 +4542,7 @@ static void thin_io_hints(struct dm_target *ti, struct queue_limits *limits)
 
 	limits->discard_granularity = pool->sectors_per_block << SECTOR_SHIFT;
 	limits->max_discard_sectors = 2048 * 1024 * 16; /* 16G */
+	limits->max_provision_sectors = 2048 * 1024 * 16; /* 16G */
 }
 
 static struct target_type thin_target = {
diff --git a/drivers/md/dm.c b/drivers/md/dm.c
index e1ea3a7bd9d9..4d19bae9da4a 100644
--- a/drivers/md/dm.c
+++ b/drivers/md/dm.c
@@ -1587,6 +1587,7 @@ static bool is_abnormal_io(struct bio *bio)
 		case REQ_OP_DISCARD:
 		case REQ_OP_SECURE_ERASE:
 		case REQ_OP_WRITE_ZEROES:
+		case REQ_OP_PROVISION:
 			return true;
 		default:
 			break;
@@ -1611,6 +1612,9 @@ static blk_status_t __process_abnormal_io(struct clone_info *ci,
 	case REQ_OP_WRITE_ZEROES:
 		num_bios = ti->num_write_zeroes_bios;
 		break;
+	case REQ_OP_PROVISION:
+		num_bios = ti->num_provision_bios;
+		break;
 	default:
 		break;
 	}
diff --git a/include/linux/device-mapper.h b/include/linux/device-mapper.h
index 04c6acf7faaa..b4d97d5d75b8 100644
--- a/include/linux/device-mapper.h
+++ b/include/linux/device-mapper.h
@@ -333,6 +333,12 @@ struct dm_target {
 	 */
 	unsigned num_write_zeroes_bios;
 
+	/*
+	 * The number of PROVISION bios that will be submitted to the target.
+	 * The bio number can be accessed with dm_bio_get_target_bio_nr.
+	 */
+	unsigned num_provision_bios;
+
 	/*
 	 * The minimum number of extra bytes allocated in each io for the
 	 * target to use.
@@ -357,6 +363,11 @@ struct dm_target {
 	 */
 	bool discards_supported:1;
 
+	/* Set if this target needs to receive provision requests regardless of
+	 * whether or not its underlying devices have support.
+	 */
+	bool provision_supported:1;
+
 	/*
 	 * Set if we need to limit the number of in-flight bios when swapping.
 	 */
-- 
2.37.3

--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel


^ permalink raw reply related	[flat|nested] 46+ messages in thread

* [PATCH v2 3/7] fs: Introduce FALLOC_FL_PROVISION
  2022-12-29  8:12 ` [dm-devel] " Sarthak Kukreti
@ 2022-12-29  8:12   ` Sarthak Kukreti
  -1 siblings, 0 replies; 46+ messages in thread
From: Sarthak Kukreti @ 2022-12-29  8:12 UTC (permalink / raw)
  To: sarthakkukreti, dm-devel, linux-block, linux-ext4, linux-kernel,
	linux-fsdevel
  Cc: Jens Axboe, Michael S. Tsirkin, Jason Wang, Stefan Hajnoczi,
	Alasdair Kergon, Mike Snitzer, Christoph Hellwig, Brian Foster,
	Theodore Ts'o, Andreas Dilger, Bart Van Assche, Daniil Lunev,
	Darrick J. Wong

FALLOC_FL_PROVISION is a new fallocate() allocation mode that
sends a hint to (supported) thinly provisioned block devices to
allocate space for the given range of sectors via REQ_OP_PROVISION.

The man pages for both fallocate(2) and posix_fallocate(3) describe
the default allocation mode as:

```
The default operation (i.e., mode is zero) of fallocate()
allocates the disk space within the range specified by offset and len.
...
subsequent writes to bytes in the specified range are guaranteed
not to fail because of lack of disk space.
```

For thinly provisioned storage constructs (dm-thin, filesystems on sparse
files), the term 'disk space' is overloaded and can either mean the apparent
disk space in the filesystem/thin logical volume or the true disk
space that will be utilized on the underlying non-sparse allocation layer.

The use of a separate mode allows us to cleanly disambiguate whether fallocate()
causes allocation only at the current layer (default mode) or whether it propagates
allocations to underlying layers (provision mode) for thinly provisioned filesystems/
block devices. For devices that do not support REQ_OP_PROVISION, both these
allocation modes will be equivalent. Given the performance cost of sending provision
requests to the underlying layers, keeping the default mode as-is allows users to
preserve existing behavior.

Signed-off-by: Sarthak Kukreti <sarthakkukreti@chromium.org>
---
 block/fops.c                | 15 +++++++++++----
 include/linux/falloc.h      |  3 ++-
 include/uapi/linux/falloc.h |  8 ++++++++
 3 files changed, 21 insertions(+), 5 deletions(-)

diff --git a/block/fops.c b/block/fops.c
index 50d245e8c913..01bde561e1e2 100644
--- a/block/fops.c
+++ b/block/fops.c
@@ -598,7 +598,8 @@ static ssize_t blkdev_read_iter(struct kiocb *iocb, struct iov_iter *to)
 
 #define	BLKDEV_FALLOC_FL_SUPPORTED					\
 		(FALLOC_FL_KEEP_SIZE | FALLOC_FL_PUNCH_HOLE |		\
-		 FALLOC_FL_ZERO_RANGE | FALLOC_FL_NO_HIDE_STALE)
+		 FALLOC_FL_ZERO_RANGE | FALLOC_FL_NO_HIDE_STALE |	\
+		 FALLOC_FL_PROVISION)
 
 static long blkdev_fallocate(struct file *file, int mode, loff_t start,
 			     loff_t len)
@@ -634,9 +635,11 @@ static long blkdev_fallocate(struct file *file, int mode, loff_t start,
 	filemap_invalidate_lock(inode->i_mapping);
 
 	/* Invalidate the page cache, including dirty pages. */
-	error = truncate_bdev_range(bdev, file->f_mode, start, end);
-	if (error)
-		goto fail;
+	if (mode != FALLOC_FL_PROVISION) {
+		error = truncate_bdev_range(bdev, file->f_mode, start, end);
+		if (error)
+			goto fail;
+	}
 
 	switch (mode) {
 	case FALLOC_FL_ZERO_RANGE:
@@ -654,6 +657,10 @@ static long blkdev_fallocate(struct file *file, int mode, loff_t start,
 		error = blkdev_issue_discard(bdev, start >> SECTOR_SHIFT,
 					     len >> SECTOR_SHIFT, GFP_KERNEL);
 		break;
+	case FALLOC_FL_PROVISION:
+		error = blkdev_issue_provision(bdev, start >> SECTOR_SHIFT,
+					       len >> SECTOR_SHIFT, GFP_KERNEL);
+		break;
 	default:
 		error = -EOPNOTSUPP;
 	}
diff --git a/include/linux/falloc.h b/include/linux/falloc.h
index f3f0b97b1675..b9a40a61a59b 100644
--- a/include/linux/falloc.h
+++ b/include/linux/falloc.h
@@ -30,7 +30,8 @@ struct space_resv {
 					 FALLOC_FL_COLLAPSE_RANGE |	\
 					 FALLOC_FL_ZERO_RANGE |		\
 					 FALLOC_FL_INSERT_RANGE |	\
-					 FALLOC_FL_UNSHARE_RANGE)
+					 FALLOC_FL_UNSHARE_RANGE |	\
+					 FALLOC_FL_PROVISION)
 
 /* on ia32 l_start is on a 32-bit boundary */
 #if defined(CONFIG_X86_64)
diff --git a/include/uapi/linux/falloc.h b/include/uapi/linux/falloc.h
index 51398fa57f6c..2d323d113eed 100644
--- a/include/uapi/linux/falloc.h
+++ b/include/uapi/linux/falloc.h
@@ -77,4 +77,12 @@
  */
 #define FALLOC_FL_UNSHARE_RANGE		0x40
 
+/*
+ * FALLOC_FL_PROVISION acts as a hint for thinly provisioned devices to allocate
+ * blocks for the range/EOF.
+ *
+ * FALLOC_FL_PROVISION can only be used with allocate-mode fallocate.
+ */
+#define FALLOC_FL_PROVISION		0x80
+
 #endif /* _UAPI_FALLOC_H_ */
-- 
2.37.3


^ permalink raw reply related	[flat|nested] 46+ messages in thread

* [dm-devel] [PATCH v2 3/7] fs: Introduce FALLOC_FL_PROVISION
@ 2022-12-29  8:12   ` Sarthak Kukreti
  0 siblings, 0 replies; 46+ messages in thread
From: Sarthak Kukreti @ 2022-12-29  8:12 UTC (permalink / raw)
  To: sarthakkukreti, dm-devel, linux-block, linux-ext4, linux-kernel,
	linux-fsdevel
  Cc: Jens Axboe, Theodore Ts'o, Michael S. Tsirkin,
	Darrick J. Wong, Jason Wang, Bart Van Assche, Mike Snitzer,
	Christoph Hellwig, Andreas Dilger, Daniil Lunev, Stefan Hajnoczi,
	Brian Foster, Alasdair Kergon

FALLOC_FL_PROVISION is a new fallocate() allocation mode that
sends a hint to (supported) thinly provisioned block devices to
allocate space for the given range of sectors via REQ_OP_PROVISION.

The man pages for both fallocate(2) and posix_fallocate(3) describe
the default allocation mode as:

```
The default operation (i.e., mode is zero) of fallocate()
allocates the disk space within the range specified by offset and len.
...
subsequent writes to bytes in the specified range are guaranteed
not to fail because of lack of disk space.
```

For thinly provisioned storage constructs (dm-thin, filesystems on sparse
files), the term 'disk space' is overloaded and can either mean the apparent
disk space in the filesystem/thin logical volume or the true disk
space that will be utilized on the underlying non-sparse allocation layer.

The use of a separate mode allows us to cleanly disambiguate whether fallocate()
causes allocation only at the current layer (default mode) or whether it propagates
allocations to underlying layers (provision mode) for thinly provisioned filesystems/
block devices. For devices that do not support REQ_OP_PROVISION, both these
allocation modes will be equivalent. Given the performance cost of sending provision
requests to the underlying layers, keeping the default mode as-is allows users to
preserve existing behavior.

Signed-off-by: Sarthak Kukreti <sarthakkukreti@chromium.org>
---
 block/fops.c                | 15 +++++++++++----
 include/linux/falloc.h      |  3 ++-
 include/uapi/linux/falloc.h |  8 ++++++++
 3 files changed, 21 insertions(+), 5 deletions(-)

diff --git a/block/fops.c b/block/fops.c
index 50d245e8c913..01bde561e1e2 100644
--- a/block/fops.c
+++ b/block/fops.c
@@ -598,7 +598,8 @@ static ssize_t blkdev_read_iter(struct kiocb *iocb, struct iov_iter *to)
 
 #define	BLKDEV_FALLOC_FL_SUPPORTED					\
 		(FALLOC_FL_KEEP_SIZE | FALLOC_FL_PUNCH_HOLE |		\
-		 FALLOC_FL_ZERO_RANGE | FALLOC_FL_NO_HIDE_STALE)
+		 FALLOC_FL_ZERO_RANGE | FALLOC_FL_NO_HIDE_STALE |	\
+		 FALLOC_FL_PROVISION)
 
 static long blkdev_fallocate(struct file *file, int mode, loff_t start,
 			     loff_t len)
@@ -634,9 +635,11 @@ static long blkdev_fallocate(struct file *file, int mode, loff_t start,
 	filemap_invalidate_lock(inode->i_mapping);
 
 	/* Invalidate the page cache, including dirty pages. */
-	error = truncate_bdev_range(bdev, file->f_mode, start, end);
-	if (error)
-		goto fail;
+	if (mode != FALLOC_FL_PROVISION) {
+		error = truncate_bdev_range(bdev, file->f_mode, start, end);
+		if (error)
+			goto fail;
+	}
 
 	switch (mode) {
 	case FALLOC_FL_ZERO_RANGE:
@@ -654,6 +657,10 @@ static long blkdev_fallocate(struct file *file, int mode, loff_t start,
 		error = blkdev_issue_discard(bdev, start >> SECTOR_SHIFT,
 					     len >> SECTOR_SHIFT, GFP_KERNEL);
 		break;
+	case FALLOC_FL_PROVISION:
+		error = blkdev_issue_provision(bdev, start >> SECTOR_SHIFT,
+					       len >> SECTOR_SHIFT, GFP_KERNEL);
+		break;
 	default:
 		error = -EOPNOTSUPP;
 	}
diff --git a/include/linux/falloc.h b/include/linux/falloc.h
index f3f0b97b1675..b9a40a61a59b 100644
--- a/include/linux/falloc.h
+++ b/include/linux/falloc.h
@@ -30,7 +30,8 @@ struct space_resv {
 					 FALLOC_FL_COLLAPSE_RANGE |	\
 					 FALLOC_FL_ZERO_RANGE |		\
 					 FALLOC_FL_INSERT_RANGE |	\
-					 FALLOC_FL_UNSHARE_RANGE)
+					 FALLOC_FL_UNSHARE_RANGE |	\
+					 FALLOC_FL_PROVISION)
 
 /* on ia32 l_start is on a 32-bit boundary */
 #if defined(CONFIG_X86_64)
diff --git a/include/uapi/linux/falloc.h b/include/uapi/linux/falloc.h
index 51398fa57f6c..2d323d113eed 100644
--- a/include/uapi/linux/falloc.h
+++ b/include/uapi/linux/falloc.h
@@ -77,4 +77,12 @@
  */
 #define FALLOC_FL_UNSHARE_RANGE		0x40
 
+/*
+ * FALLOC_FL_PROVISION acts as a hint for thinly provisioned devices to allocate
+ * blocks for the range/EOF.
+ *
+ * FALLOC_FL_PROVISION can only be used with allocate-mode fallocate.
+ */
+#define FALLOC_FL_PROVISION		0x80
+
 #endif /* _UAPI_FALLOC_H_ */
-- 
2.37.3

--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel


^ permalink raw reply related	[flat|nested] 46+ messages in thread

* [PATCH v2 4/7] loop: Add support for provision requests
  2022-12-29  8:12 ` [dm-devel] " Sarthak Kukreti
@ 2022-12-29  8:12   ` Sarthak Kukreti
  -1 siblings, 0 replies; 46+ messages in thread
From: Sarthak Kukreti @ 2022-12-29  8:12 UTC (permalink / raw)
  To: sarthakkukreti, dm-devel, linux-block, linux-ext4, linux-kernel,
	linux-fsdevel
  Cc: Jens Axboe, Michael S. Tsirkin, Jason Wang, Stefan Hajnoczi,
	Alasdair Kergon, Mike Snitzer, Christoph Hellwig, Brian Foster,
	Theodore Ts'o, Andreas Dilger, Bart Van Assche, Daniil Lunev,
	Darrick J. Wong

Add support for provision requests to loopback devices.
Loop devices will configure provision support based on
whether the underlying block device/file can support
the provision request and upon receiving a provision bio,
will map it to the backing device/storage.

Signed-off-by: Sarthak Kukreti <sarthakkukreti@chromium.org>
---
 drivers/block/loop.c | 42 ++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 42 insertions(+)

diff --git a/drivers/block/loop.c b/drivers/block/loop.c
index 1518a6423279..64ebb0d60c0e 100644
--- a/drivers/block/loop.c
+++ b/drivers/block/loop.c
@@ -327,6 +327,24 @@ static int lo_fallocate(struct loop_device *lo, struct request *rq, loff_t pos,
 	return ret;
 }
 
+static int lo_req_provision(struct loop_device *lo, struct request *rq, loff_t pos)
+{
+	struct file *file = lo->lo_backing_file;
+	struct request_queue *q = lo->lo_queue;
+	int ret;
+
+	if (!q->limits.max_provision_sectors) {
+		ret = -EOPNOTSUPP;
+		goto out;
+	}
+
+	ret = file->f_op->fallocate(file, FALLOC_FL_PROVISION, pos, blk_rq_bytes(rq));
+	if (unlikely(ret && ret != -EINVAL && ret != -EOPNOTSUPP))
+		ret = -EIO;
+ out:
+	return ret;
+}
+
 static int lo_req_flush(struct loop_device *lo, struct request *rq)
 {
 	int ret = vfs_fsync(lo->lo_backing_file, 0);
@@ -488,6 +506,8 @@ static int do_req_filebacked(struct loop_device *lo, struct request *rq)
 				FALLOC_FL_PUNCH_HOLE);
 	case REQ_OP_DISCARD:
 		return lo_fallocate(lo, rq, pos, FALLOC_FL_PUNCH_HOLE);
+	case REQ_OP_PROVISION:
+		return lo_req_provision(lo, rq, pos);
 	case REQ_OP_WRITE:
 		if (cmd->use_aio)
 			return lo_rw_aio(lo, cmd, pos, ITER_SOURCE);
@@ -754,6 +774,25 @@ static void loop_sysfs_exit(struct loop_device *lo)
 				   &loop_attribute_group);
 }
 
+static void loop_config_provision(struct loop_device *lo)
+{
+	struct file *file = lo->lo_backing_file;
+	struct inode *inode = file->f_mapping->host;
+
+	/*
+	 * If the backing device is a block device, mirror its provisioning
+	 * capability.
+	 */
+	if (S_ISBLK(inode->i_mode)) {
+		blk_queue_max_provision_sectors(lo->lo_queue,
+			bdev_max_provision_sectors(I_BDEV(inode)));
+	} else if (file->f_op->fallocate) {
+		blk_queue_max_provision_sectors(lo->lo_queue, UINT_MAX >> 9);
+	} else {
+		blk_queue_max_provision_sectors(lo->lo_queue, 0);
+	}
+}
+
 static void loop_config_discard(struct loop_device *lo)
 {
 	struct file *file = lo->lo_backing_file;
@@ -1092,6 +1131,7 @@ static int loop_configure(struct loop_device *lo, fmode_t mode,
 	blk_queue_io_min(lo->lo_queue, bsize);
 
 	loop_config_discard(lo);
+	loop_config_provision(lo);
 	loop_update_rotational(lo);
 	loop_update_dio(lo);
 	loop_sysfs_init(lo);
@@ -1304,6 +1344,7 @@ loop_set_status(struct loop_device *lo, const struct loop_info64 *info)
 	}
 
 	loop_config_discard(lo);
+	loop_config_provision(lo);
 
 	/* update dio if lo_offset or transfer is changed */
 	__loop_update_dio(lo, lo->use_dio);
@@ -1824,6 +1865,7 @@ static blk_status_t loop_queue_rq(struct blk_mq_hw_ctx *hctx,
 	case REQ_OP_FLUSH:
 	case REQ_OP_DISCARD:
 	case REQ_OP_WRITE_ZEROES:
+	case REQ_OP_PROVISION:
 		cmd->use_aio = false;
 		break;
 	default:
-- 
2.37.3


^ permalink raw reply related	[flat|nested] 46+ messages in thread

* [dm-devel] [PATCH v2 4/7] loop: Add support for provision requests
@ 2022-12-29  8:12   ` Sarthak Kukreti
  0 siblings, 0 replies; 46+ messages in thread
From: Sarthak Kukreti @ 2022-12-29  8:12 UTC (permalink / raw)
  To: sarthakkukreti, dm-devel, linux-block, linux-ext4, linux-kernel,
	linux-fsdevel
  Cc: Jens Axboe, Theodore Ts'o, Michael S. Tsirkin,
	Darrick J. Wong, Jason Wang, Bart Van Assche, Mike Snitzer,
	Christoph Hellwig, Andreas Dilger, Daniil Lunev, Stefan Hajnoczi,
	Brian Foster, Alasdair Kergon

Add support for provision requests to loopback devices.
Loop devices will configure provision support based on
whether the underlying block device/file can support
the provision request and upon receiving a provision bio,
will map it to the backing device/storage.

Signed-off-by: Sarthak Kukreti <sarthakkukreti@chromium.org>
---
 drivers/block/loop.c | 42 ++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 42 insertions(+)

diff --git a/drivers/block/loop.c b/drivers/block/loop.c
index 1518a6423279..64ebb0d60c0e 100644
--- a/drivers/block/loop.c
+++ b/drivers/block/loop.c
@@ -327,6 +327,24 @@ static int lo_fallocate(struct loop_device *lo, struct request *rq, loff_t pos,
 	return ret;
 }
 
+static int lo_req_provision(struct loop_device *lo, struct request *rq, loff_t pos)
+{
+	struct file *file = lo->lo_backing_file;
+	struct request_queue *q = lo->lo_queue;
+	int ret;
+
+	if (!q->limits.max_provision_sectors) {
+		ret = -EOPNOTSUPP;
+		goto out;
+	}
+
+	ret = file->f_op->fallocate(file, FALLOC_FL_PROVISION, pos, blk_rq_bytes(rq));
+	if (unlikely(ret && ret != -EINVAL && ret != -EOPNOTSUPP))
+		ret = -EIO;
+ out:
+	return ret;
+}
+
 static int lo_req_flush(struct loop_device *lo, struct request *rq)
 {
 	int ret = vfs_fsync(lo->lo_backing_file, 0);
@@ -488,6 +506,8 @@ static int do_req_filebacked(struct loop_device *lo, struct request *rq)
 				FALLOC_FL_PUNCH_HOLE);
 	case REQ_OP_DISCARD:
 		return lo_fallocate(lo, rq, pos, FALLOC_FL_PUNCH_HOLE);
+	case REQ_OP_PROVISION:
+		return lo_req_provision(lo, rq, pos);
 	case REQ_OP_WRITE:
 		if (cmd->use_aio)
 			return lo_rw_aio(lo, cmd, pos, ITER_SOURCE);
@@ -754,6 +774,25 @@ static void loop_sysfs_exit(struct loop_device *lo)
 				   &loop_attribute_group);
 }
 
+static void loop_config_provision(struct loop_device *lo)
+{
+	struct file *file = lo->lo_backing_file;
+	struct inode *inode = file->f_mapping->host;
+
+	/*
+	 * If the backing device is a block device, mirror its provisioning
+	 * capability.
+	 */
+	if (S_ISBLK(inode->i_mode)) {
+		blk_queue_max_provision_sectors(lo->lo_queue,
+			bdev_max_provision_sectors(I_BDEV(inode)));
+	} else if (file->f_op->fallocate) {
+		blk_queue_max_provision_sectors(lo->lo_queue, UINT_MAX >> 9);
+	} else {
+		blk_queue_max_provision_sectors(lo->lo_queue, 0);
+	}
+}
+
 static void loop_config_discard(struct loop_device *lo)
 {
 	struct file *file = lo->lo_backing_file;
@@ -1092,6 +1131,7 @@ static int loop_configure(struct loop_device *lo, fmode_t mode,
 	blk_queue_io_min(lo->lo_queue, bsize);
 
 	loop_config_discard(lo);
+	loop_config_provision(lo);
 	loop_update_rotational(lo);
 	loop_update_dio(lo);
 	loop_sysfs_init(lo);
@@ -1304,6 +1344,7 @@ loop_set_status(struct loop_device *lo, const struct loop_info64 *info)
 	}
 
 	loop_config_discard(lo);
+	loop_config_provision(lo);
 
 	/* update dio if lo_offset or transfer is changed */
 	__loop_update_dio(lo, lo->use_dio);
@@ -1824,6 +1865,7 @@ static blk_status_t loop_queue_rq(struct blk_mq_hw_ctx *hctx,
 	case REQ_OP_FLUSH:
 	case REQ_OP_DISCARD:
 	case REQ_OP_WRITE_ZEROES:
+	case REQ_OP_PROVISION:
 		cmd->use_aio = false;
 		break;
 	default:
-- 
2.37.3

--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel


^ permalink raw reply related	[flat|nested] 46+ messages in thread

* [PATCH v2 5/7] ext4: Add support for FALLOC_FL_PROVISION
  2022-12-29  8:12 ` [dm-devel] " Sarthak Kukreti
@ 2022-12-29  8:12   ` Sarthak Kukreti
  -1 siblings, 0 replies; 46+ messages in thread
From: Sarthak Kukreti @ 2022-12-29  8:12 UTC (permalink / raw)
  To: sarthakkukreti, dm-devel, linux-block, linux-ext4, linux-kernel,
	linux-fsdevel
  Cc: Jens Axboe, Michael S. Tsirkin, Jason Wang, Stefan Hajnoczi,
	Alasdair Kergon, Mike Snitzer, Christoph Hellwig, Brian Foster,
	Theodore Ts'o, Andreas Dilger, Bart Van Assche, Daniil Lunev,
	Darrick J. Wong

Once ext4 is done mapping blocks for an fallocate() request, send
out an FALLOC_FL_PROVISION request to the underlying layer to
ensure that the space is provisioned for the newly allocated extent
or indirect blocks.

There is an expected performance degradation with fallocate() calls made
with this flag due to the extra REQ_OP_PROVISIONs sent to the underlying
storage.

Signed-off-by: Sarthak Kukreti <sarthakkukreti@chromium.org>
---
 fs/ext4/ext4.h         |  2 ++
 fs/ext4/extents.c      | 15 ++++++++++++++-
 fs/ext4/indirect.c     |  9 +++++++++
 include/linux/blkdev.h | 11 +++++++++++
 4 files changed, 36 insertions(+), 1 deletion(-)

diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
index 140e1eb300d1..49832e90b62f 100644
--- a/fs/ext4/ext4.h
+++ b/fs/ext4/ext4.h
@@ -673,6 +673,8 @@ enum {
 #define EXT4_GET_BLOCKS_IO_SUBMIT		0x0400
 	/* Caller is in the atomic contex, find extent if it has been cached */
 #define EXT4_GET_BLOCKS_CACHED_NOWAIT		0x0800
+	/* Provision blocks on underlying storage */
+#define EXT4_GET_BLOCKS_PROVISION		0x1000
 
 /*
  * The bit position of these flags must not overlap with any of the
diff --git a/fs/ext4/extents.c b/fs/ext4/extents.c
index 9de1c9d1a13d..2e64a9211792 100644
--- a/fs/ext4/extents.c
+++ b/fs/ext4/extents.c
@@ -4361,6 +4361,13 @@ int ext4_ext_map_blocks(handle_t *handle, struct inode *inode,
 		}
 	}
 
+	/* Attempt to provision blocks on underlying storage */
+	if (flags & EXT4_GET_BLOCKS_PROVISION) {
+		err = sb_issue_provision(inode->i_sb, pblk, ar.len, GFP_NOFS);
+		if (err)
+			goto out;
+	}
+
 	/*
 	 * Cache the extent and update transaction to commit on fdatasync only
 	 * when it is _not_ an unwritten extent.
@@ -4694,7 +4701,7 @@ long ext4_fallocate(struct file *file, int mode, loff_t offset, loff_t len)
 	/* Return error if mode is not supported */
 	if (mode & ~(FALLOC_FL_KEEP_SIZE | FALLOC_FL_PUNCH_HOLE |
 		     FALLOC_FL_COLLAPSE_RANGE | FALLOC_FL_ZERO_RANGE |
-		     FALLOC_FL_INSERT_RANGE))
+		     FALLOC_FL_INSERT_RANGE | FALLOC_FL_PROVISION))
 		return -EOPNOTSUPP;
 
 	inode_lock(inode);
@@ -4754,6 +4761,12 @@ long ext4_fallocate(struct file *file, int mode, loff_t offset, loff_t len)
 	if (ret)
 		goto out;
 
+	/* Ensure that preallocation provisions the blocks on the underlying
+	 * storage device.
+	 */
+	if (mode & FALLOC_FL_PROVISION)
+		flags |= EXT4_GET_BLOCKS_PROVISION;
+
 	ret = ext4_alloc_file_blocks(file, lblk, max_blocks, new_size, flags);
 	if (ret)
 		goto out;
diff --git a/fs/ext4/indirect.c b/fs/ext4/indirect.c
index c68bebe7ff4b..a8065aae7563 100644
--- a/fs/ext4/indirect.c
+++ b/fs/ext4/indirect.c
@@ -647,6 +647,15 @@ int ext4_ind_map_blocks(handle_t *handle, struct inode *inode,
 	if (err)
 		goto cleanup;
 
+	/* Attempt to provision blocks on underlying storage */
+	if (flags & EXT4_GET_BLOCKS_PROVISION) {
+		err = sb_issue_provision(inode->i_sb,
+					 le32_to_cpu(chain[depth-1].key),
+					 ar.len, GFP_NOFS);
+		if (err)
+			goto out;
+	}
+
 	map->m_flags |= EXT4_MAP_NEW;
 
 	ext4_update_inode_fsync_trans(handle, inode, 1);
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index f1abc7b43e25..b2e3244e9f3d 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -1093,6 +1093,17 @@ static inline int sb_issue_zeroout(struct super_block *sb, sector_t block,
 				    gfp_mask, 0);
 }
 
+static inline int sb_issue_provision(struct super_block *sb, sector_t block,
+		sector_t nr_blocks, gfp_t gfp_mask)
+{
+	return blkdev_issue_provision(sb->s_bdev,
+				      block << (sb->s_blocksize_bits -
+					      SECTOR_SHIFT),
+				      nr_blocks << (sb->s_blocksize_bits -
+						    SECTOR_SHIFT),
+				      gfp_mask);
+}
+
 static inline bool bdev_is_partition(struct block_device *bdev)
 {
 	return bdev->bd_partno;
-- 
2.37.3


^ permalink raw reply related	[flat|nested] 46+ messages in thread

* [dm-devel] [PATCH v2 5/7] ext4: Add support for FALLOC_FL_PROVISION
@ 2022-12-29  8:12   ` Sarthak Kukreti
  0 siblings, 0 replies; 46+ messages in thread
From: Sarthak Kukreti @ 2022-12-29  8:12 UTC (permalink / raw)
  To: sarthakkukreti, dm-devel, linux-block, linux-ext4, linux-kernel,
	linux-fsdevel
  Cc: Jens Axboe, Theodore Ts'o, Michael S. Tsirkin,
	Darrick J. Wong, Jason Wang, Bart Van Assche, Mike Snitzer,
	Christoph Hellwig, Andreas Dilger, Daniil Lunev, Stefan Hajnoczi,
	Brian Foster, Alasdair Kergon

Once ext4 is done mapping blocks for an fallocate() request, send
out an FALLOC_FL_PROVISION request to the underlying layer to
ensure that the space is provisioned for the newly allocated extent
or indirect blocks.

There is an expected performance degradation with fallocate() calls made
with this flag due to the extra REQ_OP_PROVISIONs sent to the underlying
storage.

Signed-off-by: Sarthak Kukreti <sarthakkukreti@chromium.org>
---
 fs/ext4/ext4.h         |  2 ++
 fs/ext4/extents.c      | 15 ++++++++++++++-
 fs/ext4/indirect.c     |  9 +++++++++
 include/linux/blkdev.h | 11 +++++++++++
 4 files changed, 36 insertions(+), 1 deletion(-)

diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
index 140e1eb300d1..49832e90b62f 100644
--- a/fs/ext4/ext4.h
+++ b/fs/ext4/ext4.h
@@ -673,6 +673,8 @@ enum {
 #define EXT4_GET_BLOCKS_IO_SUBMIT		0x0400
 	/* Caller is in the atomic contex, find extent if it has been cached */
 #define EXT4_GET_BLOCKS_CACHED_NOWAIT		0x0800
+	/* Provision blocks on underlying storage */
+#define EXT4_GET_BLOCKS_PROVISION		0x1000
 
 /*
  * The bit position of these flags must not overlap with any of the
diff --git a/fs/ext4/extents.c b/fs/ext4/extents.c
index 9de1c9d1a13d..2e64a9211792 100644
--- a/fs/ext4/extents.c
+++ b/fs/ext4/extents.c
@@ -4361,6 +4361,13 @@ int ext4_ext_map_blocks(handle_t *handle, struct inode *inode,
 		}
 	}
 
+	/* Attempt to provision blocks on underlying storage */
+	if (flags & EXT4_GET_BLOCKS_PROVISION) {
+		err = sb_issue_provision(inode->i_sb, pblk, ar.len, GFP_NOFS);
+		if (err)
+			goto out;
+	}
+
 	/*
 	 * Cache the extent and update transaction to commit on fdatasync only
 	 * when it is _not_ an unwritten extent.
@@ -4694,7 +4701,7 @@ long ext4_fallocate(struct file *file, int mode, loff_t offset, loff_t len)
 	/* Return error if mode is not supported */
 	if (mode & ~(FALLOC_FL_KEEP_SIZE | FALLOC_FL_PUNCH_HOLE |
 		     FALLOC_FL_COLLAPSE_RANGE | FALLOC_FL_ZERO_RANGE |
-		     FALLOC_FL_INSERT_RANGE))
+		     FALLOC_FL_INSERT_RANGE | FALLOC_FL_PROVISION))
 		return -EOPNOTSUPP;
 
 	inode_lock(inode);
@@ -4754,6 +4761,12 @@ long ext4_fallocate(struct file *file, int mode, loff_t offset, loff_t len)
 	if (ret)
 		goto out;
 
+	/* Ensure that preallocation provisions the blocks on the underlying
+	 * storage device.
+	 */
+	if (mode & FALLOC_FL_PROVISION)
+		flags |= EXT4_GET_BLOCKS_PROVISION;
+
 	ret = ext4_alloc_file_blocks(file, lblk, max_blocks, new_size, flags);
 	if (ret)
 		goto out;
diff --git a/fs/ext4/indirect.c b/fs/ext4/indirect.c
index c68bebe7ff4b..a8065aae7563 100644
--- a/fs/ext4/indirect.c
+++ b/fs/ext4/indirect.c
@@ -647,6 +647,15 @@ int ext4_ind_map_blocks(handle_t *handle, struct inode *inode,
 	if (err)
 		goto cleanup;
 
+	/* Attempt to provision blocks on underlying storage */
+	if (flags & EXT4_GET_BLOCKS_PROVISION) {
+		err = sb_issue_provision(inode->i_sb,
+					 le32_to_cpu(chain[depth-1].key),
+					 ar.len, GFP_NOFS);
+		if (err)
+			goto out;
+	}
+
 	map->m_flags |= EXT4_MAP_NEW;
 
 	ext4_update_inode_fsync_trans(handle, inode, 1);
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index f1abc7b43e25..b2e3244e9f3d 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -1093,6 +1093,17 @@ static inline int sb_issue_zeroout(struct super_block *sb, sector_t block,
 				    gfp_mask, 0);
 }
 
+static inline int sb_issue_provision(struct super_block *sb, sector_t block,
+		sector_t nr_blocks, gfp_t gfp_mask)
+{
+	return blkdev_issue_provision(sb->s_bdev,
+				      block << (sb->s_blocksize_bits -
+					      SECTOR_SHIFT),
+				      nr_blocks << (sb->s_blocksize_bits -
+						    SECTOR_SHIFT),
+				      gfp_mask);
+}
+
 static inline bool bdev_is_partition(struct block_device *bdev)
 {
 	return bdev->bd_partno;
-- 
2.37.3

--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel


^ permalink raw reply related	[flat|nested] 46+ messages in thread

* [PATCH v2 6/7] ext4: Add mount option for provisioning blocks during allocations
  2022-12-29  8:12 ` [dm-devel] " Sarthak Kukreti
@ 2022-12-29  8:12   ` Sarthak Kukreti
  -1 siblings, 0 replies; 46+ messages in thread
From: Sarthak Kukreti @ 2022-12-29  8:12 UTC (permalink / raw)
  To: sarthakkukreti, dm-devel, linux-block, linux-ext4, linux-kernel,
	linux-fsdevel
  Cc: Jens Axboe, Michael S. Tsirkin, Jason Wang, Stefan Hajnoczi,
	Alasdair Kergon, Mike Snitzer, Christoph Hellwig, Brian Foster,
	Theodore Ts'o, Andreas Dilger, Bart Van Assche, Daniil Lunev,
	Darrick J. Wong

Add a mount option that sets the default provisioning mode for
all files within the filesystem.

Signed-off-by: Sarthak Kukreti <sarthakkukreti@chromium.org>
---
 fs/ext4/ext4.h    | 1 +
 fs/ext4/extents.c | 7 +++++++
 fs/ext4/super.c   | 7 +++++++
 3 files changed, 15 insertions(+)

diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
index 49832e90b62f..29cab2e2ea20 100644
--- a/fs/ext4/ext4.h
+++ b/fs/ext4/ext4.h
@@ -1269,6 +1269,7 @@ struct ext4_inode_info {
 #define EXT4_MOUNT2_MB_OPTIMIZE_SCAN	0x00000080 /* Optimize group
 						    * scanning in mballoc
 						    */
+#define EXT4_MOUNT2_PROVISION		0x00000100 /* Provision while allocating file blocks */
 
 #define clear_opt(sb, opt)		EXT4_SB(sb)->s_mount_opt &= \
 						~EXT4_MOUNT_##opt
diff --git a/fs/ext4/extents.c b/fs/ext4/extents.c
index 2e64a9211792..a73f44264fe2 100644
--- a/fs/ext4/extents.c
+++ b/fs/ext4/extents.c
@@ -4441,6 +4441,13 @@ static int ext4_alloc_file_blocks(struct file *file, ext4_lblk_t offset,
 	unsigned int credits;
 	loff_t epos;
 
+	/*
+	 * Attempt to provision file blocks if the mount is mounted with
+	 * provision.
+	 */
+	if (test_opt2(inode->i_sb, PROVISION))
+		flags |= EXT4_GET_BLOCKS_PROVISION;
+
 	BUG_ON(!ext4_test_inode_flag(inode, EXT4_INODE_EXTENTS));
 	map.m_lblk = offset;
 	map.m_len = len;
diff --git a/fs/ext4/super.c b/fs/ext4/super.c
index 260c1b3e3ef2..5bc376f6a6f0 100644
--- a/fs/ext4/super.c
+++ b/fs/ext4/super.c
@@ -1591,6 +1591,7 @@ enum {
 	Opt_max_dir_size_kb, Opt_nojournal_checksum, Opt_nombcache,
 	Opt_no_prefetch_block_bitmaps, Opt_mb_optimize_scan,
 	Opt_errors, Opt_data, Opt_data_err, Opt_jqfmt, Opt_dax_type,
+	Opt_provision, Opt_noprovision,
 #ifdef CONFIG_EXT4_DEBUG
 	Opt_fc_debug_max_replay, Opt_fc_debug_force
 #endif
@@ -1737,6 +1738,8 @@ static const struct fs_parameter_spec ext4_param_specs[] = {
 	fsparam_flag	("reservation",		Opt_removed),	/* mount option from ext2/3 */
 	fsparam_flag	("noreservation",	Opt_removed),	/* mount option from ext2/3 */
 	fsparam_u32	("journal",		Opt_removed),	/* mount option from ext2/3 */
+	fsparam_flag	("provision",		Opt_provision),
+	fsparam_flag	("noprovision",		Opt_noprovision),
 	{}
 };
 
@@ -1826,6 +1829,8 @@ static const struct mount_opts {
 	{Opt_nombcache, EXT4_MOUNT_NO_MBCACHE, MOPT_SET},
 	{Opt_no_prefetch_block_bitmaps, EXT4_MOUNT_NO_PREFETCH_BLOCK_BITMAPS,
 	 MOPT_SET},
+	{Opt_provision, EXT4_MOUNT2_PROVISION, MOPT_SET | MOPT_2},
+	{Opt_noprovision, EXT4_MOUNT2_PROVISION, MOPT_CLEAR | MOPT_2},
 #ifdef CONFIG_EXT4_DEBUG
 	{Opt_fc_debug_force, EXT4_MOUNT2_JOURNAL_FAST_COMMIT,
 	 MOPT_SET | MOPT_2 | MOPT_EXT4_ONLY},
@@ -2977,6 +2982,8 @@ static int _ext4_show_options(struct seq_file *seq, struct super_block *sb,
 		SEQ_OPTS_PUTS("dax=never");
 	} else if (test_opt2(sb, DAX_INODE)) {
 		SEQ_OPTS_PUTS("dax=inode");
+	} else if (test_opt2(sb, PROVISION)) {
+		SEQ_OPTS_PUTS("provision");
 	}
 
 	if (sbi->s_groups_count >= MB_DEFAULT_LINEAR_SCAN_THRESHOLD &&
-- 
2.37.3


^ permalink raw reply related	[flat|nested] 46+ messages in thread

* [dm-devel] [PATCH v2 6/7] ext4: Add mount option for provisioning blocks during allocations
@ 2022-12-29  8:12   ` Sarthak Kukreti
  0 siblings, 0 replies; 46+ messages in thread
From: Sarthak Kukreti @ 2022-12-29  8:12 UTC (permalink / raw)
  To: sarthakkukreti, dm-devel, linux-block, linux-ext4, linux-kernel,
	linux-fsdevel
  Cc: Jens Axboe, Theodore Ts'o, Michael S. Tsirkin,
	Darrick J. Wong, Jason Wang, Bart Van Assche, Mike Snitzer,
	Christoph Hellwig, Andreas Dilger, Daniil Lunev, Stefan Hajnoczi,
	Brian Foster, Alasdair Kergon

Add a mount option that sets the default provisioning mode for
all files within the filesystem.

Signed-off-by: Sarthak Kukreti <sarthakkukreti@chromium.org>
---
 fs/ext4/ext4.h    | 1 +
 fs/ext4/extents.c | 7 +++++++
 fs/ext4/super.c   | 7 +++++++
 3 files changed, 15 insertions(+)

diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
index 49832e90b62f..29cab2e2ea20 100644
--- a/fs/ext4/ext4.h
+++ b/fs/ext4/ext4.h
@@ -1269,6 +1269,7 @@ struct ext4_inode_info {
 #define EXT4_MOUNT2_MB_OPTIMIZE_SCAN	0x00000080 /* Optimize group
 						    * scanning in mballoc
 						    */
+#define EXT4_MOUNT2_PROVISION		0x00000100 /* Provision while allocating file blocks */
 
 #define clear_opt(sb, opt)		EXT4_SB(sb)->s_mount_opt &= \
 						~EXT4_MOUNT_##opt
diff --git a/fs/ext4/extents.c b/fs/ext4/extents.c
index 2e64a9211792..a73f44264fe2 100644
--- a/fs/ext4/extents.c
+++ b/fs/ext4/extents.c
@@ -4441,6 +4441,13 @@ static int ext4_alloc_file_blocks(struct file *file, ext4_lblk_t offset,
 	unsigned int credits;
 	loff_t epos;
 
+	/*
+	 * Attempt to provision file blocks if the mount is mounted with
+	 * provision.
+	 */
+	if (test_opt2(inode->i_sb, PROVISION))
+		flags |= EXT4_GET_BLOCKS_PROVISION;
+
 	BUG_ON(!ext4_test_inode_flag(inode, EXT4_INODE_EXTENTS));
 	map.m_lblk = offset;
 	map.m_len = len;
diff --git a/fs/ext4/super.c b/fs/ext4/super.c
index 260c1b3e3ef2..5bc376f6a6f0 100644
--- a/fs/ext4/super.c
+++ b/fs/ext4/super.c
@@ -1591,6 +1591,7 @@ enum {
 	Opt_max_dir_size_kb, Opt_nojournal_checksum, Opt_nombcache,
 	Opt_no_prefetch_block_bitmaps, Opt_mb_optimize_scan,
 	Opt_errors, Opt_data, Opt_data_err, Opt_jqfmt, Opt_dax_type,
+	Opt_provision, Opt_noprovision,
 #ifdef CONFIG_EXT4_DEBUG
 	Opt_fc_debug_max_replay, Opt_fc_debug_force
 #endif
@@ -1737,6 +1738,8 @@ static const struct fs_parameter_spec ext4_param_specs[] = {
 	fsparam_flag	("reservation",		Opt_removed),	/* mount option from ext2/3 */
 	fsparam_flag	("noreservation",	Opt_removed),	/* mount option from ext2/3 */
 	fsparam_u32	("journal",		Opt_removed),	/* mount option from ext2/3 */
+	fsparam_flag	("provision",		Opt_provision),
+	fsparam_flag	("noprovision",		Opt_noprovision),
 	{}
 };
 
@@ -1826,6 +1829,8 @@ static const struct mount_opts {
 	{Opt_nombcache, EXT4_MOUNT_NO_MBCACHE, MOPT_SET},
 	{Opt_no_prefetch_block_bitmaps, EXT4_MOUNT_NO_PREFETCH_BLOCK_BITMAPS,
 	 MOPT_SET},
+	{Opt_provision, EXT4_MOUNT2_PROVISION, MOPT_SET | MOPT_2},
+	{Opt_noprovision, EXT4_MOUNT2_PROVISION, MOPT_CLEAR | MOPT_2},
 #ifdef CONFIG_EXT4_DEBUG
 	{Opt_fc_debug_force, EXT4_MOUNT2_JOURNAL_FAST_COMMIT,
 	 MOPT_SET | MOPT_2 | MOPT_EXT4_ONLY},
@@ -2977,6 +2982,8 @@ static int _ext4_show_options(struct seq_file *seq, struct super_block *sb,
 		SEQ_OPTS_PUTS("dax=never");
 	} else if (test_opt2(sb, DAX_INODE)) {
 		SEQ_OPTS_PUTS("dax=inode");
+	} else if (test_opt2(sb, PROVISION)) {
+		SEQ_OPTS_PUTS("provision");
 	}
 
 	if (sbi->s_groups_count >= MB_DEFAULT_LINEAR_SCAN_THRESHOLD &&
-- 
2.37.3

--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel


^ permalink raw reply related	[flat|nested] 46+ messages in thread

* [PATCH v2 7/7] ext4: Add a per-file provision override xattr
  2022-12-29  8:12 ` [dm-devel] " Sarthak Kukreti
@ 2022-12-29  8:12   ` Sarthak Kukreti
  -1 siblings, 0 replies; 46+ messages in thread
From: Sarthak Kukreti @ 2022-12-29  8:12 UTC (permalink / raw)
  To: sarthakkukreti, dm-devel, linux-block, linux-ext4, linux-kernel,
	linux-fsdevel
  Cc: Jens Axboe, Michael S. Tsirkin, Jason Wang, Stefan Hajnoczi,
	Alasdair Kergon, Mike Snitzer, Christoph Hellwig, Brian Foster,
	Theodore Ts'o, Andreas Dilger, Bart Van Assche, Daniil Lunev,
	Darrick J. Wong

Adds a per-file provision override that allows select files to
override the per-mount setting for provisioning blocks on allocation.

This acts as a mechanism to allow mounts using provision to
replicate the current behavior for fallocate() and only preserve
space at the filesystem level.

Signed-off-by: Sarthak Kukreti <sarthakkukreti@chromium.org>
---
 fs/ext4/extents.c | 32 ++++++++++++++++++++++++++++++++
 fs/ext4/xattr.h   |  1 +
 2 files changed, 33 insertions(+)

diff --git a/fs/ext4/extents.c b/fs/ext4/extents.c
index a73f44264fe2..9861115681b3 100644
--- a/fs/ext4/extents.c
+++ b/fs/ext4/extents.c
@@ -4428,6 +4428,26 @@ int ext4_ext_truncate(handle_t *handle, struct inode *inode)
 	return err;
 }
 
+static int ext4_file_provision_support(struct inode *inode)
+{
+	char provision;
+	int ret =
+		ext4_xattr_get(inode, EXT4_XATTR_INDEX_TRUSTED,
+			       EXT4_XATTR_NAME_PROVISION_POLICY, &provision, 1);
+
+	if (ret < 0)
+		return ret;
+
+	switch (provision) {
+	case 'y':
+		return 1;
+	case 'n':
+		return 0;
+	default:
+		return -EINVAL;
+	}
+}
+
 static int ext4_alloc_file_blocks(struct file *file, ext4_lblk_t offset,
 				  ext4_lblk_t len, loff_t new_size,
 				  int flags)
@@ -4440,12 +4460,24 @@ static int ext4_alloc_file_blocks(struct file *file, ext4_lblk_t offset,
 	struct ext4_map_blocks map;
 	unsigned int credits;
 	loff_t epos;
+	bool provision = false;
+	int file_provision_override = -1;
 
 	/*
 	 * Attempt to provision file blocks if the mount is mounted with
 	 * provision.
 	 */
 	if (test_opt2(inode->i_sb, PROVISION))
+		provision = true;
+
+	/*
+	 * Use file-specific override, if available.
+	 */
+	file_provision_override = ext4_file_provision_support(inode);
+	if (file_provision_override >= 0)
+		provision &= file_provision_override;
+
+	if (provision)
 		flags |= EXT4_GET_BLOCKS_PROVISION;
 
 	BUG_ON(!ext4_test_inode_flag(inode, EXT4_INODE_EXTENTS));
diff --git a/fs/ext4/xattr.h b/fs/ext4/xattr.h
index 824faf0b15a8..69e97f853b0c 100644
--- a/fs/ext4/xattr.h
+++ b/fs/ext4/xattr.h
@@ -140,6 +140,7 @@ extern const struct xattr_handler ext4_xattr_security_handler;
 extern const struct xattr_handler ext4_xattr_hurd_handler;
 
 #define EXT4_XATTR_NAME_ENCRYPTION_CONTEXT "c"
+#define EXT4_XATTR_NAME_PROVISION_POLICY "provision"
 
 /*
  * The EXT4_STATE_NO_EXPAND is overloaded and used for two purposes.
-- 
2.37.3


^ permalink raw reply related	[flat|nested] 46+ messages in thread

* [dm-devel] [PATCH v2 7/7] ext4: Add a per-file provision override xattr
@ 2022-12-29  8:12   ` Sarthak Kukreti
  0 siblings, 0 replies; 46+ messages in thread
From: Sarthak Kukreti @ 2022-12-29  8:12 UTC (permalink / raw)
  To: sarthakkukreti, dm-devel, linux-block, linux-ext4, linux-kernel,
	linux-fsdevel
  Cc: Jens Axboe, Theodore Ts'o, Michael S. Tsirkin,
	Darrick J. Wong, Jason Wang, Bart Van Assche, Mike Snitzer,
	Christoph Hellwig, Andreas Dilger, Daniil Lunev, Stefan Hajnoczi,
	Brian Foster, Alasdair Kergon

Adds a per-file provision override that allows select files to
override the per-mount setting for provisioning blocks on allocation.

This acts as a mechanism to allow mounts using provision to
replicate the current behavior for fallocate() and only preserve
space at the filesystem level.

Signed-off-by: Sarthak Kukreti <sarthakkukreti@chromium.org>
---
 fs/ext4/extents.c | 32 ++++++++++++++++++++++++++++++++
 fs/ext4/xattr.h   |  1 +
 2 files changed, 33 insertions(+)

diff --git a/fs/ext4/extents.c b/fs/ext4/extents.c
index a73f44264fe2..9861115681b3 100644
--- a/fs/ext4/extents.c
+++ b/fs/ext4/extents.c
@@ -4428,6 +4428,26 @@ int ext4_ext_truncate(handle_t *handle, struct inode *inode)
 	return err;
 }
 
+static int ext4_file_provision_support(struct inode *inode)
+{
+	char provision;
+	int ret =
+		ext4_xattr_get(inode, EXT4_XATTR_INDEX_TRUSTED,
+			       EXT4_XATTR_NAME_PROVISION_POLICY, &provision, 1);
+
+	if (ret < 0)
+		return ret;
+
+	switch (provision) {
+	case 'y':
+		return 1;
+	case 'n':
+		return 0;
+	default:
+		return -EINVAL;
+	}
+}
+
 static int ext4_alloc_file_blocks(struct file *file, ext4_lblk_t offset,
 				  ext4_lblk_t len, loff_t new_size,
 				  int flags)
@@ -4440,12 +4460,24 @@ static int ext4_alloc_file_blocks(struct file *file, ext4_lblk_t offset,
 	struct ext4_map_blocks map;
 	unsigned int credits;
 	loff_t epos;
+	bool provision = false;
+	int file_provision_override = -1;
 
 	/*
 	 * Attempt to provision file blocks if the mount is mounted with
 	 * provision.
 	 */
 	if (test_opt2(inode->i_sb, PROVISION))
+		provision = true;
+
+	/*
+	 * Use file-specific override, if available.
+	 */
+	file_provision_override = ext4_file_provision_support(inode);
+	if (file_provision_override >= 0)
+		provision &= file_provision_override;
+
+	if (provision)
 		flags |= EXT4_GET_BLOCKS_PROVISION;
 
 	BUG_ON(!ext4_test_inode_flag(inode, EXT4_INODE_EXTENTS));
diff --git a/fs/ext4/xattr.h b/fs/ext4/xattr.h
index 824faf0b15a8..69e97f853b0c 100644
--- a/fs/ext4/xattr.h
+++ b/fs/ext4/xattr.h
@@ -140,6 +140,7 @@ extern const struct xattr_handler ext4_xattr_security_handler;
 extern const struct xattr_handler ext4_xattr_hurd_handler;
 
 #define EXT4_XATTR_NAME_ENCRYPTION_CONTEXT "c"
+#define EXT4_XATTR_NAME_PROVISION_POLICY "provision"
 
 /*
  * The EXT4_STATE_NO_EXPAND is overloaded and used for two purposes.
-- 
2.37.3

--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel


^ permalink raw reply related	[flat|nested] 46+ messages in thread

* Re: [PATCH v2 3/7] fs: Introduce FALLOC_FL_PROVISION
  2022-12-29  8:12   ` [dm-devel] " Sarthak Kukreti
@ 2023-01-04 16:39     ` Darrick J. Wong
  -1 siblings, 0 replies; 46+ messages in thread
From: Darrick J. Wong @ 2023-01-04 16:39 UTC (permalink / raw)
  To: Sarthak Kukreti
  Cc: sarthakkukreti, dm-devel, linux-block, linux-ext4, linux-kernel,
	linux-fsdevel, Jens Axboe, Michael S. Tsirkin, Jason Wang,
	Stefan Hajnoczi, Alasdair Kergon, Mike Snitzer,
	Christoph Hellwig, Brian Foster, Theodore Ts'o,
	Andreas Dilger, Bart Van Assche, Daniil Lunev

On Thu, Dec 29, 2022 at 12:12:48AM -0800, Sarthak Kukreti wrote:
> FALLOC_FL_PROVISION is a new fallocate() allocation mode that
> sends a hint to (supported) thinly provisioned block devices to
> allocate space for the given range of sectors via REQ_OP_PROVISION.
> 
> The man pages for both fallocate(2) and posix_fallocate(3) describe
> the default allocation mode as:
> 
> ```
> The default operation (i.e., mode is zero) of fallocate()
> allocates the disk space within the range specified by offset and len.
> ...
> subsequent writes to bytes in the specified range are guaranteed
> not to fail because of lack of disk space.
> ```
> 
> For thinly provisioned storage constructs (dm-thin, filesystems on sparse
> files), the term 'disk space' is overloaded and can either mean the apparent
> disk space in the filesystem/thin logical volume or the true disk
> space that will be utilized on the underlying non-sparse allocation layer.
> 
> The use of a separate mode allows us to cleanly disambiguate whether fallocate()
> causes allocation only at the current layer (default mode) or whether it propagates
> allocations to underlying layers (provision mode)

Why is it important to make this distinction?  The outcome of fallocate
is supposed to be that subsequent writes do not fail with ENOSPC.  In my
(fs developer) mind, REQ_OP_PROVISION simply an extra step to be taken
after allocating file blocks.

If you *don't* add this API flag and simply bake the REQ_OP_PROVISION
call into mode 0 fallocate, then the new functionality can be added (or
even backported) to existing kernels and customers can use it
immediately.  If you *do*, then you get to wait a few years for
developers to add it to their codebases only after enough enterprise
distros pick up a new kernel to make it worth their while.

> for thinly provisioned filesystems/
> block devices. For devices that do not support REQ_OP_PROVISION, both these
> allocation modes will be equivalent. Given the performance cost of sending provision
> requests to the underlying layers, keeping the default mode as-is allows users to
> preserve existing behavior.

How expensive is this expected to be?  Is this why you wanted a separate
mode flag?

--D

> Signed-off-by: Sarthak Kukreti <sarthakkukreti@chromium.org>
> ---
>  block/fops.c                | 15 +++++++++++----
>  include/linux/falloc.h      |  3 ++-
>  include/uapi/linux/falloc.h |  8 ++++++++
>  3 files changed, 21 insertions(+), 5 deletions(-)
> 
> diff --git a/block/fops.c b/block/fops.c
> index 50d245e8c913..01bde561e1e2 100644
> --- a/block/fops.c
> +++ b/block/fops.c
> @@ -598,7 +598,8 @@ static ssize_t blkdev_read_iter(struct kiocb *iocb, struct iov_iter *to)
>  
>  #define	BLKDEV_FALLOC_FL_SUPPORTED					\
>  		(FALLOC_FL_KEEP_SIZE | FALLOC_FL_PUNCH_HOLE |		\
> -		 FALLOC_FL_ZERO_RANGE | FALLOC_FL_NO_HIDE_STALE)
> +		 FALLOC_FL_ZERO_RANGE | FALLOC_FL_NO_HIDE_STALE |	\
> +		 FALLOC_FL_PROVISION)
>  
>  static long blkdev_fallocate(struct file *file, int mode, loff_t start,
>  			     loff_t len)
> @@ -634,9 +635,11 @@ static long blkdev_fallocate(struct file *file, int mode, loff_t start,
>  	filemap_invalidate_lock(inode->i_mapping);
>  
>  	/* Invalidate the page cache, including dirty pages. */
> -	error = truncate_bdev_range(bdev, file->f_mode, start, end);
> -	if (error)
> -		goto fail;
> +	if (mode != FALLOC_FL_PROVISION) {
> +		error = truncate_bdev_range(bdev, file->f_mode, start, end);
> +		if (error)
> +			goto fail;
> +	}
>  
>  	switch (mode) {
>  	case FALLOC_FL_ZERO_RANGE:
> @@ -654,6 +657,10 @@ static long blkdev_fallocate(struct file *file, int mode, loff_t start,
>  		error = blkdev_issue_discard(bdev, start >> SECTOR_SHIFT,
>  					     len >> SECTOR_SHIFT, GFP_KERNEL);
>  		break;
> +	case FALLOC_FL_PROVISION:
> +		error = blkdev_issue_provision(bdev, start >> SECTOR_SHIFT,
> +					       len >> SECTOR_SHIFT, GFP_KERNEL);
> +		break;
>  	default:
>  		error = -EOPNOTSUPP;
>  	}
> diff --git a/include/linux/falloc.h b/include/linux/falloc.h
> index f3f0b97b1675..b9a40a61a59b 100644
> --- a/include/linux/falloc.h
> +++ b/include/linux/falloc.h
> @@ -30,7 +30,8 @@ struct space_resv {
>  					 FALLOC_FL_COLLAPSE_RANGE |	\
>  					 FALLOC_FL_ZERO_RANGE |		\
>  					 FALLOC_FL_INSERT_RANGE |	\
> -					 FALLOC_FL_UNSHARE_RANGE)
> +					 FALLOC_FL_UNSHARE_RANGE |	\
> +					 FALLOC_FL_PROVISION)
>  
>  /* on ia32 l_start is on a 32-bit boundary */
>  #if defined(CONFIG_X86_64)
> diff --git a/include/uapi/linux/falloc.h b/include/uapi/linux/falloc.h
> index 51398fa57f6c..2d323d113eed 100644
> --- a/include/uapi/linux/falloc.h
> +++ b/include/uapi/linux/falloc.h
> @@ -77,4 +77,12 @@
>   */
>  #define FALLOC_FL_UNSHARE_RANGE		0x40
>  
> +/*
> + * FALLOC_FL_PROVISION acts as a hint for thinly provisioned devices to allocate
> + * blocks for the range/EOF.
> + *
> + * FALLOC_FL_PROVISION can only be used with allocate-mode fallocate.
> + */
> +#define FALLOC_FL_PROVISION		0x80
> +
>  #endif /* _UAPI_FALLOC_H_ */
> -- 
> 2.37.3
> 

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [dm-devel] [PATCH v2 3/7] fs: Introduce FALLOC_FL_PROVISION
@ 2023-01-04 16:39     ` Darrick J. Wong
  0 siblings, 0 replies; 46+ messages in thread
From: Darrick J. Wong @ 2023-01-04 16:39 UTC (permalink / raw)
  To: Sarthak Kukreti
  Cc: Jens Axboe, Christoph Hellwig, Theodore Ts'o,
	Michael S. Tsirkin, sarthakkukreti, Jason Wang, Bart Van Assche,
	Mike Snitzer, linux-kernel, linux-block, dm-devel,
	Andreas Dilger, Daniil Lunev, Stefan Hajnoczi, linux-fsdevel,
	linux-ext4, Brian Foster, Alasdair Kergon

On Thu, Dec 29, 2022 at 12:12:48AM -0800, Sarthak Kukreti wrote:
> FALLOC_FL_PROVISION is a new fallocate() allocation mode that
> sends a hint to (supported) thinly provisioned block devices to
> allocate space for the given range of sectors via REQ_OP_PROVISION.
> 
> The man pages for both fallocate(2) and posix_fallocate(3) describe
> the default allocation mode as:
> 
> ```
> The default operation (i.e., mode is zero) of fallocate()
> allocates the disk space within the range specified by offset and len.
> ...
> subsequent writes to bytes in the specified range are guaranteed
> not to fail because of lack of disk space.
> ```
> 
> For thinly provisioned storage constructs (dm-thin, filesystems on sparse
> files), the term 'disk space' is overloaded and can either mean the apparent
> disk space in the filesystem/thin logical volume or the true disk
> space that will be utilized on the underlying non-sparse allocation layer.
> 
> The use of a separate mode allows us to cleanly disambiguate whether fallocate()
> causes allocation only at the current layer (default mode) or whether it propagates
> allocations to underlying layers (provision mode)

Why is it important to make this distinction?  The outcome of fallocate
is supposed to be that subsequent writes do not fail with ENOSPC.  In my
(fs developer) mind, REQ_OP_PROVISION simply an extra step to be taken
after allocating file blocks.

If you *don't* add this API flag and simply bake the REQ_OP_PROVISION
call into mode 0 fallocate, then the new functionality can be added (or
even backported) to existing kernels and customers can use it
immediately.  If you *do*, then you get to wait a few years for
developers to add it to their codebases only after enough enterprise
distros pick up a new kernel to make it worth their while.

> for thinly provisioned filesystems/
> block devices. For devices that do not support REQ_OP_PROVISION, both these
> allocation modes will be equivalent. Given the performance cost of sending provision
> requests to the underlying layers, keeping the default mode as-is allows users to
> preserve existing behavior.

How expensive is this expected to be?  Is this why you wanted a separate
mode flag?

--D

> Signed-off-by: Sarthak Kukreti <sarthakkukreti@chromium.org>
> ---
>  block/fops.c                | 15 +++++++++++----
>  include/linux/falloc.h      |  3 ++-
>  include/uapi/linux/falloc.h |  8 ++++++++
>  3 files changed, 21 insertions(+), 5 deletions(-)
> 
> diff --git a/block/fops.c b/block/fops.c
> index 50d245e8c913..01bde561e1e2 100644
> --- a/block/fops.c
> +++ b/block/fops.c
> @@ -598,7 +598,8 @@ static ssize_t blkdev_read_iter(struct kiocb *iocb, struct iov_iter *to)
>  
>  #define	BLKDEV_FALLOC_FL_SUPPORTED					\
>  		(FALLOC_FL_KEEP_SIZE | FALLOC_FL_PUNCH_HOLE |		\
> -		 FALLOC_FL_ZERO_RANGE | FALLOC_FL_NO_HIDE_STALE)
> +		 FALLOC_FL_ZERO_RANGE | FALLOC_FL_NO_HIDE_STALE |	\
> +		 FALLOC_FL_PROVISION)
>  
>  static long blkdev_fallocate(struct file *file, int mode, loff_t start,
>  			     loff_t len)
> @@ -634,9 +635,11 @@ static long blkdev_fallocate(struct file *file, int mode, loff_t start,
>  	filemap_invalidate_lock(inode->i_mapping);
>  
>  	/* Invalidate the page cache, including dirty pages. */
> -	error = truncate_bdev_range(bdev, file->f_mode, start, end);
> -	if (error)
> -		goto fail;
> +	if (mode != FALLOC_FL_PROVISION) {
> +		error = truncate_bdev_range(bdev, file->f_mode, start, end);
> +		if (error)
> +			goto fail;
> +	}
>  
>  	switch (mode) {
>  	case FALLOC_FL_ZERO_RANGE:
> @@ -654,6 +657,10 @@ static long blkdev_fallocate(struct file *file, int mode, loff_t start,
>  		error = blkdev_issue_discard(bdev, start >> SECTOR_SHIFT,
>  					     len >> SECTOR_SHIFT, GFP_KERNEL);
>  		break;
> +	case FALLOC_FL_PROVISION:
> +		error = blkdev_issue_provision(bdev, start >> SECTOR_SHIFT,
> +					       len >> SECTOR_SHIFT, GFP_KERNEL);
> +		break;
>  	default:
>  		error = -EOPNOTSUPP;
>  	}
> diff --git a/include/linux/falloc.h b/include/linux/falloc.h
> index f3f0b97b1675..b9a40a61a59b 100644
> --- a/include/linux/falloc.h
> +++ b/include/linux/falloc.h
> @@ -30,7 +30,8 @@ struct space_resv {
>  					 FALLOC_FL_COLLAPSE_RANGE |	\
>  					 FALLOC_FL_ZERO_RANGE |		\
>  					 FALLOC_FL_INSERT_RANGE |	\
> -					 FALLOC_FL_UNSHARE_RANGE)
> +					 FALLOC_FL_UNSHARE_RANGE |	\
> +					 FALLOC_FL_PROVISION)
>  
>  /* on ia32 l_start is on a 32-bit boundary */
>  #if defined(CONFIG_X86_64)
> diff --git a/include/uapi/linux/falloc.h b/include/uapi/linux/falloc.h
> index 51398fa57f6c..2d323d113eed 100644
> --- a/include/uapi/linux/falloc.h
> +++ b/include/uapi/linux/falloc.h
> @@ -77,4 +77,12 @@
>   */
>  #define FALLOC_FL_UNSHARE_RANGE		0x40
>  
> +/*
> + * FALLOC_FL_PROVISION acts as a hint for thinly provisioned devices to allocate
> + * blocks for the range/EOF.
> + *
> + * FALLOC_FL_PROVISION can only be used with allocate-mode fallocate.
> + */
> +#define FALLOC_FL_PROVISION		0x80
> +
>  #endif /* _UAPI_FALLOC_H_ */
> -- 
> 2.37.3
> 

--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel


^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH v2 3/7] fs: Introduce FALLOC_FL_PROVISION
  2023-01-04 16:39     ` [dm-devel] " Darrick J. Wong
@ 2023-01-04 18:58       ` Sarthak Kukreti
  -1 siblings, 0 replies; 46+ messages in thread
From: Sarthak Kukreti @ 2023-01-04 18:58 UTC (permalink / raw)
  To: Darrick J. Wong
  Cc: sarthakkukreti, dm-devel, linux-block, linux-ext4, linux-kernel,
	linux-fsdevel, Jens Axboe, Michael S. Tsirkin, Jason Wang,
	Stefan Hajnoczi, Alasdair Kergon, Mike Snitzer,
	Christoph Hellwig, Brian Foster, Theodore Ts'o,
	Andreas Dilger, Bart Van Assche, Daniil Lunev

On Wed, Jan 4, 2023 at 8:39 AM Darrick J. Wong <djwong@kernel.org> wrote:
>
> On Thu, Dec 29, 2022 at 12:12:48AM -0800, Sarthak Kukreti wrote:
> > FALLOC_FL_PROVISION is a new fallocate() allocation mode that
> > sends a hint to (supported) thinly provisioned block devices to
> > allocate space for the given range of sectors via REQ_OP_PROVISION.
> >
> > The man pages for both fallocate(2) and posix_fallocate(3) describe
> > the default allocation mode as:
> >
> > ```
> > The default operation (i.e., mode is zero) of fallocate()
> > allocates the disk space within the range specified by offset and len.
> > ...
> > subsequent writes to bytes in the specified range are guaranteed
> > not to fail because of lack of disk space.
> > ```
> >
> > For thinly provisioned storage constructs (dm-thin, filesystems on sparse
> > files), the term 'disk space' is overloaded and can either mean the apparent
> > disk space in the filesystem/thin logical volume or the true disk
> > space that will be utilized on the underlying non-sparse allocation layer.
> >
> > The use of a separate mode allows us to cleanly disambiguate whether fallocate()
> > causes allocation only at the current layer (default mode) or whether it propagates
> > allocations to underlying layers (provision mode)
>
> Why is it important to make this distinction?  The outcome of fallocate
> is supposed to be that subsequent writes do not fail with ENOSPC.  In my
> (fs developer) mind, REQ_OP_PROVISION simply an extra step to be taken
> after allocating file blocks.
>
Some use cases still benefit from keeping the default mode - eg.
virtual machines
running on massive storage pools that don't expect to hit the storage
limit anytime
soon (like most cloud storage providers). Essentially, if the 'no
ENOSPC' guarantee is
maintained via other means, then REQ_OP_PROVISION adds latency that isn't
needed (and cloud storage providers don't need to set aside that extra
space that
may or may not be used).

> If you *don't* add this API flag and simply bake the REQ_OP_PROVISION
> call into mode 0 fallocate, then the new functionality can be added (or
> even backported) to existing kernels and customers can use it
> immediately.  If you *do*, then you get to wait a few years for
> developers to add it to their codebases only after enough enterprise
> distros pick up a new kernel to make it worth their while.
>
> > for thinly provisioned filesystems/
> > block devices. For devices that do not support REQ_OP_PROVISION, both these
> > allocation modes will be equivalent. Given the performance cost of sending provision
> > requests to the underlying layers, keeping the default mode as-is allows users to
> > preserve existing behavior.
>
> How expensive is this expected to be?  Is this why you wanted a separate
> mode flag?
>
Yes, the exact latency will depend on the stacked block devices and the
fragmentation at the allocation layers.

I did a quick test for benchmarking fallocate() with an:
A) ext4 filesystem mounted with 'noprovision'
B) ext4 filesystem mounted with 'provision' on a dm-thin device.
C) ext4 filesystem mounted with 'provision' on a loop device with a sparse
   backing file on the filesystem in (B).

I tested file sizes from 512M to 8G, time taken for fallocate() in (A)
remains expectedly
flat at ~0.01-0.02s, but for (B), it scales from 0.03-0.4s and for (C) it scales
from 0.04s-0.52s (I captured the exact time distribution in the cover letter
https://marc.info/?l=linux-ext4&m=167230113520636&w=2)

+0.5s for a 8G fallocate doesn't sound a lot but I think fragmentation
and how the
block device is layered can make this worse...

> --D
>
> > Signed-off-by: Sarthak Kukreti <sarthakkukreti@chromium.org>
> > ---
> >  block/fops.c                | 15 +++++++++++----
> >  include/linux/falloc.h      |  3 ++-
> >  include/uapi/linux/falloc.h |  8 ++++++++
> >  3 files changed, 21 insertions(+), 5 deletions(-)
> >
> > diff --git a/block/fops.c b/block/fops.c
> > index 50d245e8c913..01bde561e1e2 100644
> > --- a/block/fops.c
> > +++ b/block/fops.c
> > @@ -598,7 +598,8 @@ static ssize_t blkdev_read_iter(struct kiocb *iocb, struct iov_iter *to)
> >
> >  #define      BLKDEV_FALLOC_FL_SUPPORTED                                      \
> >               (FALLOC_FL_KEEP_SIZE | FALLOC_FL_PUNCH_HOLE |           \
> > -              FALLOC_FL_ZERO_RANGE | FALLOC_FL_NO_HIDE_STALE)
> > +              FALLOC_FL_ZERO_RANGE | FALLOC_FL_NO_HIDE_STALE |       \
> > +              FALLOC_FL_PROVISION)
> >
> >  static long blkdev_fallocate(struct file *file, int mode, loff_t start,
> >                            loff_t len)
> > @@ -634,9 +635,11 @@ static long blkdev_fallocate(struct file *file, int mode, loff_t start,
> >       filemap_invalidate_lock(inode->i_mapping);
> >
> >       /* Invalidate the page cache, including dirty pages. */
> > -     error = truncate_bdev_range(bdev, file->f_mode, start, end);
> > -     if (error)
> > -             goto fail;
> > +     if (mode != FALLOC_FL_PROVISION) {
> > +             error = truncate_bdev_range(bdev, file->f_mode, start, end);
> > +             if (error)
> > +                     goto fail;
> > +     }
> >
> >       switch (mode) {
> >       case FALLOC_FL_ZERO_RANGE:
> > @@ -654,6 +657,10 @@ static long blkdev_fallocate(struct file *file, int mode, loff_t start,
> >               error = blkdev_issue_discard(bdev, start >> SECTOR_SHIFT,
> >                                            len >> SECTOR_SHIFT, GFP_KERNEL);
> >               break;
> > +     case FALLOC_FL_PROVISION:
> > +             error = blkdev_issue_provision(bdev, start >> SECTOR_SHIFT,
> > +                                            len >> SECTOR_SHIFT, GFP_KERNEL);
> > +             break;
> >       default:
> >               error = -EOPNOTSUPP;
> >       }
> > diff --git a/include/linux/falloc.h b/include/linux/falloc.h
> > index f3f0b97b1675..b9a40a61a59b 100644
> > --- a/include/linux/falloc.h
> > +++ b/include/linux/falloc.h
> > @@ -30,7 +30,8 @@ struct space_resv {
> >                                        FALLOC_FL_COLLAPSE_RANGE |     \
> >                                        FALLOC_FL_ZERO_RANGE |         \
> >                                        FALLOC_FL_INSERT_RANGE |       \
> > -                                      FALLOC_FL_UNSHARE_RANGE)
> > +                                      FALLOC_FL_UNSHARE_RANGE |      \
> > +                                      FALLOC_FL_PROVISION)
> >
> >  /* on ia32 l_start is on a 32-bit boundary */
> >  #if defined(CONFIG_X86_64)
> > diff --git a/include/uapi/linux/falloc.h b/include/uapi/linux/falloc.h
> > index 51398fa57f6c..2d323d113eed 100644
> > --- a/include/uapi/linux/falloc.h
> > +++ b/include/uapi/linux/falloc.h
> > @@ -77,4 +77,12 @@
> >   */
> >  #define FALLOC_FL_UNSHARE_RANGE              0x40
> >
> > +/*
> > + * FALLOC_FL_PROVISION acts as a hint for thinly provisioned devices to allocate
> > + * blocks for the range/EOF.
> > + *
> > + * FALLOC_FL_PROVISION can only be used with allocate-mode fallocate.
> > + */
> > +#define FALLOC_FL_PROVISION          0x80
> > +
> >  #endif /* _UAPI_FALLOC_H_ */
> > --
> > 2.37.3
> >

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [dm-devel] [PATCH v2 3/7] fs: Introduce FALLOC_FL_PROVISION
@ 2023-01-04 18:58       ` Sarthak Kukreti
  0 siblings, 0 replies; 46+ messages in thread
From: Sarthak Kukreti @ 2023-01-04 18:58 UTC (permalink / raw)
  To: Darrick J. Wong
  Cc: Jens Axboe, Christoph Hellwig, Theodore Ts'o,
	Michael S. Tsirkin, sarthakkukreti, Jason Wang, Bart Van Assche,
	Mike Snitzer, linux-kernel, linux-block, dm-devel,
	Andreas Dilger, Daniil Lunev, Stefan Hajnoczi, linux-fsdevel,
	linux-ext4, Brian Foster, Alasdair Kergon

On Wed, Jan 4, 2023 at 8:39 AM Darrick J. Wong <djwong@kernel.org> wrote:
>
> On Thu, Dec 29, 2022 at 12:12:48AM -0800, Sarthak Kukreti wrote:
> > FALLOC_FL_PROVISION is a new fallocate() allocation mode that
> > sends a hint to (supported) thinly provisioned block devices to
> > allocate space for the given range of sectors via REQ_OP_PROVISION.
> >
> > The man pages for both fallocate(2) and posix_fallocate(3) describe
> > the default allocation mode as:
> >
> > ```
> > The default operation (i.e., mode is zero) of fallocate()
> > allocates the disk space within the range specified by offset and len.
> > ...
> > subsequent writes to bytes in the specified range are guaranteed
> > not to fail because of lack of disk space.
> > ```
> >
> > For thinly provisioned storage constructs (dm-thin, filesystems on sparse
> > files), the term 'disk space' is overloaded and can either mean the apparent
> > disk space in the filesystem/thin logical volume or the true disk
> > space that will be utilized on the underlying non-sparse allocation layer.
> >
> > The use of a separate mode allows us to cleanly disambiguate whether fallocate()
> > causes allocation only at the current layer (default mode) or whether it propagates
> > allocations to underlying layers (provision mode)
>
> Why is it important to make this distinction?  The outcome of fallocate
> is supposed to be that subsequent writes do not fail with ENOSPC.  In my
> (fs developer) mind, REQ_OP_PROVISION simply an extra step to be taken
> after allocating file blocks.
>
Some use cases still benefit from keeping the default mode - eg.
virtual machines
running on massive storage pools that don't expect to hit the storage
limit anytime
soon (like most cloud storage providers). Essentially, if the 'no
ENOSPC' guarantee is
maintained via other means, then REQ_OP_PROVISION adds latency that isn't
needed (and cloud storage providers don't need to set aside that extra
space that
may or may not be used).

> If you *don't* add this API flag and simply bake the REQ_OP_PROVISION
> call into mode 0 fallocate, then the new functionality can be added (or
> even backported) to existing kernels and customers can use it
> immediately.  If you *do*, then you get to wait a few years for
> developers to add it to their codebases only after enough enterprise
> distros pick up a new kernel to make it worth their while.
>
> > for thinly provisioned filesystems/
> > block devices. For devices that do not support REQ_OP_PROVISION, both these
> > allocation modes will be equivalent. Given the performance cost of sending provision
> > requests to the underlying layers, keeping the default mode as-is allows users to
> > preserve existing behavior.
>
> How expensive is this expected to be?  Is this why you wanted a separate
> mode flag?
>
Yes, the exact latency will depend on the stacked block devices and the
fragmentation at the allocation layers.

I did a quick test for benchmarking fallocate() with an:
A) ext4 filesystem mounted with 'noprovision'
B) ext4 filesystem mounted with 'provision' on a dm-thin device.
C) ext4 filesystem mounted with 'provision' on a loop device with a sparse
   backing file on the filesystem in (B).

I tested file sizes from 512M to 8G, time taken for fallocate() in (A)
remains expectedly
flat at ~0.01-0.02s, but for (B), it scales from 0.03-0.4s and for (C) it scales
from 0.04s-0.52s (I captured the exact time distribution in the cover letter
https://marc.info/?l=linux-ext4&m=167230113520636&w=2)

+0.5s for a 8G fallocate doesn't sound a lot but I think fragmentation
and how the
block device is layered can make this worse...

> --D
>
> > Signed-off-by: Sarthak Kukreti <sarthakkukreti@chromium.org>
> > ---
> >  block/fops.c                | 15 +++++++++++----
> >  include/linux/falloc.h      |  3 ++-
> >  include/uapi/linux/falloc.h |  8 ++++++++
> >  3 files changed, 21 insertions(+), 5 deletions(-)
> >
> > diff --git a/block/fops.c b/block/fops.c
> > index 50d245e8c913..01bde561e1e2 100644
> > --- a/block/fops.c
> > +++ b/block/fops.c
> > @@ -598,7 +598,8 @@ static ssize_t blkdev_read_iter(struct kiocb *iocb, struct iov_iter *to)
> >
> >  #define      BLKDEV_FALLOC_FL_SUPPORTED                                      \
> >               (FALLOC_FL_KEEP_SIZE | FALLOC_FL_PUNCH_HOLE |           \
> > -              FALLOC_FL_ZERO_RANGE | FALLOC_FL_NO_HIDE_STALE)
> > +              FALLOC_FL_ZERO_RANGE | FALLOC_FL_NO_HIDE_STALE |       \
> > +              FALLOC_FL_PROVISION)
> >
> >  static long blkdev_fallocate(struct file *file, int mode, loff_t start,
> >                            loff_t len)
> > @@ -634,9 +635,11 @@ static long blkdev_fallocate(struct file *file, int mode, loff_t start,
> >       filemap_invalidate_lock(inode->i_mapping);
> >
> >       /* Invalidate the page cache, including dirty pages. */
> > -     error = truncate_bdev_range(bdev, file->f_mode, start, end);
> > -     if (error)
> > -             goto fail;
> > +     if (mode != FALLOC_FL_PROVISION) {
> > +             error = truncate_bdev_range(bdev, file->f_mode, start, end);
> > +             if (error)
> > +                     goto fail;
> > +     }
> >
> >       switch (mode) {
> >       case FALLOC_FL_ZERO_RANGE:
> > @@ -654,6 +657,10 @@ static long blkdev_fallocate(struct file *file, int mode, loff_t start,
> >               error = blkdev_issue_discard(bdev, start >> SECTOR_SHIFT,
> >                                            len >> SECTOR_SHIFT, GFP_KERNEL);
> >               break;
> > +     case FALLOC_FL_PROVISION:
> > +             error = blkdev_issue_provision(bdev, start >> SECTOR_SHIFT,
> > +                                            len >> SECTOR_SHIFT, GFP_KERNEL);
> > +             break;
> >       default:
> >               error = -EOPNOTSUPP;
> >       }
> > diff --git a/include/linux/falloc.h b/include/linux/falloc.h
> > index f3f0b97b1675..b9a40a61a59b 100644
> > --- a/include/linux/falloc.h
> > +++ b/include/linux/falloc.h
> > @@ -30,7 +30,8 @@ struct space_resv {
> >                                        FALLOC_FL_COLLAPSE_RANGE |     \
> >                                        FALLOC_FL_ZERO_RANGE |         \
> >                                        FALLOC_FL_INSERT_RANGE |       \
> > -                                      FALLOC_FL_UNSHARE_RANGE)
> > +                                      FALLOC_FL_UNSHARE_RANGE |      \
> > +                                      FALLOC_FL_PROVISION)
> >
> >  /* on ia32 l_start is on a 32-bit boundary */
> >  #if defined(CONFIG_X86_64)
> > diff --git a/include/uapi/linux/falloc.h b/include/uapi/linux/falloc.h
> > index 51398fa57f6c..2d323d113eed 100644
> > --- a/include/uapi/linux/falloc.h
> > +++ b/include/uapi/linux/falloc.h
> > @@ -77,4 +77,12 @@
> >   */
> >  #define FALLOC_FL_UNSHARE_RANGE              0x40
> >
> > +/*
> > + * FALLOC_FL_PROVISION acts as a hint for thinly provisioned devices to allocate
> > + * blocks for the range/EOF.
> > + *
> > + * FALLOC_FL_PROVISION can only be used with allocate-mode fallocate.
> > + */
> > +#define FALLOC_FL_PROVISION          0x80
> > +
> >  #endif /* _UAPI_FALLOC_H_ */
> > --
> > 2.37.3
> >

--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel


^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH v2 3/7] fs: Introduce FALLOC_FL_PROVISION
  2023-01-04 16:39     ` [dm-devel] " Darrick J. Wong
@ 2023-01-04 21:22       ` Sarthak Kukreti
  -1 siblings, 0 replies; 46+ messages in thread
From: Sarthak Kukreti @ 2023-01-04 21:22 UTC (permalink / raw)
  To: Darrick J. Wong
  Cc: sarthakkukreti, dm-devel, linux-block, linux-ext4, linux-kernel,
	linux-fsdevel, Jens Axboe, Michael S. Tsirkin, Jason Wang,
	Stefan Hajnoczi, Alasdair Kergon, Mike Snitzer,
	Christoph Hellwig, Brian Foster, Theodore Ts'o,
	Andreas Dilger, Bart Van Assche, Daniil Lunev

(Resend; the text flow made the last reply unreadable)

On Wed, Jan 4, 2023 at 8:39 AM Darrick J. Wong <djwong@kernel.org> wrote:
>
> On Thu, Dec 29, 2022 at 12:12:48AM -0800, Sarthak Kukreti wrote:
> > FALLOC_FL_PROVISION is a new fallocate() allocation mode that
> > sends a hint to (supported) thinly provisioned block devices to
> > allocate space for the given range of sectors via REQ_OP_PROVISION.
> >
> > The man pages for both fallocate(2) and posix_fallocate(3) describe
> > the default allocation mode as:
> >
> > ```
> > The default operation (i.e., mode is zero) of fallocate()
> > allocates the disk space within the range specified by offset and len.
> > ...
> > subsequent writes to bytes in the specified range are guaranteed
> > not to fail because of lack of disk space.
> > ```
> >
> > For thinly provisioned storage constructs (dm-thin, filesystems on sparse
> > files), the term 'disk space' is overloaded and can either mean the apparent
> > disk space in the filesystem/thin logical volume or the true disk
> > space that will be utilized on the underlying non-sparse allocation layer.
> >
> > The use of a separate mode allows us to cleanly disambiguate whether fallocate()
> > causes allocation only at the current layer (default mode) or whether it propagates
> > allocations to underlying layers (provision mode)
>
> Why is it important to make this distinction?  The outcome of fallocate
> is supposed to be that subsequent writes do not fail with ENOSPC.  In my
> (fs developer) mind, REQ_OP_PROVISION simply an extra step to be taken
> after allocating file blocks.
>
Some use cases still benefit from keeping the default mode - eg.
virtual machines running on massive storage pools that don't expect to
hit the storage limit anytime soon (like most cloud storage
providers). Essentially, if the 'no ENOSPC' guarantee is maintained
via other means, then REQ_OP_PROVISION adds latency that isn't needed
(and cloud storage providers don't need to set aside that extra space
that may or may not be used).

> If you *don't* add this API flag and simply bake the REQ_OP_PROVISION
> call into mode 0 fallocate, then the new functionality can be added (or
> even backported) to existing kernels and customers can use it
> immediately.  If you *do*, then you get to wait a few years for
> developers to add it to their codebases only after enough enterprise
> distros pick up a new kernel to make it worth their while.
>
> > for thinly provisioned filesystems/
> > block devices. For devices that do not support REQ_OP_PROVISION, both these
> > allocation modes will be equivalent. Given the performance cost of sending provision
> > requests to the underlying layers, keeping the default mode as-is allows users to
> > preserve existing behavior.
>
> How expensive is this expected to be?  Is this why you wanted a separate
> mode flag?
>
Yes, the exact latency will depend on the stacked block devices and
the fragmentation at the allocation layers.

I did a quick test for benchmarking fallocate() with an:
A) ext4 filesystem mounted with 'noprovision'
B) ext4 filesystem mounted with 'provision' on a dm-thin device.
C) ext4 filesystem mounted with 'provision' on a loop device with a
sparse backing file on the filesystem in (B).

I tested file sizes from 512M to 8G, time taken for fallocate() in (A)
remains expectedly flat at ~0.01-0.02s, but for (B), it scales from
0.03-0.4s and for (C) it scales from 0.04s-0.52s (I captured the exact
time distribution in the cover letter
https://marc.info/?l=linux-ext4&m=167230113520636&w=2)

+0.5s for a 8G fallocate doesn't sound a lot but I think fragmentation
and how the block device is layered can make this worse...

> --D
>
> > Signed-off-by: Sarthak Kukreti <sarthakkukreti@chromium.org>
> > ---
> >  block/fops.c                | 15 +++++++++++----
> >  include/linux/falloc.h      |  3 ++-
> >  include/uapi/linux/falloc.h |  8 ++++++++
> >  3 files changed, 21 insertions(+), 5 deletions(-)
> >
> > diff --git a/block/fops.c b/block/fops.c
> > index 50d245e8c913..01bde561e1e2 100644
> > --- a/block/fops.c
> > +++ b/block/fops.c
> > @@ -598,7 +598,8 @@ static ssize_t blkdev_read_iter(struct kiocb *iocb, struct iov_iter *to)
> >
> >  #define      BLKDEV_FALLOC_FL_SUPPORTED                                      \
> >               (FALLOC_FL_KEEP_SIZE | FALLOC_FL_PUNCH_HOLE |           \
> > -              FALLOC_FL_ZERO_RANGE | FALLOC_FL_NO_HIDE_STALE)
> > +              FALLOC_FL_ZERO_RANGE | FALLOC_FL_NO_HIDE_STALE |       \
> > +              FALLOC_FL_PROVISION)
> >
> >  static long blkdev_fallocate(struct file *file, int mode, loff_t start,
> >                            loff_t len)
> > @@ -634,9 +635,11 @@ static long blkdev_fallocate(struct file *file, int mode, loff_t start,
> >       filemap_invalidate_lock(inode->i_mapping);
> >
> >       /* Invalidate the page cache, including dirty pages. */
> > -     error = truncate_bdev_range(bdev, file->f_mode, start, end);
> > -     if (error)
> > -             goto fail;
> > +     if (mode != FALLOC_FL_PROVISION) {
> > +             error = truncate_bdev_range(bdev, file->f_mode, start, end);
> > +             if (error)
> > +                     goto fail;
> > +     }
> >
> >       switch (mode) {
> >       case FALLOC_FL_ZERO_RANGE:
> > @@ -654,6 +657,10 @@ static long blkdev_fallocate(struct file *file, int mode, loff_t start,
> >               error = blkdev_issue_discard(bdev, start >> SECTOR_SHIFT,
> >                                            len >> SECTOR_SHIFT, GFP_KERNEL);
> >               break;
> > +     case FALLOC_FL_PROVISION:
> > +             error = blkdev_issue_provision(bdev, start >> SECTOR_SHIFT,
> > +                                            len >> SECTOR_SHIFT, GFP_KERNEL);
> > +             break;
> >       default:
> >               error = -EOPNOTSUPP;
> >       }
> > diff --git a/include/linux/falloc.h b/include/linux/falloc.h
> > index f3f0b97b1675..b9a40a61a59b 100644
> > --- a/include/linux/falloc.h
> > +++ b/include/linux/falloc.h
> > @@ -30,7 +30,8 @@ struct space_resv {
> >                                        FALLOC_FL_COLLAPSE_RANGE |     \
> >                                        FALLOC_FL_ZERO_RANGE |         \
> >                                        FALLOC_FL_INSERT_RANGE |       \
> > -                                      FALLOC_FL_UNSHARE_RANGE)
> > +                                      FALLOC_FL_UNSHARE_RANGE |      \
> > +                                      FALLOC_FL_PROVISION)
> >
> >  /* on ia32 l_start is on a 32-bit boundary */
> >  #if defined(CONFIG_X86_64)
> > diff --git a/include/uapi/linux/falloc.h b/include/uapi/linux/falloc.h
> > index 51398fa57f6c..2d323d113eed 100644
> > --- a/include/uapi/linux/falloc.h
> > +++ b/include/uapi/linux/falloc.h
> > @@ -77,4 +77,12 @@
> >   */
> >  #define FALLOC_FL_UNSHARE_RANGE              0x40
> >
> > +/*
> > + * FALLOC_FL_PROVISION acts as a hint for thinly provisioned devices to allocate
> > + * blocks for the range/EOF.
> > + *
> > + * FALLOC_FL_PROVISION can only be used with allocate-mode fallocate.
> > + */
> > +#define FALLOC_FL_PROVISION          0x80
> > +
> >  #endif /* _UAPI_FALLOC_H_ */
> > --
> > 2.37.3
> >

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [dm-devel] [PATCH v2 3/7] fs: Introduce FALLOC_FL_PROVISION
@ 2023-01-04 21:22       ` Sarthak Kukreti
  0 siblings, 0 replies; 46+ messages in thread
From: Sarthak Kukreti @ 2023-01-04 21:22 UTC (permalink / raw)
  To: Darrick J. Wong
  Cc: Jens Axboe, Christoph Hellwig, Theodore Ts'o,
	Michael S. Tsirkin, sarthakkukreti, Jason Wang, Bart Van Assche,
	Mike Snitzer, linux-kernel, linux-block, dm-devel,
	Andreas Dilger, Daniil Lunev, Stefan Hajnoczi, linux-fsdevel,
	linux-ext4, Brian Foster, Alasdair Kergon

(Resend; the text flow made the last reply unreadable)

On Wed, Jan 4, 2023 at 8:39 AM Darrick J. Wong <djwong@kernel.org> wrote:
>
> On Thu, Dec 29, 2022 at 12:12:48AM -0800, Sarthak Kukreti wrote:
> > FALLOC_FL_PROVISION is a new fallocate() allocation mode that
> > sends a hint to (supported) thinly provisioned block devices to
> > allocate space for the given range of sectors via REQ_OP_PROVISION.
> >
> > The man pages for both fallocate(2) and posix_fallocate(3) describe
> > the default allocation mode as:
> >
> > ```
> > The default operation (i.e., mode is zero) of fallocate()
> > allocates the disk space within the range specified by offset and len.
> > ...
> > subsequent writes to bytes in the specified range are guaranteed
> > not to fail because of lack of disk space.
> > ```
> >
> > For thinly provisioned storage constructs (dm-thin, filesystems on sparse
> > files), the term 'disk space' is overloaded and can either mean the apparent
> > disk space in the filesystem/thin logical volume or the true disk
> > space that will be utilized on the underlying non-sparse allocation layer.
> >
> > The use of a separate mode allows us to cleanly disambiguate whether fallocate()
> > causes allocation only at the current layer (default mode) or whether it propagates
> > allocations to underlying layers (provision mode)
>
> Why is it important to make this distinction?  The outcome of fallocate
> is supposed to be that subsequent writes do not fail with ENOSPC.  In my
> (fs developer) mind, REQ_OP_PROVISION simply an extra step to be taken
> after allocating file blocks.
>
Some use cases still benefit from keeping the default mode - eg.
virtual machines running on massive storage pools that don't expect to
hit the storage limit anytime soon (like most cloud storage
providers). Essentially, if the 'no ENOSPC' guarantee is maintained
via other means, then REQ_OP_PROVISION adds latency that isn't needed
(and cloud storage providers don't need to set aside that extra space
that may or may not be used).

> If you *don't* add this API flag and simply bake the REQ_OP_PROVISION
> call into mode 0 fallocate, then the new functionality can be added (or
> even backported) to existing kernels and customers can use it
> immediately.  If you *do*, then you get to wait a few years for
> developers to add it to their codebases only after enough enterprise
> distros pick up a new kernel to make it worth their while.
>
> > for thinly provisioned filesystems/
> > block devices. For devices that do not support REQ_OP_PROVISION, both these
> > allocation modes will be equivalent. Given the performance cost of sending provision
> > requests to the underlying layers, keeping the default mode as-is allows users to
> > preserve existing behavior.
>
> How expensive is this expected to be?  Is this why you wanted a separate
> mode flag?
>
Yes, the exact latency will depend on the stacked block devices and
the fragmentation at the allocation layers.

I did a quick test for benchmarking fallocate() with an:
A) ext4 filesystem mounted with 'noprovision'
B) ext4 filesystem mounted with 'provision' on a dm-thin device.
C) ext4 filesystem mounted with 'provision' on a loop device with a
sparse backing file on the filesystem in (B).

I tested file sizes from 512M to 8G, time taken for fallocate() in (A)
remains expectedly flat at ~0.01-0.02s, but for (B), it scales from
0.03-0.4s and for (C) it scales from 0.04s-0.52s (I captured the exact
time distribution in the cover letter
https://marc.info/?l=linux-ext4&m=167230113520636&w=2)

+0.5s for a 8G fallocate doesn't sound a lot but I think fragmentation
and how the block device is layered can make this worse...

> --D
>
> > Signed-off-by: Sarthak Kukreti <sarthakkukreti@chromium.org>
> > ---
> >  block/fops.c                | 15 +++++++++++----
> >  include/linux/falloc.h      |  3 ++-
> >  include/uapi/linux/falloc.h |  8 ++++++++
> >  3 files changed, 21 insertions(+), 5 deletions(-)
> >
> > diff --git a/block/fops.c b/block/fops.c
> > index 50d245e8c913..01bde561e1e2 100644
> > --- a/block/fops.c
> > +++ b/block/fops.c
> > @@ -598,7 +598,8 @@ static ssize_t blkdev_read_iter(struct kiocb *iocb, struct iov_iter *to)
> >
> >  #define      BLKDEV_FALLOC_FL_SUPPORTED                                      \
> >               (FALLOC_FL_KEEP_SIZE | FALLOC_FL_PUNCH_HOLE |           \
> > -              FALLOC_FL_ZERO_RANGE | FALLOC_FL_NO_HIDE_STALE)
> > +              FALLOC_FL_ZERO_RANGE | FALLOC_FL_NO_HIDE_STALE |       \
> > +              FALLOC_FL_PROVISION)
> >
> >  static long blkdev_fallocate(struct file *file, int mode, loff_t start,
> >                            loff_t len)
> > @@ -634,9 +635,11 @@ static long blkdev_fallocate(struct file *file, int mode, loff_t start,
> >       filemap_invalidate_lock(inode->i_mapping);
> >
> >       /* Invalidate the page cache, including dirty pages. */
> > -     error = truncate_bdev_range(bdev, file->f_mode, start, end);
> > -     if (error)
> > -             goto fail;
> > +     if (mode != FALLOC_FL_PROVISION) {
> > +             error = truncate_bdev_range(bdev, file->f_mode, start, end);
> > +             if (error)
> > +                     goto fail;
> > +     }
> >
> >       switch (mode) {
> >       case FALLOC_FL_ZERO_RANGE:
> > @@ -654,6 +657,10 @@ static long blkdev_fallocate(struct file *file, int mode, loff_t start,
> >               error = blkdev_issue_discard(bdev, start >> SECTOR_SHIFT,
> >                                            len >> SECTOR_SHIFT, GFP_KERNEL);
> >               break;
> > +     case FALLOC_FL_PROVISION:
> > +             error = blkdev_issue_provision(bdev, start >> SECTOR_SHIFT,
> > +                                            len >> SECTOR_SHIFT, GFP_KERNEL);
> > +             break;
> >       default:
> >               error = -EOPNOTSUPP;
> >       }
> > diff --git a/include/linux/falloc.h b/include/linux/falloc.h
> > index f3f0b97b1675..b9a40a61a59b 100644
> > --- a/include/linux/falloc.h
> > +++ b/include/linux/falloc.h
> > @@ -30,7 +30,8 @@ struct space_resv {
> >                                        FALLOC_FL_COLLAPSE_RANGE |     \
> >                                        FALLOC_FL_ZERO_RANGE |         \
> >                                        FALLOC_FL_INSERT_RANGE |       \
> > -                                      FALLOC_FL_UNSHARE_RANGE)
> > +                                      FALLOC_FL_UNSHARE_RANGE |      \
> > +                                      FALLOC_FL_PROVISION)
> >
> >  /* on ia32 l_start is on a 32-bit boundary */
> >  #if defined(CONFIG_X86_64)
> > diff --git a/include/uapi/linux/falloc.h b/include/uapi/linux/falloc.h
> > index 51398fa57f6c..2d323d113eed 100644
> > --- a/include/uapi/linux/falloc.h
> > +++ b/include/uapi/linux/falloc.h
> > @@ -77,4 +77,12 @@
> >   */
> >  #define FALLOC_FL_UNSHARE_RANGE              0x40
> >
> > +/*
> > + * FALLOC_FL_PROVISION acts as a hint for thinly provisioned devices to allocate
> > + * blocks for the range/EOF.
> > + *
> > + * FALLOC_FL_PROVISION can only be used with allocate-mode fallocate.
> > + */
> > +#define FALLOC_FL_PROVISION          0x80
> > +
> >  #endif /* _UAPI_FALLOC_H_ */
> > --
> > 2.37.3
> >

--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel


^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [dm-devel] [PATCH v2 2/7] dm: Add support for block provisioning
  2022-12-29  8:12   ` [dm-devel] " Sarthak Kukreti
@ 2023-01-05 14:43     ` Brian Foster
  -1 siblings, 0 replies; 46+ messages in thread
From: Brian Foster @ 2023-01-05 14:43 UTC (permalink / raw)
  To: Sarthak Kukreti
  Cc: Jens Axboe, Christoph Hellwig, Theodore Ts'o,
	Michael S. Tsirkin, sarthakkukreti, Darrick J. Wong, Jason Wang,
	Bart Van Assche, Mike Snitzer, linux-kernel, linux-block,
	dm-devel, Andreas Dilger, Daniil Lunev, Stefan Hajnoczi,
	linux-fsdevel, linux-ext4, Alasdair Kergon

On Thu, Dec 29, 2022 at 12:12:47AM -0800, Sarthak Kukreti wrote:
> Add support to dm devices for REQ_OP_PROVISION. The default mode
> is to pass through the request and dm-thin will utilize it to provision
> blocks.
> 
> Signed-off-by: Sarthak Kukreti <sarthakkukreti@chromium.org>
> ---
>  drivers/md/dm-crypt.c         |  4 +-
>  drivers/md/dm-linear.c        |  1 +
>  drivers/md/dm-snap.c          |  7 +++
>  drivers/md/dm-table.c         | 25 ++++++++++
>  drivers/md/dm-thin.c          | 90 ++++++++++++++++++++++++++++++++++-
>  drivers/md/dm.c               |  4 ++
>  include/linux/device-mapper.h | 11 +++++
>  7 files changed, 139 insertions(+), 3 deletions(-)
> 
...
> diff --git a/drivers/md/dm-thin.c b/drivers/md/dm-thin.c
> index 64cfcf46881d..ab3f1abfabaf 100644
> --- a/drivers/md/dm-thin.c
> +++ b/drivers/md/dm-thin.c
...
> @@ -1980,6 +1992,70 @@ static void process_cell(struct thin_c *tc, struct dm_bio_prison_cell *cell)
>  	}
>  }
>  
> +static void process_provision_cell(struct thin_c *tc, struct dm_bio_prison_cell *cell)
> +{
> +	int r;
> +	struct pool *pool = tc->pool;
> +	struct bio *bio = cell->holder;
> +	dm_block_t begin, end;
> +	struct dm_thin_lookup_result lookup_result;
> +
> +	if (tc->requeue_mode) {
> +		cell_requeue(pool, cell);
> +		return;
> +	}
> +
> +	get_bio_block_range(tc, bio, &begin, &end);
> +
> +	while (begin != end) {
> +		r = ensure_next_mapping(pool);
> +		if (r)
> +			/* we did our best */
> +			return;
> +
> +		r = dm_thin_find_block(tc->td, begin, 1, &lookup_result);

Hi Sarthak,

I think we discussed this before.. but remind me if/how we wanted to
handle the case if the thin blocks are shared..? Would a provision op
carry enough information to distinguish an FALLOC_FL_UNSHARE_RANGE
request from upper layers to conditionally provision in that case?

Brian

> +		switch (r) {
> +		case 0:
> +			begin++;
> +			break;
> +		case -ENODATA:
> +			bio_inc_remaining(bio);
> +			provision_block(tc, bio, begin, cell);
> +			begin++;
> +			break;
> +		default:
> +			DMERR_LIMIT(
> +				"%s: dm_thin_find_block() failed: error = %d",
> +				__func__, r);
> +			cell_defer_no_holder(tc, cell);
> +			bio_io_error(bio);
> +			begin++;
> +			break;
> +		}
> +	}
> +	bio_endio(bio);
> +	cell_defer_no_holder(tc, cell);
> +}
> +
> +static void process_provision_bio(struct thin_c *tc, struct bio *bio)
> +{
> +	dm_block_t begin, end;
> +	struct dm_cell_key virt_key;
> +	struct dm_bio_prison_cell *virt_cell;
> +
> +	get_bio_block_range(tc, bio, &begin, &end);
> +	if (begin == end) {
> +		bio_endio(bio);
> +		return;
> +	}
> +
> +	build_key(tc->td, VIRTUAL, begin, end, &virt_key);
> +	if (bio_detain(tc->pool, &virt_key, bio, &virt_cell))
> +		return;
> +
> +	process_provision_cell(tc, virt_cell);
> +}
> +
>  static void process_bio(struct thin_c *tc, struct bio *bio)
>  {
>  	struct pool *pool = tc->pool;
> @@ -2200,6 +2276,8 @@ static void process_thin_deferred_bios(struct thin_c *tc)
>  
>  		if (bio_op(bio) == REQ_OP_DISCARD)
>  			pool->process_discard(tc, bio);
> +		else if (bio_op(bio) == REQ_OP_PROVISION)
> +			process_provision_bio(tc, bio);
>  		else
>  			pool->process_bio(tc, bio);
>  
> @@ -2716,7 +2794,8 @@ static int thin_bio_map(struct dm_target *ti, struct bio *bio)
>  		return DM_MAPIO_SUBMITTED;
>  	}
>  
> -	if (op_is_flush(bio->bi_opf) || bio_op(bio) == REQ_OP_DISCARD) {
> +	if (op_is_flush(bio->bi_opf) || bio_op(bio) == REQ_OP_DISCARD ||
> +	    bio_op(bio) == REQ_OP_PROVISION) {
>  		thin_defer_bio_with_throttle(tc, bio);
>  		return DM_MAPIO_SUBMITTED;
>  	}
> @@ -3355,6 +3434,8 @@ static int pool_ctr(struct dm_target *ti, unsigned argc, char **argv)
>  	pt->low_water_blocks = low_water_blocks;
>  	pt->adjusted_pf = pt->requested_pf = pf;
>  	ti->num_flush_bios = 1;
> +	ti->num_provision_bios = 1;
> +	ti->provision_supported = true;
>  
>  	/*
>  	 * Only need to enable discards if the pool should pass
> @@ -4053,6 +4134,7 @@ static void pool_io_hints(struct dm_target *ti, struct queue_limits *limits)
>  		blk_limits_io_opt(limits, pool->sectors_per_block << SECTOR_SHIFT);
>  	}
>  
> +
>  	/*
>  	 * pt->adjusted_pf is a staging area for the actual features to use.
>  	 * They get transferred to the live pool in bind_control_target()
> @@ -4243,6 +4325,9 @@ static int thin_ctr(struct dm_target *ti, unsigned argc, char **argv)
>  		ti->num_discard_bios = 1;
>  	}
>  
> +	ti->num_provision_bios = 1;
> +	ti->provision_supported = true;
> +
>  	mutex_unlock(&dm_thin_pool_table.mutex);
>  
>  	spin_lock_irq(&tc->pool->lock);
> @@ -4457,6 +4542,7 @@ static void thin_io_hints(struct dm_target *ti, struct queue_limits *limits)
>  
>  	limits->discard_granularity = pool->sectors_per_block << SECTOR_SHIFT;
>  	limits->max_discard_sectors = 2048 * 1024 * 16; /* 16G */
> +	limits->max_provision_sectors = 2048 * 1024 * 16; /* 16G */
>  }
>  
>  static struct target_type thin_target = {
> diff --git a/drivers/md/dm.c b/drivers/md/dm.c
> index e1ea3a7bd9d9..4d19bae9da4a 100644
> --- a/drivers/md/dm.c
> +++ b/drivers/md/dm.c
> @@ -1587,6 +1587,7 @@ static bool is_abnormal_io(struct bio *bio)
>  		case REQ_OP_DISCARD:
>  		case REQ_OP_SECURE_ERASE:
>  		case REQ_OP_WRITE_ZEROES:
> +		case REQ_OP_PROVISION:
>  			return true;
>  		default:
>  			break;
> @@ -1611,6 +1612,9 @@ static blk_status_t __process_abnormal_io(struct clone_info *ci,
>  	case REQ_OP_WRITE_ZEROES:
>  		num_bios = ti->num_write_zeroes_bios;
>  		break;
> +	case REQ_OP_PROVISION:
> +		num_bios = ti->num_provision_bios;
> +		break;
>  	default:
>  		break;
>  	}
> diff --git a/include/linux/device-mapper.h b/include/linux/device-mapper.h
> index 04c6acf7faaa..b4d97d5d75b8 100644
> --- a/include/linux/device-mapper.h
> +++ b/include/linux/device-mapper.h
> @@ -333,6 +333,12 @@ struct dm_target {
>  	 */
>  	unsigned num_write_zeroes_bios;
>  
> +	/*
> +	 * The number of PROVISION bios that will be submitted to the target.
> +	 * The bio number can be accessed with dm_bio_get_target_bio_nr.
> +	 */
> +	unsigned num_provision_bios;
> +
>  	/*
>  	 * The minimum number of extra bytes allocated in each io for the
>  	 * target to use.
> @@ -357,6 +363,11 @@ struct dm_target {
>  	 */
>  	bool discards_supported:1;
>  
> +	/* Set if this target needs to receive provision requests regardless of
> +	 * whether or not its underlying devices have support.
> +	 */
> +	bool provision_supported:1;
> +
>  	/*
>  	 * Set if we need to limit the number of in-flight bios when swapping.
>  	 */
> -- 
> 2.37.3
> 

--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel


^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH v2 2/7] dm: Add support for block provisioning
@ 2023-01-05 14:43     ` Brian Foster
  0 siblings, 0 replies; 46+ messages in thread
From: Brian Foster @ 2023-01-05 14:43 UTC (permalink / raw)
  To: Sarthak Kukreti
  Cc: sarthakkukreti, dm-devel, linux-block, linux-ext4, linux-kernel,
	linux-fsdevel, Jens Axboe, Michael S. Tsirkin, Jason Wang,
	Stefan Hajnoczi, Alasdair Kergon, Mike Snitzer,
	Christoph Hellwig, Theodore Ts'o, Andreas Dilger,
	Bart Van Assche, Daniil Lunev, Darrick J. Wong

On Thu, Dec 29, 2022 at 12:12:47AM -0800, Sarthak Kukreti wrote:
> Add support to dm devices for REQ_OP_PROVISION. The default mode
> is to pass through the request and dm-thin will utilize it to provision
> blocks.
> 
> Signed-off-by: Sarthak Kukreti <sarthakkukreti@chromium.org>
> ---
>  drivers/md/dm-crypt.c         |  4 +-
>  drivers/md/dm-linear.c        |  1 +
>  drivers/md/dm-snap.c          |  7 +++
>  drivers/md/dm-table.c         | 25 ++++++++++
>  drivers/md/dm-thin.c          | 90 ++++++++++++++++++++++++++++++++++-
>  drivers/md/dm.c               |  4 ++
>  include/linux/device-mapper.h | 11 +++++
>  7 files changed, 139 insertions(+), 3 deletions(-)
> 
...
> diff --git a/drivers/md/dm-thin.c b/drivers/md/dm-thin.c
> index 64cfcf46881d..ab3f1abfabaf 100644
> --- a/drivers/md/dm-thin.c
> +++ b/drivers/md/dm-thin.c
...
> @@ -1980,6 +1992,70 @@ static void process_cell(struct thin_c *tc, struct dm_bio_prison_cell *cell)
>  	}
>  }
>  
> +static void process_provision_cell(struct thin_c *tc, struct dm_bio_prison_cell *cell)
> +{
> +	int r;
> +	struct pool *pool = tc->pool;
> +	struct bio *bio = cell->holder;
> +	dm_block_t begin, end;
> +	struct dm_thin_lookup_result lookup_result;
> +
> +	if (tc->requeue_mode) {
> +		cell_requeue(pool, cell);
> +		return;
> +	}
> +
> +	get_bio_block_range(tc, bio, &begin, &end);
> +
> +	while (begin != end) {
> +		r = ensure_next_mapping(pool);
> +		if (r)
> +			/* we did our best */
> +			return;
> +
> +		r = dm_thin_find_block(tc->td, begin, 1, &lookup_result);

Hi Sarthak,

I think we discussed this before.. but remind me if/how we wanted to
handle the case if the thin blocks are shared..? Would a provision op
carry enough information to distinguish an FALLOC_FL_UNSHARE_RANGE
request from upper layers to conditionally provision in that case?

Brian

> +		switch (r) {
> +		case 0:
> +			begin++;
> +			break;
> +		case -ENODATA:
> +			bio_inc_remaining(bio);
> +			provision_block(tc, bio, begin, cell);
> +			begin++;
> +			break;
> +		default:
> +			DMERR_LIMIT(
> +				"%s: dm_thin_find_block() failed: error = %d",
> +				__func__, r);
> +			cell_defer_no_holder(tc, cell);
> +			bio_io_error(bio);
> +			begin++;
> +			break;
> +		}
> +	}
> +	bio_endio(bio);
> +	cell_defer_no_holder(tc, cell);
> +}
> +
> +static void process_provision_bio(struct thin_c *tc, struct bio *bio)
> +{
> +	dm_block_t begin, end;
> +	struct dm_cell_key virt_key;
> +	struct dm_bio_prison_cell *virt_cell;
> +
> +	get_bio_block_range(tc, bio, &begin, &end);
> +	if (begin == end) {
> +		bio_endio(bio);
> +		return;
> +	}
> +
> +	build_key(tc->td, VIRTUAL, begin, end, &virt_key);
> +	if (bio_detain(tc->pool, &virt_key, bio, &virt_cell))
> +		return;
> +
> +	process_provision_cell(tc, virt_cell);
> +}
> +
>  static void process_bio(struct thin_c *tc, struct bio *bio)
>  {
>  	struct pool *pool = tc->pool;
> @@ -2200,6 +2276,8 @@ static void process_thin_deferred_bios(struct thin_c *tc)
>  
>  		if (bio_op(bio) == REQ_OP_DISCARD)
>  			pool->process_discard(tc, bio);
> +		else if (bio_op(bio) == REQ_OP_PROVISION)
> +			process_provision_bio(tc, bio);
>  		else
>  			pool->process_bio(tc, bio);
>  
> @@ -2716,7 +2794,8 @@ static int thin_bio_map(struct dm_target *ti, struct bio *bio)
>  		return DM_MAPIO_SUBMITTED;
>  	}
>  
> -	if (op_is_flush(bio->bi_opf) || bio_op(bio) == REQ_OP_DISCARD) {
> +	if (op_is_flush(bio->bi_opf) || bio_op(bio) == REQ_OP_DISCARD ||
> +	    bio_op(bio) == REQ_OP_PROVISION) {
>  		thin_defer_bio_with_throttle(tc, bio);
>  		return DM_MAPIO_SUBMITTED;
>  	}
> @@ -3355,6 +3434,8 @@ static int pool_ctr(struct dm_target *ti, unsigned argc, char **argv)
>  	pt->low_water_blocks = low_water_blocks;
>  	pt->adjusted_pf = pt->requested_pf = pf;
>  	ti->num_flush_bios = 1;
> +	ti->num_provision_bios = 1;
> +	ti->provision_supported = true;
>  
>  	/*
>  	 * Only need to enable discards if the pool should pass
> @@ -4053,6 +4134,7 @@ static void pool_io_hints(struct dm_target *ti, struct queue_limits *limits)
>  		blk_limits_io_opt(limits, pool->sectors_per_block << SECTOR_SHIFT);
>  	}
>  
> +
>  	/*
>  	 * pt->adjusted_pf is a staging area for the actual features to use.
>  	 * They get transferred to the live pool in bind_control_target()
> @@ -4243,6 +4325,9 @@ static int thin_ctr(struct dm_target *ti, unsigned argc, char **argv)
>  		ti->num_discard_bios = 1;
>  	}
>  
> +	ti->num_provision_bios = 1;
> +	ti->provision_supported = true;
> +
>  	mutex_unlock(&dm_thin_pool_table.mutex);
>  
>  	spin_lock_irq(&tc->pool->lock);
> @@ -4457,6 +4542,7 @@ static void thin_io_hints(struct dm_target *ti, struct queue_limits *limits)
>  
>  	limits->discard_granularity = pool->sectors_per_block << SECTOR_SHIFT;
>  	limits->max_discard_sectors = 2048 * 1024 * 16; /* 16G */
> +	limits->max_provision_sectors = 2048 * 1024 * 16; /* 16G */
>  }
>  
>  static struct target_type thin_target = {
> diff --git a/drivers/md/dm.c b/drivers/md/dm.c
> index e1ea3a7bd9d9..4d19bae9da4a 100644
> --- a/drivers/md/dm.c
> +++ b/drivers/md/dm.c
> @@ -1587,6 +1587,7 @@ static bool is_abnormal_io(struct bio *bio)
>  		case REQ_OP_DISCARD:
>  		case REQ_OP_SECURE_ERASE:
>  		case REQ_OP_WRITE_ZEROES:
> +		case REQ_OP_PROVISION:
>  			return true;
>  		default:
>  			break;
> @@ -1611,6 +1612,9 @@ static blk_status_t __process_abnormal_io(struct clone_info *ci,
>  	case REQ_OP_WRITE_ZEROES:
>  		num_bios = ti->num_write_zeroes_bios;
>  		break;
> +	case REQ_OP_PROVISION:
> +		num_bios = ti->num_provision_bios;
> +		break;
>  	default:
>  		break;
>  	}
> diff --git a/include/linux/device-mapper.h b/include/linux/device-mapper.h
> index 04c6acf7faaa..b4d97d5d75b8 100644
> --- a/include/linux/device-mapper.h
> +++ b/include/linux/device-mapper.h
> @@ -333,6 +333,12 @@ struct dm_target {
>  	 */
>  	unsigned num_write_zeroes_bios;
>  
> +	/*
> +	 * The number of PROVISION bios that will be submitted to the target.
> +	 * The bio number can be accessed with dm_bio_get_target_bio_nr.
> +	 */
> +	unsigned num_provision_bios;
> +
>  	/*
>  	 * The minimum number of extra bytes allocated in each io for the
>  	 * target to use.
> @@ -357,6 +363,11 @@ struct dm_target {
>  	 */
>  	bool discards_supported:1;
>  
> +	/* Set if this target needs to receive provision requests regardless of
> +	 * whether or not its underlying devices have support.
> +	 */
> +	bool provision_supported:1;
> +
>  	/*
>  	 * Set if we need to limit the number of in-flight bios when swapping.
>  	 */
> -- 
> 2.37.3
> 


^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [dm-devel] [PATCH v2 3/7] fs: Introduce FALLOC_FL_PROVISION
  2023-01-04 21:22       ` [dm-devel] " Sarthak Kukreti
@ 2023-01-05 14:46         ` Brian Foster
  -1 siblings, 0 replies; 46+ messages in thread
From: Brian Foster @ 2023-01-05 14:46 UTC (permalink / raw)
  To: Sarthak Kukreti
  Cc: Jens Axboe, Christoph Hellwig, Theodore Ts'o,
	Michael S. Tsirkin, sarthakkukreti, Darrick J. Wong, Jason Wang,
	Bart Van Assche, Mike Snitzer, linux-kernel, linux-block,
	dm-devel, Andreas Dilger, Daniil Lunev, Stefan Hajnoczi,
	linux-fsdevel, linux-ext4, Alasdair Kergon

On Wed, Jan 04, 2023 at 01:22:06PM -0800, Sarthak Kukreti wrote:
> (Resend; the text flow made the last reply unreadable)
> 
> On Wed, Jan 4, 2023 at 8:39 AM Darrick J. Wong <djwong@kernel.org> wrote:
> >
> > On Thu, Dec 29, 2022 at 12:12:48AM -0800, Sarthak Kukreti wrote:
> > > FALLOC_FL_PROVISION is a new fallocate() allocation mode that
> > > sends a hint to (supported) thinly provisioned block devices to
> > > allocate space for the given range of sectors via REQ_OP_PROVISION.
> > >
> > > The man pages for both fallocate(2) and posix_fallocate(3) describe
> > > the default allocation mode as:
> > >
> > > ```
> > > The default operation (i.e., mode is zero) of fallocate()
> > > allocates the disk space within the range specified by offset and len.
> > > ...
> > > subsequent writes to bytes in the specified range are guaranteed
> > > not to fail because of lack of disk space.
> > > ```
> > >
> > > For thinly provisioned storage constructs (dm-thin, filesystems on sparse
> > > files), the term 'disk space' is overloaded and can either mean the apparent
> > > disk space in the filesystem/thin logical volume or the true disk
> > > space that will be utilized on the underlying non-sparse allocation layer.
> > >
> > > The use of a separate mode allows us to cleanly disambiguate whether fallocate()
> > > causes allocation only at the current layer (default mode) or whether it propagates
> > > allocations to underlying layers (provision mode)
> >
> > Why is it important to make this distinction?  The outcome of fallocate
> > is supposed to be that subsequent writes do not fail with ENOSPC.  In my
> > (fs developer) mind, REQ_OP_PROVISION simply an extra step to be taken
> > after allocating file blocks.
> >
> Some use cases still benefit from keeping the default mode - eg.
> virtual machines running on massive storage pools that don't expect to
> hit the storage limit anytime soon (like most cloud storage
> providers). Essentially, if the 'no ENOSPC' guarantee is maintained
> via other means, then REQ_OP_PROVISION adds latency that isn't needed
> (and cloud storage providers don't need to set aside that extra space
> that may or may not be used).
> 

What's the granularity that needs to be managed at? Do you really need
an fallocate command for this, or would one of the filesystem level
features you've already implemented for ext4 suffice?

I mostly agree with Darrick in that FALLOC_FL_PROVISION stills feels a
bit wonky to me. I can see that there might be some legitimate use cases
for it, but I'm not convinced that it won't just end up being confusing
to many users. At the same time, I think the approach of unconditional
provision on falloc could eventually lead to complaints associated with
the performance impact or similar sorts of confusion. For example,
should an falloc of an already allocated range in the fs send a
provision or not? Should filesystems that don't otherwise support
UNSHARE_RANGE need to support it in order to support an unshare request
to COW'd blocks on an underlying block device?

I wonder if the smart thing to do here is separate out the question of a
new fallocate interface from the mechanism entirely. For example,
implement REQ_OP_PROVISION as you've already done, enable block layer
mode = 0 fallocate support (i.e. without FL_PROVISION, so whether a
request propagates from a loop device will be up to the backing fs),
implement the various fs features to support REQ_OP_PROVISION (i.e.,
mount option, file attr, etc.), then tack on FL_FALLOC + ext4 support at
the end as an RFC/prototype.

Even if we ultimately ended up with FL_PROVISION support, it might
actually make some sense to kick that can down the road a bit regardless
to give fs' a chance to implement basic REQ_OP_PROVISION support, get a
better understanding of how it works in practice, and then perhaps make
more informed decisions on things like sane defaults and/or how best to
expose it via fallocate. Thoughts?

Brian

> > If you *don't* add this API flag and simply bake the REQ_OP_PROVISION
> > call into mode 0 fallocate, then the new functionality can be added (or
> > even backported) to existing kernels and customers can use it
> > immediately.  If you *do*, then you get to wait a few years for
> > developers to add it to their codebases only after enough enterprise
> > distros pick up a new kernel to make it worth their while.
> >
> > > for thinly provisioned filesystems/
> > > block devices. For devices that do not support REQ_OP_PROVISION, both these
> > > allocation modes will be equivalent. Given the performance cost of sending provision
> > > requests to the underlying layers, keeping the default mode as-is allows users to
> > > preserve existing behavior.
> >
> > How expensive is this expected to be?  Is this why you wanted a separate
> > mode flag?
> >
> Yes, the exact latency will depend on the stacked block devices and
> the fragmentation at the allocation layers.
> 
> I did a quick test for benchmarking fallocate() with an:
> A) ext4 filesystem mounted with 'noprovision'
> B) ext4 filesystem mounted with 'provision' on a dm-thin device.
> C) ext4 filesystem mounted with 'provision' on a loop device with a
> sparse backing file on the filesystem in (B).
> 
> I tested file sizes from 512M to 8G, time taken for fallocate() in (A)
> remains expectedly flat at ~0.01-0.02s, but for (B), it scales from
> 0.03-0.4s and for (C) it scales from 0.04s-0.52s (I captured the exact
> time distribution in the cover letter
> https://marc.info/?l=linux-ext4&m=167230113520636&w=2)
> 
> +0.5s for a 8G fallocate doesn't sound a lot but I think fragmentation
> and how the block device is layered can make this worse...
> 
> > --D
> >
> > > Signed-off-by: Sarthak Kukreti <sarthakkukreti@chromium.org>
> > > ---
> > >  block/fops.c                | 15 +++++++++++----
> > >  include/linux/falloc.h      |  3 ++-
> > >  include/uapi/linux/falloc.h |  8 ++++++++
> > >  3 files changed, 21 insertions(+), 5 deletions(-)
> > >
> > > diff --git a/block/fops.c b/block/fops.c
> > > index 50d245e8c913..01bde561e1e2 100644
> > > --- a/block/fops.c
> > > +++ b/block/fops.c
> > > @@ -598,7 +598,8 @@ static ssize_t blkdev_read_iter(struct kiocb *iocb, struct iov_iter *to)
> > >
> > >  #define      BLKDEV_FALLOC_FL_SUPPORTED                                      \
> > >               (FALLOC_FL_KEEP_SIZE | FALLOC_FL_PUNCH_HOLE |           \
> > > -              FALLOC_FL_ZERO_RANGE | FALLOC_FL_NO_HIDE_STALE)
> > > +              FALLOC_FL_ZERO_RANGE | FALLOC_FL_NO_HIDE_STALE |       \
> > > +              FALLOC_FL_PROVISION)
> > >
> > >  static long blkdev_fallocate(struct file *file, int mode, loff_t start,
> > >                            loff_t len)
> > > @@ -634,9 +635,11 @@ static long blkdev_fallocate(struct file *file, int mode, loff_t start,
> > >       filemap_invalidate_lock(inode->i_mapping);
> > >
> > >       /* Invalidate the page cache, including dirty pages. */
> > > -     error = truncate_bdev_range(bdev, file->f_mode, start, end);
> > > -     if (error)
> > > -             goto fail;
> > > +     if (mode != FALLOC_FL_PROVISION) {
> > > +             error = truncate_bdev_range(bdev, file->f_mode, start, end);
> > > +             if (error)
> > > +                     goto fail;
> > > +     }
> > >
> > >       switch (mode) {
> > >       case FALLOC_FL_ZERO_RANGE:
> > > @@ -654,6 +657,10 @@ static long blkdev_fallocate(struct file *file, int mode, loff_t start,
> > >               error = blkdev_issue_discard(bdev, start >> SECTOR_SHIFT,
> > >                                            len >> SECTOR_SHIFT, GFP_KERNEL);
> > >               break;
> > > +     case FALLOC_FL_PROVISION:
> > > +             error = blkdev_issue_provision(bdev, start >> SECTOR_SHIFT,
> > > +                                            len >> SECTOR_SHIFT, GFP_KERNEL);
> > > +             break;
> > >       default:
> > >               error = -EOPNOTSUPP;
> > >       }
> > > diff --git a/include/linux/falloc.h b/include/linux/falloc.h
> > > index f3f0b97b1675..b9a40a61a59b 100644
> > > --- a/include/linux/falloc.h
> > > +++ b/include/linux/falloc.h
> > > @@ -30,7 +30,8 @@ struct space_resv {
> > >                                        FALLOC_FL_COLLAPSE_RANGE |     \
> > >                                        FALLOC_FL_ZERO_RANGE |         \
> > >                                        FALLOC_FL_INSERT_RANGE |       \
> > > -                                      FALLOC_FL_UNSHARE_RANGE)
> > > +                                      FALLOC_FL_UNSHARE_RANGE |      \
> > > +                                      FALLOC_FL_PROVISION)
> > >
> > >  /* on ia32 l_start is on a 32-bit boundary */
> > >  #if defined(CONFIG_X86_64)
> > > diff --git a/include/uapi/linux/falloc.h b/include/uapi/linux/falloc.h
> > > index 51398fa57f6c..2d323d113eed 100644
> > > --- a/include/uapi/linux/falloc.h
> > > +++ b/include/uapi/linux/falloc.h
> > > @@ -77,4 +77,12 @@
> > >   */
> > >  #define FALLOC_FL_UNSHARE_RANGE              0x40
> > >
> > > +/*
> > > + * FALLOC_FL_PROVISION acts as a hint for thinly provisioned devices to allocate
> > > + * blocks for the range/EOF.
> > > + *
> > > + * FALLOC_FL_PROVISION can only be used with allocate-mode fallocate.
> > > + */
> > > +#define FALLOC_FL_PROVISION          0x80
> > > +
> > >  #endif /* _UAPI_FALLOC_H_ */
> > > --
> > > 2.37.3
> > >
> 

--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel


^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH v2 3/7] fs: Introduce FALLOC_FL_PROVISION
@ 2023-01-05 14:46         ` Brian Foster
  0 siblings, 0 replies; 46+ messages in thread
From: Brian Foster @ 2023-01-05 14:46 UTC (permalink / raw)
  To: Sarthak Kukreti
  Cc: Darrick J. Wong, sarthakkukreti, dm-devel, linux-block,
	linux-ext4, linux-kernel, linux-fsdevel, Jens Axboe,
	Michael S. Tsirkin, Jason Wang, Stefan Hajnoczi, Alasdair Kergon,
	Mike Snitzer, Christoph Hellwig, Theodore Ts'o,
	Andreas Dilger, Bart Van Assche, Daniil Lunev

On Wed, Jan 04, 2023 at 01:22:06PM -0800, Sarthak Kukreti wrote:
> (Resend; the text flow made the last reply unreadable)
> 
> On Wed, Jan 4, 2023 at 8:39 AM Darrick J. Wong <djwong@kernel.org> wrote:
> >
> > On Thu, Dec 29, 2022 at 12:12:48AM -0800, Sarthak Kukreti wrote:
> > > FALLOC_FL_PROVISION is a new fallocate() allocation mode that
> > > sends a hint to (supported) thinly provisioned block devices to
> > > allocate space for the given range of sectors via REQ_OP_PROVISION.
> > >
> > > The man pages for both fallocate(2) and posix_fallocate(3) describe
> > > the default allocation mode as:
> > >
> > > ```
> > > The default operation (i.e., mode is zero) of fallocate()
> > > allocates the disk space within the range specified by offset and len.
> > > ...
> > > subsequent writes to bytes in the specified range are guaranteed
> > > not to fail because of lack of disk space.
> > > ```
> > >
> > > For thinly provisioned storage constructs (dm-thin, filesystems on sparse
> > > files), the term 'disk space' is overloaded and can either mean the apparent
> > > disk space in the filesystem/thin logical volume or the true disk
> > > space that will be utilized on the underlying non-sparse allocation layer.
> > >
> > > The use of a separate mode allows us to cleanly disambiguate whether fallocate()
> > > causes allocation only at the current layer (default mode) or whether it propagates
> > > allocations to underlying layers (provision mode)
> >
> > Why is it important to make this distinction?  The outcome of fallocate
> > is supposed to be that subsequent writes do not fail with ENOSPC.  In my
> > (fs developer) mind, REQ_OP_PROVISION simply an extra step to be taken
> > after allocating file blocks.
> >
> Some use cases still benefit from keeping the default mode - eg.
> virtual machines running on massive storage pools that don't expect to
> hit the storage limit anytime soon (like most cloud storage
> providers). Essentially, if the 'no ENOSPC' guarantee is maintained
> via other means, then REQ_OP_PROVISION adds latency that isn't needed
> (and cloud storage providers don't need to set aside that extra space
> that may or may not be used).
> 

What's the granularity that needs to be managed at? Do you really need
an fallocate command for this, or would one of the filesystem level
features you've already implemented for ext4 suffice?

I mostly agree with Darrick in that FALLOC_FL_PROVISION stills feels a
bit wonky to me. I can see that there might be some legitimate use cases
for it, but I'm not convinced that it won't just end up being confusing
to many users. At the same time, I think the approach of unconditional
provision on falloc could eventually lead to complaints associated with
the performance impact or similar sorts of confusion. For example,
should an falloc of an already allocated range in the fs send a
provision or not? Should filesystems that don't otherwise support
UNSHARE_RANGE need to support it in order to support an unshare request
to COW'd blocks on an underlying block device?

I wonder if the smart thing to do here is separate out the question of a
new fallocate interface from the mechanism entirely. For example,
implement REQ_OP_PROVISION as you've already done, enable block layer
mode = 0 fallocate support (i.e. without FL_PROVISION, so whether a
request propagates from a loop device will be up to the backing fs),
implement the various fs features to support REQ_OP_PROVISION (i.e.,
mount option, file attr, etc.), then tack on FL_FALLOC + ext4 support at
the end as an RFC/prototype.

Even if we ultimately ended up with FL_PROVISION support, it might
actually make some sense to kick that can down the road a bit regardless
to give fs' a chance to implement basic REQ_OP_PROVISION support, get a
better understanding of how it works in practice, and then perhaps make
more informed decisions on things like sane defaults and/or how best to
expose it via fallocate. Thoughts?

Brian

> > If you *don't* add this API flag and simply bake the REQ_OP_PROVISION
> > call into mode 0 fallocate, then the new functionality can be added (or
> > even backported) to existing kernels and customers can use it
> > immediately.  If you *do*, then you get to wait a few years for
> > developers to add it to their codebases only after enough enterprise
> > distros pick up a new kernel to make it worth their while.
> >
> > > for thinly provisioned filesystems/
> > > block devices. For devices that do not support REQ_OP_PROVISION, both these
> > > allocation modes will be equivalent. Given the performance cost of sending provision
> > > requests to the underlying layers, keeping the default mode as-is allows users to
> > > preserve existing behavior.
> >
> > How expensive is this expected to be?  Is this why you wanted a separate
> > mode flag?
> >
> Yes, the exact latency will depend on the stacked block devices and
> the fragmentation at the allocation layers.
> 
> I did a quick test for benchmarking fallocate() with an:
> A) ext4 filesystem mounted with 'noprovision'
> B) ext4 filesystem mounted with 'provision' on a dm-thin device.
> C) ext4 filesystem mounted with 'provision' on a loop device with a
> sparse backing file on the filesystem in (B).
> 
> I tested file sizes from 512M to 8G, time taken for fallocate() in (A)
> remains expectedly flat at ~0.01-0.02s, but for (B), it scales from
> 0.03-0.4s and for (C) it scales from 0.04s-0.52s (I captured the exact
> time distribution in the cover letter
> https://marc.info/?l=linux-ext4&m=167230113520636&w=2)
> 
> +0.5s for a 8G fallocate doesn't sound a lot but I think fragmentation
> and how the block device is layered can make this worse...
> 
> > --D
> >
> > > Signed-off-by: Sarthak Kukreti <sarthakkukreti@chromium.org>
> > > ---
> > >  block/fops.c                | 15 +++++++++++----
> > >  include/linux/falloc.h      |  3 ++-
> > >  include/uapi/linux/falloc.h |  8 ++++++++
> > >  3 files changed, 21 insertions(+), 5 deletions(-)
> > >
> > > diff --git a/block/fops.c b/block/fops.c
> > > index 50d245e8c913..01bde561e1e2 100644
> > > --- a/block/fops.c
> > > +++ b/block/fops.c
> > > @@ -598,7 +598,8 @@ static ssize_t blkdev_read_iter(struct kiocb *iocb, struct iov_iter *to)
> > >
> > >  #define      BLKDEV_FALLOC_FL_SUPPORTED                                      \
> > >               (FALLOC_FL_KEEP_SIZE | FALLOC_FL_PUNCH_HOLE |           \
> > > -              FALLOC_FL_ZERO_RANGE | FALLOC_FL_NO_HIDE_STALE)
> > > +              FALLOC_FL_ZERO_RANGE | FALLOC_FL_NO_HIDE_STALE |       \
> > > +              FALLOC_FL_PROVISION)
> > >
> > >  static long blkdev_fallocate(struct file *file, int mode, loff_t start,
> > >                            loff_t len)
> > > @@ -634,9 +635,11 @@ static long blkdev_fallocate(struct file *file, int mode, loff_t start,
> > >       filemap_invalidate_lock(inode->i_mapping);
> > >
> > >       /* Invalidate the page cache, including dirty pages. */
> > > -     error = truncate_bdev_range(bdev, file->f_mode, start, end);
> > > -     if (error)
> > > -             goto fail;
> > > +     if (mode != FALLOC_FL_PROVISION) {
> > > +             error = truncate_bdev_range(bdev, file->f_mode, start, end);
> > > +             if (error)
> > > +                     goto fail;
> > > +     }
> > >
> > >       switch (mode) {
> > >       case FALLOC_FL_ZERO_RANGE:
> > > @@ -654,6 +657,10 @@ static long blkdev_fallocate(struct file *file, int mode, loff_t start,
> > >               error = blkdev_issue_discard(bdev, start >> SECTOR_SHIFT,
> > >                                            len >> SECTOR_SHIFT, GFP_KERNEL);
> > >               break;
> > > +     case FALLOC_FL_PROVISION:
> > > +             error = blkdev_issue_provision(bdev, start >> SECTOR_SHIFT,
> > > +                                            len >> SECTOR_SHIFT, GFP_KERNEL);
> > > +             break;
> > >       default:
> > >               error = -EOPNOTSUPP;
> > >       }
> > > diff --git a/include/linux/falloc.h b/include/linux/falloc.h
> > > index f3f0b97b1675..b9a40a61a59b 100644
> > > --- a/include/linux/falloc.h
> > > +++ b/include/linux/falloc.h
> > > @@ -30,7 +30,8 @@ struct space_resv {
> > >                                        FALLOC_FL_COLLAPSE_RANGE |     \
> > >                                        FALLOC_FL_ZERO_RANGE |         \
> > >                                        FALLOC_FL_INSERT_RANGE |       \
> > > -                                      FALLOC_FL_UNSHARE_RANGE)
> > > +                                      FALLOC_FL_UNSHARE_RANGE |      \
> > > +                                      FALLOC_FL_PROVISION)
> > >
> > >  /* on ia32 l_start is on a 32-bit boundary */
> > >  #if defined(CONFIG_X86_64)
> > > diff --git a/include/uapi/linux/falloc.h b/include/uapi/linux/falloc.h
> > > index 51398fa57f6c..2d323d113eed 100644
> > > --- a/include/uapi/linux/falloc.h
> > > +++ b/include/uapi/linux/falloc.h
> > > @@ -77,4 +77,12 @@
> > >   */
> > >  #define FALLOC_FL_UNSHARE_RANGE              0x40
> > >
> > > +/*
> > > + * FALLOC_FL_PROVISION acts as a hint for thinly provisioned devices to allocate
> > > + * blocks for the range/EOF.
> > > + *
> > > + * FALLOC_FL_PROVISION can only be used with allocate-mode fallocate.
> > > + */
> > > +#define FALLOC_FL_PROVISION          0x80
> > > +
> > >  #endif /* _UAPI_FALLOC_H_ */
> > > --
> > > 2.37.3
> > >
> 


^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH v2 3/7] fs: Introduce FALLOC_FL_PROVISION
  2023-01-04 21:22       ` [dm-devel] " Sarthak Kukreti
@ 2023-01-05 15:49         ` Theodore Ts'o
  -1 siblings, 0 replies; 46+ messages in thread
From: Theodore Ts'o @ 2023-01-05 15:49 UTC (permalink / raw)
  To: Sarthak Kukreti
  Cc: Darrick J. Wong, sarthakkukreti, dm-devel, linux-block,
	linux-ext4, linux-kernel, linux-fsdevel, Jens Axboe,
	Michael S. Tsirkin, Jason Wang, Stefan Hajnoczi, Alasdair Kergon,
	Mike Snitzer, Christoph Hellwig, Brian Foster, Andreas Dilger,
	Bart Van Assche, Daniil Lunev

On Wed, Jan 04, 2023 at 01:22:06PM -0800, Sarthak Kukreti wrote:
> > How expensive is this expected to be?  Is this why you wanted a separate
> > mode flag?
>
> Yes, the exact latency will depend on the stacked block devices and
> the fragmentation at the allocation layers.
> 
> I did a quick test for benchmarking fallocate() with an:
> A) ext4 filesystem mounted with 'noprovision'
> B) ext4 filesystem mounted with 'provision' on a dm-thin device.
> C) ext4 filesystem mounted with 'provision' on a loop device with a
> sparse backing file on the filesystem in (B).
> 
> I tested file sizes from 512M to 8G, time taken for fallocate() in (A)
> remains expectedly flat at ~0.01-0.02s, but for (B), it scales from
> 0.03-0.4s and for (C) it scales from 0.04s-0.52s (I captured the exact
> time distribution in the cover letter
> https://marc.info/?l=linux-ext4&m=167230113520636&w=2)
> 
> +0.5s for a 8G fallocate doesn't sound a lot but I think fragmentation
> and how the block device is layered can make this worse...

If userspace uses fallocate(2) there are generally two reasons.
Either they **really** don't want to get the NOSPC, in which case
noprovision will not give them what they want unless we modify their
source code to add this new FALLOC_FL_PROVISION flag --- which may not
be possible if it is provided in a binary-only format (for example,
proprietary databases shipped by companies beginning with the letters
'I' or 'O').

Or, they really care about avoiding fragmentation by giving a hint to
the file system that layout is important, and so **please** allocate
the space right away so that it is more likely that the space will be
laid out in a contiguous fashion.  Of course, the moment you use
thin-provisioning this goes out the window, since even if the space is
contiguous on the dm-thin layer, on the underlying storage layer it is
likely that things will be fragmented to a fare-thee-well, and either
(a) you have a vast amount of flash to try to mitigate the performance
hit of using thin-provisioning (example, hardware thin-provisioning
such as EMC storage arrays), or (b) you really don't care about
performance since space savings is what you're going for.

So.... because of the issue of changing the semantics of what
fallocate(2) will guarantee, unless programs are forced to change
their code to use this new FALLOC flag, I really am not very fond of
it.

I suspect that using a mount option (which should default to
"provision"; if you want to break user API expectations, it should
require a mount option for the system administrator to explicitly OK
such a change), is OK.

As far as the per-file mode --- I'm not convinced it's really
necessary.  In general if you are using thin-provisioning file systems
tend to be used explicitly for one purpose, so adding the complexity
of doing it on a per-file basis is probably not really needed.  That
being said, your existing prototype requires searching for the
extended attribute on every single file allocation, which is not a
great idea.  On a system with SELinux enabled, every file will have an
xattr block, and requiring that it be searched on every file
allocation would be unfortunate.  It would be better to check for the
xattr when the file is opened, and then setting a flag in the struct
file.  However, it might be better to see if it there is a real demand
for such a feature before adding it.

						- Ted

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [dm-devel] [PATCH v2 3/7] fs: Introduce FALLOC_FL_PROVISION
@ 2023-01-05 15:49         ` Theodore Ts'o
  0 siblings, 0 replies; 46+ messages in thread
From: Theodore Ts'o @ 2023-01-05 15:49 UTC (permalink / raw)
  To: Sarthak Kukreti
  Cc: Jens Axboe, Christoph Hellwig, Michael S. Tsirkin,
	sarthakkukreti, Darrick J. Wong, Jason Wang, Bart Van Assche,
	Mike Snitzer, linux-kernel, linux-block, dm-devel,
	Andreas Dilger, Daniil Lunev, Stefan Hajnoczi, linux-fsdevel,
	linux-ext4, Brian Foster, Alasdair Kergon

On Wed, Jan 04, 2023 at 01:22:06PM -0800, Sarthak Kukreti wrote:
> > How expensive is this expected to be?  Is this why you wanted a separate
> > mode flag?
>
> Yes, the exact latency will depend on the stacked block devices and
> the fragmentation at the allocation layers.
> 
> I did a quick test for benchmarking fallocate() with an:
> A) ext4 filesystem mounted with 'noprovision'
> B) ext4 filesystem mounted with 'provision' on a dm-thin device.
> C) ext4 filesystem mounted with 'provision' on a loop device with a
> sparse backing file on the filesystem in (B).
> 
> I tested file sizes from 512M to 8G, time taken for fallocate() in (A)
> remains expectedly flat at ~0.01-0.02s, but for (B), it scales from
> 0.03-0.4s and for (C) it scales from 0.04s-0.52s (I captured the exact
> time distribution in the cover letter
> https://marc.info/?l=linux-ext4&m=167230113520636&w=2)
> 
> +0.5s for a 8G fallocate doesn't sound a lot but I think fragmentation
> and how the block device is layered can make this worse...

If userspace uses fallocate(2) there are generally two reasons.
Either they **really** don't want to get the NOSPC, in which case
noprovision will not give them what they want unless we modify their
source code to add this new FALLOC_FL_PROVISION flag --- which may not
be possible if it is provided in a binary-only format (for example,
proprietary databases shipped by companies beginning with the letters
'I' or 'O').

Or, they really care about avoiding fragmentation by giving a hint to
the file system that layout is important, and so **please** allocate
the space right away so that it is more likely that the space will be
laid out in a contiguous fashion.  Of course, the moment you use
thin-provisioning this goes out the window, since even if the space is
contiguous on the dm-thin layer, on the underlying storage layer it is
likely that things will be fragmented to a fare-thee-well, and either
(a) you have a vast amount of flash to try to mitigate the performance
hit of using thin-provisioning (example, hardware thin-provisioning
such as EMC storage arrays), or (b) you really don't care about
performance since space savings is what you're going for.

So.... because of the issue of changing the semantics of what
fallocate(2) will guarantee, unless programs are forced to change
their code to use this new FALLOC flag, I really am not very fond of
it.

I suspect that using a mount option (which should default to
"provision"; if you want to break user API expectations, it should
require a mount option for the system administrator to explicitly OK
such a change), is OK.

As far as the per-file mode --- I'm not convinced it's really
necessary.  In general if you are using thin-provisioning file systems
tend to be used explicitly for one purpose, so adding the complexity
of doing it on a per-file basis is probably not really needed.  That
being said, your existing prototype requires searching for the
extended attribute on every single file allocation, which is not a
great idea.  On a system with SELinux enabled, every file will have an
xattr block, and requiring that it be searched on every file
allocation would be unfortunate.  It would be better to check for the
xattr when the file is opened, and then setting a flag in the struct
file.  However, it might be better to see if it there is a real demand
for such a feature before adding it.

						- Ted

--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel


^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [dm-devel] [PATCH v2 3/7] fs: Introduce FALLOC_FL_PROVISION
  2023-01-05 14:46         ` Brian Foster
@ 2023-01-05 19:35           ` Darrick J. Wong
  -1 siblings, 0 replies; 46+ messages in thread
From: Darrick J. Wong @ 2023-01-05 19:35 UTC (permalink / raw)
  To: Brian Foster
  Cc: Jens Axboe, Christoph Hellwig, Theodore Ts'o,
	Stefan Hajnoczi, Michael S. Tsirkin, sarthakkukreti, Jason Wang,
	Bart Van Assche, Mike Snitzer, linux-kernel, linux-block,
	dm-devel, Andreas Dilger, Daniil Lunev, Sarthak Kukreti,
	linux-fsdevel, linux-ext4, Alasdair Kergon

On Thu, Jan 05, 2023 at 09:46:06AM -0500, Brian Foster wrote:
> On Wed, Jan 04, 2023 at 01:22:06PM -0800, Sarthak Kukreti wrote:
> > (Resend; the text flow made the last reply unreadable)
> > 
> > On Wed, Jan 4, 2023 at 8:39 AM Darrick J. Wong <djwong@kernel.org> wrote:
> > >
> > > On Thu, Dec 29, 2022 at 12:12:48AM -0800, Sarthak Kukreti wrote:
> > > > FALLOC_FL_PROVISION is a new fallocate() allocation mode that
> > > > sends a hint to (supported) thinly provisioned block devices to
> > > > allocate space for the given range of sectors via REQ_OP_PROVISION.
> > > >
> > > > The man pages for both fallocate(2) and posix_fallocate(3) describe
> > > > the default allocation mode as:
> > > >
> > > > ```
> > > > The default operation (i.e., mode is zero) of fallocate()
> > > > allocates the disk space within the range specified by offset and len.
> > > > ...
> > > > subsequent writes to bytes in the specified range are guaranteed
> > > > not to fail because of lack of disk space.
> > > > ```
> > > >
> > > > For thinly provisioned storage constructs (dm-thin, filesystems on sparse
> > > > files), the term 'disk space' is overloaded and can either mean the apparent
> > > > disk space in the filesystem/thin logical volume or the true disk
> > > > space that will be utilized on the underlying non-sparse allocation layer.
> > > >
> > > > The use of a separate mode allows us to cleanly disambiguate whether fallocate()
> > > > causes allocation only at the current layer (default mode) or whether it propagates
> > > > allocations to underlying layers (provision mode)
> > >
> > > Why is it important to make this distinction?  The outcome of fallocate
> > > is supposed to be that subsequent writes do not fail with ENOSPC.  In my
> > > (fs developer) mind, REQ_OP_PROVISION simply an extra step to be taken
> > > after allocating file blocks.
> > >
> > Some use cases still benefit from keeping the default mode - eg.
> > virtual machines running on massive storage pools that don't expect to
> > hit the storage limit anytime soon (like most cloud storage
> > providers). Essentially, if the 'no ENOSPC' guarantee is maintained
> > via other means, then REQ_OP_PROVISION adds latency that isn't needed
> > (and cloud storage providers don't need to set aside that extra space
> > that may or may not be used).
> > 
> 
> What's the granularity that needs to be managed at? Do you really need
> an fallocate command for this, or would one of the filesystem level
> features you've already implemented for ext4 suffice?
> 
> I mostly agree with Darrick in that FALLOC_FL_PROVISION stills feels a
> bit wonky to me. I can see that there might be some legitimate use cases
> for it, but I'm not convinced that it won't just end up being confusing
> to many users. At the same time, I think the approach of unconditional
> provision on falloc could eventually lead to complaints associated with
> the performance impact or similar sorts of confusion. For example,
> should an falloc of an already allocated range in the fs send a
> provision or not?

For a user-initiated fallocate call, I think that's reasonable.

My first thought is to make the XFS allocator issue REQ_OP_PROVISION on
every allocation if the device supports it.  The fs has decided that
it's going to allocate and presumably write to some space, so the
underlying storage really ought to have some space ready.

But then it occurred to me -- what if the IO fails with ENOSPC?  Do we
keep going and hope for the best?  Or maybe we should undo the
allocation?  That could be tricky since we'd have to add a transaction
to undo the allocation, commit that, and then throw an error to the
upper layers.

Should the allocator instead find the space it wants and issue the
provisioning IO with the AGF locked, and try again somewhere else if the
IO returns ENOSPC?  If the space management IO takes forever, we've
pinned the that AG for the duration, which is one of the not very nice
aspects of the XFS FITRIM implementation on crappy SSDs.

For a directio write, it's simple enough to throw that error back to
userspace.  I think the same applies to buffered writeback -- we'll
cancel the writeback and set AS_ENOSPC on the mapping.

But then, what about *metadata* allocation?  If those fail because the
provisioning encounters ENOSPC, we'll shut down the filesystem, which
isn't nice.  For XFS I guess we could reuse the existing metadata IO
error config knobs to make it retry for some amount of time until
(hopefully) the admin buys more storage.

Let's go with the simplest implementation (issue it with the free space
locked), and iterate from there.

> Should filesystems that don't otherwise support UNSHARE_RANGE need to
> support it in order to support an unshare request to COW'd blocks on
> an underlying block device?

Hmm.  Currently, fallocate'ing part of a file that's already mapped to
shared blocks is a nop.  That's technically an omission in the
implementation, since a subsequent write can fail during COW setup due
to insufficient space.  My memory about funshare is a bit murky since
it's been years now.

As I remember it, originally, I had allocate mode also calling unshare,
but Dave or someone pointed out that unsharing generates a flood of
dirty pagecache, and it would be a bit surprising that fallocate
suddenly takes a long time to run.  There also wasn't much precedent for
fallocate to unshare blocks, since btrfs doesn't do that:

# filefrag -v /mnt/[ab]
Filesystem type is: 9123683e
File size of /mnt/a is 1048576 (256 blocks of 4096 bytes)
 ext:     logical_offset:        physical_offset: length:   expected: flags:
   0:        0..     255:       3328..      3583:    256:             last,shared,eof
/mnt/a: 1 extent found
File size of /mnt/b is 1048576 (256 blocks of 4096 bytes)
 ext:     logical_offset:        physical_offset: length:   expected: flags:
   0:        0..     255:       3328..      3583:    256:             last,shared,eof
/mnt/b: 1 extent found

# xfs_io -c 'falloc 512k 36k' /mnt/b

# filefrag -v /mnt/[ab]
Filesystem type is: 9123683e
File size of /mnt/a is 1048576 (256 blocks of 4096 bytes)
 ext:     logical_offset:        physical_offset: length:   expected: flags:
   0:        0..     255:       3328..      3583:    256:             last,shared,eof
/mnt/a: 1 extent found
File size of /mnt/b is 1048576 (256 blocks of 4096 bytes)
 ext:     logical_offset:        physical_offset: length:   expected: flags:
   0:        0..     255:       3328..      3583:    256:             last,shared,eof
/mnt/b: 1 extent found

I took funshare out of the patchset entirely (minimum viable product,
yadda yadda) and a few months later, I think hch or someone asked for a
knob for userspace to get a file back to pure overwrite mode.  That's
where it's been ever since.

So to answer your question: fallocate mode 0 and REQ_OP_PROVISION
probably ought to be allocating the holes and unsharing existing shared
mappings.  However, we could also wriggle out of that by <cough>
claiming that fallocate has been consistent across filesystems in
leaving that wart for userspace to trip over. :/

> I wonder if the smart thing to do here is separate out the question of a
> new fallocate interface from the mechanism entirely. For example,
> implement REQ_OP_PROVISION as you've already done, enable block layer
> mode = 0 fallocate support (i.e. without FL_PROVISION, so whether a
> request propagates from a loop device will be up to the backing fs),
> implement the various fs features to support REQ_OP_PROVISION (i.e.,
> mount option, file attr, etc.), then tack on FL_FALLOC + ext4 support at
> the end as an RFC/prototype.

Yeah.

> Even if we ultimately ended up with FL_PROVISION support, it might
> actually make some sense to kick that can down the road a bit regardless
> to give fs' a chance to implement basic REQ_OP_PROVISION support, get a
> better understanding of how it works in practice, and then perhaps make
> more informed decisions on things like sane defaults and/or how best to
> expose it via fallocate. Thoughts?

Agree. :)

--D

> 
> Brian
> 
> > > If you *don't* add this API flag and simply bake the REQ_OP_PROVISION
> > > call into mode 0 fallocate, then the new functionality can be added (or
> > > even backported) to existing kernels and customers can use it
> > > immediately.  If you *do*, then you get to wait a few years for
> > > developers to add it to their codebases only after enough enterprise
> > > distros pick up a new kernel to make it worth their while.
> > >
> > > > for thinly provisioned filesystems/
> > > > block devices. For devices that do not support REQ_OP_PROVISION, both these
> > > > allocation modes will be equivalent. Given the performance cost of sending provision
> > > > requests to the underlying layers, keeping the default mode as-is allows users to
> > > > preserve existing behavior.
> > >
> > > How expensive is this expected to be?  Is this why you wanted a separate
> > > mode flag?
> > >
> > Yes, the exact latency will depend on the stacked block devices and
> > the fragmentation at the allocation layers.
> > 
> > I did a quick test for benchmarking fallocate() with an:
> > A) ext4 filesystem mounted with 'noprovision'
> > B) ext4 filesystem mounted with 'provision' on a dm-thin device.
> > C) ext4 filesystem mounted with 'provision' on a loop device with a
> > sparse backing file on the filesystem in (B).
> > 
> > I tested file sizes from 512M to 8G, time taken for fallocate() in (A)
> > remains expectedly flat at ~0.01-0.02s, but for (B), it scales from
> > 0.03-0.4s and for (C) it scales from 0.04s-0.52s (I captured the exact
> > time distribution in the cover letter
> > https://marc.info/?l=linux-ext4&m=167230113520636&w=2)
> > 
> > +0.5s for a 8G fallocate doesn't sound a lot but I think fragmentation
> > and how the block device is layered can make this worse...
> > 
> > > --D
> > >
> > > > Signed-off-by: Sarthak Kukreti <sarthakkukreti@chromium.org>
> > > > ---
> > > >  block/fops.c                | 15 +++++++++++----
> > > >  include/linux/falloc.h      |  3 ++-
> > > >  include/uapi/linux/falloc.h |  8 ++++++++
> > > >  3 files changed, 21 insertions(+), 5 deletions(-)
> > > >
> > > > diff --git a/block/fops.c b/block/fops.c
> > > > index 50d245e8c913..01bde561e1e2 100644
> > > > --- a/block/fops.c
> > > > +++ b/block/fops.c
> > > > @@ -598,7 +598,8 @@ static ssize_t blkdev_read_iter(struct kiocb *iocb, struct iov_iter *to)
> > > >
> > > >  #define      BLKDEV_FALLOC_FL_SUPPORTED                                      \
> > > >               (FALLOC_FL_KEEP_SIZE | FALLOC_FL_PUNCH_HOLE |           \
> > > > -              FALLOC_FL_ZERO_RANGE | FALLOC_FL_NO_HIDE_STALE)
> > > > +              FALLOC_FL_ZERO_RANGE | FALLOC_FL_NO_HIDE_STALE |       \
> > > > +              FALLOC_FL_PROVISION)
> > > >
> > > >  static long blkdev_fallocate(struct file *file, int mode, loff_t start,
> > > >                            loff_t len)
> > > > @@ -634,9 +635,11 @@ static long blkdev_fallocate(struct file *file, int mode, loff_t start,
> > > >       filemap_invalidate_lock(inode->i_mapping);
> > > >
> > > >       /* Invalidate the page cache, including dirty pages. */
> > > > -     error = truncate_bdev_range(bdev, file->f_mode, start, end);
> > > > -     if (error)
> > > > -             goto fail;
> > > > +     if (mode != FALLOC_FL_PROVISION) {
> > > > +             error = truncate_bdev_range(bdev, file->f_mode, start, end);
> > > > +             if (error)
> > > > +                     goto fail;
> > > > +     }
> > > >
> > > >       switch (mode) {
> > > >       case FALLOC_FL_ZERO_RANGE:
> > > > @@ -654,6 +657,10 @@ static long blkdev_fallocate(struct file *file, int mode, loff_t start,
> > > >               error = blkdev_issue_discard(bdev, start >> SECTOR_SHIFT,
> > > >                                            len >> SECTOR_SHIFT, GFP_KERNEL);
> > > >               break;
> > > > +     case FALLOC_FL_PROVISION:
> > > > +             error = blkdev_issue_provision(bdev, start >> SECTOR_SHIFT,
> > > > +                                            len >> SECTOR_SHIFT, GFP_KERNEL);
> > > > +             break;
> > > >       default:
> > > >               error = -EOPNOTSUPP;
> > > >       }
> > > > diff --git a/include/linux/falloc.h b/include/linux/falloc.h
> > > > index f3f0b97b1675..b9a40a61a59b 100644
> > > > --- a/include/linux/falloc.h
> > > > +++ b/include/linux/falloc.h
> > > > @@ -30,7 +30,8 @@ struct space_resv {
> > > >                                        FALLOC_FL_COLLAPSE_RANGE |     \
> > > >                                        FALLOC_FL_ZERO_RANGE |         \
> > > >                                        FALLOC_FL_INSERT_RANGE |       \
> > > > -                                      FALLOC_FL_UNSHARE_RANGE)
> > > > +                                      FALLOC_FL_UNSHARE_RANGE |      \
> > > > +                                      FALLOC_FL_PROVISION)
> > > >
> > > >  /* on ia32 l_start is on a 32-bit boundary */
> > > >  #if defined(CONFIG_X86_64)
> > > > diff --git a/include/uapi/linux/falloc.h b/include/uapi/linux/falloc.h
> > > > index 51398fa57f6c..2d323d113eed 100644
> > > > --- a/include/uapi/linux/falloc.h
> > > > +++ b/include/uapi/linux/falloc.h
> > > > @@ -77,4 +77,12 @@
> > > >   */
> > > >  #define FALLOC_FL_UNSHARE_RANGE              0x40
> > > >
> > > > +/*
> > > > + * FALLOC_FL_PROVISION acts as a hint for thinly provisioned devices to allocate
> > > > + * blocks for the range/EOF.
> > > > + *
> > > > + * FALLOC_FL_PROVISION can only be used with allocate-mode fallocate.
> > > > + */
> > > > +#define FALLOC_FL_PROVISION          0x80
> > > > +
> > > >  #endif /* _UAPI_FALLOC_H_ */
> > > > --
> > > > 2.37.3
> > > >
> > 
> 

--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel


^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH v2 3/7] fs: Introduce FALLOC_FL_PROVISION
@ 2023-01-05 19:35           ` Darrick J. Wong
  0 siblings, 0 replies; 46+ messages in thread
From: Darrick J. Wong @ 2023-01-05 19:35 UTC (permalink / raw)
  To: Brian Foster
  Cc: Sarthak Kukreti, sarthakkukreti, dm-devel, linux-block,
	linux-ext4, linux-kernel, linux-fsdevel, Jens Axboe,
	Michael S. Tsirkin, Jason Wang, Stefan Hajnoczi, Alasdair Kergon,
	Mike Snitzer, Christoph Hellwig, Theodore Ts'o,
	Andreas Dilger, Bart Van Assche, Daniil Lunev

On Thu, Jan 05, 2023 at 09:46:06AM -0500, Brian Foster wrote:
> On Wed, Jan 04, 2023 at 01:22:06PM -0800, Sarthak Kukreti wrote:
> > (Resend; the text flow made the last reply unreadable)
> > 
> > On Wed, Jan 4, 2023 at 8:39 AM Darrick J. Wong <djwong@kernel.org> wrote:
> > >
> > > On Thu, Dec 29, 2022 at 12:12:48AM -0800, Sarthak Kukreti wrote:
> > > > FALLOC_FL_PROVISION is a new fallocate() allocation mode that
> > > > sends a hint to (supported) thinly provisioned block devices to
> > > > allocate space for the given range of sectors via REQ_OP_PROVISION.
> > > >
> > > > The man pages for both fallocate(2) and posix_fallocate(3) describe
> > > > the default allocation mode as:
> > > >
> > > > ```
> > > > The default operation (i.e., mode is zero) of fallocate()
> > > > allocates the disk space within the range specified by offset and len.
> > > > ...
> > > > subsequent writes to bytes in the specified range are guaranteed
> > > > not to fail because of lack of disk space.
> > > > ```
> > > >
> > > > For thinly provisioned storage constructs (dm-thin, filesystems on sparse
> > > > files), the term 'disk space' is overloaded and can either mean the apparent
> > > > disk space in the filesystem/thin logical volume or the true disk
> > > > space that will be utilized on the underlying non-sparse allocation layer.
> > > >
> > > > The use of a separate mode allows us to cleanly disambiguate whether fallocate()
> > > > causes allocation only at the current layer (default mode) or whether it propagates
> > > > allocations to underlying layers (provision mode)
> > >
> > > Why is it important to make this distinction?  The outcome of fallocate
> > > is supposed to be that subsequent writes do not fail with ENOSPC.  In my
> > > (fs developer) mind, REQ_OP_PROVISION simply an extra step to be taken
> > > after allocating file blocks.
> > >
> > Some use cases still benefit from keeping the default mode - eg.
> > virtual machines running on massive storage pools that don't expect to
> > hit the storage limit anytime soon (like most cloud storage
> > providers). Essentially, if the 'no ENOSPC' guarantee is maintained
> > via other means, then REQ_OP_PROVISION adds latency that isn't needed
> > (and cloud storage providers don't need to set aside that extra space
> > that may or may not be used).
> > 
> 
> What's the granularity that needs to be managed at? Do you really need
> an fallocate command for this, or would one of the filesystem level
> features you've already implemented for ext4 suffice?
> 
> I mostly agree with Darrick in that FALLOC_FL_PROVISION stills feels a
> bit wonky to me. I can see that there might be some legitimate use cases
> for it, but I'm not convinced that it won't just end up being confusing
> to many users. At the same time, I think the approach of unconditional
> provision on falloc could eventually lead to complaints associated with
> the performance impact or similar sorts of confusion. For example,
> should an falloc of an already allocated range in the fs send a
> provision or not?

For a user-initiated fallocate call, I think that's reasonable.

My first thought is to make the XFS allocator issue REQ_OP_PROVISION on
every allocation if the device supports it.  The fs has decided that
it's going to allocate and presumably write to some space, so the
underlying storage really ought to have some space ready.

But then it occurred to me -- what if the IO fails with ENOSPC?  Do we
keep going and hope for the best?  Or maybe we should undo the
allocation?  That could be tricky since we'd have to add a transaction
to undo the allocation, commit that, and then throw an error to the
upper layers.

Should the allocator instead find the space it wants and issue the
provisioning IO with the AGF locked, and try again somewhere else if the
IO returns ENOSPC?  If the space management IO takes forever, we've
pinned the that AG for the duration, which is one of the not very nice
aspects of the XFS FITRIM implementation on crappy SSDs.

For a directio write, it's simple enough to throw that error back to
userspace.  I think the same applies to buffered writeback -- we'll
cancel the writeback and set AS_ENOSPC on the mapping.

But then, what about *metadata* allocation?  If those fail because the
provisioning encounters ENOSPC, we'll shut down the filesystem, which
isn't nice.  For XFS I guess we could reuse the existing metadata IO
error config knobs to make it retry for some amount of time until
(hopefully) the admin buys more storage.

Let's go with the simplest implementation (issue it with the free space
locked), and iterate from there.

> Should filesystems that don't otherwise support UNSHARE_RANGE need to
> support it in order to support an unshare request to COW'd blocks on
> an underlying block device?

Hmm.  Currently, fallocate'ing part of a file that's already mapped to
shared blocks is a nop.  That's technically an omission in the
implementation, since a subsequent write can fail during COW setup due
to insufficient space.  My memory about funshare is a bit murky since
it's been years now.

As I remember it, originally, I had allocate mode also calling unshare,
but Dave or someone pointed out that unsharing generates a flood of
dirty pagecache, and it would be a bit surprising that fallocate
suddenly takes a long time to run.  There also wasn't much precedent for
fallocate to unshare blocks, since btrfs doesn't do that:

# filefrag -v /mnt/[ab]
Filesystem type is: 9123683e
File size of /mnt/a is 1048576 (256 blocks of 4096 bytes)
 ext:     logical_offset:        physical_offset: length:   expected: flags:
   0:        0..     255:       3328..      3583:    256:             last,shared,eof
/mnt/a: 1 extent found
File size of /mnt/b is 1048576 (256 blocks of 4096 bytes)
 ext:     logical_offset:        physical_offset: length:   expected: flags:
   0:        0..     255:       3328..      3583:    256:             last,shared,eof
/mnt/b: 1 extent found

# xfs_io -c 'falloc 512k 36k' /mnt/b

# filefrag -v /mnt/[ab]
Filesystem type is: 9123683e
File size of /mnt/a is 1048576 (256 blocks of 4096 bytes)
 ext:     logical_offset:        physical_offset: length:   expected: flags:
   0:        0..     255:       3328..      3583:    256:             last,shared,eof
/mnt/a: 1 extent found
File size of /mnt/b is 1048576 (256 blocks of 4096 bytes)
 ext:     logical_offset:        physical_offset: length:   expected: flags:
   0:        0..     255:       3328..      3583:    256:             last,shared,eof
/mnt/b: 1 extent found

I took funshare out of the patchset entirely (minimum viable product,
yadda yadda) and a few months later, I think hch or someone asked for a
knob for userspace to get a file back to pure overwrite mode.  That's
where it's been ever since.

So to answer your question: fallocate mode 0 and REQ_OP_PROVISION
probably ought to be allocating the holes and unsharing existing shared
mappings.  However, we could also wriggle out of that by <cough>
claiming that fallocate has been consistent across filesystems in
leaving that wart for userspace to trip over. :/

> I wonder if the smart thing to do here is separate out the question of a
> new fallocate interface from the mechanism entirely. For example,
> implement REQ_OP_PROVISION as you've already done, enable block layer
> mode = 0 fallocate support (i.e. without FL_PROVISION, so whether a
> request propagates from a loop device will be up to the backing fs),
> implement the various fs features to support REQ_OP_PROVISION (i.e.,
> mount option, file attr, etc.), then tack on FL_FALLOC + ext4 support at
> the end as an RFC/prototype.

Yeah.

> Even if we ultimately ended up with FL_PROVISION support, it might
> actually make some sense to kick that can down the road a bit regardless
> to give fs' a chance to implement basic REQ_OP_PROVISION support, get a
> better understanding of how it works in practice, and then perhaps make
> more informed decisions on things like sane defaults and/or how best to
> expose it via fallocate. Thoughts?

Agree. :)

--D

> 
> Brian
> 
> > > If you *don't* add this API flag and simply bake the REQ_OP_PROVISION
> > > call into mode 0 fallocate, then the new functionality can be added (or
> > > even backported) to existing kernels and customers can use it
> > > immediately.  If you *do*, then you get to wait a few years for
> > > developers to add it to their codebases only after enough enterprise
> > > distros pick up a new kernel to make it worth their while.
> > >
> > > > for thinly provisioned filesystems/
> > > > block devices. For devices that do not support REQ_OP_PROVISION, both these
> > > > allocation modes will be equivalent. Given the performance cost of sending provision
> > > > requests to the underlying layers, keeping the default mode as-is allows users to
> > > > preserve existing behavior.
> > >
> > > How expensive is this expected to be?  Is this why you wanted a separate
> > > mode flag?
> > >
> > Yes, the exact latency will depend on the stacked block devices and
> > the fragmentation at the allocation layers.
> > 
> > I did a quick test for benchmarking fallocate() with an:
> > A) ext4 filesystem mounted with 'noprovision'
> > B) ext4 filesystem mounted with 'provision' on a dm-thin device.
> > C) ext4 filesystem mounted with 'provision' on a loop device with a
> > sparse backing file on the filesystem in (B).
> > 
> > I tested file sizes from 512M to 8G, time taken for fallocate() in (A)
> > remains expectedly flat at ~0.01-0.02s, but for (B), it scales from
> > 0.03-0.4s and for (C) it scales from 0.04s-0.52s (I captured the exact
> > time distribution in the cover letter
> > https://marc.info/?l=linux-ext4&m=167230113520636&w=2)
> > 
> > +0.5s for a 8G fallocate doesn't sound a lot but I think fragmentation
> > and how the block device is layered can make this worse...
> > 
> > > --D
> > >
> > > > Signed-off-by: Sarthak Kukreti <sarthakkukreti@chromium.org>
> > > > ---
> > > >  block/fops.c                | 15 +++++++++++----
> > > >  include/linux/falloc.h      |  3 ++-
> > > >  include/uapi/linux/falloc.h |  8 ++++++++
> > > >  3 files changed, 21 insertions(+), 5 deletions(-)
> > > >
> > > > diff --git a/block/fops.c b/block/fops.c
> > > > index 50d245e8c913..01bde561e1e2 100644
> > > > --- a/block/fops.c
> > > > +++ b/block/fops.c
> > > > @@ -598,7 +598,8 @@ static ssize_t blkdev_read_iter(struct kiocb *iocb, struct iov_iter *to)
> > > >
> > > >  #define      BLKDEV_FALLOC_FL_SUPPORTED                                      \
> > > >               (FALLOC_FL_KEEP_SIZE | FALLOC_FL_PUNCH_HOLE |           \
> > > > -              FALLOC_FL_ZERO_RANGE | FALLOC_FL_NO_HIDE_STALE)
> > > > +              FALLOC_FL_ZERO_RANGE | FALLOC_FL_NO_HIDE_STALE |       \
> > > > +              FALLOC_FL_PROVISION)
> > > >
> > > >  static long blkdev_fallocate(struct file *file, int mode, loff_t start,
> > > >                            loff_t len)
> > > > @@ -634,9 +635,11 @@ static long blkdev_fallocate(struct file *file, int mode, loff_t start,
> > > >       filemap_invalidate_lock(inode->i_mapping);
> > > >
> > > >       /* Invalidate the page cache, including dirty pages. */
> > > > -     error = truncate_bdev_range(bdev, file->f_mode, start, end);
> > > > -     if (error)
> > > > -             goto fail;
> > > > +     if (mode != FALLOC_FL_PROVISION) {
> > > > +             error = truncate_bdev_range(bdev, file->f_mode, start, end);
> > > > +             if (error)
> > > > +                     goto fail;
> > > > +     }
> > > >
> > > >       switch (mode) {
> > > >       case FALLOC_FL_ZERO_RANGE:
> > > > @@ -654,6 +657,10 @@ static long blkdev_fallocate(struct file *file, int mode, loff_t start,
> > > >               error = blkdev_issue_discard(bdev, start >> SECTOR_SHIFT,
> > > >                                            len >> SECTOR_SHIFT, GFP_KERNEL);
> > > >               break;
> > > > +     case FALLOC_FL_PROVISION:
> > > > +             error = blkdev_issue_provision(bdev, start >> SECTOR_SHIFT,
> > > > +                                            len >> SECTOR_SHIFT, GFP_KERNEL);
> > > > +             break;
> > > >       default:
> > > >               error = -EOPNOTSUPP;
> > > >       }
> > > > diff --git a/include/linux/falloc.h b/include/linux/falloc.h
> > > > index f3f0b97b1675..b9a40a61a59b 100644
> > > > --- a/include/linux/falloc.h
> > > > +++ b/include/linux/falloc.h
> > > > @@ -30,7 +30,8 @@ struct space_resv {
> > > >                                        FALLOC_FL_COLLAPSE_RANGE |     \
> > > >                                        FALLOC_FL_ZERO_RANGE |         \
> > > >                                        FALLOC_FL_INSERT_RANGE |       \
> > > > -                                      FALLOC_FL_UNSHARE_RANGE)
> > > > +                                      FALLOC_FL_UNSHARE_RANGE |      \
> > > > +                                      FALLOC_FL_PROVISION)
> > > >
> > > >  /* on ia32 l_start is on a 32-bit boundary */
> > > >  #if defined(CONFIG_X86_64)
> > > > diff --git a/include/uapi/linux/falloc.h b/include/uapi/linux/falloc.h
> > > > index 51398fa57f6c..2d323d113eed 100644
> > > > --- a/include/uapi/linux/falloc.h
> > > > +++ b/include/uapi/linux/falloc.h
> > > > @@ -77,4 +77,12 @@
> > > >   */
> > > >  #define FALLOC_FL_UNSHARE_RANGE              0x40
> > > >
> > > > +/*
> > > > + * FALLOC_FL_PROVISION acts as a hint for thinly provisioned devices to allocate
> > > > + * blocks for the range/EOF.
> > > > + *
> > > > + * FALLOC_FL_PROVISION can only be used with allocate-mode fallocate.
> > > > + */
> > > > +#define FALLOC_FL_PROVISION          0x80
> > > > +
> > > >  #endif /* _UAPI_FALLOC_H_ */
> > > > --
> > > > 2.37.3
> > > >
> > 
> 

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH v2 6/7] ext4: Add mount option for provisioning blocks during allocations
  2022-12-29  8:12   ` [dm-devel] " Sarthak Kukreti
@ 2023-01-09 15:02     ` Brian Foster
  -1 siblings, 0 replies; 46+ messages in thread
From: Brian Foster @ 2023-01-09 15:02 UTC (permalink / raw)
  To: Sarthak Kukreti
  Cc: sarthakkukreti, dm-devel, linux-block, linux-ext4, linux-kernel,
	linux-fsdevel, Jens Axboe, Michael S. Tsirkin, Jason Wang,
	Stefan Hajnoczi, Alasdair Kergon, Mike Snitzer,
	Christoph Hellwig, Theodore Ts'o, Andreas Dilger,
	Bart Van Assche, Daniil Lunev, Darrick J. Wong

On Thu, Dec 29, 2022 at 12:12:51AM -0800, Sarthak Kukreti wrote:
> Add a mount option that sets the default provisioning mode for
> all files within the filesystem.
> 

There's not much description here to explain what a user should expect
from this mode. Should the user expect -ENOSPC from the lower layers
once out of space? What about files that are provisioned by the fs and
then freed? Should the user/admin know to run fstrim or also enable an
online discard mechanism to ensure freed filesystem blocks are returned
to the free pool in the block/dm layer in order to avoid premature fs
-ENOSPC conditions?

Also, what about dealing with block level snapshots? There is some
discussion on previous patches wrt to expectations on how provision
might handle unsharing of cow'd blocks. If the fs only provisions on
initial allocation, then a subsequent snapshot means we run into the
same sort of ENOSPC problem on overwrites of already allocated blocks.
That also doesn't consider things like an internal log, which may have
been physically allocated (provisioned?) at mkfs time and yet is subject
to the same general problem.

So what is the higher level goal with this sort of mode? Is
provision-on-alloc sufficient to provide a practical benefit to users,
or should this perhaps consider other scenarios where a provision might
be warranted before submitting writes to a thinly provisioned device?

FWIW, it seems reasonable to me to introduce this without snapshot
support and work toward it later, but it should be made clear what is
being advertised in the meantime. Unless there's some nice way to
explicitly limit the scope of use, such as preventing snapshots or
something, the fs might want to consider this sort of feature
experimental until it is more fully functional.

Brian

> Signed-off-by: Sarthak Kukreti <sarthakkukreti@chromium.org>
> ---
>  fs/ext4/ext4.h    | 1 +
>  fs/ext4/extents.c | 7 +++++++
>  fs/ext4/super.c   | 7 +++++++
>  3 files changed, 15 insertions(+)
> 
> diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
> index 49832e90b62f..29cab2e2ea20 100644
> --- a/fs/ext4/ext4.h
> +++ b/fs/ext4/ext4.h
> @@ -1269,6 +1269,7 @@ struct ext4_inode_info {
>  #define EXT4_MOUNT2_MB_OPTIMIZE_SCAN	0x00000080 /* Optimize group
>  						    * scanning in mballoc
>  						    */
> +#define EXT4_MOUNT2_PROVISION		0x00000100 /* Provision while allocating file blocks */
>  
>  #define clear_opt(sb, opt)		EXT4_SB(sb)->s_mount_opt &= \
>  						~EXT4_MOUNT_##opt
> diff --git a/fs/ext4/extents.c b/fs/ext4/extents.c
> index 2e64a9211792..a73f44264fe2 100644
> --- a/fs/ext4/extents.c
> +++ b/fs/ext4/extents.c
> @@ -4441,6 +4441,13 @@ static int ext4_alloc_file_blocks(struct file *file, ext4_lblk_t offset,
>  	unsigned int credits;
>  	loff_t epos;
>  
> +	/*
> +	 * Attempt to provision file blocks if the mount is mounted with
> +	 * provision.
> +	 */
> +	if (test_opt2(inode->i_sb, PROVISION))
> +		flags |= EXT4_GET_BLOCKS_PROVISION;
> +
>  	BUG_ON(!ext4_test_inode_flag(inode, EXT4_INODE_EXTENTS));
>  	map.m_lblk = offset;
>  	map.m_len = len;
> diff --git a/fs/ext4/super.c b/fs/ext4/super.c
> index 260c1b3e3ef2..5bc376f6a6f0 100644
> --- a/fs/ext4/super.c
> +++ b/fs/ext4/super.c
> @@ -1591,6 +1591,7 @@ enum {
>  	Opt_max_dir_size_kb, Opt_nojournal_checksum, Opt_nombcache,
>  	Opt_no_prefetch_block_bitmaps, Opt_mb_optimize_scan,
>  	Opt_errors, Opt_data, Opt_data_err, Opt_jqfmt, Opt_dax_type,
> +	Opt_provision, Opt_noprovision,
>  #ifdef CONFIG_EXT4_DEBUG
>  	Opt_fc_debug_max_replay, Opt_fc_debug_force
>  #endif
> @@ -1737,6 +1738,8 @@ static const struct fs_parameter_spec ext4_param_specs[] = {
>  	fsparam_flag	("reservation",		Opt_removed),	/* mount option from ext2/3 */
>  	fsparam_flag	("noreservation",	Opt_removed),	/* mount option from ext2/3 */
>  	fsparam_u32	("journal",		Opt_removed),	/* mount option from ext2/3 */
> +	fsparam_flag	("provision",		Opt_provision),
> +	fsparam_flag	("noprovision",		Opt_noprovision),
>  	{}
>  };
>  
> @@ -1826,6 +1829,8 @@ static const struct mount_opts {
>  	{Opt_nombcache, EXT4_MOUNT_NO_MBCACHE, MOPT_SET},
>  	{Opt_no_prefetch_block_bitmaps, EXT4_MOUNT_NO_PREFETCH_BLOCK_BITMAPS,
>  	 MOPT_SET},
> +	{Opt_provision, EXT4_MOUNT2_PROVISION, MOPT_SET | MOPT_2},
> +	{Opt_noprovision, EXT4_MOUNT2_PROVISION, MOPT_CLEAR | MOPT_2},
>  #ifdef CONFIG_EXT4_DEBUG
>  	{Opt_fc_debug_force, EXT4_MOUNT2_JOURNAL_FAST_COMMIT,
>  	 MOPT_SET | MOPT_2 | MOPT_EXT4_ONLY},
> @@ -2977,6 +2982,8 @@ static int _ext4_show_options(struct seq_file *seq, struct super_block *sb,
>  		SEQ_OPTS_PUTS("dax=never");
>  	} else if (test_opt2(sb, DAX_INODE)) {
>  		SEQ_OPTS_PUTS("dax=inode");
> +	} else if (test_opt2(sb, PROVISION)) {
> +		SEQ_OPTS_PUTS("provision");
>  	}
>  
>  	if (sbi->s_groups_count >= MB_DEFAULT_LINEAR_SCAN_THRESHOLD &&
> -- 
> 2.37.3
> 


^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [dm-devel] [PATCH v2 6/7] ext4: Add mount option for provisioning blocks during allocations
@ 2023-01-09 15:02     ` Brian Foster
  0 siblings, 0 replies; 46+ messages in thread
From: Brian Foster @ 2023-01-09 15:02 UTC (permalink / raw)
  To: Sarthak Kukreti
  Cc: Jens Axboe, Christoph Hellwig, Theodore Ts'o,
	Michael S. Tsirkin, sarthakkukreti, Darrick J. Wong, Jason Wang,
	Bart Van Assche, Mike Snitzer, linux-kernel, linux-block,
	dm-devel, Andreas Dilger, Daniil Lunev, Stefan Hajnoczi,
	linux-fsdevel, linux-ext4, Alasdair Kergon

On Thu, Dec 29, 2022 at 12:12:51AM -0800, Sarthak Kukreti wrote:
> Add a mount option that sets the default provisioning mode for
> all files within the filesystem.
> 

There's not much description here to explain what a user should expect
from this mode. Should the user expect -ENOSPC from the lower layers
once out of space? What about files that are provisioned by the fs and
then freed? Should the user/admin know to run fstrim or also enable an
online discard mechanism to ensure freed filesystem blocks are returned
to the free pool in the block/dm layer in order to avoid premature fs
-ENOSPC conditions?

Also, what about dealing with block level snapshots? There is some
discussion on previous patches wrt to expectations on how provision
might handle unsharing of cow'd blocks. If the fs only provisions on
initial allocation, then a subsequent snapshot means we run into the
same sort of ENOSPC problem on overwrites of already allocated blocks.
That also doesn't consider things like an internal log, which may have
been physically allocated (provisioned?) at mkfs time and yet is subject
to the same general problem.

So what is the higher level goal with this sort of mode? Is
provision-on-alloc sufficient to provide a practical benefit to users,
or should this perhaps consider other scenarios where a provision might
be warranted before submitting writes to a thinly provisioned device?

FWIW, it seems reasonable to me to introduce this without snapshot
support and work toward it later, but it should be made clear what is
being advertised in the meantime. Unless there's some nice way to
explicitly limit the scope of use, such as preventing snapshots or
something, the fs might want to consider this sort of feature
experimental until it is more fully functional.

Brian

> Signed-off-by: Sarthak Kukreti <sarthakkukreti@chromium.org>
> ---
>  fs/ext4/ext4.h    | 1 +
>  fs/ext4/extents.c | 7 +++++++
>  fs/ext4/super.c   | 7 +++++++
>  3 files changed, 15 insertions(+)
> 
> diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
> index 49832e90b62f..29cab2e2ea20 100644
> --- a/fs/ext4/ext4.h
> +++ b/fs/ext4/ext4.h
> @@ -1269,6 +1269,7 @@ struct ext4_inode_info {
>  #define EXT4_MOUNT2_MB_OPTIMIZE_SCAN	0x00000080 /* Optimize group
>  						    * scanning in mballoc
>  						    */
> +#define EXT4_MOUNT2_PROVISION		0x00000100 /* Provision while allocating file blocks */
>  
>  #define clear_opt(sb, opt)		EXT4_SB(sb)->s_mount_opt &= \
>  						~EXT4_MOUNT_##opt
> diff --git a/fs/ext4/extents.c b/fs/ext4/extents.c
> index 2e64a9211792..a73f44264fe2 100644
> --- a/fs/ext4/extents.c
> +++ b/fs/ext4/extents.c
> @@ -4441,6 +4441,13 @@ static int ext4_alloc_file_blocks(struct file *file, ext4_lblk_t offset,
>  	unsigned int credits;
>  	loff_t epos;
>  
> +	/*
> +	 * Attempt to provision file blocks if the mount is mounted with
> +	 * provision.
> +	 */
> +	if (test_opt2(inode->i_sb, PROVISION))
> +		flags |= EXT4_GET_BLOCKS_PROVISION;
> +
>  	BUG_ON(!ext4_test_inode_flag(inode, EXT4_INODE_EXTENTS));
>  	map.m_lblk = offset;
>  	map.m_len = len;
> diff --git a/fs/ext4/super.c b/fs/ext4/super.c
> index 260c1b3e3ef2..5bc376f6a6f0 100644
> --- a/fs/ext4/super.c
> +++ b/fs/ext4/super.c
> @@ -1591,6 +1591,7 @@ enum {
>  	Opt_max_dir_size_kb, Opt_nojournal_checksum, Opt_nombcache,
>  	Opt_no_prefetch_block_bitmaps, Opt_mb_optimize_scan,
>  	Opt_errors, Opt_data, Opt_data_err, Opt_jqfmt, Opt_dax_type,
> +	Opt_provision, Opt_noprovision,
>  #ifdef CONFIG_EXT4_DEBUG
>  	Opt_fc_debug_max_replay, Opt_fc_debug_force
>  #endif
> @@ -1737,6 +1738,8 @@ static const struct fs_parameter_spec ext4_param_specs[] = {
>  	fsparam_flag	("reservation",		Opt_removed),	/* mount option from ext2/3 */
>  	fsparam_flag	("noreservation",	Opt_removed),	/* mount option from ext2/3 */
>  	fsparam_u32	("journal",		Opt_removed),	/* mount option from ext2/3 */
> +	fsparam_flag	("provision",		Opt_provision),
> +	fsparam_flag	("noprovision",		Opt_noprovision),
>  	{}
>  };
>  
> @@ -1826,6 +1829,8 @@ static const struct mount_opts {
>  	{Opt_nombcache, EXT4_MOUNT_NO_MBCACHE, MOPT_SET},
>  	{Opt_no_prefetch_block_bitmaps, EXT4_MOUNT_NO_PREFETCH_BLOCK_BITMAPS,
>  	 MOPT_SET},
> +	{Opt_provision, EXT4_MOUNT2_PROVISION, MOPT_SET | MOPT_2},
> +	{Opt_noprovision, EXT4_MOUNT2_PROVISION, MOPT_CLEAR | MOPT_2},
>  #ifdef CONFIG_EXT4_DEBUG
>  	{Opt_fc_debug_force, EXT4_MOUNT2_JOURNAL_FAST_COMMIT,
>  	 MOPT_SET | MOPT_2 | MOPT_EXT4_ONLY},
> @@ -2977,6 +2982,8 @@ static int _ext4_show_options(struct seq_file *seq, struct super_block *sb,
>  		SEQ_OPTS_PUTS("dax=never");
>  	} else if (test_opt2(sb, DAX_INODE)) {
>  		SEQ_OPTS_PUTS("dax=inode");
> +	} else if (test_opt2(sb, PROVISION)) {
> +		SEQ_OPTS_PUTS("provision");
>  	}
>  
>  	if (sbi->s_groups_count >= MB_DEFAULT_LINEAR_SCAN_THRESHOLD &&
> -- 
> 2.37.3
> 

--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel


^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [dm-devel] [PATCH v2 3/7] fs: Introduce FALLOC_FL_PROVISION
  2023-01-05 19:35           ` Darrick J. Wong
@ 2023-01-09 15:07             ` Brian Foster
  -1 siblings, 0 replies; 46+ messages in thread
From: Brian Foster @ 2023-01-09 15:07 UTC (permalink / raw)
  To: Darrick J. Wong
  Cc: Jens Axboe, Christoph Hellwig, Theodore Ts'o,
	Stefan Hajnoczi, Michael S. Tsirkin, sarthakkukreti, Jason Wang,
	Bart Van Assche, Mike Snitzer, linux-kernel, linux-block,
	dm-devel, Andreas Dilger, Daniil Lunev, Sarthak Kukreti,
	linux-fsdevel, linux-ext4, Alasdair Kergon

On Thu, Jan 05, 2023 at 11:35:36AM -0800, Darrick J. Wong wrote:
> On Thu, Jan 05, 2023 at 09:46:06AM -0500, Brian Foster wrote:
> > On Wed, Jan 04, 2023 at 01:22:06PM -0800, Sarthak Kukreti wrote:
> > > (Resend; the text flow made the last reply unreadable)
> > > 
> > > On Wed, Jan 4, 2023 at 8:39 AM Darrick J. Wong <djwong@kernel.org> wrote:
> > > >
> > > > On Thu, Dec 29, 2022 at 12:12:48AM -0800, Sarthak Kukreti wrote:
> > > > > FALLOC_FL_PROVISION is a new fallocate() allocation mode that
> > > > > sends a hint to (supported) thinly provisioned block devices to
> > > > > allocate space for the given range of sectors via REQ_OP_PROVISION.
> > > > >
> > > > > The man pages for both fallocate(2) and posix_fallocate(3) describe
> > > > > the default allocation mode as:
> > > > >
> > > > > ```
> > > > > The default operation (i.e., mode is zero) of fallocate()
> > > > > allocates the disk space within the range specified by offset and len.
> > > > > ...
> > > > > subsequent writes to bytes in the specified range are guaranteed
> > > > > not to fail because of lack of disk space.
> > > > > ```
> > > > >
> > > > > For thinly provisioned storage constructs (dm-thin, filesystems on sparse
> > > > > files), the term 'disk space' is overloaded and can either mean the apparent
> > > > > disk space in the filesystem/thin logical volume or the true disk
> > > > > space that will be utilized on the underlying non-sparse allocation layer.
> > > > >
> > > > > The use of a separate mode allows us to cleanly disambiguate whether fallocate()
> > > > > causes allocation only at the current layer (default mode) or whether it propagates
> > > > > allocations to underlying layers (provision mode)
> > > >
> > > > Why is it important to make this distinction?  The outcome of fallocate
> > > > is supposed to be that subsequent writes do not fail with ENOSPC.  In my
> > > > (fs developer) mind, REQ_OP_PROVISION simply an extra step to be taken
> > > > after allocating file blocks.
> > > >
> > > Some use cases still benefit from keeping the default mode - eg.
> > > virtual machines running on massive storage pools that don't expect to
> > > hit the storage limit anytime soon (like most cloud storage
> > > providers). Essentially, if the 'no ENOSPC' guarantee is maintained
> > > via other means, then REQ_OP_PROVISION adds latency that isn't needed
> > > (and cloud storage providers don't need to set aside that extra space
> > > that may or may not be used).
> > > 
> > 
> > What's the granularity that needs to be managed at? Do you really need
> > an fallocate command for this, or would one of the filesystem level
> > features you've already implemented for ext4 suffice?
> > 
> > I mostly agree with Darrick in that FALLOC_FL_PROVISION stills feels a
> > bit wonky to me. I can see that there might be some legitimate use cases
> > for it, but I'm not convinced that it won't just end up being confusing
> > to many users. At the same time, I think the approach of unconditional
> > provision on falloc could eventually lead to complaints associated with
> > the performance impact or similar sorts of confusion. For example,
> > should an falloc of an already allocated range in the fs send a
> > provision or not?
> 
> For a user-initiated fallocate call, I think that's reasonable.
> 

I think so as well, but that doesn't appear to be what the proposed
implementation for ext4 does. I'm not intimately familiar with ext4, but
it looks to me like it only provisions on initial allocation..?

> My first thought is to make the XFS allocator issue REQ_OP_PROVISION on
> every allocation if the device supports it.  The fs has decided that
> it's going to allocate and presumably write to some space, so the
> underlying storage really ought to have some space ready.
> 

That makes sense for a purely thin provisioned device, but runs into
issues with block layer snapshots (re: my comments on the provision
mount option patch). I wonder if it makes more sense to provision at
some point before submitting writes or dirtying pagecache. IIRC we had
prototyped something in XFS a while back that performed an analogous
dm-thin fallocate at the time an extent is mapped for writes. I'm not
sure what the performance impact of that would be or if there's a nice
way to optimize away the obvious side effect of spurious requests.

> But then it occurred to me -- what if the IO fails with ENOSPC?  Do we
> keep going and hope for the best?  Or maybe we should undo the
> allocation?  That could be tricky since we'd have to add a transaction
> to undo the allocation, commit that, and then throw an error to the
> upper layers.
> 

Yeah, that's a good question. IMO we should be able to use something
like this to improve the failure handling for fs' over thinly
provisioned storage with dangerously low free space. There's not much
point in just submitting writes in response to failed provisions in that
case, but perhaps there is some more incremental use case or benefit I'm
not aware of..?

The flipside of more reliably graceful error handling is there may be
more to the minimal solution than just firing off provisions on initial
allocation, unless you wanted to just rule out snapshots I guess. That
said, I think there's still potential opportunity for improvement. For
example, if a prototype did something like the following:

- Provision the log at opportunistic points (i.e., on mount, first
  transaction to a covered log, etc.) to guarantee log writes won't
  fail.
- Provision extents mapped for data writes before the write is allowed
  to proceed.
- Do something similar for metadata in AIL processing or some such,
  where each item must be provisioned before written back.
- Shutdown the fs in response to any provision failure.

... obviously that comes with caveats, possibly bad performance, etc.,
but it would be interesting to see if that is sufficient to catch most
scenarios where a write would otherwise get to an out of space volume
causing it to become inactive. If that could be made to work well
enough, perhaps the fs shutdown step could be replaced with some kind of
in-core pause/freeze like mode where the admin has the opportunity to
either add more storage and continue or explicitly shutdown to save the
volume.

OTOH if that just doesn't work out, perhaps this can be combined with
other schemes to reliably prevent inactivation, such as the reservation
mechanism the dm guys had prototyped in the past. Of course that
potentially complicates the interface between the fs and dm-layer.

> Should the allocator instead find the space it wants and issue the
> provisioning IO with the AGF locked, and try again somewhere else if the
> IO returns ENOSPC?  If the space management IO takes forever, we've
> pinned the that AG for the duration, which is one of the not very nice
> aspects of the XFS FITRIM implementation on crappy SSDs.
> 
> For a directio write, it's simple enough to throw that error back to
> userspace.  I think the same applies to buffered writeback -- we'll
> cancel the writeback and set AS_ENOSPC on the mapping.
> 
> But then, what about *metadata* allocation?  If those fail because the
> provisioning encounters ENOSPC, we'll shut down the filesystem, which
> isn't nice.  For XFS I guess we could reuse the existing metadata IO
> error config knobs to make it retry for some amount of time until
> (hopefully) the admin buys more storage.
> 
> Let's go with the simplest implementation (issue it with the free space
> locked), and iterate from there.
> 
> > Should filesystems that don't otherwise support UNSHARE_RANGE need to
> > support it in order to support an unshare request to COW'd blocks on
> > an underlying block device?
> 
> Hmm.  Currently, fallocate'ing part of a file that's already mapped to
> shared blocks is a nop.  That's technically an omission in the
> implementation, since a subsequent write can fail during COW setup due
> to insufficient space.  My memory about funshare is a bit murky since
> it's been years now.
> 
> As I remember it, originally, I had allocate mode also calling unshare,
> but Dave or someone pointed out that unsharing generates a flood of
> dirty pagecache, and it would be a bit surprising that fallocate
> suddenly takes a long time to run.  There also wasn't much precedent for
> fallocate to unshare blocks, since btrfs doesn't do that:
> 
> # filefrag -v /mnt/[ab]
> Filesystem type is: 9123683e
> File size of /mnt/a is 1048576 (256 blocks of 4096 bytes)
>  ext:     logical_offset:        physical_offset: length:   expected: flags:
>    0:        0..     255:       3328..      3583:    256:             last,shared,eof
> /mnt/a: 1 extent found
> File size of /mnt/b is 1048576 (256 blocks of 4096 bytes)
>  ext:     logical_offset:        physical_offset: length:   expected: flags:
>    0:        0..     255:       3328..      3583:    256:             last,shared,eof
> /mnt/b: 1 extent found
> 
> # xfs_io -c 'falloc 512k 36k' /mnt/b
> 
> # filefrag -v /mnt/[ab]
> Filesystem type is: 9123683e
> File size of /mnt/a is 1048576 (256 blocks of 4096 bytes)
>  ext:     logical_offset:        physical_offset: length:   expected: flags:
>    0:        0..     255:       3328..      3583:    256:             last,shared,eof
> /mnt/a: 1 extent found
> File size of /mnt/b is 1048576 (256 blocks of 4096 bytes)
>  ext:     logical_offset:        physical_offset: length:   expected: flags:
>    0:        0..     255:       3328..      3583:    256:             last,shared,eof
> /mnt/b: 1 extent found
> 
> I took funshare out of the patchset entirely (minimum viable product,
> yadda yadda) and a few months later, I think hch or someone asked for a
> knob for userspace to get a file back to pure overwrite mode.  That's
> where it's been ever since.
> 
> So to answer your question: fallocate mode 0 and REQ_OP_PROVISION
> probably ought to be allocating the holes and unsharing existing shared
> mappings.  However, we could also wriggle out of that by <cough>
> claiming that fallocate has been consistent across filesystems in
> leaving that wart for userspace to trip over. :/
> 

Thanks. That seems reasonable to me, but again isn't what the patches
appear to implement. ;P

I guess from the standpoint of an I/O command, it probably makes more
sense to unshare by default. Why else would one send the command
otherwise? The falloc api is what it is at this point, so the bdev folks
could always decide if/how to implement a non-unsharing variant if there
happens to be some reason to do that.

Brian

> > I wonder if the smart thing to do here is separate out the question of a
> > new fallocate interface from the mechanism entirely. For example,
> > implement REQ_OP_PROVISION as you've already done, enable block layer
> > mode = 0 fallocate support (i.e. without FL_PROVISION, so whether a
> > request propagates from a loop device will be up to the backing fs),
> > implement the various fs features to support REQ_OP_PROVISION (i.e.,
> > mount option, file attr, etc.), then tack on FL_FALLOC + ext4 support at
> > the end as an RFC/prototype.
> 
> Yeah.
> 
> > Even if we ultimately ended up with FL_PROVISION support, it might
> > actually make some sense to kick that can down the road a bit regardless
> > to give fs' a chance to implement basic REQ_OP_PROVISION support, get a
> > better understanding of how it works in practice, and then perhaps make
> > more informed decisions on things like sane defaults and/or how best to
> > expose it via fallocate. Thoughts?
> 
> Agree. :)
> 
> --D
> 
> > 
> > Brian
> > 
> > > > If you *don't* add this API flag and simply bake the REQ_OP_PROVISION
> > > > call into mode 0 fallocate, then the new functionality can be added (or
> > > > even backported) to existing kernels and customers can use it
> > > > immediately.  If you *do*, then you get to wait a few years for
> > > > developers to add it to their codebases only after enough enterprise
> > > > distros pick up a new kernel to make it worth their while.
> > > >
> > > > > for thinly provisioned filesystems/
> > > > > block devices. For devices that do not support REQ_OP_PROVISION, both these
> > > > > allocation modes will be equivalent. Given the performance cost of sending provision
> > > > > requests to the underlying layers, keeping the default mode as-is allows users to
> > > > > preserve existing behavior.
> > > >
> > > > How expensive is this expected to be?  Is this why you wanted a separate
> > > > mode flag?
> > > >
> > > Yes, the exact latency will depend on the stacked block devices and
> > > the fragmentation at the allocation layers.
> > > 
> > > I did a quick test for benchmarking fallocate() with an:
> > > A) ext4 filesystem mounted with 'noprovision'
> > > B) ext4 filesystem mounted with 'provision' on a dm-thin device.
> > > C) ext4 filesystem mounted with 'provision' on a loop device with a
> > > sparse backing file on the filesystem in (B).
> > > 
> > > I tested file sizes from 512M to 8G, time taken for fallocate() in (A)
> > > remains expectedly flat at ~0.01-0.02s, but for (B), it scales from
> > > 0.03-0.4s and for (C) it scales from 0.04s-0.52s (I captured the exact
> > > time distribution in the cover letter
> > > https://marc.info/?l=linux-ext4&m=167230113520636&w=2)
> > > 
> > > +0.5s for a 8G fallocate doesn't sound a lot but I think fragmentation
> > > and how the block device is layered can make this worse...
> > > 
> > > > --D
> > > >
> > > > > Signed-off-by: Sarthak Kukreti <sarthakkukreti@chromium.org>
> > > > > ---
> > > > >  block/fops.c                | 15 +++++++++++----
> > > > >  include/linux/falloc.h      |  3 ++-
> > > > >  include/uapi/linux/falloc.h |  8 ++++++++
> > > > >  3 files changed, 21 insertions(+), 5 deletions(-)
> > > > >
> > > > > diff --git a/block/fops.c b/block/fops.c
> > > > > index 50d245e8c913..01bde561e1e2 100644
> > > > > --- a/block/fops.c
> > > > > +++ b/block/fops.c
> > > > > @@ -598,7 +598,8 @@ static ssize_t blkdev_read_iter(struct kiocb *iocb, struct iov_iter *to)
> > > > >
> > > > >  #define      BLKDEV_FALLOC_FL_SUPPORTED                                      \
> > > > >               (FALLOC_FL_KEEP_SIZE | FALLOC_FL_PUNCH_HOLE |           \
> > > > > -              FALLOC_FL_ZERO_RANGE | FALLOC_FL_NO_HIDE_STALE)
> > > > > +              FALLOC_FL_ZERO_RANGE | FALLOC_FL_NO_HIDE_STALE |       \
> > > > > +              FALLOC_FL_PROVISION)
> > > > >
> > > > >  static long blkdev_fallocate(struct file *file, int mode, loff_t start,
> > > > >                            loff_t len)
> > > > > @@ -634,9 +635,11 @@ static long blkdev_fallocate(struct file *file, int mode, loff_t start,
> > > > >       filemap_invalidate_lock(inode->i_mapping);
> > > > >
> > > > >       /* Invalidate the page cache, including dirty pages. */
> > > > > -     error = truncate_bdev_range(bdev, file->f_mode, start, end);
> > > > > -     if (error)
> > > > > -             goto fail;
> > > > > +     if (mode != FALLOC_FL_PROVISION) {
> > > > > +             error = truncate_bdev_range(bdev, file->f_mode, start, end);
> > > > > +             if (error)
> > > > > +                     goto fail;
> > > > > +     }
> > > > >
> > > > >       switch (mode) {
> > > > >       case FALLOC_FL_ZERO_RANGE:
> > > > > @@ -654,6 +657,10 @@ static long blkdev_fallocate(struct file *file, int mode, loff_t start,
> > > > >               error = blkdev_issue_discard(bdev, start >> SECTOR_SHIFT,
> > > > >                                            len >> SECTOR_SHIFT, GFP_KERNEL);
> > > > >               break;
> > > > > +     case FALLOC_FL_PROVISION:
> > > > > +             error = blkdev_issue_provision(bdev, start >> SECTOR_SHIFT,
> > > > > +                                            len >> SECTOR_SHIFT, GFP_KERNEL);
> > > > > +             break;
> > > > >       default:
> > > > >               error = -EOPNOTSUPP;
> > > > >       }
> > > > > diff --git a/include/linux/falloc.h b/include/linux/falloc.h
> > > > > index f3f0b97b1675..b9a40a61a59b 100644
> > > > > --- a/include/linux/falloc.h
> > > > > +++ b/include/linux/falloc.h
> > > > > @@ -30,7 +30,8 @@ struct space_resv {
> > > > >                                        FALLOC_FL_COLLAPSE_RANGE |     \
> > > > >                                        FALLOC_FL_ZERO_RANGE |         \
> > > > >                                        FALLOC_FL_INSERT_RANGE |       \
> > > > > -                                      FALLOC_FL_UNSHARE_RANGE)
> > > > > +                                      FALLOC_FL_UNSHARE_RANGE |      \
> > > > > +                                      FALLOC_FL_PROVISION)
> > > > >
> > > > >  /* on ia32 l_start is on a 32-bit boundary */
> > > > >  #if defined(CONFIG_X86_64)
> > > > > diff --git a/include/uapi/linux/falloc.h b/include/uapi/linux/falloc.h
> > > > > index 51398fa57f6c..2d323d113eed 100644
> > > > > --- a/include/uapi/linux/falloc.h
> > > > > +++ b/include/uapi/linux/falloc.h
> > > > > @@ -77,4 +77,12 @@
> > > > >   */
> > > > >  #define FALLOC_FL_UNSHARE_RANGE              0x40
> > > > >
> > > > > +/*
> > > > > + * FALLOC_FL_PROVISION acts as a hint for thinly provisioned devices to allocate
> > > > > + * blocks for the range/EOF.
> > > > > + *
> > > > > + * FALLOC_FL_PROVISION can only be used with allocate-mode fallocate.
> > > > > + */
> > > > > +#define FALLOC_FL_PROVISION          0x80
> > > > > +
> > > > >  #endif /* _UAPI_FALLOC_H_ */
> > > > > --
> > > > > 2.37.3
> > > > >
> > > 
> > 
> 

--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel


^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH v2 3/7] fs: Introduce FALLOC_FL_PROVISION
@ 2023-01-09 15:07             ` Brian Foster
  0 siblings, 0 replies; 46+ messages in thread
From: Brian Foster @ 2023-01-09 15:07 UTC (permalink / raw)
  To: Darrick J. Wong
  Cc: Sarthak Kukreti, sarthakkukreti, dm-devel, linux-block,
	linux-ext4, linux-kernel, linux-fsdevel, Jens Axboe,
	Michael S. Tsirkin, Jason Wang, Stefan Hajnoczi, Alasdair Kergon,
	Mike Snitzer, Christoph Hellwig, Theodore Ts'o,
	Andreas Dilger, Bart Van Assche, Daniil Lunev

On Thu, Jan 05, 2023 at 11:35:36AM -0800, Darrick J. Wong wrote:
> On Thu, Jan 05, 2023 at 09:46:06AM -0500, Brian Foster wrote:
> > On Wed, Jan 04, 2023 at 01:22:06PM -0800, Sarthak Kukreti wrote:
> > > (Resend; the text flow made the last reply unreadable)
> > > 
> > > On Wed, Jan 4, 2023 at 8:39 AM Darrick J. Wong <djwong@kernel.org> wrote:
> > > >
> > > > On Thu, Dec 29, 2022 at 12:12:48AM -0800, Sarthak Kukreti wrote:
> > > > > FALLOC_FL_PROVISION is a new fallocate() allocation mode that
> > > > > sends a hint to (supported) thinly provisioned block devices to
> > > > > allocate space for the given range of sectors via REQ_OP_PROVISION.
> > > > >
> > > > > The man pages for both fallocate(2) and posix_fallocate(3) describe
> > > > > the default allocation mode as:
> > > > >
> > > > > ```
> > > > > The default operation (i.e., mode is zero) of fallocate()
> > > > > allocates the disk space within the range specified by offset and len.
> > > > > ...
> > > > > subsequent writes to bytes in the specified range are guaranteed
> > > > > not to fail because of lack of disk space.
> > > > > ```
> > > > >
> > > > > For thinly provisioned storage constructs (dm-thin, filesystems on sparse
> > > > > files), the term 'disk space' is overloaded and can either mean the apparent
> > > > > disk space in the filesystem/thin logical volume or the true disk
> > > > > space that will be utilized on the underlying non-sparse allocation layer.
> > > > >
> > > > > The use of a separate mode allows us to cleanly disambiguate whether fallocate()
> > > > > causes allocation only at the current layer (default mode) or whether it propagates
> > > > > allocations to underlying layers (provision mode)
> > > >
> > > > Why is it important to make this distinction?  The outcome of fallocate
> > > > is supposed to be that subsequent writes do not fail with ENOSPC.  In my
> > > > (fs developer) mind, REQ_OP_PROVISION simply an extra step to be taken
> > > > after allocating file blocks.
> > > >
> > > Some use cases still benefit from keeping the default mode - eg.
> > > virtual machines running on massive storage pools that don't expect to
> > > hit the storage limit anytime soon (like most cloud storage
> > > providers). Essentially, if the 'no ENOSPC' guarantee is maintained
> > > via other means, then REQ_OP_PROVISION adds latency that isn't needed
> > > (and cloud storage providers don't need to set aside that extra space
> > > that may or may not be used).
> > > 
> > 
> > What's the granularity that needs to be managed at? Do you really need
> > an fallocate command for this, or would one of the filesystem level
> > features you've already implemented for ext4 suffice?
> > 
> > I mostly agree with Darrick in that FALLOC_FL_PROVISION stills feels a
> > bit wonky to me. I can see that there might be some legitimate use cases
> > for it, but I'm not convinced that it won't just end up being confusing
> > to many users. At the same time, I think the approach of unconditional
> > provision on falloc could eventually lead to complaints associated with
> > the performance impact or similar sorts of confusion. For example,
> > should an falloc of an already allocated range in the fs send a
> > provision or not?
> 
> For a user-initiated fallocate call, I think that's reasonable.
> 

I think so as well, but that doesn't appear to be what the proposed
implementation for ext4 does. I'm not intimately familiar with ext4, but
it looks to me like it only provisions on initial allocation..?

> My first thought is to make the XFS allocator issue REQ_OP_PROVISION on
> every allocation if the device supports it.  The fs has decided that
> it's going to allocate and presumably write to some space, so the
> underlying storage really ought to have some space ready.
> 

That makes sense for a purely thin provisioned device, but runs into
issues with block layer snapshots (re: my comments on the provision
mount option patch). I wonder if it makes more sense to provision at
some point before submitting writes or dirtying pagecache. IIRC we had
prototyped something in XFS a while back that performed an analogous
dm-thin fallocate at the time an extent is mapped for writes. I'm not
sure what the performance impact of that would be or if there's a nice
way to optimize away the obvious side effect of spurious requests.

> But then it occurred to me -- what if the IO fails with ENOSPC?  Do we
> keep going and hope for the best?  Or maybe we should undo the
> allocation?  That could be tricky since we'd have to add a transaction
> to undo the allocation, commit that, and then throw an error to the
> upper layers.
> 

Yeah, that's a good question. IMO we should be able to use something
like this to improve the failure handling for fs' over thinly
provisioned storage with dangerously low free space. There's not much
point in just submitting writes in response to failed provisions in that
case, but perhaps there is some more incremental use case or benefit I'm
not aware of..?

The flipside of more reliably graceful error handling is there may be
more to the minimal solution than just firing off provisions on initial
allocation, unless you wanted to just rule out snapshots I guess. That
said, I think there's still potential opportunity for improvement. For
example, if a prototype did something like the following:

- Provision the log at opportunistic points (i.e., on mount, first
  transaction to a covered log, etc.) to guarantee log writes won't
  fail.
- Provision extents mapped for data writes before the write is allowed
  to proceed.
- Do something similar for metadata in AIL processing or some such,
  where each item must be provisioned before written back.
- Shutdown the fs in response to any provision failure.

... obviously that comes with caveats, possibly bad performance, etc.,
but it would be interesting to see if that is sufficient to catch most
scenarios where a write would otherwise get to an out of space volume
causing it to become inactive. If that could be made to work well
enough, perhaps the fs shutdown step could be replaced with some kind of
in-core pause/freeze like mode where the admin has the opportunity to
either add more storage and continue or explicitly shutdown to save the
volume.

OTOH if that just doesn't work out, perhaps this can be combined with
other schemes to reliably prevent inactivation, such as the reservation
mechanism the dm guys had prototyped in the past. Of course that
potentially complicates the interface between the fs and dm-layer.

> Should the allocator instead find the space it wants and issue the
> provisioning IO with the AGF locked, and try again somewhere else if the
> IO returns ENOSPC?  If the space management IO takes forever, we've
> pinned the that AG for the duration, which is one of the not very nice
> aspects of the XFS FITRIM implementation on crappy SSDs.
> 
> For a directio write, it's simple enough to throw that error back to
> userspace.  I think the same applies to buffered writeback -- we'll
> cancel the writeback and set AS_ENOSPC on the mapping.
> 
> But then, what about *metadata* allocation?  If those fail because the
> provisioning encounters ENOSPC, we'll shut down the filesystem, which
> isn't nice.  For XFS I guess we could reuse the existing metadata IO
> error config knobs to make it retry for some amount of time until
> (hopefully) the admin buys more storage.
> 
> Let's go with the simplest implementation (issue it with the free space
> locked), and iterate from there.
> 
> > Should filesystems that don't otherwise support UNSHARE_RANGE need to
> > support it in order to support an unshare request to COW'd blocks on
> > an underlying block device?
> 
> Hmm.  Currently, fallocate'ing part of a file that's already mapped to
> shared blocks is a nop.  That's technically an omission in the
> implementation, since a subsequent write can fail during COW setup due
> to insufficient space.  My memory about funshare is a bit murky since
> it's been years now.
> 
> As I remember it, originally, I had allocate mode also calling unshare,
> but Dave or someone pointed out that unsharing generates a flood of
> dirty pagecache, and it would be a bit surprising that fallocate
> suddenly takes a long time to run.  There also wasn't much precedent for
> fallocate to unshare blocks, since btrfs doesn't do that:
> 
> # filefrag -v /mnt/[ab]
> Filesystem type is: 9123683e
> File size of /mnt/a is 1048576 (256 blocks of 4096 bytes)
>  ext:     logical_offset:        physical_offset: length:   expected: flags:
>    0:        0..     255:       3328..      3583:    256:             last,shared,eof
> /mnt/a: 1 extent found
> File size of /mnt/b is 1048576 (256 blocks of 4096 bytes)
>  ext:     logical_offset:        physical_offset: length:   expected: flags:
>    0:        0..     255:       3328..      3583:    256:             last,shared,eof
> /mnt/b: 1 extent found
> 
> # xfs_io -c 'falloc 512k 36k' /mnt/b
> 
> # filefrag -v /mnt/[ab]
> Filesystem type is: 9123683e
> File size of /mnt/a is 1048576 (256 blocks of 4096 bytes)
>  ext:     logical_offset:        physical_offset: length:   expected: flags:
>    0:        0..     255:       3328..      3583:    256:             last,shared,eof
> /mnt/a: 1 extent found
> File size of /mnt/b is 1048576 (256 blocks of 4096 bytes)
>  ext:     logical_offset:        physical_offset: length:   expected: flags:
>    0:        0..     255:       3328..      3583:    256:             last,shared,eof
> /mnt/b: 1 extent found
> 
> I took funshare out of the patchset entirely (minimum viable product,
> yadda yadda) and a few months later, I think hch or someone asked for a
> knob for userspace to get a file back to pure overwrite mode.  That's
> where it's been ever since.
> 
> So to answer your question: fallocate mode 0 and REQ_OP_PROVISION
> probably ought to be allocating the holes and unsharing existing shared
> mappings.  However, we could also wriggle out of that by <cough>
> claiming that fallocate has been consistent across filesystems in
> leaving that wart for userspace to trip over. :/
> 

Thanks. That seems reasonable to me, but again isn't what the patches
appear to implement. ;P

I guess from the standpoint of an I/O command, it probably makes more
sense to unshare by default. Why else would one send the command
otherwise? The falloc api is what it is at this point, so the bdev folks
could always decide if/how to implement a non-unsharing variant if there
happens to be some reason to do that.

Brian

> > I wonder if the smart thing to do here is separate out the question of a
> > new fallocate interface from the mechanism entirely. For example,
> > implement REQ_OP_PROVISION as you've already done, enable block layer
> > mode = 0 fallocate support (i.e. without FL_PROVISION, so whether a
> > request propagates from a loop device will be up to the backing fs),
> > implement the various fs features to support REQ_OP_PROVISION (i.e.,
> > mount option, file attr, etc.), then tack on FL_FALLOC + ext4 support at
> > the end as an RFC/prototype.
> 
> Yeah.
> 
> > Even if we ultimately ended up with FL_PROVISION support, it might
> > actually make some sense to kick that can down the road a bit regardless
> > to give fs' a chance to implement basic REQ_OP_PROVISION support, get a
> > better understanding of how it works in practice, and then perhaps make
> > more informed decisions on things like sane defaults and/or how best to
> > expose it via fallocate. Thoughts?
> 
> Agree. :)
> 
> --D
> 
> > 
> > Brian
> > 
> > > > If you *don't* add this API flag and simply bake the REQ_OP_PROVISION
> > > > call into mode 0 fallocate, then the new functionality can be added (or
> > > > even backported) to existing kernels and customers can use it
> > > > immediately.  If you *do*, then you get to wait a few years for
> > > > developers to add it to their codebases only after enough enterprise
> > > > distros pick up a new kernel to make it worth their while.
> > > >
> > > > > for thinly provisioned filesystems/
> > > > > block devices. For devices that do not support REQ_OP_PROVISION, both these
> > > > > allocation modes will be equivalent. Given the performance cost of sending provision
> > > > > requests to the underlying layers, keeping the default mode as-is allows users to
> > > > > preserve existing behavior.
> > > >
> > > > How expensive is this expected to be?  Is this why you wanted a separate
> > > > mode flag?
> > > >
> > > Yes, the exact latency will depend on the stacked block devices and
> > > the fragmentation at the allocation layers.
> > > 
> > > I did a quick test for benchmarking fallocate() with an:
> > > A) ext4 filesystem mounted with 'noprovision'
> > > B) ext4 filesystem mounted with 'provision' on a dm-thin device.
> > > C) ext4 filesystem mounted with 'provision' on a loop device with a
> > > sparse backing file on the filesystem in (B).
> > > 
> > > I tested file sizes from 512M to 8G, time taken for fallocate() in (A)
> > > remains expectedly flat at ~0.01-0.02s, but for (B), it scales from
> > > 0.03-0.4s and for (C) it scales from 0.04s-0.52s (I captured the exact
> > > time distribution in the cover letter
> > > https://marc.info/?l=linux-ext4&m=167230113520636&w=2)
> > > 
> > > +0.5s for a 8G fallocate doesn't sound a lot but I think fragmentation
> > > and how the block device is layered can make this worse...
> > > 
> > > > --D
> > > >
> > > > > Signed-off-by: Sarthak Kukreti <sarthakkukreti@chromium.org>
> > > > > ---
> > > > >  block/fops.c                | 15 +++++++++++----
> > > > >  include/linux/falloc.h      |  3 ++-
> > > > >  include/uapi/linux/falloc.h |  8 ++++++++
> > > > >  3 files changed, 21 insertions(+), 5 deletions(-)
> > > > >
> > > > > diff --git a/block/fops.c b/block/fops.c
> > > > > index 50d245e8c913..01bde561e1e2 100644
> > > > > --- a/block/fops.c
> > > > > +++ b/block/fops.c
> > > > > @@ -598,7 +598,8 @@ static ssize_t blkdev_read_iter(struct kiocb *iocb, struct iov_iter *to)
> > > > >
> > > > >  #define      BLKDEV_FALLOC_FL_SUPPORTED                                      \
> > > > >               (FALLOC_FL_KEEP_SIZE | FALLOC_FL_PUNCH_HOLE |           \
> > > > > -              FALLOC_FL_ZERO_RANGE | FALLOC_FL_NO_HIDE_STALE)
> > > > > +              FALLOC_FL_ZERO_RANGE | FALLOC_FL_NO_HIDE_STALE |       \
> > > > > +              FALLOC_FL_PROVISION)
> > > > >
> > > > >  static long blkdev_fallocate(struct file *file, int mode, loff_t start,
> > > > >                            loff_t len)
> > > > > @@ -634,9 +635,11 @@ static long blkdev_fallocate(struct file *file, int mode, loff_t start,
> > > > >       filemap_invalidate_lock(inode->i_mapping);
> > > > >
> > > > >       /* Invalidate the page cache, including dirty pages. */
> > > > > -     error = truncate_bdev_range(bdev, file->f_mode, start, end);
> > > > > -     if (error)
> > > > > -             goto fail;
> > > > > +     if (mode != FALLOC_FL_PROVISION) {
> > > > > +             error = truncate_bdev_range(bdev, file->f_mode, start, end);
> > > > > +             if (error)
> > > > > +                     goto fail;
> > > > > +     }
> > > > >
> > > > >       switch (mode) {
> > > > >       case FALLOC_FL_ZERO_RANGE:
> > > > > @@ -654,6 +657,10 @@ static long blkdev_fallocate(struct file *file, int mode, loff_t start,
> > > > >               error = blkdev_issue_discard(bdev, start >> SECTOR_SHIFT,
> > > > >                                            len >> SECTOR_SHIFT, GFP_KERNEL);
> > > > >               break;
> > > > > +     case FALLOC_FL_PROVISION:
> > > > > +             error = blkdev_issue_provision(bdev, start >> SECTOR_SHIFT,
> > > > > +                                            len >> SECTOR_SHIFT, GFP_KERNEL);
> > > > > +             break;
> > > > >       default:
> > > > >               error = -EOPNOTSUPP;
> > > > >       }
> > > > > diff --git a/include/linux/falloc.h b/include/linux/falloc.h
> > > > > index f3f0b97b1675..b9a40a61a59b 100644
> > > > > --- a/include/linux/falloc.h
> > > > > +++ b/include/linux/falloc.h
> > > > > @@ -30,7 +30,8 @@ struct space_resv {
> > > > >                                        FALLOC_FL_COLLAPSE_RANGE |     \
> > > > >                                        FALLOC_FL_ZERO_RANGE |         \
> > > > >                                        FALLOC_FL_INSERT_RANGE |       \
> > > > > -                                      FALLOC_FL_UNSHARE_RANGE)
> > > > > +                                      FALLOC_FL_UNSHARE_RANGE |      \
> > > > > +                                      FALLOC_FL_PROVISION)
> > > > >
> > > > >  /* on ia32 l_start is on a 32-bit boundary */
> > > > >  #if defined(CONFIG_X86_64)
> > > > > diff --git a/include/uapi/linux/falloc.h b/include/uapi/linux/falloc.h
> > > > > index 51398fa57f6c..2d323d113eed 100644
> > > > > --- a/include/uapi/linux/falloc.h
> > > > > +++ b/include/uapi/linux/falloc.h
> > > > > @@ -77,4 +77,12 @@
> > > > >   */
> > > > >  #define FALLOC_FL_UNSHARE_RANGE              0x40
> > > > >
> > > > > +/*
> > > > > + * FALLOC_FL_PROVISION acts as a hint for thinly provisioned devices to allocate
> > > > > + * blocks for the range/EOF.
> > > > > + *
> > > > > + * FALLOC_FL_PROVISION can only be used with allocate-mode fallocate.
> > > > > + */
> > > > > +#define FALLOC_FL_PROVISION          0x80
> > > > > +
> > > > >  #endif /* _UAPI_FALLOC_H_ */
> > > > > --
> > > > > 2.37.3
> > > > >
> > > 
> > 
> 


^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH v2 3/7] fs: Introduce FALLOC_FL_PROVISION
  2023-01-05 14:46         ` Brian Foster
@ 2023-03-31  0:28           ` Sarthak Kukreti
  -1 siblings, 0 replies; 46+ messages in thread
From: Sarthak Kukreti @ 2023-03-31  0:28 UTC (permalink / raw)
  To: Brian Foster
  Cc: Darrick J. Wong, sarthakkukreti, dm-devel, linux-block,
	linux-ext4, linux-kernel, linux-fsdevel, Jens Axboe,
	Michael S. Tsirkin, Jason Wang, Stefan Hajnoczi, Alasdair Kergon,
	Mike Snitzer, Christoph Hellwig, Theodore Ts'o,
	Andreas Dilger, Bart Van Assche, Daniil Lunev

On Thu, Jan 5, 2023 at 6:45 AM Brian Foster <bfoster@redhat.com> wrote:
>
> On Wed, Jan 04, 2023 at 01:22:06PM -0800, Sarthak Kukreti wrote:
> > (Resend; the text flow made the last reply unreadable)
> >
> > On Wed, Jan 4, 2023 at 8:39 AM Darrick J. Wong <djwong@kernel.org> wrote:
> > >
> > > On Thu, Dec 29, 2022 at 12:12:48AM -0800, Sarthak Kukreti wrote:
> > > > FALLOC_FL_PROVISION is a new fallocate() allocation mode that
> > > > sends a hint to (supported) thinly provisioned block devices to
> > > > allocate space for the given range of sectors via REQ_OP_PROVISION.
> > > >
> > > > The man pages for both fallocate(2) and posix_fallocate(3) describe
> > > > the default allocation mode as:
> > > >
> > > > ```
> > > > The default operation (i.e., mode is zero) of fallocate()
> > > > allocates the disk space within the range specified by offset and len.
> > > > ...
> > > > subsequent writes to bytes in the specified range are guaranteed
> > > > not to fail because of lack of disk space.
> > > > ```
> > > >
> > > > For thinly provisioned storage constructs (dm-thin, filesystems on sparse
> > > > files), the term 'disk space' is overloaded and can either mean the apparent
> > > > disk space in the filesystem/thin logical volume or the true disk
> > > > space that will be utilized on the underlying non-sparse allocation layer.
> > > >
> > > > The use of a separate mode allows us to cleanly disambiguate whether fallocate()
> > > > causes allocation only at the current layer (default mode) or whether it propagates
> > > > allocations to underlying layers (provision mode)
> > >
> > > Why is it important to make this distinction?  The outcome of fallocate
> > > is supposed to be that subsequent writes do not fail with ENOSPC.  In my
> > > (fs developer) mind, REQ_OP_PROVISION simply an extra step to be taken
> > > after allocating file blocks.
> > >
> > Some use cases still benefit from keeping the default mode - eg.
> > virtual machines running on massive storage pools that don't expect to
> > hit the storage limit anytime soon (like most cloud storage
> > providers). Essentially, if the 'no ENOSPC' guarantee is maintained
> > via other means, then REQ_OP_PROVISION adds latency that isn't needed
> > (and cloud storage providers don't need to set aside that extra space
> > that may or may not be used).
> >
>
> What's the granularity that needs to be managed at? Do you really need
> an fallocate command for this, or would one of the filesystem level
> features you've already implemented for ext4 suffice?
>
I think I (belatedly) see the point now; the other mechanisms provide
enough flexibility that make a separate FALLOC_FL_PROVISION redundant
and confusing. I'll post the next series without the falloc() flag.

> I mostly agree with Darrick in that FALLOC_FL_PROVISION stills feels a
> bit wonky to me. I can see that there might be some legitimate use cases
> for it, but I'm not convinced that it won't just end up being confusing
> to many users. At the same time, I think the approach of unconditional
> provision on falloc could eventually lead to complaints associated with
> the performance impact or similar sorts of confusion. For example,
> should an falloc of an already allocated range in the fs send a
> provision or not?
>
It boils down to whether a) the underlying device supports
provisioning and b) whether the device is a snapshot. If either is
true, then we'd need to pass down provision requests down to the last
layers of the stack. Filesystems might be able to amortize some of the
performance drop if they maintain a bit that tracks whether the extent
has been provisioned/written to; for such extents, we'd only send a
provision request iff the underlying device is a snapshot device. Or
we could make this a policy that's configurable by a mount option
(added details below).

In the current patch series, I went through the simpler route of just
calling REQ_OP_PROVISION on the first fallocate() call. But as
everyone pointed out on the thread, that doesn't work out as well for
previously allocated ranges..

> [Reflowed] Should filesystems that don't otherwise support
> UNSHARE_RANGE need to support it in order to support an unshare request
> to COW'd blocks on an underlying block device?
>
I think it would make sense to keep the UNSHARE_RANGE handling intact
and delegate the actual provisioning to the filesystem layer. Even if
the filesystem doesn't support unsharing, we could add a separate
mount mode option that will result in the filesystem sending
REQ_OP_PROVISION to the entire file range if fallocate mode==0 is
called.

> I wonder if the smart thing to do here is separate out the question of a
> new fallocate interface from the mechanism entirely. For example,
> implement REQ_OP_PROVISION as you've already done, enable block layer
> mode = 0 fallocate support (i.e. without FL_PROVISION, so whether a
> request propagates from a loop device will be up to the backing fs),
> implement the various fs features to support REQ_OP_PROVISION (i.e.,
> mount option, file attr, etc.), then tack on FL_FALLOC + ext4 support at
> the end as an RFC/prototype.
>
> Even if we ultimately ended up with FL_PROVISION support, it might
> actually make some sense to kick that can down the road a bit regardless
> to give fs' a chance to implement basic REQ_OP_PROVISION support, get a
> better understanding of how it works in practice, and then perhaps make
> more informed decisions on things like sane defaults and/or how best to
> expose it via fallocate. Thoughts?
>
That's fair (and thanks for the thorough feedback!), I'll split the
series and send out the REQ_OP_PROVISION parts shortly. As you,
Darrick and Ted have pointed out, the filesystem patches need a bit
more work.

Best
Sarthak



> Brian
>
> > > If you *don't* add this API flag and simply bake the REQ_OP_PROVISION
> > > call into mode 0 fallocate, then the new functionality can be added (or
> > > even backported) to existing kernels and customers can use it
> > > immediately.  If you *do*, then you get to wait a few years for
> > > developers to add it to their codebases only after enough enterprise
> > > distros pick up a new kernel to make it worth their while.
> > >
> > > > for thinly provisioned filesystems/
> > > > block devices. For devices that do not support REQ_OP_PROVISION, both these
> > > > allocation modes will be equivalent. Given the performance cost of sending provision
> > > > requests to the underlying layers, keeping the default mode as-is allows users to
> > > > preserve existing behavior.
> > >
> > > How expensive is this expected to be?  Is this why you wanted a separate
> > > mode flag?
> > >
> > Yes, the exact latency will depend on the stacked block devices and
> > the fragmentation at the allocation layers.
> >
> > I did a quick test for benchmarking fallocate() with an:
> > A) ext4 filesystem mounted with 'noprovision'
> > B) ext4 filesystem mounted with 'provision' on a dm-thin device.
> > C) ext4 filesystem mounted with 'provision' on a loop device with a
> > sparse backing file on the filesystem in (B).
> >
> > I tested file sizes from 512M to 8G, time taken for fallocate() in (A)
> > remains expectedly flat at ~0.01-0.02s, but for (B), it scales from
> > 0.03-0.4s and for (C) it scales from 0.04s-0.52s (I captured the exact
> > time distribution in the cover letter
> > https://marc.info/?l=linux-ext4&m=167230113520636&w=2)
> >
> > +0.5s for a 8G fallocate doesn't sound a lot but I think fragmentation
> > and how the block device is layered can make this worse...
> >
> > > --D
> > >
> > > > Signed-off-by: Sarthak Kukreti <sarthakkukreti@chromium.org>
> > > > ---
> > > >  block/fops.c                | 15 +++++++++++----
> > > >  include/linux/falloc.h      |  3 ++-
> > > >  include/uapi/linux/falloc.h |  8 ++++++++
> > > >  3 files changed, 21 insertions(+), 5 deletions(-)
> > > >
> > > > diff --git a/block/fops.c b/block/fops.c
> > > > index 50d245e8c913..01bde561e1e2 100644
> > > > --- a/block/fops.c
> > > > +++ b/block/fops.c
> > > > @@ -598,7 +598,8 @@ static ssize_t blkdev_read_iter(struct kiocb *iocb, struct iov_iter *to)
> > > >
> > > >  #define      BLKDEV_FALLOC_FL_SUPPORTED                                      \
> > > >               (FALLOC_FL_KEEP_SIZE | FALLOC_FL_PUNCH_HOLE |           \
> > > > -              FALLOC_FL_ZERO_RANGE | FALLOC_FL_NO_HIDE_STALE)
> > > > +              FALLOC_FL_ZERO_RANGE | FALLOC_FL_NO_HIDE_STALE |       \
> > > > +              FALLOC_FL_PROVISION)
> > > >
> > > >  static long blkdev_fallocate(struct file *file, int mode, loff_t start,
> > > >                            loff_t len)
> > > > @@ -634,9 +635,11 @@ static long blkdev_fallocate(struct file *file, int mode, loff_t start,
> > > >       filemap_invalidate_lock(inode->i_mapping);
> > > >
> > > >       /* Invalidate the page cache, including dirty pages. */
> > > > -     error = truncate_bdev_range(bdev, file->f_mode, start, end);
> > > > -     if (error)
> > > > -             goto fail;
> > > > +     if (mode != FALLOC_FL_PROVISION) {
> > > > +             error = truncate_bdev_range(bdev, file->f_mode, start, end);
> > > > +             if (error)
> > > > +                     goto fail;
> > > > +     }
> > > >
> > > >       switch (mode) {
> > > >       case FALLOC_FL_ZERO_RANGE:
> > > > @@ -654,6 +657,10 @@ static long blkdev_fallocate(struct file *file, int mode, loff_t start,
> > > >               error = blkdev_issue_discard(bdev, start >> SECTOR_SHIFT,
> > > >                                            len >> SECTOR_SHIFT, GFP_KERNEL);
> > > >               break;
> > > > +     case FALLOC_FL_PROVISION:
> > > > +             error = blkdev_issue_provision(bdev, start >> SECTOR_SHIFT,
> > > > +                                            len >> SECTOR_SHIFT, GFP_KERNEL);
> > > > +             break;
> > > >       default:
> > > >               error = -EOPNOTSUPP;
> > > >       }
> > > > diff --git a/include/linux/falloc.h b/include/linux/falloc.h
> > > > index f3f0b97b1675..b9a40a61a59b 100644
> > > > --- a/include/linux/falloc.h
> > > > +++ b/include/linux/falloc.h
> > > > @@ -30,7 +30,8 @@ struct space_resv {
> > > >                                        FALLOC_FL_COLLAPSE_RANGE |     \
> > > >                                        FALLOC_FL_ZERO_RANGE |         \
> > > >                                        FALLOC_FL_INSERT_RANGE |       \
> > > > -                                      FALLOC_FL_UNSHARE_RANGE)
> > > > +                                      FALLOC_FL_UNSHARE_RANGE |      \
> > > > +                                      FALLOC_FL_PROVISION)
> > > >
> > > >  /* on ia32 l_start is on a 32-bit boundary */
> > > >  #if defined(CONFIG_X86_64)
> > > > diff --git a/include/uapi/linux/falloc.h b/include/uapi/linux/falloc.h
> > > > index 51398fa57f6c..2d323d113eed 100644
> > > > --- a/include/uapi/linux/falloc.h
> > > > +++ b/include/uapi/linux/falloc.h
> > > > @@ -77,4 +77,12 @@
> > > >   */
> > > >  #define FALLOC_FL_UNSHARE_RANGE              0x40
> > > >
> > > > +/*
> > > > + * FALLOC_FL_PROVISION acts as a hint for thinly provisioned devices to allocate
> > > > + * blocks for the range/EOF.
> > > > + *
> > > > + * FALLOC_FL_PROVISION can only be used with allocate-mode fallocate.
> > > > + */
> > > > +#define FALLOC_FL_PROVISION          0x80
> > > > +
> > > >  #endif /* _UAPI_FALLOC_H_ */
> > > > --
> > > > 2.37.3
> > > >
> >
>

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [dm-devel] [PATCH v2 3/7] fs: Introduce FALLOC_FL_PROVISION
@ 2023-03-31  0:28           ` Sarthak Kukreti
  0 siblings, 0 replies; 46+ messages in thread
From: Sarthak Kukreti @ 2023-03-31  0:28 UTC (permalink / raw)
  To: Brian Foster
  Cc: Jens Axboe, Christoph Hellwig, Theodore Ts'o,
	Michael S. Tsirkin, sarthakkukreti, Darrick J. Wong, Jason Wang,
	Bart Van Assche, Mike Snitzer, linux-kernel, linux-block,
	dm-devel, Andreas Dilger, Daniil Lunev, Stefan Hajnoczi,
	linux-fsdevel, linux-ext4, Alasdair Kergon

On Thu, Jan 5, 2023 at 6:45 AM Brian Foster <bfoster@redhat.com> wrote:
>
> On Wed, Jan 04, 2023 at 01:22:06PM -0800, Sarthak Kukreti wrote:
> > (Resend; the text flow made the last reply unreadable)
> >
> > On Wed, Jan 4, 2023 at 8:39 AM Darrick J. Wong <djwong@kernel.org> wrote:
> > >
> > > On Thu, Dec 29, 2022 at 12:12:48AM -0800, Sarthak Kukreti wrote:
> > > > FALLOC_FL_PROVISION is a new fallocate() allocation mode that
> > > > sends a hint to (supported) thinly provisioned block devices to
> > > > allocate space for the given range of sectors via REQ_OP_PROVISION.
> > > >
> > > > The man pages for both fallocate(2) and posix_fallocate(3) describe
> > > > the default allocation mode as:
> > > >
> > > > ```
> > > > The default operation (i.e., mode is zero) of fallocate()
> > > > allocates the disk space within the range specified by offset and len.
> > > > ...
> > > > subsequent writes to bytes in the specified range are guaranteed
> > > > not to fail because of lack of disk space.
> > > > ```
> > > >
> > > > For thinly provisioned storage constructs (dm-thin, filesystems on sparse
> > > > files), the term 'disk space' is overloaded and can either mean the apparent
> > > > disk space in the filesystem/thin logical volume or the true disk
> > > > space that will be utilized on the underlying non-sparse allocation layer.
> > > >
> > > > The use of a separate mode allows us to cleanly disambiguate whether fallocate()
> > > > causes allocation only at the current layer (default mode) or whether it propagates
> > > > allocations to underlying layers (provision mode)
> > >
> > > Why is it important to make this distinction?  The outcome of fallocate
> > > is supposed to be that subsequent writes do not fail with ENOSPC.  In my
> > > (fs developer) mind, REQ_OP_PROVISION simply an extra step to be taken
> > > after allocating file blocks.
> > >
> > Some use cases still benefit from keeping the default mode - eg.
> > virtual machines running on massive storage pools that don't expect to
> > hit the storage limit anytime soon (like most cloud storage
> > providers). Essentially, if the 'no ENOSPC' guarantee is maintained
> > via other means, then REQ_OP_PROVISION adds latency that isn't needed
> > (and cloud storage providers don't need to set aside that extra space
> > that may or may not be used).
> >
>
> What's the granularity that needs to be managed at? Do you really need
> an fallocate command for this, or would one of the filesystem level
> features you've already implemented for ext4 suffice?
>
I think I (belatedly) see the point now; the other mechanisms provide
enough flexibility that make a separate FALLOC_FL_PROVISION redundant
and confusing. I'll post the next series without the falloc() flag.

> I mostly agree with Darrick in that FALLOC_FL_PROVISION stills feels a
> bit wonky to me. I can see that there might be some legitimate use cases
> for it, but I'm not convinced that it won't just end up being confusing
> to many users. At the same time, I think the approach of unconditional
> provision on falloc could eventually lead to complaints associated with
> the performance impact or similar sorts of confusion. For example,
> should an falloc of an already allocated range in the fs send a
> provision or not?
>
It boils down to whether a) the underlying device supports
provisioning and b) whether the device is a snapshot. If either is
true, then we'd need to pass down provision requests down to the last
layers of the stack. Filesystems might be able to amortize some of the
performance drop if they maintain a bit that tracks whether the extent
has been provisioned/written to; for such extents, we'd only send a
provision request iff the underlying device is a snapshot device. Or
we could make this a policy that's configurable by a mount option
(added details below).

In the current patch series, I went through the simpler route of just
calling REQ_OP_PROVISION on the first fallocate() call. But as
everyone pointed out on the thread, that doesn't work out as well for
previously allocated ranges..

> [Reflowed] Should filesystems that don't otherwise support
> UNSHARE_RANGE need to support it in order to support an unshare request
> to COW'd blocks on an underlying block device?
>
I think it would make sense to keep the UNSHARE_RANGE handling intact
and delegate the actual provisioning to the filesystem layer. Even if
the filesystem doesn't support unsharing, we could add a separate
mount mode option that will result in the filesystem sending
REQ_OP_PROVISION to the entire file range if fallocate mode==0 is
called.

> I wonder if the smart thing to do here is separate out the question of a
> new fallocate interface from the mechanism entirely. For example,
> implement REQ_OP_PROVISION as you've already done, enable block layer
> mode = 0 fallocate support (i.e. without FL_PROVISION, so whether a
> request propagates from a loop device will be up to the backing fs),
> implement the various fs features to support REQ_OP_PROVISION (i.e.,
> mount option, file attr, etc.), then tack on FL_FALLOC + ext4 support at
> the end as an RFC/prototype.
>
> Even if we ultimately ended up with FL_PROVISION support, it might
> actually make some sense to kick that can down the road a bit regardless
> to give fs' a chance to implement basic REQ_OP_PROVISION support, get a
> better understanding of how it works in practice, and then perhaps make
> more informed decisions on things like sane defaults and/or how best to
> expose it via fallocate. Thoughts?
>
That's fair (and thanks for the thorough feedback!), I'll split the
series and send out the REQ_OP_PROVISION parts shortly. As you,
Darrick and Ted have pointed out, the filesystem patches need a bit
more work.

Best
Sarthak



> Brian
>
> > > If you *don't* add this API flag and simply bake the REQ_OP_PROVISION
> > > call into mode 0 fallocate, then the new functionality can be added (or
> > > even backported) to existing kernels and customers can use it
> > > immediately.  If you *do*, then you get to wait a few years for
> > > developers to add it to their codebases only after enough enterprise
> > > distros pick up a new kernel to make it worth their while.
> > >
> > > > for thinly provisioned filesystems/
> > > > block devices. For devices that do not support REQ_OP_PROVISION, both these
> > > > allocation modes will be equivalent. Given the performance cost of sending provision
> > > > requests to the underlying layers, keeping the default mode as-is allows users to
> > > > preserve existing behavior.
> > >
> > > How expensive is this expected to be?  Is this why you wanted a separate
> > > mode flag?
> > >
> > Yes, the exact latency will depend on the stacked block devices and
> > the fragmentation at the allocation layers.
> >
> > I did a quick test for benchmarking fallocate() with an:
> > A) ext4 filesystem mounted with 'noprovision'
> > B) ext4 filesystem mounted with 'provision' on a dm-thin device.
> > C) ext4 filesystem mounted with 'provision' on a loop device with a
> > sparse backing file on the filesystem in (B).
> >
> > I tested file sizes from 512M to 8G, time taken for fallocate() in (A)
> > remains expectedly flat at ~0.01-0.02s, but for (B), it scales from
> > 0.03-0.4s and for (C) it scales from 0.04s-0.52s (I captured the exact
> > time distribution in the cover letter
> > https://marc.info/?l=linux-ext4&m=167230113520636&w=2)
> >
> > +0.5s for a 8G fallocate doesn't sound a lot but I think fragmentation
> > and how the block device is layered can make this worse...
> >
> > > --D
> > >
> > > > Signed-off-by: Sarthak Kukreti <sarthakkukreti@chromium.org>
> > > > ---
> > > >  block/fops.c                | 15 +++++++++++----
> > > >  include/linux/falloc.h      |  3 ++-
> > > >  include/uapi/linux/falloc.h |  8 ++++++++
> > > >  3 files changed, 21 insertions(+), 5 deletions(-)
> > > >
> > > > diff --git a/block/fops.c b/block/fops.c
> > > > index 50d245e8c913..01bde561e1e2 100644
> > > > --- a/block/fops.c
> > > > +++ b/block/fops.c
> > > > @@ -598,7 +598,8 @@ static ssize_t blkdev_read_iter(struct kiocb *iocb, struct iov_iter *to)
> > > >
> > > >  #define      BLKDEV_FALLOC_FL_SUPPORTED                                      \
> > > >               (FALLOC_FL_KEEP_SIZE | FALLOC_FL_PUNCH_HOLE |           \
> > > > -              FALLOC_FL_ZERO_RANGE | FALLOC_FL_NO_HIDE_STALE)
> > > > +              FALLOC_FL_ZERO_RANGE | FALLOC_FL_NO_HIDE_STALE |       \
> > > > +              FALLOC_FL_PROVISION)
> > > >
> > > >  static long blkdev_fallocate(struct file *file, int mode, loff_t start,
> > > >                            loff_t len)
> > > > @@ -634,9 +635,11 @@ static long blkdev_fallocate(struct file *file, int mode, loff_t start,
> > > >       filemap_invalidate_lock(inode->i_mapping);
> > > >
> > > >       /* Invalidate the page cache, including dirty pages. */
> > > > -     error = truncate_bdev_range(bdev, file->f_mode, start, end);
> > > > -     if (error)
> > > > -             goto fail;
> > > > +     if (mode != FALLOC_FL_PROVISION) {
> > > > +             error = truncate_bdev_range(bdev, file->f_mode, start, end);
> > > > +             if (error)
> > > > +                     goto fail;
> > > > +     }
> > > >
> > > >       switch (mode) {
> > > >       case FALLOC_FL_ZERO_RANGE:
> > > > @@ -654,6 +657,10 @@ static long blkdev_fallocate(struct file *file, int mode, loff_t start,
> > > >               error = blkdev_issue_discard(bdev, start >> SECTOR_SHIFT,
> > > >                                            len >> SECTOR_SHIFT, GFP_KERNEL);
> > > >               break;
> > > > +     case FALLOC_FL_PROVISION:
> > > > +             error = blkdev_issue_provision(bdev, start >> SECTOR_SHIFT,
> > > > +                                            len >> SECTOR_SHIFT, GFP_KERNEL);
> > > > +             break;
> > > >       default:
> > > >               error = -EOPNOTSUPP;
> > > >       }
> > > > diff --git a/include/linux/falloc.h b/include/linux/falloc.h
> > > > index f3f0b97b1675..b9a40a61a59b 100644
> > > > --- a/include/linux/falloc.h
> > > > +++ b/include/linux/falloc.h
> > > > @@ -30,7 +30,8 @@ struct space_resv {
> > > >                                        FALLOC_FL_COLLAPSE_RANGE |     \
> > > >                                        FALLOC_FL_ZERO_RANGE |         \
> > > >                                        FALLOC_FL_INSERT_RANGE |       \
> > > > -                                      FALLOC_FL_UNSHARE_RANGE)
> > > > +                                      FALLOC_FL_UNSHARE_RANGE |      \
> > > > +                                      FALLOC_FL_PROVISION)
> > > >
> > > >  /* on ia32 l_start is on a 32-bit boundary */
> > > >  #if defined(CONFIG_X86_64)
> > > > diff --git a/include/uapi/linux/falloc.h b/include/uapi/linux/falloc.h
> > > > index 51398fa57f6c..2d323d113eed 100644
> > > > --- a/include/uapi/linux/falloc.h
> > > > +++ b/include/uapi/linux/falloc.h
> > > > @@ -77,4 +77,12 @@
> > > >   */
> > > >  #define FALLOC_FL_UNSHARE_RANGE              0x40
> > > >
> > > > +/*
> > > > + * FALLOC_FL_PROVISION acts as a hint for thinly provisioned devices to allocate
> > > > + * blocks for the range/EOF.
> > > > + *
> > > > + * FALLOC_FL_PROVISION can only be used with allocate-mode fallocate.
> > > > + */
> > > > +#define FALLOC_FL_PROVISION          0x80
> > > > +
> > > >  #endif /* _UAPI_FALLOC_H_ */
> > > > --
> > > > 2.37.3
> > > >
> >
>

--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel


^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH v2 3/7] fs: Introduce FALLOC_FL_PROVISION
  2023-01-05 15:49         ` [dm-devel] " Theodore Ts'o
@ 2023-03-31  0:28           ` Sarthak Kukreti
  -1 siblings, 0 replies; 46+ messages in thread
From: Sarthak Kukreti @ 2023-03-31  0:28 UTC (permalink / raw)
  To: Theodore Ts'o
  Cc: Darrick J. Wong, sarthakkukreti, dm-devel, linux-block,
	linux-ext4, linux-kernel, linux-fsdevel, Jens Axboe,
	Michael S. Tsirkin, Jason Wang, Stefan Hajnoczi, Alasdair Kergon,
	Mike Snitzer, Christoph Hellwig, Brian Foster, Andreas Dilger,
	Bart Van Assche, Daniil Lunev

On Thu, Jan 5, 2023 at 7:49 AM Theodore Ts'o <tytso@mit.edu> wrote:
>
> On Wed, Jan 04, 2023 at 01:22:06PM -0800, Sarthak Kukreti wrote:
> > > How expensive is this expected to be?  Is this why you wanted a separate
> > > mode flag?
> >
> > Yes, the exact latency will depend on the stacked block devices and
> > the fragmentation at the allocation layers.
> >
> > I did a quick test for benchmarking fallocate() with an:
> > A) ext4 filesystem mounted with 'noprovision'
> > B) ext4 filesystem mounted with 'provision' on a dm-thin device.
> > C) ext4 filesystem mounted with 'provision' on a loop device with a
> > sparse backing file on the filesystem in (B).
> >
> > I tested file sizes from 512M to 8G, time taken for fallocate() in (A)
> > remains expectedly flat at ~0.01-0.02s, but for (B), it scales from
> > 0.03-0.4s and for (C) it scales from 0.04s-0.52s (I captured the exact
> > time distribution in the cover letter
> > https://marc.info/?l=linux-ext4&m=167230113520636&w=2)
> >
> > +0.5s for a 8G fallocate doesn't sound a lot but I think fragmentation
> > and how the block device is layered can make this worse...
>
> If userspace uses fallocate(2) there are generally two reasons.
> Either they **really** don't want to get the NOSPC, in which case
> noprovision will not give them what they want unless we modify their
> source code to add this new FALLOC_FL_PROVISION flag --- which may not
> be possible if it is provided in a binary-only format (for example,
> proprietary databases shipped by companies beginning with the letters
> 'I' or 'O').
>
> Or, they really care about avoiding fragmentation by giving a hint to
> the file system that layout is important, and so **please** allocate
> the space right away so that it is more likely that the space will be
> laid out in a contiguous fashion.  Of course, the moment you use
> thin-provisioning this goes out the window, since even if the space is
> contiguous on the dm-thin layer, on the underlying storage layer it is
> likely that things will be fragmented to a fare-thee-well, and either
> (a) you have a vast amount of flash to try to mitigate the performance
> hit of using thin-provisioning (example, hardware thin-provisioning
> such as EMC storage arrays), or (b) you really don't care about
> performance since space savings is what you're going for.
>
> So.... because of the issue of changing the semantics of what
> fallocate(2) will guarantee, unless programs are forced to change
> their code to use this new FALLOC flag, I really am not very fond of
> it.
>
> I suspect that using a mount option (which should default to
> "provision"; if you want to break user API expectations, it should
> require a mount option for the system administrator to explicitly OK
> such a change), is OK.
>
Understood. I dropped the FALLOC flag from the series in v3, instead
we now rely on the filesystem's mount/policy.

> As far as the per-file mode --- I'm not convinced it's really
> necessary.  In general if you are using thin-provisioning file systems
> tend to be used explicitly for one purpose, so adding the complexity
> of doing it on a per-file basis is probably not really needed.  That
> being said, your existing prototype requires searching for the
> extended attribute on every single file allocation, which is not a
> great idea.  On a system with SELinux enabled, every file will have an
> xattr block, and requiring that it be searched on every file
> allocation would be unfortunate.  It would be better to check for the
> xattr when the file is opened, and then setting a flag in the struct
> file.  However, it might be better to see if it there is a real demand
> for such a feature before adding it.
>
Thanks for the feedback! On ChromeOS, we still have filesystems shared
between applications, partly due to inertia of adoption. So, we have a
few cases of needing to share the filesystem but with differing
provisioning policy.

One more idea that I've been exploring in this space and uses the
above file-based mechanism is to use a 'provisioning disabled'
fallocated file to make the apparent free space in the thinly
provisioned filesystem match the space available in the thinpool. In
theory, this prevents userspace applications from writing much more
than what's available on the thinpool. In practice, it depends on the
responsiveness of the service that monitors and resizes this 'storage
balloon'.

Best
Sarthak

>                                                 - Ted

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [dm-devel] [PATCH v2 3/7] fs: Introduce FALLOC_FL_PROVISION
@ 2023-03-31  0:28           ` Sarthak Kukreti
  0 siblings, 0 replies; 46+ messages in thread
From: Sarthak Kukreti @ 2023-03-31  0:28 UTC (permalink / raw)
  To: Theodore Ts'o
  Cc: Jens Axboe, Christoph Hellwig, Michael S. Tsirkin,
	sarthakkukreti, Darrick J. Wong, Jason Wang, Bart Van Assche,
	Mike Snitzer, linux-kernel, linux-block, dm-devel,
	Andreas Dilger, Daniil Lunev, Stefan Hajnoczi, linux-fsdevel,
	linux-ext4, Brian Foster, Alasdair Kergon

On Thu, Jan 5, 2023 at 7:49 AM Theodore Ts'o <tytso@mit.edu> wrote:
>
> On Wed, Jan 04, 2023 at 01:22:06PM -0800, Sarthak Kukreti wrote:
> > > How expensive is this expected to be?  Is this why you wanted a separate
> > > mode flag?
> >
> > Yes, the exact latency will depend on the stacked block devices and
> > the fragmentation at the allocation layers.
> >
> > I did a quick test for benchmarking fallocate() with an:
> > A) ext4 filesystem mounted with 'noprovision'
> > B) ext4 filesystem mounted with 'provision' on a dm-thin device.
> > C) ext4 filesystem mounted with 'provision' on a loop device with a
> > sparse backing file on the filesystem in (B).
> >
> > I tested file sizes from 512M to 8G, time taken for fallocate() in (A)
> > remains expectedly flat at ~0.01-0.02s, but for (B), it scales from
> > 0.03-0.4s and for (C) it scales from 0.04s-0.52s (I captured the exact
> > time distribution in the cover letter
> > https://marc.info/?l=linux-ext4&m=167230113520636&w=2)
> >
> > +0.5s for a 8G fallocate doesn't sound a lot but I think fragmentation
> > and how the block device is layered can make this worse...
>
> If userspace uses fallocate(2) there are generally two reasons.
> Either they **really** don't want to get the NOSPC, in which case
> noprovision will not give them what they want unless we modify their
> source code to add this new FALLOC_FL_PROVISION flag --- which may not
> be possible if it is provided in a binary-only format (for example,
> proprietary databases shipped by companies beginning with the letters
> 'I' or 'O').
>
> Or, they really care about avoiding fragmentation by giving a hint to
> the file system that layout is important, and so **please** allocate
> the space right away so that it is more likely that the space will be
> laid out in a contiguous fashion.  Of course, the moment you use
> thin-provisioning this goes out the window, since even if the space is
> contiguous on the dm-thin layer, on the underlying storage layer it is
> likely that things will be fragmented to a fare-thee-well, and either
> (a) you have a vast amount of flash to try to mitigate the performance
> hit of using thin-provisioning (example, hardware thin-provisioning
> such as EMC storage arrays), or (b) you really don't care about
> performance since space savings is what you're going for.
>
> So.... because of the issue of changing the semantics of what
> fallocate(2) will guarantee, unless programs are forced to change
> their code to use this new FALLOC flag, I really am not very fond of
> it.
>
> I suspect that using a mount option (which should default to
> "provision"; if you want to break user API expectations, it should
> require a mount option for the system administrator to explicitly OK
> such a change), is OK.
>
Understood. I dropped the FALLOC flag from the series in v3, instead
we now rely on the filesystem's mount/policy.

> As far as the per-file mode --- I'm not convinced it's really
> necessary.  In general if you are using thin-provisioning file systems
> tend to be used explicitly for one purpose, so adding the complexity
> of doing it on a per-file basis is probably not really needed.  That
> being said, your existing prototype requires searching for the
> extended attribute on every single file allocation, which is not a
> great idea.  On a system with SELinux enabled, every file will have an
> xattr block, and requiring that it be searched on every file
> allocation would be unfortunate.  It would be better to check for the
> xattr when the file is opened, and then setting a flag in the struct
> file.  However, it might be better to see if it there is a real demand
> for such a feature before adding it.
>
Thanks for the feedback! On ChromeOS, we still have filesystems shared
between applications, partly due to inertia of adoption. So, we have a
few cases of needing to share the filesystem but with differing
provisioning policy.

One more idea that I've been exploring in this space and uses the
above file-based mechanism is to use a 'provisioning disabled'
fallocated file to make the apparent free space in the thinly
provisioned filesystem match the space available in the thinpool. In
theory, this prevents userspace applications from writing much more
than what's available on the thinpool. In practice, it depends on the
responsiveness of the service that monitors and resizes this 'storage
balloon'.

Best
Sarthak

>                                                 - Ted

--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel


^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH v2 3/7] fs: Introduce FALLOC_FL_PROVISION
  2023-01-09 15:07             ` Brian Foster
@ 2023-03-31  0:28               ` Sarthak Kukreti
  -1 siblings, 0 replies; 46+ messages in thread
From: Sarthak Kukreti @ 2023-03-31  0:28 UTC (permalink / raw)
  To: Brian Foster
  Cc: Darrick J. Wong, sarthakkukreti, dm-devel, linux-block,
	linux-ext4, linux-kernel, linux-fsdevel, Jens Axboe,
	Michael S. Tsirkin, Jason Wang, Stefan Hajnoczi, Alasdair Kergon,
	Mike Snitzer, Christoph Hellwig, Theodore Ts'o,
	Andreas Dilger, Bart Van Assche, Daniil Lunev

On Mon, Jan 9, 2023 at 7:06 AM Brian Foster <bfoster@redhat.com> wrote:
>
> On Thu, Jan 05, 2023 at 11:35:36AM -0800, Darrick J. Wong wrote:
> > On Thu, Jan 05, 2023 at 09:46:06AM -0500, Brian Foster wrote:
> > > On Wed, Jan 04, 2023 at 01:22:06PM -0800, Sarthak Kukreti wrote:
> > > > (Resend; the text flow made the last reply unreadable)
> > > >
> > > > On Wed, Jan 4, 2023 at 8:39 AM Darrick J. Wong <djwong@kernel.org> wrote:
> > > > >
> > > > > On Thu, Dec 29, 2022 at 12:12:48AM -0800, Sarthak Kukreti wrote:
> > > > > > FALLOC_FL_PROVISION is a new fallocate() allocation mode that
> > > > > > sends a hint to (supported) thinly provisioned block devices to
> > > > > > allocate space for the given range of sectors via REQ_OP_PROVISION.
> > > > > >
> > > > > > The man pages for both fallocate(2) and posix_fallocate(3) describe
> > > > > > the default allocation mode as:
> > > > > >
> > > > > > ```
> > > > > > The default operation (i.e., mode is zero) of fallocate()
> > > > > > allocates the disk space within the range specified by offset and len.
> > > > > > ...
> > > > > > subsequent writes to bytes in the specified range are guaranteed
> > > > > > not to fail because of lack of disk space.
> > > > > > ```
> > > > > >
> > > > > > For thinly provisioned storage constructs (dm-thin, filesystems on sparse
> > > > > > files), the term 'disk space' is overloaded and can either mean the apparent
> > > > > > disk space in the filesystem/thin logical volume or the true disk
> > > > > > space that will be utilized on the underlying non-sparse allocation layer.
> > > > > >
> > > > > > The use of a separate mode allows us to cleanly disambiguate whether fallocate()
> > > > > > causes allocation only at the current layer (default mode) or whether it propagates
> > > > > > allocations to underlying layers (provision mode)
> > > > >
> > > > > Why is it important to make this distinction?  The outcome of fallocate
> > > > > is supposed to be that subsequent writes do not fail with ENOSPC.  In my
> > > > > (fs developer) mind, REQ_OP_PROVISION simply an extra step to be taken
> > > > > after allocating file blocks.
> > > > >
> > > > Some use cases still benefit from keeping the default mode - eg.
> > > > virtual machines running on massive storage pools that don't expect to
> > > > hit the storage limit anytime soon (like most cloud storage
> > > > providers). Essentially, if the 'no ENOSPC' guarantee is maintained
> > > > via other means, then REQ_OP_PROVISION adds latency that isn't needed
> > > > (and cloud storage providers don't need to set aside that extra space
> > > > that may or may not be used).
> > > >
> > >
> > > What's the granularity that needs to be managed at? Do you really need
> > > an fallocate command for this, or would one of the filesystem level
> > > features you've already implemented for ext4 suffice?
> > >
> > > I mostly agree with Darrick in that FALLOC_FL_PROVISION stills feels a
> > > bit wonky to me. I can see that there might be some legitimate use cases
> > > for it, but I'm not convinced that it won't just end up being confusing
> > > to many users. At the same time, I think the approach of unconditional
> > > provision on falloc could eventually lead to complaints associated with
> > > the performance impact or similar sorts of confusion. For example,
> > > should an falloc of an already allocated range in the fs send a
> > > provision or not?
> >
> > For a user-initiated fallocate call, I think that's reasonable.
> >
>
> I think so as well, but that doesn't appear to be what the proposed
> implementation for ext4 does. I'm not intimately familiar with ext4, but
> it looks to me like it only provisions on initial allocation..?
>
That is correct. I think there are two parts/policies to it:

1) provision on first allocation: assuming that the filesystem is
never used on top of a block-level snapshot device, it might be
prudent to limit provision requests to just the first allocation.
2) always provision: for filesystems that are set up on top of a
snapshot device.

Would it make sense to split these behaviors out into mount-based
policies (provision, noprovision, provision_on_alloc)? That way, we
can amortize the cost of provisioning in the non-block snapshot world,
but at the same time ensure correctness.

Best
Sarthak


> > My first thought is to make the XFS allocator issue REQ_OP_PROVISION on
> > every allocation if the device supports it.  The fs has decided that
> > it's going to allocate and presumably write to some space, so the
> > underlying storage really ought to have some space ready.
> >
>
> That makes sense for a purely thin provisioned device, but runs into
> issues with block layer snapshots (re: my comments on the provision
> mount option patch). I wonder if it makes more sense to provision at
> some point before submitting writes or dirtying pagecache. IIRC we had
> prototyped something in XFS a while back that performed an analogous
> dm-thin fallocate at the time an extent is mapped for writes. I'm not
> sure what the performance impact of that would be or if there's a nice
> way to optimize away the obvious side effect of spurious requests.
>
> > But then it occurred to me -- what if the IO fails with ENOSPC?  Do we
> > keep going and hope for the best?  Or maybe we should undo the
> > allocation?  That could be tricky since we'd have to add a transaction
> > to undo the allocation, commit that, and then throw an error to the
> > upper layers.
> >
>
> Yeah, that's a good question. IMO we should be able to use something
> like this to improve the failure handling for fs' over thinly
> provisioned storage with dangerously low free space. There's not much
> point in just submitting writes in response to failed provisions in that
> case, but perhaps there is some more incremental use case or benefit I'm
> not aware of..?
>
> The flipside of more reliably graceful error handling is there may be
> more to the minimal solution than just firing off provisions on initial
> allocation, unless you wanted to just rule out snapshots I guess. That
> said, I think there's still potential opportunity for improvement. For
> example, if a prototype did something like the following:
>
> - Provision the log at opportunistic points (i.e., on mount, first
>   transaction to a covered log, etc.) to guarantee log writes won't
>   fail.
> - Provision extents mapped for data writes before the write is allowed
>   to proceed.
> - Do something similar for metadata in AIL processing or some such,
>   where each item must be provisioned before written back.
> - Shutdown the fs in response to any provision failure.
>
> ... obviously that comes with caveats, possibly bad performance, etc.,
> but it would be interesting to see if that is sufficient to catch most
> scenarios where a write would otherwise get to an out of space volume
> causing it to become inactive. If that could be made to work well
> enough, perhaps the fs shutdown step could be replaced with some kind of
> in-core pause/freeze like mode where the admin has the opportunity to
> either add more storage and continue or explicitly shutdown to save the
> volume.
>
> OTOH if that just doesn't work out, perhaps this can be combined with
> other schemes to reliably prevent inactivation, such as the reservation
> mechanism the dm guys had prototyped in the past. Of course that
> potentially complicates the interface between the fs and dm-layer.
>
> > Should the allocator instead find the space it wants and issue the
> > provisioning IO with the AGF locked, and try again somewhere else if the
> > IO returns ENOSPC?  If the space management IO takes forever, we've
> > pinned the that AG for the duration, which is one of the not very nice
> > aspects of the XFS FITRIM implementation on crappy SSDs.
> >
> > For a directio write, it's simple enough to throw that error back to
> > userspace.  I think the same applies to buffered writeback -- we'll
> > cancel the writeback and set AS_ENOSPC on the mapping.
> >
> > But then, what about *metadata* allocation?  If those fail because the
> > provisioning encounters ENOSPC, we'll shut down the filesystem, which
> > isn't nice.  For XFS I guess we could reuse the existing metadata IO
> > error config knobs to make it retry for some amount of time until
> > (hopefully) the admin buys more storage.
> >
> > Let's go with the simplest implementation (issue it with the free space
> > locked), and iterate from there.
> >
> > > Should filesystems that don't otherwise support UNSHARE_RANGE need to
> > > support it in order to support an unshare request to COW'd blocks on
> > > an underlying block device?
> >
> > Hmm.  Currently, fallocate'ing part of a file that's already mapped to
> > shared blocks is a nop.  That's technically an omission in the
> > implementation, since a subsequent write can fail during COW setup due
> > to insufficient space.  My memory about funshare is a bit murky since
> > it's been years now.
> >
> > As I remember it, originally, I had allocate mode also calling unshare,
> > but Dave or someone pointed out that unsharing generates a flood of
> > dirty pagecache, and it would be a bit surprising that fallocate
> > suddenly takes a long time to run.  There also wasn't much precedent for
> > fallocate to unshare blocks, since btrfs doesn't do that:
> >
> > # filefrag -v /mnt/[ab]
> > Filesystem type is: 9123683e
> > File size of /mnt/a is 1048576 (256 blocks of 4096 bytes)
> >  ext:     logical_offset:        physical_offset: length:   expected: flags:
> >    0:        0..     255:       3328..      3583:    256:             last,shared,eof
> > /mnt/a: 1 extent found
> > File size of /mnt/b is 1048576 (256 blocks of 4096 bytes)
> >  ext:     logical_offset:        physical_offset: length:   expected: flags:
> >    0:        0..     255:       3328..      3583:    256:             last,shared,eof
> > /mnt/b: 1 extent found
> >
> > # xfs_io -c 'falloc 512k 36k' /mnt/b
> >
> > # filefrag -v /mnt/[ab]
> > Filesystem type is: 9123683e
> > File size of /mnt/a is 1048576 (256 blocks of 4096 bytes)
> >  ext:     logical_offset:        physical_offset: length:   expected: flags:
> >    0:        0..     255:       3328..      3583:    256:             last,shared,eof
> > /mnt/a: 1 extent found
> > File size of /mnt/b is 1048576 (256 blocks of 4096 bytes)
> >  ext:     logical_offset:        physical_offset: length:   expected: flags:
> >    0:        0..     255:       3328..      3583:    256:             last,shared,eof
> > /mnt/b: 1 extent found
> >
> > I took funshare out of the patchset entirely (minimum viable product,
> > yadda yadda) and a few months later, I think hch or someone asked for a
> > knob for userspace to get a file back to pure overwrite mode.  That's
> > where it's been ever since.
> >
> > So to answer your question: fallocate mode 0 and REQ_OP_PROVISION
> > probably ought to be allocating the holes and unsharing existing shared
> > mappings.  However, we could also wriggle out of that by <cough>
> > claiming that fallocate has been consistent across filesystems in
> > leaving that wart for userspace to trip over. :/
> >
>
> Thanks. That seems reasonable to me, but again isn't what the patches
> appear to implement. ;P
>
> I guess from the standpoint of an I/O command, it probably makes more
> sense to unshare by default. Why else would one send the command
> otherwise? The falloc api is what it is at this point, so the bdev folks
> could always decide if/how to implement a non-unsharing variant if there
> happens to be some reason to do that.
>
> Brian
>
> > > I wonder if the smart thing to do here is separate out the question of a
> > > new fallocate interface from the mechanism entirely. For example,
> > > implement REQ_OP_PROVISION as you've already done, enable block layer
> > > mode = 0 fallocate support (i.e. without FL_PROVISION, so whether a
> > > request propagates from a loop device will be up to the backing fs),
> > > implement the various fs features to support REQ_OP_PROVISION (i.e.,
> > > mount option, file attr, etc.), then tack on FL_FALLOC + ext4 support at
> > > the end as an RFC/prototype.
> >
> > Yeah.
> >
> > > Even if we ultimately ended up with FL_PROVISION support, it might
> > > actually make some sense to kick that can down the road a bit regardless
> > > to give fs' a chance to implement basic REQ_OP_PROVISION support, get a
> > > better understanding of how it works in practice, and then perhaps make
> > > more informed decisions on things like sane defaults and/or how best to
> > > expose it via fallocate. Thoughts?
> >
> > Agree. :)
> >
> > --D
> >
> > >
> > > Brian
> > >
> > > > > If you *don't* add this API flag and simply bake the REQ_OP_PROVISION
> > > > > call into mode 0 fallocate, then the new functionality can be added (or
> > > > > even backported) to existing kernels and customers can use it
> > > > > immediately.  If you *do*, then you get to wait a few years for
> > > > > developers to add it to their codebases only after enough enterprise
> > > > > distros pick up a new kernel to make it worth their while.
> > > > >
> > > > > > for thinly provisioned filesystems/
> > > > > > block devices. For devices that do not support REQ_OP_PROVISION, both these
> > > > > > allocation modes will be equivalent. Given the performance cost of sending provision
> > > > > > requests to the underlying layers, keeping the default mode as-is allows users to
> > > > > > preserve existing behavior.
> > > > >
> > > > > How expensive is this expected to be?  Is this why you wanted a separate
> > > > > mode flag?
> > > > >
> > > > Yes, the exact latency will depend on the stacked block devices and
> > > > the fragmentation at the allocation layers.
> > > >
> > > > I did a quick test for benchmarking fallocate() with an:
> > > > A) ext4 filesystem mounted with 'noprovision'
> > > > B) ext4 filesystem mounted with 'provision' on a dm-thin device.
> > > > C) ext4 filesystem mounted with 'provision' on a loop device with a
> > > > sparse backing file on the filesystem in (B).
> > > >
> > > > I tested file sizes from 512M to 8G, time taken for fallocate() in (A)
> > > > remains expectedly flat at ~0.01-0.02s, but for (B), it scales from
> > > > 0.03-0.4s and for (C) it scales from 0.04s-0.52s (I captured the exact
> > > > time distribution in the cover letter
> > > > https://marc.info/?l=linux-ext4&m=167230113520636&w=2)
> > > >
> > > > +0.5s for a 8G fallocate doesn't sound a lot but I think fragmentation
> > > > and how the block device is layered can make this worse...
> > > >
> > > > > --D
> > > > >
> > > > > > Signed-off-by: Sarthak Kukreti <sarthakkukreti@chromium.org>
> > > > > > ---
> > > > > >  block/fops.c                | 15 +++++++++++----
> > > > > >  include/linux/falloc.h      |  3 ++-
> > > > > >  include/uapi/linux/falloc.h |  8 ++++++++
> > > > > >  3 files changed, 21 insertions(+), 5 deletions(-)
> > > > > >
> > > > > > diff --git a/block/fops.c b/block/fops.c
> > > > > > index 50d245e8c913..01bde561e1e2 100644
> > > > > > --- a/block/fops.c
> > > > > > +++ b/block/fops.c
> > > > > > @@ -598,7 +598,8 @@ static ssize_t blkdev_read_iter(struct kiocb *iocb, struct iov_iter *to)
> > > > > >
> > > > > >  #define      BLKDEV_FALLOC_FL_SUPPORTED                                      \
> > > > > >               (FALLOC_FL_KEEP_SIZE | FALLOC_FL_PUNCH_HOLE |           \
> > > > > > -              FALLOC_FL_ZERO_RANGE | FALLOC_FL_NO_HIDE_STALE)
> > > > > > +              FALLOC_FL_ZERO_RANGE | FALLOC_FL_NO_HIDE_STALE |       \
> > > > > > +              FALLOC_FL_PROVISION)
> > > > > >
> > > > > >  static long blkdev_fallocate(struct file *file, int mode, loff_t start,
> > > > > >                            loff_t len)
> > > > > > @@ -634,9 +635,11 @@ static long blkdev_fallocate(struct file *file, int mode, loff_t start,
> > > > > >       filemap_invalidate_lock(inode->i_mapping);
> > > > > >
> > > > > >       /* Invalidate the page cache, including dirty pages. */
> > > > > > -     error = truncate_bdev_range(bdev, file->f_mode, start, end);
> > > > > > -     if (error)
> > > > > > -             goto fail;
> > > > > > +     if (mode != FALLOC_FL_PROVISION) {
> > > > > > +             error = truncate_bdev_range(bdev, file->f_mode, start, end);
> > > > > > +             if (error)
> > > > > > +                     goto fail;
> > > > > > +     }
> > > > > >
> > > > > >       switch (mode) {
> > > > > >       case FALLOC_FL_ZERO_RANGE:
> > > > > > @@ -654,6 +657,10 @@ static long blkdev_fallocate(struct file *file, int mode, loff_t start,
> > > > > >               error = blkdev_issue_discard(bdev, start >> SECTOR_SHIFT,
> > > > > >                                            len >> SECTOR_SHIFT, GFP_KERNEL);
> > > > > >               break;
> > > > > > +     case FALLOC_FL_PROVISION:
> > > > > > +             error = blkdev_issue_provision(bdev, start >> SECTOR_SHIFT,
> > > > > > +                                            len >> SECTOR_SHIFT, GFP_KERNEL);
> > > > > > +             break;
> > > > > >       default:
> > > > > >               error = -EOPNOTSUPP;
> > > > > >       }
> > > > > > diff --git a/include/linux/falloc.h b/include/linux/falloc.h
> > > > > > index f3f0b97b1675..b9a40a61a59b 100644
> > > > > > --- a/include/linux/falloc.h
> > > > > > +++ b/include/linux/falloc.h
> > > > > > @@ -30,7 +30,8 @@ struct space_resv {
> > > > > >                                        FALLOC_FL_COLLAPSE_RANGE |     \
> > > > > >                                        FALLOC_FL_ZERO_RANGE |         \
> > > > > >                                        FALLOC_FL_INSERT_RANGE |       \
> > > > > > -                                      FALLOC_FL_UNSHARE_RANGE)
> > > > > > +                                      FALLOC_FL_UNSHARE_RANGE |      \
> > > > > > +                                      FALLOC_FL_PROVISION)
> > > > > >
> > > > > >  /* on ia32 l_start is on a 32-bit boundary */
> > > > > >  #if defined(CONFIG_X86_64)
> > > > > > diff --git a/include/uapi/linux/falloc.h b/include/uapi/linux/falloc.h
> > > > > > index 51398fa57f6c..2d323d113eed 100644
> > > > > > --- a/include/uapi/linux/falloc.h
> > > > > > +++ b/include/uapi/linux/falloc.h
> > > > > > @@ -77,4 +77,12 @@
> > > > > >   */
> > > > > >  #define FALLOC_FL_UNSHARE_RANGE              0x40
> > > > > >
> > > > > > +/*
> > > > > > + * FALLOC_FL_PROVISION acts as a hint for thinly provisioned devices to allocate
> > > > > > + * blocks for the range/EOF.
> > > > > > + *
> > > > > > + * FALLOC_FL_PROVISION can only be used with allocate-mode fallocate.
> > > > > > + */
> > > > > > +#define FALLOC_FL_PROVISION          0x80
> > > > > > +
> > > > > >  #endif /* _UAPI_FALLOC_H_ */
> > > > > > --
> > > > > > 2.37.3
> > > > > >
> > > >
> > >
> >
>

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [dm-devel] [PATCH v2 3/7] fs: Introduce FALLOC_FL_PROVISION
@ 2023-03-31  0:28               ` Sarthak Kukreti
  0 siblings, 0 replies; 46+ messages in thread
From: Sarthak Kukreti @ 2023-03-31  0:28 UTC (permalink / raw)
  To: Brian Foster
  Cc: Jens Axboe, Christoph Hellwig, Theodore Ts'o,
	Michael S. Tsirkin, sarthakkukreti, Darrick J. Wong, Jason Wang,
	Bart Van Assche, Mike Snitzer, linux-kernel, linux-block,
	dm-devel, Andreas Dilger, Daniil Lunev, Stefan Hajnoczi,
	linux-fsdevel, linux-ext4, Alasdair Kergon

On Mon, Jan 9, 2023 at 7:06 AM Brian Foster <bfoster@redhat.com> wrote:
>
> On Thu, Jan 05, 2023 at 11:35:36AM -0800, Darrick J. Wong wrote:
> > On Thu, Jan 05, 2023 at 09:46:06AM -0500, Brian Foster wrote:
> > > On Wed, Jan 04, 2023 at 01:22:06PM -0800, Sarthak Kukreti wrote:
> > > > (Resend; the text flow made the last reply unreadable)
> > > >
> > > > On Wed, Jan 4, 2023 at 8:39 AM Darrick J. Wong <djwong@kernel.org> wrote:
> > > > >
> > > > > On Thu, Dec 29, 2022 at 12:12:48AM -0800, Sarthak Kukreti wrote:
> > > > > > FALLOC_FL_PROVISION is a new fallocate() allocation mode that
> > > > > > sends a hint to (supported) thinly provisioned block devices to
> > > > > > allocate space for the given range of sectors via REQ_OP_PROVISION.
> > > > > >
> > > > > > The man pages for both fallocate(2) and posix_fallocate(3) describe
> > > > > > the default allocation mode as:
> > > > > >
> > > > > > ```
> > > > > > The default operation (i.e., mode is zero) of fallocate()
> > > > > > allocates the disk space within the range specified by offset and len.
> > > > > > ...
> > > > > > subsequent writes to bytes in the specified range are guaranteed
> > > > > > not to fail because of lack of disk space.
> > > > > > ```
> > > > > >
> > > > > > For thinly provisioned storage constructs (dm-thin, filesystems on sparse
> > > > > > files), the term 'disk space' is overloaded and can either mean the apparent
> > > > > > disk space in the filesystem/thin logical volume or the true disk
> > > > > > space that will be utilized on the underlying non-sparse allocation layer.
> > > > > >
> > > > > > The use of a separate mode allows us to cleanly disambiguate whether fallocate()
> > > > > > causes allocation only at the current layer (default mode) or whether it propagates
> > > > > > allocations to underlying layers (provision mode)
> > > > >
> > > > > Why is it important to make this distinction?  The outcome of fallocate
> > > > > is supposed to be that subsequent writes do not fail with ENOSPC.  In my
> > > > > (fs developer) mind, REQ_OP_PROVISION simply an extra step to be taken
> > > > > after allocating file blocks.
> > > > >
> > > > Some use cases still benefit from keeping the default mode - eg.
> > > > virtual machines running on massive storage pools that don't expect to
> > > > hit the storage limit anytime soon (like most cloud storage
> > > > providers). Essentially, if the 'no ENOSPC' guarantee is maintained
> > > > via other means, then REQ_OP_PROVISION adds latency that isn't needed
> > > > (and cloud storage providers don't need to set aside that extra space
> > > > that may or may not be used).
> > > >
> > >
> > > What's the granularity that needs to be managed at? Do you really need
> > > an fallocate command for this, or would one of the filesystem level
> > > features you've already implemented for ext4 suffice?
> > >
> > > I mostly agree with Darrick in that FALLOC_FL_PROVISION stills feels a
> > > bit wonky to me. I can see that there might be some legitimate use cases
> > > for it, but I'm not convinced that it won't just end up being confusing
> > > to many users. At the same time, I think the approach of unconditional
> > > provision on falloc could eventually lead to complaints associated with
> > > the performance impact or similar sorts of confusion. For example,
> > > should an falloc of an already allocated range in the fs send a
> > > provision or not?
> >
> > For a user-initiated fallocate call, I think that's reasonable.
> >
>
> I think so as well, but that doesn't appear to be what the proposed
> implementation for ext4 does. I'm not intimately familiar with ext4, but
> it looks to me like it only provisions on initial allocation..?
>
That is correct. I think there are two parts/policies to it:

1) provision on first allocation: assuming that the filesystem is
never used on top of a block-level snapshot device, it might be
prudent to limit provision requests to just the first allocation.
2) always provision: for filesystems that are set up on top of a
snapshot device.

Would it make sense to split these behaviors out into mount-based
policies (provision, noprovision, provision_on_alloc)? That way, we
can amortize the cost of provisioning in the non-block snapshot world,
but at the same time ensure correctness.

Best
Sarthak


> > My first thought is to make the XFS allocator issue REQ_OP_PROVISION on
> > every allocation if the device supports it.  The fs has decided that
> > it's going to allocate and presumably write to some space, so the
> > underlying storage really ought to have some space ready.
> >
>
> That makes sense for a purely thin provisioned device, but runs into
> issues with block layer snapshots (re: my comments on the provision
> mount option patch). I wonder if it makes more sense to provision at
> some point before submitting writes or dirtying pagecache. IIRC we had
> prototyped something in XFS a while back that performed an analogous
> dm-thin fallocate at the time an extent is mapped for writes. I'm not
> sure what the performance impact of that would be or if there's a nice
> way to optimize away the obvious side effect of spurious requests.
>
> > But then it occurred to me -- what if the IO fails with ENOSPC?  Do we
> > keep going and hope for the best?  Or maybe we should undo the
> > allocation?  That could be tricky since we'd have to add a transaction
> > to undo the allocation, commit that, and then throw an error to the
> > upper layers.
> >
>
> Yeah, that's a good question. IMO we should be able to use something
> like this to improve the failure handling for fs' over thinly
> provisioned storage with dangerously low free space. There's not much
> point in just submitting writes in response to failed provisions in that
> case, but perhaps there is some more incremental use case or benefit I'm
> not aware of..?
>
> The flipside of more reliably graceful error handling is there may be
> more to the minimal solution than just firing off provisions on initial
> allocation, unless you wanted to just rule out snapshots I guess. That
> said, I think there's still potential opportunity for improvement. For
> example, if a prototype did something like the following:
>
> - Provision the log at opportunistic points (i.e., on mount, first
>   transaction to a covered log, etc.) to guarantee log writes won't
>   fail.
> - Provision extents mapped for data writes before the write is allowed
>   to proceed.
> - Do something similar for metadata in AIL processing or some such,
>   where each item must be provisioned before written back.
> - Shutdown the fs in response to any provision failure.
>
> ... obviously that comes with caveats, possibly bad performance, etc.,
> but it would be interesting to see if that is sufficient to catch most
> scenarios where a write would otherwise get to an out of space volume
> causing it to become inactive. If that could be made to work well
> enough, perhaps the fs shutdown step could be replaced with some kind of
> in-core pause/freeze like mode where the admin has the opportunity to
> either add more storage and continue or explicitly shutdown to save the
> volume.
>
> OTOH if that just doesn't work out, perhaps this can be combined with
> other schemes to reliably prevent inactivation, such as the reservation
> mechanism the dm guys had prototyped in the past. Of course that
> potentially complicates the interface between the fs and dm-layer.
>
> > Should the allocator instead find the space it wants and issue the
> > provisioning IO with the AGF locked, and try again somewhere else if the
> > IO returns ENOSPC?  If the space management IO takes forever, we've
> > pinned the that AG for the duration, which is one of the not very nice
> > aspects of the XFS FITRIM implementation on crappy SSDs.
> >
> > For a directio write, it's simple enough to throw that error back to
> > userspace.  I think the same applies to buffered writeback -- we'll
> > cancel the writeback and set AS_ENOSPC on the mapping.
> >
> > But then, what about *metadata* allocation?  If those fail because the
> > provisioning encounters ENOSPC, we'll shut down the filesystem, which
> > isn't nice.  For XFS I guess we could reuse the existing metadata IO
> > error config knobs to make it retry for some amount of time until
> > (hopefully) the admin buys more storage.
> >
> > Let's go with the simplest implementation (issue it with the free space
> > locked), and iterate from there.
> >
> > > Should filesystems that don't otherwise support UNSHARE_RANGE need to
> > > support it in order to support an unshare request to COW'd blocks on
> > > an underlying block device?
> >
> > Hmm.  Currently, fallocate'ing part of a file that's already mapped to
> > shared blocks is a nop.  That's technically an omission in the
> > implementation, since a subsequent write can fail during COW setup due
> > to insufficient space.  My memory about funshare is a bit murky since
> > it's been years now.
> >
> > As I remember it, originally, I had allocate mode also calling unshare,
> > but Dave or someone pointed out that unsharing generates a flood of
> > dirty pagecache, and it would be a bit surprising that fallocate
> > suddenly takes a long time to run.  There also wasn't much precedent for
> > fallocate to unshare blocks, since btrfs doesn't do that:
> >
> > # filefrag -v /mnt/[ab]
> > Filesystem type is: 9123683e
> > File size of /mnt/a is 1048576 (256 blocks of 4096 bytes)
> >  ext:     logical_offset:        physical_offset: length:   expected: flags:
> >    0:        0..     255:       3328..      3583:    256:             last,shared,eof
> > /mnt/a: 1 extent found
> > File size of /mnt/b is 1048576 (256 blocks of 4096 bytes)
> >  ext:     logical_offset:        physical_offset: length:   expected: flags:
> >    0:        0..     255:       3328..      3583:    256:             last,shared,eof
> > /mnt/b: 1 extent found
> >
> > # xfs_io -c 'falloc 512k 36k' /mnt/b
> >
> > # filefrag -v /mnt/[ab]
> > Filesystem type is: 9123683e
> > File size of /mnt/a is 1048576 (256 blocks of 4096 bytes)
> >  ext:     logical_offset:        physical_offset: length:   expected: flags:
> >    0:        0..     255:       3328..      3583:    256:             last,shared,eof
> > /mnt/a: 1 extent found
> > File size of /mnt/b is 1048576 (256 blocks of 4096 bytes)
> >  ext:     logical_offset:        physical_offset: length:   expected: flags:
> >    0:        0..     255:       3328..      3583:    256:             last,shared,eof
> > /mnt/b: 1 extent found
> >
> > I took funshare out of the patchset entirely (minimum viable product,
> > yadda yadda) and a few months later, I think hch or someone asked for a
> > knob for userspace to get a file back to pure overwrite mode.  That's
> > where it's been ever since.
> >
> > So to answer your question: fallocate mode 0 and REQ_OP_PROVISION
> > probably ought to be allocating the holes and unsharing existing shared
> > mappings.  However, we could also wriggle out of that by <cough>
> > claiming that fallocate has been consistent across filesystems in
> > leaving that wart for userspace to trip over. :/
> >
>
> Thanks. That seems reasonable to me, but again isn't what the patches
> appear to implement. ;P
>
> I guess from the standpoint of an I/O command, it probably makes more
> sense to unshare by default. Why else would one send the command
> otherwise? The falloc api is what it is at this point, so the bdev folks
> could always decide if/how to implement a non-unsharing variant if there
> happens to be some reason to do that.
>
> Brian
>
> > > I wonder if the smart thing to do here is separate out the question of a
> > > new fallocate interface from the mechanism entirely. For example,
> > > implement REQ_OP_PROVISION as you've already done, enable block layer
> > > mode = 0 fallocate support (i.e. without FL_PROVISION, so whether a
> > > request propagates from a loop device will be up to the backing fs),
> > > implement the various fs features to support REQ_OP_PROVISION (i.e.,
> > > mount option, file attr, etc.), then tack on FL_FALLOC + ext4 support at
> > > the end as an RFC/prototype.
> >
> > Yeah.
> >
> > > Even if we ultimately ended up with FL_PROVISION support, it might
> > > actually make some sense to kick that can down the road a bit regardless
> > > to give fs' a chance to implement basic REQ_OP_PROVISION support, get a
> > > better understanding of how it works in practice, and then perhaps make
> > > more informed decisions on things like sane defaults and/or how best to
> > > expose it via fallocate. Thoughts?
> >
> > Agree. :)
> >
> > --D
> >
> > >
> > > Brian
> > >
> > > > > If you *don't* add this API flag and simply bake the REQ_OP_PROVISION
> > > > > call into mode 0 fallocate, then the new functionality can be added (or
> > > > > even backported) to existing kernels and customers can use it
> > > > > immediately.  If you *do*, then you get to wait a few years for
> > > > > developers to add it to their codebases only after enough enterprise
> > > > > distros pick up a new kernel to make it worth their while.
> > > > >
> > > > > > for thinly provisioned filesystems/
> > > > > > block devices. For devices that do not support REQ_OP_PROVISION, both these
> > > > > > allocation modes will be equivalent. Given the performance cost of sending provision
> > > > > > requests to the underlying layers, keeping the default mode as-is allows users to
> > > > > > preserve existing behavior.
> > > > >
> > > > > How expensive is this expected to be?  Is this why you wanted a separate
> > > > > mode flag?
> > > > >
> > > > Yes, the exact latency will depend on the stacked block devices and
> > > > the fragmentation at the allocation layers.
> > > >
> > > > I did a quick test for benchmarking fallocate() with an:
> > > > A) ext4 filesystem mounted with 'noprovision'
> > > > B) ext4 filesystem mounted with 'provision' on a dm-thin device.
> > > > C) ext4 filesystem mounted with 'provision' on a loop device with a
> > > > sparse backing file on the filesystem in (B).
> > > >
> > > > I tested file sizes from 512M to 8G, time taken for fallocate() in (A)
> > > > remains expectedly flat at ~0.01-0.02s, but for (B), it scales from
> > > > 0.03-0.4s and for (C) it scales from 0.04s-0.52s (I captured the exact
> > > > time distribution in the cover letter
> > > > https://marc.info/?l=linux-ext4&m=167230113520636&w=2)
> > > >
> > > > +0.5s for a 8G fallocate doesn't sound a lot but I think fragmentation
> > > > and how the block device is layered can make this worse...
> > > >
> > > > > --D
> > > > >
> > > > > > Signed-off-by: Sarthak Kukreti <sarthakkukreti@chromium.org>
> > > > > > ---
> > > > > >  block/fops.c                | 15 +++++++++++----
> > > > > >  include/linux/falloc.h      |  3 ++-
> > > > > >  include/uapi/linux/falloc.h |  8 ++++++++
> > > > > >  3 files changed, 21 insertions(+), 5 deletions(-)
> > > > > >
> > > > > > diff --git a/block/fops.c b/block/fops.c
> > > > > > index 50d245e8c913..01bde561e1e2 100644
> > > > > > --- a/block/fops.c
> > > > > > +++ b/block/fops.c
> > > > > > @@ -598,7 +598,8 @@ static ssize_t blkdev_read_iter(struct kiocb *iocb, struct iov_iter *to)
> > > > > >
> > > > > >  #define      BLKDEV_FALLOC_FL_SUPPORTED                                      \
> > > > > >               (FALLOC_FL_KEEP_SIZE | FALLOC_FL_PUNCH_HOLE |           \
> > > > > > -              FALLOC_FL_ZERO_RANGE | FALLOC_FL_NO_HIDE_STALE)
> > > > > > +              FALLOC_FL_ZERO_RANGE | FALLOC_FL_NO_HIDE_STALE |       \
> > > > > > +              FALLOC_FL_PROVISION)
> > > > > >
> > > > > >  static long blkdev_fallocate(struct file *file, int mode, loff_t start,
> > > > > >                            loff_t len)
> > > > > > @@ -634,9 +635,11 @@ static long blkdev_fallocate(struct file *file, int mode, loff_t start,
> > > > > >       filemap_invalidate_lock(inode->i_mapping);
> > > > > >
> > > > > >       /* Invalidate the page cache, including dirty pages. */
> > > > > > -     error = truncate_bdev_range(bdev, file->f_mode, start, end);
> > > > > > -     if (error)
> > > > > > -             goto fail;
> > > > > > +     if (mode != FALLOC_FL_PROVISION) {
> > > > > > +             error = truncate_bdev_range(bdev, file->f_mode, start, end);
> > > > > > +             if (error)
> > > > > > +                     goto fail;
> > > > > > +     }
> > > > > >
> > > > > >       switch (mode) {
> > > > > >       case FALLOC_FL_ZERO_RANGE:
> > > > > > @@ -654,6 +657,10 @@ static long blkdev_fallocate(struct file *file, int mode, loff_t start,
> > > > > >               error = blkdev_issue_discard(bdev, start >> SECTOR_SHIFT,
> > > > > >                                            len >> SECTOR_SHIFT, GFP_KERNEL);
> > > > > >               break;
> > > > > > +     case FALLOC_FL_PROVISION:
> > > > > > +             error = blkdev_issue_provision(bdev, start >> SECTOR_SHIFT,
> > > > > > +                                            len >> SECTOR_SHIFT, GFP_KERNEL);
> > > > > > +             break;
> > > > > >       default:
> > > > > >               error = -EOPNOTSUPP;
> > > > > >       }
> > > > > > diff --git a/include/linux/falloc.h b/include/linux/falloc.h
> > > > > > index f3f0b97b1675..b9a40a61a59b 100644
> > > > > > --- a/include/linux/falloc.h
> > > > > > +++ b/include/linux/falloc.h
> > > > > > @@ -30,7 +30,8 @@ struct space_resv {
> > > > > >                                        FALLOC_FL_COLLAPSE_RANGE |     \
> > > > > >                                        FALLOC_FL_ZERO_RANGE |         \
> > > > > >                                        FALLOC_FL_INSERT_RANGE |       \
> > > > > > -                                      FALLOC_FL_UNSHARE_RANGE)
> > > > > > +                                      FALLOC_FL_UNSHARE_RANGE |      \
> > > > > > +                                      FALLOC_FL_PROVISION)
> > > > > >
> > > > > >  /* on ia32 l_start is on a 32-bit boundary */
> > > > > >  #if defined(CONFIG_X86_64)
> > > > > > diff --git a/include/uapi/linux/falloc.h b/include/uapi/linux/falloc.h
> > > > > > index 51398fa57f6c..2d323d113eed 100644
> > > > > > --- a/include/uapi/linux/falloc.h
> > > > > > +++ b/include/uapi/linux/falloc.h
> > > > > > @@ -77,4 +77,12 @@
> > > > > >   */
> > > > > >  #define FALLOC_FL_UNSHARE_RANGE              0x40
> > > > > >
> > > > > > +/*
> > > > > > + * FALLOC_FL_PROVISION acts as a hint for thinly provisioned devices to allocate
> > > > > > + * blocks for the range/EOF.
> > > > > > + *
> > > > > > + * FALLOC_FL_PROVISION can only be used with allocate-mode fallocate.
> > > > > > + */
> > > > > > +#define FALLOC_FL_PROVISION          0x80
> > > > > > +
> > > > > >  #endif /* _UAPI_FALLOC_H_ */
> > > > > > --
> > > > > > 2.37.3
> > > > > >
> > > >
> > >
> >
>

--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH v2 2/7] dm: Add support for block provisioning
  2023-01-05 14:43     ` Brian Foster
@ 2023-03-31  0:30       ` Sarthak Kukreti
  -1 siblings, 0 replies; 46+ messages in thread
From: Sarthak Kukreti @ 2023-03-31  0:30 UTC (permalink / raw)
  To: Brian Foster
  Cc: sarthakkukreti, dm-devel, linux-block, linux-ext4, linux-kernel,
	linux-fsdevel, Jens Axboe, Michael S. Tsirkin, Jason Wang,
	Stefan Hajnoczi, Alasdair Kergon, Mike Snitzer,
	Christoph Hellwig, Theodore Ts'o, Andreas Dilger,
	Bart Van Assche, Daniil Lunev, Darrick J. Wong

On Thu, Jan 5, 2023 at 6:42 AM Brian Foster <bfoster@redhat.com> wrote:
>
> On Thu, Dec 29, 2022 at 12:12:47AM -0800, Sarthak Kukreti wrote:
> > Add support to dm devices for REQ_OP_PROVISION. The default mode
> > is to pass through the request and dm-thin will utilize it to provision
> > blocks.
> >
> > Signed-off-by: Sarthak Kukreti <sarthakkukreti@chromium.org>
> > ---
> >  drivers/md/dm-crypt.c         |  4 +-
> >  drivers/md/dm-linear.c        |  1 +
> >  drivers/md/dm-snap.c          |  7 +++
> >  drivers/md/dm-table.c         | 25 ++++++++++
> >  drivers/md/dm-thin.c          | 90 ++++++++++++++++++++++++++++++++++-
> >  drivers/md/dm.c               |  4 ++
> >  include/linux/device-mapper.h | 11 +++++
> >  7 files changed, 139 insertions(+), 3 deletions(-)
> >
> ...
> > diff --git a/drivers/md/dm-thin.c b/drivers/md/dm-thin.c
> > index 64cfcf46881d..ab3f1abfabaf 100644
> > --- a/drivers/md/dm-thin.c
> > +++ b/drivers/md/dm-thin.c
> ...
> > @@ -1980,6 +1992,70 @@ static void process_cell(struct thin_c *tc, struct dm_bio_prison_cell *cell)
> >       }
> >  }
> >
> > +static void process_provision_cell(struct thin_c *tc, struct dm_bio_prison_cell *cell)
> > +{
> > +     int r;
> > +     struct pool *pool = tc->pool;
> > +     struct bio *bio = cell->holder;
> > +     dm_block_t begin, end;
> > +     struct dm_thin_lookup_result lookup_result;
> > +
> > +     if (tc->requeue_mode) {
> > +             cell_requeue(pool, cell);
> > +             return;
> > +     }
> > +
> > +     get_bio_block_range(tc, bio, &begin, &end);
> > +
> > +     while (begin != end) {
> > +             r = ensure_next_mapping(pool);
> > +             if (r)
> > +                     /* we did our best */
> > +                     return;
> > +
> > +             r = dm_thin_find_block(tc->td, begin, 1, &lookup_result);
>
> Hi Sarthak,
>
> I think we discussed this before.. but remind me if/how we wanted to
> handle the case if the thin blocks are shared..? Would a provision op
> carry enough information to distinguish an FALLOC_FL_UNSHARE_RANGE
> request from upper layers to conditionally provision in that case?
>
I think that should depend on how the filesystem implements unsharing:
assuming that we use provision on first allocation, unsharing on xfs
should result in xfs calling REQ_OP_PROVISION on the newly allocated
blocks first. But for ext4, we'd fail UNSHARE_RANGE unless provision
(instead of noprovision, provision_on_alloc), in which case, we'd send
REQ_OP_PROVISION.

Best
Sarthak


Sarthak

> Brian
>
> > +             switch (r) {
> > +             case 0:
> > +                     begin++;
> > +                     break;
> > +             case -ENODATA:
> > +                     bio_inc_remaining(bio);
> > +                     provision_block(tc, bio, begin, cell);
> > +                     begin++;
> > +                     break;
> > +             default:
> > +                     DMERR_LIMIT(
> > +                             "%s: dm_thin_find_block() failed: error = %d",
> > +                             __func__, r);
> > +                     cell_defer_no_holder(tc, cell);
> > +                     bio_io_error(bio);
> > +                     begin++;
> > +                     break;
> > +             }
> > +     }
> > +     bio_endio(bio);
> > +     cell_defer_no_holder(tc, cell);
> > +}
> > +
> > +static void process_provision_bio(struct thin_c *tc, struct bio *bio)
> > +{
> > +     dm_block_t begin, end;
> > +     struct dm_cell_key virt_key;
> > +     struct dm_bio_prison_cell *virt_cell;
> > +
> > +     get_bio_block_range(tc, bio, &begin, &end);
> > +     if (begin == end) {
> > +             bio_endio(bio);
> > +             return;
> > +     }
> > +
> > +     build_key(tc->td, VIRTUAL, begin, end, &virt_key);
> > +     if (bio_detain(tc->pool, &virt_key, bio, &virt_cell))
> > +             return;
> > +
> > +     process_provision_cell(tc, virt_cell);
> > +}
> > +
> >  static void process_bio(struct thin_c *tc, struct bio *bio)
> >  {
> >       struct pool *pool = tc->pool;
> > @@ -2200,6 +2276,8 @@ static void process_thin_deferred_bios(struct thin_c *tc)
> >
> >               if (bio_op(bio) == REQ_OP_DISCARD)
> >                       pool->process_discard(tc, bio);
> > +             else if (bio_op(bio) == REQ_OP_PROVISION)
> > +                     process_provision_bio(tc, bio);
> >               else
> >                       pool->process_bio(tc, bio);
> >
> > @@ -2716,7 +2794,8 @@ static int thin_bio_map(struct dm_target *ti, struct bio *bio)
> >               return DM_MAPIO_SUBMITTED;
> >       }
> >
> > -     if (op_is_flush(bio->bi_opf) || bio_op(bio) == REQ_OP_DISCARD) {
> > +     if (op_is_flush(bio->bi_opf) || bio_op(bio) == REQ_OP_DISCARD ||
> > +         bio_op(bio) == REQ_OP_PROVISION) {
> >               thin_defer_bio_with_throttle(tc, bio);
> >               return DM_MAPIO_SUBMITTED;
> >       }
> > @@ -3355,6 +3434,8 @@ static int pool_ctr(struct dm_target *ti, unsigned argc, char **argv)
> >       pt->low_water_blocks = low_water_blocks;
> >       pt->adjusted_pf = pt->requested_pf = pf;
> >       ti->num_flush_bios = 1;
> > +     ti->num_provision_bios = 1;
> > +     ti->provision_supported = true;
> >
> >       /*
> >        * Only need to enable discards if the pool should pass
> > @@ -4053,6 +4134,7 @@ static void pool_io_hints(struct dm_target *ti, struct queue_limits *limits)
> >               blk_limits_io_opt(limits, pool->sectors_per_block << SECTOR_SHIFT);
> >       }
> >
> > +
> >       /*
> >        * pt->adjusted_pf is a staging area for the actual features to use.
> >        * They get transferred to the live pool in bind_control_target()
> > @@ -4243,6 +4325,9 @@ static int thin_ctr(struct dm_target *ti, unsigned argc, char **argv)
> >               ti->num_discard_bios = 1;
> >       }
> >
> > +     ti->num_provision_bios = 1;
> > +     ti->provision_supported = true;
> > +
> >       mutex_unlock(&dm_thin_pool_table.mutex);
> >
> >       spin_lock_irq(&tc->pool->lock);
> > @@ -4457,6 +4542,7 @@ static void thin_io_hints(struct dm_target *ti, struct queue_limits *limits)
> >
> >       limits->discard_granularity = pool->sectors_per_block << SECTOR_SHIFT;
> >       limits->max_discard_sectors = 2048 * 1024 * 16; /* 16G */
> > +     limits->max_provision_sectors = 2048 * 1024 * 16; /* 16G */
> >  }
> >
> >  static struct target_type thin_target = {
> > diff --git a/drivers/md/dm.c b/drivers/md/dm.c
> > index e1ea3a7bd9d9..4d19bae9da4a 100644
> > --- a/drivers/md/dm.c
> > +++ b/drivers/md/dm.c
> > @@ -1587,6 +1587,7 @@ static bool is_abnormal_io(struct bio *bio)
> >               case REQ_OP_DISCARD:
> >               case REQ_OP_SECURE_ERASE:
> >               case REQ_OP_WRITE_ZEROES:
> > +             case REQ_OP_PROVISION:
> >                       return true;
> >               default:
> >                       break;
> > @@ -1611,6 +1612,9 @@ static blk_status_t __process_abnormal_io(struct clone_info *ci,
> >       case REQ_OP_WRITE_ZEROES:
> >               num_bios = ti->num_write_zeroes_bios;
> >               break;
> > +     case REQ_OP_PROVISION:
> > +             num_bios = ti->num_provision_bios;
> > +             break;
> >       default:
> >               break;
> >       }
> > diff --git a/include/linux/device-mapper.h b/include/linux/device-mapper.h
> > index 04c6acf7faaa..b4d97d5d75b8 100644
> > --- a/include/linux/device-mapper.h
> > +++ b/include/linux/device-mapper.h
> > @@ -333,6 +333,12 @@ struct dm_target {
> >        */
> >       unsigned num_write_zeroes_bios;
> >
> > +     /*
> > +      * The number of PROVISION bios that will be submitted to the target.
> > +      * The bio number can be accessed with dm_bio_get_target_bio_nr.
> > +      */
> > +     unsigned num_provision_bios;
> > +
> >       /*
> >        * The minimum number of extra bytes allocated in each io for the
> >        * target to use.
> > @@ -357,6 +363,11 @@ struct dm_target {
> >        */
> >       bool discards_supported:1;
> >
> > +     /* Set if this target needs to receive provision requests regardless of
> > +      * whether or not its underlying devices have support.
> > +      */
> > +     bool provision_supported:1;
> > +
> >       /*
> >        * Set if we need to limit the number of in-flight bios when swapping.
> >        */
> > --
> > 2.37.3
> >
>

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [dm-devel] [PATCH v2 2/7] dm: Add support for block provisioning
@ 2023-03-31  0:30       ` Sarthak Kukreti
  0 siblings, 0 replies; 46+ messages in thread
From: Sarthak Kukreti @ 2023-03-31  0:30 UTC (permalink / raw)
  To: Brian Foster
  Cc: Jens Axboe, Christoph Hellwig, Theodore Ts'o,
	Michael S. Tsirkin, sarthakkukreti, Darrick J. Wong, Jason Wang,
	Bart Van Assche, Mike Snitzer, linux-kernel, linux-block,
	dm-devel, Andreas Dilger, Daniil Lunev, Stefan Hajnoczi,
	linux-fsdevel, linux-ext4, Alasdair Kergon

On Thu, Jan 5, 2023 at 6:42 AM Brian Foster <bfoster@redhat.com> wrote:
>
> On Thu, Dec 29, 2022 at 12:12:47AM -0800, Sarthak Kukreti wrote:
> > Add support to dm devices for REQ_OP_PROVISION. The default mode
> > is to pass through the request and dm-thin will utilize it to provision
> > blocks.
> >
> > Signed-off-by: Sarthak Kukreti <sarthakkukreti@chromium.org>
> > ---
> >  drivers/md/dm-crypt.c         |  4 +-
> >  drivers/md/dm-linear.c        |  1 +
> >  drivers/md/dm-snap.c          |  7 +++
> >  drivers/md/dm-table.c         | 25 ++++++++++
> >  drivers/md/dm-thin.c          | 90 ++++++++++++++++++++++++++++++++++-
> >  drivers/md/dm.c               |  4 ++
> >  include/linux/device-mapper.h | 11 +++++
> >  7 files changed, 139 insertions(+), 3 deletions(-)
> >
> ...
> > diff --git a/drivers/md/dm-thin.c b/drivers/md/dm-thin.c
> > index 64cfcf46881d..ab3f1abfabaf 100644
> > --- a/drivers/md/dm-thin.c
> > +++ b/drivers/md/dm-thin.c
> ...
> > @@ -1980,6 +1992,70 @@ static void process_cell(struct thin_c *tc, struct dm_bio_prison_cell *cell)
> >       }
> >  }
> >
> > +static void process_provision_cell(struct thin_c *tc, struct dm_bio_prison_cell *cell)
> > +{
> > +     int r;
> > +     struct pool *pool = tc->pool;
> > +     struct bio *bio = cell->holder;
> > +     dm_block_t begin, end;
> > +     struct dm_thin_lookup_result lookup_result;
> > +
> > +     if (tc->requeue_mode) {
> > +             cell_requeue(pool, cell);
> > +             return;
> > +     }
> > +
> > +     get_bio_block_range(tc, bio, &begin, &end);
> > +
> > +     while (begin != end) {
> > +             r = ensure_next_mapping(pool);
> > +             if (r)
> > +                     /* we did our best */
> > +                     return;
> > +
> > +             r = dm_thin_find_block(tc->td, begin, 1, &lookup_result);
>
> Hi Sarthak,
>
> I think we discussed this before.. but remind me if/how we wanted to
> handle the case if the thin blocks are shared..? Would a provision op
> carry enough information to distinguish an FALLOC_FL_UNSHARE_RANGE
> request from upper layers to conditionally provision in that case?
>
I think that should depend on how the filesystem implements unsharing:
assuming that we use provision on first allocation, unsharing on xfs
should result in xfs calling REQ_OP_PROVISION on the newly allocated
blocks first. But for ext4, we'd fail UNSHARE_RANGE unless provision
(instead of noprovision, provision_on_alloc), in which case, we'd send
REQ_OP_PROVISION.

Best
Sarthak


Sarthak

> Brian
>
> > +             switch (r) {
> > +             case 0:
> > +                     begin++;
> > +                     break;
> > +             case -ENODATA:
> > +                     bio_inc_remaining(bio);
> > +                     provision_block(tc, bio, begin, cell);
> > +                     begin++;
> > +                     break;
> > +             default:
> > +                     DMERR_LIMIT(
> > +                             "%s: dm_thin_find_block() failed: error = %d",
> > +                             __func__, r);
> > +                     cell_defer_no_holder(tc, cell);
> > +                     bio_io_error(bio);
> > +                     begin++;
> > +                     break;
> > +             }
> > +     }
> > +     bio_endio(bio);
> > +     cell_defer_no_holder(tc, cell);
> > +}
> > +
> > +static void process_provision_bio(struct thin_c *tc, struct bio *bio)
> > +{
> > +     dm_block_t begin, end;
> > +     struct dm_cell_key virt_key;
> > +     struct dm_bio_prison_cell *virt_cell;
> > +
> > +     get_bio_block_range(tc, bio, &begin, &end);
> > +     if (begin == end) {
> > +             bio_endio(bio);
> > +             return;
> > +     }
> > +
> > +     build_key(tc->td, VIRTUAL, begin, end, &virt_key);
> > +     if (bio_detain(tc->pool, &virt_key, bio, &virt_cell))
> > +             return;
> > +
> > +     process_provision_cell(tc, virt_cell);
> > +}
> > +
> >  static void process_bio(struct thin_c *tc, struct bio *bio)
> >  {
> >       struct pool *pool = tc->pool;
> > @@ -2200,6 +2276,8 @@ static void process_thin_deferred_bios(struct thin_c *tc)
> >
> >               if (bio_op(bio) == REQ_OP_DISCARD)
> >                       pool->process_discard(tc, bio);
> > +             else if (bio_op(bio) == REQ_OP_PROVISION)
> > +                     process_provision_bio(tc, bio);
> >               else
> >                       pool->process_bio(tc, bio);
> >
> > @@ -2716,7 +2794,8 @@ static int thin_bio_map(struct dm_target *ti, struct bio *bio)
> >               return DM_MAPIO_SUBMITTED;
> >       }
> >
> > -     if (op_is_flush(bio->bi_opf) || bio_op(bio) == REQ_OP_DISCARD) {
> > +     if (op_is_flush(bio->bi_opf) || bio_op(bio) == REQ_OP_DISCARD ||
> > +         bio_op(bio) == REQ_OP_PROVISION) {
> >               thin_defer_bio_with_throttle(tc, bio);
> >               return DM_MAPIO_SUBMITTED;
> >       }
> > @@ -3355,6 +3434,8 @@ static int pool_ctr(struct dm_target *ti, unsigned argc, char **argv)
> >       pt->low_water_blocks = low_water_blocks;
> >       pt->adjusted_pf = pt->requested_pf = pf;
> >       ti->num_flush_bios = 1;
> > +     ti->num_provision_bios = 1;
> > +     ti->provision_supported = true;
> >
> >       /*
> >        * Only need to enable discards if the pool should pass
> > @@ -4053,6 +4134,7 @@ static void pool_io_hints(struct dm_target *ti, struct queue_limits *limits)
> >               blk_limits_io_opt(limits, pool->sectors_per_block << SECTOR_SHIFT);
> >       }
> >
> > +
> >       /*
> >        * pt->adjusted_pf is a staging area for the actual features to use.
> >        * They get transferred to the live pool in bind_control_target()
> > @@ -4243,6 +4325,9 @@ static int thin_ctr(struct dm_target *ti, unsigned argc, char **argv)
> >               ti->num_discard_bios = 1;
> >       }
> >
> > +     ti->num_provision_bios = 1;
> > +     ti->provision_supported = true;
> > +
> >       mutex_unlock(&dm_thin_pool_table.mutex);
> >
> >       spin_lock_irq(&tc->pool->lock);
> > @@ -4457,6 +4542,7 @@ static void thin_io_hints(struct dm_target *ti, struct queue_limits *limits)
> >
> >       limits->discard_granularity = pool->sectors_per_block << SECTOR_SHIFT;
> >       limits->max_discard_sectors = 2048 * 1024 * 16; /* 16G */
> > +     limits->max_provision_sectors = 2048 * 1024 * 16; /* 16G */
> >  }
> >
> >  static struct target_type thin_target = {
> > diff --git a/drivers/md/dm.c b/drivers/md/dm.c
> > index e1ea3a7bd9d9..4d19bae9da4a 100644
> > --- a/drivers/md/dm.c
> > +++ b/drivers/md/dm.c
> > @@ -1587,6 +1587,7 @@ static bool is_abnormal_io(struct bio *bio)
> >               case REQ_OP_DISCARD:
> >               case REQ_OP_SECURE_ERASE:
> >               case REQ_OP_WRITE_ZEROES:
> > +             case REQ_OP_PROVISION:
> >                       return true;
> >               default:
> >                       break;
> > @@ -1611,6 +1612,9 @@ static blk_status_t __process_abnormal_io(struct clone_info *ci,
> >       case REQ_OP_WRITE_ZEROES:
> >               num_bios = ti->num_write_zeroes_bios;
> >               break;
> > +     case REQ_OP_PROVISION:
> > +             num_bios = ti->num_provision_bios;
> > +             break;
> >       default:
> >               break;
> >       }
> > diff --git a/include/linux/device-mapper.h b/include/linux/device-mapper.h
> > index 04c6acf7faaa..b4d97d5d75b8 100644
> > --- a/include/linux/device-mapper.h
> > +++ b/include/linux/device-mapper.h
> > @@ -333,6 +333,12 @@ struct dm_target {
> >        */
> >       unsigned num_write_zeroes_bios;
> >
> > +     /*
> > +      * The number of PROVISION bios that will be submitted to the target.
> > +      * The bio number can be accessed with dm_bio_get_target_bio_nr.
> > +      */
> > +     unsigned num_provision_bios;
> > +
> >       /*
> >        * The minimum number of extra bytes allocated in each io for the
> >        * target to use.
> > @@ -357,6 +363,11 @@ struct dm_target {
> >        */
> >       bool discards_supported:1;
> >
> > +     /* Set if this target needs to receive provision requests regardless of
> > +      * whether or not its underlying devices have support.
> > +      */
> > +     bool provision_supported:1;
> > +
> >       /*
> >        * Set if we need to limit the number of in-flight bios when swapping.
> >        */
> > --
> > 2.37.3
> >
>

--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH v2 2/7] dm: Add support for block provisioning
  2023-03-31  0:30       ` [dm-devel] " Sarthak Kukreti
@ 2023-03-31 12:28         ` Brian Foster
  -1 siblings, 0 replies; 46+ messages in thread
From: Brian Foster @ 2023-03-31 12:28 UTC (permalink / raw)
  To: Sarthak Kukreti
  Cc: sarthakkukreti, dm-devel, linux-block, linux-ext4, linux-kernel,
	linux-fsdevel, Jens Axboe, Michael S. Tsirkin, Jason Wang,
	Stefan Hajnoczi, Alasdair Kergon, Mike Snitzer,
	Christoph Hellwig, Theodore Ts'o, Andreas Dilger,
	Bart Van Assche, Daniil Lunev, Darrick J. Wong

On Thu, Mar 30, 2023 at 05:30:22PM -0700, Sarthak Kukreti wrote:
> On Thu, Jan 5, 2023 at 6:42 AM Brian Foster <bfoster@redhat.com> wrote:
> >
> > On Thu, Dec 29, 2022 at 12:12:47AM -0800, Sarthak Kukreti wrote:
> > > Add support to dm devices for REQ_OP_PROVISION. The default mode
> > > is to pass through the request and dm-thin will utilize it to provision
> > > blocks.
> > >
> > > Signed-off-by: Sarthak Kukreti <sarthakkukreti@chromium.org>
> > > ---
> > >  drivers/md/dm-crypt.c         |  4 +-
> > >  drivers/md/dm-linear.c        |  1 +
> > >  drivers/md/dm-snap.c          |  7 +++
> > >  drivers/md/dm-table.c         | 25 ++++++++++
> > >  drivers/md/dm-thin.c          | 90 ++++++++++++++++++++++++++++++++++-
> > >  drivers/md/dm.c               |  4 ++
> > >  include/linux/device-mapper.h | 11 +++++
> > >  7 files changed, 139 insertions(+), 3 deletions(-)
> > >
> > ...
> > > diff --git a/drivers/md/dm-thin.c b/drivers/md/dm-thin.c
> > > index 64cfcf46881d..ab3f1abfabaf 100644
> > > --- a/drivers/md/dm-thin.c
> > > +++ b/drivers/md/dm-thin.c
> > ...
> > > @@ -1980,6 +1992,70 @@ static void process_cell(struct thin_c *tc, struct dm_bio_prison_cell *cell)
> > >       }
> > >  }
> > >
> > > +static void process_provision_cell(struct thin_c *tc, struct dm_bio_prison_cell *cell)
> > > +{
> > > +     int r;
> > > +     struct pool *pool = tc->pool;
> > > +     struct bio *bio = cell->holder;
> > > +     dm_block_t begin, end;
> > > +     struct dm_thin_lookup_result lookup_result;
> > > +
> > > +     if (tc->requeue_mode) {
> > > +             cell_requeue(pool, cell);
> > > +             return;
> > > +     }
> > > +
> > > +     get_bio_block_range(tc, bio, &begin, &end);
> > > +
> > > +     while (begin != end) {
> > > +             r = ensure_next_mapping(pool);
> > > +             if (r)
> > > +                     /* we did our best */
> > > +                     return;
> > > +
> > > +             r = dm_thin_find_block(tc->td, begin, 1, &lookup_result);
> >
> > Hi Sarthak,
> >
> > I think we discussed this before.. but remind me if/how we wanted to
> > handle the case if the thin blocks are shared..? Would a provision op
> > carry enough information to distinguish an FALLOC_FL_UNSHARE_RANGE
> > request from upper layers to conditionally provision in that case?
> >
> I think that should depend on how the filesystem implements unsharing:
> assuming that we use provision on first allocation, unsharing on xfs
> should result in xfs calling REQ_OP_PROVISION on the newly allocated
> blocks first. But for ext4, we'd fail UNSHARE_RANGE unless provision
> (instead of noprovision, provision_on_alloc), in which case, we'd send
> REQ_OP_PROVISION.
> 

I think my question was unclear... It doesn't necessarily have much to
do with the filesystem or associated provision policy. Since dm-thin can
share blocks internally via snapshots, do you intend to support
FL_UNSHARE_RANGE via blkdev_fallocate() and REQ_OP_PROVISION?

If so, then presumably this wants an UNSHARE request flag to pair with
REQ_OP_PROVISION. Also, the dm-thin code above needs to check whether an
existing block it finds is shared and basically do whatever COW breaking
is necessary during the PROVISION request.

If not, why? And what is expected behavior if blkdev_fallocate() is
called with FL_UNSHARE_RANGE?

Brian 

> Best
> Sarthak
> 
> 
> Sarthak
> 
> > Brian
> >
> > > +             switch (r) {
> > > +             case 0:
> > > +                     begin++;
> > > +                     break;
> > > +             case -ENODATA:
> > > +                     bio_inc_remaining(bio);
> > > +                     provision_block(tc, bio, begin, cell);
> > > +                     begin++;
> > > +                     break;
> > > +             default:
> > > +                     DMERR_LIMIT(
> > > +                             "%s: dm_thin_find_block() failed: error = %d",
> > > +                             __func__, r);
> > > +                     cell_defer_no_holder(tc, cell);
> > > +                     bio_io_error(bio);
> > > +                     begin++;
> > > +                     break;
> > > +             }
> > > +     }
> > > +     bio_endio(bio);
> > > +     cell_defer_no_holder(tc, cell);
> > > +}
> > > +
> > > +static void process_provision_bio(struct thin_c *tc, struct bio *bio)
> > > +{
> > > +     dm_block_t begin, end;
> > > +     struct dm_cell_key virt_key;
> > > +     struct dm_bio_prison_cell *virt_cell;
> > > +
> > > +     get_bio_block_range(tc, bio, &begin, &end);
> > > +     if (begin == end) {
> > > +             bio_endio(bio);
> > > +             return;
> > > +     }
> > > +
> > > +     build_key(tc->td, VIRTUAL, begin, end, &virt_key);
> > > +     if (bio_detain(tc->pool, &virt_key, bio, &virt_cell))
> > > +             return;
> > > +
> > > +     process_provision_cell(tc, virt_cell);
> > > +}
> > > +
> > >  static void process_bio(struct thin_c *tc, struct bio *bio)
> > >  {
> > >       struct pool *pool = tc->pool;
> > > @@ -2200,6 +2276,8 @@ static void process_thin_deferred_bios(struct thin_c *tc)
> > >
> > >               if (bio_op(bio) == REQ_OP_DISCARD)
> > >                       pool->process_discard(tc, bio);
> > > +             else if (bio_op(bio) == REQ_OP_PROVISION)
> > > +                     process_provision_bio(tc, bio);
> > >               else
> > >                       pool->process_bio(tc, bio);
> > >
> > > @@ -2716,7 +2794,8 @@ static int thin_bio_map(struct dm_target *ti, struct bio *bio)
> > >               return DM_MAPIO_SUBMITTED;
> > >       }
> > >
> > > -     if (op_is_flush(bio->bi_opf) || bio_op(bio) == REQ_OP_DISCARD) {
> > > +     if (op_is_flush(bio->bi_opf) || bio_op(bio) == REQ_OP_DISCARD ||
> > > +         bio_op(bio) == REQ_OP_PROVISION) {
> > >               thin_defer_bio_with_throttle(tc, bio);
> > >               return DM_MAPIO_SUBMITTED;
> > >       }
> > > @@ -3355,6 +3434,8 @@ static int pool_ctr(struct dm_target *ti, unsigned argc, char **argv)
> > >       pt->low_water_blocks = low_water_blocks;
> > >       pt->adjusted_pf = pt->requested_pf = pf;
> > >       ti->num_flush_bios = 1;
> > > +     ti->num_provision_bios = 1;
> > > +     ti->provision_supported = true;
> > >
> > >       /*
> > >        * Only need to enable discards if the pool should pass
> > > @@ -4053,6 +4134,7 @@ static void pool_io_hints(struct dm_target *ti, struct queue_limits *limits)
> > >               blk_limits_io_opt(limits, pool->sectors_per_block << SECTOR_SHIFT);
> > >       }
> > >
> > > +
> > >       /*
> > >        * pt->adjusted_pf is a staging area for the actual features to use.
> > >        * They get transferred to the live pool in bind_control_target()
> > > @@ -4243,6 +4325,9 @@ static int thin_ctr(struct dm_target *ti, unsigned argc, char **argv)
> > >               ti->num_discard_bios = 1;
> > >       }
> > >
> > > +     ti->num_provision_bios = 1;
> > > +     ti->provision_supported = true;
> > > +
> > >       mutex_unlock(&dm_thin_pool_table.mutex);
> > >
> > >       spin_lock_irq(&tc->pool->lock);
> > > @@ -4457,6 +4542,7 @@ static void thin_io_hints(struct dm_target *ti, struct queue_limits *limits)
> > >
> > >       limits->discard_granularity = pool->sectors_per_block << SECTOR_SHIFT;
> > >       limits->max_discard_sectors = 2048 * 1024 * 16; /* 16G */
> > > +     limits->max_provision_sectors = 2048 * 1024 * 16; /* 16G */
> > >  }
> > >
> > >  static struct target_type thin_target = {
> > > diff --git a/drivers/md/dm.c b/drivers/md/dm.c
> > > index e1ea3a7bd9d9..4d19bae9da4a 100644
> > > --- a/drivers/md/dm.c
> > > +++ b/drivers/md/dm.c
> > > @@ -1587,6 +1587,7 @@ static bool is_abnormal_io(struct bio *bio)
> > >               case REQ_OP_DISCARD:
> > >               case REQ_OP_SECURE_ERASE:
> > >               case REQ_OP_WRITE_ZEROES:
> > > +             case REQ_OP_PROVISION:
> > >                       return true;
> > >               default:
> > >                       break;
> > > @@ -1611,6 +1612,9 @@ static blk_status_t __process_abnormal_io(struct clone_info *ci,
> > >       case REQ_OP_WRITE_ZEROES:
> > >               num_bios = ti->num_write_zeroes_bios;
> > >               break;
> > > +     case REQ_OP_PROVISION:
> > > +             num_bios = ti->num_provision_bios;
> > > +             break;
> > >       default:
> > >               break;
> > >       }
> > > diff --git a/include/linux/device-mapper.h b/include/linux/device-mapper.h
> > > index 04c6acf7faaa..b4d97d5d75b8 100644
> > > --- a/include/linux/device-mapper.h
> > > +++ b/include/linux/device-mapper.h
> > > @@ -333,6 +333,12 @@ struct dm_target {
> > >        */
> > >       unsigned num_write_zeroes_bios;
> > >
> > > +     /*
> > > +      * The number of PROVISION bios that will be submitted to the target.
> > > +      * The bio number can be accessed with dm_bio_get_target_bio_nr.
> > > +      */
> > > +     unsigned num_provision_bios;
> > > +
> > >       /*
> > >        * The minimum number of extra bytes allocated in each io for the
> > >        * target to use.
> > > @@ -357,6 +363,11 @@ struct dm_target {
> > >        */
> > >       bool discards_supported:1;
> > >
> > > +     /* Set if this target needs to receive provision requests regardless of
> > > +      * whether or not its underlying devices have support.
> > > +      */
> > > +     bool provision_supported:1;
> > > +
> > >       /*
> > >        * Set if we need to limit the number of in-flight bios when swapping.
> > >        */
> > > --
> > > 2.37.3
> > >
> >
> 


^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [dm-devel] [PATCH v2 2/7] dm: Add support for block provisioning
@ 2023-03-31 12:28         ` Brian Foster
  0 siblings, 0 replies; 46+ messages in thread
From: Brian Foster @ 2023-03-31 12:28 UTC (permalink / raw)
  To: Sarthak Kukreti
  Cc: Jens Axboe, Christoph Hellwig, Theodore Ts'o,
	Michael S. Tsirkin, sarthakkukreti, Darrick J. Wong, Jason Wang,
	Bart Van Assche, Mike Snitzer, linux-kernel, linux-block,
	dm-devel, Andreas Dilger, Daniil Lunev, Stefan Hajnoczi,
	linux-fsdevel, linux-ext4, Alasdair Kergon

On Thu, Mar 30, 2023 at 05:30:22PM -0700, Sarthak Kukreti wrote:
> On Thu, Jan 5, 2023 at 6:42 AM Brian Foster <bfoster@redhat.com> wrote:
> >
> > On Thu, Dec 29, 2022 at 12:12:47AM -0800, Sarthak Kukreti wrote:
> > > Add support to dm devices for REQ_OP_PROVISION. The default mode
> > > is to pass through the request and dm-thin will utilize it to provision
> > > blocks.
> > >
> > > Signed-off-by: Sarthak Kukreti <sarthakkukreti@chromium.org>
> > > ---
> > >  drivers/md/dm-crypt.c         |  4 +-
> > >  drivers/md/dm-linear.c        |  1 +
> > >  drivers/md/dm-snap.c          |  7 +++
> > >  drivers/md/dm-table.c         | 25 ++++++++++
> > >  drivers/md/dm-thin.c          | 90 ++++++++++++++++++++++++++++++++++-
> > >  drivers/md/dm.c               |  4 ++
> > >  include/linux/device-mapper.h | 11 +++++
> > >  7 files changed, 139 insertions(+), 3 deletions(-)
> > >
> > ...
> > > diff --git a/drivers/md/dm-thin.c b/drivers/md/dm-thin.c
> > > index 64cfcf46881d..ab3f1abfabaf 100644
> > > --- a/drivers/md/dm-thin.c
> > > +++ b/drivers/md/dm-thin.c
> > ...
> > > @@ -1980,6 +1992,70 @@ static void process_cell(struct thin_c *tc, struct dm_bio_prison_cell *cell)
> > >       }
> > >  }
> > >
> > > +static void process_provision_cell(struct thin_c *tc, struct dm_bio_prison_cell *cell)
> > > +{
> > > +     int r;
> > > +     struct pool *pool = tc->pool;
> > > +     struct bio *bio = cell->holder;
> > > +     dm_block_t begin, end;
> > > +     struct dm_thin_lookup_result lookup_result;
> > > +
> > > +     if (tc->requeue_mode) {
> > > +             cell_requeue(pool, cell);
> > > +             return;
> > > +     }
> > > +
> > > +     get_bio_block_range(tc, bio, &begin, &end);
> > > +
> > > +     while (begin != end) {
> > > +             r = ensure_next_mapping(pool);
> > > +             if (r)
> > > +                     /* we did our best */
> > > +                     return;
> > > +
> > > +             r = dm_thin_find_block(tc->td, begin, 1, &lookup_result);
> >
> > Hi Sarthak,
> >
> > I think we discussed this before.. but remind me if/how we wanted to
> > handle the case if the thin blocks are shared..? Would a provision op
> > carry enough information to distinguish an FALLOC_FL_UNSHARE_RANGE
> > request from upper layers to conditionally provision in that case?
> >
> I think that should depend on how the filesystem implements unsharing:
> assuming that we use provision on first allocation, unsharing on xfs
> should result in xfs calling REQ_OP_PROVISION on the newly allocated
> blocks first. But for ext4, we'd fail UNSHARE_RANGE unless provision
> (instead of noprovision, provision_on_alloc), in which case, we'd send
> REQ_OP_PROVISION.
> 

I think my question was unclear... It doesn't necessarily have much to
do with the filesystem or associated provision policy. Since dm-thin can
share blocks internally via snapshots, do you intend to support
FL_UNSHARE_RANGE via blkdev_fallocate() and REQ_OP_PROVISION?

If so, then presumably this wants an UNSHARE request flag to pair with
REQ_OP_PROVISION. Also, the dm-thin code above needs to check whether an
existing block it finds is shared and basically do whatever COW breaking
is necessary during the PROVISION request.

If not, why? And what is expected behavior if blkdev_fallocate() is
called with FL_UNSHARE_RANGE?

Brian 

> Best
> Sarthak
> 
> 
> Sarthak
> 
> > Brian
> >
> > > +             switch (r) {
> > > +             case 0:
> > > +                     begin++;
> > > +                     break;
> > > +             case -ENODATA:
> > > +                     bio_inc_remaining(bio);
> > > +                     provision_block(tc, bio, begin, cell);
> > > +                     begin++;
> > > +                     break;
> > > +             default:
> > > +                     DMERR_LIMIT(
> > > +                             "%s: dm_thin_find_block() failed: error = %d",
> > > +                             __func__, r);
> > > +                     cell_defer_no_holder(tc, cell);
> > > +                     bio_io_error(bio);
> > > +                     begin++;
> > > +                     break;
> > > +             }
> > > +     }
> > > +     bio_endio(bio);
> > > +     cell_defer_no_holder(tc, cell);
> > > +}
> > > +
> > > +static void process_provision_bio(struct thin_c *tc, struct bio *bio)
> > > +{
> > > +     dm_block_t begin, end;
> > > +     struct dm_cell_key virt_key;
> > > +     struct dm_bio_prison_cell *virt_cell;
> > > +
> > > +     get_bio_block_range(tc, bio, &begin, &end);
> > > +     if (begin == end) {
> > > +             bio_endio(bio);
> > > +             return;
> > > +     }
> > > +
> > > +     build_key(tc->td, VIRTUAL, begin, end, &virt_key);
> > > +     if (bio_detain(tc->pool, &virt_key, bio, &virt_cell))
> > > +             return;
> > > +
> > > +     process_provision_cell(tc, virt_cell);
> > > +}
> > > +
> > >  static void process_bio(struct thin_c *tc, struct bio *bio)
> > >  {
> > >       struct pool *pool = tc->pool;
> > > @@ -2200,6 +2276,8 @@ static void process_thin_deferred_bios(struct thin_c *tc)
> > >
> > >               if (bio_op(bio) == REQ_OP_DISCARD)
> > >                       pool->process_discard(tc, bio);
> > > +             else if (bio_op(bio) == REQ_OP_PROVISION)
> > > +                     process_provision_bio(tc, bio);
> > >               else
> > >                       pool->process_bio(tc, bio);
> > >
> > > @@ -2716,7 +2794,8 @@ static int thin_bio_map(struct dm_target *ti, struct bio *bio)
> > >               return DM_MAPIO_SUBMITTED;
> > >       }
> > >
> > > -     if (op_is_flush(bio->bi_opf) || bio_op(bio) == REQ_OP_DISCARD) {
> > > +     if (op_is_flush(bio->bi_opf) || bio_op(bio) == REQ_OP_DISCARD ||
> > > +         bio_op(bio) == REQ_OP_PROVISION) {
> > >               thin_defer_bio_with_throttle(tc, bio);
> > >               return DM_MAPIO_SUBMITTED;
> > >       }
> > > @@ -3355,6 +3434,8 @@ static int pool_ctr(struct dm_target *ti, unsigned argc, char **argv)
> > >       pt->low_water_blocks = low_water_blocks;
> > >       pt->adjusted_pf = pt->requested_pf = pf;
> > >       ti->num_flush_bios = 1;
> > > +     ti->num_provision_bios = 1;
> > > +     ti->provision_supported = true;
> > >
> > >       /*
> > >        * Only need to enable discards if the pool should pass
> > > @@ -4053,6 +4134,7 @@ static void pool_io_hints(struct dm_target *ti, struct queue_limits *limits)
> > >               blk_limits_io_opt(limits, pool->sectors_per_block << SECTOR_SHIFT);
> > >       }
> > >
> > > +
> > >       /*
> > >        * pt->adjusted_pf is a staging area for the actual features to use.
> > >        * They get transferred to the live pool in bind_control_target()
> > > @@ -4243,6 +4325,9 @@ static int thin_ctr(struct dm_target *ti, unsigned argc, char **argv)
> > >               ti->num_discard_bios = 1;
> > >       }
> > >
> > > +     ti->num_provision_bios = 1;
> > > +     ti->provision_supported = true;
> > > +
> > >       mutex_unlock(&dm_thin_pool_table.mutex);
> > >
> > >       spin_lock_irq(&tc->pool->lock);
> > > @@ -4457,6 +4542,7 @@ static void thin_io_hints(struct dm_target *ti, struct queue_limits *limits)
> > >
> > >       limits->discard_granularity = pool->sectors_per_block << SECTOR_SHIFT;
> > >       limits->max_discard_sectors = 2048 * 1024 * 16; /* 16G */
> > > +     limits->max_provision_sectors = 2048 * 1024 * 16; /* 16G */
> > >  }
> > >
> > >  static struct target_type thin_target = {
> > > diff --git a/drivers/md/dm.c b/drivers/md/dm.c
> > > index e1ea3a7bd9d9..4d19bae9da4a 100644
> > > --- a/drivers/md/dm.c
> > > +++ b/drivers/md/dm.c
> > > @@ -1587,6 +1587,7 @@ static bool is_abnormal_io(struct bio *bio)
> > >               case REQ_OP_DISCARD:
> > >               case REQ_OP_SECURE_ERASE:
> > >               case REQ_OP_WRITE_ZEROES:
> > > +             case REQ_OP_PROVISION:
> > >                       return true;
> > >               default:
> > >                       break;
> > > @@ -1611,6 +1612,9 @@ static blk_status_t __process_abnormal_io(struct clone_info *ci,
> > >       case REQ_OP_WRITE_ZEROES:
> > >               num_bios = ti->num_write_zeroes_bios;
> > >               break;
> > > +     case REQ_OP_PROVISION:
> > > +             num_bios = ti->num_provision_bios;
> > > +             break;
> > >       default:
> > >               break;
> > >       }
> > > diff --git a/include/linux/device-mapper.h b/include/linux/device-mapper.h
> > > index 04c6acf7faaa..b4d97d5d75b8 100644
> > > --- a/include/linux/device-mapper.h
> > > +++ b/include/linux/device-mapper.h
> > > @@ -333,6 +333,12 @@ struct dm_target {
> > >        */
> > >       unsigned num_write_zeroes_bios;
> > >
> > > +     /*
> > > +      * The number of PROVISION bios that will be submitted to the target.
> > > +      * The bio number can be accessed with dm_bio_get_target_bio_nr.
> > > +      */
> > > +     unsigned num_provision_bios;
> > > +
> > >       /*
> > >        * The minimum number of extra bytes allocated in each io for the
> > >        * target to use.
> > > @@ -357,6 +363,11 @@ struct dm_target {
> > >        */
> > >       bool discards_supported:1;
> > >
> > > +     /* Set if this target needs to receive provision requests regardless of
> > > +      * whether or not its underlying devices have support.
> > > +      */
> > > +     bool provision_supported:1;
> > > +
> > >       /*
> > >        * Set if we need to limit the number of in-flight bios when swapping.
> > >        */
> > > --
> > > 2.37.3
> > >
> >
> 

--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH v2 2/7] dm: Add support for block provisioning
  2023-03-31 12:28         ` [dm-devel] " Brian Foster
@ 2023-04-03 22:57           ` Sarthak Kukreti
  -1 siblings, 0 replies; 46+ messages in thread
From: Sarthak Kukreti @ 2023-04-03 22:57 UTC (permalink / raw)
  To: Brian Foster
  Cc: sarthakkukreti, dm-devel, linux-block, linux-ext4, linux-kernel,
	linux-fsdevel, Jens Axboe, Michael S. Tsirkin, Jason Wang,
	Stefan Hajnoczi, Alasdair Kergon, Mike Snitzer,
	Christoph Hellwig, Theodore Ts'o, Andreas Dilger,
	Bart Van Assche, Daniil Lunev, Darrick J. Wong

On Fri, Mar 31, 2023 at 5:26 AM Brian Foster <bfoster@redhat.com> wrote:
>
> On Thu, Mar 30, 2023 at 05:30:22PM -0700, Sarthak Kukreti wrote:
> > On Thu, Jan 5, 2023 at 6:42 AM Brian Foster <bfoster@redhat.com> wrote:
> > >
> > > On Thu, Dec 29, 2022 at 12:12:47AM -0800, Sarthak Kukreti wrote:
> > > > Add support to dm devices for REQ_OP_PROVISION. The default mode
> > > > is to pass through the request and dm-thin will utilize it to provision
> > > > blocks.
> > > >
> > > > Signed-off-by: Sarthak Kukreti <sarthakkukreti@chromium.org>
> > > > ---
> > > >  drivers/md/dm-crypt.c         |  4 +-
> > > >  drivers/md/dm-linear.c        |  1 +
> > > >  drivers/md/dm-snap.c          |  7 +++
> > > >  drivers/md/dm-table.c         | 25 ++++++++++
> > > >  drivers/md/dm-thin.c          | 90 ++++++++++++++++++++++++++++++++++-
> > > >  drivers/md/dm.c               |  4 ++
> > > >  include/linux/device-mapper.h | 11 +++++
> > > >  7 files changed, 139 insertions(+), 3 deletions(-)
> > > >
> > > ...
> > > > diff --git a/drivers/md/dm-thin.c b/drivers/md/dm-thin.c
> > > > index 64cfcf46881d..ab3f1abfabaf 100644
> > > > --- a/drivers/md/dm-thin.c
> > > > +++ b/drivers/md/dm-thin.c
> > > ...
> > > > @@ -1980,6 +1992,70 @@ static void process_cell(struct thin_c *tc, struct dm_bio_prison_cell *cell)
> > > >       }
> > > >  }
> > > >
> > > > +static void process_provision_cell(struct thin_c *tc, struct dm_bio_prison_cell *cell)
> > > > +{
> > > > +     int r;
> > > > +     struct pool *pool = tc->pool;
> > > > +     struct bio *bio = cell->holder;
> > > > +     dm_block_t begin, end;
> > > > +     struct dm_thin_lookup_result lookup_result;
> > > > +
> > > > +     if (tc->requeue_mode) {
> > > > +             cell_requeue(pool, cell);
> > > > +             return;
> > > > +     }
> > > > +
> > > > +     get_bio_block_range(tc, bio, &begin, &end);
> > > > +
> > > > +     while (begin != end) {
> > > > +             r = ensure_next_mapping(pool);
> > > > +             if (r)
> > > > +                     /* we did our best */
> > > > +                     return;
> > > > +
> > > > +             r = dm_thin_find_block(tc->td, begin, 1, &lookup_result);
> > >
> > > Hi Sarthak,
> > >
> > > I think we discussed this before.. but remind me if/how we wanted to
> > > handle the case if the thin blocks are shared..? Would a provision op
> > > carry enough information to distinguish an FALLOC_FL_UNSHARE_RANGE
> > > request from upper layers to conditionally provision in that case?
> > >
> > I think that should depend on how the filesystem implements unsharing:
> > assuming that we use provision on first allocation, unsharing on xfs
> > should result in xfs calling REQ_OP_PROVISION on the newly allocated
> > blocks first. But for ext4, we'd fail UNSHARE_RANGE unless provision
> > (instead of noprovision, provision_on_alloc), in which case, we'd send
> > REQ_OP_PROVISION.
> >
>
> I think my question was unclear... It doesn't necessarily have much to
> do with the filesystem or associated provision policy. Since dm-thin can
> share blocks internally via snapshots, do you intend to support
> FL_UNSHARE_RANGE via blkdev_fallocate() and REQ_OP_PROVISION?
>
> If so, then presumably this wants an UNSHARE request flag to pair with
> REQ_OP_PROVISION. Also, the dm-thin code above needs to check whether an
> existing block it finds is shared and basically do whatever COW breaking
> is necessary during the PROVISION request.
>
> If not, why? And what is expected behavior if blkdev_fallocate() is
> called with FL_UNSHARE_RANGE?
>
I think the handling of REQ_OP_PROVISION by each snapshot target is
kind-of implicit:

- snapshot-origin: do nothing
- snapshot: send REQ_OP_PROVISION to the COW device
- snapshot-merge: send REQ_OP_PROVISION to the origin.

From the thinpool's perspective, REQ_OP_PROVISION reuses the
provision_block() primitive to break sharing (there's a bug in the
below code, as you pointed out: case 0 should also call
provision_block() if the lookup result shows that this is a shared
block).

So, I think the provision op would carry enough information to
conditionally provision and copy the block. Are there other cases
where UNSHARE_RANGE would be useful?

Best
Sarthak

> Brian
>
> > Best
> > Sarthak
> >
> >
> > Sarthak
> >
> > > Brian
> > >
> > > > +             switch (r) {
> > > > +             case 0:
> > > > +                     begin++;
> > > > +                     break;
> > > > +             case -ENODATA:
> > > > +                     bio_inc_remaining(bio);
> > > > +                     provision_block(tc, bio, begin, cell);
> > > > +                     begin++;
> > > > +                     break;
> > > > +             default:
> > > > +                     DMERR_LIMIT(
> > > > +                             "%s: dm_thin_find_block() failed: error = %d",
> > > > +                             __func__, r);
> > > > +                     cell_defer_no_holder(tc, cell);
> > > > +                     bio_io_error(bio);
> > > > +                     begin++;
> > > > +                     break;
> > > > +             }
> > > > +     }
> > > > +     bio_endio(bio);
> > > > +     cell_defer_no_holder(tc, cell);
> > > > +}
> > > > +
> > > > +static void process_provision_bio(struct thin_c *tc, struct bio *bio)
> > > > +{
> > > > +     dm_block_t begin, end;
> > > > +     struct dm_cell_key virt_key;
> > > > +     struct dm_bio_prison_cell *virt_cell;
> > > > +
> > > > +     get_bio_block_range(tc, bio, &begin, &end);
> > > > +     if (begin == end) {
> > > > +             bio_endio(bio);
> > > > +             return;
> > > > +     }
> > > > +
> > > > +     build_key(tc->td, VIRTUAL, begin, end, &virt_key);
> > > > +     if (bio_detain(tc->pool, &virt_key, bio, &virt_cell))
> > > > +             return;
> > > > +
> > > > +     process_provision_cell(tc, virt_cell);
> > > > +}
> > > > +
> > > >  static void process_bio(struct thin_c *tc, struct bio *bio)
> > > >  {
> > > >       struct pool *pool = tc->pool;
> > > > @@ -2200,6 +2276,8 @@ static void process_thin_deferred_bios(struct thin_c *tc)
> > > >
> > > >               if (bio_op(bio) == REQ_OP_DISCARD)
> > > >                       pool->process_discard(tc, bio);
> > > > +             else if (bio_op(bio) == REQ_OP_PROVISION)
> > > > +                     process_provision_bio(tc, bio);
> > > >               else
> > > >                       pool->process_bio(tc, bio);
> > > >
> > > > @@ -2716,7 +2794,8 @@ static int thin_bio_map(struct dm_target *ti, struct bio *bio)
> > > >               return DM_MAPIO_SUBMITTED;
> > > >       }
> > > >
> > > > -     if (op_is_flush(bio->bi_opf) || bio_op(bio) == REQ_OP_DISCARD) {
> > > > +     if (op_is_flush(bio->bi_opf) || bio_op(bio) == REQ_OP_DISCARD ||
> > > > +         bio_op(bio) == REQ_OP_PROVISION) {
> > > >               thin_defer_bio_with_throttle(tc, bio);
> > > >               return DM_MAPIO_SUBMITTED;
> > > >       }
> > > > @@ -3355,6 +3434,8 @@ static int pool_ctr(struct dm_target *ti, unsigned argc, char **argv)
> > > >       pt->low_water_blocks = low_water_blocks;
> > > >       pt->adjusted_pf = pt->requested_pf = pf;
> > > >       ti->num_flush_bios = 1;
> > > > +     ti->num_provision_bios = 1;
> > > > +     ti->provision_supported = true;
> > > >
> > > >       /*
> > > >        * Only need to enable discards if the pool should pass
> > > > @@ -4053,6 +4134,7 @@ static void pool_io_hints(struct dm_target *ti, struct queue_limits *limits)
> > > >               blk_limits_io_opt(limits, pool->sectors_per_block << SECTOR_SHIFT);
> > > >       }
> > > >
> > > > +
> > > >       /*
> > > >        * pt->adjusted_pf is a staging area for the actual features to use.
> > > >        * They get transferred to the live pool in bind_control_target()
> > > > @@ -4243,6 +4325,9 @@ static int thin_ctr(struct dm_target *ti, unsigned argc, char **argv)
> > > >               ti->num_discard_bios = 1;
> > > >       }
> > > >
> > > > +     ti->num_provision_bios = 1;
> > > > +     ti->provision_supported = true;
> > > > +
> > > >       mutex_unlock(&dm_thin_pool_table.mutex);
> > > >
> > > >       spin_lock_irq(&tc->pool->lock);
> > > > @@ -4457,6 +4542,7 @@ static void thin_io_hints(struct dm_target *ti, struct queue_limits *limits)
> > > >
> > > >       limits->discard_granularity = pool->sectors_per_block << SECTOR_SHIFT;
> > > >       limits->max_discard_sectors = 2048 * 1024 * 16; /* 16G */
> > > > +     limits->max_provision_sectors = 2048 * 1024 * 16; /* 16G */
> > > >  }
> > > >
> > > >  static struct target_type thin_target = {
> > > > diff --git a/drivers/md/dm.c b/drivers/md/dm.c
> > > > index e1ea3a7bd9d9..4d19bae9da4a 100644
> > > > --- a/drivers/md/dm.c
> > > > +++ b/drivers/md/dm.c
> > > > @@ -1587,6 +1587,7 @@ static bool is_abnormal_io(struct bio *bio)
> > > >               case REQ_OP_DISCARD:
> > > >               case REQ_OP_SECURE_ERASE:
> > > >               case REQ_OP_WRITE_ZEROES:
> > > > +             case REQ_OP_PROVISION:
> > > >                       return true;
> > > >               default:
> > > >                       break;
> > > > @@ -1611,6 +1612,9 @@ static blk_status_t __process_abnormal_io(struct clone_info *ci,
> > > >       case REQ_OP_WRITE_ZEROES:
> > > >               num_bios = ti->num_write_zeroes_bios;
> > > >               break;
> > > > +     case REQ_OP_PROVISION:
> > > > +             num_bios = ti->num_provision_bios;
> > > > +             break;
> > > >       default:
> > > >               break;
> > > >       }
> > > > diff --git a/include/linux/device-mapper.h b/include/linux/device-mapper.h
> > > > index 04c6acf7faaa..b4d97d5d75b8 100644
> > > > --- a/include/linux/device-mapper.h
> > > > +++ b/include/linux/device-mapper.h
> > > > @@ -333,6 +333,12 @@ struct dm_target {
> > > >        */
> > > >       unsigned num_write_zeroes_bios;
> > > >
> > > > +     /*
> > > > +      * The number of PROVISION bios that will be submitted to the target.
> > > > +      * The bio number can be accessed with dm_bio_get_target_bio_nr.
> > > > +      */
> > > > +     unsigned num_provision_bios;
> > > > +
> > > >       /*
> > > >        * The minimum number of extra bytes allocated in each io for the
> > > >        * target to use.
> > > > @@ -357,6 +363,11 @@ struct dm_target {
> > > >        */
> > > >       bool discards_supported:1;
> > > >
> > > > +     /* Set if this target needs to receive provision requests regardless of
> > > > +      * whether or not its underlying devices have support.
> > > > +      */
> > > > +     bool provision_supported:1;
> > > > +
> > > >       /*
> > > >        * Set if we need to limit the number of in-flight bios when swapping.
> > > >        */
> > > > --
> > > > 2.37.3
> > > >
> > >
> >
>

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [dm-devel] [PATCH v2 2/7] dm: Add support for block provisioning
@ 2023-04-03 22:57           ` Sarthak Kukreti
  0 siblings, 0 replies; 46+ messages in thread
From: Sarthak Kukreti @ 2023-04-03 22:57 UTC (permalink / raw)
  To: Brian Foster
  Cc: Jens Axboe, Christoph Hellwig, Theodore Ts'o,
	Michael S. Tsirkin, sarthakkukreti, Darrick J. Wong, Jason Wang,
	Bart Van Assche, Mike Snitzer, linux-kernel, linux-block,
	dm-devel, Andreas Dilger, Daniil Lunev, Stefan Hajnoczi,
	linux-fsdevel, linux-ext4, Alasdair Kergon

On Fri, Mar 31, 2023 at 5:26 AM Brian Foster <bfoster@redhat.com> wrote:
>
> On Thu, Mar 30, 2023 at 05:30:22PM -0700, Sarthak Kukreti wrote:
> > On Thu, Jan 5, 2023 at 6:42 AM Brian Foster <bfoster@redhat.com> wrote:
> > >
> > > On Thu, Dec 29, 2022 at 12:12:47AM -0800, Sarthak Kukreti wrote:
> > > > Add support to dm devices for REQ_OP_PROVISION. The default mode
> > > > is to pass through the request and dm-thin will utilize it to provision
> > > > blocks.
> > > >
> > > > Signed-off-by: Sarthak Kukreti <sarthakkukreti@chromium.org>
> > > > ---
> > > >  drivers/md/dm-crypt.c         |  4 +-
> > > >  drivers/md/dm-linear.c        |  1 +
> > > >  drivers/md/dm-snap.c          |  7 +++
> > > >  drivers/md/dm-table.c         | 25 ++++++++++
> > > >  drivers/md/dm-thin.c          | 90 ++++++++++++++++++++++++++++++++++-
> > > >  drivers/md/dm.c               |  4 ++
> > > >  include/linux/device-mapper.h | 11 +++++
> > > >  7 files changed, 139 insertions(+), 3 deletions(-)
> > > >
> > > ...
> > > > diff --git a/drivers/md/dm-thin.c b/drivers/md/dm-thin.c
> > > > index 64cfcf46881d..ab3f1abfabaf 100644
> > > > --- a/drivers/md/dm-thin.c
> > > > +++ b/drivers/md/dm-thin.c
> > > ...
> > > > @@ -1980,6 +1992,70 @@ static void process_cell(struct thin_c *tc, struct dm_bio_prison_cell *cell)
> > > >       }
> > > >  }
> > > >
> > > > +static void process_provision_cell(struct thin_c *tc, struct dm_bio_prison_cell *cell)
> > > > +{
> > > > +     int r;
> > > > +     struct pool *pool = tc->pool;
> > > > +     struct bio *bio = cell->holder;
> > > > +     dm_block_t begin, end;
> > > > +     struct dm_thin_lookup_result lookup_result;
> > > > +
> > > > +     if (tc->requeue_mode) {
> > > > +             cell_requeue(pool, cell);
> > > > +             return;
> > > > +     }
> > > > +
> > > > +     get_bio_block_range(tc, bio, &begin, &end);
> > > > +
> > > > +     while (begin != end) {
> > > > +             r = ensure_next_mapping(pool);
> > > > +             if (r)
> > > > +                     /* we did our best */
> > > > +                     return;
> > > > +
> > > > +             r = dm_thin_find_block(tc->td, begin, 1, &lookup_result);
> > >
> > > Hi Sarthak,
> > >
> > > I think we discussed this before.. but remind me if/how we wanted to
> > > handle the case if the thin blocks are shared..? Would a provision op
> > > carry enough information to distinguish an FALLOC_FL_UNSHARE_RANGE
> > > request from upper layers to conditionally provision in that case?
> > >
> > I think that should depend on how the filesystem implements unsharing:
> > assuming that we use provision on first allocation, unsharing on xfs
> > should result in xfs calling REQ_OP_PROVISION on the newly allocated
> > blocks first. But for ext4, we'd fail UNSHARE_RANGE unless provision
> > (instead of noprovision, provision_on_alloc), in which case, we'd send
> > REQ_OP_PROVISION.
> >
>
> I think my question was unclear... It doesn't necessarily have much to
> do with the filesystem or associated provision policy. Since dm-thin can
> share blocks internally via snapshots, do you intend to support
> FL_UNSHARE_RANGE via blkdev_fallocate() and REQ_OP_PROVISION?
>
> If so, then presumably this wants an UNSHARE request flag to pair with
> REQ_OP_PROVISION. Also, the dm-thin code above needs to check whether an
> existing block it finds is shared and basically do whatever COW breaking
> is necessary during the PROVISION request.
>
> If not, why? And what is expected behavior if blkdev_fallocate() is
> called with FL_UNSHARE_RANGE?
>
I think the handling of REQ_OP_PROVISION by each snapshot target is
kind-of implicit:

- snapshot-origin: do nothing
- snapshot: send REQ_OP_PROVISION to the COW device
- snapshot-merge: send REQ_OP_PROVISION to the origin.

From the thinpool's perspective, REQ_OP_PROVISION reuses the
provision_block() primitive to break sharing (there's a bug in the
below code, as you pointed out: case 0 should also call
provision_block() if the lookup result shows that this is a shared
block).

So, I think the provision op would carry enough information to
conditionally provision and copy the block. Are there other cases
where UNSHARE_RANGE would be useful?

Best
Sarthak

> Brian
>
> > Best
> > Sarthak
> >
> >
> > Sarthak
> >
> > > Brian
> > >
> > > > +             switch (r) {
> > > > +             case 0:
> > > > +                     begin++;
> > > > +                     break;
> > > > +             case -ENODATA:
> > > > +                     bio_inc_remaining(bio);
> > > > +                     provision_block(tc, bio, begin, cell);
> > > > +                     begin++;
> > > > +                     break;
> > > > +             default:
> > > > +                     DMERR_LIMIT(
> > > > +                             "%s: dm_thin_find_block() failed: error = %d",
> > > > +                             __func__, r);
> > > > +                     cell_defer_no_holder(tc, cell);
> > > > +                     bio_io_error(bio);
> > > > +                     begin++;
> > > > +                     break;
> > > > +             }
> > > > +     }
> > > > +     bio_endio(bio);
> > > > +     cell_defer_no_holder(tc, cell);
> > > > +}
> > > > +
> > > > +static void process_provision_bio(struct thin_c *tc, struct bio *bio)
> > > > +{
> > > > +     dm_block_t begin, end;
> > > > +     struct dm_cell_key virt_key;
> > > > +     struct dm_bio_prison_cell *virt_cell;
> > > > +
> > > > +     get_bio_block_range(tc, bio, &begin, &end);
> > > > +     if (begin == end) {
> > > > +             bio_endio(bio);
> > > > +             return;
> > > > +     }
> > > > +
> > > > +     build_key(tc->td, VIRTUAL, begin, end, &virt_key);
> > > > +     if (bio_detain(tc->pool, &virt_key, bio, &virt_cell))
> > > > +             return;
> > > > +
> > > > +     process_provision_cell(tc, virt_cell);
> > > > +}
> > > > +
> > > >  static void process_bio(struct thin_c *tc, struct bio *bio)
> > > >  {
> > > >       struct pool *pool = tc->pool;
> > > > @@ -2200,6 +2276,8 @@ static void process_thin_deferred_bios(struct thin_c *tc)
> > > >
> > > >               if (bio_op(bio) == REQ_OP_DISCARD)
> > > >                       pool->process_discard(tc, bio);
> > > > +             else if (bio_op(bio) == REQ_OP_PROVISION)
> > > > +                     process_provision_bio(tc, bio);
> > > >               else
> > > >                       pool->process_bio(tc, bio);
> > > >
> > > > @@ -2716,7 +2794,8 @@ static int thin_bio_map(struct dm_target *ti, struct bio *bio)
> > > >               return DM_MAPIO_SUBMITTED;
> > > >       }
> > > >
> > > > -     if (op_is_flush(bio->bi_opf) || bio_op(bio) == REQ_OP_DISCARD) {
> > > > +     if (op_is_flush(bio->bi_opf) || bio_op(bio) == REQ_OP_DISCARD ||
> > > > +         bio_op(bio) == REQ_OP_PROVISION) {
> > > >               thin_defer_bio_with_throttle(tc, bio);
> > > >               return DM_MAPIO_SUBMITTED;
> > > >       }
> > > > @@ -3355,6 +3434,8 @@ static int pool_ctr(struct dm_target *ti, unsigned argc, char **argv)
> > > >       pt->low_water_blocks = low_water_blocks;
> > > >       pt->adjusted_pf = pt->requested_pf = pf;
> > > >       ti->num_flush_bios = 1;
> > > > +     ti->num_provision_bios = 1;
> > > > +     ti->provision_supported = true;
> > > >
> > > >       /*
> > > >        * Only need to enable discards if the pool should pass
> > > > @@ -4053,6 +4134,7 @@ static void pool_io_hints(struct dm_target *ti, struct queue_limits *limits)
> > > >               blk_limits_io_opt(limits, pool->sectors_per_block << SECTOR_SHIFT);
> > > >       }
> > > >
> > > > +
> > > >       /*
> > > >        * pt->adjusted_pf is a staging area for the actual features to use.
> > > >        * They get transferred to the live pool in bind_control_target()
> > > > @@ -4243,6 +4325,9 @@ static int thin_ctr(struct dm_target *ti, unsigned argc, char **argv)
> > > >               ti->num_discard_bios = 1;
> > > >       }
> > > >
> > > > +     ti->num_provision_bios = 1;
> > > > +     ti->provision_supported = true;
> > > > +
> > > >       mutex_unlock(&dm_thin_pool_table.mutex);
> > > >
> > > >       spin_lock_irq(&tc->pool->lock);
> > > > @@ -4457,6 +4542,7 @@ static void thin_io_hints(struct dm_target *ti, struct queue_limits *limits)
> > > >
> > > >       limits->discard_granularity = pool->sectors_per_block << SECTOR_SHIFT;
> > > >       limits->max_discard_sectors = 2048 * 1024 * 16; /* 16G */
> > > > +     limits->max_provision_sectors = 2048 * 1024 * 16; /* 16G */
> > > >  }
> > > >
> > > >  static struct target_type thin_target = {
> > > > diff --git a/drivers/md/dm.c b/drivers/md/dm.c
> > > > index e1ea3a7bd9d9..4d19bae9da4a 100644
> > > > --- a/drivers/md/dm.c
> > > > +++ b/drivers/md/dm.c
> > > > @@ -1587,6 +1587,7 @@ static bool is_abnormal_io(struct bio *bio)
> > > >               case REQ_OP_DISCARD:
> > > >               case REQ_OP_SECURE_ERASE:
> > > >               case REQ_OP_WRITE_ZEROES:
> > > > +             case REQ_OP_PROVISION:
> > > >                       return true;
> > > >               default:
> > > >                       break;
> > > > @@ -1611,6 +1612,9 @@ static blk_status_t __process_abnormal_io(struct clone_info *ci,
> > > >       case REQ_OP_WRITE_ZEROES:
> > > >               num_bios = ti->num_write_zeroes_bios;
> > > >               break;
> > > > +     case REQ_OP_PROVISION:
> > > > +             num_bios = ti->num_provision_bios;
> > > > +             break;
> > > >       default:
> > > >               break;
> > > >       }
> > > > diff --git a/include/linux/device-mapper.h b/include/linux/device-mapper.h
> > > > index 04c6acf7faaa..b4d97d5d75b8 100644
> > > > --- a/include/linux/device-mapper.h
> > > > +++ b/include/linux/device-mapper.h
> > > > @@ -333,6 +333,12 @@ struct dm_target {
> > > >        */
> > > >       unsigned num_write_zeroes_bios;
> > > >
> > > > +     /*
> > > > +      * The number of PROVISION bios that will be submitted to the target.
> > > > +      * The bio number can be accessed with dm_bio_get_target_bio_nr.
> > > > +      */
> > > > +     unsigned num_provision_bios;
> > > > +
> > > >       /*
> > > >        * The minimum number of extra bytes allocated in each io for the
> > > >        * target to use.
> > > > @@ -357,6 +363,11 @@ struct dm_target {
> > > >        */
> > > >       bool discards_supported:1;
> > > >
> > > > +     /* Set if this target needs to receive provision requests regardless of
> > > > +      * whether or not its underlying devices have support.
> > > > +      */
> > > > +     bool provision_supported:1;
> > > > +
> > > >       /*
> > > >        * Set if we need to limit the number of in-flight bios when swapping.
> > > >        */
> > > > --
> > > > 2.37.3
> > > >
> > >
> >
>

--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel

^ permalink raw reply	[flat|nested] 46+ messages in thread

end of thread, other threads:[~2023-04-04  7:35 UTC | newest]

Thread overview: 46+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-12-29  8:12 [PATCH v2 0/8] Introduce provisioning primitives for thinly provisioned storage Sarthak Kukreti
2022-12-29  8:12 ` [dm-devel] " Sarthak Kukreti
2022-12-29  8:12 ` [PATCH v2 1/7] block: Introduce provisioning primitives Sarthak Kukreti
2022-12-29  8:12   ` [dm-devel] " Sarthak Kukreti
2022-12-29  8:12 ` [PATCH v2 2/7] dm: Add support for block provisioning Sarthak Kukreti
2022-12-29  8:12   ` [dm-devel] " Sarthak Kukreti
2023-01-05 14:43   ` Brian Foster
2023-01-05 14:43     ` Brian Foster
2023-03-31  0:30     ` Sarthak Kukreti
2023-03-31  0:30       ` [dm-devel] " Sarthak Kukreti
2023-03-31 12:28       ` Brian Foster
2023-03-31 12:28         ` [dm-devel] " Brian Foster
2023-04-03 22:57         ` Sarthak Kukreti
2023-04-03 22:57           ` [dm-devel] " Sarthak Kukreti
2022-12-29  8:12 ` [PATCH v2 3/7] fs: Introduce FALLOC_FL_PROVISION Sarthak Kukreti
2022-12-29  8:12   ` [dm-devel] " Sarthak Kukreti
2023-01-04 16:39   ` Darrick J. Wong
2023-01-04 16:39     ` [dm-devel] " Darrick J. Wong
2023-01-04 18:58     ` Sarthak Kukreti
2023-01-04 18:58       ` [dm-devel] " Sarthak Kukreti
2023-01-04 21:22     ` Sarthak Kukreti
2023-01-04 21:22       ` [dm-devel] " Sarthak Kukreti
2023-01-05 14:46       ` Brian Foster
2023-01-05 14:46         ` Brian Foster
2023-01-05 19:35         ` [dm-devel] " Darrick J. Wong
2023-01-05 19:35           ` Darrick J. Wong
2023-01-09 15:07           ` [dm-devel] " Brian Foster
2023-01-09 15:07             ` Brian Foster
2023-03-31  0:28             ` Sarthak Kukreti
2023-03-31  0:28               ` [dm-devel] " Sarthak Kukreti
2023-03-31  0:28         ` Sarthak Kukreti
2023-03-31  0:28           ` [dm-devel] " Sarthak Kukreti
2023-01-05 15:49       ` Theodore Ts'o
2023-01-05 15:49         ` [dm-devel] " Theodore Ts'o
2023-03-31  0:28         ` Sarthak Kukreti
2023-03-31  0:28           ` [dm-devel] " Sarthak Kukreti
2022-12-29  8:12 ` [PATCH v2 4/7] loop: Add support for provision requests Sarthak Kukreti
2022-12-29  8:12   ` [dm-devel] " Sarthak Kukreti
2022-12-29  8:12 ` [PATCH v2 5/7] ext4: Add support for FALLOC_FL_PROVISION Sarthak Kukreti
2022-12-29  8:12   ` [dm-devel] " Sarthak Kukreti
2022-12-29  8:12 ` [PATCH v2 6/7] ext4: Add mount option for provisioning blocks during allocations Sarthak Kukreti
2022-12-29  8:12   ` [dm-devel] " Sarthak Kukreti
2023-01-09 15:02   ` Brian Foster
2023-01-09 15:02     ` [dm-devel] " Brian Foster
2022-12-29  8:12 ` [PATCH v2 7/7] ext4: Add a per-file provision override xattr Sarthak Kukreti
2022-12-29  8:12   ` [dm-devel] " Sarthak Kukreti

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.