linux-block.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v5 00/10] Implement copy offload support
       [not found] <CGME20221123061010epcas5p21cef9d23e4362b01f2b19d1117e1cdf5@epcas5p2.samsung.com>
@ 2022-11-23  5:58 ` Nitesh Shetty
       [not found]   ` <CGME20221123061014epcas5p150fd8add12fe6d09b63c56972818e6a2@epcas5p1.samsung.com>
                     ` (10 more replies)
  0 siblings, 11 replies; 32+ messages in thread
From: Nitesh Shetty @ 2022-11-23  5:58 UTC (permalink / raw)
  To: axboe, agk, snitzer, dm-devel, kbusch, hch, sagi, james.smart,
	kch, damien.lemoal, naohiro.aota, jth, viro
  Cc: linux-block, linux-kernel, linux-nvme, linux-fsdevel, anuj20.g,
	joshi.k, p.raghav, nitheshshetty, gost.dev, Nitesh Shetty

The patch series covers the points discussed in November 2021 virtual
call [LSF/MM/BFP TOPIC] Storage: Copy Offload [0].
We have covered the initial agreed requirements in this patchset and
further additional features suggested by community.
Patchset borrows Mikulas's token based approach for 2 bdev
implementation.

This is on top of our previous patchset v4[1].

Overall series supports:
========================
	1. Driver
		- NVMe Copy command (single NS, TP 4065), including support
		in nvme-target (for block and file backend).

	2. Block layer
		- Block-generic copy (REQ_COPY flag), with interface
		accommodating two block-devs, and multi-source/destination
		interface
		- Emulation, when offload is natively absent
		- dm-linear support (for cases not requiring split)

	3. User-interface
		- new ioctl
		- copy_file_range for zonefs

	4. In-kernel user
		- dm-kcopyd
		- copy_file_range in zonefs

Testing
=======
	Copy offload can be tested on:
	a. QEMU: NVME simple copy (TP 4065). By setting nvme-ns
		parameters mssrl,mcl, msrc. For more info [2].
	b. Fabrics loopback.
	c. zonefs copy_file_range

	Emuation can be tested on any device.

	Sample application to use IOCTL is present in patch desciption.
	fio[3].

Performance
===========
	With the async design of copy-emulation/offload using fio[3],
	we were  able to see the following improvements as
	compared to userspace read and write on a NVMeOF TCP setup:
	Setup1: Network Speed: 1000Mb/s
		Host PC: Intel(R) Core(TM) i7-8700 CPU @ 3.20GHz
		Target PC: AMD Ryzen 9 5900X 12-Core Processor
		block size 8k, range 1:
			635% improvement in IO BW (107 MiB/s to 787 MiB/s).
			Network utilisation drops from  97% to 14%.
		block-size 2M, range 16:
			2555% improvement in IO BW (100 MiB/s to 2655 MiB/s).
			Network utilisation drops from 89% to 0.62%.
	Setup2: Network Speed: 100Gb/s
		Server: Intel(R) Xeon(R) Gold 6240 CPU @ 2.60GHz, 72 cores
			(host and target have the same configuration)
		block-size 8k, range 1:
			6.5% improvement in IO BW (791 MiB/s to 843 MiB/s).
			Network utilisation drops from  6.75% to 0.14%.
		block-size 2M, range 16:
			15% improvement in IO BW (1027 MiB/s to 1183 MiB/s).
			Network utilisation drops from 8.42% to ~0%.
		block-size 8k, 8 ranges:
			18% drop in IO BW (from 798 MiB/s to 647 MiB/s)
			Network utilisation drops from 6.66% to 0.13%.

		At present we see drop in performance for bs 8k,16k and
		higher ranges (8, 16), so something more to check there.
	Overall, in these tests, kernel copy emulation performs better than
	userspace read+write. 

Zonefs copy_file_range
======================
	Sample tests for zonefs-tools[4]. Test 0118 and 0119 will test
	basic CFR. Will raise a PR, once this series is finalized.

Future Work
===========
	- nullblk: copy-offload emulation
	- generic copy file range (CFR):
		I did go through this, but couldn't find straight forward
		way to plug in copy offload for all the cases. We are doing
		detailed study, will address this future versions.
	- loopback device copy offload support
	- upstream fio to use copy offload

	These are to be taken up after we reach consensus on the
	plumbing of current elements that are part of this series.


Additional links:
=================
	[0] https://lore.kernel.org/linux-nvme/CA+1E3rJ7BZ7LjQXXTdX+-0Edz=zT14mmPGMiVCzUgB33C60tbQ@mail.gmail.com/
	[1] https://lore.kernel.org/lkml/20220426101241.30100-1-nj.shetty@samsung.com/
	[2] https://qemu-project.gitlab.io/qemu/system/devices/nvme.html#simple-copy
	[3] https://github.com/vincentkfu/fio/tree/copyoffload
	[4] https://github.com/nitesh-shetty/zonefs-tools/tree/cfr

Changes since v4:
=================
	- make the offload and emulation design asynchronous (Hannes
	  Reinecke)
	- fabrics loopback support
	- sysfs naming improvements (Damien Le Moal)
	- use kfree() instead of kvfree() in cio_await_completion
	  (Damien Le Moal)
	- use ranges instead of rlist to represent range_entry (Damien
	  Le Moal)
	- change argument ordering in blk_copy_offload suggested (Damien
	  Le Moal)
	- removed multiple copy limit and merged into only one limit
	  (Damien Le Moal)
	- wrap overly long lines (Damien Le Moal)
	- other naming improvements and cleanups (Damien Le Moal)
	- correctly format the code example in description (Damien Le
	  Moal)
	- mark blk_copy_offload as static (kernel test robot)
	
Changes since v3:
=================
	- added copy_file_range support for zonefs
	- added documentation about new sysfs entries
	- incorporated review comments on v3
	- minor fixes

Changes since v2:
=================
	- fixed possible race condition reported by Damien Le Moal
	- new sysfs controls as suggested by Damien Le Moal
	- fixed possible memory leak reported by Dan Carpenter, lkp
	- minor fixes

Nitesh Shetty (10):
  block: Introduce queue limits for copy-offload support
  block: Add copy offload support infrastructure
  block: add emulation for copy
  block: Introduce a new ioctl for copy
  nvme: add copy offload support
  nvmet: add copy command support for bdev and file ns
  dm: Add support for copy offload.
  dm: Enable copy offload for dm-linear target
  dm kcopyd: use copy offload support
  fs: add support for copy file range in zonefs

 Documentation/ABI/stable/sysfs-block |  36 ++
 block/blk-lib.c                      | 597 +++++++++++++++++++++++++++
 block/blk-map.c                      |   4 +-
 block/blk-settings.c                 |  24 ++
 block/blk-sysfs.c                    |  64 +++
 block/blk.h                          |   2 +
 block/ioctl.c                        |  36 ++
 drivers/md/dm-kcopyd.c               |  56 ++-
 drivers/md/dm-linear.c               |   1 +
 drivers/md/dm-table.c                |  42 ++
 drivers/md/dm.c                      |   7 +
 drivers/nvme/host/core.c             | 106 ++++-
 drivers/nvme/host/fc.c               |   5 +
 drivers/nvme/host/nvme.h             |   7 +
 drivers/nvme/host/pci.c              |  28 +-
 drivers/nvme/host/rdma.c             |   7 +
 drivers/nvme/host/tcp.c              |  16 +
 drivers/nvme/host/trace.c            |  19 +
 drivers/nvme/target/admin-cmd.c      |   9 +-
 drivers/nvme/target/io-cmd-bdev.c    |  79 ++++
 drivers/nvme/target/io-cmd-file.c    |  51 +++
 drivers/nvme/target/loop.c           |   6 +
 drivers/nvme/target/nvmet.h          |   2 +
 fs/zonefs/super.c                    | 179 ++++++++
 include/linux/blk_types.h            |  44 ++
 include/linux/blkdev.h               |  18 +
 include/linux/device-mapper.h        |   5 +
 include/linux/nvme.h                 |  43 +-
 include/uapi/linux/fs.h              |  27 ++
 29 files changed, 1502 insertions(+), 18 deletions(-)


base-commit: e4cd8d3ff7f9efeb97330e5e9b99eeb2a68f5cf9
-- 
2.35.1.500.gb896f729e2


^ permalink raw reply	[flat|nested] 32+ messages in thread

* [PATCH v5 01/10] block: Introduce queue limits for copy-offload support
       [not found]   ` <CGME20221123061014epcas5p150fd8add12fe6d09b63c56972818e6a2@epcas5p1.samsung.com>
@ 2022-11-23  5:58     ` Nitesh Shetty
  0 siblings, 0 replies; 32+ messages in thread
From: Nitesh Shetty @ 2022-11-23  5:58 UTC (permalink / raw)
  To: axboe, agk, snitzer, dm-devel, kbusch, hch, sagi, james.smart,
	kch, damien.lemoal, naohiro.aota, jth, viro
  Cc: linux-block, linux-kernel, linux-nvme, linux-fsdevel, anuj20.g,
	joshi.k, p.raghav, nitheshshetty, gost.dev, Nitesh Shetty,
	Hannes Reinecke

Add device limits as sysfs entries,
        - copy_offload (RW)
        - copy_max_bytes (RW)
        - copy_max_bytes_hw (RO)

Above limits help to split the copy payload in block layer.
copy_offload: used for setting copy offload(1) or emulation(0).
copy_max_bytes: maximum total length of copy in single payload.
copy_max_bytes_hw: Reflects the device supported maximum limit.

Reviewed-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Nitesh Shetty <nj.shetty@samsung.com>
Signed-off-by: Kanchan Joshi <joshi.k@samsung.com>
Signed-off-by: Anuj Gupta <anuj20.g@samsung.com>
---
 Documentation/ABI/stable/sysfs-block | 36 ++++++++++++++++
 block/blk-settings.c                 | 24 +++++++++++
 block/blk-sysfs.c                    | 64 ++++++++++++++++++++++++++++
 include/linux/blkdev.h               | 12 ++++++
 include/uapi/linux/fs.h              |  3 ++
 5 files changed, 139 insertions(+)

diff --git a/Documentation/ABI/stable/sysfs-block b/Documentation/ABI/stable/sysfs-block
index cd14ecb3c9a5..e0c9be009706 100644
--- a/Documentation/ABI/stable/sysfs-block
+++ b/Documentation/ABI/stable/sysfs-block
@@ -155,6 +155,42 @@ Description:
 		last zone of the device which may be smaller.
 
 
+What:		/sys/block/<disk>/queue/copy_offload
+Date:		November 2022
+Contact:	linux-block@vger.kernel.org
+Description:
+		[RW] When read, this file shows whether offloading copy to
+		device is enabled (1) or disabled (0). Writing '0' to this
+		file will disable offloading copies for this device.
+		Writing any '1' value will enable this feature. If device
+		does not support offloading, then writing 1, will result in
+		error.
+
+
+What:		/sys/block/<disk>/queue/copy_max_bytes
+Date:		November 2022
+Contact:	linux-block@vger.kernel.org
+Description:
+		[RW] While 'copy_max_bytes_hw' is the hardware limit for the
+		device, 'copy_max_bytes' setting is the software limit.
+		Setting this value lower will make Linux issue smaller size
+		copies from block layer.
+
+
+What:		/sys/block/<disk>/queue/copy_max_bytes_hw
+Date:		November 2022
+Contact:	linux-block@vger.kernel.org
+Description:
+		[RO] Devices that support offloading copy functionality may have
+		internal limits on the number of bytes that can be offloaded
+		in a single operation. The `copy_max_bytes_hw`
+		parameter is set by the device driver to the maximum number of
+		bytes that can be copied in a single operation. Copy
+		requests issued to the device must not exceed this limit.
+		A value of 0 means that the device does not
+		support copy offload.
+
+
 What:		/sys/block/<disk>/queue/crypto/
 Date:		February 2022
 Contact:	linux-block@vger.kernel.org
diff --git a/block/blk-settings.c b/block/blk-settings.c
index 0477c4d527fe..ca6f15a70fdc 100644
--- a/block/blk-settings.c
+++ b/block/blk-settings.c
@@ -58,6 +58,8 @@ void blk_set_default_limits(struct queue_limits *lim)
 	lim->zoned = BLK_ZONED_NONE;
 	lim->zone_write_granularity = 0;
 	lim->dma_alignment = 511;
+	lim->max_copy_sectors_hw = 0;
+	lim->max_copy_sectors = 0;
 }
 
 /**
@@ -81,6 +83,8 @@ void blk_set_stacking_limits(struct queue_limits *lim)
 	lim->max_dev_sectors = UINT_MAX;
 	lim->max_write_zeroes_sectors = UINT_MAX;
 	lim->max_zone_append_sectors = UINT_MAX;
+	lim->max_copy_sectors_hw = ULONG_MAX;
+	lim->max_copy_sectors = ULONG_MAX;
 }
 EXPORT_SYMBOL(blk_set_stacking_limits);
 
@@ -177,6 +181,22 @@ void blk_queue_max_discard_sectors(struct request_queue *q,
 }
 EXPORT_SYMBOL(blk_queue_max_discard_sectors);
 
+/**
+ * blk_queue_max_copy_sectors_hw - set max sectors for a single copy payload
+ * @q:  the request queue for the device
+ * @max_copy_sectors: maximum number of sectors to copy
+ **/
+void blk_queue_max_copy_sectors_hw(struct request_queue *q,
+		unsigned int max_copy_sectors)
+{
+	if (max_copy_sectors >= MAX_COPY_TOTAL_LENGTH)
+		max_copy_sectors = MAX_COPY_TOTAL_LENGTH;
+
+	q->limits.max_copy_sectors_hw = max_copy_sectors;
+	q->limits.max_copy_sectors = max_copy_sectors;
+}
+EXPORT_SYMBOL_GPL(blk_queue_max_copy_sectors_hw);
+
 /**
  * blk_queue_max_secure_erase_sectors - set max sectors for a secure erase
  * @q:  the request queue for the device
@@ -572,6 +592,10 @@ int blk_stack_limits(struct queue_limits *t, struct queue_limits *b,
 	t->max_segment_size = min_not_zero(t->max_segment_size,
 					   b->max_segment_size);
 
+	t->max_copy_sectors = min(t->max_copy_sectors, b->max_copy_sectors);
+	t->max_copy_sectors_hw = min(t->max_copy_sectors_hw,
+						b->max_copy_sectors_hw);
+
 	t->misaligned |= b->misaligned;
 
 	alignment = queue_limit_alignment_offset(b, start);
diff --git a/block/blk-sysfs.c b/block/blk-sysfs.c
index 02e94c4beff1..903285b04029 100644
--- a/block/blk-sysfs.c
+++ b/block/blk-sysfs.c
@@ -212,6 +212,63 @@ static ssize_t queue_discard_zeroes_data_show(struct request_queue *q, char *pag
 	return queue_var_show(0, page);
 }
 
+static ssize_t queue_copy_offload_show(struct request_queue *q, char *page)
+{
+	return queue_var_show(blk_queue_copy(q), page);
+}
+
+static ssize_t queue_copy_offload_store(struct request_queue *q,
+				       const char *page, size_t count)
+{
+	s64 copy_offload;
+	ssize_t ret = queue_var_store64(&copy_offload, page);
+
+	if (ret < 0)
+		return ret;
+
+	if (copy_offload && !q->limits.max_copy_sectors_hw)
+		return -EINVAL;
+
+	if (copy_offload)
+		blk_queue_flag_set(QUEUE_FLAG_COPY, q);
+	else
+		blk_queue_flag_clear(QUEUE_FLAG_COPY, q);
+
+	return count;
+}
+
+static ssize_t queue_copy_max_hw_show(struct request_queue *q, char *page)
+{
+	return sprintf(page, "%llu\n", (unsigned long long)
+			q->limits.max_copy_sectors_hw << SECTOR_SHIFT);
+}
+
+static ssize_t queue_copy_max_show(struct request_queue *q, char *page)
+{
+	return sprintf(page, "%llu\n", (unsigned long long)
+			q->limits.max_copy_sectors << SECTOR_SHIFT);
+}
+
+static ssize_t queue_copy_max_store(struct request_queue *q,
+				       const char *page, size_t count)
+{
+	s64 max_copy;
+	ssize_t ret = queue_var_store64(&max_copy, page);
+
+	if (ret < 0)
+		return ret;
+
+	if (max_copy & (queue_logical_block_size(q) - 1))
+		return -EINVAL;
+
+	max_copy >>= SECTOR_SHIFT;
+	if (max_copy > q->limits.max_copy_sectors_hw)
+		max_copy = q->limits.max_copy_sectors_hw;
+
+	q->limits.max_copy_sectors = max_copy;
+	return count;
+}
+
 static ssize_t queue_write_same_max_show(struct request_queue *q, char *page)
 {
 	return queue_var_show(0, page);
@@ -604,6 +661,10 @@ QUEUE_RO_ENTRY(queue_nr_zones, "nr_zones");
 QUEUE_RO_ENTRY(queue_max_open_zones, "max_open_zones");
 QUEUE_RO_ENTRY(queue_max_active_zones, "max_active_zones");
 
+QUEUE_RW_ENTRY(queue_copy_offload, "copy_offload");
+QUEUE_RO_ENTRY(queue_copy_max_hw, "copy_max_bytes_hw");
+QUEUE_RW_ENTRY(queue_copy_max, "copy_max_bytes");
+
 QUEUE_RW_ENTRY(queue_nomerges, "nomerges");
 QUEUE_RW_ENTRY(queue_rq_affinity, "rq_affinity");
 QUEUE_RW_ENTRY(queue_poll, "io_poll");
@@ -651,6 +712,9 @@ static struct attribute *queue_attrs[] = {
 	&queue_discard_max_entry.attr,
 	&queue_discard_max_hw_entry.attr,
 	&queue_discard_zeroes_data_entry.attr,
+	&queue_copy_offload_entry.attr,
+	&queue_copy_max_hw_entry.attr,
+	&queue_copy_max_entry.attr,
 	&queue_write_same_max_entry.attr,
 	&queue_write_zeroes_max_entry.attr,
 	&queue_zone_append_max_entry.attr,
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index a0452ba08e9a..3ac324208f2f 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -302,6 +302,9 @@ struct queue_limits {
 	unsigned int		discard_alignment;
 	unsigned int		zone_write_granularity;
 
+	unsigned long		max_copy_sectors_hw;
+	unsigned long		max_copy_sectors;
+
 	unsigned short		max_segments;
 	unsigned short		max_integrity_segments;
 	unsigned short		max_discard_segments;
@@ -573,6 +576,7 @@ struct request_queue {
 #define QUEUE_FLAG_NOWAIT       29	/* device supports NOWAIT */
 #define QUEUE_FLAG_SQ_SCHED     30	/* single queue style io dispatch */
 #define QUEUE_FLAG_SKIP_TAGSET_QUIESCE	31 /* quiesce_tagset skip the queue*/
+#define QUEUE_FLAG_COPY		32	/* supports copy offload */
 
 #define QUEUE_FLAG_MQ_DEFAULT	((1UL << QUEUE_FLAG_IO_STAT) |		\
 				 (1UL << QUEUE_FLAG_SAME_COMP) |	\
@@ -593,6 +597,7 @@ bool blk_queue_flag_test_and_set(unsigned int flag, struct request_queue *q);
 	test_bit(QUEUE_FLAG_STABLE_WRITES, &(q)->queue_flags)
 #define blk_queue_io_stat(q)	test_bit(QUEUE_FLAG_IO_STAT, &(q)->queue_flags)
 #define blk_queue_add_random(q)	test_bit(QUEUE_FLAG_ADD_RANDOM, &(q)->queue_flags)
+#define blk_queue_copy(q)	test_bit(QUEUE_FLAG_COPY, &(q)->queue_flags)
 #define blk_queue_zone_resetall(q)	\
 	test_bit(QUEUE_FLAG_ZONE_RESETALL, &(q)->queue_flags)
 #define blk_queue_dax(q)	test_bit(QUEUE_FLAG_DAX, &(q)->queue_flags)
@@ -913,6 +918,8 @@ extern void blk_queue_chunk_sectors(struct request_queue *, unsigned int);
 extern void blk_queue_max_segments(struct request_queue *, unsigned short);
 extern void blk_queue_max_discard_segments(struct request_queue *,
 		unsigned short);
+extern void blk_queue_max_copy_sectors_hw(struct request_queue *q,
+		unsigned int max_copy_sectors);
 void blk_queue_max_secure_erase_sectors(struct request_queue *q,
 		unsigned int max_sectors);
 extern void blk_queue_max_segment_size(struct request_queue *, unsigned int);
@@ -1231,6 +1238,11 @@ static inline unsigned int bdev_discard_granularity(struct block_device *bdev)
 	return bdev_get_queue(bdev)->limits.discard_granularity;
 }
 
+static inline unsigned int bdev_max_copy_sectors(struct block_device *bdev)
+{
+	return bdev_get_queue(bdev)->limits.max_copy_sectors;
+}
+
 static inline unsigned int
 bdev_max_secure_erase_sectors(struct block_device *bdev)
 {
diff --git a/include/uapi/linux/fs.h b/include/uapi/linux/fs.h
index b7b56871029c..b3ad173f619c 100644
--- a/include/uapi/linux/fs.h
+++ b/include/uapi/linux/fs.h
@@ -64,6 +64,9 @@ struct fstrim_range {
 	__u64 minlen;
 };
 
+/* maximum total copy length */
+#define MAX_COPY_TOTAL_LENGTH	(1 << 27)
+
 /* extent-same (dedupe) ioctls; these MUST match the btrfs ioctl definitions */
 #define FILE_DEDUPE_RANGE_SAME		0
 #define FILE_DEDUPE_RANGE_DIFFERS	1
-- 
2.35.1.500.gb896f729e2


^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH v5 02/10] block: Add copy offload support infrastructure
       [not found]   ` <CGME20221123061017epcas5p246a589e20eac655ac340cfda6028ff35@epcas5p2.samsung.com>
@ 2022-11-23  5:58     ` Nitesh Shetty
  2022-11-23  8:04       ` Ming Lei
  0 siblings, 1 reply; 32+ messages in thread
From: Nitesh Shetty @ 2022-11-23  5:58 UTC (permalink / raw)
  To: axboe, agk, snitzer, dm-devel, kbusch, hch, sagi, james.smart,
	kch, damien.lemoal, naohiro.aota, jth, viro
  Cc: linux-block, linux-kernel, linux-nvme, linux-fsdevel, anuj20.g,
	joshi.k, p.raghav, nitheshshetty, gost.dev, Nitesh Shetty

Introduce blkdev_issue_copy which supports source and destination bdevs,
and an array of (source, destination and copy length) tuples.
Introduce REQ_COPY copy offload operation flag. Create a read-write
bio pair with a token as payload and submitted to the device in order.
Read request populates token with source specific information which
is then passed with write request.
This design is courtesy Mikulas Patocka's token based copy

Larger copy will be divided, based on max_copy_sectors limit.

Signed-off-by: Nitesh Shetty <nj.shetty@samsung.com>
Signed-off-by: Anuj Gupta <anuj20.g@samsung.com>
---
 block/blk-lib.c           | 358 ++++++++++++++++++++++++++++++++++++++
 block/blk.h               |   2 +
 include/linux/blk_types.h |  44 +++++
 include/linux/blkdev.h    |   3 +
 include/uapi/linux/fs.h   |  15 ++
 5 files changed, 422 insertions(+)

diff --git a/block/blk-lib.c b/block/blk-lib.c
index e59c3069e835..2ce3c872ca49 100644
--- a/block/blk-lib.c
+++ b/block/blk-lib.c
@@ -115,6 +115,364 @@ int blkdev_issue_discard(struct block_device *bdev, sector_t sector,
 }
 EXPORT_SYMBOL(blkdev_issue_discard);
 
+/*
+ * For synchronous copy offload/emulation, wait and process all in-flight BIOs.
+ * This must only be called once all bios have been issued so that the refcount
+ * can only decrease. This just waits for all bios to make it through
+ * bio_copy_*_write_end_io. IO errors are propagated through cio->io_error.
+ */
+static int cio_await_completion(struct cio *cio)
+{
+	int ret = 0;
+
+	atomic_dec(&cio->refcount);
+
+	if (cio->endio)
+		return 0;
+
+	if (atomic_read(&cio->refcount)) {
+		__set_current_state(TASK_UNINTERRUPTIBLE);
+		blk_io_schedule();
+	}
+
+	ret = cio->io_err;
+	kfree(cio);
+
+	return ret;
+}
+
+static void blk_copy_offload_write_end_io(struct bio *bio)
+{
+	struct copy_ctx *ctx = bio->bi_private;
+	struct cio *cio = ctx->cio;
+	sector_t clen;
+	int ri = ctx->range_idx;
+
+	if (bio->bi_status) {
+		cio->io_err = blk_status_to_errno(bio->bi_status);
+		clen = (bio->bi_iter.bi_sector << SECTOR_SHIFT) -
+			cio->ranges[ri].dst;
+		cio->ranges[ri].comp_len = min_t(sector_t, clen,
+				cio->ranges[ri].comp_len);
+	}
+	__free_page(bio->bi_io_vec[0].bv_page);
+	bio_put(bio);
+
+	if (atomic_dec_and_test(&ctx->refcount))
+		kfree(ctx);
+	if (atomic_dec_and_test(&cio->refcount)) {
+		if (cio->endio) {
+			cio->endio(cio->private, cio->io_err);
+			kfree(cio);
+		} else
+			blk_wake_io_task(cio->waiter);
+	}
+}
+
+static void blk_copy_offload_read_end_io(struct bio *read_bio)
+{
+	struct copy_ctx *ctx = read_bio->bi_private;
+	struct cio *cio = ctx->cio;
+	sector_t clen;
+	int ri = ctx->range_idx;
+	unsigned long flags;
+
+	if (read_bio->bi_status) {
+		cio->io_err = blk_status_to_errno(read_bio->bi_status);
+		goto err_rw_bio;
+	}
+
+	/* For zoned device, we check if completed bio is first entry in linked
+	 * list,
+	 * if yes, we start the worker to submit write bios.
+	 * if not, then we just update status of bio in ctx,
+	 * once the worker gets scheduled, it will submit writes for all
+	 * the consecutive REQ_COPY_READ_COMPLETE bios.
+	 */
+	if (bdev_is_zoned(ctx->write_bio->bi_bdev)) {
+		spin_lock_irqsave(&cio->list_lock, flags);
+		ctx->status = REQ_COPY_READ_COMPLETE;
+		if (ctx == list_first_entry(&cio->list,
+					struct copy_ctx, list)) {
+			spin_unlock_irqrestore(&cio->list_lock, flags);
+			schedule_work(&ctx->dispatch_work);
+			goto free_read_bio;
+		}
+		spin_unlock_irqrestore(&cio->list_lock, flags);
+	} else
+		schedule_work(&ctx->dispatch_work);
+
+free_read_bio:
+	bio_put(read_bio);
+
+	return;
+
+err_rw_bio:
+	clen = (read_bio->bi_iter.bi_sector << SECTOR_SHIFT) -
+					cio->ranges[ri].src;
+	cio->ranges[ri].comp_len = min_t(sector_t, clen,
+					cio->ranges[ri].comp_len);
+	__free_page(read_bio->bi_io_vec[0].bv_page);
+	bio_put(ctx->write_bio);
+	bio_put(read_bio);
+	if (atomic_dec_and_test(&ctx->refcount))
+		kfree(ctx);
+	if (atomic_dec_and_test(&cio->refcount)) {
+		if (cio->endio) {
+			cio->endio(cio->private, cio->io_err);
+			kfree(cio);
+		} else
+			blk_wake_io_task(cio->waiter);
+	}
+}
+
+static void blk_copy_dispatch_work_fn(struct work_struct *work)
+{
+	struct copy_ctx *ctx = container_of(work, struct copy_ctx,
+			dispatch_work);
+
+	submit_bio(ctx->write_bio);
+}
+
+static void blk_zoned_copy_dispatch_work_fn(struct work_struct *work)
+{
+	struct copy_ctx *ctx = container_of(work, struct copy_ctx,
+			dispatch_work);
+	struct cio *cio = ctx->cio;
+	unsigned long flags = 0;
+
+	atomic_inc(&cio->refcount);
+	spin_lock_irqsave(&cio->list_lock, flags);
+
+	while (!list_empty(&cio->list)) {
+		ctx = list_first_entry(&cio->list, struct copy_ctx, list);
+
+		if (ctx->status == REQ_COPY_READ_PROGRESS)
+			break;
+
+		atomic_inc(&ctx->refcount);
+		ctx->status = REQ_COPY_WRITE_PROGRESS;
+		spin_unlock_irqrestore(&cio->list_lock, flags);
+		submit_bio(ctx->write_bio);
+		spin_lock_irqsave(&cio->list_lock, flags);
+
+		list_del(&ctx->list);
+		if (atomic_dec_and_test(&ctx->refcount))
+			kfree(ctx);
+	}
+
+	spin_unlock_irqrestore(&cio->list_lock, flags);
+	if (atomic_dec_and_test(&cio->refcount))
+		blk_wake_io_task(cio->waiter);
+}
+
+/*
+ * blk_copy_offload	- Use device's native copy offload feature.
+ * we perform copy operation by sending 2 bio.
+ * 1. First we send a read bio with REQ_COPY flag along with a token and source
+ * and length. Once read bio reaches driver layer, device driver adds all the
+ * source info to token and does a fake completion.
+ * 2. Once read opration completes, we issue write with REQ_COPY flag with same
+ * token. In driver layer, token info is used to form a copy offload command.
+ *
+ * For conventional devices we submit write bio independentenly once read
+ * completes. For zoned devices , reads can complete out of order, so we
+ * maintain a linked list and submit writes in the order, reads are submitted.
+ */
+static int blk_copy_offload(struct block_device *src_bdev,
+		struct block_device *dst_bdev, struct range_entry *ranges,
+		int nr, cio_iodone_t end_io, void *private, gfp_t gfp_mask)
+{
+	struct cio *cio;
+	struct copy_ctx *ctx;
+	struct bio *read_bio, *write_bio;
+	struct page *token;
+	sector_t src_blk, copy_len, dst_blk;
+	sector_t rem, max_copy_len;
+	int ri = 0, ret = 0;
+	unsigned long flags;
+
+	cio = kzalloc(sizeof(struct cio), GFP_KERNEL);
+	if (!cio)
+		return -ENOMEM;
+	cio->ranges = ranges;
+	atomic_set(&cio->refcount, 1);
+	cio->waiter = current;
+	cio->endio = end_io;
+	cio->private = private;
+	if (bdev_is_zoned(dst_bdev)) {
+		INIT_LIST_HEAD(&cio->list);
+		spin_lock_init(&cio->list_lock);
+	}
+
+	max_copy_len = min(bdev_max_copy_sectors(src_bdev),
+			bdev_max_copy_sectors(dst_bdev)) << SECTOR_SHIFT;
+
+	for (ri = 0; ri < nr; ri++) {
+		cio->ranges[ri].comp_len = ranges[ri].len;
+		src_blk = ranges[ri].src;
+		dst_blk = ranges[ri].dst;
+		for (rem = ranges[ri].len; rem > 0; rem -= copy_len) {
+			copy_len = min(rem, max_copy_len);
+
+			token = alloc_page(gfp_mask);
+			if (unlikely(!token)) {
+				ret = -ENOMEM;
+				goto err_token;
+			}
+
+			ctx = kzalloc(sizeof(struct copy_ctx), gfp_mask);
+			if (!ctx) {
+				ret = -ENOMEM;
+				goto err_ctx;
+			}
+			read_bio = bio_alloc(src_bdev, 1, REQ_OP_READ | REQ_COPY
+					| REQ_SYNC | REQ_NOMERGE, gfp_mask);
+			if (!read_bio) {
+				ret = -ENOMEM;
+				goto err_read_bio;
+			}
+			write_bio = bio_alloc(dst_bdev, 1, REQ_OP_WRITE
+					| REQ_COPY | REQ_SYNC | REQ_NOMERGE,
+					gfp_mask);
+			if (!write_bio) {
+				cio->io_err = -ENOMEM;
+				goto err_write_bio;
+			}
+
+			ctx->cio = cio;
+			ctx->range_idx = ri;
+			ctx->write_bio = write_bio;
+			atomic_set(&ctx->refcount, 1);
+
+			if (bdev_is_zoned(dst_bdev)) {
+				INIT_WORK(&ctx->dispatch_work,
+					blk_zoned_copy_dispatch_work_fn);
+				INIT_LIST_HEAD(&ctx->list);
+				spin_lock_irqsave(&cio->list_lock, flags);
+				ctx->status = REQ_COPY_READ_PROGRESS;
+				list_add_tail(&ctx->list, &cio->list);
+				spin_unlock_irqrestore(&cio->list_lock, flags);
+			} else
+				INIT_WORK(&ctx->dispatch_work,
+					blk_copy_dispatch_work_fn);
+
+			__bio_add_page(read_bio, token, PAGE_SIZE, 0);
+			read_bio->bi_iter.bi_size = copy_len;
+			read_bio->bi_iter.bi_sector = src_blk >> SECTOR_SHIFT;
+			read_bio->bi_end_io = blk_copy_offload_read_end_io;
+			read_bio->bi_private = ctx;
+
+			__bio_add_page(write_bio, token, PAGE_SIZE, 0);
+			write_bio->bi_iter.bi_size = copy_len;
+			write_bio->bi_end_io = blk_copy_offload_write_end_io;
+			write_bio->bi_iter.bi_sector = dst_blk >> SECTOR_SHIFT;
+			write_bio->bi_private = ctx;
+
+			atomic_inc(&cio->refcount);
+			submit_bio(read_bio);
+			src_blk += copy_len;
+			dst_blk += copy_len;
+		}
+	}
+
+	/* Wait for completion of all IO's*/
+	return cio_await_completion(cio);
+
+err_write_bio:
+	bio_put(read_bio);
+err_read_bio:
+	kfree(ctx);
+err_ctx:
+	__free_page(token);
+err_token:
+	ranges[ri].comp_len = min_t(sector_t,
+			ranges[ri].comp_len, (ranges[ri].len - rem));
+
+	cio->io_err = ret;
+	return cio_await_completion(cio);
+}
+
+static inline int blk_copy_sanity_check(struct block_device *src_bdev,
+	struct block_device *dst_bdev, struct range_entry *ranges, int nr)
+{
+	unsigned int align_mask = max(bdev_logical_block_size(dst_bdev),
+					bdev_logical_block_size(src_bdev)) - 1;
+	sector_t len = 0;
+	int i;
+
+	if (!nr)
+		return -EINVAL;
+
+	if (nr >= MAX_COPY_NR_RANGE)
+		return -EINVAL;
+
+	if (bdev_read_only(dst_bdev))
+		return -EPERM;
+
+	for (i = 0; i < nr; i++) {
+		if (!ranges[i].len)
+			return -EINVAL;
+
+		len += ranges[i].len;
+		if ((ranges[i].dst & align_mask) ||
+				(ranges[i].src & align_mask) ||
+				(ranges[i].len & align_mask))
+			return -EINVAL;
+		ranges[i].comp_len = 0;
+	}
+
+	if (len && len >= MAX_COPY_TOTAL_LENGTH)
+		return -EINVAL;
+
+	return 0;
+}
+
+static inline bool blk_check_copy_offload(struct request_queue *src_q,
+		struct request_queue *dst_q)
+{
+	return blk_queue_copy(dst_q) && blk_queue_copy(src_q);
+}
+
+/*
+ * blkdev_issue_copy - queue a copy
+ * @src_bdev:	source block device
+ * @dst_bdev:	destination block device
+ * @ranges:	array of source/dest/len,
+ *		ranges are expected to be allocated/freed by caller
+ * @nr:		number of source ranges to copy
+ * @end_io:	end_io function to be called on completion of copy operation,
+ *		for synchronous operation this should be NULL
+ * @private:	end_io function will be called with this private data, should be
+ *		NULL, if operation is synchronous in nature
+ * @gfp_mask:   memory allocation flags (for bio_alloc)
+ *
+ * Description:
+ *	Copy source ranges from source block device to destination block
+ *	device. length of a source range cannot be zero. Max total length of
+ *	copy is limited to MAX_COPY_TOTAL_LENGTH and also maximum number of
+ *	entries is limited to MAX_COPY_NR_RANGE
+ */
+int blkdev_issue_copy(struct block_device *src_bdev,
+	struct block_device *dst_bdev, struct range_entry *ranges, int nr,
+	cio_iodone_t end_io, void *private, gfp_t gfp_mask)
+{
+	struct request_queue *src_q = bdev_get_queue(src_bdev);
+	struct request_queue *dst_q = bdev_get_queue(dst_bdev);
+	int ret = -EINVAL;
+
+	ret = blk_copy_sanity_check(src_bdev, dst_bdev, ranges, nr);
+	if (ret)
+		return ret;
+
+	if (blk_check_copy_offload(src_q, dst_q))
+		ret = blk_copy_offload(src_bdev, dst_bdev, ranges, nr,
+				end_io, private, gfp_mask);
+
+	return ret;
+}
+EXPORT_SYMBOL_GPL(blkdev_issue_copy);
+
 static int __blkdev_issue_write_zeroes(struct block_device *bdev,
 		sector_t sector, sector_t nr_sects, gfp_t gfp_mask,
 		struct bio **biop, unsigned flags)
diff --git a/block/blk.h b/block/blk.h
index 5929559acd71..6d534047f20d 100644
--- a/block/blk.h
+++ b/block/blk.h
@@ -308,6 +308,8 @@ static inline bool bio_may_exceed_limits(struct bio *bio,
 		break;
 	}
 
+	if (unlikely(op_is_copy(bio->bi_opf)))
+		return false;
 	/*
 	 * All drivers must accept single-segments bios that are <= PAGE_SIZE.
 	 * This is a quick and dirty check that relies on the fact that
diff --git a/include/linux/blk_types.h b/include/linux/blk_types.h
index e0b098089ef2..71278c862bba 100644
--- a/include/linux/blk_types.h
+++ b/include/linux/blk_types.h
@@ -422,6 +422,7 @@ enum req_flag_bits {
 	 */
 	/* for REQ_OP_WRITE_ZEROES: */
 	__REQ_NOUNMAP,		/* do not free blocks when zeroing */
+	__REQ_COPY,		/* copy request */
 
 	__REQ_NR_BITS,		/* stops here */
 };
@@ -451,6 +452,7 @@ enum req_flag_bits {
 
 #define REQ_DRV		(__force blk_opf_t)(1ULL << __REQ_DRV)
 #define REQ_SWAP	(__force blk_opf_t)(1ULL << __REQ_SWAP)
+#define REQ_COPY	((__force blk_opf_t)(1ULL << __REQ_COPY))
 
 #define REQ_FAILFAST_MASK \
 	(REQ_FAILFAST_DEV | REQ_FAILFAST_TRANSPORT | REQ_FAILFAST_DRIVER)
@@ -484,6 +486,11 @@ static inline bool op_is_write(blk_opf_t op)
 	return !!(op & (__force blk_opf_t)1);
 }
 
+static inline bool op_is_copy(blk_opf_t op)
+{
+	return (op & REQ_COPY);
+}
+
 /*
  * Check if the bio or request is one that needs special treatment in the
  * flush state machine.
@@ -543,4 +550,41 @@ struct blk_rq_stat {
 	u64 batch;
 };
 
+typedef void (cio_iodone_t)(void *private, int status);
+
+struct cio {
+	struct range_entry *ranges;
+	struct task_struct *waiter;     /* waiting task (NULL if none) */
+	atomic_t refcount;
+	int io_err;
+	cio_iodone_t *endio;		/* applicable for async operation */
+	void *private;			/* applicable for async operation */
+
+	/* For zoned device we maintain a linked list of IO submissions.
+	 * This is to make sure we maintain the order of submissions.
+	 * Otherwise some reads completing out of order, will submit writes not
+	 * aligned with zone write pointer.
+	 */
+	struct list_head list;
+	spinlock_t list_lock;
+};
+
+enum copy_io_status {
+	REQ_COPY_READ_PROGRESS,
+	REQ_COPY_READ_COMPLETE,
+	REQ_COPY_WRITE_PROGRESS,
+};
+
+struct copy_ctx {
+	struct cio *cio;
+	struct work_struct dispatch_work;
+	struct bio *write_bio;
+	atomic_t refcount;
+	int range_idx;			/* used in error/partial completion */
+
+	/* For zoned device linked list is maintained. Along with state of IO */
+	struct list_head list;
+	enum copy_io_status status;
+};
+
 #endif /* __LINUX_BLK_TYPES_H */
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index 3ac324208f2f..a3b12ad42ed7 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -1065,6 +1065,9 @@ int __blkdev_issue_discard(struct block_device *bdev, sector_t sector,
 		sector_t nr_sects, gfp_t gfp_mask, struct bio **biop);
 int blkdev_issue_secure_erase(struct block_device *bdev, sector_t sector,
 		sector_t nr_sects, gfp_t gfp);
+int blkdev_issue_copy(struct block_device *src_bdev,
+		struct block_device *dst_bdev, struct range_entry *ranges,
+		int nr, cio_iodone_t end_io, void *private, gfp_t gfp_mask);
 
 #define BLKDEV_ZERO_NOUNMAP	(1 << 0)  /* do not free blocks */
 #define BLKDEV_ZERO_NOFALLBACK	(1 << 1)  /* don't write explicit zeroes */
diff --git a/include/uapi/linux/fs.h b/include/uapi/linux/fs.h
index b3ad173f619c..9248b6d259de 100644
--- a/include/uapi/linux/fs.h
+++ b/include/uapi/linux/fs.h
@@ -67,6 +67,21 @@ struct fstrim_range {
 /* maximum total copy length */
 #define MAX_COPY_TOTAL_LENGTH	(1 << 27)
 
+/* Maximum no of entries supported */
+#define MAX_COPY_NR_RANGE	(1 << 12)
+
+/* range entry for copy offload, all fields should be byte addressed */
+struct range_entry {
+	__u64 src;		/* source to be copied */
+	__u64 dst;		/* destination */
+	__u64 len;		/* length in bytes to be copied */
+
+	/* length of data copy actually completed. This will be filled by
+	 * kernel, once copy completes
+	 */
+	__u64 comp_len;
+};
+
 /* extent-same (dedupe) ioctls; these MUST match the btrfs ioctl definitions */
 #define FILE_DEDUPE_RANGE_SAME		0
 #define FILE_DEDUPE_RANGE_DIFFERS	1
-- 
2.35.1.500.gb896f729e2


^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH v5 03/10] block: add emulation for copy
       [not found]   ` <CGME20221123061021epcas5p276b6d48db889932282d017b27c9a3291@epcas5p2.samsung.com>
@ 2022-11-23  5:58     ` Nitesh Shetty
  0 siblings, 0 replies; 32+ messages in thread
From: Nitesh Shetty @ 2022-11-23  5:58 UTC (permalink / raw)
  To: axboe, agk, snitzer, dm-devel, kbusch, hch, sagi, james.smart,
	kch, damien.lemoal, naohiro.aota, jth, viro
  Cc: linux-block, linux-kernel, linux-nvme, linux-fsdevel, anuj20.g,
	joshi.k, p.raghav, nitheshshetty, gost.dev, Nitesh Shetty,
	Vincent Fu

For the devices which does not support copy, copy emulation is
added. Copy-emulation is implemented by reading from source ranges
into memory and writing to the corresponding destination asynchronously.
For zoned device we maintain a linked list of read submission and try to
submit corresponding write in same order.
Also emulation is used, if copy offload fails or partially completes.

Signed-off-by: Nitesh Shetty <nj.shetty@samsung.com>
Signed-off-by: Vincent Fu <vincent.fu@samsung.com>
Signed-off-by: Anuj Gupta <anuj20.g@samsung.com>
---
 block/blk-lib.c        | 241 ++++++++++++++++++++++++++++++++++++++++-
 block/blk-map.c        |   4 +-
 include/linux/blkdev.h |   3 +
 3 files changed, 245 insertions(+), 3 deletions(-)

diff --git a/block/blk-lib.c b/block/blk-lib.c
index 2ce3c872ca49..43b1d0ef5732 100644
--- a/block/blk-lib.c
+++ b/block/blk-lib.c
@@ -428,6 +428,239 @@ static inline int blk_copy_sanity_check(struct block_device *src_bdev,
 	return 0;
 }
 
+static void *blk_alloc_buf(sector_t req_size, sector_t *alloc_size,
+		gfp_t gfp_mask)
+{
+	int min_size = PAGE_SIZE;
+	void *buf;
+
+	while (req_size >= min_size) {
+		buf = kvmalloc(req_size, gfp_mask);
+		if (buf) {
+			*alloc_size = req_size;
+			return buf;
+		}
+		/* retry half the requested size */
+		req_size >>= 1;
+	}
+
+	return NULL;
+}
+
+static void blk_copy_emulate_write_end_io(struct bio *bio)
+{
+	struct copy_ctx *ctx = bio->bi_private;
+	struct cio *cio = ctx->cio;
+	sector_t clen;
+	int ri = ctx->range_idx;
+
+	if (bio->bi_status) {
+		cio->io_err = blk_status_to_errno(bio->bi_status);
+		clen = (bio->bi_iter.bi_sector << SECTOR_SHIFT) -
+			cio->ranges[ri].dst;
+		cio->ranges[ri].comp_len = min_t(sector_t, clen,
+				cio->ranges[ri].comp_len);
+	}
+	kvfree(page_address(bio->bi_io_vec[0].bv_page));
+	bio_map_kern_endio(bio);
+	if (atomic_dec_and_test(&ctx->refcount))
+		kfree(ctx);
+	if (atomic_dec_and_test(&cio->refcount)) {
+		if (cio->endio) {
+			cio->endio(cio->private, cio->io_err);
+			kfree(cio);
+		} else
+			blk_wake_io_task(cio->waiter);
+	}
+}
+
+static void blk_copy_emulate_read_end_io(struct bio *read_bio)
+{
+	struct copy_ctx *ctx = read_bio->bi_private;
+	struct cio *cio = ctx->cio;
+	sector_t clen;
+	int ri = ctx->range_idx;
+	unsigned long flags;
+
+	if (read_bio->bi_status) {
+		cio->io_err = blk_status_to_errno(read_bio->bi_status);
+		goto err_rw_bio;
+	}
+
+	/* For zoned device, we check if completed bio is first entry in linked
+	 * list,
+	 * if yes, we start the worker to submit write bios.
+	 * if not, then we just update status of bio in ctx,
+	 * once the worker gets scheduled, it will submit writes for all
+	 * the consecutive REQ_COPY_READ_COMPLETE bios.
+	 */
+	if (bdev_is_zoned(ctx->write_bio->bi_bdev)) {
+		spin_lock_irqsave(&cio->list_lock, flags);
+		ctx->status = REQ_COPY_READ_COMPLETE;
+		if (ctx == list_first_entry(&cio->list,
+					struct copy_ctx, list)) {
+			spin_unlock_irqrestore(&cio->list_lock, flags);
+			schedule_work(&ctx->dispatch_work);
+			goto free_read_bio;
+		}
+		spin_unlock_irqrestore(&cio->list_lock, flags);
+	} else
+		schedule_work(&ctx->dispatch_work);
+
+free_read_bio:
+	kfree(read_bio);
+
+	return;
+
+err_rw_bio:
+	clen = (read_bio->bi_iter.bi_sector << SECTOR_SHIFT) -
+					cio->ranges[ri].src;
+	cio->ranges[ri].comp_len = min_t(sector_t, clen,
+					cio->ranges[ri].comp_len);
+	__free_page(read_bio->bi_io_vec[0].bv_page);
+	bio_map_kern_endio(read_bio);
+	if (atomic_dec_and_test(&ctx->refcount))
+		kfree(ctx);
+	if (atomic_dec_and_test(&cio->refcount)) {
+		if (cio->endio) {
+			cio->endio(cio->private, cio->io_err);
+			kfree(cio);
+		} else
+			blk_wake_io_task(cio->waiter);
+	}
+}
+
+/*
+ * If native copy offload feature is absent, this function tries to emulate,
+ * by copying data from source to a temporary buffer and from buffer to
+ * destination device.
+ */
+static int blk_copy_emulate(struct block_device *src_bdev,
+		struct block_device *dst_bdev, struct range_entry *ranges,
+		int nr, cio_iodone_t end_io, void *private, gfp_t gfp_mask)
+{
+	struct request_queue *sq = bdev_get_queue(src_bdev);
+	struct request_queue *dq = bdev_get_queue(dst_bdev);
+	struct bio *read_bio, *write_bio;
+	void *buf = NULL;
+	struct copy_ctx *ctx;
+	struct cio *cio;
+	sector_t src, dst, offset, buf_len, req_len, rem = 0;
+	int ri = 0, ret = 0;
+	unsigned long flags;
+	sector_t max_src_hw_len = min_t(unsigned int, queue_max_hw_sectors(sq),
+			queue_max_segments(sq) << (PAGE_SHIFT - SECTOR_SHIFT))
+			<< SECTOR_SHIFT;
+	sector_t max_dst_hw_len = min_t(unsigned int, queue_max_hw_sectors(dq),
+			queue_max_segments(dq) << (PAGE_SHIFT - SECTOR_SHIFT))
+			<< SECTOR_SHIFT;
+	sector_t max_hw_len = min_t(unsigned int,
+			max_src_hw_len, max_dst_hw_len);
+
+	cio = kzalloc(sizeof(struct cio), GFP_KERNEL);
+	if (!cio)
+		return -ENOMEM;
+	cio->ranges = ranges;
+	atomic_set(&cio->refcount, 1);
+	cio->waiter = current;
+	cio->endio = end_io;
+	cio->private = private;
+
+	if (bdev_is_zoned(dst_bdev)) {
+		INIT_LIST_HEAD(&cio->list);
+		spin_lock_init(&cio->list_lock);
+	}
+
+	for (ri = 0; ri < nr; ri++) {
+		offset = ranges[ri].comp_len;
+		src = ranges[ri].src + offset;
+		dst = ranges[ri].dst + offset;
+		/* If IO fails, we truncate comp_len */
+		ranges[ri].comp_len = ranges[ri].len;
+
+		for (rem = ranges[ri].len - offset; rem > 0; rem -= buf_len) {
+			req_len = min_t(int, max_hw_len, rem);
+
+			buf = blk_alloc_buf(req_len, &buf_len, gfp_mask);
+			if (!buf) {
+				ret = -ENOMEM;
+				goto err_alloc_buf;
+			}
+
+			ctx = kzalloc(sizeof(struct copy_ctx), gfp_mask);
+			if (!ctx) {
+				ret = -ENOMEM;
+				goto err_ctx;
+			}
+
+			read_bio = bio_map_kern(sq, buf, buf_len, gfp_mask);
+			if (IS_ERR(read_bio)) {
+				ret = PTR_ERR(read_bio);
+				goto err_read_bio;
+			}
+
+			write_bio = bio_map_kern(dq, buf, buf_len, gfp_mask);
+			if (IS_ERR(write_bio)) {
+				ret = PTR_ERR(write_bio);
+				goto err_write_bio;
+			}
+
+			ctx->cio = cio;
+			ctx->range_idx = ri;
+			ctx->write_bio = write_bio;
+			atomic_set(&ctx->refcount, 1);
+
+			read_bio->bi_iter.bi_sector = src >> SECTOR_SHIFT;
+			read_bio->bi_iter.bi_size = buf_len;
+			read_bio->bi_opf = REQ_OP_READ | REQ_SYNC;
+			bio_set_dev(read_bio, src_bdev);
+			read_bio->bi_end_io = blk_copy_emulate_read_end_io;
+			read_bio->bi_private = ctx;
+
+			write_bio->bi_iter.bi_size = buf_len;
+			write_bio->bi_opf = REQ_OP_WRITE | REQ_SYNC;
+			bio_set_dev(write_bio, dst_bdev);
+			write_bio->bi_end_io = blk_copy_emulate_write_end_io;
+			write_bio->bi_iter.bi_sector = dst >> SECTOR_SHIFT;
+			write_bio->bi_private = ctx;
+
+			if (bdev_is_zoned(dst_bdev)) {
+				INIT_WORK(&ctx->dispatch_work,
+					blk_zoned_copy_dispatch_work_fn);
+				INIT_LIST_HEAD(&ctx->list);
+				spin_lock_irqsave(&cio->list_lock, flags);
+				ctx->status = REQ_COPY_READ_PROGRESS;
+				list_add_tail(&ctx->list, &cio->list);
+				spin_unlock_irqrestore(&cio->list_lock, flags);
+			} else
+				INIT_WORK(&ctx->dispatch_work,
+					blk_copy_dispatch_work_fn);
+
+			atomic_inc(&cio->refcount);
+			submit_bio(read_bio);
+
+			src += buf_len;
+			dst += buf_len;
+		}
+	}
+
+	/* Wait for completion of all IO's*/
+	return cio_await_completion(cio);
+
+err_write_bio:
+	bio_put(read_bio);
+err_read_bio:
+	kfree(ctx);
+err_ctx:
+	kvfree(buf);
+err_alloc_buf:
+	ranges[ri].comp_len -= min_t(sector_t,
+			ranges[ri].comp_len, (ranges[ri].len - rem));
+
+	cio->io_err = ret;
+	return cio_await_completion(cio);
+}
+
 static inline bool blk_check_copy_offload(struct request_queue *src_q,
 		struct request_queue *dst_q)
 {
@@ -460,15 +693,21 @@ int blkdev_issue_copy(struct block_device *src_bdev,
 	struct request_queue *src_q = bdev_get_queue(src_bdev);
 	struct request_queue *dst_q = bdev_get_queue(dst_bdev);
 	int ret = -EINVAL;
+	bool offload = false;
 
 	ret = blk_copy_sanity_check(src_bdev, dst_bdev, ranges, nr);
 	if (ret)
 		return ret;
 
-	if (blk_check_copy_offload(src_q, dst_q))
+	offload = blk_check_copy_offload(src_q, dst_q);
+	if (offload)
 		ret = blk_copy_offload(src_bdev, dst_bdev, ranges, nr,
 				end_io, private, gfp_mask);
 
+	if (ret || !offload)
+		ret = blk_copy_emulate(src_bdev, dst_bdev, ranges, nr,
+				end_io, private, gfp_mask);
+
 	return ret;
 }
 EXPORT_SYMBOL_GPL(blkdev_issue_copy);
diff --git a/block/blk-map.c b/block/blk-map.c
index 19940c978c73..bcf8db2b75f1 100644
--- a/block/blk-map.c
+++ b/block/blk-map.c
@@ -363,7 +363,7 @@ static void bio_invalidate_vmalloc_pages(struct bio *bio)
 #endif
 }
 
-static void bio_map_kern_endio(struct bio *bio)
+void bio_map_kern_endio(struct bio *bio)
 {
 	bio_invalidate_vmalloc_pages(bio);
 	bio_uninit(bio);
@@ -380,7 +380,7 @@ static void bio_map_kern_endio(struct bio *bio)
  *	Map the kernel address into a bio suitable for io to a block
  *	device. Returns an error pointer in case of error.
  */
-static struct bio *bio_map_kern(struct request_queue *q, void *data,
+struct bio *bio_map_kern(struct request_queue *q, void *data,
 		unsigned int len, gfp_t gfp_mask)
 {
 	unsigned long kaddr = (unsigned long)data;
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index a3b12ad42ed7..b0b18c30a60b 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -1068,6 +1068,9 @@ int blkdev_issue_secure_erase(struct block_device *bdev, sector_t sector,
 int blkdev_issue_copy(struct block_device *src_bdev,
 		struct block_device *dst_bdev, struct range_entry *ranges,
 		int nr, cio_iodone_t end_io, void *private, gfp_t gfp_mask);
+struct bio *bio_map_kern(struct request_queue *q, void *data, unsigned int len,
+		gfp_t gfp_mask);
+void bio_map_kern_endio(struct bio *bio);
 
 #define BLKDEV_ZERO_NOUNMAP	(1 << 0)  /* do not free blocks */
 #define BLKDEV_ZERO_NOFALLBACK	(1 << 1)  /* don't write explicit zeroes */
-- 
2.35.1.500.gb896f729e2


^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH v5 04/10] block: Introduce a new ioctl for copy
       [not found]   ` <CGME20221123061024epcas5p28fd0296018950d722b5a97e2875cf391@epcas5p2.samsung.com>
@ 2022-11-23  5:58     ` Nitesh Shetty
  0 siblings, 0 replies; 32+ messages in thread
From: Nitesh Shetty @ 2022-11-23  5:58 UTC (permalink / raw)
  To: axboe, agk, snitzer, dm-devel, kbusch, hch, sagi, james.smart,
	kch, damien.lemoal, naohiro.aota, jth, viro
  Cc: linux-block, linux-kernel, linux-nvme, linux-fsdevel, anuj20.g,
	joshi.k, p.raghav, nitheshshetty, gost.dev, Nitesh Shetty,
	Hannes Reinecke, Javier González

Add new BLKCOPY ioctl that offloads copying of one or more sources ranges
to one or more destination in a device. COPY ioctl accepts a 'copy_range'
structure that contains no of range, a reserved field , followed by an
array of ranges. Each source range is represented by 'range_entry' that
contains source start offset, destination start offset and length of
source ranges (in bytes)

MAX_COPY_NR_RANGE, limits the number of entries for the IOCTL and
MAX_COPY_TOTAL_LENGTH limits the total copy length, IOCTL can handle.

Example code, to issue BLKCOPY:
/* Sample example to copy three entries with [dest,src,len],
* [32768, 0, 4096] [36864, 4096, 4096] [40960,8192,4096] on same device */

int main(void)
{
	int i, ret, fd;
	unsigned long src = 0, dst = 32768, len = 4096;
	struct copy_range *cr;

	cr = (struct copy_range *)malloc(sizeof(*cr)+
					(sizeof(struct range_entry)*3));
	cr->nr_range = 3;
	cr->reserved = 0;
	for (i = 0; i< cr->nr_range; i++, src += len, dst += len) {
		cr->ranges[i].dst = dst;
		cr->ranges[i].src = src;
		cr->ranges[i].len = len;
		cr->ranges[i].comp_len = 0;
	}

	fd = open("/dev/nvme0n1", O_RDWR);
	if (fd < 0) return 1;

	ret = ioctl(fd, BLKCOPY, cr);
	if (ret != 0)
	       printf("copy failed, ret= %d\n", ret);

	for (i=0; i< cr->nr_range; i++)
		if (cr->ranges[i].len != cr->ranges[i].comp_len)
			printf("Partial copy for entry %d: requested %llu,
				completed %llu\n",
				i, cr->ranges[i].len,
				cr->ranges[i].comp_len);
	close(fd);
	free(cr);
	return ret;
}

Reviewed-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Nitesh Shetty <nj.shetty@samsung.com>
Signed-off-by: Javier González <javier.gonz@samsung.com>
Signed-off-by: Anuj Gupta <anuj20.g@samsung.com>
---
 block/ioctl.c           | 36 ++++++++++++++++++++++++++++++++++++
 include/uapi/linux/fs.h |  9 +++++++++
 2 files changed, 45 insertions(+)

diff --git a/block/ioctl.c b/block/ioctl.c
index 60121e89052b..7daf76199161 100644
--- a/block/ioctl.c
+++ b/block/ioctl.c
@@ -120,6 +120,40 @@ static int blk_ioctl_discard(struct block_device *bdev, fmode_t mode,
 	return err;
 }
 
+static int blk_ioctl_copy(struct block_device *bdev, fmode_t mode,
+		unsigned long arg)
+{
+	struct copy_range ucopy_range, *kcopy_range = NULL;
+	size_t payload_size = 0;
+	int ret;
+
+	if (!(mode & FMODE_WRITE))
+		return -EBADF;
+
+	if (copy_from_user(&ucopy_range, (void __user *)arg,
+				sizeof(ucopy_range)))
+		return -EFAULT;
+
+	if (unlikely(!ucopy_range.nr_range || ucopy_range.reserved ||
+				ucopy_range.nr_range >= MAX_COPY_NR_RANGE))
+		return -EINVAL;
+
+	payload_size = (ucopy_range.nr_range * sizeof(struct range_entry)) +
+				sizeof(ucopy_range);
+
+	kcopy_range = memdup_user((void __user *)arg, payload_size);
+	if (IS_ERR(kcopy_range))
+		return PTR_ERR(kcopy_range);
+
+	ret = blkdev_issue_copy(bdev, bdev, kcopy_range->ranges,
+			kcopy_range->nr_range, NULL, NULL, GFP_KERNEL);
+	if (copy_to_user((void __user *)arg, kcopy_range, payload_size))
+		ret = -EFAULT;
+
+	kfree(kcopy_range);
+	return ret;
+}
+
 static int blk_ioctl_secure_erase(struct block_device *bdev, fmode_t mode,
 		void __user *argp)
 {
@@ -481,6 +515,8 @@ static int blkdev_common_ioctl(struct block_device *bdev, fmode_t mode,
 		return blk_ioctl_discard(bdev, mode, arg);
 	case BLKSECDISCARD:
 		return blk_ioctl_secure_erase(bdev, mode, argp);
+	case BLKCOPY:
+		return blk_ioctl_copy(bdev, mode, arg);
 	case BLKZEROOUT:
 		return blk_ioctl_zeroout(bdev, mode, arg);
 	case BLKGETDISKSEQ:
diff --git a/include/uapi/linux/fs.h b/include/uapi/linux/fs.h
index 9248b6d259de..8af10b926a6f 100644
--- a/include/uapi/linux/fs.h
+++ b/include/uapi/linux/fs.h
@@ -82,6 +82,14 @@ struct range_entry {
 	__u64 comp_len;
 };
 
+struct copy_range {
+	__u64 nr_range;
+	__u64 reserved;
+
+	/* Ranges always must be at the end */
+	struct range_entry ranges[];
+};
+
 /* extent-same (dedupe) ioctls; these MUST match the btrfs ioctl definitions */
 #define FILE_DEDUPE_RANGE_SAME		0
 #define FILE_DEDUPE_RANGE_DIFFERS	1
@@ -203,6 +211,7 @@ struct fsxattr {
 #define BLKROTATIONAL _IO(0x12,126)
 #define BLKZEROOUT _IO(0x12,127)
 #define BLKGETDISKSEQ _IOR(0x12,128,__u64)
+#define BLKCOPY _IOWR(0x12, 129, struct copy_range)
 /*
  * A jump here: 130-136 are reserved for zoned block devices
  * (see uapi/linux/blkzoned.h)
-- 
2.35.1.500.gb896f729e2


^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH v5 05/10] nvme: add copy offload support
       [not found]   ` <CGME20221123061028epcas5p1aecd27b2f4f694b5a18b51d3df5d7432@epcas5p1.samsung.com>
@ 2022-11-23  5:58     ` Nitesh Shetty
  0 siblings, 0 replies; 32+ messages in thread
From: Nitesh Shetty @ 2022-11-23  5:58 UTC (permalink / raw)
  To: axboe, agk, snitzer, dm-devel, kbusch, hch, sagi, james.smart,
	kch, damien.lemoal, naohiro.aota, jth, viro
  Cc: linux-block, linux-kernel, linux-nvme, linux-fsdevel, anuj20.g,
	joshi.k, p.raghav, nitheshshetty, gost.dev, Nitesh Shetty,
	Javier González

For device supporting native copy, nvme driver receives read and
write request with BLK_COPY op flags.
For read request the nvme driver populates the payload with source
information.
For write request the driver converts it to nvme copy command using the
source information in the payload and submits to the device.
current design only supports single source range.
This design is courtesy Mikulas Patocka's token based copy

trace event support for nvme_copy_cmd.
Set the device copy limits to queue limits.

Signed-off-by: Kanchan Joshi <joshi.k@samsung.com>
Signed-off-by: Nitesh Shetty <nj.shetty@samsung.com>
Signed-off-by: Javier González <javier.gonz@samsung.com>
Signed-off-by: Anuj Gupta <anuj20.g@samsung.com>
---
 drivers/nvme/host/core.c  | 106 +++++++++++++++++++++++++++++++++++++-
 drivers/nvme/host/fc.c    |   5 ++
 drivers/nvme/host/nvme.h  |   7 +++
 drivers/nvme/host/pci.c   |  28 ++++++++--
 drivers/nvme/host/rdma.c  |   7 +++
 drivers/nvme/host/tcp.c   |  16 ++++++
 drivers/nvme/host/trace.c |  19 +++++++
 include/linux/nvme.h      |  43 ++++++++++++++--
 8 files changed, 223 insertions(+), 8 deletions(-)

diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
index 4423ccd0b0b1..26ce482ac112 100644
--- a/drivers/nvme/host/core.c
+++ b/drivers/nvme/host/core.c
@@ -751,6 +751,80 @@ static inline void nvme_setup_flush(struct nvme_ns *ns,
 	cmnd->common.nsid = cpu_to_le32(ns->head->ns_id);
 }
 
+static inline blk_status_t nvme_setup_copy_read(struct nvme_ns *ns,
+		struct request *req)
+{
+	struct bio *bio = req->bio;
+	struct nvme_copy_token *token = bvec_kmap_local(&bio->bi_io_vec[0]);
+
+	memcpy(token->subsys, "nvme", 4);
+	token->ns = ns;
+	token->src_sector = bio->bi_iter.bi_sector;
+	token->sectors = bio->bi_iter.bi_size >> 9;
+
+	return BLK_STS_OK;
+}
+
+static inline blk_status_t nvme_setup_copy_write(struct nvme_ns *ns,
+	       struct request *req, struct nvme_command *cmnd)
+{
+	struct nvme_copy_range *range = NULL;
+	struct bio *bio = req->bio;
+	struct nvme_copy_token *token = bvec_kmap_local(&bio->bi_io_vec[0]);
+	sector_t src_sector, dst_sector, n_sectors;
+	u64 src_lba, dst_lba, n_lba;
+	unsigned short nr_range = 1;
+	u16 control = 0;
+
+	if (unlikely(memcmp(token->subsys, "nvme", 4)))
+		return BLK_STS_NOTSUPP;
+	if (unlikely(token->ns != ns))
+		return BLK_STS_NOTSUPP;
+
+	src_sector = token->src_sector;
+	dst_sector = bio->bi_iter.bi_sector;
+	n_sectors = token->sectors;
+	if (WARN_ON(n_sectors != bio->bi_iter.bi_size >> 9))
+		return BLK_STS_NOTSUPP;
+
+	src_lba = nvme_sect_to_lba(ns, src_sector);
+	dst_lba = nvme_sect_to_lba(ns, dst_sector);
+	n_lba = nvme_sect_to_lba(ns, n_sectors);
+
+	if (WARN_ON(!n_lba))
+		return BLK_STS_NOTSUPP;
+
+	if (req->cmd_flags & REQ_FUA)
+		control |= NVME_RW_FUA;
+
+	if (req->cmd_flags & REQ_FAILFAST_DEV)
+		control |= NVME_RW_LR;
+
+	memset(cmnd, 0, sizeof(*cmnd));
+	cmnd->copy.opcode = nvme_cmd_copy;
+	cmnd->copy.nsid = cpu_to_le32(ns->head->ns_id);
+	cmnd->copy.sdlba = cpu_to_le64(dst_lba);
+
+	range = kmalloc_array(nr_range, sizeof(*range),
+			GFP_ATOMIC | __GFP_NOWARN);
+	if (!range)
+		return BLK_STS_RESOURCE;
+
+	range[0].slba = cpu_to_le64(src_lba);
+	range[0].nlb = cpu_to_le16(n_lba - 1);
+
+	cmnd->copy.nr_range = 0;
+
+	req->special_vec.bv_page = virt_to_page(range);
+	req->special_vec.bv_offset = offset_in_page(range);
+	req->special_vec.bv_len = sizeof(*range) * nr_range;
+	req->rq_flags |= RQF_SPECIAL_PAYLOAD;
+
+	cmnd->copy.control = cpu_to_le16(control);
+
+	return BLK_STS_OK;
+}
+
 static blk_status_t nvme_setup_discard(struct nvme_ns *ns, struct request *req,
 		struct nvme_command *cmnd)
 {
@@ -974,10 +1048,16 @@ blk_status_t nvme_setup_cmd(struct nvme_ns *ns, struct request *req)
 		ret = nvme_setup_discard(ns, req, cmd);
 		break;
 	case REQ_OP_READ:
-		ret = nvme_setup_rw(ns, req, cmd, nvme_cmd_read);
+		if (unlikely(req->cmd_flags & REQ_COPY))
+			ret = nvme_setup_copy_read(ns, req);
+		else
+			ret = nvme_setup_rw(ns, req, cmd, nvme_cmd_read);
 		break;
 	case REQ_OP_WRITE:
-		ret = nvme_setup_rw(ns, req, cmd, nvme_cmd_write);
+		if (unlikely(req->cmd_flags & REQ_COPY))
+			ret = nvme_setup_copy_write(ns, req, cmd);
+		else
+			ret = nvme_setup_rw(ns, req, cmd, nvme_cmd_write);
 		break;
 	case REQ_OP_ZONE_APPEND:
 		ret = nvme_setup_rw(ns, req, cmd, nvme_cmd_zone_append);
@@ -1704,6 +1784,26 @@ static void nvme_config_discard(struct gendisk *disk, struct nvme_ns *ns)
 		blk_queue_max_write_zeroes_sectors(queue, UINT_MAX);
 }
 
+static void nvme_config_copy(struct gendisk *disk, struct nvme_ns *ns,
+				       struct nvme_id_ns *id)
+{
+	struct nvme_ctrl *ctrl = ns->ctrl;
+	struct request_queue *q = disk->queue;
+
+	if (!(ctrl->oncs & NVME_CTRL_ONCS_COPY)) {
+		blk_queue_max_copy_sectors_hw(q, 0);
+		blk_queue_flag_clear(QUEUE_FLAG_COPY, q);
+		return;
+	}
+
+	/* setting copy limits */
+	if (blk_queue_flag_test_and_set(QUEUE_FLAG_COPY, q))
+		return;
+
+	blk_queue_max_copy_sectors_hw(q,
+		nvme_lba_to_sect(ns, le16_to_cpu(id->mssrl)));
+}
+
 static bool nvme_ns_ids_equal(struct nvme_ns_ids *a, struct nvme_ns_ids *b)
 {
 	return uuid_equal(&a->uuid, &b->uuid) &&
@@ -1903,6 +2003,7 @@ static void nvme_update_disk_info(struct gendisk *disk,
 	set_capacity_and_notify(disk, capacity);
 
 	nvme_config_discard(disk, ns);
+	nvme_config_copy(disk, ns, id);
 	blk_queue_max_write_zeroes_sectors(disk->queue,
 					   ns->ctrl->max_zeroes_sectors);
 }
@@ -5228,6 +5329,7 @@ static inline void _nvme_check_size(void)
 	BUILD_BUG_ON(sizeof(struct nvme_download_firmware) != 64);
 	BUILD_BUG_ON(sizeof(struct nvme_format_cmd) != 64);
 	BUILD_BUG_ON(sizeof(struct nvme_dsm_cmd) != 64);
+	BUILD_BUG_ON(sizeof(struct nvme_copy_command) != 64);
 	BUILD_BUG_ON(sizeof(struct nvme_write_zeroes_cmd) != 64);
 	BUILD_BUG_ON(sizeof(struct nvme_abort_cmd) != 64);
 	BUILD_BUG_ON(sizeof(struct nvme_get_log_page_command) != 64);
diff --git a/drivers/nvme/host/fc.c b/drivers/nvme/host/fc.c
index 5d57a042dbca..b2a1cf37cd92 100644
--- a/drivers/nvme/host/fc.c
+++ b/drivers/nvme/host/fc.c
@@ -2794,6 +2794,11 @@ nvme_fc_queue_rq(struct blk_mq_hw_ctx *hctx,
 	if (ret)
 		return ret;
 
+	if (unlikely((rq->cmd_flags & REQ_COPY) &&
+				(req_op(rq) == REQ_OP_READ))) {
+		blk_mq_end_request(rq, BLK_STS_OK);
+		return BLK_STS_OK;
+	}
 	/*
 	 * nvme core doesn't quite treat the rq opaquely. Commands such
 	 * as WRITE ZEROES will return a non-zero rq payload_bytes yet
diff --git a/drivers/nvme/host/nvme.h b/drivers/nvme/host/nvme.h
index f9df10653f3c..17cfcfc58346 100644
--- a/drivers/nvme/host/nvme.h
+++ b/drivers/nvme/host/nvme.h
@@ -495,6 +495,13 @@ struct nvme_ns {
 
 };
 
+struct nvme_copy_token {
+	char subsys[4];
+	struct nvme_ns *ns;
+	u64 src_sector;
+	u64 sectors;
+};
+
 /* NVMe ns supports metadata actions by the controller (generate/strip) */
 static inline bool nvme_ns_has_pi(struct nvme_ns *ns)
 {
diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
index 0163bfa925aa..eb1ed2c8b3a2 100644
--- a/drivers/nvme/host/pci.c
+++ b/drivers/nvme/host/pci.c
@@ -503,16 +503,19 @@ static inline void nvme_sq_copy_cmd(struct nvme_queue *nvmeq,
 		nvmeq->sq_tail = 0;
 }
 
-static void nvme_commit_rqs(struct blk_mq_hw_ctx *hctx)
+static inline void nvme_commit_sq_db(struct nvme_queue *nvmeq)
 {
-	struct nvme_queue *nvmeq = hctx->driver_data;
-
 	spin_lock(&nvmeq->sq_lock);
 	if (nvmeq->sq_tail != nvmeq->last_sq_tail)
 		nvme_write_sq_db(nvmeq, true);
 	spin_unlock(&nvmeq->sq_lock);
 }
 
+static void nvme_commit_rqs(struct blk_mq_hw_ctx *hctx)
+{
+	nvme_commit_sq_db(hctx->driver_data);
+}
+
 static void **nvme_pci_iod_list(struct request *req)
 {
 	struct nvme_iod *iod = blk_mq_rq_to_pdu(req);
@@ -900,6 +903,12 @@ static blk_status_t nvme_prep_rq(struct nvme_dev *dev, struct request *req)
 	if (ret)
 		return ret;
 
+	if (unlikely((req->cmd_flags & REQ_COPY) &&
+				(req_op(req) == REQ_OP_READ))) {
+		blk_mq_start_request(req);
+		return BLK_STS_OK;
+	}
+
 	if (blk_rq_nr_phys_segments(req)) {
 		ret = nvme_map_data(dev, req, &iod->cmd);
 		if (ret)
@@ -913,6 +922,7 @@ static blk_status_t nvme_prep_rq(struct nvme_dev *dev, struct request *req)
 	}
 
 	blk_mq_start_request(req);
+
 	return BLK_STS_OK;
 out_unmap_data:
 	nvme_unmap_data(dev, req);
@@ -946,6 +956,18 @@ static blk_status_t nvme_queue_rq(struct blk_mq_hw_ctx *hctx,
 	ret = nvme_prep_rq(dev, req);
 	if (unlikely(ret))
 		return ret;
+	if (unlikely((req->cmd_flags & REQ_COPY) &&
+				(req_op(req) == REQ_OP_READ))) {
+		blk_mq_set_request_complete(req);
+		blk_mq_end_request(req, BLK_STS_OK);
+		/* Commit the sq if copy read was the last req in the list,
+		 * as copy read deoesn't update sq db
+		 */
+		if (bd->last)
+			nvme_commit_sq_db(nvmeq);
+		return ret;
+	}
+
 	spin_lock(&nvmeq->sq_lock);
 	nvme_sq_copy_cmd(nvmeq, &iod->cmd);
 	nvme_write_sq_db(nvmeq, bd->last);
diff --git a/drivers/nvme/host/rdma.c b/drivers/nvme/host/rdma.c
index 6e079abb22ee..693865139e3c 100644
--- a/drivers/nvme/host/rdma.c
+++ b/drivers/nvme/host/rdma.c
@@ -2040,6 +2040,13 @@ static blk_status_t nvme_rdma_queue_rq(struct blk_mq_hw_ctx *hctx,
 	if (ret)
 		goto unmap_qe;
 
+	if (unlikely((rq->cmd_flags & REQ_COPY) &&
+				(req_op(rq) == REQ_OP_READ))) {
+		blk_mq_end_request(rq, BLK_STS_OK);
+		ret = BLK_STS_OK;
+		goto unmap_qe;
+	}
+
 	blk_mq_start_request(rq);
 
 	if (IS_ENABLED(CONFIG_BLK_DEV_INTEGRITY) &&
diff --git a/drivers/nvme/host/tcp.c b/drivers/nvme/host/tcp.c
index 9b47dcb2a7d9..e42fb53e9dc2 100644
--- a/drivers/nvme/host/tcp.c
+++ b/drivers/nvme/host/tcp.c
@@ -2348,6 +2348,11 @@ static blk_status_t nvme_tcp_setup_cmd_pdu(struct nvme_ns *ns,
 	if (ret)
 		return ret;
 
+	if (unlikely((rq->cmd_flags & REQ_COPY) &&
+				(req_op(rq) == REQ_OP_READ))) {
+		return BLK_STS_OK;
+	}
+
 	req->state = NVME_TCP_SEND_CMD_PDU;
 	req->status = cpu_to_le16(NVME_SC_SUCCESS);
 	req->offset = 0;
@@ -2416,6 +2421,17 @@ static blk_status_t nvme_tcp_queue_rq(struct blk_mq_hw_ctx *hctx,
 
 	blk_mq_start_request(rq);
 
+	if (unlikely((rq->cmd_flags & REQ_COPY) &&
+				(req_op(rq) == REQ_OP_READ))) {
+		blk_mq_set_request_complete(rq);
+		blk_mq_end_request(rq, BLK_STS_OK);
+		/* if copy read is the last req queue tcp reqs */
+		if (bd->last && nvme_tcp_queue_more(queue))
+			queue_work_on(queue->io_cpu, nvme_tcp_wq,
+					&queue->io_work);
+		return ret;
+	}
+
 	nvme_tcp_queue_request(req, true, bd->last);
 
 	return BLK_STS_OK;
diff --git a/drivers/nvme/host/trace.c b/drivers/nvme/host/trace.c
index 1c36fcedea20..da4a7494e5a7 100644
--- a/drivers/nvme/host/trace.c
+++ b/drivers/nvme/host/trace.c
@@ -150,6 +150,23 @@ static const char *nvme_trace_read_write(struct trace_seq *p, u8 *cdw10)
 	return ret;
 }
 
+static const char *nvme_trace_copy(struct trace_seq *p, u8 *cdw10)
+{
+	const char *ret = trace_seq_buffer_ptr(p);
+	u64 slba = get_unaligned_le64(cdw10);
+	u8 nr_range = get_unaligned_le16(cdw10 + 8);
+	u16 control = get_unaligned_le16(cdw10 + 10);
+	u32 dsmgmt = get_unaligned_le32(cdw10 + 12);
+	u32 reftag = get_unaligned_le32(cdw10 +  16);
+
+	trace_seq_printf(p,
+			 "slba=%llu, nr_range=%u, ctrl=0x%x, dsmgmt=%u, reftag=%u",
+			 slba, nr_range, control, dsmgmt, reftag);
+	trace_seq_putc(p, 0);
+
+	return ret;
+}
+
 static const char *nvme_trace_dsm(struct trace_seq *p, u8 *cdw10)
 {
 	const char *ret = trace_seq_buffer_ptr(p);
@@ -243,6 +260,8 @@ const char *nvme_trace_parse_nvm_cmd(struct trace_seq *p,
 		return nvme_trace_zone_mgmt_send(p, cdw10);
 	case nvme_cmd_zone_mgmt_recv:
 		return nvme_trace_zone_mgmt_recv(p, cdw10);
+	case nvme_cmd_copy:
+		return nvme_trace_copy(p, cdw10);
 	default:
 		return nvme_trace_common(p, cdw10);
 	}
diff --git a/include/linux/nvme.h b/include/linux/nvme.h
index 050d7d0cd81b..41349d78d410 100644
--- a/include/linux/nvme.h
+++ b/include/linux/nvme.h
@@ -336,7 +336,7 @@ struct nvme_id_ctrl {
 	__u8			nvscc;
 	__u8			nwpc;
 	__le16			acwu;
-	__u8			rsvd534[2];
+	__le16			ocfs;
 	__le32			sgls;
 	__le32			mnan;
 	__u8			rsvd544[224];
@@ -364,6 +364,7 @@ enum {
 	NVME_CTRL_ONCS_WRITE_ZEROES		= 1 << 3,
 	NVME_CTRL_ONCS_RESERVATIONS		= 1 << 5,
 	NVME_CTRL_ONCS_TIMESTAMP		= 1 << 6,
+	NVME_CTRL_ONCS_COPY			= 1 << 8,
 	NVME_CTRL_VWC_PRESENT			= 1 << 0,
 	NVME_CTRL_OACS_SEC_SUPP                 = 1 << 0,
 	NVME_CTRL_OACS_NS_MNGT_SUPP		= 1 << 3,
@@ -413,7 +414,10 @@ struct nvme_id_ns {
 	__le16			npdg;
 	__le16			npda;
 	__le16			nows;
-	__u8			rsvd74[18];
+	__le16			mssrl;
+	__le32			mcl;
+	__u8			msrc;
+	__u8			rsvd91[11];
 	__le32			anagrpid;
 	__u8			rsvd96[3];
 	__u8			nsattr;
@@ -794,6 +798,7 @@ enum nvme_opcode {
 	nvme_cmd_resv_report	= 0x0e,
 	nvme_cmd_resv_acquire	= 0x11,
 	nvme_cmd_resv_release	= 0x15,
+	nvme_cmd_copy		= 0x19,
 	nvme_cmd_zone_mgmt_send	= 0x79,
 	nvme_cmd_zone_mgmt_recv	= 0x7a,
 	nvme_cmd_zone_append	= 0x7d,
@@ -815,7 +820,8 @@ enum nvme_opcode {
 		nvme_opcode_name(nvme_cmd_resv_release),	\
 		nvme_opcode_name(nvme_cmd_zone_mgmt_send),	\
 		nvme_opcode_name(nvme_cmd_zone_mgmt_recv),	\
-		nvme_opcode_name(nvme_cmd_zone_append))
+		nvme_opcode_name(nvme_cmd_zone_append),		\
+		nvme_opcode_name(nvme_cmd_copy))
 
 
 
@@ -991,6 +997,36 @@ struct nvme_dsm_range {
 	__le64			slba;
 };
 
+struct nvme_copy_command {
+	__u8                    opcode;
+	__u8                    flags;
+	__u16                   command_id;
+	__le32                  nsid;
+	__u64                   rsvd2;
+	__le64                  metadata;
+	union nvme_data_ptr     dptr;
+	__le64                  sdlba;
+	__u8			nr_range;
+	__u8			rsvd12;
+	__le16                  control;
+	__le16                  rsvd13;
+	__le16			dspec;
+	__le32                  ilbrt;
+	__le16                  lbat;
+	__le16                  lbatm;
+};
+
+struct nvme_copy_range {
+	__le64			rsvd0;
+	__le64			slba;
+	__le16			nlb;
+	__le16			rsvd18;
+	__le32			rsvd20;
+	__le32			eilbrt;
+	__le16			elbat;
+	__le16			elbatm;
+};
+
 struct nvme_write_zeroes_cmd {
 	__u8			opcode;
 	__u8			flags;
@@ -1748,6 +1784,7 @@ struct nvme_command {
 		struct nvme_download_firmware dlfw;
 		struct nvme_format_cmd format;
 		struct nvme_dsm_cmd dsm;
+		struct nvme_copy_command copy;
 		struct nvme_write_zeroes_cmd write_zeroes;
 		struct nvme_zone_mgmt_send_cmd zms;
 		struct nvme_zone_mgmt_recv_cmd zmr;
-- 
2.35.1.500.gb896f729e2


^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH v5 06/10] nvmet: add copy command support for bdev and file ns
       [not found]   ` <CGME20221123061031epcas5p3745558c2caffd2fd21d15feff00495e9@epcas5p3.samsung.com>
@ 2022-11-23  5:58     ` Nitesh Shetty
       [not found]       ` <482586a3-f45d-a17b-7630-341fb0e1ee96@linux.alibaba.com>
  0 siblings, 1 reply; 32+ messages in thread
From: Nitesh Shetty @ 2022-11-23  5:58 UTC (permalink / raw)
  To: axboe, agk, snitzer, dm-devel, kbusch, hch, sagi, james.smart,
	kch, damien.lemoal, naohiro.aota, jth, viro
  Cc: linux-block, linux-kernel, linux-nvme, linux-fsdevel, anuj20.g,
	joshi.k, p.raghav, nitheshshetty, gost.dev, Nitesh Shetty

Add support for handling target command on target.
For bdev-ns we call into blkdev_issue_copy, which the block layer
completes by a offloaded copy request to backend bdev or by emulating the
request.

For file-ns we call vfs_copy_file_range to service our request.

Currently target always shows copy capability by setting
NVME_CTRL_ONCS_COPY in controller ONCS.

Signed-off-by: Nitesh Shetty <nj.shetty@samsung.com>
Signed-off-by: Anuj Gupta <anuj20.g@samsung.com>
---
 drivers/nvme/target/admin-cmd.c   |  9 +++-
 drivers/nvme/target/io-cmd-bdev.c | 79 +++++++++++++++++++++++++++++++
 drivers/nvme/target/io-cmd-file.c | 51 ++++++++++++++++++++
 drivers/nvme/target/loop.c        |  6 +++
 drivers/nvme/target/nvmet.h       |  2 +
 5 files changed, 145 insertions(+), 2 deletions(-)

diff --git a/drivers/nvme/target/admin-cmd.c b/drivers/nvme/target/admin-cmd.c
index c8a061ce3ee5..5ae509ff4b19 100644
--- a/drivers/nvme/target/admin-cmd.c
+++ b/drivers/nvme/target/admin-cmd.c
@@ -431,8 +431,7 @@ static void nvmet_execute_identify_ctrl(struct nvmet_req *req)
 	id->nn = cpu_to_le32(NVMET_MAX_NAMESPACES);
 	id->mnan = cpu_to_le32(NVMET_MAX_NAMESPACES);
 	id->oncs = cpu_to_le16(NVME_CTRL_ONCS_DSM |
-			NVME_CTRL_ONCS_WRITE_ZEROES);
-
+			NVME_CTRL_ONCS_WRITE_ZEROES | NVME_CTRL_ONCS_COPY);
 	/* XXX: don't report vwc if the underlying device is write through */
 	id->vwc = NVME_CTRL_VWC_PRESENT;
 
@@ -534,6 +533,12 @@ static void nvmet_execute_identify_ns(struct nvmet_req *req)
 
 	if (req->ns->bdev)
 		nvmet_bdev_set_limits(req->ns->bdev, id);
+	else {
+		id->msrc = (u8)to0based(BIO_MAX_VECS - 1);
+		id->mssrl = cpu_to_le16(BIO_MAX_VECS <<
+				(PAGE_SHIFT - SECTOR_SHIFT));
+		id->mcl = cpu_to_le32(le16_to_cpu(id->mssrl));
+	}
 
 	/*
 	 * We just provide a single LBA format that matches what the
diff --git a/drivers/nvme/target/io-cmd-bdev.c b/drivers/nvme/target/io-cmd-bdev.c
index c2d6cea0236b..01f0160125fb 100644
--- a/drivers/nvme/target/io-cmd-bdev.c
+++ b/drivers/nvme/target/io-cmd-bdev.c
@@ -46,6 +46,19 @@ void nvmet_bdev_set_limits(struct block_device *bdev, struct nvme_id_ns *id)
 	id->npda = id->npdg;
 	/* NOWS = Namespace Optimal Write Size */
 	id->nows = to0based(bdev_io_opt(bdev) / bdev_logical_block_size(bdev));
+
+	/*Copy limits*/
+	if (bdev_max_copy_sectors(bdev)) {
+		id->msrc = id->msrc;
+		id->mssrl = cpu_to_le16((bdev_max_copy_sectors(bdev) <<
+				SECTOR_SHIFT) / bdev_logical_block_size(bdev));
+		id->mcl = cpu_to_le32(id->mssrl);
+	} else {
+		id->msrc = (u8)to0based(BIO_MAX_VECS - 1);
+		id->mssrl = cpu_to_le16((BIO_MAX_VECS << PAGE_SHIFT) /
+				bdev_logical_block_size(bdev));
+		id->mcl = cpu_to_le32(id->mssrl);
+	}
 }
 
 void nvmet_bdev_ns_disable(struct nvmet_ns *ns)
@@ -184,6 +197,23 @@ static void nvmet_bio_done(struct bio *bio)
 	nvmet_req_bio_put(req, bio);
 }
 
+static void nvmet_bdev_copy_end_io(void *private, int status)
+{
+	struct nvmet_req *req = (struct nvmet_req *)private;
+	int id;
+
+	if (status) {
+		for (id = 0 ; id < req->nr_range; id++) {
+			if (req->ranges[id].len != req->ranges[id].comp_len) {
+				req->cqe->result.u32 = cpu_to_le32(id);
+				break;
+			}
+		}
+	}
+	kfree(req->ranges);
+	nvmet_req_complete(req, errno_to_nvme_status(req, status));
+}
+
 #ifdef CONFIG_BLK_DEV_INTEGRITY
 static int nvmet_bdev_alloc_bip(struct nvmet_req *req, struct bio *bio,
 				struct sg_mapping_iter *miter)
@@ -450,6 +480,51 @@ static void nvmet_bdev_execute_write_zeroes(struct nvmet_req *req)
 	}
 }
 
+static void nvmet_bdev_execute_copy(struct nvmet_req *req)
+{
+	struct nvme_copy_range range;
+	struct range_entry *ranges;
+	struct nvme_command *cmnd = req->cmd;
+	sector_t dest, dest_off = 0;
+	int ret, id, nr_range;
+
+	nr_range = cmnd->copy.nr_range + 1;
+	dest = le64_to_cpu(cmnd->copy.sdlba) << req->ns->blksize_shift;
+	ranges = kmalloc_array(nr_range, sizeof(*ranges), GFP_KERNEL);
+
+	for (id = 0 ; id < nr_range; id++) {
+		ret = nvmet_copy_from_sgl(req, id * sizeof(range),
+					&range, sizeof(range));
+		if (ret)
+			goto out;
+
+		ranges[id].dst = dest + dest_off;
+		ranges[id].src = le64_to_cpu(range.slba) <<
+					req->ns->blksize_shift;
+		ranges[id].len = (le16_to_cpu(range.nlb) + 1) <<
+					req->ns->blksize_shift;
+		ranges[id].comp_len = 0;
+		dest_off += ranges[id].len;
+	}
+	req->ranges = ranges;
+	req->nr_range = nr_range;
+	ret = blkdev_issue_copy(req->ns->bdev, req->ns->bdev, ranges, nr_range,
+			nvmet_bdev_copy_end_io, (void *)req, GFP_KERNEL);
+	if (ret) {
+		for (id = 0 ; id < nr_range; id++) {
+			if (ranges[id].len != ranges[id].comp_len) {
+				req->cqe->result.u32 = cpu_to_le32(id);
+				break;
+			}
+		}
+		goto out;
+	} else
+		return;
+out:
+	kfree(ranges);
+	nvmet_req_complete(req, errno_to_nvme_status(req, ret));
+}
+
 u16 nvmet_bdev_parse_io_cmd(struct nvmet_req *req)
 {
 	switch (req->cmd->common.opcode) {
@@ -468,6 +543,10 @@ u16 nvmet_bdev_parse_io_cmd(struct nvmet_req *req)
 	case nvme_cmd_write_zeroes:
 		req->execute = nvmet_bdev_execute_write_zeroes;
 		return 0;
+	case nvme_cmd_copy:
+		req->execute = nvmet_bdev_execute_copy;
+		return 0;
+
 	default:
 		return nvmet_report_invalid_opcode(req);
 	}
diff --git a/drivers/nvme/target/io-cmd-file.c b/drivers/nvme/target/io-cmd-file.c
index 64b47e2a4633..a81d38796e17 100644
--- a/drivers/nvme/target/io-cmd-file.c
+++ b/drivers/nvme/target/io-cmd-file.c
@@ -338,6 +338,48 @@ static void nvmet_file_dsm_work(struct work_struct *w)
 	}
 }
 
+static void nvmet_file_copy_work(struct work_struct *w)
+{
+	struct nvmet_req *req = container_of(w, struct nvmet_req, f.work);
+	int nr_range;
+	loff_t pos;
+	struct nvme_command *cmnd = req->cmd;
+	int ret = 0, len = 0, src, id;
+
+	nr_range = cmnd->copy.nr_range + 1;
+	pos = le64_to_cpu(req->cmd->copy.sdlba) << req->ns->blksize_shift;
+	if (unlikely(pos + req->transfer_len > req->ns->size)) {
+		nvmet_req_complete(req, errno_to_nvme_status(req, -ENOSPC));
+		return;
+	}
+
+	for (id = 0 ; id < nr_range; id++) {
+		struct nvme_copy_range range;
+
+		ret = nvmet_copy_from_sgl(req, id * sizeof(range), &range,
+					sizeof(range));
+		if (ret)
+			goto out;
+
+		len = (le16_to_cpu(range.nlb) + 1) << (req->ns->blksize_shift);
+		src = (le64_to_cpu(range.slba) << (req->ns->blksize_shift));
+		ret = vfs_copy_file_range(req->ns->file, src, req->ns->file,
+					pos, len, 0);
+out:
+		if (ret != len) {
+			pos += ret;
+			req->cqe->result.u32 = cpu_to_le32(id);
+			nvmet_req_complete(req, ret < 0 ?
+					errno_to_nvme_status(req, ret) :
+					errno_to_nvme_status(req, -EIO));
+			return;
+
+		} else
+			pos += len;
+}
+	nvmet_req_complete(req, ret);
+
+}
 static void nvmet_file_execute_dsm(struct nvmet_req *req)
 {
 	if (!nvmet_check_data_len_lte(req, nvmet_dsm_len(req)))
@@ -346,6 +388,12 @@ static void nvmet_file_execute_dsm(struct nvmet_req *req)
 	queue_work(nvmet_wq, &req->f.work);
 }
 
+static void nvmet_file_execute_copy(struct nvmet_req *req)
+{
+	INIT_WORK(&req->f.work, nvmet_file_copy_work);
+	queue_work(nvmet_wq, &req->f.work);
+}
+
 static void nvmet_file_write_zeroes_work(struct work_struct *w)
 {
 	struct nvmet_req *req = container_of(w, struct nvmet_req, f.work);
@@ -392,6 +440,9 @@ u16 nvmet_file_parse_io_cmd(struct nvmet_req *req)
 	case nvme_cmd_write_zeroes:
 		req->execute = nvmet_file_execute_write_zeroes;
 		return 0;
+	case nvme_cmd_copy:
+		req->execute = nvmet_file_execute_copy;
+		return 0;
 	default:
 		return nvmet_report_invalid_opcode(req);
 	}
diff --git a/drivers/nvme/target/loop.c b/drivers/nvme/target/loop.c
index b45fe3adf015..55802632b407 100644
--- a/drivers/nvme/target/loop.c
+++ b/drivers/nvme/target/loop.c
@@ -146,6 +146,12 @@ static blk_status_t nvme_loop_queue_rq(struct blk_mq_hw_ctx *hctx,
 		return ret;
 
 	blk_mq_start_request(req);
+	if (unlikely((req->cmd_flags & REQ_COPY) &&
+				(req_op(req) == REQ_OP_READ))) {
+		blk_mq_set_request_complete(req);
+		blk_mq_end_request(req, BLK_STS_OK);
+		return BLK_STS_OK;
+	}
 	iod->cmd.common.flags |= NVME_CMD_SGL_METABUF;
 	iod->req.port = queue->ctrl->port;
 	if (!nvmet_req_init(&iod->req, &queue->nvme_cq,
diff --git a/drivers/nvme/target/nvmet.h b/drivers/nvme/target/nvmet.h
index dfe3894205aa..3b4c7d2ee45d 100644
--- a/drivers/nvme/target/nvmet.h
+++ b/drivers/nvme/target/nvmet.h
@@ -391,6 +391,8 @@ struct nvmet_req {
 	struct device		*p2p_client;
 	u16			error_loc;
 	u64			error_slba;
+	struct range_entry *ranges;
+	unsigned int nr_range;
 };
 
 extern struct workqueue_struct *buffered_io_wq;
-- 
2.35.1.500.gb896f729e2


^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH v5 07/10] dm: Add support for copy offload.
       [not found]   ` <CGME20221123061034epcas5p3fe90293ad08df4901f98bae2d7cfc1ba@epcas5p3.samsung.com>
@ 2022-11-23  5:58     ` Nitesh Shetty
  0 siblings, 0 replies; 32+ messages in thread
From: Nitesh Shetty @ 2022-11-23  5:58 UTC (permalink / raw)
  To: axboe, agk, snitzer, dm-devel, kbusch, hch, sagi, james.smart,
	kch, damien.lemoal, naohiro.aota, jth, viro
  Cc: linux-block, linux-kernel, linux-nvme, linux-fsdevel, anuj20.g,
	joshi.k, p.raghav, nitheshshetty, gost.dev, Nitesh Shetty

Before enabling copy for dm target, check if underlying devices and
dm target support copy. Avoid split happening inside dm target.
Fail early if the request needs split, currently splitting copy
request is not supported.

Signed-off-by: Nitesh Shetty <nj.shetty@samsung.com>
---
 drivers/md/dm-table.c         | 42 +++++++++++++++++++++++++++++++++++
 drivers/md/dm.c               |  7 ++++++
 include/linux/device-mapper.h |  5 +++++
 3 files changed, 54 insertions(+)

diff --git a/drivers/md/dm-table.c b/drivers/md/dm-table.c
index 078da18bb86d..b2073e857a74 100644
--- a/drivers/md/dm-table.c
+++ b/drivers/md/dm-table.c
@@ -1875,6 +1875,39 @@ static bool dm_table_supports_nowait(struct dm_table *t)
 	return true;
 }
 
+static int device_not_copy_capable(struct dm_target *ti, struct dm_dev *dev,
+				      sector_t start, sector_t len, void *data)
+{
+	struct request_queue *q = bdev_get_queue(dev->bdev);
+
+	return !blk_queue_copy(q);
+}
+
+static bool dm_table_supports_copy(struct dm_table *t)
+{
+	struct dm_target *ti;
+	unsigned int i;
+
+	for (i = 0; i < t->num_targets; i++) {
+		ti = dm_table_get_target(t, i);
+
+		if (!ti->copy_offload_supported)
+			return false;
+
+		/*
+		 * target provides copy support (as implied by setting
+		 * 'copy_offload_supported')
+		 * and it relies on _all_ data devices having copy support.
+		 */
+		if (!ti->type->iterate_devices ||
+		     ti->type->iterate_devices(ti,
+			     device_not_copy_capable, NULL))
+			return false;
+	}
+
+	return true;
+}
+
 static int device_not_discard_capable(struct dm_target *ti, struct dm_dev *dev,
 				      sector_t start, sector_t len, void *data)
 {
@@ -1957,6 +1990,15 @@ int dm_table_set_restrictions(struct dm_table *t, struct request_queue *q,
 		q->limits.discard_misaligned = 0;
 	}
 
+	if (!dm_table_supports_copy(t)) {
+		blk_queue_flag_clear(QUEUE_FLAG_COPY, q);
+		/* Must also clear copy limits... */
+		q->limits.max_copy_sectors = 0;
+		q->limits.max_copy_sectors_hw = 0;
+	} else {
+		blk_queue_flag_set(QUEUE_FLAG_COPY, q);
+	}
+
 	if (!dm_table_supports_secure_erase(t))
 		q->limits.max_secure_erase_sectors = 0;
 
diff --git a/drivers/md/dm.c b/drivers/md/dm.c
index e1ea3a7bd9d9..713335995290 100644
--- a/drivers/md/dm.c
+++ b/drivers/md/dm.c
@@ -1690,6 +1690,13 @@ static blk_status_t __split_and_process_bio(struct clone_info *ci)
 	if (unlikely(ci->is_abnormal_io))
 		return __process_abnormal_io(ci, ti);
 
+	if ((unlikely(op_is_copy(ci->bio->bi_opf)) &&
+			max_io_len(ti, ci->sector) < ci->sector_count)) {
+		DMERR("Error, IO size(%u) > max target size(%llu)\n",
+			ci->sector_count, max_io_len(ti, ci->sector));
+		return BLK_STS_IOERR;
+	}
+
 	/*
 	 * Only support bio polling for normal IO, and the target io is
 	 * exactly inside the dm_io instance (verified in dm_poll_dm_io)
diff --git a/include/linux/device-mapper.h b/include/linux/device-mapper.h
index 04c6acf7faaa..da4e77e81011 100644
--- a/include/linux/device-mapper.h
+++ b/include/linux/device-mapper.h
@@ -379,6 +379,11 @@ struct dm_target {
 	 * bio_set_dev(). NOTE: ideally a target should _not_ need this.
 	 */
 	bool needs_bio_set_dev:1;
+
+	/*
+	 * copy offload is supported
+	 */
+	bool copy_offload_supported:1;
 };
 
 void *dm_per_bio_data(struct bio *bio, size_t data_size);
-- 
2.35.1.500.gb896f729e2


^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH v5 08/10] dm: Enable copy offload for dm-linear target
       [not found]   ` <CGME20221123061037epcas5p4d57436204fbe0065819b156eeeddbfac@epcas5p4.samsung.com>
@ 2022-11-23  5:58     ` Nitesh Shetty
  0 siblings, 0 replies; 32+ messages in thread
From: Nitesh Shetty @ 2022-11-23  5:58 UTC (permalink / raw)
  To: axboe, agk, snitzer, dm-devel, kbusch, hch, sagi, james.smart,
	kch, damien.lemoal, naohiro.aota, jth, viro
  Cc: linux-block, linux-kernel, linux-nvme, linux-fsdevel, anuj20.g,
	joshi.k, p.raghav, nitheshshetty, gost.dev, Nitesh Shetty

Setting copy_offload_supported flag to enable offload.

Signed-off-by: Nitesh Shetty <nj.shetty@samsung.com>
---
 drivers/md/dm-linear.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/drivers/md/dm-linear.c b/drivers/md/dm-linear.c
index 3212ef6aa81b..b4b57bead495 100644
--- a/drivers/md/dm-linear.c
+++ b/drivers/md/dm-linear.c
@@ -61,6 +61,7 @@ static int linear_ctr(struct dm_target *ti, unsigned int argc, char **argv)
 	ti->num_discard_bios = 1;
 	ti->num_secure_erase_bios = 1;
 	ti->num_write_zeroes_bios = 1;
+	ti->copy_offload_supported = 1;
 	ti->private = lc;
 	return 0;
 
-- 
2.35.1.500.gb896f729e2


^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH v5 09/10] dm kcopyd: use copy offload support
       [not found]   ` <CGME20221123061041epcas5p4413569a46ee730cd3033a9025c8f134a@epcas5p4.samsung.com>
@ 2022-11-23  5:58     ` Nitesh Shetty
  0 siblings, 0 replies; 32+ messages in thread
From: Nitesh Shetty @ 2022-11-23  5:58 UTC (permalink / raw)
  To: axboe, agk, snitzer, dm-devel, kbusch, hch, sagi, james.smart,
	kch, damien.lemoal, naohiro.aota, jth, viro
  Cc: linux-block, linux-kernel, linux-nvme, linux-fsdevel, anuj20.g,
	joshi.k, p.raghav, nitheshshetty, gost.dev, Nitesh Shetty

Introduce copy_jobs to use copy-offload, if supported by underlying devices
otherwise fall back to existing method.

run_copy_jobs() calls block layer copy offload API, if both source and
destination request queue are same and support copy offload.
On successful completion, destination regions copied count is made zero,
failed regions are processed via existing method.

Signed-off-by: Nitesh Shetty <nj.shetty@samsung.com>
Signed-off-by: Anuj Gupta <anuj20.g@samsung.com>
---
 drivers/md/dm-kcopyd.c | 56 +++++++++++++++++++++++++++++++++++++-----
 1 file changed, 50 insertions(+), 6 deletions(-)

diff --git a/drivers/md/dm-kcopyd.c b/drivers/md/dm-kcopyd.c
index 4d3bbbea2e9a..2f9985f671ac 100644
--- a/drivers/md/dm-kcopyd.c
+++ b/drivers/md/dm-kcopyd.c
@@ -74,18 +74,20 @@ struct dm_kcopyd_client {
 	atomic_t nr_jobs;
 
 /*
- * We maintain four lists of jobs:
+ * We maintain five lists of jobs:
  *
- * i)   jobs waiting for pages
- * ii)  jobs that have pages, and are waiting for the io to be issued.
- * iii) jobs that don't need to do any IO and just run a callback
- * iv) jobs that have completed.
+ * i)	jobs waiting to try copy offload
+ * ii)   jobs waiting for pages
+ * iii)  jobs that have pages, and are waiting for the io to be issued.
+ * iv) jobs that don't need to do any IO and just run a callback
+ * v) jobs that have completed.
  *
- * All four of these are protected by job_lock.
+ * All five of these are protected by job_lock.
  */
 	spinlock_t job_lock;
 	struct list_head callback_jobs;
 	struct list_head complete_jobs;
+	struct list_head copy_jobs;
 	struct list_head io_jobs;
 	struct list_head pages_jobs;
 };
@@ -579,6 +581,43 @@ static int run_io_job(struct kcopyd_job *job)
 	return r;
 }
 
+static int run_copy_job(struct kcopyd_job *job)
+{
+	int r, i, count = 0;
+	struct range_entry range;
+
+	struct request_queue *src_q, *dest_q;
+
+	for (i = 0; i < job->num_dests; i++) {
+		range.dst = job->dests[i].sector << SECTOR_SHIFT;
+		range.src = job->source.sector << SECTOR_SHIFT;
+		range.len = job->source.count << SECTOR_SHIFT;
+
+		src_q = bdev_get_queue(job->source.bdev);
+		dest_q = bdev_get_queue(job->dests[i].bdev);
+
+		if (src_q != dest_q || !blk_queue_copy(src_q))
+			break;
+
+		r = blkdev_issue_copy(job->source.bdev, job->dests[i].bdev,
+				&range, 1, NULL, NULL, GFP_KERNEL);
+		if (r)
+			break;
+
+		job->dests[i].count = 0;
+		count++;
+	}
+
+	if (count == job->num_dests) {
+		push(&job->kc->complete_jobs, job);
+	} else {
+		push(&job->kc->pages_jobs, job);
+		r = 0;
+	}
+
+	return r;
+}
+
 static int run_pages_job(struct kcopyd_job *job)
 {
 	int r;
@@ -659,6 +698,7 @@ static void do_work(struct work_struct *work)
 	spin_unlock_irq(&kc->job_lock);
 
 	blk_start_plug(&plug);
+	process_jobs(&kc->copy_jobs, kc, run_copy_job);
 	process_jobs(&kc->complete_jobs, kc, run_complete_job);
 	process_jobs(&kc->pages_jobs, kc, run_pages_job);
 	process_jobs(&kc->io_jobs, kc, run_io_job);
@@ -676,6 +716,8 @@ static void dispatch_job(struct kcopyd_job *job)
 	atomic_inc(&kc->nr_jobs);
 	if (unlikely(!job->source.count))
 		push(&kc->callback_jobs, job);
+	else if (job->source.bdev->bd_disk == job->dests[0].bdev->bd_disk)
+		push(&kc->copy_jobs, job);
 	else if (job->pages == &zero_page_list)
 		push(&kc->io_jobs, job);
 	else
@@ -916,6 +958,7 @@ struct dm_kcopyd_client *dm_kcopyd_client_create(struct dm_kcopyd_throttle *thro
 	spin_lock_init(&kc->job_lock);
 	INIT_LIST_HEAD(&kc->callback_jobs);
 	INIT_LIST_HEAD(&kc->complete_jobs);
+	INIT_LIST_HEAD(&kc->copy_jobs);
 	INIT_LIST_HEAD(&kc->io_jobs);
 	INIT_LIST_HEAD(&kc->pages_jobs);
 	kc->throttle = throttle;
@@ -971,6 +1014,7 @@ void dm_kcopyd_client_destroy(struct dm_kcopyd_client *kc)
 
 	BUG_ON(!list_empty(&kc->callback_jobs));
 	BUG_ON(!list_empty(&kc->complete_jobs));
+	WARN_ON(!list_empty(&kc->copy_jobs));
 	BUG_ON(!list_empty(&kc->io_jobs));
 	BUG_ON(!list_empty(&kc->pages_jobs));
 	destroy_workqueue(kc->kcopyd_wq);
-- 
2.35.1.500.gb896f729e2


^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH v5 10/10] fs: add support for copy file range in zonefs
       [not found]   ` <CGME20221123061044epcas5p2ac082a91fc8197821f29e84278b6203c@epcas5p2.samsung.com>
@ 2022-11-23  5:58     ` Nitesh Shetty
  2022-11-23  6:53       ` Amir Goldstein
  2022-11-24  1:32       ` Damien Le Moal
  0 siblings, 2 replies; 32+ messages in thread
From: Nitesh Shetty @ 2022-11-23  5:58 UTC (permalink / raw)
  To: axboe, agk, snitzer, dm-devel, kbusch, hch, sagi, james.smart,
	kch, damien.lemoal, naohiro.aota, jth, viro
  Cc: linux-block, linux-kernel, linux-nvme, linux-fsdevel, anuj20.g,
	joshi.k, p.raghav, nitheshshetty, gost.dev, Nitesh Shetty

copy_file_range is implemented using copy offload,
copy offloading to device is always enabled.
To disable copy offloading mount with "no_copy_offload" mount option.
At present copy offload is only used, if the source and destination files
are on same block device, otherwise copy file range is completed by
generic copy file range.

copy file range implemented as following:
	- write pending writes on the src and dest files
	- drop page cache for dest file if its conv zone
	- copy the range using offload
	- update dest file info

For all failure cases we fallback to generic file copy range
At present this implementation does not support conv aggregation

Signed-off-by: Nitesh Shetty <nj.shetty@samsung.com>
Signed-off-by: Anuj Gupta <anuj20.g@samsung.com>
---
 fs/zonefs/super.c | 179 ++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 179 insertions(+)

diff --git a/fs/zonefs/super.c b/fs/zonefs/super.c
index abc9a85106f2..15613433d4ae 100644
--- a/fs/zonefs/super.c
+++ b/fs/zonefs/super.c
@@ -1223,6 +1223,183 @@ static int zonefs_file_release(struct inode *inode, struct file *file)
 	return 0;
 }
 
+static int zonefs_is_file_copy_offset_ok(struct inode *src_inode,
+		struct inode *dst_inode, loff_t src_off, loff_t dst_off,
+		size_t *len)
+{
+	loff_t size, endoff;
+	struct zonefs_inode_info *dst_zi = ZONEFS_I(dst_inode);
+
+	inode_lock(src_inode);
+	size = i_size_read(src_inode);
+	inode_unlock(src_inode);
+	/* Don't copy beyond source file EOF. */
+	if (src_off < size) {
+		if (src_off + *len > size)
+			*len = (size - (src_off + *len));
+	} else
+		*len = 0;
+
+	mutex_lock(&dst_zi->i_truncate_mutex);
+	if (dst_zi->i_ztype == ZONEFS_ZTYPE_SEQ) {
+		if (*len > dst_zi->i_max_size - dst_zi->i_wpoffset)
+			*len -= dst_zi->i_max_size - dst_zi->i_wpoffset;
+
+		if (dst_off != dst_zi->i_wpoffset)
+			goto err;
+	}
+	mutex_unlock(&dst_zi->i_truncate_mutex);
+
+	endoff = dst_off + *len;
+	inode_lock(dst_inode);
+	if (endoff > dst_zi->i_max_size ||
+			inode_newsize_ok(dst_inode, endoff)) {
+		inode_unlock(dst_inode);
+		goto err;
+	}
+	inode_unlock(dst_inode);
+
+	return 0;
+err:
+	mutex_unlock(&dst_zi->i_truncate_mutex);
+	return -EINVAL;
+}
+
+static ssize_t zonefs_issue_copy(struct zonefs_inode_info *src_zi,
+		loff_t src_off, struct zonefs_inode_info *dst_zi,
+		loff_t dst_off, size_t len)
+{
+	struct block_device *src_bdev = src_zi->i_vnode.i_sb->s_bdev;
+	struct block_device *dst_bdev = dst_zi->i_vnode.i_sb->s_bdev;
+	struct range_entry *rlist = NULL;
+	int ret = len;
+
+	rlist = kmalloc(sizeof(*rlist), GFP_KERNEL);
+	if (!rlist)
+		return -ENOMEM;
+
+	rlist[0].dst = (dst_zi->i_zsector << SECTOR_SHIFT) + dst_off;
+	rlist[0].src = (src_zi->i_zsector << SECTOR_SHIFT) + src_off;
+	rlist[0].len = len;
+	rlist[0].comp_len = 0;
+	ret = blkdev_issue_copy(src_bdev, dst_bdev, rlist, 1, NULL, NULL,
+			GFP_KERNEL);
+	if (rlist[0].comp_len > 0)
+		ret = rlist[0].comp_len;
+	kfree(rlist);
+
+	return ret;
+}
+
+/* Returns length of possible copy, else returns error */
+static ssize_t zonefs_copy_file_checks(struct file *src_file, loff_t src_off,
+					struct file *dst_file, loff_t dst_off,
+					size_t *len, unsigned int flags)
+{
+	struct inode *src_inode = file_inode(src_file);
+	struct inode *dst_inode = file_inode(dst_file);
+	struct zonefs_inode_info *src_zi = ZONEFS_I(src_inode);
+	struct zonefs_inode_info *dst_zi = ZONEFS_I(dst_inode);
+	ssize_t ret;
+
+	if (src_inode->i_sb != dst_inode->i_sb)
+		return -EXDEV;
+
+	/* Start by sync'ing the source and destination files for conv zones */
+	if (src_zi->i_ztype == ZONEFS_ZTYPE_CNV) {
+		ret = file_write_and_wait_range(src_file, src_off,
+				(src_off + *len));
+		if (ret < 0)
+			goto io_error;
+	}
+	inode_dio_wait(src_inode);
+
+	/* Start by sync'ing the source and destination files ifor conv zones */
+	if (dst_zi->i_ztype == ZONEFS_ZTYPE_CNV) {
+		ret = file_write_and_wait_range(dst_file, dst_off,
+				(dst_off + *len));
+		if (ret < 0)
+			goto io_error;
+	}
+	inode_dio_wait(dst_inode);
+
+	/* Drop dst file cached pages for a conv zone*/
+	if (dst_zi->i_ztype == ZONEFS_ZTYPE_CNV) {
+		ret = invalidate_inode_pages2_range(dst_inode->i_mapping,
+				dst_off >> PAGE_SHIFT,
+				(dst_off + *len) >> PAGE_SHIFT);
+		if (ret < 0)
+			goto io_error;
+	}
+
+	ret = zonefs_is_file_copy_offset_ok(src_inode, dst_inode, src_off,
+			dst_off, len);
+	if (ret < 0)
+		return ret;
+
+	return *len;
+
+io_error:
+	zonefs_io_error(dst_inode, true);
+	return ret;
+}
+
+static ssize_t zonefs_copy_file(struct file *src_file, loff_t src_off,
+		struct file *dst_file, loff_t dst_off,
+		size_t len, unsigned int flags)
+{
+	struct inode *src_inode = file_inode(src_file);
+	struct inode *dst_inode = file_inode(dst_file);
+	struct zonefs_inode_info *src_zi = ZONEFS_I(src_inode);
+	struct zonefs_inode_info *dst_zi = ZONEFS_I(dst_inode);
+	ssize_t ret = 0, bytes;
+
+	inode_lock(src_inode);
+	inode_lock(dst_inode);
+	bytes = zonefs_issue_copy(src_zi, src_off, dst_zi, dst_off, len);
+	if (bytes < 0)
+		goto unlock_exit;
+
+	ret += bytes;
+
+	file_update_time(dst_file);
+	mutex_lock(&dst_zi->i_truncate_mutex);
+	zonefs_update_stats(dst_inode, dst_off + bytes);
+	zonefs_i_size_write(dst_inode, dst_off + bytes);
+	dst_zi->i_wpoffset += bytes;
+	mutex_unlock(&dst_zi->i_truncate_mutex);
+	/* if we still have some bytes left, do splice copy */
+	if (bytes && (bytes < len)) {
+		bytes = do_splice_direct(src_file, &src_off, dst_file,
+					 &dst_off, len, flags);
+		if (bytes > 0)
+			ret += bytes;
+	}
+unlock_exit:
+	if (ret < 0)
+		zonefs_io_error(dst_inode, true);
+	inode_unlock(src_inode);
+	inode_unlock(dst_inode);
+	return ret;
+}
+
+static ssize_t zonefs_copy_file_range(struct file *src_file, loff_t src_off,
+				      struct file *dst_file, loff_t dst_off,
+				      size_t len, unsigned int flags)
+{
+	ssize_t ret = -EIO;
+
+	ret = zonefs_copy_file_checks(src_file, src_off, dst_file, dst_off,
+				     &len, flags);
+	if (ret > 0)
+		ret = zonefs_copy_file(src_file, src_off, dst_file, dst_off,
+				     len, flags);
+	else if (ret < 0 && ret == -EXDEV)
+		ret = generic_copy_file_range(src_file, src_off, dst_file,
+					      dst_off, len, flags);
+	return ret;
+}
+
 static const struct file_operations zonefs_file_operations = {
 	.open		= zonefs_file_open,
 	.release	= zonefs_file_release,
@@ -1234,6 +1411,7 @@ static const struct file_operations zonefs_file_operations = {
 	.splice_read	= generic_file_splice_read,
 	.splice_write	= iter_file_splice_write,
 	.iopoll		= iocb_bio_iopoll,
+	.copy_file_range = zonefs_copy_file_range,
 };
 
 static struct kmem_cache *zonefs_inode_cachep;
@@ -1804,6 +1982,7 @@ static int zonefs_fill_super(struct super_block *sb, void *data, int silent)
 	atomic_set(&sbi->s_active_seq_files, 0);
 	sbi->s_max_active_seq_files = bdev_max_active_zones(sb->s_bdev);
 
+	/* set copy support by default */
 	ret = zonefs_read_super(sb);
 	if (ret)
 		return ret;
-- 
2.35.1.500.gb896f729e2


^ permalink raw reply related	[flat|nested] 32+ messages in thread

* Re: [PATCH v5 10/10] fs: add support for copy file range in zonefs
  2022-11-23  5:58     ` [PATCH v5 10/10] fs: add support for copy file range in zonefs Nitesh Shetty
@ 2022-11-23  6:53       ` Amir Goldstein
  2022-11-23 10:13         ` Nitesh Shetty
  2022-11-24  1:32       ` Damien Le Moal
  1 sibling, 1 reply; 32+ messages in thread
From: Amir Goldstein @ 2022-11-23  6:53 UTC (permalink / raw)
  To: Nitesh Shetty
  Cc: axboe, agk, snitzer, dm-devel, kbusch, hch, sagi, james.smart,
	kch, damien.lemoal, naohiro.aota, jth, viro, linux-block,
	linux-kernel, linux-nvme, linux-fsdevel, anuj20.g, joshi.k,
	p.raghav, nitheshshetty, gost.dev

On Wed, Nov 23, 2022 at 8:26 AM Nitesh Shetty <nj.shetty@samsung.com> wrote:
>
> copy_file_range is implemented using copy offload,
> copy offloading to device is always enabled.
> To disable copy offloading mount with "no_copy_offload" mount option.
> At present copy offload is only used, if the source and destination files
> are on same block device, otherwise copy file range is completed by
> generic copy file range.
>
> copy file range implemented as following:
>         - write pending writes on the src and dest files
>         - drop page cache for dest file if its conv zone
>         - copy the range using offload
>         - update dest file info
>
> For all failure cases we fallback to generic file copy range
> At present this implementation does not support conv aggregation
>
> Signed-off-by: Nitesh Shetty <nj.shetty@samsung.com>
> Signed-off-by: Anuj Gupta <anuj20.g@samsung.com>
> ---
>  fs/zonefs/super.c | 179 ++++++++++++++++++++++++++++++++++++++++++++++
>  1 file changed, 179 insertions(+)
>
> diff --git a/fs/zonefs/super.c b/fs/zonefs/super.c
> index abc9a85106f2..15613433d4ae 100644
> --- a/fs/zonefs/super.c
> +++ b/fs/zonefs/super.c
> @@ -1223,6 +1223,183 @@ static int zonefs_file_release(struct inode *inode, struct file *file)
>         return 0;
>  }
>
> +static int zonefs_is_file_copy_offset_ok(struct inode *src_inode,
> +               struct inode *dst_inode, loff_t src_off, loff_t dst_off,
> +               size_t *len)
> +{
> +       loff_t size, endoff;
> +       struct zonefs_inode_info *dst_zi = ZONEFS_I(dst_inode);
> +
> +       inode_lock(src_inode);
> +       size = i_size_read(src_inode);
> +       inode_unlock(src_inode);
> +       /* Don't copy beyond source file EOF. */
> +       if (src_off < size) {
> +               if (src_off + *len > size)
> +                       *len = (size - (src_off + *len));
> +       } else
> +               *len = 0;
> +
> +       mutex_lock(&dst_zi->i_truncate_mutex);
> +       if (dst_zi->i_ztype == ZONEFS_ZTYPE_SEQ) {
> +               if (*len > dst_zi->i_max_size - dst_zi->i_wpoffset)
> +                       *len -= dst_zi->i_max_size - dst_zi->i_wpoffset;
> +
> +               if (dst_off != dst_zi->i_wpoffset)
> +                       goto err;
> +       }
> +       mutex_unlock(&dst_zi->i_truncate_mutex);
> +
> +       endoff = dst_off + *len;
> +       inode_lock(dst_inode);
> +       if (endoff > dst_zi->i_max_size ||
> +                       inode_newsize_ok(dst_inode, endoff)) {
> +               inode_unlock(dst_inode);
> +               goto err;
> +       }
> +       inode_unlock(dst_inode);
> +
> +       return 0;
> +err:
> +       mutex_unlock(&dst_zi->i_truncate_mutex);
> +       return -EINVAL;
> +}
> +
> +static ssize_t zonefs_issue_copy(struct zonefs_inode_info *src_zi,
> +               loff_t src_off, struct zonefs_inode_info *dst_zi,
> +               loff_t dst_off, size_t len)
> +{
> +       struct block_device *src_bdev = src_zi->i_vnode.i_sb->s_bdev;
> +       struct block_device *dst_bdev = dst_zi->i_vnode.i_sb->s_bdev;
> +       struct range_entry *rlist = NULL;
> +       int ret = len;
> +
> +       rlist = kmalloc(sizeof(*rlist), GFP_KERNEL);
> +       if (!rlist)
> +               return -ENOMEM;
> +
> +       rlist[0].dst = (dst_zi->i_zsector << SECTOR_SHIFT) + dst_off;
> +       rlist[0].src = (src_zi->i_zsector << SECTOR_SHIFT) + src_off;
> +       rlist[0].len = len;
> +       rlist[0].comp_len = 0;
> +       ret = blkdev_issue_copy(src_bdev, dst_bdev, rlist, 1, NULL, NULL,
> +                       GFP_KERNEL);
> +       if (rlist[0].comp_len > 0)
> +               ret = rlist[0].comp_len;
> +       kfree(rlist);
> +
> +       return ret;
> +}
> +
> +/* Returns length of possible copy, else returns error */
> +static ssize_t zonefs_copy_file_checks(struct file *src_file, loff_t src_off,
> +                                       struct file *dst_file, loff_t dst_off,
> +                                       size_t *len, unsigned int flags)
> +{
> +       struct inode *src_inode = file_inode(src_file);
> +       struct inode *dst_inode = file_inode(dst_file);
> +       struct zonefs_inode_info *src_zi = ZONEFS_I(src_inode);
> +       struct zonefs_inode_info *dst_zi = ZONEFS_I(dst_inode);
> +       ssize_t ret;
> +
> +       if (src_inode->i_sb != dst_inode->i_sb)
> +               return -EXDEV;
> +
> +       /* Start by sync'ing the source and destination files for conv zones */
> +       if (src_zi->i_ztype == ZONEFS_ZTYPE_CNV) {
> +               ret = file_write_and_wait_range(src_file, src_off,
> +                               (src_off + *len));
> +               if (ret < 0)
> +                       goto io_error;
> +       }
> +       inode_dio_wait(src_inode);
> +
> +       /* Start by sync'ing the source and destination files ifor conv zones */
> +       if (dst_zi->i_ztype == ZONEFS_ZTYPE_CNV) {
> +               ret = file_write_and_wait_range(dst_file, dst_off,
> +                               (dst_off + *len));
> +               if (ret < 0)
> +                       goto io_error;
> +       }
> +       inode_dio_wait(dst_inode);
> +
> +       /* Drop dst file cached pages for a conv zone*/
> +       if (dst_zi->i_ztype == ZONEFS_ZTYPE_CNV) {
> +               ret = invalidate_inode_pages2_range(dst_inode->i_mapping,
> +                               dst_off >> PAGE_SHIFT,
> +                               (dst_off + *len) >> PAGE_SHIFT);
> +               if (ret < 0)
> +                       goto io_error;
> +       }
> +
> +       ret = zonefs_is_file_copy_offset_ok(src_inode, dst_inode, src_off,
> +                       dst_off, len);
> +       if (ret < 0)
> +               return ret;
> +
> +       return *len;
> +
> +io_error:
> +       zonefs_io_error(dst_inode, true);
> +       return ret;
> +}
> +
> +static ssize_t zonefs_copy_file(struct file *src_file, loff_t src_off,
> +               struct file *dst_file, loff_t dst_off,
> +               size_t len, unsigned int flags)
> +{
> +       struct inode *src_inode = file_inode(src_file);
> +       struct inode *dst_inode = file_inode(dst_file);
> +       struct zonefs_inode_info *src_zi = ZONEFS_I(src_inode);
> +       struct zonefs_inode_info *dst_zi = ZONEFS_I(dst_inode);
> +       ssize_t ret = 0, bytes;
> +
> +       inode_lock(src_inode);
> +       inode_lock(dst_inode);
> +       bytes = zonefs_issue_copy(src_zi, src_off, dst_zi, dst_off, len);
> +       if (bytes < 0)
> +               goto unlock_exit;
> +
> +       ret += bytes;
> +
> +       file_update_time(dst_file);
> +       mutex_lock(&dst_zi->i_truncate_mutex);
> +       zonefs_update_stats(dst_inode, dst_off + bytes);
> +       zonefs_i_size_write(dst_inode, dst_off + bytes);
> +       dst_zi->i_wpoffset += bytes;
> +       mutex_unlock(&dst_zi->i_truncate_mutex);
> +       /* if we still have some bytes left, do splice copy */
> +       if (bytes && (bytes < len)) {
> +               bytes = do_splice_direct(src_file, &src_off, dst_file,
> +                                        &dst_off, len, flags);
> +               if (bytes > 0)
> +                       ret += bytes;
> +       }
> +unlock_exit:
> +       if (ret < 0)
> +               zonefs_io_error(dst_inode, true);
> +       inode_unlock(src_inode);
> +       inode_unlock(dst_inode);
> +       return ret;
> +}
> +
> +static ssize_t zonefs_copy_file_range(struct file *src_file, loff_t src_off,
> +                                     struct file *dst_file, loff_t dst_off,
> +                                     size_t len, unsigned int flags)
> +{
> +       ssize_t ret = -EIO;
> +
> +       ret = zonefs_copy_file_checks(src_file, src_off, dst_file, dst_off,
> +                                    &len, flags);
> +       if (ret > 0)
> +               ret = zonefs_copy_file(src_file, src_off, dst_file, dst_off,
> +                                    len, flags);
> +       else if (ret < 0 && ret == -EXDEV)

First of all, ret < 0 is redundant.

> +               ret = generic_copy_file_range(src_file, src_off, dst_file,
> +                                             dst_off, len, flags);

But more importantly, why do you want to fall back to
do_splice_direct() in zonefs copy_file_range?
How does it serve your patch set or the prospect consumers
of zonefs copy_file_range?

The reason I am asking is because commit 5dae222a5ff0
("vfs: allow copy_file_range to copy across devices")
turned out to be an API mistake that was later reverted by
868f9f2f8e00 ("vfs: fix copy_file_range() regression in cross-fs copies")

It is always better to return EXDEV to userspace which can
always fallback to splice itself, but maybe it has something
smarter to do.

The places where it made sense for kernel to fallback to
direct splice was for network servers server-side-copy, but that
is independent of any specific filesystem copy_file_range()
implementation.

Thanks,
Amir.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v5 02/10] block: Add copy offload support infrastructure
  2022-11-23  5:58     ` [PATCH v5 02/10] block: Add copy offload support infrastructure Nitesh Shetty
@ 2022-11-23  8:04       ` Ming Lei
  2022-11-23 10:07         ` Nitesh Shetty
  0 siblings, 1 reply; 32+ messages in thread
From: Ming Lei @ 2022-11-23  8:04 UTC (permalink / raw)
  To: Nitesh Shetty
  Cc: axboe, agk, snitzer, dm-devel, kbusch, hch, sagi, james.smart,
	kch, damien.lemoal, naohiro.aota, jth, viro, linux-block,
	linux-kernel, linux-nvme, linux-fsdevel, anuj20.g, joshi.k,
	p.raghav, nitheshshetty, gost.dev, ming.lei

On Wed, Nov 23, 2022 at 11:28:19AM +0530, Nitesh Shetty wrote:
> Introduce blkdev_issue_copy which supports source and destination bdevs,
> and an array of (source, destination and copy length) tuples.
> Introduce REQ_COPY copy offload operation flag. Create a read-write
> bio pair with a token as payload and submitted to the device in order.
> Read request populates token with source specific information which
> is then passed with write request.
> This design is courtesy Mikulas Patocka's token based copy

I thought this patchset is just for enabling copy command which is
supported by hardware. But turns out it isn't, because blk_copy_offload()
still submits read/write bios for doing the copy.

I am just wondering why not let copy_file_range() cover this kind of copy,
and the framework has been there.

When I was researching pipe/splice code for supporting ublk zero copy[1], I
have got idea for async copy_file_range(), such as: io uring based
direct splice, user backed intermediate buffer, still zero copy, if these
ideas are finally implemented, we could get super-fast generic offload copy,
and bdev copy is really covered too.

[1] https://lore.kernel.org/linux-block/20221103085004.1029763-1-ming.lei@redhat.com/

thanks,
Ming


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v5 06/10] nvmet: add copy command support for bdev and file ns
       [not found]       ` <482586a3-f45d-a17b-7630-341fb0e1ee96@linux.alibaba.com>
@ 2022-11-23  9:39         ` Nitesh Shetty
  0 siblings, 0 replies; 32+ messages in thread
From: Nitesh Shetty @ 2022-11-23  9:39 UTC (permalink / raw)
  To: Guixin Liu
  Cc: axboe, agk, snitzer, dm-devel, kbusch, hch, sagi, james.smart,
	kch, damien.lemoal, naohiro.aota, jth, viro, linux-block,
	linux-kernel, linux-nvme, linux-fsdevel, anuj20.g, joshi.k,
	p.raghav, nitheshshetty, gost.dev

[-- Attachment #1: Type: text/plain, Size: 10272 bytes --]

On Wed, Nov 23, 2022 at 04:17:41PM +0800, Guixin Liu wrote:
> 
> 在 2022/11/23 13:58, Nitesh Shetty 写道:
> > Add support for handling target command on target.
> > For bdev-ns we call into blkdev_issue_copy, which the block layer
> > completes by a offloaded copy request to backend bdev or by emulating the
> > request.
> > 
> > For file-ns we call vfs_copy_file_range to service our request.
> > 
> > Currently target always shows copy capability by setting
> > NVME_CTRL_ONCS_COPY in controller ONCS.
> > 
> > Signed-off-by: Nitesh Shetty<nj.shetty@samsung.com>
> > Signed-off-by: Anuj Gupta<anuj20.g@samsung.com>
> > ---
> >   drivers/nvme/target/admin-cmd.c   |  9 +++-
> >   drivers/nvme/target/io-cmd-bdev.c | 79 +++++++++++++++++++++++++++++++
> >   drivers/nvme/target/io-cmd-file.c | 51 ++++++++++++++++++++
> >   drivers/nvme/target/loop.c        |  6 +++
> >   drivers/nvme/target/nvmet.h       |  2 +
> >   5 files changed, 145 insertions(+), 2 deletions(-)
> > 
> > diff --git a/drivers/nvme/target/admin-cmd.c b/drivers/nvme/target/admin-cmd.c
> > index c8a061ce3ee5..5ae509ff4b19 100644
> > --- a/drivers/nvme/target/admin-cmd.c
> > +++ b/drivers/nvme/target/admin-cmd.c
> > @@ -431,8 +431,7 @@ static void nvmet_execute_identify_ctrl(struct nvmet_req *req)
> >   	id->nn = cpu_to_le32(NVMET_MAX_NAMESPACES);
> >   	id->mnan = cpu_to_le32(NVMET_MAX_NAMESPACES);
> >   	id->oncs = cpu_to_le16(NVME_CTRL_ONCS_DSM |
> > -			NVME_CTRL_ONCS_WRITE_ZEROES);
> > -
> > +			NVME_CTRL_ONCS_WRITE_ZEROES | NVME_CTRL_ONCS_COPY);
> >   	/* XXX: don't report vwc if the underlying device is write through */
> >   	id->vwc = NVME_CTRL_VWC_PRESENT;
> > @@ -534,6 +533,12 @@ static void nvmet_execute_identify_ns(struct nvmet_req *req)
> >   	if (req->ns->bdev)
> >   		nvmet_bdev_set_limits(req->ns->bdev, id);
> > +	else {
> > +		id->msrc = (u8)to0based(BIO_MAX_VECS - 1);
> > +		id->mssrl = cpu_to_le16(BIO_MAX_VECS <<
> > +				(PAGE_SHIFT - SECTOR_SHIFT));
> > +		id->mcl = cpu_to_le32(le16_to_cpu(id->mssrl));
> > +	}
> >   	/*
> >   	 * We just provide a single LBA format that matches what the
> > diff --git a/drivers/nvme/target/io-cmd-bdev.c b/drivers/nvme/target/io-cmd-bdev.c
> > index c2d6cea0236b..01f0160125fb 100644
> > --- a/drivers/nvme/target/io-cmd-bdev.c
> > +++ b/drivers/nvme/target/io-cmd-bdev.c
> > @@ -46,6 +46,19 @@ void nvmet_bdev_set_limits(struct block_device *bdev, struct nvme_id_ns *id)
> >   	id->npda = id->npdg;
> >   	/* NOWS = Namespace Optimal Write Size */
> >   	id->nows = to0based(bdev_io_opt(bdev) / bdev_logical_block_size(bdev));
> > +
> > +	/*Copy limits*/
> > +	if (bdev_max_copy_sectors(bdev)) {
> > +		id->msrc = id->msrc;
> > +		id->mssrl = cpu_to_le16((bdev_max_copy_sectors(bdev) <<
> > +				SECTOR_SHIFT) / bdev_logical_block_size(bdev));
> > +		id->mcl = cpu_to_le32(id->mssrl);
> > +	} else {
> > +		id->msrc = (u8)to0based(BIO_MAX_VECS - 1);
> > +		id->mssrl = cpu_to_le16((BIO_MAX_VECS << PAGE_SHIFT) /
> > +				bdev_logical_block_size(bdev));
> > +		id->mcl = cpu_to_le32(id->mssrl);
> > +	}
> 
> Based on my understanding of the NVMe protocol 2.0,  the mssrl is the max
> length per single range entry,
> 
> the mcl is the total max copy length in one copy command, may I ask why mcl
> = msssrl? not mcl = mssrl * msrc?
> 
> Best Regards,
> 
> Guixin Liu
>

You are right, as per NVMe spec, mcl >= mssrl. Since we decided to make
copy offload generic for NVMe/Xcopy/copy across namespaces and all, we went
with 2 bio/bdev design, which is compatible with device mapper.
So effectively we are using 1 range(msrc), when using only 1 range, I
feel it makes sense to use one of the limits, so went with mssrl.

Thanks,
Nitesh

> >   }
> >   void nvmet_bdev_ns_disable(struct nvmet_ns *ns)
> > @@ -184,6 +197,23 @@ static void nvmet_bio_done(struct bio *bio)
> >   	nvmet_req_bio_put(req, bio);
> >   }
> > +static void nvmet_bdev_copy_end_io(void *private, int status)
> > +{
> > +	struct nvmet_req *req = (struct nvmet_req *)private;
> > +	int id;
> > +
> > +	if (status) {
> > +		for (id = 0 ; id < req->nr_range; id++) {
> > +			if (req->ranges[id].len != req->ranges[id].comp_len) {
> > +				req->cqe->result.u32 = cpu_to_le32(id);
> > +				break;
> > +			}
> > +		}
> > +	}
> > +	kfree(req->ranges);
> > +	nvmet_req_complete(req, errno_to_nvme_status(req, status));
> > +}
> > +
> >   #ifdef CONFIG_BLK_DEV_INTEGRITY
> >   static int nvmet_bdev_alloc_bip(struct nvmet_req *req, struct bio *bio,
> >   				struct sg_mapping_iter *miter)
> > @@ -450,6 +480,51 @@ static void nvmet_bdev_execute_write_zeroes(struct nvmet_req *req)
> >   	}
> >   }
> > +static void nvmet_bdev_execute_copy(struct nvmet_req *req)
> > +{
> > +	struct nvme_copy_range range;
> > +	struct range_entry *ranges;
> > +	struct nvme_command *cmnd = req->cmd;
> > +	sector_t dest, dest_off = 0;
> > +	int ret, id, nr_range;
> > +
> > +	nr_range = cmnd->copy.nr_range + 1;
> > +	dest = le64_to_cpu(cmnd->copy.sdlba) << req->ns->blksize_shift;
> > +	ranges = kmalloc_array(nr_range, sizeof(*ranges), GFP_KERNEL);
> > +
> > +	for (id = 0 ; id < nr_range; id++) {
> > +		ret = nvmet_copy_from_sgl(req, id * sizeof(range),
> > +					&range, sizeof(range));
> > +		if (ret)
> > +			goto out;
> > +
> > +		ranges[id].dst = dest + dest_off;
> > +		ranges[id].src = le64_to_cpu(range.slba) <<
> > +					req->ns->blksize_shift;
> > +		ranges[id].len = (le16_to_cpu(range.nlb) + 1) <<
> > +					req->ns->blksize_shift;
> > +		ranges[id].comp_len = 0;
> > +		dest_off += ranges[id].len;
> > +	}
> > +	req->ranges = ranges;
> > +	req->nr_range = nr_range;
> > +	ret = blkdev_issue_copy(req->ns->bdev, req->ns->bdev, ranges, nr_range,
> > +			nvmet_bdev_copy_end_io, (void *)req, GFP_KERNEL);
> > +	if (ret) {
> > +		for (id = 0 ; id < nr_range; id++) {
> > +			if (ranges[id].len != ranges[id].comp_len) {
> > +				req->cqe->result.u32 = cpu_to_le32(id);
> > +				break;
> > +			}
> > +		}
> > +		goto out;
> > +	} else
> > +		return;
> > +out:
> > +	kfree(ranges);
> > +	nvmet_req_complete(req, errno_to_nvme_status(req, ret));
> > +}
> > +
> >   u16 nvmet_bdev_parse_io_cmd(struct nvmet_req *req)
> >   {
> >   	switch (req->cmd->common.opcode) {
> > @@ -468,6 +543,10 @@ u16 nvmet_bdev_parse_io_cmd(struct nvmet_req *req)
> >   	case nvme_cmd_write_zeroes:
> >   		req->execute = nvmet_bdev_execute_write_zeroes;
> >   		return 0;
> > +	case nvme_cmd_copy:
> > +		req->execute = nvmet_bdev_execute_copy;
> > +		return 0;
> > +
> >   	default:
> >   		return nvmet_report_invalid_opcode(req);
> >   	}
> > diff --git a/drivers/nvme/target/io-cmd-file.c b/drivers/nvme/target/io-cmd-file.c
> > index 64b47e2a4633..a81d38796e17 100644
> > --- a/drivers/nvme/target/io-cmd-file.c
> > +++ b/drivers/nvme/target/io-cmd-file.c
> > @@ -338,6 +338,48 @@ static void nvmet_file_dsm_work(struct work_struct *w)
> >   	}
> >   }
> > +static void nvmet_file_copy_work(struct work_struct *w)
> > +{
> > +	struct nvmet_req *req = container_of(w, struct nvmet_req, f.work);
> > +	int nr_range;
> > +	loff_t pos;
> > +	struct nvme_command *cmnd = req->cmd;
> > +	int ret = 0, len = 0, src, id;
> > +
> > +	nr_range = cmnd->copy.nr_range + 1;
> > +	pos = le64_to_cpu(req->cmd->copy.sdlba) << req->ns->blksize_shift;
> > +	if (unlikely(pos + req->transfer_len > req->ns->size)) {
> > +		nvmet_req_complete(req, errno_to_nvme_status(req, -ENOSPC));
> > +		return;
> > +	}
> > +
> > +	for (id = 0 ; id < nr_range; id++) {
> > +		struct nvme_copy_range range;
> > +
> > +		ret = nvmet_copy_from_sgl(req, id * sizeof(range), &range,
> > +					sizeof(range));
> > +		if (ret)
> > +			goto out;
> > +
> > +		len = (le16_to_cpu(range.nlb) + 1) << (req->ns->blksize_shift);
> > +		src = (le64_to_cpu(range.slba) << (req->ns->blksize_shift));
> > +		ret = vfs_copy_file_range(req->ns->file, src, req->ns->file,
> > +					pos, len, 0);
> > +out:
> > +		if (ret != len) {
> > +			pos += ret;
> > +			req->cqe->result.u32 = cpu_to_le32(id);
> > +			nvmet_req_complete(req, ret < 0 ?
> > +					errno_to_nvme_status(req, ret) :
> > +					errno_to_nvme_status(req, -EIO));
> > +			return;
> > +
> > +		} else
> > +			pos += len;
> > +}
> > +	nvmet_req_complete(req, ret);
> > +
> > +}
> >   static void nvmet_file_execute_dsm(struct nvmet_req *req)
> >   {
> >   	if (!nvmet_check_data_len_lte(req, nvmet_dsm_len(req)))
> > @@ -346,6 +388,12 @@ static void nvmet_file_execute_dsm(struct nvmet_req *req)
> >   	queue_work(nvmet_wq, &req->f.work);
> >   }
> > +static void nvmet_file_execute_copy(struct nvmet_req *req)
> > +{
> > +	INIT_WORK(&req->f.work, nvmet_file_copy_work);
> > +	queue_work(nvmet_wq, &req->f.work);
> > +}
> > +
> >   static void nvmet_file_write_zeroes_work(struct work_struct *w)
> >   {
> >   	struct nvmet_req *req = container_of(w, struct nvmet_req, f.work);
> > @@ -392,6 +440,9 @@ u16 nvmet_file_parse_io_cmd(struct nvmet_req *req)
> >   	case nvme_cmd_write_zeroes:
> >   		req->execute = nvmet_file_execute_write_zeroes;
> >   		return 0;
> > +	case nvme_cmd_copy:
> > +		req->execute = nvmet_file_execute_copy;
> > +		return 0;
> >   	default:
> >   		return nvmet_report_invalid_opcode(req);
> >   	}
> > diff --git a/drivers/nvme/target/loop.c b/drivers/nvme/target/loop.c
> > index b45fe3adf015..55802632b407 100644
> > --- a/drivers/nvme/target/loop.c
> > +++ b/drivers/nvme/target/loop.c
> > @@ -146,6 +146,12 @@ static blk_status_t nvme_loop_queue_rq(struct blk_mq_hw_ctx *hctx,
> >   		return ret;
> >   	blk_mq_start_request(req);
> > +	if (unlikely((req->cmd_flags & REQ_COPY) &&
> > +				(req_op(req) == REQ_OP_READ))) {
> > +		blk_mq_set_request_complete(req);
> > +		blk_mq_end_request(req, BLK_STS_OK);
> > +		return BLK_STS_OK;
> > +	}
> >   	iod->cmd.common.flags |= NVME_CMD_SGL_METABUF;
> >   	iod->req.port = queue->ctrl->port;
> >   	if (!nvmet_req_init(&iod->req, &queue->nvme_cq,
> > diff --git a/drivers/nvme/target/nvmet.h b/drivers/nvme/target/nvmet.h
> > index dfe3894205aa..3b4c7d2ee45d 100644
> > --- a/drivers/nvme/target/nvmet.h
> > +++ b/drivers/nvme/target/nvmet.h
> > @@ -391,6 +391,8 @@ struct nvmet_req {
> >   	struct device		*p2p_client;
> >   	u16			error_loc;
> >   	u64			error_slba;
> > +	struct range_entry *ranges;
> > +	unsigned int nr_range;
> >   };
> >   extern struct workqueue_struct *buffered_io_wq;

[-- Attachment #2: Type: text/plain, Size: 0 bytes --]



^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v5 02/10] block: Add copy offload support infrastructure
  2022-11-23  8:04       ` Ming Lei
@ 2022-11-23 10:07         ` Nitesh Shetty
  2022-11-24  0:03           ` Ming Lei
  0 siblings, 1 reply; 32+ messages in thread
From: Nitesh Shetty @ 2022-11-23 10:07 UTC (permalink / raw)
  To: Ming Lei
  Cc: axboe, agk, snitzer, dm-devel, kbusch, hch, sagi, james.smart,
	kch, damien.lemoal, naohiro.aota, jth, viro, linux-block,
	linux-kernel, linux-nvme, linux-fsdevel, anuj20.g, joshi.k,
	p.raghav, nitheshshetty, gost.dev

[-- Attachment #1: Type: text/plain, Size: 2380 bytes --]

On Wed, Nov 23, 2022 at 04:04:18PM +0800, Ming Lei wrote:
> On Wed, Nov 23, 2022 at 11:28:19AM +0530, Nitesh Shetty wrote:
> > Introduce blkdev_issue_copy which supports source and destination bdevs,
> > and an array of (source, destination and copy length) tuples.
> > Introduce REQ_COPY copy offload operation flag. Create a read-write
> > bio pair with a token as payload and submitted to the device in order.
> > Read request populates token with source specific information which
> > is then passed with write request.
> > This design is courtesy Mikulas Patocka's token based copy
> 
> I thought this patchset is just for enabling copy command which is
> supported by hardware. But turns out it isn't, because blk_copy_offload()
> still submits read/write bios for doing the copy.
> 
> I am just wondering why not let copy_file_range() cover this kind of copy,
> and the framework has been there.
> 

Main goal was to enable copy command, but community suggested to add
copy emulation as well.

blk_copy_offload - actually issues copy command in driver layer.
The way read/write BIOs are percieved is different for copy offload.
In copy offload we check REQ_COPY flag in NVMe driver layer to issue
copy command. But we did missed it to add in other driver's, where they
might be treated as normal READ/WRITE.

blk_copy_emulate - is used if we fail or if device doesn't support native
copy offload command. Here we do READ/WRITE. Using copy_file_range for
emulation might be possible, but we see 2 issues here.
1. We explored possibility of pulling dm-kcopyd to block layer so that we 
can readily use it. But we found it had many dependecies from dm-layer.
So later dropped that idea.
2. copy_file_range, for block device atleast we saw few check's which fail
it for raw block device. At this point I dont know much about the history of
why such check is present.

> When I was researching pipe/splice code for supporting ublk zero copy[1], I
> have got idea for async copy_file_range(), such as: io uring based
> direct splice, user backed intermediate buffer, still zero copy, if these
> ideas are finally implemented, we could get super-fast generic offload copy,
> and bdev copy is really covered too.
> 
> [1] https://lore.kernel.org/linux-block/20221103085004.1029763-1-ming.lei@redhat.com/
> 

Seems interesting, We will take a look into this.

> 
> 

Thanks,
Nitesh

[-- Attachment #2: Type: text/plain, Size: 0 bytes --]



^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v5 10/10] fs: add support for copy file range in zonefs
  2022-11-23  6:53       ` Amir Goldstein
@ 2022-11-23 10:13         ` Nitesh Shetty
  0 siblings, 0 replies; 32+ messages in thread
From: Nitesh Shetty @ 2022-11-23 10:13 UTC (permalink / raw)
  To: Amir Goldstein
  Cc: axboe, agk, snitzer, dm-devel, kbusch, hch, sagi, james.smart,
	kch, damien.lemoal, naohiro.aota, jth, viro, linux-block,
	linux-kernel, linux-nvme, linux-fsdevel, anuj20.g, joshi.k,
	p.raghav, nitheshshetty, gost.dev

[-- Attachment #1: Type: text/plain, Size: 9677 bytes --]

On Wed, Nov 23, 2022 at 08:53:14AM +0200, Amir Goldstein wrote:
> On Wed, Nov 23, 2022 at 8:26 AM Nitesh Shetty <nj.shetty@samsung.com> wrote:
> >
> > copy_file_range is implemented using copy offload,
> > copy offloading to device is always enabled.
> > To disable copy offloading mount with "no_copy_offload" mount option.
> > At present copy offload is only used, if the source and destination files
> > are on same block device, otherwise copy file range is completed by
> > generic copy file range.
> >
> > copy file range implemented as following:
> >         - write pending writes on the src and dest files
> >         - drop page cache for dest file if its conv zone
> >         - copy the range using offload
> >         - update dest file info
> >
> > For all failure cases we fallback to generic file copy range
> > At present this implementation does not support conv aggregation
> >
> > Signed-off-by: Nitesh Shetty <nj.shetty@samsung.com>
> > Signed-off-by: Anuj Gupta <anuj20.g@samsung.com>
> > ---
> >  fs/zonefs/super.c | 179 ++++++++++++++++++++++++++++++++++++++++++++++
> >  1 file changed, 179 insertions(+)
> >
> > diff --git a/fs/zonefs/super.c b/fs/zonefs/super.c
> > index abc9a85106f2..15613433d4ae 100644
> > --- a/fs/zonefs/super.c
> > +++ b/fs/zonefs/super.c
> > @@ -1223,6 +1223,183 @@ static int zonefs_file_release(struct inode *inode, struct file *file)
> >         return 0;
> >  }
> >
> > +static int zonefs_is_file_copy_offset_ok(struct inode *src_inode,
> > +               struct inode *dst_inode, loff_t src_off, loff_t dst_off,
> > +               size_t *len)
> > +{
> > +       loff_t size, endoff;
> > +       struct zonefs_inode_info *dst_zi = ZONEFS_I(dst_inode);
> > +
> > +       inode_lock(src_inode);
> > +       size = i_size_read(src_inode);
> > +       inode_unlock(src_inode);
> > +       /* Don't copy beyond source file EOF. */
> > +       if (src_off < size) {
> > +               if (src_off + *len > size)
> > +                       *len = (size - (src_off + *len));
> > +       } else
> > +               *len = 0;
> > +
> > +       mutex_lock(&dst_zi->i_truncate_mutex);
> > +       if (dst_zi->i_ztype == ZONEFS_ZTYPE_SEQ) {
> > +               if (*len > dst_zi->i_max_size - dst_zi->i_wpoffset)
> > +                       *len -= dst_zi->i_max_size - dst_zi->i_wpoffset;
> > +
> > +               if (dst_off != dst_zi->i_wpoffset)
> > +                       goto err;
> > +       }
> > +       mutex_unlock(&dst_zi->i_truncate_mutex);
> > +
> > +       endoff = dst_off + *len;
> > +       inode_lock(dst_inode);
> > +       if (endoff > dst_zi->i_max_size ||
> > +                       inode_newsize_ok(dst_inode, endoff)) {
> > +               inode_unlock(dst_inode);
> > +               goto err;
> > +       }
> > +       inode_unlock(dst_inode);
> > +
> > +       return 0;
> > +err:
> > +       mutex_unlock(&dst_zi->i_truncate_mutex);
> > +       return -EINVAL;
> > +}
> > +
> > +static ssize_t zonefs_issue_copy(struct zonefs_inode_info *src_zi,
> > +               loff_t src_off, struct zonefs_inode_info *dst_zi,
> > +               loff_t dst_off, size_t len)
> > +{
> > +       struct block_device *src_bdev = src_zi->i_vnode.i_sb->s_bdev;
> > +       struct block_device *dst_bdev = dst_zi->i_vnode.i_sb->s_bdev;
> > +       struct range_entry *rlist = NULL;
> > +       int ret = len;
> > +
> > +       rlist = kmalloc(sizeof(*rlist), GFP_KERNEL);
> > +       if (!rlist)
> > +               return -ENOMEM;
> > +
> > +       rlist[0].dst = (dst_zi->i_zsector << SECTOR_SHIFT) + dst_off;
> > +       rlist[0].src = (src_zi->i_zsector << SECTOR_SHIFT) + src_off;
> > +       rlist[0].len = len;
> > +       rlist[0].comp_len = 0;
> > +       ret = blkdev_issue_copy(src_bdev, dst_bdev, rlist, 1, NULL, NULL,
> > +                       GFP_KERNEL);
> > +       if (rlist[0].comp_len > 0)
> > +               ret = rlist[0].comp_len;
> > +       kfree(rlist);
> > +
> > +       return ret;
> > +}
> > +
> > +/* Returns length of possible copy, else returns error */
> > +static ssize_t zonefs_copy_file_checks(struct file *src_file, loff_t src_off,
> > +                                       struct file *dst_file, loff_t dst_off,
> > +                                       size_t *len, unsigned int flags)
> > +{
> > +       struct inode *src_inode = file_inode(src_file);
> > +       struct inode *dst_inode = file_inode(dst_file);
> > +       struct zonefs_inode_info *src_zi = ZONEFS_I(src_inode);
> > +       struct zonefs_inode_info *dst_zi = ZONEFS_I(dst_inode);
> > +       ssize_t ret;
> > +
> > +       if (src_inode->i_sb != dst_inode->i_sb)
> > +               return -EXDEV;
> > +
> > +       /* Start by sync'ing the source and destination files for conv zones */
> > +       if (src_zi->i_ztype == ZONEFS_ZTYPE_CNV) {
> > +               ret = file_write_and_wait_range(src_file, src_off,
> > +                               (src_off + *len));
> > +               if (ret < 0)
> > +                       goto io_error;
> > +       }
> > +       inode_dio_wait(src_inode);
> > +
> > +       /* Start by sync'ing the source and destination files ifor conv zones */
> > +       if (dst_zi->i_ztype == ZONEFS_ZTYPE_CNV) {
> > +               ret = file_write_and_wait_range(dst_file, dst_off,
> > +                               (dst_off + *len));
> > +               if (ret < 0)
> > +                       goto io_error;
> > +       }
> > +       inode_dio_wait(dst_inode);
> > +
> > +       /* Drop dst file cached pages for a conv zone*/
> > +       if (dst_zi->i_ztype == ZONEFS_ZTYPE_CNV) {
> > +               ret = invalidate_inode_pages2_range(dst_inode->i_mapping,
> > +                               dst_off >> PAGE_SHIFT,
> > +                               (dst_off + *len) >> PAGE_SHIFT);
> > +               if (ret < 0)
> > +                       goto io_error;
> > +       }
> > +
> > +       ret = zonefs_is_file_copy_offset_ok(src_inode, dst_inode, src_off,
> > +                       dst_off, len);
> > +       if (ret < 0)
> > +               return ret;
> > +
> > +       return *len;
> > +
> > +io_error:
> > +       zonefs_io_error(dst_inode, true);
> > +       return ret;
> > +}
> > +
> > +static ssize_t zonefs_copy_file(struct file *src_file, loff_t src_off,
> > +               struct file *dst_file, loff_t dst_off,
> > +               size_t len, unsigned int flags)
> > +{
> > +       struct inode *src_inode = file_inode(src_file);
> > +       struct inode *dst_inode = file_inode(dst_file);
> > +       struct zonefs_inode_info *src_zi = ZONEFS_I(src_inode);
> > +       struct zonefs_inode_info *dst_zi = ZONEFS_I(dst_inode);
> > +       ssize_t ret = 0, bytes;
> > +
> > +       inode_lock(src_inode);
> > +       inode_lock(dst_inode);
> > +       bytes = zonefs_issue_copy(src_zi, src_off, dst_zi, dst_off, len);
> > +       if (bytes < 0)
> > +               goto unlock_exit;
> > +
> > +       ret += bytes;
> > +
> > +       file_update_time(dst_file);
> > +       mutex_lock(&dst_zi->i_truncate_mutex);
> > +       zonefs_update_stats(dst_inode, dst_off + bytes);
> > +       zonefs_i_size_write(dst_inode, dst_off + bytes);
> > +       dst_zi->i_wpoffset += bytes;
> > +       mutex_unlock(&dst_zi->i_truncate_mutex);
> > +       /* if we still have some bytes left, do splice copy */
> > +       if (bytes && (bytes < len)) {
> > +               bytes = do_splice_direct(src_file, &src_off, dst_file,
> > +                                        &dst_off, len, flags);
> > +               if (bytes > 0)
> > +                       ret += bytes;
> > +       }
> > +unlock_exit:
> > +       if (ret < 0)
> > +               zonefs_io_error(dst_inode, true);
> > +       inode_unlock(src_inode);
> > +       inode_unlock(dst_inode);
> > +       return ret;
> > +}
> > +
> > +static ssize_t zonefs_copy_file_range(struct file *src_file, loff_t src_off,
> > +                                     struct file *dst_file, loff_t dst_off,
> > +                                     size_t len, unsigned int flags)
> > +{
> > +       ssize_t ret = -EIO;
> > +
> > +       ret = zonefs_copy_file_checks(src_file, src_off, dst_file, dst_off,
> > +                                    &len, flags);
> > +       if (ret > 0)
> > +               ret = zonefs_copy_file(src_file, src_off, dst_file, dst_off,
> > +                                    len, flags);
> > +       else if (ret < 0 && ret == -EXDEV)
> 
> First of all, ret < 0 is redundant.
> 

acked

> > +               ret = generic_copy_file_range(src_file, src_off, dst_file,
> > +                                             dst_off, len, flags);
> 
> But more importantly, why do you want to fall back to
> do_splice_direct() in zonefs copy_file_range?
> How does it serve your patch set or the prospect consumers
> of zonefs copy_file_range?
> 
> The reason I am asking is because commit 5dae222a5ff0
> ("vfs: allow copy_file_range to copy across devices")
> turned out to be an API mistake that was later reverted by
> 868f9f2f8e00 ("vfs: fix copy_file_range() regression in cross-fs copies")
> 
> It is always better to return EXDEV to userspace which can
> always fallback to splice itself, but maybe it has something
> smarter to do.
> 
> The places where it made sense for kernel to fallback to
> direct splice was for network servers server-side-copy, but that
> is independent of any specific filesystem copy_file_range()
> implementation.
> 
> Thanks,
> Amir.
> 

At present we don't handle few case's such as IO getting split incase of
copy offload, so we wanted to fallback to existing mechanism. So went with
default operation, do_splice_direct.

Regards,
Nitesh Shetty


[-- Attachment #2: Type: text/plain, Size: 0 bytes --]



^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v5 00/10] Implement copy offload support
  2022-11-23  5:58 ` [PATCH v5 00/10] Implement copy offload support Nitesh Shetty
                     ` (9 preceding siblings ...)
       [not found]   ` <CGME20221123061044epcas5p2ac082a91fc8197821f29e84278b6203c@epcas5p2.samsung.com>
@ 2022-11-23 22:56   ` Chaitanya Kulkarni
  2022-11-29 12:16     ` Nitesh Shetty
  10 siblings, 1 reply; 32+ messages in thread
From: Chaitanya Kulkarni @ 2022-11-23 22:56 UTC (permalink / raw)
  To: Nitesh Shetty
  Cc: linux-block, agk, linux-kernel, linux-nvme, linux-fsdevel, axboe,
	viro, sagi, anuj20.g, joshi.k, p.raghav, naohiro.aota,
	damien.lemoal, Chaitanya Kulkarni, snitzer, kbusch, jth,
	nitheshshetty, james.smart, gost.dev, dm-devel, hch,
	shinichiro.kawasaki

(+ Shinichiro)

On 11/22/22 21:58, Nitesh Shetty wrote:
> The patch series covers the points discussed in November 2021 virtual
> call [LSF/MM/BFP TOPIC] Storage: Copy Offload [0].
> We have covered the initial agreed requirements in this patchset and
> further additional features suggested by community.
> Patchset borrows Mikulas's token based approach for 2 bdev
> implementation.
> 
> This is on top of our previous patchset v4[1].

Now that series is converging, since patch-series touches
drivers and key components in the block layer you need accompany
the patch-series with the blktests to cover the corner cases in the
drivers which supports this operations, as I mentioned this in the
call last year....

If you need any help with that feel free to send an email to linux-block
and CC me or Shinichiro (added in CC )...

-ck


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v5 02/10] block: Add copy offload support infrastructure
  2022-11-23 10:07         ` Nitesh Shetty
@ 2022-11-24  0:03           ` Ming Lei
  2022-11-29 11:44             ` Nitesh Shetty
  0 siblings, 1 reply; 32+ messages in thread
From: Ming Lei @ 2022-11-24  0:03 UTC (permalink / raw)
  To: Nitesh Shetty
  Cc: axboe, agk, snitzer, dm-devel, kbusch, hch, sagi, james.smart,
	kch, damien.lemoal, naohiro.aota, jth, viro, linux-block,
	linux-kernel, linux-nvme, linux-fsdevel, anuj20.g, joshi.k,
	p.raghav, nitheshshetty, gost.dev, ming.lei

On Wed, Nov 23, 2022 at 03:37:12PM +0530, Nitesh Shetty wrote:
> On Wed, Nov 23, 2022 at 04:04:18PM +0800, Ming Lei wrote:
> > On Wed, Nov 23, 2022 at 11:28:19AM +0530, Nitesh Shetty wrote:
> > > Introduce blkdev_issue_copy which supports source and destination bdevs,
> > > and an array of (source, destination and copy length) tuples.
> > > Introduce REQ_COPY copy offload operation flag. Create a read-write
> > > bio pair with a token as payload and submitted to the device in order.
> > > Read request populates token with source specific information which
> > > is then passed with write request.
> > > This design is courtesy Mikulas Patocka's token based copy
> > 
> > I thought this patchset is just for enabling copy command which is
> > supported by hardware. But turns out it isn't, because blk_copy_offload()
> > still submits read/write bios for doing the copy.
> > 
> > I am just wondering why not let copy_file_range() cover this kind of copy,
> > and the framework has been there.
> > 
> 
> Main goal was to enable copy command, but community suggested to add
> copy emulation as well.
> 
> blk_copy_offload - actually issues copy command in driver layer.
> The way read/write BIOs are percieved is different for copy offload.
> In copy offload we check REQ_COPY flag in NVMe driver layer to issue
> copy command. But we did missed it to add in other driver's, where they
> might be treated as normal READ/WRITE.
> 
> blk_copy_emulate - is used if we fail or if device doesn't support native
> copy offload command. Here we do READ/WRITE. Using copy_file_range for
> emulation might be possible, but we see 2 issues here.
> 1. We explored possibility of pulling dm-kcopyd to block layer so that we 
> can readily use it. But we found it had many dependecies from dm-layer.
> So later dropped that idea.

Is it just because dm-kcopyd supports async copy? If yes, I believe we
can reply on io_uring for implementing async copy_file_range, which will
be generic interface for async copy, and could get better perf.

> 2. copy_file_range, for block device atleast we saw few check's which fail
> it for raw block device. At this point I dont know much about the history of
> why such check is present.

Got it, but IMO the check in generic_copy_file_checks() can be
relaxed to cover blkdev cause splice does support blkdev.

Then your bdev offload copy work can be simplified into:

1) implement .copy_file_range for def_blk_fops, suppose it is
blkdev_copy_file_range()

2) inside blkdev_copy_file_range()

- if the bdev supports offload copy, just submit one bio to the device,
and this will be converted to one pt req to device

- otherwise, fallback to generic_copy_file_range()

> 
> > When I was researching pipe/splice code for supporting ublk zero copy[1], I
> > have got idea for async copy_file_range(), such as: io uring based
> > direct splice, user backed intermediate buffer, still zero copy, if these
> > ideas are finally implemented, we could get super-fast generic offload copy,
> > and bdev copy is really covered too.
> > 
> > [1] https://lore.kernel.org/linux-block/20221103085004.1029763-1-ming.lei@redhat.com/
> > 
> 
> Seems interesting, We will take a look into this.

BTW, that is probably one direction of ublk's async zero copy IO too.


Thanks, 
Ming


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v5 10/10] fs: add support for copy file range in zonefs
  2022-11-23  5:58     ` [PATCH v5 10/10] fs: add support for copy file range in zonefs Nitesh Shetty
  2022-11-23  6:53       ` Amir Goldstein
@ 2022-11-24  1:32       ` Damien Le Moal
  2022-11-24  1:47         ` Damien Le Moal
  1 sibling, 1 reply; 32+ messages in thread
From: Damien Le Moal @ 2022-11-24  1:32 UTC (permalink / raw)
  To: Nitesh Shetty, axboe, agk, snitzer, dm-devel, kbusch, hch, sagi,
	james.smart, kch, naohiro.aota, jth, viro
  Cc: linux-block, linux-kernel, linux-nvme, linux-fsdevel, anuj20.g,
	joshi.k, p.raghav, nitheshshetty, gost.dev

On 11/23/22 14:58, Nitesh Shetty wrote:
> copy_file_range is implemented using copy offload,
> copy offloading to device is always enabled.
> To disable copy offloading mount with "no_copy_offload" mount option.

And were is the code that handle this option ?

> At present copy offload is only used, if the source and destination files
> are on same block device, otherwise copy file range is completed by
> generic copy file range.
> 
> copy file range implemented as following:
> 	- write pending writes on the src and dest files
> 	- drop page cache for dest file if its conv zone
> 	- copy the range using offload
> 	- update dest file info
> 
> For all failure cases we fallback to generic file copy range

For all cases ? That would be weird. What would be the point of trying to
copy again if e.g. the dest zone has gone offline or read only ?

> At present this implementation does not support conv aggregation

Please check this commit message overall: the grammar and punctuation
could really be improved.

> 
> Signed-off-by: Nitesh Shetty <nj.shetty@samsung.com>
> Signed-off-by: Anuj Gupta <anuj20.g@samsung.com>
> ---
>  fs/zonefs/super.c | 179 ++++++++++++++++++++++++++++++++++++++++++++++
>  1 file changed, 179 insertions(+)
> 
> diff --git a/fs/zonefs/super.c b/fs/zonefs/super.c
> index abc9a85106f2..15613433d4ae 100644
> --- a/fs/zonefs/super.c
> +++ b/fs/zonefs/super.c
> @@ -1223,6 +1223,183 @@ static int zonefs_file_release(struct inode *inode, struct file *file)
>  	return 0;
>  }
>  
> +static int zonefs_is_file_copy_offset_ok(struct inode *src_inode,
> +		struct inode *dst_inode, loff_t src_off, loff_t dst_off,
> +		size_t *len)
> +{
> +	loff_t size, endoff;
> +	struct zonefs_inode_info *dst_zi = ZONEFS_I(dst_inode);
> +
> +	inode_lock(src_inode);
> +	size = i_size_read(src_inode);
> +	inode_unlock(src_inode);
> +	/* Don't copy beyond source file EOF. */
> +	if (src_off < size) {
> +		if (src_off + *len > size)
> +			*len = (size - (src_off + *len));
> +	} else
> +		*len = 0;

Missing curly brackets for the else.

> +
> +	mutex_lock(&dst_zi->i_truncate_mutex);
> +	if (dst_zi->i_ztype == ZONEFS_ZTYPE_SEQ) {
> +		if (*len > dst_zi->i_max_size - dst_zi->i_wpoffset)
> +			*len -= dst_zi->i_max_size - dst_zi->i_wpoffset;
> +
> +		if (dst_off != dst_zi->i_wpoffset)
> +			goto err;
> +	}
> +	mutex_unlock(&dst_zi->i_truncate_mutex);
> +
> +	endoff = dst_off + *len;
> +	inode_lock(dst_inode);
> +	if (endoff > dst_zi->i_max_size ||
> +			inode_newsize_ok(dst_inode, endoff)) {
> +		inode_unlock(dst_inode);
> +		goto err;

And here truncate mutex is not locked, but goto err will unlock it. This
is broken...

> +	}
> +	inode_unlock(dst_inode);

...The locking is completely broken in this function anyway. You take the
lock, look at something, then release the lock. Then what if a write or a
trunctate comes in when the inode is not locked ? This is completely
broken. The inode should be locked with no dio pending when this function
is called. This is only to check if everything is ok. This has no business
playing with the inode and truncate locks.

> +
> +	return 0;
> +err:
> +	mutex_unlock(&dst_zi->i_truncate_mutex);
> +	return -EINVAL;
> +}
> +
> +static ssize_t zonefs_issue_copy(struct zonefs_inode_info *src_zi,
> +		loff_t src_off, struct zonefs_inode_info *dst_zi,
> +		loff_t dst_off, size_t len)
> +{
> +	struct block_device *src_bdev = src_zi->i_vnode.i_sb->s_bdev;
> +	struct block_device *dst_bdev = dst_zi->i_vnode.i_sb->s_bdev;
> +	struct range_entry *rlist = NULL;
> +	int ret = len;
> +
> +	rlist = kmalloc(sizeof(*rlist), GFP_KERNEL);

GFP_NOIO ?

> +	if (!rlist)
> +		return -ENOMEM;
> +
> +	rlist[0].dst = (dst_zi->i_zsector << SECTOR_SHIFT) + dst_off;
> +	rlist[0].src = (src_zi->i_zsector << SECTOR_SHIFT) + src_off;
> +	rlist[0].len = len;
> +	rlist[0].comp_len = 0;
> +	ret = blkdev_issue_copy(src_bdev, dst_bdev, rlist, 1, NULL, NULL,
> +			GFP_KERNEL);
> +	if (rlist[0].comp_len > 0)
> +		ret = rlist[0].comp_len;
> +	kfree(rlist);
> +
> +	return ret;
> +}
> +
> +/* Returns length of possible copy, else returns error */
> +static ssize_t zonefs_copy_file_checks(struct file *src_file, loff_t src_off,
> +					struct file *dst_file, loff_t dst_off,
> +					size_t *len, unsigned int flags)
> +{
> +	struct inode *src_inode = file_inode(src_file);
> +	struct inode *dst_inode = file_inode(dst_file);
> +	struct zonefs_inode_info *src_zi = ZONEFS_I(src_inode);
> +	struct zonefs_inode_info *dst_zi = ZONEFS_I(dst_inode);
> +	ssize_t ret;
> +
> +	if (src_inode->i_sb != dst_inode->i_sb)
> +		return -EXDEV;
> +
> +	/* Start by sync'ing the source and destination files for conv zones */
> +	if (src_zi->i_ztype == ZONEFS_ZTYPE_CNV) {
> +		ret = file_write_and_wait_range(src_file, src_off,
> +				(src_off + *len));
> +		if (ret < 0)
> +			goto io_error;
> +	}
> +	inode_dio_wait(src_inode);

That is not a "check". So having this in a function called
zonefs_copy_file_checks() is a little strange.

> +
> +	/* Start by sync'ing the source and destination files ifor conv zones */

Same comment repeated, with typos.

> +	if (dst_zi->i_ztype == ZONEFS_ZTYPE_CNV) {
> +		ret = file_write_and_wait_range(dst_file, dst_off,
> +				(dst_off + *len));
> +		if (ret < 0)
> +			goto io_error;
> +	}
> +	inode_dio_wait(dst_inode);
> +
> +	/* Drop dst file cached pages for a conv zone*/
> +	if (dst_zi->i_ztype == ZONEFS_ZTYPE_CNV) {
> +		ret = invalidate_inode_pages2_range(dst_inode->i_mapping,
> +				dst_off >> PAGE_SHIFT,
> +				(dst_off + *len) >> PAGE_SHIFT);
> +		if (ret < 0)
> +			goto io_error;
> +	}
> +
> +	ret = zonefs_is_file_copy_offset_ok(src_inode, dst_inode, src_off,
> +			dst_off, len);
> +	if (ret < 0)

if (ret)

> +		return ret;
> +
> +	return *len;
> +
> +io_error:
> +	zonefs_io_error(dst_inode, true);
> +	return ret;
> +}
> +
> +static ssize_t zonefs_copy_file(struct file *src_file, loff_t src_off,
> +		struct file *dst_file, loff_t dst_off,
> +		size_t len, unsigned int flags)
> +{
> +	struct inode *src_inode = file_inode(src_file);
> +	struct inode *dst_inode = file_inode(dst_file);
> +	struct zonefs_inode_info *src_zi = ZONEFS_I(src_inode);
> +	struct zonefs_inode_info *dst_zi = ZONEFS_I(dst_inode);
> +	ssize_t ret = 0, bytes;
> +
> +	inode_lock(src_inode);
> +	inode_lock(dst_inode);

So you did zonefs_copy_file_checks() outside of these locks, which mean
that everything about the source and destination files may have changed.
This does not work.

> +	bytes = zonefs_issue_copy(src_zi, src_off, dst_zi, dst_off, len);
> +	if (bytes < 0)
> +		goto unlock_exit;
> +
> +	ret += bytes;
> +
> +	file_update_time(dst_file);
> +	mutex_lock(&dst_zi->i_truncate_mutex);
> +	zonefs_update_stats(dst_inode, dst_off + bytes);
> +	zonefs_i_size_write(dst_inode, dst_off + bytes);
> +	dst_zi->i_wpoffset += bytes;

This is wierd. iszie for dst will be dst_zi->i_wpoffset. So please do:

	dst_zi->i_wpoffset += bytes;
	zonefs_i_size_write(dst_inode, dst_zi->i_wpoffset);

> +	mutex_unlock(&dst_zi->i_truncate_mutex);

And you are not taking care of the accounting for active zones here. If
the copy made the dst zone full, it is not active anymore. You need to
call zonefs_account_active();

> +	/* if we still have some bytes left, do splice copy */
> +	if (bytes && (bytes < len)) {
> +		bytes = do_splice_direct(src_file, &src_off, dst_file,
> +					 &dst_off, len, flags);

No way.

> +		if (bytes > 0)
> +			ret += bytes;
> +	}
> +unlock_exit:
> +	if (ret < 0)
> +		zonefs_io_error(dst_inode, true);

How can you be sure that you even did an IO when you get an error ?
zonefs_issue_copy() may have failed on its kmalloc() and no IO was done.

> +	inode_unlock(src_inode);
> +	inode_unlock(dst_inode);
> +	return ret;
> +}
> +
> +static ssize_t zonefs_copy_file_range(struct file *src_file, loff_t src_off,
> +				      struct file *dst_file, loff_t dst_off,
> +				      size_t len, unsigned int flags)
> +{
> +	ssize_t ret = -EIO;

This does not need to be initialized.

> +
> +	ret = zonefs_copy_file_checks(src_file, src_off, dst_file, dst_off,
> +				     &len, flags);

These checks need to be done for the generic implementation too, no ? Why
would checking this automatically trigger the offload ? What if the device
does not support offloading ?

> +	if (ret > 0)
> +		ret = zonefs_copy_file(src_file, src_off, dst_file, dst_off,
> +				     len, flags);

return here, then no need for the else. But see above. This seems all
broken to me.

> +	else if (ret < 0 && ret == -EXDEV)
> +		ret = generic_copy_file_range(src_file, src_off, dst_file,
> +					      dst_off, len, flags);
> +	return ret;
> +}
> +
>  static const struct file_operations zonefs_file_operations = {
>  	.open		= zonefs_file_open,
>  	.release	= zonefs_file_release,
> @@ -1234,6 +1411,7 @@ static const struct file_operations zonefs_file_operations = {
>  	.splice_read	= generic_file_splice_read,
>  	.splice_write	= iter_file_splice_write,
>  	.iopoll		= iocb_bio_iopoll,
> +	.copy_file_range = zonefs_copy_file_range,
>  };
>  
>  static struct kmem_cache *zonefs_inode_cachep;
> @@ -1804,6 +1982,7 @@ static int zonefs_fill_super(struct super_block *sb, void *data, int silent)
>  	atomic_set(&sbi->s_active_seq_files, 0);
>  	sbi->s_max_active_seq_files = bdev_max_active_zones(sb->s_bdev);
>  
> +	/* set copy support by default */

What is this comment supposed to be for ?

>  	ret = zonefs_read_super(sb);
>  	if (ret)
>  		return ret;

-- 
Damien Le Moal
Western Digital Research


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v5 10/10] fs: add support for copy file range in zonefs
  2022-11-24  1:32       ` Damien Le Moal
@ 2022-11-24  1:47         ` Damien Le Moal
  2022-11-25  4:18           ` Al Viro
  2022-11-29 12:22           ` Nitesh Shetty
  0 siblings, 2 replies; 32+ messages in thread
From: Damien Le Moal @ 2022-11-24  1:47 UTC (permalink / raw)
  To: Nitesh Shetty, axboe, agk, snitzer, dm-devel, kbusch, hch, sagi,
	james.smart, kch, naohiro.aota, jth, viro
  Cc: linux-block, linux-kernel, linux-nvme, linux-fsdevel, anuj20.g,
	joshi.k, p.raghav, nitheshshetty, gost.dev

On 11/24/22 10:32, Damien Le Moal wrote:
> On 11/23/22 14:58, Nitesh Shetty wrote:
>> copy_file_range is implemented using copy offload,
>> copy offloading to device is always enabled.
>> To disable copy offloading mount with "no_copy_offload" mount option.
> 
> And were is the code that handle this option ?
> 
>> At present copy offload is only used, if the source and destination files
>> are on same block device, otherwise copy file range is completed by
>> generic copy file range.
>>
>> copy file range implemented as following:
>> 	- write pending writes on the src and dest files
>> 	- drop page cache for dest file if its conv zone
>> 	- copy the range using offload
>> 	- update dest file info
>>
>> For all failure cases we fallback to generic file copy range
> 
> For all cases ? That would be weird. What would be the point of trying to
> copy again if e.g. the dest zone has gone offline or read only ?
> 
>> At present this implementation does not support conv aggregation
> 
> Please check this commit message overall: the grammar and punctuation
> could really be improved.
> 
>>
>> Signed-off-by: Nitesh Shetty <nj.shetty@samsung.com>
>> Signed-off-by: Anuj Gupta <anuj20.g@samsung.com>
>> ---
>>  fs/zonefs/super.c | 179 ++++++++++++++++++++++++++++++++++++++++++++++
>>  1 file changed, 179 insertions(+)
>>
>> diff --git a/fs/zonefs/super.c b/fs/zonefs/super.c
>> index abc9a85106f2..15613433d4ae 100644
>> --- a/fs/zonefs/super.c
>> +++ b/fs/zonefs/super.c
>> @@ -1223,6 +1223,183 @@ static int zonefs_file_release(struct inode *inode, struct file *file)
>>  	return 0;
>>  }
>>  
>> +static int zonefs_is_file_copy_offset_ok(struct inode *src_inode,
>> +		struct inode *dst_inode, loff_t src_off, loff_t dst_off,
>> +		size_t *len)
>> +{
>> +	loff_t size, endoff;
>> +	struct zonefs_inode_info *dst_zi = ZONEFS_I(dst_inode);
>> +
>> +	inode_lock(src_inode);
>> +	size = i_size_read(src_inode);
>> +	inode_unlock(src_inode);
>> +	/* Don't copy beyond source file EOF. */
>> +	if (src_off < size) {
>> +		if (src_off + *len > size)
>> +			*len = (size - (src_off + *len));
>> +	} else
>> +		*len = 0;
> 
> Missing curly brackets for the else.
> 
>> +
>> +	mutex_lock(&dst_zi->i_truncate_mutex);
>> +	if (dst_zi->i_ztype == ZONEFS_ZTYPE_SEQ) {
>> +		if (*len > dst_zi->i_max_size - dst_zi->i_wpoffset)
>> +			*len -= dst_zi->i_max_size - dst_zi->i_wpoffset;
>> +
>> +		if (dst_off != dst_zi->i_wpoffset)
>> +			goto err;
>> +	}
>> +	mutex_unlock(&dst_zi->i_truncate_mutex);
>> +
>> +	endoff = dst_off + *len;
>> +	inode_lock(dst_inode);
>> +	if (endoff > dst_zi->i_max_size ||
>> +			inode_newsize_ok(dst_inode, endoff)) {
>> +		inode_unlock(dst_inode);
>> +		goto err;
> 
> And here truncate mutex is not locked, but goto err will unlock it. This
> is broken...
> 
>> +	}
>> +	inode_unlock(dst_inode);
> 
> ...The locking is completely broken in this function anyway. You take the
> lock, look at something, then release the lock. Then what if a write or a
> trunctate comes in when the inode is not locked ? This is completely
> broken. The inode should be locked with no dio pending when this function
> is called. This is only to check if everything is ok. This has no business
> playing with the inode and truncate locks.
> 
>> +
>> +	return 0;
>> +err:
>> +	mutex_unlock(&dst_zi->i_truncate_mutex);
>> +	return -EINVAL;
>> +}
>> +
>> +static ssize_t zonefs_issue_copy(struct zonefs_inode_info *src_zi,
>> +		loff_t src_off, struct zonefs_inode_info *dst_zi,
>> +		loff_t dst_off, size_t len)
>> +{
>> +	struct block_device *src_bdev = src_zi->i_vnode.i_sb->s_bdev;
>> +	struct block_device *dst_bdev = dst_zi->i_vnode.i_sb->s_bdev;
>> +	struct range_entry *rlist = NULL;
>> +	int ret = len;
>> +
>> +	rlist = kmalloc(sizeof(*rlist), GFP_KERNEL);
> 
> GFP_NOIO ?
> 
>> +	if (!rlist)
>> +		return -ENOMEM;
>> +
>> +	rlist[0].dst = (dst_zi->i_zsector << SECTOR_SHIFT) + dst_off;
>> +	rlist[0].src = (src_zi->i_zsector << SECTOR_SHIFT) + src_off;
>> +	rlist[0].len = len;
>> +	rlist[0].comp_len = 0;
>> +	ret = blkdev_issue_copy(src_bdev, dst_bdev, rlist, 1, NULL, NULL,
>> +			GFP_KERNEL);
>> +	if (rlist[0].comp_len > 0)
>> +		ret = rlist[0].comp_len;
>> +	kfree(rlist);
>> +
>> +	return ret;
>> +}
>> +
>> +/* Returns length of possible copy, else returns error */
>> +static ssize_t zonefs_copy_file_checks(struct file *src_file, loff_t src_off,
>> +					struct file *dst_file, loff_t dst_off,
>> +					size_t *len, unsigned int flags)
>> +{
>> +	struct inode *src_inode = file_inode(src_file);
>> +	struct inode *dst_inode = file_inode(dst_file);
>> +	struct zonefs_inode_info *src_zi = ZONEFS_I(src_inode);
>> +	struct zonefs_inode_info *dst_zi = ZONEFS_I(dst_inode);
>> +	ssize_t ret;
>> +
>> +	if (src_inode->i_sb != dst_inode->i_sb)
>> +		return -EXDEV;
>> +
>> +	/* Start by sync'ing the source and destination files for conv zones */
>> +	if (src_zi->i_ztype == ZONEFS_ZTYPE_CNV) {
>> +		ret = file_write_and_wait_range(src_file, src_off,
>> +				(src_off + *len));
>> +		if (ret < 0)
>> +			goto io_error;
>> +	}
>> +	inode_dio_wait(src_inode);
> 
> That is not a "check". So having this in a function called
> zonefs_copy_file_checks() is a little strange.
> 
>> +
>> +	/* Start by sync'ing the source and destination files ifor conv zones */
> 
> Same comment repeated, with typos.
> 
>> +	if (dst_zi->i_ztype == ZONEFS_ZTYPE_CNV) {
>> +		ret = file_write_and_wait_range(dst_file, dst_off,
>> +				(dst_off + *len));
>> +		if (ret < 0)
>> +			goto io_error;
>> +	}
>> +	inode_dio_wait(dst_inode);
>> +
>> +	/* Drop dst file cached pages for a conv zone*/
>> +	if (dst_zi->i_ztype == ZONEFS_ZTYPE_CNV) {
>> +		ret = invalidate_inode_pages2_range(dst_inode->i_mapping,
>> +				dst_off >> PAGE_SHIFT,
>> +				(dst_off + *len) >> PAGE_SHIFT);
>> +		if (ret < 0)
>> +			goto io_error;
>> +	}
>> +
>> +	ret = zonefs_is_file_copy_offset_ok(src_inode, dst_inode, src_off,
>> +			dst_off, len);
>> +	if (ret < 0)
> 
> if (ret)
> 
>> +		return ret;
>> +
>> +	return *len;
>> +
>> +io_error:
>> +	zonefs_io_error(dst_inode, true);
>> +	return ret;
>> +}
>> +
>> +static ssize_t zonefs_copy_file(struct file *src_file, loff_t src_off,
>> +		struct file *dst_file, loff_t dst_off,
>> +		size_t len, unsigned int flags)
>> +{
>> +	struct inode *src_inode = file_inode(src_file);
>> +	struct inode *dst_inode = file_inode(dst_file);
>> +	struct zonefs_inode_info *src_zi = ZONEFS_I(src_inode);
>> +	struct zonefs_inode_info *dst_zi = ZONEFS_I(dst_inode);
>> +	ssize_t ret = 0, bytes;
>> +
>> +	inode_lock(src_inode);
>> +	inode_lock(dst_inode);
> 
> So you did zonefs_copy_file_checks() outside of these locks, which mean
> that everything about the source and destination files may have changed.
> This does not work.

I forgot to mention that locking 2 inodes blindly like this can leads to
deadlocks if another process tries a copy range from dst to src at the
same time (lock order is reversed and so can deadlock).

> 
>> +	bytes = zonefs_issue_copy(src_zi, src_off, dst_zi, dst_off, len);
>> +	if (bytes < 0)
>> +		goto unlock_exit;
>> +
>> +	ret += bytes;
>> +
>> +	file_update_time(dst_file);
>> +	mutex_lock(&dst_zi->i_truncate_mutex);
>> +	zonefs_update_stats(dst_inode, dst_off + bytes);
>> +	zonefs_i_size_write(dst_inode, dst_off + bytes);
>> +	dst_zi->i_wpoffset += bytes;
> 
> This is wierd. iszie for dst will be dst_zi->i_wpoffset. So please do:
> 
> 	dst_zi->i_wpoffset += bytes;
> 	zonefs_i_size_write(dst_inode, dst_zi->i_wpoffset);
> 
>> +	mutex_unlock(&dst_zi->i_truncate_mutex);
> 
> And you are not taking care of the accounting for active zones here. If
> the copy made the dst zone full, it is not active anymore. You need to
> call zonefs_account_active();
> 
>> +	/* if we still have some bytes left, do splice copy */
>> +	if (bytes && (bytes < len)) {
>> +		bytes = do_splice_direct(src_file, &src_off, dst_file,
>> +					 &dst_off, len, flags);
> 
> No way.
> 
>> +		if (bytes > 0)
>> +			ret += bytes;
>> +	}
>> +unlock_exit:
>> +	if (ret < 0)
>> +		zonefs_io_error(dst_inode, true);
> 
> How can you be sure that you even did an IO when you get an error ?
> zonefs_issue_copy() may have failed on its kmalloc() and no IO was done.
> 
>> +	inode_unlock(src_inode);
>> +	inode_unlock(dst_inode);
>> +	return ret;
>> +}
>> +
>> +static ssize_t zonefs_copy_file_range(struct file *src_file, loff_t src_off,
>> +				      struct file *dst_file, loff_t dst_off,
>> +				      size_t len, unsigned int flags)
>> +{
>> +	ssize_t ret = -EIO;
> 
> This does not need to be initialized.
> 
>> +
>> +	ret = zonefs_copy_file_checks(src_file, src_off, dst_file, dst_off,
>> +				     &len, flags);
> 
> These checks need to be done for the generic implementation too, no ? Why
> would checking this automatically trigger the offload ? What if the device
> does not support offloading ?
> 
>> +	if (ret > 0)
>> +		ret = zonefs_copy_file(src_file, src_off, dst_file, dst_off,
>> +				     len, flags);
> 
> return here, then no need for the else. But see above. This seems all
> broken to me.
> 
>> +	else if (ret < 0 && ret == -EXDEV)
>> +		ret = generic_copy_file_range(src_file, src_off, dst_file,
>> +					      dst_off, len, flags);
>> +	return ret;
>> +}
>> +
>>  static const struct file_operations zonefs_file_operations = {
>>  	.open		= zonefs_file_open,
>>  	.release	= zonefs_file_release,
>> @@ -1234,6 +1411,7 @@ static const struct file_operations zonefs_file_operations = {
>>  	.splice_read	= generic_file_splice_read,
>>  	.splice_write	= iter_file_splice_write,
>>  	.iopoll		= iocb_bio_iopoll,
>> +	.copy_file_range = zonefs_copy_file_range,
>>  };
>>  
>>  static struct kmem_cache *zonefs_inode_cachep;
>> @@ -1804,6 +1982,7 @@ static int zonefs_fill_super(struct super_block *sb, void *data, int silent)
>>  	atomic_set(&sbi->s_active_seq_files, 0);
>>  	sbi->s_max_active_seq_files = bdev_max_active_zones(sb->s_bdev);
>>  
>> +	/* set copy support by default */
> 
> What is this comment supposed to be for ?
> 
>>  	ret = zonefs_read_super(sb);
>>  	if (ret)
>>  		return ret;
> 

-- 
Damien Le Moal
Western Digital Research


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v5 10/10] fs: add support for copy file range in zonefs
  2022-11-24  1:47         ` Damien Le Moal
@ 2022-11-25  4:18           ` Al Viro
  2022-11-29 12:22           ` Nitesh Shetty
  1 sibling, 0 replies; 32+ messages in thread
From: Al Viro @ 2022-11-25  4:18 UTC (permalink / raw)
  To: Damien Le Moal
  Cc: Nitesh Shetty, axboe, agk, snitzer, dm-devel, kbusch, hch, sagi,
	james.smart, kch, naohiro.aota, jth, linux-block, linux-kernel,
	linux-nvme, linux-fsdevel, anuj20.g, joshi.k, p.raghav,
	nitheshshetty, gost.dev

On Thu, Nov 24, 2022 at 10:47:55AM +0900, Damien Le Moal wrote:
> >> +	inode_lock(src_inode);
> >> +	inode_lock(dst_inode);
> > 
> > So you did zonefs_copy_file_checks() outside of these locks, which mean
> > that everything about the source and destination files may have changed.
> > This does not work.
> 
> I forgot to mention that locking 2 inodes blindly like this can leads to
> deadlocks if another process tries a copy range from dst to src at the
> same time (lock order is reversed and so can deadlock).

Not to mention the deadlocks with existing places where fs/namei.c locks
two inodes...

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v5 02/10] block: Add copy offload support infrastructure
  2022-11-24  0:03           ` Ming Lei
@ 2022-11-29 11:44             ` Nitesh Shetty
  2022-12-07  5:54               ` Nitesh Shetty
  0 siblings, 1 reply; 32+ messages in thread
From: Nitesh Shetty @ 2022-11-29 11:44 UTC (permalink / raw)
  To: Ming Lei
  Cc: axboe, agk, snitzer, dm-devel, kbusch, hch, sagi, james.smart,
	kch, damien.lemoal, naohiro.aota, jth, viro, linux-block,
	linux-kernel, linux-nvme, linux-fsdevel, anuj20.g, joshi.k,
	p.raghav, nitheshshetty, gost.dev

[-- Attachment #1: Type: text/plain, Size: 4006 bytes --]

On Thu, Nov 24, 2022 at 08:03:56AM +0800, Ming Lei wrote:
> On Wed, Nov 23, 2022 at 03:37:12PM +0530, Nitesh Shetty wrote:
> > On Wed, Nov 23, 2022 at 04:04:18PM +0800, Ming Lei wrote:
> > > On Wed, Nov 23, 2022 at 11:28:19AM +0530, Nitesh Shetty wrote:
> > > > Introduce blkdev_issue_copy which supports source and destination bdevs,
> > > > and an array of (source, destination and copy length) tuples.
> > > > Introduce REQ_COPY copy offload operation flag. Create a read-write
> > > > bio pair with a token as payload and submitted to the device in order.
> > > > Read request populates token with source specific information which
> > > > is then passed with write request.
> > > > This design is courtesy Mikulas Patocka's token based copy
> > > 
> > > I thought this patchset is just for enabling copy command which is
> > > supported by hardware. But turns out it isn't, because blk_copy_offload()
> > > still submits read/write bios for doing the copy.
> > > 
> > > I am just wondering why not let copy_file_range() cover this kind of copy,
> > > and the framework has been there.
> > > 
> > 
> > Main goal was to enable copy command, but community suggested to add
> > copy emulation as well.
> > 
> > blk_copy_offload - actually issues copy command in driver layer.
> > The way read/write BIOs are percieved is different for copy offload.
> > In copy offload we check REQ_COPY flag in NVMe driver layer to issue
> > copy command. But we did missed it to add in other driver's, where they
> > might be treated as normal READ/WRITE.
> > 
> > blk_copy_emulate - is used if we fail or if device doesn't support native
> > copy offload command. Here we do READ/WRITE. Using copy_file_range for
> > emulation might be possible, but we see 2 issues here.
> > 1. We explored possibility of pulling dm-kcopyd to block layer so that we 
> > can readily use it. But we found it had many dependecies from dm-layer.
> > So later dropped that idea.
> 
> Is it just because dm-kcopyd supports async copy? If yes, I believe we
> can reply on io_uring for implementing async copy_file_range, which will
> be generic interface for async copy, and could get better perf.
>

It supports both sync and async. But used only inside dm-layer.
Async version of copy_file_range can help, using io-uring can be helpful
for user , but in-kernel users can't use uring.

> > 2. copy_file_range, for block device atleast we saw few check's which fail
> > it for raw block device. At this point I dont know much about the history of
> > why such check is present.
> 
> Got it, but IMO the check in generic_copy_file_checks() can be
> relaxed to cover blkdev cause splice does support blkdev.
> 
> Then your bdev offload copy work can be simplified into:
> 
> 1) implement .copy_file_range for def_blk_fops, suppose it is
> blkdev_copy_file_range()
> 
> 2) inside blkdev_copy_file_range()
> 
> - if the bdev supports offload copy, just submit one bio to the device,
> and this will be converted to one pt req to device
> 
> - otherwise, fallback to generic_copy_file_range()
>

We will check the feasibilty and try to implement the scheme in next versions.
It would be helpful, if someone in community know's why such checks were
present ? We see copy_file_range accepts only regular file. Was it
designed only for regular files or can we extend it to regular block
device.

> > 
> > > When I was researching pipe/splice code for supporting ublk zero copy[1], I
> > > have got idea for async copy_file_range(), such as: io uring based
> > > direct splice, user backed intermediate buffer, still zero copy, if these
> > > ideas are finally implemented, we could get super-fast generic offload copy,
> > > and bdev copy is really covered too.
> > > 
> > > [1] https://lore.kernel.org/linux-block/20221103085004.1029763-1-ming.lei@redhat.com/
> > > 
> > 
> > Seems interesting, We will take a look into this.
> 
> BTW, that is probably one direction of ublk's async zero copy IO too.
> 
> 
> Thanks, 
> Ming
> 
> 


Thanks,
Nitesh

[-- Attachment #2: Type: text/plain, Size: 0 bytes --]



^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v5 00/10] Implement copy offload support
  2022-11-23 22:56   ` [PATCH v5 00/10] Implement copy offload support Chaitanya Kulkarni
@ 2022-11-29 12:16     ` Nitesh Shetty
  2022-11-30  0:05       ` Chaitanya Kulkarni
  0 siblings, 1 reply; 32+ messages in thread
From: Nitesh Shetty @ 2022-11-29 12:16 UTC (permalink / raw)
  To: Chaitanya Kulkarni
  Cc: linux-block, agk, linux-kernel, linux-nvme, linux-fsdevel, axboe,
	viro, sagi, anuj20.g, joshi.k, p.raghav, naohiro.aota,
	damien.lemoal, snitzer, kbusch, jth, nitheshshetty, james.smart,
	gost.dev, dm-devel, hch, shinichiro.kawasaki

[-- Attachment #1: Type: text/plain, Size: 1176 bytes --]

On Wed, Nov 23, 2022 at 10:56:23PM +0000, Chaitanya Kulkarni wrote:
> (+ Shinichiro)
> 
> On 11/22/22 21:58, Nitesh Shetty wrote:
> > The patch series covers the points discussed in November 2021 virtual
> > call [LSF/MM/BFP TOPIC] Storage: Copy Offload [0].
> > We have covered the initial agreed requirements in this patchset and
> > further additional features suggested by community.
> > Patchset borrows Mikulas's token based approach for 2 bdev
> > implementation.
> > 
> > This is on top of our previous patchset v4[1].
> 
> Now that series is converging, since patch-series touches
> drivers and key components in the block layer you need accompany
> the patch-series with the blktests to cover the corner cases in the
> drivers which supports this operations, as I mentioned this in the
> call last year....
> 
> If you need any help with that feel free to send an email to linux-block
> and CC me or Shinichiro (added in CC )...
> 
> -ck
>

Yes any help would be appreciated. I am not familiar with blktest 
development/testing cycle. Do we need add blktests along with patch 
series or do we need to add after patch series gets merged(to be merged)?

Thanks
Nitesh

[-- Attachment #2: Type: text/plain, Size: 0 bytes --]



^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v5 10/10] fs: add support for copy file range in zonefs
  2022-11-24  1:47         ` Damien Le Moal
  2022-11-25  4:18           ` Al Viro
@ 2022-11-29 12:22           ` Nitesh Shetty
  2022-11-29 23:45             ` Damien Le Moal
  1 sibling, 1 reply; 32+ messages in thread
From: Nitesh Shetty @ 2022-11-29 12:22 UTC (permalink / raw)
  To: Damien Le Moal
  Cc: axboe, agk, snitzer, dm-devel, kbusch, hch, sagi, james.smart,
	kch, naohiro.aota, jth, viro, linux-block, linux-kernel,
	linux-nvme, linux-fsdevel, anuj20.g, joshi.k, p.raghav,
	nitheshshetty, gost.dev

[-- Attachment #1: Type: text/plain, Size: 11029 bytes --]

On Thu, Nov 24, 2022 at 10:47:55AM +0900, Damien Le Moal wrote:
> On 11/24/22 10:32, Damien Le Moal wrote:
> > On 11/23/22 14:58, Nitesh Shetty wrote:
> >> copy_file_range is implemented using copy offload,
> >> copy offloading to device is always enabled.
> >> To disable copy offloading mount with "no_copy_offload" mount option.
> > 
> > And were is the code that handle this option ?
> > 
> >> At present copy offload is only used, if the source and destination files
> >> are on same block device, otherwise copy file range is completed by
> >> generic copy file range.
> >>
> >> copy file range implemented as following:
> >> 	- write pending writes on the src and dest files
> >> 	- drop page cache for dest file if its conv zone
> >> 	- copy the range using offload
> >> 	- update dest file info
> >>
> >> For all failure cases we fallback to generic file copy range
> > 
> > For all cases ? That would be weird. What would be the point of trying to
> > copy again if e.g. the dest zone has gone offline or read only ?
> > 
> >> At present this implementation does not support conv aggregation
> > 
> > Please check this commit message overall: the grammar and punctuation
> > could really be improved.
> > 
> >>
> >> Signed-off-by: Nitesh Shetty <nj.shetty@samsung.com>
> >> Signed-off-by: Anuj Gupta <anuj20.g@samsung.com>
> >> ---
> >>  fs/zonefs/super.c | 179 ++++++++++++++++++++++++++++++++++++++++++++++
> >>  1 file changed, 179 insertions(+)
> >>
> >> diff --git a/fs/zonefs/super.c b/fs/zonefs/super.c
> >> index abc9a85106f2..15613433d4ae 100644
> >> --- a/fs/zonefs/super.c
> >> +++ b/fs/zonefs/super.c
> >> @@ -1223,6 +1223,183 @@ static int zonefs_file_release(struct inode *inode, struct file *file)
> >>  	return 0;
> >>  }
> >>  
> >> +static int zonefs_is_file_copy_offset_ok(struct inode *src_inode,
> >> +		struct inode *dst_inode, loff_t src_off, loff_t dst_off,
> >> +		size_t *len)
> >> +{
> >> +	loff_t size, endoff;
> >> +	struct zonefs_inode_info *dst_zi = ZONEFS_I(dst_inode);
> >> +
> >> +	inode_lock(src_inode);
> >> +	size = i_size_read(src_inode);
> >> +	inode_unlock(src_inode);
> >> +	/* Don't copy beyond source file EOF. */
> >> +	if (src_off < size) {
> >> +		if (src_off + *len > size)
> >> +			*len = (size - (src_off + *len));
> >> +	} else
> >> +		*len = 0;
> > 
> > Missing curly brackets for the else.
> > 
> >> +
> >> +	mutex_lock(&dst_zi->i_truncate_mutex);
> >> +	if (dst_zi->i_ztype == ZONEFS_ZTYPE_SEQ) {
> >> +		if (*len > dst_zi->i_max_size - dst_zi->i_wpoffset)
> >> +			*len -= dst_zi->i_max_size - dst_zi->i_wpoffset;
> >> +
> >> +		if (dst_off != dst_zi->i_wpoffset)
> >> +			goto err;
> >> +	}
> >> +	mutex_unlock(&dst_zi->i_truncate_mutex);
> >> +
> >> +	endoff = dst_off + *len;
> >> +	inode_lock(dst_inode);
> >> +	if (endoff > dst_zi->i_max_size ||
> >> +			inode_newsize_ok(dst_inode, endoff)) {
> >> +		inode_unlock(dst_inode);
> >> +		goto err;
> > 
> > And here truncate mutex is not locked, but goto err will unlock it. This
> > is broken...
> > 
> >> +	}
> >> +	inode_unlock(dst_inode);
> > 
> > ...The locking is completely broken in this function anyway. You take the
> > lock, look at something, then release the lock. Then what if a write or a
> > trunctate comes in when the inode is not locked ? This is completely
> > broken. The inode should be locked with no dio pending when this function
> > is called. This is only to check if everything is ok. This has no business
> > playing with the inode and truncate locks.
> > 
> >> +
> >> +	return 0;
> >> +err:
> >> +	mutex_unlock(&dst_zi->i_truncate_mutex);
> >> +	return -EINVAL;
> >> +}
> >> +
> >> +static ssize_t zonefs_issue_copy(struct zonefs_inode_info *src_zi,
> >> +		loff_t src_off, struct zonefs_inode_info *dst_zi,
> >> +		loff_t dst_off, size_t len)
> >> +{
> >> +	struct block_device *src_bdev = src_zi->i_vnode.i_sb->s_bdev;
> >> +	struct block_device *dst_bdev = dst_zi->i_vnode.i_sb->s_bdev;
> >> +	struct range_entry *rlist = NULL;
> >> +	int ret = len;
> >> +
> >> +	rlist = kmalloc(sizeof(*rlist), GFP_KERNEL);
> > 
> > GFP_NOIO ?
> > 
> >> +	if (!rlist)
> >> +		return -ENOMEM;
> >> +
> >> +	rlist[0].dst = (dst_zi->i_zsector << SECTOR_SHIFT) + dst_off;
> >> +	rlist[0].src = (src_zi->i_zsector << SECTOR_SHIFT) + src_off;
> >> +	rlist[0].len = len;
> >> +	rlist[0].comp_len = 0;
> >> +	ret = blkdev_issue_copy(src_bdev, dst_bdev, rlist, 1, NULL, NULL,
> >> +			GFP_KERNEL);
> >> +	if (rlist[0].comp_len > 0)
> >> +		ret = rlist[0].comp_len;
> >> +	kfree(rlist);
> >> +
> >> +	return ret;
> >> +}
> >> +
> >> +/* Returns length of possible copy, else returns error */
> >> +static ssize_t zonefs_copy_file_checks(struct file *src_file, loff_t src_off,
> >> +					struct file *dst_file, loff_t dst_off,
> >> +					size_t *len, unsigned int flags)
> >> +{
> >> +	struct inode *src_inode = file_inode(src_file);
> >> +	struct inode *dst_inode = file_inode(dst_file);
> >> +	struct zonefs_inode_info *src_zi = ZONEFS_I(src_inode);
> >> +	struct zonefs_inode_info *dst_zi = ZONEFS_I(dst_inode);
> >> +	ssize_t ret;
> >> +
> >> +	if (src_inode->i_sb != dst_inode->i_sb)
> >> +		return -EXDEV;
> >> +
> >> +	/* Start by sync'ing the source and destination files for conv zones */
> >> +	if (src_zi->i_ztype == ZONEFS_ZTYPE_CNV) {
> >> +		ret = file_write_and_wait_range(src_file, src_off,
> >> +				(src_off + *len));
> >> +		if (ret < 0)
> >> +			goto io_error;
> >> +	}
> >> +	inode_dio_wait(src_inode);
> > 
> > That is not a "check". So having this in a function called
> > zonefs_copy_file_checks() is a little strange.
> > 
> >> +
> >> +	/* Start by sync'ing the source and destination files ifor conv zones */
> > 
> > Same comment repeated, with typos.
> > 
> >> +	if (dst_zi->i_ztype == ZONEFS_ZTYPE_CNV) {
> >> +		ret = file_write_and_wait_range(dst_file, dst_off,
> >> +				(dst_off + *len));
> >> +		if (ret < 0)
> >> +			goto io_error;
> >> +	}
> >> +	inode_dio_wait(dst_inode);
> >> +
> >> +	/* Drop dst file cached pages for a conv zone*/
> >> +	if (dst_zi->i_ztype == ZONEFS_ZTYPE_CNV) {
> >> +		ret = invalidate_inode_pages2_range(dst_inode->i_mapping,
> >> +				dst_off >> PAGE_SHIFT,
> >> +				(dst_off + *len) >> PAGE_SHIFT);
> >> +		if (ret < 0)
> >> +			goto io_error;
> >> +	}
> >> +
> >> +	ret = zonefs_is_file_copy_offset_ok(src_inode, dst_inode, src_off,
> >> +			dst_off, len);
> >> +	if (ret < 0)
> > 
> > if (ret)
> > 
> >> +		return ret;
> >> +
> >> +	return *len;
> >> +
> >> +io_error:
> >> +	zonefs_io_error(dst_inode, true);
> >> +	return ret;
> >> +}
> >> +
> >> +static ssize_t zonefs_copy_file(struct file *src_file, loff_t src_off,
> >> +		struct file *dst_file, loff_t dst_off,
> >> +		size_t len, unsigned int flags)
> >> +{
> >> +	struct inode *src_inode = file_inode(src_file);
> >> +	struct inode *dst_inode = file_inode(dst_file);
> >> +	struct zonefs_inode_info *src_zi = ZONEFS_I(src_inode);
> >> +	struct zonefs_inode_info *dst_zi = ZONEFS_I(dst_inode);
> >> +	ssize_t ret = 0, bytes;
> >> +
> >> +	inode_lock(src_inode);
> >> +	inode_lock(dst_inode);
> > 
> > So you did zonefs_copy_file_checks() outside of these locks, which mean
> > that everything about the source and destination files may have changed.
> > This does not work.
> 
> I forgot to mention that locking 2 inodes blindly like this can leads to
> deadlocks if another process tries a copy range from dst to src at the
> same time (lock order is reversed and so can deadlock).
> 
> > 
> >> +	bytes = zonefs_issue_copy(src_zi, src_off, dst_zi, dst_off, len);
> >> +	if (bytes < 0)
> >> +		goto unlock_exit;
> >> +
> >> +	ret += bytes;
> >> +
> >> +	file_update_time(dst_file);
> >> +	mutex_lock(&dst_zi->i_truncate_mutex);
> >> +	zonefs_update_stats(dst_inode, dst_off + bytes);
> >> +	zonefs_i_size_write(dst_inode, dst_off + bytes);
> >> +	dst_zi->i_wpoffset += bytes;
> > 
> > This is wierd. iszie for dst will be dst_zi->i_wpoffset. So please do:
> > 
> > 	dst_zi->i_wpoffset += bytes;
> > 	zonefs_i_size_write(dst_inode, dst_zi->i_wpoffset);
> > 
> >> +	mutex_unlock(&dst_zi->i_truncate_mutex);
> > 
> > And you are not taking care of the accounting for active zones here. If
> > the copy made the dst zone full, it is not active anymore. You need to
> > call zonefs_account_active();
> > 
> >> +	/* if we still have some bytes left, do splice copy */
> >> +	if (bytes && (bytes < len)) {
> >> +		bytes = do_splice_direct(src_file, &src_off, dst_file,
> >> +					 &dst_off, len, flags);
> > 
> > No way.
> > 
> >> +		if (bytes > 0)
> >> +			ret += bytes;
> >> +	}
> >> +unlock_exit:
> >> +	if (ret < 0)
> >> +		zonefs_io_error(dst_inode, true);
> > 
> > How can you be sure that you even did an IO when you get an error ?
> > zonefs_issue_copy() may have failed on its kmalloc() and no IO was done.
> > 
> >> +	inode_unlock(src_inode);
> >> +	inode_unlock(dst_inode);
> >> +	return ret;
> >> +}
> >> +
> >> +static ssize_t zonefs_copy_file_range(struct file *src_file, loff_t src_off,
> >> +				      struct file *dst_file, loff_t dst_off,
> >> +				      size_t len, unsigned int flags)
> >> +{
> >> +	ssize_t ret = -EIO;
> > 
> > This does not need to be initialized.
> > 
> >> +
> >> +	ret = zonefs_copy_file_checks(src_file, src_off, dst_file, dst_off,
> >> +				     &len, flags);
> > 
> > These checks need to be done for the generic implementation too, no ? Why
> > would checking this automatically trigger the offload ? What if the device
> > does not support offloading ?
> > 
> >> +	if (ret > 0)
> >> +		ret = zonefs_copy_file(src_file, src_off, dst_file, dst_off,
> >> +				     len, flags);
> > 
> > return here, then no need for the else. But see above. This seems all
> > broken to me.
> > 
> >> +	else if (ret < 0 && ret == -EXDEV)
> >> +		ret = generic_copy_file_range(src_file, src_off, dst_file,
> >> +					      dst_off, len, flags);
> >> +	return ret;
> >> +}
> >> +
> >>  static const struct file_operations zonefs_file_operations = {
> >>  	.open		= zonefs_file_open,
> >>  	.release	= zonefs_file_release,
> >> @@ -1234,6 +1411,7 @@ static const struct file_operations zonefs_file_operations = {
> >>  	.splice_read	= generic_file_splice_read,
> >>  	.splice_write	= iter_file_splice_write,
> >>  	.iopoll		= iocb_bio_iopoll,
> >> +	.copy_file_range = zonefs_copy_file_range,
> >>  };
> >>  
> >>  static struct kmem_cache *zonefs_inode_cachep;
> >> @@ -1804,6 +1982,7 @@ static int zonefs_fill_super(struct super_block *sb, void *data, int silent)
> >>  	atomic_set(&sbi->s_active_seq_files, 0);
> >>  	sbi->s_max_active_seq_files = bdev_max_active_zones(sb->s_bdev);
> >>  
> >> +	/* set copy support by default */
> > 
> > What is this comment supposed to be for ?
> > 
> >>  	ret = zonefs_read_super(sb);
> >>  	if (ret)
> >>  		return ret;
> > 
> 
> -- 
> Damien Le Moal
> Western Digital Research
> 
> 

Acked. I do see a gap in current zonefs cfr implementation. I will drop this
implementation for next version. Once we finalize on block copy offload
implementation, will re-pick this and send with above reviews fixed.

Thank you,
Nitesh

[-- Attachment #2: Type: text/plain, Size: 0 bytes --]



^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v5 10/10] fs: add support for copy file range in zonefs
  2022-11-29 12:22           ` Nitesh Shetty
@ 2022-11-29 23:45             ` Damien Le Moal
  2022-11-30  4:17               ` Nitesh Shetty
  0 siblings, 1 reply; 32+ messages in thread
From: Damien Le Moal @ 2022-11-29 23:45 UTC (permalink / raw)
  To: Nitesh Shetty
  Cc: axboe, agk, snitzer, dm-devel, kbusch, hch, sagi, james.smart,
	kch, naohiro.aota, jth, viro, linux-block, linux-kernel,
	linux-nvme, linux-fsdevel, anuj20.g, joshi.k, p.raghav,
	nitheshshetty, gost.dev

On 11/29/22 21:22, Nitesh Shetty wrote:
> Acked. I do see a gap in current zonefs cfr implementation. I will drop this

cfr ?

> implementation for next version. Once we finalize on block copy offload
> implementation, will re-pick this and send with above reviews fixed.
> 
> Thank you,
> Nitesh

Please trim your replies.


-- 
Damien Le Moal
Western Digital Research


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v5 00/10] Implement copy offload support
  2022-11-29 12:16     ` Nitesh Shetty
@ 2022-11-30  0:05       ` Chaitanya Kulkarni
  2022-11-30  4:14         ` Nitesh Shetty
  0 siblings, 1 reply; 32+ messages in thread
From: Chaitanya Kulkarni @ 2022-11-30  0:05 UTC (permalink / raw)
  To: Nitesh Shetty
  Cc: linux-block, agk, linux-kernel, linux-nvme, linux-fsdevel, axboe,
	viro, sagi, anuj20.g, joshi.k, p.raghav, naohiro.aota,
	damien.lemoal, snitzer, kbusch, jth, nitheshshetty, james.smart,
	gost.dev, dm-devel, hch, shinichiro.kawasaki

On 11/29/22 04:16, Nitesh Shetty wrote:
> On Wed, Nov 23, 2022 at 10:56:23PM +0000, Chaitanya Kulkarni wrote:
>> (+ Shinichiro)
>>
>> On 11/22/22 21:58, Nitesh Shetty wrote:
>>> The patch series covers the points discussed in November 2021 virtual
>>> call [LSF/MM/BFP TOPIC] Storage: Copy Offload [0].
>>> We have covered the initial agreed requirements in this patchset and
>>> further additional features suggested by community.
>>> Patchset borrows Mikulas's token based approach for 2 bdev
>>> implementation.
>>>
>>> This is on top of our previous patchset v4[1].
>>
>> Now that series is converging, since patch-series touches
>> drivers and key components in the block layer you need accompany
>> the patch-series with the blktests to cover the corner cases in the
>> drivers which supports this operations, as I mentioned this in the
>> call last year....
>>
>> If you need any help with that feel free to send an email to linux-block
>> and CC me or Shinichiro (added in CC )...
>>
>> -ck
>>
> 
> Yes any help would be appreciated. I am not familiar with blktest
> development/testing cycle. Do we need add blktests along with patch
> series or do we need to add after patch series gets merged(to be merged)?
> 
> Thanks
> Nitesh
> 
> 

we have many testcases you can refer to as an example.
Your cover-letter mentions that you have tested this code, just move
all the testcases to the blktests.

More importantly for a feature like this you should be providing
outstanding testcases in your github tree when you post the
series, it should cover critical parts of the block layer and
drivers in question.

The objective here is to have blktests updated when the code
is upstream so all the distros can test the code from
upstream blktest repo. You can refer to what we have done it
for NVMeOF in-band authentication (Thanks to Hannes and Sagi
in linux-nvme email-archives.

-ck


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v5 00/10] Implement copy offload support
  2022-11-30  0:05       ` Chaitanya Kulkarni
@ 2022-11-30  4:14         ` Nitesh Shetty
  0 siblings, 0 replies; 32+ messages in thread
From: Nitesh Shetty @ 2022-11-30  4:14 UTC (permalink / raw)
  To: Chaitanya Kulkarni
  Cc: linux-block, agk, linux-kernel, linux-nvme, linux-fsdevel, axboe,
	viro, sagi, anuj20.g, joshi.k, p.raghav, naohiro.aota,
	damien.lemoal, snitzer, kbusch, jth, nitheshshetty, james.smart,
	gost.dev, dm-devel, hch, shinichiro.kawasaki

[-- Attachment #1: Type: text/plain, Size: 2120 bytes --]

On Wed, Nov 30, 2022 at 12:05:00AM +0000, Chaitanya Kulkarni wrote:
> On 11/29/22 04:16, Nitesh Shetty wrote:
> > On Wed, Nov 23, 2022 at 10:56:23PM +0000, Chaitanya Kulkarni wrote:
> >> (+ Shinichiro)
> >>
> >> On 11/22/22 21:58, Nitesh Shetty wrote:
> >>> The patch series covers the points discussed in November 2021 virtual
> >>> call [LSF/MM/BFP TOPIC] Storage: Copy Offload [0].
> >>> We have covered the initial agreed requirements in this patchset and
> >>> further additional features suggested by community.
> >>> Patchset borrows Mikulas's token based approach for 2 bdev
> >>> implementation.
> >>>
> >>> This is on top of our previous patchset v4[1].
> >>
> >> Now that series is converging, since patch-series touches
> >> drivers and key components in the block layer you need accompany
> >> the patch-series with the blktests to cover the corner cases in the
> >> drivers which supports this operations, as I mentioned this in the
> >> call last year....
> >>
> >> If you need any help with that feel free to send an email to linux-block
> >> and CC me or Shinichiro (added in CC )...
> >>
> >> -ck
> >>
> > 
> > Yes any help would be appreciated. I am not familiar with blktest
> > development/testing cycle. Do we need add blktests along with patch
> > series or do we need to add after patch series gets merged(to be merged)?
> > 
> > Thanks
> > Nitesh
> > 
> > 
> 
> we have many testcases you can refer to as an example.
> Your cover-letter mentions that you have tested this code, just move
> all the testcases to the blktests.
> 
> More importantly for a feature like this you should be providing
> outstanding testcases in your github tree when you post the
> series, it should cover critical parts of the block layer and
> drivers in question.
> 
> The objective here is to have blktests updated when the code
> is upstream so all the distros can test the code from
> upstream blktest repo. You can refer to what we have done it
> for NVMeOF in-band authentication (Thanks to Hannes and Sagi
> in linux-nvme email-archives.
> 
> -ck
>

Sure, next version will update blktest.

Thank you,
Nitesh

[-- Attachment #2: Type: text/plain, Size: 0 bytes --]



^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v5 10/10] fs: add support for copy file range in zonefs
  2022-11-29 23:45             ` Damien Le Moal
@ 2022-11-30  4:17               ` Nitesh Shetty
  2022-11-30  9:55                 ` Damien Le Moal
  0 siblings, 1 reply; 32+ messages in thread
From: Nitesh Shetty @ 2022-11-30  4:17 UTC (permalink / raw)
  To: Damien Le Moal
  Cc: axboe, agk, snitzer, dm-devel, kbusch, hch, sagi, james.smart,
	kch, naohiro.aota, jth, viro, linux-block, linux-kernel,
	linux-nvme, linux-fsdevel, anuj20.g, joshi.k, p.raghav,
	nitheshshetty, gost.dev

[-- Attachment #1: Type: text/plain, Size: 550 bytes --]

On Wed, Nov 30, 2022 at 08:45:55AM +0900, Damien Le Moal wrote:
> On 11/29/22 21:22, Nitesh Shetty wrote:
> > Acked. I do see a gap in current zonefs cfr implementation. I will drop this
> 
> cfr ?
>

yes, will drop zonefs cfr for next version.

> > implementation for next version. Once we finalize on block copy offload
> > implementation, will re-pick this and send with above reviews fixed.
> > 
> > Thank you,
> > Nitesh
> 
> Please trim your replies.
> 

Noted

> 
> -- 
> Damien Le Moal
> Western Digital Research
> 
> 

Thanks,
Nitesh Shetty

[-- Attachment #2: Type: text/plain, Size: 0 bytes --]



^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v5 10/10] fs: add support for copy file range in zonefs
  2022-11-30  4:17               ` Nitesh Shetty
@ 2022-11-30  9:55                 ` Damien Le Moal
  0 siblings, 0 replies; 32+ messages in thread
From: Damien Le Moal @ 2022-11-30  9:55 UTC (permalink / raw)
  To: Nitesh Shetty
  Cc: axboe, agk, snitzer, dm-devel, kbusch, hch, sagi, james.smart,
	kch, naohiro.aota, jth, viro, linux-block, linux-kernel,
	linux-nvme, linux-fsdevel, anuj20.g, joshi.k, p.raghav,
	nitheshshetty, gost.dev

On 11/30/22 13:17, Nitesh Shetty wrote:
> On Wed, Nov 30, 2022 at 08:45:55AM +0900, Damien Le Moal wrote:
>> On 11/29/22 21:22, Nitesh Shetty wrote:
>>> Acked. I do see a gap in current zonefs cfr implementation. I will drop this
>>
>> cfr ?
>>
> 
> yes, will drop zonefs cfr for next version.

I meant: I do not understand "cfr". I now realize that it probably means
copy-file-range ? Please be clear and do not use abbreviations.

-- 
Damien Le Moal
Western Digital Research


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v5 02/10] block: Add copy offload support infrastructure
  2022-11-29 11:44             ` Nitesh Shetty
@ 2022-12-07  5:54               ` Nitesh Shetty
  2022-12-07 11:19                 ` Ming Lei
  0 siblings, 1 reply; 32+ messages in thread
From: Nitesh Shetty @ 2022-12-07  5:54 UTC (permalink / raw)
  To: Ming Lei
  Cc: axboe, agk, snitzer, dm-devel, kbusch, hch, sagi, james.smart,
	kch, damien.lemoal, naohiro.aota, jth, viro, linux-block,
	linux-kernel, linux-nvme, linux-fsdevel, anuj20.g, joshi.k,
	p.raghav, nitheshshetty, gost.dev

[-- Attachment #1: Type: text/plain, Size: 4809 bytes --]

On Tue, Nov 29, 2022 at 05:14:28PM +0530, Nitesh Shetty wrote:
> On Thu, Nov 24, 2022 at 08:03:56AM +0800, Ming Lei wrote:
> > On Wed, Nov 23, 2022 at 03:37:12PM +0530, Nitesh Shetty wrote:
> > > On Wed, Nov 23, 2022 at 04:04:18PM +0800, Ming Lei wrote:
> > > > On Wed, Nov 23, 2022 at 11:28:19AM +0530, Nitesh Shetty wrote:
> > > > > Introduce blkdev_issue_copy which supports source and destination bdevs,
> > > > > and an array of (source, destination and copy length) tuples.
> > > > > Introduce REQ_COPY copy offload operation flag. Create a read-write
> > > > > bio pair with a token as payload and submitted to the device in order.
> > > > > Read request populates token with source specific information which
> > > > > is then passed with write request.
> > > > > This design is courtesy Mikulas Patocka's token based copy
> > > > 
> > > > I thought this patchset is just for enabling copy command which is
> > > > supported by hardware. But turns out it isn't, because blk_copy_offload()
> > > > still submits read/write bios for doing the copy.
> > > > 
> > > > I am just wondering why not let copy_file_range() cover this kind of copy,
> > > > and the framework has been there.
> > > > 
> > > 
> > > Main goal was to enable copy command, but community suggested to add
> > > copy emulation as well.
> > > 
> > > blk_copy_offload - actually issues copy command in driver layer.
> > > The way read/write BIOs are percieved is different for copy offload.
> > > In copy offload we check REQ_COPY flag in NVMe driver layer to issue
> > > copy command. But we did missed it to add in other driver's, where they
> > > might be treated as normal READ/WRITE.
> > > 
> > > blk_copy_emulate - is used if we fail or if device doesn't support native
> > > copy offload command. Here we do READ/WRITE. Using copy_file_range for
> > > emulation might be possible, but we see 2 issues here.
> > > 1. We explored possibility of pulling dm-kcopyd to block layer so that we 
> > > can readily use it. But we found it had many dependecies from dm-layer.
> > > So later dropped that idea.
> > 
> > Is it just because dm-kcopyd supports async copy? If yes, I believe we
> > can reply on io_uring for implementing async copy_file_range, which will
> > be generic interface for async copy, and could get better perf.
> >
> 
> It supports both sync and async. But used only inside dm-layer.
> Async version of copy_file_range can help, using io-uring can be helpful
> for user , but in-kernel users can't use uring.
> 
> > > 2. copy_file_range, for block device atleast we saw few check's which fail
> > > it for raw block device. At this point I dont know much about the history of
> > > why such check is present.
> > 
> > Got it, but IMO the check in generic_copy_file_checks() can be
> > relaxed to cover blkdev cause splice does support blkdev.
> > 
> > Then your bdev offload copy work can be simplified into:
> > 
> > 1) implement .copy_file_range for def_blk_fops, suppose it is
> > blkdev_copy_file_range()
> > 
> > 2) inside blkdev_copy_file_range()
> > 
> > - if the bdev supports offload copy, just submit one bio to the device,
> > and this will be converted to one pt req to device
> > 
> > - otherwise, fallback to generic_copy_file_range()
> >
> 

Actually we sent initial version with single bio, but later community
suggested two bio's is must for offload, main reasoning being
dm-layer,Xcopy,copy across namespace compatibilty.

> We will check the feasibilty and try to implement the scheme in next versions.
> It would be helpful, if someone in community know's why such checks were
> present ? We see copy_file_range accepts only regular file. Was it
> designed only for regular files or can we extend it to regular block
> device.
>

As you suggested we were able to integrate def_blk_ops and
run with user application, but we see one main issue with this approach.
Using blkdev_copy_file_range requires having 2 file descriptors, which
is not possible for in kernel users such as fabrics/dm-kcopyd which has
only bdev descriptors.
Do you have any plumbing suggestions here ?

> > > 
> > > > When I was researching pipe/splice code for supporting ublk zero copy[1], I
> > > > have got idea for async copy_file_range(), such as: io uring based
> > > > direct splice, user backed intermediate buffer, still zero copy, if these
> > > > ideas are finally implemented, we could get super-fast generic offload copy,
> > > > and bdev copy is really covered too.
> > > > 
> > > > [1] https://lore.kernel.org/linux-block/20221103085004.1029763-1-ming.lei@redhat.com/
> > > > 
> > > 
> > > Seems interesting, We will take a look into this.
> > 
> > BTW, that is probably one direction of ublk's async zero copy IO too.
> > 
> > 
> > Thanks, 
> > Ming
> > 
> > 
> 
> 
> Thanks,
> Nitesh

Thanks,
Nitesh Shetty


[-- Attachment #2: Type: text/plain, Size: 0 bytes --]



^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v5 02/10] block: Add copy offload support infrastructure
  2022-12-07  5:54               ` Nitesh Shetty
@ 2022-12-07 11:19                 ` Ming Lei
  2022-12-09  8:16                   ` Nitesh Shetty
  0 siblings, 1 reply; 32+ messages in thread
From: Ming Lei @ 2022-12-07 11:19 UTC (permalink / raw)
  To: Nitesh Shetty
  Cc: axboe, agk, snitzer, dm-devel, kbusch, hch, sagi, james.smart,
	kch, damien.lemoal, naohiro.aota, jth, viro, linux-block,
	linux-kernel, linux-nvme, linux-fsdevel, anuj20.g, joshi.k,
	p.raghav, nitheshshetty, gost.dev

On Wed, Dec 07, 2022 at 11:24:00AM +0530, Nitesh Shetty wrote:
> On Tue, Nov 29, 2022 at 05:14:28PM +0530, Nitesh Shetty wrote:
> > On Thu, Nov 24, 2022 at 08:03:56AM +0800, Ming Lei wrote:
> > > On Wed, Nov 23, 2022 at 03:37:12PM +0530, Nitesh Shetty wrote:
> > > > On Wed, Nov 23, 2022 at 04:04:18PM +0800, Ming Lei wrote:
> > > > > On Wed, Nov 23, 2022 at 11:28:19AM +0530, Nitesh Shetty wrote:
> > > > > > Introduce blkdev_issue_copy which supports source and destination bdevs,
> > > > > > and an array of (source, destination and copy length) tuples.
> > > > > > Introduce REQ_COPY copy offload operation flag. Create a read-write
> > > > > > bio pair with a token as payload and submitted to the device in order.
> > > > > > Read request populates token with source specific information which
> > > > > > is then passed with write request.
> > > > > > This design is courtesy Mikulas Patocka's token based copy
> > > > > 
> > > > > I thought this patchset is just for enabling copy command which is
> > > > > supported by hardware. But turns out it isn't, because blk_copy_offload()
> > > > > still submits read/write bios for doing the copy.
> > > > > 
> > > > > I am just wondering why not let copy_file_range() cover this kind of copy,
> > > > > and the framework has been there.
> > > > > 
> > > > 
> > > > Main goal was to enable copy command, but community suggested to add
> > > > copy emulation as well.
> > > > 
> > > > blk_copy_offload - actually issues copy command in driver layer.
> > > > The way read/write BIOs are percieved is different for copy offload.
> > > > In copy offload we check REQ_COPY flag in NVMe driver layer to issue
> > > > copy command. But we did missed it to add in other driver's, where they
> > > > might be treated as normal READ/WRITE.
> > > > 
> > > > blk_copy_emulate - is used if we fail or if device doesn't support native
> > > > copy offload command. Here we do READ/WRITE. Using copy_file_range for
> > > > emulation might be possible, but we see 2 issues here.
> > > > 1. We explored possibility of pulling dm-kcopyd to block layer so that we 
> > > > can readily use it. But we found it had many dependecies from dm-layer.
> > > > So later dropped that idea.
> > > 
> > > Is it just because dm-kcopyd supports async copy? If yes, I believe we
> > > can reply on io_uring for implementing async copy_file_range, which will
> > > be generic interface for async copy, and could get better perf.
> > >
> > 
> > It supports both sync and async. But used only inside dm-layer.
> > Async version of copy_file_range can help, using io-uring can be helpful
> > for user , but in-kernel users can't use uring.
> > 
> > > > 2. copy_file_range, for block device atleast we saw few check's which fail
> > > > it for raw block device. At this point I dont know much about the history of
> > > > why such check is present.
> > > 
> > > Got it, but IMO the check in generic_copy_file_checks() can be
> > > relaxed to cover blkdev cause splice does support blkdev.
> > > 
> > > Then your bdev offload copy work can be simplified into:
> > > 
> > > 1) implement .copy_file_range for def_blk_fops, suppose it is
> > > blkdev_copy_file_range()
> > > 
> > > 2) inside blkdev_copy_file_range()
> > > 
> > > - if the bdev supports offload copy, just submit one bio to the device,
> > > and this will be converted to one pt req to device
> > > 
> > > - otherwise, fallback to generic_copy_file_range()
> > >
> > 
> 
> Actually we sent initial version with single bio, but later community
> suggested two bio's is must for offload, main reasoning being

Is there any link which holds the discussion?

> dm-layer,Xcopy,copy across namespace compatibilty.

But dm kcopy has supported bdev copy already, so once your patch is
ready, dm kcopy can just sends one bio with REQ_COPY if the device
supports offload command, otherwise the current dm kcopy code can work
as before.

> 
> > We will check the feasibilty and try to implement the scheme in next versions.
> > It would be helpful, if someone in community know's why such checks were
> > present ? We see copy_file_range accepts only regular file. Was it
> > designed only for regular files or can we extend it to regular block
> > device.
> >
> 
> As you suggested we were able to integrate def_blk_ops and
> run with user application, but we see one main issue with this approach.
> Using blkdev_copy_file_range requires having 2 file descriptors, which
> is not possible for in kernel users such as fabrics/dm-kcopyd which has
> only bdev descriptors.
> Do you have any plumbing suggestions here ?

What is the fabrics kernel user? Any kernel target code(such as nvme target)
has file/bdev path available, vfs_copy_file_range() should be fine.

Also IMO, kernel copy user shouldn't be important long term, especially if
io_uring copy_file_range() can be supported, forwarding to userspace not
only gets better performance, but also cleanup kernel related copy code
much.


thanks, 
Ming


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v5 02/10] block: Add copy offload support infrastructure
  2022-12-07 11:19                 ` Ming Lei
@ 2022-12-09  8:16                   ` Nitesh Shetty
  0 siblings, 0 replies; 32+ messages in thread
From: Nitesh Shetty @ 2022-12-09  8:16 UTC (permalink / raw)
  To: Ming Lei
  Cc: axboe, agk, snitzer, dm-devel, kbusch, hch, sagi, james.smart,
	kch, damien.lemoal, naohiro.aota, jth, viro, linux-block,
	linux-kernel, linux-nvme, linux-fsdevel, anuj20.g, joshi.k,
	p.raghav, nitheshshetty, gost.dev

[-- Attachment #1: Type: text/plain, Size: 7285 bytes --]

On Wed, Dec 07, 2022 at 07:19:00PM +0800, Ming Lei wrote:
> On Wed, Dec 07, 2022 at 11:24:00AM +0530, Nitesh Shetty wrote:
> > On Tue, Nov 29, 2022 at 05:14:28PM +0530, Nitesh Shetty wrote:
> > > On Thu, Nov 24, 2022 at 08:03:56AM +0800, Ming Lei wrote:
> > > > On Wed, Nov 23, 2022 at 03:37:12PM +0530, Nitesh Shetty wrote:
> > > > > On Wed, Nov 23, 2022 at 04:04:18PM +0800, Ming Lei wrote:
> > > > > > On Wed, Nov 23, 2022 at 11:28:19AM +0530, Nitesh Shetty wrote:
> > > > > > > Introduce blkdev_issue_copy which supports source and destination bdevs,
> > > > > > > and an array of (source, destination and copy length) tuples.
> > > > > > > Introduce REQ_COPY copy offload operation flag. Create a read-write
> > > > > > > bio pair with a token as payload and submitted to the device in order.
> > > > > > > Read request populates token with source specific information which
> > > > > > > is then passed with write request.
> > > > > > > This design is courtesy Mikulas Patocka's token based copy
> > > > > > 
> > > > > > I thought this patchset is just for enabling copy command which is
> > > > > > supported by hardware. But turns out it isn't, because blk_copy_offload()
> > > > > > still submits read/write bios for doing the copy.
> > > > > > 
> > > > > > I am just wondering why not let copy_file_range() cover this kind of copy,
> > > > > > and the framework has been there.
> > > > > > 
> > > > > 
> > > > > Main goal was to enable copy command, but community suggested to add
> > > > > copy emulation as well.
> > > > > 
> > > > > blk_copy_offload - actually issues copy command in driver layer.
> > > > > The way read/write BIOs are percieved is different for copy offload.
> > > > > In copy offload we check REQ_COPY flag in NVMe driver layer to issue
> > > > > copy command. But we did missed it to add in other driver's, where they
> > > > > might be treated as normal READ/WRITE.
> > > > > 
> > > > > blk_copy_emulate - is used if we fail or if device doesn't support native
> > > > > copy offload command. Here we do READ/WRITE. Using copy_file_range for
> > > > > emulation might be possible, but we see 2 issues here.
> > > > > 1. We explored possibility of pulling dm-kcopyd to block layer so that we 
> > > > > can readily use it. But we found it had many dependecies from dm-layer.
> > > > > So later dropped that idea.
> > > > 
> > > > Is it just because dm-kcopyd supports async copy? If yes, I believe we
> > > > can reply on io_uring for implementing async copy_file_range, which will
> > > > be generic interface for async copy, and could get better perf.
> > > >
> > > 
> > > It supports both sync and async. But used only inside dm-layer.
> > > Async version of copy_file_range can help, using io-uring can be helpful
> > > for user , but in-kernel users can't use uring.
> > > 
> > > > > 2. copy_file_range, for block device atleast we saw few check's which fail
> > > > > it for raw block device. At this point I dont know much about the history of
> > > > > why such check is present.
> > > > 
> > > > Got it, but IMO the check in generic_copy_file_checks() can be
> > > > relaxed to cover blkdev cause splice does support blkdev.
> > > > 
> > > > Then your bdev offload copy work can be simplified into:
> > > > 
> > > > 1) implement .copy_file_range for def_blk_fops, suppose it is
> > > > blkdev_copy_file_range()
> > > > 
> > > > 2) inside blkdev_copy_file_range()
> > > > 
> > > > - if the bdev supports offload copy, just submit one bio to the device,
> > > > and this will be converted to one pt req to device
> > > > 
> > > > - otherwise, fallback to generic_copy_file_range()
> > > >
> > > 
> > 
> > Actually we sent initial version with single bio, but later community
> > suggested two bio's is must for offload, main reasoning being
> 
> Is there any link which holds the discussion?
> 

This[1] is the starting thread for LSF/MM which Chaitanya organized and it
was a conference call. Also in 2022 LSM/MM there was a discussion on copy
topic as well. One of the main conclusion was using 2 bio's as must have.

Bart has summarized previous copy efforts[2] and Martin too [3],
which might be of help in understanding why 2 bio's are must.

> > dm-layer,Xcopy,copy across namespace compatibilty.
> 
> But dm kcopy has supported bdev copy already, so once your patch is
> ready, dm kcopy can just sends one bio with REQ_COPY if the device
> supports offload command, otherwise the current dm kcopy code can work
> as before.
> 
> > 
> > > We will check the feasibilty and try to implement the scheme in next versions.
> > > It would be helpful, if someone in community know's why such checks were
> > > present ? We see copy_file_range accepts only regular file. Was it
> > > designed only for regular files or can we extend it to regular block
> > > device.
> > >
> > 
> > As you suggested we were able to integrate def_blk_ops and
> > run with user application, but we see one main issue with this approach.
> > Using blkdev_copy_file_range requires having 2 file descriptors, which
> > is not possible for in kernel users such as fabrics/dm-kcopyd which has
> > only bdev descriptors.
> > Do you have any plumbing suggestions here ?
> 
> What is the fabrics kernel user? 

In fabrics setup we are exposing target side NVMe device, to host as
copy offload capable device, irrespective of target device's capability.
This will enable sending copy offload command from host.
From target side, if device doesn't support offload then we emulate.
This way we will avoid data copy over the network.

> Any kernel target code(such as nvme target)
> has file/bdev path available, vfs_copy_file_range() should be fine.
> 

From target side, fabrics device can be created in 2 ways,
1. file backed device: In this setup we get file descriptor. In fact in our
patches[4] are using vfs_copy_file_range to offload copy.
2. block backed device: Here we have only blockdev descriptor.
This setup we can't use vfs_copy_file_range since we are missing file
descriptor.
Your input on how we can plumb blkdev_copy_file_range with bdev would help.

> Also IMO, kernel copy user shouldn't be important long term, especially if
> io_uring copy_file_range() can be supported, forwarding to userspace not
> only gets better performance, but also cleanup kernel related copy code
> much.
>

Using uring is valid if consumer of copy_file_range is userspace.
But there are many possible in-kernel user of copy, such as dm-kcopyd,
garbage collection case such as F2FS GC, also fabrics as explained above.
We can't use uring from these layers.

Present implementation of vfs_copy_file_range(CFR) lacks
a. async implementation, which helps in nvme over fabrics
b. multi range(src,dst pair) support
So here we see going to blkdev_copy_file_range actually regress our present
result. But we do see a goodness of using mature vfs_copy_file_range for
emulation.

[1] https://lore.kernel.org/linux-nvme/BYAPR04MB49652C4B75E38F3716F3C06386539@BYAPR04MB4965.namprd04.prod.outlook.com/
[2] https://github.com/bvanassche/linux-kernel-copy-offload
[3] https://oss.oracle.com/~mkp/docs/xcopy.pdf
[4] https://lore.kernel.org/lkml/20221130041450.GA17533@test-zns/T/#Z2e.:..:20221123055827.26996-7-nj.shetty::40samsung.com:1drivers:nvme:target:io-cmd-file.c

Thanks,
Nitesh Shetty

[-- Attachment #2: Type: text/plain, Size: 0 bytes --]



^ permalink raw reply	[flat|nested] 32+ messages in thread

end of thread, other threads:[~2022-12-13  5:17 UTC | newest]

Thread overview: 32+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <CGME20221123061010epcas5p21cef9d23e4362b01f2b19d1117e1cdf5@epcas5p2.samsung.com>
2022-11-23  5:58 ` [PATCH v5 00/10] Implement copy offload support Nitesh Shetty
     [not found]   ` <CGME20221123061014epcas5p150fd8add12fe6d09b63c56972818e6a2@epcas5p1.samsung.com>
2022-11-23  5:58     ` [PATCH v5 01/10] block: Introduce queue limits for copy-offload support Nitesh Shetty
     [not found]   ` <CGME20221123061017epcas5p246a589e20eac655ac340cfda6028ff35@epcas5p2.samsung.com>
2022-11-23  5:58     ` [PATCH v5 02/10] block: Add copy offload support infrastructure Nitesh Shetty
2022-11-23  8:04       ` Ming Lei
2022-11-23 10:07         ` Nitesh Shetty
2022-11-24  0:03           ` Ming Lei
2022-11-29 11:44             ` Nitesh Shetty
2022-12-07  5:54               ` Nitesh Shetty
2022-12-07 11:19                 ` Ming Lei
2022-12-09  8:16                   ` Nitesh Shetty
     [not found]   ` <CGME20221123061021epcas5p276b6d48db889932282d017b27c9a3291@epcas5p2.samsung.com>
2022-11-23  5:58     ` [PATCH v5 03/10] block: add emulation for copy Nitesh Shetty
     [not found]   ` <CGME20221123061024epcas5p28fd0296018950d722b5a97e2875cf391@epcas5p2.samsung.com>
2022-11-23  5:58     ` [PATCH v5 04/10] block: Introduce a new ioctl " Nitesh Shetty
     [not found]   ` <CGME20221123061028epcas5p1aecd27b2f4f694b5a18b51d3df5d7432@epcas5p1.samsung.com>
2022-11-23  5:58     ` [PATCH v5 05/10] nvme: add copy offload support Nitesh Shetty
     [not found]   ` <CGME20221123061031epcas5p3745558c2caffd2fd21d15feff00495e9@epcas5p3.samsung.com>
2022-11-23  5:58     ` [PATCH v5 06/10] nvmet: add copy command support for bdev and file ns Nitesh Shetty
     [not found]       ` <482586a3-f45d-a17b-7630-341fb0e1ee96@linux.alibaba.com>
2022-11-23  9:39         ` Nitesh Shetty
     [not found]   ` <CGME20221123061034epcas5p3fe90293ad08df4901f98bae2d7cfc1ba@epcas5p3.samsung.com>
2022-11-23  5:58     ` [PATCH v5 07/10] dm: Add support for copy offload Nitesh Shetty
     [not found]   ` <CGME20221123061037epcas5p4d57436204fbe0065819b156eeeddbfac@epcas5p4.samsung.com>
2022-11-23  5:58     ` [PATCH v5 08/10] dm: Enable copy offload for dm-linear target Nitesh Shetty
     [not found]   ` <CGME20221123061041epcas5p4413569a46ee730cd3033a9025c8f134a@epcas5p4.samsung.com>
2022-11-23  5:58     ` [PATCH v5 09/10] dm kcopyd: use copy offload support Nitesh Shetty
     [not found]   ` <CGME20221123061044epcas5p2ac082a91fc8197821f29e84278b6203c@epcas5p2.samsung.com>
2022-11-23  5:58     ` [PATCH v5 10/10] fs: add support for copy file range in zonefs Nitesh Shetty
2022-11-23  6:53       ` Amir Goldstein
2022-11-23 10:13         ` Nitesh Shetty
2022-11-24  1:32       ` Damien Le Moal
2022-11-24  1:47         ` Damien Le Moal
2022-11-25  4:18           ` Al Viro
2022-11-29 12:22           ` Nitesh Shetty
2022-11-29 23:45             ` Damien Le Moal
2022-11-30  4:17               ` Nitesh Shetty
2022-11-30  9:55                 ` Damien Le Moal
2022-11-23 22:56   ` [PATCH v5 00/10] Implement copy offload support Chaitanya Kulkarni
2022-11-29 12:16     ` Nitesh Shetty
2022-11-30  0:05       ` Chaitanya Kulkarni
2022-11-30  4:14         ` Nitesh Shetty

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).