linux-scsi.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v13 0/9] Implement copy offload support
       [not found] <CGME20230627183950epcas5p1b924785633509f612ffa5d9616bfe447@epcas5p1.samsung.com>
@ 2023-06-27 18:36 ` Nitesh Shetty
       [not found]   ` <CGME20230627184000epcas5p1c7cb01eb1c70bc5a19f76ce21f2ec3f8@epcas5p1.samsung.com>
                     ` (8 more replies)
  0 siblings, 9 replies; 32+ messages in thread
From: Nitesh Shetty @ 2023-06-27 18:36 UTC (permalink / raw)
  To: Jens Axboe, Jonathan Corbet, Alasdair Kergon, Mike Snitzer,
	dm-devel, Keith Busch, Christoph Hellwig, Sagi Grimberg,
	Chaitanya Kulkarni, Alexander Viro, Christian Brauner
  Cc: martin.petersen, linux-scsi, willy, hare, djwong, bvanassche,
	ming.lei, dlemoal, nitheshshetty, gost.dev, Nitesh Shetty,
	linux-block, linux-kernel, linux-doc, linux-nvme, linux-fsdevel

The patch series covers the points discussed in past and most recently
in LSFMM'23[0].
We have covered the initial agreed requirements in this patchset and
further additional features suggested by community.

This is next iteration of our previous patchset v12[1].
We have changed the token based approach to request based approach,
instead of storing the info in token. We now try to merge the copy bio's
in request layer and send it to driver.
So this design works only for request based storage drivers.

Overall series supports:
========================
	1. Driver
		- NVMe Copy command (single NS, TP 4065), including support
		in nvme-target (for block and file backend).

	2. Block layer
		- Block-generic copy (REQ_OP_COPY_DST/SRC), operation with
                  interface accommodating two block-devs
                - Merging copy requests in request layer
		- Emulation, for in-kernel user when offload is natively 
                absent
		- dm-linear support (for cases not requiring split)

	3. User-interface
		- copy_file_range

Testing
=======
	Copy offload can be tested on:
	a. QEMU: NVME simple copy (TP 4065). By setting nvme-ns
		parameters mssrl,mcl, msrc. For more info [2].
	b. Null block device
        c. NVMe Fabrics loopback.
	d. blktests[3] (tests block/035-038, nvme/050-053)

	Emulation can be tested on any device.

	fio[4].

Infra and plumbing:
===================
        We populate copy_file_range callback in def_blk_fops. 
        For devices that support copy-offload, use __blkdev_copy_offload to
        achieve in-device copy.
        However for cases, where device doesn't support offload,
        fallback to generic_copy_file_range.
        For in-kernel users (like NVMe fabrics), we use blkdev_issue_copy
        which implements its own emulation, as fd is not available.
        Modify checks in generic_copy_file_range to support block-device.

Blktests[3]
======================
	tests/block/035,036: Runs copy offload and emulation on block
                              device.
	tests/block/037,038: Runs copy offload and emulation on null
                              block device.
        tests/nvme/050-053: Create a loop backed fabrics device and
                              run copy offload and emulation.

Future Work
===========
	- loopback device copy offload support
	- upstream fio to use copy offload
	- upstream blktest to test copy offload

	These are to be taken up after this minimal series is agreed upon.

Additional links:
=================
	[0] https://lore.kernel.org/linux-nvme/CA+1E3rJ7BZ7LjQXXTdX+-0Edz=zT14mmPGMiVCzUgB33C60tbQ@mail.gmail.com/
            https://lore.kernel.org/linux-nvme/f0e19ae4-b37a-e9a3-2be7-a5afb334a5c3@nvidia.com/
            https://lore.kernel.org/linux-nvme/20230113094648.15614-1-nj.shetty@samsung.com/
	[1] https://lore.kernel.org/linux-block/20230605121732.28468-1-nj.shetty@samsung.com/T/#m4db1801c86a5490dc736266609f8458fd52b9eb5
	[2] https://qemu-project.gitlab.io/qemu/system/devices/nvme.html#simple-copy
	[3] https://github.com/nitesh-shetty/blktests/tree/feat/copy_offload/v13
	[4] https://github.com/vincentkfu/fio/commits/copyoffload-3.35-v13

Changes since v12:
=================
        - block,nvme: Replaced token based approach with request based
          single namespace capable approach (Christoph Hellwig)

Changes since v11:
=================
        - Documentation: Improved documentation (Damien Le Moal)
        - block,nvme: ssize_t return values (Darrick J. Wong)
        - block: token is allocated to SECTOR_SIZE (Matthew Wilcox)
        - block: mem leak fix (Maurizio Lombardi)

Changes since v10:
=================
        - NVMeOF: optimization in NVMe fabrics (Chaitanya Kulkarni)
        - NVMeOF: sparse warnings (kernel test robot)

Changes since v9:
=================
        - null_blk, improved documentation, minor fixes(Chaitanya Kulkarni)
        - fio, expanded testing and minor fixes (Vincent Fu)

Changes since v8:
=================
        - null_blk, copy_max_bytes_hw is made config fs parameter
          (Damien Le Moal)
        - Negative error handling in copy_file_range (Christian Brauner)
        - minor fixes, better documentation (Damien Le Moal)
        - fio upgraded to 3.34 (Vincent Fu)

Changes since v7:
=================
        - null block copy offload support for testing (Damien Le Moal)
        - adding direct flag check for copy offload to block device,
	  as we are using generic_copy_file_range for cached cases.
        - Minor fixes

Changes since v6:
=================
        - copy_file_range instead of ioctl for direct block device
        - Remove support for multi range (vectored) copy
        - Remove ioctl interface for copy.
        - Remove offload support in dm kcopyd.

Changes since v5:
=================
	- Addition of blktests (Chaitanya Kulkarni)
        - Minor fix for fabrics file backed path
        - Remove buggy zonefs copy file range implementation.

Changes since v4:
=================
	- make the offload and emulation design asynchronous (Hannes
	  Reinecke)
	- fabrics loopback support
	- sysfs naming improvements (Damien Le Moal)
	- use kfree() instead of kvfree() in cio_await_completion
	  (Damien Le Moal)
	- use ranges instead of rlist to represent range_entry (Damien
	  Le Moal)
	- change argument ordering in blk_copy_offload suggested (Damien
	  Le Moal)
	- removed multiple copy limit and merged into only one limit
	  (Damien Le Moal)
	- wrap overly long lines (Damien Le Moal)
	- other naming improvements and cleanups (Damien Le Moal)
	- correctly format the code example in description (Damien Le
	  Moal)
	- mark blk_copy_offload as static (kernel test robot)
	
Changes since v3:
=================
	- added copy_file_range support for zonefs
	- added documentation about new sysfs entries
	- incorporated review comments on v3
	- minor fixes

Changes since v2:
=================
	- fixed possible race condition reported by Damien Le Moal
	- new sysfs controls as suggested by Damien Le Moal
	- fixed possible memory leak reported by Dan Carpenter, lkp
	- minor fixes

Changes since v1:
=================
	- sysfs documentation (Greg KH)
        - 2 bios for copy operation (Bart Van Assche, Mikulas Patocka,
          Martin K. Petersen, Douglas Gilbert)
        - better payload design (Darrick J. Wong)


Nitesh Shetty (9):
  block: Introduce queue limits for copy-offload support
  block: Add copy offload support infrastructure
  block: add emulation for copy
  fs, block: copy_file_range for def_blk_ops for direct block device
  nvme: add copy offload support
  nvmet: add copy command support for bdev and file ns
  dm: Add support for copy offload
  dm: Enable copy offload for dm-linear target
  null_blk: add support for copy offload

 Documentation/ABI/stable/sysfs-block |  33 +++
 Documentation/block/null_blk.rst     |   5 +
 block/blk-core.c                     |   5 +
 block/blk-lib.c                      | 384 +++++++++++++++++++++++++++
 block/blk-map.c                      |   4 +-
 block/blk-merge.c                    |  21 ++
 block/blk-settings.c                 |  24 ++
 block/blk-sysfs.c                    |  63 +++++
 block/blk.h                          |   9 +
 block/elevator.h                     |   1 +
 block/fops.c                         |  20 ++
 drivers/block/null_blk/main.c        |  85 +++++-
 drivers/block/null_blk/null_blk.h    |   1 +
 drivers/md/dm-linear.c               |   1 +
 drivers/md/dm-table.c                |  41 +++
 drivers/md/dm.c                      |   7 +
 drivers/nvme/host/constants.c        |   1 +
 drivers/nvme/host/core.c             |  79 ++++++
 drivers/nvme/host/trace.c            |  19 ++
 drivers/nvme/target/admin-cmd.c      |   9 +-
 drivers/nvme/target/io-cmd-bdev.c    |  62 +++++
 drivers/nvme/target/io-cmd-file.c    |  52 ++++
 drivers/nvme/target/nvmet.h          |   1 +
 fs/read_write.c                      |   7 +-
 include/linux/bio.h                  |   4 +-
 include/linux/blk_types.h            |  26 ++
 include/linux/blkdev.h               |  23 ++
 include/linux/device-mapper.h        |   5 +
 include/linux/nvme.h                 |  43 ++-
 include/uapi/linux/fs.h              |   3 +
 30 files changed, 1025 insertions(+), 13 deletions(-)


base-commit: 53cdf865f90ba922a854c65ed05b519f9d728424
-- 
2.35.1.500.gb896f729e2


^ permalink raw reply	[flat|nested] 32+ messages in thread

* [PATCH v13 1/9] block: Introduce queue limits for copy-offload support
       [not found]   ` <CGME20230627184000epcas5p1c7cb01eb1c70bc5a19f76ce21f2ec3f8@epcas5p1.samsung.com>
@ 2023-06-27 18:36     ` Nitesh Shetty
  2023-06-28  6:40       ` Damien Le Moal
                         ` (2 more replies)
  0 siblings, 3 replies; 32+ messages in thread
From: Nitesh Shetty @ 2023-06-27 18:36 UTC (permalink / raw)
  To: Jens Axboe, Jonathan Corbet, Alasdair Kergon, Mike Snitzer,
	dm-devel, Keith Busch, Christoph Hellwig, Sagi Grimberg,
	Chaitanya Kulkarni, Alexander Viro, Christian Brauner
  Cc: martin.petersen, linux-scsi, willy, hare, djwong, bvanassche,
	ming.lei, dlemoal, nitheshshetty, gost.dev, Nitesh Shetty,
	Kanchan Joshi, Anuj Gupta, linux-block, linux-kernel, linux-doc,
	linux-nvme, linux-fsdevel

Add device limits as sysfs entries,
        - copy_offload (RW)
        - copy_max_bytes (RW)
        - copy_max_bytes_hw (RO)

Above limits help to split the copy payload in block layer.
copy_offload: used for setting copy offload(1) or emulation(0).
copy_max_bytes: maximum total length of copy in single payload.
copy_max_bytes_hw: Reflects the device supported maximum limit.

Reviewed-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Nitesh Shetty <nj.shetty@samsung.com>
Signed-off-by: Kanchan Joshi <joshi.k@samsung.com>
Signed-off-by: Anuj Gupta <anuj20.g@samsung.com>
---
 Documentation/ABI/stable/sysfs-block | 33 +++++++++++++++
 block/blk-settings.c                 | 24 +++++++++++
 block/blk-sysfs.c                    | 63 ++++++++++++++++++++++++++++
 include/linux/blkdev.h               | 12 ++++++
 include/uapi/linux/fs.h              |  3 ++
 5 files changed, 135 insertions(+)

diff --git a/Documentation/ABI/stable/sysfs-block b/Documentation/ABI/stable/sysfs-block
index c57e5b7cb532..3c97303f658b 100644
--- a/Documentation/ABI/stable/sysfs-block
+++ b/Documentation/ABI/stable/sysfs-block
@@ -155,6 +155,39 @@ Description:
 		last zone of the device which may be smaller.
 
 
+What:		/sys/block/<disk>/queue/copy_offload
+Date:		June 2023
+Contact:	linux-block@vger.kernel.org
+Description:
+		[RW] When read, this file shows whether offloading copy to a
+		device is enabled (1) or disabled (0). Writing '0' to this
+		file will disable offloading copies for this device.
+		Writing any '1' value will enable this feature. If the device
+		does not support offloading, then writing 1, will result in an
+		error.
+
+
+What:		/sys/block/<disk>/queue/copy_max_bytes
+Date:		June 2023
+Contact:	linux-block@vger.kernel.org
+Description:
+		[RW] This is the maximum number of bytes that the block layer
+		will allow for a copy request. This will is always smaller or
+		equal to the maximum size allowed by the hardware, indicated by
+		'copy_max_bytes_hw'. An attempt to set a value higher than
+		'copy_max_bytes_hw' will truncate this to 'copy_max_bytes_hw'.
+
+
+What:		/sys/block/<disk>/queue/copy_max_bytes_hw
+Date:		June 2023
+Contact:	linux-block@vger.kernel.org
+Description:
+		[RO] This is the maximum number of bytes that the hardware
+		will allow for single data copy request.
+		A value of 0 means that the device does not support
+		copy offload.
+
+
 What:		/sys/block/<disk>/queue/crypto/
 Date:		February 2022
 Contact:	linux-block@vger.kernel.org
diff --git a/block/blk-settings.c b/block/blk-settings.c
index 4dd59059b788..738cd3f21259 100644
--- a/block/blk-settings.c
+++ b/block/blk-settings.c
@@ -59,6 +59,8 @@ void blk_set_default_limits(struct queue_limits *lim)
 	lim->zoned = BLK_ZONED_NONE;
 	lim->zone_write_granularity = 0;
 	lim->dma_alignment = 511;
+	lim->max_copy_sectors_hw = 0;
+	lim->max_copy_sectors = 0;
 }
 
 /**
@@ -82,6 +84,8 @@ void blk_set_stacking_limits(struct queue_limits *lim)
 	lim->max_dev_sectors = UINT_MAX;
 	lim->max_write_zeroes_sectors = UINT_MAX;
 	lim->max_zone_append_sectors = UINT_MAX;
+	lim->max_copy_sectors_hw = UINT_MAX;
+	lim->max_copy_sectors = UINT_MAX;
 }
 EXPORT_SYMBOL(blk_set_stacking_limits);
 
@@ -183,6 +187,22 @@ void blk_queue_max_discard_sectors(struct request_queue *q,
 }
 EXPORT_SYMBOL(blk_queue_max_discard_sectors);
 
+/**
+ * blk_queue_max_copy_sectors_hw - set max sectors for a single copy payload
+ * @q:  the request queue for the device
+ * @max_copy_sectors: maximum number of sectors to copy
+ **/
+void blk_queue_max_copy_sectors_hw(struct request_queue *q,
+		unsigned int max_copy_sectors)
+{
+	if (max_copy_sectors > (COPY_MAX_BYTES >> SECTOR_SHIFT))
+		max_copy_sectors = COPY_MAX_BYTES >> SECTOR_SHIFT;
+
+	q->limits.max_copy_sectors_hw = max_copy_sectors;
+	q->limits.max_copy_sectors = max_copy_sectors;
+}
+EXPORT_SYMBOL_GPL(blk_queue_max_copy_sectors_hw);
+
 /**
  * blk_queue_max_secure_erase_sectors - set max sectors for a secure erase
  * @q:  the request queue for the device
@@ -578,6 +598,10 @@ int blk_stack_limits(struct queue_limits *t, struct queue_limits *b,
 	t->max_segment_size = min_not_zero(t->max_segment_size,
 					   b->max_segment_size);
 
+	t->max_copy_sectors = min(t->max_copy_sectors, b->max_copy_sectors);
+	t->max_copy_sectors_hw = min(t->max_copy_sectors_hw,
+						b->max_copy_sectors_hw);
+
 	t->misaligned |= b->misaligned;
 
 	alignment = queue_limit_alignment_offset(b, start);
diff --git a/block/blk-sysfs.c b/block/blk-sysfs.c
index afc797fb0dfc..43551778d035 100644
--- a/block/blk-sysfs.c
+++ b/block/blk-sysfs.c
@@ -199,6 +199,62 @@ static ssize_t queue_discard_zeroes_data_show(struct request_queue *q, char *pag
 	return queue_var_show(0, page);
 }
 
+static ssize_t queue_copy_offload_show(struct request_queue *q, char *page)
+{
+	return queue_var_show(blk_queue_copy(q), page);
+}
+
+static ssize_t queue_copy_offload_store(struct request_queue *q,
+				       const char *page, size_t count)
+{
+	unsigned long copy_offload;
+	ssize_t ret = queue_var_store(&copy_offload, page, count);
+
+	if (ret < 0)
+		return ret;
+
+	if (copy_offload && !q->limits.max_copy_sectors_hw)
+		return -EINVAL;
+
+	if (copy_offload)
+		blk_queue_flag_set(QUEUE_FLAG_COPY, q);
+	else
+		blk_queue_flag_clear(QUEUE_FLAG_COPY, q);
+
+	return count;
+}
+
+static ssize_t queue_copy_max_hw_show(struct request_queue *q, char *page)
+{
+	return sprintf(page, "%llu\n", (unsigned long long)
+			q->limits.max_copy_sectors_hw << SECTOR_SHIFT);
+}
+
+static ssize_t queue_copy_max_show(struct request_queue *q, char *page)
+{
+	return sprintf(page, "%llu\n", (unsigned long long)
+			q->limits.max_copy_sectors << SECTOR_SHIFT);
+}
+
+static ssize_t queue_copy_max_store(struct request_queue *q,
+				       const char *page, size_t count)
+{
+	unsigned long max_copy;
+	ssize_t ret = queue_var_store(&max_copy, page, count);
+
+	if (ret < 0)
+		return ret;
+
+	if (max_copy & (queue_logical_block_size(q) - 1))
+		return -EINVAL;
+
+	max_copy >>= SECTOR_SHIFT;
+	q->limits.max_copy_sectors = min_t(unsigned int, max_copy,
+					q->limits.max_copy_sectors_hw);
+
+	return count;
+}
+
 static ssize_t queue_write_same_max_show(struct request_queue *q, char *page)
 {
 	return queue_var_show(0, page);
@@ -522,6 +578,10 @@ QUEUE_RO_ENTRY(queue_nr_zones, "nr_zones");
 QUEUE_RO_ENTRY(queue_max_open_zones, "max_open_zones");
 QUEUE_RO_ENTRY(queue_max_active_zones, "max_active_zones");
 
+QUEUE_RW_ENTRY(queue_copy_offload, "copy_offload");
+QUEUE_RO_ENTRY(queue_copy_max_hw, "copy_max_bytes_hw");
+QUEUE_RW_ENTRY(queue_copy_max, "copy_max_bytes");
+
 QUEUE_RW_ENTRY(queue_nomerges, "nomerges");
 QUEUE_RW_ENTRY(queue_rq_affinity, "rq_affinity");
 QUEUE_RW_ENTRY(queue_poll, "io_poll");
@@ -638,6 +698,9 @@ static struct attribute *queue_attrs[] = {
 	&queue_discard_max_entry.attr,
 	&queue_discard_max_hw_entry.attr,
 	&queue_discard_zeroes_data_entry.attr,
+	&queue_copy_offload_entry.attr,
+	&queue_copy_max_hw_entry.attr,
+	&queue_copy_max_entry.attr,
 	&queue_write_same_max_entry.attr,
 	&queue_write_zeroes_max_entry.attr,
 	&queue_zone_append_max_entry.attr,
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index ed44a997f629..6098665953e6 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -309,6 +309,9 @@ struct queue_limits {
 	unsigned int		discard_alignment;
 	unsigned int		zone_write_granularity;
 
+	unsigned int		max_copy_sectors_hw;
+	unsigned int		max_copy_sectors;
+
 	unsigned short		max_segments;
 	unsigned short		max_integrity_segments;
 	unsigned short		max_discard_segments;
@@ -554,6 +557,7 @@ struct request_queue {
 #define QUEUE_FLAG_NOWAIT       29	/* device supports NOWAIT */
 #define QUEUE_FLAG_SQ_SCHED     30	/* single queue style io dispatch */
 #define QUEUE_FLAG_SKIP_TAGSET_QUIESCE	31 /* quiesce_tagset skip the queue*/
+#define QUEUE_FLAG_COPY		32	/* enable/disable device copy offload */
 
 #define QUEUE_FLAG_MQ_DEFAULT	((1UL << QUEUE_FLAG_IO_STAT) |		\
 				 (1UL << QUEUE_FLAG_SAME_COMP) |	\
@@ -574,6 +578,7 @@ bool blk_queue_flag_test_and_set(unsigned int flag, struct request_queue *q);
 	test_bit(QUEUE_FLAG_STABLE_WRITES, &(q)->queue_flags)
 #define blk_queue_io_stat(q)	test_bit(QUEUE_FLAG_IO_STAT, &(q)->queue_flags)
 #define blk_queue_add_random(q)	test_bit(QUEUE_FLAG_ADD_RANDOM, &(q)->queue_flags)
+#define blk_queue_copy(q)	test_bit(QUEUE_FLAG_COPY, &(q)->queue_flags)
 #define blk_queue_zone_resetall(q)	\
 	test_bit(QUEUE_FLAG_ZONE_RESETALL, &(q)->queue_flags)
 #define blk_queue_dax(q)	test_bit(QUEUE_FLAG_DAX, &(q)->queue_flags)
@@ -891,6 +896,8 @@ extern void blk_queue_chunk_sectors(struct request_queue *, unsigned int);
 extern void blk_queue_max_segments(struct request_queue *, unsigned short);
 extern void blk_queue_max_discard_segments(struct request_queue *,
 		unsigned short);
+extern void blk_queue_max_copy_sectors_hw(struct request_queue *q,
+		unsigned int max_copy_sectors);
 void blk_queue_max_secure_erase_sectors(struct request_queue *q,
 		unsigned int max_sectors);
 extern void blk_queue_max_segment_size(struct request_queue *, unsigned int);
@@ -1210,6 +1217,11 @@ static inline unsigned int bdev_discard_granularity(struct block_device *bdev)
 	return bdev_get_queue(bdev)->limits.discard_granularity;
 }
 
+static inline unsigned int bdev_max_copy_sectors(struct block_device *bdev)
+{
+	return bdev_get_queue(bdev)->limits.max_copy_sectors;
+}
+
 static inline unsigned int
 bdev_max_secure_erase_sectors(struct block_device *bdev)
 {
diff --git a/include/uapi/linux/fs.h b/include/uapi/linux/fs.h
index b7b56871029c..dff56813f95a 100644
--- a/include/uapi/linux/fs.h
+++ b/include/uapi/linux/fs.h
@@ -64,6 +64,9 @@ struct fstrim_range {
 	__u64 minlen;
 };
 
+/* maximum copy offload length, this is set to 128MB based on current testing */
+#define COPY_MAX_BYTES	(1 << 27)
+
 /* extent-same (dedupe) ioctls; these MUST match the btrfs ioctl definitions */
 #define FILE_DEDUPE_RANGE_SAME		0
 #define FILE_DEDUPE_RANGE_DIFFERS	1
-- 
2.35.1.500.gb896f729e2


^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH v13 2/9] block: Add copy offload support infrastructure
       [not found]   ` <CGME20230627184010epcas5p4bb6581408d9b67bbbcad633fb26689c9@epcas5p4.samsung.com>
@ 2023-06-27 18:36     ` Nitesh Shetty
  2023-06-28  6:45       ` Damien Le Moal
  2023-07-20  7:42       ` Christoph Hellwig
  0 siblings, 2 replies; 32+ messages in thread
From: Nitesh Shetty @ 2023-06-27 18:36 UTC (permalink / raw)
  To: Jens Axboe, Jonathan Corbet, Alasdair Kergon, Mike Snitzer,
	dm-devel, Keith Busch, Christoph Hellwig, Sagi Grimberg,
	Chaitanya Kulkarni, Alexander Viro, Christian Brauner
  Cc: martin.petersen, linux-scsi, willy, hare, djwong, bvanassche,
	ming.lei, dlemoal, nitheshshetty, gost.dev, Nitesh Shetty,
	Anuj Gupta, linux-block, linux-kernel, linux-doc, linux-nvme,
	linux-fsdevel

Introduce blkdev_copy_offload which takes similar arguments as
copy_file_range and performs copy offload between two bdevs.
Introduce REQ_OP_COPY_DST, REQ_OP_COPY_SRC operation.
Issue REQ_OP_COPY_DST with destination info along with taking a plug.
This flows till request layer and waits for src bio to get merged.
Issue REQ_OP_COPY_SRC with source info and this bio reaches request
layer and merges with dst request.
For any reason, if request comes to driver with either only one of src/dst
info we fail the copy offload.

Larger copy will be divided, based on max_copy_sectors limit.

Suggested-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Anuj Gupta <anuj20.g@samsung.com>
Signed-off-by: Nitesh Shetty <nj.shetty@samsung.com>
---
 block/blk-core.c          |   5 ++
 block/blk-lib.c           | 177 ++++++++++++++++++++++++++++++++++++++
 block/blk-merge.c         |  21 +++++
 block/blk.h               |   9 ++
 block/elevator.h          |   1 +
 include/linux/bio.h       |   4 +-
 include/linux/blk_types.h |  21 +++++
 include/linux/blkdev.h    |   4 +
 8 files changed, 241 insertions(+), 1 deletion(-)

diff --git a/block/blk-core.c b/block/blk-core.c
index 99d8b9812b18..e6714391c93f 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -796,6 +796,11 @@ void submit_bio_noacct(struct bio *bio)
 		if (!q->limits.max_write_zeroes_sectors)
 			goto not_supported;
 		break;
+	case REQ_OP_COPY_SRC:
+	case REQ_OP_COPY_DST:
+		if (!blk_queue_copy(q))
+			goto not_supported;
+		break;
 	default:
 		break;
 	}
diff --git a/block/blk-lib.c b/block/blk-lib.c
index e59c3069e835..10c3eadd5bf6 100644
--- a/block/blk-lib.c
+++ b/block/blk-lib.c
@@ -115,6 +115,183 @@ int blkdev_issue_discard(struct block_device *bdev, sector_t sector,
 }
 EXPORT_SYMBOL(blkdev_issue_discard);
 
+/*
+ * For synchronous copy offload/emulation, wait and process all in-flight BIOs.
+ * This must only be called once all bios have been issued so that the refcount
+ * can only decrease. This just waits for all bios to make it through
+ * blkdev_copy_(offload/emulate)_(read/write)_endio.
+ */
+static ssize_t blkdev_copy_wait_io_completion(struct cio *cio)
+{
+	ssize_t ret;
+
+	if (cio->endio)
+		return 0;
+
+	if (atomic_read(&cio->refcount)) {
+		__set_current_state(TASK_UNINTERRUPTIBLE);
+		blk_io_schedule();
+	}
+
+	ret = cio->comp_len;
+	kfree(cio);
+
+	return ret;
+}
+
+static void blkdev_copy_offload_read_endio(struct bio *bio)
+{
+	struct cio *cio = bio->bi_private;
+	sector_t clen;
+
+	if (bio->bi_status) {
+		clen = (bio->bi_iter.bi_sector << SECTOR_SHIFT) - cio->pos_out;
+		cio->comp_len = min_t(sector_t, clen, cio->comp_len);
+	}
+	bio_put(bio);
+
+	if (!atomic_dec_and_test(&cio->refcount))
+		return;
+	if (cio->endio) {
+		cio->endio(cio->private, cio->comp_len);
+		kfree(cio);
+	} else
+		blk_wake_io_task(cio->waiter);
+}
+
+/*
+ * __blkdev_copy_offload	- Use device's native copy offload feature.
+ * we perform copy operation by sending 2 bio.
+ * 1. We take a plug and send a REQ_OP_COPY_DST bio along with destination
+ * sector and length. Once this bio reaches request layer, we form a request and
+ * wait for src bio to arrive.
+ * 2. We issue REQ_OP_COPY_SRC bio along with source sector and length. Once
+ * this bio reaches request layer and find a request with previously sent
+ * destination info we merge the source bio and return.
+ * 3. Release the plug and request is sent to driver
+ *
+ * Returns the length of bytes copied or error if encountered
+ */
+static ssize_t __blkdev_copy_offload(
+		struct block_device *bdev_in, loff_t pos_in,
+		struct block_device *bdev_out, loff_t pos_out,
+		size_t len, cio_iodone_t endio, void *private, gfp_t gfp_mask)
+{
+	struct cio *cio;
+	struct bio *read_bio, *write_bio;
+	sector_t rem, copy_len, max_copy_len;
+	struct blk_plug plug;
+
+	cio = kzalloc(sizeof(struct cio), GFP_KERNEL);
+	if (!cio)
+		return -ENOMEM;
+	atomic_set(&cio->refcount, 0);
+	cio->waiter = current;
+	cio->endio = endio;
+	cio->private = private;
+
+	max_copy_len = min(bdev_max_copy_sectors(bdev_in),
+			bdev_max_copy_sectors(bdev_out)) << SECTOR_SHIFT;
+
+	cio->pos_in = pos_in;
+	cio->pos_out = pos_out;
+	/* If there is a error, comp_len will be set to least successfully
+	 * completed copied length
+	 */
+	cio->comp_len = len;
+	for (rem = len; rem > 0; rem -= copy_len) {
+		copy_len = min(rem, max_copy_len);
+
+		write_bio = bio_alloc(bdev_out, 0, REQ_OP_COPY_DST, gfp_mask);
+		if (!write_bio)
+			goto err_write_bio_alloc;
+		write_bio->bi_iter.bi_size = copy_len;
+		write_bio->bi_iter.bi_sector = pos_out >> SECTOR_SHIFT;
+
+		blk_start_plug(&plug);
+		read_bio = blk_next_bio(write_bio, bdev_in, 0, REQ_OP_COPY_SRC,
+						gfp_mask);
+		read_bio->bi_iter.bi_size = copy_len;
+		read_bio->bi_iter.bi_sector = pos_in >> SECTOR_SHIFT;
+		read_bio->bi_end_io = blkdev_copy_offload_read_endio;
+		read_bio->bi_private = cio;
+
+		atomic_inc(&cio->refcount);
+		submit_bio(read_bio);
+		blk_finish_plug(&plug);
+		pos_in += copy_len;
+		pos_out += copy_len;
+	}
+
+	return blkdev_copy_wait_io_completion(cio);
+
+err_write_bio_alloc:
+	cio->comp_len = min_t(sector_t, cio->comp_len, (len - rem));
+	if (!atomic_read(&cio->refcount)) {
+		kfree(cio);
+		return -ENOMEM;
+	}
+	return blkdev_copy_wait_io_completion(cio);
+}
+
+static inline ssize_t blkdev_copy_sanity_check(
+	struct block_device *bdev_in, loff_t pos_in,
+	struct block_device *bdev_out, loff_t pos_out,
+	size_t len)
+{
+	unsigned int align = max(bdev_logical_block_size(bdev_out),
+					bdev_logical_block_size(bdev_in)) - 1;
+
+	if (bdev_read_only(bdev_out))
+		return -EPERM;
+
+	if ((pos_in & align) || (pos_out & align) || (len & align) || !len ||
+		len >= COPY_MAX_BYTES)
+		return -EINVAL;
+
+	return 0;
+}
+
+/*
+ * @bdev_in:	source block device
+ * @pos_in:	source offset
+ * @bdev_out:	destination block device
+ * @pos_out:	destination offset
+ * @len:	length in bytes to be copied
+ * @endio:	endio function to be called on completion of copy operation,
+ *		for synchronous operation this should be NULL
+ * @private:	endio function will be called with this private data, should be
+ *		NULL, if operation is synchronous in nature
+ * @gfp_mask:   memory allocation flags (for bio_alloc)
+ *
+ * Returns the length of bytes copied or error if encountered
+ *
+ * Description:
+ *	Copy source offset from source block device to destination block
+ *	device. If copy offload is not supported or fails, fallback to
+ *	emulation. Max total length of copy is limited to COPY_MAX_BYTES
+ */
+ssize_t blkdev_copy_offload(
+		struct block_device *bdev_in, loff_t pos_in,
+		struct block_device *bdev_out, loff_t pos_out,
+		size_t len, cio_iodone_t endio, void *private, gfp_t gfp_mask)
+{
+	struct request_queue *q_in = bdev_get_queue(bdev_in);
+	struct request_queue *q_out = bdev_get_queue(bdev_out);
+	ssize_t ret;
+
+	ret = blkdev_copy_sanity_check(bdev_in, pos_in, bdev_out, pos_out, len);
+	if (ret)
+		return ret;
+
+	if (blk_queue_copy(q_in) && blk_queue_copy(q_out))
+		ret = __blkdev_copy_offload(bdev_in, pos_in, bdev_out, pos_out,
+			   len, endio, private, gfp_mask);
+
+	return ret;
+}
+EXPORT_SYMBOL_GPL(blkdev_copy_offload);
+
 static int __blkdev_issue_write_zeroes(struct block_device *bdev,
 		sector_t sector, sector_t nr_sects, gfp_t gfp_mask,
 		struct bio **biop, unsigned flags)
diff --git a/block/blk-merge.c b/block/blk-merge.c
index 65e75efa9bd3..bfd86c54df22 100644
--- a/block/blk-merge.c
+++ b/block/blk-merge.c
@@ -922,6 +922,9 @@ bool blk_rq_merge_ok(struct request *rq, struct bio *bio)
 	if (!rq_mergeable(rq) || !bio_mergeable(bio))
 		return false;
 
+	if ((req_op(rq) == REQ_OP_COPY_DST) && (bio_op(bio) == REQ_OP_COPY_SRC))
+		return true;
+
 	if (req_op(rq) != bio_op(bio))
 		return false;
 
@@ -951,6 +954,8 @@ enum elv_merge blk_try_merge(struct request *rq, struct bio *bio)
 {
 	if (blk_discard_mergable(rq))
 		return ELEVATOR_DISCARD_MERGE;
+	else if (blk_copy_offload_mergable(rq, bio))
+		return ELEVATOR_COPY_OFFLOAD_MERGE;
 	else if (blk_rq_pos(rq) + blk_rq_sectors(rq) == bio->bi_iter.bi_sector)
 		return ELEVATOR_BACK_MERGE;
 	else if (blk_rq_pos(rq) - bio_sectors(bio) == bio->bi_iter.bi_sector)
@@ -1053,6 +1058,20 @@ static enum bio_merge_status bio_attempt_discard_merge(struct request_queue *q,
 	return BIO_MERGE_FAILED;
 }
 
+static enum bio_merge_status bio_attempt_copy_offload_merge(
+	struct request_queue *q, struct request *req, struct bio *bio)
+{
+	if (req->__data_len != bio->bi_iter.bi_size)
+		return BIO_MERGE_FAILED;
+
+	req->biotail->bi_next = bio;
+	req->biotail = bio;
+	req->nr_phys_segments = blk_rq_nr_phys_segments(req) + 1;
+	req->__data_len += bio->bi_iter.bi_size;
+
+	return BIO_MERGE_OK;
+}
+
 static enum bio_merge_status blk_attempt_bio_merge(struct request_queue *q,
 						   struct request *rq,
 						   struct bio *bio,
@@ -1073,6 +1092,8 @@ static enum bio_merge_status blk_attempt_bio_merge(struct request_queue *q,
 		break;
 	case ELEVATOR_DISCARD_MERGE:
 		return bio_attempt_discard_merge(q, rq, bio);
+	case ELEVATOR_COPY_OFFLOAD_MERGE:
+		return bio_attempt_copy_offload_merge(q, rq, bio);
 	default:
 		return BIO_MERGE_NONE;
 	}
diff --git a/block/blk.h b/block/blk.h
index 608c5dcc516b..440bfa148461 100644
--- a/block/blk.h
+++ b/block/blk.h
@@ -156,6 +156,13 @@ static inline bool blk_discard_mergable(struct request *req)
 	return false;
 }
 
+static inline bool blk_copy_offload_mergable(struct request *req,
+					     struct bio *bio)
+{
+	return ((req_op(req) == REQ_OP_COPY_DST)  &&
+		(bio_op(bio) == REQ_OP_COPY_SRC));
+}
+
 static inline unsigned int blk_rq_get_max_segments(struct request *rq)
 {
 	if (req_op(rq) == REQ_OP_DISCARD)
@@ -303,6 +310,8 @@ static inline bool bio_may_exceed_limits(struct bio *bio,
 		break;
 	}
 
+	if (unlikely(op_is_copy(bio->bi_opf)))
+		return false;
 	/*
 	 * All drivers must accept single-segments bios that are <= PAGE_SIZE.
 	 * This is a quick and dirty check that relies on the fact that
diff --git a/block/elevator.h b/block/elevator.h
index 7ca3d7b6ed82..eec442bbf384 100644
--- a/block/elevator.h
+++ b/block/elevator.h
@@ -18,6 +18,7 @@ enum elv_merge {
 	ELEVATOR_FRONT_MERGE	= 1,
 	ELEVATOR_BACK_MERGE	= 2,
 	ELEVATOR_DISCARD_MERGE	= 3,
+	ELEVATOR_COPY_OFFLOAD_MERGE	= 4,
 };
 
 struct blk_mq_alloc_data;
diff --git a/include/linux/bio.h b/include/linux/bio.h
index c4f5b5228105..a2673f24e493 100644
--- a/include/linux/bio.h
+++ b/include/linux/bio.h
@@ -57,7 +57,9 @@ static inline bool bio_has_data(struct bio *bio)
 	    bio->bi_iter.bi_size &&
 	    bio_op(bio) != REQ_OP_DISCARD &&
 	    bio_op(bio) != REQ_OP_SECURE_ERASE &&
-	    bio_op(bio) != REQ_OP_WRITE_ZEROES)
+	    bio_op(bio) != REQ_OP_WRITE_ZEROES &&
+	    bio_op(bio) != REQ_OP_COPY_DST &&
+	    bio_op(bio) != REQ_OP_COPY_SRC)
 		return true;
 
 	return false;
diff --git a/include/linux/blk_types.h b/include/linux/blk_types.h
index 0bad62cca3d0..336146798e56 100644
--- a/include/linux/blk_types.h
+++ b/include/linux/blk_types.h
@@ -394,6 +394,9 @@ enum req_op {
 	/* reset all the zone present on the device */
 	REQ_OP_ZONE_RESET_ALL	= (__force blk_opf_t)17,
 
+	REQ_OP_COPY_SRC		= (__force blk_opf_t)18,
+	REQ_OP_COPY_DST		= (__force blk_opf_t)19,
+
 	/* Driver private requests */
 	REQ_OP_DRV_IN		= (__force blk_opf_t)34,
 	REQ_OP_DRV_OUT		= (__force blk_opf_t)35,
@@ -482,6 +485,12 @@ static inline bool op_is_write(blk_opf_t op)
 	return !!(op & (__force blk_opf_t)1);
 }
 
+static inline bool op_is_copy(blk_opf_t op)
+{
+	return (((op & REQ_OP_MASK) == REQ_OP_COPY_SRC) ||
+		((op & REQ_OP_MASK) == REQ_OP_COPY_DST));
+}
+
 /*
  * Check if the bio or request is one that needs special treatment in the
  * flush state machine.
@@ -541,4 +550,16 @@ struct blk_rq_stat {
 	u64 batch;
 };
 
+typedef void (cio_iodone_t)(void *private, int comp_len);
+
+struct cio {
+	struct task_struct *waiter;     /* waiting task (NULL if none) */
+	loff_t pos_in;
+	loff_t pos_out;
+	ssize_t comp_len;
+	cio_iodone_t *endio;		/* applicable for async operation */
+	void *private;			/* applicable for async operation */
+	atomic_t refcount;
+};
+
 #endif /* __LINUX_BLK_TYPES_H */
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index 6098665953e6..963f5c97dec0 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -1043,6 +1043,10 @@ int __blkdev_issue_discard(struct block_device *bdev, sector_t sector,
 		sector_t nr_sects, gfp_t gfp_mask, struct bio **biop);
 int blkdev_issue_secure_erase(struct block_device *bdev, sector_t sector,
 		sector_t nr_sects, gfp_t gfp);
+ssize_t blkdev_copy_offload(
+		struct block_device *bdev_in, loff_t pos_in,
+		struct block_device *bdev_out, loff_t pos_out,
+		size_t len, cio_iodone_t end_io, void *private, gfp_t gfp_mask);
 
 #define BLKDEV_ZERO_NOUNMAP	(1 << 0)  /* do not free blocks */
 #define BLKDEV_ZERO_NOFALLBACK	(1 << 1)  /* don't write explicit zeroes */
-- 
2.35.1.500.gb896f729e2


^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH v13 3/9] block: add emulation for copy
       [not found]   ` <CGME20230627184020epcas5p13fdcea52edead5ffa3fae444f923439e@epcas5p1.samsung.com>
@ 2023-06-27 18:36     ` Nitesh Shetty
  2023-06-28  6:50       ` Damien Le Moal
                         ` (2 more replies)
  0 siblings, 3 replies; 32+ messages in thread
From: Nitesh Shetty @ 2023-06-27 18:36 UTC (permalink / raw)
  To: Jens Axboe, Jonathan Corbet, Alasdair Kergon, Mike Snitzer,
	dm-devel, Keith Busch, Christoph Hellwig, Sagi Grimberg,
	Chaitanya Kulkarni, Alexander Viro, Christian Brauner
  Cc: martin.petersen, linux-scsi, willy, hare, djwong, bvanassche,
	ming.lei, dlemoal, nitheshshetty, gost.dev, Nitesh Shetty,
	Vincent Fu, Anuj Gupta, linux-block, linux-kernel, linux-doc,
	linux-nvme, linux-fsdevel

For the devices which does not support copy, copy emulation is added.
It is required for in-kernel users like fabrics, where file descriptor is
not available and hence they can't use copy_file_range.
Copy-emulation is implemented by reading from source into memory and
writing to the corresponding destination asynchronously.
Also emulation is used, if copy offload fails or partially completes.

Signed-off-by: Nitesh Shetty <nj.shetty@samsung.com>
Signed-off-by: Vincent Fu <vincent.fu@samsung.com>
Signed-off-by: Anuj Gupta <anuj20.g@samsung.com>
---
 block/blk-lib.c           | 183 +++++++++++++++++++++++++++++++++++++-
 block/blk-map.c           |   4 +-
 include/linux/blk_types.h |   5 ++
 include/linux/blkdev.h    |   3 +
 4 files changed, 192 insertions(+), 3 deletions(-)

diff --git a/block/blk-lib.c b/block/blk-lib.c
index 10c3eadd5bf6..09e0d5d51d03 100644
--- a/block/blk-lib.c
+++ b/block/blk-lib.c
@@ -234,6 +234,180 @@ static ssize_t __blkdev_copy_offload(
 	return blkdev_copy_wait_io_completion(cio);
 }
 
+static void *blkdev_copy_alloc_buf(sector_t req_size, sector_t *alloc_size,
+		gfp_t gfp_mask)
+{
+	int min_size = PAGE_SIZE;
+	void *buf;
+
+	while (req_size >= min_size) {
+		buf = kvmalloc(req_size, gfp_mask);
+		if (buf) {
+			*alloc_size = req_size;
+			return buf;
+		}
+		/* retry half the requested size */
+		req_size >>= 1;
+	}
+
+	return NULL;
+}
+
+static void blkdev_copy_emulate_write_endio(struct bio *bio)
+{
+	struct copy_ctx *ctx = bio->bi_private;
+	struct cio *cio = ctx->cio;
+	sector_t clen;
+
+	if (bio->bi_status) {
+		clen = (bio->bi_iter.bi_sector << SECTOR_SHIFT) - cio->pos_out;
+		cio->comp_len = min_t(sector_t, clen, cio->comp_len);
+	}
+	kfree(bvec_virt(&bio->bi_io_vec[0]));
+	bio_map_kern_endio(bio);
+	kfree(ctx);
+	if (atomic_dec_and_test(&cio->refcount)) {
+		if (cio->endio) {
+			cio->endio(cio->private, cio->comp_len);
+			kfree(cio);
+		} else
+			blk_wake_io_task(cio->waiter);
+	}
+}
+
+static void blkdev_copy_emulate_read_endio(struct bio *read_bio)
+{
+	struct copy_ctx *ctx = read_bio->bi_private;
+	struct cio *cio = ctx->cio;
+	sector_t clen;
+
+	if (read_bio->bi_status) {
+		clen = (read_bio->bi_iter.bi_sector << SECTOR_SHIFT) -
+						cio->pos_in;
+		cio->comp_len = min_t(sector_t, clen, cio->comp_len);
+		kfree(bvec_virt(&read_bio->bi_io_vec[0]));
+		bio_map_kern_endio(read_bio);
+		kfree(ctx);
+
+		if (atomic_dec_and_test(&cio->refcount)) {
+			if (cio->endio) {
+				cio->endio(cio->private, cio->comp_len);
+				kfree(cio);
+			} else
+				blk_wake_io_task(cio->waiter);
+		}
+	}
+	schedule_work(&ctx->dispatch_work);
+	kfree(read_bio);
+}
+
+static void blkdev_copy_dispatch_work(struct work_struct *work)
+{
+	struct copy_ctx *ctx = container_of(work, struct copy_ctx,
+			dispatch_work);
+
+	submit_bio(ctx->write_bio);
+}
+
+/*
+ * If native copy offload feature is absent, this function tries to emulate,
+ * by copying data from source to a temporary buffer and from buffer to
+ * destination device.
+ * Returns the length of bytes copied or error if encountered
+ */
+static ssize_t __blkdev_copy_emulate(
+		struct block_device *bdev_in, loff_t pos_in,
+		struct block_device *bdev_out, loff_t pos_out,
+		size_t len, cio_iodone_t endio, void *private, gfp_t gfp_mask)
+{
+	struct request_queue *in = bdev_get_queue(bdev_in);
+	struct request_queue *out = bdev_get_queue(bdev_out);
+	struct bio *read_bio, *write_bio;
+	void *buf = NULL;
+	struct copy_ctx *ctx;
+	struct cio *cio;
+	sector_t buf_len, req_len, rem = 0;
+	sector_t max_src_hw_len = min_t(unsigned int,
+			queue_max_hw_sectors(in),
+			queue_max_segments(in) << (PAGE_SHIFT - SECTOR_SHIFT))
+			<< SECTOR_SHIFT;
+	sector_t max_dst_hw_len = min_t(unsigned int,
+		queue_max_hw_sectors(out),
+			queue_max_segments(out) << (PAGE_SHIFT - SECTOR_SHIFT))
+			<< SECTOR_SHIFT;
+	sector_t max_hw_len = min_t(unsigned int,
+			max_src_hw_len, max_dst_hw_len);
+
+	cio = kzalloc(sizeof(struct cio), GFP_KERNEL);
+	if (!cio)
+		return -ENOMEM;
+	atomic_set(&cio->refcount, 0);
+	cio->pos_in = pos_in;
+	cio->pos_out = pos_out;
+	cio->waiter = current;
+	cio->endio = endio;
+	cio->private = private;
+
+	for (rem = len; rem > 0; rem -= buf_len) {
+		req_len = min_t(int, max_hw_len, rem);
+
+		buf = blkdev_copy_alloc_buf(req_len, &buf_len, gfp_mask);
+		if (!buf)
+			goto err_alloc_buf;
+
+		ctx = kzalloc(sizeof(struct copy_ctx), gfp_mask);
+		if (!ctx)
+			goto err_ctx;
+
+		read_bio = bio_map_kern(in, buf, buf_len, gfp_mask);
+		if (IS_ERR(read_bio))
+			goto err_read_bio;
+
+		write_bio = bio_map_kern(out, buf, buf_len, gfp_mask);
+		if (IS_ERR(write_bio))
+			goto err_write_bio;
+
+		ctx->cio = cio;
+		ctx->write_bio = write_bio;
+		INIT_WORK(&ctx->dispatch_work, blkdev_copy_dispatch_work);
+
+		read_bio->bi_iter.bi_sector = pos_in >> SECTOR_SHIFT;
+		read_bio->bi_iter.bi_size = buf_len;
+		read_bio->bi_opf = REQ_OP_READ | REQ_SYNC;
+		bio_set_dev(read_bio, bdev_in);
+		read_bio->bi_end_io = blkdev_copy_emulate_read_endio;
+		read_bio->bi_private = ctx;
+
+		write_bio->bi_iter.bi_size = buf_len;
+		write_bio->bi_opf = REQ_OP_WRITE | REQ_SYNC;
+		bio_set_dev(write_bio, bdev_out);
+		write_bio->bi_end_io = blkdev_copy_emulate_write_endio;
+		write_bio->bi_iter.bi_sector = pos_out >> SECTOR_SHIFT;
+		write_bio->bi_private = ctx;
+
+		atomic_inc(&cio->refcount);
+		submit_bio(read_bio);
+
+		pos_in += buf_len;
+		pos_out += buf_len;
+	}
+	return blkdev_copy_wait_io_completion(cio);
+
+err_write_bio:
+	bio_put(read_bio);
+err_read_bio:
+	kfree(ctx);
+err_ctx:
+	kvfree(buf);
+err_alloc_buf:
+	cio->comp_len -= min_t(sector_t, cio->comp_len, len - rem);
+	if (!atomic_read(&cio->refcount)) {
+		kfree(cio);
+		return -ENOMEM;
+	}
+	return blkdev_copy_wait_io_completion(cio);
+}
+
 static inline ssize_t blkdev_copy_sanity_check(
 	struct block_device *bdev_in, loff_t pos_in,
 	struct block_device *bdev_out, loff_t pos_out,
@@ -284,9 +458,16 @@ ssize_t blkdev_copy_offload(
 	if (ret)
 		return ret;
 
-	if (blk_queue_copy(q_in) && blk_queue_copy(q_out))
+	if (blk_queue_copy(q_in) && blk_queue_copy(q_out)) {
 		ret = __blkdev_copy_offload(bdev_in, pos_in, bdev_out, pos_out,
 			   len, endio, private, gfp_mask);
+		if (ret < 0)
+			ret = 0;
+	}
+
+	if (ret != len)
+		ret = __blkdev_copy_emulate(bdev_in, pos_in + ret, bdev_out,
+			 pos_out + ret, len - ret, endio, private, gfp_mask);
 
 	return ret;
 }
diff --git a/block/blk-map.c b/block/blk-map.c
index 44d74a30ddac..ceeb70a95fd1 100644
--- a/block/blk-map.c
+++ b/block/blk-map.c
@@ -363,7 +363,7 @@ static void bio_invalidate_vmalloc_pages(struct bio *bio)
 #endif
 }
 
-static void bio_map_kern_endio(struct bio *bio)
+void bio_map_kern_endio(struct bio *bio)
 {
 	bio_invalidate_vmalloc_pages(bio);
 	bio_uninit(bio);
@@ -380,7 +380,7 @@ static void bio_map_kern_endio(struct bio *bio)
  *	Map the kernel address into a bio suitable for io to a block
  *	device. Returns an error pointer in case of error.
  */
-static struct bio *bio_map_kern(struct request_queue *q, void *data,
+struct bio *bio_map_kern(struct request_queue *q, void *data,
 		unsigned int len, gfp_t gfp_mask)
 {
 	unsigned long kaddr = (unsigned long)data;
diff --git a/include/linux/blk_types.h b/include/linux/blk_types.h
index 336146798e56..f8c80940c7ad 100644
--- a/include/linux/blk_types.h
+++ b/include/linux/blk_types.h
@@ -562,4 +562,9 @@ struct cio {
 	atomic_t refcount;
 };
 
+struct copy_ctx {
+	struct cio *cio;
+	struct work_struct dispatch_work;
+	struct bio *write_bio;
+};
 #endif /* __LINUX_BLK_TYPES_H */
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index 963f5c97dec0..c176bf6173c5 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -1047,6 +1047,9 @@ ssize_t blkdev_copy_offload(
 		struct block_device *bdev_in, loff_t pos_in,
 		struct block_device *bdev_out, loff_t pos_out,
 		size_t len, cio_iodone_t end_io, void *private, gfp_t gfp_mask);
+struct bio *bio_map_kern(struct request_queue *q, void *data, unsigned int len,
+		gfp_t gfp_mask);
+void bio_map_kern_endio(struct bio *bio);
 
 #define BLKDEV_ZERO_NOUNMAP	(1 << 0)  /* do not free blocks */
 #define BLKDEV_ZERO_NOFALLBACK	(1 << 1)  /* don't write explicit zeroes */
-- 
2.35.1.500.gb896f729e2


^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH v13 4/9] fs, block: copy_file_range for def_blk_ops for direct block device
       [not found]   ` <CGME20230627184029epcas5p49a29676fa6dff5f24ddfa5c64e525a51@epcas5p4.samsung.com>
@ 2023-06-27 18:36     ` Nitesh Shetty
  2023-06-28  6:51       ` Damien Le Moal
  2023-07-20  7:57       ` Christoph Hellwig
  0 siblings, 2 replies; 32+ messages in thread
From: Nitesh Shetty @ 2023-06-27 18:36 UTC (permalink / raw)
  To: Jens Axboe, Jonathan Corbet, Alasdair Kergon, Mike Snitzer,
	dm-devel, Keith Busch, Christoph Hellwig, Sagi Grimberg,
	Chaitanya Kulkarni, Alexander Viro, Christian Brauner
  Cc: martin.petersen, linux-scsi, willy, hare, djwong, bvanassche,
	ming.lei, dlemoal, nitheshshetty, gost.dev, Nitesh Shetty,
	Anuj Gupta, linux-block, linux-kernel, linux-doc, linux-nvme,
	linux-fsdevel

For direct block device opened with O_DIRECT, use copy_file_range to
issue device copy offload, and fallback to generic_copy_file_range incase
device copy offload capability is absent.
Modify checks to allow bdevs to use copy_file_range.

Suggested-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Anuj Gupta <anuj20.g@samsung.com>
Signed-off-by: Nitesh Shetty <nj.shetty@samsung.com>
---
 block/blk-lib.c        | 26 ++++++++++++++++++++++++++
 block/fops.c           | 20 ++++++++++++++++++++
 fs/read_write.c        |  7 +++++--
 include/linux/blkdev.h |  4 ++++
 4 files changed, 55 insertions(+), 2 deletions(-)

diff --git a/block/blk-lib.c b/block/blk-lib.c
index 09e0d5d51d03..7d8e09a99254 100644
--- a/block/blk-lib.c
+++ b/block/blk-lib.c
@@ -473,6 +473,32 @@ ssize_t blkdev_copy_offload(
 }
 EXPORT_SYMBOL_GPL(blkdev_copy_offload);
 
+/* Copy source offset from source block device to destination block
+ * device. Returns the length of bytes copied.
+ */
+ssize_t blkdev_copy_offload_failfast(
+		struct block_device *bdev_in, loff_t pos_in,
+		struct block_device *bdev_out, loff_t pos_out,
+		size_t len, gfp_t gfp_mask)
+{
+	struct request_queue *in_q = bdev_get_queue(bdev_in);
+	struct request_queue *out_q = bdev_get_queue(bdev_out);
+	ssize_t ret = 0;
+
+	if (blkdev_copy_sanity_check(bdev_in, pos_in, bdev_out, pos_out, len))
+		return 0;
+
+	if (blk_queue_copy(in_q) && blk_queue_copy(out_q)) {
+		ret = __blkdev_copy_offload(bdev_in, pos_in, bdev_out, pos_out,
+				len, NULL, NULL, gfp_mask);
+		if (ret < 0)
+			return 0;
+	}
+
+	return ret;
+}
+EXPORT_SYMBOL_GPL(blkdev_copy_offload_failfast);
+
 static int __blkdev_issue_write_zeroes(struct block_device *bdev,
 		sector_t sector, sector_t nr_sects, gfp_t gfp_mask,
 		struct bio **biop, unsigned flags)
diff --git a/block/fops.c b/block/fops.c
index a286bf3325c5..a1576304f269 100644
--- a/block/fops.c
+++ b/block/fops.c
@@ -621,6 +621,25 @@ static ssize_t blkdev_read_iter(struct kiocb *iocb, struct iov_iter *to)
 	return ret;
 }
 
+static ssize_t blkdev_copy_file_range(struct file *file_in, loff_t pos_in,
+				struct file *file_out, loff_t pos_out,
+				size_t len, unsigned int flags)
+{
+	struct block_device *in_bdev = I_BDEV(bdev_file_inode(file_in));
+	struct block_device *out_bdev = I_BDEV(bdev_file_inode(file_out));
+	ssize_t comp_len = 0;
+
+	if ((file_in->f_iocb_flags & IOCB_DIRECT) &&
+		(file_out->f_iocb_flags & IOCB_DIRECT))
+		comp_len = blkdev_copy_offload_failfast(in_bdev, pos_in,
+				out_bdev, pos_out, len, GFP_KERNEL);
+	if (comp_len != len)
+		comp_len = generic_copy_file_range(file_in, pos_in + comp_len,
+			file_out, pos_out + comp_len, len - comp_len, flags);
+
+	return comp_len;
+}
+
 #define	BLKDEV_FALLOC_FL_SUPPORTED					\
 		(FALLOC_FL_KEEP_SIZE | FALLOC_FL_PUNCH_HOLE |		\
 		 FALLOC_FL_ZERO_RANGE | FALLOC_FL_NO_HIDE_STALE)
@@ -714,6 +733,7 @@ const struct file_operations def_blk_fops = {
 	.splice_read	= filemap_splice_read,
 	.splice_write	= iter_file_splice_write,
 	.fallocate	= blkdev_fallocate,
+	.copy_file_range = blkdev_copy_file_range,
 };
 
 static __init int blkdev_init(void)
diff --git a/fs/read_write.c b/fs/read_write.c
index b07de77ef126..d27148a2543f 100644
--- a/fs/read_write.c
+++ b/fs/read_write.c
@@ -1447,7 +1447,8 @@ static int generic_copy_file_checks(struct file *file_in, loff_t pos_in,
 		return -EOVERFLOW;
 
 	/* Shorten the copy to EOF */
-	size_in = i_size_read(inode_in);
+	size_in = i_size_read(file_in->f_mapping->host);
+
 	if (pos_in >= size_in)
 		count = 0;
 	else
@@ -1708,7 +1709,9 @@ int generic_file_rw_checks(struct file *file_in, struct file *file_out)
 	/* Don't copy dirs, pipes, sockets... */
 	if (S_ISDIR(inode_in->i_mode) || S_ISDIR(inode_out->i_mode))
 		return -EISDIR;
-	if (!S_ISREG(inode_in->i_mode) || !S_ISREG(inode_out->i_mode))
+
+	if ((!S_ISREG(inode_in->i_mode) || !S_ISREG(inode_out->i_mode)) &&
+		(!S_ISBLK(inode_in->i_mode) || !S_ISBLK(inode_out->i_mode)))
 		return -EINVAL;
 
 	if (!(file_in->f_mode & FMODE_READ) ||
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index c176bf6173c5..850168cad080 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -1047,6 +1047,10 @@ ssize_t blkdev_copy_offload(
 		struct block_device *bdev_in, loff_t pos_in,
 		struct block_device *bdev_out, loff_t pos_out,
 		size_t len, cio_iodone_t end_io, void *private, gfp_t gfp_mask);
+ssize_t blkdev_copy_offload_failfast(
+		struct block_device *bdev_in, loff_t pos_in,
+		struct block_device *bdev_out, loff_t pos_out,
+		size_t len, gfp_t gfp_mask);
 struct bio *bio_map_kern(struct request_queue *q, void *data, unsigned int len,
 		gfp_t gfp_mask);
 void bio_map_kern_endio(struct bio *bio);
-- 
2.35.1.500.gb896f729e2


^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH v13 5/9] nvme: add copy offload support
       [not found]   ` <CGME20230627184039epcas5p2decb92731d3e7dfdf9f2c05309a90bd7@epcas5p2.samsung.com>
@ 2023-06-27 18:36     ` Nitesh Shetty
  2023-07-20  8:00       ` Christoph Hellwig
  0 siblings, 1 reply; 32+ messages in thread
From: Nitesh Shetty @ 2023-06-27 18:36 UTC (permalink / raw)
  To: Jens Axboe, Jonathan Corbet, Alasdair Kergon, Mike Snitzer,
	dm-devel, Keith Busch, Christoph Hellwig, Sagi Grimberg,
	Chaitanya Kulkarni, Alexander Viro, Christian Brauner
  Cc: martin.petersen, linux-scsi, willy, hare, djwong, bvanassche,
	ming.lei, dlemoal, nitheshshetty, gost.dev, Nitesh Shetty,
	Kanchan Joshi, Javier González, Anuj Gupta, linux-block,
	linux-kernel, linux-doc, linux-nvme, linux-fsdevel

Current design only supports single source range.
We receive a request with REQ_OP_COPY_DST.
Parse this request which consists of dst(1st) and src(2nd) bios.
Form a copy command (TP 4065)

trace event support for nvme_copy_cmd.
Set the device copy limits to queue limits.

Signed-off-by: Kanchan Joshi <joshi.k@samsung.com>
Signed-off-by: Nitesh Shetty <nj.shetty@samsung.com>
Signed-off-by: Javier González <javier.gonz@samsung.com>
Signed-off-by: Anuj Gupta <anuj20.g@samsung.com>
---
 drivers/nvme/host/constants.c |  1 +
 drivers/nvme/host/core.c      | 79 +++++++++++++++++++++++++++++++++++
 drivers/nvme/host/trace.c     | 19 +++++++++
 include/linux/nvme.h          | 43 +++++++++++++++++--
 4 files changed, 139 insertions(+), 3 deletions(-)

diff --git a/drivers/nvme/host/constants.c b/drivers/nvme/host/constants.c
index 5e4f8848dce0..311ad67e9cf3 100644
--- a/drivers/nvme/host/constants.c
+++ b/drivers/nvme/host/constants.c
@@ -19,6 +19,7 @@ static const char * const nvme_ops[] = {
 	[nvme_cmd_resv_report] = "Reservation Report",
 	[nvme_cmd_resv_acquire] = "Reservation Acquire",
 	[nvme_cmd_resv_release] = "Reservation Release",
+	[nvme_cmd_copy] = "Copy Offload",
 	[nvme_cmd_zone_mgmt_send] = "Zone Management Send",
 	[nvme_cmd_zone_mgmt_recv] = "Zone Management Receive",
 	[nvme_cmd_zone_append] = "Zone Append",
diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
index 98bfb3d9c22a..d4063e981492 100644
--- a/drivers/nvme/host/core.c
+++ b/drivers/nvme/host/core.c
@@ -763,6 +763,60 @@ static inline void nvme_setup_flush(struct nvme_ns *ns,
 	cmnd->common.nsid = cpu_to_le32(ns->head->ns_id);
 }
 
+static inline blk_status_t nvme_setup_copy_write(struct nvme_ns *ns,
+	       struct request *req, struct nvme_command *cmnd)
+{
+	struct nvme_copy_range *range = NULL;
+	struct bio *bio;
+	u64 dst_lba, src_lba, n_lba;
+	u16 nr_range = 1, control = 0;
+
+	if (blk_rq_nr_phys_segments(req) != 2)
+		return BLK_STS_IOERR;
+
+	/* +1 shift as dst+src length is added in request merging, we send copy
+	 * for half the length.
+	 */
+	n_lba = blk_rq_bytes(req) >> (ns->lba_shift + 1);
+	if (WARN_ON(!n_lba))
+		return BLK_STS_NOTSUPP;
+
+	dst_lba = nvme_sect_to_lba(ns, blk_rq_pos(req));
+	__rq_for_each_bio(bio, req) {
+		src_lba = nvme_sect_to_lba(ns, bio->bi_iter.bi_sector);
+		if (n_lba != bio->bi_iter.bi_size >> ns->lba_shift)
+			return BLK_STS_IOERR;
+	}
+
+	if (req->cmd_flags & REQ_FUA)
+		control |= NVME_RW_FUA;
+
+	if (req->cmd_flags & REQ_FAILFAST_DEV)
+		control |= NVME_RW_LR;
+
+	memset(cmnd, 0, sizeof(*cmnd));
+	cmnd->copy.opcode = nvme_cmd_copy;
+	cmnd->copy.nsid = cpu_to_le32(ns->head->ns_id);
+	cmnd->copy.control = cpu_to_le16(control);
+	cmnd->copy.sdlba = cpu_to_le64(dst_lba);
+	cmnd->copy.nr_range = 0;
+
+	range = kmalloc_array(nr_range, sizeof(*range),
+			GFP_ATOMIC | __GFP_NOWARN);
+	if (!range)
+		return BLK_STS_RESOURCE;
+
+	range[0].slba = cpu_to_le64(src_lba);
+	range[0].nlb = cpu_to_le16(n_lba - 1);
+
+	req->special_vec.bv_page = virt_to_page(range);
+	req->special_vec.bv_offset = offset_in_page(range);
+	req->special_vec.bv_len = sizeof(*range) * nr_range;
+	req->rq_flags |= RQF_SPECIAL_PAYLOAD;
+
+	return BLK_STS_OK;
+}
+
 static blk_status_t nvme_setup_discard(struct nvme_ns *ns, struct request *req,
 		struct nvme_command *cmnd)
 {
@@ -1005,6 +1059,9 @@ blk_status_t nvme_setup_cmd(struct nvme_ns *ns, struct request *req)
 	case REQ_OP_ZONE_APPEND:
 		ret = nvme_setup_rw(ns, req, cmd, nvme_cmd_zone_append);
 		break;
+	case REQ_OP_COPY_DST:
+		ret = nvme_setup_copy_write(ns, req, cmd);
+		break;
 	default:
 		WARN_ON_ONCE(1);
 		return BLK_STS_IOERR;
@@ -1742,6 +1799,26 @@ static void nvme_config_discard(struct gendisk *disk, struct nvme_ns *ns)
 		blk_queue_max_write_zeroes_sectors(queue, UINT_MAX);
 }
 
+static void nvme_config_copy(struct gendisk *disk, struct nvme_ns *ns,
+				       struct nvme_id_ns *id)
+{
+	struct nvme_ctrl *ctrl = ns->ctrl;
+	struct request_queue *q = disk->queue;
+
+	if (!(ctrl->oncs & NVME_CTRL_ONCS_COPY)) {
+		blk_queue_max_copy_sectors_hw(q, 0);
+		blk_queue_flag_clear(QUEUE_FLAG_COPY, q);
+		return;
+	}
+
+	/* setting copy limits */
+	if (blk_queue_flag_test_and_set(QUEUE_FLAG_COPY, q))
+		return;
+
+	blk_queue_max_copy_sectors_hw(q,
+		nvme_lba_to_sect(ns, le16_to_cpu(id->mssrl)));
+}
+
 static bool nvme_ns_ids_equal(struct nvme_ns_ids *a, struct nvme_ns_ids *b)
 {
 	return uuid_equal(&a->uuid, &b->uuid) &&
@@ -1941,6 +2018,7 @@ static void nvme_update_disk_info(struct gendisk *disk,
 	set_capacity_and_notify(disk, capacity);
 
 	nvme_config_discard(disk, ns);
+	nvme_config_copy(disk, ns, id);
 	blk_queue_max_write_zeroes_sectors(disk->queue,
 					   ns->ctrl->max_zeroes_sectors);
 }
@@ -4600,6 +4678,7 @@ static inline void _nvme_check_size(void)
 	BUILD_BUG_ON(sizeof(struct nvme_download_firmware) != 64);
 	BUILD_BUG_ON(sizeof(struct nvme_format_cmd) != 64);
 	BUILD_BUG_ON(sizeof(struct nvme_dsm_cmd) != 64);
+	BUILD_BUG_ON(sizeof(struct nvme_copy_command) != 64);
 	BUILD_BUG_ON(sizeof(struct nvme_write_zeroes_cmd) != 64);
 	BUILD_BUG_ON(sizeof(struct nvme_abort_cmd) != 64);
 	BUILD_BUG_ON(sizeof(struct nvme_get_log_page_command) != 64);
diff --git a/drivers/nvme/host/trace.c b/drivers/nvme/host/trace.c
index 1c36fcedea20..da4a7494e5a7 100644
--- a/drivers/nvme/host/trace.c
+++ b/drivers/nvme/host/trace.c
@@ -150,6 +150,23 @@ static const char *nvme_trace_read_write(struct trace_seq *p, u8 *cdw10)
 	return ret;
 }
 
+static const char *nvme_trace_copy(struct trace_seq *p, u8 *cdw10)
+{
+	const char *ret = trace_seq_buffer_ptr(p);
+	u64 slba = get_unaligned_le64(cdw10);
+	u8 nr_range = get_unaligned_le16(cdw10 + 8);
+	u16 control = get_unaligned_le16(cdw10 + 10);
+	u32 dsmgmt = get_unaligned_le32(cdw10 + 12);
+	u32 reftag = get_unaligned_le32(cdw10 +  16);
+
+	trace_seq_printf(p,
+			 "slba=%llu, nr_range=%u, ctrl=0x%x, dsmgmt=%u, reftag=%u",
+			 slba, nr_range, control, dsmgmt, reftag);
+	trace_seq_putc(p, 0);
+
+	return ret;
+}
+
 static const char *nvme_trace_dsm(struct trace_seq *p, u8 *cdw10)
 {
 	const char *ret = trace_seq_buffer_ptr(p);
@@ -243,6 +260,8 @@ const char *nvme_trace_parse_nvm_cmd(struct trace_seq *p,
 		return nvme_trace_zone_mgmt_send(p, cdw10);
 	case nvme_cmd_zone_mgmt_recv:
 		return nvme_trace_zone_mgmt_recv(p, cdw10);
+	case nvme_cmd_copy:
+		return nvme_trace_copy(p, cdw10);
 	default:
 		return nvme_trace_common(p, cdw10);
 	}
diff --git a/include/linux/nvme.h b/include/linux/nvme.h
index 182b6d614eb1..bbd877111b57 100644
--- a/include/linux/nvme.h
+++ b/include/linux/nvme.h
@@ -337,7 +337,7 @@ struct nvme_id_ctrl {
 	__u8			nvscc;
 	__u8			nwpc;
 	__le16			acwu;
-	__u8			rsvd534[2];
+	__le16			ocfs;
 	__le32			sgls;
 	__le32			mnan;
 	__u8			rsvd544[224];
@@ -365,6 +365,7 @@ enum {
 	NVME_CTRL_ONCS_WRITE_ZEROES		= 1 << 3,
 	NVME_CTRL_ONCS_RESERVATIONS		= 1 << 5,
 	NVME_CTRL_ONCS_TIMESTAMP		= 1 << 6,
+	NVME_CTRL_ONCS_COPY			= 1 << 8,
 	NVME_CTRL_VWC_PRESENT			= 1 << 0,
 	NVME_CTRL_OACS_SEC_SUPP                 = 1 << 0,
 	NVME_CTRL_OACS_NS_MNGT_SUPP		= 1 << 3,
@@ -414,7 +415,10 @@ struct nvme_id_ns {
 	__le16			npdg;
 	__le16			npda;
 	__le16			nows;
-	__u8			rsvd74[18];
+	__le16			mssrl;
+	__le32			mcl;
+	__u8			msrc;
+	__u8			rsvd91[11];
 	__le32			anagrpid;
 	__u8			rsvd96[3];
 	__u8			nsattr;
@@ -831,6 +835,7 @@ enum nvme_opcode {
 	nvme_cmd_resv_report	= 0x0e,
 	nvme_cmd_resv_acquire	= 0x11,
 	nvme_cmd_resv_release	= 0x15,
+	nvme_cmd_copy		= 0x19,
 	nvme_cmd_zone_mgmt_send	= 0x79,
 	nvme_cmd_zone_mgmt_recv	= 0x7a,
 	nvme_cmd_zone_append	= 0x7d,
@@ -854,7 +859,8 @@ enum nvme_opcode {
 		nvme_opcode_name(nvme_cmd_resv_release),	\
 		nvme_opcode_name(nvme_cmd_zone_mgmt_send),	\
 		nvme_opcode_name(nvme_cmd_zone_mgmt_recv),	\
-		nvme_opcode_name(nvme_cmd_zone_append))
+		nvme_opcode_name(nvme_cmd_zone_append),		\
+		nvme_opcode_name(nvme_cmd_copy))
 
 
 
@@ -1031,6 +1037,36 @@ struct nvme_dsm_range {
 	__le64			slba;
 };
 
+struct nvme_copy_command {
+	__u8                    opcode;
+	__u8                    flags;
+	__u16                   command_id;
+	__le32                  nsid;
+	__u64                   rsvd2;
+	__le64                  metadata;
+	union nvme_data_ptr     dptr;
+	__le64                  sdlba;
+	__u8			nr_range;
+	__u8			rsvd12;
+	__le16                  control;
+	__le16                  rsvd13;
+	__le16			dspec;
+	__le32                  ilbrt;
+	__le16                  lbat;
+	__le16                  lbatm;
+};
+
+struct nvme_copy_range {
+	__le64			rsvd0;
+	__le64			slba;
+	__le16			nlb;
+	__le16			rsvd18;
+	__le32			rsvd20;
+	__le32			eilbrt;
+	__le16			elbat;
+	__le16			elbatm;
+};
+
 struct nvme_write_zeroes_cmd {
 	__u8			opcode;
 	__u8			flags;
@@ -1792,6 +1828,7 @@ struct nvme_command {
 		struct nvme_download_firmware dlfw;
 		struct nvme_format_cmd format;
 		struct nvme_dsm_cmd dsm;
+		struct nvme_copy_command copy;
 		struct nvme_write_zeroes_cmd write_zeroes;
 		struct nvme_zone_mgmt_send_cmd zms;
 		struct nvme_zone_mgmt_recv_cmd zmr;
-- 
2.35.1.500.gb896f729e2


^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH v13 6/9] nvmet: add copy command support for bdev and file ns
       [not found]   ` <CGME20230627184049epcas5p293a6e6b75c93e39c7fca1a702e3e3774@epcas5p2.samsung.com>
@ 2023-06-27 18:36     ` Nitesh Shetty
  0 siblings, 0 replies; 32+ messages in thread
From: Nitesh Shetty @ 2023-06-27 18:36 UTC (permalink / raw)
  To: Jens Axboe, Jonathan Corbet, Alasdair Kergon, Mike Snitzer,
	dm-devel, Keith Busch, Christoph Hellwig, Sagi Grimberg,
	Chaitanya Kulkarni, Alexander Viro, Christian Brauner
  Cc: martin.petersen, linux-scsi, willy, hare, djwong, bvanassche,
	ming.lei, dlemoal, nitheshshetty, gost.dev, Nitesh Shetty,
	Anuj Gupta, linux-block, linux-kernel, linux-doc, linux-nvme,
	linux-fsdevel

Add support for handling nvme_cmd_copy command on target.
For bdev-ns we call into blkdev_copy_offload, which the block layer
completes by a offloaded copy request to backend bdev or by emulating the
request.

For file-ns we call vfs_copy_file_range to service our request.

Currently target always shows copy capability by setting
NVME_CTRL_ONCS_COPY in controller ONCS.

loop target has copy support, which can be used to test copy offload.

Signed-off-by: Nitesh Shetty <nj.shetty@samsung.com>
Signed-off-by: Anuj Gupta <anuj20.g@samsung.com>
---
 drivers/nvme/target/admin-cmd.c   |  9 ++++-
 drivers/nvme/target/io-cmd-bdev.c | 62 +++++++++++++++++++++++++++++++
 drivers/nvme/target/io-cmd-file.c | 52 ++++++++++++++++++++++++++
 drivers/nvme/target/nvmet.h       |  1 +
 4 files changed, 122 insertions(+), 2 deletions(-)

diff --git a/drivers/nvme/target/admin-cmd.c b/drivers/nvme/target/admin-cmd.c
index 39cb570f833d..8e644b8ec0fd 100644
--- a/drivers/nvme/target/admin-cmd.c
+++ b/drivers/nvme/target/admin-cmd.c
@@ -433,8 +433,7 @@ static void nvmet_execute_identify_ctrl(struct nvmet_req *req)
 	id->nn = cpu_to_le32(NVMET_MAX_NAMESPACES);
 	id->mnan = cpu_to_le32(NVMET_MAX_NAMESPACES);
 	id->oncs = cpu_to_le16(NVME_CTRL_ONCS_DSM |
-			NVME_CTRL_ONCS_WRITE_ZEROES);
-
+			NVME_CTRL_ONCS_WRITE_ZEROES | NVME_CTRL_ONCS_COPY);
 	/* XXX: don't report vwc if the underlying device is write through */
 	id->vwc = NVME_CTRL_VWC_PRESENT;
 
@@ -536,6 +535,12 @@ static void nvmet_execute_identify_ns(struct nvmet_req *req)
 
 	if (req->ns->bdev)
 		nvmet_bdev_set_limits(req->ns->bdev, id);
+	else {
+		id->msrc = (__force u8)to0based(BIO_MAX_VECS - 1);
+		id->mssrl = cpu_to_le16(BIO_MAX_VECS <<
+				(PAGE_SHIFT - SECTOR_SHIFT));
+		id->mcl = cpu_to_le32(le16_to_cpu(id->mssrl));
+	}
 
 	/*
 	 * We just provide a single LBA format that matches what the
diff --git a/drivers/nvme/target/io-cmd-bdev.c b/drivers/nvme/target/io-cmd-bdev.c
index 2733e0158585..5c4c6a460cfa 100644
--- a/drivers/nvme/target/io-cmd-bdev.c
+++ b/drivers/nvme/target/io-cmd-bdev.c
@@ -46,6 +46,18 @@ void nvmet_bdev_set_limits(struct block_device *bdev, struct nvme_id_ns *id)
 	id->npda = id->npdg;
 	/* NOWS = Namespace Optimal Write Size */
 	id->nows = to0based(bdev_io_opt(bdev) / bdev_logical_block_size(bdev));
+
+	if (bdev_max_copy_sectors(bdev)) {
+		id->msrc = id->msrc;
+		id->mssrl = cpu_to_le16((bdev_max_copy_sectors(bdev) <<
+				SECTOR_SHIFT) / bdev_logical_block_size(bdev));
+		id->mcl = cpu_to_le32((__force u32)id->mssrl);
+	} else {
+		id->msrc = (__force u8)to0based(BIO_MAX_VECS - 1);
+		id->mssrl = cpu_to_le16((BIO_MAX_VECS << PAGE_SHIFT) /
+				bdev_logical_block_size(bdev));
+		id->mcl = cpu_to_le32((__force u32)id->mssrl);
+	}
 }
 
 void nvmet_bdev_ns_disable(struct nvmet_ns *ns)
@@ -184,6 +196,21 @@ static void nvmet_bio_done(struct bio *bio)
 	nvmet_req_bio_put(req, bio);
 }
 
+static void nvmet_bdev_copy_end_io(void *private, int comp_len)
+{
+	struct nvmet_req *req = (struct nvmet_req *)private;
+	u16 status;
+
+	if (comp_len == req->copy_len) {
+		req->cqe->result.u32 = cpu_to_le32(1);
+		status = errno_to_nvme_status(req, 0);
+	} else {
+		req->cqe->result.u32 = cpu_to_le32(0);
+		status = errno_to_nvme_status(req, (__force u16)BLK_STS_IOERR);
+	}
+	nvmet_req_complete(req, status);
+}
+
 #ifdef CONFIG_BLK_DEV_INTEGRITY
 static int nvmet_bdev_alloc_bip(struct nvmet_req *req, struct bio *bio,
 				struct sg_mapping_iter *miter)
@@ -450,6 +477,37 @@ static void nvmet_bdev_execute_write_zeroes(struct nvmet_req *req)
 	}
 }
 
+/* At present we handle only one range entry, since copy offload is aligned with
+ * copy_file_range, only one entry is passed from block layer.
+ */
+static void nvmet_bdev_execute_copy(struct nvmet_req *req)
+{
+	struct nvme_copy_range range;
+	struct nvme_command *cmd = req->cmd;
+	ssize_t ret;
+	u16 status;
+
+	status = nvmet_copy_from_sgl(req, 0, &range, sizeof(range));
+	if (status)
+		goto out;
+
+	ret = blkdev_copy_offload(req->ns->bdev,
+		le64_to_cpu(cmd->copy.sdlba) << req->ns->blksize_shift,
+		req->ns->bdev,
+		le64_to_cpu(range.slba) << req->ns->blksize_shift,
+		(le16_to_cpu(range.nlb) + 1) << req->ns->blksize_shift,
+		nvmet_bdev_copy_end_io, (void *)req, GFP_KERNEL);
+	if (ret) {
+		req->cqe->result.u32 = cpu_to_le32(0);
+		status = blk_to_nvme_status(req, BLK_STS_IOERR);
+		goto out;
+	}
+
+	return;
+out:
+	nvmet_req_complete(req, errno_to_nvme_status(req, status));
+}
+
 u16 nvmet_bdev_parse_io_cmd(struct nvmet_req *req)
 {
 	switch (req->cmd->common.opcode) {
@@ -468,6 +526,10 @@ u16 nvmet_bdev_parse_io_cmd(struct nvmet_req *req)
 	case nvme_cmd_write_zeroes:
 		req->execute = nvmet_bdev_execute_write_zeroes;
 		return 0;
+	case nvme_cmd_copy:
+		req->execute = nvmet_bdev_execute_copy;
+		return 0;
+
 	default:
 		return nvmet_report_invalid_opcode(req);
 	}
diff --git a/drivers/nvme/target/io-cmd-file.c b/drivers/nvme/target/io-cmd-file.c
index 2d068439b129..f61aa834f7a5 100644
--- a/drivers/nvme/target/io-cmd-file.c
+++ b/drivers/nvme/target/io-cmd-file.c
@@ -322,6 +322,49 @@ static void nvmet_file_dsm_work(struct work_struct *w)
 	}
 }
 
+static void nvmet_file_copy_work(struct work_struct *w)
+{
+	struct nvmet_req *req = container_of(w, struct nvmet_req, f.work);
+	int nr_range = req->cmd->copy.nr_range + 1;
+	u16 status = 0;
+	int src, id;
+	ssize_t len, ret;
+	loff_t pos;
+
+	pos = le64_to_cpu(req->cmd->copy.sdlba) << req->ns->blksize_shift;
+	if (unlikely(pos + req->transfer_len > req->ns->size)) {
+		nvmet_req_complete(req, errno_to_nvme_status(req, -ENOSPC));
+		return;
+	}
+
+	for (id = 0 ; id < nr_range; id++) {
+		struct nvme_copy_range range;
+
+		status = nvmet_copy_from_sgl(req, id * sizeof(range), &range,
+					sizeof(range));
+		if (status)
+			goto out;
+
+		src = (le64_to_cpu(range.slba) << (req->ns->blksize_shift));
+		len = (le16_to_cpu(range.nlb) + 1) << (req->ns->blksize_shift);
+		ret = vfs_copy_file_range(req->ns->file, src, req->ns->file,
+					pos, len, 0);
+		if (ret != len) {
+			pos += ret;
+			req->cqe->result.u32 = cpu_to_le32(id);
+			if (ret < 0)
+				status = errno_to_nvme_status(req, ret);
+			else
+				status = errno_to_nvme_status(req, -EIO);
+			goto out;
+		} else
+			pos += len;
+	}
+
+out:
+	nvmet_req_complete(req, status);
+}
+
 static void nvmet_file_execute_dsm(struct nvmet_req *req)
 {
 	if (!nvmet_check_data_len_lte(req, nvmet_dsm_len(req)))
@@ -330,6 +373,12 @@ static void nvmet_file_execute_dsm(struct nvmet_req *req)
 	queue_work(nvmet_wq, &req->f.work);
 }
 
+static void nvmet_file_execute_copy(struct nvmet_req *req)
+{
+	INIT_WORK(&req->f.work, nvmet_file_copy_work);
+	queue_work(nvmet_wq, &req->f.work);
+}
+
 static void nvmet_file_write_zeroes_work(struct work_struct *w)
 {
 	struct nvmet_req *req = container_of(w, struct nvmet_req, f.work);
@@ -376,6 +425,9 @@ u16 nvmet_file_parse_io_cmd(struct nvmet_req *req)
 	case nvme_cmd_write_zeroes:
 		req->execute = nvmet_file_execute_write_zeroes;
 		return 0;
+	case nvme_cmd_copy:
+		req->execute = nvmet_file_execute_copy;
+		return 0;
 	default:
 		return nvmet_report_invalid_opcode(req);
 	}
diff --git a/drivers/nvme/target/nvmet.h b/drivers/nvme/target/nvmet.h
index 6cf723bc664e..c6fb515ee1c5 100644
--- a/drivers/nvme/target/nvmet.h
+++ b/drivers/nvme/target/nvmet.h
@@ -393,6 +393,7 @@ struct nvmet_req {
 	struct device		*p2p_client;
 	u16			error_loc;
 	u64			error_slba;
+	size_t			copy_len;
 };
 
 #define NVMET_MAX_MPOOL_BVEC		16
-- 
2.35.1.500.gb896f729e2


^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH v13 7/9] dm: Add support for copy offload
       [not found]   ` <CGME20230627184058epcas5p2226835b15381b856859b162e58572d63@epcas5p2.samsung.com>
@ 2023-06-27 18:36     ` Nitesh Shetty
  0 siblings, 0 replies; 32+ messages in thread
From: Nitesh Shetty @ 2023-06-27 18:36 UTC (permalink / raw)
  To: Jens Axboe, Jonathan Corbet, Alasdair Kergon, Mike Snitzer,
	dm-devel, Keith Busch, Christoph Hellwig, Sagi Grimberg,
	Chaitanya Kulkarni, Alexander Viro, Christian Brauner
  Cc: martin.petersen, linux-scsi, willy, hare, djwong, bvanassche,
	ming.lei, dlemoal, nitheshshetty, gost.dev, Nitesh Shetty,
	linux-block, linux-kernel, linux-doc, linux-nvme, linux-fsdevel

Before enabling copy for dm target, check if underlying devices and
dm target support copy. Avoid split happening inside dm target.
Fail early if the request needs split, currently splitting copy
request is not supported.

Signed-off-by: Nitesh Shetty <nj.shetty@samsung.com>
---
 drivers/md/dm-table.c         | 41 +++++++++++++++++++++++++++++++++++
 drivers/md/dm.c               |  7 ++++++
 include/linux/device-mapper.h |  5 +++++
 3 files changed, 53 insertions(+)

diff --git a/drivers/md/dm-table.c b/drivers/md/dm-table.c
index 7d208b2b1a19..2d08a890d7e1 100644
--- a/drivers/md/dm-table.c
+++ b/drivers/md/dm-table.c
@@ -1862,6 +1862,39 @@ static bool dm_table_supports_nowait(struct dm_table *t)
 	return true;
 }
 
+static int device_not_copy_capable(struct dm_target *ti, struct dm_dev *dev,
+				      sector_t start, sector_t len, void *data)
+{
+	struct request_queue *q = bdev_get_queue(dev->bdev);
+
+	return !blk_queue_copy(q);
+}
+
+static bool dm_table_supports_copy(struct dm_table *t)
+{
+	struct dm_target *ti;
+	unsigned int i;
+
+	for (i = 0; i < t->num_targets; i++) {
+		ti = dm_table_get_target(t, i);
+
+		if (!ti->copy_offload_supported)
+			return false;
+
+		/*
+		 * target provides copy support (as implied by setting
+		 * 'copy_offload_supported')
+		 * and it relies on _all_ data devices having copy support.
+		 */
+		if (!ti->type->iterate_devices ||
+		     ti->type->iterate_devices(ti,
+			     device_not_copy_capable, NULL))
+			return false;
+	}
+
+	return true;
+}
+
 static int device_not_discard_capable(struct dm_target *ti, struct dm_dev *dev,
 				      sector_t start, sector_t len, void *data)
 {
@@ -1944,6 +1977,14 @@ int dm_table_set_restrictions(struct dm_table *t, struct request_queue *q,
 		q->limits.discard_misaligned = 0;
 	}
 
+	if (!dm_table_supports_copy(t)) {
+		blk_queue_flag_clear(QUEUE_FLAG_COPY, q);
+		q->limits.max_copy_sectors = 0;
+		q->limits.max_copy_sectors_hw = 0;
+	} else {
+		blk_queue_flag_set(QUEUE_FLAG_COPY, q);
+	}
+
 	if (!dm_table_supports_secure_erase(t))
 		q->limits.max_secure_erase_sectors = 0;
 
diff --git a/drivers/md/dm.c b/drivers/md/dm.c
index f0f118ab20fa..6245e16bf066 100644
--- a/drivers/md/dm.c
+++ b/drivers/md/dm.c
@@ -1732,6 +1732,13 @@ static blk_status_t __split_and_process_bio(struct clone_info *ci)
 	if (unlikely(ci->is_abnormal_io))
 		return __process_abnormal_io(ci, ti);
 
+	if ((unlikely(op_is_copy(ci->bio->bi_opf)) &&
+			max_io_len(ti, ci->sector) < ci->sector_count)) {
+		DMERR("Error, IO size(%u) > max target size(%llu)\n",
+			ci->sector_count, max_io_len(ti, ci->sector));
+		return BLK_STS_IOERR;
+	}
+
 	/*
 	 * Only support bio polling for normal IO, and the target io is
 	 * exactly inside the dm_io instance (verified in dm_poll_dm_io)
diff --git a/include/linux/device-mapper.h b/include/linux/device-mapper.h
index 69d0435c7ebb..8ffee7e8cd06 100644
--- a/include/linux/device-mapper.h
+++ b/include/linux/device-mapper.h
@@ -396,6 +396,11 @@ struct dm_target {
 	 * bio_set_dev(). NOTE: ideally a target should _not_ need this.
 	 */
 	bool needs_bio_set_dev:1;
+
+	/*
+	 * copy offload is supported
+	 */
+	bool copy_offload_supported:1;
 };
 
 void *dm_per_bio_data(struct bio *bio, size_t data_size);
-- 
2.35.1.500.gb896f729e2


^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH v13 8/9] dm: Enable copy offload for dm-linear target
       [not found]   ` <CGME20230627184107epcas5p3e01453c42bafa3ba08b8c8ba183927e6@epcas5p3.samsung.com>
@ 2023-06-27 18:36     ` Nitesh Shetty
  0 siblings, 0 replies; 32+ messages in thread
From: Nitesh Shetty @ 2023-06-27 18:36 UTC (permalink / raw)
  To: Jens Axboe, Jonathan Corbet, Alasdair Kergon, Mike Snitzer,
	dm-devel, Keith Busch, Christoph Hellwig, Sagi Grimberg,
	Chaitanya Kulkarni, Alexander Viro, Christian Brauner
  Cc: martin.petersen, linux-scsi, willy, hare, djwong, bvanassche,
	ming.lei, dlemoal, nitheshshetty, gost.dev, Nitesh Shetty,
	linux-block, linux-kernel, linux-doc, linux-nvme, linux-fsdevel

Setting copy_offload_supported flag to enable offload.

Signed-off-by: Nitesh Shetty <nj.shetty@samsung.com>
---
 drivers/md/dm-linear.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/drivers/md/dm-linear.c b/drivers/md/dm-linear.c
index f4448d520ee9..1d1ee30bbefb 100644
--- a/drivers/md/dm-linear.c
+++ b/drivers/md/dm-linear.c
@@ -62,6 +62,7 @@ static int linear_ctr(struct dm_target *ti, unsigned int argc, char **argv)
 	ti->num_discard_bios = 1;
 	ti->num_secure_erase_bios = 1;
 	ti->num_write_zeroes_bios = 1;
+	ti->copy_offload_supported = 1;
 	ti->private = lc;
 	return 0;
 
-- 
2.35.1.500.gb896f729e2


^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH v13 9/9] null_blk: add support for copy offload
       [not found]   ` <CGME20230627184117epcas5p3a9102988870743b20127422928f072bd@epcas5p3.samsung.com>
@ 2023-06-27 18:36     ` Nitesh Shetty
  2023-06-28 12:11       ` kernel test robot
  2023-06-28 12:52       ` kernel test robot
  0 siblings, 2 replies; 32+ messages in thread
From: Nitesh Shetty @ 2023-06-27 18:36 UTC (permalink / raw)
  To: Jens Axboe, Jonathan Corbet, Alasdair Kergon, Mike Snitzer,
	dm-devel, Keith Busch, Christoph Hellwig, Sagi Grimberg,
	Chaitanya Kulkarni, Alexander Viro, Christian Brauner
  Cc: martin.petersen, linux-scsi, willy, hare, djwong, bvanassche,
	ming.lei, dlemoal, nitheshshetty, gost.dev, Nitesh Shetty,
	Damien Le Moal, Anuj Gupta, Vincent Fu, linux-block,
	linux-kernel, linux-doc, linux-nvme, linux-fsdevel

Implementaion is based on existing read and write infrastructure.
copy_max_bytes: A new configfs and module parameter is introduced, which
can be used to set hardware/driver supported maximum copy limit.

Suggested-by: Damien Le Moal <damien.lemoal@opensource.wdc.com>
Signed-off-by: Anuj Gupta <anuj20.g@samsung.com>
Signed-off-by: Nitesh Shetty <nj.shetty@samsung.com>
Signed-off-by: Vincent Fu <vincent.fu@samsung.com>
---
 Documentation/block/null_blk.rst  |  5 ++
 drivers/block/null_blk/main.c     | 85 +++++++++++++++++++++++++++++--
 drivers/block/null_blk/null_blk.h |  1 +
 3 files changed, 88 insertions(+), 3 deletions(-)

diff --git a/Documentation/block/null_blk.rst b/Documentation/block/null_blk.rst
index 4dd78f24d10a..6153e02fcf13 100644
--- a/Documentation/block/null_blk.rst
+++ b/Documentation/block/null_blk.rst
@@ -149,3 +149,8 @@ zone_size=[MB]: Default: 256
 zone_nr_conv=[nr_conv]: Default: 0
   The number of conventional zones to create when block device is zoned.  If
   zone_nr_conv >= nr_zones, it will be reduced to nr_zones - 1.
+
+copy_max_bytes=[size in bytes]: Default: COPY_MAX_BYTES
+  A module and configfs parameter which can be used to set hardware/driver
+  supported maximum copy offload limit.
+  COPY_MAX_BYTES(=128MB at present) is defined in fs.h
diff --git a/drivers/block/null_blk/main.c b/drivers/block/null_blk/main.c
index 864013019d6b..e9461bd4dc2c 100644
--- a/drivers/block/null_blk/main.c
+++ b/drivers/block/null_blk/main.c
@@ -157,6 +157,10 @@ static int g_max_sectors;
 module_param_named(max_sectors, g_max_sectors, int, 0444);
 MODULE_PARM_DESC(max_sectors, "Maximum size of a command (in 512B sectors)");
 
+static unsigned long g_copy_max_bytes = COPY_MAX_BYTES;
+module_param_named(copy_max_bytes, g_copy_max_bytes, ulong, 0444);
+MODULE_PARM_DESC(copy_max_bytes, "Maximum size of a copy command (in bytes)");
+
 static unsigned int nr_devices = 1;
 module_param(nr_devices, uint, 0444);
 MODULE_PARM_DESC(nr_devices, "Number of devices to register");
@@ -409,6 +413,7 @@ NULLB_DEVICE_ATTR(home_node, uint, NULL);
 NULLB_DEVICE_ATTR(queue_mode, uint, NULL);
 NULLB_DEVICE_ATTR(blocksize, uint, NULL);
 NULLB_DEVICE_ATTR(max_sectors, uint, NULL);
+NULLB_DEVICE_ATTR(copy_max_bytes, uint, NULL);
 NULLB_DEVICE_ATTR(irqmode, uint, NULL);
 NULLB_DEVICE_ATTR(hw_queue_depth, uint, NULL);
 NULLB_DEVICE_ATTR(index, uint, NULL);
@@ -550,6 +555,7 @@ static struct configfs_attribute *nullb_device_attrs[] = {
 	&nullb_device_attr_queue_mode,
 	&nullb_device_attr_blocksize,
 	&nullb_device_attr_max_sectors,
+	&nullb_device_attr_copy_max_bytes,
 	&nullb_device_attr_irqmode,
 	&nullb_device_attr_hw_queue_depth,
 	&nullb_device_attr_index,
@@ -656,7 +662,8 @@ static ssize_t memb_group_features_show(struct config_item *item, char *page)
 			"poll_queues,power,queue_mode,shared_tag_bitmap,size,"
 			"submit_queues,use_per_node_hctx,virt_boundary,zoned,"
 			"zone_capacity,zone_max_active,zone_max_open,"
-			"zone_nr_conv,zone_offline,zone_readonly,zone_size\n");
+			"zone_nr_conv,zone_offline,zone_readonly,zone_size,"
+			"copy_max_bytes\n");
 }
 
 CONFIGFS_ATTR_RO(memb_group_, features);
@@ -722,6 +729,7 @@ static struct nullb_device *null_alloc_dev(void)
 	dev->queue_mode = g_queue_mode;
 	dev->blocksize = g_bs;
 	dev->max_sectors = g_max_sectors;
+	dev->copy_max_bytes = g_copy_max_bytes;
 	dev->irqmode = g_irqmode;
 	dev->hw_queue_depth = g_hw_queue_depth;
 	dev->blocking = g_blocking;
@@ -1271,6 +1279,67 @@ static int null_transfer(struct nullb *nullb, struct page *page,
 	return err;
 }
 
+static inline int nullb_setup_copy_write(struct nullb *nullb,
+		struct request *req, bool is_fua)
+{
+	sector_t sector_in, sector_out;
+	void *in, *out;
+	size_t rem, temp;
+	struct bio *bio;
+	unsigned long offset_in, offset_out;
+	struct nullb_page *t_page_in, *t_page_out;
+	int ret = -EIO;
+
+	sector_out = blk_rq_pos(req);
+
+	__rq_for_each_bio(bio, req) {
+		sector_in = bio->bi_iter.bi_sector;
+		rem = bio->bi_iter.bi_size;
+	}
+
+	if (WARN_ON(!rem))
+		return BLK_STS_NOTSUPP;
+
+	spin_lock_irq(&nullb->lock);
+	while (rem > 0) {
+		temp = min_t(size_t, nullb->dev->blocksize, rem);
+		offset_in = (sector_in & SECTOR_MASK) << SECTOR_SHIFT;
+		offset_out = (sector_out & SECTOR_MASK) << SECTOR_SHIFT;
+
+		if (null_cache_active(nullb) && !is_fua)
+			null_make_cache_space(nullb, PAGE_SIZE);
+
+		t_page_in = null_lookup_page(nullb, sector_in, false,
+			!null_cache_active(nullb));
+		if (!t_page_in)
+			goto err;
+		t_page_out = null_insert_page(nullb, sector_out,
+			!null_cache_active(nullb) || is_fua);
+		if (!t_page_out)
+			goto err;
+
+		in = kmap_local_page(t_page_in->page);
+		out = kmap_local_page(t_page_out->page);
+
+		memcpy(out + offset_out, in + offset_in, temp);
+		kunmap_local(out);
+		kunmap_local(in);
+		__set_bit(sector_out & SECTOR_MASK, t_page_out->bitmap);
+
+		if (is_fua)
+			null_free_sector(nullb, sector_out, true);
+
+		rem -= temp;
+		sector_in += temp >> SECTOR_SHIFT;
+		sector_out += temp >> SECTOR_SHIFT;
+	}
+
+	ret = 0;
+err:
+	spin_unlock_irq(&nullb->lock);
+	return ret;
+}
+
 static int null_handle_rq(struct nullb_cmd *cmd)
 {
 	struct request *rq = cmd->rq;
@@ -1280,13 +1349,16 @@ static int null_handle_rq(struct nullb_cmd *cmd)
 	sector_t sector = blk_rq_pos(rq);
 	struct req_iterator iter;
 	struct bio_vec bvec;
+	bool fua = rq->cmd_flags & REQ_FUA;
+
+	if (op_is_copy(req_op(rq)))
+		return nullb_setup_copy_write(nullb, rq, fua);
 
 	spin_lock_irq(&nullb->lock);
 	rq_for_each_segment(bvec, rq, iter) {
 		len = bvec.bv_len;
 		err = null_transfer(nullb, bvec.bv_page, len, bvec.bv_offset,
-				     op_is_write(req_op(rq)), sector,
-				     rq->cmd_flags & REQ_FUA);
+				     op_is_write(req_op(rq)), sector, fua);
 		if (err) {
 			spin_unlock_irq(&nullb->lock);
 			return err;
@@ -2042,6 +2114,9 @@ static int null_validate_conf(struct nullb_device *dev)
 		return -EINVAL;
 	}
 
+	if (dev->queue_mode == NULL_Q_BIO)
+		dev->copy_max_bytes = 0;
+
 	return 0;
 }
 
@@ -2161,6 +2236,10 @@ static int null_add_dev(struct nullb_device *dev)
 		dev->max_sectors = queue_max_hw_sectors(nullb->q);
 	dev->max_sectors = min(dev->max_sectors, BLK_DEF_MAX_SECTORS);
 	blk_queue_max_hw_sectors(nullb->q, dev->max_sectors);
+	blk_queue_max_copy_sectors_hw(nullb->q,
+			       dev->copy_max_bytes >> SECTOR_SHIFT);
+	if (dev->copy_max_bytes)
+		blk_queue_flag_set(QUEUE_FLAG_COPY, nullb->disk->queue);
 
 	if (dev->virt_boundary)
 		blk_queue_virt_boundary(nullb->q, PAGE_SIZE - 1);
diff --git a/drivers/block/null_blk/null_blk.h b/drivers/block/null_blk/null_blk.h
index 929f659dd255..e82e53a2e2df 100644
--- a/drivers/block/null_blk/null_blk.h
+++ b/drivers/block/null_blk/null_blk.h
@@ -107,6 +107,7 @@ struct nullb_device {
 	unsigned int queue_mode; /* block interface */
 	unsigned int blocksize; /* block size */
 	unsigned int max_sectors; /* Max sectors per command */
+	unsigned long copy_max_bytes; /* Max copy offload length in bytes */
 	unsigned int irqmode; /* IRQ completion handler */
 	unsigned int hw_queue_depth; /* queue depth */
 	unsigned int index; /* index of the disk, only valid with a disk */
-- 
2.35.1.500.gb896f729e2


^ permalink raw reply related	[flat|nested] 32+ messages in thread

* Re: [PATCH v13 1/9] block: Introduce queue limits for copy-offload support
  2023-06-27 18:36     ` [PATCH v13 1/9] block: Introduce queue limits for copy-offload support Nitesh Shetty
@ 2023-06-28  6:40       ` Damien Le Moal
  2023-06-28 15:35         ` Nitesh Shetty
  2023-07-20  7:06       ` Christoph Hellwig
  2023-07-20  7:58       ` Christoph Hellwig
  2 siblings, 1 reply; 32+ messages in thread
From: Damien Le Moal @ 2023-06-28  6:40 UTC (permalink / raw)
  To: Nitesh Shetty, Jens Axboe, Jonathan Corbet, Alasdair Kergon,
	Mike Snitzer, dm-devel, Keith Busch, Christoph Hellwig,
	Sagi Grimberg, Chaitanya Kulkarni, Alexander Viro,
	Christian Brauner
  Cc: martin.petersen, linux-scsi, willy, hare, djwong, bvanassche,
	ming.lei, nitheshshetty, gost.dev, Kanchan Joshi, Anuj Gupta,
	linux-block, linux-kernel, linux-doc, linux-nvme, linux-fsdevel

On 6/28/23 03:36, Nitesh Shetty wrote:
> Add device limits as sysfs entries,
>         - copy_offload (RW)
>         - copy_max_bytes (RW)
>         - copy_max_bytes_hw (RO)
> 
> Above limits help to split the copy payload in block layer.
> copy_offload: used for setting copy offload(1) or emulation(0).
> copy_max_bytes: maximum total length of copy in single payload.
> copy_max_bytes_hw: Reflects the device supported maximum limit.
> 
> Reviewed-by: Hannes Reinecke <hare@suse.de>
> Signed-off-by: Nitesh Shetty <nj.shetty@samsung.com>
> Signed-off-by: Kanchan Joshi <joshi.k@samsung.com>
> Signed-off-by: Anuj Gupta <anuj20.g@samsung.com>
> ---
>  Documentation/ABI/stable/sysfs-block | 33 +++++++++++++++
>  block/blk-settings.c                 | 24 +++++++++++
>  block/blk-sysfs.c                    | 63 ++++++++++++++++++++++++++++
>  include/linux/blkdev.h               | 12 ++++++
>  include/uapi/linux/fs.h              |  3 ++
>  5 files changed, 135 insertions(+)
> 
> diff --git a/Documentation/ABI/stable/sysfs-block b/Documentation/ABI/stable/sysfs-block
> index c57e5b7cb532..3c97303f658b 100644
> --- a/Documentation/ABI/stable/sysfs-block
> +++ b/Documentation/ABI/stable/sysfs-block
> @@ -155,6 +155,39 @@ Description:
>  		last zone of the device which may be smaller.
>  
>  
> +What:		/sys/block/<disk>/queue/copy_offload
> +Date:		June 2023
> +Contact:	linux-block@vger.kernel.org
> +Description:
> +		[RW] When read, this file shows whether offloading copy to a
> +		device is enabled (1) or disabled (0). Writing '0' to this
> +		file will disable offloading copies for this device.
> +		Writing any '1' value will enable this feature. If the device
> +		does not support offloading, then writing 1, will result in an
> +		error.

I am still not convinced that this one is really necessary. copy_max_bytes_hw !=
0 indicates that the devices supports copy offload. And setting copy_max_bytes
to 0 can be used to disable copy offload (which probably should be the default
for now).

> +
> +
> +What:		/sys/block/<disk>/queue/copy_max_bytes
> +Date:		June 2023
> +Contact:	linux-block@vger.kernel.org
> +Description:
> +		[RW] This is the maximum number of bytes that the block layer
> +		will allow for a copy request. This will is always smaller or

will is -> is

> +		equal to the maximum size allowed by the hardware, indicated by
> +		'copy_max_bytes_hw'. An attempt to set a value higher than
> +		'copy_max_bytes_hw' will truncate this to 'copy_max_bytes_hw'.
> +
> +
> +What:		/sys/block/<disk>/queue/copy_max_bytes_hw

Nit: In keeping with the spirit of attributes like
max_hw_sectors_kb/max_sectors_kb, I would call this one copy_max_hw_bytes.

> +Date:		June 2023
> +Contact:	linux-block@vger.kernel.org
> +Description:
> +		[RO] This is the maximum number of bytes that the hardware
> +		will allow for single data copy request.
> +		A value of 0 means that the device does not support
> +		copy offload.
> +
> +
>  What:		/sys/block/<disk>/queue/crypto/
>  Date:		February 2022
>  Contact:	linux-block@vger.kernel.org
> diff --git a/block/blk-settings.c b/block/blk-settings.c
> index 4dd59059b788..738cd3f21259 100644
> --- a/block/blk-settings.c
> +++ b/block/blk-settings.c
> @@ -59,6 +59,8 @@ void blk_set_default_limits(struct queue_limits *lim)
>  	lim->zoned = BLK_ZONED_NONE;
>  	lim->zone_write_granularity = 0;
>  	lim->dma_alignment = 511;
> +	lim->max_copy_sectors_hw = 0;
> +	lim->max_copy_sectors = 0;
>  }
>  
>  /**
> @@ -82,6 +84,8 @@ void blk_set_stacking_limits(struct queue_limits *lim)
>  	lim->max_dev_sectors = UINT_MAX;
>  	lim->max_write_zeroes_sectors = UINT_MAX;
>  	lim->max_zone_append_sectors = UINT_MAX;
> +	lim->max_copy_sectors_hw = UINT_MAX;
> +	lim->max_copy_sectors = UINT_MAX;
>  }
>  EXPORT_SYMBOL(blk_set_stacking_limits);
>  
> @@ -183,6 +187,22 @@ void blk_queue_max_discard_sectors(struct request_queue *q,
>  }
>  EXPORT_SYMBOL(blk_queue_max_discard_sectors);
>  
> +/**
> + * blk_queue_max_copy_sectors_hw - set max sectors for a single copy payload
> + * @q:  the request queue for the device
> + * @max_copy_sectors: maximum number of sectors to copy
> + **/
> +void blk_queue_max_copy_sectors_hw(struct request_queue *q,
> +		unsigned int max_copy_sectors)
> +{
> +	if (max_copy_sectors > (COPY_MAX_BYTES >> SECTOR_SHIFT))
> +		max_copy_sectors = COPY_MAX_BYTES >> SECTOR_SHIFT;
> +
> +	q->limits.max_copy_sectors_hw = max_copy_sectors;
> +	q->limits.max_copy_sectors = max_copy_sectors;
> +}
> +EXPORT_SYMBOL_GPL(blk_queue_max_copy_sectors_hw);
> +
>  /**
>   * blk_queue_max_secure_erase_sectors - set max sectors for a secure erase
>   * @q:  the request queue for the device
> @@ -578,6 +598,10 @@ int blk_stack_limits(struct queue_limits *t, struct queue_limits *b,
>  	t->max_segment_size = min_not_zero(t->max_segment_size,
>  					   b->max_segment_size);
>  
> +	t->max_copy_sectors = min(t->max_copy_sectors, b->max_copy_sectors);
> +	t->max_copy_sectors_hw = min(t->max_copy_sectors_hw,
> +						b->max_copy_sectors_hw);
> +
>  	t->misaligned |= b->misaligned;
>  
>  	alignment = queue_limit_alignment_offset(b, start);
> diff --git a/block/blk-sysfs.c b/block/blk-sysfs.c
> index afc797fb0dfc..43551778d035 100644
> --- a/block/blk-sysfs.c
> +++ b/block/blk-sysfs.c
> @@ -199,6 +199,62 @@ static ssize_t queue_discard_zeroes_data_show(struct request_queue *q, char *pag
>  	return queue_var_show(0, page);
>  }
>  
> +static ssize_t queue_copy_offload_show(struct request_queue *q, char *page)
> +{
> +	return queue_var_show(blk_queue_copy(q), page);
> +}
> +
> +static ssize_t queue_copy_offload_store(struct request_queue *q,
> +				       const char *page, size_t count)
> +{
> +	unsigned long copy_offload;
> +	ssize_t ret = queue_var_store(&copy_offload, page, count);
> +
> +	if (ret < 0)
> +		return ret;
> +
> +	if (copy_offload && !q->limits.max_copy_sectors_hw)
> +		return -EINVAL;
> +
> +	if (copy_offload)
> +		blk_queue_flag_set(QUEUE_FLAG_COPY, q);
> +	else
> +		blk_queue_flag_clear(QUEUE_FLAG_COPY, q);
> +
> +	return count;
> +}

See above. I think we can drop this attribute.

> +
> +static ssize_t queue_copy_max_hw_show(struct request_queue *q, char *page)
> +{
> +	return sprintf(page, "%llu\n", (unsigned long long)
> +			q->limits.max_copy_sectors_hw << SECTOR_SHIFT);
> +}
> +
> +static ssize_t queue_copy_max_show(struct request_queue *q, char *page)
> +{
> +	return sprintf(page, "%llu\n", (unsigned long long)
> +			q->limits.max_copy_sectors << SECTOR_SHIFT);
> +}
> +
> +static ssize_t queue_copy_max_store(struct request_queue *q,
> +				       const char *page, size_t count)
> +{
> +	unsigned long max_copy;
> +	ssize_t ret = queue_var_store(&max_copy, page, count);
> +
> +	if (ret < 0)
> +		return ret;
> +
> +	if (max_copy & (queue_logical_block_size(q) - 1))
> +		return -EINVAL;
> +
> +	max_copy >>= SECTOR_SHIFT;
> +	q->limits.max_copy_sectors = min_t(unsigned int, max_copy,
> +					q->limits.max_copy_sectors_hw);
> +
> +	return count;
> +}
> +
>  static ssize_t queue_write_same_max_show(struct request_queue *q, char *page)
>  {
>  	return queue_var_show(0, page);
> @@ -522,6 +578,10 @@ QUEUE_RO_ENTRY(queue_nr_zones, "nr_zones");
>  QUEUE_RO_ENTRY(queue_max_open_zones, "max_open_zones");
>  QUEUE_RO_ENTRY(queue_max_active_zones, "max_active_zones");
>  
> +QUEUE_RW_ENTRY(queue_copy_offload, "copy_offload");
> +QUEUE_RO_ENTRY(queue_copy_max_hw, "copy_max_bytes_hw");
> +QUEUE_RW_ENTRY(queue_copy_max, "copy_max_bytes");
> +
>  QUEUE_RW_ENTRY(queue_nomerges, "nomerges");
>  QUEUE_RW_ENTRY(queue_rq_affinity, "rq_affinity");
>  QUEUE_RW_ENTRY(queue_poll, "io_poll");
> @@ -638,6 +698,9 @@ static struct attribute *queue_attrs[] = {
>  	&queue_discard_max_entry.attr,
>  	&queue_discard_max_hw_entry.attr,
>  	&queue_discard_zeroes_data_entry.attr,
> +	&queue_copy_offload_entry.attr,
> +	&queue_copy_max_hw_entry.attr,
> +	&queue_copy_max_entry.attr,
>  	&queue_write_same_max_entry.attr,
>  	&queue_write_zeroes_max_entry.attr,
>  	&queue_zone_append_max_entry.attr,
> diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
> index ed44a997f629..6098665953e6 100644
> --- a/include/linux/blkdev.h
> +++ b/include/linux/blkdev.h
> @@ -309,6 +309,9 @@ struct queue_limits {
>  	unsigned int		discard_alignment;
>  	unsigned int		zone_write_granularity;
>  
> +	unsigned int		max_copy_sectors_hw;
> +	unsigned int		max_copy_sectors;
> +
>  	unsigned short		max_segments;
>  	unsigned short		max_integrity_segments;
>  	unsigned short		max_discard_segments;
> @@ -554,6 +557,7 @@ struct request_queue {
>  #define QUEUE_FLAG_NOWAIT       29	/* device supports NOWAIT */
>  #define QUEUE_FLAG_SQ_SCHED     30	/* single queue style io dispatch */
>  #define QUEUE_FLAG_SKIP_TAGSET_QUIESCE	31 /* quiesce_tagset skip the queue*/
> +#define QUEUE_FLAG_COPY		32	/* enable/disable device copy offload */
>  
>  #define QUEUE_FLAG_MQ_DEFAULT	((1UL << QUEUE_FLAG_IO_STAT) |		\
>  				 (1UL << QUEUE_FLAG_SAME_COMP) |	\
> @@ -574,6 +578,7 @@ bool blk_queue_flag_test_and_set(unsigned int flag, struct request_queue *q);
>  	test_bit(QUEUE_FLAG_STABLE_WRITES, &(q)->queue_flags)
>  #define blk_queue_io_stat(q)	test_bit(QUEUE_FLAG_IO_STAT, &(q)->queue_flags)
>  #define blk_queue_add_random(q)	test_bit(QUEUE_FLAG_ADD_RANDOM, &(q)->queue_flags)
> +#define blk_queue_copy(q)	test_bit(QUEUE_FLAG_COPY, &(q)->queue_flags)
>  #define blk_queue_zone_resetall(q)	\
>  	test_bit(QUEUE_FLAG_ZONE_RESETALL, &(q)->queue_flags)
>  #define blk_queue_dax(q)	test_bit(QUEUE_FLAG_DAX, &(q)->queue_flags)
> @@ -891,6 +896,8 @@ extern void blk_queue_chunk_sectors(struct request_queue *, unsigned int);
>  extern void blk_queue_max_segments(struct request_queue *, unsigned short);
>  extern void blk_queue_max_discard_segments(struct request_queue *,
>  		unsigned short);
> +extern void blk_queue_max_copy_sectors_hw(struct request_queue *q,
> +		unsigned int max_copy_sectors);
>  void blk_queue_max_secure_erase_sectors(struct request_queue *q,
>  		unsigned int max_sectors);
>  extern void blk_queue_max_segment_size(struct request_queue *, unsigned int);
> @@ -1210,6 +1217,11 @@ static inline unsigned int bdev_discard_granularity(struct block_device *bdev)
>  	return bdev_get_queue(bdev)->limits.discard_granularity;
>  }
>  
> +static inline unsigned int bdev_max_copy_sectors(struct block_device *bdev)
> +{
> +	return bdev_get_queue(bdev)->limits.max_copy_sectors;
> +}
> +
>  static inline unsigned int
>  bdev_max_secure_erase_sectors(struct block_device *bdev)
>  {
> diff --git a/include/uapi/linux/fs.h b/include/uapi/linux/fs.h
> index b7b56871029c..dff56813f95a 100644
> --- a/include/uapi/linux/fs.h
> +++ b/include/uapi/linux/fs.h
> @@ -64,6 +64,9 @@ struct fstrim_range {
>  	__u64 minlen;
>  };
>  
> +/* maximum copy offload length, this is set to 128MB based on current testing */
> +#define COPY_MAX_BYTES	(1 << 27)
> +
>  /* extent-same (dedupe) ioctls; these MUST match the btrfs ioctl definitions */
>  #define FILE_DEDUPE_RANGE_SAME		0
>  #define FILE_DEDUPE_RANGE_DIFFERS	1

-- 
Damien Le Moal
Western Digital Research


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v13 2/9] block: Add copy offload support infrastructure
  2023-06-27 18:36     ` [PATCH v13 2/9] block: Add copy offload support infrastructure Nitesh Shetty
@ 2023-06-28  6:45       ` Damien Le Moal
  2023-06-28 16:03         ` Nitesh Shetty
  2023-07-20  7:42       ` Christoph Hellwig
  1 sibling, 1 reply; 32+ messages in thread
From: Damien Le Moal @ 2023-06-28  6:45 UTC (permalink / raw)
  To: Nitesh Shetty, Jens Axboe, Jonathan Corbet, Alasdair Kergon,
	Mike Snitzer, dm-devel, Keith Busch, Christoph Hellwig,
	Sagi Grimberg, Chaitanya Kulkarni, Alexander Viro,
	Christian Brauner
  Cc: martin.petersen, linux-scsi, willy, hare, djwong, bvanassche,
	ming.lei, nitheshshetty, gost.dev, Anuj Gupta, linux-block,
	linux-kernel, linux-doc, linux-nvme, linux-fsdevel

On 6/28/23 03:36, Nitesh Shetty wrote:
> Introduce blkdev_copy_offload which takes similar arguments as
> copy_file_range and performs copy offload between two bdevs.

I am confused... I thought it was discussed to only allow copy offload only
within a single bdev for now... Did I missi something ?

> Introduce REQ_OP_COPY_DST, REQ_OP_COPY_SRC operation.
> Issue REQ_OP_COPY_DST with destination info along with taking a plug.
> This flows till request layer and waits for src bio to get merged.
> Issue REQ_OP_COPY_SRC with source info and this bio reaches request
> layer and merges with dst request.
> For any reason, if request comes to driver with either only one of src/dst
> info we fail the copy offload.
> 
> Larger copy will be divided, based on max_copy_sectors limit.
> 
> Suggested-by: Christoph Hellwig <hch@lst.de>
> Signed-off-by: Anuj Gupta <anuj20.g@samsung.com>
> Signed-off-by: Nitesh Shetty <nj.shetty@samsung.com>
> ---
>  block/blk-core.c          |   5 ++
>  block/blk-lib.c           | 177 ++++++++++++++++++++++++++++++++++++++
>  block/blk-merge.c         |  21 +++++
>  block/blk.h               |   9 ++
>  block/elevator.h          |   1 +
>  include/linux/bio.h       |   4 +-
>  include/linux/blk_types.h |  21 +++++
>  include/linux/blkdev.h    |   4 +
>  8 files changed, 241 insertions(+), 1 deletion(-)
> 
> diff --git a/block/blk-core.c b/block/blk-core.c
> index 99d8b9812b18..e6714391c93f 100644
> --- a/block/blk-core.c
> +++ b/block/blk-core.c
> @@ -796,6 +796,11 @@ void submit_bio_noacct(struct bio *bio)
>  		if (!q->limits.max_write_zeroes_sectors)
>  			goto not_supported;
>  		break;
> +	case REQ_OP_COPY_SRC:
> +	case REQ_OP_COPY_DST:
> +		if (!blk_queue_copy(q))
> +			goto not_supported;
> +		break;
>  	default:
>  		break;
>  	}
> diff --git a/block/blk-lib.c b/block/blk-lib.c
> index e59c3069e835..10c3eadd5bf6 100644
> --- a/block/blk-lib.c
> +++ b/block/blk-lib.c
> @@ -115,6 +115,183 @@ int blkdev_issue_discard(struct block_device *bdev, sector_t sector,
>  }
>  EXPORT_SYMBOL(blkdev_issue_discard);
>  
> +/*
> + * For synchronous copy offload/emulation, wait and process all in-flight BIOs.
> + * This must only be called once all bios have been issued so that the refcount
> + * can only decrease. This just waits for all bios to make it through
> + * blkdev_copy_(offload/emulate)_(read/write)_endio.
> + */
> +static ssize_t blkdev_copy_wait_io_completion(struct cio *cio)
> +{
> +	ssize_t ret;
> +
> +	if (cio->endio)
> +		return 0;
> +
> +	if (atomic_read(&cio->refcount)) {
> +		__set_current_state(TASK_UNINTERRUPTIBLE);
> +		blk_io_schedule();
> +	}
> +
> +	ret = cio->comp_len;
> +	kfree(cio);
> +
> +	return ret;
> +}
> +
> +static void blkdev_copy_offload_read_endio(struct bio *bio)
> +{
> +	struct cio *cio = bio->bi_private;
> +	sector_t clen;
> +
> +	if (bio->bi_status) {
> +		clen = (bio->bi_iter.bi_sector << SECTOR_SHIFT) - cio->pos_out;
> +		cio->comp_len = min_t(sector_t, clen, cio->comp_len);
> +	}
> +	bio_put(bio);
> +
> +	if (!atomic_dec_and_test(&cio->refcount))
> +		return;
> +	if (cio->endio) {
> +		cio->endio(cio->private, cio->comp_len);
> +		kfree(cio);
> +	} else
> +		blk_wake_io_task(cio->waiter);

Curly brackets around else missing.

> +}
> +
> +/*
> + * __blkdev_copy_offload	- Use device's native copy offload feature.
> + * we perform copy operation by sending 2 bio.
> + * 1. We take a plug and send a REQ_OP_COPY_DST bio along with destination
> + * sector and length. Once this bio reaches request layer, we form a request and
> + * wait for src bio to arrive.
> + * 2. We issue REQ_OP_COPY_SRC bio along with source sector and length. Once
> + * this bio reaches request layer and find a request with previously sent
> + * destination info we merge the source bio and return.
> + * 3. Release the plug and request is sent to driver
> + *
> + * Returns the length of bytes copied or error if encountered
> + */
> +static ssize_t __blkdev_copy_offload(
> +		struct block_device *bdev_in, loff_t pos_in,
> +		struct block_device *bdev_out, loff_t pos_out,
> +		size_t len, cio_iodone_t endio, void *private, gfp_t gfp_mask)
> +{
> +	struct cio *cio;
> +	struct bio *read_bio, *write_bio;
> +	sector_t rem, copy_len, max_copy_len;
> +	struct blk_plug plug;
> +
> +	cio = kzalloc(sizeof(struct cio), GFP_KERNEL);
> +	if (!cio)
> +		return -ENOMEM;
> +	atomic_set(&cio->refcount, 0);
> +	cio->waiter = current;
> +	cio->endio = endio;
> +	cio->private = private;
> +
> +	max_copy_len = min(bdev_max_copy_sectors(bdev_in),
> +			bdev_max_copy_sectors(bdev_out)) << SECTOR_SHIFT;

According to patch 1, this can end up being 0, so the loop below will be infinite.

> +
> +	cio->pos_in = pos_in;
> +	cio->pos_out = pos_out;
> +	/* If there is a error, comp_len will be set to least successfully
> +	 * completed copied length
> +	 */
> +	cio->comp_len = len;
> +	for (rem = len; rem > 0; rem -= copy_len) {
> +		copy_len = min(rem, max_copy_len);
> +
> +		write_bio = bio_alloc(bdev_out, 0, REQ_OP_COPY_DST, gfp_mask);
> +		if (!write_bio)
> +			goto err_write_bio_alloc;
> +		write_bio->bi_iter.bi_size = copy_len;
> +		write_bio->bi_iter.bi_sector = pos_out >> SECTOR_SHIFT;
> +
> +		blk_start_plug(&plug);
> +		read_bio = blk_next_bio(write_bio, bdev_in, 0, REQ_OP_COPY_SRC,
> +						gfp_mask);
> +		read_bio->bi_iter.bi_size = copy_len;
> +		read_bio->bi_iter.bi_sector = pos_in >> SECTOR_SHIFT;
> +		read_bio->bi_end_io = blkdev_copy_offload_read_endio;
> +		read_bio->bi_private = cio;
> +
> +		atomic_inc(&cio->refcount);
> +		submit_bio(read_bio);
> +		blk_finish_plug(&plug);
> +		pos_in += copy_len;
> +		pos_out += copy_len;
> +	}
> +
> +	return blkdev_copy_wait_io_completion(cio);
> +
> +err_write_bio_alloc:
> +	cio->comp_len = min_t(sector_t, cio->comp_len, (len - rem));
> +	if (!atomic_read(&cio->refcount)) {
> +		kfree(cio);
> +		return -ENOMEM;
> +	}
> +	return blkdev_copy_wait_io_completion(cio);
> +}
> +
> +static inline ssize_t blkdev_copy_sanity_check(
> +	struct block_device *bdev_in, loff_t pos_in,
> +	struct block_device *bdev_out, loff_t pos_out,
> +	size_t len)
> +{
> +	unsigned int align = max(bdev_logical_block_size(bdev_out),
> +					bdev_logical_block_size(bdev_in)) - 1;
> +
> +	if (bdev_read_only(bdev_out))
> +		return -EPERM;
> +
> +	if ((pos_in & align) || (pos_out & align) || (len & align) || !len ||
> +		len >= COPY_MAX_BYTES)
> +		return -EINVAL;
> +
> +	return 0;
> +}
> +
> +/*
> + * @bdev_in:	source block device
> + * @pos_in:	source offset
> + * @bdev_out:	destination block device
> + * @pos_out:	destination offset
> + * @len:	length in bytes to be copied
> + * @endio:	endio function to be called on completion of copy operation,
> + *		for synchronous operation this should be NULL
> + * @private:	endio function will be called with this private data, should be
> + *		NULL, if operation is synchronous in nature
> + * @gfp_mask:   memory allocation flags (for bio_alloc)
> + *
> + * Returns the length of bytes copied or error if encountered
> + *
> + * Description:
> + *	Copy source offset from source block device to destination block
> + *	device. If copy offload is not supported or fails, fallback to
> + *	emulation. Max total length of copy is limited to COPY_MAX_BYTES
> + */
> +ssize_t blkdev_copy_offload(
> +		struct block_device *bdev_in, loff_t pos_in,
> +		struct block_device *bdev_out, loff_t pos_out,
> +		size_t len, cio_iodone_t endio, void *private, gfp_t gfp_mask)
> +{
> +	struct request_queue *q_in = bdev_get_queue(bdev_in);
> +	struct request_queue *q_out = bdev_get_queue(bdev_out);
> +	ssize_t ret;
> +
> +	ret = blkdev_copy_sanity_check(bdev_in, pos_in, bdev_out, pos_out, len);
> +	if (ret)
> +		return ret;
> +
> +	if (blk_queue_copy(q_in) && blk_queue_copy(q_out))
> +		ret = __blkdev_copy_offload(bdev_in, pos_in, bdev_out, pos_out,
> +			   len, endio, private, gfp_mask);
> +
> +	return ret;
> +}
> +EXPORT_SYMBOL_GPL(blkdev_copy_offload);
> +
>  static int __blkdev_issue_write_zeroes(struct block_device *bdev,
>  		sector_t sector, sector_t nr_sects, gfp_t gfp_mask,
>  		struct bio **biop, unsigned flags)
> diff --git a/block/blk-merge.c b/block/blk-merge.c
> index 65e75efa9bd3..bfd86c54df22 100644
> --- a/block/blk-merge.c
> +++ b/block/blk-merge.c
> @@ -922,6 +922,9 @@ bool blk_rq_merge_ok(struct request *rq, struct bio *bio)
>  	if (!rq_mergeable(rq) || !bio_mergeable(bio))
>  		return false;
>  
> +	if ((req_op(rq) == REQ_OP_COPY_DST) && (bio_op(bio) == REQ_OP_COPY_SRC))
> +		return true;
> +
>  	if (req_op(rq) != bio_op(bio))
>  		return false;
>  
> @@ -951,6 +954,8 @@ enum elv_merge blk_try_merge(struct request *rq, struct bio *bio)
>  {
>  	if (blk_discard_mergable(rq))
>  		return ELEVATOR_DISCARD_MERGE;
> +	else if (blk_copy_offload_mergable(rq, bio))
> +		return ELEVATOR_COPY_OFFLOAD_MERGE;
>  	else if (blk_rq_pos(rq) + blk_rq_sectors(rq) == bio->bi_iter.bi_sector)
>  		return ELEVATOR_BACK_MERGE;
>  	else if (blk_rq_pos(rq) - bio_sectors(bio) == bio->bi_iter.bi_sector)
> @@ -1053,6 +1058,20 @@ static enum bio_merge_status bio_attempt_discard_merge(struct request_queue *q,
>  	return BIO_MERGE_FAILED;
>  }
>  
> +static enum bio_merge_status bio_attempt_copy_offload_merge(
> +	struct request_queue *q, struct request *req, struct bio *bio)
> +{
> +	if (req->__data_len != bio->bi_iter.bi_size)
> +		return BIO_MERGE_FAILED;
> +
> +	req->biotail->bi_next = bio;
> +	req->biotail = bio;
> +	req->nr_phys_segments = blk_rq_nr_phys_segments(req) + 1;
> +	req->__data_len += bio->bi_iter.bi_size;
> +
> +	return BIO_MERGE_OK;
> +}
> +
>  static enum bio_merge_status blk_attempt_bio_merge(struct request_queue *q,
>  						   struct request *rq,
>  						   struct bio *bio,
> @@ -1073,6 +1092,8 @@ static enum bio_merge_status blk_attempt_bio_merge(struct request_queue *q,
>  		break;
>  	case ELEVATOR_DISCARD_MERGE:
>  		return bio_attempt_discard_merge(q, rq, bio);
> +	case ELEVATOR_COPY_OFFLOAD_MERGE:
> +		return bio_attempt_copy_offload_merge(q, rq, bio);
>  	default:
>  		return BIO_MERGE_NONE;
>  	}
> diff --git a/block/blk.h b/block/blk.h
> index 608c5dcc516b..440bfa148461 100644
> --- a/block/blk.h
> +++ b/block/blk.h
> @@ -156,6 +156,13 @@ static inline bool blk_discard_mergable(struct request *req)
>  	return false;
>  }
>  
> +static inline bool blk_copy_offload_mergable(struct request *req,
> +					     struct bio *bio)
> +{
> +	return ((req_op(req) == REQ_OP_COPY_DST)  &&
> +		(bio_op(bio) == REQ_OP_COPY_SRC));
> +}
> +
>  static inline unsigned int blk_rq_get_max_segments(struct request *rq)
>  {
>  	if (req_op(rq) == REQ_OP_DISCARD)
> @@ -303,6 +310,8 @@ static inline bool bio_may_exceed_limits(struct bio *bio,
>  		break;
>  	}
>  
> +	if (unlikely(op_is_copy(bio->bi_opf)))
> +		return false;
>  	/*
>  	 * All drivers must accept single-segments bios that are <= PAGE_SIZE.
>  	 * This is a quick and dirty check that relies on the fact that
> diff --git a/block/elevator.h b/block/elevator.h
> index 7ca3d7b6ed82..eec442bbf384 100644
> --- a/block/elevator.h
> +++ b/block/elevator.h
> @@ -18,6 +18,7 @@ enum elv_merge {
>  	ELEVATOR_FRONT_MERGE	= 1,
>  	ELEVATOR_BACK_MERGE	= 2,
>  	ELEVATOR_DISCARD_MERGE	= 3,
> +	ELEVATOR_COPY_OFFLOAD_MERGE	= 4,
>  };
>  
>  struct blk_mq_alloc_data;
> diff --git a/include/linux/bio.h b/include/linux/bio.h
> index c4f5b5228105..a2673f24e493 100644
> --- a/include/linux/bio.h
> +++ b/include/linux/bio.h
> @@ -57,7 +57,9 @@ static inline bool bio_has_data(struct bio *bio)
>  	    bio->bi_iter.bi_size &&
>  	    bio_op(bio) != REQ_OP_DISCARD &&
>  	    bio_op(bio) != REQ_OP_SECURE_ERASE &&
> -	    bio_op(bio) != REQ_OP_WRITE_ZEROES)
> +	    bio_op(bio) != REQ_OP_WRITE_ZEROES &&
> +	    bio_op(bio) != REQ_OP_COPY_DST &&
> +	    bio_op(bio) != REQ_OP_COPY_SRC)
>  		return true;
>  
>  	return false;
> diff --git a/include/linux/blk_types.h b/include/linux/blk_types.h
> index 0bad62cca3d0..336146798e56 100644
> --- a/include/linux/blk_types.h
> +++ b/include/linux/blk_types.h
> @@ -394,6 +394,9 @@ enum req_op {
>  	/* reset all the zone present on the device */
>  	REQ_OP_ZONE_RESET_ALL	= (__force blk_opf_t)17,
>  
> +	REQ_OP_COPY_SRC		= (__force blk_opf_t)18,
> +	REQ_OP_COPY_DST		= (__force blk_opf_t)19,
> +
>  	/* Driver private requests */
>  	REQ_OP_DRV_IN		= (__force blk_opf_t)34,
>  	REQ_OP_DRV_OUT		= (__force blk_opf_t)35,
> @@ -482,6 +485,12 @@ static inline bool op_is_write(blk_opf_t op)
>  	return !!(op & (__force blk_opf_t)1);
>  }
>  
> +static inline bool op_is_copy(blk_opf_t op)
> +{
> +	return (((op & REQ_OP_MASK) == REQ_OP_COPY_SRC) ||
> +		((op & REQ_OP_MASK) == REQ_OP_COPY_DST));
> +}
> +
>  /*
>   * Check if the bio or request is one that needs special treatment in the
>   * flush state machine.
> @@ -541,4 +550,16 @@ struct blk_rq_stat {
>  	u64 batch;
>  };
>  
> +typedef void (cio_iodone_t)(void *private, int comp_len);
> +
> +struct cio {
> +	struct task_struct *waiter;     /* waiting task (NULL if none) */
> +	loff_t pos_in;
> +	loff_t pos_out;
> +	ssize_t comp_len;
> +	cio_iodone_t *endio;		/* applicable for async operation */
> +	void *private;			/* applicable for async operation */
> +	atomic_t refcount;
> +};
> +
>  #endif /* __LINUX_BLK_TYPES_H */
> diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
> index 6098665953e6..963f5c97dec0 100644
> --- a/include/linux/blkdev.h
> +++ b/include/linux/blkdev.h
> @@ -1043,6 +1043,10 @@ int __blkdev_issue_discard(struct block_device *bdev, sector_t sector,
>  		sector_t nr_sects, gfp_t gfp_mask, struct bio **biop);
>  int blkdev_issue_secure_erase(struct block_device *bdev, sector_t sector,
>  		sector_t nr_sects, gfp_t gfp);
> +ssize_t blkdev_copy_offload(
> +		struct block_device *bdev_in, loff_t pos_in,
> +		struct block_device *bdev_out, loff_t pos_out,
> +		size_t len, cio_iodone_t end_io, void *private, gfp_t gfp_mask);
>  
>  #define BLKDEV_ZERO_NOUNMAP	(1 << 0)  /* do not free blocks */
>  #define BLKDEV_ZERO_NOFALLBACK	(1 << 1)  /* don't write explicit zeroes */

-- 
Damien Le Moal
Western Digital Research


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v13 3/9] block: add emulation for copy
  2023-06-27 18:36     ` [PATCH v13 3/9] block: add emulation for copy Nitesh Shetty
@ 2023-06-28  6:50       ` Damien Le Moal
  2023-06-28 16:10         ` Nitesh Shetty
  2023-06-29  8:33       ` Ming Lei
  2023-07-20  7:50       ` Christoph Hellwig
  2 siblings, 1 reply; 32+ messages in thread
From: Damien Le Moal @ 2023-06-28  6:50 UTC (permalink / raw)
  To: Nitesh Shetty, Jens Axboe, Jonathan Corbet, Alasdair Kergon,
	Mike Snitzer, dm-devel, Keith Busch, Christoph Hellwig,
	Sagi Grimberg, Chaitanya Kulkarni, Alexander Viro,
	Christian Brauner
  Cc: martin.petersen, linux-scsi, willy, hare, djwong, bvanassche,
	ming.lei, nitheshshetty, gost.dev, Vincent Fu, Anuj Gupta,
	linux-block, linux-kernel, linux-doc, linux-nvme, linux-fsdevel

On 6/28/23 03:36, Nitesh Shetty wrote:
> For the devices which does not support copy, copy emulation is added.
> It is required for in-kernel users like fabrics, where file descriptor is
> not available and hence they can't use copy_file_range.
> Copy-emulation is implemented by reading from source into memory and
> writing to the corresponding destination asynchronously.
> Also emulation is used, if copy offload fails or partially completes.
> 
> Signed-off-by: Nitesh Shetty <nj.shetty@samsung.com>
> Signed-off-by: Vincent Fu <vincent.fu@samsung.com>
> Signed-off-by: Anuj Gupta <anuj20.g@samsung.com>
> ---
>  block/blk-lib.c           | 183 +++++++++++++++++++++++++++++++++++++-
>  block/blk-map.c           |   4 +-
>  include/linux/blk_types.h |   5 ++
>  include/linux/blkdev.h    |   3 +
>  4 files changed, 192 insertions(+), 3 deletions(-)
> 
> diff --git a/block/blk-lib.c b/block/blk-lib.c
> index 10c3eadd5bf6..09e0d5d51d03 100644
> --- a/block/blk-lib.c
> +++ b/block/blk-lib.c
> @@ -234,6 +234,180 @@ static ssize_t __blkdev_copy_offload(
>  	return blkdev_copy_wait_io_completion(cio);
>  }
>  
> +static void *blkdev_copy_alloc_buf(sector_t req_size, sector_t *alloc_size,
> +		gfp_t gfp_mask)
> +{
> +	int min_size = PAGE_SIZE;
> +	void *buf;
> +
> +	while (req_size >= min_size) {
> +		buf = kvmalloc(req_size, gfp_mask);
> +		if (buf) {
> +			*alloc_size = req_size;
> +			return buf;
> +		}
> +		/* retry half the requested size */

Kind of obvious :)

> +		req_size >>= 1;
> +	}
> +
> +	return NULL;
> +}
> +
> +static void blkdev_copy_emulate_write_endio(struct bio *bio)
> +{
> +	struct copy_ctx *ctx = bio->bi_private;
> +	struct cio *cio = ctx->cio;
> +	sector_t clen;
> +
> +	if (bio->bi_status) {
> +		clen = (bio->bi_iter.bi_sector << SECTOR_SHIFT) - cio->pos_out;
> +		cio->comp_len = min_t(sector_t, clen, cio->comp_len);
> +	}
> +	kfree(bvec_virt(&bio->bi_io_vec[0]));
> +	bio_map_kern_endio(bio);
> +	kfree(ctx);
> +	if (atomic_dec_and_test(&cio->refcount)) {
> +		if (cio->endio) {
> +			cio->endio(cio->private, cio->comp_len);
> +			kfree(cio);
> +		} else
> +			blk_wake_io_task(cio->waiter);

Curly brackets.

> +	}
> +}
> +
> +static void blkdev_copy_emulate_read_endio(struct bio *read_bio)
> +{
> +	struct copy_ctx *ctx = read_bio->bi_private;
> +	struct cio *cio = ctx->cio;
> +	sector_t clen;
> +
> +	if (read_bio->bi_status) {
> +		clen = (read_bio->bi_iter.bi_sector << SECTOR_SHIFT) -
> +						cio->pos_in;
> +		cio->comp_len = min_t(sector_t, clen, cio->comp_len);
> +		kfree(bvec_virt(&read_bio->bi_io_vec[0]));
> +		bio_map_kern_endio(read_bio);
> +		kfree(ctx);
> +
> +		if (atomic_dec_and_test(&cio->refcount)) {
> +			if (cio->endio) {
> +				cio->endio(cio->private, cio->comp_len);
> +				kfree(cio);
> +			} else
> +				blk_wake_io_task(cio->waiter);

Same.

> +		}
> +	}
> +	schedule_work(&ctx->dispatch_work);

ctx may have been freed above.

> +	kfree(read_bio);
> +}
> +
> +static void blkdev_copy_dispatch_work(struct work_struct *work)
> +{
> +	struct copy_ctx *ctx = container_of(work, struct copy_ctx,
> +			dispatch_work);

Please align the argument, or even better: split the line after "=".

> +
> +	submit_bio(ctx->write_bio);
> +}
> +
> +/*
> + * If native copy offload feature is absent, this function tries to emulate,
> + * by copying data from source to a temporary buffer and from buffer to
> + * destination device.
> + * Returns the length of bytes copied or error if encountered
> + */
> +static ssize_t __blkdev_copy_emulate(
> +		struct block_device *bdev_in, loff_t pos_in,
> +		struct block_device *bdev_out, loff_t pos_out,
> +		size_t len, cio_iodone_t endio, void *private, gfp_t gfp_mask)
> +{
> +	struct request_queue *in = bdev_get_queue(bdev_in);
> +	struct request_queue *out = bdev_get_queue(bdev_out);
> +	struct bio *read_bio, *write_bio;
> +	void *buf = NULL;
> +	struct copy_ctx *ctx;
> +	struct cio *cio;
> +	sector_t buf_len, req_len, rem = 0;
> +	sector_t max_src_hw_len = min_t(unsigned int,
> +			queue_max_hw_sectors(in),
> +			queue_max_segments(in) << (PAGE_SHIFT - SECTOR_SHIFT))
> +			<< SECTOR_SHIFT;
> +	sector_t max_dst_hw_len = min_t(unsigned int,
> +		queue_max_hw_sectors(out),
> +			queue_max_segments(out) << (PAGE_SHIFT - SECTOR_SHIFT))
> +			<< SECTOR_SHIFT;
> +	sector_t max_hw_len = min_t(unsigned int,
> +			max_src_hw_len, max_dst_hw_len);
> +
> +	cio = kzalloc(sizeof(struct cio), GFP_KERNEL);
> +	if (!cio)
> +		return -ENOMEM;
> +	atomic_set(&cio->refcount, 0);
> +	cio->pos_in = pos_in;
> +	cio->pos_out = pos_out;
> +	cio->waiter = current;
> +	cio->endio = endio;
> +	cio->private = private;
> +
> +	for (rem = len; rem > 0; rem -= buf_len) {
> +		req_len = min_t(int, max_hw_len, rem);
> +
> +		buf = blkdev_copy_alloc_buf(req_len, &buf_len, gfp_mask);
> +		if (!buf)
> +			goto err_alloc_buf;
> +
> +		ctx = kzalloc(sizeof(struct copy_ctx), gfp_mask);
> +		if (!ctx)
> +			goto err_ctx;
> +
> +		read_bio = bio_map_kern(in, buf, buf_len, gfp_mask);
> +		if (IS_ERR(read_bio))
> +			goto err_read_bio;
> +
> +		write_bio = bio_map_kern(out, buf, buf_len, gfp_mask);
> +		if (IS_ERR(write_bio))
> +			goto err_write_bio;
> +
> +		ctx->cio = cio;
> +		ctx->write_bio = write_bio;
> +		INIT_WORK(&ctx->dispatch_work, blkdev_copy_dispatch_work);
> +
> +		read_bio->bi_iter.bi_sector = pos_in >> SECTOR_SHIFT;
> +		read_bio->bi_iter.bi_size = buf_len;
> +		read_bio->bi_opf = REQ_OP_READ | REQ_SYNC;
> +		bio_set_dev(read_bio, bdev_in);
> +		read_bio->bi_end_io = blkdev_copy_emulate_read_endio;
> +		read_bio->bi_private = ctx;
> +
> +		write_bio->bi_iter.bi_size = buf_len;
> +		write_bio->bi_opf = REQ_OP_WRITE | REQ_SYNC;
> +		bio_set_dev(write_bio, bdev_out);
> +		write_bio->bi_end_io = blkdev_copy_emulate_write_endio;
> +		write_bio->bi_iter.bi_sector = pos_out >> SECTOR_SHIFT;
> +		write_bio->bi_private = ctx;
> +
> +		atomic_inc(&cio->refcount);
> +		submit_bio(read_bio);
> +
> +		pos_in += buf_len;
> +		pos_out += buf_len;
> +	}
> +	return blkdev_copy_wait_io_completion(cio);
> +
> +err_write_bio:
> +	bio_put(read_bio);
> +err_read_bio:
> +	kfree(ctx);
> +err_ctx:
> +	kvfree(buf);
> +err_alloc_buf:
> +	cio->comp_len -= min_t(sector_t, cio->comp_len, len - rem);
> +	if (!atomic_read(&cio->refcount)) {
> +		kfree(cio);
> +		return -ENOMEM;
> +	}
> +	return blkdev_copy_wait_io_completion(cio);
> +}
> +
>  static inline ssize_t blkdev_copy_sanity_check(
>  	struct block_device *bdev_in, loff_t pos_in,
>  	struct block_device *bdev_out, loff_t pos_out,
> @@ -284,9 +458,16 @@ ssize_t blkdev_copy_offload(
>  	if (ret)
>  		return ret;
>  
> -	if (blk_queue_copy(q_in) && blk_queue_copy(q_out))
> +	if (blk_queue_copy(q_in) && blk_queue_copy(q_out)) {
>  		ret = __blkdev_copy_offload(bdev_in, pos_in, bdev_out, pos_out,
>  			   len, endio, private, gfp_mask);
> +		if (ret < 0)
> +			ret = 0;
> +	}
> +
> +	if (ret != len)
> +		ret = __blkdev_copy_emulate(bdev_in, pos_in + ret, bdev_out,
> +			 pos_out + ret, len - ret, endio, private, gfp_mask);
>  
>  	return ret;
>  }
> diff --git a/block/blk-map.c b/block/blk-map.c
> index 44d74a30ddac..ceeb70a95fd1 100644
> --- a/block/blk-map.c
> +++ b/block/blk-map.c
> @@ -363,7 +363,7 @@ static void bio_invalidate_vmalloc_pages(struct bio *bio)
>  #endif
>  }
>  
> -static void bio_map_kern_endio(struct bio *bio)
> +void bio_map_kern_endio(struct bio *bio)
>  {
>  	bio_invalidate_vmalloc_pages(bio);
>  	bio_uninit(bio);
> @@ -380,7 +380,7 @@ static void bio_map_kern_endio(struct bio *bio)
>   *	Map the kernel address into a bio suitable for io to a block
>   *	device. Returns an error pointer in case of error.
>   */
> -static struct bio *bio_map_kern(struct request_queue *q, void *data,
> +struct bio *bio_map_kern(struct request_queue *q, void *data,
>  		unsigned int len, gfp_t gfp_mask)
>  {
>  	unsigned long kaddr = (unsigned long)data;
> diff --git a/include/linux/blk_types.h b/include/linux/blk_types.h
> index 336146798e56..f8c80940c7ad 100644
> --- a/include/linux/blk_types.h
> +++ b/include/linux/blk_types.h
> @@ -562,4 +562,9 @@ struct cio {
>  	atomic_t refcount;
>  };
>  
> +struct copy_ctx {
> +	struct cio *cio;
> +	struct work_struct dispatch_work;
> +	struct bio *write_bio;
> +};
>  #endif /* __LINUX_BLK_TYPES_H */
> diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
> index 963f5c97dec0..c176bf6173c5 100644
> --- a/include/linux/blkdev.h
> +++ b/include/linux/blkdev.h
> @@ -1047,6 +1047,9 @@ ssize_t blkdev_copy_offload(
>  		struct block_device *bdev_in, loff_t pos_in,
>  		struct block_device *bdev_out, loff_t pos_out,
>  		size_t len, cio_iodone_t end_io, void *private, gfp_t gfp_mask);
> +struct bio *bio_map_kern(struct request_queue *q, void *data, unsigned int len,
> +		gfp_t gfp_mask);
> +void bio_map_kern_endio(struct bio *bio);
>  
>  #define BLKDEV_ZERO_NOUNMAP	(1 << 0)  /* do not free blocks */
>  #define BLKDEV_ZERO_NOFALLBACK	(1 << 1)  /* don't write explicit zeroes */

-- 
Damien Le Moal
Western Digital Research


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v13 4/9] fs, block: copy_file_range for def_blk_ops for direct block device
  2023-06-27 18:36     ` [PATCH v13 4/9] fs, block: copy_file_range for def_blk_ops for direct block device Nitesh Shetty
@ 2023-06-28  6:51       ` Damien Le Moal
  2023-06-28 16:39         ` Nitesh Shetty
  2023-07-20  7:57       ` Christoph Hellwig
  1 sibling, 1 reply; 32+ messages in thread
From: Damien Le Moal @ 2023-06-28  6:51 UTC (permalink / raw)
  To: Nitesh Shetty, Jens Axboe, Jonathan Corbet, Alasdair Kergon,
	Mike Snitzer, dm-devel, Keith Busch, Christoph Hellwig,
	Sagi Grimberg, Chaitanya Kulkarni, Alexander Viro,
	Christian Brauner
  Cc: martin.petersen, linux-scsi, willy, hare, djwong, bvanassche,
	ming.lei, nitheshshetty, gost.dev, Anuj Gupta, linux-block,
	linux-kernel, linux-doc, linux-nvme, linux-fsdevel

On 6/28/23 03:36, Nitesh Shetty wrote:
> For direct block device opened with O_DIRECT, use copy_file_range to
> issue device copy offload, and fallback to generic_copy_file_range incase
> device copy offload capability is absent.

...if the device does not support copy offload or the device files are not open
with O_DIRECT.

No ?

> Modify checks to allow bdevs to use copy_file_range.
> 
> Suggested-by: Ming Lei <ming.lei@redhat.com>
> Signed-off-by: Anuj Gupta <anuj20.g@samsung.com>
> Signed-off-by: Nitesh Shetty <nj.shetty@samsung.com>
> ---
>  block/blk-lib.c        | 26 ++++++++++++++++++++++++++
>  block/fops.c           | 20 ++++++++++++++++++++
>  fs/read_write.c        |  7 +++++--
>  include/linux/blkdev.h |  4 ++++
>  4 files changed, 55 insertions(+), 2 deletions(-)
> 
> diff --git a/block/blk-lib.c b/block/blk-lib.c
> index 09e0d5d51d03..7d8e09a99254 100644
> --- a/block/blk-lib.c
> +++ b/block/blk-lib.c
> @@ -473,6 +473,32 @@ ssize_t blkdev_copy_offload(
>  }
>  EXPORT_SYMBOL_GPL(blkdev_copy_offload);
>  
> +/* Copy source offset from source block device to destination block
> + * device. Returns the length of bytes copied.
> + */

Multi-line comment style: start with a "/*" line please.

> +ssize_t blkdev_copy_offload_failfast(

What is the "failfast" in the name for ?

> +		struct block_device *bdev_in, loff_t pos_in,
> +		struct block_device *bdev_out, loff_t pos_out,
> +		size_t len, gfp_t gfp_mask)
> +{
> +	struct request_queue *in_q = bdev_get_queue(bdev_in);
> +	struct request_queue *out_q = bdev_get_queue(bdev_out);
> +	ssize_t ret = 0;

You do not need this initialization.

> +
> +	if (blkdev_copy_sanity_check(bdev_in, pos_in, bdev_out, pos_out, len))
> +		return 0;
> +
> +	if (blk_queue_copy(in_q) && blk_queue_copy(out_q)) {

Given that I think we do not allow copies between different devices, in_q and
out_q should always be the same, no ?

> +		ret = __blkdev_copy_offload(bdev_in, pos_in, bdev_out, pos_out,
> +				len, NULL, NULL, gfp_mask);

Same here. Why pass 2 bdevs if we only allow copies within the same device ?

> +		if (ret < 0)
> +			return 0;
> +	}
> +
> +	return ret;

return 0;

> +}
> +EXPORT_SYMBOL_GPL(blkdev_copy_offload_failfast);
> +
>  static int __blkdev_issue_write_zeroes(struct block_device *bdev,
>  		sector_t sector, sector_t nr_sects, gfp_t gfp_mask,
>  		struct bio **biop, unsigned flags)
> diff --git a/block/fops.c b/block/fops.c
> index a286bf3325c5..a1576304f269 100644
> --- a/block/fops.c
> +++ b/block/fops.c
> @@ -621,6 +621,25 @@ static ssize_t blkdev_read_iter(struct kiocb *iocb, struct iov_iter *to)
>  	return ret;
>  }
>  
> +static ssize_t blkdev_copy_file_range(struct file *file_in, loff_t pos_in,
> +				struct file *file_out, loff_t pos_out,
> +				size_t len, unsigned int flags)
> +{
> +	struct block_device *in_bdev = I_BDEV(bdev_file_inode(file_in));
> +	struct block_device *out_bdev = I_BDEV(bdev_file_inode(file_out));
> +	ssize_t comp_len = 0;
> +
> +	if ((file_in->f_iocb_flags & IOCB_DIRECT) &&
> +		(file_out->f_iocb_flags & IOCB_DIRECT))
> +		comp_len = blkdev_copy_offload_failfast(in_bdev, pos_in,
> +				out_bdev, pos_out, len, GFP_KERNEL);
> +	if (comp_len != len)
> +		comp_len = generic_copy_file_range(file_in, pos_in + comp_len,
> +			file_out, pos_out + comp_len, len - comp_len, flags);
> +
> +	return comp_len;
> +}
> +
>  #define	BLKDEV_FALLOC_FL_SUPPORTED					\
>  		(FALLOC_FL_KEEP_SIZE | FALLOC_FL_PUNCH_HOLE |		\
>  		 FALLOC_FL_ZERO_RANGE | FALLOC_FL_NO_HIDE_STALE)
> @@ -714,6 +733,7 @@ const struct file_operations def_blk_fops = {
>  	.splice_read	= filemap_splice_read,
>  	.splice_write	= iter_file_splice_write,
>  	.fallocate	= blkdev_fallocate,
> +	.copy_file_range = blkdev_copy_file_range,
>  };
>  
>  static __init int blkdev_init(void)
> diff --git a/fs/read_write.c b/fs/read_write.c
> index b07de77ef126..d27148a2543f 100644
> --- a/fs/read_write.c
> +++ b/fs/read_write.c
> @@ -1447,7 +1447,8 @@ static int generic_copy_file_checks(struct file *file_in, loff_t pos_in,
>  		return -EOVERFLOW;
>  
>  	/* Shorten the copy to EOF */
> -	size_in = i_size_read(inode_in);
> +	size_in = i_size_read(file_in->f_mapping->host);
> +
>  	if (pos_in >= size_in)
>  		count = 0;
>  	else
> @@ -1708,7 +1709,9 @@ int generic_file_rw_checks(struct file *file_in, struct file *file_out)
>  	/* Don't copy dirs, pipes, sockets... */
>  	if (S_ISDIR(inode_in->i_mode) || S_ISDIR(inode_out->i_mode))
>  		return -EISDIR;
> -	if (!S_ISREG(inode_in->i_mode) || !S_ISREG(inode_out->i_mode))
> +
> +	if ((!S_ISREG(inode_in->i_mode) || !S_ISREG(inode_out->i_mode)) &&
> +		(!S_ISBLK(inode_in->i_mode) || !S_ISBLK(inode_out->i_mode)))
>  		return -EINVAL;
>  
>  	if (!(file_in->f_mode & FMODE_READ) ||
> diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
> index c176bf6173c5..850168cad080 100644
> --- a/include/linux/blkdev.h
> +++ b/include/linux/blkdev.h
> @@ -1047,6 +1047,10 @@ ssize_t blkdev_copy_offload(
>  		struct block_device *bdev_in, loff_t pos_in,
>  		struct block_device *bdev_out, loff_t pos_out,
>  		size_t len, cio_iodone_t end_io, void *private, gfp_t gfp_mask);
> +ssize_t blkdev_copy_offload_failfast(
> +		struct block_device *bdev_in, loff_t pos_in,
> +		struct block_device *bdev_out, loff_t pos_out,
> +		size_t len, gfp_t gfp_mask);
>  struct bio *bio_map_kern(struct request_queue *q, void *data, unsigned int len,
>  		gfp_t gfp_mask);
>  void bio_map_kern_endio(struct bio *bio);

-- 
Damien Le Moal
Western Digital Research


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v13 9/9] null_blk: add support for copy offload
  2023-06-27 18:36     ` [PATCH v13 9/9] null_blk: add support for copy offload Nitesh Shetty
@ 2023-06-28 12:11       ` kernel test robot
  2023-06-28 12:52       ` kernel test robot
  1 sibling, 0 replies; 32+ messages in thread
From: kernel test robot @ 2023-06-28 12:11 UTC (permalink / raw)
  To: Nitesh Shetty, Jens Axboe, Jonathan Corbet, Alasdair Kergon,
	Mike Snitzer, dm-devel, Keith Busch, Christoph Hellwig,
	Sagi Grimberg, Chaitanya Kulkarni, Alexander Viro,
	Christian Brauner
  Cc: llvm, oe-kbuild-all, martin.petersen, linux-scsi, willy, hare,
	djwong, bvanassche, ming.lei, dlemoal, nitheshshetty, gost.dev,
	Nitesh Shetty, Damien Le Moal, Anuj Gupta, Vincent Fu,
	linux-block, linux-kernel, linux-doc, linux-nvme, linux-fsdevel

Hi Nitesh,

kernel test robot noticed the following build warnings:

[auto build test WARNING on 53cdf865f90ba922a854c65ed05b519f9d728424]

url:    https://github.com/intel-lab-lkp/linux/commits/Nitesh-Shetty/block-Introduce-queue-limits-for-copy-offload-support/20230628-163126
base:   53cdf865f90ba922a854c65ed05b519f9d728424
patch link:    https://lore.kernel.org/r/20230627183629.26571-10-nj.shetty%40samsung.com
patch subject: [PATCH v13 9/9] null_blk: add support for copy offload
config: hexagon-randconfig-r045-20230628 (https://download.01.org/0day-ci/archive/20230628/202306281909.TRNCf5eG-lkp@intel.com/config)
compiler: clang version 17.0.0 (https://github.com/llvm/llvm-project.git 4a5ac14ee968ff0ad5d2cc1ffa0299048db4c88a)
reproduce: (https://download.01.org/0day-ci/archive/20230628/202306281909.TRNCf5eG-lkp@intel.com/reproduce)

If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202306281909.TRNCf5eG-lkp@intel.com/

All warnings (new ones prefixed by >>):

   In file included from drivers/block/null_blk/main.c:12:
   In file included from drivers/block/null_blk/null_blk.h:8:
   In file included from include/linux/blkdev.h:9:
   In file included from include/linux/blk_types.h:10:
   In file included from include/linux/bvec.h:10:
   In file included from include/linux/highmem.h:12:
   In file included from include/linux/hardirq.h:11:
   In file included from ./arch/hexagon/include/generated/asm/hardirq.h:1:
   In file included from include/asm-generic/hardirq.h:17:
   In file included from include/linux/irq.h:20:
   In file included from include/linux/io.h:13:
   In file included from arch/hexagon/include/asm/io.h:334:
   include/asm-generic/io.h:547:31: warning: performing pointer arithmetic on a null pointer has undefined behavior [-Wnull-pointer-arithmetic]
     547 |         val = __raw_readb(PCI_IOBASE + addr);
         |                           ~~~~~~~~~~ ^
   include/asm-generic/io.h:560:61: warning: performing pointer arithmetic on a null pointer has undefined behavior [-Wnull-pointer-arithmetic]
     560 |         val = __le16_to_cpu((__le16 __force)__raw_readw(PCI_IOBASE + addr));
         |                                                         ~~~~~~~~~~ ^
   include/uapi/linux/byteorder/little_endian.h:37:51: note: expanded from macro '__le16_to_cpu'
      37 | #define __le16_to_cpu(x) ((__force __u16)(__le16)(x))
         |                                                   ^
   In file included from drivers/block/null_blk/main.c:12:
   In file included from drivers/block/null_blk/null_blk.h:8:
   In file included from include/linux/blkdev.h:9:
   In file included from include/linux/blk_types.h:10:
   In file included from include/linux/bvec.h:10:
   In file included from include/linux/highmem.h:12:
   In file included from include/linux/hardirq.h:11:
   In file included from ./arch/hexagon/include/generated/asm/hardirq.h:1:
   In file included from include/asm-generic/hardirq.h:17:
   In file included from include/linux/irq.h:20:
   In file included from include/linux/io.h:13:
   In file included from arch/hexagon/include/asm/io.h:334:
   include/asm-generic/io.h:573:61: warning: performing pointer arithmetic on a null pointer has undefined behavior [-Wnull-pointer-arithmetic]
     573 |         val = __le32_to_cpu((__le32 __force)__raw_readl(PCI_IOBASE + addr));
         |                                                         ~~~~~~~~~~ ^
   include/uapi/linux/byteorder/little_endian.h:35:51: note: expanded from macro '__le32_to_cpu'
      35 | #define __le32_to_cpu(x) ((__force __u32)(__le32)(x))
         |                                                   ^
   In file included from drivers/block/null_blk/main.c:12:
   In file included from drivers/block/null_blk/null_blk.h:8:
   In file included from include/linux/blkdev.h:9:
   In file included from include/linux/blk_types.h:10:
   In file included from include/linux/bvec.h:10:
   In file included from include/linux/highmem.h:12:
   In file included from include/linux/hardirq.h:11:
   In file included from ./arch/hexagon/include/generated/asm/hardirq.h:1:
   In file included from include/asm-generic/hardirq.h:17:
   In file included from include/linux/irq.h:20:
   In file included from include/linux/io.h:13:
   In file included from arch/hexagon/include/asm/io.h:334:
   include/asm-generic/io.h:584:33: warning: performing pointer arithmetic on a null pointer has undefined behavior [-Wnull-pointer-arithmetic]
     584 |         __raw_writeb(value, PCI_IOBASE + addr);
         |                             ~~~~~~~~~~ ^
   include/asm-generic/io.h:594:59: warning: performing pointer arithmetic on a null pointer has undefined behavior [-Wnull-pointer-arithmetic]
     594 |         __raw_writew((u16 __force)cpu_to_le16(value), PCI_IOBASE + addr);
         |                                                       ~~~~~~~~~~ ^
   include/asm-generic/io.h:604:59: warning: performing pointer arithmetic on a null pointer has undefined behavior [-Wnull-pointer-arithmetic]
     604 |         __raw_writel((u32 __force)cpu_to_le32(value), PCI_IOBASE + addr);
         |                                                       ~~~~~~~~~~ ^
>> drivers/block/null_blk/main.c:1295:2: warning: variable 'rem' is used uninitialized whenever 'if' condition is false [-Wsometimes-uninitialized]
    1295 |         __rq_for_each_bio(bio, req) {
         |         ^~~~~~~~~~~~~~~~~~~~~~~~~~~
   include/linux/blk-mq.h:1012:2: note: expanded from macro '__rq_for_each_bio'
    1012 |         if ((rq->bio))                  \
         |         ^~~~~~~~~~~~~~
   include/linux/compiler.h:55:28: note: expanded from macro 'if'
      55 | #define if(cond, ...) if ( __trace_if_var( !!(cond , ## __VA_ARGS__) ) )
         |                            ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
   include/linux/compiler.h:57:30: note: expanded from macro '__trace_if_var'
      57 | #define __trace_if_var(cond) (__builtin_constant_p(cond) ? (cond) : __trace_if_value(cond))
         |                              ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
   drivers/block/null_blk/main.c:1300:15: note: uninitialized use occurs here
    1300 |         if (WARN_ON(!rem))
         |                      ^~~
   include/asm-generic/bug.h:123:25: note: expanded from macro 'WARN_ON'
     123 |         int __ret_warn_on = !!(condition);                              \
         |                                ^~~~~~~~~
   include/linux/compiler.h:55:47: note: expanded from macro 'if'
      55 | #define if(cond, ...) if ( __trace_if_var( !!(cond , ## __VA_ARGS__) ) )
         |                                               ^~~~
   include/linux/compiler.h:57:52: note: expanded from macro '__trace_if_var'
      57 | #define __trace_if_var(cond) (__builtin_constant_p(cond) ? (cond) : __trace_if_value(cond))
         |                                                    ^~~~
   drivers/block/null_blk/main.c:1295:2: note: remove the 'if' if its condition is always true
    1295 |         __rq_for_each_bio(bio, req) {
         |         ^
   include/linux/blk-mq.h:1012:2: note: expanded from macro '__rq_for_each_bio'
    1012 |         if ((rq->bio))                  \
         |         ^
   include/linux/compiler.h:55:23: note: expanded from macro 'if'
      55 | #define if(cond, ...) if ( __trace_if_var( !!(cond , ## __VA_ARGS__) ) )
         |                       ^
   drivers/block/null_blk/main.c:1287:12: note: initialize the variable 'rem' to silence this warning
    1287 |         size_t rem, temp;
         |                   ^
         |                    = 0
   7 warnings generated.


vim +1295 drivers/block/null_blk/main.c

  1281	
  1282	static inline int nullb_setup_copy_write(struct nullb *nullb,
  1283			struct request *req, bool is_fua)
  1284	{
  1285		sector_t sector_in, sector_out;
  1286		void *in, *out;
  1287		size_t rem, temp;
  1288		struct bio *bio;
  1289		unsigned long offset_in, offset_out;
  1290		struct nullb_page *t_page_in, *t_page_out;
  1291		int ret = -EIO;
  1292	
  1293		sector_out = blk_rq_pos(req);
  1294	
> 1295		__rq_for_each_bio(bio, req) {
  1296			sector_in = bio->bi_iter.bi_sector;
  1297			rem = bio->bi_iter.bi_size;
  1298		}
  1299	
  1300		if (WARN_ON(!rem))
  1301			return BLK_STS_NOTSUPP;
  1302	
  1303		spin_lock_irq(&nullb->lock);
  1304		while (rem > 0) {
  1305			temp = min_t(size_t, nullb->dev->blocksize, rem);
  1306			offset_in = (sector_in & SECTOR_MASK) << SECTOR_SHIFT;
  1307			offset_out = (sector_out & SECTOR_MASK) << SECTOR_SHIFT;
  1308	
  1309			if (null_cache_active(nullb) && !is_fua)
  1310				null_make_cache_space(nullb, PAGE_SIZE);
  1311	
  1312			t_page_in = null_lookup_page(nullb, sector_in, false,
  1313				!null_cache_active(nullb));
  1314			if (!t_page_in)
  1315				goto err;
  1316			t_page_out = null_insert_page(nullb, sector_out,
  1317				!null_cache_active(nullb) || is_fua);
  1318			if (!t_page_out)
  1319				goto err;
  1320	
  1321			in = kmap_local_page(t_page_in->page);
  1322			out = kmap_local_page(t_page_out->page);
  1323	
  1324			memcpy(out + offset_out, in + offset_in, temp);
  1325			kunmap_local(out);
  1326			kunmap_local(in);
  1327			__set_bit(sector_out & SECTOR_MASK, t_page_out->bitmap);
  1328	
  1329			if (is_fua)
  1330				null_free_sector(nullb, sector_out, true);
  1331	
  1332			rem -= temp;
  1333			sector_in += temp >> SECTOR_SHIFT;
  1334			sector_out += temp >> SECTOR_SHIFT;
  1335		}
  1336	
  1337		ret = 0;
  1338	err:
  1339		spin_unlock_irq(&nullb->lock);
  1340		return ret;
  1341	}
  1342	

-- 
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v13 9/9] null_blk: add support for copy offload
  2023-06-27 18:36     ` [PATCH v13 9/9] null_blk: add support for copy offload Nitesh Shetty
  2023-06-28 12:11       ` kernel test robot
@ 2023-06-28 12:52       ` kernel test robot
  1 sibling, 0 replies; 32+ messages in thread
From: kernel test robot @ 2023-06-28 12:52 UTC (permalink / raw)
  To: Nitesh Shetty, Jens Axboe, Jonathan Corbet, Alasdair Kergon,
	Mike Snitzer, dm-devel, Keith Busch, Christoph Hellwig,
	Sagi Grimberg, Chaitanya Kulkarni, Alexander Viro,
	Christian Brauner
  Cc: llvm, oe-kbuild-all, martin.petersen, linux-scsi, willy, hare,
	djwong, bvanassche, ming.lei, dlemoal, nitheshshetty, gost.dev,
	Nitesh Shetty, Damien Le Moal, Anuj Gupta, Vincent Fu,
	linux-block, linux-kernel, linux-doc, linux-nvme, linux-fsdevel

Hi Nitesh,

kernel test robot noticed the following build warnings:

[auto build test WARNING on 53cdf865f90ba922a854c65ed05b519f9d728424]

url:    https://github.com/intel-lab-lkp/linux/commits/Nitesh-Shetty/block-Introduce-queue-limits-for-copy-offload-support/20230628-163126
base:   53cdf865f90ba922a854c65ed05b519f9d728424
patch link:    https://lore.kernel.org/r/20230627183629.26571-10-nj.shetty%40samsung.com
patch subject: [PATCH v13 9/9] null_blk: add support for copy offload
config: i386-randconfig-i006-20230628 (https://download.01.org/0day-ci/archive/20230628/202306282001.ba1qWTf0-lkp@intel.com/config)
compiler: clang version 15.0.7 (https://github.com/llvm/llvm-project.git 8dfdcc7b7bf66834a761bd8de445840ef68e4d1a)
reproduce: (https://download.01.org/0day-ci/archive/20230628/202306282001.ba1qWTf0-lkp@intel.com/reproduce)

If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202306282001.ba1qWTf0-lkp@intel.com/

All warnings (new ones prefixed by >>):

>> drivers/block/null_blk/main.c:1295:2: warning: variable 'rem' is used uninitialized whenever 'if' condition is false [-Wsometimes-uninitialized]
           __rq_for_each_bio(bio, req) {
           ^~~~~~~~~~~~~~~~~~~~~~~~~~~
   include/linux/blk-mq.h:1012:6: note: expanded from macro '__rq_for_each_bio'
           if ((rq->bio))                  \
               ^~~~~~~~~
   drivers/block/null_blk/main.c:1300:15: note: uninitialized use occurs here
           if (WARN_ON(!rem))
                        ^~~
   include/asm-generic/bug.h:123:25: note: expanded from macro 'WARN_ON'
           int __ret_warn_on = !!(condition);                              \
                                  ^~~~~~~~~
   drivers/block/null_blk/main.c:1295:2: note: remove the 'if' if its condition is always true
           __rq_for_each_bio(bio, req) {
           ^
   include/linux/blk-mq.h:1012:2: note: expanded from macro '__rq_for_each_bio'
           if ((rq->bio))                  \
           ^
   drivers/block/null_blk/main.c:1287:12: note: initialize the variable 'rem' to silence this warning
           size_t rem, temp;
                     ^
                      = 0
   1 warning generated.


vim +1295 drivers/block/null_blk/main.c

  1281	
  1282	static inline int nullb_setup_copy_write(struct nullb *nullb,
  1283			struct request *req, bool is_fua)
  1284	{
  1285		sector_t sector_in, sector_out;
  1286		void *in, *out;
  1287		size_t rem, temp;
  1288		struct bio *bio;
  1289		unsigned long offset_in, offset_out;
  1290		struct nullb_page *t_page_in, *t_page_out;
  1291		int ret = -EIO;
  1292	
  1293		sector_out = blk_rq_pos(req);
  1294	
> 1295		__rq_for_each_bio(bio, req) {
  1296			sector_in = bio->bi_iter.bi_sector;
  1297			rem = bio->bi_iter.bi_size;
  1298		}
  1299	
  1300		if (WARN_ON(!rem))
  1301			return BLK_STS_NOTSUPP;
  1302	
  1303		spin_lock_irq(&nullb->lock);
  1304		while (rem > 0) {
  1305			temp = min_t(size_t, nullb->dev->blocksize, rem);
  1306			offset_in = (sector_in & SECTOR_MASK) << SECTOR_SHIFT;
  1307			offset_out = (sector_out & SECTOR_MASK) << SECTOR_SHIFT;
  1308	
  1309			if (null_cache_active(nullb) && !is_fua)
  1310				null_make_cache_space(nullb, PAGE_SIZE);
  1311	
  1312			t_page_in = null_lookup_page(nullb, sector_in, false,
  1313				!null_cache_active(nullb));
  1314			if (!t_page_in)
  1315				goto err;
  1316			t_page_out = null_insert_page(nullb, sector_out,
  1317				!null_cache_active(nullb) || is_fua);
  1318			if (!t_page_out)
  1319				goto err;
  1320	
  1321			in = kmap_local_page(t_page_in->page);
  1322			out = kmap_local_page(t_page_out->page);
  1323	
  1324			memcpy(out + offset_out, in + offset_in, temp);
  1325			kunmap_local(out);
  1326			kunmap_local(in);
  1327			__set_bit(sector_out & SECTOR_MASK, t_page_out->bitmap);
  1328	
  1329			if (is_fua)
  1330				null_free_sector(nullb, sector_out, true);
  1331	
  1332			rem -= temp;
  1333			sector_in += temp >> SECTOR_SHIFT;
  1334			sector_out += temp >> SECTOR_SHIFT;
  1335		}
  1336	
  1337		ret = 0;
  1338	err:
  1339		spin_unlock_irq(&nullb->lock);
  1340		return ret;
  1341	}
  1342	

-- 
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v13 1/9] block: Introduce queue limits for copy-offload support
  2023-06-28  6:40       ` Damien Le Moal
@ 2023-06-28 15:35         ` Nitesh Shetty
  0 siblings, 0 replies; 32+ messages in thread
From: Nitesh Shetty @ 2023-06-28 15:35 UTC (permalink / raw)
  To: Damien Le Moal
  Cc: Jens Axboe, Jonathan Corbet, Alasdair Kergon, Mike Snitzer,
	dm-devel, Keith Busch, Christoph Hellwig, Sagi Grimberg,
	Chaitanya Kulkarni, Alexander Viro, Christian Brauner,
	martin.petersen, linux-scsi, willy, hare, djwong, bvanassche,
	ming.lei, nitheshshetty, gost.dev, Kanchan Joshi, Anuj Gupta,
	linux-block, linux-kernel, linux-doc, linux-nvme, linux-fsdevel

[-- Attachment #1: Type: text/plain, Size: 6418 bytes --]

On 23/06/28 03:40PM, Damien Le Moal wrote:
>On 6/28/23 03:36, Nitesh Shetty wrote:
>> Add device limits as sysfs entries,
>>         - copy_offload (RW)
>>         - copy_max_bytes (RW)
>>         - copy_max_bytes_hw (RO)
>>
>> Above limits help to split the copy payload in block layer.
>> copy_offload: used for setting copy offload(1) or emulation(0).
>> copy_max_bytes: maximum total length of copy in single payload.
>> copy_max_bytes_hw: Reflects the device supported maximum limit.
>>
>> Reviewed-by: Hannes Reinecke <hare@suse.de>
>> Signed-off-by: Nitesh Shetty <nj.shetty@samsung.com>
>> Signed-off-by: Kanchan Joshi <joshi.k@samsung.com>
>> Signed-off-by: Anuj Gupta <anuj20.g@samsung.com>
>> ---
>>  Documentation/ABI/stable/sysfs-block | 33 +++++++++++++++
>>  block/blk-settings.c                 | 24 +++++++++++
>>  block/blk-sysfs.c                    | 63 ++++++++++++++++++++++++++++
>>  include/linux/blkdev.h               | 12 ++++++
>>  include/uapi/linux/fs.h              |  3 ++
>>  5 files changed, 135 insertions(+)
>>
>> diff --git a/Documentation/ABI/stable/sysfs-block b/Documentation/ABI/stable/sysfs-block
>> index c57e5b7cb532..3c97303f658b 100644
>> --- a/Documentation/ABI/stable/sysfs-block
>> +++ b/Documentation/ABI/stable/sysfs-block
>> @@ -155,6 +155,39 @@ Description:
>>  		last zone of the device which may be smaller.
>>
>>
>> +What:		/sys/block/<disk>/queue/copy_offload
>> +Date:		June 2023
>> +Contact:	linux-block@vger.kernel.org
>> +Description:
>> +		[RW] When read, this file shows whether offloading copy to a
>> +		device is enabled (1) or disabled (0). Writing '0' to this
>> +		file will disable offloading copies for this device.
>> +		Writing any '1' value will enable this feature. If the device
>> +		does not support offloading, then writing 1, will result in an
>> +		error.
>
>I am still not convinced that this one is really necessary. copy_max_bytes_hw !=
>0 indicates that the devices supports copy offload. And setting copy_max_bytes
>to 0 can be used to disable copy offload (which probably should be the default
>for now).
>

Agreed, we will do this in next iteration.

>> +
>> +
>> +What:		/sys/block/<disk>/queue/copy_max_bytes
>> +Date:		June 2023
>> +Contact:	linux-block@vger.kernel.org
>> +Description:
>> +		[RW] This is the maximum number of bytes that the block layer
>> +		will allow for a copy request. This will is always smaller or
>
>will is -> is
>

acked

>> +		equal to the maximum size allowed by the hardware, indicated by
>> +		'copy_max_bytes_hw'. An attempt to set a value higher than
>> +		'copy_max_bytes_hw' will truncate this to 'copy_max_bytes_hw'.
>> +
>> +
>> +What:		/sys/block/<disk>/queue/copy_max_bytes_hw
>
>Nit: In keeping with the spirit of attributes like
>max_hw_sectors_kb/max_sectors_kb, I would call this one copy_max_hw_bytes.
>

acked, will update in next iteration.

>> +Date:		June 2023
>> +Contact:	linux-block@vger.kernel.org
>> +Description:
>> +		[RO] This is the maximum number of bytes that the hardware
>> +		will allow for single data copy request.
>> +		A value of 0 means that the device does not support
>> +		copy offload.
>> +
>> +
>>  What:		/sys/block/<disk>/queue/crypto/
>>  Date:		February 2022
>>  Contact:	linux-block@vger.kernel.org
>> diff --git a/block/blk-settings.c b/block/blk-settings.c
>> index 4dd59059b788..738cd3f21259 100644
>> --- a/block/blk-settings.c
>> +++ b/block/blk-settings.c
>> @@ -59,6 +59,8 @@ void blk_set_default_limits(struct queue_limits *lim)
>>  	lim->zoned = BLK_ZONED_NONE;
>>  	lim->zone_write_granularity = 0;
>>  	lim->dma_alignment = 511;
>> +	lim->max_copy_sectors_hw = 0;
>> +	lim->max_copy_sectors = 0;
>>  }
>>
>>  /**
>> @@ -82,6 +84,8 @@ void blk_set_stacking_limits(struct queue_limits *lim)
>>  	lim->max_dev_sectors = UINT_MAX;
>>  	lim->max_write_zeroes_sectors = UINT_MAX;
>>  	lim->max_zone_append_sectors = UINT_MAX;
>> +	lim->max_copy_sectors_hw = UINT_MAX;
>> +	lim->max_copy_sectors = UINT_MAX;
>>  }
>>  EXPORT_SYMBOL(blk_set_stacking_limits);
>>
>> @@ -183,6 +187,22 @@ void blk_queue_max_discard_sectors(struct request_queue *q,
>>  }
>>  EXPORT_SYMBOL(blk_queue_max_discard_sectors);
>>
>> +/**
>> + * blk_queue_max_copy_sectors_hw - set max sectors for a single copy payload
>> + * @q:  the request queue for the device
>> + * @max_copy_sectors: maximum number of sectors to copy
>> + **/
>> +void blk_queue_max_copy_sectors_hw(struct request_queue *q,
>> +		unsigned int max_copy_sectors)
>> +{
>> +	if (max_copy_sectors > (COPY_MAX_BYTES >> SECTOR_SHIFT))
>> +		max_copy_sectors = COPY_MAX_BYTES >> SECTOR_SHIFT;
>> +
>> +	q->limits.max_copy_sectors_hw = max_copy_sectors;
>> +	q->limits.max_copy_sectors = max_copy_sectors;
>> +}
>> +EXPORT_SYMBOL_GPL(blk_queue_max_copy_sectors_hw);
>> +
>>  /**
>>   * blk_queue_max_secure_erase_sectors - set max sectors for a secure erase
>>   * @q:  the request queue for the device
>> @@ -578,6 +598,10 @@ int blk_stack_limits(struct queue_limits *t, struct queue_limits *b,
>>  	t->max_segment_size = min_not_zero(t->max_segment_size,
>>  					   b->max_segment_size);
>>
>> +	t->max_copy_sectors = min(t->max_copy_sectors, b->max_copy_sectors);
>> +	t->max_copy_sectors_hw = min(t->max_copy_sectors_hw,
>> +						b->max_copy_sectors_hw);
>> +
>>  	t->misaligned |= b->misaligned;
>>
>>  	alignment = queue_limit_alignment_offset(b, start);
>> diff --git a/block/blk-sysfs.c b/block/blk-sysfs.c
>> index afc797fb0dfc..43551778d035 100644
>> --- a/block/blk-sysfs.c
>> +++ b/block/blk-sysfs.c
>> @@ -199,6 +199,62 @@ static ssize_t queue_discard_zeroes_data_show(struct request_queue *q, char *pag
>>  	return queue_var_show(0, page);
>>  }
>>
>> +static ssize_t queue_copy_offload_show(struct request_queue *q, char *page)
>> +{
>> +	return queue_var_show(blk_queue_copy(q), page);
>> +}
>> +
>> +static ssize_t queue_copy_offload_store(struct request_queue *q,
>> +				       const char *page, size_t count)
>> +{
>> +	unsigned long copy_offload;
>> +	ssize_t ret = queue_var_store(&copy_offload, page, count);
>> +
>> +	if (ret < 0)
>> +		return ret;
>> +
>> +	if (copy_offload && !q->limits.max_copy_sectors_hw)
>> +		return -EINVAL;
>> +
>> +	if (copy_offload)
>> +		blk_queue_flag_set(QUEUE_FLAG_COPY, q);
>> +	else
>> +		blk_queue_flag_clear(QUEUE_FLAG_COPY, q);
>> +
>> +	return count;
>> +}
>
>See above. I think we can drop this attribute.
>
acked

Thank you, 
Nitesh Shetty

[-- Attachment #2: Type: text/plain, Size: 0 bytes --]



^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v13 2/9] block: Add copy offload support infrastructure
  2023-06-28  6:45       ` Damien Le Moal
@ 2023-06-28 16:03         ` Nitesh Shetty
  0 siblings, 0 replies; 32+ messages in thread
From: Nitesh Shetty @ 2023-06-28 16:03 UTC (permalink / raw)
  To: Damien Le Moal
  Cc: Jens Axboe, Jonathan Corbet, Alasdair Kergon, Mike Snitzer,
	dm-devel, Keith Busch, Christoph Hellwig, Sagi Grimberg,
	Chaitanya Kulkarni, Alexander Viro, Christian Brauner,
	martin.petersen, linux-scsi, willy, hare, djwong, bvanassche,
	ming.lei, nitheshshetty, gost.dev, Anuj Gupta, linux-block,
	linux-kernel, linux-doc, linux-nvme, linux-fsdevel

[-- Attachment #1: Type: text/plain, Size: 5119 bytes --]

On 23/06/28 03:45PM, Damien Le Moal wrote:
>On 6/28/23 03:36, Nitesh Shetty wrote:
>> Introduce blkdev_copy_offload which takes similar arguments as
>> copy_file_range and performs copy offload between two bdevs.
>
>I am confused... I thought it was discussed to only allow copy offload only
>within a single bdev for now... Did I missi something ?
>

Yes, you are right. copy is supported within single bdev only.
We will update this.

>> Introduce REQ_OP_COPY_DST, REQ_OP_COPY_SRC operation.
>> Issue REQ_OP_COPY_DST with destination info along with taking a plug.
>> This flows till request layer and waits for src bio to get merged.
>> Issue REQ_OP_COPY_SRC with source info and this bio reaches request
>> layer and merges with dst request.
>> For any reason, if request comes to driver with either only one of src/dst
>> info we fail the copy offload.
>>
>> Larger copy will be divided, based on max_copy_sectors limit.
>>
>> Suggested-by: Christoph Hellwig <hch@lst.de>
>> Signed-off-by: Anuj Gupta <anuj20.g@samsung.com>
>> Signed-off-by: Nitesh Shetty <nj.shetty@samsung.com>
>> ---
>>  block/blk-core.c          |   5 ++
>>  block/blk-lib.c           | 177 ++++++++++++++++++++++++++++++++++++++
>>  block/blk-merge.c         |  21 +++++
>>  block/blk.h               |   9 ++
>>  block/elevator.h          |   1 +
>>  include/linux/bio.h       |   4 +-
>>  include/linux/blk_types.h |  21 +++++
>>  include/linux/blkdev.h    |   4 +
>>  8 files changed, 241 insertions(+), 1 deletion(-)
>>
>> diff --git a/block/blk-core.c b/block/blk-core.c
>> index 99d8b9812b18..e6714391c93f 100644
>> --- a/block/blk-core.c
>> +++ b/block/blk-core.c
>> @@ -796,6 +796,11 @@ void submit_bio_noacct(struct bio *bio)
>>  		if (!q->limits.max_write_zeroes_sectors)
>>  			goto not_supported;
>>  		break;
>> +	case REQ_OP_COPY_SRC:
>> +	case REQ_OP_COPY_DST:
>> +		if (!blk_queue_copy(q))
>> +			goto not_supported;
>> +		break;
>>  	default:
>>  		break;
>>  	}
>> diff --git a/block/blk-lib.c b/block/blk-lib.c
>> index e59c3069e835..10c3eadd5bf6 100644
>> --- a/block/blk-lib.c
>> +++ b/block/blk-lib.c
>> @@ -115,6 +115,183 @@ int blkdev_issue_discard(struct block_device *bdev, sector_t sector,
>>  }
>>  EXPORT_SYMBOL(blkdev_issue_discard);
>>
>> +/*
>> + * For synchronous copy offload/emulation, wait and process all in-flight BIOs.
>> + * This must only be called once all bios have been issued so that the refcount
>> + * can only decrease. This just waits for all bios to make it through
>> + * blkdev_copy_(offload/emulate)_(read/write)_endio.
>> + */
>> +static ssize_t blkdev_copy_wait_io_completion(struct cio *cio)
>> +{
>> +	ssize_t ret;
>> +
>> +	if (cio->endio)
>> +		return 0;
>> +
>> +	if (atomic_read(&cio->refcount)) {
>> +		__set_current_state(TASK_UNINTERRUPTIBLE);
>> +		blk_io_schedule();
>> +	}
>> +
>> +	ret = cio->comp_len;
>> +	kfree(cio);
>> +
>> +	return ret;
>> +}
>> +
>> +static void blkdev_copy_offload_read_endio(struct bio *bio)
>> +{
>> +	struct cio *cio = bio->bi_private;
>> +	sector_t clen;
>> +
>> +	if (bio->bi_status) {
>> +		clen = (bio->bi_iter.bi_sector << SECTOR_SHIFT) - cio->pos_out;
>> +		cio->comp_len = min_t(sector_t, clen, cio->comp_len);
>> +	}
>> +	bio_put(bio);
>> +
>> +	if (!atomic_dec_and_test(&cio->refcount))
>> +		return;
>> +	if (cio->endio) {
>> +		cio->endio(cio->private, cio->comp_len);
>> +		kfree(cio);
>> +	} else
>> +		blk_wake_io_task(cio->waiter);
>
>Curly brackets around else missing.
>

Acked.

>> +}
>> +
>> +/*
>> + * __blkdev_copy_offload	- Use device's native copy offload feature.
>> + * we perform copy operation by sending 2 bio.
>> + * 1. We take a plug and send a REQ_OP_COPY_DST bio along with destination
>> + * sector and length. Once this bio reaches request layer, we form a request and
>> + * wait for src bio to arrive.
>> + * 2. We issue REQ_OP_COPY_SRC bio along with source sector and length. Once
>> + * this bio reaches request layer and find a request with previously sent
>> + * destination info we merge the source bio and return.
>> + * 3. Release the plug and request is sent to driver
>> + *
>> + * Returns the length of bytes copied or error if encountered
>> + */
>> +static ssize_t __blkdev_copy_offload(
>> +		struct block_device *bdev_in, loff_t pos_in,
>> +		struct block_device *bdev_out, loff_t pos_out,
>> +		size_t len, cio_iodone_t endio, void *private, gfp_t gfp_mask)
>> +{
>> +	struct cio *cio;
>> +	struct bio *read_bio, *write_bio;
>> +	sector_t rem, copy_len, max_copy_len;
>> +	struct blk_plug plug;
>> +
>> +	cio = kzalloc(sizeof(struct cio), GFP_KERNEL);
>> +	if (!cio)
>> +		return -ENOMEM;
>> +	atomic_set(&cio->refcount, 0);
>> +	cio->waiter = current;
>> +	cio->endio = endio;
>> +	cio->private = private;
>> +
>> +	max_copy_len = min(bdev_max_copy_sectors(bdev_in),
>> +			bdev_max_copy_sectors(bdev_out)) << SECTOR_SHIFT;
>
>According to patch 1, this can end up being 0, so the loop below will be infinite.
>

Agreed. As you suggested earlier, once we remove copy_offload parameter
and checking copy_max_sector to identify copy offload capabilty should
solve this.

Thank you,
Nitesh Shetty

[-- Attachment #2: Type: text/plain, Size: 0 bytes --]



^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v13 3/9] block: add emulation for copy
  2023-06-28  6:50       ` Damien Le Moal
@ 2023-06-28 16:10         ` Nitesh Shetty
  0 siblings, 0 replies; 32+ messages in thread
From: Nitesh Shetty @ 2023-06-28 16:10 UTC (permalink / raw)
  To: Damien Le Moal
  Cc: Jens Axboe, Jonathan Corbet, Alasdair Kergon, Mike Snitzer,
	dm-devel, Keith Busch, Christoph Hellwig, Sagi Grimberg,
	Chaitanya Kulkarni, Alexander Viro, Christian Brauner,
	martin.petersen, linux-scsi, willy, hare, djwong, bvanassche,
	ming.lei, nitheshshetty, gost.dev, Vincent Fu, Anuj Gupta,
	linux-block, linux-kernel, linux-doc, linux-nvme, linux-fsdevel

[-- Attachment #1: Type: text/plain, Size: 3408 bytes --]

On 23/06/28 03:50PM, Damien Le Moal wrote:
>On 6/28/23 03:36, Nitesh Shetty wrote:
>> For the devices which does not support copy, copy emulation is added.
>> It is required for in-kernel users like fabrics, where file descriptor is
>> not available and hence they can't use copy_file_range.
>> Copy-emulation is implemented by reading from source into memory and
>> writing to the corresponding destination asynchronously.
>> Also emulation is used, if copy offload fails or partially completes.
>>
>> Signed-off-by: Nitesh Shetty <nj.shetty@samsung.com>
>> Signed-off-by: Vincent Fu <vincent.fu@samsung.com>
>> Signed-off-by: Anuj Gupta <anuj20.g@samsung.com>
>> ---
>>  block/blk-lib.c           | 183 +++++++++++++++++++++++++++++++++++++-
>>  block/blk-map.c           |   4 +-
>>  include/linux/blk_types.h |   5 ++
>>  include/linux/blkdev.h    |   3 +
>>  4 files changed, 192 insertions(+), 3 deletions(-)
>>
>> diff --git a/block/blk-lib.c b/block/blk-lib.c
>> index 10c3eadd5bf6..09e0d5d51d03 100644
>> --- a/block/blk-lib.c
>> +++ b/block/blk-lib.c
>> @@ -234,6 +234,180 @@ static ssize_t __blkdev_copy_offload(
>>  	return blkdev_copy_wait_io_completion(cio);
>>  }
>>
>> +static void *blkdev_copy_alloc_buf(sector_t req_size, sector_t *alloc_size,
>> +		gfp_t gfp_mask)
>> +{
>> +	int min_size = PAGE_SIZE;
>> +	void *buf;
>> +
>> +	while (req_size >= min_size) {
>> +		buf = kvmalloc(req_size, gfp_mask);
>> +		if (buf) {
>> +			*alloc_size = req_size;
>> +			return buf;
>> +		}
>> +		/* retry half the requested size */
>
>Kind of obvious :)

Acked. will remove.

>
>> +		req_size >>= 1;
>> +	}
>> +
>> +	return NULL;
>> +}
>> +
>> +static void blkdev_copy_emulate_write_endio(struct bio *bio)
>> +{
>> +	struct copy_ctx *ctx = bio->bi_private;
>> +	struct cio *cio = ctx->cio;
>> +	sector_t clen;
>> +
>> +	if (bio->bi_status) {
>> +		clen = (bio->bi_iter.bi_sector << SECTOR_SHIFT) - cio->pos_out;
>> +		cio->comp_len = min_t(sector_t, clen, cio->comp_len);
>> +	}
>> +	kfree(bvec_virt(&bio->bi_io_vec[0]));
>> +	bio_map_kern_endio(bio);
>> +	kfree(ctx);
>> +	if (atomic_dec_and_test(&cio->refcount)) {
>> +		if (cio->endio) {
>> +			cio->endio(cio->private, cio->comp_len);
>> +			kfree(cio);
>> +		} else
>> +			blk_wake_io_task(cio->waiter);
>
>Curly brackets.
>
acked

>> +	}
>> +}
>> +
>> +static void blkdev_copy_emulate_read_endio(struct bio *read_bio)
>> +{
>> +	struct copy_ctx *ctx = read_bio->bi_private;
>> +	struct cio *cio = ctx->cio;
>> +	sector_t clen;
>> +
>> +	if (read_bio->bi_status) {
>> +		clen = (read_bio->bi_iter.bi_sector << SECTOR_SHIFT) -
>> +						cio->pos_in;
>> +		cio->comp_len = min_t(sector_t, clen, cio->comp_len);
>> +		kfree(bvec_virt(&read_bio->bi_io_vec[0]));
>> +		bio_map_kern_endio(read_bio);
>> +		kfree(ctx);
>> +
>> +		if (atomic_dec_and_test(&cio->refcount)) {
>> +			if (cio->endio) {
>> +				cio->endio(cio->private, cio->comp_len);
>> +				kfree(cio);
>> +			} else
>> +				blk_wake_io_task(cio->waiter);
>
>Same.

acked

>
>> +		}
>> +	}
>> +	schedule_work(&ctx->dispatch_work);
>
>ctx may have been freed above.

acked, will fix this.

>
>> +	kfree(read_bio);
>> +}
>> +
>> +static void blkdev_copy_dispatch_work(struct work_struct *work)
>> +{
>> +	struct copy_ctx *ctx = container_of(work, struct copy_ctx,
>> +			dispatch_work);
>
>Please align the argument, or even better: split the line after "=".


acked.

Thank you
Nitesh Shetty

[-- Attachment #2: Type: text/plain, Size: 0 bytes --]



^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v13 4/9] fs, block: copy_file_range for def_blk_ops for direct block device
  2023-06-28  6:51       ` Damien Le Moal
@ 2023-06-28 16:39         ` Nitesh Shetty
  0 siblings, 0 replies; 32+ messages in thread
From: Nitesh Shetty @ 2023-06-28 16:39 UTC (permalink / raw)
  To: Damien Le Moal
  Cc: Jens Axboe, Jonathan Corbet, Alasdair Kergon, Mike Snitzer,
	dm-devel, Keith Busch, Christoph Hellwig, Sagi Grimberg,
	Chaitanya Kulkarni, Alexander Viro, Christian Brauner,
	martin.petersen, linux-scsi, willy, hare, djwong, bvanassche,
	ming.lei, nitheshshetty, gost.dev, Anuj Gupta, linux-block,
	linux-kernel, linux-doc, linux-nvme, linux-fsdevel

[-- Attachment #1: Type: text/plain, Size: 2871 bytes --]

On 23/06/28 03:51PM, Damien Le Moal wrote:
>On 6/28/23 03:36, Nitesh Shetty wrote:
>> For direct block device opened with O_DIRECT, use copy_file_range to
>> issue device copy offload, and fallback to generic_copy_file_range incase
>> device copy offload capability is absent.
>
>...if the device does not support copy offload or the device files are not open
>with O_DIRECT.
>
>No ?
>
Yes your right. We will fallback to generic_copy_file_range in either of
these cases.

>> Modify checks to allow bdevs to use copy_file_range.
>>
>> Suggested-by: Ming Lei <ming.lei@redhat.com>
>> Signed-off-by: Anuj Gupta <anuj20.g@samsung.com>
>> Signed-off-by: Nitesh Shetty <nj.shetty@samsung.com>
>> ---
>>  block/blk-lib.c        | 26 ++++++++++++++++++++++++++
>>  block/fops.c           | 20 ++++++++++++++++++++
>>  fs/read_write.c        |  7 +++++--
>>  include/linux/blkdev.h |  4 ++++
>>  4 files changed, 55 insertions(+), 2 deletions(-)
>>
>> diff --git a/block/blk-lib.c b/block/blk-lib.c
>> index 09e0d5d51d03..7d8e09a99254 100644
>> --- a/block/blk-lib.c
>> +++ b/block/blk-lib.c
>> @@ -473,6 +473,32 @@ ssize_t blkdev_copy_offload(
>>  }
>>  EXPORT_SYMBOL_GPL(blkdev_copy_offload);
>>
>> +/* Copy source offset from source block device to destination block
>> + * device. Returns the length of bytes copied.
>> + */
>
>Multi-line comment style: start with a "/*" line please.
>
acked

>> +ssize_t blkdev_copy_offload_failfast(
>
>What is the "failfast" in the name for ?

We dont want failed copy offload IOs to fallback to block layer copy emulation.
We wanted a API to return error, if offload fails.

>
>> +		struct block_device *bdev_in, loff_t pos_in,
>> +		struct block_device *bdev_out, loff_t pos_out,
>> +		size_t len, gfp_t gfp_mask)
>> +{
>> +	struct request_queue *in_q = bdev_get_queue(bdev_in);
>> +	struct request_queue *out_q = bdev_get_queue(bdev_out);
>> +	ssize_t ret = 0;
>
>You do not need this initialization.
>

we need this initialization, because __blkdev_copy_offload return number of
bytes copied or error value.
So we can not return 0, incase of success/partial completion.
blkdev_copy_offload_failfast is expected to return number of bytes copied.

>> +
>> +	if (blkdev_copy_sanity_check(bdev_in, pos_in, bdev_out, pos_out, len))
>> +		return 0;
>> +
>> +	if (blk_queue_copy(in_q) && blk_queue_copy(out_q)) {
>
>Given that I think we do not allow copies between different devices, in_q and
>out_q should always be the same, no ?

acked, will update this.

>
>> +		ret = __blkdev_copy_offload(bdev_in, pos_in, bdev_out, pos_out,
>> +				len, NULL, NULL, gfp_mask);
>
>Same here. Why pass 2 bdevs if we only allow copies within the same device ?
>

acked, will update function arguments to take single bdev.

>> +		if (ret < 0)
>> +			return 0;
>> +	}
>> +
>> +	return ret;
>
>return 0;
>

Nack, explained above.

Thank you,
Nitesh Shetty

[-- Attachment #2: Type: text/plain, Size: 0 bytes --]



^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v13 3/9] block: add emulation for copy
  2023-06-27 18:36     ` [PATCH v13 3/9] block: add emulation for copy Nitesh Shetty
  2023-06-28  6:50       ` Damien Le Moal
@ 2023-06-29  8:33       ` Ming Lei
  2023-06-30 11:22         ` Nitesh Shetty
  2023-07-20  7:50       ` Christoph Hellwig
  2 siblings, 1 reply; 32+ messages in thread
From: Ming Lei @ 2023-06-29  8:33 UTC (permalink / raw)
  To: Nitesh Shetty
  Cc: Jens Axboe, Jonathan Corbet, Alasdair Kergon, Mike Snitzer,
	dm-devel, Keith Busch, Christoph Hellwig, Sagi Grimberg,
	Chaitanya Kulkarni, Alexander Viro, Christian Brauner,
	martin.petersen, linux-scsi, willy, hare, djwong, bvanassche,
	dlemoal, nitheshshetty, gost.dev, Vincent Fu, Anuj Gupta,
	linux-block, linux-kernel, linux-doc, linux-nvme, linux-fsdevel,
	ming.lei

Hi Nitesh,

On Wed, Jun 28, 2023 at 12:06:17AM +0530, Nitesh Shetty wrote:
> For the devices which does not support copy, copy emulation is added.
> It is required for in-kernel users like fabrics, where file descriptor is

I can understand copy command does help for FS GC and fabrics storages,
but still not very clear why copy emulation is needed for kernel users,
is it just for covering both copy command and emulation in single
interface? Or other purposes?

I'd suggest to add more words about in-kernel users of copy emulation.

> not available and hence they can't use copy_file_range.
> Copy-emulation is implemented by reading from source into memory and
> writing to the corresponding destination asynchronously.
> Also emulation is used, if copy offload fails or partially completes.

Per my understanding, this kind of emulation may not be as efficient
as doing it in userspace(two linked io_uring SQEs, read & write with
shared buffer). But it is fine if there are real in-kernel such users.


Thanks,
Ming


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v13 3/9] block: add emulation for copy
  2023-06-29  8:33       ` Ming Lei
@ 2023-06-30 11:22         ` Nitesh Shetty
  0 siblings, 0 replies; 32+ messages in thread
From: Nitesh Shetty @ 2023-06-30 11:22 UTC (permalink / raw)
  To: Ming Lei
  Cc: Jens Axboe, Jonathan Corbet, Alasdair Kergon, Mike Snitzer,
	dm-devel, Keith Busch, Christoph Hellwig, Sagi Grimberg,
	Chaitanya Kulkarni, Alexander Viro, Christian Brauner,
	martin.petersen, linux-scsi, willy, hare, djwong, bvanassche,
	dlemoal, nitheshshetty, gost.dev, Vincent Fu, Anuj Gupta,
	linux-block, linux-kernel, linux-doc, linux-nvme, linux-fsdevel

[-- Attachment #1: Type: text/plain, Size: 1607 bytes --]

On 23/06/29 04:33PM, Ming Lei wrote:
>Hi Nitesh,
>
>On Wed, Jun 28, 2023 at 12:06:17AM +0530, Nitesh Shetty wrote:
>> For the devices which does not support copy, copy emulation is added.
>> It is required for in-kernel users like fabrics, where file descriptor is
>
>I can understand copy command does help for FS GC and fabrics storages,
>but still not very clear why copy emulation is needed for kernel users,
>is it just for covering both copy command and emulation in single
>interface? Or other purposes?
>
>I'd suggest to add more words about in-kernel users of copy emulation.
>

As you mentioned above, we need a single interface for covering both
copy command and emulation.
This is needed in fabrics cases, as we expose any non copy command
supported target device also as copy capable, so we fallback to emulation
once we recieve copy from host/initator.
Agreed, we will add more description to covey the same info.

>> not available and hence they can't use copy_file_range.
>> Copy-emulation is implemented by reading from source into memory and
>> writing to the corresponding destination asynchronously.
>> Also emulation is used, if copy offload fails or partially completes.
>
>Per my understanding, this kind of emulation may not be as efficient
>as doing it in userspace(two linked io_uring SQEs, read & write with
>shared buffer). But it is fine if there are real in-kernel such users.
>

We do have plans for uring based copy interface in next phase,
once curent series is merged.
With current design we really see the advantage of emulation in fabrics case.

Thank you,
Nitesh Shetty

[-- Attachment #2: Type: text/plain, Size: 0 bytes --]



^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v13 1/9] block: Introduce queue limits for copy-offload support
  2023-06-27 18:36     ` [PATCH v13 1/9] block: Introduce queue limits for copy-offload support Nitesh Shetty
  2023-06-28  6:40       ` Damien Le Moal
@ 2023-07-20  7:06       ` Christoph Hellwig
  2023-07-20  7:58       ` Christoph Hellwig
  2 siblings, 0 replies; 32+ messages in thread
From: Christoph Hellwig @ 2023-07-20  7:06 UTC (permalink / raw)
  To: Nitesh Shetty
  Cc: Jens Axboe, Jonathan Corbet, Alasdair Kergon, Mike Snitzer,
	dm-devel, Keith Busch, Christoph Hellwig, Sagi Grimberg,
	Chaitanya Kulkarni, Alexander Viro, Christian Brauner,
	martin.petersen, linux-scsi, willy, hare, djwong, bvanassche,
	ming.lei, dlemoal, nitheshshetty, gost.dev, Kanchan Joshi,
	Anuj Gupta, linux-block, linux-kernel, linux-doc, linux-nvme,
	linux-fsdevel

Just a little heads up:  I think we need to properly solve the
hw vs user limit as the various ad-hoc mechanisms tend to be
broken and don't scale.  I plan to start looking into that the
next days.

On Wed, Jun 28, 2023 at 12:06:15AM +0530, Nitesh Shetty wrote:
> Add device limits as sysfs entries,
>         - copy_offload (RW)
>         - copy_max_bytes (RW)
>         - copy_max_bytes_hw (RO)
> 
> Above limits help to split the copy payload in block layer.
> copy_offload: used for setting copy offload(1) or emulation(0).
> copy_max_bytes: maximum total length of copy in single payload.
> copy_max_bytes_hw: Reflects the device supported maximum limit.

Why do we need the separate copy_offload boolean vs just looking
at the max_bytes value?


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v13 2/9] block: Add copy offload support infrastructure
  2023-06-27 18:36     ` [PATCH v13 2/9] block: Add copy offload support infrastructure Nitesh Shetty
  2023-06-28  6:45       ` Damien Le Moal
@ 2023-07-20  7:42       ` Christoph Hellwig
  2023-07-27 10:29         ` Nitesh Shetty
  1 sibling, 1 reply; 32+ messages in thread
From: Christoph Hellwig @ 2023-07-20  7:42 UTC (permalink / raw)
  To: Nitesh Shetty
  Cc: Jens Axboe, Jonathan Corbet, Alasdair Kergon, Mike Snitzer,
	dm-devel, Keith Busch, Christoph Hellwig, Sagi Grimberg,
	Chaitanya Kulkarni, Alexander Viro, Christian Brauner,
	martin.petersen, linux-scsi, willy, hare, djwong, bvanassche,
	ming.lei, dlemoal, nitheshshetty, gost.dev, Anuj Gupta,
	linux-block, linux-kernel, linux-doc, linux-nvme, linux-fsdevel

I wonder if this might benefit if you split the actual block
layer copy infrastructure from the blkdev_copy_offload* helpers
that just make use of it.

> Suggested-by: Christoph Hellwig <hch@lst.de>

Hmm, I'm not sure I suggested adding copy offload..

> +/*
> + * For synchronous copy offload/emulation, wait and process all in-flight BIOs.
> + * This must only be called once all bios have been issued so that the refcount
> + * can only decrease. This just waits for all bios to make it through
> + * blkdev_copy_(offload/emulate)_(read/write)_endio.
> + */
> +static ssize_t blkdev_copy_wait_io_completion(struct cio *cio)
> +{
> +	ssize_t ret;
> +
> +	if (cio->endio)
> +		return 0;

I'd move this to the caller to make things more clear.

> +
> +	if (atomic_read(&cio->refcount)) {
> +		__set_current_state(TASK_UNINTERRUPTIBLE);
> +		blk_io_schedule();

I don't think the refcount scheme you have works, instead you need
to have an extra count for the submitter, which is dropped using
atomic_dec_and_test here.  Take a look at ref scheme in blkdev_dio
which should be applicable here.

> +	ret = cio->comp_len;

The comp_len name for this variable confuses me.  I think is the length
that has succesfully been copied.  So maybe name it copied?

> +static void blkdev_copy_offload_read_endio(struct bio *bio)
> +{
> +	struct cio *cio = bio->bi_private;
> +	sector_t clen;
> +
> +	if (bio->bi_status) {
> +		clen = (bio->bi_iter.bi_sector << SECTOR_SHIFT) - cio->pos_out;
> +		cio->comp_len = min_t(sector_t, clen, cio->comp_len);

bi_sector can be thrashed once you hit the end_io handler.

> +	if (!atomic_dec_and_test(&cio->refcount))
> +		return;
> +	if (cio->endio) {
> +		cio->endio(cio->private, cio->comp_len);
> +		kfree(cio);
> +	} else
> +		blk_wake_io_task(cio->waiter);
> +}

This code is duplicated in a few places, please add a helper for it.

Also don't we need a way to return an error code through ->endio?

> +static ssize_t __blkdev_copy_offload(
> +		struct block_device *bdev_in, loff_t pos_in,
> +		struct block_device *bdev_out, loff_t pos_out,
> +		size_t len, cio_iodone_t endio, void *private, gfp_t gfp_mask)

I'd call this something like blkdev_copy_native, or maybe just
blkdev_copy_offlod.  Also given that we only want to support
single-device copies there i no need to pass two block_devices here.

Also please use the available space on the declaration line:

static ssize_t __blkdev_copy_offload(struct block_device *bdev_in,
		loff_t pos_in, struct block_device *bdev_out, loff_t pos_out,
		size_t len, cio_iodone_t endio, void *private, gfp_t gfp_mask)

Also the cio_iodone_t name is very generic.  Givne that there aren't
many places where we pass these callbacks I'd probably just drop the
typedef entirely.

> +	/* If there is a error, comp_len will be set to least successfully
> +	 * completed copied length
> +	 */

This is not the canonical kernel comment style.

> +	cio->comp_len = len;
> +	for (rem = len; rem > 0; rem -= copy_len) {
> +		copy_len = min(rem, max_copy_len);
> +
> +		write_bio = bio_alloc(bdev_out, 0, REQ_OP_COPY_DST, gfp_mask);
> +		if (!write_bio)
> +			goto err_write_bio_alloc;
> +		write_bio->bi_iter.bi_size = copy_len;
> +		write_bio->bi_iter.bi_sector = pos_out >> SECTOR_SHIFT;
> +
> +		blk_start_plug(&plug);
> +		read_bio = blk_next_bio(write_bio, bdev_in, 0, REQ_OP_COPY_SRC,
> +						gfp_mask);
> +		read_bio->bi_iter.bi_size = copy_len;
> +		read_bio->bi_iter.bi_sector = pos_in >> SECTOR_SHIFT;
> +		read_bio->bi_end_io = blkdev_copy_offload_read_endio;
> +		read_bio->bi_private = cio;

The chaining order here seems inverse to what I'd expect.  At least
for NVMe the copy command supports multiple input ranges being copied
to a single output range, and that is a very useful and important
feature for garbage collection in out of place write file systems.

So I'd expect to see one or more read bios first, which get chained
to the write bio that drives the completion.  We don't need the
multiple input range support in the very first version, but I'd expect
it to be added soon later so we better get the infrastructure right
for it.

> +
> +static inline ssize_t blkdev_copy_sanity_check(
> +	struct block_device *bdev_in, loff_t pos_in,
> +	struct block_device *bdev_out, loff_t pos_out,
> +	size_t len)

Two tab indents for the prototype, please.

> +{
> +	unsigned int align = max(bdev_logical_block_size(bdev_out),
> +					bdev_logical_block_size(bdev_in)) - 1;
> +
> +	if (bdev_read_only(bdev_out))
> +		return -EPERM;
> +
> +	if ((pos_in & align) || (pos_out & align) || (len & align) || !len ||
> +		len >= COPY_MAX_BYTES)

This indentation should also use two tabs or alignent of the opening
brace, and not the same as the indented branch.

> +ssize_t blkdev_copy_offload(

Just blkdev_copy?  Especially as the non-offloaded version is added
later.

> diff --git a/block/blk-merge.c b/block/blk-merge.c
> index 65e75efa9bd3..bfd86c54df22 100644
> --- a/block/blk-merge.c
> +++ b/block/blk-merge.c
> @@ -922,6 +922,9 @@ bool blk_rq_merge_ok(struct request *rq, struct bio *bio)
>  	if (!rq_mergeable(rq) || !bio_mergeable(bio))
>  		return false;
>  
> +	if ((req_op(rq) == REQ_OP_COPY_DST) && (bio_op(bio) == REQ_OP_COPY_SRC))
> +		return true;

This seems to be equivalent to blk_copy_offload_mergable, so why not
use that?

> +static enum bio_merge_status bio_attempt_copy_offload_merge(
> +	struct request_queue *q, struct request *req, struct bio *bio)

Same comment about the indentation as above (I'm not going to mention
it further, please do a sweep).

Also we don't need the q argument, it must be req->q.

> +{
> +	if (req->__data_len != bio->bi_iter.bi_size)
> +		return BIO_MERGE_FAILED;
> +
> +	req->biotail->bi_next = bio;
> +	req->biotail = bio;
> +	req->nr_phys_segments = blk_rq_nr_phys_segments(req) + 1;

This should just be req->nr_phys_segments++, shouldn't it?

>  }
>  
> +static inline bool blk_copy_offload_mergable(struct request *req,
> +					     struct bio *bio)
> +{
> +	return ((req_op(req) == REQ_OP_COPY_DST)  &&
> +		(bio_op(bio) == REQ_OP_COPY_SRC));
> +}

Can you please add a comment explaining the logic of merging different
operations here?

Also all the braces in the function body are superflous and there is a
double whitespace before the &&.

>  static inline unsigned int blk_rq_get_max_segments(struct request *rq)
>  {
>  	if (req_op(rq) == REQ_OP_DISCARD)
> @@ -303,6 +310,8 @@ static inline bool bio_may_exceed_limits(struct bio *bio,
>  		break;
>  	}
>  
> +	if (unlikely(op_is_copy(bio->bi_opf)))
> +		return false;

This looks wrong to me.  I think the copy ops need to be added to the
switch statement above as they have non-trivial splitting decisions.
Or at least should have those as we're missing the code to split
copy commands right now.

> diff --git a/include/linux/bio.h b/include/linux/bio.h
> index c4f5b5228105..a2673f24e493 100644
> --- a/include/linux/bio.h
> +++ b/include/linux/bio.h
> @@ -57,7 +57,9 @@ static inline bool bio_has_data(struct bio *bio)
>  	    bio->bi_iter.bi_size &&
>  	    bio_op(bio) != REQ_OP_DISCARD &&
>  	    bio_op(bio) != REQ_OP_SECURE_ERASE &&
> -	    bio_op(bio) != REQ_OP_WRITE_ZEROES)
> +	    bio_op(bio) != REQ_OP_WRITE_ZEROES &&
> +	    bio_op(bio) != REQ_OP_COPY_DST &&
> +	    bio_op(bio) != REQ_OP_COPY_SRC)

It probably make sense to replace this with a positive check
for the operations that do have data as a prep patch, which is
just REQ_OP_READ an  REQ_OP_WRITE.

>  	/* reset all the zone present on the device */
>  	REQ_OP_ZONE_RESET_ALL	= (__force blk_opf_t)17,
>  
> +	REQ_OP_COPY_SRC		= (__force blk_opf_t)18,
> +	REQ_OP_COPY_DST		= (__force blk_opf_t)19,

Little comments on these ops, please.

> +static inline bool op_is_copy(blk_opf_t op)
> +{
> +	return (((op & REQ_OP_MASK) == REQ_OP_COPY_SRC) ||
> +		((op & REQ_OP_MASK) == REQ_OP_COPY_DST));

All but the inner most braces here are superflous.

> +struct cio {
> +	struct task_struct *waiter;     /* waiting task (NULL if none) */
> +	loff_t pos_in;
> +	loff_t pos_out;
> +	ssize_t comp_len;
> +	cio_iodone_t *endio;		/* applicable for async operation */
> +	void *private;			/* applicable for async operation */
> +	atomic_t refcount;
> +};

The name for this structure is way to generic.  It also is only used
inside of blk-lib.c and should be moved there.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v13 3/9] block: add emulation for copy
  2023-06-27 18:36     ` [PATCH v13 3/9] block: add emulation for copy Nitesh Shetty
  2023-06-28  6:50       ` Damien Le Moal
  2023-06-29  8:33       ` Ming Lei
@ 2023-07-20  7:50       ` Christoph Hellwig
  2023-08-01 13:07         ` Nitesh Shetty
  2 siblings, 1 reply; 32+ messages in thread
From: Christoph Hellwig @ 2023-07-20  7:50 UTC (permalink / raw)
  To: Nitesh Shetty
  Cc: Jens Axboe, Jonathan Corbet, Alasdair Kergon, Mike Snitzer,
	dm-devel, Keith Busch, Christoph Hellwig, Sagi Grimberg,
	Chaitanya Kulkarni, Alexander Viro, Christian Brauner,
	martin.petersen, linux-scsi, willy, hare, djwong, bvanassche,
	ming.lei, dlemoal, nitheshshetty, gost.dev, Vincent Fu,
	Anuj Gupta, linux-block, linux-kernel, linux-doc, linux-nvme,
	linux-fsdevel

> +static void *blkdev_copy_alloc_buf(sector_t req_size, sector_t *alloc_size,
> +		gfp_t gfp_mask)
> +{
> +	int min_size = PAGE_SIZE;
> +	void *buf;
> +
> +	while (req_size >= min_size) {
> +		buf = kvmalloc(req_size, gfp_mask);
> +		if (buf) {
> +			*alloc_size = req_size;
> +			return buf;
> +		}
> +		/* retry half the requested size */
> +		req_size >>= 1;
> +	}
> +
> +	return NULL;

Is there any good reason for using vmalloc instead of a bunch
of distcontiguous pages?

> +		ctx = kzalloc(sizeof(struct copy_ctx), gfp_mask);
> +		if (!ctx)
> +			goto err_ctx;

I'd suspect it would be better to just allocte a single buffer and
only have a single outstanding copy.  That will reduce the bandwith
you can theoretically get, but copies tend to be background operations
anyway.  It will reduce the required memory, and thus the chance for
this operation to fail on a loaded system.  It will also dramatically
reduce the effect on memory managment.

> +		read_bio = bio_map_kern(in, buf, buf_len, gfp_mask);
> +		if (IS_ERR(read_bio))
> +			goto err_read_bio;
> +
> +		write_bio = bio_map_kern(out, buf, buf_len, gfp_mask);
> +		if (IS_ERR(write_bio))
> +			goto err_write_bio;

bio_map_kern is only for passthrough I/Os.  You need to use
a bio_add_page loop here.

> index 336146798e56..f8c80940c7ad 100644
> --- a/include/linux/blk_types.h
> +++ b/include/linux/blk_types.h
> @@ -562,4 +562,9 @@ struct cio {
>  	atomic_t refcount;
>  };
>  
> +struct copy_ctx {
> +	struct cio *cio;
> +	struct work_struct dispatch_work;
> +	struct bio *write_bio;
> +};

This is misnamed as it's only used by the fallback code, and also
should be private to blk-lib.c.


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v13 4/9] fs, block: copy_file_range for def_blk_ops for direct block device
  2023-06-27 18:36     ` [PATCH v13 4/9] fs, block: copy_file_range for def_blk_ops for direct block device Nitesh Shetty
  2023-06-28  6:51       ` Damien Le Moal
@ 2023-07-20  7:57       ` Christoph Hellwig
  2023-07-24  5:46         ` Nitesh Shetty
  1 sibling, 1 reply; 32+ messages in thread
From: Christoph Hellwig @ 2023-07-20  7:57 UTC (permalink / raw)
  To: Nitesh Shetty
  Cc: Jens Axboe, Jonathan Corbet, Alasdair Kergon, Mike Snitzer,
	dm-devel, Keith Busch, Christoph Hellwig, Sagi Grimberg,
	Chaitanya Kulkarni, Alexander Viro, Christian Brauner,
	martin.petersen, linux-scsi, willy, hare, djwong, bvanassche,
	ming.lei, dlemoal, nitheshshetty, gost.dev, Anuj Gupta,
	linux-block, linux-kernel, linux-doc, linux-nvme, linux-fsdevel

> +/* Copy source offset from source block device to destination block
> + * device. Returns the length of bytes copied.
> + */
> +ssize_t blkdev_copy_offload_failfast(
> +		struct block_device *bdev_in, loff_t pos_in,
> +		struct block_device *bdev_out, loff_t pos_out,
> +		size_t len, gfp_t gfp_mask)

This is an odd and very misnamed interface.

Either we have a klkdev_copy() interface that automatically falls back
to a fallback (maybe with an opt-out), or we have separate
blkdev_copy_offload/blkdev_copy_emulated interface and let the caller
decide.  But none of that really is "failfast".

Also this needs to go into the helpers patch and not a patch that is
supposed to just wire copying up for block device node.

> index b07de77ef126..d27148a2543f 100644
> --- a/fs/read_write.c
> +++ b/fs/read_write.c
> @@ -1447,7 +1447,8 @@ static int generic_copy_file_checks(struct file *file_in, loff_t pos_in,
>  		return -EOVERFLOW;
>  
>  	/* Shorten the copy to EOF */
> -	size_in = i_size_read(inode_in);
> +	size_in = i_size_read(file_in->f_mapping->host);

generic_copy_file_checks needs to be fixed to use ->mapping->host both
or inode_in and inode_out at the top of the file instead of this
band aid.  And that needs to be a separate patch with a Fixes tag.

> @@ -1708,7 +1709,9 @@ int generic_file_rw_checks(struct file *file_in, struct file *file_out)
>  	/* Don't copy dirs, pipes, sockets... */
>  	if (S_ISDIR(inode_in->i_mode) || S_ISDIR(inode_out->i_mode))
>  		return -EISDIR;
> -	if (!S_ISREG(inode_in->i_mode) || !S_ISREG(inode_out->i_mode))
> +
> +	if ((!S_ISREG(inode_in->i_mode) || !S_ISREG(inode_out->i_mode)) &&
> +		(!S_ISBLK(inode_in->i_mode) || !S_ISBLK(inode_out->i_mode)))

This is using weird indentation, and might also not be doing
exactly what we want.  I think the better thing to do here is to:

 1) check for the accetable types only on the in inode
 2) have a check that the mode matches for the in and out inodes

And please do this as a separate prep patch instead of hiding it here.


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v13 1/9] block: Introduce queue limits for copy-offload support
  2023-06-27 18:36     ` [PATCH v13 1/9] block: Introduce queue limits for copy-offload support Nitesh Shetty
  2023-06-28  6:40       ` Damien Le Moal
  2023-07-20  7:06       ` Christoph Hellwig
@ 2023-07-20  7:58       ` Christoph Hellwig
  2 siblings, 0 replies; 32+ messages in thread
From: Christoph Hellwig @ 2023-07-20  7:58 UTC (permalink / raw)
  To: Nitesh Shetty
  Cc: Jens Axboe, Jonathan Corbet, Alasdair Kergon, Mike Snitzer,
	dm-devel, Keith Busch, Christoph Hellwig, Sagi Grimberg,
	Chaitanya Kulkarni, Alexander Viro, Christian Brauner,
	martin.petersen, linux-scsi, willy, hare, djwong, bvanassche,
	ming.lei, dlemoal, nitheshshetty, gost.dev, Kanchan Joshi,
	Anuj Gupta, linux-block, linux-kernel, linux-doc, linux-nvme,
	linux-fsdevel

> diff --git a/include/uapi/linux/fs.h b/include/uapi/linux/fs.h
> index b7b56871029c..dff56813f95a 100644
> --- a/include/uapi/linux/fs.h
> +++ b/include/uapi/linux/fs.h
> @@ -64,6 +64,9 @@ struct fstrim_range {
>  	__u64 minlen;
>  };
>  
> +/* maximum copy offload length, this is set to 128MB based on current testing */
> +#define COPY_MAX_BYTES	(1 << 27)

This should not be in the UAPI.  It is a totally arbitrary internal
limit.


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v13 5/9] nvme: add copy offload support
  2023-06-27 18:36     ` [PATCH v13 5/9] nvme: add copy offload support Nitesh Shetty
@ 2023-07-20  8:00       ` Christoph Hellwig
  0 siblings, 0 replies; 32+ messages in thread
From: Christoph Hellwig @ 2023-07-20  8:00 UTC (permalink / raw)
  To: Nitesh Shetty
  Cc: Jens Axboe, Jonathan Corbet, Alasdair Kergon, Mike Snitzer,
	dm-devel, Keith Busch, Christoph Hellwig, Sagi Grimberg,
	Chaitanya Kulkarni, Alexander Viro, Christian Brauner,
	martin.petersen, linux-scsi, willy, hare, djwong, bvanassche,
	ming.lei, dlemoal, nitheshshetty, gost.dev, Kanchan Joshi,
	Javier González, Anuj Gupta, linux-block, linux-kernel,
	linux-doc, linux-nvme, linux-fsdevel

> +	if (blk_rq_nr_phys_segments(req) != 2)
> +		return BLK_STS_IOERR;

The magic number of segments adding up source and dest really need
constants and helpers to make the code understandable.

> +	/* +1 shift as dst+src length is added in request merging, we send copy
> +	 * for half the length.
> +	 */
> +	n_lba = blk_rq_bytes(req) >> (ns->lba_shift + 1);

I do not understand the logic and comment here.

> +	if (WARN_ON(!n_lba))

WARN_ON_ONCE

> +		return BLK_STS_NOTSUPP;

and BLK_STS_NOTSUPP seems like the wrong error here, this is an
invalid argument.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v13 4/9] fs, block: copy_file_range for def_blk_ops for direct block device
  2023-07-20  7:57       ` Christoph Hellwig
@ 2023-07-24  5:46         ` Nitesh Shetty
  0 siblings, 0 replies; 32+ messages in thread
From: Nitesh Shetty @ 2023-07-24  5:46 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Jens Axboe, Jonathan Corbet, Alasdair Kergon, Mike Snitzer,
	dm-devel, Keith Busch, Sagi Grimberg, Chaitanya Kulkarni,
	Alexander Viro, Christian Brauner, martin.petersen, linux-scsi,
	willy, hare, djwong, bvanassche, ming.lei, dlemoal,
	nitheshshetty, gost.dev, Anuj Gupta, linux-block, linux-kernel,
	linux-doc, linux-nvme, linux-fsdevel

[-- Attachment #1: Type: text/plain, Size: 2257 bytes --]

On 23/07/20 09:57AM, Christoph Hellwig wrote:
>> +/* Copy source offset from source block device to destination block
>> + * device. Returns the length of bytes copied.
>> + */
>> +ssize_t blkdev_copy_offload_failfast(
>> +		struct block_device *bdev_in, loff_t pos_in,
>> +		struct block_device *bdev_out, loff_t pos_out,
>> +		size_t len, gfp_t gfp_mask)
>
>This is an odd and very misnamed interface.
>
>Either we have a klkdev_copy() interface that automatically falls back
>to a fallback (maybe with an opt-out), or we have separate
>blkdev_copy_offload/blkdev_copy_emulated interface and let the caller
>decide.  But none of that really is "failfast".
>
>Also this needs to go into the helpers patch and not a patch that is
>supposed to just wire copying up for block device node.
>
Acked.

>> index b07de77ef126..d27148a2543f 100644
>> --- a/fs/read_write.c
>> +++ b/fs/read_write.c
>> @@ -1447,7 +1447,8 @@ static int generic_copy_file_checks(struct file *file_in, loff_t pos_in,
>>  		return -EOVERFLOW;
>>
>>  	/* Shorten the copy to EOF */
>> -	size_in = i_size_read(inode_in);
>> +	size_in = i_size_read(file_in->f_mapping->host);
>
>generic_copy_file_checks needs to be fixed to use ->mapping->host both
>or inode_in and inode_out at the top of the file instead of this
>band aid.  And that needs to be a separate patch with a Fixes tag.
>
Addressed below.

>> @@ -1708,7 +1709,9 @@ int generic_file_rw_checks(struct file *file_in, struct file *file_out)
>>  	/* Don't copy dirs, pipes, sockets... */
>>  	if (S_ISDIR(inode_in->i_mode) || S_ISDIR(inode_out->i_mode))
>>  		return -EISDIR;
>> -	if (!S_ISREG(inode_in->i_mode) || !S_ISREG(inode_out->i_mode))
>> +
>> +	if ((!S_ISREG(inode_in->i_mode) || !S_ISREG(inode_out->i_mode)) &&
>> +		(!S_ISBLK(inode_in->i_mode) || !S_ISBLK(inode_out->i_mode)))
>
>This is using weird indentation, and might also not be doing
>exactly what we want.  I think the better thing to do here is to:
>
> 1) check for the accetable types only on the in inode
> 2) have a check that the mode matches for the in and out inodes
>
>And please do this as a separate prep patch instead of hiding it here.
>
Agreed. We will send a separate patch, that enables copy_file_range on
block devices.

Thank you,
Nitesh Shetty

[-- Attachment #2: Type: text/plain, Size: 0 bytes --]



^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v13 2/9] block: Add copy offload support infrastructure
  2023-07-20  7:42       ` Christoph Hellwig
@ 2023-07-27 10:29         ` Nitesh Shetty
  0 siblings, 0 replies; 32+ messages in thread
From: Nitesh Shetty @ 2023-07-27 10:29 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Nitesh Shetty, Jens Axboe, Jonathan Corbet, Alasdair Kergon,
	Mike Snitzer, dm-devel, Keith Busch, Sagi Grimberg,
	Chaitanya Kulkarni, Alexander Viro, Christian Brauner,
	martin.petersen, linux-scsi, willy, hare, djwong, bvanassche,
	ming.lei, dlemoal, gost.dev, Anuj Gupta, linux-block,
	linux-kernel, linux-doc, linux-nvme, linux-fsdevel

On Thu, Jul 20, 2023 at 1:12 PM Christoph Hellwig <hch@lst.de> wrote:
> > Suggested-by: Christoph Hellwig <hch@lst.de>
>
> Hmm, I'm not sure I suggested adding copy offload..
>
We meant for request based design, we will remove it.

> >  static inline unsigned int blk_rq_get_max_segments(struct request *rq)
> >  {
> >       if (req_op(rq) == REQ_OP_DISCARD)
> > @@ -303,6 +310,8 @@ static inline bool bio_may_exceed_limits(struct bio *bio,
> >               break;
> >       }
> >
> > +     if (unlikely(op_is_copy(bio->bi_opf)))
> > +             return false;
>
> This looks wrong to me.  I think the copy ops need to be added to the
> switch statement above as they have non-trivial splitting decisions.
> Or at least should have those as we're missing the code to split
> copy commands right now.
>

Agreed, copy will have non-trivial splitting decisions. But, I
couldn't think of scenarios where this could happen, as we check for
queue limits before issuing a copy. Do you see scenarios where split
could happen for copy here.

Acked for all other review comments.

Thank you,
Nitesh Shetty

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v13 3/9] block: add emulation for copy
  2023-07-20  7:50       ` Christoph Hellwig
@ 2023-08-01 13:07         ` Nitesh Shetty
  2023-08-02  6:31           ` Kent Overstreet
  0 siblings, 1 reply; 32+ messages in thread
From: Nitesh Shetty @ 2023-08-01 13:07 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Jens Axboe, Jonathan Corbet, Alasdair Kergon, Mike Snitzer,
	dm-devel, Keith Busch, Sagi Grimberg, Chaitanya Kulkarni,
	Alexander Viro, Christian Brauner, martin.petersen, linux-scsi,
	willy, hare, djwong, bvanassche, ming.lei, dlemoal,
	nitheshshetty, gost.dev, Vincent Fu, Anuj Gupta, linux-block,
	linux-kernel, linux-doc, linux-nvme, linux-fsdevel

[-- Attachment #1: Type: text/plain, Size: 1253 bytes --]

On 23/07/20 09:50AM, Christoph Hellwig wrote:
>> +static void *blkdev_copy_alloc_buf(sector_t req_size, sector_t *alloc_size,
>> +		gfp_t gfp_mask)
>> +{
>> +	int min_size = PAGE_SIZE;
>> +	void *buf;
>> +
>> +	while (req_size >= min_size) {
>> +		buf = kvmalloc(req_size, gfp_mask);
>> +		if (buf) {
>> +			*alloc_size = req_size;
>> +			return buf;
>> +		}
>> +		/* retry half the requested size */
>> +		req_size >>= 1;
>> +	}
>> +
>> +	return NULL;
>
>Is there any good reason for using vmalloc instead of a bunch
>of distcontiguous pages?
>

kvmalloc seemed convenient for the purpose. 
We will need to call alloc_page in a loop to guarantee discontigous pages. 
Do you prefer that over kvmalloc?

>> +		ctx = kzalloc(sizeof(struct copy_ctx), gfp_mask);
>> +		if (!ctx)
>> +			goto err_ctx;
>
>I'd suspect it would be better to just allocte a single buffer and
>only have a single outstanding copy.  That will reduce the bandwith
>you can theoretically get, but copies tend to be background operations
>anyway.  It will reduce the required memory, and thus the chance for
>this operation to fail on a loaded system.  It will also dramatically
>reduce the effect on memory managment.
>

Next version will have that change.

Thank You,
Nitesh Shetty

[-- Attachment #2: Type: text/plain, Size: 0 bytes --]



^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v13 3/9] block: add emulation for copy
  2023-08-01 13:07         ` Nitesh Shetty
@ 2023-08-02  6:31           ` Kent Overstreet
  0 siblings, 0 replies; 32+ messages in thread
From: Kent Overstreet @ 2023-08-02  6:31 UTC (permalink / raw)
  To: Nitesh Shetty
  Cc: Christoph Hellwig, Jens Axboe, Jonathan Corbet, Alasdair Kergon,
	Mike Snitzer, dm-devel, Keith Busch, Sagi Grimberg,
	Chaitanya Kulkarni, Alexander Viro, Christian Brauner,
	martin.petersen, linux-scsi, willy, hare, djwong, bvanassche,
	ming.lei, dlemoal, nitheshshetty, gost.dev, Vincent Fu,
	Anuj Gupta, linux-block, linux-kernel, linux-doc, linux-nvme,
	linux-fsdevel

On Tue, Aug 01, 2023 at 06:37:02PM +0530, Nitesh Shetty wrote:
> On 23/07/20 09:50AM, Christoph Hellwig wrote:
> > > +static void *blkdev_copy_alloc_buf(sector_t req_size, sector_t *alloc_size,
> > > +		gfp_t gfp_mask)
> > > +{
> > > +	int min_size = PAGE_SIZE;
> > > +	void *buf;
> > > +
> > > +	while (req_size >= min_size) {
> > > +		buf = kvmalloc(req_size, gfp_mask);
> > > +		if (buf) {
> > > +			*alloc_size = req_size;
> > > +			return buf;
> > > +		}
> > > +		/* retry half the requested size */
> > > +		req_size >>= 1;
> > > +	}
> > > +
> > > +	return NULL;
> > 
> > Is there any good reason for using vmalloc instead of a bunch
> > of distcontiguous pages?
> > 
> 
> kvmalloc seemed convenient for the purpose. We will need to call alloc_page
> in a loop to guarantee discontigous pages. Do you prefer that over kvmalloc?

No, kvmalloc should be the preferred approach here now: with large
folios, we're now getting better about doing more large memory
allocations and avoiding fragmentation, so in practice this won't be a
vmalloc allocation except in exceptional circumstances, and performance
will be better and the code will be simpler doing a single large
allocation.

^ permalink raw reply	[flat|nested] 32+ messages in thread

end of thread, other threads:[~2023-08-02  6:41 UTC | newest]

Thread overview: 32+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <CGME20230627183950epcas5p1b924785633509f612ffa5d9616bfe447@epcas5p1.samsung.com>
2023-06-27 18:36 ` [PATCH v13 0/9] Implement copy offload support Nitesh Shetty
     [not found]   ` <CGME20230627184000epcas5p1c7cb01eb1c70bc5a19f76ce21f2ec3f8@epcas5p1.samsung.com>
2023-06-27 18:36     ` [PATCH v13 1/9] block: Introduce queue limits for copy-offload support Nitesh Shetty
2023-06-28  6:40       ` Damien Le Moal
2023-06-28 15:35         ` Nitesh Shetty
2023-07-20  7:06       ` Christoph Hellwig
2023-07-20  7:58       ` Christoph Hellwig
     [not found]   ` <CGME20230627184010epcas5p4bb6581408d9b67bbbcad633fb26689c9@epcas5p4.samsung.com>
2023-06-27 18:36     ` [PATCH v13 2/9] block: Add copy offload support infrastructure Nitesh Shetty
2023-06-28  6:45       ` Damien Le Moal
2023-06-28 16:03         ` Nitesh Shetty
2023-07-20  7:42       ` Christoph Hellwig
2023-07-27 10:29         ` Nitesh Shetty
     [not found]   ` <CGME20230627184020epcas5p13fdcea52edead5ffa3fae444f923439e@epcas5p1.samsung.com>
2023-06-27 18:36     ` [PATCH v13 3/9] block: add emulation for copy Nitesh Shetty
2023-06-28  6:50       ` Damien Le Moal
2023-06-28 16:10         ` Nitesh Shetty
2023-06-29  8:33       ` Ming Lei
2023-06-30 11:22         ` Nitesh Shetty
2023-07-20  7:50       ` Christoph Hellwig
2023-08-01 13:07         ` Nitesh Shetty
2023-08-02  6:31           ` Kent Overstreet
     [not found]   ` <CGME20230627184029epcas5p49a29676fa6dff5f24ddfa5c64e525a51@epcas5p4.samsung.com>
2023-06-27 18:36     ` [PATCH v13 4/9] fs, block: copy_file_range for def_blk_ops for direct block device Nitesh Shetty
2023-06-28  6:51       ` Damien Le Moal
2023-06-28 16:39         ` Nitesh Shetty
2023-07-20  7:57       ` Christoph Hellwig
2023-07-24  5:46         ` Nitesh Shetty
     [not found]   ` <CGME20230627184039epcas5p2decb92731d3e7dfdf9f2c05309a90bd7@epcas5p2.samsung.com>
2023-06-27 18:36     ` [PATCH v13 5/9] nvme: add copy offload support Nitesh Shetty
2023-07-20  8:00       ` Christoph Hellwig
     [not found]   ` <CGME20230627184049epcas5p293a6e6b75c93e39c7fca1a702e3e3774@epcas5p2.samsung.com>
2023-06-27 18:36     ` [PATCH v13 6/9] nvmet: add copy command support for bdev and file ns Nitesh Shetty
     [not found]   ` <CGME20230627184058epcas5p2226835b15381b856859b162e58572d63@epcas5p2.samsung.com>
2023-06-27 18:36     ` [PATCH v13 7/9] dm: Add support for copy offload Nitesh Shetty
     [not found]   ` <CGME20230627184107epcas5p3e01453c42bafa3ba08b8c8ba183927e6@epcas5p3.samsung.com>
2023-06-27 18:36     ` [PATCH v13 8/9] dm: Enable copy offload for dm-linear target Nitesh Shetty
     [not found]   ` <CGME20230627184117epcas5p3a9102988870743b20127422928f072bd@epcas5p3.samsung.com>
2023-06-27 18:36     ` [PATCH v13 9/9] null_blk: add support for copy offload Nitesh Shetty
2023-06-28 12:11       ` kernel test robot
2023-06-28 12:52       ` kernel test robot

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).