All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v6 0/9] Implement copy offload support
       [not found] <CGME20230112115954epcas5p4a959bef952926b8976719f1179bb4436@epcas5p4.samsung.com>
@ 2023-01-12 11:58   ` Nitesh Shetty
  0 siblings, 0 replies; 26+ messages in thread
From: Nitesh Shetty @ 2023-01-12 11:58 UTC (permalink / raw)
  To: Jens Axboe, Alasdair Kergon, Mike Snitzer, dm-devel, Keith Busch,
	Christoph Hellwig, Sagi Grimberg, James Smart,
	Chaitanya Kulkarni, Alexander Viro
  Cc: anuj20.g, joshi.k, p.raghav, nitheshshetty, gost.dev,
	Nitesh Shetty, linux-block, linux-kernel, linux-nvme,
	linux-fsdevel

The patch series covers the points discussed in November 2021 virtual
call [LSF/MM/BFP TOPIC] Storage: Copy Offload [0].
We have covered the initial agreed requirements in this patchset and
further additional features suggested by community.
Patchset borrows Mikulas's token based approach for 2 bdev
implementation.

This is on top of our previous patchset v5[1].

Overall series supports:
========================
	1. Driver
		- NVMe Copy command (single NS, TP 4065), including support
		in nvme-target (for block and file backend).

	2. Block layer
		- Block-generic copy (REQ_COPY flag), with interface
		accommodating two block-devs, and multi-source/destination
		interface
		- Emulation, when offload is natively absent
		- dm-linear support (for cases not requiring split)

	3. User-interface
		- new ioctl

	4. In-kernel user
		- dm-kcopyd

Testing
=======
	Copy offload can be tested on:
	a. QEMU: NVME simple copy (TP 4065). By setting nvme-ns
		parameters mssrl,mcl, msrc. For more info [2].
	b. Fabrics loopback.
	c. blktests[3] (tests block/032,033, nvme/046,047,048,049)

	Emuation can be tested on any device.

	Sample application to use IOCTL is present in patch desciption.
	fio[4].

Performance
===========
	With the async design of copy-emulation/offload using fio[4],
	we were  able to see the following improvements as
	compared to userspace read and write on a NVMeOF TCP setup:
	Setup1: Network Speed: 1000Mb/s
		Host PC: Intel(R) Core(TM) i7-8700 CPU @ 3.20GHz
		Target PC: AMD Ryzen 9 5900X 12-Core Processor
		block size 8k, range 1:
			635% improvement in IO BW (107 MiB/s to 787 MiB/s).
			Network utilisation drops from  97% to 14%.
		block-size 2M, range 16:
			2555% improvement in IO BW (100 MiB/s to 2655 MiB/s).
			Network utilisation drops from 89% to 0.62%.
	Setup2: Network Speed: 100Gb/s
		Server: Intel(R) Xeon(R) Gold 6240 CPU @ 2.60GHz, 72 cores
			(host and target have the same configuration)
		block-size 8k, range 1:
			6.5% improvement in IO BW (791 MiB/s to 843 MiB/s).
			Network utilisation drops from  6.75% to 0.14%.
		block-size 2M, range 16:
			15% improvement in IO BW (1027 MiB/s to 1183 MiB/s).
			Network utilisation drops from 8.42% to ~0%.
		block-size 8k, 8 ranges:
			18% drop in IO BW (from 798 MiB/s to 647 MiB/s)
			Network utilisation drops from 6.66% to 0.13%.

		At present we see drop in performance for bs 8k,16k and
		higher ranges (8, 16), so something more to check there.
	Overall, in these tests, kernel copy emulation performs better than
	userspace read+write. 

Blktests[3]
======================
	tests/block/032,033: Runs copy offload and emulation on block device.
        tests/nvme/046,047,048,049 Create a loop backed fabrics device and
        run copy offload and emulation.

Future Work
===========
        - nullblk: copy-offload emulation.
	- generic copy file range (CFR):
		We explored the possibility of using block device
		def_blk_ops, but we saw a major disadvantage for in-kernel
		users. fd is not available for in-kernel user [5].
	- loopback device copy offload support
	- upstream fio to use copy offload

	These are to be taken up after we reach consensus on the
	plumbing of current elements that are part of this series.


Additional links:
=================
	[0] https://lore.kernel.org/linux-nvme/CA+1E3rJ7BZ7LjQXXTdX+-0Edz=zT14mmPGMiVCzUgB33C60tbQ@mail.gmail.com/
	[1] https://lore.kernel.org/lkml/20221130041450.GA17533@test-zns/T/
	[2] https://qemu-project.gitlab.io/qemu/system/devices/nvme.html#simple-copy
	[3] https://github.com/nitesh-shetty/blktests/tree/feat/copy_offload/v6
	[4] https://github.com/vincentkfu/fio/tree/copyoffload
	[5] https://lore.kernel.org/lkml/20221130041450.GA17533@test-zns/T/#m0e2754202fc2223e937c8e7ba3cf7336a93f97a3

Changes since v5:
=================
	- Addition of blktests (Chaitanya Kulkarni)
        - Minor fix for fabrics file backed path
        - Remove buggy zonefs copy file range implementation.

Changes since v4:
=================
	- make the offload and emulation design asynchronous (Hannes
	  Reinecke)
	- fabrics loopback support
	- sysfs naming improvements (Damien Le Moal)
	- use kfree() instead of kvfree() in cio_await_completion
	  (Damien Le Moal)
	- use ranges instead of rlist to represent range_entry (Damien
	  Le Moal)
	- change argument ordering in blk_copy_offload suggested (Damien
	  Le Moal)
	- removed multiple copy limit and merged into only one limit
	  (Damien Le Moal)
	- wrap overly long lines (Damien Le Moal)
	- other naming improvements and cleanups (Damien Le Moal)
	- correctly format the code example in description (Damien Le
	  Moal)
	- mark blk_copy_offload as static (kernel test robot)
	
Changes since v3:
=================
	- added copy_file_range support for zonefs
	- added documentation about new sysfs entries
	- incorporated review comments on v3
	- minor fixes

Changes since v2:
=================
	- fixed possible race condition reported by Damien Le Moal
	- new sysfs controls as suggested by Damien Le Moal
	- fixed possible memory leak reported by Dan Carpenter, lkp
	- minor fixes

Nitesh Shetty (9):
  block: Introduce queue limits for copy-offload support
  block: Add copy offload support infrastructure
  block: add emulation for copy
  block: Introduce a new ioctl for copy
  nvme: add copy offload support
  nvmet: add copy command support for bdev and file ns
  dm: Add support for copy offload.
  dm: Enable copy offload for dm-linear target
  dm kcopyd: use copy offload support

 Documentation/ABI/stable/sysfs-block |  36 ++
 block/blk-lib.c                      | 597 +++++++++++++++++++++++++++
 block/blk-map.c                      |   4 +-
 block/blk-settings.c                 |  24 ++
 block/blk-sysfs.c                    |  64 +++
 block/blk.h                          |   2 +
 block/ioctl.c                        |  36 ++
 drivers/md/dm-kcopyd.c               |  56 ++-
 drivers/md/dm-linear.c               |   1 +
 drivers/md/dm-table.c                |  42 ++
 drivers/md/dm.c                      |   7 +
 drivers/nvme/host/constants.c        |   1 +
 drivers/nvme/host/core.c             | 106 ++++-
 drivers/nvme/host/fc.c               |   5 +
 drivers/nvme/host/nvme.h             |   7 +
 drivers/nvme/host/pci.c              |  27 +-
 drivers/nvme/host/rdma.c             |   7 +
 drivers/nvme/host/tcp.c              |  16 +
 drivers/nvme/host/trace.c            |  19 +
 drivers/nvme/target/admin-cmd.c      |   9 +-
 drivers/nvme/target/io-cmd-bdev.c    |  79 ++++
 drivers/nvme/target/io-cmd-file.c    |  52 +++
 drivers/nvme/target/loop.c           |   6 +
 drivers/nvme/target/nvmet.h          |   2 +
 include/linux/blk_types.h            |  44 ++
 include/linux/blkdev.h               |  18 +
 include/linux/device-mapper.h        |   5 +
 include/linux/nvme.h                 |  43 +-
 include/uapi/linux/fs.h              |  27 ++
 29 files changed, 1324 insertions(+), 18 deletions(-)


base-commit: 469a89fd3bb73bb2eea628da2b3e0f695f80b7ce
-- 
2.35.1.500.gb896f729e2


^ permalink raw reply	[flat|nested] 26+ messages in thread

* [dm-devel] [PATCH v6 0/9] Implement copy offload support
@ 2023-01-12 11:58   ` Nitesh Shetty
  0 siblings, 0 replies; 26+ messages in thread
From: Nitesh Shetty @ 2023-01-12 11:58 UTC (permalink / raw)
  To: Jens Axboe, Alasdair Kergon, Mike Snitzer, dm-devel, Keith Busch,
	Christoph Hellwig, Sagi Grimberg, James Smart,
	Chaitanya Kulkarni, Alexander Viro
  Cc: p.raghav, joshi.k, gost.dev, anuj20.g, linux-kernel, linux-nvme,
	Nitesh Shetty, linux-block, linux-fsdevel, nitheshshetty

The patch series covers the points discussed in November 2021 virtual
call [LSF/MM/BFP TOPIC] Storage: Copy Offload [0].
We have covered the initial agreed requirements in this patchset and
further additional features suggested by community.
Patchset borrows Mikulas's token based approach for 2 bdev
implementation.

This is on top of our previous patchset v5[1].

Overall series supports:
========================
	1. Driver
		- NVMe Copy command (single NS, TP 4065), including support
		in nvme-target (for block and file backend).

	2. Block layer
		- Block-generic copy (REQ_COPY flag), with interface
		accommodating two block-devs, and multi-source/destination
		interface
		- Emulation, when offload is natively absent
		- dm-linear support (for cases not requiring split)

	3. User-interface
		- new ioctl

	4. In-kernel user
		- dm-kcopyd

Testing
=======
	Copy offload can be tested on:
	a. QEMU: NVME simple copy (TP 4065). By setting nvme-ns
		parameters mssrl,mcl, msrc. For more info [2].
	b. Fabrics loopback.
	c. blktests[3] (tests block/032,033, nvme/046,047,048,049)

	Emuation can be tested on any device.

	Sample application to use IOCTL is present in patch desciption.
	fio[4].

Performance
===========
	With the async design of copy-emulation/offload using fio[4],
	we were  able to see the following improvements as
	compared to userspace read and write on a NVMeOF TCP setup:
	Setup1: Network Speed: 1000Mb/s
		Host PC: Intel(R) Core(TM) i7-8700 CPU @ 3.20GHz
		Target PC: AMD Ryzen 9 5900X 12-Core Processor
		block size 8k, range 1:
			635% improvement in IO BW (107 MiB/s to 787 MiB/s).
			Network utilisation drops from  97% to 14%.
		block-size 2M, range 16:
			2555% improvement in IO BW (100 MiB/s to 2655 MiB/s).
			Network utilisation drops from 89% to 0.62%.
	Setup2: Network Speed: 100Gb/s
		Server: Intel(R) Xeon(R) Gold 6240 CPU @ 2.60GHz, 72 cores
			(host and target have the same configuration)
		block-size 8k, range 1:
			6.5% improvement in IO BW (791 MiB/s to 843 MiB/s).
			Network utilisation drops from  6.75% to 0.14%.
		block-size 2M, range 16:
			15% improvement in IO BW (1027 MiB/s to 1183 MiB/s).
			Network utilisation drops from 8.42% to ~0%.
		block-size 8k, 8 ranges:
			18% drop in IO BW (from 798 MiB/s to 647 MiB/s)
			Network utilisation drops from 6.66% to 0.13%.

		At present we see drop in performance for bs 8k,16k and
		higher ranges (8, 16), so something more to check there.
	Overall, in these tests, kernel copy emulation performs better than
	userspace read+write. 

Blktests[3]
======================
	tests/block/032,033: Runs copy offload and emulation on block device.
        tests/nvme/046,047,048,049 Create a loop backed fabrics device and
        run copy offload and emulation.

Future Work
===========
        - nullblk: copy-offload emulation.
	- generic copy file range (CFR):
		We explored the possibility of using block device
		def_blk_ops, but we saw a major disadvantage for in-kernel
		users. fd is not available for in-kernel user [5].
	- loopback device copy offload support
	- upstream fio to use copy offload

	These are to be taken up after we reach consensus on the
	plumbing of current elements that are part of this series.


Additional links:
=================
	[0] https://lore.kernel.org/linux-nvme/CA+1E3rJ7BZ7LjQXXTdX+-0Edz=zT14mmPGMiVCzUgB33C60tbQ@mail.gmail.com/
	[1] https://lore.kernel.org/lkml/20221130041450.GA17533@test-zns/T/
	[2] https://qemu-project.gitlab.io/qemu/system/devices/nvme.html#simple-copy
	[3] https://github.com/nitesh-shetty/blktests/tree/feat/copy_offload/v6
	[4] https://github.com/vincentkfu/fio/tree/copyoffload
	[5] https://lore.kernel.org/lkml/20221130041450.GA17533@test-zns/T/#m0e2754202fc2223e937c8e7ba3cf7336a93f97a3

Changes since v5:
=================
	- Addition of blktests (Chaitanya Kulkarni)
        - Minor fix for fabrics file backed path
        - Remove buggy zonefs copy file range implementation.

Changes since v4:
=================
	- make the offload and emulation design asynchronous (Hannes
	  Reinecke)
	- fabrics loopback support
	- sysfs naming improvements (Damien Le Moal)
	- use kfree() instead of kvfree() in cio_await_completion
	  (Damien Le Moal)
	- use ranges instead of rlist to represent range_entry (Damien
	  Le Moal)
	- change argument ordering in blk_copy_offload suggested (Damien
	  Le Moal)
	- removed multiple copy limit and merged into only one limit
	  (Damien Le Moal)
	- wrap overly long lines (Damien Le Moal)
	- other naming improvements and cleanups (Damien Le Moal)
	- correctly format the code example in description (Damien Le
	  Moal)
	- mark blk_copy_offload as static (kernel test robot)
	
Changes since v3:
=================
	- added copy_file_range support for zonefs
	- added documentation about new sysfs entries
	- incorporated review comments on v3
	- minor fixes

Changes since v2:
=================
	- fixed possible race condition reported by Damien Le Moal
	- new sysfs controls as suggested by Damien Le Moal
	- fixed possible memory leak reported by Dan Carpenter, lkp
	- minor fixes

Nitesh Shetty (9):
  block: Introduce queue limits for copy-offload support
  block: Add copy offload support infrastructure
  block: add emulation for copy
  block: Introduce a new ioctl for copy
  nvme: add copy offload support
  nvmet: add copy command support for bdev and file ns
  dm: Add support for copy offload.
  dm: Enable copy offload for dm-linear target
  dm kcopyd: use copy offload support

 Documentation/ABI/stable/sysfs-block |  36 ++
 block/blk-lib.c                      | 597 +++++++++++++++++++++++++++
 block/blk-map.c                      |   4 +-
 block/blk-settings.c                 |  24 ++
 block/blk-sysfs.c                    |  64 +++
 block/blk.h                          |   2 +
 block/ioctl.c                        |  36 ++
 drivers/md/dm-kcopyd.c               |  56 ++-
 drivers/md/dm-linear.c               |   1 +
 drivers/md/dm-table.c                |  42 ++
 drivers/md/dm.c                      |   7 +
 drivers/nvme/host/constants.c        |   1 +
 drivers/nvme/host/core.c             | 106 ++++-
 drivers/nvme/host/fc.c               |   5 +
 drivers/nvme/host/nvme.h             |   7 +
 drivers/nvme/host/pci.c              |  27 +-
 drivers/nvme/host/rdma.c             |   7 +
 drivers/nvme/host/tcp.c              |  16 +
 drivers/nvme/host/trace.c            |  19 +
 drivers/nvme/target/admin-cmd.c      |   9 +-
 drivers/nvme/target/io-cmd-bdev.c    |  79 ++++
 drivers/nvme/target/io-cmd-file.c    |  52 +++
 drivers/nvme/target/loop.c           |   6 +
 drivers/nvme/target/nvmet.h          |   2 +
 include/linux/blk_types.h            |  44 ++
 include/linux/blkdev.h               |  18 +
 include/linux/device-mapper.h        |   5 +
 include/linux/nvme.h                 |  43 +-
 include/uapi/linux/fs.h              |  27 ++
 29 files changed, 1324 insertions(+), 18 deletions(-)


base-commit: 469a89fd3bb73bb2eea628da2b3e0f695f80b7ce
-- 
2.35.1.500.gb896f729e2

--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel


^ permalink raw reply	[flat|nested] 26+ messages in thread

* [PATCH v6 1/9] block: Introduce queue limits for copy-offload support
       [not found]   ` <CGME20230112120011epcas5p33c10ec9a0cb2ee4c0c68737bb879a154@epcas5p3.samsung.com>
@ 2023-01-12 11:58       ` Nitesh Shetty
  0 siblings, 0 replies; 26+ messages in thread
From: Nitesh Shetty @ 2023-01-12 11:58 UTC (permalink / raw)
  To: Jens Axboe, Alasdair Kergon, Mike Snitzer, dm-devel, Keith Busch,
	Christoph Hellwig, Sagi Grimberg, James Smart,
	Chaitanya Kulkarni, Alexander Viro
  Cc: anuj20.g, joshi.k, p.raghav, nitheshshetty, gost.dev,
	Nitesh Shetty, Hannes Reinecke, linux-block, linux-kernel,
	linux-nvme, linux-fsdevel

Add device limits as sysfs entries,
        - copy_offload (RW)
        - copy_max_bytes (RW)
        - copy_max_bytes_hw (RO)

Above limits help to split the copy payload in block layer.
copy_offload: used for setting copy offload(1) or emulation(0).
copy_max_bytes: maximum total length of copy in single payload.
copy_max_bytes_hw: Reflects the device supported maximum limit.

Reviewed-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Nitesh Shetty <nj.shetty@samsung.com>
Signed-off-by: Kanchan Joshi <joshi.k@samsung.com>
Signed-off-by: Anuj Gupta <anuj20.g@samsung.com>
---
 Documentation/ABI/stable/sysfs-block | 36 ++++++++++++++++
 block/blk-settings.c                 | 24 +++++++++++
 block/blk-sysfs.c                    | 64 ++++++++++++++++++++++++++++
 include/linux/blkdev.h               | 12 ++++++
 include/uapi/linux/fs.h              |  3 ++
 5 files changed, 139 insertions(+)

diff --git a/Documentation/ABI/stable/sysfs-block b/Documentation/ABI/stable/sysfs-block
index cd14ecb3c9a5..e0c9be009706 100644
--- a/Documentation/ABI/stable/sysfs-block
+++ b/Documentation/ABI/stable/sysfs-block
@@ -155,6 +155,42 @@ Description:
 		last zone of the device which may be smaller.
 
 
+What:		/sys/block/<disk>/queue/copy_offload
+Date:		November 2022
+Contact:	linux-block@vger.kernel.org
+Description:
+		[RW] When read, this file shows whether offloading copy to
+		device is enabled (1) or disabled (0). Writing '0' to this
+		file will disable offloading copies for this device.
+		Writing any '1' value will enable this feature. If device
+		does not support offloading, then writing 1, will result in
+		error.
+
+
+What:		/sys/block/<disk>/queue/copy_max_bytes
+Date:		November 2022
+Contact:	linux-block@vger.kernel.org
+Description:
+		[RW] While 'copy_max_bytes_hw' is the hardware limit for the
+		device, 'copy_max_bytes' setting is the software limit.
+		Setting this value lower will make Linux issue smaller size
+		copies from block layer.
+
+
+What:		/sys/block/<disk>/queue/copy_max_bytes_hw
+Date:		November 2022
+Contact:	linux-block@vger.kernel.org
+Description:
+		[RO] Devices that support offloading copy functionality may have
+		internal limits on the number of bytes that can be offloaded
+		in a single operation. The `copy_max_bytes_hw`
+		parameter is set by the device driver to the maximum number of
+		bytes that can be copied in a single operation. Copy
+		requests issued to the device must not exceed this limit.
+		A value of 0 means that the device does not
+		support copy offload.
+
+
 What:		/sys/block/<disk>/queue/crypto/
 Date:		February 2022
 Contact:	linux-block@vger.kernel.org
diff --git a/block/blk-settings.c b/block/blk-settings.c
index 0477c4d527fe..ca6f15a70fdc 100644
--- a/block/blk-settings.c
+++ b/block/blk-settings.c
@@ -58,6 +58,8 @@ void blk_set_default_limits(struct queue_limits *lim)
 	lim->zoned = BLK_ZONED_NONE;
 	lim->zone_write_granularity = 0;
 	lim->dma_alignment = 511;
+	lim->max_copy_sectors_hw = 0;
+	lim->max_copy_sectors = 0;
 }
 
 /**
@@ -81,6 +83,8 @@ void blk_set_stacking_limits(struct queue_limits *lim)
 	lim->max_dev_sectors = UINT_MAX;
 	lim->max_write_zeroes_sectors = UINT_MAX;
 	lim->max_zone_append_sectors = UINT_MAX;
+	lim->max_copy_sectors_hw = ULONG_MAX;
+	lim->max_copy_sectors = ULONG_MAX;
 }
 EXPORT_SYMBOL(blk_set_stacking_limits);
 
@@ -177,6 +181,22 @@ void blk_queue_max_discard_sectors(struct request_queue *q,
 }
 EXPORT_SYMBOL(blk_queue_max_discard_sectors);
 
+/**
+ * blk_queue_max_copy_sectors_hw - set max sectors for a single copy payload
+ * @q:  the request queue for the device
+ * @max_copy_sectors: maximum number of sectors to copy
+ **/
+void blk_queue_max_copy_sectors_hw(struct request_queue *q,
+		unsigned int max_copy_sectors)
+{
+	if (max_copy_sectors >= MAX_COPY_TOTAL_LENGTH)
+		max_copy_sectors = MAX_COPY_TOTAL_LENGTH;
+
+	q->limits.max_copy_sectors_hw = max_copy_sectors;
+	q->limits.max_copy_sectors = max_copy_sectors;
+}
+EXPORT_SYMBOL_GPL(blk_queue_max_copy_sectors_hw);
+
 /**
  * blk_queue_max_secure_erase_sectors - set max sectors for a secure erase
  * @q:  the request queue for the device
@@ -572,6 +592,10 @@ int blk_stack_limits(struct queue_limits *t, struct queue_limits *b,
 	t->max_segment_size = min_not_zero(t->max_segment_size,
 					   b->max_segment_size);
 
+	t->max_copy_sectors = min(t->max_copy_sectors, b->max_copy_sectors);
+	t->max_copy_sectors_hw = min(t->max_copy_sectors_hw,
+						b->max_copy_sectors_hw);
+
 	t->misaligned |= b->misaligned;
 
 	alignment = queue_limit_alignment_offset(b, start);
diff --git a/block/blk-sysfs.c b/block/blk-sysfs.c
index 93d9e9c9a6ea..82a28a6c2e8a 100644
--- a/block/blk-sysfs.c
+++ b/block/blk-sysfs.c
@@ -212,6 +212,63 @@ static ssize_t queue_discard_zeroes_data_show(struct request_queue *q, char *pag
 	return queue_var_show(0, page);
 }
 
+static ssize_t queue_copy_offload_show(struct request_queue *q, char *page)
+{
+	return queue_var_show(blk_queue_copy(q), page);
+}
+
+static ssize_t queue_copy_offload_store(struct request_queue *q,
+				       const char *page, size_t count)
+{
+	s64 copy_offload;
+	ssize_t ret = queue_var_store64(&copy_offload, page);
+
+	if (ret < 0)
+		return ret;
+
+	if (copy_offload && !q->limits.max_copy_sectors_hw)
+		return -EINVAL;
+
+	if (copy_offload)
+		blk_queue_flag_set(QUEUE_FLAG_COPY, q);
+	else
+		blk_queue_flag_clear(QUEUE_FLAG_COPY, q);
+
+	return count;
+}
+
+static ssize_t queue_copy_max_hw_show(struct request_queue *q, char *page)
+{
+	return sprintf(page, "%llu\n", (unsigned long long)
+			q->limits.max_copy_sectors_hw << SECTOR_SHIFT);
+}
+
+static ssize_t queue_copy_max_show(struct request_queue *q, char *page)
+{
+	return sprintf(page, "%llu\n", (unsigned long long)
+			q->limits.max_copy_sectors << SECTOR_SHIFT);
+}
+
+static ssize_t queue_copy_max_store(struct request_queue *q,
+				       const char *page, size_t count)
+{
+	s64 max_copy;
+	ssize_t ret = queue_var_store64(&max_copy, page);
+
+	if (ret < 0)
+		return ret;
+
+	if (max_copy & (queue_logical_block_size(q) - 1))
+		return -EINVAL;
+
+	max_copy >>= SECTOR_SHIFT;
+	if (max_copy > q->limits.max_copy_sectors_hw)
+		max_copy = q->limits.max_copy_sectors_hw;
+
+	q->limits.max_copy_sectors = max_copy;
+	return count;
+}
+
 static ssize_t queue_write_same_max_show(struct request_queue *q, char *page)
 {
 	return queue_var_show(0, page);
@@ -604,6 +661,10 @@ QUEUE_RO_ENTRY(queue_nr_zones, "nr_zones");
 QUEUE_RO_ENTRY(queue_max_open_zones, "max_open_zones");
 QUEUE_RO_ENTRY(queue_max_active_zones, "max_active_zones");
 
+QUEUE_RW_ENTRY(queue_copy_offload, "copy_offload");
+QUEUE_RO_ENTRY(queue_copy_max_hw, "copy_max_bytes_hw");
+QUEUE_RW_ENTRY(queue_copy_max, "copy_max_bytes");
+
 QUEUE_RW_ENTRY(queue_nomerges, "nomerges");
 QUEUE_RW_ENTRY(queue_rq_affinity, "rq_affinity");
 QUEUE_RW_ENTRY(queue_poll, "io_poll");
@@ -651,6 +712,9 @@ static struct attribute *queue_attrs[] = {
 	&queue_discard_max_entry.attr,
 	&queue_discard_max_hw_entry.attr,
 	&queue_discard_zeroes_data_entry.attr,
+	&queue_copy_offload_entry.attr,
+	&queue_copy_max_hw_entry.attr,
+	&queue_copy_max_entry.attr,
 	&queue_write_same_max_entry.attr,
 	&queue_write_zeroes_max_entry.attr,
 	&queue_zone_append_max_entry.attr,
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index 43d4e073b111..807ffb5f715d 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -303,6 +303,9 @@ struct queue_limits {
 	unsigned int		discard_alignment;
 	unsigned int		zone_write_granularity;
 
+	unsigned long		max_copy_sectors_hw;
+	unsigned long		max_copy_sectors;
+
 	unsigned short		max_segments;
 	unsigned short		max_integrity_segments;
 	unsigned short		max_discard_segments;
@@ -571,6 +574,7 @@ struct request_queue {
 #define QUEUE_FLAG_NOWAIT       29	/* device supports NOWAIT */
 #define QUEUE_FLAG_SQ_SCHED     30	/* single queue style io dispatch */
 #define QUEUE_FLAG_SKIP_TAGSET_QUIESCE	31 /* quiesce_tagset skip the queue*/
+#define QUEUE_FLAG_COPY		32	/* supports copy offload */
 
 #define QUEUE_FLAG_MQ_DEFAULT	((1UL << QUEUE_FLAG_IO_STAT) |		\
 				 (1UL << QUEUE_FLAG_SAME_COMP) |	\
@@ -591,6 +595,7 @@ bool blk_queue_flag_test_and_set(unsigned int flag, struct request_queue *q);
 	test_bit(QUEUE_FLAG_STABLE_WRITES, &(q)->queue_flags)
 #define blk_queue_io_stat(q)	test_bit(QUEUE_FLAG_IO_STAT, &(q)->queue_flags)
 #define blk_queue_add_random(q)	test_bit(QUEUE_FLAG_ADD_RANDOM, &(q)->queue_flags)
+#define blk_queue_copy(q)	test_bit(QUEUE_FLAG_COPY, &(q)->queue_flags)
 #define blk_queue_zone_resetall(q)	\
 	test_bit(QUEUE_FLAG_ZONE_RESETALL, &(q)->queue_flags)
 #define blk_queue_dax(q)	test_bit(QUEUE_FLAG_DAX, &(q)->queue_flags)
@@ -911,6 +916,8 @@ extern void blk_queue_chunk_sectors(struct request_queue *, unsigned int);
 extern void blk_queue_max_segments(struct request_queue *, unsigned short);
 extern void blk_queue_max_discard_segments(struct request_queue *,
 		unsigned short);
+extern void blk_queue_max_copy_sectors_hw(struct request_queue *q,
+		unsigned int max_copy_sectors);
 void blk_queue_max_secure_erase_sectors(struct request_queue *q,
 		unsigned int max_sectors);
 extern void blk_queue_max_segment_size(struct request_queue *, unsigned int);
@@ -1229,6 +1236,11 @@ static inline unsigned int bdev_discard_granularity(struct block_device *bdev)
 	return bdev_get_queue(bdev)->limits.discard_granularity;
 }
 
+static inline unsigned int bdev_max_copy_sectors(struct block_device *bdev)
+{
+	return bdev_get_queue(bdev)->limits.max_copy_sectors;
+}
+
 static inline unsigned int
 bdev_max_secure_erase_sectors(struct block_device *bdev)
 {
diff --git a/include/uapi/linux/fs.h b/include/uapi/linux/fs.h
index b7b56871029c..b3ad173f619c 100644
--- a/include/uapi/linux/fs.h
+++ b/include/uapi/linux/fs.h
@@ -64,6 +64,9 @@ struct fstrim_range {
 	__u64 minlen;
 };
 
+/* maximum total copy length */
+#define MAX_COPY_TOTAL_LENGTH	(1 << 27)
+
 /* extent-same (dedupe) ioctls; these MUST match the btrfs ioctl definitions */
 #define FILE_DEDUPE_RANGE_SAME		0
 #define FILE_DEDUPE_RANGE_DIFFERS	1
-- 
2.35.1.500.gb896f729e2


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [dm-devel] [PATCH v6 1/9] block: Introduce queue limits for copy-offload support
@ 2023-01-12 11:58       ` Nitesh Shetty
  0 siblings, 0 replies; 26+ messages in thread
From: Nitesh Shetty @ 2023-01-12 11:58 UTC (permalink / raw)
  To: Jens Axboe, Alasdair Kergon, Mike Snitzer, dm-devel, Keith Busch,
	Christoph Hellwig, Sagi Grimberg, James Smart,
	Chaitanya Kulkarni, Alexander Viro
  Cc: p.raghav, joshi.k, gost.dev, anuj20.g, linux-kernel, linux-nvme,
	Nitesh Shetty, linux-block, linux-fsdevel, nitheshshetty

Add device limits as sysfs entries,
        - copy_offload (RW)
        - copy_max_bytes (RW)
        - copy_max_bytes_hw (RO)

Above limits help to split the copy payload in block layer.
copy_offload: used for setting copy offload(1) or emulation(0).
copy_max_bytes: maximum total length of copy in single payload.
copy_max_bytes_hw: Reflects the device supported maximum limit.

Reviewed-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Nitesh Shetty <nj.shetty@samsung.com>
Signed-off-by: Kanchan Joshi <joshi.k@samsung.com>
Signed-off-by: Anuj Gupta <anuj20.g@samsung.com>
---
 Documentation/ABI/stable/sysfs-block | 36 ++++++++++++++++
 block/blk-settings.c                 | 24 +++++++++++
 block/blk-sysfs.c                    | 64 ++++++++++++++++++++++++++++
 include/linux/blkdev.h               | 12 ++++++
 include/uapi/linux/fs.h              |  3 ++
 5 files changed, 139 insertions(+)

diff --git a/Documentation/ABI/stable/sysfs-block b/Documentation/ABI/stable/sysfs-block
index cd14ecb3c9a5..e0c9be009706 100644
--- a/Documentation/ABI/stable/sysfs-block
+++ b/Documentation/ABI/stable/sysfs-block
@@ -155,6 +155,42 @@ Description:
 		last zone of the device which may be smaller.
 
 
+What:		/sys/block/<disk>/queue/copy_offload
+Date:		November 2022
+Contact:	linux-block@vger.kernel.org
+Description:
+		[RW] When read, this file shows whether offloading copy to
+		device is enabled (1) or disabled (0). Writing '0' to this
+		file will disable offloading copies for this device.
+		Writing any '1' value will enable this feature. If device
+		does not support offloading, then writing 1, will result in
+		error.
+
+
+What:		/sys/block/<disk>/queue/copy_max_bytes
+Date:		November 2022
+Contact:	linux-block@vger.kernel.org
+Description:
+		[RW] While 'copy_max_bytes_hw' is the hardware limit for the
+		device, 'copy_max_bytes' setting is the software limit.
+		Setting this value lower will make Linux issue smaller size
+		copies from block layer.
+
+
+What:		/sys/block/<disk>/queue/copy_max_bytes_hw
+Date:		November 2022
+Contact:	linux-block@vger.kernel.org
+Description:
+		[RO] Devices that support offloading copy functionality may have
+		internal limits on the number of bytes that can be offloaded
+		in a single operation. The `copy_max_bytes_hw`
+		parameter is set by the device driver to the maximum number of
+		bytes that can be copied in a single operation. Copy
+		requests issued to the device must not exceed this limit.
+		A value of 0 means that the device does not
+		support copy offload.
+
+
 What:		/sys/block/<disk>/queue/crypto/
 Date:		February 2022
 Contact:	linux-block@vger.kernel.org
diff --git a/block/blk-settings.c b/block/blk-settings.c
index 0477c4d527fe..ca6f15a70fdc 100644
--- a/block/blk-settings.c
+++ b/block/blk-settings.c
@@ -58,6 +58,8 @@ void blk_set_default_limits(struct queue_limits *lim)
 	lim->zoned = BLK_ZONED_NONE;
 	lim->zone_write_granularity = 0;
 	lim->dma_alignment = 511;
+	lim->max_copy_sectors_hw = 0;
+	lim->max_copy_sectors = 0;
 }
 
 /**
@@ -81,6 +83,8 @@ void blk_set_stacking_limits(struct queue_limits *lim)
 	lim->max_dev_sectors = UINT_MAX;
 	lim->max_write_zeroes_sectors = UINT_MAX;
 	lim->max_zone_append_sectors = UINT_MAX;
+	lim->max_copy_sectors_hw = ULONG_MAX;
+	lim->max_copy_sectors = ULONG_MAX;
 }
 EXPORT_SYMBOL(blk_set_stacking_limits);
 
@@ -177,6 +181,22 @@ void blk_queue_max_discard_sectors(struct request_queue *q,
 }
 EXPORT_SYMBOL(blk_queue_max_discard_sectors);
 
+/**
+ * blk_queue_max_copy_sectors_hw - set max sectors for a single copy payload
+ * @q:  the request queue for the device
+ * @max_copy_sectors: maximum number of sectors to copy
+ **/
+void blk_queue_max_copy_sectors_hw(struct request_queue *q,
+		unsigned int max_copy_sectors)
+{
+	if (max_copy_sectors >= MAX_COPY_TOTAL_LENGTH)
+		max_copy_sectors = MAX_COPY_TOTAL_LENGTH;
+
+	q->limits.max_copy_sectors_hw = max_copy_sectors;
+	q->limits.max_copy_sectors = max_copy_sectors;
+}
+EXPORT_SYMBOL_GPL(blk_queue_max_copy_sectors_hw);
+
 /**
  * blk_queue_max_secure_erase_sectors - set max sectors for a secure erase
  * @q:  the request queue for the device
@@ -572,6 +592,10 @@ int blk_stack_limits(struct queue_limits *t, struct queue_limits *b,
 	t->max_segment_size = min_not_zero(t->max_segment_size,
 					   b->max_segment_size);
 
+	t->max_copy_sectors = min(t->max_copy_sectors, b->max_copy_sectors);
+	t->max_copy_sectors_hw = min(t->max_copy_sectors_hw,
+						b->max_copy_sectors_hw);
+
 	t->misaligned |= b->misaligned;
 
 	alignment = queue_limit_alignment_offset(b, start);
diff --git a/block/blk-sysfs.c b/block/blk-sysfs.c
index 93d9e9c9a6ea..82a28a6c2e8a 100644
--- a/block/blk-sysfs.c
+++ b/block/blk-sysfs.c
@@ -212,6 +212,63 @@ static ssize_t queue_discard_zeroes_data_show(struct request_queue *q, char *pag
 	return queue_var_show(0, page);
 }
 
+static ssize_t queue_copy_offload_show(struct request_queue *q, char *page)
+{
+	return queue_var_show(blk_queue_copy(q), page);
+}
+
+static ssize_t queue_copy_offload_store(struct request_queue *q,
+				       const char *page, size_t count)
+{
+	s64 copy_offload;
+	ssize_t ret = queue_var_store64(&copy_offload, page);
+
+	if (ret < 0)
+		return ret;
+
+	if (copy_offload && !q->limits.max_copy_sectors_hw)
+		return -EINVAL;
+
+	if (copy_offload)
+		blk_queue_flag_set(QUEUE_FLAG_COPY, q);
+	else
+		blk_queue_flag_clear(QUEUE_FLAG_COPY, q);
+
+	return count;
+}
+
+static ssize_t queue_copy_max_hw_show(struct request_queue *q, char *page)
+{
+	return sprintf(page, "%llu\n", (unsigned long long)
+			q->limits.max_copy_sectors_hw << SECTOR_SHIFT);
+}
+
+static ssize_t queue_copy_max_show(struct request_queue *q, char *page)
+{
+	return sprintf(page, "%llu\n", (unsigned long long)
+			q->limits.max_copy_sectors << SECTOR_SHIFT);
+}
+
+static ssize_t queue_copy_max_store(struct request_queue *q,
+				       const char *page, size_t count)
+{
+	s64 max_copy;
+	ssize_t ret = queue_var_store64(&max_copy, page);
+
+	if (ret < 0)
+		return ret;
+
+	if (max_copy & (queue_logical_block_size(q) - 1))
+		return -EINVAL;
+
+	max_copy >>= SECTOR_SHIFT;
+	if (max_copy > q->limits.max_copy_sectors_hw)
+		max_copy = q->limits.max_copy_sectors_hw;
+
+	q->limits.max_copy_sectors = max_copy;
+	return count;
+}
+
 static ssize_t queue_write_same_max_show(struct request_queue *q, char *page)
 {
 	return queue_var_show(0, page);
@@ -604,6 +661,10 @@ QUEUE_RO_ENTRY(queue_nr_zones, "nr_zones");
 QUEUE_RO_ENTRY(queue_max_open_zones, "max_open_zones");
 QUEUE_RO_ENTRY(queue_max_active_zones, "max_active_zones");
 
+QUEUE_RW_ENTRY(queue_copy_offload, "copy_offload");
+QUEUE_RO_ENTRY(queue_copy_max_hw, "copy_max_bytes_hw");
+QUEUE_RW_ENTRY(queue_copy_max, "copy_max_bytes");
+
 QUEUE_RW_ENTRY(queue_nomerges, "nomerges");
 QUEUE_RW_ENTRY(queue_rq_affinity, "rq_affinity");
 QUEUE_RW_ENTRY(queue_poll, "io_poll");
@@ -651,6 +712,9 @@ static struct attribute *queue_attrs[] = {
 	&queue_discard_max_entry.attr,
 	&queue_discard_max_hw_entry.attr,
 	&queue_discard_zeroes_data_entry.attr,
+	&queue_copy_offload_entry.attr,
+	&queue_copy_max_hw_entry.attr,
+	&queue_copy_max_entry.attr,
 	&queue_write_same_max_entry.attr,
 	&queue_write_zeroes_max_entry.attr,
 	&queue_zone_append_max_entry.attr,
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index 43d4e073b111..807ffb5f715d 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -303,6 +303,9 @@ struct queue_limits {
 	unsigned int		discard_alignment;
 	unsigned int		zone_write_granularity;
 
+	unsigned long		max_copy_sectors_hw;
+	unsigned long		max_copy_sectors;
+
 	unsigned short		max_segments;
 	unsigned short		max_integrity_segments;
 	unsigned short		max_discard_segments;
@@ -571,6 +574,7 @@ struct request_queue {
 #define QUEUE_FLAG_NOWAIT       29	/* device supports NOWAIT */
 #define QUEUE_FLAG_SQ_SCHED     30	/* single queue style io dispatch */
 #define QUEUE_FLAG_SKIP_TAGSET_QUIESCE	31 /* quiesce_tagset skip the queue*/
+#define QUEUE_FLAG_COPY		32	/* supports copy offload */
 
 #define QUEUE_FLAG_MQ_DEFAULT	((1UL << QUEUE_FLAG_IO_STAT) |		\
 				 (1UL << QUEUE_FLAG_SAME_COMP) |	\
@@ -591,6 +595,7 @@ bool blk_queue_flag_test_and_set(unsigned int flag, struct request_queue *q);
 	test_bit(QUEUE_FLAG_STABLE_WRITES, &(q)->queue_flags)
 #define blk_queue_io_stat(q)	test_bit(QUEUE_FLAG_IO_STAT, &(q)->queue_flags)
 #define blk_queue_add_random(q)	test_bit(QUEUE_FLAG_ADD_RANDOM, &(q)->queue_flags)
+#define blk_queue_copy(q)	test_bit(QUEUE_FLAG_COPY, &(q)->queue_flags)
 #define blk_queue_zone_resetall(q)	\
 	test_bit(QUEUE_FLAG_ZONE_RESETALL, &(q)->queue_flags)
 #define blk_queue_dax(q)	test_bit(QUEUE_FLAG_DAX, &(q)->queue_flags)
@@ -911,6 +916,8 @@ extern void blk_queue_chunk_sectors(struct request_queue *, unsigned int);
 extern void blk_queue_max_segments(struct request_queue *, unsigned short);
 extern void blk_queue_max_discard_segments(struct request_queue *,
 		unsigned short);
+extern void blk_queue_max_copy_sectors_hw(struct request_queue *q,
+		unsigned int max_copy_sectors);
 void blk_queue_max_secure_erase_sectors(struct request_queue *q,
 		unsigned int max_sectors);
 extern void blk_queue_max_segment_size(struct request_queue *, unsigned int);
@@ -1229,6 +1236,11 @@ static inline unsigned int bdev_discard_granularity(struct block_device *bdev)
 	return bdev_get_queue(bdev)->limits.discard_granularity;
 }
 
+static inline unsigned int bdev_max_copy_sectors(struct block_device *bdev)
+{
+	return bdev_get_queue(bdev)->limits.max_copy_sectors;
+}
+
 static inline unsigned int
 bdev_max_secure_erase_sectors(struct block_device *bdev)
 {
diff --git a/include/uapi/linux/fs.h b/include/uapi/linux/fs.h
index b7b56871029c..b3ad173f619c 100644
--- a/include/uapi/linux/fs.h
+++ b/include/uapi/linux/fs.h
@@ -64,6 +64,9 @@ struct fstrim_range {
 	__u64 minlen;
 };
 
+/* maximum total copy length */
+#define MAX_COPY_TOTAL_LENGTH	(1 << 27)
+
 /* extent-same (dedupe) ioctls; these MUST match the btrfs ioctl definitions */
 #define FILE_DEDUPE_RANGE_SAME		0
 #define FILE_DEDUPE_RANGE_DIFFERS	1
-- 
2.35.1.500.gb896f729e2

--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH v6 2/9] block: Add copy offload support infrastructure
       [not found]   ` <CGME20230112120039epcas5p49ccf70d806c530c8228130cc25737b51@epcas5p4.samsung.com>
@ 2023-01-12 11:58       ` Nitesh Shetty
  0 siblings, 0 replies; 26+ messages in thread
From: Nitesh Shetty @ 2023-01-12 11:58 UTC (permalink / raw)
  To: Jens Axboe, Alasdair Kergon, Mike Snitzer, dm-devel, Keith Busch,
	Christoph Hellwig, Sagi Grimberg, James Smart,
	Chaitanya Kulkarni, Alexander Viro
  Cc: anuj20.g, joshi.k, p.raghav, nitheshshetty, gost.dev,
	Nitesh Shetty, linux-block, linux-kernel, linux-nvme,
	linux-fsdevel

Introduce blkdev_issue_copy which supports source and destination bdevs,
and an array of (source, destination and copy length) tuples.
Introduce REQ_COPY copy offload operation flag. Create a read-write
bio pair with a token as payload and submitted to the device in order.
Read request populates token with source specific information which
is then passed with write request.
This design is courtesy Mikulas Patocka's token based copy

Larger copy will be divided, based on max_copy_sectors limit.

Signed-off-by: Nitesh Shetty <nj.shetty@samsung.com>
Signed-off-by: Anuj Gupta <anuj20.g@samsung.com>
---
 block/blk-lib.c           | 358 ++++++++++++++++++++++++++++++++++++++
 block/blk.h               |   2 +
 include/linux/blk_types.h |  44 +++++
 include/linux/blkdev.h    |   3 +
 include/uapi/linux/fs.h   |  15 ++
 5 files changed, 422 insertions(+)

diff --git a/block/blk-lib.c b/block/blk-lib.c
index e59c3069e835..2ce3c872ca49 100644
--- a/block/blk-lib.c
+++ b/block/blk-lib.c
@@ -115,6 +115,364 @@ int blkdev_issue_discard(struct block_device *bdev, sector_t sector,
 }
 EXPORT_SYMBOL(blkdev_issue_discard);
 
+/*
+ * For synchronous copy offload/emulation, wait and process all in-flight BIOs.
+ * This must only be called once all bios have been issued so that the refcount
+ * can only decrease. This just waits for all bios to make it through
+ * bio_copy_*_write_end_io. IO errors are propagated through cio->io_error.
+ */
+static int cio_await_completion(struct cio *cio)
+{
+	int ret = 0;
+
+	atomic_dec(&cio->refcount);
+
+	if (cio->endio)
+		return 0;
+
+	if (atomic_read(&cio->refcount)) {
+		__set_current_state(TASK_UNINTERRUPTIBLE);
+		blk_io_schedule();
+	}
+
+	ret = cio->io_err;
+	kfree(cio);
+
+	return ret;
+}
+
+static void blk_copy_offload_write_end_io(struct bio *bio)
+{
+	struct copy_ctx *ctx = bio->bi_private;
+	struct cio *cio = ctx->cio;
+	sector_t clen;
+	int ri = ctx->range_idx;
+
+	if (bio->bi_status) {
+		cio->io_err = blk_status_to_errno(bio->bi_status);
+		clen = (bio->bi_iter.bi_sector << SECTOR_SHIFT) -
+			cio->ranges[ri].dst;
+		cio->ranges[ri].comp_len = min_t(sector_t, clen,
+				cio->ranges[ri].comp_len);
+	}
+	__free_page(bio->bi_io_vec[0].bv_page);
+	bio_put(bio);
+
+	if (atomic_dec_and_test(&ctx->refcount))
+		kfree(ctx);
+	if (atomic_dec_and_test(&cio->refcount)) {
+		if (cio->endio) {
+			cio->endio(cio->private, cio->io_err);
+			kfree(cio);
+		} else
+			blk_wake_io_task(cio->waiter);
+	}
+}
+
+static void blk_copy_offload_read_end_io(struct bio *read_bio)
+{
+	struct copy_ctx *ctx = read_bio->bi_private;
+	struct cio *cio = ctx->cio;
+	sector_t clen;
+	int ri = ctx->range_idx;
+	unsigned long flags;
+
+	if (read_bio->bi_status) {
+		cio->io_err = blk_status_to_errno(read_bio->bi_status);
+		goto err_rw_bio;
+	}
+
+	/* For zoned device, we check if completed bio is first entry in linked
+	 * list,
+	 * if yes, we start the worker to submit write bios.
+	 * if not, then we just update status of bio in ctx,
+	 * once the worker gets scheduled, it will submit writes for all
+	 * the consecutive REQ_COPY_READ_COMPLETE bios.
+	 */
+	if (bdev_is_zoned(ctx->write_bio->bi_bdev)) {
+		spin_lock_irqsave(&cio->list_lock, flags);
+		ctx->status = REQ_COPY_READ_COMPLETE;
+		if (ctx == list_first_entry(&cio->list,
+					struct copy_ctx, list)) {
+			spin_unlock_irqrestore(&cio->list_lock, flags);
+			schedule_work(&ctx->dispatch_work);
+			goto free_read_bio;
+		}
+		spin_unlock_irqrestore(&cio->list_lock, flags);
+	} else
+		schedule_work(&ctx->dispatch_work);
+
+free_read_bio:
+	bio_put(read_bio);
+
+	return;
+
+err_rw_bio:
+	clen = (read_bio->bi_iter.bi_sector << SECTOR_SHIFT) -
+					cio->ranges[ri].src;
+	cio->ranges[ri].comp_len = min_t(sector_t, clen,
+					cio->ranges[ri].comp_len);
+	__free_page(read_bio->bi_io_vec[0].bv_page);
+	bio_put(ctx->write_bio);
+	bio_put(read_bio);
+	if (atomic_dec_and_test(&ctx->refcount))
+		kfree(ctx);
+	if (atomic_dec_and_test(&cio->refcount)) {
+		if (cio->endio) {
+			cio->endio(cio->private, cio->io_err);
+			kfree(cio);
+		} else
+			blk_wake_io_task(cio->waiter);
+	}
+}
+
+static void blk_copy_dispatch_work_fn(struct work_struct *work)
+{
+	struct copy_ctx *ctx = container_of(work, struct copy_ctx,
+			dispatch_work);
+
+	submit_bio(ctx->write_bio);
+}
+
+static void blk_zoned_copy_dispatch_work_fn(struct work_struct *work)
+{
+	struct copy_ctx *ctx = container_of(work, struct copy_ctx,
+			dispatch_work);
+	struct cio *cio = ctx->cio;
+	unsigned long flags = 0;
+
+	atomic_inc(&cio->refcount);
+	spin_lock_irqsave(&cio->list_lock, flags);
+
+	while (!list_empty(&cio->list)) {
+		ctx = list_first_entry(&cio->list, struct copy_ctx, list);
+
+		if (ctx->status == REQ_COPY_READ_PROGRESS)
+			break;
+
+		atomic_inc(&ctx->refcount);
+		ctx->status = REQ_COPY_WRITE_PROGRESS;
+		spin_unlock_irqrestore(&cio->list_lock, flags);
+		submit_bio(ctx->write_bio);
+		spin_lock_irqsave(&cio->list_lock, flags);
+
+		list_del(&ctx->list);
+		if (atomic_dec_and_test(&ctx->refcount))
+			kfree(ctx);
+	}
+
+	spin_unlock_irqrestore(&cio->list_lock, flags);
+	if (atomic_dec_and_test(&cio->refcount))
+		blk_wake_io_task(cio->waiter);
+}
+
+/*
+ * blk_copy_offload	- Use device's native copy offload feature.
+ * we perform copy operation by sending 2 bio.
+ * 1. First we send a read bio with REQ_COPY flag along with a token and source
+ * and length. Once read bio reaches driver layer, device driver adds all the
+ * source info to token and does a fake completion.
+ * 2. Once read opration completes, we issue write with REQ_COPY flag with same
+ * token. In driver layer, token info is used to form a copy offload command.
+ *
+ * For conventional devices we submit write bio independentenly once read
+ * completes. For zoned devices , reads can complete out of order, so we
+ * maintain a linked list and submit writes in the order, reads are submitted.
+ */
+static int blk_copy_offload(struct block_device *src_bdev,
+		struct block_device *dst_bdev, struct range_entry *ranges,
+		int nr, cio_iodone_t end_io, void *private, gfp_t gfp_mask)
+{
+	struct cio *cio;
+	struct copy_ctx *ctx;
+	struct bio *read_bio, *write_bio;
+	struct page *token;
+	sector_t src_blk, copy_len, dst_blk;
+	sector_t rem, max_copy_len;
+	int ri = 0, ret = 0;
+	unsigned long flags;
+
+	cio = kzalloc(sizeof(struct cio), GFP_KERNEL);
+	if (!cio)
+		return -ENOMEM;
+	cio->ranges = ranges;
+	atomic_set(&cio->refcount, 1);
+	cio->waiter = current;
+	cio->endio = end_io;
+	cio->private = private;
+	if (bdev_is_zoned(dst_bdev)) {
+		INIT_LIST_HEAD(&cio->list);
+		spin_lock_init(&cio->list_lock);
+	}
+
+	max_copy_len = min(bdev_max_copy_sectors(src_bdev),
+			bdev_max_copy_sectors(dst_bdev)) << SECTOR_SHIFT;
+
+	for (ri = 0; ri < nr; ri++) {
+		cio->ranges[ri].comp_len = ranges[ri].len;
+		src_blk = ranges[ri].src;
+		dst_blk = ranges[ri].dst;
+		for (rem = ranges[ri].len; rem > 0; rem -= copy_len) {
+			copy_len = min(rem, max_copy_len);
+
+			token = alloc_page(gfp_mask);
+			if (unlikely(!token)) {
+				ret = -ENOMEM;
+				goto err_token;
+			}
+
+			ctx = kzalloc(sizeof(struct copy_ctx), gfp_mask);
+			if (!ctx) {
+				ret = -ENOMEM;
+				goto err_ctx;
+			}
+			read_bio = bio_alloc(src_bdev, 1, REQ_OP_READ | REQ_COPY
+					| REQ_SYNC | REQ_NOMERGE, gfp_mask);
+			if (!read_bio) {
+				ret = -ENOMEM;
+				goto err_read_bio;
+			}
+			write_bio = bio_alloc(dst_bdev, 1, REQ_OP_WRITE
+					| REQ_COPY | REQ_SYNC | REQ_NOMERGE,
+					gfp_mask);
+			if (!write_bio) {
+				cio->io_err = -ENOMEM;
+				goto err_write_bio;
+			}
+
+			ctx->cio = cio;
+			ctx->range_idx = ri;
+			ctx->write_bio = write_bio;
+			atomic_set(&ctx->refcount, 1);
+
+			if (bdev_is_zoned(dst_bdev)) {
+				INIT_WORK(&ctx->dispatch_work,
+					blk_zoned_copy_dispatch_work_fn);
+				INIT_LIST_HEAD(&ctx->list);
+				spin_lock_irqsave(&cio->list_lock, flags);
+				ctx->status = REQ_COPY_READ_PROGRESS;
+				list_add_tail(&ctx->list, &cio->list);
+				spin_unlock_irqrestore(&cio->list_lock, flags);
+			} else
+				INIT_WORK(&ctx->dispatch_work,
+					blk_copy_dispatch_work_fn);
+
+			__bio_add_page(read_bio, token, PAGE_SIZE, 0);
+			read_bio->bi_iter.bi_size = copy_len;
+			read_bio->bi_iter.bi_sector = src_blk >> SECTOR_SHIFT;
+			read_bio->bi_end_io = blk_copy_offload_read_end_io;
+			read_bio->bi_private = ctx;
+
+			__bio_add_page(write_bio, token, PAGE_SIZE, 0);
+			write_bio->bi_iter.bi_size = copy_len;
+			write_bio->bi_end_io = blk_copy_offload_write_end_io;
+			write_bio->bi_iter.bi_sector = dst_blk >> SECTOR_SHIFT;
+			write_bio->bi_private = ctx;
+
+			atomic_inc(&cio->refcount);
+			submit_bio(read_bio);
+			src_blk += copy_len;
+			dst_blk += copy_len;
+		}
+	}
+
+	/* Wait for completion of all IO's*/
+	return cio_await_completion(cio);
+
+err_write_bio:
+	bio_put(read_bio);
+err_read_bio:
+	kfree(ctx);
+err_ctx:
+	__free_page(token);
+err_token:
+	ranges[ri].comp_len = min_t(sector_t,
+			ranges[ri].comp_len, (ranges[ri].len - rem));
+
+	cio->io_err = ret;
+	return cio_await_completion(cio);
+}
+
+static inline int blk_copy_sanity_check(struct block_device *src_bdev,
+	struct block_device *dst_bdev, struct range_entry *ranges, int nr)
+{
+	unsigned int align_mask = max(bdev_logical_block_size(dst_bdev),
+					bdev_logical_block_size(src_bdev)) - 1;
+	sector_t len = 0;
+	int i;
+
+	if (!nr)
+		return -EINVAL;
+
+	if (nr >= MAX_COPY_NR_RANGE)
+		return -EINVAL;
+
+	if (bdev_read_only(dst_bdev))
+		return -EPERM;
+
+	for (i = 0; i < nr; i++) {
+		if (!ranges[i].len)
+			return -EINVAL;
+
+		len += ranges[i].len;
+		if ((ranges[i].dst & align_mask) ||
+				(ranges[i].src & align_mask) ||
+				(ranges[i].len & align_mask))
+			return -EINVAL;
+		ranges[i].comp_len = 0;
+	}
+
+	if (len && len >= MAX_COPY_TOTAL_LENGTH)
+		return -EINVAL;
+
+	return 0;
+}
+
+static inline bool blk_check_copy_offload(struct request_queue *src_q,
+		struct request_queue *dst_q)
+{
+	return blk_queue_copy(dst_q) && blk_queue_copy(src_q);
+}
+
+/*
+ * blkdev_issue_copy - queue a copy
+ * @src_bdev:	source block device
+ * @dst_bdev:	destination block device
+ * @ranges:	array of source/dest/len,
+ *		ranges are expected to be allocated/freed by caller
+ * @nr:		number of source ranges to copy
+ * @end_io:	end_io function to be called on completion of copy operation,
+ *		for synchronous operation this should be NULL
+ * @private:	end_io function will be called with this private data, should be
+ *		NULL, if operation is synchronous in nature
+ * @gfp_mask:   memory allocation flags (for bio_alloc)
+ *
+ * Description:
+ *	Copy source ranges from source block device to destination block
+ *	device. length of a source range cannot be zero. Max total length of
+ *	copy is limited to MAX_COPY_TOTAL_LENGTH and also maximum number of
+ *	entries is limited to MAX_COPY_NR_RANGE
+ */
+int blkdev_issue_copy(struct block_device *src_bdev,
+	struct block_device *dst_bdev, struct range_entry *ranges, int nr,
+	cio_iodone_t end_io, void *private, gfp_t gfp_mask)
+{
+	struct request_queue *src_q = bdev_get_queue(src_bdev);
+	struct request_queue *dst_q = bdev_get_queue(dst_bdev);
+	int ret = -EINVAL;
+
+	ret = blk_copy_sanity_check(src_bdev, dst_bdev, ranges, nr);
+	if (ret)
+		return ret;
+
+	if (blk_check_copy_offload(src_q, dst_q))
+		ret = blk_copy_offload(src_bdev, dst_bdev, ranges, nr,
+				end_io, private, gfp_mask);
+
+	return ret;
+}
+EXPORT_SYMBOL_GPL(blkdev_issue_copy);
+
 static int __blkdev_issue_write_zeroes(struct block_device *bdev,
 		sector_t sector, sector_t nr_sects, gfp_t gfp_mask,
 		struct bio **biop, unsigned flags)
diff --git a/block/blk.h b/block/blk.h
index 4c3b3325219a..6d9924a7d559 100644
--- a/block/blk.h
+++ b/block/blk.h
@@ -304,6 +304,8 @@ static inline bool bio_may_exceed_limits(struct bio *bio,
 		break;
 	}
 
+	if (unlikely(op_is_copy(bio->bi_opf)))
+		return false;
 	/*
 	 * All drivers must accept single-segments bios that are <= PAGE_SIZE.
 	 * This is a quick and dirty check that relies on the fact that
diff --git a/include/linux/blk_types.h b/include/linux/blk_types.h
index 99be590f952f..de1638c87ecf 100644
--- a/include/linux/blk_types.h
+++ b/include/linux/blk_types.h
@@ -422,6 +422,7 @@ enum req_flag_bits {
 	 */
 	/* for REQ_OP_WRITE_ZEROES: */
 	__REQ_NOUNMAP,		/* do not free blocks when zeroing */
+	__REQ_COPY,		/* copy request */
 
 	__REQ_NR_BITS,		/* stops here */
 };
@@ -451,6 +452,7 @@ enum req_flag_bits {
 
 #define REQ_DRV		(__force blk_opf_t)(1ULL << __REQ_DRV)
 #define REQ_SWAP	(__force blk_opf_t)(1ULL << __REQ_SWAP)
+#define REQ_COPY	((__force blk_opf_t)(1ULL << __REQ_COPY))
 
 #define REQ_FAILFAST_MASK \
 	(REQ_FAILFAST_DEV | REQ_FAILFAST_TRANSPORT | REQ_FAILFAST_DRIVER)
@@ -477,6 +479,11 @@ static inline bool op_is_write(blk_opf_t op)
 	return !!(op & (__force blk_opf_t)1);
 }
 
+static inline bool op_is_copy(blk_opf_t op)
+{
+	return (op & REQ_COPY);
+}
+
 /*
  * Check if the bio or request is one that needs special treatment in the
  * flush state machine.
@@ -536,4 +543,41 @@ struct blk_rq_stat {
 	u64 batch;
 };
 
+typedef void (cio_iodone_t)(void *private, int status);
+
+struct cio {
+	struct range_entry *ranges;
+	struct task_struct *waiter;     /* waiting task (NULL if none) */
+	atomic_t refcount;
+	int io_err;
+	cio_iodone_t *endio;		/* applicable for async operation */
+	void *private;			/* applicable for async operation */
+
+	/* For zoned device we maintain a linked list of IO submissions.
+	 * This is to make sure we maintain the order of submissions.
+	 * Otherwise some reads completing out of order, will submit writes not
+	 * aligned with zone write pointer.
+	 */
+	struct list_head list;
+	spinlock_t list_lock;
+};
+
+enum copy_io_status {
+	REQ_COPY_READ_PROGRESS,
+	REQ_COPY_READ_COMPLETE,
+	REQ_COPY_WRITE_PROGRESS,
+};
+
+struct copy_ctx {
+	struct cio *cio;
+	struct work_struct dispatch_work;
+	struct bio *write_bio;
+	atomic_t refcount;
+	int range_idx;			/* used in error/partial completion */
+
+	/* For zoned device linked list is maintained. Along with state of IO */
+	struct list_head list;
+	enum copy_io_status status;
+};
+
 #endif /* __LINUX_BLK_TYPES_H */
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index 807ffb5f715d..48e9160b7195 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -1063,6 +1063,9 @@ int __blkdev_issue_discard(struct block_device *bdev, sector_t sector,
 		sector_t nr_sects, gfp_t gfp_mask, struct bio **biop);
 int blkdev_issue_secure_erase(struct block_device *bdev, sector_t sector,
 		sector_t nr_sects, gfp_t gfp);
+int blkdev_issue_copy(struct block_device *src_bdev,
+		struct block_device *dst_bdev, struct range_entry *ranges,
+		int nr, cio_iodone_t end_io, void *private, gfp_t gfp_mask);
 
 #define BLKDEV_ZERO_NOUNMAP	(1 << 0)  /* do not free blocks */
 #define BLKDEV_ZERO_NOFALLBACK	(1 << 1)  /* don't write explicit zeroes */
diff --git a/include/uapi/linux/fs.h b/include/uapi/linux/fs.h
index b3ad173f619c..9248b6d259de 100644
--- a/include/uapi/linux/fs.h
+++ b/include/uapi/linux/fs.h
@@ -67,6 +67,21 @@ struct fstrim_range {
 /* maximum total copy length */
 #define MAX_COPY_TOTAL_LENGTH	(1 << 27)
 
+/* Maximum no of entries supported */
+#define MAX_COPY_NR_RANGE	(1 << 12)
+
+/* range entry for copy offload, all fields should be byte addressed */
+struct range_entry {
+	__u64 src;		/* source to be copied */
+	__u64 dst;		/* destination */
+	__u64 len;		/* length in bytes to be copied */
+
+	/* length of data copy actually completed. This will be filled by
+	 * kernel, once copy completes
+	 */
+	__u64 comp_len;
+};
+
 /* extent-same (dedupe) ioctls; these MUST match the btrfs ioctl definitions */
 #define FILE_DEDUPE_RANGE_SAME		0
 #define FILE_DEDUPE_RANGE_DIFFERS	1
-- 
2.35.1.500.gb896f729e2


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [dm-devel] [PATCH v6 2/9] block: Add copy offload support infrastructure
@ 2023-01-12 11:58       ` Nitesh Shetty
  0 siblings, 0 replies; 26+ messages in thread
From: Nitesh Shetty @ 2023-01-12 11:58 UTC (permalink / raw)
  To: Jens Axboe, Alasdair Kergon, Mike Snitzer, dm-devel, Keith Busch,
	Christoph Hellwig, Sagi Grimberg, James Smart,
	Chaitanya Kulkarni, Alexander Viro
  Cc: p.raghav, joshi.k, gost.dev, anuj20.g, linux-kernel, linux-nvme,
	Nitesh Shetty, linux-block, linux-fsdevel, nitheshshetty

Introduce blkdev_issue_copy which supports source and destination bdevs,
and an array of (source, destination and copy length) tuples.
Introduce REQ_COPY copy offload operation flag. Create a read-write
bio pair with a token as payload and submitted to the device in order.
Read request populates token with source specific information which
is then passed with write request.
This design is courtesy Mikulas Patocka's token based copy

Larger copy will be divided, based on max_copy_sectors limit.

Signed-off-by: Nitesh Shetty <nj.shetty@samsung.com>
Signed-off-by: Anuj Gupta <anuj20.g@samsung.com>
---
 block/blk-lib.c           | 358 ++++++++++++++++++++++++++++++++++++++
 block/blk.h               |   2 +
 include/linux/blk_types.h |  44 +++++
 include/linux/blkdev.h    |   3 +
 include/uapi/linux/fs.h   |  15 ++
 5 files changed, 422 insertions(+)

diff --git a/block/blk-lib.c b/block/blk-lib.c
index e59c3069e835..2ce3c872ca49 100644
--- a/block/blk-lib.c
+++ b/block/blk-lib.c
@@ -115,6 +115,364 @@ int blkdev_issue_discard(struct block_device *bdev, sector_t sector,
 }
 EXPORT_SYMBOL(blkdev_issue_discard);
 
+/*
+ * For synchronous copy offload/emulation, wait and process all in-flight BIOs.
+ * This must only be called once all bios have been issued so that the refcount
+ * can only decrease. This just waits for all bios to make it through
+ * bio_copy_*_write_end_io. IO errors are propagated through cio->io_error.
+ */
+static int cio_await_completion(struct cio *cio)
+{
+	int ret = 0;
+
+	atomic_dec(&cio->refcount);
+
+	if (cio->endio)
+		return 0;
+
+	if (atomic_read(&cio->refcount)) {
+		__set_current_state(TASK_UNINTERRUPTIBLE);
+		blk_io_schedule();
+	}
+
+	ret = cio->io_err;
+	kfree(cio);
+
+	return ret;
+}
+
+static void blk_copy_offload_write_end_io(struct bio *bio)
+{
+	struct copy_ctx *ctx = bio->bi_private;
+	struct cio *cio = ctx->cio;
+	sector_t clen;
+	int ri = ctx->range_idx;
+
+	if (bio->bi_status) {
+		cio->io_err = blk_status_to_errno(bio->bi_status);
+		clen = (bio->bi_iter.bi_sector << SECTOR_SHIFT) -
+			cio->ranges[ri].dst;
+		cio->ranges[ri].comp_len = min_t(sector_t, clen,
+				cio->ranges[ri].comp_len);
+	}
+	__free_page(bio->bi_io_vec[0].bv_page);
+	bio_put(bio);
+
+	if (atomic_dec_and_test(&ctx->refcount))
+		kfree(ctx);
+	if (atomic_dec_and_test(&cio->refcount)) {
+		if (cio->endio) {
+			cio->endio(cio->private, cio->io_err);
+			kfree(cio);
+		} else
+			blk_wake_io_task(cio->waiter);
+	}
+}
+
+static void blk_copy_offload_read_end_io(struct bio *read_bio)
+{
+	struct copy_ctx *ctx = read_bio->bi_private;
+	struct cio *cio = ctx->cio;
+	sector_t clen;
+	int ri = ctx->range_idx;
+	unsigned long flags;
+
+	if (read_bio->bi_status) {
+		cio->io_err = blk_status_to_errno(read_bio->bi_status);
+		goto err_rw_bio;
+	}
+
+	/* For zoned device, we check if completed bio is first entry in linked
+	 * list,
+	 * if yes, we start the worker to submit write bios.
+	 * if not, then we just update status of bio in ctx,
+	 * once the worker gets scheduled, it will submit writes for all
+	 * the consecutive REQ_COPY_READ_COMPLETE bios.
+	 */
+	if (bdev_is_zoned(ctx->write_bio->bi_bdev)) {
+		spin_lock_irqsave(&cio->list_lock, flags);
+		ctx->status = REQ_COPY_READ_COMPLETE;
+		if (ctx == list_first_entry(&cio->list,
+					struct copy_ctx, list)) {
+			spin_unlock_irqrestore(&cio->list_lock, flags);
+			schedule_work(&ctx->dispatch_work);
+			goto free_read_bio;
+		}
+		spin_unlock_irqrestore(&cio->list_lock, flags);
+	} else
+		schedule_work(&ctx->dispatch_work);
+
+free_read_bio:
+	bio_put(read_bio);
+
+	return;
+
+err_rw_bio:
+	clen = (read_bio->bi_iter.bi_sector << SECTOR_SHIFT) -
+					cio->ranges[ri].src;
+	cio->ranges[ri].comp_len = min_t(sector_t, clen,
+					cio->ranges[ri].comp_len);
+	__free_page(read_bio->bi_io_vec[0].bv_page);
+	bio_put(ctx->write_bio);
+	bio_put(read_bio);
+	if (atomic_dec_and_test(&ctx->refcount))
+		kfree(ctx);
+	if (atomic_dec_and_test(&cio->refcount)) {
+		if (cio->endio) {
+			cio->endio(cio->private, cio->io_err);
+			kfree(cio);
+		} else
+			blk_wake_io_task(cio->waiter);
+	}
+}
+
+static void blk_copy_dispatch_work_fn(struct work_struct *work)
+{
+	struct copy_ctx *ctx = container_of(work, struct copy_ctx,
+			dispatch_work);
+
+	submit_bio(ctx->write_bio);
+}
+
+static void blk_zoned_copy_dispatch_work_fn(struct work_struct *work)
+{
+	struct copy_ctx *ctx = container_of(work, struct copy_ctx,
+			dispatch_work);
+	struct cio *cio = ctx->cio;
+	unsigned long flags = 0;
+
+	atomic_inc(&cio->refcount);
+	spin_lock_irqsave(&cio->list_lock, flags);
+
+	while (!list_empty(&cio->list)) {
+		ctx = list_first_entry(&cio->list, struct copy_ctx, list);
+
+		if (ctx->status == REQ_COPY_READ_PROGRESS)
+			break;
+
+		atomic_inc(&ctx->refcount);
+		ctx->status = REQ_COPY_WRITE_PROGRESS;
+		spin_unlock_irqrestore(&cio->list_lock, flags);
+		submit_bio(ctx->write_bio);
+		spin_lock_irqsave(&cio->list_lock, flags);
+
+		list_del(&ctx->list);
+		if (atomic_dec_and_test(&ctx->refcount))
+			kfree(ctx);
+	}
+
+	spin_unlock_irqrestore(&cio->list_lock, flags);
+	if (atomic_dec_and_test(&cio->refcount))
+		blk_wake_io_task(cio->waiter);
+}
+
+/*
+ * blk_copy_offload	- Use device's native copy offload feature.
+ * we perform copy operation by sending 2 bio.
+ * 1. First we send a read bio with REQ_COPY flag along with a token and source
+ * and length. Once read bio reaches driver layer, device driver adds all the
+ * source info to token and does a fake completion.
+ * 2. Once read opration completes, we issue write with REQ_COPY flag with same
+ * token. In driver layer, token info is used to form a copy offload command.
+ *
+ * For conventional devices we submit write bio independentenly once read
+ * completes. For zoned devices , reads can complete out of order, so we
+ * maintain a linked list and submit writes in the order, reads are submitted.
+ */
+static int blk_copy_offload(struct block_device *src_bdev,
+		struct block_device *dst_bdev, struct range_entry *ranges,
+		int nr, cio_iodone_t end_io, void *private, gfp_t gfp_mask)
+{
+	struct cio *cio;
+	struct copy_ctx *ctx;
+	struct bio *read_bio, *write_bio;
+	struct page *token;
+	sector_t src_blk, copy_len, dst_blk;
+	sector_t rem, max_copy_len;
+	int ri = 0, ret = 0;
+	unsigned long flags;
+
+	cio = kzalloc(sizeof(struct cio), GFP_KERNEL);
+	if (!cio)
+		return -ENOMEM;
+	cio->ranges = ranges;
+	atomic_set(&cio->refcount, 1);
+	cio->waiter = current;
+	cio->endio = end_io;
+	cio->private = private;
+	if (bdev_is_zoned(dst_bdev)) {
+		INIT_LIST_HEAD(&cio->list);
+		spin_lock_init(&cio->list_lock);
+	}
+
+	max_copy_len = min(bdev_max_copy_sectors(src_bdev),
+			bdev_max_copy_sectors(dst_bdev)) << SECTOR_SHIFT;
+
+	for (ri = 0; ri < nr; ri++) {
+		cio->ranges[ri].comp_len = ranges[ri].len;
+		src_blk = ranges[ri].src;
+		dst_blk = ranges[ri].dst;
+		for (rem = ranges[ri].len; rem > 0; rem -= copy_len) {
+			copy_len = min(rem, max_copy_len);
+
+			token = alloc_page(gfp_mask);
+			if (unlikely(!token)) {
+				ret = -ENOMEM;
+				goto err_token;
+			}
+
+			ctx = kzalloc(sizeof(struct copy_ctx), gfp_mask);
+			if (!ctx) {
+				ret = -ENOMEM;
+				goto err_ctx;
+			}
+			read_bio = bio_alloc(src_bdev, 1, REQ_OP_READ | REQ_COPY
+					| REQ_SYNC | REQ_NOMERGE, gfp_mask);
+			if (!read_bio) {
+				ret = -ENOMEM;
+				goto err_read_bio;
+			}
+			write_bio = bio_alloc(dst_bdev, 1, REQ_OP_WRITE
+					| REQ_COPY | REQ_SYNC | REQ_NOMERGE,
+					gfp_mask);
+			if (!write_bio) {
+				cio->io_err = -ENOMEM;
+				goto err_write_bio;
+			}
+
+			ctx->cio = cio;
+			ctx->range_idx = ri;
+			ctx->write_bio = write_bio;
+			atomic_set(&ctx->refcount, 1);
+
+			if (bdev_is_zoned(dst_bdev)) {
+				INIT_WORK(&ctx->dispatch_work,
+					blk_zoned_copy_dispatch_work_fn);
+				INIT_LIST_HEAD(&ctx->list);
+				spin_lock_irqsave(&cio->list_lock, flags);
+				ctx->status = REQ_COPY_READ_PROGRESS;
+				list_add_tail(&ctx->list, &cio->list);
+				spin_unlock_irqrestore(&cio->list_lock, flags);
+			} else
+				INIT_WORK(&ctx->dispatch_work,
+					blk_copy_dispatch_work_fn);
+
+			__bio_add_page(read_bio, token, PAGE_SIZE, 0);
+			read_bio->bi_iter.bi_size = copy_len;
+			read_bio->bi_iter.bi_sector = src_blk >> SECTOR_SHIFT;
+			read_bio->bi_end_io = blk_copy_offload_read_end_io;
+			read_bio->bi_private = ctx;
+
+			__bio_add_page(write_bio, token, PAGE_SIZE, 0);
+			write_bio->bi_iter.bi_size = copy_len;
+			write_bio->bi_end_io = blk_copy_offload_write_end_io;
+			write_bio->bi_iter.bi_sector = dst_blk >> SECTOR_SHIFT;
+			write_bio->bi_private = ctx;
+
+			atomic_inc(&cio->refcount);
+			submit_bio(read_bio);
+			src_blk += copy_len;
+			dst_blk += copy_len;
+		}
+	}
+
+	/* Wait for completion of all IO's*/
+	return cio_await_completion(cio);
+
+err_write_bio:
+	bio_put(read_bio);
+err_read_bio:
+	kfree(ctx);
+err_ctx:
+	__free_page(token);
+err_token:
+	ranges[ri].comp_len = min_t(sector_t,
+			ranges[ri].comp_len, (ranges[ri].len - rem));
+
+	cio->io_err = ret;
+	return cio_await_completion(cio);
+}
+
+static inline int blk_copy_sanity_check(struct block_device *src_bdev,
+	struct block_device *dst_bdev, struct range_entry *ranges, int nr)
+{
+	unsigned int align_mask = max(bdev_logical_block_size(dst_bdev),
+					bdev_logical_block_size(src_bdev)) - 1;
+	sector_t len = 0;
+	int i;
+
+	if (!nr)
+		return -EINVAL;
+
+	if (nr >= MAX_COPY_NR_RANGE)
+		return -EINVAL;
+
+	if (bdev_read_only(dst_bdev))
+		return -EPERM;
+
+	for (i = 0; i < nr; i++) {
+		if (!ranges[i].len)
+			return -EINVAL;
+
+		len += ranges[i].len;
+		if ((ranges[i].dst & align_mask) ||
+				(ranges[i].src & align_mask) ||
+				(ranges[i].len & align_mask))
+			return -EINVAL;
+		ranges[i].comp_len = 0;
+	}
+
+	if (len && len >= MAX_COPY_TOTAL_LENGTH)
+		return -EINVAL;
+
+	return 0;
+}
+
+static inline bool blk_check_copy_offload(struct request_queue *src_q,
+		struct request_queue *dst_q)
+{
+	return blk_queue_copy(dst_q) && blk_queue_copy(src_q);
+}
+
+/*
+ * blkdev_issue_copy - queue a copy
+ * @src_bdev:	source block device
+ * @dst_bdev:	destination block device
+ * @ranges:	array of source/dest/len,
+ *		ranges are expected to be allocated/freed by caller
+ * @nr:		number of source ranges to copy
+ * @end_io:	end_io function to be called on completion of copy operation,
+ *		for synchronous operation this should be NULL
+ * @private:	end_io function will be called with this private data, should be
+ *		NULL, if operation is synchronous in nature
+ * @gfp_mask:   memory allocation flags (for bio_alloc)
+ *
+ * Description:
+ *	Copy source ranges from source block device to destination block
+ *	device. length of a source range cannot be zero. Max total length of
+ *	copy is limited to MAX_COPY_TOTAL_LENGTH and also maximum number of
+ *	entries is limited to MAX_COPY_NR_RANGE
+ */
+int blkdev_issue_copy(struct block_device *src_bdev,
+	struct block_device *dst_bdev, struct range_entry *ranges, int nr,
+	cio_iodone_t end_io, void *private, gfp_t gfp_mask)
+{
+	struct request_queue *src_q = bdev_get_queue(src_bdev);
+	struct request_queue *dst_q = bdev_get_queue(dst_bdev);
+	int ret = -EINVAL;
+
+	ret = blk_copy_sanity_check(src_bdev, dst_bdev, ranges, nr);
+	if (ret)
+		return ret;
+
+	if (blk_check_copy_offload(src_q, dst_q))
+		ret = blk_copy_offload(src_bdev, dst_bdev, ranges, nr,
+				end_io, private, gfp_mask);
+
+	return ret;
+}
+EXPORT_SYMBOL_GPL(blkdev_issue_copy);
+
 static int __blkdev_issue_write_zeroes(struct block_device *bdev,
 		sector_t sector, sector_t nr_sects, gfp_t gfp_mask,
 		struct bio **biop, unsigned flags)
diff --git a/block/blk.h b/block/blk.h
index 4c3b3325219a..6d9924a7d559 100644
--- a/block/blk.h
+++ b/block/blk.h
@@ -304,6 +304,8 @@ static inline bool bio_may_exceed_limits(struct bio *bio,
 		break;
 	}
 
+	if (unlikely(op_is_copy(bio->bi_opf)))
+		return false;
 	/*
 	 * All drivers must accept single-segments bios that are <= PAGE_SIZE.
 	 * This is a quick and dirty check that relies on the fact that
diff --git a/include/linux/blk_types.h b/include/linux/blk_types.h
index 99be590f952f..de1638c87ecf 100644
--- a/include/linux/blk_types.h
+++ b/include/linux/blk_types.h
@@ -422,6 +422,7 @@ enum req_flag_bits {
 	 */
 	/* for REQ_OP_WRITE_ZEROES: */
 	__REQ_NOUNMAP,		/* do not free blocks when zeroing */
+	__REQ_COPY,		/* copy request */
 
 	__REQ_NR_BITS,		/* stops here */
 };
@@ -451,6 +452,7 @@ enum req_flag_bits {
 
 #define REQ_DRV		(__force blk_opf_t)(1ULL << __REQ_DRV)
 #define REQ_SWAP	(__force blk_opf_t)(1ULL << __REQ_SWAP)
+#define REQ_COPY	((__force blk_opf_t)(1ULL << __REQ_COPY))
 
 #define REQ_FAILFAST_MASK \
 	(REQ_FAILFAST_DEV | REQ_FAILFAST_TRANSPORT | REQ_FAILFAST_DRIVER)
@@ -477,6 +479,11 @@ static inline bool op_is_write(blk_opf_t op)
 	return !!(op & (__force blk_opf_t)1);
 }
 
+static inline bool op_is_copy(blk_opf_t op)
+{
+	return (op & REQ_COPY);
+}
+
 /*
  * Check if the bio or request is one that needs special treatment in the
  * flush state machine.
@@ -536,4 +543,41 @@ struct blk_rq_stat {
 	u64 batch;
 };
 
+typedef void (cio_iodone_t)(void *private, int status);
+
+struct cio {
+	struct range_entry *ranges;
+	struct task_struct *waiter;     /* waiting task (NULL if none) */
+	atomic_t refcount;
+	int io_err;
+	cio_iodone_t *endio;		/* applicable for async operation */
+	void *private;			/* applicable for async operation */
+
+	/* For zoned device we maintain a linked list of IO submissions.
+	 * This is to make sure we maintain the order of submissions.
+	 * Otherwise some reads completing out of order, will submit writes not
+	 * aligned with zone write pointer.
+	 */
+	struct list_head list;
+	spinlock_t list_lock;
+};
+
+enum copy_io_status {
+	REQ_COPY_READ_PROGRESS,
+	REQ_COPY_READ_COMPLETE,
+	REQ_COPY_WRITE_PROGRESS,
+};
+
+struct copy_ctx {
+	struct cio *cio;
+	struct work_struct dispatch_work;
+	struct bio *write_bio;
+	atomic_t refcount;
+	int range_idx;			/* used in error/partial completion */
+
+	/* For zoned device linked list is maintained. Along with state of IO */
+	struct list_head list;
+	enum copy_io_status status;
+};
+
 #endif /* __LINUX_BLK_TYPES_H */
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index 807ffb5f715d..48e9160b7195 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -1063,6 +1063,9 @@ int __blkdev_issue_discard(struct block_device *bdev, sector_t sector,
 		sector_t nr_sects, gfp_t gfp_mask, struct bio **biop);
 int blkdev_issue_secure_erase(struct block_device *bdev, sector_t sector,
 		sector_t nr_sects, gfp_t gfp);
+int blkdev_issue_copy(struct block_device *src_bdev,
+		struct block_device *dst_bdev, struct range_entry *ranges,
+		int nr, cio_iodone_t end_io, void *private, gfp_t gfp_mask);
 
 #define BLKDEV_ZERO_NOUNMAP	(1 << 0)  /* do not free blocks */
 #define BLKDEV_ZERO_NOFALLBACK	(1 << 1)  /* don't write explicit zeroes */
diff --git a/include/uapi/linux/fs.h b/include/uapi/linux/fs.h
index b3ad173f619c..9248b6d259de 100644
--- a/include/uapi/linux/fs.h
+++ b/include/uapi/linux/fs.h
@@ -67,6 +67,21 @@ struct fstrim_range {
 /* maximum total copy length */
 #define MAX_COPY_TOTAL_LENGTH	(1 << 27)
 
+/* Maximum no of entries supported */
+#define MAX_COPY_NR_RANGE	(1 << 12)
+
+/* range entry for copy offload, all fields should be byte addressed */
+struct range_entry {
+	__u64 src;		/* source to be copied */
+	__u64 dst;		/* destination */
+	__u64 len;		/* length in bytes to be copied */
+
+	/* length of data copy actually completed. This will be filled by
+	 * kernel, once copy completes
+	 */
+	__u64 comp_len;
+};
+
 /* extent-same (dedupe) ioctls; these MUST match the btrfs ioctl definitions */
 #define FILE_DEDUPE_RANGE_SAME		0
 #define FILE_DEDUPE_RANGE_DIFFERS	1
-- 
2.35.1.500.gb896f729e2

--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH v6 3/9] block: add emulation for copy
       [not found]   ` <CGME20230112120054epcas5p3ec5887c4e1de59f7529dafca1cd6aa65@epcas5p3.samsung.com>
@ 2023-01-12 11:58       ` Nitesh Shetty
  0 siblings, 0 replies; 26+ messages in thread
From: Nitesh Shetty @ 2023-01-12 11:58 UTC (permalink / raw)
  To: Jens Axboe, Alasdair Kergon, Mike Snitzer, dm-devel, Keith Busch,
	Christoph Hellwig, Sagi Grimberg, James Smart,
	Chaitanya Kulkarni, Alexander Viro
  Cc: anuj20.g, joshi.k, p.raghav, nitheshshetty, gost.dev,
	Nitesh Shetty, Vincent Fu, linux-block, linux-kernel, linux-nvme,
	linux-fsdevel

For the devices which does not support copy, copy emulation is
added. Copy-emulation is implemented by reading from source ranges
into memory and writing to the corresponding destination asynchronously.
For zoned device we maintain a linked list of read submission and try to
submit corresponding write in same order.
Also emulation is used, if copy offload fails or partially completes.

Signed-off-by: Nitesh Shetty <nj.shetty@samsung.com>
Signed-off-by: Vincent Fu <vincent.fu@samsung.com>
Signed-off-by: Anuj Gupta <anuj20.g@samsung.com>
---
 block/blk-lib.c        | 241 ++++++++++++++++++++++++++++++++++++++++-
 block/blk-map.c        |   4 +-
 include/linux/blkdev.h |   3 +
 3 files changed, 245 insertions(+), 3 deletions(-)

diff --git a/block/blk-lib.c b/block/blk-lib.c
index 2ce3c872ca49..43b1d0ef5732 100644
--- a/block/blk-lib.c
+++ b/block/blk-lib.c
@@ -428,6 +428,239 @@ static inline int blk_copy_sanity_check(struct block_device *src_bdev,
 	return 0;
 }
 
+static void *blk_alloc_buf(sector_t req_size, sector_t *alloc_size,
+		gfp_t gfp_mask)
+{
+	int min_size = PAGE_SIZE;
+	void *buf;
+
+	while (req_size >= min_size) {
+		buf = kvmalloc(req_size, gfp_mask);
+		if (buf) {
+			*alloc_size = req_size;
+			return buf;
+		}
+		/* retry half the requested size */
+		req_size >>= 1;
+	}
+
+	return NULL;
+}
+
+static void blk_copy_emulate_write_end_io(struct bio *bio)
+{
+	struct copy_ctx *ctx = bio->bi_private;
+	struct cio *cio = ctx->cio;
+	sector_t clen;
+	int ri = ctx->range_idx;
+
+	if (bio->bi_status) {
+		cio->io_err = blk_status_to_errno(bio->bi_status);
+		clen = (bio->bi_iter.bi_sector << SECTOR_SHIFT) -
+			cio->ranges[ri].dst;
+		cio->ranges[ri].comp_len = min_t(sector_t, clen,
+				cio->ranges[ri].comp_len);
+	}
+	kvfree(page_address(bio->bi_io_vec[0].bv_page));
+	bio_map_kern_endio(bio);
+	if (atomic_dec_and_test(&ctx->refcount))
+		kfree(ctx);
+	if (atomic_dec_and_test(&cio->refcount)) {
+		if (cio->endio) {
+			cio->endio(cio->private, cio->io_err);
+			kfree(cio);
+		} else
+			blk_wake_io_task(cio->waiter);
+	}
+}
+
+static void blk_copy_emulate_read_end_io(struct bio *read_bio)
+{
+	struct copy_ctx *ctx = read_bio->bi_private;
+	struct cio *cio = ctx->cio;
+	sector_t clen;
+	int ri = ctx->range_idx;
+	unsigned long flags;
+
+	if (read_bio->bi_status) {
+		cio->io_err = blk_status_to_errno(read_bio->bi_status);
+		goto err_rw_bio;
+	}
+
+	/* For zoned device, we check if completed bio is first entry in linked
+	 * list,
+	 * if yes, we start the worker to submit write bios.
+	 * if not, then we just update status of bio in ctx,
+	 * once the worker gets scheduled, it will submit writes for all
+	 * the consecutive REQ_COPY_READ_COMPLETE bios.
+	 */
+	if (bdev_is_zoned(ctx->write_bio->bi_bdev)) {
+		spin_lock_irqsave(&cio->list_lock, flags);
+		ctx->status = REQ_COPY_READ_COMPLETE;
+		if (ctx == list_first_entry(&cio->list,
+					struct copy_ctx, list)) {
+			spin_unlock_irqrestore(&cio->list_lock, flags);
+			schedule_work(&ctx->dispatch_work);
+			goto free_read_bio;
+		}
+		spin_unlock_irqrestore(&cio->list_lock, flags);
+	} else
+		schedule_work(&ctx->dispatch_work);
+
+free_read_bio:
+	kfree(read_bio);
+
+	return;
+
+err_rw_bio:
+	clen = (read_bio->bi_iter.bi_sector << SECTOR_SHIFT) -
+					cio->ranges[ri].src;
+	cio->ranges[ri].comp_len = min_t(sector_t, clen,
+					cio->ranges[ri].comp_len);
+	__free_page(read_bio->bi_io_vec[0].bv_page);
+	bio_map_kern_endio(read_bio);
+	if (atomic_dec_and_test(&ctx->refcount))
+		kfree(ctx);
+	if (atomic_dec_and_test(&cio->refcount)) {
+		if (cio->endio) {
+			cio->endio(cio->private, cio->io_err);
+			kfree(cio);
+		} else
+			blk_wake_io_task(cio->waiter);
+	}
+}
+
+/*
+ * If native copy offload feature is absent, this function tries to emulate,
+ * by copying data from source to a temporary buffer and from buffer to
+ * destination device.
+ */
+static int blk_copy_emulate(struct block_device *src_bdev,
+		struct block_device *dst_bdev, struct range_entry *ranges,
+		int nr, cio_iodone_t end_io, void *private, gfp_t gfp_mask)
+{
+	struct request_queue *sq = bdev_get_queue(src_bdev);
+	struct request_queue *dq = bdev_get_queue(dst_bdev);
+	struct bio *read_bio, *write_bio;
+	void *buf = NULL;
+	struct copy_ctx *ctx;
+	struct cio *cio;
+	sector_t src, dst, offset, buf_len, req_len, rem = 0;
+	int ri = 0, ret = 0;
+	unsigned long flags;
+	sector_t max_src_hw_len = min_t(unsigned int, queue_max_hw_sectors(sq),
+			queue_max_segments(sq) << (PAGE_SHIFT - SECTOR_SHIFT))
+			<< SECTOR_SHIFT;
+	sector_t max_dst_hw_len = min_t(unsigned int, queue_max_hw_sectors(dq),
+			queue_max_segments(dq) << (PAGE_SHIFT - SECTOR_SHIFT))
+			<< SECTOR_SHIFT;
+	sector_t max_hw_len = min_t(unsigned int,
+			max_src_hw_len, max_dst_hw_len);
+
+	cio = kzalloc(sizeof(struct cio), GFP_KERNEL);
+	if (!cio)
+		return -ENOMEM;
+	cio->ranges = ranges;
+	atomic_set(&cio->refcount, 1);
+	cio->waiter = current;
+	cio->endio = end_io;
+	cio->private = private;
+
+	if (bdev_is_zoned(dst_bdev)) {
+		INIT_LIST_HEAD(&cio->list);
+		spin_lock_init(&cio->list_lock);
+	}
+
+	for (ri = 0; ri < nr; ri++) {
+		offset = ranges[ri].comp_len;
+		src = ranges[ri].src + offset;
+		dst = ranges[ri].dst + offset;
+		/* If IO fails, we truncate comp_len */
+		ranges[ri].comp_len = ranges[ri].len;
+
+		for (rem = ranges[ri].len - offset; rem > 0; rem -= buf_len) {
+			req_len = min_t(int, max_hw_len, rem);
+
+			buf = blk_alloc_buf(req_len, &buf_len, gfp_mask);
+			if (!buf) {
+				ret = -ENOMEM;
+				goto err_alloc_buf;
+			}
+
+			ctx = kzalloc(sizeof(struct copy_ctx), gfp_mask);
+			if (!ctx) {
+				ret = -ENOMEM;
+				goto err_ctx;
+			}
+
+			read_bio = bio_map_kern(sq, buf, buf_len, gfp_mask);
+			if (IS_ERR(read_bio)) {
+				ret = PTR_ERR(read_bio);
+				goto err_read_bio;
+			}
+
+			write_bio = bio_map_kern(dq, buf, buf_len, gfp_mask);
+			if (IS_ERR(write_bio)) {
+				ret = PTR_ERR(write_bio);
+				goto err_write_bio;
+			}
+
+			ctx->cio = cio;
+			ctx->range_idx = ri;
+			ctx->write_bio = write_bio;
+			atomic_set(&ctx->refcount, 1);
+
+			read_bio->bi_iter.bi_sector = src >> SECTOR_SHIFT;
+			read_bio->bi_iter.bi_size = buf_len;
+			read_bio->bi_opf = REQ_OP_READ | REQ_SYNC;
+			bio_set_dev(read_bio, src_bdev);
+			read_bio->bi_end_io = blk_copy_emulate_read_end_io;
+			read_bio->bi_private = ctx;
+
+			write_bio->bi_iter.bi_size = buf_len;
+			write_bio->bi_opf = REQ_OP_WRITE | REQ_SYNC;
+			bio_set_dev(write_bio, dst_bdev);
+			write_bio->bi_end_io = blk_copy_emulate_write_end_io;
+			write_bio->bi_iter.bi_sector = dst >> SECTOR_SHIFT;
+			write_bio->bi_private = ctx;
+
+			if (bdev_is_zoned(dst_bdev)) {
+				INIT_WORK(&ctx->dispatch_work,
+					blk_zoned_copy_dispatch_work_fn);
+				INIT_LIST_HEAD(&ctx->list);
+				spin_lock_irqsave(&cio->list_lock, flags);
+				ctx->status = REQ_COPY_READ_PROGRESS;
+				list_add_tail(&ctx->list, &cio->list);
+				spin_unlock_irqrestore(&cio->list_lock, flags);
+			} else
+				INIT_WORK(&ctx->dispatch_work,
+					blk_copy_dispatch_work_fn);
+
+			atomic_inc(&cio->refcount);
+			submit_bio(read_bio);
+
+			src += buf_len;
+			dst += buf_len;
+		}
+	}
+
+	/* Wait for completion of all IO's*/
+	return cio_await_completion(cio);
+
+err_write_bio:
+	bio_put(read_bio);
+err_read_bio:
+	kfree(ctx);
+err_ctx:
+	kvfree(buf);
+err_alloc_buf:
+	ranges[ri].comp_len -= min_t(sector_t,
+			ranges[ri].comp_len, (ranges[ri].len - rem));
+
+	cio->io_err = ret;
+	return cio_await_completion(cio);
+}
+
 static inline bool blk_check_copy_offload(struct request_queue *src_q,
 		struct request_queue *dst_q)
 {
@@ -460,15 +693,21 @@ int blkdev_issue_copy(struct block_device *src_bdev,
 	struct request_queue *src_q = bdev_get_queue(src_bdev);
 	struct request_queue *dst_q = bdev_get_queue(dst_bdev);
 	int ret = -EINVAL;
+	bool offload = false;
 
 	ret = blk_copy_sanity_check(src_bdev, dst_bdev, ranges, nr);
 	if (ret)
 		return ret;
 
-	if (blk_check_copy_offload(src_q, dst_q))
+	offload = blk_check_copy_offload(src_q, dst_q);
+	if (offload)
 		ret = blk_copy_offload(src_bdev, dst_bdev, ranges, nr,
 				end_io, private, gfp_mask);
 
+	if (ret || !offload)
+		ret = blk_copy_emulate(src_bdev, dst_bdev, ranges, nr,
+				end_io, private, gfp_mask);
+
 	return ret;
 }
 EXPORT_SYMBOL_GPL(blkdev_issue_copy);
diff --git a/block/blk-map.c b/block/blk-map.c
index 19940c978c73..bcf8db2b75f1 100644
--- a/block/blk-map.c
+++ b/block/blk-map.c
@@ -363,7 +363,7 @@ static void bio_invalidate_vmalloc_pages(struct bio *bio)
 #endif
 }
 
-static void bio_map_kern_endio(struct bio *bio)
+void bio_map_kern_endio(struct bio *bio)
 {
 	bio_invalidate_vmalloc_pages(bio);
 	bio_uninit(bio);
@@ -380,7 +380,7 @@ static void bio_map_kern_endio(struct bio *bio)
  *	Map the kernel address into a bio suitable for io to a block
  *	device. Returns an error pointer in case of error.
  */
-static struct bio *bio_map_kern(struct request_queue *q, void *data,
+struct bio *bio_map_kern(struct request_queue *q, void *data,
 		unsigned int len, gfp_t gfp_mask)
 {
 	unsigned long kaddr = (unsigned long)data;
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index 48e9160b7195..c5621550e5b4 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -1066,6 +1066,9 @@ int blkdev_issue_secure_erase(struct block_device *bdev, sector_t sector,
 int blkdev_issue_copy(struct block_device *src_bdev,
 		struct block_device *dst_bdev, struct range_entry *ranges,
 		int nr, cio_iodone_t end_io, void *private, gfp_t gfp_mask);
+struct bio *bio_map_kern(struct request_queue *q, void *data, unsigned int len,
+		gfp_t gfp_mask);
+void bio_map_kern_endio(struct bio *bio);
 
 #define BLKDEV_ZERO_NOUNMAP	(1 << 0)  /* do not free blocks */
 #define BLKDEV_ZERO_NOFALLBACK	(1 << 1)  /* don't write explicit zeroes */
-- 
2.35.1.500.gb896f729e2


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [dm-devel] [PATCH v6 3/9] block: add emulation for copy
@ 2023-01-12 11:58       ` Nitesh Shetty
  0 siblings, 0 replies; 26+ messages in thread
From: Nitesh Shetty @ 2023-01-12 11:58 UTC (permalink / raw)
  To: Jens Axboe, Alasdair Kergon, Mike Snitzer, dm-devel, Keith Busch,
	Christoph Hellwig, Sagi Grimberg, James Smart,
	Chaitanya Kulkarni, Alexander Viro
  Cc: p.raghav, Vincent Fu, joshi.k, gost.dev, anuj20.g, linux-kernel,
	linux-nvme, Nitesh Shetty, linux-block, linux-fsdevel,
	nitheshshetty

For the devices which does not support copy, copy emulation is
added. Copy-emulation is implemented by reading from source ranges
into memory and writing to the corresponding destination asynchronously.
For zoned device we maintain a linked list of read submission and try to
submit corresponding write in same order.
Also emulation is used, if copy offload fails or partially completes.

Signed-off-by: Nitesh Shetty <nj.shetty@samsung.com>
Signed-off-by: Vincent Fu <vincent.fu@samsung.com>
Signed-off-by: Anuj Gupta <anuj20.g@samsung.com>
---
 block/blk-lib.c        | 241 ++++++++++++++++++++++++++++++++++++++++-
 block/blk-map.c        |   4 +-
 include/linux/blkdev.h |   3 +
 3 files changed, 245 insertions(+), 3 deletions(-)

diff --git a/block/blk-lib.c b/block/blk-lib.c
index 2ce3c872ca49..43b1d0ef5732 100644
--- a/block/blk-lib.c
+++ b/block/blk-lib.c
@@ -428,6 +428,239 @@ static inline int blk_copy_sanity_check(struct block_device *src_bdev,
 	return 0;
 }
 
+static void *blk_alloc_buf(sector_t req_size, sector_t *alloc_size,
+		gfp_t gfp_mask)
+{
+	int min_size = PAGE_SIZE;
+	void *buf;
+
+	while (req_size >= min_size) {
+		buf = kvmalloc(req_size, gfp_mask);
+		if (buf) {
+			*alloc_size = req_size;
+			return buf;
+		}
+		/* retry half the requested size */
+		req_size >>= 1;
+	}
+
+	return NULL;
+}
+
+static void blk_copy_emulate_write_end_io(struct bio *bio)
+{
+	struct copy_ctx *ctx = bio->bi_private;
+	struct cio *cio = ctx->cio;
+	sector_t clen;
+	int ri = ctx->range_idx;
+
+	if (bio->bi_status) {
+		cio->io_err = blk_status_to_errno(bio->bi_status);
+		clen = (bio->bi_iter.bi_sector << SECTOR_SHIFT) -
+			cio->ranges[ri].dst;
+		cio->ranges[ri].comp_len = min_t(sector_t, clen,
+				cio->ranges[ri].comp_len);
+	}
+	kvfree(page_address(bio->bi_io_vec[0].bv_page));
+	bio_map_kern_endio(bio);
+	if (atomic_dec_and_test(&ctx->refcount))
+		kfree(ctx);
+	if (atomic_dec_and_test(&cio->refcount)) {
+		if (cio->endio) {
+			cio->endio(cio->private, cio->io_err);
+			kfree(cio);
+		} else
+			blk_wake_io_task(cio->waiter);
+	}
+}
+
+static void blk_copy_emulate_read_end_io(struct bio *read_bio)
+{
+	struct copy_ctx *ctx = read_bio->bi_private;
+	struct cio *cio = ctx->cio;
+	sector_t clen;
+	int ri = ctx->range_idx;
+	unsigned long flags;
+
+	if (read_bio->bi_status) {
+		cio->io_err = blk_status_to_errno(read_bio->bi_status);
+		goto err_rw_bio;
+	}
+
+	/* For zoned device, we check if completed bio is first entry in linked
+	 * list,
+	 * if yes, we start the worker to submit write bios.
+	 * if not, then we just update status of bio in ctx,
+	 * once the worker gets scheduled, it will submit writes for all
+	 * the consecutive REQ_COPY_READ_COMPLETE bios.
+	 */
+	if (bdev_is_zoned(ctx->write_bio->bi_bdev)) {
+		spin_lock_irqsave(&cio->list_lock, flags);
+		ctx->status = REQ_COPY_READ_COMPLETE;
+		if (ctx == list_first_entry(&cio->list,
+					struct copy_ctx, list)) {
+			spin_unlock_irqrestore(&cio->list_lock, flags);
+			schedule_work(&ctx->dispatch_work);
+			goto free_read_bio;
+		}
+		spin_unlock_irqrestore(&cio->list_lock, flags);
+	} else
+		schedule_work(&ctx->dispatch_work);
+
+free_read_bio:
+	kfree(read_bio);
+
+	return;
+
+err_rw_bio:
+	clen = (read_bio->bi_iter.bi_sector << SECTOR_SHIFT) -
+					cio->ranges[ri].src;
+	cio->ranges[ri].comp_len = min_t(sector_t, clen,
+					cio->ranges[ri].comp_len);
+	__free_page(read_bio->bi_io_vec[0].bv_page);
+	bio_map_kern_endio(read_bio);
+	if (atomic_dec_and_test(&ctx->refcount))
+		kfree(ctx);
+	if (atomic_dec_and_test(&cio->refcount)) {
+		if (cio->endio) {
+			cio->endio(cio->private, cio->io_err);
+			kfree(cio);
+		} else
+			blk_wake_io_task(cio->waiter);
+	}
+}
+
+/*
+ * If native copy offload feature is absent, this function tries to emulate,
+ * by copying data from source to a temporary buffer and from buffer to
+ * destination device.
+ */
+static int blk_copy_emulate(struct block_device *src_bdev,
+		struct block_device *dst_bdev, struct range_entry *ranges,
+		int nr, cio_iodone_t end_io, void *private, gfp_t gfp_mask)
+{
+	struct request_queue *sq = bdev_get_queue(src_bdev);
+	struct request_queue *dq = bdev_get_queue(dst_bdev);
+	struct bio *read_bio, *write_bio;
+	void *buf = NULL;
+	struct copy_ctx *ctx;
+	struct cio *cio;
+	sector_t src, dst, offset, buf_len, req_len, rem = 0;
+	int ri = 0, ret = 0;
+	unsigned long flags;
+	sector_t max_src_hw_len = min_t(unsigned int, queue_max_hw_sectors(sq),
+			queue_max_segments(sq) << (PAGE_SHIFT - SECTOR_SHIFT))
+			<< SECTOR_SHIFT;
+	sector_t max_dst_hw_len = min_t(unsigned int, queue_max_hw_sectors(dq),
+			queue_max_segments(dq) << (PAGE_SHIFT - SECTOR_SHIFT))
+			<< SECTOR_SHIFT;
+	sector_t max_hw_len = min_t(unsigned int,
+			max_src_hw_len, max_dst_hw_len);
+
+	cio = kzalloc(sizeof(struct cio), GFP_KERNEL);
+	if (!cio)
+		return -ENOMEM;
+	cio->ranges = ranges;
+	atomic_set(&cio->refcount, 1);
+	cio->waiter = current;
+	cio->endio = end_io;
+	cio->private = private;
+
+	if (bdev_is_zoned(dst_bdev)) {
+		INIT_LIST_HEAD(&cio->list);
+		spin_lock_init(&cio->list_lock);
+	}
+
+	for (ri = 0; ri < nr; ri++) {
+		offset = ranges[ri].comp_len;
+		src = ranges[ri].src + offset;
+		dst = ranges[ri].dst + offset;
+		/* If IO fails, we truncate comp_len */
+		ranges[ri].comp_len = ranges[ri].len;
+
+		for (rem = ranges[ri].len - offset; rem > 0; rem -= buf_len) {
+			req_len = min_t(int, max_hw_len, rem);
+
+			buf = blk_alloc_buf(req_len, &buf_len, gfp_mask);
+			if (!buf) {
+				ret = -ENOMEM;
+				goto err_alloc_buf;
+			}
+
+			ctx = kzalloc(sizeof(struct copy_ctx), gfp_mask);
+			if (!ctx) {
+				ret = -ENOMEM;
+				goto err_ctx;
+			}
+
+			read_bio = bio_map_kern(sq, buf, buf_len, gfp_mask);
+			if (IS_ERR(read_bio)) {
+				ret = PTR_ERR(read_bio);
+				goto err_read_bio;
+			}
+
+			write_bio = bio_map_kern(dq, buf, buf_len, gfp_mask);
+			if (IS_ERR(write_bio)) {
+				ret = PTR_ERR(write_bio);
+				goto err_write_bio;
+			}
+
+			ctx->cio = cio;
+			ctx->range_idx = ri;
+			ctx->write_bio = write_bio;
+			atomic_set(&ctx->refcount, 1);
+
+			read_bio->bi_iter.bi_sector = src >> SECTOR_SHIFT;
+			read_bio->bi_iter.bi_size = buf_len;
+			read_bio->bi_opf = REQ_OP_READ | REQ_SYNC;
+			bio_set_dev(read_bio, src_bdev);
+			read_bio->bi_end_io = blk_copy_emulate_read_end_io;
+			read_bio->bi_private = ctx;
+
+			write_bio->bi_iter.bi_size = buf_len;
+			write_bio->bi_opf = REQ_OP_WRITE | REQ_SYNC;
+			bio_set_dev(write_bio, dst_bdev);
+			write_bio->bi_end_io = blk_copy_emulate_write_end_io;
+			write_bio->bi_iter.bi_sector = dst >> SECTOR_SHIFT;
+			write_bio->bi_private = ctx;
+
+			if (bdev_is_zoned(dst_bdev)) {
+				INIT_WORK(&ctx->dispatch_work,
+					blk_zoned_copy_dispatch_work_fn);
+				INIT_LIST_HEAD(&ctx->list);
+				spin_lock_irqsave(&cio->list_lock, flags);
+				ctx->status = REQ_COPY_READ_PROGRESS;
+				list_add_tail(&ctx->list, &cio->list);
+				spin_unlock_irqrestore(&cio->list_lock, flags);
+			} else
+				INIT_WORK(&ctx->dispatch_work,
+					blk_copy_dispatch_work_fn);
+
+			atomic_inc(&cio->refcount);
+			submit_bio(read_bio);
+
+			src += buf_len;
+			dst += buf_len;
+		}
+	}
+
+	/* Wait for completion of all IO's*/
+	return cio_await_completion(cio);
+
+err_write_bio:
+	bio_put(read_bio);
+err_read_bio:
+	kfree(ctx);
+err_ctx:
+	kvfree(buf);
+err_alloc_buf:
+	ranges[ri].comp_len -= min_t(sector_t,
+			ranges[ri].comp_len, (ranges[ri].len - rem));
+
+	cio->io_err = ret;
+	return cio_await_completion(cio);
+}
+
 static inline bool blk_check_copy_offload(struct request_queue *src_q,
 		struct request_queue *dst_q)
 {
@@ -460,15 +693,21 @@ int blkdev_issue_copy(struct block_device *src_bdev,
 	struct request_queue *src_q = bdev_get_queue(src_bdev);
 	struct request_queue *dst_q = bdev_get_queue(dst_bdev);
 	int ret = -EINVAL;
+	bool offload = false;
 
 	ret = blk_copy_sanity_check(src_bdev, dst_bdev, ranges, nr);
 	if (ret)
 		return ret;
 
-	if (blk_check_copy_offload(src_q, dst_q))
+	offload = blk_check_copy_offload(src_q, dst_q);
+	if (offload)
 		ret = blk_copy_offload(src_bdev, dst_bdev, ranges, nr,
 				end_io, private, gfp_mask);
 
+	if (ret || !offload)
+		ret = blk_copy_emulate(src_bdev, dst_bdev, ranges, nr,
+				end_io, private, gfp_mask);
+
 	return ret;
 }
 EXPORT_SYMBOL_GPL(blkdev_issue_copy);
diff --git a/block/blk-map.c b/block/blk-map.c
index 19940c978c73..bcf8db2b75f1 100644
--- a/block/blk-map.c
+++ b/block/blk-map.c
@@ -363,7 +363,7 @@ static void bio_invalidate_vmalloc_pages(struct bio *bio)
 #endif
 }
 
-static void bio_map_kern_endio(struct bio *bio)
+void bio_map_kern_endio(struct bio *bio)
 {
 	bio_invalidate_vmalloc_pages(bio);
 	bio_uninit(bio);
@@ -380,7 +380,7 @@ static void bio_map_kern_endio(struct bio *bio)
  *	Map the kernel address into a bio suitable for io to a block
  *	device. Returns an error pointer in case of error.
  */
-static struct bio *bio_map_kern(struct request_queue *q, void *data,
+struct bio *bio_map_kern(struct request_queue *q, void *data,
 		unsigned int len, gfp_t gfp_mask)
 {
 	unsigned long kaddr = (unsigned long)data;
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index 48e9160b7195..c5621550e5b4 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -1066,6 +1066,9 @@ int blkdev_issue_secure_erase(struct block_device *bdev, sector_t sector,
 int blkdev_issue_copy(struct block_device *src_bdev,
 		struct block_device *dst_bdev, struct range_entry *ranges,
 		int nr, cio_iodone_t end_io, void *private, gfp_t gfp_mask);
+struct bio *bio_map_kern(struct request_queue *q, void *data, unsigned int len,
+		gfp_t gfp_mask);
+void bio_map_kern_endio(struct bio *bio);
 
 #define BLKDEV_ZERO_NOUNMAP	(1 << 0)  /* do not free blocks */
 #define BLKDEV_ZERO_NOFALLBACK	(1 << 1)  /* don't write explicit zeroes */
-- 
2.35.1.500.gb896f729e2

--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH v6 4/9] block: Introduce a new ioctl for copy
       [not found]   ` <CGME20230112120131epcas5p4374e6add89990dd546bd0ae38f4386f0@epcas5p4.samsung.com>
@ 2023-01-12 11:58       ` Nitesh Shetty
  0 siblings, 0 replies; 26+ messages in thread
From: Nitesh Shetty @ 2023-01-12 11:58 UTC (permalink / raw)
  To: Jens Axboe, Alasdair Kergon, Mike Snitzer, dm-devel, Keith Busch,
	Christoph Hellwig, Sagi Grimberg, James Smart,
	Chaitanya Kulkarni, Alexander Viro
  Cc: anuj20.g, joshi.k, p.raghav, nitheshshetty, gost.dev,
	Nitesh Shetty, Hannes Reinecke, Javier González,
	linux-block, linux-kernel, linux-nvme, linux-fsdevel

Add new BLKCOPY ioctl that offloads copying of one or more sources ranges
to one or more destination in a device. COPY ioctl accepts a 'copy_range'
structure that contains no of range, a reserved field , followed by an
array of ranges. Each source range is represented by 'range_entry' that
contains source start offset, destination start offset and length of
source ranges (in bytes)

MAX_COPY_NR_RANGE, limits the number of entries for the IOCTL and
MAX_COPY_TOTAL_LENGTH limits the total copy length, IOCTL can handle.

Example code, to issue BLKCOPY:
/* Sample example to copy three entries with [dest,src,len],
* [32768, 0, 4096] [36864, 4096, 4096] [40960,8192,4096] on same device */

int main(void)
{
	int i, ret, fd;
	unsigned long src = 0, dst = 32768, len = 4096;
	struct copy_range *cr;

	cr = (struct copy_range *)malloc(sizeof(*cr)+
					(sizeof(struct range_entry)*3));
	cr->nr_range = 3;
	cr->reserved = 0;
	for (i = 0; i< cr->nr_range; i++, src += len, dst += len) {
		cr->ranges[i].dst = dst;
		cr->ranges[i].src = src;
		cr->ranges[i].len = len;
		cr->ranges[i].comp_len = 0;
	}

	fd = open("/dev/nvme0n1", O_RDWR);
	if (fd < 0) return 1;

	ret = ioctl(fd, BLKCOPY, cr);
	if (ret != 0)
	       printf("copy failed, ret= %d\n", ret);

	for (i=0; i< cr->nr_range; i++)
		if (cr->ranges[i].len != cr->ranges[i].comp_len)
			printf("Partial copy for entry %d: requested %llu,
				completed %llu\n",
				i, cr->ranges[i].len,
				cr->ranges[i].comp_len);
	close(fd);
	free(cr);
	return ret;
}

Reviewed-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Nitesh Shetty <nj.shetty@samsung.com>
Signed-off-by: Javier González <javier.gonz@samsung.com>
Signed-off-by: Anuj Gupta <anuj20.g@samsung.com>
---
 block/ioctl.c           | 36 ++++++++++++++++++++++++++++++++++++
 include/uapi/linux/fs.h |  9 +++++++++
 2 files changed, 45 insertions(+)

diff --git a/block/ioctl.c b/block/ioctl.c
index 96617512982e..d636bc1f0047 100644
--- a/block/ioctl.c
+++ b/block/ioctl.c
@@ -120,6 +120,40 @@ static int blk_ioctl_discard(struct block_device *bdev, fmode_t mode,
 	return err;
 }
 
+static int blk_ioctl_copy(struct block_device *bdev, fmode_t mode,
+		unsigned long arg)
+{
+	struct copy_range ucopy_range, *kcopy_range = NULL;
+	size_t payload_size = 0;
+	int ret;
+
+	if (!(mode & FMODE_WRITE))
+		return -EBADF;
+
+	if (copy_from_user(&ucopy_range, (void __user *)arg,
+				sizeof(ucopy_range)))
+		return -EFAULT;
+
+	if (unlikely(!ucopy_range.nr_range || ucopy_range.reserved ||
+				ucopy_range.nr_range >= MAX_COPY_NR_RANGE))
+		return -EINVAL;
+
+	payload_size = (ucopy_range.nr_range * sizeof(struct range_entry)) +
+				sizeof(ucopy_range);
+
+	kcopy_range = memdup_user((void __user *)arg, payload_size);
+	if (IS_ERR(kcopy_range))
+		return PTR_ERR(kcopy_range);
+
+	ret = blkdev_issue_copy(bdev, bdev, kcopy_range->ranges,
+			kcopy_range->nr_range, NULL, NULL, GFP_KERNEL);
+	if (copy_to_user((void __user *)arg, kcopy_range, payload_size))
+		ret = -EFAULT;
+
+	kfree(kcopy_range);
+	return ret;
+}
+
 static int blk_ioctl_secure_erase(struct block_device *bdev, fmode_t mode,
 		void __user *argp)
 {
@@ -482,6 +516,8 @@ static int blkdev_common_ioctl(struct file *file, fmode_t mode, unsigned cmd,
 		return blk_ioctl_discard(bdev, mode, arg);
 	case BLKSECDISCARD:
 		return blk_ioctl_secure_erase(bdev, mode, argp);
+	case BLKCOPY:
+		return blk_ioctl_copy(bdev, mode, arg);
 	case BLKZEROOUT:
 		return blk_ioctl_zeroout(bdev, mode, arg);
 	case BLKGETDISKSEQ:
diff --git a/include/uapi/linux/fs.h b/include/uapi/linux/fs.h
index 9248b6d259de..8af10b926a6f 100644
--- a/include/uapi/linux/fs.h
+++ b/include/uapi/linux/fs.h
@@ -82,6 +82,14 @@ struct range_entry {
 	__u64 comp_len;
 };
 
+struct copy_range {
+	__u64 nr_range;
+	__u64 reserved;
+
+	/* Ranges always must be at the end */
+	struct range_entry ranges[];
+};
+
 /* extent-same (dedupe) ioctls; these MUST match the btrfs ioctl definitions */
 #define FILE_DEDUPE_RANGE_SAME		0
 #define FILE_DEDUPE_RANGE_DIFFERS	1
@@ -203,6 +211,7 @@ struct fsxattr {
 #define BLKROTATIONAL _IO(0x12,126)
 #define BLKZEROOUT _IO(0x12,127)
 #define BLKGETDISKSEQ _IOR(0x12,128,__u64)
+#define BLKCOPY _IOWR(0x12, 129, struct copy_range)
 /*
  * A jump here: 130-136 are reserved for zoned block devices
  * (see uapi/linux/blkzoned.h)
-- 
2.35.1.500.gb896f729e2


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [dm-devel] [PATCH v6 4/9] block: Introduce a new ioctl for copy
@ 2023-01-12 11:58       ` Nitesh Shetty
  0 siblings, 0 replies; 26+ messages in thread
From: Nitesh Shetty @ 2023-01-12 11:58 UTC (permalink / raw)
  To: Jens Axboe, Alasdair Kergon, Mike Snitzer, dm-devel, Keith Busch,
	Christoph Hellwig, Sagi Grimberg, James Smart,
	Chaitanya Kulkarni, Alexander Viro
  Cc: p.raghav, joshi.k, gost.dev, anuj20.g, linux-kernel, linux-nvme,
	Nitesh Shetty, linux-block, linux-fsdevel, Javier González,
	nitheshshetty

Add new BLKCOPY ioctl that offloads copying of one or more sources ranges
to one or more destination in a device. COPY ioctl accepts a 'copy_range'
structure that contains no of range, a reserved field , followed by an
array of ranges. Each source range is represented by 'range_entry' that
contains source start offset, destination start offset and length of
source ranges (in bytes)

MAX_COPY_NR_RANGE, limits the number of entries for the IOCTL and
MAX_COPY_TOTAL_LENGTH limits the total copy length, IOCTL can handle.

Example code, to issue BLKCOPY:
/* Sample example to copy three entries with [dest,src,len],
* [32768, 0, 4096] [36864, 4096, 4096] [40960,8192,4096] on same device */

int main(void)
{
	int i, ret, fd;
	unsigned long src = 0, dst = 32768, len = 4096;
	struct copy_range *cr;

	cr = (struct copy_range *)malloc(sizeof(*cr)+
					(sizeof(struct range_entry)*3));
	cr->nr_range = 3;
	cr->reserved = 0;
	for (i = 0; i< cr->nr_range; i++, src += len, dst += len) {
		cr->ranges[i].dst = dst;
		cr->ranges[i].src = src;
		cr->ranges[i].len = len;
		cr->ranges[i].comp_len = 0;
	}

	fd = open("/dev/nvme0n1", O_RDWR);
	if (fd < 0) return 1;

	ret = ioctl(fd, BLKCOPY, cr);
	if (ret != 0)
	       printf("copy failed, ret= %d\n", ret);

	for (i=0; i< cr->nr_range; i++)
		if (cr->ranges[i].len != cr->ranges[i].comp_len)
			printf("Partial copy for entry %d: requested %llu,
				completed %llu\n",
				i, cr->ranges[i].len,
				cr->ranges[i].comp_len);
	close(fd);
	free(cr);
	return ret;
}

Reviewed-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Nitesh Shetty <nj.shetty@samsung.com>
Signed-off-by: Javier González <javier.gonz@samsung.com>
Signed-off-by: Anuj Gupta <anuj20.g@samsung.com>
---
 block/ioctl.c           | 36 ++++++++++++++++++++++++++++++++++++
 include/uapi/linux/fs.h |  9 +++++++++
 2 files changed, 45 insertions(+)

diff --git a/block/ioctl.c b/block/ioctl.c
index 96617512982e..d636bc1f0047 100644
--- a/block/ioctl.c
+++ b/block/ioctl.c
@@ -120,6 +120,40 @@ static int blk_ioctl_discard(struct block_device *bdev, fmode_t mode,
 	return err;
 }
 
+static int blk_ioctl_copy(struct block_device *bdev, fmode_t mode,
+		unsigned long arg)
+{
+	struct copy_range ucopy_range, *kcopy_range = NULL;
+	size_t payload_size = 0;
+	int ret;
+
+	if (!(mode & FMODE_WRITE))
+		return -EBADF;
+
+	if (copy_from_user(&ucopy_range, (void __user *)arg,
+				sizeof(ucopy_range)))
+		return -EFAULT;
+
+	if (unlikely(!ucopy_range.nr_range || ucopy_range.reserved ||
+				ucopy_range.nr_range >= MAX_COPY_NR_RANGE))
+		return -EINVAL;
+
+	payload_size = (ucopy_range.nr_range * sizeof(struct range_entry)) +
+				sizeof(ucopy_range);
+
+	kcopy_range = memdup_user((void __user *)arg, payload_size);
+	if (IS_ERR(kcopy_range))
+		return PTR_ERR(kcopy_range);
+
+	ret = blkdev_issue_copy(bdev, bdev, kcopy_range->ranges,
+			kcopy_range->nr_range, NULL, NULL, GFP_KERNEL);
+	if (copy_to_user((void __user *)arg, kcopy_range, payload_size))
+		ret = -EFAULT;
+
+	kfree(kcopy_range);
+	return ret;
+}
+
 static int blk_ioctl_secure_erase(struct block_device *bdev, fmode_t mode,
 		void __user *argp)
 {
@@ -482,6 +516,8 @@ static int blkdev_common_ioctl(struct file *file, fmode_t mode, unsigned cmd,
 		return blk_ioctl_discard(bdev, mode, arg);
 	case BLKSECDISCARD:
 		return blk_ioctl_secure_erase(bdev, mode, argp);
+	case BLKCOPY:
+		return blk_ioctl_copy(bdev, mode, arg);
 	case BLKZEROOUT:
 		return blk_ioctl_zeroout(bdev, mode, arg);
 	case BLKGETDISKSEQ:
diff --git a/include/uapi/linux/fs.h b/include/uapi/linux/fs.h
index 9248b6d259de..8af10b926a6f 100644
--- a/include/uapi/linux/fs.h
+++ b/include/uapi/linux/fs.h
@@ -82,6 +82,14 @@ struct range_entry {
 	__u64 comp_len;
 };
 
+struct copy_range {
+	__u64 nr_range;
+	__u64 reserved;
+
+	/* Ranges always must be at the end */
+	struct range_entry ranges[];
+};
+
 /* extent-same (dedupe) ioctls; these MUST match the btrfs ioctl definitions */
 #define FILE_DEDUPE_RANGE_SAME		0
 #define FILE_DEDUPE_RANGE_DIFFERS	1
@@ -203,6 +211,7 @@ struct fsxattr {
 #define BLKROTATIONAL _IO(0x12,126)
 #define BLKZEROOUT _IO(0x12,127)
 #define BLKGETDISKSEQ _IOR(0x12,128,__u64)
+#define BLKCOPY _IOWR(0x12, 129, struct copy_range)
 /*
  * A jump here: 130-136 are reserved for zoned block devices
  * (see uapi/linux/blkzoned.h)
-- 
2.35.1.500.gb896f729e2

--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel

^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH v6 5/9] nvme: add copy offload support
       [not found]   ` <CGME20230112120151epcas5p1e7c3ec0c7bd0869b9cf0bea64d65991a@epcas5p1.samsung.com>
@ 2023-01-12 11:58       ` Nitesh Shetty
  0 siblings, 0 replies; 26+ messages in thread
From: Nitesh Shetty @ 2023-01-12 11:58 UTC (permalink / raw)
  To: Jens Axboe, Alasdair Kergon, Mike Snitzer, dm-devel, Keith Busch,
	Christoph Hellwig, Sagi Grimberg, James Smart,
	Chaitanya Kulkarni, Alexander Viro
  Cc: anuj20.g, joshi.k, p.raghav, nitheshshetty, gost.dev,
	Nitesh Shetty, Javier González, linux-block, linux-kernel,
	linux-nvme, linux-fsdevel

For device supporting native copy, nvme driver receives read and
write request with BLK_COPY op flags.
For read request the nvme driver populates the payload with source
information.
For write request the driver converts it to nvme copy command using the
source information in the payload and submits to the device.
current design only supports single source range.
This design is courtesy Mikulas Patocka's token based copy

trace event support for nvme_copy_cmd.
Set the device copy limits to queue limits.

Signed-off-by: Kanchan Joshi <joshi.k@samsung.com>
Signed-off-by: Nitesh Shetty <nj.shetty@samsung.com>
Signed-off-by: Javier González <javier.gonz@samsung.com>
Signed-off-by: Anuj Gupta <anuj20.g@samsung.com>
---
 drivers/nvme/host/constants.c |   1 +
 drivers/nvme/host/core.c      | 106 +++++++++++++++++++++++++++++++++-
 drivers/nvme/host/fc.c        |   5 ++
 drivers/nvme/host/nvme.h      |   7 +++
 drivers/nvme/host/pci.c       |  27 ++++++++-
 drivers/nvme/host/rdma.c      |   7 +++
 drivers/nvme/host/tcp.c       |  16 +++++
 drivers/nvme/host/trace.c     |  19 ++++++
 include/linux/nvme.h          |  43 +++++++++++++-
 9 files changed, 223 insertions(+), 8 deletions(-)

diff --git a/drivers/nvme/host/constants.c b/drivers/nvme/host/constants.c
index e958d5015585..4e60946d3aa8 100644
--- a/drivers/nvme/host/constants.c
+++ b/drivers/nvme/host/constants.c
@@ -19,6 +19,7 @@ static const char * const nvme_ops[] = {
 	[nvme_cmd_resv_report] = "Reservation Report",
 	[nvme_cmd_resv_acquire] = "Reservation Acquire",
 	[nvme_cmd_resv_release] = "Reservation Release",
+	[nvme_cmd_copy] = "Copy Offload",
 	[nvme_cmd_zone_mgmt_send] = "Zone Management Send",
 	[nvme_cmd_zone_mgmt_recv] = "Zone Management Receive",
 	[nvme_cmd_zone_append] = "Zone Management Append",
diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
index 7be562a4e1aa..0af4ce771c5d 100644
--- a/drivers/nvme/host/core.c
+++ b/drivers/nvme/host/core.c
@@ -753,6 +753,80 @@ static inline void nvme_setup_flush(struct nvme_ns *ns,
 	cmnd->common.nsid = cpu_to_le32(ns->head->ns_id);
 }
 
+static inline blk_status_t nvme_setup_copy_read(struct nvme_ns *ns,
+		struct request *req)
+{
+	struct bio *bio = req->bio;
+	struct nvme_copy_token *token = bvec_kmap_local(&bio->bi_io_vec[0]);
+
+	memcpy(token->subsys, "nvme", 4);
+	token->ns = ns;
+	token->src_sector = bio->bi_iter.bi_sector;
+	token->sectors = bio->bi_iter.bi_size >> 9;
+
+	return BLK_STS_OK;
+}
+
+static inline blk_status_t nvme_setup_copy_write(struct nvme_ns *ns,
+	       struct request *req, struct nvme_command *cmnd)
+{
+	struct nvme_copy_range *range = NULL;
+	struct bio *bio = req->bio;
+	struct nvme_copy_token *token = bvec_kmap_local(&bio->bi_io_vec[0]);
+	sector_t src_sector, dst_sector, n_sectors;
+	u64 src_lba, dst_lba, n_lba;
+	unsigned short nr_range = 1;
+	u16 control = 0;
+
+	if (unlikely(memcmp(token->subsys, "nvme", 4)))
+		return BLK_STS_NOTSUPP;
+	if (unlikely(token->ns != ns))
+		return BLK_STS_NOTSUPP;
+
+	src_sector = token->src_sector;
+	dst_sector = bio->bi_iter.bi_sector;
+	n_sectors = token->sectors;
+	if (WARN_ON(n_sectors != bio->bi_iter.bi_size >> 9))
+		return BLK_STS_NOTSUPP;
+
+	src_lba = nvme_sect_to_lba(ns, src_sector);
+	dst_lba = nvme_sect_to_lba(ns, dst_sector);
+	n_lba = nvme_sect_to_lba(ns, n_sectors);
+
+	if (WARN_ON(!n_lba))
+		return BLK_STS_NOTSUPP;
+
+	if (req->cmd_flags & REQ_FUA)
+		control |= NVME_RW_FUA;
+
+	if (req->cmd_flags & REQ_FAILFAST_DEV)
+		control |= NVME_RW_LR;
+
+	memset(cmnd, 0, sizeof(*cmnd));
+	cmnd->copy.opcode = nvme_cmd_copy;
+	cmnd->copy.nsid = cpu_to_le32(ns->head->ns_id);
+	cmnd->copy.sdlba = cpu_to_le64(dst_lba);
+
+	range = kmalloc_array(nr_range, sizeof(*range),
+			GFP_ATOMIC | __GFP_NOWARN);
+	if (!range)
+		return BLK_STS_RESOURCE;
+
+	range[0].slba = cpu_to_le64(src_lba);
+	range[0].nlb = cpu_to_le16(n_lba - 1);
+
+	cmnd->copy.nr_range = 0;
+
+	req->special_vec.bv_page = virt_to_page(range);
+	req->special_vec.bv_offset = offset_in_page(range);
+	req->special_vec.bv_len = sizeof(*range) * nr_range;
+	req->rq_flags |= RQF_SPECIAL_PAYLOAD;
+
+	cmnd->copy.control = cpu_to_le16(control);
+
+	return BLK_STS_OK;
+}
+
 static blk_status_t nvme_setup_discard(struct nvme_ns *ns, struct request *req,
 		struct nvme_command *cmnd)
 {
@@ -979,10 +1053,16 @@ blk_status_t nvme_setup_cmd(struct nvme_ns *ns, struct request *req)
 		ret = nvme_setup_discard(ns, req, cmd);
 		break;
 	case REQ_OP_READ:
-		ret = nvme_setup_rw(ns, req, cmd, nvme_cmd_read);
+		if (unlikely(req->cmd_flags & REQ_COPY))
+			ret = nvme_setup_copy_read(ns, req);
+		else
+			ret = nvme_setup_rw(ns, req, cmd, nvme_cmd_read);
 		break;
 	case REQ_OP_WRITE:
-		ret = nvme_setup_rw(ns, req, cmd, nvme_cmd_write);
+		if (unlikely(req->cmd_flags & REQ_COPY))
+			ret = nvme_setup_copy_write(ns, req, cmd);
+		else
+			ret = nvme_setup_rw(ns, req, cmd, nvme_cmd_write);
 		break;
 	case REQ_OP_ZONE_APPEND:
 		ret = nvme_setup_rw(ns, req, cmd, nvme_cmd_zone_append);
@@ -1731,6 +1811,26 @@ static void nvme_config_discard(struct gendisk *disk, struct nvme_ns *ns)
 		blk_queue_max_write_zeroes_sectors(queue, UINT_MAX);
 }
 
+static void nvme_config_copy(struct gendisk *disk, struct nvme_ns *ns,
+				       struct nvme_id_ns *id)
+{
+	struct nvme_ctrl *ctrl = ns->ctrl;
+	struct request_queue *q = disk->queue;
+
+	if (!(ctrl->oncs & NVME_CTRL_ONCS_COPY)) {
+		blk_queue_max_copy_sectors_hw(q, 0);
+		blk_queue_flag_clear(QUEUE_FLAG_COPY, q);
+		return;
+	}
+
+	/* setting copy limits */
+	if (blk_queue_flag_test_and_set(QUEUE_FLAG_COPY, q))
+		return;
+
+	blk_queue_max_copy_sectors_hw(q,
+		nvme_lba_to_sect(ns, le16_to_cpu(id->mssrl)));
+}
+
 static bool nvme_ns_ids_equal(struct nvme_ns_ids *a, struct nvme_ns_ids *b)
 {
 	return uuid_equal(&a->uuid, &b->uuid) &&
@@ -1930,6 +2030,7 @@ static void nvme_update_disk_info(struct gendisk *disk,
 	set_capacity_and_notify(disk, capacity);
 
 	nvme_config_discard(disk, ns);
+	nvme_config_copy(disk, ns, id);
 	blk_queue_max_write_zeroes_sectors(disk->queue,
 					   ns->ctrl->max_zeroes_sectors);
 }
@@ -5320,6 +5421,7 @@ static inline void _nvme_check_size(void)
 	BUILD_BUG_ON(sizeof(struct nvme_download_firmware) != 64);
 	BUILD_BUG_ON(sizeof(struct nvme_format_cmd) != 64);
 	BUILD_BUG_ON(sizeof(struct nvme_dsm_cmd) != 64);
+	BUILD_BUG_ON(sizeof(struct nvme_copy_command) != 64);
 	BUILD_BUG_ON(sizeof(struct nvme_write_zeroes_cmd) != 64);
 	BUILD_BUG_ON(sizeof(struct nvme_abort_cmd) != 64);
 	BUILD_BUG_ON(sizeof(struct nvme_get_log_page_command) != 64);
diff --git a/drivers/nvme/host/fc.c b/drivers/nvme/host/fc.c
index 4564f16a0b20..fd04dee590b4 100644
--- a/drivers/nvme/host/fc.c
+++ b/drivers/nvme/host/fc.c
@@ -2807,6 +2807,11 @@ nvme_fc_queue_rq(struct blk_mq_hw_ctx *hctx,
 	if (ret)
 		return ret;
 
+	if (unlikely((rq->cmd_flags & REQ_COPY) &&
+				(req_op(rq) == REQ_OP_READ))) {
+		blk_mq_end_request(rq, BLK_STS_OK);
+		return BLK_STS_OK;
+	}
 	/*
 	 * nvme core doesn't quite treat the rq opaquely. Commands such
 	 * as WRITE ZEROES will return a non-zero rq payload_bytes yet
diff --git a/drivers/nvme/host/nvme.h b/drivers/nvme/host/nvme.h
index 424c8a467a0c..6e282956deb4 100644
--- a/drivers/nvme/host/nvme.h
+++ b/drivers/nvme/host/nvme.h
@@ -500,6 +500,13 @@ struct nvme_ns {
 
 };
 
+struct nvme_copy_token {
+	char subsys[4];
+	struct nvme_ns *ns;
+	u64 src_sector;
+	u64 sectors;
+};
+
 /* NVMe ns supports metadata actions by the controller (generate/strip) */
 static inline bool nvme_ns_has_pi(struct nvme_ns *ns)
 {
diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
index b13baccedb4a..8a66dd05cd24 100644
--- a/drivers/nvme/host/pci.c
+++ b/drivers/nvme/host/pci.c
@@ -499,16 +499,19 @@ static inline void nvme_sq_copy_cmd(struct nvme_queue *nvmeq,
 		nvmeq->sq_tail = 0;
 }
 
-static void nvme_commit_rqs(struct blk_mq_hw_ctx *hctx)
+static inline void nvme_commit_sq_db(struct nvme_queue *nvmeq)
 {
-	struct nvme_queue *nvmeq = hctx->driver_data;
-
 	spin_lock(&nvmeq->sq_lock);
 	if (nvmeq->sq_tail != nvmeq->last_sq_tail)
 		nvme_write_sq_db(nvmeq, true);
 	spin_unlock(&nvmeq->sq_lock);
 }
 
+static void nvme_commit_rqs(struct blk_mq_hw_ctx *hctx)
+{
+	nvme_commit_sq_db(hctx->driver_data);
+}
+
 static void **nvme_pci_iod_list(struct request *req)
 {
 	struct nvme_iod *iod = blk_mq_rq_to_pdu(req);
@@ -898,6 +901,12 @@ static blk_status_t nvme_prep_rq(struct nvme_dev *dev, struct request *req)
 	if (ret)
 		return ret;
 
+	if (unlikely((req->cmd_flags & REQ_COPY) &&
+				(req_op(req) == REQ_OP_READ))) {
+		blk_mq_start_request(req);
+		return BLK_STS_OK;
+	}
+
 	if (blk_rq_nr_phys_segments(req)) {
 		ret = nvme_map_data(dev, req, &iod->cmd);
 		if (ret)
@@ -944,6 +953,18 @@ static blk_status_t nvme_queue_rq(struct blk_mq_hw_ctx *hctx,
 	ret = nvme_prep_rq(dev, req);
 	if (unlikely(ret))
 		return ret;
+	if (unlikely((req->cmd_flags & REQ_COPY) &&
+				(req_op(req) == REQ_OP_READ))) {
+		blk_mq_set_request_complete(req);
+		blk_mq_end_request(req, BLK_STS_OK);
+		/* Commit the sq if copy read was the last req in the list,
+		 * as copy read deoesn't update sq db
+		 */
+		if (bd->last)
+			nvme_commit_sq_db(nvmeq);
+		return ret;
+	}
+
 	spin_lock(&nvmeq->sq_lock);
 	nvme_sq_copy_cmd(nvmeq, &iod->cmd);
 	nvme_write_sq_db(nvmeq, bd->last);
diff --git a/drivers/nvme/host/rdma.c b/drivers/nvme/host/rdma.c
index bbad26b82b56..a8bf2a87f42a 100644
--- a/drivers/nvme/host/rdma.c
+++ b/drivers/nvme/host/rdma.c
@@ -2043,6 +2043,13 @@ static blk_status_t nvme_rdma_queue_rq(struct blk_mq_hw_ctx *hctx,
 
 	nvme_start_request(rq);
 
+	if (unlikely((rq->cmd_flags & REQ_COPY) &&
+				(req_op(rq) == REQ_OP_READ))) {
+		blk_mq_end_request(rq, BLK_STS_OK);
+		ret = BLK_STS_OK;
+		goto unmap_qe;
+	}
+
 	if (IS_ENABLED(CONFIG_BLK_DEV_INTEGRITY) &&
 	    queue->pi_support &&
 	    (c->common.opcode == nvme_cmd_write ||
diff --git a/drivers/nvme/host/tcp.c b/drivers/nvme/host/tcp.c
index 8cedc1ef496c..776e2ba84911 100644
--- a/drivers/nvme/host/tcp.c
+++ b/drivers/nvme/host/tcp.c
@@ -2346,6 +2346,11 @@ static blk_status_t nvme_tcp_setup_cmd_pdu(struct nvme_ns *ns,
 	if (ret)
 		return ret;
 
+	if (unlikely((rq->cmd_flags & REQ_COPY) &&
+				(req_op(rq) == REQ_OP_READ))) {
+		return BLK_STS_OK;
+	}
+
 	req->state = NVME_TCP_SEND_CMD_PDU;
 	req->status = cpu_to_le16(NVME_SC_SUCCESS);
 	req->offset = 0;
@@ -2414,6 +2419,17 @@ static blk_status_t nvme_tcp_queue_rq(struct blk_mq_hw_ctx *hctx,
 
 	nvme_start_request(rq);
 
+	if (unlikely((rq->cmd_flags & REQ_COPY) &&
+				(req_op(rq) == REQ_OP_READ))) {
+		blk_mq_set_request_complete(rq);
+		blk_mq_end_request(rq, BLK_STS_OK);
+		/* if copy read is the last req queue tcp reqs */
+		if (bd->last && nvme_tcp_queue_more(queue))
+			queue_work_on(queue->io_cpu, nvme_tcp_wq,
+					&queue->io_work);
+		return ret;
+	}
+
 	nvme_tcp_queue_request(req, true, bd->last);
 
 	return BLK_STS_OK;
diff --git a/drivers/nvme/host/trace.c b/drivers/nvme/host/trace.c
index 1c36fcedea20..da4a7494e5a7 100644
--- a/drivers/nvme/host/trace.c
+++ b/drivers/nvme/host/trace.c
@@ -150,6 +150,23 @@ static const char *nvme_trace_read_write(struct trace_seq *p, u8 *cdw10)
 	return ret;
 }
 
+static const char *nvme_trace_copy(struct trace_seq *p, u8 *cdw10)
+{
+	const char *ret = trace_seq_buffer_ptr(p);
+	u64 slba = get_unaligned_le64(cdw10);
+	u8 nr_range = get_unaligned_le16(cdw10 + 8);
+	u16 control = get_unaligned_le16(cdw10 + 10);
+	u32 dsmgmt = get_unaligned_le32(cdw10 + 12);
+	u32 reftag = get_unaligned_le32(cdw10 +  16);
+
+	trace_seq_printf(p,
+			 "slba=%llu, nr_range=%u, ctrl=0x%x, dsmgmt=%u, reftag=%u",
+			 slba, nr_range, control, dsmgmt, reftag);
+	trace_seq_putc(p, 0);
+
+	return ret;
+}
+
 static const char *nvme_trace_dsm(struct trace_seq *p, u8 *cdw10)
 {
 	const char *ret = trace_seq_buffer_ptr(p);
@@ -243,6 +260,8 @@ const char *nvme_trace_parse_nvm_cmd(struct trace_seq *p,
 		return nvme_trace_zone_mgmt_send(p, cdw10);
 	case nvme_cmd_zone_mgmt_recv:
 		return nvme_trace_zone_mgmt_recv(p, cdw10);
+	case nvme_cmd_copy:
+		return nvme_trace_copy(p, cdw10);
 	default:
 		return nvme_trace_common(p, cdw10);
 	}
diff --git a/include/linux/nvme.h b/include/linux/nvme.h
index 4fad4aa245fb..e92dd69c745a 100644
--- a/include/linux/nvme.h
+++ b/include/linux/nvme.h
@@ -337,7 +337,7 @@ struct nvme_id_ctrl {
 	__u8			nvscc;
 	__u8			nwpc;
 	__le16			acwu;
-	__u8			rsvd534[2];
+	__le16			ocfs;
 	__le32			sgls;
 	__le32			mnan;
 	__u8			rsvd544[224];
@@ -365,6 +365,7 @@ enum {
 	NVME_CTRL_ONCS_WRITE_ZEROES		= 1 << 3,
 	NVME_CTRL_ONCS_RESERVATIONS		= 1 << 5,
 	NVME_CTRL_ONCS_TIMESTAMP		= 1 << 6,
+	NVME_CTRL_ONCS_COPY			= 1 << 8,
 	NVME_CTRL_VWC_PRESENT			= 1 << 0,
 	NVME_CTRL_OACS_SEC_SUPP                 = 1 << 0,
 	NVME_CTRL_OACS_NS_MNGT_SUPP		= 1 << 3,
@@ -414,7 +415,10 @@ struct nvme_id_ns {
 	__le16			npdg;
 	__le16			npda;
 	__le16			nows;
-	__u8			rsvd74[18];
+	__le16			mssrl;
+	__le32			mcl;
+	__u8			msrc;
+	__u8			rsvd91[11];
 	__le32			anagrpid;
 	__u8			rsvd96[3];
 	__u8			nsattr;
@@ -796,6 +800,7 @@ enum nvme_opcode {
 	nvme_cmd_resv_report	= 0x0e,
 	nvme_cmd_resv_acquire	= 0x11,
 	nvme_cmd_resv_release	= 0x15,
+	nvme_cmd_copy		= 0x19,
 	nvme_cmd_zone_mgmt_send	= 0x79,
 	nvme_cmd_zone_mgmt_recv	= 0x7a,
 	nvme_cmd_zone_append	= 0x7d,
@@ -818,7 +823,8 @@ enum nvme_opcode {
 		nvme_opcode_name(nvme_cmd_resv_release),	\
 		nvme_opcode_name(nvme_cmd_zone_mgmt_send),	\
 		nvme_opcode_name(nvme_cmd_zone_mgmt_recv),	\
-		nvme_opcode_name(nvme_cmd_zone_append))
+		nvme_opcode_name(nvme_cmd_zone_append),		\
+		nvme_opcode_name(nvme_cmd_copy))
 
 
 
@@ -995,6 +1001,36 @@ struct nvme_dsm_range {
 	__le64			slba;
 };
 
+struct nvme_copy_command {
+	__u8                    opcode;
+	__u8                    flags;
+	__u16                   command_id;
+	__le32                  nsid;
+	__u64                   rsvd2;
+	__le64                  metadata;
+	union nvme_data_ptr     dptr;
+	__le64                  sdlba;
+	__u8			nr_range;
+	__u8			rsvd12;
+	__le16                  control;
+	__le16                  rsvd13;
+	__le16			dspec;
+	__le32                  ilbrt;
+	__le16                  lbat;
+	__le16                  lbatm;
+};
+
+struct nvme_copy_range {
+	__le64			rsvd0;
+	__le64			slba;
+	__le16			nlb;
+	__le16			rsvd18;
+	__le32			rsvd20;
+	__le32			eilbrt;
+	__le16			elbat;
+	__le16			elbatm;
+};
+
 struct nvme_write_zeroes_cmd {
 	__u8			opcode;
 	__u8			flags;
@@ -1752,6 +1788,7 @@ struct nvme_command {
 		struct nvme_download_firmware dlfw;
 		struct nvme_format_cmd format;
 		struct nvme_dsm_cmd dsm;
+		struct nvme_copy_command copy;
 		struct nvme_write_zeroes_cmd write_zeroes;
 		struct nvme_zone_mgmt_send_cmd zms;
 		struct nvme_zone_mgmt_recv_cmd zmr;
-- 
2.35.1.500.gb896f729e2


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [dm-devel] [PATCH v6 5/9] nvme: add copy offload support
@ 2023-01-12 11:58       ` Nitesh Shetty
  0 siblings, 0 replies; 26+ messages in thread
From: Nitesh Shetty @ 2023-01-12 11:58 UTC (permalink / raw)
  To: Jens Axboe, Alasdair Kergon, Mike Snitzer, dm-devel, Keith Busch,
	Christoph Hellwig, Sagi Grimberg, James Smart,
	Chaitanya Kulkarni, Alexander Viro
  Cc: p.raghav, joshi.k, gost.dev, anuj20.g, linux-kernel, linux-nvme,
	Nitesh Shetty, linux-block, linux-fsdevel, Javier González,
	nitheshshetty

For device supporting native copy, nvme driver receives read and
write request with BLK_COPY op flags.
For read request the nvme driver populates the payload with source
information.
For write request the driver converts it to nvme copy command using the
source information in the payload and submits to the device.
current design only supports single source range.
This design is courtesy Mikulas Patocka's token based copy

trace event support for nvme_copy_cmd.
Set the device copy limits to queue limits.

Signed-off-by: Kanchan Joshi <joshi.k@samsung.com>
Signed-off-by: Nitesh Shetty <nj.shetty@samsung.com>
Signed-off-by: Javier González <javier.gonz@samsung.com>
Signed-off-by: Anuj Gupta <anuj20.g@samsung.com>
---
 drivers/nvme/host/constants.c |   1 +
 drivers/nvme/host/core.c      | 106 +++++++++++++++++++++++++++++++++-
 drivers/nvme/host/fc.c        |   5 ++
 drivers/nvme/host/nvme.h      |   7 +++
 drivers/nvme/host/pci.c       |  27 ++++++++-
 drivers/nvme/host/rdma.c      |   7 +++
 drivers/nvme/host/tcp.c       |  16 +++++
 drivers/nvme/host/trace.c     |  19 ++++++
 include/linux/nvme.h          |  43 +++++++++++++-
 9 files changed, 223 insertions(+), 8 deletions(-)

diff --git a/drivers/nvme/host/constants.c b/drivers/nvme/host/constants.c
index e958d5015585..4e60946d3aa8 100644
--- a/drivers/nvme/host/constants.c
+++ b/drivers/nvme/host/constants.c
@@ -19,6 +19,7 @@ static const char * const nvme_ops[] = {
 	[nvme_cmd_resv_report] = "Reservation Report",
 	[nvme_cmd_resv_acquire] = "Reservation Acquire",
 	[nvme_cmd_resv_release] = "Reservation Release",
+	[nvme_cmd_copy] = "Copy Offload",
 	[nvme_cmd_zone_mgmt_send] = "Zone Management Send",
 	[nvme_cmd_zone_mgmt_recv] = "Zone Management Receive",
 	[nvme_cmd_zone_append] = "Zone Management Append",
diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
index 7be562a4e1aa..0af4ce771c5d 100644
--- a/drivers/nvme/host/core.c
+++ b/drivers/nvme/host/core.c
@@ -753,6 +753,80 @@ static inline void nvme_setup_flush(struct nvme_ns *ns,
 	cmnd->common.nsid = cpu_to_le32(ns->head->ns_id);
 }
 
+static inline blk_status_t nvme_setup_copy_read(struct nvme_ns *ns,
+		struct request *req)
+{
+	struct bio *bio = req->bio;
+	struct nvme_copy_token *token = bvec_kmap_local(&bio->bi_io_vec[0]);
+
+	memcpy(token->subsys, "nvme", 4);
+	token->ns = ns;
+	token->src_sector = bio->bi_iter.bi_sector;
+	token->sectors = bio->bi_iter.bi_size >> 9;
+
+	return BLK_STS_OK;
+}
+
+static inline blk_status_t nvme_setup_copy_write(struct nvme_ns *ns,
+	       struct request *req, struct nvme_command *cmnd)
+{
+	struct nvme_copy_range *range = NULL;
+	struct bio *bio = req->bio;
+	struct nvme_copy_token *token = bvec_kmap_local(&bio->bi_io_vec[0]);
+	sector_t src_sector, dst_sector, n_sectors;
+	u64 src_lba, dst_lba, n_lba;
+	unsigned short nr_range = 1;
+	u16 control = 0;
+
+	if (unlikely(memcmp(token->subsys, "nvme", 4)))
+		return BLK_STS_NOTSUPP;
+	if (unlikely(token->ns != ns))
+		return BLK_STS_NOTSUPP;
+
+	src_sector = token->src_sector;
+	dst_sector = bio->bi_iter.bi_sector;
+	n_sectors = token->sectors;
+	if (WARN_ON(n_sectors != bio->bi_iter.bi_size >> 9))
+		return BLK_STS_NOTSUPP;
+
+	src_lba = nvme_sect_to_lba(ns, src_sector);
+	dst_lba = nvme_sect_to_lba(ns, dst_sector);
+	n_lba = nvme_sect_to_lba(ns, n_sectors);
+
+	if (WARN_ON(!n_lba))
+		return BLK_STS_NOTSUPP;
+
+	if (req->cmd_flags & REQ_FUA)
+		control |= NVME_RW_FUA;
+
+	if (req->cmd_flags & REQ_FAILFAST_DEV)
+		control |= NVME_RW_LR;
+
+	memset(cmnd, 0, sizeof(*cmnd));
+	cmnd->copy.opcode = nvme_cmd_copy;
+	cmnd->copy.nsid = cpu_to_le32(ns->head->ns_id);
+	cmnd->copy.sdlba = cpu_to_le64(dst_lba);
+
+	range = kmalloc_array(nr_range, sizeof(*range),
+			GFP_ATOMIC | __GFP_NOWARN);
+	if (!range)
+		return BLK_STS_RESOURCE;
+
+	range[0].slba = cpu_to_le64(src_lba);
+	range[0].nlb = cpu_to_le16(n_lba - 1);
+
+	cmnd->copy.nr_range = 0;
+
+	req->special_vec.bv_page = virt_to_page(range);
+	req->special_vec.bv_offset = offset_in_page(range);
+	req->special_vec.bv_len = sizeof(*range) * nr_range;
+	req->rq_flags |= RQF_SPECIAL_PAYLOAD;
+
+	cmnd->copy.control = cpu_to_le16(control);
+
+	return BLK_STS_OK;
+}
+
 static blk_status_t nvme_setup_discard(struct nvme_ns *ns, struct request *req,
 		struct nvme_command *cmnd)
 {
@@ -979,10 +1053,16 @@ blk_status_t nvme_setup_cmd(struct nvme_ns *ns, struct request *req)
 		ret = nvme_setup_discard(ns, req, cmd);
 		break;
 	case REQ_OP_READ:
-		ret = nvme_setup_rw(ns, req, cmd, nvme_cmd_read);
+		if (unlikely(req->cmd_flags & REQ_COPY))
+			ret = nvme_setup_copy_read(ns, req);
+		else
+			ret = nvme_setup_rw(ns, req, cmd, nvme_cmd_read);
 		break;
 	case REQ_OP_WRITE:
-		ret = nvme_setup_rw(ns, req, cmd, nvme_cmd_write);
+		if (unlikely(req->cmd_flags & REQ_COPY))
+			ret = nvme_setup_copy_write(ns, req, cmd);
+		else
+			ret = nvme_setup_rw(ns, req, cmd, nvme_cmd_write);
 		break;
 	case REQ_OP_ZONE_APPEND:
 		ret = nvme_setup_rw(ns, req, cmd, nvme_cmd_zone_append);
@@ -1731,6 +1811,26 @@ static void nvme_config_discard(struct gendisk *disk, struct nvme_ns *ns)
 		blk_queue_max_write_zeroes_sectors(queue, UINT_MAX);
 }
 
+static void nvme_config_copy(struct gendisk *disk, struct nvme_ns *ns,
+				       struct nvme_id_ns *id)
+{
+	struct nvme_ctrl *ctrl = ns->ctrl;
+	struct request_queue *q = disk->queue;
+
+	if (!(ctrl->oncs & NVME_CTRL_ONCS_COPY)) {
+		blk_queue_max_copy_sectors_hw(q, 0);
+		blk_queue_flag_clear(QUEUE_FLAG_COPY, q);
+		return;
+	}
+
+	/* setting copy limits */
+	if (blk_queue_flag_test_and_set(QUEUE_FLAG_COPY, q))
+		return;
+
+	blk_queue_max_copy_sectors_hw(q,
+		nvme_lba_to_sect(ns, le16_to_cpu(id->mssrl)));
+}
+
 static bool nvme_ns_ids_equal(struct nvme_ns_ids *a, struct nvme_ns_ids *b)
 {
 	return uuid_equal(&a->uuid, &b->uuid) &&
@@ -1930,6 +2030,7 @@ static void nvme_update_disk_info(struct gendisk *disk,
 	set_capacity_and_notify(disk, capacity);
 
 	nvme_config_discard(disk, ns);
+	nvme_config_copy(disk, ns, id);
 	blk_queue_max_write_zeroes_sectors(disk->queue,
 					   ns->ctrl->max_zeroes_sectors);
 }
@@ -5320,6 +5421,7 @@ static inline void _nvme_check_size(void)
 	BUILD_BUG_ON(sizeof(struct nvme_download_firmware) != 64);
 	BUILD_BUG_ON(sizeof(struct nvme_format_cmd) != 64);
 	BUILD_BUG_ON(sizeof(struct nvme_dsm_cmd) != 64);
+	BUILD_BUG_ON(sizeof(struct nvme_copy_command) != 64);
 	BUILD_BUG_ON(sizeof(struct nvme_write_zeroes_cmd) != 64);
 	BUILD_BUG_ON(sizeof(struct nvme_abort_cmd) != 64);
 	BUILD_BUG_ON(sizeof(struct nvme_get_log_page_command) != 64);
diff --git a/drivers/nvme/host/fc.c b/drivers/nvme/host/fc.c
index 4564f16a0b20..fd04dee590b4 100644
--- a/drivers/nvme/host/fc.c
+++ b/drivers/nvme/host/fc.c
@@ -2807,6 +2807,11 @@ nvme_fc_queue_rq(struct blk_mq_hw_ctx *hctx,
 	if (ret)
 		return ret;
 
+	if (unlikely((rq->cmd_flags & REQ_COPY) &&
+				(req_op(rq) == REQ_OP_READ))) {
+		blk_mq_end_request(rq, BLK_STS_OK);
+		return BLK_STS_OK;
+	}
 	/*
 	 * nvme core doesn't quite treat the rq opaquely. Commands such
 	 * as WRITE ZEROES will return a non-zero rq payload_bytes yet
diff --git a/drivers/nvme/host/nvme.h b/drivers/nvme/host/nvme.h
index 424c8a467a0c..6e282956deb4 100644
--- a/drivers/nvme/host/nvme.h
+++ b/drivers/nvme/host/nvme.h
@@ -500,6 +500,13 @@ struct nvme_ns {
 
 };
 
+struct nvme_copy_token {
+	char subsys[4];
+	struct nvme_ns *ns;
+	u64 src_sector;
+	u64 sectors;
+};
+
 /* NVMe ns supports metadata actions by the controller (generate/strip) */
 static inline bool nvme_ns_has_pi(struct nvme_ns *ns)
 {
diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
index b13baccedb4a..8a66dd05cd24 100644
--- a/drivers/nvme/host/pci.c
+++ b/drivers/nvme/host/pci.c
@@ -499,16 +499,19 @@ static inline void nvme_sq_copy_cmd(struct nvme_queue *nvmeq,
 		nvmeq->sq_tail = 0;
 }
 
-static void nvme_commit_rqs(struct blk_mq_hw_ctx *hctx)
+static inline void nvme_commit_sq_db(struct nvme_queue *nvmeq)
 {
-	struct nvme_queue *nvmeq = hctx->driver_data;
-
 	spin_lock(&nvmeq->sq_lock);
 	if (nvmeq->sq_tail != nvmeq->last_sq_tail)
 		nvme_write_sq_db(nvmeq, true);
 	spin_unlock(&nvmeq->sq_lock);
 }
 
+static void nvme_commit_rqs(struct blk_mq_hw_ctx *hctx)
+{
+	nvme_commit_sq_db(hctx->driver_data);
+}
+
 static void **nvme_pci_iod_list(struct request *req)
 {
 	struct nvme_iod *iod = blk_mq_rq_to_pdu(req);
@@ -898,6 +901,12 @@ static blk_status_t nvme_prep_rq(struct nvme_dev *dev, struct request *req)
 	if (ret)
 		return ret;
 
+	if (unlikely((req->cmd_flags & REQ_COPY) &&
+				(req_op(req) == REQ_OP_READ))) {
+		blk_mq_start_request(req);
+		return BLK_STS_OK;
+	}
+
 	if (blk_rq_nr_phys_segments(req)) {
 		ret = nvme_map_data(dev, req, &iod->cmd);
 		if (ret)
@@ -944,6 +953,18 @@ static blk_status_t nvme_queue_rq(struct blk_mq_hw_ctx *hctx,
 	ret = nvme_prep_rq(dev, req);
 	if (unlikely(ret))
 		return ret;
+	if (unlikely((req->cmd_flags & REQ_COPY) &&
+				(req_op(req) == REQ_OP_READ))) {
+		blk_mq_set_request_complete(req);
+		blk_mq_end_request(req, BLK_STS_OK);
+		/* Commit the sq if copy read was the last req in the list,
+		 * as copy read deoesn't update sq db
+		 */
+		if (bd->last)
+			nvme_commit_sq_db(nvmeq);
+		return ret;
+	}
+
 	spin_lock(&nvmeq->sq_lock);
 	nvme_sq_copy_cmd(nvmeq, &iod->cmd);
 	nvme_write_sq_db(nvmeq, bd->last);
diff --git a/drivers/nvme/host/rdma.c b/drivers/nvme/host/rdma.c
index bbad26b82b56..a8bf2a87f42a 100644
--- a/drivers/nvme/host/rdma.c
+++ b/drivers/nvme/host/rdma.c
@@ -2043,6 +2043,13 @@ static blk_status_t nvme_rdma_queue_rq(struct blk_mq_hw_ctx *hctx,
 
 	nvme_start_request(rq);
 
+	if (unlikely((rq->cmd_flags & REQ_COPY) &&
+				(req_op(rq) == REQ_OP_READ))) {
+		blk_mq_end_request(rq, BLK_STS_OK);
+		ret = BLK_STS_OK;
+		goto unmap_qe;
+	}
+
 	if (IS_ENABLED(CONFIG_BLK_DEV_INTEGRITY) &&
 	    queue->pi_support &&
 	    (c->common.opcode == nvme_cmd_write ||
diff --git a/drivers/nvme/host/tcp.c b/drivers/nvme/host/tcp.c
index 8cedc1ef496c..776e2ba84911 100644
--- a/drivers/nvme/host/tcp.c
+++ b/drivers/nvme/host/tcp.c
@@ -2346,6 +2346,11 @@ static blk_status_t nvme_tcp_setup_cmd_pdu(struct nvme_ns *ns,
 	if (ret)
 		return ret;
 
+	if (unlikely((rq->cmd_flags & REQ_COPY) &&
+				(req_op(rq) == REQ_OP_READ))) {
+		return BLK_STS_OK;
+	}
+
 	req->state = NVME_TCP_SEND_CMD_PDU;
 	req->status = cpu_to_le16(NVME_SC_SUCCESS);
 	req->offset = 0;
@@ -2414,6 +2419,17 @@ static blk_status_t nvme_tcp_queue_rq(struct blk_mq_hw_ctx *hctx,
 
 	nvme_start_request(rq);
 
+	if (unlikely((rq->cmd_flags & REQ_COPY) &&
+				(req_op(rq) == REQ_OP_READ))) {
+		blk_mq_set_request_complete(rq);
+		blk_mq_end_request(rq, BLK_STS_OK);
+		/* if copy read is the last req queue tcp reqs */
+		if (bd->last && nvme_tcp_queue_more(queue))
+			queue_work_on(queue->io_cpu, nvme_tcp_wq,
+					&queue->io_work);
+		return ret;
+	}
+
 	nvme_tcp_queue_request(req, true, bd->last);
 
 	return BLK_STS_OK;
diff --git a/drivers/nvme/host/trace.c b/drivers/nvme/host/trace.c
index 1c36fcedea20..da4a7494e5a7 100644
--- a/drivers/nvme/host/trace.c
+++ b/drivers/nvme/host/trace.c
@@ -150,6 +150,23 @@ static const char *nvme_trace_read_write(struct trace_seq *p, u8 *cdw10)
 	return ret;
 }
 
+static const char *nvme_trace_copy(struct trace_seq *p, u8 *cdw10)
+{
+	const char *ret = trace_seq_buffer_ptr(p);
+	u64 slba = get_unaligned_le64(cdw10);
+	u8 nr_range = get_unaligned_le16(cdw10 + 8);
+	u16 control = get_unaligned_le16(cdw10 + 10);
+	u32 dsmgmt = get_unaligned_le32(cdw10 + 12);
+	u32 reftag = get_unaligned_le32(cdw10 +  16);
+
+	trace_seq_printf(p,
+			 "slba=%llu, nr_range=%u, ctrl=0x%x, dsmgmt=%u, reftag=%u",
+			 slba, nr_range, control, dsmgmt, reftag);
+	trace_seq_putc(p, 0);
+
+	return ret;
+}
+
 static const char *nvme_trace_dsm(struct trace_seq *p, u8 *cdw10)
 {
 	const char *ret = trace_seq_buffer_ptr(p);
@@ -243,6 +260,8 @@ const char *nvme_trace_parse_nvm_cmd(struct trace_seq *p,
 		return nvme_trace_zone_mgmt_send(p, cdw10);
 	case nvme_cmd_zone_mgmt_recv:
 		return nvme_trace_zone_mgmt_recv(p, cdw10);
+	case nvme_cmd_copy:
+		return nvme_trace_copy(p, cdw10);
 	default:
 		return nvme_trace_common(p, cdw10);
 	}
diff --git a/include/linux/nvme.h b/include/linux/nvme.h
index 4fad4aa245fb..e92dd69c745a 100644
--- a/include/linux/nvme.h
+++ b/include/linux/nvme.h
@@ -337,7 +337,7 @@ struct nvme_id_ctrl {
 	__u8			nvscc;
 	__u8			nwpc;
 	__le16			acwu;
-	__u8			rsvd534[2];
+	__le16			ocfs;
 	__le32			sgls;
 	__le32			mnan;
 	__u8			rsvd544[224];
@@ -365,6 +365,7 @@ enum {
 	NVME_CTRL_ONCS_WRITE_ZEROES		= 1 << 3,
 	NVME_CTRL_ONCS_RESERVATIONS		= 1 << 5,
 	NVME_CTRL_ONCS_TIMESTAMP		= 1 << 6,
+	NVME_CTRL_ONCS_COPY			= 1 << 8,
 	NVME_CTRL_VWC_PRESENT			= 1 << 0,
 	NVME_CTRL_OACS_SEC_SUPP                 = 1 << 0,
 	NVME_CTRL_OACS_NS_MNGT_SUPP		= 1 << 3,
@@ -414,7 +415,10 @@ struct nvme_id_ns {
 	__le16			npdg;
 	__le16			npda;
 	__le16			nows;
-	__u8			rsvd74[18];
+	__le16			mssrl;
+	__le32			mcl;
+	__u8			msrc;
+	__u8			rsvd91[11];
 	__le32			anagrpid;
 	__u8			rsvd96[3];
 	__u8			nsattr;
@@ -796,6 +800,7 @@ enum nvme_opcode {
 	nvme_cmd_resv_report	= 0x0e,
 	nvme_cmd_resv_acquire	= 0x11,
 	nvme_cmd_resv_release	= 0x15,
+	nvme_cmd_copy		= 0x19,
 	nvme_cmd_zone_mgmt_send	= 0x79,
 	nvme_cmd_zone_mgmt_recv	= 0x7a,
 	nvme_cmd_zone_append	= 0x7d,
@@ -818,7 +823,8 @@ enum nvme_opcode {
 		nvme_opcode_name(nvme_cmd_resv_release),	\
 		nvme_opcode_name(nvme_cmd_zone_mgmt_send),	\
 		nvme_opcode_name(nvme_cmd_zone_mgmt_recv),	\
-		nvme_opcode_name(nvme_cmd_zone_append))
+		nvme_opcode_name(nvme_cmd_zone_append),		\
+		nvme_opcode_name(nvme_cmd_copy))
 
 
 
@@ -995,6 +1001,36 @@ struct nvme_dsm_range {
 	__le64			slba;
 };
 
+struct nvme_copy_command {
+	__u8                    opcode;
+	__u8                    flags;
+	__u16                   command_id;
+	__le32                  nsid;
+	__u64                   rsvd2;
+	__le64                  metadata;
+	union nvme_data_ptr     dptr;
+	__le64                  sdlba;
+	__u8			nr_range;
+	__u8			rsvd12;
+	__le16                  control;
+	__le16                  rsvd13;
+	__le16			dspec;
+	__le32                  ilbrt;
+	__le16                  lbat;
+	__le16                  lbatm;
+};
+
+struct nvme_copy_range {
+	__le64			rsvd0;
+	__le64			slba;
+	__le16			nlb;
+	__le16			rsvd18;
+	__le32			rsvd20;
+	__le32			eilbrt;
+	__le16			elbat;
+	__le16			elbatm;
+};
+
 struct nvme_write_zeroes_cmd {
 	__u8			opcode;
 	__u8			flags;
@@ -1752,6 +1788,7 @@ struct nvme_command {
 		struct nvme_download_firmware dlfw;
 		struct nvme_format_cmd format;
 		struct nvme_dsm_cmd dsm;
+		struct nvme_copy_command copy;
 		struct nvme_write_zeroes_cmd write_zeroes;
 		struct nvme_zone_mgmt_send_cmd zms;
 		struct nvme_zone_mgmt_recv_cmd zmr;
-- 
2.35.1.500.gb896f729e2

--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel

^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH v6 6/9] nvmet: add copy command support for bdev and file ns
       [not found]   ` <CGME20230112120201epcas5p1d2ee1f9fd6a1f458ffa770bb33b4bb41@epcas5p1.samsung.com>
@ 2023-01-12 11:59       ` Nitesh Shetty
  0 siblings, 0 replies; 26+ messages in thread
From: Nitesh Shetty @ 2023-01-12 11:59 UTC (permalink / raw)
  To: Jens Axboe, Alasdair Kergon, Mike Snitzer, dm-devel, Keith Busch,
	Christoph Hellwig, Sagi Grimberg, James Smart,
	Chaitanya Kulkarni, Alexander Viro
  Cc: anuj20.g, joshi.k, p.raghav, nitheshshetty, gost.dev,
	Nitesh Shetty, linux-block, linux-kernel, linux-nvme,
	linux-fsdevel

Add support for handling target command on target.
For bdev-ns we call into blkdev_issue_copy, which the block layer
completes by a offloaded copy request to backend bdev or by emulating the
request.

For file-ns we call vfs_copy_file_range to service our request.

Currently target always shows copy capability by setting
NVME_CTRL_ONCS_COPY in controller ONCS.

Signed-off-by: Nitesh Shetty <nj.shetty@samsung.com>
Signed-off-by: Anuj Gupta <anuj20.g@samsung.com>
---
 drivers/nvme/target/admin-cmd.c   |  9 +++-
 drivers/nvme/target/io-cmd-bdev.c | 79 +++++++++++++++++++++++++++++++
 drivers/nvme/target/io-cmd-file.c | 52 ++++++++++++++++++++
 drivers/nvme/target/loop.c        |  6 +++
 drivers/nvme/target/nvmet.h       |  2 +
 5 files changed, 146 insertions(+), 2 deletions(-)

diff --git a/drivers/nvme/target/admin-cmd.c b/drivers/nvme/target/admin-cmd.c
index 6a54ed6fb121..66a45121f5a0 100644
--- a/drivers/nvme/target/admin-cmd.c
+++ b/drivers/nvme/target/admin-cmd.c
@@ -433,8 +433,7 @@ static void nvmet_execute_identify_ctrl(struct nvmet_req *req)
 	id->nn = cpu_to_le32(NVMET_MAX_NAMESPACES);
 	id->mnan = cpu_to_le32(NVMET_MAX_NAMESPACES);
 	id->oncs = cpu_to_le16(NVME_CTRL_ONCS_DSM |
-			NVME_CTRL_ONCS_WRITE_ZEROES);
-
+			NVME_CTRL_ONCS_WRITE_ZEROES | NVME_CTRL_ONCS_COPY);
 	/* XXX: don't report vwc if the underlying device is write through */
 	id->vwc = NVME_CTRL_VWC_PRESENT;
 
@@ -536,6 +535,12 @@ static void nvmet_execute_identify_ns(struct nvmet_req *req)
 
 	if (req->ns->bdev)
 		nvmet_bdev_set_limits(req->ns->bdev, id);
+	else {
+		id->msrc = (u8)to0based(BIO_MAX_VECS - 1);
+		id->mssrl = cpu_to_le16(BIO_MAX_VECS <<
+				(PAGE_SHIFT - SECTOR_SHIFT));
+		id->mcl = cpu_to_le32(le16_to_cpu(id->mssrl));
+	}
 
 	/*
 	 * We just provide a single LBA format that matches what the
diff --git a/drivers/nvme/target/io-cmd-bdev.c b/drivers/nvme/target/io-cmd-bdev.c
index c2d6cea0236b..01f0160125fb 100644
--- a/drivers/nvme/target/io-cmd-bdev.c
+++ b/drivers/nvme/target/io-cmd-bdev.c
@@ -46,6 +46,19 @@ void nvmet_bdev_set_limits(struct block_device *bdev, struct nvme_id_ns *id)
 	id->npda = id->npdg;
 	/* NOWS = Namespace Optimal Write Size */
 	id->nows = to0based(bdev_io_opt(bdev) / bdev_logical_block_size(bdev));
+
+	/*Copy limits*/
+	if (bdev_max_copy_sectors(bdev)) {
+		id->msrc = id->msrc;
+		id->mssrl = cpu_to_le16((bdev_max_copy_sectors(bdev) <<
+				SECTOR_SHIFT) / bdev_logical_block_size(bdev));
+		id->mcl = cpu_to_le32(id->mssrl);
+	} else {
+		id->msrc = (u8)to0based(BIO_MAX_VECS - 1);
+		id->mssrl = cpu_to_le16((BIO_MAX_VECS << PAGE_SHIFT) /
+				bdev_logical_block_size(bdev));
+		id->mcl = cpu_to_le32(id->mssrl);
+	}
 }
 
 void nvmet_bdev_ns_disable(struct nvmet_ns *ns)
@@ -184,6 +197,23 @@ static void nvmet_bio_done(struct bio *bio)
 	nvmet_req_bio_put(req, bio);
 }
 
+static void nvmet_bdev_copy_end_io(void *private, int status)
+{
+	struct nvmet_req *req = (struct nvmet_req *)private;
+	int id;
+
+	if (status) {
+		for (id = 0 ; id < req->nr_range; id++) {
+			if (req->ranges[id].len != req->ranges[id].comp_len) {
+				req->cqe->result.u32 = cpu_to_le32(id);
+				break;
+			}
+		}
+	}
+	kfree(req->ranges);
+	nvmet_req_complete(req, errno_to_nvme_status(req, status));
+}
+
 #ifdef CONFIG_BLK_DEV_INTEGRITY
 static int nvmet_bdev_alloc_bip(struct nvmet_req *req, struct bio *bio,
 				struct sg_mapping_iter *miter)
@@ -450,6 +480,51 @@ static void nvmet_bdev_execute_write_zeroes(struct nvmet_req *req)
 	}
 }
 
+static void nvmet_bdev_execute_copy(struct nvmet_req *req)
+{
+	struct nvme_copy_range range;
+	struct range_entry *ranges;
+	struct nvme_command *cmnd = req->cmd;
+	sector_t dest, dest_off = 0;
+	int ret, id, nr_range;
+
+	nr_range = cmnd->copy.nr_range + 1;
+	dest = le64_to_cpu(cmnd->copy.sdlba) << req->ns->blksize_shift;
+	ranges = kmalloc_array(nr_range, sizeof(*ranges), GFP_KERNEL);
+
+	for (id = 0 ; id < nr_range; id++) {
+		ret = nvmet_copy_from_sgl(req, id * sizeof(range),
+					&range, sizeof(range));
+		if (ret)
+			goto out;
+
+		ranges[id].dst = dest + dest_off;
+		ranges[id].src = le64_to_cpu(range.slba) <<
+					req->ns->blksize_shift;
+		ranges[id].len = (le16_to_cpu(range.nlb) + 1) <<
+					req->ns->blksize_shift;
+		ranges[id].comp_len = 0;
+		dest_off += ranges[id].len;
+	}
+	req->ranges = ranges;
+	req->nr_range = nr_range;
+	ret = blkdev_issue_copy(req->ns->bdev, req->ns->bdev, ranges, nr_range,
+			nvmet_bdev_copy_end_io, (void *)req, GFP_KERNEL);
+	if (ret) {
+		for (id = 0 ; id < nr_range; id++) {
+			if (ranges[id].len != ranges[id].comp_len) {
+				req->cqe->result.u32 = cpu_to_le32(id);
+				break;
+			}
+		}
+		goto out;
+	} else
+		return;
+out:
+	kfree(ranges);
+	nvmet_req_complete(req, errno_to_nvme_status(req, ret));
+}
+
 u16 nvmet_bdev_parse_io_cmd(struct nvmet_req *req)
 {
 	switch (req->cmd->common.opcode) {
@@ -468,6 +543,10 @@ u16 nvmet_bdev_parse_io_cmd(struct nvmet_req *req)
 	case nvme_cmd_write_zeroes:
 		req->execute = nvmet_bdev_execute_write_zeroes;
 		return 0;
+	case nvme_cmd_copy:
+		req->execute = nvmet_bdev_execute_copy;
+		return 0;
+
 	default:
 		return nvmet_report_invalid_opcode(req);
 	}
diff --git a/drivers/nvme/target/io-cmd-file.c b/drivers/nvme/target/io-cmd-file.c
index 871c4f32f443..76db307f5b30 100644
--- a/drivers/nvme/target/io-cmd-file.c
+++ b/drivers/nvme/target/io-cmd-file.c
@@ -328,6 +328,49 @@ static void nvmet_file_dsm_work(struct work_struct *w)
 	}
 }
 
+static void nvmet_file_copy_work(struct work_struct *w)
+{
+	struct nvmet_req *req = container_of(w, struct nvmet_req, f.work);
+	int nr_range;
+	loff_t pos;
+	struct nvme_command *cmnd = req->cmd;
+	int ret = 0, len = 0, src, id;
+
+	nr_range = cmnd->copy.nr_range + 1;
+	pos = le64_to_cpu(req->cmd->copy.sdlba) << req->ns->blksize_shift;
+	if (unlikely(pos + req->transfer_len > req->ns->size)) {
+		nvmet_req_complete(req, errno_to_nvme_status(req, -ENOSPC));
+		return;
+	}
+
+	for (id = 0 ; id < nr_range; id++) {
+		struct nvme_copy_range range;
+
+		ret = nvmet_copy_from_sgl(req, id * sizeof(range), &range,
+					sizeof(range));
+		if (ret)
+			goto out;
+
+		len = (le16_to_cpu(range.nlb) + 1) << (req->ns->blksize_shift);
+		src = (le64_to_cpu(range.slba) << (req->ns->blksize_shift));
+		ret = vfs_copy_file_range(req->ns->file, src, req->ns->file,
+					pos, len, 0);
+out:
+		if (ret != len) {
+			pos += ret;
+			req->cqe->result.u32 = cpu_to_le32(id);
+			nvmet_req_complete(req, ret < 0 ?
+					errno_to_nvme_status(req, ret) :
+					errno_to_nvme_status(req, -EIO));
+			return;
+
+		} else
+			pos += len;
+	}
+
+	nvmet_req_complete(req, 0);
+
+}
 static void nvmet_file_execute_dsm(struct nvmet_req *req)
 {
 	if (!nvmet_check_data_len_lte(req, nvmet_dsm_len(req)))
@@ -336,6 +379,12 @@ static void nvmet_file_execute_dsm(struct nvmet_req *req)
 	queue_work(nvmet_wq, &req->f.work);
 }
 
+static void nvmet_file_execute_copy(struct nvmet_req *req)
+{
+	INIT_WORK(&req->f.work, nvmet_file_copy_work);
+	queue_work(nvmet_wq, &req->f.work);
+}
+
 static void nvmet_file_write_zeroes_work(struct work_struct *w)
 {
 	struct nvmet_req *req = container_of(w, struct nvmet_req, f.work);
@@ -382,6 +431,9 @@ u16 nvmet_file_parse_io_cmd(struct nvmet_req *req)
 	case nvme_cmd_write_zeroes:
 		req->execute = nvmet_file_execute_write_zeroes;
 		return 0;
+	case nvme_cmd_copy:
+		req->execute = nvmet_file_execute_copy;
+		return 0;
 	default:
 		return nvmet_report_invalid_opcode(req);
 	}
diff --git a/drivers/nvme/target/loop.c b/drivers/nvme/target/loop.c
index f2d24b2d992f..d18ed8067a15 100644
--- a/drivers/nvme/target/loop.c
+++ b/drivers/nvme/target/loop.c
@@ -146,6 +146,12 @@ static blk_status_t nvme_loop_queue_rq(struct blk_mq_hw_ctx *hctx,
 		return ret;
 
 	nvme_start_request(req);
+	if (unlikely((req->cmd_flags & REQ_COPY) &&
+				(req_op(req) == REQ_OP_READ))) {
+		blk_mq_set_request_complete(req);
+		blk_mq_end_request(req, BLK_STS_OK);
+		return BLK_STS_OK;
+	}
 	iod->cmd.common.flags |= NVME_CMD_SGL_METABUF;
 	iod->req.port = queue->ctrl->port;
 	if (!nvmet_req_init(&iod->req, &queue->nvme_cq,
diff --git a/drivers/nvme/target/nvmet.h b/drivers/nvme/target/nvmet.h
index 89bedfcd974c..c36b55c9ff68 100644
--- a/drivers/nvme/target/nvmet.h
+++ b/drivers/nvme/target/nvmet.h
@@ -393,6 +393,8 @@ struct nvmet_req {
 	struct device		*p2p_client;
 	u16			error_loc;
 	u64			error_slba;
+	struct range_entry	*ranges;
+	unsigned int		nr_range;
 };
 
 #define NVMET_MAX_MPOOL_BVEC		16
-- 
2.35.1.500.gb896f729e2


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [dm-devel] [PATCH v6 6/9] nvmet: add copy command support for bdev and file ns
@ 2023-01-12 11:59       ` Nitesh Shetty
  0 siblings, 0 replies; 26+ messages in thread
From: Nitesh Shetty @ 2023-01-12 11:59 UTC (permalink / raw)
  To: Jens Axboe, Alasdair Kergon, Mike Snitzer, dm-devel, Keith Busch,
	Christoph Hellwig, Sagi Grimberg, James Smart,
	Chaitanya Kulkarni, Alexander Viro
  Cc: p.raghav, joshi.k, gost.dev, anuj20.g, linux-kernel, linux-nvme,
	Nitesh Shetty, linux-block, linux-fsdevel, nitheshshetty

Add support for handling target command on target.
For bdev-ns we call into blkdev_issue_copy, which the block layer
completes by a offloaded copy request to backend bdev or by emulating the
request.

For file-ns we call vfs_copy_file_range to service our request.

Currently target always shows copy capability by setting
NVME_CTRL_ONCS_COPY in controller ONCS.

Signed-off-by: Nitesh Shetty <nj.shetty@samsung.com>
Signed-off-by: Anuj Gupta <anuj20.g@samsung.com>
---
 drivers/nvme/target/admin-cmd.c   |  9 +++-
 drivers/nvme/target/io-cmd-bdev.c | 79 +++++++++++++++++++++++++++++++
 drivers/nvme/target/io-cmd-file.c | 52 ++++++++++++++++++++
 drivers/nvme/target/loop.c        |  6 +++
 drivers/nvme/target/nvmet.h       |  2 +
 5 files changed, 146 insertions(+), 2 deletions(-)

diff --git a/drivers/nvme/target/admin-cmd.c b/drivers/nvme/target/admin-cmd.c
index 6a54ed6fb121..66a45121f5a0 100644
--- a/drivers/nvme/target/admin-cmd.c
+++ b/drivers/nvme/target/admin-cmd.c
@@ -433,8 +433,7 @@ static void nvmet_execute_identify_ctrl(struct nvmet_req *req)
 	id->nn = cpu_to_le32(NVMET_MAX_NAMESPACES);
 	id->mnan = cpu_to_le32(NVMET_MAX_NAMESPACES);
 	id->oncs = cpu_to_le16(NVME_CTRL_ONCS_DSM |
-			NVME_CTRL_ONCS_WRITE_ZEROES);
-
+			NVME_CTRL_ONCS_WRITE_ZEROES | NVME_CTRL_ONCS_COPY);
 	/* XXX: don't report vwc if the underlying device is write through */
 	id->vwc = NVME_CTRL_VWC_PRESENT;
 
@@ -536,6 +535,12 @@ static void nvmet_execute_identify_ns(struct nvmet_req *req)
 
 	if (req->ns->bdev)
 		nvmet_bdev_set_limits(req->ns->bdev, id);
+	else {
+		id->msrc = (u8)to0based(BIO_MAX_VECS - 1);
+		id->mssrl = cpu_to_le16(BIO_MAX_VECS <<
+				(PAGE_SHIFT - SECTOR_SHIFT));
+		id->mcl = cpu_to_le32(le16_to_cpu(id->mssrl));
+	}
 
 	/*
 	 * We just provide a single LBA format that matches what the
diff --git a/drivers/nvme/target/io-cmd-bdev.c b/drivers/nvme/target/io-cmd-bdev.c
index c2d6cea0236b..01f0160125fb 100644
--- a/drivers/nvme/target/io-cmd-bdev.c
+++ b/drivers/nvme/target/io-cmd-bdev.c
@@ -46,6 +46,19 @@ void nvmet_bdev_set_limits(struct block_device *bdev, struct nvme_id_ns *id)
 	id->npda = id->npdg;
 	/* NOWS = Namespace Optimal Write Size */
 	id->nows = to0based(bdev_io_opt(bdev) / bdev_logical_block_size(bdev));
+
+	/*Copy limits*/
+	if (bdev_max_copy_sectors(bdev)) {
+		id->msrc = id->msrc;
+		id->mssrl = cpu_to_le16((bdev_max_copy_sectors(bdev) <<
+				SECTOR_SHIFT) / bdev_logical_block_size(bdev));
+		id->mcl = cpu_to_le32(id->mssrl);
+	} else {
+		id->msrc = (u8)to0based(BIO_MAX_VECS - 1);
+		id->mssrl = cpu_to_le16((BIO_MAX_VECS << PAGE_SHIFT) /
+				bdev_logical_block_size(bdev));
+		id->mcl = cpu_to_le32(id->mssrl);
+	}
 }
 
 void nvmet_bdev_ns_disable(struct nvmet_ns *ns)
@@ -184,6 +197,23 @@ static void nvmet_bio_done(struct bio *bio)
 	nvmet_req_bio_put(req, bio);
 }
 
+static void nvmet_bdev_copy_end_io(void *private, int status)
+{
+	struct nvmet_req *req = (struct nvmet_req *)private;
+	int id;
+
+	if (status) {
+		for (id = 0 ; id < req->nr_range; id++) {
+			if (req->ranges[id].len != req->ranges[id].comp_len) {
+				req->cqe->result.u32 = cpu_to_le32(id);
+				break;
+			}
+		}
+	}
+	kfree(req->ranges);
+	nvmet_req_complete(req, errno_to_nvme_status(req, status));
+}
+
 #ifdef CONFIG_BLK_DEV_INTEGRITY
 static int nvmet_bdev_alloc_bip(struct nvmet_req *req, struct bio *bio,
 				struct sg_mapping_iter *miter)
@@ -450,6 +480,51 @@ static void nvmet_bdev_execute_write_zeroes(struct nvmet_req *req)
 	}
 }
 
+static void nvmet_bdev_execute_copy(struct nvmet_req *req)
+{
+	struct nvme_copy_range range;
+	struct range_entry *ranges;
+	struct nvme_command *cmnd = req->cmd;
+	sector_t dest, dest_off = 0;
+	int ret, id, nr_range;
+
+	nr_range = cmnd->copy.nr_range + 1;
+	dest = le64_to_cpu(cmnd->copy.sdlba) << req->ns->blksize_shift;
+	ranges = kmalloc_array(nr_range, sizeof(*ranges), GFP_KERNEL);
+
+	for (id = 0 ; id < nr_range; id++) {
+		ret = nvmet_copy_from_sgl(req, id * sizeof(range),
+					&range, sizeof(range));
+		if (ret)
+			goto out;
+
+		ranges[id].dst = dest + dest_off;
+		ranges[id].src = le64_to_cpu(range.slba) <<
+					req->ns->blksize_shift;
+		ranges[id].len = (le16_to_cpu(range.nlb) + 1) <<
+					req->ns->blksize_shift;
+		ranges[id].comp_len = 0;
+		dest_off += ranges[id].len;
+	}
+	req->ranges = ranges;
+	req->nr_range = nr_range;
+	ret = blkdev_issue_copy(req->ns->bdev, req->ns->bdev, ranges, nr_range,
+			nvmet_bdev_copy_end_io, (void *)req, GFP_KERNEL);
+	if (ret) {
+		for (id = 0 ; id < nr_range; id++) {
+			if (ranges[id].len != ranges[id].comp_len) {
+				req->cqe->result.u32 = cpu_to_le32(id);
+				break;
+			}
+		}
+		goto out;
+	} else
+		return;
+out:
+	kfree(ranges);
+	nvmet_req_complete(req, errno_to_nvme_status(req, ret));
+}
+
 u16 nvmet_bdev_parse_io_cmd(struct nvmet_req *req)
 {
 	switch (req->cmd->common.opcode) {
@@ -468,6 +543,10 @@ u16 nvmet_bdev_parse_io_cmd(struct nvmet_req *req)
 	case nvme_cmd_write_zeroes:
 		req->execute = nvmet_bdev_execute_write_zeroes;
 		return 0;
+	case nvme_cmd_copy:
+		req->execute = nvmet_bdev_execute_copy;
+		return 0;
+
 	default:
 		return nvmet_report_invalid_opcode(req);
 	}
diff --git a/drivers/nvme/target/io-cmd-file.c b/drivers/nvme/target/io-cmd-file.c
index 871c4f32f443..76db307f5b30 100644
--- a/drivers/nvme/target/io-cmd-file.c
+++ b/drivers/nvme/target/io-cmd-file.c
@@ -328,6 +328,49 @@ static void nvmet_file_dsm_work(struct work_struct *w)
 	}
 }
 
+static void nvmet_file_copy_work(struct work_struct *w)
+{
+	struct nvmet_req *req = container_of(w, struct nvmet_req, f.work);
+	int nr_range;
+	loff_t pos;
+	struct nvme_command *cmnd = req->cmd;
+	int ret = 0, len = 0, src, id;
+
+	nr_range = cmnd->copy.nr_range + 1;
+	pos = le64_to_cpu(req->cmd->copy.sdlba) << req->ns->blksize_shift;
+	if (unlikely(pos + req->transfer_len > req->ns->size)) {
+		nvmet_req_complete(req, errno_to_nvme_status(req, -ENOSPC));
+		return;
+	}
+
+	for (id = 0 ; id < nr_range; id++) {
+		struct nvme_copy_range range;
+
+		ret = nvmet_copy_from_sgl(req, id * sizeof(range), &range,
+					sizeof(range));
+		if (ret)
+			goto out;
+
+		len = (le16_to_cpu(range.nlb) + 1) << (req->ns->blksize_shift);
+		src = (le64_to_cpu(range.slba) << (req->ns->blksize_shift));
+		ret = vfs_copy_file_range(req->ns->file, src, req->ns->file,
+					pos, len, 0);
+out:
+		if (ret != len) {
+			pos += ret;
+			req->cqe->result.u32 = cpu_to_le32(id);
+			nvmet_req_complete(req, ret < 0 ?
+					errno_to_nvme_status(req, ret) :
+					errno_to_nvme_status(req, -EIO));
+			return;
+
+		} else
+			pos += len;
+	}
+
+	nvmet_req_complete(req, 0);
+
+}
 static void nvmet_file_execute_dsm(struct nvmet_req *req)
 {
 	if (!nvmet_check_data_len_lte(req, nvmet_dsm_len(req)))
@@ -336,6 +379,12 @@ static void nvmet_file_execute_dsm(struct nvmet_req *req)
 	queue_work(nvmet_wq, &req->f.work);
 }
 
+static void nvmet_file_execute_copy(struct nvmet_req *req)
+{
+	INIT_WORK(&req->f.work, nvmet_file_copy_work);
+	queue_work(nvmet_wq, &req->f.work);
+}
+
 static void nvmet_file_write_zeroes_work(struct work_struct *w)
 {
 	struct nvmet_req *req = container_of(w, struct nvmet_req, f.work);
@@ -382,6 +431,9 @@ u16 nvmet_file_parse_io_cmd(struct nvmet_req *req)
 	case nvme_cmd_write_zeroes:
 		req->execute = nvmet_file_execute_write_zeroes;
 		return 0;
+	case nvme_cmd_copy:
+		req->execute = nvmet_file_execute_copy;
+		return 0;
 	default:
 		return nvmet_report_invalid_opcode(req);
 	}
diff --git a/drivers/nvme/target/loop.c b/drivers/nvme/target/loop.c
index f2d24b2d992f..d18ed8067a15 100644
--- a/drivers/nvme/target/loop.c
+++ b/drivers/nvme/target/loop.c
@@ -146,6 +146,12 @@ static blk_status_t nvme_loop_queue_rq(struct blk_mq_hw_ctx *hctx,
 		return ret;
 
 	nvme_start_request(req);
+	if (unlikely((req->cmd_flags & REQ_COPY) &&
+				(req_op(req) == REQ_OP_READ))) {
+		blk_mq_set_request_complete(req);
+		blk_mq_end_request(req, BLK_STS_OK);
+		return BLK_STS_OK;
+	}
 	iod->cmd.common.flags |= NVME_CMD_SGL_METABUF;
 	iod->req.port = queue->ctrl->port;
 	if (!nvmet_req_init(&iod->req, &queue->nvme_cq,
diff --git a/drivers/nvme/target/nvmet.h b/drivers/nvme/target/nvmet.h
index 89bedfcd974c..c36b55c9ff68 100644
--- a/drivers/nvme/target/nvmet.h
+++ b/drivers/nvme/target/nvmet.h
@@ -393,6 +393,8 @@ struct nvmet_req {
 	struct device		*p2p_client;
 	u16			error_loc;
 	u64			error_slba;
+	struct range_entry	*ranges;
+	unsigned int		nr_range;
 };
 
 #define NVMET_MAX_MPOOL_BVEC		16
-- 
2.35.1.500.gb896f729e2

--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH v6 7/9] dm: Add support for copy offload.
       [not found]   ` <CGME20230112120210epcas5p41524bba73af1dcf283d21b9c7ee9d239@epcas5p4.samsung.com>
@ 2023-01-12 11:59       ` Nitesh Shetty
  0 siblings, 0 replies; 26+ messages in thread
From: Nitesh Shetty @ 2023-01-12 11:59 UTC (permalink / raw)
  To: Jens Axboe, Alasdair Kergon, Mike Snitzer, dm-devel, Keith Busch,
	Christoph Hellwig, Sagi Grimberg, James Smart,
	Chaitanya Kulkarni, Alexander Viro
  Cc: anuj20.g, joshi.k, p.raghav, nitheshshetty, gost.dev,
	Nitesh Shetty, linux-block, linux-kernel, linux-nvme,
	linux-fsdevel

Before enabling copy for dm target, check if underlying devices and
dm target support copy. Avoid split happening inside dm target.
Fail early if the request needs split, currently splitting copy
request is not supported.

Signed-off-by: Nitesh Shetty <nj.shetty@samsung.com>
---
 drivers/md/dm-table.c         | 42 +++++++++++++++++++++++++++++++++++
 drivers/md/dm.c               |  7 ++++++
 include/linux/device-mapper.h |  5 +++++
 3 files changed, 54 insertions(+)

diff --git a/drivers/md/dm-table.c b/drivers/md/dm-table.c
index 8541d5688f3a..4a1bbbb2493b 100644
--- a/drivers/md/dm-table.c
+++ b/drivers/md/dm-table.c
@@ -1875,6 +1875,39 @@ static bool dm_table_supports_nowait(struct dm_table *t)
 	return true;
 }
 
+static int device_not_copy_capable(struct dm_target *ti, struct dm_dev *dev,
+				      sector_t start, sector_t len, void *data)
+{
+	struct request_queue *q = bdev_get_queue(dev->bdev);
+
+	return !blk_queue_copy(q);
+}
+
+static bool dm_table_supports_copy(struct dm_table *t)
+{
+	struct dm_target *ti;
+	unsigned int i;
+
+	for (i = 0; i < t->num_targets; i++) {
+		ti = dm_table_get_target(t, i);
+
+		if (!ti->copy_offload_supported)
+			return false;
+
+		/*
+		 * target provides copy support (as implied by setting
+		 * 'copy_offload_supported')
+		 * and it relies on _all_ data devices having copy support.
+		 */
+		if (!ti->type->iterate_devices ||
+		     ti->type->iterate_devices(ti,
+			     device_not_copy_capable, NULL))
+			return false;
+	}
+
+	return true;
+}
+
 static int device_not_discard_capable(struct dm_target *ti, struct dm_dev *dev,
 				      sector_t start, sector_t len, void *data)
 {
@@ -1957,6 +1990,15 @@ int dm_table_set_restrictions(struct dm_table *t, struct request_queue *q,
 		q->limits.discard_misaligned = 0;
 	}
 
+	if (!dm_table_supports_copy(t)) {
+		blk_queue_flag_clear(QUEUE_FLAG_COPY, q);
+		/* Must also clear copy limits... */
+		q->limits.max_copy_sectors = 0;
+		q->limits.max_copy_sectors_hw = 0;
+	} else {
+		blk_queue_flag_set(QUEUE_FLAG_COPY, q);
+	}
+
 	if (!dm_table_supports_secure_erase(t))
 		q->limits.max_secure_erase_sectors = 0;
 
diff --git a/drivers/md/dm.c b/drivers/md/dm.c
index b424a6ee27ba..0b04093fbeb2 100644
--- a/drivers/md/dm.c
+++ b/drivers/md/dm.c
@@ -1690,6 +1690,13 @@ static blk_status_t __split_and_process_bio(struct clone_info *ci)
 	if (unlikely(ci->is_abnormal_io))
 		return __process_abnormal_io(ci, ti);
 
+	if ((unlikely(op_is_copy(ci->bio->bi_opf)) &&
+			max_io_len(ti, ci->sector) < ci->sector_count)) {
+		DMERR("Error, IO size(%u) > max target size(%llu)\n",
+			ci->sector_count, max_io_len(ti, ci->sector));
+		return BLK_STS_IOERR;
+	}
+
 	/*
 	 * Only support bio polling for normal IO, and the target io is
 	 * exactly inside the dm_io instance (verified in dm_poll_dm_io)
diff --git a/include/linux/device-mapper.h b/include/linux/device-mapper.h
index 04c6acf7faaa..da4e77e81011 100644
--- a/include/linux/device-mapper.h
+++ b/include/linux/device-mapper.h
@@ -379,6 +379,11 @@ struct dm_target {
 	 * bio_set_dev(). NOTE: ideally a target should _not_ need this.
 	 */
 	bool needs_bio_set_dev:1;
+
+	/*
+	 * copy offload is supported
+	 */
+	bool copy_offload_supported:1;
 };
 
 void *dm_per_bio_data(struct bio *bio, size_t data_size);
-- 
2.35.1.500.gb896f729e2


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [dm-devel] [PATCH v6 7/9] dm: Add support for copy offload.
@ 2023-01-12 11:59       ` Nitesh Shetty
  0 siblings, 0 replies; 26+ messages in thread
From: Nitesh Shetty @ 2023-01-12 11:59 UTC (permalink / raw)
  To: Jens Axboe, Alasdair Kergon, Mike Snitzer, dm-devel, Keith Busch,
	Christoph Hellwig, Sagi Grimberg, James Smart,
	Chaitanya Kulkarni, Alexander Viro
  Cc: p.raghav, joshi.k, gost.dev, anuj20.g, linux-kernel, linux-nvme,
	Nitesh Shetty, linux-block, linux-fsdevel, nitheshshetty

Before enabling copy for dm target, check if underlying devices and
dm target support copy. Avoid split happening inside dm target.
Fail early if the request needs split, currently splitting copy
request is not supported.

Signed-off-by: Nitesh Shetty <nj.shetty@samsung.com>
---
 drivers/md/dm-table.c         | 42 +++++++++++++++++++++++++++++++++++
 drivers/md/dm.c               |  7 ++++++
 include/linux/device-mapper.h |  5 +++++
 3 files changed, 54 insertions(+)

diff --git a/drivers/md/dm-table.c b/drivers/md/dm-table.c
index 8541d5688f3a..4a1bbbb2493b 100644
--- a/drivers/md/dm-table.c
+++ b/drivers/md/dm-table.c
@@ -1875,6 +1875,39 @@ static bool dm_table_supports_nowait(struct dm_table *t)
 	return true;
 }
 
+static int device_not_copy_capable(struct dm_target *ti, struct dm_dev *dev,
+				      sector_t start, sector_t len, void *data)
+{
+	struct request_queue *q = bdev_get_queue(dev->bdev);
+
+	return !blk_queue_copy(q);
+}
+
+static bool dm_table_supports_copy(struct dm_table *t)
+{
+	struct dm_target *ti;
+	unsigned int i;
+
+	for (i = 0; i < t->num_targets; i++) {
+		ti = dm_table_get_target(t, i);
+
+		if (!ti->copy_offload_supported)
+			return false;
+
+		/*
+		 * target provides copy support (as implied by setting
+		 * 'copy_offload_supported')
+		 * and it relies on _all_ data devices having copy support.
+		 */
+		if (!ti->type->iterate_devices ||
+		     ti->type->iterate_devices(ti,
+			     device_not_copy_capable, NULL))
+			return false;
+	}
+
+	return true;
+}
+
 static int device_not_discard_capable(struct dm_target *ti, struct dm_dev *dev,
 				      sector_t start, sector_t len, void *data)
 {
@@ -1957,6 +1990,15 @@ int dm_table_set_restrictions(struct dm_table *t, struct request_queue *q,
 		q->limits.discard_misaligned = 0;
 	}
 
+	if (!dm_table_supports_copy(t)) {
+		blk_queue_flag_clear(QUEUE_FLAG_COPY, q);
+		/* Must also clear copy limits... */
+		q->limits.max_copy_sectors = 0;
+		q->limits.max_copy_sectors_hw = 0;
+	} else {
+		blk_queue_flag_set(QUEUE_FLAG_COPY, q);
+	}
+
 	if (!dm_table_supports_secure_erase(t))
 		q->limits.max_secure_erase_sectors = 0;
 
diff --git a/drivers/md/dm.c b/drivers/md/dm.c
index b424a6ee27ba..0b04093fbeb2 100644
--- a/drivers/md/dm.c
+++ b/drivers/md/dm.c
@@ -1690,6 +1690,13 @@ static blk_status_t __split_and_process_bio(struct clone_info *ci)
 	if (unlikely(ci->is_abnormal_io))
 		return __process_abnormal_io(ci, ti);
 
+	if ((unlikely(op_is_copy(ci->bio->bi_opf)) &&
+			max_io_len(ti, ci->sector) < ci->sector_count)) {
+		DMERR("Error, IO size(%u) > max target size(%llu)\n",
+			ci->sector_count, max_io_len(ti, ci->sector));
+		return BLK_STS_IOERR;
+	}
+
 	/*
 	 * Only support bio polling for normal IO, and the target io is
 	 * exactly inside the dm_io instance (verified in dm_poll_dm_io)
diff --git a/include/linux/device-mapper.h b/include/linux/device-mapper.h
index 04c6acf7faaa..da4e77e81011 100644
--- a/include/linux/device-mapper.h
+++ b/include/linux/device-mapper.h
@@ -379,6 +379,11 @@ struct dm_target {
 	 * bio_set_dev(). NOTE: ideally a target should _not_ need this.
 	 */
 	bool needs_bio_set_dev:1;
+
+	/*
+	 * copy offload is supported
+	 */
+	bool copy_offload_supported:1;
 };
 
 void *dm_per_bio_data(struct bio *bio, size_t data_size);
-- 
2.35.1.500.gb896f729e2

--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH v6 9/9] dm kcopyd: use copy offload support
       [not found]   ` <CGME20230112120229epcas5p38a07a42302d823422960eb11de5d685b@epcas5p3.samsung.com>
@ 2023-01-12 11:59       ` Nitesh Shetty
  0 siblings, 0 replies; 26+ messages in thread
From: Nitesh Shetty @ 2023-01-12 11:59 UTC (permalink / raw)
  To: Jens Axboe, Alasdair Kergon, Mike Snitzer, dm-devel, Keith Busch,
	Christoph Hellwig, Sagi Grimberg, James Smart,
	Chaitanya Kulkarni, Alexander Viro
  Cc: anuj20.g, joshi.k, p.raghav, nitheshshetty, gost.dev,
	Nitesh Shetty, linux-block, linux-kernel, linux-nvme,
	linux-fsdevel

Introduce copy_jobs to use copy-offload, if supported by underlying devices
otherwise fall back to existing method.

run_copy_jobs() calls block layer copy offload API, if both source and
destination request queue are same and support copy offload.
On successful completion, destination regions copied count is made zero,
failed regions are processed via existing method.

Signed-off-by: Nitesh Shetty <nj.shetty@samsung.com>
Signed-off-by: Anuj Gupta <anuj20.g@samsung.com>
---
 drivers/md/dm-kcopyd.c | 56 +++++++++++++++++++++++++++++++++++++-----
 1 file changed, 50 insertions(+), 6 deletions(-)

diff --git a/drivers/md/dm-kcopyd.c b/drivers/md/dm-kcopyd.c
index 4d3bbbea2e9a..2f9985f671ac 100644
--- a/drivers/md/dm-kcopyd.c
+++ b/drivers/md/dm-kcopyd.c
@@ -74,18 +74,20 @@ struct dm_kcopyd_client {
 	atomic_t nr_jobs;
 
 /*
- * We maintain four lists of jobs:
+ * We maintain five lists of jobs:
  *
- * i)   jobs waiting for pages
- * ii)  jobs that have pages, and are waiting for the io to be issued.
- * iii) jobs that don't need to do any IO and just run a callback
- * iv) jobs that have completed.
+ * i)	jobs waiting to try copy offload
+ * ii)   jobs waiting for pages
+ * iii)  jobs that have pages, and are waiting for the io to be issued.
+ * iv) jobs that don't need to do any IO and just run a callback
+ * v) jobs that have completed.
  *
- * All four of these are protected by job_lock.
+ * All five of these are protected by job_lock.
  */
 	spinlock_t job_lock;
 	struct list_head callback_jobs;
 	struct list_head complete_jobs;
+	struct list_head copy_jobs;
 	struct list_head io_jobs;
 	struct list_head pages_jobs;
 };
@@ -579,6 +581,43 @@ static int run_io_job(struct kcopyd_job *job)
 	return r;
 }
 
+static int run_copy_job(struct kcopyd_job *job)
+{
+	int r, i, count = 0;
+	struct range_entry range;
+
+	struct request_queue *src_q, *dest_q;
+
+	for (i = 0; i < job->num_dests; i++) {
+		range.dst = job->dests[i].sector << SECTOR_SHIFT;
+		range.src = job->source.sector << SECTOR_SHIFT;
+		range.len = job->source.count << SECTOR_SHIFT;
+
+		src_q = bdev_get_queue(job->source.bdev);
+		dest_q = bdev_get_queue(job->dests[i].bdev);
+
+		if (src_q != dest_q || !blk_queue_copy(src_q))
+			break;
+
+		r = blkdev_issue_copy(job->source.bdev, job->dests[i].bdev,
+				&range, 1, NULL, NULL, GFP_KERNEL);
+		if (r)
+			break;
+
+		job->dests[i].count = 0;
+		count++;
+	}
+
+	if (count == job->num_dests) {
+		push(&job->kc->complete_jobs, job);
+	} else {
+		push(&job->kc->pages_jobs, job);
+		r = 0;
+	}
+
+	return r;
+}
+
 static int run_pages_job(struct kcopyd_job *job)
 {
 	int r;
@@ -659,6 +698,7 @@ static void do_work(struct work_struct *work)
 	spin_unlock_irq(&kc->job_lock);
 
 	blk_start_plug(&plug);
+	process_jobs(&kc->copy_jobs, kc, run_copy_job);
 	process_jobs(&kc->complete_jobs, kc, run_complete_job);
 	process_jobs(&kc->pages_jobs, kc, run_pages_job);
 	process_jobs(&kc->io_jobs, kc, run_io_job);
@@ -676,6 +716,8 @@ static void dispatch_job(struct kcopyd_job *job)
 	atomic_inc(&kc->nr_jobs);
 	if (unlikely(!job->source.count))
 		push(&kc->callback_jobs, job);
+	else if (job->source.bdev->bd_disk == job->dests[0].bdev->bd_disk)
+		push(&kc->copy_jobs, job);
 	else if (job->pages == &zero_page_list)
 		push(&kc->io_jobs, job);
 	else
@@ -916,6 +958,7 @@ struct dm_kcopyd_client *dm_kcopyd_client_create(struct dm_kcopyd_throttle *thro
 	spin_lock_init(&kc->job_lock);
 	INIT_LIST_HEAD(&kc->callback_jobs);
 	INIT_LIST_HEAD(&kc->complete_jobs);
+	INIT_LIST_HEAD(&kc->copy_jobs);
 	INIT_LIST_HEAD(&kc->io_jobs);
 	INIT_LIST_HEAD(&kc->pages_jobs);
 	kc->throttle = throttle;
@@ -971,6 +1014,7 @@ void dm_kcopyd_client_destroy(struct dm_kcopyd_client *kc)
 
 	BUG_ON(!list_empty(&kc->callback_jobs));
 	BUG_ON(!list_empty(&kc->complete_jobs));
+	WARN_ON(!list_empty(&kc->copy_jobs));
 	BUG_ON(!list_empty(&kc->io_jobs));
 	BUG_ON(!list_empty(&kc->pages_jobs));
 	destroy_workqueue(kc->kcopyd_wq);
-- 
2.35.1.500.gb896f729e2


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [dm-devel] [PATCH v6 9/9] dm kcopyd: use copy offload support
@ 2023-01-12 11:59       ` Nitesh Shetty
  0 siblings, 0 replies; 26+ messages in thread
From: Nitesh Shetty @ 2023-01-12 11:59 UTC (permalink / raw)
  To: Jens Axboe, Alasdair Kergon, Mike Snitzer, dm-devel, Keith Busch,
	Christoph Hellwig, Sagi Grimberg, James Smart,
	Chaitanya Kulkarni, Alexander Viro
  Cc: p.raghav, joshi.k, gost.dev, anuj20.g, linux-kernel, linux-nvme,
	Nitesh Shetty, linux-block, linux-fsdevel, nitheshshetty

Introduce copy_jobs to use copy-offload, if supported by underlying devices
otherwise fall back to existing method.

run_copy_jobs() calls block layer copy offload API, if both source and
destination request queue are same and support copy offload.
On successful completion, destination regions copied count is made zero,
failed regions are processed via existing method.

Signed-off-by: Nitesh Shetty <nj.shetty@samsung.com>
Signed-off-by: Anuj Gupta <anuj20.g@samsung.com>
---
 drivers/md/dm-kcopyd.c | 56 +++++++++++++++++++++++++++++++++++++-----
 1 file changed, 50 insertions(+), 6 deletions(-)

diff --git a/drivers/md/dm-kcopyd.c b/drivers/md/dm-kcopyd.c
index 4d3bbbea2e9a..2f9985f671ac 100644
--- a/drivers/md/dm-kcopyd.c
+++ b/drivers/md/dm-kcopyd.c
@@ -74,18 +74,20 @@ struct dm_kcopyd_client {
 	atomic_t nr_jobs;
 
 /*
- * We maintain four lists of jobs:
+ * We maintain five lists of jobs:
  *
- * i)   jobs waiting for pages
- * ii)  jobs that have pages, and are waiting for the io to be issued.
- * iii) jobs that don't need to do any IO and just run a callback
- * iv) jobs that have completed.
+ * i)	jobs waiting to try copy offload
+ * ii)   jobs waiting for pages
+ * iii)  jobs that have pages, and are waiting for the io to be issued.
+ * iv) jobs that don't need to do any IO and just run a callback
+ * v) jobs that have completed.
  *
- * All four of these are protected by job_lock.
+ * All five of these are protected by job_lock.
  */
 	spinlock_t job_lock;
 	struct list_head callback_jobs;
 	struct list_head complete_jobs;
+	struct list_head copy_jobs;
 	struct list_head io_jobs;
 	struct list_head pages_jobs;
 };
@@ -579,6 +581,43 @@ static int run_io_job(struct kcopyd_job *job)
 	return r;
 }
 
+static int run_copy_job(struct kcopyd_job *job)
+{
+	int r, i, count = 0;
+	struct range_entry range;
+
+	struct request_queue *src_q, *dest_q;
+
+	for (i = 0; i < job->num_dests; i++) {
+		range.dst = job->dests[i].sector << SECTOR_SHIFT;
+		range.src = job->source.sector << SECTOR_SHIFT;
+		range.len = job->source.count << SECTOR_SHIFT;
+
+		src_q = bdev_get_queue(job->source.bdev);
+		dest_q = bdev_get_queue(job->dests[i].bdev);
+
+		if (src_q != dest_q || !blk_queue_copy(src_q))
+			break;
+
+		r = blkdev_issue_copy(job->source.bdev, job->dests[i].bdev,
+				&range, 1, NULL, NULL, GFP_KERNEL);
+		if (r)
+			break;
+
+		job->dests[i].count = 0;
+		count++;
+	}
+
+	if (count == job->num_dests) {
+		push(&job->kc->complete_jobs, job);
+	} else {
+		push(&job->kc->pages_jobs, job);
+		r = 0;
+	}
+
+	return r;
+}
+
 static int run_pages_job(struct kcopyd_job *job)
 {
 	int r;
@@ -659,6 +698,7 @@ static void do_work(struct work_struct *work)
 	spin_unlock_irq(&kc->job_lock);
 
 	blk_start_plug(&plug);
+	process_jobs(&kc->copy_jobs, kc, run_copy_job);
 	process_jobs(&kc->complete_jobs, kc, run_complete_job);
 	process_jobs(&kc->pages_jobs, kc, run_pages_job);
 	process_jobs(&kc->io_jobs, kc, run_io_job);
@@ -676,6 +716,8 @@ static void dispatch_job(struct kcopyd_job *job)
 	atomic_inc(&kc->nr_jobs);
 	if (unlikely(!job->source.count))
 		push(&kc->callback_jobs, job);
+	else if (job->source.bdev->bd_disk == job->dests[0].bdev->bd_disk)
+		push(&kc->copy_jobs, job);
 	else if (job->pages == &zero_page_list)
 		push(&kc->io_jobs, job);
 	else
@@ -916,6 +958,7 @@ struct dm_kcopyd_client *dm_kcopyd_client_create(struct dm_kcopyd_throttle *thro
 	spin_lock_init(&kc->job_lock);
 	INIT_LIST_HEAD(&kc->callback_jobs);
 	INIT_LIST_HEAD(&kc->complete_jobs);
+	INIT_LIST_HEAD(&kc->copy_jobs);
 	INIT_LIST_HEAD(&kc->io_jobs);
 	INIT_LIST_HEAD(&kc->pages_jobs);
 	kc->throttle = throttle;
@@ -971,6 +1014,7 @@ void dm_kcopyd_client_destroy(struct dm_kcopyd_client *kc)
 
 	BUG_ON(!list_empty(&kc->callback_jobs));
 	BUG_ON(!list_empty(&kc->complete_jobs));
+	WARN_ON(!list_empty(&kc->copy_jobs));
 	BUG_ON(!list_empty(&kc->io_jobs));
 	BUG_ON(!list_empty(&kc->pages_jobs));
 	destroy_workqueue(kc->kcopyd_wq);
-- 
2.35.1.500.gb896f729e2

--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* Re: [PATCH v6 2/9] block: Add copy offload support infrastructure
  2023-01-12 11:58       ` [dm-devel] " Nitesh Shetty
@ 2023-01-12 14:43         ` Hannes Reinecke
  -1 siblings, 0 replies; 26+ messages in thread
From: Hannes Reinecke @ 2023-01-12 14:43 UTC (permalink / raw)
  To: Nitesh Shetty, Jens Axboe, Alasdair Kergon, Mike Snitzer,
	dm-devel, Keith Busch, Christoph Hellwig, Sagi Grimberg,
	James Smart, Chaitanya Kulkarni, Alexander Viro
  Cc: anuj20.g, joshi.k, p.raghav, nitheshshetty, gost.dev,
	linux-block, linux-kernel, linux-nvme, linux-fsdevel

On 1/12/23 12:58, Nitesh Shetty wrote:
> Introduce blkdev_issue_copy which supports source and destination bdevs,
> and an array of (source, destination and copy length) tuples.
> Introduce REQ_COPY copy offload operation flag. Create a read-write
> bio pair with a token as payload and submitted to the device in order.
> Read request populates token with source specific information which
> is then passed with write request.
> This design is courtesy Mikulas Patocka's token based copy
> 
> Larger copy will be divided, based on max_copy_sectors limit.
> 
> Signed-off-by: Nitesh Shetty <nj.shetty@samsung.com>
> Signed-off-by: Anuj Gupta <anuj20.g@samsung.com>
> ---
>   block/blk-lib.c           | 358 ++++++++++++++++++++++++++++++++++++++
>   block/blk.h               |   2 +
>   include/linux/blk_types.h |  44 +++++
>   include/linux/blkdev.h    |   3 +
>   include/uapi/linux/fs.h   |  15 ++
>   5 files changed, 422 insertions(+)
> 
> diff --git a/block/blk-lib.c b/block/blk-lib.c
> index e59c3069e835..2ce3c872ca49 100644
> --- a/block/blk-lib.c
> +++ b/block/blk-lib.c
> @@ -115,6 +115,364 @@ int blkdev_issue_discard(struct block_device *bdev, sector_t sector,
>   }
>   EXPORT_SYMBOL(blkdev_issue_discard);
>   
> +/*
> + * For synchronous copy offload/emulation, wait and process all in-flight BIOs.
> + * This must only be called once all bios have been issued so that the refcount
> + * can only decrease. This just waits for all bios to make it through
> + * bio_copy_*_write_end_io. IO errors are propagated through cio->io_error.
> + */
> +static int cio_await_completion(struct cio *cio)
> +{
> +	int ret = 0;
> +
> +	atomic_dec(&cio->refcount);
> +
> +	if (cio->endio)
> +		return 0;
> +
> +	if (atomic_read(&cio->refcount)) {
> +		__set_current_state(TASK_UNINTERRUPTIBLE);
> +		blk_io_schedule();
> +	}
> +
Wouldn't it be better to use 'atomic_dec_return()' to avoid a potential 
race condition between atomic_dec() and atomic_read()?

> +	ret = cio->io_err;
> +	kfree(cio);
> +
> +	return ret;
> +}
> +
> +static void blk_copy_offload_write_end_io(struct bio *bio)
> +{
> +	struct copy_ctx *ctx = bio->bi_private;
> +	struct cio *cio = ctx->cio;
> +	sector_t clen;
> +	int ri = ctx->range_idx;
> +
> +	if (bio->bi_status) {
> +		cio->io_err = blk_status_to_errno(bio->bi_status);
> +		clen = (bio->bi_iter.bi_sector << SECTOR_SHIFT) -
> +			cio->ranges[ri].dst;
> +		cio->ranges[ri].comp_len = min_t(sector_t, clen,
> +				cio->ranges[ri].comp_len);
> +	}
> +	__free_page(bio->bi_io_vec[0].bv_page);
> +	bio_put(bio);
> +
> +	if (atomic_dec_and_test(&ctx->refcount))
> +		kfree(ctx);
> +	if (atomic_dec_and_test(&cio->refcount)) {

_Two_ atomic_dec() in a row?
Why?

And if that really is required please add a comment.

> +		if (cio->endio) {
> +			cio->endio(cio->private, cio->io_err);
> +			kfree(cio);
> +		} else
> +			blk_wake_io_task(cio->waiter);
> +	}
> +}
> +
> +static void blk_copy_offload_read_end_io(struct bio *read_bio)
> +{
> +	struct copy_ctx *ctx = read_bio->bi_private;
> +	struct cio *cio = ctx->cio;
> +	sector_t clen;
> +	int ri = ctx->range_idx;
> +	unsigned long flags;
> +
> +	if (read_bio->bi_status) {
> +		cio->io_err = blk_status_to_errno(read_bio->bi_status);
> +		goto err_rw_bio;
> +	}
> +
> +	/* For zoned device, we check if completed bio is first entry in linked
> +	 * list,
> +	 * if yes, we start the worker to submit write bios.
> +	 * if not, then we just update status of bio in ctx,
> +	 * once the worker gets scheduled, it will submit writes for all
> +	 * the consecutive REQ_COPY_READ_COMPLETE bios.
> +	 */
> +	if (bdev_is_zoned(ctx->write_bio->bi_bdev)) {
> +		spin_lock_irqsave(&cio->list_lock, flags);
> +		ctx->status = REQ_COPY_READ_COMPLETE;
> +		if (ctx == list_first_entry(&cio->list,
> +					struct copy_ctx, list)) {
> +			spin_unlock_irqrestore(&cio->list_lock, flags);
> +			schedule_work(&ctx->dispatch_work);
> +			goto free_read_bio;
> +		}
> +		spin_unlock_irqrestore(&cio->list_lock, flags);
> +	} else
> +		schedule_work(&ctx->dispatch_work);
> +
> +free_read_bio:
> +	bio_put(read_bio);
> +
> +	return;
> +
> +err_rw_bio:
> +	clen = (read_bio->bi_iter.bi_sector << SECTOR_SHIFT) -
> +					cio->ranges[ri].src;
> +	cio->ranges[ri].comp_len = min_t(sector_t, clen,
> +					cio->ranges[ri].comp_len);
> +	__free_page(read_bio->bi_io_vec[0].bv_page);
> +	bio_put(ctx->write_bio);
> +	bio_put(read_bio);
> +	if (atomic_dec_and_test(&ctx->refcount))
> +		kfree(ctx);
> +	if (atomic_dec_and_test(&cio->refcount)) {

Same here.

> +		if (cio->endio) {
> +			cio->endio(cio->private, cio->io_err);
> +			kfree(cio);
> +		} else
> +			blk_wake_io_task(cio->waiter);
> +	}
> +}
> +
> +static void blk_copy_dispatch_work_fn(struct work_struct *work)
> +{
> +	struct copy_ctx *ctx = container_of(work, struct copy_ctx,
> +			dispatch_work);
> +
> +	submit_bio(ctx->write_bio);
> +}
> +
> +static void blk_zoned_copy_dispatch_work_fn(struct work_struct *work)
> +{
> +	struct copy_ctx *ctx = container_of(work, struct copy_ctx,
> +			dispatch_work);
> +	struct cio *cio = ctx->cio;
> +	unsigned long flags = 0;
> +
> +	atomic_inc(&cio->refcount);
> +	spin_lock_irqsave(&cio->list_lock, flags);
> +
> +	while (!list_empty(&cio->list)) {
> +		ctx = list_first_entry(&cio->list, struct copy_ctx, list);
> +
That is ever so odd; it'll block 'cio->list' for the time of processing.
Wouldn't it be better to move 'cio->list' to a private list, and do away 
with the list_lock during processing?

> +		if (ctx->status == REQ_COPY_READ_PROGRESS)
> +			break;
> +
> +		atomic_inc(&ctx->refcount);
> +		ctx->status = REQ_COPY_WRITE_PROGRESS;
> +		spin_unlock_irqrestore(&cio->list_lock, flags);
> +		submit_bio(ctx->write_bio);
> +		spin_lock_irqsave(&cio->list_lock, flags);
> +
> +		list_del(&ctx->list);
> +		if (atomic_dec_and_test(&ctx->refcount))
> +			kfree(ctx);
> +	}
> +
> +	spin_unlock_irqrestore(&cio->list_lock, flags);
> +	if (atomic_dec_and_test(&cio->refcount))
> +		blk_wake_io_task(cio->waiter);
> +}
> +
> +/*
> + * blk_copy_offload	- Use device's native copy offload feature.
> + * we perform copy operation by sending 2 bio.
> + * 1. First we send a read bio with REQ_COPY flag along with a token and source
> + * and length. Once read bio reaches driver layer, device driver adds all the
> + * source info to token and does a fake completion.
> + * 2. Once read opration completes, we issue write with REQ_COPY flag with same
> + * token. In driver layer, token info is used to form a copy offload command.
> + *
> + * For conventional devices we submit write bio independentenly once read
> + * completes. For zoned devices , reads can complete out of order, so we
> + * maintain a linked list and submit writes in the order, reads are submitted.
> + */
> +static int blk_copy_offload(struct block_device *src_bdev,
> +		struct block_device *dst_bdev, struct range_entry *ranges,
> +		int nr, cio_iodone_t end_io, void *private, gfp_t gfp_mask)
> +{
> +	struct cio *cio;
> +	struct copy_ctx *ctx;
> +	struct bio *read_bio, *write_bio;
> +	struct page *token;
> +	sector_t src_blk, copy_len, dst_blk;
> +	sector_t rem, max_copy_len;
> +	int ri = 0, ret = 0;
> +	unsigned long flags;
> +
> +	cio = kzalloc(sizeof(struct cio), GFP_KERNEL);
> +	if (!cio)
> +		return -ENOMEM;
> +	cio->ranges = ranges;
> +	atomic_set(&cio->refcount, 1);
> +	cio->waiter = current;
> +	cio->endio = end_io;
> +	cio->private = private;
> +	if (bdev_is_zoned(dst_bdev)) {
> +		INIT_LIST_HEAD(&cio->list);
> +		spin_lock_init(&cio->list_lock);
> +	}
> +
> +	max_copy_len = min(bdev_max_copy_sectors(src_bdev),
> +			bdev_max_copy_sectors(dst_bdev)) << SECTOR_SHIFT;
> +
> +	for (ri = 0; ri < nr; ri++) {
> +		cio->ranges[ri].comp_len = ranges[ri].len;
> +		src_blk = ranges[ri].src;
> +		dst_blk = ranges[ri].dst;
> +		for (rem = ranges[ri].len; rem > 0; rem -= copy_len) {
> +			copy_len = min(rem, max_copy_len);
> +
> +			token = alloc_page(gfp_mask);
> +			if (unlikely(!token)) {
> +				ret = -ENOMEM;
> +				goto err_token;
> +			}
> +
> +			ctx = kzalloc(sizeof(struct copy_ctx), gfp_mask);
> +			if (!ctx) {
> +				ret = -ENOMEM;
> +				goto err_ctx;
> +			}
> +			read_bio = bio_alloc(src_bdev, 1, REQ_OP_READ | REQ_COPY
> +					| REQ_SYNC | REQ_NOMERGE, gfp_mask);
> +			if (!read_bio) {
> +				ret = -ENOMEM;
> +				goto err_read_bio;
> +			}
> +			write_bio = bio_alloc(dst_bdev, 1, REQ_OP_WRITE
> +					| REQ_COPY | REQ_SYNC | REQ_NOMERGE,
> +					gfp_mask);
> +			if (!write_bio) {
> +				cio->io_err = -ENOMEM;
> +				goto err_write_bio;
> +			}
> +
> +			ctx->cio = cio;
> +			ctx->range_idx = ri;
> +			ctx->write_bio = write_bio;
> +			atomic_set(&ctx->refcount, 1);
> +
> +			if (bdev_is_zoned(dst_bdev)) {
> +				INIT_WORK(&ctx->dispatch_work,
> +					blk_zoned_copy_dispatch_work_fn);
> +				INIT_LIST_HEAD(&ctx->list);
> +				spin_lock_irqsave(&cio->list_lock, flags);
> +				ctx->status = REQ_COPY_READ_PROGRESS;
> +				list_add_tail(&ctx->list, &cio->list);
> +				spin_unlock_irqrestore(&cio->list_lock, flags);
> +			} else
> +				INIT_WORK(&ctx->dispatch_work,
> +					blk_copy_dispatch_work_fn);
> +
> +			__bio_add_page(read_bio, token, PAGE_SIZE, 0);
> +			read_bio->bi_iter.bi_size = copy_len;
> +			read_bio->bi_iter.bi_sector = src_blk >> SECTOR_SHIFT;
> +			read_bio->bi_end_io = blk_copy_offload_read_end_io;
> +			read_bio->bi_private = ctx;
> +
> +			__bio_add_page(write_bio, token, PAGE_SIZE, 0);
> +			write_bio->bi_iter.bi_size = copy_len;
> +			write_bio->bi_end_io = blk_copy_offload_write_end_io;
> +			write_bio->bi_iter.bi_sector = dst_blk >> SECTOR_SHIFT;
> +			write_bio->bi_private = ctx;
> +
> +			atomic_inc(&cio->refcount);
> +			submit_bio(read_bio);
> +			src_blk += copy_len;
> +			dst_blk += copy_len;
> +		}
> +	}
> +
> +	/* Wait for completion of all IO's*/
> +	return cio_await_completion(cio);
> +
> +err_write_bio:
> +	bio_put(read_bio);
> +err_read_bio:
> +	kfree(ctx);
> +err_ctx:
> +	__free_page(token);
> +err_token:
> +	ranges[ri].comp_len = min_t(sector_t,
> +			ranges[ri].comp_len, (ranges[ri].len - rem));
> +
> +	cio->io_err = ret;
> +	return cio_await_completion(cio);
> +}
> +
> +static inline int blk_copy_sanity_check(struct block_device *src_bdev,
> +	struct block_device *dst_bdev, struct range_entry *ranges, int nr)
> +{
> +	unsigned int align_mask = max(bdev_logical_block_size(dst_bdev),
> +					bdev_logical_block_size(src_bdev)) - 1;
> +	sector_t len = 0;
> +	int i;
> +
> +	if (!nr)
> +		return -EINVAL;
> +
> +	if (nr >= MAX_COPY_NR_RANGE)
> +		return -EINVAL;
> +
> +	if (bdev_read_only(dst_bdev))
> +		return -EPERM;
> +
> +	for (i = 0; i < nr; i++) {
> +		if (!ranges[i].len)
> +			return -EINVAL;
> +
> +		len += ranges[i].len;
> +		if ((ranges[i].dst & align_mask) ||
> +				(ranges[i].src & align_mask) ||
> +				(ranges[i].len & align_mask))
> +			return -EINVAL;
> +		ranges[i].comp_len = 0;
> +	}
> +
> +	if (len && len >= MAX_COPY_TOTAL_LENGTH)
> +		return -EINVAL;
> +
> +	return 0;
> +}
> +
> +static inline bool blk_check_copy_offload(struct request_queue *src_q,
> +		struct request_queue *dst_q)
> +{
> +	return blk_queue_copy(dst_q) && blk_queue_copy(src_q);
> +}
> +
> +/*
> + * blkdev_issue_copy - queue a copy
> + * @src_bdev:	source block device
> + * @dst_bdev:	destination block device
> + * @ranges:	array of source/dest/len,
> + *		ranges are expected to be allocated/freed by caller
> + * @nr:		number of source ranges to copy
> + * @end_io:	end_io function to be called on completion of copy operation,
> + *		for synchronous operation this should be NULL
> + * @private:	end_io function will be called with this private data, should be
> + *		NULL, if operation is synchronous in nature
> + * @gfp_mask:   memory allocation flags (for bio_alloc)
> + *
> + * Description:
> + *	Copy source ranges from source block device to destination block
> + *	device. length of a source range cannot be zero. Max total length of
> + *	copy is limited to MAX_COPY_TOTAL_LENGTH and also maximum number of
> + *	entries is limited to MAX_COPY_NR_RANGE
> + */
> +int blkdev_issue_copy(struct block_device *src_bdev,
> +	struct block_device *dst_bdev, struct range_entry *ranges, int nr,
> +	cio_iodone_t end_io, void *private, gfp_t gfp_mask)
> +{
> +	struct request_queue *src_q = bdev_get_queue(src_bdev);
> +	struct request_queue *dst_q = bdev_get_queue(dst_bdev);
> +	int ret = -EINVAL;
> +
> +	ret = blk_copy_sanity_check(src_bdev, dst_bdev, ranges, nr);
> +	if (ret)
> +		return ret;
> +
> +	if (blk_check_copy_offload(src_q, dst_q))
> +		ret = blk_copy_offload(src_bdev, dst_bdev, ranges, nr,
> +				end_io, private, gfp_mask);
> +
> +	return ret;
> +}
> +EXPORT_SYMBOL_GPL(blkdev_issue_copy);
> +
>   static int __blkdev_issue_write_zeroes(struct block_device *bdev,
>   		sector_t sector, sector_t nr_sects, gfp_t gfp_mask,
>   		struct bio **biop, unsigned flags)
> diff --git a/block/blk.h b/block/blk.h
> index 4c3b3325219a..6d9924a7d559 100644
> --- a/block/blk.h
> +++ b/block/blk.h
> @@ -304,6 +304,8 @@ static inline bool bio_may_exceed_limits(struct bio *bio,
>   		break;
>   	}
>   
> +	if (unlikely(op_is_copy(bio->bi_opf)))
> +		return false;
>   	/*
>   	 * All drivers must accept single-segments bios that are <= PAGE_SIZE.
>   	 * This is a quick and dirty check that relies on the fact that
> diff --git a/include/linux/blk_types.h b/include/linux/blk_types.h
> index 99be590f952f..de1638c87ecf 100644
> --- a/include/linux/blk_types.h
> +++ b/include/linux/blk_types.h
> @@ -422,6 +422,7 @@ enum req_flag_bits {
>   	 */
>   	/* for REQ_OP_WRITE_ZEROES: */
>   	__REQ_NOUNMAP,		/* do not free blocks when zeroing */
> +	__REQ_COPY,		/* copy request */
>   
>   	__REQ_NR_BITS,		/* stops here */
>   };
> @@ -451,6 +452,7 @@ enum req_flag_bits {
>   
>   #define REQ_DRV		(__force blk_opf_t)(1ULL << __REQ_DRV)
>   #define REQ_SWAP	(__force blk_opf_t)(1ULL << __REQ_SWAP)
> +#define REQ_COPY	((__force blk_opf_t)(1ULL << __REQ_COPY))
>   
>   #define REQ_FAILFAST_MASK \
>   	(REQ_FAILFAST_DEV | REQ_FAILFAST_TRANSPORT | REQ_FAILFAST_DRIVER)
> @@ -477,6 +479,11 @@ static inline bool op_is_write(blk_opf_t op)
>   	return !!(op & (__force blk_opf_t)1);
>   }
>   
> +static inline bool op_is_copy(blk_opf_t op)
> +{
> +	return (op & REQ_COPY);
> +}
> +
>   /*
>    * Check if the bio or request is one that needs special treatment in the
>    * flush state machine.
> @@ -536,4 +543,41 @@ struct blk_rq_stat {
>   	u64 batch;
>   };
>   
> +typedef void (cio_iodone_t)(void *private, int status);
> +
> +struct cio {
> +	struct range_entry *ranges;
> +	struct task_struct *waiter;     /* waiting task (NULL if none) */
> +	atomic_t refcount;
> +	int io_err;
> +	cio_iodone_t *endio;		/* applicable for async operation */
> +	void *private;			/* applicable for async operation */
> +
> +	/* For zoned device we maintain a linked list of IO submissions.
> +	 * This is to make sure we maintain the order of submissions.
> +	 * Otherwise some reads completing out of order, will submit writes not
> +	 * aligned with zone write pointer.
> +	 */
> +	struct list_head list;
> +	spinlock_t list_lock;
> +};
> +
> +enum copy_io_status {
> +	REQ_COPY_READ_PROGRESS,
> +	REQ_COPY_READ_COMPLETE,
> +	REQ_COPY_WRITE_PROGRESS,
> +};
> +
> +struct copy_ctx {
> +	struct cio *cio;
> +	struct work_struct dispatch_work;
> +	struct bio *write_bio;
> +	atomic_t refcount;
> +	int range_idx;			/* used in error/partial completion */
> +
> +	/* For zoned device linked list is maintained. Along with state of IO */
> +	struct list_head list;
> +	enum copy_io_status status;
> +};
> +
>   #endif /* __LINUX_BLK_TYPES_H */
> diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
> index 807ffb5f715d..48e9160b7195 100644
> --- a/include/linux/blkdev.h
> +++ b/include/linux/blkdev.h
> @@ -1063,6 +1063,9 @@ int __blkdev_issue_discard(struct block_device *bdev, sector_t sector,
>   		sector_t nr_sects, gfp_t gfp_mask, struct bio **biop);
>   int blkdev_issue_secure_erase(struct block_device *bdev, sector_t sector,
>   		sector_t nr_sects, gfp_t gfp);
> +int blkdev_issue_copy(struct block_device *src_bdev,
> +		struct block_device *dst_bdev, struct range_entry *ranges,
> +		int nr, cio_iodone_t end_io, void *private, gfp_t gfp_mask);
>   
>   #define BLKDEV_ZERO_NOUNMAP	(1 << 0)  /* do not free blocks */
>   #define BLKDEV_ZERO_NOFALLBACK	(1 << 1)  /* don't write explicit zeroes */
> diff --git a/include/uapi/linux/fs.h b/include/uapi/linux/fs.h
> index b3ad173f619c..9248b6d259de 100644
> --- a/include/uapi/linux/fs.h
> +++ b/include/uapi/linux/fs.h
> @@ -67,6 +67,21 @@ struct fstrim_range {
>   /* maximum total copy length */
>   #define MAX_COPY_TOTAL_LENGTH	(1 << 27)
>   
> +/* Maximum no of entries supported */
> +#define MAX_COPY_NR_RANGE	(1 << 12)
> +
> +/* range entry for copy offload, all fields should be byte addressed */
> +struct range_entry {
> +	__u64 src;		/* source to be copied */
> +	__u64 dst;		/* destination */
> +	__u64 len;		/* length in bytes to be copied */
> +
> +	/* length of data copy actually completed. This will be filled by
> +	 * kernel, once copy completes
> +	 */
> +	__u64 comp_len;
> +};
> +
>   /* extent-same (dedupe) ioctls; these MUST match the btrfs ioctl definitions */
>   #define FILE_DEDUPE_RANGE_SAME		0
>   #define FILE_DEDUPE_RANGE_DIFFERS	1

Cheers,

Hannes

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [dm-devel] [PATCH v6 2/9] block: Add copy offload support infrastructure
@ 2023-01-12 14:43         ` Hannes Reinecke
  0 siblings, 0 replies; 26+ messages in thread
From: Hannes Reinecke @ 2023-01-12 14:43 UTC (permalink / raw)
  To: Nitesh Shetty, Jens Axboe, Alasdair Kergon, Mike Snitzer,
	dm-devel, Keith Busch, Christoph Hellwig, Sagi Grimberg,
	James Smart, Chaitanya Kulkarni, Alexander Viro
  Cc: p.raghav, joshi.k, gost.dev, anuj20.g, linux-kernel, linux-nvme,
	linux-block, linux-fsdevel, nitheshshetty

On 1/12/23 12:58, Nitesh Shetty wrote:
> Introduce blkdev_issue_copy which supports source and destination bdevs,
> and an array of (source, destination and copy length) tuples.
> Introduce REQ_COPY copy offload operation flag. Create a read-write
> bio pair with a token as payload and submitted to the device in order.
> Read request populates token with source specific information which
> is then passed with write request.
> This design is courtesy Mikulas Patocka's token based copy
> 
> Larger copy will be divided, based on max_copy_sectors limit.
> 
> Signed-off-by: Nitesh Shetty <nj.shetty@samsung.com>
> Signed-off-by: Anuj Gupta <anuj20.g@samsung.com>
> ---
>   block/blk-lib.c           | 358 ++++++++++++++++++++++++++++++++++++++
>   block/blk.h               |   2 +
>   include/linux/blk_types.h |  44 +++++
>   include/linux/blkdev.h    |   3 +
>   include/uapi/linux/fs.h   |  15 ++
>   5 files changed, 422 insertions(+)
> 
> diff --git a/block/blk-lib.c b/block/blk-lib.c
> index e59c3069e835..2ce3c872ca49 100644
> --- a/block/blk-lib.c
> +++ b/block/blk-lib.c
> @@ -115,6 +115,364 @@ int blkdev_issue_discard(struct block_device *bdev, sector_t sector,
>   }
>   EXPORT_SYMBOL(blkdev_issue_discard);
>   
> +/*
> + * For synchronous copy offload/emulation, wait and process all in-flight BIOs.
> + * This must only be called once all bios have been issued so that the refcount
> + * can only decrease. This just waits for all bios to make it through
> + * bio_copy_*_write_end_io. IO errors are propagated through cio->io_error.
> + */
> +static int cio_await_completion(struct cio *cio)
> +{
> +	int ret = 0;
> +
> +	atomic_dec(&cio->refcount);
> +
> +	if (cio->endio)
> +		return 0;
> +
> +	if (atomic_read(&cio->refcount)) {
> +		__set_current_state(TASK_UNINTERRUPTIBLE);
> +		blk_io_schedule();
> +	}
> +
Wouldn't it be better to use 'atomic_dec_return()' to avoid a potential 
race condition between atomic_dec() and atomic_read()?

> +	ret = cio->io_err;
> +	kfree(cio);
> +
> +	return ret;
> +}
> +
> +static void blk_copy_offload_write_end_io(struct bio *bio)
> +{
> +	struct copy_ctx *ctx = bio->bi_private;
> +	struct cio *cio = ctx->cio;
> +	sector_t clen;
> +	int ri = ctx->range_idx;
> +
> +	if (bio->bi_status) {
> +		cio->io_err = blk_status_to_errno(bio->bi_status);
> +		clen = (bio->bi_iter.bi_sector << SECTOR_SHIFT) -
> +			cio->ranges[ri].dst;
> +		cio->ranges[ri].comp_len = min_t(sector_t, clen,
> +				cio->ranges[ri].comp_len);
> +	}
> +	__free_page(bio->bi_io_vec[0].bv_page);
> +	bio_put(bio);
> +
> +	if (atomic_dec_and_test(&ctx->refcount))
> +		kfree(ctx);
> +	if (atomic_dec_and_test(&cio->refcount)) {

_Two_ atomic_dec() in a row?
Why?

And if that really is required please add a comment.

> +		if (cio->endio) {
> +			cio->endio(cio->private, cio->io_err);
> +			kfree(cio);
> +		} else
> +			blk_wake_io_task(cio->waiter);
> +	}
> +}
> +
> +static void blk_copy_offload_read_end_io(struct bio *read_bio)
> +{
> +	struct copy_ctx *ctx = read_bio->bi_private;
> +	struct cio *cio = ctx->cio;
> +	sector_t clen;
> +	int ri = ctx->range_idx;
> +	unsigned long flags;
> +
> +	if (read_bio->bi_status) {
> +		cio->io_err = blk_status_to_errno(read_bio->bi_status);
> +		goto err_rw_bio;
> +	}
> +
> +	/* For zoned device, we check if completed bio is first entry in linked
> +	 * list,
> +	 * if yes, we start the worker to submit write bios.
> +	 * if not, then we just update status of bio in ctx,
> +	 * once the worker gets scheduled, it will submit writes for all
> +	 * the consecutive REQ_COPY_READ_COMPLETE bios.
> +	 */
> +	if (bdev_is_zoned(ctx->write_bio->bi_bdev)) {
> +		spin_lock_irqsave(&cio->list_lock, flags);
> +		ctx->status = REQ_COPY_READ_COMPLETE;
> +		if (ctx == list_first_entry(&cio->list,
> +					struct copy_ctx, list)) {
> +			spin_unlock_irqrestore(&cio->list_lock, flags);
> +			schedule_work(&ctx->dispatch_work);
> +			goto free_read_bio;
> +		}
> +		spin_unlock_irqrestore(&cio->list_lock, flags);
> +	} else
> +		schedule_work(&ctx->dispatch_work);
> +
> +free_read_bio:
> +	bio_put(read_bio);
> +
> +	return;
> +
> +err_rw_bio:
> +	clen = (read_bio->bi_iter.bi_sector << SECTOR_SHIFT) -
> +					cio->ranges[ri].src;
> +	cio->ranges[ri].comp_len = min_t(sector_t, clen,
> +					cio->ranges[ri].comp_len);
> +	__free_page(read_bio->bi_io_vec[0].bv_page);
> +	bio_put(ctx->write_bio);
> +	bio_put(read_bio);
> +	if (atomic_dec_and_test(&ctx->refcount))
> +		kfree(ctx);
> +	if (atomic_dec_and_test(&cio->refcount)) {

Same here.

> +		if (cio->endio) {
> +			cio->endio(cio->private, cio->io_err);
> +			kfree(cio);
> +		} else
> +			blk_wake_io_task(cio->waiter);
> +	}
> +}
> +
> +static void blk_copy_dispatch_work_fn(struct work_struct *work)
> +{
> +	struct copy_ctx *ctx = container_of(work, struct copy_ctx,
> +			dispatch_work);
> +
> +	submit_bio(ctx->write_bio);
> +}
> +
> +static void blk_zoned_copy_dispatch_work_fn(struct work_struct *work)
> +{
> +	struct copy_ctx *ctx = container_of(work, struct copy_ctx,
> +			dispatch_work);
> +	struct cio *cio = ctx->cio;
> +	unsigned long flags = 0;
> +
> +	atomic_inc(&cio->refcount);
> +	spin_lock_irqsave(&cio->list_lock, flags);
> +
> +	while (!list_empty(&cio->list)) {
> +		ctx = list_first_entry(&cio->list, struct copy_ctx, list);
> +
That is ever so odd; it'll block 'cio->list' for the time of processing.
Wouldn't it be better to move 'cio->list' to a private list, and do away 
with the list_lock during processing?

> +		if (ctx->status == REQ_COPY_READ_PROGRESS)
> +			break;
> +
> +		atomic_inc(&ctx->refcount);
> +		ctx->status = REQ_COPY_WRITE_PROGRESS;
> +		spin_unlock_irqrestore(&cio->list_lock, flags);
> +		submit_bio(ctx->write_bio);
> +		spin_lock_irqsave(&cio->list_lock, flags);
> +
> +		list_del(&ctx->list);
> +		if (atomic_dec_and_test(&ctx->refcount))
> +			kfree(ctx);
> +	}
> +
> +	spin_unlock_irqrestore(&cio->list_lock, flags);
> +	if (atomic_dec_and_test(&cio->refcount))
> +		blk_wake_io_task(cio->waiter);
> +}
> +
> +/*
> + * blk_copy_offload	- Use device's native copy offload feature.
> + * we perform copy operation by sending 2 bio.
> + * 1. First we send a read bio with REQ_COPY flag along with a token and source
> + * and length. Once read bio reaches driver layer, device driver adds all the
> + * source info to token and does a fake completion.
> + * 2. Once read opration completes, we issue write with REQ_COPY flag with same
> + * token. In driver layer, token info is used to form a copy offload command.
> + *
> + * For conventional devices we submit write bio independentenly once read
> + * completes. For zoned devices , reads can complete out of order, so we
> + * maintain a linked list and submit writes in the order, reads are submitted.
> + */
> +static int blk_copy_offload(struct block_device *src_bdev,
> +		struct block_device *dst_bdev, struct range_entry *ranges,
> +		int nr, cio_iodone_t end_io, void *private, gfp_t gfp_mask)
> +{
> +	struct cio *cio;
> +	struct copy_ctx *ctx;
> +	struct bio *read_bio, *write_bio;
> +	struct page *token;
> +	sector_t src_blk, copy_len, dst_blk;
> +	sector_t rem, max_copy_len;
> +	int ri = 0, ret = 0;
> +	unsigned long flags;
> +
> +	cio = kzalloc(sizeof(struct cio), GFP_KERNEL);
> +	if (!cio)
> +		return -ENOMEM;
> +	cio->ranges = ranges;
> +	atomic_set(&cio->refcount, 1);
> +	cio->waiter = current;
> +	cio->endio = end_io;
> +	cio->private = private;
> +	if (bdev_is_zoned(dst_bdev)) {
> +		INIT_LIST_HEAD(&cio->list);
> +		spin_lock_init(&cio->list_lock);
> +	}
> +
> +	max_copy_len = min(bdev_max_copy_sectors(src_bdev),
> +			bdev_max_copy_sectors(dst_bdev)) << SECTOR_SHIFT;
> +
> +	for (ri = 0; ri < nr; ri++) {
> +		cio->ranges[ri].comp_len = ranges[ri].len;
> +		src_blk = ranges[ri].src;
> +		dst_blk = ranges[ri].dst;
> +		for (rem = ranges[ri].len; rem > 0; rem -= copy_len) {
> +			copy_len = min(rem, max_copy_len);
> +
> +			token = alloc_page(gfp_mask);
> +			if (unlikely(!token)) {
> +				ret = -ENOMEM;
> +				goto err_token;
> +			}
> +
> +			ctx = kzalloc(sizeof(struct copy_ctx), gfp_mask);
> +			if (!ctx) {
> +				ret = -ENOMEM;
> +				goto err_ctx;
> +			}
> +			read_bio = bio_alloc(src_bdev, 1, REQ_OP_READ | REQ_COPY
> +					| REQ_SYNC | REQ_NOMERGE, gfp_mask);
> +			if (!read_bio) {
> +				ret = -ENOMEM;
> +				goto err_read_bio;
> +			}
> +			write_bio = bio_alloc(dst_bdev, 1, REQ_OP_WRITE
> +					| REQ_COPY | REQ_SYNC | REQ_NOMERGE,
> +					gfp_mask);
> +			if (!write_bio) {
> +				cio->io_err = -ENOMEM;
> +				goto err_write_bio;
> +			}
> +
> +			ctx->cio = cio;
> +			ctx->range_idx = ri;
> +			ctx->write_bio = write_bio;
> +			atomic_set(&ctx->refcount, 1);
> +
> +			if (bdev_is_zoned(dst_bdev)) {
> +				INIT_WORK(&ctx->dispatch_work,
> +					blk_zoned_copy_dispatch_work_fn);
> +				INIT_LIST_HEAD(&ctx->list);
> +				spin_lock_irqsave(&cio->list_lock, flags);
> +				ctx->status = REQ_COPY_READ_PROGRESS;
> +				list_add_tail(&ctx->list, &cio->list);
> +				spin_unlock_irqrestore(&cio->list_lock, flags);
> +			} else
> +				INIT_WORK(&ctx->dispatch_work,
> +					blk_copy_dispatch_work_fn);
> +
> +			__bio_add_page(read_bio, token, PAGE_SIZE, 0);
> +			read_bio->bi_iter.bi_size = copy_len;
> +			read_bio->bi_iter.bi_sector = src_blk >> SECTOR_SHIFT;
> +			read_bio->bi_end_io = blk_copy_offload_read_end_io;
> +			read_bio->bi_private = ctx;
> +
> +			__bio_add_page(write_bio, token, PAGE_SIZE, 0);
> +			write_bio->bi_iter.bi_size = copy_len;
> +			write_bio->bi_end_io = blk_copy_offload_write_end_io;
> +			write_bio->bi_iter.bi_sector = dst_blk >> SECTOR_SHIFT;
> +			write_bio->bi_private = ctx;
> +
> +			atomic_inc(&cio->refcount);
> +			submit_bio(read_bio);
> +			src_blk += copy_len;
> +			dst_blk += copy_len;
> +		}
> +	}
> +
> +	/* Wait for completion of all IO's*/
> +	return cio_await_completion(cio);
> +
> +err_write_bio:
> +	bio_put(read_bio);
> +err_read_bio:
> +	kfree(ctx);
> +err_ctx:
> +	__free_page(token);
> +err_token:
> +	ranges[ri].comp_len = min_t(sector_t,
> +			ranges[ri].comp_len, (ranges[ri].len - rem));
> +
> +	cio->io_err = ret;
> +	return cio_await_completion(cio);
> +}
> +
> +static inline int blk_copy_sanity_check(struct block_device *src_bdev,
> +	struct block_device *dst_bdev, struct range_entry *ranges, int nr)
> +{
> +	unsigned int align_mask = max(bdev_logical_block_size(dst_bdev),
> +					bdev_logical_block_size(src_bdev)) - 1;
> +	sector_t len = 0;
> +	int i;
> +
> +	if (!nr)
> +		return -EINVAL;
> +
> +	if (nr >= MAX_COPY_NR_RANGE)
> +		return -EINVAL;
> +
> +	if (bdev_read_only(dst_bdev))
> +		return -EPERM;
> +
> +	for (i = 0; i < nr; i++) {
> +		if (!ranges[i].len)
> +			return -EINVAL;
> +
> +		len += ranges[i].len;
> +		if ((ranges[i].dst & align_mask) ||
> +				(ranges[i].src & align_mask) ||
> +				(ranges[i].len & align_mask))
> +			return -EINVAL;
> +		ranges[i].comp_len = 0;
> +	}
> +
> +	if (len && len >= MAX_COPY_TOTAL_LENGTH)
> +		return -EINVAL;
> +
> +	return 0;
> +}
> +
> +static inline bool blk_check_copy_offload(struct request_queue *src_q,
> +		struct request_queue *dst_q)
> +{
> +	return blk_queue_copy(dst_q) && blk_queue_copy(src_q);
> +}
> +
> +/*
> + * blkdev_issue_copy - queue a copy
> + * @src_bdev:	source block device
> + * @dst_bdev:	destination block device
> + * @ranges:	array of source/dest/len,
> + *		ranges are expected to be allocated/freed by caller
> + * @nr:		number of source ranges to copy
> + * @end_io:	end_io function to be called on completion of copy operation,
> + *		for synchronous operation this should be NULL
> + * @private:	end_io function will be called with this private data, should be
> + *		NULL, if operation is synchronous in nature
> + * @gfp_mask:   memory allocation flags (for bio_alloc)
> + *
> + * Description:
> + *	Copy source ranges from source block device to destination block
> + *	device. length of a source range cannot be zero. Max total length of
> + *	copy is limited to MAX_COPY_TOTAL_LENGTH and also maximum number of
> + *	entries is limited to MAX_COPY_NR_RANGE
> + */
> +int blkdev_issue_copy(struct block_device *src_bdev,
> +	struct block_device *dst_bdev, struct range_entry *ranges, int nr,
> +	cio_iodone_t end_io, void *private, gfp_t gfp_mask)
> +{
> +	struct request_queue *src_q = bdev_get_queue(src_bdev);
> +	struct request_queue *dst_q = bdev_get_queue(dst_bdev);
> +	int ret = -EINVAL;
> +
> +	ret = blk_copy_sanity_check(src_bdev, dst_bdev, ranges, nr);
> +	if (ret)
> +		return ret;
> +
> +	if (blk_check_copy_offload(src_q, dst_q))
> +		ret = blk_copy_offload(src_bdev, dst_bdev, ranges, nr,
> +				end_io, private, gfp_mask);
> +
> +	return ret;
> +}
> +EXPORT_SYMBOL_GPL(blkdev_issue_copy);
> +
>   static int __blkdev_issue_write_zeroes(struct block_device *bdev,
>   		sector_t sector, sector_t nr_sects, gfp_t gfp_mask,
>   		struct bio **biop, unsigned flags)
> diff --git a/block/blk.h b/block/blk.h
> index 4c3b3325219a..6d9924a7d559 100644
> --- a/block/blk.h
> +++ b/block/blk.h
> @@ -304,6 +304,8 @@ static inline bool bio_may_exceed_limits(struct bio *bio,
>   		break;
>   	}
>   
> +	if (unlikely(op_is_copy(bio->bi_opf)))
> +		return false;
>   	/*
>   	 * All drivers must accept single-segments bios that are <= PAGE_SIZE.
>   	 * This is a quick and dirty check that relies on the fact that
> diff --git a/include/linux/blk_types.h b/include/linux/blk_types.h
> index 99be590f952f..de1638c87ecf 100644
> --- a/include/linux/blk_types.h
> +++ b/include/linux/blk_types.h
> @@ -422,6 +422,7 @@ enum req_flag_bits {
>   	 */
>   	/* for REQ_OP_WRITE_ZEROES: */
>   	__REQ_NOUNMAP,		/* do not free blocks when zeroing */
> +	__REQ_COPY,		/* copy request */
>   
>   	__REQ_NR_BITS,		/* stops here */
>   };
> @@ -451,6 +452,7 @@ enum req_flag_bits {
>   
>   #define REQ_DRV		(__force blk_opf_t)(1ULL << __REQ_DRV)
>   #define REQ_SWAP	(__force blk_opf_t)(1ULL << __REQ_SWAP)
> +#define REQ_COPY	((__force blk_opf_t)(1ULL << __REQ_COPY))
>   
>   #define REQ_FAILFAST_MASK \
>   	(REQ_FAILFAST_DEV | REQ_FAILFAST_TRANSPORT | REQ_FAILFAST_DRIVER)
> @@ -477,6 +479,11 @@ static inline bool op_is_write(blk_opf_t op)
>   	return !!(op & (__force blk_opf_t)1);
>   }
>   
> +static inline bool op_is_copy(blk_opf_t op)
> +{
> +	return (op & REQ_COPY);
> +}
> +
>   /*
>    * Check if the bio or request is one that needs special treatment in the
>    * flush state machine.
> @@ -536,4 +543,41 @@ struct blk_rq_stat {
>   	u64 batch;
>   };
>   
> +typedef void (cio_iodone_t)(void *private, int status);
> +
> +struct cio {
> +	struct range_entry *ranges;
> +	struct task_struct *waiter;     /* waiting task (NULL if none) */
> +	atomic_t refcount;
> +	int io_err;
> +	cio_iodone_t *endio;		/* applicable for async operation */
> +	void *private;			/* applicable for async operation */
> +
> +	/* For zoned device we maintain a linked list of IO submissions.
> +	 * This is to make sure we maintain the order of submissions.
> +	 * Otherwise some reads completing out of order, will submit writes not
> +	 * aligned with zone write pointer.
> +	 */
> +	struct list_head list;
> +	spinlock_t list_lock;
> +};
> +
> +enum copy_io_status {
> +	REQ_COPY_READ_PROGRESS,
> +	REQ_COPY_READ_COMPLETE,
> +	REQ_COPY_WRITE_PROGRESS,
> +};
> +
> +struct copy_ctx {
> +	struct cio *cio;
> +	struct work_struct dispatch_work;
> +	struct bio *write_bio;
> +	atomic_t refcount;
> +	int range_idx;			/* used in error/partial completion */
> +
> +	/* For zoned device linked list is maintained. Along with state of IO */
> +	struct list_head list;
> +	enum copy_io_status status;
> +};
> +
>   #endif /* __LINUX_BLK_TYPES_H */
> diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
> index 807ffb5f715d..48e9160b7195 100644
> --- a/include/linux/blkdev.h
> +++ b/include/linux/blkdev.h
> @@ -1063,6 +1063,9 @@ int __blkdev_issue_discard(struct block_device *bdev, sector_t sector,
>   		sector_t nr_sects, gfp_t gfp_mask, struct bio **biop);
>   int blkdev_issue_secure_erase(struct block_device *bdev, sector_t sector,
>   		sector_t nr_sects, gfp_t gfp);
> +int blkdev_issue_copy(struct block_device *src_bdev,
> +		struct block_device *dst_bdev, struct range_entry *ranges,
> +		int nr, cio_iodone_t end_io, void *private, gfp_t gfp_mask);
>   
>   #define BLKDEV_ZERO_NOUNMAP	(1 << 0)  /* do not free blocks */
>   #define BLKDEV_ZERO_NOFALLBACK	(1 << 1)  /* don't write explicit zeroes */
> diff --git a/include/uapi/linux/fs.h b/include/uapi/linux/fs.h
> index b3ad173f619c..9248b6d259de 100644
> --- a/include/uapi/linux/fs.h
> +++ b/include/uapi/linux/fs.h
> @@ -67,6 +67,21 @@ struct fstrim_range {
>   /* maximum total copy length */
>   #define MAX_COPY_TOTAL_LENGTH	(1 << 27)
>   
> +/* Maximum no of entries supported */
> +#define MAX_COPY_NR_RANGE	(1 << 12)
> +
> +/* range entry for copy offload, all fields should be byte addressed */
> +struct range_entry {
> +	__u64 src;		/* source to be copied */
> +	__u64 dst;		/* destination */
> +	__u64 len;		/* length in bytes to be copied */
> +
> +	/* length of data copy actually completed. This will be filled by
> +	 * kernel, once copy completes
> +	 */
> +	__u64 comp_len;
> +};
> +
>   /* extent-same (dedupe) ioctls; these MUST match the btrfs ioctl definitions */
>   #define FILE_DEDUPE_RANGE_SAME		0
>   #define FILE_DEDUPE_RANGE_DIFFERS	1

Cheers,

Hannes

--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v6 3/9] block: add emulation for copy
  2023-01-12 11:58       ` [dm-devel] " Nitesh Shetty
@ 2023-01-12 14:46         ` Hannes Reinecke
  -1 siblings, 0 replies; 26+ messages in thread
From: Hannes Reinecke @ 2023-01-12 14:46 UTC (permalink / raw)
  To: Nitesh Shetty, Jens Axboe, Alasdair Kergon, Mike Snitzer,
	dm-devel, Keith Busch, Christoph Hellwig, Sagi Grimberg,
	James Smart, Chaitanya Kulkarni, Alexander Viro
  Cc: anuj20.g, joshi.k, p.raghav, nitheshshetty, gost.dev, Vincent Fu,
	linux-block, linux-kernel, linux-nvme, linux-fsdevel

On 1/12/23 12:58, Nitesh Shetty wrote:
> For the devices which does not support copy, copy emulation is
> added. Copy-emulation is implemented by reading from source ranges
> into memory and writing to the corresponding destination asynchronously.
> For zoned device we maintain a linked list of read submission and try to
> submit corresponding write in same order.
> Also emulation is used, if copy offload fails or partially completes.
> 
> Signed-off-by: Nitesh Shetty <nj.shetty@samsung.com>
> Signed-off-by: Vincent Fu <vincent.fu@samsung.com>
> Signed-off-by: Anuj Gupta <anuj20.g@samsung.com>
> ---
>   block/blk-lib.c        | 241 ++++++++++++++++++++++++++++++++++++++++-
>   block/blk-map.c        |   4 +-
>   include/linux/blkdev.h |   3 +
>   3 files changed, 245 insertions(+), 3 deletions(-)
> 
I'm not sure if I agree with this one.

You just submitted a patch for device-mapper to implement copy offload, 
which (to all intents and purposes) _is_ an emulation.

So why do we need to implement it in the block layer as an emulation?
Or, if we have to, why do we need the device-mapper emulation?
This emulation will be doing the same thing, no?

Cheers,

Hannes


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [dm-devel] [PATCH v6 3/9] block: add emulation for copy
@ 2023-01-12 14:46         ` Hannes Reinecke
  0 siblings, 0 replies; 26+ messages in thread
From: Hannes Reinecke @ 2023-01-12 14:46 UTC (permalink / raw)
  To: Nitesh Shetty, Jens Axboe, Alasdair Kergon, Mike Snitzer,
	dm-devel, Keith Busch, Christoph Hellwig, Sagi Grimberg,
	James Smart, Chaitanya Kulkarni, Alexander Viro
  Cc: p.raghav, Vincent Fu, joshi.k, gost.dev, anuj20.g, linux-kernel,
	linux-nvme, linux-block, linux-fsdevel, nitheshshetty

On 1/12/23 12:58, Nitesh Shetty wrote:
> For the devices which does not support copy, copy emulation is
> added. Copy-emulation is implemented by reading from source ranges
> into memory and writing to the corresponding destination asynchronously.
> For zoned device we maintain a linked list of read submission and try to
> submit corresponding write in same order.
> Also emulation is used, if copy offload fails or partially completes.
> 
> Signed-off-by: Nitesh Shetty <nj.shetty@samsung.com>
> Signed-off-by: Vincent Fu <vincent.fu@samsung.com>
> Signed-off-by: Anuj Gupta <anuj20.g@samsung.com>
> ---
>   block/blk-lib.c        | 241 ++++++++++++++++++++++++++++++++++++++++-
>   block/blk-map.c        |   4 +-
>   include/linux/blkdev.h |   3 +
>   3 files changed, 245 insertions(+), 3 deletions(-)
> 
I'm not sure if I agree with this one.

You just submitted a patch for device-mapper to implement copy offload, 
which (to all intents and purposes) _is_ an emulation.

So why do we need to implement it in the block layer as an emulation?
Or, if we have to, why do we need the device-mapper emulation?
This emulation will be doing the same thing, no?

Cheers,

Hannes

--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v6 3/9] block: add emulation for copy
  2023-01-12 14:46         ` [dm-devel] " Hannes Reinecke
@ 2023-01-12 14:48           ` Hannes Reinecke
  -1 siblings, 0 replies; 26+ messages in thread
From: Hannes Reinecke @ 2023-01-12 14:48 UTC (permalink / raw)
  To: Nitesh Shetty, Jens Axboe, Alasdair Kergon, Mike Snitzer,
	dm-devel, Keith Busch, Christoph Hellwig, Sagi Grimberg,
	James Smart, Chaitanya Kulkarni, Alexander Viro
  Cc: anuj20.g, joshi.k, p.raghav, nitheshshetty, gost.dev, Vincent Fu,
	linux-block, linux-kernel, linux-nvme, linux-fsdevel

On 1/12/23 15:46, Hannes Reinecke wrote:
> On 1/12/23 12:58, Nitesh Shetty wrote:
>> For the devices which does not support copy, copy emulation is
>> added. Copy-emulation is implemented by reading from source ranges
>> into memory and writing to the corresponding destination asynchronously.
>> For zoned device we maintain a linked list of read submission and try to
>> submit corresponding write in same order.
>> Also emulation is used, if copy offload fails or partially completes.
>>
>> Signed-off-by: Nitesh Shetty <nj.shetty@samsung.com>
>> Signed-off-by: Vincent Fu <vincent.fu@samsung.com>
>> Signed-off-by: Anuj Gupta <anuj20.g@samsung.com>
>> ---
>>   block/blk-lib.c        | 241 ++++++++++++++++++++++++++++++++++++++++-
>>   block/blk-map.c        |   4 +-
>>   include/linux/blkdev.h |   3 +
>>   3 files changed, 245 insertions(+), 3 deletions(-)
>>
> I'm not sure if I agree with this one.
> 
> You just submitted a patch for device-mapper to implement copy offload, 
> which (to all intents and purposes) _is_ an emulation.
> 
> So why do we need to implement it in the block layer as an emulation?
> Or, if we have to, why do we need the device-mapper emulation?
> This emulation will be doing the same thing, no?
> 
Sheesh. One should read the entire patchset.

Disregard the above comment.

Cheers,

Hannes


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [dm-devel] [PATCH v6 3/9] block: add emulation for copy
@ 2023-01-12 14:48           ` Hannes Reinecke
  0 siblings, 0 replies; 26+ messages in thread
From: Hannes Reinecke @ 2023-01-12 14:48 UTC (permalink / raw)
  To: Nitesh Shetty, Jens Axboe, Alasdair Kergon, Mike Snitzer,
	dm-devel, Keith Busch, Christoph Hellwig, Sagi Grimberg,
	James Smart, Chaitanya Kulkarni, Alexander Viro
  Cc: p.raghav, Vincent Fu, joshi.k, gost.dev, anuj20.g, linux-kernel,
	linux-nvme, linux-block, linux-fsdevel, nitheshshetty

On 1/12/23 15:46, Hannes Reinecke wrote:
> On 1/12/23 12:58, Nitesh Shetty wrote:
>> For the devices which does not support copy, copy emulation is
>> added. Copy-emulation is implemented by reading from source ranges
>> into memory and writing to the corresponding destination asynchronously.
>> For zoned device we maintain a linked list of read submission and try to
>> submit corresponding write in same order.
>> Also emulation is used, if copy offload fails or partially completes.
>>
>> Signed-off-by: Nitesh Shetty <nj.shetty@samsung.com>
>> Signed-off-by: Vincent Fu <vincent.fu@samsung.com>
>> Signed-off-by: Anuj Gupta <anuj20.g@samsung.com>
>> ---
>>   block/blk-lib.c        | 241 ++++++++++++++++++++++++++++++++++++++++-
>>   block/blk-map.c        |   4 +-
>>   include/linux/blkdev.h |   3 +
>>   3 files changed, 245 insertions(+), 3 deletions(-)
>>
> I'm not sure if I agree with this one.
> 
> You just submitted a patch for device-mapper to implement copy offload, 
> which (to all intents and purposes) _is_ an emulation.
> 
> So why do we need to implement it in the block layer as an emulation?
> Or, if we have to, why do we need the device-mapper emulation?
> This emulation will be doing the same thing, no?
> 
Sheesh. One should read the entire patchset.

Disregard the above comment.

Cheers,

Hannes

--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v6 2/9] block: Add copy offload support infrastructure
  2023-01-12 14:43         ` [dm-devel] " Hannes Reinecke
@ 2023-01-13  8:26           ` Nitesh Shetty
  -1 siblings, 0 replies; 26+ messages in thread
From: Nitesh Shetty @ 2023-01-13  8:26 UTC (permalink / raw)
  To: Hannes Reinecke
  Cc: Jens Axboe, Alasdair Kergon, Mike Snitzer, dm-devel, Keith Busch,
	Christoph Hellwig, Sagi Grimberg, James Smart,
	Chaitanya Kulkarni, Alexander Viro, anuj20.g, joshi.k, p.raghav,
	nitheshshetty, gost.dev, linux-block, linux-kernel, linux-nvme,
	linux-fsdevel

[-- Attachment #1: Type: text/plain, Size: 19734 bytes --]

On Thu, Jan 12, 2023 at 03:43:07PM +0100, Hannes Reinecke wrote:
> On 1/12/23 12:58, Nitesh Shetty wrote:
> > Introduce blkdev_issue_copy which supports source and destination bdevs,
> > and an array of (source, destination and copy length) tuples.
> > Introduce REQ_COPY copy offload operation flag. Create a read-write
> > bio pair with a token as payload and submitted to the device in order.
> > Read request populates token with source specific information which
> > is then passed with write request.
> > This design is courtesy Mikulas Patocka's token based copy
> > 
> > Larger copy will be divided, based on max_copy_sectors limit.
> > 
> > Signed-off-by: Nitesh Shetty <nj.shetty@samsung.com>
> > Signed-off-by: Anuj Gupta <anuj20.g@samsung.com>
> > ---
> >   block/blk-lib.c           | 358 ++++++++++++++++++++++++++++++++++++++
> >   block/blk.h               |   2 +
> >   include/linux/blk_types.h |  44 +++++
> >   include/linux/blkdev.h    |   3 +
> >   include/uapi/linux/fs.h   |  15 ++
> >   5 files changed, 422 insertions(+)
> > 
> > diff --git a/block/blk-lib.c b/block/blk-lib.c
> > index e59c3069e835..2ce3c872ca49 100644
> > --- a/block/blk-lib.c
> > +++ b/block/blk-lib.c
> > @@ -115,6 +115,364 @@ int blkdev_issue_discard(struct block_device *bdev, sector_t sector,
> >   }
> >   EXPORT_SYMBOL(blkdev_issue_discard);
> > +/*
> > + * For synchronous copy offload/emulation, wait and process all in-flight BIOs.
> > + * This must only be called once all bios have been issued so that the refcount
> > + * can only decrease. This just waits for all bios to make it through
> > + * bio_copy_*_write_end_io. IO errors are propagated through cio->io_error.
> > + */
> > +static int cio_await_completion(struct cio *cio)
> > +{
> > +	int ret = 0;
> > +
> > +	atomic_dec(&cio->refcount);
> > +
> > +	if (cio->endio)
> > +		return 0;
> > +
> > +	if (atomic_read(&cio->refcount)) {
> > +		__set_current_state(TASK_UNINTERRUPTIBLE);
> > +		blk_io_schedule();
> > +	}
> > +
> Wouldn't it be better to use 'atomic_dec_return()' to avoid a potential race
> condition between atomic_dec() and atomic_read()?
> 

cio keeps total number of submitted IOs. For async copy(with endio handler) we
just return after decrementing refcount, if we use atomic_dec_return, we need
to have that endio check twice.
Also this function is called after all the submissions are complete. So race
condition shouldn't happen with this ordered calling.

> > +	ret = cio->io_err;
> > +	kfree(cio);
> > +
> > +	return ret;
> > +}
> > +
> > +static void blk_copy_offload_write_end_io(struct bio *bio)
> > +{
> > +	struct copy_ctx *ctx = bio->bi_private;
> > +	struct cio *cio = ctx->cio;
> > +	sector_t clen;
> > +	int ri = ctx->range_idx;
> > +
> > +	if (bio->bi_status) {
> > +		cio->io_err = blk_status_to_errno(bio->bi_status);
> > +		clen = (bio->bi_iter.bi_sector << SECTOR_SHIFT) -
> > +			cio->ranges[ri].dst;
> > +		cio->ranges[ri].comp_len = min_t(sector_t, clen,
> > +				cio->ranges[ri].comp_len);
> > +	}
> > +	__free_page(bio->bi_io_vec[0].bv_page);
> > +	bio_put(bio);
> > +
> > +	if (atomic_dec_and_test(&ctx->refcount))
> > +		kfree(ctx);
> > +	if (atomic_dec_and_test(&cio->refcount)) {
> 
> _Two_ atomic_dec() in a row?
> Why?
> 
> And if that really is required please add a comment.
> 

cio is used to keep track of all the submitted IOs.

ctx is used to keep track of single IO. Each IOs again has 4 parts.
1.READ submission (process/submitter context)
2.READ completion (interrupt context, can't submit WRITE bio here, so we
create a workqueue and submit)
3.WRITE submission by worker (process context)
4.WRITE completion (interrupt context)
So there is a concurrent access to ctx.
Especially if IO is for zoned device we need to maintain order of
READ submissions, so that WRITE order is same as READ.

So cio and ctx refcount serve different purpose and updated accordingly.
Agreed, I can add better comments in next version.

> > +		if (cio->endio) {
> > +			cio->endio(cio->private, cio->io_err);
> > +			kfree(cio);
> > +		} else
> > +			blk_wake_io_task(cio->waiter);
> > +	}
> > +}
> > +
> > +static void blk_copy_offload_read_end_io(struct bio *read_bio)
> > +{
> > +	struct copy_ctx *ctx = read_bio->bi_private;
> > +	struct cio *cio = ctx->cio;
> > +	sector_t clen;
> > +	int ri = ctx->range_idx;
> > +	unsigned long flags;
> > +
> > +	if (read_bio->bi_status) {
> > +		cio->io_err = blk_status_to_errno(read_bio->bi_status);
> > +		goto err_rw_bio;
> > +	}
> > +
> > +	/* For zoned device, we check if completed bio is first entry in linked
> > +	 * list,
> > +	 * if yes, we start the worker to submit write bios.
> > +	 * if not, then we just update status of bio in ctx,
> > +	 * once the worker gets scheduled, it will submit writes for all
> > +	 * the consecutive REQ_COPY_READ_COMPLETE bios.
> > +	 */
> > +	if (bdev_is_zoned(ctx->write_bio->bi_bdev)) {
> > +		spin_lock_irqsave(&cio->list_lock, flags);
> > +		ctx->status = REQ_COPY_READ_COMPLETE;
> > +		if (ctx == list_first_entry(&cio->list,
> > +					struct copy_ctx, list)) {
> > +			spin_unlock_irqrestore(&cio->list_lock, flags);
> > +			schedule_work(&ctx->dispatch_work);
> > +			goto free_read_bio;
> > +		}
> > +		spin_unlock_irqrestore(&cio->list_lock, flags);
> > +	} else
> > +		schedule_work(&ctx->dispatch_work);
> > +
> > +free_read_bio:
> > +	bio_put(read_bio);
> > +
> > +	return;
> > +
> > +err_rw_bio:
> > +	clen = (read_bio->bi_iter.bi_sector << SECTOR_SHIFT) -
> > +					cio->ranges[ri].src;
> > +	cio->ranges[ri].comp_len = min_t(sector_t, clen,
> > +					cio->ranges[ri].comp_len);
> > +	__free_page(read_bio->bi_io_vec[0].bv_page);
> > +	bio_put(ctx->write_bio);
> > +	bio_put(read_bio);
> > +	if (atomic_dec_and_test(&ctx->refcount))
> > +		kfree(ctx);
> > +	if (atomic_dec_and_test(&cio->refcount)) {
> 
> Same here.
> 
> > +		if (cio->endio) {
> > +			cio->endio(cio->private, cio->io_err);
> > +			kfree(cio);
> > +		} else
> > +			blk_wake_io_task(cio->waiter);
> > +	}
> > +}
> > +
> > +static void blk_copy_dispatch_work_fn(struct work_struct *work)
> > +{
> > +	struct copy_ctx *ctx = container_of(work, struct copy_ctx,
> > +			dispatch_work);
> > +
> > +	submit_bio(ctx->write_bio);
> > +}
> > +
> > +static void blk_zoned_copy_dispatch_work_fn(struct work_struct *work)
> > +{
> > +	struct copy_ctx *ctx = container_of(work, struct copy_ctx,
> > +			dispatch_work);
> > +	struct cio *cio = ctx->cio;
> > +	unsigned long flags = 0;
> > +
> > +	atomic_inc(&cio->refcount);
> > +	spin_lock_irqsave(&cio->list_lock, flags);
> > +
> > +	while (!list_empty(&cio->list)) {
> > +		ctx = list_first_entry(&cio->list, struct copy_ctx, list);
> > +
> That is ever so odd; it'll block 'cio->list' for the time of processing.
> Wouldn't it be better to move 'cio->list' to a private list, and do away
> with the list_lock during processing?
> 

For zoned devices we need to maintain ordering of IOs. Because write
cant be out of order. So we maintain this list.
Again this IO list is accessed concurrently by
a. READ submission
b. READ completion
c. WRITE submission(worker).
So moving to private list won't be possible I feel.
Since we are using global list, we require the lock to synchronize this list.

> > +		if (ctx->status == REQ_COPY_READ_PROGRESS)
> > +			break;
> > +
> > +		atomic_inc(&ctx->refcount);
> > +		ctx->status = REQ_COPY_WRITE_PROGRESS;
> > +		spin_unlock_irqrestore(&cio->list_lock, flags);
> > +		submit_bio(ctx->write_bio);
> > +		spin_lock_irqsave(&cio->list_lock, flags);
> > +
> > +		list_del(&ctx->list);
> > +		if (atomic_dec_and_test(&ctx->refcount))
> > +			kfree(ctx);
> > +	}
> > +
> > +	spin_unlock_irqrestore(&cio->list_lock, flags);
> > +	if (atomic_dec_and_test(&cio->refcount))
> > +		blk_wake_io_task(cio->waiter);
> > +}
> > +
> > +/*
> > + * blk_copy_offload	- Use device's native copy offload feature.
> > + * we perform copy operation by sending 2 bio.
> > + * 1. First we send a read bio with REQ_COPY flag along with a token and source
> > + * and length. Once read bio reaches driver layer, device driver adds all the
> > + * source info to token and does a fake completion.
> > + * 2. Once read opration completes, we issue write with REQ_COPY flag with same
> > + * token. In driver layer, token info is used to form a copy offload command.
> > + *
> > + * For conventional devices we submit write bio independentenly once read
> > + * completes. For zoned devices , reads can complete out of order, so we
> > + * maintain a linked list and submit writes in the order, reads are submitted.
> > + */
> > +static int blk_copy_offload(struct block_device *src_bdev,
> > +		struct block_device *dst_bdev, struct range_entry *ranges,
> > +		int nr, cio_iodone_t end_io, void *private, gfp_t gfp_mask)
> > +{
> > +	struct cio *cio;
> > +	struct copy_ctx *ctx;
> > +	struct bio *read_bio, *write_bio;
> > +	struct page *token;
> > +	sector_t src_blk, copy_len, dst_blk;
> > +	sector_t rem, max_copy_len;
> > +	int ri = 0, ret = 0;
> > +	unsigned long flags;
> > +
> > +	cio = kzalloc(sizeof(struct cio), GFP_KERNEL);
> > +	if (!cio)
> > +		return -ENOMEM;
> > +	cio->ranges = ranges;
> > +	atomic_set(&cio->refcount, 1);
> > +	cio->waiter = current;
> > +	cio->endio = end_io;
> > +	cio->private = private;
> > +	if (bdev_is_zoned(dst_bdev)) {
> > +		INIT_LIST_HEAD(&cio->list);
> > +		spin_lock_init(&cio->list_lock);
> > +	}
> > +
> > +	max_copy_len = min(bdev_max_copy_sectors(src_bdev),
> > +			bdev_max_copy_sectors(dst_bdev)) << SECTOR_SHIFT;
> > +
> > +	for (ri = 0; ri < nr; ri++) {
> > +		cio->ranges[ri].comp_len = ranges[ri].len;
> > +		src_blk = ranges[ri].src;
> > +		dst_blk = ranges[ri].dst;
> > +		for (rem = ranges[ri].len; rem > 0; rem -= copy_len) {
> > +			copy_len = min(rem, max_copy_len);
> > +
> > +			token = alloc_page(gfp_mask);
> > +			if (unlikely(!token)) {
> > +				ret = -ENOMEM;
> > +				goto err_token;
> > +			}
> > +
> > +			ctx = kzalloc(sizeof(struct copy_ctx), gfp_mask);
> > +			if (!ctx) {
> > +				ret = -ENOMEM;
> > +				goto err_ctx;
> > +			}
> > +			read_bio = bio_alloc(src_bdev, 1, REQ_OP_READ | REQ_COPY
> > +					| REQ_SYNC | REQ_NOMERGE, gfp_mask);
> > +			if (!read_bio) {
> > +				ret = -ENOMEM;
> > +				goto err_read_bio;
> > +			}
> > +			write_bio = bio_alloc(dst_bdev, 1, REQ_OP_WRITE
> > +					| REQ_COPY | REQ_SYNC | REQ_NOMERGE,
> > +					gfp_mask);
> > +			if (!write_bio) {
> > +				cio->io_err = -ENOMEM;
> > +				goto err_write_bio;
> > +			}
> > +
> > +			ctx->cio = cio;
> > +			ctx->range_idx = ri;
> > +			ctx->write_bio = write_bio;
> > +			atomic_set(&ctx->refcount, 1);
> > +
> > +			if (bdev_is_zoned(dst_bdev)) {
> > +				INIT_WORK(&ctx->dispatch_work,
> > +					blk_zoned_copy_dispatch_work_fn);
> > +				INIT_LIST_HEAD(&ctx->list);
> > +				spin_lock_irqsave(&cio->list_lock, flags);
> > +				ctx->status = REQ_COPY_READ_PROGRESS;
> > +				list_add_tail(&ctx->list, &cio->list);
> > +				spin_unlock_irqrestore(&cio->list_lock, flags);
> > +			} else
> > +				INIT_WORK(&ctx->dispatch_work,
> > +					blk_copy_dispatch_work_fn);
> > +
> > +			__bio_add_page(read_bio, token, PAGE_SIZE, 0);
> > +			read_bio->bi_iter.bi_size = copy_len;
> > +			read_bio->bi_iter.bi_sector = src_blk >> SECTOR_SHIFT;
> > +			read_bio->bi_end_io = blk_copy_offload_read_end_io;
> > +			read_bio->bi_private = ctx;
> > +
> > +			__bio_add_page(write_bio, token, PAGE_SIZE, 0);
> > +			write_bio->bi_iter.bi_size = copy_len;
> > +			write_bio->bi_end_io = blk_copy_offload_write_end_io;
> > +			write_bio->bi_iter.bi_sector = dst_blk >> SECTOR_SHIFT;
> > +			write_bio->bi_private = ctx;
> > +
> > +			atomic_inc(&cio->refcount);
> > +			submit_bio(read_bio);
> > +			src_blk += copy_len;
> > +			dst_blk += copy_len;
> > +		}
> > +	}
> > +
> > +	/* Wait for completion of all IO's*/
> > +	return cio_await_completion(cio);
> > +
> > +err_write_bio:
> > +	bio_put(read_bio);
> > +err_read_bio:
> > +	kfree(ctx);
> > +err_ctx:
> > +	__free_page(token);
> > +err_token:
> > +	ranges[ri].comp_len = min_t(sector_t,
> > +			ranges[ri].comp_len, (ranges[ri].len - rem));
> > +
> > +	cio->io_err = ret;
> > +	return cio_await_completion(cio);
> > +}
> > +
> > +static inline int blk_copy_sanity_check(struct block_device *src_bdev,
> > +	struct block_device *dst_bdev, struct range_entry *ranges, int nr)
> > +{
> > +	unsigned int align_mask = max(bdev_logical_block_size(dst_bdev),
> > +					bdev_logical_block_size(src_bdev)) - 1;
> > +	sector_t len = 0;
> > +	int i;
> > +
> > +	if (!nr)
> > +		return -EINVAL;
> > +
> > +	if (nr >= MAX_COPY_NR_RANGE)
> > +		return -EINVAL;
> > +
> > +	if (bdev_read_only(dst_bdev))
> > +		return -EPERM;
> > +
> > +	for (i = 0; i < nr; i++) {
> > +		if (!ranges[i].len)
> > +			return -EINVAL;
> > +
> > +		len += ranges[i].len;
> > +		if ((ranges[i].dst & align_mask) ||
> > +				(ranges[i].src & align_mask) ||
> > +				(ranges[i].len & align_mask))
> > +			return -EINVAL;
> > +		ranges[i].comp_len = 0;
> > +	}
> > +
> > +	if (len && len >= MAX_COPY_TOTAL_LENGTH)
> > +		return -EINVAL;
> > +
> > +	return 0;
> > +}
> > +
> > +static inline bool blk_check_copy_offload(struct request_queue *src_q,
> > +		struct request_queue *dst_q)
> > +{
> > +	return blk_queue_copy(dst_q) && blk_queue_copy(src_q);
> > +}
> > +
> > +/*
> > + * blkdev_issue_copy - queue a copy
> > + * @src_bdev:	source block device
> > + * @dst_bdev:	destination block device
> > + * @ranges:	array of source/dest/len,
> > + *		ranges are expected to be allocated/freed by caller
> > + * @nr:		number of source ranges to copy
> > + * @end_io:	end_io function to be called on completion of copy operation,
> > + *		for synchronous operation this should be NULL
> > + * @private:	end_io function will be called with this private data, should be
> > + *		NULL, if operation is synchronous in nature
> > + * @gfp_mask:   memory allocation flags (for bio_alloc)
> > + *
> > + * Description:
> > + *	Copy source ranges from source block device to destination block
> > + *	device. length of a source range cannot be zero. Max total length of
> > + *	copy is limited to MAX_COPY_TOTAL_LENGTH and also maximum number of
> > + *	entries is limited to MAX_COPY_NR_RANGE
> > + */
> > +int blkdev_issue_copy(struct block_device *src_bdev,
> > +	struct block_device *dst_bdev, struct range_entry *ranges, int nr,
> > +	cio_iodone_t end_io, void *private, gfp_t gfp_mask)
> > +{
> > +	struct request_queue *src_q = bdev_get_queue(src_bdev);
> > +	struct request_queue *dst_q = bdev_get_queue(dst_bdev);
> > +	int ret = -EINVAL;
> > +
> > +	ret = blk_copy_sanity_check(src_bdev, dst_bdev, ranges, nr);
> > +	if (ret)
> > +		return ret;
> > +
> > +	if (blk_check_copy_offload(src_q, dst_q))
> > +		ret = blk_copy_offload(src_bdev, dst_bdev, ranges, nr,
> > +				end_io, private, gfp_mask);
> > +
> > +	return ret;
> > +}
> > +EXPORT_SYMBOL_GPL(blkdev_issue_copy);
> > +
> >   static int __blkdev_issue_write_zeroes(struct block_device *bdev,
> >   		sector_t sector, sector_t nr_sects, gfp_t gfp_mask,
> >   		struct bio **biop, unsigned flags)
> > diff --git a/block/blk.h b/block/blk.h
> > index 4c3b3325219a..6d9924a7d559 100644
> > --- a/block/blk.h
> > +++ b/block/blk.h
> > @@ -304,6 +304,8 @@ static inline bool bio_may_exceed_limits(struct bio *bio,
> >   		break;
> >   	}
> > +	if (unlikely(op_is_copy(bio->bi_opf)))
> > +		return false;
> >   	/*
> >   	 * All drivers must accept single-segments bios that are <= PAGE_SIZE.
> >   	 * This is a quick and dirty check that relies on the fact that
> > diff --git a/include/linux/blk_types.h b/include/linux/blk_types.h
> > index 99be590f952f..de1638c87ecf 100644
> > --- a/include/linux/blk_types.h
> > +++ b/include/linux/blk_types.h
> > @@ -422,6 +422,7 @@ enum req_flag_bits {
> >   	 */
> >   	/* for REQ_OP_WRITE_ZEROES: */
> >   	__REQ_NOUNMAP,		/* do not free blocks when zeroing */
> > +	__REQ_COPY,		/* copy request */
> >   	__REQ_NR_BITS,		/* stops here */
> >   };
> > @@ -451,6 +452,7 @@ enum req_flag_bits {
> >   #define REQ_DRV		(__force blk_opf_t)(1ULL << __REQ_DRV)
> >   #define REQ_SWAP	(__force blk_opf_t)(1ULL << __REQ_SWAP)
> > +#define REQ_COPY	((__force blk_opf_t)(1ULL << __REQ_COPY))
> >   #define REQ_FAILFAST_MASK \
> >   	(REQ_FAILFAST_DEV | REQ_FAILFAST_TRANSPORT | REQ_FAILFAST_DRIVER)
> > @@ -477,6 +479,11 @@ static inline bool op_is_write(blk_opf_t op)
> >   	return !!(op & (__force blk_opf_t)1);
> >   }
> > +static inline bool op_is_copy(blk_opf_t op)
> > +{
> > +	return (op & REQ_COPY);
> > +}
> > +
> >   /*
> >    * Check if the bio or request is one that needs special treatment in the
> >    * flush state machine.
> > @@ -536,4 +543,41 @@ struct blk_rq_stat {
> >   	u64 batch;
> >   };
> > +typedef void (cio_iodone_t)(void *private, int status);
> > +
> > +struct cio {
> > +	struct range_entry *ranges;
> > +	struct task_struct *waiter;     /* waiting task (NULL if none) */
> > +	atomic_t refcount;
> > +	int io_err;
> > +	cio_iodone_t *endio;		/* applicable for async operation */
> > +	void *private;			/* applicable for async operation */
> > +
> > +	/* For zoned device we maintain a linked list of IO submissions.
> > +	 * This is to make sure we maintain the order of submissions.
> > +	 * Otherwise some reads completing out of order, will submit writes not
> > +	 * aligned with zone write pointer.
> > +	 */
> > +	struct list_head list;
> > +	spinlock_t list_lock;
> > +};
> > +
> > +enum copy_io_status {
> > +	REQ_COPY_READ_PROGRESS,
> > +	REQ_COPY_READ_COMPLETE,
> > +	REQ_COPY_WRITE_PROGRESS,
> > +};
> > +
> > +struct copy_ctx {
> > +	struct cio *cio;
> > +	struct work_struct dispatch_work;
> > +	struct bio *write_bio;
> > +	atomic_t refcount;
> > +	int range_idx;			/* used in error/partial completion */
> > +
> > +	/* For zoned device linked list is maintained. Along with state of IO */
> > +	struct list_head list;
> > +	enum copy_io_status status;
> > +};
> > +
> >   #endif /* __LINUX_BLK_TYPES_H */
> > diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
> > index 807ffb5f715d..48e9160b7195 100644
> > --- a/include/linux/blkdev.h
> > +++ b/include/linux/blkdev.h
> > @@ -1063,6 +1063,9 @@ int __blkdev_issue_discard(struct block_device *bdev, sector_t sector,
> >   		sector_t nr_sects, gfp_t gfp_mask, struct bio **biop);
> >   int blkdev_issue_secure_erase(struct block_device *bdev, sector_t sector,
> >   		sector_t nr_sects, gfp_t gfp);
> > +int blkdev_issue_copy(struct block_device *src_bdev,
> > +		struct block_device *dst_bdev, struct range_entry *ranges,
> > +		int nr, cio_iodone_t end_io, void *private, gfp_t gfp_mask);
> >   #define BLKDEV_ZERO_NOUNMAP	(1 << 0)  /* do not free blocks */
> >   #define BLKDEV_ZERO_NOFALLBACK	(1 << 1)  /* don't write explicit zeroes */
> > diff --git a/include/uapi/linux/fs.h b/include/uapi/linux/fs.h
> > index b3ad173f619c..9248b6d259de 100644
> > --- a/include/uapi/linux/fs.h
> > +++ b/include/uapi/linux/fs.h
> > @@ -67,6 +67,21 @@ struct fstrim_range {
> >   /* maximum total copy length */
> >   #define MAX_COPY_TOTAL_LENGTH	(1 << 27)
> > +/* Maximum no of entries supported */
> > +#define MAX_COPY_NR_RANGE	(1 << 12)
> > +
> > +/* range entry for copy offload, all fields should be byte addressed */
> > +struct range_entry {
> > +	__u64 src;		/* source to be copied */
> > +	__u64 dst;		/* destination */
> > +	__u64 len;		/* length in bytes to be copied */
> > +
> > +	/* length of data copy actually completed. This will be filled by
> > +	 * kernel, once copy completes
> > +	 */
> > +	__u64 comp_len;
> > +};
> > +
> >   /* extent-same (dedupe) ioctls; these MUST match the btrfs ioctl definitions */
> >   #define FILE_DEDUPE_RANGE_SAME		0
> >   #define FILE_DEDUPE_RANGE_DIFFERS	1
> 
> Cheers,
> 
> Hannes
> 

Thanks,
Nitesh Shetty

[-- Attachment #2: Type: text/plain, Size: 0 bytes --]



^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [dm-devel] [PATCH v6 2/9] block: Add copy offload support infrastructure
@ 2023-01-13  8:26           ` Nitesh Shetty
  0 siblings, 0 replies; 26+ messages in thread
From: Nitesh Shetty @ 2023-01-13  8:26 UTC (permalink / raw)
  To: Hannes Reinecke
  Cc: Jens Axboe, p.raghav, Chaitanya Kulkarni, Sagi Grimberg, joshi.k,
	gost.dev, anuj20.g, Mike Snitzer, James Smart, linux-nvme,
	linux-kernel, linux-block, linux-fsdevel, dm-devel,
	Alexander Viro, nitheshshetty, Keith Busch, Christoph Hellwig,
	Alasdair Kergon

[-- Attachment #1: Type: text/plain, Size: 19734 bytes --]

On Thu, Jan 12, 2023 at 03:43:07PM +0100, Hannes Reinecke wrote:
> On 1/12/23 12:58, Nitesh Shetty wrote:
> > Introduce blkdev_issue_copy which supports source and destination bdevs,
> > and an array of (source, destination and copy length) tuples.
> > Introduce REQ_COPY copy offload operation flag. Create a read-write
> > bio pair with a token as payload and submitted to the device in order.
> > Read request populates token with source specific information which
> > is then passed with write request.
> > This design is courtesy Mikulas Patocka's token based copy
> > 
> > Larger copy will be divided, based on max_copy_sectors limit.
> > 
> > Signed-off-by: Nitesh Shetty <nj.shetty@samsung.com>
> > Signed-off-by: Anuj Gupta <anuj20.g@samsung.com>
> > ---
> >   block/blk-lib.c           | 358 ++++++++++++++++++++++++++++++++++++++
> >   block/blk.h               |   2 +
> >   include/linux/blk_types.h |  44 +++++
> >   include/linux/blkdev.h    |   3 +
> >   include/uapi/linux/fs.h   |  15 ++
> >   5 files changed, 422 insertions(+)
> > 
> > diff --git a/block/blk-lib.c b/block/blk-lib.c
> > index e59c3069e835..2ce3c872ca49 100644
> > --- a/block/blk-lib.c
> > +++ b/block/blk-lib.c
> > @@ -115,6 +115,364 @@ int blkdev_issue_discard(struct block_device *bdev, sector_t sector,
> >   }
> >   EXPORT_SYMBOL(blkdev_issue_discard);
> > +/*
> > + * For synchronous copy offload/emulation, wait and process all in-flight BIOs.
> > + * This must only be called once all bios have been issued so that the refcount
> > + * can only decrease. This just waits for all bios to make it through
> > + * bio_copy_*_write_end_io. IO errors are propagated through cio->io_error.
> > + */
> > +static int cio_await_completion(struct cio *cio)
> > +{
> > +	int ret = 0;
> > +
> > +	atomic_dec(&cio->refcount);
> > +
> > +	if (cio->endio)
> > +		return 0;
> > +
> > +	if (atomic_read(&cio->refcount)) {
> > +		__set_current_state(TASK_UNINTERRUPTIBLE);
> > +		blk_io_schedule();
> > +	}
> > +
> Wouldn't it be better to use 'atomic_dec_return()' to avoid a potential race
> condition between atomic_dec() and atomic_read()?
> 

cio keeps total number of submitted IOs. For async copy(with endio handler) we
just return after decrementing refcount, if we use atomic_dec_return, we need
to have that endio check twice.
Also this function is called after all the submissions are complete. So race
condition shouldn't happen with this ordered calling.

> > +	ret = cio->io_err;
> > +	kfree(cio);
> > +
> > +	return ret;
> > +}
> > +
> > +static void blk_copy_offload_write_end_io(struct bio *bio)
> > +{
> > +	struct copy_ctx *ctx = bio->bi_private;
> > +	struct cio *cio = ctx->cio;
> > +	sector_t clen;
> > +	int ri = ctx->range_idx;
> > +
> > +	if (bio->bi_status) {
> > +		cio->io_err = blk_status_to_errno(bio->bi_status);
> > +		clen = (bio->bi_iter.bi_sector << SECTOR_SHIFT) -
> > +			cio->ranges[ri].dst;
> > +		cio->ranges[ri].comp_len = min_t(sector_t, clen,
> > +				cio->ranges[ri].comp_len);
> > +	}
> > +	__free_page(bio->bi_io_vec[0].bv_page);
> > +	bio_put(bio);
> > +
> > +	if (atomic_dec_and_test(&ctx->refcount))
> > +		kfree(ctx);
> > +	if (atomic_dec_and_test(&cio->refcount)) {
> 
> _Two_ atomic_dec() in a row?
> Why?
> 
> And if that really is required please add a comment.
> 

cio is used to keep track of all the submitted IOs.

ctx is used to keep track of single IO. Each IOs again has 4 parts.
1.READ submission (process/submitter context)
2.READ completion (interrupt context, can't submit WRITE bio here, so we
create a workqueue and submit)
3.WRITE submission by worker (process context)
4.WRITE completion (interrupt context)
So there is a concurrent access to ctx.
Especially if IO is for zoned device we need to maintain order of
READ submissions, so that WRITE order is same as READ.

So cio and ctx refcount serve different purpose and updated accordingly.
Agreed, I can add better comments in next version.

> > +		if (cio->endio) {
> > +			cio->endio(cio->private, cio->io_err);
> > +			kfree(cio);
> > +		} else
> > +			blk_wake_io_task(cio->waiter);
> > +	}
> > +}
> > +
> > +static void blk_copy_offload_read_end_io(struct bio *read_bio)
> > +{
> > +	struct copy_ctx *ctx = read_bio->bi_private;
> > +	struct cio *cio = ctx->cio;
> > +	sector_t clen;
> > +	int ri = ctx->range_idx;
> > +	unsigned long flags;
> > +
> > +	if (read_bio->bi_status) {
> > +		cio->io_err = blk_status_to_errno(read_bio->bi_status);
> > +		goto err_rw_bio;
> > +	}
> > +
> > +	/* For zoned device, we check if completed bio is first entry in linked
> > +	 * list,
> > +	 * if yes, we start the worker to submit write bios.
> > +	 * if not, then we just update status of bio in ctx,
> > +	 * once the worker gets scheduled, it will submit writes for all
> > +	 * the consecutive REQ_COPY_READ_COMPLETE bios.
> > +	 */
> > +	if (bdev_is_zoned(ctx->write_bio->bi_bdev)) {
> > +		spin_lock_irqsave(&cio->list_lock, flags);
> > +		ctx->status = REQ_COPY_READ_COMPLETE;
> > +		if (ctx == list_first_entry(&cio->list,
> > +					struct copy_ctx, list)) {
> > +			spin_unlock_irqrestore(&cio->list_lock, flags);
> > +			schedule_work(&ctx->dispatch_work);
> > +			goto free_read_bio;
> > +		}
> > +		spin_unlock_irqrestore(&cio->list_lock, flags);
> > +	} else
> > +		schedule_work(&ctx->dispatch_work);
> > +
> > +free_read_bio:
> > +	bio_put(read_bio);
> > +
> > +	return;
> > +
> > +err_rw_bio:
> > +	clen = (read_bio->bi_iter.bi_sector << SECTOR_SHIFT) -
> > +					cio->ranges[ri].src;
> > +	cio->ranges[ri].comp_len = min_t(sector_t, clen,
> > +					cio->ranges[ri].comp_len);
> > +	__free_page(read_bio->bi_io_vec[0].bv_page);
> > +	bio_put(ctx->write_bio);
> > +	bio_put(read_bio);
> > +	if (atomic_dec_and_test(&ctx->refcount))
> > +		kfree(ctx);
> > +	if (atomic_dec_and_test(&cio->refcount)) {
> 
> Same here.
> 
> > +		if (cio->endio) {
> > +			cio->endio(cio->private, cio->io_err);
> > +			kfree(cio);
> > +		} else
> > +			blk_wake_io_task(cio->waiter);
> > +	}
> > +}
> > +
> > +static void blk_copy_dispatch_work_fn(struct work_struct *work)
> > +{
> > +	struct copy_ctx *ctx = container_of(work, struct copy_ctx,
> > +			dispatch_work);
> > +
> > +	submit_bio(ctx->write_bio);
> > +}
> > +
> > +static void blk_zoned_copy_dispatch_work_fn(struct work_struct *work)
> > +{
> > +	struct copy_ctx *ctx = container_of(work, struct copy_ctx,
> > +			dispatch_work);
> > +	struct cio *cio = ctx->cio;
> > +	unsigned long flags = 0;
> > +
> > +	atomic_inc(&cio->refcount);
> > +	spin_lock_irqsave(&cio->list_lock, flags);
> > +
> > +	while (!list_empty(&cio->list)) {
> > +		ctx = list_first_entry(&cio->list, struct copy_ctx, list);
> > +
> That is ever so odd; it'll block 'cio->list' for the time of processing.
> Wouldn't it be better to move 'cio->list' to a private list, and do away
> with the list_lock during processing?
> 

For zoned devices we need to maintain ordering of IOs. Because write
cant be out of order. So we maintain this list.
Again this IO list is accessed concurrently by
a. READ submission
b. READ completion
c. WRITE submission(worker).
So moving to private list won't be possible I feel.
Since we are using global list, we require the lock to synchronize this list.

> > +		if (ctx->status == REQ_COPY_READ_PROGRESS)
> > +			break;
> > +
> > +		atomic_inc(&ctx->refcount);
> > +		ctx->status = REQ_COPY_WRITE_PROGRESS;
> > +		spin_unlock_irqrestore(&cio->list_lock, flags);
> > +		submit_bio(ctx->write_bio);
> > +		spin_lock_irqsave(&cio->list_lock, flags);
> > +
> > +		list_del(&ctx->list);
> > +		if (atomic_dec_and_test(&ctx->refcount))
> > +			kfree(ctx);
> > +	}
> > +
> > +	spin_unlock_irqrestore(&cio->list_lock, flags);
> > +	if (atomic_dec_and_test(&cio->refcount))
> > +		blk_wake_io_task(cio->waiter);
> > +}
> > +
> > +/*
> > + * blk_copy_offload	- Use device's native copy offload feature.
> > + * we perform copy operation by sending 2 bio.
> > + * 1. First we send a read bio with REQ_COPY flag along with a token and source
> > + * and length. Once read bio reaches driver layer, device driver adds all the
> > + * source info to token and does a fake completion.
> > + * 2. Once read opration completes, we issue write with REQ_COPY flag with same
> > + * token. In driver layer, token info is used to form a copy offload command.
> > + *
> > + * For conventional devices we submit write bio independentenly once read
> > + * completes. For zoned devices , reads can complete out of order, so we
> > + * maintain a linked list and submit writes in the order, reads are submitted.
> > + */
> > +static int blk_copy_offload(struct block_device *src_bdev,
> > +		struct block_device *dst_bdev, struct range_entry *ranges,
> > +		int nr, cio_iodone_t end_io, void *private, gfp_t gfp_mask)
> > +{
> > +	struct cio *cio;
> > +	struct copy_ctx *ctx;
> > +	struct bio *read_bio, *write_bio;
> > +	struct page *token;
> > +	sector_t src_blk, copy_len, dst_blk;
> > +	sector_t rem, max_copy_len;
> > +	int ri = 0, ret = 0;
> > +	unsigned long flags;
> > +
> > +	cio = kzalloc(sizeof(struct cio), GFP_KERNEL);
> > +	if (!cio)
> > +		return -ENOMEM;
> > +	cio->ranges = ranges;
> > +	atomic_set(&cio->refcount, 1);
> > +	cio->waiter = current;
> > +	cio->endio = end_io;
> > +	cio->private = private;
> > +	if (bdev_is_zoned(dst_bdev)) {
> > +		INIT_LIST_HEAD(&cio->list);
> > +		spin_lock_init(&cio->list_lock);
> > +	}
> > +
> > +	max_copy_len = min(bdev_max_copy_sectors(src_bdev),
> > +			bdev_max_copy_sectors(dst_bdev)) << SECTOR_SHIFT;
> > +
> > +	for (ri = 0; ri < nr; ri++) {
> > +		cio->ranges[ri].comp_len = ranges[ri].len;
> > +		src_blk = ranges[ri].src;
> > +		dst_blk = ranges[ri].dst;
> > +		for (rem = ranges[ri].len; rem > 0; rem -= copy_len) {
> > +			copy_len = min(rem, max_copy_len);
> > +
> > +			token = alloc_page(gfp_mask);
> > +			if (unlikely(!token)) {
> > +				ret = -ENOMEM;
> > +				goto err_token;
> > +			}
> > +
> > +			ctx = kzalloc(sizeof(struct copy_ctx), gfp_mask);
> > +			if (!ctx) {
> > +				ret = -ENOMEM;
> > +				goto err_ctx;
> > +			}
> > +			read_bio = bio_alloc(src_bdev, 1, REQ_OP_READ | REQ_COPY
> > +					| REQ_SYNC | REQ_NOMERGE, gfp_mask);
> > +			if (!read_bio) {
> > +				ret = -ENOMEM;
> > +				goto err_read_bio;
> > +			}
> > +			write_bio = bio_alloc(dst_bdev, 1, REQ_OP_WRITE
> > +					| REQ_COPY | REQ_SYNC | REQ_NOMERGE,
> > +					gfp_mask);
> > +			if (!write_bio) {
> > +				cio->io_err = -ENOMEM;
> > +				goto err_write_bio;
> > +			}
> > +
> > +			ctx->cio = cio;
> > +			ctx->range_idx = ri;
> > +			ctx->write_bio = write_bio;
> > +			atomic_set(&ctx->refcount, 1);
> > +
> > +			if (bdev_is_zoned(dst_bdev)) {
> > +				INIT_WORK(&ctx->dispatch_work,
> > +					blk_zoned_copy_dispatch_work_fn);
> > +				INIT_LIST_HEAD(&ctx->list);
> > +				spin_lock_irqsave(&cio->list_lock, flags);
> > +				ctx->status = REQ_COPY_READ_PROGRESS;
> > +				list_add_tail(&ctx->list, &cio->list);
> > +				spin_unlock_irqrestore(&cio->list_lock, flags);
> > +			} else
> > +				INIT_WORK(&ctx->dispatch_work,
> > +					blk_copy_dispatch_work_fn);
> > +
> > +			__bio_add_page(read_bio, token, PAGE_SIZE, 0);
> > +			read_bio->bi_iter.bi_size = copy_len;
> > +			read_bio->bi_iter.bi_sector = src_blk >> SECTOR_SHIFT;
> > +			read_bio->bi_end_io = blk_copy_offload_read_end_io;
> > +			read_bio->bi_private = ctx;
> > +
> > +			__bio_add_page(write_bio, token, PAGE_SIZE, 0);
> > +			write_bio->bi_iter.bi_size = copy_len;
> > +			write_bio->bi_end_io = blk_copy_offload_write_end_io;
> > +			write_bio->bi_iter.bi_sector = dst_blk >> SECTOR_SHIFT;
> > +			write_bio->bi_private = ctx;
> > +
> > +			atomic_inc(&cio->refcount);
> > +			submit_bio(read_bio);
> > +			src_blk += copy_len;
> > +			dst_blk += copy_len;
> > +		}
> > +	}
> > +
> > +	/* Wait for completion of all IO's*/
> > +	return cio_await_completion(cio);
> > +
> > +err_write_bio:
> > +	bio_put(read_bio);
> > +err_read_bio:
> > +	kfree(ctx);
> > +err_ctx:
> > +	__free_page(token);
> > +err_token:
> > +	ranges[ri].comp_len = min_t(sector_t,
> > +			ranges[ri].comp_len, (ranges[ri].len - rem));
> > +
> > +	cio->io_err = ret;
> > +	return cio_await_completion(cio);
> > +}
> > +
> > +static inline int blk_copy_sanity_check(struct block_device *src_bdev,
> > +	struct block_device *dst_bdev, struct range_entry *ranges, int nr)
> > +{
> > +	unsigned int align_mask = max(bdev_logical_block_size(dst_bdev),
> > +					bdev_logical_block_size(src_bdev)) - 1;
> > +	sector_t len = 0;
> > +	int i;
> > +
> > +	if (!nr)
> > +		return -EINVAL;
> > +
> > +	if (nr >= MAX_COPY_NR_RANGE)
> > +		return -EINVAL;
> > +
> > +	if (bdev_read_only(dst_bdev))
> > +		return -EPERM;
> > +
> > +	for (i = 0; i < nr; i++) {
> > +		if (!ranges[i].len)
> > +			return -EINVAL;
> > +
> > +		len += ranges[i].len;
> > +		if ((ranges[i].dst & align_mask) ||
> > +				(ranges[i].src & align_mask) ||
> > +				(ranges[i].len & align_mask))
> > +			return -EINVAL;
> > +		ranges[i].comp_len = 0;
> > +	}
> > +
> > +	if (len && len >= MAX_COPY_TOTAL_LENGTH)
> > +		return -EINVAL;
> > +
> > +	return 0;
> > +}
> > +
> > +static inline bool blk_check_copy_offload(struct request_queue *src_q,
> > +		struct request_queue *dst_q)
> > +{
> > +	return blk_queue_copy(dst_q) && blk_queue_copy(src_q);
> > +}
> > +
> > +/*
> > + * blkdev_issue_copy - queue a copy
> > + * @src_bdev:	source block device
> > + * @dst_bdev:	destination block device
> > + * @ranges:	array of source/dest/len,
> > + *		ranges are expected to be allocated/freed by caller
> > + * @nr:		number of source ranges to copy
> > + * @end_io:	end_io function to be called on completion of copy operation,
> > + *		for synchronous operation this should be NULL
> > + * @private:	end_io function will be called with this private data, should be
> > + *		NULL, if operation is synchronous in nature
> > + * @gfp_mask:   memory allocation flags (for bio_alloc)
> > + *
> > + * Description:
> > + *	Copy source ranges from source block device to destination block
> > + *	device. length of a source range cannot be zero. Max total length of
> > + *	copy is limited to MAX_COPY_TOTAL_LENGTH and also maximum number of
> > + *	entries is limited to MAX_COPY_NR_RANGE
> > + */
> > +int blkdev_issue_copy(struct block_device *src_bdev,
> > +	struct block_device *dst_bdev, struct range_entry *ranges, int nr,
> > +	cio_iodone_t end_io, void *private, gfp_t gfp_mask)
> > +{
> > +	struct request_queue *src_q = bdev_get_queue(src_bdev);
> > +	struct request_queue *dst_q = bdev_get_queue(dst_bdev);
> > +	int ret = -EINVAL;
> > +
> > +	ret = blk_copy_sanity_check(src_bdev, dst_bdev, ranges, nr);
> > +	if (ret)
> > +		return ret;
> > +
> > +	if (blk_check_copy_offload(src_q, dst_q))
> > +		ret = blk_copy_offload(src_bdev, dst_bdev, ranges, nr,
> > +				end_io, private, gfp_mask);
> > +
> > +	return ret;
> > +}
> > +EXPORT_SYMBOL_GPL(blkdev_issue_copy);
> > +
> >   static int __blkdev_issue_write_zeroes(struct block_device *bdev,
> >   		sector_t sector, sector_t nr_sects, gfp_t gfp_mask,
> >   		struct bio **biop, unsigned flags)
> > diff --git a/block/blk.h b/block/blk.h
> > index 4c3b3325219a..6d9924a7d559 100644
> > --- a/block/blk.h
> > +++ b/block/blk.h
> > @@ -304,6 +304,8 @@ static inline bool bio_may_exceed_limits(struct bio *bio,
> >   		break;
> >   	}
> > +	if (unlikely(op_is_copy(bio->bi_opf)))
> > +		return false;
> >   	/*
> >   	 * All drivers must accept single-segments bios that are <= PAGE_SIZE.
> >   	 * This is a quick and dirty check that relies on the fact that
> > diff --git a/include/linux/blk_types.h b/include/linux/blk_types.h
> > index 99be590f952f..de1638c87ecf 100644
> > --- a/include/linux/blk_types.h
> > +++ b/include/linux/blk_types.h
> > @@ -422,6 +422,7 @@ enum req_flag_bits {
> >   	 */
> >   	/* for REQ_OP_WRITE_ZEROES: */
> >   	__REQ_NOUNMAP,		/* do not free blocks when zeroing */
> > +	__REQ_COPY,		/* copy request */
> >   	__REQ_NR_BITS,		/* stops here */
> >   };
> > @@ -451,6 +452,7 @@ enum req_flag_bits {
> >   #define REQ_DRV		(__force blk_opf_t)(1ULL << __REQ_DRV)
> >   #define REQ_SWAP	(__force blk_opf_t)(1ULL << __REQ_SWAP)
> > +#define REQ_COPY	((__force blk_opf_t)(1ULL << __REQ_COPY))
> >   #define REQ_FAILFAST_MASK \
> >   	(REQ_FAILFAST_DEV | REQ_FAILFAST_TRANSPORT | REQ_FAILFAST_DRIVER)
> > @@ -477,6 +479,11 @@ static inline bool op_is_write(blk_opf_t op)
> >   	return !!(op & (__force blk_opf_t)1);
> >   }
> > +static inline bool op_is_copy(blk_opf_t op)
> > +{
> > +	return (op & REQ_COPY);
> > +}
> > +
> >   /*
> >    * Check if the bio or request is one that needs special treatment in the
> >    * flush state machine.
> > @@ -536,4 +543,41 @@ struct blk_rq_stat {
> >   	u64 batch;
> >   };
> > +typedef void (cio_iodone_t)(void *private, int status);
> > +
> > +struct cio {
> > +	struct range_entry *ranges;
> > +	struct task_struct *waiter;     /* waiting task (NULL if none) */
> > +	atomic_t refcount;
> > +	int io_err;
> > +	cio_iodone_t *endio;		/* applicable for async operation */
> > +	void *private;			/* applicable for async operation */
> > +
> > +	/* For zoned device we maintain a linked list of IO submissions.
> > +	 * This is to make sure we maintain the order of submissions.
> > +	 * Otherwise some reads completing out of order, will submit writes not
> > +	 * aligned with zone write pointer.
> > +	 */
> > +	struct list_head list;
> > +	spinlock_t list_lock;
> > +};
> > +
> > +enum copy_io_status {
> > +	REQ_COPY_READ_PROGRESS,
> > +	REQ_COPY_READ_COMPLETE,
> > +	REQ_COPY_WRITE_PROGRESS,
> > +};
> > +
> > +struct copy_ctx {
> > +	struct cio *cio;
> > +	struct work_struct dispatch_work;
> > +	struct bio *write_bio;
> > +	atomic_t refcount;
> > +	int range_idx;			/* used in error/partial completion */
> > +
> > +	/* For zoned device linked list is maintained. Along with state of IO */
> > +	struct list_head list;
> > +	enum copy_io_status status;
> > +};
> > +
> >   #endif /* __LINUX_BLK_TYPES_H */
> > diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
> > index 807ffb5f715d..48e9160b7195 100644
> > --- a/include/linux/blkdev.h
> > +++ b/include/linux/blkdev.h
> > @@ -1063,6 +1063,9 @@ int __blkdev_issue_discard(struct block_device *bdev, sector_t sector,
> >   		sector_t nr_sects, gfp_t gfp_mask, struct bio **biop);
> >   int blkdev_issue_secure_erase(struct block_device *bdev, sector_t sector,
> >   		sector_t nr_sects, gfp_t gfp);
> > +int blkdev_issue_copy(struct block_device *src_bdev,
> > +		struct block_device *dst_bdev, struct range_entry *ranges,
> > +		int nr, cio_iodone_t end_io, void *private, gfp_t gfp_mask);
> >   #define BLKDEV_ZERO_NOUNMAP	(1 << 0)  /* do not free blocks */
> >   #define BLKDEV_ZERO_NOFALLBACK	(1 << 1)  /* don't write explicit zeroes */
> > diff --git a/include/uapi/linux/fs.h b/include/uapi/linux/fs.h
> > index b3ad173f619c..9248b6d259de 100644
> > --- a/include/uapi/linux/fs.h
> > +++ b/include/uapi/linux/fs.h
> > @@ -67,6 +67,21 @@ struct fstrim_range {
> >   /* maximum total copy length */
> >   #define MAX_COPY_TOTAL_LENGTH	(1 << 27)
> > +/* Maximum no of entries supported */
> > +#define MAX_COPY_NR_RANGE	(1 << 12)
> > +
> > +/* range entry for copy offload, all fields should be byte addressed */
> > +struct range_entry {
> > +	__u64 src;		/* source to be copied */
> > +	__u64 dst;		/* destination */
> > +	__u64 len;		/* length in bytes to be copied */
> > +
> > +	/* length of data copy actually completed. This will be filled by
> > +	 * kernel, once copy completes
> > +	 */
> > +	__u64 comp_len;
> > +};
> > +
> >   /* extent-same (dedupe) ioctls; these MUST match the btrfs ioctl definitions */
> >   #define FILE_DEDUPE_RANGE_SAME		0
> >   #define FILE_DEDUPE_RANGE_DIFFERS	1
> 
> Cheers,
> 
> Hannes
> 

Thanks,
Nitesh Shetty

[-- Attachment #2: Type: text/plain, Size: 0 bytes --]



[-- Attachment #3: Type: text/plain, Size: 98 bytes --]

--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel

^ permalink raw reply	[flat|nested] 26+ messages in thread

end of thread, other threads:[~2023-01-16  8:23 UTC | newest]

Thread overview: 26+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <CGME20230112115954epcas5p4a959bef952926b8976719f1179bb4436@epcas5p4.samsung.com>
2023-01-12 11:58 ` [PATCH v6 0/9] Implement copy offload support Nitesh Shetty
2023-01-12 11:58   ` [dm-devel] " Nitesh Shetty
     [not found]   ` <CGME20230112120011epcas5p33c10ec9a0cb2ee4c0c68737bb879a154@epcas5p3.samsung.com>
2023-01-12 11:58     ` [PATCH v6 1/9] block: Introduce queue limits for copy-offload support Nitesh Shetty
2023-01-12 11:58       ` [dm-devel] " Nitesh Shetty
     [not found]   ` <CGME20230112120039epcas5p49ccf70d806c530c8228130cc25737b51@epcas5p4.samsung.com>
2023-01-12 11:58     ` [PATCH v6 2/9] block: Add copy offload support infrastructure Nitesh Shetty
2023-01-12 11:58       ` [dm-devel] " Nitesh Shetty
2023-01-12 14:43       ` Hannes Reinecke
2023-01-12 14:43         ` [dm-devel] " Hannes Reinecke
2023-01-13  8:26         ` Nitesh Shetty
2023-01-13  8:26           ` [dm-devel] " Nitesh Shetty
     [not found]   ` <CGME20230112120054epcas5p3ec5887c4e1de59f7529dafca1cd6aa65@epcas5p3.samsung.com>
2023-01-12 11:58     ` [PATCH v6 3/9] block: add emulation for copy Nitesh Shetty
2023-01-12 11:58       ` [dm-devel] " Nitesh Shetty
2023-01-12 14:46       ` Hannes Reinecke
2023-01-12 14:46         ` [dm-devel] " Hannes Reinecke
2023-01-12 14:48         ` Hannes Reinecke
2023-01-12 14:48           ` [dm-devel] " Hannes Reinecke
     [not found]   ` <CGME20230112120131epcas5p4374e6add89990dd546bd0ae38f4386f0@epcas5p4.samsung.com>
2023-01-12 11:58     ` [PATCH v6 4/9] block: Introduce a new ioctl " Nitesh Shetty
2023-01-12 11:58       ` [dm-devel] " Nitesh Shetty
     [not found]   ` <CGME20230112120151epcas5p1e7c3ec0c7bd0869b9cf0bea64d65991a@epcas5p1.samsung.com>
2023-01-12 11:58     ` [PATCH v6 5/9] nvme: add copy offload support Nitesh Shetty
2023-01-12 11:58       ` [dm-devel] " Nitesh Shetty
     [not found]   ` <CGME20230112120201epcas5p1d2ee1f9fd6a1f458ffa770bb33b4bb41@epcas5p1.samsung.com>
2023-01-12 11:59     ` [PATCH v6 6/9] nvmet: add copy command support for bdev and file ns Nitesh Shetty
2023-01-12 11:59       ` [dm-devel] " Nitesh Shetty
     [not found]   ` <CGME20230112120210epcas5p41524bba73af1dcf283d21b9c7ee9d239@epcas5p4.samsung.com>
2023-01-12 11:59     ` [PATCH v6 7/9] dm: Add support for copy offload Nitesh Shetty
2023-01-12 11:59       ` [dm-devel] " Nitesh Shetty
     [not found]   ` <CGME20230112120229epcas5p38a07a42302d823422960eb11de5d685b@epcas5p3.samsung.com>
2023-01-12 11:59     ` [PATCH v6 9/9] dm kcopyd: use copy offload support Nitesh Shetty
2023-01-12 11:59       ` [dm-devel] " Nitesh Shetty

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.