linux-nvme.lists.infradead.org archive mirror
 help / color / mirror / Atom feed
* [RFC PATCH v5 0/4] add simple copy support
       [not found] <CGME20210219124555epcas5p1334e7c4d64ada5dc4a2ca0feb48c1d44@epcas5p1.samsung.com>
@ 2021-02-19 12:45 ` SelvaKumar S
       [not found]   ` <CGME20210219124559epcas5p41da46f1c248e334953d407a154697903@epcas5p4.samsung.com>
                     ` (8 more replies)
  0 siblings, 9 replies; 29+ messages in thread
From: SelvaKumar S @ 2021-02-19 12:45 UTC (permalink / raw)
  To: linux-nvme
  Cc: axboe, damien.lemoal, kch, SelvaKumar S, sagi, snitzer,
	selvajove, linux-kernel, nj.shetty, linux-block, linux-fsdevel,
	dm-devel, joshi.k, javier.gonz, kbusch, joshiiitr, hch

This patchset tries to add support for TP4065a ("Simple Copy Command"),
v2020.05.04 ("Ratified")

The Specification can be found in following link.
https://nvmexpress.org/wp-content/uploads/NVM-Express-1.4-Ratified-TPs-1.zip

Simple copy command is a copy offloading operation and is  used to copy
multiple contiguous ranges (source_ranges) of LBA's to a single destination
LBA within the device reducing traffic between host and device.

This implementation doesn't add native copy offload support for stacked
devices rather copy offload is done through emulation. Possible use
cases are F2FS gc and BTRFS relocation/balance.

*blkdev_issue_copy* takes source bdev, no of sources, array of source
ranges (in sectors), destination bdev and destination offset(in sectors).
If both source and destination block devices are same and copy_offload = 1,
then copy is done through native copy offloading. Copy emulation is used
in other cases.

As SCSI XCOPY can take two different block devices and no of source range is
equal to 1, this interface can be extended in future to support SCSI XCOPY.

For devices supporting native simple copy, attach the control information
as payload to the bio and submit to the device. For devices without native
copy support, copy emulation is done by reading each source range into memory
and writing it to the destination. Caller can choose not to try
emulation if copy offload is not supported by setting
BLKDEV_COPY_NOEMULATION flag.

Following limits are added to queue limits and are exposed in sysfs
to userspace
	- *copy_offload* controls copy_offload. set 0 to disable copy
		offload, 1 to enable native copy offloading support.
	- *max_copy_sectors* limits the sum of all source_range length
	- *max_copy_nr_ranges* limits the number of source ranges
	- *max_copy_range_sectors* limit the maximum number of sectors
		that can constitute a single source range.

	max_copy_sectors = 0 indicates the device doesn't support copy
offloading.

	*copy offload* sysfs entry is configurable and can be used toggle
between emulation and native support depending upon the usecase.

Changes from v4

1. Extend dm-kcopyd to leverage copy-offload, while copying within the
same device. The other approach was to have copy-emulation by moving
dm-kcopyd to block layer. But it also required moving core dm-io infra,
causing a massive churn across multiple dm-targets.

2. Remove export in bio_map_kern()
3. Change copy_offload sysfs to accept 0 or else
4. Rename copy support flag to QUEUE_FLAG_SIMPLE_COPY
5. Rename payload entries, add source bdev field to be used while
partition remapping, remove copy_size
6. Change the blkdev_issue_copy() interface to accept destination and
source values in sector rather in bytes
7. Add payload to bio using bio_map_kern() for copy_offload case
8. Add check to return error if one of the source range length is 0
9. Add BLKDEV_COPY_NOEMULATION flag to allow user to not try copy
emulation incase of copy offload is not supported. Caller can his use
his existing copying logic to complete the io.
10. Bug fix copy checks and reduce size of rcu_lock()

Planned for next:
- adding blktests
- handling larger (than device limits) copy
- decide on ioctl interface (man-page etc.)

Changes from v3

1. gfp_flag fixes.
2. Export bio_map_kern() and use it to allocate and add pages to bio.
3. Move copy offload, reading to buf, writing from buf to separate functions.
4. Send read bio of copy offload by chaining them and submit asynchronously.
5. Add gendisk->part0 and part->bd_start_sect changes to blk_check_copy().
6. Move single source range limit check to blk_check_copy()
7. Rename __blkdev_issue_copy() to blkdev_issue_copy and remove old helper.
8. Change blkdev_issue_copy() interface generic to accepts destination bdev
	to support XCOPY as well.
9. Add invalidate_kernel_vmap_range() after reading data for vmalloc'ed memory.
10. Fix buf allocoation logic to allocate buffer for the total size of copy.
11. Reword patch commit description.

Changes from v2

1. Add emulation support for devices not supporting copy.
2. Add *copy_offload* sysfs entry to enable and disable copy_offload
	in devices supporting simple copy.
3. Remove simple copy support for stacked devices.

Changes from v1:

1. Fix memory leak in __blkdev_issue_copy
2. Unmark blk_check_copy inline
3. Fix line break in blk_check_copy_eod
4. Remove p checks and made code more readable
5. Don't use bio_set_op_attrs and remove op and set
   bi_opf directly
6. Use struct_size to calculate total_size
7. Fix partition remap of copy destination
8. Remove mcl,mssrl,msrc from nvme_ns
9. Initialize copy queue limits to 0 in nvme_config_copy
10. Remove return in QUEUE_FLAG_COPY check
11. Remove unused OCFS

SelvaKumar S (4):
  block: make bio_map_kern() non static
  block: add simple copy support
  nvme: add simple copy support
  dm kcopyd: add simple copy offload support

 block/blk-core.c          | 102 +++++++++++++++--
 block/blk-lib.c           | 223 ++++++++++++++++++++++++++++++++++++++
 block/blk-map.c           |   2 +-
 block/blk-merge.c         |   2 +
 block/blk-settings.c      |  10 ++
 block/blk-sysfs.c         |  47 ++++++++
 block/blk-zoned.c         |   1 +
 block/bounce.c            |   1 +
 block/ioctl.c             |  33 ++++++
 drivers/md/dm-kcopyd.c    |  49 ++++++++-
 drivers/nvme/host/core.c  |  87 +++++++++++++++
 include/linux/bio.h       |   1 +
 include/linux/blk_types.h |  14 +++
 include/linux/blkdev.h    |  17 +++
 include/linux/nvme.h      |  43 +++++++-
 include/uapi/linux/fs.h   |  13 +++
 16 files changed, 627 insertions(+), 18 deletions(-)

-- 
2.25.1


_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 29+ messages in thread

* [RFC PATCH v5 1/4] block: make bio_map_kern() non static
       [not found]   ` <CGME20210219124559epcas5p41da46f1c248e334953d407a154697903@epcas5p4.samsung.com>
@ 2021-02-19 12:45     ` SelvaKumar S
  0 siblings, 0 replies; 29+ messages in thread
From: SelvaKumar S @ 2021-02-19 12:45 UTC (permalink / raw)
  To: linux-nvme
  Cc: axboe, damien.lemoal, kch, SelvaKumar S, sagi, snitzer,
	selvajove, linux-kernel, nj.shetty, linux-block, linux-fsdevel,
	dm-devel, joshi.k, javier.gonz, kbusch, joshiiitr, hch

Make bio_map_kern() non static, so that copy offload emulation can use
it to add vmalloced memory to bio.

Signed-off-by: SelvaKumar S <selvakuma.s1@samsung.com>
Signed-off-by: Chaitanya Kulkarni <kch@kernel.org>
---
 block/blk-map.c        | 2 +-
 include/linux/blkdev.h | 2 ++
 2 files changed, 3 insertions(+), 1 deletion(-)

diff --git a/block/blk-map.c b/block/blk-map.c
index 21630dccac62..17381b1643b8 100644
--- a/block/blk-map.c
+++ b/block/blk-map.c
@@ -378,7 +378,7 @@ static void bio_map_kern_endio(struct bio *bio)
  *	Map the kernel address into a bio suitable for io to a block
  *	device. Returns an error pointer in case of error.
  */
-static struct bio *bio_map_kern(struct request_queue *q, void *data,
+struct bio *bio_map_kern(struct request_queue *q, void *data,
 		unsigned int len, gfp_t gfp_mask)
 {
 	unsigned long kaddr = (unsigned long)data;
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index f94ee3089e01..699ace6b25ff 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -944,6 +944,8 @@ extern int blk_rq_map_user(struct request_queue *, struct request *,
 			   struct rq_map_data *, void __user *, unsigned long,
 			   gfp_t);
 extern int blk_rq_unmap_user(struct bio *);
+struct bio *bio_map_kern(struct request_queue *q, void *data, unsigned int len,
+			gfp_t gfp_mask);
 extern int blk_rq_map_kern(struct request_queue *, struct request *, void *, unsigned int, gfp_t);
 extern int blk_rq_map_user_iov(struct request_queue *, struct request *,
 			       struct rq_map_data *, const struct iov_iter *,
-- 
2.25.1


_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [RFC PATCH v5 2/4] block: add simple copy support
       [not found]   ` <CGME20210219124603epcas5p33add0f2c1781b2a4d71bf30c9e1ac647@epcas5p3.samsung.com>
@ 2021-02-19 12:45     ` SelvaKumar S
  2021-02-20  4:59       ` Damien Le Moal
  0 siblings, 1 reply; 29+ messages in thread
From: SelvaKumar S @ 2021-02-19 12:45 UTC (permalink / raw)
  To: linux-nvme
  Cc: axboe, damien.lemoal, kch, SelvaKumar S, sagi, snitzer,
	selvajove, linux-kernel, nj.shetty, linux-block, linux-fsdevel,
	dm-devel, joshi.k, javier.gonz, kbusch, joshiiitr, hch

Add new BLKCOPY ioctl that offloads copying of one or more sources
ranges to a destination in the device. Accepts a 'copy_range' structure
that contains destination (in sectors), no of sources and pointer to the
array of source ranges. Each source range is represented by 'range_entry'
that contains start and length of source ranges (in sectors).

Introduce REQ_OP_COPY, a no-merge copy offload operation. Create
bio with control information as payload and submit to the device.
REQ_OP_COPY(19) is a write op and takes zone_write_lock when submitted
to zoned device.

If the device doesn't support copy or copy offload is disabled, then
copy operation is emulated by default. However, the copy-emulation is an
opt-in feature. Caller can choose not to use the copy-emulation by
specifying a flag 'BLKDEV_COPY_NOEMULATION'.

Copy-emulation is implemented by allocating memory of total copy size.
The source ranges are read into memory by chaining bio for each source
ranges and submitting them async and the last bio waits for completion.
After data is read, it is written to the destination.

bio_map_kern() is used to allocate bio and add pages of copy buffer to
bio. As bio->bi_private and bio->bi_end_io are needed for chaining the
bio and gets over-written, invalidate_kernel_vmap_range() for read is
called in the caller.

Introduce queue limits for simple copy and other helper functions.
Add device limits as sysfs entries.
	- copy_offload
	- max_copy_sectors
	- max_copy_ranges_sectors
	- max_copy_nr_ranges

copy_offload(= 0) is disabled by default. This needs to be enabled if
copy-offload needs to be used.
max_copy_sectors = 0 indicates the device doesn't support native copy.

Native copy offload is not supported for stacked devices and is done via
copy emulation.

Signed-off-by: SelvaKumar S <selvakuma.s1@samsung.com>
Signed-off-by: Kanchan Joshi <joshi.k@samsung.com>
Signed-off-by: Nitesh Shetty <nj.shetty@samsung.com>
Signed-off-by: Javier González <javier.gonz@samsung.com>
Signed-off-by: Chaitanya Kulkarni <kch@kernel.org>
---
 block/blk-core.c          | 102 ++++++++++++++++--
 block/blk-lib.c           | 222 ++++++++++++++++++++++++++++++++++++++
 block/blk-merge.c         |   2 +
 block/blk-settings.c      |  10 ++
 block/blk-sysfs.c         |  47 ++++++++
 block/blk-zoned.c         |   1 +
 block/bounce.c            |   1 +
 block/ioctl.c             |  33 ++++++
 include/linux/bio.h       |   1 +
 include/linux/blk_types.h |  14 +++
 include/linux/blkdev.h    |  15 +++
 include/uapi/linux/fs.h   |  13 +++
 12 files changed, 453 insertions(+), 8 deletions(-)

diff --git a/block/blk-core.c b/block/blk-core.c
index 7663a9b94b80..23e646e5ae43 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -720,6 +720,17 @@ static noinline int should_fail_bio(struct bio *bio)
 }
 ALLOW_ERROR_INJECTION(should_fail_bio, ERRNO);
 
+static inline int bio_check_copy_eod(struct bio *bio, sector_t start,
+		sector_t nr_sectors, sector_t max_sect)
+{
+	if (nr_sectors && max_sect &&
+	    (nr_sectors > max_sect || start > max_sect - nr_sectors)) {
+		handle_bad_sector(bio, max_sect);
+		return -EIO;
+	}
+	return 0;
+}
+
 /*
  * Check whether this bio extends beyond the end of the device or partition.
  * This may well happen - the kernel calls bread() without checking the size of
@@ -738,6 +749,75 @@ static inline int bio_check_eod(struct bio *bio, sector_t maxsector)
 	return 0;
 }
 
+/*
+ * Check for copy limits and remap source ranges if needed.
+ */
+static int blk_check_copy(struct bio *bio)
+{
+	struct blk_copy_payload *payload = bio_data(bio);
+	struct request_queue *q = bio->bi_disk->queue;
+	sector_t max_sect, start_sect, copy_size = 0;
+	sector_t src_max_sect, src_start_sect;
+	struct block_device *bd_part;
+	int i, ret = -EIO;
+
+	rcu_read_lock();
+
+	bd_part = __disk_get_part(bio->bi_disk, bio->bi_partno);
+	if (unlikely(!bd_part)) {
+		rcu_read_unlock();
+		goto out;
+	}
+
+	max_sect =  bdev_nr_sectors(bd_part);
+	start_sect = bd_part->bd_start_sect;
+
+	src_max_sect = bdev_nr_sectors(payload->src_bdev);
+	src_start_sect = payload->src_bdev->bd_start_sect;
+
+	if (unlikely(should_fail_request(bd_part, bio->bi_iter.bi_size)))
+		goto out;
+
+	if (unlikely(bio_check_ro(bio, bd_part)))
+		goto out;
+
+	rcu_read_unlock();
+
+	/* cannot handle copy crossing nr_ranges limit */
+	if (payload->copy_nr_ranges > q->limits.max_copy_nr_ranges)
+		goto out;
+
+	for (i = 0; i < payload->copy_nr_ranges; i++) {
+		ret = bio_check_copy_eod(bio, payload->range[i].src,
+				payload->range[i].len, src_max_sect);
+		if (unlikely(ret))
+			goto out;
+
+		/* single source range length limit */
+		if (payload->range[i].len > q->limits.max_copy_range_sectors)
+			goto out;
+
+		payload->range[i].src += src_start_sect;
+		copy_size += payload->range[i].len;
+	}
+
+	/* check if copy length crosses eod */
+	ret = bio_check_copy_eod(bio, bio->bi_iter.bi_sector,
+				copy_size, max_sect);
+	if (unlikely(ret))
+		goto out;
+
+	/* cannot handle copy more than copy limits */
+	if (copy_size > q->limits.max_copy_sectors)
+		goto out;
+
+	bio->bi_iter.bi_sector += start_sect;
+	bio->bi_partno = 0;
+	ret = 0;
+out:
+	return ret;
+}
+
 /*
  * Remap block n of partition p to block n+start(p) of the disk.
  */
@@ -827,14 +907,16 @@ static noinline_for_stack bool submit_bio_checks(struct bio *bio)
 	if (should_fail_bio(bio))
 		goto end_io;
 
-	if (bio->bi_partno) {
-		if (unlikely(blk_partition_remap(bio)))
-			goto end_io;
-	} else {
-		if (unlikely(bio_check_ro(bio, bio->bi_disk->part0)))
-			goto end_io;
-		if (unlikely(bio_check_eod(bio, get_capacity(bio->bi_disk))))
-			goto end_io;
+	if (likely(!op_is_copy(bio->bi_opf))) {
+		if (bio->bi_partno) {
+			if (unlikely(blk_partition_remap(bio)))
+				goto end_io;
+		} else {
+			if (unlikely(bio_check_ro(bio, bio->bi_disk->part0)))
+				goto end_io;
+			if (unlikely(bio_check_eod(bio, get_capacity(bio->bi_disk))))
+				goto end_io;
+		}
 	}
 
 	/*
@@ -858,6 +940,10 @@ static noinline_for_stack bool submit_bio_checks(struct bio *bio)
 		if (!blk_queue_discard(q))
 			goto not_supported;
 		break;
+	case REQ_OP_COPY:
+		if (unlikely(blk_check_copy(bio)))
+			goto end_io;
+		break;
 	case REQ_OP_SECURE_ERASE:
 		if (!blk_queue_secure_erase(q))
 			goto not_supported;
diff --git a/block/blk-lib.c b/block/blk-lib.c
index 752f9c722062..97ba58d8d9a1 100644
--- a/block/blk-lib.c
+++ b/block/blk-lib.c
@@ -150,6 +150,228 @@ int blkdev_issue_discard(struct block_device *bdev, sector_t sector,
 }
 EXPORT_SYMBOL(blkdev_issue_discard);
 
+int blk_copy_offload(struct block_device *dest_bdev, struct blk_copy_payload *payload,
+		sector_t dest, gfp_t gfp_mask)
+{
+	struct request_queue *q = bdev_get_queue(dest_bdev);
+	struct bio *bio;
+	int ret, payload_size;
+
+	payload_size = struct_size(payload, range, payload->copy_nr_ranges);
+	bio = bio_map_kern(q, payload, payload_size, gfp_mask);
+	if (IS_ERR(bio)) {
+		ret = PTR_ERR(bio);
+		goto err;
+	}
+
+	bio->bi_iter.bi_sector = dest;
+	bio->bi_opf = REQ_OP_COPY | REQ_NOMERGE;
+	bio_set_dev(bio, dest_bdev);
+
+	ret = submit_bio_wait(bio);
+err:
+	bio_put(bio);
+	return ret;
+}
+
+int blk_read_to_buf(struct block_device *src_bdev, struct blk_copy_payload *payload,
+		gfp_t gfp_mask, sector_t copy_size, void **buf_p)
+{
+	struct request_queue *q = bdev_get_queue(src_bdev);
+	struct bio *bio, *parent = NULL;
+	void *buf = NULL;
+	int copy_len = copy_size << SECTOR_SHIFT;
+	int i, nr_srcs, ret, cur_size, t_len = 0;
+	bool is_vmalloc;
+
+	nr_srcs = payload->copy_nr_ranges;
+
+	buf = kvmalloc(copy_len, gfp_mask);
+	if (!buf)
+		return -ENOMEM;
+	is_vmalloc = is_vmalloc_addr(buf);
+
+	for (i = 0; i < nr_srcs; i++) {
+		cur_size = payload->range[i].len << SECTOR_SHIFT;
+
+		bio = bio_map_kern(q, buf + t_len, cur_size, gfp_mask);
+		if (IS_ERR(bio)) {
+			ret = PTR_ERR(bio);
+			goto out;
+		}
+
+		bio->bi_iter.bi_sector = payload->range[i].src;
+		bio->bi_opf = REQ_OP_READ;
+		bio_set_dev(bio, src_bdev);
+		bio->bi_end_io = NULL;
+		bio->bi_private = NULL;
+
+		if (parent) {
+			bio_chain(parent, bio);
+			submit_bio(parent);
+		}
+
+		parent = bio;
+		t_len += cur_size;
+	}
+
+	ret = submit_bio_wait(bio);
+	bio_put(bio);
+	if (is_vmalloc)
+		invalidate_kernel_vmap_range(buf, copy_len);
+	if (ret)
+		goto out;
+
+	*buf_p = buf;
+	return 0;
+out:
+	kvfree(buf);
+	return ret;
+}
+
+int blk_write_from_buf(struct block_device *dest_bdev, void *buf, sector_t dest,
+		sector_t copy_size, gfp_t gfp_mask)
+{
+	struct request_queue *q = bdev_get_queue(dest_bdev);
+	struct bio *bio;
+	int ret, copy_len = copy_size << SECTOR_SHIFT;
+
+	bio = bio_map_kern(q, buf, copy_len, gfp_mask);
+	if (IS_ERR(bio)) {
+		ret = PTR_ERR(bio);
+		goto out;
+	}
+	bio_set_dev(bio, dest_bdev);
+	bio->bi_opf = REQ_OP_WRITE;
+	bio->bi_iter.bi_sector = dest;
+
+	bio->bi_end_io = NULL;
+	ret = submit_bio_wait(bio);
+	bio_put(bio);
+out:
+	return ret;
+}
+
+int blk_prepare_payload(struct block_device *src_bdev, int nr_srcs, struct range_entry *rlist,
+		gfp_t gfp_mask, struct blk_copy_payload **payload_p, sector_t *copy_size)
+{
+
+	struct request_queue *q = bdev_get_queue(src_bdev);
+	struct blk_copy_payload *payload;
+	sector_t bs_mask, total_len = 0;
+	int i, ret, payload_size;
+
+	if (!q)
+		return -ENXIO;
+
+	if (!nr_srcs)
+		return -EINVAL;
+
+	if (bdev_read_only(src_bdev))
+		return -EPERM;
+
+	bs_mask = (bdev_logical_block_size(src_bdev) >> 9) - 1;
+
+	payload_size = struct_size(payload, range, nr_srcs);
+	payload = kmalloc(payload_size, gfp_mask);
+	if (!payload)
+		return -ENOMEM;
+
+	for (i = 0; i < nr_srcs; i++) {
+		if (rlist[i].src & bs_mask || rlist[i].len & bs_mask) {
+			ret = -EINVAL;
+			goto err;
+		}
+
+		payload->range[i].src = rlist[i].src;
+		payload->range[i].len = rlist[i].len;
+
+		total_len += rlist[i].len;
+	}
+
+	payload->copy_nr_ranges = i;
+	payload->src_bdev = src_bdev;
+	*copy_size = total_len;
+
+	*payload_p = payload;
+	return 0;
+err:
+	kfree(payload);
+	return ret;
+}
+
+int blk_copy_emulate(struct block_device *src_bdev, struct blk_copy_payload *payload,
+			struct block_device *dest_bdev, sector_t dest,
+			sector_t copy_size, gfp_t gfp_mask)
+{
+	void *buf = NULL;
+	int ret;
+
+	ret = blk_read_to_buf(src_bdev, payload, gfp_mask, copy_size, &buf);
+	if (ret)
+		goto out;
+
+	ret = blk_write_from_buf(dest_bdev, buf, dest, copy_size, gfp_mask);
+	if (buf)
+		kvfree(buf);
+out:
+	return ret;
+}
+
+/**
+ * blkdev_issue_copy - queue a copy
+ * @src_bdev:	source block device
+ * @nr_srcs:	number of source ranges to copy
+ * @rlist:	array of source ranges in sector
+ * @dest_bdev:	destination block device
+ * @dest:	destination in sector
+ * @gfp_mask:   memory allocation flags (for bio_alloc)
+ * @flags:	BLKDEV_COPY_* flags to control behaviour
+ *
+ * Description:
+ *	Copy array of source ranges from source block device to
+ *	destination block devcie. All source must belong to same bdev and
+ *	length of a source range cannot be zero.
+ */
+
+int blkdev_issue_copy(struct block_device *src_bdev, int nr_srcs,
+		struct range_entry *src_rlist, struct block_device *dest_bdev,
+		sector_t dest, gfp_t gfp_mask, int flags)
+{
+	struct request_queue *q = bdev_get_queue(src_bdev);
+	struct request_queue *dest_q = bdev_get_queue(dest_bdev);
+	struct blk_copy_payload *payload;
+	sector_t bs_mask, copy_size;
+	int ret;
+
+	ret = blk_prepare_payload(src_bdev, nr_srcs, src_rlist, gfp_mask,
+			&payload, &copy_size);
+	if (ret)
+		return ret;
+
+	bs_mask = (bdev_logical_block_size(dest_bdev) >> 9) - 1;
+	if (dest & bs_mask) {
+		return -EINVAL;
+		goto out;
+	}
+
+	if (q == dest_q && q->limits.copy_offload) {
+		ret = blk_copy_offload(src_bdev, payload, dest, gfp_mask);
+		if (ret)
+			goto out;
+	} else if (flags & BLKDEV_COPY_NOEMULATION) {
+		ret = -EIO;
+		goto out;
+	} else
+		ret = blk_copy_emulate(src_bdev, payload, dest_bdev, dest,
+				copy_size, gfp_mask);
+
+out:
+	kvfree(payload);
+	return ret;
+}
+EXPORT_SYMBOL(blkdev_issue_copy);
+
 /**
  * __blkdev_issue_write_same - generate number of bios with same page
  * @bdev:	target blockdev
diff --git a/block/blk-merge.c b/block/blk-merge.c
index 808768f6b174..4e04f24e13c1 100644
--- a/block/blk-merge.c
+++ b/block/blk-merge.c
@@ -309,6 +309,8 @@ void __blk_queue_split(struct bio **bio, unsigned int *nr_segs)
 	struct bio *split = NULL;
 
 	switch (bio_op(*bio)) {
+	case REQ_OP_COPY:
+			break;
 	case REQ_OP_DISCARD:
 	case REQ_OP_SECURE_ERASE:
 		split = blk_bio_discard_split(q, *bio, &q->bio_split, nr_segs);
diff --git a/block/blk-settings.c b/block/blk-settings.c
index 43990b1d148b..93c15ba45a69 100644
--- a/block/blk-settings.c
+++ b/block/blk-settings.c
@@ -60,6 +60,10 @@ void blk_set_default_limits(struct queue_limits *lim)
 	lim->io_opt = 0;
 	lim->misaligned = 0;
 	lim->zoned = BLK_ZONED_NONE;
+	lim->copy_offload = 0;
+	lim->max_copy_sectors = 0;
+	lim->max_copy_nr_ranges = 0;
+	lim->max_copy_range_sectors = 0;
 }
 EXPORT_SYMBOL(blk_set_default_limits);
 
@@ -565,6 +569,12 @@ int blk_stack_limits(struct queue_limits *t, struct queue_limits *b,
 	if (b->chunk_sectors)
 		t->chunk_sectors = gcd(t->chunk_sectors, b->chunk_sectors);
 
+	/* simple copy not supported in stacked devices */
+	t->copy_offload = 0;
+	t->max_copy_sectors = 0;
+	t->max_copy_range_sectors = 0;
+	t->max_copy_nr_ranges = 0;
+
 	/* Physical block size a multiple of the logical block size? */
 	if (t->physical_block_size & (t->logical_block_size - 1)) {
 		t->physical_block_size = t->logical_block_size;
diff --git a/block/blk-sysfs.c b/block/blk-sysfs.c
index b513f1683af0..625a72541263 100644
--- a/block/blk-sysfs.c
+++ b/block/blk-sysfs.c
@@ -166,6 +166,44 @@ static ssize_t queue_discard_granularity_show(struct request_queue *q, char *pag
 	return queue_var_show(q->limits.discard_granularity, page);
 }
 
+static ssize_t queue_copy_offload_show(struct request_queue *q, char *page)
+{
+	return queue_var_show(q->limits.copy_offload, page);
+}
+
+static ssize_t queue_copy_offload_store(struct request_queue *q,
+				       const char *page, size_t count)
+{
+	unsigned long copy_offload;
+	ssize_t ret = queue_var_store(&copy_offload, page, count);
+
+	if (ret < 0)
+		return ret;
+
+	if (copy_offload && q->limits.max_copy_sectors == 0)
+		return -EINVAL;
+
+	q->limits.copy_offload = copy_offload;
+	return ret;
+}
+
+static ssize_t queue_max_copy_sectors_show(struct request_queue *q, char *page)
+{
+	return queue_var_show(q->limits.max_copy_sectors, page);
+}
+
+static ssize_t queue_max_copy_range_sectors_show(struct request_queue *q,
+		char *page)
+{
+	return queue_var_show(q->limits.max_copy_range_sectors, page);
+}
+
+static ssize_t queue_max_copy_nr_ranges_show(struct request_queue *q,
+		char *page)
+{
+	return queue_var_show(q->limits.max_copy_nr_ranges, page);
+}
+
 static ssize_t queue_discard_max_hw_show(struct request_queue *q, char *page)
 {
 
@@ -591,6 +629,11 @@ QUEUE_RO_ENTRY(queue_nr_zones, "nr_zones");
 QUEUE_RO_ENTRY(queue_max_open_zones, "max_open_zones");
 QUEUE_RO_ENTRY(queue_max_active_zones, "max_active_zones");
 
+QUEUE_RW_ENTRY(queue_copy_offload, "copy_offload");
+QUEUE_RO_ENTRY(queue_max_copy_sectors, "max_copy_sectors");
+QUEUE_RO_ENTRY(queue_max_copy_range_sectors, "max_copy_range_sectors");
+QUEUE_RO_ENTRY(queue_max_copy_nr_ranges, "max_copy_nr_ranges");
+
 QUEUE_RW_ENTRY(queue_nomerges, "nomerges");
 QUEUE_RW_ENTRY(queue_rq_affinity, "rq_affinity");
 QUEUE_RW_ENTRY(queue_poll, "io_poll");
@@ -636,6 +679,10 @@ static struct attribute *queue_attrs[] = {
 	&queue_discard_max_entry.attr,
 	&queue_discard_max_hw_entry.attr,
 	&queue_discard_zeroes_data_entry.attr,
+	&queue_copy_offload_entry.attr,
+	&queue_max_copy_sectors_entry.attr,
+	&queue_max_copy_range_sectors_entry.attr,
+	&queue_max_copy_nr_ranges_entry.attr,
 	&queue_write_same_max_entry.attr,
 	&queue_write_zeroes_max_entry.attr,
 	&queue_zone_append_max_entry.attr,
diff --git a/block/blk-zoned.c b/block/blk-zoned.c
index 7a68b6e4300c..02069178d51e 100644
--- a/block/blk-zoned.c
+++ b/block/blk-zoned.c
@@ -75,6 +75,7 @@ bool blk_req_needs_zone_write_lock(struct request *rq)
 	case REQ_OP_WRITE_ZEROES:
 	case REQ_OP_WRITE_SAME:
 	case REQ_OP_WRITE:
+	case REQ_OP_COPY:
 		return blk_rq_zone_is_seq(rq);
 	default:
 		return false;
diff --git a/block/bounce.c b/block/bounce.c
index d3f51acd6e3b..5e052afe8691 100644
--- a/block/bounce.c
+++ b/block/bounce.c
@@ -254,6 +254,7 @@ static struct bio *bounce_clone_bio(struct bio *bio_src, gfp_t gfp_mask,
 	bio->bi_iter.bi_size	= bio_src->bi_iter.bi_size;
 
 	switch (bio_op(bio)) {
+	case REQ_OP_COPY:
 	case REQ_OP_DISCARD:
 	case REQ_OP_SECURE_ERASE:
 	case REQ_OP_WRITE_ZEROES:
diff --git a/block/ioctl.c b/block/ioctl.c
index d61d652078f4..0e52181657a4 100644
--- a/block/ioctl.c
+++ b/block/ioctl.c
@@ -133,6 +133,37 @@ static int blk_ioctl_discard(struct block_device *bdev, fmode_t mode,
 				    GFP_KERNEL, flags);
 }
 
+static int blk_ioctl_copy(struct block_device *bdev, fmode_t mode,
+		unsigned long arg, unsigned long flags)
+{
+	struct copy_range crange;
+	struct range_entry *rlist;
+	int ret;
+
+	if (!(mode & FMODE_WRITE))
+		return -EBADF;
+
+	if (copy_from_user(&crange, (void __user *)arg, sizeof(crange)))
+		return -EFAULT;
+
+	rlist = kmalloc_array(crange.nr_range, sizeof(*rlist),
+			GFP_KERNEL);
+	if (!rlist)
+		return -ENOMEM;
+
+	if (copy_from_user(rlist, (void __user *)crange.range_list,
+				sizeof(*rlist) * crange.nr_range)) {
+		ret = -EFAULT;
+		goto out;
+	}
+
+	ret = blkdev_issue_copy(bdev, crange.nr_range, rlist, bdev, crange.dest,
+			GFP_KERNEL, flags);
+out:
+	kfree(rlist);
+	return ret;
+}
+
 static int blk_ioctl_zeroout(struct block_device *bdev, fmode_t mode,
 		unsigned long arg)
 {
@@ -458,6 +489,8 @@ static int blkdev_common_ioctl(struct block_device *bdev, fmode_t mode,
 	case BLKSECDISCARD:
 		return blk_ioctl_discard(bdev, mode, arg,
 				BLKDEV_DISCARD_SECURE);
+	case BLKCOPY:
+		return blk_ioctl_copy(bdev, mode, arg, 0);
 	case BLKZEROOUT:
 		return blk_ioctl_zeroout(bdev, mode, arg);
 	case BLKREPORTZONE:
diff --git a/include/linux/bio.h b/include/linux/bio.h
index 1edda614f7ce..164313bdfb35 100644
--- a/include/linux/bio.h
+++ b/include/linux/bio.h
@@ -71,6 +71,7 @@ static inline bool bio_has_data(struct bio *bio)
 static inline bool bio_no_advance_iter(const struct bio *bio)
 {
 	return bio_op(bio) == REQ_OP_DISCARD ||
+	       bio_op(bio) == REQ_OP_COPY ||
 	       bio_op(bio) == REQ_OP_SECURE_ERASE ||
 	       bio_op(bio) == REQ_OP_WRITE_SAME ||
 	       bio_op(bio) == REQ_OP_WRITE_ZEROES;
diff --git a/include/linux/blk_types.h b/include/linux/blk_types.h
index 866f74261b3b..5a35c02ac0a8 100644
--- a/include/linux/blk_types.h
+++ b/include/linux/blk_types.h
@@ -380,6 +380,8 @@ enum req_opf {
 	REQ_OP_ZONE_RESET	= 15,
 	/* reset all the zone present on the device */
 	REQ_OP_ZONE_RESET_ALL	= 17,
+	/* copy ranges within device */
+	REQ_OP_COPY		= 19,
 
 	/* SCSI passthrough using struct scsi_request */
 	REQ_OP_SCSI_IN		= 32,
@@ -506,6 +508,11 @@ static inline bool op_is_discard(unsigned int op)
 	return (op & REQ_OP_MASK) == REQ_OP_DISCARD;
 }
 
+static inline bool op_is_copy(unsigned int op)
+{
+	return (op & REQ_OP_MASK) == REQ_OP_COPY;
+}
+
 /*
  * Check if a bio or request operation is a zone management operation, with
  * the exception of REQ_OP_ZONE_RESET_ALL which is treated as a special case
@@ -565,4 +572,11 @@ struct blk_rq_stat {
 	u64 batch;
 };
 
+struct blk_copy_payload {
+	sector_t	dest;
+	int		copy_nr_ranges;
+	struct block_device *src_bdev;
+	struct	range_entry	range[];
+};
+
 #endif /* __LINUX_BLK_TYPES_H */
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index 699ace6b25ff..2bb4513d4bb8 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -337,10 +337,14 @@ struct queue_limits {
 	unsigned int		max_zone_append_sectors;
 	unsigned int		discard_granularity;
 	unsigned int		discard_alignment;
+	unsigned int		copy_offload;
+	unsigned int		max_copy_sectors;
 
 	unsigned short		max_segments;
 	unsigned short		max_integrity_segments;
 	unsigned short		max_discard_segments;
+	unsigned short		max_copy_range_sectors;
+	unsigned short		max_copy_nr_ranges;
 
 	unsigned char		misaligned;
 	unsigned char		discard_misaligned;
@@ -621,6 +625,7 @@ struct request_queue {
 #define QUEUE_FLAG_RQ_ALLOC_TIME 27	/* record rq->alloc_time_ns */
 #define QUEUE_FLAG_HCTX_ACTIVE	28	/* at least one blk-mq hctx is active */
 #define QUEUE_FLAG_NOWAIT       29	/* device supports NOWAIT */
+#define QUEUE_FLAG_SIMPLE_COPY	30	/* supports simple copy */
 
 #define QUEUE_FLAG_MQ_DEFAULT	((1 << QUEUE_FLAG_IO_STAT) |		\
 				 (1 << QUEUE_FLAG_SAME_COMP) |		\
@@ -643,6 +648,7 @@ bool blk_queue_flag_test_and_set(unsigned int flag, struct request_queue *q);
 #define blk_queue_io_stat(q)	test_bit(QUEUE_FLAG_IO_STAT, &(q)->queue_flags)
 #define blk_queue_add_random(q)	test_bit(QUEUE_FLAG_ADD_RANDOM, &(q)->queue_flags)
 #define blk_queue_discard(q)	test_bit(QUEUE_FLAG_DISCARD, &(q)->queue_flags)
+#define blk_queue_copy(q)	test_bit(QUEUE_FLAG_SIMPLE_COPY, &(q)->queue_flags)
 #define blk_queue_zone_resetall(q)	\
 	test_bit(QUEUE_FLAG_ZONE_RESETALL, &(q)->queue_flags)
 #define blk_queue_secure_erase(q) \
@@ -1069,6 +1075,9 @@ static inline unsigned int blk_queue_get_max_sectors(struct request_queue *q,
 		return min(q->limits.max_discard_sectors,
 			   UINT_MAX >> SECTOR_SHIFT);
 
+	if (unlikely(op == REQ_OP_COPY))
+		return q->limits.max_copy_sectors;
+
 	if (unlikely(op == REQ_OP_WRITE_SAME))
 		return q->limits.max_write_same_sectors;
 
@@ -1343,6 +1352,12 @@ extern int __blkdev_issue_discard(struct block_device *bdev, sector_t sector,
 		sector_t nr_sects, gfp_t gfp_mask, int flags,
 		struct bio **biop);
 
+#define BLKDEV_COPY_NOEMULATION	(1 << 0)	/* do not emulate if copy offload not supported */
+
+extern int blkdev_issue_copy(struct block_device *src_bdev, int nr_srcs,
+		struct range_entry *src_rlist, struct block_device *dest_bdev,
+		sector_t dest, gfp_t gfp_mask, int flags);
+
 #define BLKDEV_ZERO_NOUNMAP	(1 << 0)  /* do not free blocks */
 #define BLKDEV_ZERO_NOFALLBACK	(1 << 1)  /* don't write explicit zeroes */
 
diff --git a/include/uapi/linux/fs.h b/include/uapi/linux/fs.h
index f44eb0a04afd..5cadb176317a 100644
--- a/include/uapi/linux/fs.h
+++ b/include/uapi/linux/fs.h
@@ -64,6 +64,18 @@ struct fstrim_range {
 	__u64 minlen;
 };
 
+struct range_entry {
+	__u64 src;
+	__u64 len;
+};
+
+struct copy_range {
+	__u64 dest;
+	__u64 nr_range;
+	__u64 range_list;
+	__u64 rsvd;
+};
+
 /* extent-same (dedupe) ioctls; these MUST match the btrfs ioctl definitions */
 #define FILE_DEDUPE_RANGE_SAME		0
 #define FILE_DEDUPE_RANGE_DIFFERS	1
@@ -184,6 +196,7 @@ struct fsxattr {
 #define BLKSECDISCARD _IO(0x12,125)
 #define BLKROTATIONAL _IO(0x12,126)
 #define BLKZEROOUT _IO(0x12,127)
+#define BLKCOPY _IOWR(0x12, 128, struct copy_range)
 /*
  * A jump here: 130-131 are reserved for zoned block devices
  * (see uapi/linux/blkzoned.h)
-- 
2.25.1


_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [RFC PATCH v5 3/4] nvme: add simple copy support
       [not found]   ` <CGME20210219124608epcas5p2a673f9e00c3e7b5352f115497b0e2d98@epcas5p2.samsung.com>
@ 2021-02-19 12:45     ` SelvaKumar S
  2021-02-20  3:36       ` Matthew Wilcox
  0 siblings, 1 reply; 29+ messages in thread
From: SelvaKumar S @ 2021-02-19 12:45 UTC (permalink / raw)
  To: linux-nvme
  Cc: axboe, damien.lemoal, kch, SelvaKumar S, sagi, snitzer,
	selvajove, linux-kernel, nj.shetty, linux-block, linux-fsdevel,
	dm-devel, joshi.k, javier.gonz, kbusch, joshiiitr, hch

Add support for  TP 4065a ("Simple Copy Command"), v2020.05.04
("Ratified")

For device supporting native simple copy, this implementation accepts
the payload passed from the block layer and convert payload to form
simple copy command and submit to the device.

Set the device copy limits to queue limits. By default copy_offload
is disabled.

End-to-end protection is done by setting both PRINFOR and PRINFOW
to 0.

Signed-off-by: SelvaKumar S <selvakuma.s1@samsung.com>
Signed-off-by: Kanchan Joshi <joshi.k@samsung.com>
Signed-off-by: Nitesh Shetty <nj.shetty@samsung.com>
Signed-off-by: Javier González <javier.gonz@samsung.com>
---
 drivers/nvme/host/core.c | 87 ++++++++++++++++++++++++++++++++++++++++
 include/linux/nvme.h     | 43 ++++++++++++++++++--
 2 files changed, 127 insertions(+), 3 deletions(-)

diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
index f13eb4ded95f..ba4de2f36cd5 100644
--- a/drivers/nvme/host/core.c
+++ b/drivers/nvme/host/core.c
@@ -706,6 +706,63 @@ static inline void nvme_setup_flush(struct nvme_ns *ns,
 	cmnd->common.nsid = cpu_to_le32(ns->head->ns_id);
 }
 
+static inline blk_status_t nvme_setup_copy(struct nvme_ns *ns,
+	       struct request *req, struct nvme_command *cmnd)
+{
+	struct nvme_ctrl *ctrl = ns->ctrl;
+	struct nvme_copy_range *range = NULL;
+	struct blk_copy_payload *payload;
+	unsigned short nr_range = 0;
+	u16 control = 0, ssrl;
+	u32 dsmgmt = 0;
+	u64 slba;
+	int i;
+
+	payload = bio_data(req->bio);
+	nr_range = payload->copy_nr_ranges;
+
+	if (req->cmd_flags & REQ_FUA)
+		control |= NVME_RW_FUA;
+
+	if (req->cmd_flags & REQ_FAILFAST_DEV)
+		control |= NVME_RW_LR;
+
+	cmnd->copy.opcode = nvme_cmd_copy;
+	cmnd->copy.nsid = cpu_to_le32(ns->head->ns_id);
+	cmnd->copy.sdlba = cpu_to_le64(blk_rq_pos(req) >> (ns->lba_shift - 9));
+
+	range = kmalloc_array(nr_range, sizeof(*range),
+			GFP_ATOMIC | __GFP_NOWARN);
+	if (!range)
+		return BLK_STS_RESOURCE;
+
+	for (i = 0; i < nr_range; i++) {
+		slba = payload->range[i].src;
+		slba = slba >> (ns->lba_shift - 9);
+
+		ssrl = payload->range[i].len;
+		ssrl = ssrl >> (ns->lba_shift - 9);
+
+		range[i].slba = cpu_to_le64(slba);
+		range[i].nlb = cpu_to_le16(ssrl - 1);
+	}
+
+	cmnd->copy.nr_range = nr_range - 1;
+
+	req->special_vec.bv_page = virt_to_page(range);
+	req->special_vec.bv_offset = offset_in_page(range);
+	req->special_vec.bv_len = sizeof(*range) * nr_range;
+	req->rq_flags |= RQF_SPECIAL_PAYLOAD;
+
+	if (ctrl->nr_streams)
+		nvme_assign_write_stream(ctrl, req, &control, &dsmgmt);
+
+	cmnd->rw.control = cpu_to_le16(control);
+	cmnd->rw.dsmgmt = cpu_to_le32(dsmgmt);
+
+	return BLK_STS_OK;
+}
+
 static blk_status_t nvme_setup_discard(struct nvme_ns *ns, struct request *req,
 		struct nvme_command *cmnd)
 {
@@ -888,6 +945,9 @@ blk_status_t nvme_setup_cmd(struct nvme_ns *ns, struct request *req,
 	case REQ_OP_DISCARD:
 		ret = nvme_setup_discard(ns, req, cmd);
 		break;
+	case REQ_OP_COPY:
+		ret = nvme_setup_copy(ns, req, cmd);
+		break;
 	case REQ_OP_READ:
 		ret = nvme_setup_rw(ns, req, cmd, nvme_cmd_read);
 		break;
@@ -1928,6 +1988,31 @@ static void nvme_config_discard(struct gendisk *disk, struct nvme_ns *ns)
 		blk_queue_max_write_zeroes_sectors(queue, UINT_MAX);
 }
 
+static void nvme_config_copy(struct gendisk *disk, struct nvme_ns *ns,
+				       struct nvme_id_ns *id)
+{
+	struct nvme_ctrl *ctrl = ns->ctrl;
+	struct request_queue *queue = disk->queue;
+
+	if (!(ctrl->oncs & NVME_CTRL_ONCS_COPY)) {
+		queue->limits.copy_offload = 0;
+		queue->limits.max_copy_sectors = 0;
+		queue->limits.max_copy_range_sectors = 0;
+		queue->limits.max_copy_nr_ranges = 0;
+		blk_queue_flag_clear(QUEUE_FLAG_SIMPLE_COPY, queue);
+		return;
+	}
+
+	/* setting copy limits */
+	blk_queue_flag_test_and_set(QUEUE_FLAG_SIMPLE_COPY, queue);
+	queue->limits.copy_offload = 0;
+	queue->limits.max_copy_sectors = le64_to_cpu(id->mcl) *
+		(1 << (ns->lba_shift - 9));
+	queue->limits.max_copy_range_sectors = le32_to_cpu(id->mssrl) *
+		(1 << (ns->lba_shift - 9));
+	queue->limits.max_copy_nr_ranges = id->msrc + 1;
+}
+
 static void nvme_config_write_zeroes(struct gendisk *disk, struct nvme_ns *ns)
 {
 	u64 max_blocks;
@@ -2123,6 +2208,7 @@ static void nvme_update_disk_info(struct gendisk *disk,
 	set_capacity_and_notify(disk, capacity);
 
 	nvme_config_discard(disk, ns);
+	nvme_config_copy(disk, ns, id);
 	nvme_config_write_zeroes(disk, ns);
 
 	if ((id->nsattr & NVME_NS_ATTR_RO) ||
@@ -4705,6 +4791,7 @@ static inline void _nvme_check_size(void)
 	BUILD_BUG_ON(sizeof(struct nvme_download_firmware) != 64);
 	BUILD_BUG_ON(sizeof(struct nvme_format_cmd) != 64);
 	BUILD_BUG_ON(sizeof(struct nvme_dsm_cmd) != 64);
+	BUILD_BUG_ON(sizeof(struct nvme_copy_command) != 64);
 	BUILD_BUG_ON(sizeof(struct nvme_write_zeroes_cmd) != 64);
 	BUILD_BUG_ON(sizeof(struct nvme_abort_cmd) != 64);
 	BUILD_BUG_ON(sizeof(struct nvme_get_log_page_command) != 64);
diff --git a/include/linux/nvme.h b/include/linux/nvme.h
index bfed36e342cc..c36e486cbe18 100644
--- a/include/linux/nvme.h
+++ b/include/linux/nvme.h
@@ -295,7 +295,7 @@ struct nvme_id_ctrl {
 	__u8			nvscc;
 	__u8			nwpc;
 	__le16			acwu;
-	__u8			rsvd534[2];
+	__le16			ocfs;
 	__le32			sgls;
 	__le32			mnan;
 	__u8			rsvd544[224];
@@ -320,6 +320,7 @@ enum {
 	NVME_CTRL_ONCS_WRITE_ZEROES		= 1 << 3,
 	NVME_CTRL_ONCS_RESERVATIONS		= 1 << 5,
 	NVME_CTRL_ONCS_TIMESTAMP		= 1 << 6,
+	NVME_CTRL_ONCS_COPY			= 1 << 8,
 	NVME_CTRL_VWC_PRESENT			= 1 << 0,
 	NVME_CTRL_OACS_SEC_SUPP                 = 1 << 0,
 	NVME_CTRL_OACS_DIRECTIVES		= 1 << 5,
@@ -368,7 +369,10 @@ struct nvme_id_ns {
 	__le16			npdg;
 	__le16			npda;
 	__le16			nows;
-	__u8			rsvd74[18];
+	__le16			mssrl;
+	__le32			mcl;
+	__u8			msrc;
+	__u8			rsvd91[11];
 	__le32			anagrpid;
 	__u8			rsvd96[3];
 	__u8			nsattr;
@@ -679,6 +683,7 @@ enum nvme_opcode {
 	nvme_cmd_resv_report	= 0x0e,
 	nvme_cmd_resv_acquire	= 0x11,
 	nvme_cmd_resv_release	= 0x15,
+	nvme_cmd_copy		= 0x19,
 	nvme_cmd_zone_mgmt_send	= 0x79,
 	nvme_cmd_zone_mgmt_recv	= 0x7a,
 	nvme_cmd_zone_append	= 0x7d,
@@ -697,7 +702,8 @@ enum nvme_opcode {
 		nvme_opcode_name(nvme_cmd_resv_register),	\
 		nvme_opcode_name(nvme_cmd_resv_report),		\
 		nvme_opcode_name(nvme_cmd_resv_acquire),	\
-		nvme_opcode_name(nvme_cmd_resv_release))
+		nvme_opcode_name(nvme_cmd_resv_release),	\
+		nvme_opcode_name(nvme_cmd_copy))
 
 
 /*
@@ -869,6 +875,36 @@ struct nvme_dsm_range {
 	__le64			slba;
 };
 
+struct nvme_copy_command {
+	__u8                    opcode;
+	__u8                    flags;
+	__u16                   command_id;
+	__le32                  nsid;
+	__u64                   rsvd2;
+	__le64                  metadata;
+	union nvme_data_ptr     dptr;
+	__le64                  sdlba;
+	__u8			nr_range;
+	__u8			rsvd12;
+	__le16                  control;
+	__le16                  rsvd13;
+	__le16			dspec;
+	__le32                  ilbrt;
+	__le16                  lbat;
+	__le16                  lbatm;
+};
+
+struct nvme_copy_range {
+	__le64			rsvd0;
+	__le64			slba;
+	__le16			nlb;
+	__le16			rsvd18;
+	__le32			rsvd20;
+	__le32			eilbrt;
+	__le16			elbat;
+	__le16			elbatm;
+};
+
 struct nvme_write_zeroes_cmd {
 	__u8			opcode;
 	__u8			flags;
@@ -1406,6 +1442,7 @@ struct nvme_command {
 		struct nvme_download_firmware dlfw;
 		struct nvme_format_cmd format;
 		struct nvme_dsm_cmd dsm;
+		struct nvme_copy_command copy;
 		struct nvme_write_zeroes_cmd write_zeroes;
 		struct nvme_zone_mgmt_send_cmd zms;
 		struct nvme_zone_mgmt_recv_cmd zmr;
-- 
2.25.1


_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [RFC PATCH v5 4/4] dm kcopyd: add simple copy offload support
       [not found]   ` <CGME20210219124611epcas5p1c775b63b537e75da161556e375fcf05e@epcas5p1.samsung.com>
@ 2021-02-19 12:45     ` SelvaKumar S
  0 siblings, 0 replies; 29+ messages in thread
From: SelvaKumar S @ 2021-02-19 12:45 UTC (permalink / raw)
  To: linux-nvme
  Cc: axboe, damien.lemoal, kch, SelvaKumar S, sagi, snitzer,
	selvajove, linux-kernel, nj.shetty, linux-block, linux-fsdevel,
	dm-devel, joshi.k, javier.gonz, kbusch, joshiiitr, hch

Introduce copy_jobs to use copy-offload if it is natively
supported (by underlying device) or else fall back to original method.

run_copy_jobs() calls block layer copy offload api with
BLKDEV_COPY_NOEMULATION. On successful completion, if only one destination
device was present, then jobs is queued for completion. If multiple
destinations were present, the completed destination is zeroed and
pushed to pages_jobs to process copy offload for other destinations. In
case of copy_offload failure, remaining destinations are processed via
regular copying mechanism.

Signed-off-by: SelvaKumar S <selvakuma.s1@samsung.com>
---
 drivers/md/dm-kcopyd.c | 49 ++++++++++++++++++++++++++++++++++++------
 1 file changed, 43 insertions(+), 6 deletions(-)

diff --git a/drivers/md/dm-kcopyd.c b/drivers/md/dm-kcopyd.c
index 1bbe4a34ef4c..2442c4870e97 100644
--- a/drivers/md/dm-kcopyd.c
+++ b/drivers/md/dm-kcopyd.c
@@ -74,18 +74,20 @@ struct dm_kcopyd_client {
 	atomic_t nr_jobs;
 
 /*
- * We maintain four lists of jobs:
+ * We maintain five lists of jobs:
  *
- * i)   jobs waiting for pages
- * ii)  jobs that have pages, and are waiting for the io to be issued.
- * iii) jobs that don't need to do any IO and just run a callback
- * iv) jobs that have completed.
+ * i)	jobs waiting to try copy offload
+ * ii)   jobs waiting for pages
+ * iii)  jobs that have pages, and are waiting for the io to be issued.
+ * iv) jobs that don't need to do any IO and just run a callback
+ * v) jobs that have completed.
  *
- * All four of these are protected by job_lock.
+ * All five of these are protected by job_lock.
  */
 	spinlock_t job_lock;
 	struct list_head callback_jobs;
 	struct list_head complete_jobs;
+	struct list_head copy_jobs;
 	struct list_head io_jobs;
 	struct list_head pages_jobs;
 };
@@ -581,6 +583,36 @@ static int run_io_job(struct kcopyd_job *job)
 	return r;
 }
 
+static int run_copy_job(struct kcopyd_job *job)
+{
+	int r, i, count = 0;
+	unsigned long flags = 0;
+	struct range_entry srange;
+
+	flags |= BLKDEV_COPY_NOEMULATION;
+	for (i = 0; i < job->num_dests; i++) {
+		srange.src = job->source.sector;
+		srange.len = job->source.count;
+
+		r = blkdev_issue_copy(job->source.bdev, 1, &srange,
+			job->dests[i].bdev, job->dests[i].sector, GFP_KERNEL, flags);
+		if (r)
+			break;
+
+		job->dests[i].count = 0;
+		count++;
+	}
+
+	if (count == job->num_dests) {
+		push(&job->kc->complete_jobs, job);
+	} else {
+		push(&job->kc->pages_jobs, job);
+		r = 0;
+	}
+
+	return r;
+}
+
 static int run_pages_job(struct kcopyd_job *job)
 {
 	int r;
@@ -662,6 +694,7 @@ static void do_work(struct work_struct *work)
 	spin_unlock_irqrestore(&kc->job_lock, flags);
 
 	blk_start_plug(&plug);
+	process_jobs(&kc->copy_jobs, kc, run_copy_job);
 	process_jobs(&kc->complete_jobs, kc, run_complete_job);
 	process_jobs(&kc->pages_jobs, kc, run_pages_job);
 	process_jobs(&kc->io_jobs, kc, run_io_job);
@@ -679,6 +712,8 @@ static void dispatch_job(struct kcopyd_job *job)
 	atomic_inc(&kc->nr_jobs);
 	if (unlikely(!job->source.count))
 		push(&kc->callback_jobs, job);
+	else if (job->source.bdev->bd_disk == job->dests[0].bdev->bd_disk)
+		push(&kc->copy_jobs, job);
 	else if (job->pages == &zero_page_list)
 		push(&kc->io_jobs, job);
 	else
@@ -919,6 +954,7 @@ struct dm_kcopyd_client *dm_kcopyd_client_create(struct dm_kcopyd_throttle *thro
 	spin_lock_init(&kc->job_lock);
 	INIT_LIST_HEAD(&kc->callback_jobs);
 	INIT_LIST_HEAD(&kc->complete_jobs);
+	INIT_LIST_HEAD(&kc->copy_jobs);
 	INIT_LIST_HEAD(&kc->io_jobs);
 	INIT_LIST_HEAD(&kc->pages_jobs);
 	kc->throttle = throttle;
@@ -974,6 +1010,7 @@ void dm_kcopyd_client_destroy(struct dm_kcopyd_client *kc)
 
 	BUG_ON(!list_empty(&kc->callback_jobs));
 	BUG_ON(!list_empty(&kc->complete_jobs));
+	BUG_ON(!list_empty(&kc->copy_jobs));
 	BUG_ON(!list_empty(&kc->io_jobs));
 	BUG_ON(!list_empty(&kc->pages_jobs));
 	destroy_workqueue(kc->kcopyd_wq);
-- 
2.25.1


_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply related	[flat|nested] 29+ messages in thread

* Re: [RFC PATCH v5 3/4] nvme: add simple copy support
  2021-02-19 12:45     ` [RFC PATCH v5 3/4] nvme: " SelvaKumar S
@ 2021-02-20  3:36       ` Matthew Wilcox
  2021-02-22 15:57         ` Selva Jove
  0 siblings, 1 reply; 29+ messages in thread
From: Matthew Wilcox @ 2021-02-20  3:36 UTC (permalink / raw)
  To: SelvaKumar S
  Cc: axboe, damien.lemoal, kch, sagi, snitzer, selvajove,
	linux-kernel, linux-nvme, nj.shetty, linux-block, linux-fsdevel,
	dm-devel, joshi.k, javier.gonz, kbusch, joshiiitr, hch

On Fri, Feb 19, 2021 at 06:15:16PM +0530, SelvaKumar S wrote:
> +	struct nvme_copy_range *range = NULL;
[...]
> +	range = kmalloc_array(nr_range, sizeof(*range),
> +			GFP_ATOMIC | __GFP_NOWARN);
[...]
> +	req->special_vec.bv_page = virt_to_page(range);
> +	req->special_vec.bv_offset = offset_in_page(range);
> +	req->special_vec.bv_len = sizeof(*range) * nr_range;
[...]
> +struct nvme_copy_range {
> +	__le64			rsvd0;
> +	__le64			slba;
> +	__le16			nlb;
> +	__le16			rsvd18;
> +	__le32			rsvd20;
> +	__le32			eilbrt;
> +	__le16			elbat;
> +	__le16			elbatm;
> +};

so ... at 32 bytes, you can get 128 per 4kB page.  What happens if you
try to send down a command that attempts to copy 129 ranges?

_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [RFC PATCH v5 2/4] block: add simple copy support
  2021-02-19 12:45     ` [RFC PATCH v5 2/4] block: add simple copy support SelvaKumar S
@ 2021-02-20  4:59       ` Damien Le Moal
  2021-04-07 11:32         ` Selva Jove
  0 siblings, 1 reply; 29+ messages in thread
From: Damien Le Moal @ 2021-02-20  4:59 UTC (permalink / raw)
  To: SelvaKumar S, linux-nvme
  Cc: axboe, kch, sagi, snitzer, selvajove, linux-kernel, nj.shetty,
	linux-block, linux-fsdevel, dm-devel, joshi.k, javier.gonz,
	kbusch, joshiiitr, hch

On 2021/02/20 11:01, SelvaKumar S wrote:
> Add new BLKCOPY ioctl that offloads copying of one or more sources
> ranges to a destination in the device. Accepts a 'copy_range' structure
> that contains destination (in sectors), no of sources and pointer to the
> array of source ranges. Each source range is represented by 'range_entry'
> that contains start and length of source ranges (in sectors).
> 
> Introduce REQ_OP_COPY, a no-merge copy offload operation. Create
> bio with control information as payload and submit to the device.
> REQ_OP_COPY(19) is a write op and takes zone_write_lock when submitted
> to zoned device.
> 
> If the device doesn't support copy or copy offload is disabled, then
> copy operation is emulated by default. However, the copy-emulation is an
> opt-in feature. Caller can choose not to use the copy-emulation by
> specifying a flag 'BLKDEV_COPY_NOEMULATION'.
> 
> Copy-emulation is implemented by allocating memory of total copy size.
> The source ranges are read into memory by chaining bio for each source
> ranges and submitting them async and the last bio waits for completion.
> After data is read, it is written to the destination.
> 
> bio_map_kern() is used to allocate bio and add pages of copy buffer to
> bio. As bio->bi_private and bio->bi_end_io are needed for chaining the
> bio and gets over-written, invalidate_kernel_vmap_range() for read is
> called in the caller.
> 
> Introduce queue limits for simple copy and other helper functions.
> Add device limits as sysfs entries.
> 	- copy_offload
> 	- max_copy_sectors
> 	- max_copy_ranges_sectors
> 	- max_copy_nr_ranges
> 
> copy_offload(= 0) is disabled by default. This needs to be enabled if
> copy-offload needs to be used.
> max_copy_sectors = 0 indicates the device doesn't support native copy.
> 
> Native copy offload is not supported for stacked devices and is done via
> copy emulation.
> 
> Signed-off-by: SelvaKumar S <selvakuma.s1@samsung.com>
> Signed-off-by: Kanchan Joshi <joshi.k@samsung.com>
> Signed-off-by: Nitesh Shetty <nj.shetty@samsung.com>
> Signed-off-by: Javier González <javier.gonz@samsung.com>
> Signed-off-by: Chaitanya Kulkarni <kch@kernel.org>
> ---
>  block/blk-core.c          | 102 ++++++++++++++++--
>  block/blk-lib.c           | 222 ++++++++++++++++++++++++++++++++++++++
>  block/blk-merge.c         |   2 +
>  block/blk-settings.c      |  10 ++
>  block/blk-sysfs.c         |  47 ++++++++
>  block/blk-zoned.c         |   1 +
>  block/bounce.c            |   1 +
>  block/ioctl.c             |  33 ++++++
>  include/linux/bio.h       |   1 +
>  include/linux/blk_types.h |  14 +++
>  include/linux/blkdev.h    |  15 +++
>  include/uapi/linux/fs.h   |  13 +++
>  12 files changed, 453 insertions(+), 8 deletions(-)
> 
> diff --git a/block/blk-core.c b/block/blk-core.c
> index 7663a9b94b80..23e646e5ae43 100644
> --- a/block/blk-core.c
> +++ b/block/blk-core.c
> @@ -720,6 +720,17 @@ static noinline int should_fail_bio(struct bio *bio)
>  }
>  ALLOW_ERROR_INJECTION(should_fail_bio, ERRNO);
>  
> +static inline int bio_check_copy_eod(struct bio *bio, sector_t start,
> +		sector_t nr_sectors, sector_t max_sect)
> +{
> +	if (nr_sectors && max_sect &&
> +	    (nr_sectors > max_sect || start > max_sect - nr_sectors)) {
> +		handle_bad_sector(bio, max_sect);
> +		return -EIO;
> +	}
> +	return 0;
> +}
> +
>  /*
>   * Check whether this bio extends beyond the end of the device or partition.
>   * This may well happen - the kernel calls bread() without checking the size of
> @@ -738,6 +749,75 @@ static inline int bio_check_eod(struct bio *bio, sector_t maxsector)
>  	return 0;
>  }
>  
> +/*
> + * Check for copy limits and remap source ranges if needed.
> + */
> +static int blk_check_copy(struct bio *bio)
> +{
> +	struct blk_copy_payload *payload = bio_data(bio);
> +	struct request_queue *q = bio->bi_disk->queue;
> +	sector_t max_sect, start_sect, copy_size = 0;
> +	sector_t src_max_sect, src_start_sect;
> +	struct block_device *bd_part;
> +	int i, ret = -EIO;
> +
> +	rcu_read_lock();
> +
> +	bd_part = __disk_get_part(bio->bi_disk, bio->bi_partno);
> +	if (unlikely(!bd_part)) {
> +		rcu_read_unlock();
> +		goto out;
> +	}
> +
> +	max_sect =  bdev_nr_sectors(bd_part);
> +	start_sect = bd_part->bd_start_sect;
> +
> +	src_max_sect = bdev_nr_sectors(payload->src_bdev);
> +	src_start_sect = payload->src_bdev->bd_start_sect;
> +
> +	if (unlikely(should_fail_request(bd_part, bio->bi_iter.bi_size)))
> +		goto out;
> +
> +	if (unlikely(bio_check_ro(bio, bd_part)))
> +		goto out;

There is no rcu_unlock() in that out label. Did you test ?

> +
> +	rcu_read_unlock();
> +
> +	/* cannot handle copy crossing nr_ranges limit */
> +	if (payload->copy_nr_ranges > q->limits.max_copy_nr_ranges)
> +		goto out;
> +
> +	for (i = 0; i < payload->copy_nr_ranges; i++) {
> +		ret = bio_check_copy_eod(bio, payload->range[i].src,
> +				payload->range[i].len, src_max_sect);
> +		if (unlikely(ret))
> +			goto out;
> +
> +		/* single source range length limit */
> +		if (payload->range[i].len > q->limits.max_copy_range_sectors)
> +			goto out;

ret is not set. You will return success with this.

> +
> +		payload->range[i].src += src_start_sect;
> +		copy_size += payload->range[i].len;
> +	}
> +
> +	/* check if copy length crosses eod */
> +	ret = bio_check_copy_eod(bio, bio->bi_iter.bi_sector,
> +				copy_size, max_sect);
> +	if (unlikely(ret))
> +		goto out;
> +
> +	/* cannot handle copy more than copy limits */
> +	if (copy_size > q->limits.max_copy_sectors)
> +		goto out;

Again ret is not set... No error return ?

> +
> +	bio->bi_iter.bi_sector += start_sect;
> +	bio->bi_partno = 0;
> +	ret = 0;
> +out:
> +	return ret;
> +}
> +
>  /*
>   * Remap block n of partition p to block n+start(p) of the disk.
>   */
> @@ -827,14 +907,16 @@ static noinline_for_stack bool submit_bio_checks(struct bio *bio)
>  	if (should_fail_bio(bio))
>  		goto end_io;
>  
> -	if (bio->bi_partno) {
> -		if (unlikely(blk_partition_remap(bio)))
> -			goto end_io;
> -	} else {
> -		if (unlikely(bio_check_ro(bio, bio->bi_disk->part0)))
> -			goto end_io;
> -		if (unlikely(bio_check_eod(bio, get_capacity(bio->bi_disk))))
> -			goto end_io;
> +	if (likely(!op_is_copy(bio->bi_opf))) {
> +		if (bio->bi_partno) {
> +			if (unlikely(blk_partition_remap(bio)))
> +				goto end_io;
> +		} else {
> +			if (unlikely(bio_check_ro(bio, bio->bi_disk->part0)))
> +				goto end_io;
> +			if (unlikely(bio_check_eod(bio, get_capacity(bio->bi_disk))))
> +				goto end_io;
> +		}
>  	}
>  
>  	/*
> @@ -858,6 +940,10 @@ static noinline_for_stack bool submit_bio_checks(struct bio *bio)
>  		if (!blk_queue_discard(q))
>  			goto not_supported;
>  		break;
> +	case REQ_OP_COPY:
> +		if (unlikely(blk_check_copy(bio)))
> +			goto end_io;
> +		break;
>  	case REQ_OP_SECURE_ERASE:
>  		if (!blk_queue_secure_erase(q))
>  			goto not_supported;
> diff --git a/block/blk-lib.c b/block/blk-lib.c
> index 752f9c722062..97ba58d8d9a1 100644
> --- a/block/blk-lib.c
> +++ b/block/blk-lib.c
> @@ -150,6 +150,228 @@ int blkdev_issue_discard(struct block_device *bdev, sector_t sector,
>  }
>  EXPORT_SYMBOL(blkdev_issue_discard);
>  
> +int blk_copy_offload(struct block_device *dest_bdev, struct blk_copy_payload *payload,
> +		sector_t dest, gfp_t gfp_mask)

Simple copy is only over the same device, right ? So the name "dest_bdev" is a
little strange.

> +{
> +	struct request_queue *q = bdev_get_queue(dest_bdev);
> +	struct bio *bio;
> +	int ret, payload_size;
> +
> +	payload_size = struct_size(payload, range, payload->copy_nr_ranges);
> +	bio = bio_map_kern(q, payload, payload_size, gfp_mask);
> +	if (IS_ERR(bio)) {
> +		ret = PTR_ERR(bio);
> +		goto err;

This will do a bio_put() on a non existent bio...

> +	}
> +
> +	bio->bi_iter.bi_sector = dest;
> +	bio->bi_opf = REQ_OP_COPY | REQ_NOMERGE;
> +	bio_set_dev(bio, dest_bdev);
> +
> +	ret = submit_bio_wait(bio);
> +err:
> +	bio_put(bio);
> +	return ret;
> +}
> +
> +int blk_read_to_buf(struct block_device *src_bdev, struct blk_copy_payload *payload,
> +		gfp_t gfp_mask, sector_t copy_size, void **buf_p)
> +{
> +	struct request_queue *q = bdev_get_queue(src_bdev);
> +	struct bio *bio, *parent = NULL;
> +	void *buf = NULL;
> +	int copy_len = copy_size << SECTOR_SHIFT;
> +	int i, nr_srcs, ret, cur_size, t_len = 0;
> +	bool is_vmalloc;
> +
> +	nr_srcs = payload->copy_nr_ranges;
> +
> +	buf = kvmalloc(copy_len, gfp_mask);
> +	if (!buf)
> +		return -ENOMEM;
> +	is_vmalloc = is_vmalloc_addr(buf);
> +
> +	for (i = 0; i < nr_srcs; i++) {
> +		cur_size = payload->range[i].len << SECTOR_SHIFT;
> +
> +		bio = bio_map_kern(q, buf + t_len, cur_size, gfp_mask);
> +		if (IS_ERR(bio)) {
> +			ret = PTR_ERR(bio);
> +			goto out;
> +		}
> +
> +		bio->bi_iter.bi_sector = payload->range[i].src;
> +		bio->bi_opf = REQ_OP_READ;
> +		bio_set_dev(bio, src_bdev);
> +		bio->bi_end_io = NULL;
> +		bio->bi_private = NULL;
> +
> +		if (parent) {
> +			bio_chain(parent, bio);
> +			submit_bio(parent);
> +		}
> +
> +		parent = bio;
> +		t_len += cur_size;
> +	}
> +
> +	ret = submit_bio_wait(bio);
> +	bio_put(bio);
> +	if (is_vmalloc)
> +		invalidate_kernel_vmap_range(buf, copy_len);

But blk_write_from_buf() will use the buffer right after this.. Is this really OK ?

> +	if (ret)
> +		goto out;
> +
> +	*buf_p = buf;
> +	return 0;
> +out:
> +	kvfree(buf);
> +	return ret;
> +}
> +
> +int blk_write_from_buf(struct block_device *dest_bdev, void *buf, sector_t dest,
> +		sector_t copy_size, gfp_t gfp_mask)
> +{
> +	struct request_queue *q = bdev_get_queue(dest_bdev);
> +	struct bio *bio;
> +	int ret, copy_len = copy_size << SECTOR_SHIFT;
> +
> +	bio = bio_map_kern(q, buf, copy_len, gfp_mask);
> +	if (IS_ERR(bio)) {
> +		ret = PTR_ERR(bio);
> +		goto out;
> +	}
> +	bio_set_dev(bio, dest_bdev);
> +	bio->bi_opf = REQ_OP_WRITE;
> +	bio->bi_iter.bi_sector = dest;
> +
> +	bio->bi_end_io = NULL;
> +	ret = submit_bio_wait(bio);
> +	bio_put(bio);
> +out:
> +	return ret;
> +}
> +
> +int blk_prepare_payload(struct block_device *src_bdev, int nr_srcs, struct range_entry *rlist,
> +		gfp_t gfp_mask, struct blk_copy_payload **payload_p, sector_t *copy_size)
> +{
> +
> +	struct request_queue *q = bdev_get_queue(src_bdev);
> +	struct blk_copy_payload *payload;
> +	sector_t bs_mask, total_len = 0;
> +	int i, ret, payload_size;
> +
> +	if (!q)
> +		return -ENXIO;
> +
> +	if (!nr_srcs)
> +		return -EINVAL;
> +
> +	if (bdev_read_only(src_bdev))
> +		return -EPERM;
> +
> +	bs_mask = (bdev_logical_block_size(src_bdev) >> 9) - 1;
> +
> +	payload_size = struct_size(payload, range, nr_srcs);
> +	payload = kmalloc(payload_size, gfp_mask);
> +	if (!payload)
> +		return -ENOMEM;
> +
> +	for (i = 0; i < nr_srcs; i++) {
> +		if (rlist[i].src & bs_mask || rlist[i].len & bs_mask) {
> +			ret = -EINVAL;
> +			goto err;
> +		}
> +
> +		payload->range[i].src = rlist[i].src;
> +		payload->range[i].len = rlist[i].len;
> +
> +		total_len += rlist[i].len;
> +	}
> +
> +	payload->copy_nr_ranges = i;
> +	payload->src_bdev = src_bdev;
> +	*copy_size = total_len;
> +
> +	*payload_p = payload;
> +	return 0;
> +err:
> +	kfree(payload);
> +	return ret;
> +}
> +
> +int blk_copy_emulate(struct block_device *src_bdev, struct blk_copy_payload *payload,
> +			struct block_device *dest_bdev, sector_t dest,
> +			sector_t copy_size, gfp_t gfp_mask)
> +{
> +	void *buf = NULL;
> +	int ret;
> +
> +	ret = blk_read_to_buf(src_bdev, payload, gfp_mask, copy_size, &buf);
> +	if (ret)
> +		goto out;
> +
> +	ret = blk_write_from_buf(dest_bdev, buf, dest, copy_size, gfp_mask);
> +	if (buf)
> +		kvfree(buf);
> +out:
> +	return ret;
> +}

I already commented that this should better use the dm-kcopyd design, which
would be far more efficient than this. This will be slow...

Your function blkdev_issue_copy() below should deal only with issuing simple
copy (amd later scsi xcopy) for devices that support it. Bring the dm-kcopyd
interface in the block layer as a generic interface for hadling emulation.
Otherwise you are repeating what dm does, but not as efficiently.

> +
> +/**
> + * blkdev_issue_copy - queue a copy
> + * @src_bdev:	source block device
> + * @nr_srcs:	number of source ranges to copy
> + * @rlist:	array of source ranges in sector
> + * @dest_bdev:	destination block device
> + * @dest:	destination in sector
> + * @gfp_mask:   memory allocation flags (for bio_alloc)
> + * @flags:	BLKDEV_COPY_* flags to control behaviour
> + *
> + * Description:
> + *	Copy array of source ranges from source block device to
> + *	destination block devcie. All source must belong to same bdev and
> + *	length of a source range cannot be zero.
> + */
> +
> +int blkdev_issue_copy(struct block_device *src_bdev, int nr_srcs,
> +		struct range_entry *src_rlist, struct block_device *dest_bdev,
> +		sector_t dest, gfp_t gfp_mask, int flags)
> +{
> +	struct request_queue *q = bdev_get_queue(src_bdev);
> +	struct request_queue *dest_q = bdev_get_queue(dest_bdev);
> +	struct blk_copy_payload *payload;
> +	sector_t bs_mask, copy_size;
> +	int ret;
> +
> +	ret = blk_prepare_payload(src_bdev, nr_srcs, src_rlist, gfp_mask,
> +			&payload, &copy_size);
> +	if (ret)
> +		return ret;
> +
> +	bs_mask = (bdev_logical_block_size(dest_bdev) >> 9) - 1;
> +	if (dest & bs_mask) {
> +		return -EINVAL;
> +		goto out;
> +	}
> +
> +	if (q == dest_q && q->limits.copy_offload) {
> +		ret = blk_copy_offload(src_bdev, payload, dest, gfp_mask);
> +		if (ret)
> +			goto out;
> +	} else if (flags & BLKDEV_COPY_NOEMULATION) {

Why ? whoever calls blkdev_issue_copy() wants a copy to be done. Why would that
user say "Fail on me if the device does not support copy" ??? This is a weird
interface in my opinion.

> +		ret = -EIO;
> +		goto out;
> +	} else

Missing braces. By you do not need all these else after the gotos anyway.

> +		ret = blk_copy_emulate(src_bdev, payload, dest_bdev, dest,
> +				copy_size, gfp_mask);
> +
> +out:
> +	kvfree(payload);
> +	return ret;
> +}
> +EXPORT_SYMBOL(blkdev_issue_copy);
> +
>  /**
>   * __blkdev_issue_write_same - generate number of bios with same page
>   * @bdev:	target blockdev
> diff --git a/block/blk-merge.c b/block/blk-merge.c
> index 808768f6b174..4e04f24e13c1 100644
> --- a/block/blk-merge.c
> +++ b/block/blk-merge.c
> @@ -309,6 +309,8 @@ void __blk_queue_split(struct bio **bio, unsigned int *nr_segs)
>  	struct bio *split = NULL;
>  
>  	switch (bio_op(*bio)) {
> +	case REQ_OP_COPY:
> +			break;

Why would this even be called ? Copy BIOs cannot be split, right ?

>  	case REQ_OP_DISCARD:
>  	case REQ_OP_SECURE_ERASE:
>  		split = blk_bio_discard_split(q, *bio, &q->bio_split, nr_segs);
> diff --git a/block/blk-settings.c b/block/blk-settings.c
> index 43990b1d148b..93c15ba45a69 100644
> --- a/block/blk-settings.c
> +++ b/block/blk-settings.c
> @@ -60,6 +60,10 @@ void blk_set_default_limits(struct queue_limits *lim)
>  	lim->io_opt = 0;
>  	lim->misaligned = 0;
>  	lim->zoned = BLK_ZONED_NONE;
> +	lim->copy_offload = 0;
> +	lim->max_copy_sectors = 0;
> +	lim->max_copy_nr_ranges = 0;
> +	lim->max_copy_range_sectors = 0;
>  }
>  EXPORT_SYMBOL(blk_set_default_limits);
>  
> @@ -565,6 +569,12 @@ int blk_stack_limits(struct queue_limits *t, struct queue_limits *b,
>  	if (b->chunk_sectors)
>  		t->chunk_sectors = gcd(t->chunk_sectors, b->chunk_sectors);
>  
> +	/* simple copy not supported in stacked devices */
> +	t->copy_offload = 0;
> +	t->max_copy_sectors = 0;
> +	t->max_copy_range_sectors = 0;
> +	t->max_copy_nr_ranges = 0;

You do not need this. Limits not explicitely initialized are 0 already.
But I do not see why you can't support copy on stacked devices. That should be
feasible taking the min() for each of the above limit.

> +
>  	/* Physical block size a multiple of the logical block size? */
>  	if (t->physical_block_size & (t->logical_block_size - 1)) {
>  		t->physical_block_size = t->logical_block_size;
> diff --git a/block/blk-sysfs.c b/block/blk-sysfs.c
> index b513f1683af0..625a72541263 100644
> --- a/block/blk-sysfs.c
> +++ b/block/blk-sysfs.c
> @@ -166,6 +166,44 @@ static ssize_t queue_discard_granularity_show(struct request_queue *q, char *pag
>  	return queue_var_show(q->limits.discard_granularity, page);
>  }
>  
> +static ssize_t queue_copy_offload_show(struct request_queue *q, char *page)
> +{
> +	return queue_var_show(q->limits.copy_offload, page);
> +}
> +
> +static ssize_t queue_copy_offload_store(struct request_queue *q,
> +				       const char *page, size_t count)
> +{
> +	unsigned long copy_offload;
> +	ssize_t ret = queue_var_store(&copy_offload, page, count);
> +
> +	if (ret < 0)
> +		return ret;
> +
> +	if (copy_offload && q->limits.max_copy_sectors == 0)
> +		return -EINVAL;
> +
> +	q->limits.copy_offload = copy_offload;
> +	return ret;
> +}

This is weird. If you want to allow a user to disable copy offload, then use
max_copy_sectors. This one should be read-only and only indicate if the device
supports it or not. I also would actually change this one into
max_copy_hw_sectors, immutable, indicating the max copy sectors that the device
supports, and 0 for no support. That would allow an easy implementation of
max_copy_sectors being red/write for controlling enable/disable.

> +
> +static ssize_t queue_max_copy_sectors_show(struct request_queue *q, char *page)
> +{
> +	return queue_var_show(q->limits.max_copy_sectors, page);
> +}
> +
> +static ssize_t queue_max_copy_range_sectors_show(struct request_queue *q,
> +		char *page)
> +{
> +	return queue_var_show(q->limits.max_copy_range_sectors, page);
> +}
> +
> +static ssize_t queue_max_copy_nr_ranges_show(struct request_queue *q,
> +		char *page)
> +{
> +	return queue_var_show(q->limits.max_copy_nr_ranges, page);
> +}
> +
>  static ssize_t queue_discard_max_hw_show(struct request_queue *q, char *page)
>  {
>  
> @@ -591,6 +629,11 @@ QUEUE_RO_ENTRY(queue_nr_zones, "nr_zones");
>  QUEUE_RO_ENTRY(queue_max_open_zones, "max_open_zones");
>  QUEUE_RO_ENTRY(queue_max_active_zones, "max_active_zones");
>  
> +QUEUE_RW_ENTRY(queue_copy_offload, "copy_offload");
> +QUEUE_RO_ENTRY(queue_max_copy_sectors, "max_copy_sectors");
> +QUEUE_RO_ENTRY(queue_max_copy_range_sectors, "max_copy_range_sectors");
> +QUEUE_RO_ENTRY(queue_max_copy_nr_ranges, "max_copy_nr_ranges");
> +
>  QUEUE_RW_ENTRY(queue_nomerges, "nomerges");
>  QUEUE_RW_ENTRY(queue_rq_affinity, "rq_affinity");
>  QUEUE_RW_ENTRY(queue_poll, "io_poll");
> @@ -636,6 +679,10 @@ static struct attribute *queue_attrs[] = {
>  	&queue_discard_max_entry.attr,
>  	&queue_discard_max_hw_entry.attr,
>  	&queue_discard_zeroes_data_entry.attr,
> +	&queue_copy_offload_entry.attr,
> +	&queue_max_copy_sectors_entry.attr,
> +	&queue_max_copy_range_sectors_entry.attr,
> +	&queue_max_copy_nr_ranges_entry.attr,
>  	&queue_write_same_max_entry.attr,
>  	&queue_write_zeroes_max_entry.attr,
>  	&queue_zone_append_max_entry.attr,
> diff --git a/block/blk-zoned.c b/block/blk-zoned.c
> index 7a68b6e4300c..02069178d51e 100644
> --- a/block/blk-zoned.c
> +++ b/block/blk-zoned.c
> @@ -75,6 +75,7 @@ bool blk_req_needs_zone_write_lock(struct request *rq)
>  	case REQ_OP_WRITE_ZEROES:
>  	case REQ_OP_WRITE_SAME:
>  	case REQ_OP_WRITE:
> +	case REQ_OP_COPY:
>  		return blk_rq_zone_is_seq(rq);
>  	default:
>  		return false;
> diff --git a/block/bounce.c b/block/bounce.c
> index d3f51acd6e3b..5e052afe8691 100644
> --- a/block/bounce.c
> +++ b/block/bounce.c
> @@ -254,6 +254,7 @@ static struct bio *bounce_clone_bio(struct bio *bio_src, gfp_t gfp_mask,
>  	bio->bi_iter.bi_size	= bio_src->bi_iter.bi_size;
>  
>  	switch (bio_op(bio)) {
> +	case REQ_OP_COPY:
>  	case REQ_OP_DISCARD:
>  	case REQ_OP_SECURE_ERASE:
>  	case REQ_OP_WRITE_ZEROES:
> diff --git a/block/ioctl.c b/block/ioctl.c
> index d61d652078f4..0e52181657a4 100644
> --- a/block/ioctl.c
> +++ b/block/ioctl.c
> @@ -133,6 +133,37 @@ static int blk_ioctl_discard(struct block_device *bdev, fmode_t mode,
>  				    GFP_KERNEL, flags);
>  }
>  
> +static int blk_ioctl_copy(struct block_device *bdev, fmode_t mode,
> +		unsigned long arg, unsigned long flags)
> +{
> +	struct copy_range crange;
> +	struct range_entry *rlist;
> +	int ret;
> +
> +	if (!(mode & FMODE_WRITE))
> +		return -EBADF;
> +
> +	if (copy_from_user(&crange, (void __user *)arg, sizeof(crange)))
> +		return -EFAULT;
> +
> +	rlist = kmalloc_array(crange.nr_range, sizeof(*rlist),
> +			GFP_KERNEL);
> +	if (!rlist)
> +		return -ENOMEM;
> +
> +	if (copy_from_user(rlist, (void __user *)crange.range_list,
> +				sizeof(*rlist) * crange.nr_range)) {
> +		ret = -EFAULT;
> +		goto out;
> +	}
> +
> +	ret = blkdev_issue_copy(bdev, crange.nr_range, rlist, bdev, crange.dest,
> +			GFP_KERNEL, flags);
> +out:
> +	kfree(rlist);
> +	return ret;
> +}
> +
>  static int blk_ioctl_zeroout(struct block_device *bdev, fmode_t mode,
>  		unsigned long arg)
>  {
> @@ -458,6 +489,8 @@ static int blkdev_common_ioctl(struct block_device *bdev, fmode_t mode,
>  	case BLKSECDISCARD:
>  		return blk_ioctl_discard(bdev, mode, arg,
>  				BLKDEV_DISCARD_SECURE);
> +	case BLKCOPY:
> +		return blk_ioctl_copy(bdev, mode, arg, 0);
>  	case BLKZEROOUT:
>  		return blk_ioctl_zeroout(bdev, mode, arg);
>  	case BLKREPORTZONE:
> diff --git a/include/linux/bio.h b/include/linux/bio.h
> index 1edda614f7ce..164313bdfb35 100644
> --- a/include/linux/bio.h
> +++ b/include/linux/bio.h
> @@ -71,6 +71,7 @@ static inline bool bio_has_data(struct bio *bio)
>  static inline bool bio_no_advance_iter(const struct bio *bio)
>  {
>  	return bio_op(bio) == REQ_OP_DISCARD ||
> +	       bio_op(bio) == REQ_OP_COPY ||
>  	       bio_op(bio) == REQ_OP_SECURE_ERASE ||
>  	       bio_op(bio) == REQ_OP_WRITE_SAME ||
>  	       bio_op(bio) == REQ_OP_WRITE_ZEROES;
> diff --git a/include/linux/blk_types.h b/include/linux/blk_types.h
> index 866f74261b3b..5a35c02ac0a8 100644
> --- a/include/linux/blk_types.h
> +++ b/include/linux/blk_types.h
> @@ -380,6 +380,8 @@ enum req_opf {
>  	REQ_OP_ZONE_RESET	= 15,
>  	/* reset all the zone present on the device */
>  	REQ_OP_ZONE_RESET_ALL	= 17,
> +	/* copy ranges within device */
> +	REQ_OP_COPY		= 19,
>  
>  	/* SCSI passthrough using struct scsi_request */
>  	REQ_OP_SCSI_IN		= 32,
> @@ -506,6 +508,11 @@ static inline bool op_is_discard(unsigned int op)
>  	return (op & REQ_OP_MASK) == REQ_OP_DISCARD;
>  }
>  
> +static inline bool op_is_copy(unsigned int op)
> +{
> +	return (op & REQ_OP_MASK) == REQ_OP_COPY;
> +}
> +
>  /*
>   * Check if a bio or request operation is a zone management operation, with
>   * the exception of REQ_OP_ZONE_RESET_ALL which is treated as a special case
> @@ -565,4 +572,11 @@ struct blk_rq_stat {
>  	u64 batch;
>  };
>  
> +struct blk_copy_payload {
> +	sector_t	dest;
> +	int		copy_nr_ranges;
> +	struct block_device *src_bdev;
> +	struct	range_entry	range[];
> +};
> +
>  #endif /* __LINUX_BLK_TYPES_H */
> diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
> index 699ace6b25ff..2bb4513d4bb8 100644
> --- a/include/linux/blkdev.h
> +++ b/include/linux/blkdev.h
> @@ -337,10 +337,14 @@ struct queue_limits {
>  	unsigned int		max_zone_append_sectors;
>  	unsigned int		discard_granularity;
>  	unsigned int		discard_alignment;
> +	unsigned int		copy_offload;
> +	unsigned int		max_copy_sectors;
>  
>  	unsigned short		max_segments;
>  	unsigned short		max_integrity_segments;
>  	unsigned short		max_discard_segments;
> +	unsigned short		max_copy_range_sectors;
> +	unsigned short		max_copy_nr_ranges;
>  
>  	unsigned char		misaligned;
>  	unsigned char		discard_misaligned;
> @@ -621,6 +625,7 @@ struct request_queue {
>  #define QUEUE_FLAG_RQ_ALLOC_TIME 27	/* record rq->alloc_time_ns */
>  #define QUEUE_FLAG_HCTX_ACTIVE	28	/* at least one blk-mq hctx is active */
>  #define QUEUE_FLAG_NOWAIT       29	/* device supports NOWAIT */
> +#define QUEUE_FLAG_SIMPLE_COPY	30	/* supports simple copy */
>  
>  #define QUEUE_FLAG_MQ_DEFAULT	((1 << QUEUE_FLAG_IO_STAT) |		\
>  				 (1 << QUEUE_FLAG_SAME_COMP) |		\
> @@ -643,6 +648,7 @@ bool blk_queue_flag_test_and_set(unsigned int flag, struct request_queue *q);
>  #define blk_queue_io_stat(q)	test_bit(QUEUE_FLAG_IO_STAT, &(q)->queue_flags)
>  #define blk_queue_add_random(q)	test_bit(QUEUE_FLAG_ADD_RANDOM, &(q)->queue_flags)
>  #define blk_queue_discard(q)	test_bit(QUEUE_FLAG_DISCARD, &(q)->queue_flags)
> +#define blk_queue_copy(q)	test_bit(QUEUE_FLAG_SIMPLE_COPY, &(q)->queue_flags)
>  #define blk_queue_zone_resetall(q)	\
>  	test_bit(QUEUE_FLAG_ZONE_RESETALL, &(q)->queue_flags)
>  #define blk_queue_secure_erase(q) \
> @@ -1069,6 +1075,9 @@ static inline unsigned int blk_queue_get_max_sectors(struct request_queue *q,
>  		return min(q->limits.max_discard_sectors,
>  			   UINT_MAX >> SECTOR_SHIFT);
>  
> +	if (unlikely(op == REQ_OP_COPY))
> +		return q->limits.max_copy_sectors;
> +

I would agreee with this if a copy BIO was always a single range, but that is
not the case. So I am not sure this makes sense at all.

>  	if (unlikely(op == REQ_OP_WRITE_SAME))
>  		return q->limits.max_write_same_sectors;
>  
> @@ -1343,6 +1352,12 @@ extern int __blkdev_issue_discard(struct block_device *bdev, sector_t sector,
>  		sector_t nr_sects, gfp_t gfp_mask, int flags,
>  		struct bio **biop);
>  
> +#define BLKDEV_COPY_NOEMULATION	(1 << 0)	/* do not emulate if copy offload not supported */
> +
> +extern int blkdev_issue_copy(struct block_device *src_bdev, int nr_srcs,
> +		struct range_entry *src_rlist, struct block_device *dest_bdev,
> +		sector_t dest, gfp_t gfp_mask, int flags);

No need for extern.

> +
>  #define BLKDEV_ZERO_NOUNMAP	(1 << 0)  /* do not free blocks */
>  #define BLKDEV_ZERO_NOFALLBACK	(1 << 1)  /* don't write explicit zeroes */
>  
> diff --git a/include/uapi/linux/fs.h b/include/uapi/linux/fs.h
> index f44eb0a04afd..5cadb176317a 100644
> --- a/include/uapi/linux/fs.h
> +++ b/include/uapi/linux/fs.h
> @@ -64,6 +64,18 @@ struct fstrim_range {
>  	__u64 minlen;
>  };
>  
> +struct range_entry {
> +	__u64 src;
> +	__u64 len;
> +};
> +
> +struct copy_range {
> +	__u64 dest;
> +	__u64 nr_range;
> +	__u64 range_list;
> +	__u64 rsvd;
> +};
> +
>  /* extent-same (dedupe) ioctls; these MUST match the btrfs ioctl definitions */
>  #define FILE_DEDUPE_RANGE_SAME		0
>  #define FILE_DEDUPE_RANGE_DIFFERS	1
> @@ -184,6 +196,7 @@ struct fsxattr {
>  #define BLKSECDISCARD _IO(0x12,125)
>  #define BLKROTATIONAL _IO(0x12,126)
>  #define BLKZEROOUT _IO(0x12,127)
> +#define BLKCOPY _IOWR(0x12, 128, struct copy_range)
>  /*
>   * A jump here: 130-131 are reserved for zoned block devices
>   * (see uapi/linux/blkzoned.h)
> 

Please test your code more thoroughly. It is full of problems that you should
have detected with better testing including RO devices, partitions and error
path coverage.

-- 
Damien Le Moal
Western Digital Research

_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 29+ messages in thread

* RE: [RFC PATCH v5 0/4] add simple copy support
  2021-02-19 12:45 ` [RFC PATCH v5 0/4] add simple copy support SelvaKumar S
                     ` (3 preceding siblings ...)
       [not found]   ` <CGME20210219124611epcas5p1c775b63b537e75da161556e375fcf05e@epcas5p1.samsung.com>
@ 2021-02-20 18:01   ` David Laight
  2021-02-20 19:08     ` Matthew Wilcox
  2021-02-20 19:19     ` Keith Busch
  2021-02-21 23:52   ` Dave Chinner
                     ` (3 subsequent siblings)
  8 siblings, 2 replies; 29+ messages in thread
From: David Laight @ 2021-02-20 18:01 UTC (permalink / raw)
  To: 'SelvaKumar S', linux-nvme
  Cc: axboe, damien.lemoal, kch, sagi, snitzer, selvajove,
	linux-kernel, nj.shetty, linux-block, linux-fsdevel, dm-devel,
	joshi.k, javier.gonz, kbusch, joshiiitr, hch

From: SelvaKumar S
> Sent: 19 February 2021 12:45
> 
> This patchset tries to add support for TP4065a ("Simple Copy Command"),
> v2020.05.04 ("Ratified")
> 
> The Specification can be found in following link.
> https://nvmexpress.org/wp-content/uploads/NVM-Express-1.4-Ratified-TPs-1.zip
> 
> Simple copy command is a copy offloading operation and is  used to copy
> multiple contiguous ranges (source_ranges) of LBA's to a single destination
> LBA within the device reducing traffic between host and device.

Sounds to me like the real reason is that the copy just ends up changing
some indirect block pointers rather than having to actually copy the data.

	David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)
_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [RFC PATCH v5 0/4] add simple copy support
  2021-02-20 18:01   ` [RFC PATCH v5 0/4] add simple copy support David Laight
@ 2021-02-20 19:08     ` Matthew Wilcox
  2021-02-20 19:19     ` Keith Busch
  1 sibling, 0 replies; 29+ messages in thread
From: Matthew Wilcox @ 2021-02-20 19:08 UTC (permalink / raw)
  To: David Laight
  Cc: axboe, damien.lemoal, kch, 'SelvaKumar S',
	sagi, snitzer, selvajove, linux-kernel, linux-nvme, nj.shetty,
	linux-block, linux-fsdevel, dm-devel, joshi.k, javier.gonz,
	kbusch, joshiiitr, hch

On Sat, Feb 20, 2021 at 06:01:56PM +0000, David Laight wrote:
> From: SelvaKumar S
> > Sent: 19 February 2021 12:45
> > 
> > This patchset tries to add support for TP4065a ("Simple Copy Command"),
> > v2020.05.04 ("Ratified")
> > 
> > The Specification can be found in following link.
> > https://nvmexpress.org/wp-content/uploads/NVM-Express-1.4-Ratified-TPs-1.zip
> > 
> > Simple copy command is a copy offloading operation and is  used to copy
> > multiple contiguous ranges (source_ranges) of LBA's to a single destination
> > LBA within the device reducing traffic between host and device.
> 
> Sounds to me like the real reason is that the copy just ends up changing
> some indirect block pointers rather than having to actually copy the data.

That would be incorrect, at least for firmware that I have knowledge of.
There are checksums which involve the logical block address of the data,
and you can't just rewrite the checksum on NAND, you have to write the
entire block.

Now, firmware doesn't have to implement their checksum like this,
but there are good reasons to do it this way (eg if the command gets
corrupted in transfer and you read the wrong block, it will fail the
checksum, preventing the drive from returning Somebody Else's Data).

So let's take these people at their word.  It is to reduce traffic
between drive and host.  And that is a good enough reason to do it.

_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [RFC PATCH v5 0/4] add simple copy support
  2021-02-20 18:01   ` [RFC PATCH v5 0/4] add simple copy support David Laight
  2021-02-20 19:08     ` Matthew Wilcox
@ 2021-02-20 19:19     ` Keith Busch
  1 sibling, 0 replies; 29+ messages in thread
From: Keith Busch @ 2021-02-20 19:19 UTC (permalink / raw)
  To: David Laight
  Cc: axboe, damien.lemoal, kch, 'SelvaKumar S',
	sagi, snitzer, selvajove, linux-kernel, linux-nvme, nj.shetty,
	linux-block, linux-fsdevel, dm-devel, javier.gonz, joshi.k,
	joshiiitr, hch

On Sat, Feb 20, 2021 at 06:01:56PM +0000, David Laight wrote:
> From: SelvaKumar S
> > Sent: 19 February 2021 12:45
> > 
> > This patchset tries to add support for TP4065a ("Simple Copy Command"),
> > v2020.05.04 ("Ratified")
> > 
> > The Specification can be found in following link.
> > https://nvmexpress.org/wp-content/uploads/NVM-Express-1.4-Ratified-TPs-1.zip
> > 
> > Simple copy command is a copy offloading operation and is  used to copy
> > multiple contiguous ranges (source_ranges) of LBA's to a single destination
> > LBA within the device reducing traffic between host and device.
> 
> Sounds to me like the real reason is that the copy just ends up changing
> some indirect block pointers rather than having to actually copy the data.

I guess an implementation could do that, but I think that's missing the
point of the command. The intention is to copy the data to a new
location on the media for host managed garbage collection. 

_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [RFC PATCH v5 0/4] add simple copy support
  2021-02-19 12:45 ` [RFC PATCH v5 0/4] add simple copy support SelvaKumar S
                     ` (4 preceding siblings ...)
  2021-02-20 18:01   ` [RFC PATCH v5 0/4] add simple copy support David Laight
@ 2021-02-21 23:52   ` Dave Chinner
  2021-02-23  9:14     ` Selva Jove
  2021-02-22  1:31   ` Ming Lei
                     ` (2 subsequent siblings)
  8 siblings, 1 reply; 29+ messages in thread
From: Dave Chinner @ 2021-02-21 23:52 UTC (permalink / raw)
  To: SelvaKumar S
  Cc: axboe, damien.lemoal, kch, sagi, snitzer, selvajove,
	linux-kernel, linux-nvme, nj.shetty, linux-block, linux-fsdevel,
	dm-devel, joshi.k, javier.gonz, kbusch, joshiiitr, hch

On Fri, Feb 19, 2021 at 06:15:13PM +0530, SelvaKumar S wrote:
> This patchset tries to add support for TP4065a ("Simple Copy Command"),
> v2020.05.04 ("Ratified")
> 
> The Specification can be found in following link.
> https://nvmexpress.org/wp-content/uploads/NVM-Express-1.4-Ratified-TPs-1.zip
> 
> Simple copy command is a copy offloading operation and is  used to copy
> multiple contiguous ranges (source_ranges) of LBA's to a single destination
> LBA within the device reducing traffic between host and device.
> 
> This implementation doesn't add native copy offload support for stacked
> devices rather copy offload is done through emulation. Possible use
> cases are F2FS gc and BTRFS relocation/balance.

It sounds like you are missing the most obvious use case for this:
hooking up filesystem copy_file_range() implementations to allow
userspace to offload user data copies to hardware....

Another fs level feature that could use this for hardware
acceleration fallocate(FALLOC_FL_UNSHARE).

These are probably going to be far easier to hook up than filesystem
GC algorithms, and there is also solid data integrity and stress
testing checking infrastructure for these operations via fstests.

> As SCSI XCOPY can take two different block devices and no of source range is
> equal to 1, this interface can be extended in future to support SCSI XCOPY.

That greatly complicates the implementation. do we even care at this
point about cross-device XCOPY at this point?

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [RFC PATCH v5 0/4] add simple copy support
  2021-02-19 12:45 ` [RFC PATCH v5 0/4] add simple copy support SelvaKumar S
                     ` (5 preceding siblings ...)
  2021-02-21 23:52   ` Dave Chinner
@ 2021-02-22  1:31   ` Ming Lei
  2021-02-22  6:52   ` Su Yue
  2021-04-10  0:21   ` Max Gurtovoy
  8 siblings, 0 replies; 29+ messages in thread
From: Ming Lei @ 2021-02-22  1:31 UTC (permalink / raw)
  To: SelvaKumar S
  Cc: axboe, damien.lemoal, kch, sagi, snitzer, selvajove,
	linux-kernel, linux-nvme, nj.shetty, linux-block, linux-fsdevel,
	dm-devel, Martin K. Petersen, joshi.k, javier.gonz, kbusch,
	joshiiitr, linux-scsi, hch

On Fri, Feb 19, 2021 at 06:15:13PM +0530, SelvaKumar S wrote:
> This patchset tries to add support for TP4065a ("Simple Copy Command"),
> v2020.05.04 ("Ratified")
> 
> The Specification can be found in following link.
> https://nvmexpress.org/wp-content/uploads/NVM-Express-1.4-Ratified-TPs-1.zip
> 
> Simple copy command is a copy offloading operation and is  used to copy
> multiple contiguous ranges (source_ranges) of LBA's to a single destination
> LBA within the device reducing traffic between host and device.
> 
> This implementation doesn't add native copy offload support for stacked
> devices rather copy offload is done through emulation. Possible use
> cases are F2FS gc and BTRFS relocation/balance.
> 
> *blkdev_issue_copy* takes source bdev, no of sources, array of source
> ranges (in sectors), destination bdev and destination offset(in sectors).
> If both source and destination block devices are same and copy_offload = 1,
> then copy is done through native copy offloading. Copy emulation is used
> in other cases.
> 
> As SCSI XCOPY can take two different block devices and no of source range is
> equal to 1, this interface can be extended in future to support SCSI XCOPY.

The patchset adds ioctl(BLKCOPY) and two userspace visible data
struture(range_entry, and copy_range), all belong to kabi stuff, and the
interface is generic block layer kabi.

The API has to be allowed to extend for supporting SCSI XCOPY in future or similar
block copy commands without breaking previous application, so please CC linux-scsi
and scsi guys in your next post.


-- 
Ming


_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [RFC PATCH v5 0/4] add simple copy support
  2021-02-19 12:45 ` [RFC PATCH v5 0/4] add simple copy support SelvaKumar S
                     ` (6 preceding siblings ...)
  2021-02-22  1:31   ` Ming Lei
@ 2021-02-22  6:52   ` Su Yue
  2021-02-23  9:00     ` Selva Jove
  2021-04-10  0:21   ` Max Gurtovoy
  8 siblings, 1 reply; 29+ messages in thread
From: Su Yue @ 2021-02-22  6:52 UTC (permalink / raw)
  To: SelvaKumar S
  Cc: axboe, damien.lemoal, kch, sagi, snitzer, selvajove,
	linux-kernel, linux-nvme, nj.shetty, linux-block, linux-fsdevel,
	dm-devel, joshi.k, javier.gonz, kbusch, joshiiitr, hch


On Fri 19 Feb 2021 at 20:45, SelvaKumar S 
<selvakuma.s1@samsung.com> wrote:

> This patchset tries to add support for TP4065a ("Simple Copy 
> Command"),
> v2020.05.04 ("Ratified")
>
> The Specification can be found in following link.
> https://nvmexpress.org/wp-content/uploads/NVM-Express-1.4-Ratified-TPs-1.zip
>

404 not found.
Should it be
https://nvmexpress.org/wp-content/uploads/NVM-Express-1.4-Ratified-TPs.zip
?

> Simple copy command is a copy offloading operation and is  used 
> to copy
> multiple contiguous ranges (source_ranges) of LBA's to a single 
> destination
> LBA within the device reducing traffic between host and device.
>
> This implementation doesn't add native copy offload support for 
> stacked
> devices rather copy offload is done through emulation. Possible 
> use
> cases are F2FS gc and BTRFS relocation/balance.
>
> *blkdev_issue_copy* takes source bdev, no of sources, array of 
> source
> ranges (in sectors), destination bdev and destination offset(in 
> sectors).
> If both source and destination block devices are same and 
> copy_offload = 1,
> then copy is done through native copy offloading. Copy emulation 
> is used
> in other cases.
>
> As SCSI XCOPY can take two different block devices and no of 
> source range is
> equal to 1, this interface can be extended in future to support 
> SCSI XCOPY.
>
> For devices supporting native simple copy, attach the control 
> information
> as payload to the bio and submit to the device. For devices 
> without native
> copy support, copy emulation is done by reading each source 
> range into memory
> and writing it to the destination. Caller can choose not to try
> emulation if copy offload is not supported by setting
> BLKDEV_COPY_NOEMULATION flag.
>
> Following limits are added to queue limits and are exposed in 
> sysfs
> to userspace
> 	- *copy_offload* controls copy_offload. set 0 to disable copy
> 		offload, 1 to enable native copy offloading support.
> 	- *max_copy_sectors* limits the sum of all source_range length
> 	- *max_copy_nr_ranges* limits the number of source ranges
> 	- *max_copy_range_sectors* limit the maximum number of sectors
> 		that can constitute a single source range.
>
> 	max_copy_sectors = 0 indicates the device doesn't support copy
> offloading.
>
> 	*copy offload* sysfs entry is configurable and can be used 
> toggle
> between emulation and native support depending upon the usecase.
>
> Changes from v4
>
> 1. Extend dm-kcopyd to leverage copy-offload, while copying 
> within the
> same device. The other approach was to have copy-emulation by 
> moving
> dm-kcopyd to block layer. But it also required moving core dm-io 
> infra,
> causing a massive churn across multiple dm-targets.
>
> 2. Remove export in bio_map_kern()
> 3. Change copy_offload sysfs to accept 0 or else
> 4. Rename copy support flag to QUEUE_FLAG_SIMPLE_COPY
> 5. Rename payload entries, add source bdev field to be used 
> while
> partition remapping, remove copy_size
> 6. Change the blkdev_issue_copy() interface to accept 
> destination and
> source values in sector rather in bytes
> 7. Add payload to bio using bio_map_kern() for copy_offload case
> 8. Add check to return error if one of the source range length 
> is 0
> 9. Add BLKDEV_COPY_NOEMULATION flag to allow user to not try 
> copy
> emulation incase of copy offload is not supported. Caller can 
> his use
> his existing copying logic to complete the io.
> 10. Bug fix copy checks and reduce size of rcu_lock()
>
> Planned for next:
> - adding blktests
> - handling larger (than device limits) copy
> - decide on ioctl interface (man-page etc.)
>
> Changes from v3
>
> 1. gfp_flag fixes.
> 2. Export bio_map_kern() and use it to allocate and add pages to 
> bio.
> 3. Move copy offload, reading to buf, writing from buf to 
> separate functions.
> 4. Send read bio of copy offload by chaining them and submit 
> asynchronously.
> 5. Add gendisk->part0 and part->bd_start_sect changes to 
> blk_check_copy().
> 6. Move single source range limit check to blk_check_copy()
> 7. Rename __blkdev_issue_copy() to blkdev_issue_copy and remove 
> old helper.
> 8. Change blkdev_issue_copy() interface generic to accepts 
> destination bdev
> 	to support XCOPY as well.
> 9. Add invalidate_kernel_vmap_range() after reading data for 
> vmalloc'ed memory.
> 10. Fix buf allocoation logic to allocate buffer for the total 
> size of copy.
> 11. Reword patch commit description.
>
> Changes from v2
>
> 1. Add emulation support for devices not supporting copy.
> 2. Add *copy_offload* sysfs entry to enable and disable 
> copy_offload
> 	in devices supporting simple copy.
> 3. Remove simple copy support for stacked devices.
>
> Changes from v1:
>
> 1. Fix memory leak in __blkdev_issue_copy
> 2. Unmark blk_check_copy inline
> 3. Fix line break in blk_check_copy_eod
> 4. Remove p checks and made code more readable
> 5. Don't use bio_set_op_attrs and remove op and set
>    bi_opf directly
> 6. Use struct_size to calculate total_size
> 7. Fix partition remap of copy destination
> 8. Remove mcl,mssrl,msrc from nvme_ns
> 9. Initialize copy queue limits to 0 in nvme_config_copy
> 10. Remove return in QUEUE_FLAG_COPY check
> 11. Remove unused OCFS
>
> SelvaKumar S (4):
>   block: make bio_map_kern() non static
>   block: add simple copy support
>   nvme: add simple copy support
>   dm kcopyd: add simple copy offload support
>
>  block/blk-core.c          | 102 +++++++++++++++--
>  block/blk-lib.c           | 223 
>  ++++++++++++++++++++++++++++++++++++++
>  block/blk-map.c           |   2 +-
>  block/blk-merge.c         |   2 +
>  block/blk-settings.c      |  10 ++
>  block/blk-sysfs.c         |  47 ++++++++
>  block/blk-zoned.c         |   1 +
>  block/bounce.c            |   1 +
>  block/ioctl.c             |  33 ++++++
>  drivers/md/dm-kcopyd.c    |  49 ++++++++-
>  drivers/nvme/host/core.c  |  87 +++++++++++++++
>  include/linux/bio.h       |   1 +
>  include/linux/blk_types.h |  14 +++
>  include/linux/blkdev.h    |  17 +++
>  include/linux/nvme.h      |  43 +++++++-
>  include/uapi/linux/fs.h   |  13 +++
>  16 files changed, 627 insertions(+), 18 deletions(-)


_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [RFC PATCH v5 3/4] nvme: add simple copy support
  2021-02-20  3:36       ` Matthew Wilcox
@ 2021-02-22 15:57         ` Selva Jove
  0 siblings, 0 replies; 29+ messages in thread
From: Selva Jove @ 2021-02-22 15:57 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: axboe, Damien Le Moal, kch, SelvaKumar S, sagi, snitzer,
	linux-kernel, linux-nvme, nj.shetty, linux-block, linux-fsdevel,
	dm-devel, joshi.k, javier.gonz, Keith Busch, joshiiitr, hch

Matthew,

Maximum Source Range Count (MSRC) is limited by u8. So the maximum
number of source ranges is 256 (0 base value). The number of pages
required to be sent to the device is at most 2. Since we are
allocating the memory using kmalloc_array(), we would get a continuous
physical segment. nvme_map_data() maps the physical segment either by
setting 2 PRP pointers or by SGL. So the copy command sends two pages
to the device for copying more than128 ranges.

On Sat, Feb 20, 2021 at 9:08 AM Matthew Wilcox <willy@infradead.org> wrote:
>
> On Fri, Feb 19, 2021 at 06:15:16PM +0530, SelvaKumar S wrote:
> > +     struct nvme_copy_range *range = NULL;
> [...]
> > +     range = kmalloc_array(nr_range, sizeof(*range),
> > +                     GFP_ATOMIC | __GFP_NOWARN);
> [...]
> > +     req->special_vec.bv_page = virt_to_page(range);
> > +     req->special_vec.bv_offset = offset_in_page(range);
> > +     req->special_vec.bv_len = sizeof(*range) * nr_range;
> [...]
> > +struct nvme_copy_range {
> > +     __le64                  rsvd0;
> > +     __le64                  slba;
> > +     __le16                  nlb;
> > +     __le16                  rsvd18;
> > +     __le32                  rsvd20;
> > +     __le32                  eilbrt;
> > +     __le16                  elbat;
> > +     __le16                  elbatm;
> > +};
>
> so ... at 32 bytes, you can get 128 per 4kB page.  What happens if you
> try to send down a command that attempts to copy 129 ranges?

_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [RFC PATCH v5 0/4] add simple copy support
  2021-02-22  6:52   ` Su Yue
@ 2021-02-23  9:00     ` Selva Jove
  0 siblings, 0 replies; 29+ messages in thread
From: Selva Jove @ 2021-02-23  9:00 UTC (permalink / raw)
  To: Su Yue
  Cc: axboe, Damien Le Moal, kch, SelvaKumar S, sagi, snitzer,
	linux-kernel, linux-nvme, nj.shetty, linux-block, linux-fsdevel,
	dm-devel, joshi.k, javier.gonz, Keith Busch, joshiiitr, hch

Thanks Su Yue. I'll update the link in the next series.

On Mon, Feb 22, 2021 at 12:23 PM Su Yue <l@damenly.su> wrote:
>
>
> On Fri 19 Feb 2021 at 20:45, SelvaKumar S
> <selvakuma.s1@samsung.com> wrote:
>
> > This patchset tries to add support for TP4065a ("Simple Copy
> > Command"),
> > v2020.05.04 ("Ratified")
> >
> > The Specification can be found in following link.
> > https://nvmexpress.org/wp-content/uploads/NVM-Express-1.4-Ratified-TPs-1.zip
> >
>
> 404 not found.
> Should it be
> https://nvmexpress.org/wp-content/uploads/NVM-Express-1.4-Ratified-TPs.zip
> ?
>
> > Simple copy command is a copy offloading operation and is  used
> > to copy
> > multiple contiguous ranges (source_ranges) of LBA's to a single
> > destination
> > LBA within the device reducing traffic between host and device.
> >
> > This implementation doesn't add native copy offload support for
> > stacked
> > devices rather copy offload is done through emulation. Possible
> > use
> > cases are F2FS gc and BTRFS relocation/balance.
> >
> > *blkdev_issue_copy* takes source bdev, no of sources, array of
> > source
> > ranges (in sectors), destination bdev and destination offset(in
> > sectors).
> > If both source and destination block devices are same and
> > copy_offload = 1,
> > then copy is done through native copy offloading. Copy emulation
> > is used
> > in other cases.
> >
> > As SCSI XCOPY can take two different block devices and no of
> > source range is
> > equal to 1, this interface can be extended in future to support
> > SCSI XCOPY.
> >
> > For devices supporting native simple copy, attach the control
> > information
> > as payload to the bio and submit to the device. For devices
> > without native
> > copy support, copy emulation is done by reading each source
> > range into memory
> > and writing it to the destination. Caller can choose not to try
> > emulation if copy offload is not supported by setting
> > BLKDEV_COPY_NOEMULATION flag.
> >
> > Following limits are added to queue limits and are exposed in
> > sysfs
> > to userspace
> >       - *copy_offload* controls copy_offload. set 0 to disable copy
> >               offload, 1 to enable native copy offloading support.
> >       - *max_copy_sectors* limits the sum of all source_range length
> >       - *max_copy_nr_ranges* limits the number of source ranges
> >       - *max_copy_range_sectors* limit the maximum number of sectors
> >               that can constitute a single source range.
> >
> >       max_copy_sectors = 0 indicates the device doesn't support copy
> > offloading.
> >
> >       *copy offload* sysfs entry is configurable and can be used
> > toggle
> > between emulation and native support depending upon the usecase.
> >
> > Changes from v4
> >
> > 1. Extend dm-kcopyd to leverage copy-offload, while copying
> > within the
> > same device. The other approach was to have copy-emulation by
> > moving
> > dm-kcopyd to block layer. But it also required moving core dm-io
> > infra,
> > causing a massive churn across multiple dm-targets.
> >
> > 2. Remove export in bio_map_kern()
> > 3. Change copy_offload sysfs to accept 0 or else
> > 4. Rename copy support flag to QUEUE_FLAG_SIMPLE_COPY
> > 5. Rename payload entries, add source bdev field to be used
> > while
> > partition remapping, remove copy_size
> > 6. Change the blkdev_issue_copy() interface to accept
> > destination and
> > source values in sector rather in bytes
> > 7. Add payload to bio using bio_map_kern() for copy_offload case
> > 8. Add check to return error if one of the source range length
> > is 0
> > 9. Add BLKDEV_COPY_NOEMULATION flag to allow user to not try
> > copy
> > emulation incase of copy offload is not supported. Caller can
> > his use
> > his existing copying logic to complete the io.
> > 10. Bug fix copy checks and reduce size of rcu_lock()
> >
> > Planned for next:
> > - adding blktests
> > - handling larger (than device limits) copy
> > - decide on ioctl interface (man-page etc.)
> >
> > Changes from v3
> >
> > 1. gfp_flag fixes.
> > 2. Export bio_map_kern() and use it to allocate and add pages to
> > bio.
> > 3. Move copy offload, reading to buf, writing from buf to
> > separate functions.
> > 4. Send read bio of copy offload by chaining them and submit
> > asynchronously.
> > 5. Add gendisk->part0 and part->bd_start_sect changes to
> > blk_check_copy().
> > 6. Move single source range limit check to blk_check_copy()
> > 7. Rename __blkdev_issue_copy() to blkdev_issue_copy and remove
> > old helper.
> > 8. Change blkdev_issue_copy() interface generic to accepts
> > destination bdev
> >       to support XCOPY as well.
> > 9. Add invalidate_kernel_vmap_range() after reading data for
> > vmalloc'ed memory.
> > 10. Fix buf allocoation logic to allocate buffer for the total
> > size of copy.
> > 11. Reword patch commit description.
> >
> > Changes from v2
> >
> > 1. Add emulation support for devices not supporting copy.
> > 2. Add *copy_offload* sysfs entry to enable and disable
> > copy_offload
> >       in devices supporting simple copy.
> > 3. Remove simple copy support for stacked devices.
> >
> > Changes from v1:
> >
> > 1. Fix memory leak in __blkdev_issue_copy
> > 2. Unmark blk_check_copy inline
> > 3. Fix line break in blk_check_copy_eod
> > 4. Remove p checks and made code more readable
> > 5. Don't use bio_set_op_attrs and remove op and set
> >    bi_opf directly
> > 6. Use struct_size to calculate total_size
> > 7. Fix partition remap of copy destination
> > 8. Remove mcl,mssrl,msrc from nvme_ns
> > 9. Initialize copy queue limits to 0 in nvme_config_copy
> > 10. Remove return in QUEUE_FLAG_COPY check
> > 11. Remove unused OCFS
> >
> > SelvaKumar S (4):
> >   block: make bio_map_kern() non static
> >   block: add simple copy support
> >   nvme: add simple copy support
> >   dm kcopyd: add simple copy offload support
> >
> >  block/blk-core.c          | 102 +++++++++++++++--
> >  block/blk-lib.c           | 223
> >  ++++++++++++++++++++++++++++++++++++++
> >  block/blk-map.c           |   2 +-
> >  block/blk-merge.c         |   2 +
> >  block/blk-settings.c      |  10 ++
> >  block/blk-sysfs.c         |  47 ++++++++
> >  block/blk-zoned.c         |   1 +
> >  block/bounce.c            |   1 +
> >  block/ioctl.c             |  33 ++++++
> >  drivers/md/dm-kcopyd.c    |  49 ++++++++-
> >  drivers/nvme/host/core.c  |  87 +++++++++++++++
> >  include/linux/bio.h       |   1 +
> >  include/linux/blk_types.h |  14 +++
> >  include/linux/blkdev.h    |  17 +++
> >  include/linux/nvme.h      |  43 +++++++-
> >  include/uapi/linux/fs.h   |  13 +++
> >  16 files changed, 627 insertions(+), 18 deletions(-)
>

_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [RFC PATCH v5 0/4] add simple copy support
  2021-02-21 23:52   ` Dave Chinner
@ 2021-02-23  9:14     ` Selva Jove
  0 siblings, 0 replies; 29+ messages in thread
From: Selva Jove @ 2021-02-23  9:14 UTC (permalink / raw)
  To: Dave Chinner
  Cc: axboe, Damien Le Moal, kch, SelvaKumar S, sagi, snitzer,
	linux-kernel, linux-nvme, nj.shetty, linux-block, linux-fsdevel,
	dm-devel, joshi.k, javier.gonz, Keith Busch, joshiiitr, hch

Dave,

copy_file_range() is work under progress.  FALLOC_FL_UNSHARE of fallocate()
use case sounds interesting. I will try to address both of them in the
next series.

Adding SCSI_XCOPY() support is not in the scope of this patchset. However
blkdev_issue_copy() interface is made generic so that it is possible to extend
to cross device XCOPY in future.


Thanks,
Selva

_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [RFC PATCH v5 2/4] block: add simple copy support
  2021-02-20  4:59       ` Damien Le Moal
@ 2021-04-07 11:32         ` Selva Jove
  2021-04-12  0:24           ` Damien Le Moal
  0 siblings, 1 reply; 29+ messages in thread
From: Selva Jove @ 2021-04-07 11:32 UTC (permalink / raw)
  To: Damien Le Moal
  Cc: SelvaKumar S, linux-nvme, kbusch, axboe, hch, sagi, linux-block,
	linux-kernel, dm-devel, snitzer, joshiiitr, nj.shetty, joshi.k,
	javier.gonz, kch, linux-fsdevel

Initially I started moving the dm-kcopyd interface to the block layer
as a generic interface.
Once I dig deeper in dm-kcopyd code, I figured that dm-kcopyd is
tightly coupled with dm_io()

To move dm-kcopyd to block layer, it would also require dm_io code to
be moved to block layer.
It would cause havoc in dm layer, as it is the backbone of the
dm-layer and needs complete
rewriting of dm-layer. Do you see any other way of doing this without
having to move dm_io code
or to have redundant code ?


On Sat, Feb 20, 2021 at 10:29 AM Damien Le Moal <Damien.LeMoal@wdc.com> wrote:
>
> On 2021/02/20 11:01, SelvaKumar S wrote:
> > Add new BLKCOPY ioctl that offloads copying of one or more sources
> > ranges to a destination in the device. Accepts a 'copy_range' structure
> > that contains destination (in sectors), no of sources and pointer to the
> > array of source ranges. Each source range is represented by 'range_entry'
> > that contains start and length of source ranges (in sectors).
> >
> > Introduce REQ_OP_COPY, a no-merge copy offload operation. Create
> > bio with control information as payload and submit to the device.
> > REQ_OP_COPY(19) is a write op and takes zone_write_lock when submitted
> > to zoned device.
> >
> > If the device doesn't support copy or copy offload is disabled, then
> > copy operation is emulated by default. However, the copy-emulation is an
> > opt-in feature. Caller can choose not to use the copy-emulation by
> > specifying a flag 'BLKDEV_COPY_NOEMULATION'.
> >
> > Copy-emulation is implemented by allocating memory of total copy size.
> > The source ranges are read into memory by chaining bio for each source
> > ranges and submitting them async and the last bio waits for completion.
> > After data is read, it is written to the destination.
> >
> > bio_map_kern() is used to allocate bio and add pages of copy buffer to
> > bio. As bio->bi_private and bio->bi_end_io are needed for chaining the
> > bio and gets over-written, invalidate_kernel_vmap_range() for read is
> > called in the caller.
> >
> > Introduce queue limits for simple copy and other helper functions.
> > Add device limits as sysfs entries.
> >       - copy_offload
> >       - max_copy_sectors
> >       - max_copy_ranges_sectors
> >       - max_copy_nr_ranges
> >
> > copy_offload(= 0) is disabled by default. This needs to be enabled if
> > copy-offload needs to be used.
> > max_copy_sectors = 0 indicates the device doesn't support native copy.
> >
> > Native copy offload is not supported for stacked devices and is done via
> > copy emulation.
> >
> > Signed-off-by: SelvaKumar S <selvakuma.s1@samsung.com>
> > Signed-off-by: Kanchan Joshi <joshi.k@samsung.com>
> > Signed-off-by: Nitesh Shetty <nj.shetty@samsung.com>
> > Signed-off-by: Javier González <javier.gonz@samsung.com>
> > Signed-off-by: Chaitanya Kulkarni <kch@kernel.org>
> > ---
> >  block/blk-core.c          | 102 ++++++++++++++++--
> >  block/blk-lib.c           | 222 ++++++++++++++++++++++++++++++++++++++
> >  block/blk-merge.c         |   2 +
> >  block/blk-settings.c      |  10 ++
> >  block/blk-sysfs.c         |  47 ++++++++
> >  block/blk-zoned.c         |   1 +
> >  block/bounce.c            |   1 +
> >  block/ioctl.c             |  33 ++++++
> >  include/linux/bio.h       |   1 +
> >  include/linux/blk_types.h |  14 +++
> >  include/linux/blkdev.h    |  15 +++
> >  include/uapi/linux/fs.h   |  13 +++
> >  12 files changed, 453 insertions(+), 8 deletions(-)
> >
> > diff --git a/block/blk-core.c b/block/blk-core.c
> > index 7663a9b94b80..23e646e5ae43 100644
> > --- a/block/blk-core.c
> > +++ b/block/blk-core.c
> > @@ -720,6 +720,17 @@ static noinline int should_fail_bio(struct bio *bio)
> >  }
> >  ALLOW_ERROR_INJECTION(should_fail_bio, ERRNO);
> >
> > +static inline int bio_check_copy_eod(struct bio *bio, sector_t start,
> > +             sector_t nr_sectors, sector_t max_sect)
> > +{
> > +     if (nr_sectors && max_sect &&
> > +         (nr_sectors > max_sect || start > max_sect - nr_sectors)) {
> > +             handle_bad_sector(bio, max_sect);
> > +             return -EIO;
> > +     }
> > +     return 0;
> > +}
> > +
> >  /*
> >   * Check whether this bio extends beyond the end of the device or partition.
> >   * This may well happen - the kernel calls bread() without checking the size of
> > @@ -738,6 +749,75 @@ static inline int bio_check_eod(struct bio *bio, sector_t maxsector)
> >       return 0;
> >  }
> >
> > +/*
> > + * Check for copy limits and remap source ranges if needed.
> > + */
> > +static int blk_check_copy(struct bio *bio)
> > +{
> > +     struct blk_copy_payload *payload = bio_data(bio);
> > +     struct request_queue *q = bio->bi_disk->queue;
> > +     sector_t max_sect, start_sect, copy_size = 0;
> > +     sector_t src_max_sect, src_start_sect;
> > +     struct block_device *bd_part;
> > +     int i, ret = -EIO;
> > +
> > +     rcu_read_lock();
> > +
> > +     bd_part = __disk_get_part(bio->bi_disk, bio->bi_partno);
> > +     if (unlikely(!bd_part)) {
> > +             rcu_read_unlock();
> > +             goto out;
> > +     }
> > +
> > +     max_sect =  bdev_nr_sectors(bd_part);
> > +     start_sect = bd_part->bd_start_sect;
> > +
> > +     src_max_sect = bdev_nr_sectors(payload->src_bdev);
> > +     src_start_sect = payload->src_bdev->bd_start_sect;
> > +
> > +     if (unlikely(should_fail_request(bd_part, bio->bi_iter.bi_size)))
> > +             goto out;
> > +
> > +     if (unlikely(bio_check_ro(bio, bd_part)))
> > +             goto out;
>
> There is no rcu_unlock() in that out label. Did you test ?
>
> > +
> > +     rcu_read_unlock();
> > +
> > +     /* cannot handle copy crossing nr_ranges limit */
> > +     if (payload->copy_nr_ranges > q->limits.max_copy_nr_ranges)
> > +             goto out;
> > +
> > +     for (i = 0; i < payload->copy_nr_ranges; i++) {
> > +             ret = bio_check_copy_eod(bio, payload->range[i].src,
> > +                             payload->range[i].len, src_max_sect);
> > +             if (unlikely(ret))
> > +                     goto out;
> > +
> > +             /* single source range length limit */
> > +             if (payload->range[i].len > q->limits.max_copy_range_sectors)
> > +                     goto out;
>
> ret is not set. You will return success with this.
>
> > +
> > +             payload->range[i].src += src_start_sect;
> > +             copy_size += payload->range[i].len;
> > +     }
> > +
> > +     /* check if copy length crosses eod */
> > +     ret = bio_check_copy_eod(bio, bio->bi_iter.bi_sector,
> > +                             copy_size, max_sect);
> > +     if (unlikely(ret))
> > +             goto out;
> > +
> > +     /* cannot handle copy more than copy limits */
> > +     if (copy_size > q->limits.max_copy_sectors)
> > +             goto out;
>
> Again ret is not set... No error return ?
>
> > +
> > +     bio->bi_iter.bi_sector += start_sect;
> > +     bio->bi_partno = 0;
> > +     ret = 0;
> > +out:
> > +     return ret;
> > +}
> > +
> >  /*
> >   * Remap block n of partition p to block n+start(p) of the disk.
> >   */
> > @@ -827,14 +907,16 @@ static noinline_for_stack bool submit_bio_checks(struct bio *bio)
> >       if (should_fail_bio(bio))
> >               goto end_io;
> >
> > -     if (bio->bi_partno) {
> > -             if (unlikely(blk_partition_remap(bio)))
> > -                     goto end_io;
> > -     } else {
> > -             if (unlikely(bio_check_ro(bio, bio->bi_disk->part0)))
> > -                     goto end_io;
> > -             if (unlikely(bio_check_eod(bio, get_capacity(bio->bi_disk))))
> > -                     goto end_io;
> > +     if (likely(!op_is_copy(bio->bi_opf))) {
> > +             if (bio->bi_partno) {
> > +                     if (unlikely(blk_partition_remap(bio)))
> > +                             goto end_io;
> > +             } else {
> > +                     if (unlikely(bio_check_ro(bio, bio->bi_disk->part0)))
> > +                             goto end_io;
> > +                     if (unlikely(bio_check_eod(bio, get_capacity(bio->bi_disk))))
> > +                             goto end_io;
> > +             }
> >       }
> >
> >       /*
> > @@ -858,6 +940,10 @@ static noinline_for_stack bool submit_bio_checks(struct bio *bio)
> >               if (!blk_queue_discard(q))
> >                       goto not_supported;
> >               break;
> > +     case REQ_OP_COPY:
> > +             if (unlikely(blk_check_copy(bio)))
> > +                     goto end_io;
> > +             break;
> >       case REQ_OP_SECURE_ERASE:
> >               if (!blk_queue_secure_erase(q))
> >                       goto not_supported;
> > diff --git a/block/blk-lib.c b/block/blk-lib.c
> > index 752f9c722062..97ba58d8d9a1 100644
> > --- a/block/blk-lib.c
> > +++ b/block/blk-lib.c
> > @@ -150,6 +150,228 @@ int blkdev_issue_discard(struct block_device *bdev, sector_t sector,
> >  }
> >  EXPORT_SYMBOL(blkdev_issue_discard);
> >
> > +int blk_copy_offload(struct block_device *dest_bdev, struct blk_copy_payload *payload,
> > +             sector_t dest, gfp_t gfp_mask)
>
> Simple copy is only over the same device, right ? So the name "dest_bdev" is a
> little strange.
>
> > +{
> > +     struct request_queue *q = bdev_get_queue(dest_bdev);
> > +     struct bio *bio;
> > +     int ret, payload_size;
> > +
> > +     payload_size = struct_size(payload, range, payload->copy_nr_ranges);
> > +     bio = bio_map_kern(q, payload, payload_size, gfp_mask);
> > +     if (IS_ERR(bio)) {
> > +             ret = PTR_ERR(bio);
> > +             goto err;
>
> This will do a bio_put() on a non existent bio...
>
> > +     }
> > +
> > +     bio->bi_iter.bi_sector = dest;
> > +     bio->bi_opf = REQ_OP_COPY | REQ_NOMERGE;
> > +     bio_set_dev(bio, dest_bdev);
> > +
> > +     ret = submit_bio_wait(bio);
> > +err:
> > +     bio_put(bio);
> > +     return ret;
> > +}
> > +
> > +int blk_read_to_buf(struct block_device *src_bdev, struct blk_copy_payload *payload,
> > +             gfp_t gfp_mask, sector_t copy_size, void **buf_p)
> > +{
> > +     struct request_queue *q = bdev_get_queue(src_bdev);
> > +     struct bio *bio, *parent = NULL;
> > +     void *buf = NULL;
> > +     int copy_len = copy_size << SECTOR_SHIFT;
> > +     int i, nr_srcs, ret, cur_size, t_len = 0;
> > +     bool is_vmalloc;
> > +
> > +     nr_srcs = payload->copy_nr_ranges;
> > +
> > +     buf = kvmalloc(copy_len, gfp_mask);
> > +     if (!buf)
> > +             return -ENOMEM;
> > +     is_vmalloc = is_vmalloc_addr(buf);
> > +
> > +     for (i = 0; i < nr_srcs; i++) {
> > +             cur_size = payload->range[i].len << SECTOR_SHIFT;
> > +
> > +             bio = bio_map_kern(q, buf + t_len, cur_size, gfp_mask);
> > +             if (IS_ERR(bio)) {
> > +                     ret = PTR_ERR(bio);
> > +                     goto out;
> > +             }
> > +
> > +             bio->bi_iter.bi_sector = payload->range[i].src;
> > +             bio->bi_opf = REQ_OP_READ;
> > +             bio_set_dev(bio, src_bdev);
> > +             bio->bi_end_io = NULL;
> > +             bio->bi_private = NULL;
> > +
> > +             if (parent) {
> > +                     bio_chain(parent, bio);
> > +                     submit_bio(parent);
> > +             }
> > +
> > +             parent = bio;
> > +             t_len += cur_size;
> > +     }
> > +
> > +     ret = submit_bio_wait(bio);
> > +     bio_put(bio);
> > +     if (is_vmalloc)
> > +             invalidate_kernel_vmap_range(buf, copy_len);
>
> But blk_write_from_buf() will use the buffer right after this.. Is this really OK ?
>

As we are over-writing bio->private adn bi->bi_endio during
submit_bio_wait(), the original
bio_map_kern_endio() can't be used to invalidate_kernel_vmap_range().
So we are doing it
here explicitly. invalidate_kernel_vmap_range() is only necessary for
data reads and the buf
is safe to be used in blk_write_from_buf().

>
> > +     if (ret)
> > +             goto out;
> > +
> > +     *buf_p = buf;
> > +     return 0;
> > +out:
> > +     kvfree(buf);
> > +     return ret;
> > +}
> > +
> > +int blk_write_from_buf(struct block_device *dest_bdev, void *buf, sector_t dest,
> > +             sector_t copy_size, gfp_t gfp_mask)
> > +{
> > +     struct request_queue *q = bdev_get_queue(dest_bdev);
> > +     struct bio *bio;
> > +     int ret, copy_len = copy_size << SECTOR_SHIFT;
> > +
> > +     bio = bio_map_kern(q, buf, copy_len, gfp_mask);
> > +     if (IS_ERR(bio)) {
> > +             ret = PTR_ERR(bio);
> > +             goto out;
> > +     }
> > +     bio_set_dev(bio, dest_bdev);
> > +     bio->bi_opf = REQ_OP_WRITE;
> > +     bio->bi_iter.bi_sector = dest;
> > +
> > +     bio->bi_end_io = NULL;
> > +     ret = submit_bio_wait(bio);
> > +     bio_put(bio);
> > +out:
> > +     return ret;
> > +}
> > +
> > +int blk_prepare_payload(struct block_device *src_bdev, int nr_srcs, struct range_entry *rlist,
> > +             gfp_t gfp_mask, struct blk_copy_payload **payload_p, sector_t *copy_size)
> > +{
> > +
> > +     struct request_queue *q = bdev_get_queue(src_bdev);
> > +     struct blk_copy_payload *payload;
> > +     sector_t bs_mask, total_len = 0;
> > +     int i, ret, payload_size;
> > +
> > +     if (!q)
> > +             return -ENXIO;
> > +
> > +     if (!nr_srcs)
> > +             return -EINVAL;
> > +
> > +     if (bdev_read_only(src_bdev))
> > +             return -EPERM;
> > +
> > +     bs_mask = (bdev_logical_block_size(src_bdev) >> 9) - 1;
> > +
> > +     payload_size = struct_size(payload, range, nr_srcs);
> > +     payload = kmalloc(payload_size, gfp_mask);
> > +     if (!payload)
> > +             return -ENOMEM;
> > +
> > +     for (i = 0; i < nr_srcs; i++) {
> > +             if (rlist[i].src & bs_mask || rlist[i].len & bs_mask) {
> > +                     ret = -EINVAL;
> > +                     goto err;
> > +             }
> > +
> > +             payload->range[i].src = rlist[i].src;
> > +             payload->range[i].len = rlist[i].len;
> > +
> > +             total_len += rlist[i].len;
> > +     }
> > +
> > +     payload->copy_nr_ranges = i;
> > +     payload->src_bdev = src_bdev;
> > +     *copy_size = total_len;
> > +
> > +     *payload_p = payload;
> > +     return 0;
> > +err:
> > +     kfree(payload);
> > +     return ret;
> > +}
> > +
> > +int blk_copy_emulate(struct block_device *src_bdev, struct blk_copy_payload *payload,
> > +                     struct block_device *dest_bdev, sector_t dest,
> > +                     sector_t copy_size, gfp_t gfp_mask)
> > +{
> > +     void *buf = NULL;
> > +     int ret;
> > +
> > +     ret = blk_read_to_buf(src_bdev, payload, gfp_mask, copy_size, &buf);
> > +     if (ret)
> > +             goto out;
> > +
> > +     ret = blk_write_from_buf(dest_bdev, buf, dest, copy_size, gfp_mask);
> > +     if (buf)
> > +             kvfree(buf);
> > +out:
> > +     return ret;
> > +}
>
> I already commented that this should better use the dm-kcopyd design, which
> would be far more efficient than this. This will be slow...
>
> Your function blkdev_issue_copy() below should deal only with issuing simple
> copy (amd later scsi xcopy) for devices that support it. Bring the dm-kcopyd
> interface in the block layer as a generic interface for hadling emulation.
> Otherwise you are repeating what dm does, but not as efficiently.
>
> > +
> > +/**
> > + * blkdev_issue_copy - queue a copy
> > + * @src_bdev:        source block device
> > + * @nr_srcs: number of source ranges to copy
> > + * @rlist:   array of source ranges in sector
> > + * @dest_bdev:       destination block device
> > + * @dest:    destination in sector
> > + * @gfp_mask:   memory allocation flags (for bio_alloc)
> > + * @flags:   BLKDEV_COPY_* flags to control behaviour
> > + *
> > + * Description:
> > + *   Copy array of source ranges from source block device to
> > + *   destination block devcie. All source must belong to same bdev and
> > + *   length of a source range cannot be zero.
> > + */
> > +
> > +int blkdev_issue_copy(struct block_device *src_bdev, int nr_srcs,
> > +             struct range_entry *src_rlist, struct block_device *dest_bdev,
> > +             sector_t dest, gfp_t gfp_mask, int flags)
> > +{
> > +     struct request_queue *q = bdev_get_queue(src_bdev);
> > +     struct request_queue *dest_q = bdev_get_queue(dest_bdev);
> > +     struct blk_copy_payload *payload;
> > +     sector_t bs_mask, copy_size;
> > +     int ret;
> > +
> > +     ret = blk_prepare_payload(src_bdev, nr_srcs, src_rlist, gfp_mask,
> > +                     &payload, &copy_size);
> > +     if (ret)
> > +             return ret;
> > +
> > +     bs_mask = (bdev_logical_block_size(dest_bdev) >> 9) - 1;
> > +     if (dest & bs_mask) {
> > +             return -EINVAL;
> > +             goto out;
> > +     }
> > +
> > +     if (q == dest_q && q->limits.copy_offload) {
> > +             ret = blk_copy_offload(src_bdev, payload, dest, gfp_mask);
> > +             if (ret)
> > +                     goto out;
> > +     } else if (flags & BLKDEV_COPY_NOEMULATION) {
>
> Why ? whoever calls blkdev_issue_copy() wants a copy to be done. Why would that
> user say "Fail on me if the device does not support copy" ??? This is a weird
> interface in my opinion.
>

BLKDEV_COPY_NOEMULATION flag was introduced to allow blkdev_issue_copy() callers
to use their native copying method instead of the emulated copy that I
added. This way we
ensure that dm uses the hw-assisted copy and if that is not present,
it falls back to existing
copy method.

The other users who don't have their native emulation can use this
emulated-copy implementation.

>
> > +             ret = -EIO;
> > +             goto out;
> > +     } else
>
> Missing braces. By you do not need all these else after the gotos anyway.
>
> > +             ret = blk_copy_emulate(src_bdev, payload, dest_bdev, dest,
> > +                             copy_size, gfp_mask);
> > +
> > +out:
> > +     kvfree(payload);
> > +     return ret;
> > +}
> > +EXPORT_SYMBOL(blkdev_issue_copy);
> > +
> >  /**
> >   * __blkdev_issue_write_same - generate number of bios with same page
> >   * @bdev:    target blockdev
> > diff --git a/block/blk-merge.c b/block/blk-merge.c
> > index 808768f6b174..4e04f24e13c1 100644
> > --- a/block/blk-merge.c
> > +++ b/block/blk-merge.c
> > @@ -309,6 +309,8 @@ void __blk_queue_split(struct bio **bio, unsigned int *nr_segs)
> >       struct bio *split = NULL;
> >
> >       switch (bio_op(*bio)) {
> > +     case REQ_OP_COPY:
> > +                     break;
>
> Why would this even be called ? Copy BIOs cannot be split, right ?
>
> >       case REQ_OP_DISCARD:
> >       case REQ_OP_SECURE_ERASE:
> >               split = blk_bio_discard_split(q, *bio, &q->bio_split, nr_segs);
> > diff --git a/block/blk-settings.c b/block/blk-settings.c
> > index 43990b1d148b..93c15ba45a69 100644
> > --- a/block/blk-settings.c
> > +++ b/block/blk-settings.c
> > @@ -60,6 +60,10 @@ void blk_set_default_limits(struct queue_limits *lim)
> >       lim->io_opt = 0;
> >       lim->misaligned = 0;
> >       lim->zoned = BLK_ZONED_NONE;
> > +     lim->copy_offload = 0;
> > +     lim->max_copy_sectors = 0;
> > +     lim->max_copy_nr_ranges = 0;
> > +     lim->max_copy_range_sectors = 0;
> >  }
> >  EXPORT_SYMBOL(blk_set_default_limits);
> >
> > @@ -565,6 +569,12 @@ int blk_stack_limits(struct queue_limits *t, struct queue_limits *b,
> >       if (b->chunk_sectors)
> >               t->chunk_sectors = gcd(t->chunk_sectors, b->chunk_sectors);
> >
> > +     /* simple copy not supported in stacked devices */
> > +     t->copy_offload = 0;
> > +     t->max_copy_sectors = 0;
> > +     t->max_copy_range_sectors = 0;
> > +     t->max_copy_nr_ranges = 0;
>
> You do not need this. Limits not explicitely initialized are 0 already.
> But I do not see why you can't support copy on stacked devices. That should be
> feasible taking the min() for each of the above limit.
>

Disabling stacked device support was feedback from v2.

https://patchwork.kernel.org/project/linux-block/patch/20201204094659.12732-2-selvakuma.s1@samsung.com/

>
> > +
> >       /* Physical block size a multiple of the logical block size? */
> >       if (t->physical_block_size & (t->logical_block_size - 1)) {
> >               t->physical_block_size = t->logical_block_size;
> > diff --git a/block/blk-sysfs.c b/block/blk-sysfs.c
> > index b513f1683af0..625a72541263 100644
> > --- a/block/blk-sysfs.c
> > +++ b/block/blk-sysfs.c
> > @@ -166,6 +166,44 @@ static ssize_t queue_discard_granularity_show(struct request_queue *q, char *pag
> >       return queue_var_show(q->limits.discard_granularity, page);
> >  }
> >
> > +static ssize_t queue_copy_offload_show(struct request_queue *q, char *page)
> > +{
> > +     return queue_var_show(q->limits.copy_offload, page);
> > +}
> > +
> > +static ssize_t queue_copy_offload_store(struct request_queue *q,
> > +                                    const char *page, size_t count)
> > +{
> > +     unsigned long copy_offload;
> > +     ssize_t ret = queue_var_store(&copy_offload, page, count);
> > +
> > +     if (ret < 0)
> > +             return ret;
> > +
> > +     if (copy_offload && q->limits.max_copy_sectors == 0)
> > +             return -EINVAL;
> > +
> > +     q->limits.copy_offload = copy_offload;
> > +     return ret;
> > +}
>
> This is weird. If you want to allow a user to disable copy offload, then use
> max_copy_sectors. This one should be read-only and only indicate if the device
> supports it or not. I also would actually change this one into
> max_copy_hw_sectors, immutable, indicating the max copy sectors that the device
> supports, and 0 for no support. That would allow an easy implementation of
> max_copy_sectors being red/write for controlling enable/disable.
>
> > +
> > +static ssize_t queue_max_copy_sectors_show(struct request_queue *q, char *page)
> > +{
> > +     return queue_var_show(q->limits.max_copy_sectors, page);
> > +}
> > +
> > +static ssize_t queue_max_copy_range_sectors_show(struct request_queue *q,
> > +             char *page)
> > +{
> > +     return queue_var_show(q->limits.max_copy_range_sectors, page);
> > +}
> > +
> > +static ssize_t queue_max_copy_nr_ranges_show(struct request_queue *q,
> > +             char *page)
> > +{
> > +     return queue_var_show(q->limits.max_copy_nr_ranges, page);
> > +}
> > +
> >  static ssize_t queue_discard_max_hw_show(struct request_queue *q, char *page)
> >  {
> >
> > @@ -591,6 +629,11 @@ QUEUE_RO_ENTRY(queue_nr_zones, "nr_zones");
> >  QUEUE_RO_ENTRY(queue_max_open_zones, "max_open_zones");
> >  QUEUE_RO_ENTRY(queue_max_active_zones, "max_active_zones");
> >
> > +QUEUE_RW_ENTRY(queue_copy_offload, "copy_offload");
> > +QUEUE_RO_ENTRY(queue_max_copy_sectors, "max_copy_sectors");
> > +QUEUE_RO_ENTRY(queue_max_copy_range_sectors, "max_copy_range_sectors");
> > +QUEUE_RO_ENTRY(queue_max_copy_nr_ranges, "max_copy_nr_ranges");
> > +
> >  QUEUE_RW_ENTRY(queue_nomerges, "nomerges");
> >  QUEUE_RW_ENTRY(queue_rq_affinity, "rq_affinity");
> >  QUEUE_RW_ENTRY(queue_poll, "io_poll");
> > @@ -636,6 +679,10 @@ static struct attribute *queue_attrs[] = {
> >       &queue_discard_max_entry.attr,
> >       &queue_discard_max_hw_entry.attr,
> >       &queue_discard_zeroes_data_entry.attr,
> > +     &queue_copy_offload_entry.attr,
> > +     &queue_max_copy_sectors_entry.attr,
> > +     &queue_max_copy_range_sectors_entry.attr,
> > +     &queue_max_copy_nr_ranges_entry.attr,
> >       &queue_write_same_max_entry.attr,
> >       &queue_write_zeroes_max_entry.attr,
> >       &queue_zone_append_max_entry.attr,
> > diff --git a/block/blk-zoned.c b/block/blk-zoned.c
> > index 7a68b6e4300c..02069178d51e 100644
> > --- a/block/blk-zoned.c
> > +++ b/block/blk-zoned.c
> > @@ -75,6 +75,7 @@ bool blk_req_needs_zone_write_lock(struct request *rq)
> >       case REQ_OP_WRITE_ZEROES:
> >       case REQ_OP_WRITE_SAME:
> >       case REQ_OP_WRITE:
> > +     case REQ_OP_COPY:
> >               return blk_rq_zone_is_seq(rq);
> >       default:
> >               return false;
> > diff --git a/block/bounce.c b/block/bounce.c
> > index d3f51acd6e3b..5e052afe8691 100644
> > --- a/block/bounce.c
> > +++ b/block/bounce.c
> > @@ -254,6 +254,7 @@ static struct bio *bounce_clone_bio(struct bio *bio_src, gfp_t gfp_mask,
> >       bio->bi_iter.bi_size    = bio_src->bi_iter.bi_size;
> >
> >       switch (bio_op(bio)) {
> > +     case REQ_OP_COPY:
> >       case REQ_OP_DISCARD:
> >       case REQ_OP_SECURE_ERASE:
> >       case REQ_OP_WRITE_ZEROES:
> > diff --git a/block/ioctl.c b/block/ioctl.c
> > index d61d652078f4..0e52181657a4 100644
> > --- a/block/ioctl.c
> > +++ b/block/ioctl.c
> > @@ -133,6 +133,37 @@ static int blk_ioctl_discard(struct block_device *bdev, fmode_t mode,
> >                                   GFP_KERNEL, flags);
> >  }
> >
> > +static int blk_ioctl_copy(struct block_device *bdev, fmode_t mode,
> > +             unsigned long arg, unsigned long flags)
> > +{
> > +     struct copy_range crange;
> > +     struct range_entry *rlist;
> > +     int ret;
> > +
> > +     if (!(mode & FMODE_WRITE))
> > +             return -EBADF;
> > +
> > +     if (copy_from_user(&crange, (void __user *)arg, sizeof(crange)))
> > +             return -EFAULT;
> > +
> > +     rlist = kmalloc_array(crange.nr_range, sizeof(*rlist),
> > +                     GFP_KERNEL);
> > +     if (!rlist)
> > +             return -ENOMEM;
> > +
> > +     if (copy_from_user(rlist, (void __user *)crange.range_list,
> > +                             sizeof(*rlist) * crange.nr_range)) {
> > +             ret = -EFAULT;
> > +             goto out;
> > +     }
> > +
> > +     ret = blkdev_issue_copy(bdev, crange.nr_range, rlist, bdev, crange.dest,
> > +                     GFP_KERNEL, flags);
> > +out:
> > +     kfree(rlist);
> > +     return ret;
> > +}
> > +
> >  static int blk_ioctl_zeroout(struct block_device *bdev, fmode_t mode,
> >               unsigned long arg)
> >  {
> > @@ -458,6 +489,8 @@ static int blkdev_common_ioctl(struct block_device *bdev, fmode_t mode,
> >       case BLKSECDISCARD:
> >               return blk_ioctl_discard(bdev, mode, arg,
> >                               BLKDEV_DISCARD_SECURE);
> > +     case BLKCOPY:
> > +             return blk_ioctl_copy(bdev, mode, arg, 0);
> >       case BLKZEROOUT:
> >               return blk_ioctl_zeroout(bdev, mode, arg);
> >       case BLKREPORTZONE:
> > diff --git a/include/linux/bio.h b/include/linux/bio.h
> > index 1edda614f7ce..164313bdfb35 100644
> > --- a/include/linux/bio.h
> > +++ b/include/linux/bio.h
> > @@ -71,6 +71,7 @@ static inline bool bio_has_data(struct bio *bio)
> >  static inline bool bio_no_advance_iter(const struct bio *bio)
> >  {
> >       return bio_op(bio) == REQ_OP_DISCARD ||
> > +            bio_op(bio) == REQ_OP_COPY ||
> >              bio_op(bio) == REQ_OP_SECURE_ERASE ||
> >              bio_op(bio) == REQ_OP_WRITE_SAME ||
> >              bio_op(bio) == REQ_OP_WRITE_ZEROES;
> > diff --git a/include/linux/blk_types.h b/include/linux/blk_types.h
> > index 866f74261b3b..5a35c02ac0a8 100644
> > --- a/include/linux/blk_types.h
> > +++ b/include/linux/blk_types.h
> > @@ -380,6 +380,8 @@ enum req_opf {
> >       REQ_OP_ZONE_RESET       = 15,
> >       /* reset all the zone present on the device */
> >       REQ_OP_ZONE_RESET_ALL   = 17,
> > +     /* copy ranges within device */
> > +     REQ_OP_COPY             = 19,
> >
> >       /* SCSI passthrough using struct scsi_request */
> >       REQ_OP_SCSI_IN          = 32,
> > @@ -506,6 +508,11 @@ static inline bool op_is_discard(unsigned int op)
> >       return (op & REQ_OP_MASK) == REQ_OP_DISCARD;
> >  }
> >
> > +static inline bool op_is_copy(unsigned int op)
> > +{
> > +     return (op & REQ_OP_MASK) == REQ_OP_COPY;
> > +}
> > +
> >  /*
> >   * Check if a bio or request operation is a zone management operation, with
> >   * the exception of REQ_OP_ZONE_RESET_ALL which is treated as a special case
> > @@ -565,4 +572,11 @@ struct blk_rq_stat {
> >       u64 batch;
> >  };
> >
> > +struct blk_copy_payload {
> > +     sector_t        dest;
> > +     int             copy_nr_ranges;
> > +     struct block_device *src_bdev;
> > +     struct  range_entry     range[];
> > +};
> > +
> >  #endif /* __LINUX_BLK_TYPES_H */
> > diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
> > index 699ace6b25ff..2bb4513d4bb8 100644
> > --- a/include/linux/blkdev.h
> > +++ b/include/linux/blkdev.h
> > @@ -337,10 +337,14 @@ struct queue_limits {
> >       unsigned int            max_zone_append_sectors;
> >       unsigned int            discard_granularity;
> >       unsigned int            discard_alignment;
> > +     unsigned int            copy_offload;
> > +     unsigned int            max_copy_sectors;
> >
> >       unsigned short          max_segments;
> >       unsigned short          max_integrity_segments;
> >       unsigned short          max_discard_segments;
> > +     unsigned short          max_copy_range_sectors;
> > +     unsigned short          max_copy_nr_ranges;
> >
> >       unsigned char           misaligned;
> >       unsigned char           discard_misaligned;
> > @@ -621,6 +625,7 @@ struct request_queue {
> >  #define QUEUE_FLAG_RQ_ALLOC_TIME 27  /* record rq->alloc_time_ns */
> >  #define QUEUE_FLAG_HCTX_ACTIVE       28      /* at least one blk-mq hctx is active */
> >  #define QUEUE_FLAG_NOWAIT       29   /* device supports NOWAIT */
> > +#define QUEUE_FLAG_SIMPLE_COPY       30      /* supports simple copy */
> >
> >  #define QUEUE_FLAG_MQ_DEFAULT        ((1 << QUEUE_FLAG_IO_STAT) |            \
> >                                (1 << QUEUE_FLAG_SAME_COMP) |          \
> > @@ -643,6 +648,7 @@ bool blk_queue_flag_test_and_set(unsigned int flag, struct request_queue *q);
> >  #define blk_queue_io_stat(q) test_bit(QUEUE_FLAG_IO_STAT, &(q)->queue_flags)
> >  #define blk_queue_add_random(q)      test_bit(QUEUE_FLAG_ADD_RANDOM, &(q)->queue_flags)
> >  #define blk_queue_discard(q) test_bit(QUEUE_FLAG_DISCARD, &(q)->queue_flags)
> > +#define blk_queue_copy(q)    test_bit(QUEUE_FLAG_SIMPLE_COPY, &(q)->queue_flags)
> >  #define blk_queue_zone_resetall(q)   \
> >       test_bit(QUEUE_FLAG_ZONE_RESETALL, &(q)->queue_flags)
> >  #define blk_queue_secure_erase(q) \
> > @@ -1069,6 +1075,9 @@ static inline unsigned int blk_queue_get_max_sectors(struct request_queue *q,
> >               return min(q->limits.max_discard_sectors,
> >                          UINT_MAX >> SECTOR_SHIFT);
> >
> > +     if (unlikely(op == REQ_OP_COPY))
> > +             return q->limits.max_copy_sectors;
> > +
>
> I would agreee with this if a copy BIO was always a single range, but that is
> not the case. So I am not sure this makes sense at all.
>
> >       if (unlikely(op == REQ_OP_WRITE_SAME))
> >               return q->limits.max_write_same_sectors;
> >
> > @@ -1343,6 +1352,12 @@ extern int __blkdev_issue_discard(struct block_device *bdev, sector_t sector,
> >               sector_t nr_sects, gfp_t gfp_mask, int flags,
> >               struct bio **biop);
> >
> > +#define BLKDEV_COPY_NOEMULATION      (1 << 0)        /* do not emulate if copy offload not supported */
> > +
> > +extern int blkdev_issue_copy(struct block_device *src_bdev, int nr_srcs,
> > +             struct range_entry *src_rlist, struct block_device *dest_bdev,
> > +             sector_t dest, gfp_t gfp_mask, int flags);
>
> No need for extern.
>
> > +
> >  #define BLKDEV_ZERO_NOUNMAP  (1 << 0)  /* do not free blocks */
> >  #define BLKDEV_ZERO_NOFALLBACK       (1 << 1)  /* don't write explicit zeroes */
> >
> > diff --git a/include/uapi/linux/fs.h b/include/uapi/linux/fs.h
> > index f44eb0a04afd..5cadb176317a 100644
> > --- a/include/uapi/linux/fs.h
> > +++ b/include/uapi/linux/fs.h
> > @@ -64,6 +64,18 @@ struct fstrim_range {
> >       __u64 minlen;
> >  };
> >
> > +struct range_entry {
> > +     __u64 src;
> > +     __u64 len;
> > +};
> > +
> > +struct copy_range {
> > +     __u64 dest;
> > +     __u64 nr_range;
> > +     __u64 range_list;
> > +     __u64 rsvd;
> > +};
> > +
> >  /* extent-same (dedupe) ioctls; these MUST match the btrfs ioctl definitions */
> >  #define FILE_DEDUPE_RANGE_SAME               0
> >  #define FILE_DEDUPE_RANGE_DIFFERS    1
> > @@ -184,6 +196,7 @@ struct fsxattr {
> >  #define BLKSECDISCARD _IO(0x12,125)
> >  #define BLKROTATIONAL _IO(0x12,126)
> >  #define BLKZEROOUT _IO(0x12,127)
> > +#define BLKCOPY _IOWR(0x12, 128, struct copy_range)
> >  /*
> >   * A jump here: 130-131 are reserved for zoned block devices
> >   * (see uapi/linux/blkzoned.h)
> >
>
> Please test your code more thoroughly. It is full of problems that you should
> have detected with better testing including RO devices, partitions and error
> path coverage.
>
> --
> Damien Le Moal
> Western Digital Research

_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [RFC PATCH v5 0/4] add simple copy support
  2021-02-19 12:45 ` [RFC PATCH v5 0/4] add simple copy support SelvaKumar S
                     ` (7 preceding siblings ...)
  2021-02-22  6:52   ` Su Yue
@ 2021-04-10  0:21   ` Max Gurtovoy
  2021-04-10  0:29     ` Chaitanya Kulkarni
  8 siblings, 1 reply; 29+ messages in thread
From: Max Gurtovoy @ 2021-04-10  0:21 UTC (permalink / raw)
  To: SelvaKumar S, linux-nvme
  Cc: axboe, damien.lemoal, kch, sagi, snitzer, selvajove,
	linux-kernel, nj.shetty, linux-block, linux-fsdevel, dm-devel,
	joshi.k, javier.gonz, kbusch, joshiiitr, hch


On 2/19/2021 2:45 PM, SelvaKumar S wrote:
> This patchset tries to add support for TP4065a ("Simple Copy Command"),
> v2020.05.04 ("Ratified")
>
> The Specification can be found in following link.
> https://nvmexpress.org/wp-content/uploads/NVM-Express-1.4-Ratified-TPs-1.zip
>
> Simple copy command is a copy offloading operation and is  used to copy
> multiple contiguous ranges (source_ranges) of LBA's to a single destination
> LBA within the device reducing traffic between host and device.
>
> This implementation doesn't add native copy offload support for stacked
> devices rather copy offload is done through emulation. Possible use
> cases are F2FS gc and BTRFS relocation/balance.
>
> *blkdev_issue_copy* takes source bdev, no of sources, array of source
> ranges (in sectors), destination bdev and destination offset(in sectors).
> If both source and destination block devices are same and copy_offload = 1,
> then copy is done through native copy offloading. Copy emulation is used
> in other cases.
>
> As SCSI XCOPY can take two different block devices and no of source range is
> equal to 1, this interface can be extended in future to support SCSI XCOPY.

Any idea why this TP wasn't designed for copy offload between 2 
different namespaces in the same controller ?

And a simple copy will be the case where the src_nsid == dst_nsid ?

Also why there are multiple source ranges and only one dst range ? We 
could add a bit to indicate if this range is src or dst..


>
> For devices supporting native simple copy, attach the control information
> as payload to the bio and submit to the device. For devices without native
> copy support, copy emulation is done by reading each source range into memory
> and writing it to the destination. Caller can choose not to try
> emulation if copy offload is not supported by setting
> BLKDEV_COPY_NOEMULATION flag.
>
> Following limits are added to queue limits and are exposed in sysfs
> to userspace
> 	- *copy_offload* controls copy_offload. set 0 to disable copy
> 		offload, 1 to enable native copy offloading support.
> 	- *max_copy_sectors* limits the sum of all source_range length
> 	- *max_copy_nr_ranges* limits the number of source ranges
> 	- *max_copy_range_sectors* limit the maximum number of sectors
> 		that can constitute a single source range.
>
> 	max_copy_sectors = 0 indicates the device doesn't support copy
> offloading.
>
> 	*copy offload* sysfs entry is configurable and can be used toggle
> between emulation and native support depending upon the usecase.
>
> Changes from v4
>
> 1. Extend dm-kcopyd to leverage copy-offload, while copying within the
> same device. The other approach was to have copy-emulation by moving
> dm-kcopyd to block layer. But it also required moving core dm-io infra,
> causing a massive churn across multiple dm-targets.
>
> 2. Remove export in bio_map_kern()
> 3. Change copy_offload sysfs to accept 0 or else
> 4. Rename copy support flag to QUEUE_FLAG_SIMPLE_COPY
> 5. Rename payload entries, add source bdev field to be used while
> partition remapping, remove copy_size
> 6. Change the blkdev_issue_copy() interface to accept destination and
> source values in sector rather in bytes
> 7. Add payload to bio using bio_map_kern() for copy_offload case
> 8. Add check to return error if one of the source range length is 0
> 9. Add BLKDEV_COPY_NOEMULATION flag to allow user to not try copy
> emulation incase of copy offload is not supported. Caller can his use
> his existing copying logic to complete the io.
> 10. Bug fix copy checks and reduce size of rcu_lock()
>
> Planned for next:
> - adding blktests
> - handling larger (than device limits) copy
> - decide on ioctl interface (man-page etc.)
>
> Changes from v3
>
> 1. gfp_flag fixes.
> 2. Export bio_map_kern() and use it to allocate and add pages to bio.
> 3. Move copy offload, reading to buf, writing from buf to separate functions.
> 4. Send read bio of copy offload by chaining them and submit asynchronously.
> 5. Add gendisk->part0 and part->bd_start_sect changes to blk_check_copy().
> 6. Move single source range limit check to blk_check_copy()
> 7. Rename __blkdev_issue_copy() to blkdev_issue_copy and remove old helper.
> 8. Change blkdev_issue_copy() interface generic to accepts destination bdev
> 	to support XCOPY as well.
> 9. Add invalidate_kernel_vmap_range() after reading data for vmalloc'ed memory.
> 10. Fix buf allocoation logic to allocate buffer for the total size of copy.
> 11. Reword patch commit description.
>
> Changes from v2
>
> 1. Add emulation support for devices not supporting copy.
> 2. Add *copy_offload* sysfs entry to enable and disable copy_offload
> 	in devices supporting simple copy.
> 3. Remove simple copy support for stacked devices.
>
> Changes from v1:
>
> 1. Fix memory leak in __blkdev_issue_copy
> 2. Unmark blk_check_copy inline
> 3. Fix line break in blk_check_copy_eod
> 4. Remove p checks and made code more readable
> 5. Don't use bio_set_op_attrs and remove op and set
>     bi_opf directly
> 6. Use struct_size to calculate total_size
> 7. Fix partition remap of copy destination
> 8. Remove mcl,mssrl,msrc from nvme_ns
> 9. Initialize copy queue limits to 0 in nvme_config_copy
> 10. Remove return in QUEUE_FLAG_COPY check
> 11. Remove unused OCFS
>
> SelvaKumar S (4):
>    block: make bio_map_kern() non static
>    block: add simple copy support
>    nvme: add simple copy support
>    dm kcopyd: add simple copy offload support
>
>   block/blk-core.c          | 102 +++++++++++++++--
>   block/blk-lib.c           | 223 ++++++++++++++++++++++++++++++++++++++
>   block/blk-map.c           |   2 +-
>   block/blk-merge.c         |   2 +
>   block/blk-settings.c      |  10 ++
>   block/blk-sysfs.c         |  47 ++++++++
>   block/blk-zoned.c         |   1 +
>   block/bounce.c            |   1 +
>   block/ioctl.c             |  33 ++++++
>   drivers/md/dm-kcopyd.c    |  49 ++++++++-
>   drivers/nvme/host/core.c  |  87 +++++++++++++++
>   include/linux/bio.h       |   1 +
>   include/linux/blk_types.h |  14 +++
>   include/linux/blkdev.h    |  17 +++
>   include/linux/nvme.h      |  43 +++++++-
>   include/uapi/linux/fs.h   |  13 +++
>   16 files changed, 627 insertions(+), 18 deletions(-)
>

_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [RFC PATCH v5 0/4] add simple copy support
  2021-04-10  0:21   ` Max Gurtovoy
@ 2021-04-10  0:29     ` Chaitanya Kulkarni
  2021-04-10  6:32       ` Javier González
  0 siblings, 1 reply; 29+ messages in thread
From: Chaitanya Kulkarni @ 2021-04-10  0:29 UTC (permalink / raw)
  To: Max Gurtovoy, SelvaKumar S, linux-nvme
  Cc: axboe, Damien Le Moal, kch, sagi, snitzer, selvajove,
	linux-kernel, nj.shetty, linux-block, linux-fsdevel, dm-devel,
	joshi.k, javier.gonz, kbusch, joshiiitr, hch

On 4/9/21 17:22, Max Gurtovoy wrote:
> On 2/19/2021 2:45 PM, SelvaKumar S wrote:
>> This patchset tries to add support for TP4065a ("Simple Copy Command"),
>> v2020.05.04 ("Ratified")
>>
>> The Specification can be found in following link.
>> https://nvmexpress.org/wp-content/uploads/NVM-Express-1.4-Ratified-TPs-1.zip
>>
>> Simple copy command is a copy offloading operation and is  used to copy
>> multiple contiguous ranges (source_ranges) of LBA's to a single destination
>> LBA within the device reducing traffic between host and device.
>>
>> This implementation doesn't add native copy offload support for stacked
>> devices rather copy offload is done through emulation. Possible use
>> cases are F2FS gc and BTRFS relocation/balance.
>>
>> *blkdev_issue_copy* takes source bdev, no of sources, array of source
>> ranges (in sectors), destination bdev and destination offset(in sectors).
>> If both source and destination block devices are same and copy_offload = 1,
>> then copy is done through native copy offloading. Copy emulation is used
>> in other cases.
>>
>> As SCSI XCOPY can take two different block devices and no of source range is
>> equal to 1, this interface can be extended in future to support SCSI XCOPY.
> Any idea why this TP wasn't designed for copy offload between 2 
> different namespaces in the same controller ?

Yes, it was the first attempt so to keep it simple.

Further work is needed to add incremental TP so that we can also do a copy
between the name-spaces of same controller (if we can't already) and to the
namespaces that belongs to the different controller.

> And a simple copy will be the case where the src_nsid == dst_nsid ?
>
> Also why there are multiple source ranges and only one dst range ? We 
> could add a bit to indicate if this range is src or dst..
>
>




_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [RFC PATCH v5 0/4] add simple copy support
  2021-04-10  0:29     ` Chaitanya Kulkarni
@ 2021-04-10  6:32       ` Javier González
  2021-04-11  9:10         ` Max Gurtovoy
  0 siblings, 1 reply; 29+ messages in thread
From: Javier González @ 2021-04-10  6:32 UTC (permalink / raw)
  To: Chaitanya Kulkarni
  Cc: Max Gurtovoy, SelvaKumar S, linux-nvme, axboe, Damien Le Moal,
	kch, sagi, snitzer, selvajove, linux-kernel, nj.shetty,
	linux-block, linux-fsdevel, dm-devel, joshi.k, javier.gonz,
	kbusch, joshiiitr, hch


> On 10 Apr 2021, at 02.30, Chaitanya Kulkarni <Chaitanya.Kulkarni@wdc.com> wrote:
> 
> On 4/9/21 17:22, Max Gurtovoy wrote:
>>> On 2/19/2021 2:45 PM, SelvaKumar S wrote:
>>> This patchset tries to add support for TP4065a ("Simple Copy Command"),
>>> v2020.05.04 ("Ratified")
>>> 
>>> The Specification can be found in following link.
>>> https://nvmexpress.org/wp-content/uploads/NVM-Express-1.4-Ratified-TPs-1.zip
>>> 
>>> Simple copy command is a copy offloading operation and is  used to copy
>>> multiple contiguous ranges (source_ranges) of LBA's to a single destination
>>> LBA within the device reducing traffic between host and device.
>>> 
>>> This implementation doesn't add native copy offload support for stacked
>>> devices rather copy offload is done through emulation. Possible use
>>> cases are F2FS gc and BTRFS relocation/balance.
>>> 
>>> *blkdev_issue_copy* takes source bdev, no of sources, array of source
>>> ranges (in sectors), destination bdev and destination offset(in sectors).
>>> If both source and destination block devices are same and copy_offload = 1,
>>> then copy is done through native copy offloading. Copy emulation is used
>>> in other cases.
>>> 
>>> As SCSI XCOPY can take two different block devices and no of source range is
>>> equal to 1, this interface can be extended in future to support SCSI XCOPY.
>> Any idea why this TP wasn't designed for copy offload between 2 
>> different namespaces in the same controller ?
> 
> Yes, it was the first attempt so to keep it simple.
> 
> Further work is needed to add incremental TP so that we can also do a copy
> between the name-spaces of same controller (if we can't already) and to the
> namespaces that belongs to the different controller.
> 
>> And a simple copy will be the case where the src_nsid == dst_nsid ?
>> 
>> Also why there are multiple source ranges and only one dst range ? We 
>> could add a bit to indicate if this range is src or dst..

One of the target use cases was ZNS in order to avoid fabric transfers during host GC. You can see how this plays well with several zone ranges and a single zone destination. 

If we start getting support in Linux through the different past copy offload efforts, I’m sure we can extend this TP in the future. 

_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [RFC PATCH v5 0/4] add simple copy support
  2021-04-10  6:32       ` Javier González
@ 2021-04-11  9:10         ` Max Gurtovoy
  2021-04-11 19:26           ` Javier González
  0 siblings, 1 reply; 29+ messages in thread
From: Max Gurtovoy @ 2021-04-11  9:10 UTC (permalink / raw)
  To: Javier González, Chaitanya Kulkarni
  Cc: SelvaKumar S, linux-nvme, axboe, Damien Le Moal, kch, sagi,
	snitzer, selvajove, linux-kernel, nj.shetty, linux-block,
	linux-fsdevel, dm-devel, joshi.k, javier.gonz, kbusch, joshiiitr,
	hch


On 4/10/2021 9:32 AM, Javier González wrote:
>> On 10 Apr 2021, at 02.30, Chaitanya Kulkarni <Chaitanya.Kulkarni@wdc.com> wrote:
>>
>> On 4/9/21 17:22, Max Gurtovoy wrote:
>>>> On 2/19/2021 2:45 PM, SelvaKumar S wrote:
>>>> This patchset tries to add support for TP4065a ("Simple Copy Command"),
>>>> v2020.05.04 ("Ratified")
>>>>
>>>> The Specification can be found in following link.
>>>> https://nvmexpress.org/wp-content/uploads/NVM-Express-1.4-Ratified-TPs-1.zip
>>>>
>>>> Simple copy command is a copy offloading operation and is  used to copy
>>>> multiple contiguous ranges (source_ranges) of LBA's to a single destination
>>>> LBA within the device reducing traffic between host and device.
>>>>
>>>> This implementation doesn't add native copy offload support for stacked
>>>> devices rather copy offload is done through emulation. Possible use
>>>> cases are F2FS gc and BTRFS relocation/balance.
>>>>
>>>> *blkdev_issue_copy* takes source bdev, no of sources, array of source
>>>> ranges (in sectors), destination bdev and destination offset(in sectors).
>>>> If both source and destination block devices are same and copy_offload = 1,
>>>> then copy is done through native copy offloading. Copy emulation is used
>>>> in other cases.
>>>>
>>>> As SCSI XCOPY can take two different block devices and no of source range is
>>>> equal to 1, this interface can be extended in future to support SCSI XCOPY.
>>> Any idea why this TP wasn't designed for copy offload between 2
>>> different namespaces in the same controller ?
>> Yes, it was the first attempt so to keep it simple.
>>
>> Further work is needed to add incremental TP so that we can also do a copy
>> between the name-spaces of same controller (if we can't already) and to the
>> namespaces that belongs to the different controller.
>>
>>> And a simple copy will be the case where the src_nsid == dst_nsid ?
>>>
>>> Also why there are multiple source ranges and only one dst range ? We
>>> could add a bit to indicate if this range is src or dst..
> One of the target use cases was ZNS in order to avoid fabric transfers during host GC. You can see how this plays well with several zone ranges and a single zone destination.
>
> If we start getting support in Linux through the different past copy offload efforts, I’m sure we can extend this TP in the future.

But the "copy" command IMO is more general than the ZNS GC case, that 
can be a private case of copy, isn't it ?

We can get a big benefit of offloading the data copy from one ns to 
another in the same controller and even in different controllers in the 
same subsystem.

Do you think the extension should be to "copy" command or to create a 
new command "x_copy" for copying to different destination ns ?


>   

_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [RFC PATCH v5 0/4] add simple copy support
  2021-04-11  9:10         ` Max Gurtovoy
@ 2021-04-11 19:26           ` Javier González
  2021-04-13 15:38             ` Max Gurtovoy
  0 siblings, 1 reply; 29+ messages in thread
From: Javier González @ 2021-04-11 19:26 UTC (permalink / raw)
  To: Max Gurtovoy
  Cc: Chaitanya Kulkarni, SelvaKumar S, linux-nvme, axboe,
	Damien Le Moal, kch, sagi, snitzer, selvajove, linux-kernel,
	nj.shetty, linux-block, linux-fsdevel, dm-devel, joshi.k, kbusch,
	joshiiitr, hch

On 11.04.2021 12:10, Max Gurtovoy wrote:
>
>On 4/10/2021 9:32 AM, Javier González wrote:
>>>On 10 Apr 2021, at 02.30, Chaitanya Kulkarni <Chaitanya.Kulkarni@wdc.com> wrote:
>>>
>>>On 4/9/21 17:22, Max Gurtovoy wrote:
>>>>>On 2/19/2021 2:45 PM, SelvaKumar S wrote:
>>>>>This patchset tries to add support for TP4065a ("Simple Copy Command"),
>>>>>v2020.05.04 ("Ratified")
>>>>>
>>>>>The Specification can be found in following link.
>>>>>https://nvmexpress.org/wp-content/uploads/NVM-Express-1.4-Ratified-TPs-1.zip
>>>>>
>>>>>Simple copy command is a copy offloading operation and is  used to copy
>>>>>multiple contiguous ranges (source_ranges) of LBA's to a single destination
>>>>>LBA within the device reducing traffic between host and device.
>>>>>
>>>>>This implementation doesn't add native copy offload support for stacked
>>>>>devices rather copy offload is done through emulation. Possible use
>>>>>cases are F2FS gc and BTRFS relocation/balance.
>>>>>
>>>>>*blkdev_issue_copy* takes source bdev, no of sources, array of source
>>>>>ranges (in sectors), destination bdev and destination offset(in sectors).
>>>>>If both source and destination block devices are same and copy_offload = 1,
>>>>>then copy is done through native copy offloading. Copy emulation is used
>>>>>in other cases.
>>>>>
>>>>>As SCSI XCOPY can take two different block devices and no of source range is
>>>>>equal to 1, this interface can be extended in future to support SCSI XCOPY.
>>>>Any idea why this TP wasn't designed for copy offload between 2
>>>>different namespaces in the same controller ?
>>>Yes, it was the first attempt so to keep it simple.
>>>
>>>Further work is needed to add incremental TP so that we can also do a copy
>>>between the name-spaces of same controller (if we can't already) and to the
>>>namespaces that belongs to the different controller.
>>>
>>>>And a simple copy will be the case where the src_nsid == dst_nsid ?
>>>>
>>>>Also why there are multiple source ranges and only one dst range ? We
>>>>could add a bit to indicate if this range is src or dst..
>>One of the target use cases was ZNS in order to avoid fabric transfers during host GC. You can see how this plays well with several zone ranges and a single zone destination.
>>
>>If we start getting support in Linux through the different past copy offload efforts, I’m sure we can extend this TP in the future.
>
>But the "copy" command IMO is more general than the ZNS GC case, that 
>can be a private case of copy, isn't it ?

It applies to any namespace type, so yes. I just wanted to give you the
background for the current "simple" scope through one of the use cases
that was in mind.

>We can get a big benefit of offloading the data copy from one ns to 
>another in the same controller and even in different controllers in 
>the same subsystem.

Definitely.

>
>Do you think the extension should be to "copy" command or to create a 
>new command "x_copy" for copying to different destination ns ?

I believe there is space for extensions to simple copy. But given the
experience with XCOPY, I can imagine that changes will be incremental,
based on very specific use cases.

I think getting support upstream and bringing deployed cases is a very
good start.

_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [RFC PATCH v5 2/4] block: add simple copy support
  2021-04-07 11:32         ` Selva Jove
@ 2021-04-12  0:24           ` Damien Le Moal
  2021-04-12 14:34             ` Selva Jove
  0 siblings, 1 reply; 29+ messages in thread
From: Damien Le Moal @ 2021-04-12  0:24 UTC (permalink / raw)
  To: Selva Jove
  Cc: SelvaKumar S, linux-nvme, kbusch, axboe, hch, sagi, linux-block,
	linux-kernel, dm-devel, snitzer, joshiiitr, nj.shetty, joshi.k,
	javier.gonz, kch, linux-fsdevel

On 2021/04/07 20:33, Selva Jove wrote:
> Initially I started moving the dm-kcopyd interface to the block layer
> as a generic interface.
> Once I dig deeper in dm-kcopyd code, I figured that dm-kcopyd is
> tightly coupled with dm_io()
> 
> To move dm-kcopyd to block layer, it would also require dm_io code to
> be moved to block layer.
> It would cause havoc in dm layer, as it is the backbone of the
> dm-layer and needs complete
> rewriting of dm-layer. Do you see any other way of doing this without
> having to move dm_io code
> or to have redundant code ?

Right. Missed that. So reusing dm-kcopyd and making it a common interface will
take some more efforts. OK, then. For the first round of commits, let's forget
about this. But I still think that your emulation could be a lot better than a
loop doing blocking writes after blocking reads.

[...]
>>> +int blkdev_issue_copy(struct block_device *src_bdev, int nr_srcs,
>>> +             struct range_entry *src_rlist, struct block_device *dest_bdev,
>>> +             sector_t dest, gfp_t gfp_mask, int flags)
>>> +{
>>> +     struct request_queue *q = bdev_get_queue(src_bdev);
>>> +     struct request_queue *dest_q = bdev_get_queue(dest_bdev);
>>> +     struct blk_copy_payload *payload;
>>> +     sector_t bs_mask, copy_size;
>>> +     int ret;
>>> +
>>> +     ret = blk_prepare_payload(src_bdev, nr_srcs, src_rlist, gfp_mask,
>>> +                     &payload, &copy_size);
>>> +     if (ret)
>>> +             return ret;
>>> +
>>> +     bs_mask = (bdev_logical_block_size(dest_bdev) >> 9) - 1;
>>> +     if (dest & bs_mask) {
>>> +             return -EINVAL;
>>> +             goto out;
>>> +     }
>>> +
>>> +     if (q == dest_q && q->limits.copy_offload) {
>>> +             ret = blk_copy_offload(src_bdev, payload, dest, gfp_mask);
>>> +             if (ret)
>>> +                     goto out;
>>> +     } else if (flags & BLKDEV_COPY_NOEMULATION) {
>>
>> Why ? whoever calls blkdev_issue_copy() wants a copy to be done. Why would that
>> user say "Fail on me if the device does not support copy" ??? This is a weird
>> interface in my opinion.
>>
> 
> BLKDEV_COPY_NOEMULATION flag was introduced to allow blkdev_issue_copy() callers
> to use their native copying method instead of the emulated copy that I
> added. This way we
> ensure that dm uses the hw-assisted copy and if that is not present,
> it falls back to existing
> copy method.
> 
> The other users who don't have their native emulation can use this
> emulated-copy implementation.

I do not understand. Emulation or not should be entirely driven by the device
reporting support for simple copy (or not). It does not matter which component
is issuing the simple copy call: an FS to a real device, and FS to a DM device
or a DM target driver. If the underlying device reported support for simple
copy, use that. Otherwise, emulate with read/write. What am I missing here ?

[...]
>>> @@ -565,6 +569,12 @@ int blk_stack_limits(struct queue_limits *t, struct queue_limits *b,
>>>       if (b->chunk_sectors)
>>>               t->chunk_sectors = gcd(t->chunk_sectors, b->chunk_sectors);
>>>
>>> +     /* simple copy not supported in stacked devices */
>>> +     t->copy_offload = 0;
>>> +     t->max_copy_sectors = 0;
>>> +     t->max_copy_range_sectors = 0;
>>> +     t->max_copy_nr_ranges = 0;
>>
>> You do not need this. Limits not explicitely initialized are 0 already.
>> But I do not see why you can't support copy on stacked devices. That should be
>> feasible taking the min() for each of the above limit.
>>
> 
> Disabling stacked device support was feedback from v2.
> 
> https://patchwork.kernel.org/project/linux-block/patch/20201204094659.12732-2-selvakuma.s1@samsung.com/

Right. But the initialization to 0 is still not needed. The fields are already
initialized to 0.


-- 
Damien Le Moal
Western Digital Research

_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [RFC PATCH v5 2/4] block: add simple copy support
  2021-04-12  0:24           ` Damien Le Moal
@ 2021-04-12 14:34             ` Selva Jove
  2021-04-13  0:32               ` Damien Le Moal
  0 siblings, 1 reply; 29+ messages in thread
From: Selva Jove @ 2021-04-12 14:34 UTC (permalink / raw)
  To: Damien Le Moal
  Cc: SelvaKumar S, linux-nvme, kbusch, axboe, hch, sagi, linux-block,
	linux-kernel, dm-devel, snitzer, joshiiitr, nj.shetty, joshi.k,
	javier.gonz, kch, linux-fsdevel

On Mon, Apr 12, 2021 at 5:55 AM Damien Le Moal <Damien.LeMoal@wdc.com> wrote:
>
> On 2021/04/07 20:33, Selva Jove wrote:
> > Initially I started moving the dm-kcopyd interface to the block layer
> > as a generic interface.
> > Once I dig deeper in dm-kcopyd code, I figured that dm-kcopyd is
> > tightly coupled with dm_io()
> >
> > To move dm-kcopyd to block layer, it would also require dm_io code to
> > be moved to block layer.
> > It would cause havoc in dm layer, as it is the backbone of the
> > dm-layer and needs complete
> > rewriting of dm-layer. Do you see any other way of doing this without
> > having to move dm_io code
> > or to have redundant code ?
>
> Right. Missed that. So reusing dm-kcopyd and making it a common interface will
> take some more efforts. OK, then. For the first round of commits, let's forget
> about this. But I still think that your emulation could be a lot better than a
> loop doing blocking writes after blocking reads.
>

Current implementation issues read asynchronously and once all the reads are
completed, then the write is issued as whole to reduce the IO traffic
in the queue.
I agree that things can be better. Will explore another approach of
sending writes
immediately once reads are completed and with  plugging to increase the chances
of merging.

> [...]
> >>> +int blkdev_issue_copy(struct block_device *src_bdev, int nr_srcs,
> >>> +             struct range_entry *src_rlist, struct block_device *dest_bdev,
> >>> +             sector_t dest, gfp_t gfp_mask, int flags)
> >>> +{
> >>> +     struct request_queue *q = bdev_get_queue(src_bdev);
> >>> +     struct request_queue *dest_q = bdev_get_queue(dest_bdev);
> >>> +     struct blk_copy_payload *payload;
> >>> +     sector_t bs_mask, copy_size;
> >>> +     int ret;
> >>> +
> >>> +     ret = blk_prepare_payload(src_bdev, nr_srcs, src_rlist, gfp_mask,
> >>> +                     &payload, &copy_size);
> >>> +     if (ret)
> >>> +             return ret;
> >>> +
> >>> +     bs_mask = (bdev_logical_block_size(dest_bdev) >> 9) - 1;
> >>> +     if (dest & bs_mask) {
> >>> +             return -EINVAL;
> >>> +             goto out;
> >>> +     }
> >>> +
> >>> +     if (q == dest_q && q->limits.copy_offload) {
> >>> +             ret = blk_copy_offload(src_bdev, payload, dest, gfp_mask);
> >>> +             if (ret)
> >>> +                     goto out;
> >>> +     } else if (flags & BLKDEV_COPY_NOEMULATION) {
> >>
> >> Why ? whoever calls blkdev_issue_copy() wants a copy to be done. Why would that
> >> user say "Fail on me if the device does not support copy" ??? This is a weird
> >> interface in my opinion.
> >>
> >
> > BLKDEV_COPY_NOEMULATION flag was introduced to allow blkdev_issue_copy() callers
> > to use their native copying method instead of the emulated copy that I
> > added. This way we
> > ensure that dm uses the hw-assisted copy and if that is not present,
> > it falls back to existing
> > copy method.
> >
> > The other users who don't have their native emulation can use this
> > emulated-copy implementation.
>
> I do not understand. Emulation or not should be entirely driven by the device
> reporting support for simple copy (or not). It does not matter which component
> is issuing the simple copy call: an FS to a real device, and FS to a DM device
> or a DM target driver. If the underlying device reported support for simple
> copy, use that. Otherwise, emulate with read/write. What am I missing here ?
>

blkdev_issue_copy() api will generally complete the copy-operation,
either by using
offloaded-copy or by using emulated-copy. The caller of the api is not
required to
figure the type of support. However, it can opt out of emulated-copy
by specifying
the flag BLKDEV_NOEMULATION. This is helpful for the case when the
caller already
has got a sophisticated emulation (e.g. dm-kcopyd users).

>
> [...]
> >>> @@ -565,6 +569,12 @@ int blk_stack_limits(struct queue_limits *t, struct queue_limits *b,
> >>>       if (b->chunk_sectors)
> >>>               t->chunk_sectors = gcd(t->chunk_sectors, b->chunk_sectors);
> >>>
> >>> +     /* simple copy not supported in stacked devices */
> >>> +     t->copy_offload = 0;
> >>> +     t->max_copy_sectors = 0;
> >>> +     t->max_copy_range_sectors = 0;
> >>> +     t->max_copy_nr_ranges = 0;
> >>
> >> You do not need this. Limits not explicitely initialized are 0 already.
> >> But I do not see why you can't support copy on stacked devices. That should be
> >> feasible taking the min() for each of the above limit.
> >>
> >
> > Disabling stacked device support was feedback from v2.
> >
> > https://patchwork.kernel.org/project/linux-block/patch/20201204094659.12732-2-selvakuma.s1@samsung.com/
>
> Right. But the initialization to 0 is still not needed. The fields are already
> initialized to 0.
>
>
> --
> Damien Le Moal
> Western Digital Research

_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [RFC PATCH v5 2/4] block: add simple copy support
  2021-04-12 14:34             ` Selva Jove
@ 2021-04-13  0:32               ` Damien Le Moal
  2021-04-14  6:58                 ` Selva Jove
  0 siblings, 1 reply; 29+ messages in thread
From: Damien Le Moal @ 2021-04-13  0:32 UTC (permalink / raw)
  To: Selva Jove
  Cc: SelvaKumar S, linux-nvme, kbusch, axboe, hch, sagi, linux-block,
	linux-kernel, dm-devel, snitzer, joshiiitr, nj.shetty, joshi.k,
	javier.gonz, kch, linux-fsdevel

On 2021/04/12 23:35, Selva Jove wrote:
> On Mon, Apr 12, 2021 at 5:55 AM Damien Le Moal <Damien.LeMoal@wdc.com> wrote:
>>
>> On 2021/04/07 20:33, Selva Jove wrote:
>>> Initially I started moving the dm-kcopyd interface to the block layer
>>> as a generic interface.
>>> Once I dig deeper in dm-kcopyd code, I figured that dm-kcopyd is
>>> tightly coupled with dm_io()
>>>
>>> To move dm-kcopyd to block layer, it would also require dm_io code to
>>> be moved to block layer.
>>> It would cause havoc in dm layer, as it is the backbone of the
>>> dm-layer and needs complete
>>> rewriting of dm-layer. Do you see any other way of doing this without
>>> having to move dm_io code
>>> or to have redundant code ?
>>
>> Right. Missed that. So reusing dm-kcopyd and making it a common interface will
>> take some more efforts. OK, then. For the first round of commits, let's forget
>> about this. But I still think that your emulation could be a lot better than a
>> loop doing blocking writes after blocking reads.
>>
> 
> Current implementation issues read asynchronously and once all the reads are
> completed, then the write is issued as whole to reduce the IO traffic
> in the queue.
> I agree that things can be better. Will explore another approach of
> sending writes
> immediately once reads are completed and with  plugging to increase the chances
> of merging.
> 
>> [...]
>>>>> +int blkdev_issue_copy(struct block_device *src_bdev, int nr_srcs,
>>>>> +             struct range_entry *src_rlist, struct block_device *dest_bdev,
>>>>> +             sector_t dest, gfp_t gfp_mask, int flags)
>>>>> +{
>>>>> +     struct request_queue *q = bdev_get_queue(src_bdev);
>>>>> +     struct request_queue *dest_q = bdev_get_queue(dest_bdev);
>>>>> +     struct blk_copy_payload *payload;
>>>>> +     sector_t bs_mask, copy_size;
>>>>> +     int ret;
>>>>> +
>>>>> +     ret = blk_prepare_payload(src_bdev, nr_srcs, src_rlist, gfp_mask,
>>>>> +                     &payload, &copy_size);
>>>>> +     if (ret)
>>>>> +             return ret;
>>>>> +
>>>>> +     bs_mask = (bdev_logical_block_size(dest_bdev) >> 9) - 1;
>>>>> +     if (dest & bs_mask) {
>>>>> +             return -EINVAL;
>>>>> +             goto out;
>>>>> +     }
>>>>> +
>>>>> +     if (q == dest_q && q->limits.copy_offload) {
>>>>> +             ret = blk_copy_offload(src_bdev, payload, dest, gfp_mask);
>>>>> +             if (ret)
>>>>> +                     goto out;
>>>>> +     } else if (flags & BLKDEV_COPY_NOEMULATION) {
>>>>
>>>> Why ? whoever calls blkdev_issue_copy() wants a copy to be done. Why would that
>>>> user say "Fail on me if the device does not support copy" ??? This is a weird
>>>> interface in my opinion.
>>>>
>>>
>>> BLKDEV_COPY_NOEMULATION flag was introduced to allow blkdev_issue_copy() callers
>>> to use their native copying method instead of the emulated copy that I
>>> added. This way we
>>> ensure that dm uses the hw-assisted copy and if that is not present,
>>> it falls back to existing
>>> copy method.
>>>
>>> The other users who don't have their native emulation can use this
>>> emulated-copy implementation.
>>
>> I do not understand. Emulation or not should be entirely driven by the device
>> reporting support for simple copy (or not). It does not matter which component
>> is issuing the simple copy call: an FS to a real device, and FS to a DM device
>> or a DM target driver. If the underlying device reported support for simple
>> copy, use that. Otherwise, emulate with read/write. What am I missing here ?
>>
> 
> blkdev_issue_copy() api will generally complete the copy-operation,
> either by using
> offloaded-copy or by using emulated-copy. The caller of the api is not
> required to
> figure the type of support. However, it can opt out of emulated-copy
> by specifying
> the flag BLKDEV_NOEMULATION. This is helpful for the case when the
> caller already
> has got a sophisticated emulation (e.g. dm-kcopyd users).

This does not make any sense to me. If the user has already another mean of
doing copies, then that user will not call blkdev_issue_copy(). So I really do
not understand what the "opting out of emulated copy" would be useful for. That
user can check the simple copy support glag in the device request queue and act
accordingly: use its own block copy code when simple copy is not supported or
use blkdev_issue_copy() when the device has simple copy. Adding that
BLKDEV_COPY_NOEMULATION does not serve any purpose at all.



-- 
Damien Le Moal
Western Digital Research

_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [RFC PATCH v5 0/4] add simple copy support
  2021-04-11 19:26           ` Javier González
@ 2021-04-13 15:38             ` Max Gurtovoy
  2021-04-13 18:25               ` Javier González
  0 siblings, 1 reply; 29+ messages in thread
From: Max Gurtovoy @ 2021-04-13 15:38 UTC (permalink / raw)
  To: Javier González
  Cc: Chaitanya Kulkarni, SelvaKumar S, linux-nvme, axboe,
	Damien Le Moal, kch, sagi, snitzer, selvajove, linux-kernel,
	nj.shetty, linux-block, linux-fsdevel, dm-devel, joshi.k, kbusch,
	joshiiitr, hch


On 4/11/2021 10:26 PM, Javier González wrote:
> On 11.04.2021 12:10, Max Gurtovoy wrote:
>>
>> On 4/10/2021 9:32 AM, Javier González wrote:
>>>> On 10 Apr 2021, at 02.30, Chaitanya Kulkarni 
>>>> <Chaitanya.Kulkarni@wdc.com> wrote:
>>>>
>>>> On 4/9/21 17:22, Max Gurtovoy wrote:
>>>>>> On 2/19/2021 2:45 PM, SelvaKumar S wrote:
>>>>>> This patchset tries to add support for TP4065a ("Simple Copy 
>>>>>> Command"),
>>>>>> v2020.05.04 ("Ratified")
>>>>>>
>>>>>> The Specification can be found in following link.
>>>>>> https://nvmexpress.org/wp-content/uploads/NVM-Express-1.4-Ratified-TPs-1.zip 
>>>>>>
>>>>>>
>>>>>> Simple copy command is a copy offloading operation and is  used 
>>>>>> to copy
>>>>>> multiple contiguous ranges (source_ranges) of LBA's to a single 
>>>>>> destination
>>>>>> LBA within the device reducing traffic between host and device.
>>>>>>
>>>>>> This implementation doesn't add native copy offload support for 
>>>>>> stacked
>>>>>> devices rather copy offload is done through emulation. Possible use
>>>>>> cases are F2FS gc and BTRFS relocation/balance.
>>>>>>
>>>>>> *blkdev_issue_copy* takes source bdev, no of sources, array of 
>>>>>> source
>>>>>> ranges (in sectors), destination bdev and destination offset(in 
>>>>>> sectors).
>>>>>> If both source and destination block devices are same and 
>>>>>> copy_offload = 1,
>>>>>> then copy is done through native copy offloading. Copy emulation 
>>>>>> is used
>>>>>> in other cases.
>>>>>>
>>>>>> As SCSI XCOPY can take two different block devices and no of 
>>>>>> source range is
>>>>>> equal to 1, this interface can be extended in future to support 
>>>>>> SCSI XCOPY.
>>>>> Any idea why this TP wasn't designed for copy offload between 2
>>>>> different namespaces in the same controller ?
>>>> Yes, it was the first attempt so to keep it simple.
>>>>
>>>> Further work is needed to add incremental TP so that we can also do 
>>>> a copy
>>>> between the name-spaces of same controller (if we can't already) 
>>>> and to the
>>>> namespaces that belongs to the different controller.
>>>>
>>>>> And a simple copy will be the case where the src_nsid == dst_nsid ?
>>>>>
>>>>> Also why there are multiple source ranges and only one dst range ? We
>>>>> could add a bit to indicate if this range is src or dst..
>>> One of the target use cases was ZNS in order to avoid fabric 
>>> transfers during host GC. You can see how this plays well with 
>>> several zone ranges and a single zone destination.
>>>
>>> If we start getting support in Linux through the different past copy 
>>> offload efforts, I’m sure we can extend this TP in the future.
>>
>> But the "copy" command IMO is more general than the ZNS GC case, that 
>> can be a private case of copy, isn't it ?
>
> It applies to any namespace type, so yes. I just wanted to give you the
> background for the current "simple" scope through one of the use cases
> that was in mind.
>
>> We can get a big benefit of offloading the data copy from one ns to 
>> another in the same controller and even in different controllers in 
>> the same subsystem.
>
> Definitely.
>
>>
>> Do you think the extension should be to "copy" command or to create a 
>> new command "x_copy" for copying to different destination ns ?
>
> I believe there is space for extensions to simple copy. But given the
> experience with XCOPY, I can imagine that changes will be incremental,
> based on very specific use cases.
>
> I think getting support upstream and bringing deployed cases is a very
> good start.

Copying data (files) within the controller/subsystem from ns_A to ns_B 
using NVMf will reduce network BW and memory BW in the host server.

This feature is well known and the use case is well known.

The question whether we implement it in vendor specific manner of we add 
it to the specification.

I prefer adding it to the spec :)



_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [RFC PATCH v5 0/4] add simple copy support
  2021-04-13 15:38             ` Max Gurtovoy
@ 2021-04-13 18:25               ` Javier González
  2021-04-13 18:36                 ` Chaitanya Kulkarni
  0 siblings, 1 reply; 29+ messages in thread
From: Javier González @ 2021-04-13 18:25 UTC (permalink / raw)
  To: Max Gurtovoy
  Cc: Chaitanya Kulkarni, SelvaKumar S, linux-nvme, axboe,
	Damien Le Moal, kch, sagi, snitzer, selvajove, linux-kernel,
	nj.shetty, linux-block, linux-fsdevel, dm-devel, joshi.k, kbusch,
	joshiiitr, hch

On 13.04.2021 18:38, Max Gurtovoy wrote:
>
>On 4/11/2021 10:26 PM, Javier González wrote:
>>On 11.04.2021 12:10, Max Gurtovoy wrote:
>>>
>>>On 4/10/2021 9:32 AM, Javier González wrote:
>>>>>On 10 Apr 2021, at 02.30, Chaitanya Kulkarni 
>>>>><Chaitanya.Kulkarni@wdc.com> wrote:
>>>>>
>>>>>On 4/9/21 17:22, Max Gurtovoy wrote:
>>>>>>>On 2/19/2021 2:45 PM, SelvaKumar S wrote:
>>>>>>>This patchset tries to add support for TP4065a ("Simple 
>>>>>>>Copy Command"),
>>>>>>>v2020.05.04 ("Ratified")
>>>>>>>
>>>>>>>The Specification can be found in following link.
>>>>>>>https://nvmexpress.org/wp-content/uploads/NVM-Express-1.4-Ratified-TPs-1.zip
>>>>>>>
>>>>>>>
>>>>>>>Simple copy command is a copy offloading operation and is  
>>>>>>>used to copy
>>>>>>>multiple contiguous ranges (source_ranges) of LBA's to a 
>>>>>>>single destination
>>>>>>>LBA within the device reducing traffic between host and device.
>>>>>>>
>>>>>>>This implementation doesn't add native copy offload 
>>>>>>>support for stacked
>>>>>>>devices rather copy offload is done through emulation. Possible use
>>>>>>>cases are F2FS gc and BTRFS relocation/balance.
>>>>>>>
>>>>>>>*blkdev_issue_copy* takes source bdev, no of sources, 
>>>>>>>array of source
>>>>>>>ranges (in sectors), destination bdev and destination 
>>>>>>>offset(in sectors).
>>>>>>>If both source and destination block devices are same and 
>>>>>>>copy_offload = 1,
>>>>>>>then copy is done through native copy offloading. Copy 
>>>>>>>emulation is used
>>>>>>>in other cases.
>>>>>>>
>>>>>>>As SCSI XCOPY can take two different block devices and no 
>>>>>>>of source range is
>>>>>>>equal to 1, this interface can be extended in future to 
>>>>>>>support SCSI XCOPY.
>>>>>>Any idea why this TP wasn't designed for copy offload between 2
>>>>>>different namespaces in the same controller ?
>>>>>Yes, it was the first attempt so to keep it simple.
>>>>>
>>>>>Further work is needed to add incremental TP so that we can 
>>>>>also do a copy
>>>>>between the name-spaces of same controller (if we can't 
>>>>>already) and to the
>>>>>namespaces that belongs to the different controller.
>>>>>
>>>>>>And a simple copy will be the case where the src_nsid == dst_nsid ?
>>>>>>
>>>>>>Also why there are multiple source ranges and only one dst range ? We
>>>>>>could add a bit to indicate if this range is src or dst..
>>>>One of the target use cases was ZNS in order to avoid fabric 
>>>>transfers during host GC. You can see how this plays well with 
>>>>several zone ranges and a single zone destination.
>>>>
>>>>If we start getting support in Linux through the different past 
>>>>copy offload efforts, I’m sure we can extend this TP in the 
>>>>future.
>>>
>>>But the "copy" command IMO is more general than the ZNS GC case, 
>>>that can be a private case of copy, isn't it ?
>>
>>It applies to any namespace type, so yes. I just wanted to give you the
>>background for the current "simple" scope through one of the use cases
>>that was in mind.
>>
>>>We can get a big benefit of offloading the data copy from one ns 
>>>to another in the same controller and even in different 
>>>controllers in the same subsystem.
>>
>>Definitely.
>>
>>>
>>>Do you think the extension should be to "copy" command or to 
>>>create a new command "x_copy" for copying to different destination 
>>>ns ?
>>
>>I believe there is space for extensions to simple copy. But given the
>>experience with XCOPY, I can imagine that changes will be incremental,
>>based on very specific use cases.
>>
>>I think getting support upstream and bringing deployed cases is a very
>>good start.
>
>Copying data (files) within the controller/subsystem from ns_A to ns_B 
>using NVMf will reduce network BW and memory BW in the host server.
>
>This feature is well known and the use case is well known.

Definitely.

>
>The question whether we implement it in vendor specific manner of we 
>add it to the specification.
>
>I prefer adding it to the spec :)

Agree. Let's build up on top of Simple Copy. We can talk about it
offline in the context of the NVMe TWG.

_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [RFC PATCH v5 0/4] add simple copy support
  2021-04-13 18:25               ` Javier González
@ 2021-04-13 18:36                 ` Chaitanya Kulkarni
  0 siblings, 0 replies; 29+ messages in thread
From: Chaitanya Kulkarni @ 2021-04-13 18:36 UTC (permalink / raw)
  To: Javier González, Max Gurtovoy
  Cc: SelvaKumar S, linux-nvme, axboe, Damien Le Moal, kch, sagi,
	snitzer, selvajove, linux-kernel, nj.shetty, linux-block,
	linux-fsdevel, dm-devel, joshi.k, kbusch, joshiiitr, hch

On 4/13/21 11:26, Javier González wrote:
>>> I believe there is space for extensions to simple copy. But given the
>>> experience with XCOPY, I can imagine that changes will be incremental,
>>> based on very specific use cases.
>>>
>>> I think getting support upstream and bringing deployed cases is a very
>>> good start.
>> Copying data (files) within the controller/subsystem from ns_A to ns_B 
>> using NVMf will reduce network BW and memory BW in the host server.
>>
>> This feature is well known and the use case is well known.
> Definitely.
>

I've a working code for nvmet for simple copy, I'm waiting to resolve
the host interface for REQ_OP_COPY so I can post it with this series.

Let me know if someone wants to collaborate offline on that.

IMHO we first need to sort out the host side interface which is
a challenge for years and it is not that easy to get it right
based on the history.



_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [RFC PATCH v5 2/4] block: add simple copy support
  2021-04-13  0:32               ` Damien Le Moal
@ 2021-04-14  6:58                 ` Selva Jove
  0 siblings, 0 replies; 29+ messages in thread
From: Selva Jove @ 2021-04-14  6:58 UTC (permalink / raw)
  To: Damien Le Moal
  Cc: SelvaKumar S, linux-nvme, kbusch, axboe, hch, sagi, linux-block,
	linux-kernel, dm-devel, snitzer, joshiiitr, nj.shetty, joshi.k,
	javier.gonz, kch, linux-fsdevel

I agree with you. Will remove BLKDEV_COPY_NOEMULATION.

On Tue, Apr 13, 2021 at 6:03 AM Damien Le Moal <Damien.LeMoal@wdc.com> wrote:
>
> On 2021/04/12 23:35, Selva Jove wrote:
> > On Mon, Apr 12, 2021 at 5:55 AM Damien Le Moal <Damien.LeMoal@wdc.com> wrote:
> >>
> >> On 2021/04/07 20:33, Selva Jove wrote:
> >>> Initially I started moving the dm-kcopyd interface to the block layer
> >>> as a generic interface.
> >>> Once I dig deeper in dm-kcopyd code, I figured that dm-kcopyd is
> >>> tightly coupled with dm_io()
> >>>
> >>> To move dm-kcopyd to block layer, it would also require dm_io code to
> >>> be moved to block layer.
> >>> It would cause havoc in dm layer, as it is the backbone of the
> >>> dm-layer and needs complete
> >>> rewriting of dm-layer. Do you see any other way of doing this without
> >>> having to move dm_io code
> >>> or to have redundant code ?
> >>
> >> Right. Missed that. So reusing dm-kcopyd and making it a common interface will
> >> take some more efforts. OK, then. For the first round of commits, let's forget
> >> about this. But I still think that your emulation could be a lot better than a
> >> loop doing blocking writes after blocking reads.
> >>
> >
> > Current implementation issues read asynchronously and once all the reads are
> > completed, then the write is issued as whole to reduce the IO traffic
> > in the queue.
> > I agree that things can be better. Will explore another approach of
> > sending writes
> > immediately once reads are completed and with  plugging to increase the chances
> > of merging.
> >
> >> [...]
> >>>>> +int blkdev_issue_copy(struct block_device *src_bdev, int nr_srcs,
> >>>>> +             struct range_entry *src_rlist, struct block_device *dest_bdev,
> >>>>> +             sector_t dest, gfp_t gfp_mask, int flags)
> >>>>> +{
> >>>>> +     struct request_queue *q = bdev_get_queue(src_bdev);
> >>>>> +     struct request_queue *dest_q = bdev_get_queue(dest_bdev);
> >>>>> +     struct blk_copy_payload *payload;
> >>>>> +     sector_t bs_mask, copy_size;
> >>>>> +     int ret;
> >>>>> +
> >>>>> +     ret = blk_prepare_payload(src_bdev, nr_srcs, src_rlist, gfp_mask,
> >>>>> +                     &payload, &copy_size);
> >>>>> +     if (ret)
> >>>>> +             return ret;
> >>>>> +
> >>>>> +     bs_mask = (bdev_logical_block_size(dest_bdev) >> 9) - 1;
> >>>>> +     if (dest & bs_mask) {
> >>>>> +             return -EINVAL;
> >>>>> +             goto out;
> >>>>> +     }
> >>>>> +
> >>>>> +     if (q == dest_q && q->limits.copy_offload) {
> >>>>> +             ret = blk_copy_offload(src_bdev, payload, dest, gfp_mask);
> >>>>> +             if (ret)
> >>>>> +                     goto out;
> >>>>> +     } else if (flags & BLKDEV_COPY_NOEMULATION) {
> >>>>
> >>>> Why ? whoever calls blkdev_issue_copy() wants a copy to be done. Why would that
> >>>> user say "Fail on me if the device does not support copy" ??? This is a weird
> >>>> interface in my opinion.
> >>>>
> >>>
> >>> BLKDEV_COPY_NOEMULATION flag was introduced to allow blkdev_issue_copy() callers
> >>> to use their native copying method instead of the emulated copy that I
> >>> added. This way we
> >>> ensure that dm uses the hw-assisted copy and if that is not present,
> >>> it falls back to existing
> >>> copy method.
> >>>
> >>> The other users who don't have their native emulation can use this
> >>> emulated-copy implementation.
> >>
> >> I do not understand. Emulation or not should be entirely driven by the device
> >> reporting support for simple copy (or not). It does not matter which component
> >> is issuing the simple copy call: an FS to a real device, and FS to a DM device
> >> or a DM target driver. If the underlying device reported support for simple
> >> copy, use that. Otherwise, emulate with read/write. What am I missing here ?
> >>
> >
> > blkdev_issue_copy() api will generally complete the copy-operation,
> > either by using
> > offloaded-copy or by using emulated-copy. The caller of the api is not
> > required to
> > figure the type of support. However, it can opt out of emulated-copy
> > by specifying
> > the flag BLKDEV_NOEMULATION. This is helpful for the case when the
> > caller already
> > has got a sophisticated emulation (e.g. dm-kcopyd users).
>
> This does not make any sense to me. If the user has already another mean of
> doing copies, then that user will not call blkdev_issue_copy(). So I really do
> not understand what the "opting out of emulated copy" would be useful for. That
> user can check the simple copy support glag in the device request queue and act
> accordingly: use its own block copy code when simple copy is not supported or
> use blkdev_issue_copy() when the device has simple copy. Adding that
> BLKDEV_COPY_NOEMULATION does not serve any purpose at all.
>
>
>
> --
> Damien Le Moal
> Western Digital Research

_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 29+ messages in thread

end of thread, other threads:[~2021-04-14  6:59 UTC | newest]

Thread overview: 29+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <CGME20210219124555epcas5p1334e7c4d64ada5dc4a2ca0feb48c1d44@epcas5p1.samsung.com>
2021-02-19 12:45 ` [RFC PATCH v5 0/4] add simple copy support SelvaKumar S
     [not found]   ` <CGME20210219124559epcas5p41da46f1c248e334953d407a154697903@epcas5p4.samsung.com>
2021-02-19 12:45     ` [RFC PATCH v5 1/4] block: make bio_map_kern() non static SelvaKumar S
     [not found]   ` <CGME20210219124603epcas5p33add0f2c1781b2a4d71bf30c9e1ac647@epcas5p3.samsung.com>
2021-02-19 12:45     ` [RFC PATCH v5 2/4] block: add simple copy support SelvaKumar S
2021-02-20  4:59       ` Damien Le Moal
2021-04-07 11:32         ` Selva Jove
2021-04-12  0:24           ` Damien Le Moal
2021-04-12 14:34             ` Selva Jove
2021-04-13  0:32               ` Damien Le Moal
2021-04-14  6:58                 ` Selva Jove
     [not found]   ` <CGME20210219124608epcas5p2a673f9e00c3e7b5352f115497b0e2d98@epcas5p2.samsung.com>
2021-02-19 12:45     ` [RFC PATCH v5 3/4] nvme: " SelvaKumar S
2021-02-20  3:36       ` Matthew Wilcox
2021-02-22 15:57         ` Selva Jove
     [not found]   ` <CGME20210219124611epcas5p1c775b63b537e75da161556e375fcf05e@epcas5p1.samsung.com>
2021-02-19 12:45     ` [RFC PATCH v5 4/4] dm kcopyd: add simple copy offload support SelvaKumar S
2021-02-20 18:01   ` [RFC PATCH v5 0/4] add simple copy support David Laight
2021-02-20 19:08     ` Matthew Wilcox
2021-02-20 19:19     ` Keith Busch
2021-02-21 23:52   ` Dave Chinner
2021-02-23  9:14     ` Selva Jove
2021-02-22  1:31   ` Ming Lei
2021-02-22  6:52   ` Su Yue
2021-02-23  9:00     ` Selva Jove
2021-04-10  0:21   ` Max Gurtovoy
2021-04-10  0:29     ` Chaitanya Kulkarni
2021-04-10  6:32       ` Javier González
2021-04-11  9:10         ` Max Gurtovoy
2021-04-11 19:26           ` Javier González
2021-04-13 15:38             ` Max Gurtovoy
2021-04-13 18:25               ` Javier González
2021-04-13 18:36                 ` Chaitanya Kulkarni

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).