All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 0/15] SCSI XCOPY support for the kernel and device mapper
@ 2014-07-15 19:34 Mikulas Patocka
  2014-07-15 19:34 ` [PATCH 1/15] block copy: initial XCOPY offload support Mikulas Patocka
                   ` (15 more replies)
  0 siblings, 16 replies; 24+ messages in thread
From: Mikulas Patocka @ 2014-07-15 19:34 UTC (permalink / raw)
  To: Alasdair G. Kergon, Mike Snitzer, Jonathan Brassow,
	Edward Thornber, Martin K. Petersen, Jens Axboe,
	Christoph Hellwig
  Cc: dm-devel, linux-kernel, linux-scsi

This patch series makes it possible to use SCSI XCOPY offload for the 
block layer and device mapper.

It is based on Martin Petersen's work
https://git.kernel.org/cgit/linux/kernel/git/mkp/linux.git/commit/?h=xcopy&id=0bdeed274e16b3038a851552188512071974eea8,
but it is changed significantly so that it is possible to propagate XCOPY
bios through the device mapper stack.

The basic architecture is this: in the function blkdev_issue_copy we
create two bios, one for read and one for write (with bi_rw READ|REQ_COPY
and WRITE|REQ_COPY). Both bios have a pointer to the same bio_copy
structure. These two bios travel independently through the device mapper
stack - each bio can go through different device mapper devices. When both
the bios reach the physical block device (in the function blk_queue_bio)
the bio pair is collected and a XCOPY request is allocated and sent to the
scsi disk driver.

Note that because device mapper mapping can dynamically change, there no
guarantee that the XCOPY command succeeds. If it ends with an error, the
caller is supposed to perform the copying manually.

The dm-kcopyd subsystem is modified to use the XCOPY command, so device
mapper targets that use it (mirror, snapshot, thin, cache) take advantage
of copy offload automatically.

There is a new ioctl BLKCOPY that makes it possible to use copy offload
from userspace.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* [PATCH 1/15] block copy: initial XCOPY offload support
  2014-07-15 19:34 [PATCH 0/15] SCSI XCOPY support for the kernel and device mapper Mikulas Patocka
@ 2014-07-15 19:34 ` Mikulas Patocka
  2014-07-18 13:03   ` Tomas Henzl
  2014-08-04 14:09   ` Pavel Machek
  2014-07-15 19:35 ` [PATCH 2/15] block copy: use two bios Mikulas Patocka
                   ` (14 subsequent siblings)
  15 siblings, 2 replies; 24+ messages in thread
From: Mikulas Patocka @ 2014-07-15 19:34 UTC (permalink / raw)
  To: Alasdair G. Kergon, Mike Snitzer, Jonathan Brassow,
	Edward Thornber, Martin K. Petersen, Jens Axboe,
	Christoph Hellwig
  Cc: dm-devel, linux-kernel, linux-scsi

This is Martin Petersen's xcopy patch
(https://git.kernel.org/cgit/linux/kernel/git/mkp/linux.git/commit/?h=xcopy&id=0bdeed274e16b3038a851552188512071974eea8)
with some bug fixes, ported to the current kernel.

This patch makes it possible to use the SCSI XCOPY command.

We create a bio that has REQ_COPY flag in bi_rw and a bi_copy structure
that defines the source device. The target device is defined in the
bi_bdev and bi_iter.bi_sector.

There is a new BLKCOPY ioctl that makes it possible to use XCOPY from
userspace. The ioctl argument is a pointer to an array of four uint64_t
values.

The first value is a source byte offset, the second value is a destination
byte offset, the third value is byte length. The forth value is written by
the kernel and it represents the number of bytes that the kernel actually
copied.

Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>

---
 Documentation/ABI/testing/sysfs-block |    9 +
 block/bio.c                           |    2 
 block/blk-core.c                      |    5 
 block/blk-lib.c                       |   95 ++++++++++++
 block/blk-merge.c                     |    7 
 block/blk-settings.c                  |   13 +
 block/blk-sysfs.c                     |   10 +
 block/compat_ioctl.c                  |    1 
 block/ioctl.c                         |   49 ++++++
 drivers/scsi/scsi.c                   |   57 +++++++
 drivers/scsi/sd.c                     |  263 +++++++++++++++++++++++++++++++++-
 drivers/scsi/sd.h                     |    4 
 include/linux/bio.h                   |    9 -
 include/linux/blk_types.h             |   15 +
 include/linux/blkdev.h                |   15 +
 include/scsi/scsi_device.h            |    3 
 include/uapi/linux/fs.h               |    1 
 17 files changed, 545 insertions(+), 13 deletions(-)

Index: linux-3.16-rc5/Documentation/ABI/testing/sysfs-block
===================================================================
--- linux-3.16-rc5.orig/Documentation/ABI/testing/sysfs-block	2014-07-14 15:17:07.000000000 +0200
+++ linux-3.16-rc5/Documentation/ABI/testing/sysfs-block	2014-07-14 16:26:44.000000000 +0200
@@ -220,3 +220,12 @@ Description:
 		write_same_max_bytes is 0, write same is not supported
 		by the device.
 
+
+What:		/sys/block/<disk>/queue/copy_max_bytes
+Date:		January 2014
+Contact:	Martin K. Petersen <martin.petersen@oracle.com>
+Description:
+		Devices that support copy offloading will set this value
+		to indicate the maximum buffer size in bytes that can be
+		copied in one operation. If the copy_max_bytes is 0 the
+		device does not support copy offload.
Index: linux-3.16-rc5/block/blk-core.c
===================================================================
--- linux-3.16-rc5.orig/block/blk-core.c	2014-07-14 16:26:22.000000000 +0200
+++ linux-3.16-rc5/block/blk-core.c	2014-07-14 16:26:44.000000000 +0200
@@ -1831,6 +1831,11 @@ generic_make_request_checks(struct bio *
 		goto end_io;
 	}
 
+	if (bio->bi_rw & REQ_COPY && !bdev_copy_offload(bio->bi_bdev)) {
+		err = -EOPNOTSUPP;
+		goto end_io;
+	}
+
 	/*
 	 * Various block parts want %current->io_context and lazy ioc
 	 * allocation ends up trading a lot of pain for a small amount of
Index: linux-3.16-rc5/block/blk-lib.c
===================================================================
--- linux-3.16-rc5.orig/block/blk-lib.c	2014-07-14 16:26:40.000000000 +0200
+++ linux-3.16-rc5/block/blk-lib.c	2014-07-14 16:32:21.000000000 +0200
@@ -304,3 +304,98 @@ int blkdev_issue_zeroout(struct block_de
 	return __blkdev_issue_zeroout(bdev, sector, nr_sects, gfp_mask);
 }
 EXPORT_SYMBOL(blkdev_issue_zeroout);
+
+/**
+ * blkdev_issue_copy - queue a copy same operation
+ * @src_bdev:	source blockdev
+ * @src_sector:	source sector
+ * @dst_bdev:	destination blockdev
+ * @dst_sector: destination sector
+ * @nr_sects:	number of sectors to copy
+ * @gfp_mask:	memory allocation flags (for bio_alloc)
+ *
+ * Description:
+ *    Copy a block range from source device to target device.
+ */
+int blkdev_issue_copy(struct block_device *src_bdev, sector_t src_sector,
+		      struct block_device *dst_bdev, sector_t dst_sector,
+		      unsigned int nr_sects, gfp_t gfp_mask)
+{
+	DECLARE_COMPLETION_ONSTACK(wait);
+	struct request_queue *sq = bdev_get_queue(src_bdev);
+	struct request_queue *dq = bdev_get_queue(dst_bdev);
+	unsigned int max_copy_sectors;
+	struct bio_batch bb;
+	int ret = 0;
+
+	if (!sq || !dq)
+		return -ENXIO;
+
+	max_copy_sectors = min(sq->limits.max_copy_sectors,
+			       dq->limits.max_copy_sectors);
+
+	if (max_copy_sectors == 0)
+		return -EOPNOTSUPP;
+
+	if (src_sector + nr_sects < src_sector ||
+	    dst_sector + nr_sects < dst_sector)
+		return -EINVAL;
+
+	/* Do not support overlapping copies */
+	if (src_bdev == dst_bdev &&
+	    abs64((u64)dst_sector - (u64)src_sector) < nr_sects)
+		return -EOPNOTSUPP;
+
+	atomic_set(&bb.done, 1);
+	bb.error = 0;
+	bb.wait = &wait;
+
+	while (nr_sects) {
+		struct bio *bio;
+		struct bio_copy *bc;
+		unsigned int chunk;
+
+		bc = kmalloc(sizeof(struct bio_copy), gfp_mask);
+		if (!bc) {
+			ret = -ENOMEM;
+			break;
+		}
+
+		bio = bio_alloc(gfp_mask, 1);
+		if (!bio) {
+			kfree(bc);
+			ret = -ENOMEM;
+			break;
+		}
+
+		chunk = min(nr_sects, max_copy_sectors);
+
+		bio->bi_iter.bi_sector = dst_sector;
+		bio->bi_iter.bi_size = chunk << 9;
+		bio->bi_end_io = bio_batch_end_io;
+		bio->bi_bdev = dst_bdev;
+		bio->bi_private = &bb;
+		bio->bi_copy = bc;
+
+		bc->bic_bdev = src_bdev;
+		bc->bic_sector = src_sector;
+
+		atomic_inc(&bb.done);
+		submit_bio(REQ_WRITE | REQ_COPY, bio);
+
+		src_sector += chunk;
+		dst_sector += chunk;
+		nr_sects -= chunk;
+	}
+
+	/* Wait for bios in-flight */
+	if (!atomic_dec_and_test(&bb.done))
+		wait_for_completion_io(&wait);
+
+	if (likely(!ret))
+		ret = bb.error;
+
+	return ret;
+}
+EXPORT_SYMBOL(blkdev_issue_copy);
+
Index: linux-3.16-rc5/block/blk-merge.c
===================================================================
--- linux-3.16-rc5.orig/block/blk-merge.c	2014-07-14 15:17:07.000000000 +0200
+++ linux-3.16-rc5/block/blk-merge.c	2014-07-14 16:26:44.000000000 +0200
@@ -25,10 +25,7 @@ static unsigned int __blk_recalc_rq_segm
 	 * This should probably be returning 0, but blk_add_request_payload()
 	 * (Christoph!!!!)
 	 */
-	if (bio->bi_rw & REQ_DISCARD)
-		return 1;
-
-	if (bio->bi_rw & REQ_WRITE_SAME)
+	if (bio->bi_rw & (REQ_DISCARD | REQ_WRITE_SAME | REQ_COPY))
 		return 1;
 
 	fbio = bio;
@@ -196,7 +193,7 @@ static int __blk_bios_map_sg(struct requ
 	nsegs = 0;
 	cluster = blk_queue_cluster(q);
 
-	if (bio->bi_rw & REQ_DISCARD) {
+	if (bio->bi_rw & (REQ_DISCARD | REQ_COPY)) {
 		/*
 		 * This is a hack - drivers should be neither modifying the
 		 * biovec, nor relying on bi_vcnt - but because of
Index: linux-3.16-rc5/block/blk-settings.c
===================================================================
--- linux-3.16-rc5.orig/block/blk-settings.c	2014-07-14 15:17:08.000000000 +0200
+++ linux-3.16-rc5/block/blk-settings.c	2014-07-14 16:26:44.000000000 +0200
@@ -115,6 +115,7 @@ void blk_set_default_limits(struct queue
 	lim->max_sectors = lim->max_hw_sectors = BLK_SAFE_MAX_SECTORS;
 	lim->chunk_sectors = 0;
 	lim->max_write_same_sectors = 0;
+	lim->max_copy_sectors = 0;
 	lim->max_discard_sectors = 0;
 	lim->discard_granularity = 0;
 	lim->discard_alignment = 0;
@@ -322,6 +323,18 @@ void blk_queue_max_write_same_sectors(st
 EXPORT_SYMBOL(blk_queue_max_write_same_sectors);
 
 /**
+ * blk_queue_max_copy_sectors - set max sectors for a single copy operation
+ * @q:  the request queue for the device
+ * @max_copy_sectors: maximum number of sectors per copy operation
+ **/
+void blk_queue_max_copy_sectors(struct request_queue *q,
+				unsigned int max_copy_sectors)
+{
+	q->limits.max_copy_sectors = max_copy_sectors;
+}
+EXPORT_SYMBOL(blk_queue_max_copy_sectors);
+
+/**
  * blk_queue_max_segments - set max hw segments for a request for this queue
  * @q:  the request queue for the device
  * @max_segments:  max number of segments
Index: linux-3.16-rc5/block/blk-sysfs.c
===================================================================
--- linux-3.16-rc5.orig/block/blk-sysfs.c	2014-07-14 15:17:08.000000000 +0200
+++ linux-3.16-rc5/block/blk-sysfs.c	2014-07-14 16:26:44.000000000 +0200
@@ -161,6 +161,11 @@ static ssize_t queue_write_same_max_show
 		(unsigned long long)q->limits.max_write_same_sectors << 9);
 }
 
+static ssize_t queue_copy_max_show(struct request_queue *q, char *page)
+{
+	return sprintf(page, "%llu\n",
+		(unsigned long long)q->limits.max_copy_sectors << 9);
+}
 
 static ssize_t
 queue_max_sectors_store(struct request_queue *q, const char *page, size_t count)
@@ -374,6 +379,10 @@ static struct queue_sysfs_entry queue_wr
 	.show = queue_write_same_max_show,
 };
 
+static struct queue_sysfs_entry queue_copy_max_entry = {
+	.attr = {.name = "copy_max_bytes", .mode = S_IRUGO },
+	.show = queue_copy_max_show,
+};
 static struct queue_sysfs_entry queue_nonrot_entry = {
 	.attr = {.name = "rotational", .mode = S_IRUGO | S_IWUSR },
 	.show = queue_show_nonrot,
@@ -422,6 +431,7 @@ static struct attribute *default_attrs[]
 	&queue_discard_max_entry.attr,
 	&queue_discard_zeroes_data_entry.attr,
 	&queue_write_same_max_entry.attr,
+	&queue_copy_max_entry.attr,
 	&queue_nonrot_entry.attr,
 	&queue_nomerges_entry.attr,
 	&queue_rq_affinity_entry.attr,
Index: linux-3.16-rc5/block/ioctl.c
===================================================================
--- linux-3.16-rc5.orig/block/ioctl.c	2014-07-14 15:17:08.000000000 +0200
+++ linux-3.16-rc5/block/ioctl.c	2014-07-14 16:26:44.000000000 +0200
@@ -201,6 +201,31 @@ static int blk_ioctl_zeroout(struct bloc
 	return blkdev_issue_zeroout(bdev, start, len, GFP_KERNEL);
 }
 
+static int blk_ioctl_copy(struct block_device *bdev, uint64_t src_offset,
+			  uint64_t dst_offset, uint64_t len)
+{
+	if (src_offset & 511)
+		return -EINVAL;
+	if (dst_offset & 511)
+		return -EINVAL;
+	if (len & 511)
+		return -EINVAL;
+	src_offset >>= 9;
+	dst_offset >>= 9;
+	len >>= 9;
+
+	if (unlikely(src_offset + len < src_offset) ||
+	    unlikely(src_offset + len > (i_size_read(bdev->bd_inode) >> 9)))
+		return -EINVAL;
+
+	if (unlikely(dst_offset + len < dst_offset) ||
+	    unlikely(dst_offset + len > (i_size_read(bdev->bd_inode) >> 9)))
+		return -EINVAL;
+
+	return blkdev_issue_copy(bdev, src_offset, bdev, dst_offset, len,
+				 GFP_KERNEL);
+}
+
 static int put_ushort(unsigned long arg, unsigned short val)
 {
 	return put_user(val, (unsigned short __user *)arg);
@@ -328,6 +353,30 @@ int blkdev_ioctl(struct block_device *bd
 		return blk_ioctl_zeroout(bdev, range[0], range[1]);
 	}
 
+	case BLKCOPY: {
+		uint64_t range[4];
+
+		range[3] = 0;
+
+		if (copy_to_user((void __user *)(arg + 24), &range[3], 8))
+			return -EFAULT;
+
+		if (!(mode & FMODE_WRITE))
+			return -EBADF;
+
+		if (copy_from_user(range, (void __user *)arg, 24))
+			return -EFAULT;
+
+		ret = blk_ioctl_copy(bdev, range[0], range[1], range[2]);
+		if (!ret) {
+			range[3] = range[2];
+			if (copy_to_user((void __user *)(arg + 24), &range[3], 8))
+				return -EFAULT;
+		}
+
+		return ret;
+	}
+
 	case HDIO_GETGEO: {
 		struct hd_geometry geo;
 
Index: linux-3.16-rc5/drivers/scsi/scsi.c
===================================================================
--- linux-3.16-rc5.orig/drivers/scsi/scsi.c	2014-07-14 15:17:08.000000000 +0200
+++ linux-3.16-rc5/drivers/scsi/scsi.c	2014-07-14 16:26:44.000000000 +0200
@@ -1024,6 +1024,62 @@ int scsi_get_vpd_page(struct scsi_device
 EXPORT_SYMBOL_GPL(scsi_get_vpd_page);
 
 /**
+ * scsi_lookup_naa - Lookup NAA descriptor in VPD page 0x83
+ * @sdev: The device to ask
+ *
+ * Copy offloading requires us to know the NAA descriptor for both
+ * source and target device. This descriptor is mandatory in the Device
+ * Identification VPD page. Locate this descriptor in the returned VPD
+ * data so we don't have to do lookups for every copy command.
+ */
+static void scsi_lookup_naa(struct scsi_device *sdev)
+{
+	unsigned char *buf = sdev->vpd_pg83;
+	unsigned int len = sdev->vpd_pg83_len;
+
+	if (buf[1] != 0x83 || get_unaligned_be16(&buf[2]) == 0) {
+		sdev_printk(KERN_ERR, sdev,
+			    "%s: VPD page 0x83 contains no descriptors\n",
+			    __func__);
+		return;
+	}
+
+	buf += 4;
+	len -= 4;
+
+	do {
+		unsigned int desig_len = buf[3] + 4;
+
+		/* Binary code set */
+		if ((buf[0] & 0xf) != 1)
+			goto skip;
+
+		/* Target association */
+		if ((buf[1] >> 4) & 0x3)
+			goto skip;
+
+		/* NAA designator */
+		if ((buf[1] & 0xf) != 0x3)
+			goto skip;
+
+		sdev->naa = buf;
+		sdev->naa_len = desig_len;
+
+		return;
+
+	skip:
+		buf += desig_len;
+		len -= desig_len;
+
+	} while (len > 0);
+
+	sdev_printk(KERN_ERR, sdev,
+		    "%s: VPD page 0x83 NAA descriptor not found\n", __func__);
+
+	return;
+}
+
+/**
  * scsi_attach_vpd - Attach Vital Product Data to a SCSI device structure
  * @sdev: The device to ask
  *
@@ -1107,6 +1163,7 @@ retry_pg83:
 		}
 		sdev->vpd_pg83_len = result;
 		sdev->vpd_pg83 = vpd_buf;
+		scsi_lookup_naa(sdev);
 	}
 }
 
Index: linux-3.16-rc5/drivers/scsi/sd.c
===================================================================
--- linux-3.16-rc5.orig/drivers/scsi/sd.c	2014-07-14 16:26:22.000000000 +0200
+++ linux-3.16-rc5/drivers/scsi/sd.c	2014-07-14 16:26:44.000000000 +0200
@@ -100,6 +100,7 @@ MODULE_ALIAS_SCSI_DEVICE(TYPE_RBC);
 
 static void sd_config_discard(struct scsi_disk *, unsigned int);
 static void sd_config_write_same(struct scsi_disk *);
+static void sd_config_copy(struct scsi_disk *);
 static int  sd_revalidate_disk(struct gendisk *);
 static void sd_unlock_native_capacity(struct gendisk *disk);
 static int  sd_probe(struct device *);
@@ -463,6 +464,48 @@ max_write_same_blocks_store(struct devic
 }
 static DEVICE_ATTR_RW(max_write_same_blocks);
 
+static ssize_t
+max_copy_blocks_show(struct device *dev, struct device_attribute *attr,
+		     char *buf)
+{
+	struct scsi_disk *sdkp = to_scsi_disk(dev);
+
+	return snprintf(buf, 20, "%u\n", sdkp->max_copy_blocks);
+}
+
+static ssize_t
+max_copy_blocks_store(struct device *dev, struct device_attribute *attr,
+		      const char *buf, size_t count)
+{
+	struct scsi_disk *sdkp = to_scsi_disk(dev);
+	struct scsi_device *sdp = sdkp->device;
+	unsigned long max;
+	int err;
+
+	if (!capable(CAP_SYS_ADMIN))
+		return -EACCES;
+
+	if (sdp->type != TYPE_DISK)
+		return -EINVAL;
+
+	err = kstrtoul(buf, 10, &max);
+
+	if (err)
+		return err;
+
+	if (max == 0)
+		sdp->no_copy = 1;
+	else if (max <= SD_MAX_COPY_BLOCKS) {
+		sdp->no_copy = 0;
+		sdkp->max_copy_blocks = max;
+	}
+
+	sd_config_copy(sdkp);
+
+	return count;
+}
+static DEVICE_ATTR_RW(max_copy_blocks);
+
 static struct attribute *sd_disk_attrs[] = {
 	&dev_attr_cache_type.attr,
 	&dev_attr_FUA.attr,
@@ -474,6 +517,7 @@ static struct attribute *sd_disk_attrs[]
 	&dev_attr_thin_provisioning.attr,
 	&dev_attr_provisioning_mode.attr,
 	&dev_attr_max_write_same_blocks.attr,
+	&dev_attr_max_copy_blocks.attr,
 	&dev_attr_max_medium_access_timeouts.attr,
 	NULL,
 };
@@ -830,6 +874,109 @@ static int sd_setup_write_same_cmnd(stru
 	return ret;
 }
 
+static void sd_config_copy(struct scsi_disk *sdkp)
+{
+	struct request_queue *q = sdkp->disk->queue;
+	unsigned int logical_block_size = sdkp->device->sector_size;
+
+	if (sdkp->device->no_copy)
+		sdkp->max_copy_blocks = 0;
+
+	/* Segment descriptor 0x02 has a 64k block limit */
+	sdkp->max_copy_blocks = min(sdkp->max_copy_blocks,
+				    (u32)SD_MAX_CSD2_BLOCKS);
+
+	blk_queue_max_copy_sectors(q, sdkp->max_copy_blocks *
+				   (logical_block_size >> 9));
+}
+
+static int sd_setup_copy_cmnd(struct scsi_device *sdp, struct request *rq)
+{
+	struct scsi_device *src_sdp, *dst_sdp;
+	struct gendisk *src_disk;
+	struct request_queue *src_queue, *dst_queue;
+	sector_t src_lba, dst_lba;
+	unsigned int nr_blocks, buf_len, nr_bytes = blk_rq_bytes(rq);
+	int ret;
+	struct bio *bio = rq->bio;
+	struct page *page;
+	unsigned char *buf;
+
+	if (!bio->bi_copy)
+		return BLKPREP_KILL;
+
+	dst_sdp = scsi_disk(rq->rq_disk)->device;
+	dst_queue = rq->rq_disk->queue;
+	src_disk = bio->bi_copy->bic_bdev->bd_disk;
+	src_queue = src_disk->queue;
+	if (!src_queue ||
+	    src_queue->make_request_fn != blk_queue_bio ||
+	    src_queue->request_fn != dst_queue->request_fn ||
+	    *(struct scsi_driver **)rq->rq_disk->private_data !=
+	    *(struct scsi_driver **)src_disk->private_data)
+		return BLKPREP_KILL;
+	src_sdp = scsi_disk(src_disk)->device;
+
+	if (src_sdp->no_copy || dst_sdp->no_copy)
+		return BLKPREP_KILL;
+
+	if (src_sdp->sector_size != dst_sdp->sector_size)
+		return BLKPREP_KILL;
+
+	dst_lba = blk_rq_pos(rq) >> (ilog2(dst_sdp->sector_size) - 9);
+	src_lba = bio->bi_copy->bic_sector >> (ilog2(src_sdp->sector_size) - 9);
+	nr_blocks = blk_rq_sectors(rq) >> (ilog2(dst_sdp->sector_size) - 9);
+
+	page = alloc_page(GFP_ATOMIC | __GFP_ZERO);
+	if (!page)
+		return BLKPREP_DEFER;
+
+	buf = page_address(page);
+
+	/* Extended Copy (LID1) Parameter List (16 bytes) */
+	buf[0] = 0;				/* LID */
+	buf[1] = 3 << 3;			/* LID usage 11b */
+	put_unaligned_be16(32 + 32, &buf[2]);	/* 32 bytes per E4 desc. */
+	put_unaligned_be32(28, &buf[8]);	/* 28 bytes per B2B desc. */
+	buf += 16;
+
+	/* Source CSCD (32 bytes) */
+	buf[0] = 0xe4;				/* Identification desc. */
+	memcpy(&buf[4], src_sdp->naa, src_sdp->naa_len);
+	buf += 32;
+
+	/* Destination CSCD (32 bytes) */
+	buf[0] = 0xe4;				/* Identification desc. */
+	memcpy(&buf[4], dst_sdp->naa, dst_sdp->naa_len);
+	buf += 32;
+
+	/* Segment descriptor (28 bytes) */
+	buf[0] = 0x02;				/* Block to block desc. */
+	put_unaligned_be16(0x18, &buf[2]);	/* Descriptor length */
+	put_unaligned_be16(0, &buf[4]);		/* Source is desc. 0 */
+	put_unaligned_be16(1, &buf[6]);		/* Dest. is desc. 1 */
+	put_unaligned_be16(nr_blocks, &buf[10]);
+	put_unaligned_be64(src_lba, &buf[12]);
+	put_unaligned_be64(dst_lba, &buf[20]);
+
+	/* CDB */
+	memset(rq->cmd, 0, rq->cmd_len);
+	rq->cmd[0] = EXTENDED_COPY;
+	rq->cmd[1] = 0; /* LID1 */
+	buf_len = 16 + 32 + 32 + 28;
+	put_unaligned_be32(buf_len, &rq->cmd[10]);
+	rq->timeout = SD_COPY_TIMEOUT;
+
+	rq->completion_data = page;
+	blk_add_request_payload(rq, page, buf_len);
+	ret = scsi_setup_blk_pc_cmnd(sdp, rq);
+	rq->__data_len = nr_bytes;
+
+	if (ret != BLKPREP_OK)
+		__free_page(page);
+	return ret;
+}
+
 static int scsi_setup_flush_cmnd(struct scsi_device *sdp, struct request *rq)
 {
 	rq->timeout *= SD_FLUSH_TIMEOUT_MULTIPLIER;
@@ -844,7 +991,7 @@ static void sd_uninit_command(struct scs
 {
 	struct request *rq = SCpnt->request;
 
-	if (rq->cmd_flags & REQ_DISCARD)
+	if (rq->cmd_flags & (REQ_DISCARD | REQ_COPY))
 		__free_page(rq->completion_data);
 
 	if (SCpnt->cmnd != rq->cmd) {
@@ -876,6 +1023,9 @@ static int sd_init_command(struct scsi_c
 	} else if (rq->cmd_flags & REQ_WRITE_SAME) {
 		ret = sd_setup_write_same_cmnd(sdp, rq);
 		goto out;
+	} else if (rq->cmd_flags & REQ_COPY) {
+		ret = sd_setup_copy_cmnd(sdp, rq);
+		goto out;
 	} else if (rq->cmd_flags & REQ_FLUSH) {
 		ret = scsi_setup_flush_cmnd(sdp, rq);
 		goto out;
@@ -1649,7 +1799,8 @@ static int sd_done(struct scsi_cmnd *SCp
 	unsigned char op = SCpnt->cmnd[0];
 	unsigned char unmap = SCpnt->cmnd[1] & 8;
 
-	if (req->cmd_flags & REQ_DISCARD || req->cmd_flags & REQ_WRITE_SAME) {
+	if (req->cmd_flags & REQ_DISCARD || req->cmd_flags & REQ_WRITE_SAME ||
+	    req->cmd_flags & REQ_COPY) {
 		if (!result) {
 			good_bytes = blk_rq_bytes(req);
 			scsi_set_resid(SCpnt, 0);
@@ -1708,6 +1859,14 @@ static int sd_done(struct scsi_cmnd *SCp
 		/* INVALID COMMAND OPCODE or INVALID FIELD IN CDB */
 		if (sshdr.asc == 0x20 || sshdr.asc == 0x24) {
 			switch (op) {
+			case EXTENDED_COPY:
+				sdkp->device->no_copy = 1;
+				sd_config_copy(sdkp);
+
+				good_bytes = 0;
+				req->__data_len = blk_rq_bytes(req);
+				req->cmd_flags |= REQ_QUIET;
+				break;
 			case UNMAP:
 				sd_config_discard(sdkp, SD_LBP_DISABLE);
 				break;
@@ -2681,6 +2840,105 @@ static void sd_read_write_same(struct sc
 		sdkp->ws10 = 1;
 }
 
+static void sd_read_copy_operations(struct scsi_disk *sdkp,
+				    unsigned char *buffer)
+{
+	struct scsi_device *sdev = sdkp->device;
+	struct scsi_sense_hdr sshdr;
+	unsigned char cdb[16];
+	unsigned int result, len, i;
+	bool b2b_desc = false, id_desc = false;
+
+	if (sdev->naa_len == 0)
+		return;
+
+	/* Verify that the device has 3PC set in INQUIRY response */
+	if (sdev->inquiry_len < 6 || (sdev->inquiry[5] & (1 << 3)) == 0)
+		return;
+
+	/* Receive Copy Operation Parameters */
+	memset(cdb, 0, 16);
+	cdb[0] = RECEIVE_COPY_RESULTS;
+	cdb[1] = 0x3;
+	put_unaligned_be32(SD_BUF_SIZE, &cdb[10]);
+
+	memset(buffer, 0, SD_BUF_SIZE);
+	result = scsi_execute_req(sdev, cdb, DMA_FROM_DEVICE,
+				  buffer, SD_BUF_SIZE, &sshdr,
+				  SD_TIMEOUT, SD_MAX_RETRIES, NULL);
+
+	if (!scsi_status_is_good(result)) {
+		sd_printk(KERN_ERR, sdkp,
+			  "%s: Receive Copy Operating Parameters failed\n",
+			  __func__);
+		return;
+	}
+
+	/* The RCOP response is a minimum of 44 bytes long. First 4
+	 * bytes contain the length of the remaining buffer, i.e. 40+
+	 * bytes. Trailing the defined fields is a list of supported
+	 * descriptors. We need at least 2 descriptors to drive the
+	 * target, hence 42.
+	 */
+	len = get_unaligned_be32(&buffer[0]);
+	if (len < 42) {
+		sd_printk(KERN_ERR, sdkp, "%s: result too short (%u)\n",
+			  __func__, len);
+		return;
+	}
+
+	if ((buffer[4] & 1) == 0) {
+		sd_printk(KERN_ERR, sdkp, "%s: does not support SNLID\n",
+			  __func__);
+		return;
+	}
+
+	if (get_unaligned_be16(&buffer[8]) < 2) {
+		sd_printk(KERN_ERR, sdkp,
+			  "%s: Need 2 or more CSCD descriptors\n", __func__);
+		return;
+	}
+
+	if (get_unaligned_be16(&buffer[10]) < 1) {
+		sd_printk(KERN_ERR, sdkp,
+			  "%s: Need 1 or more segment descriptor\n", __func__);
+		return;
+	}
+
+	if (len - 40 != buffer[43]) {
+		sd_printk(KERN_ERR, sdkp,
+			  "%s: Buffer len and descriptor count mismatch " \
+			  "(%u vs. %u)\n", __func__, len - 40, buffer[43]);
+		return;
+	}
+
+	for (i = 44 ; i < len + 4 ; i++) {
+		if (buffer[i] == 0x02)
+			b2b_desc = true;
+
+		if (buffer[i] == 0xe4)
+			id_desc = true;
+	}
+
+	if (!b2b_desc) {
+		sd_printk(KERN_ERR, sdkp,
+			  "%s: No block 2 block descriptor (0x02)\n",
+			  __func__);
+		return;
+	}
+
+	if (!id_desc) {
+		sd_printk(KERN_ERR, sdkp,
+			  "%s: No identification descriptor (0xE4)\n",
+			  __func__);
+		return;
+	}
+
+	sdkp->max_copy_blocks = get_unaligned_be32(&buffer[16])
+		>> ilog2(sdev->sector_size);
+	sd_config_copy(sdkp);
+}
+
 static int sd_try_extended_inquiry(struct scsi_device *sdp)
 {
 	/*
@@ -2741,6 +2999,7 @@ static int sd_revalidate_disk(struct gen
 		sd_read_cache_type(sdkp, buffer);
 		sd_read_app_tag_own(sdkp, buffer);
 		sd_read_write_same(sdkp, buffer);
+		sd_read_copy_operations(sdkp, buffer);
 	}
 
 	sdkp->first_scan = 0;
Index: linux-3.16-rc5/drivers/scsi/sd.h
===================================================================
--- linux-3.16-rc5.orig/drivers/scsi/sd.h	2014-07-14 15:17:08.000000000 +0200
+++ linux-3.16-rc5/drivers/scsi/sd.h	2014-07-14 16:26:44.000000000 +0200
@@ -19,6 +19,7 @@
  */
 #define SD_FLUSH_TIMEOUT_MULTIPLIER	2
 #define SD_WRITE_SAME_TIMEOUT	(120 * HZ)
+#define SD_COPY_TIMEOUT		(120 * HZ)
 
 /*
  * Number of allowed retries
@@ -46,6 +47,8 @@ enum {
 enum {
 	SD_MAX_WS10_BLOCKS = 0xffff,
 	SD_MAX_WS16_BLOCKS = 0x7fffff,
+	SD_MAX_CSD2_BLOCKS = 0xffff,
+	SD_MAX_COPY_BLOCKS = 0xffffffff,
 };
 
 enum {
@@ -66,6 +69,7 @@ struct scsi_disk {
 	sector_t	capacity;	/* size in 512-byte sectors */
 	u32		max_ws_blocks;
 	u32		max_unmap_blocks;
+	u32		max_copy_blocks;
 	u32		unmap_granularity;
 	u32		unmap_alignment;
 	u32		index;
Index: linux-3.16-rc5/include/linux/bio.h
===================================================================
--- linux-3.16-rc5.orig/include/linux/bio.h	2014-07-14 15:17:09.000000000 +0200
+++ linux-3.16-rc5/include/linux/bio.h	2014-07-14 16:26:44.000000000 +0200
@@ -106,7 +106,7 @@ static inline bool bio_has_data(struct b
 {
 	if (bio &&
 	    bio->bi_iter.bi_size &&
-	    !(bio->bi_rw & REQ_DISCARD))
+	    !(bio->bi_rw & (REQ_DISCARD | REQ_COPY)))
 		return true;
 
 	return false;
@@ -260,8 +260,8 @@ static inline unsigned bio_segments(stru
 	struct bvec_iter iter;
 
 	/*
-	 * We special case discard/write same, because they interpret bi_size
-	 * differently:
+	 * We special case discard/write same/copy, because they
+	 * interpret bi_size differently:
 	 */
 
 	if (bio->bi_rw & REQ_DISCARD)
@@ -270,6 +270,9 @@ static inline unsigned bio_segments(stru
 	if (bio->bi_rw & REQ_WRITE_SAME)
 		return 1;
 
+	if (bio->bi_rw & REQ_COPY)
+		return 1;
+
 	bio_for_each_segment(bv, bio, iter)
 		segs++;
 
Index: linux-3.16-rc5/include/linux/blk_types.h
===================================================================
--- linux-3.16-rc5.orig/include/linux/blk_types.h	2014-07-14 15:17:09.000000000 +0200
+++ linux-3.16-rc5/include/linux/blk_types.h	2014-07-14 16:26:44.000000000 +0200
@@ -39,6 +39,11 @@ struct bvec_iter {
 						   current bvec */
 };
 
+struct bio_copy {
+	struct block_device	*bic_bdev;
+	sector_t		bic_sector;
+};
+
 /*
  * main unit of I/O for the block layer and lower layers (ie drivers and
  * stacking drivers)
@@ -81,6 +86,7 @@ struct bio {
 #if defined(CONFIG_BLK_DEV_INTEGRITY)
 	struct bio_integrity_payload *bi_integrity;  /* data integrity */
 #endif
+	struct bio_copy		*bi_copy; 	/* TODO, use bi_integrity */
 
 	unsigned short		bi_vcnt;	/* how many bio_vec's */
 
@@ -160,6 +166,7 @@ enum rq_flag_bits {
 	__REQ_DISCARD,		/* request to discard sectors */
 	__REQ_SECURE,		/* secure discard (used with __REQ_DISCARD) */
 	__REQ_WRITE_SAME,	/* write same block many times */
+	__REQ_COPY,		/* copy block range */
 
 	__REQ_NOIDLE,		/* don't anticipate more IO after this one */
 	__REQ_FUA,		/* forced unit access */
@@ -203,6 +210,7 @@ enum rq_flag_bits {
 #define REQ_PRIO		(1ULL << __REQ_PRIO)
 #define REQ_DISCARD		(1ULL << __REQ_DISCARD)
 #define REQ_WRITE_SAME		(1ULL << __REQ_WRITE_SAME)
+#define REQ_COPY		(1ULL << __REQ_COPY)
 #define REQ_NOIDLE		(1ULL << __REQ_NOIDLE)
 
 #define REQ_FAILFAST_MASK \
@@ -210,14 +218,15 @@ enum rq_flag_bits {
 #define REQ_COMMON_MASK \
 	(REQ_WRITE | REQ_FAILFAST_MASK | REQ_SYNC | REQ_META | REQ_PRIO | \
 	 REQ_DISCARD | REQ_WRITE_SAME | REQ_NOIDLE | REQ_FLUSH | REQ_FUA | \
-	 REQ_SECURE)
+	 REQ_SECURE | REQ_COPY)
 #define REQ_CLONE_MASK		REQ_COMMON_MASK
 
-#define BIO_NO_ADVANCE_ITER_MASK	(REQ_DISCARD|REQ_WRITE_SAME)
+#define BIO_NO_ADVANCE_ITER_MASK	(REQ_DISCARD|REQ_WRITE_SAME|REQ_COPY)
 
 /* This mask is used for both bio and request merge checking */
 #define REQ_NOMERGE_FLAGS \
-	(REQ_NOMERGE | REQ_STARTED | REQ_SOFTBARRIER | REQ_FLUSH | REQ_FUA)
+	(REQ_NOMERGE | REQ_STARTED | REQ_SOFTBARRIER | REQ_FLUSH | REQ_FUA | \
+	 REQ_COPY)
 
 #define REQ_RAHEAD		(1ULL << __REQ_RAHEAD)
 #define REQ_THROTTLED		(1ULL << __REQ_THROTTLED)
Index: linux-3.16-rc5/include/linux/blkdev.h
===================================================================
--- linux-3.16-rc5.orig/include/linux/blkdev.h	2014-07-14 16:26:22.000000000 +0200
+++ linux-3.16-rc5/include/linux/blkdev.h	2014-07-14 16:26:44.000000000 +0200
@@ -289,6 +289,7 @@ struct queue_limits {
 	unsigned int		io_opt;
 	unsigned int		max_discard_sectors;
 	unsigned int		max_write_same_sectors;
+	unsigned int		max_copy_sectors;
 	unsigned int		discard_granularity;
 	unsigned int		discard_alignment;
 
@@ -1012,6 +1013,8 @@ extern void blk_queue_max_discard_sector
 		unsigned int max_discard_sectors);
 extern void blk_queue_max_write_same_sectors(struct request_queue *q,
 		unsigned int max_write_same_sectors);
+extern void blk_queue_max_copy_sectors(struct request_queue *q,
+		unsigned int max_copy_sectors);
 extern void blk_queue_logical_block_size(struct request_queue *, unsigned short);
 extern void blk_queue_physical_block_size(struct request_queue *, unsigned int);
 extern void blk_queue_alignment_offset(struct request_queue *q,
@@ -1168,6 +1171,8 @@ extern int blkdev_issue_discard(struct b
 		sector_t nr_sects, gfp_t gfp_mask, unsigned long flags);
 extern int blkdev_issue_write_same(struct block_device *bdev, sector_t sector,
 		sector_t nr_sects, gfp_t gfp_mask, struct page *page);
+extern int blkdev_issue_copy(struct block_device *, sector_t,
+		struct block_device *, sector_t, unsigned int, gfp_t);
 extern int blkdev_issue_zeroout(struct block_device *bdev, sector_t sector,
 			sector_t nr_sects, gfp_t gfp_mask);
 static inline int sb_issue_discard(struct super_block *sb, sector_t block,
@@ -1367,6 +1372,16 @@ static inline unsigned int bdev_write_sa
 	return 0;
 }
 
+static inline unsigned int bdev_copy_offload(struct block_device *bdev)
+{
+	struct request_queue *q = bdev_get_queue(bdev);
+
+	if (q)
+		return q->limits.max_copy_sectors;
+
+	return 0;
+}
+
 static inline int queue_dma_alignment(struct request_queue *q)
 {
 	return q ? q->dma_alignment : 511;
Index: linux-3.16-rc5/include/scsi/scsi_device.h
===================================================================
--- linux-3.16-rc5.orig/include/scsi/scsi_device.h	2014-07-14 15:17:09.000000000 +0200
+++ linux-3.16-rc5/include/scsi/scsi_device.h	2014-07-14 16:26:44.000000000 +0200
@@ -119,6 +119,8 @@ struct scsi_device {
 	unsigned char *vpd_pg83;
 	int vpd_pg80_len;
 	unsigned char *vpd_pg80;
+	unsigned char naa_len;
+	unsigned char *naa;
 	unsigned char current_tag;	/* current tag */
 	struct scsi_target      *sdev_target;   /* used only for single_lun */
 
@@ -151,6 +153,7 @@ struct scsi_device {
 	unsigned use_10_for_ms:1; /* first try 10-byte mode sense/select */
 	unsigned no_report_opcodes:1;	/* no REPORT SUPPORTED OPERATION CODES */
 	unsigned no_write_same:1;	/* no WRITE SAME command */
+	unsigned no_copy:1;		/* no copy offload */
 	unsigned use_16_for_rw:1; /* Use read/write(16) over read/write(10) */
 	unsigned skip_ms_page_8:1;	/* do not use MODE SENSE page 0x08 */
 	unsigned skip_ms_page_3f:1;	/* do not use MODE SENSE page 0x3f */
Index: linux-3.16-rc5/include/uapi/linux/fs.h
===================================================================
--- linux-3.16-rc5.orig/include/uapi/linux/fs.h	2014-07-14 15:17:09.000000000 +0200
+++ linux-3.16-rc5/include/uapi/linux/fs.h	2014-07-14 16:26:44.000000000 +0200
@@ -149,6 +149,7 @@ struct inodes_stat_t {
 #define BLKSECDISCARD _IO(0x12,125)
 #define BLKROTATIONAL _IO(0x12,126)
 #define BLKZEROOUT _IO(0x12,127)
+#define BLKCOPY _IO(0x12,128)
 
 #define BMAP_IOCTL 1		/* obsolete - kept for compatibility */
 #define FIBMAP	   _IO(0x00,1)	/* bmap access */
Index: linux-3.16-rc5/block/compat_ioctl.c
===================================================================
--- linux-3.16-rc5.orig/block/compat_ioctl.c	2014-07-14 16:26:38.000000000 +0200
+++ linux-3.16-rc5/block/compat_ioctl.c	2014-07-14 16:26:44.000000000 +0200
@@ -696,6 +696,7 @@ long compat_blkdev_ioctl(struct file *fi
 	 * but we call blkdev_ioctl, which gets the lock for us
 	 */
 	case BLKRRPART:
+	case BLKCOPY:
 		return blkdev_ioctl(bdev, mode, cmd,
 				(unsigned long)compat_ptr(arg));
 	case BLKBSZSET_32:
Index: linux-3.16-rc5/block/bio.c
===================================================================
--- linux-3.16-rc5.orig/block/bio.c	2014-07-14 16:26:24.000000000 +0200
+++ linux-3.16-rc5/block/bio.c	2014-07-14 16:26:44.000000000 +0200
@@ -239,6 +239,8 @@ static void __bio_free(struct bio *bio)
 {
 	bio_disassociate_task(bio);
 
+	kfree(bio->bi_copy);
+
 	if (bio_integrity(bio))
 		bio_integrity_free(bio);
 }


^ permalink raw reply	[flat|nested] 24+ messages in thread

* [PATCH 2/15] block copy: use two bios
  2014-07-15 19:34 [PATCH 0/15] SCSI XCOPY support for the kernel and device mapper Mikulas Patocka
  2014-07-15 19:34 ` [PATCH 1/15] block copy: initial XCOPY offload support Mikulas Patocka
@ 2014-07-15 19:35 ` Mikulas Patocka
  2014-07-15 19:35 ` [PATCH 3/15] block copy: report the amount of copied data Mikulas Patocka
                   ` (13 subsequent siblings)
  15 siblings, 0 replies; 24+ messages in thread
From: Mikulas Patocka @ 2014-07-15 19:35 UTC (permalink / raw)
  To: Alasdair G. Kergon, Mike Snitzer, Jonathan Brassow,
	Edward Thornber, Martin K. Petersen, Jens Axboe,
	Christoph Hellwig
  Cc: dm-devel, linux-kernel, linux-scsi

This patch changes the architecture of xcopy so that two bios are used.

There used to be just one bio that held pointers to both source and
destination block device. However a bio with two block devices cannot
really be passed though block midlayer drivers (dm and md).

When we need to send the XCOPY command, we call the function
blkdev_issue_copy. This function creates two bios, the first with bi_rw
READ | REQ_COPY and the second WRITE | REQ_COPY. The bios have a pointer
to a common bi_copy structure.

These bios travel independently through the block device stack. When both
the bios reach the physical disk driver (the function blk_queue_bio), they
are paired, a request is made and the request is sent to the SCSI disk
driver.

It is possible that one of the bios reaches a device that doesn't support
XCOPY, in that case both bios are aborted with an error.

Note that there is no guarantee that the XCOPY command will succeed. If it
doesn't succeed, the caller is supposed to perform the copy manually.

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>

---
 block/bio.c               |   26 ++++++++++++++-
 block/blk-core.c          |   34 ++++++++++++++++++++
 block/blk-lib.c           |   76 ++++++++++++++++++++++++++++++++++++----------
 drivers/scsi/sd.c         |    7 +---
 include/linux/blk_types.h |   12 ++++++-
 5 files changed, 131 insertions(+), 24 deletions(-)

Index: linux-3.16-rc5/block/blk-lib.c
===================================================================
--- linux-3.16-rc5.orig/block/blk-lib.c	2014-07-14 16:32:21.000000000 +0200
+++ linux-3.16-rc5/block/blk-lib.c	2014-07-15 15:26:33.000000000 +0200
@@ -305,6 +305,36 @@ int blkdev_issue_zeroout(struct block_de
 }
 EXPORT_SYMBOL(blkdev_issue_zeroout);
 
+static void bio_copy_end_io(struct bio *bio, int error)
+{
+	struct bio_copy *bc = bio->bi_copy;
+	if (unlikely(error)) {
+		unsigned long flags;
+		int dir;
+		struct bio *other;
+
+		/* if the other bio is waiting for the pair, release it */
+		spin_lock_irqsave(&bc->spinlock, flags);
+		if (bc->error >= 0)
+			bc->error = error;
+		dir = bio_data_dir(bio);
+		other = bc->pair[dir ^ 1];
+		bc->pair[dir ^ 1] = NULL;
+		spin_unlock_irqrestore(&bc->spinlock, flags);
+		if (other)
+			bio_endio(other, error);
+	}
+	bio_put(bio);
+	if (atomic_dec_and_test(&bc->in_flight)) {
+		struct bio_batch *bb = bc->private;
+		if (unlikely(bc->error < 0) && !ACCESS_ONCE(bb->error))
+			ACCESS_ONCE(bb->error) = bc->error;
+		kfree(bc);
+		if (atomic_dec_and_test(&bb->done))
+			complete(bb->wait);
+	}
+}
+
 /**
  * blkdev_issue_copy - queue a copy same operation
  * @src_bdev:	source blockdev
@@ -351,9 +381,9 @@ int blkdev_issue_copy(struct block_devic
 	bb.wait = &wait;
 
 	while (nr_sects) {
-		struct bio *bio;
+		struct bio *read_bio, *write_bio;
 		struct bio_copy *bc;
-		unsigned int chunk;
+		unsigned int chunk = min(nr_sects, max_copy_sectors);
 
 		bc = kmalloc(sizeof(struct bio_copy), gfp_mask);
 		if (!bc) {
@@ -361,27 +391,43 @@ int blkdev_issue_copy(struct block_devic
 			break;
 		}
 
-		bio = bio_alloc(gfp_mask, 1);
-		if (!bio) {
+		read_bio = bio_alloc(gfp_mask, 1);
+		if (!read_bio) {
 			kfree(bc);
 			ret = -ENOMEM;
 			break;
 		}
 
-		chunk = min(nr_sects, max_copy_sectors);
-
-		bio->bi_iter.bi_sector = dst_sector;
-		bio->bi_iter.bi_size = chunk << 9;
-		bio->bi_end_io = bio_batch_end_io;
-		bio->bi_bdev = dst_bdev;
-		bio->bi_private = &bb;
-		bio->bi_copy = bc;
+		write_bio = bio_alloc(gfp_mask, 1);
+		if (!write_bio) {
+			bio_put(read_bio);
+			kfree(bc);
+			ret = -ENOMEM;
+			break;
+		}
 
-		bc->bic_bdev = src_bdev;
-		bc->bic_sector = src_sector;
+		atomic_set(&bc->in_flight, 2);
+		bc->error = 1;
+		bc->pair[0] = NULL;
+		bc->pair[1] = NULL;
+		bc->private = &bb;
+		spin_lock_init(&bc->spinlock);
+
+		read_bio->bi_iter.bi_sector = src_sector;
+		read_bio->bi_iter.bi_size = chunk << 9;
+		read_bio->bi_end_io = bio_copy_end_io;
+		read_bio->bi_bdev = src_bdev;
+		read_bio->bi_copy = bc;
+
+		write_bio->bi_iter.bi_sector = dst_sector;
+		write_bio->bi_iter.bi_size = chunk << 9;
+		write_bio->bi_end_io = bio_copy_end_io;
+		write_bio->bi_bdev = dst_bdev;
+		write_bio->bi_copy = bc;
 
 		atomic_inc(&bb.done);
-		submit_bio(REQ_WRITE | REQ_COPY, bio);
+		submit_bio(READ | REQ_COPY, read_bio);
+		submit_bio(WRITE | REQ_COPY, write_bio);
 
 		src_sector += chunk;
 		dst_sector += chunk;
Index: linux-3.16-rc5/include/linux/blk_types.h
===================================================================
--- linux-3.16-rc5.orig/include/linux/blk_types.h	2014-07-14 16:26:44.000000000 +0200
+++ linux-3.16-rc5/include/linux/blk_types.h	2014-07-15 15:26:05.000000000 +0200
@@ -40,8 +40,16 @@ struct bvec_iter {
 };
 
 struct bio_copy {
-	struct block_device	*bic_bdev;
-	sector_t		bic_sector;
+	/*
+	 * error == 1 - bios are waiting to be paired
+	 * error == 0 - pair was issued
+	 * error < 0  - error
+	 */
+	int error;
+	atomic_t in_flight;
+	struct bio *pair[2];
+	void *private;
+	spinlock_t spinlock;
 };
 
 /*
Index: linux-3.16-rc5/block/bio.c
===================================================================
--- linux-3.16-rc5.orig/block/bio.c	2014-07-14 16:26:44.000000000 +0200
+++ linux-3.16-rc5/block/bio.c	2014-07-14 16:39:04.000000000 +0200
@@ -239,8 +239,6 @@ static void __bio_free(struct bio *bio)
 {
 	bio_disassociate_task(bio);
 
-	kfree(bio->bi_copy);
-
 	if (bio_integrity(bio))
 		bio_integrity_free(bio);
 }
@@ -566,6 +564,7 @@ void __bio_clone_fast(struct bio *bio, s
 	bio->bi_flags |= 1 << BIO_CLONED;
 	bio->bi_rw = bio_src->bi_rw;
 	bio->bi_iter = bio_src->bi_iter;
+	bio->bi_copy = bio_src->bi_copy;
 	bio->bi_io_vec = bio_src->bi_io_vec;
 }
 EXPORT_SYMBOL(__bio_clone_fast);
@@ -1750,6 +1749,26 @@ void bio_flush_dcache_pages(struct bio *
 EXPORT_SYMBOL(bio_flush_dcache_pages);
 #endif
 
+static noinline_for_stack void bio_endio_copy(struct bio *bio, int error)
+{
+	struct bio_copy *bc = bio->bi_copy;
+	struct bio *other = NULL;
+	unsigned long flags;
+	int dir;
+
+	spin_lock_irqsave(&bc->spinlock, flags);
+	dir = bio_data_dir(bio);
+	if (bc->pair[dir]) {
+		BUG_ON(bc->pair[dir] != bio);
+		other = bc->pair[dir ^ 1];
+		bc->pair[0] = bc->pair[1] = NULL;
+	}
+	spin_unlock_irqrestore(&bc->spinlock, flags);
+
+	if (other)
+		bio_endio(other, error);
+}
+
 /**
  * bio_endio - end I/O on a bio
  * @bio:	bio
@@ -1777,6 +1796,9 @@ void bio_endio(struct bio *bio, int erro
 		if (!atomic_dec_and_test(&bio->bi_remaining))
 			return;
 
+		if (unlikely((bio->bi_rw & REQ_COPY) != 0))
+			bio_endio_copy(bio, error);
+
 		/*
 		 * Need to have a real endio function for chained bios,
 		 * otherwise various corner cases will break (like stacking
Index: linux-3.16-rc5/block/blk-core.c
===================================================================
--- linux-3.16-rc5.orig/block/blk-core.c	2014-07-14 16:26:44.000000000 +0200
+++ linux-3.16-rc5/block/blk-core.c	2014-07-14 16:39:04.000000000 +0200
@@ -1544,6 +1544,32 @@ void init_request_from_bio(struct reques
 	blk_rq_bio_prep(req->q, req, bio);
 }
 
+static noinline_for_stack struct bio *blk_queue_copy(struct bio *bio)
+{
+	struct bio_copy *bc = bio->bi_copy;
+	int dir, error;
+	struct bio *ret;
+
+	spin_lock_irq(&bc->spinlock);
+	error = bc->error;
+	if (unlikely(error < 0)) {
+		spin_unlock_irq(&bc->spinlock);
+		bio_endio(bio, error);
+		return NULL;
+	}
+	dir = bio_data_dir(bio);
+	bc->pair[dir] = bio;
+	if (bc->pair[dir ^ 1]) {
+		ret = bc->pair[1];
+		bc->error = 0;
+	} else {
+		ret = NULL;
+	}
+	spin_unlock_irq(&bc->spinlock);
+
+	return ret;
+}
+
 void blk_queue_bio(struct request_queue *q, struct bio *bio)
 {
 	const bool sync = !!(bio->bi_rw & REQ_SYNC);
@@ -1598,6 +1624,14 @@ void blk_queue_bio(struct request_queue 
 	}
 
 get_rq:
+	if (unlikely((bio->bi_rw & REQ_COPY) != 0)) {
+		spin_unlock_irq(q->queue_lock);
+		bio = blk_queue_copy(bio);
+		if (!bio)
+			return;
+		spin_lock_irq(q->queue_lock);
+	}
+
 	/*
 	 * This sync check and mask will be re-done in init_request_from_bio(),
 	 * but we need to set it earlier to expose the sync flag to the
Index: linux-3.16-rc5/drivers/scsi/sd.c
===================================================================
--- linux-3.16-rc5.orig/drivers/scsi/sd.c	2014-07-14 16:26:44.000000000 +0200
+++ linux-3.16-rc5/drivers/scsi/sd.c	2014-07-14 16:41:31.000000000 +0200
@@ -902,12 +902,9 @@ static int sd_setup_copy_cmnd(struct scs
 	struct page *page;
 	unsigned char *buf;
 
-	if (!bio->bi_copy)
-		return BLKPREP_KILL;
-
 	dst_sdp = scsi_disk(rq->rq_disk)->device;
 	dst_queue = rq->rq_disk->queue;
-	src_disk = bio->bi_copy->bic_bdev->bd_disk;
+	src_disk = bio->bi_copy->pair[0]->bi_bdev->bd_disk;
 	src_queue = src_disk->queue;
 	if (!src_queue ||
 	    src_queue->make_request_fn != blk_queue_bio ||
@@ -924,7 +921,7 @@ static int sd_setup_copy_cmnd(struct scs
 		return BLKPREP_KILL;
 
 	dst_lba = blk_rq_pos(rq) >> (ilog2(dst_sdp->sector_size) - 9);
-	src_lba = bio->bi_copy->bic_sector >> (ilog2(src_sdp->sector_size) - 9);
+	src_lba = bio->bi_copy->pair[0]->bi_iter.bi_sector >> (ilog2(src_sdp->sector_size) - 9);
 	nr_blocks = blk_rq_sectors(rq) >> (ilog2(dst_sdp->sector_size) - 9);
 
 	page = alloc_page(GFP_ATOMIC | __GFP_ZERO);


^ permalink raw reply	[flat|nested] 24+ messages in thread

* [PATCH 3/15] block copy: report the amount of copied data
  2014-07-15 19:34 [PATCH 0/15] SCSI XCOPY support for the kernel and device mapper Mikulas Patocka
  2014-07-15 19:34 ` [PATCH 1/15] block copy: initial XCOPY offload support Mikulas Patocka
  2014-07-15 19:35 ` [PATCH 2/15] block copy: use two bios Mikulas Patocka
@ 2014-07-15 19:35 ` Mikulas Patocka
  2014-07-15 19:36 ` [PATCH 4/15] block copy: use a timer to fix a theoretical deadlock Mikulas Patocka
                   ` (12 subsequent siblings)
  15 siblings, 0 replies; 24+ messages in thread
From: Mikulas Patocka @ 2014-07-15 19:35 UTC (permalink / raw)
  To: Alasdair G. Kergon, Mike Snitzer, Jonathan Brassow,
	Edward Thornber, Martin K. Petersen, Jens Axboe,
	Christoph Hellwig
  Cc: dm-devel, linux-kernel, linux-scsi

This patch changes blkdev_issue_copy so that it returns the number of
copied sectors in the variable "copied".

The kernel makes best effort to copy as much data as possible, but because
of device mapper mapping, it may be possible that copying fails at some
stage. If we just returned the error number, the caller wouldn't know if
all or part of the operation failed and the caller would be required to
redo the whole copy operation.

We return the number of copied sectors so that the caller can skip these
sectors when doing the copy manually. On success (zero return code), the
number of copied sectors is equal to the number of requested sectors. On
error (negative return code), the number of copied sectors is smaller than
the number of requested sectors.

The number of copied bytes is returned as a fourth uint64_t argument in
the BLKCOPY ioctl.

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>

---
 block/blk-lib.c           |   30 +++++++++++++++++++++++++-----
 block/ioctl.c             |   25 ++++++++++++++++---------
 include/linux/blk_types.h |    2 ++
 include/linux/blkdev.h    |    3 ++-
 4 files changed, 45 insertions(+), 15 deletions(-)

Index: linux-3.16-rc5/block/blk-lib.c
===================================================================
--- linux-3.16-rc5.orig/block/blk-lib.c	2014-07-15 15:26:33.000000000 +0200
+++ linux-3.16-rc5/block/blk-lib.c	2014-07-15 15:27:25.000000000 +0200
@@ -327,8 +327,17 @@ static void bio_copy_end_io(struct bio *
 	bio_put(bio);
 	if (atomic_dec_and_test(&bc->in_flight)) {
 		struct bio_batch *bb = bc->private;
-		if (unlikely(bc->error < 0) && !ACCESS_ONCE(bb->error))
-			ACCESS_ONCE(bb->error) = bc->error;
+		if (unlikely(bc->error < 0)) {
+			u64 first_error;
+			if (!ACCESS_ONCE(bb->error))
+				ACCESS_ONCE(bb->error) = bc->error;
+			do {
+				first_error = atomic64_read(bc->first_error);
+				if (bc->offset >= first_error)
+					break;
+			} while (unlikely(atomic64_cmpxchg(bc->first_error,
+				first_error, bc->offset) != first_error));
+		}
 		kfree(bc);
 		if (atomic_dec_and_test(&bb->done))
 			complete(bb->wait);
@@ -349,7 +358,7 @@ static void bio_copy_end_io(struct bio *
  */
 int blkdev_issue_copy(struct block_device *src_bdev, sector_t src_sector,
 		      struct block_device *dst_bdev, sector_t dst_sector,
-		      unsigned int nr_sects, gfp_t gfp_mask)
+		      sector_t nr_sects, gfp_t gfp_mask, sector_t *copied)
 {
 	DECLARE_COMPLETION_ONSTACK(wait);
 	struct request_queue *sq = bdev_get_queue(src_bdev);
@@ -357,6 +366,11 @@ int blkdev_issue_copy(struct block_devic
 	unsigned int max_copy_sectors;
 	struct bio_batch bb;
 	int ret = 0;
+	atomic64_t first_error = ATOMIC64_INIT(nr_sects);
+	sector_t offset = 0;
+
+	if (copied)
+		*copied = 0;
 
 	if (!sq || !dq)
 		return -ENXIO;
@@ -380,10 +394,10 @@ int blkdev_issue_copy(struct block_devic
 	bb.error = 0;
 	bb.wait = &wait;
 
-	while (nr_sects) {
+	while (nr_sects && !ACCESS_ONCE(bb.error)) {
 		struct bio *read_bio, *write_bio;
 		struct bio_copy *bc;
-		unsigned int chunk = min(nr_sects, max_copy_sectors);
+		unsigned chunk = (unsigned)min(nr_sects, (sector_t)max_copy_sectors);
 
 		bc = kmalloc(sizeof(struct bio_copy), gfp_mask);
 		if (!bc) {
@@ -411,6 +425,8 @@ int blkdev_issue_copy(struct block_devic
 		bc->pair[0] = NULL;
 		bc->pair[1] = NULL;
 		bc->private = &bb;
+		bc->first_error = &first_error;
+		bc->offset = offset;
 		spin_lock_init(&bc->spinlock);
 
 		read_bio->bi_iter.bi_sector = src_sector;
@@ -432,12 +448,16 @@ int blkdev_issue_copy(struct block_devic
 		src_sector += chunk;
 		dst_sector += chunk;
 		nr_sects -= chunk;
+		offset += chunk;
 	}
 
 	/* Wait for bios in-flight */
 	if (!atomic_dec_and_test(&bb.done))
 		wait_for_completion_io(&wait);
 
+	if (copied)
+		*copied = min((sector_t)atomic64_read(&first_error), offset);
+
 	if (likely(!ret))
 		ret = bb.error;
 
Index: linux-3.16-rc5/include/linux/blk_types.h
===================================================================
--- linux-3.16-rc5.orig/include/linux/blk_types.h	2014-07-15 15:26:05.000000000 +0200
+++ linux-3.16-rc5/include/linux/blk_types.h	2014-07-15 15:27:12.000000000 +0200
@@ -49,6 +49,8 @@ struct bio_copy {
 	atomic_t in_flight;
 	struct bio *pair[2];
 	void *private;
+	atomic64_t *first_error;
+	sector_t offset;
 	spinlock_t spinlock;
 };
 
Index: linux-3.16-rc5/include/linux/blkdev.h
===================================================================
--- linux-3.16-rc5.orig/include/linux/blkdev.h	2014-07-15 15:26:05.000000000 +0200
+++ linux-3.16-rc5/include/linux/blkdev.h	2014-07-15 15:27:12.000000000 +0200
@@ -1172,7 +1172,8 @@ extern int blkdev_issue_discard(struct b
 extern int blkdev_issue_write_same(struct block_device *bdev, sector_t sector,
 		sector_t nr_sects, gfp_t gfp_mask, struct page *page);
 extern int blkdev_issue_copy(struct block_device *, sector_t,
-		struct block_device *, sector_t, unsigned int, gfp_t);
+		struct block_device *, sector_t, sector_t, gfp_t,
+		sector_t *);
 extern int blkdev_issue_zeroout(struct block_device *bdev, sector_t sector,
 			sector_t nr_sects, gfp_t gfp_mask);
 static inline int sb_issue_discard(struct super_block *sb, sector_t block,
Index: linux-3.16-rc5/block/ioctl.c
===================================================================
--- linux-3.16-rc5.orig/block/ioctl.c	2014-07-15 15:26:05.000000000 +0200
+++ linux-3.16-rc5/block/ioctl.c	2014-07-15 15:27:12.000000000 +0200
@@ -202,8 +202,13 @@ static int blk_ioctl_zeroout(struct bloc
 }
 
 static int blk_ioctl_copy(struct block_device *bdev, uint64_t src_offset,
-			  uint64_t dst_offset, uint64_t len)
+			  uint64_t dst_offset, uint64_t len, uint64_t *copied)
 {
+	int ret;
+	sector_t copied_sec;
+
+	*copied = 0;
+
 	if (src_offset & 511)
 		return -EINVAL;
 	if (dst_offset & 511)
@@ -222,8 +227,12 @@ static int blk_ioctl_copy(struct block_d
 	    unlikely(dst_offset + len > (i_size_read(bdev->bd_inode) >> 9)))
 		return -EINVAL;
 
-	return blkdev_issue_copy(bdev, src_offset, bdev, dst_offset, len,
-				 GFP_KERNEL);
+	ret = blkdev_issue_copy(bdev, src_offset, bdev, dst_offset, len,
+				GFP_KERNEL, &copied_sec);
+
+	*copied = (uint64_t)copied_sec << 9;
+
+	return ret;
 }
 
 static int put_ushort(unsigned long arg, unsigned short val)
@@ -367,12 +376,10 @@ int blkdev_ioctl(struct block_device *bd
 		if (copy_from_user(range, (void __user *)arg, 24))
 			return -EFAULT;
 
-		ret = blk_ioctl_copy(bdev, range[0], range[1], range[2]);
-		if (!ret) {
-			range[3] = range[2];
-			if (copy_to_user((void __user *)(arg + 24), &range[3], 8))
-				return -EFAULT;
-		}
+		ret = blk_ioctl_copy(bdev, range[0], range[1], range[2], &range[3]);
+
+		if (copy_to_user((void __user *)(arg + 24), &range[3], 8))
+			return -EFAULT;
 
 		return ret;
 	}


^ permalink raw reply	[flat|nested] 24+ messages in thread

* [PATCH 4/15] block copy: use a timer to fix a theoretical deadlock
  2014-07-15 19:34 [PATCH 0/15] SCSI XCOPY support for the kernel and device mapper Mikulas Patocka
                   ` (2 preceding siblings ...)
  2014-07-15 19:35 ` [PATCH 3/15] block copy: report the amount of copied data Mikulas Patocka
@ 2014-07-15 19:36 ` Mikulas Patocka
  2014-07-15 19:37 ` [PATCH 5/15] block copy: use merge_bvec_fn for copies Mikulas Patocka
                   ` (11 subsequent siblings)
  15 siblings, 0 replies; 24+ messages in thread
From: Mikulas Patocka @ 2014-07-15 19:36 UTC (permalink / raw)
  To: Alasdair G. Kergon, Mike Snitzer, Jonathan Brassow,
	Edward Thornber, Martin K. Petersen, Jens Axboe,
	Christoph Hellwig
  Cc: dm-devel, linux-kernel, linux-scsi

The block layer creates two bios for each copy operation. The bios travel
independently through the storage stack and they are paired at the block
device.

There is a theoretical problem with this - the block device stack only
guarantees forward progress for a single bio. When two bios are sent, it
is possible (though very unlikely) that the first bio exhausts some
mempool and the second bio waits until there is free space in the mempool
(and thus it waits until the first bio finishes).

To avoid this deadlock, we introduce a timer. If the two bios are not
paired at the physical block device within 10 seconds, the copy operation
is aborted and the bio that waits to be paired is released with an error.

Note that there is no guarantee that any XCOPY operation succeed, so
aborting an operation with an error shouldn't cause any problems - the
caller is supposed to perform the copy manually if XCOPY fails.

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>

---
 block/blk-lib.c           |   27 +++++++++++++++++++++++++++
 include/linux/blk_types.h |    2 ++
 2 files changed, 29 insertions(+)

Index: linux-3.16-rc5/block/blk-lib.c
===================================================================
--- linux-3.16-rc5.orig/block/blk-lib.c	2014-07-15 15:27:49.000000000 +0200
+++ linux-3.16-rc5/block/blk-lib.c	2014-07-15 15:27:51.000000000 +0200
@@ -305,6 +305,30 @@ int blkdev_issue_zeroout(struct block_de
 }
 EXPORT_SYMBOL(blkdev_issue_zeroout);
 
+#define BLK_COPY_TIMEOUT	(10 * HZ)
+
+static void blk_copy_timeout(unsigned long bc_)
+{
+	struct bio_copy *bc = (struct bio_copy *)bc_;
+	struct bio *bio0 = NULL, *bio1 = NULL;
+
+	WARN_ON(!irqs_disabled());
+
+	spin_lock(&bc->spinlock);	/* the timer is IRQSAFE */
+	if (bc->error == 1) {
+		bc->error = -ETIMEDOUT;
+		bio0 = bc->pair[0];
+		bio1 = bc->pair[1];
+		bc->pair[0] = bc->pair[1] = NULL;
+	}
+	spin_unlock(&bc->spinlock);
+
+	if (bio0)
+		bio_endio(bio0, -ETIMEDOUT);
+	if (bio1)
+		bio_endio(bio1, -ETIMEDOUT);
+}
+
 static void bio_copy_end_io(struct bio *bio, int error)
 {
 	struct bio_copy *bc = bio->bi_copy;
@@ -338,6 +362,7 @@ static void bio_copy_end_io(struct bio *
 			} while (unlikely(atomic64_cmpxchg(bc->first_error,
 				first_error, bc->offset) != first_error));
 		}
+		del_timer_sync(&bc->timer);
 		kfree(bc);
 		if (atomic_dec_and_test(&bb->done))
 			complete(bb->wait);
@@ -428,6 +453,8 @@ int blkdev_issue_copy(struct block_devic
 		bc->first_error = &first_error;
 		bc->offset = offset;
 		spin_lock_init(&bc->spinlock);
+		__setup_timer(&bc->timer, blk_copy_timeout, (unsigned long)bc, TIMER_IRQSAFE);
+		mod_timer(&bc->timer, jiffies + BLK_COPY_TIMEOUT);
 
 		read_bio->bi_iter.bi_sector = src_sector;
 		read_bio->bi_iter.bi_size = chunk << 9;
Index: linux-3.16-rc5/include/linux/blk_types.h
===================================================================
--- linux-3.16-rc5.orig/include/linux/blk_types.h	2014-07-15 15:27:49.000000000 +0200
+++ linux-3.16-rc5/include/linux/blk_types.h	2014-07-15 15:27:51.000000000 +0200
@@ -6,6 +6,7 @@
 #define __LINUX_BLK_TYPES_H
 
 #include <linux/types.h>
+#include <linux/timer.h>
 
 struct bio_set;
 struct bio;
@@ -52,6 +53,7 @@ struct bio_copy {
 	atomic64_t *first_error;
 	sector_t offset;
 	spinlock_t spinlock;
+	struct timer_list timer;
 };
 
 /*


^ permalink raw reply	[flat|nested] 24+ messages in thread

* [PATCH 5/15] block copy: use merge_bvec_fn for copies
  2014-07-15 19:34 [PATCH 0/15] SCSI XCOPY support for the kernel and device mapper Mikulas Patocka
                   ` (3 preceding siblings ...)
  2014-07-15 19:36 ` [PATCH 4/15] block copy: use a timer to fix a theoretical deadlock Mikulas Patocka
@ 2014-07-15 19:37 ` Mikulas Patocka
  2014-07-15 19:37 ` [PATCH 6/15] block copy: use asynchronous notification Mikulas Patocka
                   ` (10 subsequent siblings)
  15 siblings, 0 replies; 24+ messages in thread
From: Mikulas Patocka @ 2014-07-15 19:37 UTC (permalink / raw)
  To: Alasdair G. Kergon, Mike Snitzer, Jonathan Brassow,
	Edward Thornber, Martin K. Petersen, Jens Axboe,
	Christoph Hellwig
  Cc: dm-devel, linux-kernel, linux-scsi

We use merge_bvec_fn to make sure that copies do not split internal
boundaries of device mapper devices.

There is no possibility to split a copy bio (splitting would complicate
the design significantly), so we must use merge_bvec_fn to make sure that
the bios have appropriate size for the device mapper stack.

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>

---
 block/blk-lib.c |   37 +++++++++++++++++++++++++++++++++++++
 1 file changed, 37 insertions(+)

Index: linux-3.16-rc5/block/blk-lib.c
===================================================================
--- linux-3.16-rc5.orig/block/blk-lib.c	2014-07-15 15:27:51.000000000 +0200
+++ linux-3.16-rc5/block/blk-lib.c	2014-07-15 15:27:59.000000000 +0200
@@ -369,6 +369,31 @@ static void bio_copy_end_io(struct bio *
 	}
 }
 
+static unsigned blkdev_copy_merge(struct block_device *bdev,
+				  struct request_queue *q, unsigned long bi_rw,
+				  sector_t sector, unsigned n)
+{
+	if (!q->merge_bvec_fn) {
+		return n;
+	} else {
+		unsigned m;
+		struct bvec_merge_data bvm = {
+			.bi_bdev = bdev,
+			.bi_sector = sector,
+			.bi_size = 0,
+			.bi_rw = bi_rw,
+		};
+		struct bio_vec vec = {
+			.bv_page = NULL,
+			.bv_len = likely(n <= UINT_MAX >> 9) ? n << 9 : UINT_MAX & ~511U,
+			.bv_offset = 0,
+		};
+		m = q->merge_bvec_fn(q, &bvm, &vec);
+		m >>= 9;
+		return min(m, n);
+	}
+}
+
 /**
  * blkdev_issue_copy - queue a copy same operation
  * @src_bdev:	source blockdev
@@ -424,6 +449,18 @@ int blkdev_issue_copy(struct block_devic
 		struct bio_copy *bc;
 		unsigned chunk = (unsigned)min(nr_sects, (sector_t)max_copy_sectors);
 
+		chunk = blkdev_copy_merge(src_bdev, sq, READ | REQ_COPY, src_sector, chunk);
+		if (!chunk) {
+			ret = -EOPNOTSUPP;
+			break;
+		}
+
+		chunk = blkdev_copy_merge(dst_bdev, dq, WRITE | REQ_COPY, dst_sector, chunk);
+		if (!chunk) {
+			ret = -EOPNOTSUPP;
+			break;
+		}
+
 		bc = kmalloc(sizeof(struct bio_copy), gfp_mask);
 		if (!bc) {
 			ret = -ENOMEM;


^ permalink raw reply	[flat|nested] 24+ messages in thread

* [PATCH 6/15] block copy: use asynchronous notification
  2014-07-15 19:34 [PATCH 0/15] SCSI XCOPY support for the kernel and device mapper Mikulas Patocka
                   ` (4 preceding siblings ...)
  2014-07-15 19:37 ` [PATCH 5/15] block copy: use merge_bvec_fn for copies Mikulas Patocka
@ 2014-07-15 19:37 ` Mikulas Patocka
  2014-07-15 19:39 ` [PATCH 7/15] dm: remove num_write_bios Mikulas Patocka
                   ` (9 subsequent siblings)
  15 siblings, 0 replies; 24+ messages in thread
From: Mikulas Patocka @ 2014-07-15 19:37 UTC (permalink / raw)
  To: Alasdair G. Kergon, Mike Snitzer, Jonathan Brassow,
	Edward Thornber, Martin K. Petersen, Jens Axboe,
	Christoph Hellwig
  Cc: dm-devel, linux-kernel, linux-scsi

block copy: use asynchronous notification

In dm-snapshot target there may be large number of copy requests in
progress. If every pending copy request consumed a process context, it
would put too much load on the system.

To avoid this load, we need asynchronous notification when copy finishes -
we can pass a callback to the function blkdev_issue_copy, if the callback
is non-NULL, blkdev_issue_copy exits when it submits all the copy bios and
the callback is called when the copy operation finishes.

With the callback mechanism, there can be large number of in-progress copy
requests and we do not need process context for each of them.

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>

---
 block/blk-lib.c           |  152 ++++++++++++++++++++++++++++++++--------------
 block/ioctl.c             |    2 
 include/linux/blk_types.h |    5 -
 include/linux/blkdev.h    |    2 
 4 files changed, 114 insertions(+), 47 deletions(-)

Index: linux-3.16-rc5/block/blk-lib.c
===================================================================
--- linux-3.16-rc5.orig/block/blk-lib.c	2014-07-15 15:27:59.000000000 +0200
+++ linux-3.16-rc5/block/blk-lib.c	2014-07-15 16:16:53.000000000 +0200
@@ -305,6 +305,17 @@ int blkdev_issue_zeroout(struct block_de
 }
 EXPORT_SYMBOL(blkdev_issue_zeroout);
 
+struct bio_copy_batch {
+	atomic_long_t done;
+	int async_error;
+	int sync_error;
+	sector_t sync_copied;
+	atomic64_t first_error;
+	void (*callback)(void *data, int error);
+	void *data;
+	sector_t *copied;
+};
+
 #define BLK_COPY_TIMEOUT	(10 * HZ)
 
 static void blk_copy_timeout(unsigned long bc_)
@@ -329,6 +340,18 @@ static void blk_copy_timeout(unsigned lo
 		bio_endio(bio1, -ETIMEDOUT);
 }
 
+static void blk_copy_batch_finish(struct bio_copy_batch *batch)
+{
+	void (*fn)(void *, int) = batch->callback;
+	void *data = batch->data;
+	int error = unlikely(batch->sync_error) ? batch->sync_error : batch->async_error;
+	if (batch->copied)
+		*batch->copied = min(batch->sync_copied, (sector_t)atomic64_read(&batch->first_error));
+	kfree(batch);
+	if (fn)
+		fn(data, error);
+}
+
 static void bio_copy_end_io(struct bio *bio, int error)
 {
 	struct bio_copy *bc = bio->bi_copy;
@@ -350,22 +373,22 @@ static void bio_copy_end_io(struct bio *
 	}
 	bio_put(bio);
 	if (atomic_dec_and_test(&bc->in_flight)) {
-		struct bio_batch *bb = bc->private;
+		struct bio_copy_batch *batch = bc->batch;
 		if (unlikely(bc->error < 0)) {
 			u64 first_error;
-			if (!ACCESS_ONCE(bb->error))
-				ACCESS_ONCE(bb->error) = bc->error;
+			if (!ACCESS_ONCE(batch->async_error))
+				ACCESS_ONCE(batch->async_error) = bc->error;
 			do {
-				first_error = atomic64_read(bc->first_error);
+				first_error = atomic64_read(&batch->first_error);
 				if (bc->offset >= first_error)
 					break;
-			} while (unlikely(atomic64_cmpxchg(bc->first_error,
+			} while (unlikely(atomic64_cmpxchg(&batch->first_error,
 				first_error, bc->offset) != first_error));
 		}
 		del_timer_sync(&bc->timer);
 		kfree(bc);
-		if (atomic_dec_and_test(&bb->done))
-			complete(bb->wait);
+		if (atomic_long_dec_and_test(&batch->done))
+			blk_copy_batch_finish(batch);
 	}
 }
 
@@ -394,6 +417,18 @@ static unsigned blkdev_copy_merge(struct
 	}
 }
 
+struct bio_copy_completion {
+	struct completion wait;
+	int error;
+};
+
+static void bio_copy_sync_callback(void *ptr, int error)
+{
+	struct bio_copy_completion *comp = ptr;
+	comp->error = error;
+	complete(&comp->wait);
+}
+
 /**
  * blkdev_issue_copy - queue a copy same operation
  * @src_bdev:	source blockdev
@@ -408,69 +443,95 @@ static unsigned blkdev_copy_merge(struct
  */
 int blkdev_issue_copy(struct block_device *src_bdev, sector_t src_sector,
 		      struct block_device *dst_bdev, sector_t dst_sector,
-		      sector_t nr_sects, gfp_t gfp_mask, sector_t *copied)
+		      sector_t nr_sects, gfp_t gfp_mask,
+		      void (*callback)(void *, int), void *data,
+		      sector_t *copied)
 {
 	DECLARE_COMPLETION_ONSTACK(wait);
 	struct request_queue *sq = bdev_get_queue(src_bdev);
 	struct request_queue *dq = bdev_get_queue(dst_bdev);
 	unsigned int max_copy_sectors;
-	struct bio_batch bb;
-	int ret = 0;
-	atomic64_t first_error = ATOMIC64_INIT(nr_sects);
-	sector_t offset = 0;
+	int ret;
+	struct bio_copy_batch *batch;
+	struct bio_copy_completion comp;
 
 	if (copied)
 		*copied = 0;
 
-	if (!sq || !dq)
-		return -ENXIO;
+	if (!sq || !dq) {
+		ret = -ENXIO;
+		goto end_callback;
+	}
 
 	max_copy_sectors = min(sq->limits.max_copy_sectors,
 			       dq->limits.max_copy_sectors);
 
-	if (max_copy_sectors == 0)
-		return -EOPNOTSUPP;
+	if (max_copy_sectors == 0) {
+		ret = -EOPNOTSUPP;
+		goto end_callback;
+	}
 
 	if (src_sector + nr_sects < src_sector ||
-	    dst_sector + nr_sects < dst_sector)
-		return -EINVAL;
+	    dst_sector + nr_sects < dst_sector) {
+		ret = -EINVAL;
+		goto end_callback;
+	}
 
 	/* Do not support overlapping copies */
 	if (src_bdev == dst_bdev &&
-	    abs64((u64)dst_sector - (u64)src_sector) < nr_sects)
-		return -EOPNOTSUPP;
+	    abs64((u64)dst_sector - (u64)src_sector) < nr_sects) {
+		ret = -EOPNOTSUPP;
+		goto end_callback;
+	}
+
+	batch = kmalloc(sizeof(struct bio_copy_batch), gfp_mask);
+	if (!batch) {
+		ret = -ENOMEM;
+		goto end_callback;
+	}
 
-	atomic_set(&bb.done, 1);
-	bb.error = 0;
-	bb.wait = &wait;
+	batch->done = (atomic_long_t)ATOMIC_LONG_INIT(1);
+	batch->async_error = 0;
+	batch->sync_error = 0;
+	batch->sync_copied = 0;
+	batch->first_error = (atomic64_t)ATOMIC64_INIT(nr_sects);
+	batch->copied = copied;
+	if (callback) {
+		batch->callback = callback;
+		batch->data = data;
+	} else {
+		comp.wait = COMPLETION_INITIALIZER_ONSTACK(comp.wait);
+		batch->callback = bio_copy_sync_callback;
+		batch->data = &comp;
+	}
 
-	while (nr_sects && !ACCESS_ONCE(bb.error)) {
+	while (nr_sects && !ACCESS_ONCE(batch->async_error)) {
 		struct bio *read_bio, *write_bio;
 		struct bio_copy *bc;
 		unsigned chunk = (unsigned)min(nr_sects, (sector_t)max_copy_sectors);
 
 		chunk = blkdev_copy_merge(src_bdev, sq, READ | REQ_COPY, src_sector, chunk);
 		if (!chunk) {
-			ret = -EOPNOTSUPP;
+			batch->sync_error = -EOPNOTSUPP;
 			break;
 		}
 
 		chunk = blkdev_copy_merge(dst_bdev, dq, WRITE | REQ_COPY, dst_sector, chunk);
 		if (!chunk) {
-			ret = -EOPNOTSUPP;
+			batch->sync_error = -EOPNOTSUPP;
 			break;
 		}
 
 		bc = kmalloc(sizeof(struct bio_copy), gfp_mask);
 		if (!bc) {
-			ret = -ENOMEM;
+			batch->sync_error = -ENOMEM;
 			break;
 		}
 
 		read_bio = bio_alloc(gfp_mask, 1);
 		if (!read_bio) {
 			kfree(bc);
-			ret = -ENOMEM;
+			batch->sync_error = -ENOMEM;
 			break;
 		}
 
@@ -478,7 +539,7 @@ int blkdev_issue_copy(struct block_devic
 		if (!write_bio) {
 			bio_put(read_bio);
 			kfree(bc);
-			ret = -ENOMEM;
+			batch->sync_error = -ENOMEM;
 			break;
 		}
 
@@ -486,9 +547,8 @@ int blkdev_issue_copy(struct block_devic
 		bc->error = 1;
 		bc->pair[0] = NULL;
 		bc->pair[1] = NULL;
-		bc->private = &bb;
-		bc->first_error = &first_error;
-		bc->offset = offset;
+		bc->batch = batch;
+		bc->offset = batch->sync_copied;
 		spin_lock_init(&bc->spinlock);
 		__setup_timer(&bc->timer, blk_copy_timeout, (unsigned long)bc, TIMER_IRQSAFE);
 		mod_timer(&bc->timer, jiffies + BLK_COPY_TIMEOUT);
@@ -505,27 +565,33 @@ int blkdev_issue_copy(struct block_devic
 		write_bio->bi_bdev = dst_bdev;
 		write_bio->bi_copy = bc;
 
-		atomic_inc(&bb.done);
+		atomic_long_inc(&batch->done);
 		submit_bio(READ | REQ_COPY, read_bio);
 		submit_bio(WRITE | REQ_COPY, write_bio);
 
 		src_sector += chunk;
 		dst_sector += chunk;
 		nr_sects -= chunk;
-		offset += chunk;
+		batch->sync_copied += chunk;
 	}
 
-	/* Wait for bios in-flight */
-	if (!atomic_dec_and_test(&bb.done))
-		wait_for_completion_io(&wait);
+	if (atomic_long_dec_and_test(&batch->done))
+		blk_copy_batch_finish(batch);
 
-	if (copied)
-		*copied = min((sector_t)atomic64_read(&first_error), offset);
-
-	if (likely(!ret))
-		ret = bb.error;
+	if (callback) {
+		return 0;
+	} else {
+		wait_for_completion_io(&comp.wait);
+		return comp.error;
+	}
 
-	return ret;
+end_callback:
+	if (callback) {
+		callback(data, ret);
+		return 0;
+	} else {
+		return ret;
+	}
 }
 EXPORT_SYMBOL(blkdev_issue_copy);
 
Index: linux-3.16-rc5/include/linux/blk_types.h
===================================================================
--- linux-3.16-rc5.orig/include/linux/blk_types.h	2014-07-15 15:27:51.000000000 +0200
+++ linux-3.16-rc5/include/linux/blk_types.h	2014-07-15 15:28:46.000000000 +0200
@@ -40,6 +40,8 @@ struct bvec_iter {
 						   current bvec */
 };
 
+struct bio_copy_batch;
+
 struct bio_copy {
 	/*
 	 * error == 1 - bios are waiting to be paired
@@ -49,8 +51,7 @@ struct bio_copy {
 	int error;
 	atomic_t in_flight;
 	struct bio *pair[2];
-	void *private;
-	atomic64_t *first_error;
+	struct bio_copy_batch *batch;
 	sector_t offset;
 	spinlock_t spinlock;
 	struct timer_list timer;
Index: linux-3.16-rc5/include/linux/blkdev.h
===================================================================
--- linux-3.16-rc5.orig/include/linux/blkdev.h	2014-07-15 15:27:49.000000000 +0200
+++ linux-3.16-rc5/include/linux/blkdev.h	2014-07-15 15:28:46.000000000 +0200
@@ -1173,7 +1173,7 @@ extern int blkdev_issue_write_same(struc
 		sector_t nr_sects, gfp_t gfp_mask, struct page *page);
 extern int blkdev_issue_copy(struct block_device *, sector_t,
 		struct block_device *, sector_t, sector_t, gfp_t,
-		sector_t *);
+		void (*)(void *, int), void *, sector_t *);
 extern int blkdev_issue_zeroout(struct block_device *bdev, sector_t sector,
 			sector_t nr_sects, gfp_t gfp_mask);
 static inline int sb_issue_discard(struct super_block *sb, sector_t block,
Index: linux-3.16-rc5/block/ioctl.c
===================================================================
--- linux-3.16-rc5.orig/block/ioctl.c	2014-07-15 15:27:49.000000000 +0200
+++ linux-3.16-rc5/block/ioctl.c	2014-07-15 15:28:46.000000000 +0200
@@ -228,7 +228,7 @@ static int blk_ioctl_copy(struct block_d
 		return -EINVAL;
 
 	ret = blkdev_issue_copy(bdev, src_offset, bdev, dst_offset, len,
-				GFP_KERNEL, &copied_sec);
+				GFP_KERNEL, NULL, NULL, &copied_sec);
 
 	*copied = (uint64_t)copied_sec << 9;
 


^ permalink raw reply	[flat|nested] 24+ messages in thread

* [PATCH 7/15] dm: remove num_write_bios
  2014-07-15 19:34 [PATCH 0/15] SCSI XCOPY support for the kernel and device mapper Mikulas Patocka
                   ` (5 preceding siblings ...)
  2014-07-15 19:37 ` [PATCH 6/15] block copy: use asynchronous notification Mikulas Patocka
@ 2014-07-15 19:39 ` Mikulas Patocka
  2014-07-15 19:39 ` [PATCH 8/15] dm: introduce dm_ask_for_duplicate_bios Mikulas Patocka
                   ` (8 subsequent siblings)
  15 siblings, 0 replies; 24+ messages in thread
From: Mikulas Patocka @ 2014-07-15 19:39 UTC (permalink / raw)
  To: Alasdair G. Kergon, Mike Snitzer, Jonathan Brassow,
	Edward Thornber, Martin K. Petersen, Jens Axboe,
	Christoph Hellwig
  Cc: dm-devel, linux-kernel, linux-scsi

[ this isn't connected to XCOPY, but it is requires for the following 
device mapper patches to apply cleanly ]

The target can set the function num_write_bios - dm will issue this
callback to ask the target how many bios does it want to receive.

This was intended for the dm-cache target, but it is not useable due to a
race condition (see the description of
e2e74d617eadc15f601983270c4f4a6935c5a943). num_write_bios is unused, so we
remove it.

Note that we deliberately leave the for loop in __clone_and_map_data_bio -
it will be used in the next patch.

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>

---
 drivers/md/dm.c               |    6 ------
 include/linux/device-mapper.h |   15 ---------------
 2 files changed, 21 deletions(-)

Index: linux-3.16-rc5/drivers/md/dm.c
===================================================================
--- linux-3.16-rc5.orig/drivers/md/dm.c	2014-07-14 16:25:01.000000000 +0200
+++ linux-3.16-rc5/drivers/md/dm.c	2014-07-14 16:25:29.000000000 +0200
@@ -1315,12 +1315,6 @@ static void __clone_and_map_data_bio(str
 	unsigned target_bio_nr;
 	unsigned num_target_bios = 1;
 
-	/*
-	 * Does the target want to receive duplicate copies of the bio?
-	 */
-	if (bio_data_dir(bio) == WRITE && ti->num_write_bios)
-		num_target_bios = ti->num_write_bios(ti, bio);
-
 	for (target_bio_nr = 0; target_bio_nr < num_target_bios; target_bio_nr++) {
 		tio = alloc_tio(ci, ti, 0, target_bio_nr);
 		tio->len_ptr = len;
Index: linux-3.16-rc5/include/linux/device-mapper.h
===================================================================
--- linux-3.16-rc5.orig/include/linux/device-mapper.h	2014-07-14 16:25:07.000000000 +0200
+++ linux-3.16-rc5/include/linux/device-mapper.h	2014-07-14 16:25:29.000000000 +0200
@@ -184,14 +184,6 @@ struct target_type {
 #define DM_TARGET_IMMUTABLE		0x00000004
 #define dm_target_is_immutable(type)	((type)->features & DM_TARGET_IMMUTABLE)
 
-/*
- * Some targets need to be sent the same WRITE bio severals times so
- * that they can send copies of it to different devices.  This function
- * examines any supplied bio and returns the number of copies of it the
- * target requires.
- */
-typedef unsigned (*dm_num_write_bios_fn) (struct dm_target *ti, struct bio *bio);
-
 struct dm_target {
 	struct dm_table *table;
 	struct target_type *type;
@@ -231,13 +223,6 @@ struct dm_target {
 	 */
 	unsigned per_bio_data_size;
 
-	/*
-	 * If defined, this function is called to find out how many
-	 * duplicate bios should be sent to the target when writing
-	 * data.
-	 */
-	dm_num_write_bios_fn num_write_bios;
-
 	/* target specific data */
 	void *private;
 

^ permalink raw reply	[flat|nested] 24+ messages in thread

* [PATCH 8/15] dm: introduce dm_ask_for_duplicate_bios
  2014-07-15 19:34 [PATCH 0/15] SCSI XCOPY support for the kernel and device mapper Mikulas Patocka
                   ` (6 preceding siblings ...)
  2014-07-15 19:39 ` [PATCH 7/15] dm: remove num_write_bios Mikulas Patocka
@ 2014-07-15 19:39 ` Mikulas Patocka
  2014-07-15 19:40 ` [PATCH 9/15] dm: implement copy Mikulas Patocka
                   ` (7 subsequent siblings)
  15 siblings, 0 replies; 24+ messages in thread
From: Mikulas Patocka @ 2014-07-15 19:39 UTC (permalink / raw)
  To: Alasdair G. Kergon, Mike Snitzer, Jonathan Brassow,
	Edward Thornber, Martin K. Petersen, Jens Axboe,
	Christoph Hellwig
  Cc: dm-devel, linux-kernel, linux-scsi

[ this isn't connected to XCOPY, but it is requires for the following
device mapper patches to apply cleanly ]

This function can be used if the target needs to receive another duplicate
of the current bio.

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>

---
 drivers/md/dm.c               |   24 +++++++++++++++++++-----
 include/linux/device-mapper.h |    2 ++
 2 files changed, 21 insertions(+), 5 deletions(-)

Index: linux-3.16-rc5/drivers/md/dm.c
===================================================================
--- linux-3.16-rc5.orig/drivers/md/dm.c	2014-07-14 16:25:29.000000000 +0200
+++ linux-3.16-rc5/drivers/md/dm.c	2014-07-14 16:25:31.000000000 +0200
@@ -1161,9 +1161,9 @@ EXPORT_SYMBOL_GPL(dm_set_target_max_io_l
  *	 to make it empty)
  * The target requires that region 3 is to be sent in the next bio.
  *
- * If the target wants to receive multiple copies of the bio (via num_*bios, etc),
- * the partially processed part (the sum of regions 1+2) must be the same for all
- * copies of the bio.
+ * If the target wants to receive multiple copies of the bio with num_*_bios or
+ * dm_ask_for_duplicate_bio, the partially processed part (the sum of regions
+ * 1+2) must be the same for all copies of the bio.
  */
 void dm_accept_partial_bio(struct bio *bio, unsigned n_sectors)
 {
@@ -1177,6 +1177,17 @@ void dm_accept_partial_bio(struct bio *b
 }
 EXPORT_SYMBOL_GPL(dm_accept_partial_bio);
 
+/*
+ * The target driver can call this function only from the map routine. The
+ * target driver requests that the dm sends more duplicates of the current bio.
+ */
+void dm_ask_for_duplicate_bios(struct bio *bio, unsigned n_duplicates)
+{
+	struct dm_target_io *tio = container_of(bio, struct dm_target_io, clone);
+	(*tio->num_bios) += n_duplicates;
+}
+EXPORT_SYMBOL_GPL(dm_ask_for_duplicate_bios);
+
 static void __map_bio(struct dm_target_io *tio)
 {
 	int r;
@@ -1267,12 +1278,14 @@ static struct dm_target_io *alloc_tio(st
 
 static void __clone_and_map_simple_bio(struct clone_info *ci,
 				       struct dm_target *ti,
-				       unsigned target_bio_nr, unsigned *len)
+				       unsigned target_bio_nr, unsigned *len,
+				       unsigned *num_bios)
 {
 	struct dm_target_io *tio = alloc_tio(ci, ti, ci->bio->bi_max_vecs, target_bio_nr);
 	struct bio *clone = &tio->clone;
 
 	tio->len_ptr = len;
+	tio->num_bios = num_bios;
 
 	/*
 	 * Discard requests require the bio's inline iovecs be initialized.
@@ -1292,7 +1305,7 @@ static void __send_duplicate_bios(struct
 	unsigned target_bio_nr;
 
 	for (target_bio_nr = 0; target_bio_nr < num_bios; target_bio_nr++)
-		__clone_and_map_simple_bio(ci, ti, target_bio_nr, len);
+		__clone_and_map_simple_bio(ci, ti, target_bio_nr, len, &num_bios);
 }
 
 static int __send_empty_flush(struct clone_info *ci)
@@ -1318,6 +1331,7 @@ static void __clone_and_map_data_bio(str
 	for (target_bio_nr = 0; target_bio_nr < num_target_bios; target_bio_nr++) {
 		tio = alloc_tio(ci, ti, 0, target_bio_nr);
 		tio->len_ptr = len;
+		tio->num_bios = &num_target_bios;
 		clone_bio(tio, bio, sector, *len);
 		__map_bio(tio);
 	}
Index: linux-3.16-rc5/include/linux/device-mapper.h
===================================================================
--- linux-3.16-rc5.orig/include/linux/device-mapper.h	2014-07-14 16:25:29.000000000 +0200
+++ linux-3.16-rc5/include/linux/device-mapper.h	2014-07-14 16:25:31.000000000 +0200
@@ -271,6 +271,7 @@ struct dm_target_io {
 	struct dm_target *ti;
 	unsigned target_bio_nr;
 	unsigned *len_ptr;
+	unsigned *num_bios;
 	struct bio clone;
 };
 
@@ -382,6 +383,7 @@ struct gendisk *dm_disk(struct mapped_de
 int dm_suspended(struct dm_target *ti);
 int dm_noflush_suspending(struct dm_target *ti);
 void dm_accept_partial_bio(struct bio *bio, unsigned n_sectors);
+void dm_ask_for_duplicate_bios(struct bio *bio, unsigned n_duplicates);
 union map_info *dm_get_rq_mapinfo(struct request *rq);
 
 struct queue_limits *dm_get_queue_limits(struct mapped_device *md);


^ permalink raw reply	[flat|nested] 24+ messages in thread

* [PATCH 9/15] dm: implement copy
  2014-07-15 19:34 [PATCH 0/15] SCSI XCOPY support for the kernel and device mapper Mikulas Patocka
                   ` (7 preceding siblings ...)
  2014-07-15 19:39 ` [PATCH 8/15] dm: introduce dm_ask_for_duplicate_bios Mikulas Patocka
@ 2014-07-15 19:40 ` Mikulas Patocka
  2014-07-15 19:40 ` [PATCH 10/15] dm linear: support copy Mikulas Patocka
                   ` (6 subsequent siblings)
  15 siblings, 0 replies; 24+ messages in thread
From: Mikulas Patocka @ 2014-07-15 19:40 UTC (permalink / raw)
  To: Alasdair G. Kergon, Mike Snitzer, Jonathan Brassow,
	Edward Thornber, Martin K. Petersen, Jens Axboe,
	Christoph Hellwig
  Cc: dm-devel, linux-kernel, linux-scsi

This patch implements basic copy support for device mapper core.
Individual targets can enable copy support by setting ti->copy_supported.

Device mapper device advertises copy support if at least one target
supports copy and for this target, at least one underlying device supports
copy.

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>

---
 drivers/md/dm-table.c         |    9 +++++++++
 drivers/md/dm.c               |   42 +++++++++++++++++++++++++++++++++++++++---
 include/linux/device-mapper.h |    5 +++++
 3 files changed, 53 insertions(+), 3 deletions(-)

Index: linux-3.16-rc5/drivers/md/dm.c
===================================================================
--- linux-3.16-rc5.orig/drivers/md/dm.c	2014-07-14 16:26:24.000000000 +0200
+++ linux-3.16-rc5/drivers/md/dm.c	2014-07-14 16:41:15.000000000 +0200
@@ -1403,6 +1403,31 @@ static int __send_write_same(struct clon
 	return __send_changing_extent_only(ci, get_num_write_same_bios, NULL);
 }
 
+static int __send_copy(struct clone_info *ci)
+{
+	struct dm_target *ti;
+	sector_t bound;
+
+	ti = dm_table_find_target(ci->map, ci->sector);
+	if (!dm_target_is_valid(ti))
+		return -EIO;
+
+	if (!ti->copy_supported)
+		return -EOPNOTSUPP;
+
+	bound = max_io_len(ci->sector, ti);
+
+	if (unlikely(ci->sector_count > bound))
+		return -EOPNOTSUPP;
+
+	__clone_and_map_simple_bio(ci, ti, 0, NULL, NULL);
+
+	ci->sector += ci->sector_count;
+	ci->sector_count = 0;
+
+	return 0;
+}
+
 /*
  * Select the correct strategy for processing a non-flush bio.
  */
@@ -1416,6 +1441,8 @@ static int __split_and_process_non_flush
 		return __send_discard(ci);
 	else if (unlikely(bio->bi_rw & REQ_WRITE_SAME))
 		return __send_write_same(ci);
+	else if (unlikely(bio->bi_rw & REQ_COPY))
+		return __send_copy(ci);
 
 	ti = dm_table_find_target(ci->map, ci->sector);
 	if (!dm_target_is_valid(ti))
@@ -1500,6 +1527,11 @@ static int dm_merge_bvec(struct request_
 	if (!dm_target_is_valid(ti))
 		goto out;
 
+	if (unlikely((bvm->bi_rw & REQ_COPY) != 0)) {
+		if (!ti->copy_supported)
+			goto out_ret_max_size;
+	}
+
 	/*
 	 * Find maximum amount of I/O that won't need splitting
 	 */
@@ -1523,17 +1555,21 @@ static int dm_merge_bvec(struct request_
 	 * entries.  So always set max_size to 0, and the code below allows
 	 * just one page.
 	 */
-	else if (queue_max_hw_sectors(q) <= PAGE_SIZE >> 9)
+	else if (likely(!(bvm->bi_rw & REQ_COPY)) &&
+		 queue_max_hw_sectors(q) <= PAGE_SIZE >> 9)
 		max_size = 0;
 
 out:
-	dm_put_live_table_fast(md);
 	/*
 	 * Always allow an entire first page
 	 */
-	if (max_size <= biovec->bv_len && !(bvm->bi_size >> SECTOR_SHIFT))
+	if (likely(!(bvm->bi_rw & REQ_COPY)) &&
+	    max_size <= biovec->bv_len && !(bvm->bi_size >> SECTOR_SHIFT))
 		max_size = biovec->bv_len;
 
+out_ret_max_size:
+	dm_put_live_table_fast(md);
+
 	return max_size;
 }
 
Index: linux-3.16-rc5/include/linux/device-mapper.h
===================================================================
--- linux-3.16-rc5.orig/include/linux/device-mapper.h	2014-07-14 16:26:24.000000000 +0200
+++ linux-3.16-rc5/include/linux/device-mapper.h	2014-07-14 16:41:15.000000000 +0200
@@ -251,6 +251,11 @@ struct dm_target {
 	 * Set if this target does not return zeroes on discarded blocks.
 	 */
 	bool discard_zeroes_data_unsupported:1;
+
+	/*
+	 * Set if the target supports XCOPY.
+	 */
+	bool copy_supported:1;
 };
 
 /* Each target can link one of these into the table */
Index: linux-3.16-rc5/drivers/md/dm-table.c
===================================================================
--- linux-3.16-rc5.orig/drivers/md/dm-table.c	2014-07-14 16:26:25.000000000 +0200
+++ linux-3.16-rc5/drivers/md/dm-table.c	2014-07-14 16:41:15.000000000 +0200
@@ -489,6 +489,11 @@ static int dm_set_device_limits(struct d
 		       q->limits.alignment_offset,
 		       (unsigned long long) start << SECTOR_SHIFT);
 
+	if (ti->copy_supported)
+		limits->max_copy_sectors =
+			min_not_zero(limits->max_copy_sectors,
+				bdev_get_queue(bdev)->limits.max_copy_sectors);
+
 	/*
 	 * Check if merge fn is supported.
 	 * If not we'll force DM to use PAGE_SIZE or
@@ -1298,6 +1303,10 @@ combine_limits:
 			       dm_device_name(table->md),
 			       (unsigned long long) ti->begin,
 			       (unsigned long long) ti->len);
+
+		limits->max_copy_sectors =
+					min_not_zero(limits->max_copy_sectors,
+					ti_limits.max_copy_sectors);
 	}
 
 	return validate_hardware_logical_block_alignment(table, limits);


^ permalink raw reply	[flat|nested] 24+ messages in thread

* [PATCH 10/15] dm linear: support copy
  2014-07-15 19:34 [PATCH 0/15] SCSI XCOPY support for the kernel and device mapper Mikulas Patocka
                   ` (8 preceding siblings ...)
  2014-07-15 19:40 ` [PATCH 9/15] dm: implement copy Mikulas Patocka
@ 2014-07-15 19:40 ` Mikulas Patocka
  2014-07-15 19:41 ` [PATCH 11/15] dm stripe: " Mikulas Patocka
                   ` (5 subsequent siblings)
  15 siblings, 0 replies; 24+ messages in thread
From: Mikulas Patocka @ 2014-07-15 19:40 UTC (permalink / raw)
  To: Alasdair G. Kergon, Mike Snitzer, Jonathan Brassow,
	Edward Thornber, Martin K. Petersen, Jens Axboe,
	Christoph Hellwig
  Cc: dm-devel, linux-kernel, linux-scsi

Support copy operation in the linear target.

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>

---
 drivers/md/dm-linear.c |    1 +
 1 file changed, 1 insertion(+)

Index: linux-3.16-rc4/drivers/md/dm-linear.c
===================================================================
--- linux-3.16-rc4.orig/drivers/md/dm-linear.c	2014-07-11 22:20:27.000000000 +0200
+++ linux-3.16-rc4/drivers/md/dm-linear.c	2014-07-11 22:22:20.000000000 +0200
@@ -56,6 +56,7 @@ static int linear_ctr(struct dm_target *
 	ti->num_flush_bios = 1;
 	ti->num_discard_bios = 1;
 	ti->num_write_same_bios = 1;
+	ti->copy_supported = 1;
 	ti->private = lc;
 	return 0;
 


^ permalink raw reply	[flat|nested] 24+ messages in thread

* [PATCH 11/15] dm stripe: support copy
  2014-07-15 19:34 [PATCH 0/15] SCSI XCOPY support for the kernel and device mapper Mikulas Patocka
                   ` (9 preceding siblings ...)
  2014-07-15 19:40 ` [PATCH 10/15] dm linear: support copy Mikulas Patocka
@ 2014-07-15 19:41 ` Mikulas Patocka
  2014-07-15 19:42 ` [PATCH 12/15] dm kcopyd: introduce the function submit_job Mikulas Patocka
                   ` (4 subsequent siblings)
  15 siblings, 0 replies; 24+ messages in thread
From: Mikulas Patocka @ 2014-07-15 19:41 UTC (permalink / raw)
  To: Alasdair G. Kergon, Mike Snitzer, Jonathan Brassow,
	Edward Thornber, Martin K. Petersen, Jens Axboe,
	Christoph Hellwig
  Cc: dm-devel, linux-kernel, linux-scsi

Support the copy operation for the stripe target.

In stripe_merge, we verify that the underlying device supports copy. If it
doesn't, we can fail fast without any bio being contructed.

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>

---
 drivers/md/dm-stripe.c |   11 ++++++++++-
 1 file changed, 10 insertions(+), 1 deletion(-)

Index: linux-3.16-rc4/drivers/md/dm-stripe.c
===================================================================
--- linux-3.16-rc4.orig/drivers/md/dm-stripe.c	2014-07-11 22:20:25.000000000 +0200
+++ linux-3.16-rc4/drivers/md/dm-stripe.c	2014-07-11 22:23:54.000000000 +0200
@@ -165,6 +165,7 @@ static int stripe_ctr(struct dm_target *
 	ti->num_flush_bios = stripes;
 	ti->num_discard_bios = stripes;
 	ti->num_write_same_bios = stripes;
+	ti->copy_supported = 1;
 
 	sc->chunk_size = chunk_size;
 	if (chunk_size & (chunk_size - 1))
@@ -416,11 +417,19 @@ static int stripe_merge(struct dm_target
 	struct stripe_c *sc = ti->private;
 	sector_t bvm_sector = bvm->bi_sector;
 	uint32_t stripe;
+	struct block_device *bdev;
 	struct request_queue *q;
 
 	stripe_map_sector(sc, bvm_sector, &stripe, &bvm_sector);
 
-	q = bdev_get_queue(sc->stripe[stripe].dev->bdev);
+	bdev = sc->stripe[stripe].dev->bdev;
+
+	if (unlikely((bvm->bi_rw & REQ_COPY) != 0)) {
+		if (!bdev_copy_offload(bdev))
+			return 0;
+	}
+
+	q = bdev_get_queue(bdev);
 	if (!q->merge_bvec_fn)
 		return max_size;
 


^ permalink raw reply	[flat|nested] 24+ messages in thread

* [PATCH 12/15] dm kcopyd: introduce the function submit_job
  2014-07-15 19:34 [PATCH 0/15] SCSI XCOPY support for the kernel and device mapper Mikulas Patocka
                   ` (10 preceding siblings ...)
  2014-07-15 19:41 ` [PATCH 11/15] dm stripe: " Mikulas Patocka
@ 2014-07-15 19:42 ` Mikulas Patocka
  2014-07-15 19:43 ` [PATCH 13/15] dm kcopyd: support copy offload Mikulas Patocka
                   ` (3 subsequent siblings)
  15 siblings, 0 replies; 24+ messages in thread
From: Mikulas Patocka @ 2014-07-15 19:42 UTC (permalink / raw)
  To: Alasdair G. Kergon, Mike Snitzer, Jonathan Brassow,
	Edward Thornber, Martin K. Petersen, Jens Axboe,
	Christoph Hellwig
  Cc: dm-devel, linux-kernel, linux-scsi

We move some code to a function submit_job. It is needed for the next
patch that calls submit_job from another place.

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>

---
 drivers/md/dm-kcopyd.c |   19 ++++++++++++-------
 1 file changed, 12 insertions(+), 7 deletions(-)

Index: linux-3.16-rc5/drivers/md/dm-kcopyd.c
===================================================================
--- linux-3.16-rc5.orig/drivers/md/dm-kcopyd.c	2014-07-14 16:45:23.000000000 +0200
+++ linux-3.16-rc5/drivers/md/dm-kcopyd.c	2014-07-14 17:28:36.000000000 +0200
@@ -698,6 +698,17 @@ static void split_job(struct kcopyd_job 
 	}
 }
 
+static void submit_job(struct kcopyd_job *job)
+{
+	if (job->source.count <= SUB_JOB_SIZE)
+		dispatch_job(job);
+	else {
+		mutex_init(&job->lock);
+		job->progress = 0;
+		split_job(job);
+	}
+}
+
 int dm_kcopyd_copy(struct dm_kcopyd_client *kc, struct dm_io_region *from,
 		   unsigned int num_dests, struct dm_io_region *dests,
 		   unsigned int flags, dm_kcopyd_notify_fn fn, void *context)
@@ -746,13 +757,7 @@ int dm_kcopyd_copy(struct dm_kcopyd_clie
 	job->context = context;
 	job->master_job = job;
 
-	if (job->source.count <= SUB_JOB_SIZE)
-		dispatch_job(job);
-	else {
-		mutex_init(&job->lock);
-		job->progress = 0;
-		split_job(job);
-	}
+	submit_job(job);
 
 	return 0;
 }


^ permalink raw reply	[flat|nested] 24+ messages in thread

* [PATCH 13/15] dm kcopyd: support copy offload
  2014-07-15 19:34 [PATCH 0/15] SCSI XCOPY support for the kernel and device mapper Mikulas Patocka
                   ` (11 preceding siblings ...)
  2014-07-15 19:42 ` [PATCH 12/15] dm kcopyd: introduce the function submit_job Mikulas Patocka
@ 2014-07-15 19:43 ` Mikulas Patocka
  2014-07-15 19:43 ` [PATCH 14/15] dm kcopyd: change mutex to spinlock Mikulas Patocka
                   ` (2 subsequent siblings)
  15 siblings, 0 replies; 24+ messages in thread
From: Mikulas Patocka @ 2014-07-15 19:43 UTC (permalink / raw)
  To: Alasdair G. Kergon, Mike Snitzer, Jonathan Brassow,
	Edward Thornber, Martin K. Petersen, Jens Axboe,
	Christoph Hellwig
  Cc: dm-devel, linux-kernel, linux-scsi

This patch adds copy offload support to dm-kcopyd. If copy offload fails,
copying is performed using dm-io, just like before.

There is a module parameter "copy_offload" that can be set to enable or
disable this feature. It can be used to test performance of copy offload.

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>

---
 drivers/md/dm-kcopyd.c |   38 +++++++++++++++++++++++++++++++++++++-
 1 file changed, 37 insertions(+), 1 deletion(-)

Index: linux-3.16-rc5/drivers/md/dm-kcopyd.c
===================================================================
--- linux-3.16-rc5.orig/drivers/md/dm-kcopyd.c	2014-07-15 16:18:15.000000000 +0200
+++ linux-3.16-rc5/drivers/md/dm-kcopyd.c	2014-07-15 19:20:34.000000000 +0200
@@ -96,6 +96,9 @@ static DEFINE_SPINLOCK(throttle_spinlock
  */
 #define MAX_SLEEPS			10
 
+static bool copy_offload = true;
+module_param(copy_offload, bool, S_IRUGO | S_IWUSR);
+
 static void io_job_start(struct dm_kcopyd_throttle *t)
 {
 	unsigned throttle, now, difference;
@@ -358,6 +361,8 @@ struct kcopyd_job {
 	sector_t progress;
 
 	struct kcopyd_job *master_job;
+
+	struct work_struct copy_work;
 };
 
 static struct kmem_cache *_job_cache;
@@ -709,6 +714,31 @@ static void submit_job(struct kcopyd_job
 	}
 }
 
+static void copy_offload_work(struct work_struct *work)
+{
+	struct kcopyd_job *job = container_of(work, struct kcopyd_job, copy_work);
+	sector_t copied;
+
+	blkdev_issue_copy(job->source.bdev, job->source.sector,
+			  job->dests[0].bdev, job->dests[0].sector,
+			  job->source.count,
+			  GFP_NOIO | __GFP_NORETRY | __GFP_NOMEMALLOC | __GFP_NOWARN,
+			  NULL, NULL, &copied);
+
+	job->source.sector += copied;
+	job->source.count -= copied;
+	job->dests[0].sector += copied;
+	job->dests[0].count -= copied;
+
+	submit_job(job);
+}
+
+static void try_copy_offload(struct kcopyd_job *job)
+{
+	INIT_WORK(&job->copy_work, copy_offload_work);
+	queue_work(job->kc->kcopyd_wq, &job->copy_work);
+}
+
 int dm_kcopyd_copy(struct dm_kcopyd_client *kc, struct dm_io_region *from,
 		   unsigned int num_dests, struct dm_io_region *dests,
 		   unsigned int flags, dm_kcopyd_notify_fn fn, void *context)
@@ -757,7 +787,13 @@ int dm_kcopyd_copy(struct dm_kcopyd_clie
 	job->context = context;
 	job->master_job = job;
 
-	submit_job(job);
+	if (copy_offload && num_dests == 1 &&
+	    bdev_copy_offload(job->source.bdev) &&
+	    bdev_copy_offload(job->dests[0].bdev)) {
+		try_copy_offload(job);
+	} else {
+		submit_job(job);
+	}
 
 	return 0;
 }


^ permalink raw reply	[flat|nested] 24+ messages in thread

* [PATCH 14/15] dm kcopyd: change mutex to spinlock
  2014-07-15 19:34 [PATCH 0/15] SCSI XCOPY support for the kernel and device mapper Mikulas Patocka
                   ` (12 preceding siblings ...)
  2014-07-15 19:43 ` [PATCH 13/15] dm kcopyd: support copy offload Mikulas Patocka
@ 2014-07-15 19:43 ` Mikulas Patocka
  2014-07-15 19:44 ` [PATCH 15/15] dm kcopyd: call copy offload with asynchronous callback Mikulas Patocka
  2014-08-28 21:37 ` [PATCH 0/15] SCSI XCOPY support for the kernel and device mapper Mike Snitzer
  15 siblings, 0 replies; 24+ messages in thread
From: Mikulas Patocka @ 2014-07-15 19:43 UTC (permalink / raw)
  To: Alasdair G. Kergon, Mike Snitzer, Jonathan Brassow,
	Edward Thornber, Martin K. Petersen, Jens Axboe,
	Christoph Hellwig
  Cc: dm-devel, linux-kernel, linux-scsi

job->lock is only taken for a finite amount of time and the process
doesn't block while holding it, so change it from mutex to spinlock.

This change is needed for the next patch that makes it possible to call
segment_complete from an interrupt. Taking mutexes inside an interrupt is
not allowed.

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>

---
 drivers/md/dm-kcopyd.c |   10 +++++-----
 1 file changed, 5 insertions(+), 5 deletions(-)

Index: linux-3.16-rc5/drivers/md/dm-kcopyd.c
===================================================================
--- linux-3.16-rc5.orig/drivers/md/dm-kcopyd.c	2014-07-15 19:20:34.000000000 +0200
+++ linux-3.16-rc5/drivers/md/dm-kcopyd.c	2014-07-15 19:24:20.000000000 +0200
@@ -21,7 +21,7 @@
 #include <linux/slab.h>
 #include <linux/vmalloc.h>
 #include <linux/workqueue.h>
-#include <linux/mutex.h>
+#include <linux/spinlock.h>
 #include <linux/delay.h>
 #include <linux/device-mapper.h>
 #include <linux/dm-kcopyd.h>
@@ -356,7 +356,7 @@ struct kcopyd_job {
 	 * These fields are only used if the job has been split
 	 * into more manageable parts.
 	 */
-	struct mutex lock;
+	spinlock_t lock;
 	atomic_t sub_jobs;
 	sector_t progress;
 
@@ -629,7 +629,7 @@ static void segment_complete(int read_er
 	struct kcopyd_job *job = sub_job->master_job;
 	struct dm_kcopyd_client *kc = job->kc;
 
-	mutex_lock(&job->lock);
+	spin_lock(&job->lock);
 
 	/* update the error */
 	if (read_err)
@@ -653,7 +653,7 @@ static void segment_complete(int read_er
 			job->progress += count;
 		}
 	}
-	mutex_unlock(&job->lock);
+	spin_unlock(&job->lock);
 
 	if (count) {
 		int i;
@@ -708,7 +708,7 @@ static void submit_job(struct kcopyd_job
 	if (job->source.count <= SUB_JOB_SIZE)
 		dispatch_job(job);
 	else {
-		mutex_init(&job->lock);
+		spin_lock_init(&job->lock);
 		job->progress = 0;
 		split_job(job);
 	}


^ permalink raw reply	[flat|nested] 24+ messages in thread

* [PATCH 15/15] dm kcopyd: call copy offload with asynchronous callback
  2014-07-15 19:34 [PATCH 0/15] SCSI XCOPY support for the kernel and device mapper Mikulas Patocka
                   ` (13 preceding siblings ...)
  2014-07-15 19:43 ` [PATCH 14/15] dm kcopyd: change mutex to spinlock Mikulas Patocka
@ 2014-07-15 19:44 ` Mikulas Patocka
  2014-08-28 21:37 ` [PATCH 0/15] SCSI XCOPY support for the kernel and device mapper Mike Snitzer
  15 siblings, 0 replies; 24+ messages in thread
From: Mikulas Patocka @ 2014-07-15 19:44 UTC (permalink / raw)
  To: Alasdair G. Kergon, Mike Snitzer, Jonathan Brassow,
	Edward Thornber, Martin K. Petersen, Jens Axboe,
	Christoph Hellwig
  Cc: dm-devel, linux-kernel, linux-scsi

Change dm kcopyd so that it calls blkdev_issue_copy with an asynchronous
callback. There can be large number of pending kcopyd requests and holding
a process context for each of them may put too much load on the workqueue
subsystem.

This patch changes it so that blkdev_issue_copy returns after it submitted
the requests and copy_offload_callback is called when the copy operation
finishes.

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>

---
 drivers/md/dm-kcopyd.c |   33 ++++++++++++++-------------------
 1 file changed, 14 insertions(+), 19 deletions(-)

Index: linux-3.16-rc5/drivers/md/dm-kcopyd.c
===================================================================
--- linux-3.16-rc5.orig/drivers/md/dm-kcopyd.c	2014-07-15 19:24:20.000000000 +0200
+++ linux-3.16-rc5/drivers/md/dm-kcopyd.c	2014-07-15 19:24:54.000000000 +0200
@@ -361,8 +361,6 @@ struct kcopyd_job {
 	sector_t progress;
 
 	struct kcopyd_job *master_job;
-
-	struct work_struct copy_work;
 };
 
 static struct kmem_cache *_job_cache;
@@ -628,8 +626,9 @@ static void segment_complete(int read_er
 	struct kcopyd_job *sub_job = (struct kcopyd_job *) context;
 	struct kcopyd_job *job = sub_job->master_job;
 	struct dm_kcopyd_client *kc = job->kc;
+	unsigned long flags;
 
-	spin_lock(&job->lock);
+	spin_lock_irqsave(&job->lock, flags);
 
 	/* update the error */
 	if (read_err)
@@ -653,7 +652,7 @@ static void segment_complete(int read_er
 			job->progress += count;
 		}
 	}
-	spin_unlock(&job->lock);
+	spin_unlock_irqrestore(&job->lock, flags);
 
 	if (count) {
 		int i;
@@ -714,29 +713,25 @@ static void submit_job(struct kcopyd_job
 	}
 }
 
-static void copy_offload_work(struct work_struct *work)
+static void copy_offload_callback(void *ptr, int error)
 {
-	struct kcopyd_job *job = container_of(work, struct kcopyd_job, copy_work);
-	sector_t copied;
+	struct kcopyd_job *job = ptr;
 
-	blkdev_issue_copy(job->source.bdev, job->source.sector,
-			  job->dests[0].bdev, job->dests[0].sector,
-			  job->source.count,
-			  GFP_NOIO | __GFP_NORETRY | __GFP_NOMEMALLOC | __GFP_NOWARN,
-			  NULL, NULL, &copied);
-
-	job->source.sector += copied;
-	job->source.count -= copied;
-	job->dests[0].sector += copied;
-	job->dests[0].count -= copied;
+	job->source.sector += job->progress;
+	job->source.count -= job->progress;
+	job->dests[0].sector += job->progress;
+	job->dests[0].count -= job->progress;
 
 	submit_job(job);
 }
 
 static void try_copy_offload(struct kcopyd_job *job)
 {
-	INIT_WORK(&job->copy_work, copy_offload_work);
-	queue_work(job->kc->kcopyd_wq, &job->copy_work);
+	blkdev_issue_copy(job->source.bdev, job->source.sector,
+			  job->dests[0].bdev, job->dests[0].sector,
+			  job->source.count,
+			  GFP_NOIO | __GFP_NORETRY | __GFP_NOMEMALLOC | __GFP_NOWARN,
+			  copy_offload_callback, job, &job->progress);
 }
 
 int dm_kcopyd_copy(struct dm_kcopyd_client *kc, struct dm_io_region *from,


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 1/15] block copy: initial XCOPY offload support
  2014-07-15 19:34 ` [PATCH 1/15] block copy: initial XCOPY offload support Mikulas Patocka
@ 2014-07-18 13:03   ` Tomas Henzl
  2014-07-18 14:35     ` Mikulas Patocka
  2014-08-04 14:09   ` Pavel Machek
  1 sibling, 1 reply; 24+ messages in thread
From: Tomas Henzl @ 2014-07-18 13:03 UTC (permalink / raw)
  To: Mikulas Patocka, Alasdair G. Kergon, Mike Snitzer,
	Jonathan Brassow, Edward Thornber, Martin K. Petersen,
	Jens Axboe, Christoph Hellwig
  Cc: dm-devel, linux-kernel, linux-scsi

On 07/15/2014 09:34 PM, Mikulas Patocka wrote:
> This is Martin Petersen's xcopy patch
> (https://git.kernel.org/cgit/linux/kernel/git/mkp/linux.git/commit/?h=xcopy&id=0bdeed274e16b3038a851552188512071974eea8)
> with some bug fixes, ported to the current kernel.
>
> This patch makes it possible to use the SCSI XCOPY command.
>
> We create a bio that has REQ_COPY flag in bi_rw and a bi_copy structure
> that defines the source device. The target device is defined in the
> bi_bdev and bi_iter.bi_sector.
>
> There is a new BLKCOPY ioctl that makes it possible to use XCOPY from
> userspace. The ioctl argument is a pointer to an array of four uint64_t
> values.
>
> The first value is a source byte offset, the second value is a destination
> byte offset, the third value is byte length. The forth value is written by
> the kernel and it represents the number of bytes that the kernel actually
> copied.
>
> Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
> Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
>
> ---
>  Documentation/ABI/testing/sysfs-block |    9 +
>  block/bio.c                           |    2 
>  block/blk-core.c                      |    5 
>  block/blk-lib.c                       |   95 ++++++++++++
>  block/blk-merge.c                     |    7 
>  block/blk-settings.c                  |   13 +
>  block/blk-sysfs.c                     |   10 +
>  block/compat_ioctl.c                  |    1 
>  block/ioctl.c                         |   49 ++++++
>  drivers/scsi/scsi.c                   |   57 +++++++
>  drivers/scsi/sd.c                     |  263 +++++++++++++++++++++++++++++++++-
>  drivers/scsi/sd.h                     |    4 
>  include/linux/bio.h                   |    9 -
>  include/linux/blk_types.h             |   15 +
>  include/linux/blkdev.h                |   15 +
>  include/scsi/scsi_device.h            |    3 
>  include/uapi/linux/fs.h               |    1 
>  17 files changed, 545 insertions(+), 13 deletions(-)
>
> Index: linux-3.16-rc5/Documentation/ABI/testing/sysfs-block
> ===================================================================
> --- linux-3.16-rc5.orig/Documentation/ABI/testing/sysfs-block	2014-07-14 15:17:07.000000000 +0200
> +++ linux-3.16-rc5/Documentation/ABI/testing/sysfs-block	2014-07-14 16:26:44.000000000 +0200
> @@ -220,3 +220,12 @@ Description:
>  		write_same_max_bytes is 0, write same is not supported
>  		by the device.
>  
> +
> +What:		/sys/block/<disk>/queue/copy_max_bytes
> +Date:		January 2014
> +Contact:	Martin K. Petersen <martin.petersen@oracle.com>
> +Description:
> +		Devices that support copy offloading will set this value
> +		to indicate the maximum buffer size in bytes that can be
> +		copied in one operation. If the copy_max_bytes is 0 the
> +		device does not support copy offload.
> Index: linux-3.16-rc5/block/blk-core.c
> ===================================================================
> --- linux-3.16-rc5.orig/block/blk-core.c	2014-07-14 16:26:22.000000000 +0200
> +++ linux-3.16-rc5/block/blk-core.c	2014-07-14 16:26:44.000000000 +0200
> @@ -1831,6 +1831,11 @@ generic_make_request_checks(struct bio *
>  		goto end_io;
>  	}
>  
> +	if (bio->bi_rw & REQ_COPY && !bdev_copy_offload(bio->bi_bdev)) {
> +		err = -EOPNOTSUPP;
> +		goto end_io;
> +	}
> +
>  	/*
>  	 * Various block parts want %current->io_context and lazy ioc
>  	 * allocation ends up trading a lot of pain for a small amount of
> Index: linux-3.16-rc5/block/blk-lib.c
> ===================================================================
> --- linux-3.16-rc5.orig/block/blk-lib.c	2014-07-14 16:26:40.000000000 +0200
> +++ linux-3.16-rc5/block/blk-lib.c	2014-07-14 16:32:21.000000000 +0200
> @@ -304,3 +304,98 @@ int blkdev_issue_zeroout(struct block_de
>  	return __blkdev_issue_zeroout(bdev, sector, nr_sects, gfp_mask);
>  }
>  EXPORT_SYMBOL(blkdev_issue_zeroout);
> +
> +/**
> + * blkdev_issue_copy - queue a copy same operation
> + * @src_bdev:	source blockdev
> + * @src_sector:	source sector
> + * @dst_bdev:	destination blockdev
> + * @dst_sector: destination sector
> + * @nr_sects:	number of sectors to copy
> + * @gfp_mask:	memory allocation flags (for bio_alloc)
> + *
> + * Description:
> + *    Copy a block range from source device to target device.
> + */
> +int blkdev_issue_copy(struct block_device *src_bdev, sector_t src_sector,
> +		      struct block_device *dst_bdev, sector_t dst_sector,
> +		      unsigned int nr_sects, gfp_t gfp_mask)
> +{
> +	DECLARE_COMPLETION_ONSTACK(wait);
> +	struct request_queue *sq = bdev_get_queue(src_bdev);
> +	struct request_queue *dq = bdev_get_queue(dst_bdev);
> +	unsigned int max_copy_sectors;
> +	struct bio_batch bb;
> +	int ret = 0;
> +
> +	if (!sq || !dq)
> +		return -ENXIO;
> +
> +	max_copy_sectors = min(sq->limits.max_copy_sectors,
> +			       dq->limits.max_copy_sectors);
> +
> +	if (max_copy_sectors == 0)
> +		return -EOPNOTSUPP;
> +
> +	if (src_sector + nr_sects < src_sector ||
> +	    dst_sector + nr_sects < dst_sector)
> +		return -EINVAL;

Hi Mikulas,
this^ is meant as an overflow test or what is the reason?
Thanks, Tomas

> +
> +	/* Do not support overlapping copies */
> +	if (src_bdev == dst_bdev &&
> +	    abs64((u64)dst_sector - (u64)src_sector) < nr_sects)
> +		return -EOPNOTSUPP;
> +
> +	atomic_set(&bb.done, 1);
> +	bb.error = 0;
> +	bb.wait = &wait;
> +
> +	while (nr_sects) {
> +		struct bio *bio;
> +		struct bio_copy *bc;
> +		unsigned int chunk;
> +
> +		bc = kmalloc(sizeof(struct bio_copy), gfp_mask);
> +		if (!bc) {
> +			ret = -ENOMEM;
> +			break;
> +		}
> +
> +		bio = bio_alloc(gfp_mask, 1);
> +		if (!bio) {
> +			kfree(bc);
> +			ret = -ENOMEM;
> +			break;
> +		}
> +
> +		chunk = min(nr_sects, max_copy_sectors);
> +
> +		bio->bi_iter.bi_sector = dst_sector;
> +		bio->bi_iter.bi_size = chunk << 9;
> +		bio->bi_end_io = bio_batch_end_io;
> +		bio->bi_bdev = dst_bdev;
> +		bio->bi_private = &bb;
> +		bio->bi_copy = bc;
> +
> +		bc->bic_bdev = src_bdev;
> +		bc->bic_sector = src_sector;
> +
> +		atomic_inc(&bb.done);
> +		submit_bio(REQ_WRITE | REQ_COPY, bio);
> +
> +		src_sector += chunk;
> +		dst_sector += chunk;
> +		nr_sects -= chunk;
> +	}
> +
> +	/* Wait for bios in-flight */
> +	if (!atomic_dec_and_test(&bb.done))
> +		wait_for_completion_io(&wait);
> +
> +	if (likely(!ret))
> +		ret = bb.error;
> +
> +	return ret;
> +}
> +EXPORT_SYMBOL(blkdev_issue_copy);
> +
> Index: linux-3.16-rc5/block/blk-merge.c
> ===================================================================
> --- linux-3.16-rc5.orig/block/blk-merge.c	2014-07-14 15:17:07.000000000 +0200
> +++ linux-3.16-rc5/block/blk-merge.c	2014-07-14 16:26:44.000000000 +0200
> @@ -25,10 +25,7 @@ static unsigned int __blk_recalc_rq_segm
>  	 * This should probably be returning 0, but blk_add_request_payload()
>  	 * (Christoph!!!!)
>  	 */
> -	if (bio->bi_rw & REQ_DISCARD)
> -		return 1;
> -
> -	if (bio->bi_rw & REQ_WRITE_SAME)
> +	if (bio->bi_rw & (REQ_DISCARD | REQ_WRITE_SAME | REQ_COPY))
>  		return 1;
>  
>  	fbio = bio;
> @@ -196,7 +193,7 @@ static int __blk_bios_map_sg(struct requ
>  	nsegs = 0;
>  	cluster = blk_queue_cluster(q);
>  
> -	if (bio->bi_rw & REQ_DISCARD) {
> +	if (bio->bi_rw & (REQ_DISCARD | REQ_COPY)) {
>  		/*
>  		 * This is a hack - drivers should be neither modifying the
>  		 * biovec, nor relying on bi_vcnt - but because of
> Index: linux-3.16-rc5/block/blk-settings.c
> ===================================================================
> --- linux-3.16-rc5.orig/block/blk-settings.c	2014-07-14 15:17:08.000000000 +0200
> +++ linux-3.16-rc5/block/blk-settings.c	2014-07-14 16:26:44.000000000 +0200
> @@ -115,6 +115,7 @@ void blk_set_default_limits(struct queue
>  	lim->max_sectors = lim->max_hw_sectors = BLK_SAFE_MAX_SECTORS;
>  	lim->chunk_sectors = 0;
>  	lim->max_write_same_sectors = 0;
> +	lim->max_copy_sectors = 0;
>  	lim->max_discard_sectors = 0;
>  	lim->discard_granularity = 0;
>  	lim->discard_alignment = 0;
> @@ -322,6 +323,18 @@ void blk_queue_max_write_same_sectors(st
>  EXPORT_SYMBOL(blk_queue_max_write_same_sectors);
>  
>  /**
> + * blk_queue_max_copy_sectors - set max sectors for a single copy operation
> + * @q:  the request queue for the device
> + * @max_copy_sectors: maximum number of sectors per copy operation
> + **/
> +void blk_queue_max_copy_sectors(struct request_queue *q,
> +				unsigned int max_copy_sectors)
> +{
> +	q->limits.max_copy_sectors = max_copy_sectors;
> +}
> +EXPORT_SYMBOL(blk_queue_max_copy_sectors);
> +
> +/**
>   * blk_queue_max_segments - set max hw segments for a request for this queue
>   * @q:  the request queue for the device
>   * @max_segments:  max number of segments
> Index: linux-3.16-rc5/block/blk-sysfs.c
> ===================================================================
> --- linux-3.16-rc5.orig/block/blk-sysfs.c	2014-07-14 15:17:08.000000000 +0200
> +++ linux-3.16-rc5/block/blk-sysfs.c	2014-07-14 16:26:44.000000000 +0200
> @@ -161,6 +161,11 @@ static ssize_t queue_write_same_max_show
>  		(unsigned long long)q->limits.max_write_same_sectors << 9);
>  }
>  
> +static ssize_t queue_copy_max_show(struct request_queue *q, char *page)
> +{
> +	return sprintf(page, "%llu\n",
> +		(unsigned long long)q->limits.max_copy_sectors << 9);
> +}
>  
>  static ssize_t
>  queue_max_sectors_store(struct request_queue *q, const char *page, size_t count)
> @@ -374,6 +379,10 @@ static struct queue_sysfs_entry queue_wr
>  	.show = queue_write_same_max_show,
>  };
>  
> +static struct queue_sysfs_entry queue_copy_max_entry = {
> +	.attr = {.name = "copy_max_bytes", .mode = S_IRUGO },
> +	.show = queue_copy_max_show,
> +};
>  static struct queue_sysfs_entry queue_nonrot_entry = {
>  	.attr = {.name = "rotational", .mode = S_IRUGO | S_IWUSR },
>  	.show = queue_show_nonrot,
> @@ -422,6 +431,7 @@ static struct attribute *default_attrs[]
>  	&queue_discard_max_entry.attr,
>  	&queue_discard_zeroes_data_entry.attr,
>  	&queue_write_same_max_entry.attr,
> +	&queue_copy_max_entry.attr,
>  	&queue_nonrot_entry.attr,
>  	&queue_nomerges_entry.attr,
>  	&queue_rq_affinity_entry.attr,
> Index: linux-3.16-rc5/block/ioctl.c
> ===================================================================
> --- linux-3.16-rc5.orig/block/ioctl.c	2014-07-14 15:17:08.000000000 +0200
> +++ linux-3.16-rc5/block/ioctl.c	2014-07-14 16:26:44.000000000 +0200
> @@ -201,6 +201,31 @@ static int blk_ioctl_zeroout(struct bloc
>  	return blkdev_issue_zeroout(bdev, start, len, GFP_KERNEL);
>  }
>  
> +static int blk_ioctl_copy(struct block_device *bdev, uint64_t src_offset,
> +			  uint64_t dst_offset, uint64_t len)
> +{
> +	if (src_offset & 511)
> +		return -EINVAL;
> +	if (dst_offset & 511)
> +		return -EINVAL;
> +	if (len & 511)
> +		return -EINVAL;
> +	src_offset >>= 9;
> +	dst_offset >>= 9;
> +	len >>= 9;
> +
> +	if (unlikely(src_offset + len < src_offset) ||
> +	    unlikely(src_offset + len > (i_size_read(bdev->bd_inode) >> 9)))
> +		return -EINVAL;
> +
> +	if (unlikely(dst_offset + len < dst_offset) ||
> +	    unlikely(dst_offset + len > (i_size_read(bdev->bd_inode) >> 9)))
> +		return -EINVAL;
> +
> +	return blkdev_issue_copy(bdev, src_offset, bdev, dst_offset, len,
> +				 GFP_KERNEL);
> +}
> +
>  static int put_ushort(unsigned long arg, unsigned short val)
>  {
>  	return put_user(val, (unsigned short __user *)arg);
> @@ -328,6 +353,30 @@ int blkdev_ioctl(struct block_device *bd
>  		return blk_ioctl_zeroout(bdev, range[0], range[1]);
>  	}
>  
> +	case BLKCOPY: {
> +		uint64_t range[4];
> +
> +		range[3] = 0;
> +
> +		if (copy_to_user((void __user *)(arg + 24), &range[3], 8))
> +			return -EFAULT;
> +
> +		if (!(mode & FMODE_WRITE))
> +			return -EBADF;
> +
> +		if (copy_from_user(range, (void __user *)arg, 24))
> +			return -EFAULT;
> +
> +		ret = blk_ioctl_copy(bdev, range[0], range[1], range[2]);
> +		if (!ret) {
> +			range[3] = range[2];
> +			if (copy_to_user((void __user *)(arg + 24), &range[3], 8))
> +				return -EFAULT;
> +		}
> +
> +		return ret;
> +	}
> +
>  	case HDIO_GETGEO: {
>  		struct hd_geometry geo;
>  
> Index: linux-3.16-rc5/drivers/scsi/scsi.c
> ===================================================================
> --- linux-3.16-rc5.orig/drivers/scsi/scsi.c	2014-07-14 15:17:08.000000000 +0200
> +++ linux-3.16-rc5/drivers/scsi/scsi.c	2014-07-14 16:26:44.000000000 +0200
> @@ -1024,6 +1024,62 @@ int scsi_get_vpd_page(struct scsi_device
>  EXPORT_SYMBOL_GPL(scsi_get_vpd_page);
>  
>  /**
> + * scsi_lookup_naa - Lookup NAA descriptor in VPD page 0x83
> + * @sdev: The device to ask
> + *
> + * Copy offloading requires us to know the NAA descriptor for both
> + * source and target device. This descriptor is mandatory in the Device
> + * Identification VPD page. Locate this descriptor in the returned VPD
> + * data so we don't have to do lookups for every copy command.
> + */
> +static void scsi_lookup_naa(struct scsi_device *sdev)
> +{
> +	unsigned char *buf = sdev->vpd_pg83;
> +	unsigned int len = sdev->vpd_pg83_len;
> +
> +	if (buf[1] != 0x83 || get_unaligned_be16(&buf[2]) == 0) {
> +		sdev_printk(KERN_ERR, sdev,
> +			    "%s: VPD page 0x83 contains no descriptors\n",
> +			    __func__);
> +		return;
> +	}
> +
> +	buf += 4;
> +	len -= 4;
> +
> +	do {
> +		unsigned int desig_len = buf[3] + 4;
> +
> +		/* Binary code set */
> +		if ((buf[0] & 0xf) != 1)
> +			goto skip;
> +
> +		/* Target association */
> +		if ((buf[1] >> 4) & 0x3)
> +			goto skip;
> +
> +		/* NAA designator */
> +		if ((buf[1] & 0xf) != 0x3)
> +			goto skip;
> +
> +		sdev->naa = buf;
> +		sdev->naa_len = desig_len;
> +
> +		return;
> +
> +	skip:
> +		buf += desig_len;
> +		len -= desig_len;
> +
> +	} while (len > 0);
> +
> +	sdev_printk(KERN_ERR, sdev,
> +		    "%s: VPD page 0x83 NAA descriptor not found\n", __func__);
> +
> +	return;
> +}
> +
> +/**
>   * scsi_attach_vpd - Attach Vital Product Data to a SCSI device structure
>   * @sdev: The device to ask
>   *
> @@ -1107,6 +1163,7 @@ retry_pg83:
>  		}
>  		sdev->vpd_pg83_len = result;
>  		sdev->vpd_pg83 = vpd_buf;
> +		scsi_lookup_naa(sdev);
>  	}
>  }
>  
> Index: linux-3.16-rc5/drivers/scsi/sd.c
> ===================================================================
> --- linux-3.16-rc5.orig/drivers/scsi/sd.c	2014-07-14 16:26:22.000000000 +0200
> +++ linux-3.16-rc5/drivers/scsi/sd.c	2014-07-14 16:26:44.000000000 +0200
> @@ -100,6 +100,7 @@ MODULE_ALIAS_SCSI_DEVICE(TYPE_RBC);
>  
>  static void sd_config_discard(struct scsi_disk *, unsigned int);
>  static void sd_config_write_same(struct scsi_disk *);
> +static void sd_config_copy(struct scsi_disk *);
>  static int  sd_revalidate_disk(struct gendisk *);
>  static void sd_unlock_native_capacity(struct gendisk *disk);
>  static int  sd_probe(struct device *);
> @@ -463,6 +464,48 @@ max_write_same_blocks_store(struct devic
>  }
>  static DEVICE_ATTR_RW(max_write_same_blocks);
>  
> +static ssize_t
> +max_copy_blocks_show(struct device *dev, struct device_attribute *attr,
> +		     char *buf)
> +{
> +	struct scsi_disk *sdkp = to_scsi_disk(dev);
> +
> +	return snprintf(buf, 20, "%u\n", sdkp->max_copy_blocks);
> +}
> +
> +static ssize_t
> +max_copy_blocks_store(struct device *dev, struct device_attribute *attr,
> +		      const char *buf, size_t count)
> +{
> +	struct scsi_disk *sdkp = to_scsi_disk(dev);
> +	struct scsi_device *sdp = sdkp->device;
> +	unsigned long max;
> +	int err;
> +
> +	if (!capable(CAP_SYS_ADMIN))
> +		return -EACCES;
> +
> +	if (sdp->type != TYPE_DISK)
> +		return -EINVAL;
> +
> +	err = kstrtoul(buf, 10, &max);
> +
> +	if (err)
> +		return err;
> +
> +	if (max == 0)
> +		sdp->no_copy = 1;
> +	else if (max <= SD_MAX_COPY_BLOCKS) {
> +		sdp->no_copy = 0;
> +		sdkp->max_copy_blocks = max;
> +	}
> +
> +	sd_config_copy(sdkp);
> +
> +	return count;
> +}
> +static DEVICE_ATTR_RW(max_copy_blocks);
> +
>  static struct attribute *sd_disk_attrs[] = {
>  	&dev_attr_cache_type.attr,
>  	&dev_attr_FUA.attr,
> @@ -474,6 +517,7 @@ static struct attribute *sd_disk_attrs[]
>  	&dev_attr_thin_provisioning.attr,
>  	&dev_attr_provisioning_mode.attr,
>  	&dev_attr_max_write_same_blocks.attr,
> +	&dev_attr_max_copy_blocks.attr,
>  	&dev_attr_max_medium_access_timeouts.attr,
>  	NULL,
>  };
> @@ -830,6 +874,109 @@ static int sd_setup_write_same_cmnd(stru
>  	return ret;
>  }
>  
> +static void sd_config_copy(struct scsi_disk *sdkp)
> +{
> +	struct request_queue *q = sdkp->disk->queue;
> +	unsigned int logical_block_size = sdkp->device->sector_size;
> +
> +	if (sdkp->device->no_copy)
> +		sdkp->max_copy_blocks = 0;
> +
> +	/* Segment descriptor 0x02 has a 64k block limit */
> +	sdkp->max_copy_blocks = min(sdkp->max_copy_blocks,
> +				    (u32)SD_MAX_CSD2_BLOCKS);
> +
> +	blk_queue_max_copy_sectors(q, sdkp->max_copy_blocks *
> +				   (logical_block_size >> 9));
> +}
> +
> +static int sd_setup_copy_cmnd(struct scsi_device *sdp, struct request *rq)
> +{
> +	struct scsi_device *src_sdp, *dst_sdp;
> +	struct gendisk *src_disk;
> +	struct request_queue *src_queue, *dst_queue;
> +	sector_t src_lba, dst_lba;
> +	unsigned int nr_blocks, buf_len, nr_bytes = blk_rq_bytes(rq);
> +	int ret;
> +	struct bio *bio = rq->bio;
> +	struct page *page;
> +	unsigned char *buf;
> +
> +	if (!bio->bi_copy)
> +		return BLKPREP_KILL;
> +
> +	dst_sdp = scsi_disk(rq->rq_disk)->device;
> +	dst_queue = rq->rq_disk->queue;
> +	src_disk = bio->bi_copy->bic_bdev->bd_disk;
> +	src_queue = src_disk->queue;
> +	if (!src_queue ||
> +	    src_queue->make_request_fn != blk_queue_bio ||
> +	    src_queue->request_fn != dst_queue->request_fn ||
> +	    *(struct scsi_driver **)rq->rq_disk->private_data !=
> +	    *(struct scsi_driver **)src_disk->private_data)
> +		return BLKPREP_KILL;
> +	src_sdp = scsi_disk(src_disk)->device;
> +
> +	if (src_sdp->no_copy || dst_sdp->no_copy)
> +		return BLKPREP_KILL;
> +
> +	if (src_sdp->sector_size != dst_sdp->sector_size)
> +		return BLKPREP_KILL;
> +
> +	dst_lba = blk_rq_pos(rq) >> (ilog2(dst_sdp->sector_size) - 9);
> +	src_lba = bio->bi_copy->bic_sector >> (ilog2(src_sdp->sector_size) - 9);
> +	nr_blocks = blk_rq_sectors(rq) >> (ilog2(dst_sdp->sector_size) - 9);
> +
> +	page = alloc_page(GFP_ATOMIC | __GFP_ZERO);
> +	if (!page)
> +		return BLKPREP_DEFER;
> +
> +	buf = page_address(page);
> +
> +	/* Extended Copy (LID1) Parameter List (16 bytes) */
> +	buf[0] = 0;				/* LID */
> +	buf[1] = 3 << 3;			/* LID usage 11b */
> +	put_unaligned_be16(32 + 32, &buf[2]);	/* 32 bytes per E4 desc. */
> +	put_unaligned_be32(28, &buf[8]);	/* 28 bytes per B2B desc. */
> +	buf += 16;
> +
> +	/* Source CSCD (32 bytes) */
> +	buf[0] = 0xe4;				/* Identification desc. */
> +	memcpy(&buf[4], src_sdp->naa, src_sdp->naa_len);
> +	buf += 32;
> +
> +	/* Destination CSCD (32 bytes) */
> +	buf[0] = 0xe4;				/* Identification desc. */
> +	memcpy(&buf[4], dst_sdp->naa, dst_sdp->naa_len);
> +	buf += 32;
> +
> +	/* Segment descriptor (28 bytes) */
> +	buf[0] = 0x02;				/* Block to block desc. */
> +	put_unaligned_be16(0x18, &buf[2]);	/* Descriptor length */
> +	put_unaligned_be16(0, &buf[4]);		/* Source is desc. 0 */
> +	put_unaligned_be16(1, &buf[6]);		/* Dest. is desc. 1 */
> +	put_unaligned_be16(nr_blocks, &buf[10]);
> +	put_unaligned_be64(src_lba, &buf[12]);
> +	put_unaligned_be64(dst_lba, &buf[20]);
> +
> +	/* CDB */
> +	memset(rq->cmd, 0, rq->cmd_len);
> +	rq->cmd[0] = EXTENDED_COPY;
> +	rq->cmd[1] = 0; /* LID1 */
> +	buf_len = 16 + 32 + 32 + 28;
> +	put_unaligned_be32(buf_len, &rq->cmd[10]);
> +	rq->timeout = SD_COPY_TIMEOUT;
> +
> +	rq->completion_data = page;
> +	blk_add_request_payload(rq, page, buf_len);
> +	ret = scsi_setup_blk_pc_cmnd(sdp, rq);
> +	rq->__data_len = nr_bytes;
> +
> +	if (ret != BLKPREP_OK)
> +		__free_page(page);
> +	return ret;
> +}
> +
>  static int scsi_setup_flush_cmnd(struct scsi_device *sdp, struct request *rq)
>  {
>  	rq->timeout *= SD_FLUSH_TIMEOUT_MULTIPLIER;
> @@ -844,7 +991,7 @@ static void sd_uninit_command(struct scs
>  {
>  	struct request *rq = SCpnt->request;
>  
> -	if (rq->cmd_flags & REQ_DISCARD)
> +	if (rq->cmd_flags & (REQ_DISCARD | REQ_COPY))
>  		__free_page(rq->completion_data);
>  
>  	if (SCpnt->cmnd != rq->cmd) {
> @@ -876,6 +1023,9 @@ static int sd_init_command(struct scsi_c
>  	} else if (rq->cmd_flags & REQ_WRITE_SAME) {
>  		ret = sd_setup_write_same_cmnd(sdp, rq);
>  		goto out;
> +	} else if (rq->cmd_flags & REQ_COPY) {
> +		ret = sd_setup_copy_cmnd(sdp, rq);
> +		goto out;
>  	} else if (rq->cmd_flags & REQ_FLUSH) {
>  		ret = scsi_setup_flush_cmnd(sdp, rq);
>  		goto out;
> @@ -1649,7 +1799,8 @@ static int sd_done(struct scsi_cmnd *SCp
>  	unsigned char op = SCpnt->cmnd[0];
>  	unsigned char unmap = SCpnt->cmnd[1] & 8;
>  
> -	if (req->cmd_flags & REQ_DISCARD || req->cmd_flags & REQ_WRITE_SAME) {
> +	if (req->cmd_flags & REQ_DISCARD || req->cmd_flags & REQ_WRITE_SAME ||
> +	    req->cmd_flags & REQ_COPY) {
>  		if (!result) {
>  			good_bytes = blk_rq_bytes(req);
>  			scsi_set_resid(SCpnt, 0);
> @@ -1708,6 +1859,14 @@ static int sd_done(struct scsi_cmnd *SCp
>  		/* INVALID COMMAND OPCODE or INVALID FIELD IN CDB */
>  		if (sshdr.asc == 0x20 || sshdr.asc == 0x24) {
>  			switch (op) {
> +			case EXTENDED_COPY:
> +				sdkp->device->no_copy = 1;
> +				sd_config_copy(sdkp);
> +
> +				good_bytes = 0;
> +				req->__data_len = blk_rq_bytes(req);
> +				req->cmd_flags |= REQ_QUIET;
> +				break;
>  			case UNMAP:
>  				sd_config_discard(sdkp, SD_LBP_DISABLE);
>  				break;
> @@ -2681,6 +2840,105 @@ static void sd_read_write_same(struct sc
>  		sdkp->ws10 = 1;
>  }
>  
> +static void sd_read_copy_operations(struct scsi_disk *sdkp,
> +				    unsigned char *buffer)
> +{
> +	struct scsi_device *sdev = sdkp->device;
> +	struct scsi_sense_hdr sshdr;
> +	unsigned char cdb[16];
> +	unsigned int result, len, i;
> +	bool b2b_desc = false, id_desc = false;
> +
> +	if (sdev->naa_len == 0)
> +		return;
> +
> +	/* Verify that the device has 3PC set in INQUIRY response */
> +	if (sdev->inquiry_len < 6 || (sdev->inquiry[5] & (1 << 3)) == 0)
> +		return;
> +
> +	/* Receive Copy Operation Parameters */
> +	memset(cdb, 0, 16);
> +	cdb[0] = RECEIVE_COPY_RESULTS;
> +	cdb[1] = 0x3;
> +	put_unaligned_be32(SD_BUF_SIZE, &cdb[10]);
> +
> +	memset(buffer, 0, SD_BUF_SIZE);
> +	result = scsi_execute_req(sdev, cdb, DMA_FROM_DEVICE,
> +				  buffer, SD_BUF_SIZE, &sshdr,
> +				  SD_TIMEOUT, SD_MAX_RETRIES, NULL);
> +
> +	if (!scsi_status_is_good(result)) {
> +		sd_printk(KERN_ERR, sdkp,
> +			  "%s: Receive Copy Operating Parameters failed\n",
> +			  __func__);
> +		return;
> +	}
> +
> +	/* The RCOP response is a minimum of 44 bytes long. First 4
> +	 * bytes contain the length of the remaining buffer, i.e. 40+
> +	 * bytes. Trailing the defined fields is a list of supported
> +	 * descriptors. We need at least 2 descriptors to drive the
> +	 * target, hence 42.
> +	 */
> +	len = get_unaligned_be32(&buffer[0]);
> +	if (len < 42) {
> +		sd_printk(KERN_ERR, sdkp, "%s: result too short (%u)\n",
> +			  __func__, len);
> +		return;
> +	}
> +
> +	if ((buffer[4] & 1) == 0) {
> +		sd_printk(KERN_ERR, sdkp, "%s: does not support SNLID\n",
> +			  __func__);
> +		return;
> +	}
> +
> +	if (get_unaligned_be16(&buffer[8]) < 2) {
> +		sd_printk(KERN_ERR, sdkp,
> +			  "%s: Need 2 or more CSCD descriptors\n", __func__);
> +		return;
> +	}
> +
> +	if (get_unaligned_be16(&buffer[10]) < 1) {
> +		sd_printk(KERN_ERR, sdkp,
> +			  "%s: Need 1 or more segment descriptor\n", __func__);
> +		return;
> +	}
> +
> +	if (len - 40 != buffer[43]) {
> +		sd_printk(KERN_ERR, sdkp,
> +			  "%s: Buffer len and descriptor count mismatch " \
> +			  "(%u vs. %u)\n", __func__, len - 40, buffer[43]);
> +		return;
> +	}
> +
> +	for (i = 44 ; i < len + 4 ; i++) {
> +		if (buffer[i] == 0x02)
> +			b2b_desc = true;
> +
> +		if (buffer[i] == 0xe4)
> +			id_desc = true;
> +	}
> +
> +	if (!b2b_desc) {
> +		sd_printk(KERN_ERR, sdkp,
> +			  "%s: No block 2 block descriptor (0x02)\n",
> +			  __func__);
> +		return;
> +	}
> +
> +	if (!id_desc) {
> +		sd_printk(KERN_ERR, sdkp,
> +			  "%s: No identification descriptor (0xE4)\n",
> +			  __func__);
> +		return;
> +	}
> +
> +	sdkp->max_copy_blocks = get_unaligned_be32(&buffer[16])
> +		>> ilog2(sdev->sector_size);
> +	sd_config_copy(sdkp);
> +}
> +
>  static int sd_try_extended_inquiry(struct scsi_device *sdp)
>  {
>  	/*
> @@ -2741,6 +2999,7 @@ static int sd_revalidate_disk(struct gen
>  		sd_read_cache_type(sdkp, buffer);
>  		sd_read_app_tag_own(sdkp, buffer);
>  		sd_read_write_same(sdkp, buffer);
> +		sd_read_copy_operations(sdkp, buffer);
>  	}
>  
>  	sdkp->first_scan = 0;
> Index: linux-3.16-rc5/drivers/scsi/sd.h
> ===================================================================
> --- linux-3.16-rc5.orig/drivers/scsi/sd.h	2014-07-14 15:17:08.000000000 +0200
> +++ linux-3.16-rc5/drivers/scsi/sd.h	2014-07-14 16:26:44.000000000 +0200
> @@ -19,6 +19,7 @@
>   */
>  #define SD_FLUSH_TIMEOUT_MULTIPLIER	2
>  #define SD_WRITE_SAME_TIMEOUT	(120 * HZ)
> +#define SD_COPY_TIMEOUT		(120 * HZ)
>  
>  /*
>   * Number of allowed retries
> @@ -46,6 +47,8 @@ enum {
>  enum {
>  	SD_MAX_WS10_BLOCKS = 0xffff,
>  	SD_MAX_WS16_BLOCKS = 0x7fffff,
> +	SD_MAX_CSD2_BLOCKS = 0xffff,
> +	SD_MAX_COPY_BLOCKS = 0xffffffff,
>  };
>  
>  enum {
> @@ -66,6 +69,7 @@ struct scsi_disk {
>  	sector_t	capacity;	/* size in 512-byte sectors */
>  	u32		max_ws_blocks;
>  	u32		max_unmap_blocks;
> +	u32		max_copy_blocks;
>  	u32		unmap_granularity;
>  	u32		unmap_alignment;
>  	u32		index;
> Index: linux-3.16-rc5/include/linux/bio.h
> ===================================================================
> --- linux-3.16-rc5.orig/include/linux/bio.h	2014-07-14 15:17:09.000000000 +0200
> +++ linux-3.16-rc5/include/linux/bio.h	2014-07-14 16:26:44.000000000 +0200
> @@ -106,7 +106,7 @@ static inline bool bio_has_data(struct b
>  {
>  	if (bio &&
>  	    bio->bi_iter.bi_size &&
> -	    !(bio->bi_rw & REQ_DISCARD))
> +	    !(bio->bi_rw & (REQ_DISCARD | REQ_COPY)))
>  		return true;
>  
>  	return false;
> @@ -260,8 +260,8 @@ static inline unsigned bio_segments(stru
>  	struct bvec_iter iter;
>  
>  	/*
> -	 * We special case discard/write same, because they interpret bi_size
> -	 * differently:
> +	 * We special case discard/write same/copy, because they
> +	 * interpret bi_size differently:
>  	 */
>  
>  	if (bio->bi_rw & REQ_DISCARD)
> @@ -270,6 +270,9 @@ static inline unsigned bio_segments(stru
>  	if (bio->bi_rw & REQ_WRITE_SAME)
>  		return 1;
>  
> +	if (bio->bi_rw & REQ_COPY)
> +		return 1;
> +
>  	bio_for_each_segment(bv, bio, iter)
>  		segs++;
>  
> Index: linux-3.16-rc5/include/linux/blk_types.h
> ===================================================================
> --- linux-3.16-rc5.orig/include/linux/blk_types.h	2014-07-14 15:17:09.000000000 +0200
> +++ linux-3.16-rc5/include/linux/blk_types.h	2014-07-14 16:26:44.000000000 +0200
> @@ -39,6 +39,11 @@ struct bvec_iter {
>  						   current bvec */
>  };
>  
> +struct bio_copy {
> +	struct block_device	*bic_bdev;
> +	sector_t		bic_sector;
> +};
> +
>  /*
>   * main unit of I/O for the block layer and lower layers (ie drivers and
>   * stacking drivers)
> @@ -81,6 +86,7 @@ struct bio {
>  #if defined(CONFIG_BLK_DEV_INTEGRITY)
>  	struct bio_integrity_payload *bi_integrity;  /* data integrity */
>  #endif
> +	struct bio_copy		*bi_copy; 	/* TODO, use bi_integrity */
>  
>  	unsigned short		bi_vcnt;	/* how many bio_vec's */
>  
> @@ -160,6 +166,7 @@ enum rq_flag_bits {
>  	__REQ_DISCARD,		/* request to discard sectors */
>  	__REQ_SECURE,		/* secure discard (used with __REQ_DISCARD) */
>  	__REQ_WRITE_SAME,	/* write same block many times */
> +	__REQ_COPY,		/* copy block range */
>  
>  	__REQ_NOIDLE,		/* don't anticipate more IO after this one */
>  	__REQ_FUA,		/* forced unit access */
> @@ -203,6 +210,7 @@ enum rq_flag_bits {
>  #define REQ_PRIO		(1ULL << __REQ_PRIO)
>  #define REQ_DISCARD		(1ULL << __REQ_DISCARD)
>  #define REQ_WRITE_SAME		(1ULL << __REQ_WRITE_SAME)
> +#define REQ_COPY		(1ULL << __REQ_COPY)
>  #define REQ_NOIDLE		(1ULL << __REQ_NOIDLE)
>  
>  #define REQ_FAILFAST_MASK \
> @@ -210,14 +218,15 @@ enum rq_flag_bits {
>  #define REQ_COMMON_MASK \
>  	(REQ_WRITE | REQ_FAILFAST_MASK | REQ_SYNC | REQ_META | REQ_PRIO | \
>  	 REQ_DISCARD | REQ_WRITE_SAME | REQ_NOIDLE | REQ_FLUSH | REQ_FUA | \
> -	 REQ_SECURE)
> +	 REQ_SECURE | REQ_COPY)
>  #define REQ_CLONE_MASK		REQ_COMMON_MASK
>  
> -#define BIO_NO_ADVANCE_ITER_MASK	(REQ_DISCARD|REQ_WRITE_SAME)
> +#define BIO_NO_ADVANCE_ITER_MASK	(REQ_DISCARD|REQ_WRITE_SAME|REQ_COPY)
>  
>  /* This mask is used for both bio and request merge checking */
>  #define REQ_NOMERGE_FLAGS \
> -	(REQ_NOMERGE | REQ_STARTED | REQ_SOFTBARRIER | REQ_FLUSH | REQ_FUA)
> +	(REQ_NOMERGE | REQ_STARTED | REQ_SOFTBARRIER | REQ_FLUSH | REQ_FUA | \
> +	 REQ_COPY)
>  
>  #define REQ_RAHEAD		(1ULL << __REQ_RAHEAD)
>  #define REQ_THROTTLED		(1ULL << __REQ_THROTTLED)
> Index: linux-3.16-rc5/include/linux/blkdev.h
> ===================================================================
> --- linux-3.16-rc5.orig/include/linux/blkdev.h	2014-07-14 16:26:22.000000000 +0200
> +++ linux-3.16-rc5/include/linux/blkdev.h	2014-07-14 16:26:44.000000000 +0200
> @@ -289,6 +289,7 @@ struct queue_limits {
>  	unsigned int		io_opt;
>  	unsigned int		max_discard_sectors;
>  	unsigned int		max_write_same_sectors;
> +	unsigned int		max_copy_sectors;
>  	unsigned int		discard_granularity;
>  	unsigned int		discard_alignment;
>  
> @@ -1012,6 +1013,8 @@ extern void blk_queue_max_discard_sector
>  		unsigned int max_discard_sectors);
>  extern void blk_queue_max_write_same_sectors(struct request_queue *q,
>  		unsigned int max_write_same_sectors);
> +extern void blk_queue_max_copy_sectors(struct request_queue *q,
> +		unsigned int max_copy_sectors);
>  extern void blk_queue_logical_block_size(struct request_queue *, unsigned short);
>  extern void blk_queue_physical_block_size(struct request_queue *, unsigned int);
>  extern void blk_queue_alignment_offset(struct request_queue *q,
> @@ -1168,6 +1171,8 @@ extern int blkdev_issue_discard(struct b
>  		sector_t nr_sects, gfp_t gfp_mask, unsigned long flags);
>  extern int blkdev_issue_write_same(struct block_device *bdev, sector_t sector,
>  		sector_t nr_sects, gfp_t gfp_mask, struct page *page);
> +extern int blkdev_issue_copy(struct block_device *, sector_t,
> +		struct block_device *, sector_t, unsigned int, gfp_t);
>  extern int blkdev_issue_zeroout(struct block_device *bdev, sector_t sector,
>  			sector_t nr_sects, gfp_t gfp_mask);
>  static inline int sb_issue_discard(struct super_block *sb, sector_t block,
> @@ -1367,6 +1372,16 @@ static inline unsigned int bdev_write_sa
>  	return 0;
>  }
>  
> +static inline unsigned int bdev_copy_offload(struct block_device *bdev)
> +{
> +	struct request_queue *q = bdev_get_queue(bdev);
> +
> +	if (q)
> +		return q->limits.max_copy_sectors;
> +
> +	return 0;
> +}
> +
>  static inline int queue_dma_alignment(struct request_queue *q)
>  {
>  	return q ? q->dma_alignment : 511;
> Index: linux-3.16-rc5/include/scsi/scsi_device.h
> ===================================================================
> --- linux-3.16-rc5.orig/include/scsi/scsi_device.h	2014-07-14 15:17:09.000000000 +0200
> +++ linux-3.16-rc5/include/scsi/scsi_device.h	2014-07-14 16:26:44.000000000 +0200
> @@ -119,6 +119,8 @@ struct scsi_device {
>  	unsigned char *vpd_pg83;
>  	int vpd_pg80_len;
>  	unsigned char *vpd_pg80;
> +	unsigned char naa_len;
> +	unsigned char *naa;
>  	unsigned char current_tag;	/* current tag */
>  	struct scsi_target      *sdev_target;   /* used only for single_lun */
>  
> @@ -151,6 +153,7 @@ struct scsi_device {
>  	unsigned use_10_for_ms:1; /* first try 10-byte mode sense/select */
>  	unsigned no_report_opcodes:1;	/* no REPORT SUPPORTED OPERATION CODES */
>  	unsigned no_write_same:1;	/* no WRITE SAME command */
> +	unsigned no_copy:1;		/* no copy offload */
>  	unsigned use_16_for_rw:1; /* Use read/write(16) over read/write(10) */
>  	unsigned skip_ms_page_8:1;	/* do not use MODE SENSE page 0x08 */
>  	unsigned skip_ms_page_3f:1;	/* do not use MODE SENSE page 0x3f */
> Index: linux-3.16-rc5/include/uapi/linux/fs.h
> ===================================================================
> --- linux-3.16-rc5.orig/include/uapi/linux/fs.h	2014-07-14 15:17:09.000000000 +0200
> +++ linux-3.16-rc5/include/uapi/linux/fs.h	2014-07-14 16:26:44.000000000 +0200
> @@ -149,6 +149,7 @@ struct inodes_stat_t {
>  #define BLKSECDISCARD _IO(0x12,125)
>  #define BLKROTATIONAL _IO(0x12,126)
>  #define BLKZEROOUT _IO(0x12,127)
> +#define BLKCOPY _IO(0x12,128)
>  
>  #define BMAP_IOCTL 1		/* obsolete - kept for compatibility */
>  #define FIBMAP	   _IO(0x00,1)	/* bmap access */
> Index: linux-3.16-rc5/block/compat_ioctl.c
> ===================================================================
> --- linux-3.16-rc5.orig/block/compat_ioctl.c	2014-07-14 16:26:38.000000000 +0200
> +++ linux-3.16-rc5/block/compat_ioctl.c	2014-07-14 16:26:44.000000000 +0200
> @@ -696,6 +696,7 @@ long compat_blkdev_ioctl(struct file *fi
>  	 * but we call blkdev_ioctl, which gets the lock for us
>  	 */
>  	case BLKRRPART:
> +	case BLKCOPY:
>  		return blkdev_ioctl(bdev, mode, cmd,
>  				(unsigned long)compat_ptr(arg));
>  	case BLKBSZSET_32:
> Index: linux-3.16-rc5/block/bio.c
> ===================================================================
> --- linux-3.16-rc5.orig/block/bio.c	2014-07-14 16:26:24.000000000 +0200
> +++ linux-3.16-rc5/block/bio.c	2014-07-14 16:26:44.000000000 +0200
> @@ -239,6 +239,8 @@ static void __bio_free(struct bio *bio)
>  {
>  	bio_disassociate_task(bio);
>  
> +	kfree(bio->bi_copy);
> +
>  	if (bio_integrity(bio))
>  		bio_integrity_free(bio);
>  }
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 1/15] block copy: initial XCOPY offload support
  2014-07-18 13:03   ` Tomas Henzl
@ 2014-07-18 14:35     ` Mikulas Patocka
  0 siblings, 0 replies; 24+ messages in thread
From: Mikulas Patocka @ 2014-07-18 14:35 UTC (permalink / raw)
  To: Tomas Henzl
  Cc: Alasdair G. Kergon, Mike Snitzer, Jonathan Brassow,
	Edward Thornber, Martin K. Petersen, Jens Axboe,
	Christoph Hellwig, dm-devel, linux-kernel, linux-scsi



On Fri, 18 Jul 2014, Tomas Henzl wrote:

> > +	if (src_sector + nr_sects < src_sector ||
> > +	    dst_sector + nr_sects < dst_sector)
> > +		return -EINVAL;
> 
> Hi Mikulas,
> this^ is meant as an overflow test or what is the reason?
> Thanks, Tomas

Yes. It is a test for overflow.

Mikulas

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 1/15] block copy: initial XCOPY offload support
  2014-07-15 19:34 ` [PATCH 1/15] block copy: initial XCOPY offload support Mikulas Patocka
  2014-07-18 13:03   ` Tomas Henzl
@ 2014-08-04 14:09   ` Pavel Machek
  2014-08-05 22:45     ` Mikulas Patocka
  1 sibling, 1 reply; 24+ messages in thread
From: Pavel Machek @ 2014-08-04 14:09 UTC (permalink / raw)
  To: Mikulas Patocka
  Cc: Alasdair G. Kergon, Mike Snitzer, Jonathan Brassow,
	Edward Thornber, Martin K. Petersen, Jens Axboe,
	Christoph Hellwig, dm-devel, linux-kernel, linux-scsi

On Tue 2014-07-15 15:34:47, Mikulas Patocka wrote:
> This is Martin Petersen's xcopy patch
> (https://git.kernel.org/cgit/linux/kernel/git/mkp/linux.git/commit/?h=xcopy&id=0bdeed274e16b3038a851552188512071974eea8)
> with some bug fixes, ported to the current kernel.
> 
> This patch makes it possible to use the SCSI XCOPY command.
> 
> We create a bio that has REQ_COPY flag in bi_rw and a bi_copy structure
> that defines the source device. The target device is defined in the
> bi_bdev and bi_iter.bi_sector.
> 
> There is a new BLKCOPY ioctl that makes it possible to use XCOPY from
> userspace. The ioctl argument is a pointer to an array of four uint64_t
> values.

But it is there only for block devices, right?

Is there plan to enable tools such as /bin/cp to use XCOPY?

									Pavel
-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 1/15] block copy: initial XCOPY offload support
  2014-08-04 14:09   ` Pavel Machek
@ 2014-08-05 22:45     ` Mikulas Patocka
  0 siblings, 0 replies; 24+ messages in thread
From: Mikulas Patocka @ 2014-08-05 22:45 UTC (permalink / raw)
  To: Pavel Machek
  Cc: Alasdair G. Kergon, Mike Snitzer, Jonathan Brassow,
	Edward Thornber, Martin K. Petersen, Jens Axboe,
	Christoph Hellwig, dm-devel, linux-kernel, linux-scsi



On Mon, 4 Aug 2014, Pavel Machek wrote:

> On Tue 2014-07-15 15:34:47, Mikulas Patocka wrote:
> > This is Martin Petersen's xcopy patch
> > (https://git.kernel.org/cgit/linux/kernel/git/mkp/linux.git/commit/?h=xcopy&id=0bdeed274e16b3038a851552188512071974eea8)
> > with some bug fixes, ported to the current kernel.
> > 
> > This patch makes it possible to use the SCSI XCOPY command.
> > 
> > We create a bio that has REQ_COPY flag in bi_rw and a bi_copy structure
> > that defines the source device. The target device is defined in the
> > bi_bdev and bi_iter.bi_sector.
> > 
> > There is a new BLKCOPY ioctl that makes it possible to use XCOPY from
> > userspace. The ioctl argument is a pointer to an array of four uint64_t
> > values.
> 
> But it is there only for block devices, right?
> 
> Is there plan to enable tools such as /bin/cp to use XCOPY?
> 
> 									Pavel

It is interesting idea, but it would be far from simple. You could make 
sendfile (or maybe splice) use XCOPY, but it needs to interact with page 
cache.

Mikulas

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 0/15] SCSI XCOPY support for the kernel and device mapper
  2014-07-15 19:34 [PATCH 0/15] SCSI XCOPY support for the kernel and device mapper Mikulas Patocka
                   ` (14 preceding siblings ...)
  2014-07-15 19:44 ` [PATCH 15/15] dm kcopyd: call copy offload with asynchronous callback Mikulas Patocka
@ 2014-08-28 21:37 ` Mike Snitzer
  2014-08-29 10:29   ` Martin K. Petersen
  15 siblings, 1 reply; 24+ messages in thread
From: Mike Snitzer @ 2014-08-28 21:37 UTC (permalink / raw)
  To: Martin K. Petersen, Mikulas Patocka
  Cc: Alasdair G. Kergon, Jonathan Brassow, Edward Thornber,
	Jens Axboe, Christoph Hellwig, dm-devel, linux-kernel,
	linux-scsi

On Tue, Jul 15 2014 at  3:34pm -0400,
Mikulas Patocka <mpatocka@redhat.com> wrote:

> This patch series makes it possible to use SCSI XCOPY offload for the 
> block layer and device mapper.
> 
> It is based on Martin Petersen's work
> https://git.kernel.org/cgit/linux/kernel/git/mkp/linux.git/commit/?h=xcopy&id=0bdeed274e16b3038a851552188512071974eea8,
> but it is changed significantly so that it is possible to propagate XCOPY
> bios through the device mapper stack.
> 
> The basic architecture is this: in the function blkdev_issue_copy we
> create two bios, one for read and one for write (with bi_rw READ|REQ_COPY
> and WRITE|REQ_COPY). Both bios have a pointer to the same bio_copy
> structure. These two bios travel independently through the device mapper
> stack - each bio can go through different device mapper devices. When both
> the bios reach the physical block device (in the function blk_queue_bio)
> the bio pair is collected and a XCOPY request is allocated and sent to the
> scsi disk driver.
> 
> Note that because device mapper mapping can dynamically change, there no
> guarantee that the XCOPY command succeeds. If it ends with an error, the
> caller is supposed to perform the copying manually.
> 
> The dm-kcopyd subsystem is modified to use the XCOPY command, so device
> mapper targets that use it (mirror, snapshot, thin, cache) take advantage
> of copy offload automatically.
> 
> There is a new ioctl BLKCOPY that makes it possible to use copy offload
> from userspace.

Hi Martin (and others on linux-scsi),

It would be ideal for XCOPY support to make its way upstream for
3.18.. but the window for staging this work in time is closing.

Any chance you might have some time to review Mikulas' revised approach
to your initial XCOPY support?

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 0/15] SCSI XCOPY support for the kernel and device mapper
  2014-08-28 21:37 ` [PATCH 0/15] SCSI XCOPY support for the kernel and device mapper Mike Snitzer
@ 2014-08-29 10:29   ` Martin K. Petersen
  0 siblings, 0 replies; 24+ messages in thread
From: Martin K. Petersen @ 2014-08-29 10:29 UTC (permalink / raw)
  To: Mike Snitzer
  Cc: Martin K. Petersen, Mikulas Patocka, Alasdair G. Kergon,
	Jonathan Brassow, Edward Thornber, Jens Axboe, Christoph Hellwig,
	dm-devel, linux-kernel, linux-scsi

>>>>> "Mike" == Mike Snitzer <snitzer@redhat.com> writes:

Mike> It would be ideal for XCOPY support to make its way upstream for
Mike> 3.18.. but the window for staging this work in time is closing.

Mike> Any chance you might have some time to review Mikulas' revised
Mike> approach to your initial XCOPY support? 

It is at the top of my list.

-- 
Martin K. Petersen	Oracle Linux Engineering

^ permalink raw reply	[flat|nested] 24+ messages in thread

* [PATCH 1/15] block copy: initial XCOPY offload support
  2015-12-10 17:29 [PATCH 0/15] copy offload patches Mikulas Patocka
@ 2015-12-10 17:30   ` Mikulas Patocka
  0 siblings, 0 replies; 24+ messages in thread
From: Mikulas Patocka @ 2015-12-10 17:30 UTC (permalink / raw)
  To: James E.J. Bottomley, Martin K. Petersen, Jens Axboe,
	Mike Snitzer, Jonathan Brassow
  Cc: dm-devel, linux-scsi, linux-kernel, linux-block

This is Martin Petersen's xcopy patch
(https://git.kernel.org/cgit/linux/kernel/git/mkp/linux.git/commit/?h=xcopy&id=0bdeed274e16b3038a851552188512071974eea8)
with some bug fixes, ported to the current kernel.

This patch makes it possible to use the SCSI XCOPY command.

We create a bio that has REQ_COPY flag in bi_rw and a bi_copy structure
that defines the source device. The target device is defined in the
bi_bdev and bi_iter.bi_sector.

There is a new BLKCOPY ioctl that makes it possible to use XCOPY from
userspace. The ioctl argument is a pointer to an array of four uint64_t
values.

The first value is a source byte offset, the second value is a destination
byte offset, the third value is byte length. The forth value is written by
the kernel and it represents the number of bytes that the kernel actually
copied.

Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>

---
 Documentation/ABI/testing/sysfs-block |    9 +
 block/bio.c                           |    2 
 block/blk-core.c                      |    5 
 block/blk-lib.c                       |   95 +++++++++++
 block/blk-merge.c                     |   11 -
 block/blk-settings.c                  |   13 +
 block/blk-sysfs.c                     |   11 +
 block/compat_ioctl.c                  |    1 
 block/ioctl.c                         |   50 ++++++
 drivers/scsi/scsi.c                   |   57 +++++++
 drivers/scsi/sd.c                     |  271 +++++++++++++++++++++++++++++++++-
 drivers/scsi/sd.h                     |    4 
 include/linux/bio.h                   |    9 -
 include/linux/blk_types.h             |   14 +
 include/linux/blkdev.h                |   15 +
 include/scsi/scsi_device.h            |    3 
 include/uapi/linux/fs.h               |    1 
 17 files changed, 557 insertions(+), 14 deletions(-)

Index: linux-4.4-rc4/Documentation/ABI/testing/sysfs-block
===================================================================
--- linux-4.4-rc4.orig/Documentation/ABI/testing/sysfs-block	2015-12-10 17:03:59.000000000 +0100
+++ linux-4.4-rc4/Documentation/ABI/testing/sysfs-block	2015-12-10 17:04:30.000000000 +0100
@@ -235,3 +235,12 @@ Description:
 		write_same_max_bytes is 0, write same is not supported
 		by the device.
 
+
+What:		/sys/block/<disk>/queue/copy_max_bytes
+Date:		January 2014
+Contact:	Martin K. Petersen <martin.petersen@oracle.com>
+Description:
+		Devices that support copy offloading will set this value
+		to indicate the maximum buffer size in bytes that can be
+		copied in one operation. If the copy_max_bytes is 0 the
+		device does not support copy offload.
Index: linux-4.4-rc4/block/blk-core.c
===================================================================
--- linux-4.4-rc4.orig/block/blk-core.c	2015-12-10 17:03:59.000000000 +0100
+++ linux-4.4-rc4/block/blk-core.c	2015-12-10 17:04:30.000000000 +0100
@@ -1957,6 +1957,11 @@ generic_make_request_checks(struct bio *
 		goto end_io;
 	}
 
+	if (bio->bi_rw & REQ_COPY && !bdev_copy_offload(bio->bi_bdev)) {
+		err = -EOPNOTSUPP;
+		goto end_io;
+	}
+
 	/*
 	 * Various block parts want %current->io_context and lazy ioc
 	 * allocation ends up trading a lot of pain for a small amount of
Index: linux-4.4-rc4/block/blk-lib.c
===================================================================
--- linux-4.4-rc4.orig/block/blk-lib.c	2015-12-10 17:03:59.000000000 +0100
+++ linux-4.4-rc4/block/blk-lib.c	2015-12-10 17:04:30.000000000 +0100
@@ -299,3 +299,98 @@ int blkdev_issue_zeroout(struct block_de
 	return __blkdev_issue_zeroout(bdev, sector, nr_sects, gfp_mask);
 }
 EXPORT_SYMBOL(blkdev_issue_zeroout);
+
+/**
+ * blkdev_issue_copy - queue a copy same operation
+ * @src_bdev:	source blockdev
+ * @src_sector:	source sector
+ * @dst_bdev:	destination blockdev
+ * @dst_sector: destination sector
+ * @nr_sects:	number of sectors to copy
+ * @gfp_mask:	memory allocation flags (for bio_alloc)
+ *
+ * Description:
+ *    Copy a block range from source device to target device.
+ */
+int blkdev_issue_copy(struct block_device *src_bdev, sector_t src_sector,
+		      struct block_device *dst_bdev, sector_t dst_sector,
+		      unsigned int nr_sects, gfp_t gfp_mask)
+{
+	DECLARE_COMPLETION_ONSTACK(wait);
+	struct request_queue *sq = bdev_get_queue(src_bdev);
+	struct request_queue *dq = bdev_get_queue(dst_bdev);
+	unsigned int max_copy_sectors;
+	struct bio_batch bb;
+	int ret = 0;
+
+	if (!sq || !dq)
+		return -ENXIO;
+
+	max_copy_sectors = min(sq->limits.max_copy_sectors,
+			       dq->limits.max_copy_sectors);
+
+	if (max_copy_sectors == 0)
+		return -EOPNOTSUPP;
+
+	if (src_sector + nr_sects < src_sector ||
+	    dst_sector + nr_sects < dst_sector)
+		return -EINVAL;
+
+	/* Do not support overlapping copies */
+	if (src_bdev == dst_bdev &&
+	    abs((u64)dst_sector - (u64)src_sector) < nr_sects)
+		return -EOPNOTSUPP;
+
+	atomic_set(&bb.done, 1);
+	bb.error = 0;
+	bb.wait = &wait;
+
+	while (nr_sects) {
+		struct bio *bio;
+		struct bio_copy *bc;
+		unsigned int chunk;
+
+		bc = kmalloc(sizeof(struct bio_copy), gfp_mask);
+		if (!bc) {
+			ret = -ENOMEM;
+			break;
+		}
+
+		bio = bio_alloc(gfp_mask, 1);
+		if (!bio) {
+			kfree(bc);
+			ret = -ENOMEM;
+			break;
+		}
+
+		chunk = min(nr_sects, max_copy_sectors);
+
+		bio->bi_iter.bi_sector = dst_sector;
+		bio->bi_iter.bi_size = chunk << 9;
+		bio->bi_end_io = bio_batch_end_io;
+		bio->bi_bdev = dst_bdev;
+		bio->bi_private = &bb;
+		bio->bi_copy = bc;
+
+		bc->bic_bdev = src_bdev;
+		bc->bic_sector = src_sector;
+
+		atomic_inc(&bb.done);
+		submit_bio(REQ_WRITE | REQ_COPY, bio);
+
+		src_sector += chunk;
+		dst_sector += chunk;
+		nr_sects -= chunk;
+	}
+
+	/* Wait for bios in-flight */
+	if (!atomic_dec_and_test(&bb.done))
+		wait_for_completion_io(&wait);
+
+	if (likely(!ret))
+		ret = bb.error;
+
+	return ret;
+}
+EXPORT_SYMBOL(blkdev_issue_copy);
+
Index: linux-4.4-rc4/block/blk-merge.c
===================================================================
--- linux-4.4-rc4.orig/block/blk-merge.c	2015-12-10 17:03:59.000000000 +0100
+++ linux-4.4-rc4/block/blk-merge.c	2015-12-10 17:04:30.000000000 +0100
@@ -145,7 +145,9 @@ void blk_queue_split(struct request_queu
 	struct bio *split, *res;
 	unsigned nsegs;
 
-	if ((*bio)->bi_rw & REQ_DISCARD)
+	if ((*bio)->bi_rw & REQ_COPY)
+		return;
+	else if ((*bio)->bi_rw & REQ_DISCARD)
 		split = blk_bio_discard_split(q, *bio, bs, &nsegs);
 	else if ((*bio)->bi_rw & REQ_WRITE_SAME)
 		split = blk_bio_write_same_split(q, *bio, bs, &nsegs);
@@ -185,10 +187,7 @@ static unsigned int __blk_recalc_rq_segm
 	 * This should probably be returning 0, but blk_add_request_payload()
 	 * (Christoph!!!!)
 	 */
-	if (bio->bi_rw & REQ_DISCARD)
-		return 1;
-
-	if (bio->bi_rw & REQ_WRITE_SAME)
+	if (bio->bi_rw & (REQ_DISCARD | REQ_WRITE_SAME | REQ_COPY))
 		return 1;
 
 	fbio = bio;
@@ -361,7 +360,7 @@ static int __blk_bios_map_sg(struct requ
 	nsegs = 0;
 	cluster = blk_queue_cluster(q);
 
-	if (bio->bi_rw & REQ_DISCARD) {
+	if (bio->bi_rw & (REQ_DISCARD | REQ_COPY)) {
 		/*
 		 * This is a hack - drivers should be neither modifying the
 		 * biovec, nor relying on bi_vcnt - but because of
Index: linux-4.4-rc4/block/blk-settings.c
===================================================================
--- linux-4.4-rc4.orig/block/blk-settings.c	2015-12-10 17:03:59.000000000 +0100
+++ linux-4.4-rc4/block/blk-settings.c	2015-12-10 17:04:30.000000000 +0100
@@ -95,6 +95,7 @@ void blk_set_default_limits(struct queue
 		BLK_SAFE_MAX_SECTORS;
 	lim->chunk_sectors = 0;
 	lim->max_write_same_sectors = 0;
+	lim->max_copy_sectors = 0;
 	lim->max_discard_sectors = 0;
 	lim->max_hw_discard_sectors = 0;
 	lim->discard_granularity = 0;
@@ -298,6 +299,18 @@ void blk_queue_max_write_same_sectors(st
 EXPORT_SYMBOL(blk_queue_max_write_same_sectors);
 
 /**
+ * blk_queue_max_copy_sectors - set max sectors for a single copy operation
+ * @q:  the request queue for the device
+ * @max_copy_sectors: maximum number of sectors per copy operation
+ **/
+void blk_queue_max_copy_sectors(struct request_queue *q,
+				unsigned int max_copy_sectors)
+{
+	q->limits.max_copy_sectors = max_copy_sectors;
+}
+EXPORT_SYMBOL(blk_queue_max_copy_sectors);
+
+/**
  * blk_queue_max_segments - set max hw segments for a request for this queue
  * @q:  the request queue for the device
  * @max_segments:  max number of segments
Index: linux-4.4-rc4/block/blk-sysfs.c
===================================================================
--- linux-4.4-rc4.orig/block/blk-sysfs.c	2015-12-10 17:04:01.000000000 +0100
+++ linux-4.4-rc4/block/blk-sysfs.c	2015-12-10 17:04:30.000000000 +0100
@@ -193,6 +193,11 @@ static ssize_t queue_write_same_max_show
 		(unsigned long long)q->limits.max_write_same_sectors << 9);
 }
 
+static ssize_t queue_copy_max_show(struct request_queue *q, char *page)
+{
+	return sprintf(page, "%llu\n",
+		(unsigned long long)q->limits.max_copy_sectors << 9);
+}
 
 static ssize_t
 queue_max_sectors_store(struct request_queue *q, const char *page, size_t count)
@@ -443,6 +448,11 @@ static struct queue_sysfs_entry queue_wr
 	.show = queue_write_same_max_show,
 };
 
+static struct queue_sysfs_entry queue_copy_max_entry = {
+	.attr = {.name = "copy_max_bytes", .mode = S_IRUGO },
+	.show = queue_copy_max_show,
+};
+
 static struct queue_sysfs_entry queue_nonrot_entry = {
 	.attr = {.name = "rotational", .mode = S_IRUGO | S_IWUSR },
 	.show = queue_show_nonrot,
@@ -498,6 +508,7 @@ static struct attribute *default_attrs[]
 	&queue_discard_max_hw_entry.attr,
 	&queue_discard_zeroes_data_entry.attr,
 	&queue_write_same_max_entry.attr,
+	&queue_copy_max_entry.attr,
 	&queue_nonrot_entry.attr,
 	&queue_nomerges_entry.attr,
 	&queue_rq_affinity_entry.attr,
Index: linux-4.4-rc4/block/ioctl.c
===================================================================
--- linux-4.4-rc4.orig/block/ioctl.c	2015-12-10 17:03:59.000000000 +0100
+++ linux-4.4-rc4/block/ioctl.c	2015-12-10 17:04:30.000000000 +0100
@@ -249,6 +249,31 @@ static int blk_ioctl_zeroout(struct bloc
 	return blkdev_issue_zeroout(bdev, start, len, GFP_KERNEL, false);
 }
 
+static int blk_ioctl_copy(struct block_device *bdev, uint64_t src_offset,
+			  uint64_t dst_offset, uint64_t len)
+{
+	if (src_offset & 511)
+		return -EINVAL;
+	if (dst_offset & 511)
+		return -EINVAL;
+	if (len & 511)
+		return -EINVAL;
+	src_offset >>= 9;
+	dst_offset >>= 9;
+	len >>= 9;
+
+	if (unlikely(src_offset + len < src_offset) ||
+	    unlikely(src_offset + len > (i_size_read(bdev->bd_inode) >> 9)))
+		return -EINVAL;
+
+	if (unlikely(dst_offset + len < dst_offset) ||
+	    unlikely(dst_offset + len > (i_size_read(bdev->bd_inode) >> 9)))
+		return -EINVAL;
+
+	return blkdev_issue_copy(bdev, src_offset, bdev, dst_offset, len,
+				 GFP_KERNEL);
+}
+
 static int put_ushort(unsigned long arg, unsigned short val)
 {
 	return put_user(val, (unsigned short __user *)arg);
@@ -513,6 +538,31 @@ int blkdev_ioctl(struct block_device *bd
 				BLKDEV_DISCARD_SECURE);
 	case BLKZEROOUT:
 		return blk_ioctl_zeroout(bdev, mode, arg);
+	case BLKCOPY: {
+		uint64_t range[4];
+		int ret;
+
+		range[3] = 0;
+
+		if (copy_to_user((void __user *)(arg + 24), &range[3], 8))
+			return -EFAULT;
+
+		if (!(mode & FMODE_WRITE))
+			return -EBADF;
+
+		if (copy_from_user(range, (void __user *)arg, 24))
+			return -EFAULT;
+
+		ret = blk_ioctl_copy(bdev, range[0], range[1], range[2]);
+		if (!ret) {
+			range[3] = range[2];
+			if (copy_to_user((void __user *)(arg + 24), &range[3], 8))
+				return -EFAULT;
+		}
+
+		return ret;
+	}
+
 	case HDIO_GETGEO:
 		return blkdev_getgeo(bdev, argp);
 	case BLKRAGET:
Index: linux-4.4-rc4/drivers/scsi/scsi.c
===================================================================
--- linux-4.4-rc4.orig/drivers/scsi/scsi.c	2015-12-10 17:04:00.000000000 +0100
+++ linux-4.4-rc4/drivers/scsi/scsi.c	2015-12-10 17:04:30.000000000 +0100
@@ -768,6 +768,62 @@ int scsi_get_vpd_page(struct scsi_device
 EXPORT_SYMBOL_GPL(scsi_get_vpd_page);
 
 /**
+ * scsi_lookup_naa - Lookup NAA descriptor in VPD page 0x83
+ * @sdev: The device to ask
+ *
+ * Copy offloading requires us to know the NAA descriptor for both
+ * source and target device. This descriptor is mandatory in the Device
+ * Identification VPD page. Locate this descriptor in the returned VPD
+ * data so we don't have to do lookups for every copy command.
+ */
+static void scsi_lookup_naa(struct scsi_device *sdev)
+{
+	unsigned char *buf = sdev->vpd_pg83;
+	unsigned int len = sdev->vpd_pg83_len;
+
+	if (buf[1] != 0x83 || get_unaligned_be16(&buf[2]) == 0) {
+		sdev_printk(KERN_ERR, sdev,
+			    "%s: VPD page 0x83 contains no descriptors\n",
+			    __func__);
+		return;
+	}
+
+	buf += 4;
+	len -= 4;
+
+	do {
+		unsigned int desig_len = buf[3] + 4;
+
+		/* Binary code set */
+		if ((buf[0] & 0xf) != 1)
+			goto skip;
+
+		/* Target association */
+		if ((buf[1] >> 4) & 0x3)
+			goto skip;
+
+		/* NAA designator */
+		if ((buf[1] & 0xf) != 0x3)
+			goto skip;
+
+		sdev->naa = buf;
+		sdev->naa_len = desig_len;
+
+		return;
+
+	skip:
+		buf += desig_len;
+		len -= desig_len;
+
+	} while (len > 0);
+
+	sdev_printk(KERN_ERR, sdev,
+		    "%s: VPD page 0x83 NAA descriptor not found\n", __func__);
+
+	return;
+}
+
+/**
  * scsi_attach_vpd - Attach Vital Product Data to a SCSI device structure
  * @sdev: The device to ask
  *
@@ -851,6 +907,7 @@ retry_pg83:
 		}
 		sdev->vpd_pg83_len = result;
 		sdev->vpd_pg83 = vpd_buf;
+		scsi_lookup_naa(sdev);
 	}
 }
 
Index: linux-4.4-rc4/drivers/scsi/sd.c
===================================================================
--- linux-4.4-rc4.orig/drivers/scsi/sd.c	2015-12-10 17:04:00.000000000 +0100
+++ linux-4.4-rc4/drivers/scsi/sd.c	2015-12-10 17:04:30.000000000 +0100
@@ -101,6 +101,7 @@ MODULE_ALIAS_SCSI_DEVICE(TYPE_RBC);
 
 static void sd_config_discard(struct scsi_disk *, unsigned int);
 static void sd_config_write_same(struct scsi_disk *);
+static void sd_config_copy(struct scsi_disk *);
 static int  sd_revalidate_disk(struct gendisk *);
 static void sd_unlock_native_capacity(struct gendisk *disk);
 static int  sd_probe(struct device *);
@@ -479,6 +480,48 @@ max_write_same_blocks_store(struct devic
 }
 static DEVICE_ATTR_RW(max_write_same_blocks);
 
+static ssize_t
+max_copy_blocks_show(struct device *dev, struct device_attribute *attr,
+		     char *buf)
+{
+	struct scsi_disk *sdkp = to_scsi_disk(dev);
+
+	return snprintf(buf, 20, "%u\n", sdkp->max_copy_blocks);
+}
+
+static ssize_t
+max_copy_blocks_store(struct device *dev, struct device_attribute *attr,
+		      const char *buf, size_t count)
+{
+	struct scsi_disk *sdkp = to_scsi_disk(dev);
+	struct scsi_device *sdp = sdkp->device;
+	unsigned long max;
+	int err;
+
+	if (!capable(CAP_SYS_ADMIN))
+		return -EACCES;
+
+	if (sdp->type != TYPE_DISK)
+		return -EINVAL;
+
+	err = kstrtoul(buf, 10, &max);
+
+	if (err)
+		return err;
+
+	if (max == 0)
+		sdp->no_copy = 1;
+	else if (max <= SD_MAX_COPY_BLOCKS) {
+		sdp->no_copy = 0;
+		sdkp->max_copy_blocks = max;
+	}
+
+	sd_config_copy(sdkp);
+
+	return count;
+}
+static DEVICE_ATTR_RW(max_copy_blocks);
+
 static struct attribute *sd_disk_attrs[] = {
 	&dev_attr_cache_type.attr,
 	&dev_attr_FUA.attr,
@@ -490,6 +533,7 @@ static struct attribute *sd_disk_attrs[]
 	&dev_attr_thin_provisioning.attr,
 	&dev_attr_provisioning_mode.attr,
 	&dev_attr_max_write_same_blocks.attr,
+	&dev_attr_max_copy_blocks.attr,
 	&dev_attr_max_medium_access_timeouts.attr,
 	NULL,
 };
@@ -879,6 +923,116 @@ static int sd_setup_write_same_cmnd(stru
 	return ret;
 }
 
+static void sd_config_copy(struct scsi_disk *sdkp)
+{
+	struct request_queue *q = sdkp->disk->queue;
+	unsigned int logical_block_size = sdkp->device->sector_size;
+
+	if (sdkp->device->no_copy)
+		sdkp->max_copy_blocks = 0;
+
+	/* Segment descriptor 0x02 has a 64k block limit */
+	sdkp->max_copy_blocks = min(sdkp->max_copy_blocks,
+				    (u32)SD_MAX_CSD2_BLOCKS);
+
+	blk_queue_max_copy_sectors(q, sdkp->max_copy_blocks *
+				   (logical_block_size >> 9));
+}
+
+static int sd_setup_copy_cmnd(struct scsi_cmnd *cmd)
+{
+	struct request *rq = cmd->request;
+	struct scsi_device *src_sdp, *dst_sdp;
+	struct gendisk *src_disk;
+	struct request_queue *src_queue, *dst_queue;
+	sector_t src_lba, dst_lba;
+	unsigned int nr_blocks, buf_len, nr_bytes = blk_rq_bytes(rq);
+	int ret;
+	struct bio *bio = rq->bio;
+	struct page *page;
+	unsigned char *buf;
+
+	if (!bio->bi_copy)
+		return BLKPREP_KILL;
+
+	dst_sdp = scsi_disk(rq->rq_disk)->device;
+	dst_queue = rq->rq_disk->queue;
+	src_disk = bio->bi_copy->bic_bdev->bd_disk;
+	src_queue = src_disk->queue;
+	if (!src_queue ||
+	    src_queue->make_request_fn != dst_queue->make_request_fn ||
+	    src_queue->request_fn != dst_queue->request_fn ||
+	    *(struct scsi_driver **)rq->rq_disk->private_data !=
+	    *(struct scsi_driver **)src_disk->private_data)
+		return BLKPREP_KILL;
+	src_sdp = scsi_disk(src_disk)->device;
+
+	if (src_sdp->no_copy || dst_sdp->no_copy)
+		return BLKPREP_KILL;
+
+	if (src_sdp->sector_size != dst_sdp->sector_size)
+		return BLKPREP_KILL;
+
+	dst_lba = blk_rq_pos(rq) >> (ilog2(dst_sdp->sector_size) - 9);
+	src_lba = bio->bi_copy->bic_sector >> (ilog2(src_sdp->sector_size) - 9);
+	nr_blocks = blk_rq_sectors(rq) >> (ilog2(dst_sdp->sector_size) - 9);
+
+	page = alloc_page(GFP_ATOMIC | __GFP_ZERO);
+	if (!page)
+		return BLKPREP_DEFER;
+
+	buf = page_address(page);
+
+	/* Extended Copy (LID1) Parameter List (16 bytes) */
+	buf[0] = 0;				/* LID */
+	buf[1] = 3 << 3;			/* LID usage 11b */
+	put_unaligned_be16(32 + 32, &buf[2]);	/* 32 bytes per E4 desc. */
+	put_unaligned_be32(28, &buf[8]);	/* 28 bytes per B2B desc. */
+	buf += 16;
+
+	/* Source CSCD (32 bytes) */
+	buf[0] = 0xe4;				/* Identification desc. */
+	memcpy(&buf[4], src_sdp->naa, src_sdp->naa_len);
+	buf += 32;
+
+	/* Destination CSCD (32 bytes) */
+	buf[0] = 0xe4;				/* Identification desc. */
+	memcpy(&buf[4], dst_sdp->naa, dst_sdp->naa_len);
+	buf += 32;
+
+	/* Segment descriptor (28 bytes) */
+	buf[0] = 0x02;				/* Block to block desc. */
+	put_unaligned_be16(0x18, &buf[2]);	/* Descriptor length */
+	put_unaligned_be16(0, &buf[4]);		/* Source is desc. 0 */
+	put_unaligned_be16(1, &buf[6]);		/* Dest. is desc. 1 */
+	put_unaligned_be16(nr_blocks, &buf[10]);
+	put_unaligned_be64(src_lba, &buf[12]);
+	put_unaligned_be64(dst_lba, &buf[20]);
+
+	/* CDB */
+	cmd->cmd_len = 16;
+	memset(cmd->cmnd, 0, cmd->cmd_len);
+	cmd->cmnd[0] = EXTENDED_COPY;
+	cmd->cmnd[1] = 0; /* LID1 */
+	buf_len = 16 + 32 + 32 + 28;
+	put_unaligned_be32(buf_len, &cmd->cmnd[10]);
+	rq->timeout = SD_COPY_TIMEOUT;
+
+	rq->completion_data = page;
+	blk_add_request_payload(rq, page, buf_len);
+
+	cmd->transfersize = buf_len;
+	cmd->allowed = 0;	/* don't retry */
+
+	rq->__data_len = buf_len;
+	ret = scsi_init_io(cmd);
+	rq->__data_len = nr_bytes;
+
+	if (ret != BLKPREP_OK)
+		__free_page(page);
+	return ret;
+}
+
 static int sd_setup_flush_cmnd(struct scsi_cmnd *cmd)
 {
 	struct request *rq = cmd->request;
@@ -1141,6 +1295,8 @@ static int sd_init_command(struct scsi_c
 		return sd_setup_discard_cmnd(cmd);
 	else if (rq->cmd_flags & REQ_WRITE_SAME)
 		return sd_setup_write_same_cmnd(cmd);
+	else if (rq->cmd_flags & REQ_COPY)
+		return sd_setup_copy_cmnd(cmd);
 	else if (rq->cmd_flags & REQ_FLUSH)
 		return sd_setup_flush_cmnd(cmd);
 	else
@@ -1151,7 +1307,7 @@ static void sd_uninit_command(struct scs
 {
 	struct request *rq = SCpnt->request;
 
-	if (rq->cmd_flags & REQ_DISCARD)
+	if (rq->cmd_flags & (REQ_DISCARD | REQ_COPY))
 		__free_page(rq->completion_data);
 
 	if (SCpnt->cmnd != rq->cmd) {
@@ -1768,7 +1924,8 @@ static int sd_done(struct scsi_cmnd *SCp
 	unsigned char op = SCpnt->cmnd[0];
 	unsigned char unmap = SCpnt->cmnd[1] & 8;
 
-	if (req->cmd_flags & REQ_DISCARD || req->cmd_flags & REQ_WRITE_SAME) {
+	if (req->cmd_flags & REQ_DISCARD || req->cmd_flags & REQ_WRITE_SAME ||
+	    req->cmd_flags & REQ_COPY) {
 		if (!result) {
 			good_bytes = blk_rq_bytes(req);
 			scsi_set_resid(SCpnt, 0);
@@ -1815,6 +1972,16 @@ static int sd_done(struct scsi_cmnd *SCp
 		/* INVALID COMMAND OPCODE or INVALID FIELD IN CDB */
 		if (sshdr.asc == 0x20 || sshdr.asc == 0x24) {
 			switch (op) {
+			case EXTENDED_COPY:
+				if ((SCpnt->cmnd[1] & 0x1f) == 0) {
+					sdkp->device->no_copy = 1;
+					sd_config_copy(sdkp);
+
+					good_bytes = 0;
+					req->__data_len = blk_rq_bytes(req);
+					req->cmd_flags |= REQ_QUIET;
+				}
+				break;
 			case UNMAP:
 				sd_config_discard(sdkp, SD_LBP_DISABLE);
 				break;
@@ -2797,6 +2964,105 @@ static void sd_read_write_same(struct sc
 		sdkp->ws10 = 1;
 }
 
+static void sd_read_copy_operations(struct scsi_disk *sdkp,
+				    unsigned char *buffer)
+{
+	struct scsi_device *sdev = sdkp->device;
+	struct scsi_sense_hdr sshdr;
+	unsigned char cdb[16];
+	unsigned int result, len, i;
+	bool b2b_desc = false, id_desc = false;
+
+	if (sdev->naa_len == 0)
+		return;
+
+	/* Verify that the device has 3PC set in INQUIRY response */
+	if (sdev->inquiry_len < 6 || (sdev->inquiry[5] & (1 << 3)) == 0)
+		return;
+
+	/* Receive Copy Operation Parameters */
+	memset(cdb, 0, 16);
+	cdb[0] = RECEIVE_COPY_RESULTS;
+	cdb[1] = 0x3;
+	put_unaligned_be32(SD_BUF_SIZE, &cdb[10]);
+
+	memset(buffer, 0, SD_BUF_SIZE);
+	result = scsi_execute_req(sdev, cdb, DMA_FROM_DEVICE,
+				  buffer, SD_BUF_SIZE, &sshdr,
+				  SD_TIMEOUT, SD_MAX_RETRIES, NULL);
+
+	if (!scsi_status_is_good(result)) {
+		sd_printk(KERN_ERR, sdkp,
+			  "%s: Receive Copy Operating Parameters failed\n",
+			  __func__);
+		return;
+	}
+
+	/* The RCOP response is a minimum of 44 bytes long. First 4
+	 * bytes contain the length of the remaining buffer, i.e. 40+
+	 * bytes. Trailing the defined fields is a list of supported
+	 * descriptors. We need at least 2 descriptors to drive the
+	 * target, hence 42.
+	 */
+	len = get_unaligned_be32(&buffer[0]);
+	if (len < 42) {
+		sd_printk(KERN_ERR, sdkp, "%s: result too short (%u)\n",
+			  __func__, len);
+		return;
+	}
+
+	if ((buffer[4] & 1) == 0) {
+		sd_printk(KERN_ERR, sdkp, "%s: does not support SNLID\n",
+			  __func__);
+		return;
+	}
+
+	if (get_unaligned_be16(&buffer[8]) < 2) {
+		sd_printk(KERN_ERR, sdkp,
+			  "%s: Need 2 or more CSCD descriptors\n", __func__);
+		return;
+	}
+
+	if (get_unaligned_be16(&buffer[10]) < 1) {
+		sd_printk(KERN_ERR, sdkp,
+			  "%s: Need 1 or more segment descriptor\n", __func__);
+		return;
+	}
+
+	if (len - 40 != buffer[43]) {
+		sd_printk(KERN_ERR, sdkp,
+			  "%s: Buffer len and descriptor count mismatch " \
+			  "(%u vs. %u)\n", __func__, len - 40, buffer[43]);
+		return;
+	}
+
+	for (i = 44 ; i < len + 4 ; i++) {
+		if (buffer[i] == 0x02)
+			b2b_desc = true;
+
+		if (buffer[i] == 0xe4)
+			id_desc = true;
+	}
+
+	if (!b2b_desc) {
+		sd_printk(KERN_ERR, sdkp,
+			  "%s: No block 2 block descriptor (0x02)\n",
+			  __func__);
+		return;
+	}
+
+	if (!id_desc) {
+		sd_printk(KERN_ERR, sdkp,
+			  "%s: No identification descriptor (0xE4)\n",
+			  __func__);
+		return;
+	}
+
+	sdkp->max_copy_blocks = get_unaligned_be32(&buffer[16])
+		>> ilog2(sdev->sector_size);
+	sd_config_copy(sdkp);
+}
+
 static int sd_try_extended_inquiry(struct scsi_device *sdp)
 {
 	/* Attempt VPD inquiry if the device blacklist explicitly calls
@@ -2868,6 +3134,7 @@ static int sd_revalidate_disk(struct gen
 		sd_read_cache_type(sdkp, buffer);
 		sd_read_app_tag_own(sdkp, buffer);
 		sd_read_write_same(sdkp, buffer);
+		sd_read_copy_operations(sdkp, buffer);
 	}
 
 	sdkp->first_scan = 0;
Index: linux-4.4-rc4/drivers/scsi/sd.h
===================================================================
--- linux-4.4-rc4.orig/drivers/scsi/sd.h	2015-12-10 17:04:00.000000000 +0100
+++ linux-4.4-rc4/drivers/scsi/sd.h	2015-12-10 17:04:30.000000000 +0100
@@ -19,6 +19,7 @@
  */
 #define SD_FLUSH_TIMEOUT_MULTIPLIER	2
 #define SD_WRITE_SAME_TIMEOUT	(120 * HZ)
+#define SD_COPY_TIMEOUT		(120 * HZ)
 
 /*
  * Number of allowed retries
@@ -48,6 +49,8 @@ enum {
 	SD_MAX_XFER_BLOCKS = 0xffffffff,
 	SD_MAX_WS10_BLOCKS = 0xffff,
 	SD_MAX_WS16_BLOCKS = 0x7fffff,
+	SD_MAX_CSD2_BLOCKS = 0xffff,
+	SD_MAX_COPY_BLOCKS = 0xffffffff,
 };
 
 enum {
@@ -70,6 +73,7 @@ struct scsi_disk {
 	u32		opt_xfer_blocks;
 	u32		max_ws_blocks;
 	u32		max_unmap_blocks;
+	u32		max_copy_blocks;
 	u32		unmap_granularity;
 	u32		unmap_alignment;
 	u32		index;
Index: linux-4.4-rc4/include/linux/bio.h
===================================================================
--- linux-4.4-rc4.orig/include/linux/bio.h	2015-12-10 17:04:00.000000000 +0100
+++ linux-4.4-rc4/include/linux/bio.h	2015-12-10 17:04:30.000000000 +0100
@@ -106,7 +106,7 @@ static inline bool bio_has_data(struct b
 {
 	if (bio &&
 	    bio->bi_iter.bi_size &&
-	    !(bio->bi_rw & REQ_DISCARD))
+	    !(bio->bi_rw & (REQ_DISCARD | REQ_COPY)))
 		return true;
 
 	return false;
@@ -249,8 +249,8 @@ static inline unsigned bio_segments(stru
 	struct bvec_iter iter;
 
 	/*
-	 * We special case discard/write same, because they interpret bi_size
-	 * differently:
+	 * We special case discard/write same/copy, because they
+	 * interpret bi_size differently:
 	 */
 
 	if (bio->bi_rw & REQ_DISCARD)
@@ -259,6 +259,9 @@ static inline unsigned bio_segments(stru
 	if (bio->bi_rw & REQ_WRITE_SAME)
 		return 1;
 
+	if (bio->bi_rw & REQ_COPY)
+		return 1;
+
 	bio_for_each_segment(bv, bio, iter)
 		segs++;
 
Index: linux-4.4-rc4/include/linux/blk_types.h
===================================================================
--- linux-4.4-rc4.orig/include/linux/blk_types.h	2015-12-10 17:04:00.000000000 +0100
+++ linux-4.4-rc4/include/linux/blk_types.h	2015-12-10 17:04:30.000000000 +0100
@@ -39,6 +39,11 @@ struct bvec_iter {
 						   current bvec */
 };
 
+struct bio_copy {
+	struct block_device	*bic_bdev;
+	sector_t		bic_sector;
+};
+
 /*
  * main unit of I/O for the block layer and lower layers (ie drivers and
  * stacking drivers)
@@ -84,6 +89,7 @@ struct bio {
 		struct bio_integrity_payload *bi_integrity; /* data integrity */
 #endif
 	};
+	struct bio_copy		*bi_copy; 	/* TODO, use bi_integrity */
 
 	unsigned short		bi_vcnt;	/* how many bio_vec's */
 
@@ -156,6 +162,7 @@ enum rq_flag_bits {
 	__REQ_DISCARD,		/* request to discard sectors */
 	__REQ_SECURE,		/* secure discard (used with __REQ_DISCARD) */
 	__REQ_WRITE_SAME,	/* write same block many times */
+	__REQ_COPY,		/* copy block range */
 
 	__REQ_NOIDLE,		/* don't anticipate more IO after this one */
 	__REQ_INTEGRITY,	/* I/O includes block integrity payload */
@@ -201,6 +208,7 @@ enum rq_flag_bits {
 #define REQ_PRIO		(1ULL << __REQ_PRIO)
 #define REQ_DISCARD		(1ULL << __REQ_DISCARD)
 #define REQ_WRITE_SAME		(1ULL << __REQ_WRITE_SAME)
+#define REQ_COPY		(1ULL << __REQ_COPY)
 #define REQ_NOIDLE		(1ULL << __REQ_NOIDLE)
 #define REQ_INTEGRITY		(1ULL << __REQ_INTEGRITY)
 
@@ -209,14 +217,14 @@ enum rq_flag_bits {
 #define REQ_COMMON_MASK \
 	(REQ_WRITE | REQ_FAILFAST_MASK | REQ_SYNC | REQ_META | REQ_PRIO | \
 	 REQ_DISCARD | REQ_WRITE_SAME | REQ_NOIDLE | REQ_FLUSH | REQ_FUA | \
-	 REQ_SECURE | REQ_INTEGRITY)
+	 REQ_SECURE | REQ_INTEGRITY | REQ_COPY)
 #define REQ_CLONE_MASK		REQ_COMMON_MASK
 
-#define BIO_NO_ADVANCE_ITER_MASK	(REQ_DISCARD|REQ_WRITE_SAME)
+#define BIO_NO_ADVANCE_ITER_MASK	(REQ_DISCARD|REQ_WRITE_SAME|REQ_COPY)
 
 /* This mask is used for both bio and request merge checking */
 #define REQ_NOMERGE_FLAGS \
-	(REQ_NOMERGE | REQ_STARTED | REQ_SOFTBARRIER | REQ_FLUSH | REQ_FUA | REQ_FLUSH_SEQ)
+	(REQ_NOMERGE | REQ_STARTED | REQ_SOFTBARRIER | REQ_FLUSH | REQ_FUA | REQ_FLUSH_SEQ | REQ_COPY)
 
 #define REQ_RAHEAD		(1ULL << __REQ_RAHEAD)
 #define REQ_THROTTLED		(1ULL << __REQ_THROTTLED)
Index: linux-4.4-rc4/include/linux/blkdev.h
===================================================================
--- linux-4.4-rc4.orig/include/linux/blkdev.h	2015-12-10 17:04:01.000000000 +0100
+++ linux-4.4-rc4/include/linux/blkdev.h	2015-12-10 17:04:30.000000000 +0100
@@ -265,6 +265,7 @@ struct queue_limits {
 	unsigned int		max_discard_sectors;
 	unsigned int		max_hw_discard_sectors;
 	unsigned int		max_write_same_sectors;
+	unsigned int		max_copy_sectors;
 	unsigned int		discard_granularity;
 	unsigned int		discard_alignment;
 
@@ -968,6 +969,8 @@ extern void blk_queue_max_discard_sector
 		unsigned int max_discard_sectors);
 extern void blk_queue_max_write_same_sectors(struct request_queue *q,
 		unsigned int max_write_same_sectors);
+extern void blk_queue_max_copy_sectors(struct request_queue *q,
+		unsigned int max_copy_sectors);
 extern void blk_queue_logical_block_size(struct request_queue *, unsigned short);
 extern void blk_queue_physical_block_size(struct request_queue *, unsigned int);
 extern void blk_queue_alignment_offset(struct request_queue *q,
@@ -1137,6 +1140,8 @@ extern int blkdev_issue_discard(struct b
 		sector_t nr_sects, gfp_t gfp_mask, unsigned long flags);
 extern int blkdev_issue_write_same(struct block_device *bdev, sector_t sector,
 		sector_t nr_sects, gfp_t gfp_mask, struct page *page);
+extern int blkdev_issue_copy(struct block_device *, sector_t,
+		struct block_device *, sector_t, unsigned int, gfp_t);
 extern int blkdev_issue_zeroout(struct block_device *bdev, sector_t sector,
 		sector_t nr_sects, gfp_t gfp_mask, bool discard);
 static inline int sb_issue_discard(struct super_block *sb, sector_t block,
@@ -1340,6 +1345,16 @@ static inline unsigned int bdev_write_sa
 	return 0;
 }
 
+static inline unsigned int bdev_copy_offload(struct block_device *bdev)
+{
+	struct request_queue *q = bdev_get_queue(bdev);
+
+	if (q)
+		return q->limits.max_copy_sectors;
+
+	return 0;
+}
+
 static inline int queue_dma_alignment(struct request_queue *q)
 {
 	return q ? q->dma_alignment : 511;
Index: linux-4.4-rc4/include/scsi/scsi_device.h
===================================================================
--- linux-4.4-rc4.orig/include/scsi/scsi_device.h	2015-12-10 17:04:00.000000000 +0100
+++ linux-4.4-rc4/include/scsi/scsi_device.h	2015-12-10 17:04:30.000000000 +0100
@@ -120,6 +120,8 @@ struct scsi_device {
 	unsigned char *vpd_pg83;
 	int vpd_pg80_len;
 	unsigned char *vpd_pg80;
+	unsigned char naa_len;
+	unsigned char *naa;
 	unsigned char current_tag;	/* current tag */
 	struct scsi_target      *sdev_target;   /* used only for single_lun */
 
@@ -150,6 +152,7 @@ struct scsi_device {
 	unsigned use_10_for_ms:1; /* first try 10-byte mode sense/select */
 	unsigned no_report_opcodes:1;	/* no REPORT SUPPORTED OPERATION CODES */
 	unsigned no_write_same:1;	/* no WRITE SAME command */
+	unsigned no_copy:1;		/* no copy offload */
 	unsigned use_16_for_rw:1; /* Use read/write(16) over read/write(10) */
 	unsigned skip_ms_page_8:1;	/* do not use MODE SENSE page 0x08 */
 	unsigned skip_ms_page_3f:1;	/* do not use MODE SENSE page 0x3f */
Index: linux-4.4-rc4/include/uapi/linux/fs.h
===================================================================
--- linux-4.4-rc4.orig/include/uapi/linux/fs.h	2015-12-10 17:04:00.000000000 +0100
+++ linux-4.4-rc4/include/uapi/linux/fs.h	2015-12-10 17:04:30.000000000 +0100
@@ -152,6 +152,7 @@ struct inodes_stat_t {
 #define BLKSECDISCARD _IO(0x12,125)
 #define BLKROTATIONAL _IO(0x12,126)
 #define BLKZEROOUT _IO(0x12,127)
+#define BLKCOPY _IO(0x12,128)
 
 #define BMAP_IOCTL 1		/* obsolete - kept for compatibility */
 #define FIBMAP	   _IO(0x00,1)	/* bmap access */
Index: linux-4.4-rc4/block/compat_ioctl.c
===================================================================
--- linux-4.4-rc4.orig/block/compat_ioctl.c	2015-12-10 17:04:00.000000000 +0100
+++ linux-4.4-rc4/block/compat_ioctl.c	2015-12-10 17:04:30.000000000 +0100
@@ -697,6 +697,7 @@ long compat_blkdev_ioctl(struct file *fi
 	 * but we call blkdev_ioctl, which gets the lock for us
 	 */
 	case BLKRRPART:
+	case BLKCOPY:
 		return blkdev_ioctl(bdev, mode, cmd,
 				(unsigned long)compat_ptr(arg));
 	case BLKBSZSET_32:
Index: linux-4.4-rc4/block/bio.c
===================================================================
--- linux-4.4-rc4.orig/block/bio.c	2015-12-10 17:03:59.000000000 +0100
+++ linux-4.4-rc4/block/bio.c	2015-12-10 17:04:30.000000000 +0100
@@ -238,6 +238,8 @@ static void __bio_free(struct bio *bio)
 {
 	bio_disassociate_task(bio);
 
+	kfree(bio->bi_copy);
+
 	if (bio_integrity(bio))
 		bio_integrity_free(bio);
 }


^ permalink raw reply	[flat|nested] 24+ messages in thread

* [PATCH 1/15] block copy: initial XCOPY offload support
@ 2015-12-10 17:30   ` Mikulas Patocka
  0 siblings, 0 replies; 24+ messages in thread
From: Mikulas Patocka @ 2015-12-10 17:30 UTC (permalink / raw)
  To: James E.J. Bottomley, Martin K. Petersen, Jens Axboe,
	Mike Snitzer, Jonathan Brassow
  Cc: linux-block, dm-devel, linux-kernel, linux-scsi

This is Martin Petersen's xcopy patch
(https://git.kernel.org/cgit/linux/kernel/git/mkp/linux.git/commit/?h=xcopy&id=0bdeed274e16b3038a851552188512071974eea8)
with some bug fixes, ported to the current kernel.

This patch makes it possible to use the SCSI XCOPY command.

We create a bio that has REQ_COPY flag in bi_rw and a bi_copy structure
that defines the source device. The target device is defined in the
bi_bdev and bi_iter.bi_sector.

There is a new BLKCOPY ioctl that makes it possible to use XCOPY from
userspace. The ioctl argument is a pointer to an array of four uint64_t
values.

The first value is a source byte offset, the second value is a destination
byte offset, the third value is byte length. The forth value is written by
the kernel and it represents the number of bytes that the kernel actually
copied.

Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>

---
 Documentation/ABI/testing/sysfs-block |    9 +
 block/bio.c                           |    2 
 block/blk-core.c                      |    5 
 block/blk-lib.c                       |   95 +++++++++++
 block/blk-merge.c                     |   11 -
 block/blk-settings.c                  |   13 +
 block/blk-sysfs.c                     |   11 +
 block/compat_ioctl.c                  |    1 
 block/ioctl.c                         |   50 ++++++
 drivers/scsi/scsi.c                   |   57 +++++++
 drivers/scsi/sd.c                     |  271 +++++++++++++++++++++++++++++++++-
 drivers/scsi/sd.h                     |    4 
 include/linux/bio.h                   |    9 -
 include/linux/blk_types.h             |   14 +
 include/linux/blkdev.h                |   15 +
 include/scsi/scsi_device.h            |    3 
 include/uapi/linux/fs.h               |    1 
 17 files changed, 557 insertions(+), 14 deletions(-)

Index: linux-4.4-rc4/Documentation/ABI/testing/sysfs-block
===================================================================
--- linux-4.4-rc4.orig/Documentation/ABI/testing/sysfs-block	2015-12-10 17:03:59.000000000 +0100
+++ linux-4.4-rc4/Documentation/ABI/testing/sysfs-block	2015-12-10 17:04:30.000000000 +0100
@@ -235,3 +235,12 @@ Description:
 		write_same_max_bytes is 0, write same is not supported
 		by the device.
 
+
+What:		/sys/block/<disk>/queue/copy_max_bytes
+Date:		January 2014
+Contact:	Martin K. Petersen <martin.petersen@oracle.com>
+Description:
+		Devices that support copy offloading will set this value
+		to indicate the maximum buffer size in bytes that can be
+		copied in one operation. If the copy_max_bytes is 0 the
+		device does not support copy offload.
Index: linux-4.4-rc4/block/blk-core.c
===================================================================
--- linux-4.4-rc4.orig/block/blk-core.c	2015-12-10 17:03:59.000000000 +0100
+++ linux-4.4-rc4/block/blk-core.c	2015-12-10 17:04:30.000000000 +0100
@@ -1957,6 +1957,11 @@ generic_make_request_checks(struct bio *
 		goto end_io;
 	}
 
+	if (bio->bi_rw & REQ_COPY && !bdev_copy_offload(bio->bi_bdev)) {
+		err = -EOPNOTSUPP;
+		goto end_io;
+	}
+
 	/*
 	 * Various block parts want %current->io_context and lazy ioc
 	 * allocation ends up trading a lot of pain for a small amount of
Index: linux-4.4-rc4/block/blk-lib.c
===================================================================
--- linux-4.4-rc4.orig/block/blk-lib.c	2015-12-10 17:03:59.000000000 +0100
+++ linux-4.4-rc4/block/blk-lib.c	2015-12-10 17:04:30.000000000 +0100
@@ -299,3 +299,98 @@ int blkdev_issue_zeroout(struct block_de
 	return __blkdev_issue_zeroout(bdev, sector, nr_sects, gfp_mask);
 }
 EXPORT_SYMBOL(blkdev_issue_zeroout);
+
+/**
+ * blkdev_issue_copy - queue a copy same operation
+ * @src_bdev:	source blockdev
+ * @src_sector:	source sector
+ * @dst_bdev:	destination blockdev
+ * @dst_sector: destination sector
+ * @nr_sects:	number of sectors to copy
+ * @gfp_mask:	memory allocation flags (for bio_alloc)
+ *
+ * Description:
+ *    Copy a block range from source device to target device.
+ */
+int blkdev_issue_copy(struct block_device *src_bdev, sector_t src_sector,
+		      struct block_device *dst_bdev, sector_t dst_sector,
+		      unsigned int nr_sects, gfp_t gfp_mask)
+{
+	DECLARE_COMPLETION_ONSTACK(wait);
+	struct request_queue *sq = bdev_get_queue(src_bdev);
+	struct request_queue *dq = bdev_get_queue(dst_bdev);
+	unsigned int max_copy_sectors;
+	struct bio_batch bb;
+	int ret = 0;
+
+	if (!sq || !dq)
+		return -ENXIO;
+
+	max_copy_sectors = min(sq->limits.max_copy_sectors,
+			       dq->limits.max_copy_sectors);
+
+	if (max_copy_sectors == 0)
+		return -EOPNOTSUPP;
+
+	if (src_sector + nr_sects < src_sector ||
+	    dst_sector + nr_sects < dst_sector)
+		return -EINVAL;
+
+	/* Do not support overlapping copies */
+	if (src_bdev == dst_bdev &&
+	    abs((u64)dst_sector - (u64)src_sector) < nr_sects)
+		return -EOPNOTSUPP;
+
+	atomic_set(&bb.done, 1);
+	bb.error = 0;
+	bb.wait = &wait;
+
+	while (nr_sects) {
+		struct bio *bio;
+		struct bio_copy *bc;
+		unsigned int chunk;
+
+		bc = kmalloc(sizeof(struct bio_copy), gfp_mask);
+		if (!bc) {
+			ret = -ENOMEM;
+			break;
+		}
+
+		bio = bio_alloc(gfp_mask, 1);
+		if (!bio) {
+			kfree(bc);
+			ret = -ENOMEM;
+			break;
+		}
+
+		chunk = min(nr_sects, max_copy_sectors);
+
+		bio->bi_iter.bi_sector = dst_sector;
+		bio->bi_iter.bi_size = chunk << 9;
+		bio->bi_end_io = bio_batch_end_io;
+		bio->bi_bdev = dst_bdev;
+		bio->bi_private = &bb;
+		bio->bi_copy = bc;
+
+		bc->bic_bdev = src_bdev;
+		bc->bic_sector = src_sector;
+
+		atomic_inc(&bb.done);
+		submit_bio(REQ_WRITE | REQ_COPY, bio);
+
+		src_sector += chunk;
+		dst_sector += chunk;
+		nr_sects -= chunk;
+	}
+
+	/* Wait for bios in-flight */
+	if (!atomic_dec_and_test(&bb.done))
+		wait_for_completion_io(&wait);
+
+	if (likely(!ret))
+		ret = bb.error;
+
+	return ret;
+}
+EXPORT_SYMBOL(blkdev_issue_copy);
+
Index: linux-4.4-rc4/block/blk-merge.c
===================================================================
--- linux-4.4-rc4.orig/block/blk-merge.c	2015-12-10 17:03:59.000000000 +0100
+++ linux-4.4-rc4/block/blk-merge.c	2015-12-10 17:04:30.000000000 +0100
@@ -145,7 +145,9 @@ void blk_queue_split(struct request_queu
 	struct bio *split, *res;
 	unsigned nsegs;
 
-	if ((*bio)->bi_rw & REQ_DISCARD)
+	if ((*bio)->bi_rw & REQ_COPY)
+		return;
+	else if ((*bio)->bi_rw & REQ_DISCARD)
 		split = blk_bio_discard_split(q, *bio, bs, &nsegs);
 	else if ((*bio)->bi_rw & REQ_WRITE_SAME)
 		split = blk_bio_write_same_split(q, *bio, bs, &nsegs);
@@ -185,10 +187,7 @@ static unsigned int __blk_recalc_rq_segm
 	 * This should probably be returning 0, but blk_add_request_payload()
 	 * (Christoph!!!!)
 	 */
-	if (bio->bi_rw & REQ_DISCARD)
-		return 1;
-
-	if (bio->bi_rw & REQ_WRITE_SAME)
+	if (bio->bi_rw & (REQ_DISCARD | REQ_WRITE_SAME | REQ_COPY))
 		return 1;
 
 	fbio = bio;
@@ -361,7 +360,7 @@ static int __blk_bios_map_sg(struct requ
 	nsegs = 0;
 	cluster = blk_queue_cluster(q);
 
-	if (bio->bi_rw & REQ_DISCARD) {
+	if (bio->bi_rw & (REQ_DISCARD | REQ_COPY)) {
 		/*
 		 * This is a hack - drivers should be neither modifying the
 		 * biovec, nor relying on bi_vcnt - but because of
Index: linux-4.4-rc4/block/blk-settings.c
===================================================================
--- linux-4.4-rc4.orig/block/blk-settings.c	2015-12-10 17:03:59.000000000 +0100
+++ linux-4.4-rc4/block/blk-settings.c	2015-12-10 17:04:30.000000000 +0100
@@ -95,6 +95,7 @@ void blk_set_default_limits(struct queue
 		BLK_SAFE_MAX_SECTORS;
 	lim->chunk_sectors = 0;
 	lim->max_write_same_sectors = 0;
+	lim->max_copy_sectors = 0;
 	lim->max_discard_sectors = 0;
 	lim->max_hw_discard_sectors = 0;
 	lim->discard_granularity = 0;
@@ -298,6 +299,18 @@ void blk_queue_max_write_same_sectors(st
 EXPORT_SYMBOL(blk_queue_max_write_same_sectors);
 
 /**
+ * blk_queue_max_copy_sectors - set max sectors for a single copy operation
+ * @q:  the request queue for the device
+ * @max_copy_sectors: maximum number of sectors per copy operation
+ **/
+void blk_queue_max_copy_sectors(struct request_queue *q,
+				unsigned int max_copy_sectors)
+{
+	q->limits.max_copy_sectors = max_copy_sectors;
+}
+EXPORT_SYMBOL(blk_queue_max_copy_sectors);
+
+/**
  * blk_queue_max_segments - set max hw segments for a request for this queue
  * @q:  the request queue for the device
  * @max_segments:  max number of segments
Index: linux-4.4-rc4/block/blk-sysfs.c
===================================================================
--- linux-4.4-rc4.orig/block/blk-sysfs.c	2015-12-10 17:04:01.000000000 +0100
+++ linux-4.4-rc4/block/blk-sysfs.c	2015-12-10 17:04:30.000000000 +0100
@@ -193,6 +193,11 @@ static ssize_t queue_write_same_max_show
 		(unsigned long long)q->limits.max_write_same_sectors << 9);
 }
 
+static ssize_t queue_copy_max_show(struct request_queue *q, char *page)
+{
+	return sprintf(page, "%llu\n",
+		(unsigned long long)q->limits.max_copy_sectors << 9);
+}
 
 static ssize_t
 queue_max_sectors_store(struct request_queue *q, const char *page, size_t count)
@@ -443,6 +448,11 @@ static struct queue_sysfs_entry queue_wr
 	.show = queue_write_same_max_show,
 };
 
+static struct queue_sysfs_entry queue_copy_max_entry = {
+	.attr = {.name = "copy_max_bytes", .mode = S_IRUGO },
+	.show = queue_copy_max_show,
+};
+
 static struct queue_sysfs_entry queue_nonrot_entry = {
 	.attr = {.name = "rotational", .mode = S_IRUGO | S_IWUSR },
 	.show = queue_show_nonrot,
@@ -498,6 +508,7 @@ static struct attribute *default_attrs[]
 	&queue_discard_max_hw_entry.attr,
 	&queue_discard_zeroes_data_entry.attr,
 	&queue_write_same_max_entry.attr,
+	&queue_copy_max_entry.attr,
 	&queue_nonrot_entry.attr,
 	&queue_nomerges_entry.attr,
 	&queue_rq_affinity_entry.attr,
Index: linux-4.4-rc4/block/ioctl.c
===================================================================
--- linux-4.4-rc4.orig/block/ioctl.c	2015-12-10 17:03:59.000000000 +0100
+++ linux-4.4-rc4/block/ioctl.c	2015-12-10 17:04:30.000000000 +0100
@@ -249,6 +249,31 @@ static int blk_ioctl_zeroout(struct bloc
 	return blkdev_issue_zeroout(bdev, start, len, GFP_KERNEL, false);
 }
 
+static int blk_ioctl_copy(struct block_device *bdev, uint64_t src_offset,
+			  uint64_t dst_offset, uint64_t len)
+{
+	if (src_offset & 511)
+		return -EINVAL;
+	if (dst_offset & 511)
+		return -EINVAL;
+	if (len & 511)
+		return -EINVAL;
+	src_offset >>= 9;
+	dst_offset >>= 9;
+	len >>= 9;
+
+	if (unlikely(src_offset + len < src_offset) ||
+	    unlikely(src_offset + len > (i_size_read(bdev->bd_inode) >> 9)))
+		return -EINVAL;
+
+	if (unlikely(dst_offset + len < dst_offset) ||
+	    unlikely(dst_offset + len > (i_size_read(bdev->bd_inode) >> 9)))
+		return -EINVAL;
+
+	return blkdev_issue_copy(bdev, src_offset, bdev, dst_offset, len,
+				 GFP_KERNEL);
+}
+
 static int put_ushort(unsigned long arg, unsigned short val)
 {
 	return put_user(val, (unsigned short __user *)arg);
@@ -513,6 +538,31 @@ int blkdev_ioctl(struct block_device *bd
 				BLKDEV_DISCARD_SECURE);
 	case BLKZEROOUT:
 		return blk_ioctl_zeroout(bdev, mode, arg);
+	case BLKCOPY: {
+		uint64_t range[4];
+		int ret;
+
+		range[3] = 0;
+
+		if (copy_to_user((void __user *)(arg + 24), &range[3], 8))
+			return -EFAULT;
+
+		if (!(mode & FMODE_WRITE))
+			return -EBADF;
+
+		if (copy_from_user(range, (void __user *)arg, 24))
+			return -EFAULT;
+
+		ret = blk_ioctl_copy(bdev, range[0], range[1], range[2]);
+		if (!ret) {
+			range[3] = range[2];
+			if (copy_to_user((void __user *)(arg + 24), &range[3], 8))
+				return -EFAULT;
+		}
+
+		return ret;
+	}
+
 	case HDIO_GETGEO:
 		return blkdev_getgeo(bdev, argp);
 	case BLKRAGET:
Index: linux-4.4-rc4/drivers/scsi/scsi.c
===================================================================
--- linux-4.4-rc4.orig/drivers/scsi/scsi.c	2015-12-10 17:04:00.000000000 +0100
+++ linux-4.4-rc4/drivers/scsi/scsi.c	2015-12-10 17:04:30.000000000 +0100
@@ -768,6 +768,62 @@ int scsi_get_vpd_page(struct scsi_device
 EXPORT_SYMBOL_GPL(scsi_get_vpd_page);
 
 /**
+ * scsi_lookup_naa - Lookup NAA descriptor in VPD page 0x83
+ * @sdev: The device to ask
+ *
+ * Copy offloading requires us to know the NAA descriptor for both
+ * source and target device. This descriptor is mandatory in the Device
+ * Identification VPD page. Locate this descriptor in the returned VPD
+ * data so we don't have to do lookups for every copy command.
+ */
+static void scsi_lookup_naa(struct scsi_device *sdev)
+{
+	unsigned char *buf = sdev->vpd_pg83;
+	unsigned int len = sdev->vpd_pg83_len;
+
+	if (buf[1] != 0x83 || get_unaligned_be16(&buf[2]) == 0) {
+		sdev_printk(KERN_ERR, sdev,
+			    "%s: VPD page 0x83 contains no descriptors\n",
+			    __func__);
+		return;
+	}
+
+	buf += 4;
+	len -= 4;
+
+	do {
+		unsigned int desig_len = buf[3] + 4;
+
+		/* Binary code set */
+		if ((buf[0] & 0xf) != 1)
+			goto skip;
+
+		/* Target association */
+		if ((buf[1] >> 4) & 0x3)
+			goto skip;
+
+		/* NAA designator */
+		if ((buf[1] & 0xf) != 0x3)
+			goto skip;
+
+		sdev->naa = buf;
+		sdev->naa_len = desig_len;
+
+		return;
+
+	skip:
+		buf += desig_len;
+		len -= desig_len;
+
+	} while (len > 0);
+
+	sdev_printk(KERN_ERR, sdev,
+		    "%s: VPD page 0x83 NAA descriptor not found\n", __func__);
+
+	return;
+}
+
+/**
  * scsi_attach_vpd - Attach Vital Product Data to a SCSI device structure
  * @sdev: The device to ask
  *
@@ -851,6 +907,7 @@ retry_pg83:
 		}
 		sdev->vpd_pg83_len = result;
 		sdev->vpd_pg83 = vpd_buf;
+		scsi_lookup_naa(sdev);
 	}
 }
 
Index: linux-4.4-rc4/drivers/scsi/sd.c
===================================================================
--- linux-4.4-rc4.orig/drivers/scsi/sd.c	2015-12-10 17:04:00.000000000 +0100
+++ linux-4.4-rc4/drivers/scsi/sd.c	2015-12-10 17:04:30.000000000 +0100
@@ -101,6 +101,7 @@ MODULE_ALIAS_SCSI_DEVICE(TYPE_RBC);
 
 static void sd_config_discard(struct scsi_disk *, unsigned int);
 static void sd_config_write_same(struct scsi_disk *);
+static void sd_config_copy(struct scsi_disk *);
 static int  sd_revalidate_disk(struct gendisk *);
 static void sd_unlock_native_capacity(struct gendisk *disk);
 static int  sd_probe(struct device *);
@@ -479,6 +480,48 @@ max_write_same_blocks_store(struct devic
 }
 static DEVICE_ATTR_RW(max_write_same_blocks);
 
+static ssize_t
+max_copy_blocks_show(struct device *dev, struct device_attribute *attr,
+		     char *buf)
+{
+	struct scsi_disk *sdkp = to_scsi_disk(dev);
+
+	return snprintf(buf, 20, "%u\n", sdkp->max_copy_blocks);
+}
+
+static ssize_t
+max_copy_blocks_store(struct device *dev, struct device_attribute *attr,
+		      const char *buf, size_t count)
+{
+	struct scsi_disk *sdkp = to_scsi_disk(dev);
+	struct scsi_device *sdp = sdkp->device;
+	unsigned long max;
+	int err;
+
+	if (!capable(CAP_SYS_ADMIN))
+		return -EACCES;
+
+	if (sdp->type != TYPE_DISK)
+		return -EINVAL;
+
+	err = kstrtoul(buf, 10, &max);
+
+	if (err)
+		return err;
+
+	if (max == 0)
+		sdp->no_copy = 1;
+	else if (max <= SD_MAX_COPY_BLOCKS) {
+		sdp->no_copy = 0;
+		sdkp->max_copy_blocks = max;
+	}
+
+	sd_config_copy(sdkp);
+
+	return count;
+}
+static DEVICE_ATTR_RW(max_copy_blocks);
+
 static struct attribute *sd_disk_attrs[] = {
 	&dev_attr_cache_type.attr,
 	&dev_attr_FUA.attr,
@@ -490,6 +533,7 @@ static struct attribute *sd_disk_attrs[]
 	&dev_attr_thin_provisioning.attr,
 	&dev_attr_provisioning_mode.attr,
 	&dev_attr_max_write_same_blocks.attr,
+	&dev_attr_max_copy_blocks.attr,
 	&dev_attr_max_medium_access_timeouts.attr,
 	NULL,
 };
@@ -879,6 +923,116 @@ static int sd_setup_write_same_cmnd(stru
 	return ret;
 }
 
+static void sd_config_copy(struct scsi_disk *sdkp)
+{
+	struct request_queue *q = sdkp->disk->queue;
+	unsigned int logical_block_size = sdkp->device->sector_size;
+
+	if (sdkp->device->no_copy)
+		sdkp->max_copy_blocks = 0;
+
+	/* Segment descriptor 0x02 has a 64k block limit */
+	sdkp->max_copy_blocks = min(sdkp->max_copy_blocks,
+				    (u32)SD_MAX_CSD2_BLOCKS);
+
+	blk_queue_max_copy_sectors(q, sdkp->max_copy_blocks *
+				   (logical_block_size >> 9));
+}
+
+static int sd_setup_copy_cmnd(struct scsi_cmnd *cmd)
+{
+	struct request *rq = cmd->request;
+	struct scsi_device *src_sdp, *dst_sdp;
+	struct gendisk *src_disk;
+	struct request_queue *src_queue, *dst_queue;
+	sector_t src_lba, dst_lba;
+	unsigned int nr_blocks, buf_len, nr_bytes = blk_rq_bytes(rq);
+	int ret;
+	struct bio *bio = rq->bio;
+	struct page *page;
+	unsigned char *buf;
+
+	if (!bio->bi_copy)
+		return BLKPREP_KILL;
+
+	dst_sdp = scsi_disk(rq->rq_disk)->device;
+	dst_queue = rq->rq_disk->queue;
+	src_disk = bio->bi_copy->bic_bdev->bd_disk;
+	src_queue = src_disk->queue;
+	if (!src_queue ||
+	    src_queue->make_request_fn != dst_queue->make_request_fn ||
+	    src_queue->request_fn != dst_queue->request_fn ||
+	    *(struct scsi_driver **)rq->rq_disk->private_data !=
+	    *(struct scsi_driver **)src_disk->private_data)
+		return BLKPREP_KILL;
+	src_sdp = scsi_disk(src_disk)->device;
+
+	if (src_sdp->no_copy || dst_sdp->no_copy)
+		return BLKPREP_KILL;
+
+	if (src_sdp->sector_size != dst_sdp->sector_size)
+		return BLKPREP_KILL;
+
+	dst_lba = blk_rq_pos(rq) >> (ilog2(dst_sdp->sector_size) - 9);
+	src_lba = bio->bi_copy->bic_sector >> (ilog2(src_sdp->sector_size) - 9);
+	nr_blocks = blk_rq_sectors(rq) >> (ilog2(dst_sdp->sector_size) - 9);
+
+	page = alloc_page(GFP_ATOMIC | __GFP_ZERO);
+	if (!page)
+		return BLKPREP_DEFER;
+
+	buf = page_address(page);
+
+	/* Extended Copy (LID1) Parameter List (16 bytes) */
+	buf[0] = 0;				/* LID */
+	buf[1] = 3 << 3;			/* LID usage 11b */
+	put_unaligned_be16(32 + 32, &buf[2]);	/* 32 bytes per E4 desc. */
+	put_unaligned_be32(28, &buf[8]);	/* 28 bytes per B2B desc. */
+	buf += 16;
+
+	/* Source CSCD (32 bytes) */
+	buf[0] = 0xe4;				/* Identification desc. */
+	memcpy(&buf[4], src_sdp->naa, src_sdp->naa_len);
+	buf += 32;
+
+	/* Destination CSCD (32 bytes) */
+	buf[0] = 0xe4;				/* Identification desc. */
+	memcpy(&buf[4], dst_sdp->naa, dst_sdp->naa_len);
+	buf += 32;
+
+	/* Segment descriptor (28 bytes) */
+	buf[0] = 0x02;				/* Block to block desc. */
+	put_unaligned_be16(0x18, &buf[2]);	/* Descriptor length */
+	put_unaligned_be16(0, &buf[4]);		/* Source is desc. 0 */
+	put_unaligned_be16(1, &buf[6]);		/* Dest. is desc. 1 */
+	put_unaligned_be16(nr_blocks, &buf[10]);
+	put_unaligned_be64(src_lba, &buf[12]);
+	put_unaligned_be64(dst_lba, &buf[20]);
+
+	/* CDB */
+	cmd->cmd_len = 16;
+	memset(cmd->cmnd, 0, cmd->cmd_len);
+	cmd->cmnd[0] = EXTENDED_COPY;
+	cmd->cmnd[1] = 0; /* LID1 */
+	buf_len = 16 + 32 + 32 + 28;
+	put_unaligned_be32(buf_len, &cmd->cmnd[10]);
+	rq->timeout = SD_COPY_TIMEOUT;
+
+	rq->completion_data = page;
+	blk_add_request_payload(rq, page, buf_len);
+
+	cmd->transfersize = buf_len;
+	cmd->allowed = 0;	/* don't retry */
+
+	rq->__data_len = buf_len;
+	ret = scsi_init_io(cmd);
+	rq->__data_len = nr_bytes;
+
+	if (ret != BLKPREP_OK)
+		__free_page(page);
+	return ret;
+}
+
 static int sd_setup_flush_cmnd(struct scsi_cmnd *cmd)
 {
 	struct request *rq = cmd->request;
@@ -1141,6 +1295,8 @@ static int sd_init_command(struct scsi_c
 		return sd_setup_discard_cmnd(cmd);
 	else if (rq->cmd_flags & REQ_WRITE_SAME)
 		return sd_setup_write_same_cmnd(cmd);
+	else if (rq->cmd_flags & REQ_COPY)
+		return sd_setup_copy_cmnd(cmd);
 	else if (rq->cmd_flags & REQ_FLUSH)
 		return sd_setup_flush_cmnd(cmd);
 	else
@@ -1151,7 +1307,7 @@ static void sd_uninit_command(struct scs
 {
 	struct request *rq = SCpnt->request;
 
-	if (rq->cmd_flags & REQ_DISCARD)
+	if (rq->cmd_flags & (REQ_DISCARD | REQ_COPY))
 		__free_page(rq->completion_data);
 
 	if (SCpnt->cmnd != rq->cmd) {
@@ -1768,7 +1924,8 @@ static int sd_done(struct scsi_cmnd *SCp
 	unsigned char op = SCpnt->cmnd[0];
 	unsigned char unmap = SCpnt->cmnd[1] & 8;
 
-	if (req->cmd_flags & REQ_DISCARD || req->cmd_flags & REQ_WRITE_SAME) {
+	if (req->cmd_flags & REQ_DISCARD || req->cmd_flags & REQ_WRITE_SAME ||
+	    req->cmd_flags & REQ_COPY) {
 		if (!result) {
 			good_bytes = blk_rq_bytes(req);
 			scsi_set_resid(SCpnt, 0);
@@ -1815,6 +1972,16 @@ static int sd_done(struct scsi_cmnd *SCp
 		/* INVALID COMMAND OPCODE or INVALID FIELD IN CDB */
 		if (sshdr.asc == 0x20 || sshdr.asc == 0x24) {
 			switch (op) {
+			case EXTENDED_COPY:
+				if ((SCpnt->cmnd[1] & 0x1f) == 0) {
+					sdkp->device->no_copy = 1;
+					sd_config_copy(sdkp);
+
+					good_bytes = 0;
+					req->__data_len = blk_rq_bytes(req);
+					req->cmd_flags |= REQ_QUIET;
+				}
+				break;
 			case UNMAP:
 				sd_config_discard(sdkp, SD_LBP_DISABLE);
 				break;
@@ -2797,6 +2964,105 @@ static void sd_read_write_same(struct sc
 		sdkp->ws10 = 1;
 }
 
+static void sd_read_copy_operations(struct scsi_disk *sdkp,
+				    unsigned char *buffer)
+{
+	struct scsi_device *sdev = sdkp->device;
+	struct scsi_sense_hdr sshdr;
+	unsigned char cdb[16];
+	unsigned int result, len, i;
+	bool b2b_desc = false, id_desc = false;
+
+	if (sdev->naa_len == 0)
+		return;
+
+	/* Verify that the device has 3PC set in INQUIRY response */
+	if (sdev->inquiry_len < 6 || (sdev->inquiry[5] & (1 << 3)) == 0)
+		return;
+
+	/* Receive Copy Operation Parameters */
+	memset(cdb, 0, 16);
+	cdb[0] = RECEIVE_COPY_RESULTS;
+	cdb[1] = 0x3;
+	put_unaligned_be32(SD_BUF_SIZE, &cdb[10]);
+
+	memset(buffer, 0, SD_BUF_SIZE);
+	result = scsi_execute_req(sdev, cdb, DMA_FROM_DEVICE,
+				  buffer, SD_BUF_SIZE, &sshdr,
+				  SD_TIMEOUT, SD_MAX_RETRIES, NULL);
+
+	if (!scsi_status_is_good(result)) {
+		sd_printk(KERN_ERR, sdkp,
+			  "%s: Receive Copy Operating Parameters failed\n",
+			  __func__);
+		return;
+	}
+
+	/* The RCOP response is a minimum of 44 bytes long. First 4
+	 * bytes contain the length of the remaining buffer, i.e. 40+
+	 * bytes. Trailing the defined fields is a list of supported
+	 * descriptors. We need at least 2 descriptors to drive the
+	 * target, hence 42.
+	 */
+	len = get_unaligned_be32(&buffer[0]);
+	if (len < 42) {
+		sd_printk(KERN_ERR, sdkp, "%s: result too short (%u)\n",
+			  __func__, len);
+		return;
+	}
+
+	if ((buffer[4] & 1) == 0) {
+		sd_printk(KERN_ERR, sdkp, "%s: does not support SNLID\n",
+			  __func__);
+		return;
+	}
+
+	if (get_unaligned_be16(&buffer[8]) < 2) {
+		sd_printk(KERN_ERR, sdkp,
+			  "%s: Need 2 or more CSCD descriptors\n", __func__);
+		return;
+	}
+
+	if (get_unaligned_be16(&buffer[10]) < 1) {
+		sd_printk(KERN_ERR, sdkp,
+			  "%s: Need 1 or more segment descriptor\n", __func__);
+		return;
+	}
+
+	if (len - 40 != buffer[43]) {
+		sd_printk(KERN_ERR, sdkp,
+			  "%s: Buffer len and descriptor count mismatch " \
+			  "(%u vs. %u)\n", __func__, len - 40, buffer[43]);
+		return;
+	}
+
+	for (i = 44 ; i < len + 4 ; i++) {
+		if (buffer[i] == 0x02)
+			b2b_desc = true;
+
+		if (buffer[i] == 0xe4)
+			id_desc = true;
+	}
+
+	if (!b2b_desc) {
+		sd_printk(KERN_ERR, sdkp,
+			  "%s: No block 2 block descriptor (0x02)\n",
+			  __func__);
+		return;
+	}
+
+	if (!id_desc) {
+		sd_printk(KERN_ERR, sdkp,
+			  "%s: No identification descriptor (0xE4)\n",
+			  __func__);
+		return;
+	}
+
+	sdkp->max_copy_blocks = get_unaligned_be32(&buffer[16])
+		>> ilog2(sdev->sector_size);
+	sd_config_copy(sdkp);
+}
+
 static int sd_try_extended_inquiry(struct scsi_device *sdp)
 {
 	/* Attempt VPD inquiry if the device blacklist explicitly calls
@@ -2868,6 +3134,7 @@ static int sd_revalidate_disk(struct gen
 		sd_read_cache_type(sdkp, buffer);
 		sd_read_app_tag_own(sdkp, buffer);
 		sd_read_write_same(sdkp, buffer);
+		sd_read_copy_operations(sdkp, buffer);
 	}
 
 	sdkp->first_scan = 0;
Index: linux-4.4-rc4/drivers/scsi/sd.h
===================================================================
--- linux-4.4-rc4.orig/drivers/scsi/sd.h	2015-12-10 17:04:00.000000000 +0100
+++ linux-4.4-rc4/drivers/scsi/sd.h	2015-12-10 17:04:30.000000000 +0100
@@ -19,6 +19,7 @@
  */
 #define SD_FLUSH_TIMEOUT_MULTIPLIER	2
 #define SD_WRITE_SAME_TIMEOUT	(120 * HZ)
+#define SD_COPY_TIMEOUT		(120 * HZ)
 
 /*
  * Number of allowed retries
@@ -48,6 +49,8 @@ enum {
 	SD_MAX_XFER_BLOCKS = 0xffffffff,
 	SD_MAX_WS10_BLOCKS = 0xffff,
 	SD_MAX_WS16_BLOCKS = 0x7fffff,
+	SD_MAX_CSD2_BLOCKS = 0xffff,
+	SD_MAX_COPY_BLOCKS = 0xffffffff,
 };
 
 enum {
@@ -70,6 +73,7 @@ struct scsi_disk {
 	u32		opt_xfer_blocks;
 	u32		max_ws_blocks;
 	u32		max_unmap_blocks;
+	u32		max_copy_blocks;
 	u32		unmap_granularity;
 	u32		unmap_alignment;
 	u32		index;
Index: linux-4.4-rc4/include/linux/bio.h
===================================================================
--- linux-4.4-rc4.orig/include/linux/bio.h	2015-12-10 17:04:00.000000000 +0100
+++ linux-4.4-rc4/include/linux/bio.h	2015-12-10 17:04:30.000000000 +0100
@@ -106,7 +106,7 @@ static inline bool bio_has_data(struct b
 {
 	if (bio &&
 	    bio->bi_iter.bi_size &&
-	    !(bio->bi_rw & REQ_DISCARD))
+	    !(bio->bi_rw & (REQ_DISCARD | REQ_COPY)))
 		return true;
 
 	return false;
@@ -249,8 +249,8 @@ static inline unsigned bio_segments(stru
 	struct bvec_iter iter;
 
 	/*
-	 * We special case discard/write same, because they interpret bi_size
-	 * differently:
+	 * We special case discard/write same/copy, because they
+	 * interpret bi_size differently:
 	 */
 
 	if (bio->bi_rw & REQ_DISCARD)
@@ -259,6 +259,9 @@ static inline unsigned bio_segments(stru
 	if (bio->bi_rw & REQ_WRITE_SAME)
 		return 1;
 
+	if (bio->bi_rw & REQ_COPY)
+		return 1;
+
 	bio_for_each_segment(bv, bio, iter)
 		segs++;
 
Index: linux-4.4-rc4/include/linux/blk_types.h
===================================================================
--- linux-4.4-rc4.orig/include/linux/blk_types.h	2015-12-10 17:04:00.000000000 +0100
+++ linux-4.4-rc4/include/linux/blk_types.h	2015-12-10 17:04:30.000000000 +0100
@@ -39,6 +39,11 @@ struct bvec_iter {
 						   current bvec */
 };
 
+struct bio_copy {
+	struct block_device	*bic_bdev;
+	sector_t		bic_sector;
+};
+
 /*
  * main unit of I/O for the block layer and lower layers (ie drivers and
  * stacking drivers)
@@ -84,6 +89,7 @@ struct bio {
 		struct bio_integrity_payload *bi_integrity; /* data integrity */
 #endif
 	};
+	struct bio_copy		*bi_copy; 	/* TODO, use bi_integrity */
 
 	unsigned short		bi_vcnt;	/* how many bio_vec's */
 
@@ -156,6 +162,7 @@ enum rq_flag_bits {
 	__REQ_DISCARD,		/* request to discard sectors */
 	__REQ_SECURE,		/* secure discard (used with __REQ_DISCARD) */
 	__REQ_WRITE_SAME,	/* write same block many times */
+	__REQ_COPY,		/* copy block range */
 
 	__REQ_NOIDLE,		/* don't anticipate more IO after this one */
 	__REQ_INTEGRITY,	/* I/O includes block integrity payload */
@@ -201,6 +208,7 @@ enum rq_flag_bits {
 #define REQ_PRIO		(1ULL << __REQ_PRIO)
 #define REQ_DISCARD		(1ULL << __REQ_DISCARD)
 #define REQ_WRITE_SAME		(1ULL << __REQ_WRITE_SAME)
+#define REQ_COPY		(1ULL << __REQ_COPY)
 #define REQ_NOIDLE		(1ULL << __REQ_NOIDLE)
 #define REQ_INTEGRITY		(1ULL << __REQ_INTEGRITY)
 
@@ -209,14 +217,14 @@ enum rq_flag_bits {
 #define REQ_COMMON_MASK \
 	(REQ_WRITE | REQ_FAILFAST_MASK | REQ_SYNC | REQ_META | REQ_PRIO | \
 	 REQ_DISCARD | REQ_WRITE_SAME | REQ_NOIDLE | REQ_FLUSH | REQ_FUA | \
-	 REQ_SECURE | REQ_INTEGRITY)
+	 REQ_SECURE | REQ_INTEGRITY | REQ_COPY)
 #define REQ_CLONE_MASK		REQ_COMMON_MASK
 
-#define BIO_NO_ADVANCE_ITER_MASK	(REQ_DISCARD|REQ_WRITE_SAME)
+#define BIO_NO_ADVANCE_ITER_MASK	(REQ_DISCARD|REQ_WRITE_SAME|REQ_COPY)
 
 /* This mask is used for both bio and request merge checking */
 #define REQ_NOMERGE_FLAGS \
-	(REQ_NOMERGE | REQ_STARTED | REQ_SOFTBARRIER | REQ_FLUSH | REQ_FUA | REQ_FLUSH_SEQ)
+	(REQ_NOMERGE | REQ_STARTED | REQ_SOFTBARRIER | REQ_FLUSH | REQ_FUA | REQ_FLUSH_SEQ | REQ_COPY)
 
 #define REQ_RAHEAD		(1ULL << __REQ_RAHEAD)
 #define REQ_THROTTLED		(1ULL << __REQ_THROTTLED)
Index: linux-4.4-rc4/include/linux/blkdev.h
===================================================================
--- linux-4.4-rc4.orig/include/linux/blkdev.h	2015-12-10 17:04:01.000000000 +0100
+++ linux-4.4-rc4/include/linux/blkdev.h	2015-12-10 17:04:30.000000000 +0100
@@ -265,6 +265,7 @@ struct queue_limits {
 	unsigned int		max_discard_sectors;
 	unsigned int		max_hw_discard_sectors;
 	unsigned int		max_write_same_sectors;
+	unsigned int		max_copy_sectors;
 	unsigned int		discard_granularity;
 	unsigned int		discard_alignment;
 
@@ -968,6 +969,8 @@ extern void blk_queue_max_discard_sector
 		unsigned int max_discard_sectors);
 extern void blk_queue_max_write_same_sectors(struct request_queue *q,
 		unsigned int max_write_same_sectors);
+extern void blk_queue_max_copy_sectors(struct request_queue *q,
+		unsigned int max_copy_sectors);
 extern void blk_queue_logical_block_size(struct request_queue *, unsigned short);
 extern void blk_queue_physical_block_size(struct request_queue *, unsigned int);
 extern void blk_queue_alignment_offset(struct request_queue *q,
@@ -1137,6 +1140,8 @@ extern int blkdev_issue_discard(struct b
 		sector_t nr_sects, gfp_t gfp_mask, unsigned long flags);
 extern int blkdev_issue_write_same(struct block_device *bdev, sector_t sector,
 		sector_t nr_sects, gfp_t gfp_mask, struct page *page);
+extern int blkdev_issue_copy(struct block_device *, sector_t,
+		struct block_device *, sector_t, unsigned int, gfp_t);
 extern int blkdev_issue_zeroout(struct block_device *bdev, sector_t sector,
 		sector_t nr_sects, gfp_t gfp_mask, bool discard);
 static inline int sb_issue_discard(struct super_block *sb, sector_t block,
@@ -1340,6 +1345,16 @@ static inline unsigned int bdev_write_sa
 	return 0;
 }
 
+static inline unsigned int bdev_copy_offload(struct block_device *bdev)
+{
+	struct request_queue *q = bdev_get_queue(bdev);
+
+	if (q)
+		return q->limits.max_copy_sectors;
+
+	return 0;
+}
+
 static inline int queue_dma_alignment(struct request_queue *q)
 {
 	return q ? q->dma_alignment : 511;
Index: linux-4.4-rc4/include/scsi/scsi_device.h
===================================================================
--- linux-4.4-rc4.orig/include/scsi/scsi_device.h	2015-12-10 17:04:00.000000000 +0100
+++ linux-4.4-rc4/include/scsi/scsi_device.h	2015-12-10 17:04:30.000000000 +0100
@@ -120,6 +120,8 @@ struct scsi_device {
 	unsigned char *vpd_pg83;
 	int vpd_pg80_len;
 	unsigned char *vpd_pg80;
+	unsigned char naa_len;
+	unsigned char *naa;
 	unsigned char current_tag;	/* current tag */
 	struct scsi_target      *sdev_target;   /* used only for single_lun */
 
@@ -150,6 +152,7 @@ struct scsi_device {
 	unsigned use_10_for_ms:1; /* first try 10-byte mode sense/select */
 	unsigned no_report_opcodes:1;	/* no REPORT SUPPORTED OPERATION CODES */
 	unsigned no_write_same:1;	/* no WRITE SAME command */
+	unsigned no_copy:1;		/* no copy offload */
 	unsigned use_16_for_rw:1; /* Use read/write(16) over read/write(10) */
 	unsigned skip_ms_page_8:1;	/* do not use MODE SENSE page 0x08 */
 	unsigned skip_ms_page_3f:1;	/* do not use MODE SENSE page 0x3f */
Index: linux-4.4-rc4/include/uapi/linux/fs.h
===================================================================
--- linux-4.4-rc4.orig/include/uapi/linux/fs.h	2015-12-10 17:04:00.000000000 +0100
+++ linux-4.4-rc4/include/uapi/linux/fs.h	2015-12-10 17:04:30.000000000 +0100
@@ -152,6 +152,7 @@ struct inodes_stat_t {
 #define BLKSECDISCARD _IO(0x12,125)
 #define BLKROTATIONAL _IO(0x12,126)
 #define BLKZEROOUT _IO(0x12,127)
+#define BLKCOPY _IO(0x12,128)
 
 #define BMAP_IOCTL 1		/* obsolete - kept for compatibility */
 #define FIBMAP	   _IO(0x00,1)	/* bmap access */
Index: linux-4.4-rc4/block/compat_ioctl.c
===================================================================
--- linux-4.4-rc4.orig/block/compat_ioctl.c	2015-12-10 17:04:00.000000000 +0100
+++ linux-4.4-rc4/block/compat_ioctl.c	2015-12-10 17:04:30.000000000 +0100
@@ -697,6 +697,7 @@ long compat_blkdev_ioctl(struct file *fi
 	 * but we call blkdev_ioctl, which gets the lock for us
 	 */
 	case BLKRRPART:
+	case BLKCOPY:
 		return blkdev_ioctl(bdev, mode, cmd,
 				(unsigned long)compat_ptr(arg));
 	case BLKBSZSET_32:
Index: linux-4.4-rc4/block/bio.c
===================================================================
--- linux-4.4-rc4.orig/block/bio.c	2015-12-10 17:03:59.000000000 +0100
+++ linux-4.4-rc4/block/bio.c	2015-12-10 17:04:30.000000000 +0100
@@ -238,6 +238,8 @@ static void __bio_free(struct bio *bio)
 {
 	bio_disassociate_task(bio);
 
+	kfree(bio->bi_copy);
+
 	if (bio_integrity(bio))
 		bio_integrity_free(bio);
 }

^ permalink raw reply	[flat|nested] 24+ messages in thread

end of thread, other threads:[~2015-12-10 17:30 UTC | newest]

Thread overview: 24+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2014-07-15 19:34 [PATCH 0/15] SCSI XCOPY support for the kernel and device mapper Mikulas Patocka
2014-07-15 19:34 ` [PATCH 1/15] block copy: initial XCOPY offload support Mikulas Patocka
2014-07-18 13:03   ` Tomas Henzl
2014-07-18 14:35     ` Mikulas Patocka
2014-08-04 14:09   ` Pavel Machek
2014-08-05 22:45     ` Mikulas Patocka
2014-07-15 19:35 ` [PATCH 2/15] block copy: use two bios Mikulas Patocka
2014-07-15 19:35 ` [PATCH 3/15] block copy: report the amount of copied data Mikulas Patocka
2014-07-15 19:36 ` [PATCH 4/15] block copy: use a timer to fix a theoretical deadlock Mikulas Patocka
2014-07-15 19:37 ` [PATCH 5/15] block copy: use merge_bvec_fn for copies Mikulas Patocka
2014-07-15 19:37 ` [PATCH 6/15] block copy: use asynchronous notification Mikulas Patocka
2014-07-15 19:39 ` [PATCH 7/15] dm: remove num_write_bios Mikulas Patocka
2014-07-15 19:39 ` [PATCH 8/15] dm: introduce dm_ask_for_duplicate_bios Mikulas Patocka
2014-07-15 19:40 ` [PATCH 9/15] dm: implement copy Mikulas Patocka
2014-07-15 19:40 ` [PATCH 10/15] dm linear: support copy Mikulas Patocka
2014-07-15 19:41 ` [PATCH 11/15] dm stripe: " Mikulas Patocka
2014-07-15 19:42 ` [PATCH 12/15] dm kcopyd: introduce the function submit_job Mikulas Patocka
2014-07-15 19:43 ` [PATCH 13/15] dm kcopyd: support copy offload Mikulas Patocka
2014-07-15 19:43 ` [PATCH 14/15] dm kcopyd: change mutex to spinlock Mikulas Patocka
2014-07-15 19:44 ` [PATCH 15/15] dm kcopyd: call copy offload with asynchronous callback Mikulas Patocka
2014-08-28 21:37 ` [PATCH 0/15] SCSI XCOPY support for the kernel and device mapper Mike Snitzer
2014-08-29 10:29   ` Martin K. Petersen
2015-12-10 17:29 [PATCH 0/15] copy offload patches Mikulas Patocka
2015-12-10 17:30 ` [PATCH 1/15] block copy: initial XCOPY offload support Mikulas Patocka
2015-12-10 17:30   ` Mikulas Patocka

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.