linux-block.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v8 0/4] block device interposer
@ 2021-04-09 11:48 Sergei Shtepa
  2021-04-09 11:48 ` [PATCH v8 1/4] Adds blk_interposer. It allows to redirect bio requests to another block device Sergei Shtepa
                   ` (4 more replies)
  0 siblings, 5 replies; 11+ messages in thread
From: Sergei Shtepa @ 2021-04-09 11:48 UTC (permalink / raw)
  To: Christoph Hellwig, Hannes Reinecke, Mike Snitzer,
	Alasdair Kergon, Alexander Viro, Jens Axboe, dm-devel,
	linux-fsdevel, linux-block, linux-kernel
  Cc: sergei.shtepa, pavel.tide

I think I'm ready to suggest the next version of block device interposer
(blk_interposer). It allows to redirect bio requests to other block
devices.

In this series of patches, I reviewed the process of attaching and
detaching device mapper via blk_interposer.

Now the dm-target is attached to the interposed block device when the
interposer dm-target is fully ready to accept requests, and the interposed
block device queue is locked, and the file system on it is frozen.
The detaching is also performed when the file system on the interposed
block device is in a frozen state, the queue is locked, and the interposer
dm-target is suspended.

To make it possible to lock the receipt of new bio requests without locking
the processing of bio requests that the interposer creates, I had to change
the submit_bio_noacct() function and add a lock. To minimize the impact of
locking, I chose percpu_rw_sem. I tried to do without a new lock, but I'm
afraid it's impossible.

Checking the operation of the interposer, I did not limit myself to
a simple dm-linear. When I experimented with dm-era, I noticed that it
accepts two block devices. Since Mike was against changing the logic in
the dm-targets itself to support the interrupter, I decided to add the
[interpose] option to the block device path.

 echo "0 ${DEV_SZ} era ${META} [interpose]${DEV} ${BLK_SZ}" | \
 	dmsetup create dm-era --interpose

I believe this option can replace the DM_INTERPOSE_FLAG flag. Of course,
we can assume that if the device cannot be opened with the FMODE_EXCL,
then it is considered an interposed device, but it seems to me that
algorithm is unsafe. I hope to get Mike's opinion on this.

I have successfully tried taking snapshots. But I ran into a problem
when I removed origin-target:
[   49.031156] ------------[ cut here ]------------
[   49.031180] kernel BUG at block/bio.c:1476!
[   49.031198] invalid opcode: 0000 [#1] SMP NOPTI
[   49.031213] CPU: 9 PID: 636 Comm: dmsetup Tainted: G            E     5.12.0-rc6-ip+ #52
[   49.031235] Hardware name: innotek GmbH VirtualBox/VirtualBox, BIOS VirtualBox 12/01/2006
[   49.031257] RIP: 0010:bio_split+0x74/0x80
[   49.031273] Code: 89 c7 e8 5f 56 03 00 41 8b 74 24 28 48 89 ef e8 12 ea ff ff f6 45 15 01 74 08 66 41 81 4c 24 14 00 01 4c 89 e0 5b 5d 41 5c c3 <0f> 0b 0f 0b 0f 0b 45 31 e4 eb ed 90 0f 1f 44 00 00 39 77 28 76 05
[   49.031322] RSP: 0018:ffff9a6100993ab0 EFLAGS: 00010246
[   49.031337] RAX: 0000000000000008 RBX: 0000000000000000 RCX: ffff8e26938f96d8
[   49.031357] RDX: 0000000000000c00 RSI: 0000000000000000 RDI: ffff8e26937d1300
[   49.031375] RBP: ffff8e2692ddc000 R08: 0000000000000000 R09: 0000000000000000
[   49.031394] R10: ffff8e2692b1de00 R11: ffff8e2692b1de58 R12: ffff8e26937d1300
[   49.031413] R13: ffff8e2692ddcd18 R14: ffff8e2691d22140 R15: ffff8e26937d1300
[   49.031432] FS:  00007efffa6e7800(0000) GS:ffff8e269bc80000(0000) knlGS:0000000000000000
[   49.031453] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[   49.031470] CR2: 00007efffa96cda0 CR3: 0000000114bd0000 CR4: 00000000000506e0
[   49.031490] Call Trace:
[   49.031501]  dm_submit_bio+0x383/0x500 [dm_mod]
[   49.031522]  submit_bio_noacct+0x370/0x770
[   49.031537]  submit_bh_wbc+0x160/0x190
[   49.031550]  __sync_dirty_buffer+0x65/0x130
[   49.031564]  ext4_commit_super+0xbc/0x120 [ext4]
[   49.031602]  ext4_freeze+0x54/0x80 [ext4]
[   49.031631]  freeze_super+0xc8/0x160
[   49.031643]  freeze_bdev+0xb2/0xc0
[   49.031654]  lock_bdev_fs+0x1c/0x30 [dm_mod]
[   49.031671]  __dm_suspend+0x2b9/0x3b0 [dm_mod]
[   49.032095]  dm_suspend+0xed/0x160 [dm_mod]
[   49.032496]  ? __find_device_hash_cell+0x5b/0x2a0 [dm_mod]
[   49.032897]  ? remove_all+0x30/0x30 [dm_mod]
[   49.033299]  dev_remove+0x4c/0x1c0 [dm_mod]
[   49.033679]  ctl_ioctl+0x1a5/0x470 [dm_mod]
[   49.034067]  dm_ctl_ioctl+0xa/0x10 [dm_mod]
[   49.034432]  __x64_sys_ioctl+0x83/0xb0
[   49.034785]  do_syscall_64+0x33/0x80
[   49.035139]  entry_SYSCALL_64_after_hwframe+0x44/0xae
When suspend is executed for origin-target before the interposer is
being detached, in the origin_map() function the value of the
o->split_binary variable is zero, since no snapshots were connected to it.
I think that if no snapshots are connected, then it does not make sense
to split the bio request into parts.

Changes summary for this patchset v7:
  * The attaching and detaching to interposed device moved to
    __dm_suspend() and __dm_resume() functions.
  * Redesigned th submit_bio_noacct() function and added a lock for the
    block device interposer.
  * Adds [interpose] option to block device patch in dm table.
  * Fix origin_map() then o->split_binary value is zero.

History:
v7 - https://patchwork.kernel.org/project/linux-block/cover/1615563895-28565-1-git-send-email-sergei.shtepa@veeam.com/
  * the request interception mechanism. Now the interposer is
    a block device that receives requests instead of the original device;
  * code design fixes.

History:
v6 - https://patchwork.kernel.org/project/linux-block/cover/1614774618-22410-1-git-send-email-sergei.shtepa@veeam.com/
  * designed for 5.12;
  * thanks to the new design of the bio structure in v5.12, it is
    possible to perform interception not for the entire disk, but
    for each block device;
  * instead of the new ioctl DM_DEV_REMAP_CMD and the 'noexcl' option,
    the DM_INTERPOSED_FLAG flag for the ioctl DM_TABLE_LOAD_CMD is
    applied.

v5 - https://patchwork.kernel.org/project/linux-block/cover/1612881028-7878-1-git-send-email-sergei.shtepa@veeam.com/
 * rebase for v5.11-rc7;
 * patch set organization;
 * fix defects in documentation;
 * add some comments;
 * change mutex names for better code readability;
 * remove calling bd_unlink_disk_holder() for targets with non-exclusive
   flag;
 * change type for struct dm_remap_param from uint8_t to __u8.

v4 - https://patchwork.kernel.org/project/linux-block/cover/1612367638-3794-1-git-send-email-sergei.shtepa@veeam.com/
Mostly changes were made, due to Damien's comments:
 * on the design of the code;
 * by the patch set organization;
 * bug with passing a wrong parameter to dm_get_device();
 * description of the 'noexcl' parameter in the linear.rst.
Also added remap_and_filter.rst.

v3 - https://patchwork.kernel.org/project/linux-block/cover/1611853955-32167-1-git-send-email-sergei.shtepa@veeam.com/
In this version, I already suggested blk_interposer to apply to dm-linear.
Problems were solved:
 * Interception of bio requests from a specific device on the disk, not
   from the entire disk. To do this, we added the dm_interposed_dev
   structure and an interval tree to store these structures.
 * Implemented ioctl DM_DEV_REMAP_CMD. A patch with changes in the lvm2
   project was sent to the team lvm-devel@redhat.com.
 * Added the 'noexcl' option for dm-linear, which allows you to open
   the underlying block-device without FMODE_EXCL mode.

v2 - https://patchwork.kernel.org/project/linux-block/cover/1607518911-30692-1-git-send-email-sergei.shtepa@veeam.com/
I tried to suggest blk_interposer without using it in device mapper,
but with the addition of a sample of its use. It was then that I learned
about the maintainers' attitudes towards the samples directory :).

v1 - https://lwn.net/ml/linux-block/20201119164924.74401-1-hare@suse.de/
This Hannes's patch can be considered as a starting point, since this is
where the interception mechanism and the term blk_interposer itself
appeared. It became clear that blk_interposer can be useful for
device mapper.

before v1 - https://patchwork.kernel.org/project/linux-block/cover/1603271049-20681-1-git-send-email-sergei.shtepa@veeam.com/
I tried to offer a rather cumbersome blk-filter and a monster-like
blk-snap module for creating snapshots.

Thank you to everyone who was able to take the time to review
the previous versions.
I hope that this time I achieved the required quality.

Thanks,
Sergei.

Sergei Shtepa (4):
  Adds blk_interposer. It allows to redirect bio requests to another
    block device.
  Adds the blk_interposers logic to __submit_bio_noacct().
  Adds blk_interposer to md.
  fix origin_map - don't split a bio for the origin device if it does
    not have registered snapshots.

 block/bio.c                   |   2 +
 block/blk-core.c              | 194 ++++++++++++++------------
 block/genhd.c                 |  51 +++++++
 drivers/md/dm-core.h          |   1 +
 drivers/md/dm-ioctl.c         |  95 ++++++++++---
 drivers/md/dm-snap.c          |  15 +-
 drivers/md/dm-table.c         |  68 ++++++++-
 drivers/md/dm.c               | 254 ++++++++++++++++++++++++++++++----
 drivers/md/dm.h               |   8 +-
 fs/block_dev.c                |   3 +
 include/linux/blk_types.h     |   6 +
 include/linux/blkdev.h        |  32 +++++
 include/linux/device-mapper.h |   1 +
 include/uapi/linux/dm-ioctl.h |   6 +
 14 files changed, 586 insertions(+), 150 deletions(-)

--
2.20.1


^ permalink raw reply	[flat|nested] 11+ messages in thread

* [PATCH v8 1/4] Adds blk_interposer. It allows to redirect bio requests to another block device.
  2021-04-09 11:48 [PATCH v8 0/4] block device interposer Sergei Shtepa
@ 2021-04-09 11:48 ` Sergei Shtepa
  2021-04-09 11:48 ` [PATCH v8 2/4] Adds the blk_interposers logic to __submit_bio_noacct() Sergei Shtepa
                   ` (3 subsequent siblings)
  4 siblings, 0 replies; 11+ messages in thread
From: Sergei Shtepa @ 2021-04-09 11:48 UTC (permalink / raw)
  To: Christoph Hellwig, Hannes Reinecke, Mike Snitzer,
	Alasdair Kergon, Alexander Viro, Jens Axboe, dm-devel,
	linux-fsdevel, linux-block, linux-kernel
  Cc: sergei.shtepa, pavel.tide

Signed-off-by: Sergei Shtepa <sergei.shtepa@veeam.com>
---
 block/genhd.c             | 51 +++++++++++++++++++++++++++++++++++++++
 fs/block_dev.c            |  3 +++
 include/linux/blk_types.h |  6 +++++
 include/linux/blkdev.h    | 32 ++++++++++++++++++++++++
 4 files changed, 92 insertions(+)

diff --git a/block/genhd.c b/block/genhd.c
index 8c8f543572e6..533b33897187 100644
--- a/block/genhd.c
+++ b/block/genhd.c
@@ -1938,3 +1938,54 @@ static void disk_release_events(struct gendisk *disk)
 	WARN_ON_ONCE(disk->ev && disk->ev->block != 1);
 	kfree(disk->ev);
 }
+
+/**
+ * bdev_interposer_attach - Attach interposer block device to original
+ * @original: original block device
+ * @interposer: interposer block device
+ *
+ * Before attaching the interposer, it is necessary to lock the processing
+ * of bio requests of the original device by calling the bdev_interposer_lock().
+ *
+ * The bdev_interposer_detach() function allows to detach the interposer
+ * from original block device.
+ */
+int bdev_interposer_attach(struct block_device *original,
+			   struct block_device *interposer)
+{
+	struct block_device *bdev;
+
+	WARN_ON(!original);
+	if (original->bd_interposer)
+		return -EBUSY;
+
+	bdev = bdgrab(interposer);
+	if (!bdev)
+		return -ENODEV;
+
+	original->bd_interposer = bdev;
+	return 0;
+}
+EXPORT_SYMBOL_GPL(bdev_interposer_attach);
+
+/**
+ * bdev_interposer_detach - Detach interposer from block device
+ * @original: original block device
+ *
+ * Before detaching the interposer, it is necessary to lock the processing
+ * of bio requests of the original device by calling the bdev_interposer_lock().
+ *
+ * The interposer should be attached using function bdev_interposer_attach().
+ */
+void bdev_interposer_detach(struct block_device *original)
+{
+	if (WARN_ON(!original))
+		return;
+
+	if (!original->bd_interposer)
+		return;
+
+	bdput(original->bd_interposer);
+	original->bd_interposer = NULL;
+}
+EXPORT_SYMBOL_GPL(bdev_interposer_detach);
diff --git a/fs/block_dev.c b/fs/block_dev.c
index 09d6f7229db9..a98a56cc634f 100644
--- a/fs/block_dev.c
+++ b/fs/block_dev.c
@@ -809,6 +809,7 @@ static void bdev_free_inode(struct inode *inode)
 {
 	struct block_device *bdev = I_BDEV(inode);
 
+	percpu_free_rwsem(&bdev->bd_interposer_lock);
 	free_percpu(bdev->bd_stats);
 	kfree(bdev->bd_meta_info);
 
@@ -909,6 +910,8 @@ struct block_device *bdev_alloc(struct gendisk *disk, u8 partno)
 		iput(inode);
 		return NULL;
 	}
+	bdev->bd_interposer = NULL;
+	percpu_init_rwsem(&bdev->bd_interposer_lock);
 	return bdev;
 }
 
diff --git a/include/linux/blk_types.h b/include/linux/blk_types.h
index db026b6ec15a..8e4309eb3b18 100644
--- a/include/linux/blk_types.h
+++ b/include/linux/blk_types.h
@@ -46,6 +46,11 @@ struct block_device {
 	spinlock_t		bd_size_lock; /* for bd_inode->i_size updates */
 	struct gendisk *	bd_disk;
 	struct backing_dev_info *bd_bdi;
+	/* The interposer allows to redirect bio to another device */
+	struct block_device	*bd_interposer;
+	/* Lock the queue of block device to attach or detach interposer.
+	 * Allows to safely suspend and flush interposer. */
+	struct percpu_rw_semaphore bd_interposer_lock;
 
 	/* The counter of freeze processes */
 	int			bd_fsfreeze_count;
@@ -304,6 +309,7 @@ enum {
 	BIO_CGROUP_ACCT,	/* has been accounted to a cgroup */
 	BIO_TRACKED,		/* set if bio goes through the rq_qos path */
 	BIO_REMAPPED,
+	BIO_INTERPOSED,		/* bio was reassigned to another block device */
 	BIO_FLAG_LAST
 };
 
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index 158aefae1030..9e376ee34e19 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -2029,4 +2029,36 @@ int fsync_bdev(struct block_device *bdev);
 int freeze_bdev(struct block_device *bdev);
 int thaw_bdev(struct block_device *bdev);
 
+/**
+ * bdev_interposer_lock - Lock bio processing
+ * @bdev: locking block device
+ *
+ * Lock the bio processing in submit_bio_noacct() for the new requests in the
+ * original block device. Requests from interposer will not be locked.
+ *
+ * To unlock, use the function bdev_interposer_unlock().
+ *
+ * This lock should be used to attach/detach interposer to the device.
+ */
+static inline void bdev_interposer_lock(struct block_device *bdev)
+{
+	percpu_down_write(&bdev->bd_interposer_lock);
+}
+
+/**
+ * bdev_interposer_unlock - Unlock bio processing
+ * @bdev: locked block device
+ *
+ * Unlock the bio processing that was locked by function bdev_interposer_lock().
+ *
+ * This lock should be used to attach/detach interposer to the device.
+ */
+static inline void bdev_interposer_unlock(struct block_device *bdev)
+{
+	percpu_up_write(&bdev->bd_interposer_lock);
+}
+
+int bdev_interposer_attach(struct block_device *original,
+			   struct block_device *interposer);
+void bdev_interposer_detach(struct block_device *original);
 #endif /* _LINUX_BLKDEV_H */
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 11+ messages in thread

* [PATCH v8 2/4] Adds the blk_interposers logic to __submit_bio_noacct().
  2021-04-09 11:48 [PATCH v8 0/4] block device interposer Sergei Shtepa
  2021-04-09 11:48 ` [PATCH v8 1/4] Adds blk_interposer. It allows to redirect bio requests to another block device Sergei Shtepa
@ 2021-04-09 11:48 ` Sergei Shtepa
  2021-04-09 11:48 ` [PATCH v8 3/4] Adds blk_interposer to md Sergei Shtepa
                   ` (2 subsequent siblings)
  4 siblings, 0 replies; 11+ messages in thread
From: Sergei Shtepa @ 2021-04-09 11:48 UTC (permalink / raw)
  To: Christoph Hellwig, Hannes Reinecke, Mike Snitzer,
	Alasdair Kergon, Alexander Viro, Jens Axboe, dm-devel,
	linux-fsdevel, linux-block, linux-kernel
  Cc: sergei.shtepa, pavel.tide

* The calling to blk_partition_remap() function has moved
from submit_bio_checks() to submit_bio_noacct().
* The __submit_bio() and __submit_bio_noacct_mq() functions
have been removed and their functionality moved to
submit_bio_noacct().
* Added locking of the block device queue using
the bd_interposer_lock.

Signed-off-by: Sergei Shtepa <sergei.shtepa@veeam.com>
---
 block/bio.c      |   2 +
 block/blk-core.c | 194 ++++++++++++++++++++++++++---------------------
 2 files changed, 108 insertions(+), 88 deletions(-)

diff --git a/block/bio.c b/block/bio.c
index 50e579088aca..6fc9e8f395a6 100644
--- a/block/bio.c
+++ b/block/bio.c
@@ -640,6 +640,8 @@ void __bio_clone_fast(struct bio *bio, struct bio *bio_src)
 		bio_set_flag(bio, BIO_THROTTLED);
 	if (bio_flagged(bio_src, BIO_REMAPPED))
 		bio_set_flag(bio, BIO_REMAPPED);
+	if (bio_flagged(bio_src, BIO_INTERPOSED))
+		bio_set_flag(bio, BIO_INTERPOSED);
 	bio->bi_opf = bio_src->bi_opf;
 	bio->bi_ioprio = bio_src->bi_ioprio;
 	bio->bi_write_hint = bio_src->bi_write_hint;
diff --git a/block/blk-core.c b/block/blk-core.c
index fc60ff208497..a987daa76a79 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -735,26 +735,27 @@ static inline int bio_check_eod(struct bio *bio)
 		handle_bad_sector(bio, maxsector);
 		return -EIO;
 	}
+
+	if (unlikely(should_fail_request(bio->bi_bdev, bio->bi_iter.bi_size)))
+		return -EIO;
+
 	return 0;
 }
 
 /*
  * Remap block n of partition p to block n+start(p) of the disk.
  */
-static int blk_partition_remap(struct bio *bio)
+static inline void blk_partition_remap(struct bio *bio)
 {
-	struct block_device *p = bio->bi_bdev;
+	struct block_device *bdev = bio->bi_bdev;
 
-	if (unlikely(should_fail_request(p, bio->bi_iter.bi_size)))
-		return -EIO;
-	if (bio_sectors(bio)) {
-		bio->bi_iter.bi_sector += p->bd_start_sect;
-		trace_block_bio_remap(bio, p->bd_dev,
+	if (bdev->bd_partno && bio_sectors(bio)) {
+		bio->bi_iter.bi_sector += bdev->bd_start_sect;
+		trace_block_bio_remap(bio, bdev->bd_dev,
 				      bio->bi_iter.bi_sector -
-				      p->bd_start_sect);
+				      bdev->bd_start_sect);
 	}
 	bio_set_flag(bio, BIO_REMAPPED);
-	return 0;
 }
 
 /*
@@ -819,8 +820,6 @@ static noinline_for_stack bool submit_bio_checks(struct bio *bio)
 	if (!bio_flagged(bio, BIO_REMAPPED)) {
 		if (unlikely(bio_check_eod(bio)))
 			goto end_io;
-		if (bdev->bd_partno && unlikely(blk_partition_remap(bio)))
-			goto end_io;
 	}
 
 	/*
@@ -910,20 +909,6 @@ static noinline_for_stack bool submit_bio_checks(struct bio *bio)
 	return false;
 }
 
-static blk_qc_t __submit_bio(struct bio *bio)
-{
-	struct gendisk *disk = bio->bi_bdev->bd_disk;
-	blk_qc_t ret = BLK_QC_T_NONE;
-
-	if (blk_crypto_bio_prep(&bio)) {
-		if (!disk->fops->submit_bio)
-			return blk_mq_submit_bio(bio);
-		ret = disk->fops->submit_bio(bio);
-	}
-	blk_queue_exit(disk->queue);
-	return ret;
-}
-
 /*
  * The loop in this function may be a bit non-obvious, and so deserves some
  * explanation:
@@ -931,7 +916,7 @@ static blk_qc_t __submit_bio(struct bio *bio)
  *  - Before entering the loop, bio->bi_next is NULL (as all callers ensure
  *    that), so we have a list with a single bio.
  *  - We pretend that we have just taken it off a longer list, so we assign
- *    bio_list to a pointer to the bio_list_on_stack, thus initialising the
+ *    bio_list to a pointer to the current->bio_list, thus initialising the
  *    bio_list of new bios to be added.  ->submit_bio() may indeed add some more
  *    bios through a recursive call to submit_bio_noacct.  If it did, we find a
  *    non-NULL value in bio_list and re-enter the loop from the top.
@@ -939,83 +924,75 @@ static blk_qc_t __submit_bio(struct bio *bio)
  *    pretending) and so remove it from bio_list, and call into ->submit_bio()
  *    again.
  *
- * bio_list_on_stack[0] contains bios submitted by the current ->submit_bio.
- * bio_list_on_stack[1] contains bios that were submitted before the current
+ * current->bio_list[0] contains bios submitted by the current ->submit_bio.
+ * current->bio_list[1] contains bios that were submitted before the current
  *	->submit_bio_bio, but that haven't been processed yet.
  */
 static blk_qc_t __submit_bio_noacct(struct bio *bio)
 {
-	struct bio_list bio_list_on_stack[2];
-	blk_qc_t ret = BLK_QC_T_NONE;
-
-	BUG_ON(bio->bi_next);
-
-	bio_list_init(&bio_list_on_stack[0]);
-	current->bio_list = bio_list_on_stack;
-
-	do {
-		struct request_queue *q = bio->bi_bdev->bd_disk->queue;
-		struct bio_list lower, same;
+	struct gendisk *disk = bio->bi_bdev->bd_disk;
+	struct bio_list lower, same;
+	blk_qc_t ret;
 
-		if (unlikely(bio_queue_enter(bio) != 0))
-			continue;
+	if (!blk_crypto_bio_prep(&bio)) {
+		blk_queue_exit(disk->queue);
+		return BLK_QC_T_NONE;
+	}
 
-		/*
-		 * Create a fresh bio_list for all subordinate requests.
-		 */
-		bio_list_on_stack[1] = bio_list_on_stack[0];
-		bio_list_init(&bio_list_on_stack[0]);
+	if (queue_is_mq(disk->queue))
+		return blk_mq_submit_bio(bio);
 
-		ret = __submit_bio(bio);
+	/*
+	 * Create a fresh bio_list for all subordinate requests.
+	 */
+	current->bio_list[1] = current->bio_list[0];
+	bio_list_init(&current->bio_list[0]);
 
-		/*
-		 * Sort new bios into those for a lower level and those for the
-		 * same level.
-		 */
-		bio_list_init(&lower);
-		bio_list_init(&same);
-		while ((bio = bio_list_pop(&bio_list_on_stack[0])) != NULL)
-			if (q == bio->bi_bdev->bd_disk->queue)
-				bio_list_add(&same, bio);
-			else
-				bio_list_add(&lower, bio);
+	WARN_ON_ONCE(!disk->fops->submit_bio);
+	ret = disk->fops->submit_bio(bio);
+	blk_queue_exit(disk->queue);
+	/*
+	 * Sort new bios into those for a lower level and those
+	 * for the same level.
+	 */
+	bio_list_init(&lower);
+	bio_list_init(&same);
+	while ((bio = bio_list_pop(&current->bio_list[0])) != NULL)
+		if (disk->queue == bio->bi_bdev->bd_disk->queue)
+			bio_list_add(&same, bio);
+		else
+			bio_list_add(&lower, bio);
 
-		/*
-		 * Now assemble so we handle the lowest level first.
-		 */
-		bio_list_merge(&bio_list_on_stack[0], &lower);
-		bio_list_merge(&bio_list_on_stack[0], &same);
-		bio_list_merge(&bio_list_on_stack[0], &bio_list_on_stack[1]);
-	} while ((bio = bio_list_pop(&bio_list_on_stack[0])));
+	/*
+	 * Now assemble so we handle the lowest level first.
+	 */
+	bio_list_merge(&current->bio_list[0], &lower);
+	bio_list_merge(&current->bio_list[0], &same);
+	bio_list_merge(&current->bio_list[0], &current->bio_list[1]);
 
-	current->bio_list = NULL;
 	return ret;
 }
 
-static blk_qc_t __submit_bio_noacct_mq(struct bio *bio)
+static inline struct block_device *bio_interposer_lock(struct bio *bio)
 {
-	struct bio_list bio_list[2] = { };
-	blk_qc_t ret = BLK_QC_T_NONE;
-
-	current->bio_list = bio_list;
-
-	do {
-		struct gendisk *disk = bio->bi_bdev->bd_disk;
-
-		if (unlikely(bio_queue_enter(bio) != 0))
-			continue;
+	bool locked;
+	struct block_device *bdev = bio->bi_bdev;
 
-		if (!blk_crypto_bio_prep(&bio)) {
-			blk_queue_exit(disk->queue);
-			ret = BLK_QC_T_NONE;
-			continue;
+	if (bio->bi_opf & REQ_NOWAIT) {
+		locked = percpu_down_read_trylock(&bdev->bd_interposer_lock);
+		if (unlikely(!locked)) {
+			bio_wouldblock_error(bio);
+			return NULL;
 		}
+	} else
+		percpu_down_read(&bdev->bd_interposer_lock);
 
-		ret = blk_mq_submit_bio(bio);
-	} while ((bio = bio_list_pop(&bio_list[0])));
+	return bdev;
+}
 
-	current->bio_list = NULL;
-	return ret;
+static inline void bio_interposer_unlock(struct block_device *locked_bdev)
+{
+	percpu_up_read(&locked_bdev->bd_interposer_lock);
 }
 
 /**
@@ -1029,6 +1006,10 @@ static blk_qc_t __submit_bio_noacct_mq(struct bio *bio)
  */
 blk_qc_t submit_bio_noacct(struct bio *bio)
 {
+	struct block_device *locked_bdev;
+	struct bio_list bio_list_on_stack[2] = { };
+	blk_qc_t ret = BLK_QC_T_NONE;
+
 	if (!submit_bio_checks(bio))
 		return BLK_QC_T_NONE;
 
@@ -1043,9 +1024,46 @@ blk_qc_t submit_bio_noacct(struct bio *bio)
 		return BLK_QC_T_NONE;
 	}
 
-	if (!bio->bi_bdev->bd_disk->fops->submit_bio)
-		return __submit_bio_noacct_mq(bio);
-	return __submit_bio_noacct(bio);
+	BUG_ON(bio->bi_next);
+
+	locked_bdev = bio_interposer_lock(bio);
+	if (!locked_bdev)
+		return BLK_QC_T_NONE;
+
+	current->bio_list = bio_list_on_stack;
+
+	do {
+		if (unlikely(bio_queue_enter(bio) != 0)) {
+			ret = BLK_QC_T_NONE;
+			continue;
+		}
+
+		if (!bio_flagged(bio, BIO_INTERPOSED) &&
+		    bio->bi_bdev->bd_interposer) {
+			struct gendisk *disk = bio->bi_bdev->bd_disk;
+
+			bio_set_dev(bio, bio->bi_bdev->bd_interposer);
+			bio_set_flag(bio, BIO_INTERPOSED);
+
+			bio_list_add(&bio_list_on_stack[0], bio);
+
+			blk_queue_exit(disk->queue);
+			ret = BLK_QC_T_NONE;
+			continue;
+		}
+
+		if (!bio_flagged(bio, BIO_REMAPPED))
+			blk_partition_remap(bio);
+
+		ret = __submit_bio_noacct(bio);
+
+	} while ((bio = bio_list_pop(&bio_list_on_stack[0])));
+
+	current->bio_list = NULL;
+
+	bio_interposer_unlock(locked_bdev);
+
+	return ret;
 }
 EXPORT_SYMBOL(submit_bio_noacct);
 
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 11+ messages in thread

* [PATCH v8 3/4] Adds blk_interposer to md.
  2021-04-09 11:48 [PATCH v8 0/4] block device interposer Sergei Shtepa
  2021-04-09 11:48 ` [PATCH v8 1/4] Adds blk_interposer. It allows to redirect bio requests to another block device Sergei Shtepa
  2021-04-09 11:48 ` [PATCH v8 2/4] Adds the blk_interposers logic to __submit_bio_noacct() Sergei Shtepa
@ 2021-04-09 11:48 ` Sergei Shtepa
  2021-04-09 14:12   ` kernel test robot
                     ` (3 more replies)
  2021-04-09 11:48 ` [PATCH v8 4/4] fix origin_map - don't split a bio for the origin device if it does not have registered snapshots Sergei Shtepa
  2021-04-09 15:23 ` [PATCH v8 0/4] block device interposer Mike Snitzer
  4 siblings, 4 replies; 11+ messages in thread
From: Sergei Shtepa @ 2021-04-09 11:48 UTC (permalink / raw)
  To: Christoph Hellwig, Hannes Reinecke, Mike Snitzer,
	Alasdair Kergon, Alexander Viro, Jens Axboe, dm-devel,
	linux-fsdevel, linux-block, linux-kernel
  Cc: sergei.shtepa, pavel.tide

* The new flag DM_INTERPOSE_FLAG allows to specify that the dm target
will be attached using blk_interposer.
* The [interpose] option allows to specify which device will be
attached via the interposer.
* The connection and disconnection of the interrupter is performed in
the functions __dm_suspend() and __dm_resume(). The flag
DM_SUSPEND_DETACH_IP_FLAG was added for this purpose.
* dm_submit_bio() sets BIO_INTERPOSED for each bio from the interposer.

Signed-off-by: Sergei Shtepa <sergei.shtepa@veeam.com>
---
 drivers/md/dm-core.h          |   1 +
 drivers/md/dm-ioctl.c         |  95 ++++++++++---
 drivers/md/dm-table.c         |  68 ++++++++-
 drivers/md/dm.c               | 254 ++++++++++++++++++++++++++++++----
 drivers/md/dm.h               |   8 +-
 include/linux/device-mapper.h |   1 +
 include/uapi/linux/dm-ioctl.h |   6 +
 7 files changed, 375 insertions(+), 58 deletions(-)

diff --git a/drivers/md/dm-core.h b/drivers/md/dm-core.h
index 5953ff2bd260..431b82461eae 100644
--- a/drivers/md/dm-core.h
+++ b/drivers/md/dm-core.h
@@ -112,6 +112,7 @@ struct mapped_device {
 	/* for blk-mq request-based DM support */
 	struct blk_mq_tag_set *tag_set;
 	bool init_tio_pdu:1;
+	bool interpose:1;
 
 	struct srcu_struct io_barrier;
 };
diff --git a/drivers/md/dm-ioctl.c b/drivers/md/dm-ioctl.c
index 1ca65b434f1f..7ec37526920b 100644
--- a/drivers/md/dm-ioctl.c
+++ b/drivers/md/dm-ioctl.c
@@ -294,11 +294,29 @@ static void dm_hash_remove_all(bool keep_open_devices, bool mark_deferred, bool
 			md = hc->md;
 			dm_get(md);
 
-			if (keep_open_devices &&
-			    dm_lock_for_deletion(md, mark_deferred, only_deferred)) {
-				dm_put(md);
-				dev_skipped++;
-				continue;
+			if (md->interpose) {
+				int r;
+
+				/*
+				 * Interposer should be suspended and detached
+				 * from the interposed block device.
+				 */
+				r = dm_suspend(md, DM_SUSPEND_DETACH_IP_FLAG |
+						   DM_SUSPEND_LOCKFS_FLAG);
+				if (r) {
+					DMERR("%s: unable to suspend and detach interposer",
+						dm_device_name(md));
+					dm_put(md);
+					dev_skipped++;
+					continue;
+				}
+			} else {
+				if (keep_open_devices &&
+				    dm_lock_for_deletion(md, mark_deferred, only_deferred)) {
+					dm_put(md);
+					dev_skipped++;
+					continue;
+				}
 			}
 
 			t = __hash_remove(hc);
@@ -732,6 +750,9 @@ static void __dev_status(struct mapped_device *md, struct dm_ioctl *param)
 	if (dm_test_deferred_remove_flag(md))
 		param->flags |= DM_DEFERRED_REMOVE;
 
+	if (dm_interposer_attached(md))
+		param->flags |= DM_INTERPOSE_FLAG;
+
 	param->dev = huge_encode_dev(disk_devt(disk));
 
 	/*
@@ -878,20 +899,37 @@ static int dev_remove(struct file *filp, struct dm_ioctl *param, size_t param_si
 
 	md = hc->md;
 
-	/*
-	 * Ensure the device is not open and nothing further can open it.
-	 */
-	r = dm_lock_for_deletion(md, !!(param->flags & DM_DEFERRED_REMOVE), false);
-	if (r) {
-		if (r == -EBUSY && param->flags & DM_DEFERRED_REMOVE) {
+	if (!md->interpose) {
+		/*
+		 * Ensure the device is not open and nothing further can open it.
+		 */
+		r = dm_lock_for_deletion(md, !!(param->flags & DM_DEFERRED_REMOVE), false);
+		if (r) {
+			if (r == -EBUSY && param->flags & DM_DEFERRED_REMOVE) {
+				up_write(&_hash_lock);
+				dm_put(md);
+				return 0;
+			}
+			DMDEBUG_LIMIT("unable to remove open device %s",
+					hc->name);
 			up_write(&_hash_lock);
 			dm_put(md);
-			return 0;
+			return r;
+		}
+	} else {
+		/*
+		 * Interposer should be suspended and detached from
+		 * the interposed block device.
+		 */
+		r = dm_suspend(md, DM_SUSPEND_DETACH_IP_FLAG |
+				   DM_SUSPEND_LOCKFS_FLAG);
+		if (r) {
+			DMERR("%s: unable to suspend and detach interposer",
+				dm_device_name(md));
+			up_write(&_hash_lock);
+			dm_put(md);
+			return r;
 		}
-		DMDEBUG_LIMIT("unable to remove open device %s", hc->name);
-		up_write(&_hash_lock);
-		dm_put(md);
-		return r;
 	}
 
 	t = __hash_remove(hc);
@@ -1050,6 +1088,7 @@ static int do_resume(struct dm_ioctl *param)
 
 	md = hc->md;
 
+
 	new_map = hc->new_map;
 	hc->new_map = NULL;
 	param->flags &= ~DM_INACTIVE_PRESENT_FLAG;
@@ -1063,8 +1102,14 @@ static int do_resume(struct dm_ioctl *param)
 			suspend_flags &= ~DM_SUSPEND_LOCKFS_FLAG;
 		if (param->flags & DM_NOFLUSH_FLAG)
 			suspend_flags |= DM_SUSPEND_NOFLUSH_FLAG;
-		if (!dm_suspended_md(md))
-			dm_suspend(md, suspend_flags);
+
+		if (md->interpose) {
+			if (!dm_suspended_md(md) || dm_interposer_attached(md))
+				dm_suspend(md, suspend_flags | DM_SUSPEND_DETACH_IP_FLAG);
+		} else {
+			if (!dm_suspended_md(md))
+				dm_suspend(md, suspend_flags);
+		}
 
 		old_map = dm_swap_table(md, new_map);
 		if (IS_ERR(old_map)) {
@@ -1267,6 +1312,11 @@ static inline fmode_t get_mode(struct dm_ioctl *param)
 	return mode;
 }
 
+static inline bool get_interpose_flag(struct dm_ioctl *param)
+{
+	return (param->flags & DM_INTERPOSE_FLAG);
+}
+
 static int next_target(struct dm_target_spec *last, uint32_t next, void *end,
 		       struct dm_target_spec **spec, char **target_params)
 {
@@ -1289,11 +1339,6 @@ static int populate_table(struct dm_table *table,
 	void *end = (void *) param + param_size;
 	char *target_params;
 
-	if (!param->target_count) {
-		DMWARN("populate_table: no targets specified");
-		return -EINVAL;
-	}
-
 	for (i = 0; i < param->target_count; i++) {
 
 		r = next_target(spec, next, end, &spec, &target_params);
@@ -1338,6 +1383,8 @@ static int table_load(struct file *filp, struct dm_ioctl *param, size_t param_si
 	if (!md)
 		return -ENXIO;
 
+	md->interpose = get_interpose_flag(param);
+
 	r = dm_table_create(&t, get_mode(param), param->target_count, md);
 	if (r)
 		goto err;
@@ -2098,6 +2145,8 @@ int __init dm_early_create(struct dm_ioctl *dmi,
 	if (r)
 		goto err_hash_remove;
 
+	md->interpose = get_interpose_flag(dmi);
+
 	/* add targets */
 	for (i = 0; i < dmi->target_count; i++) {
 		r = dm_table_add_target(t, spec_array[i]->target_type,
diff --git a/drivers/md/dm-table.c b/drivers/md/dm-table.c
index e5f0f1703c5d..23574c727f2b 100644
--- a/drivers/md/dm-table.c
+++ b/drivers/md/dm-table.c
@@ -327,14 +327,14 @@ static int device_area_is_invalid(struct dm_target *ti, struct dm_dev *dev,
  * it is accessed concurrently.
  */
 static int upgrade_mode(struct dm_dev_internal *dd, fmode_t new_mode,
-			struct mapped_device *md)
+			bool interpose, struct mapped_device *md)
 {
 	int r;
 	struct dm_dev *old_dev, *new_dev;
 
 	old_dev = dd->dm_dev;
 
-	r = dm_get_table_device(md, dd->dm_dev->bdev->bd_dev,
+	r = dm_get_table_device(md, dd->dm_dev->bdev->bd_dev, interpose,
 				dd->dm_dev->mode | new_mode, &new_dev);
 	if (r)
 		return r;
@@ -367,6 +367,8 @@ int dm_get_device(struct dm_target *ti, const char *path, fmode_t mode,
 {
 	int r;
 	dev_t dev;
+	size_t ofs = 0;
+	bool interpose = false;
 	unsigned int major, minor;
 	char dummy;
 	struct dm_dev_internal *dd;
@@ -374,13 +376,40 @@ int dm_get_device(struct dm_target *ti, const char *path, fmode_t mode,
 
 	BUG_ON(!t);
 
-	if (sscanf(path, "%u:%u%c", &major, &minor, &dummy) == 2) {
+	/*
+	 * Extract extended options for device
+	 */
+	if (path[0] == '[') {
+		const char *interpose_opt = "interpose";
+		size_t opt_pos = 1;
+		size_t opt_len;
+
+		/*
+		 * Because only one option is supported yet, the parser
+		 * can be simplest.
+		 */
+		opt_len = strlen(interpose_opt);
+		if ((opt_pos + opt_len) < strlen(path) &&
+		    memcmp(&path[opt_pos], interpose_opt, opt_len) == 0) {
+			interpose = true;
+
+			if (!t->md->interpose)
+				t->md->interpose = true;
+		} else {
+			DMERR("Invalid devices extended options %s", path);
+			return -EINVAL;
+		}
+
+		ofs = opt_pos + opt_len + 1;
+	}
+
+	if (sscanf(&path[ofs], "%u:%u%c", &major, &minor, &dummy) == 2) {
 		/* Extract the major/minor numbers */
 		dev = MKDEV(major, minor);
 		if (MAJOR(dev) != major || MINOR(dev) != minor)
 			return -EOVERFLOW;
 	} else {
-		dev = dm_get_dev_t(path);
+		dev = dm_get_dev_t(&path[ofs]);
 		if (!dev)
 			return -ENODEV;
 	}
@@ -391,7 +420,8 @@ int dm_get_device(struct dm_target *ti, const char *path, fmode_t mode,
 		if (!dd)
 			return -ENOMEM;
 
-		if ((r = dm_get_table_device(t->md, dev, mode, &dd->dm_dev))) {
+		r = dm_get_table_device(t->md, dev, mode, interpose, &dd->dm_dev);
+		if (r) {
 			kfree(dd);
 			return r;
 		}
@@ -401,14 +431,40 @@ int dm_get_device(struct dm_target *ti, const char *path, fmode_t mode,
 		goto out;
 
 	} else if (dd->dm_dev->mode != (mode | dd->dm_dev->mode)) {
-		r = upgrade_mode(dd, mode, t->md);
+		r = upgrade_mode(dd, mode, interpose, t->md);
 		if (r)
 			return r;
 	}
 	refcount_inc(&dd->count);
 out:
+	if (interpose) {
+		struct block_device *original = dd->dm_dev->bdev;
+		/*
+		 * Interposer target should cover all underlying device
+		 */
+		if (ti->begin != 0) {
+			DMERR("%s: target offset should be zero for dm interposer",
+			      dm_device_name(t->md));
+			r = -EINVAL;
+			goto fail;
+		}
+		if (bdev_nr_sectors(original) != ti->len) {
+			DMERR("%s: interposer and interposed block device size should be equal",
+			      dm_device_name(t->md));
+			r = -EINVAL;
+			goto fail;
+		}
+	}
+
 	*result = dd->dm_dev;
 	return 0;
+fail:
+	if (refcount_dec_and_test(&dd->count)) {
+		dm_put_table_device(t->md, dd->dm_dev);
+		list_del(&dd->list);
+		kfree(dd);
+	}
+	return r;
 }
 EXPORT_SYMBOL(dm_get_device);
 
diff --git a/drivers/md/dm.c b/drivers/md/dm.c
index 3f3be9408afa..04142454c4ee 100644
--- a/drivers/md/dm.c
+++ b/drivers/md/dm.c
@@ -149,6 +149,7 @@ EXPORT_SYMBOL_GPL(dm_bio_get_target_bio_nr);
 #define DMF_DEFERRED_REMOVE 6
 #define DMF_SUSPENDED_INTERNALLY 7
 #define DMF_POST_SUSPENDING 8
+#define DMF_INTERPOSER_ATTACHED 9
 
 #define DM_NUMA_NODE NUMA_NO_NODE
 static int dm_numa_node = DM_NUMA_NODE;
@@ -757,18 +758,24 @@ static int open_table_device(struct table_device *td, dev_t dev,
 			     struct mapped_device *md)
 {
 	struct block_device *bdev;
-
+	fmode_t mode = td->dm_dev.mode;
+	void *holder = NULL;
 	int r;
 
 	BUG_ON(td->dm_dev.bdev);
 
-	bdev = blkdev_get_by_dev(dev, td->dm_dev.mode | FMODE_EXCL, _dm_claim_ptr);
+	if (!td->dm_dev.interpose) {
+		mode |= FMODE_EXCL;
+		holder = _dm_claim_ptr;
+	}
+
+	bdev = blkdev_get_by_dev(dev, mode, holder);
 	if (IS_ERR(bdev))
 		return PTR_ERR(bdev);
 
 	r = bd_link_disk_holder(bdev, dm_disk(md));
 	if (r) {
-		blkdev_put(bdev, td->dm_dev.mode | FMODE_EXCL);
+		blkdev_put(bdev, mode);
 		return r;
 	}
 
@@ -782,11 +789,16 @@ static int open_table_device(struct table_device *td, dev_t dev,
  */
 static void close_table_device(struct table_device *td, struct mapped_device *md)
 {
+	fmode_t mode = td->dm_dev.mode;
+
 	if (!td->dm_dev.bdev)
 		return;
 
 	bd_unlink_disk_holder(td->dm_dev.bdev, dm_disk(md));
-	blkdev_put(td->dm_dev.bdev, td->dm_dev.mode | FMODE_EXCL);
+	if (!td->dm_dev.interpose)
+		mode |= FMODE_EXCL;
+	blkdev_put(td->dm_dev.bdev, mode);
+
 	put_dax(td->dm_dev.dax_dev);
 	td->dm_dev.bdev = NULL;
 	td->dm_dev.dax_dev = NULL;
@@ -805,7 +817,7 @@ static struct table_device *find_table_device(struct list_head *l, dev_t dev,
 }
 
 int dm_get_table_device(struct mapped_device *md, dev_t dev, fmode_t mode,
-			struct dm_dev **result)
+			bool interpose, struct dm_dev **result)
 {
 	int r;
 	struct table_device *td;
@@ -821,6 +833,7 @@ int dm_get_table_device(struct mapped_device *md, dev_t dev, fmode_t mode,
 
 		td->dm_dev.mode = mode;
 		td->dm_dev.bdev = NULL;
+		td->dm_dev.interpose = interpose;
 
 		if ((r = open_table_device(td, dev, md))) {
 			mutex_unlock(&md->table_devices_lock);
@@ -1496,13 +1509,12 @@ static int __send_empty_flush(struct clone_info *ci)
 static int __clone_and_map_data_bio(struct clone_info *ci, struct dm_target *ti,
 				    sector_t sector, unsigned *len)
 {
-	struct bio *bio = ci->bio;
 	struct dm_target_io *tio;
 	int r;
 
 	tio = alloc_tio(ci, ti, 0, GFP_NOIO);
 	tio->len_ptr = len;
-	r = clone_bio(tio, bio, sector, *len);
+	r = clone_bio(tio, ci->bio, sector, *len);
 	if (r < 0) {
 		free_tio(tio);
 		return r;
@@ -1696,6 +1708,13 @@ static blk_qc_t dm_submit_bio(struct bio *bio)
 		goto out;
 	}
 
+	/*
+	 * If md is an interposer, then we must set the BIO_INTERPOSE flag
+	 * so that the request is not re-interposed.
+	 */
+	if (md->interpose)
+		bio_set_flag(bio, BIO_INTERPOSED);
+
 	/* If suspended, queue this IO for later */
 	if (unlikely(test_bit(DMF_BLOCK_IO_FOR_SUSPEND, &md->flags))) {
 		if (bio->bi_opf & REQ_NOWAIT)
@@ -2410,7 +2429,8 @@ static void dm_queue_flush(struct mapped_device *md)
  */
 struct dm_table *dm_swap_table(struct mapped_device *md, struct dm_table *table)
 {
-	struct dm_table *live_map = NULL, *map = ERR_PTR(-EINVAL);
+	struct dm_table *live_map = NULL;
+	struct dm_table *map = ERR_PTR(-EINVAL);
 	struct queue_limits limits;
 	int r;
 
@@ -2453,26 +2473,50 @@ struct dm_table *dm_swap_table(struct mapped_device *md, struct dm_table *table)
  * Functions to lock and unlock any filesystem running on the
  * device.
  */
-static int lock_fs(struct mapped_device *md)
+static int lock_bdev_fs(struct mapped_device *md, struct block_device *bdev)
 {
 	int r;
 
 	WARN_ON(test_bit(DMF_FROZEN, &md->flags));
 
-	r = freeze_bdev(md->disk->part0);
+	r = freeze_bdev(bdev);
 	if (!r)
 		set_bit(DMF_FROZEN, &md->flags);
 	return r;
 }
 
-static void unlock_fs(struct mapped_device *md)
+static void unlock_bdev_fs(struct mapped_device *md, struct block_device *bdev)
 {
 	if (!test_bit(DMF_FROZEN, &md->flags))
 		return;
-	thaw_bdev(md->disk->part0);
+	thaw_bdev(bdev);
 	clear_bit(DMF_FROZEN, &md->flags);
 }
 
+static inline int lock_fs(struct mapped_device *md)
+{
+	return lock_bdev_fs(md, md->disk->part0);
+}
+
+static inline void unlock_fs(struct mapped_device *md)
+{
+	unlock_bdev_fs(md, md->disk->part0);
+}
+
+static inline struct block_device *get_interposed_bdev(struct dm_table *t)
+{
+	struct dm_dev_internal *dd;
+
+	/*
+	 * For interposer should be only one device in dm table
+	 */
+	list_for_each_entry(dd, dm_table_get_devices(t), list)
+		if (dd->dm_dev->interpose)
+			return bdgrab(dd->dm_dev->bdev);
+
+	return NULL;
+}
+
 /*
  * @suspend_flags: DM_SUSPEND_LOCKFS_FLAG and/or DM_SUSPEND_NOFLUSH_FLAG
  * @task_state: e.g. TASK_INTERRUPTIBLE or TASK_UNINTERRUPTIBLE
@@ -2488,7 +2532,10 @@ static int __dm_suspend(struct mapped_device *md, struct dm_table *map,
 {
 	bool do_lockfs = suspend_flags & DM_SUSPEND_LOCKFS_FLAG;
 	bool noflush = suspend_flags & DM_SUSPEND_NOFLUSH_FLAG;
-	int r;
+	bool detach_ip = suspend_flags & DM_SUSPEND_DETACH_IP_FLAG
+			 && md->interpose;
+	struct block_device *original_bdev = NULL;
+	int r = 0;
 
 	lockdep_assert_held(&md->suspend_lock);
 
@@ -2507,18 +2554,50 @@ static int __dm_suspend(struct mapped_device *md, struct dm_table *map,
 	 */
 	dm_table_presuspend_targets(map);
 
+	if (!md->interpose) {
+		/*
+		 * Flush I/O to the device.
+		 * Any I/O submitted after lock_fs() may not be flushed.
+		 * noflush takes precedence over do_lockfs.
+		 * (lock_fs() flushes I/Os and waits for them to complete.)
+		 */
+		if (!noflush && do_lockfs)
+			r = lock_fs(md);
+	} else if (map) {
+		/*
+		 * Interposer should not lock mapped device, but
+		 * should freeze interposed device and lock it.
+		 */
+		original_bdev = get_interposed_bdev(map);
+		if (!original_bdev) {
+			r = -EINVAL;
+			DMERR("%s: interposer cannot get interposed device from table",
+				dm_device_name(md));
+			goto presuspend_undo;
+		}
+
+		if (!noflush && do_lockfs) {
+			r = lock_bdev_fs(md, original_bdev);
+			if (r) {
+				DMERR("%s: interposer cannot freeze interposed device",
+					dm_device_name(md));
+				goto presuspend_undo;
+			}
+		}
+
+		bdev_interposer_lock(original_bdev);
+	}
 	/*
-	 * Flush I/O to the device.
-	 * Any I/O submitted after lock_fs() may not be flushed.
-	 * noflush takes precedence over do_lockfs.
-	 * (lock_fs() flushes I/Os and waits for them to complete.)
+	 * If map is not initialized, then we cannot suspend
+	 * interposed device
 	 */
-	if (!noflush && do_lockfs) {
-		r = lock_fs(md);
-		if (r) {
-			dm_table_presuspend_undo_targets(map);
-			return r;
-		}
+
+presuspend_undo:
+	if (r) {
+		if (original_bdev)
+			bdput(original_bdev);
+		dm_table_presuspend_undo_targets(map);
+		return r;
 	}
 
 	/*
@@ -2559,14 +2638,40 @@ static int __dm_suspend(struct mapped_device *md, struct dm_table *map,
 	if (map)
 		synchronize_srcu(&md->io_barrier);
 
-	/* were we interrupted ? */
-	if (r < 0) {
+	if (r == 0) { /* the wait ended successfully */
+		if (md->interpose && original_bdev) {
+			if (detach_ip) {
+				bdev_interposer_detach(original_bdev);
+				clear_bit(DMF_INTERPOSER_ATTACHED, &md->flags);
+			}
+
+			bdev_interposer_unlock(original_bdev);
+
+			if (detach_ip) {
+				/*
+				 * If th interposer is detached, then there is
+				 * no reason in keeping the queue of the
+				 * interposed device stopped.
+				 */
+				unlock_bdev_fs(md, original_bdev);
+			}
+
+			bdput(original_bdev);
+		}
+	} else { /* were we interrupted ? */
 		dm_queue_flush(md);
 
 		if (dm_request_based(md))
 			dm_start_queue(md->queue);
 
-		unlock_fs(md);
+		if (md->interpose && original_bdev) {
+			bdev_interposer_unlock(original_bdev);
+			unlock_bdev_fs(md, original_bdev);
+
+			bdput(original_bdev);
+		} else
+			unlock_fs(md);
+
 		dm_table_presuspend_undo_targets(map);
 		/* pushback list is already flushed, so skip flush */
 	}
@@ -2574,6 +2679,88 @@ static int __dm_suspend(struct mapped_device *md, struct dm_table *map,
 	return r;
 }
 
+int __dm_attach_interposer(struct mapped_device *md)
+{
+	int r;
+	struct dm_table *map;
+	struct block_device *original_bdev = NULL;
+
+	if (dm_interposer_attached(md))
+		return 0;
+
+	map = rcu_dereference_protected(md->map,
+					lockdep_is_held(&md->suspend_lock));
+	if (!map) {
+		DMERR("%s: interposers table is not initialized",
+			dm_device_name(md));
+		return -EINVAL;
+	}
+
+	original_bdev = get_interposed_bdev(map);
+	if (!original_bdev) {
+		DMERR("%s: interposer cannot get interposed device from table",
+			dm_device_name(md));
+		return -EINVAL;
+	}
+
+	bdev_interposer_lock(original_bdev);
+
+	r = bdev_interposer_attach(original_bdev, dm_disk(md)->part0);
+	if (r)
+		DMERR("%s: failed to attach interposer",
+			dm_device_name(md));
+	else
+		set_bit(DMF_INTERPOSER_ATTACHED, &md->flags);
+
+	bdev_interposer_unlock(original_bdev);
+
+	unlock_bdev_fs(md, original_bdev);
+
+	bdput(original_bdev);
+
+	return r;
+}
+
+int __dm_detach_interposer(struct mapped_device *md)
+{
+	struct dm_table *map = NULL;
+	struct block_device *original_bdev;
+
+	if (!dm_interposer_attached(md))
+		return 0;
+	/*
+	 * If mapped device is suspended, but should be detached
+	 * we just detach without freeze fs on interposed device.
+	 */
+	map = rcu_dereference_protected(md->map,
+			lockdep_is_held(&md->suspend_lock));
+	if (!map) {
+		/*
+		 * If table is not initialized then interposed device
+		 * cannot be attached
+		 */
+		DMERR("%s: table is not initialized for device",
+			dm_device_name(md));
+		return -EINVAL;
+	}
+
+	original_bdev = get_interposed_bdev(map);
+	if (!original_bdev) {
+		DMERR("%s: interposer cannot get interposed device from table",
+			dm_device_name(md));
+		return -EINVAL;
+	}
+
+	bdev_interposer_lock(original_bdev);
+
+	bdev_interposer_detach(original_bdev);
+	clear_bit(DMF_INTERPOSER_ATTACHED, &md->flags);
+
+	bdev_interposer_unlock(original_bdev);
+
+	bdput(original_bdev);
+	return 0;
+}
 /*
  * We need to be able to change a mapping table under a mounted
  * filesystem.  For example we might want to move some data in
@@ -2599,7 +2786,11 @@ int dm_suspend(struct mapped_device *md, unsigned suspend_flags)
 	mutex_lock_nested(&md->suspend_lock, SINGLE_DEPTH_NESTING);
 
 	if (dm_suspended_md(md)) {
-		r = -EINVAL;
+		if (suspend_flags & DM_SUSPEND_DETACH_IP_FLAG)
+			r = __dm_detach_interposer(md);
+		else
+			r = -EINVAL;
+
 		goto out_unlock;
 	}
 
@@ -2645,8 +2836,10 @@ static int __dm_resume(struct mapped_device *md, struct dm_table *map)
 	if (dm_request_based(md))
 		dm_start_queue(md->queue);
 
-	unlock_fs(md);
+	if (md->interpose)
+		return __dm_attach_interposer(md);
 
+	unlock_fs(md);
 	return 0;
 }
 
@@ -2880,6 +3073,11 @@ int dm_suspended_md(struct mapped_device *md)
 	return test_bit(DMF_SUSPENDED, &md->flags);
 }
 
+int dm_interposer_attached(struct mapped_device *md)
+{
+	return test_bit(DMF_INTERPOSER_ATTACHED, &md->flags);
+}
+
 static int dm_post_suspending_md(struct mapped_device *md)
 {
 	return test_bit(DMF_POST_SUSPENDING, &md->flags);
diff --git a/drivers/md/dm.h b/drivers/md/dm.h
index b441ad772c18..35f71e48abd1 100644
--- a/drivers/md/dm.h
+++ b/drivers/md/dm.h
@@ -28,6 +28,7 @@
  */
 #define DM_SUSPEND_LOCKFS_FLAG		(1 << 0)
 #define DM_SUSPEND_NOFLUSH_FLAG		(1 << 1)
+#define DM_SUSPEND_DETACH_IP_FLAG	(1 << 2)
 
 /*
  * Status feature flags
@@ -122,6 +123,11 @@ int dm_deleting_md(struct mapped_device *md);
  */
 int dm_suspended_md(struct mapped_device *md);
 
+/*
+ * Is the interposer of this mapped_device is attached?
+ */
+int dm_interposer_attached(struct mapped_device *md);
+
 /*
  * Internal suspend and resume methods.
  */
@@ -180,7 +186,7 @@ int dm_lock_for_deletion(struct mapped_device *md, bool mark_deferred, bool only
 int dm_cancel_deferred_remove(struct mapped_device *md);
 int dm_request_based(struct mapped_device *md);
 int dm_get_table_device(struct mapped_device *md, dev_t dev, fmode_t mode,
-			struct dm_dev **result);
+			bool interpose, struct dm_dev **result);
 void dm_put_table_device(struct mapped_device *md, struct dm_dev *d);
 
 int dm_kobject_uevent(struct mapped_device *md, enum kobject_action action,
diff --git a/include/linux/device-mapper.h b/include/linux/device-mapper.h
index 5c641f930caf..3a7abb347702 100644
--- a/include/linux/device-mapper.h
+++ b/include/linux/device-mapper.h
@@ -159,6 +159,7 @@ struct dm_dev {
 	struct block_device *bdev;
 	struct dax_device *dax_dev;
 	fmode_t mode;
+	bool interpose;
 	char name[16];
 };
 
diff --git a/include/uapi/linux/dm-ioctl.h b/include/uapi/linux/dm-ioctl.h
index fcff6669137b..73a5b712cd0d 100644
--- a/include/uapi/linux/dm-ioctl.h
+++ b/include/uapi/linux/dm-ioctl.h
@@ -362,4 +362,10 @@ enum {
  */
 #define DM_INTERNAL_SUSPEND_FLAG	(1 << 18) /* Out */
 
+/*
+ * If set, the underlying device should open without FMODE_EXCL
+ * and attach mapped device via bdev_interposer.
+ */
+#define DM_INTERPOSE_FLAG		(1 << 19) /* In/Out */
+
 #endif				/* _LINUX_DM_IOCTL_H */
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 11+ messages in thread

* [PATCH v8 4/4] fix origin_map - don't split a bio for the origin device if it does not have registered snapshots.
  2021-04-09 11:48 [PATCH v8 0/4] block device interposer Sergei Shtepa
                   ` (2 preceding siblings ...)
  2021-04-09 11:48 ` [PATCH v8 3/4] Adds blk_interposer to md Sergei Shtepa
@ 2021-04-09 11:48 ` Sergei Shtepa
  2021-04-09 15:23 ` [PATCH v8 0/4] block device interposer Mike Snitzer
  4 siblings, 0 replies; 11+ messages in thread
From: Sergei Shtepa @ 2021-04-09 11:48 UTC (permalink / raw)
  To: Christoph Hellwig, Hannes Reinecke, Mike Snitzer,
	Alasdair Kergon, Alexander Viro, Jens Axboe, dm-devel,
	linux-fsdevel, linux-block, linux-kernel
  Cc: sergei.shtepa, pavel.tide

Signed-off-by: Sergei Shtepa <sergei.shtepa@veeam.com>
---
 drivers/md/dm-snap.c | 15 +++++++++++----
 1 file changed, 11 insertions(+), 4 deletions(-)

diff --git a/drivers/md/dm-snap.c b/drivers/md/dm-snap.c
index 11890db71f3f..81e8e3bb6d25 100644
--- a/drivers/md/dm-snap.c
+++ b/drivers/md/dm-snap.c
@@ -2685,11 +2685,18 @@ static int origin_map(struct dm_target *ti, struct bio *bio)
 	if (bio_data_dir(bio) != WRITE)
 		return DM_MAPIO_REMAPPED;
 
-	available_sectors = o->split_boundary -
-		((unsigned)bio->bi_iter.bi_sector & (o->split_boundary - 1));
+	/*
+	 * If no snapshot is connected to origin, then split_boundary
+	 * will be set to zero. In this case, we don't need to split a bio.
+	 */
+	if (o->split_boundary) {
+		available_sectors = o->split_boundary -
+			((unsigned int)bio->bi_iter.bi_sector &
+				(o->split_boundary - 1));
 
-	if (bio_sectors(bio) > available_sectors)
-		dm_accept_partial_bio(bio, available_sectors);
+		if (bio_sectors(bio) > available_sectors)
+			dm_accept_partial_bio(bio, available_sectors);
+	}
 
 	/* Only tell snapshots if this is a write */
 	return do_origin(o->dev, bio, true);
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 11+ messages in thread

* Re: [PATCH v8 3/4] Adds blk_interposer to md.
  2021-04-09 11:48 ` [PATCH v8 3/4] Adds blk_interposer to md Sergei Shtepa
@ 2021-04-09 14:12   ` kernel test robot
  2021-04-09 14:39   ` kernel test robot
                     ` (2 subsequent siblings)
  3 siblings, 0 replies; 11+ messages in thread
From: kernel test robot @ 2021-04-09 14:12 UTC (permalink / raw)
  To: Sergei Shtepa, Christoph Hellwig, Hannes Reinecke, Mike Snitzer,
	Alasdair Kergon, Alexander Viro, Jens Axboe, dm-devel,
	linux-fsdevel, linux-block, linux-kernel
  Cc: kbuild-all, clang-built-linux

[-- Attachment #1: Type: text/plain, Size: 6996 bytes --]

Hi Sergei,

Thank you for the patch! Perhaps something to improve:

[auto build test WARNING on block/for-next]
[also build test WARNING on hch-configfs/for-next v5.12-rc6]
[cannot apply to dm/for-next next-20210409]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch]

url:    https://github.com/0day-ci/linux/commits/Sergei-Shtepa/block-device-interposer/20210409-194943
base:   https://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux-block.git for-next
config: arm-randconfig-r025-20210409 (attached as .config)
compiler: clang version 13.0.0 (https://github.com/llvm/llvm-project dd453a1389b6a7e6d9214b449d3c54981b1a89b6)
reproduce (this is a W=1 build):
        wget https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross
        chmod +x ~/bin/make.cross
        # install arm cross compiling tool for clang build
        # apt-get install binutils-arm-linux-gnueabi
        # https://github.com/0day-ci/linux/commit/df79fb333cb0a1263a1f03f54de425507e3c2238
        git remote add linux-review https://github.com/0day-ci/linux
        git fetch --no-tags linux-review Sergei-Shtepa/block-device-interposer/20210409-194943
        git checkout df79fb333cb0a1263a1f03f54de425507e3c2238
        # save the attached .config to linux build tree
        COMPILER_INSTALL_PATH=$HOME/0day COMPILER=clang make.cross ARCH=arm 

If you fix the issue, kindly add following tag as appropriate
Reported-by: kernel test robot <lkp@intel.com>

All warnings (new ones prefixed by >>):

>> drivers/md/dm.c:2682:5: warning: no previous prototype for function '__dm_attach_interposer' [-Wmissing-prototypes]
   int __dm_attach_interposer(struct mapped_device *md)
       ^
   drivers/md/dm.c:2682:1: note: declare 'static' if the function is not intended to be used outside of this translation unit
   int __dm_attach_interposer(struct mapped_device *md)
   ^
   static 
>> drivers/md/dm.c:2724:5: warning: no previous prototype for function '__dm_detach_interposer' [-Wmissing-prototypes]
   int __dm_detach_interposer(struct mapped_device *md)
       ^
   drivers/md/dm.c:2724:1: note: declare 'static' if the function is not intended to be used outside of this translation unit
   int __dm_detach_interposer(struct mapped_device *md)
   ^
   static 
   2 warnings generated.


vim +/__dm_attach_interposer +2682 drivers/md/dm.c

  2681	
> 2682	int __dm_attach_interposer(struct mapped_device *md)
  2683	{
  2684		int r;
  2685		struct dm_table *map;
  2686		struct block_device *original_bdev = NULL;
  2687	
  2688		if (dm_interposer_attached(md))
  2689			return 0;
  2690	
  2691		map = rcu_dereference_protected(md->map,
  2692						lockdep_is_held(&md->suspend_lock));
  2693		if (!map) {
  2694			DMERR("%s: interposers table is not initialized",
  2695				dm_device_name(md));
  2696			return -EINVAL;
  2697		}
  2698	
  2699		original_bdev = get_interposed_bdev(map);
  2700		if (!original_bdev) {
  2701			DMERR("%s: interposer cannot get interposed device from table",
  2702				dm_device_name(md));
  2703			return -EINVAL;
  2704		}
  2705	
  2706		bdev_interposer_lock(original_bdev);
  2707	
  2708		r = bdev_interposer_attach(original_bdev, dm_disk(md)->part0);
  2709		if (r)
  2710			DMERR("%s: failed to attach interposer",
  2711				dm_device_name(md));
  2712		else
  2713			set_bit(DMF_INTERPOSER_ATTACHED, &md->flags);
  2714	
  2715		bdev_interposer_unlock(original_bdev);
  2716	
  2717		unlock_bdev_fs(md, original_bdev);
  2718	
  2719		bdput(original_bdev);
  2720	
  2721		return r;
  2722	}
  2723	
> 2724	int __dm_detach_interposer(struct mapped_device *md)
  2725	{
  2726		struct dm_table *map = NULL;
  2727		struct block_device *original_bdev;
  2728	
  2729		if (!dm_interposer_attached(md))
  2730			return 0;
  2731		/*
  2732		 * If mapped device is suspended, but should be detached
  2733		 * we just detach without freeze fs on interposed device.
  2734		 */
  2735		map = rcu_dereference_protected(md->map,
  2736				lockdep_is_held(&md->suspend_lock));
  2737		if (!map) {
  2738			/*
  2739			 * If table is not initialized then interposed device
  2740			 * cannot be attached
  2741			 */
  2742			DMERR("%s: table is not initialized for device",
  2743				dm_device_name(md));
  2744			return -EINVAL;
  2745		}
  2746	
  2747		original_bdev = get_interposed_bdev(map);
  2748		if (!original_bdev) {
  2749			DMERR("%s: interposer cannot get interposed device from table",
  2750				dm_device_name(md));
  2751			return -EINVAL;
  2752		}
  2753	
  2754		bdev_interposer_lock(original_bdev);
  2755	
  2756		bdev_interposer_detach(original_bdev);
  2757		clear_bit(DMF_INTERPOSER_ATTACHED, &md->flags);
  2758	
  2759		bdev_interposer_unlock(original_bdev);
  2760	
  2761		bdput(original_bdev);
  2762		return 0;
  2763	}
  2764	/*
  2765	 * We need to be able to change a mapping table under a mounted
  2766	 * filesystem.  For example we might want to move some data in
  2767	 * the background.  Before the table can be swapped with
  2768	 * dm_bind_table, dm_suspend must be called to flush any in
  2769	 * flight bios and ensure that any further io gets deferred.
  2770	 */
  2771	/*
  2772	 * Suspend mechanism in request-based dm.
  2773	 *
  2774	 * 1. Flush all I/Os by lock_fs() if needed.
  2775	 * 2. Stop dispatching any I/O by stopping the request_queue.
  2776	 * 3. Wait for all in-flight I/Os to be completed or requeued.
  2777	 *
  2778	 * To abort suspend, start the request_queue.
  2779	 */
  2780	int dm_suspend(struct mapped_device *md, unsigned suspend_flags)
  2781	{
  2782		struct dm_table *map = NULL;
  2783		int r = 0;
  2784	
  2785	retry:
  2786		mutex_lock_nested(&md->suspend_lock, SINGLE_DEPTH_NESTING);
  2787	
  2788		if (dm_suspended_md(md)) {
  2789			if (suspend_flags & DM_SUSPEND_DETACH_IP_FLAG)
  2790				r = __dm_detach_interposer(md);
  2791			else
  2792				r = -EINVAL;
  2793	
  2794			goto out_unlock;
  2795		}
  2796	
  2797		if (dm_suspended_internally_md(md)) {
  2798			/* already internally suspended, wait for internal resume */
  2799			mutex_unlock(&md->suspend_lock);
  2800			r = wait_on_bit(&md->flags, DMF_SUSPENDED_INTERNALLY, TASK_INTERRUPTIBLE);
  2801			if (r)
  2802				return r;
  2803			goto retry;
  2804		}
  2805	
  2806		map = rcu_dereference_protected(md->map, lockdep_is_held(&md->suspend_lock));
  2807	
  2808		r = __dm_suspend(md, map, suspend_flags, TASK_INTERRUPTIBLE, DMF_SUSPENDED);
  2809		if (r)
  2810			goto out_unlock;
  2811	
  2812		set_bit(DMF_POST_SUSPENDING, &md->flags);
  2813		dm_table_postsuspend_targets(map);
  2814		clear_bit(DMF_POST_SUSPENDING, &md->flags);
  2815	
  2816	out_unlock:
  2817		mutex_unlock(&md->suspend_lock);
  2818		return r;
  2819	}
  2820	

---
0-DAY CI Kernel Test Service, Intel Corporation
https://lists.01.org/hyperkitty/list/kbuild-all@lists.01.org

[-- Attachment #2: .config.gz --]
[-- Type: application/gzip, Size: 29920 bytes --]

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH v8 3/4] Adds blk_interposer to md.
  2021-04-09 11:48 ` [PATCH v8 3/4] Adds blk_interposer to md Sergei Shtepa
  2021-04-09 14:12   ` kernel test robot
@ 2021-04-09 14:39   ` kernel test robot
  2021-04-09 17:03   ` kernel test robot
  2021-04-09 17:03   ` [RFC PATCH] __dm_attach_interposer() can be static kernel test robot
  3 siblings, 0 replies; 11+ messages in thread
From: kernel test robot @ 2021-04-09 14:39 UTC (permalink / raw)
  To: Sergei Shtepa, Christoph Hellwig, Hannes Reinecke, Mike Snitzer,
	Alasdair Kergon, Alexander Viro, Jens Axboe, dm-devel,
	linux-fsdevel, linux-block, linux-kernel
  Cc: kbuild-all

[-- Attachment #1: Type: text/plain, Size: 6260 bytes --]

Hi Sergei,

Thank you for the patch! Perhaps something to improve:

[auto build test WARNING on block/for-next]
[also build test WARNING on hch-configfs/for-next v5.12-rc6]
[cannot apply to dm/for-next next-20210409]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch]

url:    https://github.com/0day-ci/linux/commits/Sergei-Shtepa/block-device-interposer/20210409-194943
base:   https://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux-block.git for-next
config: i386-randconfig-m021-20210409 (attached as .config)
compiler: gcc-9 (Debian 9.3.0-22) 9.3.0
reproduce (this is a W=1 build):
        # https://github.com/0day-ci/linux/commit/df79fb333cb0a1263a1f03f54de425507e3c2238
        git remote add linux-review https://github.com/0day-ci/linux
        git fetch --no-tags linux-review Sergei-Shtepa/block-device-interposer/20210409-194943
        git checkout df79fb333cb0a1263a1f03f54de425507e3c2238
        # save the attached .config to linux build tree
        make W=1 ARCH=i386 

If you fix the issue, kindly add following tag as appropriate
Reported-by: kernel test robot <lkp@intel.com>

All warnings (new ones prefixed by >>):

>> drivers/md/dm.c:2682:5: warning: no previous prototype for '__dm_attach_interposer' [-Wmissing-prototypes]
    2682 | int __dm_attach_interposer(struct mapped_device *md)
         |     ^~~~~~~~~~~~~~~~~~~~~~
>> drivers/md/dm.c:2724:5: warning: no previous prototype for '__dm_detach_interposer' [-Wmissing-prototypes]
    2724 | int __dm_detach_interposer(struct mapped_device *md)
         |     ^~~~~~~~~~~~~~~~~~~~~~


vim +/__dm_attach_interposer +2682 drivers/md/dm.c

  2681	
> 2682	int __dm_attach_interposer(struct mapped_device *md)
  2683	{
  2684		int r;
  2685		struct dm_table *map;
  2686		struct block_device *original_bdev = NULL;
  2687	
  2688		if (dm_interposer_attached(md))
  2689			return 0;
  2690	
  2691		map = rcu_dereference_protected(md->map,
  2692						lockdep_is_held(&md->suspend_lock));
  2693		if (!map) {
  2694			DMERR("%s: interposers table is not initialized",
  2695				dm_device_name(md));
  2696			return -EINVAL;
  2697		}
  2698	
  2699		original_bdev = get_interposed_bdev(map);
  2700		if (!original_bdev) {
  2701			DMERR("%s: interposer cannot get interposed device from table",
  2702				dm_device_name(md));
  2703			return -EINVAL;
  2704		}
  2705	
  2706		bdev_interposer_lock(original_bdev);
  2707	
  2708		r = bdev_interposer_attach(original_bdev, dm_disk(md)->part0);
  2709		if (r)
  2710			DMERR("%s: failed to attach interposer",
  2711				dm_device_name(md));
  2712		else
  2713			set_bit(DMF_INTERPOSER_ATTACHED, &md->flags);
  2714	
  2715		bdev_interposer_unlock(original_bdev);
  2716	
  2717		unlock_bdev_fs(md, original_bdev);
  2718	
  2719		bdput(original_bdev);
  2720	
  2721		return r;
  2722	}
  2723	
> 2724	int __dm_detach_interposer(struct mapped_device *md)
  2725	{
  2726		struct dm_table *map = NULL;
  2727		struct block_device *original_bdev;
  2728	
  2729		if (!dm_interposer_attached(md))
  2730			return 0;
  2731		/*
  2732		 * If mapped device is suspended, but should be detached
  2733		 * we just detach without freeze fs on interposed device.
  2734		 */
  2735		map = rcu_dereference_protected(md->map,
  2736				lockdep_is_held(&md->suspend_lock));
  2737		if (!map) {
  2738			/*
  2739			 * If table is not initialized then interposed device
  2740			 * cannot be attached
  2741			 */
  2742			DMERR("%s: table is not initialized for device",
  2743				dm_device_name(md));
  2744			return -EINVAL;
  2745		}
  2746	
  2747		original_bdev = get_interposed_bdev(map);
  2748		if (!original_bdev) {
  2749			DMERR("%s: interposer cannot get interposed device from table",
  2750				dm_device_name(md));
  2751			return -EINVAL;
  2752		}
  2753	
  2754		bdev_interposer_lock(original_bdev);
  2755	
  2756		bdev_interposer_detach(original_bdev);
  2757		clear_bit(DMF_INTERPOSER_ATTACHED, &md->flags);
  2758	
  2759		bdev_interposer_unlock(original_bdev);
  2760	
  2761		bdput(original_bdev);
  2762		return 0;
  2763	}
  2764	/*
  2765	 * We need to be able to change a mapping table under a mounted
  2766	 * filesystem.  For example we might want to move some data in
  2767	 * the background.  Before the table can be swapped with
  2768	 * dm_bind_table, dm_suspend must be called to flush any in
  2769	 * flight bios and ensure that any further io gets deferred.
  2770	 */
  2771	/*
  2772	 * Suspend mechanism in request-based dm.
  2773	 *
  2774	 * 1. Flush all I/Os by lock_fs() if needed.
  2775	 * 2. Stop dispatching any I/O by stopping the request_queue.
  2776	 * 3. Wait for all in-flight I/Os to be completed or requeued.
  2777	 *
  2778	 * To abort suspend, start the request_queue.
  2779	 */
  2780	int dm_suspend(struct mapped_device *md, unsigned suspend_flags)
  2781	{
  2782		struct dm_table *map = NULL;
  2783		int r = 0;
  2784	
  2785	retry:
  2786		mutex_lock_nested(&md->suspend_lock, SINGLE_DEPTH_NESTING);
  2787	
  2788		if (dm_suspended_md(md)) {
  2789			if (suspend_flags & DM_SUSPEND_DETACH_IP_FLAG)
  2790				r = __dm_detach_interposer(md);
  2791			else
  2792				r = -EINVAL;
  2793	
  2794			goto out_unlock;
  2795		}
  2796	
  2797		if (dm_suspended_internally_md(md)) {
  2798			/* already internally suspended, wait for internal resume */
  2799			mutex_unlock(&md->suspend_lock);
  2800			r = wait_on_bit(&md->flags, DMF_SUSPENDED_INTERNALLY, TASK_INTERRUPTIBLE);
  2801			if (r)
  2802				return r;
  2803			goto retry;
  2804		}
  2805	
  2806		map = rcu_dereference_protected(md->map, lockdep_is_held(&md->suspend_lock));
  2807	
  2808		r = __dm_suspend(md, map, suspend_flags, TASK_INTERRUPTIBLE, DMF_SUSPENDED);
  2809		if (r)
  2810			goto out_unlock;
  2811	
  2812		set_bit(DMF_POST_SUSPENDING, &md->flags);
  2813		dm_table_postsuspend_targets(map);
  2814		clear_bit(DMF_POST_SUSPENDING, &md->flags);
  2815	
  2816	out_unlock:
  2817		mutex_unlock(&md->suspend_lock);
  2818		return r;
  2819	}
  2820	

---
0-DAY CI Kernel Test Service, Intel Corporation
https://lists.01.org/hyperkitty/list/kbuild-all@lists.01.org

[-- Attachment #2: .config.gz --]
[-- Type: application/gzip, Size: 39378 bytes --]

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH v8 0/4] block device interposer
  2021-04-09 11:48 [PATCH v8 0/4] block device interposer Sergei Shtepa
                   ` (3 preceding siblings ...)
  2021-04-09 11:48 ` [PATCH v8 4/4] fix origin_map - don't split a bio for the origin device if it does not have registered snapshots Sergei Shtepa
@ 2021-04-09 15:23 ` Mike Snitzer
  2021-04-13 10:12   ` Sergei Shtepa
  4 siblings, 1 reply; 11+ messages in thread
From: Mike Snitzer @ 2021-04-09 15:23 UTC (permalink / raw)
  To: Sergei Shtepa
  Cc: Christoph Hellwig, Hannes Reinecke, Alasdair Kergon,
	Alexander Viro, Jens Axboe, dm-devel, linux-fsdevel, linux-block,
	linux-kernel, pavel.tide

On Fri, Apr 09 2021 at  7:48am -0400,
Sergei Shtepa <sergei.shtepa@veeam.com> wrote:

> I think I'm ready to suggest the next version of block device interposer
> (blk_interposer). It allows to redirect bio requests to other block
> devices.
> 
> In this series of patches, I reviewed the process of attaching and
> detaching device mapper via blk_interposer.
> 
> Now the dm-target is attached to the interposed block device when the
> interposer dm-target is fully ready to accept requests, and the interposed
> block device queue is locked, and the file system on it is frozen.
> The detaching is also performed when the file system on the interposed
> block device is in a frozen state, the queue is locked, and the interposer
> dm-target is suspended.
> 
> To make it possible to lock the receipt of new bio requests without locking
> the processing of bio requests that the interposer creates, I had to change
> the submit_bio_noacct() function and add a lock. To minimize the impact of
> locking, I chose percpu_rw_sem. I tried to do without a new lock, but I'm
> afraid it's impossible.
> 
> Checking the operation of the interposer, I did not limit myself to
> a simple dm-linear. When I experimented with dm-era, I noticed that it
> accepts two block devices. Since Mike was against changing the logic in
> the dm-targets itself to support the interrupter, I decided to add the
> [interpose] option to the block device path.
> 
>  echo "0 ${DEV_SZ} era ${META} [interpose]${DEV} ${BLK_SZ}" | \
>  	dmsetup create dm-era --interpose
> 
> I believe this option can replace the DM_INTERPOSE_FLAG flag. Of course,
> we can assume that if the device cannot be opened with the FMODE_EXCL,
> then it is considered an interposed device, but it seems to me that
> algorithm is unsafe. I hope to get Mike's opinion on this.
> 
> I have successfully tried taking snapshots. But I ran into a problem
> when I removed origin-target:
> [   49.031156] ------------[ cut here ]------------
> [   49.031180] kernel BUG at block/bio.c:1476!
> [   49.031198] invalid opcode: 0000 [#1] SMP NOPTI
> [   49.031213] CPU: 9 PID: 636 Comm: dmsetup Tainted: G            E     5.12.0-rc6-ip+ #52
> [   49.031235] Hardware name: innotek GmbH VirtualBox/VirtualBox, BIOS VirtualBox 12/01/2006
> [   49.031257] RIP: 0010:bio_split+0x74/0x80
> [   49.031273] Code: 89 c7 e8 5f 56 03 00 41 8b 74 24 28 48 89 ef e8 12 ea ff ff f6 45 15 01 74 08 66 41 81 4c 24 14 00 01 4c 89 e0 5b 5d 41 5c c3 <0f> 0b 0f 0b 0f 0b 45 31 e4 eb ed 90 0f 1f 44 00 00 39 77 28 76 05
> [   49.031322] RSP: 0018:ffff9a6100993ab0 EFLAGS: 00010246
> [   49.031337] RAX: 0000000000000008 RBX: 0000000000000000 RCX: ffff8e26938f96d8
> [   49.031357] RDX: 0000000000000c00 RSI: 0000000000000000 RDI: ffff8e26937d1300
> [   49.031375] RBP: ffff8e2692ddc000 R08: 0000000000000000 R09: 0000000000000000
> [   49.031394] R10: ffff8e2692b1de00 R11: ffff8e2692b1de58 R12: ffff8e26937d1300
> [   49.031413] R13: ffff8e2692ddcd18 R14: ffff8e2691d22140 R15: ffff8e26937d1300
> [   49.031432] FS:  00007efffa6e7800(0000) GS:ffff8e269bc80000(0000) knlGS:0000000000000000
> [   49.031453] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [   49.031470] CR2: 00007efffa96cda0 CR3: 0000000114bd0000 CR4: 00000000000506e0
> [   49.031490] Call Trace:
> [   49.031501]  dm_submit_bio+0x383/0x500 [dm_mod]
> [   49.031522]  submit_bio_noacct+0x370/0x770
> [   49.031537]  submit_bh_wbc+0x160/0x190
> [   49.031550]  __sync_dirty_buffer+0x65/0x130
> [   49.031564]  ext4_commit_super+0xbc/0x120 [ext4]
> [   49.031602]  ext4_freeze+0x54/0x80 [ext4]
> [   49.031631]  freeze_super+0xc8/0x160
> [   49.031643]  freeze_bdev+0xb2/0xc0
> [   49.031654]  lock_bdev_fs+0x1c/0x30 [dm_mod]
> [   49.031671]  __dm_suspend+0x2b9/0x3b0 [dm_mod]
> [   49.032095]  dm_suspend+0xed/0x160 [dm_mod]
> [   49.032496]  ? __find_device_hash_cell+0x5b/0x2a0 [dm_mod]
> [   49.032897]  ? remove_all+0x30/0x30 [dm_mod]
> [   49.033299]  dev_remove+0x4c/0x1c0 [dm_mod]
> [   49.033679]  ctl_ioctl+0x1a5/0x470 [dm_mod]
> [   49.034067]  dm_ctl_ioctl+0xa/0x10 [dm_mod]
> [   49.034432]  __x64_sys_ioctl+0x83/0xb0
> [   49.034785]  do_syscall_64+0x33/0x80
> [   49.035139]  entry_SYSCALL_64_after_hwframe+0x44/0xae
> When suspend is executed for origin-target before the interposer is
> being detached, in the origin_map() function the value of the
> o->split_binary variable is zero, since no snapshots were connected to it.
> I think that if no snapshots are connected, then it does not make sense
> to split the bio request into parts.

The dm-snapshot code requires careful order of operations.  You say you
removed the origin target.. please show exactly what you did.  Your 4th
patch shouldn't be tied to this patchset. Can be dealt with
independently.

> Changes summary for this patchset v7:
>   * The attaching and detaching to interposed device moved to
>     __dm_suspend() and __dm_resume() functions.

Why? Those hooks are inherently more constrained.  And in the case of
resume, failure is not an option.

>   * Redesigned th submit_bio_noacct() function and added a lock for the
>     block device interposer.
>   * Adds [interpose] option to block device patch in dm table.

I'm struggling to see why you need "[interpose]" (never mind that this
idea of device options is a new construct): what are the implications?
Are you saying that a table will have N devices with only a subset that
are interposed?

Just feels very awkward but I'll try to keep an open mind until I can
better understand.

>   * Fix origin_map() then o->split_binary value is zero.

Overall this effort, while appreciated in general, is getting more and
more muddled -- you're having to sprinkle obscure code all over DM. And
your patch headers are severely lacking for a v8 patch
submission. Terse bullet points don't paint a very comprehensive
picture. Please detail how a user is expected to drive this (either in
patch headers and/or some Documentation file).

Mike


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH v8 3/4] Adds blk_interposer to md.
  2021-04-09 11:48 ` [PATCH v8 3/4] Adds blk_interposer to md Sergei Shtepa
  2021-04-09 14:12   ` kernel test robot
  2021-04-09 14:39   ` kernel test robot
@ 2021-04-09 17:03   ` kernel test robot
  2021-04-09 17:03   ` [RFC PATCH] __dm_attach_interposer() can be static kernel test robot
  3 siblings, 0 replies; 11+ messages in thread
From: kernel test robot @ 2021-04-09 17:03 UTC (permalink / raw)
  To: Sergei Shtepa, Christoph Hellwig, Hannes Reinecke, Mike Snitzer,
	Alasdair Kergon, Alexander Viro, Jens Axboe, dm-devel,
	linux-fsdevel, linux-block, linux-kernel
  Cc: kbuild-all

[-- Attachment #1: Type: text/plain, Size: 2980 bytes --]

Hi Sergei,

Thank you for the patch! Perhaps something to improve:

[auto build test WARNING on block/for-next]
[also build test WARNING on hch-configfs/for-next v5.12-rc6]
[cannot apply to dm/for-next next-20210409]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch]

url:    https://github.com/0day-ci/linux/commits/Sergei-Shtepa/block-device-interposer/20210409-194943
base:   https://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux-block.git for-next
config: x86_64-randconfig-s032-20210409 (attached as .config)
compiler: gcc-9 (Debian 9.3.0-22) 9.3.0
reproduce:
        # apt-get install sparse
        # sparse version: v0.6.3-279-g6d5d9b42-dirty
        # https://github.com/0day-ci/linux/commit/df79fb333cb0a1263a1f03f54de425507e3c2238
        git remote add linux-review https://github.com/0day-ci/linux
        git fetch --no-tags linux-review Sergei-Shtepa/block-device-interposer/20210409-194943
        git checkout df79fb333cb0a1263a1f03f54de425507e3c2238
        # save the attached .config to linux build tree
        make W=1 C=1 CF='-fdiagnostic-prefix -D__CHECK_ENDIAN__' ARCH=x86_64 

If you fix the issue, kindly add following tag as appropriate
Reported-by: kernel test robot <lkp@intel.com>


sparse warnings: (new ones prefixed by >>)
>> drivers/md/dm-table.c:337:63: sparse: sparse: incorrect type in argument 3 (different base types) @@     expected restricted fmode_t [usertype] mode @@     got bool [usertype] interpose @@
   drivers/md/dm-table.c:337:63: sparse:     expected restricted fmode_t [usertype] mode
   drivers/md/dm-table.c:337:63: sparse:     got bool [usertype] interpose
--
>> drivers/md/dm.c:2682:5: sparse: sparse: symbol '__dm_attach_interposer' was not declared. Should it be static?
>> drivers/md/dm.c:2724:5: sparse: sparse: symbol '__dm_detach_interposer' was not declared. Should it be static?

Please review and possibly fold the followup patch.

vim +337 drivers/md/dm-table.c

   322	
   323	/*
   324	 * This upgrades the mode on an already open dm_dev, being
   325	 * careful to leave things as they were if we fail to reopen the
   326	 * device and not to touch the existing bdev field in case
   327	 * it is accessed concurrently.
   328	 */
   329	static int upgrade_mode(struct dm_dev_internal *dd, fmode_t new_mode,
   330				bool interpose, struct mapped_device *md)
   331	{
   332		int r;
   333		struct dm_dev *old_dev, *new_dev;
   334	
   335		old_dev = dd->dm_dev;
   336	
 > 337		r = dm_get_table_device(md, dd->dm_dev->bdev->bd_dev, interpose,
   338					dd->dm_dev->mode | new_mode, &new_dev);
   339		if (r)
   340			return r;
   341	
   342		dd->dm_dev = new_dev;
   343		dm_put_table_device(md, old_dev);
   344	
   345		return 0;
   346	}
   347	

---
0-DAY CI Kernel Test Service, Intel Corporation
https://lists.01.org/hyperkitty/list/kbuild-all@lists.01.org

[-- Attachment #2: .config.gz --]
[-- Type: application/gzip, Size: 31563 bytes --]

^ permalink raw reply	[flat|nested] 11+ messages in thread

* [RFC PATCH] __dm_attach_interposer() can be static
  2021-04-09 11:48 ` [PATCH v8 3/4] Adds blk_interposer to md Sergei Shtepa
                     ` (2 preceding siblings ...)
  2021-04-09 17:03   ` kernel test robot
@ 2021-04-09 17:03   ` kernel test robot
  3 siblings, 0 replies; 11+ messages in thread
From: kernel test robot @ 2021-04-09 17:03 UTC (permalink / raw)
  To: Sergei Shtepa, Christoph Hellwig, Hannes Reinecke, Mike Snitzer,
	Alasdair Kergon, Alexander Viro, Jens Axboe, dm-devel,
	linux-fsdevel, linux-block, linux-kernel
  Cc: kbuild-all


Reported-by: kernel test robot <lkp@intel.com>
Signed-off-by: kernel test robot <lkp@intel.com>
---
 dm.c |    4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/drivers/md/dm.c b/drivers/md/dm.c
index 04142454c4eed..2a584c2103f3a 100644
--- a/drivers/md/dm.c
+++ b/drivers/md/dm.c
@@ -2679,7 +2679,7 @@ static int __dm_suspend(struct mapped_device *md, struct dm_table *map,
 	return r;
 }
 
-int __dm_attach_interposer(struct mapped_device *md)
+static int __dm_attach_interposer(struct mapped_device *md)
 {
 	int r;
 	struct dm_table *map;
@@ -2721,7 +2721,7 @@ int __dm_attach_interposer(struct mapped_device *md)
 	return r;
 }
 
-int __dm_detach_interposer(struct mapped_device *md)
+static int __dm_detach_interposer(struct mapped_device *md)
 {
 	struct dm_table *map = NULL;
 	struct block_device *original_bdev;

^ permalink raw reply related	[flat|nested] 11+ messages in thread

* Re: [PATCH v8 0/4] block device interposer
  2021-04-09 15:23 ` [PATCH v8 0/4] block device interposer Mike Snitzer
@ 2021-04-13 10:12   ` Sergei Shtepa
  0 siblings, 0 replies; 11+ messages in thread
From: Sergei Shtepa @ 2021-04-13 10:12 UTC (permalink / raw)
  To: Mike Snitzer
  Cc: Christoph Hellwig, Hannes Reinecke, Alasdair Kergon,
	Alexander Viro, Jens Axboe, dm-devel, linux-fsdevel, linux-block,
	linux-kernel, Pavel Tide

The 04/09/2021 18:23, Mike Snitzer wrote:
> On Fri, Apr 09 2021 at  7:48am -0400,
> Sergei Shtepa <sergei.shtepa@veeam.com> wrote:
> 
> > I think I'm ready to suggest the next version of block device interposer
> > (blk_interposer). It allows to redirect bio requests to other block
> > devices.
> > 
> > In this series of patches, I reviewed the process of attaching and
> > detaching device mapper via blk_interposer.
> > 
> > Now the dm-target is attached to the interposed block device when the
> > interposer dm-target is fully ready to accept requests, and the interposed
> > block device queue is locked, and the file system on it is frozen.
> > The detaching is also performed when the file system on the interposed
> > block device is in a frozen state, the queue is locked, and the interposer
> > dm-target is suspended.
> > 
> > To make it possible to lock the receipt of new bio requests without locking
> > the processing of bio requests that the interposer creates, I had to change
> > the submit_bio_noacct() function and add a lock. To minimize the impact of
> > locking, I chose percpu_rw_sem. I tried to do without a new lock, but I'm
> > afraid it's impossible.
> > 
> > Checking the operation of the interposer, I did not limit myself to
> > a simple dm-linear. When I experimented with dm-era, I noticed that it
> > accepts two block devices. Since Mike was against changing the logic in
> > the dm-targets itself to support the interrupter, I decided to add the
> > [interpose] option to the block device path.
> > 
> >  echo "0 ${DEV_SZ} era ${META} [interpose]${DEV} ${BLK_SZ}" | \
> >  	dmsetup create dm-era --interpose
> > 
> > I believe this option can replace the DM_INTERPOSE_FLAG flag. Of course,
> > we can assume that if the device cannot be opened with the FMODE_EXCL,
> > then it is considered an interposed device, but it seems to me that
> > algorithm is unsafe. I hope to get Mike's opinion on this.
> > 
> > I have successfully tried taking snapshots. But I ran into a problem
> > when I removed origin-target:
> > [   49.031156] ------------[ cut here ]------------
> > [   49.031180] kernel BUG at block/bio.c:1476!
> > [   49.031198] invalid opcode: 0000 [#1] SMP NOPTI
> > [   49.031213] CPU: 9 PID: 636 Comm: dmsetup Tainted: G            E     5.12.0-rc6-ip+ #52
> > [   49.031235] Hardware name: innotek GmbH VirtualBox/VirtualBox, BIOS VirtualBox 12/01/2006
> > [   49.031257] RIP: 0010:bio_split+0x74/0x80
> > [   49.031273] Code: 89 c7 e8 5f 56 03 00 41 8b 74 24 28 48 89 ef e8 12 ea ff ff f6 45 15 01 74 08 66 41 81 4c 24 14 00 01 4c 89 e0 5b 5d 41 5c c3 <0f> 0b 0f 0b 0f 0b 45 31 e4 eb ed 90 0f 1f 44 00 00 39 77 28 76 05
> > [   49.031322] RSP: 0018:ffff9a6100993ab0 EFLAGS: 00010246
> > [   49.031337] RAX: 0000000000000008 RBX: 0000000000000000 RCX: ffff8e26938f96d8
> > [   49.031357] RDX: 0000000000000c00 RSI: 0000000000000000 RDI: ffff8e26937d1300
> > [   49.031375] RBP: ffff8e2692ddc000 R08: 0000000000000000 R09: 0000000000000000
> > [   49.031394] R10: ffff8e2692b1de00 R11: ffff8e2692b1de58 R12: ffff8e26937d1300
> > [   49.031413] R13: ffff8e2692ddcd18 R14: ffff8e2691d22140 R15: ffff8e26937d1300
> > [   49.031432] FS:  00007efffa6e7800(0000) GS:ffff8e269bc80000(0000) knlGS:0000000000000000
> > [   49.031453] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> > [   49.031470] CR2: 00007efffa96cda0 CR3: 0000000114bd0000 CR4: 00000000000506e0
> > [   49.031490] Call Trace:
> > [   49.031501]  dm_submit_bio+0x383/0x500 [dm_mod]
> > [   49.031522]  submit_bio_noacct+0x370/0x770
> > [   49.031537]  submit_bh_wbc+0x160/0x190
> > [   49.031550]  __sync_dirty_buffer+0x65/0x130
> > [   49.031564]  ext4_commit_super+0xbc/0x120 [ext4]
> > [   49.031602]  ext4_freeze+0x54/0x80 [ext4]
> > [   49.031631]  freeze_super+0xc8/0x160
> > [   49.031643]  freeze_bdev+0xb2/0xc0
> > [   49.031654]  lock_bdev_fs+0x1c/0x30 [dm_mod]
> > [   49.031671]  __dm_suspend+0x2b9/0x3b0 [dm_mod]
> > [   49.032095]  dm_suspend+0xed/0x160 [dm_mod]
> > [   49.032496]  ? __find_device_hash_cell+0x5b/0x2a0 [dm_mod]
> > [   49.032897]  ? remove_all+0x30/0x30 [dm_mod]
> > [   49.033299]  dev_remove+0x4c/0x1c0 [dm_mod]
> > [   49.033679]  ctl_ioctl+0x1a5/0x470 [dm_mod]
> > [   49.034067]  dm_ctl_ioctl+0xa/0x10 [dm_mod]
> > [   49.034432]  __x64_sys_ioctl+0x83/0xb0
> > [   49.034785]  do_syscall_64+0x33/0x80
> > [   49.035139]  entry_SYSCALL_64_after_hwframe+0x44/0xae
> > When suspend is executed for origin-target before the interposer is
> > being detached, in the origin_map() function the value of the
> > o->split_binary variable is zero, since no snapshots were connected to it.
> > I think that if no snapshots are connected, then it does not make sense
> > to split the bio request into parts.
> 
> The dm-snapshot code requires careful order of operations.  You say you
> removed the origin target.. please show exactly what you did.  Your 4th
> patch shouldn't be tied to this patchset. Can be dealt with
> independently.

To create a snapshot, the snapshot-origin from dm-snap must be connected to
the device. I do it like this:
 DEV=/dev/nvme0n1p2
 DEV_SZ=$(blockdev --getsz ${DEV})
 echo "0 ${DEV_SZ} snapshot-origin [interpose]${DEV}" | \
 	dmsetup create dm-origin --interpose

Next, I create the snapshot itself and mount it:
 META=/dev/nvme0n1p1
 ORIGIN=/dev/mapper/dm-origin
 echo "0 ${DEV_SZ} snapshot ${ORIGIN} ${META} N 8" | \
 	dmsetup create dm-snapshot
 mount /dev/mapper/dm-snapshot /mnt/snap

Releasing the snapshot:
 umount /mnt/snap
 dmsetup remove dm-snapshot

Remove snapshot-origin:
 dmsetup remove dm-origin

I think it's hard to make a mistake here, although the documentation describes
creating a snapshot using lvcreate, not dmsetup.

As for the fourth patch - I agree - this should be the next step, after the
idea of the interposer is accepted.

> 
> > Changes summary for this patchset v7:
> >   * The attaching and detaching to interposed device moved to
> >     __dm_suspend() and __dm_resume() functions.
> 
> Why? Those hooks are inherently more constrained.  And in the case of
> resume, failure is not an option.
> 
> >   * Redesigned th submit_bio_noacct() function and added a lock for the
> >     block device interposer.
> >   * Adds [interpose] option to block device patch in dm table.
> 
> I'm struggling to see why you need "[interpose]" (never mind that this
> idea of device options is a new construct): what are the implications?
> Are you saying that a table will have N devices with only a subset that
> are interposed?

I'm analyzing how dmsetup works with strace. I get something like this
diagram:
 ioctl(3, DM_VERSION, ...
 ...
 read(0, "0 14675935 snapshot-origin [inte"..., 4096) = 53
 ...
 ioctl(3, DM_DEV_CREATE, ...
 ioctl(3, DM_TABLE_LOAD, ...
 ioctl(3, DM_DEV_SUSPEND, ...

ioctl DM_DEV_SUSPEND without the DM_SUSPEND_FLAG flag, which means that the
do_resume function is started. It turns out that only after the do_resume
DM target works, the target becomes ready to work.

Before that, we cannot attach the interposer, as the bio will not be
successfully processed by the interposer. It turns out that it is at the
final stage of initialization that we can safely connect the interposer.
Note that a special DMF_INTERPOSER_ATTACHED flag was provided, this allows
to connect the interposer only at the first resume.

When removing a DM target, there is a requirement that the DM target is closed
and not used by anyone. This ensures that no new bio requests to the DM target
will be received. But in the case of the interposer, this is not enough.
We need to lock the queue of the original block device and wait for all
previously created bio requests to complete. To do this, run dm_suspend() with
the DM_SUSPEND_DETACH_IP_FLAG flag.

If we look at the do_resume() function, it can finish with an error code for
various malfunctions and their processing is provided.

> 
> Just feels very awkward but I'll try to keep an open mind until I can
> better understand.

Ok. Let's look at a simple example. We need to attach the dm-era using the
interposer. This target uses two devices ${DEV} and ${META}.
 echo "0 `blockdev --getsz ${DEV}` era ${META} ${DEV} 128" | \
 	dmsetup create dm-era --interpose

The ${DEV} device needs to be attached using a interposer, while ${META} is
used to output the result of the module and must be opened in FMODE_EXCL mode.

It turns out that only the dm-target itself depends on which of the devices
can be connected via the interposer, and which can not. The [interpose] option
allows to explicitly specify this.

I don't really like the design with the [interpret] option either.
I think it's best to change the dm_get_device() call and explicitly specify
which device to open via the interposer. It depends on the DM target itself
whether it supports connection via the interposer and for which devices.
This would make the code more visual. But to do this, we will need to change
one line in each existing dm_target. You have already spoken out against
a similar decision. But maybe can you look at this solution again?
It will look something like this:

diff --git a/drivers/md/dm-era-target.c b/drivers/md/dm-era-target.c
index d9ac7372108c..461fd7656751 100644
--- a/drivers/md/dm-era-target.c
+++ b/drivers/md/dm-era-target.c
@@ -1455,14 +1455,16 @@ static int era_ctr(struct dm_target *ti, unsigned argc, char **argv)

 	era->ti = ti;

-	r = dm_get_device(ti, argv[0], FMODE_READ | FMODE_WRITE, &era->metadata_dev);
+	r = dm_get_device(ti, argv[0], FMODE_READ | FMODE_WRITE, false, &era->metadata_dev);
 	if (r) {
 		ti->error = "Error opening metadata device";
 		era_destroy(era);
 		return -EINVAL;
 	}

-	r = dm_get_device(ti, argv[1], FMODE_READ | FMODE_WRITE, &era->origin_dev);
+	r = dm_get_device(ti, argv[1], FMODE_READ | FMODE_WRITE, ti->table->md->interpose, &era->origin_dev);
 	if (r) {
 		ti->error = "Error opening data device";
 		era_destroy(era);
diff --git a/drivers/md/dm-table.c b/drivers/md/dm-table.c
index e5f0f1703c5d..dc08e9b0c2fc 100644
--- a/drivers/md/dm-table.c
+++ b/drivers/md/dm-table.c
@@ -327,14 +327,14 @@ static int device_area_is_invalid(struct dm_target *ti, struct dm_dev *dev,
  * it is accessed concurrently.
  */
 static int upgrade_mode(struct dm_dev_internal *dd, fmode_t new_mode,
-			struct mapped_device *md)
+			bool interpose, struct mapped_device *md)
 {
 	int r;
 	struct dm_dev *old_dev, *new_dev;

 	old_dev = dd->dm_dev;

-	r = dm_get_table_device(md, dd->dm_dev->bdev->bd_dev,
+	r = dm_get_table_device(md, dd->dm_dev->bdev->bd_dev, interpose,
 				dd->dm_dev->mode | new_mode, &new_dev);
 	if (r)
 		return r;
@@ -363,7 +363,7 @@ EXPORT_SYMBOL_GPL(dm_get_dev_t);
  * it's already present.
  */
 int dm_get_device(struct dm_target *ti, const char *path, fmode_t mode,
-		  struct dm_dev **result)
+		  bool interpose, struct dm_dev **result)
 {
 	int r;
 	dev_t dev;
@@ -391,7 +391,7 @@ int dm_get_device(struct dm_target *ti, const char *path, fmode_t mode,
 		if (!dd)
 			return -ENOMEM;

-		if ((r = dm_get_table_device(t->md, dev, mode, &dd->dm_dev))) {
+		if ((r = dm_get_table_device(t->md, dev, mode, interpose, &dd->dm_dev))) {
 			kfree(dd);
 			return r;
 		}
@@ -401,7 +401,7 @@ int dm_get_device(struct dm_target *ti, const char *path, fmode_t mode,
 		goto out;

 	} else if (dd->dm_dev->mode != (mode | dd->dm_dev->mode)) {
-		r = upgrade_mode(dd, mode, t->md);
+		r = upgrade_mode(dd, mode, interpose, t->md);
 		if (r)
 			return r;
 	}
> 
> >   * Fix origin_map() then o->split_binary value is zero.
> 
> Overall this effort, while appreciated in general, is getting more and
> more muddled -- you're having to sprinkle obscure code all over DM. And
> your patch headers are severely lacking for a v8 patch
> submission. Terse bullet points don't paint a very comprehensive
> picture. Please detail how a user is expected to drive this (either in
> patch headers and/or some Documentation file).
> 
> Mike
> 

Mike, thank you for appreciating my efforts. I'm getting deeper and deeper
into the DM code and trying to add new functionality to it in a harmonious way.
When discussing the previous patch, you were quite right to say that the
connection and disconnection of the interposer was not safe, although the code
looked simpler.
This time I tried to understand in detail the processes of creating and
removing DM targets. I tried to write the code as simple as possible and added
comments to make it as easy as possible to understand.
I think it would be great if you would indicate which code you found
"more muddled". I will be happy to rewrite it or give additional comments.

How a user is expected to drive this - I'll try to describe it in this email.
If this text suits you, I will create documentation based on it in the future.
At the current stage, I would not want to distract the people who are engaged
in checking the documentation, so as not to throw their work into the trash bin
as it happened with the documentation for the v4 patch.

===================
DM & blk_interposer
===================

Usually LVM should be used for new devices. The administrator have to create
logical volumes for the system partition when installing the operating system.
For a running system with partitioned disk space and mounted file systems,
it is quite difficult to reconfigure to logical volumes. As a result, all
the features that Device Mapper provides are not available for non-LVM systems.
This problem is partially solved by the interposer functionality, which uses
the kernel's blk_interposer.

Blk_interposer it allows to redirect bio requests from ordinal block devices
to DM target. It allows to attach interposer to original device "on the fly"
without stopping the execution of users programs.

Interposer for dm-flakey
========================
In a classic dm-flakey application, the /dev/sda1 device must be unmounted.
We have to create a new block device /dev/mapper/test and mount it. ::
 echo "0 `blockdev --getsz /dev/sda1` flakey /dev/sda1 0 1 3" | \
 	dmsetup create test
 mount /dev/mapper/test /mnt/test

The relationship diagram will look like this:
  +-------------+
  | file system |
  +-------------+
        ||
        \/
  +------------------+
  | /dev/mapper/test |
  +------------------+
        ||
        \/
  +-------------+
  | /dev/sda1   |
  +-------------+

blk_interposer allows to connect the DM target to a device that is already
mounted. Adding the --interpose flag to the command::
 echo "0 `blockdev --getsz /dev/sda1` flakey /dev/sda1 0 1 3" | \
     dmsetup create test --interpose

Now the relationship diagram will look like this:
  +-------------+
  | file system |
  +-------------+
        ||
        \/
  +----------------+
  | blk_interposer |
  +----------------+
        ||
        \/
  +--------------+
  | /dev/mapper/ |
  | test         |
  +--------------+
        ||
        \/
  +-------------+
  | /dev/sda1   |
  +-------------+

At the same time, we do not need to remount the file system. The new DM target
was added to the stack "on the fly" unnoticed by the user-space environment.

Interposer for dm-snap
======================

Suppose we have a file system mounted on the block device /dev/sda1::
  +-------------+
  | file system |
  +-------------+
        ||
        \/
  +-------------+
  | /dev/sda1   |
  +-------------+

To create a snapshot of a block device, we need to connect the dm-snap to this
device. To do this, use the --interpose flag when creating snapshot-origin.
 echo "0 `blockdev --getsz /dev/sda1` snapshot-origin /dev/sda1" | \
 	dmsetup create origin --interpose

In this case, thanks to blk_interposer, all bio requests from the file system
will be redirected to the new device /dev/mapper/origin.
Diagram ::

  +-------------+
  | file system |
  +-------------+
        ||
        \/
 +----------------+
 | blk_interposer |
 +----------------+
        ||
        \/
 +--------------+
 | /dev/mapper/ |
 | origin       |
 +--------------+
        || 
        \/
  +-----------+
  | /dev/sda1 |
  +-----------+

To create a snapshot, just use the new device /dev/mapper/origin:
 echo "0 `blockdev --getsz /dev/sda1` snapshot /dev/mapper/origin ${COW_DEVICE} N 8" | \
 	dmesetup create snapshot

Diagram::
  +-------------+       +--------------+
  | file system |       | backup agent |
  +-------------+       +--------------+
        ||                    ||
        \/                    ||
  +----------------+          ||
  | blk_interposer |          ||
  +----------------+          ||
        ||                    ||
        \/                    \/
  +--------------+     +---------------+
  | /dev/mapper/ | <=> | /dev/mapper/  |
  | origin       |     | snapshot      |
  +--------------+     +---------------+
        ||                    ||
        \/                    \/
  +-----------+        +---------------+
  | /dev/sda1 |        | ${COW_DEVICE} |
  +-----------+        +---------------+

The snapshot device on the /dev/mapper/snapshot device is now available for
mounting or backup.

-- 
Sergei Shtepa
Veeam Software developer.


^ permalink raw reply related	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2021-04-13 10:12 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-04-09 11:48 [PATCH v8 0/4] block device interposer Sergei Shtepa
2021-04-09 11:48 ` [PATCH v8 1/4] Adds blk_interposer. It allows to redirect bio requests to another block device Sergei Shtepa
2021-04-09 11:48 ` [PATCH v8 2/4] Adds the blk_interposers logic to __submit_bio_noacct() Sergei Shtepa
2021-04-09 11:48 ` [PATCH v8 3/4] Adds blk_interposer to md Sergei Shtepa
2021-04-09 14:12   ` kernel test robot
2021-04-09 14:39   ` kernel test robot
2021-04-09 17:03   ` kernel test robot
2021-04-09 17:03   ` [RFC PATCH] __dm_attach_interposer() can be static kernel test robot
2021-04-09 11:48 ` [PATCH v8 4/4] fix origin_map - don't split a bio for the origin device if it does not have registered snapshots Sergei Shtepa
2021-04-09 15:23 ` [PATCH v8 0/4] block device interposer Mike Snitzer
2021-04-13 10:12   ` Sergei Shtepa

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).