All of lore.kernel.org
 help / color / mirror / Atom feed
* device mapper and the BLKFLSBUF ioctl
@ 2016-10-21 18:33 Mikulas Patocka
  2016-10-21 20:00 ` Mike Snitzer
  0 siblings, 1 reply; 25+ messages in thread
From: Mikulas Patocka @ 2016-10-21 18:33 UTC (permalink / raw)
  To: dm-devel; +Cc: Mike Snitzer, Alasdair G. Kergon

Hi

I found a bug in dm regarding the BLKFLSBUF ioctl.

The BLKFLSBUF ioctl can be called on a block device and it flushes the 
buffer cache. There is one exception - when it is called on ramdisk, it 
actually destroys all ramdisk data (it works like a discard on the full 
device).

The device mapper passes this ioctl down to the underlying device, so when 
the ioctl is called on a logical volume, it can be used to destroy the 
underlying volume group if the volume group is on ramdisk.

For example:
# modprobe brd rd_size=1048576
# pvcreate /dev/ram0
# vgcreate ram_vg /dev/ram0
# lvcreate -L 16M -n ram_lv ram_vg
# blockdev --flushbufs /dev/ram_vg/ram_lv
	--- and now the whole volume group is gone, all data on the 
		ramdisk were replaced with zeroes

The BLKFLSBUF ioctl is only allowed with CAP_SYS_ADMIN, so there shouldn't 
be security implications with this.

Whan to do with it? The best thing would be to drop this special ramdisk 
behavior and make the BLKFLSBUF ioctl flush the buffer cache on ramdisk 
like on all other block devices. But there may be many users having 
scripts that depend on this special behavior.

Another possibility is to stop the device mapper from passing the 
BLKFLSBUF ioctl down.

Mikulas

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: device mapper and the BLKFLSBUF ioctl
  2016-10-21 18:33 device mapper and the BLKFLSBUF ioctl Mikulas Patocka
@ 2016-10-21 20:00 ` Mike Snitzer
  2016-10-21 20:18   ` Mikulas Patocka
  0 siblings, 1 reply; 25+ messages in thread
From: Mike Snitzer @ 2016-10-21 20:00 UTC (permalink / raw)
  To: Mikulas Patocka; +Cc: dm-devel, Alasdair G. Kergon

On Fri, Oct 21 2016 at  2:33pm -0400,
Mikulas Patocka <mpatocka@redhat.com> wrote:

> Hi
> 
> I found a bug in dm regarding the BLKFLSBUF ioctl.
> 
> The BLKFLSBUF ioctl can be called on a block device and it flushes the 
> buffer cache. There is one exception - when it is called on ramdisk, it 
> actually destroys all ramdisk data (it works like a discard on the full 
> device).
> 
> The device mapper passes this ioctl down to the underlying device, so when 
> the ioctl is called on a logical volume, it can be used to destroy the 
> underlying volume group if the volume group is on ramdisk.
> 
> For example:
> # modprobe brd rd_size=1048576
> # pvcreate /dev/ram0
> # vgcreate ram_vg /dev/ram0
> # lvcreate -L 16M -n ram_lv ram_vg
> # blockdev --flushbufs /dev/ram_vg/ram_lv
> 	--- and now the whole volume group is gone, all data on the 
> 		ramdisk were replaced with zeroes
> 
> The BLKFLSBUF ioctl is only allowed with CAP_SYS_ADMIN, so there shouldn't 
> be security implications with this.
> 
> Whan to do with it? The best thing would be to drop this special ramdisk 
> behavior and make the BLKFLSBUF ioctl flush the buffer cache on ramdisk 
> like on all other block devices. But there may be many users having 
> scripts that depend on this special behavior.
> 
> Another possibility is to stop the device mapper from passing the 
> BLKFLSBUF ioctl down.

If anything DM is being consistent with what the underlying device is
meant to do.

brd_ioctl() destroys the data in response to BLKFLSBUF.. I'm missing why
this is a DM-specific problem.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: device mapper and the BLKFLSBUF ioctl
  2016-10-21 20:00 ` Mike Snitzer
@ 2016-10-21 20:18   ` Mikulas Patocka
  2016-10-24 15:57     ` Mike Snitzer
  0 siblings, 1 reply; 25+ messages in thread
From: Mikulas Patocka @ 2016-10-21 20:18 UTC (permalink / raw)
  To: Mike Snitzer; +Cc: dm-devel, Alasdair G. Kergon



On Fri, 21 Oct 2016, Mike Snitzer wrote:

> On Fri, Oct 21 2016 at  2:33pm -0400,
> Mikulas Patocka <mpatocka@redhat.com> wrote:
> 
> > Hi
> > 
> > I found a bug in dm regarding the BLKFLSBUF ioctl.
> > 
> > The BLKFLSBUF ioctl can be called on a block device and it flushes the 
> > buffer cache. There is one exception - when it is called on ramdisk, it 
> > actually destroys all ramdisk data (it works like a discard on the full 
> > device).
> > 
> > The device mapper passes this ioctl down to the underlying device, so when 
> > the ioctl is called on a logical volume, it can be used to destroy the 
> > underlying volume group if the volume group is on ramdisk.
> > 
> > For example:
> > # modprobe brd rd_size=1048576
> > # pvcreate /dev/ram0
> > # vgcreate ram_vg /dev/ram0
> > # lvcreate -L 16M -n ram_lv ram_vg
> > # blockdev --flushbufs /dev/ram_vg/ram_lv
> > 	--- and now the whole volume group is gone, all data on the 
> > 		ramdisk were replaced with zeroes
> > 
> > The BLKFLSBUF ioctl is only allowed with CAP_SYS_ADMIN, so there shouldn't 
> > be security implications with this.
> > 
> > Whan to do with it? The best thing would be to drop this special ramdisk 
> > behavior and make the BLKFLSBUF ioctl flush the buffer cache on ramdisk 
> > like on all other block devices. But there may be many users having 
> > scripts that depend on this special behavior.
> > 
> > Another possibility is to stop the device mapper from passing the 
> > BLKFLSBUF ioctl down.
> 
> If anything DM is being consistent with what the underlying device is
> meant to do.
> 
> brd_ioctl() destroys the data in response to BLKFLSBUF.. I'm missing why
> this is a DM-specific problem.

The problem is that if we call it on a logical volume, it destroys all 
logical volumes on the give physical volume.

Mikulas

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: device mapper and the BLKFLSBUF ioctl
  2016-10-21 20:18   ` Mikulas Patocka
@ 2016-10-24 15:57     ` Mike Snitzer
  2016-10-25 13:07       ` Christoph Hellwig
  0 siblings, 1 reply; 25+ messages in thread
From: Mike Snitzer @ 2016-10-24 15:57 UTC (permalink / raw)
  To: Mikulas Patocka; +Cc: dm-devel, Alasdair G. Kergon

On Fri, Oct 21 2016 at  4:18P -0400,
Mikulas Patocka <mpatocka@redhat.com> wrote:

> 
> 
> On Fri, 21 Oct 2016, Mike Snitzer wrote:
> 
> > On Fri, Oct 21 2016 at  2:33pm -0400,
> > Mikulas Patocka <mpatocka@redhat.com> wrote:
> > 
> > > Hi
> > > 
> > > I found a bug in dm regarding the BLKFLSBUF ioctl.
> > > 
> > > The BLKFLSBUF ioctl can be called on a block device and it flushes the 
> > > buffer cache. There is one exception - when it is called on ramdisk, it 
> > > actually destroys all ramdisk data (it works like a discard on the full 
> > > device).
> > > 
> > > The device mapper passes this ioctl down to the underlying device, so when 
> > > the ioctl is called on a logical volume, it can be used to destroy the 
> > > underlying volume group if the volume group is on ramdisk.
> > > 
> > > For example:
> > > # modprobe brd rd_size=1048576
> > > # pvcreate /dev/ram0
> > > # vgcreate ram_vg /dev/ram0
> > > # lvcreate -L 16M -n ram_lv ram_vg
> > > # blockdev --flushbufs /dev/ram_vg/ram_lv
> > > 	--- and now the whole volume group is gone, all data on the 
> > > 		ramdisk were replaced with zeroes
> > > 
> > > The BLKFLSBUF ioctl is only allowed with CAP_SYS_ADMIN, so there shouldn't 
> > > be security implications with this.
> > > 
> > > Whan to do with it? The best thing would be to drop this special ramdisk 
> > > behavior and make the BLKFLSBUF ioctl flush the buffer cache on ramdisk 
> > > like on all other block devices. But there may be many users having 
> > > scripts that depend on this special behavior.
> > > 
> > > Another possibility is to stop the device mapper from passing the 
> > > BLKFLSBUF ioctl down.
> > 
> > If anything DM is being consistent with what the underlying device is
> > meant to do.
> > 
> > brd_ioctl() destroys the data in response to BLKFLSBUF.. I'm missing why
> > this is a DM-specific problem.
> 
> The problem is that if we call it on a logical volume, it destroys all 
> logical volumes on the give physical volume.

Yeah, pretty awful.  But this isn't a DM-specific concern.  Could easily
happen with normal block partitions too right?

We _could_ add a DM-specific hack like this but it feels wrong to me:

diff --git a/drivers/md/dm.c b/drivers/md/dm.c
index ec513ee..e91607f 100644
--- a/drivers/md/dm.c
+++ b/drivers/md/dm.c
@@ -10,6 +10,8 @@
 #include "dm-uevent.h"
 
 #include <linux/init.h>
+#include <linux/fs.h>
+#include <linux/major.h>
 #include <linux/module.h>
 #include <linux/mutex.h>
 #include <linux/blkpg.h>
@@ -470,6 +472,16 @@ static int dm_blk_ioctl(struct block_device *bdev, fmode_t mode,
 		 * a logical partition of the parent bdev; so extra
 		 * validation is needed.
 		 */
+		if (MAJOR(disk_devt(bdev->bd_disk)) == RAMDISK_MAJOR &&
+		    cmd == BLKFLSBUF) {
+			/*
+			 * Disallow BLKFLSBUF on ramdisk because it can easily
+			 * destroy data outside of this logical volume.
+			 */
+			r = -EIO;
+			goto out;
+		}
+
 		r = scsi_verify_blk_ioctl(NULL, cmd);
 		if (r)
 			goto out;

^ permalink raw reply related	[flat|nested] 25+ messages in thread

* Re: device mapper and the BLKFLSBUF ioctl
  2016-10-24 15:57     ` Mike Snitzer
@ 2016-10-25 13:07       ` Christoph Hellwig
  2016-10-25 14:37         ` [PATCH] brd: remove support for BLKFLSBUF Mike Snitzer
  0 siblings, 1 reply; 25+ messages in thread
From: Christoph Hellwig @ 2016-10-25 13:07 UTC (permalink / raw)
  To: Mike Snitzer; +Cc: dm-devel, Mikulas Patocka, Alasdair G. Kergon

I think the right fix is to kill off the BLKFLSBUF special case in
brd.  Yes, it break compatibility - but in this case the compatibility
breaks more than it helps.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* [PATCH] brd: remove support for BLKFLSBUF
  2016-10-25 13:07       ` Christoph Hellwig
@ 2016-10-25 14:37         ` Mike Snitzer
  2016-10-25 14:46           ` Jens Axboe
  0 siblings, 1 reply; 25+ messages in thread
From: Mike Snitzer @ 2016-10-25 14:37 UTC (permalink / raw)
  To: Christoph Hellwig, axboe
  Cc: Mikulas Patocka, dm-devel, Alasdair G. Kergon, linux-block

On Tue, Oct 25 2016 at  9:07P -0400,
Christoph Hellwig <hch@infradead.org> wrote:

> I think the right fix is to kill off the BLKFLSBUF special case in
> brd.  Yes, it break compatibility - but in this case the compatibility
> breaks more than it helps.

Jens, please pick up this patch:

From: Mike Snitzer <snitzer@redhat.com>
Date: Tue, 25 Oct 2016 10:25:07 -0400
Subject: [PATCH] brd: remove support for BLKFLSBUF

Discontinue having the brd driver destructively free all pages in the
ramdisk in response to the BLKFLSBUF ioctl.  Doing so allows a BLKFLSBUF
ioctl issued to a logical partition to destroy pages of the parent brd
device (and all other partitions of that brd device).

This change breaks compatibility - but in this case the compatibility
breaks more than it helps.

Reported-by: Mikulas Patocka <mpatocka@redhat.com>
Suggested-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
---
 drivers/block/brd.c | 35 -----------------------------------
 1 file changed, 35 deletions(-)

diff --git a/drivers/block/brd.c b/drivers/block/brd.c
index 0c76d40..45c998a 100644
--- a/drivers/block/brd.c
+++ b/drivers/block/brd.c
@@ -395,44 +395,9 @@ static long brd_direct_access(struct block_device *bdev, sector_t sector,
 #define brd_direct_access NULL
 #endif
 
-static int brd_ioctl(struct block_device *bdev, fmode_t mode,
-			unsigned int cmd, unsigned long arg)
-{
-	int error;
-	struct brd_device *brd = bdev->bd_disk->private_data;
-
-	if (cmd != BLKFLSBUF)
-		return -ENOTTY;
-
-	/*
-	 * ram device BLKFLSBUF has special semantics, we want to actually
-	 * release and destroy the ramdisk data.
-	 */
-	mutex_lock(&brd_mutex);
-	mutex_lock(&bdev->bd_mutex);
-	error = -EBUSY;
-	if (bdev->bd_openers <= 1) {
-		/*
-		 * Kill the cache first, so it isn't written back to the
-		 * device.
-		 *
-		 * Another thread might instantiate more buffercache here,
-		 * but there is not much we can do to close that race.
-		 */
-		kill_bdev(bdev);
-		brd_free_pages(brd);
-		error = 0;
-	}
-	mutex_unlock(&bdev->bd_mutex);
-	mutex_unlock(&brd_mutex);
-
-	return error;
-}
-
 static const struct block_device_operations brd_fops = {
 	.owner =		THIS_MODULE,
 	.rw_page =		brd_rw_page,
-	.ioctl =		brd_ioctl,
 	.direct_access =	brd_direct_access,
 };
 
-- 
2.8.4 (Apple Git-73)


^ permalink raw reply related	[flat|nested] 25+ messages in thread

* Re: [PATCH] brd: remove support for BLKFLSBUF
  2016-10-25 14:37         ` [PATCH] brd: remove support for BLKFLSBUF Mike Snitzer
@ 2016-10-25 14:46           ` Jens Axboe
  2016-10-26 20:25             ` [PATCH 0/4] brd: support discard Mikulas Patocka
  0 siblings, 1 reply; 25+ messages in thread
From: Jens Axboe @ 2016-10-25 14:46 UTC (permalink / raw)
  To: Mike Snitzer, Christoph Hellwig
  Cc: Mikulas Patocka, dm-devel, Alasdair G. Kergon, linux-block

On 10/25/2016 08:37 AM, Mike Snitzer wrote:
> On Tue, Oct 25 2016 at  9:07P -0400,
> Christoph Hellwig <hch@infradead.org> wrote:
>
>> I think the right fix is to kill off the BLKFLSBUF special case in
>> brd.  Yes, it break compatibility - but in this case the compatibility
>> breaks more than it helps.
>
> Jens, please pick up this patch:
>
> From: Mike Snitzer <snitzer@redhat.com>
> Date: Tue, 25 Oct 2016 10:25:07 -0400
> Subject: [PATCH] brd: remove support for BLKFLSBUF

Added, thanks.

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 25+ messages in thread

* [PATCH 0/4] brd: support discard
  2016-10-25 14:46           ` Jens Axboe
@ 2016-10-26 20:25             ` Mikulas Patocka
  2016-10-26 20:26               ` [PATCH 1/4] brd: handle misaligned discard Mikulas Patocka
                                 ` (3 more replies)
  0 siblings, 4 replies; 25+ messages in thread
From: Mikulas Patocka @ 2016-10-26 20:25 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Mike Snitzer, Christoph Hellwig, dm-devel, Alasdair G. Kergon,
	linux-block



On Tue, 25 Oct 2016, Jens Axboe wrote:

> On 10/25/2016 08:37 AM, Mike Snitzer wrote:
> > On Tue, Oct 25 2016 at  9:07P -0400,
> > Christoph Hellwig <hch@infradead.org> wrote:
> > 
> > > I think the right fix is to kill off the BLKFLSBUF special case in
> > > brd.  Yes, it break compatibility - but in this case the compatibility
> > > breaks more than it helps.
> > 
> > Jens, please pick up this patch:
> > 
> > From: Mike Snitzer <snitzer@redhat.com>
> > Date: Tue, 25 Oct 2016 10:25:07 -0400
> > Subject: [PATCH] brd: remove support for BLKFLSBUF
> 
> Added, thanks.
> 
> -- 
> Jens Axboe

Hi

With the removal of BLKFLSBUF, there is no way to free memory allocated by 
ramdisk, so I created these 4 patches that add DISCARD support for 
ramdisk.

Mikulas

^ permalink raw reply	[flat|nested] 25+ messages in thread

* [PATCH 1/4] brd: handle misaligned discard
  2016-10-26 20:25             ` [PATCH 0/4] brd: support discard Mikulas Patocka
@ 2016-10-26 20:26               ` Mikulas Patocka
  2016-10-26 20:38                   ` Bart Van Assche
  2016-10-26 20:26               ` [PATCH 2/4] brd: extend rcu read sections Mikulas Patocka
                                 ` (2 subsequent siblings)
  3 siblings, 1 reply; 25+ messages in thread
From: Mikulas Patocka @ 2016-10-26 20:26 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Mike Snitzer, Christoph Hellwig, dm-devel, Alasdair G. Kergon,
	linux-block

The brd driver refuses misaligned discard requests with an error. However,
this is suboptimal, misaligned requests could be handled by discarding a
part of the request that is aligned on a page boundary. This patch changes
the code so that it handles misaligned requests.

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>

---
 drivers/block/brd.c |   16 +++++++++-------
 1 file changed, 9 insertions(+), 7 deletions(-)

Index: linux-2.6/drivers/block/brd.c
===================================================================
--- linux-2.6.orig/drivers/block/brd.c
+++ linux-2.6/drivers/block/brd.c
@@ -213,9 +213,14 @@ static int copy_to_brd_setup(struct brd_
 }
 
 static void discard_from_brd(struct brd_device *brd,
-			sector_t sector, size_t n)
+			sector_t sector, unsigned n_sectors)
 {
-	while (n >= PAGE_SIZE) {
+	unsigned boundary = -sector & ((PAGE_SIZE >> SECTOR_SHIFT) - 1);
+	if (unlikely(boundary >= n_sectors))
+		return;
+	sector += boundary;
+	n_sectors -= boundary;
+	while (n_sectors >= PAGE_SIZE >> SECTOR_SHIFT) {
 		/*
 		 * Don't want to actually discard pages here because
 		 * re-allocating the pages can result in writeback
@@ -226,7 +231,7 @@ static void discard_from_brd(struct brd_
 		else
 			brd_zero_page(brd, sector);
 		sector += PAGE_SIZE >> SECTOR_SHIFT;
-		n -= PAGE_SIZE;
+		n_sectors -= PAGE_SIZE >> SECTOR_SHIFT;
 	}
 }
 
@@ -339,10 +344,7 @@ static blk_qc_t brd_make_request(struct
 		goto io_error;
 
 	if (unlikely(bio_op(bio) == REQ_OP_DISCARD)) {
-		if (sector & ((PAGE_SIZE >> SECTOR_SHIFT) - 1) ||
-		    bio->bi_iter.bi_size & ~PAGE_MASK)
-			goto io_error;
-		discard_from_brd(brd, sector, bio->bi_iter.bi_size);
+		discard_from_brd(brd, sector, bio->bi_iter.bi_size >> SECTOR_SHIFT);
 		goto out;
 	}
 

^ permalink raw reply	[flat|nested] 25+ messages in thread

* [PATCH 2/4] brd: extend rcu read sections
  2016-10-26 20:25             ` [PATCH 0/4] brd: support discard Mikulas Patocka
  2016-10-26 20:26               ` [PATCH 1/4] brd: handle misaligned discard Mikulas Patocka
@ 2016-10-26 20:26               ` Mikulas Patocka
  2016-10-26 20:27               ` [PATCH 3/4] brd: implement discard Mikulas Patocka
  2016-10-26 20:27               ` [PATCH 4/4] brd: remove unused brd_zero_page Mikulas Patocka
  3 siblings, 0 replies; 25+ messages in thread
From: Mikulas Patocka @ 2016-10-26 20:26 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Mike Snitzer, Christoph Hellwig, dm-devel, Alasdair G. Kergon,
	linux-block

This patch extends rcu read sections, so that all manipulations of the
page and its data are within read sections.

This patch is a prerequisite for discarding pages using rcu.

Note that the page pointer escapes the rcu section in the function
brd_direct_access, however, direct access is not supposed to race with
discard.

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>

---
 drivers/block/brd.c |   19 +++++++++++++++++--
 1 file changed, 17 insertions(+), 2 deletions(-)

Index: linux-2.6/drivers/block/brd.c
===================================================================
--- linux-2.6.orig/drivers/block/brd.c
+++ linux-2.6/drivers/block/brd.c
@@ -71,10 +71,8 @@ static struct page *brd_lookup_page(stru
 	 * safe here (we don't have total exclusion from radix tree updates
 	 * here, only deletes).
 	 */
-	rcu_read_lock();
 	idx = sector >> PAGE_SECTORS_SHIFT; /* sector to page index */
 	page = radix_tree_lookup(&brd->brd_pages, idx);
-	rcu_read_unlock();
 
 	BUG_ON(page && page->index != idx);
 
@@ -92,7 +90,12 @@ static struct page *brd_insert_page(stru
 	struct page *page;
 	gfp_t gfp_flags;
 
+	rcu_read_lock();
+
 	page = brd_lookup_page(brd, sector);
+
+	rcu_read_unlock();
+
 	if (page)
 		return page;
 
@@ -151,9 +154,13 @@ static void brd_zero_page(struct brd_dev
 {
 	struct page *page;
 
+	rcu_read_lock();
+
 	page = brd_lookup_page(brd, sector);
 	if (page)
 		clear_highpage(page);
+
+	rcu_read_unlock();
 }
 
 /*
@@ -246,6 +253,8 @@ static void copy_to_brd(struct brd_devic
 	unsigned int offset = (sector & (PAGE_SECTORS-1)) << SECTOR_SHIFT;
 	size_t copy;
 
+	rcu_read_lock();
+
 	copy = min_t(size_t, n, PAGE_SIZE - offset);
 	page = brd_lookup_page(brd, sector);
 	BUG_ON(!page);
@@ -265,6 +274,8 @@ static void copy_to_brd(struct brd_devic
 		memcpy(dst, src, copy);
 		kunmap_atomic(dst);
 	}
+
+	rcu_read_unlock();
 }
 
 /*
@@ -278,6 +289,8 @@ static void copy_from_brd(void *dst, str
 	unsigned int offset = (sector & (PAGE_SECTORS-1)) << SECTOR_SHIFT;
 	size_t copy;
 
+	rcu_read_lock();
+
 	copy = min_t(size_t, n, PAGE_SIZE - offset);
 	page = brd_lookup_page(brd, sector);
 	if (page) {
@@ -299,6 +312,8 @@ static void copy_from_brd(void *dst, str
 		} else
 			memset(dst, 0, copy);
 	}
+
+	rcu_read_unlock();
 }
 
 /*


^ permalink raw reply	[flat|nested] 25+ messages in thread

* [PATCH 3/4] brd: implement discard
  2016-10-26 20:25             ` [PATCH 0/4] brd: support discard Mikulas Patocka
  2016-10-26 20:26               ` [PATCH 1/4] brd: handle misaligned discard Mikulas Patocka
  2016-10-26 20:26               ` [PATCH 2/4] brd: extend rcu read sections Mikulas Patocka
@ 2016-10-26 20:27               ` Mikulas Patocka
  2016-10-26 20:27               ` [PATCH 4/4] brd: remove unused brd_zero_page Mikulas Patocka
  3 siblings, 0 replies; 25+ messages in thread
From: Mikulas Patocka @ 2016-10-26 20:27 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Mike Snitzer, Christoph Hellwig, dm-devel, Alasdair G. Kergon,
	linux-block

Implement page discard using rcu. Each page has built-in entry rcu_head
that could be used to free the page from rcu.

Regarding the commant that "re-allocating the pages can result in
writeback deadlocks under heavy load" - if the user is in the risk of such
deadlocks, he should mount the filesystem without "-o discard".

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>

---
 drivers/block/brd.c |   30 ++++++++++++++++++------------
 1 file changed, 18 insertions(+), 12 deletions(-)

Index: linux-2.6/drivers/block/brd.c
===================================================================
--- linux-2.6.orig/drivers/block/brd.c
+++ linux-2.6/drivers/block/brd.c
@@ -137,6 +137,12 @@ static struct page *brd_insert_page(stru
 	return page;
 }
 
+static void brd_free_page_rcu(struct rcu_head *head)
+{
+	struct page *page = container_of(head, struct page, rcu_head);
+	__free_page(page);
+}
+
 static void brd_free_page(struct brd_device *brd, sector_t sector)
 {
 	struct page *page;
@@ -147,7 +153,7 @@ static void brd_free_page(struct brd_dev
 	page = radix_tree_delete(&brd->brd_pages, idx);
 	spin_unlock(&brd->brd_lock);
 	if (page)
-		__free_page(page);
+		call_rcu(&page->rcu_head, brd_free_page_rcu);
 }
 
 static void brd_zero_page(struct brd_device *brd, sector_t sector)
@@ -233,7 +239,7 @@ static void discard_from_brd(struct brd_
 		 * re-allocating the pages can result in writeback
 		 * deadlocks under heavy load.
 		 */
-		if (0)
+		if (1)
 			brd_free_page(brd, sector);
 		else
 			brd_zero_page(brd, sector);
@@ -257,22 +263,22 @@ static void copy_to_brd(struct brd_devic
 
 	copy = min_t(size_t, n, PAGE_SIZE - offset);
 	page = brd_lookup_page(brd, sector);
-	BUG_ON(!page);
-
-	dst = kmap_atomic(page);
-	memcpy(dst + offset, src, copy);
-	kunmap_atomic(dst);
+	if (page) {
+		dst = kmap_atomic(page);
+		memcpy(dst + offset, src, copy);
+		kunmap_atomic(dst);
+	}
 
 	if (copy < n) {
 		src += copy;
 		sector += copy >> SECTOR_SHIFT;
 		copy = n - copy;
 		page = brd_lookup_page(brd, sector);
-		BUG_ON(!page);
-
-		dst = kmap_atomic(page);
-		memcpy(dst, src, copy);
-		kunmap_atomic(dst);
+		if (page) {
+			dst = kmap_atomic(page);
+			memcpy(dst, src, copy);
+			kunmap_atomic(dst);
+		}
 	}
 
 	rcu_read_unlock();

^ permalink raw reply	[flat|nested] 25+ messages in thread

* [PATCH 4/4] brd: remove unused brd_zero_page
  2016-10-26 20:25             ` [PATCH 0/4] brd: support discard Mikulas Patocka
                                 ` (2 preceding siblings ...)
  2016-10-26 20:27               ` [PATCH 3/4] brd: implement discard Mikulas Patocka
@ 2016-10-26 20:27               ` Mikulas Patocka
  3 siblings, 0 replies; 25+ messages in thread
From: Mikulas Patocka @ 2016-10-26 20:27 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Mike Snitzer, Christoph Hellwig, dm-devel, Alasdair G. Kergon,
	linux-block

Remove the function brd_zero_page. This function was used to zero a page
when the discard request came in.

The discard request is used for performance or space optimization, it
makes no sense to zero pages on discard request, as it neither improves
performance nor saves memory.

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>

---
 drivers/block/brd.c |   23 +----------------------
 1 file changed, 1 insertion(+), 22 deletions(-)

Index: linux-2.6/drivers/block/brd.c
===================================================================
--- linux-2.6.orig/drivers/block/brd.c
+++ linux-2.6/drivers/block/brd.c
@@ -156,19 +156,6 @@ static void brd_free_page(struct brd_dev
 		call_rcu(&page->rcu_head, brd_free_page_rcu);
 }
 
-static void brd_zero_page(struct brd_device *brd, sector_t sector)
-{
-	struct page *page;
-
-	rcu_read_lock();
-
-	page = brd_lookup_page(brd, sector);
-	if (page)
-		clear_highpage(page);
-
-	rcu_read_unlock();
-}
-
 /*
  * Free all backing store pages and radix tree. This must only be called when
  * there are no other users of the device.
@@ -234,15 +221,7 @@ static void discard_from_brd(struct brd_
 	sector += boundary;
 	n_sectors -= boundary;
 	while (n_sectors >= PAGE_SIZE >> SECTOR_SHIFT) {
-		/*
-		 * Don't want to actually discard pages here because
-		 * re-allocating the pages can result in writeback
-		 * deadlocks under heavy load.
-		 */
-		if (1)
-			brd_free_page(brd, sector);
-		else
-			brd_zero_page(brd, sector);
+		brd_free_page(brd, sector);
 		sector += PAGE_SIZE >> SECTOR_SHIFT;
 		n_sectors -= PAGE_SIZE >> SECTOR_SHIFT;
 	}

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [dm-devel] [PATCH 1/4] brd: handle misaligned discard
  2016-10-26 20:26               ` [PATCH 1/4] brd: handle misaligned discard Mikulas Patocka
@ 2016-10-26 20:38                   ` Bart Van Assche
  0 siblings, 0 replies; 25+ messages in thread
From: Bart Van Assche @ 2016-10-26 20:38 UTC (permalink / raw)
  To: Mikulas Patocka, Jens Axboe
  Cc: Christoph Hellwig, linux-block, dm-devel, Alasdair G. Kergon,
	Mike Snitzer

On 10/26/2016 01:26 PM, Mikulas Patocka wrote:
> The brd driver refuses misaligned discard requests with an error. However,
> this is suboptimal, misaligned requests could be handled by discarding a
> part of the request that is aligned on a page boundary. This patch changes
> the code so that it handles misaligned requests.

Hello Mikulas,

We do not want this kind of discard request processing in every block 
driver. This is why I think that this kind of processing should be added 
to the block layer. See also "[PATCH v3 0/5] Make blkdev_issue_discard() 
submit aligned discard requests" 
(http://www.spinics.net/lists/linux-block/msg02360.html).

Bart.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [dm-devel] [PATCH 1/4] brd: handle misaligned discard
@ 2016-10-26 20:38                   ` Bart Van Assche
  0 siblings, 0 replies; 25+ messages in thread
From: Bart Van Assche @ 2016-10-26 20:38 UTC (permalink / raw)
  To: Mikulas Patocka, Jens Axboe
  Cc: Christoph Hellwig, linux-block, dm-devel, Alasdair G. Kergon,
	Mike Snitzer

On 10/26/2016 01:26 PM, Mikulas Patocka wrote:
> The brd driver refuses misaligned discard requests with an error. However,
> this is suboptimal, misaligned requests could be handled by discarding a
> part of the request that is aligned on a page boundary. This patch changes
> the code so that it handles misaligned requests.

Hello Mikulas,

We do not want this kind of discard request processing in every block 
driver. This is why I think that this kind of processing should be added 
to the block layer. See also "[PATCH v3 0/5] Make blkdev_issue_discard() 
submit aligned discard requests" 
(http://www.spinics.net/lists/linux-block/msg02360.html).

Bart.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [dm-devel] [PATCH 1/4] brd: handle misaligned discard
  2016-10-26 20:38                   ` Bart Van Assche
  (?)
@ 2016-10-26 21:46                   ` Mikulas Patocka
  2016-10-26 21:50                     ` REQ_OP for zeroing, was " Christoph Hellwig
  2016-10-26 21:57                       ` Bart Van Assche
  -1 siblings, 2 replies; 25+ messages in thread
From: Mikulas Patocka @ 2016-10-26 21:46 UTC (permalink / raw)
  To: Bart Van Assche
  Cc: Jens Axboe, Christoph Hellwig, linux-block, dm-devel,
	Alasdair G. Kergon, Mike Snitzer



On Wed, 26 Oct 2016, Bart Van Assche wrote:

> On 10/26/2016 01:26 PM, Mikulas Patocka wrote:
> > The brd driver refuses misaligned discard requests with an error. However,
> > this is suboptimal, misaligned requests could be handled by discarding a
> > part of the request that is aligned on a page boundary. This patch changes
> > the code so that it handles misaligned requests.
> 
> Hello Mikulas,
> 
> We do not want this kind of discard request processing in every block driver.
> This is why I think that this kind of processing should be added to the block
> layer. See also "[PATCH v3 0/5] Make blkdev_issue_discard() submit aligned
> discard requests" (http://www.spinics.net/lists/linux-block/msg02360.html).
> 
> Bart.

I don't like the idea of complicating the code by turning discards into 
writes. You can just turn off the flag "discard_zeroes_data" and drop all 
the splitting code.

The flag "discard_zeroes_data" is actually misdesigned, because the 
storage stack can change dynamically while bios are in progress. You can 
send a discard bio to a device mapper device that has 
"discard_zeroes_data" - and while the bio is in progress, the device 
mapper stack can be reconfigured to redirect the bio to another device 
that doesn't have "discard_zeroes_data" - and the bio won't zero data and 
the caller that issued it has no way to find it out.

I think the proper thing would be to move "discard_zeroes_data" flag into 
the bio itself - there would be REQ_OP_DISCARD and REQ_OP_DISCARD_ZERO - 
and if the device doesn't support REQ_OP_DISCARD_ZERO, it rejects the bio 
and the caller is supposed to do zeroing manually.

Mikulas

^ permalink raw reply	[flat|nested] 25+ messages in thread

* REQ_OP for zeroing, was Re: [dm-devel] [PATCH 1/4] brd: handle misaligned discard
  2016-10-26 21:46                   ` Mikulas Patocka
@ 2016-10-26 21:50                     ` Christoph Hellwig
  2016-10-28 11:43                       ` Mikulas Patocka
  2016-10-26 21:57                       ` Bart Van Assche
  1 sibling, 1 reply; 25+ messages in thread
From: Christoph Hellwig @ 2016-10-26 21:50 UTC (permalink / raw)
  To: Mikulas Patocka
  Cc: Bart Van Assche, Jens Axboe, linux-block, dm-devel,
	Alasdair G. Kergon, Mike Snitzer, Martin K. Petersen

On Wed, Oct 26, 2016 at 05:46:11PM -0400, Mikulas Patocka wrote:
> I think the proper thing would be to move "discard_zeroes_data" flag into 
> the bio itself - there would be REQ_OP_DISCARD and REQ_OP_DISCARD_ZERO - 
> and if the device doesn't support REQ_OP_DISCARD_ZERO, it rejects the bio 
> and the caller is supposed to do zeroing manually.

Yes, Martin and I have come to a similar conclusion recently.  An
additional aspect is that NVMe has a Write Zeroes command which is more
limited than what REQ_OP_WRITE_SAME does.

So I think the right way is to add a REQ_OP_WRITE_ZEROES (or
REQ_OP_ZERO) and have modifies if it may discard or not.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [dm-devel] [PATCH 1/4] brd: handle misaligned discard
  2016-10-26 21:46                   ` Mikulas Patocka
@ 2016-10-26 21:57                       ` Bart Van Assche
  2016-10-26 21:57                       ` Bart Van Assche
  1 sibling, 0 replies; 25+ messages in thread
From: Bart Van Assche @ 2016-10-26 21:57 UTC (permalink / raw)
  To: Mikulas Patocka
  Cc: Jens Axboe, linux-block, Mike Snitzer, Christoph Hellwig,
	dm-devel, Alasdair G. Kergon

On 10/26/2016 02:46 PM, Mikulas Patocka wrote:
> I don't like the idea of complicating the code by turning discards into
> writes.

That's not what my patch series does. The only writes added by my patch 
series are those for the non-aligned head and tail of the range passed 
to blkdev_issue_zeroout().

> The flag "discard_zeroes_data" is actually misdesigned, because the
> storage stack can change dynamically while bios are in progress. You can
> send a discard bio to a device mapper device that has
> "discard_zeroes_data" - and while the bio is in progress, the device
> mapper stack can be reconfigured to redirect the bio to another device
> that doesn't have "discard_zeroes_data" - and the bio won't zero data and
> the caller that issued it has no way to find it out.
>
> I think the proper thing would be to move "discard_zeroes_data" flag into
> the bio itself - there would be REQ_OP_DISCARD and REQ_OP_DISCARD_ZERO -
> and if the device doesn't support REQ_OP_DISCARD_ZERO, it rejects the bio
> and the caller is supposed to do zeroing manually.

Sorry but I don't like this proposal. I think that a much better 
solution would be to pause I/O before making any changes to the 
discard_zeroes_data flag.

Bart.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [dm-devel] [PATCH 1/4] brd: handle misaligned discard
@ 2016-10-26 21:57                       ` Bart Van Assche
  0 siblings, 0 replies; 25+ messages in thread
From: Bart Van Assche @ 2016-10-26 21:57 UTC (permalink / raw)
  To: Mikulas Patocka
  Cc: Jens Axboe, linux-block, Mike Snitzer, Christoph Hellwig,
	dm-devel, Alasdair G. Kergon

On 10/26/2016 02:46 PM, Mikulas Patocka wrote:
> I don't like the idea of complicating the code by turning discards into
> writes.

That's not what my patch series does. The only writes added by my patch 
series are those for the non-aligned head and tail of the range passed 
to blkdev_issue_zeroout().

> The flag "discard_zeroes_data" is actually misdesigned, because the
> storage stack can change dynamically while bios are in progress. You can
> send a discard bio to a device mapper device that has
> "discard_zeroes_data" - and while the bio is in progress, the device
> mapper stack can be reconfigured to redirect the bio to another device
> that doesn't have "discard_zeroes_data" - and the bio won't zero data and
> the caller that issued it has no way to find it out.
>
> I think the proper thing would be to move "discard_zeroes_data" flag into
> the bio itself - there would be REQ_OP_DISCARD and REQ_OP_DISCARD_ZERO -
> and if the device doesn't support REQ_OP_DISCARD_ZERO, it rejects the bio
> and the caller is supposed to do zeroing manually.

Sorry but I don't like this proposal. I think that a much better 
solution would be to pause I/O before making any changes to the 
discard_zeroes_data flag.

Bart.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [dm-devel] [PATCH 1/4] brd: handle misaligned discard
  2016-10-26 21:57                       ` Bart Van Assche
  (?)
@ 2016-10-28 11:39                       ` Mikulas Patocka
  2016-10-28 15:55                           ` Bart Van Assche
  -1 siblings, 1 reply; 25+ messages in thread
From: Mikulas Patocka @ 2016-10-28 11:39 UTC (permalink / raw)
  To: Bart Van Assche
  Cc: Jens Axboe, linux-block, Mike Snitzer, Christoph Hellwig,
	dm-devel, Alasdair G. Kergon



On Wed, 26 Oct 2016, Bart Van Assche wrote:

> On 10/26/2016 02:46 PM, Mikulas Patocka wrote:
> > I don't like the idea of complicating the code by turning discards into
> > writes.
> 
> That's not what my patch series does. The only writes added by my patch series
> are those for the non-aligned head and tail of the range passed to
> blkdev_issue_zeroout().

The purpose of discard is to improve SSD performance and reduce wear.

Generating more write requests for discard does quite the opposite - it 
reduces performance (discard + two small writes is slower than just 
discard) and it also causes more wear.

> > The flag "discard_zeroes_data" is actually misdesigned, because the
> > storage stack can change dynamically while bios are in progress. You can
> > send a discard bio to a device mapper device that has
> > "discard_zeroes_data" - and while the bio is in progress, the device
> > mapper stack can be reconfigured to redirect the bio to another device
> > that doesn't have "discard_zeroes_data" - and the bio won't zero data and
> > the caller that issued it has no way to find it out.
> > 
> > I think the proper thing would be to move "discard_zeroes_data" flag into
> > the bio itself - there would be REQ_OP_DISCARD and REQ_OP_DISCARD_ZERO -
> > and if the device doesn't support REQ_OP_DISCARD_ZERO, it rejects the bio
> > and the caller is supposed to do zeroing manually.
> 
> Sorry but I don't like this proposal. I think that a much better solution
> would be to pause I/O before making any changes to the discard_zeroes_data
> flag.

The device mapper pauses all bios when it switches the table - but the bio 
was submitted with assumption that it goes to a device withe 
"discard_zeroes_data" set, then the bio is paused, device mapper table is 
switched, the bio is unpaused, and now it can go to a device without 
"discard_zeroes_data".

> Bart.

Mikulas

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: REQ_OP for zeroing, was Re: [dm-devel] [PATCH 1/4] brd: handle misaligned discard
  2016-10-26 21:50                     ` REQ_OP for zeroing, was " Christoph Hellwig
@ 2016-10-28 11:43                       ` Mikulas Patocka
  2016-10-28 13:14                         ` Christoph Hellwig
  0 siblings, 1 reply; 25+ messages in thread
From: Mikulas Patocka @ 2016-10-28 11:43 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Bart Van Assche, Jens Axboe, linux-block, dm-devel,
	Alasdair G. Kergon, Mike Snitzer, Martin K. Petersen



On Wed, 26 Oct 2016, Christoph Hellwig wrote:

> On Wed, Oct 26, 2016 at 05:46:11PM -0400, Mikulas Patocka wrote:
> > I think the proper thing would be to move "discard_zeroes_data" flag into 
> > the bio itself - there would be REQ_OP_DISCARD and REQ_OP_DISCARD_ZERO - 
> > and if the device doesn't support REQ_OP_DISCARD_ZERO, it rejects the bio 
> > and the caller is supposed to do zeroing manually.
> 
> Yes, Martin and I have come to a similar conclusion recently.  An
> additional aspect is that NVMe has a Write Zeroes command which is more
> limited than what REQ_OP_WRITE_SAME does.
> 
> So I think the right way is to add a REQ_OP_WRITE_ZEROES (or
> REQ_OP_ZERO) and have modifies if it may discard or not.

We could detect if the REQ_OP_WRITE_SAME command contains all zeroes and 
if it does, turn it into "Write Zeroes" or TRIM command (if the device 
guarantees zeroing on trim). If it doesn't contain all zeroes and the 
device doesn't support non-zero WRITE SAME, then reject it.

Or maybe we could add a new command REQ_OP_WRITE_ZEROES - I'm not sure 
which of these two possibilities is better.

Mikulas

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: REQ_OP for zeroing, was Re: [dm-devel] [PATCH 1/4] brd: handle misaligned discard
  2016-10-28 11:43                       ` Mikulas Patocka
@ 2016-10-28 13:14                         ` Christoph Hellwig
  2016-10-31 16:36                           ` Mikulas Patocka
  0 siblings, 1 reply; 25+ messages in thread
From: Christoph Hellwig @ 2016-10-28 13:14 UTC (permalink / raw)
  To: Mikulas Patocka
  Cc: Christoph Hellwig, Bart Van Assche, Jens Axboe, linux-block,
	dm-devel, Alasdair G. Kergon, Mike Snitzer, Martin K. Petersen,
	Chaitanya Kulkarni

[adding Chaitanya to Cc]

On Fri, Oct 28, 2016 at 07:43:41AM -0400, Mikulas Patocka wrote:
> We could detect if the REQ_OP_WRITE_SAME command contains all zeroes and 
> if it does, turn it into "Write Zeroes" or TRIM command (if the device 
> guarantees zeroing on trim). If it doesn't contain all zeroes and the 
> device doesn't support non-zero WRITE SAME, then reject it.

I don't like this because it's very inefficient - we have to allocate
a payload first and then compare the whole payload for very operation.

> Or maybe we could add a new command REQ_OP_WRITE_ZEROES - I'm not sure 
> which of these two possibilities is better.

Chaitanya actually did an initial prototype implementation of this for
NVMe that he shared with me.  I liked it a lot and I think he'll be
ready to post it in a few days.  Now that we have the REQ_OP* values
instead of mapping different command types to flags it's actually
surprisingly easy to add new block layer operations.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [dm-devel] [PATCH 1/4] brd: handle misaligned discard
  2016-10-28 11:39                       ` Mikulas Patocka
@ 2016-10-28 15:55                           ` Bart Van Assche
  0 siblings, 0 replies; 25+ messages in thread
From: Bart Van Assche @ 2016-10-28 15:55 UTC (permalink / raw)
  To: Mikulas Patocka
  Cc: Jens Axboe, Christoph Hellwig, Mike Snitzer, linux-block,
	dm-devel, Alasdair G. Kergon

On 10/28/2016 04:39 AM, Mikulas Patocka wrote:
> On Wed, 26 Oct 2016, Bart Van Assche wrote:
>> On 10/26/2016 02:46 PM, Mikulas Patocka wrote:
>>> I think the proper thing would be to move "discard_zeroes_data" flag into
>>> the bio itself - there would be REQ_OP_DISCARD and REQ_OP_DISCARD_ZERO -
>>> and if the device doesn't support REQ_OP_DISCARD_ZERO, it rejects the bio
>>> and the caller is supposed to do zeroing manually.
>>
>> Sorry but I don't like this proposal. I think that a much better solution
>> would be to pause I/O before making any changes to the discard_zeroes_data
>> flag.
>
> The device mapper pauses all bios when it switches the table - but the bio
> was submitted with assumption that it goes to a device withe
> "discard_zeroes_data" set, then the bio is paused, device mapper table is
> switched, the bio is unpaused, and now it can go to a device without
> "discard_zeroes_data".

Hello Mikulas,

Sorry if I wasn't clear enough. What I meant is to wait until all 
outstanding requests have finished before modifying the 
discard_zeroes_data flag - the kind of operation that is performed by 
e.g. blk_mq_freeze_queue(). Modifying the discard_zeroes_data flag after 
a bio has been submitted and before it has completed could lead to 
several classes of subtle bugs. Functions like __blkdev_issue_discard() 
assume that the value of the discard_zeroes_data flag does not change 
after this function has been called and before the submitted requests 
completed.

Bart.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [dm-devel] [PATCH 1/4] brd: handle misaligned discard
@ 2016-10-28 15:55                           ` Bart Van Assche
  0 siblings, 0 replies; 25+ messages in thread
From: Bart Van Assche @ 2016-10-28 15:55 UTC (permalink / raw)
  To: Mikulas Patocka
  Cc: Jens Axboe, Christoph Hellwig, Mike Snitzer, linux-block,
	dm-devel, Alasdair G. Kergon

On 10/28/2016 04:39 AM, Mikulas Patocka wrote:
> On Wed, 26 Oct 2016, Bart Van Assche wrote:
>> On 10/26/2016 02:46 PM, Mikulas Patocka wrote:
>>> I think the proper thing would be to move "discard_zeroes_data" flag into
>>> the bio itself - there would be REQ_OP_DISCARD and REQ_OP_DISCARD_ZERO -
>>> and if the device doesn't support REQ_OP_DISCARD_ZERO, it rejects the bio
>>> and the caller is supposed to do zeroing manually.
>>
>> Sorry but I don't like this proposal. I think that a much better solution
>> would be to pause I/O before making any changes to the discard_zeroes_data
>> flag.
>
> The device mapper pauses all bios when it switches the table - but the bio
> was submitted with assumption that it goes to a device withe
> "discard_zeroes_data" set, then the bio is paused, device mapper table is
> switched, the bio is unpaused, and now it can go to a device without
> "discard_zeroes_data".

Hello Mikulas,

Sorry if I wasn't clear enough. What I meant is to wait until all 
outstanding requests have finished before modifying the 
discard_zeroes_data flag - the kind of operation that is performed by 
e.g. blk_mq_freeze_queue(). Modifying the discard_zeroes_data flag after 
a bio has been submitted and before it has completed could lead to 
several classes of subtle bugs. Functions like __blkdev_issue_discard() 
assume that the value of the discard_zeroes_data flag does not change 
after this function has been called and before the submitted requests 
completed.

Bart.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [dm-devel] [PATCH 1/4] brd: handle misaligned discard
  2016-10-28 15:55                           ` Bart Van Assche
  (?)
@ 2016-10-31 16:31                           ` Mikulas Patocka
  -1 siblings, 0 replies; 25+ messages in thread
From: Mikulas Patocka @ 2016-10-31 16:31 UTC (permalink / raw)
  To: Bart Van Assche
  Cc: Jens Axboe, Christoph Hellwig, Mike Snitzer, linux-block,
	dm-devel, Alasdair G. Kergon



On Fri, 28 Oct 2016, Bart Van Assche wrote:

> On 10/28/2016 04:39 AM, Mikulas Patocka wrote:
> > On Wed, 26 Oct 2016, Bart Van Assche wrote:
> > > On 10/26/2016 02:46 PM, Mikulas Patocka wrote:
> > > > I think the proper thing would be to move "discard_zeroes_data" flag
> > > > into
> > > > the bio itself - there would be REQ_OP_DISCARD and REQ_OP_DISCARD_ZERO -
> > > > and if the device doesn't support REQ_OP_DISCARD_ZERO, it rejects the
> > > > bio
> > > > and the caller is supposed to do zeroing manually.
> > > 
> > > Sorry but I don't like this proposal. I think that a much better solution
> > > would be to pause I/O before making any changes to the discard_zeroes_data
> > > flag.
> > 
> > The device mapper pauses all bios when it switches the table - but the bio
> > was submitted with assumption that it goes to a device withe
> > "discard_zeroes_data" set, then the bio is paused, device mapper table is
> > switched, the bio is unpaused, and now it can go to a device without
> > "discard_zeroes_data".
> 
> Hello Mikulas,
> 
> Sorry if I wasn't clear enough. What I meant is to wait until all outstanding
> requests have finished

It is possible that the process sends never ending stream of bios (for 
example when reading linear data and using readahead), so waiting until 
there are no oustanding bios never finishes.

> before modifying the discard_zeroes_data flag - the
> kind of operation that is performed by e.g. blk_mq_freeze_queue().

blk_mq_freeze_queue() works on request-based drivers, device mapper works 
with bios, so that function has no effect on device mapper device. Anyway 
- blk_mq_freeze_queue() won't stop the process that issues the I/O 
requests - it will just hold the requests in the queue and not forward 
them to the device.

There is no way to stop the process that issues the bios. We can't stop 
the process that is looping in __blkdev_issue_discard, issuing discard 
requests. All that we can do is to hold the bios that the process issued.

Device mapper can freeze the filesystem with "freeze_bdev", but...
- some filesystems don't support freeze
- if the filesystem is not directly mounted on the frozen device, but 
there is a stack of intermediate block devices between the filesystem and 
the frozen device, then the filesystem will not be frozen
- the user can open the block device directly and he won't be affected by 
the freeze

> Modifying the discard_zeroes_data flag after a bio has been submitted 
> and before it has completed could lead to several classes of subtle 
> bugs. Functions like __blkdev_issue_discard() assume that the value of 
> the discard_zeroes_data flag does not change after this function has 
> been called and before the submitted requests completed.
>
> Bart.

I agree. That's the topic of the discussion - that the discard_zeroes_data 
flag is unreliable and the flag should be moved to the bio.

Mikulas

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: REQ_OP for zeroing, was Re: [dm-devel] [PATCH 1/4] brd: handle misaligned discard
  2016-10-28 13:14                         ` Christoph Hellwig
@ 2016-10-31 16:36                           ` Mikulas Patocka
  0 siblings, 0 replies; 25+ messages in thread
From: Mikulas Patocka @ 2016-10-31 16:36 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Bart Van Assche, Jens Axboe, linux-block, dm-devel,
	Alasdair G. Kergon, Mike Snitzer, Martin K. Petersen,
	Chaitanya Kulkarni



On Fri, 28 Oct 2016, Christoph Hellwig wrote:

> [adding Chaitanya to Cc]
> 
> On Fri, Oct 28, 2016 at 07:43:41AM -0400, Mikulas Patocka wrote:
> > We could detect if the REQ_OP_WRITE_SAME command contains all zeroes and 
> > if it does, turn it into "Write Zeroes" or TRIM command (if the device 
> > guarantees zeroing on trim). If it doesn't contain all zeroes and the 
> > device doesn't support non-zero WRITE SAME, then reject it.
> 
> I don't like this because it's very inefficient - we have to allocate
> a payload first and then compare the whole payload for very operation.
> 
> > Or maybe we could add a new command REQ_OP_WRITE_ZEROES - I'm not sure 
> > which of these two possibilities is better.
> 
> Chaitanya actually did an initial prototype implementation of this for
> NVMe that he shared with me.  I liked it a lot and I think he'll be
> ready to post it in a few days.  Now that we have the REQ_OP* values
> instead of mapping different command types to flags it's actually
> surprisingly easy to add new block layer operations.

OK - when it is in the kernel, let me know, so that I can write device 
mapper support for that.

We should remove the flag "discard_zeroes_data" afterwards, because it is 
unreliable and impossible to support correctly in the device mapper.

Mikulas

^ permalink raw reply	[flat|nested] 25+ messages in thread

end of thread, other threads:[~2016-10-31 16:36 UTC | newest]

Thread overview: 25+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-10-21 18:33 device mapper and the BLKFLSBUF ioctl Mikulas Patocka
2016-10-21 20:00 ` Mike Snitzer
2016-10-21 20:18   ` Mikulas Patocka
2016-10-24 15:57     ` Mike Snitzer
2016-10-25 13:07       ` Christoph Hellwig
2016-10-25 14:37         ` [PATCH] brd: remove support for BLKFLSBUF Mike Snitzer
2016-10-25 14:46           ` Jens Axboe
2016-10-26 20:25             ` [PATCH 0/4] brd: support discard Mikulas Patocka
2016-10-26 20:26               ` [PATCH 1/4] brd: handle misaligned discard Mikulas Patocka
2016-10-26 20:38                 ` [dm-devel] " Bart Van Assche
2016-10-26 20:38                   ` Bart Van Assche
2016-10-26 21:46                   ` Mikulas Patocka
2016-10-26 21:50                     ` REQ_OP for zeroing, was " Christoph Hellwig
2016-10-28 11:43                       ` Mikulas Patocka
2016-10-28 13:14                         ` Christoph Hellwig
2016-10-31 16:36                           ` Mikulas Patocka
2016-10-26 21:57                     ` Bart Van Assche
2016-10-26 21:57                       ` Bart Van Assche
2016-10-28 11:39                       ` Mikulas Patocka
2016-10-28 15:55                         ` Bart Van Assche
2016-10-28 15:55                           ` Bart Van Assche
2016-10-31 16:31                           ` Mikulas Patocka
2016-10-26 20:26               ` [PATCH 2/4] brd: extend rcu read sections Mikulas Patocka
2016-10-26 20:27               ` [PATCH 3/4] brd: implement discard Mikulas Patocka
2016-10-26 20:27               ` [PATCH 4/4] brd: remove unused brd_zero_page Mikulas Patocka

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.