[PATCH 0/11 v4] block: Fix block device shutdown related races

All of lore.kernel.org
 help / color / mirror / Atom feed

* [PATCH 0/11 v4] block: Fix block device shutdown related races
@ 2017-03-13 15:13 Jan Kara
  2017-03-13 15:14 ` [PATCH 01/11] block: Fix bdi assignment to bdev inode when racing with disk delete Jan Kara
                   ` (12 more replies)
  0 siblings, 13 replies; 26+ messages in thread
From: Jan Kara @ 2017-03-13 15:13 UTC (permalink / raw)
  To: Jens Axboe
  Cc: linux-block, Christoph Hellwig, Dan Williams,
	Thiago Jung Bauermann, Tejun Heo, Tahsin Erdogan, Omar Sandoval,
	Jan Kara

Hello,

this is a series with the remaining patches (on top of 4.11-rc2) to fix several
different races and issues I've found when testing device shutdown and reuse.
The first two patches fix possible (theoretical) problems when opening of a
block device races with shutdown of a gendisk structure. Patches 3-9 fix oops
that is triggered by __blkdev_put() calling inode_detach_wb() too early (the
problem reported by Thiago). Patches 10 and 11 fix oops due to a bug in gendisk
code where get_gendisk() can return already freed gendisk structure (again
triggered by Omar's stress test).

People, please have a look at patches. They are mostly simple however the
interactions are rather complex so I may have missed something. Also I'm
happy for any additional testing these patches can get - I've stressed them
with Omar's script, tested memcg writeback, tested static (not udev managed)
device inodes.

Changes since v3:
* Rebased on top of 4.11-rc2
* Reworked patch 2 (block: Fix race of bdev open with gendisk shutdown) based
  on Tejun's feedback
* Significantly updated patch 5 (and dropped previous Tejun's ack) to
  accommodate for fixes to SCSI re-registration of BDI that went to 4.11-rc2

Changes since v2:
* Added Tejun's acks
* Rebased on top of 4.11-rc1
* Fixed two possible races between blkdev_open() and del_gendisk()
* Fixed possible race between concurrent shutdown of cgwb spotted by Tejun

Changes since v1:
* Added Acks and Tested-by tags for patches in areas that did not change
* Reworked inode_detach_wb() related fixes based on Tejun's feedback

								Honza

^ permalink raw reply	[flat|nested] 26+ messages in thread

* [PATCH 01/11] block: Fix bdi assignment to bdev inode when racing with disk delete
  2017-03-13 15:13 [PATCH 0/11 v4] block: Fix block device shutdown related races Jan Kara
@ 2017-03-13 15:14 ` Jan Kara
  2017-03-13 15:14 ` [PATCH 02/11] block: Fix race of bdev open with gendisk shutdown Jan Kara
                   ` (11 subsequent siblings)
  12 siblings, 0 replies; 26+ messages in thread
From: Jan Kara @ 2017-03-13 15:14 UTC (permalink / raw)
  To: Jens Axboe
  Cc: linux-block, Christoph Hellwig, Dan Williams,
	Thiago Jung Bauermann, Tejun Heo, Tahsin Erdogan, Omar Sandoval,
	Jan Kara

When disk->fops->open() in __blkdev_get() returns -ERESTARTSYS, we
restart the process of opening the block device. However we forget to
switch bdev->bd_bdi back to noop_backing_dev_info and as a result bdev
inode will be pointing to a stale bdi. Fix the problem by setting
bdev->bd_bdi later when __blkdev_get() is already guaranteed to succeed.

Acked-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Signed-off-by: Jan Kara <jack@suse.cz>
---
 fs/block_dev.c | 7 +++----
 1 file changed, 3 insertions(+), 4 deletions(-)

diff --git a/fs/block_dev.c b/fs/block_dev.c
index 2eca00ec4370..53e2389ae4d4 100644
--- a/fs/block_dev.c
+++ b/fs/block_dev.c
@@ -1556,8 +1556,6 @@ static int __blkdev_get(struct block_device *bdev, fmode_t mode, int for_part)
 		bdev->bd_disk = disk;
 		bdev->bd_queue = disk->queue;
 		bdev->bd_contains = bdev;
-		if (bdev->bd_bdi == &noop_backing_dev_info)
-			bdev->bd_bdi = bdi_get(disk->queue->backing_dev_info);
 
 		if (!partno) {
 			ret = -ENXIO;
@@ -1622,6 +1620,9 @@ static int __blkdev_get(struct block_device *bdev, fmode_t mode, int for_part)
 			}
 			bd_set_size(bdev, (loff_t)bdev->bd_part->nr_sects << 9);
 		}
+
+		if (bdev->bd_bdi == &noop_backing_dev_info)
+			bdev->bd_bdi = bdi_get(disk->queue->backing_dev_info);
 	} else {
 		if (bdev->bd_contains == bdev) {
 			ret = 0;
@@ -1653,8 +1654,6 @@ static int __blkdev_get(struct block_device *bdev, fmode_t mode, int for_part)
 	bdev->bd_disk = NULL;
 	bdev->bd_part = NULL;
 	bdev->bd_queue = NULL;
-	bdi_put(bdev->bd_bdi);
-	bdev->bd_bdi = &noop_backing_dev_info;
 	if (bdev != bdev->bd_contains)
 		__blkdev_put(bdev->bd_contains, mode, 1);
 	bdev->bd_contains = NULL;
-- 
2.10.2

^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH 02/11] block: Fix race of bdev open with gendisk shutdown
  2017-03-13 15:13 [PATCH 0/11 v4] block: Fix block device shutdown related races Jan Kara
  2017-03-13 15:14 ` [PATCH 01/11] block: Fix bdi assignment to bdev inode when racing with disk delete Jan Kara
@ 2017-03-13 15:14 ` Jan Kara
  2017-03-21 16:19   ` Tejun Heo
  2017-11-17  6:51   ` Hou Tao
  2017-03-13 15:14 ` [PATCH 03/11] bdi: Mark congested->bdi as internal Jan Kara
                   ` (10 subsequent siblings)
  12 siblings, 2 replies; 26+ messages in thread
From: Jan Kara @ 2017-03-13 15:14 UTC (permalink / raw)
  To: Jens Axboe
  Cc: linux-block, Christoph Hellwig, Dan Williams,
	Thiago Jung Bauermann, Tejun Heo, Tahsin Erdogan, Omar Sandoval,
	Jan Kara

blkdev_open() may race with gendisk shutdown in two different ways.
Either del_gendisk() has already unhashed block device inode (and thus
bd_acquire() will end up creating new block device inode) however
gen_gendisk() will still return the gendisk that is being destroyed.
Or bdev returned by bd_acquire() will get unhashed and gendisk destroyed
before we get to get_gendisk() and get_gendisk() will return new gendisk
that got allocated for device that reused the device number.

In both cases this will result in possible inconsistencies between
bdev->bd_disk and bdev->bd_bdi (in the first case after gendisk gets
destroyed and device number reused, in the second case immediately).

Fix the problem by checking whether the gendisk is still alive and inode
hashed when associating bdev inode with it and its bdi. That way we are
sure that we will not associate bdev inode with disk that got past
blk_unregister_region() in del_gendisk() (and thus device number can get
reused). Similarly, we will not associate bdev that was once associated
with gendisk that is going away (and thus the corresponding bdev inode
will get unhashed in del_gendisk()) with a new gendisk that is just
reusing the device number.

Also add a warning that will tell us about unexpected inconsistencies
between bdi associated with the bdev inode and bdi associated with the
disk.

Signed-off-by: Jan Kara <jack@suse.cz>
---
 fs/block_dev.c | 9 +++++++--
 1 file changed, 7 insertions(+), 2 deletions(-)

diff --git a/fs/block_dev.c b/fs/block_dev.c
index 53e2389ae4d4..5ec8750f5332 100644
--- a/fs/block_dev.c
+++ b/fs/block_dev.c
@@ -1560,7 +1560,8 @@ static int __blkdev_get(struct block_device *bdev, fmode_t mode, int for_part)
 		if (!partno) {
 			ret = -ENXIO;
 			bdev->bd_part = disk_get_part(disk, partno);
-			if (!bdev->bd_part)
+			if (!(disk->flags & GENHD_FL_UP) || !bdev->bd_part ||
+			    inode_unhashed(bdev->bd_inode))
 				goto out_clear;
 
 			ret = 0;
@@ -1614,7 +1615,8 @@ static int __blkdev_get(struct block_device *bdev, fmode_t mode, int for_part)
 			bdev->bd_contains = whole;
 			bdev->bd_part = disk_get_part(disk, partno);
 			if (!(disk->flags & GENHD_FL_UP) ||
-			    !bdev->bd_part || !bdev->bd_part->nr_sects) {
+			    !bdev->bd_part || !bdev->bd_part->nr_sects ||
+			    inode_unhashed(bdev->bd_inode)) {
 				ret = -ENXIO;
 				goto out_clear;
 			}
@@ -1623,6 +1625,9 @@ static int __blkdev_get(struct block_device *bdev, fmode_t mode, int for_part)
 
 		if (bdev->bd_bdi == &noop_backing_dev_info)
 			bdev->bd_bdi = bdi_get(disk->queue->backing_dev_info);
+		else
+			WARN_ON_ONCE(bdev->bd_bdi !=
+				     disk->queue->backing_dev_info);
 	} else {
 		if (bdev->bd_contains == bdev) {
 			ret = 0;
-- 
2.10.2

^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH 03/11] bdi: Mark congested->bdi as internal
  2017-03-13 15:13 [PATCH 0/11 v4] block: Fix block device shutdown related races Jan Kara
  2017-03-13 15:14 ` [PATCH 01/11] block: Fix bdi assignment to bdev inode when racing with disk delete Jan Kara
  2017-03-13 15:14 ` [PATCH 02/11] block: Fix race of bdev open with gendisk shutdown Jan Kara
@ 2017-03-13 15:14 ` Jan Kara
  2017-03-13 15:14 ` [PATCH 04/11] bdi: Make wb->bdi a proper reference Jan Kara
                   ` (9 subsequent siblings)
  12 siblings, 0 replies; 26+ messages in thread
From: Jan Kara @ 2017-03-13 15:14 UTC (permalink / raw)
  To: Jens Axboe
  Cc: linux-block, Christoph Hellwig, Dan Williams,
	Thiago Jung Bauermann, Tejun Heo, Tahsin Erdogan, Omar Sandoval,
	Jan Kara

congested->bdi pointer is used only to be able to remove congested
structure from bdi->cgwb_congested_tree on structure release. Moreover
the pointer can become NULL when we unregister the bdi. Rename the field
to __bdi and add a comment to make it more explicit this is internal
stuff of memcg writeback code and people should not use the field as
such use will be likely race prone.

We do not bother with converting congested->bdi to a proper refcounted
reference. It will be slightly ugly to special-case bdi->wb.congested to
avoid effectively a cyclic reference of bdi to itself and the reference
gets cleared from bdi_unregister() making it impossible to reference
a freed bdi.

Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jan Kara <jack@suse.cz>
---
 include/linux/backing-dev-defs.h |  4 +++-
 mm/backing-dev.c                 | 10 +++++-----
 2 files changed, 8 insertions(+), 6 deletions(-)

diff --git a/include/linux/backing-dev-defs.h b/include/linux/backing-dev-defs.h
index ad955817916d..8fb3dcdebc80 100644
--- a/include/linux/backing-dev-defs.h
+++ b/include/linux/backing-dev-defs.h
@@ -54,7 +54,9 @@ struct bdi_writeback_congested {
 	atomic_t refcnt;		/* nr of attached wb's and blkg */
 
 #ifdef CONFIG_CGROUP_WRITEBACK
-	struct backing_dev_info *bdi;	/* the associated bdi */
+	struct backing_dev_info *__bdi;	/* the associated bdi, set to NULL
+					 * on bdi unregistration. For memcg-wb
+					 * internal use only! */
 	int blkcg_id;			/* ID of the associated blkcg */
 	struct rb_node rb_node;		/* on bdi->cgwb_congestion_tree */
 #endif
diff --git a/mm/backing-dev.c b/mm/backing-dev.c
index c6f2a37028c2..12408f86783c 100644
--- a/mm/backing-dev.c
+++ b/mm/backing-dev.c
@@ -438,7 +438,7 @@ wb_congested_get_create(struct backing_dev_info *bdi, int blkcg_id, gfp_t gfp)
 		return NULL;
 
 	atomic_set(&new_congested->refcnt, 0);
-	new_congested->bdi = bdi;
+	new_congested->__bdi = bdi;
 	new_congested->blkcg_id = blkcg_id;
 	goto retry;
 
@@ -466,10 +466,10 @@ void wb_congested_put(struct bdi_writeback_congested *congested)
 	}
 
 	/* bdi might already have been destroyed leaving @congested unlinked */
-	if (congested->bdi) {
+	if (congested->__bdi) {
 		rb_erase(&congested->rb_node,
-			 &congested->bdi->cgwb_congested_tree);
-		congested->bdi = NULL;
+			 &congested->__bdi->cgwb_congested_tree);
+		congested->__bdi = NULL;
 	}
 
 	spin_unlock_irqrestore(&cgwb_lock, flags);
@@ -752,7 +752,7 @@ static void cgwb_bdi_exit(struct backing_dev_info *bdi)
 			rb_entry(rbn, struct bdi_writeback_congested, rb_node);
 
 		rb_erase(rbn, &bdi->cgwb_congested_tree);
-		congested->bdi = NULL;	/* mark @congested unlinked */
+		congested->__bdi = NULL;	/* mark @congested unlinked */
 	}
 	spin_unlock_irq(&cgwb_lock);
 }
-- 
2.10.2

^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH 04/11] bdi: Make wb->bdi a proper reference
  2017-03-13 15:13 [PATCH 0/11 v4] block: Fix block device shutdown related races Jan Kara
                   ` (2 preceding siblings ...)
  2017-03-13 15:14 ` [PATCH 03/11] bdi: Mark congested->bdi as internal Jan Kara
@ 2017-03-13 15:14 ` Jan Kara
  2017-03-13 15:14 ` [PATCH 05/11] bdi: Unify bdi->wb_list handling for root wb_writeback Jan Kara
                   ` (8 subsequent siblings)
  12 siblings, 0 replies; 26+ messages in thread
From: Jan Kara @ 2017-03-13 15:14 UTC (permalink / raw)
  To: Jens Axboe
  Cc: linux-block, Christoph Hellwig, Dan Williams,
	Thiago Jung Bauermann, Tejun Heo, Tahsin Erdogan, Omar Sandoval,
	Jan Kara

Make wb->bdi a proper refcounted reference to bdi for all bdi_writeback
structures except for the one embedded inside struct backing_dev_info.
That will allow us to simplify bdi unregistration.

Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jan Kara <jack@suse.cz>
---
 mm/backing-dev.c | 13 +++++++++++--
 1 file changed, 11 insertions(+), 2 deletions(-)

diff --git a/mm/backing-dev.c b/mm/backing-dev.c
index 12408f86783c..03d4ba27c133 100644
--- a/mm/backing-dev.c
+++ b/mm/backing-dev.c
@@ -294,6 +294,8 @@ static int wb_init(struct bdi_writeback *wb, struct backing_dev_info *bdi,
 
 	memset(wb, 0, sizeof(*wb));
 
+	if (wb != &bdi->wb)
+		bdi_get(bdi);
 	wb->bdi = bdi;
 	wb->last_old_flush = jiffies;
 	INIT_LIST_HEAD(&wb->b_dirty);
@@ -314,8 +316,10 @@ static int wb_init(struct bdi_writeback *wb, struct backing_dev_info *bdi,
 	wb->dirty_sleep = jiffies;
 
 	wb->congested = wb_congested_get_create(bdi, blkcg_id, gfp);
-	if (!wb->congested)
-		return -ENOMEM;
+	if (!wb->congested) {
+		err = -ENOMEM;
+		goto out_put_bdi;
+	}
 
 	err = fprop_local_init_percpu(&wb->completions, gfp);
 	if (err)
@@ -335,6 +339,9 @@ static int wb_init(struct bdi_writeback *wb, struct backing_dev_info *bdi,
 	fprop_local_destroy_percpu(&wb->completions);
 out_put_cong:
 	wb_congested_put(wb->congested);
+out_put_bdi:
+	if (wb != &bdi->wb)
+		bdi_put(bdi);
 	return err;
 }
 
@@ -372,6 +379,8 @@ static void wb_exit(struct bdi_writeback *wb)
 
 	fprop_local_destroy_percpu(&wb->completions);
 	wb_congested_put(wb->congested);
+	if (wb != &wb->bdi->wb)
+		bdi_put(wb->bdi);
 }
 
 #ifdef CONFIG_CGROUP_WRITEBACK
-- 
2.10.2

^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH 05/11] bdi: Unify bdi->wb_list handling for root wb_writeback
  2017-03-13 15:13 [PATCH 0/11 v4] block: Fix block device shutdown related races Jan Kara
                   ` (3 preceding siblings ...)
  2017-03-13 15:14 ` [PATCH 04/11] bdi: Make wb->bdi a proper reference Jan Kara
@ 2017-03-13 15:14 ` Jan Kara
  2017-03-21 16:21   ` Tejun Heo
  2017-03-13 15:14 ` [PATCH 06/11] bdi: Shutdown writeback on all cgwbs in cgwb_bdi_destroy() Jan Kara
                   ` (7 subsequent siblings)
  12 siblings, 1 reply; 26+ messages in thread
From: Jan Kara @ 2017-03-13 15:14 UTC (permalink / raw)
  To: Jens Axboe
  Cc: linux-block, Christoph Hellwig, Dan Williams,
	Thiago Jung Bauermann, Tejun Heo, Tahsin Erdogan, Omar Sandoval,
	Jan Kara

Currently root wb_writeback structure is added to bdi->wb_list in
bdi_init() and never removed. That is different from all other
wb_writeback structures which get added to the list when created and
removed from it before wb_shutdown().

So move list addition of root bdi_writeback to bdi_register() and list
removal of all wb_writeback structures to wb_shutdown(). That way a
wb_writeback structure is on bdi->wb_list if and only if it can handle
writeback and it will make it easier for us to handle shutdown of all
wb_writeback structures in bdi_unregister().

Signed-off-by: Jan Kara <jack@suse.cz>
---
 mm/backing-dev.c | 34 ++++++++++++++++++++++++++++------
 1 file changed, 28 insertions(+), 6 deletions(-)

diff --git a/mm/backing-dev.c b/mm/backing-dev.c
index 03d4ba27c133..e3d56dba4da8 100644
--- a/mm/backing-dev.c
+++ b/mm/backing-dev.c
@@ -345,6 +345,8 @@ static int wb_init(struct bdi_writeback *wb, struct backing_dev_info *bdi,
 	return err;
 }
 
+static void cgwb_remove_from_bdi_list(struct bdi_writeback *wb);
+
 /*
  * Remove bdi from the global list and shutdown any threads we have running
  */
@@ -358,6 +360,7 @@ static void wb_shutdown(struct bdi_writeback *wb)
 	}
 	spin_unlock_bh(&wb->work_lock);
 
+	cgwb_remove_from_bdi_list(wb);
 	/*
 	 * Drain work list and shutdown the delayed_work.  !WB_registered
 	 * tells wb_workfn() that @wb is dying and its work_list needs to
@@ -491,10 +494,6 @@ static void cgwb_release_workfn(struct work_struct *work)
 						release_work);
 	struct backing_dev_info *bdi = wb->bdi;
 
-	spin_lock_irq(&cgwb_lock);
-	list_del_rcu(&wb->bdi_node);
-	spin_unlock_irq(&cgwb_lock);
-
 	wb_shutdown(wb);
 
 	css_put(wb->memcg_css);
@@ -526,6 +525,13 @@ static void cgwb_kill(struct bdi_writeback *wb)
 	percpu_ref_kill(&wb->refcnt);
 }
 
+static void cgwb_remove_from_bdi_list(struct bdi_writeback *wb)
+{
+	spin_lock_irq(&cgwb_lock);
+	list_del_rcu(&wb->bdi_node);
+	spin_unlock_irq(&cgwb_lock);
+}
+
 static int cgwb_create(struct backing_dev_info *bdi,
 		       struct cgroup_subsys_state *memcg_css, gfp_t gfp)
 {
@@ -766,6 +772,13 @@ static void cgwb_bdi_exit(struct backing_dev_info *bdi)
 	spin_unlock_irq(&cgwb_lock);
 }
 
+static void cgwb_bdi_register(struct backing_dev_info *bdi)
+{
+	spin_lock_irq(&cgwb_lock);
+	list_add_tail_rcu(&bdi->wb.bdi_node, &bdi->wb_list);
+	spin_unlock_irq(&cgwb_lock);
+}
+
 #else	/* CONFIG_CGROUP_WRITEBACK */
 
 static int cgwb_bdi_init(struct backing_dev_info *bdi)
@@ -793,6 +806,16 @@ static void cgwb_bdi_exit(struct backing_dev_info *bdi)
 	wb_congested_put(bdi->wb_congested);
 }
 
+static void cgwb_bdi_register(struct backing_dev_info *bdi)
+{
+	list_add_tail_rcu(&bdi->wb.bdi_node, &bdi->wb_list);
+}
+
+static void cgwb_remove_from_bdi_list(struct bdi_writeback *wb)
+{
+	list_del_rcu(&wb->bdi_node);
+}
+
 #endif	/* CONFIG_CGROUP_WRITEBACK */
 
 int bdi_init(struct backing_dev_info *bdi)
@@ -811,8 +834,6 @@ int bdi_init(struct backing_dev_info *bdi)
 
 	ret = cgwb_bdi_init(bdi);
 
-	list_add_tail_rcu(&bdi->wb.bdi_node, &bdi->wb_list);
-
 	return ret;
 }
 EXPORT_SYMBOL(bdi_init);
@@ -848,6 +869,7 @@ int bdi_register(struct backing_dev_info *bdi, struct device *parent,
 	if (IS_ERR(dev))
 		return PTR_ERR(dev);
 
+	cgwb_bdi_register(bdi);
 	bdi->dev = dev;
 
 	bdi_debug_register(bdi, dev_name(dev));
-- 
2.10.2

^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH 06/11] bdi: Shutdown writeback on all cgwbs in cgwb_bdi_destroy()
  2017-03-13 15:13 [PATCH 0/11 v4] block: Fix block device shutdown related races Jan Kara
                   ` (4 preceding siblings ...)
  2017-03-13 15:14 ` [PATCH 05/11] bdi: Unify bdi->wb_list handling for root wb_writeback Jan Kara
@ 2017-03-13 15:14 ` Jan Kara
  2017-03-13 15:14 ` [PATCH 07/11] bdi: Do not wait for cgwbs release in bdi_unregister() Jan Kara
                   ` (6 subsequent siblings)
  12 siblings, 0 replies; 26+ messages in thread
From: Jan Kara @ 2017-03-13 15:14 UTC (permalink / raw)
  To: Jens Axboe
  Cc: linux-block, Christoph Hellwig, Dan Williams,
	Thiago Jung Bauermann, Tejun Heo, Tahsin Erdogan, Omar Sandoval,
	Jan Kara

Currently we waited for all cgwbs to get freed in cgwb_bdi_destroy()
which also means that writeback has been shutdown on them. Since this
wait is going away, directly shutdown writeback on cgwbs from
cgwb_bdi_destroy() to avoid live writeback structures after
bdi_unregister() has finished. To make that safe with concurrent
shutdown from cgwb_release_workfn(), we also have to make sure
wb_shutdown() returns only after the bdi_writeback structure is really
shutdown.

Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jan Kara <jack@suse.cz>
---
 include/linux/backing-dev-defs.h |  1 +
 mm/backing-dev.c                 | 22 ++++++++++++++++++++++
 2 files changed, 23 insertions(+)

diff --git a/include/linux/backing-dev-defs.h b/include/linux/backing-dev-defs.h
index 8fb3dcdebc80..8af720f22a2d 100644
--- a/include/linux/backing-dev-defs.h
+++ b/include/linux/backing-dev-defs.h
@@ -21,6 +21,7 @@ struct dentry;
  */
 enum wb_state {
 	WB_registered,		/* bdi_register() was done */
+	WB_shutting_down,	/* wb_shutdown() in progress */
 	WB_writeback_running,	/* Writeback is in progress */
 	WB_has_dirty_io,	/* Dirty inodes on ->b_{dirty|io|more_io} */
 };
diff --git a/mm/backing-dev.c b/mm/backing-dev.c
index e3d56dba4da8..b67be4fc12c4 100644
--- a/mm/backing-dev.c
+++ b/mm/backing-dev.c
@@ -356,8 +356,15 @@ static void wb_shutdown(struct bdi_writeback *wb)
 	spin_lock_bh(&wb->work_lock);
 	if (!test_and_clear_bit(WB_registered, &wb->state)) {
 		spin_unlock_bh(&wb->work_lock);
+		/*
+		 * Wait for wb shutdown to finish if someone else is just
+		 * running wb_shutdown(). Otherwise we could proceed to wb /
+		 * bdi destruction before wb_shutdown() is finished.
+		 */
+		wait_on_bit(&wb->state, WB_shutting_down, TASK_UNINTERRUPTIBLE);
 		return;
 	}
+	set_bit(WB_shutting_down, &wb->state);
 	spin_unlock_bh(&wb->work_lock);
 
 	cgwb_remove_from_bdi_list(wb);
@@ -369,6 +376,12 @@ static void wb_shutdown(struct bdi_writeback *wb)
 	mod_delayed_work(bdi_wq, &wb->dwork, 0);
 	flush_delayed_work(&wb->dwork);
 	WARN_ON(!list_empty(&wb->work_list));
+	/*
+	 * Make sure bit gets cleared after shutdown is finished. Matches with
+	 * the barrier provided by test_and_clear_bit() above.
+	 */
+	smp_wmb();
+	clear_bit(WB_shutting_down, &wb->state);
 }
 
 static void wb_exit(struct bdi_writeback *wb)
@@ -699,12 +712,21 @@ static void cgwb_bdi_destroy(struct backing_dev_info *bdi)
 {
 	struct radix_tree_iter iter;
 	void **slot;
+	struct bdi_writeback *wb;
 
 	WARN_ON(test_bit(WB_registered, &bdi->wb.state));
 
 	spin_lock_irq(&cgwb_lock);
 	radix_tree_for_each_slot(slot, &bdi->cgwb_tree, &iter, 0)
 		cgwb_kill(*slot);
+
+	while (!list_empty(&bdi->wb_list)) {
+		wb = list_first_entry(&bdi->wb_list, struct bdi_writeback,
+				      bdi_node);
+		spin_unlock_irq(&cgwb_lock);
+		wb_shutdown(wb);
+		spin_lock_irq(&cgwb_lock);
+	}
 	spin_unlock_irq(&cgwb_lock);
 
 	/*
-- 
2.10.2

^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH 07/11] bdi: Do not wait for cgwbs release in bdi_unregister()
  2017-03-13 15:13 [PATCH 0/11 v4] block: Fix block device shutdown related races Jan Kara
                   ` (5 preceding siblings ...)
  2017-03-13 15:14 ` [PATCH 06/11] bdi: Shutdown writeback on all cgwbs in cgwb_bdi_destroy() Jan Kara
@ 2017-03-13 15:14 ` Jan Kara
  2017-03-13 15:14 ` [PATCH 08/11] bdi: Rename cgwb_bdi_destroy() to cgwb_bdi_unregister() Jan Kara
                   ` (5 subsequent siblings)
  12 siblings, 0 replies; 26+ messages in thread
From: Jan Kara @ 2017-03-13 15:14 UTC (permalink / raw)
  To: Jens Axboe
  Cc: linux-block, Christoph Hellwig, Dan Williams,
	Thiago Jung Bauermann, Tejun Heo, Tahsin Erdogan, Omar Sandoval,
	Jan Kara

Currently we wait for all cgwbs to get released in cgwb_bdi_destroy()
(called from bdi_unregister()). That is however unnecessary now when
cgwb->bdi is a proper refcounted reference (thus bdi cannot get
released before all cgwbs are released) and when cgwb_bdi_destroy()
shuts down writeback directly.

Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jan Kara <jack@suse.cz>
---
 include/linux/backing-dev-defs.h |  1 -
 mm/backing-dev.c                 | 22 +---------------------
 2 files changed, 1 insertion(+), 22 deletions(-)

diff --git a/include/linux/backing-dev-defs.h b/include/linux/backing-dev-defs.h
index 8af720f22a2d..e66d4722db8e 100644
--- a/include/linux/backing-dev-defs.h
+++ b/include/linux/backing-dev-defs.h
@@ -164,7 +164,6 @@ struct backing_dev_info {
 #ifdef CONFIG_CGROUP_WRITEBACK
 	struct radix_tree_root cgwb_tree; /* radix tree of active cgroup wbs */
 	struct rb_root cgwb_congested_tree; /* their congested states */
-	atomic_t usage_cnt; /* counts both cgwbs and cgwb_contested's */
 #else
 	struct bdi_writeback_congested *wb_congested;
 #endif
diff --git a/mm/backing-dev.c b/mm/backing-dev.c
index b67be4fc12c4..8c30b1a7aae5 100644
--- a/mm/backing-dev.c
+++ b/mm/backing-dev.c
@@ -406,11 +406,9 @@ static void wb_exit(struct bdi_writeback *wb)
 /*
  * cgwb_lock protects bdi->cgwb_tree, bdi->cgwb_congested_tree,
  * blkcg->cgwb_list, and memcg->cgwb_list.  bdi->cgwb_tree is also RCU
- * protected.  cgwb_release_wait is used to wait for the completion of cgwb
- * releases from bdi destruction path.
+ * protected.
  */
 static DEFINE_SPINLOCK(cgwb_lock);
-static DECLARE_WAIT_QUEUE_HEAD(cgwb_release_wait);
 
 /**
  * wb_congested_get_create - get or create a wb_congested
@@ -505,7 +503,6 @@ static void cgwb_release_workfn(struct work_struct *work)
 {
 	struct bdi_writeback *wb = container_of(work, struct bdi_writeback,
 						release_work);
-	struct backing_dev_info *bdi = wb->bdi;
 
 	wb_shutdown(wb);
 
@@ -516,9 +513,6 @@ static void cgwb_release_workfn(struct work_struct *work)
 	percpu_ref_exit(&wb->refcnt);
 	wb_exit(wb);
 	kfree_rcu(wb, rcu);
-
-	if (atomic_dec_and_test(&bdi->usage_cnt))
-		wake_up_all(&cgwb_release_wait);
 }
 
 static void cgwb_release(struct percpu_ref *refcnt)
@@ -608,7 +602,6 @@ static int cgwb_create(struct backing_dev_info *bdi,
 		/* we might have raced another instance of this function */
 		ret = radix_tree_insert(&bdi->cgwb_tree, memcg_css->id, wb);
 		if (!ret) {
-			atomic_inc(&bdi->usage_cnt);
 			list_add_tail_rcu(&wb->bdi_node, &bdi->wb_list);
 			list_add(&wb->memcg_node, memcg_cgwb_list);
 			list_add(&wb->blkcg_node, blkcg_cgwb_list);
@@ -698,7 +691,6 @@ static int cgwb_bdi_init(struct backing_dev_info *bdi)
 
 	INIT_RADIX_TREE(&bdi->cgwb_tree, GFP_ATOMIC);
 	bdi->cgwb_congested_tree = RB_ROOT;
-	atomic_set(&bdi->usage_cnt, 1);
 
 	ret = wb_init(&bdi->wb, bdi, 1, GFP_KERNEL);
 	if (!ret) {
@@ -728,18 +720,6 @@ static void cgwb_bdi_destroy(struct backing_dev_info *bdi)
 		spin_lock_irq(&cgwb_lock);
 	}
 	spin_unlock_irq(&cgwb_lock);
-
-	/*
-	 * All cgwb's must be shutdown and released before returning.  Drain
-	 * the usage counter to wait for all cgwb's ever created on @bdi.
-	 */
-	atomic_dec(&bdi->usage_cnt);
-	wait_event(cgwb_release_wait, !atomic_read(&bdi->usage_cnt));
-	/*
-	 * Grab back our reference so that we hold it when @bdi gets
-	 * re-registered.
-	 */
-	atomic_inc(&bdi->usage_cnt);
 }
 
 /**
-- 
2.10.2

^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH 08/11] bdi: Rename cgwb_bdi_destroy() to cgwb_bdi_unregister()
  2017-03-13 15:13 [PATCH 0/11 v4] block: Fix block device shutdown related races Jan Kara
                   ` (6 preceding siblings ...)
  2017-03-13 15:14 ` [PATCH 07/11] bdi: Do not wait for cgwbs release in bdi_unregister() Jan Kara
@ 2017-03-13 15:14 ` Jan Kara
  2017-03-13 15:14 ` [PATCH 09/11] block: Fix oops in locked_inode_to_wb_and_lock_list() Jan Kara
                   ` (4 subsequent siblings)
  12 siblings, 0 replies; 26+ messages in thread
From: Jan Kara @ 2017-03-13 15:14 UTC (permalink / raw)
  To: Jens Axboe
  Cc: linux-block, Christoph Hellwig, Dan Williams,
	Thiago Jung Bauermann, Tejun Heo, Tahsin Erdogan, Omar Sandoval,
	Jan Kara

Rename cgwb_bdi_destroy() to cgwb_bdi_unregister() as it gets called
from bdi_unregister() which is not necessarily called from bdi_destroy()
and thus the name is somewhat misleading.

Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jan Kara <jack@suse.cz>
---
 mm/backing-dev.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/mm/backing-dev.c b/mm/backing-dev.c
index 8c30b1a7aae5..3ea3bbd921d6 100644
--- a/mm/backing-dev.c
+++ b/mm/backing-dev.c
@@ -700,7 +700,7 @@ static int cgwb_bdi_init(struct backing_dev_info *bdi)
 	return ret;
 }
 
-static void cgwb_bdi_destroy(struct backing_dev_info *bdi)
+static void cgwb_bdi_unregister(struct backing_dev_info *bdi)
 {
 	struct radix_tree_iter iter;
 	void **slot;
@@ -801,7 +801,7 @@ static int cgwb_bdi_init(struct backing_dev_info *bdi)
 	return 0;
 }
 
-static void cgwb_bdi_destroy(struct backing_dev_info *bdi) { }
+static void cgwb_bdi_unregister(struct backing_dev_info *bdi) { }
 
 static void cgwb_bdi_exit(struct backing_dev_info *bdi)
 {
@@ -925,7 +925,7 @@ void bdi_unregister(struct backing_dev_info *bdi)
 	/* make sure nobody finds us on the bdi_list anymore */
 	bdi_remove_from_list(bdi);
 	wb_shutdown(&bdi->wb);
-	cgwb_bdi_destroy(bdi);
+	cgwb_bdi_unregister(bdi);
 
 	if (bdi->dev) {
 		bdi_debug_unregister(bdi);
-- 
2.10.2

^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH 09/11] block: Fix oops in locked_inode_to_wb_and_lock_list()
  2017-03-13 15:13 [PATCH 0/11 v4] block: Fix block device shutdown related races Jan Kara
                   ` (7 preceding siblings ...)
  2017-03-13 15:14 ` [PATCH 08/11] bdi: Rename cgwb_bdi_destroy() to cgwb_bdi_unregister() Jan Kara
@ 2017-03-13 15:14 ` Jan Kara
  2017-03-13 15:14 ` [PATCH 10/11] kobject: Export kobject_get_unless_zero() Jan Kara
                   ` (3 subsequent siblings)
  12 siblings, 0 replies; 26+ messages in thread
From: Jan Kara @ 2017-03-13 15:14 UTC (permalink / raw)
  To: Jens Axboe
  Cc: linux-block, Christoph Hellwig, Dan Williams,
	Thiago Jung Bauermann, Tejun Heo, Tahsin Erdogan, Omar Sandoval,
	Jan Kara

When block device is closed, we call inode_detach_wb() in __blkdev_put()
which sets inode->i_wb to NULL. That is contrary to expectations that
inode->i_wb stays valid once set during the whole inode's lifetime and
leads to oops in wb_get() in locked_inode_to_wb_and_lock_list() because
inode_to_wb() returned NULL.

The reason why we called inode_detach_wb() is not valid anymore though.
BDI is guaranteed to stay along until we call bdi_put() from
bdev_evict_inode() so we can postpone calling inode_detach_wb() to that
moment.

Also add a warning to catch if someone uses inode_detach_wb() in a
dangerous way.

Reported-by: Thiago Jung Bauermann <bauerman@linux.vnet.ibm.com>
Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jan Kara <jack@suse.cz>
---
 fs/block_dev.c            | 8 ++------
 include/linux/writeback.h | 1 +
 2 files changed, 3 insertions(+), 6 deletions(-)

diff --git a/fs/block_dev.c b/fs/block_dev.c
index 5ec8750f5332..c66f5dd4a02a 100644
--- a/fs/block_dev.c
+++ b/fs/block_dev.c
@@ -885,6 +885,8 @@ static void bdev_evict_inode(struct inode *inode)
 	spin_lock(&bdev_lock);
 	list_del_init(&bdev->bd_list);
 	spin_unlock(&bdev_lock);
+	/* Detach inode from wb early as bdi_put() may free bdi->wb */
+	inode_detach_wb(inode);
 	if (bdev->bd_bdi != &noop_backing_dev_info) {
 		bdi_put(bdev->bd_bdi);
 		bdev->bd_bdi = &noop_backing_dev_info;
@@ -1880,12 +1882,6 @@ static void __blkdev_put(struct block_device *bdev, fmode_t mode, int for_part)
 		kill_bdev(bdev);
 
 		bdev_write_inode(bdev);
-		/*
-		 * Detaching bdev inode from its wb in __destroy_inode()
-		 * is too late: the queue which embeds its bdi (along with
-		 * root wb) can be gone as soon as we put_disk() below.
-		 */
-		inode_detach_wb(bdev->bd_inode);
 	}
 	if (bdev->bd_contains == bdev) {
 		if (disk->fops->release)
diff --git a/include/linux/writeback.h b/include/linux/writeback.h
index a3c0cbd7c888..d5815794416c 100644
--- a/include/linux/writeback.h
+++ b/include/linux/writeback.h
@@ -237,6 +237,7 @@ static inline void inode_attach_wb(struct inode *inode, struct page *page)
 static inline void inode_detach_wb(struct inode *inode)
 {
 	if (inode->i_wb) {
+		WARN_ON_ONCE(!(inode->i_state & I_CLEAR));
 		wb_put(inode->i_wb);
 		inode->i_wb = NULL;
 	}
-- 
2.10.2

^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH 10/11] kobject: Export kobject_get_unless_zero()
  2017-03-13 15:13 [PATCH 0/11 v4] block: Fix block device shutdown related races Jan Kara
                   ` (8 preceding siblings ...)
  2017-03-13 15:14 ` [PATCH 09/11] block: Fix oops in locked_inode_to_wb_and_lock_list() Jan Kara
@ 2017-03-13 15:14 ` Jan Kara
  2017-03-13 15:14 ` [PATCH 11/11] block: Fix oops scsi_disk_get() Jan Kara
                   ` (2 subsequent siblings)
  12 siblings, 0 replies; 26+ messages in thread
From: Jan Kara @ 2017-03-13 15:14 UTC (permalink / raw)
  To: Jens Axboe
  Cc: linux-block, Christoph Hellwig, Dan Williams,
	Thiago Jung Bauermann, Tejun Heo, Tahsin Erdogan, Omar Sandoval,
	Jan Kara, Greg Kroah-Hartman

Make the function available for outside use and fortify it against NULL
kobject.

CC: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Reviewed-by: Bart Van Assche <bart.vanassche@sandisk.com>
Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jan Kara <jack@suse.cz>
---
 include/linux/kobject.h | 2 ++
 lib/kobject.c           | 5 ++++-
 2 files changed, 6 insertions(+), 1 deletion(-)

diff --git a/include/linux/kobject.h b/include/linux/kobject.h
index e6284591599e..ca85cb80e99a 100644
--- a/include/linux/kobject.h
+++ b/include/linux/kobject.h
@@ -108,6 +108,8 @@ extern int __must_check kobject_rename(struct kobject *, const char *new_name);
 extern int __must_check kobject_move(struct kobject *, struct kobject *);
 
 extern struct kobject *kobject_get(struct kobject *kobj);
+extern struct kobject * __must_check kobject_get_unless_zero(
+						struct kobject *kobj);
 extern void kobject_put(struct kobject *kobj);
 
 extern const void *kobject_namespace(struct kobject *kobj);
diff --git a/lib/kobject.c b/lib/kobject.c
index 445dcaeb0f56..763d70a18941 100644
--- a/lib/kobject.c
+++ b/lib/kobject.c
@@ -601,12 +601,15 @@ struct kobject *kobject_get(struct kobject *kobj)
 }
 EXPORT_SYMBOL(kobject_get);
 
-static struct kobject * __must_check kobject_get_unless_zero(struct kobject *kobj)
+struct kobject * __must_check kobject_get_unless_zero(struct kobject *kobj)
 {
+	if (!kobj)
+		return NULL;
 	if (!kref_get_unless_zero(&kobj->kref))
 		kobj = NULL;
 	return kobj;
 }
+EXPORT_SYMBOL(kobject_get_unless_zero);
 
 /*
  * kobject_cleanup - free kobject resources.
-- 
2.10.2

^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH 11/11] block: Fix oops scsi_disk_get()
  2017-03-13 15:13 [PATCH 0/11 v4] block: Fix block device shutdown related races Jan Kara
                   ` (9 preceding siblings ...)
  2017-03-13 15:14 ` [PATCH 10/11] kobject: Export kobject_get_unless_zero() Jan Kara
@ 2017-03-13 15:14 ` Jan Kara
  2017-03-13 18:10 ` [PATCH 0/11 v4] block: Fix block device shutdown related races Dan Williams
  2017-03-21  0:57 ` Thiago Jung Bauermann
  12 siblings, 0 replies; 26+ messages in thread
From: Jan Kara @ 2017-03-13 15:14 UTC (permalink / raw)
  To: Jens Axboe
  Cc: linux-block, Christoph Hellwig, Dan Williams,
	Thiago Jung Bauermann, Tejun Heo, Tahsin Erdogan, Omar Sandoval,
	Jan Kara

When device open races with device shutdown, we can get the following
oops in scsi_disk_get():

[11863.044351] general protection fault: 0000 [#1] SMP
[11863.045561] Modules linked in: scsi_debug xfs libcrc32c netconsole btrfs raid6_pq zlib_deflate lzo_compress xor [last unloaded: loop]
[11863.047853] CPU: 3 PID: 13042 Comm: hald-probe-stor Tainted: G W      4.10.0-rc2-xen+ #35
[11863.048030] Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
[11863.048030] task: ffff88007f438200 task.stack: ffffc90000fd0000
[11863.048030] RIP: 0010:scsi_disk_get+0x43/0x70
[11863.048030] RSP: 0018:ffffc90000fd3a08 EFLAGS: 00010202
[11863.048030] RAX: 6b6b6b6b6b6b6b6b RBX: ffff88007f56d000 RCX: 0000000000000000
[11863.048030] RDX: 0000000000000001 RSI: 0000000000000004 RDI: ffffffff81a8d880
[11863.048030] RBP: ffffc90000fd3a18 R08: 0000000000000000 R09: 0000000000000001
[11863.059217] R10: 0000000000000000 R11: 0000000000000000 R12: 00000000fffffffa
[11863.059217] R13: ffff880078872800 R14: ffff880070915540 R15: 000000000000001d
[11863.059217] FS:  00007f2611f71800(0000) GS:ffff88007f0c0000(0000) knlGS:0000000000000000
[11863.059217] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[11863.059217] CR2: 000000000060e048 CR3: 00000000778d4000 CR4: 00000000000006e0
[11863.059217] Call Trace:
[11863.059217]  ? disk_get_part+0x22/0x1f0
[11863.059217]  sd_open+0x39/0x130
[11863.059217]  __blkdev_get+0x69/0x430
[11863.059217]  ? bd_acquire+0x7f/0xc0
[11863.059217]  ? bd_acquire+0x96/0xc0
[11863.059217]  ? blkdev_get+0x350/0x350
[11863.059217]  blkdev_get+0x126/0x350
[11863.059217]  ? _raw_spin_unlock+0x2b/0x40
[11863.059217]  ? bd_acquire+0x7f/0xc0
[11863.059217]  ? blkdev_get+0x350/0x350
[11863.059217]  blkdev_open+0x65/0x80
...

As you can see RAX value is already poisoned showing that gendisk we got
is already freed. The problem is that get_gendisk() looks up device
number in ext_devt_idr and then does get_disk() which does kobject_get()
on the disks kobject. However the disk gets removed from ext_devt_idr
only in disk_release() (through blk_free_devt()) at which moment it has
already 0 refcount and is already on its way to be freed. Indeed we've
got a warning from kobject_get() about 0 refcount shortly before the
oops.

We fix the problem by using kobject_get_unless_zero() in get_disk() so
that get_disk() cannot get reference on a disk that is already being
freed.

Tested-by: Lekshmi Pillai <lekshmicpillai@in.ibm.com>
Reviewed-by: Bart Van Assche <bart.vanassche@sandisk.com>
Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jan Kara <jack@suse.cz>
---
 block/genhd.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/block/genhd.c b/block/genhd.c
index a9c516a8b37d..510aac1486cb 100644
--- a/block/genhd.c
+++ b/block/genhd.c
@@ -1352,7 +1352,7 @@ struct kobject *get_disk(struct gendisk *disk)
 	owner = disk->fops->owner;
 	if (owner && !try_module_get(owner))
 		return NULL;
-	kobj = kobject_get(&disk_to_dev(disk)->kobj);
+	kobj = kobject_get_unless_zero(&disk_to_dev(disk)->kobj);
 	if (kobj == NULL) {
 		module_put(owner);
 		return NULL;
-- 
2.10.2

^ permalink raw reply related	[flat|nested] 26+ messages in thread

* Re: [PATCH 0/11 v4] block: Fix block device shutdown related races
  2017-03-13 15:13 [PATCH 0/11 v4] block: Fix block device shutdown related races Jan Kara
                   ` (10 preceding siblings ...)
  2017-03-13 15:14 ` [PATCH 11/11] block: Fix oops scsi_disk_get() Jan Kara
@ 2017-03-13 18:10 ` Dan Williams
  2017-03-21  0:57 ` Thiago Jung Bauermann
  12 siblings, 0 replies; 26+ messages in thread
From: Dan Williams @ 2017-03-13 18:10 UTC (permalink / raw)
  To: Jan Kara
  Cc: Jens Axboe, linux-block, Christoph Hellwig,
	Thiago Jung Bauermann, Tejun Heo, Tahsin Erdogan, Omar Sandoval

On Mon, Mar 13, 2017 at 8:13 AM, Jan Kara <jack@suse.cz> wrote:
> Hello,
>
> this is a series with the remaining patches (on top of 4.11-rc2) to fix several
> different races and issues I've found when testing device shutdown and reuse.
> The first two patches fix possible (theoretical) problems when opening of a
> block device races with shutdown of a gendisk structure. Patches 3-9 fix oops
> that is triggered by __blkdev_put() calling inode_detach_wb() too early (the
> problem reported by Thiago). Patches 10 and 11 fix oops due to a bug in gendisk
> code where get_gendisk() can return already freed gendisk structure (again
> triggered by Omar's stress test).
>
> People, please have a look at patches. They are mostly simple however the
> interactions are rather complex so I may have missed something. Also I'm
> happy for any additional testing these patches can get - I've stressed them
> with Omar's script, tested memcg writeback, tested static (not udev managed)
> device inodes.

Passes testing with the libnvdimm unit tests that have been tripped up
by block-unplug bugs in the past.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 0/11 v4] block: Fix block device shutdown related races
  2017-03-13 15:13 [PATCH 0/11 v4] block: Fix block device shutdown related races Jan Kara
                   ` (11 preceding siblings ...)
  2017-03-13 18:10 ` [PATCH 0/11 v4] block: Fix block device shutdown related races Dan Williams
@ 2017-03-21  0:57 ` Thiago Jung Bauermann
  12 siblings, 0 replies; 26+ messages in thread
From: Thiago Jung Bauermann @ 2017-03-21  0:57 UTC (permalink / raw)
  To: Jan Kara, Lekshmi C. Pillai
  Cc: Jens Axboe, linux-block, Christoph Hellwig, Dan Williams,
	Tejun Heo, Tahsin Erdogan, Omar Sandoval

Hello,

Am Montag, 13. M=E4rz 2017, 16:13:59 BRT schrieb Jan Kara:
> People, please have a look at patches. They are mostly simple however the
> interactions are rather complex so I may have missed something. Also I'm
> happy for any additional testing these patches can get - I've stressed th=
em
> with Omar's script, tested memcg writeback, tested static (not udev manag=
ed)
> device inodes.

Lekshmi tested these patches on top of an older kernel version that  is=20
afflicted by this bug (I backported the patches on which this series depend=
s as=20
well), and it fixed  the bug there.

Sorry if that's not as helpful, I had unrelated issues with v4.11-rc2 and=20
wasn't able to test with it yet. :-/

=2D-=20
Thiago Jung Bauermann
IBM Linux Technology Center

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 02/11] block: Fix race of bdev open with gendisk shutdown
  2017-03-13 15:14 ` [PATCH 02/11] block: Fix race of bdev open with gendisk shutdown Jan Kara
@ 2017-03-21 16:19   ` Tejun Heo
  2017-03-23  0:21     ` Jan Kara
  2017-11-17  6:51   ` Hou Tao
  1 sibling, 1 reply; 26+ messages in thread
From: Tejun Heo @ 2017-03-21 16:19 UTC (permalink / raw)
  To: Jan Kara
  Cc: Jens Axboe, linux-block, Christoph Hellwig, Dan Williams,
	Thiago Jung Bauermann, Tahsin Erdogan, Omar Sandoval

Hello, Jan.

On Mon, Mar 13, 2017 at 04:14:01PM +0100, Jan Kara wrote:
> blkdev_open() may race with gendisk shutdown in two different ways.
> Either del_gendisk() has already unhashed block device inode (and thus
> bd_acquire() will end up creating new block device inode) however
> gen_gendisk() will still return the gendisk that is being destroyed.
> Or bdev returned by bd_acquire() will get unhashed and gendisk destroyed
> before we get to get_gendisk() and get_gendisk() will return new gendisk
> that got allocated for device that reused the device number.
> 
> In both cases this will result in possible inconsistencies between
> bdev->bd_disk and bdev->bd_bdi (in the first case after gendisk gets
> destroyed and device number reused, in the second case immediately).
> 
> Fix the problem by checking whether the gendisk is still alive and inode
> hashed when associating bdev inode with it and its bdi. That way we are
> sure that we will not associate bdev inode with disk that got past
> blk_unregister_region() in del_gendisk() (and thus device number can get
> reused). Similarly, we will not associate bdev that was once associated
> with gendisk that is going away (and thus the corresponding bdev inode
> will get unhashed in del_gendisk()) with a new gendisk that is just
> reusing the device number.
> 
> Also add a warning that will tell us about unexpected inconsistencies
> between bdi associated with the bdev inode and bdi associated with the
> disk.

Hmm... let's see if I got this straight.

It used to be that blockdevs are essentially stateless while nobody
has it open.  It acquires all its actual associations during the
initial open and loses them on the last release.  The immediate
release semantics got us into trouble because upper layers had nothing
to serve as the proper sever point when the underlying qblock device
goes away.

So, we decided that bdi should serve that purpose, which makes perfect
sense as it's what upper layers talk to when they wanna reach to the
block device, so we decoupled its lifetime from the request_queue and
implements sever there.

Now that bdis are persistent, we can solve bdev-access-while-not-open
problem by making bdi point to the respective bdi from the beginning
until it's released, which means that bdevs are now stateful objects
which are associated with specific bdis and thus request_queues.

Because there are multiple ways that these objects are looked up and
handled, now we can get into situations where the request_queue
(gendisk) we look up during open may not match the bdi that a bdev is
associated with, and this patch solves that problem by detecting the
conditions where these mismatches would take place.  More
specifically, we don't want to be using a dead bdev during open.

If I misunderstood something, please let me know.

> diff --git a/fs/block_dev.c b/fs/block_dev.c
> index 53e2389ae4d4..5ec8750f5332 100644
> --- a/fs/block_dev.c
> +++ b/fs/block_dev.c
> @@ -1560,7 +1560,8 @@ static int __blkdev_get(struct block_device *bdev, fmode_t mode, int for_part)
>  		if (!partno) {
>  			ret = -ENXIO;
>  			bdev->bd_part = disk_get_part(disk, partno);
> -			if (!bdev->bd_part)
> +			if (!(disk->flags & GENHD_FL_UP) || !bdev->bd_part ||
> +			    inode_unhashed(bdev->bd_inode))
>  				goto out_clear;

This adds both GENHD_FL_UP and unhashed testing; however, I'm not sure
whether the former is meaningful here.  GENHD flag updates are not
synchronized when seen from outside the party which is updating it.
The actual synchronization against dead device should happen inside
disk->fops->open().

>  			ret = 0;
> @@ -1614,7 +1615,8 @@ static int __blkdev_get(struct block_device *bdev, fmode_t mode, int for_part)
>  			bdev->bd_contains = whole;
>  			bdev->bd_part = disk_get_part(disk, partno);
>  			if (!(disk->flags & GENHD_FL_UP) ||
> -			    !bdev->bd_part || !bdev->bd_part->nr_sects) {
> +			    !bdev->bd_part || !bdev->bd_part->nr_sects ||
> +			    inode_unhashed(bdev->bd_inode)) {
>  				ret = -ENXIO;
>  				goto out_clear;
>  			}

Which is different from here.  As the device is already open, we're
just incrementing the ref before granting another open without
consulting the driver.  As the device is open already, whether
granting a new ref or not can't break anything.  Block layer is just
doing a best effort thing to not give out new ref on a dead one with
the UP test, which is completely fine.

> @@ -1623,6 +1625,9 @@ static int __blkdev_get(struct block_device *bdev, fmode_t mode, int for_part)
>  
>  		if (bdev->bd_bdi == &noop_backing_dev_info)
>  			bdev->bd_bdi = bdi_get(disk->queue->backing_dev_info);
> +		else
> +			WARN_ON_ONCE(bdev->bd_bdi !=
> +				     disk->queue->backing_dev_info);

While I can't think of cases where this wouldn't work, it seems a bit
roundabout to me.

* A bdi is the object which knows which request_queue it's associated
  and when that association dies.

* bdev is permanently associated with a bdi.

So, it's kinda weird to look up the request_queue again on each open
and then verify that the bd_bdi and request_queue match.  I think it
would make more sense to skip disk look up and just go through bdi to
determine the associated request_queue as that's the primary nexus
object tying everything up now.

That said, it's totally possible that that would take too much
restructuring to implement right now with everything else, but if we
think that that's the right long term direction, I think it would make
more sense to just test whether bdev->bd_bdi matches the disk right
after looking up the disk and fail the open if not.  That's the
ultimate condition we wanna avoid after all and it also would ease
replacing it with going through bdi instead of looking up again.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 05/11] bdi: Unify bdi->wb_list handling for root wb_writeback
  2017-03-13 15:14 ` [PATCH 05/11] bdi: Unify bdi->wb_list handling for root wb_writeback Jan Kara
@ 2017-03-21 16:21   ` Tejun Heo
  0 siblings, 0 replies; 26+ messages in thread
From: Tejun Heo @ 2017-03-21 16:21 UTC (permalink / raw)
  To: Jan Kara
  Cc: Jens Axboe, linux-block, Christoph Hellwig, Dan Williams,
	Thiago Jung Bauermann, Tahsin Erdogan, Omar Sandoval

On Mon, Mar 13, 2017 at 04:14:04PM +0100, Jan Kara wrote:
> Currently root wb_writeback structure is added to bdi->wb_list in
> bdi_init() and never removed. That is different from all other
> wb_writeback structures which get added to the list when created and
> removed from it before wb_shutdown().
> 
> So move list addition of root bdi_writeback to bdi_register() and list
> removal of all wb_writeback structures to wb_shutdown(). That way a
> wb_writeback structure is on bdi->wb_list if and only if it can handle
> writeback and it will make it easier for us to handle shutdown of all
> wb_writeback structures in bdi_unregister().
> 
> Signed-off-by: Jan Kara <jack@suse.cz>

Acked-by: Tejun Heo <tj@kernel.org>

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 02/11] block: Fix race of bdev open with gendisk shutdown
  2017-03-21 16:19   ` Tejun Heo
@ 2017-03-23  0:21     ` Jan Kara
  2017-03-23 15:38       ` Tejun Heo
  0 siblings, 1 reply; 26+ messages in thread
From: Jan Kara @ 2017-03-23  0:21 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Jan Kara, Jens Axboe, linux-block, Christoph Hellwig,
	Dan Williams, Thiago Jung Bauermann, Tahsin Erdogan,
	Omar Sandoval

Hello Tejun,

> On Mon, Mar 13, 2017 at 04:14:01PM +0100, Jan Kara wrote:
> > blkdev_open() may race with gendisk shutdown in two different ways.
> > Either del_gendisk() has already unhashed block device inode (and thus
> > bd_acquire() will end up creating new block device inode) however
> > gen_gendisk() will still return the gendisk that is being destroyed.
> > Or bdev returned by bd_acquire() will get unhashed and gendisk destroyed
> > before we get to get_gendisk() and get_gendisk() will return new gendisk
> > that got allocated for device that reused the device number.
> > 
> > In both cases this will result in possible inconsistencies between
> > bdev->bd_disk and bdev->bd_bdi (in the first case after gendisk gets
> > destroyed and device number reused, in the second case immediately).
> > 
> > Fix the problem by checking whether the gendisk is still alive and inode
> > hashed when associating bdev inode with it and its bdi. That way we are
> > sure that we will not associate bdev inode with disk that got past
> > blk_unregister_region() in del_gendisk() (and thus device number can get
> > reused). Similarly, we will not associate bdev that was once associated
> > with gendisk that is going away (and thus the corresponding bdev inode
> > will get unhashed in del_gendisk()) with a new gendisk that is just
> > reusing the device number.
> > 
> > Also add a warning that will tell us about unexpected inconsistencies
> > between bdi associated with the bdev inode and bdi associated with the
> > disk.
> 
> Hmm... let's see if I got this straight.
> 
> It used to be that blockdevs are essentially stateless while nobody
> has it open.  It acquires all its actual associations during the
> initial open and loses them on the last release.  The immediate
> release semantics got us into trouble because upper layers had nothing
> to serve as the proper sever point when the underlying qblock device
> goes away.
> 
> So, we decided that bdi should serve that purpose, which makes perfect
> sense as it's what upper layers talk to when they wanna reach to the
> block device, so we decoupled its lifetime from the request_queue and
> implements sever there.
> 
> Now that bdis are persistent, we can solve bdev-access-while-not-open
> problem by making bdi point to the respective bdi from the beginning
> until it's released, which means that bdevs are now stateful objects
> which are associated with specific bdis and thus request_queues.

Yes, perfect summary.

> Because there are multiple ways that these objects are looked up and
> handled, now we can get into situations where the request_queue
> (gendisk) we look up during open may not match the bdi that a bdev is
> associated with, and this patch solves that problem by detecting the
> conditions where these mismatches would take place.  More
> specifically, we don't want to be using a dead bdev during open.

Yes.

> > diff --git a/fs/block_dev.c b/fs/block_dev.c
> > index 53e2389ae4d4..5ec8750f5332 100644
> > --- a/fs/block_dev.c
> > +++ b/fs/block_dev.c
> > @@ -1560,7 +1560,8 @@ static int __blkdev_get(struct block_device *bdev, fmode_t mode, int for_part)
> >  		if (!partno) {
> >  			ret = -ENXIO;
> >  			bdev->bd_part = disk_get_part(disk, partno);
> > -			if (!bdev->bd_part)
> > +			if (!(disk->flags & GENHD_FL_UP) || !bdev->bd_part ||
> > +			    inode_unhashed(bdev->bd_inode))
> >  				goto out_clear;
> 
> This adds both GENHD_FL_UP and unhashed testing; however, I'm not sure
> whether the former is meaningful here.  GENHD flag updates are not
> synchronized when seen from outside the party which is updating it.
> The actual synchronization against dead device should happen inside
> disk->fops->open().

Hum, now that I look at the code again it seems my patch is still racy if
we get fresh inode (lookup_bdev() already allocated new inode and returned
it to us) but get_gendisk() returned gendisk that is shutting down but
we'll still see GENHD_FL_UP set and thus associate new inode with gendisk
that is being destroyed.

> >  			ret = 0;
> > @@ -1614,7 +1615,8 @@ static int __blkdev_get(struct block_device *bdev, fmode_t mode, int for_part)
> >  			bdev->bd_contains = whole;
> >  			bdev->bd_part = disk_get_part(disk, partno);
> >  			if (!(disk->flags & GENHD_FL_UP) ||
> > -			    !bdev->bd_part || !bdev->bd_part->nr_sects) {
> > +			    !bdev->bd_part || !bdev->bd_part->nr_sects ||
> > +			    inode_unhashed(bdev->bd_inode)) {
> >  				ret = -ENXIO;
> >  				goto out_clear;
> >  			}
> 
> Which is different from here.  As the device is already open, we're
> just incrementing the ref before granting another open without
> consulting the driver.  As the device is open already, whether
> granting a new ref or not can't break anything.  Block layer is just
> doing a best effort thing to not give out new ref on a dead one with
> the UP test, which is completely fine.

Yeah.

> > @@ -1623,6 +1625,9 @@ static int __blkdev_get(struct block_device *bdev, fmode_t mode, int for_part)
> >  
> >  		if (bdev->bd_bdi == &noop_backing_dev_info)
> >  			bdev->bd_bdi = bdi_get(disk->queue->backing_dev_info);
> > +		else
> > +			WARN_ON_ONCE(bdev->bd_bdi !=
> > +				     disk->queue->backing_dev_info);
> 
> While I can't think of cases where this wouldn't work, it seems a bit
> roundabout to me.
> 
> * A bdi is the object which knows which request_queue it's associated
>   and when that association dies.

So bdi is associated with a request_queue but it does not have any
direct pointer or reference to it. It is the request_queue that points to
bdi. However you are right that bdi_unregister() call is telling bdi that
the device is going away.

> * bdev is permanently associated with a bdi.

Yes.

> So, it's kinda weird to look up the request_queue again on each open
> and then verify that the bd_bdi and request_queue match.  I think it
> would make more sense to skip disk look up and just go through bdi to
> determine the associated request_queue as that's the primary nexus
> object tying everything up now.

So I was thinking about this as well. The problem is that you cannot really
keep a reference to request_queue (or gendisk) from block device inode or
bdi as that could unexpectedly pin the driver in memory after user thinks
everything should be cleaned up. So I think we really do have to establish
the connection between block device inode and request_queue / gendisk on
opening of the block device and drop the reference when the block device is
closed.

> That said, it's totally possible that that would take too much
> restructuring to implement right now with everything else, but if we
> think that that's the right long term direction, I think it would make
> more sense to just test whether bdev->bd_bdi matches the disk right
> after looking up the disk and fail the open if not.  That's the
> ultimate condition we wanna avoid after all and it also would ease
> replacing it with going through bdi instead of looking up again.

The problem with this is that you could still associate fresh block device
inode with a bdi corresponding to gendisk that is going away (and that has
already went through bdev_unhash_inode()). Such block device inode may than
stay in case associated with that bdi even after the device number gets
reused and that will cause false the check for matching bdi to fail ever
after that moment.

So since I'm not actually able to trigger any of these races in my testing,
I guess I'll just drop this patch (it isn't needed by anything else in the
series) and let the reset of the series be merged. When I come up with some
better solution to this problem, I'll send a fix separately. Thanks for
review!

								Honza

-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 02/11] block: Fix race of bdev open with gendisk shutdown
  2017-03-23  0:21     ` Jan Kara
@ 2017-03-23 15:38       ` Tejun Heo
  0 siblings, 0 replies; 26+ messages in thread
From: Tejun Heo @ 2017-03-23 15:38 UTC (permalink / raw)
  To: Jan Kara
  Cc: Jens Axboe, linux-block, Christoph Hellwig, Dan Williams,
	Thiago Jung Bauermann, Tahsin Erdogan, Omar Sandoval

Hello, Jan.

On Thu, Mar 23, 2017 at 01:21:05AM +0100, Jan Kara wrote:
> > * A bdi is the object which knows which request_queue it's associated
> >   and when that association dies.
> 
> So bdi is associated with a request_queue but it does not have any
> direct pointer or reference to it. It is the request_queue that points to
> bdi. However you are right that bdi_unregister() call is telling bdi that
> the device is going away.
> 
> > * bdev is permanently associated with a bdi.
> 
> Yes.
> 
> > So, it's kinda weird to look up the request_queue again on each open
> > and then verify that the bd_bdi and request_queue match.  I think it
> > would make more sense to skip disk look up and just go through bdi to
> > determine the associated request_queue as that's the primary nexus
> > object tying everything up now.
> 
> So I was thinking about this as well. The problem is that you cannot really
> keep a reference to request_queue (or gendisk) from block device inode or
> bdi as that could unexpectedly pin the driver in memory after user thinks
> everything should be cleaned up. So I think we really do have to establish
> the connection between block device inode and request_queue / gendisk on
> opening of the block device and drop the reference when the block device is
> closed.

Hmmm... but for bdi to be able to serve as the sever point, it must
know when it gets disassociated (del_gendisk) and be able to safely
clear its pointer and reject further access to the request_queue.  I
think we're trying to shift that synchronization to the open path but
no matter where we do that we have to do that anyway and it'll likely
be the more straight-forward to do it at the sever point.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 02/11] block: Fix race of bdev open with gendisk shutdown
  2017-03-13 15:14 ` [PATCH 02/11] block: Fix race of bdev open with gendisk shutdown Jan Kara
  2017-03-21 16:19   ` Tejun Heo
@ 2017-11-17  6:51   ` Hou Tao
  2017-11-20 16:43     ` Jan Kara
  1 sibling, 1 reply; 26+ messages in thread
From: Hou Tao @ 2017-11-17  6:51 UTC (permalink / raw)
  To: Jan Kara, Jens Axboe
  Cc: linux-block, Christoph Hellwig, Dan Williams,
	Thiago Jung Bauermann, Tejun Heo, Tahsin Erdogan, Omar Sandoval,
	houtao1

Hi Jan,

On 2017/3/13 23:14, Jan Kara wrote:
> blkdev_open() may race with gendisk shutdown in two different ways.
> Either del_gendisk() has already unhashed block device inode (and thus
> bd_acquire() will end up creating new block device inode) however
> gen_gendisk() will still return the gendisk that is being destroyed.
> Or bdev returned by bd_acquire() will get unhashed and gendisk destroyed
> before we get to get_gendisk() and get_gendisk() will return new gendisk
> that got allocated for device that reused the device number.
> 
> In both cases this will result in possible inconsistencies between
> bdev->bd_disk and bdev->bd_bdi (in the first case after gendisk gets
> destroyed and device number reused, in the second case immediately).

> Fix the problem by checking whether the gendisk is still alive and inode
> hashed when associating bdev inode with it and its bdi. That way we are
> sure that we will not associate bdev inode with disk that got past
> blk_unregister_region() in del_gendisk() (and thus device number can get
> reused). Similarly, we will not associate bdev that was once associated
> with gendisk that is going away (and thus the corresponding bdev inode
> will get unhashed in del_gendisk()) with a new gendisk that is just
> reusing the device number.

I have run some extended tests based on Omar Sandoval's script [1] these days,
and found two new problems related with the race between bdev open and gendisk
shutdown.

The first problem lies in __blkdev_get(), and will cause use-after-free, or
worse oops. It happens because there may be two instances of gendisk related
with one bdev. When the process which owns the newer gendisk completes the
invocation of __blkdev_get() firstly, the other process which owns the older
gendisk will put the last reference of the older gendisk and cause use-after-free.

The following call sequences illustrate the problem:

unhash the bdev_inode of /dev/sda

process A (blkdev_open()):
	bd_acquire() returns new bdev_inode of /dev/sda
	get_gendisk() returns gendisk (v1 gendisk)

remove gendisk from bdev_map
device number is reused

process B (blkdev_open()):
	bd_acquire() returns new bdev_inode of /dev/sda
	get_gendisk() returns a new gendisk (v2 gendisk)
	increase bdev->bd_openers

process A (blkdev_open()):
	find bdev->bd_openers != 0
	put_disk(disk) // this is the last reference to v1 gendisk
	disk_unblock_events(disk) // use-after-free occurs

The problem can be fixed by an extra check showed in the following snippet:

static int __blkdev_get(struct block_device *bdev, fmode_t mode, int for_part)
{
	if (!bdev->bd_openers) {
	} else {
		if (bdev->bd_disk != disk) {
			ret = -ENXIO;
			goto out_unlock_bdev;
		}
	}
}

And we also can simplify the check in your patch by moving blk_unregister_region()
before bdev_unhash_inode(), so whenever we get a new bdev_inode, the gendisk returned
by get_gendisk() will either be NULL or a new gendisk. And the only check we need to do
in __blkdev_get() is checking the hash status of the bdev_inode after we get a gendisk
from get_gendisk(). The check can be done like the following snippet:

static int __blkdev_get(struct block_device *bdev, fmode_t mode, int for_part)
{
	/* ...... */
	ret = -ENXIO;
	disk = get_gendisk(bdev->bd_dev, &partno);
	if (!disk)
		goto out;
	owner = disk->fops->owner;

	if (inode_unhashed(bdev->bd_inode))
		goto out_unmatched;
}

The second problem lies in bd_start_claiming(), and will trigger the BUG_ON assert in
blkdev_get: BUG_ON(!bd_may_claim(bdev, whole, holder)). It occurs when testing the
exclusive open and close of the disk and its partitions.

The cause of the problem is that a bdev_inode of a disk partition corresponds with two
instances of bdev_inode of the whole disk simultaneously. When one pair of the partition
inode and disk inode completes the claiming, the other pair will be stumbled by the BUG_ON
assert in blkdev_get() because bdev->bd_holder is no longer NULL.

The following sequences illustrate the problem:

unhash the bdev_inode of /dev/sda2

process A (blkdev_open()):
	bd_acquire() returns a new bdev_inode for /dev/sda2
	bd_start_claiming() returns bdev_inode of /dev/sda

process B (blkdev_open()):
	bd_acquire() returns the new bdev_inode for /dev/sda2

unhash the bdev_inode of /dev/sda
remove gendisk from bdev_map
device number is reused

process B (blkdev_open()):
	bd_start_claming() returns a new bdev_inode for /dev/sda

process A (blkdev_open()):
	__blkdev_get() returns successfully
	finish claiming  // bdev->bd_holder = holder

process B (blkdev_open()):
	__blkdev_get() returns successfully
	trigger BUG_ON(!bd_may_claim(bdev, whole, holder))

And the problem can be fixed by adding more checks in bd_start_claiming():

static struct block_device *bd_start_claiming(struct block_device *bdev,
                                              void *holder)
{
	/* ...... */
        disk = get_gendisk(bdev->bd_dev, &partno);
        if (!disk)
                return ERR_PTR(-ENXIO);

        if (inode_unhashed(bdev->bd_inode)) {
                module_put(disk->fops->owner);
                put_disk(disk);
                return ERR_PTR(-ENXIO);
        }

	/* ...... */
        if (!whole)
                return ERR_PTR(-ENOMEM);

        if (inode_unhashed(bdev->bd_inode) ||
                inode_unhashed(whole->bd_inode)) {
                bdput(whole);
                return ERR_PTR(-ENXIO);
        }
}

Except adding more and more checks, are there better solutions to the race problem ?
I have thought about managing the device number by ref-counter, so the device number
will not be reused until the last reference of bdev_inode and gendisk are released,
and i am trying to figure out a way to get/put a reference of the device number when
get/put the block_device.

So any suggests and thoughts ?

Regards,
Tao

---
1: https://marc.info/?l=linux-block&m=148554717109098&w=2


> Also add a warning that will tell us about unexpected inconsistencies
> between bdi associated with the bdev inode and bdi associated with the
> disk.
> 
> Signed-off-by: Jan Kara <jack@suse.cz>
> ---
>  fs/block_dev.c | 9 +++++++--
>  1 file changed, 7 insertions(+), 2 deletions(-)
> 
> diff --git a/fs/block_dev.c b/fs/block_dev.c
> index 53e2389ae4d4..5ec8750f5332 100644
> --- a/fs/block_dev.c
> +++ b/fs/block_dev.c
> @@ -1560,7 +1560,8 @@ static int __blkdev_get(struct block_device *bdev, fmode_t mode, int for_part)
>  		if (!partno) {
>  			ret = -ENXIO;
>  			bdev->bd_part = disk_get_part(disk, partno);
> -			if (!bdev->bd_part)
> +			if (!(disk->flags & GENHD_FL_UP) || !bdev->bd_part ||
> +			    inode_unhashed(bdev->bd_inode))
>  				goto out_clear;
>  
>  			ret = 0;
> @@ -1614,7 +1615,8 @@ static int __blkdev_get(struct block_device *bdev, fmode_t mode, int for_part)
>  			bdev->bd_contains = whole;
>  			bdev->bd_part = disk_get_part(disk, partno);
>  			if (!(disk->flags & GENHD_FL_UP) ||
> -			    !bdev->bd_part || !bdev->bd_part->nr_sects) {
> +			    !bdev->bd_part || !bdev->bd_part->nr_sects ||
> +			    inode_unhashed(bdev->bd_inode)) {
>  				ret = -ENXIO;
>  				goto out_clear;
>  			}
> @@ -1623,6 +1625,9 @@ static int __blkdev_get(struct block_device *bdev, fmode_t mode, int for_part)
>  
>  		if (bdev->bd_bdi == &noop_backing_dev_info)
>  			bdev->bd_bdi = bdi_get(disk->queue->backing_dev_info);
> +		else
> +			WARN_ON_ONCE(bdev->bd_bdi !=
> +				     disk->queue->backing_dev_info);
>  	} else {
>  		if (bdev->bd_contains == bdev) {
>  			ret = 0;
> 

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 02/11] block: Fix race of bdev open with gendisk shutdown
  2017-11-17  6:51   ` Hou Tao
@ 2017-11-20 16:43     ` Jan Kara
  2017-11-24  3:12       ` Hou Tao
  0 siblings, 1 reply; 26+ messages in thread
From: Jan Kara @ 2017-11-20 16:43 UTC (permalink / raw)
  To: Hou Tao
  Cc: Jan Kara, Jens Axboe, linux-block, Christoph Hellwig,
	Dan Williams, Thiago Jung Bauermann, Tejun Heo, Tahsin Erdogan,
	Omar Sandoval

Hi Tao!

On Fri 17-11-17 14:51:18, Hou Tao wrote:
> On 2017/3/13 23:14, Jan Kara wrote:
> > blkdev_open() may race with gendisk shutdown in two different ways.
> > Either del_gendisk() has already unhashed block device inode (and thus
> > bd_acquire() will end up creating new block device inode) however
> > gen_gendisk() will still return the gendisk that is being destroyed.
> > Or bdev returned by bd_acquire() will get unhashed and gendisk destroyed
> > before we get to get_gendisk() and get_gendisk() will return new gendisk
> > that got allocated for device that reused the device number.
> > 
> > In both cases this will result in possible inconsistencies between
> > bdev->bd_disk and bdev->bd_bdi (in the first case after gendisk gets
> > destroyed and device number reused, in the second case immediately).
> 
> > Fix the problem by checking whether the gendisk is still alive and inode
> > hashed when associating bdev inode with it and its bdi. That way we are
> > sure that we will not associate bdev inode with disk that got past
> > blk_unregister_region() in del_gendisk() (and thus device number can get
> > reused). Similarly, we will not associate bdev that was once associated
> > with gendisk that is going away (and thus the corresponding bdev inode
> > will get unhashed in del_gendisk()) with a new gendisk that is just
> > reusing the device number.
> 
> I have run some extended tests based on Omar Sandoval's script [1] these days,
> and found two new problems related with the race between bdev open and gendisk
> shutdown.
> 
> The first problem lies in __blkdev_get(), and will cause use-after-free, or
> worse oops. It happens because there may be two instances of gendisk related
> with one bdev. When the process which owns the newer gendisk completes the
> invocation of __blkdev_get() firstly, the other process which owns the older
> gendisk will put the last reference of the older gendisk and cause use-after-free.
> 
> The following call sequences illustrate the problem:
> 
> unhash the bdev_inode of /dev/sda
> 
> process A (blkdev_open()):
> 	bd_acquire() returns new bdev_inode of /dev/sda
> 	get_gendisk() returns gendisk (v1 gendisk)
> 
> remove gendisk from bdev_map
> device number is reused

^^ this is through blk_unregister_region() in del_gendisk(), isn't it?

> process B (blkdev_open()):
> 	bd_acquire() returns new bdev_inode of /dev/sda
> 	get_gendisk() returns a new gendisk (v2 gendisk)
> 	increase bdev->bd_openers

So this needs to be a bit more complex - bd_acquire() must happen before
bdev_unhash_inode() in del_gendisk() above happened. And get_gendisk() must
happen after blk_unregister_region() from del_gendisk() happened. But yes,
this seems possible.

> process A (blkdev_open()):
> 	find bdev->bd_openers != 0
> 	put_disk(disk) // this is the last reference to v1 gendisk
> 	disk_unblock_events(disk) // use-after-free occurs
> 
> The problem can be fixed by an extra check showed in the following snippet:
> 
> static int __blkdev_get(struct block_device *bdev, fmode_t mode, int for_part)
> {
> 	if (!bdev->bd_openers) {
> 	} else {
> 		if (bdev->bd_disk != disk) {
> 			ret = -ENXIO;
> 			goto out_unlock_bdev;
> 		}
> 	}
> }
>
> And we also can simplify the check in your patch by moving
> blk_unregister_region() before bdev_unhash_inode(), so whenever we get a
> new bdev_inode, the gendisk returned by get_gendisk() will either be NULL
> or a new gendisk.

I don't think we can do that. If we call blk_unregister_region() before
bdev_unhash_inode(), the device number can get reused while bdev inode is
still visible. So you can get that bdev inode v1 associated with gendisk
v2. And open following unhashing of bdev inode v1 will get bdev inode v2
associated with gendisk v2. So you have two different bdev inodes
associated with the same gendisk and that will cause data integrity issues.

But what you suggest above is probably a reasonable fix. Will you submit a
proper patch please?

> And the only check we need to do in __blkdev_get() is
> checking the hash status of the bdev_inode after we get a gendisk from
> get_gendisk(). The check can be done like the following snippet:
> 
> static int __blkdev_get(struct block_device *bdev, fmode_t mode, int for_part)
> {
> 	/* ...... */
> 	ret = -ENXIO;
> 	disk = get_gendisk(bdev->bd_dev, &partno);
> 	if (!disk)
> 		goto out;
> 	owner = disk->fops->owner;
> 
> 	if (inode_unhashed(bdev->bd_inode))
> 		goto out_unmatched;
> }
> 
> The second problem lies in bd_start_claiming(), and will trigger the
> BUG_ON assert in blkdev_get: BUG_ON(!bd_may_claim(bdev, whole, holder)).
> It occurs when testing the exclusive open and close of the disk and its
> partitions.
> 
> The cause of the problem is that a bdev_inode of a disk partition
> corresponds with two instances of bdev_inode of the whole disk
> simultaneously. When one pair of the partition inode and disk inode
> completes the claiming, the other pair will be stumbled by the BUG_ON
> assert in blkdev_get() because bdev->bd_holder is no longer NULL.
> 
> The following sequences illustrate the problem:
> 
> unhash the bdev_inode of /dev/sda2
> 
> process A (blkdev_open()):
> 	bd_acquire() returns a new bdev_inode for /dev/sda2
> 	bd_start_claiming() returns bdev_inode of /dev/sda
> 
> process B (blkdev_open()):
> 	bd_acquire() returns the new bdev_inode for /dev/sda2
> 
> unhash the bdev_inode of /dev/sda
> remove gendisk from bdev_map
> device number is reused
> 
> process B (blkdev_open()):
> 	bd_start_claming() returns a new bdev_inode for /dev/sda
> 
> process A (blkdev_open()):
> 	__blkdev_get() returns successfully
> 	finish claiming  // bdev->bd_holder = holder
> 
> process B (blkdev_open()):
> 	__blkdev_get() returns successfully
> 	trigger BUG_ON(!bd_may_claim(bdev, whole, holder))

Ah, good spotting. Essentially the whole->bd_claiming protection for
partition bdev does not work because we've got two different 'whole' bdev
pointers.

> And the problem can be fixed by adding more checks in bd_start_claiming():
> 
> static struct block_device *bd_start_claiming(struct block_device *bdev,
>                                               void *holder)
> {
> 	/* ...... */
>         disk = get_gendisk(bdev->bd_dev, &partno);
>         if (!disk)
>                 return ERR_PTR(-ENXIO);
> 
>         if (inode_unhashed(bdev->bd_inode)) {
>                 module_put(disk->fops->owner);
>                 put_disk(disk);
>                 return ERR_PTR(-ENXIO);
>         }
> 
> 	/* ...... */
>         if (!whole)
>                 return ERR_PTR(-ENOMEM);
> 
>         if (inode_unhashed(bdev->bd_inode) ||
>                 inode_unhashed(whole->bd_inode)) {
>                 bdput(whole);
>                 return ERR_PTR(-ENXIO);
>         }
> }
> 
> Except adding more and more checks, are there better solutions to the
> race problem ?  I have thought about managing the device number by
> ref-counter, so the device number will not be reused until the last
> reference of bdev_inode and gendisk are released, and i am trying to
> figure out a way to get/put a reference of the device number when get/put
> the block_device.
> 
> So any suggests and thoughts ?

Yeah, these races are nasty. I will think whether there's some reasonably
easy way to fixing them or whether we'll have to go via some other route
like blocking the device number as you suggest. In either case thanks for
the report and the analysis!

								Honza

-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 02/11] block: Fix race of bdev open with gendisk shutdown
  2017-11-20 16:43     ` Jan Kara
@ 2017-11-24  3:12       ` Hou Tao
  0 siblings, 0 replies; 26+ messages in thread
From: Hou Tao @ 2017-11-24  3:12 UTC (permalink / raw)
  To: Jan Kara
  Cc: Jens Axboe, linux-block, Christoph Hellwig, Dan Williams,
	Thiago Jung Bauermann, Tejun Heo, Tahsin Erdogan, Omar Sandoval

Hi Jan,

On 2017/11/21 0:43, Jan Kara wrote:
> Hi Tao!
> 
> On Fri 17-11-17 14:51:18, Hou Tao wrote:
>> On 2017/3/13 23:14, Jan Kara wrote:
>>> blkdev_open() may race with gendisk shutdown in two different ways.
>>> Either del_gendisk() has already unhashed block device inode (and thus
>>> bd_acquire() will end up creating new block device inode) however
>>> gen_gendisk() will still return the gendisk that is being destroyed.
>>> Or bdev returned by bd_acquire() will get unhashed and gendisk destroyed
>>> before we get to get_gendisk() and get_gendisk() will return new gendisk
>>> that got allocated for device that reused the device number.
>>>
>>> In both cases this will result in possible inconsistencies between
>>> bdev->bd_disk and bdev->bd_bdi (in the first case after gendisk gets
>>> destroyed and device number reused, in the second case immediately).
>>
>>> Fix the problem by checking whether the gendisk is still alive and inode
>>> hashed when associating bdev inode with it and its bdi. That way we are
>>> sure that we will not associate bdev inode with disk that got past
>>> blk_unregister_region() in del_gendisk() (and thus device number can get
>>> reused). Similarly, we will not associate bdev that was once associated
>>> with gendisk that is going away (and thus the corresponding bdev inode
>>> will get unhashed in del_gendisk()) with a new gendisk that is just
>>> reusing the device number.
>>
>> I have run some extended tests based on Omar Sandoval's script [1] these days,
>> and found two new problems related with the race between bdev open and gendisk
>> shutdown.
>>
>> The first problem lies in __blkdev_get(), and will cause use-after-free, or
>> worse oops. It happens because there may be two instances of gendisk related
>> with one bdev. When the process which owns the newer gendisk completes the
>> invocation of __blkdev_get() firstly, the other process which owns the older
>> gendisk will put the last reference of the older gendisk and cause use-after-free.
>>
>> The following call sequences illustrate the problem:
>>
>> unhash the bdev_inode of /dev/sda
>>
>> process A (blkdev_open()):
>> 	bd_acquire() returns new bdev_inode of /dev/sda
>> 	get_gendisk() returns gendisk (v1 gendisk)
>>
>> remove gendisk from bdev_map
>> device number is reused
> 
> ^^ this is through blk_unregister_region() in del_gendisk(), isn't it?

Yes, both the unhash of inode and the removal of gendisk from bdev_map
happen in del_gendisk().

>> process B (blkdev_open()):
>> 	bd_acquire() returns new bdev_inode of /dev/sda
>> 	get_gendisk() returns a new gendisk (v2 gendisk)
>> 	increase bdev->bd_openers
> 
> So this needs to be a bit more complex - bd_acquire() must happen before
> bdev_unhash_inode() in del_gendisk() above happened. And get_gendisk() must
> happen after blk_unregister_region() from del_gendisk() happened. But yes,
> this seems possible.
> 
>> process A (blkdev_open()):
>> 	find bdev->bd_openers != 0
>> 	put_disk(disk) // this is the last reference to v1 gendisk
>> 	disk_unblock_events(disk) // use-after-free occurs
>>
>> The problem can be fixed by an extra check showed in the following snippet:
>>
>> static int __blkdev_get(struct block_device *bdev, fmode_t mode, int for_part)
>> {
>> 	if (!bdev->bd_openers) {
>> 	} else {
>> 		if (bdev->bd_disk != disk) {
>> 			ret = -ENXIO;
>> 			goto out_unlock_bdev;
>> 		}
>> 	}
>> }
>>
>> And we also can simplify the check in your patch by moving
>> blk_unregister_region() before bdev_unhash_inode(), so whenever we get a
>> new bdev_inode, the gendisk returned by get_gendisk() will either be NULL
>> or a new gendisk.
> 
> I don't think we can do that. If we call blk_unregister_region() before
> bdev_unhash_inode(), the device number can get reused while bdev inode is
> still visible. So you can get that bdev inode v1 associated with gendisk
> v2. And open following unhashing of bdev inode v1 will get bdev inode v2
> associated with gendisk v2. So you have two different bdev inodes
> associated with the same gendisk and that will cause data integrity issues.

Even we do not move blk_unregister_region(), the data integrity issue still
exists: bd_acquire() is invoked before bdev_unhash_inode() and get_gendisk()
is invoked after blk_unregister_region(). The move is used to simplify the
consistency check or the version check between bdev_inode and gendisk, so
the added check above will be unnecessary, because whenever bd_acquire()
returns a new bdev_inode, the gendisk returned by get_gendisk() will also
a new gendisk or NULL. Anyway, it's just another workaround, so it's OK
we do not do that.

> But what you suggest above is probably a reasonable fix. Will you submit a
> proper patch please?

Uh, it's also an incomplete fix to the race problem. Anyhow i will do it.

>> And the only check we need to do in __blkdev_get() is
>> checking the hash status of the bdev_inode after we get a gendisk from
>> get_gendisk(). The check can be done like the following snippet:
>>
>> static int __blkdev_get(struct block_device *bdev, fmode_t mode, int for_part)
>> {
>> 	/* ...... */
>> 	ret = -ENXIO;
>> 	disk = get_gendisk(bdev->bd_dev, &partno);
>> 	if (!disk)
>> 		goto out;
>> 	owner = disk->fops->owner;
>>
>> 	if (inode_unhashed(bdev->bd_inode))
>> 		goto out_unmatched;
>> }
>>
>> The second problem lies in bd_start_claiming(), and will trigger the
>> BUG_ON assert in blkdev_get: BUG_ON(!bd_may_claim(bdev, whole, holder)).
>> It occurs when testing the exclusive open and close of the disk and its
>> partitions.
>>
>> The cause of the problem is that a bdev_inode of a disk partition
>> corresponds with two instances of bdev_inode of the whole disk
>> simultaneously. When one pair of the partition inode and disk inode
>> completes the claiming, the other pair will be stumbled by the BUG_ON
>> assert in blkdev_get() because bdev->bd_holder is no longer NULL.
>>
>> The following sequences illustrate the problem:
>>
>> unhash the bdev_inode of /dev/sda2
>>
>> process A (blkdev_open()):
>> 	bd_acquire() returns a new bdev_inode for /dev/sda2
>> 	bd_start_claiming() returns bdev_inode of /dev/sda
>>
>> process B (blkdev_open()):
>> 	bd_acquire() returns the new bdev_inode for /dev/sda2
>>
>> unhash the bdev_inode of /dev/sda
>> remove gendisk from bdev_map
>> device number is reused
>>
>> process B (blkdev_open()):
>> 	bd_start_claming() returns a new bdev_inode for /dev/sda
>>
>> process A (blkdev_open()):
>> 	__blkdev_get() returns successfully
>> 	finish claiming  // bdev->bd_holder = holder
>>
>> process B (blkdev_open()):
>> 	__blkdev_get() returns successfully
>> 	trigger BUG_ON(!bd_may_claim(bdev, whole, holder))
> 
> Ah, good spotting. Essentially the whole->bd_claiming protection for
> partition bdev does not work because we've got two different 'whole' bdev
> pointers.

Yes, perfect summary.

>> And the problem can be fixed by adding more checks in bd_start_claiming():
>>
>> static struct block_device *bd_start_claiming(struct block_device *bdev,
>>                                               void *holder)
>> {
>> 	/* ...... */
>>         disk = get_gendisk(bdev->bd_dev, &partno);
>>         if (!disk)
>>                 return ERR_PTR(-ENXIO);
>>
>>         if (inode_unhashed(bdev->bd_inode)) {
>>                 module_put(disk->fops->owner);
>>                 put_disk(disk);
>>                 return ERR_PTR(-ENXIO);
>>         }
>>
>> 	/* ...... */
>>         if (!whole)
>>                 return ERR_PTR(-ENOMEM);
>>
>>         if (inode_unhashed(bdev->bd_inode) ||
>>                 inode_unhashed(whole->bd_inode)) {
>>                 bdput(whole);
>>                 return ERR_PTR(-ENXIO);
>>         }
>> }
>>
>> Except adding more and more checks, are there better solutions to the
>> race problem ?  I have thought about managing the device number by
>> ref-counter, so the device number will not be reused until the last
>> reference of bdev_inode and gendisk are released, and i am trying to
>> figure out a way to get/put a reference of the device number when get/put
>> the block_device.
>>
>> So any suggests and thoughts ?
> 
> Yeah, these races are nasty. I will think whether there's some reasonably
> easy way to fixing them or whether we'll have to go via some other route
> like blocking the device number as you suggest. In either case thanks for
> the report and the analysis!

Look forward to review them.

Regards,
Tao

> 								Honza
> 

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 02/11] block: Fix race of bdev open with gendisk shutdown
  2017-03-07 19:43       ` Tejun Heo
@ 2017-03-08 14:25         ` Jan Kara
  0 siblings, 0 replies; 26+ messages in thread
From: Jan Kara @ 2017-03-08 14:25 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Jan Kara, Jens Axboe, linux-block, Christoph Hellwig,
	Dan Williams, Thiago Jung Bauermann, Lekshmi Pillai, NeilBrown,
	Omar Sandoval

On Tue 07-03-17 14:43:49, Tejun Heo wrote:
> Hello,
> 
> On Tue, Mar 07, 2017 at 12:14:20PM +0100, Jan Kara wrote:
> > > Given that this isn't a super hot path, I think it'd be far better to
> > > stick to a simpler synchronization mechanism.
> > 
> > I'd be happy to do so however struct gendisk does not have any lock in it
> > and introducing a special lock just for that seems awkward.
> 
> I think even a awkward shared lock is better than memory barriers.
> Barriers are the trickiest locking construct to get right and we
> really shouldn't using that unless absolutely necessary.

OK, I'll try to come up with something.

								Honza

-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 02/11] block: Fix race of bdev open with gendisk shutdown
  2017-03-07 11:14     ` Jan Kara
@ 2017-03-07 19:43       ` Tejun Heo
  2017-03-08 14:25         ` Jan Kara
  0 siblings, 1 reply; 26+ messages in thread
From: Tejun Heo @ 2017-03-07 19:43 UTC (permalink / raw)
  To: Jan Kara
  Cc: Jens Axboe, linux-block, Christoph Hellwig, Dan Williams,
	Thiago Jung Bauermann, Lekshmi Pillai, NeilBrown, Omar Sandoval

Hello,

On Tue, Mar 07, 2017 at 12:14:20PM +0100, Jan Kara wrote:
> > Given that this isn't a super hot path, I think it'd be far better to
> > stick to a simpler synchronization mechanism.
> 
> I'd be happy to do so however struct gendisk does not have any lock in it
> and introducing a special lock just for that seems awkward.

I think even a awkward shared lock is better than memory barriers.
Barriers are the trickiest locking construct to get right and we
really shouldn't using that unless absolutely necessary.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 02/11] block: Fix race of bdev open with gendisk shutdown
  2017-03-06 22:04   ` Tejun Heo
@ 2017-03-07 11:14     ` Jan Kara
  2017-03-07 19:43       ` Tejun Heo
  0 siblings, 1 reply; 26+ messages in thread
From: Jan Kara @ 2017-03-07 11:14 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Jan Kara, Jens Axboe, linux-block, Christoph Hellwig,
	Dan Williams, Thiago Jung Bauermann, Lekshmi Pillai, NeilBrown,
	Omar Sandoval

On Mon 06-03-17 17:04:59, Tejun Heo wrote:
> Hello,
> 
> On Mon, Mar 06, 2017 at 05:33:55PM +0100, Jan Kara wrote:
> > +	disk->flags &= ~GENHD_FL_UP;
> > +	/*
> > +	 * Make sure __blkdev_open() sees the disk is going away before
> > +	 * starting to unhash bdev inodes.
> > +	 */
> > +	smp_wmb();
> 
> But which rmb is this paired with?  Without paring memory barriers
> don't do anything.

It is paired with the fact that the test (disk->flags & GENHD_FL_UP) in
__blkdev_get() cannot be reordered before the bd_acquire() call in
blkdev_open() (or however the bdev is looked up). I was thinking so long
about how and where to sanely comment on that that I did not write anything
in the end.

I was also thinking about adding an explicit barrier before that test just
to make things explicit since as you say below the first blkdev open is not
that hot path. Maybe that would be a better option?

> Given that this isn't a super hot path, I think it'd be far better to
> stick to a simpler synchronization mechanism.

I'd be happy to do so however struct gendisk does not have any lock in it
and introducing a special lock just for that seems awkward.

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 02/11] block: Fix race of bdev open with gendisk shutdown
  2017-03-06 16:33 ` [PATCH 02/11] block: Fix race of bdev open with gendisk shutdown Jan Kara
@ 2017-03-06 22:04   ` Tejun Heo
  2017-03-07 11:14     ` Jan Kara
  0 siblings, 1 reply; 26+ messages in thread
From: Tejun Heo @ 2017-03-06 22:04 UTC (permalink / raw)
  To: Jan Kara
  Cc: Jens Axboe, linux-block, Christoph Hellwig, Dan Williams,
	Thiago Jung Bauermann, Lekshmi Pillai, NeilBrown, Omar Sandoval

Hello,

On Mon, Mar 06, 2017 at 05:33:55PM +0100, Jan Kara wrote:
> +	disk->flags &= ~GENHD_FL_UP;
> +	/*
> +	 * Make sure __blkdev_open() sees the disk is going away before
> +	 * starting to unhash bdev inodes.
> +	 */
> +	smp_wmb();

But which rmb is this paired with?  Without paring memory barriers
don't do anything.  Given that this isn't a super hot path, I think
it'd be far better to stick to a simpler synchronization mechanism.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 26+ messages in thread

* [PATCH 02/11] block: Fix race of bdev open with gendisk shutdown
  2017-03-06 16:33 [PATCH 0/11 v3] " Jan Kara
@ 2017-03-06 16:33 ` Jan Kara
  2017-03-06 22:04   ` Tejun Heo
  0 siblings, 1 reply; 26+ messages in thread
From: Jan Kara @ 2017-03-06 16:33 UTC (permalink / raw)
  To: Jens Axboe
  Cc: linux-block, Christoph Hellwig, Dan Williams,
	Thiago Jung Bauermann, Lekshmi Pillai, Tejun Heo, NeilBrown,
	Omar Sandoval, Jan Kara

blkdev_open() may race with gendisk shutdown in such a way that
del_gendisk() has already unhashed block device inode (and thus
bd_acquire() will end up creating new block device inode) however
gen_gendisk() will still return the gendisk that is being destroyed.
This will result in the new bdev inode being associated with bdi of the
request queue that is going away and when the device number gets
eventually reused, the block device will still be pointing to the stale
bdi.

Fix the problem by checking whether the gendisk is still alive when
associating bdev inode with it and its bdi. That way we are sure that
once we are unhashing block device inodes in del_gendisk(), newly
created bdev inodes cannot be associated with bdi of the deleted gendisk
anymore. We actually already do this check when opening a partition so
we need to add it only for the case when the whole device is opened.

Also add a warning that will tell us about unexpected inconsistencies
between bdi associated with the bdev inode and bdi associated with the
disk.

Signed-off-by: Jan Kara <jack@suse.cz>
---
 block/genhd.c  | 7 ++++++-
 fs/block_dev.c | 5 ++++-
 2 files changed, 10 insertions(+), 2 deletions(-)

diff --git a/block/genhd.c b/block/genhd.c
index b26a5ea115d0..e8df37de03af 100644
--- a/block/genhd.c
+++ b/block/genhd.c
@@ -662,6 +662,12 @@ void del_gendisk(struct gendisk *disk)
 	struct disk_part_iter piter;
 	struct hd_struct *part;

+	disk->flags &= ~GENHD_FL_UP;
+	/*
+	 * Make sure __blkdev_open() sees the disk is going away before
+	 * starting to unhash bdev inodes.
+	 */
+	smp_wmb();
 	blk_integrity_del(disk);
 	disk_del_events(disk);

@@ -678,7 +684,6 @@ void del_gendisk(struct gendisk *disk)
 	invalidate_partition(disk, 0);
 	bdev_unhash_inode(disk_devt(disk));
 	set_capacity(disk, 0);
-	disk->flags &= ~GENHD_FL_UP;

 	sysfs_remove_link(&disk_to_dev(disk)->kobj, "bdi");
 	/*
diff --git a/fs/block_dev.c b/fs/block_dev.c
index 53e2389ae4d4..9e1993a2827f 100644
--- a/fs/block_dev.c
+++ b/fs/block_dev.c
@@ -1560,7 +1560,7 @@ static int __blkdev_get(struct block_device *bdev, fmode_t mode, int for_part)
 		if (!partno) {
 			ret = -ENXIO;
 			bdev->bd_part = disk_get_part(disk, partno);
-			if (!bdev->bd_part)
+			if (!(disk->flags & GENHD_FL_UP) || !bdev->bd_part)
 				goto out_clear;

 			ret = 0;
@@ -1623,6 +1623,9 @@ static int __blkdev_get(struct block_device *bdev, fmode_t mode, int for_part)

 		if (bdev->bd_bdi == &noop_backing_dev_info)
 			bdev->bd_bdi = bdi_get(disk->queue->backing_dev_info);
+		else
+			WARN_ON_ONCE(bdev->bd_bdi !=
+				     disk->queue->backing_dev_info);
 	} else {
 		if (bdev->bd_contains == bdev) {
 			ret = 0;
-- 
2.10.2

^ permalink raw reply related	[flat|nested] 26+ messages in thread

end of thread, other threads:[~2017-11-24  3:13 UTC | newest]

Thread overview: 26+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-03-13 15:13 [PATCH 0/11 v4] block: Fix block device shutdown related races Jan Kara
2017-03-13 15:14 ` [PATCH 01/11] block: Fix bdi assignment to bdev inode when racing with disk delete Jan Kara
2017-03-13 15:14 ` [PATCH 02/11] block: Fix race of bdev open with gendisk shutdown Jan Kara
2017-03-21 16:19   ` Tejun Heo
2017-03-23  0:21     ` Jan Kara
2017-03-23 15:38       ` Tejun Heo
2017-11-17  6:51   ` Hou Tao
2017-11-20 16:43     ` Jan Kara
2017-11-24  3:12       ` Hou Tao
2017-03-13 15:14 ` [PATCH 03/11] bdi: Mark congested->bdi as internal Jan Kara
2017-03-13 15:14 ` [PATCH 04/11] bdi: Make wb->bdi a proper reference Jan Kara
2017-03-13 15:14 ` [PATCH 05/11] bdi: Unify bdi->wb_list handling for root wb_writeback Jan Kara
2017-03-21 16:21   ` Tejun Heo
2017-03-13 15:14 ` [PATCH 06/11] bdi: Shutdown writeback on all cgwbs in cgwb_bdi_destroy() Jan Kara
2017-03-13 15:14 ` [PATCH 07/11] bdi: Do not wait for cgwbs release in bdi_unregister() Jan Kara
2017-03-13 15:14 ` [PATCH 08/11] bdi: Rename cgwb_bdi_destroy() to cgwb_bdi_unregister() Jan Kara
2017-03-13 15:14 ` [PATCH 09/11] block: Fix oops in locked_inode_to_wb_and_lock_list() Jan Kara
2017-03-13 15:14 ` [PATCH 10/11] kobject: Export kobject_get_unless_zero() Jan Kara
2017-03-13 15:14 ` [PATCH 11/11] block: Fix oops scsi_disk_get() Jan Kara
2017-03-13 18:10 ` [PATCH 0/11 v4] block: Fix block device shutdown related races Dan Williams
2017-03-21  0:57 ` Thiago Jung Bauermann
  -- strict thread matches above, loose matches on Subject: below --
2017-03-06 16:33 [PATCH 0/11 v3] " Jan Kara
2017-03-06 16:33 ` [PATCH 02/11] block: Fix race of bdev open with gendisk shutdown Jan Kara
2017-03-06 22:04   ` Tejun Heo
2017-03-07 11:14     ` Jan Kara
2017-03-07 19:43       ` Tejun Heo
2017-03-08 14:25         ` Jan Kara

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.