[PATCH 0/11 v3] block: Fix block device shutdown related races

linux-block.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH 0/11 v3] block: Fix block device shutdown related races
@ 2017-03-06 16:33 Jan Kara
  2017-03-06 16:33 ` [PATCH 01/11] block: Fix bdi assignment to bdev inode when racing with disk delete Jan Kara
                   ` (10 more replies)
  0 siblings, 11 replies; 21+ messages in thread
From: Jan Kara @ 2017-03-06 16:33 UTC (permalink / raw)
  To: Jens Axboe
  Cc: linux-block, Christoph Hellwig, Dan Williams,
	Thiago Jung Bauermann, Lekshmi Pillai, Tejun Heo, NeilBrown,
	Omar Sandoval, Jan Kara

Hello,

this is a series with the remaining patches (on top of 3.11-rc1) to fix several
different races and issues I've found when testing device shutdown and reuse.
The first two patches fix possible (theoretical) problems when opening of a
block device races with shutdown of a gendisk structure. Patches 3-9 fix oops
that is triggered by __blkdev_put() calling inode_detach_wb() too early (the
problem reported by Thiago). Patches 10 and 11 fix oops due to a bug in gendisk
code where get_gendisk() can return already freed gendisk structure (again
triggered by Omar's stress test).

People, please have a look at patches. They are mostly simple however the
interactions are rather complex so I may have missed something. Also I'm
happy for any additional testing these patches can get - I've stressed them
with Omar's script, tested memcg writeback, tested static (not udev managed)
device inodes.

Changes since v2:
* Added Tejun's acks
* Rebased on top of 4.11-rc1
* Fixed two possible races between blkdev_open() and del_gendisk()
* Fixed possible race between concurrent shutdown of cgwb spotted by Tejun

Changes since v1:
* Added Acks and Tested-by tags for patches in areas that did not change
* Reworked inode_detach_wb() related fixes based on Tejun's feedback

								Honza

^ permalink raw reply	[flat|nested] 21+ messages in thread

* [PATCH 01/11] block: Fix bdi assignment to bdev inode when racing with disk delete
  2017-03-06 16:33 [PATCH 0/11 v3] block: Fix block device shutdown related races Jan Kara
@ 2017-03-06 16:33 ` Jan Kara
  2017-03-06 22:01   ` Tejun Heo
  2017-03-07  6:47   ` Hannes Reinecke
  2017-03-06 16:33 ` [PATCH 02/11] block: Fix race of bdev open with gendisk shutdown Jan Kara
                   ` (9 subsequent siblings)
  10 siblings, 2 replies; 21+ messages in thread
From: Jan Kara @ 2017-03-06 16:33 UTC (permalink / raw)
  To: Jens Axboe
  Cc: linux-block, Christoph Hellwig, Dan Williams,
	Thiago Jung Bauermann, Lekshmi Pillai, Tejun Heo, NeilBrown,
	Omar Sandoval, Jan Kara

When disk->fops->open() in __blkdev_get() returns -ERESTARTSYS, we
restart the process of opening the block device. However we forget to
switch bdev->bd_bdi back to noop_backing_dev_info and as a result bdev
inode will be pointing to a stale bdi. Fix the problem by setting
bdev->bd_bdi later when __blkdev_get() is already guaranteed to succeed.

Signed-off-by: Jan Kara <jack@suse.cz>
---
 fs/block_dev.c | 7 +++----
 1 file changed, 3 insertions(+), 4 deletions(-)

diff --git a/fs/block_dev.c b/fs/block_dev.c
index 2eca00ec4370..53e2389ae4d4 100644
--- a/fs/block_dev.c
+++ b/fs/block_dev.c
@@ -1556,8 +1556,6 @@ static int __blkdev_get(struct block_device *bdev, fmode_t mode, int for_part)
 		bdev->bd_disk = disk;
 		bdev->bd_queue = disk->queue;
 		bdev->bd_contains = bdev;
-		if (bdev->bd_bdi == &noop_backing_dev_info)
-			bdev->bd_bdi = bdi_get(disk->queue->backing_dev_info);
 
 		if (!partno) {
 			ret = -ENXIO;
@@ -1622,6 +1620,9 @@ static int __blkdev_get(struct block_device *bdev, fmode_t mode, int for_part)
 			}
 			bd_set_size(bdev, (loff_t)bdev->bd_part->nr_sects << 9);
 		}
+
+		if (bdev->bd_bdi == &noop_backing_dev_info)
+			bdev->bd_bdi = bdi_get(disk->queue->backing_dev_info);
 	} else {
 		if (bdev->bd_contains == bdev) {
 			ret = 0;
@@ -1653,8 +1654,6 @@ static int __blkdev_get(struct block_device *bdev, fmode_t mode, int for_part)
 	bdev->bd_disk = NULL;
 	bdev->bd_part = NULL;
 	bdev->bd_queue = NULL;
-	bdi_put(bdev->bd_bdi);
-	bdev->bd_bdi = &noop_backing_dev_info;
 	if (bdev != bdev->bd_contains)
 		__blkdev_put(bdev->bd_contains, mode, 1);
 	bdev->bd_contains = NULL;
-- 
2.10.2

^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH 02/11] block: Fix race of bdev open with gendisk shutdown
  2017-03-06 16:33 [PATCH 0/11 v3] block: Fix block device shutdown related races Jan Kara
  2017-03-06 16:33 ` [PATCH 01/11] block: Fix bdi assignment to bdev inode when racing with disk delete Jan Kara
@ 2017-03-06 16:33 ` Jan Kara
  2017-03-06 22:04   ` Tejun Heo
  2017-03-06 16:33 ` [PATCH 03/11] bdi: Mark congested->bdi as internal Jan Kara
                   ` (8 subsequent siblings)
  10 siblings, 1 reply; 21+ messages in thread
From: Jan Kara @ 2017-03-06 16:33 UTC (permalink / raw)
  To: Jens Axboe
  Cc: linux-block, Christoph Hellwig, Dan Williams,
	Thiago Jung Bauermann, Lekshmi Pillai, Tejun Heo, NeilBrown,
	Omar Sandoval, Jan Kara

blkdev_open() may race with gendisk shutdown in such a way that
del_gendisk() has already unhashed block device inode (and thus
bd_acquire() will end up creating new block device inode) however
gen_gendisk() will still return the gendisk that is being destroyed.
This will result in the new bdev inode being associated with bdi of the
request queue that is going away and when the device number gets
eventually reused, the block device will still be pointing to the stale
bdi.

Fix the problem by checking whether the gendisk is still alive when
associating bdev inode with it and its bdi. That way we are sure that
once we are unhashing block device inodes in del_gendisk(), newly
created bdev inodes cannot be associated with bdi of the deleted gendisk
anymore. We actually already do this check when opening a partition so
we need to add it only for the case when the whole device is opened.

Also add a warning that will tell us about unexpected inconsistencies
between bdi associated with the bdev inode and bdi associated with the
disk.

Signed-off-by: Jan Kara <jack@suse.cz>
---
 block/genhd.c  | 7 ++++++-
 fs/block_dev.c | 5 ++++-
 2 files changed, 10 insertions(+), 2 deletions(-)

diff --git a/block/genhd.c b/block/genhd.c
index b26a5ea115d0..e8df37de03af 100644
--- a/block/genhd.c
+++ b/block/genhd.c
@@ -662,6 +662,12 @@ void del_gendisk(struct gendisk *disk)
 	struct disk_part_iter piter;
 	struct hd_struct *part;
 
+	disk->flags &= ~GENHD_FL_UP;
+	/*
+	 * Make sure __blkdev_open() sees the disk is going away before
+	 * starting to unhash bdev inodes.
+	 */
+	smp_wmb();
 	blk_integrity_del(disk);
 	disk_del_events(disk);
 
@@ -678,7 +684,6 @@ void del_gendisk(struct gendisk *disk)
 	invalidate_partition(disk, 0);
 	bdev_unhash_inode(disk_devt(disk));
 	set_capacity(disk, 0);
-	disk->flags &= ~GENHD_FL_UP;
 
 	sysfs_remove_link(&disk_to_dev(disk)->kobj, "bdi");
 	/*
diff --git a/fs/block_dev.c b/fs/block_dev.c
index 53e2389ae4d4..9e1993a2827f 100644
--- a/fs/block_dev.c
+++ b/fs/block_dev.c
@@ -1560,7 +1560,7 @@ static int __blkdev_get(struct block_device *bdev, fmode_t mode, int for_part)
 		if (!partno) {
 			ret = -ENXIO;
 			bdev->bd_part = disk_get_part(disk, partno);
-			if (!bdev->bd_part)
+			if (!(disk->flags & GENHD_FL_UP) || !bdev->bd_part)
 				goto out_clear;
 
 			ret = 0;
@@ -1623,6 +1623,9 @@ static int __blkdev_get(struct block_device *bdev, fmode_t mode, int for_part)
 
 		if (bdev->bd_bdi == &noop_backing_dev_info)
 			bdev->bd_bdi = bdi_get(disk->queue->backing_dev_info);
+		else
+			WARN_ON_ONCE(bdev->bd_bdi !=
+				     disk->queue->backing_dev_info);
 	} else {
 		if (bdev->bd_contains == bdev) {
 			ret = 0;
-- 
2.10.2

^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH 03/11] bdi: Mark congested->bdi as internal
  2017-03-06 16:33 [PATCH 0/11 v3] block: Fix block device shutdown related races Jan Kara
  2017-03-06 16:33 ` [PATCH 01/11] block: Fix bdi assignment to bdev inode when racing with disk delete Jan Kara
  2017-03-06 16:33 ` [PATCH 02/11] block: Fix race of bdev open with gendisk shutdown Jan Kara
@ 2017-03-06 16:33 ` Jan Kara
  2017-03-06 16:33 ` [PATCH 04/11] bdi: Make wb->bdi a proper reference Jan Kara
                   ` (7 subsequent siblings)
  10 siblings, 0 replies; 21+ messages in thread
From: Jan Kara @ 2017-03-06 16:33 UTC (permalink / raw)
  To: Jens Axboe
  Cc: linux-block, Christoph Hellwig, Dan Williams,
	Thiago Jung Bauermann, Lekshmi Pillai, Tejun Heo, NeilBrown,
	Omar Sandoval, Jan Kara

congested->bdi pointer is used only to be able to remove congested
structure from bdi->cgwb_congested_tree on structure release. Moreover
the pointer can become NULL when we unregister the bdi. Rename the field
to __bdi and add a comment to make it more explicit this is internal
stuff of memcg writeback code and people should not use the field as
such use will be likely race prone.

We do not bother with converting congested->bdi to a proper refcounted
reference. It will be slightly ugly to special-case bdi->wb.congested to
avoid effectively a cyclic reference of bdi to itself and the reference
gets cleared from bdi_unregister() making it impossible to reference
a freed bdi.

Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jan Kara <jack@suse.cz>
---
 include/linux/backing-dev-defs.h |  4 +++-
 mm/backing-dev.c                 | 10 +++++-----
 2 files changed, 8 insertions(+), 6 deletions(-)

diff --git a/include/linux/backing-dev-defs.h b/include/linux/backing-dev-defs.h
index ad955817916d..8fb3dcdebc80 100644
--- a/include/linux/backing-dev-defs.h
+++ b/include/linux/backing-dev-defs.h
@@ -54,7 +54,9 @@ struct bdi_writeback_congested {
 	atomic_t refcnt;		/* nr of attached wb's and blkg */
 
 #ifdef CONFIG_CGROUP_WRITEBACK
-	struct backing_dev_info *bdi;	/* the associated bdi */
+	struct backing_dev_info *__bdi;	/* the associated bdi, set to NULL
+					 * on bdi unregistration. For memcg-wb
+					 * internal use only! */
 	int blkcg_id;			/* ID of the associated blkcg */
 	struct rb_node rb_node;		/* on bdi->cgwb_congestion_tree */
 #endif
diff --git a/mm/backing-dev.c b/mm/backing-dev.c
index 6d861d090e9f..fec27da14aea 100644
--- a/mm/backing-dev.c
+++ b/mm/backing-dev.c
@@ -438,7 +438,7 @@ wb_congested_get_create(struct backing_dev_info *bdi, int blkcg_id, gfp_t gfp)
 		return NULL;
 
 	atomic_set(&new_congested->refcnt, 0);
-	new_congested->bdi = bdi;
+	new_congested->__bdi = bdi;
 	new_congested->blkcg_id = blkcg_id;
 	goto retry;
 
@@ -466,10 +466,10 @@ void wb_congested_put(struct bdi_writeback_congested *congested)
 	}
 
 	/* bdi might already have been destroyed leaving @congested unlinked */
-	if (congested->bdi) {
+	if (congested->__bdi) {
 		rb_erase(&congested->rb_node,
-			 &congested->bdi->cgwb_congested_tree);
-		congested->bdi = NULL;
+			 &congested->__bdi->cgwb_congested_tree);
+		congested->__bdi = NULL;
 	}
 
 	spin_unlock_irqrestore(&cgwb_lock, flags);
@@ -698,7 +698,7 @@ static void cgwb_bdi_destroy(struct backing_dev_info *bdi)
 			rb_entry(rbn, struct bdi_writeback_congested, rb_node);
 
 		rb_erase(rbn, &bdi->cgwb_congested_tree);
-		congested->bdi = NULL;	/* mark @congested unlinked */
+		congested->__bdi = NULL;	/* mark @congested unlinked */
 	}
 
 	spin_unlock_irq(&cgwb_lock);
-- 
2.10.2

^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH 04/11] bdi: Make wb->bdi a proper reference
  2017-03-06 16:33 [PATCH 0/11 v3] block: Fix block device shutdown related races Jan Kara
                   ` (2 preceding siblings ...)
  2017-03-06 16:33 ` [PATCH 03/11] bdi: Mark congested->bdi as internal Jan Kara
@ 2017-03-06 16:33 ` Jan Kara
  2017-03-06 16:33 ` [PATCH 05/11] bdi: Move removal from bdi->wb_list into wb_shutdown() Jan Kara
                   ` (6 subsequent siblings)
  10 siblings, 0 replies; 21+ messages in thread
From: Jan Kara @ 2017-03-06 16:33 UTC (permalink / raw)
  To: Jens Axboe
  Cc: linux-block, Christoph Hellwig, Dan Williams,
	Thiago Jung Bauermann, Lekshmi Pillai, Tejun Heo, NeilBrown,
	Omar Sandoval, Jan Kara

Make wb->bdi a proper refcounted reference to bdi for all bdi_writeback
structures except for the one embedded inside struct backing_dev_info.
That will allow us to simplify bdi unregistration.

Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jan Kara <jack@suse.cz>
---
 mm/backing-dev.c | 13 +++++++++++--
 1 file changed, 11 insertions(+), 2 deletions(-)

diff --git a/mm/backing-dev.c b/mm/backing-dev.c
index fec27da14aea..37cd1adae979 100644
--- a/mm/backing-dev.c
+++ b/mm/backing-dev.c
@@ -294,6 +294,8 @@ static int wb_init(struct bdi_writeback *wb, struct backing_dev_info *bdi,
 
 	memset(wb, 0, sizeof(*wb));
 
+	if (wb != &bdi->wb)
+		bdi_get(bdi);
 	wb->bdi = bdi;
 	wb->last_old_flush = jiffies;
 	INIT_LIST_HEAD(&wb->b_dirty);
@@ -314,8 +316,10 @@ static int wb_init(struct bdi_writeback *wb, struct backing_dev_info *bdi,
 	wb->dirty_sleep = jiffies;
 
 	wb->congested = wb_congested_get_create(bdi, blkcg_id, gfp);
-	if (!wb->congested)
-		return -ENOMEM;
+	if (!wb->congested) {
+		err = -ENOMEM;
+		goto out_put_bdi;
+	}
 
 	err = fprop_local_init_percpu(&wb->completions, gfp);
 	if (err)
@@ -335,6 +339,9 @@ static int wb_init(struct bdi_writeback *wb, struct backing_dev_info *bdi,
 	fprop_local_destroy_percpu(&wb->completions);
 out_put_cong:
 	wb_congested_put(wb->congested);
+out_put_bdi:
+	if (wb != &bdi->wb)
+		bdi_put(bdi);
 	return err;
 }
 
@@ -372,6 +379,8 @@ static void wb_exit(struct bdi_writeback *wb)
 
 	fprop_local_destroy_percpu(&wb->completions);
 	wb_congested_put(wb->congested);
+	if (wb != &wb->bdi->wb)
+		bdi_put(wb->bdi);
 }
 
 #ifdef CONFIG_CGROUP_WRITEBACK
-- 
2.10.2

^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH 05/11] bdi: Move removal from bdi->wb_list into wb_shutdown()
  2017-03-06 16:33 [PATCH 0/11 v3] block: Fix block device shutdown related races Jan Kara
                   ` (3 preceding siblings ...)
  2017-03-06 16:33 ` [PATCH 04/11] bdi: Make wb->bdi a proper reference Jan Kara
@ 2017-03-06 16:33 ` Jan Kara
  2017-03-06 16:33 ` [PATCH 06/11] bdi: Shutdown writeback on all cgwbs in cgwb_bdi_destroy() Jan Kara
                   ` (5 subsequent siblings)
  10 siblings, 0 replies; 21+ messages in thread
From: Jan Kara @ 2017-03-06 16:33 UTC (permalink / raw)
  To: Jens Axboe
  Cc: linux-block, Christoph Hellwig, Dan Williams,
	Thiago Jung Bauermann, Lekshmi Pillai, Tejun Heo, NeilBrown,
	Omar Sandoval, Jan Kara

Currently the removal from bdi->wb_list happens directly in
cgwb_release_workfn(). Move it to wb_shutdown() which is functionally
equivalent and more logical (the list gets only used for distributing
writeback works among bdi_writeback structures). It will also allow us
to simplify writeback shutdown in cgwb_bdi_destroy().

Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jan Kara <jack@suse.cz>
---
 mm/backing-dev.c | 16 ++++++++++++----
 1 file changed, 12 insertions(+), 4 deletions(-)

diff --git a/mm/backing-dev.c b/mm/backing-dev.c
index 37cd1adae979..4dffae6e2d7b 100644
--- a/mm/backing-dev.c
+++ b/mm/backing-dev.c
@@ -345,6 +345,8 @@ static int wb_init(struct bdi_writeback *wb, struct backing_dev_info *bdi,
 	return err;
 }
 
+static void cgwb_remove_from_bdi_list(struct bdi_writeback *wb);
+
 /*
  * Remove bdi from the global list and shutdown any threads we have running
  */
@@ -358,6 +360,7 @@ static void wb_shutdown(struct bdi_writeback *wb)
 	}
 	spin_unlock_bh(&wb->work_lock);
 
+	cgwb_remove_from_bdi_list(wb);
 	/*
 	 * Drain work list and shutdown the delayed_work.  !WB_registered
 	 * tells wb_workfn() that @wb is dying and its work_list needs to
@@ -491,10 +494,6 @@ static void cgwb_release_workfn(struct work_struct *work)
 						release_work);
 	struct backing_dev_info *bdi = wb->bdi;
 
-	spin_lock_irq(&cgwb_lock);
-	list_del_rcu(&wb->bdi_node);
-	spin_unlock_irq(&cgwb_lock);
-
 	wb_shutdown(wb);
 
 	css_put(wb->memcg_css);
@@ -526,6 +525,13 @@ static void cgwb_kill(struct bdi_writeback *wb)
 	percpu_ref_kill(&wb->refcnt);
 }
 
+static void cgwb_remove_from_bdi_list(struct bdi_writeback *wb)
+{
+	spin_lock_irq(&cgwb_lock);
+	list_del_rcu(&wb->bdi_node);
+	spin_unlock_irq(&cgwb_lock);
+}
+
 static int cgwb_create(struct backing_dev_info *bdi,
 		       struct cgroup_subsys_state *memcg_css, gfp_t gfp)
 {
@@ -783,6 +789,8 @@ static void cgwb_bdi_destroy(struct backing_dev_info *bdi)
 	wb_congested_put(bdi->wb_congested);
 }
 
+static void cgwb_remove_from_bdi_list(struct bdi_writeback *wb) { }
+
 #endif	/* CONFIG_CGROUP_WRITEBACK */
 
 int bdi_init(struct backing_dev_info *bdi)
-- 
2.10.2

^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH 06/11] bdi: Shutdown writeback on all cgwbs in cgwb_bdi_destroy()
  2017-03-06 16:33 [PATCH 0/11 v3] block: Fix block device shutdown related races Jan Kara
                   ` (4 preceding siblings ...)
  2017-03-06 16:33 ` [PATCH 05/11] bdi: Move removal from bdi->wb_list into wb_shutdown() Jan Kara
@ 2017-03-06 16:33 ` Jan Kara
  2017-03-06 22:08   ` Tejun Heo
  2017-03-06 16:34 ` [PATCH 07/11] bdi: Do not wait for cgwbs release in bdi_unregister() Jan Kara
                   ` (4 subsequent siblings)
  10 siblings, 1 reply; 21+ messages in thread
From: Jan Kara @ 2017-03-06 16:33 UTC (permalink / raw)
  To: Jens Axboe
  Cc: linux-block, Christoph Hellwig, Dan Williams,
	Thiago Jung Bauermann, Lekshmi Pillai, Tejun Heo, NeilBrown,
	Omar Sandoval, Jan Kara

Currently we waited for all cgwbs to get freed in cgwb_bdi_destroy()
which also means that writeback has been shutdown on them. Since this
wait is going away, directly shutdown writeback on cgwbs from
cgwb_bdi_destroy() to avoid live writeback structures after
bdi_unregister() has finished. To make that safe with concurrent
shutdown from cgwb_release_workfn(), we also have to make sure
wb_shutdown() returns only after the bdi_writeback structure is really
shutdown.

Signed-off-by: Jan Kara <jack@suse.cz>
---
 include/linux/backing-dev-defs.h |  1 +
 mm/backing-dev.c                 | 22 ++++++++++++++++++++++
 2 files changed, 23 insertions(+)

diff --git a/include/linux/backing-dev-defs.h b/include/linux/backing-dev-defs.h
index 8fb3dcdebc80..8af720f22a2d 100644
--- a/include/linux/backing-dev-defs.h
+++ b/include/linux/backing-dev-defs.h
@@ -21,6 +21,7 @@ struct dentry;
  */
 enum wb_state {
 	WB_registered,		/* bdi_register() was done */
+	WB_shutting_down,	/* wb_shutdown() in progress */
 	WB_writeback_running,	/* Writeback is in progress */
 	WB_has_dirty_io,	/* Dirty inodes on ->b_{dirty|io|more_io} */
 };
diff --git a/mm/backing-dev.c b/mm/backing-dev.c
index 4dffae6e2d7b..d084c89922f3 100644
--- a/mm/backing-dev.c
+++ b/mm/backing-dev.c
@@ -356,8 +356,15 @@ static void wb_shutdown(struct bdi_writeback *wb)
 	spin_lock_bh(&wb->work_lock);
 	if (!test_and_clear_bit(WB_registered, &wb->state)) {
 		spin_unlock_bh(&wb->work_lock);
+		/*
+		 * Wait for wb shutdown to finish if someone else is just
+		 * running wb_shutdown(). Otherwise we could proceed to wb /
+		 * bdi destruction before wb_shutdown() is finished.
+		 */
+		wait_on_bit(&wb->state, WB_shutting_down, TASK_UNINTERRUPTIBLE);
 		return;
 	}
+	set_bit(WB_shutting_down, &wb->state);
 	spin_unlock_bh(&wb->work_lock);
 
 	cgwb_remove_from_bdi_list(wb);
@@ -369,6 +376,12 @@ static void wb_shutdown(struct bdi_writeback *wb)
 	mod_delayed_work(bdi_wq, &wb->dwork, 0);
 	flush_delayed_work(&wb->dwork);
 	WARN_ON(!list_empty(&wb->work_list));
+	/*
+	 * Make sure bit gets cleared after shutdown is finished. Matches with
+	 * the barrier provided by test_and_clear_bit() above.
+	 */
+	smp_wmb();
+	clear_bit(WB_shutting_down, &wb->state);
 }
 
 static void wb_exit(struct bdi_writeback *wb)
@@ -700,6 +713,7 @@ static void cgwb_bdi_destroy(struct backing_dev_info *bdi)
 	struct radix_tree_iter iter;
 	struct rb_node *rbn;
 	void **slot;
+	struct bdi_writeback *wb;
 
 	WARN_ON(test_bit(WB_registered, &bdi->wb.state));
 
@@ -716,6 +730,14 @@ static void cgwb_bdi_destroy(struct backing_dev_info *bdi)
 		congested->__bdi = NULL;	/* mark @congested unlinked */
 	}
 
+	while (!list_empty(&bdi->wb_list)) {
+		wb = list_first_entry(&bdi->wb_list, struct bdi_writeback,
+				      bdi_node);
+		spin_unlock_irq(&cgwb_lock);
+		wb_shutdown(wb);
+		spin_lock_irq(&cgwb_lock);
+	}
+
 	spin_unlock_irq(&cgwb_lock);
 
 	/*
-- 
2.10.2

^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH 07/11] bdi: Do not wait for cgwbs release in bdi_unregister()
  2017-03-06 16:33 [PATCH 0/11 v3] block: Fix block device shutdown related races Jan Kara
                   ` (5 preceding siblings ...)
  2017-03-06 16:33 ` [PATCH 06/11] bdi: Shutdown writeback on all cgwbs in cgwb_bdi_destroy() Jan Kara
@ 2017-03-06 16:34 ` Jan Kara
  2017-03-06 22:08   ` Tejun Heo
  2017-03-06 16:34 ` [PATCH 08/11] bdi: Rename cgwb_bdi_destroy() to cgwb_bdi_unregister() Jan Kara
                   ` (3 subsequent siblings)
  10 siblings, 1 reply; 21+ messages in thread
From: Jan Kara @ 2017-03-06 16:34 UTC (permalink / raw)
  To: Jens Axboe
  Cc: linux-block, Christoph Hellwig, Dan Williams,
	Thiago Jung Bauermann, Lekshmi Pillai, Tejun Heo, NeilBrown,
	Omar Sandoval, Jan Kara

Currently we wait for all cgwbs to get released in cgwb_bdi_destroy()
(called from bdi_unregister()). That is however unnecessary now when
cgwb->bdi is a proper refcounted reference (thus bdi cannot get
released before all cgwbs are released) and when cgwb_bdi_destroy()
shuts down writeback directly.

Signed-off-by: Jan Kara <jack@suse.cz>
---
 include/linux/backing-dev-defs.h |  1 -
 mm/backing-dev.c                 | 18 +-----------------
 2 files changed, 1 insertion(+), 18 deletions(-)

diff --git a/include/linux/backing-dev-defs.h b/include/linux/backing-dev-defs.h
index 8af720f22a2d..e66d4722db8e 100644
--- a/include/linux/backing-dev-defs.h
+++ b/include/linux/backing-dev-defs.h
@@ -164,7 +164,6 @@ struct backing_dev_info {
 #ifdef CONFIG_CGROUP_WRITEBACK
 	struct radix_tree_root cgwb_tree; /* radix tree of active cgroup wbs */
 	struct rb_root cgwb_congested_tree; /* their congested states */
-	atomic_t usage_cnt; /* counts both cgwbs and cgwb_contested's */
 #else
 	struct bdi_writeback_congested *wb_congested;
 #endif
diff --git a/mm/backing-dev.c b/mm/backing-dev.c
index d084c89922f3..7592f4a36776 100644
--- a/mm/backing-dev.c
+++ b/mm/backing-dev.c
@@ -406,11 +406,9 @@ static void wb_exit(struct bdi_writeback *wb)
 /*
  * cgwb_lock protects bdi->cgwb_tree, bdi->cgwb_congested_tree,
  * blkcg->cgwb_list, and memcg->cgwb_list.  bdi->cgwb_tree is also RCU
- * protected.  cgwb_release_wait is used to wait for the completion of cgwb
- * releases from bdi destruction path.
+ * protected.
  */
 static DEFINE_SPINLOCK(cgwb_lock);
-static DECLARE_WAIT_QUEUE_HEAD(cgwb_release_wait);
 
 /**
  * wb_congested_get_create - get or create a wb_congested
@@ -505,7 +503,6 @@ static void cgwb_release_workfn(struct work_struct *work)
 {
 	struct bdi_writeback *wb = container_of(work, struct bdi_writeback,
 						release_work);
-	struct backing_dev_info *bdi = wb->bdi;
 
 	wb_shutdown(wb);
 
@@ -516,9 +513,6 @@ static void cgwb_release_workfn(struct work_struct *work)
 	percpu_ref_exit(&wb->refcnt);
 	wb_exit(wb);
 	kfree_rcu(wb, rcu);
-
-	if (atomic_dec_and_test(&bdi->usage_cnt))
-		wake_up_all(&cgwb_release_wait);
 }
 
 static void cgwb_release(struct percpu_ref *refcnt)
@@ -608,7 +602,6 @@ static int cgwb_create(struct backing_dev_info *bdi,
 		/* we might have raced another instance of this function */
 		ret = radix_tree_insert(&bdi->cgwb_tree, memcg_css->id, wb);
 		if (!ret) {
-			atomic_inc(&bdi->usage_cnt);
 			list_add_tail_rcu(&wb->bdi_node, &bdi->wb_list);
 			list_add(&wb->memcg_node, memcg_cgwb_list);
 			list_add(&wb->blkcg_node, blkcg_cgwb_list);
@@ -698,7 +691,6 @@ static int cgwb_bdi_init(struct backing_dev_info *bdi)
 
 	INIT_RADIX_TREE(&bdi->cgwb_tree, GFP_ATOMIC);
 	bdi->cgwb_congested_tree = RB_ROOT;
-	atomic_set(&bdi->usage_cnt, 1);
 
 	ret = wb_init(&bdi->wb, bdi, 1, GFP_KERNEL);
 	if (!ret) {
@@ -739,14 +731,6 @@ static void cgwb_bdi_destroy(struct backing_dev_info *bdi)
 	}
 
 	spin_unlock_irq(&cgwb_lock);
-
-	/*
-	 * All cgwb's and their congested states must be shutdown and
-	 * released before returning.  Drain the usage counter to wait for
-	 * all cgwb's and cgwb_congested's ever created on @bdi.
-	 */
-	atomic_dec(&bdi->usage_cnt);
-	wait_event(cgwb_release_wait, !atomic_read(&bdi->usage_cnt));
 }
 
 /**
-- 
2.10.2

^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH 08/11] bdi: Rename cgwb_bdi_destroy() to cgwb_bdi_unregister()
  2017-03-06 16:33 [PATCH 0/11 v3] block: Fix block device shutdown related races Jan Kara
                   ` (6 preceding siblings ...)
  2017-03-06 16:34 ` [PATCH 07/11] bdi: Do not wait for cgwbs release in bdi_unregister() Jan Kara
@ 2017-03-06 16:34 ` Jan Kara
  2017-03-06 16:34 ` [PATCH 09/11] block: Fix oops in locked_inode_to_wb_and_lock_list() Jan Kara
                   ` (2 subsequent siblings)
  10 siblings, 0 replies; 21+ messages in thread
From: Jan Kara @ 2017-03-06 16:34 UTC (permalink / raw)
  To: Jens Axboe
  Cc: linux-block, Christoph Hellwig, Dan Williams,
	Thiago Jung Bauermann, Lekshmi Pillai, Tejun Heo, NeilBrown,
	Omar Sandoval, Jan Kara

Rename cgwb_bdi_destroy() to cgwb_bdi_unregister() as it gets called
from bdi_unregister() which is not necessarily called from bdi_destroy()
and thus the name is somewhat misleading.

Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jan Kara <jack@suse.cz>
---
 mm/backing-dev.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/mm/backing-dev.c b/mm/backing-dev.c
index 7592f4a36776..54c1ff39ea05 100644
--- a/mm/backing-dev.c
+++ b/mm/backing-dev.c
@@ -700,7 +700,7 @@ static int cgwb_bdi_init(struct backing_dev_info *bdi)
 	return ret;
 }
 
-static void cgwb_bdi_destroy(struct backing_dev_info *bdi)
+static void cgwb_bdi_unregister(struct backing_dev_info *bdi)
 {
 	struct radix_tree_iter iter;
 	struct rb_node *rbn;
@@ -790,7 +790,7 @@ static int cgwb_bdi_init(struct backing_dev_info *bdi)
 	return 0;
 }
 
-static void cgwb_bdi_destroy(struct backing_dev_info *bdi)
+static void cgwb_bdi_unregister(struct backing_dev_info *bdi)
 {
 	wb_congested_put(bdi->wb_congested);
 }
@@ -903,7 +903,7 @@ void bdi_unregister(struct backing_dev_info *bdi)
 	/* make sure nobody finds us on the bdi_list anymore */
 	bdi_remove_from_list(bdi);
 	wb_shutdown(&bdi->wb);
-	cgwb_bdi_destroy(bdi);
+	cgwb_bdi_unregister(bdi);
 
 	if (bdi->dev) {
 		bdi_debug_unregister(bdi);
-- 
2.10.2

^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH 09/11] block: Fix oops in locked_inode_to_wb_and_lock_list()
  2017-03-06 16:33 [PATCH 0/11 v3] block: Fix block device shutdown related races Jan Kara
                   ` (7 preceding siblings ...)
  2017-03-06 16:34 ` [PATCH 08/11] bdi: Rename cgwb_bdi_destroy() to cgwb_bdi_unregister() Jan Kara
@ 2017-03-06 16:34 ` Jan Kara
  2017-03-06 16:34 ` [PATCH 10/11] kobject: Export kobject_get_unless_zero() Jan Kara
  2017-03-06 16:34 ` [PATCH 11/11] block: Fix oops scsi_disk_get() Jan Kara
  10 siblings, 0 replies; 21+ messages in thread
From: Jan Kara @ 2017-03-06 16:34 UTC (permalink / raw)
  To: Jens Axboe
  Cc: linux-block, Christoph Hellwig, Dan Williams,
	Thiago Jung Bauermann, Lekshmi Pillai, Tejun Heo, NeilBrown,
	Omar Sandoval, Jan Kara

When block device is closed, we call inode_detach_wb() in __blkdev_put()
which sets inode->i_wb to NULL. That is contrary to expectations that
inode->i_wb stays valid once set during the whole inode's lifetime and
leads to oops in wb_get() in locked_inode_to_wb_and_lock_list() because
inode_to_wb() returned NULL.

The reason why we called inode_detach_wb() is not valid anymore though.
BDI is guaranteed to stay along until we call bdi_put() from
bdev_evict_inode() so we can postpone calling inode_detach_wb() to that
moment.

Also add a warning to catch if someone uses inode_detach_wb() in a
dangerous way.

Reported-by: Thiago Jung Bauermann <bauerman@linux.vnet.ibm.com>
Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jan Kara <jack@suse.cz>
---
 fs/block_dev.c            | 8 ++------
 include/linux/writeback.h | 1 +
 2 files changed, 3 insertions(+), 6 deletions(-)

diff --git a/fs/block_dev.c b/fs/block_dev.c
index 9e1993a2827f..4b8a0f871f2d 100644
--- a/fs/block_dev.c
+++ b/fs/block_dev.c
@@ -885,6 +885,8 @@ static void bdev_evict_inode(struct inode *inode)
 	spin_lock(&bdev_lock);
 	list_del_init(&bdev->bd_list);
 	spin_unlock(&bdev_lock);
+	/* Detach inode from wb early as bdi_put() may free bdi->wb */
+	inode_detach_wb(inode);
 	if (bdev->bd_bdi != &noop_backing_dev_info) {
 		bdi_put(bdev->bd_bdi);
 		bdev->bd_bdi = &noop_backing_dev_info;
@@ -1878,12 +1880,6 @@ static void __blkdev_put(struct block_device *bdev, fmode_t mode, int for_part)
 		kill_bdev(bdev);
 
 		bdev_write_inode(bdev);
-		/*
-		 * Detaching bdev inode from its wb in __destroy_inode()
-		 * is too late: the queue which embeds its bdi (along with
-		 * root wb) can be gone as soon as we put_disk() below.
-		 */
-		inode_detach_wb(bdev->bd_inode);
 	}
 	if (bdev->bd_contains == bdev) {
 		if (disk->fops->release)
diff --git a/include/linux/writeback.h b/include/linux/writeback.h
index a3c0cbd7c888..d5815794416c 100644
--- a/include/linux/writeback.h
+++ b/include/linux/writeback.h
@@ -237,6 +237,7 @@ static inline void inode_attach_wb(struct inode *inode, struct page *page)
 static inline void inode_detach_wb(struct inode *inode)
 {
 	if (inode->i_wb) {
+		WARN_ON_ONCE(!(inode->i_state & I_CLEAR));
 		wb_put(inode->i_wb);
 		inode->i_wb = NULL;
 	}
-- 
2.10.2

^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH 10/11] kobject: Export kobject_get_unless_zero()
  2017-03-06 16:33 [PATCH 0/11 v3] block: Fix block device shutdown related races Jan Kara
                   ` (8 preceding siblings ...)
  2017-03-06 16:34 ` [PATCH 09/11] block: Fix oops in locked_inode_to_wb_and_lock_list() Jan Kara
@ 2017-03-06 16:34 ` Jan Kara
  2017-03-06 16:34 ` [PATCH 11/11] block: Fix oops scsi_disk_get() Jan Kara
  10 siblings, 0 replies; 21+ messages in thread
From: Jan Kara @ 2017-03-06 16:34 UTC (permalink / raw)
  To: Jens Axboe
  Cc: linux-block, Christoph Hellwig, Dan Williams,
	Thiago Jung Bauermann, Lekshmi Pillai, Tejun Heo, NeilBrown,
	Omar Sandoval, Jan Kara, Greg Kroah-Hartman

Make the function available for outside use and fortify it against NULL
kobject.

CC: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Reviewed-by: Bart Van Assche <bart.vanassche@sandisk.com>
Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jan Kara <jack@suse.cz>
---
 include/linux/kobject.h | 2 ++
 lib/kobject.c           | 5 ++++-
 2 files changed, 6 insertions(+), 1 deletion(-)

diff --git a/include/linux/kobject.h b/include/linux/kobject.h
index e6284591599e..ca85cb80e99a 100644
--- a/include/linux/kobject.h
+++ b/include/linux/kobject.h
@@ -108,6 +108,8 @@ extern int __must_check kobject_rename(struct kobject *, const char *new_name);
 extern int __must_check kobject_move(struct kobject *, struct kobject *);
 
 extern struct kobject *kobject_get(struct kobject *kobj);
+extern struct kobject * __must_check kobject_get_unless_zero(
+						struct kobject *kobj);
 extern void kobject_put(struct kobject *kobj);
 
 extern const void *kobject_namespace(struct kobject *kobj);
diff --git a/lib/kobject.c b/lib/kobject.c
index 445dcaeb0f56..763d70a18941 100644
--- a/lib/kobject.c
+++ b/lib/kobject.c
@@ -601,12 +601,15 @@ struct kobject *kobject_get(struct kobject *kobj)
 }
 EXPORT_SYMBOL(kobject_get);
 
-static struct kobject * __must_check kobject_get_unless_zero(struct kobject *kobj)
+struct kobject * __must_check kobject_get_unless_zero(struct kobject *kobj)
 {
+	if (!kobj)
+		return NULL;
 	if (!kref_get_unless_zero(&kobj->kref))
 		kobj = NULL;
 	return kobj;
 }
+EXPORT_SYMBOL(kobject_get_unless_zero);
 
 /*
  * kobject_cleanup - free kobject resources.
-- 
2.10.2

^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH 11/11] block: Fix oops scsi_disk_get()
  2017-03-06 16:33 [PATCH 0/11 v3] block: Fix block device shutdown related races Jan Kara
                   ` (9 preceding siblings ...)
  2017-03-06 16:34 ` [PATCH 10/11] kobject: Export kobject_get_unless_zero() Jan Kara
@ 2017-03-06 16:34 ` Jan Kara
  10 siblings, 0 replies; 21+ messages in thread
From: Jan Kara @ 2017-03-06 16:34 UTC (permalink / raw)
  To: Jens Axboe
  Cc: linux-block, Christoph Hellwig, Dan Williams,
	Thiago Jung Bauermann, Lekshmi Pillai, Tejun Heo, NeilBrown,
	Omar Sandoval, Jan Kara

When device open races with device shutdown, we can get the following
oops in scsi_disk_get():

[11863.044351] general protection fault: 0000 [#1] SMP
[11863.045561] Modules linked in: scsi_debug xfs libcrc32c netconsole btrfs raid6_pq zlib_deflate lzo_compress xor [last unloaded: loop]
[11863.047853] CPU: 3 PID: 13042 Comm: hald-probe-stor Tainted: G W      4.10.0-rc2-xen+ #35
[11863.048030] Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
[11863.048030] task: ffff88007f438200 task.stack: ffffc90000fd0000
[11863.048030] RIP: 0010:scsi_disk_get+0x43/0x70
[11863.048030] RSP: 0018:ffffc90000fd3a08 EFLAGS: 00010202
[11863.048030] RAX: 6b6b6b6b6b6b6b6b RBX: ffff88007f56d000 RCX: 0000000000000000
[11863.048030] RDX: 0000000000000001 RSI: 0000000000000004 RDI: ffffffff81a8d880
[11863.048030] RBP: ffffc90000fd3a18 R08: 0000000000000000 R09: 0000000000000001
[11863.059217] R10: 0000000000000000 R11: 0000000000000000 R12: 00000000fffffffa
[11863.059217] R13: ffff880078872800 R14: ffff880070915540 R15: 000000000000001d
[11863.059217] FS:  00007f2611f71800(0000) GS:ffff88007f0c0000(0000) knlGS:0000000000000000
[11863.059217] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[11863.059217] CR2: 000000000060e048 CR3: 00000000778d4000 CR4: 00000000000006e0
[11863.059217] Call Trace:
[11863.059217]  ? disk_get_part+0x22/0x1f0
[11863.059217]  sd_open+0x39/0x130
[11863.059217]  __blkdev_get+0x69/0x430
[11863.059217]  ? bd_acquire+0x7f/0xc0
[11863.059217]  ? bd_acquire+0x96/0xc0
[11863.059217]  ? blkdev_get+0x350/0x350
[11863.059217]  blkdev_get+0x126/0x350
[11863.059217]  ? _raw_spin_unlock+0x2b/0x40
[11863.059217]  ? bd_acquire+0x7f/0xc0
[11863.059217]  ? blkdev_get+0x350/0x350
[11863.059217]  blkdev_open+0x65/0x80
...

As you can see RAX value is already poisoned showing that gendisk we got
is already freed. The problem is that get_gendisk() looks up device
number in ext_devt_idr and then does get_disk() which does kobject_get()
on the disks kobject. However the disk gets removed from ext_devt_idr
only in disk_release() (through blk_free_devt()) at which moment it has
already 0 refcount and is already on its way to be freed. Indeed we've
got a warning from kobject_get() about 0 refcount shortly before the
oops.

We fix the problem by using kobject_get_unless_zero() in get_disk() so
that get_disk() cannot get reference on a disk that is already being
freed.

Tested-by: Lekshmi Pillai <lekshmicpillai@in.ibm.com>
Reviewed-by: Bart Van Assche <bart.vanassche@sandisk.com>
Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jan Kara <jack@suse.cz>
---
 block/genhd.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/block/genhd.c b/block/genhd.c
index e8df37de03af..7fa59bc231dd 100644
--- a/block/genhd.c
+++ b/block/genhd.c
@@ -1374,7 +1374,7 @@ struct kobject *get_disk(struct gendisk *disk)
 	owner = disk->fops->owner;
 	if (owner && !try_module_get(owner))
 		return NULL;
-	kobj = kobject_get(&disk_to_dev(disk)->kobj);
+	kobj = kobject_get_unless_zero(&disk_to_dev(disk)->kobj);
 	if (kobj == NULL) {
 		module_put(owner);
 		return NULL;
-- 
2.10.2

^ permalink raw reply related	[flat|nested] 21+ messages in thread

* Re: [PATCH 01/11] block: Fix bdi assignment to bdev inode when racing with disk delete
  2017-03-06 16:33 ` [PATCH 01/11] block: Fix bdi assignment to bdev inode when racing with disk delete Jan Kara
@ 2017-03-06 22:01   ` Tejun Heo
  2017-03-07  6:47   ` Hannes Reinecke
  1 sibling, 0 replies; 21+ messages in thread
From: Tejun Heo @ 2017-03-06 22:01 UTC (permalink / raw)
  To: Jan Kara
  Cc: Jens Axboe, linux-block, Christoph Hellwig, Dan Williams,
	Thiago Jung Bauermann, Lekshmi Pillai, NeilBrown, Omar Sandoval

On Mon, Mar 06, 2017 at 05:33:54PM +0100, Jan Kara wrote:
> When disk->fops->open() in __blkdev_get() returns -ERESTARTSYS, we
> restart the process of opening the block device. However we forget to
> switch bdev->bd_bdi back to noop_backing_dev_info and as a result bdev
> inode will be pointing to a stale bdi. Fix the problem by setting
> bdev->bd_bdi later when __blkdev_get() is already guaranteed to succeed.
> 
> Signed-off-by: Jan Kara <jack@suse.cz>

Acked-by: Tejun Heo <tj@kernel.org>

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 02/11] block: Fix race of bdev open with gendisk shutdown
  2017-03-06 16:33 ` [PATCH 02/11] block: Fix race of bdev open with gendisk shutdown Jan Kara
@ 2017-03-06 22:04   ` Tejun Heo
  2017-03-07 11:14     ` Jan Kara
  0 siblings, 1 reply; 21+ messages in thread
From: Tejun Heo @ 2017-03-06 22:04 UTC (permalink / raw)
  To: Jan Kara
  Cc: Jens Axboe, linux-block, Christoph Hellwig, Dan Williams,
	Thiago Jung Bauermann, Lekshmi Pillai, NeilBrown, Omar Sandoval

Hello,

On Mon, Mar 06, 2017 at 05:33:55PM +0100, Jan Kara wrote:
> +	disk->flags &= ~GENHD_FL_UP;
> +	/*
> +	 * Make sure __blkdev_open() sees the disk is going away before
> +	 * starting to unhash bdev inodes.
> +	 */
> +	smp_wmb();

But which rmb is this paired with?  Without paring memory barriers
don't do anything.  Given that this isn't a super hot path, I think
it'd be far better to stick to a simpler synchronization mechanism.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 06/11] bdi: Shutdown writeback on all cgwbs in cgwb_bdi_destroy()
  2017-03-06 16:33 ` [PATCH 06/11] bdi: Shutdown writeback on all cgwbs in cgwb_bdi_destroy() Jan Kara
@ 2017-03-06 22:08   ` Tejun Heo
  0 siblings, 0 replies; 21+ messages in thread
From: Tejun Heo @ 2017-03-06 22:08 UTC (permalink / raw)
  To: Jan Kara
  Cc: Jens Axboe, linux-block, Christoph Hellwig, Dan Williams,
	Thiago Jung Bauermann, Lekshmi Pillai, NeilBrown, Omar Sandoval

On Mon, Mar 06, 2017 at 05:33:59PM +0100, Jan Kara wrote:
> Currently we waited for all cgwbs to get freed in cgwb_bdi_destroy()
> which also means that writeback has been shutdown on them. Since this
> wait is going away, directly shutdown writeback on cgwbs from
> cgwb_bdi_destroy() to avoid live writeback structures after
> bdi_unregister() has finished. To make that safe with concurrent
> shutdown from cgwb_release_workfn(), we also have to make sure
> wb_shutdown() returns only after the bdi_writeback structure is really
> shutdown.
> 
> Signed-off-by: Jan Kara <jack@suse.cz>

Acked-by: Tejun Heo <tj@kernel.org>

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 07/11] bdi: Do not wait for cgwbs release in bdi_unregister()
  2017-03-06 16:34 ` [PATCH 07/11] bdi: Do not wait for cgwbs release in bdi_unregister() Jan Kara
@ 2017-03-06 22:08   ` Tejun Heo
  0 siblings, 0 replies; 21+ messages in thread
From: Tejun Heo @ 2017-03-06 22:08 UTC (permalink / raw)
  To: Jan Kara
  Cc: Jens Axboe, linux-block, Christoph Hellwig, Dan Williams,
	Thiago Jung Bauermann, Lekshmi Pillai, NeilBrown, Omar Sandoval

On Mon, Mar 06, 2017 at 05:34:00PM +0100, Jan Kara wrote:
> Currently we wait for all cgwbs to get released in cgwb_bdi_destroy()
> (called from bdi_unregister()). That is however unnecessary now when
> cgwb->bdi is a proper refcounted reference (thus bdi cannot get
> released before all cgwbs are released) and when cgwb_bdi_destroy()
> shuts down writeback directly.
> 
> Signed-off-by: Jan Kara <jack@suse.cz>

Acked-by: Tejun Heo <tj@kernel.org>

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 01/11] block: Fix bdi assignment to bdev inode when racing with disk delete
  2017-03-06 16:33 ` [PATCH 01/11] block: Fix bdi assignment to bdev inode when racing with disk delete Jan Kara
  2017-03-06 22:01   ` Tejun Heo
@ 2017-03-07  6:47   ` Hannes Reinecke
  1 sibling, 0 replies; 21+ messages in thread
From: Hannes Reinecke @ 2017-03-07  6:47 UTC (permalink / raw)
  To: Jan Kara, Jens Axboe
  Cc: linux-block, Christoph Hellwig, Dan Williams,
	Thiago Jung Bauermann, Lekshmi Pillai, Tejun Heo, NeilBrown,
	Omar Sandoval

On 03/06/2017 05:33 PM, Jan Kara wrote:
> When disk->fops->open() in __blkdev_get() returns -ERESTARTSYS, we
> restart the process of opening the block device. However we forget to
> switch bdev->bd_bdi back to noop_backing_dev_info and as a result bdev
> inode will be pointing to a stale bdi. Fix the problem by setting
> bdev->bd_bdi later when __blkdev_get() is already guaranteed to succeed.
> 
> Signed-off-by: Jan Kara <jack@suse.cz>
> ---
>  fs/block_dev.c | 7 +++----
>  1 file changed, 3 insertions(+), 4 deletions(-)
> 
Reviewed-by: Hannes Reinecke <hare@suse.com>

Cheers,

Hannes
-- 
Dr. Hannes Reinecke		   Teamlead Storage & Networking
hare@suse.de			               +49 911 74053 688
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nï¿½rnberg
GF: F. Imendï¿½rffer, J. Smithard, J. Guild, D. Upmanyu, G. Norton
HRB 21284 (AG Nï¿½rnberg)

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 02/11] block: Fix race of bdev open with gendisk shutdown
  2017-03-06 22:04   ` Tejun Heo
@ 2017-03-07 11:14     ` Jan Kara
  2017-03-07 19:43       ` Tejun Heo
  0 siblings, 1 reply; 21+ messages in thread
From: Jan Kara @ 2017-03-07 11:14 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Jan Kara, Jens Axboe, linux-block, Christoph Hellwig,
	Dan Williams, Thiago Jung Bauermann, Lekshmi Pillai, NeilBrown,
	Omar Sandoval

On Mon 06-03-17 17:04:59, Tejun Heo wrote:
> Hello,
> 
> On Mon, Mar 06, 2017 at 05:33:55PM +0100, Jan Kara wrote:
> > +	disk->flags &= ~GENHD_FL_UP;
> > +	/*
> > +	 * Make sure __blkdev_open() sees the disk is going away before
> > +	 * starting to unhash bdev inodes.
> > +	 */
> > +	smp_wmb();
> 
> But which rmb is this paired with?  Without paring memory barriers
> don't do anything.

It is paired with the fact that the test (disk->flags & GENHD_FL_UP) in
__blkdev_get() cannot be reordered before the bd_acquire() call in
blkdev_open() (or however the bdev is looked up). I was thinking so long
about how and where to sanely comment on that that I did not write anything
in the end.

I was also thinking about adding an explicit barrier before that test just
to make things explicit since as you say below the first blkdev open is not
that hot path. Maybe that would be a better option?

> Given that this isn't a super hot path, I think it'd be far better to
> stick to a simpler synchronization mechanism.

I'd be happy to do so however struct gendisk does not have any lock in it
and introducing a special lock just for that seems awkward.

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 02/11] block: Fix race of bdev open with gendisk shutdown
  2017-03-07 11:14     ` Jan Kara
@ 2017-03-07 19:43       ` Tejun Heo
  2017-03-08 14:25         ` Jan Kara
  0 siblings, 1 reply; 21+ messages in thread
From: Tejun Heo @ 2017-03-07 19:43 UTC (permalink / raw)
  To: Jan Kara
  Cc: Jens Axboe, linux-block, Christoph Hellwig, Dan Williams,
	Thiago Jung Bauermann, Lekshmi Pillai, NeilBrown, Omar Sandoval

Hello,

On Tue, Mar 07, 2017 at 12:14:20PM +0100, Jan Kara wrote:
> > Given that this isn't a super hot path, I think it'd be far better to
> > stick to a simpler synchronization mechanism.
> 
> I'd be happy to do so however struct gendisk does not have any lock in it
> and introducing a special lock just for that seems awkward.

I think even a awkward shared lock is better than memory barriers.
Barriers are the trickiest locking construct to get right and we
really shouldn't using that unless absolutely necessary.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 02/11] block: Fix race of bdev open with gendisk shutdown
  2017-03-07 19:43       ` Tejun Heo
@ 2017-03-08 14:25         ` Jan Kara
  0 siblings, 0 replies; 21+ messages in thread
From: Jan Kara @ 2017-03-08 14:25 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Jan Kara, Jens Axboe, linux-block, Christoph Hellwig,
	Dan Williams, Thiago Jung Bauermann, Lekshmi Pillai, NeilBrown,
	Omar Sandoval

On Tue 07-03-17 14:43:49, Tejun Heo wrote:
> Hello,
> 
> On Tue, Mar 07, 2017 at 12:14:20PM +0100, Jan Kara wrote:
> > > Given that this isn't a super hot path, I think it'd be far better to
> > > stick to a simpler synchronization mechanism.
> > 
> > I'd be happy to do so however struct gendisk does not have any lock in it
> > and introducing a special lock just for that seems awkward.
> 
> I think even a awkward shared lock is better than memory barriers.
> Barriers are the trickiest locking construct to get right and we
> really shouldn't using that unless absolutely necessary.

OK, I'll try to come up with something.

								Honza

-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 21+ messages in thread

* [PATCH 06/11] bdi: Shutdown writeback on all cgwbs in cgwb_bdi_destroy()
  2017-03-13 15:13 [PATCH 0/11 v4] block: Fix block device shutdown related races Jan Kara
@ 2017-03-13 15:14 ` Jan Kara
  0 siblings, 0 replies; 21+ messages in thread
From: Jan Kara @ 2017-03-13 15:14 UTC (permalink / raw)
  To: Jens Axboe
  Cc: linux-block, Christoph Hellwig, Dan Williams,
	Thiago Jung Bauermann, Tejun Heo, Tahsin Erdogan, Omar Sandoval,
	Jan Kara

Currently we waited for all cgwbs to get freed in cgwb_bdi_destroy()
which also means that writeback has been shutdown on them. Since this
wait is going away, directly shutdown writeback on cgwbs from
cgwb_bdi_destroy() to avoid live writeback structures after
bdi_unregister() has finished. To make that safe with concurrent
shutdown from cgwb_release_workfn(), we also have to make sure
wb_shutdown() returns only after the bdi_writeback structure is really
shutdown.

Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jan Kara <jack@suse.cz>
---
 include/linux/backing-dev-defs.h |  1 +
 mm/backing-dev.c                 | 22 ++++++++++++++++++++++
 2 files changed, 23 insertions(+)

diff --git a/include/linux/backing-dev-defs.h b/include/linux/backing-dev-defs.h
index 8fb3dcdebc80..8af720f22a2d 100644
--- a/include/linux/backing-dev-defs.h
+++ b/include/linux/backing-dev-defs.h
@@ -21,6 +21,7 @@ struct dentry;
  */
 enum wb_state {
 	WB_registered,		/* bdi_register() was done */
+	WB_shutting_down,	/* wb_shutdown() in progress */
 	WB_writeback_running,	/* Writeback is in progress */
 	WB_has_dirty_io,	/* Dirty inodes on ->b_{dirty|io|more_io} */
 };
diff --git a/mm/backing-dev.c b/mm/backing-dev.c
index e3d56dba4da8..b67be4fc12c4 100644
--- a/mm/backing-dev.c
+++ b/mm/backing-dev.c
@@ -356,8 +356,15 @@ static void wb_shutdown(struct bdi_writeback *wb)
 	spin_lock_bh(&wb->work_lock);
 	if (!test_and_clear_bit(WB_registered, &wb->state)) {
 		spin_unlock_bh(&wb->work_lock);
+		/*
+		 * Wait for wb shutdown to finish if someone else is just
+		 * running wb_shutdown(). Otherwise we could proceed to wb /
+		 * bdi destruction before wb_shutdown() is finished.
+		 */
+		wait_on_bit(&wb->state, WB_shutting_down, TASK_UNINTERRUPTIBLE);
 		return;
 	}
+	set_bit(WB_shutting_down, &wb->state);
 	spin_unlock_bh(&wb->work_lock);
 
 	cgwb_remove_from_bdi_list(wb);
@@ -369,6 +376,12 @@ static void wb_shutdown(struct bdi_writeback *wb)
 	mod_delayed_work(bdi_wq, &wb->dwork, 0);
 	flush_delayed_work(&wb->dwork);
 	WARN_ON(!list_empty(&wb->work_list));
+	/*
+	 * Make sure bit gets cleared after shutdown is finished. Matches with
+	 * the barrier provided by test_and_clear_bit() above.
+	 */
+	smp_wmb();
+	clear_bit(WB_shutting_down, &wb->state);
 }
 
 static void wb_exit(struct bdi_writeback *wb)
@@ -699,12 +712,21 @@ static void cgwb_bdi_destroy(struct backing_dev_info *bdi)
 {
 	struct radix_tree_iter iter;
 	void **slot;
+	struct bdi_writeback *wb;
 
 	WARN_ON(test_bit(WB_registered, &bdi->wb.state));
 
 	spin_lock_irq(&cgwb_lock);
 	radix_tree_for_each_slot(slot, &bdi->cgwb_tree, &iter, 0)
 		cgwb_kill(*slot);
+
+	while (!list_empty(&bdi->wb_list)) {
+		wb = list_first_entry(&bdi->wb_list, struct bdi_writeback,
+				      bdi_node);
+		spin_unlock_irq(&cgwb_lock);
+		wb_shutdown(wb);
+		spin_lock_irq(&cgwb_lock);
+	}
 	spin_unlock_irq(&cgwb_lock);
 
 	/*
-- 
2.10.2

^ permalink raw reply related	[flat|nested] 21+ messages in thread

end of thread, other threads:[~2017-03-13 15:14 UTC | newest]

Thread overview: 21+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-03-06 16:33 [PATCH 0/11 v3] block: Fix block device shutdown related races Jan Kara
2017-03-06 16:33 ` [PATCH 01/11] block: Fix bdi assignment to bdev inode when racing with disk delete Jan Kara
2017-03-06 22:01   ` Tejun Heo
2017-03-07  6:47   ` Hannes Reinecke
2017-03-06 16:33 ` [PATCH 02/11] block: Fix race of bdev open with gendisk shutdown Jan Kara
2017-03-06 22:04   ` Tejun Heo
2017-03-07 11:14     ` Jan Kara
2017-03-07 19:43       ` Tejun Heo
2017-03-08 14:25         ` Jan Kara
2017-03-06 16:33 ` [PATCH 03/11] bdi: Mark congested->bdi as internal Jan Kara
2017-03-06 16:33 ` [PATCH 04/11] bdi: Make wb->bdi a proper reference Jan Kara
2017-03-06 16:33 ` [PATCH 05/11] bdi: Move removal from bdi->wb_list into wb_shutdown() Jan Kara
2017-03-06 16:33 ` [PATCH 06/11] bdi: Shutdown writeback on all cgwbs in cgwb_bdi_destroy() Jan Kara
2017-03-06 22:08   ` Tejun Heo
2017-03-06 16:34 ` [PATCH 07/11] bdi: Do not wait for cgwbs release in bdi_unregister() Jan Kara
2017-03-06 22:08   ` Tejun Heo
2017-03-06 16:34 ` [PATCH 08/11] bdi: Rename cgwb_bdi_destroy() to cgwb_bdi_unregister() Jan Kara
2017-03-06 16:34 ` [PATCH 09/11] block: Fix oops in locked_inode_to_wb_and_lock_list() Jan Kara
2017-03-06 16:34 ` [PATCH 10/11] kobject: Export kobject_get_unless_zero() Jan Kara
2017-03-06 16:34 ` [PATCH 11/11] block: Fix oops scsi_disk_get() Jan Kara
2017-03-13 15:13 [PATCH 0/11 v4] block: Fix block device shutdown related races Jan Kara
2017-03-13 15:14 ` [PATCH 06/11] bdi: Shutdown writeback on all cgwbs in cgwb_bdi_destroy() Jan Kara

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).