linux-block.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* tear down file system I/O in del_gendisk v2
@ 2021-09-22 17:22 Christoph Hellwig
  2021-09-22 17:22 ` [PATCH 1/4] block: factor out a blk_try_enter_queue helper Christoph Hellwig
                   ` (3 more replies)
  0 siblings, 4 replies; 13+ messages in thread
From: Christoph Hellwig @ 2021-09-22 17:22 UTC (permalink / raw)
  To: Jens Axboe; +Cc: Tejun Heo, linux-block, Ming Lei

Ming reported that for SCSI we have a lifetime problem now that
the BDI moved from the request_queue to the disk as del_gendisk
doesn't finish all outstanding file system I/O.  It turns out
this actually is an older problem, although the case where it could
be hit before was very unusual (unbinding of a SCSI upper driver
while the scsi_device stays around).  This series fixes this by
draining all I/O in del_gendisk.

Changes since v1:
 - fix a commit log typo
 - keep the existing nowait vs queue dying semantics in bio_queue_enter 
 - actually keep q_usage_counter in atomic mode after del_gendisk

^ permalink raw reply	[flat|nested] 13+ messages in thread

* [PATCH 1/4] block: factor out a blk_try_enter_queue helper
  2021-09-22 17:22 tear down file system I/O in del_gendisk v2 Christoph Hellwig
@ 2021-09-22 17:22 ` Christoph Hellwig
  2021-09-22 17:22 ` [PATCH 2/4] block: split bio_queue_enter from blk_queue_enter Christoph Hellwig
                   ` (2 subsequent siblings)
  3 siblings, 0 replies; 13+ messages in thread
From: Christoph Hellwig @ 2021-09-22 17:22 UTC (permalink / raw)
  To: Jens Axboe; +Cc: Tejun Heo, linux-block, Ming Lei

Factor out the code to try to get q_usage_counter without blocking into
a separate helper.  Both to improve code readability and to prepare for
splitting bio_queue_enter from blk_queue_enter.

Signed-off-by: Christoph Hellwig <hch@lst.de>
---
 block/blk-core.c | 60 ++++++++++++++++++++++++++----------------------
 1 file changed, 32 insertions(+), 28 deletions(-)

diff --git a/block/blk-core.c b/block/blk-core.c
index 5454db2fa263b..4c078681f39b8 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -416,6 +416,30 @@ void blk_cleanup_queue(struct request_queue *q)
 }
 EXPORT_SYMBOL(blk_cleanup_queue);
 
+static bool blk_try_enter_queue(struct request_queue *q, bool pm)
+{
+	rcu_read_lock();
+	if (!percpu_ref_tryget_live(&q->q_usage_counter))
+		goto fail;
+
+	/*
+	 * The code that increments the pm_only counter must ensure that the
+	 * counter is globally visible before the queue is unfrozen.
+	 */
+	if (blk_queue_pm_only(q) &&
+	    (!pm || queue_rpm_status(q) == RPM_SUSPENDED))
+		goto fail_put;
+
+	rcu_read_unlock();
+	return true;
+
+fail_put:
+	percpu_ref_put(&q->q_usage_counter);
+fail:
+	rcu_read_unlock();
+	return false;
+}
+
 /**
  * blk_queue_enter() - try to increase q->q_usage_counter
  * @q: request queue pointer
@@ -425,40 +449,18 @@ int blk_queue_enter(struct request_queue *q, blk_mq_req_flags_t flags)
 {
 	const bool pm = flags & BLK_MQ_REQ_PM;
 
-	while (true) {
-		bool success = false;
-
-		rcu_read_lock();
-		if (percpu_ref_tryget_live(&q->q_usage_counter)) {
-			/*
-			 * The code that increments the pm_only counter is
-			 * responsible for ensuring that that counter is
-			 * globally visible before the queue is unfrozen.
-			 */
-			if ((pm && queue_rpm_status(q) != RPM_SUSPENDED) ||
-			    !blk_queue_pm_only(q)) {
-				success = true;
-			} else {
-				percpu_ref_put(&q->q_usage_counter);
-			}
-		}
-		rcu_read_unlock();
-
-		if (success)
-			return 0;
-
+	while (!blk_try_enter_queue(q, pm)) {
 		if (flags & BLK_MQ_REQ_NOWAIT)
 			return -EBUSY;
 
 		/*
-		 * read pair of barrier in blk_freeze_queue_start(),
-		 * we need to order reading __PERCPU_REF_DEAD flag of
-		 * .q_usage_counter and reading .mq_freeze_depth or
-		 * queue dying flag, otherwise the following wait may
-		 * never return if the two reads are reordered.
+		 * read pair of barrier in blk_freeze_queue_start(), we need to
+		 * order reading __PERCPU_REF_DEAD flag of .q_usage_counter and
+		 * reading .mq_freeze_depth or queue dying flag, otherwise the
+		 * following wait may never return if the two reads are
+		 * reordered.
 		 */
 		smp_rmb();
-
 		wait_event(q->mq_freeze_wq,
 			   (!q->mq_freeze_depth &&
 			    blk_pm_resume_queue(pm, q)) ||
@@ -466,6 +468,8 @@ int blk_queue_enter(struct request_queue *q, blk_mq_req_flags_t flags)
 		if (blk_queue_dying(q))
 			return -ENODEV;
 	}
+
+	return 0;
 }
 
 static inline int bio_queue_enter(struct bio *bio)
-- 
2.30.2


^ permalink raw reply	[flat|nested] 13+ messages in thread

* [PATCH 2/4] block: split bio_queue_enter from blk_queue_enter
  2021-09-22 17:22 tear down file system I/O in del_gendisk v2 Christoph Hellwig
  2021-09-22 17:22 ` [PATCH 1/4] block: factor out a blk_try_enter_queue helper Christoph Hellwig
@ 2021-09-22 17:22 ` Christoph Hellwig
  2021-09-22 17:22 ` [PATCH 3/4] block: drain file system I/O on del_gendisk Christoph Hellwig
  2021-09-22 17:22 ` [PATCH 4/4] block: keep q_usage_counter in atomic mode after del_gendisk Christoph Hellwig
  3 siblings, 0 replies; 13+ messages in thread
From: Christoph Hellwig @ 2021-09-22 17:22 UTC (permalink / raw)
  To: Jens Axboe; +Cc: Tejun Heo, linux-block, Ming Lei

To prepare for fixing a gendisk shutdown race, open code the
blk_queue_enter logic in bio_queue_enter.  This also removes the
pointless flags translation.

Signed-off-by: Christoph Hellwig <hch@lst.de>
---
 block/blk-core.c | 33 +++++++++++++++++++++++++--------
 1 file changed, 25 insertions(+), 8 deletions(-)

diff --git a/block/blk-core.c b/block/blk-core.c
index 4c078681f39b8..e951378855a02 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -475,18 +475,35 @@ int blk_queue_enter(struct request_queue *q, blk_mq_req_flags_t flags)
 static inline int bio_queue_enter(struct bio *bio)
 {
 	struct request_queue *q = bio->bi_bdev->bd_disk->queue;
-	bool nowait = bio->bi_opf & REQ_NOWAIT;
-	int ret;
 
-	ret = blk_queue_enter(q, nowait ? BLK_MQ_REQ_NOWAIT : 0);
-	if (unlikely(ret)) {
-		if (nowait && !blk_queue_dying(q))
+	while (!blk_try_enter_queue(q, false)) {
+		if (bio->bi_opf & REQ_NOWAIT) {
+			if (blk_queue_dying(q))
+				goto dead;
 			bio_wouldblock_error(bio);
-		else
-			bio_io_error(bio);
+			return -EBUSY;
+		}
+
+		/*
+		 * read pair of barrier in blk_freeze_queue_start(), we need to
+		 * order reading __PERCPU_REF_DEAD flag of .q_usage_counter and
+		 * reading .mq_freeze_depth or queue dying flag, otherwise the
+		 * following wait may never return if the two reads are
+		 * reordered.
+		 */
+		smp_rmb();
+		wait_event(q->mq_freeze_wq,
+			   (!q->mq_freeze_depth &&
+			    blk_pm_resume_queue(false, q)) ||
+			   blk_queue_dying(q));
+		if (blk_queue_dying(q))
+			goto dead;
 	}
 
-	return ret;
+	return 0;
+dead:
+	bio_io_error(bio);
+	return -ENODEV;
 }
 
 void blk_queue_exit(struct request_queue *q)
-- 
2.30.2


^ permalink raw reply	[flat|nested] 13+ messages in thread

* [PATCH 3/4] block: drain file system I/O on del_gendisk
  2021-09-22 17:22 tear down file system I/O in del_gendisk v2 Christoph Hellwig
  2021-09-22 17:22 ` [PATCH 1/4] block: factor out a blk_try_enter_queue helper Christoph Hellwig
  2021-09-22 17:22 ` [PATCH 2/4] block: split bio_queue_enter from blk_queue_enter Christoph Hellwig
@ 2021-09-22 17:22 ` Christoph Hellwig
  2021-09-23  1:34   ` Ming Lei
  2021-09-23  6:21   ` Hannes Reinecke
  2021-09-22 17:22 ` [PATCH 4/4] block: keep q_usage_counter in atomic mode after del_gendisk Christoph Hellwig
  3 siblings, 2 replies; 13+ messages in thread
From: Christoph Hellwig @ 2021-09-22 17:22 UTC (permalink / raw)
  To: Jens Axboe; +Cc: Tejun Heo, linux-block, Ming Lei

Instead of delaying draining of file system I/O related items like the
blk-qos queues, the integrity read workqueue and timeouts only when the
request_queue is removed, do that when del_gendisk is called.  This is
important for SCSI where the upper level drivers that control the gendisk
are separate entities, and the disk can be freed much earlier than the
request_queue, or can even be unbound without tearing down the queue.

Fixes: edb0872f44ec ("block: move the bdi from the request_queue to the gendisk")
Reported-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
---
 block/blk-core.c      | 27 ++++++++++++---------------
 block/blk.h           |  1 +
 block/genhd.c         | 21 +++++++++++++++++++++
 include/linux/genhd.h |  1 +
 4 files changed, 35 insertions(+), 15 deletions(-)

diff --git a/block/blk-core.c b/block/blk-core.c
index e951378855a02..d150d829a53c4 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -49,7 +49,6 @@
 #include "blk-mq.h"
 #include "blk-mq-sched.h"
 #include "blk-pm.h"
-#include "blk-rq-qos.h"
 
 struct dentry *blk_debugfs_root;
 
@@ -337,23 +336,25 @@ void blk_put_queue(struct request_queue *q)
 }
 EXPORT_SYMBOL(blk_put_queue);
 
-void blk_set_queue_dying(struct request_queue *q)
+void blk_queue_start_drain(struct request_queue *q)
 {
-	blk_queue_flag_set(QUEUE_FLAG_DYING, q);
-
 	/*
 	 * When queue DYING flag is set, we need to block new req
 	 * entering queue, so we call blk_freeze_queue_start() to
 	 * prevent I/O from crossing blk_queue_enter().
 	 */
 	blk_freeze_queue_start(q);
-
 	if (queue_is_mq(q))
 		blk_mq_wake_waiters(q);
-
 	/* Make blk_queue_enter() reexamine the DYING flag. */
 	wake_up_all(&q->mq_freeze_wq);
 }
+
+void blk_set_queue_dying(struct request_queue *q)
+{
+	blk_queue_flag_set(QUEUE_FLAG_DYING, q);
+	blk_queue_start_drain(q);
+}
 EXPORT_SYMBOL_GPL(blk_set_queue_dying);
 
 /**
@@ -385,13 +386,8 @@ void blk_cleanup_queue(struct request_queue *q)
 	 */
 	blk_freeze_queue(q);
 
-	rq_qos_exit(q);
-
 	blk_queue_flag_set(QUEUE_FLAG_DEAD, q);
 
-	/* for synchronous bio-based driver finish in-flight integrity i/o */
-	blk_flush_integrity();
-
 	blk_sync_queue(q);
 	if (queue_is_mq(q))
 		blk_mq_exit_queue(q);
@@ -474,11 +470,12 @@ int blk_queue_enter(struct request_queue *q, blk_mq_req_flags_t flags)
 
 static inline int bio_queue_enter(struct bio *bio)
 {
-	struct request_queue *q = bio->bi_bdev->bd_disk->queue;
+	struct gendisk *disk = bio->bi_bdev->bd_disk;
+	struct request_queue *q = disk->queue;
 
 	while (!blk_try_enter_queue(q, false)) {
 		if (bio->bi_opf & REQ_NOWAIT) {
-			if (blk_queue_dying(q))
+			if (test_bit(GD_DEAD, &disk->state))
 				goto dead;
 			bio_wouldblock_error(bio);
 			return -EBUSY;
@@ -495,8 +492,8 @@ static inline int bio_queue_enter(struct bio *bio)
 		wait_event(q->mq_freeze_wq,
 			   (!q->mq_freeze_depth &&
 			    blk_pm_resume_queue(false, q)) ||
-			   blk_queue_dying(q));
-		if (blk_queue_dying(q))
+			   test_bit(GD_DEAD, &disk->state));
+		if (test_bit(GD_DEAD, &disk->state))
 			goto dead;
 	}
 
diff --git a/block/blk.h b/block/blk.h
index 7d2a0ba7ed21d..e2ed2257709ae 100644
--- a/block/blk.h
+++ b/block/blk.h
@@ -51,6 +51,7 @@ struct blk_flush_queue *blk_alloc_flush_queue(int node, int cmd_size,
 void blk_free_flush_queue(struct blk_flush_queue *q);
 
 void blk_freeze_queue(struct request_queue *q);
+void blk_queue_start_drain(struct request_queue *q);
 
 #define BIO_INLINE_VECS 4
 struct bio_vec *bvec_alloc(mempool_t *pool, unsigned short *nr_vecs,
diff --git a/block/genhd.c b/block/genhd.c
index 7b6e5e1cf9564..b3c33495d7208 100644
--- a/block/genhd.c
+++ b/block/genhd.c
@@ -26,6 +26,7 @@
 #include <linux/badblocks.h>
 
 #include "blk.h"
+#include "blk-rq-qos.h"
 
 static struct kobject *block_depr;
 
@@ -559,6 +560,8 @@ EXPORT_SYMBOL(device_add_disk);
  */
 void del_gendisk(struct gendisk *disk)
 {
+	struct request_queue *q = disk->queue;
+
 	might_sleep();
 
 	if (WARN_ON_ONCE(!disk_live(disk) && !(disk->flags & GENHD_FL_HIDDEN)))
@@ -575,8 +578,26 @@ void del_gendisk(struct gendisk *disk)
 	fsync_bdev(disk->part0);
 	__invalidate_device(disk->part0, true);
 
+	/*
+	 * Fail any new I/O.
+	 */
+	set_bit(GD_DEAD, &disk->state);
 	set_capacity(disk, 0);
 
+	/*
+	 * Prevent new I/O from crossing bio_queue_enter().
+	 */
+	blk_queue_start_drain(q);
+	blk_mq_freeze_queue_wait(q);
+
+	rq_qos_exit(q);
+	blk_sync_queue(q);
+	blk_flush_integrity();
+	/*
+	 * Allow using passthrough request again after the queue is torn down.
+	 */
+	blk_mq_unfreeze_queue(q);
+
 	if (!(disk->flags & GENHD_FL_HIDDEN)) {
 		sysfs_remove_link(&disk_to_dev(disk)->kobj, "bdi");
 
diff --git a/include/linux/genhd.h b/include/linux/genhd.h
index c68d83c87f83f..0f5315c2b5a34 100644
--- a/include/linux/genhd.h
+++ b/include/linux/genhd.h
@@ -149,6 +149,7 @@ struct gendisk {
 	unsigned long state;
 #define GD_NEED_PART_SCAN		0
 #define GD_READ_ONLY			1
+#define GD_DEAD				2
 
 	struct mutex open_mutex;	/* open/close mutex */
 	unsigned open_partitions;	/* number of open partitions */
-- 
2.30.2


^ permalink raw reply	[flat|nested] 13+ messages in thread

* [PATCH 4/4] block: keep q_usage_counter in atomic mode after del_gendisk
  2021-09-22 17:22 tear down file system I/O in del_gendisk v2 Christoph Hellwig
                   ` (2 preceding siblings ...)
  2021-09-22 17:22 ` [PATCH 3/4] block: drain file system I/O on del_gendisk Christoph Hellwig
@ 2021-09-22 17:22 ` Christoph Hellwig
  3 siblings, 0 replies; 13+ messages in thread
From: Christoph Hellwig @ 2021-09-22 17:22 UTC (permalink / raw)
  To: Jens Axboe; +Cc: Tejun Heo, linux-block, Ming Lei

Don't switch back to percpu mode to avoid the double RCU grace period
when tearing down SCSI devices.  After removing the disk only passthrough
commands can be send anyway.

Suggested-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
---
 block/blk-mq.c | 9 ++++++++-
 block/blk.h    | 1 +
 block/genhd.c  | 3 ++-
 3 files changed, 11 insertions(+), 2 deletions(-)

diff --git a/block/blk-mq.c b/block/blk-mq.c
index 108a352051be5..bc026372de439 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -188,9 +188,11 @@ void blk_mq_freeze_queue(struct request_queue *q)
 }
 EXPORT_SYMBOL_GPL(blk_mq_freeze_queue);
 
-void blk_mq_unfreeze_queue(struct request_queue *q)
+void __blk_mq_unfreeze_queue(struct request_queue *q, bool force_atomic)
 {
 	mutex_lock(&q->mq_freeze_lock);
+	if (force_atomic)
+		q->q_usage_counter.data->force_atomic = true;
 	q->mq_freeze_depth--;
 	WARN_ON_ONCE(q->mq_freeze_depth < 0);
 	if (!q->mq_freeze_depth) {
@@ -199,6 +201,11 @@ void blk_mq_unfreeze_queue(struct request_queue *q)
 	}
 	mutex_unlock(&q->mq_freeze_lock);
 }
+
+void blk_mq_unfreeze_queue(struct request_queue *q)
+{
+	__blk_mq_unfreeze_queue(q, false);
+}
 EXPORT_SYMBOL_GPL(blk_mq_unfreeze_queue);
 
 /*
diff --git a/block/blk.h b/block/blk.h
index e2ed2257709ae..6c3c00a8fe19d 100644
--- a/block/blk.h
+++ b/block/blk.h
@@ -51,6 +51,7 @@ struct blk_flush_queue *blk_alloc_flush_queue(int node, int cmd_size,
 void blk_free_flush_queue(struct blk_flush_queue *q);
 
 void blk_freeze_queue(struct request_queue *q);
+void __blk_mq_unfreeze_queue(struct request_queue *q, bool force_atomic);
 void blk_queue_start_drain(struct request_queue *q);
 
 #define BIO_INLINE_VECS 4
diff --git a/block/genhd.c b/block/genhd.c
index b3c33495d7208..edd013a820879 100644
--- a/block/genhd.c
+++ b/block/genhd.c
@@ -596,7 +596,8 @@ void del_gendisk(struct gendisk *disk)
 	/*
 	 * Allow using passthrough request again after the queue is torn down.
 	 */
-	blk_mq_unfreeze_queue(q);
+	blk_queue_flag_clear(QUEUE_FLAG_INIT_DONE, q);
+	__blk_mq_unfreeze_queue(q, true);
 
 	if (!(disk->flags & GENHD_FL_HIDDEN)) {
 		sysfs_remove_link(&disk_to_dev(disk)->kobj, "bdi");
-- 
2.30.2


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH 3/4] block: drain file system I/O on del_gendisk
  2021-09-22 17:22 ` [PATCH 3/4] block: drain file system I/O on del_gendisk Christoph Hellwig
@ 2021-09-23  1:34   ` Ming Lei
  2021-09-23  5:27     ` Christoph Hellwig
  2021-09-23  6:21   ` Hannes Reinecke
  1 sibling, 1 reply; 13+ messages in thread
From: Ming Lei @ 2021-09-23  1:34 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: Jens Axboe, Tejun Heo, linux-block

On Wed, Sep 22, 2021 at 07:22:21PM +0200, Christoph Hellwig wrote:
> Instead of delaying draining of file system I/O related items like the
> blk-qos queues, the integrity read workqueue and timeouts only when the
> request_queue is removed, do that when del_gendisk is called.  This is
> important for SCSI where the upper level drivers that control the gendisk
> are separate entities, and the disk can be freed much earlier than the
> request_queue, or can even be unbound without tearing down the queue.
> 
> Fixes: edb0872f44ec ("block: move the bdi from the request_queue to the gendisk")
> Reported-by: Ming Lei <ming.lei@redhat.com>
> Signed-off-by: Christoph Hellwig <hch@lst.de>
> ---
>  block/blk-core.c      | 27 ++++++++++++---------------
>  block/blk.h           |  1 +
>  block/genhd.c         | 21 +++++++++++++++++++++
>  include/linux/genhd.h |  1 +
>  4 files changed, 35 insertions(+), 15 deletions(-)
> 
> diff --git a/block/blk-core.c b/block/blk-core.c
> index e951378855a02..d150d829a53c4 100644
> --- a/block/blk-core.c
> +++ b/block/blk-core.c
> @@ -49,7 +49,6 @@
>  #include "blk-mq.h"
>  #include "blk-mq-sched.h"
>  #include "blk-pm.h"
> -#include "blk-rq-qos.h"
>  
>  struct dentry *blk_debugfs_root;
>  
> @@ -337,23 +336,25 @@ void blk_put_queue(struct request_queue *q)
>  }
>  EXPORT_SYMBOL(blk_put_queue);
>  
> -void blk_set_queue_dying(struct request_queue *q)
> +void blk_queue_start_drain(struct request_queue *q)
>  {
> -	blk_queue_flag_set(QUEUE_FLAG_DYING, q);
> -
>  	/*
>  	 * When queue DYING flag is set, we need to block new req
>  	 * entering queue, so we call blk_freeze_queue_start() to
>  	 * prevent I/O from crossing blk_queue_enter().
>  	 */
>  	blk_freeze_queue_start(q);
> -
>  	if (queue_is_mq(q))
>  		blk_mq_wake_waiters(q);
> -
>  	/* Make blk_queue_enter() reexamine the DYING flag. */
>  	wake_up_all(&q->mq_freeze_wq);
>  }
> +
> +void blk_set_queue_dying(struct request_queue *q)
> +{
> +	blk_queue_flag_set(QUEUE_FLAG_DYING, q);
> +	blk_queue_start_drain(q);
> +}
>  EXPORT_SYMBOL_GPL(blk_set_queue_dying);
>  
>  /**
> @@ -385,13 +386,8 @@ void blk_cleanup_queue(struct request_queue *q)
>  	 */
>  	blk_freeze_queue(q);
>  
> -	rq_qos_exit(q);
> -
>  	blk_queue_flag_set(QUEUE_FLAG_DEAD, q);
>  
> -	/* for synchronous bio-based driver finish in-flight integrity i/o */
> -	blk_flush_integrity();
> -
>  	blk_sync_queue(q);
>  	if (queue_is_mq(q))
>  		blk_mq_exit_queue(q);
> @@ -474,11 +470,12 @@ int blk_queue_enter(struct request_queue *q, blk_mq_req_flags_t flags)
>  
>  static inline int bio_queue_enter(struct bio *bio)
>  {
> -	struct request_queue *q = bio->bi_bdev->bd_disk->queue;
> +	struct gendisk *disk = bio->bi_bdev->bd_disk;
> +	struct request_queue *q = disk->queue;
>  
>  	while (!blk_try_enter_queue(q, false)) {
>  		if (bio->bi_opf & REQ_NOWAIT) {
> -			if (blk_queue_dying(q))
> +			if (test_bit(GD_DEAD, &disk->state))
>  				goto dead;
>  			bio_wouldblock_error(bio);
>  			return -EBUSY;
> @@ -495,8 +492,8 @@ static inline int bio_queue_enter(struct bio *bio)
>  		wait_event(q->mq_freeze_wq,
>  			   (!q->mq_freeze_depth &&
>  			    blk_pm_resume_queue(false, q)) ||
> -			   blk_queue_dying(q));
> -		if (blk_queue_dying(q))
> +			   test_bit(GD_DEAD, &disk->state));
> +		if (test_bit(GD_DEAD, &disk->state))
>  			goto dead;
>  	}
>  
> diff --git a/block/blk.h b/block/blk.h
> index 7d2a0ba7ed21d..e2ed2257709ae 100644
> --- a/block/blk.h
> +++ b/block/blk.h
> @@ -51,6 +51,7 @@ struct blk_flush_queue *blk_alloc_flush_queue(int node, int cmd_size,
>  void blk_free_flush_queue(struct blk_flush_queue *q);
>  
>  void blk_freeze_queue(struct request_queue *q);
> +void blk_queue_start_drain(struct request_queue *q);
>  
>  #define BIO_INLINE_VECS 4
>  struct bio_vec *bvec_alloc(mempool_t *pool, unsigned short *nr_vecs,
> diff --git a/block/genhd.c b/block/genhd.c
> index 7b6e5e1cf9564..b3c33495d7208 100644
> --- a/block/genhd.c
> +++ b/block/genhd.c
> @@ -26,6 +26,7 @@
>  #include <linux/badblocks.h>
>  
>  #include "blk.h"
> +#include "blk-rq-qos.h"
>  
>  static struct kobject *block_depr;
>  
> @@ -559,6 +560,8 @@ EXPORT_SYMBOL(device_add_disk);
>   */
>  void del_gendisk(struct gendisk *disk)
>  {
> +	struct request_queue *q = disk->queue;
> +
>  	might_sleep();
>  
>  	if (WARN_ON_ONCE(!disk_live(disk) && !(disk->flags & GENHD_FL_HIDDEN)))
> @@ -575,8 +578,26 @@ void del_gendisk(struct gendisk *disk)
>  	fsync_bdev(disk->part0);
>  	__invalidate_device(disk->part0, true);
>  
> +	/*
> +	 * Fail any new I/O.
> +	 */
> +	set_bit(GD_DEAD, &disk->state);
>  	set_capacity(disk, 0);
>  
> +	/*
> +	 * Prevent new I/O from crossing bio_queue_enter().
> +	 */
> +	blk_queue_start_drain(q);
> +	blk_mq_freeze_queue_wait(q);
> +
> +	rq_qos_exit(q);
> +	blk_sync_queue(q);
> +	blk_flush_integrity();
> +	/*
> +	 * Allow using passthrough request again after the queue is torn down.
> +	 */
> +	blk_mq_unfreeze_queue(q);

After blk_mq_unfreeze_queue() returns, blk_try_enter_queue() will return
true, so new FS I/O from opened bdev still won't be blocked, right?


Thanks,
Ming


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH 3/4] block: drain file system I/O on del_gendisk
  2021-09-23  1:34   ` Ming Lei
@ 2021-09-23  5:27     ` Christoph Hellwig
  2021-09-23  6:39       ` Ming Lei
  0 siblings, 1 reply; 13+ messages in thread
From: Christoph Hellwig @ 2021-09-23  5:27 UTC (permalink / raw)
  To: Ming Lei; +Cc: Christoph Hellwig, Jens Axboe, Tejun Heo, linux-block

On Thu, Sep 23, 2021 at 09:34:03AM +0800, Ming Lei wrote:
> > +	set_bit(GD_DEAD, &disk->state);
> >  	set_capacity(disk, 0);
> >  
> > +	/*
> > +	 * Prevent new I/O from crossing bio_queue_enter().
> > +	 */
> > +	blk_queue_start_drain(q);
> > +	blk_mq_freeze_queue_wait(q);
> > +
> > +	rq_qos_exit(q);
> > +	blk_sync_queue(q);
> > +	blk_flush_integrity();
> > +	/*
> > +	 * Allow using passthrough request again after the queue is torn down.
> > +	 */
> > +	blk_mq_unfreeze_queue(q);
> 
> After blk_mq_unfreeze_queue() returns, blk_try_enter_queue() will return
> true, so new FS I/O from opened bdev still won't be blocked, right?

It won't be blocked by blk_mq_unfreeze_queue, but because the capacity
is set to 0 it still won't make it to the driver.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH 3/4] block: drain file system I/O on del_gendisk
  2021-09-22 17:22 ` [PATCH 3/4] block: drain file system I/O on del_gendisk Christoph Hellwig
  2021-09-23  1:34   ` Ming Lei
@ 2021-09-23  6:21   ` Hannes Reinecke
  1 sibling, 0 replies; 13+ messages in thread
From: Hannes Reinecke @ 2021-09-23  6:21 UTC (permalink / raw)
  To: Christoph Hellwig, Jens Axboe; +Cc: Tejun Heo, linux-block, Ming Lei

On 9/22/21 7:22 PM, Christoph Hellwig wrote:
> Instead of delaying draining of file system I/O related items like the
> blk-qos queues, the integrity read workqueue and timeouts only when the
> request_queue is removed, do that when del_gendisk is called.  This is
> important for SCSI where the upper level drivers that control the gendisk
> are separate entities, and the disk can be freed much earlier than the
> request_queue, or can even be unbound without tearing down the queue.
> 
> Fixes: edb0872f44ec ("block: move the bdi from the request_queue to the gendisk")
> Reported-by: Ming Lei <ming.lei@redhat.com>
> Signed-off-by: Christoph Hellwig <hch@lst.de>
> ---
>   block/blk-core.c      | 27 ++++++++++++---------------
>   block/blk.h           |  1 +
>   block/genhd.c         | 21 +++++++++++++++++++++
>   include/linux/genhd.h |  1 +
>   4 files changed, 35 insertions(+), 15 deletions(-)
> 
> diff --git a/block/blk-core.c b/block/blk-core.c
> index e951378855a02..d150d829a53c4 100644
> --- a/block/blk-core.c
> +++ b/block/blk-core.c
> @@ -49,7 +49,6 @@
>   #include "blk-mq.h"
>   #include "blk-mq-sched.h"
>   #include "blk-pm.h"
> -#include "blk-rq-qos.h"
>   
>   struct dentry *blk_debugfs_root;
>   
> @@ -337,23 +336,25 @@ void blk_put_queue(struct request_queue *q)
>   }
>   EXPORT_SYMBOL(blk_put_queue);
>   
> -void blk_set_queue_dying(struct request_queue *q)
> +void blk_queue_start_drain(struct request_queue *q)
>   {
> -	blk_queue_flag_set(QUEUE_FLAG_DYING, q);
> -
>   	/*
>   	 * When queue DYING flag is set, we need to block new req
>   	 * entering queue, so we call blk_freeze_queue_start() to
>   	 * prevent I/O from crossing blk_queue_enter().
>   	 */
>   	blk_freeze_queue_start(q);
> -
>   	if (queue_is_mq(q))
>   		blk_mq_wake_waiters(q);
> -
>   	/* Make blk_queue_enter() reexamine the DYING flag. */
>   	wake_up_all(&q->mq_freeze_wq);
>   }
> +
> +void blk_set_queue_dying(struct request_queue *q)
> +{
> +	blk_queue_flag_set(QUEUE_FLAG_DYING, q);
> +	blk_queue_start_drain(q);
> +}
>   EXPORT_SYMBOL_GPL(blk_set_queue_dying);
>   
>   /**
> @@ -385,13 +386,8 @@ void blk_cleanup_queue(struct request_queue *q)
>   	 */
>   	blk_freeze_queue(q);
>   
> -	rq_qos_exit(q);
> -
>   	blk_queue_flag_set(QUEUE_FLAG_DEAD, q);
>   
> -	/* for synchronous bio-based driver finish in-flight integrity i/o */
> -	blk_flush_integrity();
> -
>   	blk_sync_queue(q);
>   	if (queue_is_mq(q))
>   		blk_mq_exit_queue(q);
> @@ -474,11 +470,12 @@ int blk_queue_enter(struct request_queue *q, blk_mq_req_flags_t flags)
>   
>   static inline int bio_queue_enter(struct bio *bio)
>   {
> -	struct request_queue *q = bio->bi_bdev->bd_disk->queue;
> +	struct gendisk *disk = bio->bi_bdev->bd_disk;
> +	struct request_queue *q = disk->queue;
>   
>   	while (!blk_try_enter_queue(q, false)) {
>   		if (bio->bi_opf & REQ_NOWAIT) {
> -			if (blk_queue_dying(q))
> +			if (test_bit(GD_DEAD, &disk->state))
>   				goto dead;
>   			bio_wouldblock_error(bio);
>   			return -EBUSY;
> @@ -495,8 +492,8 @@ static inline int bio_queue_enter(struct bio *bio)
>   		wait_event(q->mq_freeze_wq,
>   			   (!q->mq_freeze_depth &&
>   			    blk_pm_resume_queue(false, q)) ||
> -			   blk_queue_dying(q));
> -		if (blk_queue_dying(q))
> +			   test_bit(GD_DEAD, &disk->state));
> +		if (test_bit(GD_DEAD, &disk->state))
>   			goto dead;
>   	}
>   
> diff --git a/block/blk.h b/block/blk.h
> index 7d2a0ba7ed21d..e2ed2257709ae 100644
> --- a/block/blk.h
> +++ b/block/blk.h
> @@ -51,6 +51,7 @@ struct blk_flush_queue *blk_alloc_flush_queue(int node, int cmd_size,
>   void blk_free_flush_queue(struct blk_flush_queue *q);
>   
>   void blk_freeze_queue(struct request_queue *q);
> +void blk_queue_start_drain(struct request_queue *q);
>   
>   #define BIO_INLINE_VECS 4
>   struct bio_vec *bvec_alloc(mempool_t *pool, unsigned short *nr_vecs,
> diff --git a/block/genhd.c b/block/genhd.c
> index 7b6e5e1cf9564..b3c33495d7208 100644
> --- a/block/genhd.c
> +++ b/block/genhd.c
> @@ -26,6 +26,7 @@
>   #include <linux/badblocks.h>
>   
>   #include "blk.h"
> +#include "blk-rq-qos.h"
>   
>   static struct kobject *block_depr;
>   
> @@ -559,6 +560,8 @@ EXPORT_SYMBOL(device_add_disk);
>    */
>   void del_gendisk(struct gendisk *disk)
>   {
> +	struct request_queue *q = disk->queue;
> +
>   	might_sleep();
>   
>   	if (WARN_ON_ONCE(!disk_live(disk) && !(disk->flags & GENHD_FL_HIDDEN)))
> @@ -575,8 +578,26 @@ void del_gendisk(struct gendisk *disk)
>   	fsync_bdev(disk->part0);
>   	__invalidate_device(disk->part0, true);
>   
> +	/*
> +	 * Fail any new I/O.
> +	 */
> +	set_bit(GD_DEAD, &disk->state);
>   	set_capacity(disk, 0);
>   

I always tend to become a bit nervous about simple 'set_bit' statements.
On the one side, it _might_ be already set (unlikely, I know).
But more importantly, 'set_bit' implies no memory barrier, so we have an 
inherent race condition.
So don't you need a barrier here to synchronize the 'state' variable 
update to all CPUs?
Or can't we just make it a

if (test_and_set_bit(GD_DEAD, &disk->state))
	WARN()

Hmm?

> +	/*
> +	 * Prevent new I/O from crossing bio_queue_enter().
> +	 */
> +	blk_queue_start_drain(q);
> +	blk_mq_freeze_queue_wait(q);
> +
> +	rq_qos_exit(q);
> +	blk_sync_queue(q);
> +	blk_flush_integrity();
> +	/*
> +	 * Allow using passthrough request again after the queue is torn down.
> +	 */
> +	blk_mq_unfreeze_queue(q);
> +
>   	if (!(disk->flags & GENHD_FL_HIDDEN)) {
>   		sysfs_remove_link(&disk_to_dev(disk)->kobj, "bdi");
>   
> diff --git a/include/linux/genhd.h b/include/linux/genhd.h
> index c68d83c87f83f..0f5315c2b5a34 100644
> --- a/include/linux/genhd.h
> +++ b/include/linux/genhd.h
> @@ -149,6 +149,7 @@ struct gendisk {
>   	unsigned long state;
>   #define GD_NEED_PART_SCAN		0
>   #define GD_READ_ONLY			1
> +#define GD_DEAD				2
>   
>   	struct mutex open_mutex;	/* open/close mutex */
>   	unsigned open_partitions;	/* number of open partitions */
> 

Cheers,

Hannes
-- 
Dr. Hannes Reinecke                Kernel Storage Architect
hare@suse.de                              +49 911 74053 688
SUSE Software Solutions GmbH, Maxfeldstr. 5, 90409 Nürnberg
HRB 36809 (AG Nürnberg), Geschäftsführer: Felix Imendörffer

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH 3/4] block: drain file system I/O on del_gendisk
  2021-09-23  5:27     ` Christoph Hellwig
@ 2021-09-23  6:39       ` Ming Lei
  2021-09-27 12:04         ` Christoph Hellwig
  0 siblings, 1 reply; 13+ messages in thread
From: Ming Lei @ 2021-09-23  6:39 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: Jens Axboe, Tejun Heo, linux-block

On Thu, Sep 23, 2021 at 07:27:05AM +0200, Christoph Hellwig wrote:
> On Thu, Sep 23, 2021 at 09:34:03AM +0800, Ming Lei wrote:
> > > +	set_bit(GD_DEAD, &disk->state);
> > >  	set_capacity(disk, 0);
> > >  
> > > +	/*
> > > +	 * Prevent new I/O from crossing bio_queue_enter().
> > > +	 */
> > > +	blk_queue_start_drain(q);
> > > +	blk_mq_freeze_queue_wait(q);
> > > +
> > > +	rq_qos_exit(q);
> > > +	blk_sync_queue(q);
> > > +	blk_flush_integrity();
> > > +	/*
> > > +	 * Allow using passthrough request again after the queue is torn down.
> > > +	 */
> > > +	blk_mq_unfreeze_queue(q);
> > 
> > After blk_mq_unfreeze_queue() returns, blk_try_enter_queue() will return
> > true, so new FS I/O from opened bdev still won't be blocked, right?
> 
> It won't be blocked by blk_mq_unfreeze_queue, but because the capacity
> is set to 0 it still won't make it to the driver.

One bio could be made & checked before set_capacity(0), then is
submitted after blk_mq_unfreeze_queue() returns.

blk_mq_freeze_queue_wait() doesn't always imply RCU grace period, for
example, the .q_usage_counter may have been in atomic mode before
calling blk_queue_start_drain() & blk_mq_freeze_queue_wait().


Thanks,
Ming


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH 3/4] block: drain file system I/O on del_gendisk
  2021-09-23  6:39       ` Ming Lei
@ 2021-09-27 12:04         ` Christoph Hellwig
  2021-09-27 14:32           ` Ming Lei
  0 siblings, 1 reply; 13+ messages in thread
From: Christoph Hellwig @ 2021-09-27 12:04 UTC (permalink / raw)
  To: Ming Lei; +Cc: Christoph Hellwig, Jens Axboe, Tejun Heo, linux-block

On Thu, Sep 23, 2021 at 02:39:59PM +0800, Ming Lei wrote:
> > > After blk_mq_unfreeze_queue() returns, blk_try_enter_queue() will return
> > > true, so new FS I/O from opened bdev still won't be blocked, right?
> > 
> > It won't be blocked by blk_mq_unfreeze_queue, but because the capacity
> > is set to 0 it still won't make it to the driver.
> 
> One bio could be made & checked before set_capacity(0), then is
> submitted after blk_mq_unfreeze_queue() returns.
> 
> blk_mq_freeze_queue_wait() doesn't always imply RCU grace period, for
> example, the .q_usage_counter may have been in atomic mode before
> calling blk_queue_start_drain() & blk_mq_freeze_queue_wait().

True, but this isn't really a new issue as we never had proper quiesce
behavior.  I have a bunch of ideas of how to make this actually solid,
but none of them looks like 5.15 material.


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH 3/4] block: drain file system I/O on del_gendisk
  2021-09-27 12:04         ` Christoph Hellwig
@ 2021-09-27 14:32           ` Ming Lei
  0 siblings, 0 replies; 13+ messages in thread
From: Ming Lei @ 2021-09-27 14:32 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: Jens Axboe, Tejun Heo, linux-block

On Mon, Sep 27, 2021 at 02:04:41PM +0200, Christoph Hellwig wrote:
> On Thu, Sep 23, 2021 at 02:39:59PM +0800, Ming Lei wrote:
> > > > After blk_mq_unfreeze_queue() returns, blk_try_enter_queue() will return
> > > > true, so new FS I/O from opened bdev still won't be blocked, right?
> > > 
> > > It won't be blocked by blk_mq_unfreeze_queue, but because the capacity
> > > is set to 0 it still won't make it to the driver.
> > 
> > One bio could be made & checked before set_capacity(0), then is
> > submitted after blk_mq_unfreeze_queue() returns.
> > 
> > blk_mq_freeze_queue_wait() doesn't always imply RCU grace period, for
> > example, the .q_usage_counter may have been in atomic mode before
> > calling blk_queue_start_drain() & blk_mq_freeze_queue_wait().
> 
> True, but this isn't really a new issue as we never had proper quiesce
> behavior.  I have a bunch of ideas of how to make this actually solid,
> but none of them looks like 5.15 material.

It is new issue since edb0872f44ec ("block: move the bdi from the
request_queue to the gendisk").

Previously bdi has longer lifetime than request queue, but now it becomes
not, then either queue_to_disk() or queue->disk may cause panic.


Thanks,
Ming


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH 1/4] block: factor out a blk_try_enter_queue helper
  2021-09-20 11:24 ` [PATCH 1/4] block: factor out a blk_try_enter_queue helper Christoph Hellwig
@ 2021-09-21  3:17   ` Bart Van Assche
  0 siblings, 0 replies; 13+ messages in thread
From: Bart Van Assche @ 2021-09-21  3:17 UTC (permalink / raw)
  To: Christoph Hellwig, Jens Axboe; +Cc: Tejun Heo, linux-block, Ming Lei

On 9/20/21 04:24, Christoph Hellwig wrote:
>   		smp_rmb();
> -
>   		wait_event(q->mq_freeze_wq,
>   			   (!q->mq_freeze_depth &&
>   			    blk_pm_resume_queue(pm, q)) ||

I prefer to keep this blank line. Otherwise this patch looks good to me.

Thanks,

Bart.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* [PATCH 1/4] block: factor out a blk_try_enter_queue helper
  2021-09-20 11:24 tear down file system I/O in del_gendisk Christoph Hellwig
@ 2021-09-20 11:24 ` Christoph Hellwig
  2021-09-21  3:17   ` Bart Van Assche
  0 siblings, 1 reply; 13+ messages in thread
From: Christoph Hellwig @ 2021-09-20 11:24 UTC (permalink / raw)
  To: Jens Axboe; +Cc: Tejun Heo, linux-block, Ming Lei

Factor out the code to try to get q_usage_counter without blocking into
a separate helper.  Both to improve code readability and to prepare for
splitting bio_queue_enter from blk_queue_enter.

Signed-off-by: Christoph Hellwig <hch@lst.de>
---
 block/blk-core.c | 60 ++++++++++++++++++++++++++----------------------
 1 file changed, 32 insertions(+), 28 deletions(-)

diff --git a/block/blk-core.c b/block/blk-core.c
index 5454db2fa263b..4c078681f39b8 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -416,6 +416,30 @@ void blk_cleanup_queue(struct request_queue *q)
 }
 EXPORT_SYMBOL(blk_cleanup_queue);
 
+static bool blk_try_enter_queue(struct request_queue *q, bool pm)
+{
+	rcu_read_lock();
+	if (!percpu_ref_tryget_live(&q->q_usage_counter))
+		goto fail;
+
+	/*
+	 * The code that increments the pm_only counter must ensure that the
+	 * counter is globally visible before the queue is unfrozen.
+	 */
+	if (blk_queue_pm_only(q) &&
+	    (!pm || queue_rpm_status(q) == RPM_SUSPENDED))
+		goto fail_put;
+
+	rcu_read_unlock();
+	return true;
+
+fail_put:
+	percpu_ref_put(&q->q_usage_counter);
+fail:
+	rcu_read_unlock();
+	return false;
+}
+
 /**
  * blk_queue_enter() - try to increase q->q_usage_counter
  * @q: request queue pointer
@@ -425,40 +449,18 @@ int blk_queue_enter(struct request_queue *q, blk_mq_req_flags_t flags)
 {
 	const bool pm = flags & BLK_MQ_REQ_PM;
 
-	while (true) {
-		bool success = false;
-
-		rcu_read_lock();
-		if (percpu_ref_tryget_live(&q->q_usage_counter)) {
-			/*
-			 * The code that increments the pm_only counter is
-			 * responsible for ensuring that that counter is
-			 * globally visible before the queue is unfrozen.
-			 */
-			if ((pm && queue_rpm_status(q) != RPM_SUSPENDED) ||
-			    !blk_queue_pm_only(q)) {
-				success = true;
-			} else {
-				percpu_ref_put(&q->q_usage_counter);
-			}
-		}
-		rcu_read_unlock();
-
-		if (success)
-			return 0;
-
+	while (!blk_try_enter_queue(q, pm)) {
 		if (flags & BLK_MQ_REQ_NOWAIT)
 			return -EBUSY;
 
 		/*
-		 * read pair of barrier in blk_freeze_queue_start(),
-		 * we need to order reading __PERCPU_REF_DEAD flag of
-		 * .q_usage_counter and reading .mq_freeze_depth or
-		 * queue dying flag, otherwise the following wait may
-		 * never return if the two reads are reordered.
+		 * read pair of barrier in blk_freeze_queue_start(), we need to
+		 * order reading __PERCPU_REF_DEAD flag of .q_usage_counter and
+		 * reading .mq_freeze_depth or queue dying flag, otherwise the
+		 * following wait may never return if the two reads are
+		 * reordered.
 		 */
 		smp_rmb();
-
 		wait_event(q->mq_freeze_wq,
 			   (!q->mq_freeze_depth &&
 			    blk_pm_resume_queue(pm, q)) ||
@@ -466,6 +468,8 @@ int blk_queue_enter(struct request_queue *q, blk_mq_req_flags_t flags)
 		if (blk_queue_dying(q))
 			return -ENODEV;
 	}
+
+	return 0;
 }
 
 static inline int bio_queue_enter(struct bio *bio)
-- 
2.30.2


^ permalink raw reply	[flat|nested] 13+ messages in thread

end of thread, other threads:[~2021-09-27 14:32 UTC | newest]

Thread overview: 13+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-09-22 17:22 tear down file system I/O in del_gendisk v2 Christoph Hellwig
2021-09-22 17:22 ` [PATCH 1/4] block: factor out a blk_try_enter_queue helper Christoph Hellwig
2021-09-22 17:22 ` [PATCH 2/4] block: split bio_queue_enter from blk_queue_enter Christoph Hellwig
2021-09-22 17:22 ` [PATCH 3/4] block: drain file system I/O on del_gendisk Christoph Hellwig
2021-09-23  1:34   ` Ming Lei
2021-09-23  5:27     ` Christoph Hellwig
2021-09-23  6:39       ` Ming Lei
2021-09-27 12:04         ` Christoph Hellwig
2021-09-27 14:32           ` Ming Lei
2021-09-23  6:21   ` Hannes Reinecke
2021-09-22 17:22 ` [PATCH 4/4] block: keep q_usage_counter in atomic mode after del_gendisk Christoph Hellwig
  -- strict thread matches above, loose matches on Subject: below --
2021-09-20 11:24 tear down file system I/O in del_gendisk Christoph Hellwig
2021-09-20 11:24 ` [PATCH 1/4] block: factor out a blk_try_enter_queue helper Christoph Hellwig
2021-09-21  3:17   ` Bart Van Assche

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).