All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH V2 00/13] block: don't drain file system I/O on del_gendisk
@ 2022-01-22 11:10 Ming Lei
  2022-01-22 11:10 ` [PATCH V2 01/13] block: declare blkcg_[init|exit]_queue in private header Ming Lei
                   ` (12 more replies)
  0 siblings, 13 replies; 49+ messages in thread
From: Ming Lei @ 2022-01-22 11:10 UTC (permalink / raw)
  To: Christoph Hellwig, Jens Axboe, Martin K . Petersen
  Cc: linux-block, linux-nvme, linux-scsi, Ming Lei

Hello,

Draining FS I/O on del_gendisk() is added for just avoiding to refer to
recently added q->disk in IO path, and it isn't actually reliable:

1) queue freezing can't drain FS I/O for bio based driver, so blk-cgroup
shutdown can't be moved into del_gendisk()

2) freezing queue can't drain in-progress IO issue/dispatch activities,
so elevator shutdown can't be moved into del_gendisk()

3) the added flag of GD_DEAD may not be observed reliably in
__bio_queue_enter() because queue freezing might not imply rcu grace
period.

4) passthrough IO accounting code can refer to q->disk after disk is
released

When releasing disk, all FS IOs are guaranteed to be done, so this
patchset tries to shutdown elevator/blk-cgroup/rq qos there, then
referring to q->disk can be avoided after disk is released.

Also add flag of BLK_MQ_REQ_USER_IO for accounting passthrough IO
only if the IO is from userspace, since disk is live for user
passthrough IO.

V2:
	- take new approach to make current referring to q->disk reliable


Laibin Qiu (1):
  block/wbt: fix negative inflight counter when remove scsi device

Ming Lei (12):
  block: declare blkcg_[init|exit]_queue in private header
  block: move initialization of q->blkg_list into blkcg_init_queue
  block: move blkcg initialization/destroy into disk allocation/release
    handler
  block: only account passthrough IO from userspace
  block: don't remove hctx debugfs dir from blk_mq_exit_queue
  block: move q_usage_counter release into blk_queue_release
  block: export __blk_mq_unfreeze_queue
  scsi: force unfreezing queue into atomic mode
  block: add helper of disk_release_queue for release queue data for
    disk
  block: move blk_exit_queue into disk_release
  block: move rq_qos_exit() into disk_release()
  block: don't drain file system I/O on del_gendisk

 block/bfq-iosched.c        |  2 +
 block/blk-cgroup.c         |  2 +
 block/blk-core.c           | 31 ++++----------
 block/blk-mq.c             | 27 ++++++++++--
 block/blk-sysfs.c          | 25 +-----------
 block/blk.h                | 10 ++++-
 block/elevator.c           |  2 -
 block/genhd.c              | 84 +++++++++++++++++++++++++++-----------
 drivers/nvme/host/ioctl.c  |  2 +-
 drivers/scsi/scsi_ioctl.c  |  3 +-
 drivers/scsi/sd.c          |  2 +-
 include/linux/blk-cgroup.h |  4 --
 include/linux/blk-mq.h     |  3 ++
 include/linux/genhd.h      |  1 -
 14 files changed, 114 insertions(+), 84 deletions(-)

-- 
2.31.1


^ permalink raw reply	[flat|nested] 49+ messages in thread

* [PATCH V2 01/13] block: declare blkcg_[init|exit]_queue in private header
  2022-01-22 11:10 [PATCH V2 00/13] block: don't drain file system I/O on del_gendisk Ming Lei
@ 2022-01-22 11:10 ` Ming Lei
  2022-01-24 12:59   ` Christoph Hellwig
  2022-01-22 11:10 ` [PATCH V2 02/13] block: move initialization of q->blkg_list into blkcg_init_queue Ming Lei
                   ` (11 subsequent siblings)
  12 siblings, 1 reply; 49+ messages in thread
From: Ming Lei @ 2022-01-22 11:10 UTC (permalink / raw)
  To: Christoph Hellwig, Jens Axboe, Martin K . Petersen
  Cc: linux-block, linux-nvme, linux-scsi, Ming Lei

Both two functions are used by block core code only, so move the
declaration into private header of block layer.

Signed-off-by: Ming Lei <ming.lei@redhat.com>
---
 block/blk.h                | 8 ++++++++
 include/linux/blk-cgroup.h | 4 ----
 2 files changed, 8 insertions(+), 4 deletions(-)

diff --git a/block/blk.h b/block/blk.h
index 8bd43b3ad33d..7b0f12260ae6 100644
--- a/block/blk.h
+++ b/block/blk.h
@@ -406,6 +406,14 @@ extern int blk_iolatency_init(struct request_queue *q);
 static inline int blk_iolatency_init(struct request_queue *q) { return 0; }
 #endif
 
+#ifdef CONFIG_BLK_CGROUP
+int blkcg_init_queue(struct request_queue *q);
+void blkcg_exit_queue(struct request_queue *q);
+#else
+static inline int blkcg_init_queue(struct request_queue *q) { return 0; }
+static inline void blkcg_exit_queue(struct request_queue *q) { }
+#endif
+
 struct bio *blk_next_bio(struct bio *bio, unsigned int nr_pages, gfp_t gfp);
 
 #ifdef CONFIG_BLK_DEV_ZONED
diff --git a/include/linux/blk-cgroup.h b/include/linux/blk-cgroup.h
index b4de2010fba5..d9dcd18d77f4 100644
--- a/include/linux/blk-cgroup.h
+++ b/include/linux/blk-cgroup.h
@@ -182,8 +182,6 @@ extern bool blkcg_debug_stats;
 
 struct blkcg_gq *blkg_lookup_slowpath(struct blkcg *blkcg,
 				      struct request_queue *q, bool update_hint);
-int blkcg_init_queue(struct request_queue *q);
-void blkcg_exit_queue(struct request_queue *q);
 
 /* Blkio controller policy registration */
 int blkcg_policy_register(struct blkcg_policy *pol);
@@ -637,8 +635,6 @@ static inline void blkcg_schedule_throttle(struct request_queue *q, bool use_mem
 static inline struct blkcg_gq *blkg_lookup(struct blkcg *blkcg, void *key) { return NULL; }
 static inline struct blkcg_gq *blk_queue_root_blkg(struct request_queue *q)
 { return NULL; }
-static inline int blkcg_init_queue(struct request_queue *q) { return 0; }
-static inline void blkcg_exit_queue(struct request_queue *q) { }
 static inline int blkcg_policy_register(struct blkcg_policy *pol) { return 0; }
 static inline void blkcg_policy_unregister(struct blkcg_policy *pol) { }
 static inline int blkcg_activate_policy(struct request_queue *q,
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 49+ messages in thread

* [PATCH V2 02/13] block: move initialization of q->blkg_list into blkcg_init_queue
  2022-01-22 11:10 [PATCH V2 00/13] block: don't drain file system I/O on del_gendisk Ming Lei
  2022-01-22 11:10 ` [PATCH V2 01/13] block: declare blkcg_[init|exit]_queue in private header Ming Lei
@ 2022-01-22 11:10 ` Ming Lei
  2022-01-24 13:00   ` Christoph Hellwig
  2022-01-24 18:32   ` Bart Van Assche
  2022-01-22 11:10 ` [PATCH V2 03/13] block: move blkcg initialization/destroy into disk allocation/release handler Ming Lei
                   ` (10 subsequent siblings)
  12 siblings, 2 replies; 49+ messages in thread
From: Ming Lei @ 2022-01-22 11:10 UTC (permalink / raw)
  To: Christoph Hellwig, Jens Axboe, Martin K . Petersen
  Cc: linux-block, linux-nvme, linux-scsi, Ming Lei

q->blkg_list is only used by blkcg code, so move it into
blkcg_init_queue.

Signed-off-by: Ming Lei <ming.lei@redhat.com>
---
 block/blk-cgroup.c | 2 ++
 block/blk-core.c   | 3 ---
 2 files changed, 2 insertions(+), 3 deletions(-)

diff --git a/block/blk-cgroup.c b/block/blk-cgroup.c
index 650f7e27989f..498753e2bb73 100644
--- a/block/blk-cgroup.c
+++ b/block/blk-cgroup.c
@@ -1176,6 +1176,8 @@ int blkcg_init_queue(struct request_queue *q)
 	bool preloaded;
 	int ret;
 
+	INIT_LIST_HEAD(&q->blkg_list);
+
 	new_blkg = blkg_alloc(&blkcg_root, q, GFP_KERNEL);
 	if (!new_blkg)
 		return -ENOMEM;
diff --git a/block/blk-core.c b/block/blk-core.c
index 97f8bc8d3a79..2a400fa8cabd 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -475,9 +475,6 @@ struct request_queue *blk_alloc_queue(int node_id, bool alloc_srcu)
 	timer_setup(&q->timeout, blk_rq_timed_out_timer, 0);
 	INIT_WORK(&q->timeout_work, blk_timeout_work);
 	INIT_LIST_HEAD(&q->icq_list);
-#ifdef CONFIG_BLK_CGROUP
-	INIT_LIST_HEAD(&q->blkg_list);
-#endif
 
 	kobject_init(&q->kobj, &blk_queue_ktype);
 
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 49+ messages in thread

* [PATCH V2 03/13] block: move blkcg initialization/destroy into disk allocation/release handler
  2022-01-22 11:10 [PATCH V2 00/13] block: don't drain file system I/O on del_gendisk Ming Lei
  2022-01-22 11:10 ` [PATCH V2 01/13] block: declare blkcg_[init|exit]_queue in private header Ming Lei
  2022-01-22 11:10 ` [PATCH V2 02/13] block: move initialization of q->blkg_list into blkcg_init_queue Ming Lei
@ 2022-01-22 11:10 ` Ming Lei
  2022-01-24 13:02   ` Christoph Hellwig
  2022-01-22 11:10 ` [PATCH V2 04/13] block/wbt: fix negative inflight counter when remove scsi device Ming Lei
                   ` (9 subsequent siblings)
  12 siblings, 1 reply; 49+ messages in thread
From: Ming Lei @ 2022-01-22 11:10 UTC (permalink / raw)
  To: Christoph Hellwig, Jens Axboe, Martin K . Petersen
  Cc: linux-block, linux-nvme, linux-scsi, Ming Lei

blkcg works on FS bio level, so it is reasonable to make both blkcg and
gendisk sharing same lifetime. Meantime there won't be any FS IO when
releasing disk, so safe to move blkcg initialization/destroy into disk
allocation/release handler

Long term, we can move blkcg into gendisk completely.

Signed-off-by: Ming Lei <ming.lei@redhat.com>
---
 block/blk-core.c  |  5 -----
 block/blk-sysfs.c |  7 -------
 block/genhd.c     | 14 ++++++++++++++
 3 files changed, 14 insertions(+), 12 deletions(-)

diff --git a/block/blk-core.c b/block/blk-core.c
index 2a400fa8cabd..d9477191b303 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -495,17 +495,12 @@ struct request_queue *blk_alloc_queue(int node_id, bool alloc_srcu)
 				PERCPU_REF_INIT_ATOMIC, GFP_KERNEL))
 		goto fail_stats;
 
-	if (blkcg_init_queue(q))
-		goto fail_ref;
-
 	blk_queue_dma_alignment(q, 511);
 	blk_set_default_limits(&q->limits);
 	q->nr_requests = BLKDEV_DEFAULT_RQ;
 
 	return q;
 
-fail_ref:
-	percpu_ref_exit(&q->q_usage_counter);
 fail_stats:
 	blk_free_queue_stats(q->stats);
 fail_split:
diff --git a/block/blk-sysfs.c b/block/blk-sysfs.c
index e20eadfcf5c8..6f326b44fb00 100644
--- a/block/blk-sysfs.c
+++ b/block/blk-sysfs.c
@@ -751,13 +751,6 @@ static void blk_exit_queue(struct request_queue *q)
 		ioc_clear_queue(q);
 		elevator_exit(q);
 	}
-
-	/*
-	 * Remove all references to @q from the block cgroup controller before
-	 * restoring @q->queue_lock to avoid that restoring this pointer causes
-	 * e.g. blkcg_print_blkgs() to crash.
-	 */
-	blkcg_exit_queue(q);
 }
 
 /**
diff --git a/block/genhd.c b/block/genhd.c
index 626c8406f21a..b9b0db168ce1 100644
--- a/block/genhd.c
+++ b/block/genhd.c
@@ -1112,6 +1112,14 @@ static void disk_release(struct device *dev)
 	disk_release_events(disk);
 	kfree(disk->random);
 	xa_destroy(&disk->part_tbl);
+
+	/*
+	 * Remove all references to @q from the block cgroup controller before
+	 * restoring @q->queue_lock to avoid that restoring this pointer causes
+	 * e.g. blkcg_print_blkgs() to crash.
+	 */
+	blkcg_exit_queue(disk->queue);
+
 	disk->queue->disk = NULL;
 	blk_put_queue(disk->queue);
 	iput(disk->part0->bd_inode);	/* frees the disk */
@@ -1308,6 +1316,10 @@ struct gendisk *__alloc_disk_node(struct request_queue *q, int node_id,
 	if (xa_insert(&disk->part_tbl, 0, disk->part0, GFP_KERNEL))
 		goto out_destroy_part_tbl;
 
+	/* todo: move blkcg into gendisk */
+	if (blkcg_init_queue(q))
+		goto out_erase_part0;
+
 	rand_initialize_disk(disk);
 	disk_to_dev(disk)->class = &block_class;
 	disk_to_dev(disk)->type = &disk_type;
@@ -1320,6 +1332,8 @@ struct gendisk *__alloc_disk_node(struct request_queue *q, int node_id,
 #endif
 	return disk;
 
+out_erase_part0:
+	xa_erase(&disk->part_tbl, 0);
 out_destroy_part_tbl:
 	xa_destroy(&disk->part_tbl);
 	disk->part0->bd_disk = NULL;
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 49+ messages in thread

* [PATCH V2 04/13] block/wbt: fix negative inflight counter when remove scsi device
  2022-01-22 11:10 [PATCH V2 00/13] block: don't drain file system I/O on del_gendisk Ming Lei
                   ` (2 preceding siblings ...)
  2022-01-22 11:10 ` [PATCH V2 03/13] block: move blkcg initialization/destroy into disk allocation/release handler Ming Lei
@ 2022-01-22 11:10 ` Ming Lei
  2022-02-17  7:45   ` Christoph Hellwig
  2022-01-22 11:10 ` [PATCH V2 05/13] block: only account passthrough IO from userspace Ming Lei
                   ` (8 subsequent siblings)
  12 siblings, 1 reply; 49+ messages in thread
From: Ming Lei @ 2022-01-22 11:10 UTC (permalink / raw)
  To: Christoph Hellwig, Jens Axboe, Martin K . Petersen
  Cc: linux-block, linux-nvme, linux-scsi, Laibin Qiu, Ming Lei

From: Laibin Qiu <qiulaibin@huawei.com>

Now that we disable wbt by set WBT_STATE_OFF_DEFAULT in
wbt_disable_default() when switch elevator to bfq. And when
we remove scsi device, wbt will be enabled by wbt_enable_default.
If it become false positive between wbt_wait() and wbt_track()
when submit write request.

The following is the scenario that triggered the problem.

T1                          T2                           T3
                            elevator_switch_mq
                            bfq_init_queue
                            wbt_disable_default <= Set
                            rwb->enable_state (OFF)
Submit_bio
blk_mq_make_request
rq_qos_throttle
<= rwb->enable_state (OFF)
                                                         scsi_remove_device
                                                         sd_remove
                                                         del_gendisk
                                                         blk_unregister_queue
                                                         elv_unregister_queue
                                                         wbt_enable_default
                                                         <= Set rwb->enable_state (ON)
q_qos_track
<= rwb->enable_state (ON)
^^^^^^ this request will mark WBT_TRACKED without inflight add and will
lead to drop rqw->inflight to -1 in wbt_done() which will trigger IO hung.

Fix this by move wbt_enable_default() from elv_unregister to
bfq_exit_queue(). Only re-enable wbt when bfq exit.

Fixes: 76a8040817b4b ("blk-wbt: make sure throttle is enabled properly")

Remove oneline stale comment, and kill one oneshot local variable.

Signed-off-by: Ming Lei <ming.lei@rehdat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/linux-block/20211214133103.551813-1-qiulaibin@huawei.com/
Signed-off-by: Laibin Qiu <qiulaibin@huawei.com>
---
 block/bfq-iosched.c | 2 ++
 block/elevator.c    | 2 --
 2 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/block/bfq-iosched.c b/block/bfq-iosched.c
index 0c612a911696..36a66e97e3c2 100644
--- a/block/bfq-iosched.c
+++ b/block/bfq-iosched.c
@@ -7018,6 +7018,8 @@ static void bfq_exit_queue(struct elevator_queue *e)
 	spin_unlock_irq(&bfqd->lock);
 #endif
 
+	wbt_enable_default(bfqd->queue);
+
 	kfree(bfqd);
 }
 
diff --git a/block/elevator.c b/block/elevator.c
index ec98aed39c4f..482df2a350fc 100644
--- a/block/elevator.c
+++ b/block/elevator.c
@@ -525,8 +525,6 @@ void elv_unregister_queue(struct request_queue *q)
 		kobject_del(&e->kobj);
 
 		e->registered = 0;
-		/* Re-enable throttling in case elevator disabled it */
-		wbt_enable_default(q);
 	}
 }
 
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 49+ messages in thread

* [PATCH V2 05/13] block: only account passthrough IO from userspace
  2022-01-22 11:10 [PATCH V2 00/13] block: don't drain file system I/O on del_gendisk Ming Lei
                   ` (3 preceding siblings ...)
  2022-01-22 11:10 ` [PATCH V2 04/13] block/wbt: fix negative inflight counter when remove scsi device Ming Lei
@ 2022-01-22 11:10 ` Ming Lei
  2022-01-24 13:05   ` Christoph Hellwig
  2022-01-22 11:10 ` [PATCH V2 06/13] block: don't remove hctx debugfs dir from blk_mq_exit_queue Ming Lei
                   ` (7 subsequent siblings)
  12 siblings, 1 reply; 49+ messages in thread
From: Ming Lei @ 2022-01-22 11:10 UTC (permalink / raw)
  To: Christoph Hellwig, Jens Axboe, Martin K . Petersen
  Cc: linux-block, linux-nvme, linux-scsi, Ming Lei

Passthrough request from userspace has one active block_device/disk
associated, so they can be accounted via rq->q->disk. For other
passthrough request, there may not be disk/block_device for the queue,
since either the queue has not a disk or the disk may be deleted
already.

Add flag of BLK_MQ_REQ_USER_IO for only accounting passthrough request
from userspace.

Signed-off-by: Ming Lei <ming.lei@redhat.com>
---
 block/blk-mq.c            | 13 ++++++++++++-
 drivers/nvme/host/ioctl.c |  2 +-
 drivers/scsi/scsi_ioctl.c |  3 ++-
 include/linux/blk-mq.h    |  2 ++
 4 files changed, 17 insertions(+), 3 deletions(-)

diff --git a/block/blk-mq.c b/block/blk-mq.c
index a6d4780580fc..0d25cc5778c9 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -336,6 +336,17 @@ void blk_rq_init(struct request_queue *q, struct request *rq)
 }
 EXPORT_SYMBOL(blk_rq_init);
 
+static inline bool blk_mq_io_may_account(struct blk_mq_alloc_data *data)
+{
+	if (!blk_op_is_passthrough(data->cmd_flags))
+		return true;
+
+	if (data->flags & BLK_MQ_REQ_USER_IO)
+		return true;
+
+	return false;
+}
+
 static struct request *blk_mq_rq_ctx_init(struct blk_mq_alloc_data *data,
 		struct blk_mq_tags *tags, unsigned int tag, u64 alloc_time_ns)
 {
@@ -351,7 +362,7 @@ static struct request *blk_mq_rq_ctx_init(struct blk_mq_alloc_data *data,
 
 	if (data->flags & BLK_MQ_REQ_PM)
 		data->rq_flags |= RQF_PM;
-	if (blk_queue_io_stat(q))
+	if (blk_queue_io_stat(q) && blk_mq_io_may_account(data))
 		data->rq_flags |= RQF_IO_STAT;
 	rq->rq_flags = data->rq_flags;
 
diff --git a/drivers/nvme/host/ioctl.c b/drivers/nvme/host/ioctl.c
index 22314962842d..f94afc38a6e3 100644
--- a/drivers/nvme/host/ioctl.c
+++ b/drivers/nvme/host/ioctl.c
@@ -66,7 +66,7 @@ static int nvme_submit_user_cmd(struct request_queue *q,
 	void *meta = NULL;
 	int ret;
 
-	req = nvme_alloc_request(q, cmd, 0);
+	req = nvme_alloc_request(q, cmd, BLK_MQ_REQ_USER_IO);
 	if (IS_ERR(req))
 		return PTR_ERR(req);
 
diff --git a/drivers/scsi/scsi_ioctl.c b/drivers/scsi/scsi_ioctl.c
index e13fd380deb6..b262fe06dacc 100644
--- a/drivers/scsi/scsi_ioctl.c
+++ b/drivers/scsi/scsi_ioctl.c
@@ -440,7 +440,8 @@ static int sg_io(struct scsi_device *sdev, struct sg_io_hdr *hdr, fmode_t mode)
 
 	ret = -ENOMEM;
 	rq = scsi_alloc_request(sdev->request_queue, writing ?
-			     REQ_OP_DRV_OUT : REQ_OP_DRV_IN, 0);
+			     REQ_OP_DRV_OUT : REQ_OP_DRV_IN,
+			     BLK_MQ_REQ_USER_IO);
 	if (IS_ERR(rq))
 		return PTR_ERR(rq);
 	req = scsi_req(rq);
diff --git a/include/linux/blk-mq.h b/include/linux/blk-mq.h
index d319ffa59354..d2ad2ed11723 100644
--- a/include/linux/blk-mq.h
+++ b/include/linux/blk-mq.h
@@ -709,6 +709,8 @@ enum {
 	BLK_MQ_REQ_RESERVED	= (__force blk_mq_req_flags_t)(1 << 1),
 	/* set RQF_PM */
 	BLK_MQ_REQ_PM		= (__force blk_mq_req_flags_t)(1 << 2),
+	/* user IO request */
+	BLK_MQ_REQ_USER_IO	= (__force blk_mq_req_flags_t)(1 << 3),
 };
 
 struct request *blk_mq_alloc_request(struct request_queue *q, unsigned int op,
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 49+ messages in thread

* [PATCH V2 06/13] block: don't remove hctx debugfs dir from blk_mq_exit_queue
  2022-01-22 11:10 [PATCH V2 00/13] block: don't drain file system I/O on del_gendisk Ming Lei
                   ` (4 preceding siblings ...)
  2022-01-22 11:10 ` [PATCH V2 05/13] block: only account passthrough IO from userspace Ming Lei
@ 2022-01-22 11:10 ` Ming Lei
  2022-01-24 13:06   ` Christoph Hellwig
  2022-01-22 11:10 ` [PATCH V2 07/13] block: move q_usage_counter release into blk_queue_release Ming Lei
                   ` (6 subsequent siblings)
  12 siblings, 1 reply; 49+ messages in thread
From: Ming Lei @ 2022-01-22 11:10 UTC (permalink / raw)
  To: Christoph Hellwig, Jens Axboe, Martin K . Petersen
  Cc: linux-block, linux-nvme, linux-scsi, Ming Lei

The queue's top debugfs dir is removed from blk_release_queue(), so all
hctx's debugfs dirs are removed from there. Given blk_mq_exit_queue()
is only called from blk_cleanup_queue(), it isn't necessary to remove
hctx debugfs from blk_mq_exit_queue().

So remove it from blk_mq_exit_queue().

Signed-off-by: Ming Lei <ming.lei@redhat.com>
---
 block/blk-mq.c | 1 -
 1 file changed, 1 deletion(-)

diff --git a/block/blk-mq.c b/block/blk-mq.c
index 0d25cc5778c9..66cc701921c1 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -3454,7 +3454,6 @@ static void blk_mq_exit_hw_queues(struct request_queue *q,
 	queue_for_each_hw_ctx(q, hctx, i) {
 		if (i == nr_queue)
 			break;
-		blk_mq_debugfs_unregister_hctx(hctx);
 		blk_mq_exit_hctx(q, set, hctx, i);
 	}
 }
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 49+ messages in thread

* [PATCH V2 07/13] block: move q_usage_counter release into blk_queue_release
  2022-01-22 11:10 [PATCH V2 00/13] block: don't drain file system I/O on del_gendisk Ming Lei
                   ` (5 preceding siblings ...)
  2022-01-22 11:10 ` [PATCH V2 06/13] block: don't remove hctx debugfs dir from blk_mq_exit_queue Ming Lei
@ 2022-01-22 11:10 ` Ming Lei
  2022-01-24 13:09   ` Christoph Hellwig
  2022-01-24 19:06   ` Bart Van Assche
  2022-01-22 11:10 ` [PATCH V2 08/13] block: export __blk_mq_unfreeze_queue Ming Lei
                   ` (5 subsequent siblings)
  12 siblings, 2 replies; 49+ messages in thread
From: Ming Lei @ 2022-01-22 11:10 UTC (permalink / raw)
  To: Christoph Hellwig, Jens Axboe, Martin K . Petersen
  Cc: linux-block, linux-nvme, linux-scsi, Ming Lei

After blk_cleanup_queue() returns, disk may not be released yet, so
probably bio may still be submitted and ->q_usage_counter may be
touched, so far this way seems safe, but not good from API's viewpoint.

Move the release ofq_usage_counter into blk_queue_release().

Signed-off-by: Ming Lei <ming.lei@redhat.com>
---
 block/blk-core.c  | 2 --
 block/blk-sysfs.c | 2 ++
 2 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/block/blk-core.c b/block/blk-core.c
index d9477191b303..bcb4d982cd80 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -341,8 +341,6 @@ void blk_cleanup_queue(struct request_queue *q)
 		blk_mq_sched_free_rqs(q);
 	mutex_unlock(&q->sysfs_lock);
 
-	percpu_ref_exit(&q->q_usage_counter);
-
 	/* @q is and will stay empty, shutdown and put */
 	blk_put_queue(q);
 }
diff --git a/block/blk-sysfs.c b/block/blk-sysfs.c
index 6f326b44fb00..5f14fd333182 100644
--- a/block/blk-sysfs.c
+++ b/block/blk-sysfs.c
@@ -780,6 +780,8 @@ static void blk_release_queue(struct kobject *kobj)
 
 	might_sleep();
 
+	percpu_ref_exit(&q->q_usage_counter);
+
 	if (q->poll_stat)
 		blk_stat_remove_callback(q, q->poll_cb);
 	blk_stat_free_callback(q->poll_cb);
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 49+ messages in thread

* [PATCH V2 08/13] block: export __blk_mq_unfreeze_queue
  2022-01-22 11:10 [PATCH V2 00/13] block: don't drain file system I/O on del_gendisk Ming Lei
                   ` (6 preceding siblings ...)
  2022-01-22 11:10 ` [PATCH V2 07/13] block: move q_usage_counter release into blk_queue_release Ming Lei
@ 2022-01-22 11:10 ` Ming Lei
  2022-01-24 13:11   ` Christoph Hellwig
  2022-01-22 11:10 ` [PATCH V2 09/13] scsi: force unfreezing queue into atomic mode Ming Lei
                   ` (4 subsequent siblings)
  12 siblings, 1 reply; 49+ messages in thread
From: Ming Lei @ 2022-01-22 11:10 UTC (permalink / raw)
  To: Christoph Hellwig, Jens Axboe, Martin K . Petersen
  Cc: linux-block, linux-nvme, linux-scsi, Ming Lei

blk_mq_unfreeze_queue() is used by scsi when releasing disk, so not
necessary to unfreeze into percpu mode, then the following
blk_cleanup_queue doesn't need to freeze queue from percpu mode, and
the implied RCU grace period may be avoided.

Meantime move clearing QUEUE_FLAG_INIT_DONE into this API, so that
when one disk is added, ->q_usage_counter can be switched to percpu
mode again.

Signed-off-by: Ming Lei <ming.lei@redhat.com>
---
 block/blk-mq.c         | 10 +++++++++-
 block/blk.h            |  1 -
 block/genhd.c          |  1 -
 include/linux/blk-mq.h |  1 +
 4 files changed, 10 insertions(+), 3 deletions(-)

diff --git a/block/blk-mq.c b/block/blk-mq.c
index 66cc701921c1..d51b0aa2e4e4 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -215,11 +215,18 @@ void blk_mq_freeze_queue(struct request_queue *q)
 }
 EXPORT_SYMBOL_GPL(blk_mq_freeze_queue);
 
+/*
+ * When 'force_atomic' is passed as true, this API is supposed to be
+ * called only in case that disk is removed or released.
+ */
 void __blk_mq_unfreeze_queue(struct request_queue *q, bool force_atomic)
 {
 	mutex_lock(&q->mq_freeze_lock);
-	if (force_atomic)
+	if (force_atomic) {
+		/* When new disk is added, switch to percpu mode */
+		blk_queue_flag_clear(QUEUE_FLAG_INIT_DONE, q);
 		q->q_usage_counter.data->force_atomic = true;
+	}
 	q->mq_freeze_depth--;
 	WARN_ON_ONCE(q->mq_freeze_depth < 0);
 	if (!q->mq_freeze_depth) {
@@ -228,6 +235,7 @@ void __blk_mq_unfreeze_queue(struct request_queue *q, bool force_atomic)
 	}
 	mutex_unlock(&q->mq_freeze_lock);
 }
+EXPORT_SYMBOL_GPL(__blk_mq_unfreeze_queue);
 
 void blk_mq_unfreeze_queue(struct request_queue *q)
 {
diff --git a/block/blk.h b/block/blk.h
index 7b0f12260ae6..a038e25d8637 100644
--- a/block/blk.h
+++ b/block/blk.h
@@ -43,7 +43,6 @@ struct blk_flush_queue *blk_alloc_flush_queue(int node, int cmd_size,
 void blk_free_flush_queue(struct blk_flush_queue *q);
 
 void blk_freeze_queue(struct request_queue *q);
-void __blk_mq_unfreeze_queue(struct request_queue *q, bool force_atomic);
 void blk_queue_start_drain(struct request_queue *q);
 int __bio_queue_enter(struct request_queue *q, struct bio *bio);
 bool submit_bio_checks(struct bio *bio);
diff --git a/block/genhd.c b/block/genhd.c
index b9b0db168ce1..5bd7bcd6246e 100644
--- a/block/genhd.c
+++ b/block/genhd.c
@@ -628,7 +628,6 @@ void del_gendisk(struct gendisk *disk)
 	/*
 	 * Allow using passthrough request again after the queue is torn down.
 	 */
-	blk_queue_flag_clear(QUEUE_FLAG_INIT_DONE, q);
 	__blk_mq_unfreeze_queue(q, true);
 
 }
diff --git a/include/linux/blk-mq.h b/include/linux/blk-mq.h
index d2ad2ed11723..1645159d10f3 100644
--- a/include/linux/blk-mq.h
+++ b/include/linux/blk-mq.h
@@ -869,6 +869,7 @@ void blk_mq_tagset_busy_iter(struct blk_mq_tag_set *tagset,
 void blk_mq_tagset_wait_completed_request(struct blk_mq_tag_set *tagset);
 void blk_mq_freeze_queue(struct request_queue *q);
 void blk_mq_unfreeze_queue(struct request_queue *q);
+void __blk_mq_unfreeze_queue(struct request_queue *q, bool force_atomic);
 void blk_freeze_queue_start(struct request_queue *q);
 void blk_mq_freeze_queue_wait(struct request_queue *q);
 int blk_mq_freeze_queue_wait_timeout(struct request_queue *q,
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 49+ messages in thread

* [PATCH V2 09/13] scsi: force unfreezing queue into atomic mode
  2022-01-22 11:10 [PATCH V2 00/13] block: don't drain file system I/O on del_gendisk Ming Lei
                   ` (7 preceding siblings ...)
  2022-01-22 11:10 ` [PATCH V2 08/13] block: export __blk_mq_unfreeze_queue Ming Lei
@ 2022-01-22 11:10 ` Ming Lei
  2022-01-24 13:15   ` Christoph Hellwig
  2022-01-22 11:10 ` [PATCH V2 10/13] block: add helper of disk_release_queue for release queue data for disk Ming Lei
                   ` (3 subsequent siblings)
  12 siblings, 1 reply; 49+ messages in thread
From: Ming Lei @ 2022-01-22 11:10 UTC (permalink / raw)
  To: Christoph Hellwig, Jens Axboe, Martin K . Petersen
  Cc: linux-block, linux-nvme, linux-scsi, Ming Lei

In scsi_disk_release() request queue is frozen for clearing
disk->private_data, and there can't be any FS IO issued to
this queue, and only private passthrough request will be handled, so
force unfreezing queue into atomic mode.

Signed-off-by: Ming Lei <ming.lei@redhat.com>
---
 drivers/scsi/sd.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/scsi/sd.c b/drivers/scsi/sd.c
index 0e73c3f2f381..27f04c860f00 100644
--- a/drivers/scsi/sd.c
+++ b/drivers/scsi/sd.c
@@ -3670,7 +3670,7 @@ static void scsi_disk_release(struct device *dev)
 	 * in case multiple processes open a /dev/sd... node concurrently.
 	 */
 	blk_mq_freeze_queue(q);
-	blk_mq_unfreeze_queue(q);
+	__blk_mq_unfreeze_queue(q, true);
 
 	disk->private_data = NULL;
 	put_disk(disk);
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 49+ messages in thread

* [PATCH V2 10/13] block: add helper of disk_release_queue for release queue data for disk
  2022-01-22 11:10 [PATCH V2 00/13] block: don't drain file system I/O on del_gendisk Ming Lei
                   ` (8 preceding siblings ...)
  2022-01-22 11:10 ` [PATCH V2 09/13] scsi: force unfreezing queue into atomic mode Ming Lei
@ 2022-01-22 11:10 ` Ming Lei
  2022-01-24 13:16   ` Christoph Hellwig
  2022-01-22 11:10 ` [PATCH V2 11/13] block: move blk_exit_queue into disk_release Ming Lei
                   ` (2 subsequent siblings)
  12 siblings, 1 reply; 49+ messages in thread
From: Ming Lei @ 2022-01-22 11:10 UTC (permalink / raw)
  To: Christoph Hellwig, Jens Axboe, Martin K . Petersen
  Cc: linux-block, linux-nvme, linux-scsi, Ming Lei

Add one helper of disk_release_queue for releasing queue data for disk.

Signed-off-by: Ming Lei <ming.lei@redhat.com>
---
 block/genhd.c | 23 +++++++++++++++--------
 1 file changed, 15 insertions(+), 8 deletions(-)

diff --git a/block/genhd.c b/block/genhd.c
index 5bd7bcd6246e..a86027619683 100644
--- a/block/genhd.c
+++ b/block/genhd.c
@@ -1085,6 +1085,20 @@ static const struct attribute_group *disk_attr_groups[] = {
 	NULL
 };
 
+static void disk_release_queue(struct gendisk *disk)
+{
+	struct request_queue *q = disk->queue;
+
+	blk_mq_cancel_work_sync(q);
+
+	/*
+	 * Remove all references to @q from the block cgroup controller before
+	 * restoring @q->queue_lock to avoid that restoring this pointer causes
+	 * e.g. blkcg_print_blkgs() to crash.
+	 */
+	blkcg_exit_queue(q);
+}
+
 /**
  * disk_release - releases all allocated resources of the gendisk
  * @dev: the device representing this disk
@@ -1106,19 +1120,12 @@ static void disk_release(struct device *dev)
 	might_sleep();
 	WARN_ON_ONCE(disk_live(disk));
 
-	blk_mq_cancel_work_sync(disk->queue);
+	disk_release_queue(disk);
 
 	disk_release_events(disk);
 	kfree(disk->random);
 	xa_destroy(&disk->part_tbl);
 
-	/*
-	 * Remove all references to @q from the block cgroup controller before
-	 * restoring @q->queue_lock to avoid that restoring this pointer causes
-	 * e.g. blkcg_print_blkgs() to crash.
-	 */
-	blkcg_exit_queue(disk->queue);
-
 	disk->queue->disk = NULL;
 	blk_put_queue(disk->queue);
 	iput(disk->part0->bd_inode);	/* frees the disk */
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 49+ messages in thread

* [PATCH V2 11/13] block: move blk_exit_queue into disk_release
  2022-01-22 11:10 [PATCH V2 00/13] block: don't drain file system I/O on del_gendisk Ming Lei
                   ` (9 preceding siblings ...)
  2022-01-22 11:10 ` [PATCH V2 10/13] block: add helper of disk_release_queue for release queue data for disk Ming Lei
@ 2022-01-22 11:10 ` Ming Lei
  2022-01-24 13:22   ` Christoph Hellwig
  2022-01-22 11:10 ` [PATCH V2 12/13] block: move rq_qos_exit() into disk_release() Ming Lei
  2022-01-22 11:10 ` [PATCH V2 13/13] block: don't drain file system I/O on del_gendisk Ming Lei
  12 siblings, 1 reply; 49+ messages in thread
From: Ming Lei @ 2022-01-22 11:10 UTC (permalink / raw)
  To: Christoph Hellwig, Jens Axboe, Martin K . Petersen
  Cc: linux-block, linux-nvme, linux-scsi, Ming Lei

There can't be FS IO in disk_release(), so move blk_exit_queue() there.

We still need to freeze queue here since request is freed after bio is
ended, meantime passthrough request relies on scheduler tags too, but
either q->q_usage_counter is in atomic mode, such as scsi, or it has
been drained already, such as most of other drivers, so the added queue
freeze is pretty fast, and RCU grace period isn't supposed to be
involved.

disk can be released before or after queue is cleaned up, and we have to
free scheduler request pool before blk_cleanup_queue returns, meantime
the request pool has to be freed before exiting the scheduler. So move
scheduler pool freeing into both the two functions.

Signed-off-by: Ming Lei <ming.lei@redhat.com>
---
 block/blk-mq.c    |  3 +++
 block/blk-sysfs.c | 16 ----------------
 block/genhd.c     | 39 ++++++++++++++++++++++++++++++++++++++-
 3 files changed, 41 insertions(+), 17 deletions(-)

diff --git a/block/blk-mq.c b/block/blk-mq.c
index d51b0aa2e4e4..72ae9955cf27 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -3101,6 +3101,9 @@ void blk_mq_free_rqs(struct blk_mq_tag_set *set, struct blk_mq_tags *tags,
 	struct blk_mq_tags *drv_tags;
 	struct page *page;
 
+	if (list_empty(&tags->page_list))
+		return;
+
 	if (blk_mq_is_shared_tags(set->flags))
 		drv_tags = set->shared_tags;
 	else
diff --git a/block/blk-sysfs.c b/block/blk-sysfs.c
index 5f14fd333182..e0f29b56e8e2 100644
--- a/block/blk-sysfs.c
+++ b/block/blk-sysfs.c
@@ -739,20 +739,6 @@ static void blk_free_queue_rcu(struct rcu_head *rcu_head)
 	kmem_cache_free(blk_get_queue_kmem_cache(blk_queue_has_srcu(q)), q);
 }
 
-/* Unconfigure the I/O scheduler and dissociate from the cgroup controller. */
-static void blk_exit_queue(struct request_queue *q)
-{
-	/*
-	 * Since the I/O scheduler exit code may access cgroup information,
-	 * perform I/O scheduler exit before disassociating from the block
-	 * cgroup controller.
-	 */
-	if (q->elevator) {
-		ioc_clear_queue(q);
-		elevator_exit(q);
-	}
-}
-
 /**
  * blk_release_queue - releases all allocated resources of the request_queue
  * @kobj: pointer to a kobject, whose container is a request_queue
@@ -786,8 +772,6 @@ static void blk_release_queue(struct kobject *kobj)
 		blk_stat_remove_callback(q, q->poll_cb);
 	blk_stat_free_callback(q->poll_cb);
 
-	blk_exit_queue(q);
-
 	blk_free_queue_stats(q->stats);
 	kfree(q->poll_stat);
 
diff --git a/block/genhd.c b/block/genhd.c
index a86027619683..f1aef5d13afa 100644
--- a/block/genhd.c
+++ b/block/genhd.c
@@ -1085,11 +1085,48 @@ static const struct attribute_group *disk_attr_groups[] = {
 	NULL
 };
 
+/* Unconfigure the I/O scheduler and dissociate from the cgroup controller. */
+static void blk_exit_queue(struct request_queue *q)
+{
+	/*
+	 * Since the I/O scheduler exit code may access cgroup information,
+	 * perform I/O scheduler exit before disassociating from the block
+	 * cgroup controller.
+	 */
+	if (q->elevator) {
+		ioc_clear_queue(q);
+
+		mutex_lock(&q->sysfs_lock);
+		blk_mq_sched_free_rqs(q);
+		elevator_exit(q);
+		mutex_unlock(&q->sysfs_lock);
+	}
+}
+
 static void disk_release_queue(struct gendisk *disk)
 {
 	struct request_queue *q = disk->queue;
 
-	blk_mq_cancel_work_sync(q);
+	if (queue_is_mq(q)) {
+		blk_mq_cancel_work_sync(q);
+
+		/*
+		 * All FS bios have been done, however FS request may not
+		 * be freed yet since we end bio before freeing request,
+		 * meantime passthrough request replies on scheduler tags,
+		 * so queue needs to be frozen here.
+		 *
+		 * Most of drivers release disk after blk_cleanup_queue()
+		 * returns, and SCSI may release disk before calling
+		 * blk_cleanup_queue, but request queue has been in atomic
+		 * mode already, see scsi_disk_release(), so the following
+		 * queue freeze is pretty fast, and RCU grace period isn't
+		 * supposed to be involved.
+		 */
+		blk_mq_freeze_queue(q);
+		blk_exit_queue(q);
+		__blk_mq_unfreeze_queue(q, true);
+	}
 
 	/*
 	 * Remove all references to @q from the block cgroup controller before
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 49+ messages in thread

* [PATCH V2 12/13] block: move rq_qos_exit() into disk_release()
  2022-01-22 11:10 [PATCH V2 00/13] block: don't drain file system I/O on del_gendisk Ming Lei
                   ` (10 preceding siblings ...)
  2022-01-22 11:10 ` [PATCH V2 11/13] block: move blk_exit_queue into disk_release Ming Lei
@ 2022-01-22 11:10 ` Ming Lei
  2022-01-24 13:28   ` Christoph Hellwig
  2022-01-22 11:10 ` [PATCH V2 13/13] block: don't drain file system I/O on del_gendisk Ming Lei
  12 siblings, 1 reply; 49+ messages in thread
From: Ming Lei @ 2022-01-22 11:10 UTC (permalink / raw)
  To: Christoph Hellwig, Jens Axboe, Martin K . Petersen
  Cc: linux-block, linux-nvme, linux-scsi, Ming Lei

There can't be FS IO in disk_release(), so it is safe to move rq_qos_exit()
there.

Signed-off-by: Ming Lei <ming.lei@redhat.com>
---
 block/genhd.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/block/genhd.c b/block/genhd.c
index f1aef5d13afa..2f0e92cdcf6d 100644
--- a/block/genhd.c
+++ b/block/genhd.c
@@ -622,7 +622,6 @@ void del_gendisk(struct gendisk *disk)
 
 	blk_mq_freeze_queue_wait(q);
 
-	rq_qos_exit(q);
 	blk_sync_queue(q);
 	blk_flush_integrity();
 	/*
@@ -1125,6 +1124,7 @@ static void disk_release_queue(struct gendisk *disk)
 		 */
 		blk_mq_freeze_queue(q);
 		blk_exit_queue(q);
+		rq_qos_exit(q);
 		__blk_mq_unfreeze_queue(q, true);
 	}
 
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 49+ messages in thread

* [PATCH V2 13/13] block: don't drain file system I/O on del_gendisk
  2022-01-22 11:10 [PATCH V2 00/13] block: don't drain file system I/O on del_gendisk Ming Lei
                   ` (11 preceding siblings ...)
  2022-01-22 11:10 ` [PATCH V2 12/13] block: move rq_qos_exit() into disk_release() Ming Lei
@ 2022-01-22 11:10 ` Ming Lei
  2022-01-24 13:39   ` Christoph Hellwig
  12 siblings, 1 reply; 49+ messages in thread
From: Ming Lei @ 2022-01-22 11:10 UTC (permalink / raw)
  To: Christoph Hellwig, Jens Axboe, Martin K . Petersen
  Cc: linux-block, linux-nvme, linux-scsi, Ming Lei

Commit 8e141f9eb803 ("block: drain file system I/O on del_gendisk") is
added for avoiding to reference of q->disk after disk is released, since
disk release can be done before running into blk_cleanup_queue(),
but it can't avoid the issue completely, and the approach is very
fragile:

1) queue freezing can't drain FS I/O for bio based driver, so there is
still inflight bios after disk is deleted, so blk-cgroup shutdown
can't be moved to del_gendisk(); meantime elevator shutdown can't be
moved to del_gendisk() because freezing queue isn't enough to drain
io issue activities. So both block-cgroup and elevator code may still
refer to q->disk somewhere. Same with q->disk referring for bio based
driver.

2) the added flag of GD_DEAD may not be observed reliably in
__bio_queue_enter() because queue freezing might not imply rcu grace
period.

Now we have moved blkcg, rqos and elevator shutdown code into disk_release()
where it is guaranteed that no any FS IO is inflight. Meantime we only
account passthrough IO issued from userspace, where q->disk is always
valid. Then q->disk can be guaranteed to be referred before releasing
disk.

So not necessary to drain FS I/O in del_gendisk() any more.

Signed-off-by: Ming Lei <ming.lei@redhat.com>
---
 block/blk-core.c      | 21 +++++++++------------
 block/blk.h           |  1 -
 block/genhd.c         | 21 ---------------------
 include/linux/genhd.h |  1 -
 4 files changed, 9 insertions(+), 35 deletions(-)

diff --git a/block/blk-core.c b/block/blk-core.c
index bcb4d982cd80..2216360165c9 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -270,8 +270,10 @@ void blk_put_queue(struct request_queue *q)
 }
 EXPORT_SYMBOL(blk_put_queue);
 
-void blk_queue_start_drain(struct request_queue *q)
+void blk_set_queue_dying(struct request_queue *q)
 {
+	blk_queue_flag_set(QUEUE_FLAG_DYING, q);
+
 	/*
 	 * When queue DYING flag is set, we need to block new req
 	 * entering queue, so we call blk_freeze_queue_start() to
@@ -283,12 +285,6 @@ void blk_queue_start_drain(struct request_queue *q)
 	/* Make blk_queue_enter() reexamine the DYING flag. */
 	wake_up_all(&q->mq_freeze_wq);
 }
-
-void blk_set_queue_dying(struct request_queue *q)
-{
-	blk_queue_flag_set(QUEUE_FLAG_DYING, q);
-	blk_queue_start_drain(q);
-}
 EXPORT_SYMBOL_GPL(blk_set_queue_dying);
 
 /**
@@ -322,6 +318,9 @@ void blk_cleanup_queue(struct request_queue *q)
 
 	blk_queue_flag_set(QUEUE_FLAG_DEAD, q);
 
+	/* for synchronous bio-based driver finish in-flight integrity i/o */
+	blk_flush_integrity();
+
 	blk_sync_queue(q);
 	if (queue_is_mq(q)) {
 		blk_mq_cancel_work_sync(q);
@@ -381,10 +380,8 @@ int blk_queue_enter(struct request_queue *q, blk_mq_req_flags_t flags)
 int __bio_queue_enter(struct request_queue *q, struct bio *bio)
 {
 	while (!blk_try_enter_queue(q, false)) {
-		struct gendisk *disk = bio->bi_bdev->bd_disk;
-
 		if (bio->bi_opf & REQ_NOWAIT) {
-			if (test_bit(GD_DEAD, &disk->state))
+			if (blk_queue_dying(q))
 				goto dead;
 			bio_wouldblock_error(bio);
 			return -EBUSY;
@@ -401,8 +398,8 @@ int __bio_queue_enter(struct request_queue *q, struct bio *bio)
 		wait_event(q->mq_freeze_wq,
 			   (!q->mq_freeze_depth &&
 			    blk_pm_resume_queue(false, q)) ||
-			   test_bit(GD_DEAD, &disk->state));
-		if (test_bit(GD_DEAD, &disk->state))
+			   blk_queue_dying(q));
+		if (blk_queue_dying(q))
 			goto dead;
 	}
 
diff --git a/block/blk.h b/block/blk.h
index a038e25d8637..6adedf97cba2 100644
--- a/block/blk.h
+++ b/block/blk.h
@@ -43,7 +43,6 @@ struct blk_flush_queue *blk_alloc_flush_queue(int node, int cmd_size,
 void blk_free_flush_queue(struct blk_flush_queue *q);
 
 void blk_freeze_queue(struct request_queue *q);
-void blk_queue_start_drain(struct request_queue *q);
 int __bio_queue_enter(struct request_queue *q, struct bio *bio);
 bool submit_bio_checks(struct bio *bio);
 
diff --git a/block/genhd.c b/block/genhd.c
index 2f0e92cdcf6d..90742679f949 100644
--- a/block/genhd.c
+++ b/block/genhd.c
@@ -569,8 +569,6 @@ EXPORT_SYMBOL(device_add_disk);
  */
 void del_gendisk(struct gendisk *disk)
 {
-	struct request_queue *q = disk->queue;
-
 	might_sleep();
 
 	if (WARN_ON_ONCE(!disk_live(disk) && !(disk->flags & GENHD_FL_HIDDEN)))
@@ -587,17 +585,8 @@ void del_gendisk(struct gendisk *disk)
 	fsync_bdev(disk->part0);
 	__invalidate_device(disk->part0, true);
 
-	/*
-	 * Fail any new I/O.
-	 */
-	set_bit(GD_DEAD, &disk->state);
 	set_capacity(disk, 0);
 
-	/*
-	 * Prevent new I/O from crossing bio_queue_enter().
-	 */
-	blk_queue_start_drain(q);
-
 	if (!(disk->flags & GENHD_FL_HIDDEN)) {
 		sysfs_remove_link(&disk_to_dev(disk)->kobj, "bdi");
 
@@ -619,16 +608,6 @@ void del_gendisk(struct gendisk *disk)
 		sysfs_remove_link(block_depr, dev_name(disk_to_dev(disk)));
 	pm_runtime_set_memalloc_noio(disk_to_dev(disk), false);
 	device_del(disk_to_dev(disk));
-
-	blk_mq_freeze_queue_wait(q);
-
-	blk_sync_queue(q);
-	blk_flush_integrity();
-	/*
-	 * Allow using passthrough request again after the queue is torn down.
-	 */
-	__blk_mq_unfreeze_queue(q, true);
-
 }
 EXPORT_SYMBOL(del_gendisk);
 
diff --git a/include/linux/genhd.h b/include/linux/genhd.h
index 6906a45bc761..3e9f234495e4 100644
--- a/include/linux/genhd.h
+++ b/include/linux/genhd.h
@@ -108,7 +108,6 @@ struct gendisk {
 	unsigned long state;
 #define GD_NEED_PART_SCAN		0
 #define GD_READ_ONLY			1
-#define GD_DEAD				2
 #define GD_NATIVE_CAPACITY		3
 
 	struct mutex open_mutex;	/* open/close mutex */
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 49+ messages in thread

* Re: [PATCH V2 01/13] block: declare blkcg_[init|exit]_queue in private header
  2022-01-22 11:10 ` [PATCH V2 01/13] block: declare blkcg_[init|exit]_queue in private header Ming Lei
@ 2022-01-24 12:59   ` Christoph Hellwig
  0 siblings, 0 replies; 49+ messages in thread
From: Christoph Hellwig @ 2022-01-24 12:59 UTC (permalink / raw)
  To: Ming Lei
  Cc: Christoph Hellwig, Jens Axboe, Martin K . Petersen, linux-block,
	linux-nvme, linux-scsi

On Sat, Jan 22, 2022 at 07:10:42PM +0800, Ming Lei wrote:
> Both two functions are used by block core code only, so move the
> declaration into private header of block layer.

Hmm.  Given how much of blk-cgroup.h is private I wonder if moving most
of it to block/blk-cgroup.h and just keeping a small public header might
be a better idea.

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH V2 02/13] block: move initialization of q->blkg_list into blkcg_init_queue
  2022-01-22 11:10 ` [PATCH V2 02/13] block: move initialization of q->blkg_list into blkcg_init_queue Ming Lei
@ 2022-01-24 13:00   ` Christoph Hellwig
  2022-01-24 18:32   ` Bart Van Assche
  1 sibling, 0 replies; 49+ messages in thread
From: Christoph Hellwig @ 2022-01-24 13:00 UTC (permalink / raw)
  To: Ming Lei
  Cc: Christoph Hellwig, Jens Axboe, Martin K . Petersen, linux-block,
	linux-nvme, linux-scsi

On Sat, Jan 22, 2022 at 07:10:43PM +0800, Ming Lei wrote:
> q->blkg_list is only used by blkcg code, so move it into
> blkcg_init_queue.

Looks good,

Reviewed-by: Christoph Hellwig <hch@lst.de>

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH V2 03/13] block: move blkcg initialization/destroy into disk allocation/release handler
  2022-01-22 11:10 ` [PATCH V2 03/13] block: move blkcg initialization/destroy into disk allocation/release handler Ming Lei
@ 2022-01-24 13:02   ` Christoph Hellwig
  0 siblings, 0 replies; 49+ messages in thread
From: Christoph Hellwig @ 2022-01-24 13:02 UTC (permalink / raw)
  To: Ming Lei
  Cc: Christoph Hellwig, Jens Axboe, Martin K . Petersen, linux-block,
	linux-nvme, linux-scsi

> @@ -1308,6 +1316,10 @@ struct gendisk *__alloc_disk_node(struct request_queue *q, int node_id,
>  	if (xa_insert(&disk->part_tbl, 0, disk->part0, GFP_KERNEL))
>  		goto out_destroy_part_tbl;
>  
> +	/* todo: move blkcg into gendisk */
> +	if (blkcg_init_queue(q))
> +		goto out_erase_part0;
> +

I don't think having this todo comment here is helpful (even if I agree
with the move and actually half old preliminary patches for that).

With this comment dropped:

Reviewed-by: Christoph Hellwig <hch@lst.de>

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH V2 05/13] block: only account passthrough IO from userspace
  2022-01-22 11:10 ` [PATCH V2 05/13] block: only account passthrough IO from userspace Ming Lei
@ 2022-01-24 13:05   ` Christoph Hellwig
  2022-01-24 23:09     ` Ming Lei
  0 siblings, 1 reply; 49+ messages in thread
From: Christoph Hellwig @ 2022-01-24 13:05 UTC (permalink / raw)
  To: Ming Lei
  Cc: Christoph Hellwig, Jens Axboe, Martin K . Petersen, linux-block,
	linux-nvme, linux-scsi

On Sat, Jan 22, 2022 at 07:10:46PM +0800, Ming Lei wrote:
> Passthrough request from userspace has one active block_device/disk
> associated, so they can be accounted via rq->q->disk. For other
> passthrough request, there may not be disk/block_device for the queue,
> since either the queue has not a disk or the disk may be deleted
> already.
> 
> Add flag of BLK_MQ_REQ_USER_IO for only accounting passthrough request
> from userspace.

Please explain why you want to change this.

Also this is missing I/O from /dev/sg, CDROM CDDA BPC reading, the tape
drivers and bsg-lib.

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH V2 06/13] block: don't remove hctx debugfs dir from blk_mq_exit_queue
  2022-01-22 11:10 ` [PATCH V2 06/13] block: don't remove hctx debugfs dir from blk_mq_exit_queue Ming Lei
@ 2022-01-24 13:06   ` Christoph Hellwig
  0 siblings, 0 replies; 49+ messages in thread
From: Christoph Hellwig @ 2022-01-24 13:06 UTC (permalink / raw)
  To: Ming Lei
  Cc: Christoph Hellwig, Jens Axboe, Martin K . Petersen, linux-block,
	linux-nvme, linux-scsi

On Sat, Jan 22, 2022 at 07:10:47PM +0800, Ming Lei wrote:
> The queue's top debugfs dir is removed from blk_release_queue(), so all
> hctx's debugfs dirs are removed from there. Given blk_mq_exit_queue()
> is only called from blk_cleanup_queue(), it isn't necessary to remove
> hctx debugfs from blk_mq_exit_queue().
> 
> So remove it from blk_mq_exit_queue().

Looks good,

Reviewed-by: Christoph Hellwig <hch@lst.de>

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH V2 07/13] block: move q_usage_counter release into blk_queue_release
  2022-01-22 11:10 ` [PATCH V2 07/13] block: move q_usage_counter release into blk_queue_release Ming Lei
@ 2022-01-24 13:09   ` Christoph Hellwig
  2022-01-24 19:06   ` Bart Van Assche
  1 sibling, 0 replies; 49+ messages in thread
From: Christoph Hellwig @ 2022-01-24 13:09 UTC (permalink / raw)
  To: Ming Lei
  Cc: Christoph Hellwig, Jens Axboe, Martin K . Petersen, linux-block,
	linux-nvme, linux-scsi

On Sat, Jan 22, 2022 at 07:10:48PM +0800, Ming Lei wrote:
> After blk_cleanup_queue() returns, disk may not be released yet, so
> probably bio may still be submitted and ->q_usage_counter may be
> touched, so far this way seems safe, but not good from API's viewpoint.
> 
> Move the release ofq_usage_counter into blk_queue_release().

Looks good,

Reviewed-by: Christoph Hellwig <hch@lst.de>

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH V2 08/13] block: export __blk_mq_unfreeze_queue
  2022-01-22 11:10 ` [PATCH V2 08/13] block: export __blk_mq_unfreeze_queue Ming Lei
@ 2022-01-24 13:11   ` Christoph Hellwig
  0 siblings, 0 replies; 49+ messages in thread
From: Christoph Hellwig @ 2022-01-24 13:11 UTC (permalink / raw)
  To: Ming Lei
  Cc: Christoph Hellwig, Jens Axboe, Martin K . Petersen, linux-block,
	linux-nvme, linux-scsi

On Sat, Jan 22, 2022 at 07:10:49PM +0800, Ming Lei wrote:
> blk_mq_unfreeze_queue() is used by scsi when releasing disk, so not
> necessary to unfreeze into percpu mode, then the following
> blk_cleanup_queue doesn't need to freeze queue from percpu mode, and
> the implied RCU grace period may be avoided.
> 
> Meantime move clearing QUEUE_FLAG_INIT_DONE into this API, so that
> when one disk is added, ->q_usage_counter can be switched to percpu
> mode again.

Independ of the use case (which I'll need to review in the next patches),
exporting lowlevel-helpers like this just seems like a bad idea.

If we have a use case add a wrapper that sets the force_atomic flag
and document and export that, please.

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH V2 09/13] scsi: force unfreezing queue into atomic mode
  2022-01-22 11:10 ` [PATCH V2 09/13] scsi: force unfreezing queue into atomic mode Ming Lei
@ 2022-01-24 13:15   ` Christoph Hellwig
  2022-01-24 23:21     ` Ming Lei
  0 siblings, 1 reply; 49+ messages in thread
From: Christoph Hellwig @ 2022-01-24 13:15 UTC (permalink / raw)
  To: Ming Lei
  Cc: Christoph Hellwig, Jens Axboe, Martin K . Petersen, linux-block,
	linux-nvme, linux-scsi, Bart Van Assche

On Sat, Jan 22, 2022 at 07:10:50PM +0800, Ming Lei wrote:
> In scsi_disk_release() request queue is frozen for clearing
> disk->private_data, and there can't be any FS IO issued to
> this queue, and only private passthrough request will be handled, so
> force unfreezing queue into atomic mode.
> 
> Signed-off-by: Ming Lei <ming.lei@redhat.com>
> ---
>  drivers/scsi/sd.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/drivers/scsi/sd.c b/drivers/scsi/sd.c
> index 0e73c3f2f381..27f04c860f00 100644
> --- a/drivers/scsi/sd.c
> +++ b/drivers/scsi/sd.c
> @@ -3670,7 +3670,7 @@ static void scsi_disk_release(struct device *dev)
>  	 * in case multiple processes open a /dev/sd... node concurrently.
>  	 */
>  	blk_mq_freeze_queue(q);
> -	blk_mq_unfreeze_queue(q);
> +	__blk_mq_unfreeze_queue(q, true);

I think the right thing here is to drop the freeze/unfreeze pair.
Now that del_gendisk properly freezes the queue, we don't need this
protection as the issue that Bart fixed with it can't happen any more.

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH V2 10/13] block: add helper of disk_release_queue for release queue data for disk
  2022-01-22 11:10 ` [PATCH V2 10/13] block: add helper of disk_release_queue for release queue data for disk Ming Lei
@ 2022-01-24 13:16   ` Christoph Hellwig
  2022-01-24 23:27     ` Ming Lei
  0 siblings, 1 reply; 49+ messages in thread
From: Christoph Hellwig @ 2022-01-24 13:16 UTC (permalink / raw)
  To: Ming Lei
  Cc: Christoph Hellwig, Jens Axboe, Martin K . Petersen, linux-block,
	linux-nvme, linux-scsi

On Sat, Jan 22, 2022 at 07:10:51PM +0800, Ming Lei wrote:
> Add one helper of disk_release_queue for releasing queue data for disk.

Please explain what "queue data for disk" is, and why this helper is
useful.  Right now it seems to just move a few things around without a
rationale or explanation of why it is safe.

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH V2 11/13] block: move blk_exit_queue into disk_release
  2022-01-22 11:10 ` [PATCH V2 11/13] block: move blk_exit_queue into disk_release Ming Lei
@ 2022-01-24 13:22   ` Christoph Hellwig
  2022-01-24 23:38     ` Ming Lei
  0 siblings, 1 reply; 49+ messages in thread
From: Christoph Hellwig @ 2022-01-24 13:22 UTC (permalink / raw)
  To: Ming Lei
  Cc: Christoph Hellwig, Jens Axboe, Martin K . Petersen, linux-block,
	linux-nvme, linux-scsi

On Sat, Jan 22, 2022 at 07:10:52PM +0800, Ming Lei wrote:
>  3 files changed, 41 insertions(+), 17 deletions(-)
> 
> diff --git a/block/blk-mq.c b/block/blk-mq.c
> index d51b0aa2e4e4..72ae9955cf27 100644
> --- a/block/blk-mq.c
> +++ b/block/blk-mq.c
> @@ -3101,6 +3101,9 @@ void blk_mq_free_rqs(struct blk_mq_tag_set *set, struct blk_mq_tags *tags,
>  	struct blk_mq_tags *drv_tags;
>  	struct page *page;
>  
> +	if (list_empty(&tags->page_list))
> +		return;

Please split this out and document why it is needed.

> +/* Unconfigure the I/O scheduler and dissociate from the cgroup controller. */
> +static void blk_exit_queue(struct request_queue *q)
> +{
> +	/*
> +	 * Since the I/O scheduler exit code may access cgroup information,
> +	 * perform I/O scheduler exit before disassociating from the block
> +	 * cgroup controller.
> +	 */
> +	if (q->elevator) {
> +		ioc_clear_queue(q);
> +
> +		mutex_lock(&q->sysfs_lock);
> +		blk_mq_sched_free_rqs(q);
> +		elevator_exit(q);
> +		mutex_unlock(&q->sysfs_lock);
> +	}
> +}

Given that this all deals with the I/O scheduler the naming seems
wrong.  It almost seems like we'd want to move ioc_clear_queue and
blk_mq_sched_free_rqs into elevator_exit to have somewhat sensible
helpers.  That would add extra locking of q->sysfs_lock for
ioc_clear_queue, but given that elevator_switch_mq already does
that, this seems harmless.


^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH V2 12/13] block: move rq_qos_exit() into disk_release()
  2022-01-22 11:10 ` [PATCH V2 12/13] block: move rq_qos_exit() into disk_release() Ming Lei
@ 2022-01-24 13:28   ` Christoph Hellwig
  0 siblings, 0 replies; 49+ messages in thread
From: Christoph Hellwig @ 2022-01-24 13:28 UTC (permalink / raw)
  To: Ming Lei
  Cc: Christoph Hellwig, Jens Axboe, Martin K . Petersen, linux-block,
	linux-nvme, linux-scsi

On Sat, Jan 22, 2022 at 07:10:53PM +0800, Ming Lei wrote:
> There can't be FS IO in disk_release(), so it is safe to move rq_qos_exit()
> there.

Looks good,

Reviewed-by: Christoph Hellwig <hch@lst.de>

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH V2 13/13] block: don't drain file system I/O on del_gendisk
  2022-01-22 11:10 ` [PATCH V2 13/13] block: don't drain file system I/O on del_gendisk Ming Lei
@ 2022-01-24 13:39   ` Christoph Hellwig
  0 siblings, 0 replies; 49+ messages in thread
From: Christoph Hellwig @ 2022-01-24 13:39 UTC (permalink / raw)
  To: Ming Lei
  Cc: Christoph Hellwig, Jens Axboe, Martin K . Petersen, linux-block,
	linux-nvme, linux-scsi

On Sat, Jan 22, 2022 at 07:10:54PM +0800, Ming Lei wrote:
> So not necessary to drain FS I/O in del_gendisk() any more.

But this changes a few other things as well.

For example the early set GD_DEAD flag is very useful to stop I/O
ASAP when entering del_gendisk.  Also some other work here is moved
later as well that really should be part of del_gendisk.

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH V2 02/13] block: move initialization of q->blkg_list into blkcg_init_queue
  2022-01-22 11:10 ` [PATCH V2 02/13] block: move initialization of q->blkg_list into blkcg_init_queue Ming Lei
  2022-01-24 13:00   ` Christoph Hellwig
@ 2022-01-24 18:32   ` Bart Van Assche
  1 sibling, 0 replies; 49+ messages in thread
From: Bart Van Assche @ 2022-01-24 18:32 UTC (permalink / raw)
  To: Ming Lei, Christoph Hellwig, Jens Axboe, Martin K . Petersen
  Cc: linux-block, linux-nvme, linux-scsi

On 1/22/22 03:10, Ming Lei wrote:
> q->blkg_list is only used by blkcg code, so move it into
> blkcg_init_queue.

Should Tejun be Cc-ed for this patch?

Anyway:

Reviewed-by: Bart Van Assche <bvanassche@acm.org>

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH V2 07/13] block: move q_usage_counter release into blk_queue_release
  2022-01-22 11:10 ` [PATCH V2 07/13] block: move q_usage_counter release into blk_queue_release Ming Lei
  2022-01-24 13:09   ` Christoph Hellwig
@ 2022-01-24 19:06   ` Bart Van Assche
  1 sibling, 0 replies; 49+ messages in thread
From: Bart Van Assche @ 2022-01-24 19:06 UTC (permalink / raw)
  To: Ming Lei, Christoph Hellwig, Jens Axboe, Martin K . Petersen
  Cc: linux-block, linux-nvme, linux-scsi

On 1/22/22 03:10, Ming Lei wrote:
> After blk_cleanup_queue() returns, disk may not be released yet, so
> probably bio may still be submitted and ->q_usage_counter may be
> touched, so far this way seems safe, but not good from API's viewpoint.
> 
> Move the release ofq_usage_counter into blk_queue_release().
                    ^^

Please change "ofq_usage_counter" into "of q_usage_counter".

Anyway:

Reviewed-by: Bart Van Assche <bvanassche@acm.org>

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH V2 05/13] block: only account passthrough IO from userspace
  2022-01-24 13:05   ` Christoph Hellwig
@ 2022-01-24 23:09     ` Ming Lei
  2022-01-25  6:16       ` Christoph Hellwig
  0 siblings, 1 reply; 49+ messages in thread
From: Ming Lei @ 2022-01-24 23:09 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Jens Axboe, Martin K . Petersen, linux-block, linux-nvme, linux-scsi

On Mon, Jan 24, 2022 at 02:05:55PM +0100, Christoph Hellwig wrote:
> On Sat, Jan 22, 2022 at 07:10:46PM +0800, Ming Lei wrote:
> > Passthrough request from userspace has one active block_device/disk
> > associated, so they can be accounted via rq->q->disk. For other
> > passthrough request, there may not be disk/block_device for the queue,
> > since either the queue has not a disk or the disk may be deleted
> > already.
> > 
> > Add flag of BLK_MQ_REQ_USER_IO for only accounting passthrough request
> > from userspace.
> 
> Please explain why you want to change this.

Please see the following code:

        /* passthrough requests can hold bios that do not have ->bi_bdev set */
        if (rq->bio && rq->bio->bi_bdev)
                rq->part = rq->bio->bi_bdev;
        else if (rq->q->disk)
                rq->part = rq->q->disk->part0;

q->disk can be cleared by disk_release() just when referring the above line, then
NULL ptr reference is caused, and similar issue with any reference to rq->part for
passthrough request sent not from userspace.

> 
> Also this is missing I/O from /dev/sg, CDROM CDDA BPC reading, the tape
> drivers and bsg-lib.

Except for CDROM CDDA BPC reading, the others don't have gendisk associated,
so they needn't such change. And it looks easy to do that for CDROM CDDA BPC
reading.


Thanks,
Ming


^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH V2 09/13] scsi: force unfreezing queue into atomic mode
  2022-01-24 13:15   ` Christoph Hellwig
@ 2022-01-24 23:21     ` Ming Lei
  2022-01-25  7:27       ` Christoph Hellwig
  0 siblings, 1 reply; 49+ messages in thread
From: Ming Lei @ 2022-01-24 23:21 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Jens Axboe, Martin K . Petersen, linux-block, linux-nvme,
	linux-scsi, Bart Van Assche

On Mon, Jan 24, 2022 at 02:15:16PM +0100, Christoph Hellwig wrote:
> On Sat, Jan 22, 2022 at 07:10:50PM +0800, Ming Lei wrote:
> > In scsi_disk_release() request queue is frozen for clearing
> > disk->private_data, and there can't be any FS IO issued to
> > this queue, and only private passthrough request will be handled, so
> > force unfreezing queue into atomic mode.
> > 
> > Signed-off-by: Ming Lei <ming.lei@redhat.com>
> > ---
> >  drivers/scsi/sd.c | 2 +-
> >  1 file changed, 1 insertion(+), 1 deletion(-)
> > 
> > diff --git a/drivers/scsi/sd.c b/drivers/scsi/sd.c
> > index 0e73c3f2f381..27f04c860f00 100644
> > --- a/drivers/scsi/sd.c
> > +++ b/drivers/scsi/sd.c
> > @@ -3670,7 +3670,7 @@ static void scsi_disk_release(struct device *dev)
> >  	 * in case multiple processes open a /dev/sd... node concurrently.
> >  	 */
> >  	blk_mq_freeze_queue(q);
> > -	blk_mq_unfreeze_queue(q);
> > +	__blk_mq_unfreeze_queue(q, true);
> 
> I think the right thing here is to drop the freeze/unfreeze pair.
> Now that del_gendisk properly freezes the queue, we don't need this
> protection as the issue that Bart fixed with it can't happen any more.

As you see, the last patch removes freeze/unfreeze pair in del_gendisk(),
which looks not very useful: it can't drain IO on bio based driver, and
del_gendisk() is supposed to provide consistent behavior for both request
and bio based driver.


Thanks,
Ming


^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH V2 10/13] block: add helper of disk_release_queue for release queue data for disk
  2022-01-24 13:16   ` Christoph Hellwig
@ 2022-01-24 23:27     ` Ming Lei
  2022-01-25  6:17       ` Christoph Hellwig
  0 siblings, 1 reply; 49+ messages in thread
From: Ming Lei @ 2022-01-24 23:27 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Jens Axboe, Martin K . Petersen, linux-block, linux-nvme, linux-scsi

On Mon, Jan 24, 2022 at 02:16:24PM +0100, Christoph Hellwig wrote:
> On Sat, Jan 22, 2022 at 07:10:51PM +0800, Ming Lei wrote:
> > Add one helper of disk_release_queue for releasing queue data for disk.
> 
> Please explain what "queue data for disk" is, and why this helper is

Here it means elevator, blk-cgroup and rq qos, all actually serve FS
IOs, maybe it should be changed to FS IOs.

> useful.  Right now it seems to just move a few things around without a
> rationale or explanation of why it is safe.

This patch just moves two function calls into the helper, and there
doesn't have functional change, so no need rationale and explanation.


Thanks,
Ming


^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH V2 11/13] block: move blk_exit_queue into disk_release
  2022-01-24 13:22   ` Christoph Hellwig
@ 2022-01-24 23:38     ` Ming Lei
  0 siblings, 0 replies; 49+ messages in thread
From: Ming Lei @ 2022-01-24 23:38 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Jens Axboe, Martin K . Petersen, linux-block, linux-nvme, linux-scsi

On Mon, Jan 24, 2022 at 02:22:21PM +0100, Christoph Hellwig wrote:
> On Sat, Jan 22, 2022 at 07:10:52PM +0800, Ming Lei wrote:
> >  3 files changed, 41 insertions(+), 17 deletions(-)
> > 
> > diff --git a/block/blk-mq.c b/block/blk-mq.c
> > index d51b0aa2e4e4..72ae9955cf27 100644
> > --- a/block/blk-mq.c
> > +++ b/block/blk-mq.c
> > @@ -3101,6 +3101,9 @@ void blk_mq_free_rqs(struct blk_mq_tag_set *set, struct blk_mq_tags *tags,
> >  	struct blk_mq_tags *drv_tags;
> >  	struct page *page;
> >  
> > +	if (list_empty(&tags->page_list))
> > +		return;
> 
> Please split this out and document why it is needed.

Now blk_mq_sched_free_rqs() can be called from either blk_cleanup_queue
and disk_release(). If sched rq pool is freed in one of them, it can't
to be freed from another one.

> 
> > +/* Unconfigure the I/O scheduler and dissociate from the cgroup controller. */
> > +static void blk_exit_queue(struct request_queue *q)
> > +{
> > +	/*
> > +	 * Since the I/O scheduler exit code may access cgroup information,
> > +	 * perform I/O scheduler exit before disassociating from the block
> > +	 * cgroup controller.
> > +	 */
> > +	if (q->elevator) {
> > +		ioc_clear_queue(q);
> > +
> > +		mutex_lock(&q->sysfs_lock);
> > +		blk_mq_sched_free_rqs(q);
> > +		elevator_exit(q);
> > +		mutex_unlock(&q->sysfs_lock);
> > +	}
> > +}
> 
> Given that this all deals with the I/O scheduler the naming seems
> wrong.  It almost seems like we'd want to move ioc_clear_queue and
> blk_mq_sched_free_rqs into elevator_exit to have somewhat sensible
> helpers.

Looks good idea.



thanks, 
Ming


^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH V2 05/13] block: only account passthrough IO from userspace
  2022-01-24 23:09     ` Ming Lei
@ 2022-01-25  6:16       ` Christoph Hellwig
  2022-01-25  7:19         ` Christoph Hellwig
  0 siblings, 1 reply; 49+ messages in thread
From: Christoph Hellwig @ 2022-01-25  6:16 UTC (permalink / raw)
  To: Ming Lei
  Cc: Christoph Hellwig, Jens Axboe, Martin K . Petersen, linux-block,
	linux-nvme, linux-scsi

On Tue, Jan 25, 2022 at 07:09:09AM +0800, Ming Lei wrote:
> > Please explain why you want to change this.
> 
> Please see the following code:

This needs to go into the commit log.

> 
>         /* passthrough requests can hold bios that do not have ->bi_bdev set */
>         if (rq->bio && rq->bio->bi_bdev)
>                 rq->part = rq->bio->bi_bdev;
>         else if (rq->q->disk)
>                 rq->part = rq->q->disk->part0;
> 
> q->disk can be cleared by disk_release() just when referring the above line, then
> NULL ptr reference is caused, and similar issue with any reference to rq->part for
> passthrough request sent not from userspace.

So why not key off accouning off "rq->bio && rq->bio->bi_bdev"
and remove the need for the flag and the second half of the assignment
above?  That is much less error probe and removes code size.

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH V2 10/13] block: add helper of disk_release_queue for release queue data for disk
  2022-01-24 23:27     ` Ming Lei
@ 2022-01-25  6:17       ` Christoph Hellwig
  0 siblings, 0 replies; 49+ messages in thread
From: Christoph Hellwig @ 2022-01-25  6:17 UTC (permalink / raw)
  To: Ming Lei
  Cc: Christoph Hellwig, Jens Axboe, Martin K . Petersen, linux-block,
	linux-nvme, linux-scsi

On Tue, Jan 25, 2022 at 07:27:37AM +0800, Ming Lei wrote:
> > Please explain what "queue data for disk" is, and why this helper is
> 
> Here it means elevator, blk-cgroup and rq qos, all actually serve FS
> IOs, maybe it should be changed to FS IOs.
> 
> > useful.  Right now it seems to just move a few things around without a
> > rationale or explanation of why it is safe.
> 
> This patch just moves two function calls into the helper, and there
> doesn't have functional change, so no need rationale and explanation.

Yes, it it does.  If you move calls into a "helper" that has a single
caller and a weird name it better have a good explanation, because the
move does not look useful at all.

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH V2 05/13] block: only account passthrough IO from userspace
  2022-01-25  6:16       ` Christoph Hellwig
@ 2022-01-25  7:19         ` Christoph Hellwig
  2022-01-25  8:35           ` Ming Lei
  2022-01-25  9:09           ` Ming Lei
  0 siblings, 2 replies; 49+ messages in thread
From: Christoph Hellwig @ 2022-01-25  7:19 UTC (permalink / raw)
  To: Ming Lei
  Cc: Christoph Hellwig, Jens Axboe, Martin K . Petersen, linux-block,
	linux-nvme, linux-scsi

On Tue, Jan 25, 2022 at 07:16:34AM +0100, Christoph Hellwig wrote:
> So why not key off accouning off "rq->bio && rq->bio->bi_bdev"
> and remove the need for the flag and the second half of the assignment
> above?  That is much less error probe and removes code size.

Something like this, lightly tested:

---
From 5499d013341b492899d1fecde7680ff8ebd232e9 Mon Sep 17 00:00:00 2001
From: Christoph Hellwig <hch@lst.de>
Date: Tue, 25 Jan 2022 07:29:06 +0100
Subject: block: remove the part field from struct request

All file system I/O and most userspace passthrough bios have bi_bdev set.
Switch I/O accounting to directly use the bio and stop copying it into a
separate struct request field.

This changes behavior in that e.g. /dev/sgX requests are not accounted
to the gendisk for the SCSI disk any more, which is the correct thing to
do as they never went through that gendisk, and fixes a potential race
when the disk driver is unbound while /dev/sgX I/O is in progress.

Signed-off-by: Christoph Hellwig <hch@lst.de>
---
 block/blk-merge.c      | 12 ++++++------
 block/blk-mq.c         | 32 +++++++++++++-------------------
 block/blk.h            |  6 +++---
 include/linux/blk-mq.h |  1 -
 4 files changed, 22 insertions(+), 29 deletions(-)

diff --git a/block/blk-merge.c b/block/blk-merge.c
index 4de34a332c9fd..43e46ea2f0152 100644
--- a/block/blk-merge.c
+++ b/block/blk-merge.c
@@ -739,11 +739,11 @@ void blk_rq_set_mixed_merge(struct request *rq)
 
 static void blk_account_io_merge_request(struct request *req)
 {
-	if (blk_do_io_stat(req)) {
-		part_stat_lock();
-		part_stat_inc(req->part, merges[op_stat_group(req_op(req))]);
-		part_stat_unlock();
-	}
+	if (!blk_do_io_stat(req))
+		return;
+	part_stat_lock();
+	part_stat_inc(req->bio->bi_bdev, merges[op_stat_group(req_op(req))]);
+	part_stat_unlock();
 }
 
 static enum elv_merge blk_try_req_merge(struct request *req,
@@ -947,7 +947,7 @@ static void blk_account_io_merge_bio(struct request *req)
 		return;
 
 	part_stat_lock();
-	part_stat_inc(req->part, merges[op_stat_group(req_op(req))]);
+	part_stat_inc(req->bio->bi_bdev, merges[op_stat_group(req_op(req))]);
 	part_stat_unlock();
 }
 
diff --git a/block/blk-mq.c b/block/blk-mq.c
index f3bf3358a3bb2..01b3862347965 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -132,10 +132,12 @@ static bool blk_mq_check_inflight(struct request *rq, void *priv,
 {
 	struct mq_inflight *mi = priv;
 
-	if ((!mi->part->bd_partno || rq->part == mi->part) &&
-	    blk_mq_rq_state(rq) == MQ_RQ_IN_FLIGHT)
-		mi->inflight[rq_data_dir(rq)]++;
+	if (blk_mq_rq_state(rq) != MQ_RQ_IN_FLIGHT)
+		return true;
+	if (mi->part->bd_partno && rq->bio && rq->bio->bi_bdev != mi->part)
+		return true;
 
+	mi->inflight[rq_data_dir(rq)]++;
 	return true;
 }
 
@@ -331,7 +333,6 @@ void blk_rq_init(struct request_queue *q, struct request *rq)
 	rq->tag = BLK_MQ_NO_TAG;
 	rq->internal_tag = BLK_MQ_NO_TAG;
 	rq->start_time_ns = ktime_get_ns();
-	rq->part = NULL;
 	blk_crypto_rq_set_defaults(rq);
 }
 EXPORT_SYMBOL(blk_rq_init);
@@ -368,7 +369,6 @@ static struct request *blk_mq_rq_ctx_init(struct blk_mq_alloc_data *data,
 		rq->start_time_ns = ktime_get_ns();
 	else
 		rq->start_time_ns = 0;
-	rq->part = NULL;
 #ifdef CONFIG_BLK_RQ_ALLOC_TIME
 	rq->alloc_time_ns = alloc_time_ns;
 #endif
@@ -687,11 +687,11 @@ static void req_bio_endio(struct request *rq, struct bio *bio,
 
 static void blk_account_io_completion(struct request *req, unsigned int bytes)
 {
-	if (req->part && blk_do_io_stat(req)) {
+	if (blk_do_io_stat(req)) {
 		const int sgrp = op_stat_group(req_op(req));
 
 		part_stat_lock();
-		part_stat_add(req->part, sectors[sgrp], bytes >> 9);
+		part_stat_add(req->bio->bi_bdev, sectors[sgrp], bytes >> 9);
 		part_stat_unlock();
 	}
 }
@@ -859,11 +859,12 @@ EXPORT_SYMBOL_GPL(blk_update_request);
 static void __blk_account_io_done(struct request *req, u64 now)
 {
 	const int sgrp = op_stat_group(req_op(req));
+	struct block_device *bdev = req->bio->bi_bdev;
 
 	part_stat_lock();
-	update_io_ticks(req->part, jiffies, true);
-	part_stat_inc(req->part, ios[sgrp]);
-	part_stat_add(req->part, nsecs[sgrp], now - req->start_time_ns);
+	update_io_ticks(bdev, jiffies, true);
+	part_stat_inc(bdev, ios[sgrp]);
+	part_stat_add(bdev, nsecs[sgrp], now - req->start_time_ns);
 	part_stat_unlock();
 }
 
@@ -874,21 +875,14 @@ static inline void blk_account_io_done(struct request *req, u64 now)
 	 * normal IO on queueing nor completion.  Accounting the
 	 * containing request is enough.
 	 */
-	if (blk_do_io_stat(req) && req->part &&
-	    !(req->rq_flags & RQF_FLUSH_SEQ))
+	if (blk_do_io_stat(req) && !(req->rq_flags & RQF_FLUSH_SEQ))
 		__blk_account_io_done(req, now);
 }
 
 static void __blk_account_io_start(struct request *rq)
 {
-	/* passthrough requests can hold bios that do not have ->bi_bdev set */
-	if (rq->bio && rq->bio->bi_bdev)
-		rq->part = rq->bio->bi_bdev;
-	else if (rq->q->disk)
-		rq->part = rq->q->disk->part0;
-
 	part_stat_lock();
-	update_io_ticks(rq->part, jiffies, false);
+	update_io_ticks(rq->bio->bi_bdev, jiffies, false);
 	part_stat_unlock();
 }
 
diff --git a/block/blk.h b/block/blk.h
index 8bd43b3ad33d5..a7a5a5435e09d 100644
--- a/block/blk.h
+++ b/block/blk.h
@@ -320,12 +320,12 @@ int blk_dev_init(void);
 /*
  * Contribute to IO statistics IFF:
  *
- *	a) it's attached to a gendisk, and
- *	b) the queue had IO stats enabled when this request was started
+ *	a) the queue had IO stats enabled when this request was started, and
+ *	b) it has an assigned block_device
  */
 static inline bool blk_do_io_stat(struct request *rq)
 {
-	return (rq->rq_flags & RQF_IO_STAT) && rq->q->disk;
+	return (rq->rq_flags & RQF_IO_STAT) && rq->bio && rq->bio->bi_bdev;
 }
 
 void update_io_ticks(struct block_device *part, unsigned long now, bool end);
diff --git a/include/linux/blk-mq.h b/include/linux/blk-mq.h
index d319ffa59354a..81769c01e6e4b 100644
--- a/include/linux/blk-mq.h
+++ b/include/linux/blk-mq.h
@@ -99,7 +99,6 @@ struct request {
 		struct request *rq_next;
 	};
 
-	struct block_device *part;
 #ifdef CONFIG_BLK_RQ_ALLOC_TIME
 	/* Time that the first bio started allocating this request. */
 	u64 alloc_time_ns;
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 49+ messages in thread

* Re: [PATCH V2 09/13] scsi: force unfreezing queue into atomic mode
  2022-01-24 23:21     ` Ming Lei
@ 2022-01-25  7:27       ` Christoph Hellwig
  2022-01-25  8:54         ` Ming Lei
  0 siblings, 1 reply; 49+ messages in thread
From: Christoph Hellwig @ 2022-01-25  7:27 UTC (permalink / raw)
  To: Ming Lei
  Cc: Christoph Hellwig, Jens Axboe, Martin K . Petersen, linux-block,
	linux-nvme, linux-scsi, Bart Van Assche

On Tue, Jan 25, 2022 at 07:21:55AM +0800, Ming Lei wrote:
> > > @@ -3670,7 +3670,7 @@ static void scsi_disk_release(struct device *dev)
> > >  	 * in case multiple processes open a /dev/sd... node concurrently.
> > >  	 */
> > >  	blk_mq_freeze_queue(q);
> > > -	blk_mq_unfreeze_queue(q);
> > > +	__blk_mq_unfreeze_queue(q, true);
> > 
> > I think the right thing here is to drop the freeze/unfreeze pair.
> > Now that del_gendisk properly freezes the queue, we don't need this
> > protection as the issue that Bart fixed with it can't happen any more.
> 
> As you see, the last patch removes freeze/unfreeze pair in del_gendisk(),
> which looks not very useful: it can't drain IO on bio based driver, and
> del_gendisk() is supposed to provide consistent behavior for both request
> and bio based driver.

So what is the advantage of trying to remove the freeze from where
it belongs (common unregister code) while keeping it where it is a bandaid
(driver specific unregister code)?

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH V2 05/13] block: only account passthrough IO from userspace
  2022-01-25  7:19         ` Christoph Hellwig
@ 2022-01-25  8:35           ` Ming Lei
  2022-01-25  9:09           ` Ming Lei
  1 sibling, 0 replies; 49+ messages in thread
From: Ming Lei @ 2022-01-25  8:35 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Jens Axboe, Martin K . Petersen, linux-block, linux-nvme, linux-scsi

On Tue, Jan 25, 2022 at 08:19:06AM +0100, Christoph Hellwig wrote:
> On Tue, Jan 25, 2022 at 07:16:34AM +0100, Christoph Hellwig wrote:
> > So why not key off accouning off "rq->bio && rq->bio->bi_bdev"
> > and remove the need for the flag and the second half of the assignment
> > above?  That is much less error probe and removes code size.
> 
> Something like this, lightly tested:
> 
> ---
> From 5499d013341b492899d1fecde7680ff8ebd232e9 Mon Sep 17 00:00:00 2001
> From: Christoph Hellwig <hch@lst.de>
> Date: Tue, 25 Jan 2022 07:29:06 +0100
> Subject: block: remove the part field from struct request
> 
> All file system I/O and most userspace passthrough bios have bi_bdev set.
> Switch I/O accounting to directly use the bio and stop copying it into a
> separate struct request field.
> 
> This changes behavior in that e.g. /dev/sgX requests are not accounted
> to the gendisk for the SCSI disk any more, which is the correct thing to
> do as they never went through that gendisk, and fixes a potential race
> when the disk driver is unbound while /dev/sgX I/O is in progress.
> 
> Signed-off-by: Christoph Hellwig <hch@lst.de>
> ---
>  block/blk-merge.c      | 12 ++++++------
>  block/blk-mq.c         | 32 +++++++++++++-------------------
>  block/blk.h            |  6 +++---
>  include/linux/blk-mq.h |  1 -
>  4 files changed, 22 insertions(+), 29 deletions(-)
> 
> diff --git a/block/blk-merge.c b/block/blk-merge.c
> index 4de34a332c9fd..43e46ea2f0152 100644
> --- a/block/blk-merge.c
> +++ b/block/blk-merge.c
> @@ -739,11 +739,11 @@ void blk_rq_set_mixed_merge(struct request *rq)
>  
>  static void blk_account_io_merge_request(struct request *req)
>  {
> -	if (blk_do_io_stat(req)) {
> -		part_stat_lock();
> -		part_stat_inc(req->part, merges[op_stat_group(req_op(req))]);
> -		part_stat_unlock();
> -	}
> +	if (!blk_do_io_stat(req))
> +		return;
> +	part_stat_lock();
> +	part_stat_inc(req->bio->bi_bdev, merges[op_stat_group(req_op(req))]);
> +	part_stat_unlock();
>  }
>  
>  static enum elv_merge blk_try_req_merge(struct request *req,
> @@ -947,7 +947,7 @@ static void blk_account_io_merge_bio(struct request *req)
>  		return;
>  
>  	part_stat_lock();
> -	part_stat_inc(req->part, merges[op_stat_group(req_op(req))]);
> +	part_stat_inc(req->bio->bi_bdev, merges[op_stat_group(req_op(req))]);
>  	part_stat_unlock();
>  }
>  
> diff --git a/block/blk-mq.c b/block/blk-mq.c
> index f3bf3358a3bb2..01b3862347965 100644
> --- a/block/blk-mq.c
> +++ b/block/blk-mq.c
> @@ -132,10 +132,12 @@ static bool blk_mq_check_inflight(struct request *rq, void *priv,
>  {
>  	struct mq_inflight *mi = priv;
>  
> -	if ((!mi->part->bd_partno || rq->part == mi->part) &&
> -	    blk_mq_rq_state(rq) == MQ_RQ_IN_FLIGHT)
> -		mi->inflight[rq_data_dir(rq)]++;
> +	if (blk_mq_rq_state(rq) != MQ_RQ_IN_FLIGHT)
> +		return true;
> +	if (mi->part->bd_partno && rq->bio && rq->bio->bi_bdev != mi->part)
> +		return true;
>  
> +	mi->inflight[rq_data_dir(rq)]++;
>  	return true;
>  }
>  
> @@ -331,7 +333,6 @@ void blk_rq_init(struct request_queue *q, struct request *rq)
>  	rq->tag = BLK_MQ_NO_TAG;
>  	rq->internal_tag = BLK_MQ_NO_TAG;
>  	rq->start_time_ns = ktime_get_ns();
> -	rq->part = NULL;
>  	blk_crypto_rq_set_defaults(rq);
>  }
>  EXPORT_SYMBOL(blk_rq_init);
> @@ -368,7 +369,6 @@ static struct request *blk_mq_rq_ctx_init(struct blk_mq_alloc_data *data,
>  		rq->start_time_ns = ktime_get_ns();
>  	else
>  		rq->start_time_ns = 0;
> -	rq->part = NULL;
>  #ifdef CONFIG_BLK_RQ_ALLOC_TIME
>  	rq->alloc_time_ns = alloc_time_ns;
>  #endif
> @@ -687,11 +687,11 @@ static void req_bio_endio(struct request *rq, struct bio *bio,
>  
>  static void blk_account_io_completion(struct request *req, unsigned int bytes)
>  {
> -	if (req->part && blk_do_io_stat(req)) {
> +	if (blk_do_io_stat(req)) {
>  		const int sgrp = op_stat_group(req_op(req));
>  
>  		part_stat_lock();
> -		part_stat_add(req->part, sectors[sgrp], bytes >> 9);
> +		part_stat_add(req->bio->bi_bdev, sectors[sgrp], bytes >> 9);
>  		part_stat_unlock();
>  	}
>  }
> @@ -859,11 +859,12 @@ EXPORT_SYMBOL_GPL(blk_update_request);
>  static void __blk_account_io_done(struct request *req, u64 now)
>  {
>  	const int sgrp = op_stat_group(req_op(req));
> +	struct block_device *bdev = req->bio->bi_bdev;

Then you need to move account_io_done before calling blk_update_request().


thanks,
Ming


^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH V2 09/13] scsi: force unfreezing queue into atomic mode
  2022-01-25  7:27       ` Christoph Hellwig
@ 2022-01-25  8:54         ` Ming Lei
  2022-01-26  8:15           ` Christoph Hellwig
  0 siblings, 1 reply; 49+ messages in thread
From: Ming Lei @ 2022-01-25  8:54 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Jens Axboe, Martin K . Petersen, linux-block, linux-nvme,
	linux-scsi, Bart Van Assche

On Tue, Jan 25, 2022 at 08:27:39AM +0100, Christoph Hellwig wrote:
> On Tue, Jan 25, 2022 at 07:21:55AM +0800, Ming Lei wrote:
> > > > @@ -3670,7 +3670,7 @@ static void scsi_disk_release(struct device *dev)
> > > >  	 * in case multiple processes open a /dev/sd... node concurrently.
> > > >  	 */
> > > >  	blk_mq_freeze_queue(q);
> > > > -	blk_mq_unfreeze_queue(q);
> > > > +	__blk_mq_unfreeze_queue(q, true);
> > > 
> > > I think the right thing here is to drop the freeze/unfreeze pair.
> > > Now that del_gendisk properly freezes the queue, we don't need this
> > > protection as the issue that Bart fixed with it can't happen any more.
> > 
> > As you see, the last patch removes freeze/unfreeze pair in del_gendisk(),
> > which looks not very useful: it can't drain IO on bio based driver, and
> > del_gendisk() is supposed to provide consistent behavior for both request
> > and bio based driver.
> 
> So what is the advantage of trying to remove the freeze from where
> it belongs (common unregister code) while keeping it where it is a bandaid
> (driver specific unregister code)?

freeze in common unregister code is actually not good, because it provide
nothing for bio based driver, so we can't move blk-cgroup shutdown into
del_gendisk. Also we can't move elevator shutdown to del_gendisk for
similar reason.

Secondly freeze is pretty slow in percpu mode, so why slow down removing every
disk just for scsi's bandaid?


Thanks,
Ming


^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH V2 05/13] block: only account passthrough IO from userspace
  2022-01-25  7:19         ` Christoph Hellwig
  2022-01-25  8:35           ` Ming Lei
@ 2022-01-25  9:09           ` Ming Lei
  2022-01-26  5:50             ` Christoph Hellwig
  1 sibling, 1 reply; 49+ messages in thread
From: Ming Lei @ 2022-01-25  9:09 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Jens Axboe, Martin K . Petersen, linux-block, linux-nvme, linux-scsi

On Tue, Jan 25, 2022 at 08:19:06AM +0100, Christoph Hellwig wrote:
> On Tue, Jan 25, 2022 at 07:16:34AM +0100, Christoph Hellwig wrote:
> > So why not key off accouning off "rq->bio && rq->bio->bi_bdev"
> > and remove the need for the flag and the second half of the assignment
> > above?  That is much less error probe and removes code size.
> 
> Something like this, lightly tested:
> 

Follows another simple way by accounting all request with bio attached,
except for requests with kernel buffer.


diff --git a/block/blk-map.c b/block/blk-map.c
index 4526adde0156..1210b51c62ae 100644
--- a/block/blk-map.c
+++ b/block/blk-map.c
@@ -630,6 +630,8 @@ int blk_rq_map_kern(struct request_queue *q, struct request *rq, void *kbuf,
 	struct bio *bio;
 	int ret;
 
+	rq->rq_flags &= ~RQF_IO_STAT;
+
 	if (len > (queue_max_hw_sectors(q) << 9))
 		return -EINVAL;
 	if (!len || !kbuf)
diff --git a/block/blk-mq.c b/block/blk-mq.c
index 72ae9955cf27..eac589d2c340 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -903,7 +903,7 @@ static void __blk_account_io_start(struct request *rq)
 	/* passthrough requests can hold bios that do not have ->bi_bdev set */
 	if (rq->bio && rq->bio->bi_bdev)
 		rq->part = rq->bio->bi_bdev;
-	else if (rq->q->disk)
+	else if (rq->q->disk && rq->bio)
 		rq->part = rq->q->disk->part0;
 
 	part_stat_lock();


Thanks,
Ming


^ permalink raw reply related	[flat|nested] 49+ messages in thread

* Re: [PATCH V2 05/13] block: only account passthrough IO from userspace
  2022-01-25  9:09           ` Ming Lei
@ 2022-01-26  5:50             ` Christoph Hellwig
  2022-01-26  7:21               ` Ming Lei
  0 siblings, 1 reply; 49+ messages in thread
From: Christoph Hellwig @ 2022-01-26  5:50 UTC (permalink / raw)
  To: Ming Lei
  Cc: Christoph Hellwig, Jens Axboe, Martin K . Petersen, linux-block,
	linux-nvme, linux-scsi

On Tue, Jan 25, 2022 at 05:09:42PM +0800, Ming Lei wrote:
> Follows another simple way by accounting all request with bio attached,
> except for requests with kernel buffer.

> -	else if (rq->q->disk)
> +	else if (rq->q->disk && rq->bio)
>  		rq->part = rq->q->disk->part0;

Most passthrough requests will have a bio, so you'll still use e.g.
the sd gendisk for sg request here.

I think the right way would be to just remove this branch entirely.
This means we only account bios with a block_device, which implies
they have a gendisk.

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH V2 05/13] block: only account passthrough IO from userspace
  2022-01-26  5:50             ` Christoph Hellwig
@ 2022-01-26  7:21               ` Ming Lei
  2022-01-26  8:10                 ` Christoph Hellwig
  0 siblings, 1 reply; 49+ messages in thread
From: Ming Lei @ 2022-01-26  7:21 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Jens Axboe, Martin K . Petersen, linux-block, linux-nvme, linux-scsi

On Wed, Jan 26, 2022 at 06:50:03AM +0100, Christoph Hellwig wrote:
> On Tue, Jan 25, 2022 at 05:09:42PM +0800, Ming Lei wrote:
> > Follows another simple way by accounting all request with bio attached,
> > except for requests with kernel buffer.
> 
> > -	else if (rq->q->disk)
> > +	else if (rq->q->disk && rq->bio)
> >  		rq->part = rq->q->disk->part0;
> 
> Most passthrough requests will have a bio, so you'll still use e.g.
> the sd gendisk for sg request here.
> 
> I think the right way would be to just remove this branch entirely.
> This means we only account bios with a block_device, which implies
> they have a gendisk.

That will not account userspace IO, and people may complain.

We can just account passthrough request from userspace by the patch
in my last email.


Thanks, 
Ming


^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH V2 05/13] block: only account passthrough IO from userspace
  2022-01-26  7:21               ` Ming Lei
@ 2022-01-26  8:10                 ` Christoph Hellwig
  2022-01-26  8:33                   ` Ming Lei
  0 siblings, 1 reply; 49+ messages in thread
From: Christoph Hellwig @ 2022-01-26  8:10 UTC (permalink / raw)
  To: Ming Lei
  Cc: Christoph Hellwig, Jens Axboe, Martin K . Petersen, linux-block,
	linux-nvme, linux-scsi

On Wed, Jan 26, 2022 at 03:21:04PM +0800, Ming Lei wrote:
> > I think the right way would be to just remove this branch entirely.
> > This means we only account bios with a block_device, which implies
> > they have a gendisk.
> 
> That will not account userspace IO, and people may complain.
> 
> We can just account passthrough request from userspace by the patch
> in my last email.

Let's take a step back:  what I/O do we want to account, and how
do we want to archive that?

Assuming accounting is enabled:

 - current mainline accounts all I/O one queues that have a gendisk
 - your original patch accounts file system I/O and some passthrough I/O
   that has a special flag set

Dropping the conditional to grab a bdev from the queue leaves us with
the following rule:

 - all I/O that has a bio and bdev is accounted.  This requires
   passthrough I/O to explicitly set the bdev in case we haven't
   done so, and it requires them to have a bio at all

I guess you are worried about the latter conditionin that we stop
accounting for no data transfer passthrough commands?

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH V2 09/13] scsi: force unfreezing queue into atomic mode
  2022-01-25  8:54         ` Ming Lei
@ 2022-01-26  8:15           ` Christoph Hellwig
  0 siblings, 0 replies; 49+ messages in thread
From: Christoph Hellwig @ 2022-01-26  8:15 UTC (permalink / raw)
  To: Ming Lei
  Cc: Christoph Hellwig, Jens Axboe, Martin K . Petersen, linux-block,
	linux-nvme, linux-scsi, Bart Van Assche

On Tue, Jan 25, 2022 at 04:54:43PM +0800, Ming Lei wrote:
> > So what is the advantage of trying to remove the freeze from where
> > it belongs (common unregister code) while keeping it where it is a bandaid
> > (driver specific unregister code)?
> 
> freeze in common unregister code is actually not good, because it provide
> nothing for bio based driver, so we can't move blk-cgroup shutdown into
> del_gendisk. Also we can't move elevator shutdown to del_gendisk for
> similar reason.
> 
> Secondly freeze is pretty slow in percpu mode, so why slow down removing every
> disk just for scsi's bandaid?

I'd frame this differently:

 - del_gendisk is the right place to freeze the queue, as that is where the
   gendisk is unregistered and all fs I/O needs to stop.  If we don't get
   all aspects right we need to fix.  As mentioned I'm already looking into
   keeping a reference for the bio life time for bio based drivers.
 - SCSI is the only driver that ever submits passthrough I/O without the
   gendisk.  So doing an additional freeze during request_queue teardown
   is only needed for SCSI and we can eventually remove that for everyone
   else.  And most of your series already does really good work towards that
   goal!

> 
> 
> Thanks,
> Ming
---end quoted text---

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH V2 05/13] block: only account passthrough IO from userspace
  2022-01-26  8:10                 ` Christoph Hellwig
@ 2022-01-26  8:33                   ` Ming Lei
  2022-01-26  8:49                     ` Christoph Hellwig
  0 siblings, 1 reply; 49+ messages in thread
From: Ming Lei @ 2022-01-26  8:33 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Jens Axboe, Martin K . Petersen, linux-block, linux-nvme, linux-scsi

On Wed, Jan 26, 2022 at 09:10:52AM +0100, Christoph Hellwig wrote:
> On Wed, Jan 26, 2022 at 03:21:04PM +0800, Ming Lei wrote:
> > > I think the right way would be to just remove this branch entirely.
> > > This means we only account bios with a block_device, which implies
> > > they have a gendisk.
> > 
> > That will not account userspace IO, and people may complain.
> > 
> > We can just account passthrough request from userspace by the patch
> > in my last email.
> 
> Let's take a step back:  what I/O do we want to account, and how
> do we want to archive that?

FS IO, and passthrough IO from userspace at least, since
that is what user cares.

Also the bdev/disk is guaranteed to be live when this userspace
passthrough IO is inflight.

> 
> Assuming accounting is enabled:
> 
>  - current mainline accounts all I/O one queues that have a gendisk
>  - your original patch accounts file system I/O and some passthrough I/O
>    that has a special flag set

The special flag is just for recognizing userspace passthrough IO.

> 
> Dropping the conditional to grab a bdev from the queue leaves us with
> the following rule:
> 
>  - all I/O that has a bio and bdev is accounted.  This requires
>    passthrough I/O to explicitly set the bdev in case we haven't
>    done so, and it requires them to have a bio at all

That is basically to make bio->bi_bdev points to part0, and the
problem is that you have to make sure that part0 won't be released
when this request is inflight.

> 
> I guess you are worried about the latter conditionin that we stop
> accounting for no data transfer passthrough commands?

No, I meant that bio->bi_bdev isn't setup yet for passthrough request,
and not sure that can be done easily.


Thanks,
Ming


^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH V2 05/13] block: only account passthrough IO from userspace
  2022-01-26  8:33                   ` Ming Lei
@ 2022-01-26  8:49                     ` Christoph Hellwig
  2022-01-26  9:59                       ` Ming Lei
  0 siblings, 1 reply; 49+ messages in thread
From: Christoph Hellwig @ 2022-01-26  8:49 UTC (permalink / raw)
  To: Ming Lei
  Cc: Christoph Hellwig, Jens Axboe, Martin K . Petersen, linux-block,
	linux-nvme, linux-scsi

On Wed, Jan 26, 2022 at 04:33:54PM +0800, Ming Lei wrote:
> > I guess you are worried about the latter conditionin that we stop
> > accounting for no data transfer passthrough commands?
> 
> No, I meant that bio->bi_bdev isn't setup yet for passthrough request,
> and not sure that can be done easily.

Take a look at e.g. nvme_submit_user_cmd and iblock_get_bio.

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH V2 05/13] block: only account passthrough IO from userspace
  2022-01-26  8:49                     ` Christoph Hellwig
@ 2022-01-26  9:59                       ` Ming Lei
  2022-01-26 16:37                         ` Christoph Hellwig
  0 siblings, 1 reply; 49+ messages in thread
From: Ming Lei @ 2022-01-26  9:59 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Jens Axboe, Martin K . Petersen, linux-block, linux-nvme, linux-scsi

On Wed, Jan 26, 2022 at 09:49:50AM +0100, Christoph Hellwig wrote:
> On Wed, Jan 26, 2022 at 04:33:54PM +0800, Ming Lei wrote:
> > > I guess you are worried about the latter conditionin that we stop
> > > accounting for no data transfer passthrough commands?
> > 
> > No, I meant that bio->bi_bdev isn't setup yet for passthrough request,
> > and not sure that can be done easily.
> 
> Take a look at e.g. nvme_submit_user_cmd and iblock_get_bio.

nvme just sets part0 to rq->bio, which is fine since nvme doesn't
support partial completion.

The simplest way could be to assign bio->bi_bdev with q->disk->part0 in both
bio_copy_user_iov() and bio_map_user_iov(), which should cover most of cases.
Given user io is always on device instead of partition even though the
command is sent via partition bdev.


Thanks,
Ming


^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH V2 05/13] block: only account passthrough IO from userspace
  2022-01-26  9:59                       ` Ming Lei
@ 2022-01-26 16:37                         ` Christoph Hellwig
  0 siblings, 0 replies; 49+ messages in thread
From: Christoph Hellwig @ 2022-01-26 16:37 UTC (permalink / raw)
  To: Ming Lei
  Cc: Christoph Hellwig, Jens Axboe, Martin K . Petersen, linux-block,
	linux-nvme, linux-scsi

On Wed, Jan 26, 2022 at 05:59:08PM +0800, Ming Lei wrote:
> nvme just sets part0 to rq->bio, which is fine since nvme doesn't
> support partial completion.
> 
> The simplest way could be to assign bio->bi_bdev with q->disk->part0 in both
> bio_copy_user_iov() and bio_map_user_iov(), which should cover most of cases.
> Given user io is always on device instead of partition even though the
> command is sent via partition bdev.

This would be easiest, but it would also assign them when called from
the SG driver.  And that means these I/Os could be in flight when
detaching a SCSI ULP.  At least without the freeze in del_gendisk
(or the local one in case of the sd driver).

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH V2 04/13] block/wbt: fix negative inflight counter when remove scsi device
  2022-01-22 11:10 ` [PATCH V2 04/13] block/wbt: fix negative inflight counter when remove scsi device Ming Lei
@ 2022-02-17  7:45   ` Christoph Hellwig
  2022-02-17 14:53     ` Jens Axboe
  0 siblings, 1 reply; 49+ messages in thread
From: Christoph Hellwig @ 2022-02-17  7:45 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Christoph Hellwig, Ming Lei, Martin K . Petersen, linux-block,
	linux-nvme, linux-scsi, Laibin Qiu, Ming Lei

Jens, can you pick this up for the block-5.17 tree?

On Sat, Jan 22, 2022 at 07:10:45PM +0800, Ming Lei wrote:
> From: Laibin Qiu <qiulaibin@huawei.com>
> 
> Now that we disable wbt by set WBT_STATE_OFF_DEFAULT in
> wbt_disable_default() when switch elevator to bfq. And when
> we remove scsi device, wbt will be enabled by wbt_enable_default.
> If it become false positive between wbt_wait() and wbt_track()
> when submit write request.
> 
> The following is the scenario that triggered the problem.
> 
> T1                          T2                           T3
>                             elevator_switch_mq
>                             bfq_init_queue
>                             wbt_disable_default <= Set
>                             rwb->enable_state (OFF)
> Submit_bio
> blk_mq_make_request
> rq_qos_throttle
> <= rwb->enable_state (OFF)
>                                                          scsi_remove_device
>                                                          sd_remove
>                                                          del_gendisk
>                                                          blk_unregister_queue
>                                                          elv_unregister_queue
>                                                          wbt_enable_default
>                                                          <= Set rwb->enable_state (ON)
> q_qos_track
> <= rwb->enable_state (ON)
> ^^^^^^ this request will mark WBT_TRACKED without inflight add and will
> lead to drop rqw->inflight to -1 in wbt_done() which will trigger IO hung.
> 
> Fix this by move wbt_enable_default() from elv_unregister to
> bfq_exit_queue(). Only re-enable wbt when bfq exit.
> 
> Fixes: 76a8040817b4b ("blk-wbt: make sure throttle is enabled properly")
> 
> Remove oneline stale comment, and kill one oneshot local variable.
> 
> Signed-off-by: Ming Lei <ming.lei@rehdat.com>
> Reviewed-by: Christoph Hellwig <hch@lst.de>
> Link: https://lore.kernel.org/linux-block/20211214133103.551813-1-qiulaibin@huawei.com/
> Signed-off-by: Laibin Qiu <qiulaibin@huawei.com>
> ---
>  block/bfq-iosched.c | 2 ++
>  block/elevator.c    | 2 --
>  2 files changed, 2 insertions(+), 2 deletions(-)
> 
> diff --git a/block/bfq-iosched.c b/block/bfq-iosched.c
> index 0c612a911696..36a66e97e3c2 100644
> --- a/block/bfq-iosched.c
> +++ b/block/bfq-iosched.c
> @@ -7018,6 +7018,8 @@ static void bfq_exit_queue(struct elevator_queue *e)
>  	spin_unlock_irq(&bfqd->lock);
>  #endif
>  
> +	wbt_enable_default(bfqd->queue);
> +
>  	kfree(bfqd);
>  }
>  
> diff --git a/block/elevator.c b/block/elevator.c
> index ec98aed39c4f..482df2a350fc 100644
> --- a/block/elevator.c
> +++ b/block/elevator.c
> @@ -525,8 +525,6 @@ void elv_unregister_queue(struct request_queue *q)
>  		kobject_del(&e->kobj);
>  
>  		e->registered = 0;
> -		/* Re-enable throttling in case elevator disabled it */
> -		wbt_enable_default(q);
>  	}
>  }
>  
> -- 
> 2.31.1
---end quoted text---

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH V2 04/13] block/wbt: fix negative inflight counter when remove scsi device
  2022-02-17  7:45   ` Christoph Hellwig
@ 2022-02-17 14:53     ` Jens Axboe
  0 siblings, 0 replies; 49+ messages in thread
From: Jens Axboe @ 2022-02-17 14:53 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Ming Lei, Martin K . Petersen, linux-block, linux-nvme,
	linux-scsi, Laibin Qiu, Ming Lei

On 2/17/22 12:45 AM, Christoph Hellwig wrote:
> Jens, can you pick this up for the block-5.17 tree?

Done

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 49+ messages in thread

end of thread, other threads:[~2022-02-17 14:53 UTC | newest]

Thread overview: 49+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-01-22 11:10 [PATCH V2 00/13] block: don't drain file system I/O on del_gendisk Ming Lei
2022-01-22 11:10 ` [PATCH V2 01/13] block: declare blkcg_[init|exit]_queue in private header Ming Lei
2022-01-24 12:59   ` Christoph Hellwig
2022-01-22 11:10 ` [PATCH V2 02/13] block: move initialization of q->blkg_list into blkcg_init_queue Ming Lei
2022-01-24 13:00   ` Christoph Hellwig
2022-01-24 18:32   ` Bart Van Assche
2022-01-22 11:10 ` [PATCH V2 03/13] block: move blkcg initialization/destroy into disk allocation/release handler Ming Lei
2022-01-24 13:02   ` Christoph Hellwig
2022-01-22 11:10 ` [PATCH V2 04/13] block/wbt: fix negative inflight counter when remove scsi device Ming Lei
2022-02-17  7:45   ` Christoph Hellwig
2022-02-17 14:53     ` Jens Axboe
2022-01-22 11:10 ` [PATCH V2 05/13] block: only account passthrough IO from userspace Ming Lei
2022-01-24 13:05   ` Christoph Hellwig
2022-01-24 23:09     ` Ming Lei
2022-01-25  6:16       ` Christoph Hellwig
2022-01-25  7:19         ` Christoph Hellwig
2022-01-25  8:35           ` Ming Lei
2022-01-25  9:09           ` Ming Lei
2022-01-26  5:50             ` Christoph Hellwig
2022-01-26  7:21               ` Ming Lei
2022-01-26  8:10                 ` Christoph Hellwig
2022-01-26  8:33                   ` Ming Lei
2022-01-26  8:49                     ` Christoph Hellwig
2022-01-26  9:59                       ` Ming Lei
2022-01-26 16:37                         ` Christoph Hellwig
2022-01-22 11:10 ` [PATCH V2 06/13] block: don't remove hctx debugfs dir from blk_mq_exit_queue Ming Lei
2022-01-24 13:06   ` Christoph Hellwig
2022-01-22 11:10 ` [PATCH V2 07/13] block: move q_usage_counter release into blk_queue_release Ming Lei
2022-01-24 13:09   ` Christoph Hellwig
2022-01-24 19:06   ` Bart Van Assche
2022-01-22 11:10 ` [PATCH V2 08/13] block: export __blk_mq_unfreeze_queue Ming Lei
2022-01-24 13:11   ` Christoph Hellwig
2022-01-22 11:10 ` [PATCH V2 09/13] scsi: force unfreezing queue into atomic mode Ming Lei
2022-01-24 13:15   ` Christoph Hellwig
2022-01-24 23:21     ` Ming Lei
2022-01-25  7:27       ` Christoph Hellwig
2022-01-25  8:54         ` Ming Lei
2022-01-26  8:15           ` Christoph Hellwig
2022-01-22 11:10 ` [PATCH V2 10/13] block: add helper of disk_release_queue for release queue data for disk Ming Lei
2022-01-24 13:16   ` Christoph Hellwig
2022-01-24 23:27     ` Ming Lei
2022-01-25  6:17       ` Christoph Hellwig
2022-01-22 11:10 ` [PATCH V2 11/13] block: move blk_exit_queue into disk_release Ming Lei
2022-01-24 13:22   ` Christoph Hellwig
2022-01-24 23:38     ` Ming Lei
2022-01-22 11:10 ` [PATCH V2 12/13] block: move rq_qos_exit() into disk_release() Ming Lei
2022-01-24 13:28   ` Christoph Hellwig
2022-01-22 11:10 ` [PATCH V2 13/13] block: don't drain file system I/O on del_gendisk Ming Lei
2022-01-24 13:39   ` Christoph Hellwig

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.