All of lore.kernel.org
 help / color / mirror / Atom feed
* move more work to disk_release v2
@ 2022-02-27 17:21 Christoph Hellwig
  2022-02-27 17:21 ` [PATCH 01/14] blk-mq: do not include passthrough requests in I/O accounting Christoph Hellwig
                   ` (15 more replies)
  0 siblings, 16 replies; 25+ messages in thread
From: Christoph Hellwig @ 2022-02-27 17:21 UTC (permalink / raw)
  To: Jens Axboe; +Cc: Martin K. Petersen, Ming Lei, linux-block, linux-scsi

Hi all,

this series resurrects and forward ports ports larger parts of the
"block: don't drain file system I/O on del_gendisk" series from Ming,
but does not remove the draining in del_gendisk, but instead the one
in the sd driver, which always was a bit ad-hoc.  As part of that sd
and sr are switched to use the new ->free_disk method to avoid having
to clear disk->private_data and the way to lookup the SCSI ULP is
cleaned up as well.

Git branch:

    git://git.infradead.org/users/hch/block.git freeze-5.18

Gitweb:

    http://git.infradead.org/users/hch/block.git/shortlog/refs/heads/freeze-5.18

Changes since v1:
 - fix a refcounting bug in sd
 - rename a function

Diffstat:
 block/blk-core.c           |    7 --
 block/blk-mq.c             |   10 +--
 block/blk-sysfs.c          |   25 --------
 block/blk.h                |    2 
 block/elevator.c           |    7 +-
 block/genhd.c              |   38 ++++++++++++-
 drivers/scsi/sd.c          |  114 +++++++++------------------------------
 drivers/scsi/sd.h          |   13 +++-
 drivers/scsi/sr.c          |  129 +++++++++------------------------------------
 drivers/scsi/sr.h          |    5 -
 drivers/scsi/st.c          |    1 
 drivers/scsi/st.h          |    1 
 include/scsi/scsi_cmnd.h   |    9 ---
 include/scsi/scsi_driver.h |    9 ++-
 14 files changed, 117 insertions(+), 253 deletions(-)

^ permalink raw reply	[flat|nested] 25+ messages in thread

* [PATCH 01/14] blk-mq: do not include passthrough requests in I/O accounting
  2022-02-27 17:21 move more work to disk_release v2 Christoph Hellwig
@ 2022-02-27 17:21 ` Christoph Hellwig
  2022-02-27 17:21 ` [PATCH 02/14] blk-mq: handle already freed tags gracefully in blk_mq_free_rqs Christoph Hellwig
                   ` (14 subsequent siblings)
  15 siblings, 0 replies; 25+ messages in thread
From: Christoph Hellwig @ 2022-02-27 17:21 UTC (permalink / raw)
  To: Jens Axboe; +Cc: Martin K. Petersen, Ming Lei, linux-block, linux-scsi

I/O accounting buckets I/O into the read/write/discard categories into
which passthrough I/O does not fit at all.  It also accounts to the
block_device, which may not even exist for passthrough I/O.

Signed-off-by: Christoph Hellwig <hch@lst.de>
---
 block/blk-mq.c | 6 +-----
 block/blk.h    | 2 +-
 2 files changed, 2 insertions(+), 6 deletions(-)

diff --git a/block/blk-mq.c b/block/blk-mq.c
index a05ce77250316..ee80853473d1e 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -883,11 +883,7 @@ static inline void blk_account_io_done(struct request *req, u64 now)
 
 static void __blk_account_io_start(struct request *rq)
 {
-	/* passthrough requests can hold bios that do not have ->bi_bdev set */
-	if (rq->bio && rq->bio->bi_bdev)
-		rq->part = rq->bio->bi_bdev;
-	else if (rq->q->disk)
-		rq->part = rq->q->disk->part0;
+	rq->part = rq->bio->bi_bdev;
 
 	part_stat_lock();
 	update_io_ticks(rq->part, jiffies, false);
diff --git a/block/blk.h b/block/blk.h
index ebaa59ca46ca6..6f21859c7f0ff 100644
--- a/block/blk.h
+++ b/block/blk.h
@@ -325,7 +325,7 @@ int blk_dev_init(void);
  */
 static inline bool blk_do_io_stat(struct request *rq)
 {
-	return (rq->rq_flags & RQF_IO_STAT) && rq->q->disk;
+	return (rq->rq_flags & RQF_IO_STAT) && !blk_rq_is_passthrough(rq);
 }
 
 void update_io_ticks(struct block_device *part, unsigned long now, bool end);
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH 02/14] blk-mq: handle already freed tags gracefully in blk_mq_free_rqs
  2022-02-27 17:21 move more work to disk_release v2 Christoph Hellwig
  2022-02-27 17:21 ` [PATCH 01/14] blk-mq: do not include passthrough requests in I/O accounting Christoph Hellwig
@ 2022-02-27 17:21 ` Christoph Hellwig
  2022-02-27 17:21 ` [PATCH 03/14] scsi: don't use disk->private_data to find the scsi_driver Christoph Hellwig
                   ` (13 subsequent siblings)
  15 siblings, 0 replies; 25+ messages in thread
From: Christoph Hellwig @ 2022-02-27 17:21 UTC (permalink / raw)
  To: Jens Axboe; +Cc: Martin K. Petersen, Ming Lei, linux-block, linux-scsi

From: Ming Lei <ming.lei@redhat.com>

To simplify further changes allow for double calling blk_mq_free_rqs on
a queue.

Signed-off-by: Ming Lei <ming.lei@redhat.com>
[hch: split out from a larger patch]
Signed-off-by: Christoph Hellwig <hch@lst.de>
---
 block/blk-mq.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/block/blk-mq.c b/block/blk-mq.c
index ee80853473d1e..63e2d3fd60946 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -3061,6 +3061,9 @@ void blk_mq_free_rqs(struct blk_mq_tag_set *set, struct blk_mq_tags *tags,
 	struct blk_mq_tags *drv_tags;
 	struct page *page;
 
+	if (list_empty(&tags->page_list))
+		return;
+
 	if (blk_mq_is_shared_tags(set->flags))
 		drv_tags = set->shared_tags;
 	else
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH 03/14] scsi: don't use disk->private_data to find the scsi_driver
  2022-02-27 17:21 move more work to disk_release v2 Christoph Hellwig
  2022-02-27 17:21 ` [PATCH 01/14] blk-mq: do not include passthrough requests in I/O accounting Christoph Hellwig
  2022-02-27 17:21 ` [PATCH 02/14] blk-mq: handle already freed tags gracefully in blk_mq_free_rqs Christoph Hellwig
@ 2022-02-27 17:21 ` Christoph Hellwig
  2022-02-27 17:21 ` [PATCH 04/14] sd: rename the scsi_disk.dev field Christoph Hellwig
                   ` (12 subsequent siblings)
  15 siblings, 0 replies; 25+ messages in thread
From: Christoph Hellwig @ 2022-02-27 17:21 UTC (permalink / raw)
  To: Jens Axboe; +Cc: Martin K. Petersen, Ming Lei, linux-block, linux-scsi

Requiring every ULP to have the scsi_drive as first member of the
private data is rather fragile and not necessary anyway.  Just use
the driver hanging off the SCSI device instead.

Signed-off-by: Christoph Hellwig <hch@lst.de>
---
 drivers/scsi/sd.c          | 3 +--
 drivers/scsi/sd.h          | 3 +--
 drivers/scsi/sr.c          | 5 ++---
 drivers/scsi/sr.h          | 1 -
 drivers/scsi/st.c          | 1 -
 drivers/scsi/st.h          | 1 -
 include/scsi/scsi_cmnd.h   | 9 ---------
 include/scsi/scsi_driver.h | 9 +++++++--
 8 files changed, 11 insertions(+), 21 deletions(-)

diff --git a/drivers/scsi/sd.c b/drivers/scsi/sd.c
index 2d648d27bfd71..2a1e19e871d30 100644
--- a/drivers/scsi/sd.c
+++ b/drivers/scsi/sd.c
@@ -3515,7 +3515,6 @@ static int sd_probe(struct device *dev)
 	}
 
 	sdkp->device = sdp;
-	sdkp->driver = &sd_template;
 	sdkp->disk = gd;
 	sdkp->index = index;
 	sdkp->max_retries = SD_MAX_RETRIES;
@@ -3548,7 +3547,7 @@ static int sd_probe(struct device *dev)
 	gd->minors = SD_MINORS;
 
 	gd->fops = &sd_fops;
-	gd->private_data = &sdkp->driver;
+	gd->private_data = sdkp;
 
 	/* defaults, until the device tells us otherwise */
 	sdp->sector_size = 512;
diff --git a/drivers/scsi/sd.h b/drivers/scsi/sd.h
index 2e5932bde43d1..303aa1c23aefb 100644
--- a/drivers/scsi/sd.h
+++ b/drivers/scsi/sd.h
@@ -68,7 +68,6 @@ enum {
 };
 
 struct scsi_disk {
-	struct scsi_driver *driver;	/* always &sd_template */
 	struct scsi_device *device;
 	struct device	dev;
 	struct gendisk	*disk;
@@ -131,7 +130,7 @@ struct scsi_disk {
 
 static inline struct scsi_disk *scsi_disk(struct gendisk *disk)
 {
-	return container_of(disk->private_data, struct scsi_disk, driver);
+	return disk->private_data;
 }
 
 #define sd_printk(prefix, sdsk, fmt, a...)				\
diff --git a/drivers/scsi/sr.c b/drivers/scsi/sr.c
index f925b1f1f9ada..569bda76a5175 100644
--- a/drivers/scsi/sr.c
+++ b/drivers/scsi/sr.c
@@ -147,7 +147,7 @@ static void sr_kref_release(struct kref *kref);
 
 static inline struct scsi_cd *scsi_cd(struct gendisk *disk)
 {
-	return container_of(disk->private_data, struct scsi_cd, driver);
+	return disk->private_data;
 }
 
 static int sr_runtime_suspend(struct device *dev)
@@ -692,7 +692,6 @@ static int sr_probe(struct device *dev)
 
 	cd->device = sdev;
 	cd->disk = disk;
-	cd->driver = &sr_template;
 	cd->capacity = 0x1fffff;
 	cd->device->changed = 1;	/* force recheck CD type */
 	cd->media_present = 1;
@@ -713,7 +712,7 @@ static int sr_probe(struct device *dev)
 	sr_vendor_init(cd);
 
 	set_capacity(disk, cd->capacity);
-	disk->private_data = &cd->driver;
+	disk->private_data = cd;
 
 	if (register_cdrom(disk, &cd->cdi))
 		goto fail_minor;
diff --git a/drivers/scsi/sr.h b/drivers/scsi/sr.h
index 1609f02ed29ac..d80af3fcb6f97 100644
--- a/drivers/scsi/sr.h
+++ b/drivers/scsi/sr.h
@@ -32,7 +32,6 @@ struct scsi_device;
 
 
 typedef struct scsi_cd {
-	struct scsi_driver *driver;
 	unsigned capacity;	/* size in blocks                       */
 	struct scsi_device *device;
 	unsigned int vendor;	/* vendor code, see sr_vendor.c         */
diff --git a/drivers/scsi/st.c b/drivers/scsi/st.c
index e869e90e05afe..ebe9412c86f43 100644
--- a/drivers/scsi/st.c
+++ b/drivers/scsi/st.c
@@ -4276,7 +4276,6 @@ static int st_probe(struct device *dev)
 		goto out_buffer_free;
 	}
 	kref_init(&tpnt->kref);
-	tpnt->driver = &st_template;
 
 	tpnt->device = SDp;
 	if (SDp->scsi_level <= 2)
diff --git a/drivers/scsi/st.h b/drivers/scsi/st.h
index c0ef0d9aaf8a2..7a68eaba7e810 100644
--- a/drivers/scsi/st.h
+++ b/drivers/scsi/st.h
@@ -117,7 +117,6 @@ struct scsi_tape_stats {
 
 /* The tape drive descriptor */
 struct scsi_tape {
-	struct scsi_driver *driver;
 	struct scsi_device *device;
 	struct mutex lock;	/* For serialization */
 	struct completion wait;	/* For SCSI commands */
diff --git a/include/scsi/scsi_cmnd.h b/include/scsi/scsi_cmnd.h
index 6794d7322cbde..e3a4c67794b14 100644
--- a/include/scsi/scsi_cmnd.h
+++ b/include/scsi/scsi_cmnd.h
@@ -13,7 +13,6 @@
 #include <scsi/scsi_request.h>
 
 struct Scsi_Host;
-struct scsi_driver;
 
 /*
  * MAX_COMMAND_SIZE is:
@@ -159,14 +158,6 @@ static inline void *scsi_cmd_priv(struct scsi_cmnd *cmd)
 	return cmd + 1;
 }
 
-/* make sure not to use it with passthrough commands */
-static inline struct scsi_driver *scsi_cmd_to_driver(struct scsi_cmnd *cmd)
-{
-	struct request *rq = scsi_cmd_to_rq(cmd);
-
-	return *(struct scsi_driver **)rq->q->disk->private_data;
-}
-
 void scsi_done(struct scsi_cmnd *cmd);
 
 extern void scsi_finish_command(struct scsi_cmnd *cmd);
diff --git a/include/scsi/scsi_driver.h b/include/scsi/scsi_driver.h
index 6dffa8555a390..4ce1988b2ba01 100644
--- a/include/scsi/scsi_driver.h
+++ b/include/scsi/scsi_driver.h
@@ -4,11 +4,10 @@
 
 #include <linux/blk_types.h>
 #include <linux/device.h>
+#include <scsi/scsi_cmnd.h>
 
 struct module;
 struct request;
-struct scsi_cmnd;
-struct scsi_device;
 
 struct scsi_driver {
 	struct device_driver	gendrv;
@@ -31,4 +30,10 @@ extern int scsi_register_interface(struct class_interface *);
 #define scsi_unregister_interface(intf) \
 	class_interface_unregister(intf)
 
+/* make sure not to use it with passthrough commands */
+static inline struct scsi_driver *scsi_cmd_to_driver(struct scsi_cmnd *cmd)
+{
+	return to_scsi_driver(cmd->device->sdev_gendev.driver);
+}
+
 #endif /* _SCSI_SCSI_DRIVER_H */
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH 04/14] sd: rename the scsi_disk.dev field
  2022-02-27 17:21 move more work to disk_release v2 Christoph Hellwig
                   ` (2 preceding siblings ...)
  2022-02-27 17:21 ` [PATCH 03/14] scsi: don't use disk->private_data to find the scsi_driver Christoph Hellwig
@ 2022-02-27 17:21 ` Christoph Hellwig
  2022-02-27 17:21 ` [PATCH 05/14] sd: call sd_zbc_release_disk before releasing the scsi_device reference Christoph Hellwig
                   ` (11 subsequent siblings)
  15 siblings, 0 replies; 25+ messages in thread
From: Christoph Hellwig @ 2022-02-27 17:21 UTC (permalink / raw)
  To: Jens Axboe; +Cc: Martin K. Petersen, Ming Lei, linux-block, linux-scsi

dev is very hard to grab for.  Give the field a more descriptive name and
documents it's purpose.

Signed-off-by: Christoph Hellwig <hch@lst.de>
---
 drivers/scsi/sd.c | 22 +++++++++++-----------
 drivers/scsi/sd.h | 10 ++++++++--
 2 files changed, 19 insertions(+), 13 deletions(-)

diff --git a/drivers/scsi/sd.c b/drivers/scsi/sd.c
index 2a1e19e871d30..7479e7cb36b43 100644
--- a/drivers/scsi/sd.c
+++ b/drivers/scsi/sd.c
@@ -672,7 +672,7 @@ static struct scsi_disk *scsi_disk_get(struct gendisk *disk)
 	if (disk->private_data) {
 		sdkp = scsi_disk(disk);
 		if (scsi_device_get(sdkp->device) == 0)
-			get_device(&sdkp->dev);
+			get_device(&sdkp->disk_dev);
 		else
 			sdkp = NULL;
 	}
@@ -685,7 +685,7 @@ static void scsi_disk_put(struct scsi_disk *sdkp)
 	struct scsi_device *sdev = sdkp->device;
 
 	mutex_lock(&sd_ref_mutex);
-	put_device(&sdkp->dev);
+	put_device(&sdkp->disk_dev);
 	scsi_device_put(sdev);
 	mutex_unlock(&sd_ref_mutex);
 }
@@ -3529,14 +3529,14 @@ static int sd_probe(struct device *dev)
 					     SD_MOD_TIMEOUT);
 	}
 
-	device_initialize(&sdkp->dev);
-	sdkp->dev.parent = get_device(dev);
-	sdkp->dev.class = &sd_disk_class;
-	dev_set_name(&sdkp->dev, "%s", dev_name(dev));
+	device_initialize(&sdkp->disk_dev);
+	sdkp->disk_dev.parent = get_device(dev);
+	sdkp->disk_dev.class = &sd_disk_class;
+	dev_set_name(&sdkp->disk_dev, "%s", dev_name(dev));
 
-	error = device_add(&sdkp->dev);
+	error = device_add(&sdkp->disk_dev);
 	if (error) {
-		put_device(&sdkp->dev);
+		put_device(&sdkp->disk_dev);
 		goto out;
 	}
 
@@ -3577,7 +3577,7 @@ static int sd_probe(struct device *dev)
 
 	error = device_add_disk(dev, gd, NULL);
 	if (error) {
-		put_device(&sdkp->dev);
+		put_device(&sdkp->disk_dev);
 		goto out;
 	}
 
@@ -3628,7 +3628,7 @@ static int sd_remove(struct device *dev)
 	sdkp = dev_get_drvdata(dev);
 	scsi_autopm_get_device(sdkp->device);
 
-	device_del(&sdkp->dev);
+	device_del(&sdkp->disk_dev);
 	del_gendisk(sdkp->disk);
 	sd_shutdown(dev);
 
@@ -3636,7 +3636,7 @@ static int sd_remove(struct device *dev)
 
 	mutex_lock(&sd_ref_mutex);
 	dev_set_drvdata(dev, NULL);
-	put_device(&sdkp->dev);
+	put_device(&sdkp->disk_dev);
 	mutex_unlock(&sd_ref_mutex);
 
 	return 0;
diff --git a/drivers/scsi/sd.h b/drivers/scsi/sd.h
index 303aa1c23aefb..7625a90b0fa69 100644
--- a/drivers/scsi/sd.h
+++ b/drivers/scsi/sd.h
@@ -69,7 +69,13 @@ enum {
 
 struct scsi_disk {
 	struct scsi_device *device;
-	struct device	dev;
+
+	/*
+	 * This device is mostly just used to show a bunch of attributes in a
+	 * weird place.  In doubt don't add any new users, and most importantly
+	 * don't use if for any actual refcounting.
+	 */
+	struct device	disk_dev;
 	struct gendisk	*disk;
 	struct opal_dev *opal_dev;
 #ifdef CONFIG_BLK_DEV_ZONED
@@ -126,7 +132,7 @@ struct scsi_disk {
 	unsigned	security : 1;
 	unsigned	ignore_medium_access_errors : 1;
 };
-#define to_scsi_disk(obj) container_of(obj,struct scsi_disk,dev)
+#define to_scsi_disk(obj) container_of(obj, struct scsi_disk, disk_dev)
 
 static inline struct scsi_disk *scsi_disk(struct gendisk *disk)
 {
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH 05/14] sd: call sd_zbc_release_disk before releasing the scsi_device reference
  2022-02-27 17:21 move more work to disk_release v2 Christoph Hellwig
                   ` (3 preceding siblings ...)
  2022-02-27 17:21 ` [PATCH 04/14] sd: rename the scsi_disk.dev field Christoph Hellwig
@ 2022-02-27 17:21 ` Christoph Hellwig
  2022-02-27 17:21 ` [PATCH 06/14] sd: delay calling free_opal_dev Christoph Hellwig
                   ` (10 subsequent siblings)
  15 siblings, 0 replies; 25+ messages in thread
From: Christoph Hellwig @ 2022-02-27 17:21 UTC (permalink / raw)
  To: Jens Axboe; +Cc: Martin K. Petersen, Ming Lei, linux-block, linux-scsi

sd_zbc_release_disk accesses disk->device, so ensure that actually still has
a valid reference.

Signed-off-by: Christoph Hellwig <hch@lst.de>
---
 drivers/scsi/sd.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/scsi/sd.c b/drivers/scsi/sd.c
index 7479e7cb36b43..7bfebf5b2832d 100644
--- a/drivers/scsi/sd.c
+++ b/drivers/scsi/sd.c
@@ -3672,9 +3672,9 @@ static void scsi_disk_release(struct device *dev)
 
 	disk->private_data = NULL;
 	put_disk(disk);
-	put_device(&sdkp->device->sdev_gendev);
 
 	sd_zbc_release_disk(sdkp);
+	put_device(&sdkp->device->sdev_gendev);
 
 	kfree(sdkp);
 }
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH 06/14] sd: delay calling free_opal_dev
  2022-02-27 17:21 move more work to disk_release v2 Christoph Hellwig
                   ` (4 preceding siblings ...)
  2022-02-27 17:21 ` [PATCH 05/14] sd: call sd_zbc_release_disk before releasing the scsi_device reference Christoph Hellwig
@ 2022-02-27 17:21 ` Christoph Hellwig
  2022-02-27 17:21 ` [PATCH 07/14] sd: make use of ->free_disk to simplify refcounting Christoph Hellwig
                   ` (9 subsequent siblings)
  15 siblings, 0 replies; 25+ messages in thread
From: Christoph Hellwig @ 2022-02-27 17:21 UTC (permalink / raw)
  To: Jens Axboe; +Cc: Martin K. Petersen, Ming Lei, linux-block, linux-scsi

Call free_opal_dev from scsi_disk_release as the opal_dev field is access
from the ioctl handler, which isn't synchronized vs sd_release and thus
can be accesses during or after sd_release was called.

Signed-off-by: Christoph Hellwig <hch@lst.de>
---
 drivers/scsi/sd.c | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/drivers/scsi/sd.c b/drivers/scsi/sd.c
index 7bfebf5b2832d..346b8d62de7d1 100644
--- a/drivers/scsi/sd.c
+++ b/drivers/scsi/sd.c
@@ -3632,8 +3632,6 @@ static int sd_remove(struct device *dev)
 	del_gendisk(sdkp->disk);
 	sd_shutdown(dev);
 
-	free_opal_dev(sdkp->opal_dev);
-
 	mutex_lock(&sd_ref_mutex);
 	dev_set_drvdata(dev, NULL);
 	put_device(&sdkp->disk_dev);
@@ -3675,6 +3673,7 @@ static void scsi_disk_release(struct device *dev)
 
 	sd_zbc_release_disk(sdkp);
 	put_device(&sdkp->device->sdev_gendev);
+	free_opal_dev(sdkp->opal_dev);
 
 	kfree(sdkp);
 }
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH 07/14] sd: make use of ->free_disk to simplify refcounting
  2022-02-27 17:21 move more work to disk_release v2 Christoph Hellwig
                   ` (5 preceding siblings ...)
  2022-02-27 17:21 ` [PATCH 06/14] sd: delay calling free_opal_dev Christoph Hellwig
@ 2022-02-27 17:21 ` Christoph Hellwig
  2022-02-27 17:21 ` [PATCH 08/14] sr: implement ->free_disk Christoph Hellwig
                   ` (8 subsequent siblings)
  15 siblings, 0 replies; 25+ messages in thread
From: Christoph Hellwig @ 2022-02-27 17:21 UTC (permalink / raw)
  To: Jens Axboe; +Cc: Martin K. Petersen, Ming Lei, linux-block, linux-scsi

Implement the ->free_disk method to to put struct scsi_disk when the last
gendisk reference count goes away.  This removes the need to clear
->private_data and thus freeze the queue on unbind.

Signed-off-by: Christoph Hellwig <hch@lst.de>
---
 drivers/scsi/sd.c | 90 +++++++++--------------------------------------
 1 file changed, 16 insertions(+), 74 deletions(-)

diff --git a/drivers/scsi/sd.c b/drivers/scsi/sd.c
index 346b8d62de7d1..498e6fdcf6cfe 100644
--- a/drivers/scsi/sd.c
+++ b/drivers/scsi/sd.c
@@ -121,11 +121,6 @@ static void scsi_disk_release(struct device *cdev);
 
 static DEFINE_IDA(sd_index_ida);
 
-/* This semaphore is used to mediate the 0->1 reference get in the
- * face of object destruction (i.e. we can't allow a get on an
- * object after last put) */
-static DEFINE_MUTEX(sd_ref_mutex);
-
 static struct kmem_cache *sd_cdb_cache;
 static mempool_t *sd_cdb_pool;
 static mempool_t *sd_page_pool;
@@ -663,33 +658,6 @@ static int sd_major(int major_idx)
 	}
 }
 
-static struct scsi_disk *scsi_disk_get(struct gendisk *disk)
-{
-	struct scsi_disk *sdkp = NULL;
-
-	mutex_lock(&sd_ref_mutex);
-
-	if (disk->private_data) {
-		sdkp = scsi_disk(disk);
-		if (scsi_device_get(sdkp->device) == 0)
-			get_device(&sdkp->disk_dev);
-		else
-			sdkp = NULL;
-	}
-	mutex_unlock(&sd_ref_mutex);
-	return sdkp;
-}
-
-static void scsi_disk_put(struct scsi_disk *sdkp)
-{
-	struct scsi_device *sdev = sdkp->device;
-
-	mutex_lock(&sd_ref_mutex);
-	put_device(&sdkp->disk_dev);
-	scsi_device_put(sdev);
-	mutex_unlock(&sd_ref_mutex);
-}
-
 #ifdef CONFIG_BLK_SED_OPAL
 static int sd_sec_submit(void *data, u16 spsp, u8 secp, void *buffer,
 		size_t len, bool send)
@@ -1418,17 +1386,15 @@ static bool sd_need_revalidate(struct block_device *bdev,
  **/
 static int sd_open(struct block_device *bdev, fmode_t mode)
 {
-	struct scsi_disk *sdkp = scsi_disk_get(bdev->bd_disk);
-	struct scsi_device *sdev;
+	struct scsi_disk *sdkp = scsi_disk(bdev->bd_disk);
+	struct scsi_device *sdev = sdkp->device;
 	int retval;
 
-	if (!sdkp)
+	if (scsi_device_get(sdev))
 		return -ENXIO;
 
 	SCSI_LOG_HLQUEUE(3, sd_printk(KERN_INFO, sdkp, "sd_open\n"));
 
-	sdev = sdkp->device;
-
 	/*
 	 * If the device is in error recovery, wait until it is done.
 	 * If the device is offline, then disallow any access to it.
@@ -1473,7 +1439,7 @@ static int sd_open(struct block_device *bdev, fmode_t mode)
 	return 0;
 
 error_out:
-	scsi_disk_put(sdkp);
+	scsi_device_put(sdkp->device);
 	return retval;	
 }
 
@@ -1502,7 +1468,7 @@ static void sd_release(struct gendisk *disk, fmode_t mode)
 			scsi_set_medium_removal(sdev, SCSI_REMOVAL_ALLOW);
 	}
 
-	scsi_disk_put(sdkp);
+	scsi_device_put(sdkp->device);
 }
 
 static int sd_getgeo(struct block_device *bdev, struct hd_geometry *geo)
@@ -1616,7 +1582,7 @@ static int media_not_present(struct scsi_disk *sdkp,
  **/
 static unsigned int sd_check_events(struct gendisk *disk, unsigned int clearing)
 {
-	struct scsi_disk *sdkp = scsi_disk_get(disk);
+	struct scsi_disk *sdkp = disk->private_data;
 	struct scsi_device *sdp;
 	int retval;
 	bool disk_changed;
@@ -1679,7 +1645,6 @@ static unsigned int sd_check_events(struct gendisk *disk, unsigned int clearing)
 	 */
 	disk_changed = sdp->changed;
 	sdp->changed = 0;
-	scsi_disk_put(sdkp);
 	return disk_changed ? DISK_EVENT_MEDIA_CHANGE : 0;
 }
 
@@ -1887,6 +1852,13 @@ static const struct pr_ops sd_pr_ops = {
 	.pr_clear	= sd_pr_clear,
 };
 
+static void scsi_disk_free_disk(struct gendisk *disk)
+{
+	struct scsi_disk *sdkp = disk->private_data;
+
+	put_device(&sdkp->disk_dev);
+}
+
 static const struct block_device_operations sd_fops = {
 	.owner			= THIS_MODULE,
 	.open			= sd_open,
@@ -1898,6 +1870,7 @@ static const struct block_device_operations sd_fops = {
 	.unlock_native_capacity	= sd_unlock_native_capacity,
 	.report_zones		= sd_zbc_report_zones,
 	.get_unique_id		= sd_get_unique_id,
+	.free_disk		= scsi_disk_free_disk,
 	.pr_ops			= &sd_pr_ops,
 };
 
@@ -3623,54 +3596,23 @@ static int sd_probe(struct device *dev)
  **/
 static int sd_remove(struct device *dev)
 {
-	struct scsi_disk *sdkp;
+	struct scsi_disk *sdkp = dev_get_drvdata(dev);
 
-	sdkp = dev_get_drvdata(dev);
 	scsi_autopm_get_device(sdkp->device);
 
 	device_del(&sdkp->disk_dev);
 	del_gendisk(sdkp->disk);
 	sd_shutdown(dev);
 
-	mutex_lock(&sd_ref_mutex);
-	dev_set_drvdata(dev, NULL);
-	put_device(&sdkp->disk_dev);
-	mutex_unlock(&sd_ref_mutex);
-
+	put_disk(sdkp->disk);
 	return 0;
 }
 
-/**
- *	scsi_disk_release - Called to free the scsi_disk structure
- *	@dev: pointer to embedded class device
- *
- *	sd_ref_mutex must be held entering this routine.  Because it is
- *	called on last put, you should always use the scsi_disk_get()
- *	scsi_disk_put() helpers which manipulate the semaphore directly
- *	and never do a direct put_device.
- **/
 static void scsi_disk_release(struct device *dev)
 {
 	struct scsi_disk *sdkp = to_scsi_disk(dev);
-	struct gendisk *disk = sdkp->disk;
-	struct request_queue *q = disk->queue;
 
 	ida_free(&sd_index_ida, sdkp->index);
-
-	/*
-	 * Wait until all requests that are in progress have completed.
-	 * This is necessary to avoid that e.g. scsi_end_request() crashes
-	 * due to clearing the disk->private_data pointer. Wait from inside
-	 * scsi_disk_release() instead of from sd_release() to avoid that
-	 * freezing and unfreezing the request queue affects user space I/O
-	 * in case multiple processes open a /dev/sd... node concurrently.
-	 */
-	blk_mq_freeze_queue(q);
-	blk_mq_unfreeze_queue(q);
-
-	disk->private_data = NULL;
-	put_disk(disk);
-
 	sd_zbc_release_disk(sdkp);
 	put_device(&sdkp->device->sdev_gendev);
 	free_opal_dev(sdkp->opal_dev);
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH 08/14] sr: implement ->free_disk
  2022-02-27 17:21 move more work to disk_release v2 Christoph Hellwig
                   ` (6 preceding siblings ...)
  2022-02-27 17:21 ` [PATCH 07/14] sd: make use of ->free_disk to simplify refcounting Christoph Hellwig
@ 2022-02-27 17:21 ` Christoph Hellwig
  2022-02-27 17:21 ` [PATCH 09/14] block: move blkcg initialization/destroy into disk allocation/release handler Christoph Hellwig
                   ` (7 subsequent siblings)
  15 siblings, 0 replies; 25+ messages in thread
From: Christoph Hellwig @ 2022-02-27 17:21 UTC (permalink / raw)
  To: Jens Axboe; +Cc: Martin K. Petersen, Ming Lei, linux-block, linux-scsi

Simplify the refcounting and remove the need to clear disk->private_data
by implementing the ->free_disk method.

Signed-off-by: Christoph Hellwig <hch@lst.de>
---
 drivers/scsi/sr.c | 124 ++++++++++------------------------------------
 drivers/scsi/sr.h |   4 --
 2 files changed, 26 insertions(+), 102 deletions(-)

diff --git a/drivers/scsi/sr.c b/drivers/scsi/sr.c
index 569bda76a5175..11fbdc75bb711 100644
--- a/drivers/scsi/sr.c
+++ b/drivers/scsi/sr.c
@@ -109,11 +109,6 @@ static DEFINE_SPINLOCK(sr_index_lock);
 
 static struct lock_class_key sr_bio_compl_lkclass;
 
-/* This semaphore is used to mediate the 0->1 reference get in the
- * face of object destruction (i.e. we can't allow a get on an
- * object after last put) */
-static DEFINE_MUTEX(sr_ref_mutex);
-
 static int sr_open(struct cdrom_device_info *, int);
 static void sr_release(struct cdrom_device_info *);
 
@@ -143,8 +138,6 @@ static const struct cdrom_device_ops sr_dops = {
 	.capability		= SR_CAPABILITIES,
 };
 
-static void sr_kref_release(struct kref *kref);
-
 static inline struct scsi_cd *scsi_cd(struct gendisk *disk)
 {
 	return disk->private_data;
@@ -163,38 +156,6 @@ static int sr_runtime_suspend(struct device *dev)
 		return 0;
 }
 
-/*
- * The get and put routines for the struct scsi_cd.  Note this entity
- * has a scsi_device pointer and owns a reference to this.
- */
-static inline struct scsi_cd *scsi_cd_get(struct gendisk *disk)
-{
-	struct scsi_cd *cd = NULL;
-
-	mutex_lock(&sr_ref_mutex);
-	if (disk->private_data == NULL)
-		goto out;
-	cd = scsi_cd(disk);
-	kref_get(&cd->kref);
-	if (scsi_device_get(cd->device)) {
-		kref_put(&cd->kref, sr_kref_release);
-		cd = NULL;
-	}
- out:
-	mutex_unlock(&sr_ref_mutex);
-	return cd;
-}
-
-static void scsi_cd_put(struct scsi_cd *cd)
-{
-	struct scsi_device *sdev = cd->device;
-
-	mutex_lock(&sr_ref_mutex);
-	kref_put(&cd->kref, sr_kref_release);
-	scsi_device_put(sdev);
-	mutex_unlock(&sr_ref_mutex);
-}
-
 static unsigned int sr_get_events(struct scsi_device *sdev)
 {
 	u8 buf[8];
@@ -522,15 +483,13 @@ static void sr_revalidate_disk(struct scsi_cd *cd)
 
 static int sr_block_open(struct block_device *bdev, fmode_t mode)
 {
-	struct scsi_cd *cd;
-	struct scsi_device *sdev;
+	struct scsi_cd *cd = cd = scsi_cd(bdev->bd_disk);
+	struct scsi_device *sdev = cd->device;
 	int ret = -ENXIO;
 
-	cd = scsi_cd_get(bdev->bd_disk);
-	if (!cd)
-		goto out;
+	if (scsi_device_get(cd->device))
+		return -ENXIO;
 
-	sdev = cd->device;
 	scsi_autopm_get_device(sdev);
 	if (bdev_check_media_change(bdev))
 		sr_revalidate_disk(cd);
@@ -541,9 +500,7 @@ static int sr_block_open(struct block_device *bdev, fmode_t mode)
 
 	scsi_autopm_put_device(sdev);
 	if (ret)
-		scsi_cd_put(cd);
-
-out:
+		scsi_device_put(cd->device);
 	return ret;
 }
 
@@ -555,7 +512,7 @@ static void sr_block_release(struct gendisk *disk, fmode_t mode)
 	cdrom_release(&cd->cdi, mode);
 	mutex_unlock(&cd->lock);
 
-	scsi_cd_put(cd);
+	scsi_device_put(cd->device);
 }
 
 static int sr_block_ioctl(struct block_device *bdev, fmode_t mode, unsigned cmd,
@@ -595,18 +552,24 @@ static int sr_block_ioctl(struct block_device *bdev, fmode_t mode, unsigned cmd,
 static unsigned int sr_block_check_events(struct gendisk *disk,
 					  unsigned int clearing)
 {
-	unsigned int ret = 0;
-	struct scsi_cd *cd;
+	struct scsi_cd *cd = disk->private_data;
 
-	cd = scsi_cd_get(disk);
-	if (!cd)
+	if (atomic_read(&cd->device->disk_events_disable_depth))
 		return 0;
+	return cdrom_check_events(&cd->cdi, clearing);
+}
 
-	if (!atomic_read(&cd->device->disk_events_disable_depth))
-		ret = cdrom_check_events(&cd->cdi, clearing);
+static void sr_free_disk(struct gendisk *disk)
+{
+	struct scsi_cd *cd = disk->private_data;
 
-	scsi_cd_put(cd);
-	return ret;
+	spin_lock(&sr_index_lock);
+	clear_bit(MINOR(disk_devt(disk)), sr_index_bits);
+	spin_unlock(&sr_index_lock);
+
+	unregister_cdrom(&cd->cdi);
+	mutex_destroy(&cd->lock);
+	kfree(cd);
 }
 
 static const struct block_device_operations sr_bdops =
@@ -617,6 +580,7 @@ static const struct block_device_operations sr_bdops =
 	.ioctl		= sr_block_ioctl,
 	.compat_ioctl	= blkdev_compat_ptr_ioctl,
 	.check_events	= sr_block_check_events,
+	.free_disk	= sr_free_disk,
 };
 
 static int sr_open(struct cdrom_device_info *cdi, int purpose)
@@ -660,8 +624,6 @@ static int sr_probe(struct device *dev)
 	if (!cd)
 		goto fail;
 
-	kref_init(&cd->kref);
-
 	disk = __alloc_disk_node(sdev->request_queue, NUMA_NO_NODE,
 				 &sr_bio_compl_lkclass);
 	if (!disk)
@@ -727,10 +689,8 @@ static int sr_probe(struct device *dev)
 	sr_revalidate_disk(cd);
 
 	error = device_add_disk(&sdev->sdev_gendev, disk, NULL);
-	if (error) {
-		kref_put(&cd->kref, sr_kref_release);
-		goto fail;
-	}
+	if (error)
+		goto unregister_cdrom;
 
 	sdev_printk(KERN_DEBUG, sdev,
 		    "Attached scsi CD-ROM %s\n", cd->cdi.name);
@@ -738,6 +698,8 @@ static int sr_probe(struct device *dev)
 
 	return 0;
 
+unregister_cdrom:
+	unregister_cdrom(&cd->cdi);
 fail_minor:
 	spin_lock(&sr_index_lock);
 	clear_bit(minor, sr_index_bits);
@@ -1009,36 +971,6 @@ static int sr_read_cdda_bpc(struct cdrom_device_info *cdi, void __user *ubuf,
 	return ret;
 }
 
-
-/**
- *	sr_kref_release - Called to free the scsi_cd structure
- *	@kref: pointer to embedded kref
- *
- *	sr_ref_mutex must be held entering this routine.  Because it is
- *	called on last put, you should always use the scsi_cd_get()
- *	scsi_cd_put() helpers which manipulate the semaphore directly
- *	and never do a direct kref_put().
- **/
-static void sr_kref_release(struct kref *kref)
-{
-	struct scsi_cd *cd = container_of(kref, struct scsi_cd, kref);
-	struct gendisk *disk = cd->disk;
-
-	spin_lock(&sr_index_lock);
-	clear_bit(MINOR(disk_devt(disk)), sr_index_bits);
-	spin_unlock(&sr_index_lock);
-
-	unregister_cdrom(&cd->cdi);
-
-	disk->private_data = NULL;
-
-	put_disk(disk);
-
-	mutex_destroy(&cd->lock);
-
-	kfree(cd);
-}
-
 static int sr_remove(struct device *dev)
 {
 	struct scsi_cd *cd = dev_get_drvdata(dev);
@@ -1046,11 +978,7 @@ static int sr_remove(struct device *dev)
 	scsi_autopm_get_device(cd->device);
 
 	del_gendisk(cd->disk);
-	dev_set_drvdata(dev, NULL);
-
-	mutex_lock(&sr_ref_mutex);
-	kref_put(&cd->kref, sr_kref_release);
-	mutex_unlock(&sr_ref_mutex);
+	put_disk(cd->disk);
 
 	return 0;
 }
diff --git a/drivers/scsi/sr.h b/drivers/scsi/sr.h
index d80af3fcb6f97..1175f2e213b56 100644
--- a/drivers/scsi/sr.h
+++ b/drivers/scsi/sr.h
@@ -18,7 +18,6 @@
 #ifndef _SR_H
 #define _SR_H
 
-#include <linux/kref.h>
 #include <linux/mutex.h>
 
 #define MAX_RETRIES	3
@@ -51,9 +50,6 @@ typedef struct scsi_cd {
 
 	struct cdrom_device_info cdi;
 	struct mutex lock;
-	/* We hold gendisk and scsi_device references on probe and use
-	 * the refs on this kref to decide when to release them */
-	struct kref kref;
 	struct gendisk *disk;
 } Scsi_CD;
 
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH 09/14] block: move blkcg initialization/destroy into disk allocation/release handler
  2022-02-27 17:21 move more work to disk_release v2 Christoph Hellwig
                   ` (7 preceding siblings ...)
  2022-02-27 17:21 ` [PATCH 08/14] sr: implement ->free_disk Christoph Hellwig
@ 2022-02-27 17:21 ` Christoph Hellwig
  2022-02-27 17:21 ` [PATCH 10/14] block: don't remove hctx debugfs dir from blk_mq_exit_queue Christoph Hellwig
                   ` (6 subsequent siblings)
  15 siblings, 0 replies; 25+ messages in thread
From: Christoph Hellwig @ 2022-02-27 17:21 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Martin K. Petersen, Ming Lei, linux-block, linux-scsi, Bart Van Assche

From: Ming Lei <ming.lei@redhat.com>

blkcg works on FS bio level, so it is reasonable to make both blkcg and
gendisk sharing same lifetime. Meantime there won't be any FS IO when
releasing disk, so safe to move blkcg initialization/destroy into disk
allocation/release handler

Long term, we can move blkcg into gendisk completely.

Signed-off-by: Ming Lei <ming.lei@redhat.com>
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Signed-off-by: Christoph Hellwig <hch@lst.de>
---
 block/blk-core.c  | 5 -----
 block/blk-sysfs.c | 7 -------
 block/genhd.c     | 8 ++++++++
 3 files changed, 8 insertions(+), 12 deletions(-)

diff --git a/block/blk-core.c b/block/blk-core.c
index 94bf37f8e61d2..b2f2c65774812 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -496,17 +496,12 @@ struct request_queue *blk_alloc_queue(int node_id, bool alloc_srcu)
 				PERCPU_REF_INIT_ATOMIC, GFP_KERNEL))
 		goto fail_stats;
 
-	if (blkcg_init_queue(q))
-		goto fail_ref;
-
 	blk_queue_dma_alignment(q, 511);
 	blk_set_default_limits(&q->limits);
 	q->nr_requests = BLKDEV_DEFAULT_RQ;
 
 	return q;
 
-fail_ref:
-	percpu_ref_exit(&q->q_usage_counter);
 fail_stats:
 	blk_free_queue_stats(q->stats);
 fail_split:
diff --git a/block/blk-sysfs.c b/block/blk-sysfs.c
index 4c6b7dff71e5b..5f723d2ff8948 100644
--- a/block/blk-sysfs.c
+++ b/block/blk-sysfs.c
@@ -751,13 +751,6 @@ static void blk_exit_queue(struct request_queue *q)
 		ioc_clear_queue(q);
 		elevator_exit(q);
 	}
-
-	/*
-	 * Remove all references to @q from the block cgroup controller before
-	 * restoring @q->queue_lock to avoid that restoring this pointer causes
-	 * e.g. blkcg_print_blkgs() to crash.
-	 */
-	blkcg_exit_queue(q);
 }
 
 /**
diff --git a/block/genhd.c b/block/genhd.c
index d89c35d5f2432..3249d4206f312 100644
--- a/block/genhd.c
+++ b/block/genhd.c
@@ -1122,9 +1122,12 @@ static void disk_release(struct device *dev)
 
 	blk_mq_cancel_work_sync(disk->queue);
 
+	blkcg_exit_queue(disk->queue);
+
 	disk_release_events(disk);
 	kfree(disk->random);
 	xa_destroy(&disk->part_tbl);
+
 	disk->queue->disk = NULL;
 	blk_put_queue(disk->queue);
 
@@ -1330,6 +1333,9 @@ struct gendisk *__alloc_disk_node(struct request_queue *q, int node_id,
 	if (xa_insert(&disk->part_tbl, 0, disk->part0, GFP_KERNEL))
 		goto out_destroy_part_tbl;
 
+	if (blkcg_init_queue(q))
+		goto out_erase_part0;
+
 	rand_initialize_disk(disk);
 	disk_to_dev(disk)->class = &block_class;
 	disk_to_dev(disk)->type = &disk_type;
@@ -1342,6 +1348,8 @@ struct gendisk *__alloc_disk_node(struct request_queue *q, int node_id,
 #endif
 	return disk;
 
+out_erase_part0:
+	xa_erase(&disk->part_tbl, 0);
 out_destroy_part_tbl:
 	xa_destroy(&disk->part_tbl);
 	disk->part0->bd_disk = NULL;
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH 10/14] block: don't remove hctx debugfs dir from blk_mq_exit_queue
  2022-02-27 17:21 move more work to disk_release v2 Christoph Hellwig
                   ` (8 preceding siblings ...)
  2022-02-27 17:21 ` [PATCH 09/14] block: move blkcg initialization/destroy into disk allocation/release handler Christoph Hellwig
@ 2022-02-27 17:21 ` Christoph Hellwig
  2022-02-27 17:21 ` [PATCH 11/14] block: move q_usage_counter release into blk_queue_release Christoph Hellwig
                   ` (5 subsequent siblings)
  15 siblings, 0 replies; 25+ messages in thread
From: Christoph Hellwig @ 2022-02-27 17:21 UTC (permalink / raw)
  To: Jens Axboe; +Cc: Martin K. Petersen, Ming Lei, linux-block, linux-scsi

From: Ming Lei <ming.lei@redhat.com>

The queue's top debugfs dir is removed from blk_release_queue(), so all
hctx's debugfs dirs are removed from there. Given blk_mq_exit_queue()
is only called from blk_cleanup_queue(), it isn't necessary to remove
hctx debugfs from blk_mq_exit_queue().

So remove it from blk_mq_exit_queue().

Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
---
 block/blk-mq.c | 1 -
 1 file changed, 1 deletion(-)

diff --git a/block/blk-mq.c b/block/blk-mq.c
index 63e2d3fd60946..540c8da30da72 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -3425,7 +3425,6 @@ static void blk_mq_exit_hw_queues(struct request_queue *q,
 	queue_for_each_hw_ctx(q, hctx, i) {
 		if (i == nr_queue)
 			break;
-		blk_mq_debugfs_unregister_hctx(hctx);
 		blk_mq_exit_hctx(q, set, hctx, i);
 	}
 }
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH 11/14] block: move q_usage_counter release into blk_queue_release
  2022-02-27 17:21 move more work to disk_release v2 Christoph Hellwig
                   ` (9 preceding siblings ...)
  2022-02-27 17:21 ` [PATCH 10/14] block: don't remove hctx debugfs dir from blk_mq_exit_queue Christoph Hellwig
@ 2022-02-27 17:21 ` Christoph Hellwig
  2022-02-27 17:21 ` [PATCH 12/14] block: move blk_exit_queue into disk_release Christoph Hellwig
                   ` (4 subsequent siblings)
  15 siblings, 0 replies; 25+ messages in thread
From: Christoph Hellwig @ 2022-02-27 17:21 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Martin K. Petersen, Ming Lei, linux-block, linux-scsi, Bart Van Assche

From: Ming Lei <ming.lei@redhat.com>

After blk_cleanup_queue() returns, disk may not be released yet, so
probably bio may still be submitted and ->q_usage_counter may be
touched, so far this way seems safe, but not good from API's viewpoint.

Move the release q_usage_counter into blk_queue_release().

Signed-off-by: Ming Lei <ming.lei@redhat.com>
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Signed-off-by: Christoph Hellwig <hch@lst.de>
---
 block/blk-core.c  | 2 --
 block/blk-sysfs.c | 2 ++
 2 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/block/blk-core.c b/block/blk-core.c
index b2f2c65774812..a8c59913dd78d 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -342,8 +342,6 @@ void blk_cleanup_queue(struct request_queue *q)
 		blk_mq_sched_free_rqs(q);
 	mutex_unlock(&q->sysfs_lock);
 
-	percpu_ref_exit(&q->q_usage_counter);
-
 	/* @q is and will stay empty, shutdown and put */
 	blk_put_queue(q);
 }
diff --git a/block/blk-sysfs.c b/block/blk-sysfs.c
index 5f723d2ff8948..4ea22169b5186 100644
--- a/block/blk-sysfs.c
+++ b/block/blk-sysfs.c
@@ -780,6 +780,8 @@ static void blk_release_queue(struct kobject *kobj)
 
 	might_sleep();
 
+	percpu_ref_exit(&q->q_usage_counter);
+
 	if (q->poll_stat)
 		blk_stat_remove_callback(q, q->poll_cb);
 	blk_stat_free_callback(q->poll_cb);
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH 12/14] block: move blk_exit_queue into disk_release
  2022-02-27 17:21 move more work to disk_release v2 Christoph Hellwig
                   ` (10 preceding siblings ...)
  2022-02-27 17:21 ` [PATCH 11/14] block: move q_usage_counter release into blk_queue_release Christoph Hellwig
@ 2022-02-27 17:21 ` Christoph Hellwig
  2022-02-27 17:21 ` [PATCH 13/14] block: do more work in elevator_exit Christoph Hellwig
                   ` (3 subsequent siblings)
  15 siblings, 0 replies; 25+ messages in thread
From: Christoph Hellwig @ 2022-02-27 17:21 UTC (permalink / raw)
  To: Jens Axboe; +Cc: Martin K. Petersen, Ming Lei, linux-block, linux-scsi

From: Ming Lei <ming.lei@redhat.com>

There can't be FS IO in disk_release(), so move blk_exit_queue() there.

We still need to freeze queue here since the request is freed after the
bio is completed and passthrough request rely on scheduler tags as well.

The disk can be released before or after queue is cleaned up, and we have
to free the scheduler request pool before blk_cleanup_queue returns,
while the static request pool has to be freed before exiting the
I/O scheduler.

Signed-off-by: Ming Lei <ming.lei@redhat.com>
[hch: rebased]
Signed-off-by: Christoph Hellwig <hch@lst.de>
---
 block/blk-sysfs.c | 16 ----------------
 block/genhd.c     | 32 +++++++++++++++++++++++++++++++-
 2 files changed, 31 insertions(+), 17 deletions(-)

diff --git a/block/blk-sysfs.c b/block/blk-sysfs.c
index 4ea22169b5186..faf8577578929 100644
--- a/block/blk-sysfs.c
+++ b/block/blk-sysfs.c
@@ -739,20 +739,6 @@ static void blk_free_queue_rcu(struct rcu_head *rcu_head)
 	kmem_cache_free(blk_get_queue_kmem_cache(blk_queue_has_srcu(q)), q);
 }
 
-/* Unconfigure the I/O scheduler and dissociate from the cgroup controller. */
-static void blk_exit_queue(struct request_queue *q)
-{
-	/*
-	 * Since the I/O scheduler exit code may access cgroup information,
-	 * perform I/O scheduler exit before disassociating from the block
-	 * cgroup controller.
-	 */
-	if (q->elevator) {
-		ioc_clear_queue(q);
-		elevator_exit(q);
-	}
-}
-
 /**
  * blk_release_queue - releases all allocated resources of the request_queue
  * @kobj: pointer to a kobject, whose container is a request_queue
@@ -786,8 +772,6 @@ static void blk_release_queue(struct kobject *kobj)
 		blk_stat_remove_callback(q, q->poll_cb);
 	blk_stat_free_callback(q->poll_cb);
 
-	blk_exit_queue(q);
-
 	blk_free_queue_stats(q->stats);
 	kfree(q->poll_stat);
 
diff --git a/block/genhd.c b/block/genhd.c
index 3249d4206f312..a92641911bc1b 100644
--- a/block/genhd.c
+++ b/block/genhd.c
@@ -30,6 +30,7 @@
 #include "blk-mq-sched.h"
 #include "blk-rq-qos.h"
 #include "blk-throttle.h"
+#include "blk-cgroup.h"
 
 static struct kobject *block_depr;
 
@@ -1099,6 +1100,34 @@ static const struct attribute_group *disk_attr_groups[] = {
 	NULL
 };
 
+static void disk_release_mq(struct request_queue *q)
+{
+	blk_mq_cancel_work_sync(q);
+
+	/*
+	 * There can't be any non non-passthrough bios in flight here, but
+	 * requests stay around longer, including passthrough ones so we
+	 * still need to freeze the queue here.
+	 */
+	blk_mq_freeze_queue(q);
+
+	/*
+	 * Since the I/O scheduler exit code may access cgroup information,
+	 * perform I/O scheduler exit before disassociating from the block
+	 * cgroup controller.
+	 */
+	if (q->elevator) {
+		ioc_clear_queue(q);
+
+		mutex_lock(&q->sysfs_lock);
+		blk_mq_sched_free_rqs(q);
+		elevator_exit(q);
+		mutex_unlock(&q->sysfs_lock);
+	}
+
+	__blk_mq_unfreeze_queue(q, true);
+}
+
 /**
  * disk_release - releases all allocated resources of the gendisk
  * @dev: the device representing this disk
@@ -1120,7 +1149,8 @@ static void disk_release(struct device *dev)
 	might_sleep();
 	WARN_ON_ONCE(disk_live(disk));
 
-	blk_mq_cancel_work_sync(disk->queue);
+	if (queue_is_mq(disk->queue))
+		disk_release_mq(disk->queue);
 
 	blkcg_exit_queue(disk->queue);
 
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH 13/14] block: do more work in elevator_exit
  2022-02-27 17:21 move more work to disk_release v2 Christoph Hellwig
                   ` (11 preceding siblings ...)
  2022-02-27 17:21 ` [PATCH 12/14] block: move blk_exit_queue into disk_release Christoph Hellwig
@ 2022-02-27 17:21 ` Christoph Hellwig
  2022-02-27 17:21 ` [PATCH 14/14] block: move rq_qos_exit() into disk_release() Christoph Hellwig
                   ` (2 subsequent siblings)
  15 siblings, 0 replies; 25+ messages in thread
From: Christoph Hellwig @ 2022-02-27 17:21 UTC (permalink / raw)
  To: Jens Axboe; +Cc: Martin K. Petersen, Ming Lei, linux-block, linux-scsi

Move the calls to ioc_clear_queue and blk_mq_sched_free_rqs into
elevator_exit.  Except for one call where we know we can't have io_cq
structures yet these always go together, and that extra call in an
error path is harmless.

Signed-off-by: Christoph Hellwig <hch@lst.de>
---
 block/elevator.c | 7 +++----
 block/genhd.c    | 3 ---
 2 files changed, 3 insertions(+), 7 deletions(-)

diff --git a/block/elevator.c b/block/elevator.c
index 6847ab6e7aa50..4664cae50da86 100644
--- a/block/elevator.c
+++ b/block/elevator.c
@@ -192,6 +192,9 @@ void elevator_exit(struct request_queue *q)
 {
 	struct elevator_queue *e = q->elevator;
 
+	ioc_clear_queue(q);
+	blk_mq_sched_free_rqs(q);
+
 	mutex_lock(&e->sysfs_lock);
 	blk_mq_exit_sched(q, e);
 	mutex_unlock(&e->sysfs_lock);
@@ -595,9 +598,6 @@ int elevator_switch_mq(struct request_queue *q,
 	if (q->elevator) {
 		if (q->elevator->registered)
 			elv_unregister_queue(q);
-
-		ioc_clear_queue(q);
-		blk_mq_sched_free_rqs(q);
 		elevator_exit(q);
 	}
 
@@ -608,7 +608,6 @@ int elevator_switch_mq(struct request_queue *q,
 	if (new_e) {
 		ret = elv_register_queue(q, true);
 		if (ret) {
-			blk_mq_sched_free_rqs(q);
 			elevator_exit(q);
 			goto out;
 		}
diff --git a/block/genhd.c b/block/genhd.c
index a92641911bc1b..5368ec88e485f 100644
--- a/block/genhd.c
+++ b/block/genhd.c
@@ -1117,10 +1117,7 @@ static void disk_release_mq(struct request_queue *q)
 	 * cgroup controller.
 	 */
 	if (q->elevator) {
-		ioc_clear_queue(q);
-
 		mutex_lock(&q->sysfs_lock);
-		blk_mq_sched_free_rqs(q);
 		elevator_exit(q);
 		mutex_unlock(&q->sysfs_lock);
 	}
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH 14/14] block: move rq_qos_exit() into disk_release()
  2022-02-27 17:21 move more work to disk_release v2 Christoph Hellwig
                   ` (12 preceding siblings ...)
  2022-02-27 17:21 ` [PATCH 13/14] block: do more work in elevator_exit Christoph Hellwig
@ 2022-02-27 17:21 ` Christoph Hellwig
  2022-02-27 23:18 ` move more work to disk_release v2 Bart Van Assche
  2022-03-01 13:00 ` Christoph Hellwig
  15 siblings, 0 replies; 25+ messages in thread
From: Christoph Hellwig @ 2022-02-27 17:21 UTC (permalink / raw)
  To: Jens Axboe; +Cc: Martin K. Petersen, Ming Lei, linux-block, linux-scsi

From: Ming Lei <ming.lei@redhat.com>

There can't be FS IO in disk_release(), so it is safe to move rq_qos_exit()
there.

Signed-off-by: Ming Lei <ming.lei@redhat.com>
---
 block/genhd.c | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/block/genhd.c b/block/genhd.c
index 5368ec88e485f..d78910ef0c893 100644
--- a/block/genhd.c
+++ b/block/genhd.c
@@ -629,7 +629,6 @@ void del_gendisk(struct gendisk *disk)
 	blk_mq_freeze_queue_wait(q);
 
 	blk_throtl_cancel_bios(disk->queue);
-	rq_qos_exit(q);
 	blk_sync_queue(q);
 	blk_flush_integrity();
 	/*
@@ -1121,7 +1120,7 @@ static void disk_release_mq(struct request_queue *q)
 		elevator_exit(q);
 		mutex_unlock(&q->sysfs_lock);
 	}
-
+	rq_qos_exit(q);
 	__blk_mq_unfreeze_queue(q, true);
 }
 
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 25+ messages in thread

* Re: move more work to disk_release v2
  2022-02-27 17:21 move more work to disk_release v2 Christoph Hellwig
                   ` (13 preceding siblings ...)
  2022-02-27 17:21 ` [PATCH 14/14] block: move rq_qos_exit() into disk_release() Christoph Hellwig
@ 2022-02-27 23:18 ` Bart Van Assche
  2022-03-01 12:56   ` Christoph Hellwig
  2022-03-01 13:00 ` Christoph Hellwig
  15 siblings, 1 reply; 25+ messages in thread
From: Bart Van Assche @ 2022-02-27 23:18 UTC (permalink / raw)
  To: Christoph Hellwig, Jens Axboe
  Cc: Martin K. Petersen, Ming Lei, linux-block, linux-scsi

On 2/27/22 09:21, Christoph Hellwig wrote:
> Hi all,
> 
> this series resurrects and forward ports ports larger parts of the
> "block: don't drain file system I/O on del_gendisk" series from Ming,
> but does not remove the draining in del_gendisk, but instead the one
> in the sd driver, which always was a bit ad-hoc.  As part of that sd
> and sr are switched to use the new ->free_disk method to avoid having
> to clear disk->private_data and the way to lookup the SCSI ULP is
> cleaned up as well.
> 
> Git branch:
> 
>      git://git.infradead.org/users/hch/block.git freeze-5.18

Hi Christoph,

Thanks for the quick respin. If I run blktests as follows:

$ use_siw=1 ./check -q

then the first report I hit with this branch is a deadlock report in
nvmet_rdma_free_queue(). That issue has already been reported - see also
https://lore.kernel.org/linux-nvme/CAHj4cs93BfTRgWF6PbuZcfq6AARHgYC2g=RQ-7Jgcf1-6h+2SQ@mail.gmail.com/

The second issue I run into with this branch is as follows
(also for nvmeof-mp/002):

==================================================================
BUG: KASAN: null-ptr-deref in __blk_account_io_start+0x28/0xa0
Read of size 8 at addr 0000000000000008 by task kworker/0:1H/159

CPU: 0 PID: 159 Comm: kworker/0:1H Not tainted 5.17.0-rc2-dbg+ #9
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.15.0-0-g2dd4b9b-rebuilt.opensuse.org 04/01/2014
Workqueue: kblockd blk_mq_requeue_work
Call Trace:
  <TASK>
  show_stack+0x52/0x58
  ? __blk_account_io_start+0x28/0xa0
  dump_stack_lvl+0x5b/0x82
  kasan_report.cold+0x64/0xdb
  ? __blk_account_io_start+0x28/0xa0
  __asan_load8+0x69/0x90
  __blk_account_io_start+0x28/0xa0
  blk_insert_cloned_request+0x107/0x3b0
  map_request+0x260/0x3c0 [dm_mod]
  ? dm_requeue_original_request+0x1a0/0x1a0 [dm_mod]
  ? blk_add_timer+0xc3/0x110
  dm_mq_queue_rq+0x207/0x400 [dm_mod]
  ? kasan_set_track+0x25/0x30
  ? kasan_set_free_info+0x24/0x40
  ? map_request+0x3c0/0x3c0 [dm_mod]
  ? nvmet_rdma_release_rsp+0xb3/0x3f0 [nvmet_rdma]
  ? nvmet_rdma_send_done+0x4a/0x70 [nvmet_rdma]
  ? __ib_process_cq+0x11b/0x3c0 [ib_core]
  ? ib_cq_poll_work+0x37/0xb0 [ib_core]
  ? process_one_work+0x594/0xad0
  ? worker_thread+0x2de/0x6b0
  ? kthread+0x15f/0x190
  ? ret_from_fork+0x1f/0x30
  blk_mq_dispatch_rq_list+0x344/0xc00
  ? blk_mq_mark_tag_wait+0x470/0x470
  ? rcu_read_lock_sched_held+0x16/0x80
  __blk_mq_sched_dispatch_requests+0x19b/0x280
  ? blk_mq_do_dispatch_ctx+0x3f0/0x3f0
  ? rcu_read_lock_sched_held+0x16/0x80
  blk_mq_sched_dispatch_requests+0x8a/0xc0
  __blk_mq_run_hw_queue+0x99/0x220
  __blk_mq_delay_run_hw_queue+0x372/0x3a0
  ? blk_mq_run_hw_queue+0xd7/0x2b0
  ? rcu_read_lock_sched_held+0x16/0x80
  blk_mq_run_hw_queue+0x1d6/0x2b0
  blk_mq_run_hw_queues+0xa0/0x1e0
  blk_mq_requeue_work+0x2e4/0x330
  ? blk_mq_try_issue_directly+0x60/0x60
  ? lock_acquire+0x76/0x1a0
  process_one_work+0x594/0xad0
  ? pwq_dec_nr_in_flight+0x120/0x120
  ? do_raw_spin_lock+0x115/0x1b0
  ? lock_acquire+0x76/0x1a0
  worker_thread+0x2de/0x6b0
  ? trace_hardirqs_on+0x2b/0x120
  ? process_one_work+0xad0/0xad0
  kthread+0x15f/0x190
  ? kthread_complete_and_exit+0x30/0x30
  ret_from_fork+0x1f/0x30
  </TASK>
==================================================================

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: move more work to disk_release v2
  2022-02-27 23:18 ` move more work to disk_release v2 Bart Van Assche
@ 2022-03-01 12:56   ` Christoph Hellwig
  2022-03-02  5:05     ` Bart Van Assche
  0 siblings, 1 reply; 25+ messages in thread
From: Christoph Hellwig @ 2022-03-01 12:56 UTC (permalink / raw)
  To: Bart Van Assche
  Cc: Christoph Hellwig, Jens Axboe, Martin K. Petersen, Ming Lei,
	linux-block, linux-scsi

On Sun, Feb 27, 2022 at 03:18:24PM -0800, Bart Van Assche wrote:
> The second issue I run into with this branch is as follows
> (also for nvmeof-mp/002):

You'll need this patch, which is only in mainline but not the
for-5.18/block branch:

fd9f4e62a39f09a7c014d7415c2b9d1390aa0504
Author: Christoph Hellwig <hch@lst.de>
Date:   Tue Jan 18 08:04:44 2022 +0100

    block: assign bi_bdev for cloned bios in blk_rq_prep_clone


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: move more work to disk_release v2
  2022-02-27 17:21 move more work to disk_release v2 Christoph Hellwig
                   ` (14 preceding siblings ...)
  2022-02-27 23:18 ` move more work to disk_release v2 Bart Van Assche
@ 2022-03-01 13:00 ` Christoph Hellwig
  15 siblings, 0 replies; 25+ messages in thread
From: Christoph Hellwig @ 2022-03-01 13:00 UTC (permalink / raw)
  To: Jens Axboe; +Cc: Martin K. Petersen, Ming Lei, linux-block, linux-scsi

FYI, this patchset has acquired some trivial conflicts against the
latest for-5.18/block tree.

The git branch below has been rebased to fix that.

> Git branch:
> 
>     git://git.infradead.org/users/hch/block.git freeze-5.18
> 
> Gitweb:
> 
>     http://git.infradead.org/users/hch/block.git/shortlog/refs/heads/freeze-5.18
> 
> Changes since v1:
>  - fix a refcounting bug in sd
>  - rename a function
> 
> Diffstat:
>  block/blk-core.c           |    7 --
>  block/blk-mq.c             |   10 +--
>  block/blk-sysfs.c          |   25 --------
>  block/blk.h                |    2 
>  block/elevator.c           |    7 +-
>  block/genhd.c              |   38 ++++++++++++-
>  drivers/scsi/sd.c          |  114 +++++++++------------------------------
>  drivers/scsi/sd.h          |   13 +++-
>  drivers/scsi/sr.c          |  129 +++++++++------------------------------------
>  drivers/scsi/sr.h          |    5 -
>  drivers/scsi/st.c          |    1 
>  drivers/scsi/st.h          |    1 
>  include/scsi/scsi_cmnd.h   |    9 ---
>  include/scsi/scsi_driver.h |    9 ++-
>  14 files changed, 117 insertions(+), 253 deletions(-)
---end quoted text---

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: move more work to disk_release v2
  2022-03-01 12:56   ` Christoph Hellwig
@ 2022-03-02  5:05     ` Bart Van Assche
  2022-03-02 15:03       ` Christoph Hellwig
  0 siblings, 1 reply; 25+ messages in thread
From: Bart Van Assche @ 2022-03-02  5:05 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Jens Axboe, Martin K. Petersen, Ming Lei, linux-block, linux-scsi

On 3/1/22 04:56, Christoph Hellwig wrote:
> On Sun, Feb 27, 2022 at 03:18:24PM -0800, Bart Van Assche wrote:
>> The second issue I run into with this branch is as follows
>> (also for nvmeof-mp/002):
> 
> You'll need this patch, which is only in mainline but not the
> for-5.18/block branch:
> 
> fd9f4e62a39f09a7c014d7415c2b9d1390aa0504
> Author: Christoph Hellwig <hch@lst.de>
> Date:   Tue Jan 18 08:04:44 2022 +0100
> 
>      block: assign bi_bdev for cloned bios in blk_rq_prep_clone

Hmm ... even with that patch applied, I still see the crash reported in 
my previous email. After I observed that crash I did a clean kernel 
build to make sure that the kernel binaries used in my test match the 
source code.

Bart.

BUG: KASAN: null-ptr-deref in __blk_account_io_start+0x28/0xa0
Read of size 8 at addr 0000000000000008 by task kworker/0:1H/155

CPU: 0 PID: 155 Comm: kworker/0:1H Not tainted 5.17.0-rc2-dbg+ #1
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 
rel-1.15.0-0-g2dd4b9b-rebuilt.opensuse.org 04/01/2014
Workqueue: kblockd blk_mq_requeue_work
Call Trace:
[ ... ]

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: move more work to disk_release v2
  2022-03-02  5:05     ` Bart Van Assche
@ 2022-03-02 15:03       ` Christoph Hellwig
  2022-03-02 23:33         ` Bart Van Assche
  0 siblings, 1 reply; 25+ messages in thread
From: Christoph Hellwig @ 2022-03-02 15:03 UTC (permalink / raw)
  To: Bart Van Assche
  Cc: Christoph Hellwig, Jens Axboe, Martin K. Petersen, Ming Lei,
	linux-block, linux-scsi

On Tue, Mar 01, 2022 at 09:05:24PM -0800, Bart Van Assche wrote:
> Hmm ... even with that patch applied, I still see the crash reported in my 
> previous email. After I observed that crash I did a clean kernel build to 
> make sure that the kernel binaries used in my test match the source code.

I still can't reproduce it at all.  With this patchset on Jens'
for-5.18/block branch I do get a pre-existing crash in
nvmf_connect_admin_queue, and on Jens' for-next tree that has all the
latest fixes from Linus' tree I only see the CM lockdep warning you
reported.

FYI, this is a branch with the patches applied ontop of the for-next
branch:

http://git.infradead.org/users/hch/block.git/shortlog/refs/heads/freeze-for-next

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: move more work to disk_release v2
  2022-03-02 15:03       ` Christoph Hellwig
@ 2022-03-02 23:33         ` Bart Van Assche
  2022-03-03 10:54           ` Christoph Hellwig
  0 siblings, 1 reply; 25+ messages in thread
From: Bart Van Assche @ 2022-03-02 23:33 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Jens Axboe, Martin K. Petersen, Ming Lei, linux-block, linux-scsi

On 3/2/22 07:03, Christoph Hellwig wrote:
> On Tue, Mar 01, 2022 at 09:05:24PM -0800, Bart Van Assche wrote:
>> Hmm ... even with that patch applied, I still see the crash reported in my
>> previous email. After I observed that crash I did a clean kernel build to
>> make sure that the kernel binaries used in my test match the source code.
> 
> I still can't reproduce it at all.  With this patchset on Jens'
> for-5.18/block branch I do get a pre-existing crash in
> nvmf_connect_admin_queue, and on Jens' for-next tree that has all the
> latest fixes from Linus' tree I only see the CM lockdep warning you
> reported.
> 
> FYI, this is a branch with the patches applied ontop of the for-next
> branch:
> 
> http://git.infradead.org/users/hch/block.git/shortlog/refs/heads/freeze-for-next

Hi Christoph,

Thanks for having published a merge of block-for-next and the branch 
with this patch series. That makes it easy for me to replicate your 
kernel tree. I can reproduce the null-ptr-deref with the freeze-for-next 
branch but not with Jens' block-for-next branch (commit e70f36e84f9b 
("Merge branch 'for-5.18/block' into for-next")). This is what appears 
in the kernel log on my test setup for the freeze-for-next branch 
(commit acac349e5516 ("block: move rq_qos_exit() into disk_release()"):

BUG: KASAN: null-ptr-deref in __blk_account_io_start+0x28/0xa0

Maybe we are using different kernel configurations? I'm using 
CONFIG_NVME_MULTIPATH=n. I guess that you are using CONFIG_NVME_MULTIPATH=y?

Bart.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: move more work to disk_release v2
  2022-03-02 23:33         ` Bart Van Assche
@ 2022-03-03 10:54           ` Christoph Hellwig
  2022-03-03 18:19             ` Bart Van Assche
  0 siblings, 1 reply; 25+ messages in thread
From: Christoph Hellwig @ 2022-03-03 10:54 UTC (permalink / raw)
  To: Bart Van Assche
  Cc: Christoph Hellwig, Jens Axboe, Martin K. Petersen, Ming Lei,
	linux-block, linux-scsi

[-- Attachment #1: Type: text/plain, Size: 1228 bytes --]

On Wed, Mar 02, 2022 at 03:33:35PM -0800, Bart Van Assche wrote:
> Thanks for having published a merge of block-for-next and the branch with 
> this patch series. That makes it easy for me to replicate your kernel tree. 
> I can reproduce the null-ptr-deref with the freeze-for-next branch but not 
> with Jens' block-for-next branch (commit e70f36e84f9b ("Merge branch 
> 'for-5.18/block' into for-next")). This is what appears in the kernel log 
> on my test setup for the freeze-for-next branch (commit acac349e5516 
> ("block: move rq_qos_exit() into disk_release()"):
>
> BUG: KASAN: null-ptr-deref in __blk_account_io_start+0x28/0xa0
>
> Maybe we are using different kernel configurations? I'm using 
> CONFIG_NVME_MULTIPATH=n. I guess that you are using 
> CONFIG_NVME_MULTIPATH=y?

Your testcases errors out when CONFIG_NVME_MULTIPATH=y is set, and
also requires various things to be built modular which wasted a lot
of my time yesterday trying to get that test to run.  My .config
is below.  Maybe you can try to figure out what derefernce causes
the null-ptr-deref, and what kind of command causes this?  Also
I suspect this is the first patch in the series, so it would be
great to verify the problem with just that.

[-- Attachment #2: config.blktests.gz --]
[-- Type: application/x-gzip, Size: 37564 bytes --]

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: move more work to disk_release v2
  2022-03-03 10:54           ` Christoph Hellwig
@ 2022-03-03 18:19             ` Bart Van Assche
  2022-03-03 19:23               ` Christoph Hellwig
  0 siblings, 1 reply; 25+ messages in thread
From: Bart Van Assche @ 2022-03-03 18:19 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Jens Axboe, Martin K. Petersen, Ming Lei, linux-block, linux-scsi

[-- Attachment #1: Type: text/plain, Size: 2205 bytes --]

On 3/3/22 02:54, Christoph Hellwig wrote:
> Maybe you can try to figure out what derefernce causes
> the null-ptr-deref, and what kind of command causes this?  Also
> I suspect this is the first patch in the series, so it would be
> great to verify the problem with just that.

Hi Christoph,

I can reproduce the crash by cherry-picking patch "blk-mq: do not include 
passthrough requests in I/O accounting" on top of Jens' for-next branch.

 From the struct request that triggers the crash (the flag names have been 
looked up manually and hence may be wrong):
* cmd_flags 0x44202 = REQ_PREFLUSH | REQ_NOMERGE | REQ_FAILFAST_TRANSPORT |
   REQ_OP_FLUSH.
* rq_flags 0x2000 = RQF_IO_STAT.

The disassembly of the start of the function that triggers the crash is as follows:

Dump of assembler code for function __blk_account_io_start:
block/blk-mq.c:
889     {
    0xffffffff81797710 <+0>:     call   0xffffffff810940a0 <__fentry__>

890             rq->part = rq->bio->bi_bdev;
    0xffffffff81797715 <+5>:     push   %rbp
    0xffffffff81797716 <+6>:     mov    %rsp,%rbp
    0xffffffff81797719 <+9>:     push   %r13
    0xffffffff8179771b <+11>:    push   %r12
    0xffffffff8179771d <+13>:    push   %rbx

889     {
    0xffffffff8179771e <+14>:    mov    %rdi,%rbx

890             rq->part = rq->bio->bi_bdev;
    0xffffffff81797721 <+17>:    add    $0x38,%rdi
    0xffffffff81797725 <+21>:    call   0xffffffff81488d10 <__asan_load8>
    0xffffffff8179772a <+26>:    mov    0x38(%rbx),%r12
    0xffffffff8179772e <+30>:    lea    0x8(%r12),%rdi
    0xffffffff81797733 <+35>:    call   0xffffffff81488d10 <__asan_load8>
    0xffffffff81797738 <+40>:    mov    0x8(%r12),%r13
    0xffffffff8179773d <+45>:    lea    0x58(%rbx),%r12
    0xffffffff81797741 <+49>:    mov    %r12,%rdi
    0xffffffff81797744 <+52>:    call   0xffffffff81488da0 <__asan_store8>

The crash occurs at address __blk_account_io_start+0x28. I assume this means 
that the "mov 0x8(%r12),%r13" instruction triggers the crash and also that it 
crashes because the rq->bio pointer is NULL?

I have attached the kernel configuration I use for running blktests to this e-mail.

Please let me know if you need more information.

Bart.

[-- Attachment #2: kernel-config.txt.gz --]
[-- Type: application/gzip, Size: 27951 bytes --]

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: move more work to disk_release v2
  2022-03-03 18:19             ` Bart Van Assche
@ 2022-03-03 19:23               ` Christoph Hellwig
  2022-03-03 20:51                 ` Bart Van Assche
  0 siblings, 1 reply; 25+ messages in thread
From: Christoph Hellwig @ 2022-03-03 19:23 UTC (permalink / raw)
  To: Bart Van Assche
  Cc: Christoph Hellwig, Jens Axboe, Martin K. Petersen, Ming Lei,
	linux-block, linux-scsi

On Thu, Mar 03, 2022 at 10:19:34AM -0800, Bart Van Assche wrote:
> On 3/3/22 02:54, Christoph Hellwig wrote:
>> Maybe you can try to figure out what derefernce causes
>> the null-ptr-deref, and what kind of command causes this?  Also
>> I suspect this is the first patch in the series, so it would be
>> great to verify the problem with just that.
>
> Hi Christoph,
>
> I can reproduce the crash by cherry-picking patch "blk-mq: do not include 
> passthrough requests in I/O accounting" on top of Jens' for-next branch.
>
> From the struct request that triggers the crash (the flag names have been 
> looked up manually and hence may be wrong):
> * cmd_flags 0x44202 = REQ_PREFLUSH | REQ_NOMERGE | REQ_FAILFAST_TRANSPORT |
>   REQ_OP_FLUSH.
> * rq_flags 0x2000 = RQF_IO_STAT.

So this is a flush request.  Flush request from the flush state machine.
Normally they don't go through the I/O accounting because the I/O
accounting happens before we call into the flush state machine.  But
with blk-mq we can run the flush state machine on the upper dm-mpath
device and then hand a request with a NULL bio down.

I can't really explain why you hit that path and I don't withthe same
test.

Can you try this patch on top of the series?

diff --git a/block/blk-mq.c b/block/blk-mq.c
index 6a072543bde4d..73b8bc9d67cf6 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -883,7 +883,10 @@ static inline void blk_account_io_done(struct request *req, u64 now)
 
 static void __blk_account_io_start(struct request *rq)
 {
-	rq->part = rq->bio->bi_bdev;
+	if (rq->bio)
+		rq->part = rq->bio->bi_bdev;
+	else /* should only happen for dm-mpath flush requests */
+		rq->part = rq->q->disk->part0;
 
 	part_stat_lock();
 	update_io_ticks(rq->part, jiffies, false);

^ permalink raw reply related	[flat|nested] 25+ messages in thread

* Re: move more work to disk_release v2
  2022-03-03 19:23               ` Christoph Hellwig
@ 2022-03-03 20:51                 ` Bart Van Assche
  0 siblings, 0 replies; 25+ messages in thread
From: Bart Van Assche @ 2022-03-03 20:51 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Jens Axboe, Martin K. Petersen, Ming Lei, linux-block, linux-scsi

On 3/3/22 11:23, Christoph Hellwig wrote:
> Can you try this patch on top of the series?
> 
> diff --git a/block/blk-mq.c b/block/blk-mq.c
> index 6a072543bde4d..73b8bc9d67cf6 100644
> --- a/block/blk-mq.c
> +++ b/block/blk-mq.c
> @@ -883,7 +883,10 @@ static inline void blk_account_io_done(struct request *req, u64 now)
>   
>   static void __blk_account_io_start(struct request *rq)
>   {
> -	rq->part = rq->bio->bi_bdev;
> +	if (rq->bio)
> +		rq->part = rq->bio->bi_bdev;
> +	else /* should only happen for dm-mpath flush requests */
> +		rq->part = rq->q->disk->part0;
>   
>   	part_stat_lock();
>   	update_io_ticks(rq->part, jiffies, false);

Hi Christoph,

This patch fixes the crash I reported. With this patch applied on top of the 
hch-block/freeze-for-next branch blktests produces the same results on my setup 
as for Jens' for-next branch.

Thanks!

Bart.

^ permalink raw reply	[flat|nested] 25+ messages in thread

end of thread, other threads:[~2022-03-03 20:51 UTC | newest]

Thread overview: 25+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-02-27 17:21 move more work to disk_release v2 Christoph Hellwig
2022-02-27 17:21 ` [PATCH 01/14] blk-mq: do not include passthrough requests in I/O accounting Christoph Hellwig
2022-02-27 17:21 ` [PATCH 02/14] blk-mq: handle already freed tags gracefully in blk_mq_free_rqs Christoph Hellwig
2022-02-27 17:21 ` [PATCH 03/14] scsi: don't use disk->private_data to find the scsi_driver Christoph Hellwig
2022-02-27 17:21 ` [PATCH 04/14] sd: rename the scsi_disk.dev field Christoph Hellwig
2022-02-27 17:21 ` [PATCH 05/14] sd: call sd_zbc_release_disk before releasing the scsi_device reference Christoph Hellwig
2022-02-27 17:21 ` [PATCH 06/14] sd: delay calling free_opal_dev Christoph Hellwig
2022-02-27 17:21 ` [PATCH 07/14] sd: make use of ->free_disk to simplify refcounting Christoph Hellwig
2022-02-27 17:21 ` [PATCH 08/14] sr: implement ->free_disk Christoph Hellwig
2022-02-27 17:21 ` [PATCH 09/14] block: move blkcg initialization/destroy into disk allocation/release handler Christoph Hellwig
2022-02-27 17:21 ` [PATCH 10/14] block: don't remove hctx debugfs dir from blk_mq_exit_queue Christoph Hellwig
2022-02-27 17:21 ` [PATCH 11/14] block: move q_usage_counter release into blk_queue_release Christoph Hellwig
2022-02-27 17:21 ` [PATCH 12/14] block: move blk_exit_queue into disk_release Christoph Hellwig
2022-02-27 17:21 ` [PATCH 13/14] block: do more work in elevator_exit Christoph Hellwig
2022-02-27 17:21 ` [PATCH 14/14] block: move rq_qos_exit() into disk_release() Christoph Hellwig
2022-02-27 23:18 ` move more work to disk_release v2 Bart Van Assche
2022-03-01 12:56   ` Christoph Hellwig
2022-03-02  5:05     ` Bart Van Assche
2022-03-02 15:03       ` Christoph Hellwig
2022-03-02 23:33         ` Bart Van Assche
2022-03-03 10:54           ` Christoph Hellwig
2022-03-03 18:19             ` Bart Van Assche
2022-03-03 19:23               ` Christoph Hellwig
2022-03-03 20:51                 ` Bart Van Assche
2022-03-01 13:00 ` Christoph Hellwig

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.