[PATCH v5 0/3] blk-mq: Fix a race between iterating over requests and freeing requests

All of lore.kernel.org
 help / color / mirror / Atom feed

* [PATCH v5 0/3] blk-mq: Fix a race between iterating over requests and freeing requests
@ 2021-04-05  0:28 Bart Van Assche
  2021-04-05  0:28 ` [PATCH v5 1/3] blk-mq: Move the elevator_exit() definition Bart Van Assche
                   ` (3 more replies)
  0 siblings, 4 replies; 12+ messages in thread
From: Bart Van Assche @ 2021-04-05  0:28 UTC (permalink / raw)
  To: Jens Axboe; +Cc: linux-block, Christoph Hellwig, Bart Van Assche

Hi Jens,

This patch series fixes the race between iterating over requests and
freeing requests that has been reported by multiple different users over
the past two years. Please consider this patch series for kernel v5.13.

Thank you,

Bart.

Changes between v4 and v5:
- Addressed Khazhy's review comments. Note: the changes that have been made
  in v5 only change behavior in case CONFIG_PROVE_RCU=y.

Changes between v3 and v4:
- Fixed support for tag sets shared across hardware queues.
- Renamed blk_mq_wait_for_tag_readers() into blk_mq_wait_for_tag_iter().
- Removed the fourth argument of blk_mq_queue_tag_busy_iter() again.

Changes between v2 and v3:
- Converted the single v2 patch into a series of three patches.
- Switched from SRCU to a combination of RCU and semaphores.

Changes between v1 and v2:
- Reformatted patch description.
- Added Tested-by/Reviewed-by tags.
- Changed srcu_barrier() calls into synchronize_srcu() calls.

Bart Van Assche (3):
  blk-mq: Move the elevator_exit() definition
  blk-mq: Introduce atomic variants of the tag iteration functions
  blk-mq: Fix a race between iterating over requests and freeing
    requests

 block/blk-core.c          | 34 ++++++++++++++++-
 block/blk-mq-tag.c        | 79 ++++++++++++++++++++++++++++++++++-----
 block/blk-mq-tag.h        |  6 ++-
 block/blk-mq.c            | 23 +++++++++---
 block/blk-mq.h            |  1 +
 block/blk.h               | 11 +-----
 block/elevator.c          |  9 +++++
 drivers/scsi/hosts.c      | 16 ++++----
 drivers/scsi/ufs/ufshcd.c |  4 +-
 include/linux/blk-mq.h    |  2 +
 10 files changed, 149 insertions(+), 36 deletions(-)


^ permalink raw reply	[flat|nested] 12+ messages in thread

* [PATCH v5 1/3] blk-mq: Move the elevator_exit() definition
  2021-04-05  0:28 [PATCH v5 0/3] blk-mq: Fix a race between iterating over requests and freeing requests Bart Van Assche
@ 2021-04-05  0:28 ` Bart Van Assche
  2021-04-05  0:28 ` [PATCH v5 2/3] blk-mq: Introduce atomic variants of the tag iteration functions Bart Van Assche
                   ` (2 subsequent siblings)
  3 siblings, 0 replies; 12+ messages in thread
From: Bart Van Assche @ 2021-04-05  0:28 UTC (permalink / raw)
  To: Jens Axboe
  Cc: linux-block, Christoph Hellwig, Bart Van Assche,
	Martin K . Petersen, Shin'ichiro Kawasaki, Ming Lei,
	Hannes Reinecke, Johannes Thumshirn, John Garry, Khazhy Kumykov

Since elevator_exit() has only one caller, move its definition from
block/blk.h into block/elevator.c. Remove the inline keyword since modern
compilers are smart enough to decide when to inline functions that occur
in the same compilation unit.

Reviewed-by: Christoph Hellwig <hch@lst.de>
Acked-by: Martin K. Petersen <martin.petersen@oracle.com>
Tested-by: Shin'ichiro Kawasaki <shinichiro.kawasaki@wdc.com>
Cc: Ming Lei <ming.lei@redhat.com>
Cc: Hannes Reinecke <hare@suse.de>
Cc: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Cc: John Garry <john.garry@huawei.com>
Cc: Khazhy Kumykov <khazhy@google.com>
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
---
 block/blk.h      | 9 ---------
 block/elevator.c | 8 ++++++++
 2 files changed, 8 insertions(+), 9 deletions(-)

diff --git a/block/blk.h b/block/blk.h
index 3b53e44b967e..e0a4a7577f6c 100644
--- a/block/blk.h
+++ b/block/blk.h
@@ -198,15 +198,6 @@ void __elevator_exit(struct request_queue *, struct elevator_queue *);
 int elv_register_queue(struct request_queue *q, bool uevent);
 void elv_unregister_queue(struct request_queue *q);
 
-static inline void elevator_exit(struct request_queue *q,
-		struct elevator_queue *e)
-{
-	lockdep_assert_held(&q->sysfs_lock);
-
-	blk_mq_sched_free_requests(q);
-	__elevator_exit(q, e);
-}
-
 ssize_t part_size_show(struct device *dev, struct device_attribute *attr,
 		char *buf);
 ssize_t part_stat_show(struct device *dev, struct device_attribute *attr,
diff --git a/block/elevator.c b/block/elevator.c
index 293c5c81397a..4b20d1ab29cc 100644
--- a/block/elevator.c
+++ b/block/elevator.c
@@ -197,6 +197,14 @@ void __elevator_exit(struct request_queue *q, struct elevator_queue *e)
 	kobject_put(&e->kobj);
 }
 
+static void elevator_exit(struct request_queue *q, struct elevator_queue *e)
+{
+	lockdep_assert_held(&q->sysfs_lock);
+
+	blk_mq_sched_free_requests(q);
+	__elevator_exit(q, e);
+}
+
 static inline void __elv_rqhash_del(struct request *rq)
 {
 	hash_del(&rq->hash);

^ permalink raw reply related	[flat|nested] 12+ messages in thread

* [PATCH v5 2/3] blk-mq: Introduce atomic variants of the tag iteration functions
  2021-04-05  0:28 [PATCH v5 0/3] blk-mq: Fix a race between iterating over requests and freeing requests Bart Van Assche
  2021-04-05  0:28 ` [PATCH v5 1/3] blk-mq: Move the elevator_exit() definition Bart Van Assche
@ 2021-04-05  0:28 ` Bart Van Assche
  2021-04-05  0:28 ` [PATCH v5 3/3] blk-mq: Fix a race between iterating over requests and freeing requests Bart Van Assche
  2021-04-06  8:00 ` [PATCH v5 0/3] " John Garry
  3 siblings, 0 replies; 12+ messages in thread
From: Bart Van Assche @ 2021-04-05  0:28 UTC (permalink / raw)
  To: Jens Axboe
  Cc: linux-block, Christoph Hellwig, Bart Van Assche,
	Martin K . Petersen, Shin'ichiro Kawasaki, Ming Lei,
	Hannes Reinecke, Johannes Thumshirn, John Garry, Khazhy Kumykov

Since in the next patch knowledge is required of whether or not it is
allowed to sleep inside the tag iteration functions, pass this context
information to the tag iteration functions. I have reviewed all callers of
tag iteration functions to verify these annotations by starting from the
output of the following grep command:

    git grep -nHE 'blk_mq_(all_tag|tagset_busy)_iter'

My conclusions from that analysis are as follows:
- Sleeping is allowed in the blk-mq-debugfs code that iterates over tags.
- Since the blk_mq_tagset_busy_iter() calls in the mtip32xx driver are
  preceded by a function that sleeps (blk_mq_quiesce_queue()), sleeping is
  safe in the context of the blk_mq_tagset_busy_iter() calls.
- The same reasoning also applies to the nbd driver.
- All blk_mq_tagset_busy_iter() calls in the NVMe drivers are followed by a
  call to a function that sleeps so sleeping inside blk_mq_tagset_busy_iter()
  when called from the NVMe driver is fine.
- scsi_host_busy(), scsi_host_complete_all_commands() and
  scsi_host_busy_iter() are used by multiple SCSI LLDs so analyzing whether
  or not these functions may sleep is hard. Instead of performing that
  analysis, make it safe to call these functions from atomic context.

Reviewed-by: Christoph Hellwig <hch@lst.de>
Acked-by: Martin K. Petersen <martin.petersen@oracle.com>
Tested-by: Shin'ichiro Kawasaki <shinichiro.kawasaki@wdc.com>
Cc: Ming Lei <ming.lei@redhat.com>
Cc: Hannes Reinecke <hare@suse.de>
Cc: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Cc: John Garry <john.garry@huawei.com>
Cc: Khazhy Kumykov <khazhy@google.com>
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
---
 block/blk-mq-tag.c        | 38 +++++++++++++++++++++++++++++++++-----
 block/blk-mq-tag.h        |  2 +-
 block/blk-mq.c            |  2 +-
 drivers/scsi/hosts.c      | 16 ++++++++--------
 drivers/scsi/ufs/ufshcd.c |  4 ++--
 include/linux/blk-mq.h    |  2 ++
 6 files changed, 47 insertions(+), 17 deletions(-)

diff --git a/block/blk-mq-tag.c b/block/blk-mq-tag.c
index e5bfecf2940d..116c3691b104 100644
--- a/block/blk-mq-tag.c
+++ b/block/blk-mq-tag.c
@@ -322,18 +322,19 @@ static void __blk_mq_all_tag_iter(struct blk_mq_tags *tags,
 }
 
 /**
- * blk_mq_all_tag_iter - iterate over all requests in a tag map
+ * blk_mq_all_tag_iter_atomic - iterate over all requests in a tag map
  * @tags:	Tag map to iterate over.
  * @fn:		Pointer to the function that will be called for each
  *		request. @fn will be called as follows: @fn(rq, @priv,
  *		reserved) where rq is a pointer to a request. 'reserved'
  *		indicates whether or not @rq is a reserved request. Return
- *		true to continue iterating tags, false to stop.
+ *		true to continue iterating tags, false to stop. Must not
+ *		sleep.
  * @priv:	Will be passed as second argument to @fn.
  *
- * Caller has to pass the tag map from which requests are allocated.
+ * Does not sleep.
  */
-void blk_mq_all_tag_iter(struct blk_mq_tags *tags, busy_tag_iter_fn *fn,
+void blk_mq_all_tag_iter_atomic(struct blk_mq_tags *tags, busy_tag_iter_fn *fn,
 		void *priv)
 {
 	__blk_mq_all_tag_iter(tags, fn, priv, BT_TAG_ITER_STATIC_RQS);
@@ -348,6 +349,8 @@ void blk_mq_all_tag_iter(struct blk_mq_tags *tags, busy_tag_iter_fn *fn,
  *		indicates whether or not @rq is a reserved request. Return
  *		true to continue iterating tags, false to stop.
  * @priv:	Will be passed as second argument to @fn.
+ *
+ * May sleep.
  */
 void blk_mq_tagset_busy_iter(struct blk_mq_tag_set *tagset,
 		busy_tag_iter_fn *fn, void *priv)
@@ -362,6 +365,31 @@ void blk_mq_tagset_busy_iter(struct blk_mq_tag_set *tagset,
 }
 EXPORT_SYMBOL(blk_mq_tagset_busy_iter);
 
+/**
+ * blk_mq_tagset_busy_iter_atomic - iterate over all started requests in a tag set
+ * @tagset:	Tag set to iterate over.
+ * @fn:		Pointer to the function that will be called for each started
+ *		request. @fn will be called as follows: @fn(rq, @priv,
+ *		reserved) where rq is a pointer to a request. 'reserved'
+ *		indicates whether or not @rq is a reserved request. Return
+ *		true to continue iterating tags, false to stop. Must not sleep.
+ * @priv:	Will be passed as second argument to @fn.
+ *
+ * Does not sleep.
+ */
+void blk_mq_tagset_busy_iter_atomic(struct blk_mq_tag_set *tagset,
+		busy_tag_iter_fn *fn, void *priv)
+{
+	int i;
+
+	for (i = 0; i < tagset->nr_hw_queues; i++) {
+		if (tagset->tags && tagset->tags[i])
+			__blk_mq_all_tag_iter(tagset->tags[i], fn, priv,
+					      BT_TAG_ITER_STARTED);
+	}
+}
+EXPORT_SYMBOL(blk_mq_tagset_busy_iter_atomic);
+
 static bool blk_mq_tagset_count_completed_rqs(struct request *rq,
 		void *data, bool reserved)
 {
@@ -384,7 +412,7 @@ void blk_mq_tagset_wait_completed_request(struct blk_mq_tag_set *tagset)
 	while (true) {
 		unsigned count = 0;
 
-		blk_mq_tagset_busy_iter(tagset,
+		blk_mq_tagset_busy_iter_atomic(tagset,
 				blk_mq_tagset_count_completed_rqs, &count);
 		if (!count)
 			break;
diff --git a/block/blk-mq-tag.h b/block/blk-mq-tag.h
index 7d3e6b333a4a..0290c308ece9 100644
--- a/block/blk-mq-tag.h
+++ b/block/blk-mq-tag.h
@@ -43,7 +43,7 @@ extern void blk_mq_tag_resize_shared_sbitmap(struct blk_mq_tag_set *set,
 extern void blk_mq_tag_wakeup_all(struct blk_mq_tags *tags, bool);
 void blk_mq_queue_tag_busy_iter(struct request_queue *q, busy_iter_fn *fn,
 		void *priv);
-void blk_mq_all_tag_iter(struct blk_mq_tags *tags, busy_tag_iter_fn *fn,
+void blk_mq_all_tag_iter_atomic(struct blk_mq_tags *tags, busy_tag_iter_fn *fn,
 		void *priv);
 
 static inline struct sbq_wait_state *bt_wait_ptr(struct sbitmap_queue *bt,
diff --git a/block/blk-mq.c b/block/blk-mq.c
index d4d7c1caa439..5b170faa6318 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -2483,7 +2483,7 @@ static bool blk_mq_hctx_has_requests(struct blk_mq_hw_ctx *hctx)
 		.hctx	= hctx,
 	};
 
-	blk_mq_all_tag_iter(tags, blk_mq_has_request, &data);
+	blk_mq_all_tag_iter_atomic(tags, blk_mq_has_request, &data);
 	return data.has_rq;
 }
 
diff --git a/drivers/scsi/hosts.c b/drivers/scsi/hosts.c
index 2f162603876f..f09e1520a241 100644
--- a/drivers/scsi/hosts.c
+++ b/drivers/scsi/hosts.c
@@ -578,8 +578,8 @@ int scsi_host_busy(struct Scsi_Host *shost)
 {
 	int cnt = 0;
 
-	blk_mq_tagset_busy_iter(&shost->tag_set,
-				scsi_host_check_in_flight, &cnt);
+	blk_mq_tagset_busy_iter_atomic(&shost->tag_set,
+				       scsi_host_check_in_flight, &cnt);
 	return cnt;
 }
 EXPORT_SYMBOL(scsi_host_busy);
@@ -677,8 +677,8 @@ static bool complete_all_cmds_iter(struct request *rq, void *data, bool rsvd)
  */
 void scsi_host_complete_all_commands(struct Scsi_Host *shost, int status)
 {
-	blk_mq_tagset_busy_iter(&shost->tag_set, complete_all_cmds_iter,
-				&status);
+	blk_mq_tagset_busy_iter_atomic(&shost->tag_set, complete_all_cmds_iter,
+				       &status);
 }
 EXPORT_SYMBOL_GPL(scsi_host_complete_all_commands);
 
@@ -699,11 +699,11 @@ static bool __scsi_host_busy_iter_fn(struct request *req, void *priv,
 /**
  * scsi_host_busy_iter - Iterate over all busy commands
  * @shost:	Pointer to Scsi_Host.
- * @fn:		Function to call on each busy command
+ * @fn:		Function to call on each busy command. Must not sleep.
  * @priv:	Data pointer passed to @fn
  *
  * If locking against concurrent command completions is required
- * ithas to be provided by the caller
+ * it has to be provided by the caller.
  **/
 void scsi_host_busy_iter(struct Scsi_Host *shost,
 			 bool (*fn)(struct scsi_cmnd *, void *, bool),
@@ -714,7 +714,7 @@ void scsi_host_busy_iter(struct Scsi_Host *shost,
 		.priv = priv,
 	};
 
-	blk_mq_tagset_busy_iter(&shost->tag_set, __scsi_host_busy_iter_fn,
-				&iter_data);
+	blk_mq_tagset_busy_iter_atomic(&shost->tag_set,
+				       __scsi_host_busy_iter_fn, &iter_data);
 }
 EXPORT_SYMBOL_GPL(scsi_host_busy_iter);
diff --git a/drivers/scsi/ufs/ufshcd.c b/drivers/scsi/ufs/ufshcd.c
index c86760788c72..6d2f8f18e2a3 100644
--- a/drivers/scsi/ufs/ufshcd.c
+++ b/drivers/scsi/ufs/ufshcd.c
@@ -1380,7 +1380,7 @@ static bool ufshcd_any_tag_in_use(struct ufs_hba *hba)
 	struct request_queue *q = hba->cmd_queue;
 	int busy = 0;
 
-	blk_mq_tagset_busy_iter(q->tag_set, ufshcd_is_busy, &busy);
+	blk_mq_tagset_busy_iter_atomic(q->tag_set, ufshcd_is_busy, &busy);
 	return busy;
 }
 
@@ -6269,7 +6269,7 @@ static irqreturn_t ufshcd_tmc_handler(struct ufs_hba *hba)
 		.pending = ufshcd_readl(hba, REG_UTP_TASK_REQ_DOOR_BELL),
 	};
 
-	blk_mq_tagset_busy_iter(q->tag_set, ufshcd_compl_tm, &ci);
+	blk_mq_tagset_busy_iter_atomic(q->tag_set, ufshcd_compl_tm, &ci);
 	return ci.ncpl ? IRQ_HANDLED : IRQ_NONE;
 }
 
diff --git a/include/linux/blk-mq.h b/include/linux/blk-mq.h
index 2c473c9b8990..dfa0114a49fd 100644
--- a/include/linux/blk-mq.h
+++ b/include/linux/blk-mq.h
@@ -526,6 +526,8 @@ void blk_mq_run_hw_queues(struct request_queue *q, bool async);
 void blk_mq_delay_run_hw_queues(struct request_queue *q, unsigned long msecs);
 void blk_mq_tagset_busy_iter(struct blk_mq_tag_set *tagset,
 		busy_tag_iter_fn *fn, void *priv);
+void blk_mq_tagset_busy_iter_atomic(struct blk_mq_tag_set *tagset,
+		busy_tag_iter_fn *fn, void *priv);
 void blk_mq_tagset_wait_completed_request(struct blk_mq_tag_set *tagset);
 void blk_mq_freeze_queue(struct request_queue *q);
 void blk_mq_unfreeze_queue(struct request_queue *q);

^ permalink raw reply related	[flat|nested] 12+ messages in thread

* [PATCH v5 3/3] blk-mq: Fix a race between iterating over requests and freeing requests
  2021-04-05  0:28 [PATCH v5 0/3] blk-mq: Fix a race between iterating over requests and freeing requests Bart Van Assche
  2021-04-05  0:28 ` [PATCH v5 1/3] blk-mq: Move the elevator_exit() definition Bart Van Assche
  2021-04-05  0:28 ` [PATCH v5 2/3] blk-mq: Introduce atomic variants of the tag iteration functions Bart Van Assche
@ 2021-04-05  0:28 ` Bart Van Assche
  2021-04-05 18:08   ` Khazhy Kumykov
  2021-04-06  8:00 ` [PATCH v5 0/3] " John Garry
  3 siblings, 1 reply; 12+ messages in thread
From: Bart Van Assche @ 2021-04-05  0:28 UTC (permalink / raw)
  To: Jens Axboe
  Cc: linux-block, Christoph Hellwig, Bart Van Assche,
	Martin K . Petersen, Shin'ichiro Kawasaki, Ming Lei,
	Hannes Reinecke, Johannes Thumshirn, John Garry, Khazhy Kumykov

When multiple request queues share a tag set and when switching the I/O
scheduler for one of the request queues associated with that tag set, the
following race can happen:
* blk_mq_tagset_busy_iter() calls bt_tags_iter() and bt_tags_iter() assigns
  a pointer to a scheduler request to the local variable 'rq'.
* blk_mq_sched_free_requests() is called to free hctx->sched_tags.
* blk_mq_tagset_busy_iter() dereferences 'rq' and triggers a use-after-free.

Fix this race as follows:
* Use rcu_assign_pointer() and rcu_dereference() to access hctx->tags->rqs[].
  The performance impact of the assignments added to the hot path is minimal
  (about 1% according to one particular test).
* Protect hctx->tags->rqs[] reads with an RCU read-side lock or with a
  semaphore. Which mechanism is used depends on whether or not it is allowed
  to sleep and also on whether or not the callback function may sleep.
* Wait for all preexisting readers to finish before freeing scheduler
  requests.

Multiple users have reported use-after-free complaints similar to the
following (from https://lore.kernel.org/linux-block/1545261885.185366.488.camel@acm.org/ ):

BUG: KASAN: use-after-free in bt_iter+0x86/0xf0
Read of size 8 at addr ffff88803b335240 by task fio/21412

CPU: 0 PID: 21412 Comm: fio Tainted: G        W         4.20.0-rc6-dbg+ #3
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-1 04/01/2014
Call Trace:
 dump_stack+0x86/0xca
 print_address_description+0x71/0x239
 kasan_report.cold.5+0x242/0x301
 __asan_load8+0x54/0x90
 bt_iter+0x86/0xf0
 blk_mq_queue_tag_busy_iter+0x373/0x5e0
 blk_mq_in_flight+0x96/0xb0
 part_in_flight+0x40/0x140
 part_round_stats+0x18e/0x370
 blk_account_io_start+0x3d7/0x670
 blk_mq_bio_to_request+0x19c/0x3a0
 blk_mq_make_request+0x7a9/0xcb0
 generic_make_request+0x41d/0x960
 submit_bio+0x9b/0x250
 do_blockdev_direct_IO+0x435c/0x4c70
 __blockdev_direct_IO+0x79/0x88
 ext4_direct_IO+0x46c/0xc00
 generic_file_direct_write+0x119/0x210
 __generic_file_write_iter+0x11c/0x280
 ext4_file_write_iter+0x1b8/0x6f0
 aio_write+0x204/0x310
 io_submit_one+0x9d3/0xe80
 __x64_sys_io_submit+0x115/0x340
 do_syscall_64+0x71/0x210

Reviewed-by: Christoph Hellwig <hch@lst.de>
Acked-by: Martin K. Petersen <martin.petersen@oracle.com>
Tested-by: Shin'ichiro Kawasaki <shinichiro.kawasaki@wdc.com>
Cc: Ming Lei <ming.lei@redhat.com>
Cc: Hannes Reinecke <hare@suse.de>
Cc: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Cc: John Garry <john.garry@huawei.com>
Cc: Khazhy Kumykov <khazhy@google.com>
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
---
 block/blk-core.c   | 34 +++++++++++++++++++++++++++++++++-
 block/blk-mq-tag.c | 41 +++++++++++++++++++++++++++++++++++++----
 block/blk-mq-tag.h |  4 +++-
 block/blk-mq.c     | 21 +++++++++++++++++----
 block/blk-mq.h     |  1 +
 block/blk.h        |  2 ++
 block/elevator.c   |  1 +
 7 files changed, 94 insertions(+), 10 deletions(-)

diff --git a/block/blk-core.c b/block/blk-core.c
index fc60ff208497..400537dde675 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -279,6 +279,36 @@ void blk_dump_rq_flags(struct request *rq, char *msg)
 }
 EXPORT_SYMBOL(blk_dump_rq_flags);
 
+/**
+ * blk_mq_wait_for_tag_iter - wait for preexisting __blk_mq_all_tag_iter() calls to finish
+ * @set: Tag set to wait on.
+ *
+ * Waits for preexisting __blk_mq_all_tag_iter() calls to finish accessing
+ * hctx->tags->rqs[]. New readers may start while this function is in progress
+ * or after this function has returned. Use this function to make sure that
+ * hctx->tags->rqs[] changes have become globally visible.
+ *
+ * Accesses of hctx->tags->rqs[] by blk_mq_queue_tag_busy_iter() calls are out
+ * of scope for this function. The caller can pause blk_mq_queue_tag_busy_iter()
+ * calls for a request queue by freezing that request queue.
+ */
+void blk_mq_wait_for_tag_iter(struct blk_mq_tag_set *set)
+{
+	struct blk_mq_tags *tags;
+	int i;
+
+	if (set->tags) {
+		for (i = 0; i < set->nr_hw_queues; i++) {
+			tags = set->tags[i];
+			if (!tags)
+				continue;
+			down_write(&tags->iter_rwsem);
+			up_write(&tags->iter_rwsem);
+		}
+	}
+	synchronize_rcu();
+}
+
 /**
  * blk_sync_queue - cancel any pending callbacks on a queue
  * @q: the queue
@@ -412,8 +442,10 @@ void blk_cleanup_queue(struct request_queue *q)
 	 * it is safe to free requests now.
 	 */
 	mutex_lock(&q->sysfs_lock);
-	if (q->elevator)
+	if (q->elevator) {
+		blk_mq_wait_for_tag_iter(q->tag_set);
 		blk_mq_sched_free_requests(q);
+	}
 	mutex_unlock(&q->sysfs_lock);
 
 	percpu_ref_exit(&q->q_usage_counter);
diff --git a/block/blk-mq-tag.c b/block/blk-mq-tag.c
index 116c3691b104..e7a6a114d216 100644
--- a/block/blk-mq-tag.c
+++ b/block/blk-mq-tag.c
@@ -209,7 +209,11 @@ static bool bt_iter(struct sbitmap *bitmap, unsigned int bitnr, void *data)
 
 	if (!reserved)
 		bitnr += tags->nr_reserved_tags;
-	rq = tags->rqs[bitnr];
+	/*
+	 * Protected by rq->q->q_usage_counter. See also
+	 * blk_mq_queue_tag_busy_iter().
+	 */
+	rq = rcu_dereference_check(tags->rqs[bitnr], true);
 
 	/*
 	 * We can hit rq == NULL here, because the tagging functions
@@ -254,11 +258,17 @@ struct bt_tags_iter_data {
 	unsigned int flags;
 };
 
+/* Include reserved tags. */
 #define BT_TAG_ITER_RESERVED		(1 << 0)
+/* Only include started requests. */
 #define BT_TAG_ITER_STARTED		(1 << 1)
+/* Iterate over tags->static_rqs[] instead of tags->rqs[]. */
 #define BT_TAG_ITER_STATIC_RQS		(1 << 2)
+/* The callback function may sleep. */
+#define BT_TAG_ITER_MAY_SLEEP		(1 << 3)
 
-static bool bt_tags_iter(struct sbitmap *bitmap, unsigned int bitnr, void *data)
+static bool __bt_tags_iter(struct sbitmap *bitmap, unsigned int bitnr,
+			   void *data)
 {
 	struct bt_tags_iter_data *iter_data = data;
 	struct blk_mq_tags *tags = iter_data->tags;
@@ -275,7 +285,8 @@ static bool bt_tags_iter(struct sbitmap *bitmap, unsigned int bitnr, void *data)
 	if (iter_data->flags & BT_TAG_ITER_STATIC_RQS)
 		rq = tags->static_rqs[bitnr];
 	else
-		rq = tags->rqs[bitnr];
+		rq = rcu_dereference_check(tags->rqs[bitnr],
+					   lockdep_is_held(&tags->iter_rwsem));
 	if (!rq)
 		return true;
 	if ((iter_data->flags & BT_TAG_ITER_STARTED) &&
@@ -284,6 +295,25 @@ static bool bt_tags_iter(struct sbitmap *bitmap, unsigned int bitnr, void *data)
 	return iter_data->fn(rq, iter_data->data, reserved);
 }
 
+static bool bt_tags_iter(struct sbitmap *bitmap, unsigned int bitnr, void *data)
+{
+	struct bt_tags_iter_data *iter_data = data;
+	struct blk_mq_tags *tags = iter_data->tags;
+	bool res;
+
+	if (iter_data->flags & BT_TAG_ITER_MAY_SLEEP) {
+		down_read(&tags->iter_rwsem);
+		res = __bt_tags_iter(bitmap, bitnr, data);
+		up_read(&tags->iter_rwsem);
+	} else {
+		rcu_read_lock();
+		res = __bt_tags_iter(bitmap, bitnr, data);
+		rcu_read_unlock();
+	}
+
+	return res;
+}
+
 /**
  * bt_tags_for_each - iterate over the requests in a tag map
  * @tags:	Tag map to iterate over.
@@ -357,10 +387,12 @@ void blk_mq_tagset_busy_iter(struct blk_mq_tag_set *tagset,
 {
 	int i;
 
+	might_sleep();
+
 	for (i = 0; i < tagset->nr_hw_queues; i++) {
 		if (tagset->tags && tagset->tags[i])
 			__blk_mq_all_tag_iter(tagset->tags[i], fn, priv,
-					      BT_TAG_ITER_STARTED);
+				BT_TAG_ITER_STARTED | BT_TAG_ITER_MAY_SLEEP);
 	}
 }
 EXPORT_SYMBOL(blk_mq_tagset_busy_iter);
@@ -544,6 +576,7 @@ struct blk_mq_tags *blk_mq_init_tags(unsigned int total_tags,
 
 	tags->nr_tags = total_tags;
 	tags->nr_reserved_tags = reserved_tags;
+	init_rwsem(&tags->iter_rwsem);
 
 	if (flags & BLK_MQ_F_TAG_HCTX_SHARED)
 		return tags;
diff --git a/block/blk-mq-tag.h b/block/blk-mq-tag.h
index 0290c308ece9..d1d73d7cc7df 100644
--- a/block/blk-mq-tag.h
+++ b/block/blk-mq-tag.h
@@ -17,9 +17,11 @@ struct blk_mq_tags {
 	struct sbitmap_queue __bitmap_tags;
 	struct sbitmap_queue __breserved_tags;
 
-	struct request **rqs;
+	struct request __rcu **rqs;
 	struct request **static_rqs;
 	struct list_head page_list;
+
+	struct rw_semaphore iter_rwsem;
 };
 
 extern struct blk_mq_tags *blk_mq_init_tags(unsigned int nr_tags,
diff --git a/block/blk-mq.c b/block/blk-mq.c
index 5b170faa6318..d6c9b655c0f5 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -495,8 +495,10 @@ static void __blk_mq_free_request(struct request *rq)
 	blk_crypto_free_request(rq);
 	blk_pm_mark_last_busy(rq);
 	rq->mq_hctx = NULL;
-	if (rq->tag != BLK_MQ_NO_TAG)
+	if (rq->tag != BLK_MQ_NO_TAG) {
 		blk_mq_put_tag(hctx->tags, ctx, rq->tag);
+		rcu_assign_pointer(hctx->tags->rqs[rq->tag], NULL);
+	}
 	if (sched_tag != BLK_MQ_NO_TAG)
 		blk_mq_put_tag(hctx->sched_tags, ctx, sched_tag);
 	blk_mq_sched_restart(hctx);
@@ -838,9 +840,20 @@ EXPORT_SYMBOL(blk_mq_delay_kick_requeue_list);
 
 struct request *blk_mq_tag_to_rq(struct blk_mq_tags *tags, unsigned int tag)
 {
+	struct request *rq;
+
 	if (tag < tags->nr_tags) {
-		prefetch(tags->rqs[tag]);
-		return tags->rqs[tag];
+		/*
+		 * Freeing tags happens with the request queue frozen so the
+		 * rcu dereference below is protected by the request queue
+		 * usage count. We can only verify that usage count after
+		 * having read the request pointer.
+		 */
+		rq = rcu_dereference_check(tags->rqs[tag], true);
+		WARN_ON_ONCE(IS_ENABLED(CONFIG_PROVE_RCU) && rq &&
+			     percpu_ref_is_zero(&rq->q->q_usage_counter));
+		prefetch(rq);
+		return rq;
 	}
 
 	return NULL;
@@ -1111,7 +1124,7 @@ static bool blk_mq_get_driver_tag(struct request *rq)
 		rq->rq_flags |= RQF_MQ_INFLIGHT;
 		__blk_mq_inc_active_requests(hctx);
 	}
-	hctx->tags->rqs[rq->tag] = rq;
+	rcu_assign_pointer(hctx->tags->rqs[rq->tag], rq);
 	return true;
 }
 
diff --git a/block/blk-mq.h b/block/blk-mq.h
index 3616453ca28c..9ccb1818303b 100644
--- a/block/blk-mq.h
+++ b/block/blk-mq.h
@@ -226,6 +226,7 @@ static inline void __blk_mq_put_driver_tag(struct blk_mq_hw_ctx *hctx,
 					   struct request *rq)
 {
 	blk_mq_put_tag(hctx->tags, rq->mq_ctx, rq->tag);
+	rcu_assign_pointer(hctx->tags->rqs[rq->tag], NULL);
 	rq->tag = BLK_MQ_NO_TAG;
 
 	if (rq->rq_flags & RQF_MQ_INFLIGHT) {
diff --git a/block/blk.h b/block/blk.h
index e0a4a7577f6c..85e0a59ef954 100644
--- a/block/blk.h
+++ b/block/blk.h
@@ -184,6 +184,8 @@ bool blk_bio_list_merge(struct request_queue *q, struct list_head *list,
 void blk_account_io_start(struct request *req);
 void blk_account_io_done(struct request *req, u64 now);
 
+void blk_mq_wait_for_tag_iter(struct blk_mq_tag_set *set);
+
 /*
  * Internal elevator interface
  */
diff --git a/block/elevator.c b/block/elevator.c
index 4b20d1ab29cc..70a10e31b336 100644
--- a/block/elevator.c
+++ b/block/elevator.c
@@ -201,6 +201,7 @@ static void elevator_exit(struct request_queue *q, struct elevator_queue *e)
 {
 	lockdep_assert_held(&q->sysfs_lock);
 
+	blk_mq_wait_for_tag_iter(q->tag_set);
 	blk_mq_sched_free_requests(q);
 	__elevator_exit(q, e);
 }

^ permalink raw reply related	[flat|nested] 12+ messages in thread

* Re: [PATCH v5 3/3] blk-mq: Fix a race between iterating over requests and freeing requests
  2021-04-05  0:28 ` [PATCH v5 3/3] blk-mq: Fix a race between iterating over requests and freeing requests Bart Van Assche
@ 2021-04-05 18:08   ` Khazhy Kumykov
  2021-04-05 21:34     ` Bart Van Assche
  2021-04-05 21:34     ` Jens Axboe
  0 siblings, 2 replies; 12+ messages in thread
From: Khazhy Kumykov @ 2021-04-05 18:08 UTC (permalink / raw)
  To: Bart Van Assche, John Garry
  Cc: Jens Axboe, linux-block, Christoph Hellwig, Martin K . Petersen,
	Shin'ichiro Kawasaki, Ming Lei, Hannes Reinecke,
	Johannes Thumshirn

[-- Attachment #1: Type: text/plain, Size: 2652 bytes --]

On Sun, Apr 4, 2021 at 5:28 PM Bart Van Assche <bvanassche@acm.org> wrote:
> diff --git a/block/blk-mq-tag.c b/block/blk-mq-tag.c
> index 116c3691b104..e7a6a114d216 100644
> --- a/block/blk-mq-tag.c
> +++ b/block/blk-mq-tag.c
> @@ -209,7 +209,11 @@ static bool bt_iter(struct sbitmap *bitmap, unsigned int bitnr, void *data)
>
>         if (!reserved)
>                 bitnr += tags->nr_reserved_tags;
> -       rq = tags->rqs[bitnr];
> +       /*
> +        * Protected by rq->q->q_usage_counter. See also
> +        * blk_mq_queue_tag_busy_iter().
> +        */
> +       rq = rcu_dereference_check(tags->rqs[bitnr], true);

maybe I'm missing something, but if this tags struct has a shared
sbitmap, what guarantees do we have that while iterating we won't
touch requests from a queue that's tearing down? The check a few lines
below suggests that at the least we may touch requests from a
different queue.

say we enter blk_mq_queue_tag_busy_iter, we're iterating with raised
hctx->q->q_usage_counter, and get to bt_iter

say tagset has 2 shared queues, hctx->q is q1, rq->q is q2
(thread 1)
rq = rcu_deref_check
(rq->q != hctx->q, but we don't know yet)

(thread 2)
elsewhere, blk_cleanup_queue(q2) runs (or elevator_exit), since we
only have raised q_usage_counter on q1, this goes to completion and
frees rq. if we have preempt kernel, thread 1 may be paused before we
read rq->q, so synchonrize_rcu passes happily by

(thread 1)
we check rq && rq->q == hctx->q, use-after-free since rq was freed above

I think John Garry mentioned observing a similar race in patch 2 of
his series, perhaps his test case can verify this?

"Indeed, blk_mq_queue_tag_busy_iter() already does take a reference to its
queue usage counter when called, and the queue cannot be frozen to switch
IO scheduler until all refs are dropped. This ensures no stale references
to IO scheduler requests will be seen by blk_mq_queue_tag_busy_iter().

However, there is nothing to stop blk_mq_queue_tag_busy_iter() being
run for another queue associated with the same tagset, and it seeing
a stale IO scheduler request from the other queue after they are freed."

so, to my understanding, we have a race between reading
tags->rq[bitnr], and verifying that rq->q == hctx->q, where if we
schedule off we might have use-after-free? If that's the case, perhaps
we should rcu_read_lock until we verify the queues match, then we can
release and run fn(), as we verified we no longer need it?

>
>         /*
>          * We can hit rq == NULL here, because the tagging functions
> @@ -254,11 +258,17 @@ struct bt_tags_iter_data {
>         unsigned int flags;
>  };
>

Thanks,
Khazhy

[-- Attachment #2: S/MIME Cryptographic Signature --]
[-- Type: application/pkcs7-signature, Size: 3996 bytes --]

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH v5 3/3] blk-mq: Fix a race between iterating over requests and freeing requests
  2021-04-05 18:08   ` Khazhy Kumykov
@ 2021-04-05 21:34     ` Bart Van Assche
  2021-04-06  0:24       ` Khazhy Kumykov
  2021-04-05 21:34     ` Jens Axboe
  1 sibling, 1 reply; 12+ messages in thread
From: Bart Van Assche @ 2021-04-05 21:34 UTC (permalink / raw)
  To: Khazhy Kumykov, John Garry
  Cc: Jens Axboe, linux-block, Christoph Hellwig, Martin K . Petersen,
	Shin'ichiro Kawasaki, Ming Lei, Hannes Reinecke,
	Johannes Thumshirn

On 4/5/21 11:08 AM, Khazhy Kumykov wrote:
> On Sun, Apr 4, 2021 at 5:28 PM Bart Van Assche <bvanassche@acm.org> wrote:
>> diff --git a/block/blk-mq-tag.c b/block/blk-mq-tag.c
>> index 116c3691b104..e7a6a114d216 100644
>> --- a/block/blk-mq-tag.c
>> +++ b/block/blk-mq-tag.c
>> @@ -209,7 +209,11 @@ static bool bt_iter(struct sbitmap *bitmap, unsigned int bitnr, void *data)
>>
>>         if (!reserved)
>>                 bitnr += tags->nr_reserved_tags;
>> -       rq = tags->rqs[bitnr];
>> +       /*
>> +        * Protected by rq->q->q_usage_counter. See also
>> +        * blk_mq_queue_tag_busy_iter().
>> +        */
>> +       rq = rcu_dereference_check(tags->rqs[bitnr], true);
> 
> maybe I'm missing something, but if this tags struct has a shared
> sbitmap, what guarantees do we have that while iterating we won't
> touch requests from a queue that's tearing down? The check a few lines
> below suggests that at the least we may touch requests from a
> different queue.
> 
> say we enter blk_mq_queue_tag_busy_iter, we're iterating with raised
> hctx->q->q_usage_counter, and get to bt_iter
> 
> say tagset has 2 shared queues, hctx->q is q1, rq->q is q2
> (thread 1)
> rq = rcu_deref_check
> (rq->q != hctx->q, but we don't know yet)
> 
> (thread 2)
> elsewhere, blk_cleanup_queue(q2) runs (or elevator_exit), since we
> only have raised q_usage_counter on q1, this goes to completion and
> frees rq. if we have preempt kernel, thread 1 may be paused before we
> read rq->q, so synchonrize_rcu passes happily by
> 
> (thread 1)
> we check rq && rq->q == hctx->q, use-after-free since rq was freed above
> 
> I think John Garry mentioned observing a similar race in patch 2 of
> his series, perhaps his test case can verify this?
> 
> "Indeed, blk_mq_queue_tag_busy_iter() already does take a reference to its
> queue usage counter when called, and the queue cannot be frozen to switch
> IO scheduler until all refs are dropped. This ensures no stale references
> to IO scheduler requests will be seen by blk_mq_queue_tag_busy_iter().
> 
> However, there is nothing to stop blk_mq_queue_tag_busy_iter() being
> run for another queue associated with the same tagset, and it seeing
> a stale IO scheduler request from the other queue after they are freed."
> 
> so, to my understanding, we have a race between reading
> tags->rq[bitnr], and verifying that rq->q == hctx->q, where if we
> schedule off we might have use-after-free? If that's the case, perhaps
> we should rcu_read_lock until we verify the queues match, then we can
> release and run fn(), as we verified we no longer need it?

Hi Khazhy,

One possibility is indeed to protect the RCU dereference and the rq->q
read with an RCU reader lock. However, that would require an elaborate
comment since that would be a creative way to use RCU: protect the
request pointer dereference with an RCU reader lock until rq->q has been
tested and protect the remaining time during which rq is used with
q_usage_counter.

Another possibility is the patch below (needs to be split). That patch
protects the entire time during which rq is used by bt_iter() with either
an RCU reader lock or with the iter_rwsem semaphores. Do you perhaps have
a preference for one of these two approaches?

Thanks,

Bart.

diff --git a/block/blk-mq-tag.c b/block/blk-mq-tag.c
index e7a6a114d216..a997fc2aa2bc 100644
--- a/block/blk-mq-tag.c
+++ b/block/blk-mq-tag.c
@@ -209,12 +209,8 @@ static bool bt_iter(struct sbitmap *bitmap, unsigned int bitnr, void *data)

 	if (!reserved)
 		bitnr += tags->nr_reserved_tags;
-	/*
-	 * Protected by rq->q->q_usage_counter. See also
-	 * blk_mq_queue_tag_busy_iter().
-	 */
-	rq = rcu_dereference_check(tags->rqs[bitnr], true);
-
+	rq = rcu_dereference_check(tags->rqs[bitnr],
+				   lockdep_is_held(&tags->iter_rwsem));
 	/*
 	 * We can hit rq == NULL here, because the tagging functions
 	 * test and set the bit before assigning ->rqs[].
@@ -453,6 +449,63 @@ void blk_mq_tagset_wait_completed_request(struct blk_mq_tag_set *tagset)
 }
 EXPORT_SYMBOL(blk_mq_tagset_wait_completed_request);

+static void __blk_mq_queue_tag_busy_iter(struct request_queue *q,
+					 busy_iter_fn *fn, void *priv)
+{
+	struct blk_mq_hw_ctx *hctx;
+	int i;
+
+	queue_for_each_hw_ctx(q, hctx, i) {
+		struct blk_mq_tags *tags = hctx->tags;
+
+		/*
+		 * If no software queues are currently mapped to this
+		 * hardware queue, there's nothing to check
+		 */
+		if (!blk_mq_hw_queue_mapped(hctx))
+			continue;
+
+		if (tags->nr_reserved_tags)
+			bt_for_each(hctx, tags->breserved_tags, fn, priv, true);
+		bt_for_each(hctx, tags->bitmap_tags, fn, priv, false);
+	}
+}
+
+/**
+ * blk_mq_queue_tag_busy_iter_atomic - iterate over all requests with a driver tag
+ * @q:		Request queue to examine.
+ * @fn:		Pointer to the function that will be called for each request
+ *		on @q. @fn will be called as follows: @fn(hctx, rq, @priv,
+ *		reserved) where rq is a pointer to a request and hctx points
+ *		to the hardware queue associated with the request. 'reserved'
+ *		indicates whether or not @rq is a reserved request. @fn must
+ *		not sleep.
+ * @priv:	Will be passed as third argument to @fn.
+ *
+ * Note: if @q->tag_set is shared with other request queues then @fn will be
+ * called for all requests on all queues that share that tag set and not only
+ * for requests associated with @q.
+ *
+ * Does not sleep.
+ */
+void blk_mq_queue_tag_busy_iter_atomic(struct request_queue *q,
+				       busy_iter_fn *fn, void *priv)
+{
+	/*
+	 * __blk_mq_update_nr_hw_queues() updates nr_hw_queues and queue_hw_ctx
+	 * while the queue is frozen. So we can use q_usage_counter to avoid
+	 * racing with it.
+	 */
+	if (!percpu_ref_tryget(&q->q_usage_counter))
+		return;
+
+	rcu_read_lock();
+	__blk_mq_queue_tag_busy_iter(q, fn, priv);
+	rcu_read_unlock();
+
+	blk_queue_exit(q);
+}
+
 /**
  * blk_mq_queue_tag_busy_iter - iterate over all requests with a driver tag
  * @q:		Request queue to examine.
@@ -466,13 +519,18 @@ EXPORT_SYMBOL(blk_mq_tagset_wait_completed_request);
  * Note: if @q->tag_set is shared with other request queues then @fn will be
  * called for all requests on all queues that share that tag set and not only
  * for requests associated with @q.
+ *
+ * May sleep.
  */
 void blk_mq_queue_tag_busy_iter(struct request_queue *q, busy_iter_fn *fn,
 		void *priv)
 {
-	struct blk_mq_hw_ctx *hctx;
+	struct blk_mq_tag_set *set = q->tag_set;
+	struct blk_mq_tags *tags;
 	int i;

+	might_sleep();
+
 	/*
 	 * __blk_mq_update_nr_hw_queues() updates nr_hw_queues and queue_hw_ctx
 	 * while the queue is frozen. So we can use q_usage_counter to avoid
@@ -481,20 +539,19 @@ void blk_mq_queue_tag_busy_iter(struct request_queue *q, busy_iter_fn *fn,
 	if (!percpu_ref_tryget(&q->q_usage_counter))
 		return;

-	queue_for_each_hw_ctx(q, hctx, i) {
-		struct blk_mq_tags *tags = hctx->tags;
-
-		/*
-		 * If no software queues are currently mapped to this
-		 * hardware queue, there's nothing to check
-		 */
-		if (!blk_mq_hw_queue_mapped(hctx))
-			continue;

-		if (tags->nr_reserved_tags)
-			bt_for_each(hctx, tags->breserved_tags, fn, priv, true);
-		bt_for_each(hctx, tags->bitmap_tags, fn, priv, false);
+	for (i = 0; i < set->nr_hw_queues; i++) {
+		tags = set->tags[i];
+		if (tags)
+			down_read(&tags->iter_rwsem);
 	}
+	__blk_mq_queue_tag_busy_iter(q, fn, priv);
+	for (i = set->nr_hw_queues - 1; i >= 0; i--) {
+		tags = set->tags[i];
+		if (tags)
+			up_read(&tags->iter_rwsem);
+	}
+
 	blk_queue_exit(q);
 }

@@ -576,7 +633,9 @@ struct blk_mq_tags *blk_mq_init_tags(unsigned int total_tags,

 	tags->nr_tags = total_tags;
 	tags->nr_reserved_tags = reserved_tags;
-	init_rwsem(&tags->iter_rwsem);
+	lockdep_register_key(&tags->iter_rwsem_key);
+	__init_rwsem(&tags->iter_rwsem, "tags->iter_rwsem",
+		     &tags->iter_rwsem_key);

 	if (flags & BLK_MQ_F_TAG_HCTX_SHARED)
 		return tags;
@@ -594,6 +653,7 @@ void blk_mq_free_tags(struct blk_mq_tags *tags, unsigned int flags)
 		sbitmap_queue_free(tags->bitmap_tags);
 		sbitmap_queue_free(tags->breserved_tags);
 	}
+	lockdep_unregister_key(&tags->iter_rwsem_key);
 	kfree(tags);
 }

diff --git a/block/blk-mq-tag.h b/block/blk-mq-tag.h
index d1d73d7cc7df..e37f219bd36a 100644
--- a/block/blk-mq-tag.h
+++ b/block/blk-mq-tag.h
@@ -22,6 +22,7 @@ struct blk_mq_tags {
 	struct list_head page_list;

 	struct rw_semaphore iter_rwsem;
+	struct lock_class_key iter_rwsem_key;
 };

 extern struct blk_mq_tags *blk_mq_init_tags(unsigned int nr_tags,
@@ -43,6 +44,8 @@ extern void blk_mq_tag_resize_shared_sbitmap(struct blk_mq_tag_set *set,
 					     unsigned int size);

 extern void blk_mq_tag_wakeup_all(struct blk_mq_tags *tags, bool);
+void blk_mq_queue_tag_busy_iter_atomic(struct request_queue *q,
+		busy_iter_fn *fn, void *priv);
 void blk_mq_queue_tag_busy_iter(struct request_queue *q, busy_iter_fn *fn,
 		void *priv);
 void blk_mq_all_tag_iter_atomic(struct blk_mq_tags *tags, busy_tag_iter_fn *fn,
diff --git a/block/blk-mq.c b/block/blk-mq.c
index d6c9b655c0f5..f5e1ace273e2 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -117,7 +117,7 @@ unsigned int blk_mq_in_flight(struct request_queue *q,
 {
 	struct mq_inflight mi = { .part = part };

-	blk_mq_queue_tag_busy_iter(q, blk_mq_check_inflight, &mi);
+	blk_mq_queue_tag_busy_iter_atomic(q, blk_mq_check_inflight, &mi);

 	return mi.inflight[0] + mi.inflight[1];
 }
@@ -127,7 +127,7 @@ void blk_mq_in_flight_rw(struct request_queue *q, struct block_device *part,
 {
 	struct mq_inflight mi = { .part = part };

-	blk_mq_queue_tag_busy_iter(q, blk_mq_check_inflight, &mi);
+	blk_mq_queue_tag_busy_iter_atomic(q, blk_mq_check_inflight, &mi);
 	inflight[0] = mi.inflight[0];
 	inflight[1] = mi.inflight[1];
 }
@@ -881,7 +881,7 @@ bool blk_mq_queue_inflight(struct request_queue *q)
 {
 	bool busy = false;

-	blk_mq_queue_tag_busy_iter(q, blk_mq_rq_inflight, &busy);
+	blk_mq_queue_tag_busy_iter_atomic(q, blk_mq_rq_inflight, &busy);
 	return busy;
 }
 EXPORT_SYMBOL_GPL(blk_mq_queue_inflight);

^ permalink raw reply related	[flat|nested] 12+ messages in thread

* Re: [PATCH v5 3/3] blk-mq: Fix a race between iterating over requests and freeing requests
  2021-04-05 18:08   ` Khazhy Kumykov
  2021-04-05 21:34     ` Bart Van Assche
@ 2021-04-05 21:34     ` Jens Axboe
  2021-04-06  1:06       ` Bart Van Assche
  1 sibling, 1 reply; 12+ messages in thread
From: Jens Axboe @ 2021-04-05 21:34 UTC (permalink / raw)
  To: Khazhy Kumykov, Bart Van Assche, John Garry
  Cc: linux-block, Christoph Hellwig, Martin K . Petersen,
	Shin'ichiro Kawasaki, Ming Lei, Hannes Reinecke,
	Johannes Thumshirn

On 4/5/21 12:08 PM, Khazhy Kumykov wrote:
> On Sun, Apr 4, 2021 at 5:28 PM Bart Van Assche <bvanassche@acm.org> wrote:
>> diff --git a/block/blk-mq-tag.c b/block/blk-mq-tag.c
>> index 116c3691b104..e7a6a114d216 100644
>> --- a/block/blk-mq-tag.c
>> +++ b/block/blk-mq-tag.c
>> @@ -209,7 +209,11 @@ static bool bt_iter(struct sbitmap *bitmap, unsigned int bitnr, void *data)
>>
>>         if (!reserved)
>>                 bitnr += tags->nr_reserved_tags;
>> -       rq = tags->rqs[bitnr];
>> +       /*
>> +        * Protected by rq->q->q_usage_counter. See also
>> +        * blk_mq_queue_tag_busy_iter().
>> +        */
>> +       rq = rcu_dereference_check(tags->rqs[bitnr], true);
> 
> maybe I'm missing something, but if this tags struct has a shared
> sbitmap, what guarantees do we have that while iterating we won't
> touch requests from a queue that's tearing down? The check a few lines
> below suggests that at the least we may touch requests from a
> different queue.
> 
> say we enter blk_mq_queue_tag_busy_iter, we're iterating with raised
> hctx->q->q_usage_counter, and get to bt_iter
> 
> say tagset has 2 shared queues, hctx->q is q1, rq->q is q2
> (thread 1)
> rq = rcu_deref_check
> (rq->q != hctx->q, but we don't know yet)
> 
> (thread 2)
> elsewhere, blk_cleanup_queue(q2) runs (or elevator_exit), since we
> only have raised q_usage_counter on q1, this goes to completion and
> frees rq. if we have preempt kernel, thread 1 may be paused before we
> read rq->q, so synchonrize_rcu passes happily by
> 
> (thread 1)
> we check rq && rq->q == hctx->q, use-after-free since rq was freed above
> 
> I think John Garry mentioned observing a similar race in patch 2 of
> his series, perhaps his test case can verify this?
> 
> "Indeed, blk_mq_queue_tag_busy_iter() already does take a reference to its
> queue usage counter when called, and the queue cannot be frozen to switch
> IO scheduler until all refs are dropped. This ensures no stale references
> to IO scheduler requests will be seen by blk_mq_queue_tag_busy_iter().
> 
> However, there is nothing to stop blk_mq_queue_tag_busy_iter() being
> run for another queue associated with the same tagset, and it seeing
> a stale IO scheduler request from the other queue after they are freed."
> 
> so, to my understanding, we have a race between reading
> tags->rq[bitnr], and verifying that rq->q == hctx->q, where if we
> schedule off we might have use-after-free? If that's the case, perhaps
> we should rcu_read_lock until we verify the queues match, then we can
> release and run fn(), as we verified we no longer need it?

For something out of left field, we can check if the page that the rq
belongs to is still part of the tag set. If it isn't, then don't
deref it.

Totally untested.

diff --git a/block/blk-mq-tag.c b/block/blk-mq-tag.c
index e5bfecf2940d..6209c465e884 100644
--- a/block/blk-mq-tag.c
+++ b/block/blk-mq-tag.c
@@ -196,9 +196,35 @@ struct bt_iter_data {
 	struct blk_mq_hw_ctx *hctx;
 	busy_iter_fn *fn;
 	void *data;
+	struct page *last_lookup;
 	bool reserved;
 };
 
+static bool rq_from_queue(struct bt_iter_data *iter_data, struct request *rq)
+{
+	struct blk_mq_hw_ctx *hctx = iter_data->hctx;
+	struct page *rq_page, *page;
+
+	/*
+	 * We can hit rq == NULL here, because the tagging functions
+	 * test and set the bit before assigning ->rqs[].
+	 */
+	if (!rq)
+		return false;
+	rq_page = virt_to_page(rq);
+	if (rq_page == iter_data->last_lookup)
+		goto check_queue;
+	list_for_each_entry(page, &hctx->tags->page_list, lru) {
+		if (page == rq_page) {
+			iter_data->last_lookup = page;
+			goto check_queue;
+		}
+	}
+	return false;
+check_queue:
+	return rq->q == hctx->queue && rq->mq_hctx == hctx;
+}
+
 static bool bt_iter(struct sbitmap *bitmap, unsigned int bitnr, void *data)
 {
 	struct bt_iter_data *iter_data = data;
@@ -211,11 +237,7 @@ static bool bt_iter(struct sbitmap *bitmap, unsigned int bitnr, void *data)
 		bitnr += tags->nr_reserved_tags;
 	rq = tags->rqs[bitnr];
 
-	/*
-	 * We can hit rq == NULL here, because the tagging functions
-	 * test and set the bit before assigning ->rqs[].
-	 */
-	if (rq && rq->q == hctx->queue && rq->mq_hctx == hctx)
+	if (rq_from_queue(iter_data, rq))
 		return iter_data->fn(hctx, rq, iter_data->data, reserved);
 	return true;
 }

-- 
Jens Axboe


^ permalink raw reply related	[flat|nested] 12+ messages in thread

* Re: [PATCH v5 3/3] blk-mq: Fix a race between iterating over requests and freeing requests
  2021-04-05 21:34     ` Bart Van Assche
@ 2021-04-06  0:24       ` Khazhy Kumykov
  0 siblings, 0 replies; 12+ messages in thread
From: Khazhy Kumykov @ 2021-04-06  0:24 UTC (permalink / raw)
  To: Bart Van Assche
  Cc: John Garry, Jens Axboe, linux-block, Christoph Hellwig,
	Martin K . Petersen, Shin'ichiro Kawasaki, Ming Lei,
	Hannes Reinecke, Johannes Thumshirn

[-- Attachment #1: Type: text/plain, Size: 12150 bytes --]

On Mon, Apr 5, 2021 at 2:34 PM Bart Van Assche <bvanassche@acm.org> wrote:
>
> On 4/5/21 11:08 AM, Khazhy Kumykov wrote:
> > On Sun, Apr 4, 2021 at 5:28 PM Bart Van Assche <bvanassche@acm.org> wrote:
> >> diff --git a/block/blk-mq-tag.c b/block/blk-mq-tag.c
> >> index 116c3691b104..e7a6a114d216 100644
> >> --- a/block/blk-mq-tag.c
> >> +++ b/block/blk-mq-tag.c
> >> @@ -209,7 +209,11 @@ static bool bt_iter(struct sbitmap *bitmap, unsigned int bitnr, void *data)
> >>
> >>         if (!reserved)
> >>                 bitnr += tags->nr_reserved_tags;
> >> -       rq = tags->rqs[bitnr];
> >> +       /*
> >> +        * Protected by rq->q->q_usage_counter. See also
> >> +        * blk_mq_queue_tag_busy_iter().
> >> +        */
> >> +       rq = rcu_dereference_check(tags->rqs[bitnr], true);
> >
> > maybe I'm missing something, but if this tags struct has a shared
> > sbitmap, what guarantees do we have that while iterating we won't
> > touch requests from a queue that's tearing down? The check a few lines
> > below suggests that at the least we may touch requests from a
> > different queue.
> >
> > say we enter blk_mq_queue_tag_busy_iter, we're iterating with raised
> > hctx->q->q_usage_counter, and get to bt_iter
> >
> > say tagset has 2 shared queues, hctx->q is q1, rq->q is q2
> > (thread 1)
> > rq = rcu_deref_check
> > (rq->q != hctx->q, but we don't know yet)
> >
> > (thread 2)
> > elsewhere, blk_cleanup_queue(q2) runs (or elevator_exit), since we
> > only have raised q_usage_counter on q1, this goes to completion and
> > frees rq. if we have preempt kernel, thread 1 may be paused before we
> > read rq->q, so synchonrize_rcu passes happily by
> >
> > (thread 1)
> > we check rq && rq->q == hctx->q, use-after-free since rq was freed above
> >
> > I think John Garry mentioned observing a similar race in patch 2 of
> > his series, perhaps his test case can verify this?
> >
> > "Indeed, blk_mq_queue_tag_busy_iter() already does take a reference to its
> > queue usage counter when called, and the queue cannot be frozen to switch
> > IO scheduler until all refs are dropped. This ensures no stale references
> > to IO scheduler requests will be seen by blk_mq_queue_tag_busy_iter().
> >
> > However, there is nothing to stop blk_mq_queue_tag_busy_iter() being
> > run for another queue associated with the same tagset, and it seeing
> > a stale IO scheduler request from the other queue after they are freed."
> >
> > so, to my understanding, we have a race between reading
> > tags->rq[bitnr], and verifying that rq->q == hctx->q, where if we
> > schedule off we might have use-after-free? If that's the case, perhaps
> > we should rcu_read_lock until we verify the queues match, then we can
> > release and run fn(), as we verified we no longer need it?
>
> Hi Khazhy,
>
> One possibility is indeed to protect the RCU dereference and the rq->q
> read with an RCU reader lock. However, that would require an elaborate
> comment since that would be a creative way to use RCU: protect the
> request pointer dereference with an RCU reader lock until rq->q has been
> tested and protect the remaining time during which rq is used with
> q_usage_counter.
>
> Another possibility is the patch below (needs to be split). That patch
> protects the entire time during which rq is used by bt_iter() with either
> an RCU reader lock or with the iter_rwsem semaphores. Do you perhaps have
> a preference for one of these two approaches?

to my eye the "creative" rcu_read_lock would be unusual, but should
require not much more justification than an rcu_dereference_check
without read_lock, but I would defer what constitutes smelly code to
those more experienced :) Part of why I suggested this was since it
does seem we can avoid needing to define and reason about an _atomic()
variant, and results in a smaller critical section (though this isn't
the hotpath)





>
> Thanks,
>
> Bart.
>
> diff --git a/block/blk-mq-tag.c b/block/blk-mq-tag.c
> index e7a6a114d216..a997fc2aa2bc 100644
> --- a/block/blk-mq-tag.c
> +++ b/block/blk-mq-tag.c
> @@ -209,12 +209,8 @@ static bool bt_iter(struct sbitmap *bitmap, unsigned int bitnr, void *data)
>
>         if (!reserved)
>                 bitnr += tags->nr_reserved_tags;
> -       /*
> -        * Protected by rq->q->q_usage_counter. See also
> -        * blk_mq_queue_tag_busy_iter().
> -        */
> -       rq = rcu_dereference_check(tags->rqs[bitnr], true);
> -
> +       rq = rcu_dereference_check(tags->rqs[bitnr],
> +                                  lockdep_is_held(&tags->iter_rwsem));
>         /*
>          * We can hit rq == NULL here, because the tagging functions
>          * test and set the bit before assigning ->rqs[].
> @@ -453,6 +449,63 @@ void blk_mq_tagset_wait_completed_request(struct blk_mq_tag_set *tagset)
>  }
>  EXPORT_SYMBOL(blk_mq_tagset_wait_completed_request);
>
> +static void __blk_mq_queue_tag_busy_iter(struct request_queue *q,
> +                                        busy_iter_fn *fn, void *priv)
> +{
> +       struct blk_mq_hw_ctx *hctx;
> +       int i;
> +
> +       queue_for_each_hw_ctx(q, hctx, i) {
> +               struct blk_mq_tags *tags = hctx->tags;
> +
> +               /*
> +                * If no software queues are currently mapped to this
> +                * hardware queue, there's nothing to check
> +                */
> +               if (!blk_mq_hw_queue_mapped(hctx))
> +                       continue;
> +
> +               if (tags->nr_reserved_tags)
> +                       bt_for_each(hctx, tags->breserved_tags, fn, priv, true);
> +               bt_for_each(hctx, tags->bitmap_tags, fn, priv, false);
> +       }
> +}
> +
> +/**
> + * blk_mq_queue_tag_busy_iter_atomic - iterate over all requests with a driver tag
> + * @q:         Request queue to examine.
> + * @fn:                Pointer to the function that will be called for each request
> + *             on @q. @fn will be called as follows: @fn(hctx, rq, @priv,
> + *             reserved) where rq is a pointer to a request and hctx points
> + *             to the hardware queue associated with the request. 'reserved'
> + *             indicates whether or not @rq is a reserved request. @fn must
> + *             not sleep.
> + * @priv:      Will be passed as third argument to @fn.
> + *
> + * Note: if @q->tag_set is shared with other request queues then @fn will be
> + * called for all requests on all queues that share that tag set and not only
> + * for requests associated with @q.
> + *
> + * Does not sleep.
> + */
> +void blk_mq_queue_tag_busy_iter_atomic(struct request_queue *q,
> +                                      busy_iter_fn *fn, void *priv)
> +{
> +       /*
> +        * __blk_mq_update_nr_hw_queues() updates nr_hw_queues and queue_hw_ctx
> +        * while the queue is frozen. So we can use q_usage_counter to avoid
> +        * racing with it.
> +        */
> +       if (!percpu_ref_tryget(&q->q_usage_counter))
> +               return;
> +
> +       rcu_read_lock();
> +       __blk_mq_queue_tag_busy_iter(q, fn, priv);
> +       rcu_read_unlock();
> +
> +       blk_queue_exit(q);
> +}
> +
>  /**
>   * blk_mq_queue_tag_busy_iter - iterate over all requests with a driver tag
>   * @q:         Request queue to examine.
> @@ -466,13 +519,18 @@ EXPORT_SYMBOL(blk_mq_tagset_wait_completed_request);
>   * Note: if @q->tag_set is shared with other request queues then @fn will be
>   * called for all requests on all queues that share that tag set and not only
>   * for requests associated with @q.
> + *
> + * May sleep.
>   */
>  void blk_mq_queue_tag_busy_iter(struct request_queue *q, busy_iter_fn *fn,
>                 void *priv)
>  {
> -       struct blk_mq_hw_ctx *hctx;
> +       struct blk_mq_tag_set *set = q->tag_set;
> +       struct blk_mq_tags *tags;
>         int i;
>
> +       might_sleep();
> +
>         /*
>          * __blk_mq_update_nr_hw_queues() updates nr_hw_queues and queue_hw_ctx
>          * while the queue is frozen. So we can use q_usage_counter to avoid
> @@ -481,20 +539,19 @@ void blk_mq_queue_tag_busy_iter(struct request_queue *q, busy_iter_fn *fn,
>         if (!percpu_ref_tryget(&q->q_usage_counter))
>                 return;
>
> -       queue_for_each_hw_ctx(q, hctx, i) {
> -               struct blk_mq_tags *tags = hctx->tags;
> -
> -               /*
> -                * If no software queues are currently mapped to this
> -                * hardware queue, there's nothing to check
> -                */
> -               if (!blk_mq_hw_queue_mapped(hctx))
> -                       continue;
>
> -               if (tags->nr_reserved_tags)
> -                       bt_for_each(hctx, tags->breserved_tags, fn, priv, true);
> -               bt_for_each(hctx, tags->bitmap_tags, fn, priv, false);
> +       for (i = 0; i < set->nr_hw_queues; i++) {
> +               tags = set->tags[i];
> +               if (tags)
> +                       down_read(&tags->iter_rwsem);
>         }
> +       __blk_mq_queue_tag_busy_iter(q, fn, priv);
> +       for (i = set->nr_hw_queues - 1; i >= 0; i--) {
> +               tags = set->tags[i];
> +               if (tags)
> +                       up_read(&tags->iter_rwsem);
> +       }
> +
>         blk_queue_exit(q);
>  }
>
> @@ -576,7 +633,9 @@ struct blk_mq_tags *blk_mq_init_tags(unsigned int total_tags,
>
>         tags->nr_tags = total_tags;
>         tags->nr_reserved_tags = reserved_tags;
> -       init_rwsem(&tags->iter_rwsem);
> +       lockdep_register_key(&tags->iter_rwsem_key);
> +       __init_rwsem(&tags->iter_rwsem, "tags->iter_rwsem",
> +                    &tags->iter_rwsem_key);
>
>         if (flags & BLK_MQ_F_TAG_HCTX_SHARED)
>                 return tags;
> @@ -594,6 +653,7 @@ void blk_mq_free_tags(struct blk_mq_tags *tags, unsigned int flags)
>                 sbitmap_queue_free(tags->bitmap_tags);
>                 sbitmap_queue_free(tags->breserved_tags);
>         }
> +       lockdep_unregister_key(&tags->iter_rwsem_key);
>         kfree(tags);
>  }
>
> diff --git a/block/blk-mq-tag.h b/block/blk-mq-tag.h
> index d1d73d7cc7df..e37f219bd36a 100644
> --- a/block/blk-mq-tag.h
> +++ b/block/blk-mq-tag.h
> @@ -22,6 +22,7 @@ struct blk_mq_tags {
>         struct list_head page_list;
>
>         struct rw_semaphore iter_rwsem;
> +       struct lock_class_key iter_rwsem_key;
>  };
>
>  extern struct blk_mq_tags *blk_mq_init_tags(unsigned int nr_tags,
> @@ -43,6 +44,8 @@ extern void blk_mq_tag_resize_shared_sbitmap(struct blk_mq_tag_set *set,
>                                              unsigned int size);
>
>  extern void blk_mq_tag_wakeup_all(struct blk_mq_tags *tags, bool);
> +void blk_mq_queue_tag_busy_iter_atomic(struct request_queue *q,
> +               busy_iter_fn *fn, void *priv);
>  void blk_mq_queue_tag_busy_iter(struct request_queue *q, busy_iter_fn *fn,
>                 void *priv);
>  void blk_mq_all_tag_iter_atomic(struct blk_mq_tags *tags, busy_tag_iter_fn *fn,
> diff --git a/block/blk-mq.c b/block/blk-mq.c
> index d6c9b655c0f5..f5e1ace273e2 100644
> --- a/block/blk-mq.c
> +++ b/block/blk-mq.c
> @@ -117,7 +117,7 @@ unsigned int blk_mq_in_flight(struct request_queue *q,
>  {
>         struct mq_inflight mi = { .part = part };
>
> -       blk_mq_queue_tag_busy_iter(q, blk_mq_check_inflight, &mi);
> +       blk_mq_queue_tag_busy_iter_atomic(q, blk_mq_check_inflight, &mi);
>
>         return mi.inflight[0] + mi.inflight[1];
>  }
> @@ -127,7 +127,7 @@ void blk_mq_in_flight_rw(struct request_queue *q, struct block_device *part,
>  {
>         struct mq_inflight mi = { .part = part };
>
> -       blk_mq_queue_tag_busy_iter(q, blk_mq_check_inflight, &mi);
> +       blk_mq_queue_tag_busy_iter_atomic(q, blk_mq_check_inflight, &mi);
>         inflight[0] = mi.inflight[0];
>         inflight[1] = mi.inflight[1];
>  }
> @@ -881,7 +881,7 @@ bool blk_mq_queue_inflight(struct request_queue *q)
>  {
>         bool busy = false;
>
> -       blk_mq_queue_tag_busy_iter(q, blk_mq_rq_inflight, &busy);
> +       blk_mq_queue_tag_busy_iter_atomic(q, blk_mq_rq_inflight, &busy);
>         return busy;
>  }
>  EXPORT_SYMBOL_GPL(blk_mq_queue_inflight);

[-- Attachment #2: S/MIME Cryptographic Signature --]
[-- Type: application/pkcs7-signature, Size: 3996 bytes --]

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH v5 3/3] blk-mq: Fix a race between iterating over requests and freeing requests
  2021-04-05 21:34     ` Jens Axboe
@ 2021-04-06  1:06       ` Bart Van Assche
  0 siblings, 0 replies; 12+ messages in thread
From: Bart Van Assche @ 2021-04-06  1:06 UTC (permalink / raw)
  To: Jens Axboe, Khazhy Kumykov, John Garry
  Cc: linux-block, Christoph Hellwig, Martin K . Petersen,
	Shin'ichiro Kawasaki, Ming Lei, Hannes Reinecke,
	Johannes Thumshirn

On 4/5/21 2:34 PM, Jens Axboe wrote:
> For something out of left field, we can check if the page that the rq
> belongs to is still part of the tag set. If it isn't, then don't
> deref it.
> 
> Totally untested.
> 
> diff --git a/block/blk-mq-tag.c b/block/blk-mq-tag.c
> index e5bfecf2940d..6209c465e884 100644
> --- a/block/blk-mq-tag.c
> +++ b/block/blk-mq-tag.c
> @@ -196,9 +196,35 @@ struct bt_iter_data {
>  	struct blk_mq_hw_ctx *hctx;
>  	busy_iter_fn *fn;
>  	void *data;
> +	struct page *last_lookup;
>  	bool reserved;
>  };
>  
> +static bool rq_from_queue(struct bt_iter_data *iter_data, struct request *rq)
> +{
> +	struct blk_mq_hw_ctx *hctx = iter_data->hctx;
> +	struct page *rq_page, *page;
> +
> +	/*
> +	 * We can hit rq == NULL here, because the tagging functions
> +	 * test and set the bit before assigning ->rqs[].
> +	 */
> +	if (!rq)
> +		return false;
> +	rq_page = virt_to_page(rq);
> +	if (rq_page == iter_data->last_lookup)
> +		goto check_queue;
> +	list_for_each_entry(page, &hctx->tags->page_list, lru) {
> +		if (page == rq_page) {
> +			iter_data->last_lookup = page;
> +			goto check_queue;
> +		}
> +	}
> +	return false;
> +check_queue:
> +	return rq->q == hctx->queue && rq->mq_hctx == hctx;
> +}
> +
>  static bool bt_iter(struct sbitmap *bitmap, unsigned int bitnr, void *data)
>  {
>  	struct bt_iter_data *iter_data = data;
> @@ -211,11 +237,7 @@ static bool bt_iter(struct sbitmap *bitmap, unsigned int bitnr, void *data)
>  		bitnr += tags->nr_reserved_tags;
>  	rq = tags->rqs[bitnr];
>  
> -	/*
> -	 * We can hit rq == NULL here, because the tagging functions
> -	 * test and set the bit before assigning ->rqs[].
> -	 */
> -	if (rq && rq->q == hctx->queue && rq->mq_hctx == hctx)
> +	if (rq_from_queue(iter_data, rq))
>  		return iter_data->fn(hctx, rq, iter_data->data, reserved);
>  	return true;
>  }

Hi Jens,

That's a very interesting suggestion. However, it seems to me that Khazhy's
suggestion will result in shorter and faster code?

Khazhy pointed out another race to me off-list, namely a race between updating
the number of hardware queues and iterating over the tags in a tag set. I'm
currently analyzing how to fix that race too.

Thanks,

Bart.



^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH v5 0/3] blk-mq: Fix a race between iterating over requests and freeing requests
  2021-04-05  0:28 [PATCH v5 0/3] blk-mq: Fix a race between iterating over requests and freeing requests Bart Van Assche
                   ` (2 preceding siblings ...)
  2021-04-05  0:28 ` [PATCH v5 3/3] blk-mq: Fix a race between iterating over requests and freeing requests Bart Van Assche
@ 2021-04-06  8:00 ` John Garry
  2021-04-06 17:49   ` Bart Van Assche
  3 siblings, 1 reply; 12+ messages in thread
From: John Garry @ 2021-04-06  8:00 UTC (permalink / raw)
  To: Bart Van Assche, Jens Axboe; +Cc: linux-block, Christoph Hellwig

On 05/04/2021 01:28, Bart Van Assche wrote:
> Hi Jens,
> 
> This patch series fixes the race between iterating over requests and
> freeing requests that has been reported by multiple different users over
> the past two years. Please consider this patch series for kernel v5.13.
> 
> Thank you,
> 
> Bart.
> 
> Changes between v4 and v5:
> - Addressed Khazhy's review comments. Note: the changes that have been made
>    in v5 only change behavior in case CONFIG_PROVE_RCU=y.
> 
> Changes between v3 and v4:
> - Fixed support for tag sets shared across hardware queues.
> - Renamed blk_mq_wait_for_tag_readers() into blk_mq_wait_for_tag_iter().
> - Removed the fourth argument of blk_mq_queue_tag_busy_iter() again.
> 

Hi Bart,

> Changes between v2 and v3:
> - Converted the single v2 patch into a series of three patches.
> - Switched from SRCU to a combination of RCU and semaphores.
> 

But can you mention why we made to changes from v3 onwards (versus v2)?

The v2 patch just used SRCU as the sync mechanism, and the impression I 
got from Jens was that the marginal performance drop was tolerable. And 
the issues it tries to address seem to be solved. So why change? Maybe 
my impression of the performance drop being acceptable was wrong.

Thanks,
John

Ps. I have been on vacation for some time, so could not support this work.

> Changes between v1 and v2:
> - Reformatted patch description.
> - Added Tested-by/Reviewed-by tags.
> - Changed srcu_barrier() calls into synchronize_srcu() calls.
> 
> Bart Van Assche (3):
>    blk-mq: Move the elevator_exit() definition
>    blk-mq: Introduce atomic variants of the tag iteration functions
>    blk-mq: Fix a race between iterating over requests and freeing
>      requests
> 
>   block/blk-core.c          | 34 ++++++++++++++++-
>   block/blk-mq-tag.c        | 79 ++++++++++++++++++++++++++++++++++-----
>   block/blk-mq-tag.h        |  6 ++-
>   block/blk-mq.c            | 23 +++++++++---
>   block/blk-mq.h            |  1 +
>   block/blk.h               | 11 +-----
>   block/elevator.c          |  9 +++++
>   drivers/scsi/hosts.c      | 16 ++++----
>   drivers/scsi/ufs/ufshcd.c |  4 +-
>   include/linux/blk-mq.h    |  2 +
>   10 files changed, 149 insertions(+), 36 deletions(-)
> 
> .
> 


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH v5 0/3] blk-mq: Fix a race between iterating over requests and freeing requests
  2021-04-06  8:00 ` [PATCH v5 0/3] " John Garry
@ 2021-04-06 17:49   ` Bart Van Assche
  2021-04-07 15:12     ` John Garry
  0 siblings, 1 reply; 12+ messages in thread
From: Bart Van Assche @ 2021-04-06 17:49 UTC (permalink / raw)
  To: John Garry, Jens Axboe; +Cc: linux-block, Christoph Hellwig

On 4/6/21 1:00 AM, John Garry wrote:
> Hi Bart,
> 
>> Changes between v2 and v3:
>> - Converted the single v2 patch into a series of three patches.
>> - Switched from SRCU to a combination of RCU and semaphores.
> 
> But can you mention why we made to changes from v3 onwards (versus
> v2)?
> 
> The v2 patch just used SRCU as the sync mechanism, and the impression
> I got from Jens was that the marginal performance drop was tolerable.
> And the issues it tries to address seem to be solved. So why change?
> Maybe my impression of the performance drop being acceptable was
> wrong.

Hi John,

It seems like I should have done a better job of explaining that change.
On v2 I received the following feedback from Hannes: "What I don't
particularly like is the global blk_sched_srcu here; can't
we make it per tagset?". My reply was as follows: "I'm concerned about
the additional memory required for one srcu_struct per tag set." Hence
the switch from SRCU to RCU + rwsem. See also
https://lore.kernel.org/linux-block/d1627890-fb10-7ebe-d805-621f925f80e7@suse.de/.

Regarding the 1% performance drop measured by Jens: with debugging
disabled srcu_dereference() is translated into READ_ONCE() and
rcu_assign_pointer() is translated into smp_store_release(). On x86
smp_store_release() is translated into a compiler barrier +
WRITE_ONCE(). In other words, I do not expect that the performance
difference came from the switch to SRCU but rather from the two new
hctx->tags->rqs[] assignments.

I think that the switch to READ_ONCE() / WRITE_ONCE() is unavoidable.
Even if cmpxchg() would be used to clear hctx->tags->rqs[] pointers then
we would need to convert all other hctx->tags->rqs[] accesses into
READ_ONCE() / WRITE_ONCE() to make that cmpxchg() call safe.

Bart.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH v5 0/3] blk-mq: Fix a race between iterating over requests and freeing requests
  2021-04-06 17:49   ` Bart Van Assche
@ 2021-04-07 15:12     ` John Garry
  0 siblings, 0 replies; 12+ messages in thread
From: John Garry @ 2021-04-07 15:12 UTC (permalink / raw)
  To: Bart Van Assche, Jens Axboe; +Cc: linux-block, Christoph Hellwig

On 06/04/2021 18:49, Bart Van Assche wrote:
> On 4/6/21 1:00 AM, John Garry wrote:
>> Hi Bart,
>>
>>> Changes between v2 and v3:
>>> - Converted the single v2 patch into a series of three patches.
>>> - Switched from SRCU to a combination of RCU and semaphores.
>>
>> But can you mention why we made to changes from v3 onwards (versus
>> v2)?
>>
>> The v2 patch just used SRCU as the sync mechanism, and the impression
>> I got from Jens was that the marginal performance drop was tolerable.
>> And the issues it tries to address seem to be solved. So why change?
>> Maybe my impression of the performance drop being acceptable was
>> wrong.
> 

Hi Bart,

> 
> It seems like I should have done a better job of explaining that change.
> On v2 I received the following feedback from Hannes: "What I don't
> particularly like is the global blk_sched_srcu here; can't
> we make it per tagset?". My reply was as follows: "I'm concerned about
> the additional memory required for one srcu_struct per tag set."

I actually said the same thing about the global blk_sched_srcu, but 
would not want a per-tagset srcu struct unless it was necessary for your 
same reason (additional memory).

> Hence
> the switch from SRCU to RCU + rwsem. See also
> https://lore.kernel.org/linux-block/d1627890-fb10-7ebe-d805-621f925f80e7@suse.de/.
> 
> Regarding the 1% performance drop measured by Jens: with debugging
> disabled srcu_dereference() is translated into READ_ONCE() and
> rcu_assign_pointer() is translated into smp_store_release(). 

I think it depends on whether assigning NULL, which gives WRITE_ONCE(), 
while non-NULL is smp_store_release().

> On x86
> smp_store_release() is translated into a compiler barrier +
> WRITE_ONCE().

Right, asm-generic gives smp mb + WRITE_ONCE(), and arm64 has a 
dedicated store-release instruction.

> In other words, I do not expect that the performance
> difference came from the switch to SRCU but rather from the two new
> hctx->tags->rqs[] assignments.
> 
> I think that the switch to READ_ONCE() / WRITE_ONCE() is unavoidable.
> Even if cmpxchg() would be used to clear hctx->tags->rqs[] pointers then
> we would need to convert all other hctx->tags->rqs[] accesses into
> READ_ONCE() / WRITE_ONCE() to make that cmpxchg() call safe.
> 

OK, I tend to agree with this. But it seems to me that they are needed 
anyway.

Anyway, I'll have a look at the latest series...

Thanks,
John

^ permalink raw reply	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2021-04-07 15:14 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-04-05  0:28 [PATCH v5 0/3] blk-mq: Fix a race between iterating over requests and freeing requests Bart Van Assche
2021-04-05  0:28 ` [PATCH v5 1/3] blk-mq: Move the elevator_exit() definition Bart Van Assche
2021-04-05  0:28 ` [PATCH v5 2/3] blk-mq: Introduce atomic variants of the tag iteration functions Bart Van Assche
2021-04-05  0:28 ` [PATCH v5 3/3] blk-mq: Fix a race between iterating over requests and freeing requests Bart Van Assche
2021-04-05 18:08   ` Khazhy Kumykov
2021-04-05 21:34     ` Bart Van Assche
2021-04-06  0:24       ` Khazhy Kumykov
2021-04-05 21:34     ` Jens Axboe
2021-04-06  1:06       ` Bart Van Assche
2021-04-06  8:00 ` [PATCH v5 0/3] " John Garry
2021-04-06 17:49   ` Bart Van Assche
2021-04-07 15:12     ` John Garry

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.