From: Bart Van Assche <bvanassche@acm.org>
To: Jens Axboe <axboe@kernel.dk>
Cc: linux-block@vger.kernel.org, Christoph Hellwig <hch@lst.de>,
Bart Van Assche <bvanassche@acm.org>,
"Martin K . Petersen" <martin.petersen@oracle.com>,
Shin'ichiro Kawasaki <shinichiro.kawasaki@wdc.com>,
Ming Lei <ming.lei@redhat.com>, Hannes Reinecke <hare@suse.de>,
Johannes Thumshirn <johannes.thumshirn@wdc.com>,
John Garry <john.garry@huawei.com>,
Khazhy Kumykov <khazhy@google.com>
Subject: [PATCH v6 3/5] blk-mq: Fix races between iterating over requests and freeing requests
Date: Tue, 6 Apr 2021 14:49:03 -0700 [thread overview]
Message-ID: <20210406214905.21622-4-bvanassche@acm.org> (raw)
In-Reply-To: <20210406214905.21622-1-bvanassche@acm.org>
When multiple request queues share a tag set and when switching the I/O
scheduler for one of the request queues associated with that tag set, the
following race can happen:
* blk_mq_tagset_busy_iter() calls bt_tags_iter() and bt_tags_iter() assigns
a pointer to a scheduler request to the local variable 'rq'.
* blk_mq_sched_free_requests() is called to free hctx->sched_tags.
* blk_mq_tagset_busy_iter() dereferences 'rq' and triggers a use-after-free.
Fix this race as follows:
* Use rcu_assign_pointer() and rcu_dereference() to access hctx->tags->rqs[].
The performance impact of the assignments added to the hot path is minimal
(about 1% according to one particular test).
* Protect hctx->tags->rqs[] reads with an RCU read-side lock or with a
semaphore. Which mechanism is used depends on whether or not it is allowed
to sleep and also on whether or not the callback function may sleep.
* Wait for all preexisting readers to finish before freeing scheduler
requests.
Another race is as follows:
* blk_mq_sched_free_requests() is called to free hctx->sched_tags.
* blk_mq_queue_tag_busy_iter() iterates over the same tag set but for another
request queue than the queue for which scheduler tags are being freed.
* bt_iter() examines rq->q after *rq has been freed.
Fix this race by protecting the rq->q read in bt_iter() with an RCU read
lock and by calling synchronize_rcu() before freeing scheduler tags.
Multiple users have reported use-after-free complaints similar to the
following (from https://lore.kernel.org/linux-block/1545261885.185366.488.camel@acm.org/ ):
BUG: KASAN: use-after-free in bt_iter+0x86/0xf0
Read of size 8 at addr ffff88803b335240 by task fio/21412
CPU: 0 PID: 21412 Comm: fio Tainted: G W 4.20.0-rc6-dbg+ #3
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-1 04/01/2014
Call Trace:
dump_stack+0x86/0xca
print_address_description+0x71/0x239
kasan_report.cold.5+0x242/0x301
__asan_load8+0x54/0x90
bt_iter+0x86/0xf0
blk_mq_queue_tag_busy_iter+0x373/0x5e0
blk_mq_in_flight+0x96/0xb0
part_in_flight+0x40/0x140
part_round_stats+0x18e/0x370
blk_account_io_start+0x3d7/0x670
blk_mq_bio_to_request+0x19c/0x3a0
blk_mq_make_request+0x7a9/0xcb0
generic_make_request+0x41d/0x960
submit_bio+0x9b/0x250
do_blockdev_direct_IO+0x435c/0x4c70
__blockdev_direct_IO+0x79/0x88
ext4_direct_IO+0x46c/0xc00
generic_file_direct_write+0x119/0x210
__generic_file_write_iter+0x11c/0x280
ext4_file_write_iter+0x1b8/0x6f0
aio_write+0x204/0x310
io_submit_one+0x9d3/0xe80
__x64_sys_io_submit+0x115/0x340
do_syscall_64+0x71/0x210
Cc: Christoph Hellwig <hch@lst.de>
Cc: Martin K. Petersen <martin.petersen@oracle.com>
Cc: Shin'ichiro Kawasaki <shinichiro.kawasaki@wdc.com>
Cc: Ming Lei <ming.lei@redhat.com>
Cc: Hannes Reinecke <hare@suse.de>
Cc: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Cc: John Garry <john.garry@huawei.com>
Cc: Khazhy Kumykov <khazhy@google.com>
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
---
block/blk-core.c | 34 ++++++++++++++++++++++++++++++-
block/blk-mq-tag.c | 51 ++++++++++++++++++++++++++++++++++++++++------
block/blk-mq-tag.h | 4 +++-
block/blk-mq.c | 21 +++++++++++++++----
block/blk-mq.h | 1 +
block/blk.h | 2 ++
block/elevator.c | 1 +
7 files changed, 102 insertions(+), 12 deletions(-)
diff --git a/block/blk-core.c b/block/blk-core.c
index 9bcdae93f6d4..ca7f833e25a8 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -279,6 +279,36 @@ void blk_dump_rq_flags(struct request *rq, char *msg)
}
EXPORT_SYMBOL(blk_dump_rq_flags);
+/**
+ * blk_mq_wait_for_tag_iter - wait for preexisting tag iteration functions to finish
+ * @set: Tag set to wait on.
+ *
+ * Waits for preexisting calls of blk_mq_all_tag_iter(),
+ * blk_mq_tagset_busy_iter() and also for their atomic variants to finish
+ * accessing hctx->tags->rqs[]. New readers may start while this function is
+ * in progress or after this function has returned. Use this function to make
+ * sure that hctx->tags->rqs[] changes have become globally visible.
+ *
+ * Waits for preexisting blk_mq_queue_tag_busy_iter(q, fn, priv) calls to
+ * finish accessing requests associated with other request queues than 'q'.
+ */
+void blk_mq_wait_for_tag_iter(struct blk_mq_tag_set *set)
+{
+ struct blk_mq_tags *tags;
+ int i;
+
+ if (set->tags) {
+ for (i = 0; i < set->nr_hw_queues; i++) {
+ tags = set->tags[i];
+ if (!tags)
+ continue;
+ down_write(&tags->iter_rwsem);
+ up_write(&tags->iter_rwsem);
+ }
+ }
+ synchronize_rcu();
+}
+
/**
* blk_sync_queue - cancel any pending callbacks on a queue
* @q: the queue
@@ -412,8 +442,10 @@ void blk_cleanup_queue(struct request_queue *q)
* it is safe to free requests now.
*/
mutex_lock(&q->sysfs_lock);
- if (q->elevator)
+ if (q->elevator) {
+ blk_mq_wait_for_tag_iter(q->tag_set);
blk_mq_sched_free_requests(q);
+ }
mutex_unlock(&q->sysfs_lock);
percpu_ref_exit(&q->q_usage_counter);
diff --git a/block/blk-mq-tag.c b/block/blk-mq-tag.c
index d8eaa38a1bd1..a73e81d75fb8 100644
--- a/block/blk-mq-tag.c
+++ b/block/blk-mq-tag.c
@@ -209,14 +209,24 @@ static bool bt_iter(struct sbitmap *bitmap, unsigned int bitnr, void *data)
if (!reserved)
bitnr += tags->nr_reserved_tags;
- rq = tags->rqs[bitnr];
-
+ rcu_read_lock();
+ /*
+ * The request 'rq' points at is protected by an RCU read lock until
+ * its queue pointer has been verified and by q_usage_count while the
+ * callback function is being invoked. an See also the
+ * percpu_ref_tryget() and blk_queue_exit() calls in
+ * blk_mq_queue_tag_busy_iter().
+ */
+ rq = rcu_dereference(tags->rqs[bitnr]);
/*
* We can hit rq == NULL here, because the tagging functions
* test and set the bit before assigning ->rqs[].
*/
- if (rq && rq->q == hctx->queue && rq->mq_hctx == hctx)
+ if (rq && rq->q == hctx->queue && rq->mq_hctx == hctx) {
+ rcu_read_unlock();
return iter_data->fn(hctx, rq, iter_data->data, reserved);
+ }
+ rcu_read_unlock();
return true;
}
@@ -254,11 +264,17 @@ struct bt_tags_iter_data {
unsigned int flags;
};
+/* Include reserved tags. */
#define BT_TAG_ITER_RESERVED (1 << 0)
+/* Only include started requests. */
#define BT_TAG_ITER_STARTED (1 << 1)
+/* Iterate over tags->static_rqs[] instead of tags->rqs[]. */
#define BT_TAG_ITER_STATIC_RQS (1 << 2)
+/* The callback function may sleep. */
+#define BT_TAG_ITER_MAY_SLEEP (1 << 3)
-static bool bt_tags_iter(struct sbitmap *bitmap, unsigned int bitnr, void *data)
+static bool __bt_tags_iter(struct sbitmap *bitmap, unsigned int bitnr,
+ void *data)
{
struct bt_tags_iter_data *iter_data = data;
struct blk_mq_tags *tags = iter_data->tags;
@@ -275,7 +291,8 @@ static bool bt_tags_iter(struct sbitmap *bitmap, unsigned int bitnr, void *data)
if (iter_data->flags & BT_TAG_ITER_STATIC_RQS)
rq = tags->static_rqs[bitnr];
else
- rq = tags->rqs[bitnr];
+ rq = rcu_dereference_check(tags->rqs[bitnr],
+ lockdep_is_held(&tags->iter_rwsem));
if (!rq)
return true;
if ((iter_data->flags & BT_TAG_ITER_STARTED) &&
@@ -284,6 +301,25 @@ static bool bt_tags_iter(struct sbitmap *bitmap, unsigned int bitnr, void *data)
return iter_data->fn(rq, iter_data->data, reserved);
}
+static bool bt_tags_iter(struct sbitmap *bitmap, unsigned int bitnr, void *data)
+{
+ struct bt_tags_iter_data *iter_data = data;
+ struct blk_mq_tags *tags = iter_data->tags;
+ bool res;
+
+ if (iter_data->flags & BT_TAG_ITER_MAY_SLEEP) {
+ down_read(&tags->iter_rwsem);
+ res = __bt_tags_iter(bitmap, bitnr, data);
+ up_read(&tags->iter_rwsem);
+ } else {
+ rcu_read_lock();
+ res = __bt_tags_iter(bitmap, bitnr, data);
+ rcu_read_unlock();
+ }
+
+ return res;
+}
+
/**
* bt_tags_for_each - iterate over the requests in a tag map
* @tags: Tag map to iterate over.
@@ -357,10 +393,12 @@ void blk_mq_tagset_busy_iter(struct blk_mq_tag_set *tagset,
{
int i;
+ might_sleep();
+
for (i = 0; i < tagset->nr_hw_queues; i++) {
if (tagset->tags && tagset->tags[i])
__blk_mq_all_tag_iter(tagset->tags[i], fn, priv,
- BT_TAG_ITER_STARTED);
+ BT_TAG_ITER_STARTED | BT_TAG_ITER_MAY_SLEEP);
}
}
EXPORT_SYMBOL(blk_mq_tagset_busy_iter);
@@ -544,6 +582,7 @@ struct blk_mq_tags *blk_mq_init_tags(unsigned int total_tags,
tags->nr_tags = total_tags;
tags->nr_reserved_tags = reserved_tags;
+ init_rwsem(&tags->iter_rwsem);
if (blk_mq_is_sbitmap_shared(flags))
return tags;
diff --git a/block/blk-mq-tag.h b/block/blk-mq-tag.h
index 0290c308ece9..d1d73d7cc7df 100644
--- a/block/blk-mq-tag.h
+++ b/block/blk-mq-tag.h
@@ -17,9 +17,11 @@ struct blk_mq_tags {
struct sbitmap_queue __bitmap_tags;
struct sbitmap_queue __breserved_tags;
- struct request **rqs;
+ struct request __rcu **rqs;
struct request **static_rqs;
struct list_head page_list;
+
+ struct rw_semaphore iter_rwsem;
};
extern struct blk_mq_tags *blk_mq_init_tags(unsigned int nr_tags,
diff --git a/block/blk-mq.c b/block/blk-mq.c
index 5b170faa6318..d6c9b655c0f5 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -495,8 +495,10 @@ static void __blk_mq_free_request(struct request *rq)
blk_crypto_free_request(rq);
blk_pm_mark_last_busy(rq);
rq->mq_hctx = NULL;
- if (rq->tag != BLK_MQ_NO_TAG)
+ if (rq->tag != BLK_MQ_NO_TAG) {
blk_mq_put_tag(hctx->tags, ctx, rq->tag);
+ rcu_assign_pointer(hctx->tags->rqs[rq->tag], NULL);
+ }
if (sched_tag != BLK_MQ_NO_TAG)
blk_mq_put_tag(hctx->sched_tags, ctx, sched_tag);
blk_mq_sched_restart(hctx);
@@ -838,9 +840,20 @@ EXPORT_SYMBOL(blk_mq_delay_kick_requeue_list);
struct request *blk_mq_tag_to_rq(struct blk_mq_tags *tags, unsigned int tag)
{
+ struct request *rq;
+
if (tag < tags->nr_tags) {
- prefetch(tags->rqs[tag]);
- return tags->rqs[tag];
+ /*
+ * Freeing tags happens with the request queue frozen so the
+ * rcu dereference below is protected by the request queue
+ * usage count. We can only verify that usage count after
+ * having read the request pointer.
+ */
+ rq = rcu_dereference_check(tags->rqs[tag], true);
+ WARN_ON_ONCE(IS_ENABLED(CONFIG_PROVE_RCU) && rq &&
+ percpu_ref_is_zero(&rq->q->q_usage_counter));
+ prefetch(rq);
+ return rq;
}
return NULL;
@@ -1111,7 +1124,7 @@ static bool blk_mq_get_driver_tag(struct request *rq)
rq->rq_flags |= RQF_MQ_INFLIGHT;
__blk_mq_inc_active_requests(hctx);
}
- hctx->tags->rqs[rq->tag] = rq;
+ rcu_assign_pointer(hctx->tags->rqs[rq->tag], rq);
return true;
}
diff --git a/block/blk-mq.h b/block/blk-mq.h
index 3616453ca28c..9ccb1818303b 100644
--- a/block/blk-mq.h
+++ b/block/blk-mq.h
@@ -226,6 +226,7 @@ static inline void __blk_mq_put_driver_tag(struct blk_mq_hw_ctx *hctx,
struct request *rq)
{
blk_mq_put_tag(hctx->tags, rq->mq_ctx, rq->tag);
+ rcu_assign_pointer(hctx->tags->rqs[rq->tag], NULL);
rq->tag = BLK_MQ_NO_TAG;
if (rq->rq_flags & RQF_MQ_INFLIGHT) {
diff --git a/block/blk.h b/block/blk.h
index 2ed6c684d63a..aadf0d82a028 100644
--- a/block/blk.h
+++ b/block/blk.h
@@ -185,6 +185,8 @@ bool blk_bio_list_merge(struct request_queue *q, struct list_head *list,
void blk_account_io_start(struct request *req);
void blk_account_io_done(struct request *req, u64 now);
+void blk_mq_wait_for_tag_iter(struct blk_mq_tag_set *set);
+
/*
* Internal elevator interface
*/
diff --git a/block/elevator.c b/block/elevator.c
index 4b20d1ab29cc..70a10e31b336 100644
--- a/block/elevator.c
+++ b/block/elevator.c
@@ -201,6 +201,7 @@ static void elevator_exit(struct request_queue *q, struct elevator_queue *e)
{
lockdep_assert_held(&q->sysfs_lock);
+ blk_mq_wait_for_tag_iter(q->tag_set);
blk_mq_sched_free_requests(q);
__elevator_exit(q, e);
}
next prev parent reply other threads:[~2021-04-06 21:49 UTC|newest]
Thread overview: 23+ messages / expand[flat|nested] mbox.gz Atom feed top
2021-04-06 21:49 [PATCH v6 0/5] blk-mq: Fix a race between iterating over requests and freeing requests Bart Van Assche
2021-04-06 21:49 ` [PATCH v6 1/5] blk-mq: Move the elevator_exit() definition Bart Van Assche
2021-04-07 15:36 ` John Garry
2021-04-20 21:30 ` Bart Van Assche
2021-04-13 7:44 ` Daniel Wagner
2021-04-06 21:49 ` [PATCH v6 2/5] blk-mq: Introduce atomic variants of blk_mq_(all_tag|tagset_busy)_iter Bart Van Assche
2021-04-07 16:57 ` John Garry
2021-04-07 18:42 ` Bart Van Assche
2021-04-08 12:48 ` John Garry
2021-04-08 16:12 ` Bart Van Assche
2021-04-08 16:35 ` John Garry
2021-04-13 7:50 ` Daniel Wagner
2021-04-20 21:35 ` Bart Van Assche
2021-04-21 7:25 ` Daniel Wagner
2021-04-06 21:49 ` Bart Van Assche [this message]
2021-04-07 0:02 ` [PATCH v6 3/5] blk-mq: Fix races between iterating over requests and freeing requests Khazhy Kumykov
2021-04-07 21:54 ` Bart Van Assche
2021-04-20 21:41 ` Bart Van Assche
2021-04-06 21:49 ` [PATCH v6 4/5] blk-mq: Make it safe to use RCU to iterate over blk_mq_tag_set.tag_list Bart Van Assche
2021-04-06 21:49 ` [PATCH v6 5/5] blk-mq: Fix races between blk_mq_update_nr_hw_queues() and iterating over tags Bart Van Assche
2021-04-07 0:04 ` Khazhy Kumykov
2021-04-08 6:45 ` [PATCH v6 0/5] blk-mq: Fix a race between iterating over requests and freeing requests Shinichiro Kawasaki
2021-04-20 21:55 ` Bart Van Assche
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20210406214905.21622-4-bvanassche@acm.org \
--to=bvanassche@acm.org \
--cc=axboe@kernel.dk \
--cc=hare@suse.de \
--cc=hch@lst.de \
--cc=johannes.thumshirn@wdc.com \
--cc=john.garry@huawei.com \
--cc=khazhy@google.com \
--cc=linux-block@vger.kernel.org \
--cc=martin.petersen@oracle.com \
--cc=ming.lei@redhat.com \
--cc=shinichiro.kawasaki@wdc.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).