linux-block.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 0/8] blk-mq: fix request UAF related with iterating over tagset requests
@ 2021-04-25  8:57 Ming Lei
  2021-04-25  8:57 ` [PATCH 1/8] Revert "blk-mq: Fix races between blk_mq_update_nr_hw_queues() and iterating over tags" Ming Lei
                   ` (9 more replies)
  0 siblings, 10 replies; 25+ messages in thread
From: Ming Lei @ 2021-04-25  8:57 UTC (permalink / raw)
  To: linux-nvme, linux-scsi, Jens Axboe, linux-block,
	Martin K . Petersen, Christoph Hellwig
  Cc: Bart Van Assche, Khazhy Kumykov, Shin'ichiro Kawasaki,
	Hannes Reinecke, John Garry, David Jeffery, Ming Lei

Hi Guys,

Revert 4 patches from Bart which try to fix request UAF issue related
with iterating over tagset wide requests, because:

1) request UAF caused by normal completion vs. async completion during
iterating can't be covered[1]

2) clearing ->rqs[] is added in fast path, which causes performance loss
by 1% according to Bart's test

3) Bart's approach is too complicated, and some changes aren't needed,
such as adding two versions of tagset iteration

This patchset fixes the request UAF issue by one simpler approach,
without any change in fast path.

1) always complete request synchronously when the completing is run
via blk_mq_tagset_busy_iter(), done in 1st two patches

2) grab request's ref before calling ->fn in blk_mq_tagset_busy_iter,
and release it after calling ->fn, so ->fn won't be called for one
request if its queue is frozen, done in 3rd patch

3) clearing any stale request referred in ->rqs[] before freeing the
request pool, one per-tags spinlock is added for protecting
grabbing request ref vs. clearing ->rqs[tag], so UAF by refcount_inc_not_zero
in bt_tags_iter() is avoided, done in 4th patch.


[1] https://lore.kernel.org/linux-block/YISzLal7Ur7jyuiy@T590/T/#u

Ming Lei (8):
  Revert "blk-mq: Fix races between blk_mq_update_nr_hw_queues() and
    iterating over tags"
  Revert "blk-mq: Make it safe to use RCU to iterate over
    blk_mq_tag_set.tag_list"
  Revert "blk-mq: Fix races between iterating over requests and freeing
    requests"
  Revert "blk-mq: Introduce atomic variants of
    blk_mq_(all_tag|tagset_busy)_iter"
  blk-mq: blk_mq_complete_request_locally
  block: drivers: complete request locally from blk_mq_tagset_busy_iter
  blk-mq: grab rq->refcount before calling ->fn in
    blk_mq_tagset_busy_iter
  blk-mq: clear stale request in tags->rq[] before freeing one request
    pool

 block/blk-core.c                  |  34 +------
 block/blk-mq-tag.c                | 147 ++++++------------------------
 block/blk-mq-tag.h                |   7 +-
 block/blk-mq.c                    | 100 +++++++++++++-------
 block/blk-mq.h                    |   2 +-
 block/blk.h                       |   2 -
 block/elevator.c                  |   1 -
 drivers/block/mtip32xx/mtip32xx.c |   2 +-
 drivers/block/nbd.c               |   2 +-
 drivers/nvme/host/core.c          |   2 +-
 drivers/scsi/hosts.c              |  16 ++--
 drivers/scsi/scsi_lib.c           |   6 +-
 drivers/scsi/ufs/ufshcd.c         |   4 +-
 include/linux/blk-mq.h            |   3 +-
 14 files changed, 119 insertions(+), 209 deletions(-)

-- 
2.29.2


^ permalink raw reply	[flat|nested] 25+ messages in thread

* [PATCH 1/8] Revert "blk-mq: Fix races between blk_mq_update_nr_hw_queues() and iterating over tags"
  2021-04-25  8:57 [PATCH 0/8] blk-mq: fix request UAF related with iterating over tagset requests Ming Lei
@ 2021-04-25  8:57 ` Ming Lei
  2021-04-25  8:57 ` [PATCH 2/8] Revert "blk-mq: Make it safe to use RCU to iterate over blk_mq_tag_set.tag_list" Ming Lei
                   ` (8 subsequent siblings)
  9 siblings, 0 replies; 25+ messages in thread
From: Ming Lei @ 2021-04-25  8:57 UTC (permalink / raw)
  To: linux-nvme, linux-scsi, Jens Axboe, linux-block,
	Martin K . Petersen, Christoph Hellwig
  Cc: Bart Van Assche, Khazhy Kumykov, Shin'ichiro Kawasaki,
	Hannes Reinecke, John Garry, David Jeffery, Ming Lei

This reverts commit ac81d1ffd022b432d24fe79adf2d31f81a4acdc3.

Signed-off-by: Ming Lei <ming.lei@redhat.com>
---
 block/blk-mq-tag.c | 39 ---------------------------------------
 1 file changed, 39 deletions(-)

diff --git a/block/blk-mq-tag.c b/block/blk-mq-tag.c
index b0e0f074a864..39d5c9190a6b 100644
--- a/block/blk-mq-tag.c
+++ b/block/blk-mq-tag.c
@@ -376,31 +376,6 @@ void blk_mq_all_tag_iter_atomic(struct blk_mq_tags *tags, busy_tag_iter_fn *fn,
 	__blk_mq_all_tag_iter(tags, fn, priv, BT_TAG_ITER_STATIC_RQS);
 }
 
-/*
- * Iterate over all request queues in a tag set, find the first queue with a
- * non-zero usage count, increment its usage count and return the pointer to
- * that request queue. This prevents that blk_mq_update_nr_hw_queues() will
- * modify @set->nr_hw_queues while iterating over tags since
- * blk_mq_update_nr_hw_queues() only modifies @set->nr_hw_queues while the
- * usage count of all queues associated with a tag set is zero.
- */
-static struct request_queue *
-blk_mq_get_any_tagset_queue(struct blk_mq_tag_set *set)
-{
-	struct request_queue *q;
-
-	rcu_read_lock();
-	list_for_each_entry_rcu(q, &set->tag_list, tag_set_list) {
-		if (percpu_ref_tryget(&q->q_usage_counter)) {
-			rcu_read_unlock();
-			return q;
-		}
-	}
-	rcu_read_unlock();
-
-	return NULL;
-}
-
 /**
  * blk_mq_tagset_busy_iter - iterate over all started requests in a tag set
  * @tagset:	Tag set to iterate over.
@@ -416,22 +391,15 @@ blk_mq_get_any_tagset_queue(struct blk_mq_tag_set *set)
 void blk_mq_tagset_busy_iter(struct blk_mq_tag_set *tagset,
 		busy_tag_iter_fn *fn, void *priv)
 {
-	struct request_queue *q;
 	int i;
 
 	might_sleep();
 
-	q = blk_mq_get_any_tagset_queue(tagset);
-	if (!q)
-		return;
-
 	for (i = 0; i < tagset->nr_hw_queues; i++) {
 		if (tagset->tags && tagset->tags[i])
 			__blk_mq_all_tag_iter(tagset->tags[i], fn, priv,
 				BT_TAG_ITER_STARTED | BT_TAG_ITER_MAY_SLEEP);
 	}
-
-	blk_queue_exit(q);
 }
 EXPORT_SYMBOL(blk_mq_tagset_busy_iter);
 
@@ -450,20 +418,13 @@ EXPORT_SYMBOL(blk_mq_tagset_busy_iter);
 void blk_mq_tagset_busy_iter_atomic(struct blk_mq_tag_set *tagset,
 		busy_tag_iter_fn *fn, void *priv)
 {
-	struct request_queue *q;
 	int i;
 
-	q = blk_mq_get_any_tagset_queue(tagset);
-	if (!q)
-		return;
-
 	for (i = 0; i < tagset->nr_hw_queues; i++) {
 		if (tagset->tags && tagset->tags[i])
 			__blk_mq_all_tag_iter(tagset->tags[i], fn, priv,
 					      BT_TAG_ITER_STARTED);
 	}
-
-	blk_queue_exit(q);
 }
 EXPORT_SYMBOL(blk_mq_tagset_busy_iter_atomic);
 
-- 
2.29.2


^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH 2/8] Revert "blk-mq: Make it safe to use RCU to iterate over blk_mq_tag_set.tag_list"
  2021-04-25  8:57 [PATCH 0/8] blk-mq: fix request UAF related with iterating over tagset requests Ming Lei
  2021-04-25  8:57 ` [PATCH 1/8] Revert "blk-mq: Fix races between blk_mq_update_nr_hw_queues() and iterating over tags" Ming Lei
@ 2021-04-25  8:57 ` Ming Lei
  2021-04-25  8:57 ` [PATCH 3/8] Revert "blk-mq: Fix races between iterating over requests and freeing requests" Ming Lei
                   ` (7 subsequent siblings)
  9 siblings, 0 replies; 25+ messages in thread
From: Ming Lei @ 2021-04-25  8:57 UTC (permalink / raw)
  To: linux-nvme, linux-scsi, Jens Axboe, linux-block,
	Martin K . Petersen, Christoph Hellwig
  Cc: Bart Van Assche, Khazhy Kumykov, Shin'ichiro Kawasaki,
	Hannes Reinecke, John Garry, David Jeffery, Ming Lei

This reverts commit a8a6ac7ad3fb6b84b933ca1ea5110998bdaeee17.

Signed-off-by: Ming Lei <ming.lei@redhat.com>
---
 block/blk-mq.c | 8 ++------
 1 file changed, 2 insertions(+), 6 deletions(-)

diff --git a/block/blk-mq.c b/block/blk-mq.c
index 7d2ea6357c7d..8b59f6b4ec8e 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -2947,7 +2947,7 @@ static void blk_mq_del_queue_tag_set(struct request_queue *q)
 	struct blk_mq_tag_set *set = q->tag_set;
 
 	mutex_lock(&set->tag_list_lock);
-	list_del_rcu(&q->tag_set_list);
+	list_del(&q->tag_set_list);
 	if (list_is_singular(&set->tag_list)) {
 		/* just transitioned to unshared */
 		set->flags &= ~BLK_MQ_F_TAG_QUEUE_SHARED;
@@ -2955,11 +2955,7 @@ static void blk_mq_del_queue_tag_set(struct request_queue *q)
 		blk_mq_update_tag_set_shared(set, false);
 	}
 	mutex_unlock(&set->tag_list_lock);
-	/*
-	 * Calling synchronize_rcu() and INIT_LIST_HEAD(&q->tag_set_list) is
-	 * not necessary since blk_mq_del_queue_tag_set() is only called from
-	 * blk_cleanup_queue().
-	 */
+	INIT_LIST_HEAD(&q->tag_set_list);
 }
 
 static void blk_mq_add_queue_tag_set(struct blk_mq_tag_set *set,
-- 
2.29.2


^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH 3/8] Revert "blk-mq: Fix races between iterating over requests and freeing requests"
  2021-04-25  8:57 [PATCH 0/8] blk-mq: fix request UAF related with iterating over tagset requests Ming Lei
  2021-04-25  8:57 ` [PATCH 1/8] Revert "blk-mq: Fix races between blk_mq_update_nr_hw_queues() and iterating over tags" Ming Lei
  2021-04-25  8:57 ` [PATCH 2/8] Revert "blk-mq: Make it safe to use RCU to iterate over blk_mq_tag_set.tag_list" Ming Lei
@ 2021-04-25  8:57 ` Ming Lei
  2021-04-25  8:57 ` [PATCH 4/8] Revert "blk-mq: Introduce atomic variants of blk_mq_(all_tag|tagset_busy)_iter" Ming Lei
                   ` (6 subsequent siblings)
  9 siblings, 0 replies; 25+ messages in thread
From: Ming Lei @ 2021-04-25  8:57 UTC (permalink / raw)
  To: linux-nvme, linux-scsi, Jens Axboe, linux-block,
	Martin K . Petersen, Christoph Hellwig
  Cc: Bart Van Assche, Khazhy Kumykov, Shin'ichiro Kawasaki,
	Hannes Reinecke, John Garry, David Jeffery, Ming Lei

This reverts commit 5ba3f5a6ca7ee2dffcae7fab25a1a1053e3264cb.

Signed-off-by: Ming Lei <ming.lei@redhat.com>
---
 block/blk-core.c   | 34 +------------------------------
 block/blk-mq-tag.c | 51 ++++++----------------------------------------
 block/blk-mq-tag.h |  4 +---
 block/blk-mq.c     | 21 ++++---------------
 block/blk-mq.h     |  1 -
 block/blk.h        |  2 --
 block/elevator.c   |  1 -
 7 files changed, 12 insertions(+), 102 deletions(-)

diff --git a/block/blk-core.c b/block/blk-core.c
index ca7f833e25a8..9bcdae93f6d4 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -279,36 +279,6 @@ void blk_dump_rq_flags(struct request *rq, char *msg)
 }
 EXPORT_SYMBOL(blk_dump_rq_flags);
 
-/**
- * blk_mq_wait_for_tag_iter - wait for preexisting tag iteration functions to finish
- * @set: Tag set to wait on.
- *
- * Waits for preexisting calls of blk_mq_all_tag_iter(),
- * blk_mq_tagset_busy_iter() and also for their atomic variants to finish
- * accessing hctx->tags->rqs[]. New readers may start while this function is
- * in progress or after this function has returned. Use this function to make
- * sure that hctx->tags->rqs[] changes have become globally visible.
- *
- * Waits for preexisting blk_mq_queue_tag_busy_iter(q, fn, priv) calls to
- * finish accessing requests associated with other request queues than 'q'.
- */
-void blk_mq_wait_for_tag_iter(struct blk_mq_tag_set *set)
-{
-	struct blk_mq_tags *tags;
-	int i;
-
-	if (set->tags) {
-		for (i = 0; i < set->nr_hw_queues; i++) {
-			tags = set->tags[i];
-			if (!tags)
-				continue;
-			down_write(&tags->iter_rwsem);
-			up_write(&tags->iter_rwsem);
-		}
-	}
-	synchronize_rcu();
-}
-
 /**
  * blk_sync_queue - cancel any pending callbacks on a queue
  * @q: the queue
@@ -442,10 +412,8 @@ void blk_cleanup_queue(struct request_queue *q)
 	 * it is safe to free requests now.
 	 */
 	mutex_lock(&q->sysfs_lock);
-	if (q->elevator) {
-		blk_mq_wait_for_tag_iter(q->tag_set);
+	if (q->elevator)
 		blk_mq_sched_free_requests(q);
-	}
 	mutex_unlock(&q->sysfs_lock);
 
 	percpu_ref_exit(&q->q_usage_counter);
diff --git a/block/blk-mq-tag.c b/block/blk-mq-tag.c
index 39d5c9190a6b..d8eaa38a1bd1 100644
--- a/block/blk-mq-tag.c
+++ b/block/blk-mq-tag.c
@@ -209,24 +209,14 @@ static bool bt_iter(struct sbitmap *bitmap, unsigned int bitnr, void *data)
 
 	if (!reserved)
 		bitnr += tags->nr_reserved_tags;
-	rcu_read_lock();
-	/*
-	 * The request 'rq' points at is protected by an RCU read lock until
-	 * its queue pointer has been verified and by q_usage_count while the
-	 * callback function is being invoked. See also the
-	 * percpu_ref_tryget() and blk_queue_exit() calls in
-	 * blk_mq_queue_tag_busy_iter().
-	 */
-	rq = rcu_dereference(tags->rqs[bitnr]);
+	rq = tags->rqs[bitnr];
+
 	/*
 	 * We can hit rq == NULL here, because the tagging functions
 	 * test and set the bit before assigning ->rqs[].
 	 */
-	if (rq && rq->q == hctx->queue && rq->mq_hctx == hctx) {
-		rcu_read_unlock();
+	if (rq && rq->q == hctx->queue && rq->mq_hctx == hctx)
 		return iter_data->fn(hctx, rq, iter_data->data, reserved);
-	}
-	rcu_read_unlock();
 	return true;
 }
 
@@ -264,17 +254,11 @@ struct bt_tags_iter_data {
 	unsigned int flags;
 };
 
-/* Include reserved tags. */
 #define BT_TAG_ITER_RESERVED		(1 << 0)
-/* Only include started requests. */
 #define BT_TAG_ITER_STARTED		(1 << 1)
-/* Iterate over tags->static_rqs[] instead of tags->rqs[]. */
 #define BT_TAG_ITER_STATIC_RQS		(1 << 2)
-/* The callback function may sleep. */
-#define BT_TAG_ITER_MAY_SLEEP		(1 << 3)
 
-static bool __bt_tags_iter(struct sbitmap *bitmap, unsigned int bitnr,
-			   void *data)
+static bool bt_tags_iter(struct sbitmap *bitmap, unsigned int bitnr, void *data)
 {
 	struct bt_tags_iter_data *iter_data = data;
 	struct blk_mq_tags *tags = iter_data->tags;
@@ -291,8 +275,7 @@ static bool __bt_tags_iter(struct sbitmap *bitmap, unsigned int bitnr,
 	if (iter_data->flags & BT_TAG_ITER_STATIC_RQS)
 		rq = tags->static_rqs[bitnr];
 	else
-		rq = rcu_dereference_check(tags->rqs[bitnr],
-					   lockdep_is_held(&tags->iter_rwsem));
+		rq = tags->rqs[bitnr];
 	if (!rq)
 		return true;
 	if ((iter_data->flags & BT_TAG_ITER_STARTED) &&
@@ -301,25 +284,6 @@ static bool __bt_tags_iter(struct sbitmap *bitmap, unsigned int bitnr,
 	return iter_data->fn(rq, iter_data->data, reserved);
 }
 
-static bool bt_tags_iter(struct sbitmap *bitmap, unsigned int bitnr, void *data)
-{
-	struct bt_tags_iter_data *iter_data = data;
-	struct blk_mq_tags *tags = iter_data->tags;
-	bool res;
-
-	if (iter_data->flags & BT_TAG_ITER_MAY_SLEEP) {
-		down_read(&tags->iter_rwsem);
-		res = __bt_tags_iter(bitmap, bitnr, data);
-		up_read(&tags->iter_rwsem);
-	} else {
-		rcu_read_lock();
-		res = __bt_tags_iter(bitmap, bitnr, data);
-		rcu_read_unlock();
-	}
-
-	return res;
-}
-
 /**
  * bt_tags_for_each - iterate over the requests in a tag map
  * @tags:	Tag map to iterate over.
@@ -393,12 +357,10 @@ void blk_mq_tagset_busy_iter(struct blk_mq_tag_set *tagset,
 {
 	int i;
 
-	might_sleep();
-
 	for (i = 0; i < tagset->nr_hw_queues; i++) {
 		if (tagset->tags && tagset->tags[i])
 			__blk_mq_all_tag_iter(tagset->tags[i], fn, priv,
-				BT_TAG_ITER_STARTED | BT_TAG_ITER_MAY_SLEEP);
+					      BT_TAG_ITER_STARTED);
 	}
 }
 EXPORT_SYMBOL(blk_mq_tagset_busy_iter);
@@ -582,7 +544,6 @@ struct blk_mq_tags *blk_mq_init_tags(unsigned int total_tags,
 
 	tags->nr_tags = total_tags;
 	tags->nr_reserved_tags = reserved_tags;
-	init_rwsem(&tags->iter_rwsem);
 
 	if (blk_mq_is_sbitmap_shared(flags))
 		return tags;
diff --git a/block/blk-mq-tag.h b/block/blk-mq-tag.h
index d1d73d7cc7df..0290c308ece9 100644
--- a/block/blk-mq-tag.h
+++ b/block/blk-mq-tag.h
@@ -17,11 +17,9 @@ struct blk_mq_tags {
 	struct sbitmap_queue __bitmap_tags;
 	struct sbitmap_queue __breserved_tags;
 
-	struct request __rcu **rqs;
+	struct request **rqs;
 	struct request **static_rqs;
 	struct list_head page_list;
-
-	struct rw_semaphore iter_rwsem;
 };
 
 extern struct blk_mq_tags *blk_mq_init_tags(unsigned int nr_tags,
diff --git a/block/blk-mq.c b/block/blk-mq.c
index 8b59f6b4ec8e..79c01b1f885c 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -496,10 +496,8 @@ static void __blk_mq_free_request(struct request *rq)
 	blk_crypto_free_request(rq);
 	blk_pm_mark_last_busy(rq);
 	rq->mq_hctx = NULL;
-	if (rq->tag != BLK_MQ_NO_TAG) {
+	if (rq->tag != BLK_MQ_NO_TAG)
 		blk_mq_put_tag(hctx->tags, ctx, rq->tag);
-		rcu_assign_pointer(hctx->tags->rqs[rq->tag], NULL);
-	}
 	if (sched_tag != BLK_MQ_NO_TAG)
 		blk_mq_put_tag(hctx->sched_tags, ctx, sched_tag);
 	blk_mq_sched_restart(hctx);
@@ -841,20 +839,9 @@ EXPORT_SYMBOL(blk_mq_delay_kick_requeue_list);
 
 struct request *blk_mq_tag_to_rq(struct blk_mq_tags *tags, unsigned int tag)
 {
-	struct request *rq;
-
 	if (tag < tags->nr_tags) {
-		/*
-		 * Freeing tags happens with the request queue frozen so the
-		 * rcu dereference below is protected by the request queue
-		 * usage count. We can only verify that usage count after
-		 * having read the request pointer.
-		 */
-		rq = rcu_dereference_check(tags->rqs[tag], true);
-		WARN_ON_ONCE(IS_ENABLED(CONFIG_PROVE_RCU) && rq &&
-			     percpu_ref_is_zero(&rq->q->q_usage_counter));
-		prefetch(rq);
-		return rq;
+		prefetch(tags->rqs[tag]);
+		return tags->rqs[tag];
 	}
 
 	return NULL;
@@ -1125,7 +1112,7 @@ static bool blk_mq_get_driver_tag(struct request *rq)
 		rq->rq_flags |= RQF_MQ_INFLIGHT;
 		__blk_mq_inc_active_requests(hctx);
 	}
-	rcu_assign_pointer(hctx->tags->rqs[rq->tag], rq);
+	hctx->tags->rqs[rq->tag] = rq;
 	return true;
 }
 
diff --git a/block/blk-mq.h b/block/blk-mq.h
index 9ccb1818303b..3616453ca28c 100644
--- a/block/blk-mq.h
+++ b/block/blk-mq.h
@@ -226,7 +226,6 @@ static inline void __blk_mq_put_driver_tag(struct blk_mq_hw_ctx *hctx,
 					   struct request *rq)
 {
 	blk_mq_put_tag(hctx->tags, rq->mq_ctx, rq->tag);
-	rcu_assign_pointer(hctx->tags->rqs[rq->tag], NULL);
 	rq->tag = BLK_MQ_NO_TAG;
 
 	if (rq->rq_flags & RQF_MQ_INFLIGHT) {
diff --git a/block/blk.h b/block/blk.h
index d88b0823738c..529233957207 100644
--- a/block/blk.h
+++ b/block/blk.h
@@ -185,8 +185,6 @@ bool blk_bio_list_merge(struct request_queue *q, struct list_head *list,
 void blk_account_io_start(struct request *req);
 void blk_account_io_done(struct request *req, u64 now);
 
-void blk_mq_wait_for_tag_iter(struct blk_mq_tag_set *set);
-
 /*
  * Internal elevator interface
  */
diff --git a/block/elevator.c b/block/elevator.c
index aae9cff6d5ae..7c486ce858e0 100644
--- a/block/elevator.c
+++ b/block/elevator.c
@@ -201,7 +201,6 @@ static void elevator_exit(struct request_queue *q, struct elevator_queue *e)
 {
 	lockdep_assert_held(&q->sysfs_lock);
 
-	blk_mq_wait_for_tag_iter(q->tag_set);
 	blk_mq_sched_free_requests(q);
 	__elevator_exit(q, e);
 }
-- 
2.29.2


^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH 4/8] Revert "blk-mq: Introduce atomic variants of blk_mq_(all_tag|tagset_busy)_iter"
  2021-04-25  8:57 [PATCH 0/8] blk-mq: fix request UAF related with iterating over tagset requests Ming Lei
                   ` (2 preceding siblings ...)
  2021-04-25  8:57 ` [PATCH 3/8] Revert "blk-mq: Fix races between iterating over requests and freeing requests" Ming Lei
@ 2021-04-25  8:57 ` Ming Lei
  2021-04-25  8:57 ` [PATCH 5/8] blk-mq: blk_mq_complete_request_locally Ming Lei
                   ` (5 subsequent siblings)
  9 siblings, 0 replies; 25+ messages in thread
From: Ming Lei @ 2021-04-25  8:57 UTC (permalink / raw)
  To: linux-nvme, linux-scsi, Jens Axboe, linux-block,
	Martin K . Petersen, Christoph Hellwig
  Cc: Bart Van Assche, Khazhy Kumykov, Shin'ichiro Kawasaki,
	Hannes Reinecke, John Garry, David Jeffery, Ming Lei

This reverts commit 5d39098af969f222253036b1b2e7ffc57c734570.

Signed-off-by: Ming Lei <ming.lei@redhat.com>
---
 block/blk-mq-tag.c        | 38 +++++---------------------------------
 block/blk-mq-tag.h        |  2 +-
 block/blk-mq.c            |  2 +-
 drivers/scsi/hosts.c      | 16 ++++++++--------
 drivers/scsi/ufs/ufshcd.c |  4 ++--
 include/linux/blk-mq.h    |  2 --
 6 files changed, 17 insertions(+), 47 deletions(-)

diff --git a/block/blk-mq-tag.c b/block/blk-mq-tag.c
index d8eaa38a1bd1..2a37731e8244 100644
--- a/block/blk-mq-tag.c
+++ b/block/blk-mq-tag.c
@@ -322,19 +322,18 @@ static void __blk_mq_all_tag_iter(struct blk_mq_tags *tags,
 }
 
 /**
- * blk_mq_all_tag_iter_atomic - iterate over all requests in a tag map
+ * blk_mq_all_tag_iter - iterate over all requests in a tag map
  * @tags:	Tag map to iterate over.
  * @fn:		Pointer to the function that will be called for each
  *		request. @fn will be called as follows: @fn(rq, @priv,
  *		reserved) where rq is a pointer to a request. 'reserved'
  *		indicates whether or not @rq is a reserved request. Return
- *		true to continue iterating tags, false to stop. Must not
- *		sleep.
+ *		true to continue iterating tags, false to stop.
  * @priv:	Will be passed as second argument to @fn.
  *
- * Does not sleep.
+ * Caller has to pass the tag map from which requests are allocated.
  */
-void blk_mq_all_tag_iter_atomic(struct blk_mq_tags *tags, busy_tag_iter_fn *fn,
+void blk_mq_all_tag_iter(struct blk_mq_tags *tags, busy_tag_iter_fn *fn,
 		void *priv)
 {
 	__blk_mq_all_tag_iter(tags, fn, priv, BT_TAG_ITER_STATIC_RQS);
@@ -349,8 +348,6 @@ void blk_mq_all_tag_iter_atomic(struct blk_mq_tags *tags, busy_tag_iter_fn *fn,
  *		indicates whether or not @rq is a reserved request. Return
  *		true to continue iterating tags, false to stop.
  * @priv:	Will be passed as second argument to @fn.
- *
- * May sleep.
  */
 void blk_mq_tagset_busy_iter(struct blk_mq_tag_set *tagset,
 		busy_tag_iter_fn *fn, void *priv)
@@ -365,31 +362,6 @@ void blk_mq_tagset_busy_iter(struct blk_mq_tag_set *tagset,
 }
 EXPORT_SYMBOL(blk_mq_tagset_busy_iter);
 
-/**
- * blk_mq_tagset_busy_iter_atomic - iterate over all started requests in a tag set
- * @tagset:	Tag set to iterate over.
- * @fn:		Pointer to the function that will be called for each started
- *		request. @fn will be called as follows: @fn(rq, @priv,
- *		reserved) where rq is a pointer to a request. 'reserved'
- *		indicates whether or not @rq is a reserved request. Return
- *		true to continue iterating tags, false to stop. Must not sleep.
- * @priv:	Will be passed as second argument to @fn.
- *
- * Does not sleep.
- */
-void blk_mq_tagset_busy_iter_atomic(struct blk_mq_tag_set *tagset,
-		busy_tag_iter_fn *fn, void *priv)
-{
-	int i;
-
-	for (i = 0; i < tagset->nr_hw_queues; i++) {
-		if (tagset->tags && tagset->tags[i])
-			__blk_mq_all_tag_iter(tagset->tags[i], fn, priv,
-					      BT_TAG_ITER_STARTED);
-	}
-}
-EXPORT_SYMBOL(blk_mq_tagset_busy_iter_atomic);
-
 static bool blk_mq_tagset_count_completed_rqs(struct request *rq,
 		void *data, bool reserved)
 {
@@ -412,7 +384,7 @@ void blk_mq_tagset_wait_completed_request(struct blk_mq_tag_set *tagset)
 	while (true) {
 		unsigned count = 0;
 
-		blk_mq_tagset_busy_iter_atomic(tagset,
+		blk_mq_tagset_busy_iter(tagset,
 				blk_mq_tagset_count_completed_rqs, &count);
 		if (!count)
 			break;
diff --git a/block/blk-mq-tag.h b/block/blk-mq-tag.h
index 0290c308ece9..7d3e6b333a4a 100644
--- a/block/blk-mq-tag.h
+++ b/block/blk-mq-tag.h
@@ -43,7 +43,7 @@ extern void blk_mq_tag_resize_shared_sbitmap(struct blk_mq_tag_set *set,
 extern void blk_mq_tag_wakeup_all(struct blk_mq_tags *tags, bool);
 void blk_mq_queue_tag_busy_iter(struct request_queue *q, busy_iter_fn *fn,
 		void *priv);
-void blk_mq_all_tag_iter_atomic(struct blk_mq_tags *tags, busy_tag_iter_fn *fn,
+void blk_mq_all_tag_iter(struct blk_mq_tags *tags, busy_tag_iter_fn *fn,
 		void *priv);
 
 static inline struct sbq_wait_state *bt_wait_ptr(struct sbitmap_queue *bt,
diff --git a/block/blk-mq.c b/block/blk-mq.c
index 79c01b1f885c..927189a55575 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -2484,7 +2484,7 @@ static bool blk_mq_hctx_has_requests(struct blk_mq_hw_ctx *hctx)
 		.hctx	= hctx,
 	};
 
-	blk_mq_all_tag_iter_atomic(tags, blk_mq_has_request, &data);
+	blk_mq_all_tag_iter(tags, blk_mq_has_request, &data);
 	return data.has_rq;
 }
 
diff --git a/drivers/scsi/hosts.c b/drivers/scsi/hosts.c
index f8863aa88642..697c09ef259b 100644
--- a/drivers/scsi/hosts.c
+++ b/drivers/scsi/hosts.c
@@ -573,8 +573,8 @@ int scsi_host_busy(struct Scsi_Host *shost)
 {
 	int cnt = 0;
 
-	blk_mq_tagset_busy_iter_atomic(&shost->tag_set,
-				       scsi_host_check_in_flight, &cnt);
+	blk_mq_tagset_busy_iter(&shost->tag_set,
+				scsi_host_check_in_flight, &cnt);
 	return cnt;
 }
 EXPORT_SYMBOL(scsi_host_busy);
@@ -672,8 +672,8 @@ static bool complete_all_cmds_iter(struct request *rq, void *data, bool rsvd)
  */
 void scsi_host_complete_all_commands(struct Scsi_Host *shost, int status)
 {
-	blk_mq_tagset_busy_iter_atomic(&shost->tag_set, complete_all_cmds_iter,
-				       &status);
+	blk_mq_tagset_busy_iter(&shost->tag_set, complete_all_cmds_iter,
+				&status);
 }
 EXPORT_SYMBOL_GPL(scsi_host_complete_all_commands);
 
@@ -694,11 +694,11 @@ static bool __scsi_host_busy_iter_fn(struct request *req, void *priv,
 /**
  * scsi_host_busy_iter - Iterate over all busy commands
  * @shost:	Pointer to Scsi_Host.
- * @fn:		Function to call on each busy command. Must not sleep.
+ * @fn:		Function to call on each busy command
  * @priv:	Data pointer passed to @fn
  *
  * If locking against concurrent command completions is required
- * it has to be provided by the caller.
+ * ithas to be provided by the caller
  **/
 void scsi_host_busy_iter(struct Scsi_Host *shost,
 			 bool (*fn)(struct scsi_cmnd *, void *, bool),
@@ -709,7 +709,7 @@ void scsi_host_busy_iter(struct Scsi_Host *shost,
 		.priv = priv,
 	};
 
-	blk_mq_tagset_busy_iter_atomic(&shost->tag_set,
-				       __scsi_host_busy_iter_fn, &iter_data);
+	blk_mq_tagset_busy_iter(&shost->tag_set, __scsi_host_busy_iter_fn,
+				&iter_data);
 }
 EXPORT_SYMBOL_GPL(scsi_host_busy_iter);
diff --git a/drivers/scsi/ufs/ufshcd.c b/drivers/scsi/ufs/ufshcd.c
index 6d2f8f18e2a3..c86760788c72 100644
--- a/drivers/scsi/ufs/ufshcd.c
+++ b/drivers/scsi/ufs/ufshcd.c
@@ -1380,7 +1380,7 @@ static bool ufshcd_any_tag_in_use(struct ufs_hba *hba)
 	struct request_queue *q = hba->cmd_queue;
 	int busy = 0;
 
-	blk_mq_tagset_busy_iter_atomic(q->tag_set, ufshcd_is_busy, &busy);
+	blk_mq_tagset_busy_iter(q->tag_set, ufshcd_is_busy, &busy);
 	return busy;
 }
 
@@ -6269,7 +6269,7 @@ static irqreturn_t ufshcd_tmc_handler(struct ufs_hba *hba)
 		.pending = ufshcd_readl(hba, REG_UTP_TASK_REQ_DOOR_BELL),
 	};
 
-	blk_mq_tagset_busy_iter_atomic(q->tag_set, ufshcd_compl_tm, &ci);
+	blk_mq_tagset_busy_iter(q->tag_set, ufshcd_compl_tm, &ci);
 	return ci.ncpl ? IRQ_HANDLED : IRQ_NONE;
 }
 
diff --git a/include/linux/blk-mq.h b/include/linux/blk-mq.h
index dfa0114a49fd..2c473c9b8990 100644
--- a/include/linux/blk-mq.h
+++ b/include/linux/blk-mq.h
@@ -526,8 +526,6 @@ void blk_mq_run_hw_queues(struct request_queue *q, bool async);
 void blk_mq_delay_run_hw_queues(struct request_queue *q, unsigned long msecs);
 void blk_mq_tagset_busy_iter(struct blk_mq_tag_set *tagset,
 		busy_tag_iter_fn *fn, void *priv);
-void blk_mq_tagset_busy_iter_atomic(struct blk_mq_tag_set *tagset,
-		busy_tag_iter_fn *fn, void *priv);
 void blk_mq_tagset_wait_completed_request(struct blk_mq_tag_set *tagset);
 void blk_mq_freeze_queue(struct request_queue *q);
 void blk_mq_unfreeze_queue(struct request_queue *q);
-- 
2.29.2


^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH 5/8] blk-mq: blk_mq_complete_request_locally
  2021-04-25  8:57 [PATCH 0/8] blk-mq: fix request UAF related with iterating over tagset requests Ming Lei
                   ` (3 preceding siblings ...)
  2021-04-25  8:57 ` [PATCH 4/8] Revert "blk-mq: Introduce atomic variants of blk_mq_(all_tag|tagset_busy)_iter" Ming Lei
@ 2021-04-25  8:57 ` Ming Lei
  2021-04-25  8:57 ` [PATCH 6/8] block: drivers: complete request locally from blk_mq_tagset_busy_iter Ming Lei
                   ` (4 subsequent siblings)
  9 siblings, 0 replies; 25+ messages in thread
From: Ming Lei @ 2021-04-25  8:57 UTC (permalink / raw)
  To: linux-nvme, linux-scsi, Jens Axboe, linux-block,
	Martin K . Petersen, Christoph Hellwig
  Cc: Bart Van Assche, Khazhy Kumykov, Shin'ichiro Kawasaki,
	Hannes Reinecke, John Garry, David Jeffery, Ming Lei

Add blk_mq_complete_request_locally() for completing request via
blk_mq_tagset_busy_iter(), so that we can avoid request UAF related
with queue releasing, or request freeing.

Signed-off-by: Ming Lei <ming.lei@redhat.com>
---
 block/blk-mq.c         | 16 ++++++++++++++++
 include/linux/blk-mq.h |  1 +
 2 files changed, 17 insertions(+)

diff --git a/block/blk-mq.c b/block/blk-mq.c
index 927189a55575..e3d1067b10c3 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -681,6 +681,22 @@ void blk_mq_complete_request(struct request *rq)
 }
 EXPORT_SYMBOL(blk_mq_complete_request);
 
+/**
+ * blk_mq_complete_request_locally - end I/O on a request locally
+ * @rq:		the request being processed
+ *
+ * Description:
+ *	Complete a request by calling the ->complete_rq directly,
+ *	and it is usually used in error handling via
+ *	blk_mq_tagset_busy_iter().
+ **/
+void blk_mq_complete_request_locally(struct request *rq)
+{
+	WRITE_ONCE(rq->state, MQ_RQ_COMPLETE);
+	rq->q->mq_ops->complete(rq);
+}
+EXPORT_SYMBOL(blk_mq_complete_request_locally);
+
 static void hctx_unlock(struct blk_mq_hw_ctx *hctx, int srcu_idx)
 	__releases(hctx->srcu)
 {
diff --git a/include/linux/blk-mq.h b/include/linux/blk-mq.h
index 2c473c9b8990..f630bf9e497e 100644
--- a/include/linux/blk-mq.h
+++ b/include/linux/blk-mq.h
@@ -511,6 +511,7 @@ void blk_mq_kick_requeue_list(struct request_queue *q);
 void blk_mq_delay_kick_requeue_list(struct request_queue *q, unsigned long msecs);
 void blk_mq_complete_request(struct request *rq);
 bool blk_mq_complete_request_remote(struct request *rq);
+void blk_mq_complete_request_locally(struct request *rq);
 bool blk_mq_queue_stopped(struct request_queue *q);
 void blk_mq_stop_hw_queue(struct blk_mq_hw_ctx *hctx);
 void blk_mq_start_hw_queue(struct blk_mq_hw_ctx *hctx);
-- 
2.29.2


^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH 6/8] block: drivers: complete request locally from blk_mq_tagset_busy_iter
  2021-04-25  8:57 [PATCH 0/8] blk-mq: fix request UAF related with iterating over tagset requests Ming Lei
                   ` (4 preceding siblings ...)
  2021-04-25  8:57 ` [PATCH 5/8] blk-mq: blk_mq_complete_request_locally Ming Lei
@ 2021-04-25  8:57 ` Ming Lei
  2021-04-26  3:02   ` Bart Van Assche
  2021-04-25  8:57 ` [PATCH 7/8] blk-mq: grab rq->refcount before calling ->fn in blk_mq_tagset_busy_iter Ming Lei
                   ` (3 subsequent siblings)
  9 siblings, 1 reply; 25+ messages in thread
From: Ming Lei @ 2021-04-25  8:57 UTC (permalink / raw)
  To: linux-nvme, linux-scsi, Jens Axboe, linux-block,
	Martin K . Petersen, Christoph Hellwig
  Cc: Bart Van Assche, Khazhy Kumykov, Shin'ichiro Kawasaki,
	Hannes Reinecke, John Garry, David Jeffery, Ming Lei

It can be a bit hard for driver to avoid request UAF between normal
completion and completion via blk_mq_tagset_busy_iter() if async
completion is done in blk_mq_tagset_busy_iter(). Cause request->tag
is only freed after .mq_ops->complete() is called, and rquest->tag
may still be valid after blk_mq_complete_request() is returned from
normal completion path, so this request is still visible in
blk_mq_tagset_busy_iter().

This patch itself can't avoid such request UAF completely. We will
grab a request reference in next patch when walking request via
blk_mq_tagset_busy_iter() for fixing such race, that is why we have
to convert to blk_mq_complete_request_locally() first.

Signed-off-by: Ming Lei <ming.lei@redhat.com>
---
 drivers/block/mtip32xx/mtip32xx.c | 2 +-
 drivers/block/nbd.c               | 2 +-
 drivers/nvme/host/core.c          | 2 +-
 drivers/scsi/scsi_lib.c           | 6 +++++-
 4 files changed, 8 insertions(+), 4 deletions(-)

diff --git a/drivers/block/mtip32xx/mtip32xx.c b/drivers/block/mtip32xx/mtip32xx.c
index 3be0dbc674bd..05f5e36ee608 100644
--- a/drivers/block/mtip32xx/mtip32xx.c
+++ b/drivers/block/mtip32xx/mtip32xx.c
@@ -3748,7 +3748,7 @@ static bool mtip_no_dev_cleanup(struct request *rq, void *data, bool reserv)
 	struct mtip_cmd *cmd = blk_mq_rq_to_pdu(rq);
 
 	cmd->status = BLK_STS_IOERR;
-	blk_mq_complete_request(rq);
+	blk_mq_complete_request_locally(rq);
 	return true;
 }
 
diff --git a/drivers/block/nbd.c b/drivers/block/nbd.c
index 4ff71b579cfc..3dcf3288efa8 100644
--- a/drivers/block/nbd.c
+++ b/drivers/block/nbd.c
@@ -809,7 +809,7 @@ static bool nbd_clear_req(struct request *req, void *data, bool reserved)
 	cmd->status = BLK_STS_IOERR;
 	mutex_unlock(&cmd->lock);
 
-	blk_mq_complete_request(req);
+	blk_mq_complete_request_locally(req);
 	return true;
 }
 
diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
index 0896e21642be..a605954477da 100644
--- a/drivers/nvme/host/core.c
+++ b/drivers/nvme/host/core.c
@@ -381,7 +381,7 @@ bool nvme_cancel_request(struct request *req, void *data, bool reserved)
 
 	nvme_req(req)->status = NVME_SC_HOST_ABORTED_CMD;
 	nvme_req(req)->flags |= NVME_REQ_CANCELLED;
-	blk_mq_complete_request(req);
+	blk_mq_complete_request_locally(req);
 	return true;
 }
 EXPORT_SYMBOL_GPL(nvme_cancel_request);
diff --git a/drivers/scsi/scsi_lib.c b/drivers/scsi/scsi_lib.c
index c289991ffaed..7cbaee282b6d 100644
--- a/drivers/scsi/scsi_lib.c
+++ b/drivers/scsi/scsi_lib.c
@@ -1568,7 +1568,11 @@ static void scsi_mq_done(struct scsi_cmnd *cmd)
 	if (unlikely(test_and_set_bit(SCMD_STATE_COMPLETE, &cmd->state)))
 		return;
 	trace_scsi_dispatch_cmd_done(cmd);
-	blk_mq_complete_request(cmd->request);
+
+	if (unlikely(host_byte(cmd->result) != DID_OK))
+		blk_mq_complete_request_locally(cmd->request);
+	else
+		blk_mq_complete_request(cmd->request);
 }
 
 static void scsi_mq_put_budget(struct request_queue *q)
-- 
2.29.2


^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH 7/8] blk-mq: grab rq->refcount before calling ->fn in blk_mq_tagset_busy_iter
  2021-04-25  8:57 [PATCH 0/8] blk-mq: fix request UAF related with iterating over tagset requests Ming Lei
                   ` (5 preceding siblings ...)
  2021-04-25  8:57 ` [PATCH 6/8] block: drivers: complete request locally from blk_mq_tagset_busy_iter Ming Lei
@ 2021-04-25  8:57 ` Ming Lei
  2021-04-25 18:55   ` Bart Van Assche
  2021-04-25  8:57 ` [PATCH 8/8] blk-mq: clear stale request in tags->rq[] before freeing one request pool Ming Lei
                   ` (2 subsequent siblings)
  9 siblings, 1 reply; 25+ messages in thread
From: Ming Lei @ 2021-04-25  8:57 UTC (permalink / raw)
  To: linux-nvme, linux-scsi, Jens Axboe, linux-block,
	Martin K . Petersen, Christoph Hellwig
  Cc: Bart Van Assche, Khazhy Kumykov, Shin'ichiro Kawasaki,
	Hannes Reinecke, John Garry, David Jeffery, Ming Lei

Grab rq->refcount before calling ->fn in blk_mq_tagset_busy_iter(), and
this way will prevent the request from being re-used when ->fn is
running. The approach is same as what we do during handling timeout.

Fix request UAF related with completion race or queue releasing:

- If one rq is referred before rq->q is frozen, then queue won't be
frozen before the request is released during iteration.

- If one rq is referred after rq->q is frozen, refcount_inc_not_zero()
will return false, and we won't iterate over this request.

However, still one request UAF not covered: refcount_inc_not_zero() may
read one freed request, and it will be handled in next patch.

Signed-off-by: Ming Lei <ming.lei@redhat.com>
---
 block/blk-mq-tag.c | 14 +++++++++++---
 block/blk-mq.c     | 14 +++++++++-----
 block/blk-mq.h     |  1 +
 3 files changed, 21 insertions(+), 8 deletions(-)

diff --git a/block/blk-mq-tag.c b/block/blk-mq-tag.c
index 2a37731e8244..489d2db89856 100644
--- a/block/blk-mq-tag.c
+++ b/block/blk-mq-tag.c
@@ -264,6 +264,7 @@ static bool bt_tags_iter(struct sbitmap *bitmap, unsigned int bitnr, void *data)
 	struct blk_mq_tags *tags = iter_data->tags;
 	bool reserved = iter_data->flags & BT_TAG_ITER_RESERVED;
 	struct request *rq;
+	bool ret;
 
 	if (!reserved)
 		bitnr += tags->nr_reserved_tags;
@@ -276,12 +277,15 @@ static bool bt_tags_iter(struct sbitmap *bitmap, unsigned int bitnr, void *data)
 		rq = tags->static_rqs[bitnr];
 	else
 		rq = tags->rqs[bitnr];
-	if (!rq)
+	if (!rq || !refcount_inc_not_zero(&rq->ref))
 		return true;
 	if ((iter_data->flags & BT_TAG_ITER_STARTED) &&
 	    !blk_mq_request_started(rq))
-		return true;
-	return iter_data->fn(rq, iter_data->data, reserved);
+		ret = true;
+	else
+		ret = iter_data->fn(rq, iter_data->data, reserved);
+	blk_mq_put_rq_ref(rq);
+	return ret;
 }
 
 /**
@@ -348,6 +352,10 @@ void blk_mq_all_tag_iter(struct blk_mq_tags *tags, busy_tag_iter_fn *fn,
  *		indicates whether or not @rq is a reserved request. Return
  *		true to continue iterating tags, false to stop.
  * @priv:	Will be passed as second argument to @fn.
+ *
+ * We grab one request reference before calling @fn and release it after
+ * @fn returns. So far we don't support to pass the request reference to
+ * one new conetxt in @fn.
  */
 void blk_mq_tagset_busy_iter(struct blk_mq_tag_set *tagset,
 		busy_tag_iter_fn *fn, void *priv)
diff --git a/block/blk-mq.c b/block/blk-mq.c
index e3d1067b10c3..9a4d520740a1 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -925,6 +925,14 @@ static bool blk_mq_req_expired(struct request *rq, unsigned long *next)
 	return false;
 }
 
+void blk_mq_put_rq_ref(struct request *rq)
+{
+	if (is_flush_rq(rq, rq->mq_hctx))
+		rq->end_io(rq, 0);
+	else if (refcount_dec_and_test(&rq->ref))
+		__blk_mq_free_request(rq);
+}
+
 static bool blk_mq_check_expired(struct blk_mq_hw_ctx *hctx,
 		struct request *rq, void *priv, bool reserved)
 {
@@ -958,11 +966,7 @@ static bool blk_mq_check_expired(struct blk_mq_hw_ctx *hctx,
 	if (blk_mq_req_expired(rq, next))
 		blk_mq_rq_timed_out(rq, reserved);
 
-	if (is_flush_rq(rq, hctx))
-		rq->end_io(rq, 0);
-	else if (refcount_dec_and_test(&rq->ref))
-		__blk_mq_free_request(rq);
-
+	blk_mq_put_rq_ref(rq);
 	return true;
 }
 
diff --git a/block/blk-mq.h b/block/blk-mq.h
index 3616453ca28c..143afe42c63a 100644
--- a/block/blk-mq.h
+++ b/block/blk-mq.h
@@ -47,6 +47,7 @@ void blk_mq_add_to_requeue_list(struct request *rq, bool at_head,
 void blk_mq_flush_busy_ctxs(struct blk_mq_hw_ctx *hctx, struct list_head *list);
 struct request *blk_mq_dequeue_from_ctx(struct blk_mq_hw_ctx *hctx,
 					struct blk_mq_ctx *start);
+void blk_mq_put_rq_ref(struct request *rq);
 
 /*
  * Internal helpers for allocating/freeing the request map
-- 
2.29.2


^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH 8/8] blk-mq: clear stale request in tags->rq[] before freeing one request pool
  2021-04-25  8:57 [PATCH 0/8] blk-mq: fix request UAF related with iterating over tagset requests Ming Lei
                   ` (6 preceding siblings ...)
  2021-04-25  8:57 ` [PATCH 7/8] blk-mq: grab rq->refcount before calling ->fn in blk_mq_tagset_busy_iter Ming Lei
@ 2021-04-25  8:57 ` Ming Lei
  2021-04-25 20:42   ` Bart Van Assche
  2021-04-25  9:27 ` [PATCH 0/8] blk-mq: fix request UAF related with iterating over tagset requests Ming Lei
  2021-04-25 16:17 ` Jens Axboe
  9 siblings, 1 reply; 25+ messages in thread
From: Ming Lei @ 2021-04-25  8:57 UTC (permalink / raw)
  To: linux-nvme, linux-scsi, Jens Axboe, linux-block,
	Martin K . Petersen, Christoph Hellwig
  Cc: Bart Van Assche, Khazhy Kumykov, Shin'ichiro Kawasaki,
	Hannes Reinecke, John Garry, David Jeffery, Ming Lei

refcount_inc_not_zero() in bt_tags_iter() still may read one freed
request.

Fix the issue by the following approach:

1) hold a per-tags spinlock when reading ->rqs[tag] and calling
refcount_inc_not_zero in bt_tags_iter()

2) clearing stale request referred via ->rqs[tag] before freeing
request pool, the per-tags spinlock is held for clearing stale
->rq[tag]

So after we cleared stale requests, bt_tags_iter() won't observe
freed request any more, also the clearing will wait for pending
request reference.

The idea of clearing ->rqs[] is borrowed from John Garry's previous
patch and one recent David's patch.

Cc: John Garry <john.garry@huawei.com>
Cc: David Jeffery <djeffery@redhat.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
---
 block/blk-mq-tag.c |  9 ++++++++-
 block/blk-mq-tag.h |  3 +++
 block/blk-mq.c     | 39 ++++++++++++++++++++++++++++++++++-----
 3 files changed, 45 insertions(+), 6 deletions(-)

diff --git a/block/blk-mq-tag.c b/block/blk-mq-tag.c
index 489d2db89856..402a7860f6c9 100644
--- a/block/blk-mq-tag.c
+++ b/block/blk-mq-tag.c
@@ -265,10 +265,12 @@ static bool bt_tags_iter(struct sbitmap *bitmap, unsigned int bitnr, void *data)
 	bool reserved = iter_data->flags & BT_TAG_ITER_RESERVED;
 	struct request *rq;
 	bool ret;
+	unsigned long flags;
 
 	if (!reserved)
 		bitnr += tags->nr_reserved_tags;
 
+	spin_lock_irqsave(&tags->lock, flags);
 	/*
 	 * We can hit rq == NULL here, because the tagging functions
 	 * test and set the bit before assigning ->rqs[].
@@ -277,8 +279,12 @@ static bool bt_tags_iter(struct sbitmap *bitmap, unsigned int bitnr, void *data)
 		rq = tags->static_rqs[bitnr];
 	else
 		rq = tags->rqs[bitnr];
-	if (!rq || !refcount_inc_not_zero(&rq->ref))
+	if (!rq || !refcount_inc_not_zero(&rq->ref)) {
+		spin_unlock_irqrestore(&tags->lock, flags);
 		return true;
+	}
+	spin_unlock_irqrestore(&tags->lock, flags);
+
 	if ((iter_data->flags & BT_TAG_ITER_STARTED) &&
 	    !blk_mq_request_started(rq))
 		ret = true;
@@ -524,6 +530,7 @@ struct blk_mq_tags *blk_mq_init_tags(unsigned int total_tags,
 
 	tags->nr_tags = total_tags;
 	tags->nr_reserved_tags = reserved_tags;
+	spin_lock_init(&tags->lock);
 
 	if (blk_mq_is_sbitmap_shared(flags))
 		return tags;
diff --git a/block/blk-mq-tag.h b/block/blk-mq-tag.h
index 7d3e6b333a4a..f942a601b5ef 100644
--- a/block/blk-mq-tag.h
+++ b/block/blk-mq-tag.h
@@ -20,6 +20,9 @@ struct blk_mq_tags {
 	struct request **rqs;
 	struct request **static_rqs;
 	struct list_head page_list;
+
+	/* used to clear rqs[] before one request pool is freed */
+	spinlock_t lock;
 };
 
 extern struct blk_mq_tags *blk_mq_init_tags(unsigned int nr_tags,
diff --git a/block/blk-mq.c b/block/blk-mq.c
index 9a4d520740a1..1f913f786c1f 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -2307,6 +2307,38 @@ blk_qc_t blk_mq_submit_bio(struct bio *bio)
 	return BLK_QC_T_NONE;
 }
 
+static size_t order_to_size(unsigned int order)
+{
+	return (size_t)PAGE_SIZE << order;
+}
+
+/* called before freeing request pool in @tags */
+static void blk_mq_clear_rq_mapping(struct blk_mq_tag_set *set,
+		struct blk_mq_tags *tags, unsigned int hctx_idx)
+{
+	struct blk_mq_tags *drv_tags = set->tags[hctx_idx];
+	struct page *page;
+	unsigned long flags;
+
+	spin_lock_irqsave(&drv_tags->lock, flags);
+	list_for_each_entry(page, &tags->page_list, lru) {
+		unsigned long start = (unsigned long)page_address(page);
+		unsigned long end = start + order_to_size(page->private);
+		int i;
+
+		for (i = 0; i < set->queue_depth; i++) {
+			struct request *rq = drv_tags->rqs[i];
+			unsigned long rq_addr = (unsigned long)rq;
+
+			if (rq_addr >= start && rq_addr < end) {
+				WARN_ON_ONCE(refcount_read(&rq->ref) != 0);
+				cmpxchg(&drv_tags->rqs[i], rq, NULL);
+			}
+		}
+	}
+	spin_unlock_irqrestore(&drv_tags->lock, flags);
+}
+
 void blk_mq_free_rqs(struct blk_mq_tag_set *set, struct blk_mq_tags *tags,
 		     unsigned int hctx_idx)
 {
@@ -2325,6 +2357,8 @@ void blk_mq_free_rqs(struct blk_mq_tag_set *set, struct blk_mq_tags *tags,
 		}
 	}
 
+	blk_mq_clear_rq_mapping(set, tags, hctx_idx);
+
 	while (!list_empty(&tags->page_list)) {
 		page = list_first_entry(&tags->page_list, struct page, lru);
 		list_del_init(&page->lru);
@@ -2384,11 +2418,6 @@ struct blk_mq_tags *blk_mq_alloc_rq_map(struct blk_mq_tag_set *set,
 	return tags;
 }
 
-static size_t order_to_size(unsigned int order)
-{
-	return (size_t)PAGE_SIZE << order;
-}
-
 static int blk_mq_init_request(struct blk_mq_tag_set *set, struct request *rq,
 			       unsigned int hctx_idx, int node)
 {
-- 
2.29.2


^ permalink raw reply related	[flat|nested] 25+ messages in thread

* Re: [PATCH 0/8] blk-mq: fix request UAF related with iterating over tagset requests
  2021-04-25  8:57 [PATCH 0/8] blk-mq: fix request UAF related with iterating over tagset requests Ming Lei
                   ` (7 preceding siblings ...)
  2021-04-25  8:57 ` [PATCH 8/8] blk-mq: clear stale request in tags->rq[] before freeing one request pool Ming Lei
@ 2021-04-25  9:27 ` Ming Lei
  2021-04-25 20:53   ` Bart Van Assche
  2021-04-25 16:17 ` Jens Axboe
  9 siblings, 1 reply; 25+ messages in thread
From: Ming Lei @ 2021-04-25  9:27 UTC (permalink / raw)
  To: linux-nvme, linux-scsi, Jens Axboe, linux-block,
	Martin K . Petersen, Christoph Hellwig
  Cc: Bart Van Assche, Khazhy Kumykov, Shin'ichiro Kawasaki,
	Hannes Reinecke, John Garry, David Jeffery

On Sun, Apr 25, 2021 at 04:57:45PM +0800, Ming Lei wrote:
> Hi Guys,
> 
> Revert 4 patches from Bart which try to fix request UAF issue related
> with iterating over tagset wide requests, because:
> 
> 1) request UAF caused by normal completion vs. async completion during
> iterating can't be covered[1]
> 
> 2) clearing ->rqs[] is added in fast path, which causes performance loss
> by 1% according to Bart's test
> 
> 3) Bart's approach is too complicated, and some changes aren't needed,
> such as adding two versions of tagset iteration

4) synchronize_rcu() is added before shutting down one request queue,
which may slow down reboot/poweroff very much on big systems with lots of
HBAs in which lots of LUNs are attached.

5) freeing request pool in updating nr_requests isn't covered.

Thanks,
Ming


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 0/8] blk-mq: fix request UAF related with iterating over tagset requests
  2021-04-25  8:57 [PATCH 0/8] blk-mq: fix request UAF related with iterating over tagset requests Ming Lei
                   ` (8 preceding siblings ...)
  2021-04-25  9:27 ` [PATCH 0/8] blk-mq: fix request UAF related with iterating over tagset requests Ming Lei
@ 2021-04-25 16:17 ` Jens Axboe
  2021-04-25 18:39   ` Bart Van Assche
  9 siblings, 1 reply; 25+ messages in thread
From: Jens Axboe @ 2021-04-25 16:17 UTC (permalink / raw)
  To: Ming Lei, linux-nvme, linux-scsi, linux-block,
	Martin K . Petersen, Christoph Hellwig
  Cc: Bart Van Assche, Khazhy Kumykov, Shin'ichiro Kawasaki,
	Hannes Reinecke, John Garry, David Jeffery

On 4/25/21 2:57 AM, Ming Lei wrote:
> Hi Guys,
> 
> Revert 4 patches from Bart which try to fix request UAF issue related
> with iterating over tagset wide requests, because:
> 
> 1) request UAF caused by normal completion vs. async completion during
> iterating can't be covered[1]
> 
> 2) clearing ->rqs[] is added in fast path, which causes performance loss
> by 1% according to Bart's test
> 
> 3) Bart's approach is too complicated, and some changes aren't needed,
> such as adding two versions of tagset iteration
> 
> This patchset fixes the request UAF issue by one simpler approach,
> without any change in fast path.
> 
> 1) always complete request synchronously when the completing is run
> via blk_mq_tagset_busy_iter(), done in 1st two patches
> 
> 2) grab request's ref before calling ->fn in blk_mq_tagset_busy_iter,
> and release it after calling ->fn, so ->fn won't be called for one
> request if its queue is frozen, done in 3rd patch
> 
> 3) clearing any stale request referred in ->rqs[] before freeing the
> request pool, one per-tags spinlock is added for protecting
> grabbing request ref vs. clearing ->rqs[tag], so UAF by refcount_inc_not_zero
> in bt_tags_iter() is avoided, done in 4th patch.

I'm going to pull the UAF series for now so we don't need to do a series
of reverts if we deem this a better approach. I'll take a further look
at it tomorrow.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 0/8] blk-mq: fix request UAF related with iterating over tagset requests
  2021-04-25 16:17 ` Jens Axboe
@ 2021-04-25 18:39   ` Bart Van Assche
  2021-04-25 20:18     ` Jens Axboe
  0 siblings, 1 reply; 25+ messages in thread
From: Bart Van Assche @ 2021-04-25 18:39 UTC (permalink / raw)
  To: Jens Axboe, Ming Lei, linux-nvme, linux-scsi, linux-block,
	Martin K . Petersen, Christoph Hellwig
  Cc: Khazhy Kumykov, Shin'ichiro Kawasaki, Hannes Reinecke,
	John Garry, David Jeffery

On 4/25/21 9:17 AM, Jens Axboe wrote:
> I'm going to pull the UAF series for now so we don't need to do a series
> of reverts if we deem this a better approach. I'll take a further look
> at it tomorrow.

Please give me the time to post review comments. My opinion about this
series is that (a) the analysis on which some patches are based is wrong
and also (b) that some patches introduce new bugs that are worse than
the current state of the for-next branch and also make it worse than the
v5.11 kernel.

Bart.



^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 7/8] blk-mq: grab rq->refcount before calling ->fn in blk_mq_tagset_busy_iter
  2021-04-25  8:57 ` [PATCH 7/8] blk-mq: grab rq->refcount before calling ->fn in blk_mq_tagset_busy_iter Ming Lei
@ 2021-04-25 18:55   ` Bart Van Assche
  2021-04-26  0:41     ` Ming Lei
  0 siblings, 1 reply; 25+ messages in thread
From: Bart Van Assche @ 2021-04-25 18:55 UTC (permalink / raw)
  To: Ming Lei, linux-nvme, linux-scsi, Jens Axboe, linux-block,
	Martin K . Petersen, Christoph Hellwig
  Cc: Khazhy Kumykov, Shin'ichiro Kawasaki, Hannes Reinecke,
	John Garry, David Jeffery

On 4/25/21 1:57 AM, Ming Lei wrote:
> However, still one request UAF not covered: refcount_inc_not_zero() may
> read one freed request, and it will be handled in next patch.

This means that patch "blk-mq: clear stale request in tags->rq[] before
freeing one request pool" should come before this patch.

> @@ -276,12 +277,15 @@ static bool bt_tags_iter(struct sbitmap *bitmap, unsigned int bitnr, void *data)
>  		rq = tags->static_rqs[bitnr];
>  	else
>  		rq = tags->rqs[bitnr];
> -	if (!rq)
> +	if (!rq || !refcount_inc_not_zero(&rq->ref))
>  		return true;
>  	if ((iter_data->flags & BT_TAG_ITER_STARTED) &&
>  	    !blk_mq_request_started(rq))
> -		return true;
> -	return iter_data->fn(rq, iter_data->data, reserved);
> +		ret = true;
> +	else
> +		ret = iter_data->fn(rq, iter_data->data, reserved);
> +	blk_mq_put_rq_ref(rq);
> +	return ret;
>  }

Even if patches 7/8 and 8/8 would be reordered, the above code
introduces a new use-after-free, a use-after-free that is much worse
than the UAF in kernel v5.11. The following sequence can be triggered by
the above code:
* bt_tags_iter() reads tags->rqs[bitnr] and stores the request pointer
in the 'rq' variable.
* Request 'rq' completes, tags->rqs[bitnr] is cleared and the memory
that backs that request is freed.
* The memory that backs 'rq' is used for another purpose and the request
reference count becomes nonzero.
* bt_tags_iter() increments the request reference count and thereby
corrupts memory.

Bart.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 0/8] blk-mq: fix request UAF related with iterating over tagset requests
  2021-04-25 18:39   ` Bart Van Assche
@ 2021-04-25 20:18     ` Jens Axboe
  0 siblings, 0 replies; 25+ messages in thread
From: Jens Axboe @ 2021-04-25 20:18 UTC (permalink / raw)
  To: Bart Van Assche, Ming Lei, linux-nvme, linux-scsi, linux-block,
	Martin K . Petersen, Christoph Hellwig
  Cc: Khazhy Kumykov, Shin'ichiro Kawasaki, Hannes Reinecke,
	John Garry, David Jeffery

On 4/25/21 12:39 PM, Bart Van Assche wrote:
> On 4/25/21 9:17 AM, Jens Axboe wrote:
>> I'm going to pull the UAF series for now so we don't need to do a series
>> of reverts if we deem this a better approach. I'll take a further look
>> at it tomorrow.
> 
> Please give me the time to post review comments. My opinion about this
> series is that (a) the analysis on which some patches are based is wrong
> and also (b) that some patches introduce new bugs that are worse than
> the current state of the for-next branch and also make it worse than the
> v5.11 kernel.

Just to be clear, I did not apply this series, "pull the UAF series" just
meant un-applying your series so we have a sane baseline for something
that everybody likes. There's plenty of time to figure this out for
5.13.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 8/8] blk-mq: clear stale request in tags->rq[] before freeing one request pool
  2021-04-25  8:57 ` [PATCH 8/8] blk-mq: clear stale request in tags->rq[] before freeing one request pool Ming Lei
@ 2021-04-25 20:42   ` Bart Van Assche
  2021-04-26  0:49     ` Ming Lei
  0 siblings, 1 reply; 25+ messages in thread
From: Bart Van Assche @ 2021-04-25 20:42 UTC (permalink / raw)
  To: Ming Lei, linux-nvme, linux-scsi, Jens Axboe, linux-block,
	Martin K . Petersen, Christoph Hellwig
  Cc: Khazhy Kumykov, Shin'ichiro Kawasaki, Hannes Reinecke,
	John Garry, David Jeffery

On 4/25/21 1:57 AM, Ming Lei wrote:
> +	spin_lock_irqsave(&drv_tags->lock, flags);
> +	list_for_each_entry(page, &tags->page_list, lru) {
> +		unsigned long start = (unsigned long)page_address(page);
> +		unsigned long end = start + order_to_size(page->private);
> +		int i;
> +
> +		for (i = 0; i < set->queue_depth; i++) {
> +			struct request *rq = drv_tags->rqs[i];
> +			unsigned long rq_addr = (unsigned long)rq;
> +
> +			if (rq_addr >= start && rq_addr < end) {
> +				WARN_ON_ONCE(refcount_read(&rq->ref) != 0);
> +				cmpxchg(&drv_tags->rqs[i], rq, NULL);
> +			}
> +		}
> +	}
> +	spin_unlock_irqrestore(&drv_tags->lock, flags);

Using cmpxchg() on set->tags[] is only safe if all other set->tags[]
accesses are changed into WRITE_ONCE() or READ_ONCE().

Bart.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 0/8] blk-mq: fix request UAF related with iterating over tagset requests
  2021-04-25  9:27 ` [PATCH 0/8] blk-mq: fix request UAF related with iterating over tagset requests Ming Lei
@ 2021-04-25 20:53   ` Bart Van Assche
  2021-04-26  1:19     ` Ming Lei
  0 siblings, 1 reply; 25+ messages in thread
From: Bart Van Assche @ 2021-04-25 20:53 UTC (permalink / raw)
  To: Ming Lei, linux-nvme, linux-scsi, Jens Axboe, linux-block,
	Martin K . Petersen, Christoph Hellwig
  Cc: Khazhy Kumykov, Shin'ichiro Kawasaki, Hannes Reinecke,
	John Garry, David Jeffery

On 4/25/21 2:27 AM, Ming Lei wrote:
> On Sun, Apr 25, 2021 at 04:57:45PM +0800, Ming Lei wrote:
>> Revert 4 patches from Bart which try to fix request UAF issue related
>> with iterating over tagset wide requests, because:

Where were you during the four weeks that my patch series was out for
review? I haven't seen any feedback from you on my patch series.

>> 1) request UAF caused by normal completion vs. async completion during
>> iterating can't be covered[1]

I do not agree with the above. Patches 5/8 and 6/8 from this series can
be applied without reverting any of my patches.

> 4) synchronize_rcu() is added before shutting down one request queue,
> which may slow down reboot/poweroff very much on big systems with lots of
> HBAs in which lots of LUNs are attached.

The synchronize_rcu() can be removed by using a semaphore
(<linux/semaphore.h>) instead of an RCU reader lock inside bt_tags_iter().

> 5) freeing request pool in updating nr_requests isn't covered.

This can be addressed easily on top of my patch series.

Bart.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 7/8] blk-mq: grab rq->refcount before calling ->fn in blk_mq_tagset_busy_iter
  2021-04-25 18:55   ` Bart Van Assche
@ 2021-04-26  0:41     ` Ming Lei
  0 siblings, 0 replies; 25+ messages in thread
From: Ming Lei @ 2021-04-26  0:41 UTC (permalink / raw)
  To: Bart Van Assche
  Cc: linux-nvme, linux-scsi, Jens Axboe, linux-block,
	Martin K . Petersen, Christoph Hellwig, Khazhy Kumykov,
	Shin'ichiro Kawasaki, Hannes Reinecke, John Garry,
	David Jeffery

On Sun, Apr 25, 2021 at 11:55:22AM -0700, Bart Van Assche wrote:
> On 4/25/21 1:57 AM, Ming Lei wrote:
> > However, still one request UAF not covered: refcount_inc_not_zero() may
> > read one freed request, and it will be handled in next patch.
> 
> This means that patch "blk-mq: clear stale request in tags->rq[] before
> freeing one request pool" should come before this patch.

It doesn't matter. This patch only can't avoid the UAF too, we need
to grab req->ref to prevent queue from being frozen.

> 
> > @@ -276,12 +277,15 @@ static bool bt_tags_iter(struct sbitmap *bitmap, unsigned int bitnr, void *data)
> >  		rq = tags->static_rqs[bitnr];
> >  	else
> >  		rq = tags->rqs[bitnr];
> > -	if (!rq)
> > +	if (!rq || !refcount_inc_not_zero(&rq->ref))
> >  		return true;
> >  	if ((iter_data->flags & BT_TAG_ITER_STARTED) &&
> >  	    !blk_mq_request_started(rq))
> > -		return true;
> > -	return iter_data->fn(rq, iter_data->data, reserved);
> > +		ret = true;
> > +	else
> > +		ret = iter_data->fn(rq, iter_data->data, reserved);
> > +	blk_mq_put_rq_ref(rq);
> > +	return ret;
> >  }
> 
> Even if patches 7/8 and 8/8 would be reordered, the above code
> introduces a new use-after-free, a use-after-free that is much worse
> than the UAF in kernel v5.11. The following sequence can be triggered by
> the above code:
> * bt_tags_iter() reads tags->rqs[bitnr] and stores the request pointer
> in the 'rq' variable.
> * Request 'rq' completes, tags->rqs[bitnr] is cleared and the memory
> that backs that request is freed.
> * The memory that backs 'rq' is used for another purpose and the request
> reference count becomes nonzero.

That means the 'rq' is re-allocated, and it becomes in-flight again.

> * bt_tags_iter() increments the request reference count and thereby
> corrupts memory.

No, When refcount_inc_not_zero() succeeds in bt_tags_iter(), no one can
free the request any more until ->fn() returns, why do you think memory
corrupts? This pattern isn't different with timeout's usage, is it?

If IO activity is allowed during iterating tagset requests, ->fn() and
in-flight IO can always be run concurrently. That is caller's
responsibility to handle the race. That is why you can see lots callers
do quiesce queues before calling blk_mq_tagset_busy_iter(), but
quiesce isn't required if ->fn() just READs request only.

Your patch or current in-tree code has same 'problem' too, if you think
it is a problem. Clearing ->rq[tag] or holding a lock before calling
->fn() can not avoid such thing, can it?

Finally it is a request walking in tagset wide, so it should be safe for
->fn to iterate over request in this way. The thing is just that req->tag may
become not same with 'bitnr' any more. We can handle it simply by checking
if 'req->tag == bitnr' in bt_tags_iter() after the req->ref is grabbed,
still not sure if it is absolutely necessary.


Thanks,
Ming


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 8/8] blk-mq: clear stale request in tags->rq[] before freeing one request pool
  2021-04-25 20:42   ` Bart Van Assche
@ 2021-04-26  0:49     ` Ming Lei
  2021-04-26  1:50       ` Bart Van Assche
  0 siblings, 1 reply; 25+ messages in thread
From: Ming Lei @ 2021-04-26  0:49 UTC (permalink / raw)
  To: Bart Van Assche
  Cc: linux-nvme, linux-scsi, Jens Axboe, linux-block,
	Martin K . Petersen, Christoph Hellwig, Khazhy Kumykov,
	Shin'ichiro Kawasaki, Hannes Reinecke, John Garry,
	David Jeffery

On Sun, Apr 25, 2021 at 01:42:59PM -0700, Bart Van Assche wrote:
> On 4/25/21 1:57 AM, Ming Lei wrote:
> > +	spin_lock_irqsave(&drv_tags->lock, flags);
> > +	list_for_each_entry(page, &tags->page_list, lru) {
> > +		unsigned long start = (unsigned long)page_address(page);
> > +		unsigned long end = start + order_to_size(page->private);
> > +		int i;
> > +
> > +		for (i = 0; i < set->queue_depth; i++) {
> > +			struct request *rq = drv_tags->rqs[i];
> > +			unsigned long rq_addr = (unsigned long)rq;
> > +
> > +			if (rq_addr >= start && rq_addr < end) {
> > +				WARN_ON_ONCE(refcount_read(&rq->ref) != 0);
> > +				cmpxchg(&drv_tags->rqs[i], rq, NULL);
> > +			}
> > +		}
> > +	}
> > +	spin_unlock_irqrestore(&drv_tags->lock, flags);
> 
> Using cmpxchg() on set->tags[] is only safe if all other set->tags[]
> accesses are changed into WRITE_ONCE() or READ_ONCE().

Why?

Semantic of cmpxchg() is to modify value pointed by the address if its
old value is same with passed 'rq'. That is exactly what we need.

writting 'void *' is always atomic. if someone has touched
'->rqs[tag]', cmpxchg() won't modify the value.



Thanks, 
Ming


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 0/8] blk-mq: fix request UAF related with iterating over tagset requests
  2021-04-25 20:53   ` Bart Van Assche
@ 2021-04-26  1:19     ` Ming Lei
  2021-04-26  1:57       ` Bart Van Assche
  0 siblings, 1 reply; 25+ messages in thread
From: Ming Lei @ 2021-04-26  1:19 UTC (permalink / raw)
  To: Bart Van Assche
  Cc: linux-nvme, linux-scsi, Jens Axboe, linux-block,
	Martin K . Petersen, Christoph Hellwig, Khazhy Kumykov,
	Shin'ichiro Kawasaki, Hannes Reinecke, John Garry,
	David Jeffery

On Sun, Apr 25, 2021 at 01:53:16PM -0700, Bart Van Assche wrote:
> On 4/25/21 2:27 AM, Ming Lei wrote:
> > On Sun, Apr 25, 2021 at 04:57:45PM +0800, Ming Lei wrote:
> >> Revert 4 patches from Bart which try to fix request UAF issue related
> >> with iterating over tagset wide requests, because:
> 
> Where were you during the four weeks that my patch series was out for
> review? I haven't seen any feedback from you on my patch series.

To be honest, it is just two days ago I have to take a close look
at your patchset because we may have to backport your patches for
addressing one RH report with high priority.

David is in CC list, and Laurence/David is looking the report too.

> 
> >> 1) request UAF caused by normal completion vs. async completion during
> >> iterating can't be covered[1]
> 
> I do not agree with the above. Patches 5/8 and 6/8 from this series can
> be applied without reverting any of my patches.

The thing is that 5 ~ 8 can fix the issue in a simpler way without
adding extra cost in fast path, and the idea is easier to be proved.

BTW, as a downstream kernel developer, I really hope all fix are simple and
easy to backport. More importantly, I do prefer to approaches in patch which
can be proved/verified easily, so further regression can be avoided.

> 
> > 4) synchronize_rcu() is added before shutting down one request queue,
> > which may slow down reboot/poweroff very much on big systems with lots of
> > HBAs in which lots of LUNs are attached.
> 
> The synchronize_rcu() can be removed by using a semaphore
> (<linux/semaphore.h>) instead of an RCU reader lock inside bt_tags_iter().

I am not sure you can, because some iteration is done in atomic context.


Thanks,
Ming


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 8/8] blk-mq: clear stale request in tags->rq[] before freeing one request pool
  2021-04-26  0:49     ` Ming Lei
@ 2021-04-26  1:50       ` Bart Van Assche
  2021-04-26  2:07         ` Ming Lei
  0 siblings, 1 reply; 25+ messages in thread
From: Bart Van Assche @ 2021-04-26  1:50 UTC (permalink / raw)
  To: Ming Lei
  Cc: linux-nvme, linux-scsi, Jens Axboe, linux-block,
	Martin K . Petersen, Christoph Hellwig, Khazhy Kumykov,
	Shin'ichiro Kawasaki, Hannes Reinecke, John Garry,
	David Jeffery

On 4/25/21 5:49 PM, Ming Lei wrote:
> On Sun, Apr 25, 2021 at 01:42:59PM -0700, Bart Van Assche wrote:
>> Using cmpxchg() on set->tags[] is only safe if all other set->tags[]
>> accesses are changed into WRITE_ONCE() or READ_ONCE().
> 
> Why?
> 
> Semantic of cmpxchg() is to modify value pointed by the address if its
> old value is same with passed 'rq'. That is exactly what we need.
> 
> writting 'void *' is always atomic. if someone has touched
> '->rqs[tag]', cmpxchg() won't modify the value.

WRITE_ONCE() supports data types that have the same size as char, short,
int, long and long long. That includes void *. If writes to these data
types would always be atomic then we wouldn't need the WRITE_ONCE()
macro. The explanation at the top of the rwonce.h header file is as
follows: "Prevent the compiler from merging or refetching reads or
writes. [ ... ] Ensuring that the compiler does not fold, spindle, or
otherwise mutilate accesses that either do not require ordering or that
interact with an explicit memory barrier or atomic instruction that
provides the required ordering."

Bart.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 0/8] blk-mq: fix request UAF related with iterating over tagset requests
  2021-04-26  1:19     ` Ming Lei
@ 2021-04-26  1:57       ` Bart Van Assche
  0 siblings, 0 replies; 25+ messages in thread
From: Bart Van Assche @ 2021-04-26  1:57 UTC (permalink / raw)
  To: Ming Lei
  Cc: linux-nvme, linux-scsi, Jens Axboe, linux-block,
	Martin K . Petersen, Christoph Hellwig, Khazhy Kumykov,
	Shin'ichiro Kawasaki, Hannes Reinecke, John Garry,
	David Jeffery

On 4/25/21 6:19 PM, Ming Lei wrote:
> On Sun, Apr 25, 2021 at 01:53:16PM -0700, Bart Van Assche wrote:
>> On 4/25/21 2:27 AM, Ming Lei wrote:
>>> 4) synchronize_rcu() is added before shutting down one request queue,
>>> which may slow down reboot/poweroff very much on big systems with lots of
>>> HBAs in which lots of LUNs are attached.
>>
>> The synchronize_rcu() can be removed by using a semaphore
>> (<linux/semaphore.h>) instead of an RCU reader lock inside bt_tags_iter().
> 
> I am not sure you can, because some iteration is done in atomic context.

I meant <linux/rwlock.h>. The functions like read_lock_irq() that are
declared in that header file are appropriate for atomic context.

Bart.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 8/8] blk-mq: clear stale request in tags->rq[] before freeing one request pool
  2021-04-26  1:50       ` Bart Van Assche
@ 2021-04-26  2:07         ` Ming Lei
  0 siblings, 0 replies; 25+ messages in thread
From: Ming Lei @ 2021-04-26  2:07 UTC (permalink / raw)
  To: Bart Van Assche
  Cc: linux-nvme, linux-scsi, Jens Axboe, linux-block,
	Martin K . Petersen, Christoph Hellwig, Khazhy Kumykov,
	Shin'ichiro Kawasaki, Hannes Reinecke, John Garry,
	David Jeffery

On Sun, Apr 25, 2021 at 06:50:48PM -0700, Bart Van Assche wrote:
> On 4/25/21 5:49 PM, Ming Lei wrote:
> > On Sun, Apr 25, 2021 at 01:42:59PM -0700, Bart Van Assche wrote:
> >> Using cmpxchg() on set->tags[] is only safe if all other set->tags[]
> >> accesses are changed into WRITE_ONCE() or READ_ONCE().
> > 
> > Why?
> > 
> > Semantic of cmpxchg() is to modify value pointed by the address if its
> > old value is same with passed 'rq'. That is exactly what we need.
> > 
> > writting 'void *' is always atomic. if someone has touched
> > '->rqs[tag]', cmpxchg() won't modify the value.
> 
> WRITE_ONCE() supports data types that have the same size as char, short,
> int, long and long long. That includes void *. If writes to these data
> types would always be atomic then we wouldn't need the WRITE_ONCE()
> macro.

OK, then we don't need WRITE_ONCE(), since WRITE on tags->rqs[i] is
always atomic.


Thanks, 
Ming


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 6/8] block: drivers: complete request locally from blk_mq_tagset_busy_iter
  2021-04-25  8:57 ` [PATCH 6/8] block: drivers: complete request locally from blk_mq_tagset_busy_iter Ming Lei
@ 2021-04-26  3:02   ` Bart Van Assche
  2021-04-26  6:24     ` Ming Lei
  0 siblings, 1 reply; 25+ messages in thread
From: Bart Van Assche @ 2021-04-26  3:02 UTC (permalink / raw)
  To: Ming Lei, linux-nvme, linux-scsi, Jens Axboe, linux-block,
	Martin K . Petersen, Christoph Hellwig
  Cc: Khazhy Kumykov, Shin'ichiro Kawasaki, Hannes Reinecke,
	John Garry, David Jeffery

On 4/25/21 1:57 AM, Ming Lei wrote:
> diff --git a/drivers/scsi/scsi_lib.c b/drivers/scsi/scsi_lib.c
> index c289991ffaed..7cbaee282b6d 100644
> --- a/drivers/scsi/scsi_lib.c
> +++ b/drivers/scsi/scsi_lib.c
> @@ -1568,7 +1568,11 @@ static void scsi_mq_done(struct scsi_cmnd *cmd)
>  	if (unlikely(test_and_set_bit(SCMD_STATE_COMPLETE, &cmd->state)))
>  		return;
>  	trace_scsi_dispatch_cmd_done(cmd);
> -	blk_mq_complete_request(cmd->request);
> +
> +	if (unlikely(host_byte(cmd->result) != DID_OK))
> +		blk_mq_complete_request_locally(cmd->request);
> +	else
> +		blk_mq_complete_request(cmd->request);
>  }

This change is so tricky that it deserves a comment.

An even better approach would be *not* to export
blk_mq_complete_request_locally() from the block layer to block drivers
and instead modify the block layer such that it completes a request on
the same CPU if request completion happens from inside the context of a
tag iteration function. That would save driver writers the trouble of
learning yet another block layer API.

Bart.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 6/8] block: drivers: complete request locally from blk_mq_tagset_busy_iter
  2021-04-26  3:02   ` Bart Van Assche
@ 2021-04-26  6:24     ` Ming Lei
  2021-04-27  8:54       ` Ming Lei
  0 siblings, 1 reply; 25+ messages in thread
From: Ming Lei @ 2021-04-26  6:24 UTC (permalink / raw)
  To: Bart Van Assche
  Cc: linux-nvme, linux-scsi, Jens Axboe, linux-block,
	Martin K . Petersen, Christoph Hellwig, Khazhy Kumykov,
	Shin'ichiro Kawasaki, Hannes Reinecke, John Garry,
	David Jeffery

On Sun, Apr 25, 2021 at 08:02:10PM -0700, Bart Van Assche wrote:
> On 4/25/21 1:57 AM, Ming Lei wrote:
> > diff --git a/drivers/scsi/scsi_lib.c b/drivers/scsi/scsi_lib.c
> > index c289991ffaed..7cbaee282b6d 100644
> > --- a/drivers/scsi/scsi_lib.c
> > +++ b/drivers/scsi/scsi_lib.c
> > @@ -1568,7 +1568,11 @@ static void scsi_mq_done(struct scsi_cmnd *cmd)
> >  	if (unlikely(test_and_set_bit(SCMD_STATE_COMPLETE, &cmd->state)))
> >  		return;
> >  	trace_scsi_dispatch_cmd_done(cmd);
> > -	blk_mq_complete_request(cmd->request);
> > +
> > +	if (unlikely(host_byte(cmd->result) != DID_OK))
> > +		blk_mq_complete_request_locally(cmd->request);
> > +	else
> > +		blk_mq_complete_request(cmd->request);
> >  }
> 
> This change is so tricky that it deserves a comment.
> 
> An even better approach would be *not* to export
> blk_mq_complete_request_locally() from the block layer to block drivers
> and instead modify the block layer such that it completes a request on
> the same CPU if request completion happens from inside the context of a
> tag iteration function. That would save driver writers the trouble of
> learning yet another block layer API.

Yeah, that is possible, and one request flag(eg. RQF_ITERATED) can be added.
The flag is set before calling ->fn(), and evaluated in
blk_mq_complete_request_remote().

Thanks,
Ming


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 6/8] block: drivers: complete request locally from blk_mq_tagset_busy_iter
  2021-04-26  6:24     ` Ming Lei
@ 2021-04-27  8:54       ` Ming Lei
  0 siblings, 0 replies; 25+ messages in thread
From: Ming Lei @ 2021-04-27  8:54 UTC (permalink / raw)
  To: Bart Van Assche
  Cc: linux-nvme, linux-scsi, Jens Axboe, linux-block,
	Martin K . Petersen, Christoph Hellwig, Khazhy Kumykov,
	Shin'ichiro Kawasaki, Hannes Reinecke, John Garry,
	David Jeffery

On Mon, Apr 26, 2021 at 02:24:56PM +0800, Ming Lei wrote:
> On Sun, Apr 25, 2021 at 08:02:10PM -0700, Bart Van Assche wrote:
> > On 4/25/21 1:57 AM, Ming Lei wrote:
> > > diff --git a/drivers/scsi/scsi_lib.c b/drivers/scsi/scsi_lib.c
> > > index c289991ffaed..7cbaee282b6d 100644
> > > --- a/drivers/scsi/scsi_lib.c
> > > +++ b/drivers/scsi/scsi_lib.c
> > > @@ -1568,7 +1568,11 @@ static void scsi_mq_done(struct scsi_cmnd *cmd)
> > >  	if (unlikely(test_and_set_bit(SCMD_STATE_COMPLETE, &cmd->state)))
> > >  		return;
> > >  	trace_scsi_dispatch_cmd_done(cmd);
> > > -	blk_mq_complete_request(cmd->request);
> > > +
> > > +	if (unlikely(host_byte(cmd->result) != DID_OK))
> > > +		blk_mq_complete_request_locally(cmd->request);
> > > +	else
> > > +		blk_mq_complete_request(cmd->request);
> > >  }
> > 
> > This change is so tricky that it deserves a comment.
> > 
> > An even better approach would be *not* to export
> > blk_mq_complete_request_locally() from the block layer to block drivers
> > and instead modify the block layer such that it completes a request on
> > the same CPU if request completion happens from inside the context of a
> > tag iteration function. That would save driver writers the trouble of
> > learning yet another block layer API.
> 
> Yeah, that is possible, and one request flag(eg. RQF_ITERATED) can be added.
> The flag is set before calling ->fn(), and evaluated in
> blk_mq_complete_request_remote().

Thinking of the issue further, we have grabbed rq->refcnt before calling
->fn(), not only request UAF is fixed, but also driver gets valid
request instance to check if the request has been completed, so we needn't
to consider the double completion issue[1] any more, which is supposed
to be covered by driver, such as, scsi uses SCMD_STATE_COMPLETE in
scsi_mq_done() for avoiding double completion.

[1] https://lore.kernel.org/linux-block/YIdWz8C5eklPvEiV@T590/T/#u

Thanks,
Ming


^ permalink raw reply	[flat|nested] 25+ messages in thread

end of thread, other threads:[~2021-04-27  8:54 UTC | newest]

Thread overview: 25+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-04-25  8:57 [PATCH 0/8] blk-mq: fix request UAF related with iterating over tagset requests Ming Lei
2021-04-25  8:57 ` [PATCH 1/8] Revert "blk-mq: Fix races between blk_mq_update_nr_hw_queues() and iterating over tags" Ming Lei
2021-04-25  8:57 ` [PATCH 2/8] Revert "blk-mq: Make it safe to use RCU to iterate over blk_mq_tag_set.tag_list" Ming Lei
2021-04-25  8:57 ` [PATCH 3/8] Revert "blk-mq: Fix races between iterating over requests and freeing requests" Ming Lei
2021-04-25  8:57 ` [PATCH 4/8] Revert "blk-mq: Introduce atomic variants of blk_mq_(all_tag|tagset_busy)_iter" Ming Lei
2021-04-25  8:57 ` [PATCH 5/8] blk-mq: blk_mq_complete_request_locally Ming Lei
2021-04-25  8:57 ` [PATCH 6/8] block: drivers: complete request locally from blk_mq_tagset_busy_iter Ming Lei
2021-04-26  3:02   ` Bart Van Assche
2021-04-26  6:24     ` Ming Lei
2021-04-27  8:54       ` Ming Lei
2021-04-25  8:57 ` [PATCH 7/8] blk-mq: grab rq->refcount before calling ->fn in blk_mq_tagset_busy_iter Ming Lei
2021-04-25 18:55   ` Bart Van Assche
2021-04-26  0:41     ` Ming Lei
2021-04-25  8:57 ` [PATCH 8/8] blk-mq: clear stale request in tags->rq[] before freeing one request pool Ming Lei
2021-04-25 20:42   ` Bart Van Assche
2021-04-26  0:49     ` Ming Lei
2021-04-26  1:50       ` Bart Van Assche
2021-04-26  2:07         ` Ming Lei
2021-04-25  9:27 ` [PATCH 0/8] blk-mq: fix request UAF related with iterating over tagset requests Ming Lei
2021-04-25 20:53   ` Bart Van Assche
2021-04-26  1:19     ` Ming Lei
2021-04-26  1:57       ` Bart Van Assche
2021-04-25 16:17 ` Jens Axboe
2021-04-25 18:39   ` Bart Van Assche
2021-04-25 20:18     ` Jens Axboe

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).