linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [RFC PATCH v2 0/2] blk-mq: Avoid use-after-free for accessing old requests
@ 2020-12-17 11:07 John Garry
  2020-12-17 11:07 ` [RFC PATCH v2 1/2] blk-mq: Clean up references to old requests when freeing rqs John Garry
  2020-12-17 11:07 ` [RFC PATCH v2 2/2] blk-mq: Lockout tagset iter " John Garry
  0 siblings, 2 replies; 23+ messages in thread
From: John Garry @ 2020-12-17 11:07 UTC (permalink / raw)
  To: axboe, ming.lei
  Cc: linux-block, linux-kernel, hch, hare, ppvk, bvanassche,
	kashyap.desai, linuxarm, John Garry

This series aims to tackle the various UAF reports, like:
- https://lore.kernel.org/linux-block/8376443a-ec1b-0cef-8244-ed584b96fa96@huawei.com/
- https://lore.kernel.org/linux-block/5c3ac5af-ed81-11e4-fee3-f92175f14daf@acm.org/T/#m6c1ac11540522716f645d004e2a5a13c9f218908
- https://lore.kernel.org/linux-block/04e2f9e8-79fa-f1cb-ab23-4a15bf3f64cc@kernel.dk/

Details are in the commit messages. Most important detail is that
fastpath is untouched.

The issue addressed in patch 1/2 is pretty easy to reproduce, 2/2 not so
much.

Differences to v1:
- add 2nd patch

John Garry (2):
  blk-mq: Clean up references to old requests when freeing rqs
  blk-mq: Lockout tagset iter when freeing rqs

 block/blk-mq-sched.c |  2 +-
 block/blk-mq-tag.c   | 22 +++++++++++++++++++---
 block/blk-mq-tag.h   |  3 +++
 block/blk-mq.c       | 22 ++++++++++++++++++++--
 block/blk-mq.h       |  2 ++
 5 files changed, 45 insertions(+), 6 deletions(-)

-- 
2.26.2


^ permalink raw reply	[flat|nested] 23+ messages in thread

* [RFC PATCH v2 1/2] blk-mq: Clean up references to old requests when freeing rqs
  2020-12-17 11:07 [RFC PATCH v2 0/2] blk-mq: Avoid use-after-free for accessing old requests John Garry
@ 2020-12-17 11:07 ` John Garry
  2020-12-17 11:07 ` [RFC PATCH v2 2/2] blk-mq: Lockout tagset iter " John Garry
  1 sibling, 0 replies; 23+ messages in thread
From: John Garry @ 2020-12-17 11:07 UTC (permalink / raw)
  To: axboe, ming.lei
  Cc: linux-block, linux-kernel, hch, hare, ppvk, bvanassche,
	kashyap.desai, linuxarm, John Garry

It has been reported many times that a use-after-free can be intermittently
found when iterating busy requests:

- https://lore.kernel.org/linux-block/8376443a-ec1b-0cef-8244-ed584b96fa96@huawei.com/
- https://lore.kernel.org/linux-block/5c3ac5af-ed81-11e4-fee3-f92175f14daf@acm.org/T/#m6c1ac11540522716f645d004e2a5a13c9f218908
- https://lore.kernel.org/linux-block/04e2f9e8-79fa-f1cb-ab23-4a15bf3f64cc@kernel.dk/

The issue is that when we switch IO scheduler or change queue depth,
there may be references in the driver tagset to the stale requests; and
since getting a tag and setting the request pointer is not atomic, a
tagset iter may see a stale request for a tag recently allocated.

As a solution, clean up any references to those requests in the driver
tagset. This is done with a cmpxchg to make safe any race with setting the
driver tagset request from another queue.

Signed-off-by: John Garry <john.garry@huawei.com>
---
 block/blk-mq-sched.c |  2 +-
 block/blk-mq-tag.c   |  2 +-
 block/blk-mq.c       | 20 ++++++++++++++++++--
 block/blk-mq.h       |  2 ++
 4 files changed, 22 insertions(+), 4 deletions(-)

diff --git a/block/blk-mq-sched.c b/block/blk-mq-sched.c
index deff4e826e23..fa1035e1ac6e 100644
--- a/block/blk-mq-sched.c
+++ b/block/blk-mq-sched.c
@@ -621,7 +621,7 @@ void blk_mq_sched_free_requests(struct request_queue *q)
 
 	queue_for_each_hw_ctx(q, hctx, i) {
 		if (hctx->sched_tags)
-			blk_mq_free_rqs(q->tag_set, hctx->sched_tags, i);
+			blk_mq_free_rqs_ext(q->tag_set, hctx->sched_tags, i, hctx->tags);
 	}
 }
 
diff --git a/block/blk-mq-tag.c b/block/blk-mq-tag.c
index 6cfd6bb00ede..a6df2d5df88a 100644
--- a/block/blk-mq-tag.c
+++ b/block/blk-mq-tag.c
@@ -580,7 +580,7 @@ int blk_mq_tag_update_depth(struct blk_mq_hw_ctx *hctx,
 			return -ENOMEM;
 		}
 
-		blk_mq_free_rqs(set, *tagsptr, hctx->queue_num);
+		blk_mq_free_rqs_ext(set, *tagsptr, hctx->queue_num, hctx->tags);
 		blk_mq_free_rq_map(*tagsptr, flags);
 		*tagsptr = new;
 	} else {
diff --git a/block/blk-mq.c b/block/blk-mq.c
index 6f207ec9ef83..8465d7c5ebf0 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -2278,8 +2278,8 @@ blk_qc_t blk_mq_submit_bio(struct bio *bio)
 	return BLK_QC_T_NONE;
 }
 
-void blk_mq_free_rqs(struct blk_mq_tag_set *set, struct blk_mq_tags *tags,
-		     unsigned int hctx_idx)
+void __blk_mq_free_rqs_ext(struct blk_mq_tag_set *set, struct blk_mq_tags *tags,
+		     unsigned int hctx_idx, struct blk_mq_tags *ref_tags)
 {
 	struct page *page;
 
@@ -2288,10 +2288,14 @@ void blk_mq_free_rqs(struct blk_mq_tag_set *set, struct blk_mq_tags *tags,
 
 		for (i = 0; i < tags->nr_tags; i++) {
 			struct request *rq = tags->static_rqs[i];
+			int j;
 
 			if (!rq)
 				continue;
 			set->ops->exit_request(set, rq, hctx_idx);
+			/* clean up any references which occur in @ref_tags */
+			for (j = 0; ref_tags && j < ref_tags->nr_tags; j++)
+				cmpxchg(&ref_tags->rqs[j], rq, 0);
 			tags->static_rqs[i] = NULL;
 		}
 	}
@@ -2308,6 +2312,18 @@ void blk_mq_free_rqs(struct blk_mq_tag_set *set, struct blk_mq_tags *tags,
 	}
 }
 
+void blk_mq_free_rqs_ext(struct blk_mq_tag_set *set, struct blk_mq_tags *tags,
+		     unsigned int hctx_idx, struct blk_mq_tags *ref_tags)
+{
+	__blk_mq_free_rqs_ext(set, tags, hctx_idx, ref_tags);
+}
+			 
+void blk_mq_free_rqs(struct blk_mq_tag_set *set, struct blk_mq_tags *tags,
+		     unsigned int hctx_idx)
+{
+	__blk_mq_free_rqs_ext(set, tags, hctx_idx, NULL);
+}
+
 void blk_mq_free_rq_map(struct blk_mq_tags *tags, unsigned int flags)
 {
 	kfree(tags->rqs);
diff --git a/block/blk-mq.h b/block/blk-mq.h
index c1458d9502f1..487ca30a6509 100644
--- a/block/blk-mq.h
+++ b/block/blk-mq.h
@@ -53,6 +53,8 @@ struct request *blk_mq_dequeue_from_ctx(struct blk_mq_hw_ctx *hctx,
  */
 void blk_mq_free_rqs(struct blk_mq_tag_set *set, struct blk_mq_tags *tags,
 		     unsigned int hctx_idx);
+void blk_mq_free_rqs_ext(struct blk_mq_tag_set *set, struct blk_mq_tags *tags,
+		     unsigned int hctx_idx, struct blk_mq_tags *references);
 void blk_mq_free_rq_map(struct blk_mq_tags *tags, unsigned int flags);
 struct blk_mq_tags *blk_mq_alloc_rq_map(struct blk_mq_tag_set *set,
 					unsigned int hctx_idx,
-- 
2.26.2


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [RFC PATCH v2 2/2] blk-mq: Lockout tagset iter when freeing rqs
  2020-12-17 11:07 [RFC PATCH v2 0/2] blk-mq: Avoid use-after-free for accessing old requests John Garry
  2020-12-17 11:07 ` [RFC PATCH v2 1/2] blk-mq: Clean up references to old requests when freeing rqs John Garry
@ 2020-12-17 11:07 ` John Garry
  2020-12-18  1:55   ` Bart Van Assche
                     ` (2 more replies)
  1 sibling, 3 replies; 23+ messages in thread
From: John Garry @ 2020-12-17 11:07 UTC (permalink / raw)
  To: axboe, ming.lei
  Cc: linux-block, linux-kernel, hch, hare, ppvk, bvanassche,
	kashyap.desai, linuxarm, John Garry

References to old IO sched requests are currently cleared from the
tagset when freeing those requests; switching elevator or changing
request queue depth is such a scenario in which this occurs.

However, this does not stop the potentially racy behaviour of freeing
and clearing a request reference between a tagset iterator getting a
reference to a request and actually dereferencing that request.

Such a use-after-free can be triggered, as follows:

==================================================================
BUG: KASAN: use-after-free in bt_iter+0xa0/0x120
Read of size 8 at addr ffff00108d589300 by task fio/3052

CPU: 32 PID: 3052 Comm: fio Tainted: GW
5.10.0-rc4-64839-g2dcf1ee5054f #693
Hardware name: Huawei Taishan 2280 /D05, BIOS Hisilicon
D05 IT21 Nemo 2.0 RC0 04/18/2018
Call trace:
dump_backtrace+0x0/0x2d0
show_stack+0x18/0x68
dump_stack+0x100/0x16c
print_address_description.constprop.12+0x6c/0x4e8
kasan_report+0x130/0x200
__asan_load8+0x9c/0xd8
bt_iter+0xa0/0x120
blk_mq_queue_tag_busy_iter+0x2d8/0x540
blk_mq_in_flight+0x80/0xb8
part_stat_show+0xd8/0x238
dev_attr_show+0x44/0x90
sysfs_kf_seq_show+0x128/0x1c8
kernfs_seq_show+0xa0/0xb8
seq_read_iter+0x1ec/0x6a0
seq_read+0x1d0/0x250
kernfs_fop_read+0x70/0x330
vfs_read+0xe4/0x250
ksys_read+0xc8/0x178
__arm64_sys_read+0x44/0x58
el0_svc_common.constprop.2+0xc4/0x1e8
do_el0_svc+0x90/0xa0
el0_sync_handler+0x128/0x178
el0_sync+0x158/0x180

This is found experimentally by running fio on 2x SCSI disks - 1x disk
holds the root partition. Userspace is constantly triggering the tagset
iter from reading the root (gen)disk partition info. And so if the IO
sched is constantly changed on the other disk, eventually the UAF occurs,
as described above.

For experiments sake, a mdelay() is added to trigger the UAF within a
sane timeframe, as follows:

+++ b/block/blk-mq-tag.c
@@ -215,8 +215,11 @@ static bool bt_iter(struct sbitmap *bitmap, unsigned int bitnr, void *data)
         * We can hit rq == NULL here, because the tagging functions
         * test and set the bit before assigning ->rqs[].
         */
-       if (rq && rq->q == hctx->queue && rq->mq_hctx == hctx)
-               return iter_data->fn(hctx, rq, iter_data->data, reserved);
+       if (rq) {
+               mdelay(50);
+               if (rq->q == hctx->queue && rq->mq_hctx == hctx)
+                       return iter_data->fn(hctx, rq, iter_data->data, reserved);
+       }
        return true;
 }

This delay increases the window for the iter in between getting the
request reference and actually dereferencing it.

To solve this problem, lockout the per-hw queue tagset iterators while
freeing IO sched tags.

Signed-off-by: John Garry <john.garry@huawei.com>
---
 block/blk-mq-tag.c | 20 ++++++++++++++++++--
 block/blk-mq-tag.h |  3 +++
 block/blk-mq.c     |  2 ++
 3 files changed, 23 insertions(+), 2 deletions(-)

diff --git a/block/blk-mq-tag.c b/block/blk-mq-tag.c
index a6df2d5df88a..853ed5b889aa 100644
--- a/block/blk-mq-tag.c
+++ b/block/blk-mq-tag.c
@@ -358,10 +358,19 @@ void blk_mq_tagset_busy_iter(struct blk_mq_tag_set *tagset,
 {
 	int i;
 
 	for (i = 0; i < tagset->nr_hw_queues; i++) {
-		if (tagset->tags && tagset->tags[i])
-			__blk_mq_all_tag_iter(tagset->tags[i], fn, priv,
+		if (tagset->tags && tagset->tags[i]) {
+			struct blk_mq_tags *tags = tagset->tags[i];
+
+			if (!atomic_inc_not_zero(&tags->iter_usage_counter))
+				continue;
+
+			__blk_mq_all_tag_iter(tags, fn, priv,
 					      BT_TAG_ITER_STARTED);
+
+			atomic_dec(&tags->iter_usage_counter);
+		}
 	}
 }
 EXPORT_SYMBOL(blk_mq_tagset_busy_iter);
@@ -435,9 +444,14 @@ void blk_mq_queue_tag_busy_iter(struct request_queue *q, busy_iter_fn *fn,
 		if (!blk_mq_hw_queue_mapped(hctx))
 			continue;
 
+		if (!atomic_inc_not_zero(&tags->iter_usage_counter))
+			continue;
+
 		if (tags->nr_reserved_tags)
 			bt_for_each(hctx, tags->breserved_tags, fn, priv, true);
 		bt_for_each(hctx, tags->bitmap_tags, fn, priv, false);
+
+		atomic_dec(&tags->iter_usage_counter);
 	}
 	blk_queue_exit(q);
 }
@@ -461,6 +475,8 @@ static int blk_mq_init_bitmap_tags(struct blk_mq_tags *tags,
 		     round_robin, node))
 		goto free_bitmap_tags;
 
+	atomic_set(&tags->iter_usage_counter, 1);
+
 	tags->bitmap_tags = &tags->__bitmap_tags;
 	tags->breserved_tags = &tags->__breserved_tags;
 
diff --git a/block/blk-mq-tag.h b/block/blk-mq-tag.h
index 7d3e6b333a4a..563019d60f05 100644
--- a/block/blk-mq-tag.h
+++ b/block/blk-mq-tag.h
@@ -11,6 +11,9 @@ struct blk_mq_tags {
 
 	atomic_t active_queues;
 
+	/* Only interesting for driver tags */
+	atomic_t	iter_usage_counter;
+
 	struct sbitmap_queue *bitmap_tags;
 	struct sbitmap_queue *breserved_tags;
 
diff --git a/block/blk-mq.c b/block/blk-mq.c
index 8465d7c5ebf0..a61279be0120 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -2315,7 +2315,9 @@ void __blk_mq_free_rqs_ext(struct blk_mq_tag_set *set, struct blk_mq_tags *tags,
 void blk_mq_free_rqs_ext(struct blk_mq_tag_set *set, struct blk_mq_tags *tags,
 		     unsigned int hctx_idx, struct blk_mq_tags *ref_tags)
 {
+	while (atomic_cmpxchg(&ref_tags->iter_usage_counter, 1, 0) != 1);
 	__blk_mq_free_rqs_ext(set, tags, hctx_idx, ref_tags);
+	atomic_set(&ref_tags->iter_usage_counter, 1);
 }
 			 
 void blk_mq_free_rqs(struct blk_mq_tag_set *set, struct blk_mq_tags *tags,
-- 
2.26.2


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* Re: [RFC PATCH v2 2/2] blk-mq: Lockout tagset iter when freeing rqs
  2020-12-17 11:07 ` [RFC PATCH v2 2/2] blk-mq: Lockout tagset iter " John Garry
@ 2020-12-18  1:55   ` Bart Van Assche
  2020-12-18  9:30     ` John Garry
  2020-12-18  3:31   ` Ming Lei
  2020-12-18 22:43   ` Bart Van Assche
  2 siblings, 1 reply; 23+ messages in thread
From: Bart Van Assche @ 2020-12-18  1:55 UTC (permalink / raw)
  To: John Garry, axboe, ming.lei
  Cc: linux-block, linux-kernel, hch, hare, ppvk, kashyap.desai, linuxarm

On 12/17/20 3:07 AM, John Garry wrote:
> diff --git a/block/blk-mq-tag.c b/block/blk-mq-tag.c
> index a6df2d5df88a..853ed5b889aa 100644
> --- a/block/blk-mq-tag.c
> +++ b/block/blk-mq-tag.c
> @@ -358,10 +358,19 @@ void blk_mq_tagset_busy_iter(struct blk_mq_tag_set *tagset,
>  {
>  	int i;
>  
>  	for (i = 0; i < tagset->nr_hw_queues; i++) {
> -		if (tagset->tags && tagset->tags[i])
> -			__blk_mq_all_tag_iter(tagset->tags[i], fn, priv,
> +		if (tagset->tags && tagset->tags[i]) {
> +			struct blk_mq_tags *tags = tagset->tags[i];
> +
> +			if (!atomic_inc_not_zero(&tags->iter_usage_counter))
> +				continue;
> +
> +			__blk_mq_all_tag_iter(tags, fn, priv,
>  					      BT_TAG_ITER_STARTED);
> +
> +			atomic_dec(&tags->iter_usage_counter);
> +		}
>  	}
>  }

Atomic operations are (a) more expensive than rcu_read_lock() /
rcu_read_lock() and (b) do not provide the same guarantees.
rcu_read_lock() has acquire semantics and rcu_read_unlock() has
release semantics. Regular atomic operations do not have these
semantics which means that the CPU is allowed to reorder certain
regular loads and stores against atomic operations. Additionally,
atomic operations are more expensive than the corresponding RCU
primitives. In other words, I would be much happier if this patch
series would use RCU instead of atomics.

Thanks,

Bart.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [RFC PATCH v2 2/2] blk-mq: Lockout tagset iter when freeing rqs
  2020-12-17 11:07 ` [RFC PATCH v2 2/2] blk-mq: Lockout tagset iter " John Garry
  2020-12-18  1:55   ` Bart Van Assche
@ 2020-12-18  3:31   ` Ming Lei
  2020-12-18 10:01     ` John Garry
  2020-12-18 22:43   ` Bart Van Assche
  2 siblings, 1 reply; 23+ messages in thread
From: Ming Lei @ 2020-12-18  3:31 UTC (permalink / raw)
  To: John Garry
  Cc: axboe, linux-block, linux-kernel, hch, hare, ppvk, bvanassche,
	kashyap.desai, linuxarm

On Thu, Dec 17, 2020 at 07:07:53PM +0800, John Garry wrote:
> References to old IO sched requests are currently cleared from the
> tagset when freeing those requests; switching elevator or changing
> request queue depth is such a scenario in which this occurs.
> 
> However, this does not stop the potentially racy behaviour of freeing
> and clearing a request reference between a tagset iterator getting a
> reference to a request and actually dereferencing that request.
> 
> Such a use-after-free can be triggered, as follows:
> 
> ==================================================================
> BUG: KASAN: use-after-free in bt_iter+0xa0/0x120
> Read of size 8 at addr ffff00108d589300 by task fio/3052
> 
> CPU: 32 PID: 3052 Comm: fio Tainted: GW
> 5.10.0-rc4-64839-g2dcf1ee5054f #693
> Hardware name: Huawei Taishan 2280 /D05, BIOS Hisilicon
> D05 IT21 Nemo 2.0 RC0 04/18/2018
> Call trace:
> dump_backtrace+0x0/0x2d0
> show_stack+0x18/0x68
> dump_stack+0x100/0x16c
> print_address_description.constprop.12+0x6c/0x4e8
> kasan_report+0x130/0x200
> __asan_load8+0x9c/0xd8
> bt_iter+0xa0/0x120
> blk_mq_queue_tag_busy_iter+0x2d8/0x540
> blk_mq_in_flight+0x80/0xb8
> part_stat_show+0xd8/0x238
> dev_attr_show+0x44/0x90
> sysfs_kf_seq_show+0x128/0x1c8
> kernfs_seq_show+0xa0/0xb8
> seq_read_iter+0x1ec/0x6a0
> seq_read+0x1d0/0x250
> kernfs_fop_read+0x70/0x330
> vfs_read+0xe4/0x250
> ksys_read+0xc8/0x178
> __arm64_sys_read+0x44/0x58
> el0_svc_common.constprop.2+0xc4/0x1e8
> do_el0_svc+0x90/0xa0
> el0_sync_handler+0x128/0x178
> el0_sync+0x158/0x180
> 
> This is found experimentally by running fio on 2x SCSI disks - 1x disk
> holds the root partition. Userspace is constantly triggering the tagset
> iter from reading the root (gen)disk partition info. And so if the IO
> sched is constantly changed on the other disk, eventually the UAF occurs,
> as described above.
> 
> For experiments sake, a mdelay() is added to trigger the UAF within a
> sane timeframe, as follows:
> 
> +++ b/block/blk-mq-tag.c
> @@ -215,8 +215,11 @@ static bool bt_iter(struct sbitmap *bitmap, unsigned int bitnr, void *data)
>          * We can hit rq == NULL here, because the tagging functions
>          * test and set the bit before assigning ->rqs[].
>          */
> -       if (rq && rq->q == hctx->queue && rq->mq_hctx == hctx)
> -               return iter_data->fn(hctx, rq, iter_data->data, reserved);
> +       if (rq) {
> +               mdelay(50);
> +               if (rq->q == hctx->queue && rq->mq_hctx == hctx)
> +                       return iter_data->fn(hctx, rq, iter_data->data, reserved);
> +       }
>         return true;
>  }
> 
> This delay increases the window for the iter in between getting the
> request reference and actually dereferencing it.
> 
> To solve this problem, lockout the per-hw queue tagset iterators while
> freeing IO sched tags.
> 
> Signed-off-by: John Garry <john.garry@huawei.com>
> ---
>  block/blk-mq-tag.c | 20 ++++++++++++++++++--
>  block/blk-mq-tag.h |  3 +++
>  block/blk-mq.c     |  2 ++
>  3 files changed, 23 insertions(+), 2 deletions(-)
> 
> diff --git a/block/blk-mq-tag.c b/block/blk-mq-tag.c
> index a6df2d5df88a..853ed5b889aa 100644
> --- a/block/blk-mq-tag.c
> +++ b/block/blk-mq-tag.c
> @@ -358,10 +358,19 @@ void blk_mq_tagset_busy_iter(struct blk_mq_tag_set *tagset,
>  {
>  	int i;
>  
>  	for (i = 0; i < tagset->nr_hw_queues; i++) {
> -		if (tagset->tags && tagset->tags[i])
> -			__blk_mq_all_tag_iter(tagset->tags[i], fn, priv,
> +		if (tagset->tags && tagset->tags[i]) {
> +			struct blk_mq_tags *tags = tagset->tags[i];
> +
> +			if (!atomic_inc_not_zero(&tags->iter_usage_counter))
> +				continue;

When 'continue' is run, blk_mq_tagset_busy_iter()'s semantic may be
broken.

> +
> +			__blk_mq_all_tag_iter(tags, fn, priv,
>  					      BT_TAG_ITER_STARTED);
> +
> +			atomic_dec(&tags->iter_usage_counter);
> +		}
>  	}
>  }
>  EXPORT_SYMBOL(blk_mq_tagset_busy_iter);
> @@ -435,9 +444,14 @@ void blk_mq_queue_tag_busy_iter(struct request_queue *q, busy_iter_fn *fn,
>  		if (!blk_mq_hw_queue_mapped(hctx))
>  			continue;
>  
> +		if (!atomic_inc_not_zero(&tags->iter_usage_counter))
> +			continue;

Same with above comment.

> +
>  		if (tags->nr_reserved_tags)
>  			bt_for_each(hctx, tags->breserved_tags, fn, priv, true);
>  		bt_for_each(hctx, tags->bitmap_tags, fn, priv, false);
> +
> +		atomic_dec(&tags->iter_usage_counter);
>  	}
>  	blk_queue_exit(q);
>  }
> @@ -461,6 +475,8 @@ static int blk_mq_init_bitmap_tags(struct blk_mq_tags *tags,
>  		     round_robin, node))
>  		goto free_bitmap_tags;
>  
> +	atomic_set(&tags->iter_usage_counter, 1);
> +
>  	tags->bitmap_tags = &tags->__bitmap_tags;
>  	tags->breserved_tags = &tags->__breserved_tags;
>  
> diff --git a/block/blk-mq-tag.h b/block/blk-mq-tag.h
> index 7d3e6b333a4a..563019d60f05 100644
> --- a/block/blk-mq-tag.h
> +++ b/block/blk-mq-tag.h
> @@ -11,6 +11,9 @@ struct blk_mq_tags {
>  
>  	atomic_t active_queues;
>  
> +	/* Only interesting for driver tags */
> +	atomic_t	iter_usage_counter;
> +
>  	struct sbitmap_queue *bitmap_tags;
>  	struct sbitmap_queue *breserved_tags;
>  
> diff --git a/block/blk-mq.c b/block/blk-mq.c
> index 8465d7c5ebf0..a61279be0120 100644
> --- a/block/blk-mq.c
> +++ b/block/blk-mq.c
> @@ -2315,7 +2315,9 @@ void __blk_mq_free_rqs_ext(struct blk_mq_tag_set *set, struct blk_mq_tags *tags,
>  void blk_mq_free_rqs_ext(struct blk_mq_tag_set *set, struct blk_mq_tags *tags,
>  		     unsigned int hctx_idx, struct blk_mq_tags *ref_tags)
>  {
> +	while (atomic_cmpxchg(&ref_tags->iter_usage_counter, 1, 0) != 1);
>  	__blk_mq_free_rqs_ext(set, tags, hctx_idx, ref_tags);
> +	atomic_set(&ref_tags->iter_usage_counter, 1);
>  }

I guess it is simpler to sync the two code paths by adding mutex to 'ref_tags' and
holding it in both __blk_mq_free_rqs_ext() and the above two iterator helpers.


thanks,
Ming


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [RFC PATCH v2 2/2] blk-mq: Lockout tagset iter when freeing rqs
  2020-12-18  1:55   ` Bart Van Assche
@ 2020-12-18  9:30     ` John Garry
  0 siblings, 0 replies; 23+ messages in thread
From: John Garry @ 2020-12-18  9:30 UTC (permalink / raw)
  To: Bart Van Assche, axboe, ming.lei
  Cc: linux-block, linux-kernel, hch, hare, ppvk, kashyap.desai, linuxarm

On 18/12/2020 01:55, Bart Van Assche wrote:
> On 12/17/20 3:07 AM, John Garry wrote:
>> diff --git a/block/blk-mq-tag.c b/block/blk-mq-tag.c
>> index a6df2d5df88a..853ed5b889aa 100644
>> --- a/block/blk-mq-tag.c
>> +++ b/block/blk-mq-tag.c
>> @@ -358,10 +358,19 @@ void blk_mq_tagset_busy_iter(struct blk_mq_tag_set *tagset,
>>   {
>>   	int i;
>>   
>>   	for (i = 0; i < tagset->nr_hw_queues; i++) {
>> -		if (tagset->tags && tagset->tags[i])
>> -			__blk_mq_all_tag_iter(tagset->tags[i], fn, priv,
>> +		if (tagset->tags && tagset->tags[i]) {
>> +			struct blk_mq_tags *tags = tagset->tags[i];
>> +
>> +			if (!atomic_inc_not_zero(&tags->iter_usage_counter))
>> +				continue;
>> +
>> +			__blk_mq_all_tag_iter(tags, fn, priv,
>>   					      BT_TAG_ITER_STARTED);
>> +
>> +			atomic_dec(&tags->iter_usage_counter);
>> +		}
>>   	}
>>   }
> 
> Atomic operations are (a) more expensive than rcu_read_lock() /
> rcu_read_lock() and (b) do not provide the same guarantees.
> rcu_read_lock() has acquire semantics and rcu_read_unlock() has
> release semantics. Regular atomic operations do not have these
> semantics which means that the CPU is allowed to reorder certain
> regular loads and stores against atomic operations. Additionally,
> atomic operations are more expensive than the corresponding RCU
> primitives. In other words, I would be much happier if this patch
> series would use RCU instead of atomics.
> 

Hi Bart,

In terms of solving the problem with RCU, can you provide more details 
on how it would actually work?

I saw that you mentioned kfree_rcu() at the following, so guess it's 
related:
https://lore.kernel.org/linux-block/5c3ac5af-ed81-11e4-fee3-f92175f14daf@acm.org/T/#m830071bca03af31516800c14f8cccbe63661c5db

In terms of expense of atomic operations, we're just adding these 
operations around the iter function, so I can't see much impact really 
on fastpath.

Thanks,
John

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [RFC PATCH v2 2/2] blk-mq: Lockout tagset iter when freeing rqs
  2020-12-18  3:31   ` Ming Lei
@ 2020-12-18 10:01     ` John Garry
  0 siblings, 0 replies; 23+ messages in thread
From: John Garry @ 2020-12-18 10:01 UTC (permalink / raw)
  To: Ming Lei
  Cc: axboe, linux-block, linux-kernel, hch, hare, ppvk, bvanassche,
	kashyap.desai, linuxarm

On 18/12/2020 03:31, Ming Lei wrote:
>>   3 files changed, 23 insertions(+), 2 deletions(-)
>>
>> diff --git a/block/blk-mq-tag.c b/block/blk-mq-tag.c
>> index a6df2d5df88a..853ed5b889aa 100644
>> --- a/block/blk-mq-tag.c
>> +++ b/block/blk-mq-tag.c
>> @@ -358,10 +358,19 @@ void blk_mq_tagset_busy_iter(struct blk_mq_tag_set *tagset,
>>   {
>>   	int i;
>>   
>>   	for (i = 0; i < tagset->nr_hw_queues; i++) {
>> -		if (tagset->tags && tagset->tags[i])
>> -			__blk_mq_all_tag_iter(tagset->tags[i], fn, priv,
>> +		if (tagset->tags && tagset->tags[i]) {
>> +			struct blk_mq_tags *tags = tagset->tags[i];
>> +
>> +			if (!atomic_inc_not_zero(&tags->iter_usage_counter))
>> +				continue;

Hi Ming,

> When 'continue' is run, blk_mq_tagset_busy_iter()'s semantic may be
> broken.

Yeah, I did consider this, and thought that we're ok since we iter only 
started tags here (and there would be none for the queue). But then that 
would not work for other queues with active tags associated.

> 
>> +
>> +			__blk_mq_all_tag_iter(tags, fn, priv,
>>   					      BT_TAG_ITER_STARTED);
>> +
>> +			atomic_dec(&tags->iter_usage_counter);
>> +		}
>>   	}
>>   }
>>   EXPORT_SYMBOL(blk_mq_tagset_busy_iter);
>> @@ -435,9 +444,14 @@ void blk_mq_queue_tag_busy_iter(struct request_queue *q, busy_iter_fn *fn,
>>   		if (!blk_mq_hw_queue_mapped(hctx))
>>   			continue;
>>   
>> +		if (!atomic_inc_not_zero(&tags->iter_usage_counter))
>> +			continue;
> Same with above comment.

Right, similar to above.

> 
>> +
>>   		if (tags->nr_reserved_tags)
>>   			bt_for_each(hctx, tags->breserved_tags, fn, priv, true);
>>   		bt_for_each(hctx, tags->bitmap_tags, fn, priv, false);
>> +
>> +		atomic_dec(&tags->iter_usage_counter);
>>   	}
>>   	blk_queue_exit(q);
>>   }
>> @@ -461,6 +475,8 @@ static int blk_mq_init_bitmap_tags(struct blk_mq_tags *tags,
>>   		     round_robin, node))
>>   		goto free_bitmap_tags;
>>   
>> +	atomic_set(&tags->iter_usage_counter, 1);
>> +
>>   	tags->bitmap_tags = &tags->__bitmap_tags;
>>   	tags->breserved_tags = &tags->__breserved_tags;
>>   
>> diff --git a/block/blk-mq-tag.h b/block/blk-mq-tag.h
>> index 7d3e6b333a4a..563019d60f05 100644
>> --- a/block/blk-mq-tag.h
>> +++ b/block/blk-mq-tag.h
>> @@ -11,6 +11,9 @@ struct blk_mq_tags {
>>   
>>   	atomic_t active_queues;
>>   
>> +	/* Only interesting for driver tags */
>> +	atomic_t	iter_usage_counter;
>> +
>>   	struct sbitmap_queue *bitmap_tags;
>>   	struct sbitmap_queue *breserved_tags;
>>   
>> diff --git a/block/blk-mq.c b/block/blk-mq.c
>> index 8465d7c5ebf0..a61279be0120 100644
>> --- a/block/blk-mq.c
>> +++ b/block/blk-mq.c
>> @@ -2315,7 +2315,9 @@ void __blk_mq_free_rqs_ext(struct blk_mq_tag_set *set, struct blk_mq_tags *tags,
>>   void blk_mq_free_rqs_ext(struct blk_mq_tag_set *set, struct blk_mq_tags *tags,
>>   		     unsigned int hctx_idx, struct blk_mq_tags *ref_tags)
>>   {
>> +	while (atomic_cmpxchg(&ref_tags->iter_usage_counter, 1, 0) != 1);
>>   	__blk_mq_free_rqs_ext(set, tags, hctx_idx, ref_tags);
>> +	atomic_set(&ref_tags->iter_usage_counter, 1);
>>   }
> I guess it is simpler to sync the two code paths by adding mutex to 'ref_tags' and
> holding it in both __blk_mq_free_rqs_ext() and the above two iterator helpers.

But are we allowed to always sleep in the iter calling context?

And that will also lockout parallel iterations, which is less than ideal.

So I could look to address the issues you mention above with atomics 
still, but maybe Bart has some better idea regarding RCU.

Thanks,
John

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [RFC PATCH v2 2/2] blk-mq: Lockout tagset iter when freeing rqs
  2020-12-17 11:07 ` [RFC PATCH v2 2/2] blk-mq: Lockout tagset iter " John Garry
  2020-12-18  1:55   ` Bart Van Assche
  2020-12-18  3:31   ` Ming Lei
@ 2020-12-18 22:43   ` Bart Van Assche
  2020-12-21 12:06     ` John Garry
  2 siblings, 1 reply; 23+ messages in thread
From: Bart Van Assche @ 2020-12-18 22:43 UTC (permalink / raw)
  To: John Garry, axboe, ming.lei
  Cc: linux-block, linux-kernel, hch, hare, ppvk, kashyap.desai, linuxarm

On 12/17/20 3:07 AM, John Garry wrote:
> References to old IO sched requests are currently cleared from the
> tagset when freeing those requests; switching elevator or changing
> request queue depth is such a scenario in which this occurs.
> 
> However, this does not stop the potentially racy behaviour of freeing
> and clearing a request reference between a tagset iterator getting a
> reference to a request and actually dereferencing that request.
> 
> Such a use-after-free can be triggered, as follows:
> 
> ==================================================================
> BUG: KASAN: use-after-free in bt_iter+0xa0/0x120
> Read of size 8 at addr ffff00108d589300 by task fio/3052
> 
> CPU: 32 PID: 3052 Comm: fio Tainted: GW
> 5.10.0-rc4-64839-g2dcf1ee5054f #693
> Hardware name: Huawei Taishan 2280 /D05, BIOS Hisilicon
> D05 IT21 Nemo 2.0 RC0 04/18/2018
> Call trace:
> dump_backtrace+0x0/0x2d0
> show_stack+0x18/0x68
> dump_stack+0x100/0x16c
> print_address_description.constprop.12+0x6c/0x4e8
> kasan_report+0x130/0x200
> __asan_load8+0x9c/0xd8
> bt_iter+0xa0/0x120
> blk_mq_queue_tag_busy_iter+0x2d8/0x540
> blk_mq_in_flight+0x80/0xb8
> part_stat_show+0xd8/0x238
> dev_attr_show+0x44/0x90
> sysfs_kf_seq_show+0x128/0x1c8
> kernfs_seq_show+0xa0/0xb8
> seq_read_iter+0x1ec/0x6a0
> seq_read+0x1d0/0x250
> kernfs_fop_read+0x70/0x330
> vfs_read+0xe4/0x250
> ksys_read+0xc8/0x178
> __arm64_sys_read+0x44/0x58
> el0_svc_common.constprop.2+0xc4/0x1e8
> do_el0_svc+0x90/0xa0
> el0_sync_handler+0x128/0x178
> el0_sync+0x158/0x180
> 
> This is found experimentally by running fio on 2x SCSI disks - 1x disk
> holds the root partition. Userspace is constantly triggering the tagset
> iter from reading the root (gen)disk partition info. And so if the IO
> sched is constantly changed on the other disk, eventually the UAF occurs,
> as described above.

Hi John,

Something is not clear to me. The above call stack includes
blk_mq_queue_tag_busy_iter(). That function starts with
percpu_ref_tryget(&q->q_usage_counter) and ends with calling
percpu_ref_put(&q->q_usage_counter). So it will only iterate over a tag set
if q->q_usage_counter is live. However, both blk_mq_update_nr_requests()
and elevator_switch() start with freezing the request queue.
blk_mq_freeze_queue() starts with killing q->q_usage_counter and waits
until that counter has dropped to zero. In other words,
blk_mq_queue_tag_busy_iter() should not iterate over a tag set while a tag
set is being freed or reallocated. Does this mean that we do not yet have
a full explanation about why the above call stack can be triggered?

Thanks,

Bart.



^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [RFC PATCH v2 2/2] blk-mq: Lockout tagset iter when freeing rqs
  2020-12-18 22:43   ` Bart Van Assche
@ 2020-12-21 12:06     ` John Garry
  2020-12-21 18:09       ` Bart Van Assche
  0 siblings, 1 reply; 23+ messages in thread
From: John Garry @ 2020-12-21 12:06 UTC (permalink / raw)
  To: Bart Van Assche, axboe, ming.lei
  Cc: linux-block, linux-kernel, hch, hare, ppvk, kashyap.desai, linuxarm

On 18/12/2020 22:43, Bart Van Assche wrote:
> On 12/17/20 3:07 AM, John Garry wrote:
>> References to old IO sched requests are currently cleared from the
>> tagset when freeing those requests; switching elevator or changing
>> request queue depth is such a scenario in which this occurs.
>>
>> However, this does not stop the potentially racy behaviour of freeing
>> and clearing a request reference between a tagset iterator getting a
>> reference to a request and actually dereferencing that request.
>>
>> Such a use-after-free can be triggered, as follows:
>>
>> ==================================================================
>> BUG: KASAN: use-after-free in bt_iter+0xa0/0x120
>> Read of size 8 at addr ffff00108d589300 by task fio/3052
>>
>> CPU: 32 PID: 3052 Comm: fio Tainted: GW
>> 5.10.0-rc4-64839-g2dcf1ee5054f #693
>> Hardware name: Huawei Taishan 2280 /D05, BIOS Hisilicon
>> D05 IT21 Nemo 2.0 RC0 04/18/2018
>> Call trace:
>> dump_backtrace+0x0/0x2d0
>> show_stack+0x18/0x68
>> dump_stack+0x100/0x16c
>> print_address_description.constprop.12+0x6c/0x4e8
>> kasan_report+0x130/0x200
>> __asan_load8+0x9c/0xd8
>> bt_iter+0xa0/0x120
>> blk_mq_queue_tag_busy_iter+0x2d8/0x540
>> blk_mq_in_flight+0x80/0xb8
>> part_stat_show+0xd8/0x238
>> dev_attr_show+0x44/0x90
>> sysfs_kf_seq_show+0x128/0x1c8
>> kernfs_seq_show+0xa0/0xb8
>> seq_read_iter+0x1ec/0x6a0
>> seq_read+0x1d0/0x250
>> kernfs_fop_read+0x70/0x330
>> vfs_read+0xe4/0x250
>> ksys_read+0xc8/0x178
>> __arm64_sys_read+0x44/0x58
>> el0_svc_common.constprop.2+0xc4/0x1e8
>> do_el0_svc+0x90/0xa0
>> el0_sync_handler+0x128/0x178
>> el0_sync+0x158/0x180
>>
>> This is found experimentally by running fio on 2x SCSI disks - 1x disk
>> holds the root partition. Userspace is constantly triggering the tagset
>> iter from reading the root (gen)disk partition info. And so if the IO
>> sched is constantly changed on the other disk, eventually the UAF occurs,
>> as described above.
> 
> Hi John,

Hi Bart,

> 
> Something is not clear to me. The above call stack includes
> blk_mq_queue_tag_busy_iter(). That function starts with
> percpu_ref_tryget(&q->q_usage_counter) and ends with calling
> percpu_ref_put(&q->q_usage_counter). So it will only iterate over a tag set
> if q->q_usage_counter is live. However, both blk_mq_update_nr_requests()
> and elevator_switch() start with freezing the request queue.
> blk_mq_freeze_queue() starts with killing q->q_usage_counter and waits
> until that counter has dropped to zero. In other words,
> blk_mq_queue_tag_busy_iter() should not iterate over a tag set while a tag
> set is being freed or reallocated. 

Right, this is what I thought, but Ming reminded me 
"blk_mq_queue_tag_busy_iter() can be run on another request queue just
between one driver tag is allocated and updating the request map, so one
extra request reference still can be grabbed."

> Does this mean that we do not yet have
> a full explanation about why the above call stack can be triggered?

We understand it, and I'll describe my experiment in detail:
a. fio runs on 2x disks, sda (root partition disk) and sdb.
b. for sda, userpace triggers blk_mq_queue_tag_busy_iter(), as in 
stackframe above. Since its request queue is not frozen, it will iter 
the busy tags.
c. on sdb, I continuously change the IO scheduler.

So sdb request queue gets frozen as we switch IO sched, but we could 
have this sequence of events:
- blk_mq_queue_tag_busy_iter() on sda takes reference to a sdb request
    - Getting a tag and updating ->rqs[] in tagset is not atomic
- requests for sdb cleared in tagset and request memory is freed
- blk_mq_queue_tag_busy_iter() on sda still holds reference to sdb 
request and dereferences it -> UAF

Hope it's clear. It is a bit unlikely, I will admit, but it still can 
happen and UAF is never good. So please let me know if other idea to solve.

Thanks,
John

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [RFC PATCH v2 2/2] blk-mq: Lockout tagset iter when freeing rqs
  2020-12-21 12:06     ` John Garry
@ 2020-12-21 18:09       ` Bart Van Assche
  2020-12-21 18:47         ` John Garry
  0 siblings, 1 reply; 23+ messages in thread
From: Bart Van Assche @ 2020-12-21 18:09 UTC (permalink / raw)
  To: John Garry, axboe, ming.lei
  Cc: linux-block, linux-kernel, hch, hare, ppvk, kashyap.desai, linuxarm

On 12/21/20 4:06 AM, John Garry wrote:
> On 18/12/2020 22:43, Bart Van Assche wrote:
>> Does this mean that we do not yet have
>> a full explanation about why the above call stack can be triggered?
> 
> We understand it, and I'll describe my experiment in detail:
> a. fio runs on 2x disks, sda (root partition disk) and sdb.
> b. for sda, userpace triggers blk_mq_queue_tag_busy_iter(), as in
> stackframe above. Since its request queue is not frozen, it will iter
> the busy tags.
> c. on sdb, I continuously change the IO scheduler.
> 
> So sdb request queue gets frozen as we switch IO sched, but we could
> have this sequence of events:
> - blk_mq_queue_tag_busy_iter() on sda takes reference to a sdb request
>    - Getting a tag and updating ->rqs[] in tagset is not atomic
> - requests for sdb cleared in tagset and request memory is freed
> - blk_mq_queue_tag_busy_iter() on sda still holds reference to sdb
> request and dereferences it -> UAF
> 
> Hope it's clear. It is a bit unlikely, I will admit, but it still can
> happen and UAF is never good. So please let me know if other idea to solve.

Hi John,

Do you agree that all partitions (struct block_device) of a disk share a
request queue (block_device.bd_disk->queue)? I'm asking this because it
seems like in the above explanation it has been assumed that different
partitions use different request queues.

Thanks,

Bart.


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [RFC PATCH v2 2/2] blk-mq: Lockout tagset iter when freeing rqs
  2020-12-21 18:09       ` Bart Van Assche
@ 2020-12-21 18:47         ` John Garry
  2020-12-22  2:13           ` Bart Van Assche
  0 siblings, 1 reply; 23+ messages in thread
From: John Garry @ 2020-12-21 18:47 UTC (permalink / raw)
  To: Bart Van Assche, axboe, ming.lei
  Cc: linux-block, linux-kernel, hch, hare, ppvk, kashyap.desai, linuxarm

On 21/12/2020 18:09, Bart Van Assche wrote:
> On 12/21/20 4:06 AM, John Garry wrote:
>> On 18/12/2020 22:43, Bart Van Assche wrote:
>>> Does this mean that we do not yet have
>>> a full explanation about why the above call stack can be triggered?
>>
>> We understand it, and I'll describe my experiment in detail:
>> a. fio runs on 2x disks, sda (root partition disk) and sdb.
>> b. for sda, userpace triggers blk_mq_queue_tag_busy_iter(), as in
>> stackframe above. Since its request queue is not frozen, it will iter
>> the busy tags.
>> c. on sdb, I continuously change the IO scheduler.
>>
>> So sdb request queue gets frozen as we switch IO sched, but we could
>> have this sequence of events:
>> - blk_mq_queue_tag_busy_iter() on sda takes reference to a sdb request
>>     - Getting a tag and updating ->rqs[] in tagset is not atomic
>> - requests for sdb cleared in tagset and request memory is freed
>> - blk_mq_queue_tag_busy_iter() on sda still holds reference to sdb
>> request and dereferences it -> UAF
>>
>> Hope it's clear. It is a bit unlikely, I will admit, but it still can
>> happen and UAF is never good. So please let me know if other idea to solve.
> 
> Hi John,
> 
> Do you agree that all partitions (struct block_device) of a disk share a
> request queue (block_device.bd_disk->queue)? I'm asking this because it
> seems like in the above explanation it has been assumed that different
> partitions use different request queues.
> 
> Thanks,
> 
> Bart.
> 

Hi Bart,

Yes, I agree, and I'm not sure what I wrote to give that impression.

About "root partition", above, I'm just saying that / is mounted on a 
sda partition:

root@ubuntu:/home/john# mount | grep sda
/dev/sda2 on / type ext4 (rw,relatime,errors=remount-ro,stripe=32)
/dev/sda1 on /boot/efi type vfat 
(rw,relatime,fmask=0077,dmask=0077,codepage=437,iocharset=iso8859-1,shortname=mixed,errors=remount-ro)

Thanks,
John


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [RFC PATCH v2 2/2] blk-mq: Lockout tagset iter when freeing rqs
  2020-12-21 18:47         ` John Garry
@ 2020-12-22  2:13           ` Bart Van Assche
  2020-12-22 11:15             ` John Garry
  2020-12-22 11:22             ` John Garry
  0 siblings, 2 replies; 23+ messages in thread
From: Bart Van Assche @ 2020-12-22  2:13 UTC (permalink / raw)
  To: John Garry, axboe, ming.lei
  Cc: linux-block, linux-kernel, hch, hare, ppvk, kashyap.desai, linuxarm

On 12/21/20 10:47 AM, John Garry wrote:
> Yes, I agree, and I'm not sure what I wrote to give that impression.
> 
> About "root partition", above, I'm just saying that / is mounted on a
> sda partition:
> 
> root@ubuntu:/home/john# mount | grep sda
> /dev/sda2 on / type ext4 (rw,relatime,errors=remount-ro,stripe=32)
> /dev/sda1 on /boot/efi type vfat
> (rw,relatime,fmask=0077,dmask=0077,codepage=437,iocharset=iso8859-1,shortname=mixed,errors=remount-ro)

Hi John,

Thanks for the clarification. I want to take back my suggestion about
adding rcu_read_lock() / rcu_read_unlock() in blk_mq_tagset_busy_iter()
since it is not allowed to sleep inside an RCU read-side critical
section, since blk_mq_tagset_busy_iter() is used in request timeout
handling and since there may be blk_mq_ops.timeout implementations that
sleep.

Ming's suggestion to serialize blk_mq_tagset_busy_iter() and
blk_mq_free_rqs() looks interesting to me.

Thanks,

Bart.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [RFC PATCH v2 2/2] blk-mq: Lockout tagset iter when freeing rqs
  2020-12-22  2:13           ` Bart Van Assche
@ 2020-12-22 11:15             ` John Garry
  2020-12-22 16:16               ` Bart Van Assche
  2020-12-22 11:22             ` John Garry
  1 sibling, 1 reply; 23+ messages in thread
From: John Garry @ 2020-12-22 11:15 UTC (permalink / raw)
  To: Bart Van Assche, axboe, ming.lei
  Cc: linux-block, linux-kernel, hch, hare, ppvk, kashyap.desai, linuxarm

On 22/12/2020 02:13, Bart Van Assche wrote:
> On 12/21/20 10:47 AM, John Garry wrote:
>> Yes, I agree, and I'm not sure what I wrote to give that impression.
>>
>> About "root partition", above, I'm just saying that / is mounted on a
>> sda partition:
>>
>> root@ubuntu:/home/john# mount | grep sda
>> /dev/sda2 on / type ext4 (rw,relatime,errors=remount-ro,stripe=32)
>> /dev/sda1 on /boot/efi type vfat
>> (rw,relatime,fmask=0077,dmask=0077,codepage=437,iocharset=iso8859-1,shortname=mixed,errors=remount-ro)
> Hi John,
> 

Hi Bart, Ming,

> Thanks for the clarification. I want to take back my suggestion about
> adding rcu_read_lock() / rcu_read_unlock() in blk_mq_tagset_busy_iter()
> since it is not allowed to sleep inside an RCU read-side critical
> section, since blk_mq_tagset_busy_iter() is used in request timeout
> handling and since there may be blk_mq_ops.timeout implementations that
> sleep.

Yes, that's why I was going with atomic, rather than some 
synchronization primitive which may sleep.

> 
> Ming's suggestion to serialize blk_mq_tagset_busy_iter() and
> blk_mq_free_rqs() looks interesting to me.
> 

So then we could have something like this:

---8<---

  -435,9 +444,13 @@ void blk_mq_queue_tag_busy_iter(struct request_queue 
*q, busy_iter_fn *fn,
	if (!blk_mq_hw_queue_mapped(hctx))
			continue;

+	while (!atomic_inc_not_zero(&tags->iter_usage_counter));
+
	if (tags->nr_reserved_tags)
		bt_for_each(hctx, tags->breserved_tags, fn, priv, true);
	bt_for_each(hctx, tags->bitmap_tags, fn, priv, false);

+	atomic_dec(&tags->iter_usage_counter);
}

blk_queue_exit(q);

--->8---

And similar for blk_mq_tagset_busy_iter(). How about it?

Thanks,
John

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [RFC PATCH v2 2/2] blk-mq: Lockout tagset iter when freeing rqs
  2020-12-22  2:13           ` Bart Van Assche
  2020-12-22 11:15             ` John Garry
@ 2020-12-22 11:22             ` John Garry
  2020-12-22 13:24               ` Ming Lei
  1 sibling, 1 reply; 23+ messages in thread
From: John Garry @ 2020-12-22 11:22 UTC (permalink / raw)
  To: Bart Van Assche, axboe, ming.lei
  Cc: linux-block, linux-kernel, hch, hare, kashyap.desai, linuxarm

Resend without ppvk@codeaurora.org, which bounces for me

On 22/12/2020 02:13, Bart Van Assche wrote:
 > On 12/21/20 10:47 AM, John Garry wrote:
 >> Yes, I agree, and I'm not sure what I wrote to give that impression.
 >>
 >> About "root partition", above, I'm just saying that / is mounted on a
 >> sda partition:
 >>
 >> root@ubuntu:/home/john# mount | grep sda
 >> /dev/sda2 on / type ext4 (rw,relatime,errors=remount-ro,stripe=32)
 >> /dev/sda1 on /boot/efi type vfat
 >> 
(rw,relatime,fmask=0077,dmask=0077,codepage=437,iocharset=iso8859-1,shortname=mixed,errors=remount-ro)
 > Hi John,
 >

Hi Bart, Ming,

 > Thanks for the clarification. I want to take back my suggestion about
 > adding rcu_read_lock() / rcu_read_unlock() in blk_mq_tagset_busy_iter()
 > since it is not allowed to sleep inside an RCU read-side critical
 > section, since blk_mq_tagset_busy_iter() is used in request timeout
 > handling and since there may be blk_mq_ops.timeout implementations that
 > sleep.

Yes, that's why I was going with atomic, rather than some 
synchronization primitive which may sleep.

 >
 > Ming's suggestion to serialize blk_mq_tagset_busy_iter() and
 > blk_mq_free_rqs() looks interesting to me.
 >

So then we could have something like this:

---8<---

  -435,9 +444,13 @@ void blk_mq_queue_tag_busy_iter(struct request_queue 
*q, busy_iter_fn *fn,
     if (!blk_mq_hw_queue_mapped(hctx))
             continue;

+    while (!atomic_inc_not_zero(&tags->iter_usage_counter));
+
     if (tags->nr_reserved_tags)
         bt_for_each(hctx, tags->breserved_tags, fn, priv, true);
     bt_for_each(hctx, tags->bitmap_tags, fn, priv, false);

+    atomic_dec(&tags->iter_usage_counter);
}

blk_queue_exit(q);

--->8---

And similar for blk_mq_tagset_busy_iter(). How about it?

Thanks,
John

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [RFC PATCH v2 2/2] blk-mq: Lockout tagset iter when freeing rqs
  2020-12-22 11:22             ` John Garry
@ 2020-12-22 13:24               ` Ming Lei
  0 siblings, 0 replies; 23+ messages in thread
From: Ming Lei @ 2020-12-22 13:24 UTC (permalink / raw)
  To: John Garry
  Cc: Bart Van Assche, axboe, linux-block, linux-kernel, hch, hare,
	kashyap.desai, linuxarm

On Tue, Dec 22, 2020 at 11:22:19AM +0000, John Garry wrote:
> Resend without ppvk@codeaurora.org, which bounces for me
> 
> On 22/12/2020 02:13, Bart Van Assche wrote:
> > On 12/21/20 10:47 AM, John Garry wrote:
> >> Yes, I agree, and I'm not sure what I wrote to give that impression.
> >>
> >> About "root partition", above, I'm just saying that / is mounted on a
> >> sda partition:
> >>
> >> root@ubuntu:/home/john# mount | grep sda
> >> /dev/sda2 on / type ext4 (rw,relatime,errors=remount-ro,stripe=32)
> >> /dev/sda1 on /boot/efi type vfat
> >> (rw,relatime,fmask=0077,dmask=0077,codepage=437,iocharset=iso8859-1,shortname=mixed,errors=remount-ro)
> > Hi John,
> >
> 
> Hi Bart, Ming,
> 
> > Thanks for the clarification. I want to take back my suggestion about
> > adding rcu_read_lock() / rcu_read_unlock() in blk_mq_tagset_busy_iter()
> > since it is not allowed to sleep inside an RCU read-side critical
> > section, since blk_mq_tagset_busy_iter() is used in request timeout
> > handling and since there may be blk_mq_ops.timeout implementations that
> > sleep.
> 
> Yes, that's why I was going with atomic, rather than some synchronization
> primitive which may sleep.
> 
> >
> > Ming's suggestion to serialize blk_mq_tagset_busy_iter() and
> > blk_mq_free_rqs() looks interesting to me.
> >
> 
> So then we could have something like this:
> 
> ---8<---
> 
>  -435,9 +444,13 @@ void blk_mq_queue_tag_busy_iter(struct request_queue *q,
> busy_iter_fn *fn,
>     if (!blk_mq_hw_queue_mapped(hctx))
>             continue;
> 
> +    while (!atomic_inc_not_zero(&tags->iter_usage_counter));
> +
>     if (tags->nr_reserved_tags)
>         bt_for_each(hctx, tags->breserved_tags, fn, priv, true);
>     bt_for_each(hctx, tags->bitmap_tags, fn, priv, false);
> 
> +    atomic_dec(&tags->iter_usage_counter);
> }

Then it is just one spin_lock variant, and you may have to consider
lock validation.

For example, scsi_host_busy() is called from scsi_log_completion()<-scsi_softirq_done(),
which may be run in irq context, then dead lock can be triggered when the irq
is fired during freeing request.

thanks,
Ming


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [RFC PATCH v2 2/2] blk-mq: Lockout tagset iter when freeing rqs
  2020-12-22 11:15             ` John Garry
@ 2020-12-22 16:16               ` Bart Van Assche
  2020-12-23 11:10                 ` John Garry
  0 siblings, 1 reply; 23+ messages in thread
From: Bart Van Assche @ 2020-12-22 16:16 UTC (permalink / raw)
  To: John Garry, axboe, ming.lei
  Cc: linux-block, linux-kernel, hch, hare, ppvk, kashyap.desai, linuxarm

On 12/22/20 3:15 AM, John Garry wrote:
> So then we could have something like this:
> 
> ---8<---
> 
>   -435,9 +444,13 @@ void blk_mq_queue_tag_busy_iter(struct request_queue 
> *q, busy_iter_fn *fn,
>      if (!blk_mq_hw_queue_mapped(hctx))
>              continue;
> 
> +    while (!atomic_inc_not_zero(&tags->iter_usage_counter));
> +
>      if (tags->nr_reserved_tags)
>          bt_for_each(hctx, tags->breserved_tags, fn, priv, true);
>      bt_for_each(hctx, tags->bitmap_tags, fn, priv, false);
> 
> +    atomic_dec(&tags->iter_usage_counter);
> }
> 
> blk_queue_exit(q);
> 
> --->8---
> 
> And similar for blk_mq_tagset_busy_iter(). How about it?

Are there any blk_mq_tagset_busy_iter() calls that happen from a context 
where the tag set can disappear while that function is in progress?

Some blk_mq_tagset_busy_iter() calls happen from a context where it is 
not allowed to sleep but also where it is guaranteed that the tag set 
won't disappear, e.g. the call from inside sdk_mq_queue_rq().

How about using a mutex inside blk_mq_queue_tag_busy_iter() instead? As 
far as I can see all blk_mq_queue_tag_busy_iter() happen from a context 
where it is allowed to sleep.

Thanks,

Bart.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [RFC PATCH v2 2/2] blk-mq: Lockout tagset iter when freeing rqs
  2020-12-22 16:16               ` Bart Van Assche
@ 2020-12-23 11:10                 ` John Garry
  2020-12-23 11:40                   ` John Garry
  0 siblings, 1 reply; 23+ messages in thread
From: John Garry @ 2020-12-23 11:10 UTC (permalink / raw)
  To: Bart Van Assche, axboe, ming.lei
  Cc: linux-block, linux-kernel, hch, hare, ppvk, kashyap.desai, linuxarm

On 22/12/2020 16:16, Bart Van Assche wrote:
> On 12/22/20 3:15 AM, John Garry wrote:
>> So then we could have something like this:
>>
>> ---8<---
>>
>>   -435,9 +444,13 @@ void blk_mq_queue_tag_busy_iter(struct 
>> request_queue *q, busy_iter_fn *fn,
>>      if (!blk_mq_hw_queue_mapped(hctx))
>>              continue;
>>
>> +    while (!atomic_inc_not_zero(&tags->iter_usage_counter));
>> +
>>      if (tags->nr_reserved_tags)
>>          bt_for_each(hctx, tags->breserved_tags, fn, priv, true);
>>      bt_for_each(hctx, tags->bitmap_tags, fn, priv, false);
>>
>> +    atomic_dec(&tags->iter_usage_counter);
>> }
>>
>> blk_queue_exit(q);
>>
>> --->8---
>>
>> And similar for blk_mq_tagset_busy_iter(). How about it?
> 

Hi Bart,

> Are there any blk_mq_tagset_busy_iter() calls that happen from a context 
> where the tag set can disappear while that function is in progress?
> 

So isn't the blk_mq_tag_set always a member of the host driver data for 
those cases, and, since blk_mq_tagset_busy_iter() is for iter'ing block 
driver tags and called from block driver or hctx_busy_show(), it would 
exist for the lifetime of the host device.

> Some blk_mq_tagset_busy_iter() calls happen from a context where it is 
> not allowed to sleep but also where it is guaranteed that the tag set 
> won't disappear, e.g. the call from inside sdk_mq_queue_rq().

You're talking about skd_mq_queue_rq() -> skd_in_flight() -> 
blk_mq_tagset_busy_iter(), right?

So I would expect any .queue_rq calls to complete before the associated 
request queue and tagset may be unregistered.

> 
> How about using a mutex inside blk_mq_queue_tag_busy_iter() instead? As 
> far as I can see all blk_mq_queue_tag_busy_iter() happen from a context 
> where it is allowed to sleep.

Well then it seems sensible to add might_sleep() also.

And we still have the blk_mq_queue_tag_busy_iter() problem. As Ming 
mentioned yesterday, we know contexts where from where it is called 
which may not sleep.

Thanks,
John

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [RFC PATCH v2 2/2] blk-mq: Lockout tagset iter when freeing rqs
  2020-12-23 11:10                 ` John Garry
@ 2020-12-23 11:40                   ` John Garry
  2020-12-23 15:47                     ` Bart Van Assche
  0 siblings, 1 reply; 23+ messages in thread
From: John Garry @ 2020-12-23 11:40 UTC (permalink / raw)
  To: Bart Van Assche, axboe, ming.lei
  Cc: linux-block, linux-kernel, hch, hare, kashyap.desai, linuxarm

- ppvk@codeaurora.org

> 
>> Are there any blk_mq_tagset_busy_iter() calls that happen from a 
>> context where the tag set can disappear while that function is in 
>> progress?
>>
> 
> So isn't the blk_mq_tag_set always a member of the host driver data for 
> those cases, and, since blk_mq_tagset_busy_iter() is for iter'ing block 
> driver tags and called from block driver or hctx_busy_show(), it would 
> exist for the lifetime of the host device.
> 
>> Some blk_mq_tagset_busy_iter() calls happen from a context where it is 
>> not allowed to sleep but also where it is guaranteed that the tag set 
>> won't disappear, e.g. the call from inside sdk_mq_queue_rq().
> 
> You're talking about skd_mq_queue_rq() -> skd_in_flight() -> 
> blk_mq_tagset_busy_iter(), right?
> 
> So I would expect any .queue_rq calls to complete before the associated 
> request queue and tagset may be unregistered.
> 
>>
>> How about using a mutex inside blk_mq_queue_tag_busy_iter() instead? 
>> As far as I can see all blk_mq_queue_tag_busy_iter() happen from a 
>> context where it is allowed to sleep.
> 
> Well then it seems sensible to add might_sleep() also.
> 
> And we still have the blk_mq_queue_tag_busy_iter() problem. As Ming 
> mentioned yesterday, we know contexts where from where it is called 
> which may not sleep.

Sorry, I got the 2x iter functions mixed up.

So if we use mutex to solve blk_mq_queue_tag_busy_iter() problem, then 
we still have this issue in blk_mq_tagset_busy_iter() which I report 
previously [0]:

[  319.771745] BUG: KASAN: use-after-free in bt_tags_iter+0xe0/0x128
[  319.777832] Read of size 4 at addr ffff0010b6bd27cc by task more/1866
[  319.784262]
[  319.785753] CPU: 61 PID: 1866 Comm: more Tainted: G        W
5.10.0-rc4-18118-gaa7b9c30d8ff #1070
[  319.795312] Hardware name: Huawei Taishan 2280 /D05, BIOS Hisilicon
D05 IT21 Nemo 2.0 RC0 04/18/2018
[  319.804437] Call trace:
[  319.806892]  dump_backtrace+0x0/0x2d0
[  319.810552]  show_stack+0x18/0x68
[  319.813865]  dump_stack+0x100/0x16c
[  319.817348]  print_address_description.constprop.12+0x6c/0x4e8
[  319.823176]  kasan_report+0x130/0x200
[  319.826831]  __asan_load4+0x9c/0xd8
[  319.830315]  bt_tags_iter+0xe0/0x128
[  319.833884]  __blk_mq_all_tag_iter+0x320/0x3a8
[  319.838320]  blk_mq_tagset_busy_iter+0x8c/0xd8
[  319.842760]  scsi_host_busy+0x88/0xb8
[  319.846418]  show_host_busy+0x1c/0x48
[  319.850079]  dev_attr_show+0x44/0x90
[  319.853655]  sysfs_kf_seq_show+0x128/0x1c8
[  319.857744]  kernfs_seq_show+0xa0/0xb8
[  319.861489]  seq_read_iter+0x1ec/0x6a0
[  319.865230]  seq_read+0x1d0/0x250
[  319.868539]  kernfs_fop_read+0x70/0x330
[  319.872369]  vfs_read+0xe4/0x250
[  319.875590]  ksys_read+0xc8/0x178
[  319.878898]  __arm64_sys_read+0x44/0x58
[  319.882730]  el0_svc_common.constprop.2+0xc4/0x1e8
[  319.887515]  do_el0_svc+0x90/0xa0
[  319.890824]  el0_sync_handler+0x128/0x178
[  319.894825]  el0_sync+0x158/0x180
[  319.898131]
[  319.899614] The buggy address belongs to the page:
[  319.904403] page:000000004e9e6864 refcount:0 mapcount:0
mapping:0000000000000000 index:0x0 pfn:0x10b6bd2
[  319.913876] flags: 0xbfffc0000000000()
[  319.917626] raw: 0bfffc0000000000 0000000000000000 fffffe0000000000
0000000000000000
[  319.925363] raw: 0000000000000000 0000000000000000 00000000ffffffff
0000000000000000
[  319.933096] page dumped because: kasan: bad access detected
[  319.938658]
[  319.940141] Memory state around the buggy address:
[  319.944925]  ffff0010b6bd2680: ff ff ff ff ff ff ff ff ff ff ff ff ff
ff ff ff
[  319.952139]  ffff0010b6bd2700: ff ff ff ff ff ff ff ff ff ff ff ff ff
ff ff ff
[  319.959354] >ffff0010b6bd2780: ff ff ff ff ff ff ff ff ff ff ff ff ff
ff ff ff
[  319.966566] ^
[  319.972131]  ffff0010b6bd2800: ff ff ff ff ff ff ff ff ff ff ff ff ff
ff ff ff
[  319.979344]  ffff0010b6bd2880: ff ff ff ff ff ff ff ff ff ff ff ff ff
ff ff ff
[  319.986557]
==================================================================
[  319.993770] Disabling lock debugging due to kernel taint

So to trigger this, I start fio on a disk, and then have one script
which constantly enables and disables an IO scheduler for that disk, and
another script which constantly reads /sys/class/scsi_host/host0/host_busy .

And in this problem, the driver tag we iterate may point to a stale IO 
sched request.

Thanks,
John

[0] 
https://lore.kernel.org/linux-block/9d4124ea-dbab-41cf-63bd-b17ef3e5037a@huawei.com/

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [RFC PATCH v2 2/2] blk-mq: Lockout tagset iter when freeing rqs
  2020-12-23 11:40                   ` John Garry
@ 2020-12-23 15:47                     ` Bart Van Assche
  2021-01-04 15:33                       ` John Garry
  0 siblings, 1 reply; 23+ messages in thread
From: Bart Van Assche @ 2020-12-23 15:47 UTC (permalink / raw)
  To: John Garry, axboe, ming.lei
  Cc: linux-block, linux-kernel, hch, hare, kashyap.desai, linuxarm

On 12/23/20 3:40 AM, John Garry wrote:
> Sorry, I got the 2x iter functions mixed up.
> 
> So if we use mutex to solve blk_mq_queue_tag_busy_iter() problem, then we
> still have this issue in blk_mq_tagset_busy_iter() which I report previously
> [0]:
> 
> [  319.771745] BUG: KASAN: use-after-free in bt_tags_iter+0xe0/0x128
> [  319.777832] Read of size 4 at addr ffff0010b6bd27cc by task more/1866
> [  319.784262]
> [  319.785753] CPU: 61 PID: 1866 Comm: more Tainted: G        W
> 5.10.0-rc4-18118-gaa7b9c30d8ff #1070
> [  319.795312] Hardware name: Huawei Taishan 2280 /D05, BIOS Hisilicon
> D05 IT21 Nemo 2.0 RC0 04/18/2018
> [  319.804437] Call trace:
> [  319.806892]  dump_backtrace+0x0/0x2d0
> [  319.810552]  show_stack+0x18/0x68
> [  319.813865]  dump_stack+0x100/0x16c
> [  319.817348]  print_address_description.constprop.12+0x6c/0x4e8
> [  319.823176]  kasan_report+0x130/0x200
> [  319.826831]  __asan_load4+0x9c/0xd8
> [  319.830315]  bt_tags_iter+0xe0/0x128
> [  319.833884]  __blk_mq_all_tag_iter+0x320/0x3a8
> [  319.838320]  blk_mq_tagset_busy_iter+0x8c/0xd8
> [  319.842760]  scsi_host_busy+0x88/0xb8
> [  319.846418]  show_host_busy+0x1c/0x48
> [  319.850079]  dev_attr_show+0x44/0x90
> [  319.853655]  sysfs_kf_seq_show+0x128/0x1c8
> [  319.857744]  kernfs_seq_show+0xa0/0xb8
> [  319.861489]  seq_read_iter+0x1ec/0x6a0
> [  319.865230]  seq_read+0x1d0/0x250
> [  319.868539]  kernfs_fop_read+0x70/0x330
> [  319.872369]  vfs_read+0xe4/0x250
> [  319.875590]  ksys_read+0xc8/0x178
> [  319.878898]  __arm64_sys_read+0x44/0x58
> [  319.882730]  el0_svc_common.constprop.2+0xc4/0x1e8
> [  319.887515]  do_el0_svc+0x90/0xa0
> [  319.890824]  el0_sync_handler+0x128/0x178
> [  319.894825]  el0_sync+0x158/0x180
> [  319.898131]
> [  319.899614] The buggy address belongs to the page:
> [  319.904403] page:000000004e9e6864 refcount:0 mapcount:0
> mapping:0000000000000000 index:0x0 pfn:0x10b6bd2
> [  319.913876] flags: 0xbfffc0000000000()
> [  319.917626] raw: 0bfffc0000000000 0000000000000000 fffffe0000000000
> 0000000000000000
> [  319.925363] raw: 0000000000000000 0000000000000000 00000000ffffffff
> 0000000000000000
> [  319.933096] page dumped because: kasan: bad access detected
> [  319.938658]
> [  319.940141] Memory state around the buggy address:
> [  319.944925]  ffff0010b6bd2680: ff ff ff ff ff ff ff ff ff ff ff ff ff
> ff ff ff
> [  319.952139]  ffff0010b6bd2700: ff ff ff ff ff ff ff ff ff ff ff ff ff
> ff ff ff
> [  319.959354] >ffff0010b6bd2780: ff ff ff ff ff ff ff ff ff ff ff ff ff
> ff ff ff
> [  319.966566] ^
> [  319.972131]  ffff0010b6bd2800: ff ff ff ff ff ff ff ff ff ff ff ff ff
> ff ff ff
> [  319.979344]  ffff0010b6bd2880: ff ff ff ff ff ff ff ff ff ff ff ff ff
> ff ff ff
> [  319.986557]
> ==================================================================
> [  319.993770] Disabling lock debugging due to kernel taint
> 
> So to trigger this, I start fio on a disk, and then have one script
> which constantly enables and disables an IO scheduler for that disk, and
> another script which constantly reads /sys/class/scsi_host/host0/host_busy .
> 
> And in this problem, the driver tag we iterate may point to a stale IO sched
> request.

Hi John,

I propose to change the order in which blk_mq_sched_free_requests(q) and
blk_mq_debugfs_unregister(q) are called. Today blk_mq_sched_free_requests(q)
is called by blk_cleanup_queue() before blk_put_queue() is called.
blk_put_queue() calls blk_release_queue() if the last reference is dropped.
blk_release_queue() calls blk_mq_debugfs_unregister(). I prefer removing the
debugfs attributes earlier over modifying the tag iteration functions
because I think removing the debugfs attributes earlier is less risky.
Although this will make it harder to debug lockups that happen while
removing a request queue, kernel developers who are analyzing such an issue
can undo this change in their development kernel tree.

Thanks,

Bart.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [RFC PATCH v2 2/2] blk-mq: Lockout tagset iter when freeing rqs
  2020-12-23 15:47                     ` Bart Van Assche
@ 2021-01-04 15:33                       ` John Garry
  2021-01-04 17:22                         ` Bart Van Assche
  0 siblings, 1 reply; 23+ messages in thread
From: John Garry @ 2021-01-04 15:33 UTC (permalink / raw)
  To: Bart Van Assche, axboe, ming.lei
  Cc: linux-block, linux-kernel, hch, hare, kashyap.desai, linuxarm

On 23/12/2020 15:47, Bart Van Assche wrote:
> On 12/23/20 3:40 AM, John Garry wrote:
>> Sorry, I got the 2x iter functions mixed up.
>>
>> So if we use mutex to solve blk_mq_queue_tag_busy_iter() problem, then we
>> still have this issue in blk_mq_tagset_busy_iter() which I report previously
>> [0]:
>>
>> [  319.771745] BUG: KASAN: use-after-free in bt_tags_iter+0xe0/0x128
>> [  319.777832] Read of size 4 at addr ffff0010b6bd27cc by task more/1866
>> [  319.784262]
>> [  319.785753] CPU: 61 PID: 1866 Comm: more Tainted: G        W
>> 5.10.0-rc4-18118-gaa7b9c30d8ff #1070
>> [  319.795312] Hardware name: Huawei Taishan 2280 /D05, BIOS Hisilicon
>> D05 IT21 Nemo 2.0 RC0 04/18/2018
>> [  319.804437] Call trace:
>> [  319.806892]  dump_backtrace+0x0/0x2d0
>> [  319.810552]  show_stack+0x18/0x68
>> [  319.813865]  dump_stack+0x100/0x16c
>> [  319.817348]  print_address_description.constprop.12+0x6c/0x4e8
>> [  319.823176]  kasan_report+0x130/0x200
>> [  319.826831]  __asan_load4+0x9c/0xd8
>> [  319.830315]  bt_tags_iter+0xe0/0x128
>> [  319.833884]  __blk_mq_all_tag_iter+0x320/0x3a8
>> [  319.838320]  blk_mq_tagset_busy_iter+0x8c/0xd8
>> [  319.842760]  scsi_host_busy+0x88/0xb8
>> [  319.846418]  show_host_busy+0x1c/0x48
>> [  319.850079]  dev_attr_show+0x44/0x90
>> [  319.853655]  sysfs_kf_seq_show+0x128/0x1c8
>> [  319.857744]  kernfs_seq_show+0xa0/0xb8
>> [  319.861489]  seq_read_iter+0x1ec/0x6a0
>> [  319.865230]  seq_read+0x1d0/0x250
>> [  319.868539]  kernfs_fop_read+0x70/0x330
>> [  319.872369]  vfs_read+0xe4/0x250
>> [  319.875590]  ksys_read+0xc8/0x178
>> [  319.878898]  __arm64_sys_read+0x44/0x58
>> [  319.882730]  el0_svc_common.constprop.2+0xc4/0x1e8
>> [  319.887515]  do_el0_svc+0x90/0xa0
>> [  319.890824]  el0_sync_handler+0x128/0x178
>> [  319.894825]  el0_sync+0x158/0x180
>> [  319.898131]
>> [  319.899614] The buggy address belongs to the page:
>> [  319.904403] page:000000004e9e6864 refcount:0 mapcount:0
>> mapping:0000000000000000 index:0x0 pfn:0x10b6bd2
>> [  319.913876] flags: 0xbfffc0000000000()
>> [  319.917626] raw: 0bfffc0000000000 0000000000000000 fffffe0000000000
>> 0000000000000000
>> [  319.925363] raw: 0000000000000000 0000000000000000 00000000ffffffff
>> 0000000000000000
>> [  319.933096] page dumped because: kasan: bad access detected
>> [  319.938658]
>> [  319.940141] Memory state around the buggy address:
>> [  319.944925]  ffff0010b6bd2680: ff ff ff ff ff ff ff ff ff ff ff ff ff
>> ff ff ff
>> [  319.952139]  ffff0010b6bd2700: ff ff ff ff ff ff ff ff ff ff ff ff ff
>> ff ff ff
>> [  319.959354] >ffff0010b6bd2780: ff ff ff ff ff ff ff ff ff ff ff ff ff
>> ff ff ff
>> [  319.966566] ^
>> [  319.972131]  ffff0010b6bd2800: ff ff ff ff ff ff ff ff ff ff ff ff ff
>> ff ff ff
>> [  319.979344]  ffff0010b6bd2880: ff ff ff ff ff ff ff ff ff ff ff ff ff
>> ff ff ff
>> [  319.986557]
>> ==================================================================
>> [  319.993770] Disabling lock debugging due to kernel taint
>>
>> So to trigger this, I start fio on a disk, and then have one script
>> which constantly enables and disables an IO scheduler for that disk, and
>> another script which constantly reads /sys/class/scsi_host/host0/host_busy .
>>
>> And in this problem, the driver tag we iterate may point to a stale IO sched
>> request.
> 
> Hi John,

Hi Bart,

Sorry for the slow reply, but I was on vacation since before you sent 
this mail.

> 
> I propose to change the order in which blk_mq_sched_free_requests(q) and
> blk_mq_debugfs_unregister(q) are called. Today blk_mq_sched_free_requests(q)
> is called by blk_cleanup_queue() before blk_put_queue() is called.
> blk_put_queue() calls blk_release_queue() if the last reference is dropped.
> blk_release_queue() calls blk_mq_debugfs_unregister(). I prefer removing the
> debugfs attributes earlier over modifying the tag iteration functions
> because I think removing the debugfs attributes earlier is less risky.

But don't we already have this following path to remove the per-hctx 
debugfs dir earlier than blk_mq_sched_free_requests() or 
blk_release_queue():

blk_cleanup_queue() -> blk_mq_exit_queue() -> blk_mq_exit_hw_queues() -> 
blk_mq_debugfs_unregister_hctx() -> 
blk_mq_debugfs_unregister_hctx(hctx->debugfs_dir)

Having said that, I am not sure how this is related directly to the 
problem I mentioned. In that problem, above, we trigger the 
blk_mq_tagset_busy_iter() from the SCSI host sysfs file, and the 
use-after-free comes about from disabling the elevator (and freeing the 
sched requests) in parallel.

> Although this will make it harder to debug lockups that happen while
> removing a request queue, kernel developers who are analyzing such an issue
> can undo this change in their development kernel tree.
> 

Thanks,
John


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [RFC PATCH v2 2/2] blk-mq: Lockout tagset iter when freeing rqs
  2021-01-04 15:33                       ` John Garry
@ 2021-01-04 17:22                         ` Bart Van Assche
  2021-01-04 18:43                           ` John Garry
       [not found]                           ` <760304b3-dcbc-5b9d-0c70-627b7ff5b4eb@huawei.com>
  0 siblings, 2 replies; 23+ messages in thread
From: Bart Van Assche @ 2021-01-04 17:22 UTC (permalink / raw)
  To: John Garry, axboe, ming.lei
  Cc: linux-block, linux-kernel, hch, hare, kashyap.desai, linuxarm

On 1/4/21 7:33 AM, John Garry wrote:
> On 23/12/2020 15:47, Bart Van Assche wrote:
>> I propose to change the order in which blk_mq_sched_free_requests(q) and
>> blk_mq_debugfs_unregister(q) are called. Today blk_mq_sched_free_requests(q)
>> is called by blk_cleanup_queue() before blk_put_queue() is called.
>> blk_put_queue() calls blk_release_queue() if the last reference is dropped.
>> blk_release_queue() calls blk_mq_debugfs_unregister(). I prefer removing the
>> debugfs attributes earlier over modifying the tag iteration functions
>> because I think removing the debugfs attributes earlier is less risky.
> 
> But don't we already have this following path to remove the per-hctx debugfs
> dir earlier than blk_mq_sched_free_requests() or blk_release_queue():
> 
> blk_cleanup_queue() -> blk_mq_exit_queue() -> blk_mq_exit_hw_queues() ->
> blk_mq_debugfs_unregister_hctx() ->
> blk_mq_debugfs_unregister_hctx(hctx->debugfs_dir)
> 
> Having said that, I am not sure how this is related directly to the problem
> I mentioned. In that problem, above, we trigger the
> blk_mq_tagset_busy_iter() from the SCSI host sysfs file, and the
> use-after-free comes about from disabling the elevator (and freeing the
> sched requests) in parallel.

Hi John,

Right, what I proposed is unrelated to the use-after-free triggered by
disabling I/O scheduling.

Regarding the races triggered by disabling I/O scheduling: can these be
fixed by quiescing all request queues associated with a tag set before
changing the I/O scheduler instead of only the request queue for which the
I/O scheduler is changed? I think we already do this before updating the
number of hardware queues.

Thanks,

Bart.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [RFC PATCH v2 2/2] blk-mq: Lockout tagset iter when freeing rqs
  2021-01-04 17:22                         ` Bart Van Assche
@ 2021-01-04 18:43                           ` John Garry
       [not found]                           ` <760304b3-dcbc-5b9d-0c70-627b7ff5b4eb@huawei.com>
  1 sibling, 0 replies; 23+ messages in thread
From: John Garry @ 2021-01-04 18:43 UTC (permalink / raw)
  To: Bart Van Assche, axboe, ming.lei
  Cc: linux-block, linux-kernel, hch, hare, kashyap.desai, linuxarm

On 04/01/2021 17:22, Bart Van Assche wrote:
> On 1/4/21 7:33 AM, John Garry wrote:
>> On 23/12/2020 15:47, Bart Van Assche wrote:
>>> I propose to change the order in which blk_mq_sched_free_requests(q) and
>>> blk_mq_debugfs_unregister(q) are called. Today blk_mq_sched_free_requests(q)
>>> is called by blk_cleanup_queue() before blk_put_queue() is called.
>>> blk_put_queue() calls blk_release_queue() if the last reference is dropped.
>>> blk_release_queue() calls blk_mq_debugfs_unregister(). I prefer removing the
>>> debugfs attributes earlier over modifying the tag iteration functions
>>> because I think removing the debugfs attributes earlier is less risky.
>> But don't we already have this following path to remove the per-hctx debugfs
>> dir earlier than blk_mq_sched_free_requests() or blk_release_queue():
>>
>> blk_cleanup_queue() -> blk_mq_exit_queue() -> blk_mq_exit_hw_queues() ->
>> blk_mq_debugfs_unregister_hctx() ->
>> blk_mq_debugfs_unregister_hctx(hctx->debugfs_dir)
>>
>> Having said that, I am not sure how this is related directly to the problem
>> I mentioned. In that problem, above, we trigger the
>> blk_mq_tagset_busy_iter() from the SCSI host sysfs file, and the
>> use-after-free comes about from disabling the elevator (and freeing the
>> sched requests) in parallel.
> Hi John,

Hi Bart,

> 
> Right, what I proposed is unrelated to the use-after-free triggered by
> disabling I/O scheduling.
> 
> Regarding the races triggered by disabling I/O scheduling: can these be
> fixed by quiescing all request queues associated with a tag set before
> changing the I/O scheduler instead of only the request queue for which the
> I/O scheduler is changed? I think we already do this before updating the
> number of hardware queues.

For changing the number of HW queues, I figure that we have no choice 
but to quiesce all request queues associated.

But maybe there is still some locking we could use to avoid that here.

Please let me consider it a bit more.

Thanks,
John


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [RFC PATCH v2 2/2] blk-mq: Lockout tagset iter when freeing rqs
       [not found]                           ` <760304b3-dcbc-5b9d-0c70-627b7ff5b4eb@huawei.com>
@ 2021-02-10 14:39                             ` John Garry
  0 siblings, 0 replies; 23+ messages in thread
From: John Garry @ 2021-02-10 14:39 UTC (permalink / raw)
  To: Bart Van Assche, axboe, ming.lei
  Cc: linux-block, linux-kernel, hch, hare, kashyap.desai, linuxarm, pragalla

+

On 04/01/2021 18:37, John Garry wrote:

Hi Bart,

> 
>> Right, what I proposed is unrelated to the use-after-free triggered by
>> disabling I/O scheduling.
>>
>> Regarding the races triggered by disabling I/O scheduling: can these be
>> fixed by quiescing all request queues associated with a tag set before
>> changing the I/O scheduler instead of only the request queue for which 
>> the
>> I/O scheduler is changed? I think we already do this before updating the
>> number of hardware queues.
> 
> Maybe quiescing all request queues could solve the issue in 
> blk_mq_queue_tag_busy_iter(), as that issue involves interaction of 2x 
> request queues.
> 
> But the blk_mq_tagset_busy_iter() issue, above, it is related to only a 
> single request queue, so not sure how it could help.
> 
> But let me consider this more.

I have looked at this proposal again, that being to quiesce all (other) 
request queues prior to freeing the IO scheduler tags+requests. I tried 
this and it seems to work (changes not shown), in terms of solving the 
issue of blk_mq_queue_tag_busy_iter() and freeing the requests+tags 
racing. However, I am still not sure if it is acceptable to quiesce all 
request queues like this just when changing IO scheduler, even if it is 
done elsewhere.

In addition, this still leaves the blk_mq_tagset_busy_iter() issue; that 
being that freeing requests+tags can race against that same function. 
There is nothing to stop blk_mq_tagset_busy_iter() taking a reference to 
a request and that request being freed in parallel when switching IO 
scheduler, however unlikely.

Solving that becomes trickier. As Ming pointed out, that function can be 
called from softirq and even hard irq context - like 
ufshcd_tmv_handler() -> blk_mq_tagset_busy_iter() -  so there is a 
locking context issue.

Thanks,
John

^ permalink raw reply	[flat|nested] 23+ messages in thread

end of thread, other threads:[~2021-02-10 14:42 UTC | newest]

Thread overview: 23+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-12-17 11:07 [RFC PATCH v2 0/2] blk-mq: Avoid use-after-free for accessing old requests John Garry
2020-12-17 11:07 ` [RFC PATCH v2 1/2] blk-mq: Clean up references to old requests when freeing rqs John Garry
2020-12-17 11:07 ` [RFC PATCH v2 2/2] blk-mq: Lockout tagset iter " John Garry
2020-12-18  1:55   ` Bart Van Assche
2020-12-18  9:30     ` John Garry
2020-12-18  3:31   ` Ming Lei
2020-12-18 10:01     ` John Garry
2020-12-18 22:43   ` Bart Van Assche
2020-12-21 12:06     ` John Garry
2020-12-21 18:09       ` Bart Van Assche
2020-12-21 18:47         ` John Garry
2020-12-22  2:13           ` Bart Van Assche
2020-12-22 11:15             ` John Garry
2020-12-22 16:16               ` Bart Van Assche
2020-12-23 11:10                 ` John Garry
2020-12-23 11:40                   ` John Garry
2020-12-23 15:47                     ` Bart Van Assche
2021-01-04 15:33                       ` John Garry
2021-01-04 17:22                         ` Bart Van Assche
2021-01-04 18:43                           ` John Garry
     [not found]                           ` <760304b3-dcbc-5b9d-0c70-627b7ff5b4eb@huawei.com>
2021-02-10 14:39                             ` John Garry
2020-12-22 11:22             ` John Garry
2020-12-22 13:24               ` Ming Lei

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).