Linux-Block Archive on lore.kernel.org
 help / color / Atom feed
* [PATCH v4] block: fix null pointer dereference in blk_mq_rq_timed_out()
@ 2019-09-25 12:20 Yufen Yu
  2019-09-26 10:04 ` Jens Axboe
  2019-10-09  8:26 ` Jack Wang
  0 siblings, 2 replies; 5+ messages in thread
From: Yufen Yu @ 2019-09-25 12:20 UTC (permalink / raw)
  To: axboe
  Cc: linux-block, Yufen Yu, Ming Lei, Christoph Hellwig, Keith Busch,
	Bart Van Assche, stable

We got a null pointer deference BUG_ON in blk_mq_rq_timed_out()
as following:

[  108.825472] BUG: kernel NULL pointer dereference, address: 0000000000000040
[  108.827059] PGD 0 P4D 0
[  108.827313] Oops: 0000 [#1] SMP PTI
[  108.827657] CPU: 6 PID: 198 Comm: kworker/6:1H Not tainted 5.3.0-rc8+ #431
[  108.829503] Workqueue: kblockd blk_mq_timeout_work
[  108.829913] RIP: 0010:blk_mq_check_expired+0x258/0x330
[  108.838191] Call Trace:
[  108.838406]  bt_iter+0x74/0x80
[  108.838665]  blk_mq_queue_tag_busy_iter+0x204/0x450
[  108.839074]  ? __switch_to_asm+0x34/0x70
[  108.839405]  ? blk_mq_stop_hw_queue+0x40/0x40
[  108.839823]  ? blk_mq_stop_hw_queue+0x40/0x40
[  108.840273]  ? syscall_return_via_sysret+0xf/0x7f
[  108.840732]  blk_mq_timeout_work+0x74/0x200
[  108.841151]  process_one_work+0x297/0x680
[  108.841550]  worker_thread+0x29c/0x6f0
[  108.841926]  ? rescuer_thread+0x580/0x580
[  108.842344]  kthread+0x16a/0x1a0
[  108.842666]  ? kthread_flush_work+0x170/0x170
[  108.843100]  ret_from_fork+0x35/0x40

The bug is caused by the race between timeout handle and completion for
flush request.

When timeout handle function blk_mq_rq_timed_out() try to read
'req->q->mq_ops', the 'req' have completed and reinitiated by next
flush request, which would call blk_rq_init() to clear 'req' as 0.

After commit 12f5b93145 ("blk-mq: Remove generation seqeunce"),
normal requests lifetime are protected by refcount. Until 'rq->ref'
drop to zero, the request can really be free. Thus, these requests
cannot been reused before timeout handle finish.

However, flush request has defined .end_io and rq->end_io() is still
called even if 'rq->ref' doesn't drop to zero. After that, the 'flush_rq'
can be reused by the next flush request handle, resulting in null
pointer deference BUG ON.

We fix this problem by covering flush request with 'rq->ref'.
If the refcount is not zero, flush_end_io() return and wait the
last holder recall it. To record the request status, we add a new
entry 'rq_status', which will be used in flush_end_io().

Cc: Ming Lei <ming.lei@redhat.com>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Keith Busch <keith.busch@intel.com>
Cc: Bart Van Assche <bvanassche@acm.org>
Cc: stable@vger.kernel.org # v4.18+
Signed-off-by: Yufen Yu <yuyufen@huawei.com>

-------
v2:
 - move rq_status from struct request to struct blk_flush_queue
v3:
 - remove unnecessary '{}' pair.
v4:
 - let spinlock to protect 'fq->rq_status'
---
 block/blk-flush.c | 10 ++++++++++
 block/blk-mq.c    |  5 ++++-
 block/blk.h       |  7 +++++++
 3 files changed, 21 insertions(+), 1 deletion(-)

diff --git a/block/blk-flush.c b/block/blk-flush.c
index aedd9320e605..1eec9cbe5a0a 100644
--- a/block/blk-flush.c
+++ b/block/blk-flush.c
@@ -214,6 +214,16 @@ static void flush_end_io(struct request *flush_rq, blk_status_t error)
 
 	/* release the tag's ownership to the req cloned from */
 	spin_lock_irqsave(&fq->mq_flush_lock, flags);
+
+	if (!refcount_dec_and_test(&flush_rq->ref)) {
+		fq->rq_status = error;
+		spin_unlock_irqrestore(&fq->mq_flush_lock, flags);
+		return;
+	}
+
+	if (fq->rq_status != BLK_STS_OK)
+		error = fq->rq_status;
+
 	hctx = flush_rq->mq_hctx;
 	if (!q->elevator) {
 		blk_mq_tag_set_rq(hctx, flush_rq->tag, fq->orig_rq);
diff --git a/block/blk-mq.c b/block/blk-mq.c
index 20a49be536b5..e04fa9ab5574 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -912,7 +912,10 @@ static bool blk_mq_check_expired(struct blk_mq_hw_ctx *hctx,
 	 */
 	if (blk_mq_req_expired(rq, next))
 		blk_mq_rq_timed_out(rq, reserved);
-	if (refcount_dec_and_test(&rq->ref))
+
+	if (is_flush_rq(rq, hctx))
+		rq->end_io(rq, 0);
+	else if (refcount_dec_and_test(&rq->ref))
 		__blk_mq_free_request(rq);
 
 	return true;
diff --git a/block/blk.h b/block/blk.h
index ed347f7a97b1..de258e7b9db8 100644
--- a/block/blk.h
+++ b/block/blk.h
@@ -30,6 +30,7 @@ struct blk_flush_queue {
 	 */
 	struct request		*orig_rq;
 	spinlock_t		mq_flush_lock;
+	blk_status_t 		rq_status;
 };
 
 extern struct kmem_cache *blk_requestq_cachep;
@@ -47,6 +48,12 @@ static inline void __blk_get_queue(struct request_queue *q)
 	kobject_get(&q->kobj);
 }
 
+static inline bool
+is_flush_rq(struct request *req, struct blk_mq_hw_ctx *hctx)
+{
+	return hctx->fq->flush_rq == req;
+}
+
 struct blk_flush_queue *blk_alloc_flush_queue(struct request_queue *q,
 		int node, int cmd_size, gfp_t flags);
 void blk_free_flush_queue(struct blk_flush_queue *q);
-- 
2.17.2


^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [PATCH v4] block: fix null pointer dereference in blk_mq_rq_timed_out()
  2019-09-25 12:20 [PATCH v4] block: fix null pointer dereference in blk_mq_rq_timed_out() Yufen Yu
@ 2019-09-26 10:04 ` Jens Axboe
  2019-09-26 13:52   ` Yufen Yu
  2019-10-09  8:26 ` Jack Wang
  1 sibling, 1 reply; 5+ messages in thread
From: Jens Axboe @ 2019-09-26 10:04 UTC (permalink / raw)
  To: Yufen Yu
  Cc: linux-block, Ming Lei, Christoph Hellwig, Keith Busch,
	Bart Van Assche, stable

On 9/25/19 2:20 PM, Yufen Yu wrote:
> diff --git a/block/blk.h b/block/blk.h
> index ed347f7a97b1..de258e7b9db8 100644
> --- a/block/blk.h
> +++ b/block/blk.h
> @@ -30,6 +30,7 @@ struct blk_flush_queue {
>   	 */
>   	struct request		*orig_rq;
>   	spinlock_t		mq_flush_lock;
> +	blk_status_t 		rq_status;
>   };

Patch looks fine to me, but you should move rq_status to after the
flush_running_idx member of struct blk_flush_queue, since then it'll
fill a padding hole instead of adding new ones.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [PATCH v4] block: fix null pointer dereference in blk_mq_rq_timed_out()
  2019-09-26 10:04 ` Jens Axboe
@ 2019-09-26 13:52   ` Yufen Yu
  0 siblings, 0 replies; 5+ messages in thread
From: Yufen Yu @ 2019-09-26 13:52 UTC (permalink / raw)
  To: Jens Axboe
  Cc: linux-block, Ming Lei, Christoph Hellwig, Keith Busch,
	Bart Van Assche, stable



On 2019/9/26 18:04, Jens Axboe wrote:
> On 9/25/19 2:20 PM, Yufen Yu wrote:
>> diff --git a/block/blk.h b/block/blk.h
>> index ed347f7a97b1..de258e7b9db8 100644
>> --- a/block/blk.h
>> +++ b/block/blk.h
>> @@ -30,6 +30,7 @@ struct blk_flush_queue {
>>    	 */
>>    	struct request		*orig_rq;
>>    	spinlock_t		mq_flush_lock;
>> +	blk_status_t 		rq_status;
>>    };
> Patch looks fine to me, but you should move rq_status to after the
> flush_running_idx member of struct blk_flush_queue, since then it'll
> fill a padding hole instead of adding new ones.
>

Thanks for you good suggestion.

Thanks,
Yufen


^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [PATCH v4] block: fix null pointer dereference in blk_mq_rq_timed_out()
  2019-09-25 12:20 [PATCH v4] block: fix null pointer dereference in blk_mq_rq_timed_out() Yufen Yu
  2019-09-26 10:04 ` Jens Axboe
@ 2019-10-09  8:26 ` Jack Wang
  2019-10-11  3:16   ` Yufen Yu
  1 sibling, 1 reply; 5+ messages in thread
From: Jack Wang @ 2019-10-09  8:26 UTC (permalink / raw)
  To: Yufen Yu
  Cc: Jens Axboe, linux-block, Ming Lei, Christoph Hellwig,
	Keith Busch, Bart Van Assche, stable, guoqing.jiang, jinpu.wang

Yufen Yu <yuyufen@huawei.com> 于2019年9月26日周四 上午11:30写道:
>
> We got a null pointer deference BUG_ON in blk_mq_rq_timed_out()
> as following:
>
> [  108.825472] BUG: kernel NULL pointer dereference, address: 0000000000000040
> [  108.827059] PGD 0 P4D 0
> [  108.827313] Oops: 0000 [#1] SMP PTI
> [  108.827657] CPU: 6 PID: 198 Comm: kworker/6:1H Not tainted 5.3.0-rc8+ #431
> [  108.829503] Workqueue: kblockd blk_mq_timeout_work
> [  108.829913] RIP: 0010:blk_mq_check_expired+0x258/0x330
> [  108.838191] Call Trace:
> [  108.838406]  bt_iter+0x74/0x80
> [  108.838665]  blk_mq_queue_tag_busy_iter+0x204/0x450
> [  108.839074]  ? __switch_to_asm+0x34/0x70
> [  108.839405]  ? blk_mq_stop_hw_queue+0x40/0x40
> [  108.839823]  ? blk_mq_stop_hw_queue+0x40/0x40
> [  108.840273]  ? syscall_return_via_sysret+0xf/0x7f
> [  108.840732]  blk_mq_timeout_work+0x74/0x200
> [  108.841151]  process_one_work+0x297/0x680
> [  108.841550]  worker_thread+0x29c/0x6f0
> [  108.841926]  ? rescuer_thread+0x580/0x580
> [  108.842344]  kthread+0x16a/0x1a0
> [  108.842666]  ? kthread_flush_work+0x170/0x170
> [  108.843100]  ret_from_fork+0x35/0x40
>
> The bug is caused by the race between timeout handle and completion for
> flush request.
>
> When timeout handle function blk_mq_rq_timed_out() try to read
> 'req->q->mq_ops', the 'req' have completed and reinitiated by next
> flush request, which would call blk_rq_init() to clear 'req' as 0.
>
> After commit 12f5b93145 ("blk-mq: Remove generation seqeunce"),
> normal requests lifetime are protected by refcount. Until 'rq->ref'
> drop to zero, the request can really be free. Thus, these requests
> cannot been reused before timeout handle finish.
>
> However, flush request has defined .end_io and rq->end_io() is still
> called even if 'rq->ref' doesn't drop to zero. After that, the 'flush_rq'
> can be reused by the next flush request handle, resulting in null
> pointer deference BUG ON.
>
> We fix this problem by covering flush request with 'rq->ref'.
> If the refcount is not zero, flush_end_io() return and wait the
> last holder recall it. To record the request status, we add a new
> entry 'rq_status', which will be used in flush_end_io().
>
> Cc: Ming Lei <ming.lei@redhat.com>
> Cc: Christoph Hellwig <hch@infradead.org>
> Cc: Keith Busch <keith.busch@intel.com>
> Cc: Bart Van Assche <bvanassche@acm.org>
> Cc: stable@vger.kernel.org # v4.18+
> Signed-off-by: Yufen Yu <yuyufen@huawei.com>
>
Hi Yufen,

Can you share your reproducer, I think the bug was there for long
time, we hit it in kernel 4.4.
We also need to fix it for older LTS kernel.

Do you have an idea, how should we fix it for older LTS kernel?

Regards,
Jack Wang

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [PATCH v4] block: fix null pointer dereference in blk_mq_rq_timed_out()
  2019-10-09  8:26 ` Jack Wang
@ 2019-10-11  3:16   ` Yufen Yu
  0 siblings, 0 replies; 5+ messages in thread
From: Yufen Yu @ 2019-10-11  3:16 UTC (permalink / raw)
  To: Jack Wang
  Cc: Jens Axboe, linux-block, Ming Lei, Christoph Hellwig,
	Keith Busch, Bart Van Assche, stable, guoqing.jiang, jinpu.wang



On 2019/10/9 16:26, Jack Wang wrote:
>
>> We got a null pointer deference BUG_ON in blk_mq_rq_timed_out()
>> as following:
>>
>> [  108.825472] BUG: kernel NULL pointer dereference, address: 0000000000000040
>> [  108.827059] PGD 0 P4D 0
>> [  108.827313] Oops: 0000 [#1] SMP PTI
>> [  108.827657] CPU: 6 PID: 198 Comm: kworker/6:1H Not tainted 5.3.0-rc8+ #431
>> [  108.829503] Workqueue: kblockd blk_mq_timeout_work
>> [  108.829913] RIP: 0010:blk_mq_check_expired+0x258/0x330
>> [  108.838191] Call Trace:
>> [  108.838406]  bt_iter+0x74/0x80
>> [  108.838665]  blk_mq_queue_tag_busy_iter+0x204/0x450
>> [  108.839074]  ? __switch_to_asm+0x34/0x70
>> [  108.839405]  ? blk_mq_stop_hw_queue+0x40/0x40
>> [  108.839823]  ? blk_mq_stop_hw_queue+0x40/0x40
>> [  108.840273]  ? syscall_return_via_sysret+0xf/0x7f
>> [  108.840732]  blk_mq_timeout_work+0x74/0x200
>> [  108.841151]  process_one_work+0x297/0x680
>> [  108.841550]  worker_thread+0x29c/0x6f0
>> [  108.841926]  ? rescuer_thread+0x580/0x580
>> [  108.842344]  kthread+0x16a/0x1a0
>> [  108.842666]  ? kthread_flush_work+0x170/0x170
>> [  108.843100]  ret_from_fork+0x35/0x40
>>
>> The bug is caused by the race between timeout handle and completion for
>> flush request.
>>
>> When timeout handle function blk_mq_rq_timed_out() try to read
>> 'req->q->mq_ops', the 'req' have completed and reinitiated by next
>> flush request, which would call blk_rq_init() to clear 'req' as 0.
>>
>> After commit 12f5b93145 ("blk-mq: Remove generation seqeunce"),
>> normal requests lifetime are protected by refcount. Until 'rq->ref'
>> drop to zero, the request can really be free. Thus, these requests
>> cannot been reused before timeout handle finish.
>>
>> However, flush request has defined .end_io and rq->end_io() is still
>> called even if 'rq->ref' doesn't drop to zero. After that, the 'flush_rq'
>> can be reused by the next flush request handle, resulting in null
>> pointer deference BUG ON.
>>
>> We fix this problem by covering flush request with 'rq->ref'.
>> If the refcount is not zero, flush_end_io() return and wait the
>> last holder recall it. To record the request status, we add a new
>> entry 'rq_status', which will be used in flush_end_io().
>>
>> Cc: Ming Lei <ming.lei@redhat.com>
>> Cc: Christoph Hellwig <hch@infradead.org>
>> Cc: Keith Busch <keith.busch@intel.com>
>> Cc: Bart Van Assche <bvanassche@acm.org>
>> Cc: stable@vger.kernel.org # v4.18+
>> Signed-off-by: Yufen Yu <yuyufen@huawei.com>
>>
> Hi Yufen,
>
> Can you share your reproducer, I think the bug was there for long
> time, we hit it in kernel 4.4.
> We also need to fix it for older LTS kernel.
>
> Do you have an idea, how should we fix it for older LTS kernel?
>

I have reproduced the bug by increasing delay after doing memset()
in blk_rq_init() and before calling blk_mq_rq_timed_out() in 
blk_mq_check_expired().

To make sure the request will be timeout, I have also increase delay
for flush request after blk_mq_start_request() in virtio_queue_rq() for
my virtio disk. Then, we just issue a flush request for the disk by fio.
The BUG_ON will be triggered.

For LTS 4.4 or older kernel, the race between timeout handle and completion
for normal request have not yet resolved.
So, IMO, we should fix the bug first by merging commit 12f5b93145
("blk-mq: Remove generation seqeunce") and its related patches.

After that, this patch can also be merged.

Thanks,
Yufen


^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, back to index

Thread overview: 5+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-09-25 12:20 [PATCH v4] block: fix null pointer dereference in blk_mq_rq_timed_out() Yufen Yu
2019-09-26 10:04 ` Jens Axboe
2019-09-26 13:52   ` Yufen Yu
2019-10-09  8:26 ` Jack Wang
2019-10-11  3:16   ` Yufen Yu

Linux-Block Archive on lore.kernel.org

Archives are clonable:
	git clone --mirror https://lore.kernel.org/linux-block/0 linux-block/git/0.git

	# If you have public-inbox 1.1+ installed, you may
	# initialize and index your mirror using the following commands:
	public-inbox-init -V2 linux-block linux-block/ https://lore.kernel.org/linux-block \
		linux-block@vger.kernel.org
	public-inbox-index linux-block

Example config snippet for mirrors

Newsgroup available over NNTP:
	nntp://nntp.lore.kernel.org/org.kernel.vger.linux-block


AGPL code for this site: git clone https://public-inbox.org/public-inbox.git