Linux-Block Archive on lore.kernel.org
 help / color / Atom feed
* [PATCH] block: fix null pointer dereference in blk_mq_rq_timed_out()
@ 2019-09-07 10:24 Yufen Yu
  2019-09-12  2:46 ` Ming Lei
  0 siblings, 1 reply; 10+ messages in thread
From: Yufen Yu @ 2019-09-07 10:24 UTC (permalink / raw)
  To: axboe; +Cc: linux-block, hch, keith.busch, tj, yuyufen, zhangxiaoxu5

There is a race condition between timeout check and completion for
flush request as follow:

timeout_work    issue flush      issue flush
                blk_insert_flush
                                 blk_insert_flush
blk_mq_timeout_work
                blk_kick_flush

blk_mq_queue_tag_busy_iter
blk_mq_check_expired(flush_rq)

                __blk_mq_end_request
               flush_end_io
               blk_kick_flush
               blk_rq_init(flush_rq)
               memset(flush_rq, 0)

blk_mq_timed_out
BUG_ON flush_rq->q->mq_ops

For normal request, we need to get a tag and then allocate corresponding request.
Thus, the request cannot be reallocated before the tag have been free.
Commit 1d9bd5161ba ("blk-mq: replace timeout synchronization with a RCU and
generation based scheme") and commit 12f5b93145 ("blk-mq: Remove generation
seqeunce") can guarantee the consistency of timeout handle and completion.

However, 'flush_rq' have been forgotten. 'flush_rq' allocation management
dependents on flush implemention mechanism. Each hctx has only one 'flush_rq'.
When a flush request have completed, the next flush request will hold the 'flush_rq'.
In the end, timeout handle may access the cleared 'flush_rq'.

We fix this problem by checking request refcount 'rq->ref', as normal request.
If the refcount is not zero, flush_end_io() return and wait the last holder
recall it. To record the request status, we add a new entry 'rq_status',
which will be used in flush_end_io().

Signed-off-by: Yufen Yu <yuyufen@huawei.com>
---
 block/blk-flush.c      | 8 ++++++++
 block/blk-mq.c         | 7 +++++--
 block/blk.h            | 5 +++++
 include/linux/blkdev.h | 2 ++
 4 files changed, 20 insertions(+), 2 deletions(-)

diff --git a/block/blk-flush.c b/block/blk-flush.c
index aedd9320e605..359a7e1a0925 100644
--- a/block/blk-flush.c
+++ b/block/blk-flush.c
@@ -212,6 +212,14 @@ static void flush_end_io(struct request *flush_rq, blk_status_t error)
 	struct blk_flush_queue *fq = blk_get_flush_queue(q, flush_rq->mq_ctx);
 	struct blk_mq_hw_ctx *hctx;
 
+	if (!refcount_dec_and_test(&flush_rq->ref)) {
+		flush_rq->rq_status = error;
+		return;
+	}
+
+	if (flush_rq->rq_status != BLK_STS_OK)
+		error = flush_rq->rq_status;
+
 	/* release the tag's ownership to the req cloned from */
 	spin_lock_irqsave(&fq->mq_flush_lock, flags);
 	hctx = flush_rq->mq_hctx;
diff --git a/block/blk-mq.c b/block/blk-mq.c
index 0835f4d8d42e..3d2b2c2e9cdf 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -905,9 +905,12 @@ static bool blk_mq_check_expired(struct blk_mq_hw_ctx *hctx,
 	 */
 	if (blk_mq_req_expired(rq, next))
 		blk_mq_rq_timed_out(rq, reserved);
-	if (refcount_dec_and_test(&rq->ref))
-		__blk_mq_free_request(rq);
 
+	if (is_flush_rq(rq, hctx)) {
+		rq->end_io(rq, 0);
+	} else if (refcount_dec_and_test(&rq->ref)) {
+		__blk_mq_free_request(rq);
+	}
 	return true;
 }
 
diff --git a/block/blk.h b/block/blk.h
index de6b2e146d6e..f503ef9ad3e6 100644
--- a/block/blk.h
+++ b/block/blk.h
@@ -47,6 +47,11 @@ static inline void __blk_get_queue(struct request_queue *q)
 	kobject_get(&q->kobj);
 }
 
+static inline bool
+is_flush_rq(struct request *req, struct blk_mq_hw_ctx *hctx) {
+	return hctx->fq->flush_rq == req;
+}
+
 struct blk_flush_queue *blk_alloc_flush_queue(struct request_queue *q,
 		int node, int cmd_size, gfp_t flags);
 void blk_free_flush_queue(struct blk_flush_queue *q);
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index 1ef375dafb1c..b1d05077e03f 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -237,6 +237,8 @@ struct request {
 	 */
 	rq_end_io_fn *end_io;
 	void *end_io_data;
+
+	blk_status_t rq_status;
 };
 
 static inline bool blk_op_is_scsi(unsigned int op)
-- 
2.17.2


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH] block: fix null pointer dereference in blk_mq_rq_timed_out()
  2019-09-07 10:24 [PATCH] block: fix null pointer dereference in blk_mq_rq_timed_out() Yufen Yu
@ 2019-09-12  2:46 ` Ming Lei
  2019-09-12  3:29   ` Yufen Yu
  0 siblings, 1 reply; 10+ messages in thread
From: Ming Lei @ 2019-09-12  2:46 UTC (permalink / raw)
  To: Yufen Yu; +Cc: axboe, linux-block, hch, keith.busch, tj, zhangxiaoxu5

On Sat, Sep 07, 2019 at 06:24:50PM +0800, Yufen Yu wrote:
> There is a race condition between timeout check and completion for
> flush request as follow:
> 
> timeout_work    issue flush      issue flush
>                 blk_insert_flush
>                                  blk_insert_flush
> blk_mq_timeout_work
>                 blk_kick_flush
> 
> blk_mq_queue_tag_busy_iter
> blk_mq_check_expired(flush_rq)
> 
>                 __blk_mq_end_request
>                flush_end_io
>                blk_kick_flush
>                blk_rq_init(flush_rq)
>                memset(flush_rq, 0)

Not see there is memset(flush_rq, 0) in block/blk-flush.c

> 
> blk_mq_timed_out
> BUG_ON flush_rq->q->mq_ops

flush_rq->q won't be changed by blk_rq_init(), and either READ or WRITE
on variable with machine WORD length are atomic, so how can the BUG_ON()
be triggered? Do you have the actual BUG log?

Also now it is driver's responsibility for avoiding race between normal
completion and timeout.

> 
> For normal request, we need to get a tag and then allocate corresponding request.
> Thus, the request cannot be reallocated before the tag have been free.
> Commit 1d9bd5161ba ("blk-mq: replace timeout synchronization with a RCU and
> generation based scheme") and commit 12f5b93145 ("blk-mq: Remove generation
> seqeunce") can guarantee the consistency of timeout handle and completion.
> 
> However, 'flush_rq' have been forgotten. 'flush_rq' allocation management
> dependents on flush implemention mechanism. Each hctx has only one 'flush_rq'.
> When a flush request have completed, the next flush request will hold the 'flush_rq'.
> In the end, timeout handle may access the cleared 'flush_rq'.
> 
> We fix this problem by checking request refcount 'rq->ref', as normal request.
> If the refcount is not zero, flush_end_io() return and wait the last holder
> recall it. To record the request status, we add a new entry 'rq_status',
> which will be used in flush_end_io().
> 
> Signed-off-by: Yufen Yu <yuyufen@huawei.com>
> ---
>  block/blk-flush.c      | 8 ++++++++
>  block/blk-mq.c         | 7 +++++--
>  block/blk.h            | 5 +++++
>  include/linux/blkdev.h | 2 ++
>  4 files changed, 20 insertions(+), 2 deletions(-)
> 
> diff --git a/block/blk-flush.c b/block/blk-flush.c
> index aedd9320e605..359a7e1a0925 100644
> --- a/block/blk-flush.c
> +++ b/block/blk-flush.c
> @@ -212,6 +212,14 @@ static void flush_end_io(struct request *flush_rq, blk_status_t error)
>  	struct blk_flush_queue *fq = blk_get_flush_queue(q, flush_rq->mq_ctx);
>  	struct blk_mq_hw_ctx *hctx;
>  
> +	if (!refcount_dec_and_test(&flush_rq->ref)) {
> +		flush_rq->rq_status = error;
> +		return;
> +	}
> +
> +	if (flush_rq->rq_status != BLK_STS_OK)
> +		error = flush_rq->rq_status;
> +
>  	/* release the tag's ownership to the req cloned from */
>  	spin_lock_irqsave(&fq->mq_flush_lock, flags);
>  	hctx = flush_rq->mq_hctx;
> diff --git a/block/blk-mq.c b/block/blk-mq.c
> index 0835f4d8d42e..3d2b2c2e9cdf 100644
> --- a/block/blk-mq.c
> +++ b/block/blk-mq.c
> @@ -905,9 +905,12 @@ static bool blk_mq_check_expired(struct blk_mq_hw_ctx *hctx,
>  	 */
>  	if (blk_mq_req_expired(rq, next))
>  		blk_mq_rq_timed_out(rq, reserved);
> -	if (refcount_dec_and_test(&rq->ref))
> -		__blk_mq_free_request(rq);
>  
> +	if (is_flush_rq(rq, hctx)) {
> +		rq->end_io(rq, 0);
> +	} else if (refcount_dec_and_test(&rq->ref)) {
> +		__blk_mq_free_request(rq);
> +	}
>  	return true;
>  }
>  
> diff --git a/block/blk.h b/block/blk.h
> index de6b2e146d6e..f503ef9ad3e6 100644
> --- a/block/blk.h
> +++ b/block/blk.h
> @@ -47,6 +47,11 @@ static inline void __blk_get_queue(struct request_queue *q)
>  	kobject_get(&q->kobj);
>  }
>  
> +static inline bool
> +is_flush_rq(struct request *req, struct blk_mq_hw_ctx *hctx) {
> +	return hctx->fq->flush_rq == req;
> +}
> +
>  struct blk_flush_queue *blk_alloc_flush_queue(struct request_queue *q,
>  		int node, int cmd_size, gfp_t flags);
>  void blk_free_flush_queue(struct blk_flush_queue *q);
> diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
> index 1ef375dafb1c..b1d05077e03f 100644
> --- a/include/linux/blkdev.h
> +++ b/include/linux/blkdev.h
> @@ -237,6 +237,8 @@ struct request {
>  	 */
>  	rq_end_io_fn *end_io;
>  	void *end_io_data;
> +
> +	blk_status_t rq_status;
>  };

'rq_status' is only for flush request, so it may be added to 'struct
blk_flush_queue' instead of 'struct request'.


Thanks,
Ming

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH] block: fix null pointer dereference in blk_mq_rq_timed_out()
  2019-09-12  2:46 ` Ming Lei
@ 2019-09-12  3:29   ` Yufen Yu
  2019-09-12  4:16     ` Ming Lei
  2019-09-12  8:59     ` Guoqing Jiang
  0 siblings, 2 replies; 10+ messages in thread
From: Yufen Yu @ 2019-09-12  3:29 UTC (permalink / raw)
  To: Ming Lei; +Cc: axboe, linux-block, hch, keith.busch, tj, zhangxiaoxu5



On 2019/9/12 10:46, Ming Lei wrote:
> On Sat, Sep 07, 2019 at 06:24:50PM +0800, Yufen Yu wrote:
>> There is a race condition between timeout check and completion for
>> flush request as follow:
>>
>> timeout_work    issue flush      issue flush
>>                  blk_insert_flush
>>                                   blk_insert_flush
>> blk_mq_timeout_work
>>                  blk_kick_flush
>>
>> blk_mq_queue_tag_busy_iter
>> blk_mq_check_expired(flush_rq)
>>
>>                  __blk_mq_end_request
>>                 flush_end_io
>>                 blk_kick_flush
>>                 blk_rq_init(flush_rq)
>>                 memset(flush_rq, 0)
> Not see there is memset(flush_rq, 0) in block/blk-flush.c

Call path as follow:

blk_kick_flush
     blk_rq_init
         memset(rq, 0, sizeof(*rq));

>> blk_mq_timed_out
>> BUG_ON flush_rq->q->mq_ops
> flush_rq->q won't be changed by blk_rq_init(), and either READ or WRITE
> on variable with machine WORD length are atomic, so how can the BUG_ON()
> be triggered? Do you have the actual BUG log?
>
> Also now it is driver's responsibility for avoiding race between normal
> completion and timeout.

I have reproduced the bug by adding time delay in timeout handle and memset.
BUG_ON log as follow:

[  108.825472] BUG: kernel NULL pointer dereference, address: 
0000000000000040
[  108.826091] #PF: supervisor read access in kernel mode
[  108.826543] #PF: error_code(0x0000) - not-present page
[  108.827059] PGD 0 P4D 0
[  108.827313] Oops: 0000 [#1] SMP PTI
[  108.827657] CPU: 6 PID: 198 Comm: kworker/6:1H Not tainted 5.3.0-rc8+ 
#431
[  108.828326] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), 
BIOS ?-20180531_142017-buildhw-08.phx2.fedoraproject.org-1.fc28 04/01/2014
[  108.829503] Workqueue: kblockd blk_mq_timeout_work
[  108.829913] RIP: 0010:blk_mq_check_expired+0x258/0x330
[  108.830439] Code: 01 e9 0a ff ff ff 48 83 05 34 45 dd 02 01 4c 39 63 
40 0f 84 8a 00 00 00 0d 00 00 20 00 40 0f b6 f5 41 89 44 24 1c 49 8b 04 
24 <48> 8b 40 40 48 8b 40 20 48 85 c0 0f 84 90 00 00 00 48 83 05 2f 44
[  108.832246] RSP: 0018:ffffbf7ac18b7db0 EFLAGS: 00010206
[  108.832756] RAX: 0000000000000000 RBX: ffffffffb56e0250 RCX: 
0000000000000000
[  108.833444] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 
ffff9ab7fbb96538
[  108.834149] RBP: 0000000000000000 R08: 000000000000024b R09: 
0000000000000030
[  108.834840] R10: 000000000000004e R11: ffffbf7ac18b7c40 R12: 
ffff9ab7f756e000
[  108.835531] R13: ffffbf7ac18b7e70 R14: 0000000000000017 R15: 
ffff9ab7f6ead0a0
[  108.836228] FS:  0000000000000000(0000) GS:ffff9ab7fbb80000(0000) 
knlGS:0000000000000000
[  108.837026] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  108.837588] CR2: 0000000000000040 CR3: 000000013544c000 CR4: 
00000000000006e0
[  108.838191] Call Trace:
[  108.838406]  bt_iter+0x74/0x80
[  108.838665]  blk_mq_queue_tag_busy_iter+0x204/0x450
[  108.839074]  ? __switch_to_asm+0x34/0x70
[  108.839405]  ? blk_mq_stop_hw_queue+0x40/0x40
[  108.839823]  ? blk_mq_stop_hw_queue+0x40/0x40
[  108.840273]  ? syscall_return_via_sysret+0xf/0x7f
[  108.840732]  blk_mq_timeout_work+0x74/0x200
[  108.841151]  process_one_work+0x297/0x680
[  108.841550]  worker_thread+0x29c/0x6f0
[  108.841926]  ? rescuer_thread+0x580/0x580
[  108.842344]  kthread+0x16a/0x1a0
[  108.842666]  ? kthread_flush_work+0x170/0x170
[  108.843100]  ret_from_fork+0x35/0x40
[  108.843455] Modules linked in:
[  108.843758] CR2: 0000000000000040
[  108.844090] ---[ end trace e0ac552505fa1b95 ]---

blk_mq_rq_timed_out() attempt to read 'req->q->mq_ops->timeout', but 'q 
== 0' currently,
which triggers BUG_ON.

>> For normal request, we need to get a tag and then allocate corresponding request.
>> Thus, the request cannot be reallocated before the tag have been free.
>> Commit 1d9bd5161ba ("blk-mq: replace timeout synchronization with a RCU and
>> generation based scheme") and commit 12f5b93145 ("blk-mq: Remove generation
>> seqeunce") can guarantee the consistency of timeout handle and completion.
>>
>> However, 'flush_rq' have been forgotten. 'flush_rq' allocation management
>> dependents on flush implemention mechanism. Each hctx has only one 'flush_rq'.
>> When a flush request have completed, the next flush request will hold the 'flush_rq'.
>> In the end, timeout handle may access the cleared 'flush_rq'.
>>
>> We fix this problem by checking request refcount 'rq->ref', as normal request.
>> If the refcount is not zero, flush_end_io() return and wait the last holder
>> recall it. To record the request status, we add a new entry 'rq_status',
>> which will be used in flush_end_io().
>>
>> Signed-off-by: Yufen Yu <yuyufen@huawei.com>
>> ---
>>   block/blk-flush.c      | 8 ++++++++
>>   block/blk-mq.c         | 7 +++++--
>>   block/blk.h            | 5 +++++
>>   include/linux/blkdev.h | 2 ++
>>   4 files changed, 20 insertions(+), 2 deletions(-)
>>
>> diff --git a/block/blk-flush.c b/block/blk-flush.c
>> index aedd9320e605..359a7e1a0925 100644
>> --- a/block/blk-flush.c
>> +++ b/block/blk-flush.c
>> @@ -212,6 +212,14 @@ static void flush_end_io(struct request *flush_rq, blk_status_t error)
>>   	struct blk_flush_queue *fq = blk_get_flush_queue(q, flush_rq->mq_ctx);
>>   	struct blk_mq_hw_ctx *hctx;
>>   
>> +	if (!refcount_dec_and_test(&flush_rq->ref)) {
>> +		flush_rq->rq_status = error;
>> +		return;
>> +	}
>> +
>> +	if (flush_rq->rq_status != BLK_STS_OK)
>> +		error = flush_rq->rq_status;
>> +
>>   	/* release the tag's ownership to the req cloned from */
>>   	spin_lock_irqsave(&fq->mq_flush_lock, flags);
>>   	hctx = flush_rq->mq_hctx;
>> diff --git a/block/blk-mq.c b/block/blk-mq.c
>> index 0835f4d8d42e..3d2b2c2e9cdf 100644
>> --- a/block/blk-mq.c
>> +++ b/block/blk-mq.c
>> @@ -905,9 +905,12 @@ static bool blk_mq_check_expired(struct blk_mq_hw_ctx *hctx,
>>   	 */
>>   	if (blk_mq_req_expired(rq, next))
>>   		blk_mq_rq_timed_out(rq, reserved);
>> -	if (refcount_dec_and_test(&rq->ref))
>> -		__blk_mq_free_request(rq);
>>   
>> +	if (is_flush_rq(rq, hctx)) {
>> +		rq->end_io(rq, 0);
>> +	} else if (refcount_dec_and_test(&rq->ref)) {
>> +		__blk_mq_free_request(rq);
>> +	}
>>   	return true;
>>   }
>>   
>> diff --git a/block/blk.h b/block/blk.h
>> index de6b2e146d6e..f503ef9ad3e6 100644
>> --- a/block/blk.h
>> +++ b/block/blk.h
>> @@ -47,6 +47,11 @@ static inline void __blk_get_queue(struct request_queue *q)
>>   	kobject_get(&q->kobj);
>>   }
>>   
>> +static inline bool
>> +is_flush_rq(struct request *req, struct blk_mq_hw_ctx *hctx) {
>> +	return hctx->fq->flush_rq == req;
>> +}
>> +
>>   struct blk_flush_queue *blk_alloc_flush_queue(struct request_queue *q,
>>   		int node, int cmd_size, gfp_t flags);
>>   void blk_free_flush_queue(struct blk_flush_queue *q);
>> diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
>> index 1ef375dafb1c..b1d05077e03f 100644
>> --- a/include/linux/blkdev.h
>> +++ b/include/linux/blkdev.h
>> @@ -237,6 +237,8 @@ struct request {
>>   	 */
>>   	rq_end_io_fn *end_io;
>>   	void *end_io_data;
>> +
>> +	blk_status_t rq_status;
>>   };
> 'rq_status' is only for flush request, so it may be added to 'struct
> blk_flush_queue' instead of 'struct request'.

That's a good idea.

Thanks,
Yufen


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH] block: fix null pointer dereference in blk_mq_rq_timed_out()
  2019-09-12  3:29   ` Yufen Yu
@ 2019-09-12  4:16     ` Ming Lei
  2019-09-12  8:49       ` Yufen Yu
  2019-09-12  8:59     ` Guoqing Jiang
  1 sibling, 1 reply; 10+ messages in thread
From: Ming Lei @ 2019-09-12  4:16 UTC (permalink / raw)
  To: Yufen Yu; +Cc: axboe, linux-block, hch, keith.busch, tj, zhangxiaoxu5

On Thu, Sep 12, 2019 at 11:29:18AM +0800, Yufen Yu wrote:
> 
> 
> On 2019/9/12 10:46, Ming Lei wrote:
> > On Sat, Sep 07, 2019 at 06:24:50PM +0800, Yufen Yu wrote:
> > > There is a race condition between timeout check and completion for
> > > flush request as follow:
> > > 
> > > timeout_work    issue flush      issue flush
> > >                  blk_insert_flush
> > >                                   blk_insert_flush
> > > blk_mq_timeout_work
> > >                  blk_kick_flush
> > > 
> > > blk_mq_queue_tag_busy_iter
> > > blk_mq_check_expired(flush_rq)
> > > 
> > >                  __blk_mq_end_request
> > >                 flush_end_io
> > >                 blk_kick_flush
> > >                 blk_rq_init(flush_rq)
> > >                 memset(flush_rq, 0)
> > Not see there is memset(flush_rq, 0) in block/blk-flush.c
> 
> Call path as follow:
> 
> blk_kick_flush
>     blk_rq_init
>         memset(rq, 0, sizeof(*rq));

Looks I miss this one in blk_rq_init(), sorry for that.

Given there are only two users of blk_rq_init(), one simple fix could be
not clearing queue in blk_rq_init(), something like below?

diff --git a/block/blk-core.c b/block/blk-core.c
index 77807a5d7f9e..25e6a045c821 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -107,7 +107,9 @@ EXPORT_SYMBOL_GPL(blk_queue_flag_test_and_set);
 
 void blk_rq_init(struct request_queue *q, struct request *rq)
 {
-	memset(rq, 0, sizeof(*rq));
+	const int offset = offsetof(struct request, q);
+
+	memset((void *)rq + offset, 0, sizeof(*rq) - offset);
 
 	INIT_LIST_HEAD(&rq->queuelist);
 	rq->q = q;
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index 1ac790178787..382e71b8787d 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -130,7 +130,7 @@ enum mq_rq_state {
  * especially blk_mq_rq_ctx_init() to take care of the added fields.
  */
 struct request {
-	struct request_queue *q;
+	struct request_queue *q;	/* Must be the 1st field */
 	struct blk_mq_ctx *mq_ctx;
 	struct blk_mq_hw_ctx *mq_hctx;
 

Thanks,
Ming

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH] block: fix null pointer dereference in blk_mq_rq_timed_out()
  2019-09-12  4:16     ` Ming Lei
@ 2019-09-12  8:49       ` Yufen Yu
  2019-09-12 10:07         ` Ming Lei
  0 siblings, 1 reply; 10+ messages in thread
From: Yufen Yu @ 2019-09-12  8:49 UTC (permalink / raw)
  To: Ming Lei; +Cc: axboe, linux-block, hch, keith.busch, tj, zhangxiaoxu5



On 2019/9/12 12:16, Ming Lei wrote:
> On Thu, Sep 12, 2019 at 11:29:18AM +0800, Yufen Yu wrote:
>>
>> On 2019/9/12 10:46, Ming Lei wrote:
>>> On Sat, Sep 07, 2019 at 06:24:50PM +0800, Yufen Yu wrote:
>>>> There is a race condition between timeout check and completion for
>>>> flush request as follow:
>>>>
>>>> timeout_work    issue flush      issue flush
>>>>                   blk_insert_flush
>>>>                                    blk_insert_flush
>>>> blk_mq_timeout_work
>>>>                   blk_kick_flush
>>>>
>>>> blk_mq_queue_tag_busy_iter
>>>> blk_mq_check_expired(flush_rq)
>>>>
>>>>                   __blk_mq_end_request
>>>>                  flush_end_io
>>>>                  blk_kick_flush
>>>>                  blk_rq_init(flush_rq)
>>>>                  memset(flush_rq, 0)
>>> Not see there is memset(flush_rq, 0) in block/blk-flush.c
>> Call path as follow:
>>
>> blk_kick_flush
>>      blk_rq_init
>>          memset(rq, 0, sizeof(*rq));
> Looks I miss this one in blk_rq_init(), sorry for that.
>
> Given there are only two users of blk_rq_init(), one simple fix could be
> not clearing queue in blk_rq_init(), something like below?
>
> diff --git a/block/blk-core.c b/block/blk-core.c
> index 77807a5d7f9e..25e6a045c821 100644
> --- a/block/blk-core.c
> +++ b/block/blk-core.c
> @@ -107,7 +107,9 @@ EXPORT_SYMBOL_GPL(blk_queue_flag_test_and_set);
>   
>   void blk_rq_init(struct request_queue *q, struct request *rq)
>   {
> -	memset(rq, 0, sizeof(*rq));
> +	const int offset = offsetof(struct request, q);
> +
> +	memset((void *)rq + offset, 0, sizeof(*rq) - offset);
>   
>   	INIT_LIST_HEAD(&rq->queuelist);
>   	rq->q = q;
> diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
> index 1ac790178787..382e71b8787d 100644
> --- a/include/linux/blkdev.h
> +++ b/include/linux/blkdev.h
> @@ -130,7 +130,7 @@ enum mq_rq_state {
>    * especially blk_mq_rq_ctx_init() to take care of the added fields.
>    */
>   struct request {
> -	struct request_queue *q;
> +	struct request_queue *q;	/* Must be the 1st field */
>   	struct blk_mq_ctx *mq_ctx;
>   	struct blk_mq_hw_ctx *mq_hctx;

Not set req->q as '0' can just avoid BUG_ON for NULL pointer deference.

However, the root problem is that 'flush_rq' have been reused while
timeout function handle it currently. That means mq_ops->timeout() may
access old values remained by the last flush request and make the wrong 
decision.

Take the race condition in the patch as an example.

blk_mq_check_expired
     blk_mq_rq_timed_out
         req->q->mq_ops->timeout  // Driver timeout handle may read old data
     refcount_dec_and_test(&rq)
     __blk_mq_free_request   // If rq have been reset has '1' in 
blk_rq_init(), it will be free here.

So, I think we should solve this problem completely. Just like normal 
request,
we can prevent flush request to call end_io when timeout handle the request.

Thanks,
Yufen




^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH] block: fix null pointer dereference in blk_mq_rq_timed_out()
  2019-09-12  3:29   ` Yufen Yu
  2019-09-12  4:16     ` Ming Lei
@ 2019-09-12  8:59     ` Guoqing Jiang
  1 sibling, 0 replies; 10+ messages in thread
From: Guoqing Jiang @ 2019-09-12  8:59 UTC (permalink / raw)
  To: Yufen Yu, Ming Lei; +Cc: axboe, linux-block, hch, keith.busch, tj, zhangxiaoxu5



On 9/12/19 5:29 AM, Yufen Yu wrote:
>
>
> On 2019/9/12 10:46, Ming Lei wrote:
>> On Sat, Sep 07, 2019 at 06:24:50PM +0800, Yufen Yu wrote:
>>> There is a race condition between timeout check and completion for
>>> flush request as follow:
>>>
>>> timeout_work    issue flush      issue flush
>>>                  blk_insert_flush
>>>                                   blk_insert_flush
>>> blk_mq_timeout_work
>>>                  blk_kick_flush
>>>
>>> blk_mq_queue_tag_busy_iter
>>> blk_mq_check_expired(flush_rq)
>>>
>>>                  __blk_mq_end_request
>>>                 flush_end_io
>>>                 blk_kick_flush
>>>                 blk_rq_init(flush_rq)
>>>                 memset(flush_rq, 0)
>> Not see there is memset(flush_rq, 0) in block/blk-flush.c
>
> Call path as follow:
>
> blk_kick_flush
>     blk_rq_init
>         memset(rq, 0, sizeof(*rq));
>
>>> blk_mq_timed_out
>>> BUG_ON flush_rq->q->mq_ops
>> flush_rq->q won't be changed by blk_rq_init(), and either READ or WRITE
>> on variable with machine WORD length are atomic, so how can the BUG_ON()
>> be triggered? Do you have the actual BUG log?
>>
>> Also now it is driver's responsibility for avoiding race between normal
>> completion and timeout.
>
> I have reproduced the bug by adding time delay in timeout handle and 
> memset.
> BUG_ON log as follow:
>
> [  108.825472] BUG: kernel NULL pointer dereference, address: 
> 0000000000000040
> [  108.826091] #PF: supervisor read access in kernel mode
> [  108.826543] #PF: error_code(0x0000) - not-present page
> [  108.827059] PGD 0 P4D 0
> [  108.827313] Oops: 0000 [#1] SMP PTI
> [  108.827657] CPU: 6 PID: 198 Comm: kworker/6:1H Not tainted 
> 5.3.0-rc8+ #431
> [  108.828326] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), 
> BIOS ?-20180531_142017-buildhw-08.phx2.fedoraproject.org-1.fc28 
> 04/01/2014
> [  108.829503] Workqueue: kblockd blk_mq_timeout_work
> [  108.829913] RIP: 0010:blk_mq_check_expired+0x258/0x330
> [  108.830439] Code: 01 e9 0a ff ff ff 48 83 05 34 45 dd 02 01 4c 39 
> 63 40 0f 84 8a 00 00 00 0d 00 00 20 00 40 0f b6 f5 41 89 44 24 1c 49 
> 8b 04 24 <48> 8b 40 40 48 8b 40 20 48 85 c0 0f 84 90 00 00 00 48 83 05 
> 2f 44
> [  108.832246] RSP: 0018:ffffbf7ac18b7db0 EFLAGS: 00010206
> [  108.832756] RAX: 0000000000000000 RBX: ffffffffb56e0250 RCX: 
> 0000000000000000
> [  108.833444] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 
> ffff9ab7fbb96538
> [  108.834149] RBP: 0000000000000000 R08: 000000000000024b R09: 
> 0000000000000030
> [  108.834840] R10: 000000000000004e R11: ffffbf7ac18b7c40 R12: 
> ffff9ab7f756e000
> [  108.835531] R13: ffffbf7ac18b7e70 R14: 0000000000000017 R15: 
> ffff9ab7f6ead0a0
> [  108.836228] FS:  0000000000000000(0000) GS:ffff9ab7fbb80000(0000) 
> knlGS:0000000000000000
> [  108.837026] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [  108.837588] CR2: 0000000000000040 CR3: 000000013544c000 CR4: 
> 00000000000006e0
> [  108.838191] Call Trace:
> [  108.838406]  bt_iter+0x74/0x80
> [  108.838665]  blk_mq_queue_tag_busy_iter+0x204/0x450
> [  108.839074]  ? __switch_to_asm+0x34/0x70
> [  108.839405]  ? blk_mq_stop_hw_queue+0x40/0x40
> [  108.839823]  ? blk_mq_stop_hw_queue+0x40/0x40
> [  108.840273]  ? syscall_return_via_sysret+0xf/0x7f
> [  108.840732]  blk_mq_timeout_work+0x74/0x200
> [  108.841151]  process_one_work+0x297/0x680
> [  108.841550]  worker_thread+0x29c/0x6f0
> [  108.841926]  ? rescuer_thread+0x580/0x580
> [  108.842344]  kthread+0x16a/0x1a0
> [  108.842666]  ? kthread_flush_work+0x170/0x170
> [  108.843100]  ret_from_fork+0x35/0x40
> [  108.843455] Modules linked in:
> [  108.843758] CR2: 0000000000000040
> [  108.844090] ---[ end trace e0ac552505fa1b95 ]---
>
> blk_mq_rq_timed_out() attempt to read 'req->q->mq_ops->timeout', but 
> 'q == 0' currently,
> which triggers BUG_ON.

We have a similar calltrace which happened in older kernel (4.4.62), not 
sure if it is the same one.

[32353526.224059] CPU: 16 PID: 0 Comm: swapper/16 Tainted: G           O    4.4.62-1-storage #4.4.62-1.3
[32353526.224343] Hardware name: Supermicro SSG-2028R-ACR24L/X10DRH-iT, BIOS 3.1 06/18/2018
[...]
[32353526.224840] RIP: 0010:[<ffffffff812df5a1>] [<ffffffff812df5a1>] blk_mq_rq_timed_out+0x11/0x70
[32353526.285015]  [<ffffffff812df63d>] blk_mq_check_expired+0x3d/0x60
[32353526.301997]  [<ffffffff812e1f74>] bt_for_each+0xd4/0xe0
[32353526.310730]  [<ffffffff812df600>] ? blk_mq_rq_timed_out+0x70/0x70
[32353526.319579]  [<ffffffff812df600>] ? blk_mq_rq_timed_out+0x70/0x70
[32353526.328329]  [<ffffffff812e2753>] blk_mq_queue_tag_busy_iter+0x43/0xc0
[32353526.336995]  [<ffffffff812de7c0>] ? blk_mq_bio_to_request+0x40/0x40
[32353526.345566]  [<ffffffff812de7f2>] blk_mq_rq_timer+0x32/0xd0
[32353526.354094]  [<ffffffff810b2e45>] call_timer_fn+0x35/0x130
[32353526.362542]  [<ffffffff812de7c0>] ? blk_mq_bio_to_request+0x40/0x40
[32353526.370894]  [<ffffffff810b31b7>] run_timer_softirq+0x157/0x280

Thanks,
Guoqing


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH] block: fix null pointer dereference in blk_mq_rq_timed_out()
  2019-09-12  8:49       ` Yufen Yu
@ 2019-09-12 10:07         ` Ming Lei
  2019-09-16  2:40           ` Yufen Yu
  2019-09-16  9:27           ` Yufen Yu
  0 siblings, 2 replies; 10+ messages in thread
From: Ming Lei @ 2019-09-12 10:07 UTC (permalink / raw)
  To: Yufen Yu; +Cc: axboe, linux-block, hch, keith.busch, tj, zhangxiaoxu5

On Thu, Sep 12, 2019 at 04:49:15PM +0800, Yufen Yu wrote:
> 
> 
> On 2019/9/12 12:16, Ming Lei wrote:
> > On Thu, Sep 12, 2019 at 11:29:18AM +0800, Yufen Yu wrote:
> > > 
> > > On 2019/9/12 10:46, Ming Lei wrote:
> > > > On Sat, Sep 07, 2019 at 06:24:50PM +0800, Yufen Yu wrote:
> > > > > There is a race condition between timeout check and completion for
> > > > > flush request as follow:
> > > > > 
> > > > > timeout_work    issue flush      issue flush
> > > > >                   blk_insert_flush
> > > > >                                    blk_insert_flush
> > > > > blk_mq_timeout_work
> > > > >                   blk_kick_flush
> > > > > 
> > > > > blk_mq_queue_tag_busy_iter
> > > > > blk_mq_check_expired(flush_rq)
> > > > > 
> > > > >                   __blk_mq_end_request
> > > > >                  flush_end_io
> > > > >                  blk_kick_flush
> > > > >                  blk_rq_init(flush_rq)
> > > > >                  memset(flush_rq, 0)
> > > > Not see there is memset(flush_rq, 0) in block/blk-flush.c
> > > Call path as follow:
> > > 
> > > blk_kick_flush
> > >      blk_rq_init
> > >          memset(rq, 0, sizeof(*rq));
> > Looks I miss this one in blk_rq_init(), sorry for that.
> > 
> > Given there are only two users of blk_rq_init(), one simple fix could be
> > not clearing queue in blk_rq_init(), something like below?
> > 
> > diff --git a/block/blk-core.c b/block/blk-core.c
> > index 77807a5d7f9e..25e6a045c821 100644
> > --- a/block/blk-core.c
> > +++ b/block/blk-core.c
> > @@ -107,7 +107,9 @@ EXPORT_SYMBOL_GPL(blk_queue_flag_test_and_set);
> >   void blk_rq_init(struct request_queue *q, struct request *rq)
> >   {
> > -	memset(rq, 0, sizeof(*rq));
> > +	const int offset = offsetof(struct request, q);
> > +
> > +	memset((void *)rq + offset, 0, sizeof(*rq) - offset);
> >   	INIT_LIST_HEAD(&rq->queuelist);
> >   	rq->q = q;
> > diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
> > index 1ac790178787..382e71b8787d 100644
> > --- a/include/linux/blkdev.h
> > +++ b/include/linux/blkdev.h
> > @@ -130,7 +130,7 @@ enum mq_rq_state {
> >    * especially blk_mq_rq_ctx_init() to take care of the added fields.
> >    */
> >   struct request {
> > -	struct request_queue *q;
> > +	struct request_queue *q;	/* Must be the 1st field */
> >   	struct blk_mq_ctx *mq_ctx;
> >   	struct blk_mq_hw_ctx *mq_hctx;
> 
> Not set req->q as '0' can just avoid BUG_ON for NULL pointer deference.
> 
> However, the root problem is that 'flush_rq' have been reused while
> timeout function handle it currently. That means mq_ops->timeout() may
> access old values remained by the last flush request and make the wrong
> decision.
> 
> Take the race condition in the patch as an example.
> 
> blk_mq_check_expired
>     blk_mq_rq_timed_out
>         req->q->mq_ops->timeout  // Driver timeout handle may read old data
>     refcount_dec_and_test(&rq)
>     __blk_mq_free_request   // If rq have been reset has '1' in
> blk_rq_init(), it will be free here.
> 
> So, I think we should solve this problem completely. Just like normal
> request,
> we can prevent flush request to call end_io when timeout handle the request.

Seems it isn't specific for 'flush_rq', and it should be one generic issue
for any request which implements .end_io.

For requests without defining .end_io, rq->ref is applied for protecting
its lifetime. However, rq->end_io() is still called even if rq->ref doesn't
drop to zero.

If the above is correct, we need to let rq->ref to cover rq->end_io().


Thanks,
Ming

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH] block: fix null pointer dereference in blk_mq_rq_timed_out()
  2019-09-12 10:07         ` Ming Lei
@ 2019-09-16  2:40           ` Yufen Yu
  2019-09-16  9:27           ` Yufen Yu
  1 sibling, 0 replies; 10+ messages in thread
From: Yufen Yu @ 2019-09-16  2:40 UTC (permalink / raw)
  To: Ming Lei; +Cc: axboe, linux-block, hch, keith.busch, tj, zhangxiaoxu5



On 2019/9/12 18:07, Ming Lei wrote:
> On Thu, Sep 12, 2019 at 04:49:15PM +0800, Yufen Yu wrote:
>>
>> On 2019/9/12 12:16, Ming Lei wrote:
>>> On Thu, Sep 12, 2019 at 11:29:18AM +0800, Yufen Yu wrote:
>>>> On 2019/9/12 10:46, Ming Lei wrote:
>>>>> On Sat, Sep 07, 2019 at 06:24:50PM +0800, Yufen Yu wrote:
>>>>>> There is a race condition between timeout check and completion for
>>>>>> flush request as follow:
>>>>>>
>>>>>> timeout_work    issue flush      issue flush
>>>>>>                    blk_insert_flush
>>>>>>                                     blk_insert_flush
>>>>>> blk_mq_timeout_work
>>>>>>                    blk_kick_flush
>>>>>>
>>>>>> blk_mq_queue_tag_busy_iter
>>>>>> blk_mq_check_expired(flush_rq)
>>>>>>
>>>>>>                    __blk_mq_end_request
>>>>>>                   flush_end_io
>>>>>>                   blk_kick_flush
>>>>>>                   blk_rq_init(flush_rq)
>>>>>>                   memset(flush_rq, 0)
>>>>> Not see there is memset(flush_rq, 0) in block/blk-flush.c
>>>> Call path as follow:
>>>>
>>>> blk_kick_flush
>>>>       blk_rq_init
>>>>           memset(rq, 0, sizeof(*rq));
>>> Looks I miss this one in blk_rq_init(), sorry for that.
>>>
>>> Given there are only two users of blk_rq_init(), one simple fix could be
>>> not clearing queue in blk_rq_init(), something like below?
>>>
>>> diff --git a/block/blk-core.c b/block/blk-core.c
>>> index 77807a5d7f9e..25e6a045c821 100644
>>> --- a/block/blk-core.c
>>> +++ b/block/blk-core.c
>>> @@ -107,7 +107,9 @@ EXPORT_SYMBOL_GPL(blk_queue_flag_test_and_set);
>>>    void blk_rq_init(struct request_queue *q, struct request *rq)
>>>    {
>>> -	memset(rq, 0, sizeof(*rq));
>>> +	const int offset = offsetof(struct request, q);
>>> +
>>> +	memset((void *)rq + offset, 0, sizeof(*rq) - offset);
>>>    	INIT_LIST_HEAD(&rq->queuelist);
>>>    	rq->q = q;
>>> diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
>>> index 1ac790178787..382e71b8787d 100644
>>> --- a/include/linux/blkdev.h
>>> +++ b/include/linux/blkdev.h
>>> @@ -130,7 +130,7 @@ enum mq_rq_state {
>>>     * especially blk_mq_rq_ctx_init() to take care of the added fields.
>>>     */
>>>    struct request {
>>> -	struct request_queue *q;
>>> +	struct request_queue *q;	/* Must be the 1st field */
>>>    	struct blk_mq_ctx *mq_ctx;
>>>    	struct blk_mq_hw_ctx *mq_hctx;
>> Not set req->q as '0' can just avoid BUG_ON for NULL pointer deference.
>>
>> However, the root problem is that 'flush_rq' have been reused while
>> timeout function handle it currently. That means mq_ops->timeout() may
>> access old values remained by the last flush request and make the wrong
>> decision.
>>
>> Take the race condition in the patch as an example.
>>
>> blk_mq_check_expired
>>      blk_mq_rq_timed_out
>>          req->q->mq_ops->timeout  // Driver timeout handle may read old data
>>      refcount_dec_and_test(&rq)
>>      __blk_mq_free_request   // If rq have been reset has '1' in
>> blk_rq_init(), it will be free here.
>>
>> So, I think we should solve this problem completely. Just like normal
>> request,
>> we can prevent flush request to call end_io when timeout handle the request.
> Seems it isn't specific for 'flush_rq', and it should be one generic issue
> for any request which implements .end_io.
>
> For requests without defining .end_io, rq->ref is applied for protecting
> its lifetime. However, rq->end_io() is still called even if rq->ref doesn't
> drop to zero.
>
> If the above is correct, we need to let rq->ref to cover rq->end_io().

Thanks for catching what I have ignored. I am trying to fix the problem.

Thanks,
Yufen


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH] block: fix null pointer dereference in blk_mq_rq_timed_out()
  2019-09-12 10:07         ` Ming Lei
  2019-09-16  2:40           ` Yufen Yu
@ 2019-09-16  9:27           ` Yufen Yu
  2019-09-17  0:50             ` Ming Lei
  1 sibling, 1 reply; 10+ messages in thread
From: Yufen Yu @ 2019-09-16  9:27 UTC (permalink / raw)
  To: Ming Lei; +Cc: axboe, linux-block, hch, keith.busch, tj, zhangxiaoxu5



On 2019/9/12 18:07, Ming Lei wrote:
> On Thu, Sep 12, 2019 at 04:49:15PM +0800, Yufen Yu wrote:
>>
>> On 2019/9/12 12:16, Ming Lei wrote:
>>> On Thu, Sep 12, 2019 at 11:29:18AM +0800, Yufen Yu wrote:
>>>> On 2019/9/12 10:46, Ming Lei wrote:
>>>>> On Sat, Sep 07, 2019 at 06:24:50PM +0800, Yufen Yu wrote:
>>>>>> There is a race condition between timeout check and completion for
>>>>>> flush request as follow:
>>>>>>
>>>>>> timeout_work    issue flush      issue flush
>>>>>>                    blk_insert_flush
>>>>>>                                     blk_insert_flush
>>>>>> blk_mq_timeout_work
>>>>>>                    blk_kick_flush
>>>>>>
>>>>>> blk_mq_queue_tag_busy_iter
>>>>>> blk_mq_check_expired(flush_rq)
>>>>>>
>>>>>>                    __blk_mq_end_request
>>>>>>                   flush_end_io
>>>>>>                   blk_kick_flush
>>>>>>                   blk_rq_init(flush_rq)
>>>>>>                   memset(flush_rq, 0)
>>>>> Not see there is memset(flush_rq, 0) in block/blk-flush.c
>>>> Call path as follow:
>>>>
>>>> blk_kick_flush
>>>>       blk_rq_init
>>>>           memset(rq, 0, sizeof(*rq));
>>> Looks I miss this one in blk_rq_init(), sorry for that.
>>>
>>> Given there are only two users of blk_rq_init(), one simple fix could be
>>> not clearing queue in blk_rq_init(), something like below?
>>>
>>> diff --git a/block/blk-core.c b/block/blk-core.c
>>> index 77807a5d7f9e..25e6a045c821 100644
>>> --- a/block/blk-core.c
>>> +++ b/block/blk-core.c
>>> @@ -107,7 +107,9 @@ EXPORT_SYMBOL_GPL(blk_queue_flag_test_and_set);
>>>    void blk_rq_init(struct request_queue *q, struct request *rq)
>>>    {
>>> -	memset(rq, 0, sizeof(*rq));
>>> +	const int offset = offsetof(struct request, q);
>>> +
>>> +	memset((void *)rq + offset, 0, sizeof(*rq) - offset);
>>>    	INIT_LIST_HEAD(&rq->queuelist);
>>>    	rq->q = q;
>>> diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
>>> index 1ac790178787..382e71b8787d 100644
>>> --- a/include/linux/blkdev.h
>>> +++ b/include/linux/blkdev.h
>>> @@ -130,7 +130,7 @@ enum mq_rq_state {
>>>     * especially blk_mq_rq_ctx_init() to take care of the added fields.
>>>     */
>>>    struct request {
>>> -	struct request_queue *q;
>>> +	struct request_queue *q;	/* Must be the 1st field */
>>>    	struct blk_mq_ctx *mq_ctx;
>>>    	struct blk_mq_hw_ctx *mq_hctx;
>> Not set req->q as '0' can just avoid BUG_ON for NULL pointer deference.
>>
>> However, the root problem is that 'flush_rq' have been reused while
>> timeout function handle it currently. That means mq_ops->timeout() may
>> access old values remained by the last flush request and make the wrong
>> decision.
>>
>> Take the race condition in the patch as an example.
>>
>> blk_mq_check_expired
>>      blk_mq_rq_timed_out
>>          req->q->mq_ops->timeout  // Driver timeout handle may read old data
>>      refcount_dec_and_test(&rq)
>>      __blk_mq_free_request   // If rq have been reset has '1' in
>> blk_rq_init(), it will be free here.
>>
>> So, I think we should solve this problem completely. Just like normal
>> request,
>> we can prevent flush request to call end_io when timeout handle the request.
> Seems it isn't specific for 'flush_rq', and it should be one generic issue
> for any request which implements .end_io.
>
> For requests without defining .end_io, rq->ref is applied for protecting
> its lifetime. However, rq->end_io() is still called even if rq->ref doesn't
> drop to zero.
>
> If the above is correct, we need to let rq->ref to cover rq->end_io().

We ignore the fact that we may also need to free 'rq' after calling 
rq->end_io(),
such as end_clone_request(), mq_flush_data_end_io().

If  we let 'rq->ref' to cover rq->end_io(), 'rq->ref' have been 
decreased to '0'
before calling __blk_mq_free_request(). Then, the function will never be 
called.

So, I think flush request may need to be fixed individually.

Thanks,
Yufen






^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH] block: fix null pointer dereference in blk_mq_rq_timed_out()
  2019-09-16  9:27           ` Yufen Yu
@ 2019-09-17  0:50             ` Ming Lei
  0 siblings, 0 replies; 10+ messages in thread
From: Ming Lei @ 2019-09-17  0:50 UTC (permalink / raw)
  To: Yufen Yu; +Cc: axboe, linux-block, hch, keith.busch, tj, zhangxiaoxu5

On Mon, Sep 16, 2019 at 05:27:39PM +0800, Yufen Yu wrote:
> 
> 
> On 2019/9/12 18:07, Ming Lei wrote:
> > On Thu, Sep 12, 2019 at 04:49:15PM +0800, Yufen Yu wrote:
> > > 
> > > On 2019/9/12 12:16, Ming Lei wrote:
> > > > On Thu, Sep 12, 2019 at 11:29:18AM +0800, Yufen Yu wrote:
> > > > > On 2019/9/12 10:46, Ming Lei wrote:
> > > > > > On Sat, Sep 07, 2019 at 06:24:50PM +0800, Yufen Yu wrote:
> > > > > > > There is a race condition between timeout check and completion for
> > > > > > > flush request as follow:
> > > > > > > 
> > > > > > > timeout_work    issue flush      issue flush
> > > > > > >                    blk_insert_flush
> > > > > > >                                     blk_insert_flush
> > > > > > > blk_mq_timeout_work
> > > > > > >                    blk_kick_flush
> > > > > > > 
> > > > > > > blk_mq_queue_tag_busy_iter
> > > > > > > blk_mq_check_expired(flush_rq)
> > > > > > > 
> > > > > > >                    __blk_mq_end_request
> > > > > > >                   flush_end_io
> > > > > > >                   blk_kick_flush
> > > > > > >                   blk_rq_init(flush_rq)
> > > > > > >                   memset(flush_rq, 0)
> > > > > > Not see there is memset(flush_rq, 0) in block/blk-flush.c
> > > > > Call path as follow:
> > > > > 
> > > > > blk_kick_flush
> > > > >       blk_rq_init
> > > > >           memset(rq, 0, sizeof(*rq));
> > > > Looks I miss this one in blk_rq_init(), sorry for that.
> > > > 
> > > > Given there are only two users of blk_rq_init(), one simple fix could be
> > > > not clearing queue in blk_rq_init(), something like below?
> > > > 
> > > > diff --git a/block/blk-core.c b/block/blk-core.c
> > > > index 77807a5d7f9e..25e6a045c821 100644
> > > > --- a/block/blk-core.c
> > > > +++ b/block/blk-core.c
> > > > @@ -107,7 +107,9 @@ EXPORT_SYMBOL_GPL(blk_queue_flag_test_and_set);
> > > >    void blk_rq_init(struct request_queue *q, struct request *rq)
> > > >    {
> > > > -	memset(rq, 0, sizeof(*rq));
> > > > +	const int offset = offsetof(struct request, q);
> > > > +
> > > > +	memset((void *)rq + offset, 0, sizeof(*rq) - offset);
> > > >    	INIT_LIST_HEAD(&rq->queuelist);
> > > >    	rq->q = q;
> > > > diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
> > > > index 1ac790178787..382e71b8787d 100644
> > > > --- a/include/linux/blkdev.h
> > > > +++ b/include/linux/blkdev.h
> > > > @@ -130,7 +130,7 @@ enum mq_rq_state {
> > > >     * especially blk_mq_rq_ctx_init() to take care of the added fields.
> > > >     */
> > > >    struct request {
> > > > -	struct request_queue *q;
> > > > +	struct request_queue *q;	/* Must be the 1st field */
> > > >    	struct blk_mq_ctx *mq_ctx;
> > > >    	struct blk_mq_hw_ctx *mq_hctx;
> > > Not set req->q as '0' can just avoid BUG_ON for NULL pointer deference.
> > > 
> > > However, the root problem is that 'flush_rq' have been reused while
> > > timeout function handle it currently. That means mq_ops->timeout() may
> > > access old values remained by the last flush request and make the wrong
> > > decision.
> > > 
> > > Take the race condition in the patch as an example.
> > > 
> > > blk_mq_check_expired
> > >      blk_mq_rq_timed_out
> > >          req->q->mq_ops->timeout  // Driver timeout handle may read old data
> > >      refcount_dec_and_test(&rq)
> > >      __blk_mq_free_request   // If rq have been reset has '1' in
> > > blk_rq_init(), it will be free here.
> > > 
> > > So, I think we should solve this problem completely. Just like normal
> > > request,
> > > we can prevent flush request to call end_io when timeout handle the request.
> > Seems it isn't specific for 'flush_rq', and it should be one generic issue
> > for any request which implements .end_io.
> > 
> > For requests without defining .end_io, rq->ref is applied for protecting
> > its lifetime. However, rq->end_io() is still called even if rq->ref doesn't
> > drop to zero.
> > 
> > If the above is correct, we need to let rq->ref to cover rq->end_io().
> 
> We ignore the fact that we may also need to free 'rq' after calling
> rq->end_io(),
> such as end_clone_request(), mq_flush_data_end_io().
> 
> If  we let 'rq->ref' to cover rq->end_io(), 'rq->ref' have been decreased to
> '0'
> before calling __blk_mq_free_request(). Then, the function will never be
> called.
> 
> So, I think flush request may need to be fixed individually.

Thinking of this issue further, given other cases of .end_io() still
depends on blk_mq_free_request() for freeing request, it is fine
to just fix flush request.


Thanks,
Ming

^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, back to index

Thread overview: 10+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-09-07 10:24 [PATCH] block: fix null pointer dereference in blk_mq_rq_timed_out() Yufen Yu
2019-09-12  2:46 ` Ming Lei
2019-09-12  3:29   ` Yufen Yu
2019-09-12  4:16     ` Ming Lei
2019-09-12  8:49       ` Yufen Yu
2019-09-12 10:07         ` Ming Lei
2019-09-16  2:40           ` Yufen Yu
2019-09-16  9:27           ` Yufen Yu
2019-09-17  0:50             ` Ming Lei
2019-09-12  8:59     ` Guoqing Jiang

Linux-Block Archive on lore.kernel.org

Archives are clonable:
	git clone --mirror https://lore.kernel.org/linux-block/0 linux-block/git/0.git

	# If you have public-inbox 1.1+ installed, you may
	# initialize and index your mirror using the following commands:
	public-inbox-init -V2 linux-block linux-block/ https://lore.kernel.org/linux-block \
		linux-block@vger.kernel.org linux-block@archiver.kernel.org
	public-inbox-index linux-block


Newsgroup available over NNTP:
	nntp://nntp.lore.kernel.org/org.kernel.vger.linux-block


AGPL code for this site: git clone https://public-inbox.org/ public-inbox