* [PATCH] block: fix null pointer dereference in blk_mq_rq_timed_out()
@ 2019-09-07 10:24 Yufen Yu
2019-09-12 2:46 ` Ming Lei
0 siblings, 1 reply; 10+ messages in thread
From: Yufen Yu @ 2019-09-07 10:24 UTC (permalink / raw)
To: axboe; +Cc: linux-block, hch, keith.busch, tj, yuyufen, zhangxiaoxu5
There is a race condition between timeout check and completion for
flush request as follow:
timeout_work issue flush issue flush
blk_insert_flush
blk_insert_flush
blk_mq_timeout_work
blk_kick_flush
blk_mq_queue_tag_busy_iter
blk_mq_check_expired(flush_rq)
__blk_mq_end_request
flush_end_io
blk_kick_flush
blk_rq_init(flush_rq)
memset(flush_rq, 0)
blk_mq_timed_out
BUG_ON flush_rq->q->mq_ops
For normal request, we need to get a tag and then allocate corresponding request.
Thus, the request cannot be reallocated before the tag have been free.
Commit 1d9bd5161ba ("blk-mq: replace timeout synchronization with a RCU and
generation based scheme") and commit 12f5b93145 ("blk-mq: Remove generation
seqeunce") can guarantee the consistency of timeout handle and completion.
However, 'flush_rq' have been forgotten. 'flush_rq' allocation management
dependents on flush implemention mechanism. Each hctx has only one 'flush_rq'.
When a flush request have completed, the next flush request will hold the 'flush_rq'.
In the end, timeout handle may access the cleared 'flush_rq'.
We fix this problem by checking request refcount 'rq->ref', as normal request.
If the refcount is not zero, flush_end_io() return and wait the last holder
recall it. To record the request status, we add a new entry 'rq_status',
which will be used in flush_end_io().
Signed-off-by: Yufen Yu <yuyufen@huawei.com>
---
block/blk-flush.c | 8 ++++++++
block/blk-mq.c | 7 +++++--
block/blk.h | 5 +++++
include/linux/blkdev.h | 2 ++
4 files changed, 20 insertions(+), 2 deletions(-)
diff --git a/block/blk-flush.c b/block/blk-flush.c
index aedd9320e605..359a7e1a0925 100644
--- a/block/blk-flush.c
+++ b/block/blk-flush.c
@@ -212,6 +212,14 @@ static void flush_end_io(struct request *flush_rq, blk_status_t error)
struct blk_flush_queue *fq = blk_get_flush_queue(q, flush_rq->mq_ctx);
struct blk_mq_hw_ctx *hctx;
+ if (!refcount_dec_and_test(&flush_rq->ref)) {
+ flush_rq->rq_status = error;
+ return;
+ }
+
+ if (flush_rq->rq_status != BLK_STS_OK)
+ error = flush_rq->rq_status;
+
/* release the tag's ownership to the req cloned from */
spin_lock_irqsave(&fq->mq_flush_lock, flags);
hctx = flush_rq->mq_hctx;
diff --git a/block/blk-mq.c b/block/blk-mq.c
index 0835f4d8d42e..3d2b2c2e9cdf 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -905,9 +905,12 @@ static bool blk_mq_check_expired(struct blk_mq_hw_ctx *hctx,
*/
if (blk_mq_req_expired(rq, next))
blk_mq_rq_timed_out(rq, reserved);
- if (refcount_dec_and_test(&rq->ref))
- __blk_mq_free_request(rq);
+ if (is_flush_rq(rq, hctx)) {
+ rq->end_io(rq, 0);
+ } else if (refcount_dec_and_test(&rq->ref)) {
+ __blk_mq_free_request(rq);
+ }
return true;
}
diff --git a/block/blk.h b/block/blk.h
index de6b2e146d6e..f503ef9ad3e6 100644
--- a/block/blk.h
+++ b/block/blk.h
@@ -47,6 +47,11 @@ static inline void __blk_get_queue(struct request_queue *q)
kobject_get(&q->kobj);
}
+static inline bool
+is_flush_rq(struct request *req, struct blk_mq_hw_ctx *hctx) {
+ return hctx->fq->flush_rq == req;
+}
+
struct blk_flush_queue *blk_alloc_flush_queue(struct request_queue *q,
int node, int cmd_size, gfp_t flags);
void blk_free_flush_queue(struct blk_flush_queue *q);
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index 1ef375dafb1c..b1d05077e03f 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -237,6 +237,8 @@ struct request {
*/
rq_end_io_fn *end_io;
void *end_io_data;
+
+ blk_status_t rq_status;
};
static inline bool blk_op_is_scsi(unsigned int op)
--
2.17.2
^ permalink raw reply related [flat|nested] 10+ messages in thread
* Re: [PATCH] block: fix null pointer dereference in blk_mq_rq_timed_out()
2019-09-07 10:24 [PATCH] block: fix null pointer dereference in blk_mq_rq_timed_out() Yufen Yu
@ 2019-09-12 2:46 ` Ming Lei
2019-09-12 3:29 ` Yufen Yu
0 siblings, 1 reply; 10+ messages in thread
From: Ming Lei @ 2019-09-12 2:46 UTC (permalink / raw)
To: Yufen Yu; +Cc: axboe, linux-block, hch, keith.busch, tj, zhangxiaoxu5
On Sat, Sep 07, 2019 at 06:24:50PM +0800, Yufen Yu wrote:
> There is a race condition between timeout check and completion for
> flush request as follow:
>
> timeout_work issue flush issue flush
> blk_insert_flush
> blk_insert_flush
> blk_mq_timeout_work
> blk_kick_flush
>
> blk_mq_queue_tag_busy_iter
> blk_mq_check_expired(flush_rq)
>
> __blk_mq_end_request
> flush_end_io
> blk_kick_flush
> blk_rq_init(flush_rq)
> memset(flush_rq, 0)
Not see there is memset(flush_rq, 0) in block/blk-flush.c
>
> blk_mq_timed_out
> BUG_ON flush_rq->q->mq_ops
flush_rq->q won't be changed by blk_rq_init(), and either READ or WRITE
on variable with machine WORD length are atomic, so how can the BUG_ON()
be triggered? Do you have the actual BUG log?
Also now it is driver's responsibility for avoiding race between normal
completion and timeout.
>
> For normal request, we need to get a tag and then allocate corresponding request.
> Thus, the request cannot be reallocated before the tag have been free.
> Commit 1d9bd5161ba ("blk-mq: replace timeout synchronization with a RCU and
> generation based scheme") and commit 12f5b93145 ("blk-mq: Remove generation
> seqeunce") can guarantee the consistency of timeout handle and completion.
>
> However, 'flush_rq' have been forgotten. 'flush_rq' allocation management
> dependents on flush implemention mechanism. Each hctx has only one 'flush_rq'.
> When a flush request have completed, the next flush request will hold the 'flush_rq'.
> In the end, timeout handle may access the cleared 'flush_rq'.
>
> We fix this problem by checking request refcount 'rq->ref', as normal request.
> If the refcount is not zero, flush_end_io() return and wait the last holder
> recall it. To record the request status, we add a new entry 'rq_status',
> which will be used in flush_end_io().
>
> Signed-off-by: Yufen Yu <yuyufen@huawei.com>
> ---
> block/blk-flush.c | 8 ++++++++
> block/blk-mq.c | 7 +++++--
> block/blk.h | 5 +++++
> include/linux/blkdev.h | 2 ++
> 4 files changed, 20 insertions(+), 2 deletions(-)
>
> diff --git a/block/blk-flush.c b/block/blk-flush.c
> index aedd9320e605..359a7e1a0925 100644
> --- a/block/blk-flush.c
> +++ b/block/blk-flush.c
> @@ -212,6 +212,14 @@ static void flush_end_io(struct request *flush_rq, blk_status_t error)
> struct blk_flush_queue *fq = blk_get_flush_queue(q, flush_rq->mq_ctx);
> struct blk_mq_hw_ctx *hctx;
>
> + if (!refcount_dec_and_test(&flush_rq->ref)) {
> + flush_rq->rq_status = error;
> + return;
> + }
> +
> + if (flush_rq->rq_status != BLK_STS_OK)
> + error = flush_rq->rq_status;
> +
> /* release the tag's ownership to the req cloned from */
> spin_lock_irqsave(&fq->mq_flush_lock, flags);
> hctx = flush_rq->mq_hctx;
> diff --git a/block/blk-mq.c b/block/blk-mq.c
> index 0835f4d8d42e..3d2b2c2e9cdf 100644
> --- a/block/blk-mq.c
> +++ b/block/blk-mq.c
> @@ -905,9 +905,12 @@ static bool blk_mq_check_expired(struct blk_mq_hw_ctx *hctx,
> */
> if (blk_mq_req_expired(rq, next))
> blk_mq_rq_timed_out(rq, reserved);
> - if (refcount_dec_and_test(&rq->ref))
> - __blk_mq_free_request(rq);
>
> + if (is_flush_rq(rq, hctx)) {
> + rq->end_io(rq, 0);
> + } else if (refcount_dec_and_test(&rq->ref)) {
> + __blk_mq_free_request(rq);
> + }
> return true;
> }
>
> diff --git a/block/blk.h b/block/blk.h
> index de6b2e146d6e..f503ef9ad3e6 100644
> --- a/block/blk.h
> +++ b/block/blk.h
> @@ -47,6 +47,11 @@ static inline void __blk_get_queue(struct request_queue *q)
> kobject_get(&q->kobj);
> }
>
> +static inline bool
> +is_flush_rq(struct request *req, struct blk_mq_hw_ctx *hctx) {
> + return hctx->fq->flush_rq == req;
> +}
> +
> struct blk_flush_queue *blk_alloc_flush_queue(struct request_queue *q,
> int node, int cmd_size, gfp_t flags);
> void blk_free_flush_queue(struct blk_flush_queue *q);
> diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
> index 1ef375dafb1c..b1d05077e03f 100644
> --- a/include/linux/blkdev.h
> +++ b/include/linux/blkdev.h
> @@ -237,6 +237,8 @@ struct request {
> */
> rq_end_io_fn *end_io;
> void *end_io_data;
> +
> + blk_status_t rq_status;
> };
'rq_status' is only for flush request, so it may be added to 'struct
blk_flush_queue' instead of 'struct request'.
Thanks,
Ming
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [PATCH] block: fix null pointer dereference in blk_mq_rq_timed_out()
2019-09-12 2:46 ` Ming Lei
@ 2019-09-12 3:29 ` Yufen Yu
2019-09-12 4:16 ` Ming Lei
2019-09-12 8:59 ` Guoqing Jiang
0 siblings, 2 replies; 10+ messages in thread
From: Yufen Yu @ 2019-09-12 3:29 UTC (permalink / raw)
To: Ming Lei; +Cc: axboe, linux-block, hch, keith.busch, tj, zhangxiaoxu5
On 2019/9/12 10:46, Ming Lei wrote:
> On Sat, Sep 07, 2019 at 06:24:50PM +0800, Yufen Yu wrote:
>> There is a race condition between timeout check and completion for
>> flush request as follow:
>>
>> timeout_work issue flush issue flush
>> blk_insert_flush
>> blk_insert_flush
>> blk_mq_timeout_work
>> blk_kick_flush
>>
>> blk_mq_queue_tag_busy_iter
>> blk_mq_check_expired(flush_rq)
>>
>> __blk_mq_end_request
>> flush_end_io
>> blk_kick_flush
>> blk_rq_init(flush_rq)
>> memset(flush_rq, 0)
> Not see there is memset(flush_rq, 0) in block/blk-flush.c
Call path as follow:
blk_kick_flush
blk_rq_init
memset(rq, 0, sizeof(*rq));
>> blk_mq_timed_out
>> BUG_ON flush_rq->q->mq_ops
> flush_rq->q won't be changed by blk_rq_init(), and either READ or WRITE
> on variable with machine WORD length are atomic, so how can the BUG_ON()
> be triggered? Do you have the actual BUG log?
>
> Also now it is driver's responsibility for avoiding race between normal
> completion and timeout.
I have reproduced the bug by adding time delay in timeout handle and memset.
BUG_ON log as follow:
[ 108.825472] BUG: kernel NULL pointer dereference, address:
0000000000000040
[ 108.826091] #PF: supervisor read access in kernel mode
[ 108.826543] #PF: error_code(0x0000) - not-present page
[ 108.827059] PGD 0 P4D 0
[ 108.827313] Oops: 0000 [#1] SMP PTI
[ 108.827657] CPU: 6 PID: 198 Comm: kworker/6:1H Not tainted 5.3.0-rc8+
#431
[ 108.828326] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996),
BIOS ?-20180531_142017-buildhw-08.phx2.fedoraproject.org-1.fc28 04/01/2014
[ 108.829503] Workqueue: kblockd blk_mq_timeout_work
[ 108.829913] RIP: 0010:blk_mq_check_expired+0x258/0x330
[ 108.830439] Code: 01 e9 0a ff ff ff 48 83 05 34 45 dd 02 01 4c 39 63
40 0f 84 8a 00 00 00 0d 00 00 20 00 40 0f b6 f5 41 89 44 24 1c 49 8b 04
24 <48> 8b 40 40 48 8b 40 20 48 85 c0 0f 84 90 00 00 00 48 83 05 2f 44
[ 108.832246] RSP: 0018:ffffbf7ac18b7db0 EFLAGS: 00010206
[ 108.832756] RAX: 0000000000000000 RBX: ffffffffb56e0250 RCX:
0000000000000000
[ 108.833444] RDX: 0000000000000000 RSI: 0000000000000000 RDI:
ffff9ab7fbb96538
[ 108.834149] RBP: 0000000000000000 R08: 000000000000024b R09:
0000000000000030
[ 108.834840] R10: 000000000000004e R11: ffffbf7ac18b7c40 R12:
ffff9ab7f756e000
[ 108.835531] R13: ffffbf7ac18b7e70 R14: 0000000000000017 R15:
ffff9ab7f6ead0a0
[ 108.836228] FS: 0000000000000000(0000) GS:ffff9ab7fbb80000(0000)
knlGS:0000000000000000
[ 108.837026] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 108.837588] CR2: 0000000000000040 CR3: 000000013544c000 CR4:
00000000000006e0
[ 108.838191] Call Trace:
[ 108.838406] bt_iter+0x74/0x80
[ 108.838665] blk_mq_queue_tag_busy_iter+0x204/0x450
[ 108.839074] ? __switch_to_asm+0x34/0x70
[ 108.839405] ? blk_mq_stop_hw_queue+0x40/0x40
[ 108.839823] ? blk_mq_stop_hw_queue+0x40/0x40
[ 108.840273] ? syscall_return_via_sysret+0xf/0x7f
[ 108.840732] blk_mq_timeout_work+0x74/0x200
[ 108.841151] process_one_work+0x297/0x680
[ 108.841550] worker_thread+0x29c/0x6f0
[ 108.841926] ? rescuer_thread+0x580/0x580
[ 108.842344] kthread+0x16a/0x1a0
[ 108.842666] ? kthread_flush_work+0x170/0x170
[ 108.843100] ret_from_fork+0x35/0x40
[ 108.843455] Modules linked in:
[ 108.843758] CR2: 0000000000000040
[ 108.844090] ---[ end trace e0ac552505fa1b95 ]---
blk_mq_rq_timed_out() attempt to read 'req->q->mq_ops->timeout', but 'q
== 0' currently,
which triggers BUG_ON.
>> For normal request, we need to get a tag and then allocate corresponding request.
>> Thus, the request cannot be reallocated before the tag have been free.
>> Commit 1d9bd5161ba ("blk-mq: replace timeout synchronization with a RCU and
>> generation based scheme") and commit 12f5b93145 ("blk-mq: Remove generation
>> seqeunce") can guarantee the consistency of timeout handle and completion.
>>
>> However, 'flush_rq' have been forgotten. 'flush_rq' allocation management
>> dependents on flush implemention mechanism. Each hctx has only one 'flush_rq'.
>> When a flush request have completed, the next flush request will hold the 'flush_rq'.
>> In the end, timeout handle may access the cleared 'flush_rq'.
>>
>> We fix this problem by checking request refcount 'rq->ref', as normal request.
>> If the refcount is not zero, flush_end_io() return and wait the last holder
>> recall it. To record the request status, we add a new entry 'rq_status',
>> which will be used in flush_end_io().
>>
>> Signed-off-by: Yufen Yu <yuyufen@huawei.com>
>> ---
>> block/blk-flush.c | 8 ++++++++
>> block/blk-mq.c | 7 +++++--
>> block/blk.h | 5 +++++
>> include/linux/blkdev.h | 2 ++
>> 4 files changed, 20 insertions(+), 2 deletions(-)
>>
>> diff --git a/block/blk-flush.c b/block/blk-flush.c
>> index aedd9320e605..359a7e1a0925 100644
>> --- a/block/blk-flush.c
>> +++ b/block/blk-flush.c
>> @@ -212,6 +212,14 @@ static void flush_end_io(struct request *flush_rq, blk_status_t error)
>> struct blk_flush_queue *fq = blk_get_flush_queue(q, flush_rq->mq_ctx);
>> struct blk_mq_hw_ctx *hctx;
>>
>> + if (!refcount_dec_and_test(&flush_rq->ref)) {
>> + flush_rq->rq_status = error;
>> + return;
>> + }
>> +
>> + if (flush_rq->rq_status != BLK_STS_OK)
>> + error = flush_rq->rq_status;
>> +
>> /* release the tag's ownership to the req cloned from */
>> spin_lock_irqsave(&fq->mq_flush_lock, flags);
>> hctx = flush_rq->mq_hctx;
>> diff --git a/block/blk-mq.c b/block/blk-mq.c
>> index 0835f4d8d42e..3d2b2c2e9cdf 100644
>> --- a/block/blk-mq.c
>> +++ b/block/blk-mq.c
>> @@ -905,9 +905,12 @@ static bool blk_mq_check_expired(struct blk_mq_hw_ctx *hctx,
>> */
>> if (blk_mq_req_expired(rq, next))
>> blk_mq_rq_timed_out(rq, reserved);
>> - if (refcount_dec_and_test(&rq->ref))
>> - __blk_mq_free_request(rq);
>>
>> + if (is_flush_rq(rq, hctx)) {
>> + rq->end_io(rq, 0);
>> + } else if (refcount_dec_and_test(&rq->ref)) {
>> + __blk_mq_free_request(rq);
>> + }
>> return true;
>> }
>>
>> diff --git a/block/blk.h b/block/blk.h
>> index de6b2e146d6e..f503ef9ad3e6 100644
>> --- a/block/blk.h
>> +++ b/block/blk.h
>> @@ -47,6 +47,11 @@ static inline void __blk_get_queue(struct request_queue *q)
>> kobject_get(&q->kobj);
>> }
>>
>> +static inline bool
>> +is_flush_rq(struct request *req, struct blk_mq_hw_ctx *hctx) {
>> + return hctx->fq->flush_rq == req;
>> +}
>> +
>> struct blk_flush_queue *blk_alloc_flush_queue(struct request_queue *q,
>> int node, int cmd_size, gfp_t flags);
>> void blk_free_flush_queue(struct blk_flush_queue *q);
>> diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
>> index 1ef375dafb1c..b1d05077e03f 100644
>> --- a/include/linux/blkdev.h
>> +++ b/include/linux/blkdev.h
>> @@ -237,6 +237,8 @@ struct request {
>> */
>> rq_end_io_fn *end_io;
>> void *end_io_data;
>> +
>> + blk_status_t rq_status;
>> };
> 'rq_status' is only for flush request, so it may be added to 'struct
> blk_flush_queue' instead of 'struct request'.
That's a good idea.
Thanks,
Yufen
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [PATCH] block: fix null pointer dereference in blk_mq_rq_timed_out()
2019-09-12 3:29 ` Yufen Yu
@ 2019-09-12 4:16 ` Ming Lei
2019-09-12 8:49 ` Yufen Yu
2019-09-12 8:59 ` Guoqing Jiang
1 sibling, 1 reply; 10+ messages in thread
From: Ming Lei @ 2019-09-12 4:16 UTC (permalink / raw)
To: Yufen Yu; +Cc: axboe, linux-block, hch, keith.busch, tj, zhangxiaoxu5
On Thu, Sep 12, 2019 at 11:29:18AM +0800, Yufen Yu wrote:
>
>
> On 2019/9/12 10:46, Ming Lei wrote:
> > On Sat, Sep 07, 2019 at 06:24:50PM +0800, Yufen Yu wrote:
> > > There is a race condition between timeout check and completion for
> > > flush request as follow:
> > >
> > > timeout_work issue flush issue flush
> > > blk_insert_flush
> > > blk_insert_flush
> > > blk_mq_timeout_work
> > > blk_kick_flush
> > >
> > > blk_mq_queue_tag_busy_iter
> > > blk_mq_check_expired(flush_rq)
> > >
> > > __blk_mq_end_request
> > > flush_end_io
> > > blk_kick_flush
> > > blk_rq_init(flush_rq)
> > > memset(flush_rq, 0)
> > Not see there is memset(flush_rq, 0) in block/blk-flush.c
>
> Call path as follow:
>
> blk_kick_flush
> blk_rq_init
> memset(rq, 0, sizeof(*rq));
Looks I miss this one in blk_rq_init(), sorry for that.
Given there are only two users of blk_rq_init(), one simple fix could be
not clearing queue in blk_rq_init(), something like below?
diff --git a/block/blk-core.c b/block/blk-core.c
index 77807a5d7f9e..25e6a045c821 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -107,7 +107,9 @@ EXPORT_SYMBOL_GPL(blk_queue_flag_test_and_set);
void blk_rq_init(struct request_queue *q, struct request *rq)
{
- memset(rq, 0, sizeof(*rq));
+ const int offset = offsetof(struct request, q);
+
+ memset((void *)rq + offset, 0, sizeof(*rq) - offset);
INIT_LIST_HEAD(&rq->queuelist);
rq->q = q;
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index 1ac790178787..382e71b8787d 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -130,7 +130,7 @@ enum mq_rq_state {
* especially blk_mq_rq_ctx_init() to take care of the added fields.
*/
struct request {
- struct request_queue *q;
+ struct request_queue *q; /* Must be the 1st field */
struct blk_mq_ctx *mq_ctx;
struct blk_mq_hw_ctx *mq_hctx;
Thanks,
Ming
^ permalink raw reply related [flat|nested] 10+ messages in thread
* Re: [PATCH] block: fix null pointer dereference in blk_mq_rq_timed_out()
2019-09-12 4:16 ` Ming Lei
@ 2019-09-12 8:49 ` Yufen Yu
2019-09-12 10:07 ` Ming Lei
0 siblings, 1 reply; 10+ messages in thread
From: Yufen Yu @ 2019-09-12 8:49 UTC (permalink / raw)
To: Ming Lei; +Cc: axboe, linux-block, hch, keith.busch, tj, zhangxiaoxu5
On 2019/9/12 12:16, Ming Lei wrote:
> On Thu, Sep 12, 2019 at 11:29:18AM +0800, Yufen Yu wrote:
>>
>> On 2019/9/12 10:46, Ming Lei wrote:
>>> On Sat, Sep 07, 2019 at 06:24:50PM +0800, Yufen Yu wrote:
>>>> There is a race condition between timeout check and completion for
>>>> flush request as follow:
>>>>
>>>> timeout_work issue flush issue flush
>>>> blk_insert_flush
>>>> blk_insert_flush
>>>> blk_mq_timeout_work
>>>> blk_kick_flush
>>>>
>>>> blk_mq_queue_tag_busy_iter
>>>> blk_mq_check_expired(flush_rq)
>>>>
>>>> __blk_mq_end_request
>>>> flush_end_io
>>>> blk_kick_flush
>>>> blk_rq_init(flush_rq)
>>>> memset(flush_rq, 0)
>>> Not see there is memset(flush_rq, 0) in block/blk-flush.c
>> Call path as follow:
>>
>> blk_kick_flush
>> blk_rq_init
>> memset(rq, 0, sizeof(*rq));
> Looks I miss this one in blk_rq_init(), sorry for that.
>
> Given there are only two users of blk_rq_init(), one simple fix could be
> not clearing queue in blk_rq_init(), something like below?
>
> diff --git a/block/blk-core.c b/block/blk-core.c
> index 77807a5d7f9e..25e6a045c821 100644
> --- a/block/blk-core.c
> +++ b/block/blk-core.c
> @@ -107,7 +107,9 @@ EXPORT_SYMBOL_GPL(blk_queue_flag_test_and_set);
>
> void blk_rq_init(struct request_queue *q, struct request *rq)
> {
> - memset(rq, 0, sizeof(*rq));
> + const int offset = offsetof(struct request, q);
> +
> + memset((void *)rq + offset, 0, sizeof(*rq) - offset);
>
> INIT_LIST_HEAD(&rq->queuelist);
> rq->q = q;
> diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
> index 1ac790178787..382e71b8787d 100644
> --- a/include/linux/blkdev.h
> +++ b/include/linux/blkdev.h
> @@ -130,7 +130,7 @@ enum mq_rq_state {
> * especially blk_mq_rq_ctx_init() to take care of the added fields.
> */
> struct request {
> - struct request_queue *q;
> + struct request_queue *q; /* Must be the 1st field */
> struct blk_mq_ctx *mq_ctx;
> struct blk_mq_hw_ctx *mq_hctx;
Not set req->q as '0' can just avoid BUG_ON for NULL pointer deference.
However, the root problem is that 'flush_rq' have been reused while
timeout function handle it currently. That means mq_ops->timeout() may
access old values remained by the last flush request and make the wrong
decision.
Take the race condition in the patch as an example.
blk_mq_check_expired
blk_mq_rq_timed_out
req->q->mq_ops->timeout // Driver timeout handle may read old data
refcount_dec_and_test(&rq)
__blk_mq_free_request // If rq have been reset has '1' in
blk_rq_init(), it will be free here.
So, I think we should solve this problem completely. Just like normal
request,
we can prevent flush request to call end_io when timeout handle the request.
Thanks,
Yufen
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [PATCH] block: fix null pointer dereference in blk_mq_rq_timed_out()
2019-09-12 3:29 ` Yufen Yu
2019-09-12 4:16 ` Ming Lei
@ 2019-09-12 8:59 ` Guoqing Jiang
1 sibling, 0 replies; 10+ messages in thread
From: Guoqing Jiang @ 2019-09-12 8:59 UTC (permalink / raw)
To: Yufen Yu, Ming Lei; +Cc: axboe, linux-block, hch, keith.busch, tj, zhangxiaoxu5
On 9/12/19 5:29 AM, Yufen Yu wrote:
>
>
> On 2019/9/12 10:46, Ming Lei wrote:
>> On Sat, Sep 07, 2019 at 06:24:50PM +0800, Yufen Yu wrote:
>>> There is a race condition between timeout check and completion for
>>> flush request as follow:
>>>
>>> timeout_work issue flush issue flush
>>> blk_insert_flush
>>> blk_insert_flush
>>> blk_mq_timeout_work
>>> blk_kick_flush
>>>
>>> blk_mq_queue_tag_busy_iter
>>> blk_mq_check_expired(flush_rq)
>>>
>>> __blk_mq_end_request
>>> flush_end_io
>>> blk_kick_flush
>>> blk_rq_init(flush_rq)
>>> memset(flush_rq, 0)
>> Not see there is memset(flush_rq, 0) in block/blk-flush.c
>
> Call path as follow:
>
> blk_kick_flush
> blk_rq_init
> memset(rq, 0, sizeof(*rq));
>
>>> blk_mq_timed_out
>>> BUG_ON flush_rq->q->mq_ops
>> flush_rq->q won't be changed by blk_rq_init(), and either READ or WRITE
>> on variable with machine WORD length are atomic, so how can the BUG_ON()
>> be triggered? Do you have the actual BUG log?
>>
>> Also now it is driver's responsibility for avoiding race between normal
>> completion and timeout.
>
> I have reproduced the bug by adding time delay in timeout handle and
> memset.
> BUG_ON log as follow:
>
> [ 108.825472] BUG: kernel NULL pointer dereference, address:
> 0000000000000040
> [ 108.826091] #PF: supervisor read access in kernel mode
> [ 108.826543] #PF: error_code(0x0000) - not-present page
> [ 108.827059] PGD 0 P4D 0
> [ 108.827313] Oops: 0000 [#1] SMP PTI
> [ 108.827657] CPU: 6 PID: 198 Comm: kworker/6:1H Not tainted
> 5.3.0-rc8+ #431
> [ 108.828326] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996),
> BIOS ?-20180531_142017-buildhw-08.phx2.fedoraproject.org-1.fc28
> 04/01/2014
> [ 108.829503] Workqueue: kblockd blk_mq_timeout_work
> [ 108.829913] RIP: 0010:blk_mq_check_expired+0x258/0x330
> [ 108.830439] Code: 01 e9 0a ff ff ff 48 83 05 34 45 dd 02 01 4c 39
> 63 40 0f 84 8a 00 00 00 0d 00 00 20 00 40 0f b6 f5 41 89 44 24 1c 49
> 8b 04 24 <48> 8b 40 40 48 8b 40 20 48 85 c0 0f 84 90 00 00 00 48 83 05
> 2f 44
> [ 108.832246] RSP: 0018:ffffbf7ac18b7db0 EFLAGS: 00010206
> [ 108.832756] RAX: 0000000000000000 RBX: ffffffffb56e0250 RCX:
> 0000000000000000
> [ 108.833444] RDX: 0000000000000000 RSI: 0000000000000000 RDI:
> ffff9ab7fbb96538
> [ 108.834149] RBP: 0000000000000000 R08: 000000000000024b R09:
> 0000000000000030
> [ 108.834840] R10: 000000000000004e R11: ffffbf7ac18b7c40 R12:
> ffff9ab7f756e000
> [ 108.835531] R13: ffffbf7ac18b7e70 R14: 0000000000000017 R15:
> ffff9ab7f6ead0a0
> [ 108.836228] FS: 0000000000000000(0000) GS:ffff9ab7fbb80000(0000)
> knlGS:0000000000000000
> [ 108.837026] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [ 108.837588] CR2: 0000000000000040 CR3: 000000013544c000 CR4:
> 00000000000006e0
> [ 108.838191] Call Trace:
> [ 108.838406] bt_iter+0x74/0x80
> [ 108.838665] blk_mq_queue_tag_busy_iter+0x204/0x450
> [ 108.839074] ? __switch_to_asm+0x34/0x70
> [ 108.839405] ? blk_mq_stop_hw_queue+0x40/0x40
> [ 108.839823] ? blk_mq_stop_hw_queue+0x40/0x40
> [ 108.840273] ? syscall_return_via_sysret+0xf/0x7f
> [ 108.840732] blk_mq_timeout_work+0x74/0x200
> [ 108.841151] process_one_work+0x297/0x680
> [ 108.841550] worker_thread+0x29c/0x6f0
> [ 108.841926] ? rescuer_thread+0x580/0x580
> [ 108.842344] kthread+0x16a/0x1a0
> [ 108.842666] ? kthread_flush_work+0x170/0x170
> [ 108.843100] ret_from_fork+0x35/0x40
> [ 108.843455] Modules linked in:
> [ 108.843758] CR2: 0000000000000040
> [ 108.844090] ---[ end trace e0ac552505fa1b95 ]---
>
> blk_mq_rq_timed_out() attempt to read 'req->q->mq_ops->timeout', but
> 'q == 0' currently,
> which triggers BUG_ON.
We have a similar calltrace which happened in older kernel (4.4.62), not
sure if it is the same one.
[32353526.224059] CPU: 16 PID: 0 Comm: swapper/16 Tainted: G O 4.4.62-1-storage #4.4.62-1.3
[32353526.224343] Hardware name: Supermicro SSG-2028R-ACR24L/X10DRH-iT, BIOS 3.1 06/18/2018
[...]
[32353526.224840] RIP: 0010:[<ffffffff812df5a1>] [<ffffffff812df5a1>] blk_mq_rq_timed_out+0x11/0x70
[32353526.285015] [<ffffffff812df63d>] blk_mq_check_expired+0x3d/0x60
[32353526.301997] [<ffffffff812e1f74>] bt_for_each+0xd4/0xe0
[32353526.310730] [<ffffffff812df600>] ? blk_mq_rq_timed_out+0x70/0x70
[32353526.319579] [<ffffffff812df600>] ? blk_mq_rq_timed_out+0x70/0x70
[32353526.328329] [<ffffffff812e2753>] blk_mq_queue_tag_busy_iter+0x43/0xc0
[32353526.336995] [<ffffffff812de7c0>] ? blk_mq_bio_to_request+0x40/0x40
[32353526.345566] [<ffffffff812de7f2>] blk_mq_rq_timer+0x32/0xd0
[32353526.354094] [<ffffffff810b2e45>] call_timer_fn+0x35/0x130
[32353526.362542] [<ffffffff812de7c0>] ? blk_mq_bio_to_request+0x40/0x40
[32353526.370894] [<ffffffff810b31b7>] run_timer_softirq+0x157/0x280
Thanks,
Guoqing
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [PATCH] block: fix null pointer dereference in blk_mq_rq_timed_out()
2019-09-12 8:49 ` Yufen Yu
@ 2019-09-12 10:07 ` Ming Lei
2019-09-16 2:40 ` Yufen Yu
2019-09-16 9:27 ` Yufen Yu
0 siblings, 2 replies; 10+ messages in thread
From: Ming Lei @ 2019-09-12 10:07 UTC (permalink / raw)
To: Yufen Yu; +Cc: axboe, linux-block, hch, keith.busch, tj, zhangxiaoxu5
On Thu, Sep 12, 2019 at 04:49:15PM +0800, Yufen Yu wrote:
>
>
> On 2019/9/12 12:16, Ming Lei wrote:
> > On Thu, Sep 12, 2019 at 11:29:18AM +0800, Yufen Yu wrote:
> > >
> > > On 2019/9/12 10:46, Ming Lei wrote:
> > > > On Sat, Sep 07, 2019 at 06:24:50PM +0800, Yufen Yu wrote:
> > > > > There is a race condition between timeout check and completion for
> > > > > flush request as follow:
> > > > >
> > > > > timeout_work issue flush issue flush
> > > > > blk_insert_flush
> > > > > blk_insert_flush
> > > > > blk_mq_timeout_work
> > > > > blk_kick_flush
> > > > >
> > > > > blk_mq_queue_tag_busy_iter
> > > > > blk_mq_check_expired(flush_rq)
> > > > >
> > > > > __blk_mq_end_request
> > > > > flush_end_io
> > > > > blk_kick_flush
> > > > > blk_rq_init(flush_rq)
> > > > > memset(flush_rq, 0)
> > > > Not see there is memset(flush_rq, 0) in block/blk-flush.c
> > > Call path as follow:
> > >
> > > blk_kick_flush
> > > blk_rq_init
> > > memset(rq, 0, sizeof(*rq));
> > Looks I miss this one in blk_rq_init(), sorry for that.
> >
> > Given there are only two users of blk_rq_init(), one simple fix could be
> > not clearing queue in blk_rq_init(), something like below?
> >
> > diff --git a/block/blk-core.c b/block/blk-core.c
> > index 77807a5d7f9e..25e6a045c821 100644
> > --- a/block/blk-core.c
> > +++ b/block/blk-core.c
> > @@ -107,7 +107,9 @@ EXPORT_SYMBOL_GPL(blk_queue_flag_test_and_set);
> > void blk_rq_init(struct request_queue *q, struct request *rq)
> > {
> > - memset(rq, 0, sizeof(*rq));
> > + const int offset = offsetof(struct request, q);
> > +
> > + memset((void *)rq + offset, 0, sizeof(*rq) - offset);
> > INIT_LIST_HEAD(&rq->queuelist);
> > rq->q = q;
> > diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
> > index 1ac790178787..382e71b8787d 100644
> > --- a/include/linux/blkdev.h
> > +++ b/include/linux/blkdev.h
> > @@ -130,7 +130,7 @@ enum mq_rq_state {
> > * especially blk_mq_rq_ctx_init() to take care of the added fields.
> > */
> > struct request {
> > - struct request_queue *q;
> > + struct request_queue *q; /* Must be the 1st field */
> > struct blk_mq_ctx *mq_ctx;
> > struct blk_mq_hw_ctx *mq_hctx;
>
> Not set req->q as '0' can just avoid BUG_ON for NULL pointer deference.
>
> However, the root problem is that 'flush_rq' have been reused while
> timeout function handle it currently. That means mq_ops->timeout() may
> access old values remained by the last flush request and make the wrong
> decision.
>
> Take the race condition in the patch as an example.
>
> blk_mq_check_expired
> blk_mq_rq_timed_out
> req->q->mq_ops->timeout // Driver timeout handle may read old data
> refcount_dec_and_test(&rq)
> __blk_mq_free_request // If rq have been reset has '1' in
> blk_rq_init(), it will be free here.
>
> So, I think we should solve this problem completely. Just like normal
> request,
> we can prevent flush request to call end_io when timeout handle the request.
Seems it isn't specific for 'flush_rq', and it should be one generic issue
for any request which implements .end_io.
For requests without defining .end_io, rq->ref is applied for protecting
its lifetime. However, rq->end_io() is still called even if rq->ref doesn't
drop to zero.
If the above is correct, we need to let rq->ref to cover rq->end_io().
Thanks,
Ming
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [PATCH] block: fix null pointer dereference in blk_mq_rq_timed_out()
2019-09-12 10:07 ` Ming Lei
@ 2019-09-16 2:40 ` Yufen Yu
2019-09-16 9:27 ` Yufen Yu
1 sibling, 0 replies; 10+ messages in thread
From: Yufen Yu @ 2019-09-16 2:40 UTC (permalink / raw)
To: Ming Lei; +Cc: axboe, linux-block, hch, keith.busch, tj, zhangxiaoxu5
On 2019/9/12 18:07, Ming Lei wrote:
> On Thu, Sep 12, 2019 at 04:49:15PM +0800, Yufen Yu wrote:
>>
>> On 2019/9/12 12:16, Ming Lei wrote:
>>> On Thu, Sep 12, 2019 at 11:29:18AM +0800, Yufen Yu wrote:
>>>> On 2019/9/12 10:46, Ming Lei wrote:
>>>>> On Sat, Sep 07, 2019 at 06:24:50PM +0800, Yufen Yu wrote:
>>>>>> There is a race condition between timeout check and completion for
>>>>>> flush request as follow:
>>>>>>
>>>>>> timeout_work issue flush issue flush
>>>>>> blk_insert_flush
>>>>>> blk_insert_flush
>>>>>> blk_mq_timeout_work
>>>>>> blk_kick_flush
>>>>>>
>>>>>> blk_mq_queue_tag_busy_iter
>>>>>> blk_mq_check_expired(flush_rq)
>>>>>>
>>>>>> __blk_mq_end_request
>>>>>> flush_end_io
>>>>>> blk_kick_flush
>>>>>> blk_rq_init(flush_rq)
>>>>>> memset(flush_rq, 0)
>>>>> Not see there is memset(flush_rq, 0) in block/blk-flush.c
>>>> Call path as follow:
>>>>
>>>> blk_kick_flush
>>>> blk_rq_init
>>>> memset(rq, 0, sizeof(*rq));
>>> Looks I miss this one in blk_rq_init(), sorry for that.
>>>
>>> Given there are only two users of blk_rq_init(), one simple fix could be
>>> not clearing queue in blk_rq_init(), something like below?
>>>
>>> diff --git a/block/blk-core.c b/block/blk-core.c
>>> index 77807a5d7f9e..25e6a045c821 100644
>>> --- a/block/blk-core.c
>>> +++ b/block/blk-core.c
>>> @@ -107,7 +107,9 @@ EXPORT_SYMBOL_GPL(blk_queue_flag_test_and_set);
>>> void blk_rq_init(struct request_queue *q, struct request *rq)
>>> {
>>> - memset(rq, 0, sizeof(*rq));
>>> + const int offset = offsetof(struct request, q);
>>> +
>>> + memset((void *)rq + offset, 0, sizeof(*rq) - offset);
>>> INIT_LIST_HEAD(&rq->queuelist);
>>> rq->q = q;
>>> diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
>>> index 1ac790178787..382e71b8787d 100644
>>> --- a/include/linux/blkdev.h
>>> +++ b/include/linux/blkdev.h
>>> @@ -130,7 +130,7 @@ enum mq_rq_state {
>>> * especially blk_mq_rq_ctx_init() to take care of the added fields.
>>> */
>>> struct request {
>>> - struct request_queue *q;
>>> + struct request_queue *q; /* Must be the 1st field */
>>> struct blk_mq_ctx *mq_ctx;
>>> struct blk_mq_hw_ctx *mq_hctx;
>> Not set req->q as '0' can just avoid BUG_ON for NULL pointer deference.
>>
>> However, the root problem is that 'flush_rq' have been reused while
>> timeout function handle it currently. That means mq_ops->timeout() may
>> access old values remained by the last flush request and make the wrong
>> decision.
>>
>> Take the race condition in the patch as an example.
>>
>> blk_mq_check_expired
>> blk_mq_rq_timed_out
>> req->q->mq_ops->timeout // Driver timeout handle may read old data
>> refcount_dec_and_test(&rq)
>> __blk_mq_free_request // If rq have been reset has '1' in
>> blk_rq_init(), it will be free here.
>>
>> So, I think we should solve this problem completely. Just like normal
>> request,
>> we can prevent flush request to call end_io when timeout handle the request.
> Seems it isn't specific for 'flush_rq', and it should be one generic issue
> for any request which implements .end_io.
>
> For requests without defining .end_io, rq->ref is applied for protecting
> its lifetime. However, rq->end_io() is still called even if rq->ref doesn't
> drop to zero.
>
> If the above is correct, we need to let rq->ref to cover rq->end_io().
Thanks for catching what I have ignored. I am trying to fix the problem.
Thanks,
Yufen
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [PATCH] block: fix null pointer dereference in blk_mq_rq_timed_out()
2019-09-12 10:07 ` Ming Lei
2019-09-16 2:40 ` Yufen Yu
@ 2019-09-16 9:27 ` Yufen Yu
2019-09-17 0:50 ` Ming Lei
1 sibling, 1 reply; 10+ messages in thread
From: Yufen Yu @ 2019-09-16 9:27 UTC (permalink / raw)
To: Ming Lei; +Cc: axboe, linux-block, hch, keith.busch, tj, zhangxiaoxu5
On 2019/9/12 18:07, Ming Lei wrote:
> On Thu, Sep 12, 2019 at 04:49:15PM +0800, Yufen Yu wrote:
>>
>> On 2019/9/12 12:16, Ming Lei wrote:
>>> On Thu, Sep 12, 2019 at 11:29:18AM +0800, Yufen Yu wrote:
>>>> On 2019/9/12 10:46, Ming Lei wrote:
>>>>> On Sat, Sep 07, 2019 at 06:24:50PM +0800, Yufen Yu wrote:
>>>>>> There is a race condition between timeout check and completion for
>>>>>> flush request as follow:
>>>>>>
>>>>>> timeout_work issue flush issue flush
>>>>>> blk_insert_flush
>>>>>> blk_insert_flush
>>>>>> blk_mq_timeout_work
>>>>>> blk_kick_flush
>>>>>>
>>>>>> blk_mq_queue_tag_busy_iter
>>>>>> blk_mq_check_expired(flush_rq)
>>>>>>
>>>>>> __blk_mq_end_request
>>>>>> flush_end_io
>>>>>> blk_kick_flush
>>>>>> blk_rq_init(flush_rq)
>>>>>> memset(flush_rq, 0)
>>>>> Not see there is memset(flush_rq, 0) in block/blk-flush.c
>>>> Call path as follow:
>>>>
>>>> blk_kick_flush
>>>> blk_rq_init
>>>> memset(rq, 0, sizeof(*rq));
>>> Looks I miss this one in blk_rq_init(), sorry for that.
>>>
>>> Given there are only two users of blk_rq_init(), one simple fix could be
>>> not clearing queue in blk_rq_init(), something like below?
>>>
>>> diff --git a/block/blk-core.c b/block/blk-core.c
>>> index 77807a5d7f9e..25e6a045c821 100644
>>> --- a/block/blk-core.c
>>> +++ b/block/blk-core.c
>>> @@ -107,7 +107,9 @@ EXPORT_SYMBOL_GPL(blk_queue_flag_test_and_set);
>>> void blk_rq_init(struct request_queue *q, struct request *rq)
>>> {
>>> - memset(rq, 0, sizeof(*rq));
>>> + const int offset = offsetof(struct request, q);
>>> +
>>> + memset((void *)rq + offset, 0, sizeof(*rq) - offset);
>>> INIT_LIST_HEAD(&rq->queuelist);
>>> rq->q = q;
>>> diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
>>> index 1ac790178787..382e71b8787d 100644
>>> --- a/include/linux/blkdev.h
>>> +++ b/include/linux/blkdev.h
>>> @@ -130,7 +130,7 @@ enum mq_rq_state {
>>> * especially blk_mq_rq_ctx_init() to take care of the added fields.
>>> */
>>> struct request {
>>> - struct request_queue *q;
>>> + struct request_queue *q; /* Must be the 1st field */
>>> struct blk_mq_ctx *mq_ctx;
>>> struct blk_mq_hw_ctx *mq_hctx;
>> Not set req->q as '0' can just avoid BUG_ON for NULL pointer deference.
>>
>> However, the root problem is that 'flush_rq' have been reused while
>> timeout function handle it currently. That means mq_ops->timeout() may
>> access old values remained by the last flush request and make the wrong
>> decision.
>>
>> Take the race condition in the patch as an example.
>>
>> blk_mq_check_expired
>> blk_mq_rq_timed_out
>> req->q->mq_ops->timeout // Driver timeout handle may read old data
>> refcount_dec_and_test(&rq)
>> __blk_mq_free_request // If rq have been reset has '1' in
>> blk_rq_init(), it will be free here.
>>
>> So, I think we should solve this problem completely. Just like normal
>> request,
>> we can prevent flush request to call end_io when timeout handle the request.
> Seems it isn't specific for 'flush_rq', and it should be one generic issue
> for any request which implements .end_io.
>
> For requests without defining .end_io, rq->ref is applied for protecting
> its lifetime. However, rq->end_io() is still called even if rq->ref doesn't
> drop to zero.
>
> If the above is correct, we need to let rq->ref to cover rq->end_io().
We ignore the fact that we may also need to free 'rq' after calling
rq->end_io(),
such as end_clone_request(), mq_flush_data_end_io().
If we let 'rq->ref' to cover rq->end_io(), 'rq->ref' have been
decreased to '0'
before calling __blk_mq_free_request(). Then, the function will never be
called.
So, I think flush request may need to be fixed individually.
Thanks,
Yufen
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [PATCH] block: fix null pointer dereference in blk_mq_rq_timed_out()
2019-09-16 9:27 ` Yufen Yu
@ 2019-09-17 0:50 ` Ming Lei
0 siblings, 0 replies; 10+ messages in thread
From: Ming Lei @ 2019-09-17 0:50 UTC (permalink / raw)
To: Yufen Yu; +Cc: axboe, linux-block, hch, keith.busch, tj, zhangxiaoxu5
On Mon, Sep 16, 2019 at 05:27:39PM +0800, Yufen Yu wrote:
>
>
> On 2019/9/12 18:07, Ming Lei wrote:
> > On Thu, Sep 12, 2019 at 04:49:15PM +0800, Yufen Yu wrote:
> > >
> > > On 2019/9/12 12:16, Ming Lei wrote:
> > > > On Thu, Sep 12, 2019 at 11:29:18AM +0800, Yufen Yu wrote:
> > > > > On 2019/9/12 10:46, Ming Lei wrote:
> > > > > > On Sat, Sep 07, 2019 at 06:24:50PM +0800, Yufen Yu wrote:
> > > > > > > There is a race condition between timeout check and completion for
> > > > > > > flush request as follow:
> > > > > > >
> > > > > > > timeout_work issue flush issue flush
> > > > > > > blk_insert_flush
> > > > > > > blk_insert_flush
> > > > > > > blk_mq_timeout_work
> > > > > > > blk_kick_flush
> > > > > > >
> > > > > > > blk_mq_queue_tag_busy_iter
> > > > > > > blk_mq_check_expired(flush_rq)
> > > > > > >
> > > > > > > __blk_mq_end_request
> > > > > > > flush_end_io
> > > > > > > blk_kick_flush
> > > > > > > blk_rq_init(flush_rq)
> > > > > > > memset(flush_rq, 0)
> > > > > > Not see there is memset(flush_rq, 0) in block/blk-flush.c
> > > > > Call path as follow:
> > > > >
> > > > > blk_kick_flush
> > > > > blk_rq_init
> > > > > memset(rq, 0, sizeof(*rq));
> > > > Looks I miss this one in blk_rq_init(), sorry for that.
> > > >
> > > > Given there are only two users of blk_rq_init(), one simple fix could be
> > > > not clearing queue in blk_rq_init(), something like below?
> > > >
> > > > diff --git a/block/blk-core.c b/block/blk-core.c
> > > > index 77807a5d7f9e..25e6a045c821 100644
> > > > --- a/block/blk-core.c
> > > > +++ b/block/blk-core.c
> > > > @@ -107,7 +107,9 @@ EXPORT_SYMBOL_GPL(blk_queue_flag_test_and_set);
> > > > void blk_rq_init(struct request_queue *q, struct request *rq)
> > > > {
> > > > - memset(rq, 0, sizeof(*rq));
> > > > + const int offset = offsetof(struct request, q);
> > > > +
> > > > + memset((void *)rq + offset, 0, sizeof(*rq) - offset);
> > > > INIT_LIST_HEAD(&rq->queuelist);
> > > > rq->q = q;
> > > > diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
> > > > index 1ac790178787..382e71b8787d 100644
> > > > --- a/include/linux/blkdev.h
> > > > +++ b/include/linux/blkdev.h
> > > > @@ -130,7 +130,7 @@ enum mq_rq_state {
> > > > * especially blk_mq_rq_ctx_init() to take care of the added fields.
> > > > */
> > > > struct request {
> > > > - struct request_queue *q;
> > > > + struct request_queue *q; /* Must be the 1st field */
> > > > struct blk_mq_ctx *mq_ctx;
> > > > struct blk_mq_hw_ctx *mq_hctx;
> > > Not set req->q as '0' can just avoid BUG_ON for NULL pointer deference.
> > >
> > > However, the root problem is that 'flush_rq' have been reused while
> > > timeout function handle it currently. That means mq_ops->timeout() may
> > > access old values remained by the last flush request and make the wrong
> > > decision.
> > >
> > > Take the race condition in the patch as an example.
> > >
> > > blk_mq_check_expired
> > > blk_mq_rq_timed_out
> > > req->q->mq_ops->timeout // Driver timeout handle may read old data
> > > refcount_dec_and_test(&rq)
> > > __blk_mq_free_request // If rq have been reset has '1' in
> > > blk_rq_init(), it will be free here.
> > >
> > > So, I think we should solve this problem completely. Just like normal
> > > request,
> > > we can prevent flush request to call end_io when timeout handle the request.
> > Seems it isn't specific for 'flush_rq', and it should be one generic issue
> > for any request which implements .end_io.
> >
> > For requests without defining .end_io, rq->ref is applied for protecting
> > its lifetime. However, rq->end_io() is still called even if rq->ref doesn't
> > drop to zero.
> >
> > If the above is correct, we need to let rq->ref to cover rq->end_io().
>
> We ignore the fact that we may also need to free 'rq' after calling
> rq->end_io(),
> such as end_clone_request(), mq_flush_data_end_io().
>
> If we let 'rq->ref' to cover rq->end_io(), 'rq->ref' have been decreased to
> '0'
> before calling __blk_mq_free_request(). Then, the function will never be
> called.
>
> So, I think flush request may need to be fixed individually.
Thinking of this issue further, given other cases of .end_io() still
depends on blk_mq_free_request() for freeing request, it is fine
to just fix flush request.
Thanks,
Ming
^ permalink raw reply [flat|nested] 10+ messages in thread
end of thread, other threads:[~2019-09-17 0:50 UTC | newest]
Thread overview: 10+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-09-07 10:24 [PATCH] block: fix null pointer dereference in blk_mq_rq_timed_out() Yufen Yu
2019-09-12 2:46 ` Ming Lei
2019-09-12 3:29 ` Yufen Yu
2019-09-12 4:16 ` Ming Lei
2019-09-12 8:49 ` Yufen Yu
2019-09-12 10:07 ` Ming Lei
2019-09-16 2:40 ` Yufen Yu
2019-09-16 9:27 ` Yufen Yu
2019-09-17 0:50 ` Ming Lei
2019-09-12 8:59 ` Guoqing Jiang
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).