* Re: [PATCH v3] blk-mq: punt failed direct issue to dispatch list
2018-12-07 5:17 [PATCH v3] blk-mq: punt failed direct issue to dispatch list Jens Axboe
@ 2018-12-07 8:24 ` Ming Lei
2018-12-07 10:50 ` Ming Lei
` (2 subsequent siblings)
3 siblings, 0 replies; 9+ messages in thread
From: Ming Lei @ 2018-12-07 8:24 UTC (permalink / raw)
To: Jens Axboe; +Cc: linux-block, Mike Snitzer, Bart Van Assche
On Thu, Dec 06, 2018 at 10:17:44PM -0700, Jens Axboe wrote:
> After the direct dispatch corruption fix, we permanently disallow direct
> dispatch of non read/write requests. This works fine off the normal IO
> path, as they will be retried like any other failed direct dispatch
> request. But for the blk_insert_cloned_request() that only DM uses to
> bypass the bottom level scheduler, we always first attempt direct
> dispatch. For some types of requests, that's now a permanent failure,
> and no amount of retrying will make that succeed. This results in a
> livelock.
>
> Instead of making special cases for what we can direct issue, and now
> having to deal with DM solving the livelock while still retaining a BUSY
> condition feedback loop, always just add a request that has been through
> ->queue_rq() to the hardware queue dispatch list. These are safe to use
> as no merging can take place there. Additionally, if requests do have
> prepped data from drivers, we aren't dependent on them not sharing space
> in the request structure to safely add them to the IO scheduler lists.
>
> This basically reverts ffe81d45322c and is based on a patch from Ming,
> but with the list insert case covered as well.
>
> Fixes: ffe81d45322c ("blk-mq: fix corruption with direct issue")
> Cc: stable@vger.kernel.org
> Suggested-by: Ming Lei <ming.lei@redhat.com>
> Reported-by: Bart Van Assche <bvanassche@acm.org>
> Signed-off-by: Jens Axboe <axboe@kernel.dk>
>
> ---
>
> I've thrown the initial hang test reported by Bart at it, works fine.
> My reproducer for the corruption case is also happy, as expected.
>
> I'm running blktests and xfstests on it overnight. If that passes as
> expected, this qualms my initial worries on using ->dispatch as a
> holding place for these types of requests.
>
>
> diff --git a/block/blk-mq.c b/block/blk-mq.c
> index 3262d83b9e07..6a7566244de3 100644
> --- a/block/blk-mq.c
> +++ b/block/blk-mq.c
> @@ -1715,15 +1715,6 @@ static blk_status_t __blk_mq_issue_directly(struct blk_mq_hw_ctx *hctx,
> break;
> case BLK_STS_RESOURCE:
> case BLK_STS_DEV_RESOURCE:
> - /*
> - * If direct dispatch fails, we cannot allow any merging on
> - * this IO. Drivers (like SCSI) may have set up permanent state
> - * for this request, like SG tables and mappings, and if we
> - * merge to it later on then we'll still only do IO to the
> - * original part.
> - */
> - rq->cmd_flags |= REQ_NOMERGE;
> -
> blk_mq_update_dispatch_busy(hctx, true);
> __blk_mq_requeue_request(rq);
> break;
> @@ -1736,18 +1727,6 @@ static blk_status_t __blk_mq_issue_directly(struct blk_mq_hw_ctx *hctx,
> return ret;
> }
>
> -/*
> - * Don't allow direct dispatch of anything but regular reads/writes,
> - * as some of the other commands can potentially share request space
> - * with data we need for the IO scheduler. If we attempt a direct dispatch
> - * on those and fail, we can't safely add it to the scheduler afterwards
> - * without potentially overwriting data that the driver has already written.
> - */
> -static bool blk_rq_can_direct_dispatch(struct request *rq)
> -{
> - return req_op(rq) == REQ_OP_READ || req_op(rq) == REQ_OP_WRITE;
> -}
> -
> static blk_status_t __blk_mq_try_issue_directly(struct blk_mq_hw_ctx *hctx,
> struct request *rq,
> blk_qc_t *cookie,
> @@ -1769,7 +1748,7 @@ static blk_status_t __blk_mq_try_issue_directly(struct blk_mq_hw_ctx *hctx,
> goto insert;
> }
>
> - if (!blk_rq_can_direct_dispatch(rq) || (q->elevator && !bypass_insert))
> + if (q->elevator && !bypass_insert)
> goto insert;
>
> if (!blk_mq_get_dispatch_budget(hctx))
> @@ -1785,7 +1764,7 @@ static blk_status_t __blk_mq_try_issue_directly(struct blk_mq_hw_ctx *hctx,
> if (bypass_insert)
> return BLK_STS_RESOURCE;
>
> - blk_mq_sched_insert_request(rq, false, run_queue, false);
> + blk_mq_request_bypass_insert(rq, run_queue);
> return BLK_STS_OK;
> }
>
> @@ -1801,7 +1780,7 @@ static void blk_mq_try_issue_directly(struct blk_mq_hw_ctx *hctx,
>
> ret = __blk_mq_try_issue_directly(hctx, rq, cookie, false);
> if (ret == BLK_STS_RESOURCE || ret == BLK_STS_DEV_RESOURCE)
> - blk_mq_sched_insert_request(rq, false, true, false);
> + blk_mq_request_bypass_insert(rq, true);
> else if (ret != BLK_STS_OK)
> blk_mq_end_request(rq, ret);
>
> @@ -1831,15 +1810,13 @@ void blk_mq_try_issue_list_directly(struct blk_mq_hw_ctx *hctx,
> struct request *rq = list_first_entry(list, struct request,
> queuelist);
>
> - if (!blk_rq_can_direct_dispatch(rq))
> - break;
> -
> list_del_init(&rq->queuelist);
> ret = blk_mq_request_issue_directly(rq);
> if (ret != BLK_STS_OK) {
> if (ret == BLK_STS_RESOURCE ||
> ret == BLK_STS_DEV_RESOURCE) {
> - list_add(&rq->queuelist, list);
> + blk_mq_request_bypass_insert(rq,
> + list_empty(list));
> break;
> }
> blk_mq_end_request(rq, ret);
Looks fine, I will run my test with this patch first.
thanks,
Ming
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [PATCH v3] blk-mq: punt failed direct issue to dispatch list
2018-12-07 5:17 [PATCH v3] blk-mq: punt failed direct issue to dispatch list Jens Axboe
2018-12-07 8:24 ` Ming Lei
@ 2018-12-07 10:50 ` Ming Lei
2018-12-07 15:15 ` Mike Snitzer
2018-12-07 16:19 ` Bart Van Assche
3 siblings, 0 replies; 9+ messages in thread
From: Ming Lei @ 2018-12-07 10:50 UTC (permalink / raw)
To: Jens Axboe; +Cc: linux-block, Mike Snitzer, Bart Van Assche
On Thu, Dec 06, 2018 at 10:17:44PM -0700, Jens Axboe wrote:
> After the direct dispatch corruption fix, we permanently disallow direct
> dispatch of non read/write requests. This works fine off the normal IO
> path, as they will be retried like any other failed direct dispatch
> request. But for the blk_insert_cloned_request() that only DM uses to
> bypass the bottom level scheduler, we always first attempt direct
> dispatch. For some types of requests, that's now a permanent failure,
> and no amount of retrying will make that succeed. This results in a
> livelock.
>
> Instead of making special cases for what we can direct issue, and now
> having to deal with DM solving the livelock while still retaining a BUSY
> condition feedback loop, always just add a request that has been through
> ->queue_rq() to the hardware queue dispatch list. These are safe to use
> as no merging can take place there. Additionally, if requests do have
> prepped data from drivers, we aren't dependent on them not sharing space
> in the request structure to safely add them to the IO scheduler lists.
>
> This basically reverts ffe81d45322c and is based on a patch from Ming,
> but with the list insert case covered as well.
>
> Fixes: ffe81d45322c ("blk-mq: fix corruption with direct issue")
> Cc: stable@vger.kernel.org
> Suggested-by: Ming Lei <ming.lei@redhat.com>
> Reported-by: Bart Van Assche <bvanassche@acm.org>
> Signed-off-by: Jens Axboe <axboe@kernel.dk>
>
> ---
>
> I've thrown the initial hang test reported by Bart at it, works fine.
> My reproducer for the corruption case is also happy, as expected.
>
> I'm running blktests and xfstests on it overnight. If that passes as
> expected, this qualms my initial worries on using ->dispatch as a
> holding place for these types of requests.
>
>
> diff --git a/block/blk-mq.c b/block/blk-mq.c
> index 3262d83b9e07..6a7566244de3 100644
> --- a/block/blk-mq.c
> +++ b/block/blk-mq.c
> @@ -1715,15 +1715,6 @@ static blk_status_t __blk_mq_issue_directly(struct blk_mq_hw_ctx *hctx,
> break;
> case BLK_STS_RESOURCE:
> case BLK_STS_DEV_RESOURCE:
> - /*
> - * If direct dispatch fails, we cannot allow any merging on
> - * this IO. Drivers (like SCSI) may have set up permanent state
> - * for this request, like SG tables and mappings, and if we
> - * merge to it later on then we'll still only do IO to the
> - * original part.
> - */
> - rq->cmd_flags |= REQ_NOMERGE;
> -
> blk_mq_update_dispatch_busy(hctx, true);
> __blk_mq_requeue_request(rq);
> break;
> @@ -1736,18 +1727,6 @@ static blk_status_t __blk_mq_issue_directly(struct blk_mq_hw_ctx *hctx,
> return ret;
> }
>
> -/*
> - * Don't allow direct dispatch of anything but regular reads/writes,
> - * as some of the other commands can potentially share request space
> - * with data we need for the IO scheduler. If we attempt a direct dispatch
> - * on those and fail, we can't safely add it to the scheduler afterwards
> - * without potentially overwriting data that the driver has already written.
> - */
> -static bool blk_rq_can_direct_dispatch(struct request *rq)
> -{
> - return req_op(rq) == REQ_OP_READ || req_op(rq) == REQ_OP_WRITE;
> -}
> -
> static blk_status_t __blk_mq_try_issue_directly(struct blk_mq_hw_ctx *hctx,
> struct request *rq,
> blk_qc_t *cookie,
> @@ -1769,7 +1748,7 @@ static blk_status_t __blk_mq_try_issue_directly(struct blk_mq_hw_ctx *hctx,
> goto insert;
> }
>
> - if (!blk_rq_can_direct_dispatch(rq) || (q->elevator && !bypass_insert))
> + if (q->elevator && !bypass_insert)
> goto insert;
>
> if (!blk_mq_get_dispatch_budget(hctx))
> @@ -1785,7 +1764,7 @@ static blk_status_t __blk_mq_try_issue_directly(struct blk_mq_hw_ctx *hctx,
> if (bypass_insert)
> return BLK_STS_RESOURCE;
>
> - blk_mq_sched_insert_request(rq, false, run_queue, false);
> + blk_mq_request_bypass_insert(rq, run_queue);
> return BLK_STS_OK;
> }
>
> @@ -1801,7 +1780,7 @@ static void blk_mq_try_issue_directly(struct blk_mq_hw_ctx *hctx,
>
> ret = __blk_mq_try_issue_directly(hctx, rq, cookie, false);
> if (ret == BLK_STS_RESOURCE || ret == BLK_STS_DEV_RESOURCE)
> - blk_mq_sched_insert_request(rq, false, true, false);
> + blk_mq_request_bypass_insert(rq, true);
> else if (ret != BLK_STS_OK)
> blk_mq_end_request(rq, ret);
>
> @@ -1831,15 +1810,13 @@ void blk_mq_try_issue_list_directly(struct blk_mq_hw_ctx *hctx,
> struct request *rq = list_first_entry(list, struct request,
> queuelist);
>
> - if (!blk_rq_can_direct_dispatch(rq))
> - break;
> -
> list_del_init(&rq->queuelist);
> ret = blk_mq_request_issue_directly(rq);
> if (ret != BLK_STS_OK) {
> if (ret == BLK_STS_RESOURCE ||
> ret == BLK_STS_DEV_RESOURCE) {
> - list_add(&rq->queuelist, list);
> + blk_mq_request_bypass_insert(rq,
> + list_empty(list));
> break;
> }
> blk_mq_end_request(rq, ret);
>
Tested-by: Ming Lei <ming.lei@redhat.com>
Thanks,
Ming
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [PATCH v3] blk-mq: punt failed direct issue to dispatch list
2018-12-07 5:17 [PATCH v3] blk-mq: punt failed direct issue to dispatch list Jens Axboe
2018-12-07 8:24 ` Ming Lei
2018-12-07 10:50 ` Ming Lei
@ 2018-12-07 15:15 ` Mike Snitzer
2018-12-07 16:19 ` Bart Van Assche
3 siblings, 0 replies; 9+ messages in thread
From: Mike Snitzer @ 2018-12-07 15:15 UTC (permalink / raw)
To: Jens Axboe; +Cc: linux-block, Bart Van Assche, Ming Lei
On Fri, Dec 07 2018 at 12:17am -0500,
Jens Axboe <axboe@kernel.dk> wrote:
> After the direct dispatch corruption fix, we permanently disallow direct
> dispatch of non read/write requests. This works fine off the normal IO
> path, as they will be retried like any other failed direct dispatch
> request. But for the blk_insert_cloned_request() that only DM uses to
> bypass the bottom level scheduler, we always first attempt direct
> dispatch. For some types of requests, that's now a permanent failure,
> and no amount of retrying will make that succeed. This results in a
> livelock.
>
> Instead of making special cases for what we can direct issue, and now
> having to deal with DM solving the livelock while still retaining a BUSY
> condition feedback loop, always just add a request that has been through
> ->queue_rq() to the hardware queue dispatch list. These are safe to use
> as no merging can take place there. Additionally, if requests do have
> prepped data from drivers, we aren't dependent on them not sharing space
> in the request structure to safely add them to the IO scheduler lists.
>
> This basically reverts ffe81d45322c and is based on a patch from Ming,
> but with the list insert case covered as well.
>
> Fixes: ffe81d45322c ("blk-mq: fix corruption with direct issue")
> Cc: stable@vger.kernel.org
> Suggested-by: Ming Lei <ming.lei@redhat.com>
> Reported-by: Bart Van Assche <bvanassche@acm.org>
> Signed-off-by: Jens Axboe <axboe@kernel.dk>
Looks good, thanks!
Acked-by: Mike Snitzer <snitzer@redhat.com>
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [PATCH v3] blk-mq: punt failed direct issue to dispatch list
2018-12-07 5:17 [PATCH v3] blk-mq: punt failed direct issue to dispatch list Jens Axboe
` (2 preceding siblings ...)
2018-12-07 15:15 ` Mike Snitzer
@ 2018-12-07 16:19 ` Bart Van Assche
2018-12-07 16:24 ` Jens Axboe
3 siblings, 1 reply; 9+ messages in thread
From: Bart Van Assche @ 2018-12-07 16:19 UTC (permalink / raw)
To: Jens Axboe, linux-block; +Cc: Mike Snitzer, Ming Lei
On Thu, 2018-12-06 at 22:17 -0700, Jens Axboe wrote:
> Instead of making special cases for what we can direct issue, and now
> having to deal with DM solving the livelock while still retaining a BUSY
> condition feedback loop, always just add a request that has been through
> ->queue_rq() to the hardware queue dispatch list. These are safe to use
> as no merging can take place there. Additionally, if requests do have
> prepped data from drivers, we aren't dependent on them not sharing space
> in the request structure to safely add them to the IO scheduler lists.
How about making blk_mq_sched_insert_request() complain if a request is passed
to it in which the RQF_DONTPREP flag has been set to avoid that this problem is
reintroduced in the future? Otherwise this patch looks fine to me.
Bart.
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [PATCH v3] blk-mq: punt failed direct issue to dispatch list
2018-12-07 16:19 ` Bart Van Assche
@ 2018-12-07 16:24 ` Jens Axboe
2018-12-07 16:35 ` Jens Axboe
0 siblings, 1 reply; 9+ messages in thread
From: Jens Axboe @ 2018-12-07 16:24 UTC (permalink / raw)
To: Bart Van Assche, linux-block; +Cc: Mike Snitzer, Ming Lei
On 12/7/18 9:19 AM, Bart Van Assche wrote:
> On Thu, 2018-12-06 at 22:17 -0700, Jens Axboe wrote:
>> Instead of making special cases for what we can direct issue, and now
>> having to deal with DM solving the livelock while still retaining a BUSY
>> condition feedback loop, always just add a request that has been through
>> ->queue_rq() to the hardware queue dispatch list. These are safe to use
>> as no merging can take place there. Additionally, if requests do have
>> prepped data from drivers, we aren't dependent on them not sharing space
>> in the request structure to safely add them to the IO scheduler lists.
>
> How about making blk_mq_sched_insert_request() complain if a request is passed
> to it in which the RQF_DONTPREP flag has been set to avoid that this problem is
> reintroduced in the future? Otherwise this patch looks fine to me.
I agree, but I think we should do that as a follow up patch. I don't want to
touch this one if we can avoid it. The thought did cross my mind, too. It
should be impossible now that everything goes to the dispatch list.
--
Jens Axboe
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [PATCH v3] blk-mq: punt failed direct issue to dispatch list
2018-12-07 16:24 ` Jens Axboe
@ 2018-12-07 16:35 ` Jens Axboe
2018-12-07 16:41 ` Bart Van Assche
0 siblings, 1 reply; 9+ messages in thread
From: Jens Axboe @ 2018-12-07 16:35 UTC (permalink / raw)
To: Bart Van Assche, linux-block; +Cc: Mike Snitzer, Ming Lei
On 12/7/18 9:24 AM, Jens Axboe wrote:
> On 12/7/18 9:19 AM, Bart Van Assche wrote:
>> On Thu, 2018-12-06 at 22:17 -0700, Jens Axboe wrote:
>>> Instead of making special cases for what we can direct issue, and now
>>> having to deal with DM solving the livelock while still retaining a BUSY
>>> condition feedback loop, always just add a request that has been through
>>> ->queue_rq() to the hardware queue dispatch list. These are safe to use
>>> as no merging can take place there. Additionally, if requests do have
>>> prepped data from drivers, we aren't dependent on them not sharing space
>>> in the request structure to safely add them to the IO scheduler lists.
>>
>> How about making blk_mq_sched_insert_request() complain if a request is passed
>> to it in which the RQF_DONTPREP flag has been set to avoid that this problem is
>> reintroduced in the future? Otherwise this patch looks fine to me.
>
> I agree, but I think we should do that as a follow up patch. I don't want to
> touch this one if we can avoid it. The thought did cross my mind, too. It
> should be impossible now that everything goes to the dispatch list.
Something like the below.
diff --git a/block/blk-mq-sched.c b/block/blk-mq-sched.c
index 29bfe8017a2d..9e5bda8800f8 100644
--- a/block/blk-mq-sched.c
+++ b/block/blk-mq-sched.c
@@ -377,6 +377,16 @@ void blk_mq_sched_insert_request(struct request *rq, bool at_head,
WARN_ON(e && (rq->tag != -1));
+ /*
+ * It's illegal to insert a request into the scheduler that has
+ * been through ->queue_rq(). Warn for that case, and use a bypass
+ * insert to be safe.
+ */
+ if (WARN_ON_ONCE(rq->rq_flags & RQF_DONTPREP)) {
+ blk_mq_request_bypass_insert(rq, false);
+ goto run;
+ }
+
if (blk_mq_sched_bypass_insert(hctx, !!e, rq))
goto run;
diff --git a/block/blk-mq.c b/block/blk-mq.c
index 6a7566244de3..d5f890d5c814 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -1595,15 +1595,25 @@ void blk_mq_insert_requests(struct blk_mq_hw_ctx *hctx, struct blk_mq_ctx *ctx,
struct list_head *list)
{
- struct request *rq;
+ struct request *rq, *tmp;
/*
* preemption doesn't flush plug list, so it's possible ctx->cpu is
* offline now
*/
- list_for_each_entry(rq, list, queuelist) {
+ list_for_each_entry_safe(rq, tmp, list, queuelist) {
BUG_ON(rq->mq_ctx != ctx);
trace_block_rq_insert(hctx->queue, rq);
+
+ /*
+ * It's illegal to insert a request into the scheduler that has
+ * been through ->queue_rq(). Warn for that case, and use a
+ * bypass insert to be safe.
+ */
+ if (WARN_ON_ONCE(rq->rq_flags & RQF_DONTPREP)) {
+ list_del_init(&rq->queuelist);
+ blk_mq_request_bypass_insert(rq, false);
+ }
}
spin_lock(&ctx->lock);
--
Jens Axboe
^ permalink raw reply related [flat|nested] 9+ messages in thread
* Re: [PATCH v3] blk-mq: punt failed direct issue to dispatch list
2018-12-07 16:35 ` Jens Axboe
@ 2018-12-07 16:41 ` Bart Van Assche
2018-12-07 16:45 ` Jens Axboe
0 siblings, 1 reply; 9+ messages in thread
From: Bart Van Assche @ 2018-12-07 16:41 UTC (permalink / raw)
To: Jens Axboe, linux-block; +Cc: Mike Snitzer, Ming Lei
On Fri, 2018-12-07 at 09:35 -0700, Jens Axboe wrote:
> diff --git a/block/blk-mq-sched.c b/block/blk-mq-sched.c
> index 29bfe8017a2d..9e5bda8800f8 100644
> --- a/block/blk-mq-sched.c
> +++ b/block/blk-mq-sched.c
> @@ -377,6 +377,16 @@ void blk_mq_sched_insert_request(struct request *rq, bool at_head,
>
> WARN_ON(e && (rq->tag != -1));
>
> + /*
> + * It's illegal to insert a request into the scheduler that has
> + * been through ->queue_rq(). Warn for that case, and use a bypass
> + * insert to be safe.
> + */
Shouldn't this refer to requests that have been prepared instead of requests
that have been through ->queue_rq()? I think this function is called for
requests that are requeued. Requeued requests have been through ->queue_rq()
but are unprepared before being requeued.
Thanks,
Bart.
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [PATCH v3] blk-mq: punt failed direct issue to dispatch list
2018-12-07 16:41 ` Bart Van Assche
@ 2018-12-07 16:45 ` Jens Axboe
0 siblings, 0 replies; 9+ messages in thread
From: Jens Axboe @ 2018-12-07 16:45 UTC (permalink / raw)
To: Bart Van Assche, linux-block; +Cc: Mike Snitzer, Ming Lei
On 12/7/18 9:41 AM, Bart Van Assche wrote:
> On Fri, 2018-12-07 at 09:35 -0700, Jens Axboe wrote:
>> diff --git a/block/blk-mq-sched.c b/block/blk-mq-sched.c
>> index 29bfe8017a2d..9e5bda8800f8 100644
>> --- a/block/blk-mq-sched.c
>> +++ b/block/blk-mq-sched.c
>> @@ -377,6 +377,16 @@ void blk_mq_sched_insert_request(struct request *rq, bool at_head,
>>
>> WARN_ON(e && (rq->tag != -1));
>>
>> + /*
>> + * It's illegal to insert a request into the scheduler that has
>> + * been through ->queue_rq(). Warn for that case, and use a bypass
>> + * insert to be safe.
>> + */
>
> Shouldn't this refer to requests that have been prepared instead of requests
> that have been through ->queue_rq()? I think this function is called for
> requests that are requeued. Requeued requests have been through ->queue_rq()
> but are unprepared before being requeued.
If they are unprepared, RQF_DONTPREP should have been cleared. But needs
testing and verification, which is exactly why I didn't want to bundle with
the fix.
I'll test it later today.
--
Jens Axboe
^ permalink raw reply [flat|nested] 9+ messages in thread