* [RFC PATCH 0/3] blk-mq: Timeout rework @ 2018-05-21 23:11 ` Keith Busch 0 siblings, 0 replies; 128+ messages in thread From: Keith Busch @ 2018-05-21 23:11 UTC (permalink / raw) To: Jens Axboe, linux-nvme, linux-block, Ming Lei, Christoph Hellwig, Bart Van Assche Cc: Keith Busch The current blk-mq code potentially locks requests out of completion by the thousands, making drivers jump through hoops to handle them. This patch set allows drivers to complete their requests whenever they're completed without requiring drivers know anything about the timeout code with minimal syncronization. Other proposals under current consideration still have moments that prevent a driver from progressing a request to the completed state. The timeout is ultimatley made safe by reference counting the request when timeout handling claims the request. By holding the reference count, we don't need to do any tricks to prevent a driver from completing the request out from under the timeout handler, allowing the actual state to be changed inline with the true state, and drivers don't need to be aware any of this is happening. In order to make the overhead as minimal as possible, the request's reference is taken only when it appears that actual timeout handling needs to be done. Keith Busch (3): blk-mq: Reference count request usage blk-mq: Fix timeout and state order blk-mq: Remove generation seqeunce block/blk-core.c | 6 - block/blk-mq-debugfs.c | 1 - block/blk-mq.c | 291 +++++++++++++------------------------------------ block/blk-mq.h | 20 +--- block/blk-timeout.c | 1 - include/linux/blkdev.h | 26 +---- 6 files changed, 83 insertions(+), 262 deletions(-) -- 2.14.3 ^ permalink raw reply [flat|nested] 128+ messages in thread
* [RFC PATCH 0/3] blk-mq: Timeout rework @ 2018-05-21 23:11 ` Keith Busch 0 siblings, 0 replies; 128+ messages in thread From: Keith Busch @ 2018-05-21 23:11 UTC (permalink / raw) The current blk-mq code potentially locks requests out of completion by the thousands, making drivers jump through hoops to handle them. This patch set allows drivers to complete their requests whenever they're completed without requiring drivers know anything about the timeout code with minimal syncronization. Other proposals under current consideration still have moments that prevent a driver from progressing a request to the completed state. The timeout is ultimatley made safe by reference counting the request when timeout handling claims the request. By holding the reference count, we don't need to do any tricks to prevent a driver from completing the request out from under the timeout handler, allowing the actual state to be changed inline with the true state, and drivers don't need to be aware any of this is happening. In order to make the overhead as minimal as possible, the request's reference is taken only when it appears that actual timeout handling needs to be done. Keith Busch (3): blk-mq: Reference count request usage blk-mq: Fix timeout and state order blk-mq: Remove generation seqeunce block/blk-core.c | 6 - block/blk-mq-debugfs.c | 1 - block/blk-mq.c | 291 +++++++++++++------------------------------------ block/blk-mq.h | 20 +--- block/blk-timeout.c | 1 - include/linux/blkdev.h | 26 +---- 6 files changed, 83 insertions(+), 262 deletions(-) -- 2.14.3 ^ permalink raw reply [flat|nested] 128+ messages in thread
* [RFC PATCH 1/3] blk-mq: Reference count request usage 2018-05-21 23:11 ` Keith Busch @ 2018-05-21 23:11 ` Keith Busch -1 siblings, 0 replies; 128+ messages in thread From: Keith Busch @ 2018-05-21 23:11 UTC (permalink / raw) To: Jens Axboe, linux-nvme, linux-block, Ming Lei, Christoph Hellwig, Bart Van Assche Cc: Keith Busch This patch adds a struct kref to struct request so that request users can be sure they're operating on the same request without it changing while they're processing it. The request's tag won't be released for reuse until the last user is done with it. Signed-off-by: Keith Busch <keith.busch@intel.com> --- block/blk-mq.c | 30 +++++++++++++++++++++++------- include/linux/blkdev.h | 2 ++ 2 files changed, 25 insertions(+), 7 deletions(-) diff --git a/block/blk-mq.c b/block/blk-mq.c index 4cbfd784e837..8b370ed75605 100644 --- a/block/blk-mq.c +++ b/block/blk-mq.c @@ -332,6 +332,7 @@ static struct request *blk_mq_rq_ctx_init(struct blk_mq_alloc_data *data, #endif data->ctx->rq_dispatched[op_is_sync(op)]++; + kref_init(&rq->ref); return rq; } @@ -465,13 +466,33 @@ struct request *blk_mq_alloc_request_hctx(struct request_queue *q, } EXPORT_SYMBOL_GPL(blk_mq_alloc_request_hctx); +static void blk_mq_exit_request(struct kref *ref) +{ + struct request *rq = container_of(ref, struct request, ref); + struct request_queue *q = rq->q; + struct blk_mq_ctx *ctx = rq->mq_ctx; + struct blk_mq_hw_ctx *hctx = blk_mq_map_queue(q, ctx->cpu); + const int sched_tag = rq->internal_tag; + + if (rq->tag != -1) + blk_mq_put_tag(hctx, hctx->tags, ctx, rq->tag); + if (sched_tag != -1) + blk_mq_put_tag(hctx, hctx->sched_tags, ctx, sched_tag); + blk_mq_sched_restart(hctx); + blk_queue_exit(q); +} + +static void blk_mq_put_request(struct request *rq) +{ + kref_put(&rq->ref, blk_mq_exit_request); +} + void blk_mq_free_request(struct request *rq) { struct request_queue *q = rq->q; struct elevator_queue *e = q->elevator; struct blk_mq_ctx *ctx = rq->mq_ctx; struct blk_mq_hw_ctx *hctx = blk_mq_map_queue(q, ctx->cpu); - const int sched_tag = rq->internal_tag; if (rq->rq_flags & RQF_ELVPRIV) { if (e && e->type->ops.mq.finish_request) @@ -495,12 +516,7 @@ void blk_mq_free_request(struct request *rq) blk_put_rl(blk_rq_rl(rq)); blk_mq_rq_update_state(rq, MQ_RQ_IDLE); - if (rq->tag != -1) - blk_mq_put_tag(hctx, hctx->tags, ctx, rq->tag); - if (sched_tag != -1) - blk_mq_put_tag(hctx, hctx->sched_tags, ctx, sched_tag); - blk_mq_sched_restart(hctx); - blk_queue_exit(q); + blk_mq_put_request(rq); } EXPORT_SYMBOL_GPL(blk_mq_free_request); diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h index f3999719f828..26bf2c1e3502 100644 --- a/include/linux/blkdev.h +++ b/include/linux/blkdev.h @@ -257,6 +257,8 @@ struct request { struct u64_stats_sync aborted_gstate_sync; u64 aborted_gstate; + struct kref ref; + /* access through blk_rq_set_deadline, blk_rq_deadline */ unsigned long __deadline; -- 2.14.3 ^ permalink raw reply related [flat|nested] 128+ messages in thread
* [RFC PATCH 1/3] blk-mq: Reference count request usage @ 2018-05-21 23:11 ` Keith Busch 0 siblings, 0 replies; 128+ messages in thread From: Keith Busch @ 2018-05-21 23:11 UTC (permalink / raw) This patch adds a struct kref to struct request so that request users can be sure they're operating on the same request without it changing while they're processing it. The request's tag won't be released for reuse until the last user is done with it. Signed-off-by: Keith Busch <keith.busch at intel.com> --- block/blk-mq.c | 30 +++++++++++++++++++++++------- include/linux/blkdev.h | 2 ++ 2 files changed, 25 insertions(+), 7 deletions(-) diff --git a/block/blk-mq.c b/block/blk-mq.c index 4cbfd784e837..8b370ed75605 100644 --- a/block/blk-mq.c +++ b/block/blk-mq.c @@ -332,6 +332,7 @@ static struct request *blk_mq_rq_ctx_init(struct blk_mq_alloc_data *data, #endif data->ctx->rq_dispatched[op_is_sync(op)]++; + kref_init(&rq->ref); return rq; } @@ -465,13 +466,33 @@ struct request *blk_mq_alloc_request_hctx(struct request_queue *q, } EXPORT_SYMBOL_GPL(blk_mq_alloc_request_hctx); +static void blk_mq_exit_request(struct kref *ref) +{ + struct request *rq = container_of(ref, struct request, ref); + struct request_queue *q = rq->q; + struct blk_mq_ctx *ctx = rq->mq_ctx; + struct blk_mq_hw_ctx *hctx = blk_mq_map_queue(q, ctx->cpu); + const int sched_tag = rq->internal_tag; + + if (rq->tag != -1) + blk_mq_put_tag(hctx, hctx->tags, ctx, rq->tag); + if (sched_tag != -1) + blk_mq_put_tag(hctx, hctx->sched_tags, ctx, sched_tag); + blk_mq_sched_restart(hctx); + blk_queue_exit(q); +} + +static void blk_mq_put_request(struct request *rq) +{ + kref_put(&rq->ref, blk_mq_exit_request); +} + void blk_mq_free_request(struct request *rq) { struct request_queue *q = rq->q; struct elevator_queue *e = q->elevator; struct blk_mq_ctx *ctx = rq->mq_ctx; struct blk_mq_hw_ctx *hctx = blk_mq_map_queue(q, ctx->cpu); - const int sched_tag = rq->internal_tag; if (rq->rq_flags & RQF_ELVPRIV) { if (e && e->type->ops.mq.finish_request) @@ -495,12 +516,7 @@ void blk_mq_free_request(struct request *rq) blk_put_rl(blk_rq_rl(rq)); blk_mq_rq_update_state(rq, MQ_RQ_IDLE); - if (rq->tag != -1) - blk_mq_put_tag(hctx, hctx->tags, ctx, rq->tag); - if (sched_tag != -1) - blk_mq_put_tag(hctx, hctx->sched_tags, ctx, sched_tag); - blk_mq_sched_restart(hctx); - blk_queue_exit(q); + blk_mq_put_request(rq); } EXPORT_SYMBOL_GPL(blk_mq_free_request); diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h index f3999719f828..26bf2c1e3502 100644 --- a/include/linux/blkdev.h +++ b/include/linux/blkdev.h @@ -257,6 +257,8 @@ struct request { struct u64_stats_sync aborted_gstate_sync; u64 aborted_gstate; + struct kref ref; + /* access through blk_rq_set_deadline, blk_rq_deadline */ unsigned long __deadline; -- 2.14.3 ^ permalink raw reply related [flat|nested] 128+ messages in thread
* Re: [RFC PATCH 1/3] blk-mq: Reference count request usage 2018-05-21 23:11 ` Keith Busch @ 2018-05-22 2:27 ` Ming Lei -1 siblings, 0 replies; 128+ messages in thread From: Ming Lei @ 2018-05-22 2:27 UTC (permalink / raw) To: Keith Busch Cc: Jens Axboe, linux-nvme, linux-block, Christoph Hellwig, Bart Van Assche On Mon, May 21, 2018 at 05:11:29PM -0600, Keith Busch wrote: > This patch adds a struct kref to struct request so that request users > can be sure they're operating on the same request without it changing > while they're processing it. The request's tag won't be released for > reuse until the last user is done with it. > > Signed-off-by: Keith Busch <keith.busch@intel.com> > --- > block/blk-mq.c | 30 +++++++++++++++++++++++------- > include/linux/blkdev.h | 2 ++ > 2 files changed, 25 insertions(+), 7 deletions(-) > > diff --git a/block/blk-mq.c b/block/blk-mq.c > index 4cbfd784e837..8b370ed75605 100644 > --- a/block/blk-mq.c > +++ b/block/blk-mq.c > @@ -332,6 +332,7 @@ static struct request *blk_mq_rq_ctx_init(struct blk_mq_alloc_data *data, > #endif > > data->ctx->rq_dispatched[op_is_sync(op)]++; > + kref_init(&rq->ref); > return rq; > } > > @@ -465,13 +466,33 @@ struct request *blk_mq_alloc_request_hctx(struct request_queue *q, > } > EXPORT_SYMBOL_GPL(blk_mq_alloc_request_hctx); > > +static void blk_mq_exit_request(struct kref *ref) > +{ > + struct request *rq = container_of(ref, struct request, ref); > + struct request_queue *q = rq->q; > + struct blk_mq_ctx *ctx = rq->mq_ctx; > + struct blk_mq_hw_ctx *hctx = blk_mq_map_queue(q, ctx->cpu); > + const int sched_tag = rq->internal_tag; > + > + if (rq->tag != -1) > + blk_mq_put_tag(hctx, hctx->tags, ctx, rq->tag); > + if (sched_tag != -1) > + blk_mq_put_tag(hctx, hctx->sched_tags, ctx, sched_tag); > + blk_mq_sched_restart(hctx); > + blk_queue_exit(q); > +} > + > +static void blk_mq_put_request(struct request *rq) > +{ > + kref_put(&rq->ref, blk_mq_exit_request); > +} > + > void blk_mq_free_request(struct request *rq) > { > struct request_queue *q = rq->q; > struct elevator_queue *e = q->elevator; > struct blk_mq_ctx *ctx = rq->mq_ctx; > struct blk_mq_hw_ctx *hctx = blk_mq_map_queue(q, ctx->cpu); > - const int sched_tag = rq->internal_tag; > > if (rq->rq_flags & RQF_ELVPRIV) { > if (e && e->type->ops.mq.finish_request) > @@ -495,12 +516,7 @@ void blk_mq_free_request(struct request *rq) > blk_put_rl(blk_rq_rl(rq)); > > blk_mq_rq_update_state(rq, MQ_RQ_IDLE); > - if (rq->tag != -1) > - blk_mq_put_tag(hctx, hctx->tags, ctx, rq->tag); > - if (sched_tag != -1) > - blk_mq_put_tag(hctx, hctx->sched_tags, ctx, sched_tag); > - blk_mq_sched_restart(hctx); > - blk_queue_exit(q); > + blk_mq_put_request(rq); Both the above line(atomic_try_cmpxchg_release is implied) and kref_init() in blk_mq_rq_ctx_init() are run from fast path, and may introduce some cost, you may have to run some benchmark to show if there is effect. Also given the cost isn't free, could you describe a bit in comment log what we can get with the cost? Thanks, Ming ^ permalink raw reply [flat|nested] 128+ messages in thread
* [RFC PATCH 1/3] blk-mq: Reference count request usage @ 2018-05-22 2:27 ` Ming Lei 0 siblings, 0 replies; 128+ messages in thread From: Ming Lei @ 2018-05-22 2:27 UTC (permalink / raw) On Mon, May 21, 2018@05:11:29PM -0600, Keith Busch wrote: > This patch adds a struct kref to struct request so that request users > can be sure they're operating on the same request without it changing > while they're processing it. The request's tag won't be released for > reuse until the last user is done with it. > > Signed-off-by: Keith Busch <keith.busch at intel.com> > --- > block/blk-mq.c | 30 +++++++++++++++++++++++------- > include/linux/blkdev.h | 2 ++ > 2 files changed, 25 insertions(+), 7 deletions(-) > > diff --git a/block/blk-mq.c b/block/blk-mq.c > index 4cbfd784e837..8b370ed75605 100644 > --- a/block/blk-mq.c > +++ b/block/blk-mq.c > @@ -332,6 +332,7 @@ static struct request *blk_mq_rq_ctx_init(struct blk_mq_alloc_data *data, > #endif > > data->ctx->rq_dispatched[op_is_sync(op)]++; > + kref_init(&rq->ref); > return rq; > } > > @@ -465,13 +466,33 @@ struct request *blk_mq_alloc_request_hctx(struct request_queue *q, > } > EXPORT_SYMBOL_GPL(blk_mq_alloc_request_hctx); > > +static void blk_mq_exit_request(struct kref *ref) > +{ > + struct request *rq = container_of(ref, struct request, ref); > + struct request_queue *q = rq->q; > + struct blk_mq_ctx *ctx = rq->mq_ctx; > + struct blk_mq_hw_ctx *hctx = blk_mq_map_queue(q, ctx->cpu); > + const int sched_tag = rq->internal_tag; > + > + if (rq->tag != -1) > + blk_mq_put_tag(hctx, hctx->tags, ctx, rq->tag); > + if (sched_tag != -1) > + blk_mq_put_tag(hctx, hctx->sched_tags, ctx, sched_tag); > + blk_mq_sched_restart(hctx); > + blk_queue_exit(q); > +} > + > +static void blk_mq_put_request(struct request *rq) > +{ > + kref_put(&rq->ref, blk_mq_exit_request); > +} > + > void blk_mq_free_request(struct request *rq) > { > struct request_queue *q = rq->q; > struct elevator_queue *e = q->elevator; > struct blk_mq_ctx *ctx = rq->mq_ctx; > struct blk_mq_hw_ctx *hctx = blk_mq_map_queue(q, ctx->cpu); > - const int sched_tag = rq->internal_tag; > > if (rq->rq_flags & RQF_ELVPRIV) { > if (e && e->type->ops.mq.finish_request) > @@ -495,12 +516,7 @@ void blk_mq_free_request(struct request *rq) > blk_put_rl(blk_rq_rl(rq)); > > blk_mq_rq_update_state(rq, MQ_RQ_IDLE); > - if (rq->tag != -1) > - blk_mq_put_tag(hctx, hctx->tags, ctx, rq->tag); > - if (sched_tag != -1) > - blk_mq_put_tag(hctx, hctx->sched_tags, ctx, sched_tag); > - blk_mq_sched_restart(hctx); > - blk_queue_exit(q); > + blk_mq_put_request(rq); Both the above line(atomic_try_cmpxchg_release is implied) and kref_init() in blk_mq_rq_ctx_init() are run from fast path, and may introduce some cost, you may have to run some benchmark to show if there is effect. Also given the cost isn't free, could you describe a bit in comment log what we can get with the cost? Thanks, Ming ^ permalink raw reply [flat|nested] 128+ messages in thread
* Re: [RFC PATCH 1/3] blk-mq: Reference count request usage 2018-05-21 23:11 ` Keith Busch @ 2018-05-22 15:19 ` Christoph Hellwig -1 siblings, 0 replies; 128+ messages in thread From: Christoph Hellwig @ 2018-05-22 15:19 UTC (permalink / raw) To: Keith Busch Cc: Jens Axboe, linux-nvme, linux-block, Ming Lei, Christoph Hellwig, Bart Van Assche On Mon, May 21, 2018 at 05:11:29PM -0600, Keith Busch wrote: > This patch adds a struct kref to struct request so that request users > can be sure they're operating on the same request without it changing > while they're processing it. The request's tag won't be released for > reuse until the last user is done with it. Can you please use a raw refcount_t instead of the kref? That avoids a possible indirect call in the fast path, which have become really painful with the spectre mitigrations, and also is easier to follow to start with. I also think this should be merged into patch 3, as it isn't very useful on its own. ^ permalink raw reply [flat|nested] 128+ messages in thread
* [RFC PATCH 1/3] blk-mq: Reference count request usage @ 2018-05-22 15:19 ` Christoph Hellwig 0 siblings, 0 replies; 128+ messages in thread From: Christoph Hellwig @ 2018-05-22 15:19 UTC (permalink / raw) On Mon, May 21, 2018@05:11:29PM -0600, Keith Busch wrote: > This patch adds a struct kref to struct request so that request users > can be sure they're operating on the same request without it changing > while they're processing it. The request's tag won't be released for > reuse until the last user is done with it. Can you please use a raw refcount_t instead of the kref? That avoids a possible indirect call in the fast path, which have become really painful with the spectre mitigrations, and also is easier to follow to start with. I also think this should be merged into patch 3, as it isn't very useful on its own. ^ permalink raw reply [flat|nested] 128+ messages in thread
* [RFC PATCH 2/3] blk-mq: Fix timeout and state order 2018-05-21 23:11 ` Keith Busch @ 2018-05-21 23:11 ` Keith Busch -1 siblings, 0 replies; 128+ messages in thread From: Keith Busch @ 2018-05-21 23:11 UTC (permalink / raw) To: Jens Axboe, linux-nvme, linux-block, Ming Lei, Christoph Hellwig, Bart Van Assche Cc: Keith Busch The block layer had been setting the state to in-flight prior to updating the timer. This is the wrong order since the timeout handler could observe the in-flight state with the older timeout, believing the request had expired when in fact it is just getting started. Signed-off-by: Keith Busch <keith.busch@intel.com> --- block/blk-mq.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/block/blk-mq.c b/block/blk-mq.c index 8b370ed75605..66e5c768803f 100644 --- a/block/blk-mq.c +++ b/block/blk-mq.c @@ -713,8 +713,8 @@ void blk_mq_start_request(struct request *rq) preempt_disable(); write_seqcount_begin(&rq->gstate_seq); - blk_mq_rq_update_state(rq, MQ_RQ_IN_FLIGHT); blk_add_timer(rq); + blk_mq_rq_update_state(rq, MQ_RQ_IN_FLIGHT); write_seqcount_end(&rq->gstate_seq); preempt_enable(); -- 2.14.3 ^ permalink raw reply related [flat|nested] 128+ messages in thread
* [RFC PATCH 2/3] blk-mq: Fix timeout and state order @ 2018-05-21 23:11 ` Keith Busch 0 siblings, 0 replies; 128+ messages in thread From: Keith Busch @ 2018-05-21 23:11 UTC (permalink / raw) The block layer had been setting the state to in-flight prior to updating the timer. This is the wrong order since the timeout handler could observe the in-flight state with the older timeout, believing the request had expired when in fact it is just getting started. Signed-off-by: Keith Busch <keith.busch at intel.com> --- block/blk-mq.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/block/blk-mq.c b/block/blk-mq.c index 8b370ed75605..66e5c768803f 100644 --- a/block/blk-mq.c +++ b/block/blk-mq.c @@ -713,8 +713,8 @@ void blk_mq_start_request(struct request *rq) preempt_disable(); write_seqcount_begin(&rq->gstate_seq); - blk_mq_rq_update_state(rq, MQ_RQ_IN_FLIGHT); blk_add_timer(rq); + blk_mq_rq_update_state(rq, MQ_RQ_IN_FLIGHT); write_seqcount_end(&rq->gstate_seq); preempt_enable(); -- 2.14.3 ^ permalink raw reply related [flat|nested] 128+ messages in thread
* Re: [RFC PATCH 2/3] blk-mq: Fix timeout and state order 2018-05-21 23:11 ` Keith Busch @ 2018-05-22 2:28 ` Ming Lei -1 siblings, 0 replies; 128+ messages in thread From: Ming Lei @ 2018-05-22 2:28 UTC (permalink / raw) To: Keith Busch Cc: Jens Axboe, linux-nvme, linux-block, Christoph Hellwig, Bart Van Assche On Mon, May 21, 2018 at 05:11:30PM -0600, Keith Busch wrote: > The block layer had been setting the state to in-flight prior to updating > the timer. This is the wrong order since the timeout handler could observe > the in-flight state with the older timeout, believing the request had > expired when in fact it is just getting started. > > Signed-off-by: Keith Busch <keith.busch@intel.com> > --- > block/blk-mq.c | 2 +- > 1 file changed, 1 insertion(+), 1 deletion(-) > > diff --git a/block/blk-mq.c b/block/blk-mq.c > index 8b370ed75605..66e5c768803f 100644 > --- a/block/blk-mq.c > +++ b/block/blk-mq.c > @@ -713,8 +713,8 @@ void blk_mq_start_request(struct request *rq) > preempt_disable(); > write_seqcount_begin(&rq->gstate_seq); > > - blk_mq_rq_update_state(rq, MQ_RQ_IN_FLIGHT); > blk_add_timer(rq); > + blk_mq_rq_update_state(rq, MQ_RQ_IN_FLIGHT); > > write_seqcount_end(&rq->gstate_seq); > preempt_enable(); > -- > 2.14.3 > Looks fine, Reviewed-by: Ming Lei <ming.lei@redhat.com> Thanks, Ming ^ permalink raw reply [flat|nested] 128+ messages in thread
* [RFC PATCH 2/3] blk-mq: Fix timeout and state order @ 2018-05-22 2:28 ` Ming Lei 0 siblings, 0 replies; 128+ messages in thread From: Ming Lei @ 2018-05-22 2:28 UTC (permalink / raw) On Mon, May 21, 2018@05:11:30PM -0600, Keith Busch wrote: > The block layer had been setting the state to in-flight prior to updating > the timer. This is the wrong order since the timeout handler could observe > the in-flight state with the older timeout, believing the request had > expired when in fact it is just getting started. > > Signed-off-by: Keith Busch <keith.busch at intel.com> > --- > block/blk-mq.c | 2 +- > 1 file changed, 1 insertion(+), 1 deletion(-) > > diff --git a/block/blk-mq.c b/block/blk-mq.c > index 8b370ed75605..66e5c768803f 100644 > --- a/block/blk-mq.c > +++ b/block/blk-mq.c > @@ -713,8 +713,8 @@ void blk_mq_start_request(struct request *rq) > preempt_disable(); > write_seqcount_begin(&rq->gstate_seq); > > - blk_mq_rq_update_state(rq, MQ_RQ_IN_FLIGHT); > blk_add_timer(rq); > + blk_mq_rq_update_state(rq, MQ_RQ_IN_FLIGHT); > > write_seqcount_end(&rq->gstate_seq); > preempt_enable(); > -- > 2.14.3 > Looks fine, Reviewed-by: Ming Lei <ming.lei at redhat.com> Thanks, Ming ^ permalink raw reply [flat|nested] 128+ messages in thread
* Re: [RFC PATCH 2/3] blk-mq: Fix timeout and state order 2018-05-21 23:11 ` Keith Busch @ 2018-05-22 15:24 ` Christoph Hellwig -1 siblings, 0 replies; 128+ messages in thread From: Christoph Hellwig @ 2018-05-22 15:24 UTC (permalink / raw) To: Keith Busch Cc: Jens Axboe, linux-nvme, linux-block, Ming Lei, Christoph Hellwig, Bart Van Assche On Mon, May 21, 2018 at 05:11:30PM -0600, Keith Busch wrote: > The block layer had been setting the state to in-flight prior to updating > the timer. This is the wrong order since the timeout handler could observe > the in-flight state with the older timeout, believing the request had > expired when in fact it is just getting started. > > Signed-off-by: Keith Busch <keith.busch@intel.com> The way I understood Barts comments to my comments on his patch we actually need the two updates to be atomic. I haven't had much time to follow up, but I'd like to hear Barts opinion. Either way we clearly need to document our assumptions here in comments. ^ permalink raw reply [flat|nested] 128+ messages in thread
* [RFC PATCH 2/3] blk-mq: Fix timeout and state order @ 2018-05-22 15:24 ` Christoph Hellwig 0 siblings, 0 replies; 128+ messages in thread From: Christoph Hellwig @ 2018-05-22 15:24 UTC (permalink / raw) On Mon, May 21, 2018@05:11:30PM -0600, Keith Busch wrote: > The block layer had been setting the state to in-flight prior to updating > the timer. This is the wrong order since the timeout handler could observe > the in-flight state with the older timeout, believing the request had > expired when in fact it is just getting started. > > Signed-off-by: Keith Busch <keith.busch at intel.com> The way I understood Barts comments to my comments on his patch we actually need the two updates to be atomic. I haven't had much time to follow up, but I'd like to hear Barts opinion. Either way we clearly need to document our assumptions here in comments. ^ permalink raw reply [flat|nested] 128+ messages in thread
* Re: [RFC PATCH 2/3] blk-mq: Fix timeout and state order 2018-05-22 15:24 ` Christoph Hellwig @ 2018-05-22 16:27 ` Bart Van Assche -1 siblings, 0 replies; 128+ messages in thread From: Bart Van Assche @ 2018-05-22 16:27 UTC (permalink / raw) To: hch, keith.busch; +Cc: linux-nvme, linux-block, axboe, ming.lei T24gVHVlLCAyMDE4LTA1LTIyIGF0IDE3OjI0ICswMjAwLCBDaHJpc3RvcGggSGVsbHdpZyB3cm90 ZToNCj4gT24gTW9uLCBNYXkgMjEsIDIwMTggYXQgMDU6MTE6MzBQTSAtMDYwMCwgS2VpdGggQnVz Y2ggd3JvdGU6DQo+ID4gVGhlIGJsb2NrIGxheWVyIGhhZCBiZWVuIHNldHRpbmcgdGhlIHN0YXRl IHRvIGluLWZsaWdodCBwcmlvciB0byB1cGRhdGluZw0KPiA+IHRoZSB0aW1lci4gVGhpcyBpcyB0 aGUgd3Jvbmcgb3JkZXIgc2luY2UgdGhlIHRpbWVvdXQgaGFuZGxlciBjb3VsZCBvYnNlcnZlDQo+ ID4gdGhlIGluLWZsaWdodCBzdGF0ZSB3aXRoIHRoZSBvbGRlciB0aW1lb3V0LCBiZWxpZXZpbmcg dGhlIHJlcXVlc3QgaGFkDQo+ID4gZXhwaXJlZCB3aGVuIGluIGZhY3QgaXQgaXMganVzdCBnZXR0 aW5nIHN0YXJ0ZWQuDQo+ID4gDQo+ID4gU2lnbmVkLW9mZi1ieTogS2VpdGggQnVzY2ggPGtlaXRo LmJ1c2NoQGludGVsLmNvbT4NCj4gDQo+IFRoZSB3YXkgSSB1bmRlcnN0b29kIEJhcnRzIGNvbW1l bnRzIHRvIG15IGNvbW1lbnRzIG9uIGhpcyBwYXRjaCB3ZQ0KPiBhY3R1YWxseSBuZWVkIHRoZSB0 d28gdXBkYXRlcyB0byBiZSBhdG9taWMuICBJIGhhdmVuJ3QgaGFkIG11Y2ggdGltZQ0KPiB0byBm b2xsb3cgdXAsIGJ1dCBJJ2QgbGlrZSB0byBoZWFyIEJhcnRzIG9waW5pb24uICBFaXRoZXIgd2F5 IHdlDQo+IGNsZWFybHkgbmVlZCB0byBkb2N1bWVudCBvdXIgYXNzdW1wdGlvbnMgaGVyZSBpbiBj b21tZW50cy4NCg0KQWZ0ZXIgd2UgZGlzY3Vzc2VkIHJlcXVlc3Qgc3RhdGUgdXBkYXRpbmcgSSBo YXZlIGJlZW4gdGhpbmtpbmcgZnVydGhlciBhYm91dA0KdGhpcy4gSSB0aGluayBub3cgdGhhdCBp dCBpcyBzYWZlIHRvIHVwZGF0ZSB0aGUgcmVxdWVzdCBkZWFkbGluZSBmaXJzdCBiZWNhdXNlDQp0 aGUgdGltZW91dCBjb2RlIGlnbm9yZXMgcmVxdWVzdHMgYW55d2F5IHRoYXQgaGF2ZSBhbm90aGVy IHN0YXRlIHRoYW4NCk1RX1JRX0lOX0ZMSUdIVC4gQWRkaXRpb25hbGx5LCBpdCBpcyB1bmxpa2Vs eSB0aGF0IHRoZSByZXF1ZXN0IHRpbWVyIHdpbGwgZmlyZQ0KYmVmb3JlIHRoZSByZXF1ZXN0IHN0 YXRlIGhhcyBiZWVuIHVwZGF0ZWQuIEFuZCBpZiB0aGF0IHdvdWxkIGhhcHBlbiB0aGUgcmVxdWVz dA0KdGltZW91dCB3aWxsIGJlIGhhbmRsZWQgdGhlIG5leHQgdGltZSByZXF1ZXN0IHRpbWVvdXRz IGFyZSBleGFtaW5lZC4NCg0KQmFydC4NCg0KDQoNCg0K ^ permalink raw reply [flat|nested] 128+ messages in thread
* [RFC PATCH 2/3] blk-mq: Fix timeout and state order @ 2018-05-22 16:27 ` Bart Van Assche 0 siblings, 0 replies; 128+ messages in thread From: Bart Van Assche @ 2018-05-22 16:27 UTC (permalink / raw) On Tue, 2018-05-22@17:24 +0200, Christoph Hellwig wrote: > On Mon, May 21, 2018@05:11:30PM -0600, Keith Busch wrote: > > The block layer had been setting the state to in-flight prior to updating > > the timer. This is the wrong order since the timeout handler could observe > > the in-flight state with the older timeout, believing the request had > > expired when in fact it is just getting started. > > > > Signed-off-by: Keith Busch <keith.busch at intel.com> > > The way I understood Barts comments to my comments on his patch we > actually need the two updates to be atomic. I haven't had much time > to follow up, but I'd like to hear Barts opinion. Either way we > clearly need to document our assumptions here in comments. After we discussed request state updating I have been thinking further about this. I think now that it is safe to update the request deadline first because the timeout code ignores requests anyway that have another state than MQ_RQ_IN_FLIGHT. Additionally, it is unlikely that the request timer will fire before the request state has been updated. And if that would happen the request timeout will be handled the next time request timeouts are examined. Bart. ^ permalink raw reply [flat|nested] 128+ messages in thread
* [RFC PATCH 3/3] blk-mq: Remove generation seqeunce 2018-05-21 23:11 ` Keith Busch @ 2018-05-21 23:11 ` Keith Busch -1 siblings, 0 replies; 128+ messages in thread From: Keith Busch @ 2018-05-21 23:11 UTC (permalink / raw) To: Jens Axboe, linux-nvme, linux-block, Ming Lei, Christoph Hellwig, Bart Van Assche Cc: Keith Busch This patch simplifies the timeout handling by relying on the request reference counting to ensure the iterator is operating on an inflight and truly timed out request. Since the reference counting prevents the tag from being reallocated, the block layer no longer needs to prevent drivers from completing their requests while the timeout handler is operating on it: a driver completing a request is allowed to proceed to the next state without additional syncronization with the block layer. This also removes any need for generation sequence numbers since the request lifetime is prevented from being reallocated as a new sequence while timeout handling is operating on it. Signed-off-by: Keith Busch <keith.busch@intel.com> --- block/blk-core.c | 6 -- block/blk-mq-debugfs.c | 1 - block/blk-mq.c | 259 ++++++++++--------------------------------------- block/blk-mq.h | 20 +--- block/blk-timeout.c | 1 - include/linux/blkdev.h | 26 +---- 6 files changed, 58 insertions(+), 255 deletions(-) diff --git a/block/blk-core.c b/block/blk-core.c index 43370faee935..cee03cad99f2 100644 --- a/block/blk-core.c +++ b/block/blk-core.c @@ -198,12 +198,6 @@ void blk_rq_init(struct request_queue *q, struct request *rq) rq->internal_tag = -1; rq->start_time_ns = ktime_get_ns(); rq->part = NULL; - seqcount_init(&rq->gstate_seq); - u64_stats_init(&rq->aborted_gstate_sync); - /* - * See comment of blk_mq_init_request - */ - WRITE_ONCE(rq->gstate, MQ_RQ_GEN_INC); } EXPORT_SYMBOL(blk_rq_init); diff --git a/block/blk-mq-debugfs.c b/block/blk-mq-debugfs.c index 3080e18cb859..ffa622366922 100644 --- a/block/blk-mq-debugfs.c +++ b/block/blk-mq-debugfs.c @@ -344,7 +344,6 @@ static const char *const rqf_name[] = { RQF_NAME(STATS), RQF_NAME(SPECIAL_PAYLOAD), RQF_NAME(ZONE_WRITE_LOCKED), - RQF_NAME(MQ_TIMEOUT_EXPIRED), RQF_NAME(MQ_POLL_SLEPT), }; #undef RQF_NAME diff --git a/block/blk-mq.c b/block/blk-mq.c index 66e5c768803f..4858876fd364 100644 --- a/block/blk-mq.c +++ b/block/blk-mq.c @@ -589,56 +589,6 @@ static void __blk_mq_complete_request(struct request *rq) put_cpu(); } -static void hctx_unlock(struct blk_mq_hw_ctx *hctx, int srcu_idx) - __releases(hctx->srcu) -{ - if (!(hctx->flags & BLK_MQ_F_BLOCKING)) - rcu_read_unlock(); - else - srcu_read_unlock(hctx->srcu, srcu_idx); -} - -static void hctx_lock(struct blk_mq_hw_ctx *hctx, int *srcu_idx) - __acquires(hctx->srcu) -{ - if (!(hctx->flags & BLK_MQ_F_BLOCKING)) { - /* shut up gcc false positive */ - *srcu_idx = 0; - rcu_read_lock(); - } else - *srcu_idx = srcu_read_lock(hctx->srcu); -} - -static void blk_mq_rq_update_aborted_gstate(struct request *rq, u64 gstate) -{ - unsigned long flags; - - /* - * blk_mq_rq_aborted_gstate() is used from the completion path and - * can thus be called from irq context. u64_stats_fetch in the - * middle of update on the same CPU leads to lockup. Disable irq - * while updating. - */ - local_irq_save(flags); - u64_stats_update_begin(&rq->aborted_gstate_sync); - rq->aborted_gstate = gstate; - u64_stats_update_end(&rq->aborted_gstate_sync); - local_irq_restore(flags); -} - -static u64 blk_mq_rq_aborted_gstate(struct request *rq) -{ - unsigned int start; - u64 aborted_gstate; - - do { - start = u64_stats_fetch_begin(&rq->aborted_gstate_sync); - aborted_gstate = rq->aborted_gstate; - } while (u64_stats_fetch_retry(&rq->aborted_gstate_sync, start)); - - return aborted_gstate; -} - /** * blk_mq_complete_request - end I/O on a request * @rq: the request being processed @@ -650,27 +600,10 @@ static u64 blk_mq_rq_aborted_gstate(struct request *rq) void blk_mq_complete_request(struct request *rq) { struct request_queue *q = rq->q; - struct blk_mq_hw_ctx *hctx = blk_mq_map_queue(q, rq->mq_ctx->cpu); - int srcu_idx; if (unlikely(blk_should_fake_timeout(q))) return; - - /* - * If @rq->aborted_gstate equals the current instance, timeout is - * claiming @rq and we lost. This is synchronized through - * hctx_lock(). See blk_mq_timeout_work() for details. - * - * Completion path never blocks and we can directly use RCU here - * instead of hctx_lock() which can be either RCU or SRCU. - * However, that would complicate paths which want to synchronize - * against us. Let stay in sync with the issue path so that - * hctx_lock() covers both issue and completion paths. - */ - hctx_lock(hctx, &srcu_idx); - if (blk_mq_rq_aborted_gstate(rq) != rq->gstate) - __blk_mq_complete_request(rq); - hctx_unlock(hctx, srcu_idx); + __blk_mq_complete_request(rq); } EXPORT_SYMBOL(blk_mq_complete_request); @@ -699,26 +632,9 @@ void blk_mq_start_request(struct request *rq) WARN_ON_ONCE(blk_mq_rq_state(rq) != MQ_RQ_IDLE); - /* - * Mark @rq in-flight which also advances the generation number, - * and register for timeout. Protect with a seqcount to allow the - * timeout path to read both @rq->gstate and @rq->deadline - * coherently. - * - * This is the only place where a request is marked in-flight. If - * the timeout path reads an in-flight @rq->gstate, the - * @rq->deadline it reads together under @rq->gstate_seq is - * guaranteed to be the matching one. - */ - preempt_disable(); - write_seqcount_begin(&rq->gstate_seq); - blk_add_timer(rq); blk_mq_rq_update_state(rq, MQ_RQ_IN_FLIGHT); - write_seqcount_end(&rq->gstate_seq); - preempt_enable(); - if (q->dma_drain_size && blk_rq_bytes(rq)) { /* * Make sure space for the drain appears. We know we can do @@ -730,11 +646,6 @@ void blk_mq_start_request(struct request *rq) } EXPORT_SYMBOL(blk_mq_start_request); -/* - * When we reach here because queue is busy, it's safe to change the state - * to IDLE without checking @rq->aborted_gstate because we should still be - * holding the RCU read lock and thus protected against timeout. - */ static void __blk_mq_requeue_request(struct request *rq) { struct request_queue *q = rq->q; @@ -843,33 +754,24 @@ struct request *blk_mq_tag_to_rq(struct blk_mq_tags *tags, unsigned int tag) } EXPORT_SYMBOL(blk_mq_tag_to_rq); -struct blk_mq_timeout_data { - unsigned long next; - unsigned int next_set; - unsigned int nr_expired; -}; - static void blk_mq_rq_timed_out(struct request *req, bool reserved) { const struct blk_mq_ops *ops = req->q->mq_ops; enum blk_eh_timer_return ret = BLK_EH_RESET_TIMER; - req->rq_flags |= RQF_MQ_TIMEOUT_EXPIRED; - if (ops->timeout) ret = ops->timeout(req, reserved); switch (ret) { case BLK_EH_HANDLED: - __blk_mq_complete_request(req); - break; - case BLK_EH_RESET_TIMER: /* - * As nothing prevents from completion happening while - * ->aborted_gstate is set, this may lead to ignored - * completions and further spurious timeouts. + * If the request is still in flight, the driver is requesting + * blk-mq complete it. */ - blk_mq_rq_update_aborted_gstate(req, 0); + if (blk_mq_rq_state(req) == MQ_RQ_IN_FLIGHT) + __blk_mq_complete_request(req); + break; + case BLK_EH_RESET_TIMER: blk_add_timer(req); break; case BLK_EH_NOT_HANDLED: @@ -880,64 +782,64 @@ static void blk_mq_rq_timed_out(struct request *req, bool reserved) } } -static void blk_mq_check_expired(struct blk_mq_hw_ctx *hctx, - struct request *rq, void *priv, bool reserved) +static bool blk_mq_req_expired(struct request *rq, unsigned long *next) { - struct blk_mq_timeout_data *data = priv; - unsigned long gstate, deadline; - int start; + unsigned long deadline; - might_sleep(); - - if (rq->rq_flags & RQF_MQ_TIMEOUT_EXPIRED) - return; + if (blk_mq_rq_state(rq) != MQ_RQ_IN_FLIGHT) + return false; - /* read coherent snapshots of @rq->state_gen and @rq->deadline */ - while (true) { - start = read_seqcount_begin(&rq->gstate_seq); - gstate = READ_ONCE(rq->gstate); - deadline = blk_rq_deadline(rq); - if (!read_seqcount_retry(&rq->gstate_seq, start)) - break; - cond_resched(); - } + deadline = blk_rq_deadline(rq); + if (time_after_eq(jiffies, deadline)) + return true; - /* if in-flight && overdue, mark for abortion */ - if ((gstate & MQ_RQ_STATE_MASK) == MQ_RQ_IN_FLIGHT && - time_after_eq(jiffies, deadline)) { - blk_mq_rq_update_aborted_gstate(rq, gstate); - data->nr_expired++; - hctx->nr_expired++; - } else if (!data->next_set || time_after(data->next, deadline)) { - data->next = deadline; - data->next_set = 1; - } + if (*next == 0) + *next = deadline; + else if (time_after(*next, deadline)) + *next = deadline; + return false; } -static void blk_mq_terminate_expired(struct blk_mq_hw_ctx *hctx, +static void blk_mq_check_expired(struct blk_mq_hw_ctx *hctx, struct request *rq, void *priv, bool reserved) { + unsigned long *next = priv; + /* - * We marked @rq->aborted_gstate and waited for RCU. If there were - * completions that we lost to, they would have finished and - * updated @rq->gstate by now; otherwise, the completion path is - * now guaranteed to see @rq->aborted_gstate and yield. If - * @rq->aborted_gstate still matches @rq->gstate, @rq is ours. + * Just do a quick check if it is expired before locking the request in + * so we're not unnecessarilly synchronizing across CPUs. */ - if (!(rq->rq_flags & RQF_MQ_TIMEOUT_EXPIRED) && - READ_ONCE(rq->gstate) == rq->aborted_gstate) + if (!blk_mq_req_expired(rq, next)) + return; + + /* + * We have reason to believe the request may be expired. Take a + * reference on the request to lock this request lifetime into its + * currently allocated context to prevent it from being reallocated in + * the event the completion by-passes this timeout handler. + * + * If the reference was already released, then the driver beat the + * timeout handler to posting a natural completion. + */ + if (!kref_get_unless_zero(&rq->ref)) + return; + + /* + * The request is now locked and cannot be reallocated underneath the + * timeout handler's processing. Re-verify this exact request is truly + * expired; if it is not expired, then the request was completed and + * reallocated as a new request. + */ + if (blk_mq_req_expired(rq, next)) blk_mq_rq_timed_out(rq, reserved); + blk_mq_put_request(rq); } static void blk_mq_timeout_work(struct work_struct *work) { struct request_queue *q = container_of(work, struct request_queue, timeout_work); - struct blk_mq_timeout_data data = { - .next = 0, - .next_set = 0, - .nr_expired = 0, - }; + unsigned long next = 0; struct blk_mq_hw_ctx *hctx; int i; @@ -957,39 +859,10 @@ static void blk_mq_timeout_work(struct work_struct *work) if (!percpu_ref_tryget(&q->q_usage_counter)) return; - /* scan for the expired ones and set their ->aborted_gstate */ - blk_mq_queue_tag_busy_iter(q, blk_mq_check_expired, &data); + blk_mq_queue_tag_busy_iter(q, blk_mq_check_expired, &next); - if (data.nr_expired) { - bool has_rcu = false; - - /* - * Wait till everyone sees ->aborted_gstate. The - * sequential waits for SRCUs aren't ideal. If this ever - * becomes a problem, we can add per-hw_ctx rcu_head and - * wait in parallel. - */ - queue_for_each_hw_ctx(q, hctx, i) { - if (!hctx->nr_expired) - continue; - - if (!(hctx->flags & BLK_MQ_F_BLOCKING)) - has_rcu = true; - else - synchronize_srcu(hctx->srcu); - - hctx->nr_expired = 0; - } - if (has_rcu) - synchronize_rcu(); - - /* terminate the ones we won */ - blk_mq_queue_tag_busy_iter(q, blk_mq_terminate_expired, NULL); - } - - if (data.next_set) { - data.next = blk_rq_timeout(round_jiffies_up(data.next)); - mod_timer(&q->timeout, data.next); + if (next != 0) { + mod_timer(&q->timeout, next); } else { /* * Request timeouts are handled as a forward rolling timer. If @@ -1334,8 +1207,6 @@ bool blk_mq_dispatch_rq_list(struct request_queue *q, struct list_head *list, static void __blk_mq_run_hw_queue(struct blk_mq_hw_ctx *hctx) { - int srcu_idx; - /* * We should be running this queue from one of the CPUs that * are mapped to it. @@ -1369,9 +1240,7 @@ static void __blk_mq_run_hw_queue(struct blk_mq_hw_ctx *hctx) might_sleep_if(hctx->flags & BLK_MQ_F_BLOCKING); - hctx_lock(hctx, &srcu_idx); blk_mq_sched_dispatch_requests(hctx); - hctx_unlock(hctx, srcu_idx); } static inline int blk_mq_first_mapped_cpu(struct blk_mq_hw_ctx *hctx) @@ -1458,7 +1327,6 @@ EXPORT_SYMBOL(blk_mq_delay_run_hw_queue); bool blk_mq_run_hw_queue(struct blk_mq_hw_ctx *hctx, bool async) { - int srcu_idx; bool need_run; /* @@ -1469,10 +1337,8 @@ bool blk_mq_run_hw_queue(struct blk_mq_hw_ctx *hctx, bool async) * And queue will be rerun in blk_mq_unquiesce_queue() if it is * quiesced. */ - hctx_lock(hctx, &srcu_idx); need_run = !blk_queue_quiesced(hctx->queue) && blk_mq_hctx_has_pending(hctx); - hctx_unlock(hctx, srcu_idx); if (need_run) { __blk_mq_delay_run_hw_queue(hctx, async, 0); @@ -1828,34 +1694,23 @@ static void blk_mq_try_issue_directly(struct blk_mq_hw_ctx *hctx, struct request *rq, blk_qc_t *cookie) { blk_status_t ret; - int srcu_idx; might_sleep_if(hctx->flags & BLK_MQ_F_BLOCKING); - hctx_lock(hctx, &srcu_idx); - ret = __blk_mq_try_issue_directly(hctx, rq, cookie, false); if (ret == BLK_STS_RESOURCE || ret == BLK_STS_DEV_RESOURCE) blk_mq_sched_insert_request(rq, false, true, false); else if (ret != BLK_STS_OK) blk_mq_end_request(rq, ret); - - hctx_unlock(hctx, srcu_idx); } blk_status_t blk_mq_request_issue_directly(struct request *rq) { - blk_status_t ret; - int srcu_idx; blk_qc_t unused_cookie; struct blk_mq_ctx *ctx = rq->mq_ctx; struct blk_mq_hw_ctx *hctx = blk_mq_map_queue(rq->q, ctx->cpu); - hctx_lock(hctx, &srcu_idx); - ret = __blk_mq_try_issue_directly(hctx, rq, &unused_cookie, true); - hctx_unlock(hctx, srcu_idx); - - return ret; + return __blk_mq_try_issue_directly(hctx, rq, &unused_cookie, true); } static blk_qc_t blk_mq_make_request(struct request_queue *q, struct bio *bio) @@ -2065,15 +1920,7 @@ static int blk_mq_init_request(struct blk_mq_tag_set *set, struct request *rq, return ret; } - seqcount_init(&rq->gstate_seq); - u64_stats_init(&rq->aborted_gstate_sync); - /* - * start gstate with gen 1 instead of 0, otherwise it will be equal - * to aborted_gstate, and be identified timed out by - * blk_mq_terminate_expired. - */ - WRITE_ONCE(rq->gstate, MQ_RQ_GEN_INC); - + WRITE_ONCE(rq->state, MQ_RQ_IDLE); return 0; } diff --git a/block/blk-mq.h b/block/blk-mq.h index e1bb420dc5d6..53452df16343 100644 --- a/block/blk-mq.h +++ b/block/blk-mq.h @@ -31,17 +31,12 @@ struct blk_mq_ctx { } ____cacheline_aligned_in_smp; /* - * Bits for request->gstate. The lower two bits carry MQ_RQ_* state value - * and the upper bits the generation number. + * Request state. */ enum mq_rq_state { MQ_RQ_IDLE = 0, MQ_RQ_IN_FLIGHT = 1, MQ_RQ_COMPLETE = 2, - - MQ_RQ_STATE_BITS = 2, - MQ_RQ_STATE_MASK = (1 << MQ_RQ_STATE_BITS) - 1, - MQ_RQ_GEN_INC = 1 << MQ_RQ_STATE_BITS, }; void blk_mq_freeze_queue(struct request_queue *q); @@ -109,7 +104,7 @@ void blk_mq_release(struct request_queue *q); */ static inline int blk_mq_rq_state(struct request *rq) { - return READ_ONCE(rq->gstate) & MQ_RQ_STATE_MASK; + return READ_ONCE(rq->state); } /** @@ -124,16 +119,7 @@ static inline int blk_mq_rq_state(struct request *rq) static inline void blk_mq_rq_update_state(struct request *rq, enum mq_rq_state state) { - u64 old_val = READ_ONCE(rq->gstate); - u64 new_val = (old_val & ~MQ_RQ_STATE_MASK) | state; - - if (state == MQ_RQ_IN_FLIGHT) { - WARN_ON_ONCE((old_val & MQ_RQ_STATE_MASK) != MQ_RQ_IDLE); - new_val += MQ_RQ_GEN_INC; - } - - /* avoid exposing interim values */ - WRITE_ONCE(rq->gstate, new_val); + WRITE_ONCE(rq->state, state); } static inline struct blk_mq_ctx *__blk_mq_get_ctx(struct request_queue *q, diff --git a/block/blk-timeout.c b/block/blk-timeout.c index 652d4d4d3e97..f95d6e6cbc96 100644 --- a/block/blk-timeout.c +++ b/block/blk-timeout.c @@ -214,7 +214,6 @@ void blk_add_timer(struct request *req) req->timeout = q->rq_timeout; blk_rq_set_deadline(req, jiffies + req->timeout); - req->rq_flags &= ~RQF_MQ_TIMEOUT_EXPIRED; /* * Only the non-mq case needs to add the request to a protected list. diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h index 26bf2c1e3502..8812a7e3c0a3 100644 --- a/include/linux/blkdev.h +++ b/include/linux/blkdev.h @@ -125,10 +125,8 @@ typedef __u32 __bitwise req_flags_t; #define RQF_SPECIAL_PAYLOAD ((__force req_flags_t)(1 << 18)) /* The per-zone write lock is held for this request */ #define RQF_ZONE_WRITE_LOCKED ((__force req_flags_t)(1 << 19)) -/* timeout is expired */ -#define RQF_MQ_TIMEOUT_EXPIRED ((__force req_flags_t)(1 << 20)) /* already slept for hybrid poll */ -#define RQF_MQ_POLL_SLEPT ((__force req_flags_t)(1 << 21)) +#define RQF_MQ_POLL_SLEPT ((__force req_flags_t)(1 << 20)) /* flags that prevent us from merging requests: */ #define RQF_NOMERGE_FLAGS \ @@ -236,27 +234,7 @@ struct request { unsigned int extra_len; /* length of alignment and padding */ - /* - * On blk-mq, the lower bits of ->gstate (generation number and - * state) carry the MQ_RQ_* state value and the upper bits the - * generation number which is monotonically incremented and used to - * distinguish the reuse instances. - * - * ->gstate_seq allows updates to ->gstate and other fields - * (currently ->deadline) during request start to be read - * atomically from the timeout path, so that it can operate on a - * coherent set of information. - */ - seqcount_t gstate_seq; - u64 gstate; - - /* - * ->aborted_gstate is used by the timeout to claim a specific - * recycle instance of this request. See blk_mq_timeout_work(). - */ - struct u64_stats_sync aborted_gstate_sync; - u64 aborted_gstate; - + u64 state; struct kref ref; /* access through blk_rq_set_deadline, blk_rq_deadline */ -- 2.14.3 ^ permalink raw reply related [flat|nested] 128+ messages in thread
* [RFC PATCH 3/3] blk-mq: Remove generation seqeunce @ 2018-05-21 23:11 ` Keith Busch 0 siblings, 0 replies; 128+ messages in thread From: Keith Busch @ 2018-05-21 23:11 UTC (permalink / raw) This patch simplifies the timeout handling by relying on the request reference counting to ensure the iterator is operating on an inflight and truly timed out request. Since the reference counting prevents the tag from being reallocated, the block layer no longer needs to prevent drivers from completing their requests while the timeout handler is operating on it: a driver completing a request is allowed to proceed to the next state without additional syncronization with the block layer. This also removes any need for generation sequence numbers since the request lifetime is prevented from being reallocated as a new sequence while timeout handling is operating on it. Signed-off-by: Keith Busch <keith.busch at intel.com> --- block/blk-core.c | 6 -- block/blk-mq-debugfs.c | 1 - block/blk-mq.c | 259 ++++++++++--------------------------------------- block/blk-mq.h | 20 +--- block/blk-timeout.c | 1 - include/linux/blkdev.h | 26 +---- 6 files changed, 58 insertions(+), 255 deletions(-) diff --git a/block/blk-core.c b/block/blk-core.c index 43370faee935..cee03cad99f2 100644 --- a/block/blk-core.c +++ b/block/blk-core.c @@ -198,12 +198,6 @@ void blk_rq_init(struct request_queue *q, struct request *rq) rq->internal_tag = -1; rq->start_time_ns = ktime_get_ns(); rq->part = NULL; - seqcount_init(&rq->gstate_seq); - u64_stats_init(&rq->aborted_gstate_sync); - /* - * See comment of blk_mq_init_request - */ - WRITE_ONCE(rq->gstate, MQ_RQ_GEN_INC); } EXPORT_SYMBOL(blk_rq_init); diff --git a/block/blk-mq-debugfs.c b/block/blk-mq-debugfs.c index 3080e18cb859..ffa622366922 100644 --- a/block/blk-mq-debugfs.c +++ b/block/blk-mq-debugfs.c @@ -344,7 +344,6 @@ static const char *const rqf_name[] = { RQF_NAME(STATS), RQF_NAME(SPECIAL_PAYLOAD), RQF_NAME(ZONE_WRITE_LOCKED), - RQF_NAME(MQ_TIMEOUT_EXPIRED), RQF_NAME(MQ_POLL_SLEPT), }; #undef RQF_NAME diff --git a/block/blk-mq.c b/block/blk-mq.c index 66e5c768803f..4858876fd364 100644 --- a/block/blk-mq.c +++ b/block/blk-mq.c @@ -589,56 +589,6 @@ static void __blk_mq_complete_request(struct request *rq) put_cpu(); } -static void hctx_unlock(struct blk_mq_hw_ctx *hctx, int srcu_idx) - __releases(hctx->srcu) -{ - if (!(hctx->flags & BLK_MQ_F_BLOCKING)) - rcu_read_unlock(); - else - srcu_read_unlock(hctx->srcu, srcu_idx); -} - -static void hctx_lock(struct blk_mq_hw_ctx *hctx, int *srcu_idx) - __acquires(hctx->srcu) -{ - if (!(hctx->flags & BLK_MQ_F_BLOCKING)) { - /* shut up gcc false positive */ - *srcu_idx = 0; - rcu_read_lock(); - } else - *srcu_idx = srcu_read_lock(hctx->srcu); -} - -static void blk_mq_rq_update_aborted_gstate(struct request *rq, u64 gstate) -{ - unsigned long flags; - - /* - * blk_mq_rq_aborted_gstate() is used from the completion path and - * can thus be called from irq context. u64_stats_fetch in the - * middle of update on the same CPU leads to lockup. Disable irq - * while updating. - */ - local_irq_save(flags); - u64_stats_update_begin(&rq->aborted_gstate_sync); - rq->aborted_gstate = gstate; - u64_stats_update_end(&rq->aborted_gstate_sync); - local_irq_restore(flags); -} - -static u64 blk_mq_rq_aborted_gstate(struct request *rq) -{ - unsigned int start; - u64 aborted_gstate; - - do { - start = u64_stats_fetch_begin(&rq->aborted_gstate_sync); - aborted_gstate = rq->aborted_gstate; - } while (u64_stats_fetch_retry(&rq->aborted_gstate_sync, start)); - - return aborted_gstate; -} - /** * blk_mq_complete_request - end I/O on a request * @rq: the request being processed @@ -650,27 +600,10 @@ static u64 blk_mq_rq_aborted_gstate(struct request *rq) void blk_mq_complete_request(struct request *rq) { struct request_queue *q = rq->q; - struct blk_mq_hw_ctx *hctx = blk_mq_map_queue(q, rq->mq_ctx->cpu); - int srcu_idx; if (unlikely(blk_should_fake_timeout(q))) return; - - /* - * If @rq->aborted_gstate equals the current instance, timeout is - * claiming @rq and we lost. This is synchronized through - * hctx_lock(). See blk_mq_timeout_work() for details. - * - * Completion path never blocks and we can directly use RCU here - * instead of hctx_lock() which can be either RCU or SRCU. - * However, that would complicate paths which want to synchronize - * against us. Let stay in sync with the issue path so that - * hctx_lock() covers both issue and completion paths. - */ - hctx_lock(hctx, &srcu_idx); - if (blk_mq_rq_aborted_gstate(rq) != rq->gstate) - __blk_mq_complete_request(rq); - hctx_unlock(hctx, srcu_idx); + __blk_mq_complete_request(rq); } EXPORT_SYMBOL(blk_mq_complete_request); @@ -699,26 +632,9 @@ void blk_mq_start_request(struct request *rq) WARN_ON_ONCE(blk_mq_rq_state(rq) != MQ_RQ_IDLE); - /* - * Mark @rq in-flight which also advances the generation number, - * and register for timeout. Protect with a seqcount to allow the - * timeout path to read both @rq->gstate and @rq->deadline - * coherently. - * - * This is the only place where a request is marked in-flight. If - * the timeout path reads an in-flight @rq->gstate, the - * @rq->deadline it reads together under @rq->gstate_seq is - * guaranteed to be the matching one. - */ - preempt_disable(); - write_seqcount_begin(&rq->gstate_seq); - blk_add_timer(rq); blk_mq_rq_update_state(rq, MQ_RQ_IN_FLIGHT); - write_seqcount_end(&rq->gstate_seq); - preempt_enable(); - if (q->dma_drain_size && blk_rq_bytes(rq)) { /* * Make sure space for the drain appears. We know we can do @@ -730,11 +646,6 @@ void blk_mq_start_request(struct request *rq) } EXPORT_SYMBOL(blk_mq_start_request); -/* - * When we reach here because queue is busy, it's safe to change the state - * to IDLE without checking @rq->aborted_gstate because we should still be - * holding the RCU read lock and thus protected against timeout. - */ static void __blk_mq_requeue_request(struct request *rq) { struct request_queue *q = rq->q; @@ -843,33 +754,24 @@ struct request *blk_mq_tag_to_rq(struct blk_mq_tags *tags, unsigned int tag) } EXPORT_SYMBOL(blk_mq_tag_to_rq); -struct blk_mq_timeout_data { - unsigned long next; - unsigned int next_set; - unsigned int nr_expired; -}; - static void blk_mq_rq_timed_out(struct request *req, bool reserved) { const struct blk_mq_ops *ops = req->q->mq_ops; enum blk_eh_timer_return ret = BLK_EH_RESET_TIMER; - req->rq_flags |= RQF_MQ_TIMEOUT_EXPIRED; - if (ops->timeout) ret = ops->timeout(req, reserved); switch (ret) { case BLK_EH_HANDLED: - __blk_mq_complete_request(req); - break; - case BLK_EH_RESET_TIMER: /* - * As nothing prevents from completion happening while - * ->aborted_gstate is set, this may lead to ignored - * completions and further spurious timeouts. + * If the request is still in flight, the driver is requesting + * blk-mq complete it. */ - blk_mq_rq_update_aborted_gstate(req, 0); + if (blk_mq_rq_state(req) == MQ_RQ_IN_FLIGHT) + __blk_mq_complete_request(req); + break; + case BLK_EH_RESET_TIMER: blk_add_timer(req); break; case BLK_EH_NOT_HANDLED: @@ -880,64 +782,64 @@ static void blk_mq_rq_timed_out(struct request *req, bool reserved) } } -static void blk_mq_check_expired(struct blk_mq_hw_ctx *hctx, - struct request *rq, void *priv, bool reserved) +static bool blk_mq_req_expired(struct request *rq, unsigned long *next) { - struct blk_mq_timeout_data *data = priv; - unsigned long gstate, deadline; - int start; + unsigned long deadline; - might_sleep(); - - if (rq->rq_flags & RQF_MQ_TIMEOUT_EXPIRED) - return; + if (blk_mq_rq_state(rq) != MQ_RQ_IN_FLIGHT) + return false; - /* read coherent snapshots of @rq->state_gen and @rq->deadline */ - while (true) { - start = read_seqcount_begin(&rq->gstate_seq); - gstate = READ_ONCE(rq->gstate); - deadline = blk_rq_deadline(rq); - if (!read_seqcount_retry(&rq->gstate_seq, start)) - break; - cond_resched(); - } + deadline = blk_rq_deadline(rq); + if (time_after_eq(jiffies, deadline)) + return true; - /* if in-flight && overdue, mark for abortion */ - if ((gstate & MQ_RQ_STATE_MASK) == MQ_RQ_IN_FLIGHT && - time_after_eq(jiffies, deadline)) { - blk_mq_rq_update_aborted_gstate(rq, gstate); - data->nr_expired++; - hctx->nr_expired++; - } else if (!data->next_set || time_after(data->next, deadline)) { - data->next = deadline; - data->next_set = 1; - } + if (*next == 0) + *next = deadline; + else if (time_after(*next, deadline)) + *next = deadline; + return false; } -static void blk_mq_terminate_expired(struct blk_mq_hw_ctx *hctx, +static void blk_mq_check_expired(struct blk_mq_hw_ctx *hctx, struct request *rq, void *priv, bool reserved) { + unsigned long *next = priv; + /* - * We marked @rq->aborted_gstate and waited for RCU. If there were - * completions that we lost to, they would have finished and - * updated @rq->gstate by now; otherwise, the completion path is - * now guaranteed to see @rq->aborted_gstate and yield. If - * @rq->aborted_gstate still matches @rq->gstate, @rq is ours. + * Just do a quick check if it is expired before locking the request in + * so we're not unnecessarilly synchronizing across CPUs. */ - if (!(rq->rq_flags & RQF_MQ_TIMEOUT_EXPIRED) && - READ_ONCE(rq->gstate) == rq->aborted_gstate) + if (!blk_mq_req_expired(rq, next)) + return; + + /* + * We have reason to believe the request may be expired. Take a + * reference on the request to lock this request lifetime into its + * currently allocated context to prevent it from being reallocated in + * the event the completion by-passes this timeout handler. + * + * If the reference was already released, then the driver beat the + * timeout handler to posting a natural completion. + */ + if (!kref_get_unless_zero(&rq->ref)) + return; + + /* + * The request is now locked and cannot be reallocated underneath the + * timeout handler's processing. Re-verify this exact request is truly + * expired; if it is not expired, then the request was completed and + * reallocated as a new request. + */ + if (blk_mq_req_expired(rq, next)) blk_mq_rq_timed_out(rq, reserved); + blk_mq_put_request(rq); } static void blk_mq_timeout_work(struct work_struct *work) { struct request_queue *q = container_of(work, struct request_queue, timeout_work); - struct blk_mq_timeout_data data = { - .next = 0, - .next_set = 0, - .nr_expired = 0, - }; + unsigned long next = 0; struct blk_mq_hw_ctx *hctx; int i; @@ -957,39 +859,10 @@ static void blk_mq_timeout_work(struct work_struct *work) if (!percpu_ref_tryget(&q->q_usage_counter)) return; - /* scan for the expired ones and set their ->aborted_gstate */ - blk_mq_queue_tag_busy_iter(q, blk_mq_check_expired, &data); + blk_mq_queue_tag_busy_iter(q, blk_mq_check_expired, &next); - if (data.nr_expired) { - bool has_rcu = false; - - /* - * Wait till everyone sees ->aborted_gstate. The - * sequential waits for SRCUs aren't ideal. If this ever - * becomes a problem, we can add per-hw_ctx rcu_head and - * wait in parallel. - */ - queue_for_each_hw_ctx(q, hctx, i) { - if (!hctx->nr_expired) - continue; - - if (!(hctx->flags & BLK_MQ_F_BLOCKING)) - has_rcu = true; - else - synchronize_srcu(hctx->srcu); - - hctx->nr_expired = 0; - } - if (has_rcu) - synchronize_rcu(); - - /* terminate the ones we won */ - blk_mq_queue_tag_busy_iter(q, blk_mq_terminate_expired, NULL); - } - - if (data.next_set) { - data.next = blk_rq_timeout(round_jiffies_up(data.next)); - mod_timer(&q->timeout, data.next); + if (next != 0) { + mod_timer(&q->timeout, next); } else { /* * Request timeouts are handled as a forward rolling timer. If @@ -1334,8 +1207,6 @@ bool blk_mq_dispatch_rq_list(struct request_queue *q, struct list_head *list, static void __blk_mq_run_hw_queue(struct blk_mq_hw_ctx *hctx) { - int srcu_idx; - /* * We should be running this queue from one of the CPUs that * are mapped to it. @@ -1369,9 +1240,7 @@ static void __blk_mq_run_hw_queue(struct blk_mq_hw_ctx *hctx) might_sleep_if(hctx->flags & BLK_MQ_F_BLOCKING); - hctx_lock(hctx, &srcu_idx); blk_mq_sched_dispatch_requests(hctx); - hctx_unlock(hctx, srcu_idx); } static inline int blk_mq_first_mapped_cpu(struct blk_mq_hw_ctx *hctx) @@ -1458,7 +1327,6 @@ EXPORT_SYMBOL(blk_mq_delay_run_hw_queue); bool blk_mq_run_hw_queue(struct blk_mq_hw_ctx *hctx, bool async) { - int srcu_idx; bool need_run; /* @@ -1469,10 +1337,8 @@ bool blk_mq_run_hw_queue(struct blk_mq_hw_ctx *hctx, bool async) * And queue will be rerun in blk_mq_unquiesce_queue() if it is * quiesced. */ - hctx_lock(hctx, &srcu_idx); need_run = !blk_queue_quiesced(hctx->queue) && blk_mq_hctx_has_pending(hctx); - hctx_unlock(hctx, srcu_idx); if (need_run) { __blk_mq_delay_run_hw_queue(hctx, async, 0); @@ -1828,34 +1694,23 @@ static void blk_mq_try_issue_directly(struct blk_mq_hw_ctx *hctx, struct request *rq, blk_qc_t *cookie) { blk_status_t ret; - int srcu_idx; might_sleep_if(hctx->flags & BLK_MQ_F_BLOCKING); - hctx_lock(hctx, &srcu_idx); - ret = __blk_mq_try_issue_directly(hctx, rq, cookie, false); if (ret == BLK_STS_RESOURCE || ret == BLK_STS_DEV_RESOURCE) blk_mq_sched_insert_request(rq, false, true, false); else if (ret != BLK_STS_OK) blk_mq_end_request(rq, ret); - - hctx_unlock(hctx, srcu_idx); } blk_status_t blk_mq_request_issue_directly(struct request *rq) { - blk_status_t ret; - int srcu_idx; blk_qc_t unused_cookie; struct blk_mq_ctx *ctx = rq->mq_ctx; struct blk_mq_hw_ctx *hctx = blk_mq_map_queue(rq->q, ctx->cpu); - hctx_lock(hctx, &srcu_idx); - ret = __blk_mq_try_issue_directly(hctx, rq, &unused_cookie, true); - hctx_unlock(hctx, srcu_idx); - - return ret; + return __blk_mq_try_issue_directly(hctx, rq, &unused_cookie, true); } static blk_qc_t blk_mq_make_request(struct request_queue *q, struct bio *bio) @@ -2065,15 +1920,7 @@ static int blk_mq_init_request(struct blk_mq_tag_set *set, struct request *rq, return ret; } - seqcount_init(&rq->gstate_seq); - u64_stats_init(&rq->aborted_gstate_sync); - /* - * start gstate with gen 1 instead of 0, otherwise it will be equal - * to aborted_gstate, and be identified timed out by - * blk_mq_terminate_expired. - */ - WRITE_ONCE(rq->gstate, MQ_RQ_GEN_INC); - + WRITE_ONCE(rq->state, MQ_RQ_IDLE); return 0; } diff --git a/block/blk-mq.h b/block/blk-mq.h index e1bb420dc5d6..53452df16343 100644 --- a/block/blk-mq.h +++ b/block/blk-mq.h @@ -31,17 +31,12 @@ struct blk_mq_ctx { } ____cacheline_aligned_in_smp; /* - * Bits for request->gstate. The lower two bits carry MQ_RQ_* state value - * and the upper bits the generation number. + * Request state. */ enum mq_rq_state { MQ_RQ_IDLE = 0, MQ_RQ_IN_FLIGHT = 1, MQ_RQ_COMPLETE = 2, - - MQ_RQ_STATE_BITS = 2, - MQ_RQ_STATE_MASK = (1 << MQ_RQ_STATE_BITS) - 1, - MQ_RQ_GEN_INC = 1 << MQ_RQ_STATE_BITS, }; void blk_mq_freeze_queue(struct request_queue *q); @@ -109,7 +104,7 @@ void blk_mq_release(struct request_queue *q); */ static inline int blk_mq_rq_state(struct request *rq) { - return READ_ONCE(rq->gstate) & MQ_RQ_STATE_MASK; + return READ_ONCE(rq->state); } /** @@ -124,16 +119,7 @@ static inline int blk_mq_rq_state(struct request *rq) static inline void blk_mq_rq_update_state(struct request *rq, enum mq_rq_state state) { - u64 old_val = READ_ONCE(rq->gstate); - u64 new_val = (old_val & ~MQ_RQ_STATE_MASK) | state; - - if (state == MQ_RQ_IN_FLIGHT) { - WARN_ON_ONCE((old_val & MQ_RQ_STATE_MASK) != MQ_RQ_IDLE); - new_val += MQ_RQ_GEN_INC; - } - - /* avoid exposing interim values */ - WRITE_ONCE(rq->gstate, new_val); + WRITE_ONCE(rq->state, state); } static inline struct blk_mq_ctx *__blk_mq_get_ctx(struct request_queue *q, diff --git a/block/blk-timeout.c b/block/blk-timeout.c index 652d4d4d3e97..f95d6e6cbc96 100644 --- a/block/blk-timeout.c +++ b/block/blk-timeout.c @@ -214,7 +214,6 @@ void blk_add_timer(struct request *req) req->timeout = q->rq_timeout; blk_rq_set_deadline(req, jiffies + req->timeout); - req->rq_flags &= ~RQF_MQ_TIMEOUT_EXPIRED; /* * Only the non-mq case needs to add the request to a protected list. diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h index 26bf2c1e3502..8812a7e3c0a3 100644 --- a/include/linux/blkdev.h +++ b/include/linux/blkdev.h @@ -125,10 +125,8 @@ typedef __u32 __bitwise req_flags_t; #define RQF_SPECIAL_PAYLOAD ((__force req_flags_t)(1 << 18)) /* The per-zone write lock is held for this request */ #define RQF_ZONE_WRITE_LOCKED ((__force req_flags_t)(1 << 19)) -/* timeout is expired */ -#define RQF_MQ_TIMEOUT_EXPIRED ((__force req_flags_t)(1 << 20)) /* already slept for hybrid poll */ -#define RQF_MQ_POLL_SLEPT ((__force req_flags_t)(1 << 21)) +#define RQF_MQ_POLL_SLEPT ((__force req_flags_t)(1 << 20)) /* flags that prevent us from merging requests: */ #define RQF_NOMERGE_FLAGS \ @@ -236,27 +234,7 @@ struct request { unsigned int extra_len; /* length of alignment and padding */ - /* - * On blk-mq, the lower bits of ->gstate (generation number and - * state) carry the MQ_RQ_* state value and the upper bits the - * generation number which is monotonically incremented and used to - * distinguish the reuse instances. - * - * ->gstate_seq allows updates to ->gstate and other fields - * (currently ->deadline) during request start to be read - * atomically from the timeout path, so that it can operate on a - * coherent set of information. - */ - seqcount_t gstate_seq; - u64 gstate; - - /* - * ->aborted_gstate is used by the timeout to claim a specific - * recycle instance of this request. See blk_mq_timeout_work(). - */ - struct u64_stats_sync aborted_gstate_sync; - u64 aborted_gstate; - + u64 state; struct kref ref; /* access through blk_rq_set_deadline, blk_rq_deadline */ -- 2.14.3 ^ permalink raw reply related [flat|nested] 128+ messages in thread
* Re: [RFC PATCH 3/3] blk-mq: Remove generation seqeunce 2018-05-21 23:11 ` Keith Busch @ 2018-05-21 23:29 ` Bart Van Assche -1 siblings, 0 replies; 128+ messages in thread From: Bart Van Assche @ 2018-05-21 23:29 UTC (permalink / raw) To: hch, keith.busch, linux-nvme, linux-block, axboe, ming.lei T24gTW9uLCAyMDE4LTA1LTIxIGF0IDE3OjExIC0wNjAwLCBLZWl0aCBCdXNjaCB3cm90ZToNCj4g QEAgLTY1MCwyNyArNjAwLDEwIEBAIHN0YXRpYyB1NjQgYmxrX21xX3JxX2Fib3J0ZWRfZ3N0YXRl KHN0cnVjdCByZXF1ZXN0ICpycSkNCj4gIHZvaWQgYmxrX21xX2NvbXBsZXRlX3JlcXVlc3Qoc3Ry dWN0IHJlcXVlc3QgKnJxKQ0KPiAgew0KPiAgCXN0cnVjdCByZXF1ZXN0X3F1ZXVlICpxID0gcnEt PnE7DQo+IC0Jc3RydWN0IGJsa19tcV9od19jdHggKmhjdHggPSBibGtfbXFfbWFwX3F1ZXVlKHEs IHJxLT5tcV9jdHgtPmNwdSk7DQo+IC0JaW50IHNyY3VfaWR4Ow0KPiAgDQo+ICAJaWYgKHVubGlr ZWx5KGJsa19zaG91bGRfZmFrZV90aW1lb3V0KHEpKSkNCj4gIAkJcmV0dXJuOw0KPiAtCVsgLi4u IF0NCj4gKwlfX2Jsa19tcV9jb21wbGV0ZV9yZXF1ZXN0KHJxKTsNCj4gIH0NCj4gIEVYUE9SVF9T WU1CT0woYmxrX21xX2NvbXBsZXRlX3JlcXVlc3QpOw0KPiBbIC4uLiBdDQo+ICBzdGF0aWMgdm9p ZCBibGtfbXFfcnFfdGltZWRfb3V0KHN0cnVjdCByZXF1ZXN0ICpyZXEsIGJvb2wgcmVzZXJ2ZWQp DQo+ICB7DQo+ICAJY29uc3Qgc3RydWN0IGJsa19tcV9vcHMgKm9wcyA9IHJlcS0+cS0+bXFfb3Bz Ow0KPiAgCWVudW0gYmxrX2VoX3RpbWVyX3JldHVybiByZXQgPSBCTEtfRUhfUkVTRVRfVElNRVI7 DQo+ICANCj4gLQlyZXEtPnJxX2ZsYWdzIHw9IFJRRl9NUV9USU1FT1VUX0VYUElSRUQ7DQo+IC0N Cj4gIAlpZiAob3BzLT50aW1lb3V0KQ0KPiAgCQlyZXQgPSBvcHMtPnRpbWVvdXQocmVxLCByZXNl cnZlZCk7DQo+ICANCj4gIAlzd2l0Y2ggKHJldCkgew0KPiAgCWNhc2UgQkxLX0VIX0hBTkRMRUQ6 DQo+IC0JCV9fYmxrX21xX2NvbXBsZXRlX3JlcXVlc3QocmVxKTsNCj4gLQkJYnJlYWs7DQo+IC0J Y2FzZSBCTEtfRUhfUkVTRVRfVElNRVI6DQo+ICAJCS8qDQo+IC0JCSAqIEFzIG5vdGhpbmcgcHJl dmVudHMgZnJvbSBjb21wbGV0aW9uIGhhcHBlbmluZyB3aGlsZQ0KPiAtCQkgKiAtPmFib3J0ZWRf Z3N0YXRlIGlzIHNldCwgdGhpcyBtYXkgbGVhZCB0byBpZ25vcmVkDQo+IC0JCSAqIGNvbXBsZXRp b25zIGFuZCBmdXJ0aGVyIHNwdXJpb3VzIHRpbWVvdXRzLg0KPiArCQkgKiBJZiB0aGUgcmVxdWVz dCBpcyBzdGlsbCBpbiBmbGlnaHQsIHRoZSBkcml2ZXIgaXMgcmVxdWVzdGluZw0KPiArCQkgKiBi bGstbXEgY29tcGxldGUgaXQuDQo+ICAJCSAqLw0KPiAtCQlibGtfbXFfcnFfdXBkYXRlX2Fib3J0 ZWRfZ3N0YXRlKHJlcSwgMCk7DQo+ICsJCWlmIChibGtfbXFfcnFfc3RhdGUocmVxKSA9PSBNUV9S UV9JTl9GTElHSFQpDQo+ICsJCQlfX2Jsa19tcV9jb21wbGV0ZV9yZXF1ZXN0KHJlcSk7DQo+ICsJ CWJyZWFrOw0KPiArCWNhc2UgQkxLX0VIX1JFU0VUX1RJTUVSOg0KPiAgCQlibGtfYWRkX3RpbWVy KHJlcSk7DQo+ICAJCWJyZWFrOw0KPiAgCWNhc2UgQkxLX0VIX05PVF9IQU5ETEVEOg0KPiBAQCAt ODgwLDY0ICs3ODIsNjQgQEAgc3RhdGljIHZvaWQgYmxrX21xX3JxX3RpbWVkX291dChzdHJ1Y3Qg cmVxdWVzdCAqcmVxLCBib29sIHJlc2VydmVkKQ0KPiAgCX0NCj4gIH0NCg0KSSB0aGluayB0aGUg YWJvdmUgY2hhbmdlcyBjYW4gbGVhZCB0byBjb25jdXJyZW50IGNhbGxzIG9mDQpfX2Jsa19tcV9j b21wbGV0ZV9yZXF1ZXN0KCkgZnJvbSB0aGUgcmVndWxhciBjb21wbGV0aW9uIHBhdGggYW5kIHRo ZSB0aW1lb3V0DQpwYXRoLiBUaGF0J3Mgd3Jvbmc6IHRoZSBfX2Jsa19tcV9jb21wbGV0ZV9yZXF1 ZXN0KCkgY2FsbGVyIHNob3VsZCBndWFyYW50ZWUNCnRoYXQgbm8gY29uY3VycmVudCBjYWxscyBm cm9tIGFub3RoZXIgY29udGV4dCB0byB0aGF0IGZ1bmN0aW9uIGNhbiBvY2N1ci4NCg0KQmFydC4= ^ permalink raw reply [flat|nested] 128+ messages in thread
* [RFC PATCH 3/3] blk-mq: Remove generation seqeunce @ 2018-05-21 23:29 ` Bart Van Assche 0 siblings, 0 replies; 128+ messages in thread From: Bart Van Assche @ 2018-05-21 23:29 UTC (permalink / raw) On Mon, 2018-05-21@17:11 -0600, Keith Busch wrote: > @@ -650,27 +600,10 @@ static u64 blk_mq_rq_aborted_gstate(struct request *rq) > void blk_mq_complete_request(struct request *rq) > { > struct request_queue *q = rq->q; > - struct blk_mq_hw_ctx *hctx = blk_mq_map_queue(q, rq->mq_ctx->cpu); > - int srcu_idx; > > if (unlikely(blk_should_fake_timeout(q))) > return; > - [ ... ] > + __blk_mq_complete_request(rq); > } > EXPORT_SYMBOL(blk_mq_complete_request); > [ ... ] > static void blk_mq_rq_timed_out(struct request *req, bool reserved) > { > const struct blk_mq_ops *ops = req->q->mq_ops; > enum blk_eh_timer_return ret = BLK_EH_RESET_TIMER; > > - req->rq_flags |= RQF_MQ_TIMEOUT_EXPIRED; > - > if (ops->timeout) > ret = ops->timeout(req, reserved); > > switch (ret) { > case BLK_EH_HANDLED: > - __blk_mq_complete_request(req); > - break; > - case BLK_EH_RESET_TIMER: > /* > - * As nothing prevents from completion happening while > - * ->aborted_gstate is set, this may lead to ignored > - * completions and further spurious timeouts. > + * If the request is still in flight, the driver is requesting > + * blk-mq complete it. > */ > - blk_mq_rq_update_aborted_gstate(req, 0); > + if (blk_mq_rq_state(req) == MQ_RQ_IN_FLIGHT) > + __blk_mq_complete_request(req); > + break; > + case BLK_EH_RESET_TIMER: > blk_add_timer(req); > break; > case BLK_EH_NOT_HANDLED: > @@ -880,64 +782,64 @@ static void blk_mq_rq_timed_out(struct request *req, bool reserved) > } > } I think the above changes can lead to concurrent calls of __blk_mq_complete_request() from the regular completion path and the timeout path. That's wrong: the __blk_mq_complete_request() caller should guarantee that no concurrent calls from another context to that function can occur. Bart. ^ permalink raw reply [flat|nested] 128+ messages in thread
* Re: [RFC PATCH 3/3] blk-mq: Remove generation seqeunce 2018-05-21 23:29 ` Bart Van Assche @ 2018-05-22 14:15 ` Keith Busch -1 siblings, 0 replies; 128+ messages in thread From: Keith Busch @ 2018-05-22 14:15 UTC (permalink / raw) To: Bart Van Assche Cc: hch, keith.busch, linux-nvme, linux-block, axboe, ming.lei On Mon, May 21, 2018 at 11:29:06PM +0000, Bart Van Assche wrote: > On Mon, 2018-05-21 at 17:11 -0600, Keith Busch wrote: > > switch (ret) { > > case BLK_EH_HANDLED: > > - __blk_mq_complete_request(req); > > - break; > > - case BLK_EH_RESET_TIMER: > > /* > > - * As nothing prevents from completion happening while > > - * ->aborted_gstate is set, this may lead to ignored > > - * completions and further spurious timeouts. > > + * If the request is still in flight, the driver is requesting > > + * blk-mq complete it. > > */ > > - blk_mq_rq_update_aborted_gstate(req, 0); > > + if (blk_mq_rq_state(req) == MQ_RQ_IN_FLIGHT) > > + __blk_mq_complete_request(req); > > + break; > > + case BLK_EH_RESET_TIMER: > > blk_add_timer(req); > > break; > > case BLK_EH_NOT_HANDLED: > > @@ -880,64 +782,64 @@ static void blk_mq_rq_timed_out(struct request *req, bool reserved) > > } > > } > > I think the above changes can lead to concurrent calls of > __blk_mq_complete_request() from the regular completion path and the timeout > path. That's wrong: the __blk_mq_complete_request() caller should guarantee > that no concurrent calls from another context to that function can occur. Hi Bart, This shouldn't be introducing any new concorrent calls to __blk_mq_complete_request if they don't already exist. The timeout work calls it only if the driver's timeout returns BLK_EH_HANDLED, which means the driver is claiming the command is now done, but the driver didn't call blk_mq_complete_request as indicated by the request's IN_FLIGHT state. In order to get a second call to __blk_mq_complete_request(), then, the driver would have to end up calling blk_mq_complete_request() in another context. But there's nothing stopping that from happening today, and would be bad if any driver actually did that: the request may have been re-allocated and issued as a new command, and calling blk_mq_complete_request() the second time introduces data corruption. Thanks, Keith ^ permalink raw reply [flat|nested] 128+ messages in thread
* [RFC PATCH 3/3] blk-mq: Remove generation seqeunce @ 2018-05-22 14:15 ` Keith Busch 0 siblings, 0 replies; 128+ messages in thread From: Keith Busch @ 2018-05-22 14:15 UTC (permalink / raw) On Mon, May 21, 2018@11:29:06PM +0000, Bart Van Assche wrote: > On Mon, 2018-05-21@17:11 -0600, Keith Busch wrote: > > switch (ret) { > > case BLK_EH_HANDLED: > > - __blk_mq_complete_request(req); > > - break; > > - case BLK_EH_RESET_TIMER: > > /* > > - * As nothing prevents from completion happening while > > - * ->aborted_gstate is set, this may lead to ignored > > - * completions and further spurious timeouts. > > + * If the request is still in flight, the driver is requesting > > + * blk-mq complete it. > > */ > > - blk_mq_rq_update_aborted_gstate(req, 0); > > + if (blk_mq_rq_state(req) == MQ_RQ_IN_FLIGHT) > > + __blk_mq_complete_request(req); > > + break; > > + case BLK_EH_RESET_TIMER: > > blk_add_timer(req); > > break; > > case BLK_EH_NOT_HANDLED: > > @@ -880,64 +782,64 @@ static void blk_mq_rq_timed_out(struct request *req, bool reserved) > > } > > } > > I think the above changes can lead to concurrent calls of > __blk_mq_complete_request() from the regular completion path and the timeout > path. That's wrong: the __blk_mq_complete_request() caller should guarantee > that no concurrent calls from another context to that function can occur. Hi Bart, This shouldn't be introducing any new concorrent calls to __blk_mq_complete_request if they don't already exist. The timeout work calls it only if the driver's timeout returns BLK_EH_HANDLED, which means the driver is claiming the command is now done, but the driver didn't call blk_mq_complete_request as indicated by the request's IN_FLIGHT state. In order to get a second call to __blk_mq_complete_request(), then, the driver would have to end up calling blk_mq_complete_request() in another context. But there's nothing stopping that from happening today, and would be bad if any driver actually did that: the request may have been re-allocated and issued as a new command, and calling blk_mq_complete_request() the second time introduces data corruption. Thanks, Keith ^ permalink raw reply [flat|nested] 128+ messages in thread
* Re: [RFC PATCH 3/3] blk-mq: Remove generation seqeunce 2018-05-22 14:15 ` Keith Busch @ 2018-05-22 16:29 ` Bart Van Assche -1 siblings, 0 replies; 128+ messages in thread From: Bart Van Assche @ 2018-05-22 16:29 UTC (permalink / raw) To: keith.busch; +Cc: hch, keith.busch, linux-nvme, linux-block, axboe, ming.lei T24gVHVlLCAyMDE4LTA1LTIyIGF0IDA4OjE1IC0wNjAwLCBLZWl0aCBCdXNjaCB3cm90ZToNCj4g VGhpcyBzaG91bGRuJ3QgYmUgaW50cm9kdWNpbmcgYW55IG5ldyBjb25jb3JyZW50IGNhbGxzIHRv DQo+IF9fYmxrX21xX2NvbXBsZXRlX3JlcXVlc3QgaWYgdGhleSBkb24ndCBhbHJlYWR5IGV4aXN0 LiBUaGUgdGltZW91dCB3b3JrDQo+IGNhbGxzIGl0IG9ubHkgaWYgdGhlIGRyaXZlcidzIHRpbWVv dXQgcmV0dXJucyBCTEtfRUhfSEFORExFRCwgd2hpY2ggbWVhbnMNCj4gdGhlIGRyaXZlciBpcyBj bGFpbWluZyB0aGUgY29tbWFuZCBpcyBub3cgZG9uZSwgYnV0IHRoZSBkcml2ZXIgZGlkbid0IGNh bGwNCj4gYmxrX21xX2NvbXBsZXRlX3JlcXVlc3QgYXMgaW5kaWNhdGVkIGJ5IHRoZSByZXF1ZXN0 J3MgSU5fRkxJR0hUIHN0YXRlLg0KPiANCj4gSW4gb3JkZXIgdG8gZ2V0IGEgc2Vjb25kIGNhbGwg dG8gX19ibGtfbXFfY29tcGxldGVfcmVxdWVzdCgpLCB0aGVuLA0KPiB0aGUgZHJpdmVyIHdvdWxk IGhhdmUgdG8gZW5kIHVwIGNhbGxpbmcgYmxrX21xX2NvbXBsZXRlX3JlcXVlc3QoKQ0KPiBpbiBh bm90aGVyIGNvbnRleHQuIEJ1dCB0aGVyZSdzIG5vdGhpbmcgc3RvcHBpbmcgdGhhdCBmcm9tIGhh cHBlbmluZw0KPiB0b2RheSwgYW5kIHdvdWxkIGJlIGJhZCBpZiBhbnkgZHJpdmVyIGFjdHVhbGx5 IGRpZCB0aGF0OiB0aGUgcmVxdWVzdA0KPiBtYXkgaGF2ZSBiZWVuIHJlLWFsbG9jYXRlZCBhbmQg aXNzdWVkIGFzIGEgbmV3IGNvbW1hbmQsIGFuZCBjYWxsaW5nDQo+IGJsa19tcV9jb21wbGV0ZV9y ZXF1ZXN0KCkgdGhlIHNlY29uZCB0aW1lIGludHJvZHVjZXMgZGF0YSBjb3JydXB0aW9uLg0KDQpI ZWxsbyBLZWl0aCwNCg0KUGxlYXNlIGhhdmUgYW5vdGhlciBsb29rIGF0IHRoZSBjdXJyZW50IGNv ZGUgdGhhdCBoYW5kbGVzIHJlcXVlc3QgdGltZW91dHMNCmFuZCBjb21wbGV0aW9ucy4gVGhlIGN1 cnJlbnQgaW1wbGVtZW50YXRpb24gZ3VhcmFudGVlcyB0aGF0IG5vIGRvdWJsZQ0KY29tcGxldGlv bnMgY2FuIG9jY3VyIGJ1dCB5b3VyIHBhdGNoIHJlbW92ZXMgZXNzZW50aWFsIGFzcGVjdHMgb2Yg dGhhdA0KaW1wbGVtZW50YXRpb24uDQoNClRoYW5rcywNCg0KQmFydC4= ^ permalink raw reply [flat|nested] 128+ messages in thread
* [RFC PATCH 3/3] blk-mq: Remove generation seqeunce @ 2018-05-22 16:29 ` Bart Van Assche 0 siblings, 0 replies; 128+ messages in thread From: Bart Van Assche @ 2018-05-22 16:29 UTC (permalink / raw) On Tue, 2018-05-22@08:15 -0600, Keith Busch wrote: > This shouldn't be introducing any new concorrent calls to > __blk_mq_complete_request if they don't already exist. The timeout work > calls it only if the driver's timeout returns BLK_EH_HANDLED, which means > the driver is claiming the command is now done, but the driver didn't call > blk_mq_complete_request as indicated by the request's IN_FLIGHT state. > > In order to get a second call to __blk_mq_complete_request(), then, > the driver would have to end up calling blk_mq_complete_request() > in another context. But there's nothing stopping that from happening > today, and would be bad if any driver actually did that: the request > may have been re-allocated and issued as a new command, and calling > blk_mq_complete_request() the second time introduces data corruption. Hello Keith, Please have another look at the current code that handles request timeouts and completions. The current implementation guarantees that no double completions can occur but your patch removes essential aspects of that implementation. Thanks, Bart. ^ permalink raw reply [flat|nested] 128+ messages in thread
* Re: [RFC PATCH 3/3] blk-mq: Remove generation seqeunce 2018-05-22 16:29 ` Bart Van Assche @ 2018-05-22 16:34 ` Keith Busch -1 siblings, 0 replies; 128+ messages in thread From: Keith Busch @ 2018-05-22 16:34 UTC (permalink / raw) To: Bart Van Assche Cc: hch, keith.busch, linux-nvme, linux-block, axboe, ming.lei On Tue, May 22, 2018 at 04:29:07PM +0000, Bart Van Assche wrote: > Please have another look at the current code that handles request timeouts > and completions. The current implementation guarantees that no double > completions can occur but your patch removes essential aspects of that > implementation. How does the current implementation guarantee a double completion doesn't happen when the request is allocated for a new command? ^ permalink raw reply [flat|nested] 128+ messages in thread
* [RFC PATCH 3/3] blk-mq: Remove generation seqeunce @ 2018-05-22 16:34 ` Keith Busch 0 siblings, 0 replies; 128+ messages in thread From: Keith Busch @ 2018-05-22 16:34 UTC (permalink / raw) On Tue, May 22, 2018@04:29:07PM +0000, Bart Van Assche wrote: > Please have another look at the current code that handles request timeouts > and completions. The current implementation guarantees that no double > completions can occur but your patch removes essential aspects of that > implementation. How does the current implementation guarantee a double completion doesn't happen when the request is allocated for a new command? ^ permalink raw reply [flat|nested] 128+ messages in thread
* Re: [RFC PATCH 3/3] blk-mq: Remove generation seqeunce 2018-05-22 16:34 ` Keith Busch @ 2018-05-22 16:48 ` Bart Van Assche -1 siblings, 0 replies; 128+ messages in thread From: Bart Van Assche @ 2018-05-22 16:48 UTC (permalink / raw) To: keith.busch; +Cc: hch, keith.busch, linux-nvme, linux-block, axboe, ming.lei T24gVHVlLCAyMDE4LTA1LTIyIGF0IDEwOjM0IC0wNjAwLCBLZWl0aCBCdXNjaCB3cm90ZToNCj4g T24gVHVlLCBNYXkgMjIsIDIwMTggYXQgMDQ6Mjk6MDdQTSArMDAwMCwgQmFydCBWYW4gQXNzY2hl IHdyb3RlOg0KPiA+IFBsZWFzZSBoYXZlIGFub3RoZXIgbG9vayBhdCB0aGUgY3VycmVudCBjb2Rl IHRoYXQgaGFuZGxlcyByZXF1ZXN0IHRpbWVvdXRzDQo+ID4gYW5kIGNvbXBsZXRpb25zLiBUaGUg Y3VycmVudCBpbXBsZW1lbnRhdGlvbiBndWFyYW50ZWVzIHRoYXQgbm8gZG91YmxlDQo+ID4gY29t cGxldGlvbnMgY2FuIG9jY3VyIGJ1dCB5b3VyIHBhdGNoIHJlbW92ZXMgZXNzZW50aWFsIGFzcGVj dHMgb2YgdGhhdA0KPiA+IGltcGxlbWVudGF0aW9uLg0KPiANCj4gSG93IGRvZXMgdGhlIGN1cnJl bnQgaW1wbGVtZW50YXRpb24gZ3VhcmFudGVlIGEgZG91YmxlIGNvbXBsZXRpb24gZG9lc24ndA0K PiBoYXBwZW4gd2hlbiB0aGUgcmVxdWVzdCBpcyBhbGxvY2F0ZWQgZm9yIGEgbmV3IGNvbW1hbmQ/ DQoNCkhlbGxvIEtlaXRoLA0KDQpJZiBhIHJlcXVlc3QgaXMgY29tcGxldGVzIGFuZCBpcyByZXVz ZWQgYWZ0ZXIgdGhlIHRpbWVvdXQgaGFuZGxlciBoYXMgc2V0DQphYm9ydGVkX2dzdGF0ZSBhbmQg YmVmb3JlIGJsa19tcV90ZXJtaW5hdGVfZXhwaXJlZCgpIGlzIGNhbGxlZCB0aGVuIHRoZSBsYXR0 ZXINCmZ1bmN0aW9uIHdpbGwgc2tpcCB0aGUgcmVxdWVzdCBiZWNhdXNlIHJlc3RhcnRpbmcgYSBy ZXF1ZXN0IGNhdXNlcyB0aGUNCmdlbmVyYXRpb24gbnVtYmVyIGluIHJxLT5nc3RhdGUgdG8gYmUg aW5jcmVtZW50ZWQuIEZyb20gYmxrX21xX3JxX3VwZGF0ZV9zdGF0ZSgpOg0KDQoJaWYgKHN0YXRl ID09IE1RX1JRX0lOX0ZMSUdIVCkgew0KCQlXQVJOX09OX09OQ0UoKG9sZF92YWwgJiBNUV9SUV9T VEFURV9NQVNLKSAhPSBNUV9SUV9JRExFKTsNCgkJbmV3X3ZhbCArPSBNUV9SUV9HRU5fSU5DOw0K CX0NCg0KQmFydC4NCg0KDQoNCg== ^ permalink raw reply [flat|nested] 128+ messages in thread
* [RFC PATCH 3/3] blk-mq: Remove generation seqeunce @ 2018-05-22 16:48 ` Bart Van Assche 0 siblings, 0 replies; 128+ messages in thread From: Bart Van Assche @ 2018-05-22 16:48 UTC (permalink / raw) On Tue, 2018-05-22@10:34 -0600, Keith Busch wrote: > On Tue, May 22, 2018@04:29:07PM +0000, Bart Van Assche wrote: > > Please have another look at the current code that handles request timeouts > > and completions. The current implementation guarantees that no double > > completions can occur but your patch removes essential aspects of that > > implementation. > > How does the current implementation guarantee a double completion doesn't > happen when the request is allocated for a new command? Hello Keith, If a request is completes and is reused after the timeout handler has set aborted_gstate and before blk_mq_terminate_expired() is called then the latter function will skip the request because restarting a request causes the generation number in rq->gstate to be incremented. From blk_mq_rq_update_state(): if (state == MQ_RQ_IN_FLIGHT) { WARN_ON_ONCE((old_val & MQ_RQ_STATE_MASK) != MQ_RQ_IDLE); new_val += MQ_RQ_GEN_INC; } Bart. ^ permalink raw reply [flat|nested] 128+ messages in thread
* Re: [RFC PATCH 3/3] blk-mq: Remove generation seqeunce 2018-05-21 23:11 ` Keith Busch @ 2018-05-22 2:49 ` Ming Lei -1 siblings, 0 replies; 128+ messages in thread From: Ming Lei @ 2018-05-22 2:49 UTC (permalink / raw) To: Keith Busch Cc: Jens Axboe, linux-nvme, linux-block, Christoph Hellwig, Bart Van Assche On Mon, May 21, 2018 at 05:11:31PM -0600, Keith Busch wrote: > This patch simplifies the timeout handling by relying on the request > reference counting to ensure the iterator is operating on an inflight The reference counting isn't free, what is the real benefit in this way? > and truly timed out request. Since the reference counting prevents the > tag from being reallocated, the block layer no longer needs to prevent > drivers from completing their requests while the timeout handler is > operating on it: a driver completing a request is allowed to proceed to > the next state without additional syncronization with the block layer. This might cause trouble for drivers, since the previous behaviour is that one request is only completed from one path, and now you change the behaviour. > > This also removes any need for generation sequence numbers since the > request lifetime is prevented from being reallocated as a new sequence > while timeout handling is operating on it. > > Signed-off-by: Keith Busch <keith.busch@intel.com> > --- > block/blk-core.c | 6 -- > block/blk-mq-debugfs.c | 1 - > block/blk-mq.c | 259 ++++++++++--------------------------------------- > block/blk-mq.h | 20 +--- > block/blk-timeout.c | 1 - > include/linux/blkdev.h | 26 +---- > 6 files changed, 58 insertions(+), 255 deletions(-) > > diff --git a/block/blk-core.c b/block/blk-core.c > index 43370faee935..cee03cad99f2 100644 > --- a/block/blk-core.c > +++ b/block/blk-core.c > @@ -198,12 +198,6 @@ void blk_rq_init(struct request_queue *q, struct request *rq) > rq->internal_tag = -1; > rq->start_time_ns = ktime_get_ns(); > rq->part = NULL; > - seqcount_init(&rq->gstate_seq); > - u64_stats_init(&rq->aborted_gstate_sync); > - /* > - * See comment of blk_mq_init_request > - */ > - WRITE_ONCE(rq->gstate, MQ_RQ_GEN_INC); > } > EXPORT_SYMBOL(blk_rq_init); > > diff --git a/block/blk-mq-debugfs.c b/block/blk-mq-debugfs.c > index 3080e18cb859..ffa622366922 100644 > --- a/block/blk-mq-debugfs.c > +++ b/block/blk-mq-debugfs.c > @@ -344,7 +344,6 @@ static const char *const rqf_name[] = { > RQF_NAME(STATS), > RQF_NAME(SPECIAL_PAYLOAD), > RQF_NAME(ZONE_WRITE_LOCKED), > - RQF_NAME(MQ_TIMEOUT_EXPIRED), > RQF_NAME(MQ_POLL_SLEPT), > }; > #undef RQF_NAME > diff --git a/block/blk-mq.c b/block/blk-mq.c > index 66e5c768803f..4858876fd364 100644 > --- a/block/blk-mq.c > +++ b/block/blk-mq.c > @@ -589,56 +589,6 @@ static void __blk_mq_complete_request(struct request *rq) > put_cpu(); > } > > -static void hctx_unlock(struct blk_mq_hw_ctx *hctx, int srcu_idx) > - __releases(hctx->srcu) > -{ > - if (!(hctx->flags & BLK_MQ_F_BLOCKING)) > - rcu_read_unlock(); > - else > - srcu_read_unlock(hctx->srcu, srcu_idx); > -} > - > -static void hctx_lock(struct blk_mq_hw_ctx *hctx, int *srcu_idx) > - __acquires(hctx->srcu) > -{ > - if (!(hctx->flags & BLK_MQ_F_BLOCKING)) { > - /* shut up gcc false positive */ > - *srcu_idx = 0; > - rcu_read_lock(); > - } else > - *srcu_idx = srcu_read_lock(hctx->srcu); > -} > - > -static void blk_mq_rq_update_aborted_gstate(struct request *rq, u64 gstate) > -{ > - unsigned long flags; > - > - /* > - * blk_mq_rq_aborted_gstate() is used from the completion path and > - * can thus be called from irq context. u64_stats_fetch in the > - * middle of update on the same CPU leads to lockup. Disable irq > - * while updating. > - */ > - local_irq_save(flags); > - u64_stats_update_begin(&rq->aborted_gstate_sync); > - rq->aborted_gstate = gstate; > - u64_stats_update_end(&rq->aborted_gstate_sync); > - local_irq_restore(flags); > -} > - > -static u64 blk_mq_rq_aborted_gstate(struct request *rq) > -{ > - unsigned int start; > - u64 aborted_gstate; > - > - do { > - start = u64_stats_fetch_begin(&rq->aborted_gstate_sync); > - aborted_gstate = rq->aborted_gstate; > - } while (u64_stats_fetch_retry(&rq->aborted_gstate_sync, start)); > - > - return aborted_gstate; > -} > - > /** > * blk_mq_complete_request - end I/O on a request > * @rq: the request being processed > @@ -650,27 +600,10 @@ static u64 blk_mq_rq_aborted_gstate(struct request *rq) > void blk_mq_complete_request(struct request *rq) > { > struct request_queue *q = rq->q; > - struct blk_mq_hw_ctx *hctx = blk_mq_map_queue(q, rq->mq_ctx->cpu); > - int srcu_idx; > > if (unlikely(blk_should_fake_timeout(q))) > return; > - > - /* > - * If @rq->aborted_gstate equals the current instance, timeout is > - * claiming @rq and we lost. This is synchronized through > - * hctx_lock(). See blk_mq_timeout_work() for details. > - * > - * Completion path never blocks and we can directly use RCU here > - * instead of hctx_lock() which can be either RCU or SRCU. > - * However, that would complicate paths which want to synchronize > - * against us. Let stay in sync with the issue path so that > - * hctx_lock() covers both issue and completion paths. > - */ > - hctx_lock(hctx, &srcu_idx); > - if (blk_mq_rq_aborted_gstate(rq) != rq->gstate) > - __blk_mq_complete_request(rq); > - hctx_unlock(hctx, srcu_idx); > + __blk_mq_complete_request(rq); > } > EXPORT_SYMBOL(blk_mq_complete_request); > > @@ -699,26 +632,9 @@ void blk_mq_start_request(struct request *rq) > > WARN_ON_ONCE(blk_mq_rq_state(rq) != MQ_RQ_IDLE); > > - /* > - * Mark @rq in-flight which also advances the generation number, > - * and register for timeout. Protect with a seqcount to allow the > - * timeout path to read both @rq->gstate and @rq->deadline > - * coherently. > - * > - * This is the only place where a request is marked in-flight. If > - * the timeout path reads an in-flight @rq->gstate, the > - * @rq->deadline it reads together under @rq->gstate_seq is > - * guaranteed to be the matching one. > - */ > - preempt_disable(); > - write_seqcount_begin(&rq->gstate_seq); > - > blk_add_timer(rq); > blk_mq_rq_update_state(rq, MQ_RQ_IN_FLIGHT); > > - write_seqcount_end(&rq->gstate_seq); > - preempt_enable(); > - > if (q->dma_drain_size && blk_rq_bytes(rq)) { > /* > * Make sure space for the drain appears. We know we can do > @@ -730,11 +646,6 @@ void blk_mq_start_request(struct request *rq) > } > EXPORT_SYMBOL(blk_mq_start_request); > > -/* > - * When we reach here because queue is busy, it's safe to change the state > - * to IDLE without checking @rq->aborted_gstate because we should still be > - * holding the RCU read lock and thus protected against timeout. > - */ > static void __blk_mq_requeue_request(struct request *rq) > { > struct request_queue *q = rq->q; > @@ -843,33 +754,24 @@ struct request *blk_mq_tag_to_rq(struct blk_mq_tags *tags, unsigned int tag) > } > EXPORT_SYMBOL(blk_mq_tag_to_rq); > > -struct blk_mq_timeout_data { > - unsigned long next; > - unsigned int next_set; > - unsigned int nr_expired; > -}; > - > static void blk_mq_rq_timed_out(struct request *req, bool reserved) > { > const struct blk_mq_ops *ops = req->q->mq_ops; > enum blk_eh_timer_return ret = BLK_EH_RESET_TIMER; > > - req->rq_flags |= RQF_MQ_TIMEOUT_EXPIRED; > - > if (ops->timeout) > ret = ops->timeout(req, reserved); > > switch (ret) { > case BLK_EH_HANDLED: > - __blk_mq_complete_request(req); > - break; > - case BLK_EH_RESET_TIMER: > /* > - * As nothing prevents from completion happening while > - * ->aborted_gstate is set, this may lead to ignored > - * completions and further spurious timeouts. > + * If the request is still in flight, the driver is requesting > + * blk-mq complete it. > */ > - blk_mq_rq_update_aborted_gstate(req, 0); > + if (blk_mq_rq_state(req) == MQ_RQ_IN_FLIGHT) > + __blk_mq_complete_request(req); > + break; > + case BLK_EH_RESET_TIMER: > blk_add_timer(req); > break; > case BLK_EH_NOT_HANDLED: > @@ -880,64 +782,64 @@ static void blk_mq_rq_timed_out(struct request *req, bool reserved) > } > } > > -static void blk_mq_check_expired(struct blk_mq_hw_ctx *hctx, > - struct request *rq, void *priv, bool reserved) > +static bool blk_mq_req_expired(struct request *rq, unsigned long *next) > { > - struct blk_mq_timeout_data *data = priv; > - unsigned long gstate, deadline; > - int start; > + unsigned long deadline; > > - might_sleep(); > - > - if (rq->rq_flags & RQF_MQ_TIMEOUT_EXPIRED) > - return; > + if (blk_mq_rq_state(rq) != MQ_RQ_IN_FLIGHT) > + return false; > > - /* read coherent snapshots of @rq->state_gen and @rq->deadline */ > - while (true) { > - start = read_seqcount_begin(&rq->gstate_seq); > - gstate = READ_ONCE(rq->gstate); > - deadline = blk_rq_deadline(rq); > - if (!read_seqcount_retry(&rq->gstate_seq, start)) > - break; > - cond_resched(); > - } > + deadline = blk_rq_deadline(rq); > + if (time_after_eq(jiffies, deadline)) > + return true; > > - /* if in-flight && overdue, mark for abortion */ > - if ((gstate & MQ_RQ_STATE_MASK) == MQ_RQ_IN_FLIGHT && > - time_after_eq(jiffies, deadline)) { > - blk_mq_rq_update_aborted_gstate(rq, gstate); > - data->nr_expired++; > - hctx->nr_expired++; > - } else if (!data->next_set || time_after(data->next, deadline)) { > - data->next = deadline; > - data->next_set = 1; > - } > + if (*next == 0) > + *next = deadline; > + else if (time_after(*next, deadline)) > + *next = deadline; > + return false; > } > > -static void blk_mq_terminate_expired(struct blk_mq_hw_ctx *hctx, > +static void blk_mq_check_expired(struct blk_mq_hw_ctx *hctx, > struct request *rq, void *priv, bool reserved) > { > + unsigned long *next = priv; > + > /* > - * We marked @rq->aborted_gstate and waited for RCU. If there were > - * completions that we lost to, they would have finished and > - * updated @rq->gstate by now; otherwise, the completion path is > - * now guaranteed to see @rq->aborted_gstate and yield. If > - * @rq->aborted_gstate still matches @rq->gstate, @rq is ours. > + * Just do a quick check if it is expired before locking the request in > + * so we're not unnecessarilly synchronizing across CPUs. > */ > - if (!(rq->rq_flags & RQF_MQ_TIMEOUT_EXPIRED) && > - READ_ONCE(rq->gstate) == rq->aborted_gstate) > + if (!blk_mq_req_expired(rq, next)) > + return; > + > + /* > + * We have reason to believe the request may be expired. Take a > + * reference on the request to lock this request lifetime into its > + * currently allocated context to prevent it from being reallocated in > + * the event the completion by-passes this timeout handler. > + * > + * If the reference was already released, then the driver beat the > + * timeout handler to posting a natural completion. > + */ > + if (!kref_get_unless_zero(&rq->ref)) > + return; If this request is just completed in normal path and its state isn't updated yet, timeout will hold the request, and may complete this request again, then this req can be completed two times. Thanks, Ming ^ permalink raw reply [flat|nested] 128+ messages in thread
* [RFC PATCH 3/3] blk-mq: Remove generation seqeunce @ 2018-05-22 2:49 ` Ming Lei 0 siblings, 0 replies; 128+ messages in thread From: Ming Lei @ 2018-05-22 2:49 UTC (permalink / raw) On Mon, May 21, 2018@05:11:31PM -0600, Keith Busch wrote: > This patch simplifies the timeout handling by relying on the request > reference counting to ensure the iterator is operating on an inflight The reference counting isn't free, what is the real benefit in this way? > and truly timed out request. Since the reference counting prevents the > tag from being reallocated, the block layer no longer needs to prevent > drivers from completing their requests while the timeout handler is > operating on it: a driver completing a request is allowed to proceed to > the next state without additional syncronization with the block layer. This might cause trouble for drivers, since the previous behaviour is that one request is only completed from one path, and now you change the behaviour. > > This also removes any need for generation sequence numbers since the > request lifetime is prevented from being reallocated as a new sequence > while timeout handling is operating on it. > > Signed-off-by: Keith Busch <keith.busch at intel.com> > --- > block/blk-core.c | 6 -- > block/blk-mq-debugfs.c | 1 - > block/blk-mq.c | 259 ++++++++++--------------------------------------- > block/blk-mq.h | 20 +--- > block/blk-timeout.c | 1 - > include/linux/blkdev.h | 26 +---- > 6 files changed, 58 insertions(+), 255 deletions(-) > > diff --git a/block/blk-core.c b/block/blk-core.c > index 43370faee935..cee03cad99f2 100644 > --- a/block/blk-core.c > +++ b/block/blk-core.c > @@ -198,12 +198,6 @@ void blk_rq_init(struct request_queue *q, struct request *rq) > rq->internal_tag = -1; > rq->start_time_ns = ktime_get_ns(); > rq->part = NULL; > - seqcount_init(&rq->gstate_seq); > - u64_stats_init(&rq->aborted_gstate_sync); > - /* > - * See comment of blk_mq_init_request > - */ > - WRITE_ONCE(rq->gstate, MQ_RQ_GEN_INC); > } > EXPORT_SYMBOL(blk_rq_init); > > diff --git a/block/blk-mq-debugfs.c b/block/blk-mq-debugfs.c > index 3080e18cb859..ffa622366922 100644 > --- a/block/blk-mq-debugfs.c > +++ b/block/blk-mq-debugfs.c > @@ -344,7 +344,6 @@ static const char *const rqf_name[] = { > RQF_NAME(STATS), > RQF_NAME(SPECIAL_PAYLOAD), > RQF_NAME(ZONE_WRITE_LOCKED), > - RQF_NAME(MQ_TIMEOUT_EXPIRED), > RQF_NAME(MQ_POLL_SLEPT), > }; > #undef RQF_NAME > diff --git a/block/blk-mq.c b/block/blk-mq.c > index 66e5c768803f..4858876fd364 100644 > --- a/block/blk-mq.c > +++ b/block/blk-mq.c > @@ -589,56 +589,6 @@ static void __blk_mq_complete_request(struct request *rq) > put_cpu(); > } > > -static void hctx_unlock(struct blk_mq_hw_ctx *hctx, int srcu_idx) > - __releases(hctx->srcu) > -{ > - if (!(hctx->flags & BLK_MQ_F_BLOCKING)) > - rcu_read_unlock(); > - else > - srcu_read_unlock(hctx->srcu, srcu_idx); > -} > - > -static void hctx_lock(struct blk_mq_hw_ctx *hctx, int *srcu_idx) > - __acquires(hctx->srcu) > -{ > - if (!(hctx->flags & BLK_MQ_F_BLOCKING)) { > - /* shut up gcc false positive */ > - *srcu_idx = 0; > - rcu_read_lock(); > - } else > - *srcu_idx = srcu_read_lock(hctx->srcu); > -} > - > -static void blk_mq_rq_update_aborted_gstate(struct request *rq, u64 gstate) > -{ > - unsigned long flags; > - > - /* > - * blk_mq_rq_aborted_gstate() is used from the completion path and > - * can thus be called from irq context. u64_stats_fetch in the > - * middle of update on the same CPU leads to lockup. Disable irq > - * while updating. > - */ > - local_irq_save(flags); > - u64_stats_update_begin(&rq->aborted_gstate_sync); > - rq->aborted_gstate = gstate; > - u64_stats_update_end(&rq->aborted_gstate_sync); > - local_irq_restore(flags); > -} > - > -static u64 blk_mq_rq_aborted_gstate(struct request *rq) > -{ > - unsigned int start; > - u64 aborted_gstate; > - > - do { > - start = u64_stats_fetch_begin(&rq->aborted_gstate_sync); > - aborted_gstate = rq->aborted_gstate; > - } while (u64_stats_fetch_retry(&rq->aborted_gstate_sync, start)); > - > - return aborted_gstate; > -} > - > /** > * blk_mq_complete_request - end I/O on a request > * @rq: the request being processed > @@ -650,27 +600,10 @@ static u64 blk_mq_rq_aborted_gstate(struct request *rq) > void blk_mq_complete_request(struct request *rq) > { > struct request_queue *q = rq->q; > - struct blk_mq_hw_ctx *hctx = blk_mq_map_queue(q, rq->mq_ctx->cpu); > - int srcu_idx; > > if (unlikely(blk_should_fake_timeout(q))) > return; > - > - /* > - * If @rq->aborted_gstate equals the current instance, timeout is > - * claiming @rq and we lost. This is synchronized through > - * hctx_lock(). See blk_mq_timeout_work() for details. > - * > - * Completion path never blocks and we can directly use RCU here > - * instead of hctx_lock() which can be either RCU or SRCU. > - * However, that would complicate paths which want to synchronize > - * against us. Let stay in sync with the issue path so that > - * hctx_lock() covers both issue and completion paths. > - */ > - hctx_lock(hctx, &srcu_idx); > - if (blk_mq_rq_aborted_gstate(rq) != rq->gstate) > - __blk_mq_complete_request(rq); > - hctx_unlock(hctx, srcu_idx); > + __blk_mq_complete_request(rq); > } > EXPORT_SYMBOL(blk_mq_complete_request); > > @@ -699,26 +632,9 @@ void blk_mq_start_request(struct request *rq) > > WARN_ON_ONCE(blk_mq_rq_state(rq) != MQ_RQ_IDLE); > > - /* > - * Mark @rq in-flight which also advances the generation number, > - * and register for timeout. Protect with a seqcount to allow the > - * timeout path to read both @rq->gstate and @rq->deadline > - * coherently. > - * > - * This is the only place where a request is marked in-flight. If > - * the timeout path reads an in-flight @rq->gstate, the > - * @rq->deadline it reads together under @rq->gstate_seq is > - * guaranteed to be the matching one. > - */ > - preempt_disable(); > - write_seqcount_begin(&rq->gstate_seq); > - > blk_add_timer(rq); > blk_mq_rq_update_state(rq, MQ_RQ_IN_FLIGHT); > > - write_seqcount_end(&rq->gstate_seq); > - preempt_enable(); > - > if (q->dma_drain_size && blk_rq_bytes(rq)) { > /* > * Make sure space for the drain appears. We know we can do > @@ -730,11 +646,6 @@ void blk_mq_start_request(struct request *rq) > } > EXPORT_SYMBOL(blk_mq_start_request); > > -/* > - * When we reach here because queue is busy, it's safe to change the state > - * to IDLE without checking @rq->aborted_gstate because we should still be > - * holding the RCU read lock and thus protected against timeout. > - */ > static void __blk_mq_requeue_request(struct request *rq) > { > struct request_queue *q = rq->q; > @@ -843,33 +754,24 @@ struct request *blk_mq_tag_to_rq(struct blk_mq_tags *tags, unsigned int tag) > } > EXPORT_SYMBOL(blk_mq_tag_to_rq); > > -struct blk_mq_timeout_data { > - unsigned long next; > - unsigned int next_set; > - unsigned int nr_expired; > -}; > - > static void blk_mq_rq_timed_out(struct request *req, bool reserved) > { > const struct blk_mq_ops *ops = req->q->mq_ops; > enum blk_eh_timer_return ret = BLK_EH_RESET_TIMER; > > - req->rq_flags |= RQF_MQ_TIMEOUT_EXPIRED; > - > if (ops->timeout) > ret = ops->timeout(req, reserved); > > switch (ret) { > case BLK_EH_HANDLED: > - __blk_mq_complete_request(req); > - break; > - case BLK_EH_RESET_TIMER: > /* > - * As nothing prevents from completion happening while > - * ->aborted_gstate is set, this may lead to ignored > - * completions and further spurious timeouts. > + * If the request is still in flight, the driver is requesting > + * blk-mq complete it. > */ > - blk_mq_rq_update_aborted_gstate(req, 0); > + if (blk_mq_rq_state(req) == MQ_RQ_IN_FLIGHT) > + __blk_mq_complete_request(req); > + break; > + case BLK_EH_RESET_TIMER: > blk_add_timer(req); > break; > case BLK_EH_NOT_HANDLED: > @@ -880,64 +782,64 @@ static void blk_mq_rq_timed_out(struct request *req, bool reserved) > } > } > > -static void blk_mq_check_expired(struct blk_mq_hw_ctx *hctx, > - struct request *rq, void *priv, bool reserved) > +static bool blk_mq_req_expired(struct request *rq, unsigned long *next) > { > - struct blk_mq_timeout_data *data = priv; > - unsigned long gstate, deadline; > - int start; > + unsigned long deadline; > > - might_sleep(); > - > - if (rq->rq_flags & RQF_MQ_TIMEOUT_EXPIRED) > - return; > + if (blk_mq_rq_state(rq) != MQ_RQ_IN_FLIGHT) > + return false; > > - /* read coherent snapshots of @rq->state_gen and @rq->deadline */ > - while (true) { > - start = read_seqcount_begin(&rq->gstate_seq); > - gstate = READ_ONCE(rq->gstate); > - deadline = blk_rq_deadline(rq); > - if (!read_seqcount_retry(&rq->gstate_seq, start)) > - break; > - cond_resched(); > - } > + deadline = blk_rq_deadline(rq); > + if (time_after_eq(jiffies, deadline)) > + return true; > > - /* if in-flight && overdue, mark for abortion */ > - if ((gstate & MQ_RQ_STATE_MASK) == MQ_RQ_IN_FLIGHT && > - time_after_eq(jiffies, deadline)) { > - blk_mq_rq_update_aborted_gstate(rq, gstate); > - data->nr_expired++; > - hctx->nr_expired++; > - } else if (!data->next_set || time_after(data->next, deadline)) { > - data->next = deadline; > - data->next_set = 1; > - } > + if (*next == 0) > + *next = deadline; > + else if (time_after(*next, deadline)) > + *next = deadline; > + return false; > } > > -static void blk_mq_terminate_expired(struct blk_mq_hw_ctx *hctx, > +static void blk_mq_check_expired(struct blk_mq_hw_ctx *hctx, > struct request *rq, void *priv, bool reserved) > { > + unsigned long *next = priv; > + > /* > - * We marked @rq->aborted_gstate and waited for RCU. If there were > - * completions that we lost to, they would have finished and > - * updated @rq->gstate by now; otherwise, the completion path is > - * now guaranteed to see @rq->aborted_gstate and yield. If > - * @rq->aborted_gstate still matches @rq->gstate, @rq is ours. > + * Just do a quick check if it is expired before locking the request in > + * so we're not unnecessarilly synchronizing across CPUs. > */ > - if (!(rq->rq_flags & RQF_MQ_TIMEOUT_EXPIRED) && > - READ_ONCE(rq->gstate) == rq->aborted_gstate) > + if (!blk_mq_req_expired(rq, next)) > + return; > + > + /* > + * We have reason to believe the request may be expired. Take a > + * reference on the request to lock this request lifetime into its > + * currently allocated context to prevent it from being reallocated in > + * the event the completion by-passes this timeout handler. > + * > + * If the reference was already released, then the driver beat the > + * timeout handler to posting a natural completion. > + */ > + if (!kref_get_unless_zero(&rq->ref)) > + return; If this request is just completed in normal path and its state isn't updated yet, timeout will hold the request, and may complete this request again, then this req can be completed two times. Thanks, Ming ^ permalink raw reply [flat|nested] 128+ messages in thread
* Re: [RFC PATCH 3/3] blk-mq: Remove generation seqeunce 2018-05-22 2:49 ` Ming Lei @ 2018-05-22 3:16 ` Jens Axboe -1 siblings, 0 replies; 128+ messages in thread From: Jens Axboe @ 2018-05-22 3:16 UTC (permalink / raw) To: Ming Lei, Keith Busch Cc: linux-nvme, linux-block, Christoph Hellwig, Bart Van Assche On 5/21/18 8:49 PM, Ming Lei wrote: > On Mon, May 21, 2018 at 05:11:31PM -0600, Keith Busch wrote: >> This patch simplifies the timeout handling by relying on the request >> reference counting to ensure the iterator is operating on an inflight > > The reference counting isn't free, what is the real benefit in this way? Neither is the current scheme and locking, and this is a hell of a lot simpler. I'd get rid of the kref stuff and just do a simple atomic_dec_and_test(). Most of the time we should be uncontended on that, which would make it pretty darn cheap. I'd be surprised if it wasn't better than the current alternatives. -- Jens Axboe ^ permalink raw reply [flat|nested] 128+ messages in thread
* [RFC PATCH 3/3] blk-mq: Remove generation seqeunce @ 2018-05-22 3:16 ` Jens Axboe 0 siblings, 0 replies; 128+ messages in thread From: Jens Axboe @ 2018-05-22 3:16 UTC (permalink / raw) On 5/21/18 8:49 PM, Ming Lei wrote: > On Mon, May 21, 2018@05:11:31PM -0600, Keith Busch wrote: >> This patch simplifies the timeout handling by relying on the request >> reference counting to ensure the iterator is operating on an inflight > > The reference counting isn't free, what is the real benefit in this way? Neither is the current scheme and locking, and this is a hell of a lot simpler. I'd get rid of the kref stuff and just do a simple atomic_dec_and_test(). Most of the time we should be uncontended on that, which would make it pretty darn cheap. I'd be surprised if it wasn't better than the current alternatives. -- Jens Axboe ^ permalink raw reply [flat|nested] 128+ messages in thread
* Re: [RFC PATCH 3/3] blk-mq: Remove generation seqeunce 2018-05-22 3:16 ` Jens Axboe @ 2018-05-22 3:47 ` Ming Lei -1 siblings, 0 replies; 128+ messages in thread From: Ming Lei @ 2018-05-22 3:47 UTC (permalink / raw) To: Jens Axboe Cc: Keith Busch, linux-nvme, linux-block, Christoph Hellwig, Bart Van Assche On Mon, May 21, 2018 at 09:16:33PM -0600, Jens Axboe wrote: > On 5/21/18 8:49 PM, Ming Lei wrote: > > On Mon, May 21, 2018 at 05:11:31PM -0600, Keith Busch wrote: > >> This patch simplifies the timeout handling by relying on the request > >> reference counting to ensure the iterator is operating on an inflight > > > > The reference counting isn't free, what is the real benefit in this way? > > Neither is the current scheme and locking, and this is a hell of a lot > simpler. I'd get rid of the kref stuff and just do a simple > atomic_dec_and_test(). Most of the time we should be uncontended on > that, which would make it pretty darn cheap. I'd be surprised if it > wasn't better than the current alternatives. The explicit memory barriers by atomic_dec_and_test() isn't free. Also the double completion issue need to be fixed before discussing this approach further. Thanks, Ming ^ permalink raw reply [flat|nested] 128+ messages in thread
* [RFC PATCH 3/3] blk-mq: Remove generation seqeunce @ 2018-05-22 3:47 ` Ming Lei 0 siblings, 0 replies; 128+ messages in thread From: Ming Lei @ 2018-05-22 3:47 UTC (permalink / raw) On Mon, May 21, 2018@09:16:33PM -0600, Jens Axboe wrote: > On 5/21/18 8:49 PM, Ming Lei wrote: > > On Mon, May 21, 2018@05:11:31PM -0600, Keith Busch wrote: > >> This patch simplifies the timeout handling by relying on the request > >> reference counting to ensure the iterator is operating on an inflight > > > > The reference counting isn't free, what is the real benefit in this way? > > Neither is the current scheme and locking, and this is a hell of a lot > simpler. I'd get rid of the kref stuff and just do a simple > atomic_dec_and_test(). Most of the time we should be uncontended on > that, which would make it pretty darn cheap. I'd be surprised if it > wasn't better than the current alternatives. The explicit memory barriers by atomic_dec_and_test() isn't free. Also the double completion issue need to be fixed before discussing this approach further. Thanks, Ming ^ permalink raw reply [flat|nested] 128+ messages in thread
* Re: [RFC PATCH 3/3] blk-mq: Remove generation seqeunce 2018-05-22 3:47 ` Ming Lei @ 2018-05-22 3:51 ` Jens Axboe -1 siblings, 0 replies; 128+ messages in thread From: Jens Axboe @ 2018-05-22 3:51 UTC (permalink / raw) To: Ming Lei Cc: Keith Busch, linux-nvme, linux-block, Christoph Hellwig, Bart Van Assche On May 21, 2018, at 9:47 PM, Ming Lei <ming.lei@redhat.com> wrote: >=20 >> On Mon, May 21, 2018 at 09:16:33PM -0600, Jens Axboe wrote: >>> On 5/21/18 8:49 PM, Ming Lei wrote: >>>> On Mon, May 21, 2018 at 05:11:31PM -0600, Keith Busch wrote: >>>> This patch simplifies the timeout handling by relying on the request >>>> reference counting to ensure the iterator is operating on an inflight >>>=20 >>> The reference counting isn't free, what is the real benefit in this way?= >>=20 >> Neither is the current scheme and locking, and this is a hell of a lot >> simpler. I'd get rid of the kref stuff and just do a simple >> atomic_dec_and_test(). Most of the time we should be uncontended on >> that, which would make it pretty darn cheap. I'd be surprised if it >> wasn't better than the current alternatives. >=20 > The explicit memory barriers by atomic_dec_and_test() isn't free. I=E2=80=99m not saying it=E2=80=99s free. Neither is our current synchroniza= tion. > Also the double completion issue need to be fixed before discussing > this approach further. Certainly. Also not saying that the current patch is perfect. But it=E2=80=99= s a lot more palatable than the alternatives, hence my interest. And I=E2=80= =99d like for this issue to get solved, we seem to be a bit stuck atm.=20 I=E2=80=99ll take a closer look tomorrow.=20 ^ permalink raw reply [flat|nested] 128+ messages in thread
* [RFC PATCH 3/3] blk-mq: Remove generation seqeunce @ 2018-05-22 3:51 ` Jens Axboe 0 siblings, 0 replies; 128+ messages in thread From: Jens Axboe @ 2018-05-22 3:51 UTC (permalink / raw) On May 21, 2018,@9:47 PM, Ming Lei <ming.lei@redhat.com> wrote: > >> On Mon, May 21, 2018@09:16:33PM -0600, Jens Axboe wrote: >>> On 5/21/18 8:49 PM, Ming Lei wrote: >>>> On Mon, May 21, 2018@05:11:31PM -0600, Keith Busch wrote: >>>> This patch simplifies the timeout handling by relying on the request >>>> reference counting to ensure the iterator is operating on an inflight >>> >>> The reference counting isn't free, what is the real benefit in this way? >> >> Neither is the current scheme and locking, and this is a hell of a lot >> simpler. I'd get rid of the kref stuff and just do a simple >> atomic_dec_and_test(). Most of the time we should be uncontended on >> that, which would make it pretty darn cheap. I'd be surprised if it >> wasn't better than the current alternatives. > > The explicit memory barriers by atomic_dec_and_test() isn't free. I?m not saying it?s free. Neither is our current synchronization. > Also the double completion issue need to be fixed before discussing > this approach further. Certainly. Also not saying that the current patch is perfect. But it?s a lot more palatable than the alternatives, hence my interest. And I?d like for this issue to get solved, we seem to be a bit stuck atm. I?ll take a closer look tomorrow. ^ permalink raw reply [flat|nested] 128+ messages in thread
* Re: [RFC PATCH 3/3] blk-mq: Remove generation seqeunce 2018-05-22 3:51 ` Jens Axboe @ 2018-05-22 8:51 ` Ming Lei -1 siblings, 0 replies; 128+ messages in thread From: Ming Lei @ 2018-05-22 8:51 UTC (permalink / raw) To: Jens Axboe Cc: Keith Busch, linux-nvme, linux-block, Christoph Hellwig, Bart Van Assche On Mon, May 21, 2018 at 09:51:19PM -0600, Jens Axboe wrote: > On May 21, 2018, at 9:47 PM, Ming Lei <ming.lei@redhat.com> wrote: > > > >> On Mon, May 21, 2018 at 09:16:33PM -0600, Jens Axboe wrote: > >>> On 5/21/18 8:49 PM, Ming Lei wrote: > >>>> On Mon, May 21, 2018 at 05:11:31PM -0600, Keith Busch wrote: > >>>> This patch simplifies the timeout handling by relying on the request > >>>> reference counting to ensure the iterator is operating on an inflight > >>> > >>> The reference counting isn't free, what is the real benefit in this way? > >> > >> Neither is the current scheme and locking, and this is a hell of a lot > >> simpler. I'd get rid of the kref stuff and just do a simple > >> atomic_dec_and_test(). Most of the time we should be uncontended on > >> that, which would make it pretty darn cheap. I'd be surprised if it > >> wasn't better than the current alternatives. > > > > The explicit memory barriers by atomic_dec_and_test() isn't free. > > I’m not saying it’s free. Neither is our current synchronization. > > > Also the double completion issue need to be fixed before discussing > > this approach further. > > Certainly. Also not saying that the current patch is perfect. But it’s a lot more palatable than the alternatives, hence my interest. And I’d like for this issue to get solved, we seem to be a bit stuck atm. > It may not be something we are stuck at, and seems no alternatives for this patchset. It is a new requirement from NVMe, and Keith wants driver to complete timed-out request in .timeout(). We never support that before for both mq and non-mq code path. Thanks, Ming ^ permalink raw reply [flat|nested] 128+ messages in thread
* [RFC PATCH 3/3] blk-mq: Remove generation seqeunce @ 2018-05-22 8:51 ` Ming Lei 0 siblings, 0 replies; 128+ messages in thread From: Ming Lei @ 2018-05-22 8:51 UTC (permalink / raw) On Mon, May 21, 2018@09:51:19PM -0600, Jens Axboe wrote: > On May 21, 2018,@9:47 PM, Ming Lei <ming.lei@redhat.com> wrote: > > > >> On Mon, May 21, 2018@09:16:33PM -0600, Jens Axboe wrote: > >>> On 5/21/18 8:49 PM, Ming Lei wrote: > >>>> On Mon, May 21, 2018@05:11:31PM -0600, Keith Busch wrote: > >>>> This patch simplifies the timeout handling by relying on the request > >>>> reference counting to ensure the iterator is operating on an inflight > >>> > >>> The reference counting isn't free, what is the real benefit in this way? > >> > >> Neither is the current scheme and locking, and this is a hell of a lot > >> simpler. I'd get rid of the kref stuff and just do a simple > >> atomic_dec_and_test(). Most of the time we should be uncontended on > >> that, which would make it pretty darn cheap. I'd be surprised if it > >> wasn't better than the current alternatives. > > > > The explicit memory barriers by atomic_dec_and_test() isn't free. > > I?m not saying it?s free. Neither is our current synchronization. > > > Also the double completion issue need to be fixed before discussing > > this approach further. > > Certainly. Also not saying that the current patch is perfect. But it?s a lot more palatable than the alternatives, hence my interest. And I?d like for this issue to get solved, we seem to be a bit stuck atm. > It may not be something we are stuck at, and seems no alternatives for this patchset. It is a new requirement from NVMe, and Keith wants driver to complete timed-out request in .timeout(). We never support that before for both mq and non-mq code path. Thanks, Ming ^ permalink raw reply [flat|nested] 128+ messages in thread
* Re: [RFC PATCH 3/3] blk-mq: Remove generation seqeunce 2018-05-22 8:51 ` Ming Lei @ 2018-05-22 14:35 ` Jens Axboe -1 siblings, 0 replies; 128+ messages in thread From: Jens Axboe @ 2018-05-22 14:35 UTC (permalink / raw) To: Ming Lei Cc: Keith Busch, linux-nvme, linux-block, Christoph Hellwig, Bart Van Assche On 5/22/18 2:51 AM, Ming Lei wrote: > On Mon, May 21, 2018 at 09:51:19PM -0600, Jens Axboe wrote: >> On May 21, 2018, at 9:47 PM, Ming Lei <ming.lei@redhat.com> wrote: >>> >>>> On Mon, May 21, 2018 at 09:16:33PM -0600, Jens Axboe wrote: >>>>> On 5/21/18 8:49 PM, Ming Lei wrote: >>>>>> On Mon, May 21, 2018 at 05:11:31PM -0600, Keith Busch wrote: >>>>>> This patch simplifies the timeout handling by relying on the request >>>>>> reference counting to ensure the iterator is operating on an inflight >>>>> >>>>> The reference counting isn't free, what is the real benefit in this way? >>>> >>>> Neither is the current scheme and locking, and this is a hell of a lot >>>> simpler. I'd get rid of the kref stuff and just do a simple >>>> atomic_dec_and_test(). Most of the time we should be uncontended on >>>> that, which would make it pretty darn cheap. I'd be surprised if it >>>> wasn't better than the current alternatives. >>> >>> The explicit memory barriers by atomic_dec_and_test() isn't free. >> >> I’m not saying it’s free. Neither is our current synchronization. >> >>> Also the double completion issue need to be fixed before discussing >>> this approach further. >> >> Certainly. Also not saying that the current patch is perfect. But >> it’s a lot more palatable than the alternatives, hence my interest. >> And I’d like for this issue to get solved, we seem to be a bit stuck >> atm. >> > > It may not be something we are stuck at, and seems no alternatives for > this patchset. > > It is a new requirement from NVMe, and Keith wants driver to complete > timed-out request in .timeout(). We never support that before for both > mq and non-mq code path. No, that's not what he wants to do. He wants to use referencing to release the final resources of the request (the tag), to prevent premature reuse of the request. -- Jens Axboe ^ permalink raw reply [flat|nested] 128+ messages in thread
* [RFC PATCH 3/3] blk-mq: Remove generation seqeunce @ 2018-05-22 14:35 ` Jens Axboe 0 siblings, 0 replies; 128+ messages in thread From: Jens Axboe @ 2018-05-22 14:35 UTC (permalink / raw) On 5/22/18 2:51 AM, Ming Lei wrote: > On Mon, May 21, 2018@09:51:19PM -0600, Jens Axboe wrote: >> On May 21, 2018,@9:47 PM, Ming Lei <ming.lei@redhat.com> wrote: >>> >>>> On Mon, May 21, 2018@09:16:33PM -0600, Jens Axboe wrote: >>>>> On 5/21/18 8:49 PM, Ming Lei wrote: >>>>>> On Mon, May 21, 2018@05:11:31PM -0600, Keith Busch wrote: >>>>>> This patch simplifies the timeout handling by relying on the request >>>>>> reference counting to ensure the iterator is operating on an inflight >>>>> >>>>> The reference counting isn't free, what is the real benefit in this way? >>>> >>>> Neither is the current scheme and locking, and this is a hell of a lot >>>> simpler. I'd get rid of the kref stuff and just do a simple >>>> atomic_dec_and_test(). Most of the time we should be uncontended on >>>> that, which would make it pretty darn cheap. I'd be surprised if it >>>> wasn't better than the current alternatives. >>> >>> The explicit memory barriers by atomic_dec_and_test() isn't free. >> >> I?m not saying it?s free. Neither is our current synchronization. >> >>> Also the double completion issue need to be fixed before discussing >>> this approach further. >> >> Certainly. Also not saying that the current patch is perfect. But >> it?s a lot more palatable than the alternatives, hence my interest. >> And I?d like for this issue to get solved, we seem to be a bit stuck >> atm. >> > > It may not be something we are stuck at, and seems no alternatives for > this patchset. > > It is a new requirement from NVMe, and Keith wants driver to complete > timed-out request in .timeout(). We never support that before for both > mq and non-mq code path. No, that's not what he wants to do. He wants to use referencing to release the final resources of the request (the tag), to prevent premature reuse of the request. -- Jens Axboe ^ permalink raw reply [flat|nested] 128+ messages in thread
* Re: [RFC PATCH 3/3] blk-mq: Remove generation seqeunce 2018-05-22 2:49 ` Ming Lei @ 2018-05-22 14:20 ` Keith Busch -1 siblings, 0 replies; 128+ messages in thread From: Keith Busch @ 2018-05-22 14:20 UTC (permalink / raw) To: Ming Lei Cc: Keith Busch, Jens Axboe, linux-block, Bart Van Assche, Christoph Hellwig, linux-nvme On Tue, May 22, 2018 at 10:49:11AM +0800, Ming Lei wrote: > On Mon, May 21, 2018 at 05:11:31PM -0600, Keith Busch wrote: > > -static void blk_mq_terminate_expired(struct blk_mq_hw_ctx *hctx, > > +static void blk_mq_check_expired(struct blk_mq_hw_ctx *hctx, > > struct request *rq, void *priv, bool reserved) > > { > > + unsigned long *next = priv; > > + > > /* > > - * We marked @rq->aborted_gstate and waited for RCU. If there were > > - * completions that we lost to, they would have finished and > > - * updated @rq->gstate by now; otherwise, the completion path is > > - * now guaranteed to see @rq->aborted_gstate and yield. If > > - * @rq->aborted_gstate still matches @rq->gstate, @rq is ours. > > + * Just do a quick check if it is expired before locking the request in > > + * so we're not unnecessarilly synchronizing across CPUs. > > */ > > - if (!(rq->rq_flags & RQF_MQ_TIMEOUT_EXPIRED) && > > - READ_ONCE(rq->gstate) == rq->aborted_gstate) > > + if (!blk_mq_req_expired(rq, next)) > > + return; > > + > > + /* > > + * We have reason to believe the request may be expired. Take a > > + * reference on the request to lock this request lifetime into its > > + * currently allocated context to prevent it from being reallocated in > > + * the event the completion by-passes this timeout handler. > > + * > > + * If the reference was already released, then the driver beat the > > + * timeout handler to posting a natural completion. > > + */ > > + if (!kref_get_unless_zero(&rq->ref)) > > + return; > > If this request is just completed in normal path and its state isn't > updated yet, timeout will hold the request, and may complete this > request again, then this req can be completed two times. Hi Ming, In the event the driver requests a normal completion, the timeout work releasing the last reference doesn't do a second completion: it only releases the request's tag back for re-allocation. Thanks, Keith ^ permalink raw reply [flat|nested] 128+ messages in thread
* [RFC PATCH 3/3] blk-mq: Remove generation seqeunce @ 2018-05-22 14:20 ` Keith Busch 0 siblings, 0 replies; 128+ messages in thread From: Keith Busch @ 2018-05-22 14:20 UTC (permalink / raw) On Tue, May 22, 2018@10:49:11AM +0800, Ming Lei wrote: > On Mon, May 21, 2018@05:11:31PM -0600, Keith Busch wrote: > > -static void blk_mq_terminate_expired(struct blk_mq_hw_ctx *hctx, > > +static void blk_mq_check_expired(struct blk_mq_hw_ctx *hctx, > > struct request *rq, void *priv, bool reserved) > > { > > + unsigned long *next = priv; > > + > > /* > > - * We marked @rq->aborted_gstate and waited for RCU. If there were > > - * completions that we lost to, they would have finished and > > - * updated @rq->gstate by now; otherwise, the completion path is > > - * now guaranteed to see @rq->aborted_gstate and yield. If > > - * @rq->aborted_gstate still matches @rq->gstate, @rq is ours. > > + * Just do a quick check if it is expired before locking the request in > > + * so we're not unnecessarilly synchronizing across CPUs. > > */ > > - if (!(rq->rq_flags & RQF_MQ_TIMEOUT_EXPIRED) && > > - READ_ONCE(rq->gstate) == rq->aborted_gstate) > > + if (!blk_mq_req_expired(rq, next)) > > + return; > > + > > + /* > > + * We have reason to believe the request may be expired. Take a > > + * reference on the request to lock this request lifetime into its > > + * currently allocated context to prevent it from being reallocated in > > + * the event the completion by-passes this timeout handler. > > + * > > + * If the reference was already released, then the driver beat the > > + * timeout handler to posting a natural completion. > > + */ > > + if (!kref_get_unless_zero(&rq->ref)) > > + return; > > If this request is just completed in normal path and its state isn't > updated yet, timeout will hold the request, and may complete this > request again, then this req can be completed two times. Hi Ming, In the event the driver requests a normal completion, the timeout work releasing the last reference doesn't do a second completion: it only releases the request's tag back for re-allocation. Thanks, Keith ^ permalink raw reply [flat|nested] 128+ messages in thread
* Re: [RFC PATCH 3/3] blk-mq: Remove generation seqeunce 2018-05-22 14:20 ` Keith Busch @ 2018-05-22 14:37 ` Ming Lei -1 siblings, 0 replies; 128+ messages in thread From: Ming Lei @ 2018-05-22 14:37 UTC (permalink / raw) To: Keith Busch Cc: Ming Lei, Keith Busch, Jens Axboe, linux-block, Bart Van Assche, Christoph Hellwig, linux-nvme On Tue, May 22, 2018 at 10:20 PM, Keith Busch <keith.busch@linux.intel.com> wrote: > On Tue, May 22, 2018 at 10:49:11AM +0800, Ming Lei wrote: >> On Mon, May 21, 2018 at 05:11:31PM -0600, Keith Busch wrote: >> > -static void blk_mq_terminate_expired(struct blk_mq_hw_ctx *hctx, >> > +static void blk_mq_check_expired(struct blk_mq_hw_ctx *hctx, >> > struct request *rq, void *priv, bool reserved) >> > { >> > + unsigned long *next = priv; >> > + >> > /* >> > - * We marked @rq->aborted_gstate and waited for RCU. If there were >> > - * completions that we lost to, they would have finished and >> > - * updated @rq->gstate by now; otherwise, the completion path is >> > - * now guaranteed to see @rq->aborted_gstate and yield. If >> > - * @rq->aborted_gstate still matches @rq->gstate, @rq is ours. >> > + * Just do a quick check if it is expired before locking the request in >> > + * so we're not unnecessarilly synchronizing across CPUs. >> > */ >> > - if (!(rq->rq_flags & RQF_MQ_TIMEOUT_EXPIRED) && >> > - READ_ONCE(rq->gstate) == rq->aborted_gstate) >> > + if (!blk_mq_req_expired(rq, next)) >> > + return; >> > + >> > + /* >> > + * We have reason to believe the request may be expired. Take a >> > + * reference on the request to lock this request lifetime into its >> > + * currently allocated context to prevent it from being reallocated in >> > + * the event the completion by-passes this timeout handler. >> > + * >> > + * If the reference was already released, then the driver beat the >> > + * timeout handler to posting a natural completion. >> > + */ >> > + if (!kref_get_unless_zero(&rq->ref)) >> > + return; >> >> If this request is just completed in normal path and its state isn't >> updated yet, timeout will hold the request, and may complete this >> request again, then this req can be completed two times. > > Hi Ming, > > In the event the driver requests a normal completion, the timeout work > releasing the last reference doesn't do a second completion: it only The reference only covers request's lifetime, not related with completion. It isn't the last reference. When driver returns EH_HANDLED, blk-mq will complete this request on extra time. Yes, if driver's timeout code and normal completion code can sync about this completion, that should be fine, but the current behaviour doesn't depend driver's sync since the req is always completed atomically via the following way: 1) timeout if (mark_completed(rq)) timed_out(rq) 2) normal completion if (mark_completed(rq)) complete(rq) For example, before nvme_timeout() is trying to run nvme_dev_disable(), irq comes and this req is completed from normal completion path, but nvme_timeout() still returns EH_HANDLED, and blk-mq may complete the req one extra time since the normal completion path may not update req's state yet. Thanks, Ming Lei ^ permalink raw reply [flat|nested] 128+ messages in thread
* [RFC PATCH 3/3] blk-mq: Remove generation seqeunce @ 2018-05-22 14:37 ` Ming Lei 0 siblings, 0 replies; 128+ messages in thread From: Ming Lei @ 2018-05-22 14:37 UTC (permalink / raw) On Tue, May 22, 2018 at 10:20 PM, Keith Busch <keith.busch@linux.intel.com> wrote: > On Tue, May 22, 2018@10:49:11AM +0800, Ming Lei wrote: >> On Mon, May 21, 2018@05:11:31PM -0600, Keith Busch wrote: >> > -static void blk_mq_terminate_expired(struct blk_mq_hw_ctx *hctx, >> > +static void blk_mq_check_expired(struct blk_mq_hw_ctx *hctx, >> > struct request *rq, void *priv, bool reserved) >> > { >> > + unsigned long *next = priv; >> > + >> > /* >> > - * We marked @rq->aborted_gstate and waited for RCU. If there were >> > - * completions that we lost to, they would have finished and >> > - * updated @rq->gstate by now; otherwise, the completion path is >> > - * now guaranteed to see @rq->aborted_gstate and yield. If >> > - * @rq->aborted_gstate still matches @rq->gstate, @rq is ours. >> > + * Just do a quick check if it is expired before locking the request in >> > + * so we're not unnecessarilly synchronizing across CPUs. >> > */ >> > - if (!(rq->rq_flags & RQF_MQ_TIMEOUT_EXPIRED) && >> > - READ_ONCE(rq->gstate) == rq->aborted_gstate) >> > + if (!blk_mq_req_expired(rq, next)) >> > + return; >> > + >> > + /* >> > + * We have reason to believe the request may be expired. Take a >> > + * reference on the request to lock this request lifetime into its >> > + * currently allocated context to prevent it from being reallocated in >> > + * the event the completion by-passes this timeout handler. >> > + * >> > + * If the reference was already released, then the driver beat the >> > + * timeout handler to posting a natural completion. >> > + */ >> > + if (!kref_get_unless_zero(&rq->ref)) >> > + return; >> >> If this request is just completed in normal path and its state isn't >> updated yet, timeout will hold the request, and may complete this >> request again, then this req can be completed two times. > > Hi Ming, > > In the event the driver requests a normal completion, the timeout work > releasing the last reference doesn't do a second completion: it only The reference only covers request's lifetime, not related with completion. It isn't the last reference. When driver returns EH_HANDLED, blk-mq will complete this request on extra time. Yes, if driver's timeout code and normal completion code can sync about this completion, that should be fine, but the current behaviour doesn't depend driver's sync since the req is always completed atomically via the following way: 1) timeout if (mark_completed(rq)) timed_out(rq) 2) normal completion if (mark_completed(rq)) complete(rq) For example, before nvme_timeout() is trying to run nvme_dev_disable(), irq comes and this req is completed from normal completion path, but nvme_timeout() still returns EH_HANDLED, and blk-mq may complete the req one extra time since the normal completion path may not update req's state yet. Thanks, Ming Lei ^ permalink raw reply [flat|nested] 128+ messages in thread
* Re: [RFC PATCH 3/3] blk-mq: Remove generation seqeunce 2018-05-22 14:37 ` Ming Lei @ 2018-05-22 14:46 ` Keith Busch -1 siblings, 0 replies; 128+ messages in thread From: Keith Busch @ 2018-05-22 14:46 UTC (permalink / raw) To: Ming Lei Cc: Jens Axboe, linux-block, linux-nvme, Ming Lei, Keith Busch, Bart Van Assche, Christoph Hellwig On Tue, May 22, 2018 at 10:37:32PM +0800, Ming Lei wrote: > On Tue, May 22, 2018 at 10:20 PM, Keith Busch > <keith.busch@linux.intel.com> wrote: > > In the event the driver requests a normal completion, the timeout work > > releasing the last reference doesn't do a second completion: it only > > The reference only covers request's lifetime, not related with completion. > > It isn't the last reference. When driver returns EH_HANDLED, blk-mq > will complete this request on extra time. > > Yes, if driver's timeout code and normal completion code can sync > about this completion, that should be fine, but the current behaviour > doesn't depend driver's sync since the req is always completed atomically > via the following way: > > 1) timeout > > if (mark_completed(rq)) > timed_out(rq) > > 2) normal completion > if (mark_completed(rq)) > complete(rq) > > For example, before nvme_timeout() is trying to run nvme_dev_disable(), > irq comes and this req is completed from normal completion path, but > nvme_timeout() still returns EH_HANDLED, and blk-mq may complete > the req one extra time since the normal completion path may not update > req's state yet. nvme_dev_disable tears down irq's, meaing their handling is already sync'ed before nvme_dev_disable can proceed. Whether the completion comes through nvme_irq, or through nvme_dev_disable, there is no way possible for nvme's timeout to return EH_HANDLED if the state was not updated prior to returning that status. ^ permalink raw reply [flat|nested] 128+ messages in thread
* [RFC PATCH 3/3] blk-mq: Remove generation seqeunce @ 2018-05-22 14:46 ` Keith Busch 0 siblings, 0 replies; 128+ messages in thread From: Keith Busch @ 2018-05-22 14:46 UTC (permalink / raw) On Tue, May 22, 2018@10:37:32PM +0800, Ming Lei wrote: > On Tue, May 22, 2018 at 10:20 PM, Keith Busch > <keith.busch@linux.intel.com> wrote: > > In the event the driver requests a normal completion, the timeout work > > releasing the last reference doesn't do a second completion: it only > > The reference only covers request's lifetime, not related with completion. > > It isn't the last reference. When driver returns EH_HANDLED, blk-mq > will complete this request on extra time. > > Yes, if driver's timeout code and normal completion code can sync > about this completion, that should be fine, but the current behaviour > doesn't depend driver's sync since the req is always completed atomically > via the following way: > > 1) timeout > > if (mark_completed(rq)) > timed_out(rq) > > 2) normal completion > if (mark_completed(rq)) > complete(rq) > > For example, before nvme_timeout() is trying to run nvme_dev_disable(), > irq comes and this req is completed from normal completion path, but > nvme_timeout() still returns EH_HANDLED, and blk-mq may complete > the req one extra time since the normal completion path may not update > req's state yet. nvme_dev_disable tears down irq's, meaing their handling is already sync'ed before nvme_dev_disable can proceed. Whether the completion comes through nvme_irq, or through nvme_dev_disable, there is no way possible for nvme's timeout to return EH_HANDLED if the state was not updated prior to returning that status. ^ permalink raw reply [flat|nested] 128+ messages in thread
* Re: [RFC PATCH 3/3] blk-mq: Remove generation seqeunce 2018-05-22 14:46 ` Keith Busch @ 2018-05-22 14:57 ` Ming Lei -1 siblings, 0 replies; 128+ messages in thread From: Ming Lei @ 2018-05-22 14:57 UTC (permalink / raw) To: Keith Busch Cc: Jens Axboe, linux-block, linux-nvme, Ming Lei, Keith Busch, Bart Van Assche, Christoph Hellwig On Tue, May 22, 2018 at 10:46 PM, Keith Busch <keith.busch@linux.intel.com> wrote: > On Tue, May 22, 2018 at 10:37:32PM +0800, Ming Lei wrote: >> On Tue, May 22, 2018 at 10:20 PM, Keith Busch >> <keith.busch@linux.intel.com> wrote: >> > In the event the driver requests a normal completion, the timeout work >> > releasing the last reference doesn't do a second completion: it only >> >> The reference only covers request's lifetime, not related with completion. >> >> It isn't the last reference. When driver returns EH_HANDLED, blk-mq >> will complete this request on extra time. >> >> Yes, if driver's timeout code and normal completion code can sync >> about this completion, that should be fine, but the current behaviour >> doesn't depend driver's sync since the req is always completed atomically >> via the following way: >> >> 1) timeout >> >> if (mark_completed(rq)) >> timed_out(rq) >> >> 2) normal completion >> if (mark_completed(rq)) >> complete(rq) >> >> For example, before nvme_timeout() is trying to run nvme_dev_disable(), >> irq comes and this req is completed from normal completion path, but >> nvme_timeout() still returns EH_HANDLED, and blk-mq may complete >> the req one extra time since the normal completion path may not update >> req's state yet. > > nvme_dev_disable tears down irq's, meaing their handling is already > sync'ed before nvme_dev_disable can proceed. Whether the completion > comes through nvme_irq, or through nvme_dev_disable, there is no way > possible for nvme's timeout to return EH_HANDLED if the state was not > updated prior to returning that status. OK, that still depends on driver's behaviour, even though it is true for NVMe, you still have to audit other drivers about this sync because it is required by your patch. thanks, Ming Lei ^ permalink raw reply [flat|nested] 128+ messages in thread
* [RFC PATCH 3/3] blk-mq: Remove generation seqeunce @ 2018-05-22 14:57 ` Ming Lei 0 siblings, 0 replies; 128+ messages in thread From: Ming Lei @ 2018-05-22 14:57 UTC (permalink / raw) On Tue, May 22, 2018 at 10:46 PM, Keith Busch <keith.busch@linux.intel.com> wrote: > On Tue, May 22, 2018@10:37:32PM +0800, Ming Lei wrote: >> On Tue, May 22, 2018 at 10:20 PM, Keith Busch >> <keith.busch@linux.intel.com> wrote: >> > In the event the driver requests a normal completion, the timeout work >> > releasing the last reference doesn't do a second completion: it only >> >> The reference only covers request's lifetime, not related with completion. >> >> It isn't the last reference. When driver returns EH_HANDLED, blk-mq >> will complete this request on extra time. >> >> Yes, if driver's timeout code and normal completion code can sync >> about this completion, that should be fine, but the current behaviour >> doesn't depend driver's sync since the req is always completed atomically >> via the following way: >> >> 1) timeout >> >> if (mark_completed(rq)) >> timed_out(rq) >> >> 2) normal completion >> if (mark_completed(rq)) >> complete(rq) >> >> For example, before nvme_timeout() is trying to run nvme_dev_disable(), >> irq comes and this req is completed from normal completion path, but >> nvme_timeout() still returns EH_HANDLED, and blk-mq may complete >> the req one extra time since the normal completion path may not update >> req's state yet. > > nvme_dev_disable tears down irq's, meaing their handling is already > sync'ed before nvme_dev_disable can proceed. Whether the completion > comes through nvme_irq, or through nvme_dev_disable, there is no way > possible for nvme's timeout to return EH_HANDLED if the state was not > updated prior to returning that status. OK, that still depends on driver's behaviour, even though it is true for NVMe, you still have to audit other drivers about this sync because it is required by your patch. thanks, Ming Lei ^ permalink raw reply [flat|nested] 128+ messages in thread
* Re: [RFC PATCH 3/3] blk-mq: Remove generation seqeunce 2018-05-22 14:57 ` Ming Lei @ 2018-05-22 15:01 ` Keith Busch -1 siblings, 0 replies; 128+ messages in thread From: Keith Busch @ 2018-05-22 15:01 UTC (permalink / raw) To: Ming Lei Cc: Jens Axboe, linux-block, linux-nvme, Ming Lei, Keith Busch, Bart Van Assche, Christoph Hellwig On Tue, May 22, 2018 at 10:57:40PM +0800, Ming Lei wrote: > OK, that still depends on driver's behaviour, even though it is true > for NVMe, you still have to audit other drivers about this sync > because it is required by your patch. Okay, forget about nvme for a moment here. Please run this thought experiment to decide if what you're saying is even plausible for any block driver with the existing implementation: Your scenario has a driver return EH_HANDLED for a timed out request. The timeout work then completes the request. The request may then be reallocated for a new command and dispatched. At approximately the same time, you're saying the driver that returned EH_HANDLED may then call blk_mq_complete_request() in reference to the *old* request that it returned EH_HANDLED for, incorrectly completing the new request before it is done. That will inevitably lead to data corruption. Is that happening today in any driver? ^ permalink raw reply [flat|nested] 128+ messages in thread
* [RFC PATCH 3/3] blk-mq: Remove generation seqeunce @ 2018-05-22 15:01 ` Keith Busch 0 siblings, 0 replies; 128+ messages in thread From: Keith Busch @ 2018-05-22 15:01 UTC (permalink / raw) On Tue, May 22, 2018@10:57:40PM +0800, Ming Lei wrote: > OK, that still depends on driver's behaviour, even though it is true > for NVMe, you still have to audit other drivers about this sync > because it is required by your patch. Okay, forget about nvme for a moment here. Please run this thought experiment to decide if what you're saying is even plausible for any block driver with the existing implementation: Your scenario has a driver return EH_HANDLED for a timed out request. The timeout work then completes the request. The request may then be reallocated for a new command and dispatched. At approximately the same time, you're saying the driver that returned EH_HANDLED may then call blk_mq_complete_request() in reference to the *old* request that it returned EH_HANDLED for, incorrectly completing the new request before it is done. That will inevitably lead to data corruption. Is that happening today in any driver? ^ permalink raw reply [flat|nested] 128+ messages in thread
* Re: [RFC PATCH 3/3] blk-mq: Remove generation seqeunce 2018-05-22 15:01 ` Keith Busch @ 2018-05-22 15:07 ` Ming Lei -1 siblings, 0 replies; 128+ messages in thread From: Ming Lei @ 2018-05-22 15:07 UTC (permalink / raw) To: Keith Busch Cc: Jens Axboe, linux-block, linux-nvme, Ming Lei, Keith Busch, Bart Van Assche, Christoph Hellwig On Tue, May 22, 2018 at 11:01 PM, Keith Busch <keith.busch@linux.intel.com> wrote: > On Tue, May 22, 2018 at 10:57:40PM +0800, Ming Lei wrote: >> OK, that still depends on driver's behaviour, even though it is true >> for NVMe, you still have to audit other drivers about this sync >> because it is required by your patch. > > Okay, forget about nvme for a moment here. Please run this thought > experiment to decide if what you're saying is even plausible for any > block driver with the existing implementation: > > Your scenario has a driver return EH_HANDLED for a timed out request. The > timeout work then completes the request. The request may then be > reallocated for a new command and dispatched. Yes. > > At approximately the same time, you're saying the driver that returned > EH_HANDLED may then call blk_mq_complete_request() in reference to the > *old* request that it returned EH_HANDLED for, incorrectly completing Because this request has been completed above by blk-mq timeout, then this old request won't be completed any more via blk_mq_complete_request() either from normal path or what ever, such as cancel. > the new request before it is done. That will inevitably lead to data > corruption. Is that happening today in any driver? No such issue since current blk-mq makes sure one req is only completed once, but your patch changes to depend on driver to make sure that. thanks, Ming Lei ^ permalink raw reply [flat|nested] 128+ messages in thread
* [RFC PATCH 3/3] blk-mq: Remove generation seqeunce @ 2018-05-22 15:07 ` Ming Lei 0 siblings, 0 replies; 128+ messages in thread From: Ming Lei @ 2018-05-22 15:07 UTC (permalink / raw) On Tue, May 22, 2018 at 11:01 PM, Keith Busch <keith.busch@linux.intel.com> wrote: > On Tue, May 22, 2018@10:57:40PM +0800, Ming Lei wrote: >> OK, that still depends on driver's behaviour, even though it is true >> for NVMe, you still have to audit other drivers about this sync >> because it is required by your patch. > > Okay, forget about nvme for a moment here. Please run this thought > experiment to decide if what you're saying is even plausible for any > block driver with the existing implementation: > > Your scenario has a driver return EH_HANDLED for a timed out request. The > timeout work then completes the request. The request may then be > reallocated for a new command and dispatched. Yes. > > At approximately the same time, you're saying the driver that returned > EH_HANDLED may then call blk_mq_complete_request() in reference to the > *old* request that it returned EH_HANDLED for, incorrectly completing Because this request has been completed above by blk-mq timeout, then this old request won't be completed any more via blk_mq_complete_request() either from normal path or what ever, such as cancel. > the new request before it is done. That will inevitably lead to data > corruption. Is that happening today in any driver? No such issue since current blk-mq makes sure one req is only completed once, but your patch changes to depend on driver to make sure that. thanks, Ming Lei ^ permalink raw reply [flat|nested] 128+ messages in thread
* Re: [RFC PATCH 3/3] blk-mq: Remove generation seqeunce 2018-05-22 15:07 ` Ming Lei @ 2018-05-22 15:17 ` Keith Busch -1 siblings, 0 replies; 128+ messages in thread From: Keith Busch @ 2018-05-22 15:17 UTC (permalink / raw) To: Ming Lei Cc: Jens Axboe, linux-block, linux-nvme, Ming Lei, Keith Busch, Bart Van Assche, Christoph Hellwig On Tue, May 22, 2018 at 11:07:07PM +0800, Ming Lei wrote: > > At approximately the same time, you're saying the driver that returned > > EH_HANDLED may then call blk_mq_complete_request() in reference to the > > *old* request that it returned EH_HANDLED for, incorrectly completing > > Because this request has been completed above by blk-mq timeout, > then this old request won't be completed any more via blk_mq_complete_request() > either from normal path or what ever, such as cancel. > > the new request before it is done. That will inevitably lead to data > > corruption. Is that happening today in any driver? > > No such issue since current blk-mq makes sure one req is only completed > once, but your patch changes to depend on driver to make sure that. The blk-mq timeout complete makes the request available for allocation as a new command, at which point blk_mq_complete_request can be called again. If a driver is somehow relying on blk-mq to prevent a double completion for a previously completed request context, they're already in a lot of trouble. ^ permalink raw reply [flat|nested] 128+ messages in thread
* [RFC PATCH 3/3] blk-mq: Remove generation seqeunce @ 2018-05-22 15:17 ` Keith Busch 0 siblings, 0 replies; 128+ messages in thread From: Keith Busch @ 2018-05-22 15:17 UTC (permalink / raw) On Tue, May 22, 2018@11:07:07PM +0800, Ming Lei wrote: > > At approximately the same time, you're saying the driver that returned > > EH_HANDLED may then call blk_mq_complete_request() in reference to the > > *old* request that it returned EH_HANDLED for, incorrectly completing > > Because this request has been completed above by blk-mq timeout, > then this old request won't be completed any more via blk_mq_complete_request() > either from normal path or what ever, such as cancel. > > the new request before it is done. That will inevitably lead to data > > corruption. Is that happening today in any driver? > > No such issue since current blk-mq makes sure one req is only completed > once, but your patch changes to depend on driver to make sure that. The blk-mq timeout complete makes the request available for allocation as a new command, at which point blk_mq_complete_request can be called again. If a driver is somehow relying on blk-mq to prevent a double completion for a previously completed request context, they're already in a lot of trouble. ^ permalink raw reply [flat|nested] 128+ messages in thread
* Re: [RFC PATCH 3/3] blk-mq: Remove generation seqeunce 2018-05-22 15:17 ` Keith Busch @ 2018-05-22 15:23 ` Ming Lei -1 siblings, 0 replies; 128+ messages in thread From: Ming Lei @ 2018-05-22 15:23 UTC (permalink / raw) To: Keith Busch Cc: Jens Axboe, linux-block, linux-nvme, Ming Lei, Keith Busch, Bart Van Assche, Christoph Hellwig On Tue, May 22, 2018 at 11:17 PM, Keith Busch <keith.busch@linux.intel.com> wrote: > On Tue, May 22, 2018 at 11:07:07PM +0800, Ming Lei wrote: >> > At approximately the same time, you're saying the driver that returned >> > EH_HANDLED may then call blk_mq_complete_request() in reference to the >> > *old* request that it returned EH_HANDLED for, incorrectly completing >> >> Because this request has been completed above by blk-mq timeout, >> then this old request won't be completed any more via blk_mq_complete_request() >> either from normal path or what ever, such as cancel. > >> > the new request before it is done. That will inevitably lead to data >> > corruption. Is that happening today in any driver? >> >> No such issue since current blk-mq makes sure one req is only completed >> once, but your patch changes to depend on driver to make sure that. > > The blk-mq timeout complete makes the request available for allocation > as a new command, at which point blk_mq_complete_request can be called > again. If a driver is somehow relying on blk-mq to prevent a double > completion for a previously completed request context, they're already > in a lot of trouble. Yes, previously there is the atomic flag of REQ_ATOM_COMPLETE for covering the atomic completion, and recently Tejun changes to aborted state with generation counter, but both provides sort of atomic completion. So even though it is much simplified by using request refcount, the atomic completion should be provided by blk-mq, or drivers have to be audited to avoid double completion. Thanks, Ming Lei ^ permalink raw reply [flat|nested] 128+ messages in thread
* [RFC PATCH 3/3] blk-mq: Remove generation seqeunce @ 2018-05-22 15:23 ` Ming Lei 0 siblings, 0 replies; 128+ messages in thread From: Ming Lei @ 2018-05-22 15:23 UTC (permalink / raw) On Tue, May 22, 2018 at 11:17 PM, Keith Busch <keith.busch@linux.intel.com> wrote: > On Tue, May 22, 2018@11:07:07PM +0800, Ming Lei wrote: >> > At approximately the same time, you're saying the driver that returned >> > EH_HANDLED may then call blk_mq_complete_request() in reference to the >> > *old* request that it returned EH_HANDLED for, incorrectly completing >> >> Because this request has been completed above by blk-mq timeout, >> then this old request won't be completed any more via blk_mq_complete_request() >> either from normal path or what ever, such as cancel. > >> > the new request before it is done. That will inevitably lead to data >> > corruption. Is that happening today in any driver? >> >> No such issue since current blk-mq makes sure one req is only completed >> once, but your patch changes to depend on driver to make sure that. > > The blk-mq timeout complete makes the request available for allocation > as a new command, at which point blk_mq_complete_request can be called > again. If a driver is somehow relying on blk-mq to prevent a double > completion for a previously completed request context, they're already > in a lot of trouble. Yes, previously there is the atomic flag of REQ_ATOM_COMPLETE for covering the atomic completion, and recently Tejun changes to aborted state with generation counter, but both provides sort of atomic completion. So even though it is much simplified by using request refcount, the atomic completion should be provided by blk-mq, or drivers have to be audited to avoid double completion. Thanks, Ming Lei ^ permalink raw reply [flat|nested] 128+ messages in thread
* Re: [RFC PATCH 3/3] blk-mq: Remove generation seqeunce 2018-05-21 23:11 ` Keith Busch @ 2018-05-22 16:17 ` Christoph Hellwig -1 siblings, 0 replies; 128+ messages in thread From: Christoph Hellwig @ 2018-05-22 16:17 UTC (permalink / raw) To: Keith Busch Cc: Jens Axboe, linux-nvme, linux-block, Ming Lei, Christoph Hellwig, Bart Van Assche Hi Keith, I like this series a lot. One comment that is probably close to the big discussion in the thread: > switch (ret) { > case BLK_EH_HANDLED: > /* > + * If the request is still in flight, the driver is requesting > + * blk-mq complete it. > */ > + if (blk_mq_rq_state(req) == MQ_RQ_IN_FLIGHT) > + __blk_mq_complete_request(req); > + break; The state check here really irked me, and from the thread it seems like I'm not the only one. At least for the NVMe case I think it is perfectly safe, although I agree I'd rather audit what other drivers do carefully. That being said I think BLK_EH_HANDLED seems like a fundamentally broken idea, and I'd actually prefer to get rid of it over adding things like the MQ_RQ_IN_FLIGHT check above. E.g. if we look at the cases where nvme-pci returns it: - if we did call nvme_dev_disable, we already canceled all requests, so we might as well just return BLK_EH_NOT_HANDLED - the poll for completion case already completed the command, so we should return BLK_EH_NOT_HANDLED So I think we need to fix up nvme and if needed any other driver to return the right value and then assert that the request is still in in-flight status for the BLK_EH_HANDLED case. > @@ -124,16 +119,7 @@ static inline int blk_mq_rq_state(struct request *rq) > static inline void blk_mq_rq_update_state(struct request *rq, > enum mq_rq_state state) > { > + WRITE_ONCE(rq->state, state); > } I think this helper can go away now. But we should have a comment near the state field documenting the concurrency implications. > + u64 state; This should probably be a mq_rq_state instead. Which means it needs to be moved to blkdev.h, but that should be ok. ^ permalink raw reply [flat|nested] 128+ messages in thread
* [RFC PATCH 3/3] blk-mq: Remove generation seqeunce @ 2018-05-22 16:17 ` Christoph Hellwig 0 siblings, 0 replies; 128+ messages in thread From: Christoph Hellwig @ 2018-05-22 16:17 UTC (permalink / raw) Hi Keith, I like this series a lot. One comment that is probably close to the big discussion in the thread: > switch (ret) { > case BLK_EH_HANDLED: > /* > + * If the request is still in flight, the driver is requesting > + * blk-mq complete it. > */ > + if (blk_mq_rq_state(req) == MQ_RQ_IN_FLIGHT) > + __blk_mq_complete_request(req); > + break; The state check here really irked me, and from the thread it seems like I'm not the only one. At least for the NVMe case I think it is perfectly safe, although I agree I'd rather audit what other drivers do carefully. That being said I think BLK_EH_HANDLED seems like a fundamentally broken idea, and I'd actually prefer to get rid of it over adding things like the MQ_RQ_IN_FLIGHT check above. E.g. if we look at the cases where nvme-pci returns it: - if we did call nvme_dev_disable, we already canceled all requests, so we might as well just return BLK_EH_NOT_HANDLED - the poll for completion case already completed the command, so we should return BLK_EH_NOT_HANDLED So I think we need to fix up nvme and if needed any other driver to return the right value and then assert that the request is still in in-flight status for the BLK_EH_HANDLED case. > @@ -124,16 +119,7 @@ static inline int blk_mq_rq_state(struct request *rq) > static inline void blk_mq_rq_update_state(struct request *rq, > enum mq_rq_state state) > { > + WRITE_ONCE(rq->state, state); > } I think this helper can go away now. But we should have a comment near the state field documenting the concurrency implications. > + u64 state; This should probably be a mq_rq_state instead. Which means it needs to be moved to blkdev.h, but that should be ok. ^ permalink raw reply [flat|nested] 128+ messages in thread
* Re: [RFC PATCH 3/3] blk-mq: Remove generation seqeunce 2018-05-22 16:17 ` Christoph Hellwig @ 2018-05-23 0:34 ` Ming Lei -1 siblings, 0 replies; 128+ messages in thread From: Ming Lei @ 2018-05-23 0:34 UTC (permalink / raw) To: Christoph Hellwig Cc: Keith Busch, Jens Axboe, linux-nvme, linux-block, Bart Van Assche On Tue, May 22, 2018 at 06:17:04PM +0200, Christoph Hellwig wrote: > Hi Keith, > > I like this series a lot. One comment that is probably close > to the big discussion in the thread: > > > switch (ret) { > > case BLK_EH_HANDLED: > > /* > > + * If the request is still in flight, the driver is requesting > > + * blk-mq complete it. > > */ > > + if (blk_mq_rq_state(req) == MQ_RQ_IN_FLIGHT) > > + __blk_mq_complete_request(req); > > + break; > > The state check here really irked me, and from the thread it seems like > I'm not the only one. At least for the NVMe case I think it is perfectly > safe, although I agree I'd rather audit what other drivers do carefully. Let's consider the normal NVMe timeout code path: 1) one request is timed out; 2) controller is shutdown, this timed-out request is requeued from nvme_cancel_request(), but can't dispatch because queues are quiesced 3) reset is done from another context, and this request is dispatched again, and completed exactly before returning EH_HANDLED to blk-mq, but its state isn't updated to COMPLETE yet. 4) then double completions are done from both normal completion and timeout path. Seems same issue exists on poll path. Thanks, Ming ^ permalink raw reply [flat|nested] 128+ messages in thread
* [RFC PATCH 3/3] blk-mq: Remove generation seqeunce @ 2018-05-23 0:34 ` Ming Lei 0 siblings, 0 replies; 128+ messages in thread From: Ming Lei @ 2018-05-23 0:34 UTC (permalink / raw) On Tue, May 22, 2018@06:17:04PM +0200, Christoph Hellwig wrote: > Hi Keith, > > I like this series a lot. One comment that is probably close > to the big discussion in the thread: > > > switch (ret) { > > case BLK_EH_HANDLED: > > /* > > + * If the request is still in flight, the driver is requesting > > + * blk-mq complete it. > > */ > > + if (blk_mq_rq_state(req) == MQ_RQ_IN_FLIGHT) > > + __blk_mq_complete_request(req); > > + break; > > The state check here really irked me, and from the thread it seems like > I'm not the only one. At least for the NVMe case I think it is perfectly > safe, although I agree I'd rather audit what other drivers do carefully. Let's consider the normal NVMe timeout code path: 1) one request is timed out; 2) controller is shutdown, this timed-out request is requeued from nvme_cancel_request(), but can't dispatch because queues are quiesced 3) reset is done from another context, and this request is dispatched again, and completed exactly before returning EH_HANDLED to blk-mq, but its state isn't updated to COMPLETE yet. 4) then double completions are done from both normal completion and timeout path. Seems same issue exists on poll path. Thanks, Ming ^ permalink raw reply [flat|nested] 128+ messages in thread
* Re: [RFC PATCH 3/3] blk-mq: Remove generation seqeunce 2018-05-23 0:34 ` Ming Lei @ 2018-05-23 14:35 ` Keith Busch -1 siblings, 0 replies; 128+ messages in thread From: Keith Busch @ 2018-05-23 14:35 UTC (permalink / raw) To: Ming Lei Cc: Christoph Hellwig, Keith Busch, Jens Axboe, Bart Van Assche, linux-block, linux-nvme On Wed, May 23, 2018 at 08:34:48AM +0800, Ming Lei wrote: > Let's consider the normal NVMe timeout code path: > > 1) one request is timed out; > > 2) controller is shutdown, this timed-out request is requeued from > nvme_cancel_request(), but can't dispatch because queues are quiesced > > 3) reset is done from another context, and this request is dispatched > again, and completed exactly before returning EH_HANDLED to blk-mq, but > its state isn't updated to COMPLETE yet. > > 4) then double completions are done from both normal completion and timeout > path. We're definitely fixing this, but I must admit that's an impressive cognitive traversal across 5 thread contexts to arrive at that race. :) ^ permalink raw reply [flat|nested] 128+ messages in thread
* [RFC PATCH 3/3] blk-mq: Remove generation seqeunce @ 2018-05-23 14:35 ` Keith Busch 0 siblings, 0 replies; 128+ messages in thread From: Keith Busch @ 2018-05-23 14:35 UTC (permalink / raw) On Wed, May 23, 2018@08:34:48AM +0800, Ming Lei wrote: > Let's consider the normal NVMe timeout code path: > > 1) one request is timed out; > > 2) controller is shutdown, this timed-out request is requeued from > nvme_cancel_request(), but can't dispatch because queues are quiesced > > 3) reset is done from another context, and this request is dispatched > again, and completed exactly before returning EH_HANDLED to blk-mq, but > its state isn't updated to COMPLETE yet. > > 4) then double completions are done from both normal completion and timeout > path. We're definitely fixing this, but I must admit that's an impressive cognitive traversal across 5 thread contexts to arrive at that race. :) ^ permalink raw reply [flat|nested] 128+ messages in thread
* Re: [RFC PATCH 3/3] blk-mq: Remove generation seqeunce 2018-05-23 14:35 ` Keith Busch @ 2018-05-24 1:52 ` Ming Lei -1 siblings, 0 replies; 128+ messages in thread From: Ming Lei @ 2018-05-24 1:52 UTC (permalink / raw) To: Keith Busch Cc: Christoph Hellwig, Keith Busch, Jens Axboe, Bart Van Assche, linux-block, linux-nvme On Wed, May 23, 2018 at 08:35:40AM -0600, Keith Busch wrote: > On Wed, May 23, 2018 at 08:34:48AM +0800, Ming Lei wrote: > > Let's consider the normal NVMe timeout code path: > > > > 1) one request is timed out; > > > > 2) controller is shutdown, this timed-out request is requeued from > > nvme_cancel_request(), but can't dispatch because queues are quiesced > > > > 3) reset is done from another context, and this request is dispatched > > again, and completed exactly before returning EH_HANDLED to blk-mq, but > > its state isn't updated to COMPLETE yet. > > > > 4) then double completions are done from both normal completion and timeout > > path. > > We're definitely fixing this, but I must admit that's an impressive > cognitive traversal across 5 thread contexts to arrive at that race. :) It can be only 2 thread contexts if requeue is done on polled request from nvme_timeout(), :-) Thanks, Ming ^ permalink raw reply [flat|nested] 128+ messages in thread
* [RFC PATCH 3/3] blk-mq: Remove generation seqeunce @ 2018-05-24 1:52 ` Ming Lei 0 siblings, 0 replies; 128+ messages in thread From: Ming Lei @ 2018-05-24 1:52 UTC (permalink / raw) On Wed, May 23, 2018@08:35:40AM -0600, Keith Busch wrote: > On Wed, May 23, 2018@08:34:48AM +0800, Ming Lei wrote: > > Let's consider the normal NVMe timeout code path: > > > > 1) one request is timed out; > > > > 2) controller is shutdown, this timed-out request is requeued from > > nvme_cancel_request(), but can't dispatch because queues are quiesced > > > > 3) reset is done from another context, and this request is dispatched > > again, and completed exactly before returning EH_HANDLED to blk-mq, but > > its state isn't updated to COMPLETE yet. > > > > 4) then double completions are done from both normal completion and timeout > > path. > > We're definitely fixing this, but I must admit that's an impressive > cognitive traversal across 5 thread contexts to arrive at that race. :) It can be only 2 thread contexts if requeue is done on polled request from nvme_timeout(), :-) Thanks, Ming ^ permalink raw reply [flat|nested] 128+ messages in thread
* Re: [RFC PATCH 3/3] blk-mq: Remove generation seqeunce 2018-05-22 16:17 ` Christoph Hellwig @ 2018-05-23 5:48 ` Hannes Reinecke -1 siblings, 0 replies; 128+ messages in thread From: Hannes Reinecke @ 2018-05-23 5:48 UTC (permalink / raw) To: Christoph Hellwig, Keith Busch Cc: Jens Axboe, linux-nvme, linux-block, Ming Lei, Bart Van Assche On 05/22/2018 06:17 PM, Christoph Hellwig wrote: > Hi Keith, > > I like this series a lot. One comment that is probably close > to the big discussion in the thread: > >> switch (ret) { >> case BLK_EH_HANDLED: >> /* >> + * If the request is still in flight, the driver is requesting >> + * blk-mq complete it. >> */ >> + if (blk_mq_rq_state(req) == MQ_RQ_IN_FLIGHT) >> + __blk_mq_complete_request(req); >> + break; > > The state check here really irked me, and from the thread it seems like > I'm not the only one. At least for the NVMe case I think it is perfectly > safe, although I agree I'd rather audit what other drivers do carefully. > > That being said I think BLK_EH_HANDLED seems like a fundamentally broken > idea, and I'd actually prefer to get rid of it over adding things like > the MQ_RQ_IN_FLIGHT check above. > I can't agree more here. BLK_EH_HANDLED is breaking all sorts of assumptions, and I'd be glad to see it go. > E.g. if we look at the cases where nvme-pci returns it: > > - if we did call nvme_dev_disable, we already canceled all requests, > so we might as well just return BLK_EH_NOT_HANDLED > - the poll for completion case already completed the command, > so we should return BLK_EH_NOT_HANDLED > > So I think we need to fix up nvme and if needed any other driver > to return the right value and then assert that the request is > still in in-flight status for the BLK_EH_HANDLED case. > ... and then kill BLK_EH_HANDLED :-) Cheers, Hannes ^ permalink raw reply [flat|nested] 128+ messages in thread
* [RFC PATCH 3/3] blk-mq: Remove generation seqeunce @ 2018-05-23 5:48 ` Hannes Reinecke 0 siblings, 0 replies; 128+ messages in thread From: Hannes Reinecke @ 2018-05-23 5:48 UTC (permalink / raw) On 05/22/2018 06:17 PM, Christoph Hellwig wrote: > Hi Keith, > > I like this series a lot. One comment that is probably close > to the big discussion in the thread: > >> switch (ret) { >> case BLK_EH_HANDLED: >> /* >> + * If the request is still in flight, the driver is requesting >> + * blk-mq complete it. >> */ >> + if (blk_mq_rq_state(req) == MQ_RQ_IN_FLIGHT) >> + __blk_mq_complete_request(req); >> + break; > > The state check here really irked me, and from the thread it seems like > I'm not the only one. At least for the NVMe case I think it is perfectly > safe, although I agree I'd rather audit what other drivers do carefully. > > That being said I think BLK_EH_HANDLED seems like a fundamentally broken > idea, and I'd actually prefer to get rid of it over adding things like > the MQ_RQ_IN_FLIGHT check above. > I can't agree more here. BLK_EH_HANDLED is breaking all sorts of assumptions, and I'd be glad to see it go. > E.g. if we look at the cases where nvme-pci returns it: > > - if we did call nvme_dev_disable, we already canceled all requests, > so we might as well just return BLK_EH_NOT_HANDLED > - the poll for completion case already completed the command, > so we should return BLK_EH_NOT_HANDLED > > So I think we need to fix up nvme and if needed any other driver > to return the right value and then assert that the request is > still in in-flight status for the BLK_EH_HANDLED case. > ... and then kill BLK_EH_HANDLED :-) Cheers, Hannes ^ permalink raw reply [flat|nested] 128+ messages in thread
* Re: [RFC PATCH 3/3] blk-mq: Remove generation seqeunce 2018-05-21 23:11 ` Keith Busch @ 2018-07-12 18:16 ` Bart Van Assche -1 siblings, 0 replies; 128+ messages in thread From: Bart Van Assche @ 2018-07-12 18:16 UTC (permalink / raw) To: hch, keith.busch, linux-nvme, linux-block, axboe, ming.lei T24gTW9uLCAyMDE4LTA1LTIxIGF0IDE3OjExIC0wNjAwLCBLZWl0aCBCdXNjaCB3cm90ZToNCj4g IAkvKg0KPiAtCSAqIFdlIG1hcmtlZCBAcnEtPmFib3J0ZWRfZ3N0YXRlIGFuZCB3YWl0ZWQgZm9y IFJDVS4gIElmIHRoZXJlIHdlcmUNCj4gLQkgKiBjb21wbGV0aW9ucyB0aGF0IHdlIGxvc3QgdG8s IHRoZXkgd291bGQgaGF2ZSBmaW5pc2hlZCBhbmQNCj4gLQkgKiB1cGRhdGVkIEBycS0+Z3N0YXRl IGJ5IG5vdzsgb3RoZXJ3aXNlLCB0aGUgY29tcGxldGlvbiBwYXRoIGlzDQo+IC0JICogbm93IGd1 YXJhbnRlZWQgdG8gc2VlIEBycS0+YWJvcnRlZF9nc3RhdGUgYW5kIHlpZWxkLiAgSWYNCj4gLQkg KiBAcnEtPmFib3J0ZWRfZ3N0YXRlIHN0aWxsIG1hdGNoZXMgQHJxLT5nc3RhdGUsIEBycSBpcyBv dXJzLg0KPiArCSAqIEp1c3QgZG8gYSBxdWljayBjaGVjayBpZiBpdCBpcyBleHBpcmVkIGJlZm9y ZSBsb2NraW5nIHRoZSByZXF1ZXN0IGluDQo+ICsJICogc28gd2UncmUgbm90IHVubmVjZXNzYXJp bGx5IHN5bmNocm9uaXppbmcgYWNyb3NzIENQVXMuDQo+ICAJICovDQo+IC0JaWYgKCEocnEtPnJx X2ZsYWdzICYgUlFGX01RX1RJTUVPVVRfRVhQSVJFRCkgJiYNCj4gLQkgICAgUkVBRF9PTkNFKHJx LT5nc3RhdGUpID09IHJxLT5hYm9ydGVkX2dzdGF0ZSkNCj4gKwlpZiAoIWJsa19tcV9yZXFfZXhw aXJlZChycSwgbmV4dCkpDQo+ICsJCXJldHVybjsNCj4gKw0KPiArCS8qDQo+ICsJICogV2UgaGF2 ZSByZWFzb24gdG8gYmVsaWV2ZSB0aGUgcmVxdWVzdCBtYXkgYmUgZXhwaXJlZC4gVGFrZSBhDQo+ ICsJICogcmVmZXJlbmNlIG9uIHRoZSByZXF1ZXN0IHRvIGxvY2sgdGhpcyByZXF1ZXN0IGxpZmV0 aW1lIGludG8gaXRzDQo+ICsJICogY3VycmVudGx5IGFsbG9jYXRlZCBjb250ZXh0IHRvIHByZXZl bnQgaXQgZnJvbSBiZWluZyByZWFsbG9jYXRlZCBpbg0KPiArCSAqIHRoZSBldmVudCB0aGUgY29t cGxldGlvbiBieS1wYXNzZXMgdGhpcyB0aW1lb3V0IGhhbmRsZXIuDQo+ICsJICogDQo+ICsJICog SWYgdGhlIHJlZmVyZW5jZSB3YXMgYWxyZWFkeSByZWxlYXNlZCwgdGhlbiB0aGUgZHJpdmVyIGJl YXQgdGhlDQo+ICsJICogdGltZW91dCBoYW5kbGVyIHRvIHBvc3RpbmcgYSBuYXR1cmFsIGNvbXBs ZXRpb24uDQo+ICsJICovDQo+ICsJaWYgKCFrcmVmX2dldF91bmxlc3NfemVybygmcnEtPnJlZikp DQo+ICsJCXJldHVybjsNCj4gKw0KPiArCS8qDQo+ICsJICogVGhlIHJlcXVlc3QgaXMgbm93IGxv Y2tlZCBhbmQgY2Fubm90IGJlIHJlYWxsb2NhdGVkIHVuZGVybmVhdGggdGhlDQo+ICsJICogdGlt ZW91dCBoYW5kbGVyJ3MgcHJvY2Vzc2luZy4gUmUtdmVyaWZ5IHRoaXMgZXhhY3QgcmVxdWVzdCBp cyB0cnVseQ0KPiArCSAqIGV4cGlyZWQ7IGlmIGl0IGlzIG5vdCBleHBpcmVkLCB0aGVuIHRoZSBy ZXF1ZXN0IHdhcyBjb21wbGV0ZWQgYW5kDQo+ICsJICogcmVhbGxvY2F0ZWQgYXMgYSBuZXcgcmVx dWVzdC4NCj4gKwkgKi8NCj4gKwlpZiAoYmxrX21xX3JlcV9leHBpcmVkKHJxLCBuZXh0KSkNCj4g IAkJYmxrX21xX3JxX3RpbWVkX291dChycSwgcmVzZXJ2ZWQpOw0KPiArCWJsa19tcV9wdXRfcmVx dWVzdChycSk7DQo+ICB9DQoNCkhlbGxvIEtlaXRoIGFuZCBDaHJpc3RvcGgsDQoNCldoYXQgcHJl dmVudHMgdGhhdCBhIHJlcXVlc3QgZmluaXNoZXMgYW5kIGdldHMgcmV1c2VkIGFmdGVyIHRoZQ0K YmxrX21xX3JlcV9leHBpcmVkKCkgY2FsbCBoYXMgZmluaXNoZWQgYW5kIGJlZm9yZSBrcmVmX2dl dF91bmxlc3NfemVybygpIGlzDQpjYWxsZWQ/IElzIHRoaXMgcGVyaGFwcyBhIHJhY2UgY29uZGl0 aW9uIHRoYXQgaGFzIG5vdCB5ZXQgYmVlbiB0cmlnZ2VyZWQgYnkNCmFueSBleGlzdGluZyBibG9j ayBsYXllciB0ZXN0PyBQbGVhc2Ugbm90ZSB0aGF0IHRoZXJlIGlzIG5vIHN1Y2ggcmFjZQ0KY29u ZGl0aW9uIGluIHRoZSBwYXRjaCBJIGhhZCBwb3N0ZWQgKCJibGstbXE6IFJld29yayBibGstbXEg dGltZW91dCBoYW5kbGluZw0KYWdhaW4iIC0gaHR0cHM6Ly93d3cuc3Bpbmljcy5uZXQvbGlzdHMv bGludXgtYmxvY2svbXNnMjY0ODkuaHRtbCkuDQoNClRoYW5rcywNCg0KQmFydC4NCg0KDQoNCg0K ^ permalink raw reply [flat|nested] 128+ messages in thread
* [RFC PATCH 3/3] blk-mq: Remove generation seqeunce @ 2018-07-12 18:16 ` Bart Van Assche 0 siblings, 0 replies; 128+ messages in thread From: Bart Van Assche @ 2018-07-12 18:16 UTC (permalink / raw) On Mon, 2018-05-21@17:11 -0600, Keith Busch wrote: > /* > - * We marked @rq->aborted_gstate and waited for RCU. If there were > - * completions that we lost to, they would have finished and > - * updated @rq->gstate by now; otherwise, the completion path is > - * now guaranteed to see @rq->aborted_gstate and yield. If > - * @rq->aborted_gstate still matches @rq->gstate, @rq is ours. > + * Just do a quick check if it is expired before locking the request in > + * so we're not unnecessarilly synchronizing across CPUs. > */ > - if (!(rq->rq_flags & RQF_MQ_TIMEOUT_EXPIRED) && > - READ_ONCE(rq->gstate) == rq->aborted_gstate) > + if (!blk_mq_req_expired(rq, next)) > + return; > + > + /* > + * We have reason to believe the request may be expired. Take a > + * reference on the request to lock this request lifetime into its > + * currently allocated context to prevent it from being reallocated in > + * the event the completion by-passes this timeout handler. > + * > + * If the reference was already released, then the driver beat the > + * timeout handler to posting a natural completion. > + */ > + if (!kref_get_unless_zero(&rq->ref)) > + return; > + > + /* > + * The request is now locked and cannot be reallocated underneath the > + * timeout handler's processing. Re-verify this exact request is truly > + * expired; if it is not expired, then the request was completed and > + * reallocated as a new request. > + */ > + if (blk_mq_req_expired(rq, next)) > blk_mq_rq_timed_out(rq, reserved); > + blk_mq_put_request(rq); > } Hello Keith and Christoph, What prevents that a request finishes and gets reused after the blk_mq_req_expired() call has finished and before kref_get_unless_zero() is called? Is this perhaps a race condition that has not yet been triggered by any existing block layer test? Please note that there is no such race condition in the patch I had posted ("blk-mq: Rework blk-mq timeout handling again" - https://www.spinics.net/lists/linux-block/msg26489.html). Thanks, Bart. ^ permalink raw reply [flat|nested] 128+ messages in thread
* Re: [RFC PATCH 3/3] blk-mq: Remove generation seqeunce 2018-07-12 18:16 ` Bart Van Assche @ 2018-07-12 19:24 ` Keith Busch -1 siblings, 0 replies; 128+ messages in thread From: Keith Busch @ 2018-07-12 19:24 UTC (permalink / raw) To: Bart Van Assche Cc: hch, keith.busch, linux-nvme, linux-block, axboe, ming.lei On Thu, Jul 12, 2018 at 06:16:12PM +0000, Bart Van Assche wrote: > On Mon, 2018-05-21 at 17:11 -0600, Keith Busch wrote: > > /* > > - * We marked @rq->aborted_gstate and waited for RCU. If there were > > - * completions that we lost to, they would have finished and > > - * updated @rq->gstate by now; otherwise, the completion path is > > - * now guaranteed to see @rq->aborted_gstate and yield. If > > - * @rq->aborted_gstate still matches @rq->gstate, @rq is ours. > > + * Just do a quick check if it is expired before locking the request in > > + * so we're not unnecessarilly synchronizing across CPUs. > > */ > > - if (!(rq->rq_flags & RQF_MQ_TIMEOUT_EXPIRED) && > > - READ_ONCE(rq->gstate) == rq->aborted_gstate) > > + if (!blk_mq_req_expired(rq, next)) > > + return; > > + > > + /* > > + * We have reason to believe the request may be expired. Take a > > + * reference on the request to lock this request lifetime into its > > + * currently allocated context to prevent it from being reallocated in > > + * the event the completion by-passes this timeout handler. > > + * > > + * If the reference was already released, then the driver beat the > > + * timeout handler to posting a natural completion. > > + */ > > + if (!kref_get_unless_zero(&rq->ref)) > > + return; > > + > > + /* > > + * The request is now locked and cannot be reallocated underneath the > > + * timeout handler's processing. Re-verify this exact request is truly > > + * expired; if it is not expired, then the request was completed and > > + * reallocated as a new request. > > + */ > > + if (blk_mq_req_expired(rq, next)) > > blk_mq_rq_timed_out(rq, reserved); > > + blk_mq_put_request(rq); > > } > > Hello Keith and Christoph, > > What prevents that a request finishes and gets reused after the > blk_mq_req_expired() call has finished and before kref_get_unless_zero() is > called? Is this perhaps a race condition that has not yet been triggered by > any existing block layer test? Please note that there is no such race > condition in the patch I had posted ("blk-mq: Rework blk-mq timeout handling > again" - https://www.spinics.net/lists/linux-block/msg26489.html). I don't think there's any such race in the merged implementation either. If the request is reallocated, then the kref check may succeed, but the blk_mq_req_expired() check would surely fail, allowing the request to proceed as normal. The code comments at least say as much. ^ permalink raw reply [flat|nested] 128+ messages in thread
* [RFC PATCH 3/3] blk-mq: Remove generation seqeunce @ 2018-07-12 19:24 ` Keith Busch 0 siblings, 0 replies; 128+ messages in thread From: Keith Busch @ 2018-07-12 19:24 UTC (permalink / raw) On Thu, Jul 12, 2018@06:16:12PM +0000, Bart Van Assche wrote: > On Mon, 2018-05-21@17:11 -0600, Keith Busch wrote: > > /* > > - * We marked @rq->aborted_gstate and waited for RCU. If there were > > - * completions that we lost to, they would have finished and > > - * updated @rq->gstate by now; otherwise, the completion path is > > - * now guaranteed to see @rq->aborted_gstate and yield. If > > - * @rq->aborted_gstate still matches @rq->gstate, @rq is ours. > > + * Just do a quick check if it is expired before locking the request in > > + * so we're not unnecessarilly synchronizing across CPUs. > > */ > > - if (!(rq->rq_flags & RQF_MQ_TIMEOUT_EXPIRED) && > > - READ_ONCE(rq->gstate) == rq->aborted_gstate) > > + if (!blk_mq_req_expired(rq, next)) > > + return; > > + > > + /* > > + * We have reason to believe the request may be expired. Take a > > + * reference on the request to lock this request lifetime into its > > + * currently allocated context to prevent it from being reallocated in > > + * the event the completion by-passes this timeout handler. > > + * > > + * If the reference was already released, then the driver beat the > > + * timeout handler to posting a natural completion. > > + */ > > + if (!kref_get_unless_zero(&rq->ref)) > > + return; > > + > > + /* > > + * The request is now locked and cannot be reallocated underneath the > > + * timeout handler's processing. Re-verify this exact request is truly > > + * expired; if it is not expired, then the request was completed and > > + * reallocated as a new request. > > + */ > > + if (blk_mq_req_expired(rq, next)) > > blk_mq_rq_timed_out(rq, reserved); > > + blk_mq_put_request(rq); > > } > > Hello Keith and Christoph, > > What prevents that a request finishes and gets reused after the > blk_mq_req_expired() call has finished and before kref_get_unless_zero() is > called? Is this perhaps a race condition that has not yet been triggered by > any existing block layer test? Please note that there is no such race > condition in the patch I had posted ("blk-mq: Rework blk-mq timeout handling > again" - https://www.spinics.net/lists/linux-block/msg26489.html). I don't think there's any such race in the merged implementation either. If the request is reallocated, then the kref check may succeed, but the blk_mq_req_expired() check would surely fail, allowing the request to proceed as normal. The code comments at least say as much. ^ permalink raw reply [flat|nested] 128+ messages in thread
* Re: [RFC PATCH 3/3] blk-mq: Remove generation seqeunce 2018-07-12 19:24 ` Keith Busch @ 2018-07-12 22:24 ` Bart Van Assche -1 siblings, 0 replies; 128+ messages in thread From: Bart Van Assche @ 2018-07-12 22:24 UTC (permalink / raw) To: keith.busch; +Cc: hch, keith.busch, linux-nvme, linux-block, axboe, ming.lei T24gVGh1LCAyMDE4LTA3LTEyIGF0IDEzOjI0IC0wNjAwLCBLZWl0aCBCdXNjaCB3cm90ZToNCj4g T24gVGh1LCBKdWwgMTIsIDIwMTggYXQgMDY6MTY6MTJQTSArMDAwMCwgQmFydCBWYW4gQXNzY2hl IHdyb3RlOg0KPiA+IFdoYXQgcHJldmVudHMgdGhhdCBhIHJlcXVlc3QgZmluaXNoZXMgYW5kIGdl dHMgcmV1c2VkIGFmdGVyIHRoZQ0KPiA+IGJsa19tcV9yZXFfZXhwaXJlZCgpIGNhbGwgaGFzIGZp bmlzaGVkIGFuZCBiZWZvcmUga3JlZl9nZXRfdW5sZXNzX3plcm8oKSBpcw0KPiA+IGNhbGxlZD8g SXMgdGhpcyBwZXJoYXBzIGEgcmFjZSBjb25kaXRpb24gdGhhdCBoYXMgbm90IHlldCBiZWVuIHRy aWdnZXJlZCBieQ0KPiA+IGFueSBleGlzdGluZyBibG9jayBsYXllciB0ZXN0PyBQbGVhc2Ugbm90 ZSB0aGF0IHRoZXJlIGlzIG5vIHN1Y2ggcmFjZQ0KPiA+IGNvbmRpdGlvbiBpbiB0aGUgcGF0Y2gg SSBoYWQgcG9zdGVkICgiYmxrLW1xOiBSZXdvcmsgYmxrLW1xIHRpbWVvdXQgaGFuZGxpbmcNCj4g PiBhZ2FpbiIgLSBodHRwczovL3d3dy5zcGluaWNzLm5ldC9saXN0cy9saW51eC1ibG9jay9tc2cy NjQ4OS5odG1sKS4NCj4gDQo+IEkgZG9uJ3QgdGhpbmsgdGhlcmUncyBhbnkgc3VjaCByYWNlIGlu IHRoZSBtZXJnZWQgaW1wbGVtZW50YXRpb24NCj4gZWl0aGVyLiBJZiB0aGUgcmVxdWVzdCBpcyBy ZWFsbG9jYXRlZCwgdGhlbiB0aGUga3JlZiBjaGVjayBtYXkgc3VjY2VlZCwNCj4gYnV0IHRoZSBi bGtfbXFfcmVxX2V4cGlyZWQoKSBjaGVjayB3b3VsZCBzdXJlbHkgZmFpbCwgYWxsb3dpbmcgdGhl DQo+IHJlcXVlc3QgdG8gcHJvY2VlZCBhcyBub3JtYWwuIFRoZSBjb2RlIGNvbW1lbnRzIGF0IGxl YXN0IHNheSBhcyBtdWNoLg0KDQpIZWxsbyBLZWl0aCwNCg0KQmVmb3JlIGNvbW1pdCAxMmY1Yjkz MTQ1NDUgKCJibGstbXE6IFJlbW92ZSBnZW5lcmF0aW9uIHNlcWV1bmNlIiksIGlmIGENCnJlcXVl c3QgY29tcGxldGlvbiB3YXMgcmVwb3J0ZWQgYWZ0ZXIgcmVxdWVzdCB0aW1lb3V0IHByb2Nlc3Np bmcgaGFkDQpzdGFydGVkLCBjb21wbGV0aW9uIGhhbmRsaW5nIHdhcyBza2lwcGVkLiBUaGUgZm9s bG93aW5nIGNvZGUgaW4NCmJsa19tcV9jb21wbGV0ZV9yZXF1ZXN0KCkgcmVhbGl6ZWQgdGhhdDoN Cg0KCWlmIChibGtfbXFfcnFfYWJvcnRlZF9nc3RhdGUocnEpICE9IHJxLT5nc3RhdGUpDQoJCV9f YmxrX21xX2NvbXBsZXRlX3JlcXVlc3QocnEpOw0KDQpTaW5jZSBjb21taXQgMTJmNWI5MzE0NTQ1 LCBpZiBhIGNvbXBsZXRpb24gb2NjdXJzIGFmdGVyIHJlcXVlc3QgdGltZW91dA0KcHJvY2Vzc2lu ZyBoYXMgc3RhcnRlZCwgdGhhdCBjb21wbGV0aW9uIGlzIHByb2Nlc3NlZCBpZiB0aGUgcmVxdWVz dCBoYXMgdGhlDQpzdGF0ZSBNUV9SUV9JTl9GTElHSFQuIGJsa19tcV9ycV90aW1lZF9vdXQoKSBk b2VzIG5vdCBtb2RpZnkgdGhlIHJlcXVlc3QNCnN0YXRlIHVubGVzcyB0aGUgYmxvY2sgZHJpdmVy IHRpbWVvdXQgaGFuZGxlciBtb2RpZmllcyBpdCwgZS5nLiBieSBjYWxsaW5nDQpibGtfbXFfZW5k X3JlcXVlc3QoKSBvciBieSBjYWxsaW5nIGJsa19tcV9yZXF1ZXVlX3JlcXVlc3QoKS4gVGhlIHR5 cGljYWwNCmJlaGF2aW9yIG9mIHNjc2lfdGltZXNfb3V0KCkgaXMgdG8gcXVldWUgc2VuZGluZyBv ZiBhIFNDU0kgYWJvcnQgYW5kIGhlbmNlDQpub3QgdG8gY2hhbmdlIHRoZSByZXF1ZXN0IHN0YXRl IGltbWVkaWF0ZWx5LiBJbiBvdGhlciB3b3JkcywgaWYgYSByZXF1ZXN0DQpjb21wbGV0aW9uIG9j Y3VycyBkdXJpbmcgb3Igc2hvcnRseSBhZnRlciBhIHRpbWVvdXQgb2NjdXJyZWQgdGhlbg0KYmxr X21xX2NvbXBsZXRlX3JlcXVlc3QoKSB3aWxsIGNhbGwgX19ibGtfbXFfY29tcGxldGVfcmVxdWVz dCgpIGFuZCB3aWxsDQpjb21wbGV0ZSB0aGUgcmVxdWVzdCwgYWx0aG91Z2ggdGhhdCBpcyBub3Qg YWxsb3dlZCBiZWNhdXNlIHRpbWVvdXQgaGFuZGxpbmcNCmhhcyBhbHJlYWR5IHN0YXJ0ZWQuIERv IHlvdSBhZ3JlZSB3aXRoIHRoaXMgYW5hbHlzaXM/DQoNClRoYW5rcywNCg0KQmFydC4NCg0KDQoN Cg0KDQoNCg0KDQo= ^ permalink raw reply [flat|nested] 128+ messages in thread
* [RFC PATCH 3/3] blk-mq: Remove generation seqeunce @ 2018-07-12 22:24 ` Bart Van Assche 0 siblings, 0 replies; 128+ messages in thread From: Bart Van Assche @ 2018-07-12 22:24 UTC (permalink / raw) On Thu, 2018-07-12@13:24 -0600, Keith Busch wrote: > On Thu, Jul 12, 2018@06:16:12PM +0000, Bart Van Assche wrote: > > What prevents that a request finishes and gets reused after the > > blk_mq_req_expired() call has finished and before kref_get_unless_zero() is > > called? Is this perhaps a race condition that has not yet been triggered by > > any existing block layer test? Please note that there is no such race > > condition in the patch I had posted ("blk-mq: Rework blk-mq timeout handling > > again" - https://www.spinics.net/lists/linux-block/msg26489.html). > > I don't think there's any such race in the merged implementation > either. If the request is reallocated, then the kref check may succeed, > but the blk_mq_req_expired() check would surely fail, allowing the > request to proceed as normal. The code comments at least say as much. Hello Keith, Before commit 12f5b9314545 ("blk-mq: Remove generation seqeunce"), if a request completion was reported after request timeout processing had started, completion handling was skipped. The following code in blk_mq_complete_request() realized that: if (blk_mq_rq_aborted_gstate(rq) != rq->gstate) __blk_mq_complete_request(rq); Since commit 12f5b9314545, if a completion occurs after request timeout processing has started, that completion is processed if the request has the state MQ_RQ_IN_FLIGHT. blk_mq_rq_timed_out() does not modify the request state unless the block driver timeout handler modifies it, e.g. by calling blk_mq_end_request() or by calling blk_mq_requeue_request(). The typical behavior of scsi_times_out() is to queue sending of a SCSI abort and hence not to change the request state immediately. In other words, if a request completion occurs during or shortly after a timeout occurred then blk_mq_complete_request() will call __blk_mq_complete_request() and will complete the request, although that is not allowed because timeout handling has already started. Do you agree with this analysis? Thanks, Bart. ^ permalink raw reply [flat|nested] 128+ messages in thread
* Re: [RFC PATCH 3/3] blk-mq: Remove generation seqeunce 2018-07-12 22:24 ` Bart Van Assche @ 2018-07-13 1:12 ` jianchao.wang -1 siblings, 0 replies; 128+ messages in thread From: jianchao.wang @ 2018-07-13 1:12 UTC (permalink / raw) To: Bart Van Assche, keith.busch Cc: hch, keith.busch, linux-nvme, linux-block, axboe, ming.lei On 07/13/2018 06:24 AM, Bart Van Assche wrote: > On Thu, 2018-07-12 at 13:24 -0600, Keith Busch wrote: >> On Thu, Jul 12, 2018 at 06:16:12PM +0000, Bart Van Assche wrote: >>> What prevents that a request finishes and gets reused after the >>> blk_mq_req_expired() call has finished and before kref_get_unless_zero() is >>> called? Is this perhaps a race condition that has not yet been triggered by >>> any existing block layer test? Please note that there is no such race >>> condition in the patch I had posted ("blk-mq: Rework blk-mq timeout handling >>> again" - https://urldefense.proofpoint.com/v2/url?u=https-3A__www.spinics.net_lists_linux-2Dblock_msg26489.html&d=DwIGaQ&c=RoP1YumCXCgaWHvlZYR8PZh8Bv7qIrMUB65eapI_JnE&r=7WdAxUBeiTUTCy8v-7zXyr4qk7sx26ATvfo6QSTvZyQ&m=zqZd2myYLkxjU6DWtRKpls-gvzEGEB4vv8bdYq5CiBs&s=-d79KAhEM83ShMp8xCHKoE6Dp5Gxf98L94DuamLEAaU&e=). >> >> I don't think there's any such race in the merged implementation >> either. If the request is reallocated, then the kref check may succeed, >> but the blk_mq_req_expired() check would surely fail, allowing the >> request to proceed as normal. The code comments at least say as much. > > Hello Keith, > > Before commit 12f5b9314545 ("blk-mq: Remove generation seqeunce"), if a > request completion was reported after request timeout processing had > started, completion handling was skipped. The following code in > blk_mq_complete_request() realized that: > > if (blk_mq_rq_aborted_gstate(rq) != rq->gstate) > __blk_mq_complete_request(rq); > > Since commit 12f5b9314545, if a completion occurs after request timeout > processing has started, that completion is processed if the request has the > state MQ_RQ_IN_FLIGHT. blk_mq_rq_timed_out() does not modify the request > state unless the block driver timeout handler modifies it, e.g. by calling > blk_mq_end_request() or by calling blk_mq_requeue_request(). The typical > behavior of scsi_times_out() is to queue sending of a SCSI abort and hence > not to change the request state immediately. In other words, if a request > completion occurs during or shortly after a timeout occurred then > blk_mq_complete_request() will call __blk_mq_complete_request() and will > complete the request, although that is not allowed because timeout handling > has already started. Do you agree with this analysis? > Oh, thanks gods for hearing Bart said this. I was always saying the same thing in the mail https://marc.info/?l=linux-block&m=152950093831738&w=2 Even though my voice is tiny, I support Bart's point definitely. Thanks Jianchao > Thanks, > > Bart. > > > > > > > > > ^ permalink raw reply [flat|nested] 128+ messages in thread
* [RFC PATCH 3/3] blk-mq: Remove generation seqeunce @ 2018-07-13 1:12 ` jianchao.wang 0 siblings, 0 replies; 128+ messages in thread From: jianchao.wang @ 2018-07-13 1:12 UTC (permalink / raw) On 07/13/2018 06:24 AM, Bart Van Assche wrote: > On Thu, 2018-07-12@13:24 -0600, Keith Busch wrote: >> On Thu, Jul 12, 2018@06:16:12PM +0000, Bart Van Assche wrote: >>> What prevents that a request finishes and gets reused after the >>> blk_mq_req_expired() call has finished and before kref_get_unless_zero() is >>> called? Is this perhaps a race condition that has not yet been triggered by >>> any existing block layer test? Please note that there is no such race >>> condition in the patch I had posted ("blk-mq: Rework blk-mq timeout handling >>> again" - https://urldefense.proofpoint.com/v2/url?u=https-3A__www.spinics.net_lists_linux-2Dblock_msg26489.html&d=DwIGaQ&c=RoP1YumCXCgaWHvlZYR8PZh8Bv7qIrMUB65eapI_JnE&r=7WdAxUBeiTUTCy8v-7zXyr4qk7sx26ATvfo6QSTvZyQ&m=zqZd2myYLkxjU6DWtRKpls-gvzEGEB4vv8bdYq5CiBs&s=-d79KAhEM83ShMp8xCHKoE6Dp5Gxf98L94DuamLEAaU&e=). >> >> I don't think there's any such race in the merged implementation >> either. If the request is reallocated, then the kref check may succeed, >> but the blk_mq_req_expired() check would surely fail, allowing the >> request to proceed as normal. The code comments at least say as much. > > Hello Keith, > > Before commit 12f5b9314545 ("blk-mq: Remove generation seqeunce"), if a > request completion was reported after request timeout processing had > started, completion handling was skipped. The following code in > blk_mq_complete_request() realized that: > > if (blk_mq_rq_aborted_gstate(rq) != rq->gstate) > __blk_mq_complete_request(rq); > > Since commit 12f5b9314545, if a completion occurs after request timeout > processing has started, that completion is processed if the request has the > state MQ_RQ_IN_FLIGHT. blk_mq_rq_timed_out() does not modify the request > state unless the block driver timeout handler modifies it, e.g. by calling > blk_mq_end_request() or by calling blk_mq_requeue_request(). The typical > behavior of scsi_times_out() is to queue sending of a SCSI abort and hence > not to change the request state immediately. In other words, if a request > completion occurs during or shortly after a timeout occurred then > blk_mq_complete_request() will call __blk_mq_complete_request() and will > complete the request, although that is not allowed because timeout handling > has already started. Do you agree with this analysis? > Oh, thanks gods for hearing Bart said this. I was always saying the same thing in the mail https://marc.info/?l=linux-block&m=152950093831738&w=2 Even though my voice is tiny, I support Bart's point definitely. Thanks Jianchao > Thanks, > > Bart. > > > > > > > > > ^ permalink raw reply [flat|nested] 128+ messages in thread
* Re: [RFC PATCH 3/3] blk-mq: Remove generation seqeunce 2018-07-12 22:24 ` Bart Van Assche @ 2018-07-13 2:40 ` jianchao.wang -1 siblings, 0 replies; 128+ messages in thread From: jianchao.wang @ 2018-07-13 2:40 UTC (permalink / raw) To: Bart Van Assche, keith.busch Cc: hch, keith.busch, linux-nvme, linux-block, axboe, ming.lei On 07/13/2018 06:24 AM, Bart Van Assche wrote: > Hello Keith, > > Before commit 12f5b9314545 ("blk-mq: Remove generation seqeunce"), if a > request completion was reported after request timeout processing had > started, completion handling was skipped. The following code in > blk_mq_complete_request() realized that: > > if (blk_mq_rq_aborted_gstate(rq) != rq->gstate) > __blk_mq_complete_request(rq); Even if before tejun's patch, we also have this for both blk-mq and blk-legacy code. blk_rq_check_expired if (time_after_eq(jiffies, rq->deadline)) { list_del_init(&rq->timeout_list); /* * Check if we raced with end io completion */ if (!blk_mark_rq_complete(rq)) blk_rq_timed_out(rq); } blk_complete_request if (!blk_mark_rq_complete(req)) __blk_complete_request(req); blk_mq_check_expired if (time_after_eq(jiffies, rq->deadline)) { if (!blk_mark_rq_complete(rq)) blk_mq_rq_timed_out(rq, reserved); } blk_mq_complete_request if (!blk_mark_rq_complete(rq)) __blk_mq_complete_request(rq); Thanks Jianchao > > Since commit 12f5b9314545, if a completion occurs after request timeout > processing has started, that completion is processed if the request has the > state MQ_RQ_IN_FLIGHT. blk_mq_rq_timed_out() does not modify the request > state unless the block driver timeout handler modifies it, e.g. by calling > blk_mq_end_request() or by calling blk_mq_requeue_request(). The typical > behavior of scsi_times_out() is to queue sending of a SCSI abort and hence > not to change the request state immediately. In other words, if a request > completion occurs during or shortly after a timeout occurred then > blk_mq_complete_request() will call __blk_mq_complete_request() and will > complete the request, although that is not allowed because timeout handling > has already started. Do you agree with this analysis? ^ permalink raw reply [flat|nested] 128+ messages in thread
* [RFC PATCH 3/3] blk-mq: Remove generation seqeunce @ 2018-07-13 2:40 ` jianchao.wang 0 siblings, 0 replies; 128+ messages in thread From: jianchao.wang @ 2018-07-13 2:40 UTC (permalink / raw) On 07/13/2018 06:24 AM, Bart Van Assche wrote: > Hello Keith, > > Before commit 12f5b9314545 ("blk-mq: Remove generation seqeunce"), if a > request completion was reported after request timeout processing had > started, completion handling was skipped. The following code in > blk_mq_complete_request() realized that: > > if (blk_mq_rq_aborted_gstate(rq) != rq->gstate) > __blk_mq_complete_request(rq); Even if before tejun's patch, we also have this for both blk-mq and blk-legacy code. blk_rq_check_expired if (time_after_eq(jiffies, rq->deadline)) { list_del_init(&rq->timeout_list); /* * Check if we raced with end io completion */ if (!blk_mark_rq_complete(rq)) blk_rq_timed_out(rq); } blk_complete_request if (!blk_mark_rq_complete(req)) __blk_complete_request(req); blk_mq_check_expired if (time_after_eq(jiffies, rq->deadline)) { if (!blk_mark_rq_complete(rq)) blk_mq_rq_timed_out(rq, reserved); } blk_mq_complete_request if (!blk_mark_rq_complete(rq)) __blk_mq_complete_request(rq); Thanks Jianchao > > Since commit 12f5b9314545, if a completion occurs after request timeout > processing has started, that completion is processed if the request has the > state MQ_RQ_IN_FLIGHT. blk_mq_rq_timed_out() does not modify the request > state unless the block driver timeout handler modifies it, e.g. by calling > blk_mq_end_request() or by calling blk_mq_requeue_request(). The typical > behavior of scsi_times_out() is to queue sending of a SCSI abort and hence > not to change the request state immediately. In other words, if a request > completion occurs during or shortly after a timeout occurred then > blk_mq_complete_request() will call __blk_mq_complete_request() and will > complete the request, although that is not allowed because timeout handling > has already started. Do you agree with this analysis? ^ permalink raw reply [flat|nested] 128+ messages in thread
* Re: [RFC PATCH 3/3] blk-mq: Remove generation seqeunce 2018-07-12 22:24 ` Bart Van Assche @ 2018-07-13 15:43 ` Keith Busch -1 siblings, 0 replies; 128+ messages in thread From: Keith Busch @ 2018-07-13 15:43 UTC (permalink / raw) To: Bart Van Assche Cc: axboe, linux-block, linux-nvme, ming.lei, keith.busch, hch On Thu, Jul 12, 2018 at 10:24:42PM +0000, Bart Van Assche wrote: > Before commit 12f5b9314545 ("blk-mq: Remove generation seqeunce"), if a > request completion was reported after request timeout processing had > started, completion handling was skipped. The following code in > blk_mq_complete_request() realized that: > > if (blk_mq_rq_aborted_gstate(rq) != rq->gstate) > __blk_mq_complete_request(rq); > > Since commit 12f5b9314545, if a completion occurs after request timeout > processing has started, that completion is processed if the request has the > state MQ_RQ_IN_FLIGHT. blk_mq_rq_timed_out() does not modify the request > state unless the block driver timeout handler modifies it, e.g. by calling > blk_mq_end_request() or by calling blk_mq_requeue_request(). The typical > behavior of scsi_times_out() is to queue sending of a SCSI abort and hence > not to change the request state immediately. In other words, if a request > completion occurs during or shortly after a timeout occurred then > blk_mq_complete_request() will call __blk_mq_complete_request() and will > complete the request, although that is not allowed because timeout handling > has already started. Do you agree with this analysis? Yes, it's different, and that was the whole point. No one made that a secret either. Are you saying you want timeout software to take priority over handling hardware events? ^ permalink raw reply [flat|nested] 128+ messages in thread
* [RFC PATCH 3/3] blk-mq: Remove generation seqeunce @ 2018-07-13 15:43 ` Keith Busch 0 siblings, 0 replies; 128+ messages in thread From: Keith Busch @ 2018-07-13 15:43 UTC (permalink / raw) On Thu, Jul 12, 2018@10:24:42PM +0000, Bart Van Assche wrote: > Before commit 12f5b9314545 ("blk-mq: Remove generation seqeunce"), if a > request completion was reported after request timeout processing had > started, completion handling was skipped. The following code in > blk_mq_complete_request() realized that: > > if (blk_mq_rq_aborted_gstate(rq) != rq->gstate) > __blk_mq_complete_request(rq); > > Since commit 12f5b9314545, if a completion occurs after request timeout > processing has started, that completion is processed if the request has the > state MQ_RQ_IN_FLIGHT. blk_mq_rq_timed_out() does not modify the request > state unless the block driver timeout handler modifies it, e.g. by calling > blk_mq_end_request() or by calling blk_mq_requeue_request(). The typical > behavior of scsi_times_out() is to queue sending of a SCSI abort and hence > not to change the request state immediately. In other words, if a request > completion occurs during or shortly after a timeout occurred then > blk_mq_complete_request() will call __blk_mq_complete_request() and will > complete the request, although that is not allowed because timeout handling > has already started. Do you agree with this analysis? Yes, it's different, and that was the whole point. No one made that a secret either. Are you saying you want timeout software to take priority over handling hardware events? ^ permalink raw reply [flat|nested] 128+ messages in thread
* Re: [RFC PATCH 3/3] blk-mq: Remove generation seqeunce 2018-07-13 15:43 ` Keith Busch @ 2018-07-13 15:52 ` Bart Van Assche -1 siblings, 0 replies; 128+ messages in thread From: Bart Van Assche @ 2018-07-13 15:52 UTC (permalink / raw) To: keith.busch; +Cc: hch, keith.busch, linux-block, linux-nvme, axboe, ming.lei T24gRnJpLCAyMDE4LTA3LTEzIGF0IDA5OjQzIC0wNjAwLCBLZWl0aCBCdXNjaCB3cm90ZToNCj4g T24gVGh1LCBKdWwgMTIsIDIwMTggYXQgMTA6MjQ6NDJQTSArMDAwMCwgQmFydCBWYW4gQXNzY2hl IHdyb3RlOg0KPiA+IEJlZm9yZSBjb21taXQgMTJmNWI5MzE0NTQ1ICgiYmxrLW1xOiBSZW1vdmUg Z2VuZXJhdGlvbiBzZXFldW5jZSIpLCBpZiBhDQo+ID4gcmVxdWVzdCBjb21wbGV0aW9uIHdhcyBy ZXBvcnRlZCBhZnRlciByZXF1ZXN0IHRpbWVvdXQgcHJvY2Vzc2luZyBoYWQNCj4gPiBzdGFydGVk LCBjb21wbGV0aW9uIGhhbmRsaW5nIHdhcyBza2lwcGVkLiBUaGUgZm9sbG93aW5nIGNvZGUgaW4N Cj4gPiBibGtfbXFfY29tcGxldGVfcmVxdWVzdCgpIHJlYWxpemVkIHRoYXQ6DQo+ID4gDQo+ID4g CWlmIChibGtfbXFfcnFfYWJvcnRlZF9nc3RhdGUocnEpICE9IHJxLT5nc3RhdGUpDQo+ID4gCQlf X2Jsa19tcV9jb21wbGV0ZV9yZXF1ZXN0KHJxKTsNCj4gPiANCj4gPiBTaW5jZSBjb21taXQgMTJm NWI5MzE0NTQ1LCBpZiBhIGNvbXBsZXRpb24gb2NjdXJzIGFmdGVyIHJlcXVlc3QgdGltZW91dA0K PiA+IHByb2Nlc3NpbmcgaGFzIHN0YXJ0ZWQsIHRoYXQgY29tcGxldGlvbiBpcyBwcm9jZXNzZWQg aWYgdGhlIHJlcXVlc3QgaGFzIHRoZQ0KPiA+IHN0YXRlIE1RX1JRX0lOX0ZMSUdIVC4gYmxrX21x X3JxX3RpbWVkX291dCgpIGRvZXMgbm90IG1vZGlmeSB0aGUgcmVxdWVzdA0KPiA+IHN0YXRlIHVu bGVzcyB0aGUgYmxvY2sgZHJpdmVyIHRpbWVvdXQgaGFuZGxlciBtb2RpZmllcyBpdCwgZS5nLiBi eSBjYWxsaW5nDQo+ID4gYmxrX21xX2VuZF9yZXF1ZXN0KCkgb3IgYnkgY2FsbGluZyBibGtfbXFf cmVxdWV1ZV9yZXF1ZXN0KCkuIFRoZSB0eXBpY2FsDQo+ID4gYmVoYXZpb3Igb2Ygc2NzaV90aW1l c19vdXQoKSBpcyB0byBxdWV1ZSBzZW5kaW5nIG9mIGEgU0NTSSBhYm9ydCBhbmQgaGVuY2UNCj4g PiBub3QgdG8gY2hhbmdlIHRoZSByZXF1ZXN0IHN0YXRlIGltbWVkaWF0ZWx5LiBJbiBvdGhlciB3 b3JkcywgaWYgYSByZXF1ZXN0DQo+ID4gY29tcGxldGlvbiBvY2N1cnMgZHVyaW5nIG9yIHNob3J0 bHkgYWZ0ZXIgYSB0aW1lb3V0IG9jY3VycmVkIHRoZW4NCj4gPiBibGtfbXFfY29tcGxldGVfcmVx dWVzdCgpIHdpbGwgY2FsbCBfX2Jsa19tcV9jb21wbGV0ZV9yZXF1ZXN0KCkgYW5kIHdpbGwNCj4g PiBjb21wbGV0ZSB0aGUgcmVxdWVzdCwgYWx0aG91Z2ggdGhhdCBpcyBub3QgYWxsb3dlZCBiZWNh dXNlIHRpbWVvdXQgaGFuZGxpbmcNCj4gPiBoYXMgYWxyZWFkeSBzdGFydGVkLiBEbyB5b3UgYWdy ZWUgd2l0aCB0aGlzIGFuYWx5c2lzPw0KPiANCj4gWWVzLCBpdCdzIGRpZmZlcmVudCwgYW5kIHRo YXQgd2FzIHRoZSB3aG9sZSBwb2ludC4gTm8gb25lIG1hZGUgdGhhdCBhDQo+IHNlY3JldCBlaXRo ZXIuIEFyZSB5b3Ugc2F5aW5nIHlvdSB3YW50IHRpbWVvdXQgc29mdHdhcmUgdG8gdGFrZSBwcmlv cml0eQ0KPiBvdmVyIGhhbmRsaW5nIGhhcmR3YXJlIGV2ZW50cz8NCg0KTm8uIFdoYXQgSSdtIHNh eWluZyBpcyB0aGF0IGEgYmVoYXZpb3IgY2hhbmdlIGhhcyBiZWVuIGludHJvZHVjZWQgaW4gdGhl IGJsb2NrDQpsYXllciB0aGF0IHdhcyBub3QgZG9jdW1lbnRlZCBpbiB0aGUgcGF0Y2ggZGVzY3Jp cHRpb24uIEkgdGhpbmsgdGhhdCBiZWhhdmlvcg0KY2hhbmdlIGV2ZW4gY2FuIHRyaWdnZXIgYSBr ZXJuZWwgY3Jhc2guDQoNCkJhcnQuDQoNCg0KDQoNCg== ^ permalink raw reply [flat|nested] 128+ messages in thread
* [RFC PATCH 3/3] blk-mq: Remove generation seqeunce @ 2018-07-13 15:52 ` Bart Van Assche 0 siblings, 0 replies; 128+ messages in thread From: Bart Van Assche @ 2018-07-13 15:52 UTC (permalink / raw) On Fri, 2018-07-13@09:43 -0600, Keith Busch wrote: > On Thu, Jul 12, 2018@10:24:42PM +0000, Bart Van Assche wrote: > > Before commit 12f5b9314545 ("blk-mq: Remove generation seqeunce"), if a > > request completion was reported after request timeout processing had > > started, completion handling was skipped. The following code in > > blk_mq_complete_request() realized that: > > > > if (blk_mq_rq_aborted_gstate(rq) != rq->gstate) > > __blk_mq_complete_request(rq); > > > > Since commit 12f5b9314545, if a completion occurs after request timeout > > processing has started, that completion is processed if the request has the > > state MQ_RQ_IN_FLIGHT. blk_mq_rq_timed_out() does not modify the request > > state unless the block driver timeout handler modifies it, e.g. by calling > > blk_mq_end_request() or by calling blk_mq_requeue_request(). The typical > > behavior of scsi_times_out() is to queue sending of a SCSI abort and hence > > not to change the request state immediately. In other words, if a request > > completion occurs during or shortly after a timeout occurred then > > blk_mq_complete_request() will call __blk_mq_complete_request() and will > > complete the request, although that is not allowed because timeout handling > > has already started. Do you agree with this analysis? > > Yes, it's different, and that was the whole point. No one made that a > secret either. Are you saying you want timeout software to take priority > over handling hardware events? No. What I'm saying is that a behavior change has been introduced in the block layer that was not documented in the patch description. I think that behavior change even can trigger a kernel crash. Bart. ^ permalink raw reply [flat|nested] 128+ messages in thread
* Re: [RFC PATCH 3/3] blk-mq: Remove generation seqeunce 2018-07-13 15:52 ` Bart Van Assche @ 2018-07-13 18:47 ` Keith Busch -1 siblings, 0 replies; 128+ messages in thread From: Keith Busch @ 2018-07-13 18:47 UTC (permalink / raw) To: Bart Van Assche Cc: hch, keith.busch, linux-block, linux-nvme, axboe, ming.lei On Fri, Jul 13, 2018 at 03:52:38PM +0000, Bart Van Assche wrote: > No. What I'm saying is that a behavior change has been introduced in the block > layer that was not documented in the patch description. Did you read the changelog? > I think that behavior change even can trigger a kernel crash. I think we are past acknowledging issues exist with timeouts. ^ permalink raw reply [flat|nested] 128+ messages in thread
* [RFC PATCH 3/3] blk-mq: Remove generation seqeunce @ 2018-07-13 18:47 ` Keith Busch 0 siblings, 0 replies; 128+ messages in thread From: Keith Busch @ 2018-07-13 18:47 UTC (permalink / raw) On Fri, Jul 13, 2018@03:52:38PM +0000, Bart Van Assche wrote: > No. What I'm saying is that a behavior change has been introduced in the block > layer that was not documented in the patch description. Did you read the changelog? > I think that behavior change even can trigger a kernel crash. I think we are past acknowledging issues exist with timeouts. ^ permalink raw reply [flat|nested] 128+ messages in thread
* Re: [RFC PATCH 3/3] blk-mq: Remove generation seqeunce 2018-07-13 18:47 ` Keith Busch @ 2018-07-13 23:03 ` Bart Van Assche -1 siblings, 0 replies; 128+ messages in thread From: Bart Van Assche @ 2018-07-13 23:03 UTC (permalink / raw) To: keith.busch; +Cc: hch, keith.busch, linux-block, linux-nvme, axboe, ming.lei T24gRnJpLCAyMDE4LTA3LTEzIGF0IDEyOjQ3IC0wNjAwLCBLZWl0aCBCdXNjaCB3cm90ZToNCj4g T24gRnJpLCBKdWwgMTMsIDIwMTggYXQgMDM6NTI6MzhQTSArMDAwMCwgQmFydCBWYW4gQXNzY2hl IHdyb3RlOg0KPiA+IEkgdGhpbmsgdGhhdCBiZWhhdmlvciBjaGFuZ2UgZXZlbiBjYW4gdHJpZ2dl ciBhIGtlcm5lbCBjcmFzaC4NCj4gDQo+IEkgdGhpbmsgd2UgYXJlIHBhc3QgYWNrbm93bGVkZ2lu ZyBpc3N1ZXMgZXhpc3Qgd2l0aCB0aW1lb3V0cy4NCg0KSGVsbG8gS2VpdGgsDQoNCkhvdyBkbyB5 b3Ugd2FudCB0byBnbyBmb3J3YXJkIGZyb20gaGVyZT8gRG8geW91IHByZWZlciB0aGUgYXBwcm9h Y2ggb2YgdGhlDQpwYXRjaCBJIGhhZCBwb3N0ZWQgKGh0dHBzOi8vd3d3LnNwaW5pY3MubmV0L2xp c3RzL2xpbnV4LWJsb2NrL21zZzI2NDg5Lmh0bWwpLCANCkppYW5jaGFvJ3MgYXBwcm9hY2ggKGh0 dHBzOi8vbWFyYy5pbmZvLz9sPWxpbnV4LWJsb2NrJm09MTUyOTUwMDkzODMxNzM4KSBvcg0KcGVy aGFwcyB5ZXQgYW5vdGhlciBhcHByb2FjaD8gTm90ZTogSSB0aGluayBKaWFuY2hhbydzIHBhdGNo IGlzIGEgZ29vZCBzdGFydA0KYnV0IGFsc28gdGhhdCBpdCBuZWVkcyBmdXJ0aGVyIGltcHJvdmVt ZW50Lg0KDQpUaGFua3MsDQoNCkJhcnQuDQoNCg0KDQoNCg0K ^ permalink raw reply [flat|nested] 128+ messages in thread
* [RFC PATCH 3/3] blk-mq: Remove generation seqeunce @ 2018-07-13 23:03 ` Bart Van Assche 0 siblings, 0 replies; 128+ messages in thread From: Bart Van Assche @ 2018-07-13 23:03 UTC (permalink / raw) On Fri, 2018-07-13@12:47 -0600, Keith Busch wrote: > On Fri, Jul 13, 2018@03:52:38PM +0000, Bart Van Assche wrote: > > I think that behavior change even can trigger a kernel crash. > > I think we are past acknowledging issues exist with timeouts. Hello Keith, How do you want to go forward from here? Do you prefer the approach of the patch I had posted (https://www.spinics.net/lists/linux-block/msg26489.html), Jianchao's approach (https://marc.info/?l=linux-block&m=152950093831738) or perhaps yet another approach? Note: I think Jianchao's patch is a good start but also that it needs further improvement. Thanks, Bart. ^ permalink raw reply [flat|nested] 128+ messages in thread
* Re: [RFC PATCH 3/3] blk-mq: Remove generation seqeunce 2018-07-13 23:03 ` Bart Van Assche @ 2018-07-13 23:58 ` Keith Busch -1 siblings, 0 replies; 128+ messages in thread From: Keith Busch @ 2018-07-13 23:58 UTC (permalink / raw) To: Bart Van Assche Cc: hch, keith.busch, linux-block, linux-nvme, axboe, ming.lei On Fri, Jul 13, 2018 at 11:03:18PM +0000, Bart Van Assche wrote: > How do you want to go forward from here? Do you prefer the approach of the > patch I had posted (https://www.spinics.net/lists/linux-block/msg26489.html), > Jianchao's approach (https://marc.info/?l=linux-block&m=152950093831738) or > perhaps yet another approach? Note: I think Jianchao's patch is a good start > but also that it needs further improvement. Of the two you mentioned, yours is preferable IMO. While I appreciate Jianchao's detailed analysis, it's hard to take a proposal seriously that so colourfully calls everyone else "dangerous" while advocating for silently losing requests on purpose. But where's the option that fixes scsi to handle hardware completions concurrently with arbitrary timeout software? Propping up that house of cards can't be the only recourse. ^ permalink raw reply [flat|nested] 128+ messages in thread
* [RFC PATCH 3/3] blk-mq: Remove generation seqeunce @ 2018-07-13 23:58 ` Keith Busch 0 siblings, 0 replies; 128+ messages in thread From: Keith Busch @ 2018-07-13 23:58 UTC (permalink / raw) On Fri, Jul 13, 2018@11:03:18PM +0000, Bart Van Assche wrote: > How do you want to go forward from here? Do you prefer the approach of the > patch I had posted (https://www.spinics.net/lists/linux-block/msg26489.html), > Jianchao's approach (https://marc.info/?l=linux-block&m=152950093831738) or > perhaps yet another approach? Note: I think Jianchao's patch is a good start > but also that it needs further improvement. Of the two you mentioned, yours is preferable IMO. While I appreciate Jianchao's detailed analysis, it's hard to take a proposal seriously that so colourfully calls everyone else "dangerous" while advocating for silently losing requests on purpose. But where's the option that fixes scsi to handle hardware completions concurrently with arbitrary timeout software? Propping up that house of cards can't be the only recourse. ^ permalink raw reply [flat|nested] 128+ messages in thread
* Re: [RFC PATCH 3/3] blk-mq: Remove generation seqeunce 2018-07-13 23:58 ` Keith Busch @ 2018-07-18 19:56 ` hch -1 siblings, 0 replies; 128+ messages in thread From: hch @ 2018-07-18 19:56 UTC (permalink / raw) To: Keith Busch Cc: Bart Van Assche, hch, keith.busch, linux-block, linux-nvme, axboe, ming.lei On Fri, Jul 13, 2018 at 05:58:08PM -0600, Keith Busch wrote: > Of the two you mentioned, yours is preferable IMO. While I appreciate > Jianchao's detailed analysis, it's hard to take a proposal seriously > that so colourfully calls everyone else "dangerous" while advocating > for silently losing requests on purpose. > > But where's the option that fixes scsi to handle hardware completions > concurrently with arbitrary timeout software? Propping up that house of > cards can't be the only recourse. The important bit is that we need to fix this issue quickly. We are past -rc5 so I'm rather concerned about anything too complicated. I'm not even sure SCSI has a problem with multiple completions happening at the same time, but it certainly has a problem with bypassing blk_mq_complete_request from the EH path. I think we can solve this properly, but I also think we are way to late in the 4.18 cycle to fix it properly. For now I fear we'll just have to revert the changes and try again for 4.19 or even 4.20 if we don't act quickly enough. ^ permalink raw reply [flat|nested] 128+ messages in thread
* [RFC PATCH 3/3] blk-mq: Remove generation seqeunce @ 2018-07-18 19:56 ` hch 0 siblings, 0 replies; 128+ messages in thread From: hch @ 2018-07-18 19:56 UTC (permalink / raw) On Fri, Jul 13, 2018@05:58:08PM -0600, Keith Busch wrote: > Of the two you mentioned, yours is preferable IMO. While I appreciate > Jianchao's detailed analysis, it's hard to take a proposal seriously > that so colourfully calls everyone else "dangerous" while advocating > for silently losing requests on purpose. > > But where's the option that fixes scsi to handle hardware completions > concurrently with arbitrary timeout software? Propping up that house of > cards can't be the only recourse. The important bit is that we need to fix this issue quickly. We are past -rc5 so I'm rather concerned about anything too complicated. I'm not even sure SCSI has a problem with multiple completions happening at the same time, but it certainly has a problem with bypassing blk_mq_complete_request from the EH path. I think we can solve this properly, but I also think we are way to late in the 4.18 cycle to fix it properly. For now I fear we'll just have to revert the changes and try again for 4.19 or even 4.20 if we don't act quickly enough. ^ permalink raw reply [flat|nested] 128+ messages in thread
* Re: [RFC PATCH 3/3] blk-mq: Remove generation seqeunce 2018-07-18 19:56 ` hch @ 2018-07-18 20:39 ` hch -1 siblings, 0 replies; 128+ messages in thread From: hch @ 2018-07-18 20:39 UTC (permalink / raw) To: Keith Busch Cc: Bart Van Assche, hch, keith.busch, linux-block, linux-nvme, axboe, ming.lei, jianchao.w.wang On Wed, Jul 18, 2018 at 09:56:50PM +0200, hch@lst.de wrote: > On Fri, Jul 13, 2018 at 05:58:08PM -0600, Keith Busch wrote: > > Of the two you mentioned, yours is preferable IMO. While I appreciate > > Jianchao's detailed analysis, it's hard to take a proposal seriously > > that so colourfully calls everyone else "dangerous" while advocating > > for silently losing requests on purpose. > > > > But where's the option that fixes scsi to handle hardware completions > > concurrently with arbitrary timeout software? Propping up that house of > > cards can't be the only recourse. > > The important bit is that we need to fix this issue quickly. We are > past -rc5 so I'm rather concerned about anything too complicated. > > I'm not even sure SCSI has a problem with multiple completions happening > at the same time, but it certainly has a problem with bypassing > blk_mq_complete_request from the EH path. > > I think we can solve this properly, but I also think we are way to late > in the 4.18 cycle to fix it properly. For now I fear we'll just have > to revert the changes and try again for 4.19 or even 4.20 if we don't > act quickly enough. So here is a quick attempt at the revert while also trying to keep nvme working. Keith, Bart, Jianchao - does this looks reasonable as a 4.18 band aid? http://git.infradead.org/users/hch/block.git/shortlog/refs/heads/blk-eh-revert ^ permalink raw reply [flat|nested] 128+ messages in thread
* [RFC PATCH 3/3] blk-mq: Remove generation seqeunce @ 2018-07-18 20:39 ` hch 0 siblings, 0 replies; 128+ messages in thread From: hch @ 2018-07-18 20:39 UTC (permalink / raw) On Wed, Jul 18, 2018@09:56:50PM +0200, hch@lst.de wrote: > On Fri, Jul 13, 2018@05:58:08PM -0600, Keith Busch wrote: > > Of the two you mentioned, yours is preferable IMO. While I appreciate > > Jianchao's detailed analysis, it's hard to take a proposal seriously > > that so colourfully calls everyone else "dangerous" while advocating > > for silently losing requests on purpose. > > > > But where's the option that fixes scsi to handle hardware completions > > concurrently with arbitrary timeout software? Propping up that house of > > cards can't be the only recourse. > > The important bit is that we need to fix this issue quickly. We are > past -rc5 so I'm rather concerned about anything too complicated. > > I'm not even sure SCSI has a problem with multiple completions happening > at the same time, but it certainly has a problem with bypassing > blk_mq_complete_request from the EH path. > > I think we can solve this properly, but I also think we are way to late > in the 4.18 cycle to fix it properly. For now I fear we'll just have > to revert the changes and try again for 4.19 or even 4.20 if we don't > act quickly enough. So here is a quick attempt at the revert while also trying to keep nvme working. Keith, Bart, Jianchao - does this looks reasonable as a 4.18 band aid? http://git.infradead.org/users/hch/block.git/shortlog/refs/heads/blk-eh-revert ^ permalink raw reply [flat|nested] 128+ messages in thread
* Re: [RFC PATCH 3/3] blk-mq: Remove generation seqeunce 2018-07-18 20:39 ` hch @ 2018-07-18 21:05 ` Bart Van Assche -1 siblings, 0 replies; 128+ messages in thread From: Bart Van Assche @ 2018-07-18 21:05 UTC (permalink / raw) To: hch, keith.busch Cc: jianchao.w.wang, keith.busch, linux-block, linux-nvme, axboe, ming.lei [-- Warning: decoded text below may be mangled, UTF-8 assumed --] [-- Attachment #1: Type: text/plain; charset="utf-7", Size: 1775 bytes --] On Wed, 2018-07-18 at 22:39 +-0200, hch+AEA-lst.de wrote: +AD4- On Wed, Jul 18, 2018 at 09:56:50PM +-0200, hch+AEA-lst.de wrote: +AD4- +AD4- The important bit is that we need to fix this issue quickly. W= e are +AD4- +AD4- past -rc5 so I'm rather concerned about anything too complicate= d. +AD4- +AD4-=20 +AD4- +AD4- I'm not even sure SCSI has a problem with multiple completions = happening +AD4- +AD4- at the same time, but it certainly has a problem with bypassing +AD4- +AD4- blk+AF8-mq+AF8-complete+AF8-request from the EH path. +AD4- +AD4-=20 +AD4- +AD4- I think we can solve this properly, but I also think we are way= to late +AD4- +AD4- in the 4.18 cycle to fix it properly. For now I fear we'll jus= t have +AD4- +AD4- to revert the changes and try again for 4.19 or even 4.20 if we= don't +AD4- +AD4- act quickly enough. +AD4-=20 +AD4- So here is a quick attempt at the revert while also trying to keep +AD4- nvme working. Keith, Bart, Jianchao - does this looks reasonable +AD4- as a 4.18 band aid? +AD4-=20 +AD4- http://git.infradead.org/users/hch/block.git/shortlog/refs/heads/blk-= eh-revert Hello Christoph, A patch series that first reverts the following patches: +ACo- blk-mq: Fix timeout handling in case the timeout handler returns BLK+= AF8-EH+AF8-DONE +ACo- block: fix timeout changes for legacy request drivers +ACo- blk-mq: don't time out requests again that are in the timeout handler +ACo- blk-mq: simplify blk+AF8-mq+AF8-rq+AF8-timed+AF8-out +ACo- block: remove BLK+AF8-EH+AF8-HANDLED +ACo- block: rename BLK+AF8-EH+AF8-NOT+AF8-HANDLED to BLK+AF8-EH+AF8-DONE +ACo- blk-mq: Remove generation seqeunce and next renames BLK+AF8-EH+AF8-NOT+AF8-HANDLED again into BLK+AF8-EH+AF8-D= ONE would probably be a lot easier to review. Thanks, Bart. ^ permalink raw reply [flat|nested] 128+ messages in thread
* [RFC PATCH 3/3] blk-mq: Remove generation seqeunce @ 2018-07-18 21:05 ` Bart Van Assche 0 siblings, 0 replies; 128+ messages in thread From: Bart Van Assche @ 2018-07-18 21:05 UTC (permalink / raw) On Wed, 2018-07-18@22:39 +0200, hch@lst.de wrote: > On Wed, Jul 18, 2018@09:56:50PM +0200, hch@lst.de wrote: > > The important bit is that we need to fix this issue quickly. We are > > past -rc5 so I'm rather concerned about anything too complicated. > > > > I'm not even sure SCSI has a problem with multiple completions happening > > at the same time, but it certainly has a problem with bypassing > > blk_mq_complete_request from the EH path. > > > > I think we can solve this properly, but I also think we are way to late > > in the 4.18 cycle to fix it properly. For now I fear we'll just have > > to revert the changes and try again for 4.19 or even 4.20 if we don't > > act quickly enough. > > So here is a quick attempt at the revert while also trying to keep > nvme working. Keith, Bart, Jianchao - does this looks reasonable > as a 4.18 band aid? > > http://git.infradead.org/users/hch/block.git/shortlog/refs/heads/blk-eh-revert Hello Christoph, A patch series that first reverts the following patches: * blk-mq: Fix timeout handling in case the timeout handler returns BLK_EH_DONE * block: fix timeout changes for legacy request drivers * blk-mq: don't time out requests again that are in the timeout handler * blk-mq: simplify blk_mq_rq_timed_out * block: remove BLK_EH_HANDLED * block: rename BLK_EH_NOT_HANDLED to BLK_EH_DONE * blk-mq: Remove generation seqeunce and next renames BLK_EH_NOT_HANDLED again into BLK_EH_DONE would probably be a lot easier to review. Thanks, Bart. ^ permalink raw reply [flat|nested] 128+ messages in thread
* Re: [RFC PATCH 3/3] blk-mq: Remove generation seqeunce 2018-07-18 20:39 ` hch @ 2018-07-18 22:53 ` Keith Busch -1 siblings, 0 replies; 128+ messages in thread From: Keith Busch @ 2018-07-18 22:53 UTC (permalink / raw) To: hch Cc: Bart Van Assche, keith.busch, linux-block, linux-nvme, axboe, ming.lei, jianchao.w.wang On Wed, Jul 18, 2018 at 10:39:36PM +0200, hch@lst.de wrote: > On Wed, Jul 18, 2018 at 09:56:50PM +0200, hch@lst.de wrote: > > On Fri, Jul 13, 2018 at 05:58:08PM -0600, Keith Busch wrote: > > > Of the two you mentioned, yours is preferable IMO. While I appreciate > > > Jianchao's detailed analysis, it's hard to take a proposal seriously > > > that so colourfully calls everyone else "dangerous" while advocating > > > for silently losing requests on purpose. > > > > > > But where's the option that fixes scsi to handle hardware completions > > > concurrently with arbitrary timeout software? Propping up that house of > > > cards can't be the only recourse. > > > > The important bit is that we need to fix this issue quickly. We are > > past -rc5 so I'm rather concerned about anything too complicated. > > > > I'm not even sure SCSI has a problem with multiple completions happening > > at the same time, but it certainly has a problem with bypassing > > blk_mq_complete_request from the EH path. > > > > I think we can solve this properly, but I also think we are way to late > > in the 4.18 cycle to fix it properly. For now I fear we'll just have > > to revert the changes and try again for 4.19 or even 4.20 if we don't > > act quickly enough. > > So here is a quick attempt at the revert while also trying to keep > nvme working. Keith, Bart, Jianchao - does this looks reasonable > as a 4.18 band aid? > > http://git.infradead.org/users/hch/block.git/shortlog/refs/heads/blk-eh-revert Hm, I not really a fan. The far majority of blk-mq drivers don't even implement a timeout, and reverting it will lose their requests forever if they complete at the same time as a timeout. Of the remaining drivers, most of those don't want the reverted behavior either. It actually looks like scsi is the only mq driver that wants to block completions. In the short term, scsi can make that happen with just three lines of code. --- diff --git a/drivers/scsi/scsi_error.c b/drivers/scsi/scsi_error.c index 8932ae81a15a..03986af3076c 100644 --- a/drivers/scsi/scsi_error.c +++ b/drivers/scsi/scsi_error.c @@ -286,6 +286,10 @@ enum blk_eh_timer_return scsi_times_out(struct request *req) enum blk_eh_timer_return rtn = BLK_EH_DONE; struct Scsi_Host *host = scmd->device->host; + if (req->q->mq_ops && cmpxchg(&rq->state, MQ_RQ_IN_FLIGHT, MQ_RQ_COMPLETE) != + MQ_RQ_IN_FLIGHT); + return rtn; + trace_scsi_dispatch_cmd_timeout(scmd); scsi_log_completion(scmd, TIMEOUT_ERROR); -- ^ permalink raw reply related [flat|nested] 128+ messages in thread
* [RFC PATCH 3/3] blk-mq: Remove generation seqeunce @ 2018-07-18 22:53 ` Keith Busch 0 siblings, 0 replies; 128+ messages in thread From: Keith Busch @ 2018-07-18 22:53 UTC (permalink / raw) On Wed, Jul 18, 2018@10:39:36PM +0200, hch@lst.de wrote: > On Wed, Jul 18, 2018@09:56:50PM +0200, hch@lst.de wrote: > > On Fri, Jul 13, 2018@05:58:08PM -0600, Keith Busch wrote: > > > Of the two you mentioned, yours is preferable IMO. While I appreciate > > > Jianchao's detailed analysis, it's hard to take a proposal seriously > > > that so colourfully calls everyone else "dangerous" while advocating > > > for silently losing requests on purpose. > > > > > > But where's the option that fixes scsi to handle hardware completions > > > concurrently with arbitrary timeout software? Propping up that house of > > > cards can't be the only recourse. > > > > The important bit is that we need to fix this issue quickly. We are > > past -rc5 so I'm rather concerned about anything too complicated. > > > > I'm not even sure SCSI has a problem with multiple completions happening > > at the same time, but it certainly has a problem with bypassing > > blk_mq_complete_request from the EH path. > > > > I think we can solve this properly, but I also think we are way to late > > in the 4.18 cycle to fix it properly. For now I fear we'll just have > > to revert the changes and try again for 4.19 or even 4.20 if we don't > > act quickly enough. > > So here is a quick attempt at the revert while also trying to keep > nvme working. Keith, Bart, Jianchao - does this looks reasonable > as a 4.18 band aid? > > http://git.infradead.org/users/hch/block.git/shortlog/refs/heads/blk-eh-revert Hm, I not really a fan. The far majority of blk-mq drivers don't even implement a timeout, and reverting it will lose their requests forever if they complete at the same time as a timeout. Of the remaining drivers, most of those don't want the reverted behavior either. It actually looks like scsi is the only mq driver that wants to block completions. In the short term, scsi can make that happen with just three lines of code. --- diff --git a/drivers/scsi/scsi_error.c b/drivers/scsi/scsi_error.c index 8932ae81a15a..03986af3076c 100644 --- a/drivers/scsi/scsi_error.c +++ b/drivers/scsi/scsi_error.c @@ -286,6 +286,10 @@ enum blk_eh_timer_return scsi_times_out(struct request *req) enum blk_eh_timer_return rtn = BLK_EH_DONE; struct Scsi_Host *host = scmd->device->host; + if (req->q->mq_ops && cmpxchg(&rq->state, MQ_RQ_IN_FLIGHT, MQ_RQ_COMPLETE) != + MQ_RQ_IN_FLIGHT); + return rtn; + trace_scsi_dispatch_cmd_timeout(scmd); scsi_log_completion(scmd, TIMEOUT_ERROR); -- ^ permalink raw reply related [flat|nested] 128+ messages in thread
* Re: [RFC PATCH 3/3] blk-mq: Remove generation seqeunce 2018-07-18 19:56 ` hch @ 2018-07-18 20:53 ` Keith Busch -1 siblings, 0 replies; 128+ messages in thread From: Keith Busch @ 2018-07-18 20:53 UTC (permalink / raw) To: hch Cc: Bart Van Assche, keith.busch, linux-block, linux-nvme, axboe, ming.lei On Wed, Jul 18, 2018 at 09:56:50PM +0200, hch@lst.de wrote: > On Fri, Jul 13, 2018 at 05:58:08PM -0600, Keith Busch wrote: > > Of the two you mentioned, yours is preferable IMO. While I appreciate > > Jianchao's detailed analysis, it's hard to take a proposal seriously > > that so colourfully calls everyone else "dangerous" while advocating > > for silently losing requests on purpose. > > > > But where's the option that fixes scsi to handle hardware completions > > concurrently with arbitrary timeout software? Propping up that house of > > cards can't be the only recourse. > > The important bit is that we need to fix this issue quickly. We are > past -rc5 so I'm rather concerned about anything too complicated. > > I'm not even sure SCSI has a problem with multiple completions happening > at the same time, but it certainly has a problem with bypassing > blk_mq_complete_request from the EH path. > > I think we can solve this properly, but I also think we are way to late > in the 4.18 cycle to fix it properly. For now I fear we'll just have > to revert the changes and try again for 4.19 or even 4.20 if we don't > act quickly enough. If scsi needs this behavior, why not just put that behavior in scsi? It can set the state to complete and then everything can play out as before. --- diff --git a/block/blk-mq.c b/block/blk-mq.c index 22326612a5d3..f50559718b71 100644 --- a/block/blk-mq.c +++ b/block/blk-mq.c @@ -558,10 +558,8 @@ static void __blk_mq_complete_request(struct request *rq) bool shared = false; int cpu; - if (cmpxchg(&rq->state, MQ_RQ_IN_FLIGHT, MQ_RQ_COMPLETE) != - MQ_RQ_IN_FLIGHT) + if (blk_mq_mark_complete(rq)) return; - if (rq->internal_tag != -1) blk_mq_sched_completed_request(rq); diff --git a/drivers/scsi/scsi_error.c b/drivers/scsi/scsi_error.c index 8932ae81a15a..a5d05fab24a7 100644 --- a/drivers/scsi/scsi_error.c +++ b/drivers/scsi/scsi_error.c @@ -286,6 +286,14 @@ enum blk_eh_timer_return scsi_times_out(struct request *req) enum blk_eh_timer_return rtn = BLK_EH_DONE; struct Scsi_Host *host = scmd->device->host; + /* + * Mark complete now so lld can't post a completion during error + * handling. Return immediately if it was already marked complete, as + * that means the lld posted a completion already. + */ + if (req->q->mq_ops && blk_mq_mark_complete(req)) + return rtn; + trace_scsi_dispatch_cmd_timeout(scmd); scsi_log_completion(scmd, TIMEOUT_ERROR); diff --git a/include/linux/blk-mq.h b/include/linux/blk-mq.h index 9b0fd11ce89a..0ce587c9c27b 100644 --- a/include/linux/blk-mq.h +++ b/include/linux/blk-mq.h @@ -289,6 +289,15 @@ void blk_mq_update_nr_hw_queues(struct blk_mq_tag_set *set, int nr_hw_queues); void blk_mq_quiesce_queue_nowait(struct request_queue *q); +/* + * Returns true if request is not in flight. + */ +static inline bool blk_mq_mark_complete(struct request *rq) +{ + return (cmpxchg(&rq->state, MQ_RQ_IN_FLIGHT, MQ_RQ_COMPLETE) != + MQ_RQ_IN_FLIGHT); +} + /* * Driver command data is immediately after the request. So subtract request * size to get back to the original request, add request size to get the PDU. -- ^ permalink raw reply related [flat|nested] 128+ messages in thread
* [RFC PATCH 3/3] blk-mq: Remove generation seqeunce @ 2018-07-18 20:53 ` Keith Busch 0 siblings, 0 replies; 128+ messages in thread From: Keith Busch @ 2018-07-18 20:53 UTC (permalink / raw) On Wed, Jul 18, 2018@09:56:50PM +0200, hch@lst.de wrote: > On Fri, Jul 13, 2018@05:58:08PM -0600, Keith Busch wrote: > > Of the two you mentioned, yours is preferable IMO. While I appreciate > > Jianchao's detailed analysis, it's hard to take a proposal seriously > > that so colourfully calls everyone else "dangerous" while advocating > > for silently losing requests on purpose. > > > > But where's the option that fixes scsi to handle hardware completions > > concurrently with arbitrary timeout software? Propping up that house of > > cards can't be the only recourse. > > The important bit is that we need to fix this issue quickly. We are > past -rc5 so I'm rather concerned about anything too complicated. > > I'm not even sure SCSI has a problem with multiple completions happening > at the same time, but it certainly has a problem with bypassing > blk_mq_complete_request from the EH path. > > I think we can solve this properly, but I also think we are way to late > in the 4.18 cycle to fix it properly. For now I fear we'll just have > to revert the changes and try again for 4.19 or even 4.20 if we don't > act quickly enough. If scsi needs this behavior, why not just put that behavior in scsi? It can set the state to complete and then everything can play out as before. --- diff --git a/block/blk-mq.c b/block/blk-mq.c index 22326612a5d3..f50559718b71 100644 --- a/block/blk-mq.c +++ b/block/blk-mq.c @@ -558,10 +558,8 @@ static void __blk_mq_complete_request(struct request *rq) bool shared = false; int cpu; - if (cmpxchg(&rq->state, MQ_RQ_IN_FLIGHT, MQ_RQ_COMPLETE) != - MQ_RQ_IN_FLIGHT) + if (blk_mq_mark_complete(rq)) return; - if (rq->internal_tag != -1) blk_mq_sched_completed_request(rq); diff --git a/drivers/scsi/scsi_error.c b/drivers/scsi/scsi_error.c index 8932ae81a15a..a5d05fab24a7 100644 --- a/drivers/scsi/scsi_error.c +++ b/drivers/scsi/scsi_error.c @@ -286,6 +286,14 @@ enum blk_eh_timer_return scsi_times_out(struct request *req) enum blk_eh_timer_return rtn = BLK_EH_DONE; struct Scsi_Host *host = scmd->device->host; + /* + * Mark complete now so lld can't post a completion during error + * handling. Return immediately if it was already marked complete, as + * that means the lld posted a completion already. + */ + if (req->q->mq_ops && blk_mq_mark_complete(req)) + return rtn; + trace_scsi_dispatch_cmd_timeout(scmd); scsi_log_completion(scmd, TIMEOUT_ERROR); diff --git a/include/linux/blk-mq.h b/include/linux/blk-mq.h index 9b0fd11ce89a..0ce587c9c27b 100644 --- a/include/linux/blk-mq.h +++ b/include/linux/blk-mq.h @@ -289,6 +289,15 @@ void blk_mq_update_nr_hw_queues(struct blk_mq_tag_set *set, int nr_hw_queues); void blk_mq_quiesce_queue_nowait(struct request_queue *q); +/* + * Returns true if request is not in flight. + */ +static inline bool blk_mq_mark_complete(struct request *rq) +{ + return (cmpxchg(&rq->state, MQ_RQ_IN_FLIGHT, MQ_RQ_COMPLETE) != + MQ_RQ_IN_FLIGHT); +} + /* * Driver command data is immediately after the request. So subtract request * size to get back to the original request, add request size to get the PDU. -- ^ permalink raw reply related [flat|nested] 128+ messages in thread
* Re: [RFC PATCH 3/3] blk-mq: Remove generation seqeunce 2018-07-18 20:53 ` Keith Busch @ 2018-07-18 20:58 ` Bart Van Assche -1 siblings, 0 replies; 128+ messages in thread From: Bart Van Assche @ 2018-07-18 20:58 UTC (permalink / raw) To: hch, keith.busch; +Cc: keith.busch, linux-block, linux-nvme, axboe, ming.lei On Wed, 2018-07-18 at 14:53 -0600, Keith Busch wrote: > If scsi needs this behavior, why not just put that behavior in scsi? = It > can set the state to complete and then everything can play out as > before. > [ ... ] There may be other drivers that need the same protection the SCSI core need= s so I think the patch at the end of your previous e-mail is a step in the wr= ong direction. Bart. ^ permalink raw reply [flat|nested] 128+ messages in thread
* [RFC PATCH 3/3] blk-mq: Remove generation seqeunce @ 2018-07-18 20:58 ` Bart Van Assche 0 siblings, 0 replies; 128+ messages in thread From: Bart Van Assche @ 2018-07-18 20:58 UTC (permalink / raw) On Wed, 2018-07-18@14:53 -0600, Keith Busch wrote: > If scsi needs this behavior, why not just put that behavior in scsi? It > can set the state to complete and then everything can play out as > before. > [ ... ] There may be other drivers that need the same protection the SCSI core needs so I think the patch at the end of your previous e-mail is a step in the wrong direction. Bart. ^ permalink raw reply [flat|nested] 128+ messages in thread
* Re: [RFC PATCH 3/3] blk-mq: Remove generation seqeunce 2018-07-18 20:58 ` Bart Van Assche @ 2018-07-18 21:17 ` Keith Busch -1 siblings, 0 replies; 128+ messages in thread From: Keith Busch @ 2018-07-18 21:17 UTC (permalink / raw) To: Bart Van Assche Cc: hch, keith.busch, linux-block, linux-nvme, axboe, ming.lei On Wed, Jul 18, 2018 at 08:58:48PM +0000, Bart Van Assche wrote: > On Wed, 2018-07-18 at 14:53 -0600, Keith Busch wrote: > > If scsi needs this behavior, why not just put that behavior in scsi? It > > can set the state to complete and then everything can play out as > > before. > > [ ... ] > > There may be other drivers that need the same protection the SCSI core needs > so I think the patch at the end of your previous e-mail is a step in the wrong > direction. > > Bart. And there may be other drivers that don't want their completions ignored, so breaking them again is also a step in the wrong direction. There are not that many blk-mq drivers, so we can go through them all. Most don't even implement .timeout, so they never know that condition ever happened. Others always return BLK_EH_RESET_TIMER without doing anythign else, so the 'new' behavior would have to be better for those, too. The following don't implement .timeout: loop, rdb, virtio, xen, dm, ubi, scm The following always return RESET_TIMER: null, skd The following is safe to the new way: mtip And now ones I am not sure about: ndb, mmc, dasd I don't know, reverting looks worse than just fixing the drivers. ^ permalink raw reply [flat|nested] 128+ messages in thread
* [RFC PATCH 3/3] blk-mq: Remove generation seqeunce @ 2018-07-18 21:17 ` Keith Busch 0 siblings, 0 replies; 128+ messages in thread From: Keith Busch @ 2018-07-18 21:17 UTC (permalink / raw) On Wed, Jul 18, 2018@08:58:48PM +0000, Bart Van Assche wrote: > On Wed, 2018-07-18@14:53 -0600, Keith Busch wrote: > > If scsi needs this behavior, why not just put that behavior in scsi? It > > can set the state to complete and then everything can play out as > > before. > > [ ... ] > > There may be other drivers that need the same protection the SCSI core needs > so I think the patch at the end of your previous e-mail is a step in the wrong > direction. > > Bart. And there may be other drivers that don't want their completions ignored, so breaking them again is also a step in the wrong direction. There are not that many blk-mq drivers, so we can go through them all. Most don't even implement .timeout, so they never know that condition ever happened. Others always return BLK_EH_RESET_TIMER without doing anythign else, so the 'new' behavior would have to be better for those, too. The following don't implement .timeout: loop, rdb, virtio, xen, dm, ubi, scm The following always return RESET_TIMER: null, skd The following is safe to the new way: mtip And now ones I am not sure about: ndb, mmc, dasd I don't know, reverting looks worse than just fixing the drivers. ^ permalink raw reply [flat|nested] 128+ messages in thread
* Re: [RFC PATCH 3/3] blk-mq: Remove generation seqeunce 2018-07-18 21:17 ` Keith Busch @ 2018-07-18 21:30 ` Bart Van Assche -1 siblings, 0 replies; 128+ messages in thread From: Bart Van Assche @ 2018-07-18 21:30 UTC (permalink / raw) To: keith.busch; +Cc: hch, keith.busch, linux-block, linux-nvme, axboe, ming.lei On Wed, 2018-07-18 at 15:17 -0600, Keith Busch wrote: > There are not that many blk-mq drivers, so we can go through them all= . I'm not sure that's the right approach. I think it is the responsibility of the block layer to handle races between completion handling and timeout handling and that this is not the responsibility of e.g. a block driver. If you look at e.g. the legacy block layer then you will see that it takes car= e of this race and that legacy block drivers do not have to worry about this race. Bart.= ^ permalink raw reply [flat|nested] 128+ messages in thread
* [RFC PATCH 3/3] blk-mq: Remove generation seqeunce @ 2018-07-18 21:30 ` Bart Van Assche 0 siblings, 0 replies; 128+ messages in thread From: Bart Van Assche @ 2018-07-18 21:30 UTC (permalink / raw) On Wed, 2018-07-18@15:17 -0600, Keith Busch wrote: > There are not that many blk-mq drivers, so we can go through them all. I'm not sure that's the right approach. I think it is the responsibility of the block layer to handle races between completion handling and timeout handling and that this is not the responsibility of e.g. a block driver. If you look at e.g. the legacy block layer then you will see that it takes care of this race and that legacy block drivers do not have to worry about this race. Bart. ^ permalink raw reply [flat|nested] 128+ messages in thread
* Re: [RFC PATCH 3/3] blk-mq: Remove generation seqeunce 2018-07-18 21:30 ` Bart Van Assche @ 2018-07-18 21:33 ` Keith Busch -1 siblings, 0 replies; 128+ messages in thread From: Keith Busch @ 2018-07-18 21:33 UTC (permalink / raw) To: Bart Van Assche Cc: hch, keith.busch, linux-block, linux-nvme, axboe, ming.lei On Wed, Jul 18, 2018 at 09:30:11PM +0000, Bart Van Assche wrote: > On Wed, 2018-07-18 at 15:17 -0600, Keith Busch wrote: > > There are not that many blk-mq drivers, so we can go through them all. > > I'm not sure that's the right approach. I think it is the responsibility of > the block layer to handle races between completion handling and timeout > handling and that this is not the responsibility of e.g. a block driver. If > you look at e.g. the legacy block layer then you will see that it takes care > of this race and that legacy block drivers do not have to worry about this > race. Reverting doesn't handle the race at all. It just ignores completions and puts the responsibility on the drivers to handle the race because that's what scsi wants to happen. ^ permalink raw reply [flat|nested] 128+ messages in thread
* [RFC PATCH 3/3] blk-mq: Remove generation seqeunce @ 2018-07-18 21:33 ` Keith Busch 0 siblings, 0 replies; 128+ messages in thread From: Keith Busch @ 2018-07-18 21:33 UTC (permalink / raw) On Wed, Jul 18, 2018@09:30:11PM +0000, Bart Van Assche wrote: > On Wed, 2018-07-18@15:17 -0600, Keith Busch wrote: > > There are not that many blk-mq drivers, so we can go through them all. > > I'm not sure that's the right approach. I think it is the responsibility of > the block layer to handle races between completion handling and timeout > handling and that this is not the responsibility of e.g. a block driver. If > you look at e.g. the legacy block layer then you will see that it takes care > of this race and that legacy block drivers do not have to worry about this > race. Reverting doesn't handle the race at all. It just ignores completions and puts the responsibility on the drivers to handle the race because that's what scsi wants to happen. ^ permalink raw reply [flat|nested] 128+ messages in thread
* Re: [RFC PATCH 3/3] blk-mq: Remove generation seqeunce 2018-07-18 21:17 ` Keith Busch @ 2018-07-19 13:19 ` hch -1 siblings, 0 replies; 128+ messages in thread From: hch @ 2018-07-19 13:19 UTC (permalink / raw) To: Keith Busch Cc: Bart Van Assche, hch, keith.busch, linux-block, linux-nvme, axboe, ming.lei On Wed, Jul 18, 2018 at 03:17:11PM -0600, Keith Busch wrote: > And there may be other drivers that don't want their completions > ignored, so breaking them again is also a step in the wrong direction. > > There are not that many blk-mq drivers, so we can go through them all. I think the point is that SCSI is the biggest user by both the number of low-level drivers sitting under the midlayer, and also by usage. We need to be very careful not to break it. Note that this doesn't mean that I don't want to eventually move away from just ignoring completions in timeout state for SCSI. I'd just rather rever 4.18 to a clean known state instead of doctoring around late in the rc phase. > Most don't even implement .timeout, so they never know that condition > ever happened. Others always return BLK_EH_RESET_TIMER without doing > anythign else, so the 'new' behavior would have to be better for those, > too. And we should never even hit the timeout handler for those as it is rather pointless (although it looks we currently do..). ^ permalink raw reply [flat|nested] 128+ messages in thread
* [RFC PATCH 3/3] blk-mq: Remove generation seqeunce @ 2018-07-19 13:19 ` hch 0 siblings, 0 replies; 128+ messages in thread From: hch @ 2018-07-19 13:19 UTC (permalink / raw) On Wed, Jul 18, 2018@03:17:11PM -0600, Keith Busch wrote: > And there may be other drivers that don't want their completions > ignored, so breaking them again is also a step in the wrong direction. > > There are not that many blk-mq drivers, so we can go through them all. I think the point is that SCSI is the biggest user by both the number of low-level drivers sitting under the midlayer, and also by usage. We need to be very careful not to break it. Note that this doesn't mean that I don't want to eventually move away from just ignoring completions in timeout state for SCSI. I'd just rather rever 4.18 to a clean known state instead of doctoring around late in the rc phase. > Most don't even implement .timeout, so they never know that condition > ever happened. Others always return BLK_EH_RESET_TIMER without doing > anythign else, so the 'new' behavior would have to be better for those, > too. And we should never even hit the timeout handler for those as it is rather pointless (although it looks we currently do..). ^ permalink raw reply [flat|nested] 128+ messages in thread
* Re: [RFC PATCH 3/3] blk-mq: Remove generation seqeunce 2018-07-19 13:19 ` hch @ 2018-07-19 14:59 ` Keith Busch -1 siblings, 0 replies; 128+ messages in thread From: Keith Busch @ 2018-07-19 14:59 UTC (permalink / raw) To: hch Cc: axboe, linux-block, linux-nvme, ming.lei, keith.busch, Bart Van Assche On Thu, Jul 19, 2018 at 03:19:04PM +0200, hch@lst.de wrote: > On Wed, Jul 18, 2018 at 03:17:11PM -0600, Keith Busch wrote: > > And there may be other drivers that don't want their completions > > ignored, so breaking them again is also a step in the wrong direction. > > > > There are not that many blk-mq drivers, so we can go through them all. > > I think the point is that SCSI is the biggest user by both the number > of low-level drivers sitting under the midlayer, and also by usage. > > We need to be very careful not to break it. Note that this doesn't > mean that I don't want to eventually move away from just ignoring > completions in timeout state for SCSI. I'd just rather rever 4.18 > to a clean known state instead of doctoring around late in the rc > phase. I definitely do not want to break scsi. I just don't want to break every one else either, and I think scsi can get the behavior it wants without forcing others to subscribe to it. > > Most don't even implement .timeout, so they never know that condition > > ever happened. Others always return BLK_EH_RESET_TIMER without doing > > anythign else, so the 'new' behavior would have to be better for those, > > too. > > And we should never even hit the timeout handler for those as it > is rather pointless (although it looks we currently do..). I don't see why we'd expect to never hit timeout for at least some of these. It's not a stretch to see, for example, that virtio-blk or loop could have their requests lost with no way to recover if we revert. I've wasted too much time debugging hardware for such lost commands when it was in fact functioning perfectly fine. So reintroducing that behavior is a bit distressing. ^ permalink raw reply [flat|nested] 128+ messages in thread
* [RFC PATCH 3/3] blk-mq: Remove generation seqeunce @ 2018-07-19 14:59 ` Keith Busch 0 siblings, 0 replies; 128+ messages in thread From: Keith Busch @ 2018-07-19 14:59 UTC (permalink / raw) On Thu, Jul 19, 2018@03:19:04PM +0200, hch@lst.de wrote: > On Wed, Jul 18, 2018@03:17:11PM -0600, Keith Busch wrote: > > And there may be other drivers that don't want their completions > > ignored, so breaking them again is also a step in the wrong direction. > > > > There are not that many blk-mq drivers, so we can go through them all. > > I think the point is that SCSI is the biggest user by both the number > of low-level drivers sitting under the midlayer, and also by usage. > > We need to be very careful not to break it. Note that this doesn't > mean that I don't want to eventually move away from just ignoring > completions in timeout state for SCSI. I'd just rather rever 4.18 > to a clean known state instead of doctoring around late in the rc > phase. I definitely do not want to break scsi. I just don't want to break every one else either, and I think scsi can get the behavior it wants without forcing others to subscribe to it. > > Most don't even implement .timeout, so they never know that condition > > ever happened. Others always return BLK_EH_RESET_TIMER without doing > > anythign else, so the 'new' behavior would have to be better for those, > > too. > > And we should never even hit the timeout handler for those as it > is rather pointless (although it looks we currently do..). I don't see why we'd expect to never hit timeout for at least some of these. It's not a stretch to see, for example, that virtio-blk or loop could have their requests lost with no way to recover if we revert. I've wasted too much time debugging hardware for such lost commands when it was in fact functioning perfectly fine. So reintroducing that behavior is a bit distressing. ^ permalink raw reply [flat|nested] 128+ messages in thread
* Re: [RFC PATCH 3/3] blk-mq: Remove generation seqeunce 2018-07-19 14:59 ` Keith Busch @ 2018-07-19 15:56 ` Keith Busch -1 siblings, 0 replies; 128+ messages in thread From: Keith Busch @ 2018-07-19 15:56 UTC (permalink / raw) To: hch Cc: axboe, keith.busch, linux-nvme, ming.lei, linux-block, Bart Van Assche On Thu, Jul 19, 2018 at 08:59:31AM -0600, Keith Busch wrote: > > And we should never even hit the timeout handler for those as it > > is rather pointless (although it looks we currently do..). > > I don't see why we'd expect to never hit timeout for at least some of > these. It's not a stretch to see, for example, that virtio-blk or loop > could have their requests lost with no way to recover if we revert. I've > wasted too much time debugging hardware for such lost commands when it > was in fact functioning perfectly fine. So reintroducing that behavior > is a bit distressing. Even some scsi drivers are susceptible to losing their requests with the reverted behavior: take virtio-scsi for example, which returns RESET_TIMER from it's timeout handler. With the behavior everyone seems to want, a natural completion at or around the same time is lost forever because it was blocked from completion with no way to recover. While the timing for when requests may be lost is quite narrow, I've seen it enough with very difficult to reproduce scenarios that hardware devs no longer trust IO timeouts are their problem because Linux loses their completions. ^ permalink raw reply [flat|nested] 128+ messages in thread
* [RFC PATCH 3/3] blk-mq: Remove generation seqeunce @ 2018-07-19 15:56 ` Keith Busch 0 siblings, 0 replies; 128+ messages in thread From: Keith Busch @ 2018-07-19 15:56 UTC (permalink / raw) On Thu, Jul 19, 2018@08:59:31AM -0600, Keith Busch wrote: > > And we should never even hit the timeout handler for those as it > > is rather pointless (although it looks we currently do..). > > I don't see why we'd expect to never hit timeout for at least some of > these. It's not a stretch to see, for example, that virtio-blk or loop > could have their requests lost with no way to recover if we revert. I've > wasted too much time debugging hardware for such lost commands when it > was in fact functioning perfectly fine. So reintroducing that behavior > is a bit distressing. Even some scsi drivers are susceptible to losing their requests with the reverted behavior: take virtio-scsi for example, which returns RESET_TIMER from it's timeout handler. With the behavior everyone seems to want, a natural completion at or around the same time is lost forever because it was blocked from completion with no way to recover. While the timing for when requests may be lost is quite narrow, I've seen it enough with very difficult to reproduce scenarios that hardware devs no longer trust IO timeouts are their problem because Linux loses their completions. ^ permalink raw reply [flat|nested] 128+ messages in thread
* Re: [RFC PATCH 3/3] blk-mq: Remove generation seqeunce 2018-07-19 15:56 ` Keith Busch @ 2018-07-19 16:04 ` Bart Van Assche -1 siblings, 0 replies; 128+ messages in thread From: Bart Van Assche @ 2018-07-19 16:04 UTC (permalink / raw) To: hch, keith.busch; +Cc: keith.busch, linux-nvme, linux-block, axboe, ming.lei [-- Warning: decoded text below may be mangled, UTF-8 assumed --] [-- Attachment #1: Type: text/plain; charset="utf-7", Size: 1662 bytes --] On Thu, 2018-07-19 at 09:56 -0600, Keith Busch wrote: +AD4- Even some scsi drivers are susceptible to losing their requests with = the +AD4- reverted behavior: take virtio-scsi for example, which returns RESET+= AF8-TIMER +AD4- from it's timeout handler. With the behavior everyone seems to want, +AD4- a natural completion at or around the same time is lost forever becau= se +AD4- it was blocked from completion with no way to recover. The patch I had posted handles a completion that occurs while a timeout is being handled properly. From https://www.mail-archive.com/linux-block+AEA-v= ger.kernel.org/msg22196.html: void blk+AF8-mq+AF8-complete+AF8-request(struct request +ACo-rq) +AFs- ... +AF0- +- if (blk+AF8-mq+AF8-change+AF8-rq+AF8-state(rq, MQ+AF8-RQ+A= F8-IN+AF8-FLIGHT, +- MQ+AF8-RQ+AF8-COMPLETE)) +AHs- +- +AF8AXw-blk+AF8-mq+AF8-complete+AF8-request(rq)+AD= s- +- break+ADs- +- +AH0- +- if (blk+AF8-mq+AF8-change+AF8-rq+AF8-state(rq, MQ+AF8-RQ+A= F8-TIMED+AF8-OUT, MQ+AF8-RQ+AF8-COMPLETE)) +- break+ADs- +AFs- ... +AF0- +AEAAQA- -838,25 +-838,42 +AEAAQA- static void blk+AF8-mq+AF8-rq+AF8-timed+= AF8-out(struct request +ACo-req, bool reserved) +AFs- ... +AF0- case BLK+AF8-EH+AF8-RESET+AF8-TIMER: +AFs- ... +AF0- +- if (blk+AF8-mq+AF8-rq+AF8-state(req) +AD0APQ- MQ+A= F8-RQ+AF8-COMPLETE) +AHs- +- +AF8AXw-blk+AF8-mq+AF8-complete+AF8-reques= t(req)+ADs- +- break+ADs- +- +AH0- Bart.= ^ permalink raw reply [flat|nested] 128+ messages in thread
* [RFC PATCH 3/3] blk-mq: Remove generation seqeunce @ 2018-07-19 16:04 ` Bart Van Assche 0 siblings, 0 replies; 128+ messages in thread From: Bart Van Assche @ 2018-07-19 16:04 UTC (permalink / raw) On Thu, 2018-07-19@09:56 -0600, Keith Busch wrote: > Even some scsi drivers are susceptible to losing their requests with the > reverted behavior: take virtio-scsi for example, which returns RESET_TIMER > from it's timeout handler. With the behavior everyone seems to want, > a natural completion at or around the same time is lost forever because > it was blocked from completion with no way to recover. The patch I had posted handles a completion that occurs while a timeout is being handled properly. From https://www.mail-archive.com/linux-block at vger.kernel.org/msg22196.html: void blk_mq_complete_request(struct request *rq) [ ... ] + if (blk_mq_change_rq_state(rq, MQ_RQ_IN_FLIGHT, + MQ_RQ_COMPLETE)) { + __blk_mq_complete_request(rq); + break; + } + if (blk_mq_change_rq_state(rq, MQ_RQ_TIMED_OUT, MQ_RQ_COMPLETE)) + break; [ ... ] @@ -838,25 +838,42 @@ static void blk_mq_rq_timed_out(struct request *req, bool reserved) [ ... ] case BLK_EH_RESET_TIMER: [ ... ] + if (blk_mq_rq_state(req) == MQ_RQ_COMPLETE) { + __blk_mq_complete_request(req); + break; + } Bart. ^ permalink raw reply [flat|nested] 128+ messages in thread
* Re: [RFC PATCH 3/3] blk-mq: Remove generation seqeunce 2018-07-19 16:04 ` Bart Van Assche @ 2018-07-19 16:22 ` Keith Busch -1 siblings, 0 replies; 128+ messages in thread From: Keith Busch @ 2018-07-19 16:22 UTC (permalink / raw) To: Bart Van Assche Cc: hch, keith.busch, linux-nvme, linux-block, axboe, ming.lei On Thu, Jul 19, 2018 at 04:04:46PM +0000, Bart Van Assche wrote: > On Thu, 2018-07-19 at 09:56 -0600, Keith Busch wrote: > > Even some scsi drivers are susceptible to losing their requests with the > > reverted behavior: take virtio-scsi for example, which returns RESET_TIMER > > from it's timeout handler. With the behavior everyone seems to want, > > a natural completion at or around the same time is lost forever because > > it was blocked from completion with no way to recover. > > The patch I had posted handles a completion that occurs while a timeout is > being handled properly. From https://www.mail-archive.com/linux-block@vger.kernel.org/msg22196.html: Sounds like a win-win to me. ^ permalink raw reply [flat|nested] 128+ messages in thread
* [RFC PATCH 3/3] blk-mq: Remove generation seqeunce @ 2018-07-19 16:22 ` Keith Busch 0 siblings, 0 replies; 128+ messages in thread From: Keith Busch @ 2018-07-19 16:22 UTC (permalink / raw) On Thu, Jul 19, 2018@04:04:46PM +0000, Bart Van Assche wrote: > On Thu, 2018-07-19@09:56 -0600, Keith Busch wrote: > > Even some scsi drivers are susceptible to losing their requests with the > > reverted behavior: take virtio-scsi for example, which returns RESET_TIMER > > from it's timeout handler. With the behavior everyone seems to want, > > a natural completion at or around the same time is lost forever because > > it was blocked from completion with no way to recover. > > The patch I had posted handles a completion that occurs while a timeout is > being handled properly. From https://www.mail-archive.com/linux-block at vger.kernel.org/msg22196.html: Sounds like a win-win to me. ^ permalink raw reply [flat|nested] 128+ messages in thread
* Re: [RFC PATCH 3/3] blk-mq: Remove generation seqeunce 2018-07-19 16:22 ` Keith Busch @ 2018-07-19 16:29 ` hch -1 siblings, 0 replies; 128+ messages in thread From: hch @ 2018-07-19 16:29 UTC (permalink / raw) To: Keith Busch Cc: Bart Van Assche, hch, keith.busch, linux-nvme, linux-block, axboe, ming.lei On Thu, Jul 19, 2018 at 10:22:40AM -0600, Keith Busch wrote: > On Thu, Jul 19, 2018 at 04:04:46PM +0000, Bart Van Assche wrote: > > On Thu, 2018-07-19 at 09:56 -0600, Keith Busch wrote: > > > Even some scsi drivers are susceptible to losing their requests with the > > > reverted behavior: take virtio-scsi for example, which returns RESET_TIMER > > > from it's timeout handler. With the behavior everyone seems to want, > > > a natural completion at or around the same time is lost forever because > > > it was blocked from completion with no way to recover. > > > > The patch I had posted handles a completion that occurs while a timeout is > > being handled properly. From https://www.mail-archive.com/linux-block@vger.kernel.org/msg22196.html: > > Sounds like a win-win to me. How do we get a fix into 4.18 at this part of the cycle? I think that is the most important prirority right now. ^ permalink raw reply [flat|nested] 128+ messages in thread
* [RFC PATCH 3/3] blk-mq: Remove generation seqeunce @ 2018-07-19 16:29 ` hch 0 siblings, 0 replies; 128+ messages in thread From: hch @ 2018-07-19 16:29 UTC (permalink / raw) On Thu, Jul 19, 2018@10:22:40AM -0600, Keith Busch wrote: > On Thu, Jul 19, 2018@04:04:46PM +0000, Bart Van Assche wrote: > > On Thu, 2018-07-19@09:56 -0600, Keith Busch wrote: > > > Even some scsi drivers are susceptible to losing their requests with the > > > reverted behavior: take virtio-scsi for example, which returns RESET_TIMER > > > from it's timeout handler. With the behavior everyone seems to want, > > > a natural completion at or around the same time is lost forever because > > > it was blocked from completion with no way to recover. > > > > The patch I had posted handles a completion that occurs while a timeout is > > being handled properly. From https://www.mail-archive.com/linux-block at vger.kernel.org/msg22196.html: > > Sounds like a win-win to me. How do we get a fix into 4.18 at this part of the cycle? I think that is the most important prirority right now. ^ permalink raw reply [flat|nested] 128+ messages in thread
* Re: [RFC PATCH 3/3] blk-mq: Remove generation seqeunce 2018-07-19 16:29 ` hch @ 2018-07-19 20:18 ` Keith Busch -1 siblings, 0 replies; 128+ messages in thread From: Keith Busch @ 2018-07-19 20:18 UTC (permalink / raw) To: hch Cc: Bart Van Assche, keith.busch, linux-nvme, linux-block, axboe, ming.lei On Thu, Jul 19, 2018 at 06:29:24PM +0200, hch@lst.de wrote: > How do we get a fix into 4.18 at this part of the cycle? I think that > is the most important prirority right now. Even if you were okay at this point to incorporate the concepts from Bart's patch, it still looks like trouble for scsi (will elobrate separately). But reverting breaks other things we finally got working, so I'd still vote for isolating the old behavior to scsi if that isn't too unpalatable. I'll send a small patch shortly and see what happens. ^ permalink raw reply [flat|nested] 128+ messages in thread
* [RFC PATCH 3/3] blk-mq: Remove generation seqeunce @ 2018-07-19 20:18 ` Keith Busch 0 siblings, 0 replies; 128+ messages in thread From: Keith Busch @ 2018-07-19 20:18 UTC (permalink / raw) On Thu, Jul 19, 2018@06:29:24PM +0200, hch@lst.de wrote: > How do we get a fix into 4.18 at this part of the cycle? I think that > is the most important prirority right now. Even if you were okay at this point to incorporate the concepts from Bart's patch, it still looks like trouble for scsi (will elobrate separately). But reverting breaks other things we finally got working, so I'd still vote for isolating the old behavior to scsi if that isn't too unpalatable. I'll send a small patch shortly and see what happens. ^ permalink raw reply [flat|nested] 128+ messages in thread
* Re: [RFC PATCH 3/3] blk-mq: Remove generation seqeunce 2018-07-18 20:53 ` Keith Busch @ 2018-07-19 13:22 ` hch -1 siblings, 0 replies; 128+ messages in thread From: hch @ 2018-07-19 13:22 UTC (permalink / raw) To: Keith Busch Cc: hch, Bart Van Assche, keith.busch, linux-block, linux-nvme, axboe, ming.lei, linux-scsi On Wed, Jul 18, 2018 at 02:53:21PM -0600, Keith Busch wrote: > If scsi needs this behavior, why not just put that behavior in scsi? It > can set the state to complete and then everything can play out as > before. I think even with this we are missing handling for the somewhat degenerate blk_abort_request case. But most importantly we'll need some good test coverage. Please do some basic testing (e.g. with a version of the hack from Jianchao (who seems to keep getting dropped from this thread for some reason) and send it out to the block and scsi lists. ^ permalink raw reply [flat|nested] 128+ messages in thread
* [RFC PATCH 3/3] blk-mq: Remove generation seqeunce @ 2018-07-19 13:22 ` hch 0 siblings, 0 replies; 128+ messages in thread From: hch @ 2018-07-19 13:22 UTC (permalink / raw) On Wed, Jul 18, 2018@02:53:21PM -0600, Keith Busch wrote: > If scsi needs this behavior, why not just put that behavior in scsi? It > can set the state to complete and then everything can play out as > before. I think even with this we are missing handling for the somewhat degenerate blk_abort_request case. But most importantly we'll need some good test coverage. Please do some basic testing (e.g. with a version of the hack from Jianchao (who seems to keep getting dropped from this thread for some reason) and send it out to the block and scsi lists. ^ permalink raw reply [flat|nested] 128+ messages in thread
* Re: [RFC PATCH 0/3] blk-mq: Timeout rework 2018-05-21 23:11 ` Keith Busch @ 2018-05-21 23:29 ` Bart Van Assche -1 siblings, 0 replies; 128+ messages in thread From: Bart Van Assche @ 2018-05-21 23:29 UTC (permalink / raw) To: hch, keith.busch, linux-nvme, linux-block, axboe, ming.lei T24gTW9uLCAyMDE4LTA1LTIxIGF0IDE3OjExIC0wNjAwLCBLZWl0aCBCdXNjaCB3cm90ZToNCj4g VGhlIGN1cnJlbnQgYmxrLW1xIGNvZGUgcG90ZW50aWFsbHkgbG9ja3MgcmVxdWVzdHMgb3V0IG9m IGNvbXBsZXRpb24gYnkNCj4gdGhlIHRob3VzYW5kcywgbWFraW5nIGRyaXZlcnMganVtcCB0aHJv dWdoIGhvb3BzIHRvIGhhbmRsZSB0aGVtLiBUaGlzDQo+IHBhdGNoIHNldCBhbGxvd3MgZHJpdmVy cyB0byBjb21wbGV0ZSB0aGVpciByZXF1ZXN0cyB3aGVuZXZlciB0aGV5J3JlDQo+IGNvbXBsZXRl ZCB3aXRob3V0IHJlcXVpcmluZyBkcml2ZXJzIGtub3cgYW55dGhpbmcgYWJvdXQgdGhlIHRpbWVv dXQgY29kZQ0KPiB3aXRoIG1pbmltYWwgc3luY3Jvbml6YXRpb24uDQo+IA0KPiBPdGhlciBwcm9w b3NhbHMgdW5kZXIgY3VycmVudCBjb25zaWRlcmF0aW9uIHN0aWxsIGhhdmUgbW9tZW50cyB0aGF0 DQo+IHByZXZlbnQgYSBkcml2ZXIgZnJvbSBwcm9ncmVzc2luZyBhIHJlcXVlc3QgdG8gdGhlIGNv bXBsZXRlZCBzdGF0ZS4NCj4gDQo+IFRoZSB0aW1lb3V0IGlzIHVsdGltYXRsZXkgbWFkZSBzYWZl IGJ5IHJlZmVyZW5jZSBjb3VudGluZyB0aGUgcmVxdWVzdA0KPiB3aGVuIHRpbWVvdXQgaGFuZGxp bmcgY2xhaW1zIHRoZSByZXF1ZXN0LiBCeSBob2xkaW5nIHRoZSByZWZlcmVuY2UgY291bnQsDQo+ IHdlIGRvbid0IG5lZWQgdG8gZG8gYW55IHRyaWNrcyB0byBwcmV2ZW50IGEgZHJpdmVyIGZyb20g Y29tcGxldGluZyB0aGUNCj4gcmVxdWVzdCBvdXQgZnJvbSB1bmRlciB0aGUgdGltZW91dCBoYW5k bGVyLCBhbGxvd2luZyB0aGUgYWN0dWFsIHN0YXRlDQo+IHRvIGJlIGNoYW5nZWQgaW5saW5lIHdp dGggdGhlIHRydWUgc3RhdGUsIGFuZCBkcml2ZXJzIGRvbid0IG5lZWQgdG8gYmUNCj4gYXdhcmUg YW55IG9mIHRoaXMgaXMgaGFwcGVuaW5nLg0KPiANCj4gSW4gb3JkZXIgdG8gbWFrZSB0aGUgb3Zl cmhlYWQgYXMgbWluaW1hbCBhcyBwb3NzaWJsZSwgdGhlIHJlcXVlc3Qncw0KPiByZWZlcmVuY2Ug aXMgdGFrZW4gb25seSB3aGVuIGl0IGFwcGVhcnMgdGhhdCBhY3R1YWwgdGltZW91dCBoYW5kbGlu Zw0KPiBuZWVkcyB0byBiZSBkb25lLg0KDQpIZWxsbyBLZWl0aCwNCg0KQ2FuIHlvdSBleHBsYWlu IHdoeSB0aGUgTlZNZSBkcml2ZXIgbmVlZHMgcmVmZXJlbmNlIGNvdW50aW5nIG9mIHJlcXVlc3Rz IGJ1dA0Kbm8gb3RoZXIgYmxvY2sgZHJpdmVyIG5lZWRzIHRoaXM/IEFkZGl0aW9uYWxseSwgd2h5 IGlzIGl0IHRoYXQgZm9yIGFsbCBibG9jaw0KZHJpdmVycyBleGNlcHQgTlZNZSB0aGUgY3VycmVu dCBibG9jayBsYXllciBBUEkgaXMgc3VmZmljaWVudA0KKGJsa19nZXRfcmVxdWVzdCgpL2Jsa19l eGVjdXRlX3JxKCkvYmxrX21xX3N0YXJ0X3JlcXVlc3QoKS8NCmJsa19tcV9jb21wbGV0ZV9yZXF1 ZXN0KCkvYmxrX21xX2VuZF9yZXF1ZXN0KCkpPw0KDQpUaGFua3MsDQoNCkJhcnQu ^ permalink raw reply [flat|nested] 128+ messages in thread
* [RFC PATCH 0/3] blk-mq: Timeout rework @ 2018-05-21 23:29 ` Bart Van Assche 0 siblings, 0 replies; 128+ messages in thread From: Bart Van Assche @ 2018-05-21 23:29 UTC (permalink / raw) On Mon, 2018-05-21@17:11 -0600, Keith Busch wrote: > The current blk-mq code potentially locks requests out of completion by > the thousands, making drivers jump through hoops to handle them. This > patch set allows drivers to complete their requests whenever they're > completed without requiring drivers know anything about the timeout code > with minimal syncronization. > > Other proposals under current consideration still have moments that > prevent a driver from progressing a request to the completed state. > > The timeout is ultimatley made safe by reference counting the request > when timeout handling claims the request. By holding the reference count, > we don't need to do any tricks to prevent a driver from completing the > request out from under the timeout handler, allowing the actual state > to be changed inline with the true state, and drivers don't need to be > aware any of this is happening. > > In order to make the overhead as minimal as possible, the request's > reference is taken only when it appears that actual timeout handling > needs to be done. Hello Keith, Can you explain why the NVMe driver needs reference counting of requests but no other block driver needs this? Additionally, why is it that for all block drivers except NVMe the current block layer API is sufficient (blk_get_request()/blk_execute_rq()/blk_mq_start_request()/ blk_mq_complete_request()/blk_mq_end_request())? Thanks, Bart. ^ permalink raw reply [flat|nested] 128+ messages in thread
* Re: [RFC PATCH 0/3] blk-mq: Timeout rework 2018-05-21 23:29 ` Bart Van Assche @ 2018-05-22 14:06 ` Keith Busch -1 siblings, 0 replies; 128+ messages in thread From: Keith Busch @ 2018-05-22 14:06 UTC (permalink / raw) To: Bart Van Assche Cc: hch, keith.busch, linux-nvme, linux-block, axboe, ming.lei On Mon, May 21, 2018 at 11:29:21PM +0000, Bart Van Assche wrote: > Can you explain why the NVMe driver needs reference counting of requests but > no other block driver needs this? Additionally, why is it that for all block > drivers except NVMe the current block layer API is sufficient > (blk_get_request()/blk_execute_rq()/blk_mq_start_request()/ > blk_mq_complete_request()/blk_mq_end_request())? Hi Bart, I'm pretty sure NVMe isn't the only driver where a call to blk_mq_complete_request silently fails to transition the request to COMPLETE, forcing unnecessary error handling. This patch isn't so much about NVMe as it is about removing that silent exception from the block API. Thanks, Keith ^ permalink raw reply [flat|nested] 128+ messages in thread
* [RFC PATCH 0/3] blk-mq: Timeout rework @ 2018-05-22 14:06 ` Keith Busch 0 siblings, 0 replies; 128+ messages in thread From: Keith Busch @ 2018-05-22 14:06 UTC (permalink / raw) On Mon, May 21, 2018@11:29:21PM +0000, Bart Van Assche wrote: > Can you explain why the NVMe driver needs reference counting of requests but > no other block driver needs this? Additionally, why is it that for all block > drivers except NVMe the current block layer API is sufficient > (blk_get_request()/blk_execute_rq()/blk_mq_start_request()/ > blk_mq_complete_request()/blk_mq_end_request())? Hi Bart, I'm pretty sure NVMe isn't the only driver where a call to blk_mq_complete_request silently fails to transition the request to COMPLETE, forcing unnecessary error handling. This patch isn't so much about NVMe as it is about removing that silent exception from the block API. Thanks, Keith ^ permalink raw reply [flat|nested] 128+ messages in thread
* Re: [RFC PATCH 0/3] blk-mq: Timeout rework 2018-05-22 14:06 ` Keith Busch @ 2018-05-22 16:30 ` Bart Van Assche -1 siblings, 0 replies; 128+ messages in thread From: Bart Van Assche @ 2018-05-22 16:30 UTC (permalink / raw) To: keith.busch; +Cc: hch, keith.busch, linux-nvme, linux-block, axboe, ming.lei T24gVHVlLCAyMDE4LTA1LTIyIGF0IDA4OjA2IC0wNjAwLCBLZWl0aCBCdXNjaCB3cm90ZToNCj4g T24gTW9uLCBNYXkgMjEsIDIwMTggYXQgMTE6Mjk6MjFQTSArMDAwMCwgQmFydCBWYW4gQXNzY2hl IHdyb3RlOg0KPiA+IENhbiB5b3UgZXhwbGFpbiB3aHkgdGhlIE5WTWUgZHJpdmVyIG5lZWRzIHJl ZmVyZW5jZSBjb3VudGluZyBvZiByZXF1ZXN0cyBidXQNCj4gPiBubyBvdGhlciBibG9jayBkcml2 ZXIgbmVlZHMgdGhpcz8gQWRkaXRpb25hbGx5LCB3aHkgaXMgaXQgdGhhdCBmb3IgYWxsIGJsb2Nr DQo+ID4gZHJpdmVycyBleGNlcHQgTlZNZSB0aGUgY3VycmVudCBibG9jayBsYXllciBBUEkgaXMg c3VmZmljaWVudA0KPiA+IChibGtfZ2V0X3JlcXVlc3QoKS9ibGtfZXhlY3V0ZV9ycSgpL2Jsa19t cV9zdGFydF9yZXF1ZXN0KCkvDQo+ID4gYmxrX21xX2NvbXBsZXRlX3JlcXVlc3QoKS9ibGtfbXFf ZW5kX3JlcXVlc3QoKSk/DQo+IA0KPiBJJ20gcHJldHR5IHN1cmUgTlZNZSBpc24ndCB0aGUgb25s eSBkcml2ZXIgd2hlcmUgYSBjYWxsIHRvDQo+IGJsa19tcV9jb21wbGV0ZV9yZXF1ZXN0IHNpbGVu dGx5IGZhaWxzIHRvIHRyYW5zaXRpb24gdGhlIHJlcXVlc3QgdG8NCj4gQ09NUExFVEUsIGZvcmNp bmcgdW5uZWNlc3NhcnkgZXJyb3IgaGFuZGxpbmcuIFRoaXMgcGF0Y2ggaXNuJ3Qgc28NCj4gbXVj aCBhYm91dCBOVk1lIGFzIGl0IGlzIGFib3V0IHJlbW92aW5nIHRoYXQgc2lsZW50IGV4Y2VwdGlv biBmcm9tIHRoZQ0KPiBibG9jayBBUEkuDQoNCkhlbGxvIEtlaXRoLA0KDQpQbGVhc2UgaGF2ZSBh IGxvb2sgYXQgdjEzIG9mIHRoZSB0aW1lb3V0IGhhbmRsaW5nIHJld29yayBwYXRjaCB0aGF0IEkg cG9zdGVkLg0KVGhhdCBwYXRjaCBzaG91bGQgbm90IGludHJvZHVjZSBhbnkgbmV3IHJhY2UgY29u ZGl0aW9ucyBhbmQgc2hvdWxkIGFsc28gaGFuZGxlDQp0aGUgc2NlbmFyaW8gZmluZSBpbiB3aGlj aCBibGtfbXFfY29tcGxldGVfcmVxdWVzdCgpIGlzIGNhbGxlZCB3aGlsZSB0aGUgTlZNZQ0KdGlt ZW91dCBoYW5kbGluZyBmdW5jdGlvbiBpcyBpbiBwcm9ncmVzcy4NCg0KVGhhbmtzLA0KDQpCYXJ0 Lg0KDQoNCg0K ^ permalink raw reply [flat|nested] 128+ messages in thread
* [RFC PATCH 0/3] blk-mq: Timeout rework @ 2018-05-22 16:30 ` Bart Van Assche 0 siblings, 0 replies; 128+ messages in thread From: Bart Van Assche @ 2018-05-22 16:30 UTC (permalink / raw) On Tue, 2018-05-22@08:06 -0600, Keith Busch wrote: > On Mon, May 21, 2018@11:29:21PM +0000, Bart Van Assche wrote: > > Can you explain why the NVMe driver needs reference counting of requests but > > no other block driver needs this? Additionally, why is it that for all block > > drivers except NVMe the current block layer API is sufficient > > (blk_get_request()/blk_execute_rq()/blk_mq_start_request()/ > > blk_mq_complete_request()/blk_mq_end_request())? > > I'm pretty sure NVMe isn't the only driver where a call to > blk_mq_complete_request silently fails to transition the request to > COMPLETE, forcing unnecessary error handling. This patch isn't so > much about NVMe as it is about removing that silent exception from the > block API. Hello Keith, Please have a look at v13 of the timeout handling rework patch that I posted. That patch should not introduce any new race conditions and should also handle the scenario fine in which blk_mq_complete_request() is called while the NVMe timeout handling function is in progress. Thanks, Bart. ^ permalink raw reply [flat|nested] 128+ messages in thread
* Re: [RFC PATCH 0/3] blk-mq: Timeout rework 2018-05-22 16:30 ` Bart Van Assche @ 2018-05-22 16:44 ` Keith Busch -1 siblings, 0 replies; 128+ messages in thread From: Keith Busch @ 2018-05-22 16:44 UTC (permalink / raw) To: Bart Van Assche Cc: hch, keith.busch, linux-nvme, linux-block, axboe, ming.lei On Tue, May 22, 2018 at 04:30:41PM +0000, Bart Van Assche wrote: > > Please have a look at v13 of the timeout handling rework patch that I posted. > That patch should not introduce any new race conditions and should also handle > the scenario fine in which blk_mq_complete_request() is called while the NVMe > timeout handling function is in progress. Thanks for the notice. That sounds very interesting and will be happy take a look. ^ permalink raw reply [flat|nested] 128+ messages in thread
* [RFC PATCH 0/3] blk-mq: Timeout rework @ 2018-05-22 16:44 ` Keith Busch 0 siblings, 0 replies; 128+ messages in thread From: Keith Busch @ 2018-05-22 16:44 UTC (permalink / raw) On Tue, May 22, 2018@04:30:41PM +0000, Bart Van Assche wrote: > > Please have a look at v13 of the timeout handling rework patch that I posted. > That patch should not introduce any new race conditions and should also handle > the scenario fine in which blk_mq_complete_request() is called while the NVMe > timeout handling function is in progress. Thanks for the notice. That sounds very interesting and will be happy take a look. ^ permalink raw reply [flat|nested] 128+ messages in thread
end of thread, other threads:[~2018-07-19 21:03 UTC | newest] Thread overview: 128+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2018-05-21 23:11 [RFC PATCH 0/3] blk-mq: Timeout rework Keith Busch 2018-05-21 23:11 ` Keith Busch 2018-05-21 23:11 ` [RFC PATCH 1/3] blk-mq: Reference count request usage Keith Busch 2018-05-21 23:11 ` Keith Busch 2018-05-22 2:27 ` Ming Lei 2018-05-22 2:27 ` Ming Lei 2018-05-22 15:19 ` Christoph Hellwig 2018-05-22 15:19 ` Christoph Hellwig 2018-05-21 23:11 ` [RFC PATCH 2/3] blk-mq: Fix timeout and state order Keith Busch 2018-05-21 23:11 ` Keith Busch 2018-05-22 2:28 ` Ming Lei 2018-05-22 2:28 ` Ming Lei 2018-05-22 15:24 ` Christoph Hellwig 2018-05-22 15:24 ` Christoph Hellwig 2018-05-22 16:27 ` Bart Van Assche 2018-05-22 16:27 ` Bart Van Assche 2018-05-21 23:11 ` [RFC PATCH 3/3] blk-mq: Remove generation seqeunce Keith Busch 2018-05-21 23:11 ` Keith Busch 2018-05-21 23:29 ` Bart Van Assche 2018-05-21 23:29 ` Bart Van Assche 2018-05-22 14:15 ` Keith Busch 2018-05-22 14:15 ` Keith Busch 2018-05-22 16:29 ` Bart Van Assche 2018-05-22 16:29 ` Bart Van Assche 2018-05-22 16:34 ` Keith Busch 2018-05-22 16:34 ` Keith Busch 2018-05-22 16:48 ` Bart Van Assche 2018-05-22 16:48 ` Bart Van Assche 2018-05-22 2:49 ` Ming Lei 2018-05-22 2:49 ` Ming Lei 2018-05-22 3:16 ` Jens Axboe 2018-05-22 3:16 ` Jens Axboe 2018-05-22 3:47 ` Ming Lei 2018-05-22 3:47 ` Ming Lei 2018-05-22 3:51 ` Jens Axboe 2018-05-22 3:51 ` Jens Axboe 2018-05-22 8:51 ` Ming Lei 2018-05-22 8:51 ` Ming Lei 2018-05-22 14:35 ` Jens Axboe 2018-05-22 14:35 ` Jens Axboe 2018-05-22 14:20 ` Keith Busch 2018-05-22 14:20 ` Keith Busch 2018-05-22 14:37 ` Ming Lei 2018-05-22 14:37 ` Ming Lei 2018-05-22 14:46 ` Keith Busch 2018-05-22 14:46 ` Keith Busch 2018-05-22 14:57 ` Ming Lei 2018-05-22 14:57 ` Ming Lei 2018-05-22 15:01 ` Keith Busch 2018-05-22 15:01 ` Keith Busch 2018-05-22 15:07 ` Ming Lei 2018-05-22 15:07 ` Ming Lei 2018-05-22 15:17 ` Keith Busch 2018-05-22 15:17 ` Keith Busch 2018-05-22 15:23 ` Ming Lei 2018-05-22 15:23 ` Ming Lei 2018-05-22 16:17 ` Christoph Hellwig 2018-05-22 16:17 ` Christoph Hellwig 2018-05-23 0:34 ` Ming Lei 2018-05-23 0:34 ` Ming Lei 2018-05-23 14:35 ` Keith Busch 2018-05-23 14:35 ` Keith Busch 2018-05-24 1:52 ` Ming Lei 2018-05-24 1:52 ` Ming Lei 2018-05-23 5:48 ` Hannes Reinecke 2018-05-23 5:48 ` Hannes Reinecke 2018-07-12 18:16 ` Bart Van Assche 2018-07-12 18:16 ` Bart Van Assche 2018-07-12 19:24 ` Keith Busch 2018-07-12 19:24 ` Keith Busch 2018-07-12 22:24 ` Bart Van Assche 2018-07-12 22:24 ` Bart Van Assche 2018-07-13 1:12 ` jianchao.wang 2018-07-13 1:12 ` jianchao.wang 2018-07-13 2:40 ` jianchao.wang 2018-07-13 2:40 ` jianchao.wang 2018-07-13 15:43 ` Keith Busch 2018-07-13 15:43 ` Keith Busch 2018-07-13 15:52 ` Bart Van Assche 2018-07-13 15:52 ` Bart Van Assche 2018-07-13 18:47 ` Keith Busch 2018-07-13 18:47 ` Keith Busch 2018-07-13 23:03 ` Bart Van Assche 2018-07-13 23:03 ` Bart Van Assche 2018-07-13 23:58 ` Keith Busch 2018-07-13 23:58 ` Keith Busch 2018-07-18 19:56 ` hch 2018-07-18 19:56 ` hch 2018-07-18 20:39 ` hch 2018-07-18 20:39 ` hch 2018-07-18 21:05 ` Bart Van Assche 2018-07-18 21:05 ` Bart Van Assche 2018-07-18 22:53 ` Keith Busch 2018-07-18 22:53 ` Keith Busch 2018-07-18 20:53 ` Keith Busch 2018-07-18 20:53 ` Keith Busch 2018-07-18 20:58 ` Bart Van Assche 2018-07-18 20:58 ` Bart Van Assche 2018-07-18 21:17 ` Keith Busch 2018-07-18 21:17 ` Keith Busch 2018-07-18 21:30 ` Bart Van Assche 2018-07-18 21:30 ` Bart Van Assche 2018-07-18 21:33 ` Keith Busch 2018-07-18 21:33 ` Keith Busch 2018-07-19 13:19 ` hch 2018-07-19 13:19 ` hch 2018-07-19 14:59 ` Keith Busch 2018-07-19 14:59 ` Keith Busch 2018-07-19 15:56 ` Keith Busch 2018-07-19 15:56 ` Keith Busch 2018-07-19 16:04 ` Bart Van Assche 2018-07-19 16:04 ` Bart Van Assche 2018-07-19 16:22 ` Keith Busch 2018-07-19 16:22 ` Keith Busch 2018-07-19 16:29 ` hch 2018-07-19 16:29 ` hch 2018-07-19 20:18 ` Keith Busch 2018-07-19 20:18 ` Keith Busch 2018-07-19 13:22 ` hch 2018-07-19 13:22 ` hch 2018-05-21 23:29 ` [RFC PATCH 0/3] blk-mq: Timeout rework Bart Van Assche 2018-05-21 23:29 ` Bart Van Assche 2018-05-22 14:06 ` Keith Busch 2018-05-22 14:06 ` Keith Busch 2018-05-22 16:30 ` Bart Van Assche 2018-05-22 16:30 ` Bart Van Assche 2018-05-22 16:44 ` Keith Busch 2018-05-22 16:44 ` Keith Busch
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.