* [PATCH] nvme: unquiesce the queue before cleaup it @ 2018-04-19 8:29 ` Jianchao Wang 0 siblings, 0 replies; 18+ messages in thread From: Jianchao Wang @ 2018-04-19 8:29 UTC (permalink / raw) To: keith.busch, axboe, hch, sagi, linux-nvme, linux-kernel There is race between nvme_remove and nvme_reset_work that can lead to io hang. nvme_remove nvme_reset_work -> change state to DELETING -> fail to change state to LIVE -> nvme_remove_dead_ctrl -> nvme_dev_disable -> quiesce request_queue -> queue remove_work -> cancel_work_sync reset_work -> nvme_remove_namespaces -> splice ctrl->namespaces nvme_remove_dead_ctrl_work -> nvme_kill_queues -> nvme_ns_remove do nothing -> blk_cleanup_queue -> blk_freeze_queue Finally, the request_queue is quiesced state when wait freeze, we will get io hang here. To fix it, unquiesce the request_queue directly before nvme_ns_remove. We have spliced the ctrl->namespaces, so nobody could access them and quiesce the queue any more. Signed-off-by: Jianchao Wang <jianchao.w.wang@oracle.com> --- drivers/nvme/host/core.c | 9 ++++++++- 1 file changed, 8 insertions(+), 1 deletion(-) diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c index 9df4f71..0e95082 100644 --- a/drivers/nvme/host/core.c +++ b/drivers/nvme/host/core.c @@ -3249,8 +3249,15 @@ void nvme_remove_namespaces(struct nvme_ctrl *ctrl) list_splice_init(&ctrl->namespaces, &ns_list); up_write(&ctrl->namespaces_rwsem); - list_for_each_entry_safe(ns, next, &ns_list, list) + /* + * After splice the namespaces list from the ctrl->namespaces, + * nobody could get them anymore, let's unquiesce the request_queue + * forcibly to avoid io hang. + */ + list_for_each_entry_safe(ns, next, &ns_list, list) { + blk_mq_unquiesce_queue(ns->queue); nvme_ns_remove(ns); + } } EXPORT_SYMBOL_GPL(nvme_remove_namespaces); -- 2.7.4 ^ permalink raw reply related [flat|nested] 18+ messages in thread
* [PATCH] nvme: unquiesce the queue before cleaup it @ 2018-04-19 8:29 ` Jianchao Wang 0 siblings, 0 replies; 18+ messages in thread From: Jianchao Wang @ 2018-04-19 8:29 UTC (permalink / raw) There is race between nvme_remove and nvme_reset_work that can lead to io hang. nvme_remove nvme_reset_work -> change state to DELETING -> fail to change state to LIVE -> nvme_remove_dead_ctrl -> nvme_dev_disable -> quiesce request_queue -> queue remove_work -> cancel_work_sync reset_work -> nvme_remove_namespaces -> splice ctrl->namespaces nvme_remove_dead_ctrl_work -> nvme_kill_queues -> nvme_ns_remove do nothing -> blk_cleanup_queue -> blk_freeze_queue Finally, the request_queue is quiesced state when wait freeze, we will get io hang here. To fix it, unquiesce the request_queue directly before nvme_ns_remove. We have spliced the ctrl->namespaces, so nobody could access them and quiesce the queue any more. Signed-off-by: Jianchao Wang <jianchao.w.wang at oracle.com> --- drivers/nvme/host/core.c | 9 ++++++++- 1 file changed, 8 insertions(+), 1 deletion(-) diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c index 9df4f71..0e95082 100644 --- a/drivers/nvme/host/core.c +++ b/drivers/nvme/host/core.c @@ -3249,8 +3249,15 @@ void nvme_remove_namespaces(struct nvme_ctrl *ctrl) list_splice_init(&ctrl->namespaces, &ns_list); up_write(&ctrl->namespaces_rwsem); - list_for_each_entry_safe(ns, next, &ns_list, list) + /* + * After splice the namespaces list from the ctrl->namespaces, + * nobody could get them anymore, let's unquiesce the request_queue + * forcibly to avoid io hang. + */ + list_for_each_entry_safe(ns, next, &ns_list, list) { + blk_mq_unquiesce_queue(ns->queue); nvme_ns_remove(ns); + } } EXPORT_SYMBOL_GPL(nvme_remove_namespaces); -- 2.7.4 ^ permalink raw reply related [flat|nested] 18+ messages in thread
* Re: [PATCH] nvme: unquiesce the queue before cleaup it 2018-04-19 8:29 ` Jianchao Wang @ 2018-04-22 13:32 ` jianchao.wang -1 siblings, 0 replies; 18+ messages in thread From: jianchao.wang @ 2018-04-22 13:32 UTC (permalink / raw) To: keith.busch, axboe, hch, sagi, linux-nvme, linux-kernel Hi keith Would you please take a look at this patch. This issue could be reproduced easily with a driver bind/unbind loop, a reset loop and a IO loop at the same time. Thanks Jianchao On 04/19/2018 04:29 PM, Jianchao Wang wrote: > There is race between nvme_remove and nvme_reset_work that can > lead to io hang. > > nvme_remove nvme_reset_work > -> change state to DELETING > -> fail to change state to LIVE > -> nvme_remove_dead_ctrl > -> nvme_dev_disable > -> quiesce request_queue > -> queue remove_work > -> cancel_work_sync reset_work > -> nvme_remove_namespaces > -> splice ctrl->namespaces > nvme_remove_dead_ctrl_work > -> nvme_kill_queues > -> nvme_ns_remove do nothing > -> blk_cleanup_queue > -> blk_freeze_queue > Finally, the request_queue is quiesced state when wait freeze, > we will get io hang here. > > To fix it, unquiesce the request_queue directly before nvme_ns_remove. > We have spliced the ctrl->namespaces, so nobody could access them > and quiesce the queue any more. > > Signed-off-by: Jianchao Wang <jianchao.w.wang@oracle.com> > --- > drivers/nvme/host/core.c | 9 ++++++++- > 1 file changed, 8 insertions(+), 1 deletion(-) > > diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c > index 9df4f71..0e95082 100644 > --- a/drivers/nvme/host/core.c > +++ b/drivers/nvme/host/core.c > @@ -3249,8 +3249,15 @@ void nvme_remove_namespaces(struct nvme_ctrl *ctrl) > list_splice_init(&ctrl->namespaces, &ns_list); > up_write(&ctrl->namespaces_rwsem); > > - list_for_each_entry_safe(ns, next, &ns_list, list) > + /* > + * After splice the namespaces list from the ctrl->namespaces, > + * nobody could get them anymore, let's unquiesce the request_queue > + * forcibly to avoid io hang. > + */ > + list_for_each_entry_safe(ns, next, &ns_list, list) { > + blk_mq_unquiesce_queue(ns->queue); > nvme_ns_remove(ns); > + } > } > EXPORT_SYMBOL_GPL(nvme_remove_namespaces); > > ^ permalink raw reply [flat|nested] 18+ messages in thread
* [PATCH] nvme: unquiesce the queue before cleaup it @ 2018-04-22 13:32 ` jianchao.wang 0 siblings, 0 replies; 18+ messages in thread From: jianchao.wang @ 2018-04-22 13:32 UTC (permalink / raw) Hi keith Would you please take a look at this patch. This issue could be reproduced easily with a driver bind/unbind loop, a reset loop and a IO loop at the same time. Thanks Jianchao On 04/19/2018 04:29 PM, Jianchao Wang wrote: > There is race between nvme_remove and nvme_reset_work that can > lead to io hang. > > nvme_remove nvme_reset_work > -> change state to DELETING > -> fail to change state to LIVE > -> nvme_remove_dead_ctrl > -> nvme_dev_disable > -> quiesce request_queue > -> queue remove_work > -> cancel_work_sync reset_work > -> nvme_remove_namespaces > -> splice ctrl->namespaces > nvme_remove_dead_ctrl_work > -> nvme_kill_queues > -> nvme_ns_remove do nothing > -> blk_cleanup_queue > -> blk_freeze_queue > Finally, the request_queue is quiesced state when wait freeze, > we will get io hang here. > > To fix it, unquiesce the request_queue directly before nvme_ns_remove. > We have spliced the ctrl->namespaces, so nobody could access them > and quiesce the queue any more. > > Signed-off-by: Jianchao Wang <jianchao.w.wang at oracle.com> > --- > drivers/nvme/host/core.c | 9 ++++++++- > 1 file changed, 8 insertions(+), 1 deletion(-) > > diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c > index 9df4f71..0e95082 100644 > --- a/drivers/nvme/host/core.c > +++ b/drivers/nvme/host/core.c > @@ -3249,8 +3249,15 @@ void nvme_remove_namespaces(struct nvme_ctrl *ctrl) > list_splice_init(&ctrl->namespaces, &ns_list); > up_write(&ctrl->namespaces_rwsem); > > - list_for_each_entry_safe(ns, next, &ns_list, list) > + /* > + * After splice the namespaces list from the ctrl->namespaces, > + * nobody could get them anymore, let's unquiesce the request_queue > + * forcibly to avoid io hang. > + */ > + list_for_each_entry_safe(ns, next, &ns_list, list) { > + blk_mq_unquiesce_queue(ns->queue); > nvme_ns_remove(ns); > + } > } > EXPORT_SYMBOL_GPL(nvme_remove_namespaces); > > ^ permalink raw reply [flat|nested] 18+ messages in thread
* [PATCH] nvme: unquiesce the queue before cleaup it 2018-04-22 13:32 ` jianchao.wang (?) @ 2018-04-22 14:18 ` Max Gurtovoy 2018-04-22 14:25 ` jianchao.wang -1 siblings, 1 reply; 18+ messages in thread From: Max Gurtovoy @ 2018-04-22 14:18 UTC (permalink / raw) Hi Jianchao, Since this patch is in the core, have you tested it using some fabrics drives too ? RDMA/FC ? thanks, Max. On 4/22/2018 4:32 PM, jianchao.wang wrote: > Hi keith > > Would you please take a look at this patch. > > This issue could be reproduced easily with a driver bind/unbind loop, > a reset loop and a IO loop at the same time. > > Thanks > Jianchao > > On 04/19/2018 04:29 PM, Jianchao Wang wrote: >> There is race between nvme_remove and nvme_reset_work that can >> lead to io hang. >> >> nvme_remove nvme_reset_work >> -> change state to DELETING >> -> fail to change state to LIVE >> -> nvme_remove_dead_ctrl >> -> nvme_dev_disable >> -> quiesce request_queue >> -> queue remove_work >> -> cancel_work_sync reset_work >> -> nvme_remove_namespaces >> -> splice ctrl->namespaces >> nvme_remove_dead_ctrl_work >> -> nvme_kill_queues >> -> nvme_ns_remove do nothing >> -> blk_cleanup_queue >> -> blk_freeze_queue >> Finally, the request_queue is quiesced state when wait freeze, >> we will get io hang here. >> >> To fix it, unquiesce the request_queue directly before nvme_ns_remove. >> We have spliced the ctrl->namespaces, so nobody could access them >> and quiesce the queue any more. >> >> Signed-off-by: Jianchao Wang <jianchao.w.wang at oracle.com> >> --- >> drivers/nvme/host/core.c | 9 ++++++++- >> 1 file changed, 8 insertions(+), 1 deletion(-) >> >> diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c >> index 9df4f71..0e95082 100644 >> --- a/drivers/nvme/host/core.c >> +++ b/drivers/nvme/host/core.c >> @@ -3249,8 +3249,15 @@ void nvme_remove_namespaces(struct nvme_ctrl *ctrl) >> list_splice_init(&ctrl->namespaces, &ns_list); >> up_write(&ctrl->namespaces_rwsem); >> >> - list_for_each_entry_safe(ns, next, &ns_list, list) >> + /* >> + * After splice the namespaces list from the ctrl->namespaces, >> + * nobody could get them anymore, let's unquiesce the request_queue >> + * forcibly to avoid io hang. >> + */ >> + list_for_each_entry_safe(ns, next, &ns_list, list) { >> + blk_mq_unquiesce_queue(ns->queue); >> nvme_ns_remove(ns); >> + } >> } >> EXPORT_SYMBOL_GPL(nvme_remove_namespaces); >> >> > > _______________________________________________ > Linux-nvme mailing list > Linux-nvme at lists.infradead.org > http://lists.infradead.org/mailman/listinfo/linux-nvme > ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [PATCH] nvme: unquiesce the queue before cleaup it 2018-04-22 14:18 ` Max Gurtovoy @ 2018-04-22 14:25 ` jianchao.wang 0 siblings, 0 replies; 18+ messages in thread From: jianchao.wang @ 2018-04-22 14:25 UTC (permalink / raw) To: Max Gurtovoy, keith.busch, axboe, hch, sagi, linux-nvme, linux-kernel Hi Max No, I only tested it on PCIe one. And sorry for that I didn't state that. Thanks Jianchao On 04/22/2018 10:18 PM, Max Gurtovoy wrote: > Hi Jianchao, > Since this patch is in the core, have you tested it using some fabrics drives too ? RDMA/FC ? > > thanks, > Max. > > On 4/22/2018 4:32 PM, jianchao.wang wrote: >> Hi keith >> >> Would you please take a look at this patch. >> >> This issue could be reproduced easily with a driver bind/unbind loop, >> a reset loop and a IO loop at the same time. >> >> Thanks >> Jianchao >> >> On 04/19/2018 04:29 PM, Jianchao Wang wrote: >>> There is race between nvme_remove and nvme_reset_work that can >>> lead to io hang. >>> >>> nvme_remove nvme_reset_work >>> -> change state to DELETING >>> -> fail to change state to LIVE >>> -> nvme_remove_dead_ctrl >>> -> nvme_dev_disable >>> -> quiesce request_queue >>> -> queue remove_work >>> -> cancel_work_sync reset_work >>> -> nvme_remove_namespaces >>> -> splice ctrl->namespaces >>> nvme_remove_dead_ctrl_work >>> -> nvme_kill_queues >>> -> nvme_ns_remove do nothing >>> -> blk_cleanup_queue >>> -> blk_freeze_queue >>> Finally, the request_queue is quiesced state when wait freeze, >>> we will get io hang here. >>> >>> To fix it, unquiesce the request_queue directly before nvme_ns_remove. >>> We have spliced the ctrl->namespaces, so nobody could access them >>> and quiesce the queue any more. >>> >>> Signed-off-by: Jianchao Wang <jianchao.w.wang@oracle.com> >>> --- >>> drivers/nvme/host/core.c | 9 ++++++++- >>> 1 file changed, 8 insertions(+), 1 deletion(-) >>> >>> diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c >>> index 9df4f71..0e95082 100644 >>> --- a/drivers/nvme/host/core.c >>> +++ b/drivers/nvme/host/core.c >>> @@ -3249,8 +3249,15 @@ void nvme_remove_namespaces(struct nvme_ctrl *ctrl) >>> list_splice_init(&ctrl->namespaces, &ns_list); >>> up_write(&ctrl->namespaces_rwsem); >>> - list_for_each_entry_safe(ns, next, &ns_list, list) >>> + /* >>> + * After splice the namespaces list from the ctrl->namespaces, >>> + * nobody could get them anymore, let's unquiesce the request_queue >>> + * forcibly to avoid io hang. >>> + */ >>> + list_for_each_entry_safe(ns, next, &ns_list, list) { >>> + blk_mq_unquiesce_queue(ns->queue); >>> nvme_ns_remove(ns); >>> + } >>> } >>> EXPORT_SYMBOL_GPL(nvme_remove_namespaces); >>> >> >> _______________________________________________ >> Linux-nvme mailing list >> Linux-nvme@lists.infradead.org >> https://urldefense.proofpoint.com/v2/url?u=http-3A__lists.infradead.org_mailman_listinfo_linux-2Dnvme&d=DwICAg&c=RoP1YumCXCgaWHvlZYR8PZh8Bv7qIrMUB65eapI_JnE&r=7WdAxUBeiTUTCy8v-7zXyr4qk7sx26ATvfo6QSTvZyQ&m=eQ9q70WFDS-d0s-KndBw8MOJvcBM6wuuKUNklqTC3h8&s=oBasfz9JoJw4yQF4EaWcNfKChZ1HMCkfHVZqyjvYVHQ&e= >> > > _______________________________________________ > Linux-nvme mailing list > Linux-nvme@lists.infradead.org > https://urldefense.proofpoint.com/v2/url?u=http-3A__lists.infradead.org_mailman_listinfo_linux-2Dnvme&d=DwICAg&c=RoP1YumCXCgaWHvlZYR8PZh8Bv7qIrMUB65eapI_JnE&r=7WdAxUBeiTUTCy8v-7zXyr4qk7sx26ATvfo6QSTvZyQ&m=eQ9q70WFDS-d0s-KndBw8MOJvcBM6wuuKUNklqTC3h8&s=oBasfz9JoJw4yQF4EaWcNfKChZ1HMCkfHVZqyjvYVHQ&e= > ^ permalink raw reply [flat|nested] 18+ messages in thread
* [PATCH] nvme: unquiesce the queue before cleaup it @ 2018-04-22 14:25 ` jianchao.wang 0 siblings, 0 replies; 18+ messages in thread From: jianchao.wang @ 2018-04-22 14:25 UTC (permalink / raw) Hi Max No, I only tested it on PCIe one. And sorry for that I didn't state that. Thanks Jianchao On 04/22/2018 10:18 PM, Max Gurtovoy wrote: > Hi Jianchao, > Since this patch is in the core, have you tested it using some fabrics drives too ? RDMA/FC ? > > thanks, > Max. > > On 4/22/2018 4:32 PM, jianchao.wang wrote: >> Hi keith >> >> Would you please take a look at this patch. >> >> This issue could be reproduced easily with a driver bind/unbind loop, >> a reset loop and a IO loop at the same time. >> >> Thanks >> Jianchao >> >> On 04/19/2018 04:29 PM, Jianchao Wang wrote: >>> There is race between nvme_remove and nvme_reset_work that can >>> lead to io hang. >>> >>> nvme_remove??????????????????? nvme_reset_work >>> -> change state to DELETING >>> ??????????????????????????????? -> fail to change state to LIVE >>> ??????????????????????????????? -> nvme_remove_dead_ctrl >>> ????????????????????????????????? -> nvme_dev_disable >>> ??????????????????????????????????? -> quiesce request_queue >>> ????????????????????????????????? -> queue remove_work >>> -> cancel_work_sync reset_work >>> -> nvme_remove_namespaces >>> ?? -> splice ctrl->namespaces >>> ??????????????????????????????? nvme_remove_dead_ctrl_work >>> ??????????????????????????????? -> nvme_kill_queues >>> ?? -> nvme_ns_remove?????????????? do nothing >>> ???? -> blk_cleanup_queue >>> ?????? -> blk_freeze_queue >>> Finally, the request_queue is quiesced state when wait freeze, >>> we will get io hang here. >>> >>> To fix it, unquiesce the request_queue directly before nvme_ns_remove. >>> We have spliced the ctrl->namespaces, so nobody could access them >>> and quiesce the queue any more. >>> >>> Signed-off-by: Jianchao Wang <jianchao.w.wang at oracle.com> >>> --- >>> ? drivers/nvme/host/core.c | 9 ++++++++- >>> ? 1 file changed, 8 insertions(+), 1 deletion(-) >>> >>> diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c >>> index 9df4f71..0e95082 100644 >>> --- a/drivers/nvme/host/core.c >>> +++ b/drivers/nvme/host/core.c >>> @@ -3249,8 +3249,15 @@ void nvme_remove_namespaces(struct nvme_ctrl *ctrl) >>> ????? list_splice_init(&ctrl->namespaces, &ns_list); >>> ????? up_write(&ctrl->namespaces_rwsem); >>> ? -??? list_for_each_entry_safe(ns, next, &ns_list, list) >>> +??? /* >>> +???? * After splice the namespaces list from the ctrl->namespaces, >>> +???? * nobody could get them anymore, let's unquiesce the request_queue >>> +???? * forcibly to avoid io hang. >>> +???? */ >>> +??? list_for_each_entry_safe(ns, next, &ns_list, list) { >>> +??????? blk_mq_unquiesce_queue(ns->queue); >>> ????????? nvme_ns_remove(ns); >>> +??? } >>> ? } >>> ? EXPORT_SYMBOL_GPL(nvme_remove_namespaces); >>> ? >> >> _______________________________________________ >> Linux-nvme mailing list >> Linux-nvme at lists.infradead.org >> https://urldefense.proofpoint.com/v2/url?u=http-3A__lists.infradead.org_mailman_listinfo_linux-2Dnvme&d=DwICAg&c=RoP1YumCXCgaWHvlZYR8PZh8Bv7qIrMUB65eapI_JnE&r=7WdAxUBeiTUTCy8v-7zXyr4qk7sx26ATvfo6QSTvZyQ&m=eQ9q70WFDS-d0s-KndBw8MOJvcBM6wuuKUNklqTC3h8&s=oBasfz9JoJw4yQF4EaWcNfKChZ1HMCkfHVZqyjvYVHQ&e= >> > > _______________________________________________ > Linux-nvme mailing list > Linux-nvme at lists.infradead.org > https://urldefense.proofpoint.com/v2/url?u=http-3A__lists.infradead.org_mailman_listinfo_linux-2Dnvme&d=DwICAg&c=RoP1YumCXCgaWHvlZYR8PZh8Bv7qIrMUB65eapI_JnE&r=7WdAxUBeiTUTCy8v-7zXyr4qk7sx26ATvfo6QSTvZyQ&m=eQ9q70WFDS-d0s-KndBw8MOJvcBM6wuuKUNklqTC3h8&s=oBasfz9JoJw4yQF4EaWcNfKChZ1HMCkfHVZqyjvYVHQ&e= > ^ permalink raw reply [flat|nested] 18+ messages in thread
* [PATCH] nvme: unquiesce the queue before cleaup it 2018-04-22 14:25 ` jianchao.wang (?) @ 2018-04-22 14:48 ` Max Gurtovoy 2018-04-22 15:00 ` jianchao.wang -1 siblings, 1 reply; 18+ messages in thread From: Max Gurtovoy @ 2018-04-22 14:48 UTC (permalink / raw) On 4/22/2018 5:25 PM, jianchao.wang wrote: > Hi Max > > No, I only tested it on PCIe one. > And sorry for that I didn't state that. Please send your exact test steps and we'll run it using RDMA transport. I also want to run a mini regression on this one since it may effect other flows. > > Thanks > Jianchao > > On 04/22/2018 10:18 PM, Max Gurtovoy wrote: >> Hi Jianchao, >> Since this patch is in the core, have you tested it using some fabrics drives too ? RDMA/FC ? >> >> thanks, >> Max. >> >> On 4/22/2018 4:32 PM, jianchao.wang wrote: >>> Hi keith >>> >>> Would you please take a look at this patch. >>> >>> This issue could be reproduced easily with a driver bind/unbind loop, >>> a reset loop and a IO loop at the same time. >>> >>> Thanks >>> Jianchao >>> >>> On 04/19/2018 04:29 PM, Jianchao Wang wrote: >>>> There is race between nvme_remove and nvme_reset_work that can >>>> lead to io hang. >>>> >>>> nvme_remove??????????????????? nvme_reset_work >>>> -> change state to DELETING >>>> ??????????????????????????????? -> fail to change state to LIVE >>>> ??????????????????????????????? -> nvme_remove_dead_ctrl >>>> ????????????????????????????????? -> nvme_dev_disable >>>> ??????????????????????????????????? -> quiesce request_queue >>>> ????????????????????????????????? -> queue remove_work >>>> -> cancel_work_sync reset_work >>>> -> nvme_remove_namespaces >>>> ?? -> splice ctrl->namespaces >>>> ??????????????????????????????? nvme_remove_dead_ctrl_work >>>> ??????????????????????????????? -> nvme_kill_queues >>>> ?? -> nvme_ns_remove?????????????? do nothing >>>> ???? -> blk_cleanup_queue >>>> ?????? -> blk_freeze_queue >>>> Finally, the request_queue is quiesced state when wait freeze, >>>> we will get io hang here. >>>> >>>> To fix it, unquiesce the request_queue directly before nvme_ns_remove. >>>> We have spliced the ctrl->namespaces, so nobody could access them >>>> and quiesce the queue any more. >>>> >>>> Signed-off-by: Jianchao Wang <jianchao.w.wang at oracle.com> >>>> --- >>>> ? drivers/nvme/host/core.c | 9 ++++++++- >>>> ? 1 file changed, 8 insertions(+), 1 deletion(-) >>>> >>>> diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c >>>> index 9df4f71..0e95082 100644 >>>> --- a/drivers/nvme/host/core.c >>>> +++ b/drivers/nvme/host/core.c >>>> @@ -3249,8 +3249,15 @@ void nvme_remove_namespaces(struct nvme_ctrl *ctrl) >>>> ????? list_splice_init(&ctrl->namespaces, &ns_list); >>>> ????? up_write(&ctrl->namespaces_rwsem); >>>> ? -??? list_for_each_entry_safe(ns, next, &ns_list, list) >>>> +??? /* >>>> +???? * After splice the namespaces list from the ctrl->namespaces, >>>> +???? * nobody could get them anymore, let's unquiesce the request_queue >>>> +???? * forcibly to avoid io hang. >>>> +???? */ >>>> +??? list_for_each_entry_safe(ns, next, &ns_list, list) { >>>> +??????? blk_mq_unquiesce_queue(ns->queue); >>>> ????????? nvme_ns_remove(ns); >>>> +??? } >>>> ? } >>>> ? EXPORT_SYMBOL_GPL(nvme_remove_namespaces); >>>> >>> >>> _______________________________________________ >>> Linux-nvme mailing list >>> Linux-nvme at lists.infradead.org >>> https://urldefense.proofpoint.com/v2/url?u=http-3A__lists.infradead.org_mailman_listinfo_linux-2Dnvme&d=DwICAg&c=RoP1YumCXCgaWHvlZYR8PZh8Bv7qIrMUB65eapI_JnE&r=7WdAxUBeiTUTCy8v-7zXyr4qk7sx26ATvfo6QSTvZyQ&m=eQ9q70WFDS-d0s-KndBw8MOJvcBM6wuuKUNklqTC3h8&s=oBasfz9JoJw4yQF4EaWcNfKChZ1HMCkfHVZqyjvYVHQ&e= >>> >> >> _______________________________________________ >> Linux-nvme mailing list >> Linux-nvme at lists.infradead.org >> https://urldefense.proofpoint.com/v2/url?u=http-3A__lists.infradead.org_mailman_listinfo_linux-2Dnvme&d=DwICAg&c=RoP1YumCXCgaWHvlZYR8PZh8Bv7qIrMUB65eapI_JnE&r=7WdAxUBeiTUTCy8v-7zXyr4qk7sx26ATvfo6QSTvZyQ&m=eQ9q70WFDS-d0s-KndBw8MOJvcBM6wuuKUNklqTC3h8&s=oBasfz9JoJw4yQF4EaWcNfKChZ1HMCkfHVZqyjvYVHQ&e= >> ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [PATCH] nvme: unquiesce the queue before cleaup it 2018-04-22 14:48 ` Max Gurtovoy @ 2018-04-22 15:00 ` jianchao.wang 0 siblings, 0 replies; 18+ messages in thread From: jianchao.wang @ 2018-04-22 15:00 UTC (permalink / raw) To: Max Gurtovoy, keith.busch, axboe, hch, sagi, linux-nvme, linux-kernel Hi Max That's really appreciated! Here is my test script. loop_reset_controller.sh #!/bin/bash while true do echo 1 > /sys/block/nvme0n1/device/reset_controller sleep 1 done loop_unbind_driver.sh #!/bin/bash while true do echo "0000:02:00.0" > /sys/bus/pci/drivers/nvme/unbind sleep 2 echo "0000:02:00.0" > /sys/bus/pci/drivers/nvme/bind sleep 2 done loop_io.sh #!/bin/bash file="/dev/nvme0n1" echo $file while true; do if [ -e $file ];then fio fio_job_rand_read.ini else echo "Not found" sleep 1 fi done The fio jobs is as below: size=512m rw=randread bs=4k ioengine=libaio iodepth=64 direct=1 numjobs=16 filename=/dev/nvme0n1 group_reporting I started in sequence, loop_io.sh, loop_reset_controller.sh, loop_unbind_driver.sh. And if lucky, I will get io hang in 3 minutes. ;) Such as: [ 142.858074] nvme nvme0: pci function 0000:02:00.0 [ 144.972256] nvme nvme0: failed to mark controller state 1 [ 144.972289] nvme nvme0: Removing after probe failure status: 0 [ 185.312344] INFO: task bash:1673 blocked for more than 30 seconds. [ 185.312889] Not tainted 4.17.0-rc1+ #6 [ 185.312950] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [ 185.313049] bash D 0 1673 1629 0x00000080 [ 185.313061] Call Trace: [ 185.313083] ? __schedule+0x3de/0xac0 [ 185.313103] schedule+0x3c/0x90 [ 185.313111] blk_mq_freeze_queue_wait+0x44/0x90 [ 185.313123] ? wait_woken+0x90/0x90 [ 185.313133] blk_cleanup_queue+0xe1/0x280 [ 185.313145] nvme_ns_remove+0x1c8/0x260 [ 185.313159] nvme_remove_namespaces+0x7f/0xa0 [ 185.313170] nvme_remove+0x6c/0x130 [ 185.313181] pci_device_remove+0x36/0xb0 [ 185.313193] device_release_driver_internal+0x160/0x230 [ 185.313205] unbind_store+0xfe/0x150 [ 185.313219] kernfs_fop_write+0x114/0x190 [ 185.313234] __vfs_write+0x23/0x150 [ 185.313246] ? rcu_read_lock_sched_held+0x3f/0x70 [ 185.313252] ? preempt_count_sub+0x92/0xd0 [ 185.313259] ? __sb_start_write+0xf8/0x200 [ 185.313271] vfs_write+0xc5/0x1c0 [ 185.313284] ksys_write+0x45/0xa0 [ 185.313298] do_syscall_64+0x5a/0x1a0 [ 185.313308] entry_SYSCALL_64_after_hwframe+0x49/0xbe And get following information in block debugfs: root@will-ThinkCentre-M910s:/sys/kernel/debug/block/nvme0n1# cat hctx6/cpu6/rq_list 000000001192d19b {.op=READ, .cmd_flags=, .rq_flags=IO_STAT, .state=idle, .tag=69, .internal_tag=-1} 00000000c33c8a5b {.op=READ, .cmd_flags=, .rq_flags=IO_STAT, .state=idle, .tag=78, .internal_tag=-1} root@will-ThinkCentre-M910s:/sys/kernel/debug/block/nvme0n1# cat state DYING|BYPASS|NOMERGES|SAME_COMP|NONROT|IO_STAT|DISCARD|NOXMERGES|INIT_DONE|NO_SG_MERGE|POLL|WC|FUA|STATS|QUIESCED We can see there were reqs on ctx rq_list and the request_queue is QUIESCED. Thanks again !! Jianchao On 04/22/2018 10:48 PM, Max Gurtovoy wrote: > > > On 4/22/2018 5:25 PM, jianchao.wang wrote: >> Hi Max >> >> No, I only tested it on PCIe one. >> And sorry for that I didn't state that. > > Please send your exact test steps and we'll run it using RDMA transport. > I also want to run a mini regression on this one since it may effect other flows. > >> >> Thanks >> Jianchao >> >> On 04/22/2018 10:18 PM, Max Gurtovoy wrote: >>> Hi Jianchao, >>> Since this patch is in the core, have you tested it using some fabrics drives too ? RDMA/FC ? >>> >>> thanks, >>> Max. >>> >>> On 4/22/2018 4:32 PM, jianchao.wang wrote: >>>> Hi keith >>>> >>>> Would you please take a look at this patch. >>>> >>>> This issue could be reproduced easily with a driver bind/unbind loop, >>>> a reset loop and a IO loop at the same time. >>>> >>>> Thanks >>>> Jianchao >>>> >>>> On 04/19/2018 04:29 PM, Jianchao Wang wrote: >>>>> There is race between nvme_remove and nvme_reset_work that can >>>>> lead to io hang. >>>>> >>>>> nvme_remove nvme_reset_work >>>>> -> change state to DELETING >>>>> -> fail to change state to LIVE >>>>> -> nvme_remove_dead_ctrl >>>>> -> nvme_dev_disable >>>>> -> quiesce request_queue >>>>> -> queue remove_work >>>>> -> cancel_work_sync reset_work >>>>> -> nvme_remove_namespaces >>>>> -> splice ctrl->namespaces >>>>> nvme_remove_dead_ctrl_work >>>>> -> nvme_kill_queues >>>>> -> nvme_ns_remove do nothing >>>>> -> blk_cleanup_queue >>>>> -> blk_freeze_queue >>>>> Finally, the request_queue is quiesced state when wait freeze, >>>>> we will get io hang here. >>>>> >>>>> To fix it, unquiesce the request_queue directly before nvme_ns_remove. >>>>> We have spliced the ctrl->namespaces, so nobody could access them >>>>> and quiesce the queue any more. >>>>> >>>>> Signed-off-by: Jianchao Wang <jianchao.w.wang@oracle.com> >>>>> --- >>>>> drivers/nvme/host/core.c | 9 ++++++++- >>>>> 1 file changed, 8 insertions(+), 1 deletion(-) >>>>> >>>>> diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c >>>>> index 9df4f71..0e95082 100644 >>>>> --- a/drivers/nvme/host/core.c >>>>> +++ b/drivers/nvme/host/core.c >>>>> @@ -3249,8 +3249,15 @@ void nvme_remove_namespaces(struct nvme_ctrl *ctrl) >>>>> list_splice_init(&ctrl->namespaces, &ns_list); >>>>> up_write(&ctrl->namespaces_rwsem); >>>>> - list_for_each_entry_safe(ns, next, &ns_list, list) >>>>> + /* >>>>> + * After splice the namespaces list from the ctrl->namespaces, >>>>> + * nobody could get them anymore, let's unquiesce the request_queue >>>>> + * forcibly to avoid io hang. >>>>> + */ >>>>> + list_for_each_entry_safe(ns, next, &ns_list, list) { >>>>> + blk_mq_unquiesce_queue(ns->queue); >>>>> nvme_ns_remove(ns); >>>>> + } >>>>> } >>>>> EXPORT_SYMBOL_GPL(nvme_remove_namespaces); >>>>> >>>> >>>> _______________________________________________ >>>> Linux-nvme mailing list >>>> Linux-nvme@lists.infradead.org >>>> https://urldefense.proofpoint.com/v2/url?u=http-3A__lists.infradead.org_mailman_listinfo_linux-2Dnvme&d=DwICAg&c=RoP1YumCXCgaWHvlZYR8PZh8Bv7qIrMUB65eapI_JnE&r=7WdAxUBeiTUTCy8v-7zXyr4qk7sx26ATvfo6QSTvZyQ&m=eQ9q70WFDS-d0s-KndBw8MOJvcBM6wuuKUNklqTC3h8&s=oBasfz9JoJw4yQF4EaWcNfKChZ1HMCkfHVZqyjvYVHQ&e= >>>> >>> >>> _______________________________________________ >>> Linux-nvme mailing list >>> Linux-nvme@lists.infradead.org >>> https://urldefense.proofpoint.com/v2/url?u=http-3A__lists.infradead.org_mailman_listinfo_linux-2Dnvme&d=DwICAg&c=RoP1YumCXCgaWHvlZYR8PZh8Bv7qIrMUB65eapI_JnE&r=7WdAxUBeiTUTCy8v-7zXyr4qk7sx26ATvfo6QSTvZyQ&m=eQ9q70WFDS-d0s-KndBw8MOJvcBM6wuuKUNklqTC3h8&s=oBasfz9JoJw4yQF4EaWcNfKChZ1HMCkfHVZqyjvYVHQ&e= >>> ^ permalink raw reply [flat|nested] 18+ messages in thread
* [PATCH] nvme: unquiesce the queue before cleaup it @ 2018-04-22 15:00 ` jianchao.wang 0 siblings, 0 replies; 18+ messages in thread From: jianchao.wang @ 2018-04-22 15:00 UTC (permalink / raw) Hi Max That's really appreciated! Here is my test script. loop_reset_controller.sh #!/bin/bash while true do echo 1 > /sys/block/nvme0n1/device/reset_controller sleep 1 done loop_unbind_driver.sh #!/bin/bash while true do echo "0000:02:00.0" > /sys/bus/pci/drivers/nvme/unbind sleep 2 echo "0000:02:00.0" > /sys/bus/pci/drivers/nvme/bind sleep 2 done loop_io.sh #!/bin/bash file="/dev/nvme0n1" echo $file while true; do if [ -e $file ];then fio fio_job_rand_read.ini else echo "Not found" sleep 1 fi done The fio jobs is as below: size=512m rw=randread bs=4k ioengine=libaio iodepth=64 direct=1 numjobs=16 filename=/dev/nvme0n1 group_reporting I started in sequence, loop_io.sh, loop_reset_controller.sh, loop_unbind_driver.sh. And if lucky, I will get io hang in 3 minutes. ;) Such as: [ 142.858074] nvme nvme0: pci function 0000:02:00.0 [ 144.972256] nvme nvme0: failed to mark controller state 1 [ 144.972289] nvme nvme0: Removing after probe failure status: 0 [ 185.312344] INFO: task bash:1673 blocked for more than 30 seconds. [ 185.312889] Not tainted 4.17.0-rc1+ #6 [ 185.312950] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [ 185.313049] bash D 0 1673 1629 0x00000080 [ 185.313061] Call Trace: [ 185.313083] ? __schedule+0x3de/0xac0 [ 185.313103] schedule+0x3c/0x90 [ 185.313111] blk_mq_freeze_queue_wait+0x44/0x90 [ 185.313123] ? wait_woken+0x90/0x90 [ 185.313133] blk_cleanup_queue+0xe1/0x280 [ 185.313145] nvme_ns_remove+0x1c8/0x260 [ 185.313159] nvme_remove_namespaces+0x7f/0xa0 [ 185.313170] nvme_remove+0x6c/0x130 [ 185.313181] pci_device_remove+0x36/0xb0 [ 185.313193] device_release_driver_internal+0x160/0x230 [ 185.313205] unbind_store+0xfe/0x150 [ 185.313219] kernfs_fop_write+0x114/0x190 [ 185.313234] __vfs_write+0x23/0x150 [ 185.313246] ? rcu_read_lock_sched_held+0x3f/0x70 [ 185.313252] ? preempt_count_sub+0x92/0xd0 [ 185.313259] ? __sb_start_write+0xf8/0x200 [ 185.313271] vfs_write+0xc5/0x1c0 [ 185.313284] ksys_write+0x45/0xa0 [ 185.313298] do_syscall_64+0x5a/0x1a0 [ 185.313308] entry_SYSCALL_64_after_hwframe+0x49/0xbe And get following information in block debugfs: root at will-ThinkCentre-M910s:/sys/kernel/debug/block/nvme0n1# cat hctx6/cpu6/rq_list 000000001192d19b {.op=READ, .cmd_flags=, .rq_flags=IO_STAT, .state=idle, .tag=69, .internal_tag=-1} 00000000c33c8a5b {.op=READ, .cmd_flags=, .rq_flags=IO_STAT, .state=idle, .tag=78, .internal_tag=-1} root at will-ThinkCentre-M910s:/sys/kernel/debug/block/nvme0n1# cat state DYING|BYPASS|NOMERGES|SAME_COMP|NONROT|IO_STAT|DISCARD|NOXMERGES|INIT_DONE|NO_SG_MERGE|POLL|WC|FUA|STATS|QUIESCED We can see there were reqs on ctx rq_list and the request_queue is QUIESCED. Thanks again !! Jianchao On 04/22/2018 10:48 PM, Max Gurtovoy wrote: > > > On 4/22/2018 5:25 PM, jianchao.wang wrote: >> Hi Max >> >> No, I only tested it on PCIe one. >> And sorry for that I didn't state that. > > Please send your exact test steps and we'll run it using RDMA transport. > I also want to run a mini regression on this one since it may effect other flows. > >> >> Thanks >> Jianchao >> >> On 04/22/2018 10:18 PM, Max Gurtovoy wrote: >>> Hi Jianchao, >>> Since this patch is in the core, have you tested it using some fabrics drives too ? RDMA/FC ? >>> >>> thanks, >>> Max. >>> >>> On 4/22/2018 4:32 PM, jianchao.wang wrote: >>>> Hi keith >>>> >>>> Would you please take a look at this patch. >>>> >>>> This issue could be reproduced easily with a driver bind/unbind loop, >>>> a reset loop and a IO loop at the same time. >>>> >>>> Thanks >>>> Jianchao >>>> >>>> On 04/19/2018 04:29 PM, Jianchao Wang wrote: >>>>> There is race between nvme_remove and nvme_reset_work that can >>>>> lead to io hang. >>>>> >>>>> nvme_remove??????????????????? nvme_reset_work >>>>> -> change state to DELETING >>>>> ???????????????????????????????? -> fail to change state to LIVE >>>>> ???????????????????????????????? -> nvme_remove_dead_ctrl >>>>> ?????????????????????????????????? -> nvme_dev_disable >>>>> ???????????????????????????????????? -> quiesce request_queue >>>>> ?????????????????????????????????? -> queue remove_work >>>>> -> cancel_work_sync reset_work >>>>> -> nvme_remove_namespaces >>>>> ??? -> splice ctrl->namespaces >>>>> ???????????????????????????????? nvme_remove_dead_ctrl_work >>>>> ???????????????????????????????? -> nvme_kill_queues >>>>> ??? -> nvme_ns_remove?????????????? do nothing >>>>> ????? -> blk_cleanup_queue >>>>> ??????? -> blk_freeze_queue >>>>> Finally, the request_queue is quiesced state when wait freeze, >>>>> we will get io hang here. >>>>> >>>>> To fix it, unquiesce the request_queue directly before nvme_ns_remove. >>>>> We have spliced the ctrl->namespaces, so nobody could access them >>>>> and quiesce the queue any more. >>>>> >>>>> Signed-off-by: Jianchao Wang <jianchao.w.wang at oracle.com> >>>>> --- >>>>> ?? drivers/nvme/host/core.c | 9 ++++++++- >>>>> ?? 1 file changed, 8 insertions(+), 1 deletion(-) >>>>> >>>>> diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c >>>>> index 9df4f71..0e95082 100644 >>>>> --- a/drivers/nvme/host/core.c >>>>> +++ b/drivers/nvme/host/core.c >>>>> @@ -3249,8 +3249,15 @@ void nvme_remove_namespaces(struct nvme_ctrl *ctrl) >>>>> ?????? list_splice_init(&ctrl->namespaces, &ns_list); >>>>> ?????? up_write(&ctrl->namespaces_rwsem); >>>>> ?? -??? list_for_each_entry_safe(ns, next, &ns_list, list) >>>>> +??? /* >>>>> +???? * After splice the namespaces list from the ctrl->namespaces, >>>>> +???? * nobody could get them anymore, let's unquiesce the request_queue >>>>> +???? * forcibly to avoid io hang. >>>>> +???? */ >>>>> +??? list_for_each_entry_safe(ns, next, &ns_list, list) { >>>>> +??????? blk_mq_unquiesce_queue(ns->queue); >>>>> ?????????? nvme_ns_remove(ns); >>>>> +??? } >>>>> ?? } >>>>> ?? EXPORT_SYMBOL_GPL(nvme_remove_namespaces); >>>>> ? >>>> >>>> _______________________________________________ >>>> Linux-nvme mailing list >>>> Linux-nvme at lists.infradead.org >>>> https://urldefense.proofpoint.com/v2/url?u=http-3A__lists.infradead.org_mailman_listinfo_linux-2Dnvme&d=DwICAg&c=RoP1YumCXCgaWHvlZYR8PZh8Bv7qIrMUB65eapI_JnE&r=7WdAxUBeiTUTCy8v-7zXyr4qk7sx26ATvfo6QSTvZyQ&m=eQ9q70WFDS-d0s-KndBw8MOJvcBM6wuuKUNklqTC3h8&s=oBasfz9JoJw4yQF4EaWcNfKChZ1HMCkfHVZqyjvYVHQ&e= >>>> >>> >>> _______________________________________________ >>> Linux-nvme mailing list >>> Linux-nvme at lists.infradead.org >>> https://urldefense.proofpoint.com/v2/url?u=http-3A__lists.infradead.org_mailman_listinfo_linux-2Dnvme&d=DwICAg&c=RoP1YumCXCgaWHvlZYR8PZh8Bv7qIrMUB65eapI_JnE&r=7WdAxUBeiTUTCy8v-7zXyr4qk7sx26ATvfo6QSTvZyQ&m=eQ9q70WFDS-d0s-KndBw8MOJvcBM6wuuKUNklqTC3h8&s=oBasfz9JoJw4yQF4EaWcNfKChZ1HMCkfHVZqyjvYVHQ&e= >>> ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [PATCH] nvme: unquiesce the queue before cleaup it 2018-04-22 15:00 ` jianchao.wang @ 2018-04-26 8:27 ` jianchao.wang -1 siblings, 0 replies; 18+ messages in thread From: jianchao.wang @ 2018-04-26 8:27 UTC (permalink / raw) To: Max Gurtovoy, keith.busch, axboe, hch, sagi, linux-nvme, linux-kernel Hi Max I did a similar test on nvme-rdma, the underlying fabric is soft-RoCE. A io loop, reset controller loop and a delete/create controller loop. And found io hang below: [ 230.884590] WARNING: CPU: 0 PID: 150 at /home/will/u04/source_code/linux-stable/drivers/nvme/host/rdma.c:1755 nvme_rdma_reset_ctrl_work+0x3d/0xb0 [nvme_rdma] [ 230.884689] CPU: 0 PID: 150 Comm: kworker/u16:3 Not tainted 4.17.0-rc1+ #16 [ 230.884690] Hardware name: LENOVO 10MLS0E339/3106, BIOS M1AKT22A 06/27/2017 [ 230.884693] Workqueue: nvme-reset-wq nvme_rdma_reset_ctrl_work [nvme_rdma] [ 230.884696] RIP: 0010:nvme_rdma_reset_ctrl_work+0x3d/0xb0 [nvme_rdma] [ 230.884697] RSP: 0018:ffffa9e30255be40 EFLAGS: 00010246 [ 230.884699] RAX: 0000000000000000 RBX: ffff9633b64c2000 RCX: 0000000000000000 [ 230.884700] RDX: 0000000000000001 RSI: 0000000000000000 RDI: ffffffff9ac05516 [ 230.884701] RBP: ffff9633b64c23b8 R08: 0000000000000001 R09: 0000000000000000 [ 230.884702] R10: 0000000000000000 R11: 0000000000000000 R12: ffff9633b64c2970 [ 230.884703] R13: 0000000000000000 R14: 0ffff96340ffcc80 R15: ffff9634102af9c0 [ 230.884705] FS: 0000000000000000(0000) GS:ffff963422c00000(0000) knlGS:0000000000000000 [ 230.884706] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 230.884707] CR2: 0000000001ce9ca4 CR3: 000000038040f005 CR4: 00000000003606f0 [ 230.884708] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [ 230.884709] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 [ 230.884710] Call Trace: [ 230.884717] process_one_work+0x1c0/0x680 [ 230.884722] worker_thread+0x22b/0x430 [ 230.884728] kthread+0x100/0x140 [ 230.884730] ? rescuer_thread+0x370/0x370 [ 230.884731] ? kthread_delayed_work_timer_fn+0x80/0x80 [ 230.884736] ret_from_fork+0x24/0x30 [ 230.884744] Code: ff 48 8d ab b8 03 00 00 48 89 ef e8 1e 25 d7 d9 31 f6 48 89 df e8 44 f9 ff ff be 04 00 00 00 48 89 ef e8 b7 13 d7 d9 84 c0 75 07 <0f> 0b 5b 5d 41 5c c3 31 f6 48 89 df e8 b2 fa ff ff 85 c0 75 39 [ 230.884799] irq event stamp: 88150 [ 230.884802] hardirqs last enabled at (88149): [<ffffffff9ac05529>] _raw_spin_unlock_irqrestore+0x59/0x70 [ 230.884803] hardirqs last disabled at (88150): [<ffffffff9ae0116c>] error_entry+0x6c/0xc0 [ 230.884805] softirqs last enabled at (87998): [<ffffffff9aa603a0>] sk_common_release+0x60/0xd0 [ 230.884807] softirqs last disabled at (87996): [<ffffffff9aa60375>] sk_common_release+0x35/0xd0 [ 230.884808] ---[ end trace c03fac253b80d77d ]--- [ 277.472094] INFO: task kworker/u16:0:6 blocked for more than 30 seconds. [ 277.473482] Tainted: G W 4.17.0-rc1+ #16 [ 277.474825] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [ 277.476293] kworker/u16:0 D 0 6 2 0x80000000 [ 277.476317] Workqueue: nvme-delete-wq nvme_delete_ctrl_work [ 277.476335] Call Trace: [ 277.476365] ? __schedule+0x3de/0xac0 [ 277.476393] schedule+0x3c/0x90 [ 277.476405] blk_mq_freeze_queue_wait+0x44/0x90 [ 277.476419] ? wait_woken+0x90/0x90 [ 277.476437] blk_cleanup_queue+0xe1/0x280 [ 277.476453] nvme_ns_remove+0x1c8/0x260 [ 277.476475] nvme_remove_namespaces+0x7f/0xa0 [ 277.476495] nvme_delete_ctrl_work+0x4b/0x80 [ 277.476508] process_one_work+0x1c0/0x680 [ 277.476535] worker_thread+0x22b/0x430 [ 277.476562] kthread+0x100/0x140 [ 277.476572] ? rescuer_thread+0x370/0x370 [ 277.476579] ? kthread_delayed_work_timer_fn+0x80/0x80 [ 277.476598] ret_from_fork+0x24/0x30 00000000ea33345b {.op=READ, .cmd_flags=23, .rq_flags=MQ_INFLIGHT|IO_STAT, .state=idle, .tag=19, .internal_tag=-1} 0000000071218c5a {.op=READ, .cmd_flags=23, .rq_flags=MQ_INFLIGHT|IO_STAT, .state=idle, .tag=45, .internal_tag=-1} root@will-ThinkCentre-M910s:/sys/kernel/debug/block/nvme1c1n1# cat state DYING|BYPASS|NOMERGES|SAME_COMP|NONROT|IO_STAT|DISCARD|NOXMERGES|INIT_DONE|NO_SG_MERGE|POLL|WC|FUA|STATS|QUIESCED This is due to a race following: nvme_delete_ctrl nvme_rdma_reset_ctrl_work -> change state to DELETING -> queue delete_work -> nvme_rdma_shutdown_ctrl -> nvme_stop_queues -> blk_mq_quiesce_queue -> change to LIVE state fails -> return nvme_delete_ctrl_work -> nvme_remove_namespaces -> nvme_ns_remove -> blk_cleanup_queue -> blk_freeze_queue This patch could also fix this issue. Thanks Jianchao On 04/22/2018 11:00 PM, jianchao.wang wrote: > Hi Max > > That's really appreciated! > Here is my test script. > > loop_reset_controller.sh > #!/bin/bash > while true > do > echo 1 > /sys/block/nvme0n1/device/reset_controller > sleep 1 > done > > loop_unbind_driver.sh > #!/bin/bash > while true > do > echo "0000:02:00.0" > /sys/bus/pci/drivers/nvme/unbind > sleep 2 > echo "0000:02:00.0" > /sys/bus/pci/drivers/nvme/bind > sleep 2 > done > > loop_io.sh > #!/bin/bash > > file="/dev/nvme0n1" > echo $file > while true; > do > if [ -e $file ];then > fio fio_job_rand_read.ini > else > echo "Not found" > sleep 1 > fi > done > > The fio jobs is as below: > size=512m > rw=randread > bs=4k > ioengine=libaio > iodepth=64 > direct=1 > numjobs=16 > filename=/dev/nvme0n1 > group_reporting > > I started in sequence, loop_io.sh, loop_reset_controller.sh, loop_unbind_driver.sh. > And if lucky, I will get io hang in 3 minutes. ;) > Such as: > > [ 142.858074] nvme nvme0: pci function 0000:02:00.0 > [ 144.972256] nvme nvme0: failed to mark controller state 1 > [ 144.972289] nvme nvme0: Removing after probe failure status: 0 > [ 185.312344] INFO: task bash:1673 blocked for more than 30 seconds. > [ 185.312889] Not tainted 4.17.0-rc1+ #6 > [ 185.312950] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. > [ 185.313049] bash D 0 1673 1629 0x00000080 > [ 185.313061] Call Trace: > [ 185.313083] ? __schedule+0x3de/0xac0 > [ 185.313103] schedule+0x3c/0x90 > [ 185.313111] blk_mq_freeze_queue_wait+0x44/0x90 > [ 185.313123] ? wait_woken+0x90/0x90 > [ 185.313133] blk_cleanup_queue+0xe1/0x280 > [ 185.313145] nvme_ns_remove+0x1c8/0x260 > [ 185.313159] nvme_remove_namespaces+0x7f/0xa0 > [ 185.313170] nvme_remove+0x6c/0x130 > [ 185.313181] pci_device_remove+0x36/0xb0 > [ 185.313193] device_release_driver_internal+0x160/0x230 > [ 185.313205] unbind_store+0xfe/0x150 > [ 185.313219] kernfs_fop_write+0x114/0x190 > [ 185.313234] __vfs_write+0x23/0x150 > [ 185.313246] ? rcu_read_lock_sched_held+0x3f/0x70 > [ 185.313252] ? preempt_count_sub+0x92/0xd0 > [ 185.313259] ? __sb_start_write+0xf8/0x200 > [ 185.313271] vfs_write+0xc5/0x1c0 > [ 185.313284] ksys_write+0x45/0xa0 > [ 185.313298] do_syscall_64+0x5a/0x1a0 > [ 185.313308] entry_SYSCALL_64_after_hwframe+0x49/0xbe > > And get following information in block debugfs: > root@will-ThinkCentre-M910s:/sys/kernel/debug/block/nvme0n1# cat hctx6/cpu6/rq_list > 000000001192d19b {.op=READ, .cmd_flags=, .rq_flags=IO_STAT, .state=idle, .tag=69, .internal_tag=-1} > 00000000c33c8a5b {.op=READ, .cmd_flags=, .rq_flags=IO_STAT, .state=idle, .tag=78, .internal_tag=-1} > root@will-ThinkCentre-M910s:/sys/kernel/debug/block/nvme0n1# cat state > DYING|BYPASS|NOMERGES|SAME_COMP|NONROT|IO_STAT|DISCARD|NOXMERGES|INIT_DONE|NO_SG_MERGE|POLL|WC|FUA|STATS|QUIESCED > > We can see there were reqs on ctx rq_list and the request_queue is QUIESCED. > > Thanks again !! > Jianchao > > On 04/22/2018 10:48 PM, Max Gurtovoy wrote: >> >> >> On 4/22/2018 5:25 PM, jianchao.wang wrote: >>> Hi Max >>> >>> No, I only tested it on PCIe one. >>> And sorry for that I didn't state that. >> >> Please send your exact test steps and we'll run it using RDMA transport. >> I also want to run a mini regression on this one since it may effect other flows. >> >>> >>> Thanks >>> Jianchao >>> >>> On 04/22/2018 10:18 PM, Max Gurtovoy wrote: >>>> Hi Jianchao, >>>> Since this patch is in the core, have you tested it using some fabrics drives too ? RDMA/FC ? >>>> >>>> thanks, >>>> Max. >>>> >>>> On 4/22/2018 4:32 PM, jianchao.wang wrote: >>>>> Hi keith >>>>> >>>>> Would you please take a look at this patch. >>>>> >>>>> This issue could be reproduced easily with a driver bind/unbind loop, >>>>> a reset loop and a IO loop at the same time. >>>>> >>>>> Thanks >>>>> Jianchao >>>>> >>>>> On 04/19/2018 04:29 PM, Jianchao Wang wrote: >>>>>> There is race between nvme_remove and nvme_reset_work that can >>>>>> lead to io hang. >>>>>> >>>>>> nvme_remove nvme_reset_work >>>>>> -> change state to DELETING >>>>>> -> fail to change state to LIVE >>>>>> -> nvme_remove_dead_ctrl >>>>>> -> nvme_dev_disable >>>>>> -> quiesce request_queue >>>>>> -> queue remove_work >>>>>> -> cancel_work_sync reset_work >>>>>> -> nvme_remove_namespaces >>>>>> -> splice ctrl->namespaces >>>>>> nvme_remove_dead_ctrl_work >>>>>> -> nvme_kill_queues >>>>>> -> nvme_ns_remove do nothing >>>>>> -> blk_cleanup_queue >>>>>> -> blk_freeze_queue >>>>>> Finally, the request_queue is quiesced state when wait freeze, >>>>>> we will get io hang here. >>>>>> >>>>>> To fix it, unquiesce the request_queue directly before nvme_ns_remove. >>>>>> We have spliced the ctrl->namespaces, so nobody could access them >>>>>> and quiesce the queue any more. >>>>>> >>>>>> Signed-off-by: Jianchao Wang <jianchao.w.wang@oracle.com> >>>>>> --- >>>>>> drivers/nvme/host/core.c | 9 ++++++++- >>>>>> 1 file changed, 8 insertions(+), 1 deletion(-) >>>>>> >>>>>> diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c >>>>>> index 9df4f71..0e95082 100644 >>>>>> --- a/drivers/nvme/host/core.c >>>>>> +++ b/drivers/nvme/host/core.c >>>>>> @@ -3249,8 +3249,15 @@ void nvme_remove_namespaces(struct nvme_ctrl *ctrl) >>>>>> list_splice_init(&ctrl->namespaces, &ns_list); >>>>>> up_write(&ctrl->namespaces_rwsem); >>>>>> - list_for_each_entry_safe(ns, next, &ns_list, list) >>>>>> + /* >>>>>> + * After splice the namespaces list from the ctrl->namespaces, >>>>>> + * nobody could get them anymore, let's unquiesce the request_queue >>>>>> + * forcibly to avoid io hang. >>>>>> + */ >>>>>> + list_for_each_entry_safe(ns, next, &ns_list, list) { >>>>>> + blk_mq_unquiesce_queue(ns->queue); >>>>>> nvme_ns_remove(ns); >>>>>> + } >>>>>> } >>>>>> EXPORT_SYMBOL_GPL(nvme_remove_namespaces); >>>>>> >>>>> >>>>> _______________________________________________ >>>>> Linux-nvme mailing list >>>>> Linux-nvme@lists.infradead.org >>>>> https://urldefense.proofpoint.com/v2/url?u=http-3A__lists.infradead.org_mailman_listinfo_linux-2Dnvme&d=DwICAg&c=RoP1YumCXCgaWHvlZYR8PZh8Bv7qIrMUB65eapI_JnE&r=7WdAxUBeiTUTCy8v-7zXyr4qk7sx26ATvfo6QSTvZyQ&m=eQ9q70WFDS-d0s-KndBw8MOJvcBM6wuuKUNklqTC3h8&s=oBasfz9JoJw4yQF4EaWcNfKChZ1HMCkfHVZqyjvYVHQ&e= >>>>> >>>> >>>> _______________________________________________ >>>> Linux-nvme mailing list >>>> Linux-nvme@lists.infradead.org >>>> https://urldefense.proofpoint.com/v2/url?u=http-3A__lists.infradead.org_mailman_listinfo_linux-2Dnvme&d=DwICAg&c=RoP1YumCXCgaWHvlZYR8PZh8Bv7qIrMUB65eapI_JnE&r=7WdAxUBeiTUTCy8v-7zXyr4qk7sx26ATvfo6QSTvZyQ&m=eQ9q70WFDS-d0s-KndBw8MOJvcBM6wuuKUNklqTC3h8&s=oBasfz9JoJw4yQF4EaWcNfKChZ1HMCkfHVZqyjvYVHQ&e= >>>> > > _______________________________________________ > Linux-nvme mailing list > Linux-nvme@lists.infradead.org > https://urldefense.proofpoint.com/v2/url?u=http-3A__lists.infradead.org_mailman_listinfo_linux-2Dnvme&d=DwIGaQ&c=RoP1YumCXCgaWHvlZYR8PZh8Bv7qIrMUB65eapI_JnE&r=7WdAxUBeiTUTCy8v-7zXyr4qk7sx26ATvfo6QSTvZyQ&m=ivKrSi6co_9l91-czj52P2ABAPyKOwv9GrD3UeCbhio&s=2U00IBVk4CVt0Q-jiZX3sEWEEA2IleOjPU_PD38Fgvg&e= > ^ permalink raw reply [flat|nested] 18+ messages in thread
* [PATCH] nvme: unquiesce the queue before cleaup it @ 2018-04-26 8:27 ` jianchao.wang 0 siblings, 0 replies; 18+ messages in thread From: jianchao.wang @ 2018-04-26 8:27 UTC (permalink / raw) Hi Max I did a similar test on nvme-rdma, the underlying fabric is soft-RoCE. A io loop, reset controller loop and a delete/create controller loop. And found io hang below: [ 230.884590] WARNING: CPU: 0 PID: 150 at /home/will/u04/source_code/linux-stable/drivers/nvme/host/rdma.c:1755 nvme_rdma_reset_ctrl_work+0x3d/0xb0 [nvme_rdma] [ 230.884689] CPU: 0 PID: 150 Comm: kworker/u16:3 Not tainted 4.17.0-rc1+ #16 [ 230.884690] Hardware name: LENOVO 10MLS0E339/3106, BIOS M1AKT22A 06/27/2017 [ 230.884693] Workqueue: nvme-reset-wq nvme_rdma_reset_ctrl_work [nvme_rdma] [ 230.884696] RIP: 0010:nvme_rdma_reset_ctrl_work+0x3d/0xb0 [nvme_rdma] [ 230.884697] RSP: 0018:ffffa9e30255be40 EFLAGS: 00010246 [ 230.884699] RAX: 0000000000000000 RBX: ffff9633b64c2000 RCX: 0000000000000000 [ 230.884700] RDX: 0000000000000001 RSI: 0000000000000000 RDI: ffffffff9ac05516 [ 230.884701] RBP: ffff9633b64c23b8 R08: 0000000000000001 R09: 0000000000000000 [ 230.884702] R10: 0000000000000000 R11: 0000000000000000 R12: ffff9633b64c2970 [ 230.884703] R13: 0000000000000000 R14: 0ffff96340ffcc80 R15: ffff9634102af9c0 [ 230.884705] FS: 0000000000000000(0000) GS:ffff963422c00000(0000) knlGS:0000000000000000 [ 230.884706] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 230.884707] CR2: 0000000001ce9ca4 CR3: 000000038040f005 CR4: 00000000003606f0 [ 230.884708] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [ 230.884709] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 [ 230.884710] Call Trace: [ 230.884717] process_one_work+0x1c0/0x680 [ 230.884722] worker_thread+0x22b/0x430 [ 230.884728] kthread+0x100/0x140 [ 230.884730] ? rescuer_thread+0x370/0x370 [ 230.884731] ? kthread_delayed_work_timer_fn+0x80/0x80 [ 230.884736] ret_from_fork+0x24/0x30 [ 230.884744] Code: ff 48 8d ab b8 03 00 00 48 89 ef e8 1e 25 d7 d9 31 f6 48 89 df e8 44 f9 ff ff be 04 00 00 00 48 89 ef e8 b7 13 d7 d9 84 c0 75 07 <0f> 0b 5b 5d 41 5c c3 31 f6 48 89 df e8 b2 fa ff ff 85 c0 75 39 [ 230.884799] irq event stamp: 88150 [ 230.884802] hardirqs last enabled at (88149): [<ffffffff9ac05529>] _raw_spin_unlock_irqrestore+0x59/0x70 [ 230.884803] hardirqs last disabled at (88150): [<ffffffff9ae0116c>] error_entry+0x6c/0xc0 [ 230.884805] softirqs last enabled at (87998): [<ffffffff9aa603a0>] sk_common_release+0x60/0xd0 [ 230.884807] softirqs last disabled at (87996): [<ffffffff9aa60375>] sk_common_release+0x35/0xd0 [ 230.884808] ---[ end trace c03fac253b80d77d ]--- [ 277.472094] INFO: task kworker/u16:0:6 blocked for more than 30 seconds. [ 277.473482] Tainted: G W 4.17.0-rc1+ #16 [ 277.474825] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [ 277.476293] kworker/u16:0 D 0 6 2 0x80000000 [ 277.476317] Workqueue: nvme-delete-wq nvme_delete_ctrl_work [ 277.476335] Call Trace: [ 277.476365] ? __schedule+0x3de/0xac0 [ 277.476393] schedule+0x3c/0x90 [ 277.476405] blk_mq_freeze_queue_wait+0x44/0x90 [ 277.476419] ? wait_woken+0x90/0x90 [ 277.476437] blk_cleanup_queue+0xe1/0x280 [ 277.476453] nvme_ns_remove+0x1c8/0x260 [ 277.476475] nvme_remove_namespaces+0x7f/0xa0 [ 277.476495] nvme_delete_ctrl_work+0x4b/0x80 [ 277.476508] process_one_work+0x1c0/0x680 [ 277.476535] worker_thread+0x22b/0x430 [ 277.476562] kthread+0x100/0x140 [ 277.476572] ? rescuer_thread+0x370/0x370 [ 277.476579] ? kthread_delayed_work_timer_fn+0x80/0x80 [ 277.476598] ret_from_fork+0x24/0x30 00000000ea33345b {.op=READ, .cmd_flags=23, .rq_flags=MQ_INFLIGHT|IO_STAT, .state=idle, .tag=19, .internal_tag=-1} 0000000071218c5a {.op=READ, .cmd_flags=23, .rq_flags=MQ_INFLIGHT|IO_STAT, .state=idle, .tag=45, .internal_tag=-1} root at will-ThinkCentre-M910s:/sys/kernel/debug/block/nvme1c1n1# cat state DYING|BYPASS|NOMERGES|SAME_COMP|NONROT|IO_STAT|DISCARD|NOXMERGES|INIT_DONE|NO_SG_MERGE|POLL|WC|FUA|STATS|QUIESCED This is due to a race following: nvme_delete_ctrl nvme_rdma_reset_ctrl_work -> change state to DELETING -> queue delete_work -> nvme_rdma_shutdown_ctrl -> nvme_stop_queues -> blk_mq_quiesce_queue -> change to LIVE state fails -> return nvme_delete_ctrl_work -> nvme_remove_namespaces -> nvme_ns_remove -> blk_cleanup_queue -> blk_freeze_queue This patch could also fix this issue. Thanks Jianchao On 04/22/2018 11:00 PM, jianchao.wang wrote: > Hi Max > > That's really appreciated! > Here is my test script. > > loop_reset_controller.sh > #!/bin/bash > while true > do > echo 1 > /sys/block/nvme0n1/device/reset_controller > sleep 1 > done > > loop_unbind_driver.sh > #!/bin/bash > while true > do > echo "0000:02:00.0" > /sys/bus/pci/drivers/nvme/unbind > sleep 2 > echo "0000:02:00.0" > /sys/bus/pci/drivers/nvme/bind > sleep 2 > done > > loop_io.sh > #!/bin/bash > > file="/dev/nvme0n1" > echo $file > while true; > do > if [ -e $file ];then > fio fio_job_rand_read.ini > else > echo "Not found" > sleep 1 > fi > done > > The fio jobs is as below: > size=512m > rw=randread > bs=4k > ioengine=libaio > iodepth=64 > direct=1 > numjobs=16 > filename=/dev/nvme0n1 > group_reporting > > I started in sequence, loop_io.sh, loop_reset_controller.sh, loop_unbind_driver.sh. > And if lucky, I will get io hang in 3 minutes. ;) > Such as: > > [ 142.858074] nvme nvme0: pci function 0000:02:00.0 > [ 144.972256] nvme nvme0: failed to mark controller state 1 > [ 144.972289] nvme nvme0: Removing after probe failure status: 0 > [ 185.312344] INFO: task bash:1673 blocked for more than 30 seconds. > [ 185.312889] Not tainted 4.17.0-rc1+ #6 > [ 185.312950] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. > [ 185.313049] bash D 0 1673 1629 0x00000080 > [ 185.313061] Call Trace: > [ 185.313083] ? __schedule+0x3de/0xac0 > [ 185.313103] schedule+0x3c/0x90 > [ 185.313111] blk_mq_freeze_queue_wait+0x44/0x90 > [ 185.313123] ? wait_woken+0x90/0x90 > [ 185.313133] blk_cleanup_queue+0xe1/0x280 > [ 185.313145] nvme_ns_remove+0x1c8/0x260 > [ 185.313159] nvme_remove_namespaces+0x7f/0xa0 > [ 185.313170] nvme_remove+0x6c/0x130 > [ 185.313181] pci_device_remove+0x36/0xb0 > [ 185.313193] device_release_driver_internal+0x160/0x230 > [ 185.313205] unbind_store+0xfe/0x150 > [ 185.313219] kernfs_fop_write+0x114/0x190 > [ 185.313234] __vfs_write+0x23/0x150 > [ 185.313246] ? rcu_read_lock_sched_held+0x3f/0x70 > [ 185.313252] ? preempt_count_sub+0x92/0xd0 > [ 185.313259] ? __sb_start_write+0xf8/0x200 > [ 185.313271] vfs_write+0xc5/0x1c0 > [ 185.313284] ksys_write+0x45/0xa0 > [ 185.313298] do_syscall_64+0x5a/0x1a0 > [ 185.313308] entry_SYSCALL_64_after_hwframe+0x49/0xbe > > And get following information in block debugfs: > root at will-ThinkCentre-M910s:/sys/kernel/debug/block/nvme0n1# cat hctx6/cpu6/rq_list > 000000001192d19b {.op=READ, .cmd_flags=, .rq_flags=IO_STAT, .state=idle, .tag=69, .internal_tag=-1} > 00000000c33c8a5b {.op=READ, .cmd_flags=, .rq_flags=IO_STAT, .state=idle, .tag=78, .internal_tag=-1} > root at will-ThinkCentre-M910s:/sys/kernel/debug/block/nvme0n1# cat state > DYING|BYPASS|NOMERGES|SAME_COMP|NONROT|IO_STAT|DISCARD|NOXMERGES|INIT_DONE|NO_SG_MERGE|POLL|WC|FUA|STATS|QUIESCED > > We can see there were reqs on ctx rq_list and the request_queue is QUIESCED. > > Thanks again !! > Jianchao > > On 04/22/2018 10:48 PM, Max Gurtovoy wrote: >> >> >> On 4/22/2018 5:25 PM, jianchao.wang wrote: >>> Hi Max >>> >>> No, I only tested it on PCIe one. >>> And sorry for that I didn't state that. >> >> Please send your exact test steps and we'll run it using RDMA transport. >> I also want to run a mini regression on this one since it may effect other flows. >> >>> >>> Thanks >>> Jianchao >>> >>> On 04/22/2018 10:18 PM, Max Gurtovoy wrote: >>>> Hi Jianchao, >>>> Since this patch is in the core, have you tested it using some fabrics drives too ? RDMA/FC ? >>>> >>>> thanks, >>>> Max. >>>> >>>> On 4/22/2018 4:32 PM, jianchao.wang wrote: >>>>> Hi keith >>>>> >>>>> Would you please take a look at this patch. >>>>> >>>>> This issue could be reproduced easily with a driver bind/unbind loop, >>>>> a reset loop and a IO loop at the same time. >>>>> >>>>> Thanks >>>>> Jianchao >>>>> >>>>> On 04/19/2018 04:29 PM, Jianchao Wang wrote: >>>>>> There is race between nvme_remove and nvme_reset_work that can >>>>>> lead to io hang. >>>>>> >>>>>> nvme_remove??????????????????? nvme_reset_work >>>>>> -> change state to DELETING >>>>>> ???????????????????????????????? -> fail to change state to LIVE >>>>>> ???????????????????????????????? -> nvme_remove_dead_ctrl >>>>>> ?????????????????????????????????? -> nvme_dev_disable >>>>>> ???????????????????????????????????? -> quiesce request_queue >>>>>> ?????????????????????????????????? -> queue remove_work >>>>>> -> cancel_work_sync reset_work >>>>>> -> nvme_remove_namespaces >>>>>> ??? -> splice ctrl->namespaces >>>>>> ???????????????????????????????? nvme_remove_dead_ctrl_work >>>>>> ???????????????????????????????? -> nvme_kill_queues >>>>>> ??? -> nvme_ns_remove?????????????? do nothing >>>>>> ????? -> blk_cleanup_queue >>>>>> ??????? -> blk_freeze_queue >>>>>> Finally, the request_queue is quiesced state when wait freeze, >>>>>> we will get io hang here. >>>>>> >>>>>> To fix it, unquiesce the request_queue directly before nvme_ns_remove. >>>>>> We have spliced the ctrl->namespaces, so nobody could access them >>>>>> and quiesce the queue any more. >>>>>> >>>>>> Signed-off-by: Jianchao Wang <jianchao.w.wang at oracle.com> >>>>>> --- >>>>>> ?? drivers/nvme/host/core.c | 9 ++++++++- >>>>>> ?? 1 file changed, 8 insertions(+), 1 deletion(-) >>>>>> >>>>>> diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c >>>>>> index 9df4f71..0e95082 100644 >>>>>> --- a/drivers/nvme/host/core.c >>>>>> +++ b/drivers/nvme/host/core.c >>>>>> @@ -3249,8 +3249,15 @@ void nvme_remove_namespaces(struct nvme_ctrl *ctrl) >>>>>> ?????? list_splice_init(&ctrl->namespaces, &ns_list); >>>>>> ?????? up_write(&ctrl->namespaces_rwsem); >>>>>> ?? -??? list_for_each_entry_safe(ns, next, &ns_list, list) >>>>>> +??? /* >>>>>> +???? * After splice the namespaces list from the ctrl->namespaces, >>>>>> +???? * nobody could get them anymore, let's unquiesce the request_queue >>>>>> +???? * forcibly to avoid io hang. >>>>>> +???? */ >>>>>> +??? list_for_each_entry_safe(ns, next, &ns_list, list) { >>>>>> +??????? blk_mq_unquiesce_queue(ns->queue); >>>>>> ?????????? nvme_ns_remove(ns); >>>>>> +??? } >>>>>> ?? } >>>>>> ?? EXPORT_SYMBOL_GPL(nvme_remove_namespaces); >>>>>> ? >>>>> >>>>> _______________________________________________ >>>>> Linux-nvme mailing list >>>>> Linux-nvme at lists.infradead.org >>>>> https://urldefense.proofpoint.com/v2/url?u=http-3A__lists.infradead.org_mailman_listinfo_linux-2Dnvme&d=DwICAg&c=RoP1YumCXCgaWHvlZYR8PZh8Bv7qIrMUB65eapI_JnE&r=7WdAxUBeiTUTCy8v-7zXyr4qk7sx26ATvfo6QSTvZyQ&m=eQ9q70WFDS-d0s-KndBw8MOJvcBM6wuuKUNklqTC3h8&s=oBasfz9JoJw4yQF4EaWcNfKChZ1HMCkfHVZqyjvYVHQ&e= >>>>> >>>> >>>> _______________________________________________ >>>> Linux-nvme mailing list >>>> Linux-nvme at lists.infradead.org >>>> https://urldefense.proofpoint.com/v2/url?u=http-3A__lists.infradead.org_mailman_listinfo_linux-2Dnvme&d=DwICAg&c=RoP1YumCXCgaWHvlZYR8PZh8Bv7qIrMUB65eapI_JnE&r=7WdAxUBeiTUTCy8v-7zXyr4qk7sx26ATvfo6QSTvZyQ&m=eQ9q70WFDS-d0s-KndBw8MOJvcBM6wuuKUNklqTC3h8&s=oBasfz9JoJw4yQF4EaWcNfKChZ1HMCkfHVZqyjvYVHQ&e= >>>> > > _______________________________________________ > Linux-nvme mailing list > Linux-nvme at lists.infradead.org > https://urldefense.proofpoint.com/v2/url?u=http-3A__lists.infradead.org_mailman_listinfo_linux-2Dnvme&d=DwIGaQ&c=RoP1YumCXCgaWHvlZYR8PZh8Bv7qIrMUB65eapI_JnE&r=7WdAxUBeiTUTCy8v-7zXyr4qk7sx26ATvfo6QSTvZyQ&m=ivKrSi6co_9l91-czj52P2ABAPyKOwv9GrD3UeCbhio&s=2U00IBVk4CVt0Q-jiZX3sEWEEA2IleOjPU_PD38Fgvg&e= > ^ permalink raw reply [flat|nested] 18+ messages in thread
* [PATCH] nvme: unquiesce the queue before cleaup it 2018-04-26 8:27 ` jianchao.wang (?) @ 2018-04-26 10:23 ` Max Gurtovoy 2018-04-27 8:51 ` jianchao.wang -1 siblings, 1 reply; 18+ messages in thread From: Max Gurtovoy @ 2018-04-26 10:23 UTC (permalink / raw) Hi Jianchao, I actually tried this scenario with real HW and was able to repro the hang. Unfortunatly, after applying your patch I got NULL deref: BUG: unable to handle kernel NULL pointer dereference at 0000000000000014 WARNING: CPU: 5 PID: 0 at block/blk-mq.c:526 __blk_mq_complete_request+0x154/0x160 I guess it's the same issue the IsraelR/Bart/Ming are trying to fix in the block layer regarding completing requests. I'll add IsraelR proposed fix to nvme-rdma that is currently on hold and see what happens. Nontheless, I don't like the situation that the reset and delete flows can run concurrently. -Max. On 4/26/2018 11:27 AM, jianchao.wang wrote: > Hi Max > > I did a similar test on nvme-rdma, the underlying fabric is soft-RoCE. > A io loop, reset controller loop and a delete/create controller loop. > And found io hang below: > > [ 230.884590] WARNING: CPU: 0 PID: 150 at /home/will/u04/source_code/linux-stable/drivers/nvme/host/rdma.c:1755 nvme_rdma_reset_ctrl_work+0x3d/0xb0 [nvme_rdma] > [ 230.884689] CPU: 0 PID: 150 Comm: kworker/u16:3 Not tainted 4.17.0-rc1+ #16 > [ 230.884690] Hardware name: LENOVO 10MLS0E339/3106, BIOS M1AKT22A 06/27/2017 > [ 230.884693] Workqueue: nvme-reset-wq nvme_rdma_reset_ctrl_work [nvme_rdma] > [ 230.884696] RIP: 0010:nvme_rdma_reset_ctrl_work+0x3d/0xb0 [nvme_rdma] > [ 230.884697] RSP: 0018:ffffa9e30255be40 EFLAGS: 00010246 > [ 230.884699] RAX: 0000000000000000 RBX: ffff9633b64c2000 RCX: 0000000000000000 > [ 230.884700] RDX: 0000000000000001 RSI: 0000000000000000 RDI: ffffffff9ac05516 > [ 230.884701] RBP: ffff9633b64c23b8 R08: 0000000000000001 R09: 0000000000000000 > [ 230.884702] R10: 0000000000000000 R11: 0000000000000000 R12: ffff9633b64c2970 > [ 230.884703] R13: 0000000000000000 R14: 0ffff96340ffcc80 R15: ffff9634102af9c0 > [ 230.884705] FS: 0000000000000000(0000) GS:ffff963422c00000(0000) knlGS:0000000000000000 > [ 230.884706] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 > [ 230.884707] CR2: 0000000001ce9ca4 CR3: 000000038040f005 CR4: 00000000003606f0 > [ 230.884708] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 > [ 230.884709] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 > [ 230.884710] Call Trace: > [ 230.884717] process_one_work+0x1c0/0x680 > [ 230.884722] worker_thread+0x22b/0x430 > [ 230.884728] kthread+0x100/0x140 > [ 230.884730] ? rescuer_thread+0x370/0x370 > [ 230.884731] ? kthread_delayed_work_timer_fn+0x80/0x80 > [ 230.884736] ret_from_fork+0x24/0x30 > [ 230.884744] Code: ff 48 8d ab b8 03 00 00 48 89 ef e8 1e 25 d7 d9 31 f6 48 89 df e8 44 f9 ff ff be 04 00 00 00 48 89 ef e8 b7 13 d7 d9 84 c0 75 07 <0f> 0b 5b 5d 41 5c c3 31 f6 48 89 df e8 b2 fa ff ff 85 c0 75 39 > [ 230.884799] irq event stamp: 88150 > [ 230.884802] hardirqs last enabled at (88149): [<ffffffff9ac05529>] _raw_spin_unlock_irqrestore+0x59/0x70 > [ 230.884803] hardirqs last disabled at (88150): [<ffffffff9ae0116c>] error_entry+0x6c/0xc0 > [ 230.884805] softirqs last enabled at (87998): [<ffffffff9aa603a0>] sk_common_release+0x60/0xd0 > [ 230.884807] softirqs last disabled at (87996): [<ffffffff9aa60375>] sk_common_release+0x35/0xd0 > [ 230.884808] ---[ end trace c03fac253b80d77d ]--- > [ 277.472094] INFO: task kworker/u16:0:6 blocked for more than 30 seconds. > [ 277.473482] Tainted: G W 4.17.0-rc1+ #16 > [ 277.474825] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. > [ 277.476293] kworker/u16:0 D 0 6 2 0x80000000 > [ 277.476317] Workqueue: nvme-delete-wq nvme_delete_ctrl_work > [ 277.476335] Call Trace: > [ 277.476365] ? __schedule+0x3de/0xac0 > [ 277.476393] schedule+0x3c/0x90 > [ 277.476405] blk_mq_freeze_queue_wait+0x44/0x90 > [ 277.476419] ? wait_woken+0x90/0x90 > [ 277.476437] blk_cleanup_queue+0xe1/0x280 > [ 277.476453] nvme_ns_remove+0x1c8/0x260 > [ 277.476475] nvme_remove_namespaces+0x7f/0xa0 > [ 277.476495] nvme_delete_ctrl_work+0x4b/0x80 > [ 277.476508] process_one_work+0x1c0/0x680 > [ 277.476535] worker_thread+0x22b/0x430 > [ 277.476562] kthread+0x100/0x140 > [ 277.476572] ? rescuer_thread+0x370/0x370 > [ 277.476579] ? kthread_delayed_work_timer_fn+0x80/0x80 > [ 277.476598] ret_from_fork+0x24/0x30 > > 00000000ea33345b {.op=READ, .cmd_flags=23, .rq_flags=MQ_INFLIGHT|IO_STAT, .state=idle, .tag=19, .internal_tag=-1} > 0000000071218c5a {.op=READ, .cmd_flags=23, .rq_flags=MQ_INFLIGHT|IO_STAT, .state=idle, .tag=45, .internal_tag=-1} > root at will-ThinkCentre-M910s:/sys/kernel/debug/block/nvme1c1n1# cat state > DYING|BYPASS|NOMERGES|SAME_COMP|NONROT|IO_STAT|DISCARD|NOXMERGES|INIT_DONE|NO_SG_MERGE|POLL|WC|FUA|STATS|QUIESCED > > This is due to a race following: > > nvme_delete_ctrl nvme_rdma_reset_ctrl_work > -> change state to DELETING > -> queue delete_work > -> nvme_rdma_shutdown_ctrl > -> nvme_stop_queues > -> blk_mq_quiesce_queue > > -> change to LIVE state fails > -> return > nvme_delete_ctrl_work > -> nvme_remove_namespaces > -> nvme_ns_remove > -> blk_cleanup_queue > -> blk_freeze_queue > > This patch could also fix this issue. > > Thanks > Jianchao > > On 04/22/2018 11:00 PM, jianchao.wang wrote: >> Hi Max >> >> That's really appreciated! >> Here is my test script. >> >> loop_reset_controller.sh >> #!/bin/bash >> while true >> do >> echo 1 > /sys/block/nvme0n1/device/reset_controller >> sleep 1 >> done >> >> loop_unbind_driver.sh >> #!/bin/bash >> while true >> do >> echo "0000:02:00.0" > /sys/bus/pci/drivers/nvme/unbind >> sleep 2 >> echo "0000:02:00.0" > /sys/bus/pci/drivers/nvme/bind >> sleep 2 >> done >> >> loop_io.sh >> #!/bin/bash >> >> file="/dev/nvme0n1" >> echo $file >> while true; >> do >> if [ -e $file ];then >> fio fio_job_rand_read.ini >> else >> echo "Not found" >> sleep 1 >> fi >> done >> >> The fio jobs is as below: >> size=512m >> rw=randread >> bs=4k >> ioengine=libaio >> iodepth=64 >> direct=1 >> numjobs=16 >> filename=/dev/nvme0n1 >> group_reporting >> >> I started in sequence, loop_io.sh, loop_reset_controller.sh, loop_unbind_driver.sh. >> And if lucky, I will get io hang in 3 minutes. ;) >> Such as: >> >> [ 142.858074] nvme nvme0: pci function 0000:02:00.0 >> [ 144.972256] nvme nvme0: failed to mark controller state 1 >> [ 144.972289] nvme nvme0: Removing after probe failure status: 0 >> [ 185.312344] INFO: task bash:1673 blocked for more than 30 seconds. >> [ 185.312889] Not tainted 4.17.0-rc1+ #6 >> [ 185.312950] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. >> [ 185.313049] bash D 0 1673 1629 0x00000080 >> [ 185.313061] Call Trace: >> [ 185.313083] ? __schedule+0x3de/0xac0 >> [ 185.313103] schedule+0x3c/0x90 >> [ 185.313111] blk_mq_freeze_queue_wait+0x44/0x90 >> [ 185.313123] ? wait_woken+0x90/0x90 >> [ 185.313133] blk_cleanup_queue+0xe1/0x280 >> [ 185.313145] nvme_ns_remove+0x1c8/0x260 >> [ 185.313159] nvme_remove_namespaces+0x7f/0xa0 >> [ 185.313170] nvme_remove+0x6c/0x130 >> [ 185.313181] pci_device_remove+0x36/0xb0 >> [ 185.313193] device_release_driver_internal+0x160/0x230 >> [ 185.313205] unbind_store+0xfe/0x150 >> [ 185.313219] kernfs_fop_write+0x114/0x190 >> [ 185.313234] __vfs_write+0x23/0x150 >> [ 185.313246] ? rcu_read_lock_sched_held+0x3f/0x70 >> [ 185.313252] ? preempt_count_sub+0x92/0xd0 >> [ 185.313259] ? __sb_start_write+0xf8/0x200 >> [ 185.313271] vfs_write+0xc5/0x1c0 >> [ 185.313284] ksys_write+0x45/0xa0 >> [ 185.313298] do_syscall_64+0x5a/0x1a0 >> [ 185.313308] entry_SYSCALL_64_after_hwframe+0x49/0xbe >> >> And get following information in block debugfs: >> root at will-ThinkCentre-M910s:/sys/kernel/debug/block/nvme0n1# cat hctx6/cpu6/rq_list >> 000000001192d19b {.op=READ, .cmd_flags=, .rq_flags=IO_STAT, .state=idle, .tag=69, .internal_tag=-1} >> 00000000c33c8a5b {.op=READ, .cmd_flags=, .rq_flags=IO_STAT, .state=idle, .tag=78, .internal_tag=-1} >> root at will-ThinkCentre-M910s:/sys/kernel/debug/block/nvme0n1# cat state >> DYING|BYPASS|NOMERGES|SAME_COMP|NONROT|IO_STAT|DISCARD|NOXMERGES|INIT_DONE|NO_SG_MERGE|POLL|WC|FUA|STATS|QUIESCED >> >> We can see there were reqs on ctx rq_list and the request_queue is QUIESCED. >> >> Thanks again !! >> Jianchao >> >> On 04/22/2018 10:48 PM, Max Gurtovoy wrote: >>> >>> >>> On 4/22/2018 5:25 PM, jianchao.wang wrote: >>>> Hi Max >>>> >>>> No, I only tested it on PCIe one. >>>> And sorry for that I didn't state that. >>> >>> Please send your exact test steps and we'll run it using RDMA transport. >>> I also want to run a mini regression on this one since it may effect other flows. >>> >>>> >>>> Thanks >>>> Jianchao >>>> >>>> On 04/22/2018 10:18 PM, Max Gurtovoy wrote: >>>>> Hi Jianchao, >>>>> Since this patch is in the core, have you tested it using some fabrics drives too ? RDMA/FC ? >>>>> >>>>> thanks, >>>>> Max. >>>>> >>>>> On 4/22/2018 4:32 PM, jianchao.wang wrote: >>>>>> Hi keith >>>>>> >>>>>> Would you please take a look at this patch. >>>>>> >>>>>> This issue could be reproduced easily with a driver bind/unbind loop, >>>>>> a reset loop and a IO loop at the same time. >>>>>> >>>>>> Thanks >>>>>> Jianchao >>>>>> >>>>>> On 04/19/2018 04:29 PM, Jianchao Wang wrote: >>>>>>> There is race between nvme_remove and nvme_reset_work that can >>>>>>> lead to io hang. >>>>>>> >>>>>>> nvme_remove??????????????????? nvme_reset_work >>>>>>> -> change state to DELETING >>>>>>> ???????????????????????????????? -> fail to change state to LIVE >>>>>>> ???????????????????????????????? -> nvme_remove_dead_ctrl >>>>>>> ?????????????????????????????????? -> nvme_dev_disable >>>>>>> ???????????????????????????????????? -> quiesce request_queue >>>>>>> ?????????????????????????????????? -> queue remove_work >>>>>>> -> cancel_work_sync reset_work >>>>>>> -> nvme_remove_namespaces >>>>>>> ??? -> splice ctrl->namespaces >>>>>>> ???????????????????????????????? nvme_remove_dead_ctrl_work >>>>>>> ???????????????????????????????? -> nvme_kill_queues >>>>>>> ??? -> nvme_ns_remove?????????????? do nothing >>>>>>> ????? -> blk_cleanup_queue >>>>>>> ??????? -> blk_freeze_queue >>>>>>> Finally, the request_queue is quiesced state when wait freeze, >>>>>>> we will get io hang here. >>>>>>> >>>>>>> To fix it, unquiesce the request_queue directly before nvme_ns_remove. >>>>>>> We have spliced the ctrl->namespaces, so nobody could access them >>>>>>> and quiesce the queue any more. >>>>>>> >>>>>>> Signed-off-by: Jianchao Wang <jianchao.w.wang at oracle.com> >>>>>>> --- >>>>>>> ?? drivers/nvme/host/core.c | 9 ++++++++- >>>>>>> ?? 1 file changed, 8 insertions(+), 1 deletion(-) >>>>>>> >>>>>>> diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c >>>>>>> index 9df4f71..0e95082 100644 >>>>>>> --- a/drivers/nvme/host/core.c >>>>>>> +++ b/drivers/nvme/host/core.c >>>>>>> @@ -3249,8 +3249,15 @@ void nvme_remove_namespaces(struct nvme_ctrl *ctrl) >>>>>>> ?????? list_splice_init(&ctrl->namespaces, &ns_list); >>>>>>> ?????? up_write(&ctrl->namespaces_rwsem); >>>>>>> ?? -??? list_for_each_entry_safe(ns, next, &ns_list, list) >>>>>>> +??? /* >>>>>>> +???? * After splice the namespaces list from the ctrl->namespaces, >>>>>>> +???? * nobody could get them anymore, let's unquiesce the request_queue >>>>>>> +???? * forcibly to avoid io hang. >>>>>>> +???? */ >>>>>>> +??? list_for_each_entry_safe(ns, next, &ns_list, list) { >>>>>>> +??????? blk_mq_unquiesce_queue(ns->queue); >>>>>>> ?????????? nvme_ns_remove(ns); >>>>>>> +??? } >>>>>>> ?? } >>>>>>> ?? EXPORT_SYMBOL_GPL(nvme_remove_namespaces); >>>>>>> >>>>>> >>>>>> _______________________________________________ >>>>>> Linux-nvme mailing list >>>>>> Linux-nvme at lists.infradead.org >>>>>> https://urldefense.proofpoint.com/v2/url?u=http-3A__lists.infradead.org_mailman_listinfo_linux-2Dnvme&d=DwICAg&c=RoP1YumCXCgaWHvlZYR8PZh8Bv7qIrMUB65eapI_JnE&r=7WdAxUBeiTUTCy8v-7zXyr4qk7sx26ATvfo6QSTvZyQ&m=eQ9q70WFDS-d0s-KndBw8MOJvcBM6wuuKUNklqTC3h8&s=oBasfz9JoJw4yQF4EaWcNfKChZ1HMCkfHVZqyjvYVHQ&e= >>>>>> >>>>> >>>>> _______________________________________________ >>>>> Linux-nvme mailing list >>>>> Linux-nvme at lists.infradead.org >>>>> https://urldefense.proofpoint.com/v2/url?u=http-3A__lists.infradead.org_mailman_listinfo_linux-2Dnvme&d=DwICAg&c=RoP1YumCXCgaWHvlZYR8PZh8Bv7qIrMUB65eapI_JnE&r=7WdAxUBeiTUTCy8v-7zXyr4qk7sx26ATvfo6QSTvZyQ&m=eQ9q70WFDS-d0s-KndBw8MOJvcBM6wuuKUNklqTC3h8&s=oBasfz9JoJw4yQF4EaWcNfKChZ1HMCkfHVZqyjvYVHQ&e= >>>>> >> >> _______________________________________________ >> Linux-nvme mailing list >> Linux-nvme at lists.infradead.org >> https://urldefense.proofpoint.com/v2/url?u=http-3A__lists.infradead.org_mailman_listinfo_linux-2Dnvme&d=DwIGaQ&c=RoP1YumCXCgaWHvlZYR8PZh8Bv7qIrMUB65eapI_JnE&r=7WdAxUBeiTUTCy8v-7zXyr4qk7sx26ATvfo6QSTvZyQ&m=ivKrSi6co_9l91-czj52P2ABAPyKOwv9GrD3UeCbhio&s=2U00IBVk4CVt0Q-jiZX3sEWEEA2IleOjPU_PD38Fgvg&e= >> ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [PATCH] nvme: unquiesce the queue before cleaup it 2018-04-26 10:23 ` Max Gurtovoy @ 2018-04-27 8:51 ` jianchao.wang 0 siblings, 0 replies; 18+ messages in thread From: jianchao.wang @ 2018-04-27 8:51 UTC (permalink / raw) To: Max Gurtovoy, keith.busch, axboe, hch, sagi, linux-nvme, linux-kernel Hi Max On 04/26/2018 06:23 PM, Max Gurtovoy wrote: > Hi Jianchao, > I actually tried this scenario with real HW and was able to repro the hang. > Unfortunatly, after applying your patch I got NULL deref: > BUG: unable to handle kernel NULL pointer dereference at 0000000000000014 > WARNING: CPU: 5 PID: 0 at block/blk-mq.c:526 __blk_mq_complete_request+0x154/0x160 > Could you please share the whole backtrace here ? Thanks Jianchao > I guess it's the same issue the IsraelR/Bart/Ming are trying to fix in the block layer regarding completing requests. > I'll add IsraelR proposed fix to nvme-rdma that is currently on hold and see what happens. > Nontheless, I don't like the situation that the reset and delete flows can run concurrently. > > -Max. > > On 4/26/2018 11:27 AM, jianchao.wang wrote: >> Hi Max >> >> I did a similar test on nvme-rdma, the underlying fabric is soft-RoCE. >> A io loop, reset controller loop and a delete/create controller loop. >> And found io hang below: >> >> [ 230.884590] WARNING: CPU: 0 PID: 150 at /home/will/u04/source_code/linux-stable/drivers/nvme/host/rdma.c:1755 nvme_rdma_reset_ctrl_work+0x3d/0xb0 [nvme_rdma] >> [ 230.884689] CPU: 0 PID: 150 Comm: kworker/u16:3 Not tainted 4.17.0-rc1+ #16 >> [ 230.884690] Hardware name: LENOVO 10MLS0E339/3106, BIOS M1AKT22A 06/27/2017 >> [ 230.884693] Workqueue: nvme-reset-wq nvme_rdma_reset_ctrl_work [nvme_rdma] >> [ 230.884696] RIP: 0010:nvme_rdma_reset_ctrl_work+0x3d/0xb0 [nvme_rdma] >> [ 230.884697] RSP: 0018:ffffa9e30255be40 EFLAGS: 00010246 >> [ 230.884699] RAX: 0000000000000000 RBX: ffff9633b64c2000 RCX: 0000000000000000 >> [ 230.884700] RDX: 0000000000000001 RSI: 0000000000000000 RDI: ffffffff9ac05516 >> [ 230.884701] RBP: ffff9633b64c23b8 R08: 0000000000000001 R09: 0000000000000000 >> [ 230.884702] R10: 0000000000000000 R11: 0000000000000000 R12: ffff9633b64c2970 >> [ 230.884703] R13: 0000000000000000 R14: 0ffff96340ffcc80 R15: ffff9634102af9c0 >> [ 230.884705] FS: 0000000000000000(0000) GS:ffff963422c00000(0000) knlGS:0000000000000000 >> [ 230.884706] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 >> [ 230.884707] CR2: 0000000001ce9ca4 CR3: 000000038040f005 CR4: 00000000003606f0 >> [ 230.884708] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 >> [ 230.884709] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 >> [ 230.884710] Call Trace: >> [ 230.884717] process_one_work+0x1c0/0x680 >> [ 230.884722] worker_thread+0x22b/0x430 >> [ 230.884728] kthread+0x100/0x140 >> [ 230.884730] ? rescuer_thread+0x370/0x370 >> [ 230.884731] ? kthread_delayed_work_timer_fn+0x80/0x80 >> [ 230.884736] ret_from_fork+0x24/0x30 >> [ 230.884744] Code: ff 48 8d ab b8 03 00 00 48 89 ef e8 1e 25 d7 d9 31 f6 48 89 df e8 44 f9 ff ff be 04 00 00 00 48 89 ef e8 b7 13 d7 d9 84 c0 75 07 <0f> 0b 5b 5d 41 5c c3 31 f6 48 89 df e8 b2 fa ff ff 85 c0 75 39 >> [ 230.884799] irq event stamp: 88150 >> [ 230.884802] hardirqs last enabled at (88149): [<ffffffff9ac05529>] _raw_spin_unlock_irqrestore+0x59/0x70 >> [ 230.884803] hardirqs last disabled at (88150): [<ffffffff9ae0116c>] error_entry+0x6c/0xc0 >> [ 230.884805] softirqs last enabled at (87998): [<ffffffff9aa603a0>] sk_common_release+0x60/0xd0 >> [ 230.884807] softirqs last disabled at (87996): [<ffffffff9aa60375>] sk_common_release+0x35/0xd0 >> [ 230.884808] ---[ end trace c03fac253b80d77d ]--- >> [ 277.472094] INFO: task kworker/u16:0:6 blocked for more than 30 seconds. >> [ 277.473482] Tainted: G W 4.17.0-rc1+ #16 >> [ 277.474825] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. >> [ 277.476293] kworker/u16:0 D 0 6 2 0x80000000 >> [ 277.476317] Workqueue: nvme-delete-wq nvme_delete_ctrl_work >> [ 277.476335] Call Trace: >> [ 277.476365] ? __schedule+0x3de/0xac0 >> [ 277.476393] schedule+0x3c/0x90 >> [ 277.476405] blk_mq_freeze_queue_wait+0x44/0x90 >> [ 277.476419] ? wait_woken+0x90/0x90 >> [ 277.476437] blk_cleanup_queue+0xe1/0x280 >> [ 277.476453] nvme_ns_remove+0x1c8/0x260 >> [ 277.476475] nvme_remove_namespaces+0x7f/0xa0 >> [ 277.476495] nvme_delete_ctrl_work+0x4b/0x80 >> [ 277.476508] process_one_work+0x1c0/0x680 >> [ 277.476535] worker_thread+0x22b/0x430 >> [ 277.476562] kthread+0x100/0x140 >> [ 277.476572] ? rescuer_thread+0x370/0x370 >> [ 277.476579] ? kthread_delayed_work_timer_fn+0x80/0x80 >> [ 277.476598] ret_from_fork+0x24/0x30 >> >> 00000000ea33345b {.op=READ, .cmd_flags=23, .rq_flags=MQ_INFLIGHT|IO_STAT, .state=idle, .tag=19, .internal_tag=-1} >> 0000000071218c5a {.op=READ, .cmd_flags=23, .rq_flags=MQ_INFLIGHT|IO_STAT, .state=idle, .tag=45, .internal_tag=-1} >> root@will-ThinkCentre-M910s:/sys/kernel/debug/block/nvme1c1n1# cat state >> DYING|BYPASS|NOMERGES|SAME_COMP|NONROT|IO_STAT|DISCARD|NOXMERGES|INIT_DONE|NO_SG_MERGE|POLL|WC|FUA|STATS|QUIESCED >> >> This is due to a race following: >> >> nvme_delete_ctrl nvme_rdma_reset_ctrl_work >> -> change state to DELETING >> -> queue delete_work >> -> nvme_rdma_shutdown_ctrl >> -> nvme_stop_queues >> -> blk_mq_quiesce_queue >> -> change to LIVE state fails >> -> return >> nvme_delete_ctrl_work >> -> nvme_remove_namespaces >> -> nvme_ns_remove >> -> blk_cleanup_queue >> -> blk_freeze_queue >> >> This patch could also fix this issue. >> >> Thanks >> Jianchao >> >> On 04/22/2018 11:00 PM, jianchao.wang wrote: >>> Hi Max >>> >>> That's really appreciated! >>> Here is my test script. >>> >>> loop_reset_controller.sh >>> #!/bin/bash >>> while true >>> do >>> echo 1 > /sys/block/nvme0n1/device/reset_controller >>> sleep 1 >>> done >>> >>> loop_unbind_driver.sh >>> #!/bin/bash >>> while true >>> do >>> echo "0000:02:00.0" > /sys/bus/pci/drivers/nvme/unbind >>> sleep 2 >>> echo "0000:02:00.0" > /sys/bus/pci/drivers/nvme/bind >>> sleep 2 >>> done >>> >>> loop_io.sh >>> #!/bin/bash >>> >>> file="/dev/nvme0n1" >>> echo $file >>> while true; >>> do >>> if [ -e $file ];then >>> fio fio_job_rand_read.ini >>> else >>> echo "Not found" >>> sleep 1 >>> fi >>> done >>> >>> The fio jobs is as below: >>> size=512m >>> rw=randread >>> bs=4k >>> ioengine=libaio >>> iodepth=64 >>> direct=1 >>> numjobs=16 >>> filename=/dev/nvme0n1 >>> group_reporting >>> >>> I started in sequence, loop_io.sh, loop_reset_controller.sh, loop_unbind_driver.sh. >>> And if lucky, I will get io hang in 3 minutes. ;) >>> Such as: >>> >>> [ 142.858074] nvme nvme0: pci function 0000:02:00.0 >>> [ 144.972256] nvme nvme0: failed to mark controller state 1 >>> [ 144.972289] nvme nvme0: Removing after probe failure status: 0 >>> [ 185.312344] INFO: task bash:1673 blocked for more than 30 seconds. >>> [ 185.312889] Not tainted 4.17.0-rc1+ #6 >>> [ 185.312950] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. >>> [ 185.313049] bash D 0 1673 1629 0x00000080 >>> [ 185.313061] Call Trace: >>> [ 185.313083] ? __schedule+0x3de/0xac0 >>> [ 185.313103] schedule+0x3c/0x90 >>> [ 185.313111] blk_mq_freeze_queue_wait+0x44/0x90 >>> [ 185.313123] ? wait_woken+0x90/0x90 >>> [ 185.313133] blk_cleanup_queue+0xe1/0x280 >>> [ 185.313145] nvme_ns_remove+0x1c8/0x260 >>> [ 185.313159] nvme_remove_namespaces+0x7f/0xa0 >>> [ 185.313170] nvme_remove+0x6c/0x130 >>> [ 185.313181] pci_device_remove+0x36/0xb0 >>> [ 185.313193] device_release_driver_internal+0x160/0x230 >>> [ 185.313205] unbind_store+0xfe/0x150 >>> [ 185.313219] kernfs_fop_write+0x114/0x190 >>> [ 185.313234] __vfs_write+0x23/0x150 >>> [ 185.313246] ? rcu_read_lock_sched_held+0x3f/0x70 >>> [ 185.313252] ? preempt_count_sub+0x92/0xd0 >>> [ 185.313259] ? __sb_start_write+0xf8/0x200 >>> [ 185.313271] vfs_write+0xc5/0x1c0 >>> [ 185.313284] ksys_write+0x45/0xa0 >>> [ 185.313298] do_syscall_64+0x5a/0x1a0 >>> [ 185.313308] entry_SYSCALL_64_after_hwframe+0x49/0xbe >>> >>> And get following information in block debugfs: >>> root@will-ThinkCentre-M910s:/sys/kernel/debug/block/nvme0n1# cat hctx6/cpu6/rq_list >>> 000000001192d19b {.op=READ, .cmd_flags=, .rq_flags=IO_STAT, .state=idle, .tag=69, .internal_tag=-1} >>> 00000000c33c8a5b {.op=READ, .cmd_flags=, .rq_flags=IO_STAT, .state=idle, .tag=78, .internal_tag=-1} >>> root@will-ThinkCentre-M910s:/sys/kernel/debug/block/nvme0n1# cat state >>> DYING|BYPASS|NOMERGES|SAME_COMP|NONROT|IO_STAT|DISCARD|NOXMERGES|INIT_DONE|NO_SG_MERGE|POLL|WC|FUA|STATS|QUIESCED >>> >>> We can see there were reqs on ctx rq_list and the request_queue is QUIESCED. >>> >>> Thanks again !! >>> Jianchao >>> >>> On 04/22/2018 10:48 PM, Max Gurtovoy wrote: >>>> >>>> >>>> On 4/22/2018 5:25 PM, jianchao.wang wrote: >>>>> Hi Max >>>>> >>>>> No, I only tested it on PCIe one. >>>>> And sorry for that I didn't state that. >>>> >>>> Please send your exact test steps and we'll run it using RDMA transport. >>>> I also want to run a mini regression on this one since it may effect other flows. >>>> >>>>> >>>>> Thanks >>>>> Jianchao >>>>> >>>>> On 04/22/2018 10:18 PM, Max Gurtovoy wrote: >>>>>> Hi Jianchao, >>>>>> Since this patch is in the core, have you tested it using some fabrics drives too ? RDMA/FC ? >>>>>> >>>>>> thanks, >>>>>> Max. >>>>>> >>>>>> On 4/22/2018 4:32 PM, jianchao.wang wrote: >>>>>>> Hi keith >>>>>>> >>>>>>> Would you please take a look at this patch. >>>>>>> >>>>>>> This issue could be reproduced easily with a driver bind/unbind loop, >>>>>>> a reset loop and a IO loop at the same time. >>>>>>> >>>>>>> Thanks >>>>>>> Jianchao >>>>>>> >>>>>>> On 04/19/2018 04:29 PM, Jianchao Wang wrote: >>>>>>>> There is race between nvme_remove and nvme_reset_work that can >>>>>>>> lead to io hang. >>>>>>>> >>>>>>>> nvme_remove nvme_reset_work >>>>>>>> -> change state to DELETING >>>>>>>> -> fail to change state to LIVE >>>>>>>> -> nvme_remove_dead_ctrl >>>>>>>> -> nvme_dev_disable >>>>>>>> -> quiesce request_queue >>>>>>>> -> queue remove_work >>>>>>>> -> cancel_work_sync reset_work >>>>>>>> -> nvme_remove_namespaces >>>>>>>> -> splice ctrl->namespaces >>>>>>>> nvme_remove_dead_ctrl_work >>>>>>>> -> nvme_kill_queues >>>>>>>> -> nvme_ns_remove do nothing >>>>>>>> -> blk_cleanup_queue >>>>>>>> -> blk_freeze_queue >>>>>>>> Finally, the request_queue is quiesced state when wait freeze, >>>>>>>> we will get io hang here. >>>>>>>> >>>>>>>> To fix it, unquiesce the request_queue directly before nvme_ns_remove. >>>>>>>> We have spliced the ctrl->namespaces, so nobody could access them >>>>>>>> and quiesce the queue any more. >>>>>>>> >>>>>>>> Signed-off-by: Jianchao Wang <jianchao.w.wang@oracle.com> >>>>>>>> --- >>>>>>>> drivers/nvme/host/core.c | 9 ++++++++- >>>>>>>> 1 file changed, 8 insertions(+), 1 deletion(-) >>>>>>>> >>>>>>>> diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c >>>>>>>> index 9df4f71..0e95082 100644 >>>>>>>> --- a/drivers/nvme/host/core.c >>>>>>>> +++ b/drivers/nvme/host/core.c >>>>>>>> @@ -3249,8 +3249,15 @@ void nvme_remove_namespaces(struct nvme_ctrl *ctrl) >>>>>>>> list_splice_init(&ctrl->namespaces, &ns_list); >>>>>>>> up_write(&ctrl->namespaces_rwsem); >>>>>>>> - list_for_each_entry_safe(ns, next, &ns_list, list) >>>>>>>> + /* >>>>>>>> + * After splice the namespaces list from the ctrl->namespaces, >>>>>>>> + * nobody could get them anymore, let's unquiesce the request_queue >>>>>>>> + * forcibly to avoid io hang. >>>>>>>> + */ >>>>>>>> + list_for_each_entry_safe(ns, next, &ns_list, list) { >>>>>>>> + blk_mq_unquiesce_queue(ns->queue); >>>>>>>> nvme_ns_remove(ns); >>>>>>>> + } >>>>>>>> } >>>>>>>> EXPORT_SYMBOL_GPL(nvme_remove_namespaces); >>>>>>>> >>>>>>> >>>>>>> _______________________________________________ >>>>>>> Linux-nvme mailing list >>>>>>> Linux-nvme@lists.infradead.org >>>>>>> https://urldefense.proofpoint.com/v2/url?u=http-3A__lists.infradead.org_mailman_listinfo_linux-2Dnvme&d=DwICAg&c=RoP1YumCXCgaWHvlZYR8PZh8Bv7qIrMUB65eapI_JnE&r=7WdAxUBeiTUTCy8v-7zXyr4qk7sx26ATvfo6QSTvZyQ&m=eQ9q70WFDS-d0s-KndBw8MOJvcBM6wuuKUNklqTC3h8&s=oBasfz9JoJw4yQF4EaWcNfKChZ1HMCkfHVZqyjvYVHQ&e= >>>>>>> >>>>>> >>>>>> _______________________________________________ >>>>>> Linux-nvme mailing list >>>>>> Linux-nvme@lists.infradead.org >>>>>> https://urldefense.proofpoint.com/v2/url?u=http-3A__lists.infradead.org_mailman_listinfo_linux-2Dnvme&d=DwICAg&c=RoP1YumCXCgaWHvlZYR8PZh8Bv7qIrMUB65eapI_JnE&r=7WdAxUBeiTUTCy8v-7zXyr4qk7sx26ATvfo6QSTvZyQ&m=eQ9q70WFDS-d0s-KndBw8MOJvcBM6wuuKUNklqTC3h8&s=oBasfz9JoJw4yQF4EaWcNfKChZ1HMCkfHVZqyjvYVHQ&e= >>>>>> >>> >>> _______________________________________________ >>> Linux-nvme mailing list >>> Linux-nvme@lists.infradead.org >>> https://urldefense.proofpoint.com/v2/url?u=http-3A__lists.infradead.org_mailman_listinfo_linux-2Dnvme&d=DwIGaQ&c=RoP1YumCXCgaWHvlZYR8PZh8Bv7qIrMUB65eapI_JnE&r=7WdAxUBeiTUTCy8v-7zXyr4qk7sx26ATvfo6QSTvZyQ&m=ivKrSi6co_9l91-czj52P2ABAPyKOwv9GrD3UeCbhio&s=2U00IBVk4CVt0Q-jiZX3sEWEEA2IleOjPU_PD38Fgvg&e= >>> > ^ permalink raw reply [flat|nested] 18+ messages in thread
* [PATCH] nvme: unquiesce the queue before cleaup it @ 2018-04-27 8:51 ` jianchao.wang 0 siblings, 0 replies; 18+ messages in thread From: jianchao.wang @ 2018-04-27 8:51 UTC (permalink / raw) Hi Max On 04/26/2018 06:23 PM, Max Gurtovoy wrote: > Hi Jianchao, > I actually tried this scenario with real HW and was able to repro the hang. > Unfortunatly, after applying your patch I got NULL deref: > BUG: unable to handle kernel NULL pointer dereference at 0000000000000014 > WARNING: CPU: 5 PID: 0 at block/blk-mq.c:526 __blk_mq_complete_request+0x154/0x160 > Could you please share the whole backtrace here ? Thanks Jianchao > I guess it's the same issue the IsraelR/Bart/Ming are trying to fix in the block layer regarding completing requests. > I'll add IsraelR proposed fix to nvme-rdma that is currently on hold and see what happens. > Nontheless, I don't like the situation that the reset and delete flows can run concurrently. > > -Max. > > On 4/26/2018 11:27 AM, jianchao.wang wrote: >> Hi Max >> >> I did a similar test on nvme-rdma, the underlying fabric is soft-RoCE. >> A io loop, reset controller loop? and a delete/create controller loop. >> And found io hang below: >> >> [? 230.884590] WARNING: CPU: 0 PID: 150 at /home/will/u04/source_code/linux-stable/drivers/nvme/host/rdma.c:1755 nvme_rdma_reset_ctrl_work+0x3d/0xb0 [nvme_rdma] >> [? 230.884689] CPU: 0 PID: 150 Comm: kworker/u16:3 Not tainted 4.17.0-rc1+ #16 >> [? 230.884690] Hardware name: LENOVO 10MLS0E339/3106, BIOS M1AKT22A 06/27/2017 >> [? 230.884693] Workqueue: nvme-reset-wq nvme_rdma_reset_ctrl_work [nvme_rdma] >> [? 230.884696] RIP: 0010:nvme_rdma_reset_ctrl_work+0x3d/0xb0 [nvme_rdma] >> [? 230.884697] RSP: 0018:ffffa9e30255be40 EFLAGS: 00010246 >> [? 230.884699] RAX: 0000000000000000 RBX: ffff9633b64c2000 RCX: 0000000000000000 >> [? 230.884700] RDX: 0000000000000001 RSI: 0000000000000000 RDI: ffffffff9ac05516 >> [? 230.884701] RBP: ffff9633b64c23b8 R08: 0000000000000001 R09: 0000000000000000 >> [? 230.884702] R10: 0000000000000000 R11: 0000000000000000 R12: ffff9633b64c2970 >> [? 230.884703] R13: 0000000000000000 R14: 0ffff96340ffcc80 R15: ffff9634102af9c0 >> [? 230.884705] FS:? 0000000000000000(0000) GS:ffff963422c00000(0000) knlGS:0000000000000000 >> [? 230.884706] CS:? 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 >> [? 230.884707] CR2: 0000000001ce9ca4 CR3: 000000038040f005 CR4: 00000000003606f0 >> [? 230.884708] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 >> [? 230.884709] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 >> [? 230.884710] Call Trace: >> [? 230.884717]? process_one_work+0x1c0/0x680 >> [? 230.884722]? worker_thread+0x22b/0x430 >> [? 230.884728]? kthread+0x100/0x140 >> [? 230.884730]? ? rescuer_thread+0x370/0x370 >> [? 230.884731]? ? kthread_delayed_work_timer_fn+0x80/0x80 >> [? 230.884736]? ret_from_fork+0x24/0x30 >> [? 230.884744] Code: ff 48 8d ab b8 03 00 00 48 89 ef e8 1e 25 d7 d9 31 f6 48 89 df e8 44 f9 ff ff be 04 00 00 00 48 89 ef e8 b7 13 d7 d9 84 c0 75 07 <0f> 0b 5b 5d 41 5c c3 31 f6 48 89 df e8 b2 fa ff ff 85 c0 75 39 >> [? 230.884799] irq event stamp: 88150 >> [? 230.884802] hardirqs last? enabled at (88149): [<ffffffff9ac05529>] _raw_spin_unlock_irqrestore+0x59/0x70 >> [? 230.884803] hardirqs last disabled at (88150): [<ffffffff9ae0116c>] error_entry+0x6c/0xc0 >> [? 230.884805] softirqs last? enabled at (87998): [<ffffffff9aa603a0>] sk_common_release+0x60/0xd0 >> [? 230.884807] softirqs last disabled at (87996): [<ffffffff9aa60375>] sk_common_release+0x35/0xd0 >> [? 230.884808] ---[ end trace c03fac253b80d77d ]--- >> [? 277.472094] INFO: task kworker/u16:0:6 blocked for more than 30 seconds. >> [? 277.473482]?????? Tainted: G??????? W???????? 4.17.0-rc1+ #16 >> [? 277.474825] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. >> [? 277.476293] kworker/u16:0?? D??? 0???? 6????? 2 0x80000000 >> [? 277.476317] Workqueue: nvme-delete-wq nvme_delete_ctrl_work >> [? 277.476335] Call Trace: >> [? 277.476365]? ? __schedule+0x3de/0xac0 >> [? 277.476393]? schedule+0x3c/0x90 >> [? 277.476405]? blk_mq_freeze_queue_wait+0x44/0x90 >> [? 277.476419]? ? wait_woken+0x90/0x90 >> [? 277.476437]? blk_cleanup_queue+0xe1/0x280 >> [? 277.476453]? nvme_ns_remove+0x1c8/0x260 >> [? 277.476475]? nvme_remove_namespaces+0x7f/0xa0 >> [? 277.476495]? nvme_delete_ctrl_work+0x4b/0x80 >> [? 277.476508]? process_one_work+0x1c0/0x680 >> [? 277.476535]? worker_thread+0x22b/0x430 >> [? 277.476562]? kthread+0x100/0x140 >> [? 277.476572]? ? rescuer_thread+0x370/0x370 >> [? 277.476579]? ? kthread_delayed_work_timer_fn+0x80/0x80 >> [? 277.476598]? ret_from_fork+0x24/0x30 >> >> 00000000ea33345b {.op=READ, .cmd_flags=23, .rq_flags=MQ_INFLIGHT|IO_STAT, .state=idle, .tag=19, .internal_tag=-1} >> 0000000071218c5a {.op=READ, .cmd_flags=23, .rq_flags=MQ_INFLIGHT|IO_STAT, .state=idle, .tag=45, .internal_tag=-1} >> root at will-ThinkCentre-M910s:/sys/kernel/debug/block/nvme1c1n1# cat state >> DYING|BYPASS|NOMERGES|SAME_COMP|NONROT|IO_STAT|DISCARD|NOXMERGES|INIT_DONE|NO_SG_MERGE|POLL|WC|FUA|STATS|QUIESCED >> >> This is due to a race following: >> >> nvme_delete_ctrl??????????????????? nvme_rdma_reset_ctrl_work >> ?? -> change state to DELETING >> ?? -> queue delete_work >> ?????????????????????????????????????? -> nvme_rdma_shutdown_ctrl >> ???????????????????????????????????????? -> nvme_stop_queues >> ?????????????????????????????????????????? -> blk_mq_quiesce_queue >> ???????????????? ?????????????????????????????????????? -> change to LIVE state fails >> ???????????????????????????????????????? -> return >> nvme_delete_ctrl_work >> ?? -> nvme_remove_namespaces >> ???? -> nvme_ns_remove >> ?????? -> blk_cleanup_queue >> ???????? -> blk_freeze_queue >> >> This patch could also fix this issue. >> >> Thanks >> Jianchao >> >> On 04/22/2018 11:00 PM, jianchao.wang wrote: >>> Hi Max >>> >>> That's really appreciated! >>> Here is my test script. >>> >>> loop_reset_controller.sh >>> #!/bin/bash >>> while true >>> do >>> ????echo 1 > /sys/block/nvme0n1/device/reset_controller >>> ????sleep 1 >>> done >>> >>> loop_unbind_driver.sh >>> #!/bin/bash >>> while true >>> do >>> ????echo "0000:02:00.0" > /sys/bus/pci/drivers/nvme/unbind >>> ????sleep 2 >>> ????echo "0000:02:00.0" > /sys/bus/pci/drivers/nvme/bind >>> ????sleep 2 >>> done >>> >>> loop_io.sh >>> #!/bin/bash >>> >>> file="/dev/nvme0n1" >>> echo $file >>> while true; >>> do >>> ????if [ -e $file ];then >>> ??????? fio fio_job_rand_read.ini >>> ????else >>> ??????? echo "Not found" >>> ??????? sleep 1 >>> ????fi >>> done >>> >>> The fio jobs is as below: >>> size=512m >>> rw=randread >>> bs=4k >>> ioengine=libaio >>> iodepth=64 >>> direct=1 >>> numjobs=16 >>> filename=/dev/nvme0n1 >>> group_reporting >>> >>> I started in sequence, loop_io.sh, loop_reset_controller.sh, loop_unbind_driver.sh. >>> And if lucky, I will get io hang in 3 minutes. ;) >>> Such as: >>> >>> [? 142.858074] nvme nvme0: pci function 0000:02:00.0 >>> [? 144.972256] nvme nvme0: failed to mark controller state 1 >>> [? 144.972289] nvme nvme0: Removing after probe failure status: 0 >>> [? 185.312344] INFO: task bash:1673 blocked for more than 30 seconds. >>> [? 185.312889]?????? Not tainted 4.17.0-rc1+ #6 >>> [? 185.312950] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. >>> [? 185.313049] bash??????????? D??? 0? 1673?? 1629 0x00000080 >>> [? 185.313061] Call Trace: >>> [? 185.313083]? ? __schedule+0x3de/0xac0 >>> [? 185.313103]? schedule+0x3c/0x90 >>> [? 185.313111]? blk_mq_freeze_queue_wait+0x44/0x90 >>> [? 185.313123]? ? wait_woken+0x90/0x90 >>> [? 185.313133]? blk_cleanup_queue+0xe1/0x280 >>> [? 185.313145]? nvme_ns_remove+0x1c8/0x260 >>> [? 185.313159]? nvme_remove_namespaces+0x7f/0xa0 >>> [? 185.313170]? nvme_remove+0x6c/0x130 >>> [? 185.313181]? pci_device_remove+0x36/0xb0 >>> [? 185.313193]? device_release_driver_internal+0x160/0x230 >>> [? 185.313205]? unbind_store+0xfe/0x150 >>> [? 185.313219]? kernfs_fop_write+0x114/0x190 >>> [? 185.313234]? __vfs_write+0x23/0x150 >>> [? 185.313246]? ? rcu_read_lock_sched_held+0x3f/0x70 >>> [? 185.313252]? ? preempt_count_sub+0x92/0xd0 >>> [? 185.313259]? ? __sb_start_write+0xf8/0x200 >>> [? 185.313271]? vfs_write+0xc5/0x1c0 >>> [? 185.313284]? ksys_write+0x45/0xa0 >>> [? 185.313298]? do_syscall_64+0x5a/0x1a0 >>> [? 185.313308]? entry_SYSCALL_64_after_hwframe+0x49/0xbe >>> >>> And get following information in block debugfs: >>> root at will-ThinkCentre-M910s:/sys/kernel/debug/block/nvme0n1# cat hctx6/cpu6/rq_list >>> 000000001192d19b {.op=READ, .cmd_flags=, .rq_flags=IO_STAT, .state=idle, .tag=69, .internal_tag=-1} >>> 00000000c33c8a5b {.op=READ, .cmd_flags=, .rq_flags=IO_STAT, .state=idle, .tag=78, .internal_tag=-1} >>> root at will-ThinkCentre-M910s:/sys/kernel/debug/block/nvme0n1# cat state >>> DYING|BYPASS|NOMERGES|SAME_COMP|NONROT|IO_STAT|DISCARD|NOXMERGES|INIT_DONE|NO_SG_MERGE|POLL|WC|FUA|STATS|QUIESCED >>> >>> We can see there were reqs on ctx rq_list and the request_queue is QUIESCED. >>> >>> Thanks again !! >>> Jianchao >>> >>> On 04/22/2018 10:48 PM, Max Gurtovoy wrote: >>>> >>>> >>>> On 4/22/2018 5:25 PM, jianchao.wang wrote: >>>>> Hi Max >>>>> >>>>> No, I only tested it on PCIe one. >>>>> And sorry for that I didn't state that. >>>> >>>> Please send your exact test steps and we'll run it using RDMA transport. >>>> I also want to run a mini regression on this one since it may effect other flows. >>>> >>>>> >>>>> Thanks >>>>> Jianchao >>>>> >>>>> On 04/22/2018 10:18 PM, Max Gurtovoy wrote: >>>>>> Hi Jianchao, >>>>>> Since this patch is in the core, have you tested it using some fabrics drives too ? RDMA/FC ? >>>>>> >>>>>> thanks, >>>>>> Max. >>>>>> >>>>>> On 4/22/2018 4:32 PM, jianchao.wang wrote: >>>>>>> Hi keith >>>>>>> >>>>>>> Would you please take a look at this patch. >>>>>>> >>>>>>> This issue could be reproduced easily with a driver bind/unbind loop, >>>>>>> a reset loop and a IO loop at the same time. >>>>>>> >>>>>>> Thanks >>>>>>> Jianchao >>>>>>> >>>>>>> On 04/19/2018 04:29 PM, Jianchao Wang wrote: >>>>>>>> There is race between nvme_remove and nvme_reset_work that can >>>>>>>> lead to io hang. >>>>>>>> >>>>>>>> nvme_remove??????????????????? nvme_reset_work >>>>>>>> -> change state to DELETING >>>>>>>> ????????????????????????????????? -> fail to change state to LIVE >>>>>>>> ????????????????????????????????? -> nvme_remove_dead_ctrl >>>>>>>> ??????????????????????????????????? -> nvme_dev_disable >>>>>>>> ????????????????????????????????????? -> quiesce request_queue >>>>>>>> ??????????????????????????????????? -> queue remove_work >>>>>>>> -> cancel_work_sync reset_work >>>>>>>> -> nvme_remove_namespaces >>>>>>>> ???? -> splice ctrl->namespaces >>>>>>>> ????????????????????????????????? nvme_remove_dead_ctrl_work >>>>>>>> ????????????????????????????????? -> nvme_kill_queues >>>>>>>> ???? -> nvme_ns_remove?????????????? do nothing >>>>>>>> ?????? -> blk_cleanup_queue >>>>>>>> ???????? -> blk_freeze_queue >>>>>>>> Finally, the request_queue is quiesced state when wait freeze, >>>>>>>> we will get io hang here. >>>>>>>> >>>>>>>> To fix it, unquiesce the request_queue directly before nvme_ns_remove. >>>>>>>> We have spliced the ctrl->namespaces, so nobody could access them >>>>>>>> and quiesce the queue any more. >>>>>>>> >>>>>>>> Signed-off-by: Jianchao Wang <jianchao.w.wang at oracle.com> >>>>>>>> --- >>>>>>>> ??? drivers/nvme/host/core.c | 9 ++++++++- >>>>>>>> ??? 1 file changed, 8 insertions(+), 1 deletion(-) >>>>>>>> >>>>>>>> diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c >>>>>>>> index 9df4f71..0e95082 100644 >>>>>>>> --- a/drivers/nvme/host/core.c >>>>>>>> +++ b/drivers/nvme/host/core.c >>>>>>>> @@ -3249,8 +3249,15 @@ void nvme_remove_namespaces(struct nvme_ctrl *ctrl) >>>>>>>> ??????? list_splice_init(&ctrl->namespaces, &ns_list); >>>>>>>> ??????? up_write(&ctrl->namespaces_rwsem); >>>>>>>> ??? -??? list_for_each_entry_safe(ns, next, &ns_list, list) >>>>>>>> +??? /* >>>>>>>> +???? * After splice the namespaces list from the ctrl->namespaces, >>>>>>>> +???? * nobody could get them anymore, let's unquiesce the request_queue >>>>>>>> +???? * forcibly to avoid io hang. >>>>>>>> +???? */ >>>>>>>> +??? list_for_each_entry_safe(ns, next, &ns_list, list) { >>>>>>>> +??????? blk_mq_unquiesce_queue(ns->queue); >>>>>>>> ??????????? nvme_ns_remove(ns); >>>>>>>> +??? } >>>>>>>> ??? } >>>>>>>> ??? EXPORT_SYMBOL_GPL(nvme_remove_namespaces); >>>>>>>> ?? >>>>>>> >>>>>>> _______________________________________________ >>>>>>> Linux-nvme mailing list >>>>>>> Linux-nvme at lists.infradead.org >>>>>>> https://urldefense.proofpoint.com/v2/url?u=http-3A__lists.infradead.org_mailman_listinfo_linux-2Dnvme&d=DwICAg&c=RoP1YumCXCgaWHvlZYR8PZh8Bv7qIrMUB65eapI_JnE&r=7WdAxUBeiTUTCy8v-7zXyr4qk7sx26ATvfo6QSTvZyQ&m=eQ9q70WFDS-d0s-KndBw8MOJvcBM6wuuKUNklqTC3h8&s=oBasfz9JoJw4yQF4EaWcNfKChZ1HMCkfHVZqyjvYVHQ&e= >>>>>>> >>>>>> >>>>>> _______________________________________________ >>>>>> Linux-nvme mailing list >>>>>> Linux-nvme at lists.infradead.org >>>>>> https://urldefense.proofpoint.com/v2/url?u=http-3A__lists.infradead.org_mailman_listinfo_linux-2Dnvme&d=DwICAg&c=RoP1YumCXCgaWHvlZYR8PZh8Bv7qIrMUB65eapI_JnE&r=7WdAxUBeiTUTCy8v-7zXyr4qk7sx26ATvfo6QSTvZyQ&m=eQ9q70WFDS-d0s-KndBw8MOJvcBM6wuuKUNklqTC3h8&s=oBasfz9JoJw4yQF4EaWcNfKChZ1HMCkfHVZqyjvYVHQ&e= >>>>>> >>> >>> _______________________________________________ >>> Linux-nvme mailing list >>> Linux-nvme at lists.infradead.org >>> https://urldefense.proofpoint.com/v2/url?u=http-3A__lists.infradead.org_mailman_listinfo_linux-2Dnvme&d=DwIGaQ&c=RoP1YumCXCgaWHvlZYR8PZh8Bv7qIrMUB65eapI_JnE&r=7WdAxUBeiTUTCy8v-7zXyr4qk7sx26ATvfo6QSTvZyQ&m=ivKrSi6co_9l91-czj52P2ABAPyKOwv9GrD3UeCbhio&s=2U00IBVk4CVt0Q-jiZX3sEWEEA2IleOjPU_PD38Fgvg&e= >>> > ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [PATCH] nvme: unquiesce the queue before cleaup it 2018-04-27 8:51 ` jianchao.wang @ 2018-04-28 15:38 ` jianchao.wang -1 siblings, 0 replies; 18+ messages in thread From: jianchao.wang @ 2018-04-28 15:38 UTC (permalink / raw) To: Max Gurtovoy, keith.busch, axboe, hch, sagi, linux-nvme, linux-kernel Hi Max On 04/27/2018 04:51 PM, jianchao.wang wrote: > Hi Max > > On 04/26/2018 06:23 PM, Max Gurtovoy wrote: >> Hi Jianchao, >> I actually tried this scenario with real HW and was able to repro the hang. >> Unfortunatly, after applying your patch I got NULL deref: >> BUG: unable to handle kernel NULL pointer dereference at 0000000000000014 >> WARNING: CPU: 5 PID: 0 at block/blk-mq.c:526 __blk_mq_complete_request+0x154/0x160 >> > > Could you please share the whole backtrace here ? I have reproduced the WARNING in __blk_mq_complete_request. [107612.367131] RIP: 0010:__blk_mq_complete_request+0x2aa/0x2e0 [107612.367133] RSP: 0018:ffff88036e627b68 EFLAGS: 00010202 [107612.367137] RAX: 0000000000000002 RBX: ffff88034708a440 RCX: ffffffff9452df6c [107612.367140] RDX: dffffc0000000000 RSI: dffffc0000000000 RDI: ffff88034708a540 [107612.367142] RBP: 1ffff1006dcc4f73 R08: ffffed007901472f R09: ffffed007901472f [107612.367144] R10: 0000000000000001 R11: ffffed007901472e R12: ffffe8ffffdc8940 [107612.367146] R13: ffff88034708a540 R14: ffff8803c51c33c0 R15: ffff88032d091ea0 [107612.367149] FS: 0000000000000000(0000) GS:ffff8803c8080000(0000) knlGS:0000000000000000 [107612.367151] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [107612.367153] CR2: 00000000018ee000 CR3: 00000003c1e13001 CR4: 00000000003606e0 [107612.367155] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [107612.367157] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 [107612.367159] Call Trace: [107612.367166] blk_mq_complete_request+0x10a/0x140 [107612.367170] ? hctx_lock+0xe0/0xe0 [107612.367176] ? __wait_rcu_gp+0x1b6/0x210 [107612.367184] ? nvme_start_ctrl+0x190/0x190 [107612.367188] bt_tags_iter+0x87/0xa0 [107612.367194] blk_mq_tagset_busy_iter+0x1da/0x3e0 [107612.367201] ? nvme_start_ctrl+0x190/0x190 [107612.367206] ? __blk_mq_get_tag+0x140/0x140 [107612.367210] ? nvme_start_ctrl+0x190/0x190 [107612.367221] nvme_rdma_shutdown_ctrl+0x50/0xe0 [nvme_rdma] [107612.367228] nvme_rdma_reset_ctrl_work+0x2c/0xd0 [nvme_rdma] [107612.367233] process_one_work+0x3ca/0xaa0 [107612.367243] ? pwq_dec_nr_in_flight+0x110/0x110 [107612.367246] ? worker_thread+0x19d/0x6c0 [107612.367257] worker_thread+0x4e2/0x6c0 [107612.367283] kthread+0x18d/0x1e0 [107612.367286] ? rescuer_thread+0x740/0x740 [107612.367289] ? kthread_delayed_work_timer_fn+0xe0/0xe0 [107612.367293] ? kthread_delayed_work_timer_fn+0xe0/0xe0 [107612.367296] ret_from_fork+0x24/0x30 in my test, there is _no_ timeout, but just reset controller The state of the request is COMPLETE. This is a race reset work and normal rmda completion path. nvme_rdma_reset_ctrl_work -> nvme_rdma_shutdown_ctrl -> nvme_cancel_request -> blk_mq_complete_request we don't stop the rdma completion path to be triggered before we invoke nvme_cancel_request. I think it is related to a change introduced by the new blk-mq timeout implementation. before it, the blk_mq_complete_request use blk_mark_rq_complete to sync with timeout path. obviously, this make blk_mq_complete_request be able to sync with itself. the blk_mq_complete_request from reset work and rdma completion path could dance with each other well. however, after this new mechanism is introduced, blk_mq_complete_request cannot sync with itself anymore. so we indeed have a risk that complete request twice when we reset the controller forcibly in rdma driver. Thanks Jianchao > > Thanks > Jianchao > >> I guess it's the same issue the IsraelR/Bart/Ming are trying to fix in the block layer regarding completing requests. >> I'll add IsraelR proposed fix to nvme-rdma that is currently on hold and see what happens. >> Nontheless, I don't like the situation that the reset and delete flows can run concurrently. >> >> -Max. >> >> On 4/26/2018 11:27 AM, jianchao.wang wrote: >>> Hi Max >>> >>> I did a similar test on nvme-rdma, the underlying fabric is soft-RoCE. >>> A io loop, reset controller loop and a delete/create controller loop. >>> And found io hang below: >>> >>> [ 230.884590] WARNING: CPU: 0 PID: 150 at /home/will/u04/source_code/linux-stable/drivers/nvme/host/rdma.c:1755 nvme_rdma_reset_ctrl_work+0x3d/0xb0 [nvme_rdma] >>> [ 230.884689] CPU: 0 PID: 150 Comm: kworker/u16:3 Not tainted 4.17.0-rc1+ #16 >>> [ 230.884690] Hardware name: LENOVO 10MLS0E339/3106, BIOS M1AKT22A 06/27/2017 >>> [ 230.884693] Workqueue: nvme-reset-wq nvme_rdma_reset_ctrl_work [nvme_rdma] >>> [ 230.884696] RIP: 0010:nvme_rdma_reset_ctrl_work+0x3d/0xb0 [nvme_rdma] >>> [ 230.884697] RSP: 0018:ffffa9e30255be40 EFLAGS: 00010246 >>> [ 230.884699] RAX: 0000000000000000 RBX: ffff9633b64c2000 RCX: 0000000000000000 >>> [ 230.884700] RDX: 0000000000000001 RSI: 0000000000000000 RDI: ffffffff9ac05516 >>> [ 230.884701] RBP: ffff9633b64c23b8 R08: 0000000000000001 R09: 0000000000000000 >>> [ 230.884702] R10: 0000000000000000 R11: 0000000000000000 R12: ffff9633b64c2970 >>> [ 230.884703] R13: 0000000000000000 R14: 0ffff96340ffcc80 R15: ffff9634102af9c0 >>> [ 230.884705] FS: 0000000000000000(0000) GS:ffff963422c00000(0000) knlGS:0000000000000000 >>> [ 230.884706] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 >>> [ 230.884707] CR2: 0000000001ce9ca4 CR3: 000000038040f005 CR4: 00000000003606f0 >>> [ 230.884708] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 >>> [ 230.884709] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 >>> [ 230.884710] Call Trace: >>> [ 230.884717] process_one_work+0x1c0/0x680 >>> [ 230.884722] worker_thread+0x22b/0x430 >>> [ 230.884728] kthread+0x100/0x140 >>> [ 230.884730] ? rescuer_thread+0x370/0x370 >>> [ 230.884731] ? kthread_delayed_work_timer_fn+0x80/0x80 >>> [ 230.884736] ret_from_fork+0x24/0x30 >>> [ 230.884744] Code: ff 48 8d ab b8 03 00 00 48 89 ef e8 1e 25 d7 d9 31 f6 48 89 df e8 44 f9 ff ff be 04 00 00 00 48 89 ef e8 b7 13 d7 d9 84 c0 75 07 <0f> 0b 5b 5d 41 5c c3 31 f6 48 89 df e8 b2 fa ff ff 85 c0 75 39 >>> [ 230.884799] irq event stamp: 88150 >>> [ 230.884802] hardirqs last enabled at (88149): [<ffffffff9ac05529>] _raw_spin_unlock_irqrestore+0x59/0x70 >>> [ 230.884803] hardirqs last disabled at (88150): [<ffffffff9ae0116c>] error_entry+0x6c/0xc0 >>> [ 230.884805] softirqs last enabled at (87998): [<ffffffff9aa603a0>] sk_common_release+0x60/0xd0 >>> [ 230.884807] softirqs last disabled at (87996): [<ffffffff9aa60375>] sk_common_release+0x35/0xd0 >>> [ 230.884808] ---[ end trace c03fac253b80d77d ]--- >>> [ 277.472094] INFO: task kworker/u16:0:6 blocked for more than 30 seconds. >>> [ 277.473482] Tainted: G W 4.17.0-rc1+ #16 >>> [ 277.474825] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. >>> [ 277.476293] kworker/u16:0 D 0 6 2 0x80000000 >>> [ 277.476317] Workqueue: nvme-delete-wq nvme_delete_ctrl_work >>> [ 277.476335] Call Trace: >>> [ 277.476365] ? __schedule+0x3de/0xac0 >>> [ 277.476393] schedule+0x3c/0x90 >>> [ 277.476405] blk_mq_freeze_queue_wait+0x44/0x90 >>> [ 277.476419] ? wait_woken+0x90/0x90 >>> [ 277.476437] blk_cleanup_queue+0xe1/0x280 >>> [ 277.476453] nvme_ns_remove+0x1c8/0x260 >>> [ 277.476475] nvme_remove_namespaces+0x7f/0xa0 >>> [ 277.476495] nvme_delete_ctrl_work+0x4b/0x80 >>> [ 277.476508] process_one_work+0x1c0/0x680 >>> [ 277.476535] worker_thread+0x22b/0x430 >>> [ 277.476562] kthread+0x100/0x140 >>> [ 277.476572] ? rescuer_thread+0x370/0x370 >>> [ 277.476579] ? kthread_delayed_work_timer_fn+0x80/0x80 >>> [ 277.476598] ret_from_fork+0x24/0x30 >>> >>> 00000000ea33345b {.op=READ, .cmd_flags=23, .rq_flags=MQ_INFLIGHT|IO_STAT, .state=idle, .tag=19, .internal_tag=-1} >>> 0000000071218c5a {.op=READ, .cmd_flags=23, .rq_flags=MQ_INFLIGHT|IO_STAT, .state=idle, .tag=45, .internal_tag=-1} >>> root@will-ThinkCentre-M910s:/sys/kernel/debug/block/nvme1c1n1# cat state >>> DYING|BYPASS|NOMERGES|SAME_COMP|NONROT|IO_STAT|DISCARD|NOXMERGES|INIT_DONE|NO_SG_MERGE|POLL|WC|FUA|STATS|QUIESCED >>> >>> This is due to a race following: >>> >>> nvme_delete_ctrl nvme_rdma_reset_ctrl_work >>> -> change state to DELETING >>> -> queue delete_work >>> -> nvme_rdma_shutdown_ctrl >>> -> nvme_stop_queues >>> -> blk_mq_quiesce_queue >>> -> change to LIVE state fails >>> -> return >>> nvme_delete_ctrl_work >>> -> nvme_remove_namespaces >>> -> nvme_ns_remove >>> -> blk_cleanup_queue >>> -> blk_freeze_queue >>> >>> This patch could also fix this issue. >>> >>> Thanks >>> Jianchao >>> >>> On 04/22/2018 11:00 PM, jianchao.wang wrote: >>>> Hi Max >>>> >>>> That's really appreciated! >>>> Here is my test script. >>>> >>>> loop_reset_controller.sh >>>> #!/bin/bash >>>> while true >>>> do >>>> echo 1 > /sys/block/nvme0n1/device/reset_controller >>>> sleep 1 >>>> done >>>> >>>> loop_unbind_driver.sh >>>> #!/bin/bash >>>> while true >>>> do >>>> echo "0000:02:00.0" > /sys/bus/pci/drivers/nvme/unbind >>>> sleep 2 >>>> echo "0000:02:00.0" > /sys/bus/pci/drivers/nvme/bind >>>> sleep 2 >>>> done >>>> >>>> loop_io.sh >>>> #!/bin/bash >>>> >>>> file="/dev/nvme0n1" >>>> echo $file >>>> while true; >>>> do >>>> if [ -e $file ];then >>>> fio fio_job_rand_read.ini >>>> else >>>> echo "Not found" >>>> sleep 1 >>>> fi >>>> done >>>> >>>> The fio jobs is as below: >>>> size=512m >>>> rw=randread >>>> bs=4k >>>> ioengine=libaio >>>> iodepth=64 >>>> direct=1 >>>> numjobs=16 >>>> filename=/dev/nvme0n1 >>>> group_reporting >>>> >>>> I started in sequence, loop_io.sh, loop_reset_controller.sh, loop_unbind_driver.sh. >>>> And if lucky, I will get io hang in 3 minutes. ;) >>>> Such as: >>>> >>>> [ 142.858074] nvme nvme0: pci function 0000:02:00.0 >>>> [ 144.972256] nvme nvme0: failed to mark controller state 1 >>>> [ 144.972289] nvme nvme0: Removing after probe failure status: 0 >>>> [ 185.312344] INFO: task bash:1673 blocked for more than 30 seconds. >>>> [ 185.312889] Not tainted 4.17.0-rc1+ #6 >>>> [ 185.312950] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. >>>> [ 185.313049] bash D 0 1673 1629 0x00000080 >>>> [ 185.313061] Call Trace: >>>> [ 185.313083] ? __schedule+0x3de/0xac0 >>>> [ 185.313103] schedule+0x3c/0x90 >>>> [ 185.313111] blk_mq_freeze_queue_wait+0x44/0x90 >>>> [ 185.313123] ? wait_woken+0x90/0x90 >>>> [ 185.313133] blk_cleanup_queue+0xe1/0x280 >>>> [ 185.313145] nvme_ns_remove+0x1c8/0x260 >>>> [ 185.313159] nvme_remove_namespaces+0x7f/0xa0 >>>> [ 185.313170] nvme_remove+0x6c/0x130 >>>> [ 185.313181] pci_device_remove+0x36/0xb0 >>>> [ 185.313193] device_release_driver_internal+0x160/0x230 >>>> [ 185.313205] unbind_store+0xfe/0x150 >>>> [ 185.313219] kernfs_fop_write+0x114/0x190 >>>> [ 185.313234] __vfs_write+0x23/0x150 >>>> [ 185.313246] ? rcu_read_lock_sched_held+0x3f/0x70 >>>> [ 185.313252] ? preempt_count_sub+0x92/0xd0 >>>> [ 185.313259] ? __sb_start_write+0xf8/0x200 >>>> [ 185.313271] vfs_write+0xc5/0x1c0 >>>> [ 185.313284] ksys_write+0x45/0xa0 >>>> [ 185.313298] do_syscall_64+0x5a/0x1a0 >>>> [ 185.313308] entry_SYSCALL_64_after_hwframe+0x49/0xbe >>>> >>>> And get following information in block debugfs: >>>> root@will-ThinkCentre-M910s:/sys/kernel/debug/block/nvme0n1# cat hctx6/cpu6/rq_list >>>> 000000001192d19b {.op=READ, .cmd_flags=, .rq_flags=IO_STAT, .state=idle, .tag=69, .internal_tag=-1} >>>> 00000000c33c8a5b {.op=READ, .cmd_flags=, .rq_flags=IO_STAT, .state=idle, .tag=78, .internal_tag=-1} >>>> root@will-ThinkCentre-M910s:/sys/kernel/debug/block/nvme0n1# cat state >>>> DYING|BYPASS|NOMERGES|SAME_COMP|NONROT|IO_STAT|DISCARD|NOXMERGES|INIT_DONE|NO_SG_MERGE|POLL|WC|FUA|STATS|QUIESCED >>>> >>>> We can see there were reqs on ctx rq_list and the request_queue is QUIESCED. >>>> >>>> Thanks again !! >>>> Jianchao >>>> >>>> On 04/22/2018 10:48 PM, Max Gurtovoy wrote: >>>>> >>>>> >>>>> On 4/22/2018 5:25 PM, jianchao.wang wrote: >>>>>> Hi Max >>>>>> >>>>>> No, I only tested it on PCIe one. >>>>>> And sorry for that I didn't state that. >>>>> >>>>> Please send your exact test steps and we'll run it using RDMA transport. >>>>> I also want to run a mini regression on this one since it may effect other flows. >>>>> >>>>>> >>>>>> Thanks >>>>>> Jianchao >>>>>> >>>>>> On 04/22/2018 10:18 PM, Max Gurtovoy wrote: >>>>>>> Hi Jianchao, >>>>>>> Since this patch is in the core, have you tested it using some fabrics drives too ? RDMA/FC ? >>>>>>> >>>>>>> thanks, >>>>>>> Max. >>>>>>> >>>>>>> On 4/22/2018 4:32 PM, jianchao.wang wrote: >>>>>>>> Hi keith >>>>>>>> >>>>>>>> Would you please take a look at this patch. >>>>>>>> >>>>>>>> This issue could be reproduced easily with a driver bind/unbind loop, >>>>>>>> a reset loop and a IO loop at the same time. >>>>>>>> >>>>>>>> Thanks >>>>>>>> Jianchao >>>>>>>> >>>>>>>> On 04/19/2018 04:29 PM, Jianchao Wang wrote: >>>>>>>>> There is race between nvme_remove and nvme_reset_work that can >>>>>>>>> lead to io hang. >>>>>>>>> >>>>>>>>> nvme_remove nvme_reset_work >>>>>>>>> -> change state to DELETING >>>>>>>>> -> fail to change state to LIVE >>>>>>>>> -> nvme_remove_dead_ctrl >>>>>>>>> -> nvme_dev_disable >>>>>>>>> -> quiesce request_queue >>>>>>>>> -> queue remove_work >>>>>>>>> -> cancel_work_sync reset_work >>>>>>>>> -> nvme_remove_namespaces >>>>>>>>> -> splice ctrl->namespaces >>>>>>>>> nvme_remove_dead_ctrl_work >>>>>>>>> -> nvme_kill_queues >>>>>>>>> -> nvme_ns_remove do nothing >>>>>>>>> -> blk_cleanup_queue >>>>>>>>> -> blk_freeze_queue >>>>>>>>> Finally, the request_queue is quiesced state when wait freeze, >>>>>>>>> we will get io hang here. >>>>>>>>> >>>>>>>>> To fix it, unquiesce the request_queue directly before nvme_ns_remove. >>>>>>>>> We have spliced the ctrl->namespaces, so nobody could access them >>>>>>>>> and quiesce the queue any more. >>>>>>>>> >>>>>>>>> Signed-off-by: Jianchao Wang <jianchao.w.wang@oracle.com> >>>>>>>>> --- >>>>>>>>> drivers/nvme/host/core.c | 9 ++++++++- >>>>>>>>> 1 file changed, 8 insertions(+), 1 deletion(-) >>>>>>>>> >>>>>>>>> diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c >>>>>>>>> index 9df4f71..0e95082 100644 >>>>>>>>> --- a/drivers/nvme/host/core.c >>>>>>>>> +++ b/drivers/nvme/host/core.c >>>>>>>>> @@ -3249,8 +3249,15 @@ void nvme_remove_namespaces(struct nvme_ctrl *ctrl) >>>>>>>>> list_splice_init(&ctrl->namespaces, &ns_list); >>>>>>>>> up_write(&ctrl->namespaces_rwsem); >>>>>>>>> - list_for_each_entry_safe(ns, next, &ns_list, list) >>>>>>>>> + /* >>>>>>>>> + * After splice the namespaces list from the ctrl->namespaces, >>>>>>>>> + * nobody could get them anymore, let's unquiesce the request_queue >>>>>>>>> + * forcibly to avoid io hang. >>>>>>>>> + */ >>>>>>>>> + list_for_each_entry_safe(ns, next, &ns_list, list) { >>>>>>>>> + blk_mq_unquiesce_queue(ns->queue); >>>>>>>>> nvme_ns_remove(ns); >>>>>>>>> + } >>>>>>>>> } >>>>>>>>> EXPORT_SYMBOL_GPL(nvme_remove_namespaces); >>>>>>>>> >>>>>>>> >>>>>>>> _______________________________________________ >>>>>>>> Linux-nvme mailing list >>>>>>>> Linux-nvme@lists.infradead.org >>>>>>>> https://urldefense.proofpoint.com/v2/url?u=http-3A__lists.infradead.org_mailman_listinfo_linux-2Dnvme&d=DwICAg&c=RoP1YumCXCgaWHvlZYR8PZh8Bv7qIrMUB65eapI_JnE&r=7WdAxUBeiTUTCy8v-7zXyr4qk7sx26ATvfo6QSTvZyQ&m=eQ9q70WFDS-d0s-KndBw8MOJvcBM6wuuKUNklqTC3h8&s=oBasfz9JoJw4yQF4EaWcNfKChZ1HMCkfHVZqyjvYVHQ&e= >>>>>>>> >>>>>>> >>>>>>> _______________________________________________ >>>>>>> Linux-nvme mailing list >>>>>>> Linux-nvme@lists.infradead.org >>>>>>> https://urldefense.proofpoint.com/v2/url?u=http-3A__lists.infradead.org_mailman_listinfo_linux-2Dnvme&d=DwICAg&c=RoP1YumCXCgaWHvlZYR8PZh8Bv7qIrMUB65eapI_JnE&r=7WdAxUBeiTUTCy8v-7zXyr4qk7sx26ATvfo6QSTvZyQ&m=eQ9q70WFDS-d0s-KndBw8MOJvcBM6wuuKUNklqTC3h8&s=oBasfz9JoJw4yQF4EaWcNfKChZ1HMCkfHVZqyjvYVHQ&e= >>>>>>> >>>> >>>> _______________________________________________ >>>> Linux-nvme mailing list >>>> Linux-nvme@lists.infradead.org >>>> https://urldefense.proofpoint.com/v2/url?u=http-3A__lists.infradead.org_mailman_listinfo_linux-2Dnvme&d=DwIGaQ&c=RoP1YumCXCgaWHvlZYR8PZh8Bv7qIrMUB65eapI_JnE&r=7WdAxUBeiTUTCy8v-7zXyr4qk7sx26ATvfo6QSTvZyQ&m=ivKrSi6co_9l91-czj52P2ABAPyKOwv9GrD3UeCbhio&s=2U00IBVk4CVt0Q-jiZX3sEWEEA2IleOjPU_PD38Fgvg&e= >>>> >> > > _______________________________________________ > Linux-nvme mailing list > Linux-nvme@lists.infradead.org > https://urldefense.proofpoint.com/v2/url?u=http-3A__lists.infradead.org_mailman_listinfo_linux-2Dnvme&d=DwIGaQ&c=RoP1YumCXCgaWHvlZYR8PZh8Bv7qIrMUB65eapI_JnE&r=7WdAxUBeiTUTCy8v-7zXyr4qk7sx26ATvfo6QSTvZyQ&m=SGZpECIOlIjMScsyTFhGuTB5fZTrhZS6gNLbeUUecmU&s=a0HxxV_1N2g59zBdAaehBUEmIG1pZGxzZKQOby6Hn8k&e= > ^ permalink raw reply [flat|nested] 18+ messages in thread
* [PATCH] nvme: unquiesce the queue before cleaup it @ 2018-04-28 15:38 ` jianchao.wang 0 siblings, 0 replies; 18+ messages in thread From: jianchao.wang @ 2018-04-28 15:38 UTC (permalink / raw) Hi Max On 04/27/2018 04:51 PM, jianchao.wang wrote: > Hi Max > > On 04/26/2018 06:23 PM, Max Gurtovoy wrote: >> Hi Jianchao, >> I actually tried this scenario with real HW and was able to repro the hang. >> Unfortunatly, after applying your patch I got NULL deref: >> BUG: unable to handle kernel NULL pointer dereference at 0000000000000014 >> WARNING: CPU: 5 PID: 0 at block/blk-mq.c:526 __blk_mq_complete_request+0x154/0x160 >> > > Could you please share the whole backtrace here ? I have reproduced the WARNING in __blk_mq_complete_request. [107612.367131] RIP: 0010:__blk_mq_complete_request+0x2aa/0x2e0 [107612.367133] RSP: 0018:ffff88036e627b68 EFLAGS: 00010202 [107612.367137] RAX: 0000000000000002 RBX: ffff88034708a440 RCX: ffffffff9452df6c [107612.367140] RDX: dffffc0000000000 RSI: dffffc0000000000 RDI: ffff88034708a540 [107612.367142] RBP: 1ffff1006dcc4f73 R08: ffffed007901472f R09: ffffed007901472f [107612.367144] R10: 0000000000000001 R11: ffffed007901472e R12: ffffe8ffffdc8940 [107612.367146] R13: ffff88034708a540 R14: ffff8803c51c33c0 R15: ffff88032d091ea0 [107612.367149] FS: 0000000000000000(0000) GS:ffff8803c8080000(0000) knlGS:0000000000000000 [107612.367151] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [107612.367153] CR2: 00000000018ee000 CR3: 00000003c1e13001 CR4: 00000000003606e0 [107612.367155] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [107612.367157] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 [107612.367159] Call Trace: [107612.367166] blk_mq_complete_request+0x10a/0x140 [107612.367170] ? hctx_lock+0xe0/0xe0 [107612.367176] ? __wait_rcu_gp+0x1b6/0x210 [107612.367184] ? nvme_start_ctrl+0x190/0x190 [107612.367188] bt_tags_iter+0x87/0xa0 [107612.367194] blk_mq_tagset_busy_iter+0x1da/0x3e0 [107612.367201] ? nvme_start_ctrl+0x190/0x190 [107612.367206] ? __blk_mq_get_tag+0x140/0x140 [107612.367210] ? nvme_start_ctrl+0x190/0x190 [107612.367221] nvme_rdma_shutdown_ctrl+0x50/0xe0 [nvme_rdma] [107612.367228] nvme_rdma_reset_ctrl_work+0x2c/0xd0 [nvme_rdma] [107612.367233] process_one_work+0x3ca/0xaa0 [107612.367243] ? pwq_dec_nr_in_flight+0x110/0x110 [107612.367246] ? worker_thread+0x19d/0x6c0 [107612.367257] worker_thread+0x4e2/0x6c0 [107612.367283] kthread+0x18d/0x1e0 [107612.367286] ? rescuer_thread+0x740/0x740 [107612.367289] ? kthread_delayed_work_timer_fn+0xe0/0xe0 [107612.367293] ? kthread_delayed_work_timer_fn+0xe0/0xe0 [107612.367296] ret_from_fork+0x24/0x30 in my test, there is _no_ timeout, but just reset controller The state of the request is COMPLETE. This is a race reset work and normal rmda completion path. nvme_rdma_reset_ctrl_work -> nvme_rdma_shutdown_ctrl -> nvme_cancel_request -> blk_mq_complete_request we don't stop the rdma completion path to be triggered before we invoke nvme_cancel_request. I think it is related to a change introduced by the new blk-mq timeout implementation. before it, the blk_mq_complete_request use blk_mark_rq_complete to sync with timeout path. obviously, this make blk_mq_complete_request be able to sync with itself. the blk_mq_complete_request from reset work and rdma completion path could dance with each other well. however, after this new mechanism is introduced, blk_mq_complete_request cannot sync with itself anymore. so we indeed have a risk that complete request twice when we reset the controller forcibly in rdma driver. Thanks Jianchao > > Thanks > Jianchao > >> I guess it's the same issue the IsraelR/Bart/Ming are trying to fix in the block layer regarding completing requests. >> I'll add IsraelR proposed fix to nvme-rdma that is currently on hold and see what happens. >> Nontheless, I don't like the situation that the reset and delete flows can run concurrently. >> >> -Max. >> >> On 4/26/2018 11:27 AM, jianchao.wang wrote: >>> Hi Max >>> >>> I did a similar test on nvme-rdma, the underlying fabric is soft-RoCE. >>> A io loop, reset controller loop? and a delete/create controller loop. >>> And found io hang below: >>> >>> [? 230.884590] WARNING: CPU: 0 PID: 150 at /home/will/u04/source_code/linux-stable/drivers/nvme/host/rdma.c:1755 nvme_rdma_reset_ctrl_work+0x3d/0xb0 [nvme_rdma] >>> [? 230.884689] CPU: 0 PID: 150 Comm: kworker/u16:3 Not tainted 4.17.0-rc1+ #16 >>> [? 230.884690] Hardware name: LENOVO 10MLS0E339/3106, BIOS M1AKT22A 06/27/2017 >>> [? 230.884693] Workqueue: nvme-reset-wq nvme_rdma_reset_ctrl_work [nvme_rdma] >>> [? 230.884696] RIP: 0010:nvme_rdma_reset_ctrl_work+0x3d/0xb0 [nvme_rdma] >>> [? 230.884697] RSP: 0018:ffffa9e30255be40 EFLAGS: 00010246 >>> [? 230.884699] RAX: 0000000000000000 RBX: ffff9633b64c2000 RCX: 0000000000000000 >>> [? 230.884700] RDX: 0000000000000001 RSI: 0000000000000000 RDI: ffffffff9ac05516 >>> [? 230.884701] RBP: ffff9633b64c23b8 R08: 0000000000000001 R09: 0000000000000000 >>> [? 230.884702] R10: 0000000000000000 R11: 0000000000000000 R12: ffff9633b64c2970 >>> [? 230.884703] R13: 0000000000000000 R14: 0ffff96340ffcc80 R15: ffff9634102af9c0 >>> [? 230.884705] FS:? 0000000000000000(0000) GS:ffff963422c00000(0000) knlGS:0000000000000000 >>> [? 230.884706] CS:? 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 >>> [? 230.884707] CR2: 0000000001ce9ca4 CR3: 000000038040f005 CR4: 00000000003606f0 >>> [? 230.884708] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 >>> [? 230.884709] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 >>> [? 230.884710] Call Trace: >>> [? 230.884717]? process_one_work+0x1c0/0x680 >>> [? 230.884722]? worker_thread+0x22b/0x430 >>> [? 230.884728]? kthread+0x100/0x140 >>> [? 230.884730]? ? rescuer_thread+0x370/0x370 >>> [? 230.884731]? ? kthread_delayed_work_timer_fn+0x80/0x80 >>> [? 230.884736]? ret_from_fork+0x24/0x30 >>> [? 230.884744] Code: ff 48 8d ab b8 03 00 00 48 89 ef e8 1e 25 d7 d9 31 f6 48 89 df e8 44 f9 ff ff be 04 00 00 00 48 89 ef e8 b7 13 d7 d9 84 c0 75 07 <0f> 0b 5b 5d 41 5c c3 31 f6 48 89 df e8 b2 fa ff ff 85 c0 75 39 >>> [? 230.884799] irq event stamp: 88150 >>> [? 230.884802] hardirqs last? enabled at (88149): [<ffffffff9ac05529>] _raw_spin_unlock_irqrestore+0x59/0x70 >>> [? 230.884803] hardirqs last disabled at (88150): [<ffffffff9ae0116c>] error_entry+0x6c/0xc0 >>> [? 230.884805] softirqs last? enabled at (87998): [<ffffffff9aa603a0>] sk_common_release+0x60/0xd0 >>> [? 230.884807] softirqs last disabled at (87996): [<ffffffff9aa60375>] sk_common_release+0x35/0xd0 >>> [? 230.884808] ---[ end trace c03fac253b80d77d ]--- >>> [? 277.472094] INFO: task kworker/u16:0:6 blocked for more than 30 seconds. >>> [? 277.473482]?????? Tainted: G??????? W???????? 4.17.0-rc1+ #16 >>> [? 277.474825] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. >>> [? 277.476293] kworker/u16:0?? D??? 0???? 6????? 2 0x80000000 >>> [? 277.476317] Workqueue: nvme-delete-wq nvme_delete_ctrl_work >>> [? 277.476335] Call Trace: >>> [? 277.476365]? ? __schedule+0x3de/0xac0 >>> [? 277.476393]? schedule+0x3c/0x90 >>> [? 277.476405]? blk_mq_freeze_queue_wait+0x44/0x90 >>> [? 277.476419]? ? wait_woken+0x90/0x90 >>> [? 277.476437]? blk_cleanup_queue+0xe1/0x280 >>> [? 277.476453]? nvme_ns_remove+0x1c8/0x260 >>> [? 277.476475]? nvme_remove_namespaces+0x7f/0xa0 >>> [? 277.476495]? nvme_delete_ctrl_work+0x4b/0x80 >>> [? 277.476508]? process_one_work+0x1c0/0x680 >>> [? 277.476535]? worker_thread+0x22b/0x430 >>> [? 277.476562]? kthread+0x100/0x140 >>> [? 277.476572]? ? rescuer_thread+0x370/0x370 >>> [? 277.476579]? ? kthread_delayed_work_timer_fn+0x80/0x80 >>> [? 277.476598]? ret_from_fork+0x24/0x30 >>> >>> 00000000ea33345b {.op=READ, .cmd_flags=23, .rq_flags=MQ_INFLIGHT|IO_STAT, .state=idle, .tag=19, .internal_tag=-1} >>> 0000000071218c5a {.op=READ, .cmd_flags=23, .rq_flags=MQ_INFLIGHT|IO_STAT, .state=idle, .tag=45, .internal_tag=-1} >>> root at will-ThinkCentre-M910s:/sys/kernel/debug/block/nvme1c1n1# cat state >>> DYING|BYPASS|NOMERGES|SAME_COMP|NONROT|IO_STAT|DISCARD|NOXMERGES|INIT_DONE|NO_SG_MERGE|POLL|WC|FUA|STATS|QUIESCED >>> >>> This is due to a race following: >>> >>> nvme_delete_ctrl??????????????????? nvme_rdma_reset_ctrl_work >>> ?? -> change state to DELETING >>> ?? -> queue delete_work >>> ?????????????????????????????????????? -> nvme_rdma_shutdown_ctrl >>> ???????????????????????????????????????? -> nvme_stop_queues >>> ?????????????????????????????????????????? -> blk_mq_quiesce_queue >>> ???????????????? ?????????????????????????????????????? -> change to LIVE state fails >>> ???????????????????????????????????????? -> return >>> nvme_delete_ctrl_work >>> ?? -> nvme_remove_namespaces >>> ???? -> nvme_ns_remove >>> ?????? -> blk_cleanup_queue >>> ???????? -> blk_freeze_queue >>> >>> This patch could also fix this issue. >>> >>> Thanks >>> Jianchao >>> >>> On 04/22/2018 11:00 PM, jianchao.wang wrote: >>>> Hi Max >>>> >>>> That's really appreciated! >>>> Here is my test script. >>>> >>>> loop_reset_controller.sh >>>> #!/bin/bash >>>> while true >>>> do >>>> ????echo 1 > /sys/block/nvme0n1/device/reset_controller >>>> ????sleep 1 >>>> done >>>> >>>> loop_unbind_driver.sh >>>> #!/bin/bash >>>> while true >>>> do >>>> ????echo "0000:02:00.0" > /sys/bus/pci/drivers/nvme/unbind >>>> ????sleep 2 >>>> ????echo "0000:02:00.0" > /sys/bus/pci/drivers/nvme/bind >>>> ????sleep 2 >>>> done >>>> >>>> loop_io.sh >>>> #!/bin/bash >>>> >>>> file="/dev/nvme0n1" >>>> echo $file >>>> while true; >>>> do >>>> ????if [ -e $file ];then >>>> ??????? fio fio_job_rand_read.ini >>>> ????else >>>> ??????? echo "Not found" >>>> ??????? sleep 1 >>>> ????fi >>>> done >>>> >>>> The fio jobs is as below: >>>> size=512m >>>> rw=randread >>>> bs=4k >>>> ioengine=libaio >>>> iodepth=64 >>>> direct=1 >>>> numjobs=16 >>>> filename=/dev/nvme0n1 >>>> group_reporting >>>> >>>> I started in sequence, loop_io.sh, loop_reset_controller.sh, loop_unbind_driver.sh. >>>> And if lucky, I will get io hang in 3 minutes. ;) >>>> Such as: >>>> >>>> [? 142.858074] nvme nvme0: pci function 0000:02:00.0 >>>> [? 144.972256] nvme nvme0: failed to mark controller state 1 >>>> [? 144.972289] nvme nvme0: Removing after probe failure status: 0 >>>> [? 185.312344] INFO: task bash:1673 blocked for more than 30 seconds. >>>> [? 185.312889]?????? Not tainted 4.17.0-rc1+ #6 >>>> [? 185.312950] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. >>>> [? 185.313049] bash??????????? D??? 0? 1673?? 1629 0x00000080 >>>> [? 185.313061] Call Trace: >>>> [? 185.313083]? ? __schedule+0x3de/0xac0 >>>> [? 185.313103]? schedule+0x3c/0x90 >>>> [? 185.313111]? blk_mq_freeze_queue_wait+0x44/0x90 >>>> [? 185.313123]? ? wait_woken+0x90/0x90 >>>> [? 185.313133]? blk_cleanup_queue+0xe1/0x280 >>>> [? 185.313145]? nvme_ns_remove+0x1c8/0x260 >>>> [? 185.313159]? nvme_remove_namespaces+0x7f/0xa0 >>>> [? 185.313170]? nvme_remove+0x6c/0x130 >>>> [? 185.313181]? pci_device_remove+0x36/0xb0 >>>> [? 185.313193]? device_release_driver_internal+0x160/0x230 >>>> [? 185.313205]? unbind_store+0xfe/0x150 >>>> [? 185.313219]? kernfs_fop_write+0x114/0x190 >>>> [? 185.313234]? __vfs_write+0x23/0x150 >>>> [? 185.313246]? ? rcu_read_lock_sched_held+0x3f/0x70 >>>> [? 185.313252]? ? preempt_count_sub+0x92/0xd0 >>>> [? 185.313259]? ? __sb_start_write+0xf8/0x200 >>>> [? 185.313271]? vfs_write+0xc5/0x1c0 >>>> [? 185.313284]? ksys_write+0x45/0xa0 >>>> [? 185.313298]? do_syscall_64+0x5a/0x1a0 >>>> [? 185.313308]? entry_SYSCALL_64_after_hwframe+0x49/0xbe >>>> >>>> And get following information in block debugfs: >>>> root at will-ThinkCentre-M910s:/sys/kernel/debug/block/nvme0n1# cat hctx6/cpu6/rq_list >>>> 000000001192d19b {.op=READ, .cmd_flags=, .rq_flags=IO_STAT, .state=idle, .tag=69, .internal_tag=-1} >>>> 00000000c33c8a5b {.op=READ, .cmd_flags=, .rq_flags=IO_STAT, .state=idle, .tag=78, .internal_tag=-1} >>>> root at will-ThinkCentre-M910s:/sys/kernel/debug/block/nvme0n1# cat state >>>> DYING|BYPASS|NOMERGES|SAME_COMP|NONROT|IO_STAT|DISCARD|NOXMERGES|INIT_DONE|NO_SG_MERGE|POLL|WC|FUA|STATS|QUIESCED >>>> >>>> We can see there were reqs on ctx rq_list and the request_queue is QUIESCED. >>>> >>>> Thanks again !! >>>> Jianchao >>>> >>>> On 04/22/2018 10:48 PM, Max Gurtovoy wrote: >>>>> >>>>> >>>>> On 4/22/2018 5:25 PM, jianchao.wang wrote: >>>>>> Hi Max >>>>>> >>>>>> No, I only tested it on PCIe one. >>>>>> And sorry for that I didn't state that. >>>>> >>>>> Please send your exact test steps and we'll run it using RDMA transport. >>>>> I also want to run a mini regression on this one since it may effect other flows. >>>>> >>>>>> >>>>>> Thanks >>>>>> Jianchao >>>>>> >>>>>> On 04/22/2018 10:18 PM, Max Gurtovoy wrote: >>>>>>> Hi Jianchao, >>>>>>> Since this patch is in the core, have you tested it using some fabrics drives too ? RDMA/FC ? >>>>>>> >>>>>>> thanks, >>>>>>> Max. >>>>>>> >>>>>>> On 4/22/2018 4:32 PM, jianchao.wang wrote: >>>>>>>> Hi keith >>>>>>>> >>>>>>>> Would you please take a look at this patch. >>>>>>>> >>>>>>>> This issue could be reproduced easily with a driver bind/unbind loop, >>>>>>>> a reset loop and a IO loop at the same time. >>>>>>>> >>>>>>>> Thanks >>>>>>>> Jianchao >>>>>>>> >>>>>>>> On 04/19/2018 04:29 PM, Jianchao Wang wrote: >>>>>>>>> There is race between nvme_remove and nvme_reset_work that can >>>>>>>>> lead to io hang. >>>>>>>>> >>>>>>>>> nvme_remove??????????????????? nvme_reset_work >>>>>>>>> -> change state to DELETING >>>>>>>>> ????????????????????????????????? -> fail to change state to LIVE >>>>>>>>> ????????????????????????????????? -> nvme_remove_dead_ctrl >>>>>>>>> ??????????????????????????????????? -> nvme_dev_disable >>>>>>>>> ????????????????????????????????????? -> quiesce request_queue >>>>>>>>> ??????????????????????????????????? -> queue remove_work >>>>>>>>> -> cancel_work_sync reset_work >>>>>>>>> -> nvme_remove_namespaces >>>>>>>>> ???? -> splice ctrl->namespaces >>>>>>>>> ????????????????????????????????? nvme_remove_dead_ctrl_work >>>>>>>>> ????????????????????????????????? -> nvme_kill_queues >>>>>>>>> ???? -> nvme_ns_remove?????????????? do nothing >>>>>>>>> ?????? -> blk_cleanup_queue >>>>>>>>> ???????? -> blk_freeze_queue >>>>>>>>> Finally, the request_queue is quiesced state when wait freeze, >>>>>>>>> we will get io hang here. >>>>>>>>> >>>>>>>>> To fix it, unquiesce the request_queue directly before nvme_ns_remove. >>>>>>>>> We have spliced the ctrl->namespaces, so nobody could access them >>>>>>>>> and quiesce the queue any more. >>>>>>>>> >>>>>>>>> Signed-off-by: Jianchao Wang <jianchao.w.wang at oracle.com> >>>>>>>>> --- >>>>>>>>> ??? drivers/nvme/host/core.c | 9 ++++++++- >>>>>>>>> ??? 1 file changed, 8 insertions(+), 1 deletion(-) >>>>>>>>> >>>>>>>>> diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c >>>>>>>>> index 9df4f71..0e95082 100644 >>>>>>>>> --- a/drivers/nvme/host/core.c >>>>>>>>> +++ b/drivers/nvme/host/core.c >>>>>>>>> @@ -3249,8 +3249,15 @@ void nvme_remove_namespaces(struct nvme_ctrl *ctrl) >>>>>>>>> ??????? list_splice_init(&ctrl->namespaces, &ns_list); >>>>>>>>> ??????? up_write(&ctrl->namespaces_rwsem); >>>>>>>>> ??? -??? list_for_each_entry_safe(ns, next, &ns_list, list) >>>>>>>>> +??? /* >>>>>>>>> +???? * After splice the namespaces list from the ctrl->namespaces, >>>>>>>>> +???? * nobody could get them anymore, let's unquiesce the request_queue >>>>>>>>> +???? * forcibly to avoid io hang. >>>>>>>>> +???? */ >>>>>>>>> +??? list_for_each_entry_safe(ns, next, &ns_list, list) { >>>>>>>>> +??????? blk_mq_unquiesce_queue(ns->queue); >>>>>>>>> ??????????? nvme_ns_remove(ns); >>>>>>>>> +??? } >>>>>>>>> ??? } >>>>>>>>> ??? EXPORT_SYMBOL_GPL(nvme_remove_namespaces); >>>>>>>>> ?? >>>>>>>> >>>>>>>> _______________________________________________ >>>>>>>> Linux-nvme mailing list >>>>>>>> Linux-nvme at lists.infradead.org >>>>>>>> https://urldefense.proofpoint.com/v2/url?u=http-3A__lists.infradead.org_mailman_listinfo_linux-2Dnvme&d=DwICAg&c=RoP1YumCXCgaWHvlZYR8PZh8Bv7qIrMUB65eapI_JnE&r=7WdAxUBeiTUTCy8v-7zXyr4qk7sx26ATvfo6QSTvZyQ&m=eQ9q70WFDS-d0s-KndBw8MOJvcBM6wuuKUNklqTC3h8&s=oBasfz9JoJw4yQF4EaWcNfKChZ1HMCkfHVZqyjvYVHQ&e= >>>>>>>> >>>>>>> >>>>>>> _______________________________________________ >>>>>>> Linux-nvme mailing list >>>>>>> Linux-nvme at lists.infradead.org >>>>>>> https://urldefense.proofpoint.com/v2/url?u=http-3A__lists.infradead.org_mailman_listinfo_linux-2Dnvme&d=DwICAg&c=RoP1YumCXCgaWHvlZYR8PZh8Bv7qIrMUB65eapI_JnE&r=7WdAxUBeiTUTCy8v-7zXyr4qk7sx26ATvfo6QSTvZyQ&m=eQ9q70WFDS-d0s-KndBw8MOJvcBM6wuuKUNklqTC3h8&s=oBasfz9JoJw4yQF4EaWcNfKChZ1HMCkfHVZqyjvYVHQ&e= >>>>>>> >>>> >>>> _______________________________________________ >>>> Linux-nvme mailing list >>>> Linux-nvme at lists.infradead.org >>>> https://urldefense.proofpoint.com/v2/url?u=http-3A__lists.infradead.org_mailman_listinfo_linux-2Dnvme&d=DwIGaQ&c=RoP1YumCXCgaWHvlZYR8PZh8Bv7qIrMUB65eapI_JnE&r=7WdAxUBeiTUTCy8v-7zXyr4qk7sx26ATvfo6QSTvZyQ&m=ivKrSi6co_9l91-czj52P2ABAPyKOwv9GrD3UeCbhio&s=2U00IBVk4CVt0Q-jiZX3sEWEEA2IleOjPU_PD38Fgvg&e= >>>> >> > > _______________________________________________ > Linux-nvme mailing list > Linux-nvme at lists.infradead.org > https://urldefense.proofpoint.com/v2/url?u=http-3A__lists.infradead.org_mailman_listinfo_linux-2Dnvme&d=DwIGaQ&c=RoP1YumCXCgaWHvlZYR8PZh8Bv7qIrMUB65eapI_JnE&r=7WdAxUBeiTUTCy8v-7zXyr4qk7sx26ATvfo6QSTvZyQ&m=SGZpECIOlIjMScsyTFhGuTB5fZTrhZS6gNLbeUUecmU&s=a0HxxV_1N2g59zBdAaehBUEmIG1pZGxzZKQOby6Hn8k&e= > ^ permalink raw reply [flat|nested] 18+ messages in thread
* [PATCH] nvme: unquiesce the queue before cleaup it 2018-04-28 15:38 ` jianchao.wang (?) @ 2018-04-29 15:47 ` Max Gurtovoy -1 siblings, 0 replies; 18+ messages in thread From: Max Gurtovoy @ 2018-04-29 15:47 UTC (permalink / raw) On 4/28/2018 6:38 PM, jianchao.wang wrote: > Hi Max > > On 04/27/2018 04:51 PM, jianchao.wang wrote: >> Hi Max >> >> On 04/26/2018 06:23 PM, Max Gurtovoy wrote: >>> Hi Jianchao, >>> I actually tried this scenario with real HW and was able to repro the hang. >>> Unfortunatly, after applying your patch I got NULL deref: >>> BUG: unable to handle kernel NULL pointer dereference at 0000000000000014 >>> WARNING: CPU: 5 PID: 0 at block/blk-mq.c:526 __blk_mq_complete_request+0x154/0x160 >>> >> >> Could you please share the whole backtrace here ? > > I have reproduced the WARNING in __blk_mq_complete_request. > > [107612.367131] RIP: 0010:__blk_mq_complete_request+0x2aa/0x2e0 > [107612.367133] RSP: 0018:ffff88036e627b68 EFLAGS: 00010202 > [107612.367137] RAX: 0000000000000002 RBX: ffff88034708a440 RCX: ffffffff9452df6c > [107612.367140] RDX: dffffc0000000000 RSI: dffffc0000000000 RDI: ffff88034708a540 > [107612.367142] RBP: 1ffff1006dcc4f73 R08: ffffed007901472f R09: ffffed007901472f > [107612.367144] R10: 0000000000000001 R11: ffffed007901472e R12: ffffe8ffffdc8940 > [107612.367146] R13: ffff88034708a540 R14: ffff8803c51c33c0 R15: ffff88032d091ea0 > [107612.367149] FS: 0000000000000000(0000) GS:ffff8803c8080000(0000) knlGS:0000000000000000 > [107612.367151] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 > [107612.367153] CR2: 00000000018ee000 CR3: 00000003c1e13001 CR4: 00000000003606e0 > [107612.367155] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 > [107612.367157] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 > [107612.367159] Call Trace: > [107612.367166] blk_mq_complete_request+0x10a/0x140 > [107612.367170] ? hctx_lock+0xe0/0xe0 > [107612.367176] ? __wait_rcu_gp+0x1b6/0x210 > [107612.367184] ? nvme_start_ctrl+0x190/0x190 > [107612.367188] bt_tags_iter+0x87/0xa0 > [107612.367194] blk_mq_tagset_busy_iter+0x1da/0x3e0 > [107612.367201] ? nvme_start_ctrl+0x190/0x190 > [107612.367206] ? __blk_mq_get_tag+0x140/0x140 > [107612.367210] ? nvme_start_ctrl+0x190/0x190 > [107612.367221] nvme_rdma_shutdown_ctrl+0x50/0xe0 [nvme_rdma] > [107612.367228] nvme_rdma_reset_ctrl_work+0x2c/0xd0 [nvme_rdma] > [107612.367233] process_one_work+0x3ca/0xaa0 > [107612.367243] ? pwq_dec_nr_in_flight+0x110/0x110 > [107612.367246] ? worker_thread+0x19d/0x6c0 > [107612.367257] worker_thread+0x4e2/0x6c0 > [107612.367283] kthread+0x18d/0x1e0 > [107612.367286] ? rescuer_thread+0x740/0x740 > [107612.367289] ? kthread_delayed_work_timer_fn+0xe0/0xe0 > [107612.367293] ? kthread_delayed_work_timer_fn+0xe0/0xe0 > [107612.367296] ret_from_fork+0x24/0x30 > > in my test, there is _no_ timeout, but just reset controller > > The state of the request is COMPLETE. > > This is a race reset work and normal rmda completion path. > nvme_rdma_reset_ctrl_work > -> nvme_rdma_shutdown_ctrl > -> nvme_cancel_request > -> blk_mq_complete_request > we don't stop the rdma completion path to be triggered before we invoke nvme_cancel_request. We fixed that in this way (on hold since the block layer fix is not yet pushed): drivers/nvme/host/rdma.c | 8 ++++++-- 1 file changed, 6 insertions(+), 2 deletions(-) diff --git a/drivers/nvme/host/rdma.c b/drivers/nvme/host/rdma.c index c1abfc8..56fccac 100644 --- a/drivers/nvme/host/rdma.c +++ b/drivers/nvme/host/rdma.c @@ -734,7 +734,6 @@ static struct blk_mq_tag_set *nvme_rdma_alloc_tagset(struct nvme_ctrl *nctrl, static void nvme_rdma_destroy_admin_queue(struct nvme_rdma_ctrl *ctrl, bool remove) { - nvme_rdma_stop_queue(&ctrl->queues[0]); if (remove) { blk_cleanup_queue(ctrl->ctrl.admin_q); nvme_rdma_free_tagset(&ctrl->ctrl, ctrl->ctrl.admin_tagset); @@ -817,7 +816,6 @@ static int nvme_rdma_configure_admin_queue(struct nvme_rdma_ctrl *ctrl, static void nvme_rdma_destroy_io_queues(struct nvme_rdma_ctrl *ctrl, bool remove) { - nvme_rdma_stop_io_queues(ctrl); if (remove) { blk_cleanup_queue(ctrl->ctrl.connect_q); nvme_rdma_free_tagset(&ctrl->ctrl, ctrl->ctrl.tagset); @@ -947,6 +945,7 @@ static void nvme_rdma_reconnect_ctrl_work(struct work_struct *work) return; destroy_admin: + nvme_rdma_stop_queue(&ctrl->queues[0]); nvme_rdma_destroy_admin_queue(ctrl, false); requeue: dev_info(ctrl->ctrl.device, "Failed reconnect attempt %d\n", @@ -963,12 +962,14 @@ static void nvme_rdma_error_recovery_work(struct work_struct *work) if (ctrl->ctrl.queue_count > 1) { nvme_stop_queues(&ctrl->ctrl); + nvme_rdma_stop_io_queues(ctrl); blk_mq_tagset_busy_iter(&ctrl->tag_set, nvme_cancel_request, &ctrl->ctrl); nvme_rdma_destroy_io_queues(ctrl, false); } blk_mq_quiesce_queue(ctrl->ctrl.admin_q); + nvme_rdma_stop_queue(&ctrl->queues[0]); blk_mq_tagset_busy_iter(&ctrl->admin_tag_set, nvme_cancel_request, &ctrl->ctrl); nvme_rdma_destroy_admin_queue(ctrl, false); @@ -1725,6 +1726,7 @@ static void nvme_rdma_shutdown_ctrl(struct nvme_rdma_ctrl *ctrl, bool shutdown) { if (ctrl->ctrl.queue_count > 1) { nvme_stop_queues(&ctrl->ctrl); + nvme_rdma_stop_io_queues(ctrl); blk_mq_tagset_busy_iter(&ctrl->tag_set, nvme_cancel_request, &ctrl->ctrl); nvme_rdma_destroy_io_queues(ctrl, shutdown); @@ -1736,6 +1738,7 @@ static void nvme_rdma_shutdown_ctrl(struct nvme_rdma_ctrl *ctrl, bool shutdown) nvme_disable_ctrl(&ctrl->ctrl, ctrl->ctrl.cap); blk_mq_quiesce_queue(ctrl->ctrl.admin_q); + nvme_rdma_stop_queue(&ctrl->queues[0]); blk_mq_tagset_busy_iter(&ctrl->admin_tag_set, nvme_cancel_request, &ctrl->ctrl); blk_mq_unquiesce_queue(ctrl->ctrl.admin_q); @@ -2001,6 +2004,7 @@ static struct nvme_ctrl *nvme_rdma_create_ctrl(struct device *dev, return &ctrl->ctrl; out_remove_admin_queue: + nvme_rdma_stop_queue(&ctrl->queues[0]); nvme_rdma_destroy_admin_queue(ctrl, true); out_kfree_queues: kfree(ctrl->queues); > > I think it is related to a change introduced by the new blk-mq timeout implementation. > before it, the blk_mq_complete_request use blk_mark_rq_complete to sync with timeout path. > obviously, this make blk_mq_complete_request be able to sync with itself. > the blk_mq_complete_request from reset work and rdma completion path could dance with each other well. > > however, after this new mechanism is introduced, blk_mq_complete_request cannot sync with itself anymore. > so we indeed have a risk that complete request twice when we reset the controller forcibly in rdma driver. > > Thanks > Jianchao > > >> >> Thanks >> Jianchao >> >>> I guess it's the same issue the IsraelR/Bart/Ming are trying to fix in the block layer regarding completing requests. >>> I'll add IsraelR proposed fix to nvme-rdma that is currently on hold and see what happens. >>> Nontheless, I don't like the situation that the reset and delete flows can run concurrently. >>> >>> -Max. >>> >>> On 4/26/2018 11:27 AM, jianchao.wang wrote: >>>> Hi Max >>>> >>>> I did a similar test on nvme-rdma, the underlying fabric is soft-RoCE. >>>> A io loop, reset controller loop? and a delete/create controller loop. >>>> And found io hang below: >>>> >>>> [? 230.884590] WARNING: CPU: 0 PID: 150 at /home/will/u04/source_code/linux-stable/drivers/nvme/host/rdma.c:1755 nvme_rdma_reset_ctrl_work+0x3d/0xb0 [nvme_rdma] >>>> [? 230.884689] CPU: 0 PID: 150 Comm: kworker/u16:3 Not tainted 4.17.0-rc1+ #16 >>>> [? 230.884690] Hardware name: LENOVO 10MLS0E339/3106, BIOS M1AKT22A 06/27/2017 >>>> [? 230.884693] Workqueue: nvme-reset-wq nvme_rdma_reset_ctrl_work [nvme_rdma] >>>> [? 230.884696] RIP: 0010:nvme_rdma_reset_ctrl_work+0x3d/0xb0 [nvme_rdma] >>>> [? 230.884697] RSP: 0018:ffffa9e30255be40 EFLAGS: 00010246 >>>> [? 230.884699] RAX: 0000000000000000 RBX: ffff9633b64c2000 RCX: 0000000000000000 >>>> [? 230.884700] RDX: 0000000000000001 RSI: 0000000000000000 RDI: ffffffff9ac05516 >>>> [? 230.884701] RBP: ffff9633b64c23b8 R08: 0000000000000001 R09: 0000000000000000 >>>> [? 230.884702] R10: 0000000000000000 R11: 0000000000000000 R12: ffff9633b64c2970 >>>> [? 230.884703] R13: 0000000000000000 R14: 0ffff96340ffcc80 R15: ffff9634102af9c0 >>>> [? 230.884705] FS:? 0000000000000000(0000) GS:ffff963422c00000(0000) knlGS:0000000000000000 >>>> [? 230.884706] CS:? 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 >>>> [? 230.884707] CR2: 0000000001ce9ca4 CR3: 000000038040f005 CR4: 00000000003606f0 >>>> [? 230.884708] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 >>>> [? 230.884709] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 >>>> [? 230.884710] Call Trace: >>>> [? 230.884717]? process_one_work+0x1c0/0x680 >>>> [? 230.884722]? worker_thread+0x22b/0x430 >>>> [? 230.884728]? kthread+0x100/0x140 >>>> [? 230.884730]? ? rescuer_thread+0x370/0x370 >>>> [? 230.884731]? ? kthread_delayed_work_timer_fn+0x80/0x80 >>>> [? 230.884736]? ret_from_fork+0x24/0x30 >>>> [? 230.884744] Code: ff 48 8d ab b8 03 00 00 48 89 ef e8 1e 25 d7 d9 31 f6 48 89 df e8 44 f9 ff ff be 04 00 00 00 48 89 ef e8 b7 13 d7 d9 84 c0 75 07 <0f> 0b 5b 5d 41 5c c3 31 f6 48 89 df e8 b2 fa ff ff 85 c0 75 39 >>>> [? 230.884799] irq event stamp: 88150 >>>> [? 230.884802] hardirqs last? enabled at (88149): [<ffffffff9ac05529>] _raw_spin_unlock_irqrestore+0x59/0x70 >>>> [? 230.884803] hardirqs last disabled at (88150): [<ffffffff9ae0116c>] error_entry+0x6c/0xc0 >>>> [? 230.884805] softirqs last? enabled at (87998): [<ffffffff9aa603a0>] sk_common_release+0x60/0xd0 >>>> [? 230.884807] softirqs last disabled at (87996): [<ffffffff9aa60375>] sk_common_release+0x35/0xd0 >>>> [? 230.884808] ---[ end trace c03fac253b80d77d ]--- >>>> [? 277.472094] INFO: task kworker/u16:0:6 blocked for more than 30 seconds. >>>> [? 277.473482]?????? Tainted: G??????? W???????? 4.17.0-rc1+ #16 >>>> [? 277.474825] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. >>>> [? 277.476293] kworker/u16:0?? D??? 0???? 6????? 2 0x80000000 >>>> [? 277.476317] Workqueue: nvme-delete-wq nvme_delete_ctrl_work >>>> [? 277.476335] Call Trace: >>>> [? 277.476365]? ? __schedule+0x3de/0xac0 >>>> [? 277.476393]? schedule+0x3c/0x90 >>>> [? 277.476405]? blk_mq_freeze_queue_wait+0x44/0x90 >>>> [? 277.476419]? ? wait_woken+0x90/0x90 >>>> [? 277.476437]? blk_cleanup_queue+0xe1/0x280 >>>> [? 277.476453]? nvme_ns_remove+0x1c8/0x260 >>>> [? 277.476475]? nvme_remove_namespaces+0x7f/0xa0 >>>> [? 277.476495]? nvme_delete_ctrl_work+0x4b/0x80 >>>> [? 277.476508]? process_one_work+0x1c0/0x680 >>>> [? 277.476535]? worker_thread+0x22b/0x430 >>>> [? 277.476562]? kthread+0x100/0x140 >>>> [? 277.476572]? ? rescuer_thread+0x370/0x370 >>>> [? 277.476579]? ? kthread_delayed_work_timer_fn+0x80/0x80 >>>> [? 277.476598]? ret_from_fork+0x24/0x30 >>>> >>>> 00000000ea33345b {.op=READ, .cmd_flags=23, .rq_flags=MQ_INFLIGHT|IO_STAT, .state=idle, .tag=19, .internal_tag=-1} >>>> 0000000071218c5a {.op=READ, .cmd_flags=23, .rq_flags=MQ_INFLIGHT|IO_STAT, .state=idle, .tag=45, .internal_tag=-1} >>>> root at will-ThinkCentre-M910s:/sys/kernel/debug/block/nvme1c1n1# cat state >>>> DYING|BYPASS|NOMERGES|SAME_COMP|NONROT|IO_STAT|DISCARD|NOXMERGES|INIT_DONE|NO_SG_MERGE|POLL|WC|FUA|STATS|QUIESCED >>>> >>>> This is due to a race following: >>>> >>>> nvme_delete_ctrl??????????????????? nvme_rdma_reset_ctrl_work >>>> ?? -> change state to DELETING >>>> ?? -> queue delete_work >>>> ?????????????????????????????????????? -> nvme_rdma_shutdown_ctrl >>>> ???????????????????????????????????????? -> nvme_stop_queues >>>> ?????????????????????????????????????????? -> blk_mq_quiesce_queue >>>> ???????????????? ?????????????????????????????????????? -> change to LIVE state fails >>>> ???????????????????????????????????????? -> return >>>> nvme_delete_ctrl_work >>>> ?? -> nvme_remove_namespaces >>>> ???? -> nvme_ns_remove >>>> ?????? -> blk_cleanup_queue >>>> ???????? -> blk_freeze_queue >>>> >>>> This patch could also fix this issue. >>>> >>>> Thanks >>>> Jianchao >>>> >>>> On 04/22/2018 11:00 PM, jianchao.wang wrote: >>>>> Hi Max >>>>> >>>>> That's really appreciated! >>>>> Here is my test script. >>>>> >>>>> loop_reset_controller.sh >>>>> #!/bin/bash >>>>> while true >>>>> do >>>>> ????echo 1 > /sys/block/nvme0n1/device/reset_controller >>>>> ????sleep 1 >>>>> done >>>>> >>>>> loop_unbind_driver.sh >>>>> #!/bin/bash >>>>> while true >>>>> do >>>>> ????echo "0000:02:00.0" > /sys/bus/pci/drivers/nvme/unbind >>>>> ????sleep 2 >>>>> ????echo "0000:02:00.0" > /sys/bus/pci/drivers/nvme/bind >>>>> ????sleep 2 >>>>> done >>>>> >>>>> loop_io.sh >>>>> #!/bin/bash >>>>> >>>>> file="/dev/nvme0n1" >>>>> echo $file >>>>> while true; >>>>> do >>>>> ????if [ -e $file ];then >>>>> ??????? fio fio_job_rand_read.ini >>>>> ????else >>>>> ??????? echo "Not found" >>>>> ??????? sleep 1 >>>>> ????fi >>>>> done >>>>> >>>>> The fio jobs is as below: >>>>> size=512m >>>>> rw=randread >>>>> bs=4k >>>>> ioengine=libaio >>>>> iodepth=64 >>>>> direct=1 >>>>> numjobs=16 >>>>> filename=/dev/nvme0n1 >>>>> group_reporting >>>>> >>>>> I started in sequence, loop_io.sh, loop_reset_controller.sh, loop_unbind_driver.sh. >>>>> And if lucky, I will get io hang in 3 minutes. ;) >>>>> Such as: >>>>> >>>>> [? 142.858074] nvme nvme0: pci function 0000:02:00.0 >>>>> [? 144.972256] nvme nvme0: failed to mark controller state 1 >>>>> [? 144.972289] nvme nvme0: Removing after probe failure status: 0 >>>>> [? 185.312344] INFO: task bash:1673 blocked for more than 30 seconds. >>>>> [? 185.312889]?????? Not tainted 4.17.0-rc1+ #6 >>>>> [? 185.312950] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. >>>>> [? 185.313049] bash??????????? D??? 0? 1673?? 1629 0x00000080 >>>>> [? 185.313061] Call Trace: >>>>> [? 185.313083]? ? __schedule+0x3de/0xac0 >>>>> [? 185.313103]? schedule+0x3c/0x90 >>>>> [? 185.313111]? blk_mq_freeze_queue_wait+0x44/0x90 >>>>> [? 185.313123]? ? wait_woken+0x90/0x90 >>>>> [? 185.313133]? blk_cleanup_queue+0xe1/0x280 >>>>> [? 185.313145]? nvme_ns_remove+0x1c8/0x260 >>>>> [? 185.313159]? nvme_remove_namespaces+0x7f/0xa0 >>>>> [? 185.313170]? nvme_remove+0x6c/0x130 >>>>> [? 185.313181]? pci_device_remove+0x36/0xb0 >>>>> [? 185.313193]? device_release_driver_internal+0x160/0x230 >>>>> [? 185.313205]? unbind_store+0xfe/0x150 >>>>> [? 185.313219]? kernfs_fop_write+0x114/0x190 >>>>> [? 185.313234]? __vfs_write+0x23/0x150 >>>>> [? 185.313246]? ? rcu_read_lock_sched_held+0x3f/0x70 >>>>> [? 185.313252]? ? preempt_count_sub+0x92/0xd0 >>>>> [? 185.313259]? ? __sb_start_write+0xf8/0x200 >>>>> [? 185.313271]? vfs_write+0xc5/0x1c0 >>>>> [? 185.313284]? ksys_write+0x45/0xa0 >>>>> [? 185.313298]? do_syscall_64+0x5a/0x1a0 >>>>> [? 185.313308]? entry_SYSCALL_64_after_hwframe+0x49/0xbe >>>>> >>>>> And get following information in block debugfs: >>>>> root at will-ThinkCentre-M910s:/sys/kernel/debug/block/nvme0n1# cat hctx6/cpu6/rq_list >>>>> 000000001192d19b {.op=READ, .cmd_flags=, .rq_flags=IO_STAT, .state=idle, .tag=69, .internal_tag=-1} >>>>> 00000000c33c8a5b {.op=READ, .cmd_flags=, .rq_flags=IO_STAT, .state=idle, .tag=78, .internal_tag=-1} >>>>> root at will-ThinkCentre-M910s:/sys/kernel/debug/block/nvme0n1# cat state >>>>> DYING|BYPASS|NOMERGES|SAME_COMP|NONROT|IO_STAT|DISCARD|NOXMERGES|INIT_DONE|NO_SG_MERGE|POLL|WC|FUA|STATS|QUIESCED >>>>> >>>>> We can see there were reqs on ctx rq_list and the request_queue is QUIESCED. >>>>> >>>>> Thanks again !! >>>>> Jianchao >>>>> >>>>> On 04/22/2018 10:48 PM, Max Gurtovoy wrote: >>>>>> >>>>>> >>>>>> On 4/22/2018 5:25 PM, jianchao.wang wrote: >>>>>>> Hi Max >>>>>>> >>>>>>> No, I only tested it on PCIe one. >>>>>>> And sorry for that I didn't state that. >>>>>> >>>>>> Please send your exact test steps and we'll run it using RDMA transport. >>>>>> I also want to run a mini regression on this one since it may effect other flows. >>>>>> >>>>>>> >>>>>>> Thanks >>>>>>> Jianchao >>>>>>> >>>>>>> On 04/22/2018 10:18 PM, Max Gurtovoy wrote: >>>>>>>> Hi Jianchao, >>>>>>>> Since this patch is in the core, have you tested it using some fabrics drives too ? RDMA/FC ? >>>>>>>> >>>>>>>> thanks, >>>>>>>> Max. >>>>>>>> >>>>>>>> On 4/22/2018 4:32 PM, jianchao.wang wrote: >>>>>>>>> Hi keith >>>>>>>>> >>>>>>>>> Would you please take a look at this patch. >>>>>>>>> >>>>>>>>> This issue could be reproduced easily with a driver bind/unbind loop, >>>>>>>>> a reset loop and a IO loop at the same time. >>>>>>>>> >>>>>>>>> Thanks >>>>>>>>> Jianchao >>>>>>>>> >>>>>>>>> On 04/19/2018 04:29 PM, Jianchao Wang wrote: >>>>>>>>>> There is race between nvme_remove and nvme_reset_work that can >>>>>>>>>> lead to io hang. >>>>>>>>>> >>>>>>>>>> nvme_remove??????????????????? nvme_reset_work >>>>>>>>>> -> change state to DELETING >>>>>>>>>> ????????????????????????????????? -> fail to change state to LIVE >>>>>>>>>> ????????????????????????????????? -> nvme_remove_dead_ctrl >>>>>>>>>> ??????????????????????????????????? -> nvme_dev_disable >>>>>>>>>> ????????????????????????????????????? -> quiesce request_queue >>>>>>>>>> ??????????????????????????????????? -> queue remove_work >>>>>>>>>> -> cancel_work_sync reset_work >>>>>>>>>> -> nvme_remove_namespaces >>>>>>>>>> ???? -> splice ctrl->namespaces >>>>>>>>>> ????????????????????????????????? nvme_remove_dead_ctrl_work >>>>>>>>>> ????????????????????????????????? -> nvme_kill_queues >>>>>>>>>> ???? -> nvme_ns_remove?????????????? do nothing >>>>>>>>>> ?????? -> blk_cleanup_queue >>>>>>>>>> ???????? -> blk_freeze_queue >>>>>>>>>> Finally, the request_queue is quiesced state when wait freeze, >>>>>>>>>> we will get io hang here. >>>>>>>>>> >>>>>>>>>> To fix it, unquiesce the request_queue directly before nvme_ns_remove. >>>>>>>>>> We have spliced the ctrl->namespaces, so nobody could access them >>>>>>>>>> and quiesce the queue any more. >>>>>>>>>> >>>>>>>>>> Signed-off-by: Jianchao Wang <jianchao.w.wang at oracle.com> >>>>>>>>>> --- >>>>>>>>>> ??? drivers/nvme/host/core.c | 9 ++++++++- >>>>>>>>>> ??? 1 file changed, 8 insertions(+), 1 deletion(-) >>>>>>>>>> >>>>>>>>>> diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c >>>>>>>>>> index 9df4f71..0e95082 100644 >>>>>>>>>> --- a/drivers/nvme/host/core.c >>>>>>>>>> +++ b/drivers/nvme/host/core.c >>>>>>>>>> @@ -3249,8 +3249,15 @@ void nvme_remove_namespaces(struct nvme_ctrl *ctrl) >>>>>>>>>> ??????? list_splice_init(&ctrl->namespaces, &ns_list); >>>>>>>>>> ??????? up_write(&ctrl->namespaces_rwsem); >>>>>>>>>> ??? -??? list_for_each_entry_safe(ns, next, &ns_list, list) >>>>>>>>>> +??? /* >>>>>>>>>> +???? * After splice the namespaces list from the ctrl->namespaces, >>>>>>>>>> +???? * nobody could get them anymore, let's unquiesce the request_queue >>>>>>>>>> +???? * forcibly to avoid io hang. >>>>>>>>>> +???? */ >>>>>>>>>> +??? list_for_each_entry_safe(ns, next, &ns_list, list) { >>>>>>>>>> +??????? blk_mq_unquiesce_queue(ns->queue); >>>>>>>>>> ??????????? nvme_ns_remove(ns); >>>>>>>>>> +??? } >>>>>>>>>> ??? } >>>>>>>>>> ??? EXPORT_SYMBOL_GPL(nvme_remove_namespaces); >>>>>>>>>> >>>>>>>>> >>>>>>>>> _______________________________________________ >>>>>>>>> Linux-nvme mailing list >>>>>>>>> Linux-nvme at lists.infradead.org >>>>>>>>> https://urldefense.proofpoint.com/v2/url?u=http-3A__lists.infradead.org_mailman_listinfo_linux-2Dnvme&d=DwICAg&c=RoP1YumCXCgaWHvlZYR8PZh8Bv7qIrMUB65eapI_JnE&r=7WdAxUBeiTUTCy8v-7zXyr4qk7sx26ATvfo6QSTvZyQ&m=eQ9q70WFDS-d0s-KndBw8MOJvcBM6wuuKUNklqTC3h8&s=oBasfz9JoJw4yQF4EaWcNfKChZ1HMCkfHVZqyjvYVHQ&e= >>>>>>>>> >>>>>>>> >>>>>>>> _______________________________________________ >>>>>>>> Linux-nvme mailing list >>>>>>>> Linux-nvme at lists.infradead.org >>>>>>>> https://urldefense.proofpoint.com/v2/url?u=http-3A__lists.infradead.org_mailman_listinfo_linux-2Dnvme&d=DwICAg&c=RoP1YumCXCgaWHvlZYR8PZh8Bv7qIrMUB65eapI_JnE&r=7WdAxUBeiTUTCy8v-7zXyr4qk7sx26ATvfo6QSTvZyQ&m=eQ9q70WFDS-d0s-KndBw8MOJvcBM6wuuKUNklqTC3h8&s=oBasfz9JoJw4yQF4EaWcNfKChZ1HMCkfHVZqyjvYVHQ&e= >>>>>>>> >>>>> >>>>> _______________________________________________ >>>>> Linux-nvme mailing list >>>>> Linux-nvme at lists.infradead.org >>>>> https://urldefense.proofpoint.com/v2/url?u=http-3A__lists.infradead.org_mailman_listinfo_linux-2Dnvme&d=DwIGaQ&c=RoP1YumCXCgaWHvlZYR8PZh8Bv7qIrMUB65eapI_JnE&r=7WdAxUBeiTUTCy8v-7zXyr4qk7sx26ATvfo6QSTvZyQ&m=ivKrSi6co_9l91-czj52P2ABAPyKOwv9GrD3UeCbhio&s=2U00IBVk4CVt0Q-jiZX3sEWEEA2IleOjPU_PD38Fgvg&e= >>>>> >>> >> >> _______________________________________________ >> Linux-nvme mailing list >> Linux-nvme at lists.infradead.org >> https://urldefense.proofpoint.com/v2/url?u=http-3A__lists.infradead.org_mailman_listinfo_linux-2Dnvme&d=DwIGaQ&c=RoP1YumCXCgaWHvlZYR8PZh8Bv7qIrMUB65eapI_JnE&r=7WdAxUBeiTUTCy8v-7zXyr4qk7sx26ATvfo6QSTvZyQ&m=SGZpECIOlIjMScsyTFhGuTB5fZTrhZS6gNLbeUUecmU&s=a0HxxV_1N2g59zBdAaehBUEmIG1pZGxzZKQOby6Hn8k&e= >> ^ permalink raw reply related [flat|nested] 18+ messages in thread
end of thread, other threads:[~2018-04-29 15:47 UTC | newest] Thread overview: 18+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2018-04-19 8:29 [PATCH] nvme: unquiesce the queue before cleaup it Jianchao Wang 2018-04-19 8:29 ` Jianchao Wang 2018-04-22 13:32 ` jianchao.wang 2018-04-22 13:32 ` jianchao.wang 2018-04-22 14:18 ` Max Gurtovoy 2018-04-22 14:25 ` jianchao.wang 2018-04-22 14:25 ` jianchao.wang 2018-04-22 14:48 ` Max Gurtovoy 2018-04-22 15:00 ` jianchao.wang 2018-04-22 15:00 ` jianchao.wang 2018-04-26 8:27 ` jianchao.wang 2018-04-26 8:27 ` jianchao.wang 2018-04-26 10:23 ` Max Gurtovoy 2018-04-27 8:51 ` jianchao.wang 2018-04-27 8:51 ` jianchao.wang 2018-04-28 15:38 ` jianchao.wang 2018-04-28 15:38 ` jianchao.wang 2018-04-29 15:47 ` Max Gurtovoy
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.