From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1754737AbeDVPBV (ORCPT ); Sun, 22 Apr 2018 11:01:21 -0400 Received: from aserp2130.oracle.com ([141.146.126.79]:35086 "EHLO aserp2130.oracle.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1752071AbeDVPBN (ORCPT ); Sun, 22 Apr 2018 11:01:13 -0400 Subject: Re: [PATCH] nvme: unquiesce the queue before cleaup it To: Max Gurtovoy , keith.busch@intel.com, axboe@fb.com, hch@lst.de, sagi@grimberg.me, linux-nvme@lists.infradead.org, linux-kernel@vger.kernel.org References: <1524126553-16290-1-git-send-email-jianchao.w.wang@oracle.com> <1d51898e-cb3b-1178-c0b6-0716acbb9564@mellanox.com> <4ef5f9f2-b61d-89a0-f619-d15c40587f03@oracle.com> <09a72ec6-117c-e31f-5aa6-546e74c3c20b@mellanox.com> From: "jianchao.wang" Message-ID: <74ea389f-499f-5162-b9c0-14d02e273455@oracle.com> Date: Sun, 22 Apr 2018 23:00:53 +0800 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:52.0) Gecko/20100101 Thunderbird/52.7.0 MIME-Version: 1.0 In-Reply-To: <09a72ec6-117c-e31f-5aa6-546e74c3c20b@mellanox.com> Content-Type: text/plain; charset=utf-8 Content-Language: en-US Content-Transfer-Encoding: 8bit X-Proofpoint-Virus-Version: vendor=nai engine=5900 definitions=8871 signatures=668698 X-Proofpoint-Spam-Details: rule=notspam policy=default score=0 suspectscore=0 malwarescore=0 phishscore=0 bulkscore=0 spamscore=0 mlxscore=0 mlxlogscore=999 adultscore=0 classifier=spam adjust=0 reason=mlx scancount=1 engine=8.0.1-1711220000 definitions=main-1804220169 Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Hi Max That's really appreciated! Here is my test script. loop_reset_controller.sh #!/bin/bash while true do echo 1 > /sys/block/nvme0n1/device/reset_controller sleep 1 done loop_unbind_driver.sh #!/bin/bash while true do echo "0000:02:00.0" > /sys/bus/pci/drivers/nvme/unbind sleep 2 echo "0000:02:00.0" > /sys/bus/pci/drivers/nvme/bind sleep 2 done loop_io.sh #!/bin/bash file="/dev/nvme0n1" echo $file while true; do if [ -e $file ];then fio fio_job_rand_read.ini else echo "Not found" sleep 1 fi done The fio jobs is as below: size=512m rw=randread bs=4k ioengine=libaio iodepth=64 direct=1 numjobs=16 filename=/dev/nvme0n1 group_reporting I started in sequence, loop_io.sh, loop_reset_controller.sh, loop_unbind_driver.sh. And if lucky, I will get io hang in 3 minutes. ;) Such as: [ 142.858074] nvme nvme0: pci function 0000:02:00.0 [ 144.972256] nvme nvme0: failed to mark controller state 1 [ 144.972289] nvme nvme0: Removing after probe failure status: 0 [ 185.312344] INFO: task bash:1673 blocked for more than 30 seconds. [ 185.312889] Not tainted 4.17.0-rc1+ #6 [ 185.312950] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [ 185.313049] bash D 0 1673 1629 0x00000080 [ 185.313061] Call Trace: [ 185.313083] ? __schedule+0x3de/0xac0 [ 185.313103] schedule+0x3c/0x90 [ 185.313111] blk_mq_freeze_queue_wait+0x44/0x90 [ 185.313123] ? wait_woken+0x90/0x90 [ 185.313133] blk_cleanup_queue+0xe1/0x280 [ 185.313145] nvme_ns_remove+0x1c8/0x260 [ 185.313159] nvme_remove_namespaces+0x7f/0xa0 [ 185.313170] nvme_remove+0x6c/0x130 [ 185.313181] pci_device_remove+0x36/0xb0 [ 185.313193] device_release_driver_internal+0x160/0x230 [ 185.313205] unbind_store+0xfe/0x150 [ 185.313219] kernfs_fop_write+0x114/0x190 [ 185.313234] __vfs_write+0x23/0x150 [ 185.313246] ? rcu_read_lock_sched_held+0x3f/0x70 [ 185.313252] ? preempt_count_sub+0x92/0xd0 [ 185.313259] ? __sb_start_write+0xf8/0x200 [ 185.313271] vfs_write+0xc5/0x1c0 [ 185.313284] ksys_write+0x45/0xa0 [ 185.313298] do_syscall_64+0x5a/0x1a0 [ 185.313308] entry_SYSCALL_64_after_hwframe+0x49/0xbe And get following information in block debugfs: root@will-ThinkCentre-M910s:/sys/kernel/debug/block/nvme0n1# cat hctx6/cpu6/rq_list 000000001192d19b {.op=READ, .cmd_flags=, .rq_flags=IO_STAT, .state=idle, .tag=69, .internal_tag=-1} 00000000c33c8a5b {.op=READ, .cmd_flags=, .rq_flags=IO_STAT, .state=idle, .tag=78, .internal_tag=-1} root@will-ThinkCentre-M910s:/sys/kernel/debug/block/nvme0n1# cat state DYING|BYPASS|NOMERGES|SAME_COMP|NONROT|IO_STAT|DISCARD|NOXMERGES|INIT_DONE|NO_SG_MERGE|POLL|WC|FUA|STATS|QUIESCED We can see there were reqs on ctx rq_list and the request_queue is QUIESCED. Thanks again !! Jianchao On 04/22/2018 10:48 PM, Max Gurtovoy wrote: > > > On 4/22/2018 5:25 PM, jianchao.wang wrote: >> Hi Max >> >> No, I only tested it on PCIe one. >> And sorry for that I didn't state that. > > Please send your exact test steps and we'll run it using RDMA transport. > I also want to run a mini regression on this one since it may effect other flows. > >> >> Thanks >> Jianchao >> >> On 04/22/2018 10:18 PM, Max Gurtovoy wrote: >>> Hi Jianchao, >>> Since this patch is in the core, have you tested it using some fabrics drives too ? RDMA/FC ? >>> >>> thanks, >>> Max. >>> >>> On 4/22/2018 4:32 PM, jianchao.wang wrote: >>>> Hi keith >>>> >>>> Would you please take a look at this patch. >>>> >>>> This issue could be reproduced easily with a driver bind/unbind loop, >>>> a reset loop and a IO loop at the same time. >>>> >>>> Thanks >>>> Jianchao >>>> >>>> On 04/19/2018 04:29 PM, Jianchao Wang wrote: >>>>> There is race between nvme_remove and nvme_reset_work that can >>>>> lead to io hang. >>>>> >>>>> nvme_remove                    nvme_reset_work >>>>> -> change state to DELETING >>>>>                                  -> fail to change state to LIVE >>>>>                                  -> nvme_remove_dead_ctrl >>>>>                                    -> nvme_dev_disable >>>>>                                      -> quiesce request_queue >>>>>                                    -> queue remove_work >>>>> -> cancel_work_sync reset_work >>>>> -> nvme_remove_namespaces >>>>>     -> splice ctrl->namespaces >>>>>                                  nvme_remove_dead_ctrl_work >>>>>                                  -> nvme_kill_queues >>>>>     -> nvme_ns_remove               do nothing >>>>>       -> blk_cleanup_queue >>>>>         -> blk_freeze_queue >>>>> Finally, the request_queue is quiesced state when wait freeze, >>>>> we will get io hang here. >>>>> >>>>> To fix it, unquiesce the request_queue directly before nvme_ns_remove. >>>>> We have spliced the ctrl->namespaces, so nobody could access them >>>>> and quiesce the queue any more. >>>>> >>>>> Signed-off-by: Jianchao Wang >>>>> --- >>>>>    drivers/nvme/host/core.c | 9 ++++++++- >>>>>    1 file changed, 8 insertions(+), 1 deletion(-) >>>>> >>>>> diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c >>>>> index 9df4f71..0e95082 100644 >>>>> --- a/drivers/nvme/host/core.c >>>>> +++ b/drivers/nvme/host/core.c >>>>> @@ -3249,8 +3249,15 @@ void nvme_remove_namespaces(struct nvme_ctrl *ctrl) >>>>>        list_splice_init(&ctrl->namespaces, &ns_list); >>>>>        up_write(&ctrl->namespaces_rwsem); >>>>>    -    list_for_each_entry_safe(ns, next, &ns_list, list) >>>>> +    /* >>>>> +     * After splice the namespaces list from the ctrl->namespaces, >>>>> +     * nobody could get them anymore, let's unquiesce the request_queue >>>>> +     * forcibly to avoid io hang. >>>>> +     */ >>>>> +    list_for_each_entry_safe(ns, next, &ns_list, list) { >>>>> +        blk_mq_unquiesce_queue(ns->queue); >>>>>            nvme_ns_remove(ns); >>>>> +    } >>>>>    } >>>>>    EXPORT_SYMBOL_GPL(nvme_remove_namespaces); >>>>>   >>>> >>>> _______________________________________________ >>>> Linux-nvme mailing list >>>> Linux-nvme@lists.infradead.org >>>> https://urldefense.proofpoint.com/v2/url?u=http-3A__lists.infradead.org_mailman_listinfo_linux-2Dnvme&d=DwICAg&c=RoP1YumCXCgaWHvlZYR8PZh8Bv7qIrMUB65eapI_JnE&r=7WdAxUBeiTUTCy8v-7zXyr4qk7sx26ATvfo6QSTvZyQ&m=eQ9q70WFDS-d0s-KndBw8MOJvcBM6wuuKUNklqTC3h8&s=oBasfz9JoJw4yQF4EaWcNfKChZ1HMCkfHVZqyjvYVHQ&e= >>>> >>> >>> _______________________________________________ >>> Linux-nvme mailing list >>> Linux-nvme@lists.infradead.org >>> https://urldefense.proofpoint.com/v2/url?u=http-3A__lists.infradead.org_mailman_listinfo_linux-2Dnvme&d=DwICAg&c=RoP1YumCXCgaWHvlZYR8PZh8Bv7qIrMUB65eapI_JnE&r=7WdAxUBeiTUTCy8v-7zXyr4qk7sx26ATvfo6QSTvZyQ&m=eQ9q70WFDS-d0s-KndBw8MOJvcBM6wuuKUNklqTC3h8&s=oBasfz9JoJw4yQF4EaWcNfKChZ1HMCkfHVZqyjvYVHQ&e= >>> From mboxrd@z Thu Jan 1 00:00:00 1970 From: jianchao.w.wang@oracle.com (jianchao.wang) Date: Sun, 22 Apr 2018 23:00:53 +0800 Subject: [PATCH] nvme: unquiesce the queue before cleaup it In-Reply-To: <09a72ec6-117c-e31f-5aa6-546e74c3c20b@mellanox.com> References: <1524126553-16290-1-git-send-email-jianchao.w.wang@oracle.com> <1d51898e-cb3b-1178-c0b6-0716acbb9564@mellanox.com> <4ef5f9f2-b61d-89a0-f619-d15c40587f03@oracle.com> <09a72ec6-117c-e31f-5aa6-546e74c3c20b@mellanox.com> Message-ID: <74ea389f-499f-5162-b9c0-14d02e273455@oracle.com> Hi Max That's really appreciated! Here is my test script. loop_reset_controller.sh #!/bin/bash while true do echo 1 > /sys/block/nvme0n1/device/reset_controller sleep 1 done loop_unbind_driver.sh #!/bin/bash while true do echo "0000:02:00.0" > /sys/bus/pci/drivers/nvme/unbind sleep 2 echo "0000:02:00.0" > /sys/bus/pci/drivers/nvme/bind sleep 2 done loop_io.sh #!/bin/bash file="/dev/nvme0n1" echo $file while true; do if [ -e $file ];then fio fio_job_rand_read.ini else echo "Not found" sleep 1 fi done The fio jobs is as below: size=512m rw=randread bs=4k ioengine=libaio iodepth=64 direct=1 numjobs=16 filename=/dev/nvme0n1 group_reporting I started in sequence, loop_io.sh, loop_reset_controller.sh, loop_unbind_driver.sh. And if lucky, I will get io hang in 3 minutes. ;) Such as: [ 142.858074] nvme nvme0: pci function 0000:02:00.0 [ 144.972256] nvme nvme0: failed to mark controller state 1 [ 144.972289] nvme nvme0: Removing after probe failure status: 0 [ 185.312344] INFO: task bash:1673 blocked for more than 30 seconds. [ 185.312889] Not tainted 4.17.0-rc1+ #6 [ 185.312950] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [ 185.313049] bash D 0 1673 1629 0x00000080 [ 185.313061] Call Trace: [ 185.313083] ? __schedule+0x3de/0xac0 [ 185.313103] schedule+0x3c/0x90 [ 185.313111] blk_mq_freeze_queue_wait+0x44/0x90 [ 185.313123] ? wait_woken+0x90/0x90 [ 185.313133] blk_cleanup_queue+0xe1/0x280 [ 185.313145] nvme_ns_remove+0x1c8/0x260 [ 185.313159] nvme_remove_namespaces+0x7f/0xa0 [ 185.313170] nvme_remove+0x6c/0x130 [ 185.313181] pci_device_remove+0x36/0xb0 [ 185.313193] device_release_driver_internal+0x160/0x230 [ 185.313205] unbind_store+0xfe/0x150 [ 185.313219] kernfs_fop_write+0x114/0x190 [ 185.313234] __vfs_write+0x23/0x150 [ 185.313246] ? rcu_read_lock_sched_held+0x3f/0x70 [ 185.313252] ? preempt_count_sub+0x92/0xd0 [ 185.313259] ? __sb_start_write+0xf8/0x200 [ 185.313271] vfs_write+0xc5/0x1c0 [ 185.313284] ksys_write+0x45/0xa0 [ 185.313298] do_syscall_64+0x5a/0x1a0 [ 185.313308] entry_SYSCALL_64_after_hwframe+0x49/0xbe And get following information in block debugfs: root at will-ThinkCentre-M910s:/sys/kernel/debug/block/nvme0n1# cat hctx6/cpu6/rq_list 000000001192d19b {.op=READ, .cmd_flags=, .rq_flags=IO_STAT, .state=idle, .tag=69, .internal_tag=-1} 00000000c33c8a5b {.op=READ, .cmd_flags=, .rq_flags=IO_STAT, .state=idle, .tag=78, .internal_tag=-1} root at will-ThinkCentre-M910s:/sys/kernel/debug/block/nvme0n1# cat state DYING|BYPASS|NOMERGES|SAME_COMP|NONROT|IO_STAT|DISCARD|NOXMERGES|INIT_DONE|NO_SG_MERGE|POLL|WC|FUA|STATS|QUIESCED We can see there were reqs on ctx rq_list and the request_queue is QUIESCED. Thanks again !! Jianchao On 04/22/2018 10:48 PM, Max Gurtovoy wrote: > > > On 4/22/2018 5:25 PM, jianchao.wang wrote: >> Hi Max >> >> No, I only tested it on PCIe one. >> And sorry for that I didn't state that. > > Please send your exact test steps and we'll run it using RDMA transport. > I also want to run a mini regression on this one since it may effect other flows. > >> >> Thanks >> Jianchao >> >> On 04/22/2018 10:18 PM, Max Gurtovoy wrote: >>> Hi Jianchao, >>> Since this patch is in the core, have you tested it using some fabrics drives too ? RDMA/FC ? >>> >>> thanks, >>> Max. >>> >>> On 4/22/2018 4:32 PM, jianchao.wang wrote: >>>> Hi keith >>>> >>>> Would you please take a look at this patch. >>>> >>>> This issue could be reproduced easily with a driver bind/unbind loop, >>>> a reset loop and a IO loop at the same time. >>>> >>>> Thanks >>>> Jianchao >>>> >>>> On 04/19/2018 04:29 PM, Jianchao Wang wrote: >>>>> There is race between nvme_remove and nvme_reset_work that can >>>>> lead to io hang. >>>>> >>>>> nvme_remove??????????????????? nvme_reset_work >>>>> -> change state to DELETING >>>>> ???????????????????????????????? -> fail to change state to LIVE >>>>> ???????????????????????????????? -> nvme_remove_dead_ctrl >>>>> ?????????????????????????????????? -> nvme_dev_disable >>>>> ???????????????????????????????????? -> quiesce request_queue >>>>> ?????????????????????????????????? -> queue remove_work >>>>> -> cancel_work_sync reset_work >>>>> -> nvme_remove_namespaces >>>>> ??? -> splice ctrl->namespaces >>>>> ???????????????????????????????? nvme_remove_dead_ctrl_work >>>>> ???????????????????????????????? -> nvme_kill_queues >>>>> ??? -> nvme_ns_remove?????????????? do nothing >>>>> ????? -> blk_cleanup_queue >>>>> ??????? -> blk_freeze_queue >>>>> Finally, the request_queue is quiesced state when wait freeze, >>>>> we will get io hang here. >>>>> >>>>> To fix it, unquiesce the request_queue directly before nvme_ns_remove. >>>>> We have spliced the ctrl->namespaces, so nobody could access them >>>>> and quiesce the queue any more. >>>>> >>>>> Signed-off-by: Jianchao Wang >>>>> --- >>>>> ?? drivers/nvme/host/core.c | 9 ++++++++- >>>>> ?? 1 file changed, 8 insertions(+), 1 deletion(-) >>>>> >>>>> diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c >>>>> index 9df4f71..0e95082 100644 >>>>> --- a/drivers/nvme/host/core.c >>>>> +++ b/drivers/nvme/host/core.c >>>>> @@ -3249,8 +3249,15 @@ void nvme_remove_namespaces(struct nvme_ctrl *ctrl) >>>>> ?????? list_splice_init(&ctrl->namespaces, &ns_list); >>>>> ?????? up_write(&ctrl->namespaces_rwsem); >>>>> ?? -??? list_for_each_entry_safe(ns, next, &ns_list, list) >>>>> +??? /* >>>>> +???? * After splice the namespaces list from the ctrl->namespaces, >>>>> +???? * nobody could get them anymore, let's unquiesce the request_queue >>>>> +???? * forcibly to avoid io hang. >>>>> +???? */ >>>>> +??? list_for_each_entry_safe(ns, next, &ns_list, list) { >>>>> +??????? blk_mq_unquiesce_queue(ns->queue); >>>>> ?????????? nvme_ns_remove(ns); >>>>> +??? } >>>>> ?? } >>>>> ?? EXPORT_SYMBOL_GPL(nvme_remove_namespaces); >>>>> ? >>>> >>>> _______________________________________________ >>>> Linux-nvme mailing list >>>> Linux-nvme at lists.infradead.org >>>> https://urldefense.proofpoint.com/v2/url?u=http-3A__lists.infradead.org_mailman_listinfo_linux-2Dnvme&d=DwICAg&c=RoP1YumCXCgaWHvlZYR8PZh8Bv7qIrMUB65eapI_JnE&r=7WdAxUBeiTUTCy8v-7zXyr4qk7sx26ATvfo6QSTvZyQ&m=eQ9q70WFDS-d0s-KndBw8MOJvcBM6wuuKUNklqTC3h8&s=oBasfz9JoJw4yQF4EaWcNfKChZ1HMCkfHVZqyjvYVHQ&e= >>>> >>> >>> _______________________________________________ >>> Linux-nvme mailing list >>> Linux-nvme at lists.infradead.org >>> https://urldefense.proofpoint.com/v2/url?u=http-3A__lists.infradead.org_mailman_listinfo_linux-2Dnvme&d=DwICAg&c=RoP1YumCXCgaWHvlZYR8PZh8Bv7qIrMUB65eapI_JnE&r=7WdAxUBeiTUTCy8v-7zXyr4qk7sx26ATvfo6QSTvZyQ&m=eQ9q70WFDS-d0s-KndBw8MOJvcBM6wuuKUNklqTC3h8&s=oBasfz9JoJw4yQF4EaWcNfKChZ1HMCkfHVZqyjvYVHQ&e= >>>