From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Date: Mon, 14 May 2018 17:38:56 +0800 From: Ming Lei To: "jianchao.wang" Cc: Keith Busch , Jens Axboe , linux-block@vger.kernel.org, James Smart , Christoph Hellwig , Sagi Grimberg , linux-nvme@lists.infradead.org, Laurence Oberman Subject: Re: [PATCH V5 0/9] nvme: pci: fix & improve timeout handling Message-ID: <20180514093850.GA807@ming.t460p> References: <20180511122933.27155-1-ming.lei@redhat.com> <776f21e1-dc19-1b77-9ba4-44f0b8366625@oracle.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii In-Reply-To: <776f21e1-dc19-1b77-9ba4-44f0b8366625@oracle.com> List-ID: On Mon, May 14, 2018 at 04:21:04PM +0800, jianchao.wang wrote: > Hi ming > > Please refer to my test log and analysis. > > [ 229.872622] nvme nvme0: I/O 164 QID 1 timeout, reset controller > [ 229.872649] nvme nvme0: EH 0: before shutdown > [ 229.872683] nvme nvme0: I/O 165 QID 1 timeout, reset controller > [ 229.872700] nvme nvme0: I/O 166 QID 1 timeout, reset controller > [ 229.872716] nvme nvme0: I/O 167 QID 1 timeout, reset controller > [ 229.872733] nvme nvme0: I/O 169 QID 1 timeout, reset controller > [ 229.872750] nvme nvme0: I/O 173 QID 1 timeout, reset controller > [ 229.872766] nvme nvme0: I/O 174 QID 1 timeout, reset controller > [ 229.872783] nvme nvme0: I/O 178 QID 1 timeout, reset controller > [ 229.872800] nvme nvme0: I/O 182 QID 1 timeout, aborting > [ 229.872900] nvme nvme0: I/O 185 QID 1 timeout, aborting > [ 229.872975] nvme nvme0: I/O 190 QID 1 timeout, reset controller > [ 229.872990] nvme nvme0: I/O 191 QID 1 timeout, aborting > [ 229.873021] nvme nvme0: I/O 5 QID 2 timeout, aborting > [ 229.873096] nvme nvme0: I/O 40 QID 2 timeout, aborting > [ 229.874041] nvme nvme0: Abort status: 0x0 > [ 229.874064] nvme nvme0: Abort status: 0x0 > [ 229.874078] nvme nvme0: Abort status: 0x0 > [ 229.874092] nvme nvme0: Abort status: 0x0 > [ 229.874106] nvme nvme0: Abort status: 0x0 > [ 230.060698] nvme nvme0: EH 0: after shutdown, top eh: 1 > [ 290.840592] nvme nvme0: I/O 18 QID 0 timeout, disable controller > [ 290.840746] nvme nvme0: EH 1: before shutdown > > [ 290.992650] ------------[ cut here ]------------ > [ 290.992751] Trying to free already-free IRQ 133 > [ 290.992783] WARNING: CPU: 6 PID: 227 at /home/will/u04/source_code/linux-stable/kernel/irq/manage.c:1549 __free_irq+0xf6/0x420 > [ 290.993394] CPU: 6 PID: 227 Comm: kworker/u16:4 Kdump: loaded Not tainted 4.17.0-rc3+ #37 > [ 290.993402] Hardware name: LENOVO 10MLS0E339/3106, BIOS M1AKT22A 06/27/2017 > [ 290.993418] Workqueue: nvme-reset-wq nvme_eh_work > [ 290.993436] RIP: 0010:__free_irq+0xf6/0x420 > ... > [ 290.993541] Call Trace: > [ 290.993581] free_irq+0x4c/0xa0 > [ 290.993600] pci_free_irq+0x18/0x30 > [ 290.993615] nvme_dev_disable+0x20b/0x720 > [ 290.993745] nvme_eh_work+0xdd/0x4b0 > [ 290.993866] process_one_work+0x3ca/0xaa0 > [ 290.993960] worker_thread+0x89/0x6c0 > [ 290.994017] kthread+0x18d/0x1e0 > [ 290.994061] ret_from_fork+0x24/0x30 > [ 338.912379] INFO: task kworker/u16:4:227 blocked for more than 30 seconds. > [ 338.913348] Tainted: G W 4.17.0-rc3+ #37 > [ 338.913549] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. > [ 338.913809] kworker/u16:4 D 0 227 2 0x80000000 > [ 338.913842] Workqueue: nvme-reset-wq nvme_eh_work > [ 338.913857] Call Trace: > [ 338.914002] schedule+0x52/0xe0 > [ 338.914019] schedule_preempt_disabled+0x14/0x20 > [ 338.914029] __mutex_lock+0x5b4/0xca0 > [ 338.914228] nvme_eh_work+0x11a/0x4b0 > [ 338.914344] process_one_work+0x3ca/0xaa0 > [ 338.914440] worker_thread+0x89/0x6c0 > [ 338.914496] kthread+0x18d/0x1e0 > [ 338.914542] ret_from_fork+0x24/0x30 > [ 338.914648] The above isn't a new issue too, since nvme_dev_disable() can be run before the controller is recovered(nvme_setup_io_queues() done) without this patchset. This can happen in several cases, such as the one you listed below, or two or more timed-out req are triggered from different queues. This issue is that genirq won't work well if the same IRQ is freed by two times. > > Here is the deadlock scenario. > > nvme_eh_work // EH0 > -> nvme_reset_dev //hold reset_lock > -> nvme_setup_io_queues > -> nvme_create_io_queues > -> nvme_create_queue > -> set nvmeq->cq_vector > -> adapter_alloc_cq > -> adapter_alloc_sq > irq has not been requested > io timeout > nvme_eh_work //EH1 > -> nvme_dev_disable > -> quiesce the adminq //----> here ! > -> nvme_suspend_queue > print out warning Trying to free already-free IRQ 133 > -> nvme_cancel_request // complete the timeout admin request > -> require reset_lock > -> adapter_delete_cq If the admin IO submitted in adapter_alloc_sq() is timed out, nvme_dev_disable() in EH1 will complete it which is set as REQ_FAILFAST_DRIVER, then adapter_alloc_sq() should return error, and the whole reset in EH0 should have been terminated immediately. I guess the issue should be that nvme_create_io_queues() ignores the failure. Could you dump the stack trace of EH0 reset task? So that we may see where EH0 reset kthread hangs. > -> adapter_delete_queue // submit to the adminq which has been quiesced. > -> nvme_submit_sync_cmd > -> blk_execute_rq > -> wait_for_completion_io_timeout > hang_check is true, so there is no hung task warning for this context > > EH0 submit cq delete admin command, but it will never be completed or timed out, because the admin request queue has > been quiesced, so the reset_lock cannot be released, and EH1 cannot get reset_lock and make things forward. The nvme_dev_disable() in outer EH(EH1 in above log) will complete all admin command, which won't be retried because it is set as REQ_FAILFAST_DRIVER, so nvme_cancel_request() will complete it in nvme_dev_disable(). Thanks Ming