Re: Kernel 4.8.4: INFO: task kworker/u16:8:289 blocked for more than 120 seconds.

From: TomK <tk@mdevsys.com>
To: "Nicholas A. Bellinger" <nab@linux-iscsi.org>
Cc: linux-scsi@vger.kernel.org,
	Himanshu Madhani <himanshu.madhani@qlogic.com>,
	Quinn Tran <quinn.tran@qlogic.com>,
	Giridhar Malavali <giridhar.malavali@qlogic.com>,
	"Gurumurthy, Anil" <Anil.Gurumurthy@cavium.com>
Subject: Re: Kernel 4.8.4: INFO: task kworker/u16:8:289 blocked for more than 120 seconds.
Date: Sun, 30 Oct 2016 14:50:15 -0400	[thread overview]
Message-ID: <ea03080a-a16a-3dcb-cb47-3b72a9ddaff4@mdevsys.com> (raw)
In-Reply-To: <1477777488.22703.37.camel@haakon3.risingtidesystems.com>

On 10/29/2016 5:44 PM, Nicholas A. Bellinger wrote:
> On Sat, 2016-10-29 at 14:10 -0400, TomK wrote:
>> On 10/29/2016 3:50 AM, Nicholas A. Bellinger wrote:
>>> Hi TomK & Co,
>>>
>>> On Fri, 2016-10-28 at 02:01 -0400, TomK wrote:
>>>> On 10/26/2016 8:08 AM, TomK wrote:
>>>>> On 10/26/2016 3:20 AM, Nicholas A. Bellinger wrote:
>>>
>
> <SNIP>
>
>>>>
>>>> 1) As soon as I re-add the bad disk without the patch, I loose the LUN
>>>> off the ESXi hosts.  Same thing happens with the patch.  No change.  The
>>>> disk is pulling things down.  Worse, the kernel panics and locks me out
>>>> of the system (http://microdevsys.com/linux-lio/messages-oct-27-2016.txt) :
>>>>
>>>
>>> So after groking these logs, the point when ata6 failing scsi_device is
>>> holding outstanding I/O beyond ESX FC host side timeouts, manifests
>>> itself as ABORT_TASK tag=1122276:
>>>
>>> Oct 28 00:42:56 mbpc-pc kernel: qla2xxx [0000:04:00.0]-e872:20: qlt_24xx_atio_pkt_all_vps: qla_target(0): type 6 ox_id 077e
>>> Oct 28 00:42:56 mbpc-pc kernel: qla2xxx [0000:04:00.0]-e872:20: qlt_24xx_atio_pkt_all_vps: qla_target(0): type 6 ox_id 077f
>>> Oct 28 00:42:56 mbpc-pc kernel: qla2xxx [0000:04:00.0]-e872:20: qlt_24xx_atio_pkt_all_vps: qla_target(0): type 6 ox_id 0780
>>> Oct 28 00:42:57 mbpc-pc kernel: qla2xxx [0000:04:00.0]-e837:20: ABTS_RECV_24XX: instance 0
>>> Oct 28 00:42:57 mbpc-pc kernel: qla2xxx [0000:04:00.0]-f811:20: qla_target(0): task abort (s_id=1:5:0, tag=1122276, param=0)
>>> Oct 28 00:42:57 mbpc-pc kernel: qla2xxx [0000:04:00.0]-f80f:20: qla_target(0): task abort (tag=1122276)
>>> Oct 28 00:42:57 mbpc-pc kernel: ABORT_TASK: Found referenced qla2xxx task_tag: 1122276
>>>
>>> and eventually:
>>>
>>> Oct 28 00:43:18 mbpc-pc kernel: qla2xxx [0000:04:00.0]-e837:20: ABTS_RECV_24XX: instance 0
>>> Oct 28 00:43:18 mbpc-pc kernel: qla2xxx [0000:04:00.0]-f811:20: qla_target(0): task abort (s_id=1:5:0, tag=1122276, param=0)
>>> Oct 28 00:43:18 mbpc-pc kernel: qla2xxx [0000:04:00.0]-f881:20: unable to find cmd in driver or LIO for tag 0x111fe4
>>> Oct 28 00:43:18 mbpc-pc kernel: qla2xxx [0000:04:00.0]-f854:20: qla_target(0): __qlt_24xx_handle_abts() failed: -2
>>> Oct 28 00:43:18 mbpc-pc kernel: qla2xxx [0000:04:00.0]-e806:20: Sending task mgmt ABTS response (ha=ffff88010f110000, atio=ffff88010f123a40, status=4
>>> Oct 28 00:43:18 mbpc-pc kernel: qla2xxx [0000:04:00.0]-e838:20: ABTS_RESP_24XX: compl_status 31
>>> Oct 28 00:43:18 mbpc-pc kernel: qla2xxx [0000:04:00.0]-e807:20: Sending retry TERM EXCH CTIO7 (ha=ffff88010f110000)
>>> Oct 28 00:43:18 mbpc-pc kernel: qla2xxx [0000:04:00.0]-e806:20: Sending task mgmt ABTS response (ha=ffff88010f110000, atio=ffff88010f123a80, status=0
>>> Oct 28 00:43:18 mbpc-pc kernel: qla2xxx [0000:04:00.0]-e838:20: ABTS_RESP_24XX: compl_status 0
>>>
>>> The outstanding se_cmd with tag=1122276 for ata6 completes back to
>>> target-core, allowing ABORT_TASK + TMR_FUNCTION_COMPLETE status to
>>> progress:
>>>
>>> Oct 28 00:44:25 mbpc-pc kernel: hpet1: lost 9600 rtc interrupts
>>> Oct 28 00:44:29 mbpc-pc kernel: ata6: SATA link up 1.5 Gbps (SStatus 113 SControl 310)
>>> Oct 28 00:44:29 mbpc-pc kernel: ata6.00: configured for UDMA/133
>>> Oct 28 00:44:29 mbpc-pc kernel: ata6.00: retrying FLUSH 0xea Emask 0x4
>>> Oct 28 00:44:29 mbpc-pc kernel: ata6.00: device reported invalid CHS sector 0
>>> Oct 28 00:44:29 mbpc-pc kernel: ata6: EH complete
>>> Oct 28 00:44:30 mbpc-pc kernel: qla2xxx [0000:04:00.0]-e81c:20: Sending TERM EXCH CTIO (ha=ffff88010f110000)
>>> Oct 28 00:44:30 mbpc-pc kernel: qla2xxx [0000:04:00.0]-f814:20: qla_target(0): terminating exchange for aborted cmd=ffff880099392fa8 (se_cmd=ffff880099392fa8, tag=1122276)
>>> Oct 28 00:44:30 mbpc-pc kernel: qla2xxx [0000:04:00.0]-e81c:20: Sending TERM EXCH CTIO (ha=ffff88010f110000)
>>> Oct 28 00:44:30 mbpc-pc kernel: qla2xxx [0000:04:00.0]-e851:20: qla_target(0): Terminating cmd ffff880099392fa8 with incorrect state 2
>>> Oct 28 00:44:30 mbpc-pc kernel: qla2xxx [0000:04:00.0]-f81d:20: Intermediate CTIO received (status 6)
>>> Oct 28 00:44:30 mbpc-pc kernel: ABORT_TASK: Sending TMR_FUNCTION_COMPLETE for ref_tag: 1122276
>>> Oct 28 00:44:30 mbpc-pc kernel: qla2xxx [0000:04:00.0]-f81d:20: Intermediate CTIO received (status 8)
>>> Oct 28 00:44:30 mbpc-pc kernel: qla2xxx [0000:04:00.0]-f813:20: TM response mcmd (ffff8800b1e2f0f0) status 0x0 state 0x0
>>> Oct 28 00:44:30 mbpc-pc kernel: qla2xxx [0000:04:00.0]-e806:20: Sending task mgmt ABTS response (ha=ffff88010f110000, atio=ffff8800b1e2f300, status=0
>>> Oct 28 00:44:30 mbpc-pc kernel: ABORT_TASK: Sending TMR_TASK_DOES_NOT_EXIST for ref_tag: 1122996
>>> Oct 28 00:44:30 mbpc-pc kernel: qla2xxx [0000:04:00.0]-f813:20: TM response mcmd (ffff8800b1e2f340) status 0x2 state 0x0
>>> Oct 28 00:44:30 mbpc-pc kernel: qla2xxx [0000:04:00.0]-e806:20: Sending task mgmt ABTS response (ha=ffff88010f110000, atio=ffff8800b1e2f550, status=2
>>> Oct 28 00:44:30 mbpc-pc kernel: ABORT_TASK: Sending TMR_TASK_DOES_NOT_EXIST for ref_tag: 1122204
>>> Oct 28 00:44:30 mbpc-pc kernel: qla2xxx [0000:04:00.0]-f813:20: TM response mcmd (ffff8800b1e2f590) status 0x2 state 0x0
>>> Oct 28 00:44:30 mbpc-pc kernel: qla2xxx [0000:04:00.0]-e806:20: Sending task mgmt ABTS response (ha=ffff88010f110000, atio=ffff8800b1e2f7a0, status=2
>>> Oct 28 00:44:30 mbpc-pc kernel: ABORT_TASK: Sending TMR_TASK_DOES_NOT_EXIST for ref_tag: 1122240
>>> Oct 28 00:44:30 mbpc-pc kernel: qla2xxx [0000:04:00.0]-e838:20: ABTS_RESP_24XX: compl_status 0
>>> Oct 28 00:44:30 mbpc-pc kernel: qla2xxx [0000:04:00.0]-f813:20: TM response mcmd (ffff8800b1e2f7e0) status 0x2 state 0x0
>>> Oct 28 00:44:30 mbpc-pc kernel: qla2xxx [0000:04:00.0]-e838:20: ABTS_RESP_24XX: compl_status 31
>>> Oct 28 00:44:30 mbpc-pc kernel: qla2xxx [0000:04:00.0]-e807:20: Sending retry TERM EXCH CTIO7 (ha=ffff88010f110000)
>>> Oct 28 00:44:30 mbpc-pc kernel: qla2xxx [0000:04:00.0]-e806:20: Sending task mgmt ABTS response (ha=ffff88010f110000, atio=ffff88010f124280, status=0
>>> Oct 28 00:44:30 mbpc-pc kernel: qla2xxx [0000:04:00.0]-e838:20: ABTS_RESP_24XX: compl_status 31
>>> Oct 28 00:44:30 mbpc-pc kernel: qla2xxx [0000:04:00.0]-e807:20: Sending retry TERM EXCH CTIO7 (ha=ffff88010f110000)
>>> Oct 28 00:44:30 mbpc-pc kernel: qla2xxx [0000:04:00.0]-e806:20: Sending task mgmt ABTS response (ha=ffff88010f110000, atio=ffff88010f1242c0, status=0
>>> Oct 28 00:44:30 mbpc-pc kernel: qla2xxx [0000:04:00.0]-e806:20: Sending task mgmt ABTS response (ha=ffff88010f110000, atio=ffff8800b1e2f9f0, status=2
>>>
>>> and continues with TMR_TASK_DOES_NOT_EXIST for other oustanding tags..
>>>
>>> Until, target-core session release for two tcm_qla2xxx host ports:
>>>
>>> Oct 28 00:44:30 mbpc-pc kernel: qla2xxx [0000:04:00.0]-f884:20: qlt_free_session_done: se_sess ffff880099a80a80 / sess ffff880111665cc0 from port 21:03:00:1b:32:74:b6:cb loop_id 0x03 s_id 01:05:00 logout 1 keep 1 els_logo 0
>>> Oct 28 00:44:30 mbpc-pc kernel: qla2xxx [0000:04:00.0]-e874:20: qlt_free_cmd: se_cmd[ffff8800993849f8] ox_id 0773
>>> Oct 28 00:44:30 mbpc-pc kernel: qla2xxx [0000:04:00.0]-f884:20: qlt_free_session_done: se_sess ffff88009e8f9040 / sess ffff8800abe95480 from port 50:01:43:80:16:77:99:38 loop_id 0x02 s_id 01:00:00 logout 1 keep 1 els_logo 0
>>> Oct 28 00:44:30 mbpc-pc kernel: qla2xxx [0000:04:00.0]-e874:20: qlt_free_cmd: se_cmd[ffff880099199f38] ox_id 05f7
>>> Oct 28 00:44:30 mbpc-pc kernel: qla2xxx [0000:04:00.0]-e874:20: qlt_free_cmd: se_cmd[ffff880099199f38] ox_id 05f7
>>> Oct 28 00:44:30 mbpc-pc kernel: qla2xxx [0000:04:00.0]-4801:20: DPC handler waking up, dpc_flags=0x0.
>>> Oct 28 00:44:30 mbpc-pc kernel: qla2xxx [0000:04:00.0]-2870:20: Async-logout - hdl=2 loop-id=3 portid=010500.
>>> Oct 28 00:44:30 mbpc-pc kernel: qla2xxx [0000:04:00.0]-2870:20: Async-logout - hdl=3 loop-id=2 portid=010000.
>>> Oct 28 00:44:30 mbpc-pc kernel: qla2xxx [0000:04:00.0]-4800:20: DPC handler sleeping.
>>> Oct 28 00:44:30 mbpc-pc kernel: qla2xxx [0000:04:00.0]-e81c:20: Sending TERM EXCH CTIO (ha=ffff88010f110000)
>>> Oct 28 00:44:30 mbpc-pc kernel: qla2xxx [0000:04:00.0]-e81c:20: Sending TERM EXCH CTIO (ha=ffff88010f110000)
>>>
>>> From there, ELS with unexpected NOTIFY_ACK received start to occur:
>>>
>>> Oct 28 00:44:30 mbpc-pc kernel: qla2xxx [0000:04:00.0]-4801:20: DPC handler waking up, dpc_flags=0x0.
>>> Oct 28 00:44:30 mbpc-pc kernel: qla2xxx [0000:04:00.0]-4800:20: DPC handler sleeping.
>>> Oct 28 00:44:34 mbpc-pc kernel: qla2xxx [0000:04:00.0]-e872:20: qlt_24xx_atio_pkt_all_vps: qla_target(0): type d ox_id 0000
>>> Oct 28 00:44:34 mbpc-pc kernel: qla2xxx [0000:04:00.0]-e82e:20: IMMED_NOTIFY ATIO
>>> Oct 28 00:44:34 mbpc-pc kernel: qla2xxx [0000:04:00.0]-f826:20: qla_target(0): Port ID: 0x00:00:01 ELS opcode: 0x03
>>> Oct 28 00:44:34 mbpc-pc kernel: qla2xxx [0000:04:00.0]-e81c:20: Sending TERM ELS CTIO (ha=ffff88010f110000)
>>> Oct 28 00:44:34 mbpc-pc kernel: qla2xxx [0000:04:00.0]-f897:20: Linking sess ffff8800abe95480 [0] wwn 50:01:43:80:16:77:99:38 with PLOGI ACK to wwn 50:01:43:80:16:77:99:38 s_id 01:00:00, ref=1
>>> Oct 28 00:44:34 mbpc-pc kernel: qla2xxx [0000:04:00.0]-e862:20: qla_target(0): Unexpected NOTIFY_ACK received
>>>
>>> ELS packets for the same two host ports continue:
>>>
>>> Oct 28 00:46:40 mbpc-pc kernel: hpet1: lost 9599 rtc interrupts
>>> Oct 28 00:46:40 mbpc-pc kernel: qla2xxx [0000:04:00.0]-e872:20: qlt_24xx_atio_pkt_all_vps: qla_target(0): type d ox_id 0000
>>> Oct 28 00:46:40 mbpc-pc kernel: qla2xxx [0000:04:00.0]-e82e:20: IMMED_NOTIFY ATIO
>>> Oct 28 00:46:40 mbpc-pc kernel: qla2xxx [0000:04:00.0]-f826:20: qla_target(0): Port ID: 0x00:00:01 ELS opcode: 0x03
>>> Oct 28 00:46:40 mbpc-pc kernel: qla2xxx [0000:04:00.0]-e81c:20: Sending TERM ELS CTIO (ha=ffff88010f110000)
>>> Oct 28 00:46:40 mbpc-pc kernel: qla2xxx [0000:04:00.0]-f897:20: Linking sess ffff8800abe95480 [0] wwn 50:01:43:80:16:77:99:38 with PLOGI ACK to wwn 50:01:43:80:16:77:99:38 s_id 01:00:00, ref=1
>>> Oct 28 00:46:40 mbpc-pc kernel: qla2xxx [0000:04:00.0]-e862:20: qla_target(0): Unexpected NOTIFY_ACK received
>>> Oct 28 00:46:46 mbpc-pc kernel: qla2xxx [0000:04:00.0]-e872:20: qlt_24xx_atio_pkt_all_vps: qla_target(0): type 6 ox_id 05b1
>>> Oct 28 00:46:46 mbpc-pc kernel: qla2xxx [0000:04:00.0]-e818:20: is_send_status=1, cmd->bufflen=4096, cmd->sg_cnt=1, cmd->dma_data_direction=1 se_cmd[ffff88009a9977d8]
>>> Oct 28 00:46:46 mbpc-pc kernel: qla2xxx [0000:04:00.0]-e874:20: qlt_free_cmd: se_cmd[ffff88009a9977d8] ox_id 05b1
>>> Oct 28 00:46:49 mbpc-pc kernel: qla2xxx [0000:04:00.0]-e872:20: qlt_24xx_atio_pkt_all_vps: qla_target(0): type d ox_id 0000
>>> Oct 28 00:46:49 mbpc-pc kernel: qla2xxx [0000:04:00.0]-e82e:20: IMMED_NOTIFY ATIO
>>> Oct 28 00:46:49 mbpc-pc kernel: qla2xxx [0000:04:00.0]-f826:20: qla_target(0): Port ID: 0x00:05:01 ELS opcode: 0x03
>>> Oct 28 00:46:49 mbpc-pc kernel: qla2xxx [0000:04:00.0]-e81c:20: Sending TERM ELS CTIO (ha=ffff88010f110000)
>>> Oct 28 00:46:49 mbpc-pc kernel: qla2xxx [0000:04:00.0]-f897:20: Linking sess ffff880111665cc0 [0] wwn 21:03:00:1b:32:74:b6:cb with PLOGI ACK to wwn 21:03:00:1b:32:74:b6:cb s_id 01:05:00, ref=1
>>> Oct 28 00:46:49 mbpc-pc kernel: qla2xxx [0000:04:00.0]-e862:20: qla_target(0): Unexpected NOTIFY_ACK received
>>>
>>> And eventually, the 360 second hung task timeout warnings appear:
>>>
>>> Oct 28 00:49:48 mbpc-pc kernel: hpet1: lost 9600 rtc interrupts
>>> Oct 28 00:49:55 mbpc-pc kernel: qla2xxx [0000:04:00.0]-e872:20: qlt_24xx_atio_pkt_all_vps: qla_target(0): type 6 ox_id 05ba
>>> Oct 28 00:49:55 mbpc-pc kernel: qla2xxx [0000:04:00.0]-e818:20: is_send_status=1, cmd->bufflen=4096, cmd->sg_cnt=1, cmd->dma_data_direction=1 se_cmd[ffff88009a988508]
>>> Oct 28 00:49:55 mbpc-pc kernel: qla2xxx [0000:04:00.0]-e874:20: qlt_free_cmd: se_cmd[ffff88009a988508] ox_id 05ba
>>> Oct 28 00:50:16 mbpc-pc kernel: qla2xxx [0000:04:00.0]-e872:20: qlt_24xx_atio_pkt_all_vps: qla_target(0): type 6 ox_id 05bb
>>> Oct 28 00:50:16 mbpc-pc kernel: qla2xxx [0000:04:00.0]-e818:20: is_send_status=1, cmd->bufflen=4096, cmd->sg_cnt=1, cmd->dma_data_direction=1 se_cmd[ffff88009a988508]
>>> Oct 28 00:50:16 mbpc-pc kernel: qla2xxx [0000:04:00.0]-e874:20: qlt_free_cmd: se_cmd[ffff88009a988508] ox_id 05bb
>>> Jan  3 15:34:00 192.168.0.2 syslog: dhcpfwd : dhcp forwarder daemon successfully started
>>> Oct 28 00:50:33 mbpc-pc kernel: INFO: task kworker/0:2:31731 blocked for more than 360 seconds.
>>> Oct 28 00:50:33 mbpc-pc kernel:      Not tainted 4.8.4 #2
>>> Oct 28 00:50:33 mbpc-pc kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
>>> Oct 28 00:50:33 mbpc-pc kernel: kworker/0:2     D ffff88011affb968     0 31731      2 0x00000080
>>> Oct 28 00:50:33 mbpc-pc kernel: Workqueue: events qlt_free_session_done [qla2xxx]
>>> Oct 28 00:50:33 mbpc-pc kernel: ffff88011affb968 ffff88011affb8d8 ffff880013514940 0000000000000006
>>> Oct 28 00:50:33 mbpc-pc kernel: ffff8801140fe880 ffffffff81f998c2 0000000000000000 ffff880100000000
>>> Oct 28 00:50:33 mbpc-pc kernel: ffffffff810bdaaa ffffffff00000000 ffffffff00000051 ffff880100000000
>>> Oct 28 00:50:33 mbpc-pc kernel: Call Trace:
>>> Oct 28 00:50:33 mbpc-pc kernel: [<ffffffff810bdaaa>] ? vprintk_emit+0x27a/0x4d0
>>> Oct 28 00:50:33 mbpc-pc kernel: [<ffffffff8162c040>] schedule+0x40/0xb0
>>> Oct 28 00:50:33 mbpc-pc kernel: [<ffffffff8115cc5c>] ? printk+0x46/0x48
>>> Oct 28 00:50:33 mbpc-pc kernel: [<ffffffff8162e7ec>] schedule_timeout+0x9c/0xe0
>>> Oct 28 00:50:33 mbpc-pc kernel: [<ffffffff8162cfa0>] wait_for_completion+0xc0/0xf0
>>> Oct 28 00:50:33 mbpc-pc kernel: [<ffffffff810923e0>] ? try_to_wake_up+0x260/0x260
>>> Oct 28 00:50:33 mbpc-pc kernel: [<ffffffff81161d73>] ? mempool_free+0x33/0x90
>>> Oct 28 00:50:33 mbpc-pc kernel: [<ffffffffa08f76ad>] target_wait_for_sess_cmds+0x4d/0x1b0 [target_core_mod]
>>> Oct 28 00:50:33 mbpc-pc kernel: [<ffffffffa00e7188>] ? qla2x00_post_work+0x58/0x70 [qla2xxx]
>>> Oct 28 00:50:33 mbpc-pc kernel: [<ffffffffa0286f69>] tcm_qla2xxx_free_session+0x49/0x90 [tcm_qla2xxx]
>>> Oct 28 00:50:33 mbpc-pc kernel: [<ffffffffa01447e9>] qlt_free_session_done+0xf9/0x3d0 [qla2xxx]
>>> Oct 28 00:50:33 mbpc-pc kernel: [<ffffffff81080639>] process_one_work+0x189/0x4e0
>>> Oct 28 00:50:33 mbpc-pc kernel: [<ffffffff8107d915>] ? wq_worker_waking_up+0x15/0x70
>>> Oct 28 00:50:33 mbpc-pc kernel: [<ffffffff8109eb59>] ? idle_balance+0x79/0x290
>>> Oct 28 00:50:33 mbpc-pc kernel: [<ffffffff8162c040>] ? schedule+0x40/0xb0
>>> Oct 28 00:50:33 mbpc-pc kernel: [<ffffffff8108150d>] worker_thread+0x16d/0x520
>>> Oct 28 00:50:33 mbpc-pc kernel: [<ffffffff8162bb3d>] ? __schedule+0x2fd/0x6a0
>>> Oct 28 00:50:33 mbpc-pc kernel: [<ffffffff810813a0>] ? maybe_create_worker+0x110/0x110
>>> Oct 28 00:50:33 mbpc-pc kernel: [<ffffffff8162c040>] ? schedule+0x40/0xb0
>>> Oct 28 00:50:33 mbpc-pc kernel: [<ffffffff810813a0>] ? maybe_create_worker+0x110/0x110
>>> Oct 28 00:50:33 mbpc-pc kernel: [<ffffffff81085fec>] kthread+0xcc/0xf0
>>> Oct 28 00:50:33 mbpc-pc kernel: [<ffffffff8109130e>] ? schedule_tail+0x1e/0xc0
>>> Oct 28 00:50:33 mbpc-pc kernel: [<ffffffff8162f60f>] ret_from_fork+0x1f/0x40
>>> Oct 28 00:50:33 mbpc-pc kernel: [<ffffffff81085f20>] ? kthread_freezable_should_stop+0x70/0x70
>>>
>>>>
>>>> Oct 28 01:19:59 mbpc-pc kernel: qla2xxx [0000:04:00.0]-f81d:20:
>>>> Intermediate CTIO received (status 6)
>>>> Oct 28 01:19:59 mbpc-pc kernel: qla2xxx [0000:04:00.0]-e81c:20: Sending
>>>> TERM EXCH CTIO (ha=ffff88010ecb0000)
>>>> Oct 28 01:19:59 mbpc-pc kernel: qla2xxx [0000:04:00.0]-f814:20:
>>>> qla_target(0): terminating exchange for aborted cmd=ffff88009af9f488
>>>> (se_cmd=ffff88009af9f488, tag=1131312)
>>>> Oct 28 01:19:59 mbpc-pc kernel: qla2xxx [0000:04:00.0]-e81c:20: Sending
>>>> TERM EXCH CTIO (ha=ffff88010ecb0000)
>>>> Oct 28 01:19:59 mbpc-pc kernel: qla2xxx [0000:04:00.0]-e851:20:
>>>> qla_target(0): Terminating cmd ffff88009af9f488 with incorrect state 2
>>>> Oct 28 01:19:59 mbpc-pc kernel: qla2xxx [0000:04:00.0]-e81c:20: Sending
>>>> TERM EXCH CTIO (ha=ffff88010ecb0000)
>>>> Oct 28 01:19:59 mbpc-pc kernel: qla2xxx [0000:04:00.0]-f81d:20:
>>>> Intermediate CTIO received (status 6)
>>>> Oct 28 01:19:59 mbpc-pc kernel: qla2xxx [0000:04:00.0]-f81d:20:
>>>> Intermediate CTIO received (status 8)
>>>> Oct 28 01:19:59 mbpc-pc kernel: qla2xxx [0000:04:00.0]-e81c:20: Sending
>>>> TERM EXCH CTIO (ha=ffff88010ecb0000)
>>>> Oct 28 01:19:59 mbpc-pc kernel: qla2xxx [0000:04:00.0]-f81d:20:
>>>> Intermediate CTIO received (status 6)
>>>> Oct 28 01:19:59 mbpc-pc kernel: qla2xxx [0000:04:00.0]-e874:20:
>>>> qlt_free_cmd: se_cmd[ffff88009af9f488] ox_id 0673
>>>> Oct 28 01:19:59 mbpc-pc kernel: ------------[ cut here ]------------
>>>> Oct 28 01:19:59 mbpc-pc kernel: kernel BUG at
>>>> drivers/scsi/qla2xxx/qla_target.c:3319!
>>>> Oct 28 01:19:59 mbpc-pc kernel: invalid opcode: 0000 [#1] SMP
>>>> Oct 28 01:19:59 mbpc-pc kernel: Modules linked in: tcm_qla2xxx tcm_fc
>>>> tcm_loop target_core_file target_core_iblock target_core_pscsi
>>>> target_core_mod configfs ip6table_filter ip6_tables ebtable_nat ebtables
>>>> nf_conntrack_ipv4 nf_defrag_ipv4 xt_state nf_conntrack xt_CHECKSUM
>>>> iptable_mangle bridge nfsd lockd grace nfs_acl auth_rpcgss autofs4 it87
>>>> hwmon_vid bnx2fc cnic uio fcoe libfcoe libfc 8021q garp stp llc ppdev
>>>> parport_pc parport sunrpc cpufreq_ondemand bonding ipv6 crc_ccitt
>>>> ipt_REJECT nf_reject_ipv4 xt_multiport iptable_filter ip_tables fuse
>>>> vfat fat xfs vhost_net macvtap macvlan vhost tun uinput raid456
>>>> async_raid6_recov async_pq async_xor xor async_memcpy async_tx raid6_pq
>>>> libcrc32c joydev sg serio_raw e1000 r8169 mii kvm_amd kvm
>>>> snd_hda_codec_realtek snd_hda_codec_generic irqbypass pcspkr k10temp
>>>> snd_hda_intel snd_hda_codec snd_hda_core snd_hwdep snd_seq
>>>> snd_seq_device snd_pcm snd_timer snd soundcore i2c_piix4 i2c_core wmi
>>>> shpchp acpi_cpufreq ext4 mbcache jbd2 qla2xxx scsi_transport_fc floppy
>>>> firewire_ohci f
>>>> Oct 28 01:19:59 mbpc-pc kernel: irewire_core crc_itu_t sd_mod pata_acpi
>>>> ata_generic pata_jmicron ahci libahci usb_storage dm_mirror
>>>> dm_region_hash dm_log dm_mod
>>>> Oct 28 01:19:59 mbpc-pc kernel: CPU: 0 PID: 296 Comm: kworker/u16:6 Not
>>>> tainted 4.8.4 #2
>>>> Oct 28 01:19:59 mbpc-pc kernel: Hardware name: Gigabyte Technology Co.,
>>>> Ltd. GA-890XA-UD3/GA-890XA-UD3, BIOS FC 08/02/2010
>>>> Oct 28 01:19:59 mbpc-pc kernel: Workqueue: tmr-fileio target_tmr_work
>>>> [target_core_mod]
>>>> Oct 28 01:19:59 mbpc-pc kernel: task: ffff8801109623c0 task.stack:
>>>> ffff880110968000
>>>> Oct 28 01:19:59 mbpc-pc kernel: RIP: 0010:[<ffffffffa0143810>]
>>>> [<ffffffffa0143810>] qlt_free_cmd+0x160/0x180 [qla2xxx]
>>>> Oct 28 01:19:59 mbpc-pc kernel: RSP: 0018:ffff88011096bb18  EFLAGS: 00010202
>>>> Oct 28 01:19:59 mbpc-pc kernel: RAX: 0000000000000051 RBX:
>>>> ffff88009af9f488 RCX: 0000000000000006
>>>> Oct 28 01:19:59 mbpc-pc kernel: RDX: 0000000000000007 RSI:
>>>> 0000000000000007 RDI: ffff88011fc0cb40
>>>> Oct 28 01:19:59 mbpc-pc kernel: RBP: ffff88011096bb48 R08:
>>>> 0000000000000000 R09: ffffffff81fa4765
>>>> Oct 28 01:19:59 mbpc-pc kernel: R10: 0000000000000074 R11:
>>>> 0000000000000002 R12: ffff8801137770c0
>>>> Oct 28 01:19:59 mbpc-pc kernel: R13: ffff8800a126eaf0 R14:
>>>> ffff88009af9f510 R15: 0000000000000296
>>>> Oct 28 01:19:59 mbpc-pc kernel: FS:  0000000000000000(0000)
>>>> GS:ffff88011fc00000(0000) knlGS:0000000000000000
>>>> Oct 28 01:19:59 mbpc-pc kernel: CS:  0010 DS: 0000 ES: 0000 CR0:
>>>> 0000000080050033
>>>> Oct 28 01:19:59 mbpc-pc kernel: CR2: 00007f8eef58d000 CR3:
>>>> 00000000cabad000 CR4: 00000000000006f0
>>>> Oct 28 01:19:59 mbpc-pc kernel: Stack:
>>>> Oct 28 01:19:59 mbpc-pc kernel: ffff880000000673 ffff88009af9f488
>>>> ffff8800a126eaf0 ffff88009af9f59c
>>>> Oct 28 01:19:59 mbpc-pc kernel: ffff88009af9f488 ffff8800a126eaf0
>>>> ffff88011096bb58 ffffffffa027f7f4
>>>> Oct 28 01:19:59 mbpc-pc kernel: ffff88011096bbb8 ffffffffa08f758c
>>>> ffff88009af9f510 ffff88009af9f488
>>>> Oct 28 01:19:59 mbpc-pc kernel: Call Trace:
>>>> Oct 28 01:19:59 mbpc-pc kernel: [<ffffffffa027f7f4>]
>>>> tcm_qla2xxx_release_cmd+0x14/0x30 [tcm_qla2xxx]
>>>> Oct 28 01:19:59 mbpc-pc kernel: [<ffffffffa08f758c>]
>>>> target_release_cmd_kref+0xac/0x110 [target_core_mod]
>>>> Oct 28 01:19:59 mbpc-pc kernel: [<ffffffffa08f7627>]
>>>> target_put_sess_cmd+0x37/0x70 [target_core_mod]
>>>> Oct 28 01:19:59 mbpc-pc kernel: [<ffffffffa08f46f7>]
>>>> core_tmr_abort_task+0x107/0x160 [target_core_mod]
>>>> Oct 28 01:19:59 mbpc-pc kernel: [<ffffffffa08f6aa4>]
>>>> target_tmr_work+0x154/0x160 [target_core_mod]
>>>> Oct 28 01:19:59 mbpc-pc kernel: [<ffffffff81080639>]
>>>> process_one_work+0x189/0x4e0
>>>> Oct 28 01:19:59 mbpc-pc kernel: [<ffffffff810d060c>] ?
>>>> del_timer_sync+0x4c/0x60
>>>> Oct 28 01:19:59 mbpc-pc kernel: [<ffffffff8108131e>] ?
>>>> maybe_create_worker+0x8e/0x110
>>>> Oct 28 01:19:59 mbpc-pc kernel: [<ffffffff8162c040>] ? schedule+0x40/0xb0
>>>> Oct 28 01:19:59 mbpc-pc kernel: [<ffffffff8108150d>]
>>>> worker_thread+0x16d/0x520
>>>> Oct 28 01:19:59 mbpc-pc kernel: [<ffffffff810923f2>] ?
>>>> default_wake_function+0x12/0x20
>>>> Oct 28 01:19:59 mbpc-pc kernel: [<ffffffff810a6f06>] ?
>>>> __wake_up_common+0x56/0x90
>>>> Oct 28 01:19:59 mbpc-pc kernel: [<ffffffff810813a0>] ?
>>>> maybe_create_worker+0x110/0x110
>>>> Oct 28 01:19:59 mbpc-pc kernel: [<ffffffff8162c040>] ? schedule+0x40/0xb0
>>>> Oct 28 01:19:59 mbpc-pc kernel: [<ffffffff810813a0>] ?
>>>> maybe_create_worker+0x110/0x110
>>>> Oct 28 01:19:59 mbpc-pc kernel: [<ffffffff81085fec>] kthread+0xcc/0xf0
>>>> Oct 28 01:19:59 mbpc-pc kernel: [<ffffffff8109130e>] ?
>>>> schedule_tail+0x1e/0xc0
>>>> Oct 28 01:19:59 mbpc-pc kernel: [<ffffffff8162f60f>] ret_from_fork+0x1f/0x40
>>>> Oct 28 01:19:59 mbpc-pc kernel: [<ffffffff81085f20>] ?
>>>> kthread_freezable_should_stop+0x70/0x70
>>>> Oct 28 01:19:59 mbpc-pc kernel: Code: 0d 00 00 48 c7 c7 00 2a 16 a0 e8
>>>> ac 32 f2 e0 48 83 c4 18 5b 41 5c 41 5d c9 c3 48 8b bb 90 02 00 00 e8 85
>>>> da 07 e1 e9 30 ff ff ff <0f> 0b eb fe 0f 0b 66 2e 0f 1f 84 00 00 00 00
>>>> 00 eb f4 66 66 66
>>>> Oct 28 01:19:59 mbpc-pc kernel: RIP  [<ffffffffa0143810>]
>>>> qlt_free_cmd+0x160/0x180 [qla2xxx]
>>>> Oct 28 01:19:59 mbpc-pc kernel: RSP <ffff88011096bb18>
>>>> Oct 28 01:19:59 mbpc-pc kernel: ---[ end trace 2551bf47a19dbe2e ]---
>>>>
>>>
>>> Mmm.
>>>
>>> This BUG_ON is signaling a qla_tgt_cmd descriptor is being freed while
>>> qlt_handle_cmd_for_atio() has queued it for backend execution, but
>>> qla_tgt_cmd->work -> __qlt_do_work() has not executed.
>>>
>>>>
>>>>
>>>> 2) This works with a new disk that's just been inserted.  No issues.
>>>>
>>>>
>>>
>>> Thanks for verifying this scenario.
>>>
>>>>
>>>>
>>>> The kernel had the patch in both scenarios.  So appears we can't
>>>> function on a degraded array that looses 1 (RAID 5 / 6 ) or 2 ( RAID 6 )
>>>> disks, at the moment, even though the array itself is fine.  Perhaps
>>>> it's the nature of the failed disk.
>>>>
>>>
>>> AFAICT, the hung task involves ABORT_TASK across tcm_qla2xxx session
>>> reinstatement, when backend I/O latency is high enough to cause
>>> ABORT_TASK operations across FC fabric host side session reset.
>>>
>>> From logs alone, it's unclear if the failing backend ata6 is leaking I/O
>>> (indefinitely) when hung task warning happen, but the preceding ata6
>>> failures + device resets seem to indicate I/O completions are just
>>> taking a really long time to complete.
>>>
>>> Also, it's unclear if the BUG_ON(cmd->cmd_in_wq) in qlt_free_cmd() is a
>>> side effect of the earlier hung task, or separate tcm_qla2xxx session
>>> reinstatement bug.
>>>
>>> --
>>> To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
>>> the body of a message to majordomo@vger.kernel.org
>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>>
>>
>>
>> Thanks Nicholas.
>>
>> Is it possible the RAID 6 array, after attempting to write sectors to a
>> bad disk, received an error and returned a status message that the QLA
>> could not interpret?  Thereby sending the QLA into a routine expecting a
>> recognized message and simply timing out?  All the while the array
>> simply 'moved on' along it's merry way.   (For example, a message akin
>> to "Unexpected xyz received, ignoring" type of message that should be
>> interpreted and actioned on.)  Or perhaps the software RAID 6 isn't
>> returning anything meaningful to the QLA driver.  (If it isn't though,
>> it can be argued that the QLA / Target Drivers shouldn't have to care if
>> the array is working.)
>
> So target-core waits for all outstanding backend I/O to complete during
> session reinstatement.
>
> The two hung task warnings observed here mean target-core I/O
> descriptors from each two tcm_qla2xxx ports are waiting to be completed
> during session reinstatement, but this never happens.
>
> Which means one of three things:
>
> 1) The backend below target-core is leaking I/O completions, which
>    is a bug outside of target-core and needs to be identified.
> 2) There is a target-core bug leaking se_cmd->cmd_kref, preventing
>    the final reference release to occur.
> 3) There is a tcm_qla2xxx bug leaking se_cmd->cmd_kref, preventing
>    the final reference release to occur.
>
>>
>> The 100% util on /dev/sdf lasted for less then 30 seconds.
>
> Based on the above, between when tag=1122276 got ABORT_TASK and ata6
> finally gave back the descriptor to allow TMR_FUNCTION_COMPLETE was at
> least 90 seconds.
>
> This does not include the extra time from initial I/O submission to when
> ESX SCSI timeouts fire to generate ABORT_TASK, which IIRC depends upon
> the FC host LLD.
>
> Btw, I don't recall libata device timeouts being 90+ seconds plus, which
> looks a little strange..
>
> This timeout is in /sys/class/scsi_device/$HCTL/device/eh_timeout.
>
> So assuming MD is not holding onto I/O much beyond backend device
> eh_timeout during failure, a simple work-around is to keep the combined
> backend eh_timeout and MD consumer I/O timeout lower than ESX FC SCSI
> timeouts.
>
> From the logs above, it would likely mask this specific bug if it's
> related to #2 or #3.
>
>> Well below
>> the 120s default timeout (I've upped this to 360 for the purpose of
>> testing this scenario)
>>
>> I do see these:
>>
>> http://microdevsys.com/linux-lio/messages-mailing-list
>> Oct 23 19:50:26 mbpc-pc kernel: qla2xxx [0000:04:00.0]-5811:20:
>> Asynchronous PORT UPDATE ignored 0000/0004/0600.
>>
>> but I don't think it's that one as it doesn't correlate well at all.
>>
>
> These are unrelated to the hung task warnings.
>
>> If there's another test that I can do to get more info, please feel free
>> to suggest.
>>
>
> Ideally, being able to generate a vmcore crashdump is the most helpful.
> This involves manually triggering a crash after you observe the hung
> tasks.  It's reasonably easy to setup once CONFIG_KEXEC=y +
> CONFIG_CRASH_DUMP=y are enabled:
>
> https://fedoraproject.org/wiki/How_to_use_kdump_to_debug_kernel_crashes
>
> I don't know if you'll run into problems on v4.8.y as per Anil's earlier
> email, but getting a proper vmcore for analysis is usually the fastest
> path to root cause.
>
> Short of that, it would be helpful to identify the state of the se_cmd
> descriptors getting leaked.  Here's a quick patch to add some more
> verbosity:
>
> diff --git a/drivers/target/target_core_transport.c b/drivers/target/target_core_transport.c
> index 7dfefd6..9b93a2c 100644
> --- a/drivers/target/target_core_transport.c
> +++ b/drivers/target/target_core_transport.c
> @@ -2657,9 +2657,13 @@ void target_wait_for_sess_cmds(struct se_session *se_sess)
>
>         list_for_each_entry_safe(se_cmd, tmp_cmd,
>                                 &se_sess->sess_wait_list, se_cmd_list) {
> -               pr_debug("Waiting for se_cmd: %p t_state: %d, fabric state:"
> -                       " %d\n", se_cmd, se_cmd->t_state,
> -                       se_cmd->se_tfo->get_cmd_state(se_cmd));
> +               printk("Waiting for se_cmd: %p t_state: %d, fabric state:"
> +                       " %d se_cmd_flags: 0x%08x transport_state: 0x%08x"
> +                       " CDB: 0x%02x\n",
> +                       se_cmd, se_cmd->t_state,
> +                       se_cmd->se_tfo->get_cmd_state(se_cmd),
> +                       se_cmd->se_cmd_flags, se_cmd->transport_state,
> +                       se_cmd->t_task_cdb[0]);
>
>                 spin_lock_irqsave(&se_cmd->t_state_lock, flags);
>                 tas = (se_cmd->transport_state & CMD_T_TAS);
> @@ -2671,9 +2675,13 @@ void target_wait_for_sess_cmds(struct se_session *se_sess)
>                 }
>
>                 wait_for_completion(&se_cmd->cmd_wait_comp);
> -               pr_debug("After cmd_wait_comp: se_cmd: %p t_state: %d"
> -                       " fabric state: %d\n", se_cmd, se_cmd->t_state,
> -                       se_cmd->se_tfo->get_cmd_state(se_cmd));
> +               printk("After cmd_wait_comp: se_cmd: %p t_state: %d"
> +                       " fabric state: %d se_cmd_flags: 0x%08x transport_state:"
> +                       " 0x%08x CDB: 0x%02x\n",
> +                       se_cmd, se_cmd->t_state,
> +                       se_cmd->se_tfo->get_cmd_state(se_cmd),
> +                       se_cmd->se_cmd_flags, se_cmd->transport_state,
> +                       se_cmd->t_task_cdb[0]);
>
>                 se_cmd->se_tfo->release_cmd(se_cmd);
>         }
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>

I've had the kdump configured but the Seagate 2TB failed to the point
where it isn't even detected and a device like /dev/sdf is no longer 
created for it.  There's only some errors printed that something was 
found on the SATA connector but commands errored out against it.  (See 
below)  (Might be the PCB going.  When out of the system or the voltage 
cut, it just got to this point very quickly.)

I did add the debug messages above to catch more in the future.  I'm 
also checking the mdadm changelogs for possibly related updated since 
I'm running 3.3.2-5.el6 and latest is 3.3.4 or 3.4.0 .

I'm leaning to the 1) case above as the RAID 6 did not pick up the 
failed disk even though smartctl -A could not access any smart 
information.  I'll also post a new thread on linux-raid.

Again thanks very much for the help.  Appreciated.

-- 
Cheers,
Tom K.
-------------------------------------------------------------------------------------

Living on earth is expensive, but it includes a free trip around the sun.