All of lore.kernel.org
 help / color / mirror / Atom feed
* Kernel 4.8.4: INFO: task kworker/u16:8:289 blocked for more than 120 seconds.
@ 2016-10-24  2:03 TomK
  2016-10-24  4:32 ` TomK
  0 siblings, 1 reply; 14+ messages in thread
From: TomK @ 2016-10-24  2:03 UTC (permalink / raw)
  To: linux-scsi

Hey,

Has anyone seen this and could have a workaround?  Seems like it is more 
Kernel related with various apps not just target apparently not but 
wondering if there is an interim solution 
(https://access.redhat.com/solutions/408833)

Getting this message after few minutes of usage from the QLA2xxx driver. 
  This is after some activity on an ESXi server (15 VM's) that I'm 
connecting to this HBA.  I've tried the following tuning parameters but 
there was no change in behaviour:

vm.dirty_background_ratio = 5
vm.dirty_ratio = 10

Details:


Oct 23 21:28:25 mbpc-pc kernel: hpet1: lost 9600 rtc interrupts
Oct 23 21:28:29 mbpc-pc kernel: ABORT_TASK: Found referenced qla2xxx 
task_tag: 1128612
Oct 23 21:28:42 mbpc-pc kernel: ABORT_TASK: Sending 
TMR_FUNCTION_COMPLETE for ref_tag: 1128612
Oct 23 21:28:42 mbpc-pc kernel: ABORT_TASK: Found referenced qla2xxx 
task_tag: 1129116
Jan  6 23:52:00 192.168.0.2 syslog: dhcpfwd : dhcp forwarder daemon 
successfully started
Oct 23 21:30:18 mbpc-pc kernel: hpet1: lost 9600 rtc interrupts
Jan  6 23:54:01 192.168.0.2 syslog: dhcpfwd : dhcp forwarder daemon 
successfully started
Oct 23 21:32:16 mbpc-pc kernel: hpet1: lost 9600 rtc interrupts
Oct 23 21:32:24 mbpc-pc kernel: INFO: task kworker/u16:8:289 blocked for 
more than 120 seconds.
Oct 23 21:32:24 mbpc-pc kernel:      Not tainted 4.8.4 #2
Oct 23 21:32:24 mbpc-pc kernel: "echo 0 > 
/proc/sys/kernel/hung_task_timeout_secs" disables this message.
Oct 23 21:32:24 mbpc-pc kernel: kworker/u16:8   D ffff88011113ba18     0 
   289      2 0x00000000
Oct 23 21:32:24 mbpc-pc kernel: Workqueue: tmr-fileio target_tmr_work 
[target_core_mod]
Oct 23 21:32:24 mbpc-pc kernel: ffff88011113ba18 0000000000000400 
ffff880049e926c0 ffff88011113b998
Oct 23 21:32:24 mbpc-pc kernel: ffff880111134600 ffffffff81f99ca0 
ffffffff81f998ef ffff880100000000
Oct 23 21:32:24 mbpc-pc kernel: ffffffff812f27d9 0000000000000000 
ffffe8ffffc9a000 ffff880000000000
Oct 23 21:32:24 mbpc-pc kernel: Call Trace:
Oct 23 21:32:24 mbpc-pc kernel: [<ffffffff812f27d9>] ? number+0x2e9/0x310
Oct 23 21:32:24 mbpc-pc kernel: [<ffffffff8162c040>] schedule+0x40/0xb0
Oct 23 21:32:24 mbpc-pc kernel: [<ffffffff81080169>] ? 
start_flush_work+0x49/0x180
Oct 23 21:32:24 mbpc-pc kernel: [<ffffffff8162e7ec>] 
schedule_timeout+0x9c/0xe0
Oct 23 21:32:24 mbpc-pc kernel: [<ffffffff810802ba>] ? flush_work+0x1a/0x40
Oct 23 21:32:24 mbpc-pc kernel: [<ffffffff810bd15c>] ? 
console_unlock+0x35c/0x380
Oct 23 21:32:24 mbpc-pc kernel: [<ffffffff8162cfa0>] 
wait_for_completion+0xc0/0xf0
Oct 23 21:32:24 mbpc-pc kernel: [<ffffffff810923e0>] ? 
try_to_wake_up+0x260/0x260
Oct 23 21:32:24 mbpc-pc kernel: [<ffffffffa08f6f84>] 
__transport_wait_for_tasks+0xb4/0x1b0 [target_core_mod]
Oct 23 21:32:24 mbpc-pc kernel: [<ffffffff810bdd1f>] ? 
vprintk_default+0x1f/0x30
Oct 23 21:32:24 mbpc-pc kernel: [<ffffffff8115cc5c>] ? printk+0x46/0x48
Oct 23 21:32:24 mbpc-pc kernel: [<ffffffffa08f70c4>] 
transport_wait_for_tasks+0x44/0x60 [target_core_mod]
Oct 23 21:32:24 mbpc-pc kernel: [<ffffffffa08f46e2>] 
core_tmr_abort_task+0xf2/0x160 [target_core_mod]
Oct 23 21:32:24 mbpc-pc kernel: [<ffffffffa08f6aa4>] 
target_tmr_work+0x154/0x160 [target_core_mod]
Oct 23 21:32:24 mbpc-pc kernel: [<ffffffff81080639>] 
process_one_work+0x189/0x4e0
Oct 23 21:32:24 mbpc-pc kernel: [<ffffffff8162c040>] ? schedule+0x40/0xb0
Oct 23 21:32:24 mbpc-pc kernel: [<ffffffff8108150d>] 
worker_thread+0x16d/0x520
Oct 23 21:32:24 mbpc-pc kernel: [<ffffffff810923f2>] ? 
default_wake_function+0x12/0x20
Oct 23 21:32:24 mbpc-pc kernel: [<ffffffff810a6f06>] ? 
__wake_up_common+0x56/0x90
Oct 23 21:32:24 mbpc-pc kernel: [<ffffffff810813a0>] ? 
maybe_create_worker+0x110/0x110
Oct 23 21:32:24 mbpc-pc kernel: [<ffffffff8162c040>] ? schedule+0x40/0xb0
Oct 23 21:32:24 mbpc-pc kernel: [<ffffffff810813a0>] ? 
maybe_create_worker+0x110/0x110
Oct 23 21:32:24 mbpc-pc kernel: [<ffffffff81085fec>] kthread+0xcc/0xf0
Oct 23 21:32:24 mbpc-pc kernel: [<ffffffff8109130e>] ? 
schedule_tail+0x1e/0xc0
Oct 23 21:32:24 mbpc-pc kernel: [<ffffffff8162f60f>] ret_from_fork+0x1f/0x40
Oct 23 21:32:24 mbpc-pc kernel: [<ffffffff81085f20>] ? 
kthread_freezable_should_stop+0x70/0x70
Oct 23 21:32:24 mbpc-pc kernel: INFO: task kworker/1:48:6089 blocked for 
more than 120 seconds.
Oct 23 21:32:24 mbpc-pc kernel:      Not tainted 4.8.4 #2
Oct 23 21:32:24 mbpc-pc kernel: "echo 0 > 
/proc/sys/kernel/hung_task_timeout_secs" disables this message.
Oct 23 21:32:24 mbpc-pc kernel: kworker/1:48    D ffff88004017f968     0 
  6089      2 0x00000080
Oct 23 21:32:24 mbpc-pc kernel: Workqueue: events qlt_free_session_done 
[qla2xxx]
Oct 23 21:32:24 mbpc-pc kernel: ffff88004017f968 ffff88004017f8f8 
ffff88011a83a300 0000000000000004
Oct 23 21:32:24 mbpc-pc kernel: ffff88004017a600 ffff88004017f938 
ffffffff810a0bb6 ffff880100000000
Oct 23 21:32:24 mbpc-pc kernel: ffff880110fd0840 ffff880000000000 
ffffffff81090728 ffff880100000000
Oct 23 21:32:24 mbpc-pc kernel: Call Trace:
Oct 23 21:32:24 mbpc-pc kernel: [<ffffffff810a0bb6>] ? 
enqueue_task_fair+0x66/0x410
Oct 23 21:32:24 mbpc-pc kernel: [<ffffffff81090728>] ? 
check_preempt_curr+0x78/0x90
Oct 23 21:32:24 mbpc-pc kernel: [<ffffffff8109075d>] ? 
ttwu_do_wakeup+0x1d/0xf0
Oct 23 21:32:24 mbpc-pc kernel: [<ffffffff8162c040>] schedule+0x40/0xb0
Oct 23 21:32:24 mbpc-pc kernel: [<ffffffff81090de0>] ? 
ttwu_queue+0x180/0x190
Oct 23 21:32:24 mbpc-pc kernel: [<ffffffff8162e7ec>] 
schedule_timeout+0x9c/0xe0
Oct 23 21:32:24 mbpc-pc kernel: [<ffffffff8162cfa0>] 
wait_for_completion+0xc0/0xf0
Oct 23 21:32:24 mbpc-pc kernel: [<ffffffff810923e0>] ? 
try_to_wake_up+0x260/0x260
Oct 23 21:32:24 mbpc-pc kernel: [<ffffffffa08f76ad>] 
target_wait_for_sess_cmds+0x4d/0x1b0 [target_core_mod]
Oct 23 21:32:24 mbpc-pc kernel: [<ffffffffa00e7188>] ? 
qla2x00_post_work+0x58/0x70 [qla2xxx]
Oct 23 21:32:24 mbpc-pc kernel: [<ffffffffa0286f69>] 
tcm_qla2xxx_free_session+0x49/0x90 [tcm_qla2xxx]
Oct 23 21:32:24 mbpc-pc kernel: [<ffffffffa01447e9>] 
qlt_free_session_done+0xf9/0x3d0 [qla2xxx]
Oct 23 21:32:24 mbpc-pc kernel: [<ffffffff815092fc>] ? 
dbs_work_handler+0x5c/0x90
Oct 23 21:32:24 mbpc-pc kernel: [<ffffffff8107f960>] ? 
pwq_dec_nr_in_flight+0x50/0xa0
Oct 23 21:32:24 mbpc-pc kernel: [<ffffffff81080639>] 
process_one_work+0x189/0x4e0
Oct 23 21:32:24 mbpc-pc kernel: [<ffffffff810d060c>] ? 
del_timer_sync+0x4c/0x60
Oct 23 21:32:24 mbpc-pc kernel: [<ffffffff8108131e>] ? 
maybe_create_worker+0x8e/0x110
Oct 23 21:32:24 mbpc-pc kernel: [<ffffffff8162c040>] ? schedule+0x40/0xb0
Oct 23 21:32:24 mbpc-pc kernel: [<ffffffff8108150d>] 
worker_thread+0x16d/0x520
Oct 23 21:32:24 mbpc-pc kernel: [<ffffffff810923f2>] ? 
default_wake_function+0x12/0x20
Oct 23 21:32:24 mbpc-pc kernel: [<ffffffff810a6f06>] ? 
__wake_up_common+0x56/0x90
Oct 23 21:32:24 mbpc-pc kernel: [<ffffffff810813a0>] ? 
maybe_create_worker+0x110/0x110
Oct 23 21:32:24 mbpc-pc kernel: [<ffffffff8162c040>] ? schedule+0x40/0xb0
Oct 23 21:32:24 mbpc-pc kernel: [<ffffffff810813a0>] ? 
maybe_create_worker+0x110/0x110
Oct 23 21:32:24 mbpc-pc kernel: [<ffffffff81085fec>] kthread+0xcc/0xf0
Oct 23 21:32:24 mbpc-pc kernel: [<ffffffff8109130e>] ? 
schedule_tail+0x1e/0xc0
Oct 23 21:32:24 mbpc-pc kernel: [<ffffffff8162f60f>] ret_from_fork+0x1f/0x40
Oct 23 21:32:24 mbpc-pc kernel: [<ffffffff81085f20>] ? 
kthread_freezable_should_stop+0x70/0x70
Jan  6 23:56:00 192.168.0.2 syslog: dhcpfwd : dhcp forwarder daemon 
successfully started
Oct 23 21:34:22 mbpc-pc kernel: hpet1: lost 9600 rtc interrupts
Oct 23 21:34:22 mbpc-pc kernel: hpet1: lost 3 rtc interrupts
Oct 23 21:34:27 mbpc-pc kernel: INFO: task kworker/u16:8:289 blocked for 
more than 120 seconds.
Oct 23 21:34:27 mbpc-pc kernel:      Not tainted 4.8.4 #2
Oct 23 21:34:27 mbpc-pc kernel: "echo 0 > 
/proc/sys/kernel/hung_task_timeout_secs" disables this message.
Oct 23 21:34:27 mbpc-pc kernel: kworker/u16:8   D ffff88011113ba18     0 
   289      2 0x00000000
Oct 23 21:34:27 mbpc-pc kernel: Workqueue: tmr-fileio target_tmr_work 
[target_core_mod]
Oct 23 21:34:27 mbpc-pc kernel: ffff88011113ba18 0000000000000400 
ffff880049e926c0 ffff88011113b998
Oct 23 21:34:27 mbpc-pc kernel: ffff880111134600 ffffffff81f99ca0 
ffffffff81f998ef ffff880100000000
Oct 23 21:34:27 mbpc-pc kernel: ffffffff812f27d9 0000000000000000 
ffffe8ffffc9a000 ffff880000000000
Oct 23 21:34:27 mbpc-pc kernel: Call Trace:
Oct 23 21:34:27 mbpc-pc kernel: [<ffffffff812f27d9>] ? number+0x2e9/0x310
Oct 23 21:34:27 mbpc-pc kernel: [<ffffffff8162c040>] schedule+0x40/0xb0
Oct 23 21:34:27 mbpc-pc kernel: [<ffffffff81080169>] ? 
start_flush_work+0x49/0x180
Oct 23 21:34:27 mbpc-pc kernel: [<ffffffff8162e7ec>] 
schedule_timeout+0x9c/0xe0
Oct 23 21:34:27 mbpc-pc kernel: [<ffffffff810802ba>] ? flush_work+0x1a/0x40
Oct 23 21:34:27 mbpc-pc kernel: [<ffffffff810bd15c>] ? 
console_unlock+0x35c/0x380
Oct 23 21:34:27 mbpc-pc kernel: [<ffffffff8162cfa0>] 
wait_for_completion+0xc0/0xf0
Oct 23 21:34:27 mbpc-pc kernel: [<ffffffff810923e0>] ? 
try_to_wake_up+0x260/0x260
Oct 23 21:34:27 mbpc-pc kernel: [<ffffffffa08f6f84>] 
__transport_wait_for_tasks+0xb4/0x1b0 [target_core_mod]
Oct 23 21:34:27 mbpc-pc kernel: [<ffffffff810bdd1f>] ? 
vprintk_default+0x1f/0x30
Oct 23 21:34:27 mbpc-pc kernel: [<ffffffff8115cc5c>] ? printk+0x46/0x48
Oct 23 21:34:27 mbpc-pc kernel: [<ffffffffa08f70c4>] 
transport_wait_for_tasks+0x44/0x60 [target_core_mod]
Oct 23 21:34:27 mbpc-pc kernel: [<ffffffffa08f46e2>] 
core_tmr_abort_task+0xf2/0x160 [target_core_mod]
Oct 23 21:34:27 mbpc-pc kernel: [<ffffffffa08f6aa4>] 
target_tmr_work+0x154/0x160 [target_core_mod]
Oct 23 21:34:27 mbpc-pc kernel: [<ffffffff81080639>] 
process_one_work+0x189/0x4e0
Oct 23 21:34:27 mbpc-pc kernel: [<ffffffff8162c040>] ? schedule+0x40/0xb0
Oct 23 21:34:27 mbpc-pc kernel: [<ffffffff8108150d>] 
worker_thread+0x16d/0x520
Oct 23 21:34:27 mbpc-pc kernel: [<ffffffff810923f2>] ? 
default_wake_function+0x12/0x20
Oct 23 21:34:27 mbpc-pc kernel: [<ffffffff810a6f06>] ? 
__wake_up_common+0x56/0x90
Oct 23 21:34:27 mbpc-pc kernel: [<ffffffff810813a0>] ? 
maybe_create_worker+0x110/0x110
Oct 23 21:34:27 mbpc-pc kernel: [<ffffffff8162c040>] ? schedule+0x40/0xb0
Oct 23 21:34:27 mbpc-pc kernel: [<ffffffff810813a0>] ? 
maybe_create_worker+0x110/0x110
Oct 23 21:34:27 mbpc-pc kernel: [<ffffffff81085fec>] kthread+0xcc/0xf0
Oct 23 21:34:27 mbpc-pc kernel: [<ffffffff8109130e>] ? 
schedule_tail+0x1e/0xc0
Oct 23 21:34:27 mbpc-pc kernel: [<ffffffff8162f60f>] ret_from_fork+0x1f/0x40
Oct 23 21:34:27 mbpc-pc kernel: [<ffffffff81085f20>] ? 
kthread_freezable_should_stop+0x70/0x70
Oct 23 21:34:27 mbpc-pc kernel: INFO: task kworker/1:48:6089 blocked for 
more than 120 seconds.
Oct 23 21:34:27 mbpc-pc kernel:      Not tainted 4.8.4 #2
Oct 23 21:34:27 mbpc-pc kernel: "echo 0 > 
/proc/sys/kernel/hung_task_timeout_secs" disables this message.
Oct 23 21:34:27 mbpc-pc kernel: kworker/1:48    D ffff88004017f968     0 
  6089      2 0x00000080
Oct 23 21:34:27 mbpc-pc kernel: Workqueue: events qlt_free_session_done 
[qla2xxx]
Oct 23 21:34:27 mbpc-pc kernel: ffff88004017f968 ffff88004017f8f8 
ffff88011a83a300 0000000000000004
Oct 23 21:34:27 mbpc-pc kernel: ffff88004017a600 ffff88004017f938 
ffffffff810a0bb6 ffff880100000000
Oct 23 21:34:27 mbpc-pc kernel: ffff880110fd0840 ffff880000000000 
ffffffff81090728 ffff880100000000
Oct 23 21:34:27 mbpc-pc kernel: Call Trace:
Oct 23 21:34:27 mbpc-pc kernel: [<ffffffff810a0bb6>] ? 
enqueue_task_fair+0x66/0x410
Oct 23 21:34:27 mbpc-pc kernel: [<ffffffff81090728>] ? 
check_preempt_curr+0x78/0x90
Oct 23 21:34:27 mbpc-pc kernel: [<ffffffff8109075d>] ? 
ttwu_do_wakeup+0x1d/0xf0
Oct 23 21:34:27 mbpc-pc kernel: [<ffffffff8162c040>] schedule+0x40/0xb0
Oct 23 21:34:27 mbpc-pc kernel: [<ffffffff81090de0>] ? 
ttwu_queue+0x180/0x190
Oct 23 21:34:27 mbpc-pc kernel: [<ffffffff8162e7ec>] 
schedule_timeout+0x9c/0xe0
Oct 23 21:34:27 mbpc-pc kernel: [<ffffffff8162cfa0>] 
wait_for_completion+0xc0/0xf0
Oct 23 21:34:27 mbpc-pc kernel: [<ffffffff810923e0>] ? 
try_to_wake_up+0x260/0x260
Oct 23 21:34:27 mbpc-pc kernel: [<ffffffffa08f76ad>] 
target_wait_for_sess_cmds+0x4d/0x1b0 [target_core_mod]
Oct 23 21:34:27 mbpc-pc kernel: [<ffffffffa00e7188>] ? 
qla2x00_post_work+0x58/0x70 [qla2xxx]
Oct 23 21:34:27 mbpc-pc kernel: [<ffffffffa0286f69>] 
tcm_qla2xxx_free_session+0x49/0x90 [tcm_qla2xxx]
Oct 23 21:34:27 mbpc-pc kernel: [<ffffffffa01447e9>] 
qlt_free_session_done+0xf9/0x3d0 [qla2xxx]
Oct 23 21:34:27 mbpc-pc kernel: [<ffffffff815092fc>] ? 
dbs_work_handler+0x5c/0x90
Oct 23 21:34:27 mbpc-pc kernel: [<ffffffff8107f960>] ? 
pwq_dec_nr_in_flight+0x50/0xa0
Oct 23 21:34:27 mbpc-pc kernel: [<ffffffff81080639>] 
process_one_work+0x189/0x4e0
Oct 23 21:34:27 mbpc-pc kernel: [<ffffffff810d060c>] ? 
del_timer_sync+0x4c/0x60
Oct 23 21:34:27 mbpc-pc kernel: [<ffffffff8108131e>] ? 
maybe_create_worker+0x8e/0x110
Oct 23 21:34:27 mbpc-pc kernel: [<ffffffff8162c040>] ? schedule+0x40/0xb0
Oct 23 21:34:27 mbpc-pc kernel: [<ffffffff8108150d>] 
worker_thread+0x16d/0x520
Oct 23 21:34:27 mbpc-pc kernel: [<ffffffff810923f2>] ? 
default_wake_function+0x12/0x20
Oct 23 21:34:27 mbpc-pc kernel: [<ffffffff810a6f06>] ? 
__wake_up_common+0x56/0x90
Oct 23 21:34:27 mbpc-pc kernel: [<ffffffff810813a0>] ? 
maybe_create_worker+0x110/0x110
Oct 23 21:34:27 mbpc-pc kernel: [<ffffffff8162c040>] ? schedule+0x40/0xb0
Oct 23 21:34:27 mbpc-pc kernel: [<ffffffff810813a0>] ? 
maybe_create_worker+0x110/0x110
Oct 23 21:34:27 mbpc-pc kernel: [<ffffffff81085fec>] kthread+0xcc/0xf0
Oct 23 21:34:27 mbpc-pc kernel: [<ffffffff8109130e>] ? 
schedule_tail+0x1e/0xc0
Oct 23 21:34:27 mbpc-pc kernel: [<ffffffff8162f60f>] ret_from_fork+0x1f/0x40
Oct 23 21:34:27 mbpc-pc kernel: [<ffffffff81085f20>] ? 
kthread_freezable_should_stop+0x70/0x70
Oct 23 21:36:04 mbpc-pc kernel: hpet1: lost 9600 rtc interrupts
Oct 23 21:36:04 mbpc-pc kernel: hpet1: lost 3 rtc interrupts
Jan  6 23:58:00 192.168.0.2 syslog: dhcpfwd : dhcp forwarder daemon 
successfully started
Oct 23 21:36:30 mbpc-pc kernel: INFO: task kworker/u16:8:289 blocked for 
more than 120 seconds.
Oct 23 21:36:30 mbpc-pc kernel:      Not tainted 4.8.4 #2
Oct 23 21:36:30 mbpc-pc kernel: "echo 0 > 
/proc/sys/kernel/hung_task_timeout_secs" disables this message.
Oct 23 21:36:30 mbpc-pc kernel: kworker/u16:8   D ffff88011113ba18     0 
   289      2 0x00000000
Oct 23 21:36:30 mbpc-pc kernel: Workqueue: tmr-fileio target_tmr_work 
[target_core_mod]
Oct 23 21:36:30 mbpc-pc kernel: ffff88011113ba18 0000000000000400 
ffff880049e926c0 ffff88011113b998
Oct 23 21:36:30 mbpc-pc kernel: ffff880111134600 ffffffff81f99ca0 
ffffffff81f998ef ffff880100000000
Oct 23 21:36:30 mbpc-pc kernel: ffffffff812f27d9 0000000000000000 
ffffe8ffffc9a000 ffff880000000000
Oct 23 21:36:30 mbpc-pc kernel: Call Trace:
Oct 23 21:36:30 mbpc-pc kernel: [<ffffffff812f27d9>] ? number+0x2e9/0x310
Oct 23 21:36:30 mbpc-pc kernel: [<ffffffff8162c040>] schedule+0x40/0xb0
Oct 23 21:36:30 mbpc-pc kernel: [<ffffffff81080169>] ? 
start_flush_work+0x49/0x180
Oct 23 21:36:30 mbpc-pc kernel: [<ffffffff8162e7ec>] 
schedule_timeout+0x9c/0xe0
Oct 23 21:36:30 mbpc-pc kernel: [<ffffffff810802ba>] ? flush_work+0x1a/0x40
Oct 23 21:36:30 mbpc-pc kernel: [<ffffffff810bd15c>] ? 
console_unlock+0x35c/0x380
Oct 23 21:36:30 mbpc-pc kernel: [<ffffffff8162cfa0>] 
wait_for_completion+0xc0/0xf0
Oct 23 21:36:30 mbpc-pc kernel: [<ffffffff810923e0>] ? 
try_to_wake_up+0x260/0x260
Oct 23 21:36:30 mbpc-pc kernel: [<ffffffffa08f6f84>] 
__transport_wait_for_tasks+0xb4/0x1b0 [target_core_mod]
Oct 23 21:36:30 mbpc-pc kernel: [<ffffffff810bdd1f>] ? 
vprintk_default+0x1f/0x30


-- 
Cheers,
Tom K.
-------------------------------------------------------------------------------------

Living on earth is expensive, but it includes a free trip around the sun.


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Kernel 4.8.4: INFO: task kworker/u16:8:289 blocked for more than 120 seconds.
  2016-10-24  2:03 Kernel 4.8.4: INFO: task kworker/u16:8:289 blocked for more than 120 seconds TomK
@ 2016-10-24  4:32 ` TomK
  2016-10-24  4:45   ` TomK
  0 siblings, 1 reply; 14+ messages in thread
From: TomK @ 2016-10-24  4:32 UTC (permalink / raw)
  To: linux-scsi

On 10/23/2016 10:03 PM, TomK wrote:
> Hey,
>
> Has anyone seen this and could have a workaround?  Seems like it is more
> Kernel related with various apps not just target apparently not but
> wondering if there is an interim solution
> (https://access.redhat.com/solutions/408833)
>
> Getting this message after few minutes of usage from the QLA2xxx driver.
>  This is after some activity on an ESXi server (15 VM's) that I'm
> connecting to this HBA.  I've tried the following tuning parameters but
> there was no change in behaviour:
>
> vm.dirty_background_ratio = 5
> vm.dirty_ratio = 10
>
> Details:
>
>
> Oct 23 21:28:25 mbpc-pc kernel: hpet1: lost 9600 rtc interrupts
> Oct 23 21:28:29 mbpc-pc kernel: ABORT_TASK: Found referenced qla2xxx
> task_tag: 1128612
> Oct 23 21:28:42 mbpc-pc kernel: ABORT_TASK: Sending
> TMR_FUNCTION_COMPLETE for ref_tag: 1128612
> Oct 23 21:28:42 mbpc-pc kernel: ABORT_TASK: Found referenced qla2xxx
> task_tag: 1129116
> Jan  6 23:52:00 192.168.0.2 syslog: dhcpfwd : dhcp forwarder daemon
> successfully started
> Oct 23 21:30:18 mbpc-pc kernel: hpet1: lost 9600 rtc interrupts
> Jan  6 23:54:01 192.168.0.2 syslog: dhcpfwd : dhcp forwarder daemon
> successfully started
> Oct 23 21:32:16 mbpc-pc kernel: hpet1: lost 9600 rtc interrupts
> Oct 23 21:32:24 mbpc-pc kernel: INFO: task kworker/u16:8:289 blocked for
> more than 120 seconds.
> Oct 23 21:32:24 mbpc-pc kernel:      Not tainted 4.8.4 #2
> Oct 23 21:32:24 mbpc-pc kernel: "echo 0 >
> /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> Oct 23 21:32:24 mbpc-pc kernel: kworker/u16:8   D ffff88011113ba18     0
>   289      2 0x00000000
> Oct 23 21:32:24 mbpc-pc kernel: Workqueue: tmr-fileio target_tmr_work
> [target_core_mod]
> Oct 23 21:32:24 mbpc-pc kernel: ffff88011113ba18 0000000000000400
> ffff880049e926c0 ffff88011113b998
> Oct 23 21:32:24 mbpc-pc kernel: ffff880111134600 ffffffff81f99ca0
> ffffffff81f998ef ffff880100000000
> Oct 23 21:32:24 mbpc-pc kernel: ffffffff812f27d9 0000000000000000
> ffffe8ffffc9a000 ffff880000000000
> Oct 23 21:32:24 mbpc-pc kernel: Call Trace:
> Oct 23 21:32:24 mbpc-pc kernel: [<ffffffff812f27d9>] ? number+0x2e9/0x310
> Oct 23 21:32:24 mbpc-pc kernel: [<ffffffff8162c040>] schedule+0x40/0xb0
> Oct 23 21:32:24 mbpc-pc kernel: [<ffffffff81080169>] ?
> start_flush_work+0x49/0x180
> Oct 23 21:32:24 mbpc-pc kernel: [<ffffffff8162e7ec>]
> schedule_timeout+0x9c/0xe0
> Oct 23 21:32:24 mbpc-pc kernel: [<ffffffff810802ba>] ? flush_work+0x1a/0x40
> Oct 23 21:32:24 mbpc-pc kernel: [<ffffffff810bd15c>] ?
> console_unlock+0x35c/0x380
> Oct 23 21:32:24 mbpc-pc kernel: [<ffffffff8162cfa0>]
> wait_for_completion+0xc0/0xf0
> Oct 23 21:32:24 mbpc-pc kernel: [<ffffffff810923e0>] ?
> try_to_wake_up+0x260/0x260
> Oct 23 21:32:24 mbpc-pc kernel: [<ffffffffa08f6f84>]
> __transport_wait_for_tasks+0xb4/0x1b0 [target_core_mod]
> Oct 23 21:32:24 mbpc-pc kernel: [<ffffffff810bdd1f>] ?
> vprintk_default+0x1f/0x30
> Oct 23 21:32:24 mbpc-pc kernel: [<ffffffff8115cc5c>] ? printk+0x46/0x48
> Oct 23 21:32:24 mbpc-pc kernel: [<ffffffffa08f70c4>]
> transport_wait_for_tasks+0x44/0x60 [target_core_mod]
> Oct 23 21:32:24 mbpc-pc kernel: [<ffffffffa08f46e2>]
> core_tmr_abort_task+0xf2/0x160 [target_core_mod]
> Oct 23 21:32:24 mbpc-pc kernel: [<ffffffffa08f6aa4>]
> target_tmr_work+0x154/0x160 [target_core_mod]
> Oct 23 21:32:24 mbpc-pc kernel: [<ffffffff81080639>]
> process_one_work+0x189/0x4e0
> Oct 23 21:32:24 mbpc-pc kernel: [<ffffffff8162c040>] ? schedule+0x40/0xb0
> Oct 23 21:32:24 mbpc-pc kernel: [<ffffffff8108150d>]
> worker_thread+0x16d/0x520
> Oct 23 21:32:24 mbpc-pc kernel: [<ffffffff810923f2>] ?
> default_wake_function+0x12/0x20
> Oct 23 21:32:24 mbpc-pc kernel: [<ffffffff810a6f06>] ?
> __wake_up_common+0x56/0x90
> Oct 23 21:32:24 mbpc-pc kernel: [<ffffffff810813a0>] ?
> maybe_create_worker+0x110/0x110
> Oct 23 21:32:24 mbpc-pc kernel: [<ffffffff8162c040>] ? schedule+0x40/0xb0
> Oct 23 21:32:24 mbpc-pc kernel: [<ffffffff810813a0>] ?
> maybe_create_worker+0x110/0x110
> Oct 23 21:32:24 mbpc-pc kernel: [<ffffffff81085fec>] kthread+0xcc/0xf0
> Oct 23 21:32:24 mbpc-pc kernel: [<ffffffff8109130e>] ?
> schedule_tail+0x1e/0xc0
> Oct 23 21:32:24 mbpc-pc kernel: [<ffffffff8162f60f>]
> ret_from_fork+0x1f/0x40
> Oct 23 21:32:24 mbpc-pc kernel: [<ffffffff81085f20>] ?
> kthread_freezable_should_stop+0x70/0x70
> Oct 23 21:32:24 mbpc-pc kernel: INFO: task kworker/1:48:6089 blocked for
> more than 120 seconds.
> Oct 23 21:32:24 mbpc-pc kernel:      Not tainted 4.8.4 #2
> Oct 23 21:32:24 mbpc-pc kernel: "echo 0 >
> /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> Oct 23 21:32:24 mbpc-pc kernel: kworker/1:48    D ffff88004017f968     0
>  6089      2 0x00000080
> Oct 23 21:32:24 mbpc-pc kernel: Workqueue: events qlt_free_session_done
> [qla2xxx]
> Oct 23 21:32:24 mbpc-pc kernel: ffff88004017f968 ffff88004017f8f8
> ffff88011a83a300 0000000000000004
> Oct 23 21:32:24 mbpc-pc kernel: ffff88004017a600 ffff88004017f938
> ffffffff810a0bb6 ffff880100000000
> Oct 23 21:32:24 mbpc-pc kernel: ffff880110fd0840 ffff880000000000
> ffffffff81090728 ffff880100000000
> Oct 23 21:32:24 mbpc-pc kernel: Call Trace:
> Oct 23 21:32:24 mbpc-pc kernel: [<ffffffff810a0bb6>] ?
> enqueue_task_fair+0x66/0x410
> Oct 23 21:32:24 mbpc-pc kernel: [<ffffffff81090728>] ?
> check_preempt_curr+0x78/0x90
> Oct 23 21:32:24 mbpc-pc kernel: [<ffffffff8109075d>] ?
> ttwu_do_wakeup+0x1d/0xf0
> Oct 23 21:32:24 mbpc-pc kernel: [<ffffffff8162c040>] schedule+0x40/0xb0
> Oct 23 21:32:24 mbpc-pc kernel: [<ffffffff81090de0>] ?
> ttwu_queue+0x180/0x190
> Oct 23 21:32:24 mbpc-pc kernel: [<ffffffff8162e7ec>]
> schedule_timeout+0x9c/0xe0
> Oct 23 21:32:24 mbpc-pc kernel: [<ffffffff8162cfa0>]
> wait_for_completion+0xc0/0xf0
> Oct 23 21:32:24 mbpc-pc kernel: [<ffffffff810923e0>] ?
> try_to_wake_up+0x260/0x260
> Oct 23 21:32:24 mbpc-pc kernel: [<ffffffffa08f76ad>]
> target_wait_for_sess_cmds+0x4d/0x1b0 [target_core_mod]
> Oct 23 21:32:24 mbpc-pc kernel: [<ffffffffa00e7188>] ?
> qla2x00_post_work+0x58/0x70 [qla2xxx]
> Oct 23 21:32:24 mbpc-pc kernel: [<ffffffffa0286f69>]
> tcm_qla2xxx_free_session+0x49/0x90 [tcm_qla2xxx]
> Oct 23 21:32:24 mbpc-pc kernel: [<ffffffffa01447e9>]
> qlt_free_session_done+0xf9/0x3d0 [qla2xxx]
> Oct 23 21:32:24 mbpc-pc kernel: [<ffffffff815092fc>] ?
> dbs_work_handler+0x5c/0x90
> Oct 23 21:32:24 mbpc-pc kernel: [<ffffffff8107f960>] ?
> pwq_dec_nr_in_flight+0x50/0xa0
> Oct 23 21:32:24 mbpc-pc kernel: [<ffffffff81080639>]
> process_one_work+0x189/0x4e0
> Oct 23 21:32:24 mbpc-pc kernel: [<ffffffff810d060c>] ?
> del_timer_sync+0x4c/0x60
> Oct 23 21:32:24 mbpc-pc kernel: [<ffffffff8108131e>] ?
> maybe_create_worker+0x8e/0x110
> Oct 23 21:32:24 mbpc-pc kernel: [<ffffffff8162c040>] ? schedule+0x40/0xb0
> Oct 23 21:32:24 mbpc-pc kernel: [<ffffffff8108150d>]
> worker_thread+0x16d/0x520
> Oct 23 21:32:24 mbpc-pc kernel: [<ffffffff810923f2>] ?
> default_wake_function+0x12/0x20
> Oct 23 21:32:24 mbpc-pc kernel: [<ffffffff810a6f06>] ?
> __wake_up_common+0x56/0x90
> Oct 23 21:32:24 mbpc-pc kernel: [<ffffffff810813a0>] ?
> maybe_create_worker+0x110/0x110
> Oct 23 21:32:24 mbpc-pc kernel: [<ffffffff8162c040>] ? schedule+0x40/0xb0
> Oct 23 21:32:24 mbpc-pc kernel: [<ffffffff810813a0>] ?
> maybe_create_worker+0x110/0x110
> Oct 23 21:32:24 mbpc-pc kernel: [<ffffffff81085fec>] kthread+0xcc/0xf0
> Oct 23 21:32:24 mbpc-pc kernel: [<ffffffff8109130e>] ?
> schedule_tail+0x1e/0xc0
> Oct 23 21:32:24 mbpc-pc kernel: [<ffffffff8162f60f>]
> ret_from_fork+0x1f/0x40
> Oct 23 21:32:24 mbpc-pc kernel: [<ffffffff81085f20>] ?
> kthread_freezable_should_stop+0x70/0x70
> Jan  6 23:56:00 192.168.0.2 syslog: dhcpfwd : dhcp forwarder daemon
> successfully started
> Oct 23 21:34:22 mbpc-pc kernel: hpet1: lost 9600 rtc interrupts
> Oct 23 21:34:22 mbpc-pc kernel: hpet1: lost 3 rtc interrupts
> Oct 23 21:34:27 mbpc-pc kernel: INFO: task kworker/u16:8:289 blocked for
> more than 120 seconds.
> Oct 23 21:34:27 mbpc-pc kernel:      Not tainted 4.8.4 #2
> Oct 23 21:34:27 mbpc-pc kernel: "echo 0 >
> /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> Oct 23 21:34:27 mbpc-pc kernel: kworker/u16:8   D ffff88011113ba18     0
>   289      2 0x00000000
> Oct 23 21:34:27 mbpc-pc kernel: Workqueue: tmr-fileio target_tmr_work
> [target_core_mod]
> Oct 23 21:34:27 mbpc-pc kernel: ffff88011113ba18 0000000000000400
> ffff880049e926c0 ffff88011113b998
> Oct 23 21:34:27 mbpc-pc kernel: ffff880111134600 ffffffff81f99ca0
> ffffffff81f998ef ffff880100000000
> Oct 23 21:34:27 mbpc-pc kernel: ffffffff812f27d9 0000000000000000
> ffffe8ffffc9a000 ffff880000000000
> Oct 23 21:34:27 mbpc-pc kernel: Call Trace:
> Oct 23 21:34:27 mbpc-pc kernel: [<ffffffff812f27d9>] ? number+0x2e9/0x310
> Oct 23 21:34:27 mbpc-pc kernel: [<ffffffff8162c040>] schedule+0x40/0xb0
> Oct 23 21:34:27 mbpc-pc kernel: [<ffffffff81080169>] ?
> start_flush_work+0x49/0x180
> Oct 23 21:34:27 mbpc-pc kernel: [<ffffffff8162e7ec>]
> schedule_timeout+0x9c/0xe0
> Oct 23 21:34:27 mbpc-pc kernel: [<ffffffff810802ba>] ? flush_work+0x1a/0x40
> Oct 23 21:34:27 mbpc-pc kernel: [<ffffffff810bd15c>] ?
> console_unlock+0x35c/0x380
> Oct 23 21:34:27 mbpc-pc kernel: [<ffffffff8162cfa0>]
> wait_for_completion+0xc0/0xf0
> Oct 23 21:34:27 mbpc-pc kernel: [<ffffffff810923e0>] ?
> try_to_wake_up+0x260/0x260
> Oct 23 21:34:27 mbpc-pc kernel: [<ffffffffa08f6f84>]
> __transport_wait_for_tasks+0xb4/0x1b0 [target_core_mod]
> Oct 23 21:34:27 mbpc-pc kernel: [<ffffffff810bdd1f>] ?
> vprintk_default+0x1f/0x30
> Oct 23 21:34:27 mbpc-pc kernel: [<ffffffff8115cc5c>] ? printk+0x46/0x48
> Oct 23 21:34:27 mbpc-pc kernel: [<ffffffffa08f70c4>]
> transport_wait_for_tasks+0x44/0x60 [target_core_mod]
> Oct 23 21:34:27 mbpc-pc kernel: [<ffffffffa08f46e2>]
> core_tmr_abort_task+0xf2/0x160 [target_core_mod]
> Oct 23 21:34:27 mbpc-pc kernel: [<ffffffffa08f6aa4>]
> target_tmr_work+0x154/0x160 [target_core_mod]
> Oct 23 21:34:27 mbpc-pc kernel: [<ffffffff81080639>]
> process_one_work+0x189/0x4e0
> Oct 23 21:34:27 mbpc-pc kernel: [<ffffffff8162c040>] ? schedule+0x40/0xb0
> Oct 23 21:34:27 mbpc-pc kernel: [<ffffffff8108150d>]
> worker_thread+0x16d/0x520
> Oct 23 21:34:27 mbpc-pc kernel: [<ffffffff810923f2>] ?
> default_wake_function+0x12/0x20
> Oct 23 21:34:27 mbpc-pc kernel: [<ffffffff810a6f06>] ?
> __wake_up_common+0x56/0x90
> Oct 23 21:34:27 mbpc-pc kernel: [<ffffffff810813a0>] ?
> maybe_create_worker+0x110/0x110
> Oct 23 21:34:27 mbpc-pc kernel: [<ffffffff8162c040>] ? schedule+0x40/0xb0
> Oct 23 21:34:27 mbpc-pc kernel: [<ffffffff810813a0>] ?
> maybe_create_worker+0x110/0x110
> Oct 23 21:34:27 mbpc-pc kernel: [<ffffffff81085fec>] kthread+0xcc/0xf0
> Oct 23 21:34:27 mbpc-pc kernel: [<ffffffff8109130e>] ?
> schedule_tail+0x1e/0xc0
> Oct 23 21:34:27 mbpc-pc kernel: [<ffffffff8162f60f>]
> ret_from_fork+0x1f/0x40
> Oct 23 21:34:27 mbpc-pc kernel: [<ffffffff81085f20>] ?
> kthread_freezable_should_stop+0x70/0x70
> Oct 23 21:34:27 mbpc-pc kernel: INFO: task kworker/1:48:6089 blocked for
> more than 120 seconds.
> Oct 23 21:34:27 mbpc-pc kernel:      Not tainted 4.8.4 #2
> Oct 23 21:34:27 mbpc-pc kernel: "echo 0 >
> /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> Oct 23 21:34:27 mbpc-pc kernel: kworker/1:48    D ffff88004017f968     0
>  6089      2 0x00000080
> Oct 23 21:34:27 mbpc-pc kernel: Workqueue: events qlt_free_session_done
> [qla2xxx]
> Oct 23 21:34:27 mbpc-pc kernel: ffff88004017f968 ffff88004017f8f8
> ffff88011a83a300 0000000000000004
> Oct 23 21:34:27 mbpc-pc kernel: ffff88004017a600 ffff88004017f938
> ffffffff810a0bb6 ffff880100000000
> Oct 23 21:34:27 mbpc-pc kernel: ffff880110fd0840 ffff880000000000
> ffffffff81090728 ffff880100000000
> Oct 23 21:34:27 mbpc-pc kernel: Call Trace:
> Oct 23 21:34:27 mbpc-pc kernel: [<ffffffff810a0bb6>] ?
> enqueue_task_fair+0x66/0x410
> Oct 23 21:34:27 mbpc-pc kernel: [<ffffffff81090728>] ?
> check_preempt_curr+0x78/0x90
> Oct 23 21:34:27 mbpc-pc kernel: [<ffffffff8109075d>] ?
> ttwu_do_wakeup+0x1d/0xf0
> Oct 23 21:34:27 mbpc-pc kernel: [<ffffffff8162c040>] schedule+0x40/0xb0
> Oct 23 21:34:27 mbpc-pc kernel: [<ffffffff81090de0>] ?
> ttwu_queue+0x180/0x190
> Oct 23 21:34:27 mbpc-pc kernel: [<ffffffff8162e7ec>]
> schedule_timeout+0x9c/0xe0
> Oct 23 21:34:27 mbpc-pc kernel: [<ffffffff8162cfa0>]
> wait_for_completion+0xc0/0xf0
> Oct 23 21:34:27 mbpc-pc kernel: [<ffffffff810923e0>] ?
> try_to_wake_up+0x260/0x260
> Oct 23 21:34:27 mbpc-pc kernel: [<ffffffffa08f76ad>]
> target_wait_for_sess_cmds+0x4d/0x1b0 [target_core_mod]
> Oct 23 21:34:27 mbpc-pc kernel: [<ffffffffa00e7188>] ?
> qla2x00_post_work+0x58/0x70 [qla2xxx]
> Oct 23 21:34:27 mbpc-pc kernel: [<ffffffffa0286f69>]
> tcm_qla2xxx_free_session+0x49/0x90 [tcm_qla2xxx]
> Oct 23 21:34:27 mbpc-pc kernel: [<ffffffffa01447e9>]
> qlt_free_session_done+0xf9/0x3d0 [qla2xxx]
> Oct 23 21:34:27 mbpc-pc kernel: [<ffffffff815092fc>] ?
> dbs_work_handler+0x5c/0x90
> Oct 23 21:34:27 mbpc-pc kernel: [<ffffffff8107f960>] ?
> pwq_dec_nr_in_flight+0x50/0xa0
> Oct 23 21:34:27 mbpc-pc kernel: [<ffffffff81080639>]
> process_one_work+0x189/0x4e0
> Oct 23 21:34:27 mbpc-pc kernel: [<ffffffff810d060c>] ?
> del_timer_sync+0x4c/0x60
> Oct 23 21:34:27 mbpc-pc kernel: [<ffffffff8108131e>] ?
> maybe_create_worker+0x8e/0x110
> Oct 23 21:34:27 mbpc-pc kernel: [<ffffffff8162c040>] ? schedule+0x40/0xb0
> Oct 23 21:34:27 mbpc-pc kernel: [<ffffffff8108150d>]
> worker_thread+0x16d/0x520
> Oct 23 21:34:27 mbpc-pc kernel: [<ffffffff810923f2>] ?
> default_wake_function+0x12/0x20
> Oct 23 21:34:27 mbpc-pc kernel: [<ffffffff810a6f06>] ?
> __wake_up_common+0x56/0x90
> Oct 23 21:34:27 mbpc-pc kernel: [<ffffffff810813a0>] ?
> maybe_create_worker+0x110/0x110
> Oct 23 21:34:27 mbpc-pc kernel: [<ffffffff8162c040>] ? schedule+0x40/0xb0
> Oct 23 21:34:27 mbpc-pc kernel: [<ffffffff810813a0>] ?
> maybe_create_worker+0x110/0x110
> Oct 23 21:34:27 mbpc-pc kernel: [<ffffffff81085fec>] kthread+0xcc/0xf0
> Oct 23 21:34:27 mbpc-pc kernel: [<ffffffff8109130e>] ?
> schedule_tail+0x1e/0xc0
> Oct 23 21:34:27 mbpc-pc kernel: [<ffffffff8162f60f>]
> ret_from_fork+0x1f/0x40
> Oct 23 21:34:27 mbpc-pc kernel: [<ffffffff81085f20>] ?
> kthread_freezable_should_stop+0x70/0x70
> Oct 23 21:36:04 mbpc-pc kernel: hpet1: lost 9600 rtc interrupts
> Oct 23 21:36:04 mbpc-pc kernel: hpet1: lost 3 rtc interrupts
> Jan  6 23:58:00 192.168.0.2 syslog: dhcpfwd : dhcp forwarder daemon
> successfully started
> Oct 23 21:36:30 mbpc-pc kernel: INFO: task kworker/u16:8:289 blocked for
> more than 120 seconds.
> Oct 23 21:36:30 mbpc-pc kernel:      Not tainted 4.8.4 #2
> Oct 23 21:36:30 mbpc-pc kernel: "echo 0 >
> /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> Oct 23 21:36:30 mbpc-pc kernel: kworker/u16:8   D ffff88011113ba18     0
>   289      2 0x00000000
> Oct 23 21:36:30 mbpc-pc kernel: Workqueue: tmr-fileio target_tmr_work
> [target_core_mod]
> Oct 23 21:36:30 mbpc-pc kernel: ffff88011113ba18 0000000000000400
> ffff880049e926c0 ffff88011113b998
> Oct 23 21:36:30 mbpc-pc kernel: ffff880111134600 ffffffff81f99ca0
> ffffffff81f998ef ffff880100000000
> Oct 23 21:36:30 mbpc-pc kernel: ffffffff812f27d9 0000000000000000
> ffffe8ffffc9a000 ffff880000000000
> Oct 23 21:36:30 mbpc-pc kernel: Call Trace:
> Oct 23 21:36:30 mbpc-pc kernel: [<ffffffff812f27d9>] ? number+0x2e9/0x310
> Oct 23 21:36:30 mbpc-pc kernel: [<ffffffff8162c040>] schedule+0x40/0xb0
> Oct 23 21:36:30 mbpc-pc kernel: [<ffffffff81080169>] ?
> start_flush_work+0x49/0x180
> Oct 23 21:36:30 mbpc-pc kernel: [<ffffffff8162e7ec>]
> schedule_timeout+0x9c/0xe0
> Oct 23 21:36:30 mbpc-pc kernel: [<ffffffff810802ba>] ? flush_work+0x1a/0x40
> Oct 23 21:36:30 mbpc-pc kernel: [<ffffffff810bd15c>] ?
> console_unlock+0x35c/0x380
> Oct 23 21:36:30 mbpc-pc kernel: [<ffffffff8162cfa0>]
> wait_for_completion+0xc0/0xf0
> Oct 23 21:36:30 mbpc-pc kernel: [<ffffffff810923e0>] ?
> try_to_wake_up+0x260/0x260
> Oct 23 21:36:30 mbpc-pc kernel: [<ffffffffa08f6f84>]
> __transport_wait_for_tasks+0xb4/0x1b0 [target_core_mod]
> Oct 23 21:36:30 mbpc-pc kernel: [<ffffffff810bdd1f>] ?
> vprintk_default+0x1f/0x30
>
>


Including the full log:

http://microdevsys.com/linux-lio/messages-mailing-list

-- 
Cheers,
Tom K.
-------------------------------------------------------------------------------------


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Kernel 4.8.4: INFO: task kworker/u16:8:289 blocked for more than 120 seconds.
  2016-10-24  4:32 ` TomK
@ 2016-10-24  4:45   ` TomK
  2016-10-24  6:36     ` Nicholas A. Bellinger
  0 siblings, 1 reply; 14+ messages in thread
From: TomK @ 2016-10-24  4:45 UTC (permalink / raw)
  To: linux-scsi

On 10/24/2016 12:32 AM, TomK wrote:
> On 10/23/2016 10:03 PM, TomK wrote:
>> Hey,
>>
>> Has anyone seen this and could have a workaround?  Seems like it is more
>> Kernel related with various apps not just target apparently not but
>> wondering if there is an interim solution
>> (https://access.redhat.com/solutions/408833)
>>
>> Getting this message after few minutes of usage from the QLA2xxx driver.
>>  This is after some activity on an ESXi server (15 VM's) that I'm
>> connecting to this HBA.  I've tried the following tuning parameters but
>> there was no change in behaviour:
>>
>> vm.dirty_background_ratio = 5
>> vm.dirty_ratio = 10
>>
>> Details:
>>
>>
>> Oct 23 21:28:25 mbpc-pc kernel: hpet1: lost 9600 rtc interrupts
>> Oct 23 21:28:29 mbpc-pc kernel: ABORT_TASK: Found referenced qla2xxx
>> task_tag: 1128612
>> Oct 23 21:28:42 mbpc-pc kernel: ABORT_TASK: Sending
>> TMR_FUNCTION_COMPLETE for ref_tag: 1128612
>> Oct 23 21:28:42 mbpc-pc kernel: ABORT_TASK: Found referenced qla2xxx
>> task_tag: 1129116
>> Jan  6 23:52:00 192.168.0.2 syslog: dhcpfwd : dhcp forwarder daemon
>> successfully started
>> Oct 23 21:30:18 mbpc-pc kernel: hpet1: lost 9600 rtc interrupts
>> Jan  6 23:54:01 192.168.0.2 syslog: dhcpfwd : dhcp forwarder daemon
>> successfully started
>> Oct 23 21:32:16 mbpc-pc kernel: hpet1: lost 9600 rtc interrupts
>> Oct 23 21:32:24 mbpc-pc kernel: INFO: task kworker/u16:8:289 blocked for
>> more than 120 seconds.
>> Oct 23 21:32:24 mbpc-pc kernel:      Not tainted 4.8.4 #2
>> Oct 23 21:32:24 mbpc-pc kernel: "echo 0 >
>> /proc/sys/kernel/hung_task_timeout_secs" disables this message.
>> Oct 23 21:32:24 mbpc-pc kernel: kworker/u16:8   D ffff88011113ba18     0
>>   289      2 0x00000000
>> Oct 23 21:32:24 mbpc-pc kernel: Workqueue: tmr-fileio target_tmr_work
>> [target_core_mod]
>> Oct 23 21:32:24 mbpc-pc kernel: ffff88011113ba18 0000000000000400
>> ffff880049e926c0 ffff88011113b998
>> Oct 23 21:32:24 mbpc-pc kernel: ffff880111134600 ffffffff81f99ca0
>> ffffffff81f998ef ffff880100000000
>> Oct 23 21:32:24 mbpc-pc kernel: ffffffff812f27d9 0000000000000000
>> ffffe8ffffc9a000 ffff880000000000
>> Oct 23 21:32:24 mbpc-pc kernel: Call Trace:
>> Oct 23 21:32:24 mbpc-pc kernel: [<ffffffff812f27d9>] ? number+0x2e9/0x310
>> Oct 23 21:32:24 mbpc-pc kernel: [<ffffffff8162c040>] schedule+0x40/0xb0
>> Oct 23 21:32:24 mbpc-pc kernel: [<ffffffff81080169>] ?
>> start_flush_work+0x49/0x180
>> Oct 23 21:32:24 mbpc-pc kernel: [<ffffffff8162e7ec>]
>> schedule_timeout+0x9c/0xe0
>> Oct 23 21:32:24 mbpc-pc kernel: [<ffffffff810802ba>] ?
>> flush_work+0x1a/0x40
>> Oct 23 21:32:24 mbpc-pc kernel: [<ffffffff810bd15c>] ?
>> console_unlock+0x35c/0x380
>> Oct 23 21:32:24 mbpc-pc kernel: [<ffffffff8162cfa0>]
>> wait_for_completion+0xc0/0xf0
>> Oct 23 21:32:24 mbpc-pc kernel: [<ffffffff810923e0>] ?
>> try_to_wake_up+0x260/0x260
>> Oct 23 21:32:24 mbpc-pc kernel: [<ffffffffa08f6f84>]
>> __transport_wait_for_tasks+0xb4/0x1b0 [target_core_mod]
>> Oct 23 21:32:24 mbpc-pc kernel: [<ffffffff810bdd1f>] ?
>> vprintk_default+0x1f/0x30
>> Oct 23 21:32:24 mbpc-pc kernel: [<ffffffff8115cc5c>] ? printk+0x46/0x48
>> Oct 23 21:32:24 mbpc-pc kernel: [<ffffffffa08f70c4>]
>> transport_wait_for_tasks+0x44/0x60 [target_core_mod]
>> Oct 23 21:32:24 mbpc-pc kernel: [<ffffffffa08f46e2>]
>> core_tmr_abort_task+0xf2/0x160 [target_core_mod]
>> Oct 23 21:32:24 mbpc-pc kernel: [<ffffffffa08f6aa4>]
>> target_tmr_work+0x154/0x160 [target_core_mod]
>> Oct 23 21:32:24 mbpc-pc kernel: [<ffffffff81080639>]
>> process_one_work+0x189/0x4e0
>> Oct 23 21:32:24 mbpc-pc kernel: [<ffffffff8162c040>] ? schedule+0x40/0xb0
>> Oct 23 21:32:24 mbpc-pc kernel: [<ffffffff8108150d>]
>> worker_thread+0x16d/0x520
>> Oct 23 21:32:24 mbpc-pc kernel: [<ffffffff810923f2>] ?
>> default_wake_function+0x12/0x20
>> Oct 23 21:32:24 mbpc-pc kernel: [<ffffffff810a6f06>] ?
>> __wake_up_common+0x56/0x90
>> Oct 23 21:32:24 mbpc-pc kernel: [<ffffffff810813a0>] ?
>> maybe_create_worker+0x110/0x110
>> Oct 23 21:32:24 mbpc-pc kernel: [<ffffffff8162c040>] ? schedule+0x40/0xb0
>> Oct 23 21:32:24 mbpc-pc kernel: [<ffffffff810813a0>] ?
>> maybe_create_worker+0x110/0x110
>> Oct 23 21:32:24 mbpc-pc kernel: [<ffffffff81085fec>] kthread+0xcc/0xf0
>> Oct 23 21:32:24 mbpc-pc kernel: [<ffffffff8109130e>] ?
>> schedule_tail+0x1e/0xc0
>> Oct 23 21:32:24 mbpc-pc kernel: [<ffffffff8162f60f>]
>> ret_from_fork+0x1f/0x40
>> Oct 23 21:32:24 mbpc-pc kernel: [<ffffffff81085f20>] ?
>> kthread_freezable_should_stop+0x70/0x70
>> Oct 23 21:32:24 mbpc-pc kernel: INFO: task kworker/1:48:6089 blocked for
>> more than 120 seconds.
>> Oct 23 21:32:24 mbpc-pc kernel:      Not tainted 4.8.4 #2
>> Oct 23 21:32:24 mbpc-pc kernel: "echo 0 >
>> /proc/sys/kernel/hung_task_timeout_secs" disables this message.
>> Oct 23 21:32:24 mbpc-pc kernel: kworker/1:48    D ffff88004017f968     0
>>  6089      2 0x00000080
>> Oct 23 21:32:24 mbpc-pc kernel: Workqueue: events qlt_free_session_done
>> [qla2xxx]
>> Oct 23 21:32:24 mbpc-pc kernel: ffff88004017f968 ffff88004017f8f8
>> ffff88011a83a300 0000000000000004
>> Oct 23 21:32:24 mbpc-pc kernel: ffff88004017a600 ffff88004017f938
>> ffffffff810a0bb6 ffff880100000000
>> Oct 23 21:32:24 mbpc-pc kernel: ffff880110fd0840 ffff880000000000
>> ffffffff81090728 ffff880100000000
>> Oct 23 21:32:24 mbpc-pc kernel: Call Trace:
>> Oct 23 21:32:24 mbpc-pc kernel: [<ffffffff810a0bb6>] ?
>> enqueue_task_fair+0x66/0x410
>> Oct 23 21:32:24 mbpc-pc kernel: [<ffffffff81090728>] ?
>> check_preempt_curr+0x78/0x90
>> Oct 23 21:32:24 mbpc-pc kernel: [<ffffffff8109075d>] ?
>> ttwu_do_wakeup+0x1d/0xf0
>> Oct 23 21:32:24 mbpc-pc kernel: [<ffffffff8162c040>] schedule+0x40/0xb0
>> Oct 23 21:32:24 mbpc-pc kernel: [<ffffffff81090de0>] ?
>> ttwu_queue+0x180/0x190
>> Oct 23 21:32:24 mbpc-pc kernel: [<ffffffff8162e7ec>]
>> schedule_timeout+0x9c/0xe0
>> Oct 23 21:32:24 mbpc-pc kernel: [<ffffffff8162cfa0>]
>> wait_for_completion+0xc0/0xf0
>> Oct 23 21:32:24 mbpc-pc kernel: [<ffffffff810923e0>] ?
>> try_to_wake_up+0x260/0x260
>> Oct 23 21:32:24 mbpc-pc kernel: [<ffffffffa08f76ad>]
>> target_wait_for_sess_cmds+0x4d/0x1b0 [target_core_mod]
>> Oct 23 21:32:24 mbpc-pc kernel: [<ffffffffa00e7188>] ?
>> qla2x00_post_work+0x58/0x70 [qla2xxx]
>> Oct 23 21:32:24 mbpc-pc kernel: [<ffffffffa0286f69>]
>> tcm_qla2xxx_free_session+0x49/0x90 [tcm_qla2xxx]
>> Oct 23 21:32:24 mbpc-pc kernel: [<ffffffffa01447e9>]
>> qlt_free_session_done+0xf9/0x3d0 [qla2xxx]
>> Oct 23 21:32:24 mbpc-pc kernel: [<ffffffff815092fc>] ?
>> dbs_work_handler+0x5c/0x90
>> Oct 23 21:32:24 mbpc-pc kernel: [<ffffffff8107f960>] ?
>> pwq_dec_nr_in_flight+0x50/0xa0
>> Oct 23 21:32:24 mbpc-pc kernel: [<ffffffff81080639>]
>> process_one_work+0x189/0x4e0
>> Oct 23 21:32:24 mbpc-pc kernel: [<ffffffff810d060c>] ?
>> del_timer_sync+0x4c/0x60
>> Oct 23 21:32:24 mbpc-pc kernel: [<ffffffff8108131e>] ?
>> maybe_create_worker+0x8e/0x110
>> Oct 23 21:32:24 mbpc-pc kernel: [<ffffffff8162c040>] ? schedule+0x40/0xb0
>> Oct 23 21:32:24 mbpc-pc kernel: [<ffffffff8108150d>]
>> worker_thread+0x16d/0x520
>> Oct 23 21:32:24 mbpc-pc kernel: [<ffffffff810923f2>] ?
>> default_wake_function+0x12/0x20
>> Oct 23 21:32:24 mbpc-pc kernel: [<ffffffff810a6f06>] ?
>> __wake_up_common+0x56/0x90
>> Oct 23 21:32:24 mbpc-pc kernel: [<ffffffff810813a0>] ?
>> maybe_create_worker+0x110/0x110
>> Oct 23 21:32:24 mbpc-pc kernel: [<ffffffff8162c040>] ? schedule+0x40/0xb0
>> Oct 23 21:32:24 mbpc-pc kernel: [<ffffffff810813a0>] ?
>> maybe_create_worker+0x110/0x110
>> Oct 23 21:32:24 mbpc-pc kernel: [<ffffffff81085fec>] kthread+0xcc/0xf0
>> Oct 23 21:32:24 mbpc-pc kernel: [<ffffffff8109130e>] ?
>> schedule_tail+0x1e/0xc0
>> Oct 23 21:32:24 mbpc-pc kernel: [<ffffffff8162f60f>]
>> ret_from_fork+0x1f/0x40
>> Oct 23 21:32:24 mbpc-pc kernel: [<ffffffff81085f20>] ?
>> kthread_freezable_should_stop+0x70/0x70
>> Jan  6 23:56:00 192.168.0.2 syslog: dhcpfwd : dhcp forwarder daemon
>> successfully started
>> Oct 23 21:34:22 mbpc-pc kernel: hpet1: lost 9600 rtc interrupts
>> Oct 23 21:34:22 mbpc-pc kernel: hpet1: lost 3 rtc interrupts
>> Oct 23 21:34:27 mbpc-pc kernel: INFO: task kworker/u16:8:289 blocked for
>> more than 120 seconds.
>> Oct 23 21:34:27 mbpc-pc kernel:      Not tainted 4.8.4 #2
>> Oct 23 21:34:27 mbpc-pc kernel: "echo 0 >
>> /proc/sys/kernel/hung_task_timeout_secs" disables this message.
>> Oct 23 21:34:27 mbpc-pc kernel: kworker/u16:8   D ffff88011113ba18     0
>>   289      2 0x00000000
>> Oct 23 21:34:27 mbpc-pc kernel: Workqueue: tmr-fileio target_tmr_work
>> [target_core_mod]
>> Oct 23 21:34:27 mbpc-pc kernel: ffff88011113ba18 0000000000000400
>> ffff880049e926c0 ffff88011113b998
>> Oct 23 21:34:27 mbpc-pc kernel: ffff880111134600 ffffffff81f99ca0
>> ffffffff81f998ef ffff880100000000
>> Oct 23 21:34:27 mbpc-pc kernel: ffffffff812f27d9 0000000000000000
>> ffffe8ffffc9a000 ffff880000000000
>> Oct 23 21:34:27 mbpc-pc kernel: Call Trace:
>> Oct 23 21:34:27 mbpc-pc kernel: [<ffffffff812f27d9>] ? number+0x2e9/0x310
>> Oct 23 21:34:27 mbpc-pc kernel: [<ffffffff8162c040>] schedule+0x40/0xb0
>> Oct 23 21:34:27 mbpc-pc kernel: [<ffffffff81080169>] ?
>> start_flush_work+0x49/0x180
>> Oct 23 21:34:27 mbpc-pc kernel: [<ffffffff8162e7ec>]
>> schedule_timeout+0x9c/0xe0
>> Oct 23 21:34:27 mbpc-pc kernel: [<ffffffff810802ba>] ?
>> flush_work+0x1a/0x40
>> Oct 23 21:34:27 mbpc-pc kernel: [<ffffffff810bd15c>] ?
>> console_unlock+0x35c/0x380
>> Oct 23 21:34:27 mbpc-pc kernel: [<ffffffff8162cfa0>]
>> wait_for_completion+0xc0/0xf0
>> Oct 23 21:34:27 mbpc-pc kernel: [<ffffffff810923e0>] ?
>> try_to_wake_up+0x260/0x260
>> Oct 23 21:34:27 mbpc-pc kernel: [<ffffffffa08f6f84>]
>> __transport_wait_for_tasks+0xb4/0x1b0 [target_core_mod]
>> Oct 23 21:34:27 mbpc-pc kernel: [<ffffffff810bdd1f>] ?
>> vprintk_default+0x1f/0x30
>> Oct 23 21:34:27 mbpc-pc kernel: [<ffffffff8115cc5c>] ? printk+0x46/0x48
>> Oct 23 21:34:27 mbpc-pc kernel: [<ffffffffa08f70c4>]
>> transport_wait_for_tasks+0x44/0x60 [target_core_mod]
>> Oct 23 21:34:27 mbpc-pc kernel: [<ffffffffa08f46e2>]
>> core_tmr_abort_task+0xf2/0x160 [target_core_mod]
>> Oct 23 21:34:27 mbpc-pc kernel: [<ffffffffa08f6aa4>]
>> target_tmr_work+0x154/0x160 [target_core_mod]
>> Oct 23 21:34:27 mbpc-pc kernel: [<ffffffff81080639>]
>> process_one_work+0x189/0x4e0
>> Oct 23 21:34:27 mbpc-pc kernel: [<ffffffff8162c040>] ? schedule+0x40/0xb0
>> Oct 23 21:34:27 mbpc-pc kernel: [<ffffffff8108150d>]
>> worker_thread+0x16d/0x520
>> Oct 23 21:34:27 mbpc-pc kernel: [<ffffffff810923f2>] ?
>> default_wake_function+0x12/0x20
>> Oct 23 21:34:27 mbpc-pc kernel: [<ffffffff810a6f06>] ?
>> __wake_up_common+0x56/0x90
>> Oct 23 21:34:27 mbpc-pc kernel: [<ffffffff810813a0>] ?
>> maybe_create_worker+0x110/0x110
>> Oct 23 21:34:27 mbpc-pc kernel: [<ffffffff8162c040>] ? schedule+0x40/0xb0
>> Oct 23 21:34:27 mbpc-pc kernel: [<ffffffff810813a0>] ?
>> maybe_create_worker+0x110/0x110
>> Oct 23 21:34:27 mbpc-pc kernel: [<ffffffff81085fec>] kthread+0xcc/0xf0
>> Oct 23 21:34:27 mbpc-pc kernel: [<ffffffff8109130e>] ?
>> schedule_tail+0x1e/0xc0
>> Oct 23 21:34:27 mbpc-pc kernel: [<ffffffff8162f60f>]
>> ret_from_fork+0x1f/0x40
>> Oct 23 21:34:27 mbpc-pc kernel: [<ffffffff81085f20>] ?
>> kthread_freezable_should_stop+0x70/0x70
>> Oct 23 21:34:27 mbpc-pc kernel: INFO: task kworker/1:48:6089 blocked for
>> more than 120 seconds.
>> Oct 23 21:34:27 mbpc-pc kernel:      Not tainted 4.8.4 #2
>> Oct 23 21:34:27 mbpc-pc kernel: "echo 0 >
>> /proc/sys/kernel/hung_task_timeout_secs" disables this message.
>> Oct 23 21:34:27 mbpc-pc kernel: kworker/1:48    D ffff88004017f968     0
>>  6089      2 0x00000080
>> Oct 23 21:34:27 mbpc-pc kernel: Workqueue: events qlt_free_session_done
>> [qla2xxx]
>> Oct 23 21:34:27 mbpc-pc kernel: ffff88004017f968 ffff88004017f8f8
>> ffff88011a83a300 0000000000000004
>> Oct 23 21:34:27 mbpc-pc kernel: ffff88004017a600 ffff88004017f938
>> ffffffff810a0bb6 ffff880100000000
>> Oct 23 21:34:27 mbpc-pc kernel: ffff880110fd0840 ffff880000000000
>> ffffffff81090728 ffff880100000000
>> Oct 23 21:34:27 mbpc-pc kernel: Call Trace:
>> Oct 23 21:34:27 mbpc-pc kernel: [<ffffffff810a0bb6>] ?
>> enqueue_task_fair+0x66/0x410
>> Oct 23 21:34:27 mbpc-pc kernel: [<ffffffff81090728>] ?
>> check_preempt_curr+0x78/0x90
>> Oct 23 21:34:27 mbpc-pc kernel: [<ffffffff8109075d>] ?
>> ttwu_do_wakeup+0x1d/0xf0
>> Oct 23 21:34:27 mbpc-pc kernel: [<ffffffff8162c040>] schedule+0x40/0xb0
>> Oct 23 21:34:27 mbpc-pc kernel: [<ffffffff81090de0>] ?
>> ttwu_queue+0x180/0x190
>> Oct 23 21:34:27 mbpc-pc kernel: [<ffffffff8162e7ec>]
>> schedule_timeout+0x9c/0xe0
>> Oct 23 21:34:27 mbpc-pc kernel: [<ffffffff8162cfa0>]
>> wait_for_completion+0xc0/0xf0
>> Oct 23 21:34:27 mbpc-pc kernel: [<ffffffff810923e0>] ?
>> try_to_wake_up+0x260/0x260
>> Oct 23 21:34:27 mbpc-pc kernel: [<ffffffffa08f76ad>]
>> target_wait_for_sess_cmds+0x4d/0x1b0 [target_core_mod]
>> Oct 23 21:34:27 mbpc-pc kernel: [<ffffffffa00e7188>] ?
>> qla2x00_post_work+0x58/0x70 [qla2xxx]
>> Oct 23 21:34:27 mbpc-pc kernel: [<ffffffffa0286f69>]
>> tcm_qla2xxx_free_session+0x49/0x90 [tcm_qla2xxx]
>> Oct 23 21:34:27 mbpc-pc kernel: [<ffffffffa01447e9>]
>> qlt_free_session_done+0xf9/0x3d0 [qla2xxx]
>> Oct 23 21:34:27 mbpc-pc kernel: [<ffffffff815092fc>] ?
>> dbs_work_handler+0x5c/0x90
>> Oct 23 21:34:27 mbpc-pc kernel: [<ffffffff8107f960>] ?
>> pwq_dec_nr_in_flight+0x50/0xa0
>> Oct 23 21:34:27 mbpc-pc kernel: [<ffffffff81080639>]
>> process_one_work+0x189/0x4e0
>> Oct 23 21:34:27 mbpc-pc kernel: [<ffffffff810d060c>] ?
>> del_timer_sync+0x4c/0x60
>> Oct 23 21:34:27 mbpc-pc kernel: [<ffffffff8108131e>] ?
>> maybe_create_worker+0x8e/0x110
>> Oct 23 21:34:27 mbpc-pc kernel: [<ffffffff8162c040>] ? schedule+0x40/0xb0
>> Oct 23 21:34:27 mbpc-pc kernel: [<ffffffff8108150d>]
>> worker_thread+0x16d/0x520
>> Oct 23 21:34:27 mbpc-pc kernel: [<ffffffff810923f2>] ?
>> default_wake_function+0x12/0x20
>> Oct 23 21:34:27 mbpc-pc kernel: [<ffffffff810a6f06>] ?
>> __wake_up_common+0x56/0x90
>> Oct 23 21:34:27 mbpc-pc kernel: [<ffffffff810813a0>] ?
>> maybe_create_worker+0x110/0x110
>> Oct 23 21:34:27 mbpc-pc kernel: [<ffffffff8162c040>] ? schedule+0x40/0xb0
>> Oct 23 21:34:27 mbpc-pc kernel: [<ffffffff810813a0>] ?
>> maybe_create_worker+0x110/0x110
>> Oct 23 21:34:27 mbpc-pc kernel: [<ffffffff81085fec>] kthread+0xcc/0xf0
>> Oct 23 21:34:27 mbpc-pc kernel: [<ffffffff8109130e>] ?
>> schedule_tail+0x1e/0xc0
>> Oct 23 21:34:27 mbpc-pc kernel: [<ffffffff8162f60f>]
>> ret_from_fork+0x1f/0x40
>> Oct 23 21:34:27 mbpc-pc kernel: [<ffffffff81085f20>] ?
>> kthread_freezable_should_stop+0x70/0x70
>> Oct 23 21:36:04 mbpc-pc kernel: hpet1: lost 9600 rtc interrupts
>> Oct 23 21:36:04 mbpc-pc kernel: hpet1: lost 3 rtc interrupts
>> Jan  6 23:58:00 192.168.0.2 syslog: dhcpfwd : dhcp forwarder daemon
>> successfully started
>> Oct 23 21:36:30 mbpc-pc kernel: INFO: task kworker/u16:8:289 blocked for
>> more than 120 seconds.
>> Oct 23 21:36:30 mbpc-pc kernel:      Not tainted 4.8.4 #2
>> Oct 23 21:36:30 mbpc-pc kernel: "echo 0 >
>> /proc/sys/kernel/hung_task_timeout_secs" disables this message.
>> Oct 23 21:36:30 mbpc-pc kernel: kworker/u16:8   D ffff88011113ba18     0
>>   289      2 0x00000000
>> Oct 23 21:36:30 mbpc-pc kernel: Workqueue: tmr-fileio target_tmr_work
>> [target_core_mod]
>> Oct 23 21:36:30 mbpc-pc kernel: ffff88011113ba18 0000000000000400
>> ffff880049e926c0 ffff88011113b998
>> Oct 23 21:36:30 mbpc-pc kernel: ffff880111134600 ffffffff81f99ca0
>> ffffffff81f998ef ffff880100000000
>> Oct 23 21:36:30 mbpc-pc kernel: ffffffff812f27d9 0000000000000000
>> ffffe8ffffc9a000 ffff880000000000
>> Oct 23 21:36:30 mbpc-pc kernel: Call Trace:
>> Oct 23 21:36:30 mbpc-pc kernel: [<ffffffff812f27d9>] ? number+0x2e9/0x310
>> Oct 23 21:36:30 mbpc-pc kernel: [<ffffffff8162c040>] schedule+0x40/0xb0
>> Oct 23 21:36:30 mbpc-pc kernel: [<ffffffff81080169>] ?
>> start_flush_work+0x49/0x180
>> Oct 23 21:36:30 mbpc-pc kernel: [<ffffffff8162e7ec>]
>> schedule_timeout+0x9c/0xe0
>> Oct 23 21:36:30 mbpc-pc kernel: [<ffffffff810802ba>] ?
>> flush_work+0x1a/0x40
>> Oct 23 21:36:30 mbpc-pc kernel: [<ffffffff810bd15c>] ?
>> console_unlock+0x35c/0x380
>> Oct 23 21:36:30 mbpc-pc kernel: [<ffffffff8162cfa0>]
>> wait_for_completion+0xc0/0xf0
>> Oct 23 21:36:30 mbpc-pc kernel: [<ffffffff810923e0>] ?
>> try_to_wake_up+0x260/0x260
>> Oct 23 21:36:30 mbpc-pc kernel: [<ffffffffa08f6f84>]
>> __transport_wait_for_tasks+0xb4/0x1b0 [target_core_mod]
>> Oct 23 21:36:30 mbpc-pc kernel: [<ffffffff810bdd1f>] ?
>> vprintk_default+0x1f/0x30
>>
>>
>
>
> Including the full log:
>
> http://microdevsys.com/linux-lio/messages-mailing-list
>


When tryint to shut down target using /etc/init.d/target stop, the 
following is printed repeatedly:

Oct 24 00:39:48 mbpc-pc kernel: qla2xxx [0000:04:00.0]-e837:20: 
ABTS_RECV_24XX: instance 0
Oct 24 00:39:48 mbpc-pc kernel: qla2xxx [0000:04:00.0]-f811:20: 
qla_target(0): task abort (s_id=1:5:0, tag=1177068, param=0)
Oct 24 00:39:48 mbpc-pc kernel: qla2xxx [0000:04:00.0]-f812:20: 
qla_target(0): task abort for non-existant session
Oct 24 00:39:48 mbpc-pc kernel: qla2xxx [0000:04:00.0]-f80e:20: 
Scheduling work (type 1, prm ffff880093365680) to find session for param 
ffff88010f8c7680 (size 64, tgt ffff880111f06600)
Oct 24 00:39:48 mbpc-pc kernel: qla2xxx [0000:04:00.0]-f800:20: Sess 
work (tgt ffff880111f06600)
Oct 24 00:39:48 mbpc-pc kernel: qla2xxx [0000:04:00.0]-e806:20: Sending 
task mgmt ABTS response (ha=ffff88010fae0000, atio=ffff880093365694, 
status=4
Oct 24 00:39:48 mbpc-pc kernel: qla2xxx [0000:04:00.0]-e838:20: 
ABTS_RESP_24XX: compl_status 31
Oct 24 00:39:48 mbpc-pc kernel: qla2xxx [0000:04:00.0]-e807:20: Sending 
retry TERM EXCH CTIO7 (ha=ffff88010fae0000)
Oct 24 00:39:48 mbpc-pc kernel: qla2xxx [0000:04:00.0]-e806:20: Sending 
task mgmt ABTS response (ha=ffff88010fae0000, atio=ffff88010f8c76c0, 
status=0
Oct 24 00:39:48 mbpc-pc kernel: qla2xxx [0000:04:00.0]-e838:20: 
ABTS_RESP_24XX: compl_status 0
Oct 24 00:39:48 mbpc-pc kernel: qla2xxx [0000:04:00.0]-e872:20: 
qlt_24xx_atio_pkt_all_vps: qla_target(0): type 6 ox_id 029c
Oct 24 00:39:48 mbpc-pc kernel: qla2xxx [0000:04:00.0]-3861:20: New 
command while device ffff880111f06600 is shutting down
Oct 24 00:39:48 mbpc-pc kernel: qla2xxx [0000:04:00.0]-e859:20: 
qla_target: Unable to send command to target for req, ignoring.



+ when I disable the ports on the brocade switch that we're using then 
try to stop target, the following is printed:



Oct 24 00:41:31 mbpc-pc kernel: qla2xxx [0000:04:00.1]-680a:21: Loop 
down - seconds remaining 231.
Oct 24 00:41:32 mbpc-pc kernel: qla2xxx [0000:04:00.0]-680a:20: Loop 
down - seconds remaining 153.
Oct 24 00:41:32 mbpc-pc kernel: ------------[ cut here ]------------
Oct 24 00:41:32 mbpc-pc kernel: WARNING: CPU: 2 PID: 8615 at 
lib/list_debug.c:33 __list_add+0xbe/0xd0
Oct 24 00:41:32 mbpc-pc kernel: list_add corruption. prev->next should 
be next (ffff88009e83b330), but was ffff88011fc972a0. 
(prev=ffff880118ada4c0).
Oct 24 00:41:32 mbpc-pc kernel: Modules linked in: tcm_qla2xxx tcm_fc 
tcm_loop target_core_file target_core_iblock target_core_pscsi 
target_core_mod configfs ip6table_filter ip6_tables ebtable_nat ebtables 
nf_conntrack_ipv4 nf_defrag_ipv4 xt_state nf_conntrack xt_CHECKSUM 
iptable_mangle bridge nfsd lockd grace nfs_acl auth_rpcgss autofs4 it87 
hwmon_vid bnx2fc cnic uio fcoe libfcoe libfc 8021q garp stp llc ppdev 
parport_pc parport sunrpc cpufreq_ondemand bonding ipv6 crc_ccitt 
ipt_REJECT nf_reject_ipv4 xt_multiport iptable_filter ip_tables fuse 
vfat fat xfs vhost_net macvtap macvlan vhost tun uinput raid456 
async_raid6_recov async_pq async_xor xor async_memcpy async_tx raid6_pq 
libcrc32c joydev sg serio_raw e1000 kvm_amd kvm irqbypass r8169 mii 
pcspkr k10temp snd_hda_codec_realtek snd_hda_codec_generic snd_hda_intel 
snd_hda_codec snd_hda_core snd_hwdep snd_seq snd_seq_device snd_pcm 
snd_timer snd soundcore i2c_piix4 i2c_core wmi acpi_cpufreq shpchp ext4 
mbcache jbd2 qla2xxx scsi_transport_fc floppy firewire_ohci f
Oct 24 00:41:32 mbpc-pc kernel: irewire_core crc_itu_t sd_mod pata_acpi 
ata_generic pata_jmicron ahci libahci usb_storage dm_mirror 
dm_region_hash dm_log dm_mod
Oct 24 00:41:32 mbpc-pc kernel: CPU: 2 PID: 8615 Comm: kworker/2:3 Not 
tainted 4.8.4 #2
Oct 24 00:41:32 mbpc-pc kernel: Hardware name: Gigabyte Technology Co., 
Ltd. GA-890XA-UD3/GA-890XA-UD3, BIOS FC 08/02/2010
Oct 24 00:41:32 mbpc-pc kernel: 0000000000000000 ffff880092b83b48 
ffffffff812e88e9 ffffffff8130753e
Oct 24 00:41:32 mbpc-pc kernel: ffff880092b83ba8 ffff880092b83ba8 
0000000000000000 ffff880092b83b98
Oct 24 00:41:32 mbpc-pc kernel: ffffffff81066a7d ffff88000058f952 
0000002100000046 ffffffff8101eae8
Oct 24 00:41:32 mbpc-pc kernel: Call Trace:
Oct 24 00:41:32 mbpc-pc kernel: [<ffffffff812e88e9>] dump_stack+0x51/0x78
Oct 24 00:41:32 mbpc-pc kernel: [<ffffffff8130753e>] ? __list_add+0xbe/0xd0
Oct 24 00:41:32 mbpc-pc kernel: [<ffffffff81066a7d>] __warn+0xfd/0x120
Oct 24 00:41:32 mbpc-pc kernel: [<ffffffff8101eae8>] ? 
__switch_to+0x398/0x7e0
Oct 24 00:41:32 mbpc-pc kernel: [<ffffffff81066b59>] 
warn_slowpath_fmt+0x49/0x50
Oct 24 00:41:32 mbpc-pc kernel: [<ffffffff8130753e>] __list_add+0xbe/0xd0
Oct 24 00:41:32 mbpc-pc kernel: [<ffffffff8107d0b2>] 
move_linked_works+0x62/0x90
Oct 24 00:41:32 mbpc-pc kernel: [<ffffffff8108070c>] 
process_one_work+0x25c/0x4e0
Oct 24 00:41:32 mbpc-pc kernel: [<ffffffff8162c040>] ? schedule+0x40/0xb0
Oct 24 00:41:32 mbpc-pc kernel: [<ffffffff8108150d>] 
worker_thread+0x16d/0x520
Oct 24 00:41:32 mbpc-pc kernel: [<ffffffff8162bb3d>] ? 
__schedule+0x2fd/0x6a0
Oct 24 00:41:32 mbpc-pc kernel: [<ffffffff810923f2>] ? 
default_wake_function+0x12/0x20
Oct 24 00:41:32 mbpc-pc kernel: [<ffffffff810a6f06>] ? 
__wake_up_common+0x56/0x90
Oct 24 00:41:32 mbpc-pc kernel: [<ffffffff810813a0>] ? 
maybe_create_worker+0x110/0x110
Oct 24 00:41:32 mbpc-pc kernel: [<ffffffff8162c040>] ? schedule+0x40/0xb0
Oct 24 00:41:32 mbpc-pc kernel: [<ffffffff810813a0>] ? 
maybe_create_worker+0x110/0x110
Oct 24 00:41:32 mbpc-pc kernel: [<ffffffff81085fec>] kthread+0xcc/0xf0
Oct 24 00:41:32 mbpc-pc kernel: [<ffffffff8109130e>] ? 
schedule_tail+0x1e/0xc0
Oct 24 00:41:32 mbpc-pc kernel: [<ffffffff8162f60f>] ret_from_fork+0x1f/0x40
Oct 24 00:41:32 mbpc-pc kernel: [<ffffffff81085f20>] ? 
kthread_freezable_should_stop+0x70/0x70
Oct 24 00:41:32 mbpc-pc kernel: ---[ end trace 713a9071c9f5f464 ]---
Oct 24 00:41:32 mbpc-pc kernel: ------------[ cut here ]------------
Oct 24 00:41:32 mbpc-pc kernel: WARNING: CPU: 2 PID: 8615 at 
lib/list_debug.c:36 __list_add+0x9c/0xd0
Oct 24 00:41:32 mbpc-pc kernel: list_add double add: 
new=ffff880118ada4c0, prev=ffff880118ada4c0, next=ffff88009e83b330.
Oct 24 00:41:32 mbpc-pc kernel: Modules linked in: tcm_qla2xxx tcm_fc 
tcm_loop target_core_file target_core_iblock target_core_pscsi 
target_core_mod configfs ip6table_filter ip6_tables ebtable_nat ebtables 
nf_conntrack_ipv4 nf_defrag_ipv4 xt_state nf_conntrack xt_CHECKSUM 
iptable_mangle bridge nfsd lockd grace nfs_acl auth_rpcgss autofs4 it87 
hwmon_vid bnx2fc cnic uio fcoe libfcoe libfc 8021q garp stp llc ppdev 
parport_pc parport sunrpc cpufreq_ondemand bonding ipv6 crc_ccitt 
ipt_REJECT nf_reject_ipv4 xt_multiport iptable_filter ip_tables fuse 
vfat fat xfs vhost_net macvtap macvlan vhost tun uinput raid456 
async_raid6_recov async_pq async_xor xor async_memcpy async_tx raid6_pq 
libcrc32c joydev sg serio_raw e1000 kvm_amd kvm irqbypass r8169 mii 
pcspkr k10temp snd_hda_codec_realtek snd_hda_codec_generic snd_hda_intel 
snd_hda_codec snd_hda_core snd_hwdep snd_seq snd_seq_device snd_pcm 
snd_timer snd soundcore i2c_piix4 i2c_core wmi acpi_cpufreq shpchp ext4 
mbcache jbd2 qla2xxx scsi_transport_fc floppy firewire_ohci f
Oct 24 00:41:32 mbpc-pc kernel: irewire_core crc_itu_t sd_mod pata_acpi 
ata_generic pata_jmicron ahci libahci usb_storage dm_mirror 
dm_region_hash dm_log dm_mod
Oct 24 00:41:32 mbpc-pc kernel: CPU: 2 PID: 8615 Comm: kworker/2:3 
Tainted: G        W       4.8.4 #2
Oct 24 00:41:32 mbpc-pc kernel: Hardware name: Gigabyte Technology Co., 
Ltd. GA-890XA-UD3/GA-890XA-UD3, BIOS FC 08/02/2010
Oct 24 00:41:32 mbpc-pc kernel: 0000000000000000 ffff880092b83b48 
ffffffff812e88e9 ffffffff8130751c
Oct 24 00:41:32 mbpc-pc kernel: ffff880092b83ba8 ffff880092b83ba8 
0000000000000000 ffff880092b83b98
Oct 24 00:41:32 mbpc-pc kernel: ffffffff81066a7d ffff88000058f952 
0000002400000046 ffffffff8101eae8
Oct 24 00:41:32 mbpc-pc kernel: Call Trace:
Oct 24 00:41:32 mbpc-pc kernel: [<ffffffff812e88e9>] dump_stack+0x51/0x78
Oct 24 00:41:32 mbpc-pc kernel: [<ffffffff8130751c>] ? __list_add+0x9c/0xd0
Oct 24 00:41:32 mbpc-pc kernel: [<ffffffff81066a7d>] __warn+0xfd/0x120
Oct 24 00:41:32 mbpc-pc kernel: [<ffffffff8101eae8>] ? 
__switch_to+0x398/0x7e0
Oct 24 00:41:32 mbpc-pc kernel: [<ffffffff81066b59>] 
warn_slowpath_fmt+0x49/0x50
Oct 24 00:41:32 mbpc-pc kernel: [<ffffffff8130751c>] __list_add+0x9c/0xd0
Oct 24 00:41:32 mbpc-pc kernel: [<ffffffff8107d0b2>] 
move_linked_works+0x62/0x90
Oct 24 00:41:32 mbpc-pc kernel: [<ffffffff8108070c>] 
process_one_work+0x25c/0x4e0
Oct 24 00:41:32 mbpc-pc kernel: [<ffffffff8162c040>] ? schedule+0x40/0xb0
Oct 24 00:41:32 mbpc-pc kernel: [<ffffffff8108150d>] 
worker_thread+0x16d/0x520
Oct 24 00:41:32 mbpc-pc kernel: [<ffffffff8162bb3d>] ? 
__schedule+0x2fd/0x6a0
Oct 24 00:41:32 mbpc-pc kernel: [<ffffffff810923f2>] ? 
default_wake_function+0x12/0x20
Oct 24 00:41:32 mbpc-pc kernel: [<ffffffff810a6f06>] ? 
__wake_up_common+0x56/0x90
Oct 24 00:41:32 mbpc-pc kernel: [<ffffffff810813a0>] ? 
maybe_create_worker+0x110/0x110
Oct 24 00:41:32 mbpc-pc kernel: [<ffffffff8162c040>] ? schedule+0x40/0xb0
Oct 24 00:41:32 mbpc-pc kernel: [<ffffffff810813a0>] ? 
maybe_create_worker+0x110/0x110
Oct 24 00:41:32 mbpc-pc kernel: [<ffffffff81085fec>] kthread+0xcc/0xf0
Oct 24 00:41:32 mbpc-pc kernel: [<ffffffff8109130e>] ? 
schedule_tail+0x1e/0xc0
Oct 24 00:41:32 mbpc-pc kernel: [<ffffffff8162f60f>] ret_from_fork+0x1f/0x40
Oct 24 00:41:32 mbpc-pc kernel: [<ffffffff81085f20>] ? 
kthread_freezable_should_stop+0x70/0x70
Oct 24 00:41:32 mbpc-pc kernel: ---[ end trace 713a9071c9f5f465 ]---
Oct 24 00:41:32 mbpc-pc kernel: qla2xxx [0000:04:00.1]-680a:21: Loop 
down - seconds remaining 230.
Oct 24 00:41:33 mbpc-pc kernel: qla2xxx [0000:04:00.0]-680a:20: Loop 
down - seconds remaining 152.


-- 
Cheers,
Tom K.
-------------------------------------------------------------------------------------

Living on earth is expensive, but it includes a free trip around the sun.


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Kernel 4.8.4: INFO: task kworker/u16:8:289 blocked for more than 120 seconds.
  2016-10-24  4:45   ` TomK
@ 2016-10-24  6:36     ` Nicholas A. Bellinger
  2016-10-25  5:28       ` TomK
  0 siblings, 1 reply; 14+ messages in thread
From: Nicholas A. Bellinger @ 2016-10-24  6:36 UTC (permalink / raw)
  To: TomK; +Cc: linux-scsi, Himanshu Madhani, Quinn Tran, Giridhar Malavali

Hi TomK,

Thanks for reporting this bug.  Comments inline below.

On Mon, 2016-10-24 at 00:45 -0400, TomK wrote:
> On 10/24/2016 12:32 AM, TomK wrote:
> > On 10/23/2016 10:03 PM, TomK wrote:
> >> Hey,
> >>
> >> Has anyone seen this and could have a workaround?  Seems like it is more
> >> Kernel related with various apps not just target apparently not but
> >> wondering if there is an interim solution
> >> (https://access.redhat.com/solutions/408833)
> >>
> >> Getting this message after few minutes of usage from the QLA2xxx driver.
> >>  This is after some activity on an ESXi server (15 VM's) that I'm
> >> connecting to this HBA.  I've tried the following tuning parameters but
> >> there was no change in behaviour:
> >>
> >> vm.dirty_background_ratio = 5
> >> vm.dirty_ratio = 10
> >>
> >> Details:
> >>
> >>
> >> Oct 23 21:28:25 mbpc-pc kernel: hpet1: lost 9600 rtc interrupts
> >> Oct 23 21:28:29 mbpc-pc kernel: ABORT_TASK: Found referenced qla2xxx
> >> task_tag: 1128612
> >> Oct 23 21:28:42 mbpc-pc kernel: ABORT_TASK: Sending
> >> TMR_FUNCTION_COMPLETE for ref_tag: 1128612
> >> Oct 23 21:28:42 mbpc-pc kernel: ABORT_TASK: Found referenced qla2xxx
> >> task_tag: 1129116

You are likely hitting a known v4.1+ regression, not yet merged up to
v4.8.y code:

https://github.com/torvalds/linux/commit/527268df31e57cf2b6d417198717c6d6afdb1e3e

> >> Jan  6 23:52:00 192.168.0.2 syslog: dhcpfwd : dhcp forwarder daemon
> >> successfully started
> >> Oct 23 21:30:18 mbpc-pc kernel: hpet1: lost 9600 rtc interrupts
> >> Jan  6 23:54:01 192.168.0.2 syslog: dhcpfwd : dhcp forwarder daemon
> >> successfully started
> >> Oct 23 21:32:16 mbpc-pc kernel: hpet1: lost 9600 rtc interrupts
> >> Oct 23 21:32:24 mbpc-pc kernel: INFO: task kworker/u16:8:289 blocked for
> >> more than 120 seconds.
> >> Oct 23 21:32:24 mbpc-pc kernel:      Not tainted 4.8.4 #2
> >> Oct 23 21:32:24 mbpc-pc kernel: "echo 0 >
> >> /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> >> Oct 23 21:32:24 mbpc-pc kernel: kworker/u16:8   D ffff88011113ba18     0
> >>   289      2 0x00000000
> >> Oct 23 21:32:24 mbpc-pc kernel: Workqueue: tmr-fileio target_tmr_work
> >> [target_core_mod]
> >> Oct 23 21:32:24 mbpc-pc kernel: ffff88011113ba18 0000000000000400
> >> ffff880049e926c0 ffff88011113b998
> >> Oct 23 21:32:24 mbpc-pc kernel: ffff880111134600 ffffffff81f99ca0
> >> ffffffff81f998ef ffff880100000000
> >> Oct 23 21:32:24 mbpc-pc kernel: ffffffff812f27d9 0000000000000000
> >> ffffe8ffffc9a000 ffff880000000000
> >> Oct 23 21:32:24 mbpc-pc kernel: Call Trace:
> >> Oct 23 21:32:24 mbpc-pc kernel: [<ffffffff812f27d9>] ? number+0x2e9/0x310
> >> Oct 23 21:32:24 mbpc-pc kernel: [<ffffffff8162c040>] schedule+0x40/0xb0
> >> Oct 23 21:32:24 mbpc-pc kernel: [<ffffffff81080169>] ?
> >> start_flush_work+0x49/0x180
> >> Oct 23 21:32:24 mbpc-pc kernel: [<ffffffff8162e7ec>]
> >> schedule_timeout+0x9c/0xe0
> >> Oct 23 21:32:24 mbpc-pc kernel: [<ffffffff810802ba>] ?
> >> flush_work+0x1a/0x40
> >> Oct 23 21:32:24 mbpc-pc kernel: [<ffffffff810bd15c>] ?
> >> console_unlock+0x35c/0x380
> >> Oct 23 21:32:24 mbpc-pc kernel: [<ffffffff8162cfa0>]
> >> wait_for_completion+0xc0/0xf0
> >> Oct 23 21:32:24 mbpc-pc kernel: [<ffffffff810923e0>] ?
> >> try_to_wake_up+0x260/0x260
> >> Oct 23 21:32:24 mbpc-pc kernel: [<ffffffffa08f6f84>]
> >> __transport_wait_for_tasks+0xb4/0x1b0 [target_core_mod]
> >> Oct 23 21:32:24 mbpc-pc kernel: [<ffffffff810bdd1f>] ?
> >> vprintk_default+0x1f/0x30
> >> Oct 23 21:32:24 mbpc-pc kernel: [<ffffffff8115cc5c>] ? printk+0x46/0x48
> >> Oct 23 21:32:24 mbpc-pc kernel: [<ffffffffa08f70c4>]
> >> transport_wait_for_tasks+0x44/0x60 [target_core_mod]
> >> Oct 23 21:32:24 mbpc-pc kernel: [<ffffffffa08f46e2>]
> >> core_tmr_abort_task+0xf2/0x160 [target_core_mod]
> >> Oct 23 21:32:24 mbpc-pc kernel: [<ffffffffa08f6aa4>]
> >> target_tmr_work+0x154/0x160 [target_core_mod]
> >> Oct 23 21:32:24 mbpc-pc kernel: [<ffffffff81080639>]
> >> process_one_work+0x189/0x4e0
> >> Oct 23 21:32:24 mbpc-pc kernel: [<ffffffff8162c040>] ? schedule+0x40/0xb0
> >> Oct 23 21:32:24 mbpc-pc kernel: [<ffffffff8108150d>]
> >> worker_thread+0x16d/0x520
> >> Oct 23 21:32:24 mbpc-pc kernel: [<ffffffff810923f2>] ?
> >> default_wake_function+0x12/0x20
> >> Oct 23 21:32:24 mbpc-pc kernel: [<ffffffff810a6f06>] ?
> >> __wake_up_common+0x56/0x90
> >> Oct 23 21:32:24 mbpc-pc kernel: [<ffffffff810813a0>] ?
> >> maybe_create_worker+0x110/0x110
> >> Oct 23 21:32:24 mbpc-pc kernel: [<ffffffff8162c040>] ? schedule+0x40/0xb0
> >> Oct 23 21:32:24 mbpc-pc kernel: [<ffffffff810813a0>] ?
> >> maybe_create_worker+0x110/0x110
> >> Oct 23 21:32:24 mbpc-pc kernel: [<ffffffff81085fec>] kthread+0xcc/0xf0
> >> Oct 23 21:32:24 mbpc-pc kernel: [<ffffffff8109130e>] ?
> >> schedule_tail+0x1e/0xc0
> >> Oct 23 21:32:24 mbpc-pc kernel: [<ffffffff8162f60f>]
> >> ret_from_fork+0x1f/0x40
> >> Oct 23 21:32:24 mbpc-pc kernel: [<ffffffff81085f20>] ?
> >> kthread_freezable_should_stop+0x70/0x70
> >> Oct 23 21:32:24 mbpc-pc kernel: INFO: task kworker/1:48:6089 blocked for
> >> more than 120 seconds.
> >> Oct 23 21:32:24 mbpc-pc kernel:      Not tainted 4.8.4 #2
> >> Oct 23 21:32:24 mbpc-pc kernel: "echo 0 >
> >> /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> >> Oct 23 21:32:24 mbpc-pc kernel: kworker/1:48    D ffff88004017f968     0
> >>  6089      2 0x00000080
> >> Oct 23 21:32:24 mbpc-pc kernel: Workqueue: events qlt_free_session_done
> >> [qla2xxx]
> >> Oct 23 21:32:24 mbpc-pc kernel: ffff88004017f968 ffff88004017f8f8
> >> ffff88011a83a300 0000000000000004
> >> Oct 23 21:32:24 mbpc-pc kernel: ffff88004017a600 ffff88004017f938
> >> ffffffff810a0bb6 ffff880100000000
> >> Oct 23 21:32:24 mbpc-pc kernel: ffff880110fd0840 ffff880000000000
> >> ffffffff81090728 ffff880100000000
> >> Oct 23 21:32:24 mbpc-pc kernel: Call Trace:
> >> Oct 23 21:32:24 mbpc-pc kernel: [<ffffffff810a0bb6>] ?
> >> enqueue_task_fair+0x66/0x410
> >> Oct 23 21:32:24 mbpc-pc kernel: [<ffffffff81090728>] ?
> >> check_preempt_curr+0x78/0x90
> >> Oct 23 21:32:24 mbpc-pc kernel: [<ffffffff8109075d>] ?
> >> ttwu_do_wakeup+0x1d/0xf0
> >> Oct 23 21:32:24 mbpc-pc kernel: [<ffffffff8162c040>] schedule+0x40/0xb0
> >> Oct 23 21:32:24 mbpc-pc kernel: [<ffffffff81090de0>] ?
> >> ttwu_queue+0x180/0x190
> >> Oct 23 21:32:24 mbpc-pc kernel: [<ffffffff8162e7ec>]
> >> schedule_timeout+0x9c/0xe0
> >> Oct 23 21:32:24 mbpc-pc kernel: [<ffffffff8162cfa0>]
> >> wait_for_completion+0xc0/0xf0
> >> Oct 23 21:32:24 mbpc-pc kernel: [<ffffffff810923e0>] ?
> >> try_to_wake_up+0x260/0x260
> >> Oct 23 21:32:24 mbpc-pc kernel: [<ffffffffa08f76ad>]
> >> target_wait_for_sess_cmds+0x4d/0x1b0 [target_core_mod]
> >> Oct 23 21:32:24 mbpc-pc kernel: [<ffffffffa00e7188>] ?
> >> qla2x00_post_work+0x58/0x70 [qla2xxx]
> >> Oct 23 21:32:24 mbpc-pc kernel: [<ffffffffa0286f69>]
> >> tcm_qla2xxx_free_session+0x49/0x90 [tcm_qla2xxx]
> >> Oct 23 21:32:24 mbpc-pc kernel: [<ffffffffa01447e9>]
> >> qlt_free_session_done+0xf9/0x3d0 [qla2xxx]
> >> Oct 23 21:32:24 mbpc-pc kernel: [<ffffffff815092fc>] ?
> >> dbs_work_handler+0x5c/0x90
> >> Oct 23 21:32:24 mbpc-pc kernel: [<ffffffff8107f960>] ?
> >> pwq_dec_nr_in_flight+0x50/0xa0
> >> Oct 23 21:32:24 mbpc-pc kernel: [<ffffffff81080639>]
> >> process_one_work+0x189/0x4e0
> >> Oct 23 21:32:24 mbpc-pc kernel: [<ffffffff810d060c>] ?
> >> del_timer_sync+0x4c/0x60
> >> Oct 23 21:32:24 mbpc-pc kernel: [<ffffffff8108131e>] ?
> >> maybe_create_worker+0x8e/0x110
> >> Oct 23 21:32:24 mbpc-pc kernel: [<ffffffff8162c040>] ? schedule+0x40/0xb0
> >> Oct 23 21:32:24 mbpc-pc kernel: [<ffffffff8108150d>]
> >> worker_thread+0x16d/0x520
> >> Oct 23 21:32:24 mbpc-pc kernel: [<ffffffff810923f2>] ?
> >> default_wake_function+0x12/0x20
> >> Oct 23 21:32:24 mbpc-pc kernel: [<ffffffff810a6f06>] ?
> >> __wake_up_common+0x56/0x90
> >> Oct 23 21:32:24 mbpc-pc kernel: [<ffffffff810813a0>] ?
> >> maybe_create_worker+0x110/0x110
> >> Oct 23 21:32:24 mbpc-pc kernel: [<ffffffff8162c040>] ? schedule+0x40/0xb0
> >> Oct 23 21:32:24 mbpc-pc kernel: [<ffffffff810813a0>] ?
> >> maybe_create_worker+0x110/0x110
> >> Oct 23 21:32:24 mbpc-pc kernel: [<ffffffff81085fec>] kthread+0xcc/0xf0
> >> Oct 23 21:32:24 mbpc-pc kernel: [<ffffffff8109130e>] ?
> >> schedule_tail+0x1e/0xc0
> >> Oct 23 21:32:24 mbpc-pc kernel: [<ffffffff8162f60f>]
> >> ret_from_fork+0x1f/0x40
> >> Oct 23 21:32:24 mbpc-pc kernel: [<ffffffff81085f20>] ?
> >> kthread_freezable_should_stop+0x70/0x70
> >> Jan  6 23:56:00 192.168.0.2 syslog: dhcpfwd : dhcp forwarder daemon
> >> successfully started
> >> Oct 23 21:34:22 mbpc-pc kernel: hpet1: lost 9600 rtc interrupts
> >> Oct 23 21:34:22 mbpc-pc kernel: hpet1: lost 3 rtc interrupts
> >> Oct 23 21:34:27 mbpc-pc kernel: INFO: task kworker/u16:8:289 blocked for
> >> more than 120 seconds.
> >> Oct 23 21:34:27 mbpc-pc kernel:      Not tainted 4.8.4 #2
> >> Oct 23 21:34:27 mbpc-pc kernel: "echo 0 >
> >> /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> >> Oct 23 21:34:27 mbpc-pc kernel: kworker/u16:8   D ffff88011113ba18     0
> >>   289      2 0x00000000
> >> Oct 23 21:34:27 mbpc-pc kernel: Workqueue: tmr-fileio target_tmr_work
> >> [target_core_mod]
> >> Oct 23 21:34:27 mbpc-pc kernel: ffff88011113ba18 0000000000000400
> >> ffff880049e926c0 ffff88011113b998
> >> Oct 23 21:34:27 mbpc-pc kernel: ffff880111134600 ffffffff81f99ca0
> >> ffffffff81f998ef ffff880100000000
> >> Oct 23 21:34:27 mbpc-pc kernel: ffffffff812f27d9 0000000000000000
> >> ffffe8ffffc9a000 ffff880000000000
> >> Oct 23 21:34:27 mbpc-pc kernel: Call Trace:
> >> Oct 23 21:34:27 mbpc-pc kernel: [<ffffffff812f27d9>] ? number+0x2e9/0x310
> >> Oct 23 21:34:27 mbpc-pc kernel: [<ffffffff8162c040>] schedule+0x40/0xb0
> >> Oct 23 21:34:27 mbpc-pc kernel: [<ffffffff81080169>] ?
> >> start_flush_work+0x49/0x180
> >> Oct 23 21:34:27 mbpc-pc kernel: [<ffffffff8162e7ec>]
> >> schedule_timeout+0x9c/0xe0
> >> Oct 23 21:34:27 mbpc-pc kernel: [<ffffffff810802ba>] ?
> >> flush_work+0x1a/0x40
> >> Oct 23 21:34:27 mbpc-pc kernel: [<ffffffff810bd15c>] ?
> >> console_unlock+0x35c/0x380
> >> Oct 23 21:34:27 mbpc-pc kernel: [<ffffffff8162cfa0>]
> >> wait_for_completion+0xc0/0xf0
> >> Oct 23 21:34:27 mbpc-pc kernel: [<ffffffff810923e0>] ?
> >> try_to_wake_up+0x260/0x260
> >> Oct 23 21:34:27 mbpc-pc kernel: [<ffffffffa08f6f84>]
> >> __transport_wait_for_tasks+0xb4/0x1b0 [target_core_mod]
> >> Oct 23 21:34:27 mbpc-pc kernel: [<ffffffff810bdd1f>] ?
> >> vprintk_default+0x1f/0x30
> >> Oct 23 21:34:27 mbpc-pc kernel: [<ffffffff8115cc5c>] ? printk+0x46/0x48
> >> Oct 23 21:34:27 mbpc-pc kernel: [<ffffffffa08f70c4>]
> >> transport_wait_for_tasks+0x44/0x60 [target_core_mod]
> >> Oct 23 21:34:27 mbpc-pc kernel: [<ffffffffa08f46e2>]
> >> core_tmr_abort_task+0xf2/0x160 [target_core_mod]
> >> Oct 23 21:34:27 mbpc-pc kernel: [<ffffffffa08f6aa4>]
> >> target_tmr_work+0x154/0x160 [target_core_mod]
> >> Oct 23 21:34:27 mbpc-pc kernel: [<ffffffff81080639>]
> >> process_one_work+0x189/0x4e0
> >> Oct 23 21:34:27 mbpc-pc kernel: [<ffffffff8162c040>] ? schedule+0x40/0xb0
> >> Oct 23 21:34:27 mbpc-pc kernel: [<ffffffff8108150d>]
> >> worker_thread+0x16d/0x520
> >> Oct 23 21:34:27 mbpc-pc kernel: [<ffffffff810923f2>] ?
> >> default_wake_function+0x12/0x20
> >> Oct 23 21:34:27 mbpc-pc kernel: [<ffffffff810a6f06>] ?
> >> __wake_up_common+0x56/0x90
> >> Oct 23 21:34:27 mbpc-pc kernel: [<ffffffff810813a0>] ?
> >> maybe_create_worker+0x110/0x110
> >> Oct 23 21:34:27 mbpc-pc kernel: [<ffffffff8162c040>] ? schedule+0x40/0xb0
> >> Oct 23 21:34:27 mbpc-pc kernel: [<ffffffff810813a0>] ?
> >> maybe_create_worker+0x110/0x110
> >> Oct 23 21:34:27 mbpc-pc kernel: [<ffffffff81085fec>] kthread+0xcc/0xf0
> >> Oct 23 21:34:27 mbpc-pc kernel: [<ffffffff8109130e>] ?
> >> schedule_tail+0x1e/0xc0
> >> Oct 23 21:34:27 mbpc-pc kernel: [<ffffffff8162f60f>]
> >> ret_from_fork+0x1f/0x40
> >> Oct 23 21:34:27 mbpc-pc kernel: [<ffffffff81085f20>] ?
> >> kthread_freezable_should_stop+0x70/0x70
> >> Oct 23 21:34:27 mbpc-pc kernel: INFO: task kworker/1:48:6089 blocked for
> >> more than 120 seconds.
> >> Oct 23 21:34:27 mbpc-pc kernel:      Not tainted 4.8.4 #2
> >> Oct 23 21:34:27 mbpc-pc kernel: "echo 0 >
> >> /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> >> Oct 23 21:34:27 mbpc-pc kernel: kworker/1:48    D ffff88004017f968     0
> >>  6089      2 0x00000080
> >> Oct 23 21:34:27 mbpc-pc kernel: Workqueue: events qlt_free_session_done
> >> [qla2xxx]
> >> Oct 23 21:34:27 mbpc-pc kernel: ffff88004017f968 ffff88004017f8f8
> >> ffff88011a83a300 0000000000000004
> >> Oct 23 21:34:27 mbpc-pc kernel: ffff88004017a600 ffff88004017f938
> >> ffffffff810a0bb6 ffff880100000000
> >> Oct 23 21:34:27 mbpc-pc kernel: ffff880110fd0840 ffff880000000000
> >> ffffffff81090728 ffff880100000000
> >> Oct 23 21:34:27 mbpc-pc kernel: Call Trace:
> >> Oct 23 21:34:27 mbpc-pc kernel: [<ffffffff810a0bb6>] ?
> >> enqueue_task_fair+0x66/0x410
> >> Oct 23 21:34:27 mbpc-pc kernel: [<ffffffff81090728>] ?
> >> check_preempt_curr+0x78/0x90
> >> Oct 23 21:34:27 mbpc-pc kernel: [<ffffffff8109075d>] ?
> >> ttwu_do_wakeup+0x1d/0xf0
> >> Oct 23 21:34:27 mbpc-pc kernel: [<ffffffff8162c040>] schedule+0x40/0xb0
> >> Oct 23 21:34:27 mbpc-pc kernel: [<ffffffff81090de0>] ?
> >> ttwu_queue+0x180/0x190
> >> Oct 23 21:34:27 mbpc-pc kernel: [<ffffffff8162e7ec>]
> >> schedule_timeout+0x9c/0xe0
> >> Oct 23 21:34:27 mbpc-pc kernel: [<ffffffff8162cfa0>]
> >> wait_for_completion+0xc0/0xf0
> >> Oct 23 21:34:27 mbpc-pc kernel: [<ffffffff810923e0>] ?
> >> try_to_wake_up+0x260/0x260
> >> Oct 23 21:34:27 mbpc-pc kernel: [<ffffffffa08f76ad>]
> >> target_wait_for_sess_cmds+0x4d/0x1b0 [target_core_mod]
> >> Oct 23 21:34:27 mbpc-pc kernel: [<ffffffffa00e7188>] ?
> >> qla2x00_post_work+0x58/0x70 [qla2xxx]
> >> Oct 23 21:34:27 mbpc-pc kernel: [<ffffffffa0286f69>]
> >> tcm_qla2xxx_free_session+0x49/0x90 [tcm_qla2xxx]
> >> Oct 23 21:34:27 mbpc-pc kernel: [<ffffffffa01447e9>]
> >> qlt_free_session_done+0xf9/0x3d0 [qla2xxx]
> >> Oct 23 21:34:27 mbpc-pc kernel: [<ffffffff815092fc>] ?
> >> dbs_work_handler+0x5c/0x90
> >> Oct 23 21:34:27 mbpc-pc kernel: [<ffffffff8107f960>] ?
> >> pwq_dec_nr_in_flight+0x50/0xa0
> >> Oct 23 21:34:27 mbpc-pc kernel: [<ffffffff81080639>]
> >> process_one_work+0x189/0x4e0
> >> Oct 23 21:34:27 mbpc-pc kernel: [<ffffffff810d060c>] ?
> >> del_timer_sync+0x4c/0x60
> >> Oct 23 21:34:27 mbpc-pc kernel: [<ffffffff8108131e>] ?
> >> maybe_create_worker+0x8e/0x110
> >> Oct 23 21:34:27 mbpc-pc kernel: [<ffffffff8162c040>] ? schedule+0x40/0xb0
> >> Oct 23 21:34:27 mbpc-pc kernel: [<ffffffff8108150d>]
> >> worker_thread+0x16d/0x520
> >> Oct 23 21:34:27 mbpc-pc kernel: [<ffffffff810923f2>] ?
> >> default_wake_function+0x12/0x20
> >> Oct 23 21:34:27 mbpc-pc kernel: [<ffffffff810a6f06>] ?
> >> __wake_up_common+0x56/0x90
> >> Oct 23 21:34:27 mbpc-pc kernel: [<ffffffff810813a0>] ?
> >> maybe_create_worker+0x110/0x110
> >> Oct 23 21:34:27 mbpc-pc kernel: [<ffffffff8162c040>] ? schedule+0x40/0xb0
> >> Oct 23 21:34:27 mbpc-pc kernel: [<ffffffff810813a0>] ?
> >> maybe_create_worker+0x110/0x110
> >> Oct 23 21:34:27 mbpc-pc kernel: [<ffffffff81085fec>] kthread+0xcc/0xf0
> >> Oct 23 21:34:27 mbpc-pc kernel: [<ffffffff8109130e>] ?
> >> schedule_tail+0x1e/0xc0
> >> Oct 23 21:34:27 mbpc-pc kernel: [<ffffffff8162f60f>]
> >> ret_from_fork+0x1f/0x40
> >> Oct 23 21:34:27 mbpc-pc kernel: [<ffffffff81085f20>] ?
> >> kthread_freezable_should_stop+0x70/0x70
> >> Oct 23 21:36:04 mbpc-pc kernel: hpet1: lost 9600 rtc interrupts
> >> Oct 23 21:36:04 mbpc-pc kernel: hpet1: lost 3 rtc interrupts
> >> Jan  6 23:58:00 192.168.0.2 syslog: dhcpfwd : dhcp forwarder daemon
> >> successfully started
> >> Oct 23 21:36:30 mbpc-pc kernel: INFO: task kworker/u16:8:289 blocked for
> >> more than 120 seconds.
> >> Oct 23 21:36:30 mbpc-pc kernel:      Not tainted 4.8.4 #2
> >> Oct 23 21:36:30 mbpc-pc kernel: "echo 0 >
> >> /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> >> Oct 23 21:36:30 mbpc-pc kernel: kworker/u16:8   D ffff88011113ba18     0
> >>   289      2 0x00000000
> >> Oct 23 21:36:30 mbpc-pc kernel: Workqueue: tmr-fileio target_tmr_work
> >> [target_core_mod]
> >> Oct 23 21:36:30 mbpc-pc kernel: ffff88011113ba18 0000000000000400
> >> ffff880049e926c0 ffff88011113b998
> >> Oct 23 21:36:30 mbpc-pc kernel: ffff880111134600 ffffffff81f99ca0
> >> ffffffff81f998ef ffff880100000000
> >> Oct 23 21:36:30 mbpc-pc kernel: ffffffff812f27d9 0000000000000000
> >> ffffe8ffffc9a000 ffff880000000000
> >> Oct 23 21:36:30 mbpc-pc kernel: Call Trace:
> >> Oct 23 21:36:30 mbpc-pc kernel: [<ffffffff812f27d9>] ? number+0x2e9/0x310
> >> Oct 23 21:36:30 mbpc-pc kernel: [<ffffffff8162c040>] schedule+0x40/0xb0
> >> Oct 23 21:36:30 mbpc-pc kernel: [<ffffffff81080169>] ?
> >> start_flush_work+0x49/0x180
> >> Oct 23 21:36:30 mbpc-pc kernel: [<ffffffff8162e7ec>]
> >> schedule_timeout+0x9c/0xe0
> >> Oct 23 21:36:30 mbpc-pc kernel: [<ffffffff810802ba>] ?
> >> flush_work+0x1a/0x40
> >> Oct 23 21:36:30 mbpc-pc kernel: [<ffffffff810bd15c>] ?
> >> console_unlock+0x35c/0x380
> >> Oct 23 21:36:30 mbpc-pc kernel: [<ffffffff8162cfa0>]
> >> wait_for_completion+0xc0/0xf0
> >> Oct 23 21:36:30 mbpc-pc kernel: [<ffffffff810923e0>] ?
> >> try_to_wake_up+0x260/0x260
> >> Oct 23 21:36:30 mbpc-pc kernel: [<ffffffffa08f6f84>]
> >> __transport_wait_for_tasks+0xb4/0x1b0 [target_core_mod]
> >> Oct 23 21:36:30 mbpc-pc kernel: [<ffffffff810bdd1f>] ?
> >> vprintk_default+0x1f/0x30
> >>
> >>
> >
> >
> > Including the full log:
> >
> > http://microdevsys.com/linux-lio/messages-mailing-list
> >
> 

Thanks for posting with qla2xxx verbose debug enabled on your setup.

> 
> When tryint to shut down target using /etc/init.d/target stop, the 
> following is printed repeatedly:
> 
> Oct 24 00:39:48 mbpc-pc kernel: qla2xxx [0000:04:00.0]-e837:20: 
> ABTS_RECV_24XX: instance 0
> Oct 24 00:39:48 mbpc-pc kernel: qla2xxx [0000:04:00.0]-f811:20: 
> qla_target(0): task abort (s_id=1:5:0, tag=1177068, param=0)
> Oct 24 00:39:48 mbpc-pc kernel: qla2xxx [0000:04:00.0]-f812:20: 
> qla_target(0): task abort for non-existant session
> Oct 24 00:39:48 mbpc-pc kernel: qla2xxx [0000:04:00.0]-f80e:20: 
> Scheduling work (type 1, prm ffff880093365680) to find session for param 
> ffff88010f8c7680 (size 64, tgt ffff880111f06600)
> Oct 24 00:39:48 mbpc-pc kernel: qla2xxx [0000:04:00.0]-f800:20: Sess 
> work (tgt ffff880111f06600)
> Oct 24 00:39:48 mbpc-pc kernel: qla2xxx [0000:04:00.0]-e806:20: Sending 
> task mgmt ABTS response (ha=ffff88010fae0000, atio=ffff880093365694, 
> status=4
> Oct 24 00:39:48 mbpc-pc kernel: qla2xxx [0000:04:00.0]-e838:20: 
> ABTS_RESP_24XX: compl_status 31
> Oct 24 00:39:48 mbpc-pc kernel: qla2xxx [0000:04:00.0]-e807:20: Sending 
> retry TERM EXCH CTIO7 (ha=ffff88010fae0000)
> Oct 24 00:39:48 mbpc-pc kernel: qla2xxx [0000:04:00.0]-e806:20: Sending 
> task mgmt ABTS response (ha=ffff88010fae0000, atio=ffff88010f8c76c0, 
> status=0
> Oct 24 00:39:48 mbpc-pc kernel: qla2xxx [0000:04:00.0]-e838:20: 
> ABTS_RESP_24XX: compl_status 0
> Oct 24 00:39:48 mbpc-pc kernel: qla2xxx [0000:04:00.0]-e872:20: 
> qlt_24xx_atio_pkt_all_vps: qla_target(0): type 6 ox_id 029c
> Oct 24 00:39:48 mbpc-pc kernel: qla2xxx [0000:04:00.0]-3861:20: New 
> command while device ffff880111f06600 is shutting down
> Oct 24 00:39:48 mbpc-pc kernel: qla2xxx [0000:04:00.0]-e859:20: 
> qla_target: Unable to send command to target for req, ignoring.
> 
> 

At your earliest convenience, please verify the patch using v4.8.y with
the above ABORT_TASK + shutdown scenario.

Also, it would be helpful to understand why this ESX FC host is
generating ABORT_TASKs.

Eg: Is ABORT_TASK generated due to FC target response packet loss..?
Or due to target backend I/O latency, that ultimately triggers FC host
side timeouts...?

> 
> + when I disable the ports on the brocade switch that we're using then 
> try to stop target, the following is printed:
> 
> 
> 
> Oct 24 00:41:31 mbpc-pc kernel: qla2xxx [0000:04:00.1]-680a:21: Loop 
> down - seconds remaining 231.
> Oct 24 00:41:32 mbpc-pc kernel: qla2xxx [0000:04:00.0]-680a:20: Loop 
> down - seconds remaining 153.
> Oct 24 00:41:32 mbpc-pc kernel: ------------[ cut here ]------------
> Oct 24 00:41:32 mbpc-pc kernel: WARNING: CPU: 2 PID: 8615 at 
> lib/list_debug.c:33 __list_add+0xbe/0xd0
> Oct 24 00:41:32 mbpc-pc kernel: list_add corruption. prev->next should 
> be next (ffff88009e83b330), but was ffff88011fc972a0. 
> (prev=ffff880118ada4c0).
> Oct 24 00:41:32 mbpc-pc kernel: Modules linked in: tcm_qla2xxx tcm_fc 
> tcm_loop target_core_file target_core_iblock target_core_pscsi 
> target_core_mod configfs ip6table_filter ip6_tables ebtable_nat ebtables 
> nf_conntrack_ipv4 nf_defrag_ipv4 xt_state nf_conntrack xt_CHECKSUM 
> iptable_mangle bridge nfsd lockd grace nfs_acl auth_rpcgss autofs4 it87 
> hwmon_vid bnx2fc cnic uio fcoe libfcoe libfc 8021q garp stp llc ppdev 
> parport_pc parport sunrpc cpufreq_ondemand bonding ipv6 crc_ccitt 
> ipt_REJECT nf_reject_ipv4 xt_multiport iptable_filter ip_tables fuse 
> vfat fat xfs vhost_net macvtap macvlan vhost tun uinput raid456 
> async_raid6_recov async_pq async_xor xor async_memcpy async_tx raid6_pq 
> libcrc32c joydev sg serio_raw e1000 kvm_amd kvm irqbypass r8169 mii 
> pcspkr k10temp snd_hda_codec_realtek snd_hda_codec_generic snd_hda_intel 
> snd_hda_codec snd_hda_core snd_hwdep snd_seq snd_seq_device snd_pcm 
> snd_timer snd soundcore i2c_piix4 i2c_core wmi acpi_cpufreq shpchp ext4 
> mbcache jbd2 qla2xxx scsi_transport_fc floppy firewire_ohci f
> Oct 24 00:41:32 mbpc-pc kernel: irewire_core crc_itu_t sd_mod pata_acpi 
> ata_generic pata_jmicron ahci libahci usb_storage dm_mirror 
> dm_region_hash dm_log dm_mod
> Oct 24 00:41:32 mbpc-pc kernel: CPU: 2 PID: 8615 Comm: kworker/2:3 Not 
> tainted 4.8.4 #2
> Oct 24 00:41:32 mbpc-pc kernel: Hardware name: Gigabyte Technology Co., 
> Ltd. GA-890XA-UD3/GA-890XA-UD3, BIOS FC 08/02/2010
> Oct 24 00:41:32 mbpc-pc kernel: 0000000000000000 ffff880092b83b48 
> ffffffff812e88e9 ffffffff8130753e
> Oct 24 00:41:32 mbpc-pc kernel: ffff880092b83ba8 ffff880092b83ba8 
> 0000000000000000 ffff880092b83b98
> Oct 24 00:41:32 mbpc-pc kernel: ffffffff81066a7d ffff88000058f952 
> 0000002100000046 ffffffff8101eae8
> Oct 24 00:41:32 mbpc-pc kernel: Call Trace:
> Oct 24 00:41:32 mbpc-pc kernel: [<ffffffff812e88e9>] dump_stack+0x51/0x78
> Oct 24 00:41:32 mbpc-pc kernel: [<ffffffff8130753e>] ? __list_add+0xbe/0xd0
> Oct 24 00:41:32 mbpc-pc kernel: [<ffffffff81066a7d>] __warn+0xfd/0x120
> Oct 24 00:41:32 mbpc-pc kernel: [<ffffffff8101eae8>] ? 
> __switch_to+0x398/0x7e0
> Oct 24 00:41:32 mbpc-pc kernel: [<ffffffff81066b59>] 
> warn_slowpath_fmt+0x49/0x50
> Oct 24 00:41:32 mbpc-pc kernel: [<ffffffff8130753e>] __list_add+0xbe/0xd0
> Oct 24 00:41:32 mbpc-pc kernel: [<ffffffff8107d0b2>] 
> move_linked_works+0x62/0x90
> Oct 24 00:41:32 mbpc-pc kernel: [<ffffffff8108070c>] 
> process_one_work+0x25c/0x4e0
> Oct 24 00:41:32 mbpc-pc kernel: [<ffffffff8162c040>] ? schedule+0x40/0xb0
> Oct 24 00:41:32 mbpc-pc kernel: [<ffffffff8108150d>] 
> worker_thread+0x16d/0x520
> Oct 24 00:41:32 mbpc-pc kernel: [<ffffffff8162bb3d>] ? 
> __schedule+0x2fd/0x6a0
> Oct 24 00:41:32 mbpc-pc kernel: [<ffffffff810923f2>] ? 
> default_wake_function+0x12/0x20
> Oct 24 00:41:32 mbpc-pc kernel: [<ffffffff810a6f06>] ? 
> __wake_up_common+0x56/0x90
> Oct 24 00:41:32 mbpc-pc kernel: [<ffffffff810813a0>] ? 
> maybe_create_worker+0x110/0x110
> Oct 24 00:41:32 mbpc-pc kernel: [<ffffffff8162c040>] ? schedule+0x40/0xb0
> Oct 24 00:41:32 mbpc-pc kernel: [<ffffffff810813a0>] ? 
> maybe_create_worker+0x110/0x110
> Oct 24 00:41:32 mbpc-pc kernel: [<ffffffff81085fec>] kthread+0xcc/0xf0
> Oct 24 00:41:32 mbpc-pc kernel: [<ffffffff8109130e>] ? 
> schedule_tail+0x1e/0xc0
> Oct 24 00:41:32 mbpc-pc kernel: [<ffffffff8162f60f>] ret_from_fork+0x1f/0x40
> Oct 24 00:41:32 mbpc-pc kernel: [<ffffffff81085f20>] ? 
> kthread_freezable_should_stop+0x70/0x70
> Oct 24 00:41:32 mbpc-pc kernel: ---[ end trace 713a9071c9f5f464 ]---
> Oct 24 00:41:32 mbpc-pc kernel: ------------[ cut here ]------------
> Oct 24 00:41:32 mbpc-pc kernel: WARNING: CPU: 2 PID: 8615 at 
> lib/list_debug.c:36 __list_add+0x9c/0xd0
> Oct 24 00:41:32 mbpc-pc kernel: list_add double add: 
> new=ffff880118ada4c0, prev=ffff880118ada4c0, next=ffff88009e83b330.
> Oct 24 00:41:32 mbpc-pc kernel: Modules linked in: tcm_qla2xxx tcm_fc 
> tcm_loop target_core_file target_core_iblock target_core_pscsi 
> target_core_mod configfs ip6table_filter ip6_tables ebtable_nat ebtables 
> nf_conntrack_ipv4 nf_defrag_ipv4 xt_state nf_conntrack xt_CHECKSUM 
> iptable_mangle bridge nfsd lockd grace nfs_acl auth_rpcgss autofs4 it87 
> hwmon_vid bnx2fc cnic uio fcoe libfcoe libfc 8021q garp stp llc ppdev 
> parport_pc parport sunrpc cpufreq_ondemand bonding ipv6 crc_ccitt 
> ipt_REJECT nf_reject_ipv4 xt_multiport iptable_filter ip_tables fuse 
> vfat fat xfs vhost_net macvtap macvlan vhost tun uinput raid456 
> async_raid6_recov async_pq async_xor xor async_memcpy async_tx raid6_pq 
> libcrc32c joydev sg serio_raw e1000 kvm_amd kvm irqbypass r8169 mii 
> pcspkr k10temp snd_hda_codec_realtek snd_hda_codec_generic snd_hda_intel 
> snd_hda_codec snd_hda_core snd_hwdep snd_seq snd_seq_device snd_pcm 
> snd_timer snd soundcore i2c_piix4 i2c_core wmi acpi_cpufreq shpchp ext4 
> mbcache jbd2 qla2xxx scsi_transport_fc floppy firewire_ohci f
> Oct 24 00:41:32 mbpc-pc kernel: irewire_core crc_itu_t sd_mod pata_acpi 
> ata_generic pata_jmicron ahci libahci usb_storage dm_mirror 
> dm_region_hash dm_log dm_mod
> Oct 24 00:41:32 mbpc-pc kernel: CPU: 2 PID: 8615 Comm: kworker/2:3 
> Tainted: G        W       4.8.4 #2
> Oct 24 00:41:32 mbpc-pc kernel: Hardware name: Gigabyte Technology Co., 
> Ltd. GA-890XA-UD3/GA-890XA-UD3, BIOS FC 08/02/2010
> Oct 24 00:41:32 mbpc-pc kernel: 0000000000000000 ffff880092b83b48 
> ffffffff812e88e9 ffffffff8130751c
> Oct 24 00:41:32 mbpc-pc kernel: ffff880092b83ba8 ffff880092b83ba8 
> 0000000000000000 ffff880092b83b98
> Oct 24 00:41:32 mbpc-pc kernel: ffffffff81066a7d ffff88000058f952 
> 0000002400000046 ffffffff8101eae8
> Oct 24 00:41:32 mbpc-pc kernel: Call Trace:
> Oct 24 00:41:32 mbpc-pc kernel: [<ffffffff812e88e9>] dump_stack+0x51/0x78
> Oct 24 00:41:32 mbpc-pc kernel: [<ffffffff8130751c>] ? __list_add+0x9c/0xd0
> Oct 24 00:41:32 mbpc-pc kernel: [<ffffffff81066a7d>] __warn+0xfd/0x120
> Oct 24 00:41:32 mbpc-pc kernel: [<ffffffff8101eae8>] ? 
> __switch_to+0x398/0x7e0
> Oct 24 00:41:32 mbpc-pc kernel: [<ffffffff81066b59>] 
> warn_slowpath_fmt+0x49/0x50
> Oct 24 00:41:32 mbpc-pc kernel: [<ffffffff8130751c>] __list_add+0x9c/0xd0
> Oct 24 00:41:32 mbpc-pc kernel: [<ffffffff8107d0b2>] 
> move_linked_works+0x62/0x90
> Oct 24 00:41:32 mbpc-pc kernel: [<ffffffff8108070c>] 
> process_one_work+0x25c/0x4e0
> Oct 24 00:41:32 mbpc-pc kernel: [<ffffffff8162c040>] ? schedule+0x40/0xb0
> Oct 24 00:41:32 mbpc-pc kernel: [<ffffffff8108150d>] 
> worker_thread+0x16d/0x520
> Oct 24 00:41:32 mbpc-pc kernel: [<ffffffff8162bb3d>] ? 
> __schedule+0x2fd/0x6a0
> Oct 24 00:41:32 mbpc-pc kernel: [<ffffffff810923f2>] ? 
> default_wake_function+0x12/0x20
> Oct 24 00:41:32 mbpc-pc kernel: [<ffffffff810a6f06>] ? 
> __wake_up_common+0x56/0x90
> Oct 24 00:41:32 mbpc-pc kernel: [<ffffffff810813a0>] ? 
> maybe_create_worker+0x110/0x110
> Oct 24 00:41:32 mbpc-pc kernel: [<ffffffff8162c040>] ? schedule+0x40/0xb0
> Oct 24 00:41:32 mbpc-pc kernel: [<ffffffff810813a0>] ? 
> maybe_create_worker+0x110/0x110
> Oct 24 00:41:32 mbpc-pc kernel: [<ffffffff81085fec>] kthread+0xcc/0xf0
> Oct 24 00:41:32 mbpc-pc kernel: [<ffffffff8109130e>] ? 
> schedule_tail+0x1e/0xc0
> Oct 24 00:41:32 mbpc-pc kernel: [<ffffffff8162f60f>] ret_from_fork+0x1f/0x40
> Oct 24 00:41:32 mbpc-pc kernel: [<ffffffff81085f20>] ? 
> kthread_freezable_should_stop+0x70/0x70
> Oct 24 00:41:32 mbpc-pc kernel: ---[ end trace 713a9071c9f5f465 ]---
> Oct 24 00:41:32 mbpc-pc kernel: qla2xxx [0000:04:00.1]-680a:21: Loop 
> down - seconds remaining 230.
> Oct 24 00:41:33 mbpc-pc kernel: qla2xxx [0000:04:00.0]-680a:20: Loop 
> down - seconds remaining 152.
> 
> 

Mmmm.  Could be a side effect of the target-core regression, but not
completely sure..

Adding QLOGIC folks CC'.


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Kernel 4.8.4: INFO: task kworker/u16:8:289 blocked for more than 120 seconds.
  2016-10-24  6:36     ` Nicholas A. Bellinger
@ 2016-10-25  5:28       ` TomK
  2016-10-26  2:05         ` TomK
  0 siblings, 1 reply; 14+ messages in thread
From: TomK @ 2016-10-25  5:28 UTC (permalink / raw)
  To: Nicholas A. Bellinger
  Cc: linux-scsi, Himanshu Madhani, Quinn Tran, Giridhar Malavali

On 10/24/2016 2:36 AM, Nicholas A. Bellinger wrote:
> Hi TomK,
>
> Thanks for reporting this bug.  Comments inline below.
>
> On Mon, 2016-10-24 at 00:45 -0400, TomK wrote:
>> On 10/24/2016 12:32 AM, TomK wrote:
>>> On 10/23/2016 10:03 PM, TomK wrote:
>>>> Hey,
>>>>
>>>> Has anyone seen this and could have a workaround?  Seems like it is more
>>>> Kernel related with various apps not just target apparently not but
>>>> wondering if there is an interim solution
>>>> (https://access.redhat.com/solutions/408833)
>>>>
>>>> Getting this message after few minutes of usage from the QLA2xxx driver.
>>>>  This is after some activity on an ESXi server (15 VM's) that I'm
>>>> connecting to this HBA.  I've tried the following tuning parameters but
>>>> there was no change in behaviour:
>>>>
>>>> vm.dirty_background_ratio = 5
>>>> vm.dirty_ratio = 10
>>>>
>>>> Details:
>>>>
>>>>
>>>> Oct 23 21:28:25 mbpc-pc kernel: hpet1: lost 9600 rtc interrupts
>>>> Oct 23 21:28:29 mbpc-pc kernel: ABORT_TASK: Found referenced qla2xxx
>>>> task_tag: 1128612
>>>> Oct 23 21:28:42 mbpc-pc kernel: ABORT_TASK: Sending
>>>> TMR_FUNCTION_COMPLETE for ref_tag: 1128612
>>>> Oct 23 21:28:42 mbpc-pc kernel: ABORT_TASK: Found referenced qla2xxx
>>>> task_tag: 1129116
>
> You are likely hitting a known v4.1+ regression, not yet merged up to
> v4.8.y code:
>
> https://github.com/torvalds/linux/commit/527268df31e57cf2b6d417198717c6d6afdb1e3e
>
>>>> Jan  6 23:52:00 192.168.0.2 syslog: dhcpfwd : dhcp forwarder daemon
>>>> successfully started
>>>> Oct 23 21:30:18 mbpc-pc kernel: hpet1: lost 9600 rtc interrupts
>>>> Jan  6 23:54:01 192.168.0.2 syslog: dhcpfwd : dhcp forwarder daemon
>>>> successfully started
>>>> Oct 23 21:32:16 mbpc-pc kernel: hpet1: lost 9600 rtc interrupts
>>>> Oct 23 21:32:24 mbpc-pc kernel: INFO: task kworker/u16:8:289 blocked for
>>>> more than 120 seconds.
>>>> Oct 23 21:32:24 mbpc-pc kernel:      Not tainted 4.8.4 #2
>>>> Oct 23 21:32:24 mbpc-pc kernel: "echo 0 >
>>>> /proc/sys/kernel/hung_task_timeout_secs" disables this message.
>>>> Oct 23 21:32:24 mbpc-pc kernel: kworker/u16:8   D ffff88011113ba18     0
>>>>   289      2 0x00000000
>>>> Oct 23 21:32:24 mbpc-pc kernel: Workqueue: tmr-fileio target_tmr_work
>>>> [target_core_mod]
>>>> Oct 23 21:32:24 mbpc-pc kernel: ffff88011113ba18 0000000000000400
>>>> ffff880049e926c0 ffff88011113b998
>>>> Oct 23 21:32:24 mbpc-pc kernel: ffff880111134600 ffffffff81f99ca0
>>>> ffffffff81f998ef ffff880100000000
>>>> Oct 23 21:32:24 mbpc-pc kernel: ffffffff812f27d9 0000000000000000
>>>> ffffe8ffffc9a000 ffff880000000000
>>>> Oct 23 21:32:24 mbpc-pc kernel: Call Trace:
>>>> Oct 23 21:32:24 mbpc-pc kernel: [<ffffffff812f27d9>] ? number+0x2e9/0x310
>>>> Oct 23 21:32:24 mbpc-pc kernel: [<ffffffff8162c040>] schedule+0x40/0xb0
>>>> Oct 23 21:32:24 mbpc-pc kernel: [<ffffffff81080169>] ?
>>>> start_flush_work+0x49/0x180
>>>> Oct 23 21:32:24 mbpc-pc kernel: [<ffffffff8162e7ec>]
>>>> schedule_timeout+0x9c/0xe0
>>>> Oct 23 21:32:24 mbpc-pc kernel: [<ffffffff810802ba>] ?
>>>> flush_work+0x1a/0x40
>>>> Oct 23 21:32:24 mbpc-pc kernel: [<ffffffff810bd15c>] ?
>>>> console_unlock+0x35c/0x380
>>>> Oct 23 21:32:24 mbpc-pc kernel: [<ffffffff8162cfa0>]
>>>> wait_for_completion+0xc0/0xf0
>>>> Oct 23 21:32:24 mbpc-pc kernel: [<ffffffff810923e0>] ?
>>>> try_to_wake_up+0x260/0x260
>>>> Oct 23 21:32:24 mbpc-pc kernel: [<ffffffffa08f6f84>]
>>>> __transport_wait_for_tasks+0xb4/0x1b0 [target_core_mod]
>>>> Oct 23 21:32:24 mbpc-pc kernel: [<ffffffff810bdd1f>] ?
>>>> vprintk_default+0x1f/0x30
>>>> Oct 23 21:32:24 mbpc-pc kernel: [<ffffffff8115cc5c>] ? printk+0x46/0x48
>>>> Oct 23 21:32:24 mbpc-pc kernel: [<ffffffffa08f70c4>]
>>>> transport_wait_for_tasks+0x44/0x60 [target_core_mod]
>>>> Oct 23 21:32:24 mbpc-pc kernel: [<ffffffffa08f46e2>]
>>>> core_tmr_abort_task+0xf2/0x160 [target_core_mod]
>>>> Oct 23 21:32:24 mbpc-pc kernel: [<ffffffffa08f6aa4>]
>>>> target_tmr_work+0x154/0x160 [target_core_mod]
>>>> Oct 23 21:32:24 mbpc-pc kernel: [<ffffffff81080639>]
>>>> process_one_work+0x189/0x4e0
>>>> Oct 23 21:32:24 mbpc-pc kernel: [<ffffffff8162c040>] ? schedule+0x40/0xb0
>>>> Oct 23 21:32:24 mbpc-pc kernel: [<ffffffff8108150d>]
>>>> worker_thread+0x16d/0x520
>>>> Oct 23 21:32:24 mbpc-pc kernel: [<ffffffff810923f2>] ?
>>>> default_wake_function+0x12/0x20
>>>> Oct 23 21:32:24 mbpc-pc kernel: [<ffffffff810a6f06>] ?
>>>> __wake_up_common+0x56/0x90
>>>> Oct 23 21:32:24 mbpc-pc kernel: [<ffffffff810813a0>] ?
>>>> maybe_create_worker+0x110/0x110
>>>> Oct 23 21:32:24 mbpc-pc kernel: [<ffffffff8162c040>] ? schedule+0x40/0xb0
>>>> Oct 23 21:32:24 mbpc-pc kernel: [<ffffffff810813a0>] ?
>>>> maybe_create_worker+0x110/0x110
>>>> Oct 23 21:32:24 mbpc-pc kernel: [<ffffffff81085fec>] kthread+0xcc/0xf0
>>>> Oct 23 21:32:24 mbpc-pc kernel: [<ffffffff8109130e>] ?
>>>> schedule_tail+0x1e/0xc0
>>>> Oct 23 21:32:24 mbpc-pc kernel: [<ffffffff8162f60f>]
>>>> ret_from_fork+0x1f/0x40
>>>> Oct 23 21:32:24 mbpc-pc kernel: [<ffffffff81085f20>] ?
>>>> kthread_freezable_should_stop+0x70/0x70
>>>> Oct 23 21:32:24 mbpc-pc kernel: INFO: task kworker/1:48:6089 blocked for
>>>> more than 120 seconds.
>>>> Oct 23 21:32:24 mbpc-pc kernel:      Not tainted 4.8.4 #2
>>>> Oct 23 21:32:24 mbpc-pc kernel: "echo 0 >
>>>> /proc/sys/kernel/hung_task_timeout_secs" disables this message.
>>>> Oct 23 21:32:24 mbpc-pc kernel: kworker/1:48    D ffff88004017f968     0
>>>>  6089      2 0x00000080
>>>> Oct 23 21:32:24 mbpc-pc kernel: Workqueue: events qlt_free_session_done
>>>> [qla2xxx]
>>>> Oct 23 21:32:24 mbpc-pc kernel: ffff88004017f968 ffff88004017f8f8
>>>> ffff88011a83a300 0000000000000004
>>>> Oct 23 21:32:24 mbpc-pc kernel: ffff88004017a600 ffff88004017f938
>>>> ffffffff810a0bb6 ffff880100000000
>>>> Oct 23 21:32:24 mbpc-pc kernel: ffff880110fd0840 ffff880000000000
>>>> ffffffff81090728 ffff880100000000
>>>> Oct 23 21:32:24 mbpc-pc kernel: Call Trace:
>>>> Oct 23 21:32:24 mbpc-pc kernel: [<ffffffff810a0bb6>] ?
>>>> enqueue_task_fair+0x66/0x410
>>>> Oct 23 21:32:24 mbpc-pc kernel: [<ffffffff81090728>] ?
>>>> check_preempt_curr+0x78/0x90
>>>> Oct 23 21:32:24 mbpc-pc kernel: [<ffffffff8109075d>] ?
>>>> ttwu_do_wakeup+0x1d/0xf0
>>>> Oct 23 21:32:24 mbpc-pc kernel: [<ffffffff8162c040>] schedule+0x40/0xb0
>>>> Oct 23 21:32:24 mbpc-pc kernel: [<ffffffff81090de0>] ?
>>>> ttwu_queue+0x180/0x190
>>>> Oct 23 21:32:24 mbpc-pc kernel: [<ffffffff8162e7ec>]
>>>> schedule_timeout+0x9c/0xe0
>>>> Oct 23 21:32:24 mbpc-pc kernel: [<ffffffff8162cfa0>]
>>>> wait_for_completion+0xc0/0xf0
>>>> Oct 23 21:32:24 mbpc-pc kernel: [<ffffffff810923e0>] ?
>>>> try_to_wake_up+0x260/0x260
>>>> Oct 23 21:32:24 mbpc-pc kernel: [<ffffffffa08f76ad>]
>>>> target_wait_for_sess_cmds+0x4d/0x1b0 [target_core_mod]
>>>> Oct 23 21:32:24 mbpc-pc kernel: [<ffffffffa00e7188>] ?
>>>> qla2x00_post_work+0x58/0x70 [qla2xxx]
>>>> Oct 23 21:32:24 mbpc-pc kernel: [<ffffffffa0286f69>]
>>>> tcm_qla2xxx_free_session+0x49/0x90 [tcm_qla2xxx]
>>>> Oct 23 21:32:24 mbpc-pc kernel: [<ffffffffa01447e9>]
>>>> qlt_free_session_done+0xf9/0x3d0 [qla2xxx]
>>>> Oct 23 21:32:24 mbpc-pc kernel: [<ffffffff815092fc>] ?
>>>> dbs_work_handler+0x5c/0x90
>>>> Oct 23 21:32:24 mbpc-pc kernel: [<ffffffff8107f960>] ?
>>>> pwq_dec_nr_in_flight+0x50/0xa0
>>>> Oct 23 21:32:24 mbpc-pc kernel: [<ffffffff81080639>]
>>>> process_one_work+0x189/0x4e0
>>>> Oct 23 21:32:24 mbpc-pc kernel: [<ffffffff810d060c>] ?
>>>> del_timer_sync+0x4c/0x60
>>>> Oct 23 21:32:24 mbpc-pc kernel: [<ffffffff8108131e>] ?
>>>> maybe_create_worker+0x8e/0x110
>>>> Oct 23 21:32:24 mbpc-pc kernel: [<ffffffff8162c040>] ? schedule+0x40/0xb0
>>>> Oct 23 21:32:24 mbpc-pc kernel: [<ffffffff8108150d>]
>>>> worker_thread+0x16d/0x520
>>>> Oct 23 21:32:24 mbpc-pc kernel: [<ffffffff810923f2>] ?
>>>> default_wake_function+0x12/0x20
>>>> Oct 23 21:32:24 mbpc-pc kernel: [<ffffffff810a6f06>] ?
>>>> __wake_up_common+0x56/0x90
>>>> Oct 23 21:32:24 mbpc-pc kernel: [<ffffffff810813a0>] ?
>>>> maybe_create_worker+0x110/0x110
>>>> Oct 23 21:32:24 mbpc-pc kernel: [<ffffffff8162c040>] ? schedule+0x40/0xb0
>>>> Oct 23 21:32:24 mbpc-pc kernel: [<ffffffff810813a0>] ?
>>>> maybe_create_worker+0x110/0x110
>>>> Oct 23 21:32:24 mbpc-pc kernel: [<ffffffff81085fec>] kthread+0xcc/0xf0
>>>> Oct 23 21:32:24 mbpc-pc kernel: [<ffffffff8109130e>] ?
>>>> schedule_tail+0x1e/0xc0
>>>> Oct 23 21:32:24 mbpc-pc kernel: [<ffffffff8162f60f>]
>>>> ret_from_fork+0x1f/0x40
>>>> Oct 23 21:32:24 mbpc-pc kernel: [<ffffffff81085f20>] ?
>>>> kthread_freezable_should_stop+0x70/0x70
>>>> Jan  6 23:56:00 192.168.0.2 syslog: dhcpfwd : dhcp forwarder daemon
>>>> successfully started
>>>> Oct 23 21:34:22 mbpc-pc kernel: hpet1: lost 9600 rtc interrupts
>>>> Oct 23 21:34:22 mbpc-pc kernel: hpet1: lost 3 rtc interrupts
>>>> Oct 23 21:34:27 mbpc-pc kernel: INFO: task kworker/u16:8:289 blocked for
>>>> more than 120 seconds.
>>>> Oct 23 21:34:27 mbpc-pc kernel:      Not tainted 4.8.4 #2
>>>> Oct 23 21:34:27 mbpc-pc kernel: "echo 0 >
>>>> /proc/sys/kernel/hung_task_timeout_secs" disables this message.
>>>> Oct 23 21:34:27 mbpc-pc kernel: kworker/u16:8   D ffff88011113ba18     0
>>>>   289      2 0x00000000
>>>> Oct 23 21:34:27 mbpc-pc kernel: Workqueue: tmr-fileio target_tmr_work
>>>> [target_core_mod]
>>>> Oct 23 21:34:27 mbpc-pc kernel: ffff88011113ba18 0000000000000400
>>>> ffff880049e926c0 ffff88011113b998
>>>> Oct 23 21:34:27 mbpc-pc kernel: ffff880111134600 ffffffff81f99ca0
>>>> ffffffff81f998ef ffff880100000000
>>>> Oct 23 21:34:27 mbpc-pc kernel: ffffffff812f27d9 0000000000000000
>>>> ffffe8ffffc9a000 ffff880000000000
>>>> Oct 23 21:34:27 mbpc-pc kernel: Call Trace:
>>>> Oct 23 21:34:27 mbpc-pc kernel: [<ffffffff812f27d9>] ? number+0x2e9/0x310
>>>> Oct 23 21:34:27 mbpc-pc kernel: [<ffffffff8162c040>] schedule+0x40/0xb0
>>>> Oct 23 21:34:27 mbpc-pc kernel: [<ffffffff81080169>] ?
>>>> start_flush_work+0x49/0x180
>>>> Oct 23 21:34:27 mbpc-pc kernel: [<ffffffff8162e7ec>]
>>>> schedule_timeout+0x9c/0xe0
>>>> Oct 23 21:34:27 mbpc-pc kernel: [<ffffffff810802ba>] ?
>>>> flush_work+0x1a/0x40
>>>> Oct 23 21:34:27 mbpc-pc kernel: [<ffffffff810bd15c>] ?
>>>> console_unlock+0x35c/0x380
>>>> Oct 23 21:34:27 mbpc-pc kernel: [<ffffffff8162cfa0>]
>>>> wait_for_completion+0xc0/0xf0
>>>> Oct 23 21:34:27 mbpc-pc kernel: [<ffffffff810923e0>] ?
>>>> try_to_wake_up+0x260/0x260
>>>> Oct 23 21:34:27 mbpc-pc kernel: [<ffffffffa08f6f84>]
>>>> __transport_wait_for_tasks+0xb4/0x1b0 [target_core_mod]
>>>> Oct 23 21:34:27 mbpc-pc kernel: [<ffffffff810bdd1f>] ?
>>>> vprintk_default+0x1f/0x30
>>>> Oct 23 21:34:27 mbpc-pc kernel: [<ffffffff8115cc5c>] ? printk+0x46/0x48
>>>> Oct 23 21:34:27 mbpc-pc kernel: [<ffffffffa08f70c4>]
>>>> transport_wait_for_tasks+0x44/0x60 [target_core_mod]
>>>> Oct 23 21:34:27 mbpc-pc kernel: [<ffffffffa08f46e2>]
>>>> core_tmr_abort_task+0xf2/0x160 [target_core_mod]
>>>> Oct 23 21:34:27 mbpc-pc kernel: [<ffffffffa08f6aa4>]
>>>> target_tmr_work+0x154/0x160 [target_core_mod]
>>>> Oct 23 21:34:27 mbpc-pc kernel: [<ffffffff81080639>]
>>>> process_one_work+0x189/0x4e0
>>>> Oct 23 21:34:27 mbpc-pc kernel: [<ffffffff8162c040>] ? schedule+0x40/0xb0
>>>> Oct 23 21:34:27 mbpc-pc kernel: [<ffffffff8108150d>]
>>>> worker_thread+0x16d/0x520
>>>> Oct 23 21:34:27 mbpc-pc kernel: [<ffffffff810923f2>] ?
>>>> default_wake_function+0x12/0x20
>>>> Oct 23 21:34:27 mbpc-pc kernel: [<ffffffff810a6f06>] ?
>>>> __wake_up_common+0x56/0x90
>>>> Oct 23 21:34:27 mbpc-pc kernel: [<ffffffff810813a0>] ?
>>>> maybe_create_worker+0x110/0x110
>>>> Oct 23 21:34:27 mbpc-pc kernel: [<ffffffff8162c040>] ? schedule+0x40/0xb0
>>>> Oct 23 21:34:27 mbpc-pc kernel: [<ffffffff810813a0>] ?
>>>> maybe_create_worker+0x110/0x110
>>>> Oct 23 21:34:27 mbpc-pc kernel: [<ffffffff81085fec>] kthread+0xcc/0xf0
>>>> Oct 23 21:34:27 mbpc-pc kernel: [<ffffffff8109130e>] ?
>>>> schedule_tail+0x1e/0xc0
>>>> Oct 23 21:34:27 mbpc-pc kernel: [<ffffffff8162f60f>]
>>>> ret_from_fork+0x1f/0x40
>>>> Oct 23 21:34:27 mbpc-pc kernel: [<ffffffff81085f20>] ?
>>>> kthread_freezable_should_stop+0x70/0x70
>>>> Oct 23 21:34:27 mbpc-pc kernel: INFO: task kworker/1:48:6089 blocked for
>>>> more than 120 seconds.
>>>> Oct 23 21:34:27 mbpc-pc kernel:      Not tainted 4.8.4 #2
>>>> Oct 23 21:34:27 mbpc-pc kernel: "echo 0 >
>>>> /proc/sys/kernel/hung_task_timeout_secs" disables this message.
>>>> Oct 23 21:34:27 mbpc-pc kernel: kworker/1:48    D ffff88004017f968     0
>>>>  6089      2 0x00000080
>>>> Oct 23 21:34:27 mbpc-pc kernel: Workqueue: events qlt_free_session_done
>>>> [qla2xxx]
>>>> Oct 23 21:34:27 mbpc-pc kernel: ffff88004017f968 ffff88004017f8f8
>>>> ffff88011a83a300 0000000000000004
>>>> Oct 23 21:34:27 mbpc-pc kernel: ffff88004017a600 ffff88004017f938
>>>> ffffffff810a0bb6 ffff880100000000
>>>> Oct 23 21:34:27 mbpc-pc kernel: ffff880110fd0840 ffff880000000000
>>>> ffffffff81090728 ffff880100000000
>>>> Oct 23 21:34:27 mbpc-pc kernel: Call Trace:
>>>> Oct 23 21:34:27 mbpc-pc kernel: [<ffffffff810a0bb6>] ?
>>>> enqueue_task_fair+0x66/0x410
>>>> Oct 23 21:34:27 mbpc-pc kernel: [<ffffffff81090728>] ?
>>>> check_preempt_curr+0x78/0x90
>>>> Oct 23 21:34:27 mbpc-pc kernel: [<ffffffff8109075d>] ?
>>>> ttwu_do_wakeup+0x1d/0xf0
>>>> Oct 23 21:34:27 mbpc-pc kernel: [<ffffffff8162c040>] schedule+0x40/0xb0
>>>> Oct 23 21:34:27 mbpc-pc kernel: [<ffffffff81090de0>] ?
>>>> ttwu_queue+0x180/0x190
>>>> Oct 23 21:34:27 mbpc-pc kernel: [<ffffffff8162e7ec>]
>>>> schedule_timeout+0x9c/0xe0
>>>> Oct 23 21:34:27 mbpc-pc kernel: [<ffffffff8162cfa0>]
>>>> wait_for_completion+0xc0/0xf0
>>>> Oct 23 21:34:27 mbpc-pc kernel: [<ffffffff810923e0>] ?
>>>> try_to_wake_up+0x260/0x260
>>>> Oct 23 21:34:27 mbpc-pc kernel: [<ffffffffa08f76ad>]
>>>> target_wait_for_sess_cmds+0x4d/0x1b0 [target_core_mod]
>>>> Oct 23 21:34:27 mbpc-pc kernel: [<ffffffffa00e7188>] ?
>>>> qla2x00_post_work+0x58/0x70 [qla2xxx]
>>>> Oct 23 21:34:27 mbpc-pc kernel: [<ffffffffa0286f69>]
>>>> tcm_qla2xxx_free_session+0x49/0x90 [tcm_qla2xxx]
>>>> Oct 23 21:34:27 mbpc-pc kernel: [<ffffffffa01447e9>]
>>>> qlt_free_session_done+0xf9/0x3d0 [qla2xxx]
>>>> Oct 23 21:34:27 mbpc-pc kernel: [<ffffffff815092fc>] ?
>>>> dbs_work_handler+0x5c/0x90
>>>> Oct 23 21:34:27 mbpc-pc kernel: [<ffffffff8107f960>] ?
>>>> pwq_dec_nr_in_flight+0x50/0xa0
>>>> Oct 23 21:34:27 mbpc-pc kernel: [<ffffffff81080639>]
>>>> process_one_work+0x189/0x4e0
>>>> Oct 23 21:34:27 mbpc-pc kernel: [<ffffffff810d060c>] ?
>>>> del_timer_sync+0x4c/0x60
>>>> Oct 23 21:34:27 mbpc-pc kernel: [<ffffffff8108131e>] ?
>>>> maybe_create_worker+0x8e/0x110
>>>> Oct 23 21:34:27 mbpc-pc kernel: [<ffffffff8162c040>] ? schedule+0x40/0xb0
>>>> Oct 23 21:34:27 mbpc-pc kernel: [<ffffffff8108150d>]
>>>> worker_thread+0x16d/0x520
>>>> Oct 23 21:34:27 mbpc-pc kernel: [<ffffffff810923f2>] ?
>>>> default_wake_function+0x12/0x20
>>>> Oct 23 21:34:27 mbpc-pc kernel: [<ffffffff810a6f06>] ?
>>>> __wake_up_common+0x56/0x90
>>>> Oct 23 21:34:27 mbpc-pc kernel: [<ffffffff810813a0>] ?
>>>> maybe_create_worker+0x110/0x110
>>>> Oct 23 21:34:27 mbpc-pc kernel: [<ffffffff8162c040>] ? schedule+0x40/0xb0
>>>> Oct 23 21:34:27 mbpc-pc kernel: [<ffffffff810813a0>] ?
>>>> maybe_create_worker+0x110/0x110
>>>> Oct 23 21:34:27 mbpc-pc kernel: [<ffffffff81085fec>] kthread+0xcc/0xf0
>>>> Oct 23 21:34:27 mbpc-pc kernel: [<ffffffff8109130e>] ?
>>>> schedule_tail+0x1e/0xc0
>>>> Oct 23 21:34:27 mbpc-pc kernel: [<ffffffff8162f60f>]
>>>> ret_from_fork+0x1f/0x40
>>>> Oct 23 21:34:27 mbpc-pc kernel: [<ffffffff81085f20>] ?
>>>> kthread_freezable_should_stop+0x70/0x70
>>>> Oct 23 21:36:04 mbpc-pc kernel: hpet1: lost 9600 rtc interrupts
>>>> Oct 23 21:36:04 mbpc-pc kernel: hpet1: lost 3 rtc interrupts
>>>> Jan  6 23:58:00 192.168.0.2 syslog: dhcpfwd : dhcp forwarder daemon
>>>> successfully started
>>>> Oct 23 21:36:30 mbpc-pc kernel: INFO: task kworker/u16:8:289 blocked for
>>>> more than 120 seconds.
>>>> Oct 23 21:36:30 mbpc-pc kernel:      Not tainted 4.8.4 #2
>>>> Oct 23 21:36:30 mbpc-pc kernel: "echo 0 >
>>>> /proc/sys/kernel/hung_task_timeout_secs" disables this message.
>>>> Oct 23 21:36:30 mbpc-pc kernel: kworker/u16:8   D ffff88011113ba18     0
>>>>   289      2 0x00000000
>>>> Oct 23 21:36:30 mbpc-pc kernel: Workqueue: tmr-fileio target_tmr_work
>>>> [target_core_mod]
>>>> Oct 23 21:36:30 mbpc-pc kernel: ffff88011113ba18 0000000000000400
>>>> ffff880049e926c0 ffff88011113b998
>>>> Oct 23 21:36:30 mbpc-pc kernel: ffff880111134600 ffffffff81f99ca0
>>>> ffffffff81f998ef ffff880100000000
>>>> Oct 23 21:36:30 mbpc-pc kernel: ffffffff812f27d9 0000000000000000
>>>> ffffe8ffffc9a000 ffff880000000000
>>>> Oct 23 21:36:30 mbpc-pc kernel: Call Trace:
>>>> Oct 23 21:36:30 mbpc-pc kernel: [<ffffffff812f27d9>] ? number+0x2e9/0x310
>>>> Oct 23 21:36:30 mbpc-pc kernel: [<ffffffff8162c040>] schedule+0x40/0xb0
>>>> Oct 23 21:36:30 mbpc-pc kernel: [<ffffffff81080169>] ?
>>>> start_flush_work+0x49/0x180
>>>> Oct 23 21:36:30 mbpc-pc kernel: [<ffffffff8162e7ec>]
>>>> schedule_timeout+0x9c/0xe0
>>>> Oct 23 21:36:30 mbpc-pc kernel: [<ffffffff810802ba>] ?
>>>> flush_work+0x1a/0x40
>>>> Oct 23 21:36:30 mbpc-pc kernel: [<ffffffff810bd15c>] ?
>>>> console_unlock+0x35c/0x380
>>>> Oct 23 21:36:30 mbpc-pc kernel: [<ffffffff8162cfa0>]
>>>> wait_for_completion+0xc0/0xf0
>>>> Oct 23 21:36:30 mbpc-pc kernel: [<ffffffff810923e0>] ?
>>>> try_to_wake_up+0x260/0x260
>>>> Oct 23 21:36:30 mbpc-pc kernel: [<ffffffffa08f6f84>]
>>>> __transport_wait_for_tasks+0xb4/0x1b0 [target_core_mod]
>>>> Oct 23 21:36:30 mbpc-pc kernel: [<ffffffff810bdd1f>] ?
>>>> vprintk_default+0x1f/0x30
>>>>
>>>>
>>>
>>>
>>> Including the full log:
>>>
>>> http://microdevsys.com/linux-lio/messages-mailing-list
>>>
>>
>
> Thanks for posting with qla2xxx verbose debug enabled on your setup.
>
>>
>> When tryint to shut down target using /etc/init.d/target stop, the
>> following is printed repeatedly:
>>
>> Oct 24 00:39:48 mbpc-pc kernel: qla2xxx [0000:04:00.0]-e837:20:
>> ABTS_RECV_24XX: instance 0
>> Oct 24 00:39:48 mbpc-pc kernel: qla2xxx [0000:04:00.0]-f811:20:
>> qla_target(0): task abort (s_id=1:5:0, tag=1177068, param=0)
>> Oct 24 00:39:48 mbpc-pc kernel: qla2xxx [0000:04:00.0]-f812:20:
>> qla_target(0): task abort for non-existant session
>> Oct 24 00:39:48 mbpc-pc kernel: qla2xxx [0000:04:00.0]-f80e:20:
>> Scheduling work (type 1, prm ffff880093365680) to find session for param
>> ffff88010f8c7680 (size 64, tgt ffff880111f06600)
>> Oct 24 00:39:48 mbpc-pc kernel: qla2xxx [0000:04:00.0]-f800:20: Sess
>> work (tgt ffff880111f06600)
>> Oct 24 00:39:48 mbpc-pc kernel: qla2xxx [0000:04:00.0]-e806:20: Sending
>> task mgmt ABTS response (ha=ffff88010fae0000, atio=ffff880093365694,
>> status=4
>> Oct 24 00:39:48 mbpc-pc kernel: qla2xxx [0000:04:00.0]-e838:20:
>> ABTS_RESP_24XX: compl_status 31
>> Oct 24 00:39:48 mbpc-pc kernel: qla2xxx [0000:04:00.0]-e807:20: Sending
>> retry TERM EXCH CTIO7 (ha=ffff88010fae0000)
>> Oct 24 00:39:48 mbpc-pc kernel: qla2xxx [0000:04:00.0]-e806:20: Sending
>> task mgmt ABTS response (ha=ffff88010fae0000, atio=ffff88010f8c76c0,
>> status=0
>> Oct 24 00:39:48 mbpc-pc kernel: qla2xxx [0000:04:00.0]-e838:20:
>> ABTS_RESP_24XX: compl_status 0
>> Oct 24 00:39:48 mbpc-pc kernel: qla2xxx [0000:04:00.0]-e872:20:
>> qlt_24xx_atio_pkt_all_vps: qla_target(0): type 6 ox_id 029c
>> Oct 24 00:39:48 mbpc-pc kernel: qla2xxx [0000:04:00.0]-3861:20: New
>> command while device ffff880111f06600 is shutting down
>> Oct 24 00:39:48 mbpc-pc kernel: qla2xxx [0000:04:00.0]-e859:20:
>> qla_target: Unable to send command to target for req, ignoring.
>>
>>
>
> At your earliest convenience, please verify the patch using v4.8.y with
> the above ABORT_TASK + shutdown scenario.
>
> Also, it would be helpful to understand why this ESX FC host is
> generating ABORT_TASKs.
>
> Eg: Is ABORT_TASK generated due to FC target response packet loss..?
> Or due to target backend I/O latency, that ultimately triggers FC host
> side timeouts...?
>
>>
>> + when I disable the ports on the brocade switch that we're using then
>> try to stop target, the following is printed:
>>
>>
>>
>> Oct 24 00:41:31 mbpc-pc kernel: qla2xxx [0000:04:00.1]-680a:21: Loop
>> down - seconds remaining 231.
>> Oct 24 00:41:32 mbpc-pc kernel: qla2xxx [0000:04:00.0]-680a:20: Loop
>> down - seconds remaining 153.
>> Oct 24 00:41:32 mbpc-pc kernel: ------------[ cut here ]------------
>> Oct 24 00:41:32 mbpc-pc kernel: WARNING: CPU: 2 PID: 8615 at
>> lib/list_debug.c:33 __list_add+0xbe/0xd0
>> Oct 24 00:41:32 mbpc-pc kernel: list_add corruption. prev->next should
>> be next (ffff88009e83b330), but was ffff88011fc972a0.
>> (prev=ffff880118ada4c0).
>> Oct 24 00:41:32 mbpc-pc kernel: Modules linked in: tcm_qla2xxx tcm_fc
>> tcm_loop target_core_file target_core_iblock target_core_pscsi
>> target_core_mod configfs ip6table_filter ip6_tables ebtable_nat ebtables
>> nf_conntrack_ipv4 nf_defrag_ipv4 xt_state nf_conntrack xt_CHECKSUM
>> iptable_mangle bridge nfsd lockd grace nfs_acl auth_rpcgss autofs4 it87
>> hwmon_vid bnx2fc cnic uio fcoe libfcoe libfc 8021q garp stp llc ppdev
>> parport_pc parport sunrpc cpufreq_ondemand bonding ipv6 crc_ccitt
>> ipt_REJECT nf_reject_ipv4 xt_multiport iptable_filter ip_tables fuse
>> vfat fat xfs vhost_net macvtap macvlan vhost tun uinput raid456
>> async_raid6_recov async_pq async_xor xor async_memcpy async_tx raid6_pq
>> libcrc32c joydev sg serio_raw e1000 kvm_amd kvm irqbypass r8169 mii
>> pcspkr k10temp snd_hda_codec_realtek snd_hda_codec_generic snd_hda_intel
>> snd_hda_codec snd_hda_core snd_hwdep snd_seq snd_seq_device snd_pcm
>> snd_timer snd soundcore i2c_piix4 i2c_core wmi acpi_cpufreq shpchp ext4
>> mbcache jbd2 qla2xxx scsi_transport_fc floppy firewire_ohci f
>> Oct 24 00:41:32 mbpc-pc kernel: irewire_core crc_itu_t sd_mod pata_acpi
>> ata_generic pata_jmicron ahci libahci usb_storage dm_mirror
>> dm_region_hash dm_log dm_mod
>> Oct 24 00:41:32 mbpc-pc kernel: CPU: 2 PID: 8615 Comm: kworker/2:3 Not
>> tainted 4.8.4 #2
>> Oct 24 00:41:32 mbpc-pc kernel: Hardware name: Gigabyte Technology Co.,
>> Ltd. GA-890XA-UD3/GA-890XA-UD3, BIOS FC 08/02/2010
>> Oct 24 00:41:32 mbpc-pc kernel: 0000000000000000 ffff880092b83b48
>> ffffffff812e88e9 ffffffff8130753e
>> Oct 24 00:41:32 mbpc-pc kernel: ffff880092b83ba8 ffff880092b83ba8
>> 0000000000000000 ffff880092b83b98
>> Oct 24 00:41:32 mbpc-pc kernel: ffffffff81066a7d ffff88000058f952
>> 0000002100000046 ffffffff8101eae8
>> Oct 24 00:41:32 mbpc-pc kernel: Call Trace:
>> Oct 24 00:41:32 mbpc-pc kernel: [<ffffffff812e88e9>] dump_stack+0x51/0x78
>> Oct 24 00:41:32 mbpc-pc kernel: [<ffffffff8130753e>] ? __list_add+0xbe/0xd0
>> Oct 24 00:41:32 mbpc-pc kernel: [<ffffffff81066a7d>] __warn+0xfd/0x120
>> Oct 24 00:41:32 mbpc-pc kernel: [<ffffffff8101eae8>] ?
>> __switch_to+0x398/0x7e0
>> Oct 24 00:41:32 mbpc-pc kernel: [<ffffffff81066b59>]
>> warn_slowpath_fmt+0x49/0x50
>> Oct 24 00:41:32 mbpc-pc kernel: [<ffffffff8130753e>] __list_add+0xbe/0xd0
>> Oct 24 00:41:32 mbpc-pc kernel: [<ffffffff8107d0b2>]
>> move_linked_works+0x62/0x90
>> Oct 24 00:41:32 mbpc-pc kernel: [<ffffffff8108070c>]
>> process_one_work+0x25c/0x4e0
>> Oct 24 00:41:32 mbpc-pc kernel: [<ffffffff8162c040>] ? schedule+0x40/0xb0
>> Oct 24 00:41:32 mbpc-pc kernel: [<ffffffff8108150d>]
>> worker_thread+0x16d/0x520
>> Oct 24 00:41:32 mbpc-pc kernel: [<ffffffff8162bb3d>] ?
>> __schedule+0x2fd/0x6a0
>> Oct 24 00:41:32 mbpc-pc kernel: [<ffffffff810923f2>] ?
>> default_wake_function+0x12/0x20
>> Oct 24 00:41:32 mbpc-pc kernel: [<ffffffff810a6f06>] ?
>> __wake_up_common+0x56/0x90
>> Oct 24 00:41:32 mbpc-pc kernel: [<ffffffff810813a0>] ?
>> maybe_create_worker+0x110/0x110
>> Oct 24 00:41:32 mbpc-pc kernel: [<ffffffff8162c040>] ? schedule+0x40/0xb0
>> Oct 24 00:41:32 mbpc-pc kernel: [<ffffffff810813a0>] ?
>> maybe_create_worker+0x110/0x110
>> Oct 24 00:41:32 mbpc-pc kernel: [<ffffffff81085fec>] kthread+0xcc/0xf0
>> Oct 24 00:41:32 mbpc-pc kernel: [<ffffffff8109130e>] ?
>> schedule_tail+0x1e/0xc0
>> Oct 24 00:41:32 mbpc-pc kernel: [<ffffffff8162f60f>] ret_from_fork+0x1f/0x40
>> Oct 24 00:41:32 mbpc-pc kernel: [<ffffffff81085f20>] ?
>> kthread_freezable_should_stop+0x70/0x70
>> Oct 24 00:41:32 mbpc-pc kernel: ---[ end trace 713a9071c9f5f464 ]---
>> Oct 24 00:41:32 mbpc-pc kernel: ------------[ cut here ]------------
>> Oct 24 00:41:32 mbpc-pc kernel: WARNING: CPU: 2 PID: 8615 at
>> lib/list_debug.c:36 __list_add+0x9c/0xd0
>> Oct 24 00:41:32 mbpc-pc kernel: list_add double add:
>> new=ffff880118ada4c0, prev=ffff880118ada4c0, next=ffff88009e83b330.
>> Oct 24 00:41:32 mbpc-pc kernel: Modules linked in: tcm_qla2xxx tcm_fc
>> tcm_loop target_core_file target_core_iblock target_core_pscsi
>> target_core_mod configfs ip6table_filter ip6_tables ebtable_nat ebtables
>> nf_conntrack_ipv4 nf_defrag_ipv4 xt_state nf_conntrack xt_CHECKSUM
>> iptable_mangle bridge nfsd lockd grace nfs_acl auth_rpcgss autofs4 it87
>> hwmon_vid bnx2fc cnic uio fcoe libfcoe libfc 8021q garp stp llc ppdev
>> parport_pc parport sunrpc cpufreq_ondemand bonding ipv6 crc_ccitt
>> ipt_REJECT nf_reject_ipv4 xt_multiport iptable_filter ip_tables fuse
>> vfat fat xfs vhost_net macvtap macvlan vhost tun uinput raid456
>> async_raid6_recov async_pq async_xor xor async_memcpy async_tx raid6_pq
>> libcrc32c joydev sg serio_raw e1000 kvm_amd kvm irqbypass r8169 mii
>> pcspkr k10temp snd_hda_codec_realtek snd_hda_codec_generic snd_hda_intel
>> snd_hda_codec snd_hda_core snd_hwdep snd_seq snd_seq_device snd_pcm
>> snd_timer snd soundcore i2c_piix4 i2c_core wmi acpi_cpufreq shpchp ext4
>> mbcache jbd2 qla2xxx scsi_transport_fc floppy firewire_ohci f
>> Oct 24 00:41:32 mbpc-pc kernel: irewire_core crc_itu_t sd_mod pata_acpi
>> ata_generic pata_jmicron ahci libahci usb_storage dm_mirror
>> dm_region_hash dm_log dm_mod
>> Oct 24 00:41:32 mbpc-pc kernel: CPU: 2 PID: 8615 Comm: kworker/2:3
>> Tainted: G        W       4.8.4 #2
>> Oct 24 00:41:32 mbpc-pc kernel: Hardware name: Gigabyte Technology Co.,
>> Ltd. GA-890XA-UD3/GA-890XA-UD3, BIOS FC 08/02/2010
>> Oct 24 00:41:32 mbpc-pc kernel: 0000000000000000 ffff880092b83b48
>> ffffffff812e88e9 ffffffff8130751c
>> Oct 24 00:41:32 mbpc-pc kernel: ffff880092b83ba8 ffff880092b83ba8
>> 0000000000000000 ffff880092b83b98
>> Oct 24 00:41:32 mbpc-pc kernel: ffffffff81066a7d ffff88000058f952
>> 0000002400000046 ffffffff8101eae8
>> Oct 24 00:41:32 mbpc-pc kernel: Call Trace:
>> Oct 24 00:41:32 mbpc-pc kernel: [<ffffffff812e88e9>] dump_stack+0x51/0x78
>> Oct 24 00:41:32 mbpc-pc kernel: [<ffffffff8130751c>] ? __list_add+0x9c/0xd0
>> Oct 24 00:41:32 mbpc-pc kernel: [<ffffffff81066a7d>] __warn+0xfd/0x120
>> Oct 24 00:41:32 mbpc-pc kernel: [<ffffffff8101eae8>] ?
>> __switch_to+0x398/0x7e0
>> Oct 24 00:41:32 mbpc-pc kernel: [<ffffffff81066b59>]
>> warn_slowpath_fmt+0x49/0x50
>> Oct 24 00:41:32 mbpc-pc kernel: [<ffffffff8130751c>] __list_add+0x9c/0xd0
>> Oct 24 00:41:32 mbpc-pc kernel: [<ffffffff8107d0b2>]
>> move_linked_works+0x62/0x90
>> Oct 24 00:41:32 mbpc-pc kernel: [<ffffffff8108070c>]
>> process_one_work+0x25c/0x4e0
>> Oct 24 00:41:32 mbpc-pc kernel: [<ffffffff8162c040>] ? schedule+0x40/0xb0
>> Oct 24 00:41:32 mbpc-pc kernel: [<ffffffff8108150d>]
>> worker_thread+0x16d/0x520
>> Oct 24 00:41:32 mbpc-pc kernel: [<ffffffff8162bb3d>] ?
>> __schedule+0x2fd/0x6a0
>> Oct 24 00:41:32 mbpc-pc kernel: [<ffffffff810923f2>] ?
>> default_wake_function+0x12/0x20
>> Oct 24 00:41:32 mbpc-pc kernel: [<ffffffff810a6f06>] ?
>> __wake_up_common+0x56/0x90
>> Oct 24 00:41:32 mbpc-pc kernel: [<ffffffff810813a0>] ?
>> maybe_create_worker+0x110/0x110
>> Oct 24 00:41:32 mbpc-pc kernel: [<ffffffff8162c040>] ? schedule+0x40/0xb0
>> Oct 24 00:41:32 mbpc-pc kernel: [<ffffffff810813a0>] ?
>> maybe_create_worker+0x110/0x110
>> Oct 24 00:41:32 mbpc-pc kernel: [<ffffffff81085fec>] kthread+0xcc/0xf0
>> Oct 24 00:41:32 mbpc-pc kernel: [<ffffffff8109130e>] ?
>> schedule_tail+0x1e/0xc0
>> Oct 24 00:41:32 mbpc-pc kernel: [<ffffffff8162f60f>] ret_from_fork+0x1f/0x40
>> Oct 24 00:41:32 mbpc-pc kernel: [<ffffffff81085f20>] ?
>> kthread_freezable_should_stop+0x70/0x70
>> Oct 24 00:41:32 mbpc-pc kernel: ---[ end trace 713a9071c9f5f465 ]---
>> Oct 24 00:41:32 mbpc-pc kernel: qla2xxx [0000:04:00.1]-680a:21: Loop
>> down - seconds remaining 230.
>> Oct 24 00:41:33 mbpc-pc kernel: qla2xxx [0000:04:00.0]-680a:20: Loop
>> down - seconds remaining 152.
>>
>>
>
> Mmmm.  Could be a side effect of the target-core regression, but not
> completely sure..
>
> Adding QLOGIC folks CC'.
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>

Hey Nicholas,


 > At your earliest convenience, please verify the patch using v4.8.y with
 > the above ABORT_TASK + shutdown scenario.
 >
 > Also, it would be helpful to understand why this ESX FC host is
 > generating ABORT_TASKs.
 >
 > Eg: Is ABORT_TASK generated due to FC target response packet loss..?
 > Or due to target backend I/O latency, that ultimately triggers FC host
 > side timeouts...?


Here is where it gets interesting and to your thought above.  Take for 
example this log snippet 
(http://microdevsys.com/linux-lio/messages-recent):

Oct 23 22:12:51 mbpc-pc kernel: hpet1: lost 9600 rtc interrupts
Jan  7 00:36:00 192.168.0.2 syslog: dhcpfwd : dhcp forwarder daemon 
successfully started
Oct 23 22:14:14 mbpc-pc kernel: hpet1: lost 9599 rtc interrupts
Oct 23 22:14:14 mbpc-pc kernel: hpet1: lost 1 rtc interrupts
Oct 23 22:15:02 mbpc-pc kernel: hpet1: lost 9600 rtc interrupts
Oct 23 22:15:41 mbpc-pc kernel: hpet1: lost 9600 rtc interrupts
Jan  7 00:38:00 192.168.0.2 syslog: dhcpfwd : dhcp forwarder daemon 
successfully started
Oct 23 22:16:29 mbpc-pc kernel: hpet1: lost 9599 rtc interrupts
Oct 23 22:17:30 mbpc-pc kernel: hpet1: lost 9600 rtc interrupts
Jan  7 00:40:00 192.168.0.2 syslog: dhcpfwd : dhcp forwarder daemon 
successfully started
Oct 23 22:18:29 mbpc-pc kernel: ABORT_TASK: Found referenced qla2xxx 
task_tag: 1195032
Oct 23 22:18:33 mbpc-pc kernel: hpet1: lost 9600 rtc interrupts
Oct 23 22:18:42 mbpc-pc kernel: ABORT_TASK: Sending 
TMR_FUNCTION_COMPLETE for ref_tag: 1195032
Oct 23 22:18:42 mbpc-pc kernel: ABORT_TASK: Found referenced qla2xxx 
task_tag: 1122276
Oct 23 22:19:35 mbpc-pc kernel: hpet1: lost 9600 rtc interrupts
Jan  7 00:42:00 192.168.0.2 syslog: dhcpfwd : dhcp forwarder daemon 
successfully started
Oct 23 22:20:41 mbpc-pc kernel: hpet1: lost 9600 rtc interrupts
Oct 23 22:21:07 mbpc-pc kernel: INFO: task kworker/u16:8:308 blocked for 
more than 120 seconds.
Oct 23 22:21:07 mbpc-pc kernel:      Not tainted 4.8.4 #2
Oct 23 22:21:07 mbpc-pc kernel: "echo 0 > 
/proc/sys/kernel/hung_task_timeout_secs" disables this message.
Oct 23 22:21:07 mbpc-pc kernel: kworker/u16:8   D ffff880111b8fa18     0 
   308      2 0x00000000
Oct 23 22:21:07 mbpc-pc kernel: Workqueue: tmr-fileio target_tmr_work 
[target_core_mod]
Oct 23 22:21:07 mbpc-pc kernel: ffff880111b8fa18 0000000000000400 
ffff880112180480 ffff880111b8f998
Oct 23 22:21:07 mbpc-pc kernel: ffff88011107a380 ffffffff81f99ca0 
ffffffff81f998ef ffff880100000000
Oct 23 22:21:07 mbpc-pc kernel: ffffffff812f27d9 0000000000000000 
ffffe8ffffcda000 ffff880000000000
Oct 23 22:21:07 mbpc-pc kernel: Call Trace:
Oct 23 22:21:07 mbpc-pc kernel: [<ffffffff812f27d9>] ? number+0x2e9/0x310
Oct 23 22:21:07 mbpc-pc kernel: [<ffffffff8162c040>] schedule+0x40/0xb0
Oct 23 22:21:07 mbpc-pc kernel: [<ffffffff81080169>] ? 
start_flush_work+0x49/0x180
Oct 23 22:21:07 mbpc-pc kernel: [<ffffffff8162e7ec>] 
schedule_timeout+0x9c/0xe0
Oct 23 22:21:07 mbpc-pc kernel: [<ffffffff810802ba>] ? flush_work+0x1a/0x40
Oct 23 22:21:07 mbpc-pc kernel: [<ffffffff810bd15c>] ? 
console_unlock+0x35c/0x380
Oct 23 22:21:07 mbpc-pc kernel: [<ffffffff8162cfa0>] 
wait_for_completion+0xc0/0xf0
Oct 23 22:21:07 mbpc-pc kernel: [<ffffffff810923e0>] ? 
try_to_wake_up+0x260/0x260
Oct 23 22:21:07 mbpc-pc kernel: [<ffffffffa08f6f84>] 
__transport_wait_for_tasks+0xb4/0x1b0 [target_core_mod]
Oct 23 22:21:07 mbpc-pc kernel: [<ffffffff810bdd1f>] ? 
vprintk_default+0x1f/0x30
Oct 23 22:21:07 mbpc-pc kernel: [<ffffffff8115cc5c>] ? printk+0x46/0x48
Oct 23 22:21:07 mbpc-pc kernel: [<ffffffffa08f70c4>] 
transport_wait_for_tasks+0x44/0x60 [target_core_mod]
Oct 23 22:21:07 mbpc-pc kernel: [<ffffffffa08f46e2>] 
core_tmr_abort_task+0xf2/0x160 [target_core_mod]
Oct 23 22:21:07 mbpc-pc kernel: [<ffffffffa08f6aa4>] 
target_tmr_work+0x154/0x160 [target_core_mod]
Oct 23 22:21:07 mbpc-pc kernel: [<ffffffff81080639>] 
process_one_work+0x189/0x4e0
Oct 23 22:21:07 mbpc-pc kernel: [<ffffffff810d060c>] ? 
del_timer_sync+0x4c/0x60
Oct 23 22:21:07 mbpc-pc kernel: [<ffffffff8108131e>] ? 
maybe_create_worker+0x8e/0x110
Oct 23 22:21:07 mbpc-pc kernel: [<ffffffff8162c040>] ? schedule+0x40/0xb0
Oct 23 22:21:07 mbpc-pc kernel: [<ffffffff8108150d>] 
worker_thread+0x16d/0x520
Oct 23 22:21:07 mbpc-pc kernel: [<ffffffff810923f2>] ? 
default_wake_function+0x12/0x20
Oct 23 22:21:07 mbpc-pc kernel: [<ffffffff810a6f06>] ? 
__wake_up_common+0x56/0x90
Oct 23 22:21:07 mbpc-pc kernel: [<ffffffff810813a0>] ? 
maybe_create_worker+0x110/0x110
Oct 23 22:21:07 mbpc-pc kernel: [<ffffffff8162c040>] ? schedule+0x40/0xb0
Oct 23 22:21:07 mbpc-pc kernel: [<ffffffff810813a0>] ? 
maybe_create_worker+0x110/0x110
Oct 23 22:21:07 mbpc-pc kernel: [<ffffffff81085fec>] kthread+0xcc/0xf0
Oct 23 22:21:07 mbpc-pc kernel: [<ffffffff8109130e>] ? 
schedule_tail+0x1e/0xc0
Oct 23 22:21:07 mbpc-pc kernel: [<ffffffff8162f60f>] ret_from_fork+0x1f/0x40
Oct 23 22:21:07 mbpc-pc kernel: [<ffffffff81085f20>] ? 
kthread_freezable_should_stop+0x70/0x70
Oct 23 22:21:52 mbpc-pc kernel: hpet1: lost 9600 rtc interrupts
Jan  7 00:44:00 192.168.0.2 syslog: dhcpfwd : dhcp forwarder daemon 
successfully started
Oct 23 22:23:03 mbpc-pc kernel: hpet1: lost 9600 rtc interrupts


And compare it to the following snippet 
(http://microdevsys.com/linux-lio/iostat-tkx-interesting-bit.txt) taken 
from this bigger iostat session 
(http://microdevsys.com/linux-lio/iostat-tkx.txt):




10/23/2016 10:18:19 PM
avg-cpu:  %user   %nice %system %iowait  %steal   %idle
            0.25    0.00    0.50   15.83    0.00   83.42

Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s 
avgrq-sz avgqu-sz   await  svctm  %util
sdb               0.00     0.00    2.00    0.00    80.00     0.00 
80.00     0.02   12.00  12.00   2.40
sdc               0.00     0.00    1.00    0.00    64.00     0.00 
128.00     0.00    2.00   2.00   0.20
sdd               0.00     0.00    1.00    0.00    48.00     0.00 
96.00     0.00    0.00   0.00   0.00
sde               0.00     0.00    2.00    0.00    64.00     0.00 
64.00     0.00    1.50   1.50   0.30
sdf               0.00     0.00    0.00    0.00     0.00     0.00 
0.00     0.60    0.00   0.00  60.10
sdg               0.00     3.00    0.00    3.00     0.00    20.00 
13.33     0.03   10.00  10.00   3.00
sda               0.00     0.00    2.00    0.00    64.00     0.00 
64.00     0.00    2.00   2.00   0.40
sdh               0.00     0.00    0.00    0.00     0.00     0.00 
0.00     0.00    0.00   0.00   0.00
sdj               0.00     0.00    0.00    0.00     0.00     0.00 
0.00     0.00    0.00   0.00   0.00
sdk               0.00     0.00    0.00    0.00     0.00     0.00 
0.00     0.00    0.00   0.00   0.00
sdi               0.00     0.00    0.00    0.00     0.00     0.00 
0.00     0.00    0.00   0.00   0.00
fd0               0.00     0.00    0.00    0.00     0.00     0.00 
0.00     0.00    0.00   0.00   0.00
dm-0              0.00     0.00    0.00    5.00     0.00    20.00 
8.00     0.03    6.40   6.00   3.00
dm-1              0.00     0.00    0.00    0.00     0.00     0.00 
0.00     0.00    0.00   0.00   0.00
dm-2              0.00     0.00    0.00    0.00     0.00     0.00 
0.00     0.00    0.00   0.00   0.00
md0               0.00     0.00    3.00    0.00   384.00     0.00 
256.00     0.00    0.00   0.00   0.00
dm-3              0.00     0.00    0.00    0.00     0.00     0.00 
0.00     0.00    0.00   0.00   0.00
dm-4              0.00     0.00    0.00    0.00     0.00     0.00 
0.00     0.00    0.00   0.00   0.00
dm-5              0.00     0.00    0.00    0.00     0.00     0.00 
0.00     0.00    0.00   0.00   0.00
dm-6              0.00     0.00    3.00    0.00   384.00     0.00 
256.00     0.60    1.33 201.67  60.50

10/23/2016 10:18:20 PM
avg-cpu:  %user   %nice %system %iowait  %steal   %idle
            0.00    0.00    0.25   25.19    0.00   74.56

Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s 
avgrq-sz avgqu-sz   await  svctm  %util
sdb               0.00     0.00    0.00    1.00     0.00     2.50 
5.00     0.03   27.00  27.00   2.70
sdc               0.00     0.00    0.00    1.00     0.00     2.50 
5.00     0.01   15.00  15.00   1.50
sdd               0.00     0.00    0.00    1.00     0.00     2.50 
5.00     0.02   18.00  18.00   1.80
sde               0.00     0.00    0.00    1.00     0.00     2.50 
5.00     0.02   23.00  23.00   2.30
sdf               0.00     0.00    0.00    0.00     0.00     0.00 
0.00     1.15    0.00   0.00 100.00
sdg               0.00     2.00    1.00    4.00     4.00   172.00 
70.40     0.04    8.40   2.80   1.40
sda               0.00     0.00    0.00    1.00     0.00     2.50 
5.00     0.04   37.00  37.00   3.70
sdh               0.00     0.00    0.00    0.00     0.00     0.00 
0.00     0.00    0.00   0.00   0.00
sdj               0.00     0.00    0.00    0.00     0.00     0.00 
0.00     0.00    0.00   0.00   0.00
sdk               0.00     0.00    0.00    0.00     0.00     0.00 
0.00     0.00    0.00   0.00   0.00
sdi               0.00     0.00    0.00    0.00     0.00     0.00 
0.00     0.00    0.00   0.00   0.00
fd0               0.00     0.00    0.00    0.00     0.00     0.00 
0.00     0.00    0.00   0.00   0.00
dm-0              0.00     0.00    1.00    6.00     4.00   172.00 
50.29     0.05    7.29   2.00   1.40
dm-1              0.00     0.00    0.00    0.00     0.00     0.00 
0.00     0.00    0.00   0.00   0.00
dm-2              0.00     0.00    0.00    0.00     0.00     0.00 
0.00     0.00    0.00   0.00   0.00
md0               0.00     0.00    0.00    0.00     0.00     0.00 
0.00     0.00    0.00   0.00   0.00
dm-3              0.00     0.00    0.00    0.00     0.00     0.00 
0.00     0.00    0.00   0.00   0.00
dm-4              0.00     0.00    0.00    0.00     0.00     0.00 
0.00     0.00    0.00   0.00   0.00
dm-5              0.00     0.00    0.00    0.00     0.00     0.00 
0.00     0.00    0.00   0.00   0.00
dm-6              0.00     0.00    0.00    0.00     0.00     0.00 
0.00     1.00    0.00   0.00 100.00

10/23/2016 10:18:21 PM
avg-cpu:  %user   %nice %system %iowait  %steal   %idle
            0.00    0.00    0.25   24.81    0.00   74.94

Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s 
avgrq-sz avgqu-sz   await  svctm  %util
sdb               0.00     0.00    0.00    0.00     0.00     0.00 
0.00     0.00    0.00   0.00   0.00
sdc               0.00     0.00    0.00    0.00     0.00     0.00 
0.00     0.00    0.00   0.00   0.00
sdd               0.00     0.00    0.00    0.00     0.00     0.00 
0.00     0.00    0.00   0.00   0.00
sde               0.00     0.00    0.00    0.00     0.00     0.00 
0.00     0.00    0.00   0.00   0.00
sdf               0.00     0.00    0.00    0.00     0.00     0.00 
0.00     2.00    0.00   0.00 100.00
sdg               0.00     0.00    0.00    0.00     0.00     0.00 
0.00     0.00    0.00   0.00   0.00
sda               0.00     0.00    0.00    0.00     0.00     0.00 
0.00     0.00    0.00   0.00   0.00
sdh               0.00     0.00    0.00    0.00     0.00     0.00 
0.00     0.00    0.00   0.00   0.00
sdj               0.00     0.00    0.00    0.00     0.00     0.00 
0.00     0.00    0.00   0.00   0.00
sdk               0.00     0.00    0.00    0.00     0.00     0.00 
0.00     0.00    0.00   0.00   0.00
sdi               0.00     0.00    0.00    0.00     0.00     0.00 
0.00     0.00    0.00   0.00   0.00
fd0               0.00     0.00    0.00    0.00     0.00     0.00 
0.00     0.00    0.00   0.00   0.00
dm-0              0.00     0.00    0.00    0.00     0.00     0.00 
0.00     0.00    0.00   0.00   0.00
dm-1              0.00     0.00    0.00    0.00     0.00     0.00 
0.00     0.00    0.00   0.00   0.00
dm-2              0.00     0.00    0.00    0.00     0.00     0.00 
0.00     0.00    0.00   0.00   0.00
md0               0.00     0.00    0.00    0.00     0.00     0.00 
0.00     0.00    0.00   0.00   0.00
dm-3              0.00     0.00    0.00    0.00     0.00     0.00 
0.00     0.00    0.00   0.00   0.00
dm-4              0.00     0.00    0.00    0.00     0.00     0.00 
0.00     0.00    0.00   0.00   0.00
dm-5              0.00     0.00    0.00    0.00     0.00     0.00 
0.00     0.00    0.00   0.00   0.00
dm-6              0.00     0.00    0.00    0.00     0.00     0.00 
0.00     1.00    0.00   0.00 100.00


We can see that /dev/sdf ramps up to 100% starting at around (10/23/2016 
10:18:18 PM) and stays that way till about the (10/23/2016 10:18:42 PM) 
mark when something occurs and it drops down to below 100% numbers.

So I checked the array which shows all clean, even across reboots:

[root@mbpc-pc ~]# cat /proc/mdstat
Personalities : [raid6] [raid5] [raid4]
md0 : active raid6 sdb[7] sdf[6] sdd[3] sda[5] sdc[1] sde[8]
       3907045632 blocks super 1.2 level 6, 64k chunk, algorithm 2 [6/6] 
[UUUUUU]
       bitmap: 1/8 pages [4KB], 65536KB chunk

unused devices: <none>
[root@mbpc-pc ~]#


Then I run smartctl across all disks and sure enough /dev/sdf prints this:

[root@mbpc-pc ~]# smartctl -A /dev/sdf
smartctl 5.43 2012-06-30 r3573 [x86_64-linux-4.8.4] (local build)
Copyright (C) 2002-12 by Bruce Allen, http://smartmontools.sourceforge.net

Error SMART Values Read failed: scsi error badly formed scsi parameters
Smartctl: SMART Read Values failed.

=== START OF READ SMART DATA SECTION ===
[root@mbpc-pc ~]#

So it would appear we found the root cause, a bad disk.  True the disk 
is bad and I'll be replacing it however, even with a degraded disk 
(checking now) the array functions just fine and I have no data loss.  I 
only lost 1.  I would have to loose 3 to get a catastrophic failure on 
this RAID6:

[root@mbpc-pc ~]# cat /proc/mdstat
Personalities : [raid6] [raid5] [raid4]
md0 : active raid6 sdf[6](F) sda[5] sde[8] sdb[7] sdd[3] sdc[1]
       3907045632 blocks super 1.2 level 6, 64k chunk, algorithm 2 [6/5] 
[UUUU_U]
       bitmap: 6/8 pages [24KB], 65536KB chunk

unused devices: <none>
[root@mbpc-pc ~]# mdadm --detail /dev/md0
/dev/md0:
         Version : 1.2
   Creation Time : Mon Mar 26 00:06:24 2012
      Raid Level : raid6
      Array Size : 3907045632 (3726.05 GiB 4000.81 GB)
   Used Dev Size : 976761408 (931.51 GiB 1000.20 GB)
    Raid Devices : 6
   Total Devices : 6
     Persistence : Superblock is persistent

   Intent Bitmap : Internal

     Update Time : Tue Oct 25 00:31:13 2016
           State : clean, degraded
  Active Devices : 5
Working Devices : 5
  Failed Devices : 1
   Spare Devices : 0

          Layout : left-symmetric
      Chunk Size : 64K

            Name : mbpc:0
            UUID : 2f36ac48:5e3e4c54:72177c53:bea3e41e
          Events : 118368

     Number   Major   Minor   RaidDevice State
        8       8       64        0      active sync   /dev/sde
        1       8       32        1      active sync   /dev/sdc
        7       8       16        2      active sync   /dev/sdb
        3       8       48        3      active sync   /dev/sdd
        8       0        0        8      removed
        5       8        0        5      active sync   /dev/sda

        6       8       80        -      faulty   /dev/sdf
[root@mbpc-pc ~]#

Last night I cut power to the /dev/sdf disk to spin it down the removed 
it and reinserted it.  The array resynced without issue however the 
smartctl -A command still failed on it. Today I check and bad blocks 
were recorded on the disk and the array has since removed /dev/sdf (per 
above).  Also I have to say that these ESXi hosts worked in this 
configuration, without any hickup, for about 4 months.  No LUN failure 
on the ESXi side.  I haven't changed the LUN in that time (had no reason 
to do so).

So now here's the real question that I have.  Why would the array 
continue to function, as intended, with only one disk failure yet the 
QLogic / Target drivers stop and error out?  The RAID6 (software) array 
should care about the failure, and it should handle it.  QLogic / Target 
Drivers shouldn't really be too impacted (aside from read speed maybe) 
about a disk failing inside the array.  That would be my thinking.  The 
Target / QLogic software seems to have picked up on a failure ahead of 
the software RAID 6 detecting it.  I've had this RAID6 for over 6 years 
now.  Aside from the occassional disk replacement, quite rock solid.

So anyway, I added the fix you pointed out to the 4.8.4 kernel and 
recompiled.  I restarted it, with the RAID6 degraded as it is.  All 
mounted fine and I checked the LUN's from the ESXi side:

[root@mbpc-pc ~]# /etc/init.d/target start
The Linux SCSI Target is already stopped                   [  OK  ]
[info] The Linux SCSI Target looks properly installed.
The configfs filesystem was not mounted, consider adding it[WARNING]
[info] Loaded core module target_core_mod.
[info] Loaded core module target_core_pscsi.
[info] Loaded core module target_core_iblock.
[info] Loaded core module target_core_file.
Failed to load fabric module ib_srpt                       [WARNING]
Failed to load fabric module tcm_usb_gadget                [WARNING]
[info] Loaded fabric module tcm_loop.
[info] Loaded fabric module tcm_fc.
Failed to load fabric module vhost_scsi                    [WARNING]
[info] Loaded fabric module tcm_qla2xxx.
Failed to load fabric module iscsi_target_mod              [WARNING]
[info] Loading config from /etc/target/scsi_target.lio, this may take 
several minutes for FC adapters.
[info] Loaded /etc/target/scsi_target.lio.
Started The Linux SCSI Target                              [  OK  ]
[root@mbpc-pc ~]#


Enabled the brocade ports:


  18  18   011200   id    N4   No_Light    FC
  19  19   011300   id    N4   No_Sync     FC  Disabled (Persistent)
  20  20   011400   id    N4   No_Light    FC
  21  21   011500   id    N4   No_Light    FC
  22  22   011600   id    N4   No_Light    FC
  23  23   011700   id    N4   No_Light    FC  Disabled (Persistent)
  24  24   011800   --    N4   No_Module   FC  (No POD License) Disabled
  25  25   011900   --    N4   No_Module   FC  (No POD License) Disabled
  26  26   011a00   --    N4   No_Module   FC  (No POD License) Disabled
  27  27   011b00   --    N4   No_Module   FC  (No POD License) Disabled
  28  28   011c00   --    N4   No_Module   FC  (No POD License) Disabled
  29  29   011d00   --    N4   No_Module   FC  (No POD License) Disabled
  30  30   011e00   --    N4   No_Module   FC  (No POD License) Disabled
  31  31   011f00   --    N4   No_Module   FC  (No POD License) Disabled
sw0:admin> portcfgpersistentenable 19
sw0:admin> portcfgpersistentenable 23
sw0:admin> date
Tue Oct 25 04:03:42 UTC 2016
sw0:admin>

And still after 30 minutes, there is no failure.  This run includes the 
fix you asked me to ass 
(https://github.com/torvalds/linux/commit/527268df31e57cf2b6d417198717c6d6afdb1e3e) 
  .  If everything works, I will revert the patch and see if I can 
reproduce the issue.  If I can reproduce it, then the disk might not 
have been it, but the patch was.  I'll keep you posted on that when I 
get a new disk tomorrow.


Right now this is a POC setup so I have lots of room to experiment.

-- 
Cheers,
Tom K.
-------------------------------------------------------------------------------------

Living on earth is expensive, but it includes a free trip around the sun.


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Kernel 4.8.4: INFO: task kworker/u16:8:289 blocked for more than 120 seconds.
  2016-10-25  5:28       ` TomK
@ 2016-10-26  2:05         ` TomK
  2016-10-26  7:20           ` Nicholas A. Bellinger
  0 siblings, 1 reply; 14+ messages in thread
From: TomK @ 2016-10-26  2:05 UTC (permalink / raw)
  To: Nicholas A. Bellinger
  Cc: linux-scsi, Himanshu Madhani, Quinn Tran, Giridhar Malavali

On 10/25/2016 1:28 AM, TomK wrote:
> On 10/24/2016 2:36 AM, Nicholas A. Bellinger wrote:
>> Hi TomK,
>>
>> Thanks for reporting this bug.  Comments inline below.
>>
>> On Mon, 2016-10-24 at 00:45 -0400, TomK wrote:
>>> On 10/24/2016 12:32 AM, TomK wrote:
>>>> On 10/23/2016 10:03 PM, TomK wrote:
>>>>> Hey,
>>>>>
>>>>> Has anyone seen this and could have a workaround?  Seems like it is
>>>>> more
>>>>> Kernel related with various apps not just target apparently not but
>>>>> wondering if there is an interim solution
>>>>> (https://access.redhat.com/solutions/408833)
>>>>>
>>>>> Getting this message after few minutes of usage from the QLA2xxx
>>>>> driver.
>>>>>  This is after some activity on an ESXi server (15 VM's) that I'm
>>>>> connecting to this HBA.  I've tried the following tuning parameters
>>>>> but
>>>>> there was no change in behaviour:
>>>>>
>>>>> vm.dirty_background_ratio = 5
>>>>> vm.dirty_ratio = 10
>>>>>
>>>>> Details:
>>>>>
>>>>>
>>>>> Oct 23 21:28:25 mbpc-pc kernel: hpet1: lost 9600 rtc interrupts
>>>>> Oct 23 21:28:29 mbpc-pc kernel: ABORT_TASK: Found referenced qla2xxx
>>>>> task_tag: 1128612
>>>>> Oct 23 21:28:42 mbpc-pc kernel: ABORT_TASK: Sending
>>>>> TMR_FUNCTION_COMPLETE for ref_tag: 1128612
>>>>> Oct 23 21:28:42 mbpc-pc kernel: ABORT_TASK: Found referenced qla2xxx
>>>>> task_tag: 1129116
>>
>> You are likely hitting a known v4.1+ regression, not yet merged up to
>> v4.8.y code:
>>
>> https://github.com/torvalds/linux/commit/527268df31e57cf2b6d417198717c6d6afdb1e3e
>>
>>
>>>>> Jan  6 23:52:00 192.168.0.2 syslog: dhcpfwd : dhcp forwarder daemon
>>>>> successfully started
>>>>> Oct 23 21:30:18 mbpc-pc kernel: hpet1: lost 9600 rtc interrupts
>>>>> Jan  6 23:54:01 192.168.0.2 syslog: dhcpfwd : dhcp forwarder daemon
>>>>> successfully started
>>>>> Oct 23 21:32:16 mbpc-pc kernel: hpet1: lost 9600 rtc interrupts
>>>>> Oct 23 21:32:24 mbpc-pc kernel: INFO: task kworker/u16:8:289
>>>>> blocked for
>>>>> more than 120 seconds.
>>>>> Oct 23 21:32:24 mbpc-pc kernel:      Not tainted 4.8.4 #2
>>>>> Oct 23 21:32:24 mbpc-pc kernel: "echo 0 >
>>>>> /proc/sys/kernel/hung_task_timeout_secs" disables this message.
>>>>> Oct 23 21:32:24 mbpc-pc kernel: kworker/u16:8   D
>>>>> ffff88011113ba18     0
>>>>>   289      2 0x00000000
>>>>> Oct 23 21:32:24 mbpc-pc kernel: Workqueue: tmr-fileio target_tmr_work
>>>>> [target_core_mod]
>>>>> Oct 23 21:32:24 mbpc-pc kernel: ffff88011113ba18 0000000000000400
>>>>> ffff880049e926c0 ffff88011113b998
>>>>> Oct 23 21:32:24 mbpc-pc kernel: ffff880111134600 ffffffff81f99ca0
>>>>> ffffffff81f998ef ffff880100000000
>>>>> Oct 23 21:32:24 mbpc-pc kernel: ffffffff812f27d9 0000000000000000
>>>>> ffffe8ffffc9a000 ffff880000000000
>>>>> Oct 23 21:32:24 mbpc-pc kernel: Call Trace:
>>>>> Oct 23 21:32:24 mbpc-pc kernel: [<ffffffff812f27d9>] ?
>>>>> number+0x2e9/0x310
>>>>> Oct 23 21:32:24 mbpc-pc kernel: [<ffffffff8162c040>]
>>>>> schedule+0x40/0xb0
>>>>> Oct 23 21:32:24 mbpc-pc kernel: [<ffffffff81080169>] ?
>>>>> start_flush_work+0x49/0x180
>>>>> Oct 23 21:32:24 mbpc-pc kernel: [<ffffffff8162e7ec>]
>>>>> schedule_timeout+0x9c/0xe0
>>>>> Oct 23 21:32:24 mbpc-pc kernel: [<ffffffff810802ba>] ?
>>>>> flush_work+0x1a/0x40
>>>>> Oct 23 21:32:24 mbpc-pc kernel: [<ffffffff810bd15c>] ?
>>>>> console_unlock+0x35c/0x380
>>>>> Oct 23 21:32:24 mbpc-pc kernel: [<ffffffff8162cfa0>]
>>>>> wait_for_completion+0xc0/0xf0
>>>>> Oct 23 21:32:24 mbpc-pc kernel: [<ffffffff810923e0>] ?
>>>>> try_to_wake_up+0x260/0x260
>>>>> Oct 23 21:32:24 mbpc-pc kernel: [<ffffffffa08f6f84>]
>>>>> __transport_wait_for_tasks+0xb4/0x1b0 [target_core_mod]
>>>>> Oct 23 21:32:24 mbpc-pc kernel: [<ffffffff810bdd1f>] ?
>>>>> vprintk_default+0x1f/0x30
>>>>> Oct 23 21:32:24 mbpc-pc kernel: [<ffffffff8115cc5c>] ?
>>>>> printk+0x46/0x48
>>>>> Oct 23 21:32:24 mbpc-pc kernel: [<ffffffffa08f70c4>]
>>>>> transport_wait_for_tasks+0x44/0x60 [target_core_mod]
>>>>> Oct 23 21:32:24 mbpc-pc kernel: [<ffffffffa08f46e2>]
>>>>> core_tmr_abort_task+0xf2/0x160 [target_core_mod]
>>>>> Oct 23 21:32:24 mbpc-pc kernel: [<ffffffffa08f6aa4>]
>>>>> target_tmr_work+0x154/0x160 [target_core_mod]
>>>>> Oct 23 21:32:24 mbpc-pc kernel: [<ffffffff81080639>]
>>>>> process_one_work+0x189/0x4e0
>>>>> Oct 23 21:32:24 mbpc-pc kernel: [<ffffffff8162c040>] ?
>>>>> schedule+0x40/0xb0
>>>>> Oct 23 21:32:24 mbpc-pc kernel: [<ffffffff8108150d>]
>>>>> worker_thread+0x16d/0x520
>>>>> Oct 23 21:32:24 mbpc-pc kernel: [<ffffffff810923f2>] ?
>>>>> default_wake_function+0x12/0x20
>>>>> Oct 23 21:32:24 mbpc-pc kernel: [<ffffffff810a6f06>] ?
>>>>> __wake_up_common+0x56/0x90
>>>>> Oct 23 21:32:24 mbpc-pc kernel: [<ffffffff810813a0>] ?
>>>>> maybe_create_worker+0x110/0x110
>>>>> Oct 23 21:32:24 mbpc-pc kernel: [<ffffffff8162c040>] ?
>>>>> schedule+0x40/0xb0
>>>>> Oct 23 21:32:24 mbpc-pc kernel: [<ffffffff810813a0>] ?
>>>>> maybe_create_worker+0x110/0x110
>>>>> Oct 23 21:32:24 mbpc-pc kernel: [<ffffffff81085fec>] kthread+0xcc/0xf0
>>>>> Oct 23 21:32:24 mbpc-pc kernel: [<ffffffff8109130e>] ?
>>>>> schedule_tail+0x1e/0xc0
>>>>> Oct 23 21:32:24 mbpc-pc kernel: [<ffffffff8162f60f>]
>>>>> ret_from_fork+0x1f/0x40
>>>>> Oct 23 21:32:24 mbpc-pc kernel: [<ffffffff81085f20>] ?
>>>>> kthread_freezable_should_stop+0x70/0x70
>>>>> Oct 23 21:32:24 mbpc-pc kernel: INFO: task kworker/1:48:6089
>>>>> blocked for
>>>>> more than 120 seconds.
>>>>> Oct 23 21:32:24 mbpc-pc kernel:      Not tainted 4.8.4 #2
>>>>> Oct 23 21:32:24 mbpc-pc kernel: "echo 0 >
>>>>> /proc/sys/kernel/hung_task_timeout_secs" disables this message.
>>>>> Oct 23 21:32:24 mbpc-pc kernel: kworker/1:48    D
>>>>> ffff88004017f968     0
>>>>>  6089      2 0x00000080
>>>>> Oct 23 21:32:24 mbpc-pc kernel: Workqueue: events
>>>>> qlt_free_session_done
>>>>> [qla2xxx]
>>>>> Oct 23 21:32:24 mbpc-pc kernel: ffff88004017f968 ffff88004017f8f8
>>>>> ffff88011a83a300 0000000000000004
>>>>> Oct 23 21:32:24 mbpc-pc kernel: ffff88004017a600 ffff88004017f938
>>>>> ffffffff810a0bb6 ffff880100000000
>>>>> Oct 23 21:32:24 mbpc-pc kernel: ffff880110fd0840 ffff880000000000
>>>>> ffffffff81090728 ffff880100000000
>>>>> Oct 23 21:32:24 mbpc-pc kernel: Call Trace:
>>>>> Oct 23 21:32:24 mbpc-pc kernel: [<ffffffff810a0bb6>] ?
>>>>> enqueue_task_fair+0x66/0x410
>>>>> Oct 23 21:32:24 mbpc-pc kernel: [<ffffffff81090728>] ?
>>>>> check_preempt_curr+0x78/0x90
>>>>> Oct 23 21:32:24 mbpc-pc kernel: [<ffffffff8109075d>] ?
>>>>> ttwu_do_wakeup+0x1d/0xf0
>>>>> Oct 23 21:32:24 mbpc-pc kernel: [<ffffffff8162c040>]
>>>>> schedule+0x40/0xb0
>>>>> Oct 23 21:32:24 mbpc-pc kernel: [<ffffffff81090de0>] ?
>>>>> ttwu_queue+0x180/0x190
>>>>> Oct 23 21:32:24 mbpc-pc kernel: [<ffffffff8162e7ec>]
>>>>> schedule_timeout+0x9c/0xe0
>>>>> Oct 23 21:32:24 mbpc-pc kernel: [<ffffffff8162cfa0>]
>>>>> wait_for_completion+0xc0/0xf0
>>>>> Oct 23 21:32:24 mbpc-pc kernel: [<ffffffff810923e0>] ?
>>>>> try_to_wake_up+0x260/0x260
>>>>> Oct 23 21:32:24 mbpc-pc kernel: [<ffffffffa08f76ad>]
>>>>> target_wait_for_sess_cmds+0x4d/0x1b0 [target_core_mod]
>>>>> Oct 23 21:32:24 mbpc-pc kernel: [<ffffffffa00e7188>] ?
>>>>> qla2x00_post_work+0x58/0x70 [qla2xxx]
>>>>> Oct 23 21:32:24 mbpc-pc kernel: [<ffffffffa0286f69>]
>>>>> tcm_qla2xxx_free_session+0x49/0x90 [tcm_qla2xxx]
>>>>> Oct 23 21:32:24 mbpc-pc kernel: [<ffffffffa01447e9>]
>>>>> qlt_free_session_done+0xf9/0x3d0 [qla2xxx]
>>>>> Oct 23 21:32:24 mbpc-pc kernel: [<ffffffff815092fc>] ?
>>>>> dbs_work_handler+0x5c/0x90
>>>>> Oct 23 21:32:24 mbpc-pc kernel: [<ffffffff8107f960>] ?
>>>>> pwq_dec_nr_in_flight+0x50/0xa0
>>>>> Oct 23 21:32:24 mbpc-pc kernel: [<ffffffff81080639>]
>>>>> process_one_work+0x189/0x4e0
>>>>> Oct 23 21:32:24 mbpc-pc kernel: [<ffffffff810d060c>] ?
>>>>> del_timer_sync+0x4c/0x60
>>>>> Oct 23 21:32:24 mbpc-pc kernel: [<ffffffff8108131e>] ?
>>>>> maybe_create_worker+0x8e/0x110
>>>>> Oct 23 21:32:24 mbpc-pc kernel: [<ffffffff8162c040>] ?
>>>>> schedule+0x40/0xb0
>>>>> Oct 23 21:32:24 mbpc-pc kernel: [<ffffffff8108150d>]
>>>>> worker_thread+0x16d/0x520
>>>>> Oct 23 21:32:24 mbpc-pc kernel: [<ffffffff810923f2>] ?
>>>>> default_wake_function+0x12/0x20
>>>>> Oct 23 21:32:24 mbpc-pc kernel: [<ffffffff810a6f06>] ?
>>>>> __wake_up_common+0x56/0x90
>>>>> Oct 23 21:32:24 mbpc-pc kernel: [<ffffffff810813a0>] ?
>>>>> maybe_create_worker+0x110/0x110
>>>>> Oct 23 21:32:24 mbpc-pc kernel: [<ffffffff8162c040>] ?
>>>>> schedule+0x40/0xb0
>>>>> Oct 23 21:32:24 mbpc-pc kernel: [<ffffffff810813a0>] ?
>>>>> maybe_create_worker+0x110/0x110
>>>>> Oct 23 21:32:24 mbpc-pc kernel: [<ffffffff81085fec>] kthread+0xcc/0xf0
>>>>> Oct 23 21:32:24 mbpc-pc kernel: [<ffffffff8109130e>] ?
>>>>> schedule_tail+0x1e/0xc0
>>>>> Oct 23 21:32:24 mbpc-pc kernel: [<ffffffff8162f60f>]
>>>>> ret_from_fork+0x1f/0x40
>>>>> Oct 23 21:32:24 mbpc-pc kernel: [<ffffffff81085f20>] ?
>>>>> kthread_freezable_should_stop+0x70/0x70
>>>>> Jan  6 23:56:00 192.168.0.2 syslog: dhcpfwd : dhcp forwarder daemon
>>>>> successfully started
>>>>> Oct 23 21:34:22 mbpc-pc kernel: hpet1: lost 9600 rtc interrupts
>>>>> Oct 23 21:34:22 mbpc-pc kernel: hpet1: lost 3 rtc interrupts
>>>>> Oct 23 21:34:27 mbpc-pc kernel: INFO: task kworker/u16:8:289
>>>>> blocked for
>>>>> more than 120 seconds.
>>>>> Oct 23 21:34:27 mbpc-pc kernel:      Not tainted 4.8.4 #2
>>>>> Oct 23 21:34:27 mbpc-pc kernel: "echo 0 >
>>>>> /proc/sys/kernel/hung_task_timeout_secs" disables this message.
>>>>> Oct 23 21:34:27 mbpc-pc kernel: kworker/u16:8   D
>>>>> ffff88011113ba18     0
>>>>>   289      2 0x00000000
>>>>> Oct 23 21:34:27 mbpc-pc kernel: Workqueue: tmr-fileio target_tmr_work
>>>>> [target_core_mod]
>>>>> Oct 23 21:34:27 mbpc-pc kernel: ffff88011113ba18 0000000000000400
>>>>> ffff880049e926c0 ffff88011113b998
>>>>> Oct 23 21:34:27 mbpc-pc kernel: ffff880111134600 ffffffff81f99ca0
>>>>> ffffffff81f998ef ffff880100000000
>>>>> Oct 23 21:34:27 mbpc-pc kernel: ffffffff812f27d9 0000000000000000
>>>>> ffffe8ffffc9a000 ffff880000000000
>>>>> Oct 23 21:34:27 mbpc-pc kernel: Call Trace:
>>>>> Oct 23 21:34:27 mbpc-pc kernel: [<ffffffff812f27d9>] ?
>>>>> number+0x2e9/0x310
>>>>> Oct 23 21:34:27 mbpc-pc kernel: [<ffffffff8162c040>]
>>>>> schedule+0x40/0xb0
>>>>> Oct 23 21:34:27 mbpc-pc kernel: [<ffffffff81080169>] ?
>>>>> start_flush_work+0x49/0x180
>>>>> Oct 23 21:34:27 mbpc-pc kernel: [<ffffffff8162e7ec>]
>>>>> schedule_timeout+0x9c/0xe0
>>>>> Oct 23 21:34:27 mbpc-pc kernel: [<ffffffff810802ba>] ?
>>>>> flush_work+0x1a/0x40
>>>>> Oct 23 21:34:27 mbpc-pc kernel: [<ffffffff810bd15c>] ?
>>>>> console_unlock+0x35c/0x380
>>>>> Oct 23 21:34:27 mbpc-pc kernel: [<ffffffff8162cfa0>]
>>>>> wait_for_completion+0xc0/0xf0
>>>>> Oct 23 21:34:27 mbpc-pc kernel: [<ffffffff810923e0>] ?
>>>>> try_to_wake_up+0x260/0x260
>>>>> Oct 23 21:34:27 mbpc-pc kernel: [<ffffffffa08f6f84>]
>>>>> __transport_wait_for_tasks+0xb4/0x1b0 [target_core_mod]
>>>>> Oct 23 21:34:27 mbpc-pc kernel: [<ffffffff810bdd1f>] ?
>>>>> vprintk_default+0x1f/0x30
>>>>> Oct 23 21:34:27 mbpc-pc kernel: [<ffffffff8115cc5c>] ?
>>>>> printk+0x46/0x48
>>>>> Oct 23 21:34:27 mbpc-pc kernel: [<ffffffffa08f70c4>]
>>>>> transport_wait_for_tasks+0x44/0x60 [target_core_mod]
>>>>> Oct 23 21:34:27 mbpc-pc kernel: [<ffffffffa08f46e2>]
>>>>> core_tmr_abort_task+0xf2/0x160 [target_core_mod]
>>>>> Oct 23 21:34:27 mbpc-pc kernel: [<ffffffffa08f6aa4>]
>>>>> target_tmr_work+0x154/0x160 [target_core_mod]
>>>>> Oct 23 21:34:27 mbpc-pc kernel: [<ffffffff81080639>]
>>>>> process_one_work+0x189/0x4e0
>>>>> Oct 23 21:34:27 mbpc-pc kernel: [<ffffffff8162c040>] ?
>>>>> schedule+0x40/0xb0
>>>>> Oct 23 21:34:27 mbpc-pc kernel: [<ffffffff8108150d>]
>>>>> worker_thread+0x16d/0x520
>>>>> Oct 23 21:34:27 mbpc-pc kernel: [<ffffffff810923f2>] ?
>>>>> default_wake_function+0x12/0x20
>>>>> Oct 23 21:34:27 mbpc-pc kernel: [<ffffffff810a6f06>] ?
>>>>> __wake_up_common+0x56/0x90
>>>>> Oct 23 21:34:27 mbpc-pc kernel: [<ffffffff810813a0>] ?
>>>>> maybe_create_worker+0x110/0x110
>>>>> Oct 23 21:34:27 mbpc-pc kernel: [<ffffffff8162c040>] ?
>>>>> schedule+0x40/0xb0
>>>>> Oct 23 21:34:27 mbpc-pc kernel: [<ffffffff810813a0>] ?
>>>>> maybe_create_worker+0x110/0x110
>>>>> Oct 23 21:34:27 mbpc-pc kernel: [<ffffffff81085fec>] kthread+0xcc/0xf0
>>>>> Oct 23 21:34:27 mbpc-pc kernel: [<ffffffff8109130e>] ?
>>>>> schedule_tail+0x1e/0xc0
>>>>> Oct 23 21:34:27 mbpc-pc kernel: [<ffffffff8162f60f>]
>>>>> ret_from_fork+0x1f/0x40
>>>>> Oct 23 21:34:27 mbpc-pc kernel: [<ffffffff81085f20>] ?
>>>>> kthread_freezable_should_stop+0x70/0x70
>>>>> Oct 23 21:34:27 mbpc-pc kernel: INFO: task kworker/1:48:6089
>>>>> blocked for
>>>>> more than 120 seconds.
>>>>> Oct 23 21:34:27 mbpc-pc kernel:      Not tainted 4.8.4 #2
>>>>> Oct 23 21:34:27 mbpc-pc kernel: "echo 0 >
>>>>> /proc/sys/kernel/hung_task_timeout_secs" disables this message.
>>>>> Oct 23 21:34:27 mbpc-pc kernel: kworker/1:48    D
>>>>> ffff88004017f968     0
>>>>>  6089      2 0x00000080
>>>>> Oct 23 21:34:27 mbpc-pc kernel: Workqueue: events
>>>>> qlt_free_session_done
>>>>> [qla2xxx]
>>>>> Oct 23 21:34:27 mbpc-pc kernel: ffff88004017f968 ffff88004017f8f8
>>>>> ffff88011a83a300 0000000000000004
>>>>> Oct 23 21:34:27 mbpc-pc kernel: ffff88004017a600 ffff88004017f938
>>>>> ffffffff810a0bb6 ffff880100000000
>>>>> Oct 23 21:34:27 mbpc-pc kernel: ffff880110fd0840 ffff880000000000
>>>>> ffffffff81090728 ffff880100000000
>>>>> Oct 23 21:34:27 mbpc-pc kernel: Call Trace:
>>>>> Oct 23 21:34:27 mbpc-pc kernel: [<ffffffff810a0bb6>] ?
>>>>> enqueue_task_fair+0x66/0x410
>>>>> Oct 23 21:34:27 mbpc-pc kernel: [<ffffffff81090728>] ?
>>>>> check_preempt_curr+0x78/0x90
>>>>> Oct 23 21:34:27 mbpc-pc kernel: [<ffffffff8109075d>] ?
>>>>> ttwu_do_wakeup+0x1d/0xf0
>>>>> Oct 23 21:34:27 mbpc-pc kernel: [<ffffffff8162c040>]
>>>>> schedule+0x40/0xb0
>>>>> Oct 23 21:34:27 mbpc-pc kernel: [<ffffffff81090de0>] ?
>>>>> ttwu_queue+0x180/0x190
>>>>> Oct 23 21:34:27 mbpc-pc kernel: [<ffffffff8162e7ec>]
>>>>> schedule_timeout+0x9c/0xe0
>>>>> Oct 23 21:34:27 mbpc-pc kernel: [<ffffffff8162cfa0>]
>>>>> wait_for_completion+0xc0/0xf0
>>>>> Oct 23 21:34:27 mbpc-pc kernel: [<ffffffff810923e0>] ?
>>>>> try_to_wake_up+0x260/0x260
>>>>> Oct 23 21:34:27 mbpc-pc kernel: [<ffffffffa08f76ad>]
>>>>> target_wait_for_sess_cmds+0x4d/0x1b0 [target_core_mod]
>>>>> Oct 23 21:34:27 mbpc-pc kernel: [<ffffffffa00e7188>] ?
>>>>> qla2x00_post_work+0x58/0x70 [qla2xxx]
>>>>> Oct 23 21:34:27 mbpc-pc kernel: [<ffffffffa0286f69>]
>>>>> tcm_qla2xxx_free_session+0x49/0x90 [tcm_qla2xxx]
>>>>> Oct 23 21:34:27 mbpc-pc kernel: [<ffffffffa01447e9>]
>>>>> qlt_free_session_done+0xf9/0x3d0 [qla2xxx]
>>>>> Oct 23 21:34:27 mbpc-pc kernel: [<ffffffff815092fc>] ?
>>>>> dbs_work_handler+0x5c/0x90
>>>>> Oct 23 21:34:27 mbpc-pc kernel: [<ffffffff8107f960>] ?
>>>>> pwq_dec_nr_in_flight+0x50/0xa0
>>>>> Oct 23 21:34:27 mbpc-pc kernel: [<ffffffff81080639>]
>>>>> process_one_work+0x189/0x4e0
>>>>> Oct 23 21:34:27 mbpc-pc kernel: [<ffffffff810d060c>] ?
>>>>> del_timer_sync+0x4c/0x60
>>>>> Oct 23 21:34:27 mbpc-pc kernel: [<ffffffff8108131e>] ?
>>>>> maybe_create_worker+0x8e/0x110
>>>>> Oct 23 21:34:27 mbpc-pc kernel: [<ffffffff8162c040>] ?
>>>>> schedule+0x40/0xb0
>>>>> Oct 23 21:34:27 mbpc-pc kernel: [<ffffffff8108150d>]
>>>>> worker_thread+0x16d/0x520
>>>>> Oct 23 21:34:27 mbpc-pc kernel: [<ffffffff810923f2>] ?
>>>>> default_wake_function+0x12/0x20
>>>>> Oct 23 21:34:27 mbpc-pc kernel: [<ffffffff810a6f06>] ?
>>>>> __wake_up_common+0x56/0x90
>>>>> Oct 23 21:34:27 mbpc-pc kernel: [<ffffffff810813a0>] ?
>>>>> maybe_create_worker+0x110/0x110
>>>>> Oct 23 21:34:27 mbpc-pc kernel: [<ffffffff8162c040>] ?
>>>>> schedule+0x40/0xb0
>>>>> Oct 23 21:34:27 mbpc-pc kernel: [<ffffffff810813a0>] ?
>>>>> maybe_create_worker+0x110/0x110
>>>>> Oct 23 21:34:27 mbpc-pc kernel: [<ffffffff81085fec>] kthread+0xcc/0xf0
>>>>> Oct 23 21:34:27 mbpc-pc kernel: [<ffffffff8109130e>] ?
>>>>> schedule_tail+0x1e/0xc0
>>>>> Oct 23 21:34:27 mbpc-pc kernel: [<ffffffff8162f60f>]
>>>>> ret_from_fork+0x1f/0x40
>>>>> Oct 23 21:34:27 mbpc-pc kernel: [<ffffffff81085f20>] ?
>>>>> kthread_freezable_should_stop+0x70/0x70
>>>>> Oct 23 21:36:04 mbpc-pc kernel: hpet1: lost 9600 rtc interrupts
>>>>> Oct 23 21:36:04 mbpc-pc kernel: hpet1: lost 3 rtc interrupts
>>>>> Jan  6 23:58:00 192.168.0.2 syslog: dhcpfwd : dhcp forwarder daemon
>>>>> successfully started
>>>>> Oct 23 21:36:30 mbpc-pc kernel: INFO: task kworker/u16:8:289
>>>>> blocked for
>>>>> more than 120 seconds.
>>>>> Oct 23 21:36:30 mbpc-pc kernel:      Not tainted 4.8.4 #2
>>>>> Oct 23 21:36:30 mbpc-pc kernel: "echo 0 >
>>>>> /proc/sys/kernel/hung_task_timeout_secs" disables this message.
>>>>> Oct 23 21:36:30 mbpc-pc kernel: kworker/u16:8   D
>>>>> ffff88011113ba18     0
>>>>>   289      2 0x00000000
>>>>> Oct 23 21:36:30 mbpc-pc kernel: Workqueue: tmr-fileio target_tmr_work
>>>>> [target_core_mod]
>>>>> Oct 23 21:36:30 mbpc-pc kernel: ffff88011113ba18 0000000000000400
>>>>> ffff880049e926c0 ffff88011113b998
>>>>> Oct 23 21:36:30 mbpc-pc kernel: ffff880111134600 ffffffff81f99ca0
>>>>> ffffffff81f998ef ffff880100000000
>>>>> Oct 23 21:36:30 mbpc-pc kernel: ffffffff812f27d9 0000000000000000
>>>>> ffffe8ffffc9a000 ffff880000000000
>>>>> Oct 23 21:36:30 mbpc-pc kernel: Call Trace:
>>>>> Oct 23 21:36:30 mbpc-pc kernel: [<ffffffff812f27d9>] ?
>>>>> number+0x2e9/0x310
>>>>> Oct 23 21:36:30 mbpc-pc kernel: [<ffffffff8162c040>]
>>>>> schedule+0x40/0xb0
>>>>> Oct 23 21:36:30 mbpc-pc kernel: [<ffffffff81080169>] ?
>>>>> start_flush_work+0x49/0x180
>>>>> Oct 23 21:36:30 mbpc-pc kernel: [<ffffffff8162e7ec>]
>>>>> schedule_timeout+0x9c/0xe0
>>>>> Oct 23 21:36:30 mbpc-pc kernel: [<ffffffff810802ba>] ?
>>>>> flush_work+0x1a/0x40
>>>>> Oct 23 21:36:30 mbpc-pc kernel: [<ffffffff810bd15c>] ?
>>>>> console_unlock+0x35c/0x380
>>>>> Oct 23 21:36:30 mbpc-pc kernel: [<ffffffff8162cfa0>]
>>>>> wait_for_completion+0xc0/0xf0
>>>>> Oct 23 21:36:30 mbpc-pc kernel: [<ffffffff810923e0>] ?
>>>>> try_to_wake_up+0x260/0x260
>>>>> Oct 23 21:36:30 mbpc-pc kernel: [<ffffffffa08f6f84>]
>>>>> __transport_wait_for_tasks+0xb4/0x1b0 [target_core_mod]
>>>>> Oct 23 21:36:30 mbpc-pc kernel: [<ffffffff810bdd1f>] ?
>>>>> vprintk_default+0x1f/0x30
>>>>>
>>>>>
>>>>
>>>>
>>>> Including the full log:
>>>>
>>>> http://microdevsys.com/linux-lio/messages-mailing-list
>>>>
>>>
>>
>> Thanks for posting with qla2xxx verbose debug enabled on your setup.
>>
>>>
>>> When tryint to shut down target using /etc/init.d/target stop, the
>>> following is printed repeatedly:
>>>
>>> Oct 24 00:39:48 mbpc-pc kernel: qla2xxx [0000:04:00.0]-e837:20:
>>> ABTS_RECV_24XX: instance 0
>>> Oct 24 00:39:48 mbpc-pc kernel: qla2xxx [0000:04:00.0]-f811:20:
>>> qla_target(0): task abort (s_id=1:5:0, tag=1177068, param=0)
>>> Oct 24 00:39:48 mbpc-pc kernel: qla2xxx [0000:04:00.0]-f812:20:
>>> qla_target(0): task abort for non-existant session
>>> Oct 24 00:39:48 mbpc-pc kernel: qla2xxx [0000:04:00.0]-f80e:20:
>>> Scheduling work (type 1, prm ffff880093365680) to find session for param
>>> ffff88010f8c7680 (size 64, tgt ffff880111f06600)
>>> Oct 24 00:39:48 mbpc-pc kernel: qla2xxx [0000:04:00.0]-f800:20: Sess
>>> work (tgt ffff880111f06600)
>>> Oct 24 00:39:48 mbpc-pc kernel: qla2xxx [0000:04:00.0]-e806:20: Sending
>>> task mgmt ABTS response (ha=ffff88010fae0000, atio=ffff880093365694,
>>> status=4
>>> Oct 24 00:39:48 mbpc-pc kernel: qla2xxx [0000:04:00.0]-e838:20:
>>> ABTS_RESP_24XX: compl_status 31
>>> Oct 24 00:39:48 mbpc-pc kernel: qla2xxx [0000:04:00.0]-e807:20: Sending
>>> retry TERM EXCH CTIO7 (ha=ffff88010fae0000)
>>> Oct 24 00:39:48 mbpc-pc kernel: qla2xxx [0000:04:00.0]-e806:20: Sending
>>> task mgmt ABTS response (ha=ffff88010fae0000, atio=ffff88010f8c76c0,
>>> status=0
>>> Oct 24 00:39:48 mbpc-pc kernel: qla2xxx [0000:04:00.0]-e838:20:
>>> ABTS_RESP_24XX: compl_status 0
>>> Oct 24 00:39:48 mbpc-pc kernel: qla2xxx [0000:04:00.0]-e872:20:
>>> qlt_24xx_atio_pkt_all_vps: qla_target(0): type 6 ox_id 029c
>>> Oct 24 00:39:48 mbpc-pc kernel: qla2xxx [0000:04:00.0]-3861:20: New
>>> command while device ffff880111f06600 is shutting down
>>> Oct 24 00:39:48 mbpc-pc kernel: qla2xxx [0000:04:00.0]-e859:20:
>>> qla_target: Unable to send command to target for req, ignoring.
>>>
>>>
>>
>> At your earliest convenience, please verify the patch using v4.8.y with
>> the above ABORT_TASK + shutdown scenario.
>>
>> Also, it would be helpful to understand why this ESX FC host is
>> generating ABORT_TASKs.
>>
>> Eg: Is ABORT_TASK generated due to FC target response packet loss..?
>> Or due to target backend I/O latency, that ultimately triggers FC host
>> side timeouts...?
>>
>>>
>>> + when I disable the ports on the brocade switch that we're using then
>>> try to stop target, the following is printed:
>>>
>>>
>>>
>>> Oct 24 00:41:31 mbpc-pc kernel: qla2xxx [0000:04:00.1]-680a:21: Loop
>>> down - seconds remaining 231.
>>> Oct 24 00:41:32 mbpc-pc kernel: qla2xxx [0000:04:00.0]-680a:20: Loop
>>> down - seconds remaining 153.
>>> Oct 24 00:41:32 mbpc-pc kernel: ------------[ cut here ]------------
>>> Oct 24 00:41:32 mbpc-pc kernel: WARNING: CPU: 2 PID: 8615 at
>>> lib/list_debug.c:33 __list_add+0xbe/0xd0
>>> Oct 24 00:41:32 mbpc-pc kernel: list_add corruption. prev->next should
>>> be next (ffff88009e83b330), but was ffff88011fc972a0.
>>> (prev=ffff880118ada4c0).
>>> Oct 24 00:41:32 mbpc-pc kernel: Modules linked in: tcm_qla2xxx tcm_fc
>>> tcm_loop target_core_file target_core_iblock target_core_pscsi
>>> target_core_mod configfs ip6table_filter ip6_tables ebtable_nat ebtables
>>> nf_conntrack_ipv4 nf_defrag_ipv4 xt_state nf_conntrack xt_CHECKSUM
>>> iptable_mangle bridge nfsd lockd grace nfs_acl auth_rpcgss autofs4 it87
>>> hwmon_vid bnx2fc cnic uio fcoe libfcoe libfc 8021q garp stp llc ppdev
>>> parport_pc parport sunrpc cpufreq_ondemand bonding ipv6 crc_ccitt
>>> ipt_REJECT nf_reject_ipv4 xt_multiport iptable_filter ip_tables fuse
>>> vfat fat xfs vhost_net macvtap macvlan vhost tun uinput raid456
>>> async_raid6_recov async_pq async_xor xor async_memcpy async_tx raid6_pq
>>> libcrc32c joydev sg serio_raw e1000 kvm_amd kvm irqbypass r8169 mii
>>> pcspkr k10temp snd_hda_codec_realtek snd_hda_codec_generic snd_hda_intel
>>> snd_hda_codec snd_hda_core snd_hwdep snd_seq snd_seq_device snd_pcm
>>> snd_timer snd soundcore i2c_piix4 i2c_core wmi acpi_cpufreq shpchp ext4
>>> mbcache jbd2 qla2xxx scsi_transport_fc floppy firewire_ohci f
>>> Oct 24 00:41:32 mbpc-pc kernel: irewire_core crc_itu_t sd_mod pata_acpi
>>> ata_generic pata_jmicron ahci libahci usb_storage dm_mirror
>>> dm_region_hash dm_log dm_mod
>>> Oct 24 00:41:32 mbpc-pc kernel: CPU: 2 PID: 8615 Comm: kworker/2:3 Not
>>> tainted 4.8.4 #2
>>> Oct 24 00:41:32 mbpc-pc kernel: Hardware name: Gigabyte Technology Co.,
>>> Ltd. GA-890XA-UD3/GA-890XA-UD3, BIOS FC 08/02/2010
>>> Oct 24 00:41:32 mbpc-pc kernel: 0000000000000000 ffff880092b83b48
>>> ffffffff812e88e9 ffffffff8130753e
>>> Oct 24 00:41:32 mbpc-pc kernel: ffff880092b83ba8 ffff880092b83ba8
>>> 0000000000000000 ffff880092b83b98
>>> Oct 24 00:41:32 mbpc-pc kernel: ffffffff81066a7d ffff88000058f952
>>> 0000002100000046 ffffffff8101eae8
>>> Oct 24 00:41:32 mbpc-pc kernel: Call Trace:
>>> Oct 24 00:41:32 mbpc-pc kernel: [<ffffffff812e88e9>]
>>> dump_stack+0x51/0x78
>>> Oct 24 00:41:32 mbpc-pc kernel: [<ffffffff8130753e>] ?
>>> __list_add+0xbe/0xd0
>>> Oct 24 00:41:32 mbpc-pc kernel: [<ffffffff81066a7d>] __warn+0xfd/0x120
>>> Oct 24 00:41:32 mbpc-pc kernel: [<ffffffff8101eae8>] ?
>>> __switch_to+0x398/0x7e0
>>> Oct 24 00:41:32 mbpc-pc kernel: [<ffffffff81066b59>]
>>> warn_slowpath_fmt+0x49/0x50
>>> Oct 24 00:41:32 mbpc-pc kernel: [<ffffffff8130753e>]
>>> __list_add+0xbe/0xd0
>>> Oct 24 00:41:32 mbpc-pc kernel: [<ffffffff8107d0b2>]
>>> move_linked_works+0x62/0x90
>>> Oct 24 00:41:32 mbpc-pc kernel: [<ffffffff8108070c>]
>>> process_one_work+0x25c/0x4e0
>>> Oct 24 00:41:32 mbpc-pc kernel: [<ffffffff8162c040>] ?
>>> schedule+0x40/0xb0
>>> Oct 24 00:41:32 mbpc-pc kernel: [<ffffffff8108150d>]
>>> worker_thread+0x16d/0x520
>>> Oct 24 00:41:32 mbpc-pc kernel: [<ffffffff8162bb3d>] ?
>>> __schedule+0x2fd/0x6a0
>>> Oct 24 00:41:32 mbpc-pc kernel: [<ffffffff810923f2>] ?
>>> default_wake_function+0x12/0x20
>>> Oct 24 00:41:32 mbpc-pc kernel: [<ffffffff810a6f06>] ?
>>> __wake_up_common+0x56/0x90
>>> Oct 24 00:41:32 mbpc-pc kernel: [<ffffffff810813a0>] ?
>>> maybe_create_worker+0x110/0x110
>>> Oct 24 00:41:32 mbpc-pc kernel: [<ffffffff8162c040>] ?
>>> schedule+0x40/0xb0
>>> Oct 24 00:41:32 mbpc-pc kernel: [<ffffffff810813a0>] ?
>>> maybe_create_worker+0x110/0x110
>>> Oct 24 00:41:32 mbpc-pc kernel: [<ffffffff81085fec>] kthread+0xcc/0xf0
>>> Oct 24 00:41:32 mbpc-pc kernel: [<ffffffff8109130e>] ?
>>> schedule_tail+0x1e/0xc0
>>> Oct 24 00:41:32 mbpc-pc kernel: [<ffffffff8162f60f>]
>>> ret_from_fork+0x1f/0x40
>>> Oct 24 00:41:32 mbpc-pc kernel: [<ffffffff81085f20>] ?
>>> kthread_freezable_should_stop+0x70/0x70
>>> Oct 24 00:41:32 mbpc-pc kernel: ---[ end trace 713a9071c9f5f464 ]---
>>> Oct 24 00:41:32 mbpc-pc kernel: ------------[ cut here ]------------
>>> Oct 24 00:41:32 mbpc-pc kernel: WARNING: CPU: 2 PID: 8615 at
>>> lib/list_debug.c:36 __list_add+0x9c/0xd0
>>> Oct 24 00:41:32 mbpc-pc kernel: list_add double add:
>>> new=ffff880118ada4c0, prev=ffff880118ada4c0, next=ffff88009e83b330.
>>> Oct 24 00:41:32 mbpc-pc kernel: Modules linked in: tcm_qla2xxx tcm_fc
>>> tcm_loop target_core_file target_core_iblock target_core_pscsi
>>> target_core_mod configfs ip6table_filter ip6_tables ebtable_nat ebtables
>>> nf_conntrack_ipv4 nf_defrag_ipv4 xt_state nf_conntrack xt_CHECKSUM
>>> iptable_mangle bridge nfsd lockd grace nfs_acl auth_rpcgss autofs4 it87
>>> hwmon_vid bnx2fc cnic uio fcoe libfcoe libfc 8021q garp stp llc ppdev
>>> parport_pc parport sunrpc cpufreq_ondemand bonding ipv6 crc_ccitt
>>> ipt_REJECT nf_reject_ipv4 xt_multiport iptable_filter ip_tables fuse
>>> vfat fat xfs vhost_net macvtap macvlan vhost tun uinput raid456
>>> async_raid6_recov async_pq async_xor xor async_memcpy async_tx raid6_pq
>>> libcrc32c joydev sg serio_raw e1000 kvm_amd kvm irqbypass r8169 mii
>>> pcspkr k10temp snd_hda_codec_realtek snd_hda_codec_generic snd_hda_intel
>>> snd_hda_codec snd_hda_core snd_hwdep snd_seq snd_seq_device snd_pcm
>>> snd_timer snd soundcore i2c_piix4 i2c_core wmi acpi_cpufreq shpchp ext4
>>> mbcache jbd2 qla2xxx scsi_transport_fc floppy firewire_ohci f
>>> Oct 24 00:41:32 mbpc-pc kernel: irewire_core crc_itu_t sd_mod pata_acpi
>>> ata_generic pata_jmicron ahci libahci usb_storage dm_mirror
>>> dm_region_hash dm_log dm_mod
>>> Oct 24 00:41:32 mbpc-pc kernel: CPU: 2 PID: 8615 Comm: kworker/2:3
>>> Tainted: G        W       4.8.4 #2
>>> Oct 24 00:41:32 mbpc-pc kernel: Hardware name: Gigabyte Technology Co.,
>>> Ltd. GA-890XA-UD3/GA-890XA-UD3, BIOS FC 08/02/2010
>>> Oct 24 00:41:32 mbpc-pc kernel: 0000000000000000 ffff880092b83b48
>>> ffffffff812e88e9 ffffffff8130751c
>>> Oct 24 00:41:32 mbpc-pc kernel: ffff880092b83ba8 ffff880092b83ba8
>>> 0000000000000000 ffff880092b83b98
>>> Oct 24 00:41:32 mbpc-pc kernel: ffffffff81066a7d ffff88000058f952
>>> 0000002400000046 ffffffff8101eae8
>>> Oct 24 00:41:32 mbpc-pc kernel: Call Trace:
>>> Oct 24 00:41:32 mbpc-pc kernel: [<ffffffff812e88e9>]
>>> dump_stack+0x51/0x78
>>> Oct 24 00:41:32 mbpc-pc kernel: [<ffffffff8130751c>] ?
>>> __list_add+0x9c/0xd0
>>> Oct 24 00:41:32 mbpc-pc kernel: [<ffffffff81066a7d>] __warn+0xfd/0x120
>>> Oct 24 00:41:32 mbpc-pc kernel: [<ffffffff8101eae8>] ?
>>> __switch_to+0x398/0x7e0
>>> Oct 24 00:41:32 mbpc-pc kernel: [<ffffffff81066b59>]
>>> warn_slowpath_fmt+0x49/0x50
>>> Oct 24 00:41:32 mbpc-pc kernel: [<ffffffff8130751c>]
>>> __list_add+0x9c/0xd0
>>> Oct 24 00:41:32 mbpc-pc kernel: [<ffffffff8107d0b2>]
>>> move_linked_works+0x62/0x90
>>> Oct 24 00:41:32 mbpc-pc kernel: [<ffffffff8108070c>]
>>> process_one_work+0x25c/0x4e0
>>> Oct 24 00:41:32 mbpc-pc kernel: [<ffffffff8162c040>] ?
>>> schedule+0x40/0xb0
>>> Oct 24 00:41:32 mbpc-pc kernel: [<ffffffff8108150d>]
>>> worker_thread+0x16d/0x520
>>> Oct 24 00:41:32 mbpc-pc kernel: [<ffffffff8162bb3d>] ?
>>> __schedule+0x2fd/0x6a0
>>> Oct 24 00:41:32 mbpc-pc kernel: [<ffffffff810923f2>] ?
>>> default_wake_function+0x12/0x20
>>> Oct 24 00:41:32 mbpc-pc kernel: [<ffffffff810a6f06>] ?
>>> __wake_up_common+0x56/0x90
>>> Oct 24 00:41:32 mbpc-pc kernel: [<ffffffff810813a0>] ?
>>> maybe_create_worker+0x110/0x110
>>> Oct 24 00:41:32 mbpc-pc kernel: [<ffffffff8162c040>] ?
>>> schedule+0x40/0xb0
>>> Oct 24 00:41:32 mbpc-pc kernel: [<ffffffff810813a0>] ?
>>> maybe_create_worker+0x110/0x110
>>> Oct 24 00:41:32 mbpc-pc kernel: [<ffffffff81085fec>] kthread+0xcc/0xf0
>>> Oct 24 00:41:32 mbpc-pc kernel: [<ffffffff8109130e>] ?
>>> schedule_tail+0x1e/0xc0
>>> Oct 24 00:41:32 mbpc-pc kernel: [<ffffffff8162f60f>]
>>> ret_from_fork+0x1f/0x40
>>> Oct 24 00:41:32 mbpc-pc kernel: [<ffffffff81085f20>] ?
>>> kthread_freezable_should_stop+0x70/0x70
>>> Oct 24 00:41:32 mbpc-pc kernel: ---[ end trace 713a9071c9f5f465 ]---
>>> Oct 24 00:41:32 mbpc-pc kernel: qla2xxx [0000:04:00.1]-680a:21: Loop
>>> down - seconds remaining 230.
>>> Oct 24 00:41:33 mbpc-pc kernel: qla2xxx [0000:04:00.0]-680a:20: Loop
>>> down - seconds remaining 152.
>>>
>>>
>>
>> Mmmm.  Could be a side effect of the target-core regression, but not
>> completely sure..
>>
>> Adding QLOGIC folks CC'.
>>
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>
>
> Hey Nicholas,
>
>
>> At your earliest convenience, please verify the patch using v4.8.y with
>> the above ABORT_TASK + shutdown scenario.
>>
>> Also, it would be helpful to understand why this ESX FC host is
>> generating ABORT_TASKs.
>>
>> Eg: Is ABORT_TASK generated due to FC target response packet loss..?
>> Or due to target backend I/O latency, that ultimately triggers FC host
>> side timeouts...?
>
>
> Here is where it gets interesting and to your thought above.  Take for
> example this log snippet
> (http://microdevsys.com/linux-lio/messages-recent):
>
> Oct 23 22:12:51 mbpc-pc kernel: hpet1: lost 9600 rtc interrupts
> Jan  7 00:36:00 192.168.0.2 syslog: dhcpfwd : dhcp forwarder daemon
> successfully started
> Oct 23 22:14:14 mbpc-pc kernel: hpet1: lost 9599 rtc interrupts
> Oct 23 22:14:14 mbpc-pc kernel: hpet1: lost 1 rtc interrupts
> Oct 23 22:15:02 mbpc-pc kernel: hpet1: lost 9600 rtc interrupts
> Oct 23 22:15:41 mbpc-pc kernel: hpet1: lost 9600 rtc interrupts
> Jan  7 00:38:00 192.168.0.2 syslog: dhcpfwd : dhcp forwarder daemon
> successfully started
> Oct 23 22:16:29 mbpc-pc kernel: hpet1: lost 9599 rtc interrupts
> Oct 23 22:17:30 mbpc-pc kernel: hpet1: lost 9600 rtc interrupts
> Jan  7 00:40:00 192.168.0.2 syslog: dhcpfwd : dhcp forwarder daemon
> successfully started
> Oct 23 22:18:29 mbpc-pc kernel: ABORT_TASK: Found referenced qla2xxx
> task_tag: 1195032
> Oct 23 22:18:33 mbpc-pc kernel: hpet1: lost 9600 rtc interrupts
> Oct 23 22:18:42 mbpc-pc kernel: ABORT_TASK: Sending
> TMR_FUNCTION_COMPLETE for ref_tag: 1195032
> Oct 23 22:18:42 mbpc-pc kernel: ABORT_TASK: Found referenced qla2xxx
> task_tag: 1122276
> Oct 23 22:19:35 mbpc-pc kernel: hpet1: lost 9600 rtc interrupts
> Jan  7 00:42:00 192.168.0.2 syslog: dhcpfwd : dhcp forwarder daemon
> successfully started
> Oct 23 22:20:41 mbpc-pc kernel: hpet1: lost 9600 rtc interrupts
> Oct 23 22:21:07 mbpc-pc kernel: INFO: task kworker/u16:8:308 blocked for
> more than 120 seconds.
> Oct 23 22:21:07 mbpc-pc kernel:      Not tainted 4.8.4 #2
> Oct 23 22:21:07 mbpc-pc kernel: "echo 0 >
> /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> Oct 23 22:21:07 mbpc-pc kernel: kworker/u16:8   D ffff880111b8fa18     0
>   308      2 0x00000000
> Oct 23 22:21:07 mbpc-pc kernel: Workqueue: tmr-fileio target_tmr_work
> [target_core_mod]
> Oct 23 22:21:07 mbpc-pc kernel: ffff880111b8fa18 0000000000000400
> ffff880112180480 ffff880111b8f998
> Oct 23 22:21:07 mbpc-pc kernel: ffff88011107a380 ffffffff81f99ca0
> ffffffff81f998ef ffff880100000000
> Oct 23 22:21:07 mbpc-pc kernel: ffffffff812f27d9 0000000000000000
> ffffe8ffffcda000 ffff880000000000
> Oct 23 22:21:07 mbpc-pc kernel: Call Trace:
> Oct 23 22:21:07 mbpc-pc kernel: [<ffffffff812f27d9>] ? number+0x2e9/0x310
> Oct 23 22:21:07 mbpc-pc kernel: [<ffffffff8162c040>] schedule+0x40/0xb0
> Oct 23 22:21:07 mbpc-pc kernel: [<ffffffff81080169>] ?
> start_flush_work+0x49/0x180
> Oct 23 22:21:07 mbpc-pc kernel: [<ffffffff8162e7ec>]
> schedule_timeout+0x9c/0xe0
> Oct 23 22:21:07 mbpc-pc kernel: [<ffffffff810802ba>] ? flush_work+0x1a/0x40
> Oct 23 22:21:07 mbpc-pc kernel: [<ffffffff810bd15c>] ?
> console_unlock+0x35c/0x380
> Oct 23 22:21:07 mbpc-pc kernel: [<ffffffff8162cfa0>]
> wait_for_completion+0xc0/0xf0
> Oct 23 22:21:07 mbpc-pc kernel: [<ffffffff810923e0>] ?
> try_to_wake_up+0x260/0x260
> Oct 23 22:21:07 mbpc-pc kernel: [<ffffffffa08f6f84>]
> __transport_wait_for_tasks+0xb4/0x1b0 [target_core_mod]
> Oct 23 22:21:07 mbpc-pc kernel: [<ffffffff810bdd1f>] ?
> vprintk_default+0x1f/0x30
> Oct 23 22:21:07 mbpc-pc kernel: [<ffffffff8115cc5c>] ? printk+0x46/0x48
> Oct 23 22:21:07 mbpc-pc kernel: [<ffffffffa08f70c4>]
> transport_wait_for_tasks+0x44/0x60 [target_core_mod]
> Oct 23 22:21:07 mbpc-pc kernel: [<ffffffffa08f46e2>]
> core_tmr_abort_task+0xf2/0x160 [target_core_mod]
> Oct 23 22:21:07 mbpc-pc kernel: [<ffffffffa08f6aa4>]
> target_tmr_work+0x154/0x160 [target_core_mod]
> Oct 23 22:21:07 mbpc-pc kernel: [<ffffffff81080639>]
> process_one_work+0x189/0x4e0
> Oct 23 22:21:07 mbpc-pc kernel: [<ffffffff810d060c>] ?
> del_timer_sync+0x4c/0x60
> Oct 23 22:21:07 mbpc-pc kernel: [<ffffffff8108131e>] ?
> maybe_create_worker+0x8e/0x110
> Oct 23 22:21:07 mbpc-pc kernel: [<ffffffff8162c040>] ? schedule+0x40/0xb0
> Oct 23 22:21:07 mbpc-pc kernel: [<ffffffff8108150d>]
> worker_thread+0x16d/0x520
> Oct 23 22:21:07 mbpc-pc kernel: [<ffffffff810923f2>] ?
> default_wake_function+0x12/0x20
> Oct 23 22:21:07 mbpc-pc kernel: [<ffffffff810a6f06>] ?
> __wake_up_common+0x56/0x90
> Oct 23 22:21:07 mbpc-pc kernel: [<ffffffff810813a0>] ?
> maybe_create_worker+0x110/0x110
> Oct 23 22:21:07 mbpc-pc kernel: [<ffffffff8162c040>] ? schedule+0x40/0xb0
> Oct 23 22:21:07 mbpc-pc kernel: [<ffffffff810813a0>] ?
> maybe_create_worker+0x110/0x110
> Oct 23 22:21:07 mbpc-pc kernel: [<ffffffff81085fec>] kthread+0xcc/0xf0
> Oct 23 22:21:07 mbpc-pc kernel: [<ffffffff8109130e>] ?
> schedule_tail+0x1e/0xc0
> Oct 23 22:21:07 mbpc-pc kernel: [<ffffffff8162f60f>]
> ret_from_fork+0x1f/0x40
> Oct 23 22:21:07 mbpc-pc kernel: [<ffffffff81085f20>] ?
> kthread_freezable_should_stop+0x70/0x70
> Oct 23 22:21:52 mbpc-pc kernel: hpet1: lost 9600 rtc interrupts
> Jan  7 00:44:00 192.168.0.2 syslog: dhcpfwd : dhcp forwarder daemon
> successfully started
> Oct 23 22:23:03 mbpc-pc kernel: hpet1: lost 9600 rtc interrupts
>
>
> And compare it to the following snippet
> (http://microdevsys.com/linux-lio/iostat-tkx-interesting-bit.txt) taken
> from this bigger iostat session
> (http://microdevsys.com/linux-lio/iostat-tkx.txt):
>
>
>
>
> 10/23/2016 10:18:19 PM
> avg-cpu:  %user   %nice %system %iowait  %steal   %idle
>            0.25    0.00    0.50   15.83    0.00   83.42
>
> Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s
> avgrq-sz avgqu-sz   await  svctm  %util
> sdb               0.00     0.00    2.00    0.00    80.00     0.00
> 80.00     0.02   12.00  12.00   2.40
> sdc               0.00     0.00    1.00    0.00    64.00     0.00
> 128.00     0.00    2.00   2.00   0.20
> sdd               0.00     0.00    1.00    0.00    48.00     0.00
> 96.00     0.00    0.00   0.00   0.00
> sde               0.00     0.00    2.00    0.00    64.00     0.00
> 64.00     0.00    1.50   1.50   0.30
> sdf               0.00     0.00    0.00    0.00     0.00     0.00
> 0.00     0.60    0.00   0.00  60.10
> sdg               0.00     3.00    0.00    3.00     0.00    20.00
> 13.33     0.03   10.00  10.00   3.00
> sda               0.00     0.00    2.00    0.00    64.00     0.00
> 64.00     0.00    2.00   2.00   0.40
> sdh               0.00     0.00    0.00    0.00     0.00     0.00
> 0.00     0.00    0.00   0.00   0.00
> sdj               0.00     0.00    0.00    0.00     0.00     0.00
> 0.00     0.00    0.00   0.00   0.00
> sdk               0.00     0.00    0.00    0.00     0.00     0.00
> 0.00     0.00    0.00   0.00   0.00
> sdi               0.00     0.00    0.00    0.00     0.00     0.00
> 0.00     0.00    0.00   0.00   0.00
> fd0               0.00     0.00    0.00    0.00     0.00     0.00
> 0.00     0.00    0.00   0.00   0.00
> dm-0              0.00     0.00    0.00    5.00     0.00    20.00
> 8.00     0.03    6.40   6.00   3.00
> dm-1              0.00     0.00    0.00    0.00     0.00     0.00
> 0.00     0.00    0.00   0.00   0.00
> dm-2              0.00     0.00    0.00    0.00     0.00     0.00
> 0.00     0.00    0.00   0.00   0.00
> md0               0.00     0.00    3.00    0.00   384.00     0.00
> 256.00     0.00    0.00   0.00   0.00
> dm-3              0.00     0.00    0.00    0.00     0.00     0.00
> 0.00     0.00    0.00   0.00   0.00
> dm-4              0.00     0.00    0.00    0.00     0.00     0.00
> 0.00     0.00    0.00   0.00   0.00
> dm-5              0.00     0.00    0.00    0.00     0.00     0.00
> 0.00     0.00    0.00   0.00   0.00
> dm-6              0.00     0.00    3.00    0.00   384.00     0.00
> 256.00     0.60    1.33 201.67  60.50
>
> 10/23/2016 10:18:20 PM
> avg-cpu:  %user   %nice %system %iowait  %steal   %idle
>            0.00    0.00    0.25   25.19    0.00   74.56
>
> Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s
> avgrq-sz avgqu-sz   await  svctm  %util
> sdb               0.00     0.00    0.00    1.00     0.00     2.50
> 5.00     0.03   27.00  27.00   2.70
> sdc               0.00     0.00    0.00    1.00     0.00     2.50
> 5.00     0.01   15.00  15.00   1.50
> sdd               0.00     0.00    0.00    1.00     0.00     2.50
> 5.00     0.02   18.00  18.00   1.80
> sde               0.00     0.00    0.00    1.00     0.00     2.50
> 5.00     0.02   23.00  23.00   2.30
> sdf               0.00     0.00    0.00    0.00     0.00     0.00
> 0.00     1.15    0.00   0.00 100.00
> sdg               0.00     2.00    1.00    4.00     4.00   172.00
> 70.40     0.04    8.40   2.80   1.40
> sda               0.00     0.00    0.00    1.00     0.00     2.50
> 5.00     0.04   37.00  37.00   3.70
> sdh               0.00     0.00    0.00    0.00     0.00     0.00
> 0.00     0.00    0.00   0.00   0.00
> sdj               0.00     0.00    0.00    0.00     0.00     0.00
> 0.00     0.00    0.00   0.00   0.00
> sdk               0.00     0.00    0.00    0.00     0.00     0.00
> 0.00     0.00    0.00   0.00   0.00
> sdi               0.00     0.00    0.00    0.00     0.00     0.00
> 0.00     0.00    0.00   0.00   0.00
> fd0               0.00     0.00    0.00    0.00     0.00     0.00
> 0.00     0.00    0.00   0.00   0.00
> dm-0              0.00     0.00    1.00    6.00     4.00   172.00
> 50.29     0.05    7.29   2.00   1.40
> dm-1              0.00     0.00    0.00    0.00     0.00     0.00
> 0.00     0.00    0.00   0.00   0.00
> dm-2              0.00     0.00    0.00    0.00     0.00     0.00
> 0.00     0.00    0.00   0.00   0.00
> md0               0.00     0.00    0.00    0.00     0.00     0.00
> 0.00     0.00    0.00   0.00   0.00
> dm-3              0.00     0.00    0.00    0.00     0.00     0.00
> 0.00     0.00    0.00   0.00   0.00
> dm-4              0.00     0.00    0.00    0.00     0.00     0.00
> 0.00     0.00    0.00   0.00   0.00
> dm-5              0.00     0.00    0.00    0.00     0.00     0.00
> 0.00     0.00    0.00   0.00   0.00
> dm-6              0.00     0.00    0.00    0.00     0.00     0.00
> 0.00     1.00    0.00   0.00 100.00
>
> 10/23/2016 10:18:21 PM
> avg-cpu:  %user   %nice %system %iowait  %steal   %idle
>            0.00    0.00    0.25   24.81    0.00   74.94
>
> Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s
> avgrq-sz avgqu-sz   await  svctm  %util
> sdb               0.00     0.00    0.00    0.00     0.00     0.00
> 0.00     0.00    0.00   0.00   0.00
> sdc               0.00     0.00    0.00    0.00     0.00     0.00
> 0.00     0.00    0.00   0.00   0.00
> sdd               0.00     0.00    0.00    0.00     0.00     0.00
> 0.00     0.00    0.00   0.00   0.00
> sde               0.00     0.00    0.00    0.00     0.00     0.00
> 0.00     0.00    0.00   0.00   0.00
> sdf               0.00     0.00    0.00    0.00     0.00     0.00
> 0.00     2.00    0.00   0.00 100.00
> sdg               0.00     0.00    0.00    0.00     0.00     0.00
> 0.00     0.00    0.00   0.00   0.00
> sda               0.00     0.00    0.00    0.00     0.00     0.00
> 0.00     0.00    0.00   0.00   0.00
> sdh               0.00     0.00    0.00    0.00     0.00     0.00
> 0.00     0.00    0.00   0.00   0.00
> sdj               0.00     0.00    0.00    0.00     0.00     0.00
> 0.00     0.00    0.00   0.00   0.00
> sdk               0.00     0.00    0.00    0.00     0.00     0.00
> 0.00     0.00    0.00   0.00   0.00
> sdi               0.00     0.00    0.00    0.00     0.00     0.00
> 0.00     0.00    0.00   0.00   0.00
> fd0               0.00     0.00    0.00    0.00     0.00     0.00
> 0.00     0.00    0.00   0.00   0.00
> dm-0              0.00     0.00    0.00    0.00     0.00     0.00
> 0.00     0.00    0.00   0.00   0.00
> dm-1              0.00     0.00    0.00    0.00     0.00     0.00
> 0.00     0.00    0.00   0.00   0.00
> dm-2              0.00     0.00    0.00    0.00     0.00     0.00
> 0.00     0.00    0.00   0.00   0.00
> md0               0.00     0.00    0.00    0.00     0.00     0.00
> 0.00     0.00    0.00   0.00   0.00
> dm-3              0.00     0.00    0.00    0.00     0.00     0.00
> 0.00     0.00    0.00   0.00   0.00
> dm-4              0.00     0.00    0.00    0.00     0.00     0.00
> 0.00     0.00    0.00   0.00   0.00
> dm-5              0.00     0.00    0.00    0.00     0.00     0.00
> 0.00     0.00    0.00   0.00   0.00
> dm-6              0.00     0.00    0.00    0.00     0.00     0.00
> 0.00     1.00    0.00   0.00 100.00
>
>
> We can see that /dev/sdf ramps up to 100% starting at around (10/23/2016
> 10:18:18 PM) and stays that way till about the (10/23/2016 10:18:42 PM)
> mark when something occurs and it drops down to below 100% numbers.
>
> So I checked the array which shows all clean, even across reboots:
>
> [root@mbpc-pc ~]# cat /proc/mdstat
> Personalities : [raid6] [raid5] [raid4]
> md0 : active raid6 sdb[7] sdf[6] sdd[3] sda[5] sdc[1] sde[8]
>       3907045632 blocks super 1.2 level 6, 64k chunk, algorithm 2 [6/6]
> [UUUUUU]
>       bitmap: 1/8 pages [4KB], 65536KB chunk
>
> unused devices: <none>
> [root@mbpc-pc ~]#
>
>
> Then I run smartctl across all disks and sure enough /dev/sdf prints this:
>
> [root@mbpc-pc ~]# smartctl -A /dev/sdf
> smartctl 5.43 2012-06-30 r3573 [x86_64-linux-4.8.4] (local build)
> Copyright (C) 2002-12 by Bruce Allen, http://smartmontools.sourceforge.net
>
> Error SMART Values Read failed: scsi error badly formed scsi parameters
> Smartctl: SMART Read Values failed.
>
> === START OF READ SMART DATA SECTION ===
> [root@mbpc-pc ~]#
>
> So it would appear we found the root cause, a bad disk.  True the disk
> is bad and I'll be replacing it however, even with a degraded disk
> (checking now) the array functions just fine and I have no data loss.  I
> only lost 1.  I would have to loose 3 to get a catastrophic failure on
> this RAID6:
>
> [root@mbpc-pc ~]# cat /proc/mdstat
> Personalities : [raid6] [raid5] [raid4]
> md0 : active raid6 sdf[6](F) sda[5] sde[8] sdb[7] sdd[3] sdc[1]
>       3907045632 blocks super 1.2 level 6, 64k chunk, algorithm 2 [6/5]
> [UUUU_U]
>       bitmap: 6/8 pages [24KB], 65536KB chunk
>
> unused devices: <none>
> [root@mbpc-pc ~]# mdadm --detail /dev/md0
> /dev/md0:
>         Version : 1.2
>   Creation Time : Mon Mar 26 00:06:24 2012
>      Raid Level : raid6
>      Array Size : 3907045632 (3726.05 GiB 4000.81 GB)
>   Used Dev Size : 976761408 (931.51 GiB 1000.20 GB)
>    Raid Devices : 6
>   Total Devices : 6
>     Persistence : Superblock is persistent
>
>   Intent Bitmap : Internal
>
>     Update Time : Tue Oct 25 00:31:13 2016
>           State : clean, degraded
>  Active Devices : 5
> Working Devices : 5
>  Failed Devices : 1
>   Spare Devices : 0
>
>          Layout : left-symmetric
>      Chunk Size : 64K
>
>            Name : mbpc:0
>            UUID : 2f36ac48:5e3e4c54:72177c53:bea3e41e
>          Events : 118368
>
>     Number   Major   Minor   RaidDevice State
>        8       8       64        0      active sync   /dev/sde
>        1       8       32        1      active sync   /dev/sdc
>        7       8       16        2      active sync   /dev/sdb
>        3       8       48        3      active sync   /dev/sdd
>        8       0        0        8      removed
>        5       8        0        5      active sync   /dev/sda
>
>        6       8       80        -      faulty   /dev/sdf
> [root@mbpc-pc ~]#
>
> Last night I cut power to the /dev/sdf disk to spin it down the removed
> it and reinserted it.  The array resynced without issue however the
> smartctl -A command still failed on it. Today I check and bad blocks
> were recorded on the disk and the array has since removed /dev/sdf (per
> above).  Also I have to say that these ESXi hosts worked in this
> configuration, without any hickup, for about 4 months.  No LUN failure
> on the ESXi side.  I haven't changed the LUN in that time (had no reason
> to do so).
>
> So now here's the real question that I have.  Why would the array
> continue to function, as intended, with only one disk failure yet the
> QLogic / Target drivers stop and error out?  The RAID6 (software) array
> should care about the failure, and it should handle it.  QLogic / Target
> Drivers shouldn't really be too impacted (aside from read speed maybe)
> about a disk failing inside the array.  That would be my thinking.  The
> Target / QLogic software seems to have picked up on a failure ahead of
> the software RAID 6 detecting it.  I've had this RAID6 for over 6 years
> now.  Aside from the occassional disk replacement, quite rock solid.
>
> So anyway, I added the fix you pointed out to the 4.8.4 kernel and
> recompiled.  I restarted it, with the RAID6 degraded as it is.  All
> mounted fine and I checked the LUN's from the ESXi side:
>
> [root@mbpc-pc ~]# /etc/init.d/target start
> The Linux SCSI Target is already stopped                   [  OK  ]
> [info] The Linux SCSI Target looks properly installed.
> The configfs filesystem was not mounted, consider adding it[WARNING]
> [info] Loaded core module target_core_mod.
> [info] Loaded core module target_core_pscsi.
> [info] Loaded core module target_core_iblock.
> [info] Loaded core module target_core_file.
> Failed to load fabric module ib_srpt                       [WARNING]
> Failed to load fabric module tcm_usb_gadget                [WARNING]
> [info] Loaded fabric module tcm_loop.
> [info] Loaded fabric module tcm_fc.
> Failed to load fabric module vhost_scsi                    [WARNING]
> [info] Loaded fabric module tcm_qla2xxx.
> Failed to load fabric module iscsi_target_mod              [WARNING]
> [info] Loading config from /etc/target/scsi_target.lio, this may take
> several minutes for FC adapters.
> [info] Loaded /etc/target/scsi_target.lio.
> Started The Linux SCSI Target                              [  OK  ]
> [root@mbpc-pc ~]#
>
>
> Enabled the brocade ports:
>
>
>  18  18   011200   id    N4   No_Light    FC
>  19  19   011300   id    N4   No_Sync     FC  Disabled (Persistent)
>  20  20   011400   id    N4   No_Light    FC
>  21  21   011500   id    N4   No_Light    FC
>  22  22   011600   id    N4   No_Light    FC
>  23  23   011700   id    N4   No_Light    FC  Disabled (Persistent)
>  24  24   011800   --    N4   No_Module   FC  (No POD License) Disabled
>  25  25   011900   --    N4   No_Module   FC  (No POD License) Disabled
>  26  26   011a00   --    N4   No_Module   FC  (No POD License) Disabled
>  27  27   011b00   --    N4   No_Module   FC  (No POD License) Disabled
>  28  28   011c00   --    N4   No_Module   FC  (No POD License) Disabled
>  29  29   011d00   --    N4   No_Module   FC  (No POD License) Disabled
>  30  30   011e00   --    N4   No_Module   FC  (No POD License) Disabled
>  31  31   011f00   --    N4   No_Module   FC  (No POD License) Disabled
> sw0:admin> portcfgpersistentenable 19
> sw0:admin> portcfgpersistentenable 23
> sw0:admin> date
> Tue Oct 25 04:03:42 UTC 2016
> sw0:admin>
>
> And still after 30 minutes, there is no failure.  This run includes the
> fix you asked me to ass
> (https://github.com/torvalds/linux/commit/527268df31e57cf2b6d417198717c6d6afdb1e3e)
>  .  If everything works, I will revert the patch and see if I can
> reproduce the issue.  If I can reproduce it, then the disk might not
> have been it, but the patch was.  I'll keep you posted on that when I
> get a new disk tomorrow.
>
>
> Right now this is a POC setup so I have lots of room to experiment.
>

Hey Nicholas,

I've done some testing up till now.  With or without the patch above, as 
long as the faulty disk is removed from the RAID 6 software array, 
everything works fine with the Target Driver and ESXi hosts.  This is 
even on a degraded array:

[root@mbpc-pc ~]# cat /proc/mdstat
Personalities : [raid6] [raid5] [raid4]
md0 : active raid6 sdc[1] sda[5] sdd[3] sde[8] sdb[7]
       3907045632 blocks super 1.2 level 6, 64k chunk, algorithm 2 [6/5] 
[UUUU_U]
       bitmap: 6/8 pages [24KB], 65536KB chunk

unused devices: <none>
[root@mbpc-pc ~]#

So at the moment, the data points to the single failed single disk ( 
/dev/sdf ) as causing the Target Drivers or QLogic cards to throw an 
exception.

Tomorrow I will insert the failed disk back in to see if the a) array 
takes it back, b) it causes a failure with the patch applied.

Looks like the failed disk /dev/sdf was limping along for months and 
until I removed the power, it didn't collapse on itself.

-- 
Cheers,
Tom K.
-------------------------------------------------------------------------------------

Living on earth is expensive, but it includes a free trip around the sun.


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Kernel 4.8.4: INFO: task kworker/u16:8:289 blocked for more than 120 seconds.
  2016-10-26  2:05         ` TomK
@ 2016-10-26  7:20           ` Nicholas A. Bellinger
  2016-10-26 12:08             ` TomK
  0 siblings, 1 reply; 14+ messages in thread
From: Nicholas A. Bellinger @ 2016-10-26  7:20 UTC (permalink / raw)
  To: TomK
  Cc: linux-scsi, Himanshu Madhani, Quinn Tran, Giridhar Malavali,
	Gurumurthy, Anil

Hello TomK & Co,

Comments below.

On Tue, 2016-10-25 at 22:05 -0400, TomK wrote:
> On 10/25/2016 1:28 AM, TomK wrote:
> > On 10/24/2016 2:36 AM, Nicholas A. Bellinger wrote:
> >> Hi TomK,
> >>
> >> Thanks for reporting this bug.  Comments inline below.
> >>
> >> On Mon, 2016-10-24 at 00:45 -0400, TomK wrote:
> >>> On 10/24/2016 12:32 AM, TomK wrote:
> >>>> On 10/23/2016 10:03 PM, TomK wrote:

<SNIP>

> >>>> Including the full log:
> >>>>
> >>>> http://microdevsys.com/linux-lio/messages-mailing-list
> >>>>
> >>>
> >>
> >> Thanks for posting with qla2xxx verbose debug enabled on your setup.
> >>
> >>>
> >>> When tryint to shut down target using /etc/init.d/target stop, the
> >>> following is printed repeatedly:
> >>>
> >>> Oct 24 00:39:48 mbpc-pc kernel: qla2xxx [0000:04:00.0]-e837:20:
> >>> ABTS_RECV_24XX: instance 0
> >>> Oct 24 00:39:48 mbpc-pc kernel: qla2xxx [0000:04:00.0]-f811:20:
> >>> qla_target(0): task abort (s_id=1:5:0, tag=1177068, param=0)
> >>> Oct 24 00:39:48 mbpc-pc kernel: qla2xxx [0000:04:00.0]-f812:20:
> >>> qla_target(0): task abort for non-existant session
> >>> Oct 24 00:39:48 mbpc-pc kernel: qla2xxx [0000:04:00.0]-f80e:20:
> >>> Scheduling work (type 1, prm ffff880093365680) to find session for param
> >>> ffff88010f8c7680 (size 64, tgt ffff880111f06600)
> >>> Oct 24 00:39:48 mbpc-pc kernel: qla2xxx [0000:04:00.0]-f800:20: Sess
> >>> work (tgt ffff880111f06600)
> >>> Oct 24 00:39:48 mbpc-pc kernel: qla2xxx [0000:04:00.0]-e806:20: Sending
> >>> task mgmt ABTS response (ha=ffff88010fae0000, atio=ffff880093365694,
> >>> status=4
> >>> Oct 24 00:39:48 mbpc-pc kernel: qla2xxx [0000:04:00.0]-e838:20:
> >>> ABTS_RESP_24XX: compl_status 31
> >>> Oct 24 00:39:48 mbpc-pc kernel: qla2xxx [0000:04:00.0]-e807:20: Sending
> >>> retry TERM EXCH CTIO7 (ha=ffff88010fae0000)
> >>> Oct 24 00:39:48 mbpc-pc kernel: qla2xxx [0000:04:00.0]-e806:20: Sending
> >>> task mgmt ABTS response (ha=ffff88010fae0000, atio=ffff88010f8c76c0,
> >>> status=0
> >>> Oct 24 00:39:48 mbpc-pc kernel: qla2xxx [0000:04:00.0]-e838:20:
> >>> ABTS_RESP_24XX: compl_status 0
> >>> Oct 24 00:39:48 mbpc-pc kernel: qla2xxx [0000:04:00.0]-e872:20:
> >>> qlt_24xx_atio_pkt_all_vps: qla_target(0): type 6 ox_id 029c
> >>> Oct 24 00:39:48 mbpc-pc kernel: qla2xxx [0000:04:00.0]-3861:20: New
> >>> command while device ffff880111f06600 is shutting down
> >>> Oct 24 00:39:48 mbpc-pc kernel: qla2xxx [0000:04:00.0]-e859:20:
> >>> qla_target: Unable to send command to target for req, ignoring.
> >>>
> >>>
> >>
> >> At your earliest convenience, please verify the patch using v4.8.y with
> >> the above ABORT_TASK + shutdown scenario.
> >>
> >> Also, it would be helpful to understand why this ESX FC host is
> >> generating ABORT_TASKs.
> >>
> >> Eg: Is ABORT_TASK generated due to FC target response packet loss..?
> >> Or due to target backend I/O latency, that ultimately triggers FC host
> >> side timeouts...?
> >>

Ok, so the specific hung task warnings reported earlier above are
ABORT_TASK due to the target-core backend md array holding onto
outstanding I/O long enough, for ESX host side SCSI timeouts to begin to
trigger.

> >>>
> >>> + when I disable the ports on the brocade switch that we're using then
> >>> try to stop target, the following is printed:
> >>>
> >>>
> >>>
> >>> Oct 24 00:41:31 mbpc-pc kernel: qla2xxx [0000:04:00.1]-680a:21: Loop
> >>> down - seconds remaining 231.
> >>> Oct 24 00:41:32 mbpc-pc kernel: qla2xxx [0000:04:00.0]-680a:20: Loop
> >>> down - seconds remaining 153.
> >>> Oct 24 00:41:32 mbpc-pc kernel: ------------[ cut here ]------------
> >>> Oct 24 00:41:32 mbpc-pc kernel: WARNING: CPU: 2 PID: 8615 at
> >>> lib/list_debug.c:33 __list_add+0xbe/0xd0
> >>> Oct 24 00:41:32 mbpc-pc kernel: list_add corruption. prev->next should
> >>> be next (ffff88009e83b330), but was ffff88011fc972a0.
> >>> (prev=ffff880118ada4c0).
> >>> Oct 24 00:41:32 mbpc-pc kernel: Modules linked in: tcm_qla2xxx tcm_fc
> >>> tcm_loop target_core_file target_core_iblock target_core_pscsi
> >>> target_core_mod configfs ip6table_filter ip6_tables ebtable_nat ebtables
> >>> nf_conntrack_ipv4 nf_defrag_ipv4 xt_state nf_conntrack xt_CHECKSUM
> >>> iptable_mangle bridge nfsd lockd grace nfs_acl auth_rpcgss autofs4 it87
> >>> hwmon_vid bnx2fc cnic uio fcoe libfcoe libfc 8021q garp stp llc ppdev
> >>> parport_pc parport sunrpc cpufreq_ondemand bonding ipv6 crc_ccitt
> >>> ipt_REJECT nf_reject_ipv4 xt_multiport iptable_filter ip_tables fuse
> >>> vfat fat xfs vhost_net macvtap macvlan vhost tun uinput raid456
> >>> async_raid6_recov async_pq async_xor xor async_memcpy async_tx raid6_pq
> >>> libcrc32c joydev sg serio_raw e1000 kvm_amd kvm irqbypass r8169 mii
> >>> pcspkr k10temp snd_hda_codec_realtek snd_hda_codec_generic snd_hda_intel
> >>> snd_hda_codec snd_hda_core snd_hwdep snd_seq snd_seq_device snd_pcm
> >>> snd_timer snd soundcore i2c_piix4 i2c_core wmi acpi_cpufreq shpchp ext4
> >>> mbcache jbd2 qla2xxx scsi_transport_fc floppy firewire_ohci f
> >>> Oct 24 00:41:32 mbpc-pc kernel: irewire_core crc_itu_t sd_mod pata_acpi
> >>> ata_generic pata_jmicron ahci libahci usb_storage dm_mirror
> >>> dm_region_hash dm_log dm_mod
> >>> Oct 24 00:41:32 mbpc-pc kernel: CPU: 2 PID: 8615 Comm: kworker/2:3 Not
> >>> tainted 4.8.4 #2
> >>> Oct 24 00:41:32 mbpc-pc kernel: Hardware name: Gigabyte Technology Co.,
> >>> Ltd. GA-890XA-UD3/GA-890XA-UD3, BIOS FC 08/02/2010
> >>> Oct 24 00:41:32 mbpc-pc kernel: 0000000000000000 ffff880092b83b48
> >>> ffffffff812e88e9 ffffffff8130753e
> >>> Oct 24 00:41:32 mbpc-pc kernel: ffff880092b83ba8 ffff880092b83ba8
> >>> 0000000000000000 ffff880092b83b98
> >>> Oct 24 00:41:32 mbpc-pc kernel: ffffffff81066a7d ffff88000058f952
> >>> 0000002100000046 ffffffff8101eae8
> >>> Oct 24 00:41:32 mbpc-pc kernel: Call Trace:
> >>> Oct 24 00:41:32 mbpc-pc kernel: [<ffffffff812e88e9>]
> >>> dump_stack+0x51/0x78
> >>> Oct 24 00:41:32 mbpc-pc kernel: [<ffffffff8130753e>] ?
> >>> __list_add+0xbe/0xd0
> >>> Oct 24 00:41:32 mbpc-pc kernel: [<ffffffff81066a7d>] __warn+0xfd/0x120
> >>> Oct 24 00:41:32 mbpc-pc kernel: [<ffffffff8101eae8>] ?
> >>> __switch_to+0x398/0x7e0
> >>> Oct 24 00:41:32 mbpc-pc kernel: [<ffffffff81066b59>]
> >>> warn_slowpath_fmt+0x49/0x50
> >>> Oct 24 00:41:32 mbpc-pc kernel: [<ffffffff8130753e>]
> >>> __list_add+0xbe/0xd0
> >>> Oct 24 00:41:32 mbpc-pc kernel: [<ffffffff8107d0b2>]
> >>> move_linked_works+0x62/0x90
> >>> Oct 24 00:41:32 mbpc-pc kernel: [<ffffffff8108070c>]
> >>> process_one_work+0x25c/0x4e0
> >>> Oct 24 00:41:32 mbpc-pc kernel: [<ffffffff8162c040>] ?
> >>> schedule+0x40/0xb0
> >>> Oct 24 00:41:32 mbpc-pc kernel: [<ffffffff8108150d>]
> >>> worker_thread+0x16d/0x520
> >>> Oct 24 00:41:32 mbpc-pc kernel: [<ffffffff8162bb3d>] ?
> >>> __schedule+0x2fd/0x6a0
> >>> Oct 24 00:41:32 mbpc-pc kernel: [<ffffffff810923f2>] ?
> >>> default_wake_function+0x12/0x20
> >>> Oct 24 00:41:32 mbpc-pc kernel: [<ffffffff810a6f06>] ?
> >>> __wake_up_common+0x56/0x90
> >>> Oct 24 00:41:32 mbpc-pc kernel: [<ffffffff810813a0>] ?
> >>> maybe_create_worker+0x110/0x110
> >>> Oct 24 00:41:32 mbpc-pc kernel: [<ffffffff8162c040>] ?
> >>> schedule+0x40/0xb0
> >>> Oct 24 00:41:32 mbpc-pc kernel: [<ffffffff810813a0>] ?
> >>> maybe_create_worker+0x110/0x110
> >>> Oct 24 00:41:32 mbpc-pc kernel: [<ffffffff81085fec>] kthread+0xcc/0xf0
> >>> Oct 24 00:41:32 mbpc-pc kernel: [<ffffffff8109130e>] ?
> >>> schedule_tail+0x1e/0xc0
> >>> Oct 24 00:41:32 mbpc-pc kernel: [<ffffffff8162f60f>]
> >>> ret_from_fork+0x1f/0x40
> >>> Oct 24 00:41:32 mbpc-pc kernel: [<ffffffff81085f20>] ?
> >>> kthread_freezable_should_stop+0x70/0x70
> >>> Oct 24 00:41:32 mbpc-pc kernel: ---[ end trace 713a9071c9f5f464 ]---
> >>> Oct 24 00:41:32 mbpc-pc kernel: ------------[ cut here ]------------
> >>> Oct 24 00:41:32 mbpc-pc kernel: WARNING: CPU: 2 PID: 8615 at
> >>> lib/list_debug.c:36 __list_add+0x9c/0xd0
> >>> Oct 24 00:41:32 mbpc-pc kernel: list_add double add:
> >>> new=ffff880118ada4c0, prev=ffff880118ada4c0, next=ffff88009e83b330.
> >>> Oct 24 00:41:32 mbpc-pc kernel: Modules linked in: tcm_qla2xxx tcm_fc
> >>> tcm_loop target_core_file target_core_iblock target_core_pscsi
> >>> target_core_mod configfs ip6table_filter ip6_tables ebtable_nat ebtables
> >>> nf_conntrack_ipv4 nf_defrag_ipv4 xt_state nf_conntrack xt_CHECKSUM
> >>> iptable_mangle bridge nfsd lockd grace nfs_acl auth_rpcgss autofs4 it87
> >>> hwmon_vid bnx2fc cnic uio fcoe libfcoe libfc 8021q garp stp llc ppdev
> >>> parport_pc parport sunrpc cpufreq_ondemand bonding ipv6 crc_ccitt
> >>> ipt_REJECT nf_reject_ipv4 xt_multiport iptable_filter ip_tables fuse
> >>> vfat fat xfs vhost_net macvtap macvlan vhost tun uinput raid456
> >>> async_raid6_recov async_pq async_xor xor async_memcpy async_tx raid6_pq
> >>> libcrc32c joydev sg serio_raw e1000 kvm_amd kvm irqbypass r8169 mii
> >>> pcspkr k10temp snd_hda_codec_realtek snd_hda_codec_generic snd_hda_intel
> >>> snd_hda_codec snd_hda_core snd_hwdep snd_seq snd_seq_device snd_pcm
> >>> snd_timer snd soundcore i2c_piix4 i2c_core wmi acpi_cpufreq shpchp ext4
> >>> mbcache jbd2 qla2xxx scsi_transport_fc floppy firewire_ohci f
> >>> Oct 24 00:41:32 mbpc-pc kernel: irewire_core crc_itu_t sd_mod pata_acpi
> >>> ata_generic pata_jmicron ahci libahci usb_storage dm_mirror
> >>> dm_region_hash dm_log dm_mod
> >>> Oct 24 00:41:32 mbpc-pc kernel: CPU: 2 PID: 8615 Comm: kworker/2:3
> >>> Tainted: G        W       4.8.4 #2
> >>> Oct 24 00:41:32 mbpc-pc kernel: Hardware name: Gigabyte Technology Co.,
> >>> Ltd. GA-890XA-UD3/GA-890XA-UD3, BIOS FC 08/02/2010
> >>> Oct 24 00:41:32 mbpc-pc kernel: 0000000000000000 ffff880092b83b48
> >>> ffffffff812e88e9 ffffffff8130751c
> >>> Oct 24 00:41:32 mbpc-pc kernel: ffff880092b83ba8 ffff880092b83ba8
> >>> 0000000000000000 ffff880092b83b98
> >>> Oct 24 00:41:32 mbpc-pc kernel: ffffffff81066a7d ffff88000058f952
> >>> 0000002400000046 ffffffff8101eae8
> >>> Oct 24 00:41:32 mbpc-pc kernel: Call Trace:
> >>> Oct 24 00:41:32 mbpc-pc kernel: [<ffffffff812e88e9>]
> >>> dump_stack+0x51/0x78
> >>> Oct 24 00:41:32 mbpc-pc kernel: [<ffffffff8130751c>] ?
> >>> __list_add+0x9c/0xd0
> >>> Oct 24 00:41:32 mbpc-pc kernel: [<ffffffff81066a7d>] __warn+0xfd/0x120
> >>> Oct 24 00:41:32 mbpc-pc kernel: [<ffffffff8101eae8>] ?
> >>> __switch_to+0x398/0x7e0
> >>> Oct 24 00:41:32 mbpc-pc kernel: [<ffffffff81066b59>]
> >>> warn_slowpath_fmt+0x49/0x50
> >>> Oct 24 00:41:32 mbpc-pc kernel: [<ffffffff8130751c>]
> >>> __list_add+0x9c/0xd0
> >>> Oct 24 00:41:32 mbpc-pc kernel: [<ffffffff8107d0b2>]
> >>> move_linked_works+0x62/0x90
> >>> Oct 24 00:41:32 mbpc-pc kernel: [<ffffffff8108070c>]
> >>> process_one_work+0x25c/0x4e0
> >>> Oct 24 00:41:32 mbpc-pc kernel: [<ffffffff8162c040>] ?
> >>> schedule+0x40/0xb0
> >>> Oct 24 00:41:32 mbpc-pc kernel: [<ffffffff8108150d>]
> >>> worker_thread+0x16d/0x520
> >>> Oct 24 00:41:32 mbpc-pc kernel: [<ffffffff8162bb3d>] ?
> >>> __schedule+0x2fd/0x6a0
> >>> Oct 24 00:41:32 mbpc-pc kernel: [<ffffffff810923f2>] ?
> >>> default_wake_function+0x12/0x20
> >>> Oct 24 00:41:32 mbpc-pc kernel: [<ffffffff810a6f06>] ?
> >>> __wake_up_common+0x56/0x90
> >>> Oct 24 00:41:32 mbpc-pc kernel: [<ffffffff810813a0>] ?
> >>> maybe_create_worker+0x110/0x110
> >>> Oct 24 00:41:32 mbpc-pc kernel: [<ffffffff8162c040>] ?
> >>> schedule+0x40/0xb0
> >>> Oct 24 00:41:32 mbpc-pc kernel: [<ffffffff810813a0>] ?
> >>> maybe_create_worker+0x110/0x110
> >>> Oct 24 00:41:32 mbpc-pc kernel: [<ffffffff81085fec>] kthread+0xcc/0xf0
> >>> Oct 24 00:41:32 mbpc-pc kernel: [<ffffffff8109130e>] ?
> >>> schedule_tail+0x1e/0xc0
> >>> Oct 24 00:41:32 mbpc-pc kernel: [<ffffffff8162f60f>]
> >>> ret_from_fork+0x1f/0x40
> >>> Oct 24 00:41:32 mbpc-pc kernel: [<ffffffff81085f20>] ?
> >>> kthread_freezable_should_stop+0x70/0x70
> >>> Oct 24 00:41:32 mbpc-pc kernel: ---[ end trace 713a9071c9f5f465 ]---
> >>> Oct 24 00:41:32 mbpc-pc kernel: qla2xxx [0000:04:00.1]-680a:21: Loop
> >>> down - seconds remaining 230.
> >>> Oct 24 00:41:33 mbpc-pc kernel: qla2xxx [0000:04:00.0]-680a:20: Loop
> >>> down - seconds remaining 152.
> >>>
> >>>
> >>
> >> Mmmm.  Could be a side effect of the target-core regression, but not
> >> completely sure..
> >>
> >> Adding QLOGIC folks CC'.
> >>
> >> --
> >> To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
> >> the body of a message to majordomo@vger.kernel.org
> >> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> >>

Adding Anil CC'

> >
> > Hey Nicholas,
> >
> >
> >> At your earliest convenience, please verify the patch using v4.8.y with
> >> the above ABORT_TASK + shutdown scenario.
> >>
> >> Also, it would be helpful to understand why this ESX FC host is
> >> generating ABORT_TASKs.
> >>
> >> Eg: Is ABORT_TASK generated due to FC target response packet loss..?
> >> Or due to target backend I/O latency, that ultimately triggers FC host
> >> side timeouts...?
> >
> >
> > Here is where it gets interesting and to your thought above.  Take for
> > example this log snippet
> > (http://microdevsys.com/linux-lio/messages-recent):
> >
> > Oct 23 22:12:51 mbpc-pc kernel: hpet1: lost 9600 rtc interrupts
> > Jan  7 00:36:00 192.168.0.2 syslog: dhcpfwd : dhcp forwarder daemon
> > successfully started
> > Oct 23 22:14:14 mbpc-pc kernel: hpet1: lost 9599 rtc interrupts
> > Oct 23 22:14:14 mbpc-pc kernel: hpet1: lost 1 rtc interrupts
> > Oct 23 22:15:02 mbpc-pc kernel: hpet1: lost 9600 rtc interrupts
> > Oct 23 22:15:41 mbpc-pc kernel: hpet1: lost 9600 rtc interrupts
> > Jan  7 00:38:00 192.168.0.2 syslog: dhcpfwd : dhcp forwarder daemon
> > successfully started
> > Oct 23 22:16:29 mbpc-pc kernel: hpet1: lost 9599 rtc interrupts
> > Oct 23 22:17:30 mbpc-pc kernel: hpet1: lost 9600 rtc interrupts
> > Jan  7 00:40:00 192.168.0.2 syslog: dhcpfwd : dhcp forwarder daemon
> > successfully started
> > Oct 23 22:18:29 mbpc-pc kernel: ABORT_TASK: Found referenced qla2xxx
> > task_tag: 1195032
> > Oct 23 22:18:33 mbpc-pc kernel: hpet1: lost 9600 rtc interrupts
> > Oct 23 22:18:42 mbpc-pc kernel: ABORT_TASK: Sending
> > TMR_FUNCTION_COMPLETE for ref_tag: 1195032
> > Oct 23 22:18:42 mbpc-pc kernel: ABORT_TASK: Found referenced qla2xxx
> > task_tag: 1122276
> > Oct 23 22:19:35 mbpc-pc kernel: hpet1: lost 9600 rtc interrupts
> > Jan  7 00:42:00 192.168.0.2 syslog: dhcpfwd : dhcp forwarder daemon
> > successfully started
> > Oct 23 22:20:41 mbpc-pc kernel: hpet1: lost 9600 rtc interrupts
> > Oct 23 22:21:07 mbpc-pc kernel: INFO: task kworker/u16:8:308 blocked for
> > more than 120 seconds.
> > Oct 23 22:21:07 mbpc-pc kernel:      Not tainted 4.8.4 #2
> > Oct 23 22:21:07 mbpc-pc kernel: "echo 0 >
> > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> > Oct 23 22:21:07 mbpc-pc kernel: kworker/u16:8   D ffff880111b8fa18     0
> >   308      2 0x00000000
> > Oct 23 22:21:07 mbpc-pc kernel: Workqueue: tmr-fileio target_tmr_work
> > [target_core_mod]
> > Oct 23 22:21:07 mbpc-pc kernel: ffff880111b8fa18 0000000000000400
> > ffff880112180480 ffff880111b8f998
> > Oct 23 22:21:07 mbpc-pc kernel: ffff88011107a380 ffffffff81f99ca0
> > ffffffff81f998ef ffff880100000000
> > Oct 23 22:21:07 mbpc-pc kernel: ffffffff812f27d9 0000000000000000
> > ffffe8ffffcda000 ffff880000000000
> > Oct 23 22:21:07 mbpc-pc kernel: Call Trace:
> > Oct 23 22:21:07 mbpc-pc kernel: [<ffffffff812f27d9>] ? number+0x2e9/0x310
> > Oct 23 22:21:07 mbpc-pc kernel: [<ffffffff8162c040>] schedule+0x40/0xb0
> > Oct 23 22:21:07 mbpc-pc kernel: [<ffffffff81080169>] ?
> > start_flush_work+0x49/0x180
> > Oct 23 22:21:07 mbpc-pc kernel: [<ffffffff8162e7ec>]
> > schedule_timeout+0x9c/0xe0
> > Oct 23 22:21:07 mbpc-pc kernel: [<ffffffff810802ba>] ? flush_work+0x1a/0x40
> > Oct 23 22:21:07 mbpc-pc kernel: [<ffffffff810bd15c>] ?
> > console_unlock+0x35c/0x380
> > Oct 23 22:21:07 mbpc-pc kernel: [<ffffffff8162cfa0>]
> > wait_for_completion+0xc0/0xf0
> > Oct 23 22:21:07 mbpc-pc kernel: [<ffffffff810923e0>] ?
> > try_to_wake_up+0x260/0x260
> > Oct 23 22:21:07 mbpc-pc kernel: [<ffffffffa08f6f84>]
> > __transport_wait_for_tasks+0xb4/0x1b0 [target_core_mod]
> > Oct 23 22:21:07 mbpc-pc kernel: [<ffffffff810bdd1f>] ?
> > vprintk_default+0x1f/0x30
> > Oct 23 22:21:07 mbpc-pc kernel: [<ffffffff8115cc5c>] ? printk+0x46/0x48
> > Oct 23 22:21:07 mbpc-pc kernel: [<ffffffffa08f70c4>]
> > transport_wait_for_tasks+0x44/0x60 [target_core_mod]
> > Oct 23 22:21:07 mbpc-pc kernel: [<ffffffffa08f46e2>]
> > core_tmr_abort_task+0xf2/0x160 [target_core_mod]
> > Oct 23 22:21:07 mbpc-pc kernel: [<ffffffffa08f6aa4>]
> > target_tmr_work+0x154/0x160 [target_core_mod]
> > Oct 23 22:21:07 mbpc-pc kernel: [<ffffffff81080639>]
> > process_one_work+0x189/0x4e0
> > Oct 23 22:21:07 mbpc-pc kernel: [<ffffffff810d060c>] ?
> > del_timer_sync+0x4c/0x60
> > Oct 23 22:21:07 mbpc-pc kernel: [<ffffffff8108131e>] ?
> > maybe_create_worker+0x8e/0x110
> > Oct 23 22:21:07 mbpc-pc kernel: [<ffffffff8162c040>] ? schedule+0x40/0xb0
> > Oct 23 22:21:07 mbpc-pc kernel: [<ffffffff8108150d>]
> > worker_thread+0x16d/0x520
> > Oct 23 22:21:07 mbpc-pc kernel: [<ffffffff810923f2>] ?
> > default_wake_function+0x12/0x20
> > Oct 23 22:21:07 mbpc-pc kernel: [<ffffffff810a6f06>] ?
> > __wake_up_common+0x56/0x90
> > Oct 23 22:21:07 mbpc-pc kernel: [<ffffffff810813a0>] ?
> > maybe_create_worker+0x110/0x110
> > Oct 23 22:21:07 mbpc-pc kernel: [<ffffffff8162c040>] ? schedule+0x40/0xb0
> > Oct 23 22:21:07 mbpc-pc kernel: [<ffffffff810813a0>] ?
> > maybe_create_worker+0x110/0x110
> > Oct 23 22:21:07 mbpc-pc kernel: [<ffffffff81085fec>] kthread+0xcc/0xf0
> > Oct 23 22:21:07 mbpc-pc kernel: [<ffffffff8109130e>] ?
> > schedule_tail+0x1e/0xc0
> > Oct 23 22:21:07 mbpc-pc kernel: [<ffffffff8162f60f>]
> > ret_from_fork+0x1f/0x40
> > Oct 23 22:21:07 mbpc-pc kernel: [<ffffffff81085f20>] ?
> > kthread_freezable_should_stop+0x70/0x70
> > Oct 23 22:21:52 mbpc-pc kernel: hpet1: lost 9600 rtc interrupts
> > Jan  7 00:44:00 192.168.0.2 syslog: dhcpfwd : dhcp forwarder daemon
> > successfully started
> > Oct 23 22:23:03 mbpc-pc kernel: hpet1: lost 9600 rtc interrupts
> >
> >
> > And compare it to the following snippet
> > (http://microdevsys.com/linux-lio/iostat-tkx-interesting-bit.txt) taken
> > from this bigger iostat session
> > (http://microdevsys.com/linux-lio/iostat-tkx.txt):
> >

<SNIP>

> >
> >
> > We can see that /dev/sdf ramps up to 100% starting at around (10/23/2016
> > 10:18:18 PM) and stays that way till about the (10/23/2016 10:18:42 PM)
> > mark when something occurs and it drops down to below 100% numbers.
> >
> > So I checked the array which shows all clean, even across reboots:
> >
> > [root@mbpc-pc ~]# cat /proc/mdstat
> > Personalities : [raid6] [raid5] [raid4]
> > md0 : active raid6 sdb[7] sdf[6] sdd[3] sda[5] sdc[1] sde[8]
> >       3907045632 blocks super 1.2 level 6, 64k chunk, algorithm 2 [6/6]
> > [UUUUUU]
> >       bitmap: 1/8 pages [4KB], 65536KB chunk
> >
> > unused devices: <none>
> > [root@mbpc-pc ~]#
> >
> >
> > Then I run smartctl across all disks and sure enough /dev/sdf prints this:
> >
> > [root@mbpc-pc ~]# smartctl -A /dev/sdf
> > smartctl 5.43 2012-06-30 r3573 [x86_64-linux-4.8.4] (local build)
> > Copyright (C) 2002-12 by Bruce Allen, http://smartmontools.sourceforge.net
> >
> > Error SMART Values Read failed: scsi error badly formed scsi parameters
> > Smartctl: SMART Read Values failed.
> >
> > === START OF READ SMART DATA SECTION ===
> > [root@mbpc-pc ~]#
> >
> > So it would appear we found the root cause, a bad disk.  True the disk
> > is bad and I'll be replacing it however, even with a degraded disk
> > (checking now) the array functions just fine and I have no data loss.  I
> > only lost 1.  I would have to loose 3 to get a catastrophic failure on
> > this RAID6:
> >
> > [root@mbpc-pc ~]# cat /proc/mdstat
> > Personalities : [raid6] [raid5] [raid4]
> > md0 : active raid6 sdf[6](F) sda[5] sde[8] sdb[7] sdd[3] sdc[1]
> >       3907045632 blocks super 1.2 level 6, 64k chunk, algorithm 2 [6/5]
> > [UUUU_U]
> >       bitmap: 6/8 pages [24KB], 65536KB chunk
> >
> > unused devices: <none>
> > [root@mbpc-pc ~]# mdadm --detail /dev/md0
> > /dev/md0:
> >         Version : 1.2
> >   Creation Time : Mon Mar 26 00:06:24 2012
> >      Raid Level : raid6
> >      Array Size : 3907045632 (3726.05 GiB 4000.81 GB)
> >   Used Dev Size : 976761408 (931.51 GiB 1000.20 GB)
> >    Raid Devices : 6
> >   Total Devices : 6
> >     Persistence : Superblock is persistent
> >
> >   Intent Bitmap : Internal
> >
> >     Update Time : Tue Oct 25 00:31:13 2016
> >           State : clean, degraded
> >  Active Devices : 5
> > Working Devices : 5
> >  Failed Devices : 1
> >   Spare Devices : 0
> >
> >          Layout : left-symmetric
> >      Chunk Size : 64K
> >
> >            Name : mbpc:0
> >            UUID : 2f36ac48:5e3e4c54:72177c53:bea3e41e
> >          Events : 118368
> >
> >     Number   Major   Minor   RaidDevice State
> >        8       8       64        0      active sync   /dev/sde
> >        1       8       32        1      active sync   /dev/sdc
> >        7       8       16        2      active sync   /dev/sdb
> >        3       8       48        3      active sync   /dev/sdd
> >        8       0        0        8      removed
> >        5       8        0        5      active sync   /dev/sda
> >
> >        6       8       80        -      faulty   /dev/sdf
> > [root@mbpc-pc ~]#
> >
> > Last night I cut power to the /dev/sdf disk to spin it down the removed
> > it and reinserted it.  The array resynced without issue however the
> > smartctl -A command still failed on it. Today I check and bad blocks
> > were recorded on the disk and the array has since removed /dev/sdf (per
> > above).  Also I have to say that these ESXi hosts worked in this
> > configuration, without any hickup, for about 4 months.  No LUN failure
> > on the ESXi side.  I haven't changed the LUN in that time (had no reason
> > to do so).
> >
> > So now here's the real question that I have.  Why would the array
> > continue to function, as intended, with only one disk failure yet the
> > QLogic / Target drivers stop and error out?  The RAID6 (software) array
> > should care about the failure, and it should handle it.  QLogic / Target
> > Drivers shouldn't really be too impacted (aside from read speed maybe)
> > about a disk failing inside the array.  That would be my thinking.  The
> > Target / QLogic software seems to have picked up on a failure ahead of
> > the software RAID 6 detecting it.  I've had this RAID6 for over 6 years
> > now.  Aside from the occassional disk replacement, quite rock solid.

The earlier hung task warnings after ABORT_TASK w/ TMR_FUNCTION_COMPLETE
and after explicit configfs shutdown are likely the missing SCF_ACK_KREF
bit assignment.  Note the bug is specific to high backed I/O latency
with v4.1+ code, so you'll want to include it for all future builds.

AFAICT thus far the list corruption bug reported here and also from Anil
& Co looks like a separate bug using tcm_qla2xxx ports.

> >
> > So anyway, I added the fix you pointed out to the 4.8.4 kernel and
> > recompiled.  I restarted it, with the RAID6 degraded as it is.  All
> > mounted fine and I checked the LUN's from the ESXi side:
> >
> > [root@mbpc-pc ~]# /etc/init.d/target start
> > The Linux SCSI Target is already stopped                   [  OK  ]
> > [info] The Linux SCSI Target looks properly installed.
> > The configfs filesystem was not mounted, consider adding it[WARNING]
> > [info] Loaded core module target_core_mod.
> > [info] Loaded core module target_core_pscsi.
> > [info] Loaded core module target_core_iblock.
> > [info] Loaded core module target_core_file.
> > Failed to load fabric module ib_srpt                       [WARNING]
> > Failed to load fabric module tcm_usb_gadget                [WARNING]
> > [info] Loaded fabric module tcm_loop.
> > [info] Loaded fabric module tcm_fc.
> > Failed to load fabric module vhost_scsi                    [WARNING]
> > [info] Loaded fabric module tcm_qla2xxx.
> > Failed to load fabric module iscsi_target_mod              [WARNING]
> > [info] Loading config from /etc/target/scsi_target.lio, this may take
> > several minutes for FC adapters.
> > [info] Loaded /etc/target/scsi_target.lio.
> > Started The Linux SCSI Target                              [  OK  ]
> > [root@mbpc-pc ~]#
> >
> >
> > Enabled the brocade ports:
> >
> >
> >  18  18   011200   id    N4   No_Light    FC
> >  19  19   011300   id    N4   No_Sync     FC  Disabled (Persistent)
> >  20  20   011400   id    N4   No_Light    FC
> >  21  21   011500   id    N4   No_Light    FC
> >  22  22   011600   id    N4   No_Light    FC
> >  23  23   011700   id    N4   No_Light    FC  Disabled (Persistent)
> >  24  24   011800   --    N4   No_Module   FC  (No POD License) Disabled
> >  25  25   011900   --    N4   No_Module   FC  (No POD License) Disabled
> >  26  26   011a00   --    N4   No_Module   FC  (No POD License) Disabled
> >  27  27   011b00   --    N4   No_Module   FC  (No POD License) Disabled
> >  28  28   011c00   --    N4   No_Module   FC  (No POD License) Disabled
> >  29  29   011d00   --    N4   No_Module   FC  (No POD License) Disabled
> >  30  30   011e00   --    N4   No_Module   FC  (No POD License) Disabled
> >  31  31   011f00   --    N4   No_Module   FC  (No POD License) Disabled
> > sw0:admin> portcfgpersistentenable 19
> > sw0:admin> portcfgpersistentenable 23
> > sw0:admin> date
> > Tue Oct 25 04:03:42 UTC 2016
> > sw0:admin>
> >
> > And still after 30 minutes, there is no failure.  This run includes the
> > fix you asked me to ass
> > (https://github.com/torvalds/linux/commit/527268df31e57cf2b6d417198717c6d6afdb1e3e)
> >  .  If everything works, I will revert the patch and see if I can
> > reproduce the issue.  If I can reproduce it, then the disk might not
> > have been it, but the patch was.  I'll keep you posted on that when I
> > get a new disk tomorrow.
> >
> >
> > Right now this is a POC setup so I have lots of room to experiment.
> >
> 
> Hey Nicholas,
> 
> I've done some testing up till now.  With or without the patch above, as 
> long as the faulty disk is removed from the RAID 6 software array, 
> everything works fine with the Target Driver and ESXi hosts.

Thanks again for the extra debug + feedback, and confirming the earlier
hung task warnings with md disk failure.

>   This is 
> even on a degraded array:
> 
> [root@mbpc-pc ~]# cat /proc/mdstat
> Personalities : [raid6] [raid5] [raid4]
> md0 : active raid6 sdc[1] sda[5] sdd[3] sde[8] sdb[7]
>        3907045632 blocks super 1.2 level 6, 64k chunk, algorithm 2 [6/5] 
> [UUUU_U]
>        bitmap: 6/8 pages [24KB], 65536KB chunk
> 
> unused devices: <none>
> [root@mbpc-pc ~]#
> 
> So at the moment, the data points to the single failed single disk ( 
> /dev/sdf ) as causing the Target Drivers or QLogic cards to throw an 
> exception.
> 
> Tomorrow I will insert the failed disk back in to see if the a) array 
> takes it back, b) it causes a failure with the patch applied.
> 
> Looks like the failed disk /dev/sdf was limping along for months and 
> until I removed the power, it didn't collapse on itself.
> 

AFAICT, the list corruption observed is a separate bug from the hung
tasks during ABORT_TASK w/ TMR_COMPLETE_FUNCTION with explicit target
shutdown.


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Kernel 4.8.4: INFO: task kworker/u16:8:289 blocked for more than 120 seconds.
  2016-10-26  7:20           ` Nicholas A. Bellinger
@ 2016-10-26 12:08             ` TomK
  2016-10-28  6:01               ` TomK
  0 siblings, 1 reply; 14+ messages in thread
From: TomK @ 2016-10-26 12:08 UTC (permalink / raw)
  To: Nicholas A. Bellinger
  Cc: linux-scsi, Himanshu Madhani, Quinn Tran, Giridhar Malavali,
	Gurumurthy, Anil

On 10/26/2016 3:20 AM, Nicholas A. Bellinger wrote:
> Hello TomK & Co,
>
> Comments below.
>
> On Tue, 2016-10-25 at 22:05 -0400, TomK wrote:
>> On 10/25/2016 1:28 AM, TomK wrote:
>>> On 10/24/2016 2:36 AM, Nicholas A. Bellinger wrote:
>>>> Hi TomK,
>>>>
>>>> Thanks for reporting this bug.  Comments inline below.
>>>>
>>>> On Mon, 2016-10-24 at 00:45 -0400, TomK wrote:
>>>>> On 10/24/2016 12:32 AM, TomK wrote:
>>>>>> On 10/23/2016 10:03 PM, TomK wrote:
>
> <SNIP>
>
>>>>>> Including the full log:
>>>>>>
>>>>>> http://microdevsys.com/linux-lio/messages-mailing-list
>>>>>>
>>>>>
>>>>
>>>> Thanks for posting with qla2xxx verbose debug enabled on your setup.
>>>>
>>>>>
>>>>> When tryint to shut down target using /etc/init.d/target stop, the
>>>>> following is printed repeatedly:
>>>>>
>>>>> Oct 24 00:39:48 mbpc-pc kernel: qla2xxx [0000:04:00.0]-e837:20:
>>>>> ABTS_RECV_24XX: instance 0
>>>>> Oct 24 00:39:48 mbpc-pc kernel: qla2xxx [0000:04:00.0]-f811:20:
>>>>> qla_target(0): task abort (s_id=1:5:0, tag=1177068, param=0)
>>>>> Oct 24 00:39:48 mbpc-pc kernel: qla2xxx [0000:04:00.0]-f812:20:
>>>>> qla_target(0): task abort for non-existant session
>>>>> Oct 24 00:39:48 mbpc-pc kernel: qla2xxx [0000:04:00.0]-f80e:20:
>>>>> Scheduling work (type 1, prm ffff880093365680) to find session for param
>>>>> ffff88010f8c7680 (size 64, tgt ffff880111f06600)
>>>>> Oct 24 00:39:48 mbpc-pc kernel: qla2xxx [0000:04:00.0]-f800:20: Sess
>>>>> work (tgt ffff880111f06600)
>>>>> Oct 24 00:39:48 mbpc-pc kernel: qla2xxx [0000:04:00.0]-e806:20: Sending
>>>>> task mgmt ABTS response (ha=ffff88010fae0000, atio=ffff880093365694,
>>>>> status=4
>>>>> Oct 24 00:39:48 mbpc-pc kernel: qla2xxx [0000:04:00.0]-e838:20:
>>>>> ABTS_RESP_24XX: compl_status 31
>>>>> Oct 24 00:39:48 mbpc-pc kernel: qla2xxx [0000:04:00.0]-e807:20: Sending
>>>>> retry TERM EXCH CTIO7 (ha=ffff88010fae0000)
>>>>> Oct 24 00:39:48 mbpc-pc kernel: qla2xxx [0000:04:00.0]-e806:20: Sending
>>>>> task mgmt ABTS response (ha=ffff88010fae0000, atio=ffff88010f8c76c0,
>>>>> status=0
>>>>> Oct 24 00:39:48 mbpc-pc kernel: qla2xxx [0000:04:00.0]-e838:20:
>>>>> ABTS_RESP_24XX: compl_status 0
>>>>> Oct 24 00:39:48 mbpc-pc kernel: qla2xxx [0000:04:00.0]-e872:20:
>>>>> qlt_24xx_atio_pkt_all_vps: qla_target(0): type 6 ox_id 029c
>>>>> Oct 24 00:39:48 mbpc-pc kernel: qla2xxx [0000:04:00.0]-3861:20: New
>>>>> command while device ffff880111f06600 is shutting down
>>>>> Oct 24 00:39:48 mbpc-pc kernel: qla2xxx [0000:04:00.0]-e859:20:
>>>>> qla_target: Unable to send command to target for req, ignoring.
>>>>>
>>>>>
>>>>
>>>> At your earliest convenience, please verify the patch using v4.8.y with
>>>> the above ABORT_TASK + shutdown scenario.
>>>>
>>>> Also, it would be helpful to understand why this ESX FC host is
>>>> generating ABORT_TASKs.
>>>>
>>>> Eg: Is ABORT_TASK generated due to FC target response packet loss..?
>>>> Or due to target backend I/O latency, that ultimately triggers FC host
>>>> side timeouts...?
>>>>
>
> Ok, so the specific hung task warnings reported earlier above are
> ABORT_TASK due to the target-core backend md array holding onto
> outstanding I/O long enough, for ESX host side SCSI timeouts to begin to
> trigger.
>
>>>>>
>>>>> + when I disable the ports on the brocade switch that we're using then
>>>>> try to stop target, the following is printed:
>>>>>
>>>>>
>>>>>
>>>>> Oct 24 00:41:31 mbpc-pc kernel: qla2xxx [0000:04:00.1]-680a:21: Loop
>>>>> down - seconds remaining 231.
>>>>> Oct 24 00:41:32 mbpc-pc kernel: qla2xxx [0000:04:00.0]-680a:20: Loop
>>>>> down - seconds remaining 153.
>>>>> Oct 24 00:41:32 mbpc-pc kernel: ------------[ cut here ]------------
>>>>> Oct 24 00:41:32 mbpc-pc kernel: WARNING: CPU: 2 PID: 8615 at
>>>>> lib/list_debug.c:33 __list_add+0xbe/0xd0
>>>>> Oct 24 00:41:32 mbpc-pc kernel: list_add corruption. prev->next should
>>>>> be next (ffff88009e83b330), but was ffff88011fc972a0.
>>>>> (prev=ffff880118ada4c0).
>>>>> Oct 24 00:41:32 mbpc-pc kernel: Modules linked in: tcm_qla2xxx tcm_fc
>>>>> tcm_loop target_core_file target_core_iblock target_core_pscsi
>>>>> target_core_mod configfs ip6table_filter ip6_tables ebtable_nat ebtables
>>>>> nf_conntrack_ipv4 nf_defrag_ipv4 xt_state nf_conntrack xt_CHECKSUM
>>>>> iptable_mangle bridge nfsd lockd grace nfs_acl auth_rpcgss autofs4 it87
>>>>> hwmon_vid bnx2fc cnic uio fcoe libfcoe libfc 8021q garp stp llc ppdev
>>>>> parport_pc parport sunrpc cpufreq_ondemand bonding ipv6 crc_ccitt
>>>>> ipt_REJECT nf_reject_ipv4 xt_multiport iptable_filter ip_tables fuse
>>>>> vfat fat xfs vhost_net macvtap macvlan vhost tun uinput raid456
>>>>> async_raid6_recov async_pq async_xor xor async_memcpy async_tx raid6_pq
>>>>> libcrc32c joydev sg serio_raw e1000 kvm_amd kvm irqbypass r8169 mii
>>>>> pcspkr k10temp snd_hda_codec_realtek snd_hda_codec_generic snd_hda_intel
>>>>> snd_hda_codec snd_hda_core snd_hwdep snd_seq snd_seq_device snd_pcm
>>>>> snd_timer snd soundcore i2c_piix4 i2c_core wmi acpi_cpufreq shpchp ext4
>>>>> mbcache jbd2 qla2xxx scsi_transport_fc floppy firewire_ohci f
>>>>> Oct 24 00:41:32 mbpc-pc kernel: irewire_core crc_itu_t sd_mod pata_acpi
>>>>> ata_generic pata_jmicron ahci libahci usb_storage dm_mirror
>>>>> dm_region_hash dm_log dm_mod
>>>>> Oct 24 00:41:32 mbpc-pc kernel: CPU: 2 PID: 8615 Comm: kworker/2:3 Not
>>>>> tainted 4.8.4 #2
>>>>> Oct 24 00:41:32 mbpc-pc kernel: Hardware name: Gigabyte Technology Co.,
>>>>> Ltd. GA-890XA-UD3/GA-890XA-UD3, BIOS FC 08/02/2010
>>>>> Oct 24 00:41:32 mbpc-pc kernel: 0000000000000000 ffff880092b83b48
>>>>> ffffffff812e88e9 ffffffff8130753e
>>>>> Oct 24 00:41:32 mbpc-pc kernel: ffff880092b83ba8 ffff880092b83ba8
>>>>> 0000000000000000 ffff880092b83b98
>>>>> Oct 24 00:41:32 mbpc-pc kernel: ffffffff81066a7d ffff88000058f952
>>>>> 0000002100000046 ffffffff8101eae8
>>>>> Oct 24 00:41:32 mbpc-pc kernel: Call Trace:
>>>>> Oct 24 00:41:32 mbpc-pc kernel: [<ffffffff812e88e9>]
>>>>> dump_stack+0x51/0x78
>>>>> Oct 24 00:41:32 mbpc-pc kernel: [<ffffffff8130753e>] ?
>>>>> __list_add+0xbe/0xd0
>>>>> Oct 24 00:41:32 mbpc-pc kernel: [<ffffffff81066a7d>] __warn+0xfd/0x120
>>>>> Oct 24 00:41:32 mbpc-pc kernel: [<ffffffff8101eae8>] ?
>>>>> __switch_to+0x398/0x7e0
>>>>> Oct 24 00:41:32 mbpc-pc kernel: [<ffffffff81066b59>]
>>>>> warn_slowpath_fmt+0x49/0x50
>>>>> Oct 24 00:41:32 mbpc-pc kernel: [<ffffffff8130753e>]
>>>>> __list_add+0xbe/0xd0
>>>>> Oct 24 00:41:32 mbpc-pc kernel: [<ffffffff8107d0b2>]
>>>>> move_linked_works+0x62/0x90
>>>>> Oct 24 00:41:32 mbpc-pc kernel: [<ffffffff8108070c>]
>>>>> process_one_work+0x25c/0x4e0
>>>>> Oct 24 00:41:32 mbpc-pc kernel: [<ffffffff8162c040>] ?
>>>>> schedule+0x40/0xb0
>>>>> Oct 24 00:41:32 mbpc-pc kernel: [<ffffffff8108150d>]
>>>>> worker_thread+0x16d/0x520
>>>>> Oct 24 00:41:32 mbpc-pc kernel: [<ffffffff8162bb3d>] ?
>>>>> __schedule+0x2fd/0x6a0
>>>>> Oct 24 00:41:32 mbpc-pc kernel: [<ffffffff810923f2>] ?
>>>>> default_wake_function+0x12/0x20
>>>>> Oct 24 00:41:32 mbpc-pc kernel: [<ffffffff810a6f06>] ?
>>>>> __wake_up_common+0x56/0x90
>>>>> Oct 24 00:41:32 mbpc-pc kernel: [<ffffffff810813a0>] ?
>>>>> maybe_create_worker+0x110/0x110
>>>>> Oct 24 00:41:32 mbpc-pc kernel: [<ffffffff8162c040>] ?
>>>>> schedule+0x40/0xb0
>>>>> Oct 24 00:41:32 mbpc-pc kernel: [<ffffffff810813a0>] ?
>>>>> maybe_create_worker+0x110/0x110
>>>>> Oct 24 00:41:32 mbpc-pc kernel: [<ffffffff81085fec>] kthread+0xcc/0xf0
>>>>> Oct 24 00:41:32 mbpc-pc kernel: [<ffffffff8109130e>] ?
>>>>> schedule_tail+0x1e/0xc0
>>>>> Oct 24 00:41:32 mbpc-pc kernel: [<ffffffff8162f60f>]
>>>>> ret_from_fork+0x1f/0x40
>>>>> Oct 24 00:41:32 mbpc-pc kernel: [<ffffffff81085f20>] ?
>>>>> kthread_freezable_should_stop+0x70/0x70
>>>>> Oct 24 00:41:32 mbpc-pc kernel: ---[ end trace 713a9071c9f5f464 ]---
>>>>> Oct 24 00:41:32 mbpc-pc kernel: ------------[ cut here ]------------
>>>>> Oct 24 00:41:32 mbpc-pc kernel: WARNING: CPU: 2 PID: 8615 at
>>>>> lib/list_debug.c:36 __list_add+0x9c/0xd0
>>>>> Oct 24 00:41:32 mbpc-pc kernel: list_add double add:
>>>>> new=ffff880118ada4c0, prev=ffff880118ada4c0, next=ffff88009e83b330.
>>>>> Oct 24 00:41:32 mbpc-pc kernel: Modules linked in: tcm_qla2xxx tcm_fc
>>>>> tcm_loop target_core_file target_core_iblock target_core_pscsi
>>>>> target_core_mod configfs ip6table_filter ip6_tables ebtable_nat ebtables
>>>>> nf_conntrack_ipv4 nf_defrag_ipv4 xt_state nf_conntrack xt_CHECKSUM
>>>>> iptable_mangle bridge nfsd lockd grace nfs_acl auth_rpcgss autofs4 it87
>>>>> hwmon_vid bnx2fc cnic uio fcoe libfcoe libfc 8021q garp stp llc ppdev
>>>>> parport_pc parport sunrpc cpufreq_ondemand bonding ipv6 crc_ccitt
>>>>> ipt_REJECT nf_reject_ipv4 xt_multiport iptable_filter ip_tables fuse
>>>>> vfat fat xfs vhost_net macvtap macvlan vhost tun uinput raid456
>>>>> async_raid6_recov async_pq async_xor xor async_memcpy async_tx raid6_pq
>>>>> libcrc32c joydev sg serio_raw e1000 kvm_amd kvm irqbypass r8169 mii
>>>>> pcspkr k10temp snd_hda_codec_realtek snd_hda_codec_generic snd_hda_intel
>>>>> snd_hda_codec snd_hda_core snd_hwdep snd_seq snd_seq_device snd_pcm
>>>>> snd_timer snd soundcore i2c_piix4 i2c_core wmi acpi_cpufreq shpchp ext4
>>>>> mbcache jbd2 qla2xxx scsi_transport_fc floppy firewire_ohci f
>>>>> Oct 24 00:41:32 mbpc-pc kernel: irewire_core crc_itu_t sd_mod pata_acpi
>>>>> ata_generic pata_jmicron ahci libahci usb_storage dm_mirror
>>>>> dm_region_hash dm_log dm_mod
>>>>> Oct 24 00:41:32 mbpc-pc kernel: CPU: 2 PID: 8615 Comm: kworker/2:3
>>>>> Tainted: G        W       4.8.4 #2
>>>>> Oct 24 00:41:32 mbpc-pc kernel: Hardware name: Gigabyte Technology Co.,
>>>>> Ltd. GA-890XA-UD3/GA-890XA-UD3, BIOS FC 08/02/2010
>>>>> Oct 24 00:41:32 mbpc-pc kernel: 0000000000000000 ffff880092b83b48
>>>>> ffffffff812e88e9 ffffffff8130751c
>>>>> Oct 24 00:41:32 mbpc-pc kernel: ffff880092b83ba8 ffff880092b83ba8
>>>>> 0000000000000000 ffff880092b83b98
>>>>> Oct 24 00:41:32 mbpc-pc kernel: ffffffff81066a7d ffff88000058f952
>>>>> 0000002400000046 ffffffff8101eae8
>>>>> Oct 24 00:41:32 mbpc-pc kernel: Call Trace:
>>>>> Oct 24 00:41:32 mbpc-pc kernel: [<ffffffff812e88e9>]
>>>>> dump_stack+0x51/0x78
>>>>> Oct 24 00:41:32 mbpc-pc kernel: [<ffffffff8130751c>] ?
>>>>> __list_add+0x9c/0xd0
>>>>> Oct 24 00:41:32 mbpc-pc kernel: [<ffffffff81066a7d>] __warn+0xfd/0x120
>>>>> Oct 24 00:41:32 mbpc-pc kernel: [<ffffffff8101eae8>] ?
>>>>> __switch_to+0x398/0x7e0
>>>>> Oct 24 00:41:32 mbpc-pc kernel: [<ffffffff81066b59>]
>>>>> warn_slowpath_fmt+0x49/0x50
>>>>> Oct 24 00:41:32 mbpc-pc kernel: [<ffffffff8130751c>]
>>>>> __list_add+0x9c/0xd0
>>>>> Oct 24 00:41:32 mbpc-pc kernel: [<ffffffff8107d0b2>]
>>>>> move_linked_works+0x62/0x90
>>>>> Oct 24 00:41:32 mbpc-pc kernel: [<ffffffff8108070c>]
>>>>> process_one_work+0x25c/0x4e0
>>>>> Oct 24 00:41:32 mbpc-pc kernel: [<ffffffff8162c040>] ?
>>>>> schedule+0x40/0xb0
>>>>> Oct 24 00:41:32 mbpc-pc kernel: [<ffffffff8108150d>]
>>>>> worker_thread+0x16d/0x520
>>>>> Oct 24 00:41:32 mbpc-pc kernel: [<ffffffff8162bb3d>] ?
>>>>> __schedule+0x2fd/0x6a0
>>>>> Oct 24 00:41:32 mbpc-pc kernel: [<ffffffff810923f2>] ?
>>>>> default_wake_function+0x12/0x20
>>>>> Oct 24 00:41:32 mbpc-pc kernel: [<ffffffff810a6f06>] ?
>>>>> __wake_up_common+0x56/0x90
>>>>> Oct 24 00:41:32 mbpc-pc kernel: [<ffffffff810813a0>] ?
>>>>> maybe_create_worker+0x110/0x110
>>>>> Oct 24 00:41:32 mbpc-pc kernel: [<ffffffff8162c040>] ?
>>>>> schedule+0x40/0xb0
>>>>> Oct 24 00:41:32 mbpc-pc kernel: [<ffffffff810813a0>] ?
>>>>> maybe_create_worker+0x110/0x110
>>>>> Oct 24 00:41:32 mbpc-pc kernel: [<ffffffff81085fec>] kthread+0xcc/0xf0
>>>>> Oct 24 00:41:32 mbpc-pc kernel: [<ffffffff8109130e>] ?
>>>>> schedule_tail+0x1e/0xc0
>>>>> Oct 24 00:41:32 mbpc-pc kernel: [<ffffffff8162f60f>]
>>>>> ret_from_fork+0x1f/0x40
>>>>> Oct 24 00:41:32 mbpc-pc kernel: [<ffffffff81085f20>] ?
>>>>> kthread_freezable_should_stop+0x70/0x70
>>>>> Oct 24 00:41:32 mbpc-pc kernel: ---[ end trace 713a9071c9f5f465 ]---
>>>>> Oct 24 00:41:32 mbpc-pc kernel: qla2xxx [0000:04:00.1]-680a:21: Loop
>>>>> down - seconds remaining 230.
>>>>> Oct 24 00:41:33 mbpc-pc kernel: qla2xxx [0000:04:00.0]-680a:20: Loop
>>>>> down - seconds remaining 152.
>>>>>
>>>>>
>>>>
>>>> Mmmm.  Could be a side effect of the target-core regression, but not
>>>> completely sure..
>>>>
>>>> Adding QLOGIC folks CC'.
>>>>
>>>> --
>>>> To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
>>>> the body of a message to majordomo@vger.kernel.org
>>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>>>
>
> Adding Anil CC'
>
>>>
>>> Hey Nicholas,
>>>
>>>
>>>> At your earliest convenience, please verify the patch using v4.8.y with
>>>> the above ABORT_TASK + shutdown scenario.
>>>>
>>>> Also, it would be helpful to understand why this ESX FC host is
>>>> generating ABORT_TASKs.
>>>>
>>>> Eg: Is ABORT_TASK generated due to FC target response packet loss..?
>>>> Or due to target backend I/O latency, that ultimately triggers FC host
>>>> side timeouts...?
>>>
>>>
>>> Here is where it gets interesting and to your thought above.  Take for
>>> example this log snippet
>>> (http://microdevsys.com/linux-lio/messages-recent):
>>>
>>> Oct 23 22:12:51 mbpc-pc kernel: hpet1: lost 9600 rtc interrupts
>>> Jan  7 00:36:00 192.168.0.2 syslog: dhcpfwd : dhcp forwarder daemon
>>> successfully started
>>> Oct 23 22:14:14 mbpc-pc kernel: hpet1: lost 9599 rtc interrupts
>>> Oct 23 22:14:14 mbpc-pc kernel: hpet1: lost 1 rtc interrupts
>>> Oct 23 22:15:02 mbpc-pc kernel: hpet1: lost 9600 rtc interrupts
>>> Oct 23 22:15:41 mbpc-pc kernel: hpet1: lost 9600 rtc interrupts
>>> Jan  7 00:38:00 192.168.0.2 syslog: dhcpfwd : dhcp forwarder daemon
>>> successfully started
>>> Oct 23 22:16:29 mbpc-pc kernel: hpet1: lost 9599 rtc interrupts
>>> Oct 23 22:17:30 mbpc-pc kernel: hpet1: lost 9600 rtc interrupts
>>> Jan  7 00:40:00 192.168.0.2 syslog: dhcpfwd : dhcp forwarder daemon
>>> successfully started
>>> Oct 23 22:18:29 mbpc-pc kernel: ABORT_TASK: Found referenced qla2xxx
>>> task_tag: 1195032
>>> Oct 23 22:18:33 mbpc-pc kernel: hpet1: lost 9600 rtc interrupts
>>> Oct 23 22:18:42 mbpc-pc kernel: ABORT_TASK: Sending
>>> TMR_FUNCTION_COMPLETE for ref_tag: 1195032
>>> Oct 23 22:18:42 mbpc-pc kernel: ABORT_TASK: Found referenced qla2xxx
>>> task_tag: 1122276
>>> Oct 23 22:19:35 mbpc-pc kernel: hpet1: lost 9600 rtc interrupts
>>> Jan  7 00:42:00 192.168.0.2 syslog: dhcpfwd : dhcp forwarder daemon
>>> successfully started
>>> Oct 23 22:20:41 mbpc-pc kernel: hpet1: lost 9600 rtc interrupts
>>> Oct 23 22:21:07 mbpc-pc kernel: INFO: task kworker/u16:8:308 blocked for
>>> more than 120 seconds.
>>> Oct 23 22:21:07 mbpc-pc kernel:      Not tainted 4.8.4 #2
>>> Oct 23 22:21:07 mbpc-pc kernel: "echo 0 >
>>> /proc/sys/kernel/hung_task_timeout_secs" disables this message.
>>> Oct 23 22:21:07 mbpc-pc kernel: kworker/u16:8   D ffff880111b8fa18     0
>>>   308      2 0x00000000
>>> Oct 23 22:21:07 mbpc-pc kernel: Workqueue: tmr-fileio target_tmr_work
>>> [target_core_mod]
>>> Oct 23 22:21:07 mbpc-pc kernel: ffff880111b8fa18 0000000000000400
>>> ffff880112180480 ffff880111b8f998
>>> Oct 23 22:21:07 mbpc-pc kernel: ffff88011107a380 ffffffff81f99ca0
>>> ffffffff81f998ef ffff880100000000
>>> Oct 23 22:21:07 mbpc-pc kernel: ffffffff812f27d9 0000000000000000
>>> ffffe8ffffcda000 ffff880000000000
>>> Oct 23 22:21:07 mbpc-pc kernel: Call Trace:
>>> Oct 23 22:21:07 mbpc-pc kernel: [<ffffffff812f27d9>] ? number+0x2e9/0x310
>>> Oct 23 22:21:07 mbpc-pc kernel: [<ffffffff8162c040>] schedule+0x40/0xb0
>>> Oct 23 22:21:07 mbpc-pc kernel: [<ffffffff81080169>] ?
>>> start_flush_work+0x49/0x180
>>> Oct 23 22:21:07 mbpc-pc kernel: [<ffffffff8162e7ec>]
>>> schedule_timeout+0x9c/0xe0
>>> Oct 23 22:21:07 mbpc-pc kernel: [<ffffffff810802ba>] ? flush_work+0x1a/0x40
>>> Oct 23 22:21:07 mbpc-pc kernel: [<ffffffff810bd15c>] ?
>>> console_unlock+0x35c/0x380
>>> Oct 23 22:21:07 mbpc-pc kernel: [<ffffffff8162cfa0>]
>>> wait_for_completion+0xc0/0xf0
>>> Oct 23 22:21:07 mbpc-pc kernel: [<ffffffff810923e0>] ?
>>> try_to_wake_up+0x260/0x260
>>> Oct 23 22:21:07 mbpc-pc kernel: [<ffffffffa08f6f84>]
>>> __transport_wait_for_tasks+0xb4/0x1b0 [target_core_mod]
>>> Oct 23 22:21:07 mbpc-pc kernel: [<ffffffff810bdd1f>] ?
>>> vprintk_default+0x1f/0x30
>>> Oct 23 22:21:07 mbpc-pc kernel: [<ffffffff8115cc5c>] ? printk+0x46/0x48
>>> Oct 23 22:21:07 mbpc-pc kernel: [<ffffffffa08f70c4>]
>>> transport_wait_for_tasks+0x44/0x60 [target_core_mod]
>>> Oct 23 22:21:07 mbpc-pc kernel: [<ffffffffa08f46e2>]
>>> core_tmr_abort_task+0xf2/0x160 [target_core_mod]
>>> Oct 23 22:21:07 mbpc-pc kernel: [<ffffffffa08f6aa4>]
>>> target_tmr_work+0x154/0x160 [target_core_mod]
>>> Oct 23 22:21:07 mbpc-pc kernel: [<ffffffff81080639>]
>>> process_one_work+0x189/0x4e0
>>> Oct 23 22:21:07 mbpc-pc kernel: [<ffffffff810d060c>] ?
>>> del_timer_sync+0x4c/0x60
>>> Oct 23 22:21:07 mbpc-pc kernel: [<ffffffff8108131e>] ?
>>> maybe_create_worker+0x8e/0x110
>>> Oct 23 22:21:07 mbpc-pc kernel: [<ffffffff8162c040>] ? schedule+0x40/0xb0
>>> Oct 23 22:21:07 mbpc-pc kernel: [<ffffffff8108150d>]
>>> worker_thread+0x16d/0x520
>>> Oct 23 22:21:07 mbpc-pc kernel: [<ffffffff810923f2>] ?
>>> default_wake_function+0x12/0x20
>>> Oct 23 22:21:07 mbpc-pc kernel: [<ffffffff810a6f06>] ?
>>> __wake_up_common+0x56/0x90
>>> Oct 23 22:21:07 mbpc-pc kernel: [<ffffffff810813a0>] ?
>>> maybe_create_worker+0x110/0x110
>>> Oct 23 22:21:07 mbpc-pc kernel: [<ffffffff8162c040>] ? schedule+0x40/0xb0
>>> Oct 23 22:21:07 mbpc-pc kernel: [<ffffffff810813a0>] ?
>>> maybe_create_worker+0x110/0x110
>>> Oct 23 22:21:07 mbpc-pc kernel: [<ffffffff81085fec>] kthread+0xcc/0xf0
>>> Oct 23 22:21:07 mbpc-pc kernel: [<ffffffff8109130e>] ?
>>> schedule_tail+0x1e/0xc0
>>> Oct 23 22:21:07 mbpc-pc kernel: [<ffffffff8162f60f>]
>>> ret_from_fork+0x1f/0x40
>>> Oct 23 22:21:07 mbpc-pc kernel: [<ffffffff81085f20>] ?
>>> kthread_freezable_should_stop+0x70/0x70
>>> Oct 23 22:21:52 mbpc-pc kernel: hpet1: lost 9600 rtc interrupts
>>> Jan  7 00:44:00 192.168.0.2 syslog: dhcpfwd : dhcp forwarder daemon
>>> successfully started
>>> Oct 23 22:23:03 mbpc-pc kernel: hpet1: lost 9600 rtc interrupts
>>>
>>>
>>> And compare it to the following snippet
>>> (http://microdevsys.com/linux-lio/iostat-tkx-interesting-bit.txt) taken
>>> from this bigger iostat session
>>> (http://microdevsys.com/linux-lio/iostat-tkx.txt):
>>>
>
> <SNIP>
>
>>>
>>>
>>> We can see that /dev/sdf ramps up to 100% starting at around (10/23/2016
>>> 10:18:18 PM) and stays that way till about the (10/23/2016 10:18:42 PM)
>>> mark when something occurs and it drops down to below 100% numbers.
>>>
>>> So I checked the array which shows all clean, even across reboots:
>>>
>>> [root@mbpc-pc ~]# cat /proc/mdstat
>>> Personalities : [raid6] [raid5] [raid4]
>>> md0 : active raid6 sdb[7] sdf[6] sdd[3] sda[5] sdc[1] sde[8]
>>>       3907045632 blocks super 1.2 level 6, 64k chunk, algorithm 2 [6/6]
>>> [UUUUUU]
>>>       bitmap: 1/8 pages [4KB], 65536KB chunk
>>>
>>> unused devices: <none>
>>> [root@mbpc-pc ~]#
>>>
>>>
>>> Then I run smartctl across all disks and sure enough /dev/sdf prints this:
>>>
>>> [root@mbpc-pc ~]# smartctl -A /dev/sdf
>>> smartctl 5.43 2012-06-30 r3573 [x86_64-linux-4.8.4] (local build)
>>> Copyright (C) 2002-12 by Bruce Allen, http://smartmontools.sourceforge.net
>>>
>>> Error SMART Values Read failed: scsi error badly formed scsi parameters
>>> Smartctl: SMART Read Values failed.
>>>
>>> === START OF READ SMART DATA SECTION ===
>>> [root@mbpc-pc ~]#
>>>
>>> So it would appear we found the root cause, a bad disk.  True the disk
>>> is bad and I'll be replacing it however, even with a degraded disk
>>> (checking now) the array functions just fine and I have no data loss.  I
>>> only lost 1.  I would have to loose 3 to get a catastrophic failure on
>>> this RAID6:
>>>
>>> [root@mbpc-pc ~]# cat /proc/mdstat
>>> Personalities : [raid6] [raid5] [raid4]
>>> md0 : active raid6 sdf[6](F) sda[5] sde[8] sdb[7] sdd[3] sdc[1]
>>>       3907045632 blocks super 1.2 level 6, 64k chunk, algorithm 2 [6/5]
>>> [UUUU_U]
>>>       bitmap: 6/8 pages [24KB], 65536KB chunk
>>>
>>> unused devices: <none>
>>> [root@mbpc-pc ~]# mdadm --detail /dev/md0
>>> /dev/md0:
>>>         Version : 1.2
>>>   Creation Time : Mon Mar 26 00:06:24 2012
>>>      Raid Level : raid6
>>>      Array Size : 3907045632 (3726.05 GiB 4000.81 GB)
>>>   Used Dev Size : 976761408 (931.51 GiB 1000.20 GB)
>>>    Raid Devices : 6
>>>   Total Devices : 6
>>>     Persistence : Superblock is persistent
>>>
>>>   Intent Bitmap : Internal
>>>
>>>     Update Time : Tue Oct 25 00:31:13 2016
>>>           State : clean, degraded
>>>  Active Devices : 5
>>> Working Devices : 5
>>>  Failed Devices : 1
>>>   Spare Devices : 0
>>>
>>>          Layout : left-symmetric
>>>      Chunk Size : 64K
>>>
>>>            Name : mbpc:0
>>>            UUID : 2f36ac48:5e3e4c54:72177c53:bea3e41e
>>>          Events : 118368
>>>
>>>     Number   Major   Minor   RaidDevice State
>>>        8       8       64        0      active sync   /dev/sde
>>>        1       8       32        1      active sync   /dev/sdc
>>>        7       8       16        2      active sync   /dev/sdb
>>>        3       8       48        3      active sync   /dev/sdd
>>>        8       0        0        8      removed
>>>        5       8        0        5      active sync   /dev/sda
>>>
>>>        6       8       80        -      faulty   /dev/sdf
>>> [root@mbpc-pc ~]#
>>>
>>> Last night I cut power to the /dev/sdf disk to spin it down the removed
>>> it and reinserted it.  The array resynced without issue however the
>>> smartctl -A command still failed on it. Today I check and bad blocks
>>> were recorded on the disk and the array has since removed /dev/sdf (per
>>> above).  Also I have to say that these ESXi hosts worked in this
>>> configuration, without any hickup, for about 4 months.  No LUN failure
>>> on the ESXi side.  I haven't changed the LUN in that time (had no reason
>>> to do so).
>>>
>>> So now here's the real question that I have.  Why would the array
>>> continue to function, as intended, with only one disk failure yet the
>>> QLogic / Target drivers stop and error out?  The RAID6 (software) array
>>> should care about the failure, and it should handle it.  QLogic / Target
>>> Drivers shouldn't really be too impacted (aside from read speed maybe)
>>> about a disk failing inside the array.  That would be my thinking.  The
>>> Target / QLogic software seems to have picked up on a failure ahead of
>>> the software RAID 6 detecting it.  I've had this RAID6 for over 6 years
>>> now.  Aside from the occassional disk replacement, quite rock solid.
>
> The earlier hung task warnings after ABORT_TASK w/ TMR_FUNCTION_COMPLETE
> and after explicit configfs shutdown are likely the missing SCF_ACK_KREF
> bit assignment.  Note the bug is specific to high backed I/O latency
> with v4.1+ code, so you'll want to include it for all future builds.
>
> AFAICT thus far the list corruption bug reported here and also from Anil
> & Co looks like a separate bug using tcm_qla2xxx ports.
>
>>>
>>> So anyway, I added the fix you pointed out to the 4.8.4 kernel and
>>> recompiled.  I restarted it, with the RAID6 degraded as it is.  All
>>> mounted fine and I checked the LUN's from the ESXi side:
>>>
>>> [root@mbpc-pc ~]# /etc/init.d/target start
>>> The Linux SCSI Target is already stopped                   [  OK  ]
>>> [info] The Linux SCSI Target looks properly installed.
>>> The configfs filesystem was not mounted, consider adding it[WARNING]
>>> [info] Loaded core module target_core_mod.
>>> [info] Loaded core module target_core_pscsi.
>>> [info] Loaded core module target_core_iblock.
>>> [info] Loaded core module target_core_file.
>>> Failed to load fabric module ib_srpt                       [WARNING]
>>> Failed to load fabric module tcm_usb_gadget                [WARNING]
>>> [info] Loaded fabric module tcm_loop.
>>> [info] Loaded fabric module tcm_fc.
>>> Failed to load fabric module vhost_scsi                    [WARNING]
>>> [info] Loaded fabric module tcm_qla2xxx.
>>> Failed to load fabric module iscsi_target_mod              [WARNING]
>>> [info] Loading config from /etc/target/scsi_target.lio, this may take
>>> several minutes for FC adapters.
>>> [info] Loaded /etc/target/scsi_target.lio.
>>> Started The Linux SCSI Target                              [  OK  ]
>>> [root@mbpc-pc ~]#
>>>
>>>
>>> Enabled the brocade ports:
>>>
>>>
>>>  18  18   011200   id    N4   No_Light    FC
>>>  19  19   011300   id    N4   No_Sync     FC  Disabled (Persistent)
>>>  20  20   011400   id    N4   No_Light    FC
>>>  21  21   011500   id    N4   No_Light    FC
>>>  22  22   011600   id    N4   No_Light    FC
>>>  23  23   011700   id    N4   No_Light    FC  Disabled (Persistent)
>>>  24  24   011800   --    N4   No_Module   FC  (No POD License) Disabled
>>>  25  25   011900   --    N4   No_Module   FC  (No POD License) Disabled
>>>  26  26   011a00   --    N4   No_Module   FC  (No POD License) Disabled
>>>  27  27   011b00   --    N4   No_Module   FC  (No POD License) Disabled
>>>  28  28   011c00   --    N4   No_Module   FC  (No POD License) Disabled
>>>  29  29   011d00   --    N4   No_Module   FC  (No POD License) Disabled
>>>  30  30   011e00   --    N4   No_Module   FC  (No POD License) Disabled
>>>  31  31   011f00   --    N4   No_Module   FC  (No POD License) Disabled
>>> sw0:admin> portcfgpersistentenable 19
>>> sw0:admin> portcfgpersistentenable 23
>>> sw0:admin> date
>>> Tue Oct 25 04:03:42 UTC 2016
>>> sw0:admin>
>>>
>>> And still after 30 minutes, there is no failure.  This run includes the
>>> fix you asked me to ass
>>> (https://github.com/torvalds/linux/commit/527268df31e57cf2b6d417198717c6d6afdb1e3e)
>>>  .  If everything works, I will revert the patch and see if I can
>>> reproduce the issue.  If I can reproduce it, then the disk might not
>>> have been it, but the patch was.  I'll keep you posted on that when I
>>> get a new disk tomorrow.
>>>
>>>
>>> Right now this is a POC setup so I have lots of room to experiment.
>>>
>>
>> Hey Nicholas,
>>
>> I've done some testing up till now.  With or without the patch above, as
>> long as the faulty disk is removed from the RAID 6 software array,
>> everything works fine with the Target Driver and ESXi hosts.
>
> Thanks again for the extra debug + feedback, and confirming the earlier
> hung task warnings with md disk failure.
>
>>   This is
>> even on a degraded array:
>>
>> [root@mbpc-pc ~]# cat /proc/mdstat
>> Personalities : [raid6] [raid5] [raid4]
>> md0 : active raid6 sdc[1] sda[5] sdd[3] sde[8] sdb[7]
>>        3907045632 blocks super 1.2 level 6, 64k chunk, algorithm 2 [6/5]
>> [UUUU_U]
>>        bitmap: 6/8 pages [24KB], 65536KB chunk
>>
>> unused devices: <none>
>> [root@mbpc-pc ~]#
>>
>> So at the moment, the data points to the single failed single disk (
>> /dev/sdf ) as causing the Target Drivers or QLogic cards to throw an
>> exception.
>>
>> Tomorrow I will insert the failed disk back in to see if the a) array
>> takes it back, b) it causes a failure with the patch applied.
>>
>> Looks like the failed disk /dev/sdf was limping along for months and
>> until I removed the power, it didn't collapse on itself.
>>
>
> AFAICT, the list corruption observed is a separate bug from the hung
> tasks during ABORT_TASK w/ TMR_COMPLETE_FUNCTION with explicit target
> shutdown.
>

Correct, I will be including the fix either way but will take some time 
to test out if I can reproduce the failure by reinserting this bad disk 
then a new disk.  I want to see if this hang will be reproduced by doing 
these actions to the RAID6 Software Raid to determine if this failure is 
isolated to a particularly bad disk or any sort of add or remove disk 
actions in a RAID6.

-- 
Cheers,
Tom K.
-------------------------------------------------------------------------------------

Living on earth is expensive, but it includes a free trip around the sun.


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Kernel 4.8.4: INFO: task kworker/u16:8:289 blocked for more than 120 seconds.
  2016-10-26 12:08             ` TomK
@ 2016-10-28  6:01               ` TomK
  2016-10-29  7:50                 ` Nicholas A. Bellinger
  0 siblings, 1 reply; 14+ messages in thread
From: TomK @ 2016-10-28  6:01 UTC (permalink / raw)
  To: Nicholas A. Bellinger
  Cc: linux-scsi, Himanshu Madhani, Quinn Tran, Giridhar Malavali,
	Gurumurthy, Anil

On 10/26/2016 8:08 AM, TomK wrote:
> On 10/26/2016 3:20 AM, Nicholas A. Bellinger wrote:
>> Hello TomK & Co,
>>
>> Comments below.
>>
>> On Tue, 2016-10-25 at 22:05 -0400, TomK wrote:
>>> On 10/25/2016 1:28 AM, TomK wrote:
>>>> On 10/24/2016 2:36 AM, Nicholas A. Bellinger wrote:
>>>>> Hi TomK,
>>>>>
>>>>> Thanks for reporting this bug.  Comments inline below.
>>>>>
>>>>> On Mon, 2016-10-24 at 00:45 -0400, TomK wrote:
>>>>>> On 10/24/2016 12:32 AM, TomK wrote:
>>>>>>> On 10/23/2016 10:03 PM, TomK wrote:
>>
>> <SNIP>
>>
>>>>>>> Including the full log:
>>>>>>>
>>>>>>> http://microdevsys.com/linux-lio/messages-mailing-list
>>>>>>>
>>>>>>
>>>>>
>>>>> Thanks for posting with qla2xxx verbose debug enabled on your setup.
>>>>>
>>>>>>
>>>>>> When tryint to shut down target using /etc/init.d/target stop, the
>>>>>> following is printed repeatedly:
>>>>>>
>>>>>> Oct 24 00:39:48 mbpc-pc kernel: qla2xxx [0000:04:00.0]-e837:20:
>>>>>> ABTS_RECV_24XX: instance 0
>>>>>> Oct 24 00:39:48 mbpc-pc kernel: qla2xxx [0000:04:00.0]-f811:20:
>>>>>> qla_target(0): task abort (s_id=1:5:0, tag=1177068, param=0)
>>>>>> Oct 24 00:39:48 mbpc-pc kernel: qla2xxx [0000:04:00.0]-f812:20:
>>>>>> qla_target(0): task abort for non-existant session
>>>>>> Oct 24 00:39:48 mbpc-pc kernel: qla2xxx [0000:04:00.0]-f80e:20:
>>>>>> Scheduling work (type 1, prm ffff880093365680) to find session for
>>>>>> param
>>>>>> ffff88010f8c7680 (size 64, tgt ffff880111f06600)
>>>>>> Oct 24 00:39:48 mbpc-pc kernel: qla2xxx [0000:04:00.0]-f800:20: Sess
>>>>>> work (tgt ffff880111f06600)
>>>>>> Oct 24 00:39:48 mbpc-pc kernel: qla2xxx [0000:04:00.0]-e806:20:
>>>>>> Sending
>>>>>> task mgmt ABTS response (ha=ffff88010fae0000, atio=ffff880093365694,
>>>>>> status=4
>>>>>> Oct 24 00:39:48 mbpc-pc kernel: qla2xxx [0000:04:00.0]-e838:20:
>>>>>> ABTS_RESP_24XX: compl_status 31
>>>>>> Oct 24 00:39:48 mbpc-pc kernel: qla2xxx [0000:04:00.0]-e807:20:
>>>>>> Sending
>>>>>> retry TERM EXCH CTIO7 (ha=ffff88010fae0000)
>>>>>> Oct 24 00:39:48 mbpc-pc kernel: qla2xxx [0000:04:00.0]-e806:20:
>>>>>> Sending
>>>>>> task mgmt ABTS response (ha=ffff88010fae0000, atio=ffff88010f8c76c0,
>>>>>> status=0
>>>>>> Oct 24 00:39:48 mbpc-pc kernel: qla2xxx [0000:04:00.0]-e838:20:
>>>>>> ABTS_RESP_24XX: compl_status 0
>>>>>> Oct 24 00:39:48 mbpc-pc kernel: qla2xxx [0000:04:00.0]-e872:20:
>>>>>> qlt_24xx_atio_pkt_all_vps: qla_target(0): type 6 ox_id 029c
>>>>>> Oct 24 00:39:48 mbpc-pc kernel: qla2xxx [0000:04:00.0]-3861:20: New
>>>>>> command while device ffff880111f06600 is shutting down
>>>>>> Oct 24 00:39:48 mbpc-pc kernel: qla2xxx [0000:04:00.0]-e859:20:
>>>>>> qla_target: Unable to send command to target for req, ignoring.
>>>>>>
>>>>>>
>>>>>
>>>>> At your earliest convenience, please verify the patch using v4.8.y
>>>>> with
>>>>> the above ABORT_TASK + shutdown scenario.
>>>>>
>>>>> Also, it would be helpful to understand why this ESX FC host is
>>>>> generating ABORT_TASKs.
>>>>>
>>>>> Eg: Is ABORT_TASK generated due to FC target response packet loss..?
>>>>> Or due to target backend I/O latency, that ultimately triggers FC host
>>>>> side timeouts...?
>>>>>
>>
>> Ok, so the specific hung task warnings reported earlier above are
>> ABORT_TASK due to the target-core backend md array holding onto
>> outstanding I/O long enough, for ESX host side SCSI timeouts to begin to
>> trigger.
>>
>>>>>>
>>>>>> + when I disable the ports on the brocade switch that we're using
>>>>>> then
>>>>>> try to stop target, the following is printed:
>>>>>>
>>>>>>
>>>>>>
>>>>>> Oct 24 00:41:31 mbpc-pc kernel: qla2xxx [0000:04:00.1]-680a:21: Loop
>>>>>> down - seconds remaining 231.
>>>>>> Oct 24 00:41:32 mbpc-pc kernel: qla2xxx [0000:04:00.0]-680a:20: Loop
>>>>>> down - seconds remaining 153.
>>>>>> Oct 24 00:41:32 mbpc-pc kernel: ------------[ cut here ]------------
>>>>>> Oct 24 00:41:32 mbpc-pc kernel: WARNING: CPU: 2 PID: 8615 at
>>>>>> lib/list_debug.c:33 __list_add+0xbe/0xd0
>>>>>> Oct 24 00:41:32 mbpc-pc kernel: list_add corruption. prev->next
>>>>>> should
>>>>>> be next (ffff88009e83b330), but was ffff88011fc972a0.
>>>>>> (prev=ffff880118ada4c0).
>>>>>> Oct 24 00:41:32 mbpc-pc kernel: Modules linked in: tcm_qla2xxx tcm_fc
>>>>>> tcm_loop target_core_file target_core_iblock target_core_pscsi
>>>>>> target_core_mod configfs ip6table_filter ip6_tables ebtable_nat
>>>>>> ebtables
>>>>>> nf_conntrack_ipv4 nf_defrag_ipv4 xt_state nf_conntrack xt_CHECKSUM
>>>>>> iptable_mangle bridge nfsd lockd grace nfs_acl auth_rpcgss autofs4
>>>>>> it87
>>>>>> hwmon_vid bnx2fc cnic uio fcoe libfcoe libfc 8021q garp stp llc ppdev
>>>>>> parport_pc parport sunrpc cpufreq_ondemand bonding ipv6 crc_ccitt
>>>>>> ipt_REJECT nf_reject_ipv4 xt_multiport iptable_filter ip_tables fuse
>>>>>> vfat fat xfs vhost_net macvtap macvlan vhost tun uinput raid456
>>>>>> async_raid6_recov async_pq async_xor xor async_memcpy async_tx
>>>>>> raid6_pq
>>>>>> libcrc32c joydev sg serio_raw e1000 kvm_amd kvm irqbypass r8169 mii
>>>>>> pcspkr k10temp snd_hda_codec_realtek snd_hda_codec_generic
>>>>>> snd_hda_intel
>>>>>> snd_hda_codec snd_hda_core snd_hwdep snd_seq snd_seq_device snd_pcm
>>>>>> snd_timer snd soundcore i2c_piix4 i2c_core wmi acpi_cpufreq shpchp
>>>>>> ext4
>>>>>> mbcache jbd2 qla2xxx scsi_transport_fc floppy firewire_ohci f
>>>>>> Oct 24 00:41:32 mbpc-pc kernel: irewire_core crc_itu_t sd_mod
>>>>>> pata_acpi
>>>>>> ata_generic pata_jmicron ahci libahci usb_storage dm_mirror
>>>>>> dm_region_hash dm_log dm_mod
>>>>>> Oct 24 00:41:32 mbpc-pc kernel: CPU: 2 PID: 8615 Comm: kworker/2:3
>>>>>> Not
>>>>>> tainted 4.8.4 #2
>>>>>> Oct 24 00:41:32 mbpc-pc kernel: Hardware name: Gigabyte Technology
>>>>>> Co.,
>>>>>> Ltd. GA-890XA-UD3/GA-890XA-UD3, BIOS FC 08/02/2010
>>>>>> Oct 24 00:41:32 mbpc-pc kernel: 0000000000000000 ffff880092b83b48
>>>>>> ffffffff812e88e9 ffffffff8130753e
>>>>>> Oct 24 00:41:32 mbpc-pc kernel: ffff880092b83ba8 ffff880092b83ba8
>>>>>> 0000000000000000 ffff880092b83b98
>>>>>> Oct 24 00:41:32 mbpc-pc kernel: ffffffff81066a7d ffff88000058f952
>>>>>> 0000002100000046 ffffffff8101eae8
>>>>>> Oct 24 00:41:32 mbpc-pc kernel: Call Trace:
>>>>>> Oct 24 00:41:32 mbpc-pc kernel: [<ffffffff812e88e9>]
>>>>>> dump_stack+0x51/0x78
>>>>>> Oct 24 00:41:32 mbpc-pc kernel: [<ffffffff8130753e>] ?
>>>>>> __list_add+0xbe/0xd0
>>>>>> Oct 24 00:41:32 mbpc-pc kernel: [<ffffffff81066a7d>]
>>>>>> __warn+0xfd/0x120
>>>>>> Oct 24 00:41:32 mbpc-pc kernel: [<ffffffff8101eae8>] ?
>>>>>> __switch_to+0x398/0x7e0
>>>>>> Oct 24 00:41:32 mbpc-pc kernel: [<ffffffff81066b59>]
>>>>>> warn_slowpath_fmt+0x49/0x50
>>>>>> Oct 24 00:41:32 mbpc-pc kernel: [<ffffffff8130753e>]
>>>>>> __list_add+0xbe/0xd0
>>>>>> Oct 24 00:41:32 mbpc-pc kernel: [<ffffffff8107d0b2>]
>>>>>> move_linked_works+0x62/0x90
>>>>>> Oct 24 00:41:32 mbpc-pc kernel: [<ffffffff8108070c>]
>>>>>> process_one_work+0x25c/0x4e0
>>>>>> Oct 24 00:41:32 mbpc-pc kernel: [<ffffffff8162c040>] ?
>>>>>> schedule+0x40/0xb0
>>>>>> Oct 24 00:41:32 mbpc-pc kernel: [<ffffffff8108150d>]
>>>>>> worker_thread+0x16d/0x520
>>>>>> Oct 24 00:41:32 mbpc-pc kernel: [<ffffffff8162bb3d>] ?
>>>>>> __schedule+0x2fd/0x6a0
>>>>>> Oct 24 00:41:32 mbpc-pc kernel: [<ffffffff810923f2>] ?
>>>>>> default_wake_function+0x12/0x20
>>>>>> Oct 24 00:41:32 mbpc-pc kernel: [<ffffffff810a6f06>] ?
>>>>>> __wake_up_common+0x56/0x90
>>>>>> Oct 24 00:41:32 mbpc-pc kernel: [<ffffffff810813a0>] ?
>>>>>> maybe_create_worker+0x110/0x110
>>>>>> Oct 24 00:41:32 mbpc-pc kernel: [<ffffffff8162c040>] ?
>>>>>> schedule+0x40/0xb0
>>>>>> Oct 24 00:41:32 mbpc-pc kernel: [<ffffffff810813a0>] ?
>>>>>> maybe_create_worker+0x110/0x110
>>>>>> Oct 24 00:41:32 mbpc-pc kernel: [<ffffffff81085fec>]
>>>>>> kthread+0xcc/0xf0
>>>>>> Oct 24 00:41:32 mbpc-pc kernel: [<ffffffff8109130e>] ?
>>>>>> schedule_tail+0x1e/0xc0
>>>>>> Oct 24 00:41:32 mbpc-pc kernel: [<ffffffff8162f60f>]
>>>>>> ret_from_fork+0x1f/0x40
>>>>>> Oct 24 00:41:32 mbpc-pc kernel: [<ffffffff81085f20>] ?
>>>>>> kthread_freezable_should_stop+0x70/0x70
>>>>>> Oct 24 00:41:32 mbpc-pc kernel: ---[ end trace 713a9071c9f5f464 ]---
>>>>>> Oct 24 00:41:32 mbpc-pc kernel: ------------[ cut here ]------------
>>>>>> Oct 24 00:41:32 mbpc-pc kernel: WARNING: CPU: 2 PID: 8615 at
>>>>>> lib/list_debug.c:36 __list_add+0x9c/0xd0
>>>>>> Oct 24 00:41:32 mbpc-pc kernel: list_add double add:
>>>>>> new=ffff880118ada4c0, prev=ffff880118ada4c0, next=ffff88009e83b330.
>>>>>> Oct 24 00:41:32 mbpc-pc kernel: Modules linked in: tcm_qla2xxx tcm_fc
>>>>>> tcm_loop target_core_file target_core_iblock target_core_pscsi
>>>>>> target_core_mod configfs ip6table_filter ip6_tables ebtable_nat
>>>>>> ebtables
>>>>>> nf_conntrack_ipv4 nf_defrag_ipv4 xt_state nf_conntrack xt_CHECKSUM
>>>>>> iptable_mangle bridge nfsd lockd grace nfs_acl auth_rpcgss autofs4
>>>>>> it87
>>>>>> hwmon_vid bnx2fc cnic uio fcoe libfcoe libfc 8021q garp stp llc ppdev
>>>>>> parport_pc parport sunrpc cpufreq_ondemand bonding ipv6 crc_ccitt
>>>>>> ipt_REJECT nf_reject_ipv4 xt_multiport iptable_filter ip_tables fuse
>>>>>> vfat fat xfs vhost_net macvtap macvlan vhost tun uinput raid456
>>>>>> async_raid6_recov async_pq async_xor xor async_memcpy async_tx
>>>>>> raid6_pq
>>>>>> libcrc32c joydev sg serio_raw e1000 kvm_amd kvm irqbypass r8169 mii
>>>>>> pcspkr k10temp snd_hda_codec_realtek snd_hda_codec_generic
>>>>>> snd_hda_intel
>>>>>> snd_hda_codec snd_hda_core snd_hwdep snd_seq snd_seq_device snd_pcm
>>>>>> snd_timer snd soundcore i2c_piix4 i2c_core wmi acpi_cpufreq shpchp
>>>>>> ext4
>>>>>> mbcache jbd2 qla2xxx scsi_transport_fc floppy firewire_ohci f
>>>>>> Oct 24 00:41:32 mbpc-pc kernel: irewire_core crc_itu_t sd_mod
>>>>>> pata_acpi
>>>>>> ata_generic pata_jmicron ahci libahci usb_storage dm_mirror
>>>>>> dm_region_hash dm_log dm_mod
>>>>>> Oct 24 00:41:32 mbpc-pc kernel: CPU: 2 PID: 8615 Comm: kworker/2:3
>>>>>> Tainted: G        W       4.8.4 #2
>>>>>> Oct 24 00:41:32 mbpc-pc kernel: Hardware name: Gigabyte Technology
>>>>>> Co.,
>>>>>> Ltd. GA-890XA-UD3/GA-890XA-UD3, BIOS FC 08/02/2010
>>>>>> Oct 24 00:41:32 mbpc-pc kernel: 0000000000000000 ffff880092b83b48
>>>>>> ffffffff812e88e9 ffffffff8130751c
>>>>>> Oct 24 00:41:32 mbpc-pc kernel: ffff880092b83ba8 ffff880092b83ba8
>>>>>> 0000000000000000 ffff880092b83b98
>>>>>> Oct 24 00:41:32 mbpc-pc kernel: ffffffff81066a7d ffff88000058f952
>>>>>> 0000002400000046 ffffffff8101eae8
>>>>>> Oct 24 00:41:32 mbpc-pc kernel: Call Trace:
>>>>>> Oct 24 00:41:32 mbpc-pc kernel: [<ffffffff812e88e9>]
>>>>>> dump_stack+0x51/0x78
>>>>>> Oct 24 00:41:32 mbpc-pc kernel: [<ffffffff8130751c>] ?
>>>>>> __list_add+0x9c/0xd0
>>>>>> Oct 24 00:41:32 mbpc-pc kernel: [<ffffffff81066a7d>]
>>>>>> __warn+0xfd/0x120
>>>>>> Oct 24 00:41:32 mbpc-pc kernel: [<ffffffff8101eae8>] ?
>>>>>> __switch_to+0x398/0x7e0
>>>>>> Oct 24 00:41:32 mbpc-pc kernel: [<ffffffff81066b59>]
>>>>>> warn_slowpath_fmt+0x49/0x50
>>>>>> Oct 24 00:41:32 mbpc-pc kernel: [<ffffffff8130751c>]
>>>>>> __list_add+0x9c/0xd0
>>>>>> Oct 24 00:41:32 mbpc-pc kernel: [<ffffffff8107d0b2>]
>>>>>> move_linked_works+0x62/0x90
>>>>>> Oct 24 00:41:32 mbpc-pc kernel: [<ffffffff8108070c>]
>>>>>> process_one_work+0x25c/0x4e0
>>>>>> Oct 24 00:41:32 mbpc-pc kernel: [<ffffffff8162c040>] ?
>>>>>> schedule+0x40/0xb0
>>>>>> Oct 24 00:41:32 mbpc-pc kernel: [<ffffffff8108150d>]
>>>>>> worker_thread+0x16d/0x520
>>>>>> Oct 24 00:41:32 mbpc-pc kernel: [<ffffffff8162bb3d>] ?
>>>>>> __schedule+0x2fd/0x6a0
>>>>>> Oct 24 00:41:32 mbpc-pc kernel: [<ffffffff810923f2>] ?
>>>>>> default_wake_function+0x12/0x20
>>>>>> Oct 24 00:41:32 mbpc-pc kernel: [<ffffffff810a6f06>] ?
>>>>>> __wake_up_common+0x56/0x90
>>>>>> Oct 24 00:41:32 mbpc-pc kernel: [<ffffffff810813a0>] ?
>>>>>> maybe_create_worker+0x110/0x110
>>>>>> Oct 24 00:41:32 mbpc-pc kernel: [<ffffffff8162c040>] ?
>>>>>> schedule+0x40/0xb0
>>>>>> Oct 24 00:41:32 mbpc-pc kernel: [<ffffffff810813a0>] ?
>>>>>> maybe_create_worker+0x110/0x110
>>>>>> Oct 24 00:41:32 mbpc-pc kernel: [<ffffffff81085fec>]
>>>>>> kthread+0xcc/0xf0
>>>>>> Oct 24 00:41:32 mbpc-pc kernel: [<ffffffff8109130e>] ?
>>>>>> schedule_tail+0x1e/0xc0
>>>>>> Oct 24 00:41:32 mbpc-pc kernel: [<ffffffff8162f60f>]
>>>>>> ret_from_fork+0x1f/0x40
>>>>>> Oct 24 00:41:32 mbpc-pc kernel: [<ffffffff81085f20>] ?
>>>>>> kthread_freezable_should_stop+0x70/0x70
>>>>>> Oct 24 00:41:32 mbpc-pc kernel: ---[ end trace 713a9071c9f5f465 ]---
>>>>>> Oct 24 00:41:32 mbpc-pc kernel: qla2xxx [0000:04:00.1]-680a:21: Loop
>>>>>> down - seconds remaining 230.
>>>>>> Oct 24 00:41:33 mbpc-pc kernel: qla2xxx [0000:04:00.0]-680a:20: Loop
>>>>>> down - seconds remaining 152.
>>>>>>
>>>>>>
>>>>>
>>>>> Mmmm.  Could be a side effect of the target-core regression, but not
>>>>> completely sure..
>>>>>
>>>>> Adding QLOGIC folks CC'.
>>>>>
>>>>> --
>>>>> To unsubscribe from this list: send the line "unsubscribe
>>>>> linux-scsi" in
>>>>> the body of a message to majordomo@vger.kernel.org
>>>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>>>>
>>
>> Adding Anil CC'
>>
>>>>
>>>> Hey Nicholas,
>>>>
>>>>
>>>>> At your earliest convenience, please verify the patch using v4.8.y
>>>>> with
>>>>> the above ABORT_TASK + shutdown scenario.
>>>>>
>>>>> Also, it would be helpful to understand why this ESX FC host is
>>>>> generating ABORT_TASKs.
>>>>>
>>>>> Eg: Is ABORT_TASK generated due to FC target response packet loss..?
>>>>> Or due to target backend I/O latency, that ultimately triggers FC host
>>>>> side timeouts...?
>>>>
>>>>
>>>> Here is where it gets interesting and to your thought above.  Take for
>>>> example this log snippet
>>>> (http://microdevsys.com/linux-lio/messages-recent):
>>>>
>>>> Oct 23 22:12:51 mbpc-pc kernel: hpet1: lost 9600 rtc interrupts
>>>> Jan  7 00:36:00 192.168.0.2 syslog: dhcpfwd : dhcp forwarder daemon
>>>> successfully started
>>>> Oct 23 22:14:14 mbpc-pc kernel: hpet1: lost 9599 rtc interrupts
>>>> Oct 23 22:14:14 mbpc-pc kernel: hpet1: lost 1 rtc interrupts
>>>> Oct 23 22:15:02 mbpc-pc kernel: hpet1: lost 9600 rtc interrupts
>>>> Oct 23 22:15:41 mbpc-pc kernel: hpet1: lost 9600 rtc interrupts
>>>> Jan  7 00:38:00 192.168.0.2 syslog: dhcpfwd : dhcp forwarder daemon
>>>> successfully started
>>>> Oct 23 22:16:29 mbpc-pc kernel: hpet1: lost 9599 rtc interrupts
>>>> Oct 23 22:17:30 mbpc-pc kernel: hpet1: lost 9600 rtc interrupts
>>>> Jan  7 00:40:00 192.168.0.2 syslog: dhcpfwd : dhcp forwarder daemon
>>>> successfully started
>>>> Oct 23 22:18:29 mbpc-pc kernel: ABORT_TASK: Found referenced qla2xxx
>>>> task_tag: 1195032
>>>> Oct 23 22:18:33 mbpc-pc kernel: hpet1: lost 9600 rtc interrupts
>>>> Oct 23 22:18:42 mbpc-pc kernel: ABORT_TASK: Sending
>>>> TMR_FUNCTION_COMPLETE for ref_tag: 1195032
>>>> Oct 23 22:18:42 mbpc-pc kernel: ABORT_TASK: Found referenced qla2xxx
>>>> task_tag: 1122276
>>>> Oct 23 22:19:35 mbpc-pc kernel: hpet1: lost 9600 rtc interrupts
>>>> Jan  7 00:42:00 192.168.0.2 syslog: dhcpfwd : dhcp forwarder daemon
>>>> successfully started
>>>> Oct 23 22:20:41 mbpc-pc kernel: hpet1: lost 9600 rtc interrupts
>>>> Oct 23 22:21:07 mbpc-pc kernel: INFO: task kworker/u16:8:308 blocked
>>>> for
>>>> more than 120 seconds.
>>>> Oct 23 22:21:07 mbpc-pc kernel:      Not tainted 4.8.4 #2
>>>> Oct 23 22:21:07 mbpc-pc kernel: "echo 0 >
>>>> /proc/sys/kernel/hung_task_timeout_secs" disables this message.
>>>> Oct 23 22:21:07 mbpc-pc kernel: kworker/u16:8   D
>>>> ffff880111b8fa18     0
>>>>   308      2 0x00000000
>>>> Oct 23 22:21:07 mbpc-pc kernel: Workqueue: tmr-fileio target_tmr_work
>>>> [target_core_mod]
>>>> Oct 23 22:21:07 mbpc-pc kernel: ffff880111b8fa18 0000000000000400
>>>> ffff880112180480 ffff880111b8f998
>>>> Oct 23 22:21:07 mbpc-pc kernel: ffff88011107a380 ffffffff81f99ca0
>>>> ffffffff81f998ef ffff880100000000
>>>> Oct 23 22:21:07 mbpc-pc kernel: ffffffff812f27d9 0000000000000000
>>>> ffffe8ffffcda000 ffff880000000000
>>>> Oct 23 22:21:07 mbpc-pc kernel: Call Trace:
>>>> Oct 23 22:21:07 mbpc-pc kernel: [<ffffffff812f27d9>] ?
>>>> number+0x2e9/0x310
>>>> Oct 23 22:21:07 mbpc-pc kernel: [<ffffffff8162c040>] schedule+0x40/0xb0
>>>> Oct 23 22:21:07 mbpc-pc kernel: [<ffffffff81080169>] ?
>>>> start_flush_work+0x49/0x180
>>>> Oct 23 22:21:07 mbpc-pc kernel: [<ffffffff8162e7ec>]
>>>> schedule_timeout+0x9c/0xe0
>>>> Oct 23 22:21:07 mbpc-pc kernel: [<ffffffff810802ba>] ?
>>>> flush_work+0x1a/0x40
>>>> Oct 23 22:21:07 mbpc-pc kernel: [<ffffffff810bd15c>] ?
>>>> console_unlock+0x35c/0x380
>>>> Oct 23 22:21:07 mbpc-pc kernel: [<ffffffff8162cfa0>]
>>>> wait_for_completion+0xc0/0xf0
>>>> Oct 23 22:21:07 mbpc-pc kernel: [<ffffffff810923e0>] ?
>>>> try_to_wake_up+0x260/0x260
>>>> Oct 23 22:21:07 mbpc-pc kernel: [<ffffffffa08f6f84>]
>>>> __transport_wait_for_tasks+0xb4/0x1b0 [target_core_mod]
>>>> Oct 23 22:21:07 mbpc-pc kernel: [<ffffffff810bdd1f>] ?
>>>> vprintk_default+0x1f/0x30
>>>> Oct 23 22:21:07 mbpc-pc kernel: [<ffffffff8115cc5c>] ? printk+0x46/0x48
>>>> Oct 23 22:21:07 mbpc-pc kernel: [<ffffffffa08f70c4>]
>>>> transport_wait_for_tasks+0x44/0x60 [target_core_mod]
>>>> Oct 23 22:21:07 mbpc-pc kernel: [<ffffffffa08f46e2>]
>>>> core_tmr_abort_task+0xf2/0x160 [target_core_mod]
>>>> Oct 23 22:21:07 mbpc-pc kernel: [<ffffffffa08f6aa4>]
>>>> target_tmr_work+0x154/0x160 [target_core_mod]
>>>> Oct 23 22:21:07 mbpc-pc kernel: [<ffffffff81080639>]
>>>> process_one_work+0x189/0x4e0
>>>> Oct 23 22:21:07 mbpc-pc kernel: [<ffffffff810d060c>] ?
>>>> del_timer_sync+0x4c/0x60
>>>> Oct 23 22:21:07 mbpc-pc kernel: [<ffffffff8108131e>] ?
>>>> maybe_create_worker+0x8e/0x110
>>>> Oct 23 22:21:07 mbpc-pc kernel: [<ffffffff8162c040>] ?
>>>> schedule+0x40/0xb0
>>>> Oct 23 22:21:07 mbpc-pc kernel: [<ffffffff8108150d>]
>>>> worker_thread+0x16d/0x520
>>>> Oct 23 22:21:07 mbpc-pc kernel: [<ffffffff810923f2>] ?
>>>> default_wake_function+0x12/0x20
>>>> Oct 23 22:21:07 mbpc-pc kernel: [<ffffffff810a6f06>] ?
>>>> __wake_up_common+0x56/0x90
>>>> Oct 23 22:21:07 mbpc-pc kernel: [<ffffffff810813a0>] ?
>>>> maybe_create_worker+0x110/0x110
>>>> Oct 23 22:21:07 mbpc-pc kernel: [<ffffffff8162c040>] ?
>>>> schedule+0x40/0xb0
>>>> Oct 23 22:21:07 mbpc-pc kernel: [<ffffffff810813a0>] ?
>>>> maybe_create_worker+0x110/0x110
>>>> Oct 23 22:21:07 mbpc-pc kernel: [<ffffffff81085fec>] kthread+0xcc/0xf0
>>>> Oct 23 22:21:07 mbpc-pc kernel: [<ffffffff8109130e>] ?
>>>> schedule_tail+0x1e/0xc0
>>>> Oct 23 22:21:07 mbpc-pc kernel: [<ffffffff8162f60f>]
>>>> ret_from_fork+0x1f/0x40
>>>> Oct 23 22:21:07 mbpc-pc kernel: [<ffffffff81085f20>] ?
>>>> kthread_freezable_should_stop+0x70/0x70
>>>> Oct 23 22:21:52 mbpc-pc kernel: hpet1: lost 9600 rtc interrupts
>>>> Jan  7 00:44:00 192.168.0.2 syslog: dhcpfwd : dhcp forwarder daemon
>>>> successfully started
>>>> Oct 23 22:23:03 mbpc-pc kernel: hpet1: lost 9600 rtc interrupts
>>>>
>>>>
>>>> And compare it to the following snippet
>>>> (http://microdevsys.com/linux-lio/iostat-tkx-interesting-bit.txt) taken
>>>> from this bigger iostat session
>>>> (http://microdevsys.com/linux-lio/iostat-tkx.txt):
>>>>
>>
>> <SNIP>
>>
>>>>
>>>>
>>>> We can see that /dev/sdf ramps up to 100% starting at around
>>>> (10/23/2016
>>>> 10:18:18 PM) and stays that way till about the (10/23/2016 10:18:42 PM)
>>>> mark when something occurs and it drops down to below 100% numbers.
>>>>
>>>> So I checked the array which shows all clean, even across reboots:
>>>>
>>>> [root@mbpc-pc ~]# cat /proc/mdstat
>>>> Personalities : [raid6] [raid5] [raid4]
>>>> md0 : active raid6 sdb[7] sdf[6] sdd[3] sda[5] sdc[1] sde[8]
>>>>       3907045632 blocks super 1.2 level 6, 64k chunk, algorithm 2 [6/6]
>>>> [UUUUUU]
>>>>       bitmap: 1/8 pages [4KB], 65536KB chunk
>>>>
>>>> unused devices: <none>
>>>> [root@mbpc-pc ~]#
>>>>
>>>>
>>>> Then I run smartctl across all disks and sure enough /dev/sdf prints
>>>> this:
>>>>
>>>> [root@mbpc-pc ~]# smartctl -A /dev/sdf
>>>> smartctl 5.43 2012-06-30 r3573 [x86_64-linux-4.8.4] (local build)
>>>> Copyright (C) 2002-12 by Bruce Allen,
>>>> http://smartmontools.sourceforge.net
>>>>
>>>> Error SMART Values Read failed: scsi error badly formed scsi parameters
>>>> Smartctl: SMART Read Values failed.
>>>>
>>>> === START OF READ SMART DATA SECTION ===
>>>> [root@mbpc-pc ~]#
>>>>
>>>> So it would appear we found the root cause, a bad disk.  True the disk
>>>> is bad and I'll be replacing it however, even with a degraded disk
>>>> (checking now) the array functions just fine and I have no data
>>>> loss.  I
>>>> only lost 1.  I would have to loose 3 to get a catastrophic failure on
>>>> this RAID6:
>>>>
>>>> [root@mbpc-pc ~]# cat /proc/mdstat
>>>> Personalities : [raid6] [raid5] [raid4]
>>>> md0 : active raid6 sdf[6](F) sda[5] sde[8] sdb[7] sdd[3] sdc[1]
>>>>       3907045632 blocks super 1.2 level 6, 64k chunk, algorithm 2 [6/5]
>>>> [UUUU_U]
>>>>       bitmap: 6/8 pages [24KB], 65536KB chunk
>>>>
>>>> unused devices: <none>
>>>> [root@mbpc-pc ~]# mdadm --detail /dev/md0
>>>> /dev/md0:
>>>>         Version : 1.2
>>>>   Creation Time : Mon Mar 26 00:06:24 2012
>>>>      Raid Level : raid6
>>>>      Array Size : 3907045632 (3726.05 GiB 4000.81 GB)
>>>>   Used Dev Size : 976761408 (931.51 GiB 1000.20 GB)
>>>>    Raid Devices : 6
>>>>   Total Devices : 6
>>>>     Persistence : Superblock is persistent
>>>>
>>>>   Intent Bitmap : Internal
>>>>
>>>>     Update Time : Tue Oct 25 00:31:13 2016
>>>>           State : clean, degraded
>>>>  Active Devices : 5
>>>> Working Devices : 5
>>>>  Failed Devices : 1
>>>>   Spare Devices : 0
>>>>
>>>>          Layout : left-symmetric
>>>>      Chunk Size : 64K
>>>>
>>>>            Name : mbpc:0
>>>>            UUID : 2f36ac48:5e3e4c54:72177c53:bea3e41e
>>>>          Events : 118368
>>>>
>>>>     Number   Major   Minor   RaidDevice State
>>>>        8       8       64        0      active sync   /dev/sde
>>>>        1       8       32        1      active sync   /dev/sdc
>>>>        7       8       16        2      active sync   /dev/sdb
>>>>        3       8       48        3      active sync   /dev/sdd
>>>>        8       0        0        8      removed
>>>>        5       8        0        5      active sync   /dev/sda
>>>>
>>>>        6       8       80        -      faulty   /dev/sdf
>>>> [root@mbpc-pc ~]#
>>>>
>>>> Last night I cut power to the /dev/sdf disk to spin it down the removed
>>>> it and reinserted it.  The array resynced without issue however the
>>>> smartctl -A command still failed on it. Today I check and bad blocks
>>>> were recorded on the disk and the array has since removed /dev/sdf (per
>>>> above).  Also I have to say that these ESXi hosts worked in this
>>>> configuration, without any hickup, for about 4 months.  No LUN failure
>>>> on the ESXi side.  I haven't changed the LUN in that time (had no
>>>> reason
>>>> to do so).
>>>>
>>>> So now here's the real question that I have.  Why would the array
>>>> continue to function, as intended, with only one disk failure yet the
>>>> QLogic / Target drivers stop and error out?  The RAID6 (software) array
>>>> should care about the failure, and it should handle it.  QLogic /
>>>> Target
>>>> Drivers shouldn't really be too impacted (aside from read speed maybe)
>>>> about a disk failing inside the array.  That would be my thinking.  The
>>>> Target / QLogic software seems to have picked up on a failure ahead of
>>>> the software RAID 6 detecting it.  I've had this RAID6 for over 6 years
>>>> now.  Aside from the occassional disk replacement, quite rock solid.
>>
>> The earlier hung task warnings after ABORT_TASK w/ TMR_FUNCTION_COMPLETE
>> and after explicit configfs shutdown are likely the missing SCF_ACK_KREF
>> bit assignment.  Note the bug is specific to high backed I/O latency
>> with v4.1+ code, so you'll want to include it for all future builds.
>>
>> AFAICT thus far the list corruption bug reported here and also from Anil
>> & Co looks like a separate bug using tcm_qla2xxx ports.
>>
>>>>
>>>> So anyway, I added the fix you pointed out to the 4.8.4 kernel and
>>>> recompiled.  I restarted it, with the RAID6 degraded as it is.  All
>>>> mounted fine and I checked the LUN's from the ESXi side:
>>>>
>>>> [root@mbpc-pc ~]# /etc/init.d/target start
>>>> The Linux SCSI Target is already stopped                   [  OK  ]
>>>> [info] The Linux SCSI Target looks properly installed.
>>>> The configfs filesystem was not mounted, consider adding it[WARNING]
>>>> [info] Loaded core module target_core_mod.
>>>> [info] Loaded core module target_core_pscsi.
>>>> [info] Loaded core module target_core_iblock.
>>>> [info] Loaded core module target_core_file.
>>>> Failed to load fabric module ib_srpt                       [WARNING]
>>>> Failed to load fabric module tcm_usb_gadget                [WARNING]
>>>> [info] Loaded fabric module tcm_loop.
>>>> [info] Loaded fabric module tcm_fc.
>>>> Failed to load fabric module vhost_scsi                    [WARNING]
>>>> [info] Loaded fabric module tcm_qla2xxx.
>>>> Failed to load fabric module iscsi_target_mod              [WARNING]
>>>> [info] Loading config from /etc/target/scsi_target.lio, this may take
>>>> several minutes for FC adapters.
>>>> [info] Loaded /etc/target/scsi_target.lio.
>>>> Started The Linux SCSI Target                              [  OK  ]
>>>> [root@mbpc-pc ~]#
>>>>
>>>>
>>>> Enabled the brocade ports:
>>>>
>>>>
>>>>  18  18   011200   id    N4   No_Light    FC
>>>>  19  19   011300   id    N4   No_Sync     FC  Disabled (Persistent)
>>>>  20  20   011400   id    N4   No_Light    FC
>>>>  21  21   011500   id    N4   No_Light    FC
>>>>  22  22   011600   id    N4   No_Light    FC
>>>>  23  23   011700   id    N4   No_Light    FC  Disabled (Persistent)
>>>>  24  24   011800   --    N4   No_Module   FC  (No POD License) Disabled
>>>>  25  25   011900   --    N4   No_Module   FC  (No POD License) Disabled
>>>>  26  26   011a00   --    N4   No_Module   FC  (No POD License) Disabled
>>>>  27  27   011b00   --    N4   No_Module   FC  (No POD License) Disabled
>>>>  28  28   011c00   --    N4   No_Module   FC  (No POD License) Disabled
>>>>  29  29   011d00   --    N4   No_Module   FC  (No POD License) Disabled
>>>>  30  30   011e00   --    N4   No_Module   FC  (No POD License) Disabled
>>>>  31  31   011f00   --    N4   No_Module   FC  (No POD License) Disabled
>>>> sw0:admin> portcfgpersistentenable 19
>>>> sw0:admin> portcfgpersistentenable 23
>>>> sw0:admin> date
>>>> Tue Oct 25 04:03:42 UTC 2016
>>>> sw0:admin>
>>>>
>>>> And still after 30 minutes, there is no failure.  This run includes the
>>>> fix you asked me to ass
>>>> (https://github.com/torvalds/linux/commit/527268df31e57cf2b6d417198717c6d6afdb1e3e)
>>>>
>>>>  .  If everything works, I will revert the patch and see if I can
>>>> reproduce the issue.  If I can reproduce it, then the disk might not
>>>> have been it, but the patch was.  I'll keep you posted on that when I
>>>> get a new disk tomorrow.
>>>>
>>>>
>>>> Right now this is a POC setup so I have lots of room to experiment.
>>>>
>>>
>>> Hey Nicholas,
>>>
>>> I've done some testing up till now.  With or without the patch above, as
>>> long as the faulty disk is removed from the RAID 6 software array,
>>> everything works fine with the Target Driver and ESXi hosts.
>>
>> Thanks again for the extra debug + feedback, and confirming the earlier
>> hung task warnings with md disk failure.
>>
>>>   This is
>>> even on a degraded array:
>>>
>>> [root@mbpc-pc ~]# cat /proc/mdstat
>>> Personalities : [raid6] [raid5] [raid4]
>>> md0 : active raid6 sdc[1] sda[5] sdd[3] sde[8] sdb[7]
>>>        3907045632 blocks super 1.2 level 6, 64k chunk, algorithm 2 [6/5]
>>> [UUUU_U]
>>>        bitmap: 6/8 pages [24KB], 65536KB chunk
>>>
>>> unused devices: <none>
>>> [root@mbpc-pc ~]#
>>>
>>> So at the moment, the data points to the single failed single disk (
>>> /dev/sdf ) as causing the Target Drivers or QLogic cards to throw an
>>> exception.
>>>
>>> Tomorrow I will insert the failed disk back in to see if the a) array
>>> takes it back, b) it causes a failure with the patch applied.
>>>
>>> Looks like the failed disk /dev/sdf was limping along for months and
>>> until I removed the power, it didn't collapse on itself.
>>>
>>
>> AFAICT, the list corruption observed is a separate bug from the hung
>> tasks during ABORT_TASK w/ TMR_COMPLETE_FUNCTION with explicit target
>> shutdown.
>>
>
> Correct, I will be including the fix either way but will take some time
> to test out if I can reproduce the failure by reinserting this bad disk
> then a new disk.  I want to see if this hang will be reproduced by doing
> these actions to the RAID6 Software Raid to determine if this failure is
> isolated to a particularly bad disk or any sort of add or remove disk
> actions in a RAID6.
>



1) As soon as I re-add the bad disk without the patch, I loose the LUN 
off the ESXi hosts.  Same thing happens with the patch.  No change.  The 
disk is pulling things down.  Worse, the kernel panics and locks me out 
of the system (http://microdevsys.com/linux-lio/messages-oct-27-2016.txt) :


Oct 28 01:19:59 mbpc-pc kernel: qla2xxx [0000:04:00.0]-f81d:20: 
Intermediate CTIO received (status 6)
Oct 28 01:19:59 mbpc-pc kernel: qla2xxx [0000:04:00.0]-e81c:20: Sending 
TERM EXCH CTIO (ha=ffff88010ecb0000)
Oct 28 01:19:59 mbpc-pc kernel: qla2xxx [0000:04:00.0]-f814:20: 
qla_target(0): terminating exchange for aborted cmd=ffff88009af9f488 
(se_cmd=ffff88009af9f488, tag=1131312)
Oct 28 01:19:59 mbpc-pc kernel: qla2xxx [0000:04:00.0]-e81c:20: Sending 
TERM EXCH CTIO (ha=ffff88010ecb0000)
Oct 28 01:19:59 mbpc-pc kernel: qla2xxx [0000:04:00.0]-e851:20: 
qla_target(0): Terminating cmd ffff88009af9f488 with incorrect state 2
Oct 28 01:19:59 mbpc-pc kernel: qla2xxx [0000:04:00.0]-e81c:20: Sending 
TERM EXCH CTIO (ha=ffff88010ecb0000)
Oct 28 01:19:59 mbpc-pc kernel: qla2xxx [0000:04:00.0]-f81d:20: 
Intermediate CTIO received (status 6)
Oct 28 01:19:59 mbpc-pc kernel: qla2xxx [0000:04:00.0]-f81d:20: 
Intermediate CTIO received (status 8)
Oct 28 01:19:59 mbpc-pc kernel: qla2xxx [0000:04:00.0]-e81c:20: Sending 
TERM EXCH CTIO (ha=ffff88010ecb0000)
Oct 28 01:19:59 mbpc-pc kernel: qla2xxx [0000:04:00.0]-f81d:20: 
Intermediate CTIO received (status 6)
Oct 28 01:19:59 mbpc-pc kernel: qla2xxx [0000:04:00.0]-e874:20: 
qlt_free_cmd: se_cmd[ffff88009af9f488] ox_id 0673
Oct 28 01:19:59 mbpc-pc kernel: ------------[ cut here ]------------
Oct 28 01:19:59 mbpc-pc kernel: kernel BUG at 
drivers/scsi/qla2xxx/qla_target.c:3319!
Oct 28 01:19:59 mbpc-pc kernel: invalid opcode: 0000 [#1] SMP
Oct 28 01:19:59 mbpc-pc kernel: Modules linked in: tcm_qla2xxx tcm_fc 
tcm_loop target_core_file target_core_iblock target_core_pscsi 
target_core_mod configfs ip6table_filter ip6_tables ebtable_nat ebtables 
nf_conntrack_ipv4 nf_defrag_ipv4 xt_state nf_conntrack xt_CHECKSUM 
iptable_mangle bridge nfsd lockd grace nfs_acl auth_rpcgss autofs4 it87 
hwmon_vid bnx2fc cnic uio fcoe libfcoe libfc 8021q garp stp llc ppdev 
parport_pc parport sunrpc cpufreq_ondemand bonding ipv6 crc_ccitt 
ipt_REJECT nf_reject_ipv4 xt_multiport iptable_filter ip_tables fuse 
vfat fat xfs vhost_net macvtap macvlan vhost tun uinput raid456 
async_raid6_recov async_pq async_xor xor async_memcpy async_tx raid6_pq 
libcrc32c joydev sg serio_raw e1000 r8169 mii kvm_amd kvm 
snd_hda_codec_realtek snd_hda_codec_generic irqbypass pcspkr k10temp 
snd_hda_intel snd_hda_codec snd_hda_core snd_hwdep snd_seq 
snd_seq_device snd_pcm snd_timer snd soundcore i2c_piix4 i2c_core wmi 
shpchp acpi_cpufreq ext4 mbcache jbd2 qla2xxx scsi_transport_fc floppy 
firewire_ohci f
Oct 28 01:19:59 mbpc-pc kernel: irewire_core crc_itu_t sd_mod pata_acpi 
ata_generic pata_jmicron ahci libahci usb_storage dm_mirror 
dm_region_hash dm_log dm_mod
Oct 28 01:19:59 mbpc-pc kernel: CPU: 0 PID: 296 Comm: kworker/u16:6 Not 
tainted 4.8.4 #2
Oct 28 01:19:59 mbpc-pc kernel: Hardware name: Gigabyte Technology Co., 
Ltd. GA-890XA-UD3/GA-890XA-UD3, BIOS FC 08/02/2010
Oct 28 01:19:59 mbpc-pc kernel: Workqueue: tmr-fileio target_tmr_work 
[target_core_mod]
Oct 28 01:19:59 mbpc-pc kernel: task: ffff8801109623c0 task.stack: 
ffff880110968000
Oct 28 01:19:59 mbpc-pc kernel: RIP: 0010:[<ffffffffa0143810>] 
[<ffffffffa0143810>] qlt_free_cmd+0x160/0x180 [qla2xxx]
Oct 28 01:19:59 mbpc-pc kernel: RSP: 0018:ffff88011096bb18  EFLAGS: 00010202
Oct 28 01:19:59 mbpc-pc kernel: RAX: 0000000000000051 RBX: 
ffff88009af9f488 RCX: 0000000000000006
Oct 28 01:19:59 mbpc-pc kernel: RDX: 0000000000000007 RSI: 
0000000000000007 RDI: ffff88011fc0cb40
Oct 28 01:19:59 mbpc-pc kernel: RBP: ffff88011096bb48 R08: 
0000000000000000 R09: ffffffff81fa4765
Oct 28 01:19:59 mbpc-pc kernel: R10: 0000000000000074 R11: 
0000000000000002 R12: ffff8801137770c0
Oct 28 01:19:59 mbpc-pc kernel: R13: ffff8800a126eaf0 R14: 
ffff88009af9f510 R15: 0000000000000296
Oct 28 01:19:59 mbpc-pc kernel: FS:  0000000000000000(0000) 
GS:ffff88011fc00000(0000) knlGS:0000000000000000
Oct 28 01:19:59 mbpc-pc kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 
0000000080050033
Oct 28 01:19:59 mbpc-pc kernel: CR2: 00007f8eef58d000 CR3: 
00000000cabad000 CR4: 00000000000006f0
Oct 28 01:19:59 mbpc-pc kernel: Stack:
Oct 28 01:19:59 mbpc-pc kernel: ffff880000000673 ffff88009af9f488 
ffff8800a126eaf0 ffff88009af9f59c
Oct 28 01:19:59 mbpc-pc kernel: ffff88009af9f488 ffff8800a126eaf0 
ffff88011096bb58 ffffffffa027f7f4
Oct 28 01:19:59 mbpc-pc kernel: ffff88011096bbb8 ffffffffa08f758c 
ffff88009af9f510 ffff88009af9f488
Oct 28 01:19:59 mbpc-pc kernel: Call Trace:
Oct 28 01:19:59 mbpc-pc kernel: [<ffffffffa027f7f4>] 
tcm_qla2xxx_release_cmd+0x14/0x30 [tcm_qla2xxx]
Oct 28 01:19:59 mbpc-pc kernel: [<ffffffffa08f758c>] 
target_release_cmd_kref+0xac/0x110 [target_core_mod]
Oct 28 01:19:59 mbpc-pc kernel: [<ffffffffa08f7627>] 
target_put_sess_cmd+0x37/0x70 [target_core_mod]
Oct 28 01:19:59 mbpc-pc kernel: [<ffffffffa08f46f7>] 
core_tmr_abort_task+0x107/0x160 [target_core_mod]
Oct 28 01:19:59 mbpc-pc kernel: [<ffffffffa08f6aa4>] 
target_tmr_work+0x154/0x160 [target_core_mod]
Oct 28 01:19:59 mbpc-pc kernel: [<ffffffff81080639>] 
process_one_work+0x189/0x4e0
Oct 28 01:19:59 mbpc-pc kernel: [<ffffffff810d060c>] ? 
del_timer_sync+0x4c/0x60
Oct 28 01:19:59 mbpc-pc kernel: [<ffffffff8108131e>] ? 
maybe_create_worker+0x8e/0x110
Oct 28 01:19:59 mbpc-pc kernel: [<ffffffff8162c040>] ? schedule+0x40/0xb0
Oct 28 01:19:59 mbpc-pc kernel: [<ffffffff8108150d>] 
worker_thread+0x16d/0x520
Oct 28 01:19:59 mbpc-pc kernel: [<ffffffff810923f2>] ? 
default_wake_function+0x12/0x20
Oct 28 01:19:59 mbpc-pc kernel: [<ffffffff810a6f06>] ? 
__wake_up_common+0x56/0x90
Oct 28 01:19:59 mbpc-pc kernel: [<ffffffff810813a0>] ? 
maybe_create_worker+0x110/0x110
Oct 28 01:19:59 mbpc-pc kernel: [<ffffffff8162c040>] ? schedule+0x40/0xb0
Oct 28 01:19:59 mbpc-pc kernel: [<ffffffff810813a0>] ? 
maybe_create_worker+0x110/0x110
Oct 28 01:19:59 mbpc-pc kernel: [<ffffffff81085fec>] kthread+0xcc/0xf0
Oct 28 01:19:59 mbpc-pc kernel: [<ffffffff8109130e>] ? 
schedule_tail+0x1e/0xc0
Oct 28 01:19:59 mbpc-pc kernel: [<ffffffff8162f60f>] ret_from_fork+0x1f/0x40
Oct 28 01:19:59 mbpc-pc kernel: [<ffffffff81085f20>] ? 
kthread_freezable_should_stop+0x70/0x70
Oct 28 01:19:59 mbpc-pc kernel: Code: 0d 00 00 48 c7 c7 00 2a 16 a0 e8 
ac 32 f2 e0 48 83 c4 18 5b 41 5c 41 5d c9 c3 48 8b bb 90 02 00 00 e8 85 
da 07 e1 e9 30 ff ff ff <0f> 0b eb fe 0f 0b 66 2e 0f 1f 84 00 00 00 00 
00 eb f4 66 66 66
Oct 28 01:19:59 mbpc-pc kernel: RIP  [<ffffffffa0143810>] 
qlt_free_cmd+0x160/0x180 [qla2xxx]
Oct 28 01:19:59 mbpc-pc kernel: RSP <ffff88011096bb18>
Oct 28 01:19:59 mbpc-pc kernel: ---[ end trace 2551bf47a19dbe2e ]---



2) This works with a new disk that's just been inserted.  No issues.




The kernel had the patch in both scenarios.  So appears we can't 
function on a degraded array that looses 1 (RAID 5 / 6 ) or 2 ( RAID 6 ) 
disks, at the moment, even though the array itself is fine.  Perhaps 
it's the nature of the failed disk.

-- 
Cheers,
Tom K.
-------------------------------------------------------------------------------------

Living on earth is expensive, but it includes a free trip around the sun.


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Kernel 4.8.4: INFO: task kworker/u16:8:289 blocked for more than 120 seconds.
  2016-10-28  6:01               ` TomK
@ 2016-10-29  7:50                 ` Nicholas A. Bellinger
  2016-10-29 18:10                   ` TomK
  0 siblings, 1 reply; 14+ messages in thread
From: Nicholas A. Bellinger @ 2016-10-29  7:50 UTC (permalink / raw)
  To: TomK
  Cc: linux-scsi, Himanshu Madhani, Quinn Tran, Giridhar Malavali,
	Gurumurthy, Anil

Hi TomK & Co,

On Fri, 2016-10-28 at 02:01 -0400, TomK wrote:
> On 10/26/2016 8:08 AM, TomK wrote:
> > On 10/26/2016 3:20 AM, Nicholas A. Bellinger wrote:

<SNIP>

> >>>>
> >>>> Right now this is a POC setup so I have lots of room to experiment.
> >>>>
> >>>
> >>> Hey Nicholas,
> >>>
> >>> I've done some testing up till now.  With or without the patch above, as
> >>> long as the faulty disk is removed from the RAID 6 software array,
> >>> everything works fine with the Target Driver and ESXi hosts.
> >>
> >> Thanks again for the extra debug + feedback, and confirming the earlier
> >> hung task warnings with md disk failure.
> >>
> >>>   This is
> >>> even on a degraded array:
> >>>
> >>> [root@mbpc-pc ~]# cat /proc/mdstat
> >>> Personalities : [raid6] [raid5] [raid4]
> >>> md0 : active raid6 sdc[1] sda[5] sdd[3] sde[8] sdb[7]
> >>>        3907045632 blocks super 1.2 level 6, 64k chunk, algorithm 2 [6/5]
> >>> [UUUU_U]
> >>>        bitmap: 6/8 pages [24KB], 65536KB chunk
> >>>
> >>> unused devices: <none>
> >>> [root@mbpc-pc ~]#
> >>>
> >>> So at the moment, the data points to the single failed single disk (
> >>> /dev/sdf ) as causing the Target Drivers or QLogic cards to throw an
> >>> exception.
> >>>
> >>> Tomorrow I will insert the failed disk back in to see if the a) array
> >>> takes it back, b) it causes a failure with the patch applied.
> >>>
> >>> Looks like the failed disk /dev/sdf was limping along for months and
> >>> until I removed the power, it didn't collapse on itself.
> >>>
> >>
> >> AFAICT, the list corruption observed is a separate bug from the hung
> >> tasks during ABORT_TASK w/ TMR_COMPLETE_FUNCTION with explicit target
> >> shutdown.
> >>
> >
> > Correct, I will be including the fix either way but will take some time
> > to test out if I can reproduce the failure by reinserting this bad disk
> > then a new disk.  I want to see if this hang will be reproduced by doing
> > these actions to the RAID6 Software Raid to determine if this failure is
> > isolated to a particularly bad disk or any sort of add or remove disk
> > actions in a RAID6.
> >
> 
> 
> 
> 1) As soon as I re-add the bad disk without the patch, I loose the LUN 
> off the ESXi hosts.  Same thing happens with the patch.  No change.  The 
> disk is pulling things down.  Worse, the kernel panics and locks me out 
> of the system (http://microdevsys.com/linux-lio/messages-oct-27-2016.txt) :
> 

So after groking these logs, the point when ata6 failing scsi_device is
holding outstanding I/O beyond ESX FC host side timeouts, manifests
itself as ABORT_TASK tag=1122276:

Oct 28 00:42:56 mbpc-pc kernel: qla2xxx [0000:04:00.0]-e872:20: qlt_24xx_atio_pkt_all_vps: qla_target(0): type 6 ox_id 077e
Oct 28 00:42:56 mbpc-pc kernel: qla2xxx [0000:04:00.0]-e872:20: qlt_24xx_atio_pkt_all_vps: qla_target(0): type 6 ox_id 077f
Oct 28 00:42:56 mbpc-pc kernel: qla2xxx [0000:04:00.0]-e872:20: qlt_24xx_atio_pkt_all_vps: qla_target(0): type 6 ox_id 0780
Oct 28 00:42:57 mbpc-pc kernel: qla2xxx [0000:04:00.0]-e837:20: ABTS_RECV_24XX: instance 0
Oct 28 00:42:57 mbpc-pc kernel: qla2xxx [0000:04:00.0]-f811:20: qla_target(0): task abort (s_id=1:5:0, tag=1122276, param=0)
Oct 28 00:42:57 mbpc-pc kernel: qla2xxx [0000:04:00.0]-f80f:20: qla_target(0): task abort (tag=1122276)
Oct 28 00:42:57 mbpc-pc kernel: ABORT_TASK: Found referenced qla2xxx task_tag: 1122276

and eventually:

Oct 28 00:43:18 mbpc-pc kernel: qla2xxx [0000:04:00.0]-e837:20: ABTS_RECV_24XX: instance 0
Oct 28 00:43:18 mbpc-pc kernel: qla2xxx [0000:04:00.0]-f811:20: qla_target(0): task abort (s_id=1:5:0, tag=1122276, param=0)
Oct 28 00:43:18 mbpc-pc kernel: qla2xxx [0000:04:00.0]-f881:20: unable to find cmd in driver or LIO for tag 0x111fe4
Oct 28 00:43:18 mbpc-pc kernel: qla2xxx [0000:04:00.0]-f854:20: qla_target(0): __qlt_24xx_handle_abts() failed: -2
Oct 28 00:43:18 mbpc-pc kernel: qla2xxx [0000:04:00.0]-e806:20: Sending task mgmt ABTS response (ha=ffff88010f110000, atio=ffff88010f123a40, status=4
Oct 28 00:43:18 mbpc-pc kernel: qla2xxx [0000:04:00.0]-e838:20: ABTS_RESP_24XX: compl_status 31
Oct 28 00:43:18 mbpc-pc kernel: qla2xxx [0000:04:00.0]-e807:20: Sending retry TERM EXCH CTIO7 (ha=ffff88010f110000)
Oct 28 00:43:18 mbpc-pc kernel: qla2xxx [0000:04:00.0]-e806:20: Sending task mgmt ABTS response (ha=ffff88010f110000, atio=ffff88010f123a80, status=0
Oct 28 00:43:18 mbpc-pc kernel: qla2xxx [0000:04:00.0]-e838:20: ABTS_RESP_24XX: compl_status 0

The outstanding se_cmd with tag=1122276 for ata6 completes back to
target-core, allowing ABORT_TASK + TMR_FUNCTION_COMPLETE status to
progress:

Oct 28 00:44:25 mbpc-pc kernel: hpet1: lost 9600 rtc interrupts
Oct 28 00:44:29 mbpc-pc kernel: ata6: SATA link up 1.5 Gbps (SStatus 113 SControl 310)
Oct 28 00:44:29 mbpc-pc kernel: ata6.00: configured for UDMA/133
Oct 28 00:44:29 mbpc-pc kernel: ata6.00: retrying FLUSH 0xea Emask 0x4
Oct 28 00:44:29 mbpc-pc kernel: ata6.00: device reported invalid CHS sector 0
Oct 28 00:44:29 mbpc-pc kernel: ata6: EH complete
Oct 28 00:44:30 mbpc-pc kernel: qla2xxx [0000:04:00.0]-e81c:20: Sending TERM EXCH CTIO (ha=ffff88010f110000)
Oct 28 00:44:30 mbpc-pc kernel: qla2xxx [0000:04:00.0]-f814:20: qla_target(0): terminating exchange for aborted cmd=ffff880099392fa8 (se_cmd=ffff880099392fa8, tag=1122276)
Oct 28 00:44:30 mbpc-pc kernel: qla2xxx [0000:04:00.0]-e81c:20: Sending TERM EXCH CTIO (ha=ffff88010f110000)
Oct 28 00:44:30 mbpc-pc kernel: qla2xxx [0000:04:00.0]-e851:20: qla_target(0): Terminating cmd ffff880099392fa8 with incorrect state 2
Oct 28 00:44:30 mbpc-pc kernel: qla2xxx [0000:04:00.0]-f81d:20: Intermediate CTIO received (status 6)
Oct 28 00:44:30 mbpc-pc kernel: ABORT_TASK: Sending TMR_FUNCTION_COMPLETE for ref_tag: 1122276
Oct 28 00:44:30 mbpc-pc kernel: qla2xxx [0000:04:00.0]-f81d:20: Intermediate CTIO received (status 8)
Oct 28 00:44:30 mbpc-pc kernel: qla2xxx [0000:04:00.0]-f813:20: TM response mcmd (ffff8800b1e2f0f0) status 0x0 state 0x0
Oct 28 00:44:30 mbpc-pc kernel: qla2xxx [0000:04:00.0]-e806:20: Sending task mgmt ABTS response (ha=ffff88010f110000, atio=ffff8800b1e2f300, status=0
Oct 28 00:44:30 mbpc-pc kernel: ABORT_TASK: Sending TMR_TASK_DOES_NOT_EXIST for ref_tag: 1122996
Oct 28 00:44:30 mbpc-pc kernel: qla2xxx [0000:04:00.0]-f813:20: TM response mcmd (ffff8800b1e2f340) status 0x2 state 0x0
Oct 28 00:44:30 mbpc-pc kernel: qla2xxx [0000:04:00.0]-e806:20: Sending task mgmt ABTS response (ha=ffff88010f110000, atio=ffff8800b1e2f550, status=2
Oct 28 00:44:30 mbpc-pc kernel: ABORT_TASK: Sending TMR_TASK_DOES_NOT_EXIST for ref_tag: 1122204
Oct 28 00:44:30 mbpc-pc kernel: qla2xxx [0000:04:00.0]-f813:20: TM response mcmd (ffff8800b1e2f590) status 0x2 state 0x0
Oct 28 00:44:30 mbpc-pc kernel: qla2xxx [0000:04:00.0]-e806:20: Sending task mgmt ABTS response (ha=ffff88010f110000, atio=ffff8800b1e2f7a0, status=2
Oct 28 00:44:30 mbpc-pc kernel: ABORT_TASK: Sending TMR_TASK_DOES_NOT_EXIST for ref_tag: 1122240
Oct 28 00:44:30 mbpc-pc kernel: qla2xxx [0000:04:00.0]-e838:20: ABTS_RESP_24XX: compl_status 0
Oct 28 00:44:30 mbpc-pc kernel: qla2xxx [0000:04:00.0]-f813:20: TM response mcmd (ffff8800b1e2f7e0) status 0x2 state 0x0
Oct 28 00:44:30 mbpc-pc kernel: qla2xxx [0000:04:00.0]-e838:20: ABTS_RESP_24XX: compl_status 31
Oct 28 00:44:30 mbpc-pc kernel: qla2xxx [0000:04:00.0]-e807:20: Sending retry TERM EXCH CTIO7 (ha=ffff88010f110000)
Oct 28 00:44:30 mbpc-pc kernel: qla2xxx [0000:04:00.0]-e806:20: Sending task mgmt ABTS response (ha=ffff88010f110000, atio=ffff88010f124280, status=0
Oct 28 00:44:30 mbpc-pc kernel: qla2xxx [0000:04:00.0]-e838:20: ABTS_RESP_24XX: compl_status 31
Oct 28 00:44:30 mbpc-pc kernel: qla2xxx [0000:04:00.0]-e807:20: Sending retry TERM EXCH CTIO7 (ha=ffff88010f110000)
Oct 28 00:44:30 mbpc-pc kernel: qla2xxx [0000:04:00.0]-e806:20: Sending task mgmt ABTS response (ha=ffff88010f110000, atio=ffff88010f1242c0, status=0
Oct 28 00:44:30 mbpc-pc kernel: qla2xxx [0000:04:00.0]-e806:20: Sending task mgmt ABTS response (ha=ffff88010f110000, atio=ffff8800b1e2f9f0, status=2

and continues with TMR_TASK_DOES_NOT_EXIST for other oustanding tags..

Until, target-core session release for two tcm_qla2xxx host ports:

Oct 28 00:44:30 mbpc-pc kernel: qla2xxx [0000:04:00.0]-f884:20: qlt_free_session_done: se_sess ffff880099a80a80 / sess ffff880111665cc0 from port 21:03:00:1b:32:74:b6:cb loop_id 0x03 s_id 01:05:00 logout 1 keep 1 els_logo 0
Oct 28 00:44:30 mbpc-pc kernel: qla2xxx [0000:04:00.0]-e874:20: qlt_free_cmd: se_cmd[ffff8800993849f8] ox_id 0773
Oct 28 00:44:30 mbpc-pc kernel: qla2xxx [0000:04:00.0]-f884:20: qlt_free_session_done: se_sess ffff88009e8f9040 / sess ffff8800abe95480 from port 50:01:43:80:16:77:99:38 loop_id 0x02 s_id 01:00:00 logout 1 keep 1 els_logo 0
Oct 28 00:44:30 mbpc-pc kernel: qla2xxx [0000:04:00.0]-e874:20: qlt_free_cmd: se_cmd[ffff880099199f38] ox_id 05f7
Oct 28 00:44:30 mbpc-pc kernel: qla2xxx [0000:04:00.0]-e874:20: qlt_free_cmd: se_cmd[ffff880099199f38] ox_id 05f7
Oct 28 00:44:30 mbpc-pc kernel: qla2xxx [0000:04:00.0]-4801:20: DPC handler waking up, dpc_flags=0x0.
Oct 28 00:44:30 mbpc-pc kernel: qla2xxx [0000:04:00.0]-2870:20: Async-logout - hdl=2 loop-id=3 portid=010500.
Oct 28 00:44:30 mbpc-pc kernel: qla2xxx [0000:04:00.0]-2870:20: Async-logout - hdl=3 loop-id=2 portid=010000.
Oct 28 00:44:30 mbpc-pc kernel: qla2xxx [0000:04:00.0]-4800:20: DPC handler sleeping.
Oct 28 00:44:30 mbpc-pc kernel: qla2xxx [0000:04:00.0]-e81c:20: Sending TERM EXCH CTIO (ha=ffff88010f110000)
Oct 28 00:44:30 mbpc-pc kernel: qla2xxx [0000:04:00.0]-e81c:20: Sending TERM EXCH CTIO (ha=ffff88010f110000)

>From there, ELS with unexpected NOTIFY_ACK received start to occur:

Oct 28 00:44:30 mbpc-pc kernel: qla2xxx [0000:04:00.0]-4801:20: DPC handler waking up, dpc_flags=0x0.
Oct 28 00:44:30 mbpc-pc kernel: qla2xxx [0000:04:00.0]-4800:20: DPC handler sleeping.
Oct 28 00:44:34 mbpc-pc kernel: qla2xxx [0000:04:00.0]-e872:20: qlt_24xx_atio_pkt_all_vps: qla_target(0): type d ox_id 0000
Oct 28 00:44:34 mbpc-pc kernel: qla2xxx [0000:04:00.0]-e82e:20: IMMED_NOTIFY ATIO
Oct 28 00:44:34 mbpc-pc kernel: qla2xxx [0000:04:00.0]-f826:20: qla_target(0): Port ID: 0x00:00:01 ELS opcode: 0x03
Oct 28 00:44:34 mbpc-pc kernel: qla2xxx [0000:04:00.0]-e81c:20: Sending TERM ELS CTIO (ha=ffff88010f110000)
Oct 28 00:44:34 mbpc-pc kernel: qla2xxx [0000:04:00.0]-f897:20: Linking sess ffff8800abe95480 [0] wwn 50:01:43:80:16:77:99:38 with PLOGI ACK to wwn 50:01:43:80:16:77:99:38 s_id 01:00:00, ref=1
Oct 28 00:44:34 mbpc-pc kernel: qla2xxx [0000:04:00.0]-e862:20: qla_target(0): Unexpected NOTIFY_ACK received

ELS packets for the same two host ports continue:

Oct 28 00:46:40 mbpc-pc kernel: hpet1: lost 9599 rtc interrupts
Oct 28 00:46:40 mbpc-pc kernel: qla2xxx [0000:04:00.0]-e872:20: qlt_24xx_atio_pkt_all_vps: qla_target(0): type d ox_id 0000
Oct 28 00:46:40 mbpc-pc kernel: qla2xxx [0000:04:00.0]-e82e:20: IMMED_NOTIFY ATIO
Oct 28 00:46:40 mbpc-pc kernel: qla2xxx [0000:04:00.0]-f826:20: qla_target(0): Port ID: 0x00:00:01 ELS opcode: 0x03
Oct 28 00:46:40 mbpc-pc kernel: qla2xxx [0000:04:00.0]-e81c:20: Sending TERM ELS CTIO (ha=ffff88010f110000)
Oct 28 00:46:40 mbpc-pc kernel: qla2xxx [0000:04:00.0]-f897:20: Linking sess ffff8800abe95480 [0] wwn 50:01:43:80:16:77:99:38 with PLOGI ACK to wwn 50:01:43:80:16:77:99:38 s_id 01:00:00, ref=1
Oct 28 00:46:40 mbpc-pc kernel: qla2xxx [0000:04:00.0]-e862:20: qla_target(0): Unexpected NOTIFY_ACK received
Oct 28 00:46:46 mbpc-pc kernel: qla2xxx [0000:04:00.0]-e872:20: qlt_24xx_atio_pkt_all_vps: qla_target(0): type 6 ox_id 05b1
Oct 28 00:46:46 mbpc-pc kernel: qla2xxx [0000:04:00.0]-e818:20: is_send_status=1, cmd->bufflen=4096, cmd->sg_cnt=1, cmd->dma_data_direction=1 se_cmd[ffff88009a9977d8]
Oct 28 00:46:46 mbpc-pc kernel: qla2xxx [0000:04:00.0]-e874:20: qlt_free_cmd: se_cmd[ffff88009a9977d8] ox_id 05b1
Oct 28 00:46:49 mbpc-pc kernel: qla2xxx [0000:04:00.0]-e872:20: qlt_24xx_atio_pkt_all_vps: qla_target(0): type d ox_id 0000
Oct 28 00:46:49 mbpc-pc kernel: qla2xxx [0000:04:00.0]-e82e:20: IMMED_NOTIFY ATIO
Oct 28 00:46:49 mbpc-pc kernel: qla2xxx [0000:04:00.0]-f826:20: qla_target(0): Port ID: 0x00:05:01 ELS opcode: 0x03
Oct 28 00:46:49 mbpc-pc kernel: qla2xxx [0000:04:00.0]-e81c:20: Sending TERM ELS CTIO (ha=ffff88010f110000)
Oct 28 00:46:49 mbpc-pc kernel: qla2xxx [0000:04:00.0]-f897:20: Linking sess ffff880111665cc0 [0] wwn 21:03:00:1b:32:74:b6:cb with PLOGI ACK to wwn 21:03:00:1b:32:74:b6:cb s_id 01:05:00, ref=1
Oct 28 00:46:49 mbpc-pc kernel: qla2xxx [0000:04:00.0]-e862:20: qla_target(0): Unexpected NOTIFY_ACK received

And eventually, the 360 second hung task timeout warnings appear:

Oct 28 00:49:48 mbpc-pc kernel: hpet1: lost 9600 rtc interrupts
Oct 28 00:49:55 mbpc-pc kernel: qla2xxx [0000:04:00.0]-e872:20: qlt_24xx_atio_pkt_all_vps: qla_target(0): type 6 ox_id 05ba
Oct 28 00:49:55 mbpc-pc kernel: qla2xxx [0000:04:00.0]-e818:20: is_send_status=1, cmd->bufflen=4096, cmd->sg_cnt=1, cmd->dma_data_direction=1 se_cmd[ffff88009a988508]
Oct 28 00:49:55 mbpc-pc kernel: qla2xxx [0000:04:00.0]-e874:20: qlt_free_cmd: se_cmd[ffff88009a988508] ox_id 05ba
Oct 28 00:50:16 mbpc-pc kernel: qla2xxx [0000:04:00.0]-e872:20: qlt_24xx_atio_pkt_all_vps: qla_target(0): type 6 ox_id 05bb
Oct 28 00:50:16 mbpc-pc kernel: qla2xxx [0000:04:00.0]-e818:20: is_send_status=1, cmd->bufflen=4096, cmd->sg_cnt=1, cmd->dma_data_direction=1 se_cmd[ffff88009a988508]
Oct 28 00:50:16 mbpc-pc kernel: qla2xxx [0000:04:00.0]-e874:20: qlt_free_cmd: se_cmd[ffff88009a988508] ox_id 05bb
Jan  3 15:34:00 192.168.0.2 syslog: dhcpfwd : dhcp forwarder daemon successfully started
Oct 28 00:50:33 mbpc-pc kernel: INFO: task kworker/0:2:31731 blocked for more than 360 seconds.
Oct 28 00:50:33 mbpc-pc kernel:      Not tainted 4.8.4 #2
Oct 28 00:50:33 mbpc-pc kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Oct 28 00:50:33 mbpc-pc kernel: kworker/0:2     D ffff88011affb968     0 31731      2 0x00000080
Oct 28 00:50:33 mbpc-pc kernel: Workqueue: events qlt_free_session_done [qla2xxx]
Oct 28 00:50:33 mbpc-pc kernel: ffff88011affb968 ffff88011affb8d8 ffff880013514940 0000000000000006
Oct 28 00:50:33 mbpc-pc kernel: ffff8801140fe880 ffffffff81f998c2 0000000000000000 ffff880100000000
Oct 28 00:50:33 mbpc-pc kernel: ffffffff810bdaaa ffffffff00000000 ffffffff00000051 ffff880100000000
Oct 28 00:50:33 mbpc-pc kernel: Call Trace:
Oct 28 00:50:33 mbpc-pc kernel: [<ffffffff810bdaaa>] ? vprintk_emit+0x27a/0x4d0
Oct 28 00:50:33 mbpc-pc kernel: [<ffffffff8162c040>] schedule+0x40/0xb0
Oct 28 00:50:33 mbpc-pc kernel: [<ffffffff8115cc5c>] ? printk+0x46/0x48
Oct 28 00:50:33 mbpc-pc kernel: [<ffffffff8162e7ec>] schedule_timeout+0x9c/0xe0
Oct 28 00:50:33 mbpc-pc kernel: [<ffffffff8162cfa0>] wait_for_completion+0xc0/0xf0
Oct 28 00:50:33 mbpc-pc kernel: [<ffffffff810923e0>] ? try_to_wake_up+0x260/0x260
Oct 28 00:50:33 mbpc-pc kernel: [<ffffffff81161d73>] ? mempool_free+0x33/0x90
Oct 28 00:50:33 mbpc-pc kernel: [<ffffffffa08f76ad>] target_wait_for_sess_cmds+0x4d/0x1b0 [target_core_mod]
Oct 28 00:50:33 mbpc-pc kernel: [<ffffffffa00e7188>] ? qla2x00_post_work+0x58/0x70 [qla2xxx]
Oct 28 00:50:33 mbpc-pc kernel: [<ffffffffa0286f69>] tcm_qla2xxx_free_session+0x49/0x90 [tcm_qla2xxx]
Oct 28 00:50:33 mbpc-pc kernel: [<ffffffffa01447e9>] qlt_free_session_done+0xf9/0x3d0 [qla2xxx]
Oct 28 00:50:33 mbpc-pc kernel: [<ffffffff81080639>] process_one_work+0x189/0x4e0
Oct 28 00:50:33 mbpc-pc kernel: [<ffffffff8107d915>] ? wq_worker_waking_up+0x15/0x70
Oct 28 00:50:33 mbpc-pc kernel: [<ffffffff8109eb59>] ? idle_balance+0x79/0x290
Oct 28 00:50:33 mbpc-pc kernel: [<ffffffff8162c040>] ? schedule+0x40/0xb0
Oct 28 00:50:33 mbpc-pc kernel: [<ffffffff8108150d>] worker_thread+0x16d/0x520
Oct 28 00:50:33 mbpc-pc kernel: [<ffffffff8162bb3d>] ? __schedule+0x2fd/0x6a0
Oct 28 00:50:33 mbpc-pc kernel: [<ffffffff810813a0>] ? maybe_create_worker+0x110/0x110
Oct 28 00:50:33 mbpc-pc kernel: [<ffffffff8162c040>] ? schedule+0x40/0xb0
Oct 28 00:50:33 mbpc-pc kernel: [<ffffffff810813a0>] ? maybe_create_worker+0x110/0x110
Oct 28 00:50:33 mbpc-pc kernel: [<ffffffff81085fec>] kthread+0xcc/0xf0
Oct 28 00:50:33 mbpc-pc kernel: [<ffffffff8109130e>] ? schedule_tail+0x1e/0xc0
Oct 28 00:50:33 mbpc-pc kernel: [<ffffffff8162f60f>] ret_from_fork+0x1f/0x40
Oct 28 00:50:33 mbpc-pc kernel: [<ffffffff81085f20>] ? kthread_freezable_should_stop+0x70/0x70

> 
> Oct 28 01:19:59 mbpc-pc kernel: qla2xxx [0000:04:00.0]-f81d:20: 
> Intermediate CTIO received (status 6)
> Oct 28 01:19:59 mbpc-pc kernel: qla2xxx [0000:04:00.0]-e81c:20: Sending 
> TERM EXCH CTIO (ha=ffff88010ecb0000)
> Oct 28 01:19:59 mbpc-pc kernel: qla2xxx [0000:04:00.0]-f814:20: 
> qla_target(0): terminating exchange for aborted cmd=ffff88009af9f488 
> (se_cmd=ffff88009af9f488, tag=1131312)
> Oct 28 01:19:59 mbpc-pc kernel: qla2xxx [0000:04:00.0]-e81c:20: Sending 
> TERM EXCH CTIO (ha=ffff88010ecb0000)
> Oct 28 01:19:59 mbpc-pc kernel: qla2xxx [0000:04:00.0]-e851:20: 
> qla_target(0): Terminating cmd ffff88009af9f488 with incorrect state 2
> Oct 28 01:19:59 mbpc-pc kernel: qla2xxx [0000:04:00.0]-e81c:20: Sending 
> TERM EXCH CTIO (ha=ffff88010ecb0000)
> Oct 28 01:19:59 mbpc-pc kernel: qla2xxx [0000:04:00.0]-f81d:20: 
> Intermediate CTIO received (status 6)
> Oct 28 01:19:59 mbpc-pc kernel: qla2xxx [0000:04:00.0]-f81d:20: 
> Intermediate CTIO received (status 8)
> Oct 28 01:19:59 mbpc-pc kernel: qla2xxx [0000:04:00.0]-e81c:20: Sending 
> TERM EXCH CTIO (ha=ffff88010ecb0000)
> Oct 28 01:19:59 mbpc-pc kernel: qla2xxx [0000:04:00.0]-f81d:20: 
> Intermediate CTIO received (status 6)
> Oct 28 01:19:59 mbpc-pc kernel: qla2xxx [0000:04:00.0]-e874:20: 
> qlt_free_cmd: se_cmd[ffff88009af9f488] ox_id 0673
> Oct 28 01:19:59 mbpc-pc kernel: ------------[ cut here ]------------
> Oct 28 01:19:59 mbpc-pc kernel: kernel BUG at 
> drivers/scsi/qla2xxx/qla_target.c:3319!
> Oct 28 01:19:59 mbpc-pc kernel: invalid opcode: 0000 [#1] SMP
> Oct 28 01:19:59 mbpc-pc kernel: Modules linked in: tcm_qla2xxx tcm_fc 
> tcm_loop target_core_file target_core_iblock target_core_pscsi 
> target_core_mod configfs ip6table_filter ip6_tables ebtable_nat ebtables 
> nf_conntrack_ipv4 nf_defrag_ipv4 xt_state nf_conntrack xt_CHECKSUM 
> iptable_mangle bridge nfsd lockd grace nfs_acl auth_rpcgss autofs4 it87 
> hwmon_vid bnx2fc cnic uio fcoe libfcoe libfc 8021q garp stp llc ppdev 
> parport_pc parport sunrpc cpufreq_ondemand bonding ipv6 crc_ccitt 
> ipt_REJECT nf_reject_ipv4 xt_multiport iptable_filter ip_tables fuse 
> vfat fat xfs vhost_net macvtap macvlan vhost tun uinput raid456 
> async_raid6_recov async_pq async_xor xor async_memcpy async_tx raid6_pq 
> libcrc32c joydev sg serio_raw e1000 r8169 mii kvm_amd kvm 
> snd_hda_codec_realtek snd_hda_codec_generic irqbypass pcspkr k10temp 
> snd_hda_intel snd_hda_codec snd_hda_core snd_hwdep snd_seq 
> snd_seq_device snd_pcm snd_timer snd soundcore i2c_piix4 i2c_core wmi 
> shpchp acpi_cpufreq ext4 mbcache jbd2 qla2xxx scsi_transport_fc floppy 
> firewire_ohci f
> Oct 28 01:19:59 mbpc-pc kernel: irewire_core crc_itu_t sd_mod pata_acpi 
> ata_generic pata_jmicron ahci libahci usb_storage dm_mirror 
> dm_region_hash dm_log dm_mod
> Oct 28 01:19:59 mbpc-pc kernel: CPU: 0 PID: 296 Comm: kworker/u16:6 Not 
> tainted 4.8.4 #2
> Oct 28 01:19:59 mbpc-pc kernel: Hardware name: Gigabyte Technology Co., 
> Ltd. GA-890XA-UD3/GA-890XA-UD3, BIOS FC 08/02/2010
> Oct 28 01:19:59 mbpc-pc kernel: Workqueue: tmr-fileio target_tmr_work 
> [target_core_mod]
> Oct 28 01:19:59 mbpc-pc kernel: task: ffff8801109623c0 task.stack: 
> ffff880110968000
> Oct 28 01:19:59 mbpc-pc kernel: RIP: 0010:[<ffffffffa0143810>] 
> [<ffffffffa0143810>] qlt_free_cmd+0x160/0x180 [qla2xxx]
> Oct 28 01:19:59 mbpc-pc kernel: RSP: 0018:ffff88011096bb18  EFLAGS: 00010202
> Oct 28 01:19:59 mbpc-pc kernel: RAX: 0000000000000051 RBX: 
> ffff88009af9f488 RCX: 0000000000000006
> Oct 28 01:19:59 mbpc-pc kernel: RDX: 0000000000000007 RSI: 
> 0000000000000007 RDI: ffff88011fc0cb40
> Oct 28 01:19:59 mbpc-pc kernel: RBP: ffff88011096bb48 R08: 
> 0000000000000000 R09: ffffffff81fa4765
> Oct 28 01:19:59 mbpc-pc kernel: R10: 0000000000000074 R11: 
> 0000000000000002 R12: ffff8801137770c0
> Oct 28 01:19:59 mbpc-pc kernel: R13: ffff8800a126eaf0 R14: 
> ffff88009af9f510 R15: 0000000000000296
> Oct 28 01:19:59 mbpc-pc kernel: FS:  0000000000000000(0000) 
> GS:ffff88011fc00000(0000) knlGS:0000000000000000
> Oct 28 01:19:59 mbpc-pc kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 
> 0000000080050033
> Oct 28 01:19:59 mbpc-pc kernel: CR2: 00007f8eef58d000 CR3: 
> 00000000cabad000 CR4: 00000000000006f0
> Oct 28 01:19:59 mbpc-pc kernel: Stack:
> Oct 28 01:19:59 mbpc-pc kernel: ffff880000000673 ffff88009af9f488 
> ffff8800a126eaf0 ffff88009af9f59c
> Oct 28 01:19:59 mbpc-pc kernel: ffff88009af9f488 ffff8800a126eaf0 
> ffff88011096bb58 ffffffffa027f7f4
> Oct 28 01:19:59 mbpc-pc kernel: ffff88011096bbb8 ffffffffa08f758c 
> ffff88009af9f510 ffff88009af9f488
> Oct 28 01:19:59 mbpc-pc kernel: Call Trace:
> Oct 28 01:19:59 mbpc-pc kernel: [<ffffffffa027f7f4>] 
> tcm_qla2xxx_release_cmd+0x14/0x30 [tcm_qla2xxx]
> Oct 28 01:19:59 mbpc-pc kernel: [<ffffffffa08f758c>] 
> target_release_cmd_kref+0xac/0x110 [target_core_mod]
> Oct 28 01:19:59 mbpc-pc kernel: [<ffffffffa08f7627>] 
> target_put_sess_cmd+0x37/0x70 [target_core_mod]
> Oct 28 01:19:59 mbpc-pc kernel: [<ffffffffa08f46f7>] 
> core_tmr_abort_task+0x107/0x160 [target_core_mod]
> Oct 28 01:19:59 mbpc-pc kernel: [<ffffffffa08f6aa4>] 
> target_tmr_work+0x154/0x160 [target_core_mod]
> Oct 28 01:19:59 mbpc-pc kernel: [<ffffffff81080639>] 
> process_one_work+0x189/0x4e0
> Oct 28 01:19:59 mbpc-pc kernel: [<ffffffff810d060c>] ? 
> del_timer_sync+0x4c/0x60
> Oct 28 01:19:59 mbpc-pc kernel: [<ffffffff8108131e>] ? 
> maybe_create_worker+0x8e/0x110
> Oct 28 01:19:59 mbpc-pc kernel: [<ffffffff8162c040>] ? schedule+0x40/0xb0
> Oct 28 01:19:59 mbpc-pc kernel: [<ffffffff8108150d>] 
> worker_thread+0x16d/0x520
> Oct 28 01:19:59 mbpc-pc kernel: [<ffffffff810923f2>] ? 
> default_wake_function+0x12/0x20
> Oct 28 01:19:59 mbpc-pc kernel: [<ffffffff810a6f06>] ? 
> __wake_up_common+0x56/0x90
> Oct 28 01:19:59 mbpc-pc kernel: [<ffffffff810813a0>] ? 
> maybe_create_worker+0x110/0x110
> Oct 28 01:19:59 mbpc-pc kernel: [<ffffffff8162c040>] ? schedule+0x40/0xb0
> Oct 28 01:19:59 mbpc-pc kernel: [<ffffffff810813a0>] ? 
> maybe_create_worker+0x110/0x110
> Oct 28 01:19:59 mbpc-pc kernel: [<ffffffff81085fec>] kthread+0xcc/0xf0
> Oct 28 01:19:59 mbpc-pc kernel: [<ffffffff8109130e>] ? 
> schedule_tail+0x1e/0xc0
> Oct 28 01:19:59 mbpc-pc kernel: [<ffffffff8162f60f>] ret_from_fork+0x1f/0x40
> Oct 28 01:19:59 mbpc-pc kernel: [<ffffffff81085f20>] ? 
> kthread_freezable_should_stop+0x70/0x70
> Oct 28 01:19:59 mbpc-pc kernel: Code: 0d 00 00 48 c7 c7 00 2a 16 a0 e8 
> ac 32 f2 e0 48 83 c4 18 5b 41 5c 41 5d c9 c3 48 8b bb 90 02 00 00 e8 85 
> da 07 e1 e9 30 ff ff ff <0f> 0b eb fe 0f 0b 66 2e 0f 1f 84 00 00 00 00 
> 00 eb f4 66 66 66
> Oct 28 01:19:59 mbpc-pc kernel: RIP  [<ffffffffa0143810>] 
> qlt_free_cmd+0x160/0x180 [qla2xxx]
> Oct 28 01:19:59 mbpc-pc kernel: RSP <ffff88011096bb18>
> Oct 28 01:19:59 mbpc-pc kernel: ---[ end trace 2551bf47a19dbe2e ]---
> 

Mmm.

This BUG_ON is signaling a qla_tgt_cmd descriptor is being freed while
qlt_handle_cmd_for_atio() has queued it for backend execution, but
qla_tgt_cmd->work -> __qlt_do_work() has not executed.

> 
> 
> 2) This works with a new disk that's just been inserted.  No issues.
> 
> 

Thanks for verifying this scenario.

> 
> 
> The kernel had the patch in both scenarios.  So appears we can't 
> function on a degraded array that looses 1 (RAID 5 / 6 ) or 2 ( RAID 6 ) 
> disks, at the moment, even though the array itself is fine.  Perhaps 
> it's the nature of the failed disk.
> 

AFAICT, the hung task involves ABORT_TASK across tcm_qla2xxx session
reinstatement, when backend I/O latency is high enough to cause
ABORT_TASK operations across FC fabric host side session reset.

>From logs alone, it's unclear if the failing backend ata6 is leaking I/O
(indefinitely) when hung task warning happen, but the preceding ata6
failures + device resets seem to indicate I/O completions are just
taking a really long time to complete.

Also, it's unclear if the BUG_ON(cmd->cmd_in_wq) in qlt_free_cmd() is a
side effect of the earlier hung task, or separate tcm_qla2xxx session
reinstatement bug. 


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Kernel 4.8.4: INFO: task kworker/u16:8:289 blocked for more than 120 seconds.
  2016-10-29  7:50                 ` Nicholas A. Bellinger
@ 2016-10-29 18:10                   ` TomK
  2016-10-29 21:44                     ` Nicholas A. Bellinger
  0 siblings, 1 reply; 14+ messages in thread
From: TomK @ 2016-10-29 18:10 UTC (permalink / raw)
  To: Nicholas A. Bellinger
  Cc: linux-scsi, Himanshu Madhani, Quinn Tran, Giridhar Malavali,
	Gurumurthy, Anil

On 10/29/2016 3:50 AM, Nicholas A. Bellinger wrote:
> Hi TomK & Co,
>
> On Fri, 2016-10-28 at 02:01 -0400, TomK wrote:
>> On 10/26/2016 8:08 AM, TomK wrote:
>>> On 10/26/2016 3:20 AM, Nicholas A. Bellinger wrote:
>
> <SNIP>
>
>>>>>>
>>>>>> Right now this is a POC setup so I have lots of room to experiment.
>>>>>>
>>>>>
>>>>> Hey Nicholas,
>>>>>
>>>>> I've done some testing up till now.  With or without the patch above, as
>>>>> long as the faulty disk is removed from the RAID 6 software array,
>>>>> everything works fine with the Target Driver and ESXi hosts.
>>>>
>>>> Thanks again for the extra debug + feedback, and confirming the earlier
>>>> hung task warnings with md disk failure.
>>>>
>>>>>   This is
>>>>> even on a degraded array:
>>>>>
>>>>> [root@mbpc-pc ~]# cat /proc/mdstat
>>>>> Personalities : [raid6] [raid5] [raid4]
>>>>> md0 : active raid6 sdc[1] sda[5] sdd[3] sde[8] sdb[7]
>>>>>        3907045632 blocks super 1.2 level 6, 64k chunk, algorithm 2 [6/5]
>>>>> [UUUU_U]
>>>>>        bitmap: 6/8 pages [24KB], 65536KB chunk
>>>>>
>>>>> unused devices: <none>
>>>>> [root@mbpc-pc ~]#
>>>>>
>>>>> So at the moment, the data points to the single failed single disk (
>>>>> /dev/sdf ) as causing the Target Drivers or QLogic cards to throw an
>>>>> exception.
>>>>>
>>>>> Tomorrow I will insert the failed disk back in to see if the a) array
>>>>> takes it back, b) it causes a failure with the patch applied.
>>>>>
>>>>> Looks like the failed disk /dev/sdf was limping along for months and
>>>>> until I removed the power, it didn't collapse on itself.
>>>>>
>>>>
>>>> AFAICT, the list corruption observed is a separate bug from the hung
>>>> tasks during ABORT_TASK w/ TMR_COMPLETE_FUNCTION with explicit target
>>>> shutdown.
>>>>
>>>
>>> Correct, I will be including the fix either way but will take some time
>>> to test out if I can reproduce the failure by reinserting this bad disk
>>> then a new disk.  I want to see if this hang will be reproduced by doing
>>> these actions to the RAID6 Software Raid to determine if this failure is
>>> isolated to a particularly bad disk or any sort of add or remove disk
>>> actions in a RAID6.
>>>
>>
>>
>>
>> 1) As soon as I re-add the bad disk without the patch, I loose the LUN
>> off the ESXi hosts.  Same thing happens with the patch.  No change.  The
>> disk is pulling things down.  Worse, the kernel panics and locks me out
>> of the system (http://microdevsys.com/linux-lio/messages-oct-27-2016.txt) :
>>
>
> So after groking these logs, the point when ata6 failing scsi_device is
> holding outstanding I/O beyond ESX FC host side timeouts, manifests
> itself as ABORT_TASK tag=1122276:
>
> Oct 28 00:42:56 mbpc-pc kernel: qla2xxx [0000:04:00.0]-e872:20: qlt_24xx_atio_pkt_all_vps: qla_target(0): type 6 ox_id 077e
> Oct 28 00:42:56 mbpc-pc kernel: qla2xxx [0000:04:00.0]-e872:20: qlt_24xx_atio_pkt_all_vps: qla_target(0): type 6 ox_id 077f
> Oct 28 00:42:56 mbpc-pc kernel: qla2xxx [0000:04:00.0]-e872:20: qlt_24xx_atio_pkt_all_vps: qla_target(0): type 6 ox_id 0780
> Oct 28 00:42:57 mbpc-pc kernel: qla2xxx [0000:04:00.0]-e837:20: ABTS_RECV_24XX: instance 0
> Oct 28 00:42:57 mbpc-pc kernel: qla2xxx [0000:04:00.0]-f811:20: qla_target(0): task abort (s_id=1:5:0, tag=1122276, param=0)
> Oct 28 00:42:57 mbpc-pc kernel: qla2xxx [0000:04:00.0]-f80f:20: qla_target(0): task abort (tag=1122276)
> Oct 28 00:42:57 mbpc-pc kernel: ABORT_TASK: Found referenced qla2xxx task_tag: 1122276
>
> and eventually:
>
> Oct 28 00:43:18 mbpc-pc kernel: qla2xxx [0000:04:00.0]-e837:20: ABTS_RECV_24XX: instance 0
> Oct 28 00:43:18 mbpc-pc kernel: qla2xxx [0000:04:00.0]-f811:20: qla_target(0): task abort (s_id=1:5:0, tag=1122276, param=0)
> Oct 28 00:43:18 mbpc-pc kernel: qla2xxx [0000:04:00.0]-f881:20: unable to find cmd in driver or LIO for tag 0x111fe4
> Oct 28 00:43:18 mbpc-pc kernel: qla2xxx [0000:04:00.0]-f854:20: qla_target(0): __qlt_24xx_handle_abts() failed: -2
> Oct 28 00:43:18 mbpc-pc kernel: qla2xxx [0000:04:00.0]-e806:20: Sending task mgmt ABTS response (ha=ffff88010f110000, atio=ffff88010f123a40, status=4
> Oct 28 00:43:18 mbpc-pc kernel: qla2xxx [0000:04:00.0]-e838:20: ABTS_RESP_24XX: compl_status 31
> Oct 28 00:43:18 mbpc-pc kernel: qla2xxx [0000:04:00.0]-e807:20: Sending retry TERM EXCH CTIO7 (ha=ffff88010f110000)
> Oct 28 00:43:18 mbpc-pc kernel: qla2xxx [0000:04:00.0]-e806:20: Sending task mgmt ABTS response (ha=ffff88010f110000, atio=ffff88010f123a80, status=0
> Oct 28 00:43:18 mbpc-pc kernel: qla2xxx [0000:04:00.0]-e838:20: ABTS_RESP_24XX: compl_status 0
>
> The outstanding se_cmd with tag=1122276 for ata6 completes back to
> target-core, allowing ABORT_TASK + TMR_FUNCTION_COMPLETE status to
> progress:
>
> Oct 28 00:44:25 mbpc-pc kernel: hpet1: lost 9600 rtc interrupts
> Oct 28 00:44:29 mbpc-pc kernel: ata6: SATA link up 1.5 Gbps (SStatus 113 SControl 310)
> Oct 28 00:44:29 mbpc-pc kernel: ata6.00: configured for UDMA/133
> Oct 28 00:44:29 mbpc-pc kernel: ata6.00: retrying FLUSH 0xea Emask 0x4
> Oct 28 00:44:29 mbpc-pc kernel: ata6.00: device reported invalid CHS sector 0
> Oct 28 00:44:29 mbpc-pc kernel: ata6: EH complete
> Oct 28 00:44:30 mbpc-pc kernel: qla2xxx [0000:04:00.0]-e81c:20: Sending TERM EXCH CTIO (ha=ffff88010f110000)
> Oct 28 00:44:30 mbpc-pc kernel: qla2xxx [0000:04:00.0]-f814:20: qla_target(0): terminating exchange for aborted cmd=ffff880099392fa8 (se_cmd=ffff880099392fa8, tag=1122276)
> Oct 28 00:44:30 mbpc-pc kernel: qla2xxx [0000:04:00.0]-e81c:20: Sending TERM EXCH CTIO (ha=ffff88010f110000)
> Oct 28 00:44:30 mbpc-pc kernel: qla2xxx [0000:04:00.0]-e851:20: qla_target(0): Terminating cmd ffff880099392fa8 with incorrect state 2
> Oct 28 00:44:30 mbpc-pc kernel: qla2xxx [0000:04:00.0]-f81d:20: Intermediate CTIO received (status 6)
> Oct 28 00:44:30 mbpc-pc kernel: ABORT_TASK: Sending TMR_FUNCTION_COMPLETE for ref_tag: 1122276
> Oct 28 00:44:30 mbpc-pc kernel: qla2xxx [0000:04:00.0]-f81d:20: Intermediate CTIO received (status 8)
> Oct 28 00:44:30 mbpc-pc kernel: qla2xxx [0000:04:00.0]-f813:20: TM response mcmd (ffff8800b1e2f0f0) status 0x0 state 0x0
> Oct 28 00:44:30 mbpc-pc kernel: qla2xxx [0000:04:00.0]-e806:20: Sending task mgmt ABTS response (ha=ffff88010f110000, atio=ffff8800b1e2f300, status=0
> Oct 28 00:44:30 mbpc-pc kernel: ABORT_TASK: Sending TMR_TASK_DOES_NOT_EXIST for ref_tag: 1122996
> Oct 28 00:44:30 mbpc-pc kernel: qla2xxx [0000:04:00.0]-f813:20: TM response mcmd (ffff8800b1e2f340) status 0x2 state 0x0
> Oct 28 00:44:30 mbpc-pc kernel: qla2xxx [0000:04:00.0]-e806:20: Sending task mgmt ABTS response (ha=ffff88010f110000, atio=ffff8800b1e2f550, status=2
> Oct 28 00:44:30 mbpc-pc kernel: ABORT_TASK: Sending TMR_TASK_DOES_NOT_EXIST for ref_tag: 1122204
> Oct 28 00:44:30 mbpc-pc kernel: qla2xxx [0000:04:00.0]-f813:20: TM response mcmd (ffff8800b1e2f590) status 0x2 state 0x0
> Oct 28 00:44:30 mbpc-pc kernel: qla2xxx [0000:04:00.0]-e806:20: Sending task mgmt ABTS response (ha=ffff88010f110000, atio=ffff8800b1e2f7a0, status=2
> Oct 28 00:44:30 mbpc-pc kernel: ABORT_TASK: Sending TMR_TASK_DOES_NOT_EXIST for ref_tag: 1122240
> Oct 28 00:44:30 mbpc-pc kernel: qla2xxx [0000:04:00.0]-e838:20: ABTS_RESP_24XX: compl_status 0
> Oct 28 00:44:30 mbpc-pc kernel: qla2xxx [0000:04:00.0]-f813:20: TM response mcmd (ffff8800b1e2f7e0) status 0x2 state 0x0
> Oct 28 00:44:30 mbpc-pc kernel: qla2xxx [0000:04:00.0]-e838:20: ABTS_RESP_24XX: compl_status 31
> Oct 28 00:44:30 mbpc-pc kernel: qla2xxx [0000:04:00.0]-e807:20: Sending retry TERM EXCH CTIO7 (ha=ffff88010f110000)
> Oct 28 00:44:30 mbpc-pc kernel: qla2xxx [0000:04:00.0]-e806:20: Sending task mgmt ABTS response (ha=ffff88010f110000, atio=ffff88010f124280, status=0
> Oct 28 00:44:30 mbpc-pc kernel: qla2xxx [0000:04:00.0]-e838:20: ABTS_RESP_24XX: compl_status 31
> Oct 28 00:44:30 mbpc-pc kernel: qla2xxx [0000:04:00.0]-e807:20: Sending retry TERM EXCH CTIO7 (ha=ffff88010f110000)
> Oct 28 00:44:30 mbpc-pc kernel: qla2xxx [0000:04:00.0]-e806:20: Sending task mgmt ABTS response (ha=ffff88010f110000, atio=ffff88010f1242c0, status=0
> Oct 28 00:44:30 mbpc-pc kernel: qla2xxx [0000:04:00.0]-e806:20: Sending task mgmt ABTS response (ha=ffff88010f110000, atio=ffff8800b1e2f9f0, status=2
>
> and continues with TMR_TASK_DOES_NOT_EXIST for other oustanding tags..
>
> Until, target-core session release for two tcm_qla2xxx host ports:
>
> Oct 28 00:44:30 mbpc-pc kernel: qla2xxx [0000:04:00.0]-f884:20: qlt_free_session_done: se_sess ffff880099a80a80 / sess ffff880111665cc0 from port 21:03:00:1b:32:74:b6:cb loop_id 0x03 s_id 01:05:00 logout 1 keep 1 els_logo 0
> Oct 28 00:44:30 mbpc-pc kernel: qla2xxx [0000:04:00.0]-e874:20: qlt_free_cmd: se_cmd[ffff8800993849f8] ox_id 0773
> Oct 28 00:44:30 mbpc-pc kernel: qla2xxx [0000:04:00.0]-f884:20: qlt_free_session_done: se_sess ffff88009e8f9040 / sess ffff8800abe95480 from port 50:01:43:80:16:77:99:38 loop_id 0x02 s_id 01:00:00 logout 1 keep 1 els_logo 0
> Oct 28 00:44:30 mbpc-pc kernel: qla2xxx [0000:04:00.0]-e874:20: qlt_free_cmd: se_cmd[ffff880099199f38] ox_id 05f7
> Oct 28 00:44:30 mbpc-pc kernel: qla2xxx [0000:04:00.0]-e874:20: qlt_free_cmd: se_cmd[ffff880099199f38] ox_id 05f7
> Oct 28 00:44:30 mbpc-pc kernel: qla2xxx [0000:04:00.0]-4801:20: DPC handler waking up, dpc_flags=0x0.
> Oct 28 00:44:30 mbpc-pc kernel: qla2xxx [0000:04:00.0]-2870:20: Async-logout - hdl=2 loop-id=3 portid=010500.
> Oct 28 00:44:30 mbpc-pc kernel: qla2xxx [0000:04:00.0]-2870:20: Async-logout - hdl=3 loop-id=2 portid=010000.
> Oct 28 00:44:30 mbpc-pc kernel: qla2xxx [0000:04:00.0]-4800:20: DPC handler sleeping.
> Oct 28 00:44:30 mbpc-pc kernel: qla2xxx [0000:04:00.0]-e81c:20: Sending TERM EXCH CTIO (ha=ffff88010f110000)
> Oct 28 00:44:30 mbpc-pc kernel: qla2xxx [0000:04:00.0]-e81c:20: Sending TERM EXCH CTIO (ha=ffff88010f110000)
>
> From there, ELS with unexpected NOTIFY_ACK received start to occur:
>
> Oct 28 00:44:30 mbpc-pc kernel: qla2xxx [0000:04:00.0]-4801:20: DPC handler waking up, dpc_flags=0x0.
> Oct 28 00:44:30 mbpc-pc kernel: qla2xxx [0000:04:00.0]-4800:20: DPC handler sleeping.
> Oct 28 00:44:34 mbpc-pc kernel: qla2xxx [0000:04:00.0]-e872:20: qlt_24xx_atio_pkt_all_vps: qla_target(0): type d ox_id 0000
> Oct 28 00:44:34 mbpc-pc kernel: qla2xxx [0000:04:00.0]-e82e:20: IMMED_NOTIFY ATIO
> Oct 28 00:44:34 mbpc-pc kernel: qla2xxx [0000:04:00.0]-f826:20: qla_target(0): Port ID: 0x00:00:01 ELS opcode: 0x03
> Oct 28 00:44:34 mbpc-pc kernel: qla2xxx [0000:04:00.0]-e81c:20: Sending TERM ELS CTIO (ha=ffff88010f110000)
> Oct 28 00:44:34 mbpc-pc kernel: qla2xxx [0000:04:00.0]-f897:20: Linking sess ffff8800abe95480 [0] wwn 50:01:43:80:16:77:99:38 with PLOGI ACK to wwn 50:01:43:80:16:77:99:38 s_id 01:00:00, ref=1
> Oct 28 00:44:34 mbpc-pc kernel: qla2xxx [0000:04:00.0]-e862:20: qla_target(0): Unexpected NOTIFY_ACK received
>
> ELS packets for the same two host ports continue:
>
> Oct 28 00:46:40 mbpc-pc kernel: hpet1: lost 9599 rtc interrupts
> Oct 28 00:46:40 mbpc-pc kernel: qla2xxx [0000:04:00.0]-e872:20: qlt_24xx_atio_pkt_all_vps: qla_target(0): type d ox_id 0000
> Oct 28 00:46:40 mbpc-pc kernel: qla2xxx [0000:04:00.0]-e82e:20: IMMED_NOTIFY ATIO
> Oct 28 00:46:40 mbpc-pc kernel: qla2xxx [0000:04:00.0]-f826:20: qla_target(0): Port ID: 0x00:00:01 ELS opcode: 0x03
> Oct 28 00:46:40 mbpc-pc kernel: qla2xxx [0000:04:00.0]-e81c:20: Sending TERM ELS CTIO (ha=ffff88010f110000)
> Oct 28 00:46:40 mbpc-pc kernel: qla2xxx [0000:04:00.0]-f897:20: Linking sess ffff8800abe95480 [0] wwn 50:01:43:80:16:77:99:38 with PLOGI ACK to wwn 50:01:43:80:16:77:99:38 s_id 01:00:00, ref=1
> Oct 28 00:46:40 mbpc-pc kernel: qla2xxx [0000:04:00.0]-e862:20: qla_target(0): Unexpected NOTIFY_ACK received
> Oct 28 00:46:46 mbpc-pc kernel: qla2xxx [0000:04:00.0]-e872:20: qlt_24xx_atio_pkt_all_vps: qla_target(0): type 6 ox_id 05b1
> Oct 28 00:46:46 mbpc-pc kernel: qla2xxx [0000:04:00.0]-e818:20: is_send_status=1, cmd->bufflen=4096, cmd->sg_cnt=1, cmd->dma_data_direction=1 se_cmd[ffff88009a9977d8]
> Oct 28 00:46:46 mbpc-pc kernel: qla2xxx [0000:04:00.0]-e874:20: qlt_free_cmd: se_cmd[ffff88009a9977d8] ox_id 05b1
> Oct 28 00:46:49 mbpc-pc kernel: qla2xxx [0000:04:00.0]-e872:20: qlt_24xx_atio_pkt_all_vps: qla_target(0): type d ox_id 0000
> Oct 28 00:46:49 mbpc-pc kernel: qla2xxx [0000:04:00.0]-e82e:20: IMMED_NOTIFY ATIO
> Oct 28 00:46:49 mbpc-pc kernel: qla2xxx [0000:04:00.0]-f826:20: qla_target(0): Port ID: 0x00:05:01 ELS opcode: 0x03
> Oct 28 00:46:49 mbpc-pc kernel: qla2xxx [0000:04:00.0]-e81c:20: Sending TERM ELS CTIO (ha=ffff88010f110000)
> Oct 28 00:46:49 mbpc-pc kernel: qla2xxx [0000:04:00.0]-f897:20: Linking sess ffff880111665cc0 [0] wwn 21:03:00:1b:32:74:b6:cb with PLOGI ACK to wwn 21:03:00:1b:32:74:b6:cb s_id 01:05:00, ref=1
> Oct 28 00:46:49 mbpc-pc kernel: qla2xxx [0000:04:00.0]-e862:20: qla_target(0): Unexpected NOTIFY_ACK received
>
> And eventually, the 360 second hung task timeout warnings appear:
>
> Oct 28 00:49:48 mbpc-pc kernel: hpet1: lost 9600 rtc interrupts
> Oct 28 00:49:55 mbpc-pc kernel: qla2xxx [0000:04:00.0]-e872:20: qlt_24xx_atio_pkt_all_vps: qla_target(0): type 6 ox_id 05ba
> Oct 28 00:49:55 mbpc-pc kernel: qla2xxx [0000:04:00.0]-e818:20: is_send_status=1, cmd->bufflen=4096, cmd->sg_cnt=1, cmd->dma_data_direction=1 se_cmd[ffff88009a988508]
> Oct 28 00:49:55 mbpc-pc kernel: qla2xxx [0000:04:00.0]-e874:20: qlt_free_cmd: se_cmd[ffff88009a988508] ox_id 05ba
> Oct 28 00:50:16 mbpc-pc kernel: qla2xxx [0000:04:00.0]-e872:20: qlt_24xx_atio_pkt_all_vps: qla_target(0): type 6 ox_id 05bb
> Oct 28 00:50:16 mbpc-pc kernel: qla2xxx [0000:04:00.0]-e818:20: is_send_status=1, cmd->bufflen=4096, cmd->sg_cnt=1, cmd->dma_data_direction=1 se_cmd[ffff88009a988508]
> Oct 28 00:50:16 mbpc-pc kernel: qla2xxx [0000:04:00.0]-e874:20: qlt_free_cmd: se_cmd[ffff88009a988508] ox_id 05bb
> Jan  3 15:34:00 192.168.0.2 syslog: dhcpfwd : dhcp forwarder daemon successfully started
> Oct 28 00:50:33 mbpc-pc kernel: INFO: task kworker/0:2:31731 blocked for more than 360 seconds.
> Oct 28 00:50:33 mbpc-pc kernel:      Not tainted 4.8.4 #2
> Oct 28 00:50:33 mbpc-pc kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> Oct 28 00:50:33 mbpc-pc kernel: kworker/0:2     D ffff88011affb968     0 31731      2 0x00000080
> Oct 28 00:50:33 mbpc-pc kernel: Workqueue: events qlt_free_session_done [qla2xxx]
> Oct 28 00:50:33 mbpc-pc kernel: ffff88011affb968 ffff88011affb8d8 ffff880013514940 0000000000000006
> Oct 28 00:50:33 mbpc-pc kernel: ffff8801140fe880 ffffffff81f998c2 0000000000000000 ffff880100000000
> Oct 28 00:50:33 mbpc-pc kernel: ffffffff810bdaaa ffffffff00000000 ffffffff00000051 ffff880100000000
> Oct 28 00:50:33 mbpc-pc kernel: Call Trace:
> Oct 28 00:50:33 mbpc-pc kernel: [<ffffffff810bdaaa>] ? vprintk_emit+0x27a/0x4d0
> Oct 28 00:50:33 mbpc-pc kernel: [<ffffffff8162c040>] schedule+0x40/0xb0
> Oct 28 00:50:33 mbpc-pc kernel: [<ffffffff8115cc5c>] ? printk+0x46/0x48
> Oct 28 00:50:33 mbpc-pc kernel: [<ffffffff8162e7ec>] schedule_timeout+0x9c/0xe0
> Oct 28 00:50:33 mbpc-pc kernel: [<ffffffff8162cfa0>] wait_for_completion+0xc0/0xf0
> Oct 28 00:50:33 mbpc-pc kernel: [<ffffffff810923e0>] ? try_to_wake_up+0x260/0x260
> Oct 28 00:50:33 mbpc-pc kernel: [<ffffffff81161d73>] ? mempool_free+0x33/0x90
> Oct 28 00:50:33 mbpc-pc kernel: [<ffffffffa08f76ad>] target_wait_for_sess_cmds+0x4d/0x1b0 [target_core_mod]
> Oct 28 00:50:33 mbpc-pc kernel: [<ffffffffa00e7188>] ? qla2x00_post_work+0x58/0x70 [qla2xxx]
> Oct 28 00:50:33 mbpc-pc kernel: [<ffffffffa0286f69>] tcm_qla2xxx_free_session+0x49/0x90 [tcm_qla2xxx]
> Oct 28 00:50:33 mbpc-pc kernel: [<ffffffffa01447e9>] qlt_free_session_done+0xf9/0x3d0 [qla2xxx]
> Oct 28 00:50:33 mbpc-pc kernel: [<ffffffff81080639>] process_one_work+0x189/0x4e0
> Oct 28 00:50:33 mbpc-pc kernel: [<ffffffff8107d915>] ? wq_worker_waking_up+0x15/0x70
> Oct 28 00:50:33 mbpc-pc kernel: [<ffffffff8109eb59>] ? idle_balance+0x79/0x290
> Oct 28 00:50:33 mbpc-pc kernel: [<ffffffff8162c040>] ? schedule+0x40/0xb0
> Oct 28 00:50:33 mbpc-pc kernel: [<ffffffff8108150d>] worker_thread+0x16d/0x520
> Oct 28 00:50:33 mbpc-pc kernel: [<ffffffff8162bb3d>] ? __schedule+0x2fd/0x6a0
> Oct 28 00:50:33 mbpc-pc kernel: [<ffffffff810813a0>] ? maybe_create_worker+0x110/0x110
> Oct 28 00:50:33 mbpc-pc kernel: [<ffffffff8162c040>] ? schedule+0x40/0xb0
> Oct 28 00:50:33 mbpc-pc kernel: [<ffffffff810813a0>] ? maybe_create_worker+0x110/0x110
> Oct 28 00:50:33 mbpc-pc kernel: [<ffffffff81085fec>] kthread+0xcc/0xf0
> Oct 28 00:50:33 mbpc-pc kernel: [<ffffffff8109130e>] ? schedule_tail+0x1e/0xc0
> Oct 28 00:50:33 mbpc-pc kernel: [<ffffffff8162f60f>] ret_from_fork+0x1f/0x40
> Oct 28 00:50:33 mbpc-pc kernel: [<ffffffff81085f20>] ? kthread_freezable_should_stop+0x70/0x70
>
>>
>> Oct 28 01:19:59 mbpc-pc kernel: qla2xxx [0000:04:00.0]-f81d:20:
>> Intermediate CTIO received (status 6)
>> Oct 28 01:19:59 mbpc-pc kernel: qla2xxx [0000:04:00.0]-e81c:20: Sending
>> TERM EXCH CTIO (ha=ffff88010ecb0000)
>> Oct 28 01:19:59 mbpc-pc kernel: qla2xxx [0000:04:00.0]-f814:20:
>> qla_target(0): terminating exchange for aborted cmd=ffff88009af9f488
>> (se_cmd=ffff88009af9f488, tag=1131312)
>> Oct 28 01:19:59 mbpc-pc kernel: qla2xxx [0000:04:00.0]-e81c:20: Sending
>> TERM EXCH CTIO (ha=ffff88010ecb0000)
>> Oct 28 01:19:59 mbpc-pc kernel: qla2xxx [0000:04:00.0]-e851:20:
>> qla_target(0): Terminating cmd ffff88009af9f488 with incorrect state 2
>> Oct 28 01:19:59 mbpc-pc kernel: qla2xxx [0000:04:00.0]-e81c:20: Sending
>> TERM EXCH CTIO (ha=ffff88010ecb0000)
>> Oct 28 01:19:59 mbpc-pc kernel: qla2xxx [0000:04:00.0]-f81d:20:
>> Intermediate CTIO received (status 6)
>> Oct 28 01:19:59 mbpc-pc kernel: qla2xxx [0000:04:00.0]-f81d:20:
>> Intermediate CTIO received (status 8)
>> Oct 28 01:19:59 mbpc-pc kernel: qla2xxx [0000:04:00.0]-e81c:20: Sending
>> TERM EXCH CTIO (ha=ffff88010ecb0000)
>> Oct 28 01:19:59 mbpc-pc kernel: qla2xxx [0000:04:00.0]-f81d:20:
>> Intermediate CTIO received (status 6)
>> Oct 28 01:19:59 mbpc-pc kernel: qla2xxx [0000:04:00.0]-e874:20:
>> qlt_free_cmd: se_cmd[ffff88009af9f488] ox_id 0673
>> Oct 28 01:19:59 mbpc-pc kernel: ------------[ cut here ]------------
>> Oct 28 01:19:59 mbpc-pc kernel: kernel BUG at
>> drivers/scsi/qla2xxx/qla_target.c:3319!
>> Oct 28 01:19:59 mbpc-pc kernel: invalid opcode: 0000 [#1] SMP
>> Oct 28 01:19:59 mbpc-pc kernel: Modules linked in: tcm_qla2xxx tcm_fc
>> tcm_loop target_core_file target_core_iblock target_core_pscsi
>> target_core_mod configfs ip6table_filter ip6_tables ebtable_nat ebtables
>> nf_conntrack_ipv4 nf_defrag_ipv4 xt_state nf_conntrack xt_CHECKSUM
>> iptable_mangle bridge nfsd lockd grace nfs_acl auth_rpcgss autofs4 it87
>> hwmon_vid bnx2fc cnic uio fcoe libfcoe libfc 8021q garp stp llc ppdev
>> parport_pc parport sunrpc cpufreq_ondemand bonding ipv6 crc_ccitt
>> ipt_REJECT nf_reject_ipv4 xt_multiport iptable_filter ip_tables fuse
>> vfat fat xfs vhost_net macvtap macvlan vhost tun uinput raid456
>> async_raid6_recov async_pq async_xor xor async_memcpy async_tx raid6_pq
>> libcrc32c joydev sg serio_raw e1000 r8169 mii kvm_amd kvm
>> snd_hda_codec_realtek snd_hda_codec_generic irqbypass pcspkr k10temp
>> snd_hda_intel snd_hda_codec snd_hda_core snd_hwdep snd_seq
>> snd_seq_device snd_pcm snd_timer snd soundcore i2c_piix4 i2c_core wmi
>> shpchp acpi_cpufreq ext4 mbcache jbd2 qla2xxx scsi_transport_fc floppy
>> firewire_ohci f
>> Oct 28 01:19:59 mbpc-pc kernel: irewire_core crc_itu_t sd_mod pata_acpi
>> ata_generic pata_jmicron ahci libahci usb_storage dm_mirror
>> dm_region_hash dm_log dm_mod
>> Oct 28 01:19:59 mbpc-pc kernel: CPU: 0 PID: 296 Comm: kworker/u16:6 Not
>> tainted 4.8.4 #2
>> Oct 28 01:19:59 mbpc-pc kernel: Hardware name: Gigabyte Technology Co.,
>> Ltd. GA-890XA-UD3/GA-890XA-UD3, BIOS FC 08/02/2010
>> Oct 28 01:19:59 mbpc-pc kernel: Workqueue: tmr-fileio target_tmr_work
>> [target_core_mod]
>> Oct 28 01:19:59 mbpc-pc kernel: task: ffff8801109623c0 task.stack:
>> ffff880110968000
>> Oct 28 01:19:59 mbpc-pc kernel: RIP: 0010:[<ffffffffa0143810>]
>> [<ffffffffa0143810>] qlt_free_cmd+0x160/0x180 [qla2xxx]
>> Oct 28 01:19:59 mbpc-pc kernel: RSP: 0018:ffff88011096bb18  EFLAGS: 00010202
>> Oct 28 01:19:59 mbpc-pc kernel: RAX: 0000000000000051 RBX:
>> ffff88009af9f488 RCX: 0000000000000006
>> Oct 28 01:19:59 mbpc-pc kernel: RDX: 0000000000000007 RSI:
>> 0000000000000007 RDI: ffff88011fc0cb40
>> Oct 28 01:19:59 mbpc-pc kernel: RBP: ffff88011096bb48 R08:
>> 0000000000000000 R09: ffffffff81fa4765
>> Oct 28 01:19:59 mbpc-pc kernel: R10: 0000000000000074 R11:
>> 0000000000000002 R12: ffff8801137770c0
>> Oct 28 01:19:59 mbpc-pc kernel: R13: ffff8800a126eaf0 R14:
>> ffff88009af9f510 R15: 0000000000000296
>> Oct 28 01:19:59 mbpc-pc kernel: FS:  0000000000000000(0000)
>> GS:ffff88011fc00000(0000) knlGS:0000000000000000
>> Oct 28 01:19:59 mbpc-pc kernel: CS:  0010 DS: 0000 ES: 0000 CR0:
>> 0000000080050033
>> Oct 28 01:19:59 mbpc-pc kernel: CR2: 00007f8eef58d000 CR3:
>> 00000000cabad000 CR4: 00000000000006f0
>> Oct 28 01:19:59 mbpc-pc kernel: Stack:
>> Oct 28 01:19:59 mbpc-pc kernel: ffff880000000673 ffff88009af9f488
>> ffff8800a126eaf0 ffff88009af9f59c
>> Oct 28 01:19:59 mbpc-pc kernel: ffff88009af9f488 ffff8800a126eaf0
>> ffff88011096bb58 ffffffffa027f7f4
>> Oct 28 01:19:59 mbpc-pc kernel: ffff88011096bbb8 ffffffffa08f758c
>> ffff88009af9f510 ffff88009af9f488
>> Oct 28 01:19:59 mbpc-pc kernel: Call Trace:
>> Oct 28 01:19:59 mbpc-pc kernel: [<ffffffffa027f7f4>]
>> tcm_qla2xxx_release_cmd+0x14/0x30 [tcm_qla2xxx]
>> Oct 28 01:19:59 mbpc-pc kernel: [<ffffffffa08f758c>]
>> target_release_cmd_kref+0xac/0x110 [target_core_mod]
>> Oct 28 01:19:59 mbpc-pc kernel: [<ffffffffa08f7627>]
>> target_put_sess_cmd+0x37/0x70 [target_core_mod]
>> Oct 28 01:19:59 mbpc-pc kernel: [<ffffffffa08f46f7>]
>> core_tmr_abort_task+0x107/0x160 [target_core_mod]
>> Oct 28 01:19:59 mbpc-pc kernel: [<ffffffffa08f6aa4>]
>> target_tmr_work+0x154/0x160 [target_core_mod]
>> Oct 28 01:19:59 mbpc-pc kernel: [<ffffffff81080639>]
>> process_one_work+0x189/0x4e0
>> Oct 28 01:19:59 mbpc-pc kernel: [<ffffffff810d060c>] ?
>> del_timer_sync+0x4c/0x60
>> Oct 28 01:19:59 mbpc-pc kernel: [<ffffffff8108131e>] ?
>> maybe_create_worker+0x8e/0x110
>> Oct 28 01:19:59 mbpc-pc kernel: [<ffffffff8162c040>] ? schedule+0x40/0xb0
>> Oct 28 01:19:59 mbpc-pc kernel: [<ffffffff8108150d>]
>> worker_thread+0x16d/0x520
>> Oct 28 01:19:59 mbpc-pc kernel: [<ffffffff810923f2>] ?
>> default_wake_function+0x12/0x20
>> Oct 28 01:19:59 mbpc-pc kernel: [<ffffffff810a6f06>] ?
>> __wake_up_common+0x56/0x90
>> Oct 28 01:19:59 mbpc-pc kernel: [<ffffffff810813a0>] ?
>> maybe_create_worker+0x110/0x110
>> Oct 28 01:19:59 mbpc-pc kernel: [<ffffffff8162c040>] ? schedule+0x40/0xb0
>> Oct 28 01:19:59 mbpc-pc kernel: [<ffffffff810813a0>] ?
>> maybe_create_worker+0x110/0x110
>> Oct 28 01:19:59 mbpc-pc kernel: [<ffffffff81085fec>] kthread+0xcc/0xf0
>> Oct 28 01:19:59 mbpc-pc kernel: [<ffffffff8109130e>] ?
>> schedule_tail+0x1e/0xc0
>> Oct 28 01:19:59 mbpc-pc kernel: [<ffffffff8162f60f>] ret_from_fork+0x1f/0x40
>> Oct 28 01:19:59 mbpc-pc kernel: [<ffffffff81085f20>] ?
>> kthread_freezable_should_stop+0x70/0x70
>> Oct 28 01:19:59 mbpc-pc kernel: Code: 0d 00 00 48 c7 c7 00 2a 16 a0 e8
>> ac 32 f2 e0 48 83 c4 18 5b 41 5c 41 5d c9 c3 48 8b bb 90 02 00 00 e8 85
>> da 07 e1 e9 30 ff ff ff <0f> 0b eb fe 0f 0b 66 2e 0f 1f 84 00 00 00 00
>> 00 eb f4 66 66 66
>> Oct 28 01:19:59 mbpc-pc kernel: RIP  [<ffffffffa0143810>]
>> qlt_free_cmd+0x160/0x180 [qla2xxx]
>> Oct 28 01:19:59 mbpc-pc kernel: RSP <ffff88011096bb18>
>> Oct 28 01:19:59 mbpc-pc kernel: ---[ end trace 2551bf47a19dbe2e ]---
>>
>
> Mmm.
>
> This BUG_ON is signaling a qla_tgt_cmd descriptor is being freed while
> qlt_handle_cmd_for_atio() has queued it for backend execution, but
> qla_tgt_cmd->work -> __qlt_do_work() has not executed.
>
>>
>>
>> 2) This works with a new disk that's just been inserted.  No issues.
>>
>>
>
> Thanks for verifying this scenario.
>
>>
>>
>> The kernel had the patch in both scenarios.  So appears we can't
>> function on a degraded array that looses 1 (RAID 5 / 6 ) or 2 ( RAID 6 )
>> disks, at the moment, even though the array itself is fine.  Perhaps
>> it's the nature of the failed disk.
>>
>
> AFAICT, the hung task involves ABORT_TASK across tcm_qla2xxx session
> reinstatement, when backend I/O latency is high enough to cause
> ABORT_TASK operations across FC fabric host side session reset.
>
> From logs alone, it's unclear if the failing backend ata6 is leaking I/O
> (indefinitely) when hung task warning happen, but the preceding ata6
> failures + device resets seem to indicate I/O completions are just
> taking a really long time to complete.
>
> Also, it's unclear if the BUG_ON(cmd->cmd_in_wq) in qlt_free_cmd() is a
> side effect of the earlier hung task, or separate tcm_qla2xxx session
> reinstatement bug.
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>


Thanks Nicholas.

Is it possible the RAID 6 array, after attempting to write sectors to a 
bad disk, received an error and returned a status message that the QLA 
could not interpret?  Thereby sending the QLA into a routine expecting a 
recognized message and simply timing out?  All the while the array 
simply 'moved on' along it's merry way.   (For example, a message akin 
to "Unexpected xyz received, ignoring" type of message that should be 
interpreted and actioned on.)  Or perhaps the software RAID 6 isn't 
returning anything meaningful to the QLA driver.  (If it isn't though, 
it can be argued that the QLA / Target Drivers shouldn't have to care if 
the array is working.)

The 100% util on /dev/sdf lasted for less then 30 seconds. Well below 
the 120s default timeout (I've upped this to 360 for the purpose of 
testing this scenario)

I do see these:

http://microdevsys.com/linux-lio/messages-mailing-list
Oct 23 19:50:26 mbpc-pc kernel: qla2xxx [0000:04:00.0]-5811:20: 
Asynchronous PORT UPDATE ignored 0000/0004/0600.

but I don't think it's that one as it doesn't correlate well at all.

If there's another test that I can do to get more info, please feel free 
to suggest.

-- 
Cheers,
Tom K.
-------------------------------------------------------------------------------------

Living on earth is expensive, but it includes a free trip around the sun.


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Kernel 4.8.4: INFO: task kworker/u16:8:289 blocked for more than 120 seconds.
  2016-10-29 18:10                   ` TomK
@ 2016-10-29 21:44                     ` Nicholas A. Bellinger
  2016-10-30 18:50                       ` TomK
  0 siblings, 1 reply; 14+ messages in thread
From: Nicholas A. Bellinger @ 2016-10-29 21:44 UTC (permalink / raw)
  To: TomK
  Cc: linux-scsi, Himanshu Madhani, Quinn Tran, Giridhar Malavali,
	Gurumurthy, Anil

On Sat, 2016-10-29 at 14:10 -0400, TomK wrote:
> On 10/29/2016 3:50 AM, Nicholas A. Bellinger wrote:
> > Hi TomK & Co,
> >
> > On Fri, 2016-10-28 at 02:01 -0400, TomK wrote:
> >> On 10/26/2016 8:08 AM, TomK wrote:
> >>> On 10/26/2016 3:20 AM, Nicholas A. Bellinger wrote:
> >

<SNIP>

> >>
> >> 1) As soon as I re-add the bad disk without the patch, I loose the LUN
> >> off the ESXi hosts.  Same thing happens with the patch.  No change.  The
> >> disk is pulling things down.  Worse, the kernel panics and locks me out
> >> of the system (http://microdevsys.com/linux-lio/messages-oct-27-2016.txt) :
> >>
> >
> > So after groking these logs, the point when ata6 failing scsi_device is
> > holding outstanding I/O beyond ESX FC host side timeouts, manifests
> > itself as ABORT_TASK tag=1122276:
> >
> > Oct 28 00:42:56 mbpc-pc kernel: qla2xxx [0000:04:00.0]-e872:20: qlt_24xx_atio_pkt_all_vps: qla_target(0): type 6 ox_id 077e
> > Oct 28 00:42:56 mbpc-pc kernel: qla2xxx [0000:04:00.0]-e872:20: qlt_24xx_atio_pkt_all_vps: qla_target(0): type 6 ox_id 077f
> > Oct 28 00:42:56 mbpc-pc kernel: qla2xxx [0000:04:00.0]-e872:20: qlt_24xx_atio_pkt_all_vps: qla_target(0): type 6 ox_id 0780
> > Oct 28 00:42:57 mbpc-pc kernel: qla2xxx [0000:04:00.0]-e837:20: ABTS_RECV_24XX: instance 0
> > Oct 28 00:42:57 mbpc-pc kernel: qla2xxx [0000:04:00.0]-f811:20: qla_target(0): task abort (s_id=1:5:0, tag=1122276, param=0)
> > Oct 28 00:42:57 mbpc-pc kernel: qla2xxx [0000:04:00.0]-f80f:20: qla_target(0): task abort (tag=1122276)
> > Oct 28 00:42:57 mbpc-pc kernel: ABORT_TASK: Found referenced qla2xxx task_tag: 1122276
> >
> > and eventually:
> >
> > Oct 28 00:43:18 mbpc-pc kernel: qla2xxx [0000:04:00.0]-e837:20: ABTS_RECV_24XX: instance 0
> > Oct 28 00:43:18 mbpc-pc kernel: qla2xxx [0000:04:00.0]-f811:20: qla_target(0): task abort (s_id=1:5:0, tag=1122276, param=0)
> > Oct 28 00:43:18 mbpc-pc kernel: qla2xxx [0000:04:00.0]-f881:20: unable to find cmd in driver or LIO for tag 0x111fe4
> > Oct 28 00:43:18 mbpc-pc kernel: qla2xxx [0000:04:00.0]-f854:20: qla_target(0): __qlt_24xx_handle_abts() failed: -2
> > Oct 28 00:43:18 mbpc-pc kernel: qla2xxx [0000:04:00.0]-e806:20: Sending task mgmt ABTS response (ha=ffff88010f110000, atio=ffff88010f123a40, status=4
> > Oct 28 00:43:18 mbpc-pc kernel: qla2xxx [0000:04:00.0]-e838:20: ABTS_RESP_24XX: compl_status 31
> > Oct 28 00:43:18 mbpc-pc kernel: qla2xxx [0000:04:00.0]-e807:20: Sending retry TERM EXCH CTIO7 (ha=ffff88010f110000)
> > Oct 28 00:43:18 mbpc-pc kernel: qla2xxx [0000:04:00.0]-e806:20: Sending task mgmt ABTS response (ha=ffff88010f110000, atio=ffff88010f123a80, status=0
> > Oct 28 00:43:18 mbpc-pc kernel: qla2xxx [0000:04:00.0]-e838:20: ABTS_RESP_24XX: compl_status 0
> >
> > The outstanding se_cmd with tag=1122276 for ata6 completes back to
> > target-core, allowing ABORT_TASK + TMR_FUNCTION_COMPLETE status to
> > progress:
> >
> > Oct 28 00:44:25 mbpc-pc kernel: hpet1: lost 9600 rtc interrupts
> > Oct 28 00:44:29 mbpc-pc kernel: ata6: SATA link up 1.5 Gbps (SStatus 113 SControl 310)
> > Oct 28 00:44:29 mbpc-pc kernel: ata6.00: configured for UDMA/133
> > Oct 28 00:44:29 mbpc-pc kernel: ata6.00: retrying FLUSH 0xea Emask 0x4
> > Oct 28 00:44:29 mbpc-pc kernel: ata6.00: device reported invalid CHS sector 0
> > Oct 28 00:44:29 mbpc-pc kernel: ata6: EH complete
> > Oct 28 00:44:30 mbpc-pc kernel: qla2xxx [0000:04:00.0]-e81c:20: Sending TERM EXCH CTIO (ha=ffff88010f110000)
> > Oct 28 00:44:30 mbpc-pc kernel: qla2xxx [0000:04:00.0]-f814:20: qla_target(0): terminating exchange for aborted cmd=ffff880099392fa8 (se_cmd=ffff880099392fa8, tag=1122276)
> > Oct 28 00:44:30 mbpc-pc kernel: qla2xxx [0000:04:00.0]-e81c:20: Sending TERM EXCH CTIO (ha=ffff88010f110000)
> > Oct 28 00:44:30 mbpc-pc kernel: qla2xxx [0000:04:00.0]-e851:20: qla_target(0): Terminating cmd ffff880099392fa8 with incorrect state 2
> > Oct 28 00:44:30 mbpc-pc kernel: qla2xxx [0000:04:00.0]-f81d:20: Intermediate CTIO received (status 6)
> > Oct 28 00:44:30 mbpc-pc kernel: ABORT_TASK: Sending TMR_FUNCTION_COMPLETE for ref_tag: 1122276
> > Oct 28 00:44:30 mbpc-pc kernel: qla2xxx [0000:04:00.0]-f81d:20: Intermediate CTIO received (status 8)
> > Oct 28 00:44:30 mbpc-pc kernel: qla2xxx [0000:04:00.0]-f813:20: TM response mcmd (ffff8800b1e2f0f0) status 0x0 state 0x0
> > Oct 28 00:44:30 mbpc-pc kernel: qla2xxx [0000:04:00.0]-e806:20: Sending task mgmt ABTS response (ha=ffff88010f110000, atio=ffff8800b1e2f300, status=0
> > Oct 28 00:44:30 mbpc-pc kernel: ABORT_TASK: Sending TMR_TASK_DOES_NOT_EXIST for ref_tag: 1122996
> > Oct 28 00:44:30 mbpc-pc kernel: qla2xxx [0000:04:00.0]-f813:20: TM response mcmd (ffff8800b1e2f340) status 0x2 state 0x0
> > Oct 28 00:44:30 mbpc-pc kernel: qla2xxx [0000:04:00.0]-e806:20: Sending task mgmt ABTS response (ha=ffff88010f110000, atio=ffff8800b1e2f550, status=2
> > Oct 28 00:44:30 mbpc-pc kernel: ABORT_TASK: Sending TMR_TASK_DOES_NOT_EXIST for ref_tag: 1122204
> > Oct 28 00:44:30 mbpc-pc kernel: qla2xxx [0000:04:00.0]-f813:20: TM response mcmd (ffff8800b1e2f590) status 0x2 state 0x0
> > Oct 28 00:44:30 mbpc-pc kernel: qla2xxx [0000:04:00.0]-e806:20: Sending task mgmt ABTS response (ha=ffff88010f110000, atio=ffff8800b1e2f7a0, status=2
> > Oct 28 00:44:30 mbpc-pc kernel: ABORT_TASK: Sending TMR_TASK_DOES_NOT_EXIST for ref_tag: 1122240
> > Oct 28 00:44:30 mbpc-pc kernel: qla2xxx [0000:04:00.0]-e838:20: ABTS_RESP_24XX: compl_status 0
> > Oct 28 00:44:30 mbpc-pc kernel: qla2xxx [0000:04:00.0]-f813:20: TM response mcmd (ffff8800b1e2f7e0) status 0x2 state 0x0
> > Oct 28 00:44:30 mbpc-pc kernel: qla2xxx [0000:04:00.0]-e838:20: ABTS_RESP_24XX: compl_status 31
> > Oct 28 00:44:30 mbpc-pc kernel: qla2xxx [0000:04:00.0]-e807:20: Sending retry TERM EXCH CTIO7 (ha=ffff88010f110000)
> > Oct 28 00:44:30 mbpc-pc kernel: qla2xxx [0000:04:00.0]-e806:20: Sending task mgmt ABTS response (ha=ffff88010f110000, atio=ffff88010f124280, status=0
> > Oct 28 00:44:30 mbpc-pc kernel: qla2xxx [0000:04:00.0]-e838:20: ABTS_RESP_24XX: compl_status 31
> > Oct 28 00:44:30 mbpc-pc kernel: qla2xxx [0000:04:00.0]-e807:20: Sending retry TERM EXCH CTIO7 (ha=ffff88010f110000)
> > Oct 28 00:44:30 mbpc-pc kernel: qla2xxx [0000:04:00.0]-e806:20: Sending task mgmt ABTS response (ha=ffff88010f110000, atio=ffff88010f1242c0, status=0
> > Oct 28 00:44:30 mbpc-pc kernel: qla2xxx [0000:04:00.0]-e806:20: Sending task mgmt ABTS response (ha=ffff88010f110000, atio=ffff8800b1e2f9f0, status=2
> >
> > and continues with TMR_TASK_DOES_NOT_EXIST for other oustanding tags..
> >
> > Until, target-core session release for two tcm_qla2xxx host ports:
> >
> > Oct 28 00:44:30 mbpc-pc kernel: qla2xxx [0000:04:00.0]-f884:20: qlt_free_session_done: se_sess ffff880099a80a80 / sess ffff880111665cc0 from port 21:03:00:1b:32:74:b6:cb loop_id 0x03 s_id 01:05:00 logout 1 keep 1 els_logo 0
> > Oct 28 00:44:30 mbpc-pc kernel: qla2xxx [0000:04:00.0]-e874:20: qlt_free_cmd: se_cmd[ffff8800993849f8] ox_id 0773
> > Oct 28 00:44:30 mbpc-pc kernel: qla2xxx [0000:04:00.0]-f884:20: qlt_free_session_done: se_sess ffff88009e8f9040 / sess ffff8800abe95480 from port 50:01:43:80:16:77:99:38 loop_id 0x02 s_id 01:00:00 logout 1 keep 1 els_logo 0
> > Oct 28 00:44:30 mbpc-pc kernel: qla2xxx [0000:04:00.0]-e874:20: qlt_free_cmd: se_cmd[ffff880099199f38] ox_id 05f7
> > Oct 28 00:44:30 mbpc-pc kernel: qla2xxx [0000:04:00.0]-e874:20: qlt_free_cmd: se_cmd[ffff880099199f38] ox_id 05f7
> > Oct 28 00:44:30 mbpc-pc kernel: qla2xxx [0000:04:00.0]-4801:20: DPC handler waking up, dpc_flags=0x0.
> > Oct 28 00:44:30 mbpc-pc kernel: qla2xxx [0000:04:00.0]-2870:20: Async-logout - hdl=2 loop-id=3 portid=010500.
> > Oct 28 00:44:30 mbpc-pc kernel: qla2xxx [0000:04:00.0]-2870:20: Async-logout - hdl=3 loop-id=2 portid=010000.
> > Oct 28 00:44:30 mbpc-pc kernel: qla2xxx [0000:04:00.0]-4800:20: DPC handler sleeping.
> > Oct 28 00:44:30 mbpc-pc kernel: qla2xxx [0000:04:00.0]-e81c:20: Sending TERM EXCH CTIO (ha=ffff88010f110000)
> > Oct 28 00:44:30 mbpc-pc kernel: qla2xxx [0000:04:00.0]-e81c:20: Sending TERM EXCH CTIO (ha=ffff88010f110000)
> >
> > From there, ELS with unexpected NOTIFY_ACK received start to occur:
> >
> > Oct 28 00:44:30 mbpc-pc kernel: qla2xxx [0000:04:00.0]-4801:20: DPC handler waking up, dpc_flags=0x0.
> > Oct 28 00:44:30 mbpc-pc kernel: qla2xxx [0000:04:00.0]-4800:20: DPC handler sleeping.
> > Oct 28 00:44:34 mbpc-pc kernel: qla2xxx [0000:04:00.0]-e872:20: qlt_24xx_atio_pkt_all_vps: qla_target(0): type d ox_id 0000
> > Oct 28 00:44:34 mbpc-pc kernel: qla2xxx [0000:04:00.0]-e82e:20: IMMED_NOTIFY ATIO
> > Oct 28 00:44:34 mbpc-pc kernel: qla2xxx [0000:04:00.0]-f826:20: qla_target(0): Port ID: 0x00:00:01 ELS opcode: 0x03
> > Oct 28 00:44:34 mbpc-pc kernel: qla2xxx [0000:04:00.0]-e81c:20: Sending TERM ELS CTIO (ha=ffff88010f110000)
> > Oct 28 00:44:34 mbpc-pc kernel: qla2xxx [0000:04:00.0]-f897:20: Linking sess ffff8800abe95480 [0] wwn 50:01:43:80:16:77:99:38 with PLOGI ACK to wwn 50:01:43:80:16:77:99:38 s_id 01:00:00, ref=1
> > Oct 28 00:44:34 mbpc-pc kernel: qla2xxx [0000:04:00.0]-e862:20: qla_target(0): Unexpected NOTIFY_ACK received
> >
> > ELS packets for the same two host ports continue:
> >
> > Oct 28 00:46:40 mbpc-pc kernel: hpet1: lost 9599 rtc interrupts
> > Oct 28 00:46:40 mbpc-pc kernel: qla2xxx [0000:04:00.0]-e872:20: qlt_24xx_atio_pkt_all_vps: qla_target(0): type d ox_id 0000
> > Oct 28 00:46:40 mbpc-pc kernel: qla2xxx [0000:04:00.0]-e82e:20: IMMED_NOTIFY ATIO
> > Oct 28 00:46:40 mbpc-pc kernel: qla2xxx [0000:04:00.0]-f826:20: qla_target(0): Port ID: 0x00:00:01 ELS opcode: 0x03
> > Oct 28 00:46:40 mbpc-pc kernel: qla2xxx [0000:04:00.0]-e81c:20: Sending TERM ELS CTIO (ha=ffff88010f110000)
> > Oct 28 00:46:40 mbpc-pc kernel: qla2xxx [0000:04:00.0]-f897:20: Linking sess ffff8800abe95480 [0] wwn 50:01:43:80:16:77:99:38 with PLOGI ACK to wwn 50:01:43:80:16:77:99:38 s_id 01:00:00, ref=1
> > Oct 28 00:46:40 mbpc-pc kernel: qla2xxx [0000:04:00.0]-e862:20: qla_target(0): Unexpected NOTIFY_ACK received
> > Oct 28 00:46:46 mbpc-pc kernel: qla2xxx [0000:04:00.0]-e872:20: qlt_24xx_atio_pkt_all_vps: qla_target(0): type 6 ox_id 05b1
> > Oct 28 00:46:46 mbpc-pc kernel: qla2xxx [0000:04:00.0]-e818:20: is_send_status=1, cmd->bufflen=4096, cmd->sg_cnt=1, cmd->dma_data_direction=1 se_cmd[ffff88009a9977d8]
> > Oct 28 00:46:46 mbpc-pc kernel: qla2xxx [0000:04:00.0]-e874:20: qlt_free_cmd: se_cmd[ffff88009a9977d8] ox_id 05b1
> > Oct 28 00:46:49 mbpc-pc kernel: qla2xxx [0000:04:00.0]-e872:20: qlt_24xx_atio_pkt_all_vps: qla_target(0): type d ox_id 0000
> > Oct 28 00:46:49 mbpc-pc kernel: qla2xxx [0000:04:00.0]-e82e:20: IMMED_NOTIFY ATIO
> > Oct 28 00:46:49 mbpc-pc kernel: qla2xxx [0000:04:00.0]-f826:20: qla_target(0): Port ID: 0x00:05:01 ELS opcode: 0x03
> > Oct 28 00:46:49 mbpc-pc kernel: qla2xxx [0000:04:00.0]-e81c:20: Sending TERM ELS CTIO (ha=ffff88010f110000)
> > Oct 28 00:46:49 mbpc-pc kernel: qla2xxx [0000:04:00.0]-f897:20: Linking sess ffff880111665cc0 [0] wwn 21:03:00:1b:32:74:b6:cb with PLOGI ACK to wwn 21:03:00:1b:32:74:b6:cb s_id 01:05:00, ref=1
> > Oct 28 00:46:49 mbpc-pc kernel: qla2xxx [0000:04:00.0]-e862:20: qla_target(0): Unexpected NOTIFY_ACK received
> >
> > And eventually, the 360 second hung task timeout warnings appear:
> >
> > Oct 28 00:49:48 mbpc-pc kernel: hpet1: lost 9600 rtc interrupts
> > Oct 28 00:49:55 mbpc-pc kernel: qla2xxx [0000:04:00.0]-e872:20: qlt_24xx_atio_pkt_all_vps: qla_target(0): type 6 ox_id 05ba
> > Oct 28 00:49:55 mbpc-pc kernel: qla2xxx [0000:04:00.0]-e818:20: is_send_status=1, cmd->bufflen=4096, cmd->sg_cnt=1, cmd->dma_data_direction=1 se_cmd[ffff88009a988508]
> > Oct 28 00:49:55 mbpc-pc kernel: qla2xxx [0000:04:00.0]-e874:20: qlt_free_cmd: se_cmd[ffff88009a988508] ox_id 05ba
> > Oct 28 00:50:16 mbpc-pc kernel: qla2xxx [0000:04:00.0]-e872:20: qlt_24xx_atio_pkt_all_vps: qla_target(0): type 6 ox_id 05bb
> > Oct 28 00:50:16 mbpc-pc kernel: qla2xxx [0000:04:00.0]-e818:20: is_send_status=1, cmd->bufflen=4096, cmd->sg_cnt=1, cmd->dma_data_direction=1 se_cmd[ffff88009a988508]
> > Oct 28 00:50:16 mbpc-pc kernel: qla2xxx [0000:04:00.0]-e874:20: qlt_free_cmd: se_cmd[ffff88009a988508] ox_id 05bb
> > Jan  3 15:34:00 192.168.0.2 syslog: dhcpfwd : dhcp forwarder daemon successfully started
> > Oct 28 00:50:33 mbpc-pc kernel: INFO: task kworker/0:2:31731 blocked for more than 360 seconds.
> > Oct 28 00:50:33 mbpc-pc kernel:      Not tainted 4.8.4 #2
> > Oct 28 00:50:33 mbpc-pc kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> > Oct 28 00:50:33 mbpc-pc kernel: kworker/0:2     D ffff88011affb968     0 31731      2 0x00000080
> > Oct 28 00:50:33 mbpc-pc kernel: Workqueue: events qlt_free_session_done [qla2xxx]
> > Oct 28 00:50:33 mbpc-pc kernel: ffff88011affb968 ffff88011affb8d8 ffff880013514940 0000000000000006
> > Oct 28 00:50:33 mbpc-pc kernel: ffff8801140fe880 ffffffff81f998c2 0000000000000000 ffff880100000000
> > Oct 28 00:50:33 mbpc-pc kernel: ffffffff810bdaaa ffffffff00000000 ffffffff00000051 ffff880100000000
> > Oct 28 00:50:33 mbpc-pc kernel: Call Trace:
> > Oct 28 00:50:33 mbpc-pc kernel: [<ffffffff810bdaaa>] ? vprintk_emit+0x27a/0x4d0
> > Oct 28 00:50:33 mbpc-pc kernel: [<ffffffff8162c040>] schedule+0x40/0xb0
> > Oct 28 00:50:33 mbpc-pc kernel: [<ffffffff8115cc5c>] ? printk+0x46/0x48
> > Oct 28 00:50:33 mbpc-pc kernel: [<ffffffff8162e7ec>] schedule_timeout+0x9c/0xe0
> > Oct 28 00:50:33 mbpc-pc kernel: [<ffffffff8162cfa0>] wait_for_completion+0xc0/0xf0
> > Oct 28 00:50:33 mbpc-pc kernel: [<ffffffff810923e0>] ? try_to_wake_up+0x260/0x260
> > Oct 28 00:50:33 mbpc-pc kernel: [<ffffffff81161d73>] ? mempool_free+0x33/0x90
> > Oct 28 00:50:33 mbpc-pc kernel: [<ffffffffa08f76ad>] target_wait_for_sess_cmds+0x4d/0x1b0 [target_core_mod]
> > Oct 28 00:50:33 mbpc-pc kernel: [<ffffffffa00e7188>] ? qla2x00_post_work+0x58/0x70 [qla2xxx]
> > Oct 28 00:50:33 mbpc-pc kernel: [<ffffffffa0286f69>] tcm_qla2xxx_free_session+0x49/0x90 [tcm_qla2xxx]
> > Oct 28 00:50:33 mbpc-pc kernel: [<ffffffffa01447e9>] qlt_free_session_done+0xf9/0x3d0 [qla2xxx]
> > Oct 28 00:50:33 mbpc-pc kernel: [<ffffffff81080639>] process_one_work+0x189/0x4e0
> > Oct 28 00:50:33 mbpc-pc kernel: [<ffffffff8107d915>] ? wq_worker_waking_up+0x15/0x70
> > Oct 28 00:50:33 mbpc-pc kernel: [<ffffffff8109eb59>] ? idle_balance+0x79/0x290
> > Oct 28 00:50:33 mbpc-pc kernel: [<ffffffff8162c040>] ? schedule+0x40/0xb0
> > Oct 28 00:50:33 mbpc-pc kernel: [<ffffffff8108150d>] worker_thread+0x16d/0x520
> > Oct 28 00:50:33 mbpc-pc kernel: [<ffffffff8162bb3d>] ? __schedule+0x2fd/0x6a0
> > Oct 28 00:50:33 mbpc-pc kernel: [<ffffffff810813a0>] ? maybe_create_worker+0x110/0x110
> > Oct 28 00:50:33 mbpc-pc kernel: [<ffffffff8162c040>] ? schedule+0x40/0xb0
> > Oct 28 00:50:33 mbpc-pc kernel: [<ffffffff810813a0>] ? maybe_create_worker+0x110/0x110
> > Oct 28 00:50:33 mbpc-pc kernel: [<ffffffff81085fec>] kthread+0xcc/0xf0
> > Oct 28 00:50:33 mbpc-pc kernel: [<ffffffff8109130e>] ? schedule_tail+0x1e/0xc0
> > Oct 28 00:50:33 mbpc-pc kernel: [<ffffffff8162f60f>] ret_from_fork+0x1f/0x40
> > Oct 28 00:50:33 mbpc-pc kernel: [<ffffffff81085f20>] ? kthread_freezable_should_stop+0x70/0x70
> >
> >>
> >> Oct 28 01:19:59 mbpc-pc kernel: qla2xxx [0000:04:00.0]-f81d:20:
> >> Intermediate CTIO received (status 6)
> >> Oct 28 01:19:59 mbpc-pc kernel: qla2xxx [0000:04:00.0]-e81c:20: Sending
> >> TERM EXCH CTIO (ha=ffff88010ecb0000)
> >> Oct 28 01:19:59 mbpc-pc kernel: qla2xxx [0000:04:00.0]-f814:20:
> >> qla_target(0): terminating exchange for aborted cmd=ffff88009af9f488
> >> (se_cmd=ffff88009af9f488, tag=1131312)
> >> Oct 28 01:19:59 mbpc-pc kernel: qla2xxx [0000:04:00.0]-e81c:20: Sending
> >> TERM EXCH CTIO (ha=ffff88010ecb0000)
> >> Oct 28 01:19:59 mbpc-pc kernel: qla2xxx [0000:04:00.0]-e851:20:
> >> qla_target(0): Terminating cmd ffff88009af9f488 with incorrect state 2
> >> Oct 28 01:19:59 mbpc-pc kernel: qla2xxx [0000:04:00.0]-e81c:20: Sending
> >> TERM EXCH CTIO (ha=ffff88010ecb0000)
> >> Oct 28 01:19:59 mbpc-pc kernel: qla2xxx [0000:04:00.0]-f81d:20:
> >> Intermediate CTIO received (status 6)
> >> Oct 28 01:19:59 mbpc-pc kernel: qla2xxx [0000:04:00.0]-f81d:20:
> >> Intermediate CTIO received (status 8)
> >> Oct 28 01:19:59 mbpc-pc kernel: qla2xxx [0000:04:00.0]-e81c:20: Sending
> >> TERM EXCH CTIO (ha=ffff88010ecb0000)
> >> Oct 28 01:19:59 mbpc-pc kernel: qla2xxx [0000:04:00.0]-f81d:20:
> >> Intermediate CTIO received (status 6)
> >> Oct 28 01:19:59 mbpc-pc kernel: qla2xxx [0000:04:00.0]-e874:20:
> >> qlt_free_cmd: se_cmd[ffff88009af9f488] ox_id 0673
> >> Oct 28 01:19:59 mbpc-pc kernel: ------------[ cut here ]------------
> >> Oct 28 01:19:59 mbpc-pc kernel: kernel BUG at
> >> drivers/scsi/qla2xxx/qla_target.c:3319!
> >> Oct 28 01:19:59 mbpc-pc kernel: invalid opcode: 0000 [#1] SMP
> >> Oct 28 01:19:59 mbpc-pc kernel: Modules linked in: tcm_qla2xxx tcm_fc
> >> tcm_loop target_core_file target_core_iblock target_core_pscsi
> >> target_core_mod configfs ip6table_filter ip6_tables ebtable_nat ebtables
> >> nf_conntrack_ipv4 nf_defrag_ipv4 xt_state nf_conntrack xt_CHECKSUM
> >> iptable_mangle bridge nfsd lockd grace nfs_acl auth_rpcgss autofs4 it87
> >> hwmon_vid bnx2fc cnic uio fcoe libfcoe libfc 8021q garp stp llc ppdev
> >> parport_pc parport sunrpc cpufreq_ondemand bonding ipv6 crc_ccitt
> >> ipt_REJECT nf_reject_ipv4 xt_multiport iptable_filter ip_tables fuse
> >> vfat fat xfs vhost_net macvtap macvlan vhost tun uinput raid456
> >> async_raid6_recov async_pq async_xor xor async_memcpy async_tx raid6_pq
> >> libcrc32c joydev sg serio_raw e1000 r8169 mii kvm_amd kvm
> >> snd_hda_codec_realtek snd_hda_codec_generic irqbypass pcspkr k10temp
> >> snd_hda_intel snd_hda_codec snd_hda_core snd_hwdep snd_seq
> >> snd_seq_device snd_pcm snd_timer snd soundcore i2c_piix4 i2c_core wmi
> >> shpchp acpi_cpufreq ext4 mbcache jbd2 qla2xxx scsi_transport_fc floppy
> >> firewire_ohci f
> >> Oct 28 01:19:59 mbpc-pc kernel: irewire_core crc_itu_t sd_mod pata_acpi
> >> ata_generic pata_jmicron ahci libahci usb_storage dm_mirror
> >> dm_region_hash dm_log dm_mod
> >> Oct 28 01:19:59 mbpc-pc kernel: CPU: 0 PID: 296 Comm: kworker/u16:6 Not
> >> tainted 4.8.4 #2
> >> Oct 28 01:19:59 mbpc-pc kernel: Hardware name: Gigabyte Technology Co.,
> >> Ltd. GA-890XA-UD3/GA-890XA-UD3, BIOS FC 08/02/2010
> >> Oct 28 01:19:59 mbpc-pc kernel: Workqueue: tmr-fileio target_tmr_work
> >> [target_core_mod]
> >> Oct 28 01:19:59 mbpc-pc kernel: task: ffff8801109623c0 task.stack:
> >> ffff880110968000
> >> Oct 28 01:19:59 mbpc-pc kernel: RIP: 0010:[<ffffffffa0143810>]
> >> [<ffffffffa0143810>] qlt_free_cmd+0x160/0x180 [qla2xxx]
> >> Oct 28 01:19:59 mbpc-pc kernel: RSP: 0018:ffff88011096bb18  EFLAGS: 00010202
> >> Oct 28 01:19:59 mbpc-pc kernel: RAX: 0000000000000051 RBX:
> >> ffff88009af9f488 RCX: 0000000000000006
> >> Oct 28 01:19:59 mbpc-pc kernel: RDX: 0000000000000007 RSI:
> >> 0000000000000007 RDI: ffff88011fc0cb40
> >> Oct 28 01:19:59 mbpc-pc kernel: RBP: ffff88011096bb48 R08:
> >> 0000000000000000 R09: ffffffff81fa4765
> >> Oct 28 01:19:59 mbpc-pc kernel: R10: 0000000000000074 R11:
> >> 0000000000000002 R12: ffff8801137770c0
> >> Oct 28 01:19:59 mbpc-pc kernel: R13: ffff8800a126eaf0 R14:
> >> ffff88009af9f510 R15: 0000000000000296
> >> Oct 28 01:19:59 mbpc-pc kernel: FS:  0000000000000000(0000)
> >> GS:ffff88011fc00000(0000) knlGS:0000000000000000
> >> Oct 28 01:19:59 mbpc-pc kernel: CS:  0010 DS: 0000 ES: 0000 CR0:
> >> 0000000080050033
> >> Oct 28 01:19:59 mbpc-pc kernel: CR2: 00007f8eef58d000 CR3:
> >> 00000000cabad000 CR4: 00000000000006f0
> >> Oct 28 01:19:59 mbpc-pc kernel: Stack:
> >> Oct 28 01:19:59 mbpc-pc kernel: ffff880000000673 ffff88009af9f488
> >> ffff8800a126eaf0 ffff88009af9f59c
> >> Oct 28 01:19:59 mbpc-pc kernel: ffff88009af9f488 ffff8800a126eaf0
> >> ffff88011096bb58 ffffffffa027f7f4
> >> Oct 28 01:19:59 mbpc-pc kernel: ffff88011096bbb8 ffffffffa08f758c
> >> ffff88009af9f510 ffff88009af9f488
> >> Oct 28 01:19:59 mbpc-pc kernel: Call Trace:
> >> Oct 28 01:19:59 mbpc-pc kernel: [<ffffffffa027f7f4>]
> >> tcm_qla2xxx_release_cmd+0x14/0x30 [tcm_qla2xxx]
> >> Oct 28 01:19:59 mbpc-pc kernel: [<ffffffffa08f758c>]
> >> target_release_cmd_kref+0xac/0x110 [target_core_mod]
> >> Oct 28 01:19:59 mbpc-pc kernel: [<ffffffffa08f7627>]
> >> target_put_sess_cmd+0x37/0x70 [target_core_mod]
> >> Oct 28 01:19:59 mbpc-pc kernel: [<ffffffffa08f46f7>]
> >> core_tmr_abort_task+0x107/0x160 [target_core_mod]
> >> Oct 28 01:19:59 mbpc-pc kernel: [<ffffffffa08f6aa4>]
> >> target_tmr_work+0x154/0x160 [target_core_mod]
> >> Oct 28 01:19:59 mbpc-pc kernel: [<ffffffff81080639>]
> >> process_one_work+0x189/0x4e0
> >> Oct 28 01:19:59 mbpc-pc kernel: [<ffffffff810d060c>] ?
> >> del_timer_sync+0x4c/0x60
> >> Oct 28 01:19:59 mbpc-pc kernel: [<ffffffff8108131e>] ?
> >> maybe_create_worker+0x8e/0x110
> >> Oct 28 01:19:59 mbpc-pc kernel: [<ffffffff8162c040>] ? schedule+0x40/0xb0
> >> Oct 28 01:19:59 mbpc-pc kernel: [<ffffffff8108150d>]
> >> worker_thread+0x16d/0x520
> >> Oct 28 01:19:59 mbpc-pc kernel: [<ffffffff810923f2>] ?
> >> default_wake_function+0x12/0x20
> >> Oct 28 01:19:59 mbpc-pc kernel: [<ffffffff810a6f06>] ?
> >> __wake_up_common+0x56/0x90
> >> Oct 28 01:19:59 mbpc-pc kernel: [<ffffffff810813a0>] ?
> >> maybe_create_worker+0x110/0x110
> >> Oct 28 01:19:59 mbpc-pc kernel: [<ffffffff8162c040>] ? schedule+0x40/0xb0
> >> Oct 28 01:19:59 mbpc-pc kernel: [<ffffffff810813a0>] ?
> >> maybe_create_worker+0x110/0x110
> >> Oct 28 01:19:59 mbpc-pc kernel: [<ffffffff81085fec>] kthread+0xcc/0xf0
> >> Oct 28 01:19:59 mbpc-pc kernel: [<ffffffff8109130e>] ?
> >> schedule_tail+0x1e/0xc0
> >> Oct 28 01:19:59 mbpc-pc kernel: [<ffffffff8162f60f>] ret_from_fork+0x1f/0x40
> >> Oct 28 01:19:59 mbpc-pc kernel: [<ffffffff81085f20>] ?
> >> kthread_freezable_should_stop+0x70/0x70
> >> Oct 28 01:19:59 mbpc-pc kernel: Code: 0d 00 00 48 c7 c7 00 2a 16 a0 e8
> >> ac 32 f2 e0 48 83 c4 18 5b 41 5c 41 5d c9 c3 48 8b bb 90 02 00 00 e8 85
> >> da 07 e1 e9 30 ff ff ff <0f> 0b eb fe 0f 0b 66 2e 0f 1f 84 00 00 00 00
> >> 00 eb f4 66 66 66
> >> Oct 28 01:19:59 mbpc-pc kernel: RIP  [<ffffffffa0143810>]
> >> qlt_free_cmd+0x160/0x180 [qla2xxx]
> >> Oct 28 01:19:59 mbpc-pc kernel: RSP <ffff88011096bb18>
> >> Oct 28 01:19:59 mbpc-pc kernel: ---[ end trace 2551bf47a19dbe2e ]---
> >>
> >
> > Mmm.
> >
> > This BUG_ON is signaling a qla_tgt_cmd descriptor is being freed while
> > qlt_handle_cmd_for_atio() has queued it for backend execution, but
> > qla_tgt_cmd->work -> __qlt_do_work() has not executed.
> >
> >>
> >>
> >> 2) This works with a new disk that's just been inserted.  No issues.
> >>
> >>
> >
> > Thanks for verifying this scenario.
> >
> >>
> >>
> >> The kernel had the patch in both scenarios.  So appears we can't
> >> function on a degraded array that looses 1 (RAID 5 / 6 ) or 2 ( RAID 6 )
> >> disks, at the moment, even though the array itself is fine.  Perhaps
> >> it's the nature of the failed disk.
> >>
> >
> > AFAICT, the hung task involves ABORT_TASK across tcm_qla2xxx session
> > reinstatement, when backend I/O latency is high enough to cause
> > ABORT_TASK operations across FC fabric host side session reset.
> >
> > From logs alone, it's unclear if the failing backend ata6 is leaking I/O
> > (indefinitely) when hung task warning happen, but the preceding ata6
> > failures + device resets seem to indicate I/O completions are just
> > taking a really long time to complete.
> >
> > Also, it's unclear if the BUG_ON(cmd->cmd_in_wq) in qlt_free_cmd() is a
> > side effect of the earlier hung task, or separate tcm_qla2xxx session
> > reinstatement bug.
> >
> > --
> > To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
> > the body of a message to majordomo@vger.kernel.org
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> >
> 
> 
> Thanks Nicholas.
> 
> Is it possible the RAID 6 array, after attempting to write sectors to a 
> bad disk, received an error and returned a status message that the QLA 
> could not interpret?  Thereby sending the QLA into a routine expecting a 
> recognized message and simply timing out?  All the while the array 
> simply 'moved on' along it's merry way.   (For example, a message akin 
> to "Unexpected xyz received, ignoring" type of message that should be 
> interpreted and actioned on.)  Or perhaps the software RAID 6 isn't 
> returning anything meaningful to the QLA driver.  (If it isn't though, 
> it can be argued that the QLA / Target Drivers shouldn't have to care if 
> the array is working.)

So target-core waits for all outstanding backend I/O to complete during
session reinstatement.

The two hung task warnings observed here mean target-core I/O
descriptors from each two tcm_qla2xxx ports are waiting to be completed
during session reinstatement, but this never happens.

Which means one of three things:

1) The backend below target-core is leaking I/O completions, which
   is a bug outside of target-core and needs to be identified.
2) There is a target-core bug leaking se_cmd->cmd_kref, preventing
   the final reference release to occur.
3) There is a tcm_qla2xxx bug leaking se_cmd->cmd_kref, preventing
   the final reference release to occur.

> 
> The 100% util on /dev/sdf lasted for less then 30 seconds.

Based on the above, between when tag=1122276 got ABORT_TASK and ata6
finally gave back the descriptor to allow TMR_FUNCTION_COMPLETE was at
least 90 seconds.

This does not include the extra time from initial I/O submission to when
ESX SCSI timeouts fire to generate ABORT_TASK, which IIRC depends upon
the FC host LLD.

Btw, I don't recall libata device timeouts being 90+ seconds plus, which
looks a little strange..

This timeout is in /sys/class/scsi_device/$HCTL/device/eh_timeout.

So assuming MD is not holding onto I/O much beyond backend device
eh_timeout during failure, a simple work-around is to keep the combined
backend eh_timeout and MD consumer I/O timeout lower than ESX FC SCSI
timeouts.

>From the logs above, it would likely mask this specific bug if it's
related to #2 or #3.

> Well below 
> the 120s default timeout (I've upped this to 360 for the purpose of 
> testing this scenario)
> 
> I do see these:
> 
> http://microdevsys.com/linux-lio/messages-mailing-list
> Oct 23 19:50:26 mbpc-pc kernel: qla2xxx [0000:04:00.0]-5811:20: 
> Asynchronous PORT UPDATE ignored 0000/0004/0600.
> 
> but I don't think it's that one as it doesn't correlate well at all.
> 

These are unrelated to the hung task warnings.

> If there's another test that I can do to get more info, please feel free 
> to suggest.
> 

Ideally, being able to generate a vmcore crashdump is the most helpful.
This involves manually triggering a crash after you observe the hung
tasks.  It's reasonably easy to setup once CONFIG_KEXEC=y +
CONFIG_CRASH_DUMP=y are enabled:

https://fedoraproject.org/wiki/How_to_use_kdump_to_debug_kernel_crashes

I don't know if you'll run into problems on v4.8.y as per Anil's earlier
email, but getting a proper vmcore for analysis is usually the fastest
path to root cause.

Short of that, it would be helpful to identify the state of the se_cmd
descriptors getting leaked.  Here's a quick patch to add some more
verbosity:

diff --git a/drivers/target/target_core_transport.c b/drivers/target/target_core_transport.c
index 7dfefd6..9b93a2c 100644
--- a/drivers/target/target_core_transport.c
+++ b/drivers/target/target_core_transport.c
@@ -2657,9 +2657,13 @@ void target_wait_for_sess_cmds(struct se_session *se_sess)
 
        list_for_each_entry_safe(se_cmd, tmp_cmd,
                                &se_sess->sess_wait_list, se_cmd_list) {
-               pr_debug("Waiting for se_cmd: %p t_state: %d, fabric state:"
-                       " %d\n", se_cmd, se_cmd->t_state,
-                       se_cmd->se_tfo->get_cmd_state(se_cmd));
+               printk("Waiting for se_cmd: %p t_state: %d, fabric state:"
+                       " %d se_cmd_flags: 0x%08x transport_state: 0x%08x"
+                       " CDB: 0x%02x\n",
+                       se_cmd, se_cmd->t_state,
+                       se_cmd->se_tfo->get_cmd_state(se_cmd),
+                       se_cmd->se_cmd_flags, se_cmd->transport_state,
+                       se_cmd->t_task_cdb[0]);
 
                spin_lock_irqsave(&se_cmd->t_state_lock, flags);
                tas = (se_cmd->transport_state & CMD_T_TAS);
@@ -2671,9 +2675,13 @@ void target_wait_for_sess_cmds(struct se_session *se_sess)
                }
 
                wait_for_completion(&se_cmd->cmd_wait_comp);
-               pr_debug("After cmd_wait_comp: se_cmd: %p t_state: %d"
-                       " fabric state: %d\n", se_cmd, se_cmd->t_state,
-                       se_cmd->se_tfo->get_cmd_state(se_cmd));
+               printk("After cmd_wait_comp: se_cmd: %p t_state: %d"
+                       " fabric state: %d se_cmd_flags: 0x%08x transport_state:"
+                       " 0x%08x CDB: 0x%02x\n",
+                       se_cmd, se_cmd->t_state,
+                       se_cmd->se_tfo->get_cmd_state(se_cmd),
+                       se_cmd->se_cmd_flags, se_cmd->transport_state,
+                       se_cmd->t_task_cdb[0]);
 
                se_cmd->se_tfo->release_cmd(se_cmd);
        }


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* Re: Kernel 4.8.4: INFO: task kworker/u16:8:289 blocked for more than 120 seconds.
  2016-10-29 21:44                     ` Nicholas A. Bellinger
@ 2016-10-30 18:50                       ` TomK
  2016-11-01  2:44                         ` TomK
  0 siblings, 1 reply; 14+ messages in thread
From: TomK @ 2016-10-30 18:50 UTC (permalink / raw)
  To: Nicholas A. Bellinger
  Cc: linux-scsi, Himanshu Madhani, Quinn Tran, Giridhar Malavali,
	Gurumurthy, Anil

On 10/29/2016 5:44 PM, Nicholas A. Bellinger wrote:
> On Sat, 2016-10-29 at 14:10 -0400, TomK wrote:
>> On 10/29/2016 3:50 AM, Nicholas A. Bellinger wrote:
>>> Hi TomK & Co,
>>>
>>> On Fri, 2016-10-28 at 02:01 -0400, TomK wrote:
>>>> On 10/26/2016 8:08 AM, TomK wrote:
>>>>> On 10/26/2016 3:20 AM, Nicholas A. Bellinger wrote:
>>>
>
> <SNIP>
>
>>>>
>>>> 1) As soon as I re-add the bad disk without the patch, I loose the LUN
>>>> off the ESXi hosts.  Same thing happens with the patch.  No change.  The
>>>> disk is pulling things down.  Worse, the kernel panics and locks me out
>>>> of the system (http://microdevsys.com/linux-lio/messages-oct-27-2016.txt) :
>>>>
>>>
>>> So after groking these logs, the point when ata6 failing scsi_device is
>>> holding outstanding I/O beyond ESX FC host side timeouts, manifests
>>> itself as ABORT_TASK tag=1122276:
>>>
>>> Oct 28 00:42:56 mbpc-pc kernel: qla2xxx [0000:04:00.0]-e872:20: qlt_24xx_atio_pkt_all_vps: qla_target(0): type 6 ox_id 077e
>>> Oct 28 00:42:56 mbpc-pc kernel: qla2xxx [0000:04:00.0]-e872:20: qlt_24xx_atio_pkt_all_vps: qla_target(0): type 6 ox_id 077f
>>> Oct 28 00:42:56 mbpc-pc kernel: qla2xxx [0000:04:00.0]-e872:20: qlt_24xx_atio_pkt_all_vps: qla_target(0): type 6 ox_id 0780
>>> Oct 28 00:42:57 mbpc-pc kernel: qla2xxx [0000:04:00.0]-e837:20: ABTS_RECV_24XX: instance 0
>>> Oct 28 00:42:57 mbpc-pc kernel: qla2xxx [0000:04:00.0]-f811:20: qla_target(0): task abort (s_id=1:5:0, tag=1122276, param=0)
>>> Oct 28 00:42:57 mbpc-pc kernel: qla2xxx [0000:04:00.0]-f80f:20: qla_target(0): task abort (tag=1122276)
>>> Oct 28 00:42:57 mbpc-pc kernel: ABORT_TASK: Found referenced qla2xxx task_tag: 1122276
>>>
>>> and eventually:
>>>
>>> Oct 28 00:43:18 mbpc-pc kernel: qla2xxx [0000:04:00.0]-e837:20: ABTS_RECV_24XX: instance 0
>>> Oct 28 00:43:18 mbpc-pc kernel: qla2xxx [0000:04:00.0]-f811:20: qla_target(0): task abort (s_id=1:5:0, tag=1122276, param=0)
>>> Oct 28 00:43:18 mbpc-pc kernel: qla2xxx [0000:04:00.0]-f881:20: unable to find cmd in driver or LIO for tag 0x111fe4
>>> Oct 28 00:43:18 mbpc-pc kernel: qla2xxx [0000:04:00.0]-f854:20: qla_target(0): __qlt_24xx_handle_abts() failed: -2
>>> Oct 28 00:43:18 mbpc-pc kernel: qla2xxx [0000:04:00.0]-e806:20: Sending task mgmt ABTS response (ha=ffff88010f110000, atio=ffff88010f123a40, status=4
>>> Oct 28 00:43:18 mbpc-pc kernel: qla2xxx [0000:04:00.0]-e838:20: ABTS_RESP_24XX: compl_status 31
>>> Oct 28 00:43:18 mbpc-pc kernel: qla2xxx [0000:04:00.0]-e807:20: Sending retry TERM EXCH CTIO7 (ha=ffff88010f110000)
>>> Oct 28 00:43:18 mbpc-pc kernel: qla2xxx [0000:04:00.0]-e806:20: Sending task mgmt ABTS response (ha=ffff88010f110000, atio=ffff88010f123a80, status=0
>>> Oct 28 00:43:18 mbpc-pc kernel: qla2xxx [0000:04:00.0]-e838:20: ABTS_RESP_24XX: compl_status 0
>>>
>>> The outstanding se_cmd with tag=1122276 for ata6 completes back to
>>> target-core, allowing ABORT_TASK + TMR_FUNCTION_COMPLETE status to
>>> progress:
>>>
>>> Oct 28 00:44:25 mbpc-pc kernel: hpet1: lost 9600 rtc interrupts
>>> Oct 28 00:44:29 mbpc-pc kernel: ata6: SATA link up 1.5 Gbps (SStatus 113 SControl 310)
>>> Oct 28 00:44:29 mbpc-pc kernel: ata6.00: configured for UDMA/133
>>> Oct 28 00:44:29 mbpc-pc kernel: ata6.00: retrying FLUSH 0xea Emask 0x4
>>> Oct 28 00:44:29 mbpc-pc kernel: ata6.00: device reported invalid CHS sector 0
>>> Oct 28 00:44:29 mbpc-pc kernel: ata6: EH complete
>>> Oct 28 00:44:30 mbpc-pc kernel: qla2xxx [0000:04:00.0]-e81c:20: Sending TERM EXCH CTIO (ha=ffff88010f110000)
>>> Oct 28 00:44:30 mbpc-pc kernel: qla2xxx [0000:04:00.0]-f814:20: qla_target(0): terminating exchange for aborted cmd=ffff880099392fa8 (se_cmd=ffff880099392fa8, tag=1122276)
>>> Oct 28 00:44:30 mbpc-pc kernel: qla2xxx [0000:04:00.0]-e81c:20: Sending TERM EXCH CTIO (ha=ffff88010f110000)
>>> Oct 28 00:44:30 mbpc-pc kernel: qla2xxx [0000:04:00.0]-e851:20: qla_target(0): Terminating cmd ffff880099392fa8 with incorrect state 2
>>> Oct 28 00:44:30 mbpc-pc kernel: qla2xxx [0000:04:00.0]-f81d:20: Intermediate CTIO received (status 6)
>>> Oct 28 00:44:30 mbpc-pc kernel: ABORT_TASK: Sending TMR_FUNCTION_COMPLETE for ref_tag: 1122276
>>> Oct 28 00:44:30 mbpc-pc kernel: qla2xxx [0000:04:00.0]-f81d:20: Intermediate CTIO received (status 8)
>>> Oct 28 00:44:30 mbpc-pc kernel: qla2xxx [0000:04:00.0]-f813:20: TM response mcmd (ffff8800b1e2f0f0) status 0x0 state 0x0
>>> Oct 28 00:44:30 mbpc-pc kernel: qla2xxx [0000:04:00.0]-e806:20: Sending task mgmt ABTS response (ha=ffff88010f110000, atio=ffff8800b1e2f300, status=0
>>> Oct 28 00:44:30 mbpc-pc kernel: ABORT_TASK: Sending TMR_TASK_DOES_NOT_EXIST for ref_tag: 1122996
>>> Oct 28 00:44:30 mbpc-pc kernel: qla2xxx [0000:04:00.0]-f813:20: TM response mcmd (ffff8800b1e2f340) status 0x2 state 0x0
>>> Oct 28 00:44:30 mbpc-pc kernel: qla2xxx [0000:04:00.0]-e806:20: Sending task mgmt ABTS response (ha=ffff88010f110000, atio=ffff8800b1e2f550, status=2
>>> Oct 28 00:44:30 mbpc-pc kernel: ABORT_TASK: Sending TMR_TASK_DOES_NOT_EXIST for ref_tag: 1122204
>>> Oct 28 00:44:30 mbpc-pc kernel: qla2xxx [0000:04:00.0]-f813:20: TM response mcmd (ffff8800b1e2f590) status 0x2 state 0x0
>>> Oct 28 00:44:30 mbpc-pc kernel: qla2xxx [0000:04:00.0]-e806:20: Sending task mgmt ABTS response (ha=ffff88010f110000, atio=ffff8800b1e2f7a0, status=2
>>> Oct 28 00:44:30 mbpc-pc kernel: ABORT_TASK: Sending TMR_TASK_DOES_NOT_EXIST for ref_tag: 1122240
>>> Oct 28 00:44:30 mbpc-pc kernel: qla2xxx [0000:04:00.0]-e838:20: ABTS_RESP_24XX: compl_status 0
>>> Oct 28 00:44:30 mbpc-pc kernel: qla2xxx [0000:04:00.0]-f813:20: TM response mcmd (ffff8800b1e2f7e0) status 0x2 state 0x0
>>> Oct 28 00:44:30 mbpc-pc kernel: qla2xxx [0000:04:00.0]-e838:20: ABTS_RESP_24XX: compl_status 31
>>> Oct 28 00:44:30 mbpc-pc kernel: qla2xxx [0000:04:00.0]-e807:20: Sending retry TERM EXCH CTIO7 (ha=ffff88010f110000)
>>> Oct 28 00:44:30 mbpc-pc kernel: qla2xxx [0000:04:00.0]-e806:20: Sending task mgmt ABTS response (ha=ffff88010f110000, atio=ffff88010f124280, status=0
>>> Oct 28 00:44:30 mbpc-pc kernel: qla2xxx [0000:04:00.0]-e838:20: ABTS_RESP_24XX: compl_status 31
>>> Oct 28 00:44:30 mbpc-pc kernel: qla2xxx [0000:04:00.0]-e807:20: Sending retry TERM EXCH CTIO7 (ha=ffff88010f110000)
>>> Oct 28 00:44:30 mbpc-pc kernel: qla2xxx [0000:04:00.0]-e806:20: Sending task mgmt ABTS response (ha=ffff88010f110000, atio=ffff88010f1242c0, status=0
>>> Oct 28 00:44:30 mbpc-pc kernel: qla2xxx [0000:04:00.0]-e806:20: Sending task mgmt ABTS response (ha=ffff88010f110000, atio=ffff8800b1e2f9f0, status=2
>>>
>>> and continues with TMR_TASK_DOES_NOT_EXIST for other oustanding tags..
>>>
>>> Until, target-core session release for two tcm_qla2xxx host ports:
>>>
>>> Oct 28 00:44:30 mbpc-pc kernel: qla2xxx [0000:04:00.0]-f884:20: qlt_free_session_done: se_sess ffff880099a80a80 / sess ffff880111665cc0 from port 21:03:00:1b:32:74:b6:cb loop_id 0x03 s_id 01:05:00 logout 1 keep 1 els_logo 0
>>> Oct 28 00:44:30 mbpc-pc kernel: qla2xxx [0000:04:00.0]-e874:20: qlt_free_cmd: se_cmd[ffff8800993849f8] ox_id 0773
>>> Oct 28 00:44:30 mbpc-pc kernel: qla2xxx [0000:04:00.0]-f884:20: qlt_free_session_done: se_sess ffff88009e8f9040 / sess ffff8800abe95480 from port 50:01:43:80:16:77:99:38 loop_id 0x02 s_id 01:00:00 logout 1 keep 1 els_logo 0
>>> Oct 28 00:44:30 mbpc-pc kernel: qla2xxx [0000:04:00.0]-e874:20: qlt_free_cmd: se_cmd[ffff880099199f38] ox_id 05f7
>>> Oct 28 00:44:30 mbpc-pc kernel: qla2xxx [0000:04:00.0]-e874:20: qlt_free_cmd: se_cmd[ffff880099199f38] ox_id 05f7
>>> Oct 28 00:44:30 mbpc-pc kernel: qla2xxx [0000:04:00.0]-4801:20: DPC handler waking up, dpc_flags=0x0.
>>> Oct 28 00:44:30 mbpc-pc kernel: qla2xxx [0000:04:00.0]-2870:20: Async-logout - hdl=2 loop-id=3 portid=010500.
>>> Oct 28 00:44:30 mbpc-pc kernel: qla2xxx [0000:04:00.0]-2870:20: Async-logout - hdl=3 loop-id=2 portid=010000.
>>> Oct 28 00:44:30 mbpc-pc kernel: qla2xxx [0000:04:00.0]-4800:20: DPC handler sleeping.
>>> Oct 28 00:44:30 mbpc-pc kernel: qla2xxx [0000:04:00.0]-e81c:20: Sending TERM EXCH CTIO (ha=ffff88010f110000)
>>> Oct 28 00:44:30 mbpc-pc kernel: qla2xxx [0000:04:00.0]-e81c:20: Sending TERM EXCH CTIO (ha=ffff88010f110000)
>>>
>>> From there, ELS with unexpected NOTIFY_ACK received start to occur:
>>>
>>> Oct 28 00:44:30 mbpc-pc kernel: qla2xxx [0000:04:00.0]-4801:20: DPC handler waking up, dpc_flags=0x0.
>>> Oct 28 00:44:30 mbpc-pc kernel: qla2xxx [0000:04:00.0]-4800:20: DPC handler sleeping.
>>> Oct 28 00:44:34 mbpc-pc kernel: qla2xxx [0000:04:00.0]-e872:20: qlt_24xx_atio_pkt_all_vps: qla_target(0): type d ox_id 0000
>>> Oct 28 00:44:34 mbpc-pc kernel: qla2xxx [0000:04:00.0]-e82e:20: IMMED_NOTIFY ATIO
>>> Oct 28 00:44:34 mbpc-pc kernel: qla2xxx [0000:04:00.0]-f826:20: qla_target(0): Port ID: 0x00:00:01 ELS opcode: 0x03
>>> Oct 28 00:44:34 mbpc-pc kernel: qla2xxx [0000:04:00.0]-e81c:20: Sending TERM ELS CTIO (ha=ffff88010f110000)
>>> Oct 28 00:44:34 mbpc-pc kernel: qla2xxx [0000:04:00.0]-f897:20: Linking sess ffff8800abe95480 [0] wwn 50:01:43:80:16:77:99:38 with PLOGI ACK to wwn 50:01:43:80:16:77:99:38 s_id 01:00:00, ref=1
>>> Oct 28 00:44:34 mbpc-pc kernel: qla2xxx [0000:04:00.0]-e862:20: qla_target(0): Unexpected NOTIFY_ACK received
>>>
>>> ELS packets for the same two host ports continue:
>>>
>>> Oct 28 00:46:40 mbpc-pc kernel: hpet1: lost 9599 rtc interrupts
>>> Oct 28 00:46:40 mbpc-pc kernel: qla2xxx [0000:04:00.0]-e872:20: qlt_24xx_atio_pkt_all_vps: qla_target(0): type d ox_id 0000
>>> Oct 28 00:46:40 mbpc-pc kernel: qla2xxx [0000:04:00.0]-e82e:20: IMMED_NOTIFY ATIO
>>> Oct 28 00:46:40 mbpc-pc kernel: qla2xxx [0000:04:00.0]-f826:20: qla_target(0): Port ID: 0x00:00:01 ELS opcode: 0x03
>>> Oct 28 00:46:40 mbpc-pc kernel: qla2xxx [0000:04:00.0]-e81c:20: Sending TERM ELS CTIO (ha=ffff88010f110000)
>>> Oct 28 00:46:40 mbpc-pc kernel: qla2xxx [0000:04:00.0]-f897:20: Linking sess ffff8800abe95480 [0] wwn 50:01:43:80:16:77:99:38 with PLOGI ACK to wwn 50:01:43:80:16:77:99:38 s_id 01:00:00, ref=1
>>> Oct 28 00:46:40 mbpc-pc kernel: qla2xxx [0000:04:00.0]-e862:20: qla_target(0): Unexpected NOTIFY_ACK received
>>> Oct 28 00:46:46 mbpc-pc kernel: qla2xxx [0000:04:00.0]-e872:20: qlt_24xx_atio_pkt_all_vps: qla_target(0): type 6 ox_id 05b1
>>> Oct 28 00:46:46 mbpc-pc kernel: qla2xxx [0000:04:00.0]-e818:20: is_send_status=1, cmd->bufflen=4096, cmd->sg_cnt=1, cmd->dma_data_direction=1 se_cmd[ffff88009a9977d8]
>>> Oct 28 00:46:46 mbpc-pc kernel: qla2xxx [0000:04:00.0]-e874:20: qlt_free_cmd: se_cmd[ffff88009a9977d8] ox_id 05b1
>>> Oct 28 00:46:49 mbpc-pc kernel: qla2xxx [0000:04:00.0]-e872:20: qlt_24xx_atio_pkt_all_vps: qla_target(0): type d ox_id 0000
>>> Oct 28 00:46:49 mbpc-pc kernel: qla2xxx [0000:04:00.0]-e82e:20: IMMED_NOTIFY ATIO
>>> Oct 28 00:46:49 mbpc-pc kernel: qla2xxx [0000:04:00.0]-f826:20: qla_target(0): Port ID: 0x00:05:01 ELS opcode: 0x03
>>> Oct 28 00:46:49 mbpc-pc kernel: qla2xxx [0000:04:00.0]-e81c:20: Sending TERM ELS CTIO (ha=ffff88010f110000)
>>> Oct 28 00:46:49 mbpc-pc kernel: qla2xxx [0000:04:00.0]-f897:20: Linking sess ffff880111665cc0 [0] wwn 21:03:00:1b:32:74:b6:cb with PLOGI ACK to wwn 21:03:00:1b:32:74:b6:cb s_id 01:05:00, ref=1
>>> Oct 28 00:46:49 mbpc-pc kernel: qla2xxx [0000:04:00.0]-e862:20: qla_target(0): Unexpected NOTIFY_ACK received
>>>
>>> And eventually, the 360 second hung task timeout warnings appear:
>>>
>>> Oct 28 00:49:48 mbpc-pc kernel: hpet1: lost 9600 rtc interrupts
>>> Oct 28 00:49:55 mbpc-pc kernel: qla2xxx [0000:04:00.0]-e872:20: qlt_24xx_atio_pkt_all_vps: qla_target(0): type 6 ox_id 05ba
>>> Oct 28 00:49:55 mbpc-pc kernel: qla2xxx [0000:04:00.0]-e818:20: is_send_status=1, cmd->bufflen=4096, cmd->sg_cnt=1, cmd->dma_data_direction=1 se_cmd[ffff88009a988508]
>>> Oct 28 00:49:55 mbpc-pc kernel: qla2xxx [0000:04:00.0]-e874:20: qlt_free_cmd: se_cmd[ffff88009a988508] ox_id 05ba
>>> Oct 28 00:50:16 mbpc-pc kernel: qla2xxx [0000:04:00.0]-e872:20: qlt_24xx_atio_pkt_all_vps: qla_target(0): type 6 ox_id 05bb
>>> Oct 28 00:50:16 mbpc-pc kernel: qla2xxx [0000:04:00.0]-e818:20: is_send_status=1, cmd->bufflen=4096, cmd->sg_cnt=1, cmd->dma_data_direction=1 se_cmd[ffff88009a988508]
>>> Oct 28 00:50:16 mbpc-pc kernel: qla2xxx [0000:04:00.0]-e874:20: qlt_free_cmd: se_cmd[ffff88009a988508] ox_id 05bb
>>> Jan  3 15:34:00 192.168.0.2 syslog: dhcpfwd : dhcp forwarder daemon successfully started
>>> Oct 28 00:50:33 mbpc-pc kernel: INFO: task kworker/0:2:31731 blocked for more than 360 seconds.
>>> Oct 28 00:50:33 mbpc-pc kernel:      Not tainted 4.8.4 #2
>>> Oct 28 00:50:33 mbpc-pc kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
>>> Oct 28 00:50:33 mbpc-pc kernel: kworker/0:2     D ffff88011affb968     0 31731      2 0x00000080
>>> Oct 28 00:50:33 mbpc-pc kernel: Workqueue: events qlt_free_session_done [qla2xxx]
>>> Oct 28 00:50:33 mbpc-pc kernel: ffff88011affb968 ffff88011affb8d8 ffff880013514940 0000000000000006
>>> Oct 28 00:50:33 mbpc-pc kernel: ffff8801140fe880 ffffffff81f998c2 0000000000000000 ffff880100000000
>>> Oct 28 00:50:33 mbpc-pc kernel: ffffffff810bdaaa ffffffff00000000 ffffffff00000051 ffff880100000000
>>> Oct 28 00:50:33 mbpc-pc kernel: Call Trace:
>>> Oct 28 00:50:33 mbpc-pc kernel: [<ffffffff810bdaaa>] ? vprintk_emit+0x27a/0x4d0
>>> Oct 28 00:50:33 mbpc-pc kernel: [<ffffffff8162c040>] schedule+0x40/0xb0
>>> Oct 28 00:50:33 mbpc-pc kernel: [<ffffffff8115cc5c>] ? printk+0x46/0x48
>>> Oct 28 00:50:33 mbpc-pc kernel: [<ffffffff8162e7ec>] schedule_timeout+0x9c/0xe0
>>> Oct 28 00:50:33 mbpc-pc kernel: [<ffffffff8162cfa0>] wait_for_completion+0xc0/0xf0
>>> Oct 28 00:50:33 mbpc-pc kernel: [<ffffffff810923e0>] ? try_to_wake_up+0x260/0x260
>>> Oct 28 00:50:33 mbpc-pc kernel: [<ffffffff81161d73>] ? mempool_free+0x33/0x90
>>> Oct 28 00:50:33 mbpc-pc kernel: [<ffffffffa08f76ad>] target_wait_for_sess_cmds+0x4d/0x1b0 [target_core_mod]
>>> Oct 28 00:50:33 mbpc-pc kernel: [<ffffffffa00e7188>] ? qla2x00_post_work+0x58/0x70 [qla2xxx]
>>> Oct 28 00:50:33 mbpc-pc kernel: [<ffffffffa0286f69>] tcm_qla2xxx_free_session+0x49/0x90 [tcm_qla2xxx]
>>> Oct 28 00:50:33 mbpc-pc kernel: [<ffffffffa01447e9>] qlt_free_session_done+0xf9/0x3d0 [qla2xxx]
>>> Oct 28 00:50:33 mbpc-pc kernel: [<ffffffff81080639>] process_one_work+0x189/0x4e0
>>> Oct 28 00:50:33 mbpc-pc kernel: [<ffffffff8107d915>] ? wq_worker_waking_up+0x15/0x70
>>> Oct 28 00:50:33 mbpc-pc kernel: [<ffffffff8109eb59>] ? idle_balance+0x79/0x290
>>> Oct 28 00:50:33 mbpc-pc kernel: [<ffffffff8162c040>] ? schedule+0x40/0xb0
>>> Oct 28 00:50:33 mbpc-pc kernel: [<ffffffff8108150d>] worker_thread+0x16d/0x520
>>> Oct 28 00:50:33 mbpc-pc kernel: [<ffffffff8162bb3d>] ? __schedule+0x2fd/0x6a0
>>> Oct 28 00:50:33 mbpc-pc kernel: [<ffffffff810813a0>] ? maybe_create_worker+0x110/0x110
>>> Oct 28 00:50:33 mbpc-pc kernel: [<ffffffff8162c040>] ? schedule+0x40/0xb0
>>> Oct 28 00:50:33 mbpc-pc kernel: [<ffffffff810813a0>] ? maybe_create_worker+0x110/0x110
>>> Oct 28 00:50:33 mbpc-pc kernel: [<ffffffff81085fec>] kthread+0xcc/0xf0
>>> Oct 28 00:50:33 mbpc-pc kernel: [<ffffffff8109130e>] ? schedule_tail+0x1e/0xc0
>>> Oct 28 00:50:33 mbpc-pc kernel: [<ffffffff8162f60f>] ret_from_fork+0x1f/0x40
>>> Oct 28 00:50:33 mbpc-pc kernel: [<ffffffff81085f20>] ? kthread_freezable_should_stop+0x70/0x70
>>>
>>>>
>>>> Oct 28 01:19:59 mbpc-pc kernel: qla2xxx [0000:04:00.0]-f81d:20:
>>>> Intermediate CTIO received (status 6)
>>>> Oct 28 01:19:59 mbpc-pc kernel: qla2xxx [0000:04:00.0]-e81c:20: Sending
>>>> TERM EXCH CTIO (ha=ffff88010ecb0000)
>>>> Oct 28 01:19:59 mbpc-pc kernel: qla2xxx [0000:04:00.0]-f814:20:
>>>> qla_target(0): terminating exchange for aborted cmd=ffff88009af9f488
>>>> (se_cmd=ffff88009af9f488, tag=1131312)
>>>> Oct 28 01:19:59 mbpc-pc kernel: qla2xxx [0000:04:00.0]-e81c:20: Sending
>>>> TERM EXCH CTIO (ha=ffff88010ecb0000)
>>>> Oct 28 01:19:59 mbpc-pc kernel: qla2xxx [0000:04:00.0]-e851:20:
>>>> qla_target(0): Terminating cmd ffff88009af9f488 with incorrect state 2
>>>> Oct 28 01:19:59 mbpc-pc kernel: qla2xxx [0000:04:00.0]-e81c:20: Sending
>>>> TERM EXCH CTIO (ha=ffff88010ecb0000)
>>>> Oct 28 01:19:59 mbpc-pc kernel: qla2xxx [0000:04:00.0]-f81d:20:
>>>> Intermediate CTIO received (status 6)
>>>> Oct 28 01:19:59 mbpc-pc kernel: qla2xxx [0000:04:00.0]-f81d:20:
>>>> Intermediate CTIO received (status 8)
>>>> Oct 28 01:19:59 mbpc-pc kernel: qla2xxx [0000:04:00.0]-e81c:20: Sending
>>>> TERM EXCH CTIO (ha=ffff88010ecb0000)
>>>> Oct 28 01:19:59 mbpc-pc kernel: qla2xxx [0000:04:00.0]-f81d:20:
>>>> Intermediate CTIO received (status 6)
>>>> Oct 28 01:19:59 mbpc-pc kernel: qla2xxx [0000:04:00.0]-e874:20:
>>>> qlt_free_cmd: se_cmd[ffff88009af9f488] ox_id 0673
>>>> Oct 28 01:19:59 mbpc-pc kernel: ------------[ cut here ]------------
>>>> Oct 28 01:19:59 mbpc-pc kernel: kernel BUG at
>>>> drivers/scsi/qla2xxx/qla_target.c:3319!
>>>> Oct 28 01:19:59 mbpc-pc kernel: invalid opcode: 0000 [#1] SMP
>>>> Oct 28 01:19:59 mbpc-pc kernel: Modules linked in: tcm_qla2xxx tcm_fc
>>>> tcm_loop target_core_file target_core_iblock target_core_pscsi
>>>> target_core_mod configfs ip6table_filter ip6_tables ebtable_nat ebtables
>>>> nf_conntrack_ipv4 nf_defrag_ipv4 xt_state nf_conntrack xt_CHECKSUM
>>>> iptable_mangle bridge nfsd lockd grace nfs_acl auth_rpcgss autofs4 it87
>>>> hwmon_vid bnx2fc cnic uio fcoe libfcoe libfc 8021q garp stp llc ppdev
>>>> parport_pc parport sunrpc cpufreq_ondemand bonding ipv6 crc_ccitt
>>>> ipt_REJECT nf_reject_ipv4 xt_multiport iptable_filter ip_tables fuse
>>>> vfat fat xfs vhost_net macvtap macvlan vhost tun uinput raid456
>>>> async_raid6_recov async_pq async_xor xor async_memcpy async_tx raid6_pq
>>>> libcrc32c joydev sg serio_raw e1000 r8169 mii kvm_amd kvm
>>>> snd_hda_codec_realtek snd_hda_codec_generic irqbypass pcspkr k10temp
>>>> snd_hda_intel snd_hda_codec snd_hda_core snd_hwdep snd_seq
>>>> snd_seq_device snd_pcm snd_timer snd soundcore i2c_piix4 i2c_core wmi
>>>> shpchp acpi_cpufreq ext4 mbcache jbd2 qla2xxx scsi_transport_fc floppy
>>>> firewire_ohci f
>>>> Oct 28 01:19:59 mbpc-pc kernel: irewire_core crc_itu_t sd_mod pata_acpi
>>>> ata_generic pata_jmicron ahci libahci usb_storage dm_mirror
>>>> dm_region_hash dm_log dm_mod
>>>> Oct 28 01:19:59 mbpc-pc kernel: CPU: 0 PID: 296 Comm: kworker/u16:6 Not
>>>> tainted 4.8.4 #2
>>>> Oct 28 01:19:59 mbpc-pc kernel: Hardware name: Gigabyte Technology Co.,
>>>> Ltd. GA-890XA-UD3/GA-890XA-UD3, BIOS FC 08/02/2010
>>>> Oct 28 01:19:59 mbpc-pc kernel: Workqueue: tmr-fileio target_tmr_work
>>>> [target_core_mod]
>>>> Oct 28 01:19:59 mbpc-pc kernel: task: ffff8801109623c0 task.stack:
>>>> ffff880110968000
>>>> Oct 28 01:19:59 mbpc-pc kernel: RIP: 0010:[<ffffffffa0143810>]
>>>> [<ffffffffa0143810>] qlt_free_cmd+0x160/0x180 [qla2xxx]
>>>> Oct 28 01:19:59 mbpc-pc kernel: RSP: 0018:ffff88011096bb18  EFLAGS: 00010202
>>>> Oct 28 01:19:59 mbpc-pc kernel: RAX: 0000000000000051 RBX:
>>>> ffff88009af9f488 RCX: 0000000000000006
>>>> Oct 28 01:19:59 mbpc-pc kernel: RDX: 0000000000000007 RSI:
>>>> 0000000000000007 RDI: ffff88011fc0cb40
>>>> Oct 28 01:19:59 mbpc-pc kernel: RBP: ffff88011096bb48 R08:
>>>> 0000000000000000 R09: ffffffff81fa4765
>>>> Oct 28 01:19:59 mbpc-pc kernel: R10: 0000000000000074 R11:
>>>> 0000000000000002 R12: ffff8801137770c0
>>>> Oct 28 01:19:59 mbpc-pc kernel: R13: ffff8800a126eaf0 R14:
>>>> ffff88009af9f510 R15: 0000000000000296
>>>> Oct 28 01:19:59 mbpc-pc kernel: FS:  0000000000000000(0000)
>>>> GS:ffff88011fc00000(0000) knlGS:0000000000000000
>>>> Oct 28 01:19:59 mbpc-pc kernel: CS:  0010 DS: 0000 ES: 0000 CR0:
>>>> 0000000080050033
>>>> Oct 28 01:19:59 mbpc-pc kernel: CR2: 00007f8eef58d000 CR3:
>>>> 00000000cabad000 CR4: 00000000000006f0
>>>> Oct 28 01:19:59 mbpc-pc kernel: Stack:
>>>> Oct 28 01:19:59 mbpc-pc kernel: ffff880000000673 ffff88009af9f488
>>>> ffff8800a126eaf0 ffff88009af9f59c
>>>> Oct 28 01:19:59 mbpc-pc kernel: ffff88009af9f488 ffff8800a126eaf0
>>>> ffff88011096bb58 ffffffffa027f7f4
>>>> Oct 28 01:19:59 mbpc-pc kernel: ffff88011096bbb8 ffffffffa08f758c
>>>> ffff88009af9f510 ffff88009af9f488
>>>> Oct 28 01:19:59 mbpc-pc kernel: Call Trace:
>>>> Oct 28 01:19:59 mbpc-pc kernel: [<ffffffffa027f7f4>]
>>>> tcm_qla2xxx_release_cmd+0x14/0x30 [tcm_qla2xxx]
>>>> Oct 28 01:19:59 mbpc-pc kernel: [<ffffffffa08f758c>]
>>>> target_release_cmd_kref+0xac/0x110 [target_core_mod]
>>>> Oct 28 01:19:59 mbpc-pc kernel: [<ffffffffa08f7627>]
>>>> target_put_sess_cmd+0x37/0x70 [target_core_mod]
>>>> Oct 28 01:19:59 mbpc-pc kernel: [<ffffffffa08f46f7>]
>>>> core_tmr_abort_task+0x107/0x160 [target_core_mod]
>>>> Oct 28 01:19:59 mbpc-pc kernel: [<ffffffffa08f6aa4>]
>>>> target_tmr_work+0x154/0x160 [target_core_mod]
>>>> Oct 28 01:19:59 mbpc-pc kernel: [<ffffffff81080639>]
>>>> process_one_work+0x189/0x4e0
>>>> Oct 28 01:19:59 mbpc-pc kernel: [<ffffffff810d060c>] ?
>>>> del_timer_sync+0x4c/0x60
>>>> Oct 28 01:19:59 mbpc-pc kernel: [<ffffffff8108131e>] ?
>>>> maybe_create_worker+0x8e/0x110
>>>> Oct 28 01:19:59 mbpc-pc kernel: [<ffffffff8162c040>] ? schedule+0x40/0xb0
>>>> Oct 28 01:19:59 mbpc-pc kernel: [<ffffffff8108150d>]
>>>> worker_thread+0x16d/0x520
>>>> Oct 28 01:19:59 mbpc-pc kernel: [<ffffffff810923f2>] ?
>>>> default_wake_function+0x12/0x20
>>>> Oct 28 01:19:59 mbpc-pc kernel: [<ffffffff810a6f06>] ?
>>>> __wake_up_common+0x56/0x90
>>>> Oct 28 01:19:59 mbpc-pc kernel: [<ffffffff810813a0>] ?
>>>> maybe_create_worker+0x110/0x110
>>>> Oct 28 01:19:59 mbpc-pc kernel: [<ffffffff8162c040>] ? schedule+0x40/0xb0
>>>> Oct 28 01:19:59 mbpc-pc kernel: [<ffffffff810813a0>] ?
>>>> maybe_create_worker+0x110/0x110
>>>> Oct 28 01:19:59 mbpc-pc kernel: [<ffffffff81085fec>] kthread+0xcc/0xf0
>>>> Oct 28 01:19:59 mbpc-pc kernel: [<ffffffff8109130e>] ?
>>>> schedule_tail+0x1e/0xc0
>>>> Oct 28 01:19:59 mbpc-pc kernel: [<ffffffff8162f60f>] ret_from_fork+0x1f/0x40
>>>> Oct 28 01:19:59 mbpc-pc kernel: [<ffffffff81085f20>] ?
>>>> kthread_freezable_should_stop+0x70/0x70
>>>> Oct 28 01:19:59 mbpc-pc kernel: Code: 0d 00 00 48 c7 c7 00 2a 16 a0 e8
>>>> ac 32 f2 e0 48 83 c4 18 5b 41 5c 41 5d c9 c3 48 8b bb 90 02 00 00 e8 85
>>>> da 07 e1 e9 30 ff ff ff <0f> 0b eb fe 0f 0b 66 2e 0f 1f 84 00 00 00 00
>>>> 00 eb f4 66 66 66
>>>> Oct 28 01:19:59 mbpc-pc kernel: RIP  [<ffffffffa0143810>]
>>>> qlt_free_cmd+0x160/0x180 [qla2xxx]
>>>> Oct 28 01:19:59 mbpc-pc kernel: RSP <ffff88011096bb18>
>>>> Oct 28 01:19:59 mbpc-pc kernel: ---[ end trace 2551bf47a19dbe2e ]---
>>>>
>>>
>>> Mmm.
>>>
>>> This BUG_ON is signaling a qla_tgt_cmd descriptor is being freed while
>>> qlt_handle_cmd_for_atio() has queued it for backend execution, but
>>> qla_tgt_cmd->work -> __qlt_do_work() has not executed.
>>>
>>>>
>>>>
>>>> 2) This works with a new disk that's just been inserted.  No issues.
>>>>
>>>>
>>>
>>> Thanks for verifying this scenario.
>>>
>>>>
>>>>
>>>> The kernel had the patch in both scenarios.  So appears we can't
>>>> function on a degraded array that looses 1 (RAID 5 / 6 ) or 2 ( RAID 6 )
>>>> disks, at the moment, even though the array itself is fine.  Perhaps
>>>> it's the nature of the failed disk.
>>>>
>>>
>>> AFAICT, the hung task involves ABORT_TASK across tcm_qla2xxx session
>>> reinstatement, when backend I/O latency is high enough to cause
>>> ABORT_TASK operations across FC fabric host side session reset.
>>>
>>> From logs alone, it's unclear if the failing backend ata6 is leaking I/O
>>> (indefinitely) when hung task warning happen, but the preceding ata6
>>> failures + device resets seem to indicate I/O completions are just
>>> taking a really long time to complete.
>>>
>>> Also, it's unclear if the BUG_ON(cmd->cmd_in_wq) in qlt_free_cmd() is a
>>> side effect of the earlier hung task, or separate tcm_qla2xxx session
>>> reinstatement bug.
>>>
>>> --
>>> To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
>>> the body of a message to majordomo@vger.kernel.org
>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>>
>>
>>
>> Thanks Nicholas.
>>
>> Is it possible the RAID 6 array, after attempting to write sectors to a
>> bad disk, received an error and returned a status message that the QLA
>> could not interpret?  Thereby sending the QLA into a routine expecting a
>> recognized message and simply timing out?  All the while the array
>> simply 'moved on' along it's merry way.   (For example, a message akin
>> to "Unexpected xyz received, ignoring" type of message that should be
>> interpreted and actioned on.)  Or perhaps the software RAID 6 isn't
>> returning anything meaningful to the QLA driver.  (If it isn't though,
>> it can be argued that the QLA / Target Drivers shouldn't have to care if
>> the array is working.)
>
> So target-core waits for all outstanding backend I/O to complete during
> session reinstatement.
>
> The two hung task warnings observed here mean target-core I/O
> descriptors from each two tcm_qla2xxx ports are waiting to be completed
> during session reinstatement, but this never happens.
>
> Which means one of three things:
>
> 1) The backend below target-core is leaking I/O completions, which
>    is a bug outside of target-core and needs to be identified.
> 2) There is a target-core bug leaking se_cmd->cmd_kref, preventing
>    the final reference release to occur.
> 3) There is a tcm_qla2xxx bug leaking se_cmd->cmd_kref, preventing
>    the final reference release to occur.
>
>>
>> The 100% util on /dev/sdf lasted for less then 30 seconds.
>
> Based on the above, between when tag=1122276 got ABORT_TASK and ata6
> finally gave back the descriptor to allow TMR_FUNCTION_COMPLETE was at
> least 90 seconds.
>
> This does not include the extra time from initial I/O submission to when
> ESX SCSI timeouts fire to generate ABORT_TASK, which IIRC depends upon
> the FC host LLD.
>
> Btw, I don't recall libata device timeouts being 90+ seconds plus, which
> looks a little strange..
>
> This timeout is in /sys/class/scsi_device/$HCTL/device/eh_timeout.
>
> So assuming MD is not holding onto I/O much beyond backend device
> eh_timeout during failure, a simple work-around is to keep the combined
> backend eh_timeout and MD consumer I/O timeout lower than ESX FC SCSI
> timeouts.
>
> From the logs above, it would likely mask this specific bug if it's
> related to #2 or #3.
>
>> Well below
>> the 120s default timeout (I've upped this to 360 for the purpose of
>> testing this scenario)
>>
>> I do see these:
>>
>> http://microdevsys.com/linux-lio/messages-mailing-list
>> Oct 23 19:50:26 mbpc-pc kernel: qla2xxx [0000:04:00.0]-5811:20:
>> Asynchronous PORT UPDATE ignored 0000/0004/0600.
>>
>> but I don't think it's that one as it doesn't correlate well at all.
>>
>
> These are unrelated to the hung task warnings.
>
>> If there's another test that I can do to get more info, please feel free
>> to suggest.
>>
>
> Ideally, being able to generate a vmcore crashdump is the most helpful.
> This involves manually triggering a crash after you observe the hung
> tasks.  It's reasonably easy to setup once CONFIG_KEXEC=y +
> CONFIG_CRASH_DUMP=y are enabled:
>
> https://fedoraproject.org/wiki/How_to_use_kdump_to_debug_kernel_crashes
>
> I don't know if you'll run into problems on v4.8.y as per Anil's earlier
> email, but getting a proper vmcore for analysis is usually the fastest
> path to root cause.
>
> Short of that, it would be helpful to identify the state of the se_cmd
> descriptors getting leaked.  Here's a quick patch to add some more
> verbosity:
>
> diff --git a/drivers/target/target_core_transport.c b/drivers/target/target_core_transport.c
> index 7dfefd6..9b93a2c 100644
> --- a/drivers/target/target_core_transport.c
> +++ b/drivers/target/target_core_transport.c
> @@ -2657,9 +2657,13 @@ void target_wait_for_sess_cmds(struct se_session *se_sess)
>
>         list_for_each_entry_safe(se_cmd, tmp_cmd,
>                                 &se_sess->sess_wait_list, se_cmd_list) {
> -               pr_debug("Waiting for se_cmd: %p t_state: %d, fabric state:"
> -                       " %d\n", se_cmd, se_cmd->t_state,
> -                       se_cmd->se_tfo->get_cmd_state(se_cmd));
> +               printk("Waiting for se_cmd: %p t_state: %d, fabric state:"
> +                       " %d se_cmd_flags: 0x%08x transport_state: 0x%08x"
> +                       " CDB: 0x%02x\n",
> +                       se_cmd, se_cmd->t_state,
> +                       se_cmd->se_tfo->get_cmd_state(se_cmd),
> +                       se_cmd->se_cmd_flags, se_cmd->transport_state,
> +                       se_cmd->t_task_cdb[0]);
>
>                 spin_lock_irqsave(&se_cmd->t_state_lock, flags);
>                 tas = (se_cmd->transport_state & CMD_T_TAS);
> @@ -2671,9 +2675,13 @@ void target_wait_for_sess_cmds(struct se_session *se_sess)
>                 }
>
>                 wait_for_completion(&se_cmd->cmd_wait_comp);
> -               pr_debug("After cmd_wait_comp: se_cmd: %p t_state: %d"
> -                       " fabric state: %d\n", se_cmd, se_cmd->t_state,
> -                       se_cmd->se_tfo->get_cmd_state(se_cmd));
> +               printk("After cmd_wait_comp: se_cmd: %p t_state: %d"
> +                       " fabric state: %d se_cmd_flags: 0x%08x transport_state:"
> +                       " 0x%08x CDB: 0x%02x\n",
> +                       se_cmd, se_cmd->t_state,
> +                       se_cmd->se_tfo->get_cmd_state(se_cmd),
> +                       se_cmd->se_cmd_flags, se_cmd->transport_state,
> +                       se_cmd->t_task_cdb[0]);
>
>                 se_cmd->se_tfo->release_cmd(se_cmd);
>         }
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>

I've had the kdump configured but the Seagate 2TB failed to the point
where it isn't even detected and a device like /dev/sdf is no longer 
created for it.  There's only some errors printed that something was 
found on the SATA connector but commands errored out against it.  (See 
below)  (Might be the PCB going.  When out of the system or the voltage 
cut, it just got to this point very quickly.)

I did add the debug messages above to catch more in the future.  I'm 
also checking the mdadm changelogs for possibly related updated since 
I'm running 3.3.2-5.el6 and latest is 3.3.4 or 3.4.0 .

I'm leaning to the 1) case above as the RAID 6 did not pick up the 
failed disk even though smartctl -A could not access any smart 
information.  I'll also post a new thread on linux-raid.

Again thanks very much for the help.  Appreciated.

-- 
Cheers,
Tom K.
-------------------------------------------------------------------------------------

Living on earth is expensive, but it includes a free trip around the sun.


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Kernel 4.8.4: INFO: task kworker/u16:8:289 blocked for more than 120 seconds.
  2016-10-30 18:50                       ` TomK
@ 2016-11-01  2:44                         ` TomK
  0 siblings, 0 replies; 14+ messages in thread
From: TomK @ 2016-11-01  2:44 UTC (permalink / raw)
  To: Nicholas A. Bellinger
  Cc: linux-scsi, Himanshu Madhani, Quinn Tran, Giridhar Malavali,
	Gurumurthy, Anil

On 10/30/2016 2:50 PM, TomK wrote:
> On 10/29/2016 5:44 PM, Nicholas A. Bellinger wrote:
>> On Sat, 2016-10-29 at 14:10 -0400, TomK wrote:
>>> On 10/29/2016 3:50 AM, Nicholas A. Bellinger wrote:
>>>> Hi TomK & Co,
>>>>
>>>> On Fri, 2016-10-28 at 02:01 -0400, TomK wrote:
>>>>> On 10/26/2016 8:08 AM, TomK wrote:
>>>>>> On 10/26/2016 3:20 AM, Nicholas A. Bellinger wrote:
>>>>
>>
>> <SNIP>
>>
>>>>>
>>>>> 1) As soon as I re-add the bad disk without the patch, I loose the LUN
>>>>> off the ESXi hosts.  Same thing happens with the patch.  No
>>>>> change.  The
>>>>> disk is pulling things down.  Worse, the kernel panics and locks me
>>>>> out
>>>>> of the system
>>>>> (http://microdevsys.com/linux-lio/messages-oct-27-2016.txt) :
>>>>>
>>>>
>>>> So after groking these logs, the point when ata6 failing scsi_device is
>>>> holding outstanding I/O beyond ESX FC host side timeouts, manifests
>>>> itself as ABORT_TASK tag=1122276:
>>>>
>>>> Oct 28 00:42:56 mbpc-pc kernel: qla2xxx [0000:04:00.0]-e872:20:
>>>> qlt_24xx_atio_pkt_all_vps: qla_target(0): type 6 ox_id 077e
>>>> Oct 28 00:42:56 mbpc-pc kernel: qla2xxx [0000:04:00.0]-e872:20:
>>>> qlt_24xx_atio_pkt_all_vps: qla_target(0): type 6 ox_id 077f
>>>> Oct 28 00:42:56 mbpc-pc kernel: qla2xxx [0000:04:00.0]-e872:20:
>>>> qlt_24xx_atio_pkt_all_vps: qla_target(0): type 6 ox_id 0780
>>>> Oct 28 00:42:57 mbpc-pc kernel: qla2xxx [0000:04:00.0]-e837:20:
>>>> ABTS_RECV_24XX: instance 0
>>>> Oct 28 00:42:57 mbpc-pc kernel: qla2xxx [0000:04:00.0]-f811:20:
>>>> qla_target(0): task abort (s_id=1:5:0, tag=1122276, param=0)
>>>> Oct 28 00:42:57 mbpc-pc kernel: qla2xxx [0000:04:00.0]-f80f:20:
>>>> qla_target(0): task abort (tag=1122276)
>>>> Oct 28 00:42:57 mbpc-pc kernel: ABORT_TASK: Found referenced qla2xxx
>>>> task_tag: 1122276
>>>>
>>>> and eventually:
>>>>
>>>> Oct 28 00:43:18 mbpc-pc kernel: qla2xxx [0000:04:00.0]-e837:20:
>>>> ABTS_RECV_24XX: instance 0
>>>> Oct 28 00:43:18 mbpc-pc kernel: qla2xxx [0000:04:00.0]-f811:20:
>>>> qla_target(0): task abort (s_id=1:5:0, tag=1122276, param=0)
>>>> Oct 28 00:43:18 mbpc-pc kernel: qla2xxx [0000:04:00.0]-f881:20:
>>>> unable to find cmd in driver or LIO for tag 0x111fe4
>>>> Oct 28 00:43:18 mbpc-pc kernel: qla2xxx [0000:04:00.0]-f854:20:
>>>> qla_target(0): __qlt_24xx_handle_abts() failed: -2
>>>> Oct 28 00:43:18 mbpc-pc kernel: qla2xxx [0000:04:00.0]-e806:20:
>>>> Sending task mgmt ABTS response (ha=ffff88010f110000,
>>>> atio=ffff88010f123a40, status=4
>>>> Oct 28 00:43:18 mbpc-pc kernel: qla2xxx [0000:04:00.0]-e838:20:
>>>> ABTS_RESP_24XX: compl_status 31
>>>> Oct 28 00:43:18 mbpc-pc kernel: qla2xxx [0000:04:00.0]-e807:20:
>>>> Sending retry TERM EXCH CTIO7 (ha=ffff88010f110000)
>>>> Oct 28 00:43:18 mbpc-pc kernel: qla2xxx [0000:04:00.0]-e806:20:
>>>> Sending task mgmt ABTS response (ha=ffff88010f110000,
>>>> atio=ffff88010f123a80, status=0
>>>> Oct 28 00:43:18 mbpc-pc kernel: qla2xxx [0000:04:00.0]-e838:20:
>>>> ABTS_RESP_24XX: compl_status 0
>>>>
>>>> The outstanding se_cmd with tag=1122276 for ata6 completes back to
>>>> target-core, allowing ABORT_TASK + TMR_FUNCTION_COMPLETE status to
>>>> progress:
>>>>
>>>> Oct 28 00:44:25 mbpc-pc kernel: hpet1: lost 9600 rtc interrupts
>>>> Oct 28 00:44:29 mbpc-pc kernel: ata6: SATA link up 1.5 Gbps (SStatus
>>>> 113 SControl 310)
>>>> Oct 28 00:44:29 mbpc-pc kernel: ata6.00: configured for UDMA/133
>>>> Oct 28 00:44:29 mbpc-pc kernel: ata6.00: retrying FLUSH 0xea Emask 0x4
>>>> Oct 28 00:44:29 mbpc-pc kernel: ata6.00: device reported invalid CHS
>>>> sector 0
>>>> Oct 28 00:44:29 mbpc-pc kernel: ata6: EH complete
>>>> Oct 28 00:44:30 mbpc-pc kernel: qla2xxx [0000:04:00.0]-e81c:20:
>>>> Sending TERM EXCH CTIO (ha=ffff88010f110000)
>>>> Oct 28 00:44:30 mbpc-pc kernel: qla2xxx [0000:04:00.0]-f814:20:
>>>> qla_target(0): terminating exchange for aborted cmd=ffff880099392fa8
>>>> (se_cmd=ffff880099392fa8, tag=1122276)
>>>> Oct 28 00:44:30 mbpc-pc kernel: qla2xxx [0000:04:00.0]-e81c:20:
>>>> Sending TERM EXCH CTIO (ha=ffff88010f110000)
>>>> Oct 28 00:44:30 mbpc-pc kernel: qla2xxx [0000:04:00.0]-e851:20:
>>>> qla_target(0): Terminating cmd ffff880099392fa8 with incorrect state 2
>>>> Oct 28 00:44:30 mbpc-pc kernel: qla2xxx [0000:04:00.0]-f81d:20:
>>>> Intermediate CTIO received (status 6)
>>>> Oct 28 00:44:30 mbpc-pc kernel: ABORT_TASK: Sending
>>>> TMR_FUNCTION_COMPLETE for ref_tag: 1122276
>>>> Oct 28 00:44:30 mbpc-pc kernel: qla2xxx [0000:04:00.0]-f81d:20:
>>>> Intermediate CTIO received (status 8)
>>>> Oct 28 00:44:30 mbpc-pc kernel: qla2xxx [0000:04:00.0]-f813:20: TM
>>>> response mcmd (ffff8800b1e2f0f0) status 0x0 state 0x0
>>>> Oct 28 00:44:30 mbpc-pc kernel: qla2xxx [0000:04:00.0]-e806:20:
>>>> Sending task mgmt ABTS response (ha=ffff88010f110000,
>>>> atio=ffff8800b1e2f300, status=0
>>>> Oct 28 00:44:30 mbpc-pc kernel: ABORT_TASK: Sending
>>>> TMR_TASK_DOES_NOT_EXIST for ref_tag: 1122996
>>>> Oct 28 00:44:30 mbpc-pc kernel: qla2xxx [0000:04:00.0]-f813:20: TM
>>>> response mcmd (ffff8800b1e2f340) status 0x2 state 0x0
>>>> Oct 28 00:44:30 mbpc-pc kernel: qla2xxx [0000:04:00.0]-e806:20:
>>>> Sending task mgmt ABTS response (ha=ffff88010f110000,
>>>> atio=ffff8800b1e2f550, status=2
>>>> Oct 28 00:44:30 mbpc-pc kernel: ABORT_TASK: Sending
>>>> TMR_TASK_DOES_NOT_EXIST for ref_tag: 1122204
>>>> Oct 28 00:44:30 mbpc-pc kernel: qla2xxx [0000:04:00.0]-f813:20: TM
>>>> response mcmd (ffff8800b1e2f590) status 0x2 state 0x0
>>>> Oct 28 00:44:30 mbpc-pc kernel: qla2xxx [0000:04:00.0]-e806:20:
>>>> Sending task mgmt ABTS response (ha=ffff88010f110000,
>>>> atio=ffff8800b1e2f7a0, status=2
>>>> Oct 28 00:44:30 mbpc-pc kernel: ABORT_TASK: Sending
>>>> TMR_TASK_DOES_NOT_EXIST for ref_tag: 1122240
>>>> Oct 28 00:44:30 mbpc-pc kernel: qla2xxx [0000:04:00.0]-e838:20:
>>>> ABTS_RESP_24XX: compl_status 0
>>>> Oct 28 00:44:30 mbpc-pc kernel: qla2xxx [0000:04:00.0]-f813:20: TM
>>>> response mcmd (ffff8800b1e2f7e0) status 0x2 state 0x0
>>>> Oct 28 00:44:30 mbpc-pc kernel: qla2xxx [0000:04:00.0]-e838:20:
>>>> ABTS_RESP_24XX: compl_status 31
>>>> Oct 28 00:44:30 mbpc-pc kernel: qla2xxx [0000:04:00.0]-e807:20:
>>>> Sending retry TERM EXCH CTIO7 (ha=ffff88010f110000)
>>>> Oct 28 00:44:30 mbpc-pc kernel: qla2xxx [0000:04:00.0]-e806:20:
>>>> Sending task mgmt ABTS response (ha=ffff88010f110000,
>>>> atio=ffff88010f124280, status=0
>>>> Oct 28 00:44:30 mbpc-pc kernel: qla2xxx [0000:04:00.0]-e838:20:
>>>> ABTS_RESP_24XX: compl_status 31
>>>> Oct 28 00:44:30 mbpc-pc kernel: qla2xxx [0000:04:00.0]-e807:20:
>>>> Sending retry TERM EXCH CTIO7 (ha=ffff88010f110000)
>>>> Oct 28 00:44:30 mbpc-pc kernel: qla2xxx [0000:04:00.0]-e806:20:
>>>> Sending task mgmt ABTS response (ha=ffff88010f110000,
>>>> atio=ffff88010f1242c0, status=0
>>>> Oct 28 00:44:30 mbpc-pc kernel: qla2xxx [0000:04:00.0]-e806:20:
>>>> Sending task mgmt ABTS response (ha=ffff88010f110000,
>>>> atio=ffff8800b1e2f9f0, status=2
>>>>
>>>> and continues with TMR_TASK_DOES_NOT_EXIST for other oustanding tags..
>>>>
>>>> Until, target-core session release for two tcm_qla2xxx host ports:
>>>>
>>>> Oct 28 00:44:30 mbpc-pc kernel: qla2xxx [0000:04:00.0]-f884:20:
>>>> qlt_free_session_done: se_sess ffff880099a80a80 / sess
>>>> ffff880111665cc0 from port 21:03:00:1b:32:74:b6:cb loop_id 0x03 s_id
>>>> 01:05:00 logout 1 keep 1 els_logo 0
>>>> Oct 28 00:44:30 mbpc-pc kernel: qla2xxx [0000:04:00.0]-e874:20:
>>>> qlt_free_cmd: se_cmd[ffff8800993849f8] ox_id 0773
>>>> Oct 28 00:44:30 mbpc-pc kernel: qla2xxx [0000:04:00.0]-f884:20:
>>>> qlt_free_session_done: se_sess ffff88009e8f9040 / sess
>>>> ffff8800abe95480 from port 50:01:43:80:16:77:99:38 loop_id 0x02 s_id
>>>> 01:00:00 logout 1 keep 1 els_logo 0
>>>> Oct 28 00:44:30 mbpc-pc kernel: qla2xxx [0000:04:00.0]-e874:20:
>>>> qlt_free_cmd: se_cmd[ffff880099199f38] ox_id 05f7
>>>> Oct 28 00:44:30 mbpc-pc kernel: qla2xxx [0000:04:00.0]-e874:20:
>>>> qlt_free_cmd: se_cmd[ffff880099199f38] ox_id 05f7
>>>> Oct 28 00:44:30 mbpc-pc kernel: qla2xxx [0000:04:00.0]-4801:20: DPC
>>>> handler waking up, dpc_flags=0x0.
>>>> Oct 28 00:44:30 mbpc-pc kernel: qla2xxx [0000:04:00.0]-2870:20:
>>>> Async-logout - hdl=2 loop-id=3 portid=010500.
>>>> Oct 28 00:44:30 mbpc-pc kernel: qla2xxx [0000:04:00.0]-2870:20:
>>>> Async-logout - hdl=3 loop-id=2 portid=010000.
>>>> Oct 28 00:44:30 mbpc-pc kernel: qla2xxx [0000:04:00.0]-4800:20: DPC
>>>> handler sleeping.
>>>> Oct 28 00:44:30 mbpc-pc kernel: qla2xxx [0000:04:00.0]-e81c:20:
>>>> Sending TERM EXCH CTIO (ha=ffff88010f110000)
>>>> Oct 28 00:44:30 mbpc-pc kernel: qla2xxx [0000:04:00.0]-e81c:20:
>>>> Sending TERM EXCH CTIO (ha=ffff88010f110000)
>>>>
>>>> From there, ELS with unexpected NOTIFY_ACK received start to occur:
>>>>
>>>> Oct 28 00:44:30 mbpc-pc kernel: qla2xxx [0000:04:00.0]-4801:20: DPC
>>>> handler waking up, dpc_flags=0x0.
>>>> Oct 28 00:44:30 mbpc-pc kernel: qla2xxx [0000:04:00.0]-4800:20: DPC
>>>> handler sleeping.
>>>> Oct 28 00:44:34 mbpc-pc kernel: qla2xxx [0000:04:00.0]-e872:20:
>>>> qlt_24xx_atio_pkt_all_vps: qla_target(0): type d ox_id 0000
>>>> Oct 28 00:44:34 mbpc-pc kernel: qla2xxx [0000:04:00.0]-e82e:20:
>>>> IMMED_NOTIFY ATIO
>>>> Oct 28 00:44:34 mbpc-pc kernel: qla2xxx [0000:04:00.0]-f826:20:
>>>> qla_target(0): Port ID: 0x00:00:01 ELS opcode: 0x03
>>>> Oct 28 00:44:34 mbpc-pc kernel: qla2xxx [0000:04:00.0]-e81c:20:
>>>> Sending TERM ELS CTIO (ha=ffff88010f110000)
>>>> Oct 28 00:44:34 mbpc-pc kernel: qla2xxx [0000:04:00.0]-f897:20:
>>>> Linking sess ffff8800abe95480 [0] wwn 50:01:43:80:16:77:99:38 with
>>>> PLOGI ACK to wwn 50:01:43:80:16:77:99:38 s_id 01:00:00, ref=1
>>>> Oct 28 00:44:34 mbpc-pc kernel: qla2xxx [0000:04:00.0]-e862:20:
>>>> qla_target(0): Unexpected NOTIFY_ACK received
>>>>
>>>> ELS packets for the same two host ports continue:
>>>>
>>>> Oct 28 00:46:40 mbpc-pc kernel: hpet1: lost 9599 rtc interrupts
>>>> Oct 28 00:46:40 mbpc-pc kernel: qla2xxx [0000:04:00.0]-e872:20:
>>>> qlt_24xx_atio_pkt_all_vps: qla_target(0): type d ox_id 0000
>>>> Oct 28 00:46:40 mbpc-pc kernel: qla2xxx [0000:04:00.0]-e82e:20:
>>>> IMMED_NOTIFY ATIO
>>>> Oct 28 00:46:40 mbpc-pc kernel: qla2xxx [0000:04:00.0]-f826:20:
>>>> qla_target(0): Port ID: 0x00:00:01 ELS opcode: 0x03
>>>> Oct 28 00:46:40 mbpc-pc kernel: qla2xxx [0000:04:00.0]-e81c:20:
>>>> Sending TERM ELS CTIO (ha=ffff88010f110000)
>>>> Oct 28 00:46:40 mbpc-pc kernel: qla2xxx [0000:04:00.0]-f897:20:
>>>> Linking sess ffff8800abe95480 [0] wwn 50:01:43:80:16:77:99:38 with
>>>> PLOGI ACK to wwn 50:01:43:80:16:77:99:38 s_id 01:00:00, ref=1
>>>> Oct 28 00:46:40 mbpc-pc kernel: qla2xxx [0000:04:00.0]-e862:20:
>>>> qla_target(0): Unexpected NOTIFY_ACK received
>>>> Oct 28 00:46:46 mbpc-pc kernel: qla2xxx [0000:04:00.0]-e872:20:
>>>> qlt_24xx_atio_pkt_all_vps: qla_target(0): type 6 ox_id 05b1
>>>> Oct 28 00:46:46 mbpc-pc kernel: qla2xxx [0000:04:00.0]-e818:20:
>>>> is_send_status=1, cmd->bufflen=4096, cmd->sg_cnt=1,
>>>> cmd->dma_data_direction=1 se_cmd[ffff88009a9977d8]
>>>> Oct 28 00:46:46 mbpc-pc kernel: qla2xxx [0000:04:00.0]-e874:20:
>>>> qlt_free_cmd: se_cmd[ffff88009a9977d8] ox_id 05b1
>>>> Oct 28 00:46:49 mbpc-pc kernel: qla2xxx [0000:04:00.0]-e872:20:
>>>> qlt_24xx_atio_pkt_all_vps: qla_target(0): type d ox_id 0000
>>>> Oct 28 00:46:49 mbpc-pc kernel: qla2xxx [0000:04:00.0]-e82e:20:
>>>> IMMED_NOTIFY ATIO
>>>> Oct 28 00:46:49 mbpc-pc kernel: qla2xxx [0000:04:00.0]-f826:20:
>>>> qla_target(0): Port ID: 0x00:05:01 ELS opcode: 0x03
>>>> Oct 28 00:46:49 mbpc-pc kernel: qla2xxx [0000:04:00.0]-e81c:20:
>>>> Sending TERM ELS CTIO (ha=ffff88010f110000)
>>>> Oct 28 00:46:49 mbpc-pc kernel: qla2xxx [0000:04:00.0]-f897:20:
>>>> Linking sess ffff880111665cc0 [0] wwn 21:03:00:1b:32:74:b6:cb with
>>>> PLOGI ACK to wwn 21:03:00:1b:32:74:b6:cb s_id 01:05:00, ref=1
>>>> Oct 28 00:46:49 mbpc-pc kernel: qla2xxx [0000:04:00.0]-e862:20:
>>>> qla_target(0): Unexpected NOTIFY_ACK received
>>>>
>>>> And eventually, the 360 second hung task timeout warnings appear:
>>>>
>>>> Oct 28 00:49:48 mbpc-pc kernel: hpet1: lost 9600 rtc interrupts
>>>> Oct 28 00:49:55 mbpc-pc kernel: qla2xxx [0000:04:00.0]-e872:20:
>>>> qlt_24xx_atio_pkt_all_vps: qla_target(0): type 6 ox_id 05ba
>>>> Oct 28 00:49:55 mbpc-pc kernel: qla2xxx [0000:04:00.0]-e818:20:
>>>> is_send_status=1, cmd->bufflen=4096, cmd->sg_cnt=1,
>>>> cmd->dma_data_direction=1 se_cmd[ffff88009a988508]
>>>> Oct 28 00:49:55 mbpc-pc kernel: qla2xxx [0000:04:00.0]-e874:20:
>>>> qlt_free_cmd: se_cmd[ffff88009a988508] ox_id 05ba
>>>> Oct 28 00:50:16 mbpc-pc kernel: qla2xxx [0000:04:00.0]-e872:20:
>>>> qlt_24xx_atio_pkt_all_vps: qla_target(0): type 6 ox_id 05bb
>>>> Oct 28 00:50:16 mbpc-pc kernel: qla2xxx [0000:04:00.0]-e818:20:
>>>> is_send_status=1, cmd->bufflen=4096, cmd->sg_cnt=1,
>>>> cmd->dma_data_direction=1 se_cmd[ffff88009a988508]
>>>> Oct 28 00:50:16 mbpc-pc kernel: qla2xxx [0000:04:00.0]-e874:20:
>>>> qlt_free_cmd: se_cmd[ffff88009a988508] ox_id 05bb
>>>> Jan  3 15:34:00 192.168.0.2 syslog: dhcpfwd : dhcp forwarder daemon
>>>> successfully started
>>>> Oct 28 00:50:33 mbpc-pc kernel: INFO: task kworker/0:2:31731 blocked
>>>> for more than 360 seconds.
>>>> Oct 28 00:50:33 mbpc-pc kernel:      Not tainted 4.8.4 #2
>>>> Oct 28 00:50:33 mbpc-pc kernel: "echo 0 >
>>>> /proc/sys/kernel/hung_task_timeout_secs" disables this message.
>>>> Oct 28 00:50:33 mbpc-pc kernel: kworker/0:2     D
>>>> ffff88011affb968     0 31731      2 0x00000080
>>>> Oct 28 00:50:33 mbpc-pc kernel: Workqueue: events
>>>> qlt_free_session_done [qla2xxx]
>>>> Oct 28 00:50:33 mbpc-pc kernel: ffff88011affb968 ffff88011affb8d8
>>>> ffff880013514940 0000000000000006
>>>> Oct 28 00:50:33 mbpc-pc kernel: ffff8801140fe880 ffffffff81f998c2
>>>> 0000000000000000 ffff880100000000
>>>> Oct 28 00:50:33 mbpc-pc kernel: ffffffff810bdaaa ffffffff00000000
>>>> ffffffff00000051 ffff880100000000
>>>> Oct 28 00:50:33 mbpc-pc kernel: Call Trace:
>>>> Oct 28 00:50:33 mbpc-pc kernel: [<ffffffff810bdaaa>] ?
>>>> vprintk_emit+0x27a/0x4d0
>>>> Oct 28 00:50:33 mbpc-pc kernel: [<ffffffff8162c040>] schedule+0x40/0xb0
>>>> Oct 28 00:50:33 mbpc-pc kernel: [<ffffffff8115cc5c>] ? printk+0x46/0x48
>>>> Oct 28 00:50:33 mbpc-pc kernel: [<ffffffff8162e7ec>]
>>>> schedule_timeout+0x9c/0xe0
>>>> Oct 28 00:50:33 mbpc-pc kernel: [<ffffffff8162cfa0>]
>>>> wait_for_completion+0xc0/0xf0
>>>> Oct 28 00:50:33 mbpc-pc kernel: [<ffffffff810923e0>] ?
>>>> try_to_wake_up+0x260/0x260
>>>> Oct 28 00:50:33 mbpc-pc kernel: [<ffffffff81161d73>] ?
>>>> mempool_free+0x33/0x90
>>>> Oct 28 00:50:33 mbpc-pc kernel: [<ffffffffa08f76ad>]
>>>> target_wait_for_sess_cmds+0x4d/0x1b0 [target_core_mod]
>>>> Oct 28 00:50:33 mbpc-pc kernel: [<ffffffffa00e7188>] ?
>>>> qla2x00_post_work+0x58/0x70 [qla2xxx]
>>>> Oct 28 00:50:33 mbpc-pc kernel: [<ffffffffa0286f69>]
>>>> tcm_qla2xxx_free_session+0x49/0x90 [tcm_qla2xxx]
>>>> Oct 28 00:50:33 mbpc-pc kernel: [<ffffffffa01447e9>]
>>>> qlt_free_session_done+0xf9/0x3d0 [qla2xxx]
>>>> Oct 28 00:50:33 mbpc-pc kernel: [<ffffffff81080639>]
>>>> process_one_work+0x189/0x4e0
>>>> Oct 28 00:50:33 mbpc-pc kernel: [<ffffffff8107d915>] ?
>>>> wq_worker_waking_up+0x15/0x70
>>>> Oct 28 00:50:33 mbpc-pc kernel: [<ffffffff8109eb59>] ?
>>>> idle_balance+0x79/0x290
>>>> Oct 28 00:50:33 mbpc-pc kernel: [<ffffffff8162c040>] ?
>>>> schedule+0x40/0xb0
>>>> Oct 28 00:50:33 mbpc-pc kernel: [<ffffffff8108150d>]
>>>> worker_thread+0x16d/0x520
>>>> Oct 28 00:50:33 mbpc-pc kernel: [<ffffffff8162bb3d>] ?
>>>> __schedule+0x2fd/0x6a0
>>>> Oct 28 00:50:33 mbpc-pc kernel: [<ffffffff810813a0>] ?
>>>> maybe_create_worker+0x110/0x110
>>>> Oct 28 00:50:33 mbpc-pc kernel: [<ffffffff8162c040>] ?
>>>> schedule+0x40/0xb0
>>>> Oct 28 00:50:33 mbpc-pc kernel: [<ffffffff810813a0>] ?
>>>> maybe_create_worker+0x110/0x110
>>>> Oct 28 00:50:33 mbpc-pc kernel: [<ffffffff81085fec>] kthread+0xcc/0xf0
>>>> Oct 28 00:50:33 mbpc-pc kernel: [<ffffffff8109130e>] ?
>>>> schedule_tail+0x1e/0xc0
>>>> Oct 28 00:50:33 mbpc-pc kernel: [<ffffffff8162f60f>]
>>>> ret_from_fork+0x1f/0x40
>>>> Oct 28 00:50:33 mbpc-pc kernel: [<ffffffff81085f20>] ?
>>>> kthread_freezable_should_stop+0x70/0x70
>>>>
>>>>>
>>>>> Oct 28 01:19:59 mbpc-pc kernel: qla2xxx [0000:04:00.0]-f81d:20:
>>>>> Intermediate CTIO received (status 6)
>>>>> Oct 28 01:19:59 mbpc-pc kernel: qla2xxx [0000:04:00.0]-e81c:20:
>>>>> Sending
>>>>> TERM EXCH CTIO (ha=ffff88010ecb0000)
>>>>> Oct 28 01:19:59 mbpc-pc kernel: qla2xxx [0000:04:00.0]-f814:20:
>>>>> qla_target(0): terminating exchange for aborted cmd=ffff88009af9f488
>>>>> (se_cmd=ffff88009af9f488, tag=1131312)
>>>>> Oct 28 01:19:59 mbpc-pc kernel: qla2xxx [0000:04:00.0]-e81c:20:
>>>>> Sending
>>>>> TERM EXCH CTIO (ha=ffff88010ecb0000)
>>>>> Oct 28 01:19:59 mbpc-pc kernel: qla2xxx [0000:04:00.0]-e851:20:
>>>>> qla_target(0): Terminating cmd ffff88009af9f488 with incorrect state 2
>>>>> Oct 28 01:19:59 mbpc-pc kernel: qla2xxx [0000:04:00.0]-e81c:20:
>>>>> Sending
>>>>> TERM EXCH CTIO (ha=ffff88010ecb0000)
>>>>> Oct 28 01:19:59 mbpc-pc kernel: qla2xxx [0000:04:00.0]-f81d:20:
>>>>> Intermediate CTIO received (status 6)
>>>>> Oct 28 01:19:59 mbpc-pc kernel: qla2xxx [0000:04:00.0]-f81d:20:
>>>>> Intermediate CTIO received (status 8)
>>>>> Oct 28 01:19:59 mbpc-pc kernel: qla2xxx [0000:04:00.0]-e81c:20:
>>>>> Sending
>>>>> TERM EXCH CTIO (ha=ffff88010ecb0000)
>>>>> Oct 28 01:19:59 mbpc-pc kernel: qla2xxx [0000:04:00.0]-f81d:20:
>>>>> Intermediate CTIO received (status 6)
>>>>> Oct 28 01:19:59 mbpc-pc kernel: qla2xxx [0000:04:00.0]-e874:20:
>>>>> qlt_free_cmd: se_cmd[ffff88009af9f488] ox_id 0673
>>>>> Oct 28 01:19:59 mbpc-pc kernel: ------------[ cut here ]------------
>>>>> Oct 28 01:19:59 mbpc-pc kernel: kernel BUG at
>>>>> drivers/scsi/qla2xxx/qla_target.c:3319!
>>>>> Oct 28 01:19:59 mbpc-pc kernel: invalid opcode: 0000 [#1] SMP
>>>>> Oct 28 01:19:59 mbpc-pc kernel: Modules linked in: tcm_qla2xxx tcm_fc
>>>>> tcm_loop target_core_file target_core_iblock target_core_pscsi
>>>>> target_core_mod configfs ip6table_filter ip6_tables ebtable_nat
>>>>> ebtables
>>>>> nf_conntrack_ipv4 nf_defrag_ipv4 xt_state nf_conntrack xt_CHECKSUM
>>>>> iptable_mangle bridge nfsd lockd grace nfs_acl auth_rpcgss autofs4
>>>>> it87
>>>>> hwmon_vid bnx2fc cnic uio fcoe libfcoe libfc 8021q garp stp llc ppdev
>>>>> parport_pc parport sunrpc cpufreq_ondemand bonding ipv6 crc_ccitt
>>>>> ipt_REJECT nf_reject_ipv4 xt_multiport iptable_filter ip_tables fuse
>>>>> vfat fat xfs vhost_net macvtap macvlan vhost tun uinput raid456
>>>>> async_raid6_recov async_pq async_xor xor async_memcpy async_tx
>>>>> raid6_pq
>>>>> libcrc32c joydev sg serio_raw e1000 r8169 mii kvm_amd kvm
>>>>> snd_hda_codec_realtek snd_hda_codec_generic irqbypass pcspkr k10temp
>>>>> snd_hda_intel snd_hda_codec snd_hda_core snd_hwdep snd_seq
>>>>> snd_seq_device snd_pcm snd_timer snd soundcore i2c_piix4 i2c_core wmi
>>>>> shpchp acpi_cpufreq ext4 mbcache jbd2 qla2xxx scsi_transport_fc floppy
>>>>> firewire_ohci f
>>>>> Oct 28 01:19:59 mbpc-pc kernel: irewire_core crc_itu_t sd_mod
>>>>> pata_acpi
>>>>> ata_generic pata_jmicron ahci libahci usb_storage dm_mirror
>>>>> dm_region_hash dm_log dm_mod
>>>>> Oct 28 01:19:59 mbpc-pc kernel: CPU: 0 PID: 296 Comm: kworker/u16:6
>>>>> Not
>>>>> tainted 4.8.4 #2
>>>>> Oct 28 01:19:59 mbpc-pc kernel: Hardware name: Gigabyte Technology
>>>>> Co.,
>>>>> Ltd. GA-890XA-UD3/GA-890XA-UD3, BIOS FC 08/02/2010
>>>>> Oct 28 01:19:59 mbpc-pc kernel: Workqueue: tmr-fileio target_tmr_work
>>>>> [target_core_mod]
>>>>> Oct 28 01:19:59 mbpc-pc kernel: task: ffff8801109623c0 task.stack:
>>>>> ffff880110968000
>>>>> Oct 28 01:19:59 mbpc-pc kernel: RIP: 0010:[<ffffffffa0143810>]
>>>>> [<ffffffffa0143810>] qlt_free_cmd+0x160/0x180 [qla2xxx]
>>>>> Oct 28 01:19:59 mbpc-pc kernel: RSP: 0018:ffff88011096bb18  EFLAGS:
>>>>> 00010202
>>>>> Oct 28 01:19:59 mbpc-pc kernel: RAX: 0000000000000051 RBX:
>>>>> ffff88009af9f488 RCX: 0000000000000006
>>>>> Oct 28 01:19:59 mbpc-pc kernel: RDX: 0000000000000007 RSI:
>>>>> 0000000000000007 RDI: ffff88011fc0cb40
>>>>> Oct 28 01:19:59 mbpc-pc kernel: RBP: ffff88011096bb48 R08:
>>>>> 0000000000000000 R09: ffffffff81fa4765
>>>>> Oct 28 01:19:59 mbpc-pc kernel: R10: 0000000000000074 R11:
>>>>> 0000000000000002 R12: ffff8801137770c0
>>>>> Oct 28 01:19:59 mbpc-pc kernel: R13: ffff8800a126eaf0 R14:
>>>>> ffff88009af9f510 R15: 0000000000000296
>>>>> Oct 28 01:19:59 mbpc-pc kernel: FS:  0000000000000000(0000)
>>>>> GS:ffff88011fc00000(0000) knlGS:0000000000000000
>>>>> Oct 28 01:19:59 mbpc-pc kernel: CS:  0010 DS: 0000 ES: 0000 CR0:
>>>>> 0000000080050033
>>>>> Oct 28 01:19:59 mbpc-pc kernel: CR2: 00007f8eef58d000 CR3:
>>>>> 00000000cabad000 CR4: 00000000000006f0
>>>>> Oct 28 01:19:59 mbpc-pc kernel: Stack:
>>>>> Oct 28 01:19:59 mbpc-pc kernel: ffff880000000673 ffff88009af9f488
>>>>> ffff8800a126eaf0 ffff88009af9f59c
>>>>> Oct 28 01:19:59 mbpc-pc kernel: ffff88009af9f488 ffff8800a126eaf0
>>>>> ffff88011096bb58 ffffffffa027f7f4
>>>>> Oct 28 01:19:59 mbpc-pc kernel: ffff88011096bbb8 ffffffffa08f758c
>>>>> ffff88009af9f510 ffff88009af9f488
>>>>> Oct 28 01:19:59 mbpc-pc kernel: Call Trace:
>>>>> Oct 28 01:19:59 mbpc-pc kernel: [<ffffffffa027f7f4>]
>>>>> tcm_qla2xxx_release_cmd+0x14/0x30 [tcm_qla2xxx]
>>>>> Oct 28 01:19:59 mbpc-pc kernel: [<ffffffffa08f758c>]
>>>>> target_release_cmd_kref+0xac/0x110 [target_core_mod]
>>>>> Oct 28 01:19:59 mbpc-pc kernel: [<ffffffffa08f7627>]
>>>>> target_put_sess_cmd+0x37/0x70 [target_core_mod]
>>>>> Oct 28 01:19:59 mbpc-pc kernel: [<ffffffffa08f46f7>]
>>>>> core_tmr_abort_task+0x107/0x160 [target_core_mod]
>>>>> Oct 28 01:19:59 mbpc-pc kernel: [<ffffffffa08f6aa4>]
>>>>> target_tmr_work+0x154/0x160 [target_core_mod]
>>>>> Oct 28 01:19:59 mbpc-pc kernel: [<ffffffff81080639>]
>>>>> process_one_work+0x189/0x4e0
>>>>> Oct 28 01:19:59 mbpc-pc kernel: [<ffffffff810d060c>] ?
>>>>> del_timer_sync+0x4c/0x60
>>>>> Oct 28 01:19:59 mbpc-pc kernel: [<ffffffff8108131e>] ?
>>>>> maybe_create_worker+0x8e/0x110
>>>>> Oct 28 01:19:59 mbpc-pc kernel: [<ffffffff8162c040>] ?
>>>>> schedule+0x40/0xb0
>>>>> Oct 28 01:19:59 mbpc-pc kernel: [<ffffffff8108150d>]
>>>>> worker_thread+0x16d/0x520
>>>>> Oct 28 01:19:59 mbpc-pc kernel: [<ffffffff810923f2>] ?
>>>>> default_wake_function+0x12/0x20
>>>>> Oct 28 01:19:59 mbpc-pc kernel: [<ffffffff810a6f06>] ?
>>>>> __wake_up_common+0x56/0x90
>>>>> Oct 28 01:19:59 mbpc-pc kernel: [<ffffffff810813a0>] ?
>>>>> maybe_create_worker+0x110/0x110
>>>>> Oct 28 01:19:59 mbpc-pc kernel: [<ffffffff8162c040>] ?
>>>>> schedule+0x40/0xb0
>>>>> Oct 28 01:19:59 mbpc-pc kernel: [<ffffffff810813a0>] ?
>>>>> maybe_create_worker+0x110/0x110
>>>>> Oct 28 01:19:59 mbpc-pc kernel: [<ffffffff81085fec>] kthread+0xcc/0xf0
>>>>> Oct 28 01:19:59 mbpc-pc kernel: [<ffffffff8109130e>] ?
>>>>> schedule_tail+0x1e/0xc0
>>>>> Oct 28 01:19:59 mbpc-pc kernel: [<ffffffff8162f60f>]
>>>>> ret_from_fork+0x1f/0x40
>>>>> Oct 28 01:19:59 mbpc-pc kernel: [<ffffffff81085f20>] ?
>>>>> kthread_freezable_should_stop+0x70/0x70
>>>>> Oct 28 01:19:59 mbpc-pc kernel: Code: 0d 00 00 48 c7 c7 00 2a 16 a0 e8
>>>>> ac 32 f2 e0 48 83 c4 18 5b 41 5c 41 5d c9 c3 48 8b bb 90 02 00 00
>>>>> e8 85
>>>>> da 07 e1 e9 30 ff ff ff <0f> 0b eb fe 0f 0b 66 2e 0f 1f 84 00 00 00 00
>>>>> 00 eb f4 66 66 66
>>>>> Oct 28 01:19:59 mbpc-pc kernel: RIP  [<ffffffffa0143810>]
>>>>> qlt_free_cmd+0x160/0x180 [qla2xxx]
>>>>> Oct 28 01:19:59 mbpc-pc kernel: RSP <ffff88011096bb18>
>>>>> Oct 28 01:19:59 mbpc-pc kernel: ---[ end trace 2551bf47a19dbe2e ]---
>>>>>
>>>>
>>>> Mmm.
>>>>
>>>> This BUG_ON is signaling a qla_tgt_cmd descriptor is being freed while
>>>> qlt_handle_cmd_for_atio() has queued it for backend execution, but
>>>> qla_tgt_cmd->work -> __qlt_do_work() has not executed.
>>>>
>>>>>
>>>>>
>>>>> 2) This works with a new disk that's just been inserted.  No issues.
>>>>>
>>>>>
>>>>
>>>> Thanks for verifying this scenario.
>>>>
>>>>>
>>>>>
>>>>> The kernel had the patch in both scenarios.  So appears we can't
>>>>> function on a degraded array that looses 1 (RAID 5 / 6 ) or 2 (
>>>>> RAID 6 )
>>>>> disks, at the moment, even though the array itself is fine.  Perhaps
>>>>> it's the nature of the failed disk.
>>>>>
>>>>
>>>> AFAICT, the hung task involves ABORT_TASK across tcm_qla2xxx session
>>>> reinstatement, when backend I/O latency is high enough to cause
>>>> ABORT_TASK operations across FC fabric host side session reset.
>>>>
>>>> From logs alone, it's unclear if the failing backend ata6 is leaking
>>>> I/O
>>>> (indefinitely) when hung task warning happen, but the preceding ata6
>>>> failures + device resets seem to indicate I/O completions are just
>>>> taking a really long time to complete.
>>>>
>>>> Also, it's unclear if the BUG_ON(cmd->cmd_in_wq) in qlt_free_cmd() is a
>>>> side effect of the earlier hung task, or separate tcm_qla2xxx session
>>>> reinstatement bug.
>>>>
>>>> --
>>>> To unsubscribe from this list: send the line "unsubscribe
>>>> linux-scsi" in
>>>> the body of a message to majordomo@vger.kernel.org
>>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>>>
>>>
>>>
>>> Thanks Nicholas.
>>>
>>> Is it possible the RAID 6 array, after attempting to write sectors to a
>>> bad disk, received an error and returned a status message that the QLA
>>> could not interpret?  Thereby sending the QLA into a routine expecting a
>>> recognized message and simply timing out?  All the while the array
>>> simply 'moved on' along it's merry way.   (For example, a message akin
>>> to "Unexpected xyz received, ignoring" type of message that should be
>>> interpreted and actioned on.)  Or perhaps the software RAID 6 isn't
>>> returning anything meaningful to the QLA driver.  (If it isn't though,
>>> it can be argued that the QLA / Target Drivers shouldn't have to care if
>>> the array is working.)
>>
>> So target-core waits for all outstanding backend I/O to complete during
>> session reinstatement.
>>
>> The two hung task warnings observed here mean target-core I/O
>> descriptors from each two tcm_qla2xxx ports are waiting to be completed
>> during session reinstatement, but this never happens.
>>
>> Which means one of three things:
>>
>> 1) The backend below target-core is leaking I/O completions, which
>>    is a bug outside of target-core and needs to be identified.
>> 2) There is a target-core bug leaking se_cmd->cmd_kref, preventing
>>    the final reference release to occur.
>> 3) There is a tcm_qla2xxx bug leaking se_cmd->cmd_kref, preventing
>>    the final reference release to occur.
>>
>>>
>>> The 100% util on /dev/sdf lasted for less then 30 seconds.
>>
>> Based on the above, between when tag=1122276 got ABORT_TASK and ata6
>> finally gave back the descriptor to allow TMR_FUNCTION_COMPLETE was at
>> least 90 seconds.
>>
>> This does not include the extra time from initial I/O submission to when
>> ESX SCSI timeouts fire to generate ABORT_TASK, which IIRC depends upon
>> the FC host LLD.
>>
>> Btw, I don't recall libata device timeouts being 90+ seconds plus, which
>> looks a little strange..
>>
>> This timeout is in /sys/class/scsi_device/$HCTL/device/eh_timeout.
>>
>> So assuming MD is not holding onto I/O much beyond backend device
>> eh_timeout during failure, a simple work-around is to keep the combined
>> backend eh_timeout and MD consumer I/O timeout lower than ESX FC SCSI
>> timeouts.
>>
>> From the logs above, it would likely mask this specific bug if it's
>> related to #2 or #3.
>>
>>> Well below
>>> the 120s default timeout (I've upped this to 360 for the purpose of
>>> testing this scenario)
>>>
>>> I do see these:
>>>
>>> http://microdevsys.com/linux-lio/messages-mailing-list
>>> Oct 23 19:50:26 mbpc-pc kernel: qla2xxx [0000:04:00.0]-5811:20:
>>> Asynchronous PORT UPDATE ignored 0000/0004/0600.
>>>
>>> but I don't think it's that one as it doesn't correlate well at all.
>>>
>>
>> These are unrelated to the hung task warnings.
>>
>>> If there's another test that I can do to get more info, please feel free
>>> to suggest.
>>>
>>
>> Ideally, being able to generate a vmcore crashdump is the most helpful.
>> This involves manually triggering a crash after you observe the hung
>> tasks.  It's reasonably easy to setup once CONFIG_KEXEC=y +
>> CONFIG_CRASH_DUMP=y are enabled:
>>
>> https://fedoraproject.org/wiki/How_to_use_kdump_to_debug_kernel_crashes
>>
>> I don't know if you'll run into problems on v4.8.y as per Anil's earlier
>> email, but getting a proper vmcore for analysis is usually the fastest
>> path to root cause.
>>
>> Short of that, it would be helpful to identify the state of the se_cmd
>> descriptors getting leaked.  Here's a quick patch to add some more
>> verbosity:
>>
>> diff --git a/drivers/target/target_core_transport.c
>> b/drivers/target/target_core_transport.c
>> index 7dfefd6..9b93a2c 100644
>> --- a/drivers/target/target_core_transport.c
>> +++ b/drivers/target/target_core_transport.c
>> @@ -2657,9 +2657,13 @@ void target_wait_for_sess_cmds(struct
>> se_session *se_sess)
>>
>>         list_for_each_entry_safe(se_cmd, tmp_cmd,
>>                                 &se_sess->sess_wait_list, se_cmd_list) {
>> -               pr_debug("Waiting for se_cmd: %p t_state: %d, fabric
>> state:"
>> -                       " %d\n", se_cmd, se_cmd->t_state,
>> -                       se_cmd->se_tfo->get_cmd_state(se_cmd));
>> +               printk("Waiting for se_cmd: %p t_state: %d, fabric
>> state:"
>> +                       " %d se_cmd_flags: 0x%08x transport_state:
>> 0x%08x"
>> +                       " CDB: 0x%02x\n",
>> +                       se_cmd, se_cmd->t_state,
>> +                       se_cmd->se_tfo->get_cmd_state(se_cmd),
>> +                       se_cmd->se_cmd_flags, se_cmd->transport_state,
>> +                       se_cmd->t_task_cdb[0]);
>>
>>                 spin_lock_irqsave(&se_cmd->t_state_lock, flags);
>>                 tas = (se_cmd->transport_state & CMD_T_TAS);
>> @@ -2671,9 +2675,13 @@ void target_wait_for_sess_cmds(struct
>> se_session *se_sess)
>>                 }
>>
>>                 wait_for_completion(&se_cmd->cmd_wait_comp);
>> -               pr_debug("After cmd_wait_comp: se_cmd: %p t_state: %d"
>> -                       " fabric state: %d\n", se_cmd, se_cmd->t_state,
>> -                       se_cmd->se_tfo->get_cmd_state(se_cmd));
>> +               printk("After cmd_wait_comp: se_cmd: %p t_state: %d"
>> +                       " fabric state: %d se_cmd_flags: 0x%08x
>> transport_state:"
>> +                       " 0x%08x CDB: 0x%02x\n",
>> +                       se_cmd, se_cmd->t_state,
>> +                       se_cmd->se_tfo->get_cmd_state(se_cmd),
>> +                       se_cmd->se_cmd_flags, se_cmd->transport_state,
>> +                       se_cmd->t_task_cdb[0]);
>>
>>                 se_cmd->se_tfo->release_cmd(se_cmd);
>>         }
>>
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>
>
> I've had the kdump configured but the Seagate 2TB failed to the point
> where it isn't even detected and a device like /dev/sdf is no longer
> created for it.  There's only some errors printed that something was
> found on the SATA connector but commands errored out against it.  (See
> below)  (Might be the PCB going.  When out of the system or the voltage
> cut, it just got to this point very quickly.)
>
> I did add the debug messages above to catch more in the future.  I'm
> also checking the mdadm changelogs for possibly related updated since
> I'm running 3.3.2-5.el6 and latest is 3.3.4 or 3.4.0 .
>
> I'm leaning to the 1) case above as the RAID 6 did not pick up the
> failed disk even though smartctl -A could not access any smart
> information.  I'll also post a new thread on linux-raid.
>
> Again thanks very much for the help.  Appreciated.
>

Ok so I got an answer that it's likely 1), which to me is the likely case.

 From the RAID community, RAID does not care about certain drive 
failures.  It cares only that the drive reported a successful write.

So that opens up a possibility for many issues that can happen that RAID 
won't cover.  This is why we see issues elsewhere such as in 
applications that use that but not specifically with the RAID itself.

What I basically need to do is to have solid monitoring in place on the 
disks to alert me ahead of time and replace potentially problematic disks.

-- 
Cheers,
Tom K.
-------------------------------------------------------------------------------------

Living on earth is expensive, but it includes a free trip around the sun.


^ permalink raw reply	[flat|nested] 14+ messages in thread

end of thread, other threads:[~2016-11-01  2:44 UTC | newest]

Thread overview: 14+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-10-24  2:03 Kernel 4.8.4: INFO: task kworker/u16:8:289 blocked for more than 120 seconds TomK
2016-10-24  4:32 ` TomK
2016-10-24  4:45   ` TomK
2016-10-24  6:36     ` Nicholas A. Bellinger
2016-10-25  5:28       ` TomK
2016-10-26  2:05         ` TomK
2016-10-26  7:20           ` Nicholas A. Bellinger
2016-10-26 12:08             ` TomK
2016-10-28  6:01               ` TomK
2016-10-29  7:50                 ` Nicholas A. Bellinger
2016-10-29 18:10                   ` TomK
2016-10-29 21:44                     ` Nicholas A. Bellinger
2016-10-30 18:50                       ` TomK
2016-11-01  2:44                         ` TomK

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.