Lock recursion seen on qla2xxx client when rebooting the target server

* Lock recursion seen on qla2xxx client when rebooting the target server
@ 2019-04-01  0:44 Laurence Oberman
  2019-04-02 21:30 ` Laurence Oberman
  0 siblings, 1 reply; 2+ messages in thread
From: Laurence Oberman @ 2019-04-01  0:44 UTC (permalink / raw)
  To: linux-scsi, linux-block, Himanshu Madhani, Madhani, Himanshu,
	Hannes Reinecke, Ewan Milne
  Cc: Marco Patalano, Dutile, Don, Van Assche, Bart

This who have been following my trials and tribulations with SRP and
block-mq panics (See Re: Panic when rebooting target server testing srp
on 5.0.0-rc2) know I was going to run the same test with qla2xxx and
F/C

Anyway, rebooting the targetserver (LIO) that was causing the block-mq
race that is still out there and not yet diagnosed when SRP is the
client causes issues with 5.1-rc2 as well.

The issue is different. I was seeing a total lockup and no console
messages. To get the lockup message I had to enable lock debugging.

Anyway, Hannes, how have you folks not seen these issues at Suse with
5.1+ testing. Here I caught two different problems that are now latent
in 5.1-x (maybe earlier too). This is a generic array reboot test that
sadly is a common issue with our customewrs when they have fabric or
array issues.

Kernel 5.1.0-rc2+ on an x86_64

localhost login: [  301.752492] BUG: spinlock cpu recursion on CPU#38,
kworker/38:0/204
[  301.782364]  lock: 0xffff90ddb2e43430, .magic: dead4ead, .owner:
kworker/38:1/271, .owner_cpu: 38
[  301.825496] CPU: 38 PID: 204 Comm: kworker/38:0 Kdump: loaded Not
tainted 5.1.0-rc2+ #1
[  301.863052] Hardware name: HP ProLiant ML150 Gen9/ProLiant ML150
Gen9, BIOS P95 05/21/2018
[  301.903614] Workqueue: qla2xxx_wq qla24xx_delete_sess_fn [qla2xxx]
[  301.933561] Call Trace:
[  301.945950]  dump_stack+0x5a/0x73
[  301.962080]  do_raw_spin_lock+0x83/0xa0
[  301.980287]  _raw_spin_lock_irqsave+0x66/0x80
[  302.001726]  ? qla24xx_delete_sess_fn+0x34/0x90 [qla2xxx]
[  302.028111]  qla24xx_delete_sess_fn+0x34/0x90 [qla2xxx]
[  302.052864]  process_one_work+0x215/0x4c0
[  302.071940]  ? process_one_work+0x18c/0x4c0
[  302.092228]  worker_thread+0x46/0x3e0
[  302.110313]  kthread+0xfb/0x130
[  302.125274]  ? process_one_work+0x4c0/0x4c0
[  302.146054]  ? kthread_bind+0x10/0x10
[  302.163789]  ret_from_fork+0x35/0x40

Just an FYI, with only 100 LUNS 4 paths i cannot boot the host without
adding my watchdog_thresh=60 to the kernel line.
I hard lockup during LUN discovery so that issue is also out there.

So far 5.x+ has been problemetic with regression testing.

Regards
Laurence

^ permalink raw reply	[flat|nested] 2+ messages in thread