All of lore.kernel.org
 help / color / mirror / Atom feed
* Lock recursion seen on qla2xxx client when rebooting the target server
@ 2019-04-01  0:44 Laurence Oberman
  2019-04-02 21:30 ` Laurence Oberman
  0 siblings, 1 reply; 2+ messages in thread
From: Laurence Oberman @ 2019-04-01  0:44 UTC (permalink / raw)
  To: linux-scsi, linux-block, Himanshu Madhani, Madhani, Himanshu,
	Hannes Reinecke, Ewan Milne
  Cc: Marco Patalano, Dutile, Don, Van Assche, Bart

This who have been following my trials and tribulations with SRP and
block-mq panics (See Re: Panic when rebooting target server testing srp
on 5.0.0-rc2) know I was going to run the same test with qla2xxx and
F/C

Anyway, rebooting the targetserver (LIO) that was causing the block-mq
race that is still out there and not yet diagnosed when SRP is the
client causes issues with 5.1-rc2 as well.

The issue is different. I was seeing a total lockup and no console
messages. To get the lockup message I had to enable lock debugging.

Anyway, Hannes, how have you folks not seen these issues at Suse with
5.1+ testing. Here I caught two different problems that are now latent
in 5.1-x (maybe earlier too). This is a generic array reboot test that
sadly is a common issue with our customewrs when they have fabric or
array issues.

Kernel 5.1.0-rc2+ on an x86_64

localhost login: [  301.752492] BUG: spinlock cpu recursion on CPU#38,
kworker/38:0/204
[  301.782364]  lock: 0xffff90ddb2e43430, .magic: dead4ead, .owner:
kworker/38:1/271, .owner_cpu: 38
[  301.825496] CPU: 38 PID: 204 Comm: kworker/38:0 Kdump: loaded Not
tainted 5.1.0-rc2+ #1
[  301.863052] Hardware name: HP ProLiant ML150 Gen9/ProLiant ML150
Gen9, BIOS P95 05/21/2018
[  301.903614] Workqueue: qla2xxx_wq qla24xx_delete_sess_fn [qla2xxx]
[  301.933561] Call Trace:
[  301.945950]  dump_stack+0x5a/0x73
[  301.962080]  do_raw_spin_lock+0x83/0xa0
[  301.980287]  _raw_spin_lock_irqsave+0x66/0x80
[  302.001726]  ? qla24xx_delete_sess_fn+0x34/0x90 [qla2xxx]
[  302.028111]  qla24xx_delete_sess_fn+0x34/0x90 [qla2xxx]
[  302.052864]  process_one_work+0x215/0x4c0
[  302.071940]  ? process_one_work+0x18c/0x4c0
[  302.092228]  worker_thread+0x46/0x3e0
[  302.110313]  kthread+0xfb/0x130
[  302.125274]  ? process_one_work+0x4c0/0x4c0
[  302.146054]  ? kthread_bind+0x10/0x10
[  302.163789]  ret_from_fork+0x35/0x40

Just an FYI, with only 100 LUNS 4 paths i cannot boot the host without
adding my watchdog_thresh=60 to the kernel line.
I hard lockup during LUN discovery so that issue is also out there.

So far 5.x+ has been problemetic with regression testing.

Regards
Laurence


^ permalink raw reply	[flat|nested] 2+ messages in thread

* Re: Lock recursion seen on qla2xxx client when rebooting the target server
  2019-04-01  0:44 Lock recursion seen on qla2xxx client when rebooting the target server Laurence Oberman
@ 2019-04-02 21:30 ` Laurence Oberman
  0 siblings, 0 replies; 2+ messages in thread
From: Laurence Oberman @ 2019-04-02 21:30 UTC (permalink / raw)
  To: linux-scsi, linux-block, Himanshu Madhani, Madhani, Himanshu,
	Hannes Reinecke, Ewan Milne
  Cc: Marco Patalano, Dutile, Don, Van Assche, Bart

On Sun, 2019-03-31 at 20:44 -0400, Laurence Oberman wrote:
> This who have been following my trials and tribulations with SRP and
> block-mq panics (See Re: Panic when rebooting target server testing
> srp
> on 5.0.0-rc2) know I was going to run the same test with qla2xxx and
> F/C
> 
> Anyway, rebooting the targetserver (LIO) that was causing the block-
> mq
> race that is still out there and not yet diagnosed when SRP is the
> client causes issues with 5.1-rc2 as well.
> 
> The issue is different. I was seeing a total lockup and no console
> messages. To get the lockup message I had to enable lock debugging.
> 
> Anyway, Hannes, how have you folks not seen these issues at Suse with
> 5.1+ testing. Here I caught two different problems that are now
> latent
> in 5.1-x (maybe earlier too). This is a generic array reboot test
> that
> sadly is a common issue with our customewrs when they have fabric or
> array issues.
> 
> Kernel 5.1.0-rc2+ on an x86_64
> 
> localhost login: [  301.752492] BUG: spinlock cpu recursion on
> CPU#38,
> kworker/38:0/204
> [  301.782364]  lock: 0xffff90ddb2e43430, .magic: dead4ead, .owner:
> kworker/38:1/271, .owner_cpu: 38
> [  301.825496] CPU: 38 PID: 204 Comm: kworker/38:0 Kdump: loaded Not
> tainted 5.1.0-rc2+ #1
> [  301.863052] Hardware name: HP ProLiant ML150 Gen9/ProLiant ML150
> Gen9, BIOS P95 05/21/2018
> [  301.903614] Workqueue: qla2xxx_wq qla24xx_delete_sess_fn [qla2xxx]
> [  301.933561] Call Trace:
> [  301.945950]  dump_stack+0x5a/0x73
> [  301.962080]  do_raw_spin_lock+0x83/0xa0
> [  301.980287]  _raw_spin_lock_irqsave+0x66/0x80
> [  302.001726]  ? qla24xx_delete_sess_fn+0x34/0x90 [qla2xxx]
> [  302.028111]  qla24xx_delete_sess_fn+0x34/0x90 [qla2xxx]
> [  302.052864]  process_one_work+0x215/0x4c0
> [  302.071940]  ? process_one_work+0x18c/0x4c0
> [  302.092228]  worker_thread+0x46/0x3e0
> [  302.110313]  kthread+0xfb/0x130
> [  302.125274]  ? process_one_work+0x4c0/0x4c0
> [  302.146054]  ? kthread_bind+0x10/0x10
> [  302.163789]  ret_from_fork+0x35/0x40
> 
> Just an FYI, with only 100 LUNS 4 paths i cannot boot the host
> without
> adding my watchdog_thresh=60 to the kernel line.
> I hard lockup during LUN discovery so that issue is also out there.
> 
> So far 5.x+ has been problemetic with regression testing.
> 
> Regards
> Laurence

I chatted with Himanshu about this and he will be sending me a test
patch. He thinks he knows what is going on here.
I will report back when tested.

Note!! to reitterate, this is not the block-mq issue I uncovered with
SRP testing. The investigation for that is still ongoing.

Thanks
Laurence



^ permalink raw reply	[flat|nested] 2+ messages in thread

end of thread, other threads:[~2019-04-02 21:31 UTC | newest]

Thread overview: 2+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-04-01  0:44 Lock recursion seen on qla2xxx client when rebooting the target server Laurence Oberman
2019-04-02 21:30 ` Laurence Oberman

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.