All of lore.kernel.org
 help / color / mirror / Atom feed
* QLA2200 causes kernel bug
@ 2009-08-06 15:28 Thomas Georgiou
  2009-08-06 16:49 ` Andrew Vasquez
  0 siblings, 1 reply; 7+ messages in thread
From: Thomas Georgiou @ 2009-08-06 15:28 UTC (permalink / raw)
  To: linux-scsi

Whenever I have the qla2xxx module loaded, some kernel problem
eventually occurs.  Sometimes its an oops, a bug, or once, a crash.
The same thing has been seen to happen on 3 different machines all
with qla2200 cards in them.

Here is the latest backtrace:
[42151.610011] kernel BUG at drivers/scsi/scsi_transport_fc.c:3022!
[42151.610011] invalid opcode: 0000 [#1] SMP
[42151.610011] last sysfs file:
/sys/devices/pci0000:00/0000:00:06.0/0000:05:00.2/0000:0a:01.0/host1/rport-1:0-25/target1:0:25/fc_transport/target1:0:25/port_name
[42151.610011] CPU 3
[42151.610011] Modules linked in: iscsi_scst scst_disk scst_vdisk scst
qla2xxx
[42151.610011] Pid: 4846, comm: fc_dl_1 Not tainted 2.6.30.4dl380 #3
ProLiant DL380 G4
[42151.610011] RIP: 0010:[<ffffffff812e3c1a>]  [<ffffffff812e3c1a>]
fc_timeout_deleted_rport+0x250/0x2df
[42151.610011] RSP: 0018:ffff8801965c7e70  EFLAGS: 00010202
[42151.610011] RAX: ffff8801967cb648 RBX: ffff88019a837ba0 RCX:
ffff8801965c7de0
[42151.610011] RDX: 0000000000000003 RSI: ffff8801968b89c0 RDI:
ffff8801965c7e60
[42152.490635] RBP: ffff8801965c7eb0 R08: ffffffff8185dd50 R09:
ffff8801965c7d70
[42152.490635] R10: ffff8801965c7da0 R11: ffff88019530ea80 R12:
ffff880196dd3400
[42152.490635] R13: ffff880196dd3598 R14: ffff880196cb4000 R15:
ffff88019a890800
[42152.490635] FS:  0000000000000000(0000) GS:ffff88002807f000(0000)
knlGS:0000000000000000
[42152.490635] CS:  0010 DS: 0018 ES: 0018 CR0: 000000008005003b
[42152.490635] CR2: 0000000001c6b628 CR3: 00000001974af000 CR4:
00000000000006e0
[42152.490635] DR0: 0000000000000000 DR1: 0000000000000000 DR2:
0000000000000000
[42152.490635] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7:
0000000000000400
[42152.490635] Process fc_dl_1 (pid: 4846, threadinfo
ffff8801965c6000, task ffff8801967cb2c0)
[42152.490635] Stack:
[42152.490635]  0000000000000000 0000000000000202 ffff8801975d2000
ffffc20000082b00
[42152.490635]  ffff880196dd3598 ffffc20000082b00 ffff8801967cb2c0
ffffffff812e39ca
[42152.490635]  ffff8801965c7f20 ffffffff8104708c 0000000000000000
ffff8801967cb2c0
[42152.490635] Call Trace:
[42152.490635]  [<ffffffff812e39ca>] ?
fc_timeout_deleted_rport+0x0/0x2df
[42152.490635]  [<ffffffff8104708c>] worker_thread+0x113/0x1ac
[42152.490635]  [<ffffffff81049f58>] ?
autoremove_wake_function+0x0/0x38
[42152.490635]  [<ffffffff81046f79>] ? worker_thread+0x0/0x1ac
[42152.490635]  [<ffffffff81046f79>] ? worker_thread+0x0/0x1ac
[42152.490635]  [<ffffffff81049e05>] kthread+0x56/0x85
[42152.490635]  [<ffffffff8100caba>] child_rip+0xa/0x20
[42152.490635]  [<ffffffff81049daf>] ? kthread+0x0/0x85
[42152.490635]  [<ffffffff8100cab0>] ? child_rip+0x0/0x20
[42152.490635] Code: e0 fb 83 c8 08 41 88 85 b0 fe ff ff 49 8b 7e 58
48 8b 75 c8 e8 e4 86 22 00 4c 89 e7 e8 7c d1 ff ff 41 83 bd 90 fe ff
ff 01 74 04 <0f> 0b eb fe 41 8b 87 d0 02 00 00 83 f8 02 74 16 83 f8 03
74 29
[42152.490635] RIP  [<ffffffff812e3c1a>]
fc_timeout_deleted_rport+0x250/0x2df
[42152.490635]  RSP <ffff8801965c7e70>

and

[43410.037930] qla2xxx 0000:0a:01.0: LIP reset occurred (f7ca).
[43414.941958] qla2xxx 0000:0a:01.0: LOOP DOWN detected (b88f 0 9581).
[43418.341988] qla2xxx 0000:0a:01.0: LIP occurred (f7b1).
[43418.403690] qla2xxx 0000:0a:01.0: LOOP UP detected (1 Gbps).
[43423.541948] qla2xxx 0000:0a:01.0: LIP reset occurred (f5b5).
[43424.584083] qla2xxx 0000:06:02.0: LIP reset occurred (f7f7).
[43425.650042]  rport-0:0-0: blocked FC remote port time out: removing
target and saving binding
[43425.672504] qla2xxx 0000:06:02.0: LIP occurred (f7f7).
[43425.815119] ------------[ cut here ]------------
[43425.825014] kernel BUG at drivers/scsi/scsi_transport_fc.c:3022!
[43425.825014] invalid opcode: 0000 [#2] SMP
[43425.825014] last sysfs file:
/sys/devices/pci0000:00/0000:00:06.0/0000:05:00.2/0000:0a:01.0/host1/rport-1:0-22/target1:0:22/1:0:22:0/block/sdco/uevent
[43425.825014] CPU 1
[43425.825014] Modules linked in: iscsi_scst scst_disk scst_vdisk scst
qla2xxx
[43425.825014] Pid: 4211, comm: fc_dl_0 Tainted: G      D
2.6.30.4dl380 #3 ProLiant DL380 G4
[43425.825014] RIP: 0010:[<ffffffff812e3c1a>]  [<ffffffff812e3c1a>]
fc_timeout_deleted_rport+0x250/0x2df
[43425.825014] RSP: 0018:ffff88019840be70  EFLAGS: 00010202
[43425.825014] RAX: ffff88019840be60 RBX: ffff880197db9920 RCX:
ffff88019840bde0
[43425.825014] RDX: 0000000000000001 RSI: ffff880196cb9000 RDI:
ffffffffffffff10
[43425.825014] RBP: ffff88019840beb0 R08: ffffffff8185dd50 R09:
ffff88019840bd70
[43425.825014] R10: ffff88019840bda0 R11: ffff88008a7c61a0 R12:
ffff88019a885c00
[43425.825014] R13: ffff88019a885d98 R14: ffff880196cb0000 R15:
ffff88019a894400
[43425.825014] FS:  0000000000000000(0000) GS:ffff88002804d000(0000)
knlGS:0000000000000000
[43425.825014] CS:  0010 DS: 0018 ES: 0018 CR0: 000000008005003b
[43425.825014] CR2: 00007fe26b7a8098 CR3: 0000000198589000 CR4:
00000000000006e0
[43425.825014] DR0: 0000000000000000 DR1: 0000000000000000 DR2:
0000000000000000
[43425.825014] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7:
0000000000000400
[43425.825014] Process fc_dl_0 (pid: 4211, threadinfo
ffff88019840a000, task ffff880198458740)
[43425.825014] Stack:
[43425.825014]  0000000100000000 0000000000000202 ffff8801975d2000
ffffc20000081180
[43425.825014]  ffff88019a885d98 ffffc20000081180 ffff880198458740
ffffffff812e39ca
[43425.825014]  ffff88019840bf20 ffffffff8104708c 0000000000000000
ffff880198458740
[43425.825014] Call Trace:
[43425.825014]  [<ffffffff812e39ca>] ?
fc_timeout_deleted_rport+0x0/0x2df
[43425.825014]  [<ffffffff8104708c>] worker_thread+0x113/0x1ac
[43425.825014]  [<ffffffff81049f58>] ?
autoremove_wake_function+0x0/0x38
[43425.825014]  [<ffffffff81046f79>] ? worker_thread+0x0/0x1ac
[43425.825014]  [<ffffffff81046f79>] ? worker_thread+0x0/0x1ac
[43425.825014]  [<ffffffff81049e05>] kthread+0x56/0x85
[43425.825014]  [<ffffffff8100caba>] child_rip+0xa/0x20
[43425.825014]  [<ffffffff81049daf>] ? kthread+0x0/0x85
[43425.825014]  [<ffffffff8100cab0>] ? child_rip+0x0/0x20
[43425.825014] Code: e0 fb 83 c8 08 41 88 85 b0 fe ff ff 49 8b 7e 58
48 8b 75 c8 e8 e4 86 22 00 4c 89 e7 e8 7c d1 ff ff 41 83 bd 90 fe ff
ff 01 74 04 <0f> 0b eb fe 41 8b 87 d0 02 00 00 83 f8 02 74 16 83 f8 03
74 29
[43425.825014] RIP  [<ffffffff812e3c1a>]
fc_timeout_deleted_rport+0x250/0x2df
[43425.825014]  RSP <ffff88019840be70>
[43425.871192] ---[ end trace dc9543ab95173f0f ]---

The bug has been occuring without scst compiled in as well.

I have filed a bug collecting other backtraces here:
http://bugzilla.kernel.org/show_bug.cgi?id=13873

Any ideas to what might be causing this?

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: QLA2200 causes kernel bug
  2009-08-06 15:28 QLA2200 causes kernel bug Thomas Georgiou
@ 2009-08-06 16:49 ` Andrew Vasquez
  2009-08-06 17:12   ` Thomas Georgiou
  0 siblings, 1 reply; 7+ messages in thread
From: Andrew Vasquez @ 2009-08-06 16:49 UTC (permalink / raw)
  To: Thomas Georgiou; +Cc: linux-scsi

On Thu, 06 Aug 2009, Thomas Georgiou wrote:

> Whenever I have the qla2xxx module loaded, some kernel problem
> eventually occurs.  Sometimes its an oops, a bug, or once, a crash.
> The same thing has been seen to happen on 3 different machines all
> with qla2200 cards in them.
> 
> Here is the latest backtrace:
> [42151.610011] kernel BUG at drivers/scsi/scsi_transport_fc.c:3022!
> [42151.610011] invalid opcode: 0000 [#1] SMP

This looks similar to:

	http://thread.gmane.org/gmane.linux.scsi/49853/focus=50297

with a proposed (though not-liked) solution here:

	http://article.gmane.org/gmane.linux.scsi/50297

Could someone refresh my memory, why was there an issue with re-adding
rports after dev-loss-tmo triggered?


Thomas, could you forward the full messages file?  I'm interested in
seeing what series of events led up to the BUG_ON().  The snippets
here and in the bugzilla only document the failing point BUG_ON().
Also, could you get a test run with the driver error-logging enabled
as well:

	$ modprobe -v qla2xxx ql2xextended_error_logging=1

-- av

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: QLA2200 causes kernel bug
  2009-08-06 16:49 ` Andrew Vasquez
@ 2009-08-06 17:12   ` Thomas Georgiou
  2009-08-07  3:40     ` Thomas Georgiou
  0 siblings, 1 reply; 7+ messages in thread
From: Thomas Georgiou @ 2009-08-06 17:12 UTC (permalink / raw)
  To: Andrew Vasquez; +Cc: linux-scsi

Here is the dmesg log: http://www.tjhsst.edu/~2010tgeorgio/fclogs/dmesg3.log
Here is the complete syslog file, with more junk in it:
http://www.tjhsst.edu/~2010tgeorgio/fclogs/messages3.log
I will reboot the server and enable error logging.

Could this issues be caused by enclosures have conflicting scsi ids?

On Thu, Aug 6, 2009 at 12:49 PM, Andrew
Vasquez<andrew.vasquez@qlogic.com> wrote:
> Thomas, could you forward the full messages file?  I'm interested in
> seeing what series of events led up to the BUG_ON().  The snippets
> here and in the bugzilla only document the failing point BUG_ON().
> Also, could you get a test run with the driver error-logging enabled
> as well:
>
>        $ modprobe -v qla2xxx ql2xextended_error_logging=1
--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: QLA2200 causes kernel bug
  2009-08-06 17:12   ` Thomas Georgiou
@ 2009-08-07  3:40     ` Thomas Georgiou
  2009-08-07  7:01       ` Andrew Vasquez
  0 siblings, 1 reply; 7+ messages in thread
From: Thomas Georgiou @ 2009-08-07  3:40 UTC (permalink / raw)
  To: Andrew Vasquez; +Cc: linux-scsi

Here is a new log with error logging enabled:
http://www.tjhsst.edu/~2010tgeorgio/fclogs/dmesg4.log

The enclosures do not have conflicting scsi ids.

On Thu, Aug 6, 2009 at 1:12 PM, Thomas Georgiou<tageorgiou@gmail.com> wrote:
> Here is the dmesg log: http://www.tjhsst.edu/~2010tgeorgio/fclogs/dmesg3.log
> Here is the complete syslog file, with more junk in it:
> http://www.tjhsst.edu/~2010tgeorgio/fclogs/messages3.log
> I will reboot the server and enable error logging.
>
> Could this issues be caused by enclosures have conflicting scsi ids?
>
> On Thu, Aug 6, 2009 at 12:49 PM, Andrew
> Vasquez<andrew.vasquez@qlogic.com> wrote:
>> Thomas, could you forward the full messages file?  I'm interested in
>> seeing what series of events led up to the BUG_ON().  The snippets
>> here and in the bugzilla only document the failing point BUG_ON().
>> Also, could you get a test run with the driver error-logging enabled
>> as well:
>>
>>        $ modprobe -v qla2xxx ql2xextended_error_logging=1
>
--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: QLA2200 causes kernel bug
  2009-08-07  3:40     ` Thomas Georgiou
@ 2009-08-07  7:01       ` Andrew Vasquez
  2009-08-07 19:19         ` Thomas Georgiou
  0 siblings, 1 reply; 7+ messages in thread
From: Andrew Vasquez @ 2009-08-07  7:01 UTC (permalink / raw)
  To: Thomas Georgiou; +Cc: linux-scsi

On Thu, 06 Aug 2009, Thomas Georgiou wrote:

> Here is a new log with error logging enabled:
> http://www.tjhsst.edu/~2010tgeorgio/fclogs/dmesg4.log
> 
> The enclosures do not have conflicting scsi ids.
> 

Could you let me know what's going on at timestamp - 1840:

	[  883.659657] bonnie++ used greatest stack depth: 3472 bytes left
	[ 1840.109135] scsi(1): Asynchronous LIP RESET (f8f7).
	[ 1840.109141] qla2xxx 0000:0a:01.0: LIP reset occurred (f8f7).
	[ 1840.109477] scsi(1): qla2x00_reset_marker()
	[ 1840.150025] scsi(1): fcport-0 - port retry count: 0 remaining
	[ 1840.150030] scsi(1): fcport-1 - port retry count: 0 remaining
	[ 1840.150034] scsi(1): fcport-2 - port retry count: 0 remaining
	[ 1840.150037] scsi(1): fcport-3 - port retry count: 0 remaining
	...
	[ 1840.150113] scsi(1): fcport-26 - port retry count: 0 remaining
	[ 1840.150116] scsi(1): fcport-27 - port retry count: 0 remaining
	[ 1840.203564] scsi(1): LIP occurred (f7f7).
	[ 1840.203569] qla2xxx 0000:0a:01.0: LIP occurred (f7f7).
	[ 1840.234308] scsi(1): Asynchronous PORT UPDATE.
	[ 1840.234313] scsi(1): Port database changed b88f 0000 003f.
	[ 1841.100044]  rport-1:0-0: blocked FC remote port time out: removing target and saving binding
	[ 1841.100101] qla2x00_mailbox_command(1): **** FAILED. mbx0=4006, mbx1=12, mbx2=ffff, cmd=71 ****
	[ 1841.100107] qla2x00_fabric_logout(1): failed=102 mbx1=12.

there's seems to be a great deal of fabric disruptions occuring on the
fibre.  This seems to be occurring on both 22xx ports.  It also
appears that you've set port-down-timeout (dev-loss-tmo) to 0 seconds
(for faster failovers)?

Could you describe the topology?  Would it be possible to isolate the
faults, I take it the constant stream of RESETs are not expected?

-- av

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: QLA2200 causes kernel bug
  2009-08-07  7:01       ` Andrew Vasquez
@ 2009-08-07 19:19         ` Thomas Georgiou
  2009-08-07 21:11           ` Andrew Vasquez
  0 siblings, 1 reply; 7+ messages in thread
From: Thomas Georgiou @ 2009-08-07 19:19 UTC (permalink / raw)
  To: Andrew Vasquez; +Cc: linux-scsi

I am not sure what is happening at 1840.

The current topology is royal (the machine in this backtrace)
connected via 2 fibre channel connections directly to a Powervault
224F jbod.  This is then connected via 2 connections again to another
224F, which is then connected to another machine, fiord (which also
has had problems).

I had royal connected to one 224f with 2 connections and did not
connect that jbod to anything else, and it worked with no problems for
the time it was connected like that (2 days).

I have also tried connecting fiord and royal to two powervault 51f
switches in a redundant configuration and then the switches to the
224Fs.  This also generated problems and was where most of the
backtraces in the bug reports came from.

I have set qlport_down_retry=1 for faster failover.  Should I unset
it?  A constant stream of RESETs is not expected.

On Fri, Aug 7, 2009 at 3:01 AM, Andrew Vasquez<andrew.vasquez@qlogic.com> wrote:
> Could you let me know what's going on at timestamp - 1840:
> there's seems to be a great deal of fabric disruptions occuring on the
> fibre.  This seems to be occurring on both 22xx ports.  It also
> appears that you've set port-down-timeout (dev-loss-tmo) to 0 seconds
> (for faster failovers)?
>
> Could you describe the topology?  Would it be possible to isolate the
> faults, I take it the constant stream of RESETs are not expected?
--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: QLA2200 causes kernel bug
  2009-08-07 19:19         ` Thomas Georgiou
@ 2009-08-07 21:11           ` Andrew Vasquez
  0 siblings, 0 replies; 7+ messages in thread
From: Andrew Vasquez @ 2009-08-07 21:11 UTC (permalink / raw)
  To: Thomas Georgiou; +Cc: linux-scsi

On Fri, 07 Aug 2009, Thomas Georgiou wrote:

> I am not sure what is happening at 1840.
> 
> The current topology is royal (the machine in this backtrace)
> connected via 2 fibre channel connections directly to a Powervault
> 224F jbod.  This is then connected via 2 connections again to another
> 224F, which is then connected to another machine, fiord (which also
> has had problems).
> 
> I had royal connected to one 224f with 2 connections and did not
> connect that jbod to anything else, and it worked with no problems for
> the time it was connected like that (2 days).
> 

Ok, so it looks like there's two problems, first, I'd suggest you talk
with your JBOD vendor to see if this daisychained configuration is
supported?  Is the JBOD acting as a mini-hub in this configuration?
Either way, as can be seen from the logs, your storage device is
continually LIP/LIP-resetting causing intermitent and visiblity/loss
to your storage, often times for long enough to have the midlayer
begin its reaping of scsi-devices.  Given the low-seed value
for dev-loss-tmo (set via your qlport_down_retry usage), after
numerous LIPs you run into the second issue: the BUG_ON() triggering
within the FC-transport -- deferred execution of rport reaping in
fc_timeout_deleted_rport().

> I have also tried connecting fiord and royal to two powervault 51f
> switches in a redundant configuration and then the switches to the
> 224Fs.  This also generated problems and was where most of the
> backtraces in the bug reports came from.

Just for completeness, could you gather a similar set of driver logs
with error-logging enabled within this configuration?

> I have set qlport_down_retry=1 for faster failover.

Increasing it may help to avoid problem (2).

> Should I unset
> it?  A constant stream of RESETs is not expected.

Regards,
Andrew Vasquez

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2009-08-07 21:11 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2009-08-06 15:28 QLA2200 causes kernel bug Thomas Georgiou
2009-08-06 16:49 ` Andrew Vasquez
2009-08-06 17:12   ` Thomas Georgiou
2009-08-07  3:40     ` Thomas Georgiou
2009-08-07  7:01       ` Andrew Vasquez
2009-08-07 19:19         ` Thomas Georgiou
2009-08-07 21:11           ` Andrew Vasquez

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.