All of lore.kernel.org
 help / color / mirror / Atom feed
* Delayed FC rport removal causes errors on other device
@ 2014-12-17 22:41 Tony Battersby
  2014-12-17 23:17 ` Giridhar Malavali
  0 siblings, 1 reply; 3+ messages in thread
From: Tony Battersby @ 2014-12-17 22:41 UTC (permalink / raw)
  To: qla2xxx-upstream, linux-scsi

Initiator-mode problem summary:

FC initiator HBA cabled directly (no FC switch) to FC target device #1. 
Unplug cable from FC target device #1 and quickly plug it into FC target
device #2, keeping the other end of the cable plugged into the same
initiator HBA port.  The new device shows up immediately; begin sending
commands to it.  The old device stays visible for 30 - 60 seconds after
the cable was moved, and then it disappears with the message
"rport-7:0-0: blocked FC remote port time out: removing rport".  When
the old device disappears, commands outstanding to the new device are
aborted or lock up.


Initiator-mode problem details:

vanilla kernel version 3.18.1

I have three types of FC HBAs:
QLogic QLE2672 16Gb FC HBA using qla2xxx
QLogic QLE2562  8Gb FC HBA using qla2xxx
LSI    7204EP   4Gb FC HBA using mptfc

I have seen the problem with both qla2xxx and mptfc.  With qla2xxx,
commands active during the old rport removal are aborted with host
status 0x0E, but then it recovers and additional commands work fine. 
With mptfc, active commands lock up.

---

The problem I have described above happens using just the initiator-mode
drivers in the mainline kernel, but my real interest is a related
problem that happens with the QLogic target-mode drivers from
"git://git.qlogic.com/scst-qla2xxx.git".  I will describe that problem
with more details below.

---

Target-mode problem summary:

Two separate PCs, each with a QLogic QLE2672 16Gb FC HBA (firmware 7.04.01).
One PC uses the FC HBA in initiator mode; the other PC uses the FC HBA
in target mode.
The target-mode PC presents a disk drive to the initiator-mode PC over FC.
The FC HBAs are directly connected with a FC cable; no FC switch involved.

With the cable plugged in, the initiator PC sees the target-mode FC disk
at /dev/sg2.  When I unplug the cable from one port on the initiator and
quickly plug it into the other port on the initiator, a new disk shows
up for the new path at /dev/sg3.  The old disk at /dev/sg2 stays around
for about 30 seconds and then disappears.  That is all fine and
expected.  The problem is that when the old disk at /dev/sg2 disappears,
the new disk at /dev/sg3 stops responding to commands (but doesn't
disappear).

Note: in the initiator-mode test described at the beginning of this
message, the cable is moved from one target device to another target
device while keeping the same initiator port.  In contrast, in this
target-mode test, the cable is moved from one initiator port to another
initiator port while keeping the same target port.  That is what
distinguishes the two tests.  Whichever port remains the same (whether
initiator or target) is the port that causes the problem when the old
rport is removed.


Target-mode problem details:

vanilla kernel version 3.18.1

Before unplugging the cable, the target-mode PC creates a session with
portid 00:00:e8, loop_id 0, and the wwn of the first initiator port. 
When the cable is unplugged, the session is scheduled for deletion. 
When the cable is plugged into the other initiator port, the target-mode
PC creates another session with the same portid and loop_id as the first
session (which is now scheduled for deletion) but with a different wwn
corresponding to the second initiator port.

The new disk at /dev/sg3 stops responding to commands when the
target-mode PC calls isp_ops->fabric_logout() from
qla2x00_terminate_rport_io() in qla_attr.c.  If I disable that call to
fabric_logout() then the new disk at /dev/sg3 continues to work as
expected.  It looks like qla2x00_terminate_rport_io() is being called to
cleanup the old removed fcport, but ends up messing up the new
still-present fcport instead (I am guessing because the old and new
fcports share the same portid and loopid).  After fabric_logout() messes
up the new fcport, the target-mode HBA returns CTIO_PORT_LOGGED_OUT for
any new incoming commands from the initiator.

The problem can be avoided by waiting 30 seconds after unplugging the
cable before plugging it back in.  But that is not a good solution for
me since these HBAs are to be used in a product sold by my company, and
we want it to "just work" for our customers.

I am very familiar with SCSI but only a little bit with FC, so I am not
exactly sure of the correct fix.  So bear with me while I ask a few
questions: Is it correct for the old and new fcports to share the same
portid and loop_id?  When creating the new fcport, should qla2xxx have
detected that the lost-but-not-yet-dead fcport was using the same portid
and loop_id, and chosen to use different values for the new fcport
instead?  Or should it have invalidated the portid and/or loop_id of the
lost-but-not-yet-dead fcport somewhere (LOOP UP/LOOP DOWN/LIP
reset/etc.)?  Or is it OK for them to share the same portid/loop_id
values, but instead qla2x00_terminate_rport_io() needs more checks
before calling fabric_logout()?

I would be happy to test any patches that anyone can provide.  Or if
someone can provide answers to my questions above or other guidance,
then I can try to come up with a fix myself.


Thanks,
Tony Battersby
Cybernetics


^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: Delayed FC rport removal causes errors on other device
  2014-12-17 22:41 Delayed FC rport removal causes errors on other device Tony Battersby
@ 2014-12-17 23:17 ` Giridhar Malavali
       [not found]   ` <549302DA.7060006@cybernetics.com>
  0 siblings, 1 reply; 3+ messages in thread
From: Giridhar Malavali @ 2014-12-17 23:17 UTC (permalink / raw)
  To: Tony Battersby, Dept-Eng QLA2xxx Upstream, linux-scsi,
	Saurav Kashyap, Chad Dupuis

[-- Attachment #1: Type: text/plain, Size: 5635 bytes --]

Tony, 

We will look into this further and get back to you.

Do you have driver logs with extended error logging for this failure. If
not, can you please capture one and send it across.

Thanks,
Giridhar

On 12/17/14 2:41 PM, "Tony Battersby" <tonyb@cybernetics.com> wrote:

>Initiator-mode problem summary:
>
>FC initiator HBA cabled directly (no FC switch) to FC target device #1.
>Unplug cable from FC target device #1 and quickly plug it into FC target
>device #2, keeping the other end of the cable plugged into the same
>initiator HBA port.  The new device shows up immediately; begin sending
>commands to it.  The old device stays visible for 30 - 60 seconds after
>the cable was moved, and then it disappears with the message
>"rport-7:0-0: blocked FC remote port time out: removing rport".  When
>the old device disappears, commands outstanding to the new device are
>aborted or lock up.
>
>
>Initiator-mode problem details:
>
>vanilla kernel version 3.18.1
>
>I have three types of FC HBAs:
>QLogic QLE2672 16Gb FC HBA using qla2xxx
>QLogic QLE2562  8Gb FC HBA using qla2xxx
>LSI    7204EP   4Gb FC HBA using mptfc
>
>I have seen the problem with both qla2xxx and mptfc.  With qla2xxx,
>commands active during the old rport removal are aborted with host
>status 0x0E, but then it recovers and additional commands work fine.
>With mptfc, active commands lock up.
>
>---
>
>The problem I have described above happens using just the initiator-mode
>drivers in the mainline kernel, but my real interest is a related
>problem that happens with the QLogic target-mode drivers from
>"git://git.qlogic.com/scst-qla2xxx.git".  I will describe that problem
>with more details below.
>
>---
>
>Target-mode problem summary:
>
>Two separate PCs, each with a QLogic QLE2672 16Gb FC HBA (firmware
>7.04.01).
>One PC uses the FC HBA in initiator mode; the other PC uses the FC HBA
>in target mode.
>The target-mode PC presents a disk drive to the initiator-mode PC over FC.
>The FC HBAs are directly connected with a FC cable; no FC switch involved.
>
>With the cable plugged in, the initiator PC sees the target-mode FC disk
>at /dev/sg2.  When I unplug the cable from one port on the initiator and
>quickly plug it into the other port on the initiator, a new disk shows
>up for the new path at /dev/sg3.  The old disk at /dev/sg2 stays around
>for about 30 seconds and then disappears.  That is all fine and
>expected.  The problem is that when the old disk at /dev/sg2 disappears,
>the new disk at /dev/sg3 stops responding to commands (but doesn't
>disappear).
>
>Note: in the initiator-mode test described at the beginning of this
>message, the cable is moved from one target device to another target
>device while keeping the same initiator port.  In contrast, in this
>target-mode test, the cable is moved from one initiator port to another
>initiator port while keeping the same target port.  That is what
>distinguishes the two tests.  Whichever port remains the same (whether
>initiator or target) is the port that causes the problem when the old
>rport is removed.
>
>
>Target-mode problem details:
>
>vanilla kernel version 3.18.1
>
>Before unplugging the cable, the target-mode PC creates a session with
>portid 00:00:e8, loop_id 0, and the wwn of the first initiator port.
>When the cable is unplugged, the session is scheduled for deletion.
>When the cable is plugged into the other initiator port, the target-mode
>PC creates another session with the same portid and loop_id as the first
>session (which is now scheduled for deletion) but with a different wwn
>corresponding to the second initiator port.
>
>The new disk at /dev/sg3 stops responding to commands when the
>target-mode PC calls isp_ops->fabric_logout() from
>qla2x00_terminate_rport_io() in qla_attr.c.  If I disable that call to
>fabric_logout() then the new disk at /dev/sg3 continues to work as
>expected.  It looks like qla2x00_terminate_rport_io() is being called to
>cleanup the old removed fcport, but ends up messing up the new
>still-present fcport instead (I am guessing because the old and new
>fcports share the same portid and loopid).  After fabric_logout() messes
>up the new fcport, the target-mode HBA returns CTIO_PORT_LOGGED_OUT for
>any new incoming commands from the initiator.
>
>The problem can be avoided by waiting 30 seconds after unplugging the
>cable before plugging it back in.  But that is not a good solution for
>me since these HBAs are to be used in a product sold by my company, and
>we want it to "just work" for our customers.
>
>I am very familiar with SCSI but only a little bit with FC, so I am not
>exactly sure of the correct fix.  So bear with me while I ask a few
>questions: Is it correct for the old and new fcports to share the same
>portid and loop_id?  When creating the new fcport, should qla2xxx have
>detected that the lost-but-not-yet-dead fcport was using the same portid
>and loop_id, and chosen to use different values for the new fcport
>instead?  Or should it have invalidated the portid and/or loop_id of the
>lost-but-not-yet-dead fcport somewhere (LOOP UP/LOOP DOWN/LIP
>reset/etc.)?  Or is it OK for them to share the same portid/loop_id
>values, but instead qla2x00_terminate_rport_io() needs more checks
>before calling fabric_logout()?
>
>I would be happy to test any patches that anyone can provide.  Or if
>someone can provide answers to my questions above or other guidance,
>then I can try to come up with a fix myself.
>
>
>Thanks,
>Tony Battersby
>Cybernetics
>


[-- Attachment #2: winmail.dat --]
[-- Type: application/ms-tnef, Size: 6571 bytes --]

^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: Delayed FC rport removal causes errors on other device
       [not found]   ` <549302DA.7060006@cybernetics.com>
@ 2014-12-19  1:54     ` Giridhar Malavali
  0 siblings, 0 replies; 3+ messages in thread
From: Giridhar Malavali @ 2014-12-19  1:54 UTC (permalink / raw)
  To: Tony Battersby, Dept-Eng QLA2xxx Upstream, linux-scsi,
	Saurav Kashyap, Chad Dupuis

[-- Attachment #1: Type: text/plain, Size: 802 bytes --]

Tony, 

We will get back to you after looking at the logs.

-- Giri

On 12/18/14 8:37 AM, "Tony Battersby" <tonyb@cybernetics.com> wrote:

>On 12/17/2014 06:17 PM, Giridhar Malavali wrote:
>> Tony, 
>>
>> We will look into this further and get back to you.
>>
>> Do you have driver logs with extended error logging for this failure. If
>> not, can you please capture one and send it across.
>>
>>
>
>I have attached two logs - one log from the initiator during the
>initiator test that I described, and one log from the target during the
>target test that I described.  Please note that these are logs from two
>different test procedures, so you cannot correlate events from the
>initiator log with events from the target log.
>
>Thanks,
>Tony Battersby
>Cybernetics
>
>


[-- Attachment #2: winmail.dat --]
[-- Type: application/ms-tnef, Size: 4423 bytes --]

^ permalink raw reply	[flat|nested] 3+ messages in thread

end of thread, other threads:[~2014-12-19  1:54 UTC | newest]

Thread overview: 3+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2014-12-17 22:41 Delayed FC rport removal causes errors on other device Tony Battersby
2014-12-17 23:17 ` Giridhar Malavali
     [not found]   ` <549302DA.7060006@cybernetics.com>
2014-12-19  1:54     ` Giridhar Malavali

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.