All of lore.kernel.org
 help / color / mirror / Atom feed
* mpt2sas losing reset events with cable pulls?
@ 2011-08-31  0:51 Roland Dreier
  2011-08-31  2:41 ` Douglas Gilbert
  0 siblings, 1 reply; 4+ messages in thread
From: Roland Dreier @ 2011-08-31  0:51 UTC (permalink / raw)
  To: Kashyap Desai, Eric Moore; +Cc: eric, linux-scsi

Hi!

We have a system with mpt2sas driver from the upstream kernel --

    #define MPT2SAS_DRIVER_VERSION              "09.100.00.00"

and hardware:

    mpt2sas1: LSISAS2008: FWVersion(09.00.00.00), ChipRevision(0x03), BiosVersion(07.17.00.00)

We have a SAS JBOD with a bunch of SSDs in it, connected with two wide
SAS ports, running Linux multipathing.  If we pull one of the cables
with IO running, then occasionally (say, 1 in 100 cable pulls) some of
the IO gets "stuck" -- we continually hit

	else if (sas_device_priv_data->block || sas_target_priv_data->tm_busy)
		return SCSI_MLQUEUE_DEVICE_BUSY;

in the mpt2sas _scsih_qcmd() function, where tm_busy never gets
cleared.  We added some debugging to _scsih_sas_device_status_change_event()
and we found that when things go wrong, we get an event of type
MPI2_EVENT_SAS_DEV_STAT_RC_INTERNAL_DEVICE_RESET for each one of the
SSDs (which sets tm_busy for each target), but then
MPI2_EVENT_SAS_DEV_STAT_RC_CMP_INTERNAL_DEV_RESET never comes for one
of the targets (so tm_busy is never cleared).

In other words, we get the reset event for handle 0x24, 0x25, ...,
0x3a with all the handles in the range (and hence all the targets)
getting an event; but then in the broken case, the reset complete
event comes for all the handles *except* one (for example 0x39).

This leads to the system getting wedged waiting for a SCSI command
that will never finish, which is not what we're after when one of our
paths to the JBOD goes down (given that we have a second path and are
aiming at fault tolerance here!).

This feels to me like it is probably a firmware race or some other
bug, but perhaps the driver is losing events somehow.  Anyway, how can
we fix this?  Please let me know if there's any further debugging
information I can collect to help make progress on this.

Thanks!
  Roland

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: mpt2sas losing reset events with cable pulls?
  2011-08-31  0:51 mpt2sas losing reset events with cable pulls? Roland Dreier
@ 2011-08-31  2:41 ` Douglas Gilbert
       [not found]   ` <CAL1RGDU9QtuXGhMQN_cT6+RPh0NTZexYwE8HK1fpw5vuX_FLOw@mail.gmail.com>
  0 siblings, 1 reply; 4+ messages in thread
From: Douglas Gilbert @ 2011-08-31  2:41 UTC (permalink / raw)
  To: Roland Dreier; +Cc: Kashyap Desai, Eric Moore, eric, linux-scsi

On 11-08-30 08:51 PM, Roland Dreier wrote:
> Hi!
>
> We have a system with mpt2sas driver from the upstream kernel --
>
>      #define MPT2SAS_DRIVER_VERSION              "09.100.00.00"
>
> and hardware:
>
>      mpt2sas1: LSISAS2008: FWVersion(09.00.00.00), ChipRevision(0x03), BiosVersion(07.17.00.00)

Most LSI HBA's based on the LSISAS2008 chip are now at
firmware version 10.00 . Perhaps you could retest with
that firmware.

Doug Gilbert

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: mpt2sas losing reset events with cable pulls?
       [not found]   ` <CAL1RGDU9QtuXGhMQN_cT6+RPh0NTZexYwE8HK1fpw5vuX_FLOw@mail.gmail.com>
@ 2011-08-31  3:16     ` Eric Seppanen
  2011-08-31  5:41     ` Roland Dreier
  1 sibling, 0 replies; 4+ messages in thread
From: Eric Seppanen @ 2011-08-31  3:16 UTC (permalink / raw)
  To: Roland Dreier; +Cc: dgilbert, linux-scsi, Kashyap Desai, Eric Moore

On Tue, Aug 30, 2011 at 7:47 PM, Roland Dreier <roland@kernel.org> wrote:
> On Aug 30, 2011 7:41 PM, "Douglas Gilbert" <dgilbert@interlog.com> wrote:
>> Most LSI HBA's based on the LSISAS2008 chip are now at
>> firmware version 10.00 . Perhaps you could retest with
>> that firmware.
>
> Eric S. can correct me if I'm wrong, but  I'm pretty we've seen the same
> issue with fw 10 as well.

Yes.  We've hit it on both version 9 and version 10.
--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: mpt2sas losing reset events with cable pulls?
       [not found]   ` <CAL1RGDU9QtuXGhMQN_cT6+RPh0NTZexYwE8HK1fpw5vuX_FLOw@mail.gmail.com>
  2011-08-31  3:16     ` Eric Seppanen
@ 2011-08-31  5:41     ` Roland Dreier
  1 sibling, 0 replies; 4+ messages in thread
From: Roland Dreier @ 2011-08-31  5:41 UTC (permalink / raw)
  To: Douglas Gilbert; +Cc: linux-scsi, eric, Kashyap Desai, Eric Moore

>> Most LSI HBA's based on the LSISAS2008 chip are now at
>> firmware version 10.00 . Perhaps you could retest with
>> that firmware.

Just for grins, I updated a test system to 10.00 and retested, and was
able to reproduce the issue with:

    mpt2sas1: LSISAS2008: FWVersion(10.00.02.00), ChipRevision(0x03), BiosVersion(07.17.00.00)

I added the code below to the end of _scsih_sas_device_status_change_event():

        printk(KERN_ERR "%s: %s handle 0x%04x sas address 0x%016llx dev %p priv %p\n", __func__,
               event_data->ReasonCode == MPI2_EVENT_SAS_DEV_STAT_RC_CMP_INTERNAL_DEV_RESET ?
               "internal device reset complete" : "internal device reset",
               le16_to_cpu(event_data->DevHandle), (unsigned long long)le64_to_cpu(event_data->SASAddress),
               sas_device, target_priv_data);

and when I hit the issue, I got the output below -- the thing to
notice is that we get "internal device reset" events for every handle
in the range 0x24 ... 0x3a inclusive, but the "internal device reset
complete" event for handle 0x2b never appears in this case.

So if this is a firmware bug, it is still present at least in the
10.00 firmware I got from the LSI web site...

This reproduction took about 50 loops of a script that turns off and
on the links between the HBA and the JBOD every 15 seconds, so it's
not too hard to hit (you can see from the kernel timestamps that the
system was up less than half an hour total).  If there's any other
debug data to collect or patches to try, I'm happy to do so.

 - R.

    [ 1319.730954] _scsih_sas_device_status_change_event: internal device reset handle 0x0024 sas address 0x500605ba004afb49 dev ffff880619c9c880 priv ffff880615397800
    [ 1319.731019] _scsih_sas_device_status_change_event: internal device reset handle 0x0025 sas address 0x500605ba004afe21 dev ffff880616a8f280 priv ffff880615164c00
    [ 1319.731026] _scsih_sas_device_status_change_event: internal device reset handle 0x0026 sas address 0x500605ba002e1189 dev ffff88061ad03400 priv ffff8806153f0c00
    [ 1319.731034] _scsih_sas_device_status_change_event: internal device reset handle 0x0027 sas address 0x500605ba002e0ea9 dev ffff880614e84100 priv ffff88061516ec00
    [ 1319.731041] _scsih_sas_device_status_change_event: internal device reset handle 0x0028 sas address 0x500605ba004af915 dev ffff88061ace4d80 priv ffff8806138c1800
    [ 1319.731048] _scsih_sas_device_status_change_event: internal device reset handle 0x0029 sas address 0x500605ba002e14d5 dev ffff880619c8aa00 priv ffff880614efec00
    [ 1319.731055] _scsih_sas_device_status_change_event: internal device reset handle 0x002a sas address 0x500605ba002e1201 dev ffff880619c8a200 priv ffff8806138c5c00
    [ 1319.731064] _scsih_sas_device_status_change_event: internal device reset handle 0x002b sas address 0x500605ba002e1049 dev ffff880619c8a800 priv ffff880613d2cc00
    [ 1319.731073] _scsih_sas_device_status_change_event: internal device reset handle 0x002c sas address 0x500605ba002e1615 dev ffff880616bd1e80 priv ffff8806147b4c00
    [ 1319.731082] _scsih_sas_device_status_change_event: internal device reset handle 0x002d sas address 0x500605ba002e1519 dev ffff880616bd1f00 priv ffff880614cdd800
    [ 1319.731088] _scsih_sas_device_status_change_event: internal device reset handle 0x002e sas address 0x500605ba002e15ed dev ffff880613598f80 priv ffff88061a5e6400
    [ 1319.731096] _scsih_sas_device_status_change_event: internal device reset handle 0x002f sas address 0x500605ba002e1371 dev ffff880616bd1280 priv ffff8806137b4800
    [ 1319.731102] _scsih_sas_device_status_change_event: internal device reset handle 0x0030 sas address 0x500605ba002e0f21 dev ffff88061b3f5680 priv ffff880613568800
    [ 1319.731106] _scsih_sas_device_status_change_event: internal device reset handle 0x0031 sas address 0x500605ba002e0ec1 dev ffff880613598980 priv ffff880616b9c400
    [ 1319.731109] _scsih_sas_device_status_change_event: internal device reset handle 0x0032 sas address 0x500605ba004afbd5 dev ffff88061ac07580 priv ffff880619e84400
    [ 1319.731116] _scsih_sas_device_status_change_event: internal device reset handle 0x0033 sas address 0x500605ba002e1129 dev ffff8806169dd080 priv ffff880613da7000
    [ 1319.731119] _scsih_sas_device_status_change_event: internal device reset handle 0x0034 sas address 0x500605ba002e1051 dev ffff8806169ddf80 priv ffff880619e86800
    [ 1319.731124] _scsih_sas_device_status_change_event: internal device reset handle 0x0035 sas address 0x500605ba002e1339 dev ffff8806151a0c00 priv ffff8806151fc800
    [ 1319.731127] _scsih_sas_device_status_change_event: internal device reset handle 0x0036 sas address 0x500605ba002e1551 dev ffff880614e38080 priv ffff880616b99000
    [ 1319.731131] _scsih_sas_device_status_change_event: internal device reset handle 0x0037 sas address 0x500605ba002e118d dev ffff88061b09c400 priv ffff880619e83c00
    [ 1319.731136] _scsih_sas_device_status_change_event: internal device reset handle 0x0038 sas address 0x500605ba002e1285 dev ffff88061b3f5980 priv ffff880613da1c00
    [ 1319.731168] _scsih_sas_device_status_change_event: internal device reset handle 0x0039 sas address 0x500605ba002e1429 dev ffff88061ac07d80 priv ffff880614630800
    [ 1319.731173] _scsih_sas_device_status_change_event: internal device reset handle 0x003a sas address 0x50050cc10ac3dc7e dev ffff880619c96080 priv ffff8806148e6800
    [ 1319.733351] _scsih_sas_device_status_change_event: internal device reset complete handle 0x0024 sas address 0x500605ba004afb49 dev ffff880619c9c880 priv ffff880615397800
    [ 1319.733360] _scsih_sas_device_status_change_event: internal device reset complete handle 0x0025 sas address 0x500605ba004afe21 dev ffff880616a8f280 priv ffff880615164c00
    [ 1319.733363] _scsih_sas_device_status_change_event: internal device reset complete handle 0x0026 sas address 0x500605ba002e1189 dev ffff88061ad03400 priv ffff8806153f0c00
    [ 1319.733366] _scsih_sas_device_status_change_event: internal device reset complete handle 0x0027 sas address 0x500605ba002e0ea9 dev ffff880614e84100 priv ffff88061516ec00
    [ 1319.733370] _scsih_sas_device_status_change_event: internal device reset complete handle 0x0028 sas address 0x500605ba004af915 dev ffff88061ace4d80 priv ffff8806138c1800
    [ 1319.733373] _scsih_sas_device_status_change_event: internal device reset complete handle 0x0029 sas address 0x500605ba002e14d5 dev ffff880619c8aa00 priv ffff880614efec00
    [ 1319.733378] _scsih_sas_device_status_change_event: internal device reset complete handle 0x002a sas address 0x500605ba002e1201 dev ffff880619c8a200 priv ffff8806138c5c00
    [ 1319.733383] _scsih_sas_device_status_change_event: internal device reset complete handle 0x002c sas address 0x500605ba002e1615 dev ffff880616bd1e80 priv ffff8806147b4c00
    [ 1319.733386] _scsih_sas_device_status_change_event: internal device reset complete handle 0x002d sas address 0x500605ba002e1519 dev ffff880616bd1f00 priv ffff880614cdd800
    [ 1319.733389] _scsih_sas_device_status_change_event: internal device reset complete handle 0x002e sas address 0x500605ba002e15ed dev ffff880613598f80 priv ffff88061a5e6400
    [ 1319.733392] _scsih_sas_device_status_change_event: internal device reset complete handle 0x002f sas address 0x500605ba002e1371 dev ffff880616bd1280 priv ffff8806137b4800
    [ 1319.733395] _scsih_sas_device_status_change_event: internal device reset complete handle 0x0030 sas address 0x500605ba002e0f21 dev ffff88061b3f5680 priv ffff880613568800
    [ 1319.733400] _scsih_sas_device_status_change_event: internal device reset complete handle 0x0031 sas address 0x500605ba002e0ec1 dev ffff880613598980 priv ffff880616b9c400
    [ 1319.733406] _scsih_sas_device_status_change_event: internal device reset complete handle 0x0032 sas address 0x500605ba004afbd5 dev ffff88061ac07580 priv ffff880619e84400
    [ 1319.733479] _scsih_sas_device_status_change_event: internal device reset complete handle 0x0033 sas address 0x500605ba002e1129 dev ffff8806169dd080 priv ffff880613da7000
    [ 1319.733486] _scsih_sas_device_status_change_event: internal device reset complete handle 0x0034 sas address 0x500605ba002e1051 dev ffff8806169ddf80 priv ffff880619e86800
    [ 1319.733490] _scsih_sas_device_status_change_event: internal device reset complete handle 0x0035 sas address 0x500605ba002e1339 dev ffff8806151a0c00 priv ffff8806151fc800
    [ 1319.733495] _scsih_sas_device_status_change_event: internal device reset complete handle 0x0036 sas address 0x500605ba002e1551 dev ffff880614e38080 priv ffff880616b99000
    [ 1319.733498] _scsih_sas_device_status_change_event: internal device reset complete handle 0x0037 sas address 0x500605ba002e118d dev ffff88061b09c400 priv ffff880619e83c00
    [ 1319.733501] _scsih_sas_device_status_change_event: internal device reset complete handle 0x0038 sas address 0x500605ba002e1285 dev ffff88061b3f5980 priv ffff880613da1c00
    [ 1319.733508] _scsih_sas_device_status_change_event: internal device reset complete handle 0x0039 sas address 0x500605ba002e1429 dev ffff88061ac07d80 priv ffff880614630800
    [ 1319.733514] _scsih_sas_device_status_change_event: internal device reset complete handle 0x003a sas address 0x50050cc10ac3dc7e dev ffff880619c96080 priv ffff8806148e6800

^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2011-08-31  5:41 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2011-08-31  0:51 mpt2sas losing reset events with cable pulls? Roland Dreier
2011-08-31  2:41 ` Douglas Gilbert
     [not found]   ` <CAL1RGDU9QtuXGhMQN_cT6+RPh0NTZexYwE8HK1fpw5vuX_FLOw@mail.gmail.com>
2011-08-31  3:16     ` Eric Seppanen
2011-08-31  5:41     ` Roland Dreier

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.