All of lore.kernel.org
 help / color / mirror / Atom feed
* mpt2sas: dma error?
@ 2010-03-07  5:00 Ilia Mirkin
  2010-03-07  6:05 ` James Bottomley
  0 siblings, 1 reply; 3+ messages in thread
From: Ilia Mirkin @ 2010-03-07  5:00 UTC (permalink / raw)
  To: linux-scsi

Hi,

I have an LSI 9211-4i card (aka SAS2004) with 4 drives attached. No
RAID-related setup in the card's BIOS, I'm just using the drives
directly. This is with kernel 2.6.33. The card starts up with

[    1.714458] mpt2sas version 03.100.03.00 loaded
[    1.714757] scsi0 : Fusion MPT SAS Host
[    1.715174]   alloc irq_desc for 16 on node -1
[    1.715175]   alloc kstat_irqs on node -1
[    1.715178] alloc irq_2_iommu on node -1
[    1.715184] mpt2sas 0000:05:00.0: PCI INT A -> GSI 16 (level, low) -> IRQ 16
[    1.715431] mpt2sas 0000:05:00.0: setting latency timer to 64
[    1.715435] mpt2sas0: 64 BIT PCI BUS DMA ADDRESSING SUPPORTED,
total mem (12387344 kB)
[    1.715939]   alloc irq_desc for 31 on node -1
[    1.715941]   alloc kstat_irqs on node -1
[    1.715943] alloc irq_2_iommu on node -1
[    1.715947] mpt2sas 0000:05:00.0: irq 31 for MSI/MSI-X
[    1.715960] mpt2sas0: PCI-MSI-X enabled: IRQ 31
[    1.716199] mpt2sas0: iomem(0xfaefc000),
mapped(0xffffc90001878000), size(16384)
[    1.716643] mpt2sas0: ioport(0xd000), size(256)
[    1.788476] mpt2sas0: sending diag reset !!
[    2.726738] mpt2sas0: diag reset: SUCCESS
[    2.772789] mpt2sas0: Allocated physical memory: size(839 kB)
[    2.773034] mpt2sas0: Current Controller Queue Depth(339), Max
Controller Queue Depth(2015)
[    2.773481] mpt2sas0: Scatter Gather Elements per IO(128)
[    2.831901] mpt2sas0: LSISAS2008: FWVersion(02.00.50.00),
ChipRevision(0x02), BiosVersion(07.01.00.00)
[    2.832360] mpt2sas0: Protocol=(Initiator,Target),
Capabilities=(Raid,TLR,EEDP,Snapshot Buffer,Diag Trace Buffer,Task Set
Full,NCQ)
[    2.833261] mpt2sas0: sending port enable !!
[    4.478515] mpt2sas0: host_add: handle(0x0001),
sas_addr(0x500605b0001d5848), phys(8)
[   11.712582] mpt2sas0: port enable: SUCCESS

which looks all happy. However it seems that running SMART commands
(like smartctl -a, smartmontools 5.39) on the drives attached results
in the following, semi-reliably:

[ 7069.168433] DRHD: handling fault status reg 2
[ 7069.168440] DMAR:[DMA Read] Request device [05:00.0] fault addr e0000
[ 7069.168442] DMAR:[fault reason 06] PTE Read access is not set
[ 7069.815775] mpt2sas0: fault_state(0x2665)!
[ 7069.815778] mpt2sas0: sending diag reset !!
[ 7070.754176] mpt2sas0: diag reset: SUCCESS
[ 7070.823523] mpt2sas0: LSISAS2008: FWVersion(02.00.50.00),
ChipRevision(0x02), BiosVersion(07.01.00.00)
[ 7070.823526] mpt2sas0: Protocol=(Initiator,Target),
Capabilities=(Raid,TLR,EEDP,Snapshot Buffer,Diag Trace Buffer,Task Set
Full,NCQ)
[ 7070.823818] mpt2sas0: sending port enable !!
[ 7079.740367] mpt2sas0: port enable: SUCCESS
[ 7079.740446] mpt2sas0: _scsih_search_responding_sas_devices
[ 7079.741023] scsi target0:0:0: handle(0x0009),
sas_addr(0x4433221100000000), enclosure logical
id(0x500605b0001d5848), slot(0)
[ 7079.741089] scsi target0:0:1: handle(0x000a),
sas_addr(0x4433221101000000), enclosure logical
id(0x500605b0001d5848), slot(1)
[ 7079.741154] scsi target0:0:2: handle(0x000b),
sas_addr(0x4433221103000000), enclosure logical
id(0x500605b0001d5848), slot(3)
[ 7079.741220] scsi target0:0:3: handle(0x000c),
sas_addr(0x4433221102000000), enclosure logical
id(0x500605b0001d5848), slot(2)
[ 7079.741287] mpt2sas0: _scsih_search_responding_raid_devices
[ 7079.741289] mpt2sas0: _scsih_search_responding_expanders
[ 7079.741291] mpt2sas0: _base_fault_reset_work: hard reset: success

I can just avoid doing any SMART-related stuff on here, but that seems
suboptimal. Anything I can do to debug this? Should I turn DMAR off?
The fault status reg changes with each attempt (2, 102, 202), but the
fault address is always e0000.

Actually, it only happened 3 times, and I can't get it to happen a 4th
time... perhaps it wasn't SMART, or harder to reproduce than I thought
originally. This still seems bad though.

Thanks,

-- 
Ilia Mirkin
imirkin@alum.mit.edu

^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: mpt2sas: dma error?
  2010-03-07  5:00 mpt2sas: dma error? Ilia Mirkin
@ 2010-03-07  6:05 ` James Bottomley
  2010-03-07  7:00   ` Ilia Mirkin
  0 siblings, 1 reply; 3+ messages in thread
From: James Bottomley @ 2010-03-07  6:05 UTC (permalink / raw)
  To: Ilia Mirkin; +Cc: linux-scsi

On Sun, 2010-03-07 at 00:00 -0500, Ilia Mirkin wrote:
> Hi,
> 
> I have an LSI 9211-4i card (aka SAS2004) with 4 drives attached. No
> RAID-related setup in the card's BIOS, I'm just using the drives
> directly. This is with kernel 2.6.33. The card starts up with
> 
> [    1.714458] mpt2sas version 03.100.03.00 loaded
> [    1.714757] scsi0 : Fusion MPT SAS Host
> [    1.715174]   alloc irq_desc for 16 on node -1
> [    1.715175]   alloc kstat_irqs on node -1
> [    1.715178] alloc irq_2_iommu on node -1
> [    1.715184] mpt2sas 0000:05:00.0: PCI INT A -> GSI 16 (level, low) -> IRQ 16
> [    1.715431] mpt2sas 0000:05:00.0: setting latency timer to 64
> [    1.715435] mpt2sas0: 64 BIT PCI BUS DMA ADDRESSING SUPPORTED,
> total mem (12387344 kB)
> [    1.715939]   alloc irq_desc for 31 on node -1
> [    1.715941]   alloc kstat_irqs on node -1
> [    1.715943] alloc irq_2_iommu on node -1
> [    1.715947] mpt2sas 0000:05:00.0: irq 31 for MSI/MSI-X
> [    1.715960] mpt2sas0: PCI-MSI-X enabled: IRQ 31
> [    1.716199] mpt2sas0: iomem(0xfaefc000),
> mapped(0xffffc90001878000), size(16384)
> [    1.716643] mpt2sas0: ioport(0xd000), size(256)
> [    1.788476] mpt2sas0: sending diag reset !!
> [    2.726738] mpt2sas0: diag reset: SUCCESS
> [    2.772789] mpt2sas0: Allocated physical memory: size(839 kB)
> [    2.773034] mpt2sas0: Current Controller Queue Depth(339), Max
> Controller Queue Depth(2015)
> [    2.773481] mpt2sas0: Scatter Gather Elements per IO(128)
> [    2.831901] mpt2sas0: LSISAS2008: FWVersion(02.00.50.00),
> ChipRevision(0x02), BiosVersion(07.01.00.00)
> [    2.832360] mpt2sas0: Protocol=(Initiator,Target),
> Capabilities=(Raid,TLR,EEDP,Snapshot Buffer,Diag Trace Buffer,Task Set
> Full,NCQ)
> [    2.833261] mpt2sas0: sending port enable !!
> [    4.478515] mpt2sas0: host_add: handle(0x0001),
> sas_addr(0x500605b0001d5848), phys(8)
> [   11.712582] mpt2sas0: port enable: SUCCESS
> 
> which looks all happy. However it seems that running SMART commands
> (like smartctl -a, smartmontools 5.39) on the drives attached results
> in the following, semi-reliably:
> 
> [ 7069.168433] DRHD: handling fault status reg 2
> [ 7069.168440] DMAR:[DMA Read] Request device [05:00.0] fault addr e0000
> [ 7069.168442] DMAR:[fault reason 06] PTE Read access is not set
> [ 7069.815775] mpt2sas0: fault_state(0x2665)!
> [ 7069.815778] mpt2sas0: sending diag reset !!
> [ 7070.754176] mpt2sas0: diag reset: SUCCESS
> [ 7070.823523] mpt2sas0: LSISAS2008: FWVersion(02.00.50.00),
> ChipRevision(0x02), BiosVersion(07.01.00.00)
> [ 7070.823526] mpt2sas0: Protocol=(Initiator,Target),
> Capabilities=(Raid,TLR,EEDP,Snapshot Buffer,Diag Trace Buffer,Task Set
> Full,NCQ)
> [ 7070.823818] mpt2sas0: sending port enable !!
> [ 7079.740367] mpt2sas0: port enable: SUCCESS
> [ 7079.740446] mpt2sas0: _scsih_search_responding_sas_devices
> [ 7079.741023] scsi target0:0:0: handle(0x0009),
> sas_addr(0x4433221100000000), enclosure logical
> id(0x500605b0001d5848), slot(0)
> [ 7079.741089] scsi target0:0:1: handle(0x000a),
> sas_addr(0x4433221101000000), enclosure logical
> id(0x500605b0001d5848), slot(1)
> [ 7079.741154] scsi target0:0:2: handle(0x000b),
> sas_addr(0x4433221103000000), enclosure logical
> id(0x500605b0001d5848), slot(3)
> [ 7079.741220] scsi target0:0:3: handle(0x000c),
> sas_addr(0x4433221102000000), enclosure logical
> id(0x500605b0001d5848), slot(2)
> [ 7079.741287] mpt2sas0: _scsih_search_responding_raid_devices
> [ 7079.741289] mpt2sas0: _scsih_search_responding_expanders
> [ 7079.741291] mpt2sas0: _base_fault_reset_work: hard reset: success
> 
> I can just avoid doing any SMART-related stuff on here, but that seems
> suboptimal. Anything I can do to debug this? Should I turn DMAR off?
> The fault status reg changes with each attempt (2, 102, 202), but the
> fault address is always e0000.
> 
> Actually, it only happened 3 times, and I can't get it to happen a 4th
> time... perhaps it wasn't SMART, or harder to reproduce than I thought
> originally. This still seems bad though.

So this is likely a firmware bug inside the mpt2sas.  All of the mpt
cards use a fat firmware model meaning they take in pure SCSI commands
and do the translation to SATA if necessary all within the firmware, so
the first step would be to make sure your card has the latest firmware.

Then, there are two methods of wrapping smart commands in SCSI: ATA_12
and ATA_16.  Try getting smartctl to use ATA_12, which is more widely
supported, by using the -d sat,12 option to the command.

James



^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: mpt2sas: dma error?
  2010-03-07  6:05 ` James Bottomley
@ 2010-03-07  7:00   ` Ilia Mirkin
  0 siblings, 0 replies; 3+ messages in thread
From: Ilia Mirkin @ 2010-03-07  7:00 UTC (permalink / raw)
  To: James Bottomley; +Cc: linux-scsi

On Sun, Mar 7, 2010 at 1:05 AM, James Bottomley <James.Bottomley@suse.de> wrote:
> On Sun, 2010-03-07 at 00:00 -0500, Ilia Mirkin wrote:
>> Hi,
>>
>> I have an LSI 9211-4i card (aka SAS2004) with 4 drives attached. No
>> RAID-related setup in the card's BIOS, I'm just using the drives
>> directly. This is with kernel 2.6.33. The card starts up with
>>
>> [    1.714458] mpt2sas version 03.100.03.00 loaded
>> [    1.714757] scsi0 : Fusion MPT SAS Host
>> [    1.715174]   alloc irq_desc for 16 on node -1
>> [    1.715175]   alloc kstat_irqs on node -1
>> [    1.715178] alloc irq_2_iommu on node -1
>> [    1.715184] mpt2sas 0000:05:00.0: PCI INT A -> GSI 16 (level, low) -> IRQ 16
>> [    1.715431] mpt2sas 0000:05:00.0: setting latency timer to 64
>> [    1.715435] mpt2sas0: 64 BIT PCI BUS DMA ADDRESSING SUPPORTED,
>> total mem (12387344 kB)
>> [    1.715939]   alloc irq_desc for 31 on node -1
>> [    1.715941]   alloc kstat_irqs on node -1
>> [    1.715943] alloc irq_2_iommu on node -1
>> [    1.715947] mpt2sas 0000:05:00.0: irq 31 for MSI/MSI-X
>> [    1.715960] mpt2sas0: PCI-MSI-X enabled: IRQ 31
>> [    1.716199] mpt2sas0: iomem(0xfaefc000),
>> mapped(0xffffc90001878000), size(16384)
>> [    1.716643] mpt2sas0: ioport(0xd000), size(256)
>> [    1.788476] mpt2sas0: sending diag reset !!
>> [    2.726738] mpt2sas0: diag reset: SUCCESS
>> [    2.772789] mpt2sas0: Allocated physical memory: size(839 kB)
>> [    2.773034] mpt2sas0: Current Controller Queue Depth(339), Max
>> Controller Queue Depth(2015)
>> [    2.773481] mpt2sas0: Scatter Gather Elements per IO(128)
>> [    2.831901] mpt2sas0: LSISAS2008: FWVersion(02.00.50.00),
>> ChipRevision(0x02), BiosVersion(07.01.00.00)
>> [    2.832360] mpt2sas0: Protocol=(Initiator,Target),
>> Capabilities=(Raid,TLR,EEDP,Snapshot Buffer,Diag Trace Buffer,Task Set
>> Full,NCQ)
>> [    2.833261] mpt2sas0: sending port enable !!
>> [    4.478515] mpt2sas0: host_add: handle(0x0001),
>> sas_addr(0x500605b0001d5848), phys(8)
>> [   11.712582] mpt2sas0: port enable: SUCCESS
>>
>> which looks all happy. However it seems that running SMART commands
>> (like smartctl -a, smartmontools 5.39) on the drives attached results
>> in the following, semi-reliably:
>>
>> [ 7069.168433] DRHD: handling fault status reg 2
>> [ 7069.168440] DMAR:[DMA Read] Request device [05:00.0] fault addr e0000
>> [ 7069.168442] DMAR:[fault reason 06] PTE Read access is not set
>> [ 7069.815775] mpt2sas0: fault_state(0x2665)!
>> [ 7069.815778] mpt2sas0: sending diag reset !!
>> [ 7070.754176] mpt2sas0: diag reset: SUCCESS
>> [ 7070.823523] mpt2sas0: LSISAS2008: FWVersion(02.00.50.00),
>> ChipRevision(0x02), BiosVersion(07.01.00.00)
>> [ 7070.823526] mpt2sas0: Protocol=(Initiator,Target),
>> Capabilities=(Raid,TLR,EEDP,Snapshot Buffer,Diag Trace Buffer,Task Set
>> Full,NCQ)
>> [ 7070.823818] mpt2sas0: sending port enable !!
>> [ 7079.740367] mpt2sas0: port enable: SUCCESS
>> [ 7079.740446] mpt2sas0: _scsih_search_responding_sas_devices
>> [ 7079.741023] scsi target0:0:0: handle(0x0009),
>> sas_addr(0x4433221100000000), enclosure logical
>> id(0x500605b0001d5848), slot(0)
>> [ 7079.741089] scsi target0:0:1: handle(0x000a),
>> sas_addr(0x4433221101000000), enclosure logical
>> id(0x500605b0001d5848), slot(1)
>> [ 7079.741154] scsi target0:0:2: handle(0x000b),
>> sas_addr(0x4433221103000000), enclosure logical
>> id(0x500605b0001d5848), slot(3)
>> [ 7079.741220] scsi target0:0:3: handle(0x000c),
>> sas_addr(0x4433221102000000), enclosure logical
>> id(0x500605b0001d5848), slot(2)
>> [ 7079.741287] mpt2sas0: _scsih_search_responding_raid_devices
>> [ 7079.741289] mpt2sas0: _scsih_search_responding_expanders
>> [ 7079.741291] mpt2sas0: _base_fault_reset_work: hard reset: success
>>
>> I can just avoid doing any SMART-related stuff on here, but that seems
>> suboptimal. Anything I can do to debug this? Should I turn DMAR off?
>> The fault status reg changes with each attempt (2, 102, 202), but the
>> fault address is always e0000.
>>
>> Actually, it only happened 3 times, and I can't get it to happen a 4th
>> time... perhaps it wasn't SMART, or harder to reproduce than I thought
>> originally. This still seems bad though.
>
> So this is likely a firmware bug inside the mpt2sas.  All of the mpt
> cards use a fat firmware model meaning they take in pure SCSI commands
> and do the translation to SATA if necessary all within the firmware, so
> the first step would be to make sure your card has the latest firmware.

Just upgraded the FW to 04.00.00.00, and the BIOS to 7.03, same issue.
(These are the latest available on LSI's website.)

>
> Then, there are two methods of wrapping smart commands in SCSI: ATA_12
> and ATA_16.  Try getting smartctl to use ATA_12, which is more widely
> supported, by using the -d sat,12 option to the command.

Hm, well first I did it without the sat,12 option and it had the
issue, and then I added -d sat,12 and again same thing (that's 2 for
2). Was having trouble getting it again for a while, but looks like
just hitting it a few times in a row (i.e. if I run it in quick
succession) triggered 2 more in a row. This time with addr d50e0000;
the reg is still going up by 100 every time in the message. No other
disk i/o was happening during this run, unlike during my initial
e-mail. To be clear, all but the very first error was with smartctl -d
sat,12 -a.

It just seems surprising that retrieving SMART data would be _so_
fragile off of a major manufacturer's controllers... Oh well.

Thanks for taking a look.

  -ilia

P.S. I'm also getting messages like
[  813.174509] sd 0:0:3:0: [sdd] Sense Key : Recovered Error [current]
[descriptor]
[  813.174514] Descriptor sense data with sense descriptors (in hex):
[  813.174517]         72 01 00 1d 00 00 00 0e 09 0c 00 00 00 00 00 00
[  813.174526]         00 4f 00 c2 00 50
[  813.174530] sd 0:0:3:0: [sdd] Add. Sense: ATA pass through
information available

on every smartctl command, but I'm fairly sure I've seen this effect
has been discussed before.
--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 3+ messages in thread

end of thread, other threads:[~2010-03-07  7:00 UTC | newest]

Thread overview: 3+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2010-03-07  5:00 mpt2sas: dma error? Ilia Mirkin
2010-03-07  6:05 ` James Bottomley
2010-03-07  7:00   ` Ilia Mirkin

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.