All of lore.kernel.org
 help / color / mirror / Atom feed
* MD sets failing under heavy load in a DRBD/Pacemaker Cluster
@ 2011-10-04 12:01 Caspar Smit
  2011-10-04 13:22 ` CoolCold
                   ` (3 more replies)
  0 siblings, 4 replies; 5+ messages in thread
From: Caspar Smit @ 2011-10-04 12:01 UTC (permalink / raw)
  To: General Linux-HA mailing list, linux-scsi, drbd-user,
	iscsitarget-devel, Support, Software

[-- Attachment #1: Type: text/plain, Size: 9510 bytes --]

Hi all,

We are having a major problem with one of our clusters.

Here's a description of the setup:

2 supermicro servers containing the following hardware:

Chassis: SC846E1-R1200B
Mainboard: X8DTH-6F rev 2.01 (onboard LSI2008 controller disabled
through jumper)
CPU: Intel Xeon E5606 @ 2.13Ghz, 4 cores
Memory: 4x KVR1333D3D4R9S/4G (16Gb)
Backplane: SAS846EL1 rev 1.1
Ethernet: 2x Intel Pro/1000 PT Quad Port Low Profile
SAS/SATA Controller: LSI 3081E-R (P20, BIOS: 6.34.00.00, Firmware 1.32.00.00-IT)
SAS/SATA JBOD Controller: LSI 3801E (P20, BIOS: 6.34.00.00, Firmware
1.32.00.00-IT)
OS Disk: 30Gb SSD
Harddisks: 24x Western Digital 2TB 7200RPM RE4-GP (WD2002FYPS)

Both machines have debian lenny 5 installed, here are the versions of
the packages involved:

drbd/heartbeat/pacemaker are installed from the backports repository.

linux-image-2.6.26-2-amd64   2.6.26-26lenny3
mdadm                                2.6.7.2-3
drbd8-2.6.26-2-amd64           2:8.3.7-1~bpo50+1+2.6.26-26lenny3
drbd8-source                        2:8.3.7-1~bpo50+1
drbd8-utils                            2:8.3.7-1~bpo50+1
heartbeat                             1:3.0.3-2~bpo50+1
pacemaker                           1.0.9.1+hg15626-1~bpo50+1
iscsitarget                            1.4.20.2 (compiled from tar.gz)


We created 4 MD sets out of the 24 harddisks (/dev/md0 through /dev/md3)

Each is a RAID5 of 5 disks and 1 hotspare (8TB netto per MD), metadata
version of the MD sets is 0.90

For each MD we created a DRBD device to the second node. (/dev/drbd4
through /dev/drbd7) (0 through 3 were used by disks from a JBOD which
was disconnected, read below)
(see attached drbd.conf.txt, these are the individual *.res files combined)

Each drbd device has its own dedicated 1GbE NIC port.

Each drbd device is then exported through iSCSI using iet in pacemaker
(see attached crm-config.txt for the full pacemaker config)


Now for the symptoms we are having:

After a number of days (sometimes weeks) the disks from the MD sets
start failing subsequently.

See the attached syslog.txt for details but here are the main entries:

It starts with:

Oct  2 11:01:59 node03 kernel: [7370143.421999] mptbase: ioc0:
LogInfo(0x31110b00): Originator={PL}, Code={Reset}, SubCode(0x0b00)
cb_idx mptbase_reply
Oct  2 11:01:59 node03 kernel: [7370143.435220] mptbase: ioc0:
LogInfo(0x31181000): Originator={PL}, Code={IO Cancelled Due to
Recieve Error}, SubCode(0x1000) cb_idx mptbase_reply
Oct  2 11:01:59 node03 kernel: [7370143.442141] mptbase: ioc0:
LogInfo(0x31112000): Originator={PL}, Code={Reset}, SubCode(0x2000)
cb_idx mptbase_reply
Oct  2 11:01:59 node03 kernel: [7370143.442783] end_request: I/O
error, dev sdf, sector 3907028992
Oct  2 11:01:59 node03 kernel: [7370143.442783] md: super_written gets
error=-5, uptodate=0
Oct  2 11:01:59 node03 kernel: [7370143.442783] raid5: Disk failure on
sdf, disabling device.
Oct  2 11:01:59 node03 kernel: [7370143.442783] raid5: Operation
continuing on 4 devices.
Oct  2 11:01:59 node03 kernel: [7370143.442820] end_request: I/O
error, dev sdb, sector 3907028992
Oct  2 11:01:59 node03 kernel: [7370143.442820] md: super_written gets
error=-5, uptodate=0
Oct  2 11:01:59 node03 kernel: [7370143.442820] raid5: Disk failure on
sdb, disabling device.
Oct  2 11:01:59 node03 kernel: [7370143.442820] raid5: Operation
continuing on 3 devices.
Oct  2 11:01:59 node03 kernel: [7370143.442820] end_request: I/O
error, dev sdd, sector 3907028992
Oct  2 11:01:59 node03 kernel: [7370143.442820] md: super_written gets
error=-5, uptodate=0
Oct  2 11:01:59 node03 kernel: [7370143.442820] raid5: Disk failure on
sdd, disabling device.
Oct  2 11:01:59 node03 kernel: [7370143.442820] raid5: Operation
continuing on 2 devices.
Oct  2 11:01:59 node03 kernel: [7370143.470791] mptbase: ioc0:
LogInfo(0x31110b00): Originator={PL}, Code={Reset}, SubCode(0x0b00)
cb_idx mptbase_reply
<snip>
Oct  2 11:02:00 node03 kernel: [7370143.968976] Buffer I/O error on
device drbd4, logical block 1651581030
Oct  2 11:02:00 node03 kernel: [7370143.969056] block drbd4: p write: error=-5
Oct  2 11:02:00 node03 kernel: [7370143.969126] block drbd4: Local
WRITE failed sec=21013680s size=4096
Oct  2 11:02:00 node03 kernel: [7370143.969203] block drbd4: disk(
UpToDate -> Failed )
Oct  2 11:02:00 node03 kernel: [7370143.969276] block drbd4: Local IO
failed in __req_mod.Detaching...
Oct  2 11:02:00 node03 kernel: [7370143.969492] block drbd4: disk(
Failed -> Diskless )
Oct  2 11:02:00 node03 kernel: [7370143.969492] block drbd4: Notified
peer that my disk is broken.
Oct  2 11:02:00 node03 kernel: [7370143.970120] block drbd4: Should
have called drbd_al_complete_io(, 21013680), but my Disk seems to have
failed :(
Oct  2 11:02:00 node03 kernel: [7370144.003730] iscsi_trgt:
fileio_make_request(63) I/O error 4096, -5
Oct  2 11:02:00 node03 kernel: [7370144.004931] iscsi_trgt:
fileio_make_request(63) I/O error 4096, -5
Oct  2 11:02:00 node03 kernel: [7370144.006820] iscsi_trgt:
fileio_make_request(63) I/O error 4096, -5
Oct  2 11:02:01 node03 kernel: [7370144.849344] mptbase: ioc0:
LogInfo(0x31110b00): Originator={PL}, Code={Reset}, SubCode(0x0b00)
cb_idx mptscsih_io_done
Oct  2 11:02:01 node03 kernel: [7370144.849451] mptbase: ioc0:
LogInfo(0x31110b00): Originator={PL}, Code={Reset}, SubCode(0x0b00)
cb_idx mptscsih_io_done
Oct  2 11:02:01 node03 kernel: [7370144.849709] mptbase: ioc0:
LogInfo(0x31110b00): Originator={PL}, Code={Reset}, SubCode(0x0b00)
cb_idx mptscsih_io_done
Oct  2 11:02:01 node03 kernel: [7370144.849814] mptbase: ioc0:
LogInfo(0x31110b00): Originator={PL}, Code={Reset}, SubCode(0x0b00)
cb_idx mptscsih_io_done
Oct  2 11:02:01 node03 kernel: [7370144.850077] mptbase: ioc0:
LogInfo(0x31110b00): Originator={PL}, Code={Reset}, SubCode(0x0b00)
cb_idx mptscsih_io_done
<snip>
Oct  2 11:02:07 node03 kernel: [7370150.918849] mptbase: ioc0: WARNING
- IOC is in FAULT state (7810h)!!!
Oct  2 11:02:07 node03 kernel: [7370150.918929] mptbase: ioc0: WARNING
- Issuing HardReset from mpt_fault_reset_work!!
Oct  2 11:02:07 node03 kernel: [7370150.919027] mptbase: ioc0:
Initiating recovery
Oct  2 11:02:07 node03 kernel: [7370150.919098] mptbase: ioc0: WARNING
- IOC is in FAULT state!!!
Oct  2 11:02:07 node03 kernel: [7370150.919171] mptbase: ioc0: WARNING
-            FAULT code = 7810h
Oct  2 11:02:10 node03 kernel: [7370154.041934] mptbase: ioc0:
Recovered from IOC FAULT
Oct  2 11:02:16 node03 cib: [5734]: WARN: send_ipc_message: IPC
Channel to 23559 is not connected
Oct  2 11:02:21 node03 iSCSITarget[9060]: [9069]: WARNING:
Configuration parameter "portals" is not supported by the iSCSI
implementation and will be ignored.
Oct  2 11:02:22 node03 kernel: [7370166.353087] mptbase: ioc0: WARNING
- mpt_fault_reset_work: HardReset: success


This results in 3 MD's were all disks are failed [_____] and 1 MD
survives that is rebuilding with its spare.
3 drbd devices are Diskless/UpToDate and the survivor is UpToDate/UpToDate
The weird thing of this all is that there is always 1 MD set that
"survives" the FAULT state of the controller!
Luckily DRBD redirects all read/writes to the second node so there is
no downtime.


Our findings:

1) It seems to only happen on heavy load

2) It seems to only happen when DRBD is connected (we didn't have any
failing MD's yet when DRBD was not connected luckily!)

3) It seems to only happen on the primary node

4) It does not look like a hardware problem because there is always
one MD that survives this, if this was hardware related I would expect
ALL disks/MD's too fail.
 Furthermore the disks are not broken because we can assemble the
array again after it happened and they resync just fine.

5) I see that there is a new kernel version (2.6.26-27) available and
if i look at the changelog it has a fair number of fixes related to
MD, although the symptoms we are seeing are different from the
described fixes it could be related. Can anyone tell if these issues
are related to the fixes in the newest kernel image?

6) In the past we had a Dell MD1000 JBOD connected to the LSI 3801E
controller on both nodes and had the same problem when every disk
(only from the JBOD) failed so we disconnected the JBOD. The
controller stayed inside the server.


Things we tried so far:

1) We switched the LSI 3081E-R controller with another but to no avail
(and we have another identical cluster suffering from this problem)

2) In stead of the stock lenny mptsas driver (version v3.04.06) we
used the latest official LSI mptsas driver (v4.26.00.00) from the LSI
website using KB article 16387
(kb.lsi.com/KnowledgebaseArticle16387.aspx). Still to no avail, it
happens with that driver too.


Things that might be related:

1) We are using the deadline IO scheduler as recommended by IETD.

2) We are suspecting that the LSI 3801E controller might interfere
with the LSI 3081E-R so we are planning to remove the unused LSI 3801E
controllers.
Is there a known issue when both controllers are used in the same
machine? They have the same firmware/bios version. The linux driver
(mptsas) is also the same for both controllers.

Kind regards,

Caspar Smit
Systemengineer
True Bit Resources B.V.
Ampèrestraat 13E
1446 TP  Purmerend

T: +31(0)299 410 475
F: +31(0)299 410 476
@: c.smit@truebit.nl
W: www.truebit.nl

[-- Attachment #2: drbd.conf.txt --]
[-- Type: text/plain, Size: 4234 bytes --]

resource r4 {

  protocol C;

  handlers {
    pri-on-incon-degr "echo o > /proc/sysrq-trigger ; reboot -f";
    pri-lost-after-sb "echo r > /proc/sysrq-trigger; echo e > /proc/sysrq-trigger; echo i > /proc/sysrq-trigger; echo s > /proc/sysrq-trigger; echo u > /proc/sysrq-trigger; echo b > /proc/sysrq-trigger;";
    local-io-error "echo o > /proc/sysrq-trigger ; halt -f";
    outdate-peer "/usr/lib/heartbeat/drbd-peer-outdater -t 5";
  }

  startup {
    wfc-timeout 120;
    degr-wfc-timeout 120;
  }

  disk {
    on-io-error   detach;
  }

  net {
     sndbuf-size 512k;
     max-buffers     16384;
     max-epoch-size 16384;
     after-sb-0pri discard-older-primary;
     after-sb-1pri consensus;
     after-sb-2pri call-pri-lost-after-sb;
     rr-conflict disconnect;
    }

  syncer {
    rate 50M;
    al-extents 1801;
    verify-alg sha1;
  }

  device /dev/drbd4;
  disk /dev/md0;
  meta-disk internal;

  on node03 {
    address 10.0.4.1:7788;
  }
  on node04 {
    address 10.0.4.2:7788;
  }
}

resource r5 {

  protocol C;

  handlers {
    pri-on-incon-degr "echo o > /proc/sysrq-trigger ; reboot -f";
    pri-lost-after-sb "echo r > /proc/sysrq-trigger; echo e > /proc/sysrq-trigger; echo i > /proc/sysrq-trigger; echo s > /proc/sysrq-trigger; echo u > /proc/sysrq-trigger; echo b > /proc/sysrq-trigger;";
    local-io-error "echo o > /proc/sysrq-trigger ; halt -f";
    outdate-peer "/usr/lib/heartbeat/drbd-peer-outdater -t 5";
  }

  startup {
    wfc-timeout 120;
    degr-wfc-timeout 120;
  }

  disk {
    on-io-error   detach;
  }

  net {
     sndbuf-size 512k;
     max-buffers     16384;
     max-epoch-size 16384;
     after-sb-0pri discard-older-primary;
     after-sb-1pri consensus;
     after-sb-2pri call-pri-lost-after-sb;
     rr-conflict disconnect;
    }

  syncer {
    rate 50M;
    al-extents 1801;
    verify-alg sha1;
  }

  device /dev/drbd5;
  disk /dev/md1;
  meta-disk internal;

  on node03 {
    address 10.0.5.1:7788;
  }
  on node04 {
    address 10.0.5.2:7788;
  }
}


resource r6 {

  protocol C;

  handlers {
    pri-on-incon-degr "echo o > /proc/sysrq-trigger ; reboot -f";
    pri-lost-after-sb "echo r > /proc/sysrq-trigger; echo e > /proc/sysrq-trigger; echo i > /proc/sysrq-trigger; echo s > /proc/sysrq-trigger; echo u > /proc/sysrq-trigger; echo b > /proc/sysrq-trigger;";
    local-io-error "echo o > /proc/sysrq-trigger ; halt -f";
    outdate-peer "/usr/lib/heartbeat/drbd-peer-outdater -t 5";
  }

  startup {
    wfc-timeout 120;
    degr-wfc-timeout 120;
  }

  disk {
    on-io-error   detach;
  }

  net {
     sndbuf-size 512k;
     max-buffers     16384;
     max-epoch-size 16384;
     after-sb-0pri discard-older-primary;
     after-sb-1pri consensus;
     after-sb-2pri call-pri-lost-after-sb;
     rr-conflict disconnect;
    }

  syncer {
    rate 50M;
    al-extents 1801;
    verify-alg sha1;
  }

  device /dev/drbd6;
  disk /dev/md2;
  meta-disk internal;

  on node03 {
    address 10.0.6.1:7788;
  }
  on node04 {
    address 10.0.6.2:7788;
  }
}


resource r7 {

  protocol C;

  handlers {
    pri-on-incon-degr "echo o > /proc/sysrq-trigger ; reboot -f";
    pri-lost-after-sb "echo r > /proc/sysrq-trigger; echo e > /proc/sysrq-trigger; echo i > /proc/sysrq-trigger; echo s > /proc/sysrq-trigger; echo u > /proc/sysrq-trigger; echo b > /proc/sysrq-trigger;";
    local-io-error "echo o > /proc/sysrq-trigger ; halt -f";
    outdate-peer "/usr/lib/heartbeat/drbd-peer-outdater -t 5";
  }

  startup {
    wfc-timeout 120;
    degr-wfc-timeout 120;
  }

  disk {
    on-io-error   detach;
  }

  net {
     sndbuf-size 512k;
     max-buffers     16384;
     max-epoch-size 16384;
     after-sb-0pri discard-older-primary;
     after-sb-1pri consensus;
     after-sb-2pri call-pri-lost-after-sb;
     rr-conflict disconnect;
    }

  syncer {
    rate 50M;
    al-extents 1801;
    verify-alg sha1;
  }

  device /dev/drbd7;
  disk /dev/md3;
  meta-disk internal;

  on node03 {
    address 10.0.7.1:7788;
  }
  on node04 {
    address 10.0.7.2:7788;
  }
}

[-- Attachment #3: crm-config.txt --]
[-- Type: text/plain, Size: 7898 bytes --]

node $id="4f6742db-c9a1-4968-8be6-06d6bad45dd9" node03 \
        attributes standby="off"
node $id="827e0857-9263-43d0-b348-7139b9f37a8f" node04 \
        attributes standby="off"
primitive drbd4 ocf:linbit:drbd \
        params drbd_resource="r4" \
        op monitor interval="15s" \
        op start interval="0" timeout="240s" \
        op stop interval="0" timeout="100s" \
        meta target-role="started"
primitive drbd5 ocf:linbit:drbd \
        params drbd_resource="r5" \
        op monitor interval="15s" \
        op start interval="0" timeout="240s" \
        op stop interval="0" timeout="100s" \
        meta target-role="started"
primitive drbd6 ocf:linbit:drbd \
        params drbd_resource="r6" \
        op monitor interval="15s" \
        op start interval="0" timeout="240s" \
        op stop interval="0" timeout="100s" \
        meta target-role="started"
primitive drbd7 ocf:linbit:drbd \
        params drbd_resource="r7" \
        op monitor interval="15s" \
        op start interval="0" timeout="240s" \
        op stop interval="0" timeout="100s" \
        meta target-role="started"
primitive failover-ip4 ocf:heartbeat:IPaddr2 \
        params ip="10.0.4.3" iflabel="0" \
        op monitor interval="10s" \
        meta target-role="started"
primitive failover-ip5 ocf:heartbeat:IPaddr2 \
        params ip="10.0.5.3" iflabel="0" \
        op monitor interval="10s" \
        meta target-role="started"
primitive failover-ip6 ocf:heartbeat:IPaddr2 \
        params ip="10.0.6.3" iflabel="0" \
        op monitor interval="10s" \
        meta target-role="started"
primitive failover-ip7 ocf:heartbeat:IPaddr2 \
        params ip="10.0.7.3" iflabel="0" \
        op monitor interval="10s"
primitive iscsilun4 ocf:heartbeat:iSCSILogicalUnit \
        params implementation="iet" target_iqn="iqn.2011-07.nl.test:test.storage4" lun="0" path="/dev/drbd4" scsi_id="storage4" scsi_sn="storage4" additional_parameters="IOMode=wt,Type=fileio" \
        op start interval="0" timeout="60" \
        op stop interval="0" timeout="60" \
        op monitor interval="120" timeout="60" start-delay="0" depth="0" \
        meta target-role="started"
primitive iscsilun5 ocf:heartbeat:iSCSILogicalUnit \
        params implementation="iet" target_iqn="iqn.2011-07.nl.test:test.storage5" lun="0" path="/dev/drbd5" scsi_id="storage5" scsi_sn="storage5" additional_parameters="IOMode=wt,Type=fileio" \
        op start interval="0" timeout="60" \
        op stop interval="0" timeout="60" \
        op monitor interval="120" timeout="60" start-delay="0" depth="0" \
        meta target-role="started"
primitive iscsilun6 ocf:heartbeat:iSCSILogicalUnit \
        params implementation="iet" target_iqn="iqn.2011-07.nl.test:test.storage6" lun="0" path="/dev/drbd6" scsi_id="storage6" scsi_sn="storage6" additional_parameters="IOMode=wt,Type=fileio" \
        op start interval="0" timeout="60" \
        op stop interval="0" timeout="60" \
        op monitor interval="120" timeout="60" start-delay="0" depth="0" \
        meta target-role="started"
primitive iscsilun7 ocf:heartbeat:iSCSILogicalUnit \
        params implementation="iet" target_iqn="iqn.2011-07.nl.test:test.storage7" lun="0" path="/dev/drbd7" scsi_id="storage7" scsi_sn="storage7" additional_parameters="IOMode=wt,Type=fileio" \
        op start interval="0" timeout="60" \
        op stop interval="0" timeout="60" \
        op monitor interval="120" timeout="60" start-delay="0" depth="0"
primitive iscsitarget4 ocf:heartbeat:iSCSITarget \
        params implementation="iet" iqn="iqn.2011-07.nl.test:test.storage4" tid="4" additional_parameters="InitialR2T=No,QueuedCommands=64,MaxRecvDataSegmentLength=65536,MaxXmitDataSegmentLength=65536,MaxOutstandingR2T=8,DefaultTime2Wait=10" \
        op start interval="0" timeout="60" \
        op stop interval="0" timeout="60" \
        op monitor interval="120" timeout="60" start-delay="0" depth="0" \
        meta target-role="started"
primitive iscsitarget5 ocf:heartbeat:iSCSITarget \
        params implementation="iet" iqn="iqn.2011-07.nl.test:test.storage5" tid="5" additional_parameters="InitialR2T=No,QueuedCommands=64,MaxRecvDataSegmentLength=65536,MaxXmitDataSegmentLength=65536,MaxOutstandingR2T=8,DefaultTime2Wait=10" \
        op start interval="0" timeout="60" \
        op stop interval="0" timeout="60" \
        op monitor interval="120" timeout="60" start-delay="0" depth="0" \
        meta target-role="started"
primitive iscsitarget6 ocf:heartbeat:iSCSITarget \
        params implementation="iet" iqn="iqn.2011-07.nl.test:test.storage6" tid="6" additional_parameters="InitialR2T=No,QueuedCommands=64,MaxRecvDataSegmentLength=65536,MaxXmitDataSegmentLength=65536,MaxOutstandingR2T=8,DefaultTime2Wait=10" \
        op start interval="0" timeout="60" \
        op stop interval="0" timeout="60" \
        op monitor interval="120" timeout="60" start-delay="0" depth="0" \
        meta target-role="started"
primitive iscsitarget7 ocf:heartbeat:iSCSITarget \
        params implementation="iet" iqn="iqn.2011-07.nl.test:test.storage7" tid="7" additional_parameters="InitialR2T=No,QueuedCommands=64,MaxRecvDataSegmentLength=65536,MaxXmitDataSegmentLength=65536,MaxOutstandingR2T=8,DefaultTime2Wait=10" \
        op start interval="0" timeout="60" \
        op stop interval="0" timeout="60" \
        op monitor interval="120" timeout="60" start-delay="0" depth="0"
group iscsi-group4 iscsitarget4 iscsilun4 failover-ip4 \
        meta target-role="Started"
group iscsi-group5 iscsitarget5 iscsilun5 failover-ip5 \
        meta target-role="Started"
group iscsi-group6 iscsitarget6 iscsilun6 failover-ip6 \
        meta target-role="Started"
group iscsi-group7 iscsitarget7 iscsilun7 failover-ip7 \
        meta target-role="Started"
ms ms-drbd4 drbd4 \
        meta clone-max="2" master-max="1" master-node-max="1" clone-node-max="1" notify="true" target-role="Started"
ms ms-drbd5 drbd5 \
        meta clone-max="2" master-max="1" master-node-max="1" clone-node-max="1" notify="true" target-role="Stopped"
ms ms-drbd6 drbd6 \
        meta clone-max="2" master-max="1" master-node-max="1" clone-node-max="1" notify="true" target-role="Stopped"
ms ms-drbd7 drbd7 \
        meta clone-max="2" master-max="1" master-node-max="1" clone-node-max="1" notify="true" target-role="Stopped"
location ms-drbd4-master-on-node03 ms-drbd4 \
        rule $id="ms-drbd4-master-on-node04-rule" $role="master" 10: #uname eq node04
location ms-drbd5-master-on-node03 ms-drbd5 \
        rule $id="ms-drbd5-master-on-node04-rule" $role="master" 10: #uname eq node04
location ms-drbd6-master-on-node03 ms-drbd6 \
        rule $id="ms-drbd6-master-on-node04-rule" $role="master" 10: #uname eq node04
location ms-drbd7-master-on-node03 ms-drbd7 \
        rule $id="ms-drbd7-master-on-node04-rule" $role="master" 10: #uname eq node04
colocation iscsi-group4-on-ms-drbd4 inf: iscsi-group4 ms-drbd4:Master
colocation iscsi-group5-on-ms-drbd5 inf: iscsi-group5 ms-drbd5:Master
colocation iscsi-group6-on-ms-drbd6 inf: iscsi-group6 ms-drbd6:Master
colocation iscsi-group7-on-ms-drbd7 inf: iscsi-group7 ms-drbd7:Master
order ms-drbd4-before-iscsi-group4 inf: ms-drbd4:promote iscsi-group4:start
order ms-drbd5-before-iscsi-group5 inf: ms-drbd5:promote iscsi-group5:start
order ms-drbd6-before-iscsi-group6 inf: ms-drbd6:promote iscsi-group6:start
order ms-drbd7-before-iscsi-group7 inf: ms-drbd7:promote iscsi-group7:start
property $id="cib-bootstrap-options" \
        stonith-enabled="false" \
        dc-version="1.0.9-74392a28b7f31d7ddc86689598bd23114f58978b" \
        no-quorum-policy="ignore" \
        cluster-recheck-interval="0" \
        cluster-infrastructure="Heartbeat"
rsc_defaults $id="rsc-options" \
        resource-stickiness="200"

[-- Attachment #4: syslog.txt --]
[-- Type: text/plain, Size: 40626 bytes --]

Oct  2 11:01:59 node03 kernel: [7370143.421999] mptbase: ioc0: LogInfo(0x31110b00): Originator={PL}, Code={Reset}, SubCode(0x0b00) cb_idx mptbase_reply
Oct  2 11:01:59 node03 kernel: [7370143.435220] mptbase: ioc0: LogInfo(0x31181000): Originator={PL}, Code={IO Cancelled Due to Recieve Error}, SubCode(0x1000) cb_idx mptbase_reply
Oct  2 11:01:59 node03 kernel: [7370143.442141] mptbase: ioc0: LogInfo(0x31112000): Originator={PL}, Code={Reset}, SubCode(0x2000) cb_idx mptbase_reply
Oct  2 11:01:59 node03 kernel: [7370143.442783] end_request: I/O error, dev sdf, sector 3907028992
Oct  2 11:01:59 node03 kernel: [7370143.442783] md: super_written gets error=-5, uptodate=0
Oct  2 11:01:59 node03 kernel: [7370143.442783] raid5: Disk failure on sdf, disabling device.
Oct  2 11:01:59 node03 kernel: [7370143.442783] raid5: Operation continuing on 4 devices.
Oct  2 11:01:59 node03 kernel: [7370143.442820] end_request: I/O error, dev sdb, sector 3907028992
Oct  2 11:01:59 node03 kernel: [7370143.442820] md: super_written gets error=-5, uptodate=0
Oct  2 11:01:59 node03 kernel: [7370143.442820] raid5: Disk failure on sdb, disabling device.
Oct  2 11:01:59 node03 kernel: [7370143.442820] raid5: Operation continuing on 3 devices.
Oct  2 11:01:59 node03 kernel: [7370143.442820] end_request: I/O error, dev sdd, sector 3907028992
Oct  2 11:01:59 node03 kernel: [7370143.442820] md: super_written gets error=-5, uptodate=0
Oct  2 11:01:59 node03 kernel: [7370143.442820] raid5: Disk failure on sdd, disabling device.
Oct  2 11:01:59 node03 kernel: [7370143.442820] raid5: Operation continuing on 2 devices.
Oct  2 11:01:59 node03 kernel: [7370143.470791] mptbase: ioc0: LogInfo(0x31110b00): Originator={PL}, Code={Reset}, SubCode(0x0b00) cb_idx mptbase_reply
Oct  2 11:01:59 node03 kernel: [7370143.472553] end_request: I/O error, dev sdv, sector 3907028992
Oct  2 11:01:59 node03 kernel: [7370143.472553] md: super_written gets error=-5, uptodate=0
Oct  2 11:01:59 node03 kernel: [7370143.472553] raid5: Disk failure on sdv, disabling device.
Oct  2 11:01:59 node03 kernel: [7370143.472553] raid5: Operation continuing on 4 devices.
Oct  2 11:01:59 node03 kernel: [7370143.472553] end_request: I/O error, dev sdt, sector 3907028992
Oct  2 11:01:59 node03 kernel: [7370143.472553] md: super_written gets error=-5, uptodate=0
Oct  2 11:01:59 node03 kernel: [7370143.472553] raid5: Disk failure on sdt, disabling device.
Oct  2 11:01:59 node03 kernel: [7370143.472553] raid5: Operation continuing on 3 devices.
Oct  2 11:01:59 node03 kernel: [7370143.480564] mptbase: ioc0: LogInfo(0x31181000): Originator={PL}, Code={IO Cancelled Due to Recieve Error}, SubCode(0x1000) cb_idx mptbase_reply
Oct  2 11:01:59 node03 kernel: [7370143.481982] end_request: I/O error, dev sde, sector 3907028992
Oct  2 11:01:59 node03 kernel: [7370143.481982] md: super_written gets error=-5, uptodate=0
Oct  2 11:01:59 node03 kernel: [7370143.481982] raid5: Disk failure on sde, disabling device.
Oct  2 11:01:59 node03 kernel: [7370143.481982] raid5: Operation continuing on 1 devices.
Oct  2 11:01:59 node03 kernel: [7370143.481996] end_request: I/O error, dev sdc, sector 3907028992
Oct  2 11:01:59 node03 kernel: [7370143.481996] md: super_written gets error=-5, uptodate=0
Oct  2 11:01:59 node03 kernel: [7370143.481996] raid5: Disk failure on sdc, disabling device.
Oct  2 11:01:59 node03 kernel: [7370143.481996] raid5: Operation continuing on 0 devices.
Oct  2 11:01:59 node03 kernel: [7370143.482463] mptbase: ioc0: LogInfo(0x31112000): Originator={PL}, Code={Reset}, SubCode(0x2000) cb_idx mptbase_reply
Oct  2 11:01:59 node03 kernel: [7370143.482746] end_request: I/O error, dev sdw, sector 3907028992
Oct  2 11:01:59 node03 kernel: [7370143.482746] md: super_written gets error=-5, uptodate=0
Oct  2 11:01:59 node03 kernel: [7370143.482746] raid5: Disk failure on sdw, disabling device.
Oct  2 11:01:59 node03 kernel: [7370143.482746] raid5: Operation continuing on 2 devices.
Oct  2 11:01:59 node03 kernel: [7370143.490729] end_request: I/O error, dev sdo, sector 3907028992
Oct  2 11:01:59 node03 kernel: [7370143.490803] md: super_written gets error=-5, uptodate=0
Oct  2 11:01:59 node03 kernel: [7370143.490875] raid5: Disk failure on sdo, disabling device.
Oct  2 11:01:59 node03 kernel: [7370143.490876] raid5: Operation continuing on 4 devices.
Oct  2 11:01:59 node03 kernel: [7370143.491112] end_request: I/O error, dev sdp, sector 3907028992
Oct  2 11:01:59 node03 kernel: [7370143.491186] md: super_written gets error=-5, uptodate=0
Oct  2 11:01:59 node03 kernel: [7370143.491258] raid5: Disk failure on sdp, disabling device.
Oct  2 11:01:59 node03 kernel: [7370143.491259] raid5: Operation continuing on 3 devices.
Oct  2 11:01:59 node03 kernel: [7370143.491978] end_request: I/O error, dev sdu, sector 3907028992
Oct  2 11:01:59 node03 kernel: [7370143.492053] md: super_written gets error=-5, uptodate=0
Oct  2 11:01:59 node03 kernel: [7370143.492125] raid5: Disk failure on sdu, disabling device.
Oct  2 11:01:59 node03 kernel: [7370143.492127] raid5: Operation continuing on 1 devices.
Oct  2 11:01:59 node03 kernel: [7370143.492606] end_request: I/O error, dev sdq, sector 3907028992
Oct  2 11:01:59 node03 kernel: [7370143.492681] md: super_written gets error=-5, uptodate=0
Oct  2 11:01:59 node03 kernel: [7370143.492753] raid5: Disk failure on sdq, disabling device.
Oct  2 11:01:59 node03 kernel: [7370143.492754] raid5: Operation continuing on 2 devices.
Oct  2 11:01:59 node03 kernel: [7370143.493859] end_request: I/O error, dev sdr, sector 3907028992
Oct  2 11:01:59 node03 kernel: [7370143.493944] md: super_written gets error=-5, uptodate=0
Oct  2 11:01:59 node03 kernel: [7370143.494019] raid5: Disk failure on sdr, disabling device.
Oct  2 11:01:59 node03 kernel: [7370143.494021] raid5: Operation continuing on 1 devices.
Oct  2 11:01:59 node03 kernel: [7370143.536631] mptbase: ioc0: LogInfo(0x31110b00): Originator={PL}, Code={Reset}, SubCode(0x0b00) cb_idx mptbase_reply
Oct  2 11:01:59 node03 kernel: [7370143.537009] end_request: I/O error, dev sdn, sector 3907028992
Oct  2 11:01:59 node03 kernel: [7370143.537009] md: super_written gets error=-5, uptodate=0
Oct  2 11:01:59 node03 kernel: [7370143.537009] raid5: Disk failure on sdn, disabling device.
Oct  2 11:01:59 node03 kernel: [7370143.537009] raid5: Operation continuing on 0 devices.
Oct  2 11:02:00 node03 kernel: [7370143.892190] mptbase: ioc0: LogInfo(0x31110b00): Originator={PL}, Code={Reset}, SubCode(0x0b00) cb_idx mptbase_reply
Oct  2 11:02:00 node03 kernel: [7370143.893494] end_request: I/O error, dev sdx, sector 3907028992
Oct  2 11:02:00 node03 kernel: [7370143.893494] md: super_written gets error=-5, uptodate=0
Oct  2 11:02:00 node03 kernel: [7370143.893494] raid5: Disk failure on sdx, disabling device.
Oct  2 11:02:00 node03 kernel: [7370143.893494] raid5: Operation continuing on 0 devices.
Oct  2 11:02:00 node03 kernel: [7370143.895936] end_request: I/O error, dev sdj, sector 3907028992
Oct  2 11:02:00 node03 kernel: [7370143.896014] md: super_written gets error=-5, uptodate=0
Oct  2 11:02:00 node03 kernel: [7370143.896091] raid5: Disk failure on sdj, disabling device.
Oct  2 11:02:00 node03 kernel: [7370143.896092] raid5: Operation continuing on 4 devices.
Oct  2 11:02:00 node03 kernel: [7370143.905293] RAID5 conf printout:
Oct  2 11:02:00 node03 kernel: [7370143.905293]  --- rd:5 wd:0
Oct  2 11:02:00 node03 kernel: [7370143.905293]  disk 0, o:0, dev:sdb
Oct  2 11:02:00 node03 kernel: [7370143.905293]  disk 1, o:0, dev:sdc
Oct  2 11:02:00 node03 kernel: [7370143.905356]  disk 2, o:0, dev:sdd
Oct  2 11:02:00 node03 kernel: [7370143.905426]  disk 3, o:0, dev:sde
Oct  2 11:02:00 node03 kernel: [7370143.905494]  disk 4, o:0, dev:sdf
Oct  2 11:02:00 node03 kernel: [7370143.917078] mptbase: ioc0: LogInfo(0x31181000): Originator={PL}, Code={IO Cancelled Due to Recieve Error}, SubCode(0x1000) cb_idx mptbase_reply
Oct  2 11:02:00 node03 kernel: [7370143.920798] RAID5 conf printout:
Oct  2 11:02:00 node03 kernel: [7370143.920801]  --- rd:5 wd:0
Oct  2 11:02:00 node03 kernel: [7370143.920802]  disk 1, o:0, dev:sdc
Oct  2 11:02:00 node03 kernel: [7370143.920804]  disk 2, o:0, dev:sdd
Oct  2 11:02:00 node03 kernel: [7370143.920805]  disk 3, o:0, dev:sde
Oct  2 11:02:00 node03 kernel: [7370143.920807]  disk 4, o:0, dev:sdf
Oct  2 11:02:00 node03 kernel: [7370143.920814] RAID5 conf printout:
Oct  2 11:02:00 node03 kernel: [7370143.920816]  --- rd:5 wd:0
Oct  2 11:02:00 node03 kernel: [7370143.920817]  disk 1, o:0, dev:sdc
Oct  2 11:02:00 node03 kernel: [7370143.920818]  disk 2, o:0, dev:sdd
Oct  2 11:02:00 node03 kernel: [7370143.920819]  disk 3, o:0, dev:sde
Oct  2 11:02:00 node03 kernel: [7370143.920821]  disk 4, o:0, dev:sdf
Oct  2 11:02:00 node03 kernel: [7370143.932837] RAID5 conf printout:
Oct  2 11:02:00 node03 kernel: [7370143.932853]  --- rd:5 wd:0
Oct  2 11:02:00 node03 kernel: [7370143.932853]  disk 1, o:0, dev:sdc
Oct  2 11:02:00 node03 kernel: [7370143.932853]  disk 2, o:0, dev:sdd
Oct  2 11:02:00 node03 kernel: [7370143.932853]  disk 3, o:0, dev:sde
Oct  2 11:02:00 node03 kernel: [7370143.932853] RAID5 conf printout:
Oct  2 11:02:00 node03 kernel: [7370143.932853]  --- rd:5 wd:0
Oct  2 11:02:00 node03 kernel: [7370143.932853]  disk 1, o:0, dev:sdc
Oct  2 11:02:00 node03 kernel: [7370143.932853]  disk 2, o:0, dev:sdd
Oct  2 11:02:00 node03 kernel: [7370143.932853]  disk 3, o:0, dev:sde
Oct  2 11:02:00 node03 kernel: [7370143.938454] mptbase: ioc0: LogInfo(0x31110b00): Originator={PL}, Code={Reset}, SubCode(0x0b00) cb_idx mptbase_reply
Oct  2 11:02:00 node03 kernel: [7370143.944595] RAID5 conf printout:
Oct  2 11:02:00 node03 kernel: [7370143.944666]  --- rd:5 wd:0
Oct  2 11:02:00 node03 kernel: [7370143.944737]  disk 1, o:0, dev:sdc
Oct  2 11:02:00 node03 kernel: [7370143.944804]  disk 2, o:0, dev:sdd
Oct  2 11:02:00 node03 kernel: [7370143.944875] RAID5 conf printout:
Oct  2 11:02:00 node03 kernel: [7370143.944936]  --- rd:5 wd:0
Oct  2 11:02:00 node03 kernel: [7370143.944936]  disk 1, o:0, dev:sdc
Oct  2 11:02:00 node03 kernel: [7370143.944936]  disk 2, o:0, dev:sdd
Oct  2 11:02:00 node03 kernel: [7370143.951160] mptbase: ioc0: LogInfo(0x31181000): Originator={PL}, Code={IO Cancelled Due to Recieve Error}, SubCode(0x1000) cb_idx mptbase_reply
Oct  2 11:02:00 node03 kernel: [7370143.953848] mptbase: ioc0: LogInfo(0x31112000): Originator={PL}, Code={Reset}, SubCode(0x2000) cb_idx mptbase_reply
Oct  2 11:02:00 node03 kernel: [7370143.956930] RAID5 conf printout:
Oct  2 11:02:00 node03 kernel: [7370143.957004]  --- rd:5 wd:0
Oct  2 11:02:00 node03 kernel: [7370143.957071]  disk 1, o:0, dev:sdc
Oct  2 11:02:00 node03 kernel: [7370143.957141] RAID5 conf printout:
Oct  2 11:02:00 node03 kernel: [7370143.957208]  --- rd:5 wd:0
Oct  2 11:02:00 node03 kernel: [7370143.957274]  disk 1, o:0, dev:sdc
Oct  2 11:02:00 node03 kernel: [7370143.968752] RAID5 conf printout:
Oct  2 11:02:00 node03 kernel: [7370143.968826]  --- rd:5 wd:0
Oct  2 11:02:00 node03 kernel: [7370143.968903] block drbd4: p read: error=-5
Oct  2 11:02:00 node03 kernel: [7370143.968976] Buffer I/O error on device drbd4, logical block 1651581030
Oct  2 11:02:00 node03 kernel: [7370143.969056] block drbd4: p write: error=-5
Oct  2 11:02:00 node03 kernel: [7370143.969126] block drbd4: Local WRITE failed sec=21013680s size=4096
Oct  2 11:02:00 node03 kernel: [7370143.969203] block drbd4: disk( UpToDate -> Failed )
Oct  2 11:02:00 node03 kernel: [7370143.969276] block drbd4: Local IO failed in __req_mod.Detaching...
Oct  2 11:02:00 node03 kernel: [7370143.969492] block drbd4: disk( Failed -> Diskless )
Oct  2 11:02:00 node03 kernel: [7370143.969492] block drbd4: Notified peer that my disk is broken.
Oct  2 11:02:00 node03 kernel: [7370143.970120] block drbd4: Should have called drbd_al_complete_io(, 21013680), but my Disk seems to have failed :(
Oct  2 11:02:00 node03 kernel: [7370144.003730] iscsi_trgt: fileio_make_request(63) I/O error 4096, -5
Oct  2 11:02:00 node03 kernel: [7370144.004931] iscsi_trgt: fileio_make_request(63) I/O error 4096, -5
Oct  2 11:02:00 node03 kernel: [7370144.006820] iscsi_trgt: fileio_make_request(63) I/O error 4096, -5
Oct  2 11:02:01 node03 kernel: [7370144.849344] mptbase: ioc0: LogInfo(0x31110b00): Originator={PL}, Code={Reset}, SubCode(0x0b00) cb_idx mptscsih_io_done
Oct  2 11:02:01 node03 kernel: [7370144.849451] mptbase: ioc0: LogInfo(0x31110b00): Originator={PL}, Code={Reset}, SubCode(0x0b00) cb_idx mptscsih_io_done
Oct  2 11:02:01 node03 kernel: [7370144.849709] mptbase: ioc0: LogInfo(0x31110b00): Originator={PL}, Code={Reset}, SubCode(0x0b00) cb_idx mptscsih_io_done
Oct  2 11:02:01 node03 kernel: [7370144.849814] mptbase: ioc0: LogInfo(0x31110b00): Originator={PL}, Code={Reset}, SubCode(0x0b00) cb_idx mptscsih_io_done
Oct  2 11:02:01 node03 kernel: [7370144.850077] mptbase: ioc0: LogInfo(0x31110b00): Originator={PL}, Code={Reset}, SubCode(0x0b00) cb_idx mptscsih_io_done
Oct  2 11:02:01 node03 kernel: [7370144.850077] mptbase: ioc0: LogInfo(0x31110b00): Originator={PL}, Code={Reset}, SubCode(0x0b00) cb_idx mptscsih_io_done
Oct  2 11:02:01 node03 kernel: [7370144.850206] mptbase: ioc0: LogInfo(0x31110b00): Originator={PL}, Code={Reset}, SubCode(0x0b00) cb_idx mptscsih_io_done
Oct  2 11:02:01 node03 kernel: [7370144.850965] mptbase: ioc0: LogInfo(0x31110b00): Originator={PL}, Code={Reset}, SubCode(0x0b00) cb_idx mptscsih_io_done
Oct  2 11:02:01 node03 last message repeated 3 times
Oct  2 11:02:01 node03 kernel: [7370144.851149] mptbase: ioc0: LogInfo(0x31110b00): Originator={PL}, Code={Reset}, SubCode(0x0b00) cb_idx mptscsih_io_done
Oct  2 11:02:01 node03 kernel: [7370144.853339] mptbase: ioc0: LogInfo(0x31110b00): Originator={PL}, Code={Reset}, SubCode(0x0b00) cb_idx mptscsih_io_done
Oct  2 11:02:01 node03 kernel: [7370144.853525] mptbase: ioc0: LogInfo(0x31110b00): Originator={PL}, Code={Reset}, SubCode(0x0b00) cb_idx mptscsih_io_done
Oct  2 11:02:01 node03 kernel: [7370144.853711] mptbase: ioc0: LogInfo(0x31110b00): Originator={PL}, Code={Reset}, SubCode(0x0b00) cb_idx mptscsih_io_done
Oct  2 11:02:01 node03 kernel: [7370144.853897] mptbase: ioc0: LogInfo(0x31110b00): Originator={PL}, Code={Reset}, SubCode(0x0b00) cb_idx mptscsih_io_done
Oct  2 11:02:01 node03 kernel: [7370144.854084] mptbase: ioc0: LogInfo(0x31110b00): Originator={PL}, Code={Reset}, SubCode(0x0b00) cb_idx mptscsih_io_done
Oct  2 11:02:01 node03 kernel: [7370144.854272] mptbase: ioc0: LogInfo(0x31110b00): Originator={PL}, Code={Reset}, SubCode(0x0b00) cb_idx mptscsih_io_done
Oct  2 11:02:01 node03 kernel: [7370144.854468] mptbase: ioc0: LogInfo(0x31110b00): Originator={PL}, Code={Reset}, SubCode(0x0b00) cb_idx mptscsih_io_done
Oct  2 11:02:01 node03 kernel: [7370144.854672] mptbase: ioc0: LogInfo(0x31110b00): Originator={PL}, Code={Reset}, SubCode(0x0b00) cb_idx mptscsih_io_done
Oct  2 11:02:01 node03 kernel: [7370144.854839] mptbase: ioc0: LogInfo(0x31110b00): Originator={PL}, Code={Reset}, SubCode(0x0b00) cb_idx mptscsih_io_done
Oct  2 11:02:01 node03 kernel: [7370144.855050] mptbase: ioc0: LogInfo(0x31110b00): Originator={PL}, Code={Reset}, SubCode(0x0b00) cb_idx mptscsih_io_done
Oct  2 11:02:01 node03 kernel: [7370144.855182] mptbase: ioc0: LogInfo(0x31110b00): Originator={PL}, Code={Reset}, SubCode(0x0b00) cb_idx mptscsih_io_done
Oct  2 11:02:01 node03 kernel: [7370144.855400] mptbase: ioc0: LogInfo(0x31110b00): Originator={PL}, Code={Reset}, SubCode(0x0b00) cb_idx mptscsih_io_done
Oct  2 11:02:01 node03 kernel: [7370144.856258] mptbase: ioc0: LogInfo(0x31110b00): Originator={PL}, Code={Reset}, SubCode(0x0b00) cb_idx mptscsih_io_done
Oct  2 11:02:01 node03 kernel: [7370144.856282] mptbase: ioc0: LogInfo(0x31110b00): Originator={PL}, Code={Reset}, SubCode(0x0b00) cb_idx mptscsih_io_done
Oct  2 11:02:01 node03 last message repeated 2 times
Oct  2 11:02:01 node03 kernel: [7370144.856470] mptbase: ioc0: LogInfo(0x31110b00): Originator={PL}, Code={Reset}, SubCode(0x0b00) cb_idx mptscsih_io_done
Oct  2 11:02:01 node03 kernel: [7370144.857335] mptbase: ioc0: LogInfo(0x31110b00): Originator={PL}, Code={Reset}, SubCode(0x0b00) cb_idx mptscsih_io_done
Oct  2 11:02:01 node03 kernel: [7370144.857524] mptbase: ioc0: LogInfo(0x31110b00): Originator={PL}, Code={Reset}, SubCode(0x0b00) cb_idx mptscsih_io_done
Oct  2 11:02:01 node03 kernel: [7370144.857714] mptbase: ioc0: LogInfo(0x31110b00): Originator={PL}, Code={Reset}, SubCode(0x0b00) cb_idx mptscsih_io_done
Oct  2 11:02:01 node03 kernel: [7370144.858480] mptbase: ioc0: LogInfo(0x31110b00): Originator={PL}, Code={Reset}, SubCode(0x0b00) cb_idx mptscsih_io_done
Oct  2 11:02:01 node03 kernel: [7370144.858701] mptbase: ioc0: LogInfo(0x31110b00): Originator={PL}, Code={Reset}, SubCode(0x0b00) cb_idx mptscsih_io_done
Oct  2 11:02:01 node03 kernel: [7370144.858841] mptbase: ioc0: LogInfo(0x31110b00): Originator={PL}, Code={Reset}, SubCode(0x0b00) cb_idx mptscsih_io_done
Oct  2 11:02:01 node03 kernel: [7370144.859023] mptbase: ioc0: LogInfo(0x31110b00): Originator={PL}, Code={Reset}, SubCode(0x0b00) cb_idx mptscsih_io_done
Oct  2 11:02:01 node03 kernel: [7370144.859203] mptbase: ioc0: LogInfo(0x31110b00): Originator={PL}, Code={Reset}, SubCode(0x0b00) cb_idx mptscsih_io_done
Oct  2 11:02:01 node03 kernel: [7370144.859383] mptbase: ioc0: LogInfo(0x31110b00): Originator={PL}, Code={Reset}, SubCode(0x0b00) cb_idx mptscsih_io_done
Oct  2 11:02:01 node03 kernel: [7370144.859564] mptbase: ioc0: LogInfo(0x31110b00): Originator={PL}, Code={Reset}, SubCode(0x0b00) cb_idx mptscsih_io_done
Oct  2 11:02:01 node03 kernel: [7370144.859744] mptbase: ioc0: LogInfo(0x31110b00): Originator={PL}, Code={Reset}, SubCode(0x0b00) cb_idx mptscsih_io_done
Oct  2 11:02:01 node03 kernel: [7370144.859925] mptbase: ioc0: LogInfo(0x31110b00): Originator={PL}, Code={Reset}, SubCode(0x0b00) cb_idx mptscsih_io_done
Oct  2 11:02:01 node03 kernel: [7370144.860105] mptbase: ioc0: LogInfo(0x31110b00): Originator={PL}, Code={Reset}, SubCode(0x0b00) cb_idx mptscsih_io_done
Oct  2 11:02:01 node03 kernel: [7370144.860286] mptbase: ioc0: LogInfo(0x31110b00): Originator={PL}, Code={Reset}, SubCode(0x0b00) cb_idx mptscsih_io_done
Oct  2 11:02:01 node03 kernel: [7370144.860469] mptbase: ioc0: LogInfo(0x31110b00): Originator={PL}, Code={Reset}, SubCode(0x0b00) cb_idx mptscsih_io_done
Oct  2 11:02:01 node03 kernel: [7370144.860650] mptbase: ioc0: LogInfo(0x31110b00): Originator={PL}, Code={Reset}, SubCode(0x0b00) cb_idx mptscsih_io_done
Oct  2 11:02:01 node03 kernel: [7370144.860650] mptbase: ioc0: LogInfo(0x31110b00): Originator={PL}, Code={Reset}, SubCode(0x0b00) cb_idx mptscsih_io_done
Oct  2 11:02:01 node03 kernel: [7370144.860845] mptbase: ioc0: LogInfo(0x31110b00): Originator={PL}, Code={Reset}, SubCode(0x0b00) cb_idx mptscsih_io_done
Oct  2 11:02:01 node03 kernel: [7370144.861027] mptbase: ioc0: LogInfo(0x31110b00): Originator={PL}, Code={Reset}, SubCode(0x0b00) cb_idx mptscsih_io_done
Oct  2 11:02:01 node03 kernel: [7370144.861603] mptbase: ioc0: LogInfo(0x31110b00): Originator={PL}, Code={Reset}, SubCode(0x0b00) cb_idx mptscsih_io_done
Oct  2 11:02:01 node03 kernel: [7370144.861782] mptbase: ioc0: LogInfo(0x31110b00): Originator={PL}, Code={Reset}, SubCode(0x0b00) cb_idx mptscsih_io_done
Oct  2 11:02:01 node03 kernel: [7370144.861963] mptbase: ioc0: LogInfo(0x31110b00): Originator={PL}, Code={Reset}, SubCode(0x0b00) cb_idx mptscsih_io_done
Oct  2 11:02:01 node03 kernel: [7370144.862143] mptbase: ioc0: LogInfo(0x31110b00): Originator={PL}, Code={Reset}, SubCode(0x0b00) cb_idx mptscsih_io_done
Oct  2 11:02:01 node03 kernel: [7370144.862324] mptbase: ioc0: LogInfo(0x31110b00): Originator={PL}, Code={Reset}, SubCode(0x0b00) cb_idx mptscsih_io_done
Oct  2 11:02:01 node03 kernel: [7370144.862504] mptbase: ioc0: LogInfo(0x31110b00): Originator={PL}, Code={Reset}, SubCode(0x0b00) cb_idx mptscsih_io_done
Oct  2 11:02:01 node03 kernel: [7370144.862685] mptbase: ioc0: LogInfo(0x31110b00): Originator={PL}, Code={Reset}, SubCode(0x0b00) cb_idx mptscsih_io_done
Oct  2 11:02:01 node03 kernel: [7370144.862866] mptbase: ioc0: LogInfo(0x31110b00): Originator={PL}, Code={Reset}, SubCode(0x0b00) cb_idx mptscsih_io_done
Oct  2 11:02:01 node03 kernel: [7370144.863046] mptbase: ioc0: LogInfo(0x31110b00): Originator={PL}, Code={Reset}, SubCode(0x0b00) cb_idx mptscsih_io_done
Oct  2 11:02:01 node03 kernel: [7370144.863228] mptbase: ioc0: LogInfo(0x31110b00): Originator={PL}, Code={Reset}, SubCode(0x0b00) cb_idx mptscsih_io_done
Oct  2 11:02:01 node03 kernel: [7370144.863408] mptbase: ioc0: LogInfo(0x31110b00): Originator={PL}, Code={Reset}, SubCode(0x0b00) cb_idx mptscsih_io_done
Oct  2 11:02:01 node03 kernel: [7370144.863589] mptbase: ioc0: LogInfo(0x31110b00): Originator={PL}, Code={Reset}, SubCode(0x0b00) cb_idx mptscsih_io_done
Oct  2 11:02:01 node03 kernel: [7370144.863771] mptbase: ioc0: LogInfo(0x31110b00): Originator={PL}, Code={Reset}, SubCode(0x0b00) cb_idx mptscsih_io_done
Oct  2 11:02:01 node03 kernel: [7370144.863957] mptbase: ioc0: LogInfo(0x31110b00): Originator={PL}, Code={Reset}, SubCode(0x0b00) cb_idx mptscsih_io_done
Oct  2 11:02:01 node03 kernel: [7370144.869335] mptbase: ioc0: LogInfo(0x31110b00): Originator={PL}, Code={Reset}, SubCode(0x0b00) cb_idx mptscsih_io_done
Oct  2 11:02:01 node03 kernel: [7370144.869437] mptbase: ioc0: LogInfo(0x31110b00): Originator={PL}, Code={Reset}, SubCode(0x0b00) cb_idx mptscsih_io_done
Oct  2 11:02:01 node03 kernel: [7370144.870135] end_request: I/O error, dev sds, sector 3907028992
Oct  2 11:02:01 node03 kernel: [7370144.870211] md: super_written gets error=-5, uptodate=0
Oct  2 11:02:01 node03 kernel: [7370144.870284] raid5: Disk failure on sds, disabling device.
Oct  2 11:02:01 node03 kernel: [7370144.870285] raid5: Operation continuing on 0 devices.
Oct  2 11:02:01 node03 kernel: [7370144.870460] RAID5 conf printout:
Oct  2 11:02:01 node03 kernel: [7370144.870528]  --- rd:5 wd:0
Oct  2 11:02:01 node03 kernel: [7370144.870594]  disk 0, o:0, dev:sdn
Oct  2 11:02:01 node03 kernel: [7370144.870661]  disk 1, o:0, dev:sdo
Oct  2 11:02:01 node03 kernel: [7370144.870729]  disk 2, o:0, dev:sdp
Oct  2 11:02:01 node03 kernel: [7370144.870796]  disk 3, o:0, dev:sdq
Oct  2 11:02:01 node03 kernel: [7370144.870864]  disk 4, o:0, dev:sdr
Oct  2 11:02:01 node03 kernel: [7370144.873685] end_request: I/O error, dev sdy, sector 3907028992
Oct  2 11:02:01 node03 kernel: [7370144.873685] md: super_written gets error=-5, uptodate=0
Oct  2 11:02:01 node03 kernel: [7370144.873685] raid5: Disk failure on sdy, disabling device.
Oct  2 11:02:01 node03 kernel: [7370144.873685] raid5: Operation continuing on 0 devices.
Oct  2 11:02:01 node03 kernel: [7370144.873685] RAID5 conf printout:
Oct  2 11:02:01 node03 kernel: [7370144.873685]  --- rd:5 wd:0
Oct  2 11:02:01 node03 kernel: [7370144.873685]  disk 0, o:0, dev:sdt
Oct  2 11:02:01 node03 kernel: [7370144.873685]  disk 1, o:0, dev:sdu
Oct  2 11:02:01 node03 kernel: [7370144.873685]  disk 2, o:0, dev:sdv
Oct  2 11:02:01 node03 kernel: [7370144.873685]  disk 3, o:0, dev:sdw
Oct  2 11:02:01 node03 kernel: [7370144.873685]  disk 4, o:0, dev:sdx
Oct  2 11:02:01 node03 kernel: [7370144.881974] RAID5 conf printout:
Oct  2 11:02:01 node03 kernel: [7370144.882045]  --- rd:5 wd:0
Oct  2 11:02:01 node03 kernel: [7370144.884314]  disk 1, o:0, dev:sdo
Oct  2 11:02:01 node03 kernel: [7370144.884382]  disk 2, o:0, dev:sdp
Oct  2 11:02:01 node03 kernel: [7370144.884449]  disk 3, o:0, dev:sdq
Oct  2 11:02:01 node03 kernel: [7370144.884517]  disk 4, o:0, dev:sdr
Oct  2 11:02:01 node03 kernel: [7370144.884590] RAID5 conf printout:
Oct  2 11:02:01 node03 kernel: [7370144.884657]  --- rd:5 wd:0
Oct  2 11:02:01 node03 kernel: [7370144.884723]  disk 1, o:0, dev:sdo
Oct  2 11:02:01 node03 kernel: [7370144.884790]  disk 2, o:0, dev:sdp
Oct  2 11:02:01 node03 kernel: [7370144.884858]  disk 3, o:0, dev:sdq
Oct  2 11:02:01 node03 kernel: [7370144.884925]  disk 4, o:0, dev:sdr
Oct  2 11:02:01 node03 kernel: [7370144.884999] RAID5 conf printout:
Oct  2 11:02:01 node03 kernel: [7370144.885066]  --- rd:5 wd:0
Oct  2 11:02:01 node03 kernel: [7370144.885132]  disk 1, o:0, dev:sdu
Oct  2 11:02:01 node03 kernel: [7370144.885200]  disk 2, o:0, dev:sdv
Oct  2 11:02:01 node03 kernel: [7370144.885267]  disk 3, o:0, dev:sdw
Oct  2 11:02:01 node03 kernel: [7370144.885334]  disk 4, o:0, dev:sdx
Oct  2 11:02:01 node03 kernel: [7370144.885404] RAID5 conf printout:
Oct  2 11:02:01 node03 kernel: [7370144.885471]  --- rd:5 wd:0
Oct  2 11:02:01 node03 kernel: [7370144.885537]  disk 1, o:0, dev:sdu
Oct  2 11:02:01 node03 kernel: [7370144.885604]  disk 2, o:0, dev:sdv
Oct  2 11:02:01 node03 kernel: [7370144.885676]  disk 3, o:0, dev:sdw
Oct  2 11:02:01 node03 kernel: [7370144.885743]  disk 4, o:0, dev:sdx
Oct  2 11:02:01 node03 kernel: [7370144.896334] RAID5 conf printout:
Oct  2 11:02:01 node03 kernel: [7370144.895996] RAID5 conf printout:
Oct  2 11:02:01 node03 kernel: [7370144.895999]  --- rd:5 wd:0
Oct  2 11:02:01 node03 kernel: [7370144.896001]  disk 1, o:0, dev:sdo
Oct  2 11:02:01 node03 kernel: [7370144.896003]  disk 2, o:0, dev:sdp
Oct  2 11:02:01 node03 kernel: [7370144.896005]  disk 3, o:0, dev:sdq
Oct  2 11:02:01 node03 kernel: [7370144.896010] RAID5 conf printout:
Oct  2 11:02:01 node03 kernel: [7370144.896011]  --- rd:5 wd:0
Oct  2 11:02:01 node03 kernel: [7370144.896012]  disk 1, o:0, dev:sdo
Oct  2 11:02:01 node03 kernel: [7370144.896014]  disk 2, o:0, dev:sdp
Oct  2 11:02:01 node03 kernel: [7370144.896015]  disk 3, o:0, dev:sdq
Oct  2 11:02:01 node03 kernel: [7370144.897068]  --- rd:5 wd:0
Oct  2 11:02:01 node03 kernel: [7370144.897135]  disk 1, o:0, dev:sdu
Oct  2 11:02:01 node03 kernel: [7370144.897202]  disk 2, o:0, dev:sdv
Oct  2 11:02:01 node03 kernel: [7370144.897270]  disk 3, o:0, dev:sdw
Oct  2 11:02:01 node03 kernel: [7370144.897340] RAID5 conf printout:
Oct  2 11:02:01 node03 kernel: [7370144.897408]  --- rd:5 wd:0
Oct  2 11:02:01 node03 kernel: [7370144.897474]  disk 1, o:0, dev:sdu
Oct  2 11:02:01 node03 kernel: [7370144.897541]  disk 2, o:0, dev:sdv
Oct  2 11:02:01 node03 kernel: [7370144.897609]  disk 3, o:0, dev:sdw
Oct  2 11:02:01 node03 kernel: [7370144.912316] RAID5 conf printout:
Oct  2 11:02:01 node03 kernel: [7370144.912316]  --- rd:5 wd:0
Oct  2 11:02:01 node03 kernel: [7370144.912316]  disk 1, o:0, dev:sdu
Oct  2 11:02:01 node03 kernel: [7370144.912316]  disk 2, o:0, dev:sdv
Oct  2 11:02:01 node03 kernel: [7370144.912316] RAID5 conf printout:
Oct  2 11:02:01 node03 kernel: [7370144.912316]  --- rd:5 wd:0
Oct  2 11:02:01 node03 kernel: [7370144.912316]  disk 1, o:0, dev:sdu
Oct  2 11:02:01 node03 kernel: [7370144.912316]  disk 2, o:0, dev:sdv
Oct  2 11:02:01 node03 kernel: [7370144.919420] RAID5 conf printout:
Oct  2 11:02:01 node03 kernel: [7370144.919494]  --- rd:5 wd:0
Oct  2 11:02:01 node03 kernel: [7370144.919725]  disk 1, o:0, dev:sdo
Oct  2 11:02:01 node03 kernel: [7370144.919792]  disk 2, o:0, dev:sdp
Oct  2 11:02:01 node03 kernel: [7370144.919862] RAID5 conf printout:
Oct  2 11:02:01 node03 kernel: [7370144.919929]  --- rd:5 wd:0
Oct  2 11:02:01 node03 kernel: [7370144.920000]  disk 1, o:0, dev:sdo
Oct  2 11:02:01 node03 kernel: [7370144.920067]  disk 2, o:0, dev:sdp
Oct  2 11:02:01 node03 kernel: [7370144.928158] RAID5 conf printout:
Oct  2 11:02:01 node03 kernel: [7370144.928230]  --- rd:5 wd:0
Oct  2 11:02:01 node03 kernel: [7370144.928301]  disk 1, o:0, dev:sdu
Oct  2 11:02:01 node03 kernel: [7370144.928323] RAID5 conf printout:
Oct  2 11:02:01 node03 kernel: [7370144.928323]  --- rd:5 wd:0
Oct  2 11:02:01 node03 kernel: [7370144.928323]  disk 1, o:0, dev:sdu
Oct  2 11:02:01 node03 kernel: [7370144.935392] RAID5 conf printout:
Oct  2 11:02:01 node03 kernel: [7370144.935463]  --- rd:5 wd:0
Oct  2 11:02:01 node03 kernel: [7370144.935859]  disk 1, o:0, dev:sdo
Oct  2 11:02:01 node03 kernel: [7370144.935929] RAID5 conf printout:
Oct  2 11:02:01 node03 kernel: [7370144.935996]  --- rd:5 wd:0
Oct  2 11:02:01 node03 kernel: [7370144.936062]  disk 1, o:0, dev:sdo
Oct  2 11:02:01 node03 kernel: [7370144.944270] RAID5 conf printout:
Oct  2 11:02:01 node03 kernel: [7370144.944315]  --- rd:5 wd:0
Oct  2 11:02:01 node03 kernel: [7370144.944315] block drbd7: p write: error=-5
Oct  2 11:02:01 node03 kernel: [7370144.944315] block drbd7: Local WRITE failed sec=9920016s size=4096
Oct  2 11:02:01 node03 kernel: [7370144.944315] block drbd7: disk( UpToDate -> Failed )
Oct  2 11:02:01 node03 kernel: [7370144.944315] block drbd7: Local IO failed in __req_mod.Detaching...
Oct  2 11:02:01 node03 kernel: [7370144.944315] block drbd7: p write: error=-5
Oct  2 11:02:01 node03 kernel: [7370144.944315] block drbd7: Local WRITE failed sec=9920024s size=4096
Oct  2 11:02:01 node03 kernel: [7370144.944315] block drbd7: p write: error=-5
Oct  2 11:02:01 node03 kernel: [7370144.944315] block drbd7: Local WRITE failed sec=7784665984s size=4096
Oct  2 11:02:01 node03 kernel: [7370144.944315] block drbd7: p write: error=-5
Oct  2 11:02:01 node03 kernel: [7370144.944315] block drbd7: Local WRITE failed sec=7784665992s size=4096
Oct  2 11:02:01 node03 kernel: [7370144.944315] block drbd7: p write: error=-5
Oct  2 11:02:01 node03 kernel: [7370144.944315] block drbd7: Local WRITE failed sec=7784666000s size=4096
Oct  2 11:02:01 node03 kernel: [7370144.944315] block drbd7: p write: error=-5
Oct  2 11:02:01 node03 kernel: [7370144.944315] block drbd7: Local WRITE failed sec=7784666008s size=4096
Oct  2 11:02:01 node03 kernel: [7370144.944315] block drbd7: p write: error=-5
Oct  2 11:02:01 node03 kernel: [7370144.944315] block drbd7: Local WRITE failed sec=7784666016s size=4096
Oct  2 11:02:01 node03 kernel: [7370144.944315] block drbd7: p write: error=-5
Oct  2 11:02:01 node03 kernel: [7370144.944315] block drbd7: Local WRITE failed sec=7784666024s size=4096
Oct  2 11:02:01 node03 kernel: [7370144.944315] block drbd7: p write: error=-5
Oct  2 11:02:01 node03 kernel: [7370144.944315] block drbd7: Local WRITE failed sec=7784666032s size=4096
Oct  2 11:02:01 node03 kernel: [7370144.944315] block drbd7: p write: error=-5
Oct  2 11:02:01 node03 kernel: [7370144.944315] block drbd7: Local WRITE failed sec=7784666040s size=4096
Oct  2 11:02:01 node03 kernel: [7370144.944315] block drbd7: p write: error=-5
Oct  2 11:02:01 node03 kernel: [7370144.944315] block drbd7: Local WRITE failed sec=7784666048s size=4096
Oct  2 11:02:01 node03 kernel: [7370144.944315] block drbd7: p write: error=-5
Oct  2 11:02:01 node03 kernel: [7370144.944315] block drbd7: Local WRITE failed sec=7784666056s size=4096
Oct  2 11:02:01 node03 kernel: [7370144.944315] block drbd7: p write: error=-5
Oct  2 11:02:01 node03 kernel: [7370144.944315] block drbd7: Local WRITE failed sec=7784666064s size=4096
Oct  2 11:02:01 node03 kernel: [7370144.944315] block drbd7: p write: error=-5
Oct  2 11:02:01 node03 kernel: [7370144.944315] block drbd7: Local WRITE failed sec=7784666072s size=4096
Oct  2 11:02:01 node03 kernel: [7370144.944315] block drbd7: p write: error=-5
Oct  2 11:02:01 node03 kernel: [7370144.944315] block drbd7: Local WRITE failed sec=7784666080s size=4096
Oct  2 11:02:01 node03 kernel: [7370144.944315] block drbd7: p write: error=-5
Oct  2 11:02:01 node03 kernel: [7370144.944315] block drbd7: Local WRITE failed sec=7784666088s size=4096
Oct  2 11:02:01 node03 kernel: [7370144.944315] block drbd7: p write: error=-5
Oct  2 11:02:01 node03 kernel: [7370144.944315] block drbd7: Local WRITE failed sec=7784666096s size=4096
Oct  2 11:02:01 node03 kernel: [7370144.944315] block drbd7: p write: error=-5
Oct  2 11:02:01 node03 kernel: [7370144.944315] block drbd7: Local WRITE failed sec=7784666104s size=4096
Oct  2 11:02:01 node03 kernel: [7370144.944315] block drbd7: disk( Failed -> Diskless )
Oct  2 11:02:01 node03 kernel: [7370144.944315] block drbd7: Notified peer that my disk is broken.
Oct  2 11:02:01 node03 kernel: [7370144.944315] block drbd7: Should have called drbd_al_complete_io(, 7784665984), but my Disk seems to have failed :(
Oct  2 11:02:01 node03 kernel: [7370144.944315] block drbd7: Should have called drbd_al_complete_io(, 9920016), but my Disk seems to have failed :(
Oct  2 11:02:01 node03 kernel: [7370144.951282] RAID5 conf printout:
Oct  2 11:02:01 node03 kernel: [7370144.951651]  --- rd:5 wd:0
Oct  2 11:02:01 node03 kernel: [7370144.951996] block drbd6: p write: error=-5
Oct  2 11:02:01 node03 kernel: [7370144.952066] block drbd6: Local WRITE failed sec=10703863424s size=4096
Oct  2 11:02:01 node03 kernel: [7370144.952143] block drbd6: disk( UpToDate -> Failed )
Oct  2 11:02:01 node03 kernel: [7370144.952216] block drbd6: Local IO failed in __req_mod.Detaching...
Oct  2 11:02:01 node03 kernel: [7370144.952321] block drbd6: p write: error=-5
Oct  2 11:02:01 node03 kernel: [7370144.952391] block drbd6: Local WRITE failed sec=10703863432s size=4096
Oct  2 11:02:01 node03 kernel: [7370144.952471] block drbd6: p write: error=-5
Oct  2 11:02:01 node03 kernel: [7370144.952843] block drbd6: Local WRITE failed sec=10703863440s size=4096
Oct  2 11:02:01 node03 kernel: [7370144.952929] block drbd6: p write: error=-5
Oct  2 11:02:01 node03 kernel: [7370144.952999] block drbd6: Local WRITE failed sec=9849888s size=4096
Oct  2 11:02:01 node03 kernel: [7370144.953079] block drbd6: p write: error=-5
Oct  2 11:02:01 node03 kernel: [7370144.953149] block drbd6: Local WRITE failed sec=11687077376s size=4096
Oct  2 11:02:01 node03 kernel: [7370144.953229] block drbd6: p write: error=-5
Oct  2 11:02:01 node03 kernel: [7370144.953298] block drbd6: Local WRITE failed sec=9920096s size=4096
Oct  2 11:02:01 node03 kernel: [7370144.953378] block drbd6: p write: error=-5
Oct  2 11:02:01 node03 kernel: [7370144.953448] block drbd6: Local WRITE failed sec=3903394728s size=4096
Oct  2 11:02:01 node03 kernel: [7370144.959512] block drbd6: disk( Failed -> Diskless )
Oct  2 11:02:01 node03 kernel: [7370144.959512] block drbd6: Notified peer that my disk is broken.
Oct  2 11:02:01 node03 kernel: [7370145.147343] mptbase: ioc0: LogInfo(0x31110b00): Originator={PL}, Code={Reset}, SubCode(0x0b00) cb_idx mptbase_reply
Oct  2 11:02:01 node03 kernel: [7370145.159666] mptbase: ioc0: LogInfo(0x31181000): Originator={PL}, Code={IO Cancelled Due to Recieve Error}, SubCode(0x1000) cb_idx mptbase_reply
Oct  2 11:02:01 node03 kernel: [7370145.183461] mptbase: ioc0: LogInfo(0x31112000): Originator={PL}, Code={Reset}, SubCode(0x2000) cb_idx mptbase_reply
Oct  2 11:02:02 node03 kernel: [7370146.666198] mptbase: ioc0: LogInfo(0x31110b00): Originator={PL}, Code={Reset}, SubCode(0x0b00) cb_idx mptscsih_io_done
Oct  2 11:02:02 node03 kernel: [7370146.666198] mptbase: ioc0: LogInfo(0x31110b00): Originator={PL}, Code={Reset}, SubCode(0x0b00) cb_idx mptscsih_io_done
Oct  2 11:02:02 node03 kernel: [7370146.666313] mptbase: ioc0: LogInfo(0x31110b00): Originator={PL}, Code={Reset}, SubCode(0x0b00) cb_idx mptscsih_io_done
Oct  2 11:02:02 node03 last message repeated 2 times
Oct  2 11:02:02 node03 kernel: [7370146.670772] mptbase: ioc0: LogInfo(0x31110b00): Originator={PL}, Code={Reset}, SubCode(0x0b00) cb_idx mptscsih_io_done
Oct  2 11:02:02 node03 kernel: [7370146.670874] mptbase: ioc0: LogInfo(0x31110b00): Originator={PL}, Code={Reset}, SubCode(0x0b00) cb_idx mptscsih_io_done
Oct  2 11:02:02 node03 kernel: [7370146.670975] mptbase: ioc0: LogInfo(0x31110b00): Originator={PL}, Code={Reset}, SubCode(0x0b00) cb_idx mptscsih_io_done
Oct  2 11:02:02 node03 kernel: [7370146.671077] mptbase: ioc0: LogInfo(0x31110b00): Originator={PL}, Code={Reset}, SubCode(0x0b00) cb_idx mptscsih_io_done
Oct  2 11:02:02 node03 kernel: [7370146.671179] mptbase: ioc0: LogInfo(0x31110b00): Originator={PL}, Code={Reset}, SubCode(0x0b00) cb_idx mptscsih_io_done
Oct  2 11:02:02 node03 kernel: [7370146.671280] mptbase: ioc0: LogInfo(0x31110b00): Originator={PL}, Code={Reset}, SubCode(0x0b00) cb_idx mptscsih_io_done
Oct  2 11:02:02 node03 kernel: [7370146.671381] mptbase: ioc0: LogInfo(0x31110b00): Originator={PL}, Code={Reset}, SubCode(0x0b00) cb_idx mptscsih_io_done
Oct  2 11:02:02 node03 kernel: [7370146.671482] mptbase: ioc0: LogInfo(0x31110b00): Originator={PL}, Code={Reset}, SubCode(0x0b00) cb_idx mptscsih_io_done
Oct  2 11:02:02 node03 kernel: [7370146.671583] mptbase: ioc0: LogInfo(0x31110b00): Originator={PL}, Code={Reset}, SubCode(0x0b00) cb_idx mptscsih_io_done
Oct  2 11:02:02 node03 kernel: [7370146.671685] mptbase: ioc0: LogInfo(0x31110b00): Originator={PL}, Code={Reset}, SubCode(0x0b00) cb_idx mptscsih_io_done
Oct  2 11:02:02 node03 kernel: [7370146.671788] mptbase: ioc0: LogInfo(0x31110b00): Originator={PL}, Code={Reset}, SubCode(0x0b00) cb_idx mptscsih_io_done
Oct  2 11:02:02 node03 kernel: [7370146.671891] mptbase: ioc0: LogInfo(0x31110b00): Originator={PL}, Code={Reset}, SubCode(0x0b00) cb_idx mptscsih_io_done
Oct  2 11:02:02 node03 kernel: [7370146.671993] mptbase: ioc0: LogInfo(0x31110b00): Originator={PL}, Code={Reset}, SubCode(0x0b00) cb_idx mptscsih_io_done
Oct  2 11:02:02 node03 kernel: [7370146.672094] mptbase: ioc0: LogInfo(0x31110b00): Originator={PL}, Code={Reset}, SubCode(0x0b00) cb_idx mptscsih_io_done
Oct  2 11:02:02 node03 kernel: [7370146.672195] mptbase: ioc0: LogInfo(0x31110b00): Originator={PL}, Code={Reset}, SubCode(0x0b00) cb_idx mptscsih_io_done
Oct  2 11:02:05 node03 cib: [5734]: WARN: send_ipc_message: IPC Channel to 23559 is not connected
Oct  2 11:02:07 node03 kernel: [7370150.918849] mptbase: ioc0: WARNING - IOC is in FAULT state (7810h)!!!
Oct  2 11:02:07 node03 kernel: [7370150.918929] mptbase: ioc0: WARNING - Issuing HardReset from mpt_fault_reset_work!!
Oct  2 11:02:07 node03 kernel: [7370150.919027] mptbase: ioc0: Initiating recovery
Oct  2 11:02:07 node03 kernel: [7370150.919098] mptbase: ioc0: WARNING - IOC is in FAULT state!!!
Oct  2 11:02:07 node03 kernel: [7370150.919171] mptbase: ioc0: WARNING -            FAULT code = 7810h
Oct  2 11:02:10 node03 kernel: [7370154.041934] mptbase: ioc0: Recovered from IOC FAULT
Oct  2 11:02:16 node03 cib: [5734]: WARN: send_ipc_message: IPC Channel to 23559 is not connected
Oct  2 11:02:21 node03 iSCSITarget[9060]: [9069]: WARNING: Configuration parameter "portals" is not supported by the iSCSI implementation and will be ignored.
Oct  2 11:02:22 node03 kernel: [7370166.353087] mptbase: ioc0: WARNING - mpt_fault_reset_work: HardReset: success
Oct  2 11:02:22 node03 kernel: [7370166.794782] RAID5 conf printout:
Oct  2 11:02:22 node03 kernel: [7370166.794782]  --- rd:5 wd:4
Oct  2 11:02:22 node03 kernel: [7370166.794782]  disk 0, o:1, dev:sdh
Oct  2 11:02:22 node03 kernel: [7370166.794782]  disk 1, o:1, dev:sdi
Oct  2 11:02:22 node03 kernel: [7370166.794782]  disk 2, o:0, dev:sdj
Oct  2 11:02:22 node03 kernel: [7370166.794782]  disk 3, o:1, dev:sdk
Oct  2 11:02:22 node03 kernel: [7370166.794782]  disk 4, o:1, dev:sdl
Oct  2 11:02:22 node03 kernel: [7370166.812949] RAID5 conf printout:
Oct  2 11:02:22 node03 kernel: [7370166.812967]  --- rd:5 wd:4
Oct  2 11:02:22 node03 kernel: [7370166.812967]  disk 0, o:1, dev:sdh
Oct  2 11:02:22 node03 kernel: [7370166.812967]  disk 1, o:1, dev:sdi
Oct  2 11:02:22 node03 kernel: [7370166.812967]  disk 3, o:1, dev:sdk
Oct  2 11:02:22 node03 kernel: [7370166.812967]  disk 4, o:1, dev:sdl
Oct  2 11:02:22 node03 kernel: [7370166.812967] RAID5 conf printout:
Oct  2 11:02:22 node03 kernel: [7370166.812967]  --- rd:5 wd:4
Oct  2 11:02:22 node03 kernel: [7370166.812967]  disk 0, o:1, dev:sdh
Oct  2 11:02:22 node03 kernel: [7370166.812967]  disk 1, o:1, dev:sdi
Oct  2 11:02:22 node03 kernel: [7370166.812967]  disk 2, o:1, dev:sdm
Oct  2 11:02:22 node03 kernel: [7370166.812967]  disk 3, o:1, dev:sdk
Oct  2 11:02:22 node03 kernel: [7370166.812967]  disk 4, o:1, dev:sdl
Oct  2 11:02:22 node03 kernel: [7370166.812967] md: recovery of RAID array md1
Oct  2 11:02:22 node03 kernel: [7370166.812967] md: minimum _guaranteed_  speed: 1000 KB/sec/disk.
Oct  2 11:02:22 node03 kernel: [7370166.812967] md: using maximum available idle IO bandwidth (but not more than 200000 KB/sec) for recovery.
Oct  2 11:02:22 node03 kernel: [7370166.812967] md: using 128k window, over a total of 1953514496 blocks.

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: MD sets failing under heavy load in a DRBD/Pacemaker Cluster
  2011-10-04 12:01 MD sets failing under heavy load in a DRBD/Pacemaker Cluster Caspar Smit
@ 2011-10-04 13:22 ` CoolCold
  2011-10-04 15:28 ` Thakkar, Vishal
                   ` (2 subsequent siblings)
  3 siblings, 0 replies; 5+ messages in thread
From: CoolCold @ 2011-10-04 13:22 UTC (permalink / raw)
  To: Caspar Smit
  Cc: General Linux-HA mailing list, linux-scsi, drbd-user,
	iscsitarget-devel, Support, Software, linux-raid

From my point of view it looks like driver/hardware errors, since you
have records like:
Oct  2 11:01:59 node03 kernel: [7370143.442783] end_request: I/O
error, dev sdf, sector 3907028992


On Tue, Oct 4, 2011 at 4:01 PM, Caspar Smit <c.smit@truebit.nl> wrote:
> Hi all,
>
> We are having a major problem with one of our clusters.
>
> Here's a description of the setup:
>
> 2 supermicro servers containing the following hardware:
>
> Chassis: SC846E1-R1200B
> Mainboard: X8DTH-6F rev 2.01 (onboard LSI2008 controller disabled
> through jumper)
> CPU: Intel Xeon E5606 @ 2.13Ghz, 4 cores
> Memory: 4x KVR1333D3D4R9S/4G (16Gb)
> Backplane: SAS846EL1 rev 1.1
> Ethernet: 2x Intel Pro/1000 PT Quad Port Low Profile
> SAS/SATA Controller: LSI 3081E-R (P20, BIOS: 6.34.00.00, Firmware 1.32.00.00-IT)
> SAS/SATA JBOD Controller: LSI 3801E (P20, BIOS: 6.34.00.00, Firmware
> 1.32.00.00-IT)
> OS Disk: 30Gb SSD
> Harddisks: 24x Western Digital 2TB 7200RPM RE4-GP (WD2002FYPS)
>
> Both machines have debian lenny 5 installed, here are the versions of
> the packages involved:
>
> drbd/heartbeat/pacemaker are installed from the backports repository.
>
> linux-image-2.6.26-2-amd64   2.6.26-26lenny3
> mdadm                                2.6.7.2-3
> drbd8-2.6.26-2-amd64           2:8.3.7-1~bpo50+1+2.6.26-26lenny3
> drbd8-source                        2:8.3.7-1~bpo50+1
> drbd8-utils                            2:8.3.7-1~bpo50+1
> heartbeat                             1:3.0.3-2~bpo50+1
> pacemaker                           1.0.9.1+hg15626-1~bpo50+1
> iscsitarget                            1.4.20.2 (compiled from tar.gz)
>
>
> We created 4 MD sets out of the 24 harddisks (/dev/md0 through /dev/md3)
>
> Each is a RAID5 of 5 disks and 1 hotspare (8TB netto per MD), metadata
> version of the MD sets is 0.90
>
> For each MD we created a DRBD device to the second node. (/dev/drbd4
> through /dev/drbd7) (0 through 3 were used by disks from a JBOD which
> was disconnected, read below)
> (see attached drbd.conf.txt, these are the individual *.res files combined)
>
> Each drbd device has its own dedicated 1GbE NIC port.
>
> Each drbd device is then exported through iSCSI using iet in pacemaker
> (see attached crm-config.txt for the full pacemaker config)
>
>
> Now for the symptoms we are having:
>
> After a number of days (sometimes weeks) the disks from the MD sets
> start failing subsequently.
>
> See the attached syslog.txt for details but here are the main entries:
>
> It starts with:
>
> Oct  2 11:01:59 node03 kernel: [7370143.421999] mptbase: ioc0:
> LogInfo(0x31110b00): Originator={PL}, Code={Reset}, SubCode(0x0b00)
> cb_idx mptbase_reply
> Oct  2 11:01:59 node03 kernel: [7370143.435220] mptbase: ioc0:
> LogInfo(0x31181000): Originator={PL}, Code={IO Cancelled Due to
> Recieve Error}, SubCode(0x1000) cb_idx mptbase_reply
> Oct  2 11:01:59 node03 kernel: [7370143.442141] mptbase: ioc0:
> LogInfo(0x31112000): Originator={PL}, Code={Reset}, SubCode(0x2000)
> cb_idx mptbase_reply
> Oct  2 11:01:59 node03 kernel: [7370143.442783] end_request: I/O
> error, dev sdf, sector 3907028992
> Oct  2 11:01:59 node03 kernel: [7370143.442783] md: super_written gets
> error=-5, uptodate=0
> Oct  2 11:01:59 node03 kernel: [7370143.442783] raid5: Disk failure on
> sdf, disabling device.
> Oct  2 11:01:59 node03 kernel: [7370143.442783] raid5: Operation
> continuing on 4 devices.
> Oct  2 11:01:59 node03 kernel: [7370143.442820] end_request: I/O
> error, dev sdb, sector 3907028992
> Oct  2 11:01:59 node03 kernel: [7370143.442820] md: super_written gets
> error=-5, uptodate=0
> Oct  2 11:01:59 node03 kernel: [7370143.442820] raid5: Disk failure on
> sdb, disabling device.
> Oct  2 11:01:59 node03 kernel: [7370143.442820] raid5: Operation
> continuing on 3 devices.
> Oct  2 11:01:59 node03 kernel: [7370143.442820] end_request: I/O
> error, dev sdd, sector 3907028992
> Oct  2 11:01:59 node03 kernel: [7370143.442820] md: super_written gets
> error=-5, uptodate=0
> Oct  2 11:01:59 node03 kernel: [7370143.442820] raid5: Disk failure on
> sdd, disabling device.
> Oct  2 11:01:59 node03 kernel: [7370143.442820] raid5: Operation
> continuing on 2 devices.
> Oct  2 11:01:59 node03 kernel: [7370143.470791] mptbase: ioc0:
> LogInfo(0x31110b00): Originator={PL}, Code={Reset}, SubCode(0x0b00)
> cb_idx mptbase_reply
> <snip>
> Oct  2 11:02:00 node03 kernel: [7370143.968976] Buffer I/O error on
> device drbd4, logical block 1651581030
> Oct  2 11:02:00 node03 kernel: [7370143.969056] block drbd4: p write: error=-5
> Oct  2 11:02:00 node03 kernel: [7370143.969126] block drbd4: Local
> WRITE failed sec=21013680s size=4096
> Oct  2 11:02:00 node03 kernel: [7370143.969203] block drbd4: disk(
> UpToDate -> Failed )
> Oct  2 11:02:00 node03 kernel: [7370143.969276] block drbd4: Local IO
> failed in __req_mod.Detaching...
> Oct  2 11:02:00 node03 kernel: [7370143.969492] block drbd4: disk(
> Failed -> Diskless )
> Oct  2 11:02:00 node03 kernel: [7370143.969492] block drbd4: Notified
> peer that my disk is broken.
> Oct  2 11:02:00 node03 kernel: [7370143.970120] block drbd4: Should
> have called drbd_al_complete_io(, 21013680), but my Disk seems to have
> failed :(
> Oct  2 11:02:00 node03 kernel: [7370144.003730] iscsi_trgt:
> fileio_make_request(63) I/O error 4096, -5
> Oct  2 11:02:00 node03 kernel: [7370144.004931] iscsi_trgt:
> fileio_make_request(63) I/O error 4096, -5
> Oct  2 11:02:00 node03 kernel: [7370144.006820] iscsi_trgt:
> fileio_make_request(63) I/O error 4096, -5
> Oct  2 11:02:01 node03 kernel: [7370144.849344] mptbase: ioc0:
> LogInfo(0x31110b00): Originator={PL}, Code={Reset}, SubCode(0x0b00)
> cb_idx mptscsih_io_done
> Oct  2 11:02:01 node03 kernel: [7370144.849451] mptbase: ioc0:
> LogInfo(0x31110b00): Originator={PL}, Code={Reset}, SubCode(0x0b00)
> cb_idx mptscsih_io_done
> Oct  2 11:02:01 node03 kernel: [7370144.849709] mptbase: ioc0:
> LogInfo(0x31110b00): Originator={PL}, Code={Reset}, SubCode(0x0b00)
> cb_idx mptscsih_io_done
> Oct  2 11:02:01 node03 kernel: [7370144.849814] mptbase: ioc0:
> LogInfo(0x31110b00): Originator={PL}, Code={Reset}, SubCode(0x0b00)
> cb_idx mptscsih_io_done
> Oct  2 11:02:01 node03 kernel: [7370144.850077] mptbase: ioc0:
> LogInfo(0x31110b00): Originator={PL}, Code={Reset}, SubCode(0x0b00)
> cb_idx mptscsih_io_done
> <snip>
> Oct  2 11:02:07 node03 kernel: [7370150.918849] mptbase: ioc0: WARNING
> - IOC is in FAULT state (7810h)!!!
> Oct  2 11:02:07 node03 kernel: [7370150.918929] mptbase: ioc0: WARNING
> - Issuing HardReset from mpt_fault_reset_work!!
> Oct  2 11:02:07 node03 kernel: [7370150.919027] mptbase: ioc0:
> Initiating recovery
> Oct  2 11:02:07 node03 kernel: [7370150.919098] mptbase: ioc0: WARNING
> - IOC is in FAULT state!!!
> Oct  2 11:02:07 node03 kernel: [7370150.919171] mptbase: ioc0: WARNING
> -            FAULT code = 7810h
> Oct  2 11:02:10 node03 kernel: [7370154.041934] mptbase: ioc0:
> Recovered from IOC FAULT
> Oct  2 11:02:16 node03 cib: [5734]: WARN: send_ipc_message: IPC
> Channel to 23559 is not connected
> Oct  2 11:02:21 node03 iSCSITarget[9060]: [9069]: WARNING:
> Configuration parameter "portals" is not supported by the iSCSI
> implementation and will be ignored.
> Oct  2 11:02:22 node03 kernel: [7370166.353087] mptbase: ioc0: WARNING
> - mpt_fault_reset_work: HardReset: success
>
>
> This results in 3 MD's were all disks are failed [_____] and 1 MD
> survives that is rebuilding with its spare.
> 3 drbd devices are Diskless/UpToDate and the survivor is UpToDate/UpToDate
> The weird thing of this all is that there is always 1 MD set that
> "survives" the FAULT state of the controller!
> Luckily DRBD redirects all read/writes to the second node so there is
> no downtime.
>
>
> Our findings:
>
> 1) It seems to only happen on heavy load
>
> 2) It seems to only happen when DRBD is connected (we didn't have any
> failing MD's yet when DRBD was not connected luckily!)
>
> 3) It seems to only happen on the primary node
>
> 4) It does not look like a hardware problem because there is always
> one MD that survives this, if this was hardware related I would expect
> ALL disks/MD's too fail.
>  Furthermore the disks are not broken because we can assemble the
> array again after it happened and they resync just fine.
>
> 5) I see that there is a new kernel version (2.6.26-27) available and
> if i look at the changelog it has a fair number of fixes related to
> MD, although the symptoms we are seeing are different from the
> described fixes it could be related. Can anyone tell if these issues
> are related to the fixes in the newest kernel image?
>
> 6) In the past we had a Dell MD1000 JBOD connected to the LSI 3801E
> controller on both nodes and had the same problem when every disk
> (only from the JBOD) failed so we disconnected the JBOD. The
> controller stayed inside the server.
>
>
> Things we tried so far:
>
> 1) We switched the LSI 3081E-R controller with another but to no avail
> (and we have another identical cluster suffering from this problem)
>
> 2) In stead of the stock lenny mptsas driver (version v3.04.06) we
> used the latest official LSI mptsas driver (v4.26.00.00) from the LSI
> website using KB article 16387
> (kb.lsi.com/KnowledgebaseArticle16387.aspx). Still to no avail, it
> happens with that driver too.
>
>
> Things that might be related:
>
> 1) We are using the deadline IO scheduler as recommended by IETD.
>
> 2) We are suspecting that the LSI 3801E controller might interfere
> with the LSI 3081E-R so we are planning to remove the unused LSI 3801E
> controllers.
> Is there a known issue when both controllers are used in the same
> machine? They have the same firmware/bios version. The linux driver
> (mptsas) is also the same for both controllers.
>
> Kind regards,
>
> Caspar Smit
> Systemengineer
> True Bit Resources B.V.
> Ampèrestraat 13E
> 1446 TP  Purmerend
>
> T: +31(0)299 410 475
> F: +31(0)299 410 476
> @: c.smit@truebit.nl
> W: www.truebit.nl
>



-- 
Best regards,
[COOLCOLD-RIPN]
--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 5+ messages in thread

* RE: MD sets failing under heavy load in a DRBD/Pacemaker Cluster
  2011-10-04 12:01 MD sets failing under heavy load in a DRBD/Pacemaker Cluster Caspar Smit
  2011-10-04 13:22 ` CoolCold
@ 2011-10-04 15:28 ` Thakkar, Vishal
  2011-10-04 16:36 ` James Bottomley
  2011-10-04 17:25 ` MD sets failing under heavy load in a DRBD/Pacemaker Cluster (115893302) Support
  3 siblings, 0 replies; 5+ messages in thread
From: Thakkar, Vishal @ 2011-10-04 15:28 UTC (permalink / raw)
  To: Caspar Smit, General Linux-HA mailing list, linux-scsi
  Cc: Nandigama, Nagalakshmi

Hi Caspar,

What is the version of the FW on the LSI 3081E-R HBA? This is printed as a debug message when the driver loads. If the version is old, you may want to try updating it from http://www.lsi.com/channel/products/storagecomponents/Pages/HBAs.aspx 

On your question - "Is there a known issue when both controllers are used in the same machine?" - We aren't aware of any such known issues with the latest versions of the driver and HBA FW.

Thanks,
Vishal

-----Original Message-----
From: linux-scsi-owner@vger.kernel.org [mailto:linux-scsi-owner@vger.kernel.org] On Behalf Of Caspar Smit
Sent: Tuesday, October 04, 2011 5:31 PM
To: General Linux-HA mailing list; linux-scsi@vger.kernel.org; drbd-user@lists.linbit.com; iscsitarget-devel@lists.sourceforge.net; Support; linux-raid@vger.kernel.org
Subject: MD sets failing under heavy load in a DRBD/Pacemaker Cluster

Hi all,

We are having a major problem with one of our clusters.

Here's a description of the setup:

2 supermicro servers containing the following hardware:

Chassis: SC846E1-R1200B
Mainboard: X8DTH-6F rev 2.01 (onboard LSI2008 controller disabled through jumper)
CPU: Intel Xeon E5606 @ 2.13Ghz, 4 cores
Memory: 4x KVR1333D3D4R9S/4G (16Gb)
Backplane: SAS846EL1 rev 1.1
Ethernet: 2x Intel Pro/1000 PT Quad Port Low Profile SAS/SATA Controller: LSI 3081E-R (P20, BIOS: 6.34.00.00, Firmware 1.32.00.00-IT) SAS/SATA JBOD Controller: LSI 3801E (P20, BIOS: 6.34.00.00, Firmware
1.32.00.00-IT)
OS Disk: 30Gb SSD
Harddisks: 24x Western Digital 2TB 7200RPM RE4-GP (WD2002FYPS)

Both machines have debian lenny 5 installed, here are the versions of the packages involved:

drbd/heartbeat/pacemaker are installed from the backports repository.

linux-image-2.6.26-2-amd64   2.6.26-26lenny3
mdadm                                2.6.7.2-3
drbd8-2.6.26-2-amd64           2:8.3.7-1~bpo50+1+2.6.26-26lenny3
drbd8-source                        2:8.3.7-1~bpo50+1
drbd8-utils                            2:8.3.7-1~bpo50+1
heartbeat                             1:3.0.3-2~bpo50+1
pacemaker                           1.0.9.1+hg15626-1~bpo50+1
iscsitarget                            1.4.20.2 (compiled from tar.gz)


We created 4 MD sets out of the 24 harddisks (/dev/md0 through /dev/md3)

Each is a RAID5 of 5 disks and 1 hotspare (8TB netto per MD), metadata version of the MD sets is 0.90

For each MD we created a DRBD device to the second node. (/dev/drbd4 through /dev/drbd7) (0 through 3 were used by disks from a JBOD which was disconnected, read below) (see attached drbd.conf.txt, these are the individual *.res files combined)

Each drbd device has its own dedicated 1GbE NIC port.

Each drbd device is then exported through iSCSI using iet in pacemaker (see attached crm-config.txt for the full pacemaker config)


Now for the symptoms we are having:

After a number of days (sometimes weeks) the disks from the MD sets start failing subsequently.

See the attached syslog.txt for details but here are the main entries:

It starts with:

Oct  2 11:01:59 node03 kernel: [7370143.421999] mptbase: ioc0:
LogInfo(0x31110b00): Originator={PL}, Code={Reset}, SubCode(0x0b00) cb_idx mptbase_reply Oct  2 11:01:59 node03 kernel: [7370143.435220] mptbase: ioc0:
LogInfo(0x31181000): Originator={PL}, Code={IO Cancelled Due to Recieve Error}, SubCode(0x1000) cb_idx mptbase_reply Oct  2 11:01:59 node03 kernel: [7370143.442141] mptbase: ioc0:
LogInfo(0x31112000): Originator={PL}, Code={Reset}, SubCode(0x2000) cb_idx mptbase_reply Oct  2 11:01:59 node03 kernel: [7370143.442783] end_request: I/O error, dev sdf, sector 3907028992 Oct  2 11:01:59 node03 kernel: [7370143.442783] md: super_written gets error=-5, uptodate=0 Oct  2 11:01:59 node03 kernel: [7370143.442783] raid5: Disk failure on sdf, disabling device.
Oct  2 11:01:59 node03 kernel: [7370143.442783] raid5: Operation continuing on 4 devices.
Oct  2 11:01:59 node03 kernel: [7370143.442820] end_request: I/O error, dev sdb, sector 3907028992 Oct  2 11:01:59 node03 kernel: [7370143.442820] md: super_written gets error=-5, uptodate=0 Oct  2 11:01:59 node03 kernel: [7370143.442820] raid5: Disk failure on sdb, disabling device.
Oct  2 11:01:59 node03 kernel: [7370143.442820] raid5: Operation continuing on 3 devices.
Oct  2 11:01:59 node03 kernel: [7370143.442820] end_request: I/O error, dev sdd, sector 3907028992 Oct  2 11:01:59 node03 kernel: [7370143.442820] md: super_written gets error=-5, uptodate=0 Oct  2 11:01:59 node03 kernel: [7370143.442820] raid5: Disk failure on sdd, disabling device.
Oct  2 11:01:59 node03 kernel: [7370143.442820] raid5: Operation continuing on 2 devices.
Oct  2 11:01:59 node03 kernel: [7370143.470791] mptbase: ioc0:
LogInfo(0x31110b00): Originator={PL}, Code={Reset}, SubCode(0x0b00) cb_idx mptbase_reply <snip> Oct  2 11:02:00 node03 kernel: [7370143.968976] Buffer I/O error on device drbd4, logical block 1651581030 Oct  2 11:02:00 node03 kernel: [7370143.969056] block drbd4: p write: error=-5 Oct  2 11:02:00 node03 kernel: [7370143.969126] block drbd4: Local WRITE failed sec=21013680s size=4096 Oct  2 11:02:00 node03 kernel: [7370143.969203] block drbd4: disk( UpToDate -> Failed ) Oct  2 11:02:00 node03 kernel: [7370143.969276] block drbd4: Local IO failed in __req_mod.Detaching...
Oct  2 11:02:00 node03 kernel: [7370143.969492] block drbd4: disk( Failed -> Diskless ) Oct  2 11:02:00 node03 kernel: [7370143.969492] block drbd4: Notified peer that my disk is broken.
Oct  2 11:02:00 node03 kernel: [7370143.970120] block drbd4: Should have called drbd_al_complete_io(, 21013680), but my Disk seems to have failed :( Oct  2 11:02:00 node03 kernel: [7370144.003730] iscsi_trgt:
fileio_make_request(63) I/O error 4096, -5 Oct  2 11:02:00 node03 kernel: [7370144.004931] iscsi_trgt:
fileio_make_request(63) I/O error 4096, -5 Oct  2 11:02:00 node03 kernel: [7370144.006820] iscsi_trgt:
fileio_make_request(63) I/O error 4096, -5 Oct  2 11:02:01 node03 kernel: [7370144.849344] mptbase: ioc0:
LogInfo(0x31110b00): Originator={PL}, Code={Reset}, SubCode(0x0b00) cb_idx mptscsih_io_done Oct  2 11:02:01 node03 kernel: [7370144.849451] mptbase: ioc0:
LogInfo(0x31110b00): Originator={PL}, Code={Reset}, SubCode(0x0b00) cb_idx mptscsih_io_done Oct  2 11:02:01 node03 kernel: [7370144.849709] mptbase: ioc0:
LogInfo(0x31110b00): Originator={PL}, Code={Reset}, SubCode(0x0b00) cb_idx mptscsih_io_done Oct  2 11:02:01 node03 kernel: [7370144.849814] mptbase: ioc0:
LogInfo(0x31110b00): Originator={PL}, Code={Reset}, SubCode(0x0b00) cb_idx mptscsih_io_done Oct  2 11:02:01 node03 kernel: [7370144.850077] mptbase: ioc0:
LogInfo(0x31110b00): Originator={PL}, Code={Reset}, SubCode(0x0b00) cb_idx mptscsih_io_done <snip> Oct  2 11:02:07 node03 kernel: [7370150.918849] mptbase: ioc0: WARNING
- IOC is in FAULT state (7810h)!!!
Oct  2 11:02:07 node03 kernel: [7370150.918929] mptbase: ioc0: WARNING
- Issuing HardReset from mpt_fault_reset_work!!
Oct  2 11:02:07 node03 kernel: [7370150.919027] mptbase: ioc0:
Initiating recovery
Oct  2 11:02:07 node03 kernel: [7370150.919098] mptbase: ioc0: WARNING
- IOC is in FAULT state!!!
Oct  2 11:02:07 node03 kernel: [7370150.919171] mptbase: ioc0: WARNING
-            FAULT code = 7810h
Oct  2 11:02:10 node03 kernel: [7370154.041934] mptbase: ioc0:
Recovered from IOC FAULT
Oct  2 11:02:16 node03 cib: [5734]: WARN: send_ipc_message: IPC Channel to 23559 is not connected Oct  2 11:02:21 node03 iSCSITarget[9060]: [9069]: WARNING:
Configuration parameter "portals" is not supported by the iSCSI implementation and will be ignored.
Oct  2 11:02:22 node03 kernel: [7370166.353087] mptbase: ioc0: WARNING
- mpt_fault_reset_work: HardReset: success


This results in 3 MD's were all disks are failed [_____] and 1 MD survives that is rebuilding with its spare.
3 drbd devices are Diskless/UpToDate and the survivor is UpToDate/UpToDate The weird thing of this all is that there is always 1 MD set that "survives" the FAULT state of the controller!
Luckily DRBD redirects all read/writes to the second node so there is no downtime.


Our findings:

1) It seems to only happen on heavy load

2) It seems to only happen when DRBD is connected (we didn't have any failing MD's yet when DRBD was not connected luckily!)

3) It seems to only happen on the primary node

4) It does not look like a hardware problem because there is always one MD that survives this, if this was hardware related I would expect ALL disks/MD's too fail.
 Furthermore the disks are not broken because we can assemble the array again after it happened and they resync just fine.

5) I see that there is a new kernel version (2.6.26-27) available and if i look at the changelog it has a fair number of fixes related to MD, although the symptoms we are seeing are different from the described fixes it could be related. Can anyone tell if these issues are related to the fixes in the newest kernel image?

6) In the past we had a Dell MD1000 JBOD connected to the LSI 3801E controller on both nodes and had the same problem when every disk (only from the JBOD) failed so we disconnected the JBOD. The controller stayed inside the server.


Things we tried so far:

1) We switched the LSI 3081E-R controller with another but to no avail (and we have another identical cluster suffering from this problem)

2) In stead of the stock lenny mptsas driver (version v3.04.06) we used the latest official LSI mptsas driver (v4.26.00.00) from the LSI website using KB article 16387 (kb.lsi.com/KnowledgebaseArticle16387.aspx). Still to no avail, it happens with that driver too.


Things that might be related:

1) We are using the deadline IO scheduler as recommended by IETD.

2) We are suspecting that the LSI 3801E controller might interfere with the LSI 3081E-R so we are planning to remove the unused LSI 3801E controllers.
Is there a known issue when both controllers are used in the same machine? They have the same firmware/bios version. The linux driver
(mptsas) is also the same for both controllers.

Kind regards,

Caspar Smit
Systemengineer
True Bit Resources B.V.
Ampèrestraat 13E
1446 TP  Purmerend

T: +31(0)299 410 475
F: +31(0)299 410 476
@: c.smit@truebit.nl
W: www.truebit.nl
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: MD sets failing under heavy load in a DRBD/Pacemaker Cluster
  2011-10-04 12:01 MD sets failing under heavy load in a DRBD/Pacemaker Cluster Caspar Smit
  2011-10-04 13:22 ` CoolCold
  2011-10-04 15:28 ` Thakkar, Vishal
@ 2011-10-04 16:36 ` James Bottomley
  2011-10-04 17:25 ` MD sets failing under heavy load in a DRBD/Pacemaker Cluster (115893302) Support
  3 siblings, 0 replies; 5+ messages in thread
From: James Bottomley @ 2011-10-04 16:36 UTC (permalink / raw)
  To: Caspar Smit
  Cc: General Linux-HA mailing list, linux-scsi, drbd-user,
	iscsitarget-devel, Support, Software, linux-raid

On Tue, 2011-10-04 at 14:01 +0200, Caspar Smit wrote:
> Oct  2 11:01:59 node03 kernel: [7370143.421999] mptbase: ioc0:
> LogInfo(0x31110b00): Originator={PL}, Code={Reset}, SubCode(0x0b00)
> cb_idx mptbase_reply
> Oct  2 11:01:59 node03 kernel: [7370143.435220] mptbase: ioc0:
> LogInfo(0x31181000): Originator={PL}, Code={IO Cancelled Due to
> Recieve Error}, SubCode(0x1000) cb_idx mptbase_reply
> Oct  2 11:01:59 node03 kernel: [7370143.442141] mptbase: ioc0:
> LogInfo(0x31112000): Originator={PL}, Code={Reset}, SubCode(0x2000)
> cb_idx mptbase_reply 

This indicates some type of hardware error in the LSI controller.  LSI
should be able to tell you what the error means.  My best guess (and it
is only a guess since I can't interpret the codes) is a link error, so
checking the SATA cables would be the first thing I'd do.

James



^ permalink raw reply	[flat|nested] 5+ messages in thread

* RE: MD sets failing under heavy load in a DRBD/Pacemaker Cluster (115893302)
  2011-10-04 12:01 MD sets failing under heavy load in a DRBD/Pacemaker Cluster Caspar Smit
                   ` (2 preceding siblings ...)
  2011-10-04 16:36 ` James Bottomley
@ 2011-10-04 17:25 ` Support
  3 siblings, 0 replies; 5+ messages in thread
From: Support @ 2011-10-04 17:25 UTC (permalink / raw)
  To: Caspar Smit, General Linux-HA mailing list, linux-scsi

Hi Caspar,

It is difficult to say what the issue is for sure. If you can run a utility we have, lsiget, it will collect logs and I will be able to see what is causing errors from the controller standpoint.

You can download the utility from the link below. Run the batch file and send the zip file back.
http://kb.lsi.com/KnowledgebaseArticle12278.aspx?Keywords=linux+lsiget


Regards,

Drew Cohen
Technical Support Engineer
Global Support Services

LSI Corporation
4165 Shakleford Road
Norcross, GA 30093
Phone: 1-800-633-4545
Email: support@lsi.com




-----Original Message-----
From: smit.caspar@gmail.com [mailto:smit.caspar@gmail.com] On Behalf Of Caspar Smit
Sent: Tuesday, October 04, 2011 8:01 AM
To: General Linux-HA mailing list; linux-scsi@vger.kernel.org; drbd-user@lists.linbit.com; iscsitarget-devel@lists.sourceforge.net; Support; linux-raid@vger.kernel.org
Subject: MD sets failing under heavy load in a DRBD/Pacemaker Cluster

Hi all,

We are having a major problem with one of our clusters.

Here's a description of the setup:

2 supermicro servers containing the following hardware:

Chassis: SC846E1-R1200B
Mainboard: X8DTH-6F rev 2.01 (onboard LSI2008 controller disabled through jumper)
CPU: Intel Xeon E5606 @ 2.13Ghz, 4 cores
Memory: 4x KVR1333D3D4R9S/4G (16Gb)
Backplane: SAS846EL1 rev 1.1
Ethernet: 2x Intel Pro/1000 PT Quad Port Low Profile SAS/SATA Controller: LSI 3081E-R (P20, BIOS: 6.34.00.00, Firmware 1.32.00.00-IT) SAS/SATA JBOD Controller: LSI 3801E (P20, BIOS: 6.34.00.00, Firmware
1.32.00.00-IT)
OS Disk: 30Gb SSD
Harddisks: 24x Western Digital 2TB 7200RPM RE4-GP (WD2002FYPS)

Both machines have debian lenny 5 installed, here are the versions of the packages involved:

drbd/heartbeat/pacemaker are installed from the backports repository.

linux-image-2.6.26-2-amd64   2.6.26-26lenny3
mdadm                                2.6.7.2-3
drbd8-2.6.26-2-amd64           2:8.3.7-1~bpo50+1+2.6.26-26lenny3
drbd8-source                        2:8.3.7-1~bpo50+1
drbd8-utils                            2:8.3.7-1~bpo50+1
heartbeat                             1:3.0.3-2~bpo50+1
pacemaker                           1.0.9.1+hg15626-1~bpo50+1
iscsitarget                            1.4.20.2 (compiled from tar.gz)


We created 4 MD sets out of the 24 harddisks (/dev/md0 through /dev/md3)

Each is a RAID5 of 5 disks and 1 hotspare (8TB netto per MD), metadata version of the MD sets is 0.90

For each MD we created a DRBD device to the second node. (/dev/drbd4 through /dev/drbd7) (0 through 3 were used by disks from a JBOD which was disconnected, read below) (see attached drbd.conf.txt, these are the individual *.res files combined)

Each drbd device has its own dedicated 1GbE NIC port.

Each drbd device is then exported through iSCSI using iet in pacemaker (see attached crm-config.txt for the full pacemaker config)


Now for the symptoms we are having:

After a number of days (sometimes weeks) the disks from the MD sets start failing subsequently.

See the attached syslog.txt for details but here are the main entries:

It starts with:

Oct  2 11:01:59 node03 kernel: [7370143.421999] mptbase: ioc0:
LogInfo(0x31110b00): Originator={PL}, Code={Reset}, SubCode(0x0b00) cb_idx mptbase_reply Oct  2 11:01:59 node03 kernel: [7370143.435220] mptbase: ioc0:
LogInfo(0x31181000): Originator={PL}, Code={IO Cancelled Due to Recieve Error}, SubCode(0x1000) cb_idx mptbase_reply Oct  2 11:01:59 node03 kernel: [7370143.442141] mptbase: ioc0:
LogInfo(0x31112000): Originator={PL}, Code={Reset}, SubCode(0x2000) cb_idx mptbase_reply Oct  2 11:01:59 node03 kernel: [7370143.442783] end_request: I/O error, dev sdf, sector 3907028992 Oct  2 11:01:59 node03 kernel: [7370143.442783] md: super_written gets error=-5, uptodate=0 Oct  2 11:01:59 node03 kernel: [7370143.442783] raid5: Disk failure on sdf, disabling device.
Oct  2 11:01:59 node03 kernel: [7370143.442783] raid5: Operation continuing on 4 devices.
Oct  2 11:01:59 node03 kernel: [7370143.442820] end_request: I/O error, dev sdb, sector 3907028992 Oct  2 11:01:59 node03 kernel: [7370143.442820] md: super_written gets error=-5, uptodate=0 Oct  2 11:01:59 node03 kernel: [7370143.442820] raid5: Disk failure on sdb, disabling device.
Oct  2 11:01:59 node03 kernel: [7370143.442820] raid5: Operation continuing on 3 devices.
Oct  2 11:01:59 node03 kernel: [7370143.442820] end_request: I/O error, dev sdd, sector 3907028992 Oct  2 11:01:59 node03 kernel: [7370143.442820] md: super_written gets error=-5, uptodate=0 Oct  2 11:01:59 node03 kernel: [7370143.442820] raid5: Disk failure on sdd, disabling device.
Oct  2 11:01:59 node03 kernel: [7370143.442820] raid5: Operation continuing on 2 devices.
Oct  2 11:01:59 node03 kernel: [7370143.470791] mptbase: ioc0:
LogInfo(0x31110b00): Originator={PL}, Code={Reset}, SubCode(0x0b00) cb_idx mptbase_reply <snip> Oct  2 11:02:00 node03 kernel: [7370143.968976] Buffer I/O error on device drbd4, logical block 1651581030 Oct  2 11:02:00 node03 kernel: [7370143.969056] block drbd4: p write: error=-5 Oct  2 11:02:00 node03 kernel: [7370143.969126] block drbd4: Local WRITE failed sec=21013680s size=4096 Oct  2 11:02:00 node03 kernel: [7370143.969203] block drbd4: disk( UpToDate -> Failed ) Oct  2 11:02:00 node03 kernel: [7370143.969276] block drbd4: Local IO failed in __req_mod.Detaching...
Oct  2 11:02:00 node03 kernel: [7370143.969492] block drbd4: disk( Failed -> Diskless ) Oct  2 11:02:00 node03 kernel: [7370143.969492] block drbd4: Notified peer that my disk is broken.
Oct  2 11:02:00 node03 kernel: [7370143.970120] block drbd4: Should have called drbd_al_complete_io(, 21013680), but my Disk seems to have failed :( Oct  2 11:02:00 node03 kernel: [7370144.003730] iscsi_trgt:
fileio_make_request(63) I/O error 4096, -5 Oct  2 11:02:00 node03 kernel: [7370144.004931] iscsi_trgt:
fileio_make_request(63) I/O error 4096, -5 Oct  2 11:02:00 node03 kernel: [7370144.006820] iscsi_trgt:
fileio_make_request(63) I/O error 4096, -5 Oct  2 11:02:01 node03 kernel: [7370144.849344] mptbase: ioc0:
LogInfo(0x31110b00): Originator={PL}, Code={Reset}, SubCode(0x0b00) cb_idx mptscsih_io_done Oct  2 11:02:01 node03 kernel: [7370144.849451] mptbase: ioc0:
LogInfo(0x31110b00): Originator={PL}, Code={Reset}, SubCode(0x0b00) cb_idx mptscsih_io_done Oct  2 11:02:01 node03 kernel: [7370144.849709] mptbase: ioc0:
LogInfo(0x31110b00): Originator={PL}, Code={Reset}, SubCode(0x0b00) cb_idx mptscsih_io_done Oct  2 11:02:01 node03 kernel: [7370144.849814] mptbase: ioc0:
LogInfo(0x31110b00): Originator={PL}, Code={Reset}, SubCode(0x0b00) cb_idx mptscsih_io_done Oct  2 11:02:01 node03 kernel: [7370144.850077] mptbase: ioc0:
LogInfo(0x31110b00): Originator={PL}, Code={Reset}, SubCode(0x0b00) cb_idx mptscsih_io_done <snip> Oct  2 11:02:07 node03 kernel: [7370150.918849] mptbase: ioc0: WARNING
- IOC is in FAULT state (7810h)!!!
Oct  2 11:02:07 node03 kernel: [7370150.918929] mptbase: ioc0: WARNING
- Issuing HardReset from mpt_fault_reset_work!!
Oct  2 11:02:07 node03 kernel: [7370150.919027] mptbase: ioc0:
Initiating recovery
Oct  2 11:02:07 node03 kernel: [7370150.919098] mptbase: ioc0: WARNING
- IOC is in FAULT state!!!
Oct  2 11:02:07 node03 kernel: [7370150.919171] mptbase: ioc0: WARNING
-            FAULT code = 7810h
Oct  2 11:02:10 node03 kernel: [7370154.041934] mptbase: ioc0:
Recovered from IOC FAULT
Oct  2 11:02:16 node03 cib: [5734]: WARN: send_ipc_message: IPC Channel to 23559 is not connected Oct  2 11:02:21 node03 iSCSITarget[9060]: [9069]: WARNING:
Configuration parameter "portals" is not supported by the iSCSI implementation and will be ignored.
Oct  2 11:02:22 node03 kernel: [7370166.353087] mptbase: ioc0: WARNING
- mpt_fault_reset_work: HardReset: success


This results in 3 MD's were all disks are failed [_____] and 1 MD survives that is rebuilding with its spare.
3 drbd devices are Diskless/UpToDate and the survivor is UpToDate/UpToDate The weird thing of this all is that there is always 1 MD set that "survives" the FAULT state of the controller!
Luckily DRBD redirects all read/writes to the second node so there is no downtime.


Our findings:

1) It seems to only happen on heavy load

2) It seems to only happen when DRBD is connected (we didn't have any failing MD's yet when DRBD was not connected luckily!)

3) It seems to only happen on the primary node

4) It does not look like a hardware problem because there is always one MD that survives this, if this was hardware related I would expect ALL disks/MD's too fail.
 Furthermore the disks are not broken because we can assemble the array again after it happened and they resync just fine.

5) I see that there is a new kernel version (2.6.26-27) available and if i look at the changelog it has a fair number of fixes related to MD, although the symptoms we are seeing are different from the described fixes it could be related. Can anyone tell if these issues are related to the fixes in the newest kernel image?

6) In the past we had a Dell MD1000 JBOD connected to the LSI 3801E controller on both nodes and had the same problem when every disk (only from the JBOD) failed so we disconnected the JBOD. The controller stayed inside the server.


Things we tried so far:

1) We switched the LSI 3081E-R controller with another but to no avail (and we have another identical cluster suffering from this problem)

2) In stead of the stock lenny mptsas driver (version v3.04.06) we used the latest official LSI mptsas driver (v4.26.00.00) from the LSI website using KB article 16387 (kb.lsi.com/KnowledgebaseArticle16387.aspx). Still to no avail, it happens with that driver too.


Things that might be related:

1) We are using the deadline IO scheduler as recommended by IETD.

2) We are suspecting that the LSI 3801E controller might interfere with the LSI 3081E-R so we are planning to remove the unused LSI 3801E controllers.
Is there a known issue when both controllers are used in the same machine? They have the same firmware/bios version. The linux driver
(mptsas) is also the same for both controllers.

Kind regards,

Caspar Smit
Systemengineer
True Bit Resources B.V.
Ampèrestraat 13E
1446 TP  Purmerend

T: +31(0)299 410 475
F: +31(0)299 410 476
@: c.smit@truebit.nl
W: www.truebit.nl
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2011-10-04 17:25 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2011-10-04 12:01 MD sets failing under heavy load in a DRBD/Pacemaker Cluster Caspar Smit
2011-10-04 13:22 ` CoolCold
2011-10-04 15:28 ` Thakkar, Vishal
2011-10-04 16:36 ` James Bottomley
2011-10-04 17:25 ` MD sets failing under heavy load in a DRBD/Pacemaker Cluster (115893302) Support

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.