high volume of disk-writes causes disk to 'disappear'

* high volume of disk-writes causes disk to 'disappear'
@ 2010-04-19 18:38 Leif Sawyer
       [not found] ` <w2od4bc8f581004271446md75f0dcbt537956a2e3f9abb7@mail.gmail.com>
  0 siblings, 1 reply; 6+ messages in thread
From: Leif Sawyer @ 2010-04-19 18:38 UTC (permalink / raw)
  To: linux-scsi

All -

Just joined the list so I could better track this particular issue
that's I've been experiencing.

This issue is repeatable using kernels from 2.6.27 - 2.6.33, whether
vanilla or distro.

Summary:
        high volume of disk-writes causes disk to 'disappear'

Setup:
        I have a dual-Xeon CPU 2.40GHz with embedded Adaptec AIC-7902
U320 controller, 2GiB ram, and 2 1000baseT and 1 100baseT interfaces.

This system is built as a remote network sniffer, and streams all
captured data using tshark with rotating capture-files.  The files are
automatically rotated at 512MiB.

       The system has two seagate drives installed:
            system: ST336706LC -  36Gb
            data:  ST3146855LC - 146Gb

root is formatted on the system drive as ext3.  swap is also on the
system drive.
data is (full-disk) formatted as ext2, mounted  noexec,nodev,noatime

       A web-based interface starts/stops the sniffer, which writes
data from either/both GiB interfaces (depending on link status)  to
the data disk.

Symptom:

       After a variable length of time, the system will start logging
errors, and become unresponsive.

----

<kern.info<6>>Apr  8 19:28:41 websniff-6036a5 kernel:[115076.989521]
sd 3:0:1:0: [sdb] Attempting to queue an ABORT message:CDB: 0x0 0x0
0x0 0x0 0x0 0x0
<kern.info<6>>Apr  8 19:28:41 websniff-6036a5 kernel:[115076.993289]
sd 3:0:1:0: [sdb] Attempting to queue an ABORT message:CDB: 0x2a 0x0
0x2 0xc0 0xdd 0x8f 0x0 0x4 0x0 0x0
<kern.info<6>>Apr  8 19:28:41 websniff-6036a5 kernel:[115076.993308]
sd 3:0:1:0: [sdb] Command not found
<kern.err<3>>Apr  8 19:28:41 websniff-6036a5 kernel:[115086.060044]
INFO: task kjournald:1007 blocked for more than 120 seconds.
<kern.err<3>>Apr  8 19:28:41 websniff-6036a5 kernel:[115086.760828]
INFO: task rsyslogd:26910 blocked for more than 120 seconds.
<kern.err<3>>Apr  8 19:28:41 websniff-6036a5 kernel:[115086.842049]
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this
message.
<kern.info<6>>Apr  8 19:28:41 websniff-6036a5 kernel:[115086.936773]
rsyslogd      D 00000000     0 26910      1 0x00000000
<kern.err<3>>Apr  8 19:28:41 websniff-6036a5 kernel:[115086.936957]
INFO: task cron:9106 blocked for more than 120 seconds.
<kern.err<3>>Apr  8 19:28:41 websniff-6036a5 kernel:[115087.013001]
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this
message.
<kern.info<6>>Apr  8 19:28:41 websniff-6036a5 kernel:[115187.040452]
sd 3:0:0:0: [sda] Command not found
<kern.info<6>>Apr  8 19:28:41 websniff-6036a5 kernel:[115197.040033]
sd 3:0:0:0: [sda] Attempting to queue an ABORT message:CDB: 0x0 0x0
0x0 0x0 0x0 0x0
<kern.info<6>>Apr  8 19:28:41 websniff-6036a5 kernel:[115197.041579]
sd 3:0:0:0: [sda] Attempting to queue an ABORT message:CDB: 0x2a 0x0
0x1 0xca 0xc0 0x4f 0x0 0x0 0x8 0x0
<kern.info<6>>Apr  8 19:28:41 websniff-6036a5 kernel:[115197.041624]
sd 3:0:0:0: [sda] Command not found
<kern.info<6>>Apr  8 19:28:41 websniff-6036a5 kernel:[115207.040034]
sd 3:0:0:0: [sda] Attempting to queue an ABORT message:CDB: 0x0 0x0
0x0 0x0 0x0 0x0
<kern.info<6>>Apr  8 19:28:41 websniff-6036a5 kernel:[115207.041854]
sd 3:0:0:0: [sda] Attempting to queue a TARGET RESET message:CDB: 0x2a
0x0 0x1 0xca 0xc0 0x97 0x0 0x0 0x8 0x0
<kern.info<6>>Apr  8 19:28:41 websniff-6036a5 kernel:[115212.040034]
sd 3:0:1:0: [sdb] Attempting to queue a TARGET RESET message:CDB: 0x2a
0x0 0x2 0xc0 0xa3 0x5f 0x0 0x4 0x0 0x0
<kern.info<6>>Apr  8 19:28:41 websniff-6036a5 kernel:[115227.264287]
sd 3:0:1:0: Device offlined - not ready after error recovery
<kern.info<6>>Apr  8 19:28:41 websniff-6036a5 kernel:[115227.264308]
sd 3:0:1:0: [sdb] Unhandled error code
<kern.info<6>>Apr  8 19:28:41 websniff-6036a5 kernel:[115227.264312]
sd 3:0:1:0: [sdb] Result: hostbyte=DID_REQUEUE driverbyte=DRIVER_OK
<kern.info<6>>Apr  8 19:28:41 websniff-6036a5 kernel:[115227.264319]
sd 3:0:1:0: [sdb] CDB: Write(10): 2a 00 02 c0 a7 5f 00 04 00 00
<kern.err<3>>Apr  8 19:28:41 websniff-6036a5 kernel:[115227.264340]
end_request: I/O error, dev sdb, sector 46180191
<kern.err<3>>Apr  8 19:28:41 websniff-6036a5 kernel:[115227.334682] sd
3:0:1:0: rejecting I/O to offline device
<kern.err<3>>Apr  8 19:28:41 websniff-6036a5 kernel:[115227.337071] sd
3:0:1:0: rejecting I/O to offline device

-----

At this point, via console, I have attempted to use
scsiadd/partprobe/sdparm to "re-discover" the lost disk,
        scsiadd -s > ${logfile} 2>&1
        partprobe -s >> ${logfile} 2>&1
        sdparm -al /dev/sdb >> ${logfile} 2>&1

scsiadd finds the device, but the kernel doesn't seem to register it:

        Attached devices:
        Host: scsi3 Channel: 00 Id: 00 Lun: 00
          Vendor: SEAGATE  Model: ST336706LC       Rev: 0108
          Type:   Direct-Access                    ANSI  SCSI revision: 03
        Host: scsi3 Channel: 00 Id: 01 Lun: 00
          Vendor: SEAGATE  Model: ST3146855LC      Rev: 0003
          Type:   Direct-Access                    ANSI  SCSI revision: 03
        Host: scsi3 Channel: 00 Id: 06 Lun: 00
          Vendor: ESG-TSD  Model: SCA HSBP M23     Rev: 1.05
          Type:   Processor                        ANSI  SCSI revision: 02
        /dev/sda: msdos partitions 1 2
        open error: /dev/sdb [read only]: No such device or address

At this point, I have to reboot in order to see the disk.

I have more logging data, but no kernel-debug data at this time.

I would appreciate any help or pointers.

Thanks,
   Leif

-- 
"It's pronounced Layf...you know, like Leif Garrett? Don't you watch
 'I Love the 70's'? What kind of retro lover are you, anyway?"

^ permalink raw reply	[flat|nested] 6+ messages in thread