All of lore.kernel.org
 help / color / mirror / Atom feed
* high volume of disk-writes causes disk to 'disappear'
@ 2010-04-19 18:38 Leif Sawyer
       [not found] ` <w2od4bc8f581004271446md75f0dcbt537956a2e3f9abb7@mail.gmail.com>
  0 siblings, 1 reply; 6+ messages in thread
From: Leif Sawyer @ 2010-04-19 18:38 UTC (permalink / raw)
  To: linux-scsi

All -

Just joined the list so I could better track this particular issue
that's I've been experiencing.

This issue is repeatable using kernels from 2.6.27 - 2.6.33, whether
vanilla or distro.

Summary:
        high volume of disk-writes causes disk to 'disappear'

Setup:
        I have a dual-Xeon CPU 2.40GHz with embedded Adaptec AIC-7902
U320 controller, 2GiB ram, and 2 1000baseT and 1 100baseT interfaces.

This system is built as a remote network sniffer, and streams all
captured data using tshark with rotating capture-files.  The files are
automatically rotated at 512MiB.

       The system has two seagate drives installed:
            system: ST336706LC -  36Gb
            data:  ST3146855LC - 146Gb

root is formatted on the system drive as ext3.  swap is also on the
system drive.
data is (full-disk) formatted as ext2, mounted  noexec,nodev,noatime

       A web-based interface starts/stops the sniffer, which writes
data from either/both GiB interfaces (depending on link status)  to
the data disk.

Symptom:

       After a variable length of time, the system will start logging
errors, and become unresponsive.

----

<kern.info<6>>Apr  8 19:28:41 websniff-6036a5 kernel:[115076.989521]
sd 3:0:1:0: [sdb] Attempting to queue an ABORT message:CDB: 0x0 0x0
0x0 0x0 0x0 0x0
<kern.info<6>>Apr  8 19:28:41 websniff-6036a5 kernel:[115076.993289]
sd 3:0:1:0: [sdb] Attempting to queue an ABORT message:CDB: 0x2a 0x0
0x2 0xc0 0xdd 0x8f 0x0 0x4 0x0 0x0
<kern.info<6>>Apr  8 19:28:41 websniff-6036a5 kernel:[115076.993308]
sd 3:0:1:0: [sdb] Command not found
<kern.err<3>>Apr  8 19:28:41 websniff-6036a5 kernel:[115086.060044]
INFO: task kjournald:1007 blocked for more than 120 seconds.
<kern.err<3>>Apr  8 19:28:41 websniff-6036a5 kernel:[115086.760828]
INFO: task rsyslogd:26910 blocked for more than 120 seconds.
<kern.err<3>>Apr  8 19:28:41 websniff-6036a5 kernel:[115086.842049]
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this
message.
<kern.info<6>>Apr  8 19:28:41 websniff-6036a5 kernel:[115086.936773]
rsyslogd      D 00000000     0 26910      1 0x00000000
<kern.err<3>>Apr  8 19:28:41 websniff-6036a5 kernel:[115086.936957]
INFO: task cron:9106 blocked for more than 120 seconds.
<kern.err<3>>Apr  8 19:28:41 websniff-6036a5 kernel:[115087.013001]
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this
message.
<kern.info<6>>Apr  8 19:28:41 websniff-6036a5 kernel:[115187.040452]
sd 3:0:0:0: [sda] Command not found
<kern.info<6>>Apr  8 19:28:41 websniff-6036a5 kernel:[115197.040033]
sd 3:0:0:0: [sda] Attempting to queue an ABORT message:CDB: 0x0 0x0
0x0 0x0 0x0 0x0
<kern.info<6>>Apr  8 19:28:41 websniff-6036a5 kernel:[115197.041579]
sd 3:0:0:0: [sda] Attempting to queue an ABORT message:CDB: 0x2a 0x0
0x1 0xca 0xc0 0x4f 0x0 0x0 0x8 0x0
<kern.info<6>>Apr  8 19:28:41 websniff-6036a5 kernel:[115197.041624]
sd 3:0:0:0: [sda] Command not found
<kern.info<6>>Apr  8 19:28:41 websniff-6036a5 kernel:[115207.040034]
sd 3:0:0:0: [sda] Attempting to queue an ABORT message:CDB: 0x0 0x0
0x0 0x0 0x0 0x0
<kern.info<6>>Apr  8 19:28:41 websniff-6036a5 kernel:[115207.041854]
sd 3:0:0:0: [sda] Attempting to queue a TARGET RESET message:CDB: 0x2a
0x0 0x1 0xca 0xc0 0x97 0x0 0x0 0x8 0x0
<kern.info<6>>Apr  8 19:28:41 websniff-6036a5 kernel:[115212.040034]
sd 3:0:1:0: [sdb] Attempting to queue a TARGET RESET message:CDB: 0x2a
0x0 0x2 0xc0 0xa3 0x5f 0x0 0x4 0x0 0x0
<kern.info<6>>Apr  8 19:28:41 websniff-6036a5 kernel:[115227.264287]
sd 3:0:1:0: Device offlined - not ready after error recovery
<kern.info<6>>Apr  8 19:28:41 websniff-6036a5 kernel:[115227.264308]
sd 3:0:1:0: [sdb] Unhandled error code
<kern.info<6>>Apr  8 19:28:41 websniff-6036a5 kernel:[115227.264312]
sd 3:0:1:0: [sdb] Result: hostbyte=DID_REQUEUE driverbyte=DRIVER_OK
<kern.info<6>>Apr  8 19:28:41 websniff-6036a5 kernel:[115227.264319]
sd 3:0:1:0: [sdb] CDB: Write(10): 2a 00 02 c0 a7 5f 00 04 00 00
<kern.err<3>>Apr  8 19:28:41 websniff-6036a5 kernel:[115227.264340]
end_request: I/O error, dev sdb, sector 46180191
<kern.err<3>>Apr  8 19:28:41 websniff-6036a5 kernel:[115227.334682] sd
3:0:1:0: rejecting I/O to offline device
<kern.err<3>>Apr  8 19:28:41 websniff-6036a5 kernel:[115227.337071] sd
3:0:1:0: rejecting I/O to offline device

-----

At this point, via console, I have attempted to use
scsiadd/partprobe/sdparm to "re-discover" the lost disk,
        scsiadd -s > ${logfile} 2>&1
        partprobe -s >> ${logfile} 2>&1
        sdparm -al /dev/sdb >> ${logfile} 2>&1

scsiadd finds the device, but the kernel doesn't seem to register it:

        Attached devices:
        Host: scsi3 Channel: 00 Id: 00 Lun: 00
          Vendor: SEAGATE  Model: ST336706LC       Rev: 0108
          Type:   Direct-Access                    ANSI  SCSI revision: 03
        Host: scsi3 Channel: 00 Id: 01 Lun: 00
          Vendor: SEAGATE  Model: ST3146855LC      Rev: 0003
          Type:   Direct-Access                    ANSI  SCSI revision: 03
        Host: scsi3 Channel: 00 Id: 06 Lun: 00
          Vendor: ESG-TSD  Model: SCA HSBP M23     Rev: 1.05
          Type:   Processor                        ANSI  SCSI revision: 02
        /dev/sda: msdos partitions 1 2
        open error: /dev/sdb [read only]: No such device or address


At this point, I have to reboot in order to see the disk.

I have more logging data, but no kernel-debug data at this time.

I would appreciate any help or pointers.

Thanks,
   Leif

-- 
"It's pronounced Layf...you know, like Leif Garrett? Don't you watch
 'I Love the 70's'? What kind of retro lover are you, anyway?"

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: Fwd: high volume of disk-writes causes disk to 'disappear'
       [not found]           ` <i2yd4bc8f581005031143r638b8037ra3778f97e6d2e25d@mail.gmail.com>
@ 2010-05-15 19:09             ` Leif Sawyer
  2010-05-17 23:40               ` Leif Sawyer
  0 siblings, 1 reply; 6+ messages in thread
From: Leif Sawyer @ 2010-05-15 19:09 UTC (permalink / raw)
  To: linux-scsi; +Cc: James.Bottomley, Hannes Reinecke

On a whim, I applied the following patches:
   master.kernel.org:/pub/scm/linux/kernel/git/jejb/scsi-rc-fixes-2.6.git

to my linux 2.6.34-rc5 kernel for testing.

Without the patches, my system would stay running for no more than 2
or three days
before erroring out and requiring a reboot before visibility of the
disk returned.

After applying the SCSI rc fixes from James' git repo, my system is
now at 5+ days
of streaming to the scsi disk.

I'll keep watching this and post another status update in a few days.

Thanks for all the hard work.


-- 
"It's pronounced Layf...you know, like Leif Garrett? Don't you watch
 'I Love the 70's'? What kind of retro lover are you, anyway?"

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: Fwd: high volume of disk-writes causes disk to 'disappear'
  2010-05-15 19:09             ` Fwd: " Leif Sawyer
@ 2010-05-17 23:40               ` Leif Sawyer
  2010-05-19 13:23                 ` Leif Sawyer
  0 siblings, 1 reply; 6+ messages in thread
From: Leif Sawyer @ 2010-05-17 23:40 UTC (permalink / raw)
  To: linux-scsi; +Cc: James.Bottomley, Hannes Reinecke

AH, of course, I spoke too soon.

Also, I discovered something interesting about this problem:

   As long as I keep my disk 'fullness' below 75%, I don't see the
problem as often.

  Once my web-captures try to fill the disk past about 75/80%, then
the churn rate of new
data ( about 1GiB every 4min )  combined with the purge rate of the
old data  will
cause the scsi device to become lost very quickly.





On Sat, May 15, 2010 at 11:09 AM, Leif Sawyer <ak.hepcat+scsi@gmail.com> wrote:
> On a whim, I applied the following patches:
>   master.kernel.org:/pub/scm/linux/kernel/git/jejb/scsi-rc-fixes-2.6.git
>
> to my linux 2.6.34-rc5 kernel for testing.
>
> Without the patches, my system would stay running for no more than 2
> or three days
> before erroring out and requiring a reboot before visibility of the
> disk returned.
>
> After applying the SCSI rc fixes from James' git repo, my system is
> now at 5+ days
> of streaming to the scsi disk.
>
> I'll keep watching this and post another status update in a few days.
>
> Thanks for all the hard work.
>
>
> --
> "It's pronounced Layf...you know, like Leif Garrett? Don't you watch
>  'I Love the 70's'? What kind of retro lover are you, anyway?"
>



-- 
"It's pronounced Layf...you know, like Leif Garrett? Don't you watch
 'I Love the 70's'? What kind of retro lover are you, anyway?"
--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: Fwd: high volume of disk-writes causes disk to 'disappear'
  2010-05-17 23:40               ` Leif Sawyer
@ 2010-05-19 13:23                 ` Leif Sawyer
  2010-05-25 22:58                   ` Leif Sawyer
  0 siblings, 1 reply; 6+ messages in thread
From: Leif Sawyer @ 2010-05-19 13:23 UTC (permalink / raw)
  To: linux-scsi; +Cc: James.Bottomley, Hannes Reinecke

looks like the 75% mark might have been too high of an estimate.
Whipped up a quick logger to show me when i was failing:

<user.info<14>>May 18 17:08:01 websniff-6036a5 logger: disk: /data at
59% utilization
<user.info<14>>May 18 17:09:01 websniff-6036a5 logger: disk: /data at
59% utilization
<user.info<14>>May 18 17:10:01 websniff-6036a5 logger: disk: /data at
60% utilization
<user.info<14>>May 18 17:11:01 websniff-6036a5 logger: disk: /data at
60% utilization
[22563.204037] INFO: task flush-8:16:2430 blocked for more than 120 seconds.
[22563.224392] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
disables this message.
[22563.248117] INFO: task dumpcap:4004 blocked for more than 120 seconds.
[22563.267662] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
disables this message.
[22563.291359] INFO: task df:14714 blocked for more than 120 seconds.
[22563.309874] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
disables this message.
[22563.333593] INFO: task websniff.cgi:14717 blocked for more than 120 seconds.
[22563.354690] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
disables this message.
[22682.229526] end_request: I/O error, dev sdb, sector 169922743
[22682.247345] Buffer I/O error on device sdb1, logical block 21240335
[22682.266781] end_request: I/O error, dev sdb, sector 170131519
[22682.284445] Buffer I/O error on device sdb1, logical block 21266432
[....... repeats until......]
[22682.782577] sd 3:0:1:0: rejecting I/O to offline device
[22682.798907] sd 3:0:1:0: rejecting I/O to offline device

And from here on out, the device is no longer recognized by the system
until a reboot.

I need some help with scsi debugging in order to provide more useful
information.

I do have a 512mb logfile (text)  with lots of scsi dump card state
logs and such, though.

-- 
"It's pronounced Layf...you know, like Leif Garrett? Don't you watch
 'I Love the 70's'? What kind of retro lover are you, anyway?"

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: Fwd: high volume of disk-writes causes disk to 'disappear'
  2010-05-19 13:23                 ` Leif Sawyer
@ 2010-05-25 22:58                   ` Leif Sawyer
  2010-06-01 16:44                     ` Leif Sawyer
  0 siblings, 1 reply; 6+ messages in thread
From: Leif Sawyer @ 2010-05-25 22:58 UTC (permalink / raw)
  To: linux-scsi; +Cc: James.Bottomley, Hannes Reinecke

On Wed, May 19, 2010 at 5:23 AM, Leif Sawyer <ak.hepcat+scsi@gmail.com> wrote:
> looks like the 75% mark might have been too high of an estimate.
> Whipped up a quick logger to show me when i was failing:
>
> <user.info<14>>May 18 17:08:01 websniff-6036a5 logger: disk: /data at
> 59% utilization
> <user.info<14>>May 18 17:09:01 websniff-6036a5 logger: disk: /data at
> 59% utilization
> <user.info<14>>May 18 17:10:01 websniff-6036a5 logger: disk: /data at
> 60% utilization
> <user.info<14>>May 18 17:11:01 websniff-6036a5 logger: disk: /data at
> 60% utilization
> [22563.204037] INFO: task flush-8:16:2430 blocked for more than 120 seconds.
> [22563.224392] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
> disables this message.
> [22563.248117] INFO: task dumpcap:4004 blocked for more than 120 seconds.
> [22563.267662] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
> disables this message.
> [22563.291359] INFO: task df:14714 blocked for more than 120 seconds.
> [22563.309874] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
> disables this message.
> [22563.333593] INFO: task websniff.cgi:14717 blocked for more than 120 seconds.
> [22563.354690] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
> disables this message.
> [22682.229526] end_request: I/O error, dev sdb, sector 169922743
> [22682.247345] Buffer I/O error on device sdb1, logical block 21240335
> [22682.266781] end_request: I/O error, dev sdb, sector 170131519
> [22682.284445] Buffer I/O error on device sdb1, logical block 21266432
> [....... repeats until......]
> [22682.782577] sd 3:0:1:0: rejecting I/O to offline device
> [22682.798907] sd 3:0:1:0: rejecting I/O to offline device
>
> And from here on out, the device is no longer recognized by the system
> until a reboot.
>
> I need some help with scsi debugging in order to provide more useful
> information.
>
> I do have a 512mb logfile (text)  with lots of scsi dump card state
> logs and such, though.
>


Okay, so on a whim, I applied some patches that were recently posted
here that I thought
might have an impact on my particular system (anything generic scsi,
or adaptec-related)

My system has been up since yesterday with those patches applied, and my disk
has been churning at the 100% utilized (with between 600Mb and 75Mb
free at any given time)
with tshark continuously rolling over new capture files  for over 6h.
(which it never did before)

the following patches were applied which were not cosmetic or debug related:

     lct_data->tid assignment
     io_dev->iop assignment
     usg use after kfree
     gdth  goto out_free_ccb_phys  instead of  out_free_coal_stat


If there's interest, i'll back out the patches one at a time and see
which one(s)
cause/bring-back the most instability.



-- 
"It's pronounced Layf...you know, like Leif Garrett? Don't you watch
 'I Love the 70's'? What kind of retro lover are you, anyway?"
--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: Fwd: high volume of disk-writes causes disk to 'disappear'
  2010-05-25 22:58                   ` Leif Sawyer
@ 2010-06-01 16:44                     ` Leif Sawyer
  0 siblings, 0 replies; 6+ messages in thread
From: Leif Sawyer @ 2010-06-01 16:44 UTC (permalink / raw)
  To: linux-scsi; +Cc: James.Bottomley, Hannes Reinecke

Well, good thing I left it running longer.   It triggered multiple
times over the past week.

So I guess it's not fixed with the simple cleanup patches that had
been posted previously.

I really need some help with scsi debugging to get valid log data out
of this, in order
to isolate the issue.


-L

On Tue, May 25, 2010 at 2:58 PM, Leif Sawyer <ak.hepcat+scsi@gmail.com> wrote:
> On Wed, May 19, 2010 at 5:23 AM, Leif Sawyer <ak.hepcat+scsi@gmail.com> wrote:
>> looks like the 75% mark might have been too high of an estimate.
>> Whipped up a quick logger to show me when i was failing:
>>
>> <user.info<14>>May 18 17:08:01 websniff-6036a5 logger: disk: /data at
>> 59% utilization
>> <user.info<14>>May 18 17:09:01 websniff-6036a5 logger: disk: /data at
>> 59% utilization
>> <user.info<14>>May 18 17:10:01 websniff-6036a5 logger: disk: /data at
>> 60% utilization
>> <user.info<14>>May 18 17:11:01 websniff-6036a5 logger: disk: /data at
>> 60% utilization
>> [22563.204037] INFO: task flush-8:16:2430 blocked for more than 120 seconds.
>> [22563.224392] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
>> disables this message.
>> [22563.248117] INFO: task dumpcap:4004 blocked for more than 120 seconds.
>> [22563.267662] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
>> disables this message.
>> [22563.291359] INFO: task df:14714 blocked for more than 120 seconds.
>> [22563.309874] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
>> disables this message.
>> [22563.333593] INFO: task websniff.cgi:14717 blocked for more than 120 seconds.
>> [22563.354690] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
>> disables this message.
>> [22682.229526] end_request: I/O error, dev sdb, sector 169922743
>> [22682.247345] Buffer I/O error on device sdb1, logical block 21240335
>> [22682.266781] end_request: I/O error, dev sdb, sector 170131519
>> [22682.284445] Buffer I/O error on device sdb1, logical block 21266432
>> [....... repeats until......]
>> [22682.782577] sd 3:0:1:0: rejecting I/O to offline device
>> [22682.798907] sd 3:0:1:0: rejecting I/O to offline device
>>
>> And from here on out, the device is no longer recognized by the system
>> until a reboot.
>>
>> I need some help with scsi debugging in order to provide more useful
>> information.
>>
>> I do have a 512mb logfile (text)  with lots of scsi dump card state
>> logs and such, though.
>>
>
>
> Okay, so on a whim, I applied some patches that were recently posted
> here that I thought
> might have an impact on my particular system (anything generic scsi,
> or adaptec-related)
>
> My system has been up since yesterday with those patches applied, and my disk
> has been churning at the 100% utilized (with between 600Mb and 75Mb
> free at any given time)
> with tshark continuously rolling over new capture files  for over 6h.
> (which it never did before)
>
> the following patches were applied which were not cosmetic or debug related:
>
>     lct_data->tid assignment
>     io_dev->iop assignment
>     usg use after kfree
>     gdth  goto out_free_ccb_phys  instead of  out_free_coal_stat
>
>
> If there's interest, i'll back out the patches one at a time and see
> which one(s)
> cause/bring-back the most instability.
>
>
>
> --
> "It's pronounced Layf...you know, like Leif Garrett? Don't you watch
>  'I Love the 70's'? What kind of retro lover are you, anyway?"
>



-- 
"It's pronounced Layf...you know, like Leif Garrett? Don't you watch
 'I Love the 70's'? What kind of retro lover are you, anyway?"
--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2010-06-01 16:44 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2010-04-19 18:38 high volume of disk-writes causes disk to 'disappear' Leif Sawyer
     [not found] ` <w2od4bc8f581004271446md75f0dcbt537956a2e3f9abb7@mail.gmail.com>
     [not found]   ` <4BD99BFD.9090202@suse.de>
     [not found]     ` <q2md4bc8f581004291011k43d5c03cpb7361473f6de4350@mail.gmail.com>
     [not found]       ` <4BDAE41D.6030707@suse.de>
     [not found]         ` <m2gd4bc8f581004301627jea793d0ax1e63bc3796458662@mail.gmail.com>
     [not found]           ` <i2yd4bc8f581005031143r638b8037ra3778f97e6d2e25d@mail.gmail.com>
2010-05-15 19:09             ` Fwd: " Leif Sawyer
2010-05-17 23:40               ` Leif Sawyer
2010-05-19 13:23                 ` Leif Sawyer
2010-05-25 22:58                   ` Leif Sawyer
2010-06-01 16:44                     ` Leif Sawyer

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.