linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* Disk errors and Reiserfs
@ 2001-09-16 23:29 Brian
  2001-09-17  0:40 ` Alan Cox
  0 siblings, 1 reply; 5+ messages in thread
From: Brian @ 2001-09-16 23:29 UTC (permalink / raw)
  To: linux-kernel

Device 08:11 not ready.
 I/O error: dev 08:11, sector 26908624
Device 08:11 not ready.
 I/O error: dev 08:11, sector 121208
Device 08:11 not ready.
 I/O error: dev 08:11, sector 26908624
Device 08:11 not ready.
 I/O error: dev 08:11, sector 278936
vs-13050: reiserfs_update_sd: i/o failure occurred trying to update [487 
175497 0x0 SD] stat data<6>Device 08:11 not ready.
 I/O error: dev 08:11, sector 75432
vs-13050: reiserfs_update_sd: i/o failure occurred trying to update [260 
487 0x0 SD] stat data<6>Device 08:11 not ready.
 I/O error: dev 08:11, sector 65680
journal-712: buffer write failed
kernel BUG at prints.c:332!

Basically, one of the server's drives (not the root one, though) stopped 
responding.  It seems better after a power cycle, but it definately 
appeared to be a hardware problem.

My issue, though, is Linux did not handle it well.  Userspace actually has 
an 'EIO' error code for this situation but, instead, any program touching 
the mounted partition hung in a D state.

You can't kill the processes; you can't unmount the partition; you 
consequently can't reboot the box in any normal manner.  The box was in a 
pretty broken, unusable state.

Is it possible for the kernel to handle this with enough grace that you 
can kill the processes and unmount the partition?  (Thus allowing the box 
to continue in a hobbled, but function manner.)  Failing that, is it 
possible for the kernel to handle it well enough for 'shutdown' to cleanly 
shutdown the box?

Thank you
	-- Brian

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: Disk errors and Reiserfs
  2001-09-16 23:29 Disk errors and Reiserfs Brian
@ 2001-09-17  0:40 ` Alan Cox
  2001-09-17  2:32   ` Brian
                     ` (2 more replies)
  0 siblings, 3 replies; 5+ messages in thread
From: Alan Cox @ 2001-09-17  0:40 UTC (permalink / raw)
  To: Brian; +Cc: linux-kernel

> My issue, though, is Linux did not handle it well.  Userspace actually has 
> an 'EIO' error code for this situation but, instead, any program touching 
> the mounted partition hung in a D state.

Thats a reiserfs property and one you'll find in pretty much any other
fs.

> Is it possible for the kernel to handle this with enough grace that you 
> can kill the processes and unmount the partition?  (Thus allowing the box 
> to continue in a hobbled, but function manner.)  Failing that, is it 
> possible for the kernel to handle it well enough for 'shutdown' to cleanly 
> shutdown the box?

Killing the process isnt neccessary, its been halted in its tracks. As to
a clean shutdown - no chance. You've just hit a disk failure, the on disk
state is not precisely known, writes have been lost. Nothing is going to
make a clean shutdown possible under such circumstances.

Alan

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: Disk errors and Reiserfs
  2001-09-17  0:40 ` Alan Cox
@ 2001-09-17  2:32   ` Brian
  2001-09-17 10:45   ` Guus Sliepen
  2001-09-18 10:17   ` Stephen C. Tweedie
  2 siblings, 0 replies; 5+ messages in thread
From: Brian @ 2001-09-17  2:32 UTC (permalink / raw)
  To: Alan Cox; +Cc: linux-kernel

On Sunday 16 September 2001 08:40 pm, Alan Cox wrote:
> > My issue, though, is Linux did not handle it well.  Userspace actually
> > has an 'EIO' error code for this situation but, instead, any program
> > touching the mounted partition hung in a D state.
>
> Thats a reiserfs property and one you'll find in pretty much any other
> fs.

Is that the best approach?  Userspace doesn't always handle EIO well, but 
it can't do much worse than a permanent deadlock.

> > Is it possible for the kernel to handle this with enough grace that
> > you can kill the processes and unmount the partition?  (Thus allowing
> > the box to continue in a hobbled, but function manner.)  Failing that,
> > is it possible for the kernel to handle it well enough for 'shutdown'
> > to cleanly shutdown the box?
>
> Killing the process isnt neccessary, its been halted in its tracks. As
> to a clean shutdown - no chance. You've just hit a disk failure, the on
> disk state is not precisely known, writes have been lost. Nothing is
> going to make a clean shutdown possible under such circumstances.

Let me rephrase: cleanish.  At least enough to let 'shutdown' actually 
shutdown and surviving filesystems to unmount cleanly.  I guess this would 
require all file ops to that filesystem to return EIO and umount would 
drop the filesystem as-is.

	-- Brian

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: Disk errors and Reiserfs
  2001-09-17  0:40 ` Alan Cox
  2001-09-17  2:32   ` Brian
@ 2001-09-17 10:45   ` Guus Sliepen
  2001-09-18 10:17   ` Stephen C. Tweedie
  2 siblings, 0 replies; 5+ messages in thread
From: Guus Sliepen @ 2001-09-17 10:45 UTC (permalink / raw)
  To: Alan Cox; +Cc: linux-kernel

[-- Attachment #1: Type: text/plain, Size: 1079 bytes --]

On Mon, Sep 17, 2001 at 01:40:36AM +0100, Alan Cox wrote:

> > Is it possible for the kernel to handle this with enough grace that you 
> > can kill the processes and unmount the partition?  (Thus allowing the box 
> > to continue in a hobbled, but function manner.)  Failing that, is it 
> > possible for the kernel to handle it well enough for 'shutdown' to cleanly 
> > shutdown the box?
> 
> Killing the process isnt neccessary, its been halted in its tracks. As to
> a clean shutdown - no chance. You've just hit a disk failure, the on disk
> state is not precisely known, writes have been lost. Nothing is going to
> make a clean shutdown possible under such circumstances.

Of course. But I did notice that (for ext2) the filesystem dirty flag is not
set if there are errors from the underlying block device, only when it actually
detects some corruption. So these errors will not trigger an appropiate
response like remounting read-only or fscking on reboot.

-- 
Met vriendelijke groet / with kind regards,
  Guus Sliepen <guus@sliepen.warande.net>

[-- Attachment #2: Type: application/pgp-signature, Size: 232 bytes --]

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: Disk errors and Reiserfs
  2001-09-17  0:40 ` Alan Cox
  2001-09-17  2:32   ` Brian
  2001-09-17 10:45   ` Guus Sliepen
@ 2001-09-18 10:17   ` Stephen C. Tweedie
  2 siblings, 0 replies; 5+ messages in thread
From: Stephen C. Tweedie @ 2001-09-18 10:17 UTC (permalink / raw)
  To: Alan Cox; +Cc: Brian, linux-kernel, Stephen Tweedie

Hi,

On Mon, Sep 17, 2001 at 01:40:36AM +0100, Alan Cox wrote:
> > My issue, though, is Linux did not handle it well.  Userspace actually has 
> > an 'EIO' error code for this situation but, instead, any program touching 
> > the mounted partition hung in a D state.
> 
> Thats a reiserfs property and one you'll find in pretty much any other
> fs.

No --- ext2 and ext3 will propagate EIO up to the application.  We've
also spent a lot of effort making sure that ext2 won't ever panic even
if the IO succeeds but returns bogus data (disk, cable or controller
faults).  Disk failures should never cause process kernel hangs, any
more than bogus network packets should.

> Killing the process isnt neccessary, its been halted in its tracks. As to
> a clean shutdown - no chance. You've just hit a disk failure, the on disk
> state is not precisely known, writes have been lost. Nothing is going to
> make a clean shutdown possible under such circumstances.

Why not?  ext2 lets you select between three behaviours on detecting
such an error: continue (the fs is marked as having errors and will be
fscked on the next boot, as long as we can write the error flag to the
superblock); remount-readonly (we fail the IO and force the fs
readonly, but otherwise continue as above); or panic immediately.  As
long as you've selected continue or continue-ro, you should be able to
unmount the disk as soon as you've killed any processes still
accessing it.  I've also spent a lot of effort making sure that the
backoff-and-remount-readonly code in ext3 is solid, too.  I don't 
regard a kernel lockup as a necessary response to disk failure.

Cheers,
 Stephen

^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2001-09-18 10:17 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2001-09-16 23:29 Disk errors and Reiserfs Brian
2001-09-17  0:40 ` Alan Cox
2001-09-17  2:32   ` Brian
2001-09-17 10:45   ` Guus Sliepen
2001-09-18 10:17   ` Stephen C. Tweedie

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).