linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [BUG] Failed writes marked clean?
@ 2002-11-08 20:29 Ross Biro
  2002-11-08 20:53 ` Linus Torvalds
  2002-11-08 20:57 ` Andrew Morton
  0 siblings, 2 replies; 7+ messages in thread
From: Ross Biro @ 2002-11-08 20:29 UTC (permalink / raw)
  To: linux-kernel


Perhaps I'm reading the code incorrectly, but in kernel versions 2.4.18 
and 2.5.46 it looks to me like in the case of a write, ll_rw_block 
always clears the dirty bit.  In the event of an error, nothing resets 
the dirty bit and the uptodate flag is cleared.  This means that if the 
same block needs to be read again, the buffer cache will see that the 
buffer is not uptodate and attempt to read the old contents of the 
buffer off of the device.  If the read suceeds the kernel ends up 
corrupting data.

It seems to me that a better solution would be to mark the buffer as 
dirty and uptodate and then attempt to propogate the error as far back 
as possible.  Ideally something can be done to correct the problem at a 
higher level.  Before I dive in and attempt to do something about this, 
I wanted to make sure I was not missing anything important.  So am I 
full of it, or could this really be a problem?

    Ross



^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [BUG] Failed writes marked clean?
  2002-11-08 20:29 [BUG] Failed writes marked clean? Ross Biro
@ 2002-11-08 20:53 ` Linus Torvalds
  2002-11-08 20:57 ` Andrew Morton
  1 sibling, 0 replies; 7+ messages in thread
From: Linus Torvalds @ 2002-11-08 20:53 UTC (permalink / raw)
  To: linux-kernel

In article <3DCC1EB5.4020303@google.com>, Ross Biro <rossb@google.com> wrote:
>
>Perhaps I'm reading the code incorrectly, but in kernel versions 2.4.18 
>and 2.5.46 it looks to me like in the case of a write, ll_rw_block 
>always clears the dirty bit.  In the event of an error, nothing resets 
>the dirty bit and the uptodate flag is cleared.

Correct. 

There's not all that much else it could do. Keeping the dirty bit set is
not an option - that would bring the whole system down on IO errors.

As it is, higher layers that care _can_ figure the IO error out, simply
by noticing that the page is not up-to-date after the write. It's then
totally up to the higher layers (ie user space) to write the thing anew
if it cares about the data.

(In other words: this is why we have fsync() and error codes).

		Linus

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [BUG] Failed writes marked clean?
  2002-11-08 20:29 [BUG] Failed writes marked clean? Ross Biro
  2002-11-08 20:53 ` Linus Torvalds
@ 2002-11-08 20:57 ` Andrew Morton
  2002-11-08 21:30   ` Ross Biro
  2002-11-08 23:35   ` Theodore Ts'o
  1 sibling, 2 replies; 7+ messages in thread
From: Andrew Morton @ 2002-11-08 20:57 UTC (permalink / raw)
  To: Ross Biro; +Cc: linux-kernel

Ross Biro wrote:
> 
> Perhaps I'm reading the code incorrectly, but in kernel versions 2.4.18
> and 2.5.46 it looks to me like in the case of a write, ll_rw_block
> always clears the dirty bit.  In the event of an error, nothing resets
> the dirty bit and the uptodate flag is cleared.  This means that if the
> same block needs to be read again, the buffer cache will see that the
> buffer is not uptodate and attempt to read the old contents of the
> buffer off of the device.  If the read suceeds the kernel ends up
> corrupting data.

That's correct, for metadata.  It may not be fully accurate for
file data, where the page state comes into play as well.

The handling of IO errors is very weird.  Especially for writes.
And poorly tested.  It needs a big revamp and testing.

> It seems to me that a better solution would be to mark the buffer as
> dirty and uptodate and then attempt to propogate the error as far back
> as possible.  Ideally something can be done to correct the problem at a
> higher level.  Before I dive in and attempt to do something about this,
> I wanted to make sure I was not missing anything important.  So am I
> full of it, or could this really be a problem?
> 

Well before going and changing stuff, we need to decide what to
change it _to_.  What do we want to happen if there's a read error?
And a write error?

For reads, it makes sense for the page/buffer to be left not uptodate,
and return an error.

For write errors, marking the page/buffer not uptodate doesn't make
a lot of sense.  Marking it clean makes sense if we're not going to retry
the write.  Marking it dirty, uptodate and unmapped would make sense
if we want to go and try a different part of the disk.  But it
doesn't make sense if the whole disk is dead.

Also, think about what a write error _means_.  Unless the disk is truly
ancient, it means that the device has run out of alternate space for
the block, or all writes are failing.  ie: it is a serious failure.

So perhaps the appropriate strategy on write errors is to mark the
device readonly and to drop all write data on the floor.  That means
clean+mapped+uptodate.

So yes, I think I agree with myself.  Write errors should leave the
page/buffer clean, uptodate, mapped, PageError (whatever the latter
maens...)

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [BUG] Failed writes marked clean?
  2002-11-08 20:57 ` Andrew Morton
@ 2002-11-08 21:30   ` Ross Biro
  2002-11-08 23:35   ` Theodore Ts'o
  1 sibling, 0 replies; 7+ messages in thread
From: Ross Biro @ 2002-11-08 21:30 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-kernel

Andrew Morton wrote:

>Also, think about what a write error _means_.  Unless the disk is truly
>ancient, it means that the device has run out of alternate space for
>the block, or all writes are failing.  ie: it is a serious failure.
>  
>
I've seen all sorts of interesting drive failure modes, including losing 
communications with the drive for a short period and then having it come 
back almost as good as new.  We've had some data corruption on flaky 
drives and I'm guessing this has something to do with it.

I'm going to sit down with our application developers and see what they 
want to see from their end and see what I can do.

    Ross



^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [BUG] Failed writes marked clean?
  2002-11-08 20:57 ` Andrew Morton
  2002-11-08 21:30   ` Ross Biro
@ 2002-11-08 23:35   ` Theodore Ts'o
  2002-11-09  1:29     ` Bernd Eckenfels
  2002-11-12 20:04     ` Pavel Machek
  1 sibling, 2 replies; 7+ messages in thread
From: Theodore Ts'o @ 2002-11-08 23:35 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Ross Biro, linux-kernel

On Fri, Nov 08, 2002 at 12:57:19PM -0800, Andrew Morton wrote:
> Well before going and changing stuff, we need to decide what to
> change it _to_.  What do we want to happen if there's a read error?
> And a write error?
> 
> For reads, it makes sense for the page/buffer to be left not uptodate,
> and return an error.

In some circumstances, it may actually make sense to try writing a
random block of data to the disk, since that may force the disk to
remap the block.  (Disks generally only remap a block from the pool of
spare blocks on writes, not on reads.)

Unfortuantely, if the error was just a transient one, you might end up
smashing the block when you write random garbage in an attempt to
remap the block.  So perhaps the answer is to retry the read, and if
that fails, *then* try to do a forced rewrite of the block.

The next question is whether to do this in userspace or in the kernel.
And if in the kernel, whether it should be done at the device driver
layer, or in the block I/O layer, or in the filesystem?  

I can make a case for doing it in userspace, since that gives us the
most amount of flexibility, and it gives us ample opportunity to do
special things, such as paging an operator for help, etc.  On the
other hand, there are arguments for doing it in the kernel.  It may be
that an appropriately clever filesystem might be able to do more
intelligent recovery while keeping the filesystem mounted.  

						- Ted

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [BUG] Failed writes marked clean?
  2002-11-08 23:35   ` Theodore Ts'o
@ 2002-11-09  1:29     ` Bernd Eckenfels
  2002-11-12 20:04     ` Pavel Machek
  1 sibling, 0 replies; 7+ messages in thread
From: Bernd Eckenfels @ 2002-11-09  1:29 UTC (permalink / raw)
  To: linux-kernel

In article <20021108233530.GA23888@think.thunk.org> you wrote:
> The next question is whether to do this in userspace or in the kernel.

An idea would be to lock/mark the block in the buffer, so it wont be used by
kernel. And then userspace can read out the locked buffers and decide what
to do (like writing to it). Especially good would it be, if user space can
get all details about the expected content (like inode/redir/dentry/data
block of file x).

Greetings
Bernd

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [BUG] Failed writes marked clean?
  2002-11-08 23:35   ` Theodore Ts'o
  2002-11-09  1:29     ` Bernd Eckenfels
@ 2002-11-12 20:04     ` Pavel Machek
  1 sibling, 0 replies; 7+ messages in thread
From: Pavel Machek @ 2002-11-12 20:04 UTC (permalink / raw)
  To: Theodore Ts'o, Andrew Morton, Ross Biro, linux-kernel

Hi!

> In some circumstances, it may actually make sense to try writing a
> random block of data to the disk, since that may force the disk to
> remap the block.  (Disks generally only remap a block from the pool of
> spare blocks on writes, not on reads.)
> 
> Unfortuantely, if the error was just a transient one, you might end up
> smashing the block when you write random garbage in an attempt to
> remap the block.  So perhaps the answer is to retry the read, and if
> that fails, *then* try to do a forced rewrite of the block.
> 

Retrying is not enough. I've seen a notebook
overheating: its cpu was still okay but HDD
was too hot and started acting crazy.  I got
away with 2 bad blocks and FS survived. If
kernel tried to do something clever it would
probably make corruption much worse.
				Pavel

-- 
				Pavel
My velo broke, so I got Zaurus. If you have Philips Velo 1 you don't need...

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2002-11-15 11:31 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2002-11-08 20:29 [BUG] Failed writes marked clean? Ross Biro
2002-11-08 20:53 ` Linus Torvalds
2002-11-08 20:57 ` Andrew Morton
2002-11-08 21:30   ` Ross Biro
2002-11-08 23:35   ` Theodore Ts'o
2002-11-09  1:29     ` Bernd Eckenfels
2002-11-12 20:04     ` Pavel Machek

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).