Linux-ext4 Archive on lore.kernel.org
 help / color / Atom feed
From: "Theodore Y. Ts'o" <tytso@mit.edu>
To: ZhangXiaoxu <zhangxiaoxu5@huawei.com>
Cc: adilger.kernel@dilger.ca, linux-ext4@vger.kernel.org
Subject: Re: [PATCH] ext4: Fix entry corruption when disk online and offline frequently
Date: Thu, 22 Aug 2019 23:02:07 -0400
Message-ID: <20190823030207.GC8130@mit.edu> (raw)
In-Reply-To: <20190517225940.GC21961@mit.edu>

I've applied this patch with the modified patch summary:

ext4: treat buffers with write errors as containing valid data

Cheers,

					- Ted

On Fri, May 17, 2019 at 06:59:40PM -0400, Theodore Ts'o wrote:
> On Tue, May 14, 2019 at 12:23:37PM +0800, ZhangXiaoxu wrote:
> > I got some errors when I repair an ext4 volume which stacked by an
> > iscsi target:
> >     Entry 'test60' in / (2) has deleted/unused inode 73750.  Clear?
> > It can be reproduced when the network not good enough.
> > 
> > When I debug this I found ext4 will read entry buffer from disk and
> > the buffer is marked with write_io_error.
> > 
> > If the buffer is marked with write_io_error, it means it already
> > wroten to journal, and not checked out to disk. IOW, the journal
> > is newer than the data in disk.
> > If this journal record 'delete test60', it means the 'test60' still
> > on the disk metadata.
> > 
> > In this case, if we read the buffer from disk successfully and create
> > file continue, the new journal record will overwrite the journal
> > which record 'delete test60', then the entry corruptioned.
> > 
> > So, use the buffer rather than read from disk if the buffer marked
> > with write_io_error
> 
> You've raised a number of issues about how we handle write errors,
> especially when they occur due to a flaky transport --- in your case,
> due to iSCSI.  As such, your patch isn't wrong, so much as it is
> incomplete.
> 
> For example, your assumption that if the buffer is marked
> write_io_error, it's safe to clear write_io_error and reset
> buffer_uptodate assumes that journalling is enabled.  If the file
> system does not have the journal, there is no journal to fall back
> upon.  For file systems which do have a journal, if you are using a
> flaky iSCSI transport, there is no protection from write errors which
> occur when the journal is replayed.  (fs/jbd2/recovery.c simply marks
> the buffer dirty and allows the writeback code take care of writing
> the buffer.)  This means that the buffer could have write_io_error set
> due to a failure to write the buffer during recovery, in which case
> relying on the journal having a uptodate copy block is invalid.
> 
> Also, this patch only patches the ex4_bread() path, which is only used
> by directories.  It doesn't deal with metadata reads for allocation
> bitmaps or extent tree blocks.  We are doing this hack for inode table
> blocks, already; perhaps you got the idea to do this for ext4_bread()
> from __ext4_get_inode_loc()?
> 
> We could add some kind of callback from the buffer cache layer when an
> aysnchronous writeback fails --- or we could use a synchronous write
> in the journal recovery code (which would be bad from a performance
> perspective, but ignore that for the moment) --- however, what do we
> do when we discover that there is an error?  Right now, we do nothing
> until we try to read the inode table block (and after your patch,
> reading a directory block).  Under memory pressure, though, the data
> will get lost and we don't even mark the file system as needing to be
> checked.  We could retry the write, but if it's due to a flaky iSCSI
> or FC transport, this write could fail yet again --- and then what?
> 
> So while I could apply this patch, since it doesn't make things worse,
> I want to make sure you are aware that if you have problems with your
> iSCSI device, this patch is far from a complete solution.  At the very
> least, we should handle reads for other metadata block.  
> 
>       	      	   	    	       		  - Ted

      reply index

Thread overview: 3+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2019-05-14  4:23 ZhangXiaoxu
2019-05-17 22:59 ` Theodore Ts'o
2019-08-23  3:02   ` Theodore Y. Ts'o [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20190823030207.GC8130@mit.edu \
    --to=tytso@mit.edu \
    --cc=adilger.kernel@dilger.ca \
    --cc=linux-ext4@vger.kernel.org \
    --cc=zhangxiaoxu5@huawei.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Linux-ext4 Archive on lore.kernel.org

Archives are clonable:
	git clone --mirror https://lore.kernel.org/linux-ext4/0 linux-ext4/git/0.git

	# If you have public-inbox 1.1+ installed, you may
	# initialize and index your mirror using the following commands:
	public-inbox-init -V2 linux-ext4 linux-ext4/ https://lore.kernel.org/linux-ext4 \
		linux-ext4@vger.kernel.org
	public-inbox-index linux-ext4

Example config snippet for mirrors

Newsgroup available over NNTP:
	nntp://nntp.lore.kernel.org/org.kernel.vger.linux-ext4


AGPL code for this site: git clone https://public-inbox.org/public-inbox.git