Re: [PATCH] ext4: Fix entry corruption when disk online and offline frequently

From: "Theodore Y. Ts'o" <tytso@mit.edu>
To: ZhangXiaoxu <zhangxiaoxu5@huawei.com>
Cc: adilger.kernel@dilger.ca, linux-ext4@vger.kernel.org
Subject: Re: [PATCH] ext4: Fix entry corruption when disk online and offline frequently
Date: Thu, 22 Aug 2019 23:02:07 -0400	[thread overview]
Message-ID: <20190823030207.GC8130@mit.edu> (raw)
In-Reply-To: <20190517225940.GC21961@mit.edu>

I've applied this patch with the modified patch summary:

ext4: treat buffers with write errors as containing valid data

Cheers,

					- Ted

On Fri, May 17, 2019 at 06:59:40PM -0400, Theodore Ts'o wrote:
> On Tue, May 14, 2019 at 12:23:37PM +0800, ZhangXiaoxu wrote:
> > I got some errors when I repair an ext4 volume which stacked by an
> > iscsi target:
> >     Entry 'test60' in / (2) has deleted/unused inode 73750.  Clear?
> > It can be reproduced when the network not good enough.
> > 
> > When I debug this I found ext4 will read entry buffer from disk and
> > the buffer is marked with write_io_error.
> > 
> > If the buffer is marked with write_io_error, it means it already
> > wroten to journal, and not checked out to disk. IOW, the journal
> > is newer than the data in disk.
> > If this journal record 'delete test60', it means the 'test60' still
> > on the disk metadata.
> > 
> > In this case, if we read the buffer from disk successfully and create
> > file continue, the new journal record will overwrite the journal
> > which record 'delete test60', then the entry corruptioned.
> > 
> > So, use the buffer rather than read from disk if the buffer marked
> > with write_io_error
> 
> You've raised a number of issues about how we handle write errors,
> especially when they occur due to a flaky transport --- in your case,
> due to iSCSI.  As such, your patch isn't wrong, so much as it is
> incomplete.
> 
> For example, your assumption that if the buffer is marked
> write_io_error, it's safe to clear write_io_error and reset
> buffer_uptodate assumes that journalling is enabled.  If the file
> system does not have the journal, there is no journal to fall back
> upon.  For file systems which do have a journal, if you are using a
> flaky iSCSI transport, there is no protection from write errors which
> occur when the journal is replayed.  (fs/jbd2/recovery.c simply marks
> the buffer dirty and allows the writeback code take care of writing
> the buffer.)  This means that the buffer could have write_io_error set
> due to a failure to write the buffer during recovery, in which case
> relying on the journal having a uptodate copy block is invalid.
> 
> Also, this patch only patches the ex4_bread() path, which is only used
> by directories.  It doesn't deal with metadata reads for allocation
> bitmaps or extent tree blocks.  We are doing this hack for inode table
> blocks, already; perhaps you got the idea to do this for ext4_bread()
> from __ext4_get_inode_loc()?
> 
> We could add some kind of callback from the buffer cache layer when an
> aysnchronous writeback fails --- or we could use a synchronous write
> in the journal recovery code (which would be bad from a performance
> perspective, but ignore that for the moment) --- however, what do we
> do when we discover that there is an error?  Right now, we do nothing
> until we try to read the inode table block (and after your patch,
> reading a directory block).  Under memory pressure, though, the data
> will get lost and we don't even mark the file system as needing to be
> checked.  We could retry the write, but if it's due to a flaky iSCSI
> or FC transport, this write could fail yet again --- and then what?
> 
> So while I could apply this patch, since it doesn't make things worse,
> I want to make sure you are aware that if you have problems with your
> iSCSI device, this patch is far from a complete solution.  At the very
> least, we should handle reads for other metadata block.  
> 
>       	      	   	    	       		  - Ted