Linux-ext4 Archive on lore.kernel.org
 help / color / Atom feed
From: Dave Chinner <david@fromorbit.com>
To: Jan Kara <jack@suse.cz>
Cc: Ted Tso <tytso@mit.edu>,
	linux-ext4@vger.kernel.org, Eric Whitney <enwlinux@gmail.com>,
	linux-fsdevel@vger.kernel.org,
	"Darrick J . Wong" <djwong@kernel.org>
Subject: Re: [PATCH 2/3] ext4: Fix occasional generic/418 failure
Date: Wed, 14 Apr 2021 08:45:31 +1000
Message-ID: <20210413224531.GE63242@dread.disaster.area> (raw)
In-Reply-To: <20210413091122.GA15752@quack2.suse.cz>

On Tue, Apr 13, 2021 at 11:11:22AM +0200, Jan Kara wrote:
> On Tue 13-04-21 07:50:24, Dave Chinner wrote:
> > On Mon, Apr 12, 2021 at 12:23:32PM +0200, Jan Kara wrote:
> > > Eric has noticed that after pagecache read rework, generic/418 is
> > > occasionally failing for ext4 when blocksize < pagesize. In fact, the
> > > pagecache rework just made hard to hit race in ext4 more likely. The
> > > problem is that since ext4 conversion of direct IO writes to iomap
> > > framework (commit 378f32bab371), we update inode size after direct IO
> > > write only after invalidating page cache. Thus if buffered read sneaks
> > > at unfortunate moment like:
> > > 
> > > CPU1 - write at offset 1k                       CPU2 - read from offset 0
> > > iomap_dio_rw(..., IOMAP_DIO_FORCE_WAIT);
> > >                                                 ext4_readpage();
> > > ext4_handle_inode_extension()
> > > 
> > > the read will zero out tail of the page as it still sees smaller inode
> > > size and thus page cache becomes inconsistent with on-disk contents with
> > > all the consequences.
> > > 
> > > Fix the problem by moving inode size update into end_io handler which
> > > gets called before the page cache is invalidated.
> > 
> > Confused.
> > 
> > This moves all the inode extension stuff into the completion
> > handler, when all that really needs to be done is extending
> > inode->i_size to tell the world there is data up to where the
> > IO completed. Actually removing the inode from the orphan list
> > does not need to be done in the IO completion callback, because...
> > 
> > >  	if (ilock_shared)
> > >  		iomap_ops = &ext4_iomap_overwrite_ops;
> > > -	ret = iomap_dio_rw(iocb, from, iomap_ops, &ext4_dio_write_ops,
> > > -			   (unaligned_io || extend) ? IOMAP_DIO_FORCE_WAIT : 0);
> > > -	if (ret == -ENOTBLK)
> > > -		ret = 0;
> > > -
> > >  	if (extend)
> > > -		ret = ext4_handle_inode_extension(inode, offset, ret, count);
> > > +		dio_ops = &ext4_dio_extending_write_ops;
> > >  
> > > +	ret = iomap_dio_rw(iocb, from, iomap_ops, dio_ops,
> > > +			   (extend || unaligned_io) ? IOMAP_DIO_FORCE_WAIT : 0);
> >                             ^^^^^^                    ^^^^^^^^^^^^^^^^^^^ 
> > 
> > .... if we are doing an extending write, we force DIO to complete
> > before returning. Hence even AIO will block here on an extending
> > write, and hence we can -always- do the correct post-IO completion
> > orphan list cleanup here because we know a) the original IO size and
> > b) the amount of data that was actually written.
> > 
> > Hence all that remains is closing the buffered read vs invalidation
> > race. All this requires is for the dio write completion to behave
> > like XFS where it just does the inode->i_size update for extending
> > writes. THis means the size is updated before the invalidation, and
> > hence any read that occurs after the invalidation but before the
> > post-eof blocks have been removed will see the correct size and read
> > the tail page(s) correctly. This closes the race window, and the
> > caller can still handle the post-eof block cleanup as it does now.
> > 
> > Hence I don't see any need for changing the iomap infrastructure to
> > solve this problem. This seems like the obvious solution to me, so
> > what am I missing?
> 
> All that you write above is correct. The missing piece is: If everything
> succeeded and all the cleanup we need is removing inode from the orphan
> list (common case), we want to piggyback that orphan list removal into the
> same transaction handle as the update of the inode size. This is just a
> performance thing, you are absolutely right we could also do the orphan
> cleanup unconditionally in ext4_dio_write_iter() and thus avoid any changes
> to the iomap framework.

Doesn't ext4, like XFS, keep two copies of the inode size? One for
the on-disk size and one for the in-memory size?

/me looks...

Yeah, there's ei->i_disksize that reflects the on-disk size.

And I note that the first thing that ext4_handle_inode_extension()
is already checking that the write is extending past the current
on-disk inode size before running the extension transaction.

The page cache only cares about the inode->i_size value, not the
ei->i_disksize value, so you can update them independently and still
have things work correctly. That's what XFS does in
xfs_dio_write_end_io - it updates the in-memory inode->i_size, then
runs a transaction to atomically update the inode on-disk inode
size. Updating the VFS inode size first protects against buffered
read races while updating the on-disk size...

So for ext4, the two separate size updates don't need to be done at
the same time - you have all the state you need in the ext4 dio
write path to extend the on-disk file size on successful extending
write, and it is not dependent in any way on the current in-memory
VFS inode size that you'd update in the ->end_io callback....

> OK, now that I write about this, maybe I was just too hung up on the
> performance improvement. Probably a better way forward is that I just fix
> the data corruption bug only inside ext4 (that will be also much easier to
> backport) and then submit the performance improvement modifying iomap if I
> can actually get performance data justifying it. Thanks for poking into
> this :)

Sounds like a good plan :)

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

  reply index

Thread overview: 13+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-04-12 10:23 [PATCH 0/3] ext4: Fix data corruption when extending DIO write races with buffered read Jan Kara
2021-04-12 10:23 ` [PATCH 1/3] iomap: Pass original DIO size to completion handler Jan Kara
2021-04-12 14:07   ` kernel test robot
2021-04-12 14:12   ` kernel test robot
2021-04-12 16:37   ` kernel test robot
2021-04-12 10:23 ` [PATCH 2/3] ext4: Fix occasional generic/418 failure Jan Kara
2021-04-12 21:50   ` Dave Chinner
2021-04-13  9:11     ` Jan Kara
2021-04-13 22:45       ` Dave Chinner [this message]
2021-04-14 11:56         ` Jan Kara
2021-04-12 10:23 ` [PATCH 3/3] ext4: Fix overflow in ext4_iomap_alloc() Jan Kara
2021-04-12 11:30   ` Matthew Wilcox
2021-06-16 12:22   ` Theodore Ts'o

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20210413224531.GE63242@dread.disaster.area \
    --to=david@fromorbit.com \
    --cc=djwong@kernel.org \
    --cc=enwlinux@gmail.com \
    --cc=jack@suse.cz \
    --cc=linux-ext4@vger.kernel.org \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=tytso@mit.edu \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Linux-ext4 Archive on lore.kernel.org

Archives are clonable:
	git clone --mirror https://lore.kernel.org/linux-ext4/0 linux-ext4/git/0.git

	# If you have public-inbox 1.1+ installed, you may
	# initialize and index your mirror using the following commands:
	public-inbox-init -V2 linux-ext4 linux-ext4/ https://lore.kernel.org/linux-ext4 \
		linux-ext4@vger.kernel.org
	public-inbox-index linux-ext4

Example config snippet for mirrors

Newsgroup available over NNTP:
	nntp://nntp.lore.kernel.org/org.kernel.vger.linux-ext4


AGPL code for this site: git clone https://public-inbox.org/public-inbox.git