All of lore.kernel.org
 help / color / mirror / Atom feed
From: Dave Chinner <david@fromorbit.com>
To: Jan Kara <jack@suse.cz>
Cc: Ted Tso <tytso@mit.edu>,
	linux-ext4@vger.kernel.org, Eric Whitney <enwlinux@gmail.com>,
	linux-fsdevel@vger.kernel.org,
	"Darrick J . Wong" <djwong@kernel.org>
Subject: Re: [PATCH 2/3] ext4: Fix occasional generic/418 failure
Date: Wed, 14 Apr 2021 08:45:31 +1000	[thread overview]
Message-ID: <20210413224531.GE63242@dread.disaster.area> (raw)
In-Reply-To: <20210413091122.GA15752@quack2.suse.cz>

On Tue, Apr 13, 2021 at 11:11:22AM +0200, Jan Kara wrote:
> On Tue 13-04-21 07:50:24, Dave Chinner wrote:
> > On Mon, Apr 12, 2021 at 12:23:32PM +0200, Jan Kara wrote:
> > > Eric has noticed that after pagecache read rework, generic/418 is
> > > occasionally failing for ext4 when blocksize < pagesize. In fact, the
> > > pagecache rework just made hard to hit race in ext4 more likely. The
> > > problem is that since ext4 conversion of direct IO writes to iomap
> > > framework (commit 378f32bab371), we update inode size after direct IO
> > > write only after invalidating page cache. Thus if buffered read sneaks
> > > at unfortunate moment like:
> > > 
> > > CPU1 - write at offset 1k                       CPU2 - read from offset 0
> > > iomap_dio_rw(..., IOMAP_DIO_FORCE_WAIT);
> > >                                                 ext4_readpage();
> > > ext4_handle_inode_extension()
> > > 
> > > the read will zero out tail of the page as it still sees smaller inode
> > > size and thus page cache becomes inconsistent with on-disk contents with
> > > all the consequences.
> > > 
> > > Fix the problem by moving inode size update into end_io handler which
> > > gets called before the page cache is invalidated.
> > 
> > Confused.
> > 
> > This moves all the inode extension stuff into the completion
> > handler, when all that really needs to be done is extending
> > inode->i_size to tell the world there is data up to where the
> > IO completed. Actually removing the inode from the orphan list
> > does not need to be done in the IO completion callback, because...
> > 
> > >  	if (ilock_shared)
> > >  		iomap_ops = &ext4_iomap_overwrite_ops;
> > > -	ret = iomap_dio_rw(iocb, from, iomap_ops, &ext4_dio_write_ops,
> > > -			   (unaligned_io || extend) ? IOMAP_DIO_FORCE_WAIT : 0);
> > > -	if (ret == -ENOTBLK)
> > > -		ret = 0;
> > > -
> > >  	if (extend)
> > > -		ret = ext4_handle_inode_extension(inode, offset, ret, count);
> > > +		dio_ops = &ext4_dio_extending_write_ops;
> > >  
> > > +	ret = iomap_dio_rw(iocb, from, iomap_ops, dio_ops,
> > > +			   (extend || unaligned_io) ? IOMAP_DIO_FORCE_WAIT : 0);
> >                             ^^^^^^                    ^^^^^^^^^^^^^^^^^^^ 
> > 
> > .... if we are doing an extending write, we force DIO to complete
> > before returning. Hence even AIO will block here on an extending
> > write, and hence we can -always- do the correct post-IO completion
> > orphan list cleanup here because we know a) the original IO size and
> > b) the amount of data that was actually written.
> > 
> > Hence all that remains is closing the buffered read vs invalidation
> > race. All this requires is for the dio write completion to behave
> > like XFS where it just does the inode->i_size update for extending
> > writes. THis means the size is updated before the invalidation, and
> > hence any read that occurs after the invalidation but before the
> > post-eof blocks have been removed will see the correct size and read
> > the tail page(s) correctly. This closes the race window, and the
> > caller can still handle the post-eof block cleanup as it does now.
> > 
> > Hence I don't see any need for changing the iomap infrastructure to
> > solve this problem. This seems like the obvious solution to me, so
> > what am I missing?
> 
> All that you write above is correct. The missing piece is: If everything
> succeeded and all the cleanup we need is removing inode from the orphan
> list (common case), we want to piggyback that orphan list removal into the
> same transaction handle as the update of the inode size. This is just a
> performance thing, you are absolutely right we could also do the orphan
> cleanup unconditionally in ext4_dio_write_iter() and thus avoid any changes
> to the iomap framework.

Doesn't ext4, like XFS, keep two copies of the inode size? One for
the on-disk size and one for the in-memory size?

/me looks...

Yeah, there's ei->i_disksize that reflects the on-disk size.

And I note that the first thing that ext4_handle_inode_extension()
is already checking that the write is extending past the current
on-disk inode size before running the extension transaction.

The page cache only cares about the inode->i_size value, not the
ei->i_disksize value, so you can update them independently and still
have things work correctly. That's what XFS does in
xfs_dio_write_end_io - it updates the in-memory inode->i_size, then
runs a transaction to atomically update the inode on-disk inode
size. Updating the VFS inode size first protects against buffered
read races while updating the on-disk size...

So for ext4, the two separate size updates don't need to be done at
the same time - you have all the state you need in the ext4 dio
write path to extend the on-disk file size on successful extending
write, and it is not dependent in any way on the current in-memory
VFS inode size that you'd update in the ->end_io callback....

> OK, now that I write about this, maybe I was just too hung up on the
> performance improvement. Probably a better way forward is that I just fix
> the data corruption bug only inside ext4 (that will be also much easier to
> backport) and then submit the performance improvement modifying iomap if I
> can actually get performance data justifying it. Thanks for poking into
> this :)

Sounds like a good plan :)

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

  reply	other threads:[~2021-04-13 22:45 UTC|newest]

Thread overview: 16+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-04-12 10:23 [PATCH 0/3] ext4: Fix data corruption when extending DIO write races with buffered read Jan Kara
2021-04-12 10:23 ` [PATCH 1/3] iomap: Pass original DIO size to completion handler Jan Kara
2021-04-12 14:07   ` kernel test robot
2021-04-12 14:07     ` kernel test robot
2021-04-12 14:12   ` kernel test robot
2021-04-12 14:12     ` kernel test robot
2021-04-12 16:37   ` kernel test robot
2021-04-12 16:37     ` kernel test robot
2021-04-12 10:23 ` [PATCH 2/3] ext4: Fix occasional generic/418 failure Jan Kara
2021-04-12 21:50   ` Dave Chinner
2021-04-13  9:11     ` Jan Kara
2021-04-13 22:45       ` Dave Chinner [this message]
2021-04-14 11:56         ` Jan Kara
2021-04-12 10:23 ` [PATCH 3/3] ext4: Fix overflow in ext4_iomap_alloc() Jan Kara
2021-04-12 11:30   ` Matthew Wilcox
2021-06-16 12:22   ` Theodore Ts'o

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20210413224531.GE63242@dread.disaster.area \
    --to=david@fromorbit.com \
    --cc=djwong@kernel.org \
    --cc=enwlinux@gmail.com \
    --cc=jack@suse.cz \
    --cc=linux-ext4@vger.kernel.org \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=tytso@mit.edu \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.