All of lore.kernel.org
 help / color / mirror / Atom feed
From: Eryu Guan <eguan@redhat.com>
To: Dave Chinner <david@fromorbit.com>
Cc: linux-xfs@vger.kernel.org
Subject: Re: [PATCH] xfs: skip free eofblocks if inode is under written back
Date: Wed, 30 Aug 2017 17:12:06 +0800	[thread overview]
Message-ID: <20170830091206.GB27835@eguan.usersys.redhat.com> (raw)
In-Reply-To: <20170830072950.GJ10621@dastard>

On Wed, Aug 30, 2017 at 05:29:50PM +1000, Dave Chinner wrote:
> On Wed, Aug 30, 2017 at 12:26:28PM +0800, Eryu Guan wrote:
> > xfs_vm_writepages() saves a cached imap in a writeback context
> > structure and passes it to xfs_do_writepage() for every page in this
> > writeback, so xfs_writepage_map() (called by xfs_do_writepage())
> > doesn't have to lookup & recalculate the mapping for every buffer it
> > writes if the cached imap is considered as valid.
> > 
> > But there's a chance that the cached imap becomes stale (doesn't
> > match the real extent entry) due to the removing of speculative
> > preallocated blocks, in this case xfs_imap_valid() still reports
> > imap as valid, because the buffer it's writing is still within the
> > cached imap range. This could result in fs corruption and data
> > corruption (data written to wrong block).
> > 
> > For example, the following sequences make it possible (assuming 4k
> > block size XFS):
> > 
> >   1. delalloc write 1072 blocks, with speculative preallocation,
> >   2032 blocks got allocated
> > 
> >   2. started a WB_SYNC_NONE writeback (so it wrote all
> >   PAGECACHE_TAG_DIRTY pages, additional dirty page from step 4 can
> >   be picked up, this could be a background writeback, or a bare
> >   SYNC_FILE_RANGE_WRITE sync_file_range() call) on this inode,
> >   filled in br_startblock, converted delayed allocation to real
> >   allocation, but br_blockcount was unchanged, still 2032, and this
> >   imap got cached in writeback context structure for writing
> >   subsequent pages
> 
> Excellent work in tracking this down, Eryu! This has been a
> longstanding issue - we've been caching the writeback map for
> multi-page writeback since, well, longer than I've been working on
> XFS....

If I read the git history correctly, previously the imap was only cached
for buffer heads within the same page, commit fbcc02561359 ("xfs:
Introduce writeback context for writepages") made the cached imap live
longer, across the whole xfs_vm_writepages() call, which made the window
wider.

But I can't reproduce the bug until commit bb18782aa47d ("xfs: build bios
directly in xfs_add_to_ioend"), I didn't dig into it deeply, perhaps it
made the window even wider?

> 
> First thing - checking for PAGECACHE_TAG_WRITEBACK in xfs_release()
> is racy as writeback can start between the check and the EOF blocks
> being trimmed in xfs_free_eofblocks(), so it's not really a solution
> to the problem.

Ah, you're right, I missed that. 

> 
> The root cause of the problem is that xfs_imap_valid() code doesn't
> return false when the extent tree is modified and the map is
> rendered potentially incorrect. If this was to notice we changed the
> extent tree, it would trigger the writeback code to look up a new
> map.
> 
> Hmmmm, that means the scope is probably wider than
> xfs_free_eofblocks() - that's a xfs_bunmapi call, but we've also got
> to consider that other IO operations might change the extent map,
> too.

Agreed, I did think about a way to notify writeback code an extent
change was made, but I didn't know the relevant code well enough to do
this nicely. And I thought free eofblocks in xfs_release() was the only
place that could race with writeback to cause problems, so I thought
simply skipping free eoflblocks was easier.

> 
> Quick patch below, doesn't work on reflink filesystems yet because
> the way it handles cached COW mapping invalidation is unnecessarily
> special and so doesn't work.

I've hit this problem and the same assert failure when I was testing
another version of fix, the usage of xfs_map_cow() cut the imap
validation work flow :)

(The basic idea of another version of my fix is comparing timestamp
saved in struct xfs_inode at free eofblocks time with timestamp saved in
struct xfs_writepage_ctx, if timestamp in xfs_inode is newer than that
in xfs_writepage_ctx, invalidate the cached imap and map it again. But
that still only focused on free of eofblocks, ignored other paths that
might change the map.)

> Can you test it and see if it fixes the
> problem?

Yes, patch works for me, lustre-racer test survived 50 iterations,
previously it would fail within 5 runs.

Thanks!

Eryu

  parent reply	other threads:[~2017-08-30  9:12 UTC|newest]

Thread overview: 6+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2017-08-30  4:26 [PATCH] xfs: skip free eofblocks if inode is under written back Eryu Guan
2017-08-30  7:29 ` Dave Chinner
2017-08-30  7:51   ` Christoph Hellwig
2017-08-30  8:10     ` Dave Chinner
2017-08-30  9:12   ` Eryu Guan [this message]
2017-08-30 14:26     ` Brian Foster

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20170830091206.GB27835@eguan.usersys.redhat.com \
    --to=eguan@redhat.com \
    --cc=david@fromorbit.com \
    --cc=linux-xfs@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.