All of lore.kernel.org
 help / color / mirror / Atom feed
From: Matthew Wilcox <willy@infradead.org>
To: Brian Foster <bfoster@redhat.com>
Cc: "Darrick J. Wong" <djwong@kernel.org>,
	xfs <linux-xfs@vger.kernel.org>,
	fstests <fstests@vger.kernel.org>,
	linux-fsdevel <linux-fsdevel@vger.kernel.org>
Subject: Re: generic/068 crash on 5.18-rc2?
Date: Mon, 2 May 2022 14:00:16 +0100	[thread overview]
Message-ID: <Ym/V4G2RcQd/RmHZ@casper.infradead.org> (raw)
In-Reply-To: <Ym/MEBfa0szil3hW@bfoster>

On Mon, May 02, 2022 at 08:18:24AM -0400, Brian Foster wrote:
> On Sat, Apr 30, 2022 at 04:44:07AM +0100, Matthew Wilcox wrote:
> > On Thu, Apr 28, 2022 at 11:53:18AM -0400, Brian Foster wrote:
> > > The above is the variant of generic/068 failure I was reproducing and
> > > used to bisect [1]. With some additional tracing added to ioend
> > > completion, what I'm seeing is that the bio_for_each_folio_all() bvec
> > > iteration basically seems to go off the rails. What happens more
> > > specifically is that at some point during the loop, bio_next_folio()
> > > actually lands into the second page of the just processed folio instead
> > > of the actual next folio (i.e. as if it's walking to the next page from
> > > the head page of the folio instead of to the next 16k folio). I suspect
> > > completion is racing with some form of truncation/reclaim/invalidation
> > > here, what exactly I don't know, that perhaps breaks down the folio and
> > > renders the iteration (bio_next_folio() -> folio_next()) unsafe. To test
> > > that theory, I open coded and modified the loop to something like the
> > > following:
> > > 
> > >                 for (bio_first_folio(&fi, bio, 0); fi.folio; ) {
> > >                         f = fi.folio;
> > >                         l = fi.length;
> > >                         bio_next_folio(&fi, bio);
> > >                         iomap_finish_folio_write(inode, f, l, error);
> > >                         folio_count++;
> > >                 }
> > > 
> > > ... to avoid accessing folio metadata after writeback is cleared on it
> > > and this seems to make the problem disappear (so far, I'll need to let
> > > this spin for a while longer to be completely confident in that).
> > 
> > _Oh_.
> > 
> > It's not even a terribly weird race, then.  It's just this:
> > 
> > CPU 0				CPU 1
> > 				truncate_inode_partial_folio()
> > 				folio_wait_writeback();
> > bio_next_folio(&fi, bio)
> > iomap_finish_folio_write(fi.folio)
> > folio_end_writeback(folio)
> > 				split_huge_page()
> > bio_next_folio()
> > ... oops, now we only walked forward one page instead of the entire folio.
> > 
> 
> Yep, though once I noticed and turned on the mm_page_free tracepoint, it
> looked like it was actually the I/O completion path breaking down the
> compound folio:
> 
>    kworker/10:1-440     [010] .....   355.369899: iomap_finish_ioend: 1090: bio 00000000bc8445c7 index 192 fi (00000000dc8c03bd 0 16384 32768 27)
>    ...
>     kworker/10:1-440     [010] .....   355.369905: mm_page_free: page=00000000dc8c03bd pfn=0x182190 order=2
>     kworker/10:1-440     [010] .....   355.369907: iomap_finish_ioend: 1090: bio 00000000bc8445c7 index 1 fi (00000000f8b5d9b3 0 4096 16384 27)
> 
> I take that to mean the truncate path executes while the completion side
> holds a reference, folio_end_writeback() ends up dropping the last
> reference, falls into the free/split path and the iteration breaks from
> there. Same idea either way, I think.

Absolutely.  That's probably the more common path anyway; we truncate
an entire folio instead of a partial one, so it could be:

truncate_inode_partial_folio():
        folio_wait_writeback(folio);
        if (length == folio_size(folio)) {
                truncate_inode_folio(folio->mapping, folio);

or basically the same code in truncate_inode_pages_range()
or invalidate_inode_pages2_range().


      reply	other threads:[~2022-05-02 13:00 UTC|newest]

Thread overview: 22+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2022-04-13  3:34 generic/068 crash on 5.18-rc2? Darrick J. Wong
2022-04-13 14:50 ` Matthew Wilcox
2022-04-13 16:23   ` Darrick J. Wong
2022-04-13 16:35     ` Matthew Wilcox
2022-04-18 18:44       ` Darrick J. Wong
2022-04-18 17:47   ` Darrick J. Wong
2022-04-20  0:37     ` Darrick J. Wong
2022-04-22 21:59     ` Darrick J. Wong
2022-04-28 15:53       ` Brian Foster
2022-04-30  3:10         ` Darrick J. Wong
2022-04-30  3:44         ` Matthew Wilcox
2022-04-30 21:40           ` Matthew Wilcox
2022-05-02 12:20             ` Brian Foster
2022-05-03  3:25               ` Darrick J. Wong
2022-05-03  4:31                 ` Matthew Wilcox
2022-05-03 17:25                   ` Darrick J. Wong
2022-05-05  2:40                     ` Darrick J. Wong
2022-05-05  4:18                       ` Matthew Wilcox
2022-05-05  4:24                         ` Darrick J. Wong
2022-05-06 17:03                           ` Darrick J. Wong
2022-05-02 12:18           ` Brian Foster
2022-05-02 13:00             ` Matthew Wilcox [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=Ym/V4G2RcQd/RmHZ@casper.infradead.org \
    --to=willy@infradead.org \
    --cc=bfoster@redhat.com \
    --cc=djwong@kernel.org \
    --cc=fstests@vger.kernel.org \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-xfs@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.