All of lore.kernel.org
 help / color / mirror / Atom feed
From: Matthew Wilcox <willy@infradead.org>
To: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Brian Foster <bfoster@redhat.com>, Linux-MM <linux-mm@kvack.org>,
	linux-fsdevel <linux-fsdevel@vger.kernel.org>,
	linux-xfs <linux-xfs@vger.kernel.org>,
	Hugh Dickins <hughd@google.com>
Subject: Re: writeback completion soft lockup BUG in folio_wake_bit()
Date: Thu, 17 Mar 2022 15:04:05 +0000	[thread overview]
Message-ID: <YjNN5SzHELGig+U4@casper.infradead.org> (raw)
In-Reply-To: <CAHk-=wgPTWoXCa=JembExs8Y7fw7YUi9XR0zn1xaxWLSXBN_vg@mail.gmail.com>

On Wed, Mar 16, 2022 at 04:35:10PM -0700, Linus Torvalds wrote:
> On Wed, Mar 16, 2022 at 1:59 PM Matthew Wilcox <willy@infradead.org> wrote:
> >
> > As I recall, the bookmark hack was introduced in order to handle
> > lock_page() problems.  It wasn't really supposed to handle writeback,
> > but nobody thought it would cause any harm (and indeed, it didn't at the
> > time).  So how about we only use bookmarks for lock_page(), since
> > lock_page() usually doesn't have the multiple-waker semantics that
> > writeback has?
> 
> I was hoping that some of the page lock problems are gone and we could
> maybe try to get rid of the bookmarks entirely.
> 
> But the page lock issues only ever showed up on some private
> proprietary load and machine, so we never really got confirmation that
> they are fixed. There were lots of strong signs to them being related
> to the migration page locking, and it may be that the bookmark code is
> only hurting these days.

Ah, I found Tim's mail describing the workload:
https://lore.kernel.org/all/a9e74f64-dee6-dc23-128e-8ef8c7383d77@linux.intel.com/

Relevant part:

: They have a parent process that spawns off 10 children per core and
: kicked them to run. The child processes all access a common library.
: We have 384 cores so 3840 child processes running.  When migration occur on
: a page in the common library, the first child that access the page will
: page fault and lock the page, with the other children also page faulting
: quickly and pile up in the page wait list, till the first child is done.

Seems like someone should be able to write a test app that does
something along those lines fairly easily.  The trick is finding a
giant system to run it on.  Although doing it on a smaller system with
more SW threads/CPU would probably be enough.

> See for example commit 9a1ea439b16b ("mm:
> put_and_wait_on_page_locked() while page is migrated") which doesn't
> actually change the *locking* side, but drops the page reference when
> waiting for the locked page to be unlocked, which in turn removes a
> "loop and try again when migration". And that may have been the real
> _fix_ for the problem.

Actually, I think it fixes a livelock.  If you have hundreds (let alone
thousands) of threads all trying to migrate the same page, they're all
going to fail with the original code because migration can't succeed with
an elevated refcount, and each thread that is waiting holds a refcount.
I had a similar problem trying to split a folio in ->readpage with
one of the xfstests that has 500 threads all trying to read() from the
same page.  That was solved by also using put_and_wait_on_page_locked()
in bd8a1f3655a7 ("mm/filemap: support readpage splitting a page").

> Because while the bookmark thing avoids the NMI lockup detector firing
> due to excessive hold times, the bookmarking also _causes_ that "we
> now will see the same page multiple times because we dropped the lock
> and somebody re-added it at the end of the queue" issue. Which seems
> to be the problem here.
> 
> Ugh. I wish we had some way to test "could we just remove the bookmark
> code entirely again".

I think we should be safe to remove it entirely now.  Of course, that
may show somewhere else that now suffers from a similar livelock ...
but if it does, we ought to fix that anyway, because the bookmark
code is stopping the livelock problem from being seen.

> Of course, the PG_lock case also works fairly hard to not actually
> remove and re-add the lock waiter to the queue, but having an actual
> "wait for and get the lock" operation. The writeback bit isn't done
> that way.
> 
> I do hate how we had to make folio_wait_writeback{_killable}() use
> "while" rather than an "if". It *almost* works with just a "wait for
> current writeback", but not quite. See commit c2407cf7d22d ("mm: make
> wait_on_page_writeback() wait for multiple pending writebacks") for
> why we have to loop. Ugly, ugly.
> 
> Because I do think that "while" in the writeback waiting is a problem.
> Maybe _the_ problem.

I do like the idea of wait-and-set the writeback flag in the VFS instead
of leaving it to the individual filesystems.  I'll put it on my list of
things to do.

Actually ... I don't think that calling folio_start_writeback() twice
is a problem per se.  It's inefficient (cycling the i_pages lock,
walking the xarray, atomic bitoperations on the flag), but nothing
goes wrong.  Except that NFS and AFS check the return value and
squawk.

So how about we do something like this:

 - Make folio_start_writeback() and set_page_writeback() return void,
   fixing up AFS and NFS.
 - Add a folio_wait_start_writeback() to use in the VFS
 - Remove the calls to set_page_writeback() in the filesystems

That keepwrite thing troubles me, and hints it might be more complicated
than that.

  reply	other threads:[~2022-03-17 15:04 UTC|newest]

Thread overview: 27+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2022-03-15 19:07 writeback completion soft lockup BUG in folio_wake_bit() Brian Foster
2022-03-16 20:59 ` Matthew Wilcox
2022-03-16 23:35   ` Linus Torvalds
2022-03-17 15:04     ` Matthew Wilcox [this message]
2022-03-17 19:26       ` Linus Torvalds
2022-03-17 21:16         ` Matthew Wilcox
2022-03-17 22:52           ` Dave Chinner
2022-03-18 13:16           ` Jan Kara
2022-03-18 18:56             ` Linus Torvalds
2022-03-19 16:23               ` Theodore Ts'o
2022-03-30 15:55                 ` Christoph Hellwig
2022-03-17 15:31     ` Brian Foster
2022-03-17 13:51   ` Brian Foster
2022-03-18 14:14     ` Brian Foster
2022-03-18 14:45       ` Matthew Wilcox
2022-03-18 18:58         ` Linus Torvalds
2022-10-20  1:35           ` Dan Williams
2022-10-23 22:38             ` Linus Torvalds
2022-10-24 19:39               ` Tim Chen
2022-10-24 19:43                 ` Linus Torvalds
2022-10-24 20:14                   ` Dan Williams
2022-10-24 20:13               ` Dan Williams
2022-10-24 20:28                 ` Linus Torvalds
2022-10-24 20:35                   ` Dan Williams
2022-10-25 15:58                     ` Arechiga Lopez, Jesus A
2022-10-25 19:19                   ` Matthew Wilcox
2022-10-25 19:20                     ` Linus Torvalds

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=YjNN5SzHELGig+U4@casper.infradead.org \
    --to=willy@infradead.org \
    --cc=bfoster@redhat.com \
    --cc=hughd@google.com \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=linux-xfs@vger.kernel.org \
    --cc=torvalds@linux-foundation.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.