Re: [RFC PATCH] mm: silence soft lockups from unlock_page

From: Hugh Dickins <hughd@google.com>
To: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Oleg Nesterov <oleg@redhat.com>, Hugh Dickins <hughd@google.com>,
	 Michal Hocko <mhocko@kernel.org>, Linux-MM <linux-mm@kvack.org>,
	 LKML <linux-kernel@vger.kernel.org>,
	 Andrew Morton <akpm@linux-foundation.org>,
	 Tim Chen <tim.c.chen@linux.intel.com>,
	Michal Hocko <mhocko@suse.com>
Subject: Re: [RFC PATCH] mm: silence soft lockups from unlock_page
Date: Fri, 24 Jul 2020 19:08:24 -0700 (PDT)	[thread overview]
Message-ID: <alpine.LSU.2.11.2007241848300.8192@eggly.anvils> (raw)
In-Reply-To: <CAHk-=wjYHvbOs9i39EnUsC6VEJiuJ2e_5gZB5-J5CRKxq80B_Q@mail.gmail.com>

On Fri, 24 Jul 2020, Linus Torvalds wrote:

> On Fri, Jul 24, 2020 at 10:32 AM Linus Torvalds
> <torvalds@linux-foundation.org> wrote:
> > Ok, that makes sense. Except you did it on top of the original patch
> > without the fix to set WQ_FLAG_WOKEN for the non-wakeup case.
> 
> Hmm.
> 
> I just realized that one thing we could do is to not even test the
> page bit for the shared case in the wakeup path.
> 
> Because anybody who uses the non-exclusive "wait_on_page_locked()" or
> "wait_on_page_writeback()" isn't actually interested in the bit state
> any more at that point. All they care about is that somebody cleared
> it - not whether it was then re-taken again.
> 
> So instead of keeping them on the list - or stopping the waitqueue
> walk because somebody else got the bit - we could just mark them
> successfully done, wake them up, and remove those entries from the
> list.
> 
> That would be better for everybody - less pointless waiting for a new
> lock or writeback event, but also fewer entries on the wait queues as
> we get rid of them more quickly instead of walking them over and over
> just because somebody else re-took the page lock.
> 
> Generally "wait_on_page_locked()" is used for two things
> 
>  - either wait for the IO to then check if it's now uptodate
> 
>  - throttle things that can't afford to lock the page (eg page faults
> that dropped the mm lock, and as such need to go through the whole
> restart logic, but that don't want to lock the page because it's now
> too late, but also the page migration things)
> 
> In the first case, waiting to actually seeing the locked bit clear is
> pointless - the code only cared about the "wait for IO in progress"
> not about the lock bit itself.
> 
> And that second case generally might want to retry, but doesn't want
> to busy-loop.
> 
> And "wait_on_page_writeback()" is basically used for similar reasons
> (ie check if there were IO errors, but also possibly to throttle
> writeback traffic).
> 
> Saying "stop walking, keep it on the list" seems wrong. It makes IO
> error handling and retries much worse, for example.
> 
> So it turns out that the wakeup logic and the initial wait logic don't
> have so much in common after all, and there is a fundamental
> conceptual difference between that "check bit one last time" case, and
> the "we got woken up, now what" case..
> 
> End result: one final (yes, hopefully - I think I'm done) version of
> this patch-series.
> 
> This not only makes the code cleaner (the generated code for
> wake_up_page() is really quite nice now), but getting rid of extra
> waiting might help the load that Michal reported.
> 
> Because a lot of page waiting might be that non-exclusive
> "wait_on_page_locked()" kind, particularly in the thundering herd kind
> of situation where one process starts IO, and then other processes
> wait for it to finish.
> 
> Those users don't even care if somebody else then did a "lock_page()"
> for some other reason (maybe for writeback). They are generally
> perfectly happy with a locked page, as long as it's now up-to-date.
> 
> So this not only simplifies the code, it really might avoid some problems too.

That set of tests I started yesterday has now completed: no crashes
due to your changes (though, one machine crashed with an entirely
unrelated list_del corruption: over in driverland, just a coincidence).

I do see more of these stresstests reporting failure than I remember
from the last time I ran them, and I wasn't quickly able to work out
why (usually I just care about not crashing or hanging, rarely delve
deeper into what they actually call success).  The most likely cause
would be some internal infrastructural oddity, and nothing for you
to worry about; but there is a possibility that it's meaningful.

But whatever, what happens on the next run, with these latest patches,
will be more important; and I'll follow this next run with a run on
the baseline without them, to compare results.

Hugh