Re: [RFC PATCH] mm: silence soft lockups from unlock_page

From: Linus Torvalds <torvalds@linux-foundation.org>
To: Michal Hocko <mhocko@kernel.org>
Cc: Linux-MM <linux-mm@kvack.org>,
	LKML <linux-kernel@vger.kernel.org>,
	Andrew Morton <akpm@linux-foundation.org>,
	Tim Chen <tim.c.chen@linux.intel.com>,
	Michal Hocko <mhocko@suse.com>
Subject: Re: [RFC PATCH] mm: silence soft lockups from unlock_page
Date: Tue, 21 Jul 2020 08:33:33 -0700	[thread overview]
Message-ID: <CAHk-=whewL14RgwLZTXcNAnrDPt0H+sRJS6iDq0oGb6zwaBMxg@mail.gmail.com> (raw)
In-Reply-To: <20200721063258.17140-1-mhocko@kernel.org>

On Mon, Jul 20, 2020 at 11:33 PM Michal Hocko <mhocko@kernel.org> wrote:
>
> The lockup is in page_unlock in do_read_fault and I suspect that this is
> yet another effect of a very long waitqueue chain which has been
> addresses by 11a19c7b099f ("sched/wait: Introduce wakeup boomark in
> wake_up_page_bit") previously.

Hmm.

I do not believe that you can actually get to the point where you have
a million waiters and it takes 20+ seconds to wake everybody up.

More likely, it's actually *caused* by that commit 11a19c7b099f, and
what might be happening is that other CPU's are just adding new
waiters to the list *while* we're waking things up, because somebody
else already got the page lock again.

Humor me.. Does something like this work instead? It's
whitespace-damaged because of just a cut-and-paste, but it's entirely
untested, and I haven't really thought about any memory ordering
issues, but I think it's ok.

The logic is that anybody who called wake_up_page_bit() _must_ have
cleared that bit before that. So if we ever see it set again (and
memory ordering doesn't matter), then clearly somebody else got access
to the page bit (whichever it was), and we should not

 (a) waste time waking up people who can't get the bit anyway

 (b) be in a  livelock where other CPU's continually add themselves to
the wait queue because somebody else got the bit.

and it's that (b) case that I think happens for you.

NOTE! Totally UNTESTED patch follows. I think it's good, but maybe
somebody sees some problem with this approach?

I realize that people can wait for other bits than the unlocked, and
if you're waiting for writeback to complete maybe you don't care if
somebody else then started writeback *AGAIN* on the page and you'd
actually want to be woken up regardless, but honestly, I don't think
it really matters.

                Linus

--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -1054,6 +1054,15 @@ static void wake_up_page_bit(struct page *page,
int bit_nr)
                 * from wait queue
                 */
                spin_unlock_irqrestore(&q->lock, flags);
+
+               /*
+                * If somebody else set the bit again, stop waking
+                * people up. It's now the responsibility of that
+                * other page bit owner to do so.
+                */
+               if (test_bit(bit_nr, &page->flags))
+                       return;
+
                cpu_relax();
                spin_lock_irqsave(&q->lock, flags);
                __wake_up_locked_key_bookmark(q, TASK_NORMAL, &key, &bookmark);