Re: [RFC PATCH] mm: silence soft lockups from unlock_page

From: Hugh Dickins <hughd@google.com>
To: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Matthew Wilcox <willy@infradead.org>,
	Hugh Dickins <hughd@google.com>,  Oleg Nesterov <oleg@redhat.com>,
	Michal Hocko <mhocko@kernel.org>,  Linux-MM <linux-mm@kvack.org>,
	LKML <linux-kernel@vger.kernel.org>,
	 Andrew Morton <akpm@linux-foundation.org>,
	 Tim Chen <tim.c.chen@linux.intel.com>,
	Michal Hocko <mhocko@suse.com>,
	 Greg KH <gregkh@linuxfoundation.org>
Subject: Re: [RFC PATCH] mm: silence soft lockups from unlock_page
Date: Fri, 7 Aug 2020 11:41:09 -0700 (PDT)	[thread overview]
Message-ID: <alpine.LSU.2.11.2008071047510.1239@eggly.anvils> (raw)
In-Reply-To: <CAHk-=wihTRHMm1LC4AfidZptT9ZuT-wBjE2VhYzKBy66e4iwQw@mail.gmail.com>

On Thu, 6 Aug 2020, Linus Torvalds wrote:
> On Thu, Aug 6, 2020 at 11:00 AM Matthew Wilcox <willy@infradead.org> wrote:
> >
> > It wasn't clear to me whether Hugh thought it was an improvement or
> > not that trylock was now less likely to jump the queue.  There're
> > the usual "fair is the opposite of throughput" kind of arguments.

I don't know.  I'm inclined to think that keeping just a small chance
of jumping the queue is probably good; but pay no attention to me,
that's an opinion based on ignorance.

Thanks for mentioning the lucky lock_page(): I was quite wrong to
single out trylock_page() - I suppose its lack of a slow path just
helped it spring to mind more easily.

> 
> Yeah, it could go either way. But on the whole, if the lock bit is
> getting any contention, I think we'd rather have it be fair for
> latency reasons.
> 
> That said, I'm not convinced about my patch, and I actually threw it
> away without even testing it (sometimes I keep patches around in my
> private tree for testing, and they can live there for months or even
> years when I wonder if they are worth it, but this time I didn't
> bother to go to the trouble).
> 
> If somebody is interested in pursuing this, I think that patch might
> be a good starting point (and it _might_ even work), but it seemed to
> be too subtle to really worry about unless somebody finds an actual
> acute reason for it.

It is an attractive idea, passing the baton from one to the next;
and a logical destination from where you had already arrived.
Maybe somebody, not me, should pursue it.

I tried to ignore it, but eventually succumbed to temptation, and ran
it overnight at home (my usual tmpfs swapping), and on one machine at
work (fork/exit stress etc).  It did need one fixup, below: then home
testing went fine.  But the fork/exit stress hit that old familiar
unlock_page wake_up_page_bit softlockup that your existing patch
has appeared to fix.  I can't afford to investigate further (though
might regret that: I keep wondering if the small dose of unfairness
in your existing patch is enough to kick the test when it otherwise
would hang, and this new stricter patch be pointing to that).

My home testing was, effectively, on top of c6fe44d96fc1 (v5.8 plus
your two patches): I did not have in the io_uring changes from the
current tree. In glancing there, I noticed one new and one previous
instance of SetPageWaiters() *after* __add_wait_queue_entry_tail():
are those correct?
 
> 
> I think the existing patch narrowing the window is good, and it
> clearly didn't hurt throughput (although that was almost certainly for
> other reasons).

Agreed.

> 
>                 Linus
>

--- linus/mm/filemap.c	2020-08-07 02:08:13.727709186 -0700
+++ hughd/mm/filemap.c	2020-08-07 02:16:10.960331473 -0700
@@ -1044,7 +1044,7 @@ static int wake_page_function(wait_queue
 	return ret;
 }
 
-static int wake_up_page_bit(struct page *page, int bit_nr)
+static void wake_up_page_bit(struct page *page, int bit_nr, bool exclude)
 {
 	wait_queue_head_t *q = page_waitqueue(page);
 	struct wait_page_key key;
@@ -1096,15 +1096,28 @@ static int wake_up_page_bit(struct page
 		 * That's okay, it's a rare case. The next waker will clear it.
 		 */
 	}
+
+	/*
+	 * If we hoped to pass PG_locked on to the next locker, but found
+	 * no exclusive taker, then we must clear it before dropping q->lock.
+	 * Otherwise unlock_page() can race trylock_page_bit_common(), and if
+	 * PageWaiters was already set from before, then cmpxchg sees no
+	 * difference to send it back to wake_up_page_bit().
+	 *
+	 * We may have already dropped and retaken q->lock on the way here, but
+	 * I think this works out because new entries are always added at tail.
+	 */
+	if (exclude && !exclusive)
+		clear_bit(bit_nr, &page->flags);
+
 	spin_unlock_irqrestore(&q->lock, flags);
-	return exclusive;
 }
 
 static void wake_up_page(struct page *page, int bit)
 {
 	if (!PageWaiters(page))
 		return;
-	wake_up_page_bit(page, bit);
+	wake_up_page_bit(page, bit, false);
 }
 
 /*
@@ -1339,8 +1352,8 @@ void unlock_page(struct page *page)
 			 * spinlock wake_up_page_bit() will do.
 			 */
 			smp_mb__before_atomic();
-			if (wake_up_page_bit(page, PG_locked))
-				return;
+			wake_up_page_bit(page, PG_locked, true);
+			return;
 		}
 		new = cmpxchg_release(&page->flags, flags, flags & ~(1 << PG_locked));
 		if (likely(new == flags))