Re: writeback completion soft lockup BUG in folio_wake_bit()

From: Brian Foster <bfoster@redhat.com>
To: Matthew Wilcox <willy@infradead.org>
Cc: linux-mm@kvack.org, linux-fsdevel@vger.kernel.org,
	linux-xfs@vger.kernel.org,
	Linus Torvalds <torvalds@linux-foundation.org>,
	Hugh Dickins <hughd@google.com>
Subject: Re: writeback completion soft lockup BUG in folio_wake_bit()
Date: Thu, 17 Mar 2022 09:51:44 -0400	[thread overview]
Message-ID: <YjM88OwoccZOKp86@bfoster> (raw)
In-Reply-To: <YjJPu/3tYnuKK888@casper.infradead.org>

On Wed, Mar 16, 2022 at 08:59:39PM +0000, Matthew Wilcox wrote:
> On Tue, Mar 15, 2022 at 03:07:10PM -0400, Brian Foster wrote:
> > What seems to happen is that the majority of the fsync calls end up
> > waiting on writeback of a particular page, the wakeup of the writeback
> > bit on that page wakes a task that immediately resets PG_writeback on
> > the page such that N other folio_wait_writeback() waiters see the bit
> > still set and immediately place themselves back onto the tail of the
> > wait queue.  Meanwhile the waker task spins in the WQ_FLAG_BOOKMARK loop
> > in folio_wake_bit() (backing off the lock for a cycle or so in each
> > iteration) only to find the same bunch of tasks in the queue. This
> > process repeats for a long enough amount of time to trigger the soft
> > lockup warning. I've confirmed this spinning behavior with a tracepoint
> > in the bookmark loop that indicates we're stuck for many hundreds of
> > thousands of iterations (at least) of this loop when the soft lockup
> > warning triggers.
> 
> [...]
> 
> > I've run a few quick experiments to try and corroborate this analysis.
> > The problem goes away completely if I either back out the loop change in
> > folio_wait_writeback() or bump WAITQUEUE_WALK_BREAK_CNT to something
> > like 128 (i.e. greater than the total possible number of waiter tasks in
> > this test). I've also played a few games with bookmark behavior mostly
> > out of curiosity, but usually end up introducing other problems like
> > missed wakeups, etc.
> 
> As I recall, the bookmark hack was introduced in order to handle
> lock_page() problems.  It wasn't really supposed to handle writeback,
> but nobody thought it would cause any harm (and indeed, it didn't at the
> time).  So how about we only use bookmarks for lock_page(), since
> lock_page() usually doesn't have the multiple-waker semantics that
> writeback has?
> 

Oh, interesting. I wasn't aware of the tenuous status of the bookmark
code. This is indeed much nicer than anything I was playing with. I
suspect it will address the problem, but I'll throw it at my test env
for a while and follow up.. thanks!

Brian

> (this is more in the spirit of "minimal patch" -- I think initialising
> the bookmark should be moved to folio_unlock()).
> 
> diff --git a/mm/filemap.c b/mm/filemap.c
> index b2728eb52407..9ee3c5f1f489 100644
> --- a/mm/filemap.c
> +++ b/mm/filemap.c
> @@ -1146,26 +1146,28 @@ static int wake_page_function(wait_queue_entry_t *wait, unsigned mode, int sync,
>  	return (flags & WQ_FLAG_EXCLUSIVE) != 0;
>  }
>  
> -static void folio_wake_bit(struct folio *folio, int bit_nr)
> +static void folio_wake_bit(struct folio *folio, int bit_nr,
> +		wait_queue_entry_t *bookmark)
>  {
>  	wait_queue_head_t *q = folio_waitqueue(folio);
>  	struct wait_page_key key;
>  	unsigned long flags;
> -	wait_queue_entry_t bookmark;
>  
>  	key.folio = folio;
>  	key.bit_nr = bit_nr;
>  	key.page_match = 0;
>  
> -	bookmark.flags = 0;
> -	bookmark.private = NULL;
> -	bookmark.func = NULL;
> -	INIT_LIST_HEAD(&bookmark.entry);
> +	if (bookmark) {
> +		bookmark->flags = 0;
> +		bookmark->private = NULL;
> +		bookmark->func = NULL;
> +		INIT_LIST_HEAD(&bookmark->entry);
> +	}
>  
>  	spin_lock_irqsave(&q->lock, flags);
> -	__wake_up_locked_key_bookmark(q, TASK_NORMAL, &key, &bookmark);
> +	__wake_up_locked_key_bookmark(q, TASK_NORMAL, &key, bookmark);
>  
> -	while (bookmark.flags & WQ_FLAG_BOOKMARK) {
> +	while (bookmark && (bookmark->flags & WQ_FLAG_BOOKMARK)) {
>  		/*
>  		 * Take a breather from holding the lock,
>  		 * allow pages that finish wake up asynchronously
> @@ -1175,7 +1177,7 @@ static void folio_wake_bit(struct folio *folio, int bit_nr)
>  		spin_unlock_irqrestore(&q->lock, flags);
>  		cpu_relax();
>  		spin_lock_irqsave(&q->lock, flags);
> -		__wake_up_locked_key_bookmark(q, TASK_NORMAL, &key, &bookmark);
> +		__wake_up_locked_key_bookmark(q, TASK_NORMAL, &key, bookmark);
>  	}
>  
>  	/*
> @@ -1204,7 +1206,7 @@ static void folio_wake(struct folio *folio, int bit)
>  {
>  	if (!folio_test_waiters(folio))
>  		return;
> -	folio_wake_bit(folio, bit);
> +	folio_wake_bit(folio, bit, NULL);
>  }
>  
>  /*
> @@ -1554,12 +1556,15 @@ static inline bool clear_bit_unlock_is_negative_byte(long nr, volatile void *mem
>   */
>  void folio_unlock(struct folio *folio)
>  {
> +	wait_queue_entry_t bookmark;
> +
>  	/* Bit 7 allows x86 to check the byte's sign bit */
>  	BUILD_BUG_ON(PG_waiters != 7);
>  	BUILD_BUG_ON(PG_locked > 7);
>  	VM_BUG_ON_FOLIO(!folio_test_locked(folio), folio);
> +
>  	if (clear_bit_unlock_is_negative_byte(PG_locked, folio_flags(folio, 0)))
> -		folio_wake_bit(folio, PG_locked);
> +		folio_wake_bit(folio, PG_locked, &bookmark);
>  }
>  EXPORT_SYMBOL(folio_unlock);
>  
> @@ -1578,7 +1583,7 @@ void folio_end_private_2(struct folio *folio)
>  {
>  	VM_BUG_ON_FOLIO(!folio_test_private_2(folio), folio);
>  	clear_bit_unlock(PG_private_2, folio_flags(folio, 0));
> -	folio_wake_bit(folio, PG_private_2);
> +	folio_wake_bit(folio, PG_private_2, NULL);
>  	folio_put(folio);
>  }
>  EXPORT_SYMBOL(folio_end_private_2);
>