All of lore.kernel.org
 help / color / mirror / Atom feed
From: Brian Foster <bfoster@redhat.com>
To: Matthew Wilcox <willy@infradead.org>
Cc: linux-mm@kvack.org, linux-fsdevel@vger.kernel.org,
	linux-xfs@vger.kernel.org,
	Linus Torvalds <torvalds@linux-foundation.org>,
	Hugh Dickins <hughd@google.com>
Subject: Re: writeback completion soft lockup BUG in folio_wake_bit()
Date: Thu, 17 Mar 2022 09:51:44 -0400	[thread overview]
Message-ID: <YjM88OwoccZOKp86@bfoster> (raw)
In-Reply-To: <YjJPu/3tYnuKK888@casper.infradead.org>

On Wed, Mar 16, 2022 at 08:59:39PM +0000, Matthew Wilcox wrote:
> On Tue, Mar 15, 2022 at 03:07:10PM -0400, Brian Foster wrote:
> > What seems to happen is that the majority of the fsync calls end up
> > waiting on writeback of a particular page, the wakeup of the writeback
> > bit on that page wakes a task that immediately resets PG_writeback on
> > the page such that N other folio_wait_writeback() waiters see the bit
> > still set and immediately place themselves back onto the tail of the
> > wait queue.  Meanwhile the waker task spins in the WQ_FLAG_BOOKMARK loop
> > in folio_wake_bit() (backing off the lock for a cycle or so in each
> > iteration) only to find the same bunch of tasks in the queue. This
> > process repeats for a long enough amount of time to trigger the soft
> > lockup warning. I've confirmed this spinning behavior with a tracepoint
> > in the bookmark loop that indicates we're stuck for many hundreds of
> > thousands of iterations (at least) of this loop when the soft lockup
> > warning triggers.
> 
> [...]
> 
> > I've run a few quick experiments to try and corroborate this analysis.
> > The problem goes away completely if I either back out the loop change in
> > folio_wait_writeback() or bump WAITQUEUE_WALK_BREAK_CNT to something
> > like 128 (i.e. greater than the total possible number of waiter tasks in
> > this test). I've also played a few games with bookmark behavior mostly
> > out of curiosity, but usually end up introducing other problems like
> > missed wakeups, etc.
> 
> As I recall, the bookmark hack was introduced in order to handle
> lock_page() problems.  It wasn't really supposed to handle writeback,
> but nobody thought it would cause any harm (and indeed, it didn't at the
> time).  So how about we only use bookmarks for lock_page(), since
> lock_page() usually doesn't have the multiple-waker semantics that
> writeback has?
> 

Oh, interesting. I wasn't aware of the tenuous status of the bookmark
code. This is indeed much nicer than anything I was playing with. I
suspect it will address the problem, but I'll throw it at my test env
for a while and follow up.. thanks!

Brian

> (this is more in the spirit of "minimal patch" -- I think initialising
> the bookmark should be moved to folio_unlock()).
> 
> diff --git a/mm/filemap.c b/mm/filemap.c
> index b2728eb52407..9ee3c5f1f489 100644
> --- a/mm/filemap.c
> +++ b/mm/filemap.c
> @@ -1146,26 +1146,28 @@ static int wake_page_function(wait_queue_entry_t *wait, unsigned mode, int sync,
>  	return (flags & WQ_FLAG_EXCLUSIVE) != 0;
>  }
>  
> -static void folio_wake_bit(struct folio *folio, int bit_nr)
> +static void folio_wake_bit(struct folio *folio, int bit_nr,
> +		wait_queue_entry_t *bookmark)
>  {
>  	wait_queue_head_t *q = folio_waitqueue(folio);
>  	struct wait_page_key key;
>  	unsigned long flags;
> -	wait_queue_entry_t bookmark;
>  
>  	key.folio = folio;
>  	key.bit_nr = bit_nr;
>  	key.page_match = 0;
>  
> -	bookmark.flags = 0;
> -	bookmark.private = NULL;
> -	bookmark.func = NULL;
> -	INIT_LIST_HEAD(&bookmark.entry);
> +	if (bookmark) {
> +		bookmark->flags = 0;
> +		bookmark->private = NULL;
> +		bookmark->func = NULL;
> +		INIT_LIST_HEAD(&bookmark->entry);
> +	}
>  
>  	spin_lock_irqsave(&q->lock, flags);
> -	__wake_up_locked_key_bookmark(q, TASK_NORMAL, &key, &bookmark);
> +	__wake_up_locked_key_bookmark(q, TASK_NORMAL, &key, bookmark);
>  
> -	while (bookmark.flags & WQ_FLAG_BOOKMARK) {
> +	while (bookmark && (bookmark->flags & WQ_FLAG_BOOKMARK)) {
>  		/*
>  		 * Take a breather from holding the lock,
>  		 * allow pages that finish wake up asynchronously
> @@ -1175,7 +1177,7 @@ static void folio_wake_bit(struct folio *folio, int bit_nr)
>  		spin_unlock_irqrestore(&q->lock, flags);
>  		cpu_relax();
>  		spin_lock_irqsave(&q->lock, flags);
> -		__wake_up_locked_key_bookmark(q, TASK_NORMAL, &key, &bookmark);
> +		__wake_up_locked_key_bookmark(q, TASK_NORMAL, &key, bookmark);
>  	}
>  
>  	/*
> @@ -1204,7 +1206,7 @@ static void folio_wake(struct folio *folio, int bit)
>  {
>  	if (!folio_test_waiters(folio))
>  		return;
> -	folio_wake_bit(folio, bit);
> +	folio_wake_bit(folio, bit, NULL);
>  }
>  
>  /*
> @@ -1554,12 +1556,15 @@ static inline bool clear_bit_unlock_is_negative_byte(long nr, volatile void *mem
>   */
>  void folio_unlock(struct folio *folio)
>  {
> +	wait_queue_entry_t bookmark;
> +
>  	/* Bit 7 allows x86 to check the byte's sign bit */
>  	BUILD_BUG_ON(PG_waiters != 7);
>  	BUILD_BUG_ON(PG_locked > 7);
>  	VM_BUG_ON_FOLIO(!folio_test_locked(folio), folio);
> +
>  	if (clear_bit_unlock_is_negative_byte(PG_locked, folio_flags(folio, 0)))
> -		folio_wake_bit(folio, PG_locked);
> +		folio_wake_bit(folio, PG_locked, &bookmark);
>  }
>  EXPORT_SYMBOL(folio_unlock);
>  
> @@ -1578,7 +1583,7 @@ void folio_end_private_2(struct folio *folio)
>  {
>  	VM_BUG_ON_FOLIO(!folio_test_private_2(folio), folio);
>  	clear_bit_unlock(PG_private_2, folio_flags(folio, 0));
> -	folio_wake_bit(folio, PG_private_2);
> +	folio_wake_bit(folio, PG_private_2, NULL);
>  	folio_put(folio);
>  }
>  EXPORT_SYMBOL(folio_end_private_2);
> 


  parent reply	other threads:[~2022-03-17 13:51 UTC|newest]

Thread overview: 27+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2022-03-15 19:07 writeback completion soft lockup BUG in folio_wake_bit() Brian Foster
2022-03-16 20:59 ` Matthew Wilcox
2022-03-16 23:35   ` Linus Torvalds
2022-03-17 15:04     ` Matthew Wilcox
2022-03-17 19:26       ` Linus Torvalds
2022-03-17 21:16         ` Matthew Wilcox
2022-03-17 22:52           ` Dave Chinner
2022-03-18 13:16           ` Jan Kara
2022-03-18 18:56             ` Linus Torvalds
2022-03-19 16:23               ` Theodore Ts'o
2022-03-30 15:55                 ` Christoph Hellwig
2022-03-17 15:31     ` Brian Foster
2022-03-17 13:51   ` Brian Foster [this message]
2022-03-18 14:14     ` Brian Foster
2022-03-18 14:45       ` Matthew Wilcox
2022-03-18 18:58         ` Linus Torvalds
2022-10-20  1:35           ` Dan Williams
2022-10-23 22:38             ` Linus Torvalds
2022-10-24 19:39               ` Tim Chen
2022-10-24 19:43                 ` Linus Torvalds
2022-10-24 20:14                   ` Dan Williams
2022-10-24 20:13               ` Dan Williams
2022-10-24 20:28                 ` Linus Torvalds
2022-10-24 20:35                   ` Dan Williams
2022-10-25 15:58                     ` Arechiga Lopez, Jesus A
2022-10-25 19:19                   ` Matthew Wilcox
2022-10-25 19:20                     ` Linus Torvalds

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=YjM88OwoccZOKp86@bfoster \
    --to=bfoster@redhat.com \
    --cc=hughd@google.com \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=linux-xfs@vger.kernel.org \
    --cc=torvalds@linux-foundation.org \
    --cc=willy@infradead.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.