All of lore.kernel.org
 help / color / mirror / Atom feed
From: Michal Hocko <mhocko@suse.com>
To: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Oleg Nesterov <oleg@redhat.com>, Hugh Dickins <hughd@google.com>,
	Linux-MM <linux-mm@kvack.org>,
	LKML <linux-kernel@vger.kernel.org>,
	Andrew Morton <akpm@linux-foundation.org>,
	Tim Chen <tim.c.chen@linux.intel.com>
Subject: Re: [RFC PATCH] mm: silence soft lockups from unlock_page
Date: Mon, 3 Aug 2020 15:14:53 +0200	[thread overview]
Message-ID: <20200803131453.GL5174@dhcp22.suse.cz> (raw)
In-Reply-To: <CAHk-=wjYHvbOs9i39EnUsC6VEJiuJ2e_5gZB5-J5CRKxq80B_Q@mail.gmail.com>

Hi,
sorry for being late in discussion (was offline or very busy with other
stuff).

I hope I got it right and this is the latest version of your patches. Btw.
do you still think that increasing PAGE_WAIT_TABLE_BITS is reasonable.
In the meantime I have learned that the customer suffering from the
issue is very unlikely to reboot the machine anytime soon or even
willing to test a non-trivial patch. We do not own any machine which
exhibit this problem unfortunately. So it is quite unlikely I can
help with testing.

Also does it make sense to put this into mmotm tree for a while to get a
larger testing coverage?

I didn't get to review these patch and it will take some time to do that
because this is really subtle area.

Thanks for helping out and sorry that I cannot really help much more
now.

On Fri 24-07-20 16:25:56, Linus Torvalds wrote:
> From 0bccb60841cc52a9aa6e9cc6b7eff59d1983e8fa Mon Sep 17 00:00:00 2001
> From: Linus Torvalds <torvalds@linux-foundation.org>
> Date: Thu, 23 Jul 2020 10:16:49 -0700
> Subject: [PATCH 1/2] mm: rewrite wait_on_page_bit_common() logic
> 
> It turns out that wait_on_page_bit_common() had several problems,
> ranging from just unfair behavioe due to re-queueing at the end of the
> wait queue when re-trying, and an outright bug that could result in
> missed wakeups (but probably never happened in practice).
> 
> This rewrites the whole logic to avoid both issues, by simply moving the
> logic to check (and possibly take) the bit lock into the wakeup path
> instead.
> 
> That makes everything much more straightforward, and means that we never
> need to re-queue the wait entry: if we get woken up, we'll be notified
> through WQ_FLAG_WOKEN, and the wait queue entry will have been removed,
> and everything will have been done for us.
> 
> Link: https://lore.kernel.org/lkml/CAHk-=wjJA2Z3kUFb-5s=6+n0qbTs8ELqKFt9B3pH85a8fGD73w@mail.gmail.com/
> Link: https://lore.kernel.org/lkml/alpine.LSU.2.11.2007221359450.1017@eggly.anvils/
> Reported-by: Oleg Nesterov <oleg@redhat.com>
> Reported-by: Hugh Dickins <hughd@google.com>
> Cc: Michal Hocko <mhocko@suse.com>
> Reviewed-by: Oleg Nesterov <oleg@redhat.com>
> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
> ---
>  mm/filemap.c | 132 +++++++++++++++++++++++++++++++++------------------
>  1 file changed, 85 insertions(+), 47 deletions(-)
> 
> diff --git a/mm/filemap.c b/mm/filemap.c
> index 385759c4ce4b..8c3d3e233d37 100644
> --- a/mm/filemap.c
> +++ b/mm/filemap.c
> @@ -1002,6 +1002,7 @@ struct wait_page_queue {
>  
>  static int wake_page_function(wait_queue_entry_t *wait, unsigned mode, int sync, void *arg)
>  {
> +	int ret;
>  	struct wait_page_key *key = arg;
>  	struct wait_page_queue *wait_page
>  		= container_of(wait, struct wait_page_queue, wait);
> @@ -1014,17 +1015,40 @@ static int wake_page_function(wait_queue_entry_t *wait, unsigned mode, int sync,
>  		return 0;
>  
>  	/*
> -	 * Stop walking if it's locked.
> -	 * Is this safe if put_and_wait_on_page_locked() is in use?
> -	 * Yes: the waker must hold a reference to this page, and if PG_locked
> -	 * has now already been set by another task, that task must also hold
> -	 * a reference to the *same usage* of this page; so there is no need
> -	 * to walk on to wake even the put_and_wait_on_page_locked() callers.
> +	 * If it's an exclusive wait, we get the bit for it, and
> +	 * stop walking if we can't.
> +	 *
> +	 * If it's a non-exclusive wait, then the fact that this
> +	 * wake function was called means that the bit already
> +	 * was cleared, and we don't care if somebody then
> +	 * re-took it.
>  	 */
> -	if (test_bit(key->bit_nr, &key->page->flags))
> -		return -1;
> +	ret = 0;
> +	if (wait->flags & WQ_FLAG_EXCLUSIVE) {
> +		if (test_and_set_bit(key->bit_nr, &key->page->flags))
> +			return -1;
> +		ret = 1;
> +	}
> +	wait->flags |= WQ_FLAG_WOKEN;
>  
> -	return autoremove_wake_function(wait, mode, sync, key);
> +	wake_up_state(wait->private, mode);
> +
> +	/*
> +	 * Ok, we have successfully done what we're waiting for,
> +	 * and we can unconditionally remove the wait entry.
> +	 *
> +	 * Note that this has to be the absolute last thing we do,
> +	 * since after list_del_init(&wait->entry) the wait entry
> +	 * might be de-allocated and the process might even have
> +	 * exited.
> +	 *
> +	 * We _really_ should have a "list_del_init_careful()" to
> +	 * properly pair with the unlocked "list_empty_careful()"
> +	 * in finish_wait().
> +	 */
> +	smp_mb();
> +	list_del_init(&wait->entry);
> +	return ret;
>  }
>  
>  static void wake_up_page_bit(struct page *page, int bit_nr)
> @@ -1103,16 +1127,31 @@ enum behavior {
>  			 */
>  };
>  
> +/*
> + * Attempt to check (or get) the page bit, and mark the
> + * waiter woken if successful.
> + */
> +static inline bool trylock_page_bit_common(struct page *page, int bit_nr,
> +					struct wait_queue_entry *wait)
> +{
> +	if (wait->flags & WQ_FLAG_EXCLUSIVE) {
> +		if (test_and_set_bit(bit_nr, &page->flags))
> +			return false;
> +	} else if (test_bit(bit_nr, &page->flags))
> +		return false;
> +
> +	wait->flags |= WQ_FLAG_WOKEN;
> +	return true;
> +}
> +
>  static inline int wait_on_page_bit_common(wait_queue_head_t *q,
>  	struct page *page, int bit_nr, int state, enum behavior behavior)
>  {
>  	struct wait_page_queue wait_page;
>  	wait_queue_entry_t *wait = &wait_page.wait;
> -	bool bit_is_set;
>  	bool thrashing = false;
>  	bool delayacct = false;
>  	unsigned long pflags;
> -	int ret = 0;
>  
>  	if (bit_nr == PG_locked &&
>  	    !PageUptodate(page) && PageWorkingset(page)) {
> @@ -1130,48 +1169,47 @@ static inline int wait_on_page_bit_common(wait_queue_head_t *q,
>  	wait_page.page = page;
>  	wait_page.bit_nr = bit_nr;
>  
> -	for (;;) {
> -		spin_lock_irq(&q->lock);
> +	/*
> +	 * Do one last check whether we can get the
> +	 * page bit synchronously.
> +	 *
> +	 * Do the SetPageWaiters() marking before that
> +	 * to let any waker we _just_ missed know they
> +	 * need to wake us up (otherwise they'll never
> +	 * even go to the slow case that looks at the
> +	 * page queue), and add ourselves to the wait
> +	 * queue if we need to sleep.
> +	 *
> +	 * This part needs to be done under the queue
> +	 * lock to avoid races.
> +	 */
> +	spin_lock_irq(&q->lock);
> +	SetPageWaiters(page);
> +	if (!trylock_page_bit_common(page, bit_nr, wait))
> +		__add_wait_queue_entry_tail(q, wait);
> +	spin_unlock_irq(&q->lock);
>  
> -		if (likely(list_empty(&wait->entry))) {
> -			__add_wait_queue_entry_tail(q, wait);
> -			SetPageWaiters(page);
> -		}
> +	/*
> +	 * From now on, all the logic will be based on
> +	 * the WQ_FLAG_WOKEN flag, and the and the page
> +	 * bit testing (and setting) will be - or has
> +	 * already been - done by the wake function.
> +	 *
> +	 * We can drop our reference to the page.
> +	 */
> +	if (behavior == DROP)
> +		put_page(page);
>  
> +	for (;;) {
>  		set_current_state(state);
>  
> -		spin_unlock_irq(&q->lock);
> -
> -		bit_is_set = test_bit(bit_nr, &page->flags);
> -		if (behavior == DROP)
> -			put_page(page);
> -
> -		if (likely(bit_is_set))
> -			io_schedule();
> -
> -		if (behavior == EXCLUSIVE) {
> -			if (!test_and_set_bit_lock(bit_nr, &page->flags))
> -				break;
> -		} else if (behavior == SHARED) {
> -			if (!test_bit(bit_nr, &page->flags))
> -				break;
> -		}
> -
> -		if (signal_pending_state(state, current)) {
> -			ret = -EINTR;
> +		if (signal_pending_state(state, current))
>  			break;
> -		}
>  
> -		if (behavior == DROP) {
> -			/*
> -			 * We can no longer safely access page->flags:
> -			 * even if CONFIG_MEMORY_HOTREMOVE is not enabled,
> -			 * there is a risk of waiting forever on a page reused
> -			 * for something that keeps it locked indefinitely.
> -			 * But best check for -EINTR above before breaking.
> -			 */
> +		if (wait->flags & WQ_FLAG_WOKEN)
>  			break;
> -		}
> +
> +		io_schedule();
>  	}
>  
>  	finish_wait(q, wait);
> @@ -1190,7 +1228,7 @@ static inline int wait_on_page_bit_common(wait_queue_head_t *q,
>  	 * bother with signals either.
>  	 */
>  
> -	return ret;
> +	return wait->flags & WQ_FLAG_WOKEN ? 0 : -EINTR;
>  }
>  
>  void wait_on_page_bit(struct page *page, int bit_nr)
> -- 
> 2.28.0.rc0.3.g1e25d3a62f
> 

> From 93f0263b9b952a1c449cec56a6aadf6320e821f9 Mon Sep 17 00:00:00 2001
> From: Linus Torvalds <torvalds@linux-foundation.org>
> Date: Thu, 23 Jul 2020 12:33:41 -0700
> Subject: [PATCH 2/2] list: add "list_del_init_careful()" to go with
>  "list_empty_careful()"
> 
> That gives us ordering guarantees around the pair.
> 
> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
> ---
>  include/linux/list.h | 20 +++++++++++++++++++-
>  kernel/sched/wait.c  |  2 +-
>  mm/filemap.c         |  7 +------
>  3 files changed, 21 insertions(+), 8 deletions(-)
> 
> diff --git a/include/linux/list.h b/include/linux/list.h
> index aff44d34f4e4..0d0d17a10d25 100644
> --- a/include/linux/list.h
> +++ b/include/linux/list.h
> @@ -282,6 +282,24 @@ static inline int list_empty(const struct list_head *head)
>  	return READ_ONCE(head->next) == head;
>  }
>  
> +/**
> + * list_del_init_careful - deletes entry from list and reinitialize it.
> + * @entry: the element to delete from the list.
> + *
> + * This is the same as list_del_init(), except designed to be used
> + * together with list_empty_careful() in a way to guarantee ordering
> + * of other memory operations.
> + *
> + * Any memory operations done before a list_del_init_careful() are
> + * guaranteed to be visible after a list_empty_careful() test.
> + */
> +static inline void list_del_init_careful(struct list_head *entry)
> +{
> +	__list_del_entry(entry);
> +	entry->prev = entry;
> +	smp_store_release(&entry->next, entry);
> +}
> +
>  /**
>   * list_empty_careful - tests whether a list is empty and not being modified
>   * @head: the list to test
> @@ -297,7 +315,7 @@ static inline int list_empty(const struct list_head *head)
>   */
>  static inline int list_empty_careful(const struct list_head *head)
>  {
> -	struct list_head *next = head->next;
> +	struct list_head *next = smp_load_acquire(&head->next);
>  	return (next == head) && (next == head->prev);
>  }
>  
> diff --git a/kernel/sched/wait.c b/kernel/sched/wait.c
> index ba059fbfc53a..01f5d3020589 100644
> --- a/kernel/sched/wait.c
> +++ b/kernel/sched/wait.c
> @@ -389,7 +389,7 @@ int autoremove_wake_function(struct wait_queue_entry *wq_entry, unsigned mode, i
>  	int ret = default_wake_function(wq_entry, mode, sync, key);
>  
>  	if (ret)
> -		list_del_init(&wq_entry->entry);
> +		list_del_init_careful(&wq_entry->entry);
>  
>  	return ret;
>  }
> diff --git a/mm/filemap.c b/mm/filemap.c
> index 8c3d3e233d37..991503bbf922 100644
> --- a/mm/filemap.c
> +++ b/mm/filemap.c
> @@ -1041,13 +1041,8 @@ static int wake_page_function(wait_queue_entry_t *wait, unsigned mode, int sync,
>  	 * since after list_del_init(&wait->entry) the wait entry
>  	 * might be de-allocated and the process might even have
>  	 * exited.
> -	 *
> -	 * We _really_ should have a "list_del_init_careful()" to
> -	 * properly pair with the unlocked "list_empty_careful()"
> -	 * in finish_wait().
>  	 */
> -	smp_mb();
> -	list_del_init(&wait->entry);
> +	list_del_init_careful(&wait->entry);
>  	return ret;
>  }
>  
> -- 
> 2.28.0.rc0.3.g1e25d3a62f
> 


-- 
Michal Hocko
SUSE Labs

  parent reply	other threads:[~2020-08-03 13:14 UTC|newest]

Thread overview: 91+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2020-07-21  6:32 [RFC PATCH] mm: silence soft lockups from unlock_page Michal Hocko
2020-07-21 11:10 ` Qian Cai
2020-07-21 11:25   ` Michal Hocko
2020-07-21 11:44     ` Qian Cai
2020-07-21 12:17       ` Michal Hocko
2020-07-21 13:23         ` Qian Cai
2020-07-21 13:38           ` Michal Hocko
2020-07-21 14:15             ` Qian Cai
2020-07-21 14:17 ` Chris Down
2020-07-21 15:00   ` Michal Hocko
2020-07-21 15:33 ` Linus Torvalds
2020-07-21 15:33   ` Linus Torvalds
2020-07-21 15:49   ` Michal Hocko
2020-07-22 18:29   ` Linus Torvalds
2020-07-22 18:29     ` Linus Torvalds
2020-07-22 21:29     ` Hugh Dickins
2020-07-22 21:29       ` Hugh Dickins
2020-07-22 22:10       ` Linus Torvalds
2020-07-22 22:10         ` Linus Torvalds
2020-07-22 23:42         ` Linus Torvalds
2020-07-22 23:42           ` Linus Torvalds
2020-07-23  0:23           ` Linus Torvalds
2020-07-23  0:23             ` Linus Torvalds
2020-07-23 12:47           ` Oleg Nesterov
2020-07-23 17:32             ` Linus Torvalds
2020-07-23 17:32               ` Linus Torvalds
2020-07-23 18:01               ` Oleg Nesterov
2020-07-23 18:22                 ` Linus Torvalds
2020-07-23 18:22                   ` Linus Torvalds
2020-07-23 19:03                   ` Linus Torvalds
2020-07-23 19:03                     ` Linus Torvalds
2020-07-24 14:45                     ` Oleg Nesterov
2020-07-23 20:03               ` Linus Torvalds
2020-07-23 20:03                 ` Linus Torvalds
2020-07-23 23:11                 ` Hugh Dickins
2020-07-23 23:11                   ` Hugh Dickins
2020-07-23 23:43                   ` Linus Torvalds
2020-07-23 23:43                     ` Linus Torvalds
2020-07-24  0:07                     ` Hugh Dickins
2020-07-24  0:07                       ` Hugh Dickins
2020-07-24  0:46                       ` Linus Torvalds
2020-07-24  0:46                         ` Linus Torvalds
2020-07-24  3:45                         ` Hugh Dickins
2020-07-24  3:45                           ` Hugh Dickins
2020-07-24 15:24                     ` Oleg Nesterov
2020-07-24 17:32                       ` Linus Torvalds
2020-07-24 17:32                         ` Linus Torvalds
2020-07-24 23:25                         ` Linus Torvalds
2020-07-24 23:25                           ` Linus Torvalds
2020-07-25  2:08                           ` Hugh Dickins
2020-07-25  2:08                             ` Hugh Dickins
2020-07-25  2:46                             ` Linus Torvalds
2020-07-25  2:46                               ` Linus Torvalds
2020-07-25 10:14                           ` Oleg Nesterov
2020-07-25 18:48                             ` Linus Torvalds
2020-07-25 18:48                               ` Linus Torvalds
2020-07-25 19:27                               ` Oleg Nesterov
2020-07-25 19:51                                 ` Linus Torvalds
2020-07-25 19:51                                   ` Linus Torvalds
2020-07-26 13:57                                   ` Oleg Nesterov
2020-07-25 21:19                               ` Hugh Dickins
2020-07-25 21:19                                 ` Hugh Dickins
2020-07-26  4:22                                 ` Hugh Dickins
2020-07-26  4:22                                   ` Hugh Dickins
2020-07-26 20:30                                   ` Hugh Dickins
2020-07-26 20:30                                     ` Hugh Dickins
2020-07-26 20:41                                     ` Linus Torvalds
2020-07-26 20:41                                       ` Linus Torvalds
2020-07-26 22:09                                       ` Hugh Dickins
2020-07-26 22:09                                         ` Hugh Dickins
2020-07-27 19:35                                     ` Greg KH
2020-08-06  5:46                                       ` Hugh Dickins
2020-08-06  5:46                                         ` Hugh Dickins
2020-08-18 13:50                                         ` Greg KH
2020-08-06  5:21                                     ` Hugh Dickins
2020-08-06  5:21                                       ` Hugh Dickins
2020-08-06 17:07                                       ` Linus Torvalds
2020-08-06 17:07                                         ` Linus Torvalds
2020-08-06 18:00                                         ` Matthew Wilcox
2020-08-06 18:32                                           ` Linus Torvalds
2020-08-06 18:32                                             ` Linus Torvalds
2020-08-07 18:41                                             ` Hugh Dickins
2020-08-07 18:41                                               ` Hugh Dickins
2020-08-07 19:07                                               ` Linus Torvalds
2020-08-07 19:07                                                 ` Linus Torvalds
2020-08-07 19:35                                               ` Matthew Wilcox
2020-08-03 13:14                           ` Michal Hocko [this message]
2020-08-03 17:56                             ` Linus Torvalds
2020-08-03 17:56                               ` Linus Torvalds
2020-07-25  9:39                         ` Oleg Nesterov
2020-07-23  8:03     ` Michal Hocko

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20200803131453.GL5174@dhcp22.suse.cz \
    --to=mhocko@suse.com \
    --cc=akpm@linux-foundation.org \
    --cc=hughd@google.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=oleg@redhat.com \
    --cc=tim.c.chen@linux.intel.com \
    --cc=torvalds@linux-foundation.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.