linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
From: Vlastimil Babka <vbabka@suse.cz>
To: Mel Gorman <mgorman@techsingularity.net>,
	Andrew Morton <akpm@linux-foundation.org>
Cc: Andrea Arcangeli <aarcange@redhat.com>,
	Rik van Riel <riel@redhat.com>, Michal Hocko <mhocko@kernel.org>,
	linux-mm@kvack.org, linux-kernel@vger.kernel.org
Subject: Re: [PATCH] mm, numa: Fix bad pmd by atomically check for pmd_trans_huge when marking page tables prot_numa
Date: Mon, 10 Apr 2017 12:03:20 +0200	[thread overview]
Message-ID: <e3a97c78-0016-318c-27b7-e6fc3a76c5a6@suse.cz> (raw)
In-Reply-To: <20170410094825.2yfo5zehn7pchg6a@techsingularity.net>

On 04/10/2017 11:48 AM, Mel Gorman wrote:
> A user reported a bug against a distribution kernel while running
> a proprietary workload described as "memory intensive that is not
> swapping" that is expected to apply to mainline kernels. The workload
> is read/write/modifying ranges of memory and checking the contents. They
> reported that within a few hours that a bad PMD would be reported followed
> by a memory corruption where expected data was all zeros.  A partial report
> of the bad PMD looked like
> 
> [ 5195.338482] ../mm/pgtable-generic.c:33: bad pmd ffff8888157ba008(000002e0396009e2)
> [ 5195.341184] ------------[ cut here ]------------
> [ 5195.356880] kernel BUG at ../mm/pgtable-generic.c:35!
> ....
> [ 5195.410033] Call Trace:
> [ 5195.410471]  [<ffffffff811bc75d>] change_protection_range+0x7dd/0x930
> [ 5195.410716]  [<ffffffff811d4be8>] change_prot_numa+0x18/0x30
> [ 5195.410918]  [<ffffffff810adefe>] task_numa_work+0x1fe/0x310
> [ 5195.411200]  [<ffffffff81098322>] task_work_run+0x72/0x90
> [ 5195.411246]  [<ffffffff81077139>] exit_to_usermode_loop+0x91/0xc2
> [ 5195.411494]  [<ffffffff81003a51>] prepare_exit_to_usermode+0x31/0x40
> [ 5195.411739]  [<ffffffff815e56af>] retint_user+0x8/0x10
> 
> Decoding revealed that the PMD was a valid prot_numa PMD and the bad PMD
> was a false detection. The bug does not trigger if automatic NUMA balancing
> or transparent huge pages is disabled.
> 
> The bug is due a race in change_pmd_range between a pmd_trans_huge and
> pmd_nond_or_clear_bad check without any locks held. During the pmd_trans_huge
> check, a parallel protection update under lock can have cleared the PMD
> and filled it with a prot_numa entry between the transhuge check and the
> pmd_none_or_clear_bad check.
> 
> While this could be fixed with heavy locking, it's only necessary to
> make a copy of the PMD on the stack during change_pmd_range and avoid
> races. A new helper is created for this as the check if quite subtle and the
> existing similar helpful is not suitable. This passed 154 hours of testing
> (usually triggers between 20 minutes and 24 hours) without detecting bad
> PMDs or corruption. A basic test of an autonuma-intensive workload showed
> no significant change in behaviour.
> 
> Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
> Cc: stable@vger.kernel.org

It would be better if there was a Fixes: tag, or at least version hint. Assuming
it's since autonuma balancing was merged?

Acked-by: Vlastimil Babka <vbabka@suse.cz>

> ---
>  include/asm-generic/pgtable.h | 25 +++++++++++++++++++++++++
>  mm/mprotect.c                 | 12 ++++++++++--
>  2 files changed, 35 insertions(+), 2 deletions(-)
> 
> diff --git a/include/asm-generic/pgtable.h b/include/asm-generic/pgtable.h
> index 1fad160f35de..597fa482cd4a 100644
> --- a/include/asm-generic/pgtable.h
> +++ b/include/asm-generic/pgtable.h
> @@ -819,6 +819,31 @@ static inline int pmd_none_or_trans_huge_or_clear_bad(pmd_t *pmd)
>  }
>  
>  /*
> + * Used when setting automatic NUMA hinting protection where it is
> + * critical that a numa hinting PMD is not confused with a bad PMD.
> + */
> +static inline int pmd_none_or_clear_bad_unless_trans_huge(pmd_t *pmd)
> +{
> +	pmd_t pmdval = pmd_read_atomic(pmd);
> +
> +	/* See pmd_none_or_trans_huge_or_clear_bad for info on barrier */
> +#ifdef CONFIG_TRANSPARENT_HUGEPAGE
> +	barrier();
> +#endif
> +
> +	if (pmd_none(pmdval))
> +		return 1;
> +	if (pmd_trans_huge(pmdval))
> +		return 0;
> +	if (unlikely(pmd_bad(pmdval))) {
> +		pmd_clear_bad(pmd);
> +		return 1;
> +	}
> +	return 0;
> +}
> +
> +
> +/*
>   * This is a noop if Transparent Hugepage Support is not built into
>   * the kernel. Otherwise it is equivalent to
>   * pmd_none_or_trans_huge_or_clear_bad(), and shall only be called in
> diff --git a/mm/mprotect.c b/mm/mprotect.c
> index 8edd0d576254..821ff2904cdb 100644
> --- a/mm/mprotect.c
> +++ b/mm/mprotect.c
> @@ -150,8 +150,16 @@ static inline unsigned long change_pmd_range(struct vm_area_struct *vma,
>  		unsigned long this_pages;
>  
>  		next = pmd_addr_end(addr, end);
> -		if (!pmd_trans_huge(*pmd) && !pmd_devmap(*pmd)
> -				&& pmd_none_or_clear_bad(pmd))
> +
> +		/*
> +		 * Automatic NUMA balancing walks the tables with mmap_sem
> +		 * held for read. It's possible a parallel update
> +		 * to occur between pmd_trans_huge and a pmd_none_or_clear_bad
> +		 * check leading to a false positive and clearing. Hence, it's
> +		 * necessary to atomically read the PMD value for all the
> +		 * checks.
> +		 */
> +		if (!pmd_devmap(*pmd) && pmd_none_or_clear_bad_unless_trans_huge(pmd))
>  			continue;
>  
>  		/* invoke the mmu notifier if the pmd is populated */
> 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

  reply	other threads:[~2017-04-10 10:03 UTC|newest]

Thread overview: 24+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2017-04-10  9:48 [PATCH] mm, numa: Fix bad pmd by atomically check for pmd_trans_huge when marking page tables prot_numa Mel Gorman
2017-04-10 10:03 ` Vlastimil Babka [this message]
2017-04-10 12:19   ` Mel Gorman
2017-04-10 12:38 ` Rik van Riel
2017-04-10 13:53 ` Michal Hocko
2017-04-10 17:38   ` Mel Gorman
2017-04-10 16:45 ` Zi Yan
2017-04-10 17:20   ` Mel Gorman
2017-04-10 17:49     ` Zi Yan
2017-04-10 18:07       ` Mel Gorman
2017-04-10 22:09         ` Andrew Morton
2017-04-10 22:28           ` Zi Yan
2017-04-11  6:35             ` Vlastimil Babka
2017-04-11 21:44               ` Andrew Morton
2017-04-11  8:29           ` Mel Gorman
2020-02-16 19:18 [PATCH] mm, numa: fix " Rafael Aquini
2020-02-16 23:32 ` Mel Gorman
2020-03-07  2:40 ` Qian Cai
2020-03-07  3:05   ` Rafael Aquini
2020-03-08  3:20     ` Qian Cai
2020-03-08 23:14       ` Rafael Aquini
2020-03-09  3:27         ` Qian Cai
2020-03-09 15:05           ` Rafael Aquini
2020-03-11  0:04             ` Qian Cai

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=e3a97c78-0016-318c-27b7-e6fc3a76c5a6@suse.cz \
    --to=vbabka@suse.cz \
    --cc=aarcange@redhat.com \
    --cc=akpm@linux-foundation.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=mgorman@techsingularity.net \
    --cc=mhocko@kernel.org \
    --cc=riel@redhat.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).