linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
* [PATCH] mm, numa: fix bad pmd by atomically check for pmd_trans_huge when marking page tables prot_numa
@ 2020-02-16 19:18 Rafael Aquini
  2020-02-16 23:32 ` Mel Gorman
  2020-03-07  2:40 ` Qian Cai
  0 siblings, 2 replies; 24+ messages in thread
From: Rafael Aquini @ 2020-02-16 19:18 UTC (permalink / raw)
  To: linux-mm; +Cc: linux-kernel, mgorman, akpm, mhocko, vbabka

From: Mel Gorman <mgorman@techsingularity.net>
  A user reported a bug against a distribution kernel while running
  a proprietary workload described as "memory intensive that is not
  swapping" that is expected to apply to mainline kernels. The workload
  is read/write/modifying ranges of memory and checking the contents. They
  reported that within a few hours that a bad PMD would be reported followed
  by a memory corruption where expected data was all zeros.  A partial report
  of the bad PMD looked like

  [ 5195.338482] ../mm/pgtable-generic.c:33: bad pmd ffff8888157ba008(000002e0396009e2)
  [ 5195.341184] ------------[ cut here ]------------
  [ 5195.356880] kernel BUG at ../mm/pgtable-generic.c:35!
  ....
  [ 5195.410033] Call Trace:
  [ 5195.410471]  [<ffffffff811bc75d>] change_protection_range+0x7dd/0x930
  [ 5195.410716]  [<ffffffff811d4be8>] change_prot_numa+0x18/0x30
  [ 5195.410918]  [<ffffffff810adefe>] task_numa_work+0x1fe/0x310
  [ 5195.411200]  [<ffffffff81098322>] task_work_run+0x72/0x90
  [ 5195.411246]  [<ffffffff81077139>] exit_to_usermode_loop+0x91/0xc2
  [ 5195.411494]  [<ffffffff81003a51>] prepare_exit_to_usermode+0x31/0x40
  [ 5195.411739]  [<ffffffff815e56af>] retint_user+0x8/0x10

  Decoding revealed that the PMD was a valid prot_numa PMD and the bad PMD
  was a false detection. The bug does not trigger if automatic NUMA balancing
  or transparent huge pages is disabled.

  The bug is due a race in change_pmd_range between a pmd_trans_huge and
  pmd_nond_or_clear_bad check without any locks held. During the pmd_trans_huge
  check, a parallel protection update under lock can have cleared the PMD
  and filled it with a prot_numa entry between the transhuge check and the
  pmd_none_or_clear_bad check.

  While this could be fixed with heavy locking, it's only necessary to
  make a copy of the PMD on the stack during change_pmd_range and avoid
  races. A new helper is created for this as the check if quite subtle and the
  existing similar helpful is not suitable. This passed 154 hours of testing
  (usually triggers between 20 minutes and 24 hours) without detecting bad
  PMDs or corruption. A basic test of an autonuma-intensive workload showed
  no significant change in behaviour.

Although Mel withdrew the patch on the face of LKML comment https://lkml.org/lkml/2017/4/10/922
the race window aforementioned is still open, and we have reports of Linpack test reporting bad
residuals after the bad PMD warning is observed. In addition to that, bad rss-counter and
non-zero pgtables assertions are triggered on mm teardown for the task hitting the bad PMD.

 host kernel: mm/pgtable-generic.c:40: bad pmd 00000000b3152f68(8000000d2d2008e7)
 ....
 host kernel: BUG: Bad rss-counter state mm:00000000b583043d idx:1 val:512
 host kernel: BUG: non-zero pgtables_bytes on freeing mm: 4096

The issue is observed on a v4.18-based distribution kernel, but the race window is
expected to be applicable to mainline kernels, as well.

Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Cc: stable@vger.kernel.org
Signed-off-by: Rafael Aquini <aquini@redhat.com>
---
 mm/mprotect.c | 38 ++++++++++++++++++++++++++++++++++++--
 1 file changed, 36 insertions(+), 2 deletions(-)

diff --git a/mm/mprotect.c b/mm/mprotect.c
index 7a8e84f86831..9ea8cc0ab2fd 100644
--- a/mm/mprotect.c
+++ b/mm/mprotect.c
@@ -161,6 +161,31 @@ static unsigned long change_pte_range(struct vm_area_struct *vma, pmd_t *pmd,
 	return pages;
 }
 
+/*
+ * Used when setting automatic NUMA hinting protection where it is
+ * critical that a numa hinting PMD is not confused with a bad PMD.
+ */
+static inline int pmd_none_or_clear_bad_unless_trans_huge(pmd_t *pmd)
+{
+	pmd_t pmdval = pmd_read_atomic(pmd);
+
+	/* See pmd_none_or_trans_huge_or_clear_bad for info on barrier */
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+	barrier();
+#endif
+
+	if (pmd_none(pmdval))
+		return 1;
+	if (pmd_trans_huge(pmdval))
+		return 0;
+	if (unlikely(pmd_bad(pmdval))) {
+		pmd_clear_bad(pmd);
+		return 1;
+	}
+
+	return 0;
+}
+
 static inline unsigned long change_pmd_range(struct vm_area_struct *vma,
 		pud_t *pud, unsigned long addr, unsigned long end,
 		pgprot_t newprot, int dirty_accountable, int prot_numa)
@@ -178,8 +203,17 @@ static inline unsigned long change_pmd_range(struct vm_area_struct *vma,
 		unsigned long this_pages;
 
 		next = pmd_addr_end(addr, end);
-		if (!is_swap_pmd(*pmd) && !pmd_trans_huge(*pmd) && !pmd_devmap(*pmd)
-				&& pmd_none_or_clear_bad(pmd))
+
+		/*
+		 * Automatic NUMA balancing walks the tables with mmap_sem
+		 * held for read. It's possible a parallel update to occur
+		 * between pmd_trans_huge() and a pmd_none_or_clear_bad()
+		 * check leading to a false positive and clearing.
+		 * Hence, it's ecessary to atomically read the PMD value
+		 * for all the checks.
+		 */
+		if (!is_swap_pmd(*pmd) && !pmd_devmap(*pmd) &&
+		     pmd_none_or_clear_bad_unless_trans_huge(pmd))
 			goto next;
 
 		/* invoke the mmu notifier if the pmd is populated */
-- 
2.24.1



^ permalink raw reply related	[flat|nested] 24+ messages in thread
* [PATCH] mm, numa: Fix bad pmd by atomically check for pmd_trans_huge when marking page tables prot_numa
@ 2017-04-10  9:48 Mel Gorman
  2017-04-10 10:03 ` Vlastimil Babka
                   ` (3 more replies)
  0 siblings, 4 replies; 24+ messages in thread
From: Mel Gorman @ 2017-04-10  9:48 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Andrea Arcangeli, Rik van Riel, Michal Hocko, Vlastimil Babka,
	linux-mm, linux-kernel

A user reported a bug against a distribution kernel while running
a proprietary workload described as "memory intensive that is not
swapping" that is expected to apply to mainline kernels. The workload
is read/write/modifying ranges of memory and checking the contents. They
reported that within a few hours that a bad PMD would be reported followed
by a memory corruption where expected data was all zeros.  A partial report
of the bad PMD looked like

[ 5195.338482] ../mm/pgtable-generic.c:33: bad pmd ffff8888157ba008(000002e0396009e2)
[ 5195.341184] ------------[ cut here ]------------
[ 5195.356880] kernel BUG at ../mm/pgtable-generic.c:35!
....
[ 5195.410033] Call Trace:
[ 5195.410471]  [<ffffffff811bc75d>] change_protection_range+0x7dd/0x930
[ 5195.410716]  [<ffffffff811d4be8>] change_prot_numa+0x18/0x30
[ 5195.410918]  [<ffffffff810adefe>] task_numa_work+0x1fe/0x310
[ 5195.411200]  [<ffffffff81098322>] task_work_run+0x72/0x90
[ 5195.411246]  [<ffffffff81077139>] exit_to_usermode_loop+0x91/0xc2
[ 5195.411494]  [<ffffffff81003a51>] prepare_exit_to_usermode+0x31/0x40
[ 5195.411739]  [<ffffffff815e56af>] retint_user+0x8/0x10

Decoding revealed that the PMD was a valid prot_numa PMD and the bad PMD
was a false detection. The bug does not trigger if automatic NUMA balancing
or transparent huge pages is disabled.

The bug is due a race in change_pmd_range between a pmd_trans_huge and
pmd_nond_or_clear_bad check without any locks held. During the pmd_trans_huge
check, a parallel protection update under lock can have cleared the PMD
and filled it with a prot_numa entry between the transhuge check and the
pmd_none_or_clear_bad check.

While this could be fixed with heavy locking, it's only necessary to
make a copy of the PMD on the stack during change_pmd_range and avoid
races. A new helper is created for this as the check if quite subtle and the
existing similar helpful is not suitable. This passed 154 hours of testing
(usually triggers between 20 minutes and 24 hours) without detecting bad
PMDs or corruption. A basic test of an autonuma-intensive workload showed
no significant change in behaviour.

Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Cc: stable@vger.kernel.org
---
 include/asm-generic/pgtable.h | 25 +++++++++++++++++++++++++
 mm/mprotect.c                 | 12 ++++++++++--
 2 files changed, 35 insertions(+), 2 deletions(-)

diff --git a/include/asm-generic/pgtable.h b/include/asm-generic/pgtable.h
index 1fad160f35de..597fa482cd4a 100644
--- a/include/asm-generic/pgtable.h
+++ b/include/asm-generic/pgtable.h
@@ -819,6 +819,31 @@ static inline int pmd_none_or_trans_huge_or_clear_bad(pmd_t *pmd)
 }
 
 /*
+ * Used when setting automatic NUMA hinting protection where it is
+ * critical that a numa hinting PMD is not confused with a bad PMD.
+ */
+static inline int pmd_none_or_clear_bad_unless_trans_huge(pmd_t *pmd)
+{
+	pmd_t pmdval = pmd_read_atomic(pmd);
+
+	/* See pmd_none_or_trans_huge_or_clear_bad for info on barrier */
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+	barrier();
+#endif
+
+	if (pmd_none(pmdval))
+		return 1;
+	if (pmd_trans_huge(pmdval))
+		return 0;
+	if (unlikely(pmd_bad(pmdval))) {
+		pmd_clear_bad(pmd);
+		return 1;
+	}
+	return 0;
+}
+
+
+/*
  * This is a noop if Transparent Hugepage Support is not built into
  * the kernel. Otherwise it is equivalent to
  * pmd_none_or_trans_huge_or_clear_bad(), and shall only be called in
diff --git a/mm/mprotect.c b/mm/mprotect.c
index 8edd0d576254..821ff2904cdb 100644
--- a/mm/mprotect.c
+++ b/mm/mprotect.c
@@ -150,8 +150,16 @@ static inline unsigned long change_pmd_range(struct vm_area_struct *vma,
 		unsigned long this_pages;
 
 		next = pmd_addr_end(addr, end);
-		if (!pmd_trans_huge(*pmd) && !pmd_devmap(*pmd)
-				&& pmd_none_or_clear_bad(pmd))
+
+		/*
+		 * Automatic NUMA balancing walks the tables with mmap_sem
+		 * held for read. It's possible a parallel update
+		 * to occur between pmd_trans_huge and a pmd_none_or_clear_bad
+		 * check leading to a false positive and clearing. Hence, it's
+		 * necessary to atomically read the PMD value for all the
+		 * checks.
+		 */
+		if (!pmd_devmap(*pmd) && pmd_none_or_clear_bad_unless_trans_huge(pmd))
 			continue;
 
 		/* invoke the mmu notifier if the pmd is populated */

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 24+ messages in thread

end of thread, other threads:[~2020-03-11  0:04 UTC | newest]

Thread overview: 24+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-02-16 19:18 [PATCH] mm, numa: fix bad pmd by atomically check for pmd_trans_huge when marking page tables prot_numa Rafael Aquini
2020-02-16 23:32 ` Mel Gorman
2020-03-07  2:40 ` Qian Cai
2020-03-07  3:05   ` Rafael Aquini
2020-03-08  3:20     ` Qian Cai
2020-03-08 23:14       ` Rafael Aquini
2020-03-09  3:27         ` Qian Cai
2020-03-09 15:05           ` Rafael Aquini
2020-03-11  0:04             ` Qian Cai
  -- strict thread matches above, loose matches on Subject: below --
2017-04-10  9:48 [PATCH] mm, numa: Fix " Mel Gorman
2017-04-10 10:03 ` Vlastimil Babka
2017-04-10 12:19   ` Mel Gorman
2017-04-10 12:38 ` Rik van Riel
2017-04-10 13:53 ` Michal Hocko
2017-04-10 17:38   ` Mel Gorman
2017-04-10 16:45 ` Zi Yan
2017-04-10 17:20   ` Mel Gorman
2017-04-10 17:49     ` Zi Yan
2017-04-10 18:07       ` Mel Gorman
2017-04-10 22:09         ` Andrew Morton
2017-04-10 22:28           ` Zi Yan
2017-04-11  6:35             ` Vlastimil Babka
2017-04-11 21:44               ` Andrew Morton
2017-04-11  8:29           ` Mel Gorman

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).