All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH] mm: read_pmd_atomic: fix 32bit PAE pmd walk vs pmd_populate SMP race condition
@ 2012-05-17 14:13 Andrea Arcangeli
  2012-05-17 19:35 ` Andrew Morton
  0 siblings, 1 reply; 4+ messages in thread
From: Andrea Arcangeli @ 2012-05-17 14:13 UTC (permalink / raw)
  To: linux-mm
  Cc: Andrew Morton, Mel Gorman, Hugh Dickins, Larry Woodman,
	Petr Matousek, Ulrich Obergfell, Rik van Riel

When holding the mmap_sem for reading, pmd_offset_map_lock should only
run on a pmd_t that has been read atomically from the pmdp
pointer, otherwise we may read only half of it leading to this crash.

PID: 11679  TASK: f06e8000  CPU: 3   COMMAND: "do_race_2_panic"
 #0 [f06a9dd8] crash_kexec at c049b5ec
 #1 [f06a9e2c] oops_end at c083d1c2
 #2 [f06a9e40] no_context at c0433ded
 #3 [f06a9e64] bad_area_nosemaphore at c043401a
 #4 [f06a9e6c] __do_page_fault at c0434493
 #5 [f06a9eec] do_page_fault at c083eb45
 #6 [f06a9f04] error_code (via page_fault) at c083c5d5
    EAX: 01fb470c EBX: fff35000 ECX: 00000003 EDX: 00000100 EBP:
    00000000
    DS:  007b     ESI: 9e201000 ES:  007b     EDI: 01fb4700 GS:  00e0
    CS:  0060     EIP: c083bc14 ERR: ffffffff EFLAGS: 00010246
 #7 [f06a9f38] _spin_lock at c083bc14
 #8 [f06a9f44] sys_mincore at c0507b7d
 #9 [f06a9fb0] system_call at c083becd
                         start           len
    EAX: ffffffda  EBX: 9e200000  ECX: 00001000  EDX: 6228537f
    DS:  007b      ESI: 00000000  ES:  007b      EDI: 003d0f00
    SS:  007b      ESP: 62285354  EBP: 62285388  GS:  0033
    CS:  0073      EIP: 00291416  ERR: 000000da  EFLAGS: 00000286

This should be a longstanding bug affecting x86 32bit PAE without
THP. Only archs with 64bit large pmd_t and 32bit unsigned long should
be affected.

With THP enabled the barrier() in
pmd_none_or_trans_huge_or_clear_bad() would partly hide the bug when
the pmd transition from none to stable, by forcing a re-read of the
*pmd in pmd_offset_map_lock, but when THP is enabled a new set of
problem arises by the fact could then transition freely in any of the
none, pmd_trans_huge or pmd_trans_stable states. So making the barrier
in pmd_none_or_trans_huge_or_clear_bad() unconditional isn't good idea
and it would be a flakey solution.

This should be fully fixed by introducing a read_pmd_atomic that reads
the pmd in order with THP disabled, or by reading the pmd atomically
with cmpxchg8b with THP enabled.

Luckily this new race condition only triggers in the places that must
already be covered by pmd_none_or_trans_huge_or_clear_bad() so the fix
is localized there but this bug is not related to THP.

NOTE: this can trigger on x86 32bit systems with PAE enabled with more
than 4G of ram, otherwise the high part of the pmd will never risk to
be truncated because it would be zero at all times, in turn so hiding
the SMP race.

This bug was discovered and fully debugged by Ulrich, quote:

----
[..]
pmd_none_or_trans_huge_or_clear_bad() loads the content of edx and
eax.

    496 static inline int pmd_none_or_trans_huge_or_clear_bad(pmd_t
    *pmd)
    497 {
    498         /* depend on compiler for an atomic pmd read */
    499         pmd_t pmdval = *pmd;

                                // edi = pmd pointer
0xc0507a74 <sys_mincore+548>:   mov    0x8(%esp),%edi
...
                                // edx = PTE page table high address
0xc0507a84 <sys_mincore+564>:   mov    0x4(%edi),%edx
...
                                // eax = PTE page table low address
0xc0507a8e <sys_mincore+574>:   mov    (%edi),%eax

[..]

Please note that the PMD is not read atomically. These are two "mov"
instructions where the high order bits of the PMD entry are fetched
first. Hence, the above machine code is prone to the following race.

-  The PMD entry {high|low} is 0x0000000000000000.
   The "mov" at 0xc0507a84 loads 0x00000000 into edx.

-  A page fault (on another CPU) sneaks in between the two "mov"
   instructions and instantiates the PMD.

-  The PMD entry {high|low} is now 0x00000003fda38067.
   The "mov" at 0xc0507a8e loads 0xfda38067 into eax.
----

Reported-by: Ulrich Obergfell <uobergfe@redhat.com>
Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
---
 arch/x86/include/asm/pgtable-3level.h |   50 +++++++++++++++++++++++++++++++++
 include/asm-generic/pgtable.h         |   22 +++++++++++++-
 2 files changed, 70 insertions(+), 2 deletions(-)

diff --git a/arch/x86/include/asm/pgtable-3level.h b/arch/x86/include/asm/pgtable-3level.h
index effff47..15a8ea3 100644
--- a/arch/x86/include/asm/pgtable-3level.h
+++ b/arch/x86/include/asm/pgtable-3level.h
@@ -31,6 +31,56 @@ static inline void native_set_pte(pte_t *ptep, pte_t pte)
 	ptep->pte_low = pte.pte_low;
 }
 
+#define  __HAVE_ARCH_READ_PMD_ATOMIC
+/*
+ * pte_offset_map_lock on 32bit PAE kernels was reading the pmd_t with
+ * a "*pmdp" dereference done by gcc. Problem is, in certain places
+ * where pte_offset_map_lock is called, concurrent page faults are
+ * allowed, if the mmap_sem is hold for reading. An example is mincore
+ * vs page faults vs MADV_DONTNEED. On the page fault side
+ * pmd_populate rightfully does a set_64bit, but if we're reading the
+ * pmd_t with a "*pmdp" on the mincore side, a SMP race can happen
+ * because gcc will not read the 64bit of the pmd atomically. To fix
+ * this all places running pmd_offset_map_lock() while holding the
+ * mmap_sem in read mode, shall read the pmdp pointer using this
+ * function to know if the pmd is null nor not, and in turn to know if
+ * they can run pmd_offset_map_lock or pmd_trans_huge or other pmd
+ * operations.
+ *
+ * Without THP if the mmap_sem is hold for reading, the
+ * pmd can only transition from null to not null while read_pmd_atomic runs.
+ * So there's no need of literally reading it atomically.
+ *
+ * With THP if the mmap_sem is hold for reading, the pmd can become
+ * THP or null or point to a pte (and in turn become "stable") at any
+ * time under read_pmd_atomic, so it's mandatory to read it atomically
+ * with cmpxchg8b.
+ */
+#ifndef CONFIG_TRANSPARENT_HUGEPAGE
+static inline pmd_t read_pmd_atomic(pmd_t *pmdp)
+{
+	pmdval_t ret;
+	u32 *tmp = (u32 *)pmdp;
+
+	ret = (pmdval_t) (*tmp);
+	if (ret) {
+		/*
+		 * If the low part is null, we must not read the high part
+		 * or we can end up with a partial pmd.
+		 */
+		smp_rmb();
+		ret |= ((pmdval_t)*(tmp + 1)) << 32;
+	}
+
+	return __pmd(ret);
+}
+#else /* CONFIG_TRANSPARENT_HUGEPAGE */
+static inline pmd_t read_pmd_atomic(pmd_t *pmdp)
+{
+	return __pmd(atomic64_read((atomic64_t *)pmdp));
+}
+#endif /* CONFIG_TRANSPARENT_HUGEPAGE */
+
 static inline void native_set_pte_atomic(pte_t *ptep, pte_t pte)
 {
 	set_64bit((unsigned long long *)(ptep), native_pte_val(pte));
diff --git a/include/asm-generic/pgtable.h b/include/asm-generic/pgtable.h
index 125c54e..0d8ee48 100644
--- a/include/asm-generic/pgtable.h
+++ b/include/asm-generic/pgtable.h
@@ -446,6 +446,18 @@ static inline int pmd_write(pmd_t pmd)
 #endif /* __HAVE_ARCH_PMD_WRITE */
 #endif /* CONFIG_TRANSPARENT_HUGEPAGE */
 
+#ifndef  __HAVE_ARCH_READ_PMD_ATOMIC
+static inline pmd_t read_pmd_atomic(pmd_t *pmdp)
+{
+	/*
+	 * Depend on compiler for an atomic pmd read. NOTE: this is
+	 * only going to work, if the pmdval_t isn't larger than
+	 * an unsigned long.
+	 */
+	return *pmdp;
+}
+#endif /* __HAVE_ARCH_READ_PMD_ATOMIC */
+
 /*
  * This function is meant to be used by sites walking pagetables with
  * the mmap_sem hold in read mode to protect against MADV_DONTNEED and
@@ -459,11 +471,17 @@ static inline int pmd_write(pmd_t pmd)
  * undefined so behaving like if the pmd was none is safe (because it
  * can return none anyway). The compiler level barrier() is critically
  * important to compute the two checks atomically on the same pmdval.
+ *
+ * For 32bit kernels with a 64bit large pmd_t this automatically takes
+ * care of reading the pmd atomically to avoid SMP race conditions
+ * against pmd_populate() when the mmap_sem is hold for reading by the
+ * caller (a special atomic read not done by "gcc" as in the generic
+ * version above, is also needed when THP is disabled because the page
+ * fault can populate the pmd from under us).
  */
 static inline int pmd_none_or_trans_huge_or_clear_bad(pmd_t *pmd)
 {
-	/* depend on compiler for an atomic pmd read */
-	pmd_t pmdval = *pmd;
+	pmd_t pmdval = read_pmd_atomic(pmd);
 	/*
 	 * The barrier will stabilize the pmdval in a register or on
 	 * the stack so that it will stop changing under the code.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 4+ messages in thread

* Re: [PATCH] mm: read_pmd_atomic: fix 32bit PAE pmd walk vs pmd_populate SMP race condition
  2012-05-17 14:13 [PATCH] mm: read_pmd_atomic: fix 32bit PAE pmd walk vs pmd_populate SMP race condition Andrea Arcangeli
@ 2012-05-17 19:35 ` Andrew Morton
  2012-05-18 23:00   ` Andrea Arcangeli
  0 siblings, 1 reply; 4+ messages in thread
From: Andrew Morton @ 2012-05-17 19:35 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: linux-mm, Mel Gorman, Hugh Dickins, Larry Woodman, Petr Matousek,
	Ulrich Obergfell, Rik van Riel

On Thu, 17 May 2012 16:13:56 +0200
Andrea Arcangeli <aarcange@redhat.com> wrote:

> When holding the mmap_sem for reading, pmd_offset_map_lock should only
> run on a pmd_t that has been read atomically from the pmdp
> pointer, otherwise we may read only half of it leading to this crash.

Do you think this is serious enough to warrant backporting the fix into
-stable?  The patch is pretty simple..

>
> ...
>
> --- a/arch/x86/include/asm/pgtable-3level.h
> +++ b/arch/x86/include/asm/pgtable-3level.h
> @@ -31,6 +31,56 @@ static inline void native_set_pte(pte_t *ptep, pte_t pte)
>  	ptep->pte_low = pte.pte_low;
>  }
>  
> +#define  __HAVE_ARCH_READ_PMD_ATOMIC

A couple of nits:

- read_pmd_atomic() should be called pmd_read_atomic() - check out
  "grep pmd include/asm-generic/pgtable.h".

- A somewhat neat convention we use is to do

	static inline void foo(...)
	{
		...
	}
	#define foo foo

  so then other code can do

	#ifndef foo
	...
	#endif

  This avoids having to create (and remember!) a second identifier.

> +/*
> + * pte_offset_map_lock on 32bit PAE kernels was reading the pmd_t with
> + * a "*pmdp" dereference done by gcc.

I spent some time trying to find exactly where pte_offset_map_lock does
this dereference then gave up, because it shouldn't have been this
hard!

Can we be specific here, so that others can more easily work out what's
going on?

> Problem is, in certain places
> + * where pte_offset_map_lock is called, concurrent page faults are
> + * allowed, if the mmap_sem is hold for reading. An example is mincore
> + * vs page faults vs MADV_DONTNEED. On the page fault side
> + * pmd_populate rightfully does a set_64bit, but if we're reading the
> + * pmd_t with a "*pmdp" on the mincore side, a SMP race can happen
> + * because gcc will not read the 64bit of the pmd atomically. To fix
> + * this all places running pmd_offset_map_lock() while holding the
> + * mmap_sem in read mode, shall read the pmdp pointer using this
> + * function to know if the pmd is null nor not, and in turn to know if
> + * they can run pmd_offset_map_lock or pmd_trans_huge or other pmd
> + * operations.
> + *
> + * Without THP if the mmap_sem is hold for reading, the
> + * pmd can only transition from null to not null while read_pmd_atomic runs.
> + * So there's no need of literally reading it atomically.
> + *
> + * With THP if the mmap_sem is hold for reading, the pmd can become
> + * THP or null or point to a pte (and in turn become "stable") at any
> + * time under read_pmd_atomic, so it's mandatory to read it atomically
> + * with cmpxchg8b.

This all seems terribly subtle and fragile.

> + */
> +#ifndef CONFIG_TRANSPARENT_HUGEPAGE
> +static inline pmd_t read_pmd_atomic(pmd_t *pmdp)
> +{
> +	pmdval_t ret;
> +	u32 *tmp = (u32 *)pmdp;
> +
> +	ret = (pmdval_t) (*tmp);
> +	if (ret) {
> +		/*
> +		 * If the low part is null, we must not read the high part
> +		 * or we can end up with a partial pmd.

This is the core part of the fix, and I don't understand it :( What is
the significance of the zeroness of the lower half of the pmdval_t? 
How, exactly, does this prevent races?

At a guess, I'd say that we're making three assumptions here:

a) that gcc will write lower-word-first when doing a 64-bit write

b) that a valid pmdval_t never has all zeroes in the lower 32 bits and

c) that any code which this function is racing against will only
   ever be writing to a pmdval_t which has the all-zeroes pattern.  ie:
   that we never can race against code which is modifying an existing
   pmdval_t.

Can we spell out and justify all the assumptions here?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: [PATCH] mm: read_pmd_atomic: fix 32bit PAE pmd walk vs pmd_populate SMP race condition
  2012-05-17 19:35 ` Andrew Morton
@ 2012-05-18 23:00   ` Andrea Arcangeli
  2012-05-23 23:39     ` [PATCH] mm: pmd_read_atomic: " Andrea Arcangeli
  0 siblings, 1 reply; 4+ messages in thread
From: Andrea Arcangeli @ 2012-05-18 23:00 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-mm, Mel Gorman, Hugh Dickins, Larry Woodman, Petr Matousek,
	Ulrich Obergfell, Rik van Riel

Hi Andrew,

On Thu, May 17, 2012 at 12:35:31PM -0700, Andrew Morton wrote:
> On Thu, 17 May 2012 16:13:56 +0200
> Andrea Arcangeli <aarcange@redhat.com> wrote:
> 
> > When holding the mmap_sem for reading, pmd_offset_map_lock should only
> > run on a pmd_t that has been read atomically from the pmdp
> > pointer, otherwise we may read only half of it leading to this crash.
> 
> Do you think this is serious enough to warrant backporting the fix into
> -stable?  The patch is pretty simple..

Considering it's simple it's probably worth backporting, it affects
only 32bit PAE kernels if there's more than 4G of ram installed.

> A couple of nits:
> 
> - read_pmd_atomic() should be called pmd_read_atomic() - check out
>   "grep pmd include/asm-generic/pgtable.h".
> 

Agreed.

I've also been wondering if _consistent would be better suffix.

> - A somewhat neat convention we use is to do
> 
> 	static inline void foo(...)
> 	{
> 		...
> 	}
> 	#define foo foo
> 
>   so then other code can do
> 
> 	#ifndef foo
> 	...
> 	#endif
> 
>   This avoids having to create (and remember!) a second identifier.

Ok, I assume from your reply the __HAVE_ARCH_WHATEVER is being
deprecated in favour of #define foo foo.

> > +/*
> > + * pte_offset_map_lock on 32bit PAE kernels was reading the pmd_t with
> > + * a "*pmdp" dereference done by gcc.
> 
> I spent some time trying to find exactly where pte_offset_map_lock does
> this dereference then gave up, because it shouldn't have been this
> hard!
> 
> Can we be specific here, so that others can more easily work out what's
> going on?

Sure. The problem has been reproduced in mincore here:

static void mincore_pte_range(struct vm_area_struct *vma, pmd_t *pmd,
			unsigned long addr, unsigned long end,
			unsigned char *vec)
{
	unsigned long next;
	spinlock_t *ptl;
	pte_t *ptep;

	ptep = pte_offset_map_lock(vma->vm_mm, pmd, addr, &ptl);

[..]
		if (pmd_none_or_trans_huge_or_clear_bad(pmd))
			mincore_unmapped_range(vma, addr, next, vec);
		else
			mincore_pte_range(vma, pmd, addr, next, vec);

or 3.3 version:

                if (pmd_none_or_clear_bad(pmd))
			mincore_unmapped_range(vma, addr, next, vec);
		else
			mincore_pte_range(vma, pmd, addr, next, vec);

So pmd_none_or_clear_bad does:

static inline int pmd_none_or_clear_bad(pmd_t *pmd)
{
	if (pmd_none(*pmd))
		return 1;
	if (unlikely(pmd_bad(*pmd))) {
		pmd_clear_bad(pmd);
		return 1;
	}
	return 0;
}

With 3.4 there's pmd_none_or_trans_huge_or_clear_bad instead, and it's
identical with THP=n, so let's assume THP=n here, it's not related to
THP anyway. It was reproduced with THP=n at build time.

static inline int pmd_none_or_trans_huge_or_clear_bad(pmd_t *pmd)
{
	pmd_t pmdval = *pmd;
	/*
	 * The barrier will stabilize the pmdval in a register or on
	 * the stack so that it will stop changing under the code.
	 */
#ifdef CONFIG_TRANSPARENT_HUGEPAGE
	barrier();
#endif
	if (pmd_none(pmdval))
		return 1;
	if (unlikely(pmd_bad(pmdval))) {
		if (!pmd_trans_huge(pmdval))
			pmd_clear_bad(pmd);
		return 1;
	}
	return 0;
}


This is all inline, so pmd_none_or_.... reads *pmd just one and cache
it in register or local stack. mincore_pte_range is static and used
just one so gcc can decide to inline it too.

So when the pmd_offset_map_lock runs on top it does:

#define pte_offset_map_lock(mm, pmd, address, ptlp)	\
({							\
	spinlock_t *__ptl = pte_lockptr(mm, pmd);	\
	pte_t *__pte = pte_offset_map(pmd, address);	\
	*(ptlp) = __ptl;				\
	spin_lock(__ptl);				\
	__pte;						\
})

which calls pte_offset_map:

#define pte_offset_map(dir, address)					\
	((pte_t *)page_address(pmd_page(*(dir))) + pte_index((address)))

Finally *dir (i.e. *pmd) is dereferenced, and if it picks the previous
value read in *pmd done by pmd_none_or_..., we're screwed because it
may have the high 32bit zero by mistake.

The bug is: the *pmd _read_ is not atomic. Making that atomic fixes
it. You also need bad luck that gcc decides to read the high part
before the low part, and to cache the *pmd read across an auto-inline.

                                // edx = PTE page table high address
0xc0507a84 <sys_mincore+564>:   mov    0x4(%edi),%edx
...
                                // eax = PTE page table low address
0xc0507a8e <sys_mincore+574>:   mov    (%edi),%eax

The race is against pmd_populate from a page fault. If the pmd is none
and we hold mmap_sem read mode (like mincore does) pmd_populate can
run under us (through the page fault).

All writes are atomic: pmd_populate (or set_pmd_at) will set the pmd
with an atomic 64bit write on 32bit PAE.

The race would be:

    mincore                     page fault
    =======                     =========
    read pmd.high
				pmd_populate 64bit atomic set
    read pmd.low -> non zero

So we end up with a *pmd that has the 64bit truncated by mistake, and
because it's non zero, it goes ahead crashing in pte_offset_map_lock.

If gcc decides to read *pmd twice (first in pmd_none_or_... and then
again above in pte_offset_map) the trouble is only with THP enabled,
because the second time we would read the correct pmd high bits too.

So this may have been hidden by gcc not caching *pmd
aggressively. Especially if gcc won't inline static functions (or make
assumptions on memory clobbered by static functions), it can't happen.

With THP enabled things are more complicated because the pmd can
return null, so with THP enabled we need a real 64bit atomic read,
looping like gup_fast does until a pmd.low/high stabilize, isn't safe
because we don't have irq disabled stopping the tlb flush of
MADV_DONTNEED.

> This all seems terribly subtle and fragile.

To make it simpler, and in turn more robust, we could just do an
atomic read for THP=n too. But this will require a cmpxchg8b for every
*pmd (one every 2m).

> > + */
> > +#ifndef CONFIG_TRANSPARENT_HUGEPAGE
> > +static inline pmd_t read_pmd_atomic(pmd_t *pmdp)
> > +{
> > +	pmdval_t ret;
> > +	u32 *tmp = (u32 *)pmdp;
> > +
> > +	ret = (pmdval_t) (*tmp);
> > +	if (ret) {
> > +		/*
> > +		 * If the low part is null, we must not read the high part
> > +		 * or we can end up with a partial pmd.
> 
> This is the core part of the fix, and I don't understand it :( What is
> the significance of the zeroness of the lower half of the pmdval_t? 
> How, exactly, does this prevent races?

The only transition we have to protect against in the above case is a
pmd going from pmd_none to !pmd_none.

So the idea is, read the low part of the pmd, if it's zero, don't even
bother the high part (all things holding mmap_sem in read mode aren't
accurate walks, if the high part wasn't zero we hit the race but we
don't care).

If instead the low part of the pmd is not zero, we memory barrier and
we read the high part. We're guaranteed the high part will be correct,
if we read it after pmd.low, because pmd_populate runs with an atomic
write. And with THP off holding the mmap_sem prevents a pmd to go away
or change from under us after it has been established (pagetables are
released or modified with mmap_sem in write mode).

> At a guess, I'd say that we're making three assumptions here:
> 
> a) that gcc will write lower-word-first when doing a 64-bit write

The assumption is that gcc will read the pmd.high before pmd.low, and
that will also cache the *pmd read.

All writes to pmds are already 64bit atomic.

> b) that a valid pmdval_t never has all zeroes in the lower 32 bits and

Correct.

> 
> c) that any code which this function is racing against will only
>    ever be writing to a pmdval_t which has the all-zeroes pattern.  ie:
>    that we never can race against code which is modifying an existing
>    pmdval_t.

Correct.

I'll resend with the cleanups if you agree.

Andrea

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 4+ messages in thread

* [PATCH] mm: pmd_read_atomic: fix 32bit PAE pmd walk vs pmd_populate SMP race condition
  2012-05-18 23:00   ` Andrea Arcangeli
@ 2012-05-23 23:39     ` Andrea Arcangeli
  0 siblings, 0 replies; 4+ messages in thread
From: Andrea Arcangeli @ 2012-05-23 23:39 UTC (permalink / raw)
  To: linux-mm
  Cc: Andrew Morton, Mel Gorman, Hugh Dickins, Larry Woodman,
	Petr Matousek, Ulrich Obergfell, Rik van Riel

When holding the mmap_sem for reading, pmd_offset_map_lock should only
run on a pmd_t that has been read atomically from the pmdp
pointer, otherwise we may read only half of it leading to this crash.

PID: 11679  TASK: f06e8000  CPU: 3   COMMAND: "do_race_2_panic"
 #0 [f06a9dd8] crash_kexec at c049b5ec
 #1 [f06a9e2c] oops_end at c083d1c2
 #2 [f06a9e40] no_context at c0433ded
 #3 [f06a9e64] bad_area_nosemaphore at c043401a
 #4 [f06a9e6c] __do_page_fault at c0434493
 #5 [f06a9eec] do_page_fault at c083eb45
 #6 [f06a9f04] error_code (via page_fault) at c083c5d5
    EAX: 01fb470c EBX: fff35000 ECX: 00000003 EDX: 00000100 EBP:
    00000000
    DS:  007b     ESI: 9e201000 ES:  007b     EDI: 01fb4700 GS:  00e0
    CS:  0060     EIP: c083bc14 ERR: ffffffff EFLAGS: 00010246
 #7 [f06a9f38] _spin_lock at c083bc14
 #8 [f06a9f44] sys_mincore at c0507b7d
 #9 [f06a9fb0] system_call at c083becd
                         start           len
    EAX: ffffffda  EBX: 9e200000  ECX: 00001000  EDX: 6228537f
    DS:  007b      ESI: 00000000  ES:  007b      EDI: 003d0f00
    SS:  007b      ESP: 62285354  EBP: 62285388  GS:  0033
    CS:  0073      EIP: 00291416  ERR: 000000da  EFLAGS: 00000286

This should be a longstanding bug affecting x86 32bit PAE without
THP. Only archs with 64bit large pmd_t and 32bit unsigned long should
be affected.

With THP enabled the barrier() in
pmd_none_or_trans_huge_or_clear_bad() would partly hide the bug when
the pmd transition from none to stable, by forcing a re-read of the
*pmd in pmd_offset_map_lock, but when THP is enabled a new set of
problem arises by the fact could then transition freely in any of the
none, pmd_trans_huge or pmd_trans_stable states. So making the barrier
in pmd_none_or_trans_huge_or_clear_bad() unconditional isn't good idea
and it would be a flakey solution.

This should be fully fixed by introducing a pmd_read_atomic that reads
the pmd in order with THP disabled, or by reading the pmd atomically
with cmpxchg8b with THP enabled.

Luckily this new race condition only triggers in the places that must
already be covered by pmd_none_or_trans_huge_or_clear_bad() so the fix
is localized there but this bug is not related to THP.

NOTE: this can trigger on x86 32bit systems with PAE enabled with more
than 4G of ram, otherwise the high part of the pmd will never risk to
be truncated because it would be zero at all times, in turn so hiding
the SMP race.

This bug was discovered and fully debugged by Ulrich, quote:

----
[..]
pmd_none_or_trans_huge_or_clear_bad() loads the content of edx and
eax.

    496 static inline int pmd_none_or_trans_huge_or_clear_bad(pmd_t
    *pmd)
    497 {
    498         /* depend on compiler for an atomic pmd read */
    499         pmd_t pmdval = *pmd;

                                // edi = pmd pointer
0xc0507a74 <sys_mincore+548>:   mov    0x8(%esp),%edi
...
                                // edx = PTE page table high address
0xc0507a84 <sys_mincore+564>:   mov    0x4(%edi),%edx
...
                                // eax = PTE page table low address
0xc0507a8e <sys_mincore+574>:   mov    (%edi),%eax

[..]

Please note that the PMD is not read atomically. These are two "mov"
instructions where the high order bits of the PMD entry are fetched
first. Hence, the above machine code is prone to the following race.

-  The PMD entry {high|low} is 0x0000000000000000.
   The "mov" at 0xc0507a84 loads 0x00000000 into edx.

-  A page fault (on another CPU) sneaks in between the two "mov"
   instructions and instantiates the PMD.

-  The PMD entry {high|low} is now 0x00000003fda38067.
   The "mov" at 0xc0507a8e loads 0xfda38067 into eax.
----

Reported-by: Ulrich Obergfell <uobergfe@redhat.com>
Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
---
 arch/x86/include/asm/pgtable-3level.h |   50 +++++++++++++++++++++++++++++++++
 include/asm-generic/pgtable.h         |   22 +++++++++++++-
 2 files changed, 70 insertions(+), 2 deletions(-)

diff --git a/arch/x86/include/asm/pgtable-3level.h b/arch/x86/include/asm/pgtable-3level.h
index effff47..43876f1 100644
--- a/arch/x86/include/asm/pgtable-3level.h
+++ b/arch/x86/include/asm/pgtable-3level.h
@@ -31,6 +31,56 @@ static inline void native_set_pte(pte_t *ptep, pte_t pte)
 	ptep->pte_low = pte.pte_low;
 }
 
+#define pmd_read_atomic pmd_read_atomic
+/*
+ * pte_offset_map_lock on 32bit PAE kernels was reading the pmd_t with
+ * a "*pmdp" dereference done by gcc. Problem is, in certain places
+ * where pte_offset_map_lock is called, concurrent page faults are
+ * allowed, if the mmap_sem is hold for reading. An example is mincore
+ * vs page faults vs MADV_DONTNEED. On the page fault side
+ * pmd_populate rightfully does a set_64bit, but if we're reading the
+ * pmd_t with a "*pmdp" on the mincore side, a SMP race can happen
+ * because gcc will not read the 64bit of the pmd atomically. To fix
+ * this all places running pmd_offset_map_lock() while holding the
+ * mmap_sem in read mode, shall read the pmdp pointer using this
+ * function to know if the pmd is null nor not, and in turn to know if
+ * they can run pmd_offset_map_lock or pmd_trans_huge or other pmd
+ * operations.
+ *
+ * Without THP if the mmap_sem is hold for reading, the
+ * pmd can only transition from null to not null while pmd_read_atomic runs.
+ * So there's no need of literally reading it atomically.
+ *
+ * With THP if the mmap_sem is hold for reading, the pmd can become
+ * THP or null or point to a pte (and in turn become "stable") at any
+ * time under pmd_read_atomic, so it's mandatory to read it atomically
+ * with cmpxchg8b.
+ */
+#ifndef CONFIG_TRANSPARENT_HUGEPAGE
+static inline pmd_t pmd_read_atomic(pmd_t *pmdp)
+{
+	pmdval_t ret;
+	u32 *tmp = (u32 *)pmdp;
+
+	ret = (pmdval_t) (*tmp);
+	if (ret) {
+		/*
+		 * If the low part is null, we must not read the high part
+		 * or we can end up with a partial pmd.
+		 */
+		smp_rmb();
+		ret |= ((pmdval_t)*(tmp + 1)) << 32;
+	}
+
+	return (pmd_t) { ret };
+}
+#else /* CONFIG_TRANSPARENT_HUGEPAGE */
+static inline pmd_t pmd_read_atomic(pmd_t *pmdp)
+{
+	return (pmd_t) { atomic64_read((atomic64_t *)pmdp) };
+}
+#endif /* CONFIG_TRANSPARENT_HUGEPAGE */
+
 static inline void native_set_pte_atomic(pte_t *ptep, pte_t pte)
 {
 	set_64bit((unsigned long long *)(ptep), native_pte_val(pte));
diff --git a/include/asm-generic/pgtable.h b/include/asm-generic/pgtable.h
index 125c54e..fa596d9 100644
--- a/include/asm-generic/pgtable.h
+++ b/include/asm-generic/pgtable.h
@@ -446,6 +446,18 @@ static inline int pmd_write(pmd_t pmd)
 #endif /* __HAVE_ARCH_PMD_WRITE */
 #endif /* CONFIG_TRANSPARENT_HUGEPAGE */
 
+#ifndef pmd_read_atomic
+static inline pmd_t pmd_read_atomic(pmd_t *pmdp)
+{
+	/*
+	 * Depend on compiler for an atomic pmd read. NOTE: this is
+	 * only going to work, if the pmdval_t isn't larger than
+	 * an unsigned long.
+	 */
+	return *pmdp;
+}
+#endif
+
 /*
  * This function is meant to be used by sites walking pagetables with
  * the mmap_sem hold in read mode to protect against MADV_DONTNEED and
@@ -459,11 +471,17 @@ static inline int pmd_write(pmd_t pmd)
  * undefined so behaving like if the pmd was none is safe (because it
  * can return none anyway). The compiler level barrier() is critically
  * important to compute the two checks atomically on the same pmdval.
+ *
+ * For 32bit kernels with a 64bit large pmd_t this automatically takes
+ * care of reading the pmd atomically to avoid SMP race conditions
+ * against pmd_populate() when the mmap_sem is hold for reading by the
+ * caller (a special atomic read not done by "gcc" as in the generic
+ * version above, is also needed when THP is disabled because the page
+ * fault can populate the pmd from under us).
  */
 static inline int pmd_none_or_trans_huge_or_clear_bad(pmd_t *pmd)
 {
-	/* depend on compiler for an atomic pmd read */
-	pmd_t pmdval = *pmd;
+	pmd_t pmdval = pmd_read_atomic(pmd);
 	/*
 	 * The barrier will stabilize the pmdval in a register or on
 	 * the stack so that it will stop changing under the code.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2012-05-23 23:39 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2012-05-17 14:13 [PATCH] mm: read_pmd_atomic: fix 32bit PAE pmd walk vs pmd_populate SMP race condition Andrea Arcangeli
2012-05-17 19:35 ` Andrew Morton
2012-05-18 23:00   ` Andrea Arcangeli
2012-05-23 23:39     ` [PATCH] mm: pmd_read_atomic: " Andrea Arcangeli

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.