LKML Archive on lore.kernel.org
 help / color / Atom feed
* [PATCH v2 0/2] x86/mm/64: vmalloc pgd synchronization cleanups/fixes
@ 2018-01-25 21:12 Andy Lutomirski
  2018-01-25 21:12 ` [PATCH v2 1/2] x86/mm/64: Fix vmapped stack syncing on very-large-memory 4-level systems Andy Lutomirski
  2018-01-25 21:12 ` [PATCH v2 2/2] x86/mm/64: Tighten up vmalloc_fault() sanity checks on 5-level kernels Andy Lutomirski
  0 siblings, 2 replies; 12+ messages in thread
From: Andy Lutomirski @ 2018-01-25 21:12 UTC (permalink / raw)
  To: Konstantin Khlebnikov, Dave Hansen, X86 ML, Borislav Petkov
  Cc: Neil Berrington, LKML, Andy Lutomirski

Hi all-

Patch 1 is a regression fix and should go to linus and -stable.  (Not
necessarily x86/pti.  It's needed in 4.14, but if anyone backports
real PTI earlier than 4.14, this patch will *not* be needed.  The
regression doesn't really have anything to do with PTI.)

Patch 2 should probably go to normal -tip or even just wait for
Konstantin's ack.

Andy Lutomirski (2):
  x86/mm/64: Fix vmapped stack syncing on very-large-memory 4-level
    systems
  x86/mm/64: Tighten up vmalloc_fault() sanity checks on 5-level kernels

 arch/x86/mm/fault.c | 22 +++++++++-------------
 arch/x86/mm/tlb.c   | 34 +++++++++++++++++++++++++++++-----
 2 files changed, 38 insertions(+), 18 deletions(-)

-- 
2.14.3

^ permalink raw reply	[flat|nested] 12+ messages in thread

* [PATCH v2 1/2] x86/mm/64: Fix vmapped stack syncing on very-large-memory 4-level systems
  2018-01-25 21:12 [PATCH v2 0/2] x86/mm/64: vmalloc pgd synchronization cleanups/fixes Andy Lutomirski
@ 2018-01-25 21:12 ` Andy Lutomirski
  2018-01-25 21:49   ` Dave Hansen
                     ` (2 more replies)
  2018-01-25 21:12 ` [PATCH v2 2/2] x86/mm/64: Tighten up vmalloc_fault() sanity checks on 5-level kernels Andy Lutomirski
  1 sibling, 3 replies; 12+ messages in thread
From: Andy Lutomirski @ 2018-01-25 21:12 UTC (permalink / raw)
  To: Konstantin Khlebnikov, Dave Hansen, X86 ML, Borislav Petkov
  Cc: Neil Berrington, LKML, Andy Lutomirski, stable

Neil Berrington reported a double-fault on a VM with 768GB of RAM that
uses large amounts of vmalloc space with PTI enabled.

The cause is that load_new_mm_cr3() was never fixed to take the
5-level pgd folding code into account, so, on a 4-level kernel, the
pgd synchronization logic compiles away to exactly nothing.

Interestingly, the problem doesn't trigger with nopti.  I assume this
is because the kernel is mapped with global pages if we boot with
nopti.  The sequence of operations when we create a new task is that
we first load its mm while still running on the old stack (which
crashes if the old stack is unmapped in the new mm unless the TLB
saves us), then we call prepare_switch_to(), and then we switch to the
new stack.  prepare_switch_to() pokes the new stack directly, which
will populate the mapping through vmalloc_fault().  I assume that
we're getting lucky on non-PTI systems -- the old stack's TLB entry
stays alive long enough to make it all the way through
prepare_switch_to() and switch_to() so that we make it to a valid
stack.

Fixes: b50858ce3e2a ("x86/mm/vmalloc: Add 5-level paging support")
Cc: stable@vger.kernel.org
Reported-and-tested-by: Neil Berrington <neil.berrington@datacore.com>
Signed-off-by: Andy Lutomirski <luto@kernel.org>
---
 arch/x86/mm/tlb.c | 34 +++++++++++++++++++++++++++++-----
 1 file changed, 29 insertions(+), 5 deletions(-)

diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c
index a1561957dccb..5bfe61a5e8e3 100644
--- a/arch/x86/mm/tlb.c
+++ b/arch/x86/mm/tlb.c
@@ -151,6 +151,34 @@ void switch_mm(struct mm_struct *prev, struct mm_struct *next,
 	local_irq_restore(flags);
 }
 
+static void sync_current_stack_to_mm(struct mm_struct *mm)
+{
+	unsigned long sp = current_stack_pointer;
+	pgd_t *pgd = pgd_offset(mm, sp);
+
+	if (CONFIG_PGTABLE_LEVELS > 4) {
+		if (unlikely(pgd_none(*pgd))) {
+			pgd_t *pgd_ref = pgd_offset_k(sp);
+
+			set_pgd(pgd, *pgd_ref);
+		}
+	} else {
+		/*
+		 * "pgd" is faked.  The top level entries are "p4d"s, so sync
+		 * the p4d.  This compiles to approximately the same code as
+		 * the 5-level case.
+		 */
+		p4d_t *p4d = p4d_offset(pgd, sp);
+
+		if (unlikely(p4d_none(*p4d))) {
+			pgd_t *pgd_ref = pgd_offset_k(sp);
+			p4d_t *p4d_ref = p4d_offset(pgd_ref, sp);
+
+			set_p4d(p4d, *p4d_ref);
+		}
+	}
+}
+
 void switch_mm_irqs_off(struct mm_struct *prev, struct mm_struct *next,
 			struct task_struct *tsk)
 {
@@ -226,11 +254,7 @@ void switch_mm_irqs_off(struct mm_struct *prev, struct mm_struct *next,
 			 * mapped in the new pgd, we'll double-fault.  Forcibly
 			 * map it.
 			 */
-			unsigned int index = pgd_index(current_stack_pointer);
-			pgd_t *pgd = next->pgd + index;
-
-			if (unlikely(pgd_none(*pgd)))
-				set_pgd(pgd, init_mm.pgd[index]);
+			sync_current_stack_to_mm(next);
 		}
 
 		/* Stop remote flushes for the previous mm */
-- 
2.14.3

^ permalink raw reply	[flat|nested] 12+ messages in thread

* [PATCH v2 2/2] x86/mm/64: Tighten up vmalloc_fault() sanity checks on 5-level kernels
  2018-01-25 21:12 [PATCH v2 0/2] x86/mm/64: vmalloc pgd synchronization cleanups/fixes Andy Lutomirski
  2018-01-25 21:12 ` [PATCH v2 1/2] x86/mm/64: Fix vmapped stack syncing on very-large-memory 4-level systems Andy Lutomirski
@ 2018-01-25 21:12 ` Andy Lutomirski
  2018-01-26 15:07   ` [tip:x86/urgent] " tip-bot for Andy Lutomirski
  1 sibling, 1 reply; 12+ messages in thread
From: Andy Lutomirski @ 2018-01-25 21:12 UTC (permalink / raw)
  To: Konstantin Khlebnikov, Dave Hansen, X86 ML, Borislav Petkov
  Cc: Neil Berrington, LKML, Andy Lutomirski

On a 5-level kernel, if a non-init mm has a top-level entry, it needs
to match init_mm's, but the vmalloc_fault() code skipped over the
BUG_ON() that would have checked it.

While we're at it, get rid of the rather confusing 4-level folded
"pgd" logic.

Cleans-up: b50858ce3e2a ("x86/mm/vmalloc: Add 5-level paging support")
Signed-off-by: Andy Lutomirski <luto@kernel.org>
---
 arch/x86/mm/fault.c | 22 +++++++++-------------
 1 file changed, 9 insertions(+), 13 deletions(-)

diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c
index 06fe3d51d385..aaeb3862a5b4 100644
--- a/arch/x86/mm/fault.c
+++ b/arch/x86/mm/fault.c
@@ -438,18 +438,13 @@ static noinline int vmalloc_fault(unsigned long address)
 	if (pgd_none(*pgd_ref))
 		return -1;
 
-	if (pgd_none(*pgd)) {
-		set_pgd(pgd, *pgd_ref);
-		arch_flush_lazy_mmu_mode();
-	} else if (CONFIG_PGTABLE_LEVELS > 4) {
-		/*
-		 * With folded p4d, pgd_none() is always false, so the pgd may
-		 * point to an empty page table entry and pgd_page_vaddr()
-		 * will return garbage.
-		 *
-		 * We will do the correct sanity check on the p4d level.
-		 */
-		BUG_ON(pgd_page_vaddr(*pgd) != pgd_page_vaddr(*pgd_ref));
+	if (CONFIG_PGTABLE_LEVELS > 4) {
+		if (pgd_none(*pgd)) {
+			set_pgd(pgd, *pgd_ref);
+			arch_flush_lazy_mmu_mode();
+		} else {
+			BUG_ON(pgd_page_vaddr(*pgd) != pgd_page_vaddr(*pgd_ref));
+		}
 	}
 
 	/* With 4-level paging, copying happens on the p4d level. */
@@ -458,7 +453,7 @@ static noinline int vmalloc_fault(unsigned long address)
 	if (p4d_none(*p4d_ref))
 		return -1;
 
-	if (p4d_none(*p4d)) {
+	if (p4d_none(*p4d) && CONFIG_PGTABLE_LEVELS == 4) {
 		set_p4d(p4d, *p4d_ref);
 		arch_flush_lazy_mmu_mode();
 	} else {
@@ -469,6 +464,7 @@ static noinline int vmalloc_fault(unsigned long address)
 	 * Below here mismatches are bugs because these lower tables
 	 * are shared:
 	 */
+	BUILD_BUG_ON(CONFIG_PGTABLE_LEVELS < 4);
 
 	pud = pud_offset(p4d, address);
 	pud_ref = pud_offset(p4d_ref, address);
-- 
2.14.3

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH v2 1/2] x86/mm/64: Fix vmapped stack syncing on very-large-memory 4-level systems
  2018-01-25 21:12 ` [PATCH v2 1/2] x86/mm/64: Fix vmapped stack syncing on very-large-memory 4-level systems Andy Lutomirski
@ 2018-01-25 21:49   ` Dave Hansen
  2018-01-25 22:00     ` Andy Lutomirski
  2018-01-26 15:06   ` [tip:x86/urgent] " tip-bot for Andy Lutomirski
  2018-01-26 18:51   ` [PATCH v2 1/2] " Kirill A. Shutemov
  2 siblings, 1 reply; 12+ messages in thread
From: Dave Hansen @ 2018-01-25 21:49 UTC (permalink / raw)
  To: Andy Lutomirski, Konstantin Khlebnikov, X86 ML, Borislav Petkov
  Cc: Neil Berrington, LKML, stable, Kirill A. Shutemov

On 01/25/2018 01:12 PM, Andy Lutomirski wrote:
> Neil Berrington reported a double-fault on a VM with 768GB of RAM that
> uses large amounts of vmalloc space with PTI enabled.
> 
> The cause is that load_new_mm_cr3() was never fixed to take the
> 5-level pgd folding code into account, so, on a 4-level kernel, the
> pgd synchronization logic compiles away to exactly nothing.

You don't mention it, but we can normally handle vmalloc() faults in the
kernel that are due to unsynchronized page tables.  The thing that kills
us here is that we have an unmapped stack and we try to use that stack
when entering the page fault handler, which double faults.  The double
fault handler gets a new stack and saves us enough to get an oops out.

Right?

> +static void sync_current_stack_to_mm(struct mm_struct *mm)
> +{
> +	unsigned long sp = current_stack_pointer;
> +	pgd_t *pgd = pgd_offset(mm, sp);
> +
> +	if (CONFIG_PGTABLE_LEVELS > 4) {
> +		if (unlikely(pgd_none(*pgd))) {
> +			pgd_t *pgd_ref = pgd_offset_k(sp);
> +
> +			set_pgd(pgd, *pgd_ref);
> +		}
> +	} else {
> +		/*
> +		 * "pgd" is faked.  The top level entries are "p4d"s, so sync
> +		 * the p4d.  This compiles to approximately the same code as
> +		 * the 5-level case.
> +		 */
> +		p4d_t *p4d = p4d_offset(pgd, sp);
> +
> +		if (unlikely(p4d_none(*p4d))) {
> +			pgd_t *pgd_ref = pgd_offset_k(sp);
> +			p4d_t *p4d_ref = p4d_offset(pgd_ref, sp);
> +
> +			set_p4d(p4d, *p4d_ref);
> +		}
> +	}
> +}

We keep having to add these.  It seems like a real deficiency in the
mechanism that we're using for pgd folding.  Can't we get a warning or
something when we try to do a set_pgd() that's (silently) not doing
anything?  This exact same pattern bit me more than once with the
KPTI/KAISER patches.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH v2 1/2] x86/mm/64: Fix vmapped stack syncing on very-large-memory 4-level systems
  2018-01-25 21:49   ` Dave Hansen
@ 2018-01-25 22:00     ` Andy Lutomirski
  2018-01-26  9:30       ` Ingo Molnar
  2018-01-26 18:54       ` Kirill A. Shutemov
  0 siblings, 2 replies; 12+ messages in thread
From: Andy Lutomirski @ 2018-01-25 22:00 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Andy Lutomirski, Konstantin Khlebnikov, X86 ML, Borislav Petkov,
	Neil Berrington, LKML, stable, Kirill A. Shutemov

On Thu, Jan 25, 2018 at 1:49 PM, Dave Hansen <dave.hansen@intel.com> wrote:
> On 01/25/2018 01:12 PM, Andy Lutomirski wrote:
>> Neil Berrington reported a double-fault on a VM with 768GB of RAM that
>> uses large amounts of vmalloc space with PTI enabled.
>>
>> The cause is that load_new_mm_cr3() was never fixed to take the
>> 5-level pgd folding code into account, so, on a 4-level kernel, the
>> pgd synchronization logic compiles away to exactly nothing.
>
> You don't mention it, but we can normally handle vmalloc() faults in the
> kernel that are due to unsynchronized page tables.  The thing that kills
> us here is that we have an unmapped stack and we try to use that stack
> when entering the page fault handler, which double faults.  The double
> fault handler gets a new stack and saves us enough to get an oops out.
>
> Right?

Exactly.

There are two special code paths that can't use vmalloc_fault(): this
one and switch_to().  The latter avoids explicit page table fiddling
and just touches the new stack before loading it into rsp.

>
>> +static void sync_current_stack_to_mm(struct mm_struct *mm)
>> +{
>> +     unsigned long sp = current_stack_pointer;
>> +     pgd_t *pgd = pgd_offset(mm, sp);
>> +
>> +     if (CONFIG_PGTABLE_LEVELS > 4) {
>> +             if (unlikely(pgd_none(*pgd))) {
>> +                     pgd_t *pgd_ref = pgd_offset_k(sp);
>> +
>> +                     set_pgd(pgd, *pgd_ref);
>> +             }
>> +     } else {
>> +             /*
>> +              * "pgd" is faked.  The top level entries are "p4d"s, so sync
>> +              * the p4d.  This compiles to approximately the same code as
>> +              * the 5-level case.
>> +              */
>> +             p4d_t *p4d = p4d_offset(pgd, sp);
>> +
>> +             if (unlikely(p4d_none(*p4d))) {
>> +                     pgd_t *pgd_ref = pgd_offset_k(sp);
>> +                     p4d_t *p4d_ref = p4d_offset(pgd_ref, sp);
>> +
>> +                     set_p4d(p4d, *p4d_ref);
>> +             }
>> +     }
>> +}
>
> We keep having to add these.  It seems like a real deficiency in the
> mechanism that we're using for pgd folding.  Can't we get a warning or
> something when we try to do a set_pgd() that's (silently) not doing
> anything?  This exact same pattern bit me more than once with the
> KPTI/KAISER patches.

Hmm, maybe.

What I'd really like to see is an entirely different API.  Maybe:

typedef struct {
  opaque, but probably includes:
  int depth;  /* 0 is root */
  void *table;
} ptbl_ptr;

ptbl_ptr root_table = mm_root_ptbl(mm);

set_ptbl_entry(root_table, pa, prot);

/* walk tables */
ptbl_ptr pt = ...;
ptentry_ptr entry;
while (ptbl_has_children(pt)) {
  pt = pt_next(pt, addr);
}
entry = pt_entry_at(pt, addr);
/* do something with entry */

etc.

Now someone can add a sixth level without changing every code path in
the kernel that touches page tables.

--Andy

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH v2 1/2] x86/mm/64: Fix vmapped stack syncing on very-large-memory 4-level systems
  2018-01-25 22:00     ` Andy Lutomirski
@ 2018-01-26  9:30       ` Ingo Molnar
  2018-01-26 18:54       ` Kirill A. Shutemov
  1 sibling, 0 replies; 12+ messages in thread
From: Ingo Molnar @ 2018-01-26  9:30 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Dave Hansen, Konstantin Khlebnikov, X86 ML, Borislav Petkov,
	Neil Berrington, LKML, stable, Kirill A. Shutemov

* Andy Lutomirski <luto@kernel.org> wrote:

> What I'd really like to see is an entirely different API.  Maybe:
> 
> typedef struct {
>   opaque, but probably includes:
>   int depth;  /* 0 is root */
>   void *table;
> } ptbl_ptr;
> 
> ptbl_ptr root_table = mm_root_ptbl(mm);
> 
> set_ptbl_entry(root_table, pa, prot);
> 
> /* walk tables */
> ptbl_ptr pt = ...;
> ptentry_ptr entry;
> while (ptbl_has_children(pt)) {
>   pt = pt_next(pt, addr);
> }
> entry = pt_entry_at(pt, addr);
> /* do something with entry */
> 
> etc.
> 
> Now someone can add a sixth level without changing every code path in
> the kernel that touches page tables.

Iteration based page table lookups would be neat.

A sixth level is unavoidable on x86-64 I think - we'll get there in a decade or 
so? The sixth level will also use up the last ~8 bits of virtual memory available 
on 64-bit.

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 12+ messages in thread

* [tip:x86/urgent] x86/mm/64: Fix vmapped stack syncing on very-large-memory 4-level systems
  2018-01-25 21:12 ` [PATCH v2 1/2] x86/mm/64: Fix vmapped stack syncing on very-large-memory 4-level systems Andy Lutomirski
  2018-01-25 21:49   ` Dave Hansen
@ 2018-01-26 15:06   ` tip-bot for Andy Lutomirski
  2018-01-26 18:51   ` [PATCH v2 1/2] " Kirill A. Shutemov
  2 siblings, 0 replies; 12+ messages in thread
From: tip-bot for Andy Lutomirski @ 2018-01-26 15:06 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: neil.berrington, luto, khlebnikov, dave.hansen, mingo, hpa, bp,
	tglx, linux-kernel

Commit-ID:  5beda7d54eafece4c974cfa9fbb9f60fb18fd20a
Gitweb:     https://git.kernel.org/tip/5beda7d54eafece4c974cfa9fbb9f60fb18fd20a
Author:     Andy Lutomirski <luto@kernel.org>
AuthorDate: Thu, 25 Jan 2018 13:12:14 -0800
Committer:  Thomas Gleixner <tglx@linutronix.de>
CommitDate: Fri, 26 Jan 2018 15:56:23 +0100

x86/mm/64: Fix vmapped stack syncing on very-large-memory 4-level systems

Neil Berrington reported a double-fault on a VM with 768GB of RAM that uses
large amounts of vmalloc space with PTI enabled.

The cause is that load_new_mm_cr3() was never fixed to take the 5-level pgd
folding code into account, so, on a 4-level kernel, the pgd synchronization
logic compiles away to exactly nothing.

Interestingly, the problem doesn't trigger with nopti.  I assume this is
because the kernel is mapped with global pages if we boot with nopti.  The
sequence of operations when we create a new task is that we first load its
mm while still running on the old stack (which crashes if the old stack is
unmapped in the new mm unless the TLB saves us), then we call
prepare_switch_to(), and then we switch to the new stack.
prepare_switch_to() pokes the new stack directly, which will populate the
mapping through vmalloc_fault().  I assume that we're getting lucky on
non-PTI systems -- the old stack's TLB entry stays alive long enough to
make it all the way through prepare_switch_to() and switch_to() so that we
make it to a valid stack.

Fixes: b50858ce3e2a ("x86/mm/vmalloc: Add 5-level paging support")
Reported-and-tested-by: Neil Berrington <neil.berrington@datacore.com>
Signed-off-by: Andy Lutomirski <luto@kernel.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
Cc: stable@vger.kernel.org
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Borislav Petkov <bp@alien8.de>
Link: https://lkml.kernel.org/r/346541c56caed61abbe693d7d2742b4a380c5001.1516914529.git.luto@kernel.org

---
 arch/x86/mm/tlb.c | 34 +++++++++++++++++++++++++++++-----
 1 file changed, 29 insertions(+), 5 deletions(-)

diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c
index a156195..5bfe61a 100644
--- a/arch/x86/mm/tlb.c
+++ b/arch/x86/mm/tlb.c
@@ -151,6 +151,34 @@ void switch_mm(struct mm_struct *prev, struct mm_struct *next,
 	local_irq_restore(flags);
 }
 
+static void sync_current_stack_to_mm(struct mm_struct *mm)
+{
+	unsigned long sp = current_stack_pointer;
+	pgd_t *pgd = pgd_offset(mm, sp);
+
+	if (CONFIG_PGTABLE_LEVELS > 4) {
+		if (unlikely(pgd_none(*pgd))) {
+			pgd_t *pgd_ref = pgd_offset_k(sp);
+
+			set_pgd(pgd, *pgd_ref);
+		}
+	} else {
+		/*
+		 * "pgd" is faked.  The top level entries are "p4d"s, so sync
+		 * the p4d.  This compiles to approximately the same code as
+		 * the 5-level case.
+		 */
+		p4d_t *p4d = p4d_offset(pgd, sp);
+
+		if (unlikely(p4d_none(*p4d))) {
+			pgd_t *pgd_ref = pgd_offset_k(sp);
+			p4d_t *p4d_ref = p4d_offset(pgd_ref, sp);
+
+			set_p4d(p4d, *p4d_ref);
+		}
+	}
+}
+
 void switch_mm_irqs_off(struct mm_struct *prev, struct mm_struct *next,
 			struct task_struct *tsk)
 {
@@ -226,11 +254,7 @@ void switch_mm_irqs_off(struct mm_struct *prev, struct mm_struct *next,
 			 * mapped in the new pgd, we'll double-fault.  Forcibly
 			 * map it.
 			 */
-			unsigned int index = pgd_index(current_stack_pointer);
-			pgd_t *pgd = next->pgd + index;
-
-			if (unlikely(pgd_none(*pgd)))
-				set_pgd(pgd, init_mm.pgd[index]);
+			sync_current_stack_to_mm(next);
 		}
 
 		/* Stop remote flushes for the previous mm */

^ permalink raw reply	[flat|nested] 12+ messages in thread

* [tip:x86/urgent] x86/mm/64: Tighten up vmalloc_fault() sanity checks on 5-level kernels
  2018-01-25 21:12 ` [PATCH v2 2/2] x86/mm/64: Tighten up vmalloc_fault() sanity checks on 5-level kernels Andy Lutomirski
@ 2018-01-26 15:07   ` tip-bot for Andy Lutomirski
  0 siblings, 0 replies; 12+ messages in thread
From: tip-bot for Andy Lutomirski @ 2018-01-26 15:07 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: bp, mingo, tglx, dave.hansen, luto, neil.berrington, khlebnikov,
	linux-kernel, hpa

Commit-ID:  36b3a7726886f24c4209852a58e64435bde3af98
Gitweb:     https://git.kernel.org/tip/36b3a7726886f24c4209852a58e64435bde3af98
Author:     Andy Lutomirski <luto@kernel.org>
AuthorDate: Thu, 25 Jan 2018 13:12:15 -0800
Committer:  Thomas Gleixner <tglx@linutronix.de>
CommitDate: Fri, 26 Jan 2018 15:56:23 +0100

x86/mm/64: Tighten up vmalloc_fault() sanity checks on 5-level kernels

On a 5-level kernel, if a non-init mm has a top-level entry, it needs to
match init_mm's, but the vmalloc_fault() code skipped over the BUG_ON()
that would have checked it.

While we're at it, get rid of the rather confusing 4-level folded "pgd"
logic.

Cleans-up: b50858ce3e2a ("x86/mm/vmalloc: Add 5-level paging support")
Signed-off-by: Andy Lutomirski <luto@kernel.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Neil Berrington <neil.berrington@datacore.com>
Link: https://lkml.kernel.org/r/2ae598f8c279b0a29baf75df207e6f2fdddc0a1b.1516914529.git.luto@kernel.org

---
 arch/x86/mm/fault.c | 22 +++++++++-------------
 1 file changed, 9 insertions(+), 13 deletions(-)

diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c
index b3e4077..800de81 100644
--- a/arch/x86/mm/fault.c
+++ b/arch/x86/mm/fault.c
@@ -439,18 +439,13 @@ static noinline int vmalloc_fault(unsigned long address)
 	if (pgd_none(*pgd_ref))
 		return -1;
 
-	if (pgd_none(*pgd)) {
-		set_pgd(pgd, *pgd_ref);
-		arch_flush_lazy_mmu_mode();
-	} else if (CONFIG_PGTABLE_LEVELS > 4) {
-		/*
-		 * With folded p4d, pgd_none() is always false, so the pgd may
-		 * point to an empty page table entry and pgd_page_vaddr()
-		 * will return garbage.
-		 *
-		 * We will do the correct sanity check on the p4d level.
-		 */
-		BUG_ON(pgd_page_vaddr(*pgd) != pgd_page_vaddr(*pgd_ref));
+	if (CONFIG_PGTABLE_LEVELS > 4) {
+		if (pgd_none(*pgd)) {
+			set_pgd(pgd, *pgd_ref);
+			arch_flush_lazy_mmu_mode();
+		} else {
+			BUG_ON(pgd_page_vaddr(*pgd) != pgd_page_vaddr(*pgd_ref));
+		}
 	}
 
 	/* With 4-level paging, copying happens on the p4d level. */
@@ -459,7 +454,7 @@ static noinline int vmalloc_fault(unsigned long address)
 	if (p4d_none(*p4d_ref))
 		return -1;
 
-	if (p4d_none(*p4d)) {
+	if (p4d_none(*p4d) && CONFIG_PGTABLE_LEVELS == 4) {
 		set_p4d(p4d, *p4d_ref);
 		arch_flush_lazy_mmu_mode();
 	} else {
@@ -470,6 +465,7 @@ static noinline int vmalloc_fault(unsigned long address)
 	 * Below here mismatches are bugs because these lower tables
 	 * are shared:
 	 */
+	BUILD_BUG_ON(CONFIG_PGTABLE_LEVELS < 4);
 
 	pud = pud_offset(p4d, address);
 	pud_ref = pud_offset(p4d_ref, address);

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH v2 1/2] x86/mm/64: Fix vmapped stack syncing on very-large-memory 4-level systems
  2018-01-25 21:12 ` [PATCH v2 1/2] x86/mm/64: Fix vmapped stack syncing on very-large-memory 4-level systems Andy Lutomirski
  2018-01-25 21:49   ` Dave Hansen
  2018-01-26 15:06   ` [tip:x86/urgent] " tip-bot for Andy Lutomirski
@ 2018-01-26 18:51   ` Kirill A. Shutemov
  2018-01-26 19:02     ` Andy Lutomirski
  2 siblings, 1 reply; 12+ messages in thread
From: Kirill A. Shutemov @ 2018-01-26 18:51 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Konstantin Khlebnikov, Dave Hansen, X86 ML, Borislav Petkov,
	Neil Berrington, LKML, stable

On Thu, Jan 25, 2018 at 01:12:14PM -0800, Andy Lutomirski wrote:
> Neil Berrington reported a double-fault on a VM with 768GB of RAM that
> uses large amounts of vmalloc space with PTI enabled.
> 
> The cause is that load_new_mm_cr3() was never fixed to take the
> 5-level pgd folding code into account, so, on a 4-level kernel, the
> pgd synchronization logic compiles away to exactly nothing.

Ouch. Sorry for this.

> 
> Interestingly, the problem doesn't trigger with nopti.  I assume this
> is because the kernel is mapped with global pages if we boot with
> nopti.  The sequence of operations when we create a new task is that
> we first load its mm while still running on the old stack (which
> crashes if the old stack is unmapped in the new mm unless the TLB
> saves us), then we call prepare_switch_to(), and then we switch to the
> new stack.  prepare_switch_to() pokes the new stack directly, which
> will populate the mapping through vmalloc_fault().  I assume that
> we're getting lucky on non-PTI systems -- the old stack's TLB entry
> stays alive long enough to make it all the way through
> prepare_switch_to() and switch_to() so that we make it to a valid
> stack.
> 
> Fixes: b50858ce3e2a ("x86/mm/vmalloc: Add 5-level paging support")
> Cc: stable@vger.kernel.org
> Reported-and-tested-by: Neil Berrington <neil.berrington@datacore.com>
> Signed-off-by: Andy Lutomirski <luto@kernel.org>
> ---
>  arch/x86/mm/tlb.c | 34 +++++++++++++++++++++++++++++-----
>  1 file changed, 29 insertions(+), 5 deletions(-)
> 
> diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c
> index a1561957dccb..5bfe61a5e8e3 100644
> --- a/arch/x86/mm/tlb.c
> +++ b/arch/x86/mm/tlb.c
> @@ -151,6 +151,34 @@ void switch_mm(struct mm_struct *prev, struct mm_struct *next,
>  	local_irq_restore(flags);
>  }
>  
> +static void sync_current_stack_to_mm(struct mm_struct *mm)
> +{
> +	unsigned long sp = current_stack_pointer;
> +	pgd_t *pgd = pgd_offset(mm, sp);
> +
> +	if (CONFIG_PGTABLE_LEVELS > 4) {

Can we have

	if (PTRS_PER_P4D > 1)

here instead? This way I wouldn't need to touch the code again for
boot-time switching support.

-- 
 Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH v2 1/2] x86/mm/64: Fix vmapped stack syncing on very-large-memory 4-level systems
  2018-01-25 22:00     ` Andy Lutomirski
  2018-01-26  9:30       ` Ingo Molnar
@ 2018-01-26 18:54       ` Kirill A. Shutemov
  1 sibling, 0 replies; 12+ messages in thread
From: Kirill A. Shutemov @ 2018-01-26 18:54 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Dave Hansen, Konstantin Khlebnikov, X86 ML, Borislav Petkov,
	Neil Berrington, LKML, stable, Kirill A. Shutemov

On Thu, Jan 25, 2018 at 02:00:22PM -0800, Andy Lutomirski wrote:
> On Thu, Jan 25, 2018 at 1:49 PM, Dave Hansen <dave.hansen@intel.com> wrote:
> > On 01/25/2018 01:12 PM, Andy Lutomirski wrote:
> >> Neil Berrington reported a double-fault on a VM with 768GB of RAM that
> >> uses large amounts of vmalloc space with PTI enabled.
> >>
> >> The cause is that load_new_mm_cr3() was never fixed to take the
> >> 5-level pgd folding code into account, so, on a 4-level kernel, the
> >> pgd synchronization logic compiles away to exactly nothing.
> >
> > You don't mention it, but we can normally handle vmalloc() faults in the
> > kernel that are due to unsynchronized page tables.  The thing that kills
> > us here is that we have an unmapped stack and we try to use that stack
> > when entering the page fault handler, which double faults.  The double
> > fault handler gets a new stack and saves us enough to get an oops out.
> >
> > Right?
> 
> Exactly.
> 
> There are two special code paths that can't use vmalloc_fault(): this
> one and switch_to().  The latter avoids explicit page table fiddling
> and just touches the new stack before loading it into rsp.
> 
> >
> >> +static void sync_current_stack_to_mm(struct mm_struct *mm)
> >> +{
> >> +     unsigned long sp = current_stack_pointer;
> >> +     pgd_t *pgd = pgd_offset(mm, sp);
> >> +
> >> +     if (CONFIG_PGTABLE_LEVELS > 4) {
> >> +             if (unlikely(pgd_none(*pgd))) {
> >> +                     pgd_t *pgd_ref = pgd_offset_k(sp);
> >> +
> >> +                     set_pgd(pgd, *pgd_ref);
> >> +             }
> >> +     } else {
> >> +             /*
> >> +              * "pgd" is faked.  The top level entries are "p4d"s, so sync
> >> +              * the p4d.  This compiles to approximately the same code as
> >> +              * the 5-level case.
> >> +              */
> >> +             p4d_t *p4d = p4d_offset(pgd, sp);
> >> +
> >> +             if (unlikely(p4d_none(*p4d))) {
> >> +                     pgd_t *pgd_ref = pgd_offset_k(sp);
> >> +                     p4d_t *p4d_ref = p4d_offset(pgd_ref, sp);
> >> +
> >> +                     set_p4d(p4d, *p4d_ref);
> >> +             }
> >> +     }
> >> +}
> >
> > We keep having to add these.  It seems like a real deficiency in the
> > mechanism that we're using for pgd folding.  Can't we get a warning or
> > something when we try to do a set_pgd() that's (silently) not doing
> > anything?  This exact same pattern bit me more than once with the
> > KPTI/KAISER patches.
> 
> Hmm, maybe.
> 
> What I'd really like to see is an entirely different API.  Maybe:
> 
> typedef struct {
>   opaque, but probably includes:
>   int depth;  /* 0 is root */
>   void *table;
> } ptbl_ptr;
> 
> ptbl_ptr root_table = mm_root_ptbl(mm);
> 
> set_ptbl_entry(root_table, pa, prot);
> 
> /* walk tables */
> ptbl_ptr pt = ...;
> ptentry_ptr entry;
> while (ptbl_has_children(pt)) {
>   pt = pt_next(pt, addr);
> }
> entry = pt_entry_at(pt, addr);
> /* do something with entry */
> 
> etc.

I thought about very similar design, but never got time to try it really.
It's not one-week-end type of project :/

-- 
 Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH v2 1/2] x86/mm/64: Fix vmapped stack syncing on very-large-memory 4-level systems
  2018-01-26 18:51   ` [PATCH v2 1/2] " Kirill A. Shutemov
@ 2018-01-26 19:02     ` Andy Lutomirski
  2018-01-26 20:50       ` Kirill A. Shutemov
  0 siblings, 1 reply; 12+ messages in thread
From: Andy Lutomirski @ 2018-01-26 19:02 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Andy Lutomirski, Konstantin Khlebnikov, Dave Hansen, X86 ML,
	Borislav Petkov, Neil Berrington, LKML, stable

On Fri, Jan 26, 2018 at 10:51 AM, Kirill A. Shutemov
<kirill@shutemov.name> wrote:
> On Thu, Jan 25, 2018 at 01:12:14PM -0800, Andy Lutomirski wrote:
>> Neil Berrington reported a double-fault on a VM with 768GB of RAM that
>> uses large amounts of vmalloc space with PTI enabled.
>>
>> The cause is that load_new_mm_cr3() was never fixed to take the
>> 5-level pgd folding code into account, so, on a 4-level kernel, the
>> pgd synchronization logic compiles away to exactly nothing.
>
> Ouch. Sorry for this.
>
>>
>> Interestingly, the problem doesn't trigger with nopti.  I assume this
>> is because the kernel is mapped with global pages if we boot with
>> nopti.  The sequence of operations when we create a new task is that
>> we first load its mm while still running on the old stack (which
>> crashes if the old stack is unmapped in the new mm unless the TLB
>> saves us), then we call prepare_switch_to(), and then we switch to the
>> new stack.  prepare_switch_to() pokes the new stack directly, which
>> will populate the mapping through vmalloc_fault().  I assume that
>> we're getting lucky on non-PTI systems -- the old stack's TLB entry
>> stays alive long enough to make it all the way through
>> prepare_switch_to() and switch_to() so that we make it to a valid
>> stack.
>>
>> Fixes: b50858ce3e2a ("x86/mm/vmalloc: Add 5-level paging support")
>> Cc: stable@vger.kernel.org
>> Reported-and-tested-by: Neil Berrington <neil.berrington@datacore.com>
>> Signed-off-by: Andy Lutomirski <luto@kernel.org>
>> ---
>>  arch/x86/mm/tlb.c | 34 +++++++++++++++++++++++++++++-----
>>  1 file changed, 29 insertions(+), 5 deletions(-)
>>
>> diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c
>> index a1561957dccb..5bfe61a5e8e3 100644
>> --- a/arch/x86/mm/tlb.c
>> +++ b/arch/x86/mm/tlb.c
>> @@ -151,6 +151,34 @@ void switch_mm(struct mm_struct *prev, struct mm_struct *next,
>>       local_irq_restore(flags);
>>  }
>>
>> +static void sync_current_stack_to_mm(struct mm_struct *mm)
>> +{
>> +     unsigned long sp = current_stack_pointer;
>> +     pgd_t *pgd = pgd_offset(mm, sp);
>> +
>> +     if (CONFIG_PGTABLE_LEVELS > 4) {
>
> Can we have
>
>         if (PTRS_PER_P4D > 1)
>
> here instead? This way I wouldn't need to touch the code again for
> boot-time switching support.

Want to send a patch?

(Also, I haven't noticed a patch to fix up the SYSRET checking for
boot-time switching.  Have I just missed it?)

--Andy

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH v2 1/2] x86/mm/64: Fix vmapped stack syncing on very-large-memory 4-level systems
  2018-01-26 19:02     ` Andy Lutomirski
@ 2018-01-26 20:50       ` Kirill A. Shutemov
  0 siblings, 0 replies; 12+ messages in thread
From: Kirill A. Shutemov @ 2018-01-26 20:50 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Konstantin Khlebnikov, Dave Hansen, X86 ML, Borislav Petkov,
	Neil Berrington, LKML, stable

On Fri, Jan 26, 2018 at 11:02:08AM -0800, Andy Lutomirski wrote:
> On Fri, Jan 26, 2018 at 10:51 AM, Kirill A. Shutemov
> <kirill@shutemov.name> wrote:
> > On Thu, Jan 25, 2018 at 01:12:14PM -0800, Andy Lutomirski wrote:
> >> Neil Berrington reported a double-fault on a VM with 768GB of RAM that
> >> uses large amounts of vmalloc space with PTI enabled.
> >>
> >> The cause is that load_new_mm_cr3() was never fixed to take the
> >> 5-level pgd folding code into account, so, on a 4-level kernel, the
> >> pgd synchronization logic compiles away to exactly nothing.
> >
> > Ouch. Sorry for this.
> >
> >>
> >> Interestingly, the problem doesn't trigger with nopti.  I assume this
> >> is because the kernel is mapped with global pages if we boot with
> >> nopti.  The sequence of operations when we create a new task is that
> >> we first load its mm while still running on the old stack (which
> >> crashes if the old stack is unmapped in the new mm unless the TLB
> >> saves us), then we call prepare_switch_to(), and then we switch to the
> >> new stack.  prepare_switch_to() pokes the new stack directly, which
> >> will populate the mapping through vmalloc_fault().  I assume that
> >> we're getting lucky on non-PTI systems -- the old stack's TLB entry
> >> stays alive long enough to make it all the way through
> >> prepare_switch_to() and switch_to() so that we make it to a valid
> >> stack.
> >>
> >> Fixes: b50858ce3e2a ("x86/mm/vmalloc: Add 5-level paging support")
> >> Cc: stable@vger.kernel.org
> >> Reported-and-tested-by: Neil Berrington <neil.berrington@datacore.com>
> >> Signed-off-by: Andy Lutomirski <luto@kernel.org>
> >> ---
> >>  arch/x86/mm/tlb.c | 34 +++++++++++++++++++++++++++++-----
> >>  1 file changed, 29 insertions(+), 5 deletions(-)
> >>
> >> diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c
> >> index a1561957dccb..5bfe61a5e8e3 100644
> >> --- a/arch/x86/mm/tlb.c
> >> +++ b/arch/x86/mm/tlb.c
> >> @@ -151,6 +151,34 @@ void switch_mm(struct mm_struct *prev, struct mm_struct *next,
> >>       local_irq_restore(flags);
> >>  }
> >>
> >> +static void sync_current_stack_to_mm(struct mm_struct *mm)
> >> +{
> >> +     unsigned long sp = current_stack_pointer;
> >> +     pgd_t *pgd = pgd_offset(mm, sp);
> >> +
> >> +     if (CONFIG_PGTABLE_LEVELS > 4) {
> >
> > Can we have
> >
> >         if (PTRS_PER_P4D > 1)
> >
> > here instead? This way I wouldn't need to touch the code again for
> > boot-time switching support.
> 
> Want to send a patch?

I'll send it with the rest of boot-time switching stuff.

> (Also, I haven't noticed a patch to fix up the SYSRET checking for
> boot-time switching.  Have I just missed it?)

It's not upstream yet.

There are two patches: initial boot-time switching support and optimization on
top of it.

https://git.kernel.org/pub/scm/linux/kernel/git/kas/linux.git/commit/?h=la57/boot-switching/wip&id=c35fc0af7a4fe9b5369134d7485d95427a0a039b
https://git.kernel.org/pub/scm/linux/kernel/git/kas/linux.git/commit/?h=la57/boot-switching/wip&id=fae0e6c3eb253e63532f4ecfa6705aac2c5d710c

-- 
 Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 12+ messages in thread

end of thread, back to index

Thread overview: 12+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-01-25 21:12 [PATCH v2 0/2] x86/mm/64: vmalloc pgd synchronization cleanups/fixes Andy Lutomirski
2018-01-25 21:12 ` [PATCH v2 1/2] x86/mm/64: Fix vmapped stack syncing on very-large-memory 4-level systems Andy Lutomirski
2018-01-25 21:49   ` Dave Hansen
2018-01-25 22:00     ` Andy Lutomirski
2018-01-26  9:30       ` Ingo Molnar
2018-01-26 18:54       ` Kirill A. Shutemov
2018-01-26 15:06   ` [tip:x86/urgent] " tip-bot for Andy Lutomirski
2018-01-26 18:51   ` [PATCH v2 1/2] " Kirill A. Shutemov
2018-01-26 19:02     ` Andy Lutomirski
2018-01-26 20:50       ` Kirill A. Shutemov
2018-01-25 21:12 ` [PATCH v2 2/2] x86/mm/64: Tighten up vmalloc_fault() sanity checks on 5-level kernels Andy Lutomirski
2018-01-26 15:07   ` [tip:x86/urgent] " tip-bot for Andy Lutomirski

LKML Archive on lore.kernel.org

Archives are clonable:
	git clone --mirror https://lore.kernel.org/lkml/0 lkml/git/0.git
	git clone --mirror https://lore.kernel.org/lkml/1 lkml/git/1.git
	git clone --mirror https://lore.kernel.org/lkml/2 lkml/git/2.git
	git clone --mirror https://lore.kernel.org/lkml/3 lkml/git/3.git
	git clone --mirror https://lore.kernel.org/lkml/4 lkml/git/4.git
	git clone --mirror https://lore.kernel.org/lkml/5 lkml/git/5.git
	git clone --mirror https://lore.kernel.org/lkml/6 lkml/git/6.git
	git clone --mirror https://lore.kernel.org/lkml/7 lkml/git/7.git
	git clone --mirror https://lore.kernel.org/lkml/8 lkml/git/8.git
	git clone --mirror https://lore.kernel.org/lkml/9 lkml/git/9.git

	# If you have public-inbox 1.1+ installed, you may
	# initialize and index your mirror using the following commands:
	public-inbox-init -V2 lkml lkml/ https://lore.kernel.org/lkml \
		linux-kernel@vger.kernel.org
	public-inbox-index lkml

Example config snippet for mirrors

Newsgroup available over NNTP:
	nntp://nntp.lore.kernel.org/org.kernel.vger.linux-kernel


AGPL code for this site: git clone https://public-inbox.org/public-inbox.git