From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <linux-kernel-owner@vger.kernel.org>
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
        id S1751505AbeAYVMn (ORCPT <rfc822;w@1wt.eu>);
        Thu, 25 Jan 2018 16:12:43 -0500
Received: from mail.kernel.org ([198.145.29.99]:50862 "EHLO mail.kernel.org"
        rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP
        id S1751377AbeAYVMS (ORCPT <rfc822;linux-kernel@vger.kernel.org>);
        Thu, 25 Jan 2018 16:12:18 -0500
DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org A1BA021795
Authentication-Results: mail.kernel.org; dmarc=none (p=none dis=none) header.from=kernel.org
Authentication-Results: mail.kernel.org; spf=none smtp.mailfrom=luto@kernel.org
From: Andy Lutomirski <luto@kernel.org>
To: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>,
        Dave Hansen <dave.hansen@intel.com>, X86 ML <x86@kernel.org>,
        Borislav Petkov <bp@alien8.de>
Cc: Neil Berrington <neil.berrington@datacore.com>,
        LKML <linux-kernel@vger.kernel.org>, Andy Lutomirski <luto@kernel.org>,
        stable@vger.kernel.org
Subject: [PATCH v2 1/2] x86/mm/64: Fix vmapped stack syncing on very-large-memory 4-level systems
Date: Thu, 25 Jan 2018 13:12:14 -0800
Message-Id: <346541c56caed61abbe693d7d2742b4a380c5001.1516914529.git.luto@kernel.org>
X-Mailer: git-send-email 2.14.3
In-Reply-To: <cover.1516914529.git.luto@kernel.org>
References: <cover.1516914529.git.luto@kernel.org>
In-Reply-To: <cover.1516914529.git.luto@kernel.org>
References: <cover.1516914529.git.luto@kernel.org>
Sender: linux-kernel-owner@vger.kernel.org
List-ID: <linux-kernel.vger.kernel.org>
X-Mailing-List: linux-kernel@vger.kernel.org

Neil Berrington reported a double-fault on a VM with 768GB of RAM that
uses large amounts of vmalloc space with PTI enabled.

The cause is that load_new_mm_cr3() was never fixed to take the
5-level pgd folding code into account, so, on a 4-level kernel, the
pgd synchronization logic compiles away to exactly nothing.

Interestingly, the problem doesn't trigger with nopti.  I assume this
is because the kernel is mapped with global pages if we boot with
nopti.  The sequence of operations when we create a new task is that
we first load its mm while still running on the old stack (which
crashes if the old stack is unmapped in the new mm unless the TLB
saves us), then we call prepare_switch_to(), and then we switch to the
new stack.  prepare_switch_to() pokes the new stack directly, which
will populate the mapping through vmalloc_fault().  I assume that
we're getting lucky on non-PTI systems -- the old stack's TLB entry
stays alive long enough to make it all the way through
prepare_switch_to() and switch_to() so that we make it to a valid
stack.

Fixes: b50858ce3e2a ("x86/mm/vmalloc: Add 5-level paging support")
Cc: stable@vger.kernel.org
Reported-and-tested-by: Neil Berrington <neil.berrington@datacore.com>
Signed-off-by: Andy Lutomirski <luto@kernel.org>
---
 arch/x86/mm/tlb.c | 34 +++++++++++++++++++++++++++++-----
 1 file changed, 29 insertions(+), 5 deletions(-)

diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c
index a1561957dccb..5bfe61a5e8e3 100644
--- a/arch/x86/mm/tlb.c
+++ b/arch/x86/mm/tlb.c
@@ -151,6 +151,34 @@ void switch_mm(struct mm_struct *prev, struct mm_struct *next,
 	local_irq_restore(flags);
 }
 
+static void sync_current_stack_to_mm(struct mm_struct *mm)
+{
+	unsigned long sp = current_stack_pointer;
+	pgd_t *pgd = pgd_offset(mm, sp);
+
+	if (CONFIG_PGTABLE_LEVELS > 4) {
+		if (unlikely(pgd_none(*pgd))) {
+			pgd_t *pgd_ref = pgd_offset_k(sp);
+
+			set_pgd(pgd, *pgd_ref);
+		}
+	} else {
+		/*
+		 * "pgd" is faked.  The top level entries are "p4d"s, so sync
+		 * the p4d.  This compiles to approximately the same code as
+		 * the 5-level case.
+		 */
+		p4d_t *p4d = p4d_offset(pgd, sp);
+
+		if (unlikely(p4d_none(*p4d))) {
+			pgd_t *pgd_ref = pgd_offset_k(sp);
+			p4d_t *p4d_ref = p4d_offset(pgd_ref, sp);
+
+			set_p4d(p4d, *p4d_ref);
+		}
+	}
+}
+
 void switch_mm_irqs_off(struct mm_struct *prev, struct mm_struct *next,
 			struct task_struct *tsk)
 {
@@ -226,11 +254,7 @@ void switch_mm_irqs_off(struct mm_struct *prev, struct mm_struct *next,
 			 * mapped in the new pgd, we'll double-fault.  Forcibly
 			 * map it.
 			 */
-			unsigned int index = pgd_index(current_stack_pointer);
-			pgd_t *pgd = next->pgd + index;
-
-			if (unlikely(pgd_none(*pgd)))
-				set_pgd(pgd, init_mm.pgd[index]);
+			sync_current_stack_to_mm(next);
 		}
 
 		/* Stop remote flushes for the previous mm */
-- 
2.14.3