linux-efi.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [GIT PULL v4 0/6] EFI page table isolation
@ 2015-11-27 21:09 Matt Fleming
  2015-11-27 21:09 ` [PATCH v4 1/6] x86: Page align _end to avoid pfn conversion bugs Matt Fleming
                   ` (5 more replies)
  0 siblings, 6 replies; 8+ messages in thread
From: Matt Fleming @ 2015-11-27 21:09 UTC (permalink / raw)
  To: Ingo Molnar, Thomas Gleixner, H . Peter Anvin
  Cc: Toshi Kani, Matt Fleming, linux-kernel, linux-efi, Andrew Morton,
	Andy Lutomirski, Borislav Petkov, Dave Hansen, Dave Jones,
	Denys Vlasenko, Linus Torvalds, Sai Praneeth Prakhya,
	Stephen Smalley

Folks,

This patch series is a response to the report that the EFI region
mappings trigger warnings when booting with CONFIG_DEBUG_WX enabled.
They allocate a new page table structure and ensure that all the
mappings we require during EFI runtime calls are only setup there.

It turns out that it still makes sense to share some page table
entries with 'swapper_pg_dir', just not the entries where we need to
allow security lax permissions. Sharing entries is useful for memory
hotplug, for example.

When writing this series I discovered a number of bugs in the existing
code that only became apparent when we stopped using 'trampoline_pgd'
which already mapped a bunch of things for us. I've put those bug
fixes at the start of the series.

Further testing would be very much appreciated as this is a
notoriously funky area of the EFI code.

Changes in v4:

 - Drop the PFN_ALIGN() patch in favour of the linker script change

Changes in v3:

 - PFN_ALIGN() _text and _end when calculating npages to map

 - Drop the hunk that mapped the stack in efi_setup_page_tables(), now
   that we're mapping RAM we know the stack is already mapped.

 - Update the wording in Documentation/<..>/mm.txt to make it clear we
   carve out a 64Gb *size* of virtual address space, we don't use the
   first 64Gb virtual addresses.

Changes in v2:

 - Folded PATCH 1 and 2 together because they both fall under the
   umbrella of "making sure cpa->pfn is really a page frame number".

 - Fixed some checkpatch warnings about mixing spaces and tabs and
   made some stylistic changes per Borislav's comments.

 - Moved efi_alloc_page_tables() earlier in __efi_enter_virtual_mode()
   so that we fail early if we can't allocate memory for the page
   tables.

The following changes since commit 8005c49d9aea74d382f474ce11afbbc7d7130bec:

  Linux 4.4-rc1 (2015-11-15 17:00:27 -0800)

are available in the git repository at:

  git://git.kernel.org/pub/scm/linux/kernel/git/mfleming/efi.git tags/efi-pgtable

for you to fetch changes up to 73e727544ad2d9e1bb7421396b9043b776e3fcf4:

  Documentation/x86: Update EFI memory region description (2015-11-27 20:57:21 +0000)

----------------------------------------------------------------
 * Use completely separate page tables for EFI runtime service calls
   so that the security-lax mapping permissions (RWX) do not leak into
   the standard kernel page tables and trigger warnings when
   CONFIG_DEBUG_WX is enabled.

----------------------------------------------------------------
Matt Fleming (6):
      x86: Page align _end to avoid pfn conversion bugs
      x86/mm/pageattr: Ensure cpa->pfn only contains page frame numbers
      x86/efi: Map RAM into the identity page table for mixed mode
      x86/efi: Hoist page table switching code into efi_call_virt()
      x86/efi: Build our own page table structures
      Documentation/x86: Update EFI memory region description

 Documentation/x86/x86_64/mm.txt     |  12 +--
 arch/x86/include/asm/efi.h          |  26 ++++++
 arch/x86/kernel/vmlinux.lds.S       |   1 +
 arch/x86/mm/pageattr.c              |  17 ++--
 arch/x86/platform/efi/efi.c         |  39 ++++-----
 arch/x86/platform/efi/efi_32.c      |   5 ++
 arch/x86/platform/efi/efi_64.c      | 157 ++++++++++++++++++++++++++++--------
 arch/x86/platform/efi/efi_stub_64.S |  43 ----------
 8 files changed, 181 insertions(+), 119 deletions(-)

^ permalink raw reply	[flat|nested] 8+ messages in thread

* [PATCH v4 1/6] x86: Page align _end to avoid pfn conversion bugs
  2015-11-27 21:09 [GIT PULL v4 0/6] EFI page table isolation Matt Fleming
@ 2015-11-27 21:09 ` Matt Fleming
  2015-11-27 21:09 ` [PATCH v4 2/6] x86/mm/pageattr: Ensure cpa->pfn only contains page frame numbers Matt Fleming
                   ` (4 subsequent siblings)
  5 siblings, 0 replies; 8+ messages in thread
From: Matt Fleming @ 2015-11-27 21:09 UTC (permalink / raw)
  To: Ingo Molnar, Thomas Gleixner, H . Peter Anvin
  Cc: Toshi Kani, Matt Fleming, linux-kernel, linux-efi,
	Sai Praneeth Prakhya, Dave Hansen, Borislav Petkov

Ingo noted that if we can guarantee _end is aligned to PAGE_SIZE we
can automatically avoid bugs along the lines of,

	size = _end - _text >> PAGE_SHIFT

which is missing a call to PFN_ALIGN(). The EFI mixed mode contains
this bug, for example.

_text is already aligned to PAGE_SIZE through the use of
LOAD_PHYSICAL_ADDR, and the BSS and BRK sections are explicitly
aligned in the linker script, so it makes sense to align _end to
match.

Reported-by: Ingo Molnar <mingo@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: "H . Peter Anvin" <hpa@zytor.com>
Cc: Toshi Kani <toshi.kani@hp.com>
Cc: Sai Praneeth Prakhya <sai.praneeth.prakhya@intel.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Borislav Petkov <bp@alien8.de>
Signed-off-by: Matt Fleming <matt@codeblueprint.co.uk>
---
 arch/x86/kernel/vmlinux.lds.S | 1 +
 1 file changed, 1 insertion(+)

diff --git a/arch/x86/kernel/vmlinux.lds.S b/arch/x86/kernel/vmlinux.lds.S
index 74e4bf11f562..4f1994257a18 100644
--- a/arch/x86/kernel/vmlinux.lds.S
+++ b/arch/x86/kernel/vmlinux.lds.S
@@ -325,6 +325,7 @@ SECTIONS
 		__brk_limit = .;
 	}
 
+	. = ALIGN(PAGE_SIZE);
 	_end = .;
 
         STABS_DEBUG
-- 
2.6.2

^ permalink raw reply related	[flat|nested] 8+ messages in thread

* [PATCH v4 2/6] x86/mm/pageattr: Ensure cpa->pfn only contains page frame numbers
  2015-11-27 21:09 [GIT PULL v4 0/6] EFI page table isolation Matt Fleming
  2015-11-27 21:09 ` [PATCH v4 1/6] x86: Page align _end to avoid pfn conversion bugs Matt Fleming
@ 2015-11-27 21:09 ` Matt Fleming
       [not found] ` <1448658575-17029-1-git-send-email-matt-mF/unelCI9GS6iBeEJttW/XRex20P6io@public.gmane.org>
                   ` (3 subsequent siblings)
  5 siblings, 0 replies; 8+ messages in thread
From: Matt Fleming @ 2015-11-27 21:09 UTC (permalink / raw)
  To: Ingo Molnar, Thomas Gleixner, H . Peter Anvin
  Cc: Toshi Kani, Matt Fleming, linux-kernel, linux-efi,
	Sai Praneeth Prakhya, Dave Hansen, Borislav Petkov

The x86 pageattr code is confused about the data that is stored in
cpa->pfn, sometimes it's treated as a page frame number, sometimes
it's treated as an unshifted physical address, and in one place it's
treated as a pte.

The result of this is that the mapping functions do not map the
intended physical address.

This isn't a problem in practice because most of the addresses we're
mapping in the EFI code paths are already mapped in 'trampoline_pgd'
and so the pageattr mapping functions don't actually do anything in
this case. But when we move to using a separate page table for the EFI
runtime this will be an issue.

Reviewed-by: Borislav Petkov <bp@suse.de>
Cc: Sai Praneeth Prakhya <sai.praneeth.prakhya@intel.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Signed-off-by: Matt Fleming <matt@codeblueprint.co.uk>
---
 arch/x86/mm/pageattr.c         | 17 ++++++-----------
 arch/x86/platform/efi/efi_64.c | 16 ++++++++++------
 2 files changed, 16 insertions(+), 17 deletions(-)

diff --git a/arch/x86/mm/pageattr.c b/arch/x86/mm/pageattr.c
index a3137a4feed1..c70e42014101 100644
--- a/arch/x86/mm/pageattr.c
+++ b/arch/x86/mm/pageattr.c
@@ -905,15 +905,10 @@ static void populate_pte(struct cpa_data *cpa,
 	pte = pte_offset_kernel(pmd, start);
 
 	while (num_pages-- && start < end) {
-
-		/* deal with the NX bit */
-		if (!(pgprot_val(pgprot) & _PAGE_NX))
-			cpa->pfn &= ~_PAGE_NX;
-
-		set_pte(pte, pfn_pte(cpa->pfn >> PAGE_SHIFT, pgprot));
+		set_pte(pte, pfn_pte(cpa->pfn, pgprot));
 
 		start	 += PAGE_SIZE;
-		cpa->pfn += PAGE_SIZE;
+		cpa->pfn++;
 		pte++;
 	}
 }
@@ -969,11 +964,11 @@ static int populate_pmd(struct cpa_data *cpa,
 
 		pmd = pmd_offset(pud, start);
 
-		set_pmd(pmd, __pmd(cpa->pfn | _PAGE_PSE |
+		set_pmd(pmd, __pmd(cpa->pfn << PAGE_SHIFT | _PAGE_PSE |
 				   massage_pgprot(pmd_pgprot)));
 
 		start	  += PMD_SIZE;
-		cpa->pfn  += PMD_SIZE;
+		cpa->pfn  += PMD_SIZE >> PAGE_SHIFT;
 		cur_pages += PMD_SIZE >> PAGE_SHIFT;
 	}
 
@@ -1042,11 +1037,11 @@ static int populate_pud(struct cpa_data *cpa, unsigned long start, pgd_t *pgd,
 	 * Map everything starting from the Gb boundary, possibly with 1G pages
 	 */
 	while (end - start >= PUD_SIZE) {
-		set_pud(pud, __pud(cpa->pfn | _PAGE_PSE |
+		set_pud(pud, __pud(cpa->pfn << PAGE_SHIFT | _PAGE_PSE |
 				   massage_pgprot(pud_pgprot)));
 
 		start	  += PUD_SIZE;
-		cpa->pfn  += PUD_SIZE;
+		cpa->pfn  += PUD_SIZE >> PAGE_SHIFT;
 		cur_pages += PUD_SIZE >> PAGE_SHIFT;
 		pud++;
 	}
diff --git a/arch/x86/platform/efi/efi_64.c b/arch/x86/platform/efi/efi_64.c
index a0ac0f9c307f..5aa186db59e3 100644
--- a/arch/x86/platform/efi/efi_64.c
+++ b/arch/x86/platform/efi/efi_64.c
@@ -143,7 +143,7 @@ void efi_sync_low_kernel_mappings(void)
 
 int __init efi_setup_page_tables(unsigned long pa_memmap, unsigned num_pages)
 {
-	unsigned long text;
+	unsigned long pfn, text;
 	struct page *page;
 	unsigned npages;
 	pgd_t *pgd;
@@ -160,7 +160,8 @@ int __init efi_setup_page_tables(unsigned long pa_memmap, unsigned num_pages)
 	 * and ident-map those pages containing the map before calling
 	 * phys_efi_set_virtual_address_map().
 	 */
-	if (kernel_map_pages_in_pgd(pgd, pa_memmap, pa_memmap, num_pages, _PAGE_NX)) {
+	pfn = pa_memmap >> PAGE_SHIFT;
+	if (kernel_map_pages_in_pgd(pgd, pfn, pa_memmap, num_pages, _PAGE_NX)) {
 		pr_err("Error ident-mapping new memmap (0x%lx)!\n", pa_memmap);
 		return 1;
 	}
@@ -185,8 +186,9 @@ int __init efi_setup_page_tables(unsigned long pa_memmap, unsigned num_pages)
 
 	npages = (_end - _text) >> PAGE_SHIFT;
 	text = __pa(_text);
+	pfn = text >> PAGE_SHIFT;
 
-	if (kernel_map_pages_in_pgd(pgd, text >> PAGE_SHIFT, text, npages, 0)) {
+	if (kernel_map_pages_in_pgd(pgd, pfn, text, npages, 0)) {
 		pr_err("Failed to map kernel text 1:1\n");
 		return 1;
 	}
@@ -204,12 +206,14 @@ void __init efi_cleanup_page_tables(unsigned long pa_memmap, unsigned num_pages)
 static void __init __map_region(efi_memory_desc_t *md, u64 va)
 {
 	pgd_t *pgd = (pgd_t *)__va(real_mode_header->trampoline_pgd);
-	unsigned long pf = 0;
+	unsigned long flags = 0;
+	unsigned long pfn;
 
 	if (!(md->attribute & EFI_MEMORY_WB))
-		pf |= _PAGE_PCD;
+		flags |= _PAGE_PCD;
 
-	if (kernel_map_pages_in_pgd(pgd, md->phys_addr, va, md->num_pages, pf))
+	pfn = md->phys_addr >> PAGE_SHIFT;
+	if (kernel_map_pages_in_pgd(pgd, pfn, va, md->num_pages, flags))
 		pr_warn("Error mapping PA 0x%llx -> VA 0x%llx!\n",
 			   md->phys_addr, va);
 }
-- 
2.6.2

^ permalink raw reply related	[flat|nested] 8+ messages in thread

* [PATCH v4 3/6] x86/efi: Map RAM into the identity page table for mixed mode
       [not found] ` <1448658575-17029-1-git-send-email-matt-mF/unelCI9GS6iBeEJttW/XRex20P6io@public.gmane.org>
@ 2015-11-27 21:09   ` Matt Fleming
  2015-11-27 21:09   ` [PATCH v4 4/6] x86/efi: Hoist page table switching code into efi_call_virt() Matt Fleming
  1 sibling, 0 replies; 8+ messages in thread
From: Matt Fleming @ 2015-11-27 21:09 UTC (permalink / raw)
  To: Ingo Molnar, Thomas Gleixner, H . Peter Anvin
  Cc: Toshi Kani, Matt Fleming, linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-efi-u79uwXL29TY76Z2rM5mHXA, Sai Praneeth Prakhya,
	Borislav Petkov

We are relying on the pre-existing mappings in 'trampoline_pgd' when
accessing function arguments in the EFI mixed mode thunking code.

Instead let's map memory explicitly so that things will continue to
work when we move to a separate page table in the future.

Reviewed-by: Borislav Petkov <bp-l3A5Bk7waGM@public.gmane.org>
Cc: Sai Praneeth Prakhya <sai.praneeth.prakhya-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>
Signed-off-by: Matt Fleming <matt-mF/unelCI9GS6iBeEJttW/XRex20P6io@public.gmane.org>
---
 arch/x86/platform/efi/efi_64.c | 20 ++++++++++++++++++++
 1 file changed, 20 insertions(+)

diff --git a/arch/x86/platform/efi/efi_64.c b/arch/x86/platform/efi/efi_64.c
index 5aa186db59e3..102976dda8c4 100644
--- a/arch/x86/platform/efi/efi_64.c
+++ b/arch/x86/platform/efi/efi_64.c
@@ -144,6 +144,7 @@ void efi_sync_low_kernel_mappings(void)
 int __init efi_setup_page_tables(unsigned long pa_memmap, unsigned num_pages)
 {
 	unsigned long pfn, text;
+	efi_memory_desc_t *md;
 	struct page *page;
 	unsigned npages;
 	pgd_t *pgd;
@@ -177,6 +178,25 @@ int __init efi_setup_page_tables(unsigned long pa_memmap, unsigned num_pages)
 	if (!IS_ENABLED(CONFIG_EFI_MIXED))
 		return 0;
 
+	/*
+	 * Map all of RAM so that we can access arguments in the 1:1
+	 * mapping when making EFI runtime calls.
+	 */
+	for_each_efi_memory_desc(&memmap, md) {
+		if (md->type != EFI_CONVENTIONAL_MEMORY &&
+		    md->type != EFI_LOADER_DATA &&
+		    md->type != EFI_LOADER_CODE)
+			continue;
+
+		pfn = md->phys_addr >> PAGE_SHIFT;
+		npages = md->num_pages;
+
+		if (kernel_map_pages_in_pgd(pgd, pfn, md->phys_addr, npages, 0)) {
+			pr_err("Failed to map 1:1 memory\n");
+			return 1;
+		}
+	}
+
 	page = alloc_page(GFP_KERNEL|__GFP_DMA32);
 	if (!page)
 		panic("Unable to allocate EFI runtime stack < 4GB\n");
-- 
2.6.2

^ permalink raw reply related	[flat|nested] 8+ messages in thread

* [PATCH v4 4/6] x86/efi: Hoist page table switching code into efi_call_virt()
       [not found] ` <1448658575-17029-1-git-send-email-matt-mF/unelCI9GS6iBeEJttW/XRex20P6io@public.gmane.org>
  2015-11-27 21:09   ` [PATCH v4 3/6] x86/efi: Map RAM into the identity page table for mixed mode Matt Fleming
@ 2015-11-27 21:09   ` Matt Fleming
  1 sibling, 0 replies; 8+ messages in thread
From: Matt Fleming @ 2015-11-27 21:09 UTC (permalink / raw)
  To: Ingo Molnar, Thomas Gleixner, H . Peter Anvin
  Cc: Toshi Kani, Matt Fleming, linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-efi-u79uwXL29TY76Z2rM5mHXA, Sai Praneeth Prakhya,
	Linus Torvalds, Dave Jones, Andrew Morton, Andy Lutomirski,
	Denys Vlasenko, Stephen Smalley, Borislav Petkov

This change is a prerequisite for pending patches that switch to a
dedicated EFI page table, instead of using 'trampoline_pgd' which
shares PGD entries with 'swapper_pg_dir'. The pending patches make it
impossible to dereference the runtime service function pointer without
first switching %cr3.

It's true that we now have duplicated switching code in
efi_call_virt() and efi_call_phys_{prolog,epilog}() but we are
sacrificing code duplication for a little more clarity and the ease of
writing the page table switching code in C instead of asm.

Reviewed-by: Borislav Petkov <bp-l3A5Bk7waGM@public.gmane.org>
Cc: Sai Praneeth Prakhya <sai.praneeth.prakhya-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>
Cc: Ingo Molnar <mingo-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
Cc: Linus Torvalds <torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b@public.gmane.org>
Cc: Dave Jones <davej-rdkfGonbjUTCLXcRTR1eJlpr/1R2p/CL@public.gmane.org>
Cc: Thomas Gleixner <tglx-hfZtesqFncYOwBW4kG4KsQ@public.gmane.org>
Cc: H. Peter Anvin <hpa-YMNOUZJC4hwAvxtiuMwx3w@public.gmane.org>
Cc: Andrew Morton <akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b@public.gmane.org>
Cc: Andy Lutomirski <luto-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
Cc: Denys Vlasenko <dvlasenk-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>,
Cc: Stephen Smalley <sds-+05T5uksL2qpZYMLLGbcSA@public.gmane.org>
Signed-off-by: Matt Fleming <matt-mF/unelCI9GS6iBeEJttW/XRex20P6io@public.gmane.org>
---
 arch/x86/include/asm/efi.h          | 25 +++++++++++++++++++++
 arch/x86/platform/efi/efi_64.c      | 24 ++++++++++-----------
 arch/x86/platform/efi/efi_stub_64.S | 43 -------------------------------------
 3 files changed, 36 insertions(+), 56 deletions(-)

diff --git a/arch/x86/include/asm/efi.h b/arch/x86/include/asm/efi.h
index 0010c78c4998..63a15de789d5 100644
--- a/arch/x86/include/asm/efi.h
+++ b/arch/x86/include/asm/efi.h
@@ -3,6 +3,7 @@
 
 #include <asm/fpu/api.h>
 #include <asm/pgtable.h>
+#include <asm/tlb.h>
 
 /*
  * We map the EFI regions needed for runtime services non-contiguously,
@@ -64,6 +65,17 @@ extern u64 asmlinkage efi_call(void *fp, ...);
 
 #define efi_call_phys(f, args...)		efi_call((f), args)
 
+/*
+ * Scratch space used for switching the pagetable in the EFI stub
+ */
+struct efi_scratch {
+	u64 r15;
+	u64 prev_cr3;
+	pgd_t *efi_pgt;
+	bool use_pgd;
+	u64 phys_stack;
+} __packed;
+
 #define efi_call_virt(f, ...)						\
 ({									\
 	efi_status_t __s;						\
@@ -71,7 +83,20 @@ extern u64 asmlinkage efi_call(void *fp, ...);
 	efi_sync_low_kernel_mappings();					\
 	preempt_disable();						\
 	__kernel_fpu_begin();						\
+									\
+	if (efi_scratch.use_pgd) {					\
+		efi_scratch.prev_cr3 = read_cr3();			\
+		write_cr3((unsigned long)efi_scratch.efi_pgt);		\
+		__flush_tlb_all();					\
+	}								\
+									\
 	__s = efi_call((void *)efi.systab->runtime->f, __VA_ARGS__);	\
+									\
+	if (efi_scratch.use_pgd) {					\
+		write_cr3(efi_scratch.prev_cr3);			\
+		__flush_tlb_all();					\
+	}								\
+									\
 	__kernel_fpu_end();						\
 	preempt_enable();						\
 	__s;								\
diff --git a/arch/x86/platform/efi/efi_64.c b/arch/x86/platform/efi/efi_64.c
index 102976dda8c4..b19cdac959b2 100644
--- a/arch/x86/platform/efi/efi_64.c
+++ b/arch/x86/platform/efi/efi_64.c
@@ -47,16 +47,7 @@
  */
 static u64 efi_va = EFI_VA_START;
 
-/*
- * Scratch space used for switching the pagetable in the EFI stub
- */
-struct efi_scratch {
-	u64 r15;
-	u64 prev_cr3;
-	pgd_t *efi_pgt;
-	bool use_pgd;
-	u64 phys_stack;
-} __packed;
+struct efi_scratch efi_scratch;
 
 static void __init early_code_mapping_set_exec(int executable)
 {
@@ -83,8 +74,11 @@ pgd_t * __init efi_call_phys_prolog(void)
 	int pgd;
 	int n_pgds;
 
-	if (!efi_enabled(EFI_OLD_MEMMAP))
-		return NULL;
+	if (!efi_enabled(EFI_OLD_MEMMAP)) {
+		save_pgd = (pgd_t *)read_cr3();
+		write_cr3((unsigned long)efi_scratch.efi_pgt);
+		goto out;
+	}
 
 	early_code_mapping_set_exec(1);
 
@@ -96,6 +90,7 @@ pgd_t * __init efi_call_phys_prolog(void)
 		vaddress = (unsigned long)__va(pgd * PGDIR_SIZE);
 		set_pgd(pgd_offset_k(pgd * PGDIR_SIZE), *pgd_offset_k(vaddress));
 	}
+out:
 	__flush_tlb_all();
 
 	return save_pgd;
@@ -109,8 +104,11 @@ void __init efi_call_phys_epilog(pgd_t *save_pgd)
 	int pgd_idx;
 	int nr_pgds;
 
-	if (!save_pgd)
+	if (!efi_enabled(EFI_OLD_MEMMAP)) {
+		write_cr3((unsigned long)save_pgd);
+		__flush_tlb_all();
 		return;
+	}
 
 	nr_pgds = DIV_ROUND_UP((max_pfn << PAGE_SHIFT) , PGDIR_SIZE);
 
diff --git a/arch/x86/platform/efi/efi_stub_64.S b/arch/x86/platform/efi/efi_stub_64.S
index 86d0f9e08dd9..32020cb8bb08 100644
--- a/arch/x86/platform/efi/efi_stub_64.S
+++ b/arch/x86/platform/efi/efi_stub_64.S
@@ -38,41 +38,6 @@
 	mov %rsi, %cr0;			\
 	mov (%rsp), %rsp
 
-	/* stolen from gcc */
-	.macro FLUSH_TLB_ALL
-	movq %r15, efi_scratch(%rip)
-	movq %r14, efi_scratch+8(%rip)
-	movq %cr4, %r15
-	movq %r15, %r14
-	andb $0x7f, %r14b
-	movq %r14, %cr4
-	movq %r15, %cr4
-	movq efi_scratch+8(%rip), %r14
-	movq efi_scratch(%rip), %r15
-	.endm
-
-	.macro SWITCH_PGT
-	cmpb $0, efi_scratch+24(%rip)
-	je 1f
-	movq %r15, efi_scratch(%rip)		# r15
-	# save previous CR3
-	movq %cr3, %r15
-	movq %r15, efi_scratch+8(%rip)		# prev_cr3
-	movq efi_scratch+16(%rip), %r15		# EFI pgt
-	movq %r15, %cr3
-	1:
-	.endm
-
-	.macro RESTORE_PGT
-	cmpb $0, efi_scratch+24(%rip)
-	je 2f
-	movq efi_scratch+8(%rip), %r15
-	movq %r15, %cr3
-	movq efi_scratch(%rip), %r15
-	FLUSH_TLB_ALL
-	2:
-	.endm
-
 ENTRY(efi_call)
 	SAVE_XMM
 	mov (%rsp), %rax
@@ -83,16 +48,8 @@ ENTRY(efi_call)
 	mov %r8, %r9
 	mov %rcx, %r8
 	mov %rsi, %rcx
-	SWITCH_PGT
 	call *%rdi
-	RESTORE_PGT
 	addq $48, %rsp
 	RESTORE_XMM
 	ret
 ENDPROC(efi_call)
-
-	.data
-ENTRY(efi_scratch)
-	.fill 3,8,0
-	.byte 0
-	.quad 0
-- 
2.6.2

^ permalink raw reply related	[flat|nested] 8+ messages in thread

* [PATCH v4 5/6] x86/efi: Build our own page table structures
  2015-11-27 21:09 [GIT PULL v4 0/6] EFI page table isolation Matt Fleming
                   ` (2 preceding siblings ...)
       [not found] ` <1448658575-17029-1-git-send-email-matt-mF/unelCI9GS6iBeEJttW/XRex20P6io@public.gmane.org>
@ 2015-11-27 21:09 ` Matt Fleming
  2015-11-27 21:09 ` [PATCH v4 6/6] Documentation/x86: Update EFI memory region description Matt Fleming
  2015-11-29  8:18 ` [GIT PULL v4 0/6] EFI page table isolation Ingo Molnar
  5 siblings, 0 replies; 8+ messages in thread
From: Matt Fleming @ 2015-11-27 21:09 UTC (permalink / raw)
  To: Ingo Molnar, Thomas Gleixner, H . Peter Anvin
  Cc: Toshi Kani, Matt Fleming, linux-kernel, linux-efi,
	Sai Praneeth Prakhya, Linus Torvalds, Dave Jones, Andrew Morton,
	Andy Lutomirski, Denys Vlasenko, Stephen Smalley,
	Borislav Petkov

With commit e1a58320a38d ("x86/mm: Warn on W^X mappings") all users
booting on 64-bit UEFI machines see the following warning,

  ------------[ cut here ]------------
  WARNING: CPU: 7 PID: 1 at arch/x86/mm/dump_pagetables.c:225 note_page+0x5dc/0x780()
  x86/mm: Found insecure W+X mapping at address ffff88000005f000/0xffff88000005f000
  ...
  x86/mm: Checked W+X mappings: FAILED, 165660 W+X pages found.
  ...

This is caused by mapping EFI regions with RWX permissions. There
isn't much we can do to restrict the permissions for these regions due
to the way the firmware toolchains mix code and data, but we can at
least isolate these mappings so that they do not appear in the regular
kernel page tables.

In commit d2f7cbe7b26a ("x86/efi: Runtime services virtual mapping")
we started using 'trampoline_pgd' to map the EFI regions because there
was an existing identity mapping there which we use during the
SetVirtualAddressMap() call and for broken firmware that accesses
those addresses.

But 'trampoline_pgd' shares some PGD entries with 'swapper_pg_dir' and
does not provide the isolation we require. Notably the virtual address
for __START_KERNEL_map and MODULES_START are mapped by the same PGD
entry so we need to be more careful when copying changes over in
efi_sync_low_kernel_mappings().

This patch doesn't go the full mile, we still want to share some PGD
entries with 'swapper_pg_dir'. Having completely separate page tables
brings its own issues such as synchronising new mappings after memory
hotplug and module loading. Sharing also keeps memory usage down.

Reviewed-by: Borislav Petkov <bp@suse.de>
Cc: Sai Praneeth Prakhya <sai.praneeth.prakhya@intel.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Dave Jones <davej@codemonkey.org.uk>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Denys Vlasenko <dvlasenk@redhat.com>,
Cc: Stephen Smalley <sds@tycho.nsa.gov>
Signed-off-by: Matt Fleming <matt@codeblueprint.co.uk>
---
 arch/x86/include/asm/efi.h     |  1 +
 arch/x86/platform/efi/efi.c    | 39 ++++++-----------
 arch/x86/platform/efi/efi_32.c |  5 +++
 arch/x86/platform/efi/efi_64.c | 97 +++++++++++++++++++++++++++++++++++-------
 4 files changed, 102 insertions(+), 40 deletions(-)

diff --git a/arch/x86/include/asm/efi.h b/arch/x86/include/asm/efi.h
index 63a15de789d5..e49a2d239c29 100644
--- a/arch/x86/include/asm/efi.h
+++ b/arch/x86/include/asm/efi.h
@@ -136,6 +136,7 @@ extern void __init efi_memory_uc(u64 addr, unsigned long size);
 extern void __init efi_map_region(efi_memory_desc_t *md);
 extern void __init efi_map_region_fixed(efi_memory_desc_t *md);
 extern void efi_sync_low_kernel_mappings(void);
+extern int __init efi_alloc_page_tables(void);
 extern int __init efi_setup_page_tables(unsigned long pa_memmap, unsigned num_pages);
 extern void __init efi_cleanup_page_tables(unsigned long pa_memmap, unsigned num_pages);
 extern void __init old_map_region(efi_memory_desc_t *md);
diff --git a/arch/x86/platform/efi/efi.c b/arch/x86/platform/efi/efi.c
index ad285404ea7f..3c1f3cd7b2ba 100644
--- a/arch/x86/platform/efi/efi.c
+++ b/arch/x86/platform/efi/efi.c
@@ -869,7 +869,7 @@ static void __init kexec_enter_virtual_mode(void)
  * This function will switch the EFI runtime services to virtual mode.
  * Essentially, we look through the EFI memmap and map every region that
  * has the runtime attribute bit set in its memory descriptor into the
- * ->trampoline_pgd page table using a top-down VA allocation scheme.
+ * efi_pgd page table.
  *
  * The old method which used to update that memory descriptor with the
  * virtual address obtained from ioremap() is still supported when the
@@ -879,8 +879,8 @@ static void __init kexec_enter_virtual_mode(void)
  *
  * The new method does a pagetable switch in a preemption-safe manner
  * so that we're in a different address space when calling a runtime
- * function. For function arguments passing we do copy the PGDs of the
- * kernel page table into ->trampoline_pgd prior to each call.
+ * function. For function arguments passing we do copy the PUDs of the
+ * kernel page table into efi_pgd prior to each call.
  *
  * Specially for kexec boot, efi runtime maps in previous kernel should
  * be passed in via setup_data. In that case runtime ranges will be mapped
@@ -895,6 +895,12 @@ static void __init __efi_enter_virtual_mode(void)
 
 	efi.systab = NULL;
 
+	if (efi_alloc_page_tables()) {
+		pr_err("Failed to allocate EFI page tables\n");
+		clear_bit(EFI_RUNTIME_SERVICES, &efi.flags);
+		return;
+	}
+
 	efi_merge_regions();
 	new_memmap = efi_map_regions(&count, &pg_shift);
 	if (!new_memmap) {
@@ -954,28 +960,11 @@ static void __init __efi_enter_virtual_mode(void)
 	efi_runtime_mkexec();
 
 	/*
-	 * We mapped the descriptor array into the EFI pagetable above but we're
-	 * not unmapping it here. Here's why:
-	 *
-	 * We're copying select PGDs from the kernel page table to the EFI page
-	 * table and when we do so and make changes to those PGDs like unmapping
-	 * stuff from them, those changes appear in the kernel page table and we
-	 * go boom.
-	 *
-	 * From setup_real_mode():
-	 *
-	 * ...
-	 * trampoline_pgd[0] = init_level4_pgt[pgd_index(__PAGE_OFFSET)].pgd;
-	 *
-	 * In this particular case, our allocation is in PGD 0 of the EFI page
-	 * table but we've copied that PGD from PGD[272] of the EFI page table:
-	 *
-	 *	pgd_index(__PAGE_OFFSET = 0xffff880000000000) = 272
-	 *
-	 * where the direct memory mapping in kernel space is.
-	 *
-	 * new_memmap's VA comes from that direct mapping and thus clearing it,
-	 * it would get cleared in the kernel page table too.
+	 * We mapped the descriptor array into the EFI pagetable above
+	 * but we're not unmapping it here because if we're running in
+	 * EFI mixed mode we need all of memory to be accessible when
+	 * we pass parameters to the EFI runtime services in the
+	 * thunking code.
 	 *
 	 * efi_cleanup_page_tables(__pa(new_memmap), 1 << pg_shift);
 	 */
diff --git a/arch/x86/platform/efi/efi_32.c b/arch/x86/platform/efi/efi_32.c
index ed5b67338294..58d669bc8250 100644
--- a/arch/x86/platform/efi/efi_32.c
+++ b/arch/x86/platform/efi/efi_32.c
@@ -38,6 +38,11 @@
  * say 0 - 3G.
  */
 
+int __init efi_alloc_page_tables(void)
+{
+	return 0;
+}
+
 void efi_sync_low_kernel_mappings(void) {}
 void __init efi_dump_pagetable(void) {}
 int __init efi_setup_page_tables(unsigned long pa_memmap, unsigned num_pages)
diff --git a/arch/x86/platform/efi/efi_64.c b/arch/x86/platform/efi/efi_64.c
index b19cdac959b2..4897f518760f 100644
--- a/arch/x86/platform/efi/efi_64.c
+++ b/arch/x86/platform/efi/efi_64.c
@@ -40,6 +40,7 @@
 #include <asm/fixmap.h>
 #include <asm/realmode.h>
 #include <asm/time.h>
+#include <asm/pgalloc.h>
 
 /*
  * We allocate runtime services regions bottom-up, starting from -4G, i.e.
@@ -121,22 +122,92 @@ void __init efi_call_phys_epilog(pgd_t *save_pgd)
 	early_code_mapping_set_exec(0);
 }
 
+static pgd_t *efi_pgd;
+
+/*
+ * We need our own copy of the higher levels of the page tables
+ * because we want to avoid inserting EFI region mappings (EFI_VA_END
+ * to EFI_VA_START) into the standard kernel page tables. Everything
+ * else can be shared, see efi_sync_low_kernel_mappings().
+ */
+int __init efi_alloc_page_tables(void)
+{
+	pgd_t *pgd;
+	pud_t *pud;
+	gfp_t gfp_mask;
+
+	if (efi_enabled(EFI_OLD_MEMMAP))
+		return 0;
+
+	gfp_mask = GFP_KERNEL | __GFP_NOTRACK | __GFP_REPEAT | __GFP_ZERO;
+	efi_pgd = (pgd_t *)__get_free_page(gfp_mask);
+	if (!efi_pgd)
+		return -ENOMEM;
+
+	pgd = efi_pgd + pgd_index(EFI_VA_END);
+
+	pud = pud_alloc_one(NULL, 0);
+	if (!pud) {
+		free_page((unsigned long)efi_pgd);
+		return -ENOMEM;
+	}
+
+	pgd_populate(NULL, pgd, pud);
+
+	return 0;
+}
+
 /*
  * Add low kernel mappings for passing arguments to EFI functions.
  */
 void efi_sync_low_kernel_mappings(void)
 {
-	unsigned num_pgds;
-	pgd_t *pgd = (pgd_t *)__va(real_mode_header->trampoline_pgd);
+	unsigned num_entries;
+	pgd_t *pgd_k, *pgd_efi;
+	pud_t *pud_k, *pud_efi;
 
 	if (efi_enabled(EFI_OLD_MEMMAP))
 		return;
 
-	num_pgds = pgd_index(MODULES_END - 1) - pgd_index(PAGE_OFFSET);
+	/*
+	 * We can share all PGD entries apart from the one entry that
+	 * covers the EFI runtime mapping space.
+	 *
+	 * Make sure the EFI runtime region mappings are guaranteed to
+	 * only span a single PGD entry and that the entry also maps
+	 * other important kernel regions.
+	 */
+	BUILD_BUG_ON(pgd_index(EFI_VA_END) != pgd_index(MODULES_END));
+	BUILD_BUG_ON((EFI_VA_START & PGDIR_MASK) !=
+			(EFI_VA_END & PGDIR_MASK));
+
+	pgd_efi = efi_pgd + pgd_index(PAGE_OFFSET);
+	pgd_k = pgd_offset_k(PAGE_OFFSET);
+
+	num_entries = pgd_index(EFI_VA_END) - pgd_index(PAGE_OFFSET);
+	memcpy(pgd_efi, pgd_k, sizeof(pgd_t) * num_entries);
 
-	memcpy(pgd + pgd_index(PAGE_OFFSET),
-		init_mm.pgd + pgd_index(PAGE_OFFSET),
-		sizeof(pgd_t) * num_pgds);
+	/*
+	 * We share all the PUD entries apart from those that map the
+	 * EFI regions. Copy around them.
+	 */
+	BUILD_BUG_ON((EFI_VA_START & ~PUD_MASK) != 0);
+	BUILD_BUG_ON((EFI_VA_END & ~PUD_MASK) != 0);
+
+	pgd_efi = efi_pgd + pgd_index(EFI_VA_END);
+	pud_efi = pud_offset(pgd_efi, 0);
+
+	pgd_k = pgd_offset_k(EFI_VA_END);
+	pud_k = pud_offset(pgd_k, 0);
+
+	num_entries = pud_index(EFI_VA_END);
+	memcpy(pud_efi, pud_k, sizeof(pud_t) * num_entries);
+
+	pud_efi = pud_offset(pgd_efi, EFI_VA_START);
+	pud_k = pud_offset(pgd_k, EFI_VA_START);
+
+	num_entries = PTRS_PER_PUD - pud_index(EFI_VA_START);
+	memcpy(pud_efi, pud_k, sizeof(pud_t) * num_entries);
 }
 
 int __init efi_setup_page_tables(unsigned long pa_memmap, unsigned num_pages)
@@ -150,8 +221,8 @@ int __init efi_setup_page_tables(unsigned long pa_memmap, unsigned num_pages)
 	if (efi_enabled(EFI_OLD_MEMMAP))
 		return 0;
 
-	efi_scratch.efi_pgt = (pgd_t *)(unsigned long)real_mode_header->trampoline_pgd;
-	pgd = __va(efi_scratch.efi_pgt);
+	efi_scratch.efi_pgt = (pgd_t *)__pa(efi_pgd);
+	pgd = efi_pgd;
 
 	/*
 	 * It can happen that the physical address of new_memmap lands in memory
@@ -216,16 +287,14 @@ int __init efi_setup_page_tables(unsigned long pa_memmap, unsigned num_pages)
 
 void __init efi_cleanup_page_tables(unsigned long pa_memmap, unsigned num_pages)
 {
-	pgd_t *pgd = (pgd_t *)__va(real_mode_header->trampoline_pgd);
-
-	kernel_unmap_pages_in_pgd(pgd, pa_memmap, num_pages);
+	kernel_unmap_pages_in_pgd(efi_pgd, pa_memmap, num_pages);
 }
 
 static void __init __map_region(efi_memory_desc_t *md, u64 va)
 {
-	pgd_t *pgd = (pgd_t *)__va(real_mode_header->trampoline_pgd);
 	unsigned long flags = 0;
 	unsigned long pfn;
+	pgd_t *pgd = efi_pgd;
 
 	if (!(md->attribute & EFI_MEMORY_WB))
 		flags |= _PAGE_PCD;
@@ -334,9 +403,7 @@ void __init efi_runtime_mkexec(void)
 void __init efi_dump_pagetable(void)
 {
 #ifdef CONFIG_EFI_PGT_DUMP
-	pgd_t *pgd = (pgd_t *)__va(real_mode_header->trampoline_pgd);
-
-	ptdump_walk_pgd_level(NULL, pgd);
+	ptdump_walk_pgd_level(NULL, efi_pgd);
 #endif
 }
 
-- 
2.6.2

^ permalink raw reply related	[flat|nested] 8+ messages in thread

* [PATCH v4 6/6] Documentation/x86: Update EFI memory region description
  2015-11-27 21:09 [GIT PULL v4 0/6] EFI page table isolation Matt Fleming
                   ` (3 preceding siblings ...)
  2015-11-27 21:09 ` [PATCH v4 5/6] x86/efi: Build our own page table structures Matt Fleming
@ 2015-11-27 21:09 ` Matt Fleming
  2015-11-29  8:18 ` [GIT PULL v4 0/6] EFI page table isolation Ingo Molnar
  5 siblings, 0 replies; 8+ messages in thread
From: Matt Fleming @ 2015-11-27 21:09 UTC (permalink / raw)
  To: Ingo Molnar, Thomas Gleixner, H . Peter Anvin
  Cc: Toshi Kani, Matt Fleming, linux-kernel, linux-efi,
	Sai Praneeth Prakhya, Linus Torvalds, Dave Jones, Andrew Morton,
	Andy Lutomirski, Denys Vlasenko, Stephen Smalley,
	Borislav Petkov

Make it clear that the EFI page tables are only available during EFI
runtime calls since that subject has come up a fair numbers of times
in the past.

Additionally, add the EFI region start and end addresses to the table
so that it's possible to see at a glance where they fall in relation
to other regions.

Reviewed-by: Borislav Petkov <bp@suse.de>
Cc: Sai Praneeth Prakhya <sai.praneeth.prakhya@intel.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Dave Jones <davej@codemonkey.org.uk>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Denys Vlasenko <dvlasenk@redhat.com>,
Cc: Stephen Smalley <sds@tycho.nsa.gov>
Signed-off-by: Matt Fleming <matt@codeblueprint.co.uk>
---
 Documentation/x86/x86_64/mm.txt | 12 ++++++------
 1 file changed, 6 insertions(+), 6 deletions(-)

diff --git a/Documentation/x86/x86_64/mm.txt b/Documentation/x86/x86_64/mm.txt
index 05712ac83e38..c518dce7da4d 100644
--- a/Documentation/x86/x86_64/mm.txt
+++ b/Documentation/x86/x86_64/mm.txt
@@ -16,6 +16,8 @@ ffffec0000000000 - fffffc0000000000 (=44 bits) kasan shadow memory (16TB)
 ... unused hole ...
 ffffff0000000000 - ffffff7fffffffff (=39 bits) %esp fixup stacks
 ... unused hole ...
+ffffffef00000000 - ffffffff00000000 (=64 GB) EFI region mapping space
+... unused hole ...
 ffffffff80000000 - ffffffffa0000000 (=512 MB)  kernel text mapping, from phys 0
 ffffffffa0000000 - ffffffffff5fffff (=1525 MB) module mapping space
 ffffffffff600000 - ffffffffffdfffff (=8 MB) vsyscalls
@@ -32,11 +34,9 @@ reference.
 Current X86-64 implementations only support 40 bits of address space,
 but we support up to 46 bits. This expands into MBZ space in the page tables.
 
-->trampoline_pgd:
-
-We map EFI runtime services in the aforementioned PGD in the virtual
-range of 64Gb (arbitrarily set, can be raised if needed)
-
-0xffffffef00000000 - 0xffffffff00000000
+We map EFI runtime services in the 'efi_pgd' PGD in a 64Gb large virtual
+memory window (this size is arbitrary, it can be raised later if needed).
+The mappings are not part of any other kernel PGD and are only available
+during EFI runtime calls.
 
 -Andi Kleen, Jul 2004
-- 
2.6.2

^ permalink raw reply related	[flat|nested] 8+ messages in thread

* Re: [GIT PULL v4 0/6] EFI page table isolation
  2015-11-27 21:09 [GIT PULL v4 0/6] EFI page table isolation Matt Fleming
                   ` (4 preceding siblings ...)
  2015-11-27 21:09 ` [PATCH v4 6/6] Documentation/x86: Update EFI memory region description Matt Fleming
@ 2015-11-29  8:18 ` Ingo Molnar
  5 siblings, 0 replies; 8+ messages in thread
From: Ingo Molnar @ 2015-11-29  8:18 UTC (permalink / raw)
  To: Matt Fleming
  Cc: Thomas Gleixner, H . Peter Anvin, Toshi Kani, linux-kernel,
	linux-efi, Andrew Morton, Andy Lutomirski, Borislav Petkov,
	Dave Hansen, Dave Jones, Denys Vlasenko, Linus Torvalds,
	Sai Praneeth Prakhya, Stephen Smalley


* Matt Fleming <matt@codeblueprint.co.uk> wrote:

> Folks,
> 
> This patch series is a response to the report that the EFI region
> mappings trigger warnings when booting with CONFIG_DEBUG_WX enabled.
> They allocate a new page table structure and ensure that all the
> mappings we require during EFI runtime calls are only setup there.
> 
> It turns out that it still makes sense to share some page table
> entries with 'swapper_pg_dir', just not the entries where we need to
> allow security lax permissions. Sharing entries is useful for memory
> hotplug, for example.
> 
> When writing this series I discovered a number of bugs in the existing
> code that only became apparent when we stopped using 'trampoline_pgd'
> which already mapped a bunch of things for us. I've put those bug
> fixes at the start of the series.
> 
> Further testing would be very much appreciated as this is a
> notoriously funky area of the EFI code.

Ok, this series looks great to me - I've applied this to tip:x86/efi and will push 
it out to linux-next after it passes some local testing.

There should be time enough before v4.5 to figure out potential bugs.

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2015-11-29  8:18 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-11-27 21:09 [GIT PULL v4 0/6] EFI page table isolation Matt Fleming
2015-11-27 21:09 ` [PATCH v4 1/6] x86: Page align _end to avoid pfn conversion bugs Matt Fleming
2015-11-27 21:09 ` [PATCH v4 2/6] x86/mm/pageattr: Ensure cpa->pfn only contains page frame numbers Matt Fleming
     [not found] ` <1448658575-17029-1-git-send-email-matt-mF/unelCI9GS6iBeEJttW/XRex20P6io@public.gmane.org>
2015-11-27 21:09   ` [PATCH v4 3/6] x86/efi: Map RAM into the identity page table for mixed mode Matt Fleming
2015-11-27 21:09   ` [PATCH v4 4/6] x86/efi: Hoist page table switching code into efi_call_virt() Matt Fleming
2015-11-27 21:09 ` [PATCH v4 5/6] x86/efi: Build our own page table structures Matt Fleming
2015-11-27 21:09 ` [PATCH v4 6/6] Documentation/x86: Update EFI memory region description Matt Fleming
2015-11-29  8:18 ` [GIT PULL v4 0/6] EFI page table isolation Ingo Molnar

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).