kernel-hardening.lists.openwall.com archive mirror
 help / color / mirror / Atom feed
* [RFC PATCH v8 00/14] Add support for eXclusive Page Frame Ownership
@ 2019-02-14  0:01 Khalid Aziz
  2019-02-14  0:01 ` [RFC PATCH v8 01/14] mm: add MAP_HUGETLB support to vm_mmap Khalid Aziz
                   ` (13 more replies)
  0 siblings, 14 replies; 32+ messages in thread
From: Khalid Aziz @ 2019-02-14  0:01 UTC (permalink / raw)
  To: juergh, tycho, jsteckli, ak, torvalds, liran.alon, keescook,
	akpm, mhocko, catalin.marinas, will.deacon, jmorris, konrad.wilk
  Cc: Khalid Aziz, deepa.srinivasan, chris.hyser, tyhicks, dwmw,
	andrew.cooper3, jcm, boris.ostrovsky, kanth.ghatraju,
	oao.m.martins, jmattson, pradeep.vincent, john.haxby, tglx,
	kirill.shutemov, hch, steven.sistare, labbott, luto, dave.hansen,
	peterz, kernel-hardening, linux-mm, x86, linux-arm-kernel,
	linux-kernel

I am continuing to build on the work Juerg, Tycho and Julian have
done on XPFO. After the last round of updates, we were seeing very
significant performance penalties when stale TLB entries were
flushed actively after an XPFO TLB update.  Benchmark for measuring
performance is kernel build using parallel make. To get full
protection from ret2dir attackes, we must flush stale TLB entries.
Performance penalty from flushing stale TLB entries goes up as the
number of cores goes up. On a desktop class machine with only 4
cores, enabling TLB flush for stale entries causes system time for
"make -j4" to go up by a factor of 2.61x but on a larger machine
with 96 cores, system time with "make -j60" goes up by a factor of
26.37x!  I have been working on reducing this performance penalty.

I implemented two solutions to reduce performance penalty and that
has had large impact. XPFO code flushes TLB every time a page is
allocated to userspace. It does so by sending IPIs to all processors
to flush TLB. Back to back allocations of pages to userspace on
multiple processors results in a storm of IPIs.  Each one of these
incoming IPIs is handled by a processor by flushing its TLB. To
reduce this IPI storm, I have added a per CPU flag that can be set
to tell a processor to flush its TLB. A processor checks this flag
on every context switch. If the flag is set, it flushes its TLB and
clears the flag. This allows for multiple TLB flush requests to a
single CPU to be combined into a single request. A kernel TLB entry
for a page that has been allocated to userspace is flushed on all
processors unlike the previous version of this patch. A processor
could hold a stale kernel TLB entry that was removed on another
processor until the next context switch. A local userspace page
allocation by the currently running process could force the TLB
flush earlier for such entries.

The other solution reduces the number of TLB flushes required, by
performing TLB flush for multiple pages at one time when pages are
refilled on the per-cpu freelist. If the pages being addedd to
per-cpu freelist are marked for userspace allocation, TLB entries
for these pages can be flushed upfront and pages tagged as currently
unmapped. When any such page is allocated to userspace, there is no
need to performa a TLB flush at that time any more. This batching of
TLB flushes reduces performance imapct further.

I measured system time for parallel make with unmodified 4.20
kernel, 4.20 with XPFO patches before these patches and then again
after applying each of these patches. Here are the results:

Hardware: 96-core Intel Xeon Platinum 8160 CPU @ 2.10GHz, 768 GB RAM
make -j60 all

4.20					950.966s
4.20+XPFO				25073.169s	26.37x
4.20+XPFO+Deferred flush		1372.874s	1.44x
4.20+XPFO+Deferred flush+Batch update	1255.021s	1.32x


Hardware: 4-core Intel Core i5-3550 CPU @ 3.30GHz, 8G RAM
make -j4 all

4.20					607.671s
4.20+XPFO				1588.646s	2.61x
4.20+XPFO+Deferred flush		803.989s	1.32x
4.20+XPFO+Deferred flush+Batch update	795.728s	1.31x

30+% overhead is still very high and there is room for improvement.

Performance with this patch set is good enough to use these as
starting point for further refinement before we merge it into main
kernel, hence RFC.

I have dropped the patch "mm, x86: omit TLB flushing by default for
XPFO page table modifications" since not flushing TLB leaves kernel
wide open to attack and there is no point in enabling XPFO without
flushing TLB every time kernel TLB entries for pages are removed. I
also dropped the patch "EXPERIMENTAL: xpfo, mm: optimize spin lock
usage in xpfo_kmap". There was not a measurable improvement in
performance with this patch and it introduced a possibility for
deadlock that Laura found.

What remains to be done beyond this patch series:

1. Performance improvements: Ideas to explore - (1) Add a freshly
   freed page to per cpu freelist and not make a kernel TLB entry
   for it, (2) kernel mappings private to an mm, (3) Any others??
2. Re-evaluate the patch "arm64/mm: Add support for XPFO to swiotlb"
   from Juerg. I dropped it for now since swiotlb code for ARM has
   changed a lot in 4.20.
3. Extend the patch "xpfo, mm: Defer TLB flushes for non-current
   CPUs" to other architectures besides x86.


---------------------------------------------------------

Juerg Haefliger (5):
  mm, x86: Add support for eXclusive Page Frame Ownership (XPFO)
  swiotlb: Map the buffer if it was unmapped by XPFO
  arm64/mm: Add support for XPFO
  arm64/mm, xpfo: temporarily map dcache regions
  lkdtm: Add test for XPFO

Julian Stecklina (2):
  xpfo, mm: remove dependency on CONFIG_PAGE_EXTENSION
  xpfo, mm: optimize spinlock usage in xpfo_kunmap

Khalid Aziz (2):
  xpfo, mm: Defer TLB flushes for non-current CPUs (x86 only)
  xpfo, mm: Optimize XPFO TLB flushes by batching them together

Tycho Andersen (5):
  mm: add MAP_HUGETLB support to vm_mmap
  x86: always set IF before oopsing from page fault
  xpfo: add primitives for mapping underlying memory
  arm64/mm: disable section/contiguous mappings if XPFO is enabled
  mm: add a user_virt_to_phys symbol

 .../admin-guide/kernel-parameters.txt         |   2 +
 arch/arm64/Kconfig                            |   1 +
 arch/arm64/mm/Makefile                        |   2 +
 arch/arm64/mm/flush.c                         |   7 +
 arch/arm64/mm/mmu.c                           |   2 +-
 arch/arm64/mm/xpfo.c                          |  64 +++++
 arch/x86/Kconfig                              |   1 +
 arch/x86/include/asm/pgtable.h                |  26 ++
 arch/x86/include/asm/tlbflush.h               |   1 +
 arch/x86/mm/Makefile                          |   2 +
 arch/x86/mm/fault.c                           |   6 +
 arch/x86/mm/pageattr.c                        |  23 +-
 arch/x86/mm/tlb.c                             |  38 +++
 arch/x86/mm/xpfo.c                            | 181 ++++++++++++++
 drivers/misc/lkdtm/Makefile                   |   1 +
 drivers/misc/lkdtm/core.c                     |   3 +
 drivers/misc/lkdtm/lkdtm.h                    |   5 +
 drivers/misc/lkdtm/xpfo.c                     | 194 +++++++++++++++
 include/linux/highmem.h                       |  15 +-
 include/linux/mm.h                            |   2 +
 include/linux/mm_types.h                      |   8 +
 include/linux/page-flags.h                    |  18 +-
 include/linux/xpfo.h                          |  95 ++++++++
 include/trace/events/mmflags.h                |  10 +-
 kernel/dma/swiotlb.c                          |   3 +-
 mm/Makefile                                   |   1 +
 mm/mmap.c                                     |  19 +-
 mm/page_alloc.c                               |   7 +
 mm/util.c                                     |  32 +++
 mm/xpfo.c                                     | 223 ++++++++++++++++++
 security/Kconfig                              |  29 +++
 31 files changed, 977 insertions(+), 44 deletions(-)
 create mode 100644 arch/arm64/mm/xpfo.c
 create mode 100644 arch/x86/mm/xpfo.c
 create mode 100644 drivers/misc/lkdtm/xpfo.c
 create mode 100644 include/linux/xpfo.h
 create mode 100644 mm/xpfo.c

-- 
2.17.1

^ permalink raw reply	[flat|nested] 32+ messages in thread

* [RFC PATCH v8 01/14] mm: add MAP_HUGETLB support to vm_mmap
  2019-02-14  0:01 [RFC PATCH v8 00/14] Add support for eXclusive Page Frame Ownership Khalid Aziz
@ 2019-02-14  0:01 ` Khalid Aziz
  2019-02-14  0:01 ` [RFC PATCH v8 02/14] x86: always set IF before oopsing from page fault Khalid Aziz
                   ` (12 subsequent siblings)
  13 siblings, 0 replies; 32+ messages in thread
From: Khalid Aziz @ 2019-02-14  0:01 UTC (permalink / raw)
  To: juergh, tycho, jsteckli, ak, torvalds, liran.alon, keescook,
	akpm, mhocko, catalin.marinas, will.deacon, jmorris, konrad.wilk
  Cc: Tycho Andersen, deepa.srinivasan, chris.hyser, tyhicks, dwmw,
	andrew.cooper3, jcm, boris.ostrovsky, kanth.ghatraju,
	oao.m.martins, jmattson, pradeep.vincent, john.haxby, tglx,
	kirill.shutemov, hch, steven.sistare, labbott, luto, dave.hansen,
	peterz, kernel-hardening, linux-mm, x86, linux-arm-kernel,
	linux-kernel

From: Tycho Andersen <tycho@docker.com>

vm_mmap is exported, which means kernel modules can use it. In particular,
for testing XPFO support, we want to use it with the MAP_HUGETLB flag, so
let's support it via vm_mmap.

Signed-off-by: Tycho Andersen <tycho@docker.com>
Tested-by: Marco Benatto <marco.antonio.780@gmail.com>
Tested-by: Khalid Aziz <khalid.aziz@oracle.com>
---
 include/linux/mm.h |  2 ++
 mm/mmap.c          | 19 +------------------
 mm/util.c          | 32 ++++++++++++++++++++++++++++++++
 3 files changed, 35 insertions(+), 18 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 5411de93a363..30bddc7b3c75 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -2361,6 +2361,8 @@ struct vm_unmapped_area_info {
 extern unsigned long unmapped_area(struct vm_unmapped_area_info *info);
 extern unsigned long unmapped_area_topdown(struct vm_unmapped_area_info *info);
 
+struct file *map_hugetlb_setup(unsigned long *len, unsigned long flags);
+
 /*
  * Search for an unmapped address range.
  *
diff --git a/mm/mmap.c b/mm/mmap.c
index 6c04292e16a7..c668d7d27c2b 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -1582,24 +1582,7 @@ unsigned long ksys_mmap_pgoff(unsigned long addr, unsigned long len,
 		if (unlikely(flags & MAP_HUGETLB && !is_file_hugepages(file)))
 			goto out_fput;
 	} else if (flags & MAP_HUGETLB) {
-		struct user_struct *user = NULL;
-		struct hstate *hs;
-
-		hs = hstate_sizelog((flags >> MAP_HUGE_SHIFT) & MAP_HUGE_MASK);
-		if (!hs)
-			return -EINVAL;
-
-		len = ALIGN(len, huge_page_size(hs));
-		/*
-		 * VM_NORESERVE is used because the reservations will be
-		 * taken when vm_ops->mmap() is called
-		 * A dummy user value is used because we are not locking
-		 * memory so no accounting is necessary
-		 */
-		file = hugetlb_file_setup(HUGETLB_ANON_FILE, len,
-				VM_NORESERVE,
-				&user, HUGETLB_ANONHUGE_INODE,
-				(flags >> MAP_HUGE_SHIFT) & MAP_HUGE_MASK);
+		file = map_hugetlb_setup(&len, flags);
 		if (IS_ERR(file))
 			return PTR_ERR(file);
 	}
diff --git a/mm/util.c b/mm/util.c
index 8bf08b5b5760..536c14cf88ba 100644
--- a/mm/util.c
+++ b/mm/util.c
@@ -357,6 +357,29 @@ unsigned long vm_mmap_pgoff(struct file *file, unsigned long addr,
 	return ret;
 }
 
+struct file *map_hugetlb_setup(unsigned long *len, unsigned long flags)
+{
+	struct user_struct *user = NULL;
+	struct hstate *hs;
+
+	hs = hstate_sizelog((flags >> MAP_HUGE_SHIFT) & MAP_HUGE_MASK);
+	if (!hs)
+		return ERR_PTR(-EINVAL);
+
+	*len = ALIGN(*len, huge_page_size(hs));
+
+	/*
+	 * VM_NORESERVE is used because the reservations will be
+	 * taken when vm_ops->mmap() is called
+	 * A dummy user value is used because we are not locking
+	 * memory so no accounting is necessary
+	 */
+	return hugetlb_file_setup(HUGETLB_ANON_FILE, *len,
+			VM_NORESERVE,
+			&user, HUGETLB_ANONHUGE_INODE,
+			(flags >> MAP_HUGE_SHIFT) & MAP_HUGE_MASK);
+}
+
 unsigned long vm_mmap(struct file *file, unsigned long addr,
 	unsigned long len, unsigned long prot,
 	unsigned long flag, unsigned long offset)
@@ -366,6 +389,15 @@ unsigned long vm_mmap(struct file *file, unsigned long addr,
 	if (unlikely(offset_in_page(offset)))
 		return -EINVAL;
 
+	if (flag & MAP_HUGETLB) {
+		if (file)
+			return -EINVAL;
+
+		file = map_hugetlb_setup(&len, flag);
+		if (IS_ERR(file))
+			return PTR_ERR(file);
+	}
+
 	return vm_mmap_pgoff(file, addr, len, prot, flag, offset >> PAGE_SHIFT);
 }
 EXPORT_SYMBOL(vm_mmap);
-- 
2.17.1

^ permalink raw reply	[flat|nested] 32+ messages in thread

* [RFC PATCH v8 02/14] x86: always set IF before oopsing from page fault
  2019-02-14  0:01 [RFC PATCH v8 00/14] Add support for eXclusive Page Frame Ownership Khalid Aziz
  2019-02-14  0:01 ` [RFC PATCH v8 01/14] mm: add MAP_HUGETLB support to vm_mmap Khalid Aziz
@ 2019-02-14  0:01 ` Khalid Aziz
  2019-02-14  0:01 ` [RFC PATCH v8 03/14] mm, x86: Add support for eXclusive Page Frame Ownership (XPFO) Khalid Aziz
                   ` (11 subsequent siblings)
  13 siblings, 0 replies; 32+ messages in thread
From: Khalid Aziz @ 2019-02-14  0:01 UTC (permalink / raw)
  To: juergh, tycho, jsteckli, ak, torvalds, liran.alon, keescook,
	akpm, mhocko, catalin.marinas, will.deacon, jmorris, konrad.wilk
  Cc: Tycho Andersen, deepa.srinivasan, chris.hyser, tyhicks, dwmw,
	andrew.cooper3, jcm, boris.ostrovsky, kanth.ghatraju,
	oao.m.martins, jmattson, pradeep.vincent, john.haxby, tglx,
	kirill.shutemov, hch, steven.sistare, labbott, luto, dave.hansen,
	peterz, kernel-hardening, linux-mm, x86, linux-arm-kernel,
	linux-kernel

From: Tycho Andersen <tycho@docker.com>

Oopsing might kill the task, via rewind_stack_do_exit() at the bottom, and
that might sleep:

Aug 23 19:30:27 xpfo kernel: [   38.302714] BUG: sleeping function called from invalid context at ./include/linux/percpu-rwsem.h:33
Aug 23 19:30:27 xpfo kernel: [   38.303837] in_atomic(): 0, irqs_disabled(): 1, pid: 1970, name: lkdtm_xpfo_test
Aug 23 19:30:27 xpfo kernel: [   38.304758] CPU: 3 PID: 1970 Comm: lkdtm_xpfo_test Tainted: G      D         4.13.0-rc5+ #228
Aug 23 19:30:27 xpfo kernel: [   38.305813] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.1-1ubuntu1 04/01/2014
Aug 23 19:30:27 xpfo kernel: [   38.306926] Call Trace:
Aug 23 19:30:27 xpfo kernel: [   38.307243]  dump_stack+0x63/0x8b
Aug 23 19:30:27 xpfo kernel: [   38.307665]  ___might_sleep+0xec/0x110
Aug 23 19:30:27 xpfo kernel: [   38.308139]  __might_sleep+0x45/0x80
Aug 23 19:30:27 xpfo kernel: [   38.308593]  exit_signals+0x21/0x1c0
Aug 23 19:30:27 xpfo kernel: [   38.309046]  ? blocking_notifier_call_chain+0x11/0x20
Aug 23 19:30:27 xpfo kernel: [   38.309677]  do_exit+0x98/0xbf0
Aug 23 19:30:27 xpfo kernel: [   38.310078]  ? smp_reader+0x27/0x40 [lkdtm]
Aug 23 19:30:27 xpfo kernel: [   38.310604]  ? kthread+0x10f/0x150
Aug 23 19:30:27 xpfo kernel: [   38.311045]  ? read_user_with_flags+0x60/0x60 [lkdtm]
Aug 23 19:30:27 xpfo kernel: [   38.311680]  rewind_stack_do_exit+0x17/0x20

To be safe, let's just always enable irqs.

The particular case I'm hitting is:

Aug 23 19:30:27 xpfo kernel: [   38.278615]  __bad_area_nosemaphore+0x1a9/0x1d0
Aug 23 19:30:27 xpfo kernel: [   38.278617]  bad_area_nosemaphore+0xf/0x20
Aug 23 19:30:27 xpfo kernel: [   38.278618]  __do_page_fault+0xd1/0x540
Aug 23 19:30:27 xpfo kernel: [   38.278620]  ? irq_work_queue+0x9b/0xb0
Aug 23 19:30:27 xpfo kernel: [   38.278623]  ? wake_up_klogd+0x36/0x40
Aug 23 19:30:27 xpfo kernel: [   38.278624]  trace_do_page_fault+0x3c/0xf0
Aug 23 19:30:27 xpfo kernel: [   38.278625]  do_async_page_fault+0x14/0x60
Aug 23 19:30:27 xpfo kernel: [   38.278627]  async_page_fault+0x28/0x30

When a fault is in kernel space which has been triggered by XPFO.

Signed-off-by: Tycho Andersen <tycho@docker.com>
CC: x86@kernel.org
Tested-by: Khalid Aziz <khalid.aziz@oracle.com>
---
 arch/x86/mm/fault.c | 6 ++++++
 1 file changed, 6 insertions(+)

diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c
index 71d4b9d4d43f..ba51652fbd33 100644
--- a/arch/x86/mm/fault.c
+++ b/arch/x86/mm/fault.c
@@ -748,6 +748,12 @@ no_context(struct pt_regs *regs, unsigned long error_code,
 	/* Executive summary in case the body of the oops scrolled away */
 	printk(KERN_DEFAULT "CR2: %016lx\n", address);
 
+	/*
+	 * We're about to oops, which might kill the task. Make sure we're
+	 * allowed to sleep.
+	 */
+	flags |= X86_EFLAGS_IF;
+
 	oops_end(flags, regs, sig);
 }
 
-- 
2.17.1

^ permalink raw reply	[flat|nested] 32+ messages in thread

* [RFC PATCH v8 03/14] mm, x86: Add support for eXclusive Page Frame Ownership (XPFO)
  2019-02-14  0:01 [RFC PATCH v8 00/14] Add support for eXclusive Page Frame Ownership Khalid Aziz
  2019-02-14  0:01 ` [RFC PATCH v8 01/14] mm: add MAP_HUGETLB support to vm_mmap Khalid Aziz
  2019-02-14  0:01 ` [RFC PATCH v8 02/14] x86: always set IF before oopsing from page fault Khalid Aziz
@ 2019-02-14  0:01 ` Khalid Aziz
  2019-02-14 10:56   ` Peter Zijlstra
  2019-02-14  0:01 ` [RFC PATCH v8 04/14] swiotlb: Map the buffer if it was unmapped by XPFO Khalid Aziz
                   ` (10 subsequent siblings)
  13 siblings, 1 reply; 32+ messages in thread
From: Khalid Aziz @ 2019-02-14  0:01 UTC (permalink / raw)
  To: juergh, tycho, jsteckli, ak, torvalds, liran.alon, keescook,
	akpm, mhocko, catalin.marinas, will.deacon, jmorris, konrad.wilk
  Cc: Juerg Haefliger, deepa.srinivasan, chris.hyser, tyhicks, dwmw,
	andrew.cooper3, jcm, boris.ostrovsky, kanth.ghatraju,
	oao.m.martins, jmattson, pradeep.vincent, john.haxby, tglx,
	kirill.shutemov, hch, steven.sistare, labbott, luto, dave.hansen,
	peterz, kernel-hardening, linux-mm, x86, linux-arm-kernel,
	linux-kernel, Tycho Andersen, Marco Benatto

From: Juerg Haefliger <juerg.haefliger@canonical.com>

This patch adds support for XPFO which protects against 'ret2dir' kernel
attacks. The basic idea is to enforce exclusive ownership of page frames
by either the kernel or userspace, unless explicitly requested by the
kernel. Whenever a page destined for userspace is allocated, it is
unmapped from physmap (the kernel's page table). When such a page is
reclaimed from userspace, it is mapped back to physmap.

Additional fields in the page_ext struct are used for XPFO housekeeping,
specifically:
  - two flags to distinguish user vs. kernel pages and to tag unmapped
    pages.
  - a reference counter to balance kmap/kunmap operations.
  - a lock to serialize access to the XPFO fields.

This patch is based on the work of Vasileios P. Kemerlis et al. who
published their work in this paper:
  http://www.cs.columbia.edu/~vpk/papers/ret2dir.sec14.pdf

v6: * use flush_tlb_kernel_range() instead of __flush_tlb_one, so we flush
      the tlb entry on all CPUs when unmapping it in kunmap
    * handle lookup_page_ext()/lookup_xpfo() returning NULL
    * drop lots of BUG()s in favor of WARN()
    * don't disable irqs in xpfo_kmap/xpfo_kunmap, export
      __split_large_page so we can do our own alloc_pages(GFP_ATOMIC) to
      pass it

CC: x86@kernel.org
Suggested-by: Vasileios P. Kemerlis <vpk@cs.columbia.edu>
Signed-off-by: Juerg Haefliger <juerg.haefliger@canonical.com>
Signed-off-by: Tycho Andersen <tycho@docker.com>
Signed-off-by: Marco Benatto <marco.antonio.780@gmail.com>
[jsteckli@amazon.de: rebased from v4.13 to v4.19]
Signed-off-by: Julian Stecklina <jsteckli@amazon.de>
Reviewed-by: Khalid Aziz <khalid.aziz@oracle.com>
---
 .../admin-guide/kernel-parameters.txt         |   2 +
 arch/x86/Kconfig                              |   1 +
 arch/x86/include/asm/pgtable.h                |  26 ++
 arch/x86/mm/Makefile                          |   2 +
 arch/x86/mm/pageattr.c                        |  23 +-
 arch/x86/mm/xpfo.c                            | 119 ++++++++++
 include/linux/highmem.h                       |  15 +-
 include/linux/xpfo.h                          |  48 ++++
 mm/Makefile                                   |   1 +
 mm/page_alloc.c                               |   2 +
 mm/page_ext.c                                 |   4 +
 mm/xpfo.c                                     | 223 ++++++++++++++++++
 security/Kconfig                              |  19 ++
 13 files changed, 463 insertions(+), 22 deletions(-)
 create mode 100644 arch/x86/mm/xpfo.c
 create mode 100644 include/linux/xpfo.h
 create mode 100644 mm/xpfo.c

diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
index aefd358a5ca3..c4c62599f216 100644
--- a/Documentation/admin-guide/kernel-parameters.txt
+++ b/Documentation/admin-guide/kernel-parameters.txt
@@ -2982,6 +2982,8 @@
 
 	nox2apic	[X86-64,APIC] Do not enable x2APIC mode.
 
+	noxpfo		[X86-64] Disable XPFO when CONFIG_XPFO is on.
+
 	cpu0_hotplug	[X86] Turn on CPU0 hotplug feature when
 			CONFIG_BOOTPARAM_HOTPLUG_CPU0 is off.
 			Some features depend on CPU0. Known dependencies are:
diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 8689e794a43c..d69d8cc6e57e 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -207,6 +207,7 @@ config X86
 	select USER_STACKTRACE_SUPPORT
 	select VIRT_TO_BUS
 	select X86_FEATURE_NAMES		if PROC_FS
+	select ARCH_SUPPORTS_XPFO		if X86_64
 
 config INSTRUCTION_DECODER
 	def_bool y
diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h
index 40616e805292..f6eeb75c8a21 100644
--- a/arch/x86/include/asm/pgtable.h
+++ b/arch/x86/include/asm/pgtable.h
@@ -1437,6 +1437,32 @@ static inline bool arch_has_pfn_modify_check(void)
 	return boot_cpu_has_bug(X86_BUG_L1TF);
 }
 
+/*
+ * The current flushing context - we pass it instead of 5 arguments:
+ */
+struct cpa_data {
+	unsigned long	*vaddr;
+	pgd_t		*pgd;
+	pgprot_t	mask_set;
+	pgprot_t	mask_clr;
+	unsigned long	numpages;
+	int		flags;
+	unsigned long	pfn;
+	unsigned	force_split		: 1,
+			force_static_prot	: 1;
+	int		curpage;
+	struct page	**pages;
+};
+
+
+int
+should_split_large_page(pte_t *kpte, unsigned long address,
+			struct cpa_data *cpa);
+extern spinlock_t cpa_lock;
+int
+__split_large_page(struct cpa_data *cpa, pte_t *kpte, unsigned long address,
+		   struct page *base);
+
 #include <asm-generic/pgtable.h>
 #endif	/* __ASSEMBLY__ */
 
diff --git a/arch/x86/mm/Makefile b/arch/x86/mm/Makefile
index 4b101dd6e52f..93b0fdaf4a99 100644
--- a/arch/x86/mm/Makefile
+++ b/arch/x86/mm/Makefile
@@ -53,3 +53,5 @@ obj-$(CONFIG_PAGE_TABLE_ISOLATION)		+= pti.o
 obj-$(CONFIG_AMD_MEM_ENCRYPT)	+= mem_encrypt.o
 obj-$(CONFIG_AMD_MEM_ENCRYPT)	+= mem_encrypt_identity.o
 obj-$(CONFIG_AMD_MEM_ENCRYPT)	+= mem_encrypt_boot.o
+
+obj-$(CONFIG_XPFO)		+= xpfo.o
diff --git a/arch/x86/mm/pageattr.c b/arch/x86/mm/pageattr.c
index a1bcde35db4c..84002442ab61 100644
--- a/arch/x86/mm/pageattr.c
+++ b/arch/x86/mm/pageattr.c
@@ -26,23 +26,6 @@
 #include <asm/pat.h>
 #include <asm/set_memory.h>
 
-/*
- * The current flushing context - we pass it instead of 5 arguments:
- */
-struct cpa_data {
-	unsigned long	*vaddr;
-	pgd_t		*pgd;
-	pgprot_t	mask_set;
-	pgprot_t	mask_clr;
-	unsigned long	numpages;
-	int		flags;
-	unsigned long	pfn;
-	unsigned	force_split		: 1,
-			force_static_prot	: 1;
-	int		curpage;
-	struct page	**pages;
-};
-
 enum cpa_warn {
 	CPA_CONFLICT,
 	CPA_PROTECT,
@@ -57,7 +40,7 @@ static const int cpa_warn_level = CPA_PROTECT;
  * entries change the page attribute in parallel to some other cpu
  * splitting a large page entry along with changing the attribute.
  */
-static DEFINE_SPINLOCK(cpa_lock);
+DEFINE_SPINLOCK(cpa_lock);
 
 #define CPA_FLUSHTLB 1
 #define CPA_ARRAY 2
@@ -869,7 +852,7 @@ static int __should_split_large_page(pte_t *kpte, unsigned long address,
 	return 0;
 }
 
-static int should_split_large_page(pte_t *kpte, unsigned long address,
+int should_split_large_page(pte_t *kpte, unsigned long address,
 				   struct cpa_data *cpa)
 {
 	int do_split;
@@ -919,7 +902,7 @@ static void split_set_pte(struct cpa_data *cpa, pte_t *pte, unsigned long pfn,
 	set_pte(pte, pfn_pte(pfn, ref_prot));
 }
 
-static int
+int
 __split_large_page(struct cpa_data *cpa, pte_t *kpte, unsigned long address,
 		   struct page *base)
 {
diff --git a/arch/x86/mm/xpfo.c b/arch/x86/mm/xpfo.c
new file mode 100644
index 000000000000..6c7502993351
--- /dev/null
+++ b/arch/x86/mm/xpfo.c
@@ -0,0 +1,119 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * Copyright (C) 2017 Hewlett Packard Enterprise Development, L.P.
+ * Copyright (C) 2016 Brown University. All rights reserved.
+ *
+ * Authors:
+ *   Juerg Haefliger <juerg.haefliger@hpe.com>
+ *   Vasileios P. Kemerlis <vpk@cs.brown.edu>
+ *
+ * This program is free software; you can redistribute it and/or modify it
+ * under the terms of the GNU General Public License version 2 as published by
+ * the Free Software Foundation.
+ */
+
+#include <linux/mm.h>
+
+#include <asm/tlbflush.h>
+
+extern spinlock_t cpa_lock;
+
+/* Update a single kernel page table entry */
+inline void set_kpte(void *kaddr, struct page *page, pgprot_t prot)
+{
+	unsigned int level;
+	pgprot_t msk_clr;
+	pte_t *pte = lookup_address((unsigned long)kaddr, &level);
+
+	if (unlikely(!pte)) {
+		WARN(1, "xpfo: invalid address %p\n", kaddr);
+		return;
+	}
+
+	switch (level) {
+	case PG_LEVEL_4K:
+		set_pte_atomic(pte, pfn_pte(page_to_pfn(page),
+			       canon_pgprot(prot)));
+		break;
+	case PG_LEVEL_2M:
+	case PG_LEVEL_1G: {
+		struct cpa_data cpa = { };
+		int do_split;
+
+		if (level == PG_LEVEL_2M)
+			msk_clr = pmd_pgprot(*(pmd_t *)pte);
+		else
+			msk_clr = pud_pgprot(*(pud_t *)pte);
+
+		cpa.vaddr = kaddr;
+		cpa.pages = &page;
+		cpa.mask_set = prot;
+		cpa.mask_clr = msk_clr;
+		cpa.numpages = 1;
+		cpa.flags = 0;
+		cpa.curpage = 0;
+		cpa.force_split = 0;
+
+
+		do_split = should_split_large_page(pte, (unsigned long)kaddr,
+						   &cpa);
+		if (do_split) {
+			struct page *base;
+
+			base = alloc_pages(GFP_ATOMIC, 0);
+			if (!base) {
+				WARN(1, "xpfo: failed to split large page\n");
+				break;
+			}
+
+			if (!debug_pagealloc_enabled())
+				spin_lock(&cpa_lock);
+			if  (__split_large_page(&cpa, pte,
+						(unsigned long)kaddr, base) < 0)
+				WARN(1, "xpfo: failed to split large page\n");
+			if (!debug_pagealloc_enabled())
+				spin_unlock(&cpa_lock);
+		}
+
+		break;
+	}
+	case PG_LEVEL_512G:
+		/* fallthrough, splitting infrastructure doesn't
+		 * support 512G pages.
+		 */
+	default:
+		WARN(1, "xpfo: unsupported page level %x\n", level);
+	}
+
+}
+
+inline void xpfo_flush_kernel_tlb(struct page *page, int order)
+{
+	int level;
+	unsigned long size, kaddr;
+
+	kaddr = (unsigned long)page_address(page);
+
+	if (unlikely(!lookup_address(kaddr, &level))) {
+		WARN(1, "xpfo: invalid address to flush %lx %d\n", kaddr,
+		     level);
+		return;
+	}
+
+	switch (level) {
+	case PG_LEVEL_4K:
+		size = PAGE_SIZE;
+		break;
+	case PG_LEVEL_2M:
+		size = PMD_SIZE;
+		break;
+	case PG_LEVEL_1G:
+		size = PUD_SIZE;
+		break;
+	default:
+		WARN(1, "xpfo: unsupported page level %x\n", level);
+		return;
+	}
+
+	flush_tlb_kernel_range(kaddr, kaddr + (1 << order) * size);
+}
diff --git a/include/linux/highmem.h b/include/linux/highmem.h
index 0690679832d4..1fdae929e38b 100644
--- a/include/linux/highmem.h
+++ b/include/linux/highmem.h
@@ -8,6 +8,7 @@
 #include <linux/mm.h>
 #include <linux/uaccess.h>
 #include <linux/hardirq.h>
+#include <linux/xpfo.h>
 
 #include <asm/cacheflush.h>
 
@@ -56,24 +57,34 @@ static inline struct page *kmap_to_page(void *addr)
 #ifndef ARCH_HAS_KMAP
 static inline void *kmap(struct page *page)
 {
+	void *kaddr;
+
 	might_sleep();
-	return page_address(page);
+	kaddr = page_address(page);
+	xpfo_kmap(kaddr, page);
+	return kaddr;
 }
 
 static inline void kunmap(struct page *page)
 {
+	xpfo_kunmap(page_address(page), page);
 }
 
 static inline void *kmap_atomic(struct page *page)
 {
+	void *kaddr;
+
 	preempt_disable();
 	pagefault_disable();
-	return page_address(page);
+	kaddr = page_address(page);
+	xpfo_kmap(kaddr, page);
+	return kaddr;
 }
 #define kmap_atomic_prot(page, prot)	kmap_atomic(page)
 
 static inline void __kunmap_atomic(void *addr)
 {
+	xpfo_kunmap(addr, virt_to_page(addr));
 	pagefault_enable();
 	preempt_enable();
 }
diff --git a/include/linux/xpfo.h b/include/linux/xpfo.h
new file mode 100644
index 000000000000..b15234745fb4
--- /dev/null
+++ b/include/linux/xpfo.h
@@ -0,0 +1,48 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+/*
+ * Copyright (C) 2017 Docker, Inc.
+ * Copyright (C) 2017 Hewlett Packard Enterprise Development, L.P.
+ * Copyright (C) 2016 Brown University. All rights reserved.
+ *
+ * Authors:
+ *   Juerg Haefliger <juerg.haefliger@hpe.com>
+ *   Vasileios P. Kemerlis <vpk@cs.brown.edu>
+ *   Tycho Andersen <tycho@docker.com>
+ *
+ * This program is free software; you can redistribute it and/or modify it
+ * under the terms of the GNU General Public License version 2 as published by
+ * the Free Software Foundation.
+ */
+
+#ifndef _LINUX_XPFO_H
+#define _LINUX_XPFO_H
+
+#include <linux/types.h>
+#include <linux/dma-direction.h>
+
+struct page;
+
+#ifdef CONFIG_XPFO
+
+extern struct page_ext_operations page_xpfo_ops;
+
+void set_kpte(void *kaddr, struct page *page, pgprot_t prot);
+void xpfo_dma_map_unmap_area(bool map, const void *addr, size_t size,
+				    enum dma_data_direction dir);
+void xpfo_flush_kernel_tlb(struct page *page, int order);
+
+void xpfo_kmap(void *kaddr, struct page *page);
+void xpfo_kunmap(void *kaddr, struct page *page);
+void xpfo_alloc_pages(struct page *page, int order, gfp_t gfp);
+void xpfo_free_pages(struct page *page, int order);
+
+#else /* !CONFIG_XPFO */
+
+static inline void xpfo_kmap(void *kaddr, struct page *page) { }
+static inline void xpfo_kunmap(void *kaddr, struct page *page) { }
+static inline void xpfo_alloc_pages(struct page *page, int order, gfp_t gfp) { }
+static inline void xpfo_free_pages(struct page *page, int order) { }
+
+#endif /* CONFIG_XPFO */
+
+#endif /* _LINUX_XPFO_H */
diff --git a/mm/Makefile b/mm/Makefile
index d210cc9d6f80..e99e1e6ae5ae 100644
--- a/mm/Makefile
+++ b/mm/Makefile
@@ -99,3 +99,4 @@ obj-$(CONFIG_HARDENED_USERCOPY) += usercopy.o
 obj-$(CONFIG_PERCPU_STATS) += percpu-stats.o
 obj-$(CONFIG_HMM) += hmm.o
 obj-$(CONFIG_MEMFD_CREATE) += memfd.o
+obj-$(CONFIG_XPFO) += xpfo.o
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index e95b5b7c9c3d..08e277790b5f 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1038,6 +1038,7 @@ static __always_inline bool free_pages_prepare(struct page *page,
 	kernel_poison_pages(page, 1 << order, 0);
 	kernel_map_pages(page, 1 << order, 0);
 	kasan_free_pages(page, order);
+	xpfo_free_pages(page, order);
 
 	return true;
 }
@@ -1915,6 +1916,7 @@ inline void post_alloc_hook(struct page *page, unsigned int order,
 	kernel_map_pages(page, 1 << order, 1);
 	kernel_poison_pages(page, 1 << order, 1);
 	kasan_alloc_pages(page, order);
+	xpfo_alloc_pages(page, order, gfp_flags);
 	set_page_owner(page, order, gfp_flags);
 }
 
diff --git a/mm/page_ext.c b/mm/page_ext.c
index ae44f7adbe07..38e5013dcb9a 100644
--- a/mm/page_ext.c
+++ b/mm/page_ext.c
@@ -8,6 +8,7 @@
 #include <linux/kmemleak.h>
 #include <linux/page_owner.h>
 #include <linux/page_idle.h>
+#include <linux/xpfo.h>
 
 /*
  * struct page extension
@@ -68,6 +69,9 @@ static struct page_ext_operations *page_ext_ops[] = {
 #if defined(CONFIG_IDLE_PAGE_TRACKING) && !defined(CONFIG_64BIT)
 	&page_idle_ops,
 #endif
+#ifdef CONFIG_XPFO
+	&page_xpfo_ops,
+#endif
 };
 
 static unsigned long total_usage;
diff --git a/mm/xpfo.c b/mm/xpfo.c
new file mode 100644
index 000000000000..24b33d3c20cb
--- /dev/null
+++ b/mm/xpfo.c
@@ -0,0 +1,223 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * Copyright (C) 2017 Docker, Inc.
+ * Copyright (C) 2017 Hewlett Packard Enterprise Development, L.P.
+ * Copyright (C) 2016 Brown University. All rights reserved.
+ *
+ * Authors:
+ *   Juerg Haefliger <juerg.haefliger@hpe.com>
+ *   Vasileios P. Kemerlis <vpk@cs.brown.edu>
+ *   Tycho Andersen <tycho@docker.com>
+ *
+ * This program is free software; you can redistribute it and/or modify it
+ * under the terms of the GNU General Public License version 2 as published by
+ * the Free Software Foundation.
+ */
+
+#include <linux/mm.h>
+#include <linux/module.h>
+#include <linux/page_ext.h>
+#include <linux/xpfo.h>
+
+#include <asm/tlbflush.h>
+
+/* XPFO page state flags */
+enum xpfo_flags {
+	XPFO_PAGE_USER,		/* Page is allocated to user-space */
+	XPFO_PAGE_UNMAPPED,	/* Page is unmapped from the linear map */
+};
+
+/* Per-page XPFO house-keeping data */
+struct xpfo {
+	unsigned long flags;	/* Page state */
+	bool inited;		/* Map counter and lock initialized */
+	atomic_t mapcount;	/* Counter for balancing map/unmap requests */
+	spinlock_t maplock;	/* Lock to serialize map/unmap requests */
+};
+
+DEFINE_STATIC_KEY_FALSE(xpfo_inited);
+
+static bool xpfo_disabled __initdata;
+
+static int __init noxpfo_param(char *str)
+{
+	xpfo_disabled = true;
+
+	return 0;
+}
+
+early_param("noxpfo", noxpfo_param);
+
+static bool __init need_xpfo(void)
+{
+	if (xpfo_disabled) {
+		pr_info("XPFO disabled\n");
+		return false;
+	}
+
+	return true;
+}
+
+static void init_xpfo(void)
+{
+	pr_info("XPFO enabled\n");
+	static_branch_enable(&xpfo_inited);
+}
+
+struct page_ext_operations page_xpfo_ops = {
+	.size = sizeof(struct xpfo),
+	.need = need_xpfo,
+	.init = init_xpfo,
+};
+
+static inline struct xpfo *lookup_xpfo(struct page *page)
+{
+	struct page_ext *page_ext = lookup_page_ext(page);
+
+	if (unlikely(!page_ext)) {
+		WARN(1, "xpfo: failed to get page ext");
+		return NULL;
+	}
+
+	return (void *)page_ext + page_xpfo_ops.offset;
+}
+
+void xpfo_alloc_pages(struct page *page, int order, gfp_t gfp)
+{
+	int i, flush_tlb = 0;
+	struct xpfo *xpfo;
+
+	if (!static_branch_unlikely(&xpfo_inited))
+		return;
+
+	for (i = 0; i < (1 << order); i++)  {
+		xpfo = lookup_xpfo(page + i);
+		if (!xpfo)
+			continue;
+
+		WARN(test_bit(XPFO_PAGE_UNMAPPED, &xpfo->flags),
+		     "xpfo: unmapped page being allocated\n");
+
+		/* Initialize the map lock and map counter */
+		if (unlikely(!xpfo->inited)) {
+			spin_lock_init(&xpfo->maplock);
+			atomic_set(&xpfo->mapcount, 0);
+			xpfo->inited = true;
+		}
+		WARN(atomic_read(&xpfo->mapcount),
+		     "xpfo: already mapped page being allocated\n");
+
+		if ((gfp & GFP_HIGHUSER) == GFP_HIGHUSER) {
+			/*
+			 * Tag the page as a user page and flush the TLB if it
+			 * was previously allocated to the kernel.
+			 */
+			if (!test_and_set_bit(XPFO_PAGE_USER, &xpfo->flags))
+				flush_tlb = 1;
+		} else {
+			/* Tag the page as a non-user (kernel) page */
+			clear_bit(XPFO_PAGE_USER, &xpfo->flags);
+		}
+	}
+
+	if (flush_tlb)
+		xpfo_flush_kernel_tlb(page, order);
+}
+
+void xpfo_free_pages(struct page *page, int order)
+{
+	int i;
+	struct xpfo *xpfo;
+
+	if (!static_branch_unlikely(&xpfo_inited))
+		return;
+
+	for (i = 0; i < (1 << order); i++) {
+		xpfo = lookup_xpfo(page + i);
+		if (!xpfo || unlikely(!xpfo->inited)) {
+			/*
+			 * The page was allocated before page_ext was
+			 * initialized, so it is a kernel page.
+			 */
+			continue;
+		}
+
+		/*
+		 * Map the page back into the kernel if it was previously
+		 * allocated to user space.
+		 */
+		if (test_and_clear_bit(XPFO_PAGE_USER, &xpfo->flags)) {
+			clear_bit(XPFO_PAGE_UNMAPPED, &xpfo->flags);
+			set_kpte(page_address(page + i), page + i,
+				 PAGE_KERNEL);
+		}
+	}
+}
+
+void xpfo_kmap(void *kaddr, struct page *page)
+{
+	struct xpfo *xpfo;
+
+	if (!static_branch_unlikely(&xpfo_inited))
+		return;
+
+	xpfo = lookup_xpfo(page);
+
+	/*
+	 * The page was allocated before page_ext was initialized (which means
+	 * it's a kernel page) or it's allocated to the kernel, so nothing to
+	 * do.
+	 */
+	if (!xpfo || unlikely(!xpfo->inited) ||
+	    !test_bit(XPFO_PAGE_USER, &xpfo->flags))
+		return;
+
+	spin_lock(&xpfo->maplock);
+
+	/*
+	 * The page was previously allocated to user space, so map it back
+	 * into the kernel. No TLB flush required.
+	 */
+	if ((atomic_inc_return(&xpfo->mapcount) == 1) &&
+	    test_and_clear_bit(XPFO_PAGE_UNMAPPED, &xpfo->flags))
+		set_kpte(kaddr, page, PAGE_KERNEL);
+
+	spin_unlock(&xpfo->maplock);
+}
+EXPORT_SYMBOL(xpfo_kmap);
+
+void xpfo_kunmap(void *kaddr, struct page *page)
+{
+	struct xpfo *xpfo;
+
+	if (!static_branch_unlikely(&xpfo_inited))
+		return;
+
+	xpfo = lookup_xpfo(page);
+
+	/*
+	 * The page was allocated before page_ext was initialized (which means
+	 * it's a kernel page) or it's allocated to the kernel, so nothing to
+	 * do.
+	 */
+	if (!xpfo || unlikely(!xpfo->inited) ||
+	    !test_bit(XPFO_PAGE_USER, &xpfo->flags))
+		return;
+
+	spin_lock(&xpfo->maplock);
+
+	/*
+	 * The page is to be allocated back to user space, so unmap it from the
+	 * kernel, flush the TLB and tag it as a user page.
+	 */
+	if (atomic_dec_return(&xpfo->mapcount) == 0) {
+		WARN(test_bit(XPFO_PAGE_UNMAPPED, &xpfo->flags),
+		     "xpfo: unmapping already unmapped page\n");
+		set_bit(XPFO_PAGE_UNMAPPED, &xpfo->flags);
+		set_kpte(kaddr, page, __pgprot(0));
+		xpfo_flush_kernel_tlb(page, 0);
+	}
+
+	spin_unlock(&xpfo->maplock);
+}
+EXPORT_SYMBOL(xpfo_kunmap);
diff --git a/security/Kconfig b/security/Kconfig
index d9aa521b5206..8d0e4e303551 100644
--- a/security/Kconfig
+++ b/security/Kconfig
@@ -6,6 +6,25 @@ menu "Security options"
 
 source security/keys/Kconfig
 
+config ARCH_SUPPORTS_XPFO
+	bool
+
+config XPFO
+	bool "Enable eXclusive Page Frame Ownership (XPFO)"
+	default n
+	depends on ARCH_SUPPORTS_XPFO
+	select PAGE_EXTENSION
+	help
+	  This option offers protection against 'ret2dir' kernel attacks.
+	  When enabled, every time a page frame is allocated to user space, it
+	  is unmapped from the direct mapped RAM region in kernel space
+	  (physmap). Similarly, when a page frame is freed/reclaimed, it is
+	  mapped back to physmap.
+
+	  There is a slight performance impact when this option is enabled.
+
+	  If in doubt, say "N".
+
 config SECURITY_DMESG_RESTRICT
 	bool "Restrict unprivileged access to the kernel syslog"
 	default n
-- 
2.17.1

^ permalink raw reply	[flat|nested] 32+ messages in thread

* [RFC PATCH v8 04/14] swiotlb: Map the buffer if it was unmapped by XPFO
  2019-02-14  0:01 [RFC PATCH v8 00/14] Add support for eXclusive Page Frame Ownership Khalid Aziz
                   ` (2 preceding siblings ...)
  2019-02-14  0:01 ` [RFC PATCH v8 03/14] mm, x86: Add support for eXclusive Page Frame Ownership (XPFO) Khalid Aziz
@ 2019-02-14  0:01 ` Khalid Aziz
  2019-02-14  7:47   ` Christoph Hellwig
  2019-02-14  0:01 ` [RFC PATCH v8 05/14] arm64/mm: Add support for XPFO Khalid Aziz
                   ` (9 subsequent siblings)
  13 siblings, 1 reply; 32+ messages in thread
From: Khalid Aziz @ 2019-02-14  0:01 UTC (permalink / raw)
  To: juergh, tycho, jsteckli, ak, torvalds, liran.alon, keescook,
	akpm, mhocko, catalin.marinas, will.deacon, jmorris, konrad.wilk
  Cc: Juerg Haefliger, deepa.srinivasan, chris.hyser, tyhicks, dwmw,
	andrew.cooper3, jcm, boris.ostrovsky, kanth.ghatraju,
	oao.m.martins, jmattson, pradeep.vincent, john.haxby, tglx,
	kirill.shutemov, hch, steven.sistare, labbott, luto, dave.hansen,
	peterz, kernel-hardening, linux-mm, x86, linux-arm-kernel,
	linux-kernel, Tycho Andersen

From: Juerg Haefliger <juerg.haefliger@canonical.com>

v6: * guard against lookup_xpfo() returning NULL

CC: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Signed-off-by: Juerg Haefliger <juerg.haefliger@canonical.com>
Signed-off-by: Tycho Andersen <tycho@docker.com>
Reviewed-by: Khalid Aziz <khalid.aziz@oracle.com>
Reviewed-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
---
 include/linux/xpfo.h |  4 ++++
 kernel/dma/swiotlb.c |  3 ++-
 mm/xpfo.c            | 15 +++++++++++++++
 3 files changed, 21 insertions(+), 1 deletion(-)

diff --git a/include/linux/xpfo.h b/include/linux/xpfo.h
index b15234745fb4..cba37ffb09b1 100644
--- a/include/linux/xpfo.h
+++ b/include/linux/xpfo.h
@@ -36,6 +36,8 @@ void xpfo_kunmap(void *kaddr, struct page *page);
 void xpfo_alloc_pages(struct page *page, int order, gfp_t gfp);
 void xpfo_free_pages(struct page *page, int order);
 
+bool xpfo_page_is_unmapped(struct page *page);
+
 #else /* !CONFIG_XPFO */
 
 static inline void xpfo_kmap(void *kaddr, struct page *page) { }
@@ -43,6 +45,8 @@ static inline void xpfo_kunmap(void *kaddr, struct page *page) { }
 static inline void xpfo_alloc_pages(struct page *page, int order, gfp_t gfp) { }
 static inline void xpfo_free_pages(struct page *page, int order) { }
 
+static inline bool xpfo_page_is_unmapped(struct page *page) { return false; }
+
 #endif /* CONFIG_XPFO */
 
 #endif /* _LINUX_XPFO_H */
diff --git a/kernel/dma/swiotlb.c b/kernel/dma/swiotlb.c
index 045930e32c0e..820a54b57491 100644
--- a/kernel/dma/swiotlb.c
+++ b/kernel/dma/swiotlb.c
@@ -396,8 +396,9 @@ static void swiotlb_bounce(phys_addr_t orig_addr, phys_addr_t tlb_addr,
 {
 	unsigned long pfn = PFN_DOWN(orig_addr);
 	unsigned char *vaddr = phys_to_virt(tlb_addr);
+	struct page *page = pfn_to_page(pfn);
 
-	if (PageHighMem(pfn_to_page(pfn))) {
+	if (PageHighMem(page) || xpfo_page_is_unmapped(page)) {
 		/* The buffer does not have a mapping.  Map it in and copy */
 		unsigned int offset = orig_addr & ~PAGE_MASK;
 		char *buffer;
diff --git a/mm/xpfo.c b/mm/xpfo.c
index 24b33d3c20cb..67884736bebe 100644
--- a/mm/xpfo.c
+++ b/mm/xpfo.c
@@ -221,3 +221,18 @@ void xpfo_kunmap(void *kaddr, struct page *page)
 	spin_unlock(&xpfo->maplock);
 }
 EXPORT_SYMBOL(xpfo_kunmap);
+
+bool xpfo_page_is_unmapped(struct page *page)
+{
+	struct xpfo *xpfo;
+
+	if (!static_branch_unlikely(&xpfo_inited))
+		return false;
+
+	xpfo = lookup_xpfo(page);
+	if (unlikely(!xpfo) && !xpfo->inited)
+		return false;
+
+	return test_bit(XPFO_PAGE_UNMAPPED, &xpfo->flags);
+}
+EXPORT_SYMBOL(xpfo_page_is_unmapped);
-- 
2.17.1

^ permalink raw reply	[flat|nested] 32+ messages in thread

* [RFC PATCH v8 05/14] arm64/mm: Add support for XPFO
  2019-02-14  0:01 [RFC PATCH v8 00/14] Add support for eXclusive Page Frame Ownership Khalid Aziz
                   ` (3 preceding siblings ...)
  2019-02-14  0:01 ` [RFC PATCH v8 04/14] swiotlb: Map the buffer if it was unmapped by XPFO Khalid Aziz
@ 2019-02-14  0:01 ` Khalid Aziz
  2019-02-14  0:01 ` [RFC PATCH v8 06/14] xpfo: add primitives for mapping underlying memory Khalid Aziz
                   ` (8 subsequent siblings)
  13 siblings, 0 replies; 32+ messages in thread
From: Khalid Aziz @ 2019-02-14  0:01 UTC (permalink / raw)
  To: juergh, tycho, jsteckli, ak, torvalds, liran.alon, keescook,
	akpm, mhocko, catalin.marinas, will.deacon, jmorris, konrad.wilk
  Cc: Juerg Haefliger, deepa.srinivasan, chris.hyser, tyhicks, dwmw,
	andrew.cooper3, jcm, boris.ostrovsky, kanth.ghatraju,
	oao.m.martins, jmattson, pradeep.vincent, john.haxby, tglx,
	kirill.shutemov, hch, steven.sistare, labbott, luto, dave.hansen,
	peterz, kernel-hardening, linux-mm, x86, linux-arm-kernel,
	linux-kernel, Tycho Andersen, Khalid Aziz

From: Juerg Haefliger <juerg.haefliger@canonical.com>

Enable support for eXclusive Page Frame Ownership (XPFO) for arm64 and
provide a hook for updating a single kernel page table entry (which is
required by the generic XPFO code).

CC: linux-arm-kernel@lists.infradead.org
Signed-off-by: Juerg Haefliger <juerg.haefliger@canonical.com>
Signed-off-by: Tycho Andersen <tycho@docker.com>
Signed-off-by: Khalid Aziz <khalid.aziz@oracle.com>
---
v6:
	- use flush_tlb_kernel_range() instead of __flush_tlb_one()

v8:
	- Add check for NULL pte in set_kpte()

 arch/arm64/Kconfig     |  1 +
 arch/arm64/mm/Makefile |  2 ++
 arch/arm64/mm/xpfo.c   | 64 ++++++++++++++++++++++++++++++++++++++++++
 3 files changed, 67 insertions(+)
 create mode 100644 arch/arm64/mm/xpfo.c

diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
index ea2ab0330e3a..f0a9c0007d23 100644
--- a/arch/arm64/Kconfig
+++ b/arch/arm64/Kconfig
@@ -171,6 +171,7 @@ config ARM64
 	select SWIOTLB
 	select SYSCTL_EXCEPTION_TRACE
 	select THREAD_INFO_IN_TASK
+	select ARCH_SUPPORTS_XPFO
 	help
 	  ARM 64-bit (AArch64) Linux support.
 
diff --git a/arch/arm64/mm/Makefile b/arch/arm64/mm/Makefile
index 849c1df3d214..cca3808d9776 100644
--- a/arch/arm64/mm/Makefile
+++ b/arch/arm64/mm/Makefile
@@ -12,3 +12,5 @@ KASAN_SANITIZE_physaddr.o	+= n
 
 obj-$(CONFIG_KASAN)		+= kasan_init.o
 KASAN_SANITIZE_kasan_init.o	:= n
+
+obj-$(CONFIG_XPFO)		+= xpfo.o
diff --git a/arch/arm64/mm/xpfo.c b/arch/arm64/mm/xpfo.c
new file mode 100644
index 000000000000..1f790f7746ad
--- /dev/null
+++ b/arch/arm64/mm/xpfo.c
@@ -0,0 +1,64 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * Copyright (C) 2017 Hewlett Packard Enterprise Development, L.P.
+ * Copyright (C) 2016 Brown University. All rights reserved.
+ *
+ * Authors:
+ *   Juerg Haefliger <juerg.haefliger@hpe.com>
+ *   Vasileios P. Kemerlis <vpk@cs.brown.edu>
+ *
+ * This program is free software; you can redistribute it and/or modify it
+ * under the terms of the GNU General Public License version 2 as published by
+ * the Free Software Foundation.
+ */
+
+#include <linux/mm.h>
+#include <linux/module.h>
+
+#include <asm/tlbflush.h>
+
+/*
+ * Lookup the page table entry for a virtual address and return a pointer to
+ * the entry. Based on x86 tree.
+ */
+static pte_t *lookup_address(unsigned long addr)
+{
+	pgd_t *pgd;
+	pud_t *pud;
+	pmd_t *pmd;
+
+	pgd = pgd_offset_k(addr);
+	if (pgd_none(*pgd))
+		return NULL;
+
+	pud = pud_offset(pgd, addr);
+	if (pud_none(*pud))
+		return NULL;
+
+	pmd = pmd_offset(pud, addr);
+	if (pmd_none(*pmd))
+		return NULL;
+
+	return pte_offset_kernel(pmd, addr);
+}
+
+/* Update a single kernel page table entry */
+inline void set_kpte(void *kaddr, struct page *page, pgprot_t prot)
+{
+	pte_t *pte = lookup_address((unsigned long)kaddr);
+
+	if (unlikely(!pte)) {
+		WARN(1, "xpfo: invalid address %p\n", kaddr);
+		return;
+	}
+
+	set_pte(pte, pfn_pte(page_to_pfn(page), prot));
+}
+
+inline void xpfo_flush_kernel_tlb(struct page *page, int order)
+{
+	unsigned long kaddr = (unsigned long)page_address(page);
+	unsigned long size = PAGE_SIZE;
+
+	flush_tlb_kernel_range(kaddr, kaddr + (1 << order) * size);
+}
-- 
2.17.1

^ permalink raw reply	[flat|nested] 32+ messages in thread

* [RFC PATCH v8 06/14] xpfo: add primitives for mapping underlying memory
  2019-02-14  0:01 [RFC PATCH v8 00/14] Add support for eXclusive Page Frame Ownership Khalid Aziz
                   ` (4 preceding siblings ...)
  2019-02-14  0:01 ` [RFC PATCH v8 05/14] arm64/mm: Add support for XPFO Khalid Aziz
@ 2019-02-14  0:01 ` Khalid Aziz
  2019-02-14  0:01 ` [RFC PATCH v8 07/14] arm64/mm, xpfo: temporarily map dcache regions Khalid Aziz
                   ` (7 subsequent siblings)
  13 siblings, 0 replies; 32+ messages in thread
From: Khalid Aziz @ 2019-02-14  0:01 UTC (permalink / raw)
  To: juergh, tycho, jsteckli, ak, torvalds, liran.alon, keescook,
	akpm, mhocko, catalin.marinas, will.deacon, jmorris, konrad.wilk
  Cc: Tycho Andersen, deepa.srinivasan, chris.hyser, tyhicks, dwmw,
	andrew.cooper3, jcm, boris.ostrovsky, kanth.ghatraju,
	oao.m.martins, jmattson, pradeep.vincent, john.haxby, tglx,
	kirill.shutemov, hch, steven.sistare, labbott, luto, dave.hansen,
	peterz, kernel-hardening, linux-mm, x86, linux-arm-kernel,
	linux-kernel

From: Tycho Andersen <tycho@docker.com>

In some cases (on arm64 DMA and data cache flushes) we may have unmapped
the underlying pages needed for something via XPFO. Here are some
primitives useful for ensuring the underlying memory is mapped/unmapped in
the face of xpfo.

Signed-off-by: Tycho Andersen <tycho@docker.com>
Reviewed-by: Khalid Aziz <khalid.aziz@oracle.com>
---
 include/linux/xpfo.h | 22 ++++++++++++++++++++++
 mm/xpfo.c            | 30 ++++++++++++++++++++++++++++++
 2 files changed, 52 insertions(+)

diff --git a/include/linux/xpfo.h b/include/linux/xpfo.h
index cba37ffb09b1..1ae05756344d 100644
--- a/include/linux/xpfo.h
+++ b/include/linux/xpfo.h
@@ -38,6 +38,15 @@ void xpfo_free_pages(struct page *page, int order);
 
 bool xpfo_page_is_unmapped(struct page *page);
 
+#define XPFO_NUM_PAGES(addr, size) \
+	(PFN_UP((unsigned long) (addr) + (size)) - \
+		PFN_DOWN((unsigned long) (addr)))
+
+void xpfo_temp_map(const void *addr, size_t size, void **mapping,
+		   size_t mapping_len);
+void xpfo_temp_unmap(const void *addr, size_t size, void **mapping,
+		     size_t mapping_len);
+
 #else /* !CONFIG_XPFO */
 
 static inline void xpfo_kmap(void *kaddr, struct page *page) { }
@@ -47,6 +56,19 @@ static inline void xpfo_free_pages(struct page *page, int order) { }
 
 static inline bool xpfo_page_is_unmapped(struct page *page) { return false; }
 
+#define XPFO_NUM_PAGES(addr, size) 0
+
+static inline void xpfo_temp_map(const void *addr, size_t size, void **mapping,
+				 size_t mapping_len)
+{
+}
+
+static inline void xpfo_temp_unmap(const void *addr, size_t size,
+				   void **mapping, size_t mapping_len)
+{
+}
+
+
 #endif /* CONFIG_XPFO */
 
 #endif /* _LINUX_XPFO_H */
diff --git a/mm/xpfo.c b/mm/xpfo.c
index 67884736bebe..92ca6d1baf06 100644
--- a/mm/xpfo.c
+++ b/mm/xpfo.c
@@ -14,6 +14,7 @@
  * the Free Software Foundation.
  */
 
+#include <linux/highmem.h>
 #include <linux/mm.h>
 #include <linux/module.h>
 #include <linux/page_ext.h>
@@ -236,3 +237,32 @@ bool xpfo_page_is_unmapped(struct page *page)
 	return test_bit(XPFO_PAGE_UNMAPPED, &xpfo->flags);
 }
 EXPORT_SYMBOL(xpfo_page_is_unmapped);
+
+void xpfo_temp_map(const void *addr, size_t size, void **mapping,
+		   size_t mapping_len)
+{
+	struct page *page = virt_to_page(addr);
+	int i, num_pages = mapping_len / sizeof(mapping[0]);
+
+	memset(mapping, 0, mapping_len);
+
+	for (i = 0; i < num_pages; i++) {
+		if (page_to_virt(page + i) >= addr + size)
+			break;
+
+		if (xpfo_page_is_unmapped(page + i))
+			mapping[i] = kmap_atomic(page + i);
+	}
+}
+EXPORT_SYMBOL(xpfo_temp_map);
+
+void xpfo_temp_unmap(const void *addr, size_t size, void **mapping,
+		     size_t mapping_len)
+{
+	int i, num_pages = mapping_len / sizeof(mapping[0]);
+
+	for (i = 0; i < num_pages; i++)
+		if (mapping[i])
+			kunmap_atomic(mapping[i]);
+}
+EXPORT_SYMBOL(xpfo_temp_unmap);
-- 
2.17.1

^ permalink raw reply	[flat|nested] 32+ messages in thread

* [RFC PATCH v8 07/14] arm64/mm, xpfo: temporarily map dcache regions
  2019-02-14  0:01 [RFC PATCH v8 00/14] Add support for eXclusive Page Frame Ownership Khalid Aziz
                   ` (5 preceding siblings ...)
  2019-02-14  0:01 ` [RFC PATCH v8 06/14] xpfo: add primitives for mapping underlying memory Khalid Aziz
@ 2019-02-14  0:01 ` Khalid Aziz
  2019-02-14 15:54   ` Tycho Andersen
  2019-02-14  0:01 ` [RFC PATCH v8 08/14] arm64/mm: disable section/contiguous mappings if XPFO is enabled Khalid Aziz
                   ` (6 subsequent siblings)
  13 siblings, 1 reply; 32+ messages in thread
From: Khalid Aziz @ 2019-02-14  0:01 UTC (permalink / raw)
  To: juergh, tycho, jsteckli, ak, torvalds, liran.alon, keescook,
	akpm, mhocko, catalin.marinas, will.deacon, jmorris, konrad.wilk
  Cc: Juerg Haefliger, deepa.srinivasan, chris.hyser, tyhicks, dwmw,
	andrew.cooper3, jcm, boris.ostrovsky, kanth.ghatraju,
	oao.m.martins, jmattson, pradeep.vincent, john.haxby, tglx,
	kirill.shutemov, hch, steven.sistare, labbott, luto, dave.hansen,
	peterz, kernel-hardening, linux-mm, x86, linux-arm-kernel,
	linux-kernel, Tycho Andersen

From: Juerg Haefliger <juerg.haefliger@canonical.com>

If the page is unmapped by XPFO, a data cache flush results in a fatal
page fault, so let's temporarily map the region, flush the cache, and then
unmap it.

v6: actually flush in the face of xpfo, and temporarily map the underlying
    memory so it can be flushed correctly

CC: linux-arm-kernel@lists.infradead.org
Signed-off-by: Juerg Haefliger <juerg.haefliger@canonical.com>
Signed-off-by: Tycho Andersen <tycho@docker.com>
---
 arch/arm64/mm/flush.c | 7 +++++++
 1 file changed, 7 insertions(+)

diff --git a/arch/arm64/mm/flush.c b/arch/arm64/mm/flush.c
index 30695a868107..fad09aafd9d5 100644
--- a/arch/arm64/mm/flush.c
+++ b/arch/arm64/mm/flush.c
@@ -20,6 +20,7 @@
 #include <linux/export.h>
 #include <linux/mm.h>
 #include <linux/pagemap.h>
+#include <linux/xpfo.h>
 
 #include <asm/cacheflush.h>
 #include <asm/cache.h>
@@ -28,9 +29,15 @@
 void sync_icache_aliases(void *kaddr, unsigned long len)
 {
 	unsigned long addr = (unsigned long)kaddr;
+	unsigned long num_pages = XPFO_NUM_PAGES(addr, len);
+	void *mapping[num_pages];
 
 	if (icache_is_aliasing()) {
+		xpfo_temp_map(kaddr, len, mapping,
+			      sizeof(mapping[0]) * num_pages);
 		__clean_dcache_area_pou(kaddr, len);
+		xpfo_temp_unmap(kaddr, len, mapping,
+				sizeof(mapping[0]) * num_pages);
 		__flush_icache_all();
 	} else {
 		flush_icache_range(addr, addr + len);
-- 
2.17.1

^ permalink raw reply	[flat|nested] 32+ messages in thread

* [RFC PATCH v8 08/14] arm64/mm: disable section/contiguous mappings if XPFO is enabled
  2019-02-14  0:01 [RFC PATCH v8 00/14] Add support for eXclusive Page Frame Ownership Khalid Aziz
                   ` (6 preceding siblings ...)
  2019-02-14  0:01 ` [RFC PATCH v8 07/14] arm64/mm, xpfo: temporarily map dcache regions Khalid Aziz
@ 2019-02-14  0:01 ` Khalid Aziz
  2019-02-15 13:09   ` Mark Rutland
  2019-02-14  0:01 ` [RFC PATCH v8 09/14] mm: add a user_virt_to_phys symbol Khalid Aziz
                   ` (5 subsequent siblings)
  13 siblings, 1 reply; 32+ messages in thread
From: Khalid Aziz @ 2019-02-14  0:01 UTC (permalink / raw)
  To: juergh, tycho, jsteckli, ak, torvalds, liran.alon, keescook,
	akpm, mhocko, catalin.marinas, will.deacon, jmorris, konrad.wilk
  Cc: Tycho Andersen, deepa.srinivasan, chris.hyser, tyhicks, dwmw,
	andrew.cooper3, jcm, boris.ostrovsky, kanth.ghatraju,
	oao.m.martins, jmattson, pradeep.vincent, john.haxby, tglx,
	kirill.shutemov, hch, steven.sistare, labbott, luto, dave.hansen,
	peterz, kernel-hardening, linux-mm, x86, linux-arm-kernel,
	linux-kernel

From: Tycho Andersen <tycho@docker.com>

XPFO doesn't support section/contiguous mappings yet, so let's disable it
if XPFO is turned on.

Thanks to Laura Abbot for the simplification from v5, and Mark Rutland for
pointing out we need NO_CONT_MAPPINGS too.

CC: linux-arm-kernel@lists.infradead.org
Signed-off-by: Tycho Andersen <tycho@docker.com>
Reviewed-by: Khalid Aziz <khalid.aziz@oracle.com>
---
 arch/arm64/mm/mmu.c  | 2 +-
 include/linux/xpfo.h | 4 ++++
 mm/xpfo.c            | 6 ++++++
 3 files changed, 11 insertions(+), 1 deletion(-)

diff --git a/arch/arm64/mm/mmu.c b/arch/arm64/mm/mmu.c
index d1d6601b385d..f4dd27073006 100644
--- a/arch/arm64/mm/mmu.c
+++ b/arch/arm64/mm/mmu.c
@@ -451,7 +451,7 @@ static void __init map_mem(pgd_t *pgdp)
 	struct memblock_region *reg;
 	int flags = 0;
 
-	if (debug_pagealloc_enabled())
+	if (debug_pagealloc_enabled() || xpfo_enabled())
 		flags = NO_BLOCK_MAPPINGS | NO_CONT_MAPPINGS;
 
 	/*
diff --git a/include/linux/xpfo.h b/include/linux/xpfo.h
index 1ae05756344d..8b029918a958 100644
--- a/include/linux/xpfo.h
+++ b/include/linux/xpfo.h
@@ -47,6 +47,8 @@ void xpfo_temp_map(const void *addr, size_t size, void **mapping,
 void xpfo_temp_unmap(const void *addr, size_t size, void **mapping,
 		     size_t mapping_len);
 
+bool xpfo_enabled(void);
+
 #else /* !CONFIG_XPFO */
 
 static inline void xpfo_kmap(void *kaddr, struct page *page) { }
@@ -69,6 +71,8 @@ static inline void xpfo_temp_unmap(const void *addr, size_t size,
 }
 
 
+static inline bool xpfo_enabled(void) { return false; }
+
 #endif /* CONFIG_XPFO */
 
 #endif /* _LINUX_XPFO_H */
diff --git a/mm/xpfo.c b/mm/xpfo.c
index 92ca6d1baf06..150784ae0f08 100644
--- a/mm/xpfo.c
+++ b/mm/xpfo.c
@@ -71,6 +71,12 @@ struct page_ext_operations page_xpfo_ops = {
 	.init = init_xpfo,
 };
 
+bool __init xpfo_enabled(void)
+{
+	return !xpfo_disabled;
+}
+EXPORT_SYMBOL(xpfo_enabled);
+
 static inline struct xpfo *lookup_xpfo(struct page *page)
 {
 	struct page_ext *page_ext = lookup_page_ext(page);
-- 
2.17.1

^ permalink raw reply	[flat|nested] 32+ messages in thread

* [RFC PATCH v8 09/14] mm: add a user_virt_to_phys symbol
  2019-02-14  0:01 [RFC PATCH v8 00/14] Add support for eXclusive Page Frame Ownership Khalid Aziz
                   ` (7 preceding siblings ...)
  2019-02-14  0:01 ` [RFC PATCH v8 08/14] arm64/mm: disable section/contiguous mappings if XPFO is enabled Khalid Aziz
@ 2019-02-14  0:01 ` Khalid Aziz
  2019-02-14  0:01 ` [RFC PATCH v8 10/14] lkdtm: Add test for XPFO Khalid Aziz
                   ` (4 subsequent siblings)
  13 siblings, 0 replies; 32+ messages in thread
From: Khalid Aziz @ 2019-02-14  0:01 UTC (permalink / raw)
  To: juergh, tycho, jsteckli, ak, torvalds, liran.alon, keescook,
	akpm, mhocko, catalin.marinas, will.deacon, jmorris, konrad.wilk
  Cc: Tycho Andersen, deepa.srinivasan, chris.hyser, tyhicks, dwmw,
	andrew.cooper3, jcm, boris.ostrovsky, kanth.ghatraju,
	oao.m.martins, jmattson, pradeep.vincent, john.haxby, tglx,
	kirill.shutemov, hch, steven.sistare, labbott, luto, dave.hansen,
	peterz, kernel-hardening, linux-mm, x86, linux-arm-kernel,
	linux-kernel, Khalid Aziz

From: Tycho Andersen <tycho@docker.com>

We need someting like this for testing XPFO. Since it's architecture
specific, putting it in the test code is slightly awkward, so let's make it
an arch-specific symbol and export it for use in LKDTM.

CC: linux-arm-kernel@lists.infradead.org
CC: x86@kernel.org
Signed-off-by: Tycho Andersen <tycho@docker.com>
Tested-by: Marco Benatto <marco.antonio.780@gmail.com>
Signed-off-by: Khalid Aziz <khalid.aziz@oracle.com>
---
v6: * add a definition of user_virt_to_phys in the !CONFIG_XPFO case
v7: * make user_virt_to_phys a GPL symbol

 arch/x86/mm/xpfo.c   | 57 ++++++++++++++++++++++++++++++++++++++++++++
 include/linux/xpfo.h |  8 +++++++
 2 files changed, 65 insertions(+)

diff --git a/arch/x86/mm/xpfo.c b/arch/x86/mm/xpfo.c
index 6c7502993351..e13b99019c47 100644
--- a/arch/x86/mm/xpfo.c
+++ b/arch/x86/mm/xpfo.c
@@ -117,3 +117,60 @@ inline void xpfo_flush_kernel_tlb(struct page *page, int order)
 
 	flush_tlb_kernel_range(kaddr, kaddr + (1 << order) * size);
 }
+
+/* Convert a user space virtual address to a physical address.
+ * Shamelessly copied from slow_virt_to_phys() and lookup_address() in
+ * arch/x86/mm/pageattr.c
+ */
+phys_addr_t user_virt_to_phys(unsigned long addr)
+{
+	phys_addr_t phys_addr;
+	unsigned long offset;
+	pgd_t *pgd;
+	p4d_t *p4d;
+	pud_t *pud;
+	pmd_t *pmd;
+	pte_t *pte;
+
+	pgd = pgd_offset(current->mm, addr);
+	if (pgd_none(*pgd))
+		return 0;
+
+	p4d = p4d_offset(pgd, addr);
+	if (p4d_none(*p4d))
+		return 0;
+
+	if (p4d_large(*p4d) || !p4d_present(*p4d)) {
+		phys_addr = (unsigned long)p4d_pfn(*p4d) << PAGE_SHIFT;
+		offset = addr & ~P4D_MASK;
+		goto out;
+	}
+
+	pud = pud_offset(p4d, addr);
+	if (pud_none(*pud))
+		return 0;
+
+	if (pud_large(*pud) || !pud_present(*pud)) {
+		phys_addr = (unsigned long)pud_pfn(*pud) << PAGE_SHIFT;
+		offset = addr & ~PUD_MASK;
+		goto out;
+	}
+
+	pmd = pmd_offset(pud, addr);
+	if (pmd_none(*pmd))
+		return 0;
+
+	if (pmd_large(*pmd) || !pmd_present(*pmd)) {
+		phys_addr = (unsigned long)pmd_pfn(*pmd) << PAGE_SHIFT;
+		offset = addr & ~PMD_MASK;
+		goto out;
+	}
+
+	pte =  pte_offset_kernel(pmd, addr);
+	phys_addr = (phys_addr_t)pte_pfn(*pte) << PAGE_SHIFT;
+	offset = addr & ~PAGE_MASK;
+
+out:
+	return (phys_addr_t)(phys_addr | offset);
+}
+EXPORT_SYMBOL_GPL(user_virt_to_phys);
diff --git a/include/linux/xpfo.h b/include/linux/xpfo.h
index 8b029918a958..117869991d5b 100644
--- a/include/linux/xpfo.h
+++ b/include/linux/xpfo.h
@@ -24,6 +24,10 @@ struct page;
 
 #ifdef CONFIG_XPFO
 
+#include <linux/dma-mapping.h>
+
+#include <linux/types.h>
+
 extern struct page_ext_operations page_xpfo_ops;
 
 void set_kpte(void *kaddr, struct page *page, pgprot_t prot);
@@ -49,6 +53,8 @@ void xpfo_temp_unmap(const void *addr, size_t size, void **mapping,
 
 bool xpfo_enabled(void);
 
+phys_addr_t user_virt_to_phys(unsigned long addr);
+
 #else /* !CONFIG_XPFO */
 
 static inline void xpfo_kmap(void *kaddr, struct page *page) { }
@@ -73,6 +79,8 @@ static inline void xpfo_temp_unmap(const void *addr, size_t size,
 
 static inline bool xpfo_enabled(void) { return false; }
 
+static inline phys_addr_t user_virt_to_phys(unsigned long addr) { return 0; }
+
 #endif /* CONFIG_XPFO */
 
 #endif /* _LINUX_XPFO_H */
-- 
2.17.1

^ permalink raw reply	[flat|nested] 32+ messages in thread

* [RFC PATCH v8 10/14] lkdtm: Add test for XPFO
  2019-02-14  0:01 [RFC PATCH v8 00/14] Add support for eXclusive Page Frame Ownership Khalid Aziz
                   ` (8 preceding siblings ...)
  2019-02-14  0:01 ` [RFC PATCH v8 09/14] mm: add a user_virt_to_phys symbol Khalid Aziz
@ 2019-02-14  0:01 ` Khalid Aziz
  2019-02-14  0:01 ` [RFC PATCH v8 11/14] xpfo, mm: remove dependency on CONFIG_PAGE_EXTENSION Khalid Aziz
                   ` (3 subsequent siblings)
  13 siblings, 0 replies; 32+ messages in thread
From: Khalid Aziz @ 2019-02-14  0:01 UTC (permalink / raw)
  To: juergh, tycho, jsteckli, ak, torvalds, liran.alon, keescook,
	akpm, mhocko, catalin.marinas, will.deacon, jmorris, konrad.wilk
  Cc: Juerg Haefliger, deepa.srinivasan, chris.hyser, tyhicks, dwmw,
	andrew.cooper3, jcm, boris.ostrovsky, kanth.ghatraju,
	oao.m.martins, jmattson, pradeep.vincent, john.haxby, tglx,
	kirill.shutemov, hch, steven.sistare, labbott, luto, dave.hansen,
	peterz, kernel-hardening, linux-mm, x86, linux-arm-kernel,
	linux-kernel, Tycho Andersen

From: Juerg Haefliger <juerg.haefliger@canonical.com>

This test simply reads from userspace memory via the kernel's linear
map.

v6: * drop an #ifdef, just let the test fail if XPFO is not supported
    * add XPFO_SMP test to try and test the case when one CPU does an xpfo
      unmap of an address, that it can't be used accidentally by other
      CPUs.

Signed-off-by: Juerg Haefliger <juerg.haefliger@canonical.com>
Signed-off-by: Tycho Andersen <tycho@docker.com>
Tested-by: Marco Benatto <marco.antonio.780@gmail.com>
[jsteckli@amazon.de: rebased from v4.13 to v4.19]
Signed-off-by: Julian Stecklina <jsteckli@amazon.de>
Tested-by: Khalid Aziz <khalid.aziz@oracle.com>
---
 drivers/misc/lkdtm/Makefile |   1 +
 drivers/misc/lkdtm/core.c   |   3 +
 drivers/misc/lkdtm/lkdtm.h  |   5 +
 drivers/misc/lkdtm/xpfo.c   | 194 ++++++++++++++++++++++++++++++++++++
 4 files changed, 203 insertions(+)
 create mode 100644 drivers/misc/lkdtm/xpfo.c

diff --git a/drivers/misc/lkdtm/Makefile b/drivers/misc/lkdtm/Makefile
index 951c984de61a..97c6b7818cce 100644
--- a/drivers/misc/lkdtm/Makefile
+++ b/drivers/misc/lkdtm/Makefile
@@ -9,6 +9,7 @@ lkdtm-$(CONFIG_LKDTM)		+= refcount.o
 lkdtm-$(CONFIG_LKDTM)		+= rodata_objcopy.o
 lkdtm-$(CONFIG_LKDTM)		+= usercopy.o
 lkdtm-$(CONFIG_LKDTM)		+= stackleak.o
+lkdtm-$(CONFIG_LKDTM)		+= xpfo.o
 
 KASAN_SANITIZE_stackleak.o	:= n
 KCOV_INSTRUMENT_rodata.o	:= n
diff --git a/drivers/misc/lkdtm/core.c b/drivers/misc/lkdtm/core.c
index 2837dc77478e..25f4ab4ebf50 100644
--- a/drivers/misc/lkdtm/core.c
+++ b/drivers/misc/lkdtm/core.c
@@ -185,6 +185,9 @@ static const struct crashtype crashtypes[] = {
 	CRASHTYPE(USERCOPY_KERNEL),
 	CRASHTYPE(USERCOPY_KERNEL_DS),
 	CRASHTYPE(STACKLEAK_ERASING),
+	CRASHTYPE(XPFO_READ_USER),
+	CRASHTYPE(XPFO_READ_USER_HUGE),
+	CRASHTYPE(XPFO_SMP),
 };
 
 
diff --git a/drivers/misc/lkdtm/lkdtm.h b/drivers/misc/lkdtm/lkdtm.h
index 3c6fd327e166..6b31ff0c7f8f 100644
--- a/drivers/misc/lkdtm/lkdtm.h
+++ b/drivers/misc/lkdtm/lkdtm.h
@@ -87,4 +87,9 @@ void lkdtm_USERCOPY_KERNEL_DS(void);
 /* lkdtm_stackleak.c */
 void lkdtm_STACKLEAK_ERASING(void);
 
+/* lkdtm_xpfo.c */
+void lkdtm_XPFO_READ_USER(void);
+void lkdtm_XPFO_READ_USER_HUGE(void);
+void lkdtm_XPFO_SMP(void);
+
 #endif
diff --git a/drivers/misc/lkdtm/xpfo.c b/drivers/misc/lkdtm/xpfo.c
new file mode 100644
index 000000000000..d903063bdd0b
--- /dev/null
+++ b/drivers/misc/lkdtm/xpfo.c
@@ -0,0 +1,194 @@
+/*
+ * This is for all the tests related to XPFO (eXclusive Page Frame Ownership).
+ */
+
+#include "lkdtm.h"
+
+#include <linux/cpumask.h>
+#include <linux/mman.h>
+#include <linux/uaccess.h>
+#include <linux/xpfo.h>
+#include <linux/kthread.h>
+
+#include <linux/delay.h>
+#include <linux/sched/task.h>
+
+#define XPFO_DATA 0xdeadbeef
+
+static unsigned long do_map(unsigned long flags)
+{
+	unsigned long user_addr, user_data = XPFO_DATA;
+
+	user_addr = vm_mmap(NULL, 0, PAGE_SIZE,
+			    PROT_READ | PROT_WRITE | PROT_EXEC,
+			    flags, 0);
+	if (user_addr >= TASK_SIZE) {
+		pr_warn("Failed to allocate user memory\n");
+		return 0;
+	}
+
+	if (copy_to_user((void __user *)user_addr, &user_data,
+			 sizeof(user_data))) {
+		pr_warn("copy_to_user failed\n");
+		goto free_user;
+	}
+
+	return user_addr;
+
+free_user:
+	vm_munmap(user_addr, PAGE_SIZE);
+	return 0;
+}
+
+static unsigned long *user_to_kernel(unsigned long user_addr)
+{
+	phys_addr_t phys_addr;
+	void *virt_addr;
+
+	phys_addr = user_virt_to_phys(user_addr);
+	if (!phys_addr) {
+		pr_warn("Failed to get physical address of user memory\n");
+		return NULL;
+	}
+
+	virt_addr = phys_to_virt(phys_addr);
+	if (phys_addr != virt_to_phys(virt_addr)) {
+		pr_warn("Physical address of user memory seems incorrect\n");
+		return NULL;
+	}
+
+	return virt_addr;
+}
+
+static void read_map(unsigned long *virt_addr)
+{
+	pr_info("Attempting bad read from kernel address %p\n", virt_addr);
+	if (*(unsigned long *)virt_addr == XPFO_DATA)
+		pr_err("FAIL: Bad read succeeded?!\n");
+	else
+		pr_err("FAIL: Bad read didn't fail but data is incorrect?!\n");
+}
+
+static void read_user_with_flags(unsigned long flags)
+{
+	unsigned long user_addr, *kernel;
+
+	user_addr = do_map(flags);
+	if (!user_addr) {
+		pr_err("FAIL: map failed\n");
+		return;
+	}
+
+	kernel = user_to_kernel(user_addr);
+	if (!kernel) {
+		pr_err("FAIL: user to kernel conversion failed\n");
+		goto free_user;
+	}
+
+	read_map(kernel);
+
+free_user:
+	vm_munmap(user_addr, PAGE_SIZE);
+}
+
+/* Read from userspace via the kernel's linear map. */
+void lkdtm_XPFO_READ_USER(void)
+{
+	read_user_with_flags(MAP_PRIVATE | MAP_ANONYMOUS);
+}
+
+void lkdtm_XPFO_READ_USER_HUGE(void)
+{
+	read_user_with_flags(MAP_PRIVATE | MAP_ANONYMOUS | MAP_HUGETLB);
+}
+
+struct smp_arg {
+	unsigned long *virt_addr;
+	unsigned int cpu;
+};
+
+static int smp_reader(void *parg)
+{
+	struct smp_arg *arg = parg;
+	unsigned long *virt_addr;
+
+	if (arg->cpu != smp_processor_id()) {
+		pr_err("FAIL: scheduled on wrong CPU?\n");
+		return 0;
+	}
+
+	virt_addr = smp_cond_load_acquire(&arg->virt_addr, VAL != NULL);
+	read_map(virt_addr);
+
+	return 0;
+}
+
+#ifdef CONFIG_X86
+#define XPFO_SMP_KILLED SIGKILL
+#elif CONFIG_ARM64
+#define XPFO_SMP_KILLED SIGSEGV
+#else
+#error unsupported arch
+#endif
+
+/* The idea here is to read from the kernel's map on a different thread than
+ * did the mapping (and thus the TLB flushing), to make sure that the page
+ * faults on other cores too.
+ */
+void lkdtm_XPFO_SMP(void)
+{
+	unsigned long user_addr, *virt_addr;
+	struct task_struct *thread;
+	int ret;
+	struct smp_arg arg;
+
+	if (num_online_cpus() < 2) {
+		pr_err("not enough to do a multi cpu test\n");
+		return;
+	}
+
+	arg.virt_addr = NULL;
+	arg.cpu = (smp_processor_id() + 1) % num_online_cpus();
+	thread = kthread_create(smp_reader, &arg, "lkdtm_xpfo_test");
+	if (IS_ERR(thread)) {
+		pr_err("couldn't create kthread? %ld\n", PTR_ERR(thread));
+		return;
+	}
+
+	kthread_bind(thread, arg.cpu);
+	get_task_struct(thread);
+	wake_up_process(thread);
+
+	user_addr = do_map(MAP_PRIVATE | MAP_ANONYMOUS);
+	if (!user_addr)
+		goto kill_thread;
+
+	virt_addr = user_to_kernel(user_addr);
+	if (!virt_addr) {
+		/*
+		 * let's store something that will fail, so we can unblock the
+		 * thread
+		 */
+		smp_store_release(&arg.virt_addr, &arg);
+		goto free_user;
+	}
+
+	smp_store_release(&arg.virt_addr, virt_addr);
+
+	/* there must be a better way to do this. */
+	while (1) {
+		if (thread->exit_state)
+			break;
+		msleep_interruptible(100);
+	}
+
+free_user:
+	if (user_addr)
+		vm_munmap(user_addr, PAGE_SIZE);
+
+kill_thread:
+	ret = kthread_stop(thread);
+	if (ret != XPFO_SMP_KILLED)
+		pr_err("FAIL: thread wasn't killed: %d\n", ret);
+	put_task_struct(thread);
+}
-- 
2.17.1

^ permalink raw reply	[flat|nested] 32+ messages in thread

* [RFC PATCH v8 11/14] xpfo, mm: remove dependency on CONFIG_PAGE_EXTENSION
  2019-02-14  0:01 [RFC PATCH v8 00/14] Add support for eXclusive Page Frame Ownership Khalid Aziz
                   ` (9 preceding siblings ...)
  2019-02-14  0:01 ` [RFC PATCH v8 10/14] lkdtm: Add test for XPFO Khalid Aziz
@ 2019-02-14  0:01 ` Khalid Aziz
  2019-02-14  0:01 ` [RFC PATCH v8 12/14] xpfo, mm: optimize spinlock usage in xpfo_kunmap Khalid Aziz
                   ` (2 subsequent siblings)
  13 siblings, 0 replies; 32+ messages in thread
From: Khalid Aziz @ 2019-02-14  0:01 UTC (permalink / raw)
  To: juergh, tycho, jsteckli, ak, torvalds, liran.alon, keescook,
	akpm, mhocko, catalin.marinas, will.deacon, jmorris, konrad.wilk
  Cc: deepa.srinivasan, chris.hyser, tyhicks, dwmw, andrew.cooper3,
	jcm, boris.ostrovsky, kanth.ghatraju, oao.m.martins, jmattson,
	pradeep.vincent, john.haxby, tglx, kirill.shutemov, hch,
	steven.sistare, labbott, luto, dave.hansen, peterz,
	kernel-hardening, linux-mm, x86, linux-arm-kernel, linux-kernel,
	Vasileios P . Kemerlis, Juerg Haefliger, Tycho Andersen,
	Marco Benatto, David Woodhouse

From: Julian Stecklina <jsteckli@amazon.de>

Instead of using the page extension debug feature, encode all
information, we need for XPFO in struct page. This allows to get rid of
some checks in the hot paths and there are also no pages anymore that
are allocated before XPFO is enabled.

Also make debugging aids configurable for maximum performance.

Signed-off-by: Julian Stecklina <jsteckli@amazon.de>
Cc: x86@kernel.org
Cc: kernel-hardening@lists.openwall.com
Cc: Vasileios P. Kemerlis <vpk@cs.columbia.edu>
Cc: Juerg Haefliger <juerg.haefliger@canonical.com>
Cc: Tycho Andersen <tycho@docker.com>
Cc: Marco Benatto <marco.antonio.780@gmail.com>
Cc: David Woodhouse <dwmw2@infradead.org>
Reviewed-by: Khalid Aziz <khalid.aziz@oracle.com>
---
 include/linux/mm_types.h       |   8 ++
 include/linux/page-flags.h     |  13 +++
 include/linux/xpfo.h           |   3 +-
 include/trace/events/mmflags.h |  10 ++-
 mm/page_alloc.c                |   3 +-
 mm/page_ext.c                  |   4 -
 mm/xpfo.c                      | 159 ++++++++-------------------------
 security/Kconfig               |  12 ++-
 8 files changed, 80 insertions(+), 132 deletions(-)

diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 2c471a2c43fa..d17d33f36a01 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -204,6 +204,14 @@ struct page {
 #ifdef LAST_CPUPID_NOT_IN_PAGE_FLAGS
 	int _last_cpupid;
 #endif
+
+#ifdef CONFIG_XPFO
+	/* Counts the number of times this page has been kmapped. */
+	atomic_t xpfo_mapcount;
+
+	/* Serialize kmap/kunmap of this page */
+	spinlock_t xpfo_lock;
+#endif
 } _struct_page_alignment;
 
 /*
diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
index 50ce1bddaf56..a532063f27b5 100644
--- a/include/linux/page-flags.h
+++ b/include/linux/page-flags.h
@@ -101,6 +101,10 @@ enum pageflags {
 #if defined(CONFIG_IDLE_PAGE_TRACKING) && defined(CONFIG_64BIT)
 	PG_young,
 	PG_idle,
+#endif
+#ifdef CONFIG_XPFO
+	PG_xpfo_user,		/* Page is allocated to user-space */
+	PG_xpfo_unmapped,	/* Page is unmapped from the linear map */
 #endif
 	__NR_PAGEFLAGS,
 
@@ -398,6 +402,15 @@ TESTCLEARFLAG(Young, young, PF_ANY)
 PAGEFLAG(Idle, idle, PF_ANY)
 #endif
 
+#ifdef CONFIG_XPFO
+PAGEFLAG(XpfoUser, xpfo_user, PF_ANY)
+TESTCLEARFLAG(XpfoUser, xpfo_user, PF_ANY)
+TESTSETFLAG(XpfoUser, xpfo_user, PF_ANY)
+PAGEFLAG(XpfoUnmapped, xpfo_unmapped, PF_ANY)
+TESTCLEARFLAG(XpfoUnmapped, xpfo_unmapped, PF_ANY)
+TESTSETFLAG(XpfoUnmapped, xpfo_unmapped, PF_ANY)
+#endif
+
 /*
  * On an anonymous page mapped into a user virtual memory area,
  * page->mapping points to its anon_vma, not to a struct address_space;
diff --git a/include/linux/xpfo.h b/include/linux/xpfo.h
index 117869991d5b..1dd590ff1a1f 100644
--- a/include/linux/xpfo.h
+++ b/include/linux/xpfo.h
@@ -28,7 +28,7 @@ struct page;
 
 #include <linux/types.h>
 
-extern struct page_ext_operations page_xpfo_ops;
+void xpfo_init_single_page(struct page *page);
 
 void set_kpte(void *kaddr, struct page *page, pgprot_t prot);
 void xpfo_dma_map_unmap_area(bool map, const void *addr, size_t size,
@@ -57,6 +57,7 @@ phys_addr_t user_virt_to_phys(unsigned long addr);
 
 #else /* !CONFIG_XPFO */
 
+static inline void xpfo_init_single_page(struct page *page) { }
 static inline void xpfo_kmap(void *kaddr, struct page *page) { }
 static inline void xpfo_kunmap(void *kaddr, struct page *page) { }
 static inline void xpfo_alloc_pages(struct page *page, int order, gfp_t gfp) { }
diff --git a/include/trace/events/mmflags.h b/include/trace/events/mmflags.h
index a1675d43777e..6bb000bb366f 100644
--- a/include/trace/events/mmflags.h
+++ b/include/trace/events/mmflags.h
@@ -79,6 +79,12 @@
 #define IF_HAVE_PG_IDLE(flag,string)
 #endif
 
+#ifdef CONFIG_XPFO
+#define IF_HAVE_PG_XPFO(flag,string) ,{1UL << flag, string}
+#else
+#define IF_HAVE_PG_XPFO(flag,string)
+#endif
+
 #define __def_pageflag_names						\
 	{1UL << PG_locked,		"locked"	},		\
 	{1UL << PG_waiters,		"waiters"	},		\
@@ -105,7 +111,9 @@ IF_HAVE_PG_MLOCK(PG_mlocked,		"mlocked"	)		\
 IF_HAVE_PG_UNCACHED(PG_uncached,	"uncached"	)		\
 IF_HAVE_PG_HWPOISON(PG_hwpoison,	"hwpoison"	)		\
 IF_HAVE_PG_IDLE(PG_young,		"young"		)		\
-IF_HAVE_PG_IDLE(PG_idle,		"idle"		)
+IF_HAVE_PG_IDLE(PG_idle,		"idle"		)		\
+IF_HAVE_PG_XPFO(PG_xpfo_user,		"xpfo_user"	)		\
+IF_HAVE_PG_XPFO(PG_xpfo_unmapped,	"xpfo_unmapped" ) 		\
 
 #define show_page_flags(flags)						\
 	(flags) ? __print_flags(flags, "|",				\
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 08e277790b5f..d00382b20001 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1024,6 +1024,7 @@ static __always_inline bool free_pages_prepare(struct page *page,
 	if (bad)
 		return false;
 
+	xpfo_free_pages(page, order);
 	page_cpupid_reset_last(page);
 	page->flags &= ~PAGE_FLAGS_CHECK_AT_PREP;
 	reset_page_owner(page, order);
@@ -1038,7 +1039,6 @@ static __always_inline bool free_pages_prepare(struct page *page,
 	kernel_poison_pages(page, 1 << order, 0);
 	kernel_map_pages(page, 1 << order, 0);
 	kasan_free_pages(page, order);
-	xpfo_free_pages(page, order);
 
 	return true;
 }
@@ -1191,6 +1191,7 @@ static void __meminit __init_single_page(struct page *page, unsigned long pfn,
 	if (!is_highmem_idx(zone))
 		set_page_address(page, __va(pfn << PAGE_SHIFT));
 #endif
+	xpfo_init_single_page(page);
 }
 
 #ifdef CONFIG_DEFERRED_STRUCT_PAGE_INIT
diff --git a/mm/page_ext.c b/mm/page_ext.c
index 38e5013dcb9a..ae44f7adbe07 100644
--- a/mm/page_ext.c
+++ b/mm/page_ext.c
@@ -8,7 +8,6 @@
 #include <linux/kmemleak.h>
 #include <linux/page_owner.h>
 #include <linux/page_idle.h>
-#include <linux/xpfo.h>
 
 /*
  * struct page extension
@@ -69,9 +68,6 @@ static struct page_ext_operations *page_ext_ops[] = {
 #if defined(CONFIG_IDLE_PAGE_TRACKING) && !defined(CONFIG_64BIT)
 	&page_idle_ops,
 #endif
-#ifdef CONFIG_XPFO
-	&page_xpfo_ops,
-#endif
 };
 
 static unsigned long total_usage;
diff --git a/mm/xpfo.c b/mm/xpfo.c
index 150784ae0f08..dc03c423c52f 100644
--- a/mm/xpfo.c
+++ b/mm/xpfo.c
@@ -17,113 +17,58 @@
 #include <linux/highmem.h>
 #include <linux/mm.h>
 #include <linux/module.h>
-#include <linux/page_ext.h>
 #include <linux/xpfo.h>
 
 #include <asm/tlbflush.h>
 
-/* XPFO page state flags */
-enum xpfo_flags {
-	XPFO_PAGE_USER,		/* Page is allocated to user-space */
-	XPFO_PAGE_UNMAPPED,	/* Page is unmapped from the linear map */
-};
-
-/* Per-page XPFO house-keeping data */
-struct xpfo {
-	unsigned long flags;	/* Page state */
-	bool inited;		/* Map counter and lock initialized */
-	atomic_t mapcount;	/* Counter for balancing map/unmap requests */
-	spinlock_t maplock;	/* Lock to serialize map/unmap requests */
-};
-
-DEFINE_STATIC_KEY_FALSE(xpfo_inited);
-
-static bool xpfo_disabled __initdata;
+DEFINE_STATIC_KEY_TRUE(xpfo_inited);
 
 static int __init noxpfo_param(char *str)
 {
-	xpfo_disabled = true;
+	static_branch_disable(&xpfo_inited);
 
 	return 0;
 }
 
 early_param("noxpfo", noxpfo_param);
 
-static bool __init need_xpfo(void)
-{
-	if (xpfo_disabled) {
-		pr_info("XPFO disabled\n");
-		return false;
-	}
-
-	return true;
-}
-
-static void init_xpfo(void)
-{
-	pr_info("XPFO enabled\n");
-	static_branch_enable(&xpfo_inited);
-}
-
-struct page_ext_operations page_xpfo_ops = {
-	.size = sizeof(struct xpfo),
-	.need = need_xpfo,
-	.init = init_xpfo,
-};
-
 bool __init xpfo_enabled(void)
 {
-	return !xpfo_disabled;
+	if (!static_branch_unlikely(&xpfo_inited))
+		return false;
+	else
+		return true;
 }
-EXPORT_SYMBOL(xpfo_enabled);
 
-static inline struct xpfo *lookup_xpfo(struct page *page)
+void __meminit xpfo_init_single_page(struct page *page)
 {
-	struct page_ext *page_ext = lookup_page_ext(page);
-
-	if (unlikely(!page_ext)) {
-		WARN(1, "xpfo: failed to get page ext");
-		return NULL;
-	}
-
-	return (void *)page_ext + page_xpfo_ops.offset;
+	spin_lock_init(&page->xpfo_lock);
 }
 
 void xpfo_alloc_pages(struct page *page, int order, gfp_t gfp)
 {
 	int i, flush_tlb = 0;
-	struct xpfo *xpfo;
 
 	if (!static_branch_unlikely(&xpfo_inited))
 		return;
 
 	for (i = 0; i < (1 << order); i++)  {
-		xpfo = lookup_xpfo(page + i);
-		if (!xpfo)
-			continue;
-
-		WARN(test_bit(XPFO_PAGE_UNMAPPED, &xpfo->flags),
-		     "xpfo: unmapped page being allocated\n");
-
-		/* Initialize the map lock and map counter */
-		if (unlikely(!xpfo->inited)) {
-			spin_lock_init(&xpfo->maplock);
-			atomic_set(&xpfo->mapcount, 0);
-			xpfo->inited = true;
-		}
-		WARN(atomic_read(&xpfo->mapcount),
-		     "xpfo: already mapped page being allocated\n");
-
+#ifdef CONFIG_XPFO_DEBUG
+		BUG_ON(PageXpfoUser(page + i));
+		BUG_ON(PageXpfoUnmapped(page + i));
+		BUG_ON(spin_is_locked(&(page + i)->xpfo_lock));
+		BUG_ON(atomic_read(&(page + i)->xpfo_mapcount));
+#endif
 		if ((gfp & GFP_HIGHUSER) == GFP_HIGHUSER) {
 			/*
 			 * Tag the page as a user page and flush the TLB if it
 			 * was previously allocated to the kernel.
 			 */
-			if (!test_and_set_bit(XPFO_PAGE_USER, &xpfo->flags))
+			if (!TestSetPageXpfoUser(page + i))
 				flush_tlb = 1;
 		} else {
 			/* Tag the page as a non-user (kernel) page */
-			clear_bit(XPFO_PAGE_USER, &xpfo->flags);
+			ClearPageXpfoUser(page + i);
 		}
 	}
 
@@ -134,27 +79,21 @@ void xpfo_alloc_pages(struct page *page, int order, gfp_t gfp)
 void xpfo_free_pages(struct page *page, int order)
 {
 	int i;
-	struct xpfo *xpfo;
 
 	if (!static_branch_unlikely(&xpfo_inited))
 		return;
 
 	for (i = 0; i < (1 << order); i++) {
-		xpfo = lookup_xpfo(page + i);
-		if (!xpfo || unlikely(!xpfo->inited)) {
-			/*
-			 * The page was allocated before page_ext was
-			 * initialized, so it is a kernel page.
-			 */
-			continue;
-		}
+#ifdef CONFIG_XPFO_DEBUG
+		BUG_ON(atomic_read(&(page + i)->xpfo_mapcount));
+#endif
 
 		/*
 		 * Map the page back into the kernel if it was previously
 		 * allocated to user space.
 		 */
-		if (test_and_clear_bit(XPFO_PAGE_USER, &xpfo->flags)) {
-			clear_bit(XPFO_PAGE_UNMAPPED, &xpfo->flags);
+		if (TestClearPageXpfoUser(page + i)) {
+			ClearPageXpfoUnmapped(page + i);
 			set_kpte(page_address(page + i), page + i,
 				 PAGE_KERNEL);
 		}
@@ -163,84 +102,56 @@ void xpfo_free_pages(struct page *page, int order)
 
 void xpfo_kmap(void *kaddr, struct page *page)
 {
-	struct xpfo *xpfo;
-
 	if (!static_branch_unlikely(&xpfo_inited))
 		return;
 
-	xpfo = lookup_xpfo(page);
-
-	/*
-	 * The page was allocated before page_ext was initialized (which means
-	 * it's a kernel page) or it's allocated to the kernel, so nothing to
-	 * do.
-	 */
-	if (!xpfo || unlikely(!xpfo->inited) ||
-	    !test_bit(XPFO_PAGE_USER, &xpfo->flags))
+	if (!PageXpfoUser(page))
 		return;
 
-	spin_lock(&xpfo->maplock);
+	spin_lock(&page->xpfo_lock);
 
 	/*
 	 * The page was previously allocated to user space, so map it back
 	 * into the kernel. No TLB flush required.
 	 */
-	if ((atomic_inc_return(&xpfo->mapcount) == 1) &&
-	    test_and_clear_bit(XPFO_PAGE_UNMAPPED, &xpfo->flags))
+	if ((atomic_inc_return(&page->xpfo_mapcount) == 1) &&
+	    TestClearPageXpfoUnmapped(page))
 		set_kpte(kaddr, page, PAGE_KERNEL);
 
-	spin_unlock(&xpfo->maplock);
+	spin_unlock(&page->xpfo_lock);
 }
 EXPORT_SYMBOL(xpfo_kmap);
 
 void xpfo_kunmap(void *kaddr, struct page *page)
 {
-	struct xpfo *xpfo;
-
 	if (!static_branch_unlikely(&xpfo_inited))
 		return;
 
-	xpfo = lookup_xpfo(page);
-
-	/*
-	 * The page was allocated before page_ext was initialized (which means
-	 * it's a kernel page) or it's allocated to the kernel, so nothing to
-	 * do.
-	 */
-	if (!xpfo || unlikely(!xpfo->inited) ||
-	    !test_bit(XPFO_PAGE_USER, &xpfo->flags))
+	if (!PageXpfoUser(page))
 		return;
 
-	spin_lock(&xpfo->maplock);
+	spin_lock(&page->xpfo_lock);
 
 	/*
 	 * The page is to be allocated back to user space, so unmap it from the
 	 * kernel, flush the TLB and tag it as a user page.
 	 */
-	if (atomic_dec_return(&xpfo->mapcount) == 0) {
-		WARN(test_bit(XPFO_PAGE_UNMAPPED, &xpfo->flags),
-		     "xpfo: unmapping already unmapped page\n");
-		set_bit(XPFO_PAGE_UNMAPPED, &xpfo->flags);
+	if (atomic_dec_return(&page->xpfo_mapcount) == 0) {
+#ifdef CONFIG_XPFO_DEBUG
+		BUG_ON(PageXpfoUnmapped(page));
+#endif
+		SetPageXpfoUnmapped(page);
 		set_kpte(kaddr, page, __pgprot(0));
 		xpfo_flush_kernel_tlb(page, 0);
 	}
 
-	spin_unlock(&xpfo->maplock);
+	spin_unlock(&page->xpfo_lock);
 }
 EXPORT_SYMBOL(xpfo_kunmap);
 
 bool xpfo_page_is_unmapped(struct page *page)
 {
-	struct xpfo *xpfo;
-
-	if (!static_branch_unlikely(&xpfo_inited))
-		return false;
-
-	xpfo = lookup_xpfo(page);
-	if (unlikely(!xpfo) && !xpfo->inited)
-		return false;
-
-	return test_bit(XPFO_PAGE_UNMAPPED, &xpfo->flags);
+	return PageXpfoUnmapped(page);
 }
 EXPORT_SYMBOL(xpfo_page_is_unmapped);
 
diff --git a/security/Kconfig b/security/Kconfig
index 8d0e4e303551..c7c581bac963 100644
--- a/security/Kconfig
+++ b/security/Kconfig
@@ -13,7 +13,6 @@ config XPFO
 	bool "Enable eXclusive Page Frame Ownership (XPFO)"
 	default n
 	depends on ARCH_SUPPORTS_XPFO
-	select PAGE_EXTENSION
 	help
 	  This option offers protection against 'ret2dir' kernel attacks.
 	  When enabled, every time a page frame is allocated to user space, it
@@ -25,6 +24,17 @@ config XPFO
 
 	  If in doubt, say "N".
 
+config XPFO_DEBUG
+       bool "Enable debugging of XPFO"
+       default n
+       depends on XPFO
+       help
+         Enables additional checking of XPFO data structures that help find
+	 bugs in the XPFO implementation. This option comes with a slight
+	 performance cost.
+
+	 If in doubt, say "N".
+
 config SECURITY_DMESG_RESTRICT
 	bool "Restrict unprivileged access to the kernel syslog"
 	default n
-- 
2.17.1

^ permalink raw reply	[flat|nested] 32+ messages in thread

* [RFC PATCH v8 12/14] xpfo, mm: optimize spinlock usage in xpfo_kunmap
  2019-02-14  0:01 [RFC PATCH v8 00/14] Add support for eXclusive Page Frame Ownership Khalid Aziz
                   ` (10 preceding siblings ...)
  2019-02-14  0:01 ` [RFC PATCH v8 11/14] xpfo, mm: remove dependency on CONFIG_PAGE_EXTENSION Khalid Aziz
@ 2019-02-14  0:01 ` Khalid Aziz
  2019-02-14  0:01 ` [RFC PATCH v8 13/14] xpfo, mm: Defer TLB flushes for non-current CPUs (x86 only) Khalid Aziz
  2019-02-14  0:01 ` [RFC PATCH v8 14/14] xpfo, mm: Optimize XPFO TLB flushes by batching them together Khalid Aziz
  13 siblings, 0 replies; 32+ messages in thread
From: Khalid Aziz @ 2019-02-14  0:01 UTC (permalink / raw)
  To: juergh, tycho, jsteckli, ak, torvalds, liran.alon, keescook,
	akpm, mhocko, catalin.marinas, will.deacon, jmorris, konrad.wilk
  Cc: deepa.srinivasan, chris.hyser, tyhicks, dwmw, andrew.cooper3,
	jcm, boris.ostrovsky, kanth.ghatraju, oao.m.martins, jmattson,
	pradeep.vincent, john.haxby, tglx, kirill.shutemov, hch,
	steven.sistare, labbott, luto, dave.hansen, peterz,
	kernel-hardening, linux-mm, x86, linux-arm-kernel, linux-kernel,
	Khalid Aziz, Vasileios P . Kemerlis, Juerg Haefliger,
	Tycho Andersen, Marco Benatto, David Woodhouse

From: Julian Stecklina <jsteckli@amazon.de>

Only the xpfo_kunmap call that needs to actually unmap the page
needs to be serialized. We need to be careful to handle the case,
where after the atomic decrement of the mapcount, a xpfo_kmap
increased the mapcount again. In this case, we can safely skip
modifying the page table.

Model-checked with up to 4 concurrent callers with Spin.

Signed-off-by: Julian Stecklina <jsteckli@amazon.de>
Signed-off-by: Khalid Aziz <khalid.aziz@oracle.com>
Cc: x86@kernel.org
Cc: kernel-hardening@lists.openwall.com
Cc: Vasileios P. Kemerlis <vpk@cs.columbia.edu>
Cc: Juerg Haefliger <juerg.haefliger@canonical.com>
Cc: Tycho Andersen <tycho@docker.com>
Cc: Marco Benatto <marco.antonio.780@gmail.com>
Cc: David Woodhouse <dwmw2@infradead.org>
---
 mm/xpfo.c | 25 ++++++++++++++++---------
 1 file changed, 16 insertions(+), 9 deletions(-)

diff --git a/mm/xpfo.c b/mm/xpfo.c
index dc03c423c52f..5157cbebce4b 100644
--- a/mm/xpfo.c
+++ b/mm/xpfo.c
@@ -124,28 +124,35 @@ EXPORT_SYMBOL(xpfo_kmap);
 
 void xpfo_kunmap(void *kaddr, struct page *page)
 {
+	bool flush_tlb = false;
+
 	if (!static_branch_unlikely(&xpfo_inited))
 		return;
 
 	if (!PageXpfoUser(page))
 		return;
 
-	spin_lock(&page->xpfo_lock);
-
 	/*
 	 * The page is to be allocated back to user space, so unmap it from the
 	 * kernel, flush the TLB and tag it as a user page.
 	 */
 	if (atomic_dec_return(&page->xpfo_mapcount) == 0) {
-#ifdef CONFIG_XPFO_DEBUG
-		BUG_ON(PageXpfoUnmapped(page));
-#endif
-		SetPageXpfoUnmapped(page);
-		set_kpte(kaddr, page, __pgprot(0));
-		xpfo_flush_kernel_tlb(page, 0);
+		spin_lock(&page->xpfo_lock);
+
+		/*
+		 * In the case, where we raced with kmap after the
+		 * atomic_dec_return, we must not nuke the mapping.
+		 */
+		if (atomic_read(&page->xpfo_mapcount) == 0) {
+			SetPageXpfoUnmapped(page);
+			set_kpte(kaddr, page, __pgprot(0));
+			flush_tlb = true;
+		}
+		spin_unlock(&page->xpfo_lock);
 	}
 
-	spin_unlock(&page->xpfo_lock);
+	if (flush_tlb)
+		xpfo_flush_kernel_tlb(page, 0);
 }
 EXPORT_SYMBOL(xpfo_kunmap);
 
-- 
2.17.1

^ permalink raw reply	[flat|nested] 32+ messages in thread

* [RFC PATCH v8 13/14] xpfo, mm: Defer TLB flushes for non-current CPUs (x86 only)
  2019-02-14  0:01 [RFC PATCH v8 00/14] Add support for eXclusive Page Frame Ownership Khalid Aziz
                   ` (11 preceding siblings ...)
  2019-02-14  0:01 ` [RFC PATCH v8 12/14] xpfo, mm: optimize spinlock usage in xpfo_kunmap Khalid Aziz
@ 2019-02-14  0:01 ` Khalid Aziz
  2019-02-14 17:42   ` Dave Hansen
  2019-02-14  0:01 ` [RFC PATCH v8 14/14] xpfo, mm: Optimize XPFO TLB flushes by batching them together Khalid Aziz
  13 siblings, 1 reply; 32+ messages in thread
From: Khalid Aziz @ 2019-02-14  0:01 UTC (permalink / raw)
  To: juergh, tycho, jsteckli, ak, torvalds, liran.alon, keescook,
	akpm, mhocko, catalin.marinas, will.deacon, jmorris, konrad.wilk
  Cc: Khalid Aziz, deepa.srinivasan, chris.hyser, tyhicks, dwmw,
	andrew.cooper3, jcm, boris.ostrovsky, kanth.ghatraju,
	oao.m.martins, jmattson, pradeep.vincent, john.haxby, tglx,
	kirill.shutemov, hch, steven.sistare, labbott, luto, dave.hansen,
	peterz, kernel-hardening, linux-mm, x86, linux-arm-kernel,
	linux-kernel

XPFO flushes kernel space TLB entries for pages that are now mapped
in userspace on not only the current CPU but also all other CPUs
synchronously. Processes on each core allocating pages causes a
flood of IPI messages to all other cores to flush TLB entries.
Many of these messages are to flush the entire TLB on the core if
the number of entries being flushed from local core exceeds
tlb_single_page_flush_ceiling. The cost of TLB flush caused by
unmapping pages from physmap goes up dramatically on machines with
high core count.

This patch flushes relevant TLB entries for current process or
entire TLB depending upon number of entries for the current CPU
and posts a pending TLB flush on all other CPUs when a page is
unmapped from kernel space and mapped in userspace. Each core
checks the pending TLB flush flag for itself on every context
switch, flushes its TLB if the flag is set and clears it.
This patch potentially aggregates multiple TLB flushes into one.
This has very significant impact especially on machines with large
core counts. To illustrate this, kernel was compiled with -j on
two classes of machines - a server with high core count and large
amount of memory, and a desktop class machine with more modest
specs. System time from "make -j" from vanilla 4.20 kernel, 4.20
with XPFO patches before applying this patch and after applying
this patch are below:

Hardware: 96-core Intel Xeon Platinum 8160 CPU @ 2.10GHz, 768 GB RAM
make -j60 all

4.20                            950.966s
4.20+XPFO                       25073.169s      26.366x
4.20+XPFO+Deferred flush        1372.874s        1.44x

Hardware: 4-core Intel Core i5-3550 CPU @ 3.30GHz, 8G RAM
make -j4 all

4.20                            607.671s
4.20+XPFO                       1588.646s       2.614x
4.20+XPFO+Deferred flush        803.989s        1.32x

This patch could use more optimization. Batching more TLB entry
flushes, as was suggested for earlier version of these patches,
can help reduce these cases. This same code should be implemented
for other architectures as well once finalized.

Signed-off-by: Khalid Aziz <khalid.aziz@oracle.com>
---
 arch/x86/include/asm/tlbflush.h |  1 +
 arch/x86/mm/tlb.c               | 38 +++++++++++++++++++++++++++++++++
 arch/x86/mm/xpfo.c              |  2 +-
 3 files changed, 40 insertions(+), 1 deletion(-)

diff --git a/arch/x86/include/asm/tlbflush.h b/arch/x86/include/asm/tlbflush.h
index f4204bf377fc..92d23629d01d 100644
--- a/arch/x86/include/asm/tlbflush.h
+++ b/arch/x86/include/asm/tlbflush.h
@@ -561,6 +561,7 @@ extern void flush_tlb_mm_range(struct mm_struct *mm, unsigned long start,
 				unsigned long end, unsigned int stride_shift,
 				bool freed_tables);
 extern void flush_tlb_kernel_range(unsigned long start, unsigned long end);
+extern void xpfo_flush_tlb_kernel_range(unsigned long start, unsigned long end);
 
 static inline void flush_tlb_page(struct vm_area_struct *vma, unsigned long a)
 {
diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c
index 03b6b4c2238d..c907b643eecb 100644
--- a/arch/x86/mm/tlb.c
+++ b/arch/x86/mm/tlb.c
@@ -35,6 +35,15 @@
  */
 #define LAST_USER_MM_IBPB	0x1UL
 
+/*
+ * When a full TLB flush is needed to flush stale TLB entries
+ * for pages that have been mapped into userspace and unmapped
+ * from kernel space, this TLB flush will be delayed until the
+ * task is scheduled on that CPU. Keep track of CPUs with
+ * pending full TLB flush forced by xpfo.
+ */
+static cpumask_t pending_xpfo_flush;
+
 /*
  * We get here when we do something requiring a TLB invalidation
  * but could not go invalidate all of the contexts.  We do the
@@ -319,6 +328,15 @@ void switch_mm_irqs_off(struct mm_struct *prev, struct mm_struct *next,
 		__flush_tlb_all();
 	}
 #endif
+
+	/* If there is a pending TLB flush for this CPU due to XPFO
+	 * flush, do it now.
+	 */
+	if (cpumask_test_and_clear_cpu(cpu, &pending_xpfo_flush)) {
+		count_vm_tlb_event(NR_TLB_REMOTE_FLUSH_RECEIVED);
+		__flush_tlb_all();
+	}
+
 	this_cpu_write(cpu_tlbstate.is_lazy, false);
 
 	/*
@@ -801,6 +819,26 @@ void flush_tlb_kernel_range(unsigned long start, unsigned long end)
 	}
 }
 
+void xpfo_flush_tlb_kernel_range(unsigned long start, unsigned long end)
+{
+	struct cpumask tmp_mask;
+
+	/* Balance as user space task's flush, a bit conservative */
+	if (end == TLB_FLUSH_ALL ||
+	    (end - start) > tlb_single_page_flush_ceiling << PAGE_SHIFT) {
+		do_flush_tlb_all(NULL);
+	} else {
+		struct flush_tlb_info info;
+
+		info.start = start;
+		info.end = end;
+		do_kernel_range_flush(&info);
+	}
+	cpumask_setall(&tmp_mask);
+	cpumask_clear_cpu(smp_processor_id(), &tmp_mask);
+	cpumask_or(&pending_xpfo_flush, &pending_xpfo_flush, &tmp_mask);
+}
+
 void arch_tlbbatch_flush(struct arch_tlbflush_unmap_batch *batch)
 {
 	struct flush_tlb_info info = {
diff --git a/arch/x86/mm/xpfo.c b/arch/x86/mm/xpfo.c
index e13b99019c47..d3833532bfdc 100644
--- a/arch/x86/mm/xpfo.c
+++ b/arch/x86/mm/xpfo.c
@@ -115,7 +115,7 @@ inline void xpfo_flush_kernel_tlb(struct page *page, int order)
 		return;
 	}
 
-	flush_tlb_kernel_range(kaddr, kaddr + (1 << order) * size);
+	xpfo_flush_tlb_kernel_range(kaddr, kaddr + (1 << order) * size);
 }
 
 /* Convert a user space virtual address to a physical address.
-- 
2.17.1

^ permalink raw reply	[flat|nested] 32+ messages in thread

* [RFC PATCH v8 14/14] xpfo, mm: Optimize XPFO TLB flushes by batching them together
  2019-02-14  0:01 [RFC PATCH v8 00/14] Add support for eXclusive Page Frame Ownership Khalid Aziz
                   ` (12 preceding siblings ...)
  2019-02-14  0:01 ` [RFC PATCH v8 13/14] xpfo, mm: Defer TLB flushes for non-current CPUs (x86 only) Khalid Aziz
@ 2019-02-14  0:01 ` Khalid Aziz
  13 siblings, 0 replies; 32+ messages in thread
From: Khalid Aziz @ 2019-02-14  0:01 UTC (permalink / raw)
  To: juergh, tycho, jsteckli, ak, torvalds, liran.alon, keescook,
	akpm, mhocko, catalin.marinas, will.deacon, jmorris, konrad.wilk
  Cc: Khalid Aziz, deepa.srinivasan, chris.hyser, tyhicks, dwmw,
	andrew.cooper3, jcm, boris.ostrovsky, kanth.ghatraju,
	oao.m.martins, jmattson, pradeep.vincent, john.haxby, tglx,
	kirill.shutemov, hch, steven.sistare, labbott, luto, dave.hansen,
	peterz, kernel-hardening, linux-mm, x86, linux-arm-kernel,
	linux-kernel

When XPFO forces a TLB flush on all cores, the performance impact is
very significant. Batching as many of these TLB updates as
possible can help lower this impact. When a userspace allocates a
page, kernel tries to get that page from the per-cpu free list.
This free list is replenished in bulk when it runs low. Free
list is being replenished for future allocation to userspace is a
good opportunity to update TLB entries in batch and reduce the
impact of multiple TLB flushes later. This patch adds new tags for
the page so a page can be marked as available for userspace
allocation and unmapped from kernel address space. All such pages
are removed from kernel address space in bulk at the time they are
added to per-cpu free list. This patch when combined with deferred
TLB flushes improves performance further. Using the same benchmark
as before of building kernel in parallel, here are the system
times on two differently sized systems:

Hardware: 96-core Intel Xeon Platinum 8160 CPU @ 2.10GHz, 768 GB RAM
make -j60 all

4.20					950.966s
4.20+XPFO				25073.169s	26.366x
4.20+XPFO+Deferred flush		1372.874s	1.44x
4.20+XPFO+Deferred flush+Batch update	1255.021s	1.32x

Hardware: 4-core Intel Core i5-3550 CPU @ 3.30GHz, 8G RAM
make -j4 all

4.20					607.671s
4.20+XPFO				1588.646s	2.614x
4.20+XPFO+Deferred flush		803.989s	1.32x
4.20+XPFO+Deferred flush+Batch update	795.728s	1.31x

Signed-off-by: Khalid Aziz <khalid.aziz@oracle.com>
Signed-off-by: Tycho Andersen <tycho@tycho.ws>
---
 arch/x86/mm/xpfo.c         |  5 +++++
 include/linux/page-flags.h |  5 ++++-
 include/linux/xpfo.h       |  8 ++++++++
 mm/page_alloc.c            |  4 ++++
 mm/xpfo.c                  | 35 +++++++++++++++++++++++++++++++++--
 5 files changed, 54 insertions(+), 3 deletions(-)

diff --git a/arch/x86/mm/xpfo.c b/arch/x86/mm/xpfo.c
index d3833532bfdc..fb06bb3cb718 100644
--- a/arch/x86/mm/xpfo.c
+++ b/arch/x86/mm/xpfo.c
@@ -87,6 +87,11 @@ inline void set_kpte(void *kaddr, struct page *page, pgprot_t prot)
 
 }
 
+void xpfo_flush_tlb_all(void)
+{
+	xpfo_flush_tlb_kernel_range(0, TLB_FLUSH_ALL);
+}
+
 inline void xpfo_flush_kernel_tlb(struct page *page, int order)
 {
 	int level;
diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
index a532063f27b5..fdf7e14cbc96 100644
--- a/include/linux/page-flags.h
+++ b/include/linux/page-flags.h
@@ -406,9 +406,11 @@ PAGEFLAG(Idle, idle, PF_ANY)
 PAGEFLAG(XpfoUser, xpfo_user, PF_ANY)
 TESTCLEARFLAG(XpfoUser, xpfo_user, PF_ANY)
 TESTSETFLAG(XpfoUser, xpfo_user, PF_ANY)
+#define __PG_XPFO_USER	(1UL << PG_xpfo_user)
 PAGEFLAG(XpfoUnmapped, xpfo_unmapped, PF_ANY)
 TESTCLEARFLAG(XpfoUnmapped, xpfo_unmapped, PF_ANY)
 TESTSETFLAG(XpfoUnmapped, xpfo_unmapped, PF_ANY)
+#define __PG_XPFO_UNMAPPED	(1UL << PG_xpfo_unmapped)
 #endif
 
 /*
@@ -787,7 +789,8 @@ static inline void ClearPageSlabPfmemalloc(struct page *page)
  * alloc-free cycle to prevent from reusing the page.
  */
 #define PAGE_FLAGS_CHECK_AT_PREP	\
-	(((1UL << NR_PAGEFLAGS) - 1) & ~__PG_HWPOISON)
+	(((1UL << NR_PAGEFLAGS) - 1) & ~__PG_HWPOISON & ~__PG_XPFO_USER & \
+					~__PG_XPFO_UNMAPPED)
 
 #define PAGE_FLAGS_PRIVATE				\
 	(1UL << PG_private | 1UL << PG_private_2)
diff --git a/include/linux/xpfo.h b/include/linux/xpfo.h
index 1dd590ff1a1f..c4f6c99e7380 100644
--- a/include/linux/xpfo.h
+++ b/include/linux/xpfo.h
@@ -34,6 +34,7 @@ void set_kpte(void *kaddr, struct page *page, pgprot_t prot);
 void xpfo_dma_map_unmap_area(bool map, const void *addr, size_t size,
 				    enum dma_data_direction dir);
 void xpfo_flush_kernel_tlb(struct page *page, int order);
+void xpfo_flush_tlb_all(void);
 
 void xpfo_kmap(void *kaddr, struct page *page);
 void xpfo_kunmap(void *kaddr, struct page *page);
@@ -55,6 +56,8 @@ bool xpfo_enabled(void);
 
 phys_addr_t user_virt_to_phys(unsigned long addr);
 
+bool xpfo_pcp_refill(struct page *page, enum migratetype migratetype,
+		     int order);
 #else /* !CONFIG_XPFO */
 
 static inline void xpfo_init_single_page(struct page *page) { }
@@ -82,6 +85,11 @@ static inline bool xpfo_enabled(void) { return false; }
 
 static inline phys_addr_t user_virt_to_phys(unsigned long addr) { return 0; }
 
+static inline bool xpfo_pcp_refill(struct page *page,
+				   enum migratetype migratetype, int order)
+{
+}
+
 #endif /* CONFIG_XPFO */
 
 #endif /* _LINUX_XPFO_H */
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index d00382b20001..5702b6fa435c 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -2478,6 +2478,7 @@ static int rmqueue_bulk(struct zone *zone, unsigned int order,
 			int migratetype)
 {
 	int i, alloced = 0;
+	bool flush_tlb = false;
 
 	spin_lock(&zone->lock);
 	for (i = 0; i < count; ++i) {
@@ -2503,6 +2504,7 @@ static int rmqueue_bulk(struct zone *zone, unsigned int order,
 		if (is_migrate_cma(get_pcppage_migratetype(page)))
 			__mod_zone_page_state(zone, NR_FREE_CMA_PAGES,
 					      -(1 << order));
+		flush_tlb |= xpfo_pcp_refill(page, migratetype, order);
 	}
 
 	/*
@@ -2513,6 +2515,8 @@ static int rmqueue_bulk(struct zone *zone, unsigned int order,
 	 */
 	__mod_zone_page_state(zone, NR_FREE_PAGES, -(i << order));
 	spin_unlock(&zone->lock);
+	if (flush_tlb)
+		xpfo_flush_tlb_all();
 	return alloced;
 }
 
diff --git a/mm/xpfo.c b/mm/xpfo.c
index 5157cbebce4b..7f78d00df002 100644
--- a/mm/xpfo.c
+++ b/mm/xpfo.c
@@ -47,7 +47,8 @@ void __meminit xpfo_init_single_page(struct page *page)
 
 void xpfo_alloc_pages(struct page *page, int order, gfp_t gfp)
 {
-	int i, flush_tlb = 0;
+	int i;
+	bool flush_tlb = false;
 
 	if (!static_branch_unlikely(&xpfo_inited))
 		return;
@@ -65,7 +66,7 @@ void xpfo_alloc_pages(struct page *page, int order, gfp_t gfp)
 			 * was previously allocated to the kernel.
 			 */
 			if (!TestSetPageXpfoUser(page + i))
-				flush_tlb = 1;
+				flush_tlb = true;
 		} else {
 			/* Tag the page as a non-user (kernel) page */
 			ClearPageXpfoUser(page + i);
@@ -74,6 +75,8 @@ void xpfo_alloc_pages(struct page *page, int order, gfp_t gfp)
 
 	if (flush_tlb)
 		xpfo_flush_kernel_tlb(page, order);
+
+	return;
 }
 
 void xpfo_free_pages(struct page *page, int order)
@@ -190,3 +193,31 @@ void xpfo_temp_unmap(const void *addr, size_t size, void **mapping,
 			kunmap_atomic(mapping[i]);
 }
 EXPORT_SYMBOL(xpfo_temp_unmap);
+
+bool xpfo_pcp_refill(struct page *page, enum migratetype migratetype,
+		     int order)
+{
+	int i;
+	bool flush_tlb = false;
+
+	if (!static_branch_unlikely(&xpfo_inited))
+		return false;
+
+	for (i = 0; i < 1 << order; i++) {
+		if (migratetype == MIGRATE_MOVABLE) {
+			/* GPF_HIGHUSER **
+			 * Tag the page as a user page, mark it as unmapped
+			 * in kernel space and flush the TLB if it was
+			 * previously allocated to the kernel.
+			 */
+			if (!TestSetPageXpfoUnmapped(page + i))
+				flush_tlb = true;
+			SetPageXpfoUser(page + i);
+		} else {
+			/* Tag the page as a non-user (kernel) page */
+			ClearPageXpfoUser(page + i);
+		}
+	}
+
+	return(flush_tlb);
+}
-- 
2.17.1

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFC PATCH v8 04/14] swiotlb: Map the buffer if it was unmapped by XPFO
  2019-02-14  0:01 ` [RFC PATCH v8 04/14] swiotlb: Map the buffer if it was unmapped by XPFO Khalid Aziz
@ 2019-02-14  7:47   ` Christoph Hellwig
  2019-02-14 16:56     ` Khalid Aziz
  0 siblings, 1 reply; 32+ messages in thread
From: Christoph Hellwig @ 2019-02-14  7:47 UTC (permalink / raw)
  To: Khalid Aziz
  Cc: juergh, tycho, jsteckli, ak, torvalds, liran.alon, keescook,
	akpm, mhocko, catalin.marinas, will.deacon, jmorris, konrad.wilk,
	Juerg Haefliger, deepa.srinivasan, chris.hyser, tyhicks, dwmw,
	andrew.cooper3, jcm, boris.ostrovsky, kanth.ghatraju,
	oao.m.martins, jmattson, pradeep.vincent, john.haxby, tglx,
	kirill.shutemov, hch, steven.sistare, labbott, luto, dave.hansen,
	peterz, kernel-hardening, linux-mm, x86, linux-arm-kernel,
	linux-kernel, Tycho Andersen

On Wed, Feb 13, 2019 at 05:01:27PM -0700, Khalid Aziz wrote:
> +++ b/kernel/dma/swiotlb.c
> @@ -396,8 +396,9 @@ static void swiotlb_bounce(phys_addr_t orig_addr, phys_addr_t tlb_addr,
>  {
>  	unsigned long pfn = PFN_DOWN(orig_addr);
>  	unsigned char *vaddr = phys_to_virt(tlb_addr);
> +	struct page *page = pfn_to_page(pfn);
>  
> -	if (PageHighMem(pfn_to_page(pfn))) {
> +	if (PageHighMem(page) || xpfo_page_is_unmapped(page)) {

I think this just wants a page_unmapped or similar helper instead of
needing the xpfo_page_is_unmapped check.  We actually have quite
a few similar construct in the arch dma mapping code for architectures
that require cache flushing.

> +bool xpfo_page_is_unmapped(struct page *page)
> +{
> +	struct xpfo *xpfo;
> +
> +	if (!static_branch_unlikely(&xpfo_inited))
> +		return false;
> +
> +	xpfo = lookup_xpfo(page);
> +	if (unlikely(!xpfo) && !xpfo->inited)
> +		return false;
> +
> +	return test_bit(XPFO_PAGE_UNMAPPED, &xpfo->flags);
> +}
> +EXPORT_SYMBOL(xpfo_page_is_unmapped);

And at least for swiotlb there is no need to export this helper,
as it is always built in.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFC PATCH v8 03/14] mm, x86: Add support for eXclusive Page Frame Ownership (XPFO)
  2019-02-14  0:01 ` [RFC PATCH v8 03/14] mm, x86: Add support for eXclusive Page Frame Ownership (XPFO) Khalid Aziz
@ 2019-02-14 10:56   ` Peter Zijlstra
  2019-02-14 16:15     ` Borislav Petkov
  2019-02-14 17:13     ` Khalid Aziz
  0 siblings, 2 replies; 32+ messages in thread
From: Peter Zijlstra @ 2019-02-14 10:56 UTC (permalink / raw)
  To: Khalid Aziz
  Cc: juergh, tycho, jsteckli, ak, torvalds, liran.alon, keescook,
	akpm, mhocko, catalin.marinas, will.deacon, jmorris, konrad.wilk,
	Juerg Haefliger, deepa.srinivasan, chris.hyser, tyhicks, dwmw,
	andrew.cooper3, jcm, boris.ostrovsky, kanth.ghatraju,
	oao.m.martins, jmattson, pradeep.vincent, john.haxby, tglx,
	kirill.shutemov, hch, steven.sistare, labbott, luto, dave.hansen,
	kernel-hardening, linux-mm, x86, linux-arm-kernel, linux-kernel,
	Tycho Andersen, Marco Benatto

On Wed, Feb 13, 2019 at 05:01:26PM -0700, Khalid Aziz wrote:
>  static inline void *kmap_atomic(struct page *page)
>  {
> +	void *kaddr;
> +
>  	preempt_disable();
>  	pagefault_disable();
> +	kaddr = page_address(page);
> +	xpfo_kmap(kaddr, page);
> +	return kaddr;
>  }
>  #define kmap_atomic_prot(page, prot)	kmap_atomic(page)
>  
>  static inline void __kunmap_atomic(void *addr)
>  {
> +	xpfo_kunmap(addr, virt_to_page(addr));
>  	pagefault_enable();
>  	preempt_enable();
>  }

How is that supposed to work; IIRC kmap_atomic was supposed to be
IRQ-safe.

> +/* Per-page XPFO house-keeping data */
> +struct xpfo {
> +	unsigned long flags;	/* Page state */
> +	bool inited;		/* Map counter and lock initialized */

What's sizeof(_Bool) ? Why can't you use a bit in that flags word?

> +	atomic_t mapcount;	/* Counter for balancing map/unmap requests */
> +	spinlock_t maplock;	/* Lock to serialize map/unmap requests */
> +};

Without that bool, the structure would be 16 bytes on 64bit, which seems
like a good number.

> +void xpfo_kmap(void *kaddr, struct page *page)
> +{
> +	struct xpfo *xpfo;
> +
> +	if (!static_branch_unlikely(&xpfo_inited))
> +		return;
> +
> +	xpfo = lookup_xpfo(page);
> +
> +	/*
> +	 * The page was allocated before page_ext was initialized (which means
> +	 * it's a kernel page) or it's allocated to the kernel, so nothing to
> +	 * do.
> +	 */
> +	if (!xpfo || unlikely(!xpfo->inited) ||
> +	    !test_bit(XPFO_PAGE_USER, &xpfo->flags))
> +		return;
> +
> +	spin_lock(&xpfo->maplock);
> +
> +	/*
> +	 * The page was previously allocated to user space, so map it back
> +	 * into the kernel. No TLB flush required.
> +	 */
> +	if ((atomic_inc_return(&xpfo->mapcount) == 1) &&
> +	    test_and_clear_bit(XPFO_PAGE_UNMAPPED, &xpfo->flags))
> +		set_kpte(kaddr, page, PAGE_KERNEL);
> +
> +	spin_unlock(&xpfo->maplock);
> +}
> +EXPORT_SYMBOL(xpfo_kmap);
> +
> +void xpfo_kunmap(void *kaddr, struct page *page)
> +{
> +	struct xpfo *xpfo;
> +
> +	if (!static_branch_unlikely(&xpfo_inited))
> +		return;
> +
> +	xpfo = lookup_xpfo(page);
> +
> +	/*
> +	 * The page was allocated before page_ext was initialized (which means
> +	 * it's a kernel page) or it's allocated to the kernel, so nothing to
> +	 * do.
> +	 */
> +	if (!xpfo || unlikely(!xpfo->inited) ||
> +	    !test_bit(XPFO_PAGE_USER, &xpfo->flags))
> +		return;
> +
> +	spin_lock(&xpfo->maplock);
> +
> +	/*
> +	 * The page is to be allocated back to user space, so unmap it from the
> +	 * kernel, flush the TLB and tag it as a user page.
> +	 */
> +	if (atomic_dec_return(&xpfo->mapcount) == 0) {
> +		WARN(test_bit(XPFO_PAGE_UNMAPPED, &xpfo->flags),
> +		     "xpfo: unmapping already unmapped page\n");
> +		set_bit(XPFO_PAGE_UNMAPPED, &xpfo->flags);
> +		set_kpte(kaddr, page, __pgprot(0));
> +		xpfo_flush_kernel_tlb(page, 0);
> +	}
> +
> +	spin_unlock(&xpfo->maplock);
> +}
> +EXPORT_SYMBOL(xpfo_kunmap);

And these here things are most definitely not IRQ-safe.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFC PATCH v8 07/14] arm64/mm, xpfo: temporarily map dcache regions
  2019-02-14  0:01 ` [RFC PATCH v8 07/14] arm64/mm, xpfo: temporarily map dcache regions Khalid Aziz
@ 2019-02-14 15:54   ` Tycho Andersen
  2019-02-14 17:29     ` Khalid Aziz
  0 siblings, 1 reply; 32+ messages in thread
From: Tycho Andersen @ 2019-02-14 15:54 UTC (permalink / raw)
  To: Khalid Aziz
  Cc: juergh, jsteckli, ak, torvalds, liran.alon, keescook, akpm,
	mhocko, catalin.marinas, will.deacon, jmorris, konrad.wilk,
	Juerg Haefliger, deepa.srinivasan, chris.hyser, tyhicks, dwmw,
	andrew.cooper3, jcm, boris.ostrovsky, kanth.ghatraju,
	oao.m.martins, jmattson, pradeep.vincent, john.haxby, tglx,
	kirill.shutemov, hch, steven.sistare, labbott, luto, dave.hansen,
	peterz, kernel-hardening, linux-mm, x86, linux-arm-kernel,
	linux-kernel

Hi,

On Wed, Feb 13, 2019 at 05:01:30PM -0700, Khalid Aziz wrote:
> From: Juerg Haefliger <juerg.haefliger@canonical.com>
> 
> If the page is unmapped by XPFO, a data cache flush results in a fatal
> page fault, so let's temporarily map the region, flush the cache, and then
> unmap it.
> 
> v6: actually flush in the face of xpfo, and temporarily map the underlying
>     memory so it can be flushed correctly
> 
> CC: linux-arm-kernel@lists.infradead.org
> Signed-off-by: Juerg Haefliger <juerg.haefliger@canonical.com>
> Signed-off-by: Tycho Andersen <tycho@docker.com>
> ---
>  arch/arm64/mm/flush.c | 7 +++++++
>  1 file changed, 7 insertions(+)
> 
> diff --git a/arch/arm64/mm/flush.c b/arch/arm64/mm/flush.c
> index 30695a868107..fad09aafd9d5 100644
> --- a/arch/arm64/mm/flush.c
> +++ b/arch/arm64/mm/flush.c
> @@ -20,6 +20,7 @@
>  #include <linux/export.h>
>  #include <linux/mm.h>
>  #include <linux/pagemap.h>
> +#include <linux/xpfo.h>
>  
>  #include <asm/cacheflush.h>
>  #include <asm/cache.h>
> @@ -28,9 +29,15 @@
>  void sync_icache_aliases(void *kaddr, unsigned long len)
>  {
>  	unsigned long addr = (unsigned long)kaddr;
> +	unsigned long num_pages = XPFO_NUM_PAGES(addr, len);
> +	void *mapping[num_pages];

What version does this build on? Presumably -Wvla will cause an error
here, but,

>  	if (icache_is_aliasing()) {
> +		xpfo_temp_map(kaddr, len, mapping,
> +			      sizeof(mapping[0]) * num_pages);
>  		__clean_dcache_area_pou(kaddr, len);

Here, we map the pages to some random address via xpfo_temp_map(),
then pass the *original* address (which may not have been mapped) to
__clean_dcache_area_pou(). So I think this whole approach is wrong.

If we want to do it this way, it may be that we need some
xpfo_map_contiguous() type thing, but since we're just going to flush
it anyway, that seems a little crazy. Maybe someone who knows more
about arm64 knows a better way?

Tycho

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFC PATCH v8 03/14] mm, x86: Add support for eXclusive Page Frame Ownership (XPFO)
  2019-02-14 10:56   ` Peter Zijlstra
@ 2019-02-14 16:15     ` Borislav Petkov
  2019-02-14 17:19       ` Khalid Aziz
  2019-02-14 17:13     ` Khalid Aziz
  1 sibling, 1 reply; 32+ messages in thread
From: Borislav Petkov @ 2019-02-14 16:15 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Khalid Aziz, juergh, tycho, jsteckli, ak, torvalds, liran.alon,
	keescook, akpm, mhocko, catalin.marinas, will.deacon, jmorris,
	konrad.wilk, Juerg Haefliger, deepa.srinivasan, chris.hyser,
	tyhicks, dwmw, andrew.cooper3, jcm, boris.ostrovsky,
	kanth.ghatraju, oao.m.martins, jmattson, pradeep.vincent,
	john.haxby, tglx, kirill.shutemov, hch, steven.sistare, labbott,
	luto, dave.hansen, kernel-hardening, linux-mm, x86,
	linux-arm-kernel, linux-kernel, Tycho Andersen, Marco Benatto

On Thu, Feb 14, 2019 at 11:56:31AM +0100, Peter Zijlstra wrote:
> > +EXPORT_SYMBOL(xpfo_kunmap);
> 
> And these here things are most definitely not IRQ-safe.

Should also be EXPORT_SYMBOL_GPL.

-- 
Regards/Gruss,
    Boris.

Good mailing practices for 400: avoid top-posting and trim the reply.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFC PATCH v8 04/14] swiotlb: Map the buffer if it was unmapped by XPFO
  2019-02-14  7:47   ` Christoph Hellwig
@ 2019-02-14 16:56     ` Khalid Aziz
  2019-02-14 17:44       ` Christoph Hellwig
  0 siblings, 1 reply; 32+ messages in thread
From: Khalid Aziz @ 2019-02-14 16:56 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: juergh, tycho, jsteckli, ak, torvalds, liran.alon, keescook,
	akpm, mhocko, catalin.marinas, will.deacon, jmorris, konrad.wilk,
	Juerg Haefliger, deepa.srinivasan, chris.hyser, tyhicks, dwmw,
	andrew.cooper3, jcm, boris.ostrovsky, kanth.ghatraju,
	oao.m.martins, jmattson, pradeep.vincent, john.haxby, tglx,
	kirill.shutemov, steven.sistare, labbott, luto, dave.hansen,
	peterz, kernel-hardening, linux-mm, x86, linux-arm-kernel,
	linux-kernel, Tycho Andersen

[-- Attachment #1: Type: text/plain, Size: 2630 bytes --]

On 2/14/19 12:47 AM, Christoph Hellwig wrote:
> On Wed, Feb 13, 2019 at 05:01:27PM -0700, Khalid Aziz wrote:
>> +++ b/kernel/dma/swiotlb.c
>> @@ -396,8 +396,9 @@ static void swiotlb_bounce(phys_addr_t orig_addr, phys_addr_t tlb_addr,
>>  {
>>  	unsigned long pfn = PFN_DOWN(orig_addr);
>>  	unsigned char *vaddr = phys_to_virt(tlb_addr);
>> +	struct page *page = pfn_to_page(pfn);
>>  
>> -	if (PageHighMem(pfn_to_page(pfn))) {
>> +	if (PageHighMem(page) || xpfo_page_is_unmapped(page)) {
> 
> I think this just wants a page_unmapped or similar helper instead of
> needing the xpfo_page_is_unmapped check.  We actually have quite
> a few similar construct in the arch dma mapping code for architectures
> that require cache flushing.

As I am not the original author of this patch, I am interpreting the
original intent. I think xpfo_page_is_unmapped() was added to account
for kernel build without CONFIG_XPFO. xpfo_page_is_unmapped() has an
alternate definition to return false if CONFIG_XPFO is not defined.
xpfo_is_unmapped() is cleaned up further in patch 11 ("xpfo, mm: remove
dependency on CONFIG_PAGE_EXTENSION") to a one-liner "return
PageXpfoUnmapped(page);". xpfo_is_unmapped() can be eliminated entirely
by adding an else clause to the following code added by that patch:

--- a/include/linux/page-flags.h
+++ b/include/linux/page-flags.h
@@ -398,6 +402,15 @@ TESTCLEARFLAG(Young, young, PF_ANY)
 PAGEFLAG(Idle, idle, PF_ANY)
 #endif

+#ifdef CONFIG_XPFO
+PAGEFLAG(XpfoUser, xpfo_user, PF_ANY)
+TESTCLEARFLAG(XpfoUser, xpfo_user, PF_ANY)
+TESTSETFLAG(XpfoUser, xpfo_user, PF_ANY)
+PAGEFLAG(XpfoUnmapped, xpfo_unmapped, PF_ANY)
+TESTCLEARFLAG(XpfoUnmapped, xpfo_unmapped, PF_ANY)
+TESTSETFLAG(XpfoUnmapped, xpfo_unmapped, PF_ANY)
+#endif
+
 /*
  * On an anonymous page mapped into a user virtual memory area,
  * page->mapping points to its anon_vma, not to a struct address_space;


Adding the following #else to above conditional:

#else
TESTPAGEFLAG_FALSE(XpfoUser)
TESTPAGEFLAG_FALSE(XpfoUnmapped)

should allow us to eliminate xpfo_is_unmapped(). Right?

Thanks,
Khalid

> 
>> +bool xpfo_page_is_unmapped(struct page *page)
>> +{
>> +	struct xpfo *xpfo;
>> +
>> +	if (!static_branch_unlikely(&xpfo_inited))
>> +		return false;
>> +
>> +	xpfo = lookup_xpfo(page);
>> +	if (unlikely(!xpfo) && !xpfo->inited)
>> +		return false;
>> +
>> +	return test_bit(XPFO_PAGE_UNMAPPED, &xpfo->flags);
>> +}
>> +EXPORT_SYMBOL(xpfo_page_is_unmapped);
> 
> And at least for swiotlb there is no need to export this helper,
> as it is always built in.
> 


[-- Attachment #2: pEpkey.asc --]
[-- Type: application/pgp-keys, Size: 2501 bytes --]

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFC PATCH v8 03/14] mm, x86: Add support for eXclusive Page Frame Ownership (XPFO)
  2019-02-14 10:56   ` Peter Zijlstra
  2019-02-14 16:15     ` Borislav Petkov
@ 2019-02-14 17:13     ` Khalid Aziz
  2019-02-14 19:08       ` Peter Zijlstra
  1 sibling, 1 reply; 32+ messages in thread
From: Khalid Aziz @ 2019-02-14 17:13 UTC (permalink / raw)
  To: Peter Zijlstra, juergh, jsteckli
  Cc: tycho, ak, torvalds, liran.alon, keescook, akpm, mhocko,
	catalin.marinas, will.deacon, jmorris, konrad.wilk,
	Juerg Haefliger, deepa.srinivasan, chris.hyser, tyhicks, dwmw,
	andrew.cooper3, jcm, boris.ostrovsky, kanth.ghatraju,
	oao.m.martins, jmattson, pradeep.vincent, john.haxby, tglx,
	kirill.shutemov, hch, steven.sistare, labbott, luto, dave.hansen,
	kernel-hardening, linux-mm, x86, linux-arm-kernel, linux-kernel,
	Tycho Andersen, Marco Benatto

[-- Attachment #1: Type: text/plain, Size: 3803 bytes --]

On 2/14/19 3:56 AM, Peter Zijlstra wrote:
> On Wed, Feb 13, 2019 at 05:01:26PM -0700, Khalid Aziz wrote:
>>  static inline void *kmap_atomic(struct page *page)
>>  {
>> +	void *kaddr;
>> +
>>  	preempt_disable();
>>  	pagefault_disable();
>> +	kaddr = page_address(page);
>> +	xpfo_kmap(kaddr, page);
>> +	return kaddr;
>>  }
>>  #define kmap_atomic_prot(page, prot)	kmap_atomic(page)
>>  
>>  static inline void __kunmap_atomic(void *addr)
>>  {
>> +	xpfo_kunmap(addr, virt_to_page(addr));
>>  	pagefault_enable();
>>  	preempt_enable();
>>  }
> 
> How is that supposed to work; IIRC kmap_atomic was supposed to be
> IRQ-safe.
> 

Ah, the spin_lock in in xpfo_kmap() can be problematic in interrupt
context. I will see if I can fix that.

Juerg, you wrote the original code and understand what you were trying
to do here. If you have ideas on how to tackle this, I would very much
appreciate it.

>> +/* Per-page XPFO house-keeping data */
>> +struct xpfo {
>> +	unsigned long flags;	/* Page state */
>> +	bool inited;		/* Map counter and lock initialized */
> 
> What's sizeof(_Bool) ? Why can't you use a bit in that flags word?
> 
>> +	atomic_t mapcount;	/* Counter for balancing map/unmap requests */
>> +	spinlock_t maplock;	/* Lock to serialize map/unmap requests */
>> +};
> 
> Without that bool, the structure would be 16 bytes on 64bit, which seems
> like a good number.
> 

Patch 11 ("xpfo, mm: remove dependency on CONFIG_PAGE_EXTENSION") cleans
all this up. If the original authors of these two patches, Juerg
Haefliger and Julian Stecklina, are ok with it, I would like to combine
the two patches in one.

>> +void xpfo_kmap(void *kaddr, struct page *page)
>> +{
>> +	struct xpfo *xpfo;
>> +
>> +	if (!static_branch_unlikely(&xpfo_inited))
>> +		return;
>> +
>> +	xpfo = lookup_xpfo(page);
>> +
>> +	/*
>> +	 * The page was allocated before page_ext was initialized (which means
>> +	 * it's a kernel page) or it's allocated to the kernel, so nothing to
>> +	 * do.
>> +	 */
>> +	if (!xpfo || unlikely(!xpfo->inited) ||
>> +	    !test_bit(XPFO_PAGE_USER, &xpfo->flags))
>> +		return;
>> +
>> +	spin_lock(&xpfo->maplock);
>> +
>> +	/*
>> +	 * The page was previously allocated to user space, so map it back
>> +	 * into the kernel. No TLB flush required.
>> +	 */
>> +	if ((atomic_inc_return(&xpfo->mapcount) == 1) &&
>> +	    test_and_clear_bit(XPFO_PAGE_UNMAPPED, &xpfo->flags))
>> +		set_kpte(kaddr, page, PAGE_KERNEL);
>> +
>> +	spin_unlock(&xpfo->maplock);
>> +}
>> +EXPORT_SYMBOL(xpfo_kmap);
>> +
>> +void xpfo_kunmap(void *kaddr, struct page *page)
>> +{
>> +	struct xpfo *xpfo;
>> +
>> +	if (!static_branch_unlikely(&xpfo_inited))
>> +		return;
>> +
>> +	xpfo = lookup_xpfo(page);
>> +
>> +	/*
>> +	 * The page was allocated before page_ext was initialized (which means
>> +	 * it's a kernel page) or it's allocated to the kernel, so nothing to
>> +	 * do.
>> +	 */
>> +	if (!xpfo || unlikely(!xpfo->inited) ||
>> +	    !test_bit(XPFO_PAGE_USER, &xpfo->flags))
>> +		return;
>> +
>> +	spin_lock(&xpfo->maplock);
>> +
>> +	/*
>> +	 * The page is to be allocated back to user space, so unmap it from the
>> +	 * kernel, flush the TLB and tag it as a user page.
>> +	 */
>> +	if (atomic_dec_return(&xpfo->mapcount) == 0) {
>> +		WARN(test_bit(XPFO_PAGE_UNMAPPED, &xpfo->flags),
>> +		     "xpfo: unmapping already unmapped page\n");
>> +		set_bit(XPFO_PAGE_UNMAPPED, &xpfo->flags);
>> +		set_kpte(kaddr, page, __pgprot(0));
>> +		xpfo_flush_kernel_tlb(page, 0);
>> +	}
>> +
>> +	spin_unlock(&xpfo->maplock);
>> +}
>> +EXPORT_SYMBOL(xpfo_kunmap);
> 
> And these here things are most definitely not IRQ-safe.
> 

Got it. I will work on this.

Thanks,
Khalid


[-- Attachment #2: pEpkey.asc --]
[-- Type: application/pgp-keys, Size: 2501 bytes --]

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFC PATCH v8 03/14] mm, x86: Add support for eXclusive Page Frame Ownership (XPFO)
  2019-02-14 16:15     ` Borislav Petkov
@ 2019-02-14 17:19       ` Khalid Aziz
  0 siblings, 0 replies; 32+ messages in thread
From: Khalid Aziz @ 2019-02-14 17:19 UTC (permalink / raw)
  To: Borislav Petkov, juergh
  Cc: Peter Zijlstra, tycho, jsteckli, ak, torvalds, liran.alon,
	keescook, akpm, mhocko, catalin.marinas, will.deacon, jmorris,
	konrad.wilk, Juerg Haefliger, deepa.srinivasan, chris.hyser,
	tyhicks, dwmw, andrew.cooper3, jcm, boris.ostrovsky,
	kanth.ghatraju, joao.m.martins, jmattson, pradeep.vincent,
	john.haxby, tglx, kirill.shutemov, hch, steven.sistare, labbott,
	luto, dave.hansen, kernel-hardening, linux-mm, x86,
	linux-arm-kernel, linux-kernel, Marco Benatto

[-- Attachment #1: Type: text/plain, Size: 454 bytes --]

On 2/14/19 9:15 AM, Borislav Petkov wrote:
> On Thu, Feb 14, 2019 at 11:56:31AM +0100, Peter Zijlstra wrote:
>>> +EXPORT_SYMBOL(xpfo_kunmap);
>>
>> And these here things are most definitely not IRQ-safe.
> 
> Should also be EXPORT_SYMBOL_GPL.
> 

Agreed. On the other hand, is there even a need to export this? It
should only be called from kunmap() or kunmap_atomic() and not from any
module directly. Same for xpfo_kmap.

Thanks,
Khalid

[-- Attachment #2: pEpkey.asc --]
[-- Type: application/pgp-keys, Size: 2501 bytes --]

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFC PATCH v8 07/14] arm64/mm, xpfo: temporarily map dcache regions
  2019-02-14 15:54   ` Tycho Andersen
@ 2019-02-14 17:29     ` Khalid Aziz
  2019-02-14 23:49       ` Tycho Andersen
  0 siblings, 1 reply; 32+ messages in thread
From: Khalid Aziz @ 2019-02-14 17:29 UTC (permalink / raw)
  To: Tycho Andersen
  Cc: juergh, jsteckli, ak, torvalds, liran.alon, keescook, akpm,
	mhocko, catalin.marinas, will.deacon, jmorris, konrad.wilk,
	Juerg Haefliger, deepa.srinivasan, chris.hyser, tyhicks, dwmw,
	andrew.cooper3, jcm, boris.ostrovsky, kanth.ghatraju,
	joao.m.martins, jmattson, pradeep.vincent, john.haxby, tglx,
	kirill.shutemov, hch, steven.sistare, labbott, luto, dave.hansen,
	peterz, kernel-hardening, linux-mm, x86, linux-arm-kernel,
	linux-kernel

[-- Attachment #1: Type: text/plain, Size: 2446 bytes --]

On 2/14/19 8:54 AM, Tycho Andersen wrote:
> Hi,
> 
> On Wed, Feb 13, 2019 at 05:01:30PM -0700, Khalid Aziz wrote:
>> From: Juerg Haefliger <juerg.haefliger@canonical.com>
>>
>> If the page is unmapped by XPFO, a data cache flush results in a fatal
>> page fault, so let's temporarily map the region, flush the cache, and then
>> unmap it.
>>
>> v6: actually flush in the face of xpfo, and temporarily map the underlying
>>     memory so it can be flushed correctly
>>
>> CC: linux-arm-kernel@lists.infradead.org
>> Signed-off-by: Juerg Haefliger <juerg.haefliger@canonical.com>
>> Signed-off-by: Tycho Andersen <tycho@docker.com>
>> ---
>>  arch/arm64/mm/flush.c | 7 +++++++
>>  1 file changed, 7 insertions(+)
>>
>> diff --git a/arch/arm64/mm/flush.c b/arch/arm64/mm/flush.c
>> index 30695a868107..fad09aafd9d5 100644
>> --- a/arch/arm64/mm/flush.c
>> +++ b/arch/arm64/mm/flush.c
>> @@ -20,6 +20,7 @@
>>  #include <linux/export.h>
>>  #include <linux/mm.h>
>>  #include <linux/pagemap.h>
>> +#include <linux/xpfo.h>
>>  
>>  #include <asm/cacheflush.h>
>>  #include <asm/cache.h>
>> @@ -28,9 +29,15 @@
>>  void sync_icache_aliases(void *kaddr, unsigned long len)
>>  {
>>  	unsigned long addr = (unsigned long)kaddr;
>> +	unsigned long num_pages = XPFO_NUM_PAGES(addr, len);
>> +	void *mapping[num_pages];
> 
> What version does this build on? Presumably -Wvla will cause an error
> here, but,
> 
>>  	if (icache_is_aliasing()) {
>> +		xpfo_temp_map(kaddr, len, mapping,
>> +			      sizeof(mapping[0]) * num_pages);
>>  		__clean_dcache_area_pou(kaddr, len);
> 
> Here, we map the pages to some random address via xpfo_temp_map(),
> then pass the *original* address (which may not have been mapped) to
> __clean_dcache_area_pou(). So I think this whole approach is wrong.
> 
> If we want to do it this way, it may be that we need some
> xpfo_map_contiguous() type thing, but since we're just going to flush
> it anyway, that seems a little crazy. Maybe someone who knows more
> about arm64 knows a better way?
> 
> Tycho
> 

Hi Tycho,

You are right. Things don't quite look right with this patch. I don't
know arm64 well enough either, so I will wait for someone more
knowledgeable to make a recommendation here.

On a side note, do you mind if I update your address in your
signed-off-by from tycho@docker.com when I send the next version of this
series?

Thanks,
Khalid

[-- Attachment #2: pEpkey.asc --]
[-- Type: application/pgp-keys, Size: 2501 bytes --]

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFC PATCH v8 13/14] xpfo, mm: Defer TLB flushes for non-current CPUs (x86 only)
  2019-02-14  0:01 ` [RFC PATCH v8 13/14] xpfo, mm: Defer TLB flushes for non-current CPUs (x86 only) Khalid Aziz
@ 2019-02-14 17:42   ` Dave Hansen
  2019-02-14 19:57     ` Khalid Aziz
  0 siblings, 1 reply; 32+ messages in thread
From: Dave Hansen @ 2019-02-14 17:42 UTC (permalink / raw)
  To: Khalid Aziz, juergh, tycho, jsteckli, ak, torvalds, liran.alon,
	keescook, akpm, mhocko, catalin.marinas, will.deacon, jmorris,
	konrad.wilk
  Cc: deepa.srinivasan, chris.hyser, tyhicks, dwmw, andrew.cooper3,
	jcm, boris.ostrovsky, kanth.ghatraju, oao.m.martins, jmattson,
	pradeep.vincent, john.haxby, tglx, kirill.shutemov, hch,
	steven.sistare, labbott, luto, peterz, kernel-hardening,
	linux-mm, x86, linux-arm-kernel, linux-kernel

>  #endif
> +
> +	/* If there is a pending TLB flush for this CPU due to XPFO
> +	 * flush, do it now.
> +	 */

Don't forget CodingStyle in all this, please.

> +	if (cpumask_test_and_clear_cpu(cpu, &pending_xpfo_flush)) {
> +		count_vm_tlb_event(NR_TLB_REMOTE_FLUSH_RECEIVED);
> +		__flush_tlb_all();
> +	}

This seems to exist in parallel with all of the cpu_tlbstate
infrastructure.  Shouldn't it go in there?

Also, if we're doing full flushes like this, it seems a bit wasteful to
then go and do later things like invalidate_user_asid() when we *know*
that the asid would have been flushed by this operation.  I'm pretty
sure this isn't the only __flush_tlb_all() callsite that does this, so
it's not really criticism of this patch specifically.  It's more of a
structural issue.


> +void xpfo_flush_tlb_kernel_range(unsigned long start, unsigned long end)
> +{

This is a bit lightly commented.  Please give this some good
descriptions about the logic behind the implementation and the tradeoffs
that are in play.

This is doing a local flush, but deferring the flushes on all other
processors, right?  Can you explain the logic behind that in a comment
here, please?  This also has to be called with preemption disabled, right?

> +	struct cpumask tmp_mask;
> +
> +	/* Balance as user space task's flush, a bit conservative */
> +	if (end == TLB_FLUSH_ALL ||
> +	    (end - start) > tlb_single_page_flush_ceiling << PAGE_SHIFT) {
> +		do_flush_tlb_all(NULL);
> +	} else {
> +		struct flush_tlb_info info;
> +
> +		info.start = start;
> +		info.end = end;
> +		do_kernel_range_flush(&info);
> +	}
> +	cpumask_setall(&tmp_mask);
> +	cpumask_clear_cpu(smp_processor_id(), &tmp_mask);
> +	cpumask_or(&pending_xpfo_flush, &pending_xpfo_flush, &tmp_mask);
> +}

Fun.  cpumask_setall() is non-atomic while cpumask_clear_cpu() and
cpumask_or() *are* atomic.  The cpumask_clear_cpu() is operating on
thread-local storage and doesn't need to be atomic.  Please make it
__cpumask_clear_cpu().

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFC PATCH v8 04/14] swiotlb: Map the buffer if it was unmapped by XPFO
  2019-02-14 16:56     ` Khalid Aziz
@ 2019-02-14 17:44       ` Christoph Hellwig
  2019-02-14 19:48         ` Khalid Aziz
  0 siblings, 1 reply; 32+ messages in thread
From: Christoph Hellwig @ 2019-02-14 17:44 UTC (permalink / raw)
  To: Khalid Aziz
  Cc: Christoph Hellwig, juergh, tycho, jsteckli, ak, torvalds,
	liran.alon, keescook, akpm, mhocko, catalin.marinas, will.deacon,
	jmorris, konrad.wilk, Juerg Haefliger, deepa.srinivasan,
	chris.hyser, tyhicks, dwmw, andrew.cooper3, jcm, boris.ostrovsky,
	kanth.ghatraju, oao.m.martins, jmattson, pradeep.vincent,
	john.haxby, tglx, kirill.shutemov, steven.sistare, labbott, luto,
	dave.hansen, peterz, kernel-hardening, linux-mm, x86,
	linux-arm-kernel, linux-kernel, Tycho Andersen

On Thu, Feb 14, 2019 at 09:56:24AM -0700, Khalid Aziz wrote:
> On 2/14/19 12:47 AM, Christoph Hellwig wrote:
> > On Wed, Feb 13, 2019 at 05:01:27PM -0700, Khalid Aziz wrote:
> >> +++ b/kernel/dma/swiotlb.c
> >> @@ -396,8 +396,9 @@ static void swiotlb_bounce(phys_addr_t orig_addr, phys_addr_t tlb_addr,
> >>  {
> >>  	unsigned long pfn = PFN_DOWN(orig_addr);
> >>  	unsigned char *vaddr = phys_to_virt(tlb_addr);
> >> +	struct page *page = pfn_to_page(pfn);
> >>  
> >> -	if (PageHighMem(pfn_to_page(pfn))) {
> >> +	if (PageHighMem(page) || xpfo_page_is_unmapped(page)) {
> > 
> > I think this just wants a page_unmapped or similar helper instead of
> > needing the xpfo_page_is_unmapped check.  We actually have quite
> > a few similar construct in the arch dma mapping code for architectures
> > that require cache flushing.
> 
> As I am not the original author of this patch, I am interpreting the
> original intent. I think xpfo_page_is_unmapped() was added to account
> for kernel build without CONFIG_XPFO. xpfo_page_is_unmapped() has an
> alternate definition to return false if CONFIG_XPFO is not defined.
> xpfo_is_unmapped() is cleaned up further in patch 11 ("xpfo, mm: remove
> dependency on CONFIG_PAGE_EXTENSION") to a one-liner "return
> PageXpfoUnmapped(page);". xpfo_is_unmapped() can be eliminated entirely
> by adding an else clause to the following code added by that patch:

The point I'm making it that just about every PageHighMem() check
before code that does a kmap* later needs to account for xpfo as well.

So instead of opencoding the above, be that using xpfo_page_is_unmapped
or PageXpfoUnmapped, we really need one self-describing helper that
checks if a page is unmapped for any reason and needs a kmap to access
it.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFC PATCH v8 03/14] mm, x86: Add support for eXclusive Page Frame Ownership (XPFO)
  2019-02-14 17:13     ` Khalid Aziz
@ 2019-02-14 19:08       ` Peter Zijlstra
  2019-02-14 19:58         ` Khalid Aziz
  0 siblings, 1 reply; 32+ messages in thread
From: Peter Zijlstra @ 2019-02-14 19:08 UTC (permalink / raw)
  To: Khalid Aziz
  Cc: juergh, jsteckli, tycho, ak, torvalds, liran.alon, keescook,
	akpm, mhocko, catalin.marinas, will.deacon, jmorris, konrad.wilk,
	Juerg Haefliger, deepa.srinivasan, chris.hyser, tyhicks, dwmw,
	andrew.cooper3, jcm, boris.ostrovsky, kanth.ghatraju,
	oao.m.martins, jmattson, pradeep.vincent, john.haxby, tglx,
	kirill.shutemov, hch, steven.sistare, labbott, luto, dave.hansen,
	kernel-hardening, linux-mm, x86, linux-arm-kernel, linux-kernel,
	Tycho Andersen, Marco Benatto

On Thu, Feb 14, 2019 at 10:13:54AM -0700, Khalid Aziz wrote:

> Patch 11 ("xpfo, mm: remove dependency on CONFIG_PAGE_EXTENSION") cleans
> all this up. If the original authors of these two patches, Juerg
> Haefliger and Julian Stecklina, are ok with it, I would like to combine
> the two patches in one.

Don't preserve broken patches because of different authorship or
whatever.

If you care you can say things like:

 Based-on-code-from:
 Co-developed-by:
 Originally-from:

or whatever other things there are. But individual patches should be
correct and complete.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFC PATCH v8 04/14] swiotlb: Map the buffer if it was unmapped by XPFO
  2019-02-14 17:44       ` Christoph Hellwig
@ 2019-02-14 19:48         ` Khalid Aziz
  0 siblings, 0 replies; 32+ messages in thread
From: Khalid Aziz @ 2019-02-14 19:48 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: juergh, tycho, jsteckli, ak, torvalds, liran.alon, keescook,
	akpm, mhocko, catalin.marinas, will.deacon, jmorris, konrad.wilk,
	Juerg Haefliger, deepa.srinivasan, chris.hyser, tyhicks, dwmw,
	andrew.cooper3, jcm, boris.ostrovsky, kanth.ghatraju,
	joao.m.martins, jmattson, pradeep.vincent, john.haxby, tglx,
	kirill.shutemov, steven.sistare, labbott, luto, dave.hansen,
	peterz, kernel-hardening, linux-mm, x86, linux-arm-kernel,
	linux-kernel

On 2/14/19 10:44 AM, Christoph Hellwig wrote:
> On Thu, Feb 14, 2019 at 09:56:24AM -0700, Khalid Aziz wrote:
>> On 2/14/19 12:47 AM, Christoph Hellwig wrote:
>>> On Wed, Feb 13, 2019 at 05:01:27PM -0700, Khalid Aziz wrote:
>>>> +++ b/kernel/dma/swiotlb.c
>>>> @@ -396,8 +396,9 @@ static void swiotlb_bounce(phys_addr_t orig_addr, phys_addr_t tlb_addr,
>>>>  {
>>>>  	unsigned long pfn = PFN_DOWN(orig_addr);
>>>>  	unsigned char *vaddr = phys_to_virt(tlb_addr);
>>>> +	struct page *page = pfn_to_page(pfn);
>>>>  
>>>> -	if (PageHighMem(pfn_to_page(pfn))) {
>>>> +	if (PageHighMem(page) || xpfo_page_is_unmapped(page)) {
>>>
>>> I think this just wants a page_unmapped or similar helper instead of
>>> needing the xpfo_page_is_unmapped check.  We actually have quite
>>> a few similar construct in the arch dma mapping code for architectures
>>> that require cache flushing.
>>
>> As I am not the original author of this patch, I am interpreting the
>> original intent. I think xpfo_page_is_unmapped() was added to account
>> for kernel build without CONFIG_XPFO. xpfo_page_is_unmapped() has an
>> alternate definition to return false if CONFIG_XPFO is not defined.
>> xpfo_is_unmapped() is cleaned up further in patch 11 ("xpfo, mm: remove
>> dependency on CONFIG_PAGE_EXTENSION") to a one-liner "return
>> PageXpfoUnmapped(page);". xpfo_is_unmapped() can be eliminated entirely
>> by adding an else clause to the following code added by that patch:
> 
> The point I'm making it that just about every PageHighMem() check
> before code that does a kmap* later needs to account for xpfo as well.
> 
> So instead of opencoding the above, be that using xpfo_page_is_unmapped
> or PageXpfoUnmapped, we really need one self-describing helper that
> checks if a page is unmapped for any reason and needs a kmap to access
> it.
> 

Understood. XpfoUnmapped is a the state for a page when it is a free
page. When this page is allocated to userspace and userspace passes this
page back to kernel in a syscall, kernel will always go through kmap to
map it temporarily any way. When the page is freed back to the kernel,
its mapping in physmap is restored. If the free page is allocated to
kernel, its physmap entry is preserved. So I am inclined to say a page
being XpfoUnmapped should not affect need or lack of need for kmap
elsewhere. Does that make sense?

Thanks,
Khalid

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFC PATCH v8 13/14] xpfo, mm: Defer TLB flushes for non-current CPUs (x86 only)
  2019-02-14 17:42   ` Dave Hansen
@ 2019-02-14 19:57     ` Khalid Aziz
  0 siblings, 0 replies; 32+ messages in thread
From: Khalid Aziz @ 2019-02-14 19:57 UTC (permalink / raw)
  To: Dave Hansen, juergh, tycho, jsteckli, ak, torvalds, liran.alon,
	keescook, akpm, mhocko, catalin.marinas, will.deacon, jmorris,
	konrad.wilk
  Cc: deepa.srinivasan, chris.hyser, tyhicks, dwmw, andrew.cooper3,
	jcm, boris.ostrovsky, kanth.ghatraju, joao.m.martins, jmattson,
	pradeep.vincent, john.haxby, tglx, kirill.shutemov, hch,
	steven.sistare, labbott, luto, peterz, kernel-hardening,
	linux-mm, x86, linux-arm-kernel, linux-kernel

On 2/14/19 10:42 AM, Dave Hansen wrote:
>>  #endif
>> +
>> +	/* If there is a pending TLB flush for this CPU due to XPFO
>> +	 * flush, do it now.
>> +	 */
> 
> Don't forget CodingStyle in all this, please.

Of course. I will fix that.

> 
>> +	if (cpumask_test_and_clear_cpu(cpu, &pending_xpfo_flush)) {
>> +		count_vm_tlb_event(NR_TLB_REMOTE_FLUSH_RECEIVED);
>> +		__flush_tlb_all();
>> +	}
> 
> This seems to exist in parallel with all of the cpu_tlbstate
> infrastructure.  Shouldn't it go in there?

That sounds like a good idea. On the other hand, pending flush needs to
be kept track of entirely within arch/x86/mm/tlb.c and using a local
variable with scope limited to just that file feels like a lighter
weight implementation. I could go either way.

> 
> Also, if we're doing full flushes like this, it seems a bit wasteful to
> then go and do later things like invalidate_user_asid() when we *know*
> that the asid would have been flushed by this operation.  I'm pretty
> sure this isn't the only __flush_tlb_all() callsite that does this, so
> it's not really criticism of this patch specifically.  It's more of a
> structural issue.
> 
> 

That is a good point. It is not just wasteful, it is bound to have
performance impact even if slight.

>> +void xpfo_flush_tlb_kernel_range(unsigned long start, unsigned long end)
>> +{
> 
> This is a bit lightly commented.  Please give this some good
> descriptions about the logic behind the implementation and the tradeoffs
> that are in play.
> 
> This is doing a local flush, but deferring the flushes on all other
> processors, right?  Can you explain the logic behind that in a comment
> here, please?  This also has to be called with preemption disabled, right?
> 
>> +	struct cpumask tmp_mask;
>> +
>> +	/* Balance as user space task's flush, a bit conservative */
>> +	if (end == TLB_FLUSH_ALL ||
>> +	    (end - start) > tlb_single_page_flush_ceiling << PAGE_SHIFT) {
>> +		do_flush_tlb_all(NULL);
>> +	} else {
>> +		struct flush_tlb_info info;
>> +
>> +		info.start = start;
>> +		info.end = end;
>> +		do_kernel_range_flush(&info);
>> +	}
>> +	cpumask_setall(&tmp_mask);
>> +	cpumask_clear_cpu(smp_processor_id(), &tmp_mask);
>> +	cpumask_or(&pending_xpfo_flush, &pending_xpfo_flush, &tmp_mask);
>> +}
> 
> Fun.  cpumask_setall() is non-atomic while cpumask_clear_cpu() and
> cpumask_or() *are* atomic.  The cpumask_clear_cpu() is operating on
> thread-local storage and doesn't need to be atomic.  Please make it
> __cpumask_clear_cpu().
> 

I will fix that. Thanks!

--
Khalid

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFC PATCH v8 03/14] mm, x86: Add support for eXclusive Page Frame Ownership (XPFO)
  2019-02-14 19:08       ` Peter Zijlstra
@ 2019-02-14 19:58         ` Khalid Aziz
  0 siblings, 0 replies; 32+ messages in thread
From: Khalid Aziz @ 2019-02-14 19:58 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: juergh, jsteckli, tycho, ak, torvalds, liran.alon, keescook,
	akpm, mhocko, catalin.marinas, will.deacon, jmorris, konrad.wilk,
	Juerg Haefliger, deepa.srinivasan, chris.hyser, tyhicks, dwmw,
	andrew.cooper3, jcm, boris.ostrovsky, kanth.ghatraju,
	joao.m.martins, jmattson, pradeep.vincent, john.haxby, tglx,
	kirill.shutemov, hch, steven.sistare, labbott, luto, dave.hansen,
	kernel-hardening, linux-mm, x86, linux-arm-kernel, linux-kernel,
	Marco Benatto

On 2/14/19 12:08 PM, Peter Zijlstra wrote:
> On Thu, Feb 14, 2019 at 10:13:54AM -0700, Khalid Aziz wrote:
> 
>> Patch 11 ("xpfo, mm: remove dependency on CONFIG_PAGE_EXTENSION") cleans
>> all this up. If the original authors of these two patches, Juerg
>> Haefliger and Julian Stecklina, are ok with it, I would like to combine
>> the two patches in one.
> 
> Don't preserve broken patches because of different authorship or
> whatever.
> 
> If you care you can say things like:
> 
>  Based-on-code-from:
>  Co-developed-by:
>  Originally-from:
> 
> or whatever other things there are. But individual patches should be
> correct and complete.
> 

That sounds reasonable. I will merge these two patches in the next version.

Thanks,
Khalid

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFC PATCH v8 07/14] arm64/mm, xpfo: temporarily map dcache regions
  2019-02-14 17:29     ` Khalid Aziz
@ 2019-02-14 23:49       ` Tycho Andersen
  0 siblings, 0 replies; 32+ messages in thread
From: Tycho Andersen @ 2019-02-14 23:49 UTC (permalink / raw)
  To: Khalid Aziz
  Cc: juergh, jsteckli, ak, torvalds, liran.alon, keescook, akpm,
	mhocko, catalin.marinas, will.deacon, jmorris, konrad.wilk,
	Juerg Haefliger, deepa.srinivasan, chris.hyser, tyhicks, dwmw,
	andrew.cooper3, jcm, boris.ostrovsky, kanth.ghatraju,
	joao.m.martins, jmattson, pradeep.vincent, john.haxby, tglx,
	kirill.shutemov, hch, steven.sistare, labbott, luto, dave.hansen,
	peterz, kernel-hardening, linux-mm, x86, linux-arm-kernel,
	linux-kernel

On Thu, Feb 14, 2019 at 10:29:52AM -0700, Khalid Aziz wrote:
> On a side note, do you mind if I update your address in your
> signed-off-by from tycho@docker.com when I send the next version of this
> series?

Sure that would be great thanks. This e-mail is a good one to use.

Cheers,

Tycho

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFC PATCH v8 08/14] arm64/mm: disable section/contiguous mappings if XPFO is enabled
  2019-02-14  0:01 ` [RFC PATCH v8 08/14] arm64/mm: disable section/contiguous mappings if XPFO is enabled Khalid Aziz
@ 2019-02-15 13:09   ` Mark Rutland
  2019-02-15 14:47     ` Khalid Aziz
  0 siblings, 1 reply; 32+ messages in thread
From: Mark Rutland @ 2019-02-15 13:09 UTC (permalink / raw)
  To: Khalid Aziz
  Cc: juergh, tycho, jsteckli, ak, torvalds, liran.alon, keescook,
	akpm, mhocko, catalin.marinas, will.deacon, jmorris, konrad.wilk,
	Tycho Andersen, deepa.srinivasan, chris.hyser, tyhicks, dwmw,
	andrew.cooper3, jcm, boris.ostrovsky, kanth.ghatraju,
	oao.m.martins, jmattson, pradeep.vincent, john.haxby, tglx,
	kirill.shutemov, hch, steven.sistare, labbott, luto, dave.hansen,
	peterz, kernel-hardening, linux-mm, x86, linux-arm-kernel,
	linux-kernel

Hi,

On Wed, Feb 13, 2019 at 05:01:31PM -0700, Khalid Aziz wrote:
> From: Tycho Andersen <tycho@docker.com>
> 
> XPFO doesn't support section/contiguous mappings yet, so let's disable it
> if XPFO is turned on.
> 
> Thanks to Laura Abbot for the simplification from v5, and Mark Rutland for
> pointing out we need NO_CONT_MAPPINGS too.
> 
> CC: linux-arm-kernel@lists.infradead.org
> Signed-off-by: Tycho Andersen <tycho@docker.com>
> Reviewed-by: Khalid Aziz <khalid.aziz@oracle.com>

There should be no point in this series where it's possible to enable a
broken XPFO. Either this patch should be merged into the rest of the
arm64 bits, or it should be placed before the rest of the arm64 bits.

That's a pre-requisite for merging, and it significantly reduces the
burden on reviewers.

In general, a patch series should bisect cleanly. Could you please
restructure the series to that effect?

Thanks,
Mark.

> ---
>  arch/arm64/mm/mmu.c  | 2 +-
>  include/linux/xpfo.h | 4 ++++
>  mm/xpfo.c            | 6 ++++++
>  3 files changed, 11 insertions(+), 1 deletion(-)
> 
> diff --git a/arch/arm64/mm/mmu.c b/arch/arm64/mm/mmu.c
> index d1d6601b385d..f4dd27073006 100644
> --- a/arch/arm64/mm/mmu.c
> +++ b/arch/arm64/mm/mmu.c
> @@ -451,7 +451,7 @@ static void __init map_mem(pgd_t *pgdp)
>  	struct memblock_region *reg;
>  	int flags = 0;
>  
> -	if (debug_pagealloc_enabled())
> +	if (debug_pagealloc_enabled() || xpfo_enabled())
>  		flags = NO_BLOCK_MAPPINGS | NO_CONT_MAPPINGS;
>  
>  	/*
> diff --git a/include/linux/xpfo.h b/include/linux/xpfo.h
> index 1ae05756344d..8b029918a958 100644
> --- a/include/linux/xpfo.h
> +++ b/include/linux/xpfo.h
> @@ -47,6 +47,8 @@ void xpfo_temp_map(const void *addr, size_t size, void **mapping,
>  void xpfo_temp_unmap(const void *addr, size_t size, void **mapping,
>  		     size_t mapping_len);
>  
> +bool xpfo_enabled(void);
> +
>  #else /* !CONFIG_XPFO */
>  
>  static inline void xpfo_kmap(void *kaddr, struct page *page) { }
> @@ -69,6 +71,8 @@ static inline void xpfo_temp_unmap(const void *addr, size_t size,
>  }
>  
>  
> +static inline bool xpfo_enabled(void) { return false; }
> +
>  #endif /* CONFIG_XPFO */
>  
>  #endif /* _LINUX_XPFO_H */
> diff --git a/mm/xpfo.c b/mm/xpfo.c
> index 92ca6d1baf06..150784ae0f08 100644
> --- a/mm/xpfo.c
> +++ b/mm/xpfo.c
> @@ -71,6 +71,12 @@ struct page_ext_operations page_xpfo_ops = {
>  	.init = init_xpfo,
>  };
>  
> +bool __init xpfo_enabled(void)
> +{
> +	return !xpfo_disabled;
> +}
> +EXPORT_SYMBOL(xpfo_enabled);
> +
>  static inline struct xpfo *lookup_xpfo(struct page *page)
>  {
>  	struct page_ext *page_ext = lookup_page_ext(page);
> -- 
> 2.17.1
> 

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFC PATCH v8 08/14] arm64/mm: disable section/contiguous mappings if XPFO is enabled
  2019-02-15 13:09   ` Mark Rutland
@ 2019-02-15 14:47     ` Khalid Aziz
  0 siblings, 0 replies; 32+ messages in thread
From: Khalid Aziz @ 2019-02-15 14:47 UTC (permalink / raw)
  To: Mark Rutland
  Cc: juergh, tycho, jsteckli, ak, torvalds, liran.alon, keescook,
	akpm, mhocko, catalin.marinas, will.deacon, jmorris, konrad.wilk,
	Tycho Andersen, deepa.srinivasan, chris.hyser, tyhicks, dwmw,
	andrew.cooper3, jcm, boris.ostrovsky, kanth.ghatraju,
	joao.m.martins, jmattson, pradeep.vincent, john.haxby, tglx,
	kirill.shutemov, hch, steven.sistare, labbott, luto, dave.hansen,
	peterz, kernel-hardening, linux-mm, x86, linux-arm-kernel,
	linux-kernel

On 2/15/19 6:09 AM, Mark Rutland wrote:
> Hi,
> 
> On Wed, Feb 13, 2019 at 05:01:31PM -0700, Khalid Aziz wrote:
>> From: Tycho Andersen <tycho@docker.com>
>>
>> XPFO doesn't support section/contiguous mappings yet, so let's disable it
>> if XPFO is turned on.
>>
>> Thanks to Laura Abbot for the simplification from v5, and Mark Rutland for
>> pointing out we need NO_CONT_MAPPINGS too.
>>
>> CC: linux-arm-kernel@lists.infradead.org
>> Signed-off-by: Tycho Andersen <tycho@docker.com>
>> Reviewed-by: Khalid Aziz <khalid.aziz@oracle.com>
> 
> There should be no point in this series where it's possible to enable a
> broken XPFO. Either this patch should be merged into the rest of the
> arm64 bits, or it should be placed before the rest of the arm64 bits.
> 
> That's a pre-requisite for merging, and it significantly reduces the
> burden on reviewers.
> 
> In general, a patch series should bisect cleanly. Could you please
> restructure the series to that effect?
> 
> Thanks,
> Mark.

That sounds reasonable to me. I will merge this with patch 5 ("arm64/mm:
Add support for XPFO") for the next version unless there are objections.

Thanks,
Khalid

> 
>> ---
>>  arch/arm64/mm/mmu.c  | 2 +-
>>  include/linux/xpfo.h | 4 ++++
>>  mm/xpfo.c            | 6 ++++++
>>  3 files changed, 11 insertions(+), 1 deletion(-)
>>
>> diff --git a/arch/arm64/mm/mmu.c b/arch/arm64/mm/mmu.c
>> index d1d6601b385d..f4dd27073006 100644
>> --- a/arch/arm64/mm/mmu.c
>> +++ b/arch/arm64/mm/mmu.c
>> @@ -451,7 +451,7 @@ static void __init map_mem(pgd_t *pgdp)
>>  	struct memblock_region *reg;
>>  	int flags = 0;
>>  
>> -	if (debug_pagealloc_enabled())
>> +	if (debug_pagealloc_enabled() || xpfo_enabled())
>>  		flags = NO_BLOCK_MAPPINGS | NO_CONT_MAPPINGS;
>>  
>>  	/*
>> diff --git a/include/linux/xpfo.h b/include/linux/xpfo.h
>> index 1ae05756344d..8b029918a958 100644
>> --- a/include/linux/xpfo.h
>> +++ b/include/linux/xpfo.h
>> @@ -47,6 +47,8 @@ void xpfo_temp_map(const void *addr, size_t size, void **mapping,
>>  void xpfo_temp_unmap(const void *addr, size_t size, void **mapping,
>>  		     size_t mapping_len);
>>  
>> +bool xpfo_enabled(void);
>> +
>>  #else /* !CONFIG_XPFO */
>>  
>>  static inline void xpfo_kmap(void *kaddr, struct page *page) { }
>> @@ -69,6 +71,8 @@ static inline void xpfo_temp_unmap(const void *addr, size_t size,
>>  }
>>  
>>  
>> +static inline bool xpfo_enabled(void) { return false; }
>> +
>>  #endif /* CONFIG_XPFO */
>>  
>>  #endif /* _LINUX_XPFO_H */
>> diff --git a/mm/xpfo.c b/mm/xpfo.c
>> index 92ca6d1baf06..150784ae0f08 100644
>> --- a/mm/xpfo.c
>> +++ b/mm/xpfo.c
>> @@ -71,6 +71,12 @@ struct page_ext_operations page_xpfo_ops = {
>>  	.init = init_xpfo,
>>  };
>>  
>> +bool __init xpfo_enabled(void)
>> +{
>> +	return !xpfo_disabled;
>> +}
>> +EXPORT_SYMBOL(xpfo_enabled);
>> +
>>  static inline struct xpfo *lookup_xpfo(struct page *page)
>>  {
>>  	struct page_ext *page_ext = lookup_page_ext(page);
>> -- 
>> 2.17.1
>>

^ permalink raw reply	[flat|nested] 32+ messages in thread

end of thread, other threads:[~2019-02-15 14:47 UTC | newest]

Thread overview: 32+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-02-14  0:01 [RFC PATCH v8 00/14] Add support for eXclusive Page Frame Ownership Khalid Aziz
2019-02-14  0:01 ` [RFC PATCH v8 01/14] mm: add MAP_HUGETLB support to vm_mmap Khalid Aziz
2019-02-14  0:01 ` [RFC PATCH v8 02/14] x86: always set IF before oopsing from page fault Khalid Aziz
2019-02-14  0:01 ` [RFC PATCH v8 03/14] mm, x86: Add support for eXclusive Page Frame Ownership (XPFO) Khalid Aziz
2019-02-14 10:56   ` Peter Zijlstra
2019-02-14 16:15     ` Borislav Petkov
2019-02-14 17:19       ` Khalid Aziz
2019-02-14 17:13     ` Khalid Aziz
2019-02-14 19:08       ` Peter Zijlstra
2019-02-14 19:58         ` Khalid Aziz
2019-02-14  0:01 ` [RFC PATCH v8 04/14] swiotlb: Map the buffer if it was unmapped by XPFO Khalid Aziz
2019-02-14  7:47   ` Christoph Hellwig
2019-02-14 16:56     ` Khalid Aziz
2019-02-14 17:44       ` Christoph Hellwig
2019-02-14 19:48         ` Khalid Aziz
2019-02-14  0:01 ` [RFC PATCH v8 05/14] arm64/mm: Add support for XPFO Khalid Aziz
2019-02-14  0:01 ` [RFC PATCH v8 06/14] xpfo: add primitives for mapping underlying memory Khalid Aziz
2019-02-14  0:01 ` [RFC PATCH v8 07/14] arm64/mm, xpfo: temporarily map dcache regions Khalid Aziz
2019-02-14 15:54   ` Tycho Andersen
2019-02-14 17:29     ` Khalid Aziz
2019-02-14 23:49       ` Tycho Andersen
2019-02-14  0:01 ` [RFC PATCH v8 08/14] arm64/mm: disable section/contiguous mappings if XPFO is enabled Khalid Aziz
2019-02-15 13:09   ` Mark Rutland
2019-02-15 14:47     ` Khalid Aziz
2019-02-14  0:01 ` [RFC PATCH v8 09/14] mm: add a user_virt_to_phys symbol Khalid Aziz
2019-02-14  0:01 ` [RFC PATCH v8 10/14] lkdtm: Add test for XPFO Khalid Aziz
2019-02-14  0:01 ` [RFC PATCH v8 11/14] xpfo, mm: remove dependency on CONFIG_PAGE_EXTENSION Khalid Aziz
2019-02-14  0:01 ` [RFC PATCH v8 12/14] xpfo, mm: optimize spinlock usage in xpfo_kunmap Khalid Aziz
2019-02-14  0:01 ` [RFC PATCH v8 13/14] xpfo, mm: Defer TLB flushes for non-current CPUs (x86 only) Khalid Aziz
2019-02-14 17:42   ` Dave Hansen
2019-02-14 19:57     ` Khalid Aziz
2019-02-14  0:01 ` [RFC PATCH v8 14/14] xpfo, mm: Optimize XPFO TLB flushes by batching them together Khalid Aziz

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).