linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
* [RFC PATCH v9 00/13] Add support for eXclusive Page Frame Ownership
@ 2019-04-03 17:34 Khalid Aziz
  2019-04-03 17:34 ` [RFC PATCH v9 01/13] mm: add MAP_HUGETLB support to vm_mmap Khalid Aziz
                   ` (14 more replies)
  0 siblings, 15 replies; 70+ messages in thread
From: Khalid Aziz @ 2019-04-03 17:34 UTC (permalink / raw)
  To: juergh, tycho, jsteckli, ak, liran.alon, keescook, akpm, konrad.wilk
  Cc: Khalid Aziz, deepa.srinivasan, chris.hyser, tyhicks, dwmw,
	andrew.cooper3, jcm, boris.ostrovsky, kanth.ghatraju,
	joao.m.martins, jmattson, pradeep.vincent, john.haxby, tglx,
	kirill.shutemov, hch, steven.sistare, labbott, luto, dave.hansen,
	peterz, aaron.lu, alexander.h.duyck, amir73il, andreyknvl,
	aneesh.kumar, anthony.yznaga, ard.biesheuvel, arnd, arunks, ben,
	bigeasy, bp, brgl, catalin.marinas, corbet, cpandya,
	daniel.vetter, dan.j.williams, gregkh, guro, hannes, hpa,
	iamjoonsoo.kim, james.morse, jannh, jgross, jkosina, jmorris,
	joe, jrdr.linux, jroedel, keith.busch, khlebnikov, logang,
	marco.antonio.780, mark.rutland, mgorman, mhocko, mhocko,
	mike.kravetz, mingo, mst, m.szyprowski, npiggin, osalvador,
	paulmck, pavel.tatashin, rdunlap, richard.weiyang, riel,
	rientjes, robin.murphy, rostedt, rppt, sai.praneeth.prakhya,
	serge, steve.capper, thymovanbeers, vbabka, will.deacon, willy,
	yang.shi, yaojun8558363, ying.huang, zhangshaokun, khalid, iommu,
	x86, linux-arm-kernel, linux-doc, linux-kernel, linux-mm,
	linux-security-module

This is another update to the work Juerg, Tycho and Julian have
done on XPFO. After the last round of updates, we were seeing very
significant performance penalties when stale TLB entries were
flushed actively after an XPFO TLB update.  Benchmark for measuring
performance is kernel build using parallel make. To get full
protection from ret2dir attackes, we must flush stale TLB entries.
Performance penalty from flushing stale TLB entries goes up as the
number of cores goes up. On a desktop class machine with only 4
cores, enabling TLB flush for stale entries causes system time for
"make -j4" to go up by a factor of 2.61x but on a larger machine
with 96 cores, system time with "make -j60" goes up by a factor of
26.37x!  I have been working on reducing this performance penalty.

I implemented two solutions to reduce performance penalty and that
has had large impact. XPFO code flushes TLB every time a page is
allocated to userspace. It does so by sending IPIs to all processors
to flush TLB. Back to back allocations of pages to userspace on
multiple processors results in a storm of IPIs.  Each one of these
incoming IPIs is handled by a processor by flushing its TLB. To
reduce this IPI storm, I have added a per CPU flag that can be set
to tell a processor to flush its TLB. A processor checks this flag
on every context switch. If the flag is set, it flushes its TLB and
clears the flag. This allows for multiple TLB flush requests to a
single CPU to be combined into a single request. A kernel TLB entry
for a page that has been allocated to userspace is flushed on all
processors unlike the previous version of this patch. A processor
could hold a stale kernel TLB entry that was removed on another
processor until the next context switch. A local userspace page
allocation by the currently running process could force the TLB
flush earlier for such entries.

The other solution reduces the number of TLB flushes required, by
performing TLB flush for multiple pages at one time when pages are
refilled on the per-cpu freelist. If the pages being addedd to
per-cpu freelist are marked for userspace allocation, TLB entries
for these pages can be flushed upfront and pages tagged as currently
unmapped. When any such page is allocated to userspace, there is no
need to performa a TLB flush at that time any more. This batching of
TLB flushes reduces performance imapct further. Similarly when
these user pages are freed by userspace and added back to per-cpu
free list, they are left unmapped and tagged so. This further
optimization reduced performance impact from 1.32x to 1.28x for
96-core server and from 1.31x to 1.27x for a 4-core desktop.

I measured system time for parallel make with unmodified 4.20
kernel, 4.20 with XPFO patches before these patches and then again
after applying each of these patches. Here are the results:

Hardware: 96-core Intel Xeon Platinum 8160 CPU @ 2.10GHz, 768 GB RAM
make -j60 all

5.0					913.862s
5.0+this patch series			1165.259ss	1.28x


Hardware: 4-core Intel Core i5-3550 CPU @ 3.30GHz, 8G RAM
make -j4 all

5.0					610.642s
5.0+this patch series			773.075s	1.27x

Performance with this patch set is good enough to use these as
starting point for further refinement before we merge it into main
kernel, hence RFC.

I have restructurerd the patches in this version to separate out
architecture independent code. I folded much of the code
improvement by Julian to not use page extension into patch 3. 

What remains to be done beyond this patch series:

1. Performance improvements: Ideas to explore - (1) kernel mappings
   private to an mm, (2) Any others??
2. Re-evaluate the patch "arm64/mm: Add support for XPFO to swiotlb"
   from Juerg. I dropped it for now since swiotlb code for ARM has
   changed a lot since this patch was written. I could use help
   from ARM experts on this.
3. Extend the patch "xpfo, mm: Defer TLB flushes for non-current
   CPUs" to other architectures besides x86.
4. Change kmap to not map the page back to physmap, instead map it
   to a new va similar to what kmap_high does. Mapping page back
   into physmap re-opens the ret2dir security for the duration of
   kmap. All of the kmap_high and related code can be reused for this
   but that will require restructuring that code so it can be built for
   64-bits as well. Any objections to that?

---------------------------------------------------------

Juerg Haefliger (6):
  mm: Add support for eXclusive Page Frame Ownership (XPFO)
  xpfo, x86: Add support for XPFO for x86-64
  lkdtm: Add test for XPFO
  arm64/mm: Add support for XPFO
  swiotlb: Map the buffer if it was unmapped by XPFO
  arm64/mm, xpfo: temporarily map dcache regions

Julian Stecklina (1):
  xpfo, mm: optimize spinlock usage in xpfo_kunmap

Khalid Aziz (2):
  xpfo, mm: Defer TLB flushes for non-current CPUs (x86 only)
  xpfo, mm: Optimize XPFO TLB flushes by batching them together

Tycho Andersen (4):
  mm: add MAP_HUGETLB support to vm_mmap
  x86: always set IF before oopsing from page fault
  mm: add a user_virt_to_phys symbol
  xpfo: add primitives for mapping underlying memory

 .../admin-guide/kernel-parameters.txt         |   6 +
 arch/arm64/Kconfig                            |   1 +
 arch/arm64/mm/Makefile                        |   2 +
 arch/arm64/mm/flush.c                         |   7 +
 arch/arm64/mm/mmu.c                           |   2 +-
 arch/arm64/mm/xpfo.c                          |  66 ++++++
 arch/x86/Kconfig                              |   1 +
 arch/x86/include/asm/pgtable.h                |  26 +++
 arch/x86/include/asm/tlbflush.h               |   1 +
 arch/x86/mm/Makefile                          |   2 +
 arch/x86/mm/fault.c                           |   6 +
 arch/x86/mm/pageattr.c                        |  32 +--
 arch/x86/mm/tlb.c                             |  39 ++++
 arch/x86/mm/xpfo.c                            | 185 +++++++++++++++++
 drivers/misc/lkdtm/Makefile                   |   1 +
 drivers/misc/lkdtm/core.c                     |   3 +
 drivers/misc/lkdtm/lkdtm.h                    |   5 +
 drivers/misc/lkdtm/xpfo.c                     | 196 ++++++++++++++++++
 include/linux/highmem.h                       |  34 +--
 include/linux/mm.h                            |   2 +
 include/linux/mm_types.h                      |   8 +
 include/linux/page-flags.h                    |  23 +-
 include/linux/xpfo.h                          | 191 +++++++++++++++++
 include/trace/events/mmflags.h                |  10 +-
 kernel/dma/swiotlb.c                          |   3 +-
 mm/Makefile                                   |   1 +
 mm/compaction.c                               |   2 +-
 mm/internal.h                                 |   2 +-
 mm/mmap.c                                     |  19 +-
 mm/page_alloc.c                               |  19 +-
 mm/page_isolation.c                           |   2 +-
 mm/util.c                                     |  32 +++
 mm/xpfo.c                                     | 170 +++++++++++++++
 security/Kconfig                              |  27 +++
 34 files changed, 1047 insertions(+), 79 deletions(-)
 create mode 100644 arch/arm64/mm/xpfo.c
 create mode 100644 arch/x86/mm/xpfo.c
 create mode 100644 drivers/misc/lkdtm/xpfo.c
 create mode 100644 include/linux/xpfo.h
 create mode 100644 mm/xpfo.c

-- 
2.17.1


^ permalink raw reply	[flat|nested] 70+ messages in thread

* [RFC PATCH v9 01/13] mm: add MAP_HUGETLB support to vm_mmap
  2019-04-03 17:34 [RFC PATCH v9 00/13] Add support for eXclusive Page Frame Ownership Khalid Aziz
@ 2019-04-03 17:34 ` Khalid Aziz
  2019-04-03 17:34 ` [RFC PATCH v9 02/13] x86: always set IF before oopsing from page fault Khalid Aziz
                   ` (13 subsequent siblings)
  14 siblings, 0 replies; 70+ messages in thread
From: Khalid Aziz @ 2019-04-03 17:34 UTC (permalink / raw)
  To: juergh, tycho, jsteckli, ak, liran.alon, keescook, konrad.wilk
  Cc: deepa.srinivasan, chris.hyser, tyhicks, dwmw, andrew.cooper3,
	jcm, boris.ostrovsky, kanth.ghatraju, joao.m.martins, jmattson,
	pradeep.vincent, john.haxby, tglx, kirill.shutemov, hch,
	steven.sistare, labbott, luto, dave.hansen, peterz, aaron.lu,
	akpm, alexander.h.duyck, amir73il, andreyknvl, aneesh.kumar,
	anthony.yznaga, ard.biesheuvel, arnd, arunks, ben, bigeasy, bp,
	brgl, catalin.marinas, corbet, cpandya, daniel.vetter,
	dan.j.williams, gregkh, guro, hannes, hpa, iamjoonsoo.kim,
	james.morse, jannh, jgross, jkosina, jmorris, joe, jrdr.linux,
	jroedel, keith.busch, khalid.aziz, khlebnikov, logang,
	marco.antonio.780, mark.rutland, mgorman, mhocko, mhocko,
	mike.kravetz, mingo, mst, m.szyprowski, npiggin, osalvador,
	paulmck, pavel.tatashin, rdunlap, richard.weiyang, riel,
	rientjes, robin.murphy, rostedt, rppt, sai.praneeth.prakhya,
	serge, steve.capper, thymovanbeers, vbabka, will.deacon, willy,
	yang.shi, yaojun8558363, ying.huang, zhangshaokun, iommu, x86,
	linux-arm-kernel, linux-doc, linux-kernel, linux-mm,
	linux-security-module, Khalid Aziz

From: Tycho Andersen <tycho@tycho.ws>

vm_mmap is exported, which means kernel modules can use it. In particular,
for testing XPFO support, we want to use it with the MAP_HUGETLB flag, so
let's support it via vm_mmap.

Signed-off-by: Tycho Andersen <tycho@tycho.ws>
Tested-by: Marco Benatto <marco.antonio.780@gmail.com>
Tested-by: Khalid Aziz <khalid.aziz@oracle.com>
Cc: Khalid Aziz <khalid@gonehiking.org>
---
 include/linux/mm.h |  2 ++
 mm/mmap.c          | 19 +------------------
 mm/util.c          | 32 ++++++++++++++++++++++++++++++++
 3 files changed, 35 insertions(+), 18 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 80bb6408fe73..3e4f6525d06b 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -2412,6 +2412,8 @@ struct vm_unmapped_area_info {
 extern unsigned long unmapped_area(struct vm_unmapped_area_info *info);
 extern unsigned long unmapped_area_topdown(struct vm_unmapped_area_info *info);
 
+struct file *map_hugetlb_setup(unsigned long *len, unsigned long flags);
+
 /*
  * Search for an unmapped address range.
  *
diff --git a/mm/mmap.c b/mm/mmap.c
index fc1809b1bed6..65382d942598 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -1582,24 +1582,7 @@ unsigned long ksys_mmap_pgoff(unsigned long addr, unsigned long len,
 		if (unlikely(flags & MAP_HUGETLB && !is_file_hugepages(file)))
 			goto out_fput;
 	} else if (flags & MAP_HUGETLB) {
-		struct user_struct *user = NULL;
-		struct hstate *hs;
-
-		hs = hstate_sizelog((flags >> MAP_HUGE_SHIFT) & MAP_HUGE_MASK);
-		if (!hs)
-			return -EINVAL;
-
-		len = ALIGN(len, huge_page_size(hs));
-		/*
-		 * VM_NORESERVE is used because the reservations will be
-		 * taken when vm_ops->mmap() is called
-		 * A dummy user value is used because we are not locking
-		 * memory so no accounting is necessary
-		 */
-		file = hugetlb_file_setup(HUGETLB_ANON_FILE, len,
-				VM_NORESERVE,
-				&user, HUGETLB_ANONHUGE_INODE,
-				(flags >> MAP_HUGE_SHIFT) & MAP_HUGE_MASK);
+		file = map_hugetlb_setup(&len, flags);
 		if (IS_ERR(file))
 			return PTR_ERR(file);
 	}
diff --git a/mm/util.c b/mm/util.c
index 379319b1bcfd..86b763861828 100644
--- a/mm/util.c
+++ b/mm/util.c
@@ -357,6 +357,29 @@ unsigned long vm_mmap_pgoff(struct file *file, unsigned long addr,
 	return ret;
 }
 
+struct file *map_hugetlb_setup(unsigned long *len, unsigned long flags)
+{
+	struct user_struct *user = NULL;
+	struct hstate *hs;
+
+	hs = hstate_sizelog((flags >> MAP_HUGE_SHIFT) & MAP_HUGE_MASK);
+	if (!hs)
+		return ERR_PTR(-EINVAL);
+
+	*len = ALIGN(*len, huge_page_size(hs));
+
+	/*
+	 * VM_NORESERVE is used because the reservations will be
+	 * taken when vm_ops->mmap() is called
+	 * A dummy user value is used because we are not locking
+	 * memory so no accounting is necessary
+	 */
+	return hugetlb_file_setup(HUGETLB_ANON_FILE, *len,
+			VM_NORESERVE,
+			&user, HUGETLB_ANONHUGE_INODE,
+			(flags >> MAP_HUGE_SHIFT) & MAP_HUGE_MASK);
+}
+
 unsigned long vm_mmap(struct file *file, unsigned long addr,
 	unsigned long len, unsigned long prot,
 	unsigned long flag, unsigned long offset)
@@ -366,6 +389,15 @@ unsigned long vm_mmap(struct file *file, unsigned long addr,
 	if (unlikely(offset_in_page(offset)))
 		return -EINVAL;
 
+	if (flag & MAP_HUGETLB) {
+		if (file)
+			return -EINVAL;
+
+		file = map_hugetlb_setup(&len, flag);
+		if (IS_ERR(file))
+			return PTR_ERR(file);
+	}
+
 	return vm_mmap_pgoff(file, addr, len, prot, flag, offset >> PAGE_SHIFT);
 }
 EXPORT_SYMBOL(vm_mmap);
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 70+ messages in thread

* [RFC PATCH v9 02/13] x86: always set IF before oopsing from page fault
  2019-04-03 17:34 [RFC PATCH v9 00/13] Add support for eXclusive Page Frame Ownership Khalid Aziz
  2019-04-03 17:34 ` [RFC PATCH v9 01/13] mm: add MAP_HUGETLB support to vm_mmap Khalid Aziz
@ 2019-04-03 17:34 ` Khalid Aziz
  2019-04-04  0:12   ` Andy Lutomirski
  2019-04-03 17:34 ` [RFC PATCH v9 03/13] mm: Add support for eXclusive Page Frame Ownership (XPFO) Khalid Aziz
                   ` (12 subsequent siblings)
  14 siblings, 1 reply; 70+ messages in thread
From: Khalid Aziz @ 2019-04-03 17:34 UTC (permalink / raw)
  To: juergh, tycho, jsteckli, ak, liran.alon, keescook, konrad.wilk
  Cc: deepa.srinivasan, chris.hyser, tyhicks, dwmw, andrew.cooper3,
	jcm, boris.ostrovsky, kanth.ghatraju, joao.m.martins, jmattson,
	pradeep.vincent, john.haxby, tglx, kirill.shutemov, hch,
	steven.sistare, labbott, luto, dave.hansen, peterz, aaron.lu,
	akpm, alexander.h.duyck, amir73il, andreyknvl, aneesh.kumar,
	anthony.yznaga, ard.biesheuvel, arnd, arunks, ben, bigeasy, bp,
	brgl, catalin.marinas, corbet, cpandya, daniel.vetter,
	dan.j.williams, gregkh, guro, hannes, hpa, iamjoonsoo.kim,
	james.morse, jannh, jgross, jkosina, jmorris, joe, jrdr.linux,
	jroedel, keith.busch, khalid.aziz, khlebnikov, logang,
	marco.antonio.780, mark.rutland, mgorman, mhocko, mhocko,
	mike.kravetz, mingo, mst, m.szyprowski, npiggin, osalvador,
	paulmck, pavel.tatashin, rdunlap, richard.weiyang, riel,
	rientjes, robin.murphy, rostedt, rppt, sai.praneeth.prakhya,
	serge, steve.capper, thymovanbeers, vbabka, will.deacon, willy,
	yang.shi, yaojun8558363, ying.huang, zhangshaokun, iommu, x86,
	linux-arm-kernel, linux-doc, linux-kernel, linux-mm,
	linux-security-module, Khalid Aziz

From: Tycho Andersen <tycho@tycho.ws>

Oopsing might kill the task, via rewind_stack_do_exit() at the bottom, and
that might sleep:

Aug 23 19:30:27 xpfo kernel: [   38.302714] BUG: sleeping function called from invalid context at ./include/linux/percpu-rwsem.h:33
Aug 23 19:30:27 xpfo kernel: [   38.303837] in_atomic(): 0, irqs_disabled(): 1, pid: 1970, name: lkdtm_xpfo_test
Aug 23 19:30:27 xpfo kernel: [   38.304758] CPU: 3 PID: 1970 Comm: lkdtm_xpfo_test Tainted: G      D         4.13.0-rc5+ #228
Aug 23 19:30:27 xpfo kernel: [   38.305813] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.1-1ubuntu1 04/01/2014
Aug 23 19:30:27 xpfo kernel: [   38.306926] Call Trace:
Aug 23 19:30:27 xpfo kernel: [   38.307243]  dump_stack+0x63/0x8b
Aug 23 19:30:27 xpfo kernel: [   38.307665]  ___might_sleep+0xec/0x110
Aug 23 19:30:27 xpfo kernel: [   38.308139]  __might_sleep+0x45/0x80
Aug 23 19:30:27 xpfo kernel: [   38.308593]  exit_signals+0x21/0x1c0
Aug 23 19:30:27 xpfo kernel: [   38.309046]  ? blocking_notifier_call_chain+0x11/0x20
Aug 23 19:30:27 xpfo kernel: [   38.309677]  do_exit+0x98/0xbf0
Aug 23 19:30:27 xpfo kernel: [   38.310078]  ? smp_reader+0x27/0x40 [lkdtm]
Aug 23 19:30:27 xpfo kernel: [   38.310604]  ? kthread+0x10f/0x150
Aug 23 19:30:27 xpfo kernel: [   38.311045]  ? read_user_with_flags+0x60/0x60 [lkdtm]
Aug 23 19:30:27 xpfo kernel: [   38.311680]  rewind_stack_do_exit+0x17/0x20

To be safe, let's just always enable irqs.

The particular case I'm hitting is:

Aug 23 19:30:27 xpfo kernel: [   38.278615]  __bad_area_nosemaphore+0x1a9/0x1d0
Aug 23 19:30:27 xpfo kernel: [   38.278617]  bad_area_nosemaphore+0xf/0x20
Aug 23 19:30:27 xpfo kernel: [   38.278618]  __do_page_fault+0xd1/0x540
Aug 23 19:30:27 xpfo kernel: [   38.278620]  ? irq_work_queue+0x9b/0xb0
Aug 23 19:30:27 xpfo kernel: [   38.278623]  ? wake_up_klogd+0x36/0x40
Aug 23 19:30:27 xpfo kernel: [   38.278624]  trace_do_page_fault+0x3c/0xf0
Aug 23 19:30:27 xpfo kernel: [   38.278625]  do_async_page_fault+0x14/0x60
Aug 23 19:30:27 xpfo kernel: [   38.278627]  async_page_fault+0x28/0x30

When a fault is in kernel space which has been triggered by XPFO.

Signed-off-by: Tycho Andersen <tycho@tycho.ws>
CC: x86@kernel.org
Tested-by: Khalid Aziz <khalid.aziz@oracle.com>
Cc: Khalid Aziz <khalid@gonehiking.org>
---
 arch/x86/mm/fault.c | 6 ++++++
 1 file changed, 6 insertions(+)

diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c
index 9d5c75f02295..7891add0913f 100644
--- a/arch/x86/mm/fault.c
+++ b/arch/x86/mm/fault.c
@@ -858,6 +858,12 @@ no_context(struct pt_regs *regs, unsigned long error_code,
 	/* Executive summary in case the body of the oops scrolled away */
 	printk(KERN_DEFAULT "CR2: %016lx\n", address);
 
+	/*
+	 * We're about to oops, which might kill the task. Make sure we're
+	 * allowed to sleep.
+	 */
+	flags |= X86_EFLAGS_IF;
+
 	oops_end(flags, regs, sig);
 }
 
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 70+ messages in thread

* [RFC PATCH v9 03/13] mm: Add support for eXclusive Page Frame Ownership (XPFO)
  2019-04-03 17:34 [RFC PATCH v9 00/13] Add support for eXclusive Page Frame Ownership Khalid Aziz
  2019-04-03 17:34 ` [RFC PATCH v9 01/13] mm: add MAP_HUGETLB support to vm_mmap Khalid Aziz
  2019-04-03 17:34 ` [RFC PATCH v9 02/13] x86: always set IF before oopsing from page fault Khalid Aziz
@ 2019-04-03 17:34 ` Khalid Aziz
  2019-04-04  7:21   ` Peter Zijlstra
                     ` (2 more replies)
  2019-04-03 17:34 ` [RFC PATCH v9 04/13] xpfo, x86: Add support for XPFO for x86-64 Khalid Aziz
                   ` (11 subsequent siblings)
  14 siblings, 3 replies; 70+ messages in thread
From: Khalid Aziz @ 2019-04-03 17:34 UTC (permalink / raw)
  To: juergh, tycho, jsteckli, ak, liran.alon, keescook, konrad.wilk
  Cc: Juerg Haefliger, deepa.srinivasan, chris.hyser, tyhicks, dwmw,
	andrew.cooper3, jcm, boris.ostrovsky, kanth.ghatraju,
	joao.m.martins, jmattson, pradeep.vincent, john.haxby, tglx,
	kirill.shutemov, hch, steven.sistare, labbott, luto, dave.hansen,
	peterz, aaron.lu, akpm, alexander.h.duyck, amir73il, andreyknvl,
	aneesh.kumar, anthony.yznaga, ard.biesheuvel, arnd, arunks, ben,
	bigeasy, bp, brgl, catalin.marinas, corbet, cpandya,
	daniel.vetter, dan.j.williams, gregkh, guro, hannes, hpa,
	iamjoonsoo.kim, james.morse, jannh, jgross, jkosina, jmorris,
	joe, jrdr.linux, jroedel, keith.busch, khalid.aziz, khlebnikov,
	logang, marco.antonio.780, mark.rutland, mgorman, mhocko, mhocko,
	mike.kravetz, mingo, mst, m.szyprowski, npiggin, osalvador,
	paulmck, pavel.tatashin, rdunlap, richard.weiyang, riel,
	rientjes, robin.murphy, rostedt, rppt, sai.praneeth.prakhya,
	serge, steve.capper, thymovanbeers, vbabka, will.deacon, willy,
	yang.shi, yaojun8558363, ying.huang, zhangshaokun, iommu, x86,
	linux-arm-kernel, linux-doc, linux-kernel, linux-mm,
	linux-security-module, Khalid Aziz

From: Juerg Haefliger <juerg.haefliger@canonical.com>

This patch adds basic support infrastructure for XPFO which protects
against 'ret2dir' kernel attacks. The basic idea is to enforce exclusive
ownership of page frames by either the kernel or userspace, unless
explicitly requested by the kernel. Whenever a page destined for
userspace is allocated, it is unmapped from physmap (the kernel's page
table). When such a page is reclaimed from userspace, it is mapped back
to physmap. Individual architectures can enable full XPFO support using
this infrastructure by supplying architecture specific pieces.

Additional fields in the page struct are used for XPFO housekeeping,
specifically:
  - two flags to distinguish user vs. kernel pages and to tag unmapped
    pages.
  - a reference counter to balance kmap/kunmap operations.
  - a lock to serialize access to the XPFO fields.

This patch is based on the work of Vasileios P. Kemerlis et al. who
published their work in this paper:
  http://www.cs.columbia.edu/~vpk/papers/ret2dir.sec14.pdf

CC: x86@kernel.org
Suggested-by: Vasileios P. Kemerlis <vpk@cs.columbia.edu>
Signed-off-by: Juerg Haefliger <juerg.haefliger@canonical.com>
Signed-off-by: Tycho Andersen <tycho@tycho.ws>
Signed-off-by: Marco Benatto <marco.antonio.780@gmail.com>
[jsteckli@amazon.de: encode all XPFO info in struct page]
Signed-off-by: Julian Stecklina <jsteckli@amazon.de>
Signed-off-by: Khalid Aziz <khalid.aziz@oracle.com>
Cc: Khalid Aziz <khalid@gonehiking.org>
---
 v9: * Do not use page extensions. Encode all xpfo information in struct
      page (Julian Stecklina).
    * Split architecture specific code into its own separate patch
    * Move kmap*() to include/linux/xpfo.h for cleaner code as suggested
      for an earlier version of this patch
    * Use irq versions of spin_lock to address possible deadlock around
      xpfo_lock caused by interrupts.
    * Incorporated various feedback provided on v6 patch way back.

v6: * use flush_tlb_kernel_range() instead of __flush_tlb_one, so we flush
      the tlb entry on all CPUs when unmapping it in kunmap
    * handle lookup_page_ext()/lookup_xpfo() returning NULL
    * drop lots of BUG()s in favor of WARN()
    * don't disable irqs in xpfo_kmap/xpfo_kunmap, export
      __split_large_page so we can do our own alloc_pages(GFP_ATOMIC) to
      pass it

 .../admin-guide/kernel-parameters.txt         |   6 +
 include/linux/highmem.h                       |  31 +---
 include/linux/mm_types.h                      |   8 +
 include/linux/page-flags.h                    |  23 ++-
 include/linux/xpfo.h                          | 147 ++++++++++++++++++
 include/trace/events/mmflags.h                |  10 +-
 mm/Makefile                                   |   1 +
 mm/compaction.c                               |   2 +-
 mm/internal.h                                 |   2 +-
 mm/page_alloc.c                               |  10 +-
 mm/page_isolation.c                           |   2 +-
 mm/xpfo.c                                     | 106 +++++++++++++
 security/Kconfig                              |  27 ++++
 13 files changed, 337 insertions(+), 38 deletions(-)
 create mode 100644 include/linux/xpfo.h
 create mode 100644 mm/xpfo.c

diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
index 858b6c0b9a15..9b36da94760e 100644
--- a/Documentation/admin-guide/kernel-parameters.txt
+++ b/Documentation/admin-guide/kernel-parameters.txt
@@ -2997,6 +2997,12 @@
 
 	nox2apic	[X86-64,APIC] Do not enable x2APIC mode.
 
+	noxpfo		[XPFO] Disable eXclusive Page Frame Ownership (XPFO)
+			when CONFIG_XPFO is on. Physical pages mapped into
+			user applications will also be mapped in the
+			kernel's address space as if CONFIG_XPFO was not
+			enabled.
+
 	cpu0_hotplug	[X86] Turn on CPU0 hotplug feature when
 			CONFIG_BOOTPARAM_HOTPLUG_CPU0 is off.
 			Some features depend on CPU0. Known dependencies are:
diff --git a/include/linux/highmem.h b/include/linux/highmem.h
index ea5cdbd8c2c3..59a1a5fa598d 100644
--- a/include/linux/highmem.h
+++ b/include/linux/highmem.h
@@ -8,6 +8,7 @@
 #include <linux/mm.h>
 #include <linux/uaccess.h>
 #include <linux/hardirq.h>
+#include <linux/xpfo.h>
 
 #include <asm/cacheflush.h>
 
@@ -77,36 +78,6 @@ static inline struct page *kmap_to_page(void *addr)
 
 static inline unsigned long totalhigh_pages(void) { return 0UL; }
 
-#ifndef ARCH_HAS_KMAP
-static inline void *kmap(struct page *page)
-{
-	might_sleep();
-	return page_address(page);
-}
-
-static inline void kunmap(struct page *page)
-{
-}
-
-static inline void *kmap_atomic(struct page *page)
-{
-	preempt_disable();
-	pagefault_disable();
-	return page_address(page);
-}
-#define kmap_atomic_prot(page, prot)	kmap_atomic(page)
-
-static inline void __kunmap_atomic(void *addr)
-{
-	pagefault_enable();
-	preempt_enable();
-}
-
-#define kmap_atomic_pfn(pfn)	kmap_atomic(pfn_to_page(pfn))
-
-#define kmap_flush_unused()	do {} while(0)
-#endif
-
 #endif /* CONFIG_HIGHMEM */
 
 #if defined(CONFIG_HIGHMEM) || defined(CONFIG_X86_32)
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 2c471a2c43fa..d17d33f36a01 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -204,6 +204,14 @@ struct page {
 #ifdef LAST_CPUPID_NOT_IN_PAGE_FLAGS
 	int _last_cpupid;
 #endif
+
+#ifdef CONFIG_XPFO
+	/* Counts the number of times this page has been kmapped. */
+	atomic_t xpfo_mapcount;
+
+	/* Serialize kmap/kunmap of this page */
+	spinlock_t xpfo_lock;
+#endif
 } _struct_page_alignment;
 
 /*
diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
index 39b4494e29f1..3622e8c33522 100644
--- a/include/linux/page-flags.h
+++ b/include/linux/page-flags.h
@@ -101,6 +101,10 @@ enum pageflags {
 #if defined(CONFIG_IDLE_PAGE_TRACKING) && defined(CONFIG_64BIT)
 	PG_young,
 	PG_idle,
+#endif
+#ifdef CONFIG_XPFO
+	PG_xpfo_user,		/* Page is allocated to user-space */
+	PG_xpfo_unmapped,	/* Page is unmapped from the linear map */
 #endif
 	__NR_PAGEFLAGS,
 
@@ -398,6 +402,22 @@ TESTCLEARFLAG(Young, young, PF_ANY)
 PAGEFLAG(Idle, idle, PF_ANY)
 #endif
 
+#ifdef CONFIG_XPFO
+PAGEFLAG(XpfoUser, xpfo_user, PF_ANY)
+TESTCLEARFLAG(XpfoUser, xpfo_user, PF_ANY)
+TESTSETFLAG(XpfoUser, xpfo_user, PF_ANY)
+#define __PG_XPFO_USER	(1UL << PG_xpfo_user)
+PAGEFLAG(XpfoUnmapped, xpfo_unmapped, PF_ANY)
+TESTCLEARFLAG(XpfoUnmapped, xpfo_unmapped, PF_ANY)
+TESTSETFLAG(XpfoUnmapped, xpfo_unmapped, PF_ANY)
+#define __PG_XPFO_UNMAPPED	(1UL << PG_xpfo_unmapped)
+#else
+#define __PG_XPFO_USER		0
+PAGEFLAG_FALSE(XpfoUser)
+#define __PG_XPFO_UNMAPPED	0
+PAGEFLAG_FALSE(XpfoUnmapped)
+#endif
+
 /*
  * On an anonymous page mapped into a user virtual memory area,
  * page->mapping points to its anon_vma, not to a struct address_space;
@@ -780,7 +800,8 @@ static inline void ClearPageSlabPfmemalloc(struct page *page)
  * alloc-free cycle to prevent from reusing the page.
  */
 #define PAGE_FLAGS_CHECK_AT_PREP	\
-	(((1UL << NR_PAGEFLAGS) - 1) & ~__PG_HWPOISON)
+	(((1UL << NR_PAGEFLAGS) - 1) & ~__PG_HWPOISON & ~__PG_XPFO_USER & \
+	 ~__PG_XPFO_UNMAPPED)
 
 #define PAGE_FLAGS_PRIVATE				\
 	(1UL << PG_private | 1UL << PG_private_2)
diff --git a/include/linux/xpfo.h b/include/linux/xpfo.h
new file mode 100644
index 000000000000..93a1b5aceca3
--- /dev/null
+++ b/include/linux/xpfo.h
@@ -0,0 +1,147 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+/*
+ * Copyright (C) 2017 Docker, Inc.
+ * Copyright (C) 2017 Hewlett Packard Enterprise Development, L.P.
+ * Copyright (C) 2016 Brown University. All rights reserved.
+ *
+ * Authors:
+ *   Juerg Haefliger <juerg.haefliger@hpe.com>
+ *   Vasileios P. Kemerlis <vpk@cs.brown.edu>
+ *   Tycho Andersen <tycho@docker.com>
+ *
+ * This program is free software; you can redistribute it and/or modify it
+ * under the terms of the GNU General Public License version 2 as published by
+ * the Free Software Foundation.
+ */
+
+#ifndef _LINUX_XPFO_H
+#define _LINUX_XPFO_H
+
+#include <linux/types.h>
+#include <linux/dma-direction.h>
+#include <linux/uaccess.h>
+
+struct page;
+
+#ifdef CONFIG_XPFO
+
+DECLARE_STATIC_KEY_TRUE(xpfo_inited);
+
+/* Architecture specific implementations */
+void set_kpte(void *kaddr, struct page *page, pgprot_t prot);
+void xpfo_flush_kernel_tlb(struct page *page, int order);
+
+void xpfo_init_single_page(struct page *page);
+
+static inline void xpfo_kmap(void *kaddr, struct page *page)
+{
+	unsigned long flags;
+
+	if (!static_branch_unlikely(&xpfo_inited))
+		return;
+
+	if (!PageXpfoUser(page))
+		return;
+
+	/*
+	 * The page was previously allocated to user space, so
+	 * map it back into the kernel if needed. No TLB flush required.
+	 */
+	spin_lock_irqsave(&page->xpfo_lock, flags);
+
+	if ((atomic_inc_return(&page->xpfo_mapcount) == 1) &&
+		TestClearPageXpfoUnmapped(page))
+		set_kpte(kaddr, page, PAGE_KERNEL);
+
+	spin_unlock_irqrestore(&page->xpfo_lock, flags);
+}
+
+static inline void xpfo_kunmap(void *kaddr, struct page *page)
+{
+	unsigned long flags;
+
+	if (!static_branch_unlikely(&xpfo_inited))
+		return;
+
+	if (!PageXpfoUser(page))
+		return;
+
+	/*
+	 * The page is to be allocated back to user space, so unmap it from
+	 * the kernel, flush the TLB and tag it as a user page.
+	 */
+	spin_lock_irqsave(&page->xpfo_lock, flags);
+
+	if (atomic_dec_return(&page->xpfo_mapcount) == 0) {
+#ifdef CONFIG_XPFO_DEBUG
+		WARN_ON(PageXpfoUnmapped(page));
+#endif
+		SetPageXpfoUnmapped(page);
+		set_kpte(kaddr, page, __pgprot(0));
+		xpfo_flush_kernel_tlb(page, 0);
+	}
+
+	spin_unlock_irqrestore(&page->xpfo_lock, flags);
+}
+
+void xpfo_alloc_pages(struct page *page, int order, gfp_t gfp, bool will_map);
+void xpfo_free_pages(struct page *page, int order);
+
+#else /* !CONFIG_XPFO */
+
+static inline void xpfo_init_single_page(struct page *page) { }
+
+static inline void xpfo_kmap(void *kaddr, struct page *page) { }
+static inline void xpfo_kunmap(void *kaddr, struct page *page) { }
+static inline void xpfo_alloc_pages(struct page *page, int order, gfp_t gfp,
+				    bool will_map) { }
+static inline void xpfo_free_pages(struct page *page, int order) { }
+
+static inline void set_kpte(void *kaddr, struct page *page, pgprot_t prot) { }
+static inline void xpfo_flush_kernel_tlb(struct page *page, int order) { }
+
+#endif /* CONFIG_XPFO */
+
+#if (!defined(CONFIG_HIGHMEM)) && (!defined(ARCH_HAS_KMAP))
+static inline void *kmap(struct page *page)
+{
+	void *kaddr;
+
+	might_sleep();
+	kaddr = page_address(page);
+	xpfo_kmap(kaddr, page);
+	return kaddr;
+}
+
+static inline void kunmap(struct page *page)
+{
+	xpfo_kunmap(page_address(page), page);
+}
+
+static inline void *kmap_atomic(struct page *page)
+{
+	void *kaddr;
+
+	preempt_disable();
+	pagefault_disable();
+	kaddr = page_address(page);
+	xpfo_kmap(kaddr, page);
+	return kaddr;
+}
+
+#define kmap_atomic_prot(page, prot)	kmap_atomic(page)
+
+static inline void __kunmap_atomic(void *addr)
+{
+	xpfo_kunmap(addr, virt_to_page(addr));
+	pagefault_enable();
+	preempt_enable();
+}
+
+#define kmap_atomic_pfn(pfn)	kmap_atomic(pfn_to_page(pfn))
+
+#define kmap_flush_unused()	do {} while (0)
+
+#endif
+
+#endif /* _LINUX_XPFO_H */
diff --git a/include/trace/events/mmflags.h b/include/trace/events/mmflags.h
index a1675d43777e..6bb000bb366f 100644
--- a/include/trace/events/mmflags.h
+++ b/include/trace/events/mmflags.h
@@ -79,6 +79,12 @@
 #define IF_HAVE_PG_IDLE(flag,string)
 #endif
 
+#ifdef CONFIG_XPFO
+#define IF_HAVE_PG_XPFO(flag,string) ,{1UL << flag, string}
+#else
+#define IF_HAVE_PG_XPFO(flag,string)
+#endif
+
 #define __def_pageflag_names						\
 	{1UL << PG_locked,		"locked"	},		\
 	{1UL << PG_waiters,		"waiters"	},		\
@@ -105,7 +111,9 @@ IF_HAVE_PG_MLOCK(PG_mlocked,		"mlocked"	)		\
 IF_HAVE_PG_UNCACHED(PG_uncached,	"uncached"	)		\
 IF_HAVE_PG_HWPOISON(PG_hwpoison,	"hwpoison"	)		\
 IF_HAVE_PG_IDLE(PG_young,		"young"		)		\
-IF_HAVE_PG_IDLE(PG_idle,		"idle"		)
+IF_HAVE_PG_IDLE(PG_idle,		"idle"		)		\
+IF_HAVE_PG_XPFO(PG_xpfo_user,		"xpfo_user"	)		\
+IF_HAVE_PG_XPFO(PG_xpfo_unmapped,	"xpfo_unmapped" ) 		\
 
 #define show_page_flags(flags)						\
 	(flags) ? __print_flags(flags, "|",				\
diff --git a/mm/Makefile b/mm/Makefile
index d210cc9d6f80..e99e1e6ae5ae 100644
--- a/mm/Makefile
+++ b/mm/Makefile
@@ -99,3 +99,4 @@ obj-$(CONFIG_HARDENED_USERCOPY) += usercopy.o
 obj-$(CONFIG_PERCPU_STATS) += percpu-stats.o
 obj-$(CONFIG_HMM) += hmm.o
 obj-$(CONFIG_MEMFD_CREATE) += memfd.o
+obj-$(CONFIG_XPFO) += xpfo.o
diff --git a/mm/compaction.c b/mm/compaction.c
index ef29490b0f46..fdd5d9783adb 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -78,7 +78,7 @@ static void map_pages(struct list_head *list)
 		order = page_private(page);
 		nr_pages = 1 << order;
 
-		post_alloc_hook(page, order, __GFP_MOVABLE);
+		post_alloc_hook(page, order, __GFP_MOVABLE, false);
 		if (order)
 			split_page(page, order);
 
diff --git a/mm/internal.h b/mm/internal.h
index f4a7bb02decf..e076e51376df 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -165,7 +165,7 @@ extern void memblock_free_pages(struct page *page, unsigned long pfn,
 					unsigned int order);
 extern void prep_compound_page(struct page *page, unsigned int order);
 extern void post_alloc_hook(struct page *page, unsigned int order,
-					gfp_t gfp_flags);
+					gfp_t gfp_flags, bool will_map);
 extern int user_min_free_kbytes;
 
 #if defined CONFIG_COMPACTION || defined CONFIG_CMA
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 0b9f577b1a2a..2e0dda1322a2 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1062,6 +1062,7 @@ static __always_inline bool free_pages_prepare(struct page *page,
 	if (bad)
 		return false;
 
+	xpfo_free_pages(page, order);
 	page_cpupid_reset_last(page);
 	page->flags &= ~PAGE_FLAGS_CHECK_AT_PREP;
 	reset_page_owner(page, order);
@@ -1229,6 +1230,7 @@ static void __meminit __init_single_page(struct page *page, unsigned long pfn,
 	if (!is_highmem_idx(zone))
 		set_page_address(page, __va(pfn << PAGE_SHIFT));
 #endif
+	xpfo_init_single_page(page);
 }
 
 #ifdef CONFIG_DEFERRED_STRUCT_PAGE_INIT
@@ -1938,7 +1940,7 @@ static bool check_new_pages(struct page *page, unsigned int order)
 }
 
 inline void post_alloc_hook(struct page *page, unsigned int order,
-				gfp_t gfp_flags)
+				gfp_t gfp_flags, bool will_map)
 {
 	set_page_private(page, 0);
 	set_page_refcounted(page);
@@ -1947,6 +1949,7 @@ inline void post_alloc_hook(struct page *page, unsigned int order,
 	kernel_map_pages(page, 1 << order, 1);
 	kernel_poison_pages(page, 1 << order, 1);
 	kasan_alloc_pages(page, order);
+	xpfo_alloc_pages(page, order, gfp_flags, will_map);
 	set_page_owner(page, order, gfp_flags);
 }
 
@@ -1954,10 +1957,11 @@ static void prep_new_page(struct page *page, unsigned int order, gfp_t gfp_flags
 							unsigned int alloc_flags)
 {
 	int i;
+	bool needs_zero = !free_pages_prezeroed() && (gfp_flags & __GFP_ZERO);
 
-	post_alloc_hook(page, order, gfp_flags);
+	post_alloc_hook(page, order, gfp_flags, needs_zero);
 
-	if (!free_pages_prezeroed() && (gfp_flags & __GFP_ZERO))
+	if (needs_zero)
 		for (i = 0; i < (1 << order); i++)
 			clear_highpage(page + i);
 
diff --git a/mm/page_isolation.c b/mm/page_isolation.c
index ce323e56b34d..d8730dd134a9 100644
--- a/mm/page_isolation.c
+++ b/mm/page_isolation.c
@@ -137,7 +137,7 @@ static void unset_migratetype_isolate(struct page *page, unsigned migratetype)
 out:
 	spin_unlock_irqrestore(&zone->lock, flags);
 	if (isolated_page) {
-		post_alloc_hook(page, order, __GFP_MOVABLE);
+		post_alloc_hook(page, order, __GFP_MOVABLE, false);
 		__free_pages(page, order);
 	}
 }
diff --git a/mm/xpfo.c b/mm/xpfo.c
new file mode 100644
index 000000000000..b74fee0479e7
--- /dev/null
+++ b/mm/xpfo.c
@@ -0,0 +1,106 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * Copyright (C) 2017 Docker, Inc.
+ * Copyright (C) 2017 Hewlett Packard Enterprise Development, L.P.
+ * Copyright (C) 2016 Brown University. All rights reserved.
+ *
+ * Authors:
+ *   Juerg Haefliger <juerg.haefliger@hpe.com>
+ *   Vasileios P. Kemerlis <vpk@cs.brown.edu>
+ *   Tycho Andersen <tycho@docker.com>
+ *
+ * This program is free software; you can redistribute it and/or modify it
+ * under the terms of the GNU General Public License version 2 as published by
+ * the Free Software Foundation.
+ */
+
+#include <linux/mm.h>
+#include <linux/module.h>
+#include <linux/xpfo.h>
+
+#include <asm/tlbflush.h>
+
+DEFINE_STATIC_KEY_TRUE(xpfo_inited);
+EXPORT_SYMBOL_GPL(xpfo_inited);
+
+static int __init noxpfo_param(char *str)
+{
+	static_branch_disable(&xpfo_inited);
+
+	return 0;
+}
+
+early_param("noxpfo", noxpfo_param);
+
+bool __init xpfo_enabled(void)
+{
+	if (!static_branch_unlikely(&xpfo_inited))
+		return false;
+	else
+		return true;
+}
+
+void __meminit xpfo_init_single_page(struct page *page)
+{
+	spin_lock_init(&page->xpfo_lock);
+}
+
+void xpfo_alloc_pages(struct page *page, int order, gfp_t gfp, bool will_map)
+{
+	int i;
+	bool flush_tlb = false;
+
+	if (!static_branch_unlikely(&xpfo_inited))
+		return;
+
+	for (i = 0; i < (1 << order); i++)  {
+#ifdef CONFIG_XPFO_DEBUG
+		WARN_ON(PageXpfoUser(page + i));
+		WARN_ON(PageXpfoUnmapped(page + i));
+		lockdep_assert_held(&(page + i)->xpfo_lock);
+		WARN_ON(atomic_read(&(page + i)->xpfo_mapcount));
+#endif
+		if ((gfp & GFP_HIGHUSER) == GFP_HIGHUSER) {
+			/*
+			 * Tag the page as a user page and flush the TLB if it
+			 * was previously allocated to the kernel.
+			 */
+			if ((!TestSetPageXpfoUser(page + i)) || !will_map) {
+				SetPageXpfoUnmapped(page + i);
+				flush_tlb = true;
+			}
+		} else {
+			/* Tag the page as a non-user (kernel) page */
+			ClearPageXpfoUser(page + i);
+		}
+	}
+
+	if (flush_tlb) {
+		set_kpte(page_address(page), page, __pgprot(0));
+		xpfo_flush_kernel_tlb(page, order);
+	}
+}
+
+void xpfo_free_pages(struct page *page, int order)
+{
+	int i;
+
+	if (!static_branch_unlikely(&xpfo_inited))
+		return;
+
+	for (i = 0; i < (1 << order); i++) {
+#ifdef CONFIG_XPFO_DEBUG
+		WARN_ON(atomic_read(&(page + i)->xpfo_mapcount));
+#endif
+
+		/*
+		 * Map the page back into the kernel if it was previously
+		 * allocated to user space.
+		 */
+		if (TestClearPageXpfoUser(page + i)) {
+			ClearPageXpfoUnmapped(page + i);
+			set_kpte(page_address(page + i), page + i,
+				 PAGE_KERNEL);
+		}
+	}
+}
diff --git a/security/Kconfig b/security/Kconfig
index e4fe2f3c2c65..3636ba7e2615 100644
--- a/security/Kconfig
+++ b/security/Kconfig
@@ -6,6 +6,33 @@ menu "Security options"
 
 source "security/keys/Kconfig"
 
+config ARCH_SUPPORTS_XPFO
+	bool
+
+config XPFO
+	bool "Enable eXclusive Page Frame Ownership (XPFO)"
+	depends on ARCH_SUPPORTS_XPFO
+	help
+	  This option offers protection against 'ret2dir' kernel attacks.
+	  When enabled, every time a page frame is allocated to user space, it
+	  is unmapped from the direct mapped RAM region in kernel space
+	  (physmap). Similarly, when a page frame is freed/reclaimed, it is
+	  mapped back to physmap.
+
+	  There is a slight performance impact when this option is enabled.
+
+	  If in doubt, say "N".
+
+config XPFO_DEBUG
+       bool "Enable debugging of XPFO"
+       depends on XPFO
+       help
+         Enables additional checking of XPFO data structures that help find
+	 bugs in the XPFO implementation. This option comes with a slight
+	 performance cost.
+
+	 If in doubt, say "N".
+
 config SECURITY_DMESG_RESTRICT
 	bool "Restrict unprivileged access to the kernel syslog"
 	default n
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 70+ messages in thread

* [RFC PATCH v9 04/13] xpfo, x86: Add support for XPFO for x86-64
  2019-04-03 17:34 [RFC PATCH v9 00/13] Add support for eXclusive Page Frame Ownership Khalid Aziz
                   ` (2 preceding siblings ...)
  2019-04-03 17:34 ` [RFC PATCH v9 03/13] mm: Add support for eXclusive Page Frame Ownership (XPFO) Khalid Aziz
@ 2019-04-03 17:34 ` Khalid Aziz
  2019-04-04  7:52   ` Peter Zijlstra
  2019-04-03 17:34 ` [RFC PATCH v9 05/13] mm: add a user_virt_to_phys symbol Khalid Aziz
                   ` (10 subsequent siblings)
  14 siblings, 1 reply; 70+ messages in thread
From: Khalid Aziz @ 2019-04-03 17:34 UTC (permalink / raw)
  To: juergh, tycho, jsteckli, ak, liran.alon, keescook, konrad.wilk
  Cc: Juerg Haefliger, deepa.srinivasan, chris.hyser, tyhicks, dwmw,
	andrew.cooper3, jcm, boris.ostrovsky, kanth.ghatraju,
	joao.m.martins, jmattson, pradeep.vincent, john.haxby, tglx,
	kirill.shutemov, hch, steven.sistare, labbott, luto, dave.hansen,
	peterz, aaron.lu, akpm, alexander.h.duyck, amir73il, andreyknvl,
	aneesh.kumar, anthony.yznaga, ard.biesheuvel, arnd, arunks, ben,
	bigeasy, bp, brgl, catalin.marinas, corbet, cpandya,
	daniel.vetter, dan.j.williams, gregkh, guro, hannes, hpa,
	iamjoonsoo.kim, james.morse, jannh, jgross, jkosina, jmorris,
	joe, jrdr.linux, jroedel, keith.busch, khalid.aziz, khlebnikov,
	logang, marco.antonio.780, mark.rutland, mgorman, mhocko, mhocko,
	mike.kravetz, mingo, mst, m.szyprowski, npiggin, osalvador,
	paulmck, pavel.tatashin, rdunlap, richard.weiyang, riel,
	rientjes, robin.murphy, rostedt, rppt, sai.praneeth.prakhya,
	serge, steve.capper, thymovanbeers, vbabka, will.deacon, willy,
	yang.shi, yaojun8558363, ying.huang, zhangshaokun, iommu, x86,
	linux-arm-kernel, linux-doc, linux-kernel, linux-mm,
	linux-security-module, Khalid Aziz

From: Juerg Haefliger <juerg.haefliger@canonical.com>

This patch adds support for XPFO for x86-64. It uses the generic
infrastructure in place for XPFO and adds the architecture specific
bits to enable XPFO on x86-64.

CC: x86@kernel.org
Suggested-by: Vasileios P. Kemerlis <vpk@cs.columbia.edu>
Signed-off-by: Juerg Haefliger <juerg.haefliger@canonical.com>
Signed-off-by: Tycho Andersen <tycho@tycho.ws>
Signed-off-by: Marco Benatto <marco.antonio.780@gmail.com>
Signed-off-by: Julian Stecklina <jsteckli@amazon.de>
Signed-off-by: Khalid Aziz <khalid.aziz@oracle.com>
Cc: Khalid Aziz <khalid@gonehiking.org>
---
 .../admin-guide/kernel-parameters.txt         |  10 +-
 arch/x86/Kconfig                              |   1 +
 arch/x86/include/asm/pgtable.h                |  26 ++++
 arch/x86/mm/Makefile                          |   2 +
 arch/x86/mm/pageattr.c                        |  23 +---
 arch/x86/mm/xpfo.c                            | 123 ++++++++++++++++++
 include/linux/xpfo.h                          |   2 +
 7 files changed, 162 insertions(+), 25 deletions(-)
 create mode 100644 arch/x86/mm/xpfo.c

diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
index 9b36da94760e..e65e3bc1efe0 100644
--- a/Documentation/admin-guide/kernel-parameters.txt
+++ b/Documentation/admin-guide/kernel-parameters.txt
@@ -2997,11 +2997,11 @@
 
 	nox2apic	[X86-64,APIC] Do not enable x2APIC mode.
 
-	noxpfo		[XPFO] Disable eXclusive Page Frame Ownership (XPFO)
-			when CONFIG_XPFO is on. Physical pages mapped into
-			user applications will also be mapped in the
-			kernel's address space as if CONFIG_XPFO was not
-			enabled.
+	noxpfo		[XPFO,X86-64] Disable eXclusive Page Frame
+			Ownership (XPFO) when CONFIG_XPFO is on. Physical
+			pages mapped into user applications will also be
+			mapped in the kernel's address space as if
+			CONFIG_XPFO was not enabled.
 
 	cpu0_hotplug	[X86] Turn on CPU0 hotplug feature when
 			CONFIG_BOOTPARAM_HOTPLUG_CPU0 is off.
diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 68261430fe6e..122786604252 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -209,6 +209,7 @@ config X86
 	select USER_STACKTRACE_SUPPORT
 	select VIRT_TO_BUS
 	select X86_FEATURE_NAMES		if PROC_FS
+	select ARCH_SUPPORTS_XPFO		if X86_64
 
 config INSTRUCTION_DECODER
 	def_bool y
diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h
index 2779ace16d23..5c0e1581fa56 100644
--- a/arch/x86/include/asm/pgtable.h
+++ b/arch/x86/include/asm/pgtable.h
@@ -1437,6 +1437,32 @@ static inline bool arch_has_pfn_modify_check(void)
 	return boot_cpu_has_bug(X86_BUG_L1TF);
 }
 
+/*
+ * The current flushing context - we pass it instead of 5 arguments:
+ */
+struct cpa_data {
+	unsigned long	*vaddr;
+	pgd_t		*pgd;
+	pgprot_t	mask_set;
+	pgprot_t	mask_clr;
+	unsigned long	numpages;
+	unsigned long	curpage;
+	unsigned long	pfn;
+	unsigned int	flags;
+	unsigned int	force_split		: 1,
+			force_static_prot	: 1;
+	struct page	**pages;
+};
+
+
+int
+should_split_large_page(pte_t *kpte, unsigned long address,
+			struct cpa_data *cpa);
+extern spinlock_t cpa_lock;
+int
+__split_large_page(struct cpa_data *cpa, pte_t *kpte, unsigned long address,
+		   struct page *base);
+
 #include <asm-generic/pgtable.h>
 #endif	/* __ASSEMBLY__ */
 
diff --git a/arch/x86/mm/Makefile b/arch/x86/mm/Makefile
index 4b101dd6e52f..93b0fdaf4a99 100644
--- a/arch/x86/mm/Makefile
+++ b/arch/x86/mm/Makefile
@@ -53,3 +53,5 @@ obj-$(CONFIG_PAGE_TABLE_ISOLATION)		+= pti.o
 obj-$(CONFIG_AMD_MEM_ENCRYPT)	+= mem_encrypt.o
 obj-$(CONFIG_AMD_MEM_ENCRYPT)	+= mem_encrypt_identity.o
 obj-$(CONFIG_AMD_MEM_ENCRYPT)	+= mem_encrypt_boot.o
+
+obj-$(CONFIG_XPFO)		+= xpfo.o
diff --git a/arch/x86/mm/pageattr.c b/arch/x86/mm/pageattr.c
index 14e6119838a6..530b5df0617e 100644
--- a/arch/x86/mm/pageattr.c
+++ b/arch/x86/mm/pageattr.c
@@ -28,23 +28,6 @@
 
 #include "mm_internal.h"
 
-/*
- * The current flushing context - we pass it instead of 5 arguments:
- */
-struct cpa_data {
-	unsigned long	*vaddr;
-	pgd_t		*pgd;
-	pgprot_t	mask_set;
-	pgprot_t	mask_clr;
-	unsigned long	numpages;
-	unsigned long	curpage;
-	unsigned long	pfn;
-	unsigned int	flags;
-	unsigned int	force_split		: 1,
-			force_static_prot	: 1;
-	struct page	**pages;
-};
-
 enum cpa_warn {
 	CPA_CONFLICT,
 	CPA_PROTECT,
@@ -59,7 +42,7 @@ static const int cpa_warn_level = CPA_PROTECT;
  * entries change the page attribute in parallel to some other cpu
  * splitting a large page entry along with changing the attribute.
  */
-static DEFINE_SPINLOCK(cpa_lock);
+DEFINE_SPINLOCK(cpa_lock);
 
 #define CPA_FLUSHTLB 1
 #define CPA_ARRAY 2
@@ -876,7 +859,7 @@ static int __should_split_large_page(pte_t *kpte, unsigned long address,
 	return 0;
 }
 
-static int should_split_large_page(pte_t *kpte, unsigned long address,
+int should_split_large_page(pte_t *kpte, unsigned long address,
 				   struct cpa_data *cpa)
 {
 	int do_split;
@@ -926,7 +909,7 @@ static void split_set_pte(struct cpa_data *cpa, pte_t *pte, unsigned long pfn,
 	set_pte(pte, pfn_pte(pfn, ref_prot));
 }
 
-static int
+int
 __split_large_page(struct cpa_data *cpa, pte_t *kpte, unsigned long address,
 		   struct page *base)
 {
diff --git a/arch/x86/mm/xpfo.c b/arch/x86/mm/xpfo.c
new file mode 100644
index 000000000000..3045bb7e4659
--- /dev/null
+++ b/arch/x86/mm/xpfo.c
@@ -0,0 +1,123 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * Copyright (C) 2017 Hewlett Packard Enterprise Development, L.P.
+ * Copyright (C) 2016 Brown University. All rights reserved.
+ *
+ * Authors:
+ *   Juerg Haefliger <juerg.haefliger@hpe.com>
+ *   Vasileios P. Kemerlis <vpk@cs.brown.edu>
+ *
+ * This program is free software; you can redistribute it and/or modify it
+ * under the terms of the GNU General Public License version 2 as published by
+ * the Free Software Foundation.
+ */
+
+#include <linux/mm.h>
+
+#include <asm/tlbflush.h>
+
+extern spinlock_t cpa_lock;
+
+/* Update a single kernel page table entry */
+inline void set_kpte(void *kaddr, struct page *page, pgprot_t prot)
+{
+	unsigned int level;
+	pgprot_t msk_clr;
+	pte_t *pte = lookup_address((unsigned long)kaddr, &level);
+
+	if (unlikely(!pte)) {
+		WARN(1, "xpfo: invalid address %p\n", kaddr);
+		return;
+	}
+
+	switch (level) {
+	case PG_LEVEL_4K:
+		set_pte_atomic(pte, pfn_pte(page_to_pfn(page),
+			       canon_pgprot(prot)));
+		break;
+	case PG_LEVEL_2M:
+	case PG_LEVEL_1G: {
+		struct cpa_data cpa = { };
+		int do_split;
+
+		if (level == PG_LEVEL_2M)
+			msk_clr = pmd_pgprot(*(pmd_t *)pte);
+		else
+			msk_clr = pud_pgprot(*(pud_t *)pte);
+
+		cpa.vaddr = kaddr;
+		cpa.pages = &page;
+		cpa.mask_set = prot;
+		cpa.mask_clr = msk_clr;
+		cpa.numpages = 1;
+		cpa.flags = 0;
+		cpa.curpage = 0;
+		cpa.force_split = 0;
+
+
+		do_split = should_split_large_page(pte, (unsigned long)kaddr,
+						   &cpa);
+		if (do_split) {
+			struct page *base;
+
+			base = alloc_pages(GFP_ATOMIC, 0);
+			if (!base) {
+				WARN(1, "xpfo: failed to split large page\n");
+				break;
+			}
+
+			if (!debug_pagealloc_enabled())
+				spin_lock(&cpa_lock);
+			if  (__split_large_page(&cpa, pte, (unsigned long)kaddr,
+						base) < 0) {
+				__free_page(base);
+				WARN(1, "xpfo: failed to split large page\n");
+			}
+			if (!debug_pagealloc_enabled())
+				spin_unlock(&cpa_lock);
+		}
+
+		break;
+	}
+	case PG_LEVEL_512G:
+		/* fallthrough, splitting infrastructure doesn't
+		 * support 512G pages.
+		 */
+	default:
+		WARN(1, "xpfo: unsupported page level %x\n", level);
+	}
+
+}
+EXPORT_SYMBOL_GPL(set_kpte);
+
+inline void xpfo_flush_kernel_tlb(struct page *page, int order)
+{
+	int level;
+	unsigned long size, kaddr;
+
+	kaddr = (unsigned long)page_address(page);
+
+	if (unlikely(!lookup_address(kaddr, &level))) {
+		WARN(1, "xpfo: invalid address to flush %lx %d\n", kaddr,
+		     level);
+		return;
+	}
+
+	switch (level) {
+	case PG_LEVEL_4K:
+		size = PAGE_SIZE;
+		break;
+	case PG_LEVEL_2M:
+		size = PMD_SIZE;
+		break;
+	case PG_LEVEL_1G:
+		size = PUD_SIZE;
+		break;
+	default:
+		WARN(1, "xpfo: unsupported page level %x\n", level);
+		return;
+	}
+
+	flush_tlb_kernel_range(kaddr, kaddr + (1 << order) * size);
+}
+EXPORT_SYMBOL_GPL(xpfo_flush_kernel_tlb);
diff --git a/include/linux/xpfo.h b/include/linux/xpfo.h
index 93a1b5aceca3..c1d232da7ee0 100644
--- a/include/linux/xpfo.h
+++ b/include/linux/xpfo.h
@@ -25,6 +25,8 @@ struct page;
 
 #ifdef CONFIG_XPFO
 
+#include <linux/dma-mapping.h>
+
 DECLARE_STATIC_KEY_TRUE(xpfo_inited);
 
 /* Architecture specific implementations */
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 70+ messages in thread

* [RFC PATCH v9 05/13] mm: add a user_virt_to_phys symbol
  2019-04-03 17:34 [RFC PATCH v9 00/13] Add support for eXclusive Page Frame Ownership Khalid Aziz
                   ` (3 preceding siblings ...)
  2019-04-03 17:34 ` [RFC PATCH v9 04/13] xpfo, x86: Add support for XPFO for x86-64 Khalid Aziz
@ 2019-04-03 17:34 ` Khalid Aziz
  2019-04-03 17:34 ` [RFC PATCH v9 06/13] lkdtm: Add test for XPFO Khalid Aziz
                   ` (9 subsequent siblings)
  14 siblings, 0 replies; 70+ messages in thread
From: Khalid Aziz @ 2019-04-03 17:34 UTC (permalink / raw)
  To: juergh, tycho, jsteckli, ak, liran.alon, keescook, konrad.wilk
  Cc: deepa.srinivasan, chris.hyser, tyhicks, dwmw, andrew.cooper3,
	jcm, boris.ostrovsky, kanth.ghatraju, joao.m.martins, jmattson,
	pradeep.vincent, john.haxby, tglx, kirill.shutemov, hch,
	steven.sistare, labbott, luto, dave.hansen, peterz, aaron.lu,
	akpm, alexander.h.duyck, amir73il, andreyknvl, aneesh.kumar,
	anthony.yznaga, ard.biesheuvel, arnd, arunks, ben, bigeasy, bp,
	brgl, catalin.marinas, corbet, cpandya, daniel.vetter,
	dan.j.williams, gregkh, guro, hannes, hpa, iamjoonsoo.kim,
	james.morse, jannh, jgross, jkosina, jmorris, joe, jrdr.linux,
	jroedel, keith.busch, khalid.aziz, khlebnikov, logang,
	marco.antonio.780, mark.rutland, mgorman, mhocko, mhocko,
	mike.kravetz, mingo, mst, m.szyprowski, npiggin, osalvador,
	paulmck, pavel.tatashin, rdunlap, richard.weiyang, riel,
	rientjes, robin.murphy, rostedt, rppt, sai.praneeth.prakhya,
	serge, steve.capper, thymovanbeers, vbabka, will.deacon, willy,
	yang.shi, yaojun8558363, ying.huang, zhangshaokun, iommu, x86,
	linux-arm-kernel, linux-doc, linux-kernel, linux-mm,
	linux-security-module, Khalid Aziz

From: Tycho Andersen <tycho@tycho.ws>

We need someting like this for testing XPFO. Since it's architecture
specific, putting it in the test code is slightly awkward, so let's make it
an arch-specific symbol and export it for use in LKDTM.

CC: linux-arm-kernel@lists.infradead.org
CC: x86@kernel.org
Signed-off-by: Tycho Andersen <tycho@tycho.ws>
Tested-by: Marco Benatto <marco.antonio.780@gmail.com>
Signed-off-by: Khalid Aziz <khalid.aziz@oracle.com>
Cc: Khalid Aziz <khalid@gonehiking.org>
---
v7: * make user_virt_to_phys a GPL symbol

v6: * add a definition of user_virt_to_phys in the !CONFIG_XPFO case

 arch/x86/mm/xpfo.c   | 57 ++++++++++++++++++++++++++++++++++++++++++++
 include/linux/xpfo.h |  4 ++++
 2 files changed, 61 insertions(+)

diff --git a/arch/x86/mm/xpfo.c b/arch/x86/mm/xpfo.c
index 3045bb7e4659..b42513347865 100644
--- a/arch/x86/mm/xpfo.c
+++ b/arch/x86/mm/xpfo.c
@@ -121,3 +121,60 @@ inline void xpfo_flush_kernel_tlb(struct page *page, int order)
 	flush_tlb_kernel_range(kaddr, kaddr + (1 << order) * size);
 }
 EXPORT_SYMBOL_GPL(xpfo_flush_kernel_tlb);
+
+/* Convert a user space virtual address to a physical address.
+ * Shamelessly copied from slow_virt_to_phys() and lookup_address() in
+ * arch/x86/mm/pageattr.c
+ */
+phys_addr_t user_virt_to_phys(unsigned long addr)
+{
+	phys_addr_t phys_addr;
+	unsigned long offset;
+	pgd_t *pgd;
+	p4d_t *p4d;
+	pud_t *pud;
+	pmd_t *pmd;
+	pte_t *pte;
+
+	pgd = pgd_offset(current->mm, addr);
+	if (pgd_none(*pgd))
+		return 0;
+
+	p4d = p4d_offset(pgd, addr);
+	if (p4d_none(*p4d))
+		return 0;
+
+	if (p4d_large(*p4d) || !p4d_present(*p4d)) {
+		phys_addr = (unsigned long)p4d_pfn(*p4d) << PAGE_SHIFT;
+		offset = addr & ~P4D_MASK;
+		goto out;
+	}
+
+	pud = pud_offset(p4d, addr);
+	if (pud_none(*pud))
+		return 0;
+
+	if (pud_large(*pud) || !pud_present(*pud)) {
+		phys_addr = (unsigned long)pud_pfn(*pud) << PAGE_SHIFT;
+		offset = addr & ~PUD_MASK;
+		goto out;
+	}
+
+	pmd = pmd_offset(pud, addr);
+	if (pmd_none(*pmd))
+		return 0;
+
+	if (pmd_large(*pmd) || !pmd_present(*pmd)) {
+		phys_addr = (unsigned long)pmd_pfn(*pmd) << PAGE_SHIFT;
+		offset = addr & ~PMD_MASK;
+		goto out;
+	}
+
+	pte =  pte_offset_kernel(pmd, addr);
+	phys_addr = (phys_addr_t)pte_pfn(*pte) << PAGE_SHIFT;
+	offset = addr & ~PAGE_MASK;
+
+out:
+	return (phys_addr_t)(phys_addr | offset);
+}
+EXPORT_SYMBOL_GPL(user_virt_to_phys);
diff --git a/include/linux/xpfo.h b/include/linux/xpfo.h
index c1d232da7ee0..5d8d06e4b796 100644
--- a/include/linux/xpfo.h
+++ b/include/linux/xpfo.h
@@ -89,6 +89,8 @@ static inline void xpfo_kunmap(void *kaddr, struct page *page)
 void xpfo_alloc_pages(struct page *page, int order, gfp_t gfp, bool will_map);
 void xpfo_free_pages(struct page *page, int order);
 
+phys_addr_t user_virt_to_phys(unsigned long addr);
+
 #else /* !CONFIG_XPFO */
 
 static inline void xpfo_init_single_page(struct page *page) { }
@@ -102,6 +104,8 @@ static inline void xpfo_free_pages(struct page *page, int order) { }
 static inline void set_kpte(void *kaddr, struct page *page, pgprot_t prot) { }
 static inline void xpfo_flush_kernel_tlb(struct page *page, int order) { }
 
+static inline phys_addr_t user_virt_to_phys(unsigned long addr) { return 0; }
+
 #endif /* CONFIG_XPFO */
 
 #if (!defined(CONFIG_HIGHMEM)) && (!defined(ARCH_HAS_KMAP))
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 70+ messages in thread

* [RFC PATCH v9 06/13] lkdtm: Add test for XPFO
  2019-04-03 17:34 [RFC PATCH v9 00/13] Add support for eXclusive Page Frame Ownership Khalid Aziz
                   ` (4 preceding siblings ...)
  2019-04-03 17:34 ` [RFC PATCH v9 05/13] mm: add a user_virt_to_phys symbol Khalid Aziz
@ 2019-04-03 17:34 ` Khalid Aziz
  2019-04-03 17:34 ` [RFC PATCH v9 07/13] arm64/mm: Add support " Khalid Aziz
                   ` (8 subsequent siblings)
  14 siblings, 0 replies; 70+ messages in thread
From: Khalid Aziz @ 2019-04-03 17:34 UTC (permalink / raw)
  To: juergh, tycho, jsteckli, ak, liran.alon, keescook, konrad.wilk
  Cc: Juerg Haefliger, deepa.srinivasan, chris.hyser, tyhicks, dwmw,
	andrew.cooper3, jcm, boris.ostrovsky, kanth.ghatraju,
	joao.m.martins, jmattson, pradeep.vincent, john.haxby, tglx,
	kirill.shutemov, hch, steven.sistare, labbott, luto, dave.hansen,
	peterz, aaron.lu, akpm, alexander.h.duyck, amir73il, andreyknvl,
	aneesh.kumar, anthony.yznaga, ard.biesheuvel, arnd, arunks, ben,
	bigeasy, bp, brgl, catalin.marinas, corbet, cpandya,
	daniel.vetter, dan.j.williams, gregkh, guro, hannes, hpa,
	iamjoonsoo.kim, james.morse, jannh, jgross, jkosina, jmorris,
	joe, jrdr.linux, jroedel, keith.busch, khalid.aziz, khlebnikov,
	logang, marco.antonio.780, mark.rutland, mgorman, mhocko, mhocko,
	mike.kravetz, mingo, mst, m.szyprowski, npiggin, osalvador,
	paulmck, pavel.tatashin, rdunlap, richard.weiyang, riel,
	rientjes, robin.murphy, rostedt, rppt, sai.praneeth.prakhya,
	serge, steve.capper, thymovanbeers, vbabka, will.deacon, willy,
	yang.shi, yaojun8558363, ying.huang, zhangshaokun, iommu, x86,
	linux-arm-kernel, linux-doc, linux-kernel, linux-mm,
	linux-security-module, Khalid Aziz

From: Juerg Haefliger <juerg.haefliger@canonical.com>

This test simply reads from userspace memory via the kernel's linear
map.

v6: * drop an #ifdef, just let the test fail if XPFO is not supported
    * add XPFO_SMP test to try and test the case when one CPU does an xpfo
      unmap of an address, that it can't be used accidentally by other
      CPUs.

Signed-off-by: Juerg Haefliger <juerg.haefliger@canonical.com>
Signed-off-by: Tycho Andersen <tycho@tycho.ws>
Tested-by: Marco Benatto <marco.antonio.780@gmail.com>
[jsteckli@amazon.de: rebased from v4.13 to v4.19]
Signed-off-by: Julian Stecklina <jsteckli@amazon.de>
Tested-by: Khalid Aziz <khalid.aziz@oracle.com>
Cc: Khalid Aziz <khalid@gonehiking.org>
---
 drivers/misc/lkdtm/Makefile |   1 +
 drivers/misc/lkdtm/core.c   |   3 +
 drivers/misc/lkdtm/lkdtm.h  |   5 +
 drivers/misc/lkdtm/xpfo.c   | 196 ++++++++++++++++++++++++++++++++++++
 4 files changed, 205 insertions(+)
 create mode 100644 drivers/misc/lkdtm/xpfo.c

diff --git a/drivers/misc/lkdtm/Makefile b/drivers/misc/lkdtm/Makefile
index 951c984de61a..97c6b7818cce 100644
--- a/drivers/misc/lkdtm/Makefile
+++ b/drivers/misc/lkdtm/Makefile
@@ -9,6 +9,7 @@ lkdtm-$(CONFIG_LKDTM)		+= refcount.o
 lkdtm-$(CONFIG_LKDTM)		+= rodata_objcopy.o
 lkdtm-$(CONFIG_LKDTM)		+= usercopy.o
 lkdtm-$(CONFIG_LKDTM)		+= stackleak.o
+lkdtm-$(CONFIG_LKDTM)		+= xpfo.o
 
 KASAN_SANITIZE_stackleak.o	:= n
 KCOV_INSTRUMENT_rodata.o	:= n
diff --git a/drivers/misc/lkdtm/core.c b/drivers/misc/lkdtm/core.c
index 2837dc77478e..25f4ab4ebf50 100644
--- a/drivers/misc/lkdtm/core.c
+++ b/drivers/misc/lkdtm/core.c
@@ -185,6 +185,9 @@ static const struct crashtype crashtypes[] = {
 	CRASHTYPE(USERCOPY_KERNEL),
 	CRASHTYPE(USERCOPY_KERNEL_DS),
 	CRASHTYPE(STACKLEAK_ERASING),
+	CRASHTYPE(XPFO_READ_USER),
+	CRASHTYPE(XPFO_READ_USER_HUGE),
+	CRASHTYPE(XPFO_SMP),
 };
 
 
diff --git a/drivers/misc/lkdtm/lkdtm.h b/drivers/misc/lkdtm/lkdtm.h
index 3c6fd327e166..6b31ff0c7f8f 100644
--- a/drivers/misc/lkdtm/lkdtm.h
+++ b/drivers/misc/lkdtm/lkdtm.h
@@ -87,4 +87,9 @@ void lkdtm_USERCOPY_KERNEL_DS(void);
 /* lkdtm_stackleak.c */
 void lkdtm_STACKLEAK_ERASING(void);
 
+/* lkdtm_xpfo.c */
+void lkdtm_XPFO_READ_USER(void);
+void lkdtm_XPFO_READ_USER_HUGE(void);
+void lkdtm_XPFO_SMP(void);
+
 #endif
diff --git a/drivers/misc/lkdtm/xpfo.c b/drivers/misc/lkdtm/xpfo.c
new file mode 100644
index 000000000000..8876128f0144
--- /dev/null
+++ b/drivers/misc/lkdtm/xpfo.c
@@ -0,0 +1,196 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * This is for all the tests related to XPFO (eXclusive Page Frame Ownership).
+ */
+
+#include "lkdtm.h"
+
+#include <linux/cpumask.h>
+#include <linux/mman.h>
+#include <linux/uaccess.h>
+#include <linux/xpfo.h>
+#include <linux/kthread.h>
+
+#include <linux/delay.h>
+#include <linux/sched/task.h>
+
+#define XPFO_DATA 0xdeadbeef
+
+static unsigned long do_map(unsigned long flags)
+{
+	unsigned long user_addr, user_data = XPFO_DATA;
+
+	user_addr = vm_mmap(NULL, 0, PAGE_SIZE,
+			    PROT_READ | PROT_WRITE | PROT_EXEC,
+			    flags, 0);
+	if (user_addr >= TASK_SIZE) {
+		pr_warn("Failed to allocate user memory\n");
+		return 0;
+	}
+
+	if (copy_to_user((void __user *)user_addr, &user_data,
+			 sizeof(user_data))) {
+		pr_warn("copy_to_user failed\n");
+		goto free_user;
+	}
+
+	return user_addr;
+
+free_user:
+	vm_munmap(user_addr, PAGE_SIZE);
+	return 0;
+}
+
+static unsigned long *user_to_kernel(unsigned long user_addr)
+{
+	phys_addr_t phys_addr;
+	void *virt_addr;
+
+	phys_addr = user_virt_to_phys(user_addr);
+	if (!phys_addr) {
+		pr_warn("Failed to get physical address of user memory\n");
+		return NULL;
+	}
+
+	virt_addr = phys_to_virt(phys_addr);
+	if (phys_addr != virt_to_phys(virt_addr)) {
+		pr_warn("Physical address of user memory seems incorrect\n");
+		return NULL;
+	}
+
+	return virt_addr;
+}
+
+static void read_map(unsigned long *virt_addr)
+{
+	pr_info("Attempting bad read from kernel address %p\n", virt_addr);
+	if (*(unsigned long *)virt_addr == XPFO_DATA)
+		pr_err("FAIL: Bad read succeeded?!\n");
+	else
+		pr_err("FAIL: Bad read didn't fail but data is incorrect?!\n");
+}
+
+static void read_user_with_flags(unsigned long flags)
+{
+	unsigned long user_addr, *kernel;
+
+	user_addr = do_map(flags);
+	if (!user_addr) {
+		pr_err("FAIL: map failed\n");
+		return;
+	}
+
+	kernel = user_to_kernel(user_addr);
+	if (!kernel) {
+		pr_err("FAIL: user to kernel conversion failed\n");
+		goto free_user;
+	}
+
+	read_map(kernel);
+
+free_user:
+	vm_munmap(user_addr, PAGE_SIZE);
+}
+
+/* Read from userspace via the kernel's linear map. */
+void lkdtm_XPFO_READ_USER(void)
+{
+	read_user_with_flags(MAP_PRIVATE | MAP_ANONYMOUS);
+}
+
+void lkdtm_XPFO_READ_USER_HUGE(void)
+{
+	read_user_with_flags(MAP_PRIVATE | MAP_ANONYMOUS | MAP_HUGETLB);
+}
+
+struct smp_arg {
+	unsigned long *virt_addr;
+	unsigned int cpu;
+};
+
+static int smp_reader(void *parg)
+{
+	struct smp_arg *arg = parg;
+	unsigned long *virt_addr;
+
+	if (arg->cpu != smp_processor_id()) {
+		pr_err("FAIL: scheduled on wrong CPU?\n");
+		return 0;
+	}
+
+	virt_addr = smp_cond_load_acquire(&arg->virt_addr, VAL != NULL);
+	read_map(virt_addr);
+
+	return 0;
+}
+
+#ifdef CONFIG_X86
+#define XPFO_SMP_KILLED SIGKILL
+#elif CONFIG_ARM64
+#define XPFO_SMP_KILLED SIGSEGV
+#else
+#error unsupported arch
+#endif
+
+/* The idea here is to read from the kernel's map on a different thread than
+ * did the mapping (and thus the TLB flushing), to make sure that the page
+ * faults on other cores too.
+ */
+void lkdtm_XPFO_SMP(void)
+{
+	unsigned long user_addr, *virt_addr;
+	struct task_struct *thread;
+	int ret;
+	struct smp_arg arg;
+
+	if (num_online_cpus() < 2) {
+		pr_err("not enough to do a multi cpu test\n");
+		return;
+	}
+
+	arg.virt_addr = NULL;
+	arg.cpu = (smp_processor_id() + 1) % num_online_cpus();
+	thread = kthread_create(smp_reader, &arg, "lkdtm_xpfo_test");
+	if (IS_ERR(thread)) {
+		pr_err("couldn't create kthread? %ld\n", PTR_ERR(thread));
+		return;
+	}
+
+	kthread_bind(thread, arg.cpu);
+	get_task_struct(thread);
+	wake_up_process(thread);
+
+	user_addr = do_map(MAP_PRIVATE | MAP_ANONYMOUS);
+	if (!user_addr)
+		goto kill_thread;
+
+	virt_addr = user_to_kernel(user_addr);
+	if (!virt_addr) {
+		/*
+		 * let's store something that will fail, so we can unblock the
+		 * thread
+		 */
+		smp_store_release(&arg.virt_addr, &arg);
+		goto free_user;
+	}
+
+	/* Store to test address to unblock the thread */
+	smp_store_release(&arg.virt_addr, virt_addr);
+
+	/* there must be a better way to do this. */
+	while (1) {
+		if (thread->exit_state)
+			break;
+		msleep_interruptible(100);
+	}
+
+free_user:
+	if (user_addr)
+		vm_munmap(user_addr, PAGE_SIZE);
+
+kill_thread:
+	ret = kthread_stop(thread);
+	if (ret != XPFO_SMP_KILLED)
+		pr_err("FAIL: thread wasn't killed: %d\n", ret);
+	put_task_struct(thread);
+}
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 70+ messages in thread

* [RFC PATCH v9 07/13] arm64/mm: Add support for XPFO
  2019-04-03 17:34 [RFC PATCH v9 00/13] Add support for eXclusive Page Frame Ownership Khalid Aziz
                   ` (5 preceding siblings ...)
  2019-04-03 17:34 ` [RFC PATCH v9 06/13] lkdtm: Add test for XPFO Khalid Aziz
@ 2019-04-03 17:34 ` Khalid Aziz
  2019-04-03 17:34 ` [RFC PATCH v9 08/13] swiotlb: Map the buffer if it was unmapped by XPFO Khalid Aziz
                   ` (7 subsequent siblings)
  14 siblings, 0 replies; 70+ messages in thread
From: Khalid Aziz @ 2019-04-03 17:34 UTC (permalink / raw)
  To: juergh, tycho, jsteckli, ak, liran.alon, keescook, konrad.wilk
  Cc: Juerg Haefliger, deepa.srinivasan, chris.hyser, tyhicks, dwmw,
	andrew.cooper3, jcm, boris.ostrovsky, kanth.ghatraju,
	joao.m.martins, jmattson, pradeep.vincent, john.haxby, tglx,
	kirill.shutemov, hch, steven.sistare, labbott, luto, dave.hansen,
	peterz, aaron.lu, akpm, alexander.h.duyck, amir73il, andreyknvl,
	aneesh.kumar, anthony.yznaga, ard.biesheuvel, arnd, arunks, ben,
	bigeasy, bp, brgl, catalin.marinas, corbet, cpandya,
	daniel.vetter, dan.j.williams, gregkh, guro, hannes, hpa,
	iamjoonsoo.kim, james.morse, jannh, jgross, jkosina, jmorris,
	joe, jrdr.linux, jroedel, keith.busch, khalid.aziz, khlebnikov,
	logang, marco.antonio.780, mark.rutland, mgorman, mhocko, mhocko,
	mike.kravetz, mingo, mst, m.szyprowski, npiggin, osalvador,
	paulmck, pavel.tatashin, rdunlap, richard.weiyang, riel,
	rientjes, robin.murphy, rostedt, rppt, sai.praneeth.prakhya,
	serge, steve.capper, thymovanbeers, vbabka, will.deacon, willy,
	yang.shi, yaojun8558363, ying.huang, zhangshaokun, iommu, x86,
	linux-arm-kernel, linux-doc, linux-kernel, linux-mm,
	linux-security-module, Khalid Aziz

From: Juerg Haefliger <juerg.haefliger@canonical.com>

Enable support for eXclusive Page Frame Ownership (XPFO) for arm64 and
provide a hook for updating a single kernel page table entry (which is
required by the generic XPFO code).

XPFO doesn't support section/contiguous mappings yet, so let's disable
it if XPFO is turned on.

Thanks to Laura Abbot for the simplification from v5, and Mark Rutland
for pointing out we need NO_CONT_MAPPINGS too.

CC: linux-arm-kernel@lists.infradead.org
Signed-off-by: Juerg Haefliger <juerg.haefliger@canonical.com>
Signed-off-by: Tycho Andersen <tycho@tycho.ws>
Signed-off-by: Khalid Aziz <khalid.aziz@oracle.com>
Cc: Khalid Aziz <khalid@gonehiking.org>
---
 .../admin-guide/kernel-parameters.txt         |  2 +-
 arch/arm64/Kconfig                            |  1 +
 arch/arm64/mm/Makefile                        |  2 +
 arch/arm64/mm/mmu.c                           |  2 +-
 arch/arm64/mm/xpfo.c                          | 66 +++++++++++++++++++
 5 files changed, 71 insertions(+), 2 deletions(-)
 create mode 100644 arch/arm64/mm/xpfo.c

diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
index e65e3bc1efe0..9fcf8c83031a 100644
--- a/Documentation/admin-guide/kernel-parameters.txt
+++ b/Documentation/admin-guide/kernel-parameters.txt
@@ -2997,7 +2997,7 @@
 
 	nox2apic	[X86-64,APIC] Do not enable x2APIC mode.
 
-	noxpfo		[XPFO,X86-64] Disable eXclusive Page Frame
+	noxpfo		[XPFO,X86-64,ARM64] Disable eXclusive Page Frame
 			Ownership (XPFO) when CONFIG_XPFO is on. Physical
 			pages mapped into user applications will also be
 			mapped in the kernel's address space as if
diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
index a4168d366127..9a8d8e649cf8 100644
--- a/arch/arm64/Kconfig
+++ b/arch/arm64/Kconfig
@@ -174,6 +174,7 @@ config ARM64
 	select SWIOTLB
 	select SYSCTL_EXCEPTION_TRACE
 	select THREAD_INFO_IN_TASK
+	select ARCH_SUPPORTS_XPFO
 	help
 	  ARM 64-bit (AArch64) Linux support.
 
diff --git a/arch/arm64/mm/Makefile b/arch/arm64/mm/Makefile
index 849c1df3d214..cca3808d9776 100644
--- a/arch/arm64/mm/Makefile
+++ b/arch/arm64/mm/Makefile
@@ -12,3 +12,5 @@ KASAN_SANITIZE_physaddr.o	+= n
 
 obj-$(CONFIG_KASAN)		+= kasan_init.o
 KASAN_SANITIZE_kasan_init.o	:= n
+
+obj-$(CONFIG_XPFO)		+= xpfo.o
diff --git a/arch/arm64/mm/mmu.c b/arch/arm64/mm/mmu.c
index b6f5aa52ac67..1673f7443d62 100644
--- a/arch/arm64/mm/mmu.c
+++ b/arch/arm64/mm/mmu.c
@@ -453,7 +453,7 @@ static void __init map_mem(pgd_t *pgdp)
 	struct memblock_region *reg;
 	int flags = 0;
 
-	if (rodata_full || debug_pagealloc_enabled())
+	if (rodata_full || debug_pagealloc_enabled() || xpfo_enabled())
 		flags = NO_BLOCK_MAPPINGS | NO_CONT_MAPPINGS;
 
 	/*
diff --git a/arch/arm64/mm/xpfo.c b/arch/arm64/mm/xpfo.c
new file mode 100644
index 000000000000..7866c5acfffb
--- /dev/null
+++ b/arch/arm64/mm/xpfo.c
@@ -0,0 +1,66 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * Copyright (C) 2017 Hewlett Packard Enterprise Development, L.P.
+ * Copyright (C) 2016 Brown University. All rights reserved.
+ *
+ * Authors:
+ *   Juerg Haefliger <juerg.haefliger@hpe.com>
+ *   Vasileios P. Kemerlis <vpk@cs.brown.edu>
+ *
+ * This program is free software; you can redistribute it and/or modify it
+ * under the terms of the GNU General Public License version 2 as published by
+ * the Free Software Foundation.
+ */
+
+#include <linux/mm.h>
+#include <linux/module.h>
+
+#include <asm/tlbflush.h>
+
+/*
+ * Lookup the page table entry for a virtual address and return a pointer to
+ * the entry. Based on x86 tree.
+ */
+static pte_t *lookup_address(unsigned long addr)
+{
+	pgd_t *pgd;
+	pud_t *pud;
+	pmd_t *pmd;
+
+	pgd = pgd_offset_k(addr);
+	if (pgd_none(*pgd))
+		return NULL;
+
+	pud = pud_offset(pgd, addr);
+	if (pud_none(*pud))
+		return NULL;
+
+	pmd = pmd_offset(pud, addr);
+	if (pmd_none(*pmd))
+		return NULL;
+
+	return pte_offset_kernel(pmd, addr);
+}
+
+/* Update a single kernel page table entry */
+inline void set_kpte(void *kaddr, struct page *page, pgprot_t prot)
+{
+	pte_t *pte = lookup_address((unsigned long)kaddr);
+
+	if (unlikely(!pte)) {
+		WARN(1, "xpfo: invalid address %p\n", kaddr);
+		return;
+	}
+
+	set_pte(pte, pfn_pte(page_to_pfn(page), prot));
+}
+EXPORT_SYMBOL_GPL(set_kpte);
+
+inline void xpfo_flush_kernel_tlb(struct page *page, int order)
+{
+	unsigned long kaddr = (unsigned long)page_address(page);
+	unsigned long size = PAGE_SIZE;
+
+	flush_tlb_kernel_range(kaddr, kaddr + (1 << order) * size);
+}
+EXPORT_SYMBOL_GPL(xpfo_flush_kernel_tlb);
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 70+ messages in thread

* [RFC PATCH v9 08/13] swiotlb: Map the buffer if it was unmapped by XPFO
  2019-04-03 17:34 [RFC PATCH v9 00/13] Add support for eXclusive Page Frame Ownership Khalid Aziz
                   ` (6 preceding siblings ...)
  2019-04-03 17:34 ` [RFC PATCH v9 07/13] arm64/mm: Add support " Khalid Aziz
@ 2019-04-03 17:34 ` Khalid Aziz
  2019-04-03 17:34 ` [RFC PATCH v9 09/13] xpfo: add primitives for mapping underlying memory Khalid Aziz
                   ` (6 subsequent siblings)
  14 siblings, 0 replies; 70+ messages in thread
From: Khalid Aziz @ 2019-04-03 17:34 UTC (permalink / raw)
  To: juergh, tycho, jsteckli, ak, liran.alon, keescook, konrad.wilk
  Cc: Juerg Haefliger, deepa.srinivasan, chris.hyser, tyhicks, dwmw,
	andrew.cooper3, jcm, boris.ostrovsky, kanth.ghatraju,
	joao.m.martins, jmattson, pradeep.vincent, john.haxby, tglx,
	kirill.shutemov, hch, steven.sistare, labbott, luto, dave.hansen,
	peterz, aaron.lu, akpm, alexander.h.duyck, amir73il, andreyknvl,
	aneesh.kumar, anthony.yznaga, ard.biesheuvel, arnd, arunks, ben,
	bigeasy, bp, brgl, catalin.marinas, corbet, cpandya,
	daniel.vetter, dan.j.williams, gregkh, guro, hannes, hpa,
	iamjoonsoo.kim, james.morse, jannh, jgross, jkosina, jmorris,
	joe, jrdr.linux, jroedel, keith.busch, khalid.aziz, khlebnikov,
	logang, marco.antonio.780, mark.rutland, mgorman, mhocko, mhocko,
	mike.kravetz, mingo, mst, m.szyprowski, npiggin, osalvador,
	paulmck, pavel.tatashin, rdunlap, richard.weiyang, riel,
	rientjes, robin.murphy, rostedt, rppt, sai.praneeth.prakhya,
	serge, steve.capper, thymovanbeers, vbabka, will.deacon, willy,
	yang.shi, yaojun8558363, ying.huang, zhangshaokun, iommu, x86,
	linux-arm-kernel, linux-doc, linux-kernel, linux-mm,
	linux-security-module, Khalid Aziz

From: Juerg Haefliger <juerg.haefliger@canonical.com>

XPFO can unmap a bounce buffer. Check for this and map it back in if
needed.

Signed-off-by: Juerg Haefliger <juerg.haefliger@canonical.com>
Signed-off-by: Tycho Andersen <tycho@tycho.ws>
Signed-off-by: Khalid Aziz <khalid.aziz@oracle.com>
Cc: Khalid Aziz <khalid@gonehiking.org>
Reviewed-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
---
v9: * Added a generic check for whether a page is mapped or not (suggested
      by Chris Hellwig)

v6: * guard against lookup_xpfo() returning NULL

 include/linux/highmem.h | 7 +++++++
 kernel/dma/swiotlb.c    | 3 ++-
 2 files changed, 9 insertions(+), 1 deletion(-)

diff --git a/include/linux/highmem.h b/include/linux/highmem.h
index 59a1a5fa598d..cf21f023dff4 100644
--- a/include/linux/highmem.h
+++ b/include/linux/highmem.h
@@ -77,6 +77,13 @@ static inline struct page *kmap_to_page(void *addr)
 }
 
 static inline unsigned long totalhigh_pages(void) { return 0UL; }
+static inline bool page_is_unmapped(struct page *page)
+{
+	if (PageHighMem(page) || PageXpfoUnmapped(page))
+		return true;
+	else
+		return false;
+}
 
 #endif /* CONFIG_HIGHMEM */
 
diff --git a/kernel/dma/swiotlb.c b/kernel/dma/swiotlb.c
index 1fb6fd68b9c7..90a1a3709b55 100644
--- a/kernel/dma/swiotlb.c
+++ b/kernel/dma/swiotlb.c
@@ -392,8 +392,9 @@ static void swiotlb_bounce(phys_addr_t orig_addr, phys_addr_t tlb_addr,
 {
 	unsigned long pfn = PFN_DOWN(orig_addr);
 	unsigned char *vaddr = phys_to_virt(tlb_addr);
+	struct page *page = pfn_to_page(pfn);
 
-	if (PageHighMem(pfn_to_page(pfn))) {
+	if (page_is_unmapped(page)) {
 		/* The buffer does not have a mapping.  Map it in and copy */
 		unsigned int offset = orig_addr & ~PAGE_MASK;
 		char *buffer;
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 70+ messages in thread

* [RFC PATCH v9 09/13] xpfo: add primitives for mapping underlying memory
  2019-04-03 17:34 [RFC PATCH v9 00/13] Add support for eXclusive Page Frame Ownership Khalid Aziz
                   ` (7 preceding siblings ...)
  2019-04-03 17:34 ` [RFC PATCH v9 08/13] swiotlb: Map the buffer if it was unmapped by XPFO Khalid Aziz
@ 2019-04-03 17:34 ` Khalid Aziz
  2019-04-03 17:34 ` [RFC PATCH v9 10/13] arm64/mm, xpfo: temporarily map dcache regions Khalid Aziz
                   ` (5 subsequent siblings)
  14 siblings, 0 replies; 70+ messages in thread
From: Khalid Aziz @ 2019-04-03 17:34 UTC (permalink / raw)
  To: juergh, tycho, jsteckli, ak, liran.alon, keescook, konrad.wilk
  Cc: deepa.srinivasan, chris.hyser, tyhicks, dwmw, andrew.cooper3,
	jcm, boris.ostrovsky, kanth.ghatraju, joao.m.martins, jmattson,
	pradeep.vincent, john.haxby, tglx, kirill.shutemov, hch,
	steven.sistare, labbott, luto, dave.hansen, peterz, aaron.lu,
	akpm, alexander.h.duyck, amir73il, andreyknvl, aneesh.kumar,
	anthony.yznaga, ard.biesheuvel, arnd, arunks, ben, bigeasy, bp,
	brgl, catalin.marinas, corbet, cpandya, daniel.vetter,
	dan.j.williams, gregkh, guro, hannes, hpa, iamjoonsoo.kim,
	james.morse, jannh, jgross, jkosina, jmorris, joe, jrdr.linux,
	jroedel, keith.busch, khalid.aziz, khlebnikov, logang,
	marco.antonio.780, mark.rutland, mgorman, mhocko, mhocko,
	mike.kravetz, mingo, mst, m.szyprowski, npiggin, osalvador,
	paulmck, pavel.tatashin, rdunlap, richard.weiyang, riel,
	rientjes, robin.murphy, rostedt, rppt, sai.praneeth.prakhya,
	serge, steve.capper, thymovanbeers, vbabka, will.deacon, willy,
	yang.shi, yaojun8558363, ying.huang, zhangshaokun, iommu, x86,
	linux-arm-kernel, linux-doc, linux-kernel, linux-mm,
	linux-security-module, Khalid Aziz

From: Tycho Andersen <tycho@tycho.ws>

In some cases (on arm64 DMA and data cache flushes) we may have unmapped
the underlying pages needed for something via XPFO. Here are some
primitives useful for ensuring the underlying memory is mapped/unmapped in
the face of xpfo.

Signed-off-by: Tycho Andersen <tycho@tycho.ws>
Signed-off-by: Khalid Aziz <khalid.aziz@oracle.com>
Cc: Khalid Aziz <khalid@gonehiking.org>
---
 include/linux/xpfo.h | 21 +++++++++++++++++++++
 mm/xpfo.c            | 30 ++++++++++++++++++++++++++++++
 2 files changed, 51 insertions(+)

diff --git a/include/linux/xpfo.h b/include/linux/xpfo.h
index 5d8d06e4b796..2318c7eb5fb7 100644
--- a/include/linux/xpfo.h
+++ b/include/linux/xpfo.h
@@ -91,6 +91,15 @@ void xpfo_free_pages(struct page *page, int order);
 
 phys_addr_t user_virt_to_phys(unsigned long addr);
 
+#define XPFO_NUM_PAGES(addr, size) \
+	(PFN_UP((unsigned long) (addr) + (size)) - \
+		PFN_DOWN((unsigned long) (addr)))
+
+void xpfo_temp_map(const void *addr, size_t size, void **mapping,
+		   size_t mapping_len);
+void xpfo_temp_unmap(const void *addr, size_t size, void **mapping,
+		     size_t mapping_len);
+
 #else /* !CONFIG_XPFO */
 
 static inline void xpfo_init_single_page(struct page *page) { }
@@ -106,6 +115,18 @@ static inline void xpfo_flush_kernel_tlb(struct page *page, int order) { }
 
 static inline phys_addr_t user_virt_to_phys(unsigned long addr) { return 0; }
 
+#define XPFO_NUM_PAGES(addr, size) 0
+
+static inline void xpfo_temp_map(const void *addr, size_t size, void **mapping,
+				 size_t mapping_len)
+{
+}
+
+static inline void xpfo_temp_unmap(const void *addr, size_t size,
+				   void **mapping, size_t mapping_len)
+{
+}
+
 #endif /* CONFIG_XPFO */
 
 #if (!defined(CONFIG_HIGHMEM)) && (!defined(ARCH_HAS_KMAP))
diff --git a/mm/xpfo.c b/mm/xpfo.c
index b74fee0479e7..974f1b70ccd9 100644
--- a/mm/xpfo.c
+++ b/mm/xpfo.c
@@ -14,6 +14,7 @@
  * the Free Software Foundation.
  */
 
+#include <linux/highmem.h>
 #include <linux/mm.h>
 #include <linux/module.h>
 #include <linux/xpfo.h>
@@ -104,3 +105,32 @@ void xpfo_free_pages(struct page *page, int order)
 		}
 	}
 }
+
+void xpfo_temp_map(const void *addr, size_t size, void **mapping,
+		   size_t mapping_len)
+{
+	struct page *page = virt_to_page(addr);
+	int i, num_pages = mapping_len / sizeof(mapping[0]);
+
+	memset(mapping, 0, mapping_len);
+
+	for (i = 0; i < num_pages; i++) {
+		if (page_to_virt(page + i) >= addr + size)
+			break;
+
+		if (PageXpfoUnmapped(page + i))
+			mapping[i] = kmap_atomic(page + i);
+	}
+}
+EXPORT_SYMBOL(xpfo_temp_map);
+
+void xpfo_temp_unmap(const void *addr, size_t size, void **mapping,
+		     size_t mapping_len)
+{
+	int i, num_pages = mapping_len / sizeof(mapping[0]);
+
+	for (i = 0; i < num_pages; i++)
+		if (mapping[i])
+			kunmap_atomic(mapping[i]);
+}
+EXPORT_SYMBOL(xpfo_temp_unmap);
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 70+ messages in thread

* [RFC PATCH v9 10/13] arm64/mm, xpfo: temporarily map dcache regions
  2019-04-03 17:34 [RFC PATCH v9 00/13] Add support for eXclusive Page Frame Ownership Khalid Aziz
                   ` (8 preceding siblings ...)
  2019-04-03 17:34 ` [RFC PATCH v9 09/13] xpfo: add primitives for mapping underlying memory Khalid Aziz
@ 2019-04-03 17:34 ` Khalid Aziz
  2019-04-03 17:34 ` [RFC PATCH v9 11/13] xpfo, mm: optimize spinlock usage in xpfo_kunmap Khalid Aziz
                   ` (4 subsequent siblings)
  14 siblings, 0 replies; 70+ messages in thread
From: Khalid Aziz @ 2019-04-03 17:34 UTC (permalink / raw)
  To: juergh, tycho, jsteckli, ak, liran.alon, keescook, konrad.wilk
  Cc: Juerg Haefliger, deepa.srinivasan, chris.hyser, tyhicks, dwmw,
	andrew.cooper3, jcm, boris.ostrovsky, kanth.ghatraju,
	joao.m.martins, jmattson, pradeep.vincent, john.haxby, tglx,
	kirill.shutemov, hch, steven.sistare, labbott, luto, dave.hansen,
	peterz, aaron.lu, akpm, alexander.h.duyck, amir73il, andreyknvl,
	aneesh.kumar, anthony.yznaga, ard.biesheuvel, arnd, arunks, ben,
	bigeasy, bp, brgl, catalin.marinas, corbet, cpandya,
	daniel.vetter, dan.j.williams, gregkh, guro, hannes, hpa,
	iamjoonsoo.kim, james.morse, jannh, jgross, jkosina, jmorris,
	joe, jrdr.linux, jroedel, keith.busch, khalid.aziz, khlebnikov,
	logang, marco.antonio.780, mark.rutland, mgorman, mhocko, mhocko,
	mike.kravetz, mingo, mst, m.szyprowski, npiggin, osalvador,
	paulmck, pavel.tatashin, rdunlap, richard.weiyang, riel,
	rientjes, robin.murphy, rostedt, rppt, sai.praneeth.prakhya,
	serge, steve.capper, thymovanbeers, vbabka, will.deacon, willy,
	yang.shi, yaojun8558363, ying.huang, zhangshaokun, iommu, x86,
	linux-arm-kernel, linux-doc, linux-kernel, linux-mm,
	linux-security-module

From: Juerg Haefliger <juerg.haefliger@canonical.com>

If the page is unmapped by XPFO, a data cache flush results in a fatal
page fault, so let's temporarily map the region, flush the cache, and then
unmap it.

CC: linux-arm-kernel@lists.infradead.org
Signed-off-by: Juerg Haefliger <juerg.haefliger@canonical.com>
Signed-off-by: Tycho Andersen <tycho@tycho.ws>
---
v6: actually flush in the face of xpfo, and temporarily map the underlying
    memory so it can be flushed correctly

 arch/arm64/mm/flush.c | 7 +++++++
 1 file changed, 7 insertions(+)

diff --git a/arch/arm64/mm/flush.c b/arch/arm64/mm/flush.c
index 5c9073bace83..114e8bc5a3dc 100644
--- a/arch/arm64/mm/flush.c
+++ b/arch/arm64/mm/flush.c
@@ -20,6 +20,7 @@
 #include <linux/export.h>
 #include <linux/mm.h>
 #include <linux/pagemap.h>
+#include <linux/xpfo.h>
 
 #include <asm/cacheflush.h>
 #include <asm/cache.h>
@@ -28,9 +29,15 @@
 void sync_icache_aliases(void *kaddr, unsigned long len)
 {
 	unsigned long addr = (unsigned long)kaddr;
+	unsigned long num_pages = XPFO_NUM_PAGES(addr, len);
+	void *mapping[num_pages];
 
 	if (icache_is_aliasing()) {
+		xpfo_temp_map(kaddr, len, mapping,
+			      sizeof(mapping[0]) * num_pages);
 		__clean_dcache_area_pou(kaddr, len);
+		xpfo_temp_unmap(kaddr, len, mapping,
+				sizeof(mapping[0]) * num_pages);
 		__flush_icache_all();
 	} else {
 		/*
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 70+ messages in thread

* [RFC PATCH v9 11/13] xpfo, mm: optimize spinlock usage in xpfo_kunmap
  2019-04-03 17:34 [RFC PATCH v9 00/13] Add support for eXclusive Page Frame Ownership Khalid Aziz
                   ` (9 preceding siblings ...)
  2019-04-03 17:34 ` [RFC PATCH v9 10/13] arm64/mm, xpfo: temporarily map dcache regions Khalid Aziz
@ 2019-04-03 17:34 ` Khalid Aziz
  2019-04-04  7:56   ` Peter Zijlstra
  2019-04-03 17:34 ` [RFC PATCH v9 12/13] xpfo, mm: Defer TLB flushes for non-current CPUs (x86 only) Khalid Aziz
                   ` (3 subsequent siblings)
  14 siblings, 1 reply; 70+ messages in thread
From: Khalid Aziz @ 2019-04-03 17:34 UTC (permalink / raw)
  To: juergh, tycho, jsteckli, ak, liran.alon, keescook, konrad.wilk
  Cc: deepa.srinivasan, chris.hyser, tyhicks, dwmw, andrew.cooper3,
	jcm, boris.ostrovsky, kanth.ghatraju, joao.m.martins, jmattson,
	pradeep.vincent, john.haxby, tglx, kirill.shutemov, hch,
	steven.sistare, labbott, luto, dave.hansen, peterz, aaron.lu,
	akpm, alexander.h.duyck, amir73il, andreyknvl, aneesh.kumar,
	anthony.yznaga, ard.biesheuvel, arnd, arunks, ben, bigeasy, bp,
	brgl, catalin.marinas, corbet, cpandya, daniel.vetter,
	dan.j.williams, gregkh, guro, hannes, hpa, iamjoonsoo.kim,
	james.morse, jannh, jgross, jkosina, jmorris, joe, jrdr.linux,
	jroedel, keith.busch, khalid.aziz, khlebnikov, logang,
	marco.antonio.780, mark.rutland, mgorman, mhocko, mhocko,
	mike.kravetz, mingo, mst, m.szyprowski, npiggin, osalvador,
	paulmck, pavel.tatashin, rdunlap, richard.weiyang, riel,
	rientjes, robin.murphy, rostedt, rppt, sai.praneeth.prakhya,
	serge, steve.capper, thymovanbeers, vbabka, will.deacon, willy,
	yang.shi, yaojun8558363, ying.huang, zhangshaokun, iommu, x86,
	linux-arm-kernel, linux-doc, linux-kernel, linux-mm,
	linux-security-module, Khalid Aziz, kernel-hardening,
	Vasileios P . Kemerlis, Juerg Haefliger, David Woodhouse

From: Julian Stecklina <jsteckli@amazon.de>

Only the xpfo_kunmap call that needs to actually unmap the page
needs to be serialized. We need to be careful to handle the case,
where after the atomic decrement of the mapcount, a xpfo_kmap
increased the mapcount again. In this case, we can safely skip
modifying the page table.

Model-checked with up to 4 concurrent callers with Spin.

Signed-off-by: Julian Stecklina <jsteckli@amazon.de>
Signed-off-by: Khalid Aziz <khalid.aziz@oracle.com>
Cc: Khalid Aziz <khalid@gonehiking.org>
Cc: x86@kernel.org
Cc: kernel-hardening@lists.openwall.com
Cc: Vasileios P. Kemerlis <vpk@cs.columbia.edu>
Cc: Juerg Haefliger <juerg.haefliger@canonical.com>
Cc: Tycho Andersen <tycho@tycho.ws>
Cc: Marco Benatto <marco.antonio.780@gmail.com>
Cc: David Woodhouse <dwmw2@infradead.org>
---
 include/linux/xpfo.h | 24 +++++++++++++++---------
 1 file changed, 15 insertions(+), 9 deletions(-)

diff --git a/include/linux/xpfo.h b/include/linux/xpfo.h
index 2318c7eb5fb7..37e7f52fa6ce 100644
--- a/include/linux/xpfo.h
+++ b/include/linux/xpfo.h
@@ -61,6 +61,7 @@ static inline void xpfo_kmap(void *kaddr, struct page *page)
 static inline void xpfo_kunmap(void *kaddr, struct page *page)
 {
 	unsigned long flags;
+	bool flush_tlb = false;
 
 	if (!static_branch_unlikely(&xpfo_inited))
 		return;
@@ -72,18 +73,23 @@ static inline void xpfo_kunmap(void *kaddr, struct page *page)
 	 * The page is to be allocated back to user space, so unmap it from
 	 * the kernel, flush the TLB and tag it as a user page.
 	 */
-	spin_lock_irqsave(&page->xpfo_lock, flags);
-
 	if (atomic_dec_return(&page->xpfo_mapcount) == 0) {
-#ifdef CONFIG_XPFO_DEBUG
-		WARN_ON(PageXpfoUnmapped(page));
-#endif
-		SetPageXpfoUnmapped(page);
-		set_kpte(kaddr, page, __pgprot(0));
-		xpfo_flush_kernel_tlb(page, 0);
+		spin_lock_irqsave(&page->xpfo_lock, flags);
+
+		/*
+		 * In the case, where we raced with kmap after the
+		 * atomic_dec_return, we must not nuke the mapping.
+		 */
+		if (atomic_read(&page->xpfo_mapcount) == 0) {
+			SetPageXpfoUnmapped(page);
+			set_kpte(kaddr, page, __pgprot(0));
+			flush_tlb = true;
+		}
+		spin_unlock_irqrestore(&page->xpfo_lock, flags);
 	}
 
-	spin_unlock_irqrestore(&page->xpfo_lock, flags);
+	if (flush_tlb)
+		xpfo_flush_kernel_tlb(page, 0);
 }
 
 void xpfo_alloc_pages(struct page *page, int order, gfp_t gfp, bool will_map);
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 70+ messages in thread

* [RFC PATCH v9 12/13] xpfo, mm: Defer TLB flushes for non-current CPUs (x86 only)
  2019-04-03 17:34 [RFC PATCH v9 00/13] Add support for eXclusive Page Frame Ownership Khalid Aziz
                   ` (10 preceding siblings ...)
  2019-04-03 17:34 ` [RFC PATCH v9 11/13] xpfo, mm: optimize spinlock usage in xpfo_kunmap Khalid Aziz
@ 2019-04-03 17:34 ` Khalid Aziz
  2019-04-04  4:10   ` Andy Lutomirski
  2019-04-04  8:18   ` Peter Zijlstra
  2019-04-03 17:34 ` [RFC PATCH v9 13/13] xpfo, mm: Optimize XPFO TLB flushes by batching them together Khalid Aziz
                   ` (2 subsequent siblings)
  14 siblings, 2 replies; 70+ messages in thread
From: Khalid Aziz @ 2019-04-03 17:34 UTC (permalink / raw)
  To: juergh, tycho, jsteckli, ak, liran.alon, keescook, konrad.wilk
  Cc: Khalid Aziz, deepa.srinivasan, chris.hyser, tyhicks, dwmw,
	andrew.cooper3, jcm, boris.ostrovsky, kanth.ghatraju,
	joao.m.martins, jmattson, pradeep.vincent, john.haxby, tglx,
	kirill.shutemov, hch, steven.sistare, labbott, luto, dave.hansen,
	peterz, aaron.lu, akpm, alexander.h.duyck, amir73il, andreyknvl,
	aneesh.kumar, anthony.yznaga, ard.biesheuvel, arnd, arunks, ben,
	bigeasy, bp, brgl, catalin.marinas, corbet, cpandya,
	daniel.vetter, dan.j.williams, gregkh, guro, hannes, hpa,
	iamjoonsoo.kim, james.morse, jannh, jgross, jkosina, jmorris,
	joe, jrdr.linux, jroedel, keith.busch, khlebnikov, logang,
	marco.antonio.780, mark.rutland, mgorman, mhocko, mhocko,
	mike.kravetz, mingo, mst, m.szyprowski, npiggin, osalvador,
	paulmck, pavel.tatashin, rdunlap, richard.weiyang, riel,
	rientjes, robin.murphy, rostedt, rppt, sai.praneeth.prakhya,
	serge, steve.capper, thymovanbeers, vbabka, will.deacon, willy,
	yang.shi, yaojun8558363, ying.huang, zhangshaokun, iommu, x86,
	linux-arm-kernel, linux-doc, linux-kernel, linux-mm,
	linux-security-module, Khalid Aziz

XPFO flushes kernel space TLB entries for pages that are now mapped
in userspace on not only the current CPU but also all other CPUs
synchronously. Processes on each core allocating pages causes a
flood of IPI messages to all other cores to flush TLB entries.
Many of these messages are to flush the entire TLB on the core if
the number of entries being flushed from local core exceeds
tlb_single_page_flush_ceiling. The cost of TLB flush caused by
unmapping pages from physmap goes up dramatically on machines with
high core count.

This patch flushes relevant TLB entries for current process or
entire TLB depending upon number of entries for the current CPU
and posts a pending TLB flush on all other CPUs when a page is
unmapped from kernel space and mapped in userspace. Each core
checks the pending TLB flush flag for itself on every context
switch, flushes its TLB if the flag is set and clears it.
This patch potentially aggregates multiple TLB flushes into one.
This has very significant impact especially on machines with large
core counts. To illustrate this, kernel was compiled with -j on
two classes of machines - a server with high core count and large
amount of memory, and a desktop class machine with more modest
specs. System time from "make -j" from vanilla 4.20 kernel, 4.20
with XPFO patches before applying this patch and after applying
this patch are below:

Hardware: 96-core Intel Xeon Platinum 8160 CPU @ 2.10GHz, 768 GB RAM
make -j60 all

4.20                            950.966s
4.20+XPFO                       25073.169s      26.366x
4.20+XPFO+Deferred flush        1372.874s        1.44x

Hardware: 4-core Intel Core i5-3550 CPU @ 3.30GHz, 8G RAM
make -j4 all

4.20                            607.671s
4.20+XPFO                       1588.646s       2.614x
4.20+XPFO+Deferred flush        803.989s        1.32x

This same code should be implemented for other architectures as
well once finalized.

Signed-off-by: Khalid Aziz <khalid.aziz@oracle.com>
Cc: Khalid Aziz <khalid@gonehiking.org>
---
 arch/x86/include/asm/tlbflush.h |  1 +
 arch/x86/mm/tlb.c               | 52 +++++++++++++++++++++++++++++++++
 arch/x86/mm/xpfo.c              |  2 +-
 3 files changed, 54 insertions(+), 1 deletion(-)

diff --git a/arch/x86/include/asm/tlbflush.h b/arch/x86/include/asm/tlbflush.h
index f4204bf377fc..92d23629d01d 100644
--- a/arch/x86/include/asm/tlbflush.h
+++ b/arch/x86/include/asm/tlbflush.h
@@ -561,6 +561,7 @@ extern void flush_tlb_mm_range(struct mm_struct *mm, unsigned long start,
 				unsigned long end, unsigned int stride_shift,
 				bool freed_tables);
 extern void flush_tlb_kernel_range(unsigned long start, unsigned long end);
+extern void xpfo_flush_tlb_kernel_range(unsigned long start, unsigned long end);
 
 static inline void flush_tlb_page(struct vm_area_struct *vma, unsigned long a)
 {
diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c
index 999d6d8f0bef..cc806a01a0eb 100644
--- a/arch/x86/mm/tlb.c
+++ b/arch/x86/mm/tlb.c
@@ -37,6 +37,20 @@
  */
 #define LAST_USER_MM_IBPB	0x1UL
 
+/*
+ * A TLB flush may be needed to flush stale TLB entries
+ * for pages that have been mapped into userspace and unmapped
+ * from kernel space. This TLB flush needs to be propagated to
+ * all CPUs. Asynchronous flush requests to all CPUs can cause
+ * significant performance imapct. Queue a pending flush for
+ * a CPU instead. Multiple of these requests can then be handled
+ * by a CPU at a less disruptive time, like context switch, in
+ * one go and reduce performance impact significantly. Following
+ * data structure is used to keep track of CPUs with pending full
+ * TLB flush forced by xpfo.
+ */
+static cpumask_t pending_xpfo_flush;
+
 /*
  * We get here when we do something requiring a TLB invalidation
  * but could not go invalidate all of the contexts.  We do the
@@ -321,6 +335,16 @@ void switch_mm_irqs_off(struct mm_struct *prev, struct mm_struct *next,
 		__flush_tlb_all();
 	}
 #endif
+
+	/*
+	 * If there is a pending TLB flush for this CPU due to XPFO
+	 * flush, do it now.
+	 */
+	if (cpumask_test_and_clear_cpu(cpu, &pending_xpfo_flush)) {
+		count_vm_tlb_event(NR_TLB_REMOTE_FLUSH_RECEIVED);
+		__flush_tlb_all();
+	}
+
 	this_cpu_write(cpu_tlbstate.is_lazy, false);
 
 	/*
@@ -803,6 +827,34 @@ void flush_tlb_kernel_range(unsigned long start, unsigned long end)
 	}
 }
 
+void xpfo_flush_tlb_kernel_range(unsigned long start, unsigned long end)
+{
+	struct cpumask tmp_mask;
+
+	/*
+	 * Balance as user space task's flush, a bit conservative.
+	 * Do a local flush immediately and post a pending flush on all
+	 * other CPUs. Local flush can be a range flush or full flush
+	 * depending upon the number of entries to be flushed. Remote
+	 * flushes will be done by individual processors at the time of
+	 * context switch and this allows multiple flush requests from
+	 * other CPUs to be batched together.
+	 */
+	if (end == TLB_FLUSH_ALL ||
+	    (end - start) > tlb_single_page_flush_ceiling << PAGE_SHIFT) {
+		do_flush_tlb_all(NULL);
+	} else {
+		struct flush_tlb_info info;
+
+		info.start = start;
+		info.end = end;
+		do_kernel_range_flush(&info);
+	}
+	cpumask_setall(&tmp_mask);
+	__cpumask_clear_cpu(smp_processor_id(), &tmp_mask);
+	cpumask_or(&pending_xpfo_flush, &pending_xpfo_flush, &tmp_mask);
+}
+
 void arch_tlbbatch_flush(struct arch_tlbflush_unmap_batch *batch)
 {
 	struct flush_tlb_info info = {
diff --git a/arch/x86/mm/xpfo.c b/arch/x86/mm/xpfo.c
index b42513347865..638eee5b1f09 100644
--- a/arch/x86/mm/xpfo.c
+++ b/arch/x86/mm/xpfo.c
@@ -118,7 +118,7 @@ inline void xpfo_flush_kernel_tlb(struct page *page, int order)
 		return;
 	}
 
-	flush_tlb_kernel_range(kaddr, kaddr + (1 << order) * size);
+	xpfo_flush_tlb_kernel_range(kaddr, kaddr + (1 << order) * size);
 }
 EXPORT_SYMBOL_GPL(xpfo_flush_kernel_tlb);
 
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 70+ messages in thread

* [RFC PATCH v9 13/13] xpfo, mm: Optimize XPFO TLB flushes by batching them together
  2019-04-03 17:34 [RFC PATCH v9 00/13] Add support for eXclusive Page Frame Ownership Khalid Aziz
                   ` (11 preceding siblings ...)
  2019-04-03 17:34 ` [RFC PATCH v9 12/13] xpfo, mm: Defer TLB flushes for non-current CPUs (x86 only) Khalid Aziz
@ 2019-04-03 17:34 ` Khalid Aziz
  2019-04-04 16:44 ` [RFC PATCH v9 00/13] Add support for eXclusive Page Frame Ownership Nadav Amit
  2019-04-06  6:40 ` Jon Masters
  14 siblings, 0 replies; 70+ messages in thread
From: Khalid Aziz @ 2019-04-03 17:34 UTC (permalink / raw)
  To: juergh, tycho, jsteckli, ak, liran.alon, keescook, konrad.wilk
  Cc: Khalid Aziz, deepa.srinivasan, chris.hyser, tyhicks, dwmw,
	andrew.cooper3, jcm, boris.ostrovsky, kanth.ghatraju,
	joao.m.martins, jmattson, pradeep.vincent, john.haxby, tglx,
	kirill.shutemov, hch, steven.sistare, labbott, luto, dave.hansen,
	peterz, aaron.lu, akpm, alexander.h.duyck, amir73il, andreyknvl,
	aneesh.kumar, anthony.yznaga, ard.biesheuvel, arnd, arunks, ben,
	bigeasy, bp, brgl, catalin.marinas, corbet, cpandya,
	daniel.vetter, dan.j.williams, gregkh, guro, hannes, hpa,
	iamjoonsoo.kim, james.morse, jannh, jgross, jkosina, jmorris,
	joe, jrdr.linux, jroedel, keith.busch, khlebnikov, logang,
	marco.antonio.780, mark.rutland, mgorman, mhocko, mhocko,
	mike.kravetz, mingo, mst, m.szyprowski, npiggin, osalvador,
	paulmck, pavel.tatashin, rdunlap, richard.weiyang, riel,
	rientjes, robin.murphy, rostedt, rppt, sai.praneeth.prakhya,
	serge, steve.capper, thymovanbeers, vbabka, will.deacon, willy,
	yang.shi, yaojun8558363, ying.huang, zhangshaokun, iommu, x86,
	linux-arm-kernel, linux-doc, linux-kernel, linux-mm,
	linux-security-module, Khalid Aziz

When XPFO forces a TLB flush on all cores, the performance impact is
very significant. Batching as many of these TLB updates as
possible can help lower this impact. When a userspace allocates a
page, kernel tries to get that page from the per-cpu free list.
This free list is replenished in bulk when it runs low. Free
list is being replenished for future allocation to userspace is a
good opportunity to update TLB entries in batch and reduce the
impact of multiple TLB flushes later. This patch adds new tags for
the page so a page can be marked as available for userspace
allocation and unmapped from kernel address space. All such pages
are removed from kernel address space in bulk at the time they are
added to per-cpu free list. This patch when combined with deferred
TLB flushes improves performance further. Using the same benchmark
as before of building kernel in parallel, here are the system
times on two differently sized systems:

Hardware: 96-core Intel Xeon Platinum 8160 CPU @ 2.10GHz, 768 GB RAM
make -j60 all

5.0					913.862s
5.0+XPFO+Deferred flush+Batch update	1165.259s	1.28x

Hardware: 4-core Intel Core i5-3550 CPU @ 3.30GHz, 8G RAM
make -j4 all

5.0					610.642s
5.0+XPFO+Deferred flush+Batch update	773.075s	1.27x

Signed-off-by: Khalid Aziz <khalid.aziz@oracle.com>
Cc: Khalid Aziz <khalid@gonehiking.org>
Signed-off-by: Tycho Andersen <tycho@tycho.ws>
---
v9:
	- Do not map a page freed by userspace back into kernel. Mark
	 it as unmapped instead and map it back in only when needed. This
	 avoids the cost of unmap and TLBV flush if the page is allocated
	 back to userspace.

 arch/x86/include/asm/pgtable.h |  2 +-
 arch/x86/mm/pageattr.c         |  9 ++++--
 arch/x86/mm/xpfo.c             | 11 +++++--
 include/linux/xpfo.h           | 11 +++++++
 mm/page_alloc.c                |  9 ++++++
 mm/xpfo.c                      | 54 +++++++++++++++++++++++++++-------
 6 files changed, 79 insertions(+), 17 deletions(-)

diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h
index 5c0e1581fa56..61f64c6c687c 100644
--- a/arch/x86/include/asm/pgtable.h
+++ b/arch/x86/include/asm/pgtable.h
@@ -1461,7 +1461,7 @@ should_split_large_page(pte_t *kpte, unsigned long address,
 extern spinlock_t cpa_lock;
 int
 __split_large_page(struct cpa_data *cpa, pte_t *kpte, unsigned long address,
-		   struct page *base);
+		   struct page *base, bool xpfo_split);
 
 #include <asm-generic/pgtable.h>
 #endif	/* __ASSEMBLY__ */
diff --git a/arch/x86/mm/pageattr.c b/arch/x86/mm/pageattr.c
index 530b5df0617e..8fe86ac6bff0 100644
--- a/arch/x86/mm/pageattr.c
+++ b/arch/x86/mm/pageattr.c
@@ -911,7 +911,7 @@ static void split_set_pte(struct cpa_data *cpa, pte_t *pte, unsigned long pfn,
 
 int
 __split_large_page(struct cpa_data *cpa, pte_t *kpte, unsigned long address,
-		   struct page *base)
+		   struct page *base, bool xpfo_split)
 {
 	unsigned long lpaddr, lpinc, ref_pfn, pfn, pfninc = 1;
 	pte_t *pbase = (pte_t *)page_address(base);
@@ -1008,7 +1008,10 @@ __split_large_page(struct cpa_data *cpa, pte_t *kpte, unsigned long address,
 	 * page attribute in parallel, that also falls into the
 	 * just split large page entry.
 	 */
-	flush_tlb_all();
+	if (xpfo_split)
+		xpfo_flush_tlb_all();
+	else
+		flush_tlb_all();
 	spin_unlock(&pgd_lock);
 
 	return 0;
@@ -1027,7 +1030,7 @@ static int split_large_page(struct cpa_data *cpa, pte_t *kpte,
 	if (!base)
 		return -ENOMEM;
 
-	if (__split_large_page(cpa, kpte, address, base))
+	if (__split_large_page(cpa, kpte, address, base, false))
 		__free_page(base);
 
 	return 0;
diff --git a/arch/x86/mm/xpfo.c b/arch/x86/mm/xpfo.c
index 638eee5b1f09..8c482c7b54f5 100644
--- a/arch/x86/mm/xpfo.c
+++ b/arch/x86/mm/xpfo.c
@@ -47,7 +47,7 @@ inline void set_kpte(void *kaddr, struct page *page, pgprot_t prot)
 
 		cpa.vaddr = kaddr;
 		cpa.pages = &page;
-		cpa.mask_set = prot;
+		cpa.mask_set = canon_pgprot(prot);
 		cpa.mask_clr = msk_clr;
 		cpa.numpages = 1;
 		cpa.flags = 0;
@@ -57,7 +57,7 @@ inline void set_kpte(void *kaddr, struct page *page, pgprot_t prot)
 
 		do_split = should_split_large_page(pte, (unsigned long)kaddr,
 						   &cpa);
-		if (do_split) {
+		if (do_split > 0) {
 			struct page *base;
 
 			base = alloc_pages(GFP_ATOMIC, 0);
@@ -69,7 +69,7 @@ inline void set_kpte(void *kaddr, struct page *page, pgprot_t prot)
 			if (!debug_pagealloc_enabled())
 				spin_lock(&cpa_lock);
 			if  (__split_large_page(&cpa, pte, (unsigned long)kaddr,
-						base) < 0) {
+						base, true) < 0) {
 				__free_page(base);
 				WARN(1, "xpfo: failed to split large page\n");
 			}
@@ -90,6 +90,11 @@ inline void set_kpte(void *kaddr, struct page *page, pgprot_t prot)
 }
 EXPORT_SYMBOL_GPL(set_kpte);
 
+void xpfo_flush_tlb_all(void)
+{
+	xpfo_flush_tlb_kernel_range(0, TLB_FLUSH_ALL);
+}
+
 inline void xpfo_flush_kernel_tlb(struct page *page, int order)
 {
 	int level;
diff --git a/include/linux/xpfo.h b/include/linux/xpfo.h
index 37e7f52fa6ce..01da4bb31cd6 100644
--- a/include/linux/xpfo.h
+++ b/include/linux/xpfo.h
@@ -32,6 +32,7 @@ DECLARE_STATIC_KEY_TRUE(xpfo_inited);
 /* Architecture specific implementations */
 void set_kpte(void *kaddr, struct page *page, pgprot_t prot);
 void xpfo_flush_kernel_tlb(struct page *page, int order);
+void xpfo_flush_tlb_all(void);
 
 void xpfo_init_single_page(struct page *page);
 
@@ -106,6 +107,9 @@ void xpfo_temp_map(const void *addr, size_t size, void **mapping,
 void xpfo_temp_unmap(const void *addr, size_t size, void **mapping,
 		     size_t mapping_len);
 
+bool xpfo_pcp_refill(struct page *page, enum migratetype migratetype,
+		     int order);
+
 #else /* !CONFIG_XPFO */
 
 static inline void xpfo_init_single_page(struct page *page) { }
@@ -118,6 +122,7 @@ static inline void xpfo_free_pages(struct page *page, int order) { }
 
 static inline void set_kpte(void *kaddr, struct page *page, pgprot_t prot) { }
 static inline void xpfo_flush_kernel_tlb(struct page *page, int order) { }
+static inline void xpfo_flush_tlb_all(void) { }
 
 static inline phys_addr_t user_virt_to_phys(unsigned long addr) { return 0; }
 
@@ -133,6 +138,12 @@ static inline void xpfo_temp_unmap(const void *addr, size_t size,
 {
 }
 
+static inline bool xpfo_pcp_refill(struct page *page,
+				   enum migratetype migratetype, int order)
+{
+	return false;
+}
+
 #endif /* CONFIG_XPFO */
 
 #if (!defined(CONFIG_HIGHMEM)) && (!defined(ARCH_HAS_KMAP))
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 2e0dda1322a2..7846b2590ef0 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -3031,6 +3031,8 @@ static struct page *__rmqueue_pcplist(struct zone *zone, int migratetype,
 			struct list_head *list)
 {
 	struct page *page;
+	struct list_head *cur;
+	bool flush_tlb = false;
 
 	do {
 		if (list_empty(list)) {
@@ -3039,6 +3041,13 @@ static struct page *__rmqueue_pcplist(struct zone *zone, int migratetype,
 					migratetype, alloc_flags);
 			if (unlikely(list_empty(list)))
 				return NULL;
+			list_for_each(cur, list) {
+				page = list_entry(cur, struct page, lru);
+				flush_tlb |= xpfo_pcp_refill(page,
+							     migratetype, 0);
+			}
+			if (flush_tlb)
+				xpfo_flush_tlb_all();
 		}
 
 		page = list_first_entry(list, struct page, lru);
diff --git a/mm/xpfo.c b/mm/xpfo.c
index 974f1b70ccd9..47d400f1fc65 100644
--- a/mm/xpfo.c
+++ b/mm/xpfo.c
@@ -62,17 +62,22 @@ void xpfo_alloc_pages(struct page *page, int order, gfp_t gfp, bool will_map)
 		WARN_ON(atomic_read(&(page + i)->xpfo_mapcount));
 #endif
 		if ((gfp & GFP_HIGHUSER) == GFP_HIGHUSER) {
+			bool user_page = TestSetPageXpfoUser(page + i);
+
 			/*
 			 * Tag the page as a user page and flush the TLB if it
 			 * was previously allocated to the kernel.
 			 */
-			if ((!TestSetPageXpfoUser(page + i)) || !will_map) {
-				SetPageXpfoUnmapped(page + i);
-				flush_tlb = true;
+			if (!user_page || !will_map) {
+				if (!TestSetPageXpfoUnmapped(page + i))
+					flush_tlb = true;
 			}
 		} else {
 			/* Tag the page as a non-user (kernel) page */
 			ClearPageXpfoUser(page + i);
+			if (TestClearPageXpfoUnmapped(page + i))
+				set_kpte(page_address(page + i), page + i,
+					 PAGE_KERNEL);
 		}
 	}
 
@@ -95,14 +100,12 @@ void xpfo_free_pages(struct page *page, int order)
 #endif
 
 		/*
-		 * Map the page back into the kernel if it was previously
-		 * allocated to user space.
+		 * Leave the page as unmapped from kernel. If this page
+		 * gets allocated to userspace soon again, it saves us
+		 * the cost of TLB flush at that time.
 		 */
-		if (TestClearPageXpfoUser(page + i)) {
-			ClearPageXpfoUnmapped(page + i);
-			set_kpte(page_address(page + i), page + i,
-				 PAGE_KERNEL);
-		}
+		if (PageXpfoUser(page + i))
+			SetPageXpfoUnmapped(page + i);
 	}
 }
 
@@ -134,3 +137,34 @@ void xpfo_temp_unmap(const void *addr, size_t size, void **mapping,
 			kunmap_atomic(mapping[i]);
 }
 EXPORT_SYMBOL(xpfo_temp_unmap);
+
+bool xpfo_pcp_refill(struct page *page, enum migratetype migratetype,
+		     int order)
+{
+	int i;
+	bool flush_tlb = false;
+
+	if (!static_branch_unlikely(&xpfo_inited))
+		return false;
+
+	for (i = 0; i < 1 << order; i++) {
+		if (migratetype == MIGRATE_MOVABLE) {
+			/* GPF_HIGHUSER **
+			 * Tag the page as a user page, mark it as unmapped
+			 * in kernel space and flush the TLB if it was
+			 * previously allocated to the kernel.
+			 */
+			SetPageXpfoUser(page + i);
+			if (!TestSetPageXpfoUnmapped(page + i))
+				flush_tlb = true;
+		} else {
+			/* Tag the page as a non-user (kernel) page */
+			ClearPageXpfoUser(page + i);
+		}
+	}
+
+	if (flush_tlb)
+		set_kpte(page_address(page), page, __pgprot(0));
+
+	return flush_tlb;
+}
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 70+ messages in thread

* Re: [RFC PATCH v9 02/13] x86: always set IF before oopsing from page fault
  2019-04-03 17:34 ` [RFC PATCH v9 02/13] x86: always set IF before oopsing from page fault Khalid Aziz
@ 2019-04-04  0:12   ` Andy Lutomirski
  2019-04-04  1:42     ` Tycho Andersen
  0 siblings, 1 reply; 70+ messages in thread
From: Andy Lutomirski @ 2019-04-04  0:12 UTC (permalink / raw)
  To: Khalid Aziz
  Cc: Juerg Haefliger, Tycho Andersen, jsteckli, Andi Kleen,
	liran.alon, Kees Cook, Konrad Rzeszutek Wilk, deepa.srinivasan,
	chris hyser, Tyler Hicks, Woodhouse, David, Andrew Cooper,
	Jon Masters, Boris Ostrovsky, kanth.ghatraju, Joao Martins,
	Jim Mattson, pradeep.vincent, John Haxby, Thomas Gleixner,
	Kirill A. Shutemov, Christoph Hellwig, steven.sistare,
	Laura Abbott, Andrew Lutomirski, Dave Hansen, Peter Zijlstra,
	Aaron Lu, Andrew Morton, alexander.h.duyck, Amir Goldstein,
	Andrey Konovalov, aneesh.kumar, anthony.yznaga, Ard Biesheuvel,
	Arnd Bergmann, arunks, Ben Hutchings, Sebastian Andrzej Siewior,
	Borislav Petkov, brgl, Catalin Marinas, Jonathan Corbet, cpandya,
	Daniel Vetter, Dan Williams, Greg KH, Roman Gushchin,
	Johannes Weiner, H. Peter Anvin, Joonsoo Kim, James Morse,
	Jann Horn, Juergen Gross, Jiri Kosina, James Morris, Joe Perches,
	Souptick Joarder, Joerg Roedel, Keith Busch,
	Konstantin Khlebnikov, Logan Gunthorpe, marco.antonio.780,
	Mark Rutland, Mel Gorman, Michal Hocko, Michal Hocko,
	Mike Kravetz, Ingo Molnar, Michael S. Tsirkin, Marek Szyprowski,
	Nicholas Piggin, osalvador, Paul E. McKenney, pavel.tatashin,
	Randy Dunlap, richard.weiyang, Rik van Riel, David Rientjes,
	Robin Murphy, Steven Rostedt, Mike Rapoport,
	Sai Praneeth Prakhya, Serge E. Hallyn, Steve Capper,
	thymovanbeers, Vlastimil Babka, Will Deacon, Matthew Wilcox,
	yang.shi, yaojun8558363, Huang Ying, zhangshaokun, iommu, X86 ML,
	linux-arm-kernel, open list:DOCUMENTATION, LKML, Linux-MM,
	LSM List, Khalid Aziz

On Wed, Apr 3, 2019 at 10:36 AM Khalid Aziz <khalid.aziz@oracle.com> wrote:
>
> From: Tycho Andersen <tycho@tycho.ws>
>
> Oopsing might kill the task, via rewind_stack_do_exit() at the bottom, and
> that might sleep:
>


> diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c
> index 9d5c75f02295..7891add0913f 100644
> --- a/arch/x86/mm/fault.c
> +++ b/arch/x86/mm/fault.c
> @@ -858,6 +858,12 @@ no_context(struct pt_regs *regs, unsigned long error_code,
>         /* Executive summary in case the body of the oops scrolled away */
>         printk(KERN_DEFAULT "CR2: %016lx\n", address);
>
> +       /*
> +        * We're about to oops, which might kill the task. Make sure we're
> +        * allowed to sleep.
> +        */
> +       flags |= X86_EFLAGS_IF;
> +
>         oops_end(flags, regs, sig);
>  }
>


NAK.  If there's a bug in rewind_stack_do_exit(), please fix it in
rewind_stack_do_exit().


^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [RFC PATCH v9 02/13] x86: always set IF before oopsing from page fault
  2019-04-04  0:12   ` Andy Lutomirski
@ 2019-04-04  1:42     ` Tycho Andersen
  2019-04-04  4:12       ` Andy Lutomirski
  0 siblings, 1 reply; 70+ messages in thread
From: Tycho Andersen @ 2019-04-04  1:42 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Khalid Aziz, Juerg Haefliger, jsteckli, Andi Kleen, liran.alon,
	Kees Cook, Konrad Rzeszutek Wilk, deepa.srinivasan, chris hyser,
	Tyler Hicks, Woodhouse, David, Andrew Cooper, Jon Masters,
	Boris Ostrovsky, kanth.ghatraju, Joao Martins, Jim Mattson,
	pradeep.vincent, John Haxby, Thomas Gleixner, Kirill A. Shutemov,
	Christoph Hellwig, steven.sistare, Laura Abbott, Dave Hansen,
	Peter Zijlstra, Aaron Lu, Andrew Morton, alexander.h.duyck,
	Amir Goldstein, Andrey Konovalov, aneesh.kumar, anthony.yznaga,
	Ard Biesheuvel, Arnd Bergmann, arunks, Ben Hutchings,
	Sebastian Andrzej Siewior, Borislav Petkov, brgl,
	Catalin Marinas, Jonathan Corbet, cpandya, Daniel Vetter,
	Dan Williams, Greg KH, Roman Gushchin, Johannes Weiner,
	H. Peter Anvin, Joonsoo Kim, James Morse, Jann Horn,
	Juergen Gross, Jiri Kosina, James Morris, Joe Perches,
	Souptick Joarder, Joerg Roedel, Keith Busch,
	Konstantin Khlebnikov, Logan Gunthorpe, marco.antonio.780,
	Mark Rutland, Mel Gorman, Michal Hocko, Michal Hocko,
	Mike Kravetz, Ingo Molnar, Michael S. Tsirkin, Marek Szyprowski,
	Nicholas Piggin, osalvador, Paul E. McKenney, pavel.tatashin,
	Randy Dunlap, richard.weiyang, Serge E. Hallyn, iommu, X86 ML,
	linux-arm-kernel, open list:DOCUMENTATION, LKML, Linux-MM,
	LSM List, Khalid Aziz

On Wed, Apr 03, 2019 at 05:12:56PM -0700, Andy Lutomirski wrote:
> On Wed, Apr 3, 2019 at 10:36 AM Khalid Aziz <khalid.aziz@oracle.com> wrote:
> >
> > From: Tycho Andersen <tycho@tycho.ws>
> >
> > Oopsing might kill the task, via rewind_stack_do_exit() at the bottom, and
> > that might sleep:
> >
> 
> 
> > diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c
> > index 9d5c75f02295..7891add0913f 100644
> > --- a/arch/x86/mm/fault.c
> > +++ b/arch/x86/mm/fault.c
> > @@ -858,6 +858,12 @@ no_context(struct pt_regs *regs, unsigned long error_code,
> >         /* Executive summary in case the body of the oops scrolled away */
> >         printk(KERN_DEFAULT "CR2: %016lx\n", address);
> >
> > +       /*
> > +        * We're about to oops, which might kill the task. Make sure we're
> > +        * allowed to sleep.
> > +        */
> > +       flags |= X86_EFLAGS_IF;
> > +
> >         oops_end(flags, regs, sig);
> >  }
> >
> 
> 
> NAK.  If there's a bug in rewind_stack_do_exit(), please fix it in
> rewind_stack_do_exit().

[I trimmed the CC list since google rejected it with E2BIG :)]

I guess the problem is really that do_exit() (or really
exit_signals()) might sleep. Maybe we should put an irq_enable() at
the beginning of do_exit() instead and fix this problem for all
arches?

Tycho


^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [RFC PATCH v9 12/13] xpfo, mm: Defer TLB flushes for non-current CPUs (x86 only)
  2019-04-03 17:34 ` [RFC PATCH v9 12/13] xpfo, mm: Defer TLB flushes for non-current CPUs (x86 only) Khalid Aziz
@ 2019-04-04  4:10   ` Andy Lutomirski
       [not found]     ` <91f1dbce-332e-25d1-15f6-0e9cfc8b797b@oracle.com>
  2019-04-04  8:18   ` Peter Zijlstra
  1 sibling, 1 reply; 70+ messages in thread
From: Andy Lutomirski @ 2019-04-04  4:10 UTC (permalink / raw)
  To: Khalid Aziz
  Cc: Juerg Haefliger, Tycho Andersen, jsteckli, Andi Kleen,
	liran.alon, Kees Cook, Konrad Rzeszutek Wilk, deepa.srinivasan,
	chris hyser, Tyler Hicks, Woodhouse, David, Andrew Cooper,
	Jon Masters, Boris Ostrovsky, kanth.ghatraju, Joao Martins,
	Jim Mattson, pradeep.vincent, John Haxby, Thomas Gleixner,
	Kirill A. Shutemov, Christoph Hellwig, steven.sistare,
	Laura Abbott, Andrew Lutomirski, Dave Hansen, Peter Zijlstra,
	Aaron Lu, Andrew Morton, alexander.h.duyck, Amir Goldstein,
	Andrey Konovalov, aneesh.kumar, anthony.yznaga, Ard Biesheuvel,
	Arnd Bergmann, arunks, Ben Hutchings, Sebastian Andrzej Siewior,
	Borislav Petkov, brgl, Catalin Marinas, Jonathan Corbet, cpandya,
	Daniel Vetter, Dan Williams, Greg KH, Roman Gushchin,
	Johannes Weiner, H. Peter Anvin, Joonsoo Kim, James Morse,
	Jann Horn, Juergen Gross, Jiri Kosina, James Morris, Joe Perches,
	Souptick Joarder, Joerg Roedel, Keith Busch,
	Konstantin Khlebnikov, Logan Gunthorpe, marco.antonio.780,
	Mark Rutland, Mel Gorman, Michal Hocko, Michal Hocko,
	Mike Kravetz, Ingo Molnar, Michael S. Tsirkin, Marek Szyprowski,
	Nicholas Piggin, osalvador, Paul E. McKenney, pavel.tatashin,
	Randy Dunlap, richard.weiyang, Rik van Riel, David Rientjes,
	Robin Murphy, Steven Rostedt, Mike Rapoport,
	Sai Praneeth Prakhya, Serge E. Hallyn, Steve Capper,
	thymovanbeers, Vlastimil Babka, Will Deacon, Matthew Wilcox,
	yang.shi, yaojun8558363, Huang Ying, zhangshaokun, iommu, X86 ML,
	linux-arm-kernel, open list:DOCUMENTATION, LKML, Linux-MM,
	LSM List, Khalid Aziz

On Wed, Apr 3, 2019 at 10:36 AM Khalid Aziz <khalid.aziz@oracle.com> wrote:
>
> XPFO flushes kernel space TLB entries for pages that are now mapped
> in userspace on not only the current CPU but also all other CPUs
> synchronously. Processes on each core allocating pages causes a
> flood of IPI messages to all other cores to flush TLB entries.
> Many of these messages are to flush the entire TLB on the core if
> the number of entries being flushed from local core exceeds
> tlb_single_page_flush_ceiling. The cost of TLB flush caused by
> unmapping pages from physmap goes up dramatically on machines with
> high core count.
>
> This patch flushes relevant TLB entries for current process or
> entire TLB depending upon number of entries for the current CPU
> and posts a pending TLB flush on all other CPUs when a page is
> unmapped from kernel space and mapped in userspace. Each core
> checks the pending TLB flush flag for itself on every context
> switch, flushes its TLB if the flag is set and clears it.
> This patch potentially aggregates multiple TLB flushes into one.
> This has very significant impact especially on machines with large
> core counts.

Why is this a reasonable strategy?

> +void xpfo_flush_tlb_kernel_range(unsigned long start, unsigned long end)
> +{
> +       struct cpumask tmp_mask;
> +
> +       /*
> +        * Balance as user space task's flush, a bit conservative.
> +        * Do a local flush immediately and post a pending flush on all
> +        * other CPUs. Local flush can be a range flush or full flush
> +        * depending upon the number of entries to be flushed. Remote
> +        * flushes will be done by individual processors at the time of
> +        * context switch and this allows multiple flush requests from
> +        * other CPUs to be batched together.
> +        */

I don't like this function at all.  A core function like this is a
contract of sorts between the caller and the implementation.  There is
no such thing as an "xpfo" flush, and this function's behavior isn't
at all well defined.  For flush_tlb_kernel_range(), I can tell you
exactly what that function does, and the implementation is either
correct or incorrect.  With this function, I have no idea what is
actually required, and I can't possibly tell whether it's correct.

As far as I can see, xpfo_flush_tlb_kernel_range() actually means
"flush this range on this CPU right now, and flush it on remote CPUs
eventually".  It would be valid, but probably silly, to flush locally
and to never flush at all on remote CPUs.  This makes me wonder what
the point is.


^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [RFC PATCH v9 02/13] x86: always set IF before oopsing from page fault
  2019-04-04  1:42     ` Tycho Andersen
@ 2019-04-04  4:12       ` Andy Lutomirski
  2019-04-04 15:47         ` Tycho Andersen
  0 siblings, 1 reply; 70+ messages in thread
From: Andy Lutomirski @ 2019-04-04  4:12 UTC (permalink / raw)
  To: Tycho Andersen
  Cc: Andy Lutomirski, Khalid Aziz, Juerg Haefliger, jsteckli,
	Andi Kleen, liran.alon, Kees Cook, Konrad Rzeszutek Wilk,
	deepa.srinivasan, chris hyser, Tyler Hicks, Woodhouse, David,
	Andrew Cooper, Jon Masters, Boris Ostrovsky, kanth.ghatraju,
	Joao Martins, Jim Mattson, pradeep.vincent, John Haxby,
	Thomas Gleixner, Kirill A. Shutemov, Christoph Hellwig,
	steven.sistare, Laura Abbott, Dave Hansen, Peter Zijlstra,
	Aaron Lu, Andrew Morton, alexander.h.duyck, Amir Goldstein,
	Andrey Konovalov, aneesh.kumar, anthony.yznaga, Ard Biesheuvel,
	Arnd Bergmann, arunks, Ben Hutchings, Sebastian Andrzej Siewior,
	Borislav Petkov, brgl, Catalin Marinas, Jonathan Corbet, cpandya,
	Daniel Vetter, Dan Williams, Greg KH, Roman Gushchin,
	Johannes Weiner, H. Peter Anvin, Joonsoo Kim, James Morse,
	Jann Horn, Juergen Gross, Jiri Kosina, James Morris, Joe Perches,
	Souptick Joarder, Joerg Roedel, Keith Busch,
	Konstantin Khlebnikov, Logan Gunthorpe, marco.antonio.780,
	Mark Rutland, Mel Gorman, Michal Hocko, Michal Hocko,
	Mike Kravetz, Ingo Molnar, Michael S. Tsirkin, Marek Szyprowski,
	Nicholas Piggin, osalvador, Paul E. McKenney, pavel.tatashin,
	Randy Dunlap, richard.weiyang, Serge E. Hallyn, iommu, X86 ML,
	linux-arm-kernel, open list:DOCUMENTATION, LKML, Linux-MM,
	LSM List, Khalid Aziz

On Wed, Apr 3, 2019 at 6:42 PM Tycho Andersen <tycho@tycho.ws> wrote:
>
> On Wed, Apr 03, 2019 at 05:12:56PM -0700, Andy Lutomirski wrote:
> > On Wed, Apr 3, 2019 at 10:36 AM Khalid Aziz <khalid.aziz@oracle.com> wrote:
> > >
> > > From: Tycho Andersen <tycho@tycho.ws>
> > >
> > > Oopsing might kill the task, via rewind_stack_do_exit() at the bottom, and
> > > that might sleep:
> > >
> >
> >
> > > diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c
> > > index 9d5c75f02295..7891add0913f 100644
> > > --- a/arch/x86/mm/fault.c
> > > +++ b/arch/x86/mm/fault.c
> > > @@ -858,6 +858,12 @@ no_context(struct pt_regs *regs, unsigned long error_code,
> > >         /* Executive summary in case the body of the oops scrolled away */
> > >         printk(KERN_DEFAULT "CR2: %016lx\n", address);
> > >
> > > +       /*
> > > +        * We're about to oops, which might kill the task. Make sure we're
> > > +        * allowed to sleep.
> > > +        */
> > > +       flags |= X86_EFLAGS_IF;
> > > +
> > >         oops_end(flags, regs, sig);
> > >  }
> > >
> >
> >
> > NAK.  If there's a bug in rewind_stack_do_exit(), please fix it in
> > rewind_stack_do_exit().
>
> [I trimmed the CC list since google rejected it with E2BIG :)]
>
> I guess the problem is really that do_exit() (or really
> exit_signals()) might sleep. Maybe we should put an irq_enable() at
> the beginning of do_exit() instead and fix this problem for all
> arches?
>

Hmm.  do_exit() isn't really meant to be "try your best to leave the
system somewhat usable without returning" -- it's a function that,
other than in OOPSes, is called from a well-defined state.  So I think
rewind_stack_do_exit() is probably a better spot.  But we need to
rewind the stack and *then* turn on IRQs, since we otherwise risk
exploding quite badly.


^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [RFC PATCH v9 03/13] mm: Add support for eXclusive Page Frame Ownership (XPFO)
  2019-04-03 17:34 ` [RFC PATCH v9 03/13] mm: Add support for eXclusive Page Frame Ownership (XPFO) Khalid Aziz
@ 2019-04-04  7:21   ` Peter Zijlstra
  2019-04-04  9:25     ` Peter Zijlstra
  2019-04-04 14:48     ` Tycho Andersen
  2019-04-04  7:43   ` Peter Zijlstra
  2019-04-17 16:15   ` Ingo Molnar
  2 siblings, 2 replies; 70+ messages in thread
From: Peter Zijlstra @ 2019-04-04  7:21 UTC (permalink / raw)
  To: Khalid Aziz
  Cc: juergh, tycho, jsteckli, ak, liran.alon, keescook, konrad.wilk,
	Juerg Haefliger, deepa.srinivasan, chris.hyser, tyhicks, dwmw,
	andrew.cooper3, jcm, boris.ostrovsky, kanth.ghatraju,
	joao.m.martins, jmattson, pradeep.vincent, john.haxby, tglx,
	kirill.shutemov, hch, steven.sistare, labbott, luto, dave.hansen,
	aaron.lu, akpm, alexander.h.duyck, amir73il, andreyknvl,
	aneesh.kumar, anthony.yznaga, ard.biesheuvel, arnd, arunks, ben,
	bigeasy, bp, brgl, catalin.marinas, corbet, cpandya,
	daniel.vetter, dan.j.williams, gregkh, guro, hannes, hpa,
	iamjoonsoo.kim, james.morse, jannh, jgross, jkosina, jmorris,
	joe, jrdr.linux, jroedel, keith.busch, khlebnikov, logang,
	marco.antonio.780, mark.rutland, mgorman, mhocko, mhocko,
	mike.kravetz, mingo, mst, m.szyprowski, npiggin, osalvador,
	paulmck, pavel.tatashin, rdunlap, richard.weiyang, riel,
	rientjes, robin.murphy, rostedt, rppt, sai.praneeth.prakhya,
	serge, steve.capper, thymovanbeers, vbabka, will.deacon, willy,
	yang.shi, yaojun8558363, ying.huang, zhangshaokun, iommu, x86,
	linux-arm-kernel, linux-doc, linux-kernel, linux-mm,
	linux-security-module, Khalid Aziz

On Wed, Apr 03, 2019 at 11:34:04AM -0600, Khalid Aziz wrote:
> diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
> index 2c471a2c43fa..d17d33f36a01 100644
> --- a/include/linux/mm_types.h
> +++ b/include/linux/mm_types.h
> @@ -204,6 +204,14 @@ struct page {
>  #ifdef LAST_CPUPID_NOT_IN_PAGE_FLAGS
>  	int _last_cpupid;
>  #endif
> +
> +#ifdef CONFIG_XPFO
> +	/* Counts the number of times this page has been kmapped. */
> +	atomic_t xpfo_mapcount;
> +
> +	/* Serialize kmap/kunmap of this page */
> +	spinlock_t xpfo_lock;

NAK, see ALLOC_SPLIT_PTLOCKS

spinlock_t can be _huge_ (CONFIG_PROVE_LOCKING=y), also are you _really_
sure you want spinlock_t and not raw_spinlock_t ? For
CONFIG_PREEMPT_FULL spinlock_t turns into a rtmutex.

> +#endif

Growing the page-frame by 8 bytes (in the good case) is really sad,
that's a _lot_ of memory.

>  } _struct_page_alignment;


^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [RFC PATCH v9 03/13] mm: Add support for eXclusive Page Frame Ownership (XPFO)
  2019-04-03 17:34 ` [RFC PATCH v9 03/13] mm: Add support for eXclusive Page Frame Ownership (XPFO) Khalid Aziz
  2019-04-04  7:21   ` Peter Zijlstra
@ 2019-04-04  7:43   ` Peter Zijlstra
  2019-04-04 15:15     ` Khalid Aziz
  2019-04-17 16:15   ` Ingo Molnar
  2 siblings, 1 reply; 70+ messages in thread
From: Peter Zijlstra @ 2019-04-04  7:43 UTC (permalink / raw)
  To: Khalid Aziz
  Cc: juergh, tycho, jsteckli, ak, liran.alon, keescook, konrad.wilk,
	Juerg Haefliger, deepa.srinivasan, chris.hyser, tyhicks, dwmw,
	andrew.cooper3, jcm, boris.ostrovsky, kanth.ghatraju,
	joao.m.martins, jmattson, pradeep.vincent, john.haxby, tglx,
	kirill.shutemov, hch, steven.sistare, labbott, luto, dave.hansen,
	aaron.lu, akpm, alexander.h.duyck, amir73il, andreyknvl,
	aneesh.kumar, anthony.yznaga, ard.biesheuvel, arnd, arunks, ben,
	bigeasy, bp, brgl, catalin.marinas, corbet, cpandya,
	daniel.vetter, dan.j.williams, gregkh, guro, hannes, hpa,
	iamjoonsoo.kim, james.morse, jannh, jgross, jkosina, jmorris,
	joe, jrdr.linux, jroedel, keith.busch, khlebnikov, logang,
	marco.antonio.780, mark.rutland, mgorman, mhocko, mhocko,
	mike.kravetz, mingo, mst, m.szyprowski, npiggin, osalvador,
	paulmck, pavel.tatashin, rdunlap, richard.weiyang, riel,
	rientjes, robin.murphy, rostedt, rppt, sai.praneeth.prakhya,
	serge, steve.capper, thymovanbeers, vbabka, will.deacon, willy,
	yang.shi, yaojun8558363, ying.huang, zhangshaokun, iommu, x86,
	linux-arm-kernel, linux-doc, linux-kernel, linux-mm,
	linux-security-module, Khalid Aziz


You must be so glad I no longer use kmap_atomic from NMI context :-)

On Wed, Apr 03, 2019 at 11:34:04AM -0600, Khalid Aziz wrote:
> +static inline void xpfo_kmap(void *kaddr, struct page *page)
> +{
> +	unsigned long flags;
> +
> +	if (!static_branch_unlikely(&xpfo_inited))
> +		return;
> +
> +	if (!PageXpfoUser(page))
> +		return;
> +
> +	/*
> +	 * The page was previously allocated to user space, so
> +	 * map it back into the kernel if needed. No TLB flush required.
> +	 */
> +	spin_lock_irqsave(&page->xpfo_lock, flags);
> +
> +	if ((atomic_inc_return(&page->xpfo_mapcount) == 1) &&
> +		TestClearPageXpfoUnmapped(page))
> +		set_kpte(kaddr, page, PAGE_KERNEL);
> +
> +	spin_unlock_irqrestore(&page->xpfo_lock, flags);

That's a really sad sequence, not wrong, but sad. _3_ atomic operations,
2 on likely the same cacheline. And mostly all pointless.

This patch makes xpfo_mapcount an atomic, but then all modifications are
under the spinlock, what gives?

Anyway, a possibly saner sequence might be:

	if (atomic_inc_not_zero(&page->xpfo_mapcount))
		return;

	spin_lock_irqsave(&page->xpfo_lock, flag);
	if ((atomic_inc_return(&page->xpfo_mapcount) == 1) &&
	    TestClearPageXpfoUnmapped(page))
		set_kpte(kaddr, page, PAGE_KERNEL);
	spin_unlock_irqrestore(&page->xpfo_lock, flags);

> +}
> +
> +static inline void xpfo_kunmap(void *kaddr, struct page *page)
> +{
> +	unsigned long flags;
> +
> +	if (!static_branch_unlikely(&xpfo_inited))
> +		return;
> +
> +	if (!PageXpfoUser(page))
> +		return;
> +
> +	/*
> +	 * The page is to be allocated back to user space, so unmap it from
> +	 * the kernel, flush the TLB and tag it as a user page.
> +	 */
> +	spin_lock_irqsave(&page->xpfo_lock, flags);
> +
> +	if (atomic_dec_return(&page->xpfo_mapcount) == 0) {
> +#ifdef CONFIG_XPFO_DEBUG
> +		WARN_ON(PageXpfoUnmapped(page));
> +#endif
> +		SetPageXpfoUnmapped(page);
> +		set_kpte(kaddr, page, __pgprot(0));
> +		xpfo_flush_kernel_tlb(page, 0);

You didn't speak about the TLB invalidation anywhere. But basically this
is one that x86 cannot do.

> +	}
> +
> +	spin_unlock_irqrestore(&page->xpfo_lock, flags);

Idem:

	if (atomic_add_unless(&page->xpfo_mapcount, -1, 1))
		return;

	....


> +}

Also I'm failing to see the point of PG_xpfo_unmapped, afaict it
is identical to !atomic_read(&page->xpfo_mapcount).


^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [RFC PATCH v9 04/13] xpfo, x86: Add support for XPFO for x86-64
  2019-04-03 17:34 ` [RFC PATCH v9 04/13] xpfo, x86: Add support for XPFO for x86-64 Khalid Aziz
@ 2019-04-04  7:52   ` Peter Zijlstra
  2019-04-04 15:40     ` Khalid Aziz
  0 siblings, 1 reply; 70+ messages in thread
From: Peter Zijlstra @ 2019-04-04  7:52 UTC (permalink / raw)
  To: Khalid Aziz
  Cc: juergh, tycho, jsteckli, ak, liran.alon, keescook, konrad.wilk,
	Juerg Haefliger, deepa.srinivasan, chris.hyser, tyhicks, dwmw,
	andrew.cooper3, jcm, boris.ostrovsky, kanth.ghatraju,
	joao.m.martins, jmattson, pradeep.vincent, john.haxby, tglx,
	kirill.shutemov, hch, steven.sistare, labbott, luto, dave.hansen,
	aaron.lu, akpm, alexander.h.duyck, amir73il, andreyknvl,
	aneesh.kumar, anthony.yznaga, ard.biesheuvel, arnd, arunks, ben,
	bigeasy, bp, brgl, catalin.marinas, corbet, cpandya,
	daniel.vetter, dan.j.williams, gregkh, guro, hannes, hpa,
	iamjoonsoo.kim, james.morse, jannh, jgross, jkosina, jmorris,
	joe, jrdr.linux, jroedel, keith.busch, khlebnikov, logang,
	marco.antonio.780, mark.rutland, mgorman, mhocko, mhocko,
	mike.kravetz, mingo, mst, m.szyprowski, npiggin, osalvador,
	paulmck, pavel.tatashin, rdunlap, richard.weiyang, riel,
	rientjes, robin.murphy, rostedt, rppt, sai.praneeth.prakhya,
	serge, steve.capper, thymovanbeers, vbabka, will.deacon, willy,
	yang.shi, yaojun8558363, ying.huang, zhangshaokun, iommu, x86,
	linux-arm-kernel, linux-doc, linux-kernel, linux-mm,
	linux-security-module, Khalid Aziz

On Wed, Apr 03, 2019 at 11:34:05AM -0600, Khalid Aziz wrote:
> diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h
> index 2779ace16d23..5c0e1581fa56 100644
> --- a/arch/x86/include/asm/pgtable.h
> +++ b/arch/x86/include/asm/pgtable.h
> @@ -1437,6 +1437,32 @@ static inline bool arch_has_pfn_modify_check(void)
>  	return boot_cpu_has_bug(X86_BUG_L1TF);
>  }
>  
> +/*
> + * The current flushing context - we pass it instead of 5 arguments:
> + */
> +struct cpa_data {
> +	unsigned long	*vaddr;
> +	pgd_t		*pgd;
> +	pgprot_t	mask_set;
> +	pgprot_t	mask_clr;
> +	unsigned long	numpages;
> +	unsigned long	curpage;
> +	unsigned long	pfn;
> +	unsigned int	flags;
> +	unsigned int	force_split		: 1,
> +			force_static_prot	: 1;
> +	struct page	**pages;
> +};
> +
> +
> +int
> +should_split_large_page(pte_t *kpte, unsigned long address,
> +			struct cpa_data *cpa);
> +extern spinlock_t cpa_lock;
> +int
> +__split_large_page(struct cpa_data *cpa, pte_t *kpte, unsigned long address,
> +		   struct page *base);
> +

I really hate exposing all that.

>  #include <asm-generic/pgtable.h>
>  #endif	/* __ASSEMBLY__ */
>  

> diff --git a/arch/x86/mm/xpfo.c b/arch/x86/mm/xpfo.c
> new file mode 100644
> index 000000000000..3045bb7e4659
> --- /dev/null
> +++ b/arch/x86/mm/xpfo.c
> @@ -0,0 +1,123 @@
> +// SPDX-License-Identifier: GPL-2.0
> +/*
> + * Copyright (C) 2017 Hewlett Packard Enterprise Development, L.P.
> + * Copyright (C) 2016 Brown University. All rights reserved.
> + *
> + * Authors:
> + *   Juerg Haefliger <juerg.haefliger@hpe.com>
> + *   Vasileios P. Kemerlis <vpk@cs.brown.edu>
> + *
> + * This program is free software; you can redistribute it and/or modify it
> + * under the terms of the GNU General Public License version 2 as published by
> + * the Free Software Foundation.
> + */
> +
> +#include <linux/mm.h>
> +
> +#include <asm/tlbflush.h>
> +
> +extern spinlock_t cpa_lock;
> +
> +/* Update a single kernel page table entry */
> +inline void set_kpte(void *kaddr, struct page *page, pgprot_t prot)
> +{
> +	unsigned int level;
> +	pgprot_t msk_clr;
> +	pte_t *pte = lookup_address((unsigned long)kaddr, &level);
> +
> +	if (unlikely(!pte)) {
> +		WARN(1, "xpfo: invalid address %p\n", kaddr);
> +		return;
> +	}
> +
> +	switch (level) {
> +	case PG_LEVEL_4K:
> +		set_pte_atomic(pte, pfn_pte(page_to_pfn(page),
> +			       canon_pgprot(prot)));

(sorry, do we also need a nikon_pgprot() ? :-)

> +		break;
> +	case PG_LEVEL_2M:
> +	case PG_LEVEL_1G: {
> +		struct cpa_data cpa = { };
> +		int do_split;
> +
> +		if (level == PG_LEVEL_2M)
> +			msk_clr = pmd_pgprot(*(pmd_t *)pte);
> +		else
> +			msk_clr = pud_pgprot(*(pud_t *)pte);
> +
> +		cpa.vaddr = kaddr;
> +		cpa.pages = &page;
> +		cpa.mask_set = prot;
> +		cpa.mask_clr = msk_clr;
> +		cpa.numpages = 1;
> +		cpa.flags = 0;
> +		cpa.curpage = 0;
> +		cpa.force_split = 0;
> +
> +
> +		do_split = should_split_large_page(pte, (unsigned long)kaddr,
> +						   &cpa);
> +		if (do_split) {
> +			struct page *base;
> +
> +			base = alloc_pages(GFP_ATOMIC, 0);
> +			if (!base) {
> +				WARN(1, "xpfo: failed to split large page\n");

You have to be fcking kidding right? A WARN when a GFP_ATOMIC allocation
fails?!

> +				break;
> +			}
> +
> +			if (!debug_pagealloc_enabled())
> +				spin_lock(&cpa_lock);
> +			if  (__split_large_page(&cpa, pte, (unsigned long)kaddr,
> +						base) < 0) {
> +				__free_page(base);
> +				WARN(1, "xpfo: failed to split large page\n");
> +			}
> +			if (!debug_pagealloc_enabled())
> +				spin_unlock(&cpa_lock);
> +		}
> +
> +		break;

Ever heard of helper functions?

> +	}
> +	case PG_LEVEL_512G:
> +		/* fallthrough, splitting infrastructure doesn't
> +		 * support 512G pages.
> +		 */

Broken coment style.

> +	default:
> +		WARN(1, "xpfo: unsupported page level %x\n", level);
> +	}
> +
> +}
> +EXPORT_SYMBOL_GPL(set_kpte);
> +
> +inline void xpfo_flush_kernel_tlb(struct page *page, int order)
> +{
> +	int level;
> +	unsigned long size, kaddr;
> +
> +	kaddr = (unsigned long)page_address(page);
> +
> +	if (unlikely(!lookup_address(kaddr, &level))) {
> +		WARN(1, "xpfo: invalid address to flush %lx %d\n", kaddr,
> +		     level);
> +		return;
> +	}
> +
> +	switch (level) {
> +	case PG_LEVEL_4K:
> +		size = PAGE_SIZE;
> +		break;
> +	case PG_LEVEL_2M:
> +		size = PMD_SIZE;
> +		break;
> +	case PG_LEVEL_1G:
> +		size = PUD_SIZE;
> +		break;
> +	default:
> +		WARN(1, "xpfo: unsupported page level %x\n", level);
> +		return;
> +	}
> +
> +	flush_tlb_kernel_range(kaddr, kaddr + (1 << order) * size);
> +}

You call this from IRQ/IRQ-disabled context... that _CANNOT_ be right.


^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [RFC PATCH v9 11/13] xpfo, mm: optimize spinlock usage in xpfo_kunmap
  2019-04-03 17:34 ` [RFC PATCH v9 11/13] xpfo, mm: optimize spinlock usage in xpfo_kunmap Khalid Aziz
@ 2019-04-04  7:56   ` Peter Zijlstra
  2019-04-04 16:06     ` Khalid Aziz
  0 siblings, 1 reply; 70+ messages in thread
From: Peter Zijlstra @ 2019-04-04  7:56 UTC (permalink / raw)
  To: Khalid Aziz
  Cc: juergh, tycho, jsteckli, ak, liran.alon, keescook, konrad.wilk,
	deepa.srinivasan, chris.hyser, tyhicks, dwmw, andrew.cooper3,
	jcm, boris.ostrovsky, kanth.ghatraju, joao.m.martins, jmattson,
	pradeep.vincent, john.haxby, tglx, kirill.shutemov, hch,
	steven.sistare, labbott, luto, dave.hansen, aaron.lu, akpm,
	alexander.h.duyck, amir73il, andreyknvl, aneesh.kumar,
	anthony.yznaga, ard.biesheuvel, arnd, arunks, ben, bigeasy, bp,
	brgl, catalin.marinas, corbet, cpandya, daniel.vetter,
	dan.j.williams, gregkh, guro, hannes, hpa, iamjoonsoo.kim,
	james.morse, jannh, jgross, jkosina, jmorris, joe, jrdr.linux,
	jroedel, keith.busch, khlebnikov, logang, marco.antonio.780,
	mark.rutland, mgorman, mhocko, mhocko, mike.kravetz, mingo, mst,
	m.szyprowski, npiggin, osalvador, paulmck, pavel.tatashin,
	rdunlap, richard.weiyang, riel, rientjes, robin.murphy, rostedt,
	rppt, sai.praneeth.prakhya, serge, steve.capper, thymovanbeers,
	vbabka, will.deacon, willy, yang.shi, yaojun8558363, ying.huang,
	zhangshaokun, iommu, x86, linux-arm-kernel, linux-doc,
	linux-kernel, linux-mm, linux-security-module, Khalid Aziz,
	kernel-hardening, Vasileios P . Kemerlis, Juerg Haefliger,
	David Woodhouse

On Wed, Apr 03, 2019 at 11:34:12AM -0600, Khalid Aziz wrote:
> From: Julian Stecklina <jsteckli@amazon.de>
> 
> Only the xpfo_kunmap call that needs to actually unmap the page
> needs to be serialized. We need to be careful to handle the case,
> where after the atomic decrement of the mapcount, a xpfo_kmap
> increased the mapcount again. In this case, we can safely skip
> modifying the page table.
> 
> Model-checked with up to 4 concurrent callers with Spin.
> 
> Signed-off-by: Julian Stecklina <jsteckli@amazon.de>
> Signed-off-by: Khalid Aziz <khalid.aziz@oracle.com>
> Cc: Khalid Aziz <khalid@gonehiking.org>
> Cc: x86@kernel.org
> Cc: kernel-hardening@lists.openwall.com
> Cc: Vasileios P. Kemerlis <vpk@cs.columbia.edu>
> Cc: Juerg Haefliger <juerg.haefliger@canonical.com>
> Cc: Tycho Andersen <tycho@tycho.ws>
> Cc: Marco Benatto <marco.antonio.780@gmail.com>
> Cc: David Woodhouse <dwmw2@infradead.org>
> ---
>  include/linux/xpfo.h | 24 +++++++++++++++---------
>  1 file changed, 15 insertions(+), 9 deletions(-)
> 
> diff --git a/include/linux/xpfo.h b/include/linux/xpfo.h
> index 2318c7eb5fb7..37e7f52fa6ce 100644
> --- a/include/linux/xpfo.h
> +++ b/include/linux/xpfo.h
> @@ -61,6 +61,7 @@ static inline void xpfo_kmap(void *kaddr, struct page *page)
>  static inline void xpfo_kunmap(void *kaddr, struct page *page)
>  {
>  	unsigned long flags;
> +	bool flush_tlb = false;
>  
>  	if (!static_branch_unlikely(&xpfo_inited))
>  		return;
> @@ -72,18 +73,23 @@ static inline void xpfo_kunmap(void *kaddr, struct page *page)
>  	 * The page is to be allocated back to user space, so unmap it from
>  	 * the kernel, flush the TLB and tag it as a user page.
>  	 */
> -	spin_lock_irqsave(&page->xpfo_lock, flags);
> -
>  	if (atomic_dec_return(&page->xpfo_mapcount) == 0) {
> -#ifdef CONFIG_XPFO_DEBUG
> -		WARN_ON(PageXpfoUnmapped(page));
> -#endif
> -		SetPageXpfoUnmapped(page);
> -		set_kpte(kaddr, page, __pgprot(0));
> -		xpfo_flush_kernel_tlb(page, 0);
> +		spin_lock_irqsave(&page->xpfo_lock, flags);
> +
> +		/*
> +		 * In the case, where we raced with kmap after the
> +		 * atomic_dec_return, we must not nuke the mapping.
> +		 */
> +		if (atomic_read(&page->xpfo_mapcount) == 0) {
> +			SetPageXpfoUnmapped(page);
> +			set_kpte(kaddr, page, __pgprot(0));
> +			flush_tlb = true;
> +		}
> +		spin_unlock_irqrestore(&page->xpfo_lock, flags);
>  	}
>  
> -	spin_unlock_irqrestore(&page->xpfo_lock, flags);
> +	if (flush_tlb)
> +		xpfo_flush_kernel_tlb(page, 0);
>  }

This doesn't help with the TLB invalidation issue, AFAICT this is still
completely buggered. kunmap_atomic() can be called from IRQ context.


^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [RFC PATCH v9 12/13] xpfo, mm: Defer TLB flushes for non-current CPUs (x86 only)
  2019-04-03 17:34 ` [RFC PATCH v9 12/13] xpfo, mm: Defer TLB flushes for non-current CPUs (x86 only) Khalid Aziz
  2019-04-04  4:10   ` Andy Lutomirski
@ 2019-04-04  8:18   ` Peter Zijlstra
  1 sibling, 0 replies; 70+ messages in thread
From: Peter Zijlstra @ 2019-04-04  8:18 UTC (permalink / raw)
  To: Khalid Aziz
  Cc: juergh, tycho, jsteckli, ak, liran.alon, keescook, konrad.wilk,
	deepa.srinivasan, chris.hyser, tyhicks, dwmw, andrew.cooper3,
	jcm, boris.ostrovsky, kanth.ghatraju, joao.m.martins, jmattson,
	pradeep.vincent, john.haxby, tglx, kirill.shutemov, hch,
	steven.sistare, labbott, luto, dave.hansen, aaron.lu, akpm,
	alexander.h.duyck, amir73il, andreyknvl, aneesh.kumar,
	anthony.yznaga, ard.biesheuvel, arnd, arunks, ben, bigeasy, bp,
	brgl, catalin.marinas, corbet, cpandya, daniel.vetter,
	dan.j.williams, gregkh, guro, hannes, hpa, iamjoonsoo.kim,
	james.morse, jannh, jgross, jkosina, jmorris, joe, jrdr.linux,
	jroedel, keith.busch, khlebnikov, logang, marco.antonio.780,
	mark.rutland, mgorman, mhocko, mhocko, mike.kravetz, mingo, mst,
	m.szyprowski, npiggin, osalvador, paulmck, pavel.tatashin,
	rdunlap, richard.weiyang, riel, rientjes, robin.murphy, rostedt,
	rppt, sai.praneeth.prakhya, serge, steve.capper, thymovanbeers,
	vbabka, will.deacon, willy, yang.shi, yaojun8558363, ying.huang,
	zhangshaokun, iommu, x86, linux-arm-kernel, linux-doc,
	linux-kernel, linux-mm, linux-security-module, Khalid Aziz

On Wed, Apr 03, 2019 at 11:34:13AM -0600, Khalid Aziz wrote:
> diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c
> index 999d6d8f0bef..cc806a01a0eb 100644
> --- a/arch/x86/mm/tlb.c
> +++ b/arch/x86/mm/tlb.c
> @@ -37,6 +37,20 @@
>   */
>  #define LAST_USER_MM_IBPB	0x1UL
>  
> +/*
> + * A TLB flush may be needed to flush stale TLB entries
> + * for pages that have been mapped into userspace and unmapped
> + * from kernel space. This TLB flush needs to be propagated to
> + * all CPUs. Asynchronous flush requests to all CPUs can cause
> + * significant performance imapct. Queue a pending flush for
> + * a CPU instead. Multiple of these requests can then be handled
> + * by a CPU at a less disruptive time, like context switch, in
> + * one go and reduce performance impact significantly. Following
> + * data structure is used to keep track of CPUs with pending full
> + * TLB flush forced by xpfo.
> + */
> +static cpumask_t pending_xpfo_flush;
> +
>  /*
>   * We get here when we do something requiring a TLB invalidation
>   * but could not go invalidate all of the contexts.  We do the
> @@ -321,6 +335,16 @@ void switch_mm_irqs_off(struct mm_struct *prev, struct mm_struct *next,
>  		__flush_tlb_all();
>  	}
>  #endif
> +
> +	/*
> +	 * If there is a pending TLB flush for this CPU due to XPFO
> +	 * flush, do it now.
> +	 */
> +	if (cpumask_test_and_clear_cpu(cpu, &pending_xpfo_flush)) {
> +		count_vm_tlb_event(NR_TLB_REMOTE_FLUSH_RECEIVED);
> +		__flush_tlb_all();
> +	}

That really should be:

	if (cpumask_test_cpu(cpu, &pending_xpfo_flush)) {
		cpumask_clear_cpu(cpu, &pending_xpfo_flush);
		count_vm_tlb_event(NR_TLB_REMOTE_FLUSH_RECEIVED);
		__flush_tlb_all();
	}

test_and_clear is an unconditional RmW and can cause cacheline
contention between adjecent CPUs even if none of the bits are set.

> +
>  	this_cpu_write(cpu_tlbstate.is_lazy, false);
>  
>  	/*
> @@ -803,6 +827,34 @@ void flush_tlb_kernel_range(unsigned long start, unsigned long end)
>  	}
>  }
>  
> +void xpfo_flush_tlb_kernel_range(unsigned long start, unsigned long end)
> +{
> +	struct cpumask tmp_mask;
> +
> +	/*
> +	 * Balance as user space task's flush, a bit conservative.
> +	 * Do a local flush immediately and post a pending flush on all
> +	 * other CPUs. Local flush can be a range flush or full flush
> +	 * depending upon the number of entries to be flushed. Remote
> +	 * flushes will be done by individual processors at the time of
> +	 * context switch and this allows multiple flush requests from
> +	 * other CPUs to be batched together.
> +	 */
> +	if (end == TLB_FLUSH_ALL ||
> +	    (end - start) > tlb_single_page_flush_ceiling << PAGE_SHIFT) {
> +		do_flush_tlb_all(NULL);
> +	} else {
> +		struct flush_tlb_info info;
> +
> +		info.start = start;
> +		info.end = end;
> +		do_kernel_range_flush(&info);
> +	}
> +	cpumask_setall(&tmp_mask);
> +	__cpumask_clear_cpu(smp_processor_id(), &tmp_mask);
> +	cpumask_or(&pending_xpfo_flush, &pending_xpfo_flush, &tmp_mask);
> +}
> +
>  void arch_tlbbatch_flush(struct arch_tlbflush_unmap_batch *batch)
>  {
>  	struct flush_tlb_info info = {
> diff --git a/arch/x86/mm/xpfo.c b/arch/x86/mm/xpfo.c
> index b42513347865..638eee5b1f09 100644
> --- a/arch/x86/mm/xpfo.c
> +++ b/arch/x86/mm/xpfo.c
> @@ -118,7 +118,7 @@ inline void xpfo_flush_kernel_tlb(struct page *page, int order)
>  		return;
>  	}
>  
> -	flush_tlb_kernel_range(kaddr, kaddr + (1 << order) * size);
> +	xpfo_flush_tlb_kernel_range(kaddr, kaddr + (1 << order) * size);
>  }
>  EXPORT_SYMBOL_GPL(xpfo_flush_kernel_tlb);

So this patch is the one that makes it 'work', but I'm with Andy on
hating it something fierce.

Up until this point x86_64 is completely buggered in this series, after
this it sorta works but *urgh* what crap.

All in all your changelog is complete and utter garbage, this is _NOT_ a
performance issue. It is a very much a correctness issue.

Also; I distinctly dislike the inconsistent TLB states this generates.
It makes it very hard to argue for its correctness..


^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [RFC PATCH v9 03/13] mm: Add support for eXclusive Page Frame Ownership (XPFO)
  2019-04-04  7:21   ` Peter Zijlstra
@ 2019-04-04  9:25     ` Peter Zijlstra
  2019-04-04 14:48     ` Tycho Andersen
  1 sibling, 0 replies; 70+ messages in thread
From: Peter Zijlstra @ 2019-04-04  9:25 UTC (permalink / raw)
  To: Khalid Aziz
  Cc: juergh, tycho, jsteckli, ak, liran.alon, keescook, konrad.wilk,
	Juerg Haefliger, deepa.srinivasan, chris.hyser, tyhicks, dwmw,
	andrew.cooper3, jcm, boris.ostrovsky, kanth.ghatraju,
	joao.m.martins, jmattson, pradeep.vincent, john.haxby, tglx,
	kirill.shutemov, hch, steven.sistare, labbott, luto, dave.hansen,
	aaron.lu, akpm, alexander.h.duyck, amir73il, andreyknvl,
	aneesh.kumar, anthony.yznaga, ard.biesheuvel, arnd, arunks, ben,
	bigeasy, bp, brgl, catalin.marinas, corbet, cpandya,
	daniel.vetter, dan.j.williams, gregkh, guro, hannes, hpa,
	iamjoonsoo.kim, james.morse, jannh, jgross, jkosina, jmorris,
	joe, jrdr.linux, jroedel, keith.busch, khlebnikov, logang,
	marco.antonio.780, mark.rutland, mgorman, mhocko, mhocko,
	mike.kravetz, mingo, mst, m.szyprowski, npiggin, osalvador,
	paulmck, pavel.tatashin, rdunlap, richard.weiyang, riel,
	rientjes, robin.murphy, rostedt, rppt, sai.praneeth.prakhya,
	serge, steve.capper, thymovanbeers, vbabka, will.deacon, willy,
	yang.shi, yaojun8558363, ying.huang, zhangshaokun, iommu, x86,
	linux-arm-kernel, linux-doc, linux-kernel, linux-mm,
	linux-security-module, Khalid Aziz

On Thu, Apr 04, 2019 at 09:21:52AM +0200, Peter Zijlstra wrote:
> On Wed, Apr 03, 2019 at 11:34:04AM -0600, Khalid Aziz wrote:
> > diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
> > index 2c471a2c43fa..d17d33f36a01 100644
> > --- a/include/linux/mm_types.h
> > +++ b/include/linux/mm_types.h
> > @@ -204,6 +204,14 @@ struct page {
> >  #ifdef LAST_CPUPID_NOT_IN_PAGE_FLAGS
> >  	int _last_cpupid;
> >  #endif
> > +
> > +#ifdef CONFIG_XPFO
> > +	/* Counts the number of times this page has been kmapped. */
> > +	atomic_t xpfo_mapcount;
> > +
> > +	/* Serialize kmap/kunmap of this page */
> > +	spinlock_t xpfo_lock;
> 
> NAK, see ALLOC_SPLIT_PTLOCKS
> 
> spinlock_t can be _huge_ (CONFIG_PROVE_LOCKING=y), also are you _really_
> sure you want spinlock_t and not raw_spinlock_t ? For
> CONFIG_PREEMPT_FULL spinlock_t turns into a rtmutex.
> 
> > +#endif
> 
> Growing the page-frame by 8 bytes (in the good case) is really sad,
> that's a _lot_ of memory.

So if you use the original kmap_atomic/kmap code from i386 and create
an alias per user you can do away with all that.

Now, that leaves you with the fixmap kmap_atomic code, which I also
hate, but it gets rid of a lot of the ugly you introduce in these here
patches.

As to the fixmap kmap_atomic; so fundamentally the PTEs are only used on
a single CPU and therefore CPU local TLB invalidation _should_ suffice.

However, speculation...

Another CPU can speculatively hit upon a fixmap entry for another CPU
and populate it's own TLB entry. Then the TLB invalidate is
insufficient, it leaves a stale entry in a remote TLB.

If the remote CPU then re-uses that fixmap slot to alias another page,
we have two CPUs with different translations for the same VA, a
condition that AMD CPU's dislike enough to machine check on (IIRC).

Actually hitting that is incredibly difficult (we have to have
speculation, fixmap reuse and not get a full TLB invalidate in between),
but, afaict, not impossible.

Your monstrosity from the last patch avoids this particular issue by not
aliasing in this manner, but it comes at the cost of this page-frame
bloat. Also, I'm still not sure there's not other problems with it.

Bah..


^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [RFC PATCH v9 03/13] mm: Add support for eXclusive Page Frame Ownership (XPFO)
  2019-04-04  7:21   ` Peter Zijlstra
  2019-04-04  9:25     ` Peter Zijlstra
@ 2019-04-04 14:48     ` Tycho Andersen
  1 sibling, 0 replies; 70+ messages in thread
From: Tycho Andersen @ 2019-04-04 14:48 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Khalid Aziz, juergh, jsteckli, ak, liran.alon, keescook,
	konrad.wilk, Juerg Haefliger, deepa.srinivasan, chris.hyser,
	tyhicks, dwmw, andrew.cooper3, jcm, boris.ostrovsky,
	kanth.ghatraju, joao.m.martins, jmattson, pradeep.vincent,
	john.haxby, tglx, kirill.shutemov, hch, steven.sistare, labbott,
	luto, dave.hansen, aaron.lu, akpm, alexander.h.duyck, amir73il,
	andreyknvl, aneesh.kumar, anthony.yznaga, ard.biesheuvel, arnd,
	arunks, ben, bigeasy, bp, brgl, catalin.marinas, corbet, cpandya,
	daniel.vetter, dan.j.williams, gregkh, guro, hannes, hpa,
	iamjoonsoo.kim, james.morse, jannh, jgross, jkosina, jmorris,
	joe, jrdr.linux, jroedel, keith.busch, khlebnikov, logang,
	marco.antonio.780, mark.rutland, mgorman, mhocko, mhocko,
	mike.kravetz, mingo, mst, m.szyprowski, npiggin, osalvador,
	paulmck, pavel.tatashin, rdunlap, richard.weiyang, riel,
	rientjes, robin.murphy, rostedt, rppt, sai.praneeth.prakhya,
	serge, vbabka, will.deacon, willy, iommu, x86, linux-arm-kernel,
	linux-doc, linux-kernel, linux-mm, linux-security-module,
	Khalid Aziz

On Thu, Apr 04, 2019 at 09:21:52AM +0200, Peter Zijlstra wrote:
> On Wed, Apr 03, 2019 at 11:34:04AM -0600, Khalid Aziz wrote:
> > diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
> > index 2c471a2c43fa..d17d33f36a01 100644
> > --- a/include/linux/mm_types.h
> > +++ b/include/linux/mm_types.h
> > @@ -204,6 +204,14 @@ struct page {
> >  #ifdef LAST_CPUPID_NOT_IN_PAGE_FLAGS
> >  	int _last_cpupid;
> >  #endif
> > +
> > +#ifdef CONFIG_XPFO
> > +	/* Counts the number of times this page has been kmapped. */
> > +	atomic_t xpfo_mapcount;
> > +
> > +	/* Serialize kmap/kunmap of this page */
> > +	spinlock_t xpfo_lock;
> 
> NAK, see ALLOC_SPLIT_PTLOCKS
> 
> spinlock_t can be _huge_ (CONFIG_PROVE_LOCKING=y), also are you _really_
> sure you want spinlock_t and not raw_spinlock_t ? For
> CONFIG_PREEMPT_FULL spinlock_t turns into a rtmutex.
> 
> > +#endif
> 
> Growing the page-frame by 8 bytes (in the good case) is really sad,
> that's a _lot_ of memory.

Originally we had this in page_ext, it's not really clear to me why we
moved it out.

Julien?

Tycho


^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [RFC PATCH v9 03/13] mm: Add support for eXclusive Page Frame Ownership (XPFO)
  2019-04-04  7:43   ` Peter Zijlstra
@ 2019-04-04 15:15     ` Khalid Aziz
  2019-04-04 17:01       ` Peter Zijlstra
  0 siblings, 1 reply; 70+ messages in thread
From: Khalid Aziz @ 2019-04-04 15:15 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: juergh, tycho, jsteckli, ak, liran.alon, keescook, konrad.wilk,
	Juerg Haefliger, deepa.srinivasan, chris.hyser, tyhicks, dwmw,
	andrew.cooper3, jcm, boris.ostrovsky, kanth.ghatraju,
	joao.m.martins, jmattson, pradeep.vincent, john.haxby, tglx,
	kirill.shutemov, hch, steven.sistare, labbott, luto, dave.hansen,
	aaron.lu, akpm, alexander.h.duyck, amir73il, andreyknvl,
	aneesh.kumar, anthony.yznaga, ard.biesheuvel, arnd, arunks, ben,
	bigeasy, bp, brgl, catalin.marinas, corbet, cpandya,
	daniel.vetter, dan.j.williams, gregkh, guro, hannes, hpa,
	iamjoonsoo.kim, james.morse, jannh, jgross, jkosina, jmorris,
	joe, jrdr.linux, jroedel, keith.busch, khlebnikov, logang,
	marco.antonio.780, mark.rutland, mgorman, mhocko, mhocko,
	mike.kravetz, mingo, mst, m.szyprowski, npiggin, osalvador,
	paulmck, pavel.tatashin, rdunlap, richard.weiyang, riel,
	rientjes, robin.murphy, rostedt, rppt, sai.praneeth.prakhya,
	serge, steve.capper, thymovanbeers, vbabka, will.deacon, willy,
	yang.shi, yaojun8558363, ying.huang, zhangshaokun, iommu, x86,
	linux-arm-kernel, linux-doc, linux-kernel, linux-mm,
	linux-security-module, Khalid Aziz

On 4/4/19 1:43 AM, Peter Zijlstra wrote:
> 
> You must be so glad I no longer use kmap_atomic from NMI context :-)
> 
> On Wed, Apr 03, 2019 at 11:34:04AM -0600, Khalid Aziz wrote:
>> +static inline void xpfo_kmap(void *kaddr, struct page *page)
>> +{
>> +	unsigned long flags;
>> +
>> +	if (!static_branch_unlikely(&xpfo_inited))
>> +		return;
>> +
>> +	if (!PageXpfoUser(page))
>> +		return;
>> +
>> +	/*
>> +	 * The page was previously allocated to user space, so
>> +	 * map it back into the kernel if needed. No TLB flush required.
>> +	 */
>> +	spin_lock_irqsave(&page->xpfo_lock, flags);
>> +
>> +	if ((atomic_inc_return(&page->xpfo_mapcount) == 1) &&
>> +		TestClearPageXpfoUnmapped(page))
>> +		set_kpte(kaddr, page, PAGE_KERNEL);
>> +
>> +	spin_unlock_irqrestore(&page->xpfo_lock, flags);
> 
> That's a really sad sequence, not wrong, but sad. _3_ atomic operations,
> 2 on likely the same cacheline. And mostly all pointless.
> 
> This patch makes xpfo_mapcount an atomic, but then all modifications are
> under the spinlock, what gives?
> 
> Anyway, a possibly saner sequence might be:
> 
> 	if (atomic_inc_not_zero(&page->xpfo_mapcount))
> 		return;
> 
> 	spin_lock_irqsave(&page->xpfo_lock, flag);
> 	if ((atomic_inc_return(&page->xpfo_mapcount) == 1) &&
> 	    TestClearPageXpfoUnmapped(page))
> 		set_kpte(kaddr, page, PAGE_KERNEL);
> 	spin_unlock_irqrestore(&page->xpfo_lock, flags);
> 
>> +}
>> +
>> +static inline void xpfo_kunmap(void *kaddr, struct page *page)
>> +{
>> +	unsigned long flags;
>> +
>> +	if (!static_branch_unlikely(&xpfo_inited))
>> +		return;
>> +
>> +	if (!PageXpfoUser(page))
>> +		return;
>> +
>> +	/*
>> +	 * The page is to be allocated back to user space, so unmap it from
>> +	 * the kernel, flush the TLB and tag it as a user page.
>> +	 */
>> +	spin_lock_irqsave(&page->xpfo_lock, flags);
>> +
>> +	if (atomic_dec_return(&page->xpfo_mapcount) == 0) {
>> +#ifdef CONFIG_XPFO_DEBUG
>> +		WARN_ON(PageXpfoUnmapped(page));
>> +#endif
>> +		SetPageXpfoUnmapped(page);
>> +		set_kpte(kaddr, page, __pgprot(0));
>> +		xpfo_flush_kernel_tlb(page, 0);
> 
> You didn't speak about the TLB invalidation anywhere. But basically this
> is one that x86 cannot do.
> 
>> +	}
>> +
>> +	spin_unlock_irqrestore(&page->xpfo_lock, flags);
> 
> Idem:
> 
> 	if (atomic_add_unless(&page->xpfo_mapcount, -1, 1))
> 		return;
> 
> 	....
> 
> 
>> +}
> 
> Also I'm failing to see the point of PG_xpfo_unmapped, afaict it
> is identical to !atomic_read(&page->xpfo_mapcount).
> 

Thanks Peter. I really appreciate your review. Your feedback helps make
this code better and closer to where I can feel comfortable not calling
it RFC any more.

The more I look at xpfo_kmap()/xpfo_kunmap() code, the more I get
uncomfortable with it. As you pointed out about calling kmap_atomic from
NMI context, that just makes the kmap_atomic code look even worse. I
pointed out one problem with this code in cover letter and suggested a
rewrite. I see these problems with this code:

1. When xpfo_kmap maps a page back in physmap, it opens up the ret2dir
attack security hole again even if just for the duration of kmap. A kmap
can stay around for some time if the page is being used for I/O.

2. This code uses spinlock which leads to problems. If it does not
disable IRQ, it is exposed to deadlock around xpfo_lock. If it disables
IRQ, I think it can still deadlock around pgd_lock.

I think a better implementation of xpfo_kmap()/xpfo_kunmap() would map
the page at a new virtual address similar to what kmap_high for i386
does. This avoids re-opening the ret2dir security hole. We can also
possibly do away with xpfo_lock saving bytes in page-frame and the not
so sane code sequence can go away.

Good point about PG_xpfo_unmapped flag. You are right that it can be
replaced with !atomic_read(&page->xpfo_mapcount).

Thanks,
Khalid


^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [RFC PATCH v9 04/13] xpfo, x86: Add support for XPFO for x86-64
  2019-04-04  7:52   ` Peter Zijlstra
@ 2019-04-04 15:40     ` Khalid Aziz
  0 siblings, 0 replies; 70+ messages in thread
From: Khalid Aziz @ 2019-04-04 15:40 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: juergh, tycho, jsteckli, ak, liran.alon, keescook, konrad.wilk,
	Juerg Haefliger, deepa.srinivasan, chris.hyser, tyhicks, dwmw,
	jcm, boris.ostrovsky, kanth.ghatraju, joao.m.martins, jmattson,
	pradeep.vincent, john.haxby, tglx, kirill.shutemov, hch,
	steven.sistare, labbott, luto, dave.hansen, akpm,
	alexander.h.duyck, amir73il, aneesh.kumar, anthony.yznaga,
	ard.biesheuvel, arnd, bigeasy, bp, brgl, catalin.marinas, corbet,
	dan.j.williams, gregkh, guro, hannes, hpa, iamjoonsoo.kim,
	james.morse, jannh, jkosina, jmorris, joe, jrdr.linux, jroedel,
	keith.busch, khlebnikov, mark.rutland, mgorman, mhocko, mhocko,
	mike.kravetz, mingo, mst, npiggin, paulmck, pavel.tatashin,
	rdunlap, richard.weiyang, riel, rientjes, rostedt, rppt,
	will.deacon, willy, yaojun8558363, ying.huang, iommu, x86,
	linux-arm-kernel, linux-doc, linux-kernel, linux-mm,
	linux-security-module, Khalid Aziz

[trimmed To: and Cc: It was too large]

On 4/4/19 1:52 AM, Peter Zijlstra wrote:
> On Wed, Apr 03, 2019 at 11:34:05AM -0600, Khalid Aziz wrote:
>> diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h
>> index 2779ace16d23..5c0e1581fa56 100644
>> --- a/arch/x86/include/asm/pgtable.h
>> +++ b/arch/x86/include/asm/pgtable.h
>> @@ -1437,6 +1437,32 @@ static inline bool arch_has_pfn_modify_check(void)
>>  	return boot_cpu_has_bug(X86_BUG_L1TF);
>>  }
>>  
>> +/*
>> + * The current flushing context - we pass it instead of 5 arguments:
>> + */
>> +struct cpa_data {
>> +	unsigned long	*vaddr;
>> +	pgd_t		*pgd;
>> +	pgprot_t	mask_set;
>> +	pgprot_t	mask_clr;
>> +	unsigned long	numpages;
>> +	unsigned long	curpage;
>> +	unsigned long	pfn;
>> +	unsigned int	flags;
>> +	unsigned int	force_split		: 1,
>> +			force_static_prot	: 1;
>> +	struct page	**pages;
>> +};
>> +
>> +
>> +int
>> +should_split_large_page(pte_t *kpte, unsigned long address,
>> +			struct cpa_data *cpa);
>> +extern spinlock_t cpa_lock;
>> +int
>> +__split_large_page(struct cpa_data *cpa, pte_t *kpte, unsigned long address,
>> +		   struct page *base);
>> +
> 
> I really hate exposing all that.

I believe this was done so set_kpte() could split large pages if needed.
I will look into creating a helper function instead so this does not
have to be exposed.

> 
>>  #include <asm-generic/pgtable.h>
>>  #endif	/* __ASSEMBLY__ */
>>  
> 
>> diff --git a/arch/x86/mm/xpfo.c b/arch/x86/mm/xpfo.c
>> new file mode 100644
>> index 000000000000..3045bb7e4659
>> --- /dev/null
>> +++ b/arch/x86/mm/xpfo.c
>> @@ -0,0 +1,123 @@
>> +// SPDX-License-Identifier: GPL-2.0
>> +/*
>> + * Copyright (C) 2017 Hewlett Packard Enterprise Development, L.P.
>> + * Copyright (C) 2016 Brown University. All rights reserved.
>> + *
>> + * Authors:
>> + *   Juerg Haefliger <juerg.haefliger@hpe.com>
>> + *   Vasileios P. Kemerlis <vpk@cs.brown.edu>
>> + *
>> + * This program is free software; you can redistribute it and/or modify it
>> + * under the terms of the GNU General Public License version 2 as published by
>> + * the Free Software Foundation.
>> + */
>> +
>> +#include <linux/mm.h>
>> +
>> +#include <asm/tlbflush.h>
>> +
>> +extern spinlock_t cpa_lock;
>> +
>> +/* Update a single kernel page table entry */
>> +inline void set_kpte(void *kaddr, struct page *page, pgprot_t prot)
>> +{
>> +	unsigned int level;
>> +	pgprot_t msk_clr;
>> +	pte_t *pte = lookup_address((unsigned long)kaddr, &level);
>> +
>> +	if (unlikely(!pte)) {
>> +		WARN(1, "xpfo: invalid address %p\n", kaddr);
>> +		return;
>> +	}
>> +
>> +	switch (level) {
>> +	case PG_LEVEL_4K:
>> +		set_pte_atomic(pte, pfn_pte(page_to_pfn(page),
>> +			       canon_pgprot(prot)));
> 
> (sorry, do we also need a nikon_pgprot() ? :-)

Are we trying to encourage nikon as well to sponsor this patch :)

> 
>> +		break;
>> +	case PG_LEVEL_2M:
>> +	case PG_LEVEL_1G: {
>> +		struct cpa_data cpa = { };
>> +		int do_split;
>> +
>> +		if (level == PG_LEVEL_2M)
>> +			msk_clr = pmd_pgprot(*(pmd_t *)pte);
>> +		else
>> +			msk_clr = pud_pgprot(*(pud_t *)pte);
>> +
>> +		cpa.vaddr = kaddr;
>> +		cpa.pages = &page;
>> +		cpa.mask_set = prot;
>> +		cpa.mask_clr = msk_clr;
>> +		cpa.numpages = 1;
>> +		cpa.flags = 0;
>> +		cpa.curpage = 0;
>> +		cpa.force_split = 0;
>> +
>> +
>> +		do_split = should_split_large_page(pte, (unsigned long)kaddr,
>> +						   &cpa);
>> +		if (do_split) {
>> +			struct page *base;
>> +
>> +			base = alloc_pages(GFP_ATOMIC, 0);
>> +			if (!base) {
>> +				WARN(1, "xpfo: failed to split large page\n");
> 
> You have to be fcking kidding right? A WARN when a GFP_ATOMIC allocation
> fails?!
> 

Not sure what the reasoning was for this WARN in original patch, but I
think this is trying to warn about failure to split the large page in
order to unmap this single page as opposed to warning about allocation
failure. Nevertheless this could be done better.

>> +				break;
>> +			}
>> +
>> +			if (!debug_pagealloc_enabled())
>> +				spin_lock(&cpa_lock);
>> +			if  (__split_large_page(&cpa, pte, (unsigned long)kaddr,
>> +						base) < 0) {
>> +				__free_page(base);
>> +				WARN(1, "xpfo: failed to split large page\n");
>> +			}
>> +			if (!debug_pagealloc_enabled())
>> +				spin_unlock(&cpa_lock);
>> +		}
>> +
>> +		break;
> 
> Ever heard of helper functions?

Good idea. I will see if this all can be done in a helper function instead.

> 
>> +	}
>> +	case PG_LEVEL_512G:
>> +		/* fallthrough, splitting infrastructure doesn't
>> +		 * support 512G pages.
>> +		 */
> 
> Broken coment style.

Slipped by me. Thanks, I will fix that.

> 
>> +	default:
>> +		WARN(1, "xpfo: unsupported page level %x\n", level);
>> +	}
>> +
>> +}
>> +EXPORT_SYMBOL_GPL(set_kpte);
>> +
>> +inline void xpfo_flush_kernel_tlb(struct page *page, int order)
>> +{
>> +	int level;
>> +	unsigned long size, kaddr;
>> +
>> +	kaddr = (unsigned long)page_address(page);
>> +
>> +	if (unlikely(!lookup_address(kaddr, &level))) {
>> +		WARN(1, "xpfo: invalid address to flush %lx %d\n", kaddr,
>> +		     level);
>> +		return;
>> +	}
>> +
>> +	switch (level) {
>> +	case PG_LEVEL_4K:
>> +		size = PAGE_SIZE;
>> +		break;
>> +	case PG_LEVEL_2M:
>> +		size = PMD_SIZE;
>> +		break;
>> +	case PG_LEVEL_1G:
>> +		size = PUD_SIZE;
>> +		break;
>> +	default:
>> +		WARN(1, "xpfo: unsupported page level %x\n", level);
>> +		return;
>> +	}
>> +
>> +	flush_tlb_kernel_range(kaddr, kaddr + (1 << order) * size);
>> +}
> 
> You call this from IRQ/IRQ-disabled context... that _CANNOT_ be right.
> 

Another reason why current implementation of xpfo_kmap/xpfo_kunmap does
not look right. I am leaning more and more towards rewriting this to be
similar to kmap_high while taking into account your input on fixmap
kmap_atomic.

Side note: jsteckli@amazon.de is bouncing. Julian wrote quite a bit of
code in these patches. If anyone has Julian's current email, it would be
appreciated. Getting his feedback on these discussions will be useful.

Thanks,
Khalid


^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [RFC PATCH v9 02/13] x86: always set IF before oopsing from page fault
  2019-04-04  4:12       ` Andy Lutomirski
@ 2019-04-04 15:47         ` Tycho Andersen
  2019-04-04 16:23           ` Sebastian Andrzej Siewior
  2019-04-04 16:28           ` Thomas Gleixner
  0 siblings, 2 replies; 70+ messages in thread
From: Tycho Andersen @ 2019-04-04 15:47 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Khalid Aziz, Juerg Haefliger, jsteckli, Andi Kleen, liran.alon,
	Kees Cook, Konrad Rzeszutek Wilk, deepa.srinivasan, chris hyser,
	Tyler Hicks, Woodhouse, David, Andrew Cooper, Jon Masters,
	Boris Ostrovsky, kanth.ghatraju, Joao Martins, Jim Mattson,
	pradeep.vincent, John Haxby, Thomas Gleixner, Kirill A. Shutemov,
	Christoph Hellwig, steven.sistare, Laura Abbott, Dave Hansen,
	Peter Zijlstra, Aaron Lu, Andrew Morton, alexander.h.duyck,
	Amir Goldstein, Andrey Konovalov, aneesh.kumar, anthony.yznaga,
	Ard Biesheuvel, Arnd Bergmann, arunks, Ben Hutchings,
	Sebastian Andrzej Siewior, Borislav Petkov, brgl,
	Catalin Marinas, Jonathan Corbet, cpandya, Daniel Vetter,
	Dan Williams, Greg KH, Roman Gushchin, Johannes Weiner,
	H. Peter Anvin, Joonsoo Kim, James Morse, Jann Horn,
	Juergen Gross, Jiri Kosina, James Morris, Joe Perches,
	Souptick Joarder, Joerg Roedel, Keith Busch,
	Konstantin Khlebnikov, Logan Gunthorpe, marco.antonio.780,
	Mark Rutland, Mel Gorman, Michal Hocko, Michal Hocko,
	Mike Kravetz, Ingo Molnar, Michael S. Tsirkin, Marek Szyprowski,
	Nicholas Piggin, osalvador, Paul E. McKenney, pavel.tatashin,
	Randy Dunlap, richard.weiyang, Serge E. Hallyn, iommu, X86 ML,
	linux-arm-kernel, open list:DOCUMENTATION, LKML, Linux-MM,
	LSM List, Khalid Aziz

On Wed, Apr 03, 2019 at 09:12:16PM -0700, Andy Lutomirski wrote:
> On Wed, Apr 3, 2019 at 6:42 PM Tycho Andersen <tycho@tycho.ws> wrote:
> >
> > On Wed, Apr 03, 2019 at 05:12:56PM -0700, Andy Lutomirski wrote:
> > > On Wed, Apr 3, 2019 at 10:36 AM Khalid Aziz <khalid.aziz@oracle.com> wrote:
> > > >
> > > > From: Tycho Andersen <tycho@tycho.ws>
> > > >
> > > > Oopsing might kill the task, via rewind_stack_do_exit() at the bottom, and
> > > > that might sleep:
> > > >
> > >
> > >
> > > > diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c
> > > > index 9d5c75f02295..7891add0913f 100644
> > > > --- a/arch/x86/mm/fault.c
> > > > +++ b/arch/x86/mm/fault.c
> > > > @@ -858,6 +858,12 @@ no_context(struct pt_regs *regs, unsigned long error_code,
> > > >         /* Executive summary in case the body of the oops scrolled away */
> > > >         printk(KERN_DEFAULT "CR2: %016lx\n", address);
> > > >
> > > > +       /*
> > > > +        * We're about to oops, which might kill the task. Make sure we're
> > > > +        * allowed to sleep.
> > > > +        */
> > > > +       flags |= X86_EFLAGS_IF;
> > > > +
> > > >         oops_end(flags, regs, sig);
> > > >  }
> > > >
> > >
> > >
> > > NAK.  If there's a bug in rewind_stack_do_exit(), please fix it in
> > > rewind_stack_do_exit().
> >
> > [I trimmed the CC list since google rejected it with E2BIG :)]
> >
> > I guess the problem is really that do_exit() (or really
> > exit_signals()) might sleep. Maybe we should put an irq_enable() at
> > the beginning of do_exit() instead and fix this problem for all
> > arches?
> >
> 
> Hmm.  do_exit() isn't really meant to be "try your best to leave the
> system somewhat usable without returning" -- it's a function that,
> other than in OOPSes, is called from a well-defined state.  So I think
> rewind_stack_do_exit() is probably a better spot.  But we need to
> rewind the stack and *then* turn on IRQs, since we otherwise risk
> exploding quite badly.

Ok, sounds good. I guess we can include something like this patch in
the next series.

Thanks,

Tycho


From 34dce229a4f43f90db823671eb0b8da7c4906045 Mon Sep 17 00:00:00 2001
From: Tycho Andersen <tycho@tycho.ws>
Date: Thu, 4 Apr 2019 09:41:32 -0600
Subject: [PATCH] x86/entry: re-enable interrupts before exiting

If the kernel oopses in an interrupt, nothing re-enables interrupts:

Aug 23 19:30:27 xpfo kernel: [   38.302714] BUG: sleeping function called from invalid context at
./include/linux/percpu-rwsem.h:33
Aug 23 19:30:27 xpfo kernel: [   38.303837] in_atomic(): 0, irqs_disabled(): 1, pid: 1970, name:
lkdtm_xpfo_test
Aug 23 19:30:27 xpfo kernel: [   38.304758] CPU: 3 PID: 1970 Comm: lkdtm_xpfo_test Tainted: G      D
4.13.0-rc5+ #228
Aug 23 19:30:27 xpfo kernel: [   38.305813] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS
1.10.1-1ubuntu1 04/01/2014
Aug 23 19:30:27 xpfo kernel: [   38.306926] Call Trace:
Aug 23 19:30:27 xpfo kernel: [   38.307243]  dump_stack+0x63/0x8b
Aug 23 19:30:27 xpfo kernel: [   38.307665]  ___might_sleep+0xec/0x110
Aug 23 19:30:27 xpfo kernel: [   38.308139]  __might_sleep+0x45/0x80
Aug 23 19:30:27 xpfo kernel: [   38.308593]  exit_signals+0x21/0x1c0
Aug 23 19:30:27 xpfo kernel: [   38.309046]  ? blocking_notifier_call_chain+0x11/0x20
Aug 23 19:30:27 xpfo kernel: [   38.309677]  do_exit+0x98/0xbf0
Aug 23 19:30:27 xpfo kernel: [   38.310078]  ? smp_reader+0x27/0x40 [lkdtm]
Aug 23 19:30:27 xpfo kernel: [   38.310604]  ? kthread+0x10f/0x150
Aug 23 19:30:27 xpfo kernel: [   38.311045]  ? read_user_with_flags+0x60/0x60 [lkdtm]
Aug 23 19:30:27 xpfo kernel: [   38.311680]  rewind_stack_do_exit+0x17/0x20

do_exit() expects to be called in a well-defined environment, so let's
re-enable interrupts after unwinding the stack, in case they were disabled.

Signed-off-by: Tycho Andersen <tycho@tycho.ws>
---
 arch/x86/entry/entry_32.S | 6 ++++++
 arch/x86/entry/entry_64.S | 6 ++++++
 2 files changed, 12 insertions(+)

diff --git a/arch/x86/entry/entry_32.S b/arch/x86/entry/entry_32.S
index d309f30cf7af..8ddb7b41669d 100644
--- a/arch/x86/entry/entry_32.S
+++ b/arch/x86/entry/entry_32.S
@@ -1507,6 +1507,12 @@ ENTRY(rewind_stack_do_exit)
 	movl	PER_CPU_VAR(cpu_current_top_of_stack), %esi
 	leal	-TOP_OF_KERNEL_STACK_PADDING-PTREGS_SIZE(%esi), %esp
 
+	/*
+	 * If we oopsed in an interrupt handler, interrupts may be off. Let's turn
+	 * them back on before going back to "normal" code.
+	 */
+	sti
+
 	call	do_exit
 1:	jmp 1b
 END(rewind_stack_do_exit)
diff --git a/arch/x86/entry/entry_64.S b/arch/x86/entry/entry_64.S
index 1f0efdb7b629..c0759f3e3ad2 100644
--- a/arch/x86/entry/entry_64.S
+++ b/arch/x86/entry/entry_64.S
@@ -1672,5 +1672,11 @@ ENTRY(rewind_stack_do_exit)
 	leaq	-PTREGS_SIZE(%rax), %rsp
 	UNWIND_HINT_FUNC sp_offset=PTREGS_SIZE
 
+	/*
+	 * If we oopsed in an interrupt handler, interrupts may be off. Let's turn
+	 * them back on before going back to "normal" code.
+	 */
+	sti
+
 	call	do_exit
 END(rewind_stack_do_exit)
-- 
2.19.1


^ permalink raw reply related	[flat|nested] 70+ messages in thread

* Re: [RFC PATCH v9 11/13] xpfo, mm: optimize spinlock usage in xpfo_kunmap
  2019-04-04  7:56   ` Peter Zijlstra
@ 2019-04-04 16:06     ` Khalid Aziz
  0 siblings, 0 replies; 70+ messages in thread
From: Khalid Aziz @ 2019-04-04 16:06 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: juergh, tycho, jsteckli, ak, liran.alon, keescook, konrad.wilk,
	deepa.srinivasan, chris.hyser, tyhicks, dwmw, andrew.cooper3,
	jcm, boris.ostrovsky, kanth.ghatraju, joao.m.martins, jmattson,
	pradeep.vincent, john.haxby, tglx, kirill.shutemov, hch,
	steven.sistare, labbott, luto, dave.hansen, akpm,
	alexander.h.duyck, aneesh.kumar, arnd, bigeasy, bp,
	catalin.marinas, corbet, dan.j.williams, gregkh, guro, hannes,
	hpa, iamjoonsoo.kim, james.morse, jannh, jkosina, jmorris, joe,
	jrdr.linux, jroedel, keith.busch, khlebnikov, mark.rutland,
	mgorman, Michal Hocko, mike.kravetz, mingo, mst, npiggin,
	paulmck, pavel.tatashin, rdunlap, richard.weiyang, riel,
	rientjes, rostedt, rppt, will.deacon, willy, yaojun8558363,
	ying.huang, iommu, x86, linux-arm-kernel, linux-doc,
	linux-kernel, linux-mm, linux-security-module, Khalid Aziz,
	kernel-hardening, Vasileios P . Kemerlis

On 4/4/19 1:56 AM, Peter Zijlstra wrote:
> On Wed, Apr 03, 2019 at 11:34:12AM -0600, Khalid Aziz wrote:
>> From: Julian Stecklina <jsteckli@amazon.de>
>>
>> Only the xpfo_kunmap call that needs to actually unmap the page
>> needs to be serialized. We need to be careful to handle the case,
>> where after the atomic decrement of the mapcount, a xpfo_kmap
>> increased the mapcount again. In this case, we can safely skip
>> modifying the page table.
>>
>> Model-checked with up to 4 concurrent callers with Spin.
>>
>> Signed-off-by: Julian Stecklina <jsteckli@amazon.de>
>> Signed-off-by: Khalid Aziz <khalid.aziz@oracle.com>
>> Cc: Khalid Aziz <khalid@gonehiking.org>
>> Cc: x86@kernel.org
>> Cc: kernel-hardening@lists.openwall.com
>> Cc: Vasileios P. Kemerlis <vpk@cs.columbia.edu>
>> Cc: Juerg Haefliger <juerg.haefliger@canonical.com>
>> Cc: Tycho Andersen <tycho@tycho.ws>
>> Cc: Marco Benatto <marco.antonio.780@gmail.com>
>> Cc: David Woodhouse <dwmw2@infradead.org>
>> ---
>>  include/linux/xpfo.h | 24 +++++++++++++++---------
>>  1 file changed, 15 insertions(+), 9 deletions(-)
>>
>> diff --git a/include/linux/xpfo.h b/include/linux/xpfo.h
>> index 2318c7eb5fb7..37e7f52fa6ce 100644
>> --- a/include/linux/xpfo.h
>> +++ b/include/linux/xpfo.h
>> @@ -61,6 +61,7 @@ static inline void xpfo_kmap(void *kaddr, struct page *page)
>>  static inline void xpfo_kunmap(void *kaddr, struct page *page)
>>  {
>>  	unsigned long flags;
>> +	bool flush_tlb = false;
>>  
>>  	if (!static_branch_unlikely(&xpfo_inited))
>>  		return;
>> @@ -72,18 +73,23 @@ static inline void xpfo_kunmap(void *kaddr, struct page *page)
>>  	 * The page is to be allocated back to user space, so unmap it from
>>  	 * the kernel, flush the TLB and tag it as a user page.
>>  	 */
>> -	spin_lock_irqsave(&page->xpfo_lock, flags);
>> -
>>  	if (atomic_dec_return(&page->xpfo_mapcount) == 0) {
>> -#ifdef CONFIG_XPFO_DEBUG
>> -		WARN_ON(PageXpfoUnmapped(page));
>> -#endif
>> -		SetPageXpfoUnmapped(page);
>> -		set_kpte(kaddr, page, __pgprot(0));
>> -		xpfo_flush_kernel_tlb(page, 0);
>> +		spin_lock_irqsave(&page->xpfo_lock, flags);
>> +
>> +		/*
>> +		 * In the case, where we raced with kmap after the
>> +		 * atomic_dec_return, we must not nuke the mapping.
>> +		 */
>> +		if (atomic_read(&page->xpfo_mapcount) == 0) {
>> +			SetPageXpfoUnmapped(page);
>> +			set_kpte(kaddr, page, __pgprot(0));
>> +			flush_tlb = true;
>> +		}
>> +		spin_unlock_irqrestore(&page->xpfo_lock, flags);
>>  	}
>>  
>> -	spin_unlock_irqrestore(&page->xpfo_lock, flags);
>> +	if (flush_tlb)
>> +		xpfo_flush_kernel_tlb(page, 0);
>>  }
> 
> This doesn't help with the TLB invalidation issue, AFAICT this is still
> completely buggered. kunmap_atomic() can be called from IRQ context.
> 

OK. xpfo_kmap/xpfo_kunmap need redesign.

--
Khalid


^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [RFC PATCH v9 02/13] x86: always set IF before oopsing from page fault
  2019-04-04 15:47         ` Tycho Andersen
@ 2019-04-04 16:23           ` Sebastian Andrzej Siewior
  2019-04-04 16:28           ` Thomas Gleixner
  1 sibling, 0 replies; 70+ messages in thread
From: Sebastian Andrzej Siewior @ 2019-04-04 16:23 UTC (permalink / raw)
  To: Tycho Andersen
  Cc: Andy Lutomirski, Khalid Aziz, iommu, X86 ML, LKML, Linux-MM, Khalid Aziz

- stepping on del button while browsing though CCs.
On 2019-04-04 09:47:27 [-0600], Tycho Andersen wrote:
> > Hmm.  do_exit() isn't really meant to be "try your best to leave the
> > system somewhat usable without returning" -- it's a function that,
> > other than in OOPSes, is called from a well-defined state.  So I think
> > rewind_stack_do_exit() is probably a better spot.  But we need to
> > rewind the stack and *then* turn on IRQs, since we otherwise risk
> > exploding quite badly.
> 
> Ok, sounds good. I guess we can include something like this patch in
> the next series.

The tracing infrastructure probably doesn't know that the interrupts are
back on. Also if you were holding a spin lock then your preempt count
isn't 0 which means that might_sleep() will trigger a splat (in your
backtrace it was zero).

> Thanks,
> 
> Tycho
Sebastian


^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [RFC PATCH v9 02/13] x86: always set IF before oopsing from page fault
  2019-04-04 15:47         ` Tycho Andersen
  2019-04-04 16:23           ` Sebastian Andrzej Siewior
@ 2019-04-04 16:28           ` Thomas Gleixner
  2019-04-04 17:11             ` Andy Lutomirski
  1 sibling, 1 reply; 70+ messages in thread
From: Thomas Gleixner @ 2019-04-04 16:28 UTC (permalink / raw)
  To: Tycho Andersen
  Cc: Andy Lutomirski, Khalid Aziz, Juerg Haefliger, jsteckli,
	Andi Kleen, liran.alon, Kees Cook, Konrad Rzeszutek Wilk,
	deepa.srinivasan, chris hyser, Tyler Hicks, Woodhouse, David,
	Andrew Cooper, Jon Masters, Boris Ostrovsky, kanth.ghatraju,
	Joao Martins, Jim Mattson, pradeep.vincent, John Haxby,
	Kirill A. Shutemov, Christoph Hellwig, steven.sistare,
	Laura Abbott, Dave Hansen, Peter Zijlstra, Aaron Lu,
	Andrew Morton, alexander.h.duyck, Amir Goldstein,
	Andrey Konovalov, aneesh.kumar, anthony.yznaga, Ard Biesheuvel,
	Arnd Bergmann, arunks, Ben Hutchings, Sebastian Andrzej Siewior,
	Borislav Petkov, brgl, Catalin Marinas, Jonathan Corbet, cpandya,
	Daniel Vetter, Dan Williams, Greg KH, Roman Gushchin,
	Johannes Weiner, H. Peter Anvin, Joonsoo Kim, James Morse,
	Jann Horn, Juergen Gross, Jiri Kosina, James Morris, Joe Perches,
	Souptick Joarder, Joerg Roedel, Keith Busch,
	Konstantin Khlebnikov, Logan Gunthorpe, marco.antonio.780,
	Mark Rutland, Mel Gorman, Michal Hocko, Michal Hocko,
	Mike Kravetz, Ingo Molnar, Michael S. Tsirkin, Marek Szyprowski,
	Nicholas Piggin, osalvador, Paul E. McKenney, pavel.tatashin,
	Randy Dunlap, richard.weiyang, Serge E. Hallyn, iommu, X86 ML,
	linux-arm-kernel, open list:DOCUMENTATION, LKML, Linux-MM,
	LSM List, Khalid Aziz

On Thu, 4 Apr 2019, Tycho Andersen wrote:
>  	leaq	-PTREGS_SIZE(%rax), %rsp
>  	UNWIND_HINT_FUNC sp_offset=PTREGS_SIZE
>  
> +	/*
> +	 * If we oopsed in an interrupt handler, interrupts may be off. Let's turn
> +	 * them back on before going back to "normal" code.
> +	 */
> +	sti

That breaks the paravirt muck and tracing/lockdep.

ENABLE_INTERRUPTS() is what you want plus TRACE_IRQ_ON to keep the tracer
and lockdep happy.

Thanks,

	tglx


^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [RFC PATCH v9 00/13] Add support for eXclusive Page Frame Ownership
  2019-04-03 17:34 [RFC PATCH v9 00/13] Add support for eXclusive Page Frame Ownership Khalid Aziz
                   ` (12 preceding siblings ...)
  2019-04-03 17:34 ` [RFC PATCH v9 13/13] xpfo, mm: Optimize XPFO TLB flushes by batching them together Khalid Aziz
@ 2019-04-04 16:44 ` Nadav Amit
  2019-04-04 17:18   ` Khalid Aziz
  2019-04-06  6:40 ` Jon Masters
  14 siblings, 1 reply; 70+ messages in thread
From: Nadav Amit @ 2019-04-04 16:44 UTC (permalink / raw)
  To: Khalid Aziz
  Cc: X86 ML, linux-arm-kernel, open list:DOCUMENTATION,
	Linux List Kernel Mailing, Linux-MM, LSM List

> On Apr 3, 2019, at 10:34 AM, Khalid Aziz <khalid.aziz@oracle.com> wrote:
> 
> This is another update to the work Juerg, Tycho and Julian have
> done on XPFO.

Interesting work, but note that it triggers a warning on my system due to
possible deadlock. It seems that the patch-set disables IRQs in
xpfo_kunmap() and then might flush remote TLBs when a large page is split.
This is wrong, since it might lead to deadlocks.


[  947.262208] WARNING: CPU: 6 PID: 9892 at kernel/smp.c:416 smp_call_function_many+0x92/0x250
[  947.263767] Modules linked in: sb_edac vmw_balloon crct10dif_pclmul crc32_pclmul joydev ghash_clmulni_intel input_leds intel_rapl_perf serio_raw mac_hid sch_fq_codel ib_iser rdma_cm iw_cm ib_cm ib_core vmw_vsock_vmci_transport vsock vmw_vmci iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi ip_tables x_tables autofs4 btrfs zstd_compress raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx libcrc32c xor raid6_pq raid1 raid0 multipath linear hid_generic usbhid hid vmwgfx drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops ttm drm aesni_intel psmouse aes_x86_64 crypto_simd cryptd glue_helper mptspi vmxnet3 scsi_transport_spi mptscsih ahci mptbase libahci i2c_piix4 pata_acpi
[  947.274649] CPU: 6 PID: 9892 Comm: cc1 Not tainted 5.0.0+ #7
[  947.275804] Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 07/28/2017
[  947.277704] RIP: 0010:smp_call_function_many+0x92/0x250
[  947.278640] Code: 3b 05 66 fc 4e 01 72 26 48 83 c4 10 5b 41 5c 41 5d 41 5e 41 5f 5d c3 8b 05 2b cc 7e 01 85 c0 75 bf 80 3d a8 99 4e 01 00 75 b6 <0f> 0b eb b2 44 89 c7 48 c7 c2 a0 9a 61 aa 4c 89 fe 44 89 45 d0 e8
[  947.281895] RSP: 0000:ffffafe04538f970 EFLAGS: 00010046
[  947.282821] RAX: 0000000000000000 RBX: 0000000000000006 RCX: 0000000000000001
[  947.284084] RDX: 0000000000000000 RSI: ffffffffa9078d70 RDI: ffffffffaa619aa0
[  947.285343] RBP: ffffafe04538f9a8 R08: ffff9d7040000ff0 R09: 0000000000000000
[  947.286596] R10: 0000000000000000 R11: 0000000000000000 R12: ffffffffa9078d70
[  947.287855] R13: 0000000000000000 R14: 0000000000000001 R15: ffffffffaa619aa0
[  947.289118] FS:  00007f668b122ac0(0000) GS:ffff9d727fd80000(0000) knlGS:0000000000000000
[  947.290550] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  947.291569] CR2: 00007f6688389004 CR3: 0000000224496006 CR4: 00000000003606e0
[  947.292861] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[  947.294125] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[  947.295394] Call Trace:
[  947.295854]  ? load_new_mm_cr3+0xe0/0xe0
[  947.296568]  on_each_cpu+0x2d/0x60
[  947.297191]  flush_tlb_all+0x1c/0x20
[  947.297846]  __split_large_page+0x5d9/0x640
[  947.298604]  set_kpte+0xfe/0x260
[  947.299824]  get_page_from_freelist+0x1633/0x1680
[  947.301260]  ? lookup_address+0x2d/0x30
[  947.302550]  ? set_kpte+0x1e1/0x260
[  947.303760]  __alloc_pages_nodemask+0x13f/0x2e0
[  947.305137]  alloc_pages_vma+0x7a/0x1c0
[  947.306378]  wp_page_copy+0x201/0xa30
[  947.307582]  ? generic_file_read_iter+0x96a/0xcf0
[  947.308946]  do_wp_page+0x1cc/0x420
[  947.310086]  __handle_mm_fault+0xc0d/0x1600
[  947.311331]  handle_mm_fault+0xe1/0x210
[  947.312502]  __do_page_fault+0x23a/0x4c0
[  947.313672]  ? _cond_resched+0x19/0x30
[  947.314795]  do_page_fault+0x2e/0xe0
[  947.315878]  ? page_fault+0x8/0x30
[  947.316916]  page_fault+0x1e/0x30
[  947.317930] RIP: 0033:0x76581e
[  947.318893] Code: eb 05 89 d8 48 8d 04 80 48 8d 34 c5 08 00 00 00 48 85 ff 74 04 44 8b 67 04 e8 de 80 08 00 81 e3 ff ff ff 7f 48 89 45 00 8b 10 <44> 89 60 04 81 e2 00 00 00 80 09 da 89 10 c1 ea 18 83 e2 7f 88 50
[  947.323337] RSP: 002b:00007ffde06c0e40 EFLAGS: 00010202
[  947.324663] RAX: 00007f6688389000 RBX: 0000000000000004 RCX: 0000000000000001
[  947.326317] RDX: 0000000000000000 RSI: 0000000001000001 RDI: 0000000000000017
[  947.327973] RBP: 00007f66883882d8 R08: 00000000032e05f0 R09: 00007f668b30e6f0
[  947.329619] R10: 0000000000000002 R11: 00000000032e05f0 R12: 0000000000000000
[  947.331260] R13: 00007f6688388230 R14: 00007f6688388288 R15: 00007f668ac3b0a8
[  947.332911] ---[ end trace 7d605a38c67d83ae ]---

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [RFC PATCH v9 03/13] mm: Add support for eXclusive Page Frame Ownership (XPFO)
  2019-04-04 15:15     ` Khalid Aziz
@ 2019-04-04 17:01       ` Peter Zijlstra
  0 siblings, 0 replies; 70+ messages in thread
From: Peter Zijlstra @ 2019-04-04 17:01 UTC (permalink / raw)
  To: Khalid Aziz
  Cc: juergh, tycho, jsteckli, ak, liran.alon, keescook, konrad.wilk,
	Juerg Haefliger, deepa.srinivasan, chris.hyser, tyhicks, dwmw,
	andrew.cooper3, jcm, boris.ostrovsky, kanth.ghatraju,
	joao.m.martins, jmattson, pradeep.vincent, john.haxby, tglx,
	kirill.shutemov, hch, steven.sistare, labbott, luto, dave.hansen,
	aaron.lu, akpm, alexander.h.duyck, amir73il, andreyknvl,
	aneesh.kumar, anthony.yznaga, ard.biesheuvel, arnd, arunks, ben,
	bigeasy, bp, brgl, catalin.marinas, corbet, cpandya,
	daniel.vetter, dan.j.williams, gregkh, guro, hannes, hpa,
	iamjoonsoo.kim, james.morse, jannh, jgross, jkosina, jmorris,
	joe, jrdr.linux, jroedel, keith.busch, khlebnikov, logang,
	marco.antonio.780, mark.rutland, mgorman, mhocko, mhocko,
	mike.kravetz, mingo, mst, m.szyprowski, npiggin, osalvador,
	paulmck, pavel.tatashin, rdunlap, richard.weiyang, riel,
	rientjes, robin.murphy, rostedt, rppt, sai.praneeth.prakhya,
	serge, steve.capper, thymovanbeers, vbabka, will.deacon, willy,
	yang.shi, yaojun8558363, ying.huang, zhangshaokun, iommu, x86,
	linux-arm-kernel, linux-doc, linux-kernel, linux-mm,
	linux-security-module, Khalid Aziz

On Thu, Apr 04, 2019 at 09:15:46AM -0600, Khalid Aziz wrote:
> Thanks Peter. I really appreciate your review. Your feedback helps make
> this code better and closer to where I can feel comfortable not calling
> it RFC any more.
> 
> The more I look at xpfo_kmap()/xpfo_kunmap() code, the more I get
> uncomfortable with it. As you pointed out about calling kmap_atomic from
> NMI context, that just makes the kmap_atomic code look even worse. I
> pointed out one problem with this code in cover letter and suggested a
> rewrite. I see these problems with this code:

Well, I no longer use it from NMI context, but I did do that for a
while. We now have a giant heap of magic in the NMI path that allows us
to take faults from NMI context (please don't ask), this means we can
mostly do copy_from_user_inatomic() now.

> 1. When xpfo_kmap maps a page back in physmap, it opens up the ret2dir
> attack security hole again even if just for the duration of kmap. A kmap
> can stay around for some time if the page is being used for I/O.

Correct.

> 2. This code uses spinlock which leads to problems. If it does not
> disable IRQ, it is exposed to deadlock around xpfo_lock. If it disables
> IRQ, I think it can still deadlock around pgd_lock.

I've not spotted that inversion yet, but then I didn't look at the lock
usage outside of k{,un}map_xpfo().

> I think a better implementation of xpfo_kmap()/xpfo_kunmap() would map
> the page at a new virtual address similar to what kmap_high for i386
> does. This avoids re-opening the ret2dir security hole. We can also
> possibly do away with xpfo_lock saving bytes in page-frame and the not
> so sane code sequence can go away.

Right, the TLB invalidation issues are still tricky, even there :/


^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [RFC PATCH v9 02/13] x86: always set IF before oopsing from page fault
  2019-04-04 16:28           ` Thomas Gleixner
@ 2019-04-04 17:11             ` Andy Lutomirski
  0 siblings, 0 replies; 70+ messages in thread
From: Andy Lutomirski @ 2019-04-04 17:11 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Tycho Andersen, Andy Lutomirski, Khalid Aziz, Juerg Haefliger,
	jsteckli, Andi Kleen, liran.alon, Kees Cook,
	Konrad Rzeszutek Wilk, deepa.srinivasan, chris hyser,
	Tyler Hicks, Woodhouse, David, Andrew Cooper, Jon Masters,
	Boris Ostrovsky, kanth.ghatraju, Joao Martins, Jim Mattson,
	pradeep.vincent, John Haxby, Kirill A. Shutemov,
	Christoph Hellwig, steven.sistare, Laura Abbott, Dave Hansen,
	Peter Zijlstra, Aaron Lu, Andrew Morton, alexander.h.duyck,
	Amir Goldstein, Andrey Konovalov, aneesh.kumar, anthony.yznaga,
	Ard Biesheuvel, Arnd Bergmann, arunks, Ben Hutchings,
	Sebastian Andrzej Siewior, Borislav Petkov, brgl,
	Catalin Marinas, Jonathan Corbet, cpandya, Daniel Vetter,
	Dan Williams, Greg KH, Roman Gushchin, Johannes Weiner,
	H. Peter Anvin, Joonsoo Kim, James Morse, Jann Horn,
	Juergen Gross, Jiri Kosina, James Morris, Joe Perches,
	Souptick Joarder, Joerg Roedel, Keith Busch,
	Konstantin Khlebnikov, Logan Gunthorpe, marco.antonio.780,
	Mark Rutland, Mel Gorman, Michal Hocko, Michal Hocko,
	Mike Kravetz, Ingo Molnar, Michael S. Tsirkin, Marek Szyprowski,
	Nicholas Piggin, osalvador, Paul E. McKenney, pavel.tatashin,
	Randy Dunlap, richard.weiyang, Serge E. Hallyn, iommu, X86 ML,
	linux-arm-kernel, open list:DOCUMENTATION, LKML, Linux-MM,
	LSM List, Khalid Aziz


> On Apr 4, 2019, at 10:28 AM, Thomas Gleixner <tglx@linutronix.de> wrote:
> 
>> On Thu, 4 Apr 2019, Tycho Andersen wrote:
>>    leaq    -PTREGS_SIZE(%rax), %rsp
>>    UNWIND_HINT_FUNC sp_offset=PTREGS_SIZE
>> 
>> +    /*
>> +     * If we oopsed in an interrupt handler, interrupts may be off. Let's turn
>> +     * them back on before going back to "normal" code.
>> +     */
>> +    sti
> 
> That breaks the paravirt muck and tracing/lockdep.
> 
> ENABLE_INTERRUPTS() is what you want plus TRACE_IRQ_ON to keep the tracer
> and lockdep happy.
> 
> 

I’m sure we’ll find some other thing we forgot to reset eventually, so let’s do this in C.  Change the call do_exit to call __finish_rewind_stack_do_exit and add the latter as a C function that does local_irq_enable() and do_exit().

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [RFC PATCH v9 00/13] Add support for eXclusive Page Frame Ownership
  2019-04-04 16:44 ` [RFC PATCH v9 00/13] Add support for eXclusive Page Frame Ownership Nadav Amit
@ 2019-04-04 17:18   ` Khalid Aziz
  0 siblings, 0 replies; 70+ messages in thread
From: Khalid Aziz @ 2019-04-04 17:18 UTC (permalink / raw)
  To: Nadav Amit
  Cc: X86 ML, linux-arm-kernel, open list:DOCUMENTATION,
	Linux List Kernel Mailing, Linux-MM, LSM List

On 4/4/19 10:44 AM, Nadav Amit wrote:
>> On Apr 3, 2019, at 10:34 AM, Khalid Aziz <khalid.aziz@oracle.com> wrote:
>>
>> This is another update to the work Juerg, Tycho and Julian have
>> done on XPFO.
> 
> Interesting work, but note that it triggers a warning on my system due to
> possible deadlock. It seems that the patch-set disables IRQs in
> xpfo_kunmap() and then might flush remote TLBs when a large page is split.
> This is wrong, since it might lead to deadlocks.
> 
> 
> [  947.262208] WARNING: CPU: 6 PID: 9892 at kernel/smp.c:416 smp_call_function_many+0x92/0x250
> [  947.263767] Modules linked in: sb_edac vmw_balloon crct10dif_pclmul crc32_pclmul joydev ghash_clmulni_intel input_leds intel_rapl_perf serio_raw mac_hid sch_fq_codel ib_iser rdma_cm iw_cm ib_cm ib_core vmw_vsock_vmci_transport vsock vmw_vmci iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi ip_tables x_tables autofs4 btrfs zstd_compress raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx libcrc32c xor raid6_pq raid1 raid0 multipath linear hid_generic usbhid hid vmwgfx drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops ttm drm aesni_intel psmouse aes_x86_64 crypto_simd cryptd glue_helper mptspi vmxnet3 scsi_transport_spi mptscsih ahci mptbase libahci i2c_piix4 pata_acpi
> [  947.274649] CPU: 6 PID: 9892 Comm: cc1 Not tainted 5.0.0+ #7
> [  947.275804] Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 07/28/2017
> [  947.277704] RIP: 0010:smp_call_function_many+0x92/0x250
> [  947.278640] Code: 3b 05 66 fc 4e 01 72 26 48 83 c4 10 5b 41 5c 41 5d 41 5e 41 5f 5d c3 8b 05 2b cc 7e 01 85 c0 75 bf 80 3d a8 99 4e 01 00 75 b6 <0f> 0b eb b2 44 89 c7 48 c7 c2 a0 9a 61 aa 4c 89 fe 44 89 45 d0 e8
> [  947.281895] RSP: 0000:ffffafe04538f970 EFLAGS: 00010046
> [  947.282821] RAX: 0000000000000000 RBX: 0000000000000006 RCX: 0000000000000001
> [  947.284084] RDX: 0000000000000000 RSI: ffffffffa9078d70 RDI: ffffffffaa619aa0
> [  947.285343] RBP: ffffafe04538f9a8 R08: ffff9d7040000ff0 R09: 0000000000000000
> [  947.286596] R10: 0000000000000000 R11: 0000000000000000 R12: ffffffffa9078d70
> [  947.287855] R13: 0000000000000000 R14: 0000000000000001 R15: ffffffffaa619aa0
> [  947.289118] FS:  00007f668b122ac0(0000) GS:ffff9d727fd80000(0000) knlGS:0000000000000000
> [  947.290550] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [  947.291569] CR2: 00007f6688389004 CR3: 0000000224496006 CR4: 00000000003606e0
> [  947.292861] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> [  947.294125] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
> [  947.295394] Call Trace:
> [  947.295854]  ? load_new_mm_cr3+0xe0/0xe0
> [  947.296568]  on_each_cpu+0x2d/0x60
> [  947.297191]  flush_tlb_all+0x1c/0x20
> [  947.297846]  __split_large_page+0x5d9/0x640
> [  947.298604]  set_kpte+0xfe/0x260
> [  947.299824]  get_page_from_freelist+0x1633/0x1680
> [  947.301260]  ? lookup_address+0x2d/0x30
> [  947.302550]  ? set_kpte+0x1e1/0x260
> [  947.303760]  __alloc_pages_nodemask+0x13f/0x2e0
> [  947.305137]  alloc_pages_vma+0x7a/0x1c0
> [  947.306378]  wp_page_copy+0x201/0xa30
> [  947.307582]  ? generic_file_read_iter+0x96a/0xcf0
> [  947.308946]  do_wp_page+0x1cc/0x420
> [  947.310086]  __handle_mm_fault+0xc0d/0x1600
> [  947.311331]  handle_mm_fault+0xe1/0x210
> [  947.312502]  __do_page_fault+0x23a/0x4c0
> [  947.313672]  ? _cond_resched+0x19/0x30
> [  947.314795]  do_page_fault+0x2e/0xe0
> [  947.315878]  ? page_fault+0x8/0x30
> [  947.316916]  page_fault+0x1e/0x30
> [  947.317930] RIP: 0033:0x76581e
> [  947.318893] Code: eb 05 89 d8 48 8d 04 80 48 8d 34 c5 08 00 00 00 48 85 ff 74 04 44 8b 67 04 e8 de 80 08 00 81 e3 ff ff ff 7f 48 89 45 00 8b 10 <44> 89 60 04 81 e2 00 00 00 80 09 da 89 10 c1 ea 18 83 e2 7f 88 50
> [  947.323337] RSP: 002b:00007ffde06c0e40 EFLAGS: 00010202
> [  947.324663] RAX: 00007f6688389000 RBX: 0000000000000004 RCX: 0000000000000001
> [  947.326317] RDX: 0000000000000000 RSI: 0000000001000001 RDI: 0000000000000017
> [  947.327973] RBP: 00007f66883882d8 R08: 00000000032e05f0 R09: 00007f668b30e6f0
> [  947.329619] R10: 0000000000000002 R11: 00000000032e05f0 R12: 0000000000000000
> [  947.331260] R13: 00007f6688388230 R14: 00007f6688388288 R15: 00007f668ac3b0a8
> [  947.332911] ---[ end trace 7d605a38c67d83ae ]---
> 

Thanks for letting me know. xpfo_kunmap() is not quite right. It will
end up being rewritten for the next version.

--
Khalid


^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [RFC PATCH v9 12/13] xpfo, mm: Defer TLB flushes for non-current CPUs (x86 only)
       [not found]     ` <91f1dbce-332e-25d1-15f6-0e9cfc8b797b@oracle.com>
@ 2019-04-05  7:17       ` Thomas Gleixner
  2019-04-05 14:44         ` Dave Hansen
  2019-04-05 15:24       ` Andy Lutomirski
  1 sibling, 1 reply; 70+ messages in thread
From: Thomas Gleixner @ 2019-04-05  7:17 UTC (permalink / raw)
  To: Khalid Aziz
  Cc: Andy Lutomirski, Juerg Haefliger, Tycho Andersen, jsteckli,
	Andi Kleen, liran.alon, Kees Cook, Konrad Rzeszutek Wilk,
	deepa.srinivasan, chris hyser, Tyler Hicks, Woodhouse, David,
	Andrew Cooper, Jon Masters, Boris Ostrovsky, kanth.ghatraju,
	Joao Martins, Jim Mattson, pradeep.vincent, John Haxby,
	Kirill A. Shutemov, Christoph Hellwig, steven.sistare,
	Laura Abbott, Dave Hansen, Peter Zijlstra, Aaron Lu,
	Andrew Morton, alexander.h.duyck, Amir Goldstein,
	Andrey Konovalov, aneesh.kumar, anthony.yznaga, Ard Biesheuvel,
	Arnd Bergmann, arunks, Ben Hutchings, Sebastian Andrzej Siewior,
	Borislav Petkov, brgl, Catalin Marinas, Jonathan Corbet, cpandya,
	Daniel Vetter, Dan Williams, Greg KH, Roman Gushchin,
	Johannes Weiner, H. Peter Anvin, Joonsoo Kim, James Morse,
	Jann Horn, Juergen Gross, Jiri Kosina, James Morris, Joe Perches,
	Souptick Joarder, Joerg Roedel, Keith Busch,
	Konstantin Khlebnikov, Logan Gunthorpe, marco.antonio.780,
	Mark Rutland, Mel Gorman, Michal Hocko, Michal Hocko,
	Mike Kravetz, Ingo Molnar, Michael S. Tsirkin, Marek Szyprowski,
	Nicholas Piggin, osalvador, Paul E. McKenney, pavel.tatashin,
	Randy Dunlap, richard.weiyang, Rik van Riel, David Rientjes,
	Robin Murphy, Steven Rostedt, Mike Rapoport,
	Sai Praneeth Prakhya, Serge E. Hallyn, Steve Capper,
	thymovanbeers, Vlastimil Babka, Will Deacon, Matthew Wilcox,
	yang.shi, yaojun8558363, Huang Ying, zhangshaokun, iommu, X86 ML,
	linux-arm-kernel, open list:DOCUMENTATION, LKML, Linux-MM,
	LSM List, Khalid Aziz

On Thu, 4 Apr 2019, Khalid Aziz wrote:
> When xpfo unmaps a page from physmap only (after mapping the page in
> userspace in response to an allocation request from userspace) on one
> processor, there is a small window of opportunity for ret2dir attack on
> other cpus until the TLB entry in physmap for the unmapped pages on
> other cpus is cleared. Forcing that to happen synchronously is the
> expensive part. A multiple of these requests can come in over a very
> short time across multiple processors resulting in every cpu asking
> every other cpusto flush TLB just to close this small window of
> vulnerability in the kernel. If each request is processed synchronously,
> each CPU will do multiple TLB flushes in short order. If we could
> consolidate these TLB flush requests instead and do one TLB flush on
> each cpu at the time of context switch, we can reduce the performance
> impact significantly. This bears out in real life measuring the system
> time when doing a parallel kernel build on a large server. Without this,
> system time on 96-core server when doing "make -j60 all" went up 26x.
> After this optimization, impact went down to 1.44x.
> 
> The trade-off with this strategy is, the kernel on a cpu is vulnerable
> for a short time if the current running processor is the malicious

The "short" time to next context switch on the other CPUs is how short
exactly? Anything from 1us to seconds- think NOHZ FULL - and even w/o that
10ms on a HZ=100 kernel is plenty of time to launch an attack.

> process. Is that an acceptable trade-off?

You are not seriously asking whether creating a user controllable ret2dir
attack window is a acceptable trade-off? April 1st was a few days ago.

Thanks,

	tglx


^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [RFC PATCH v9 12/13] xpfo, mm: Defer TLB flushes for non-current CPUs (x86 only)
  2019-04-05  7:17       ` Thomas Gleixner
@ 2019-04-05 14:44         ` Dave Hansen
  2019-04-05 15:24           ` Andy Lutomirski
  2019-04-05 15:44           ` Khalid Aziz
  0 siblings, 2 replies; 70+ messages in thread
From: Dave Hansen @ 2019-04-05 14:44 UTC (permalink / raw)
  To: Thomas Gleixner, Khalid Aziz
  Cc: Andy Lutomirski, Juerg Haefliger, Tycho Andersen, jsteckli,
	Andi Kleen, liran.alon, Kees Cook, Konrad Rzeszutek Wilk,
	deepa.srinivasan, chris hyser, Tyler Hicks, Woodhouse, David,
	Andrew Cooper, Jon Masters, Boris Ostrovsky, kanth.ghatraju,
	Joao Martins, Jim Mattson, pradeep.vincent, John Haxby,
	Kirill A. Shutemov, Christoph Hellwig, steven.sistare,
	Laura Abbott, Peter Zijlstra, Aaron Lu, Andrew Morton,
	alexander.h.duyck, Amir Goldstein, Andrey Konovalov,
	aneesh.kumar, anthony.yznaga, Ard Biesheuvel, Arnd Bergmann,
	arunks, Ben Hutchings, Sebastian Andrzej Siewior,
	Borislav Petkov, brgl, Catalin Marinas, Jonathan Corbet, cpandya,
	Daniel Vetter, Dan Williams, Greg KH, Roman Gushchin,
	Johannes Weiner, H. Peter Anvin, Joonsoo Kim, James Morse,
	Jann Horn, Juergen Gross, Jiri Kosina, James Morris, Joe Perches,
	Souptick Joarder, Joerg Roedel, Keith Busch,
	Konstantin Khlebnikov, Logan Gunthorpe, marco.antonio.780,
	Mark Rutland, Mel Gorman, Michal Hocko, Michal Hocko,
	Mike Kravetz, Ingo Molnar, Michael S. Tsirkin, Marek Szyprowski,
	Nicholas Piggin, osalvador, Paul E. McKenney, pavel.tatashin,
	Randy Dunlap, richard.weiyang, Rik van Riel, David Rientjes,
	Robin Murphy, Steven Rostedt, Mike Rapoport,
	Sai Praneeth Prakhya, Serge E. Hallyn, Steve Capper,
	thymovanbeers, Vlastimil Babka, Will Deacon, Matthew Wilcox,
	yang.shi, yaojun8558363, Huang Ying, zhangshaokun, iommu, X86 ML,
	linux-arm-kernel, open list:DOCUMENTATION, LKML, Linux-MM,
	LSM List, Khalid Aziz

On 4/5/19 12:17 AM, Thomas Gleixner wrote:
>> process. Is that an acceptable trade-off?
> You are not seriously asking whether creating a user controllable ret2dir
> attack window is a acceptable trade-off? April 1st was a few days ago.

Well, let's not forget that this set at least takes us from "always
vulnerable to ret2dir" to a choice between:

1. fast-ish and "vulnerable to ret2dir for a user-controllable window"
2. slow and "mitigated against ret2dir"

Sounds like we need a mechanism that will do the deferred XPFO TLB
flushes whenever the kernel is entered, and not _just_ at context switch
time.  This permits an app to run in userspace with stale kernel TLB
entries as long as it wants... that's harmless.

But, if it enters the kernel, we could process the deferred flush there.
 The concern would be catching all the kernel entry paths (PTI does this
today obviously, but I don't think we want to muck with the PTI code for
this).  The other concern would be that the code between kernel entry
and the flush would be vulnerable.  But, that seems like a reasonable
trade-off to me.


^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [RFC PATCH v9 12/13] xpfo, mm: Defer TLB flushes for non-current CPUs (x86 only)
       [not found]     ` <91f1dbce-332e-25d1-15f6-0e9cfc8b797b@oracle.com>
  2019-04-05  7:17       ` Thomas Gleixner
@ 2019-04-05 15:24       ` Andy Lutomirski
  1 sibling, 0 replies; 70+ messages in thread
From: Andy Lutomirski @ 2019-04-05 15:24 UTC (permalink / raw)
  To: Khalid Aziz
  Cc: Andy Lutomirski, Juerg Haefliger, Tycho Andersen, jsteckli,
	Andi Kleen, liran.alon, Kees Cook, Konrad Rzeszutek Wilk,
	deepa.srinivasan, chris hyser, Tyler Hicks, Woodhouse, David,
	Andrew Cooper, Jon Masters, Boris Ostrovsky, kanth.ghatraju,
	Joao Martins, Jim Mattson, pradeep.vincent, John Haxby,
	Thomas Gleixner, Kirill A. Shutemov, Christoph Hellwig,
	steven.sistare, Laura Abbott, Dave Hansen, Peter Zijlstra,
	Aaron Lu, Andrew Morton, alexander.h.duyck, Amir Goldstein,
	Andrey Konovalov, aneesh.kumar, anthony.yznaga, Ard Biesheuvel,
	Arnd Bergmann, arunks, Ben Hutchings, Sebastian Andrzej Siewior,
	Borislav Petkov, brgl, Catalin Marinas, Jonathan Corbet, cpandya,
	Daniel Vetter, Dan Williams, Greg KH, Roman Gushchin,
	Johannes Weiner, H. Peter Anvin, Joonsoo Kim, James Morse,
	Jann Horn, Juergen Gross, Jiri Kosina, James Morris, Joe Perches,
	Souptick Joarder, Joerg Roedel, Keith Busch,
	Konstantin Khlebnikov, Logan Gunthorpe, marco.antonio.780,
	Mark Rutland, Mel Gorman, Michal Hocko, Michal Hocko,
	Mike Kravetz, Ingo Molnar, Michael S. Tsirkin, Marek Szyprowski,
	Nicholas Piggin, osalvador, Paul E. McKenney, pavel.tatashin,
	Randy Dunlap, richard.weiyang, Rik van Riel, David Rientjes,
	Robin Murphy, Steven Rostedt, Mike Rapoport, Serge E. Hallyn,
	Steve Capper, thymovanbeers, Vlastimil Babka, Will Deacon,
	Matthew Wilcox, yaojun8558363, Huang Ying, zhangshaokun, iommu,
	X86 ML, linux-arm-kernel, open list:DOCUMENTATION, LKML,
	Linux-MM, LSM List, Khalid Aziz



>>> On Apr 4, 2019, at 4:55 PM, Khalid Aziz <khalid.aziz@oracle.com> wrote:
>>> 
>>> On 4/3/19 10:10 PM, Andy Lutomirski wrote:
>>> On Wed, Apr 3, 2019 at 10:36 AM Khalid Aziz <khalid.aziz@oracle.com> wrote:
>>> 
>>> XPFO flushes kernel space TLB entries for pages that are now mapped
>>> in userspace on not only the current CPU but also all other CPUs
>>> synchronously. Processes on each core allocating pages causes a
>>> flood of IPI messages to all other cores to flush TLB entries.
>>> Many of these messages are to flush the entire TLB on the core if
>>> the number of entries being flushed from local core exceeds
>>> tlb_single_page_flush_ceiling. The cost of TLB flush caused by
>>> unmapping pages from physmap goes up dramatically on machines with
>>> high core count.
>>> 
>>> This patch flushes relevant TLB entries for current process or
>>> entire TLB depending upon number of entries for the current CPU
>>> and posts a pending TLB flush on all other CPUs when a page is
>>> unmapped from kernel space and mapped in userspace. Each core
>>> checks the pending TLB flush flag for itself on every context
>>> switch, flushes its TLB if the flag is set and clears it.
>>> This patch potentially aggregates multiple TLB flushes into one.
>>> This has very significant impact especially on machines with large
>>> core counts.
>> 
>> Why is this a reasonable strategy?
> 
> Ideally when pages are unmapped from physmap, all CPUs would be sent IPI
> synchronously to flush TLB entry for those pages immediately. This may
> be ideal from correctness and consistency point of view, but it also
> results in IPI storm and repeated TLB flushes on all processors. Any
> time a page is allocated to userspace, we are going to go through this
> and it is very expensive. On a 96-core server, performance degradation
> is 26x!!

Indeed. XPFO is expensive.

> 
> When xpfo unmaps a page from physmap only (after mapping the page in
> userspace in response to an allocation request from userspace) on one
> processor, there is a small window of opportunity for ret2dir attack on
> other cpus until the TLB entry in physmap for the unmapped pages on
> other cpus is cleared.

Why do you think this window is small? Intervals of seconds to months between context switches aren’t unheard of.

And why is a small window like this even helpful?  For a ret2dir attack, you just need to get CPU A to allocate a page and write the ret2dir payload and then get CPU B to return to it before context switching.  This should be doable quite reliably.

So I don’t really have a suggestion, but I think that a 44% regression to get a weak defense like this doesn’t seem worthwhile.  I bet that any of a number of CFI techniques (RAP-like or otherwise) will be cheaper and protect against ret2dir better.  And they’ll also protect against using other kernel memory as a stack buffer.  There are plenty of those — think pipe buffers, network buffers, any page cache not covered by XPFO, XMM/YMM saved state, etc.

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [RFC PATCH v9 12/13] xpfo, mm: Defer TLB flushes for non-current CPUs (x86 only)
  2019-04-05 14:44         ` Dave Hansen
@ 2019-04-05 15:24           ` Andy Lutomirski
  2019-04-05 15:56             ` Tycho Andersen
                               ` (2 more replies)
  2019-04-05 15:44           ` Khalid Aziz
  1 sibling, 3 replies; 70+ messages in thread
From: Andy Lutomirski @ 2019-04-05 15:24 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Thomas Gleixner, Khalid Aziz, Andy Lutomirski, Juerg Haefliger,
	Tycho Andersen, jsteckli, Andi Kleen, liran.alon, Kees Cook,
	Konrad Rzeszutek Wilk, deepa.srinivasan, chris hyser,
	Tyler Hicks, Woodhouse, David, Andrew Cooper, Jon Masters,
	Boris Ostrovsky, kanth.ghatraju, Joao Martins, Jim Mattson,
	pradeep.vincent, John Haxby, Kirill A. Shutemov,
	Christoph Hellwig, steven.sistare, Laura Abbott, Peter Zijlstra,
	Aaron Lu, Andrew Morton, alexander.h.duyck, Amir Goldstein,
	Andrey Konovalov, aneesh.kumar, anthony.yznaga, Ard Biesheuvel,
	Arnd Bergmann, arunks, Ben Hutchings, Sebastian Andrzej Siewior,
	Borislav Petkov, brgl, Catalin Marinas, Jonathan Corbet, cpandya,
	Daniel Vetter, Dan Williams, Greg KH, Roman Gushchin,
	Johannes Weiner, H. Peter Anvin, Joonsoo Kim, James Morse,
	Jann Horn, Juergen Gross, Jiri Kosina, James Morris, Joe Perches,
	Souptick Joarder, Joerg Roedel, Keith Busch,
	Konstantin Khlebnikov, Logan Gunthorpe, marco.antonio.780,
	Mark Rutland, Mel Gorman, Michal Hocko, Michal Hocko,
	Mike Kravetz, Ingo Molnar, Michael S. Tsirkin, Marek Szyprowski,
	Nicholas Piggin, osalvador, Paul E. McKenney, pavel.tatashin,
	Randy Dunlap, richard.weiyang, Rik van Riel, David Rientjes,
	Robin Murphy, Steven Rostedt, Mike Rapoport,
	Sai Praneeth Prakhya, Serge E. Hallyn, Steve Capper,
	thymovanbeers, Vlastimil Babka, Will Deacon, Matthew Wilcox,
	yaojun8558363, Huang Ying, zhangshaokun, iommu, X86 ML,
	linux-arm-kernel, LKML, Linux-MM, LSM List, Khalid Aziz



> On Apr 5, 2019, at 8:44 AM, Dave Hansen <dave.hansen@intel.com> wrote:
> 
> On 4/5/19 12:17 AM, Thomas Gleixner wrote:
>>> process. Is that an acceptable trade-off?
>> You are not seriously asking whether creating a user controllable ret2dir
>> attack window is a acceptable trade-off? April 1st was a few days ago.
> 
> Well, let's not forget that this set at least takes us from "always
> vulnerable to ret2dir" to a choice between:
> 
> 1. fast-ish and "vulnerable to ret2dir for a user-controllable window"
> 2. slow and "mitigated against ret2dir"
> 
> Sounds like we need a mechanism that will do the deferred XPFO TLB
> flushes whenever the kernel is entered, and not _just_ at context switch
> time.  This permits an app to run in userspace with stale kernel TLB
> entries as long as it wants... that's harmless.

I don’t think this is good enough. The bad guys can enter the kernel and arrange for the kernel to wait, *in kernel*, for long enough to set up the attack.  userfaultfd is the most obvious way, but there are plenty. I suppose we could do the flush at context switch *and* entry.  I bet that performance still utterly sucks, though — on many workloads, this turns every entry into a full flush, and we already know exactly how much that sucks — it’s identical to KPTI without PCID.  (And yes, if we go this route, we need to merge this logic together — we shouldn’t write CR3 twice on entry).

I feel like this whole approach is misguided. ret2dir is not such a game changer that fixing it is worth huge slowdowns. I think all this effort should be spent on some kind of sensible CFI. For example, we should be able to mostly squash ret2anything by inserting a check that the high bits of RSP match the value on the top of the stack before any code that pops RSP.  On an FPO build, there aren’t all that many hot POP RSP instructions, I think.

(Actually, checking the bits is suboptimal. Do:

unsigned long offset = *rsp - rsp;
offset >>= THREAD_SHIFT;
if (unlikely(offset))
BUG();
POP RSP;

This means that it’s also impossible to trick a function to return into a buffer that is on that function’s stack.)

In other words, I think that ret2dir is an insufficient justification for XPFO.

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [RFC PATCH v9 12/13] xpfo, mm: Defer TLB flushes for non-current CPUs (x86 only)
  2019-04-05 14:44         ` Dave Hansen
  2019-04-05 15:24           ` Andy Lutomirski
@ 2019-04-05 15:44           ` Khalid Aziz
  1 sibling, 0 replies; 70+ messages in thread
From: Khalid Aziz @ 2019-04-05 15:44 UTC (permalink / raw)
  To: Dave Hansen, Thomas Gleixner
  Cc: Andy Lutomirski, Juerg Haefliger, Tycho Andersen, jsteckli,
	Andi Kleen, liran.alon, Kees Cook, Konrad Rzeszutek Wilk,
	deepa.srinivasan, chris hyser, Tyler Hicks, Woodhouse, David,
	Andrew Cooper, Jon Masters, Boris Ostrovsky, kanth.ghatraju,
	Joao Martins, Jim Mattson, pradeep.vincent, John Haxby,
	Kirill A. Shutemov, Christoph Hellwig, steven.sistare,
	Laura Abbott, Peter Zijlstra, Aaron Lu, Andrew Morton,
	alexander.h.duyck, Amir Goldstein, Andrey Konovalov,
	aneesh.kumar, anthony.yznaga, Ard Biesheuvel, Arnd Bergmann,
	arunks, Ben Hutchings, Sebastian Andrzej Siewior,
	Borislav Petkov, brgl, Catalin Marinas, Jonathan Corbet, cpandya,
	Daniel Vetter, Dan Williams, Greg KH, Roman Gushchin,
	Johannes Weiner, H. Peter Anvin, Joonsoo Kim, James Morse,
	Jann Horn, Juergen Gross, Jiri Kosina, James Morris, Joe Perches,
	Souptick Joarder, Joerg Roedel, Keith Busch,
	Konstantin Khlebnikov, Logan Gunthorpe, marco.antonio.780,
	Mark Rutland, Mel Gorman, Michal Hocko, Michal Hocko,
	Mike Kravetz, Ingo Molnar, Michael S. Tsirkin, Marek Szyprowski,
	Nicholas Piggin, osalvador, Paul E. McKenney, pavel.tatashin,
	Randy Dunlap, richard.weiyang, Rik van Riel, David Rientjes,
	Robin Murphy, Steven Rostedt, Mike Rapoport,
	Sai Praneeth Prakhya, Serge E. Hallyn, Steve Capper,
	thymovanbeers, Vlastimil Babka, Will Deacon, Matthew Wilcox,
	yang.shi, yaojun8558363, Huang Ying, zhangshaokun, iommu, X86 ML,
	linux-arm-kernel, open list:DOCUMENTATION, LKML, Linux-MM,
	LSM List, Khalid Aziz

On 4/5/19 8:44 AM, Dave Hansen wrote:
> On 4/5/19 12:17 AM, Thomas Gleixner wrote:
>>> process. Is that an acceptable trade-off?
>> You are not seriously asking whether creating a user controllable ret2dir
>> attack window is a acceptable trade-off? April 1st was a few days ago.
> 
> Well, let's not forget that this set at least takes us from "always
> vulnerable to ret2dir" to a choice between:
> 
> 1. fast-ish and "vulnerable to ret2dir for a user-controllable window"
> 2. slow and "mitigated against ret2dir"
> 
> Sounds like we need a mechanism that will do the deferred XPFO TLB
> flushes whenever the kernel is entered, and not _just_ at context switch
> time.  This permits an app to run in userspace with stale kernel TLB
> entries as long as it wants... that's harmless.

That sounds like a good idea. This TLB flush only needs to be done at
kernel entry when there is a pending flush posted for that cpu. What
this does is move the cost of TLB flush from next context switch to
kernel entry and does not add any more flushes than what we are doing
with these xpfo patches. So the overall performance impact might not
change much. It seems worth coding up.

Thanks,
Khalid


^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [RFC PATCH v9 12/13] xpfo, mm: Defer TLB flushes for non-current CPUs (x86 only)
  2019-04-05 15:24           ` Andy Lutomirski
@ 2019-04-05 15:56             ` Tycho Andersen
  2019-04-05 16:32               ` Andy Lutomirski
  2019-04-05 15:56             ` Khalid Aziz
  2019-04-05 16:01             ` Dave Hansen
  2 siblings, 1 reply; 70+ messages in thread
From: Tycho Andersen @ 2019-04-05 15:56 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Dave Hansen, Thomas Gleixner, Khalid Aziz, Andy Lutomirski,
	Juerg Haefliger, jsteckli, Andi Kleen, liran.alon, Kees Cook,
	Konrad Rzeszutek Wilk, deepa.srinivasan, chris hyser,
	Tyler Hicks, Woodhouse, David, Andrew Cooper, Jon Masters,
	Boris Ostrovsky, kanth.ghatraju, Joao Martins, Jim Mattson,
	pradeep.vincent, John Haxby, Kirill A. Shutemov,
	Christoph Hellwig, steven.sistare, Laura Abbott, Peter Zijlstra,
	Aaron Lu, Andrew Morton, alexander.h.duyck, Amir Goldstein,
	Andrey Konovalov, aneesh.kumar, anthony.yznaga, Ard Biesheuvel,
	Arnd Bergmann, arunks, Ben Hutchings, Sebastian Andrzej Siewior,
	Borislav Petkov, brgl, Catalin Marinas, Jonathan Corbet, cpandya,
	Daniel Vetter, Dan Williams, Greg KH, Roman Gushchin,
	Johannes Weiner, H. Peter Anvin, Joonsoo Kim, James Morse,
	Jann Horn, Juergen Gross, Jiri Kosina, James Morris, Joe Perches,
	Souptick Joarder, Joerg Roedel, Keith Busch,
	Konstantin Khlebnikov, Logan Gunthorpe, marco.antonio.780,
	Mark Rutland, Mel Gorman, Michal Hocko, Michal Hocko,
	Mike Kravetz, Ingo Molnar, Michael S. Tsirkin, Marek Szyprowski,
	Nicholas Piggin, osalvador, Paul E. McKenney, pavel.tatashin,
	Randy Dunlap, richard.weiyang, Rik van Riel, David Rientjes,
	Robin Murphy, Steven Rostedt, Mike Rapoport,
	Sai Praneeth Prakhya, Serge E. Hallyn, Steve Capper,
	thymovanbeers, Vlastimil Babka, Will Deacon, Matthew Wilcox,
	yaojun8558363, Huang Ying, zhangshaokun, iommu, X86 ML,
	linux-arm-kernel, LKML, Linux-MM, LSM List, Khalid Aziz

On Fri, Apr 05, 2019 at 09:24:50AM -0600, Andy Lutomirski wrote:
> 
> 
> > On Apr 5, 2019, at 8:44 AM, Dave Hansen <dave.hansen@intel.com> wrote:
> > 
> > On 4/5/19 12:17 AM, Thomas Gleixner wrote:
> >>> process. Is that an acceptable trade-off?
> >> You are not seriously asking whether creating a user controllable ret2dir
> >> attack window is a acceptable trade-off? April 1st was a few days ago.
> > 
> > Well, let's not forget that this set at least takes us from "always
> > vulnerable to ret2dir" to a choice between:
> > 
> > 1. fast-ish and "vulnerable to ret2dir for a user-controllable window"
> > 2. slow and "mitigated against ret2dir"
> > 
> > Sounds like we need a mechanism that will do the deferred XPFO TLB
> > flushes whenever the kernel is entered, and not _just_ at context switch
> > time.  This permits an app to run in userspace with stale kernel TLB
> > entries as long as it wants... that's harmless.
> 
> I don’t think this is good enough. The bad guys can enter the kernel and arrange for the kernel to wait, *in kernel*, for long enough to set up the attack.  userfaultfd is the most obvious way, but there are plenty. I suppose we could do the flush at context switch *and* entry.  I bet that performance still utterly sucks, though — on many workloads, this turns every entry into a full flush, and we already know exactly how much that sucks — it’s identical to KPTI without PCID.  (And yes, if we go this route, we need to merge this logic together — we shouldn’t write CR3 twice on entry).
> 
> I feel like this whole approach is misguided. ret2dir is not such a game changer that fixing it is worth huge slowdowns. I think all this effort should be spent on some kind of sensible CFI. For example, we should be able to mostly squash ret2anything by inserting a check that the high bits of RSP match the value on the top of the stack before any code that pops RSP.  On an FPO build, there aren’t all that many hot POP RSP instructions, I think.
> 
> (Actually, checking the bits is suboptimal. Do:
> 
> unsigned long offset = *rsp - rsp;
> offset >>= THREAD_SHIFT;
> if (unlikely(offset))
> BUG();
> POP RSP;

This is a neat trick, and definitely prevents going random places in
the heap. But,

> This means that it’s also impossible to trick a function to return into a buffer that is on that function’s stack.)

Why is this true? All you're checking is that you can't shift the
"location" of the stack. If you can inject stuff into a stack buffer,
can't you just inject the right frame to return to your code as well,
so you don't have to shift locations?

Thanks!

Tycho


^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [RFC PATCH v9 12/13] xpfo, mm: Defer TLB flushes for non-current CPUs (x86 only)
  2019-04-05 15:24           ` Andy Lutomirski
  2019-04-05 15:56             ` Tycho Andersen
@ 2019-04-05 15:56             ` Khalid Aziz
  2019-04-05 16:01             ` Dave Hansen
  2 siblings, 0 replies; 70+ messages in thread
From: Khalid Aziz @ 2019-04-05 15:56 UTC (permalink / raw)
  To: Andy Lutomirski, Dave Hansen
  Cc: Thomas Gleixner, Andy Lutomirski, Juerg Haefliger,
	Tycho Andersen, jsteckli, Andi Kleen, liran.alon, Kees Cook,
	Konrad Rzeszutek Wilk, deepa.srinivasan, chris hyser,
	Tyler Hicks, Woodhouse, David, Andrew Cooper, Jon Masters,
	Boris Ostrovsky, kanth.ghatraju, Joao Martins, Jim Mattson,
	pradeep.vincent, John Haxby, Kirill A. Shutemov,
	Christoph Hellwig, steven.sistare, Laura Abbott, Peter Zijlstra,
	Aaron Lu, Andrew Morton, alexander.h.duyck, Amir Goldstein,
	Andrey Konovalov, aneesh.kumar, anthony.yznaga, Ard Biesheuvel,
	Arnd Bergmann, arunks, Ben Hutchings, Sebastian Andrzej Siewior,
	Borislav Petkov, brgl, Catalin Marinas, Jonathan Corbet, cpandya,
	Daniel Vetter, Dan Williams, Greg KH, Roman Gushchin,
	Johannes Weiner, H. Peter Anvin, Joonsoo Kim, James Morse,
	Jann Horn, Juergen Gross, Jiri Kosina, James Morris, Joe Perches,
	Souptick Joarder, Joerg Roedel, Keith Busch,
	Konstantin Khlebnikov, Logan Gunthorpe, marco.antonio.780,
	Mark Rutland, Mel Gorman, Michal Hocko, Michal Hocko,
	Mike Kravetz, Ingo Molnar, Michael S. Tsirkin, Marek Szyprowski,
	Nicholas Piggin, osalvador, Paul E. McKenney, pavel.tatashin,
	Randy Dunlap, richard.weiyang, Rik van Riel, David Rientjes,
	Robin Murphy, Steven Rostedt, Mike Rapoport,
	Sai Praneeth Prakhya, Serge E. Hallyn, Steve Capper,
	thymovanbeers, Vlastimil Babka, Will Deacon, Matthew Wilcox,
	yaojun8558363, Huang Ying, zhangshaokun, iommu, X86 ML,
	linux-arm-kernel, LKML, Linux-MM, LSM List, Khalid Aziz

On 4/5/19 9:24 AM, Andy Lutomirski wrote:
> 
> 
>> On Apr 5, 2019, at 8:44 AM, Dave Hansen <dave.hansen@intel.com> wrote:
>>
>> On 4/5/19 12:17 AM, Thomas Gleixner wrote:
>>>> process. Is that an acceptable trade-off?
>>> You are not seriously asking whether creating a user controllable ret2dir
>>> attack window is a acceptable trade-off? April 1st was a few days ago.
>>
>> Well, let's not forget that this set at least takes us from "always
>> vulnerable to ret2dir" to a choice between:
>>
>> 1. fast-ish and "vulnerable to ret2dir for a user-controllable window"
>> 2. slow and "mitigated against ret2dir"
>>
>> Sounds like we need a mechanism that will do the deferred XPFO TLB
>> flushes whenever the kernel is entered, and not _just_ at context switch
>> time.  This permits an app to run in userspace with stale kernel TLB
>> entries as long as it wants... that's harmless.
> 
> I don’t think this is good enough. The bad guys can enter the kernel and arrange for the kernel to wait, *in kernel*, for long enough to set up the attack.  userfaultfd is the most obvious way, but there are plenty. I suppose we could do the flush at context switch *and* entry.  I bet that performance still utterly sucks, though — on many workloads, this turns every entry into a full flush, and we already know exactly how much that sucks — it’s identical to KPTI without PCID.  (And yes, if we go this route, we need to merge this logic together — we shouldn’t write CR3 twice on entry).

Performance impact might not be all that much from flush at kernel
entry. This flush will happen only if there is a pending flush posted to
the processor and will be done in lieu of flush at the next context
switch. So we are not looking at adding more TLB flushes, rather change
where they might happen. That still can result in some performance
impact and measuring it with real code will be the only way to get that
number.

> 
> I feel like this whole approach is misguided. ret2dir is not such a game changer that fixing it is worth huge slowdowns. I think all this effort should be spent on some kind of sensible CFI. For example, we should be able to mostly squash ret2anything by inserting a check that the high bits of RSP match the value on the top of the stack before any code that pops RSP.  On an FPO build, there aren’t all that many hot POP RSP instructions, I think.
> 
> (Actually, checking the bits is suboptimal. Do:
> 
> unsigned long offset = *rsp - rsp;
> offset >>= THREAD_SHIFT;
> if (unlikely(offset))
> BUG();
> POP RSP;
> 
> This means that it’s also impossible to trick a function to return into a buffer that is on that function’s stack.)
> 
> In other words, I think that ret2dir is an insufficient justification for XPFO.
> 

That is something we may want to explore further. Closing down
/proc/<pid>/pagemap has already helped reduce one way to mount ret2dir
attack. physmap spraying technique still remains viable. XPFO
implementation is expensive. Can we do something different to mitigate
physmap spraying?

Thanks,
Khalid



^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [RFC PATCH v9 12/13] xpfo, mm: Defer TLB flushes for non-current CPUs (x86 only)
  2019-04-05 15:24           ` Andy Lutomirski
  2019-04-05 15:56             ` Tycho Andersen
  2019-04-05 15:56             ` Khalid Aziz
@ 2019-04-05 16:01             ` Dave Hansen
  2019-04-05 16:27               ` Andy Lutomirski
  2 siblings, 1 reply; 70+ messages in thread
From: Dave Hansen @ 2019-04-05 16:01 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Thomas Gleixner, Khalid Aziz, Andy Lutomirski, Juerg Haefliger,
	Tycho Andersen, jsteckli, Andi Kleen, liran.alon, Kees Cook,
	Konrad Rzeszutek Wilk, deepa.srinivasan, chris hyser,
	Tyler Hicks, Woodhouse, David, Andrew Cooper, Jon Masters,
	Boris Ostrovsky, kanth.ghatraju, Joao Martins, Jim Mattson,
	pradeep.vincent, John Haxby, Kirill A. Shutemov,
	Christoph Hellwig, steven.sistare, Laura Abbott, Peter Zijlstra,
	Aaron Lu, Andrew Morton, alexander.h.duyck, Amir Goldstein,
	Andrey Konovalov, aneesh.kumar, anthony.yznaga, Ard Biesheuvel,
	Arnd Bergmann, arunks, Ben Hutchings, Sebastian Andrzej Siewior,
	Borislav Petkov, brgl, Catalin Marinas, Jonathan Corbet, cpandya,
	Daniel Vetter, Dan Williams, Greg KH, Roman Gushchin,
	Johannes Weiner, H. Peter Anvin, Joonsoo Kim, James Morse,
	Jann Horn, Juergen Gross, Jiri Kosina, James Morris, Joe Perches,
	Souptick Joarder, Joerg Roedel, Keith Busch,
	Konstantin Khlebnikov, Logan Gunthorpe, marco.antonio.780,
	Mark Rutland, Mel Gorman, Michal Hocko, Michal Hocko,
	Mike Kravetz, Ingo Molnar, Michael S. Tsirkin, Marek Szyprowski,
	Nicholas Piggin, osalvador, Paul E. McKenney, pavel.tatashin,
	Randy Dunlap, richard.weiyang, Rik van Riel, David Rientjes,
	Robin Murphy, Steven Rostedt, Mike Rapoport,
	Sai Praneeth Prakhya, Serge E. Hallyn, Steve Capper,
	thymovanbeers, Vlastimil Babka, Will Deacon, Matthew Wilcox,
	yaojun8558363, Huang Ying, zhangshaokun, iommu, X86 ML,
	linux-arm-kernel, LKML, Linux-MM, LSM List, Khalid Aziz

On 4/5/19 8:24 AM, Andy Lutomirski wrote:
>> Sounds like we need a mechanism that will do the deferred XPFO TLB 
>> flushes whenever the kernel is entered, and not _just_ at context
>> switch time.  This permits an app to run in userspace with stale
>> kernel TLB entries as long as it wants... that's harmless.
...
> I suppose we could do the flush at context switch *and*
> entry.  I bet that performance still utterly sucks, though — on many
> workloads, this turns every entry into a full flush, and we already
> know exactly how much that sucks — it’s identical to KPTI without
> PCID.  (And yes, if we go this route, we need to merge this logic
> together — we shouldn’t write CR3 twice on entry).

Yeah, probably true.

Just eyeballing this, it would mean mapping the "cpu needs deferred
flush" variable into the cpu_entry_area, which doesn't seem too awful.

I think the basic overall concern is that the deferred flush leaves too
many holes and by the time we close them sufficiently, performance will
suck again.  Seems like a totally valid concern, but my crystal ball is
hazy on whether it will be worth it in the end to many folks

...
> In other words, I think that ret2dir is an insufficient justification
> for XPFO.

Yeah, other things that it is good for have kinda been lost in the
noise.  I think I first started looking at this long before Meltdown and
L1TF were public.

There are hypervisors out there that simply don't (persistently) map
user data.  They can't leak user data because they don't even have
access to it in their virtual address space.  Those hypervisors had a
much easier time with L1TF mitigation than we did.  Basically, they
could flush the L1 after user data was accessible instead of before
untrusted guest code runs.

My hope is that XPFO could provide us similar protection.  But,
somebody's got to poke at it for a while to see how far they can push it.

IMNHO, XPFO is *always* going to be painful for kernel compiles.  But,
cloud providers aren't doing a lot of kernel compiles on their KVM
hosts, and they deeply care about leaking their users' data.


^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [RFC PATCH v9 12/13] xpfo, mm: Defer TLB flushes for non-current CPUs (x86 only)
  2019-04-05 16:01             ` Dave Hansen
@ 2019-04-05 16:27               ` Andy Lutomirski
  2019-04-05 16:41                 ` Peter Zijlstra
  2019-04-05 17:35                 ` Khalid Aziz
  0 siblings, 2 replies; 70+ messages in thread
From: Andy Lutomirski @ 2019-04-05 16:27 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Thomas Gleixner, Khalid Aziz, Andy Lutomirski, Juerg Haefliger,
	Tycho Andersen, jsteckli, Andi Kleen, liran.alon, Kees Cook,
	Konrad Rzeszutek Wilk, deepa.srinivasan, chris hyser,
	Tyler Hicks, Woodhouse, David, Andrew Cooper, Jon Masters,
	Boris Ostrovsky, kanth.ghatraju, Joao Martins, Jim Mattson,
	pradeep.vincent, John Haxby, Kirill A. Shutemov,
	Christoph Hellwig, steven.sistare, Laura Abbott, Peter Zijlstra,
	Aaron Lu, Andrew Morton, alexander.h.duyck, Amir Goldstein,
	Andrey Konovalov, aneesh.kumar, anthony.yznaga, Ard Biesheuvel,
	Arnd Bergmann, arunks, Ben Hutchings, Sebastian Andrzej Siewior,
	Borislav Petkov, brgl, Catalin Marinas, Jonathan Corbet, cpandya,
	Daniel Vetter, Dan Williams, Greg KH, Roman Gushchin,
	Johannes Weiner, H. Peter Anvin, Joonsoo Kim, James Morse,
	Jann Horn, Juergen Gross, Jiri Kosina, James Morris, Joe Perches,
	Souptick Joarder, Joerg Roedel, Keith Busch,
	Konstantin Khlebnikov, Logan Gunthorpe, marco.antonio.780,
	Mark Rutland, Mel Gorman, Michal Hocko, Michal Hocko,
	Mike Kravetz, Ingo Molnar, Michael S. Tsirkin, Marek Szyprowski,
	Nicholas Piggin, osalvador, Paul E. McKenney, pavel.tatashin,
	Randy Dunlap, richard.weiyang, Rik van Riel, David Rientjes,
	Robin Murphy, Steven Rostedt, Mike Rapoport,
	Sai Praneeth Prakhya, Serge E. Hallyn, Steve Capper,
	thymovanbeers, Vlastimil Babka, Will Deacon, Matthew Wilcox,
	yaojun8558363, Huang Ying, zhangshaokun, iommu, X86 ML,
	linux-arm-kernel, LKML, Linux-MM, LSM List, Khalid Aziz



> On Apr 5, 2019, at 10:01 AM, Dave Hansen <dave.hansen@intel.com> wrote:
> 
> On 4/5/19 8:24 AM, Andy Lutomirski wrote:
>>> Sounds like we need a mechanism that will do the deferred XPFO TLB 
>>> flushes whenever the kernel is entered, and not _just_ at context
>>> switch time.  This permits an app to run in userspace with stale
>>> kernel TLB entries as long as it wants... that's harmless.
> ...
>> I suppose we could do the flush at context switch *and*
>> entry.  I bet that performance still utterly sucks, though — on many
>> workloads, this turns every entry into a full flush, and we already
>> know exactly how much that sucks — it’s identical to KPTI without
>> PCID.  (And yes, if we go this route, we need to merge this logic
>> together — we shouldn’t write CR3 twice on entry).
> 
> Yeah, probably true.
> 
> Just eyeballing this, it would mean mapping the "cpu needs deferred
> flush" variable into the cpu_entry_area, which doesn't seem too awful.
> 
> I think the basic overall concern is that the deferred flush leaves too
> many holes and by the time we close them sufficiently, performance will
> suck again.  Seems like a totally valid concern, but my crystal ball is
> hazy on whether it will be worth it in the end to many folks
> 
> ...
>> In other words, I think that ret2dir is an insufficient justification
>> for XPFO.
> 
> Yeah, other things that it is good for have kinda been lost in the
> noise.  I think I first started looking at this long before Meltdown and
> L1TF were public.
> 
> There are hypervisors out there that simply don't (persistently) map
> user data.  They can't leak user data because they don't even have
> access to it in their virtual address space.  Those hypervisors had a
> much easier time with L1TF mitigation than we did.  Basically, they
> could flush the L1 after user data was accessible instead of before
> untrusted guest code runs.
> 
> My hope is that XPFO could provide us similar protection.  But,
> somebody's got to poke at it for a while to see how far they can push it.
> 
> IMNHO, XPFO is *always* going to be painful for kernel compiles.  But,
> cloud providers aren't doing a lot of kernel compiles on their KVM
> hosts, and they deeply care about leaking their users' data.

At the risk of asking stupid questions: we already have a mechanism for this: highmem.  Can we enable highmem on x86_64, maybe with some heuristics to make it work well?

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [RFC PATCH v9 12/13] xpfo, mm: Defer TLB flushes for non-current CPUs (x86 only)
  2019-04-05 15:56             ` Tycho Andersen
@ 2019-04-05 16:32               ` Andy Lutomirski
  0 siblings, 0 replies; 70+ messages in thread
From: Andy Lutomirski @ 2019-04-05 16:32 UTC (permalink / raw)
  To: Tycho Andersen
  Cc: Dave Hansen, Thomas Gleixner, Khalid Aziz, Andy Lutomirski,
	Juerg Haefliger, jsteckli, Andi Kleen, liran.alon, Kees Cook,
	Konrad Rzeszutek Wilk, deepa.srinivasan, chris hyser,
	Tyler Hicks, Woodhouse, David, Andrew Cooper, Jon Masters,
	Boris Ostrovsky, kanth.ghatraju, Joao Martins, Jim Mattson,
	pradeep.vincent, John Haxby, Kirill A. Shutemov,
	Christoph Hellwig, steven.sistare, Laura Abbott, Peter Zijlstra,
	Aaron Lu, Andrew Morton, alexander.h.duyck, Amir Goldstein,
	Andrey Konovalov, aneesh.kumar, anthony.yznaga, Ard Biesheuvel,
	Arnd Bergmann, arunks, Ben Hutchings, Sebastian Andrzej Siewior,
	Borislav Petkov, brgl, Catalin Marinas, Jonathan Corbet, cpandya,
	Daniel Vetter, Dan Williams, Greg KH, Roman Gushchin,
	Johannes Weiner, H. Peter Anvin, Joonsoo Kim, James Morse,
	Jann Horn, Juergen Gross, Jiri Kosina, James Morris, Joe Perches,
	Souptick Joarder, Joerg Roedel, Keith Busch,
	Konstantin Khlebnikov, Logan Gunthorpe, marco.antonio.780,
	Mark Rutland, Mel Gorman, Michal Hocko, Michal Hocko,
	Mike Kravetz, Ingo Molnar, Michael S. Tsirkin, Marek Szyprowski,
	Nicholas Piggin, osalvador, Paul E. McKenney, pavel.tatashin,
	Randy Dunlap, richard.weiyang, Rik van Riel, David Rientjes,
	Robin Murphy, Steven Rostedt, Mike Rapoport,
	Sai Praneeth Prakhya, Serge E. Hallyn, Steve Capper,
	thymovanbeers, Vlastimil Babka, Will Deacon, Matthew Wilcox,
	yaojun8558363, Huang Ying, zhangshaokun, iommu, X86 ML,
	linux-arm-kernel, LKML, Linux-MM, LSM List, Khalid Aziz



> On Apr 5, 2019, at 9:56 AM, Tycho Andersen <tycho@tycho.ws> wrote:
> 
>> On Fri, Apr 05, 2019 at 09:24:50AM -0600, Andy Lutomirski wrote:
>> 
>> 
>>> On Apr 5, 2019, at 8:44 AM, Dave Hansen <dave.hansen@intel.com> wrote:
>>> 
>>> On 4/5/19 12:17 AM, Thomas Gleixner wrote:
>>>>> process. Is that an acceptable trade-off?
>>>> You are not seriously asking whether creating a user controllable ret2dir
>>>> attack window is a acceptable trade-off? April 1st was a few days ago.
>>> 
>>> Well, let's not forget that this set at least takes us from "always
>>> vulnerable to ret2dir" to a choice between:
>>> 
>>> 1. fast-ish and "vulnerable to ret2dir for a user-controllable window"
>>> 2. slow and "mitigated against ret2dir"
>>> 
>>> Sounds like we need a mechanism that will do the deferred XPFO TLB
>>> flushes whenever the kernel is entered, and not _just_ at context switch
>>> time.  This permits an app to run in userspace with stale kernel TLB
>>> entries as long as it wants... that's harmless.
>> 
>> I don’t think this is good enough. The bad guys can enter the kernel and arrange for the kernel to wait, *in kernel*, for long enough to set up the attack.  userfaultfd is the most obvious way, but there are plenty. I suppose we could do the flush at context switch *and* entry.  I bet that performance still utterly sucks, though — on many workloads, this turns every entry into a full flush, and we already know exactly how much that sucks — it’s identical to KPTI without PCID.  (And yes, if we go this route, we need to merge this logic together — we shouldn’t write CR3 twice on entry).
>> 
>> I feel like this whole approach is misguided. ret2dir is not such a game changer that fixing it is worth huge slowdowns. I think all this effort should be spent on some kind of sensible CFI. For example, we should be able to mostly squash ret2anything by inserting a check that the high bits of RSP match the value on the top of the stack before any code that pops RSP.  On an FPO build, there aren’t all that many hot POP RSP instructions, I think.
>> 
>> (Actually, checking the bits is suboptimal. Do:
>> 
>> unsigned long offset = *rsp - rsp;
>> offset >>= THREAD_SHIFT;
>> if (unlikely(offset))
>> BUG();
>> POP RSP;
> 
> This is a neat trick, and definitely prevents going random places in
> the heap. But,
> 
>> This means that it’s also impossible to trick a function to return into a buffer that is on that function’s stack.)
> 
> Why is this true? All you're checking is that you can't shift the
> "location" of the stack. If you can inject stuff into a stack buffer,
> can't you just inject the right frame to return to your code as well,
> so you don't have to shift locations?
> 
> 

But the injected ROP payload will be *below* RSP, so you’ll need a gadget that can decrement RSP.  This makes the attack a good deal harder.

Something like RAP on top, or CET, will make this even harder.

> 
> Tycho


^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [RFC PATCH v9 12/13] xpfo, mm: Defer TLB flushes for non-current CPUs (x86 only)
  2019-04-05 16:27               ` Andy Lutomirski
@ 2019-04-05 16:41                 ` Peter Zijlstra
  2019-04-05 17:35                 ` Khalid Aziz
  1 sibling, 0 replies; 70+ messages in thread
From: Peter Zijlstra @ 2019-04-05 16:41 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Dave Hansen, Thomas Gleixner, Khalid Aziz, Andy Lutomirski,
	Juerg Haefliger, Tycho Andersen, jsteckli, Andi Kleen,
	liran.alon, Kees Cook, Konrad Rzeszutek Wilk, deepa.srinivasan,
	chris hyser, Tyler Hicks, Woodhouse, David, Andrew Cooper,
	Jon Masters, Boris Ostrovsky, kanth.ghatraju, Joao Martins,
	Jim Mattson, pradeep.vincent, John Haxby, Kirill A. Shutemov,
	Christoph Hellwig, steven.sistare, Laura Abbott, Aaron Lu,
	Andrew Morton, alexander.h.duyck, Amir Goldstein,
	Andrey Konovalov, aneesh.kumar, anthony.yznaga, Ard Biesheuvel,
	Arnd Bergmann, arunks, Ben Hutchings, Sebastian Andrzej Siewior,
	Borislav Petkov, brgl, Catalin Marinas, Jonathan Corbet, cpandya,
	Daniel Vetter, Dan Williams, Greg KH, Roman Gushchin,
	Johannes Weiner, H. Peter Anvin, Joonsoo Kim, James Morse,
	Jann Horn, Juergen Gross, Jiri Kosina, James Morris, Joe Perches,
	Souptick Joarder, Joerg Roedel, Keith Busch,
	Konstantin Khlebnikov, Logan Gunthorpe, marco.antonio.780,
	Mark Rutland, Mel Gorman, Michal Hocko, Michal Hocko,
	Mike Kravetz, Ingo Molnar, Michael S. Tsirkin, Marek Szyprowski,
	Nicholas Piggin, osalvador, Paul E. McKenney, pavel.tatashin,
	Randy Dunlap, richard.weiyang, Rik van Riel, David Rientjes,
	Robin Murphy, Steven Rostedt, Mike Rapoport,
	Sai Praneeth Prakhya, Serge E. Hallyn, Steve Capper,
	thymovanbeers, Vlastimil Babka, Will Deacon, Matthew Wilcox,
	yaojun8558363, Huang Ying, zhangshaokun, iommu, X86 ML,
	linux-arm-kernel, LKML, Linux-MM, LSM List, Khalid Aziz

On Fri, Apr 05, 2019 at 10:27:05AM -0600, Andy Lutomirski wrote:
> At the risk of asking stupid questions: we already have a mechanism
> for this: highmem.  Can we enable highmem on x86_64, maybe with some
> heuristics to make it work well?

That's what I said; but note that I'm still not convinced fixmap/highmem
is actually correct TLB wise.


^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [RFC PATCH v9 12/13] xpfo, mm: Defer TLB flushes for non-current CPUs (x86 only)
  2019-04-05 16:27               ` Andy Lutomirski
  2019-04-05 16:41                 ` Peter Zijlstra
@ 2019-04-05 17:35                 ` Khalid Aziz
  1 sibling, 0 replies; 70+ messages in thread
From: Khalid Aziz @ 2019-04-05 17:35 UTC (permalink / raw)
  To: Andy Lutomirski, Dave Hansen
  Cc: Thomas Gleixner, Andy Lutomirski, Juerg Haefliger,
	Tycho Andersen, jsteckli, Andi Kleen, liran.alon, Kees Cook,
	Konrad Rzeszutek Wilk, deepa.srinivasan, chris hyser,
	Tyler Hicks, Woodhouse, David, Andrew Cooper, Jon Masters,
	Boris Ostrovsky, kanth.ghatraju, Joao Martins, Jim Mattson,
	pradeep.vincent, John Haxby, Kirill A. Shutemov,
	Christoph Hellwig, steven.sistare, Laura Abbott, Peter Zijlstra,
	Aaron Lu, Andrew Morton, alexander.h.duyck, Amir Goldstein,
	Andrey Konovalov, aneesh.kumar, anthony.yznaga, Ard Biesheuvel,
	Arnd Bergmann, arunks, Ben Hutchings, Sebastian Andrzej Siewior,
	Borislav Petkov, brgl, Catalin Marinas, Jonathan Corbet, cpandya,
	Daniel Vetter, Dan Williams, Greg KH, Roman Gushchin,
	Johannes Weiner, H. Peter Anvin, Joonsoo Kim, James Morse,
	Jann Horn, Juergen Gross, Jiri Kosina, James Morris, Joe Perches,
	Souptick Joarder, Joerg Roedel, Keith Busch,
	Konstantin Khlebnikov, Logan Gunthorpe, marco.antonio.780,
	Mark Rutland, Mel Gorman, Michal Hocko, Michal Hocko,
	Mike Kravetz, Ingo Molnar, Michael S. Tsirkin, Marek Szyprowski,
	Nicholas Piggin, osalvador, Paul E. McKenney, pavel.tatashin,
	Randy Dunlap, richard.weiyang, Rik van Riel, David Rientjes,
	Robin Murphy, Steven Rostedt, Mike Rapoport,
	Sai Praneeth Prakhya, Serge E. Hallyn, Steve Capper,
	thymovanbeers, Vlastimil Babka, Will Deacon, Matthew Wilcox,
	yaojun8558363, Huang Ying, zhangshaokun, iommu, X86 ML,
	linux-arm-kernel, LKML, Linux-MM, LSM List, Khalid Aziz

On 4/5/19 10:27 AM, Andy Lutomirski wrote:
> At the risk of asking stupid questions: we already have a mechanism for this: highmem.  Can we enable highmem on x86_64, maybe with some heuristics to make it work well?
> 

I proposed using highmem infrastructure for xpfo in my cover letter as
well as in my earlier replies discussing redesigning
xpfo_kmap/xpfo_kunmap. So that sounds like a reasonable question to me
:) Looks like we might be getting to an agreement that highmem
infrastructure is a good thing to use here.

--
Khalid


^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [RFC PATCH v9 00/13] Add support for eXclusive Page Frame Ownership
  2019-04-03 17:34 [RFC PATCH v9 00/13] Add support for eXclusive Page Frame Ownership Khalid Aziz
                   ` (13 preceding siblings ...)
  2019-04-04 16:44 ` [RFC PATCH v9 00/13] Add support for eXclusive Page Frame Ownership Nadav Amit
@ 2019-04-06  6:40 ` Jon Masters
  14 siblings, 0 replies; 70+ messages in thread
From: Jon Masters @ 2019-04-06  6:40 UTC (permalink / raw)
  To: Khalid Aziz
  Cc: juergh, tycho, jsteckli, ak, liran.alon, keescook, akpm,
	konrad.wilk, deepa.srinivasan, chris.hyser, tyhicks, dwmw,
	andrew.cooper3, boris.ostrovsky, kanth.ghatraju, joao.m.martins,
	jmattson, pradeep.vincent, john.haxby, tglx, kirill.shutemov,
	hch, steven.sistare, labbott, luto, dave.hansen, peterz,
	aaron.lu, alexander.h.duyck, amir73il, andreyknvl, aneesh.kumar,
	anthony.yznaga, ard.biesheuvel, arnd, arunks, ben, bigeasy, bp,
	brgl, catalin.marinas, corbet, cpandya, daniel.vetter,
	dan.j.williams, gregkh, guro, hannes, hpa, iamjoonsoo.kim,
	james.morse, jannh, jgross, jkosina, jmorris, joe, jrdr.linux,
	jroedel, keith.busch, khlebnikov, logang, marco.antonio.780,
	mark.rutland, mgorman, mhocko, mhocko, mike.kravetz, mingo, mst,
	m.szyprowski, npiggin, osalvador, paulmck, pavel.tatashin,
	rdunlap, richard.weiyang, riel, rientjes, robin.murphy, rostedt,
	rppt, sai.praneeth.prakhya, serge, steve.capper, thymovanbeers,
	vbabka, will.deacon, willy, yang.shi, yaojun8558363, ying.huang,
	zhangshaokun, khalid, iommu, x86, linux-arm-kernel, linux-doc,
	linux-kernel, linux-mm, linux-security-module

Khalid,

Thanks for these patches. We will test them on x86 and investigate the Arm pieces highlighted.

Jon.

-- 
Computer Architect


> On Apr 4, 2019, at 00:37, Khalid Aziz <khalid.aziz@oracle.com> wrote:
> 
> This is another update to the work Juerg, Tycho and Julian have
> done on XPFO. After the last round of updates, we were seeing very
> significant performance penalties when stale TLB entries were
> flushed actively after an XPFO TLB update.  Benchmark for measuring
> performance is kernel build using parallel make. To get full
> protection from ret2dir attackes, we must flush stale TLB entries.
> Performance penalty from flushing stale TLB entries goes up as the
> number of cores goes up. On a desktop class machine with only 4
> cores, enabling TLB flush for stale entries causes system time for
> "make -j4" to go up by a factor of 2.61x but on a larger machine
> with 96 cores, system time with "make -j60" goes up by a factor of
> 26.37x!  I have been working on reducing this performance penalty.
> 
> I implemented two solutions to reduce performance penalty and that
> has had large impact. XPFO code flushes TLB every time a page is
> allocated to userspace. It does so by sending IPIs to all processors
> to flush TLB. Back to back allocations of pages to userspace on
> multiple processors results in a storm of IPIs.  Each one of these
> incoming IPIs is handled by a processor by flushing its TLB. To
> reduce this IPI storm, I have added a per CPU flag that can be set
> to tell a processor to flush its TLB. A processor checks this flag
> on every context switch. If the flag is set, it flushes its TLB and
> clears the flag. This allows for multiple TLB flush requests to a
> single CPU to be combined into a single request. A kernel TLB entry
> for a page that has been allocated to userspace is flushed on all
> processors unlike the previous version of this patch. A processor
> could hold a stale kernel TLB entry that was removed on another
> processor until the next context switch. A local userspace page
> allocation by the currently running process could force the TLB
> flush earlier for such entries.
> 
> The other solution reduces the number of TLB flushes required, by
> performing TLB flush for multiple pages at one time when pages are
> refilled on the per-cpu freelist. If the pages being addedd to
> per-cpu freelist are marked for userspace allocation, TLB entries
> for these pages can be flushed upfront and pages tagged as currently
> unmapped. When any such page is allocated to userspace, there is no
> need to performa a TLB flush at that time any more. This batching of
> TLB flushes reduces performance imapct further. Similarly when
> these user pages are freed by userspace and added back to per-cpu
> free list, they are left unmapped and tagged so. This further
> optimization reduced performance impact from 1.32x to 1.28x for
> 96-core server and from 1.31x to 1.27x for a 4-core desktop.
> 
> I measured system time for parallel make with unmodified 4.20
> kernel, 4.20 with XPFO patches before these patches and then again
> after applying each of these patches. Here are the results:
> 
> Hardware: 96-core Intel Xeon Platinum 8160 CPU @ 2.10GHz, 768 GB RAM
> make -j60 all
> 
> 5.0                    913.862s
> 5.0+this patch series            1165.259ss    1.28x
> 
> 
> Hardware: 4-core Intel Core i5-3550 CPU @ 3.30GHz, 8G RAM
> make -j4 all
> 
> 5.0                    610.642s
> 5.0+this patch series            773.075s    1.27x
> 
> Performance with this patch set is good enough to use these as
> starting point for further refinement before we merge it into main
> kernel, hence RFC.
> 
> I have restructurerd the patches in this version to separate out
> architecture independent code. I folded much of the code
> improvement by Julian to not use page extension into patch 3. 
> 
> What remains to be done beyond this patch series:
> 
> 1. Performance improvements: Ideas to explore - (1) kernel mappings
>   private to an mm, (2) Any others??
> 2. Re-evaluate the patch "arm64/mm: Add support for XPFO to swiotlb"
>   from Juerg. I dropped it for now since swiotlb code for ARM has
>   changed a lot since this patch was written. I could use help
>   from ARM experts on this.
> 3. Extend the patch "xpfo, mm: Defer TLB flushes for non-current
>   CPUs" to other architectures besides x86.
> 4. Change kmap to not map the page back to physmap, instead map it
>   to a new va similar to what kmap_high does. Mapping page back
>   into physmap re-opens the ret2dir security for the duration of
>   kmap. All of the kmap_high and related code can be reused for this
>   but that will require restructuring that code so it can be built for
>   64-bits as well. Any objections to that?
> 
> ---------------------------------------------------------
> 
> Juerg Haefliger (6):
>  mm: Add support for eXclusive Page Frame Ownership (XPFO)
>  xpfo, x86: Add support for XPFO for x86-64
>  lkdtm: Add test for XPFO
>  arm64/mm: Add support for XPFO
>  swiotlb: Map the buffer if it was unmapped by XPFO
>  arm64/mm, xpfo: temporarily map dcache regions
> 
> Julian Stecklina (1):
>  xpfo, mm: optimize spinlock usage in xpfo_kunmap
> 
> Khalid Aziz (2):
>  xpfo, mm: Defer TLB flushes for non-current CPUs (x86 only)
>  xpfo, mm: Optimize XPFO TLB flushes by batching them together
> 
> Tycho Andersen (4):
>  mm: add MAP_HUGETLB support to vm_mmap
>  x86: always set IF before oopsing from page fault
>  mm: add a user_virt_to_phys symbol
>  xpfo: add primitives for mapping underlying memory
> 
> .../admin-guide/kernel-parameters.txt         |   6 +
> arch/arm64/Kconfig                            |   1 +
> arch/arm64/mm/Makefile                        |   2 +
> arch/arm64/mm/flush.c                         |   7 +
> arch/arm64/mm/mmu.c                           |   2 +-
> arch/arm64/mm/xpfo.c                          |  66 ++++++
> arch/x86/Kconfig                              |   1 +
> arch/x86/include/asm/pgtable.h                |  26 +++
> arch/x86/include/asm/tlbflush.h               |   1 +
> arch/x86/mm/Makefile                          |   2 +
> arch/x86/mm/fault.c                           |   6 +
> arch/x86/mm/pageattr.c                        |  32 +--
> arch/x86/mm/tlb.c                             |  39 ++++
> arch/x86/mm/xpfo.c                            | 185 +++++++++++++++++
> drivers/misc/lkdtm/Makefile                   |   1 +
> drivers/misc/lkdtm/core.c                     |   3 +
> drivers/misc/lkdtm/lkdtm.h                    |   5 +
> drivers/misc/lkdtm/xpfo.c                     | 196 ++++++++++++++++++
> include/linux/highmem.h                       |  34 +--
> include/linux/mm.h                            |   2 +
> include/linux/mm_types.h                      |   8 +
> include/linux/page-flags.h                    |  23 +-
> include/linux/xpfo.h                          | 191 +++++++++++++++++
> include/trace/events/mmflags.h                |  10 +-
> kernel/dma/swiotlb.c                          |   3 +-
> mm/Makefile                                   |   1 +
> mm/compaction.c                               |   2 +-
> mm/internal.h                                 |   2 +-
> mm/mmap.c                                     |  19 +-
> mm/page_alloc.c                               |  19 +-
> mm/page_isolation.c                           |   2 +-
> mm/util.c                                     |  32 +++
> mm/xpfo.c                                     | 170 +++++++++++++++
> security/Kconfig                              |  27 +++
> 34 files changed, 1047 insertions(+), 79 deletions(-)
> create mode 100644 arch/arm64/mm/xpfo.c
> create mode 100644 arch/x86/mm/xpfo.c
> create mode 100644 drivers/misc/lkdtm/xpfo.c
> create mode 100644 include/linux/xpfo.h
> create mode 100644 mm/xpfo.c
> 
> -- 
> 2.17.1
> 


^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [RFC PATCH v9 03/13] mm: Add support for eXclusive Page Frame Ownership (XPFO)
  2019-04-03 17:34 ` [RFC PATCH v9 03/13] mm: Add support for eXclusive Page Frame Ownership (XPFO) Khalid Aziz
  2019-04-04  7:21   ` Peter Zijlstra
  2019-04-04  7:43   ` Peter Zijlstra
@ 2019-04-17 16:15   ` Ingo Molnar
  2019-04-17 16:49     ` Khalid Aziz
  2 siblings, 1 reply; 70+ messages in thread
From: Ingo Molnar @ 2019-04-17 16:15 UTC (permalink / raw)
  To: Khalid Aziz
  Cc: juergh, tycho, jsteckli, keescook, konrad.wilk, Juerg Haefliger,
	deepa.srinivasan, chris.hyser, tyhicks, dwmw, andrew.cooper3,
	jcm, boris.ostrovsky, iommu, x86, linux-arm-kernel, linux-doc,
	linux-kernel, linux-mm, linux-security-module, Khalid Aziz,
	Linus Torvalds, Andrew Morton, Thomas Gleixner, Andy Lutomirski,
	Peter Zijlstra, Dave Hansen, Borislav Petkov, H. Peter Anvin,
	Arjan van de Ven, Greg Kroah-Hartman


[ Sorry, had to trim the Cc: list from hell. Tried to keep all the 
  mailing lists and all x86 developers. ]

* Khalid Aziz <khalid.aziz@oracle.com> wrote:

> From: Juerg Haefliger <juerg.haefliger@canonical.com>
> 
> This patch adds basic support infrastructure for XPFO which protects 
> against 'ret2dir' kernel attacks. The basic idea is to enforce 
> exclusive ownership of page frames by either the kernel or userspace, 
> unless explicitly requested by the kernel. Whenever a page destined for 
> userspace is allocated, it is unmapped from physmap (the kernel's page 
> table). When such a page is reclaimed from userspace, it is mapped back 
> to physmap. Individual architectures can enable full XPFO support using 
> this infrastructure by supplying architecture specific pieces.

I have a higher level, meta question:

Is there any updated analysis outlining why this XPFO overhead would be 
required on x86-64 kernels running on SMAP/SMEP CPUs which should be all 
recent Intel and AMD CPUs, and with kernel that mark all direct kernel 
mappings as non-executable - which should be all reasonably modern 
kernels later than v4.0 or so?

I.e. the original motivation of the XPFO patches was to prevent execution 
of direct kernel mappings. Is this motivation still present if those 
mappings are non-executable?

(Sorry if this has been asked and answered in previous discussions.)

Thanks,

	Ingo


^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [RFC PATCH v9 03/13] mm: Add support for eXclusive Page Frame Ownership (XPFO)
  2019-04-17 16:15   ` Ingo Molnar
@ 2019-04-17 16:49     ` Khalid Aziz
  2019-04-17 17:09       ` Ingo Molnar
  2019-05-01 14:49       ` Waiman Long
  0 siblings, 2 replies; 70+ messages in thread
From: Khalid Aziz @ 2019-04-17 16:49 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: juergh, tycho, jsteckli, keescook, konrad.wilk, Juerg Haefliger,
	deepa.srinivasan, chris.hyser, tyhicks, dwmw, andrew.cooper3,
	jcm, boris.ostrovsky, iommu, x86, linux-arm-kernel, linux-doc,
	linux-kernel, linux-mm, linux-security-module, Khalid Aziz,
	Linus Torvalds, Andrew Morton, Thomas Gleixner, Andy Lutomirski,
	Peter Zijlstra, Dave Hansen, Borislav Petkov, H. Peter Anvin,
	Arjan van de Ven, Greg Kroah-Hartman

On 4/17/19 10:15 AM, Ingo Molnar wrote:
> 
> [ Sorry, had to trim the Cc: list from hell. Tried to keep all the 
>   mailing lists and all x86 developers. ]
> 
> * Khalid Aziz <khalid.aziz@oracle.com> wrote:
> 
>> From: Juerg Haefliger <juerg.haefliger@canonical.com>
>>
>> This patch adds basic support infrastructure for XPFO which protects 
>> against 'ret2dir' kernel attacks. The basic idea is to enforce 
>> exclusive ownership of page frames by either the kernel or userspace, 
>> unless explicitly requested by the kernel. Whenever a page destined for 
>> userspace is allocated, it is unmapped from physmap (the kernel's page 
>> table). When such a page is reclaimed from userspace, it is mapped back 
>> to physmap. Individual architectures can enable full XPFO support using 
>> this infrastructure by supplying architecture specific pieces.
> 
> I have a higher level, meta question:
> 
> Is there any updated analysis outlining why this XPFO overhead would be 
> required on x86-64 kernels running on SMAP/SMEP CPUs which should be all 
> recent Intel and AMD CPUs, and with kernel that mark all direct kernel 
> mappings as non-executable - which should be all reasonably modern 
> kernels later than v4.0 or so?
> 
> I.e. the original motivation of the XPFO patches was to prevent execution 
> of direct kernel mappings. Is this motivation still present if those 
> mappings are non-executable?
> 
> (Sorry if this has been asked and answered in previous discussions.)

Hi Ingo,

That is a good question. Because of the cost of XPFO, we have to be very
sure we need this protection. The paper from Vasileios, Michalis and
Angelos - <http://www.cs.columbia.edu/~vpk/papers/ret2dir.sec14.pdf>,
does go into how ret2dir attacks can bypass SMAP/SMEP in sections 6.1
and 6.2.

Thanks,
Khalid



^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [RFC PATCH v9 03/13] mm: Add support for eXclusive Page Frame Ownership (XPFO)
  2019-04-17 16:49     ` Khalid Aziz
@ 2019-04-17 17:09       ` Ingo Molnar
  2019-04-17 17:19         ` Nadav Amit
  2019-04-17 17:33         ` Khalid Aziz
  2019-05-01 14:49       ` Waiman Long
  1 sibling, 2 replies; 70+ messages in thread
From: Ingo Molnar @ 2019-04-17 17:09 UTC (permalink / raw)
  To: Khalid Aziz
  Cc: juergh, tycho, jsteckli, keescook, konrad.wilk, Juerg Haefliger,
	deepa.srinivasan, chris.hyser, tyhicks, dwmw, andrew.cooper3,
	jcm, boris.ostrovsky, iommu, x86, linux-arm-kernel, linux-doc,
	linux-kernel, linux-mm, linux-security-module, Khalid Aziz,
	Linus Torvalds, Andrew Morton, Thomas Gleixner, Andy Lutomirski,
	Peter Zijlstra, Dave Hansen, Borislav Petkov, H. Peter Anvin,
	Arjan van de Ven, Greg Kroah-Hartman


* Khalid Aziz <khalid.aziz@oracle.com> wrote:

> > I.e. the original motivation of the XPFO patches was to prevent execution 
> > of direct kernel mappings. Is this motivation still present if those 
> > mappings are non-executable?
> > 
> > (Sorry if this has been asked and answered in previous discussions.)
> 
> Hi Ingo,
> 
> That is a good question. Because of the cost of XPFO, we have to be very
> sure we need this protection. The paper from Vasileios, Michalis and
> Angelos - <http://www.cs.columbia.edu/~vpk/papers/ret2dir.sec14.pdf>,
> does go into how ret2dir attacks can bypass SMAP/SMEP in sections 6.1
> and 6.2.

So it would be nice if you could generally summarize external arguments 
when defending a patchset, instead of me having to dig through a PDF 
which not only causes me to spend time that you probably already spent 
reading that PDF, but I might also interpret it incorrectly. ;-)

The PDF you cited says this:

  "Unfortunately, as shown in Table 1, the W^X prop-erty is not enforced 
   in many platforms, including x86-64.  In our example, the content of 
   user address 0xBEEF000 is also accessible through kernel address 
   0xFFFF87FF9F080000 as plain, executable code."

Is this actually true of modern x86-64 kernels? We've locked down W^X 
protections in general.

I.e. this conclusion:

  "Therefore, by simply overwriting kfptr with 0xFFFF87FF9F080000 and 
   triggering the kernel to dereference it, an attacker can directly 
   execute shell code with kernel privileges."

... appears to be predicated on imperfect W^X protections on the x86-64 
kernel.

Do such holes exist on the latest x86-64 kernel? If yes, is there a 
reason to believe that these W^X holes cannot be fixed, or that any fix 
would be more expensive than XPFO?

Thanks,

	Ingo


^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [RFC PATCH v9 03/13] mm: Add support for eXclusive Page Frame Ownership (XPFO)
  2019-04-17 17:09       ` Ingo Molnar
@ 2019-04-17 17:19         ` Nadav Amit
  2019-04-17 17:26           ` Ingo Molnar
  2019-04-17 17:33         ` Khalid Aziz
  1 sibling, 1 reply; 70+ messages in thread
From: Nadav Amit @ 2019-04-17 17:19 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Khalid Aziz, juergh, Tycho Andersen, jsteckli, keescook,
	Konrad Rzeszutek Wilk, Juerg Haefliger, deepa.srinivasan,
	chris.hyser, tyhicks, David Woodhouse, Andrew Cooper, jcm,
	Boris Ostrovsky, iommu, X86 ML, linux-arm-kernel,
	open list:DOCUMENTATION, Linux List Kernel Mailing, Linux-MM,
	LSM List, Khalid Aziz, Linus Torvalds, Andrew Morton,
	Thomas Gleixner, Andy Lutomirski, Peter Zijlstra, Dave Hansen,
	Borislav Petkov, H. Peter Anvin, Arjan van de Ven,
	Greg Kroah-Hartman

> On Apr 17, 2019, at 10:09 AM, Ingo Molnar <mingo@kernel.org> wrote:
> 
> 
> * Khalid Aziz <khalid.aziz@oracle.com> wrote:
> 
>>> I.e. the original motivation of the XPFO patches was to prevent execution 
>>> of direct kernel mappings. Is this motivation still present if those 
>>> mappings are non-executable?
>>> 
>>> (Sorry if this has been asked and answered in previous discussions.)
>> 
>> Hi Ingo,
>> 
>> That is a good question. Because of the cost of XPFO, we have to be very
>> sure we need this protection. The paper from Vasileios, Michalis and
>> Angelos - <http://www.cs.columbia.edu/~vpk/papers/ret2dir.sec14.pdf>,
>> does go into how ret2dir attacks can bypass SMAP/SMEP in sections 6.1
>> and 6.2.
> 
> So it would be nice if you could generally summarize external arguments 
> when defending a patchset, instead of me having to dig through a PDF 
> which not only causes me to spend time that you probably already spent 
> reading that PDF, but I might also interpret it incorrectly. ;-)
> 
> The PDF you cited says this:
> 
>  "Unfortunately, as shown in Table 1, the W^X prop-erty is not enforced 
>   in many platforms, including x86-64.  In our example, the content of 
>   user address 0xBEEF000 is also accessible through kernel address 
>   0xFFFF87FF9F080000 as plain, executable code."
> 
> Is this actually true of modern x86-64 kernels? We've locked down W^X 
> protections in general.

As I was curious, I looked at the paper. Here is a quote from it:

"In x86-64, however, the permissions of physmap are not in sane state.
Kernels up to v3.8.13 violate the W^X property by mapping the entire region
as “readable, writeable, and executable” (RWX)—only very recent kernels
(≥v3.9) use the more conservative RW mapping.”


^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [RFC PATCH v9 03/13] mm: Add support for eXclusive Page Frame Ownership (XPFO)
  2019-04-17 17:19         ` Nadav Amit
@ 2019-04-17 17:26           ` Ingo Molnar
  2019-04-17 17:44             ` Nadav Amit
  0 siblings, 1 reply; 70+ messages in thread
From: Ingo Molnar @ 2019-04-17 17:26 UTC (permalink / raw)
  To: Nadav Amit
  Cc: Khalid Aziz, juergh, Tycho Andersen, jsteckli, keescook,
	Konrad Rzeszutek Wilk, Juerg Haefliger, deepa.srinivasan,
	chris.hyser, tyhicks, David Woodhouse, Andrew Cooper, jcm,
	Boris Ostrovsky, iommu, X86 ML, linux-arm-kernel,
	open list:DOCUMENTATION, Linux List Kernel Mailing, Linux-MM,
	LSM List, Khalid Aziz, Linus Torvalds, Andrew Morton,
	Thomas Gleixner, Andy Lutomirski, Peter Zijlstra, Dave Hansen,
	Borislav Petkov, H. Peter Anvin, Arjan van de Ven,
	Greg Kroah-Hartman


* Nadav Amit <nadav.amit@gmail.com> wrote:

> > On Apr 17, 2019, at 10:09 AM, Ingo Molnar <mingo@kernel.org> wrote:
> > 
> > 
> > * Khalid Aziz <khalid.aziz@oracle.com> wrote:
> > 
> >>> I.e. the original motivation of the XPFO patches was to prevent execution 
> >>> of direct kernel mappings. Is this motivation still present if those 
> >>> mappings are non-executable?
> >>> 
> >>> (Sorry if this has been asked and answered in previous discussions.)
> >> 
> >> Hi Ingo,
> >> 
> >> That is a good question. Because of the cost of XPFO, we have to be very
> >> sure we need this protection. The paper from Vasileios, Michalis and
> >> Angelos - <http://www.cs.columbia.edu/~vpk/papers/ret2dir.sec14.pdf>,
> >> does go into how ret2dir attacks can bypass SMAP/SMEP in sections 6.1
> >> and 6.2.
> > 
> > So it would be nice if you could generally summarize external arguments 
> > when defending a patchset, instead of me having to dig through a PDF 
> > which not only causes me to spend time that you probably already spent 
> > reading that PDF, but I might also interpret it incorrectly. ;-)
> > 
> > The PDF you cited says this:
> > 
> >  "Unfortunately, as shown in Table 1, the W^X prop-erty is not enforced 
> >   in many platforms, including x86-64.  In our example, the content of 
> >   user address 0xBEEF000 is also accessible through kernel address 
> >   0xFFFF87FF9F080000 as plain, executable code."
> > 
> > Is this actually true of modern x86-64 kernels? We've locked down W^X 
> > protections in general.
> 
> As I was curious, I looked at the paper. Here is a quote from it:
> 
> "In x86-64, however, the permissions of physmap are not in sane state.
> Kernels up to v3.8.13 violate the W^X property by mapping the entire region
> as “readable, writeable, and executable” (RWX)—only very recent kernels
> (≥v3.9) use the more conservative RW mapping.”

But v3.8.13 is a 5+ years old kernel, it doesn't count as a "modern" 
kernel in any sense of the word. For any proposed patchset with 
significant complexity and non-trivial costs the benchmark version 
threshold is the "current upstream kernel".

So does that quote address my followup questions:

> Is this actually true of modern x86-64 kernels? We've locked down W^X
> protections in general.
>
> I.e. this conclusion:
>
>   "Therefore, by simply overwriting kfptr with 0xFFFF87FF9F080000 and
>    triggering the kernel to dereference it, an attacker can directly
>    execute shell code with kernel privileges."
>
> ... appears to be predicated on imperfect W^X protections on the x86-64
> kernel.
>
> Do such holes exist on the latest x86-64 kernel? If yes, is there a
> reason to believe that these W^X holes cannot be fixed, or that any fix
> would be more expensive than XPFO?

?

What you are proposing here is a XPFO patch-set against recent kernels 
with significant runtime overhead, so my questions about the W^X holes 
are warranted.

Thanks,

	Ingo


^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [RFC PATCH v9 03/13] mm: Add support for eXclusive Page Frame Ownership (XPFO)
  2019-04-17 17:09       ` Ingo Molnar
  2019-04-17 17:19         ` Nadav Amit
@ 2019-04-17 17:33         ` Khalid Aziz
  2019-04-17 19:49           ` Andy Lutomirski
  1 sibling, 1 reply; 70+ messages in thread
From: Khalid Aziz @ 2019-04-17 17:33 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: juergh, tycho, jsteckli, keescook, konrad.wilk, Juerg Haefliger,
	deepa.srinivasan, chris.hyser, tyhicks, dwmw, andrew.cooper3,
	jcm, boris.ostrovsky, iommu, x86, linux-arm-kernel, linux-doc,
	linux-kernel, linux-mm, linux-security-module, Khalid Aziz,
	Linus Torvalds, Andrew Morton, Thomas Gleixner, Andy Lutomirski,
	Peter Zijlstra, Dave Hansen, Borislav Petkov, H. Peter Anvin,
	Arjan van de Ven, Greg Kroah-Hartman

On 4/17/19 11:09 AM, Ingo Molnar wrote:
> 
> * Khalid Aziz <khalid.aziz@oracle.com> wrote:
> 
>>> I.e. the original motivation of the XPFO patches was to prevent execution 
>>> of direct kernel mappings. Is this motivation still present if those 
>>> mappings are non-executable?
>>>
>>> (Sorry if this has been asked and answered in previous discussions.)
>>
>> Hi Ingo,
>>
>> That is a good question. Because of the cost of XPFO, we have to be very
>> sure we need this protection. The paper from Vasileios, Michalis and
>> Angelos - <http://www.cs.columbia.edu/~vpk/papers/ret2dir.sec14.pdf>,
>> does go into how ret2dir attacks can bypass SMAP/SMEP in sections 6.1
>> and 6.2.
> 
> So it would be nice if you could generally summarize external arguments 
> when defending a patchset, instead of me having to dig through a PDF 
> which not only causes me to spend time that you probably already spent 
> reading that PDF, but I might also interpret it incorrectly. ;-)

Sorry, you are right. Even though that paper explains it well, a summary
is always useful.

> 
> The PDF you cited says this:
> 
>   "Unfortunately, as shown in Table 1, the W^X prop-erty is not enforced 
>    in many platforms, including x86-64.  In our example, the content of 
>    user address 0xBEEF000 is also accessible through kernel address 
>    0xFFFF87FF9F080000 as plain, executable code."
> 
> Is this actually true of modern x86-64 kernels? We've locked down W^X 
> protections in general.
> 
> I.e. this conclusion:
> 
>   "Therefore, by simply overwriting kfptr with 0xFFFF87FF9F080000 and 
>    triggering the kernel to dereference it, an attacker can directly 
>    execute shell code with kernel privileges."
> 
> ... appears to be predicated on imperfect W^X protections on the x86-64 
> kernel.
> 
> Do such holes exist on the latest x86-64 kernel? If yes, is there a 
> reason to believe that these W^X holes cannot be fixed, or that any fix 
> would be more expensive than XPFO?

Even if physmap is not executable, return-oriented programming (ROP) can
still be used to launch an attack. Instead of placing executable code at
user address 0xBEEF000, attacker can place an ROP payload there. kfptr
is then overwritten to point to a stack-pivoting gadget. Using the
physmap address aliasing, the ROP payload becomes kernel-mode stack. The
execution can then be hijacked upon execution of ret instruction. This
is a gist of the subsection titled "Non-executable physmap" under
section 6.2 and it looked convincing enough to me. If you have a
different take on this, I am very interested in your point of view.

Thanks,
Khalid



^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [RFC PATCH v9 03/13] mm: Add support for eXclusive Page Frame Ownership (XPFO)
  2019-04-17 17:26           ` Ingo Molnar
@ 2019-04-17 17:44             ` Nadav Amit
  2019-04-17 21:19               ` Thomas Gleixner
  0 siblings, 1 reply; 70+ messages in thread
From: Nadav Amit @ 2019-04-17 17:44 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Khalid Aziz, juergh, Tycho Andersen, jsteckli, keescook,
	Konrad Rzeszutek Wilk, Juerg Haefliger, deepa.srinivasan,
	chris.hyser, tyhicks, David Woodhouse, Andrew Cooper, jcm,
	Boris Ostrovsky, iommu, X86 ML, linux-arm-kernel,
	open list:DOCUMENTATION, Linux List Kernel Mailing, Linux-MM,
	LSM List, Khalid Aziz, Linus Torvalds, Andrew Morton,
	Thomas Gleixner, Andy Lutomirski, Peter Zijlstra, Dave Hansen,
	Borislav Petkov, H. Peter Anvin, Arjan van de Ven,
	Greg Kroah-Hartman

> On Apr 17, 2019, at 10:26 AM, Ingo Molnar <mingo@kernel.org> wrote:
> 
> 
> * Nadav Amit <nadav.amit@gmail.com> wrote:
> 
>>> On Apr 17, 2019, at 10:09 AM, Ingo Molnar <mingo@kernel.org> wrote:
>>> 
>>> 
>>> * Khalid Aziz <khalid.aziz@oracle.com> wrote:
>>> 
>>>>> I.e. the original motivation of the XPFO patches was to prevent execution 
>>>>> of direct kernel mappings. Is this motivation still present if those 
>>>>> mappings are non-executable?
>>>>> 
>>>>> (Sorry if this has been asked and answered in previous discussions.)
>>>> 
>>>> Hi Ingo,
>>>> 
>>>> That is a good question. Because of the cost of XPFO, we have to be very
>>>> sure we need this protection. The paper from Vasileios, Michalis and
>>>> Angelos - <http://www.cs.columbia.edu/~vpk/papers/ret2dir.sec14.pdf>,
>>>> does go into how ret2dir attacks can bypass SMAP/SMEP in sections 6.1
>>>> and 6.2.
>>> 
>>> So it would be nice if you could generally summarize external arguments 
>>> when defending a patchset, instead of me having to dig through a PDF 
>>> which not only causes me to spend time that you probably already spent 
>>> reading that PDF, but I might also interpret it incorrectly. ;-)
>>> 
>>> The PDF you cited says this:
>>> 
>>> "Unfortunately, as shown in Table 1, the W^X prop-erty is not enforced 
>>>  in many platforms, including x86-64.  In our example, the content of 
>>>  user address 0xBEEF000 is also accessible through kernel address 
>>>  0xFFFF87FF9F080000 as plain, executable code."
>>> 
>>> Is this actually true of modern x86-64 kernels? We've locked down W^X 
>>> protections in general.
>> 
>> As I was curious, I looked at the paper. Here is a quote from it:
>> 
>> "In x86-64, however, the permissions of physmap are not in sane state.
>> Kernels up to v3.8.13 violate the W^X property by mapping the entire region
>> as “readable, writeable, and executable” (RWX)—only very recent kernels
>> (≥v3.9) use the more conservative RW mapping.”
> 
> But v3.8.13 is a 5+ years old kernel, it doesn't count as a "modern" 
> kernel in any sense of the word. For any proposed patchset with 
> significant complexity and non-trivial costs the benchmark version 
> threshold is the "current upstream kernel".
> 
> So does that quote address my followup questions:
> 
>> Is this actually true of modern x86-64 kernels? We've locked down W^X
>> protections in general.
>> 
>> I.e. this conclusion:
>> 
>>  "Therefore, by simply overwriting kfptr with 0xFFFF87FF9F080000 and
>>   triggering the kernel to dereference it, an attacker can directly
>>   execute shell code with kernel privileges."
>> 
>> ... appears to be predicated on imperfect W^X protections on the x86-64
>> kernel.
>> 
>> Do such holes exist on the latest x86-64 kernel? If yes, is there a
>> reason to believe that these W^X holes cannot be fixed, or that any fix
>> would be more expensive than XPFO?
> 
> ?
> 
> What you are proposing here is a XPFO patch-set against recent kernels 
> with significant runtime overhead, so my questions about the W^X holes 
> are warranted.
> 

Just to clarify - I am an innocent bystander and have no part in this work.
I was just looking (again) at the paper, as I was curious due to the recent
patches that I sent that improve W^X protection.


^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [RFC PATCH v9 03/13] mm: Add support for eXclusive Page Frame Ownership (XPFO)
  2019-04-17 17:33         ` Khalid Aziz
@ 2019-04-17 19:49           ` Andy Lutomirski
  2019-04-17 19:52             ` Tycho Andersen
  2019-04-17 20:12             ` Khalid Aziz
  0 siblings, 2 replies; 70+ messages in thread
From: Andy Lutomirski @ 2019-04-17 19:49 UTC (permalink / raw)
  To: Khalid Aziz
  Cc: Ingo Molnar, Juerg Haefliger, Tycho Andersen, jsteckli,
	Kees Cook, Konrad Rzeszutek Wilk, Juerg Haefliger,
	deepa.srinivasan, chris hyser, Tyler Hicks, Woodhouse, David,
	Andrew Cooper, Jon Masters, Boris Ostrovsky, iommu, X86 ML,
	linux-arm-kernel, open list:DOCUMENTATION, LKML, Linux-MM,
	LSM List, Khalid Aziz, Linus Torvalds, Andrew Morton,
	Thomas Gleixner, Andy Lutomirski, Peter Zijlstra, Dave Hansen,
	Borislav Petkov, H. Peter Anvin, Arjan van de Ven,
	Greg Kroah-Hartman

On Wed, Apr 17, 2019 at 10:33 AM Khalid Aziz <khalid.aziz@oracle.com> wrote:
>
> On 4/17/19 11:09 AM, Ingo Molnar wrote:
> >
> > * Khalid Aziz <khalid.aziz@oracle.com> wrote:
> >
> >>> I.e. the original motivation of the XPFO patches was to prevent execution
> >>> of direct kernel mappings. Is this motivation still present if those
> >>> mappings are non-executable?
> >>>
> >>> (Sorry if this has been asked and answered in previous discussions.)
> >>
> >> Hi Ingo,
> >>
> >> That is a good question. Because of the cost of XPFO, we have to be very
> >> sure we need this protection. The paper from Vasileios, Michalis and
> >> Angelos - <http://www.cs.columbia.edu/~vpk/papers/ret2dir.sec14.pdf>,
> >> does go into how ret2dir attacks can bypass SMAP/SMEP in sections 6.1
> >> and 6.2.
> >
> > So it would be nice if you could generally summarize external arguments
> > when defending a patchset, instead of me having to dig through a PDF
> > which not only causes me to spend time that you probably already spent
> > reading that PDF, but I might also interpret it incorrectly. ;-)
>
> Sorry, you are right. Even though that paper explains it well, a summary
> is always useful.
>
> >
> > The PDF you cited says this:
> >
> >   "Unfortunately, as shown in Table 1, the W^X prop-erty is not enforced
> >    in many platforms, including x86-64.  In our example, the content of
> >    user address 0xBEEF000 is also accessible through kernel address
> >    0xFFFF87FF9F080000 as plain, executable code."
> >
> > Is this actually true of modern x86-64 kernels? We've locked down W^X
> > protections in general.
> >
> > I.e. this conclusion:
> >
> >   "Therefore, by simply overwriting kfptr with 0xFFFF87FF9F080000 and
> >    triggering the kernel to dereference it, an attacker can directly
> >    execute shell code with kernel privileges."
> >
> > ... appears to be predicated on imperfect W^X protections on the x86-64
> > kernel.
> >
> > Do such holes exist on the latest x86-64 kernel? If yes, is there a
> > reason to believe that these W^X holes cannot be fixed, or that any fix
> > would be more expensive than XPFO?
>
> Even if physmap is not executable, return-oriented programming (ROP) can
> still be used to launch an attack. Instead of placing executable code at
> user address 0xBEEF000, attacker can place an ROP payload there. kfptr
> is then overwritten to point to a stack-pivoting gadget. Using the
> physmap address aliasing, the ROP payload becomes kernel-mode stack. The
> execution can then be hijacked upon execution of ret instruction. This
> is a gist of the subsection titled "Non-executable physmap" under
> section 6.2 and it looked convincing enough to me. If you have a
> different take on this, I am very interested in your point of view.

My issue with all this is that XPFO is really very expensive.  I think
that, if we're going to seriously consider upstreaming expensive
exploit mitigations like this, we should consider others first, in
particular CFI techniques.  grsecurity's RAP would be a great start.
I also proposed using a gcc plugin (or upstream gcc feature) to add
some instrumentation to any code that pops RSP to verify that the
resulting (unsigned) change in RSP is between 0 and THREAD_SIZE bytes.
This will make ROP quite a bit harder.


^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [RFC PATCH v9 03/13] mm: Add support for eXclusive Page Frame Ownership (XPFO)
  2019-04-17 19:49           ` Andy Lutomirski
@ 2019-04-17 19:52             ` Tycho Andersen
  2019-04-17 20:12             ` Khalid Aziz
  1 sibling, 0 replies; 70+ messages in thread
From: Tycho Andersen @ 2019-04-17 19:52 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Khalid Aziz, Ingo Molnar, Juerg Haefliger, jsteckli, Kees Cook,
	Konrad Rzeszutek Wilk, Juerg Haefliger, deepa.srinivasan,
	chris hyser, Tyler Hicks, Woodhouse, David, Andrew Cooper,
	Jon Masters, Boris Ostrovsky, iommu, X86 ML, linux-arm-kernel,
	open list:DOCUMENTATION, LKML, Linux-MM, LSM List, Khalid Aziz,
	Linus Torvalds, Andrew Morton, Thomas Gleixner, Peter Zijlstra,
	Dave Hansen, Borislav Petkov, H. Peter Anvin, Arjan van de Ven,
	Greg Kroah-Hartman

On Wed, Apr 17, 2019 at 12:49:04PM -0700, Andy Lutomirski wrote:
> I also proposed using a gcc plugin (or upstream gcc feature) to add
> some instrumentation to any code that pops RSP to verify that the
> resulting (unsigned) change in RSP is between 0 and THREAD_SIZE bytes.
> This will make ROP quite a bit harder.

I've been playing around with this for a bit, and hope to have
something to post Soon :)

Tycho


^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [RFC PATCH v9 03/13] mm: Add support for eXclusive Page Frame Ownership (XPFO)
  2019-04-17 19:49           ` Andy Lutomirski
  2019-04-17 19:52             ` Tycho Andersen
@ 2019-04-17 20:12             ` Khalid Aziz
  1 sibling, 0 replies; 70+ messages in thread
From: Khalid Aziz @ 2019-04-17 20:12 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Ingo Molnar, Juerg Haefliger, Tycho Andersen, jsteckli,
	Kees Cook, Konrad Rzeszutek Wilk, Juerg Haefliger,
	deepa.srinivasan, chris hyser, Tyler Hicks, Woodhouse, David,
	Andrew Cooper, Jon Masters, Boris Ostrovsky, iommu, X86 ML,
	linux-arm-kernel, open list:DOCUMENTATION, LKML, Linux-MM,
	LSM List, Khalid Aziz, Linus Torvalds, Andrew Morton,
	Thomas Gleixner, Peter Zijlstra, Dave Hansen, Borislav Petkov,
	H. Peter Anvin, Arjan van de Ven, Greg Kroah-Hartman

On 4/17/19 1:49 PM, Andy Lutomirski wrote:
> On Wed, Apr 17, 2019 at 10:33 AM Khalid Aziz <khalid.aziz@oracle.com> wrote:
>>
>> On 4/17/19 11:09 AM, Ingo Molnar wrote:
>>>
>>> * Khalid Aziz <khalid.aziz@oracle.com> wrote:
>>>
>>>>> I.e. the original motivation of the XPFO patches was to prevent execution
>>>>> of direct kernel mappings. Is this motivation still present if those
>>>>> mappings are non-executable?
>>>>>
>>>>> (Sorry if this has been asked and answered in previous discussions.)
>>>>
>>>> Hi Ingo,
>>>>
>>>> That is a good question. Because of the cost of XPFO, we have to be very
>>>> sure we need this protection. The paper from Vasileios, Michalis and
>>>> Angelos - <http://www.cs.columbia.edu/~vpk/papers/ret2dir.sec14.pdf>,
>>>> does go into how ret2dir attacks can bypass SMAP/SMEP in sections 6.1
>>>> and 6.2.
>>>
>>> So it would be nice if you could generally summarize external arguments
>>> when defending a patchset, instead of me having to dig through a PDF
>>> which not only causes me to spend time that you probably already spent
>>> reading that PDF, but I might also interpret it incorrectly. ;-)
>>
>> Sorry, you are right. Even though that paper explains it well, a summary
>> is always useful.
>>
>>>
>>> The PDF you cited says this:
>>>
>>>   "Unfortunately, as shown in Table 1, the W^X prop-erty is not enforced
>>>    in many platforms, including x86-64.  In our example, the content of
>>>    user address 0xBEEF000 is also accessible through kernel address
>>>    0xFFFF87FF9F080000 as plain, executable code."
>>>
>>> Is this actually true of modern x86-64 kernels? We've locked down W^X
>>> protections in general.
>>>
>>> I.e. this conclusion:
>>>
>>>   "Therefore, by simply overwriting kfptr with 0xFFFF87FF9F080000 and
>>>    triggering the kernel to dereference it, an attacker can directly
>>>    execute shell code with kernel privileges."
>>>
>>> ... appears to be predicated on imperfect W^X protections on the x86-64
>>> kernel.
>>>
>>> Do such holes exist on the latest x86-64 kernel? If yes, is there a
>>> reason to believe that these W^X holes cannot be fixed, or that any fix
>>> would be more expensive than XPFO?
>>
>> Even if physmap is not executable, return-oriented programming (ROP) can
>> still be used to launch an attack. Instead of placing executable code at
>> user address 0xBEEF000, attacker can place an ROP payload there. kfptr
>> is then overwritten to point to a stack-pivoting gadget. Using the
>> physmap address aliasing, the ROP payload becomes kernel-mode stack. The
>> execution can then be hijacked upon execution of ret instruction. This
>> is a gist of the subsection titled "Non-executable physmap" under
>> section 6.2 and it looked convincing enough to me. If you have a
>> different take on this, I am very interested in your point of view.
> 
> My issue with all this is that XPFO is really very expensive.  I think
> that, if we're going to seriously consider upstreaming expensive
> exploit mitigations like this, we should consider others first, in
> particular CFI techniques.  grsecurity's RAP would be a great start.
> I also proposed using a gcc plugin (or upstream gcc feature) to add
> some instrumentation to any code that pops RSP to verify that the
> resulting (unsigned) change in RSP is between 0 and THREAD_SIZE bytes.
> This will make ROP quite a bit harder.
> 

Yes, XPFO is expensive. I have been able to reduce the overhead of XPFO
from 2537% to 28% (on large servers) but 28% is still quite significant.
Alternative mitigation techniques with lower impact would easily be more
acceptable as long as they provide same level of protection. If we have
to go with XPFO, we will continue to look for more performance
improvement to bring that number down further from 28%. Hopefully what
Tycho is working on will yield better results. I am continuing to look
for improvements to XPFO in parallel.

Thanks,
Khalid


^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [RFC PATCH v9 03/13] mm: Add support for eXclusive Page Frame Ownership (XPFO)
  2019-04-17 17:44             ` Nadav Amit
@ 2019-04-17 21:19               ` Thomas Gleixner
  2019-04-17 23:18                 ` Linus Torvalds
  0 siblings, 1 reply; 70+ messages in thread
From: Thomas Gleixner @ 2019-04-17 21:19 UTC (permalink / raw)
  To: Nadav Amit
  Cc: Ingo Molnar, Khalid Aziz, juergh, Tycho Andersen, jsteckli,
	keescook, Konrad Rzeszutek Wilk, Juerg Haefliger,
	deepa.srinivasan, chris.hyser, tyhicks, David Woodhouse,
	Andrew Cooper, jcm, Boris Ostrovsky, iommu, X86 ML,
	linux-arm-kernel, open list:DOCUMENTATION,
	Linux List Kernel Mailing, Linux-MM, LSM List, Khalid Aziz,
	Linus Torvalds, Andrew Morton, Andy Lutomirski, Peter Zijlstra,
	Dave Hansen, Borislav Petkov, H. Peter Anvin, Arjan van de Ven,
	Greg Kroah-Hartman

[-- Attachment #1: Type: text/plain, Size: 2067 bytes --]

On Wed, 17 Apr 2019, Nadav Amit wrote:
> > On Apr 17, 2019, at 10:26 AM, Ingo Molnar <mingo@kernel.org> wrote:
> >> As I was curious, I looked at the paper. Here is a quote from it:
> >> 
> >> "In x86-64, however, the permissions of physmap are not in sane state.
> >> Kernels up to v3.8.13 violate the W^X property by mapping the entire region
> >> as “readable, writeable, and executable” (RWX)—only very recent kernels
> >> (≥v3.9) use the more conservative RW mapping.”
> > 
> > But v3.8.13 is a 5+ years old kernel, it doesn't count as a "modern" 
> > kernel in any sense of the word. For any proposed patchset with 
> > significant complexity and non-trivial costs the benchmark version 
> > threshold is the "current upstream kernel".
> > 
> > So does that quote address my followup questions:
> > 
> >> Is this actually true of modern x86-64 kernels? We've locked down W^X
> >> protections in general.
> >> 
> >> I.e. this conclusion:
> >> 
> >>  "Therefore, by simply overwriting kfptr with 0xFFFF87FF9F080000 and
> >>   triggering the kernel to dereference it, an attacker can directly
> >>   execute shell code with kernel privileges."
> >> 
> >> ... appears to be predicated on imperfect W^X protections on the x86-64
> >> kernel.
> >> 
> >> Do such holes exist on the latest x86-64 kernel? If yes, is there a
> >> reason to believe that these W^X holes cannot be fixed, or that any fix
> >> would be more expensive than XPFO?
> > 
> > ?
> > 
> > What you are proposing here is a XPFO patch-set against recent kernels 
> > with significant runtime overhead, so my questions about the W^X holes 
> > are warranted.
> > 
> 
> Just to clarify - I am an innocent bystander and have no part in this work.
> I was just looking (again) at the paper, as I was curious due to the recent
> patches that I sent that improve W^X protection.

It's not necessarily a W+X issue. The user space text is mapped in the
kernel as well and even if it is mapped RX then this can happen. So any
kernel mappings of user space text need to be mapped NX!

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [RFC PATCH v9 03/13] mm: Add support for eXclusive Page Frame Ownership (XPFO)
  2019-04-17 21:19               ` Thomas Gleixner
@ 2019-04-17 23:18                 ` Linus Torvalds
  2019-04-17 23:42                   ` Thomas Gleixner
  0 siblings, 1 reply; 70+ messages in thread
From: Linus Torvalds @ 2019-04-17 23:18 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Nadav Amit, Ingo Molnar, Khalid Aziz, juergh, Tycho Andersen,
	jsteckli, keescook, Konrad Rzeszutek Wilk, Juerg Haefliger,
	deepa.srinivasan, chris.hyser, tyhicks, David Woodhouse,
	Andrew Cooper, jcm, Boris Ostrovsky, iommu, X86 ML,
	linux-arm-kernel, open list:DOCUMENTATION,
	Linux List Kernel Mailing, Linux-MM, LSM List, Khalid Aziz,
	Andrew Morton, Andy Lutomirski, Peter Zijlstra, Dave Hansen,
	Borislav Petkov, H. Peter Anvin, Arjan van de Ven,
	Greg Kroah-Hartman

[-- Attachment #1: Type: text/plain, Size: 558 bytes --]

On Wed, Apr 17, 2019, 14:20 Thomas Gleixner <tglx@linutronix.de> wrote:

>
> It's not necessarily a W+X issue. The user space text is mapped in the
> kernel as well and even if it is mapped RX then this can happen. So any
> kernel mappings of user space text need to be mapped NX!


With SMEP, user space pages are always NX.

I really think SM[AE]P is something we can already take for granted. People
who have old CPU's workout it are simply not serious about security anyway.
There is no point in saying "we can do it badly in software".

       Linus

>

[-- Attachment #2: Type: text/html, Size: 1190 bytes --]

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [RFC PATCH v9 03/13] mm: Add support for eXclusive Page Frame Ownership (XPFO)
  2019-04-17 23:18                 ` Linus Torvalds
@ 2019-04-17 23:42                   ` Thomas Gleixner
  2019-04-17 23:52                     ` Linus Torvalds
  0 siblings, 1 reply; 70+ messages in thread
From: Thomas Gleixner @ 2019-04-17 23:42 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Nadav Amit, Ingo Molnar, Khalid Aziz, juergh, Tycho Andersen,
	jsteckli, keescook, Konrad Rzeszutek Wilk, Juerg Haefliger,
	deepa.srinivasan, chris.hyser, tyhicks, David Woodhouse,
	Andrew Cooper, jcm, Boris Ostrovsky, iommu, X86 ML,
	linux-arm-kernel, open list:DOCUMENTATION,
	Linux List Kernel Mailing, Linux-MM, LSM List, Khalid Aziz,
	Andrew Morton, Andy Lutomirski, Peter Zijlstra, Dave Hansen,
	Borislav Petkov, H. Peter Anvin, Arjan van de Ven,
	Greg Kroah-Hartman

On Wed, 17 Apr 2019, Linus Torvalds wrote:

> On Wed, Apr 17, 2019, 14:20 Thomas Gleixner <tglx@linutronix.de> wrote:
> 
> >
> > It's not necessarily a W+X issue. The user space text is mapped in the
> > kernel as well and even if it is mapped RX then this can happen. So any
> > kernel mappings of user space text need to be mapped NX!
> 
> With SMEP, user space pages are always NX.

We talk past each other. The user space page in the ring3 valid virtual
address space (non negative) is of course protected by SMEP.

The attack utilizes the kernel linear mapping of the physical
memory. I.e. user space address 0x43210 has a kernel equivalent at
0xfxxxxxxxxxx. So if the attack manages to trick the kernel to that valid
kernel address and that is mapped X --> game over. SMEP does not help
there.

From the top of my head I'd say this is a non issue as those kernel address
space mappings _should_ be NX, but we got bitten by _should_ in the past:)

Thanks,

	tglx



^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [RFC PATCH v9 03/13] mm: Add support for eXclusive Page Frame Ownership (XPFO)
  2019-04-17 23:42                   ` Thomas Gleixner
@ 2019-04-17 23:52                     ` Linus Torvalds
  2019-04-18  4:41                       ` Andy Lutomirski
  2019-04-18  6:14                       ` Thomas Gleixner
  0 siblings, 2 replies; 70+ messages in thread
From: Linus Torvalds @ 2019-04-17 23:52 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Nadav Amit, Ingo Molnar, Khalid Aziz, juergh, Tycho Andersen,
	jsteckli, Kees Cook, Konrad Rzeszutek Wilk, Juerg Haefliger,
	deepa.srinivasan, chris.hyser, Tyler Hicks, David Woodhouse,
	Andrew Cooper, Jon Masters, Boris Ostrovsky, iommu, X86 ML,
	linux-alpha@vger.kernel.org, open list:DOCUMENTATION,
	Linux List Kernel Mailing, Linux-MM, LSM List, Khalid Aziz,
	Andrew Morton, Andy Lutomirski, Peter Zijlstra, Dave Hansen,
	Borislav Petkov, H. Peter Anvin, Arjan van de Ven,
	Greg Kroah-Hartman

On Wed, Apr 17, 2019 at 4:42 PM Thomas Gleixner <tglx@linutronix.de> wrote:
>
> On Wed, 17 Apr 2019, Linus Torvalds wrote:
>
> > With SMEP, user space pages are always NX.
>
> We talk past each other. The user space page in the ring3 valid virtual
> address space (non negative) is of course protected by SMEP.
>
> The attack utilizes the kernel linear mapping of the physical
> memory. I.e. user space address 0x43210 has a kernel equivalent at
> 0xfxxxxxxxxxx. So if the attack manages to trick the kernel to that valid
> kernel address and that is mapped X --> game over. SMEP does not help
> there.

Oh, agreed.

But that would simply be a kernel bug. We should only map kernel pages
executable when we have kernel code in them, and we should certainly
not allow those pages to be mapped writably in user space.

That kind of "executable in kernel, writable in user" would be a
horrendous and major bug.

So i think it's a non-issue.

> From the top of my head I'd say this is a non issue as those kernel address
> space mappings _should_ be NX, but we got bitten by _should_ in the past:)

I do agree that bugs can happen, obviously, and we might have missed something.

But in the context of XPFO, I would argue (*very* strongly) that the
likelihood of the above kind of bug is absolutely *miniscule* compared
to the likelihood that we'd have something wrong in the software
implementation of XPFO.

So if the argument is "we might have bugs in software", then I think
that's an argument _against_ XPFO rather than for it.

                Linus


^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [RFC PATCH v9 03/13] mm: Add support for eXclusive Page Frame Ownership (XPFO)
  2019-04-17 23:52                     ` Linus Torvalds
@ 2019-04-18  4:41                       ` Andy Lutomirski
  2019-04-18  5:41                         ` Kees Cook
  2019-04-18  6:14                       ` Thomas Gleixner
  1 sibling, 1 reply; 70+ messages in thread
From: Andy Lutomirski @ 2019-04-18  4:41 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Thomas Gleixner, Nadav Amit, Ingo Molnar, Khalid Aziz,
	Juerg Haefliger, Tycho Andersen, jsteckli, Kees Cook,
	Konrad Rzeszutek Wilk, Juerg Haefliger, deepa.srinivasan,
	chris hyser, Tyler Hicks, David Woodhouse, Andrew Cooper,
	Jon Masters, Boris Ostrovsky, iommu, X86 ML,
	linux-alpha@vger.kernel.org, open list:DOCUMENTATION,
	Linux List Kernel Mailing, Linux-MM, LSM List, Khalid Aziz,
	Andrew Morton, Andy Lutomirski, Peter Zijlstra, Dave Hansen,
	Borislav Petkov, H. Peter Anvin, Arjan van de Ven,
	Greg Kroah-Hartman

On Wed, Apr 17, 2019 at 5:00 PM Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> On Wed, Apr 17, 2019 at 4:42 PM Thomas Gleixner <tglx@linutronix.de> wrote:
> >
> > On Wed, 17 Apr 2019, Linus Torvalds wrote:
> >
> > > With SMEP, user space pages are always NX.
> >
> > We talk past each other. The user space page in the ring3 valid virtual
> > address space (non negative) is of course protected by SMEP.
> >
> > The attack utilizes the kernel linear mapping of the physical
> > memory. I.e. user space address 0x43210 has a kernel equivalent at
> > 0xfxxxxxxxxxx. So if the attack manages to trick the kernel to that valid
> > kernel address and that is mapped X --> game over. SMEP does not help
> > there.
>
> Oh, agreed.
>
> But that would simply be a kernel bug. We should only map kernel pages
> executable when we have kernel code in them, and we should certainly
> not allow those pages to be mapped writably in user space.
>
> That kind of "executable in kernel, writable in user" would be a
> horrendous and major bug.
>
> So i think it's a non-issue.
>
> > From the top of my head I'd say this is a non issue as those kernel address
> > space mappings _should_ be NX, but we got bitten by _should_ in the past:)
>
> I do agree that bugs can happen, obviously, and we might have missed something.
>
> But in the context of XPFO, I would argue (*very* strongly) that the
> likelihood of the above kind of bug is absolutely *miniscule* compared
> to the likelihood that we'd have something wrong in the software
> implementation of XPFO.
>
> So if the argument is "we might have bugs in software", then I think
> that's an argument _against_ XPFO rather than for it.
>

I don't think this type of NX goof was ever the argument for XPFO.
The main argument I've heard is that a malicious user program writes a
ROP payload into user memory (regular anonymous user memory) and then
gets the kernel to erroneously set RSP (*not* RIP) to point there.

I find this argument fairly weak for a couple reasons.  First, if
we're worried about this, let's do in-kernel CFI, not XPFO, to
mitigate it.  Second, I don't see why the exact same attack can't be
done using, say, page cache, and unless I'm missing something, XPFO
doesn't protect page cache.  Or network buffers, or pipe buffers, etc.


^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [RFC PATCH v9 03/13] mm: Add support for eXclusive Page Frame Ownership (XPFO)
  2019-04-18  4:41                       ` Andy Lutomirski
@ 2019-04-18  5:41                         ` Kees Cook
  2019-04-18 14:34                           ` Khalid Aziz
  0 siblings, 1 reply; 70+ messages in thread
From: Kees Cook @ 2019-04-18  5:41 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Linus Torvalds, Thomas Gleixner, Nadav Amit, Ingo Molnar,
	Khalid Aziz, Juerg Haefliger, Tycho Andersen, Julian Stecklina,
	Kees Cook, Konrad Rzeszutek Wilk, Juerg Haefliger,
	deepa.srinivasan, chris hyser, Tyler Hicks, David Woodhouse,
	Andrew Cooper, Jon Masters, Boris Ostrovsky, iommu, X86 ML,
	linux-alpha@vger.kernel.org, open list:DOCUMENTATION,
	Linux List Kernel Mailing, Linux-MM, LSM List, Khalid Aziz,
	Andrew Morton, Peter Zijlstra, Dave Hansen, Borislav Petkov,
	H. Peter Anvin, Arjan van de Ven, Greg Kroah-Hartman

On Wed, Apr 17, 2019 at 11:41 PM Andy Lutomirski <luto@kernel.org> wrote:
> I don't think this type of NX goof was ever the argument for XPFO.
> The main argument I've heard is that a malicious user program writes a
> ROP payload into user memory (regular anonymous user memory) and then
> gets the kernel to erroneously set RSP (*not* RIP) to point there.

Well, more than just ROP. Any of the various attack primitives. The NX
stuff is about moving RIP: SMEP-bypassing. But there is still basic
SMAP-bypassing for putting a malicious structure in userspace and
having the kernel access it via the linear mapping, etc.

> I find this argument fairly weak for a couple reasons.  First, if
> we're worried about this, let's do in-kernel CFI, not XPFO, to

CFI is getting much closer. Getting the kernel happy under Clang, LTO,
and CFI is under active development. (It's functional for arm64
already, and pieces have been getting upstreamed.)

> mitigate it.  Second, I don't see why the exact same attack can't be
> done using, say, page cache, and unless I'm missing something, XPFO
> doesn't protect page cache.  Or network buffers, or pipe buffers, etc.

My understanding is that it's much easier to feel out the linear
mapping address than for the others. But yes, all of those same attack
primitives are possible in other memory areas (though most are NX),
and plenty of exploits have done such things.

-- 
Kees Cook


^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [RFC PATCH v9 03/13] mm: Add support for eXclusive Page Frame Ownership (XPFO)
  2019-04-17 23:52                     ` Linus Torvalds
  2019-04-18  4:41                       ` Andy Lutomirski
@ 2019-04-18  6:14                       ` Thomas Gleixner
  1 sibling, 0 replies; 70+ messages in thread
From: Thomas Gleixner @ 2019-04-18  6:14 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Nadav Amit, Ingo Molnar, Khalid Aziz, juergh, Tycho Andersen,
	jsteckli, Kees Cook, Konrad Rzeszutek Wilk, Juerg Haefliger,
	deepa.srinivasan, chris.hyser, Tyler Hicks, David Woodhouse,
	Andrew Cooper, Jon Masters, Boris Ostrovsky, iommu, X86 ML,
	linux-alpha@vger.kernel.org, open list:DOCUMENTATION,
	Linux List Kernel Mailing, Linux-MM, LSM List, Khalid Aziz,
	Andrew Morton, Andy Lutomirski, Peter Zijlstra, Dave Hansen,
	Borislav Petkov, H. Peter Anvin, Arjan van de Ven,
	Greg Kroah-Hartman

On Wed, 17 Apr 2019, Linus Torvalds wrote:
> On Wed, Apr 17, 2019 at 4:42 PM Thomas Gleixner <tglx@linutronix.de> wrote:
> > On Wed, 17 Apr 2019, Linus Torvalds wrote:
> > > With SMEP, user space pages are always NX.
> >
> > We talk past each other. The user space page in the ring3 valid virtual
> > address space (non negative) is of course protected by SMEP.
> >
> > The attack utilizes the kernel linear mapping of the physical
> > memory. I.e. user space address 0x43210 has a kernel equivalent at
> > 0xfxxxxxxxxxx. So if the attack manages to trick the kernel to that valid
> > kernel address and that is mapped X --> game over. SMEP does not help
> > there.
> 
> Oh, agreed.
> 
> But that would simply be a kernel bug. We should only map kernel pages
> executable when we have kernel code in them, and we should certainly
> not allow those pages to be mapped writably in user space.
> 
> That kind of "executable in kernel, writable in user" would be a
> horrendous and major bug.
> 
> So i think it's a non-issue.

Pretty much.

> > From the top of my head I'd say this is a non issue as those kernel address
> > space mappings _should_ be NX, but we got bitten by _should_ in the past:)
> 
> I do agree that bugs can happen, obviously, and we might have missed something.
>
> But in the context of XPFO, I would argue (*very* strongly) that the
> likelihood of the above kind of bug is absolutely *miniscule* compared
> to the likelihood that we'd have something wrong in the software
> implementation of XPFO.
> 
> So if the argument is "we might have bugs in software", then I think
> that's an argument _against_ XPFO rather than for it.

No argument from my side. We better spend time to make sure that a bogus
kernel side X mapping is caught, like we catch other things.

Thanks,

	tglx


^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [RFC PATCH v9 03/13] mm: Add support for eXclusive Page Frame Ownership (XPFO)
  2019-04-18  5:41                         ` Kees Cook
@ 2019-04-18 14:34                           ` Khalid Aziz
  2019-04-22 19:30                             ` Khalid Aziz
  2019-04-22 22:23                             ` Kees Cook
  0 siblings, 2 replies; 70+ messages in thread
From: Khalid Aziz @ 2019-04-18 14:34 UTC (permalink / raw)
  To: Kees Cook, Andy Lutomirski
  Cc: Linus Torvalds, Thomas Gleixner, Nadav Amit, Ingo Molnar,
	Juerg Haefliger, Tycho Andersen, Julian Stecklina,
	Konrad Rzeszutek Wilk, Juerg Haefliger, deepa.srinivasan,
	chris hyser, Tyler Hicks, David Woodhouse, Andrew Cooper,
	Jon Masters, Boris Ostrovsky, iommu, X86 ML,
	linux-alpha@vger.kernel.org, open list:DOCUMENTATION,
	Linux List Kernel Mailing, Linux-MM, LSM List, Khalid Aziz,
	Andrew Morton, Peter Zijlstra, Dave Hansen, Borislav Petkov,
	H. Peter Anvin, Arjan van de Ven, Greg Kroah-Hartman

On 4/17/19 11:41 PM, Kees Cook wrote:
> On Wed, Apr 17, 2019 at 11:41 PM Andy Lutomirski <luto@kernel.org> wrote:
>> I don't think this type of NX goof was ever the argument for XPFO.
>> The main argument I've heard is that a malicious user program writes a
>> ROP payload into user memory (regular anonymous user memory) and then
>> gets the kernel to erroneously set RSP (*not* RIP) to point there.
> 
> Well, more than just ROP. Any of the various attack primitives. The NX
> stuff is about moving RIP: SMEP-bypassing. But there is still basic
> SMAP-bypassing for putting a malicious structure in userspace and
> having the kernel access it via the linear mapping, etc.
> 
>> I find this argument fairly weak for a couple reasons.  First, if
>> we're worried about this, let's do in-kernel CFI, not XPFO, to
> 
> CFI is getting much closer. Getting the kernel happy under Clang, LTO,
> and CFI is under active development. (It's functional for arm64
> already, and pieces have been getting upstreamed.)
> 

CFI theoretically offers protection with fairly low overhead. I have not
played much with CFI in clang. I agree with Linus that probability of
bugs in XPFO implementation itself is a cause of concern. If CFI in
Clang can provide us the same level of protection as XPFO does, I
wouldn't want to push for an expensive change like XPFO.

If Clang/CFI can't get us there for extended period of time, does it
make sense to continue to poke at XPFO?

Thanks,
Khalid


^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [RFC PATCH v9 03/13] mm: Add support for eXclusive Page Frame Ownership (XPFO)
  2019-04-18 14:34                           ` Khalid Aziz
@ 2019-04-22 19:30                             ` Khalid Aziz
  2019-04-22 22:23                             ` Kees Cook
  1 sibling, 0 replies; 70+ messages in thread
From: Khalid Aziz @ 2019-04-22 19:30 UTC (permalink / raw)
  To: Kees Cook, Andy Lutomirski, Linus Torvalds
  Cc: Thomas Gleixner, Nadav Amit, Ingo Molnar, Juerg Haefliger,
	Tycho Andersen, Julian Stecklina, Konrad Rzeszutek Wilk,
	Juerg Haefliger, deepa.srinivasan, chris hyser, Tyler Hicks,
	David Woodhouse, Andrew Cooper, Jon Masters, Boris Ostrovsky,
	iommu, X86 ML, linux-alpha@vger.kernel.org,
	open list:DOCUMENTATION, Linux List Kernel Mailing, Linux-MM,
	LSM List, Khalid Aziz, Andrew Morton, Peter Zijlstra,
	Dave Hansen, Borislav Petkov, H. Peter Anvin, Arjan van de Ven,
	Greg Kroah-Hartman

On 4/18/19 8:34 AM, Khalid Aziz wrote:
> On 4/17/19 11:41 PM, Kees Cook wrote:
>> On Wed, Apr 17, 2019 at 11:41 PM Andy Lutomirski <luto@kernel.org> wrote:
>>> I don't think this type of NX goof was ever the argument for XPFO.
>>> The main argument I've heard is that a malicious user program writes a
>>> ROP payload into user memory (regular anonymous user memory) and then
>>> gets the kernel to erroneously set RSP (*not* RIP) to point there.
>>
>> Well, more than just ROP. Any of the various attack primitives. The NX
>> stuff is about moving RIP: SMEP-bypassing. But there is still basic
>> SMAP-bypassing for putting a malicious structure in userspace and
>> having the kernel access it via the linear mapping, etc.
>>
>>> I find this argument fairly weak for a couple reasons.  First, if
>>> we're worried about this, let's do in-kernel CFI, not XPFO, to
>>
>> CFI is getting much closer. Getting the kernel happy under Clang, LTO,
>> and CFI is under active development. (It's functional for arm64
>> already, and pieces have been getting upstreamed.)
>>
> 
> CFI theoretically offers protection with fairly low overhead. I have not
> played much with CFI in clang. I agree with Linus that probability of
> bugs in XPFO implementation itself is a cause of concern. If CFI in
> Clang can provide us the same level of protection as XPFO does, I
> wouldn't want to push for an expensive change like XPFO.
> 
> If Clang/CFI can't get us there for extended period of time, does it
> make sense to continue to poke at XPFO?

Any feedback on continued effort on XPFO? If it makes sense to have XPFO
available as a solution for ret2dir issue in case Clang/CFI does not
work out, I will continue to refine it.

--
Khalid


^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [RFC PATCH v9 03/13] mm: Add support for eXclusive Page Frame Ownership (XPFO)
  2019-04-18 14:34                           ` Khalid Aziz
  2019-04-22 19:30                             ` Khalid Aziz
@ 2019-04-22 22:23                             ` Kees Cook
  1 sibling, 0 replies; 70+ messages in thread
From: Kees Cook @ 2019-04-22 22:23 UTC (permalink / raw)
  To: Khalid Aziz
  Cc: Andy Lutomirski, Linus Torvalds, Thomas Gleixner, Nadav Amit,
	Ingo Molnar, Juerg Haefliger, Tycho Andersen, Julian Stecklina,
	Konrad Rzeszutek Wilk, Juerg Haefliger, deepa.srinivasan,
	chris hyser, Tyler Hicks, David Woodhouse, Andrew Cooper,
	Jon Masters, Boris Ostrovsky, iommu, X86 ML,
	linux-alpha@vger.kernel.org, open list:DOCUMENTATION,
	Linux List Kernel Mailing, Linux-MM, LSM List, Khalid Aziz,
	Andrew Morton, Peter Zijlstra, Dave Hansen, Borislav Petkov,
	H. Peter Anvin, Arjan van de Ven, Greg Kroah-Hartman

On Thu, Apr 18, 2019 at 7:35 AM Khalid Aziz <khalid.aziz@oracle.com> wrote:
>
> On 4/17/19 11:41 PM, Kees Cook wrote:
> > On Wed, Apr 17, 2019 at 11:41 PM Andy Lutomirski <luto@kernel.org> wrote:
> >> I don't think this type of NX goof was ever the argument for XPFO.
> >> The main argument I've heard is that a malicious user program writes a
> >> ROP payload into user memory (regular anonymous user memory) and then
> >> gets the kernel to erroneously set RSP (*not* RIP) to point there.
> >
> > Well, more than just ROP. Any of the various attack primitives. The NX
> > stuff is about moving RIP: SMEP-bypassing. But there is still basic
> > SMAP-bypassing for putting a malicious structure in userspace and
> > having the kernel access it via the linear mapping, etc.
> >
> >> I find this argument fairly weak for a couple reasons.  First, if
> >> we're worried about this, let's do in-kernel CFI, not XPFO, to
> >
> > CFI is getting much closer. Getting the kernel happy under Clang, LTO,
> > and CFI is under active development. (It's functional for arm64
> > already, and pieces have been getting upstreamed.)
> >
>
> CFI theoretically offers protection with fairly low overhead. I have not
> played much with CFI in clang. I agree with Linus that probability of
> bugs in XPFO implementation itself is a cause of concern. If CFI in
> Clang can provide us the same level of protection as XPFO does, I
> wouldn't want to push for an expensive change like XPFO.
>
> If Clang/CFI can't get us there for extended period of time, does it
> make sense to continue to poke at XPFO?

Well, I think CFI will certainly vastly narrow the execution paths
available to an attacker, but what I continue to see XPFO useful for
is stopping attacks that need to locate something in memory. (i.e. not
ret2dir but, like, read2dir.) It's arguable that such attacks would
just use heap, stack, etc to hold such things, but the linear map
remains relatively easy to find/target. But I agree: the protection is
getting more and more narrow (especially with CFI coming down the
pipe), and if it's still a 28% hit, that's not going to be tenable for
anyone but the truly paranoid. :)

All that said, there isn't a very good backward-edge CFI protection
(i.e. ROP defense) on x86 in Clang. The forward-edge looks decent, but
requires LTO, etc. My point is there is still a long path to gaining
CFI in upstream.

-- 
Kees Cook


^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [RFC PATCH v9 03/13] mm: Add support for eXclusive Page Frame Ownership (XPFO)
  2019-04-17 16:49     ` Khalid Aziz
  2019-04-17 17:09       ` Ingo Molnar
@ 2019-05-01 14:49       ` Waiman Long
  2019-05-01 15:18         ` Khalid Aziz
  1 sibling, 1 reply; 70+ messages in thread
From: Waiman Long @ 2019-05-01 14:49 UTC (permalink / raw)
  To: Khalid Aziz, Ingo Molnar
  Cc: juergh, tycho, jsteckli, keescook, konrad.wilk, Juerg Haefliger,
	deepa.srinivasan, chris.hyser, tyhicks, dwmw, andrew.cooper3,
	jcm, boris.ostrovsky, iommu, x86, linux-arm-kernel, linux-doc,
	linux-kernel, linux-mm, linux-security-module, Khalid Aziz,
	Linus Torvalds, Andrew Morton, Thomas Gleixner, Andy Lutomirski,
	Peter Zijlstra, Dave Hansen, Borislav Petkov, H. Peter Anvin,
	Arjan van de Ven, Greg Kroah-Hartman

On Wed, Apr 03, 2019 at 11:34:04AM -0600, Khalid Aziz wrote:
> diff --git a/Documentation/admin-guide/kernel-parameters.txt
b/Documentation/admin-guide/kernel-parameters.txt

> index 858b6c0b9a15..9b36da94760e 100644
> --- a/Documentation/admin-guide/kernel-parameters.txt
> +++ b/Documentation/admin-guide/kernel-parameters.txt
> @@ -2997,6 +2997,12 @@
>
>      nox2apic    [X86-64,APIC] Do not enable x2APIC mode.
>
> +    noxpfo        [XPFO] Disable eXclusive Page Frame Ownership (XPFO)
> +            when CONFIG_XPFO is on. Physical pages mapped into
> +            user applications will also be mapped in the
> +            kernel's address space as if CONFIG_XPFO was not
> +            enabled.
> +
>      cpu0_hotplug    [X86] Turn on CPU0 hotplug feature when
>              CONFIG_BO OTPARAM_HOTPLUG_CPU0 is off.
>              Some features depend on CPU0. Known dependencies are:

Given the big performance impact that XPFO can have. It should be off by
default when configured. Instead, the xpfo option should be used to
enable it.

Cheers,
Longman


^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [RFC PATCH v9 03/13] mm: Add support for eXclusive Page Frame Ownership (XPFO)
  2019-05-01 14:49       ` Waiman Long
@ 2019-05-01 15:18         ` Khalid Aziz
  0 siblings, 0 replies; 70+ messages in thread
From: Khalid Aziz @ 2019-05-01 15:18 UTC (permalink / raw)
  To: Waiman Long, Ingo Molnar
  Cc: juergh, tycho, jsteckli, keescook, konrad.wilk, Juerg Haefliger,
	deepa.srinivasan, chris.hyser, tyhicks, dwmw, andrew.cooper3,
	jcm, boris.ostrovsky, iommu, x86, linux-arm-kernel, linux-doc,
	linux-kernel, linux-mm, linux-security-module, Khalid Aziz,
	Linus Torvalds, Andrew Morton, Thomas Gleixner, Andy Lutomirski,
	Peter Zijlstra, Dave Hansen, Borislav Petkov, H. Peter Anvin,
	Arjan van de Ven, Greg Kroah-Hartman

On 5/1/19 8:49 AM, Waiman Long wrote:
> On Wed, Apr 03, 2019 at 11:34:04AM -0600, Khalid Aziz wrote:
>> diff --git a/Documentation/admin-guide/kernel-parameters.txt
> b/Documentation/admin-guide/kernel-parameters.txt
> 
>> index 858b6c0b9a15..9b36da94760e 100644
>> --- a/Documentation/admin-guide/kernel-parameters.txt
>> +++ b/Documentation/admin-guide/kernel-parameters.txt
>> @@ -2997,6 +2997,12 @@
>>
>>       nox2apic    [X86-64,APIC] Do not enable x2APIC mode.
>>
>> +    noxpfo        [XPFO] Disable eXclusive Page Frame Ownership (XPFO)
>> +            when CONFIG_XPFO is on. Physical pages mapped into
>> +            user applications will also be mapped in the
>> +            kernel's address space as if CONFIG_XPFO was not
>> +            enabled.
>> +
>>       cpu0_hotplug    [X86] Turn on CPU0 hotplug feature when
>>               CONFIG_BO OTPARAM_HOTPLUG_CPU0 is off.
>>               Some features depend on CPU0. Known dependencies are:
> 
> Given the big performance impact that XPFO can have. It should be off by
> default when configured. Instead, the xpfo option should be used to
> enable it.

Agreed. I plan to disable it by default in the next version of the
patch. This is likely to end up being a feature for extreme security
conscious folks only, unless I or someone else comes up with further
significant performance boost.

Thanks,
Khalid


^ permalink raw reply	[flat|nested] 70+ messages in thread

end of thread, other threads:[~2019-05-01 15:21 UTC | newest]

Thread overview: 70+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-04-03 17:34 [RFC PATCH v9 00/13] Add support for eXclusive Page Frame Ownership Khalid Aziz
2019-04-03 17:34 ` [RFC PATCH v9 01/13] mm: add MAP_HUGETLB support to vm_mmap Khalid Aziz
2019-04-03 17:34 ` [RFC PATCH v9 02/13] x86: always set IF before oopsing from page fault Khalid Aziz
2019-04-04  0:12   ` Andy Lutomirski
2019-04-04  1:42     ` Tycho Andersen
2019-04-04  4:12       ` Andy Lutomirski
2019-04-04 15:47         ` Tycho Andersen
2019-04-04 16:23           ` Sebastian Andrzej Siewior
2019-04-04 16:28           ` Thomas Gleixner
2019-04-04 17:11             ` Andy Lutomirski
2019-04-03 17:34 ` [RFC PATCH v9 03/13] mm: Add support for eXclusive Page Frame Ownership (XPFO) Khalid Aziz
2019-04-04  7:21   ` Peter Zijlstra
2019-04-04  9:25     ` Peter Zijlstra
2019-04-04 14:48     ` Tycho Andersen
2019-04-04  7:43   ` Peter Zijlstra
2019-04-04 15:15     ` Khalid Aziz
2019-04-04 17:01       ` Peter Zijlstra
2019-04-17 16:15   ` Ingo Molnar
2019-04-17 16:49     ` Khalid Aziz
2019-04-17 17:09       ` Ingo Molnar
2019-04-17 17:19         ` Nadav Amit
2019-04-17 17:26           ` Ingo Molnar
2019-04-17 17:44             ` Nadav Amit
2019-04-17 21:19               ` Thomas Gleixner
2019-04-17 23:18                 ` Linus Torvalds
2019-04-17 23:42                   ` Thomas Gleixner
2019-04-17 23:52                     ` Linus Torvalds
2019-04-18  4:41                       ` Andy Lutomirski
2019-04-18  5:41                         ` Kees Cook
2019-04-18 14:34                           ` Khalid Aziz
2019-04-22 19:30                             ` Khalid Aziz
2019-04-22 22:23                             ` Kees Cook
2019-04-18  6:14                       ` Thomas Gleixner
2019-04-17 17:33         ` Khalid Aziz
2019-04-17 19:49           ` Andy Lutomirski
2019-04-17 19:52             ` Tycho Andersen
2019-04-17 20:12             ` Khalid Aziz
2019-05-01 14:49       ` Waiman Long
2019-05-01 15:18         ` Khalid Aziz
2019-04-03 17:34 ` [RFC PATCH v9 04/13] xpfo, x86: Add support for XPFO for x86-64 Khalid Aziz
2019-04-04  7:52   ` Peter Zijlstra
2019-04-04 15:40     ` Khalid Aziz
2019-04-03 17:34 ` [RFC PATCH v9 05/13] mm: add a user_virt_to_phys symbol Khalid Aziz
2019-04-03 17:34 ` [RFC PATCH v9 06/13] lkdtm: Add test for XPFO Khalid Aziz
2019-04-03 17:34 ` [RFC PATCH v9 07/13] arm64/mm: Add support " Khalid Aziz
2019-04-03 17:34 ` [RFC PATCH v9 08/13] swiotlb: Map the buffer if it was unmapped by XPFO Khalid Aziz
2019-04-03 17:34 ` [RFC PATCH v9 09/13] xpfo: add primitives for mapping underlying memory Khalid Aziz
2019-04-03 17:34 ` [RFC PATCH v9 10/13] arm64/mm, xpfo: temporarily map dcache regions Khalid Aziz
2019-04-03 17:34 ` [RFC PATCH v9 11/13] xpfo, mm: optimize spinlock usage in xpfo_kunmap Khalid Aziz
2019-04-04  7:56   ` Peter Zijlstra
2019-04-04 16:06     ` Khalid Aziz
2019-04-03 17:34 ` [RFC PATCH v9 12/13] xpfo, mm: Defer TLB flushes for non-current CPUs (x86 only) Khalid Aziz
2019-04-04  4:10   ` Andy Lutomirski
     [not found]     ` <91f1dbce-332e-25d1-15f6-0e9cfc8b797b@oracle.com>
2019-04-05  7:17       ` Thomas Gleixner
2019-04-05 14:44         ` Dave Hansen
2019-04-05 15:24           ` Andy Lutomirski
2019-04-05 15:56             ` Tycho Andersen
2019-04-05 16:32               ` Andy Lutomirski
2019-04-05 15:56             ` Khalid Aziz
2019-04-05 16:01             ` Dave Hansen
2019-04-05 16:27               ` Andy Lutomirski
2019-04-05 16:41                 ` Peter Zijlstra
2019-04-05 17:35                 ` Khalid Aziz
2019-04-05 15:44           ` Khalid Aziz
2019-04-05 15:24       ` Andy Lutomirski
2019-04-04  8:18   ` Peter Zijlstra
2019-04-03 17:34 ` [RFC PATCH v9 13/13] xpfo, mm: Optimize XPFO TLB flushes by batching them together Khalid Aziz
2019-04-04 16:44 ` [RFC PATCH v9 00/13] Add support for eXclusive Page Frame Ownership Nadav Amit
2019-04-04 17:18   ` Khalid Aziz
2019-04-06  6:40 ` Jon Masters

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).