linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH RFC] mm: add MAP_EXCLUSIVE to create exclusive user mappings
@ 2019-10-27 10:17 Mike Rapoport
  2019-10-27 10:17 ` Mike Rapoport
                   ` (3 more replies)
  0 siblings, 4 replies; 60+ messages in thread
From: Mike Rapoport @ 2019-10-27 10:17 UTC (permalink / raw)
  To: linux-kernel
  Cc: Alexey Dobriyan, Andrew Morton, Andy Lutomirski, Arnd Bergmann,
	Borislav Petkov, Dave Hansen, James Bottomley, Mike Rapoport,
	Peter Zijlstra, Steven Rostedt, Thomas Gleixner, Ingo Molnar,
	H. Peter Anvin, linux-api, linux-mm, x86, Mike Rapoport

From: Mike Rapoport <rppt@linux.ibm.com>

Hi,

The patch below aims to allow applications to create mappins that have
pages visible only to the owning process. Such mappings could be used to
store secrets so that these secrets are not visible neither to other
processes nor to the kernel.

I've only tested the basic functionality, the changes should be verified
against THP/migration/compaction. Yet, I'd appreciate early feedback.

Mike Rapoport (1):
  mm: add MAP_EXCLUSIVE to create exclusive user mappings

 arch/x86/mm/fault.c                    | 14 ++++++++++
 fs/proc/task_mmu.c                     |  1 +
 include/linux/mm.h                     |  9 +++++++
 include/linux/page-flags.h             |  7 +++++
 include/linux/page_excl.h              | 49 ++++++++++++++++++++++++++++++++++
 include/trace/events/mmflags.h         |  9 ++++++-
 include/uapi/asm-generic/mman-common.h |  1 +
 kernel/fork.c                          |  3 ++-
 mm/Kconfig                             |  3 +++
 mm/gup.c                               |  8 ++++++
 mm/memory.c                            |  3 +++
 mm/mmap.c                              | 16 +++++++++++
 mm/page_alloc.c                        |  5 ++++
 13 files changed, 126 insertions(+), 2 deletions(-)
 create mode 100644 include/linux/page_excl.h

-- 
2.7.4


^ permalink raw reply	[flat|nested] 60+ messages in thread

* [PATCH RFC] mm: add MAP_EXCLUSIVE to create exclusive user mappings
  2019-10-27 10:17 [PATCH RFC] mm: add MAP_EXCLUSIVE to create exclusive user mappings Mike Rapoport
@ 2019-10-27 10:17 ` Mike Rapoport
  2019-10-28 12:31   ` Kirill A. Shutemov
                     ` (4 more replies)
  2019-10-27 10:30 ` Florian Weimer
                   ` (2 subsequent siblings)
  3 siblings, 5 replies; 60+ messages in thread
From: Mike Rapoport @ 2019-10-27 10:17 UTC (permalink / raw)
  To: linux-kernel
  Cc: Alexey Dobriyan, Andrew Morton, Andy Lutomirski, Arnd Bergmann,
	Borislav Petkov, Dave Hansen, James Bottomley, Mike Rapoport,
	Peter Zijlstra, Steven Rostedt, Thomas Gleixner, Ingo Molnar,
	H. Peter Anvin, linux-api, linux-mm, x86, Mike Rapoport

From: Mike Rapoport <rppt@linux.ibm.com>

The mappings created with MAP_EXCLUSIVE are visible only in the context of
the owning process and can be used by applications to store secret
information that will not be visible not only to other processes but to the
kernel as well.

The pages in these mappings are removed from the kernel direct map and
marked with PG_user_exclusive flag. When the exclusive area is unmapped,
the pages are mapped back into the direct map.

The MAP_EXCLUSIVE flag implies MAP_POPULATE and MAP_LOCKED.

Signed-off-by: Mike Rapoport <rppt@linux.ibm.com>
---
 arch/x86/mm/fault.c                    | 14 ++++++++++
 fs/proc/task_mmu.c                     |  1 +
 include/linux/mm.h                     |  9 +++++++
 include/linux/page-flags.h             |  7 +++++
 include/linux/page_excl.h              | 49 ++++++++++++++++++++++++++++++++++
 include/trace/events/mmflags.h         |  9 ++++++-
 include/uapi/asm-generic/mman-common.h |  1 +
 kernel/fork.c                          |  3 ++-
 mm/Kconfig                             |  3 +++
 mm/gup.c                               |  8 ++++++
 mm/memory.c                            |  3 +++
 mm/mmap.c                              | 16 +++++++++++
 mm/page_alloc.c                        |  5 ++++
 13 files changed, 126 insertions(+), 2 deletions(-)
 create mode 100644 include/linux/page_excl.h

diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c
index 9ceacd1..8f73a75 100644
--- a/arch/x86/mm/fault.c
+++ b/arch/x86/mm/fault.c
@@ -17,6 +17,7 @@
 #include <linux/context_tracking.h>	/* exception_enter(), ...	*/
 #include <linux/uaccess.h>		/* faulthandler_disabled()	*/
 #include <linux/efi.h>			/* efi_recover_from_page_fault()*/
+#include <linux/page_excl.h>		/* page_is_user_exclusive()	*/
 #include <linux/mm_types.h>
 
 #include <asm/cpufeature.h>		/* boot_cpu_has, ...		*/
@@ -1218,6 +1219,13 @@ static int fault_in_kernel_space(unsigned long address)
 	return address >= TASK_SIZE_MAX;
 }
 
+static bool fault_in_user_exclusive_page(unsigned long address)
+{
+	struct page *page = virt_to_page(address);
+
+	return page_is_user_exclusive(page);
+}
+
 /*
  * Called for all faults where 'address' is part of the kernel address
  * space.  Might get called for faults that originate from *code* that
@@ -1261,6 +1269,12 @@ do_kern_addr_fault(struct pt_regs *regs, unsigned long hw_error_code,
 	if (spurious_kernel_fault(hw_error_code, address))
 		return;
 
+	/* FIXME: warn and handle gracefully */
+	if (unlikely(fault_in_user_exclusive_page(address))) {
+		pr_err("page fault in user exclusive page at %lx", address);
+		force_sig_fault(SIGSEGV, SEGV_MAPERR, (void __user *)address);
+	}
+
 	/* kprobes don't want to hook the spurious faults: */
 	if (kprobe_page_fault(regs, X86_TRAP_PF))
 		return;
diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
index 9442631..99e14d1 100644
--- a/fs/proc/task_mmu.c
+++ b/fs/proc/task_mmu.c
@@ -655,6 +655,7 @@ static void show_smap_vma_flags(struct seq_file *m, struct vm_area_struct *vma)
 #ifdef CONFIG_X86_INTEL_MPX
 		[ilog2(VM_MPX)]		= "mp",
 #endif
+		[ilog2(VM_EXCLUSIVE)]	= "xl",
 		[ilog2(VM_LOCKED)]	= "lo",
 		[ilog2(VM_IO)]		= "io",
 		[ilog2(VM_SEQ_READ)]	= "sr",
diff --git a/include/linux/mm.h b/include/linux/mm.h
index cc29227..9c43375 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -298,11 +298,13 @@ extern unsigned int kobjsize(const void *objp);
 #define VM_HIGH_ARCH_BIT_2	34	/* bit only usable on 64-bit architectures */
 #define VM_HIGH_ARCH_BIT_3	35	/* bit only usable on 64-bit architectures */
 #define VM_HIGH_ARCH_BIT_4	36	/* bit only usable on 64-bit architectures */
+#define VM_HIGH_ARCH_BIT_5	37	/* bit only usable on 64-bit architectures */
 #define VM_HIGH_ARCH_0	BIT(VM_HIGH_ARCH_BIT_0)
 #define VM_HIGH_ARCH_1	BIT(VM_HIGH_ARCH_BIT_1)
 #define VM_HIGH_ARCH_2	BIT(VM_HIGH_ARCH_BIT_2)
 #define VM_HIGH_ARCH_3	BIT(VM_HIGH_ARCH_BIT_3)
 #define VM_HIGH_ARCH_4	BIT(VM_HIGH_ARCH_BIT_4)
+#define VM_HIGH_ARCH_5	BIT(VM_HIGH_ARCH_BIT_5)
 #endif /* CONFIG_ARCH_USES_HIGH_VMA_FLAGS */
 
 #ifdef CONFIG_ARCH_HAS_PKEYS
@@ -340,6 +342,12 @@ extern unsigned int kobjsize(const void *objp);
 # define VM_MPX		VM_NONE
 #endif
 
+#ifdef CONFIG_ARCH_USES_HIGH_VMA_FLAGS
+# define VM_EXCLUSIVE	VM_HIGH_ARCH_5
+#else
+# define VM_EXCLUSIVE	VM_NONE
+#endif
+
 #ifndef VM_GROWSUP
 # define VM_GROWSUP	VM_NONE
 #endif
@@ -2594,6 +2602,7 @@ struct page *follow_page(struct vm_area_struct *vma, unsigned long address,
 #define FOLL_ANON	0x8000	/* don't do file mappings */
 #define FOLL_LONGTERM	0x10000	/* mapping lifetime is indefinite: see below */
 #define FOLL_SPLIT_PMD	0x20000	/* split huge pmd before returning */
+#define FOLL_EXCLUSIVE	0x40000	/* mapping is exclusive to owning mm */
 
 /*
  * NOTE on FOLL_LONGTERM:
diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
index f91cb88..32d0aee 100644
--- a/include/linux/page-flags.h
+++ b/include/linux/page-flags.h
@@ -131,6 +131,9 @@ enum pageflags {
 	PG_young,
 	PG_idle,
 #endif
+#if defined(CONFIG_EXCLUSIVE_USER_PAGES)
+	PG_user_exclusive,
+#endif
 	__NR_PAGEFLAGS,
 
 	/* Filesystems */
@@ -431,6 +434,10 @@ TESTCLEARFLAG(Young, young, PF_ANY)
 PAGEFLAG(Idle, idle, PF_ANY)
 #endif
 
+#ifdef CONFIG_EXCLUSIVE_USER_PAGES
+__PAGEFLAG(UserExclusive, user_exclusive, PF_ANY)
+#endif
+
 /*
  * On an anonymous page mapped into a user virtual memory area,
  * page->mapping points to its anon_vma, not to a struct address_space;
diff --git a/include/linux/page_excl.h b/include/linux/page_excl.h
new file mode 100644
index 0000000..b7ea3ce
--- /dev/null
+++ b/include/linux/page_excl.h
@@ -0,0 +1,49 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef _LINUX_MM_PAGE_EXCLUSIVE_H
+#define _LINUX_MM_PAGE_EXCLUSIVE_H
+
+#include <linux/bitops.h>
+#include <linux/page-flags.h>
+#include <linux/set_memory.h>
+#include <asm/tlbflush.h>
+
+#ifdef CONFIG_EXCLUSIVE_USER_PAGES
+
+static inline bool page_is_user_exclusive(struct page *page)
+{
+	return PageUserExclusive(page);
+}
+
+static inline void __set_page_user_exclusive(struct page *page)
+{
+	unsigned long addr = (unsigned long)page_address(page);
+
+	__SetPageUserExclusive(page);
+	set_direct_map_invalid_noflush(page);
+	flush_tlb_kernel_range(addr, addr + PAGE_SIZE);
+}
+
+static inline void __clear_page_user_exclusive(struct page *page)
+{
+	__ClearPageUserExclusive(page);
+	set_direct_map_default_noflush(page);
+}
+
+#else /* !CONFIG_EXCLUSIVE_USER_PAGES */
+
+static inline bool page_is_user_exclusive(struct page *page)
+{
+	return false;
+}
+
+static inline void __set_page_user_exclusive(struct page *page)
+{
+}
+
+static inline void __clear_page_user_exclusive(struct page *page)
+{
+}
+
+#endif /* CONFIG_EXCLUSIVE_USER_PAGES */
+
+#endif /* _LINUX_MM_PAGE_EXCLUSIVE_H */
diff --git a/include/trace/events/mmflags.h b/include/trace/events/mmflags.h
index a1675d4..2d3c14a 100644
--- a/include/trace/events/mmflags.h
+++ b/include/trace/events/mmflags.h
@@ -79,6 +79,12 @@
 #define IF_HAVE_PG_IDLE(flag,string)
 #endif
 
+#ifdef CONFIG_EXCLUSIVE_USER_PAGES
+#define IF_HAVE_PG_USER_EXCLUSIVE(flag,string) ,{1UL << flag, string}
+#else
+#define IF_HAVE_PG_USER_EXCLUSIVE(flag,string)
+#endif
+
 #define __def_pageflag_names						\
 	{1UL << PG_locked,		"locked"	},		\
 	{1UL << PG_waiters,		"waiters"	},		\
@@ -105,7 +111,8 @@ IF_HAVE_PG_MLOCK(PG_mlocked,		"mlocked"	)		\
 IF_HAVE_PG_UNCACHED(PG_uncached,	"uncached"	)		\
 IF_HAVE_PG_HWPOISON(PG_hwpoison,	"hwpoison"	)		\
 IF_HAVE_PG_IDLE(PG_young,		"young"		)		\
-IF_HAVE_PG_IDLE(PG_idle,		"idle"		)
+IF_HAVE_PG_IDLE(PG_idle,		"idle"		)		\
+IF_HAVE_PG_USER_EXCLUSIVE(PG_user_exclusive,	"user_exclusive" )
 
 #define show_page_flags(flags)						\
 	(flags) ? __print_flags(flags, "|",				\
diff --git a/include/uapi/asm-generic/mman-common.h b/include/uapi/asm-generic/mman-common.h
index c160a53..bf8f23e 100644
--- a/include/uapi/asm-generic/mman-common.h
+++ b/include/uapi/asm-generic/mman-common.h
@@ -27,6 +27,7 @@
 #define MAP_HUGETLB		0x040000	/* create a huge page mapping */
 #define MAP_SYNC		0x080000 /* perform synchronous page faults for the mapping */
 #define MAP_FIXED_NOREPLACE	0x100000	/* MAP_FIXED which doesn't unmap underlying mapping */
+#define MAP_EXCLUSIVE		0x200000	/* mapping is exclusive to the owning task; the pages in it are dropped from the direct map */
 
 #define MAP_UNINITIALIZED 0x4000000	/* For anonymous mmap, memory could be
 					 * uninitialized */
diff --git a/kernel/fork.c b/kernel/fork.c
index bcdf531..d63adec 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -518,7 +518,8 @@ static __latent_entropy int dup_mmap(struct mm_struct *mm,
 	for (mpnt = oldmm->mmap; mpnt; mpnt = mpnt->vm_next) {
 		struct file *file;
 
-		if (mpnt->vm_flags & VM_DONTCOPY) {
+		if (mpnt->vm_flags & VM_DONTCOPY ||
+		    mpnt->vm_flags & VM_EXCLUSIVE) {
 			vm_stat_account(mm, mpnt->vm_flags, -vma_pages(mpnt));
 			continue;
 		}
diff --git a/mm/Kconfig b/mm/Kconfig
index a5dae9a..9d60141 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -736,4 +736,7 @@ config ARCH_HAS_PTE_SPECIAL
 config ARCH_HAS_HUGEPD
 	bool
 
+config EXCLUSIVE_USER_PAGES
+        def_bool ARCH_USES_HIGH_VMA_FLAGS && ARCH_HAS_SET_DIRECT_MAP
+
 endmenu
diff --git a/mm/gup.c b/mm/gup.c
index 8f236a3..a98c0ca0 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -17,6 +17,7 @@
 #include <linux/migrate.h>
 #include <linux/mm_inline.h>
 #include <linux/sched/mm.h>
+#include <linux/page_excl.h>
 
 #include <asm/mmu_context.h>
 #include <asm/pgtable.h>
@@ -868,6 +869,10 @@ static long __get_user_pages(struct task_struct *tsk, struct mm_struct *mm,
 			ret = PTR_ERR(page);
 			goto out;
 		}
+
+		if (gup_flags & FOLL_EXCLUSIVE)
+			__set_page_user_exclusive(page);
+
 		if (pages) {
 			pages[i] = page;
 			flush_anon_page(vma, page, start);
@@ -1216,6 +1221,9 @@ long populate_vma_page_range(struct vm_area_struct *vma,
 	if (vma->vm_flags & (VM_READ | VM_WRITE | VM_EXEC))
 		gup_flags |= FOLL_FORCE;
 
+	if (vma->vm_flags & VM_EXCLUSIVE)
+		gup_flags |= FOLL_EXCLUSIVE;
+
 	/*
 	 * We made sure addr is within a VMA, so the following will
 	 * not result in a stack expansion that recurses back here.
diff --git a/mm/memory.c b/mm/memory.c
index b1ca51a..a4b4cff 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -71,6 +71,7 @@
 #include <linux/dax.h>
 #include <linux/oom.h>
 #include <linux/numa.h>
+#include <linux/page_excl.h>
 
 #include <asm/io.h>
 #include <asm/mmu_context.h>
@@ -1062,6 +1063,8 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb,
 			page_remove_rmap(page, false);
 			if (unlikely(page_mapcount(page) < 0))
 				print_bad_pte(vma, addr, ptent, page);
+			if (page_is_user_exclusive(page))
+				__clear_page_user_exclusive(page);
 			if (unlikely(__tlb_remove_page(tlb, page))) {
 				force_flush = 1;
 				addr += PAGE_SIZE;
diff --git a/mm/mmap.c b/mm/mmap.c
index a7d8c84..d8cc82d 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -1574,6 +1574,9 @@ unsigned long do_mmap(struct file *file, unsigned long addr,
 			vm_flags |= VM_NORESERVE;
 	}
 
+	if (flags & MAP_EXCLUSIVE)
+		vm_flags |= VM_EXCLUSIVE;
+
 	addr = mmap_region(file, addr, len, vm_flags, pgoff, uf);
 	if (!IS_ERR_VALUE(addr) &&
 	    ((vm_flags & VM_LOCKED) ||
@@ -1591,6 +1594,19 @@ unsigned long ksys_mmap_pgoff(unsigned long addr, unsigned long len,
 
 	addr = untagged_addr(addr);
 
+	if (flags & MAP_EXCLUSIVE) {
+		/*
+		 * MAP_EXCLUSIVE is only supported for private
+		 * anonymous memory not backed by hugetlbfs
+		 */
+		if (!(flags & MAP_ANONYMOUS) || !(flags & MAP_PRIVATE) ||
+		    (flags & MAP_HUGETLB))
+			return -EINVAL;
+
+		/* and impies MAP_LOCKED and MAP_POPULATE */
+		flags |= (MAP_LOCKED | MAP_POPULATE);
+	}
+
 	if (!(flags & MAP_ANONYMOUS)) {
 		audit_mmap_fd(fd, flags);
 		file = fget(fd);
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index ecc3dba..2f1de9d 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -68,6 +68,7 @@
 #include <linux/lockdep.h>
 #include <linux/nmi.h>
 #include <linux/psi.h>
+#include <linux/page_excl.h>
 
 #include <asm/sections.h>
 #include <asm/tlbflush.h>
@@ -4779,6 +4780,10 @@ __alloc_pages_nodemask(gfp_t gfp_mask, unsigned int order, int preferred_nid,
 		page = NULL;
 	}
 
+	/* FIXME: should not happen! */
+	if (WARN_ON(page_is_user_exclusive(page)))
+		__clear_page_user_exclusive(page);
+
 	trace_mm_page_alloc(page, order, alloc_mask, ac.migratetype);
 
 	return page;
-- 
2.7.4


^ permalink raw reply related	[flat|nested] 60+ messages in thread

* Re: [PATCH RFC] mm: add MAP_EXCLUSIVE to create exclusive user mappings
  2019-10-27 10:17 [PATCH RFC] mm: add MAP_EXCLUSIVE to create exclusive user mappings Mike Rapoport
  2019-10-27 10:17 ` Mike Rapoport
@ 2019-10-27 10:30 ` Florian Weimer
  2019-10-27 11:00   ` Mike Rapoport
  2019-10-28 20:44 ` Andy Lutomirski
  2019-10-29 11:25 ` [PATCH RFC] mm: add MAP_EXCLUSIVE to create exclusive user mappings Reshetova, Elena
  3 siblings, 1 reply; 60+ messages in thread
From: Florian Weimer @ 2019-10-27 10:30 UTC (permalink / raw)
  To: Mike Rapoport
  Cc: linux-kernel, Alexey Dobriyan, Andrew Morton, Andy Lutomirski,
	Arnd Bergmann, Borislav Petkov, Dave Hansen, James Bottomley,
	Peter Zijlstra, Steven Rostedt, Thomas Gleixner, Ingo Molnar,
	H. Peter Anvin, linux-api, linux-mm, x86, Mike Rapoport

* Mike Rapoport:

> The patch below aims to allow applications to create mappins that have
> pages visible only to the owning process. Such mappings could be used to
> store secrets so that these secrets are not visible neither to other
> processes nor to the kernel.

How is this expected to interact with CRIU?

> I've only tested the basic functionality, the changes should be verified
> against THP/migration/compaction. Yet, I'd appreciate early feedback.

What are the expected semantics for VM migration?  Should it fail?

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH RFC] mm: add MAP_EXCLUSIVE to create exclusive user mappings
  2019-10-27 10:30 ` Florian Weimer
@ 2019-10-27 11:00   ` Mike Rapoport
  2019-10-28 20:23     ` Florian Weimer
  0 siblings, 1 reply; 60+ messages in thread
From: Mike Rapoport @ 2019-10-27 11:00 UTC (permalink / raw)
  To: Florian Weimer
  Cc: linux-kernel, Alexey Dobriyan, Andrew Morton, Andy Lutomirski,
	Arnd Bergmann, Borislav Petkov, Dave Hansen, James Bottomley,
	Peter Zijlstra, Steven Rostedt, Thomas Gleixner, Ingo Molnar,
	H. Peter Anvin, linux-api, linux-mm, x86, Mike Rapoport

On October 27, 2019 12:30:21 PM GMT+02:00, Florian Weimer <fw@deneb.enyo.de> wrote:
>* Mike Rapoport:
>
>> The patch below aims to allow applications to create mappins that
>have
>> pages visible only to the owning process. Such mappings could be used
>to
>> store secrets so that these secrets are not visible neither to other
>> processes nor to the kernel.
>
>How is this expected to interact with CRIU?

CRIU dumps the memory contents using a parasite code from inside the dumpee address space, so it would work the same way as for the other mappings. Of course, at the restore time the exclusive mapping should be recreated with the appropriate flags.

>> I've only tested the basic functionality, the changes should be
>verified
>> against THP/migration/compaction. Yet, I'd appreciate early feedback.
>
>What are the expected semantics for VM migration?  Should it fail?

I don't quite follow. If qemu would use such mappings it would be able to transfer them during live migration.

-- 
Sincerely yours,
Mike

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH RFC] mm: add MAP_EXCLUSIVE to create exclusive user mappings
  2019-10-27 10:17 ` Mike Rapoport
@ 2019-10-28 12:31   ` Kirill A. Shutemov
  2019-10-28 13:00     ` Mike Rapoport
  2019-10-28 14:55   ` David Hildenbrand
                     ` (3 subsequent siblings)
  4 siblings, 1 reply; 60+ messages in thread
From: Kirill A. Shutemov @ 2019-10-28 12:31 UTC (permalink / raw)
  To: Mike Rapoport
  Cc: linux-kernel, Alexey Dobriyan, Andrew Morton, Andy Lutomirski,
	Arnd Bergmann, Borislav Petkov, Dave Hansen, James Bottomley,
	Peter Zijlstra, Steven Rostedt, Thomas Gleixner, Ingo Molnar,
	H. Peter Anvin, linux-api, linux-mm, x86, Mike Rapoport

On Sun, Oct 27, 2019 at 12:17:32PM +0200, Mike Rapoport wrote:
> From: Mike Rapoport <rppt@linux.ibm.com>
> 
> The mappings created with MAP_EXCLUSIVE are visible only in the context of
> the owning process and can be used by applications to store secret
> information that will not be visible not only to other processes but to the
> kernel as well.
> 
> The pages in these mappings are removed from the kernel direct map and
> marked with PG_user_exclusive flag. When the exclusive area is unmapped,
> the pages are mapped back into the direct map.

I probably blind, but I don't see where you manipulate direct map...

-- 
 Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH RFC] mm: add MAP_EXCLUSIVE to create exclusive user mappings
  2019-10-28 12:31   ` Kirill A. Shutemov
@ 2019-10-28 13:00     ` Mike Rapoport
  2019-10-28 13:16       ` Kirill A. Shutemov
  0 siblings, 1 reply; 60+ messages in thread
From: Mike Rapoport @ 2019-10-28 13:00 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: linux-kernel, Alexey Dobriyan, Andrew Morton, Andy Lutomirski,
	Arnd Bergmann, Borislav Petkov, Dave Hansen, James Bottomley,
	Peter Zijlstra, Steven Rostedt, Thomas Gleixner, Ingo Molnar,
	H. Peter Anvin, linux-api, linux-mm, x86, Mike Rapoport

On Mon, Oct 28, 2019 at 03:31:24PM +0300, Kirill A. Shutemov wrote:
> On Sun, Oct 27, 2019 at 12:17:32PM +0200, Mike Rapoport wrote:
> > From: Mike Rapoport <rppt@linux.ibm.com>
> > 
> > The mappings created with MAP_EXCLUSIVE are visible only in the context of
> > the owning process and can be used by applications to store secret
> > information that will not be visible not only to other processes but to the
> > kernel as well.
> > 
> > The pages in these mappings are removed from the kernel direct map and
> > marked with PG_user_exclusive flag. When the exclusive area is unmapped,
> > the pages are mapped back into the direct map.
> 
> I probably blind, but I don't see where you manipulate direct map...

__get_user_pages() calls __set_page_user_exclusive() which in turn calls
set_direct_map_invalid_noflush() that makes the page not present.
 
> -- 
>  Kirill A. Shutemov

-- 
Sincerely yours,
Mike.

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH RFC] mm: add MAP_EXCLUSIVE to create exclusive user mappings
  2019-10-28 13:00     ` Mike Rapoport
@ 2019-10-28 13:16       ` Kirill A. Shutemov
  2019-10-28 13:55         ` Peter Zijlstra
                           ` (2 more replies)
  0 siblings, 3 replies; 60+ messages in thread
From: Kirill A. Shutemov @ 2019-10-28 13:16 UTC (permalink / raw)
  To: Mike Rapoport
  Cc: linux-kernel, Alexey Dobriyan, Andrew Morton, Andy Lutomirski,
	Arnd Bergmann, Borislav Petkov, Dave Hansen, James Bottomley,
	Peter Zijlstra, Steven Rostedt, Thomas Gleixner, Ingo Molnar,
	H. Peter Anvin, linux-api, linux-mm, x86, Mike Rapoport

On Mon, Oct 28, 2019 at 02:00:19PM +0100, Mike Rapoport wrote:
> On Mon, Oct 28, 2019 at 03:31:24PM +0300, Kirill A. Shutemov wrote:
> > On Sun, Oct 27, 2019 at 12:17:32PM +0200, Mike Rapoport wrote:
> > > From: Mike Rapoport <rppt@linux.ibm.com>
> > > 
> > > The mappings created with MAP_EXCLUSIVE are visible only in the context of
> > > the owning process and can be used by applications to store secret
> > > information that will not be visible not only to other processes but to the
> > > kernel as well.
> > > 
> > > The pages in these mappings are removed from the kernel direct map and
> > > marked with PG_user_exclusive flag. When the exclusive area is unmapped,
> > > the pages are mapped back into the direct map.
> > 
> > I probably blind, but I don't see where you manipulate direct map...
> 
> __get_user_pages() calls __set_page_user_exclusive() which in turn calls
> set_direct_map_invalid_noflush() that makes the page not present.

Ah. okay.

I think active use of this feature will lead to performance degradation of
the system with time.

Setting a single 4k page non-present in the direct mapping will require
splitting 2M or 1G page we usually map direct mapping with. And it's one
way road. We don't have any mechanism to map the memory with huge page
again after the application has freed the page.

It might be okay if all these pages cluster together, but I don't think we
have a way to achieve it easily.

-- 
 Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH RFC] mm: add MAP_EXCLUSIVE to create exclusive user mappings
  2019-10-28 13:16       ` Kirill A. Shutemov
@ 2019-10-28 13:55         ` Peter Zijlstra
  2019-10-28 19:59           ` Edgecombe, Rick P
  2019-10-29  5:43         ` Dan Williams
  2019-10-29  7:08         ` Christopher Lameter
  2 siblings, 1 reply; 60+ messages in thread
From: Peter Zijlstra @ 2019-10-28 13:55 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Mike Rapoport, linux-kernel, Alexey Dobriyan, Andrew Morton,
	Andy Lutomirski, Arnd Bergmann, Borislav Petkov, Dave Hansen,
	James Bottomley, Steven Rostedt, Thomas Gleixner, Ingo Molnar,
	H. Peter Anvin, linux-api, linux-mm, x86, Mike Rapoport

On Mon, Oct 28, 2019 at 04:16:23PM +0300, Kirill A. Shutemov wrote:

> I think active use of this feature will lead to performance degradation of
> the system with time.
> 
> Setting a single 4k page non-present in the direct mapping will require
> splitting 2M or 1G page we usually map direct mapping with. And it's one
> way road. We don't have any mechanism to map the memory with huge page
> again after the application has freed the page.

Right, we recently had a 'bug' where ftrace triggered something like
this and facebook ran into it as a performance regression. So yes, this
is a real concern.

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH RFC] mm: add MAP_EXCLUSIVE to create exclusive user mappings
  2019-10-27 10:17 ` Mike Rapoport
  2019-10-28 12:31   ` Kirill A. Shutemov
@ 2019-10-28 14:55   ` David Hildenbrand
  2019-10-28 17:12   ` Dave Hansen
                     ` (2 subsequent siblings)
  4 siblings, 0 replies; 60+ messages in thread
From: David Hildenbrand @ 2019-10-28 14:55 UTC (permalink / raw)
  To: Mike Rapoport, linux-kernel
  Cc: Alexey Dobriyan, Andrew Morton, Andy Lutomirski, Arnd Bergmann,
	Borislav Petkov, Dave Hansen, James Bottomley, Peter Zijlstra,
	Steven Rostedt, Thomas Gleixner, Ingo Molnar, H. Peter Anvin,
	linux-api, linux-mm, x86, Mike Rapoport

On 27.10.19 11:17, Mike Rapoport wrote:
> From: Mike Rapoport <rppt@linux.ibm.com>
> 
> The mappings created with MAP_EXCLUSIVE are visible only in the context of
> the owning process and can be used by applications to store secret
> information that will not be visible not only to other processes but to the
> kernel as well.
> 
> The pages in these mappings are removed from the kernel direct map and
> marked with PG_user_exclusive flag. When the exclusive area is unmapped,
> the pages are mapped back into the direct map.
> 
> The MAP_EXCLUSIVE flag implies MAP_POPULATE and MAP_LOCKED.
> 
> Signed-off-by: Mike Rapoport <rppt@linux.ibm.com>
> ---
>   arch/x86/mm/fault.c                    | 14 ++++++++++
>   fs/proc/task_mmu.c                     |  1 +
>   include/linux/mm.h                     |  9 +++++++
>   include/linux/page-flags.h             |  7 +++++
>   include/linux/page_excl.h              | 49 ++++++++++++++++++++++++++++++++++
>   include/trace/events/mmflags.h         |  9 ++++++-
>   include/uapi/asm-generic/mman-common.h |  1 +
>   kernel/fork.c                          |  3 ++-
>   mm/Kconfig                             |  3 +++
>   mm/gup.c                               |  8 ++++++
>   mm/memory.c                            |  3 +++
>   mm/mmap.c                              | 16 +++++++++++
>   mm/page_alloc.c                        |  5 ++++
>   13 files changed, 126 insertions(+), 2 deletions(-)
>   create mode 100644 include/linux/page_excl.h
> 
> diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c
> index 9ceacd1..8f73a75 100644
> --- a/arch/x86/mm/fault.c
> +++ b/arch/x86/mm/fault.c
> @@ -17,6 +17,7 @@
>   #include <linux/context_tracking.h>	/* exception_enter(), ...	*/
>   #include <linux/uaccess.h>		/* faulthandler_disabled()	*/
>   #include <linux/efi.h>			/* efi_recover_from_page_fault()*/
> +#include <linux/page_excl.h>		/* page_is_user_exclusive()	*/
>   #include <linux/mm_types.h>
>   
>   #include <asm/cpufeature.h>		/* boot_cpu_has, ...		*/
> @@ -1218,6 +1219,13 @@ static int fault_in_kernel_space(unsigned long address)
>   	return address >= TASK_SIZE_MAX;
>   }
>   
> +static bool fault_in_user_exclusive_page(unsigned long address)
> +{
> +	struct page *page = virt_to_page(address);
> +
> +	return page_is_user_exclusive(page);
> +}
> +
>   /*
>    * Called for all faults where 'address' is part of the kernel address
>    * space.  Might get called for faults that originate from *code* that
> @@ -1261,6 +1269,12 @@ do_kern_addr_fault(struct pt_regs *regs, unsigned long hw_error_code,
>   	if (spurious_kernel_fault(hw_error_code, address))
>   		return;
>   
> +	/* FIXME: warn and handle gracefully */
> +	if (unlikely(fault_in_user_exclusive_page(address))) {
> +		pr_err("page fault in user exclusive page at %lx", address);
> +		force_sig_fault(SIGSEGV, SEGV_MAPERR, (void __user *)address);
> +	}
> +
>   	/* kprobes don't want to hook the spurious faults: */
>   	if (kprobe_page_fault(regs, X86_TRAP_PF))
>   		return;
> diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
> index 9442631..99e14d1 100644
> --- a/fs/proc/task_mmu.c
> +++ b/fs/proc/task_mmu.c
> @@ -655,6 +655,7 @@ static void show_smap_vma_flags(struct seq_file *m, struct vm_area_struct *vma)
>   #ifdef CONFIG_X86_INTEL_MPX
>   		[ilog2(VM_MPX)]		= "mp",
>   #endif
> +		[ilog2(VM_EXCLUSIVE)]	= "xl",
>   		[ilog2(VM_LOCKED)]	= "lo",
>   		[ilog2(VM_IO)]		= "io",
>   		[ilog2(VM_SEQ_READ)]	= "sr",
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index cc29227..9c43375 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -298,11 +298,13 @@ extern unsigned int kobjsize(const void *objp);
>   #define VM_HIGH_ARCH_BIT_2	34	/* bit only usable on 64-bit architectures */
>   #define VM_HIGH_ARCH_BIT_3	35	/* bit only usable on 64-bit architectures */
>   #define VM_HIGH_ARCH_BIT_4	36	/* bit only usable on 64-bit architectures */
> +#define VM_HIGH_ARCH_BIT_5	37	/* bit only usable on 64-bit architectures */
>   #define VM_HIGH_ARCH_0	BIT(VM_HIGH_ARCH_BIT_0)
>   #define VM_HIGH_ARCH_1	BIT(VM_HIGH_ARCH_BIT_1)
>   #define VM_HIGH_ARCH_2	BIT(VM_HIGH_ARCH_BIT_2)
>   #define VM_HIGH_ARCH_3	BIT(VM_HIGH_ARCH_BIT_3)
>   #define VM_HIGH_ARCH_4	BIT(VM_HIGH_ARCH_BIT_4)
> +#define VM_HIGH_ARCH_5	BIT(VM_HIGH_ARCH_BIT_5)
>   #endif /* CONFIG_ARCH_USES_HIGH_VMA_FLAGS */
>   
>   #ifdef CONFIG_ARCH_HAS_PKEYS
> @@ -340,6 +342,12 @@ extern unsigned int kobjsize(const void *objp);
>   # define VM_MPX		VM_NONE
>   #endif
>   
> +#ifdef CONFIG_ARCH_USES_HIGH_VMA_FLAGS
> +# define VM_EXCLUSIVE	VM_HIGH_ARCH_5
> +#else
> +# define VM_EXCLUSIVE	VM_NONE
> +#endif
> +
>   #ifndef VM_GROWSUP
>   # define VM_GROWSUP	VM_NONE
>   #endif
> @@ -2594,6 +2602,7 @@ struct page *follow_page(struct vm_area_struct *vma, unsigned long address,
>   #define FOLL_ANON	0x8000	/* don't do file mappings */
>   #define FOLL_LONGTERM	0x10000	/* mapping lifetime is indefinite: see below */
>   #define FOLL_SPLIT_PMD	0x20000	/* split huge pmd before returning */
> +#define FOLL_EXCLUSIVE	0x40000	/* mapping is exclusive to owning mm */
>   
>   /*
>    * NOTE on FOLL_LONGTERM:
> diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
> index f91cb88..32d0aee 100644
> --- a/include/linux/page-flags.h
> +++ b/include/linux/page-flags.h
> @@ -131,6 +131,9 @@ enum pageflags {
>   	PG_young,
>   	PG_idle,
>   #endif
> +#if defined(CONFIG_EXCLUSIVE_USER_PAGES)
> +	PG_user_exclusive,
> +#endif

Last time I tried to introduce a new page flag I learned that this is 
very much frowned upon. Best you can usually do is reuse another flag - 
if valid in that context.

-- 

Thanks,

David / dhildenb


^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH RFC] mm: add MAP_EXCLUSIVE to create exclusive user mappings
  2019-10-27 10:17 ` Mike Rapoport
  2019-10-28 12:31   ` Kirill A. Shutemov
  2019-10-28 14:55   ` David Hildenbrand
@ 2019-10-28 17:12   ` Dave Hansen
  2019-10-28 17:32     ` Sean Christopherson
                       ` (2 more replies)
  2019-10-28 18:02   ` Andy Lutomirski
  2019-10-29 11:02   ` David Hildenbrand
  4 siblings, 3 replies; 60+ messages in thread
From: Dave Hansen @ 2019-10-28 17:12 UTC (permalink / raw)
  To: Mike Rapoport, linux-kernel
  Cc: Alexey Dobriyan, Andrew Morton, Andy Lutomirski, Arnd Bergmann,
	Borislav Petkov, Dave Hansen, James Bottomley, Peter Zijlstra,
	Steven Rostedt, Thomas Gleixner, Ingo Molnar, H. Peter Anvin,
	linux-api, linux-mm, x86, Mike Rapoport

On 10/27/19 3:17 AM, Mike Rapoport wrote:
> The pages in these mappings are removed from the kernel direct map and
> marked with PG_user_exclusive flag. When the exclusive area is unmapped,
> the pages are mapped back into the direct map.

This looks fun.  It's certainly simple.

But, the description is not really calling out the pros and cons very
well.  I'm also not sure that folks will use an interface like this that
requires up-front, special code to do an allocation instead of something
like madvise().  That's why protection keys ended up the way it did: if
you do this as a mmap() replacement, you need to modify all *allocators*
to be enabled for this.  If you do it with mprotect()-style, you can
apply it to existing allocations.

Some other random thoughts:

 * The page flag is probably not a good idea.  It would be probably
   better to set _PAGE_SPECIAL on the PTE and force get_user_pages()
   into the slow path.
 * This really stops being "normal" memory.  You can't do futexes on it,
   cant splice it.  Probably need a more fleshed-out list of
   incompatible features.
 * As Kirill noted, each 4k page ends up with a potential 1GB "blast
   radius" of demoted pages in the direct map.  Not cool.  This is
   probably a non-starter as it stands.
 * The global TLB flushes are going to eat you alive.  They probably
   border on a DoS on larger systems.
 * Do we really want this user interface to dictate the kernel
   implementation?  In other words, do we really want MAP_EXCLUSIVE,
   or do we want MAP_SECRET?  One tells the kernel what do *do*, the
   other tells the kernel what the memory *IS*.
 * There's a lot of other stuff going on in this area: XPFO, SEV, MKTME,
   Persistent Memory, where the kernel direct map is a liability in some
   way.  We probably need some kind of overall, architected solution
   rather than five or ten things all poking at the direct map.


^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH RFC] mm: add MAP_EXCLUSIVE to create exclusive user mappings
  2019-10-28 17:12   ` Dave Hansen
@ 2019-10-28 17:32     ` Sean Christopherson
  2019-10-28 18:08     ` Matthew Wilcox
  2019-10-29  9:19     ` Mike Rapoport
  2 siblings, 0 replies; 60+ messages in thread
From: Sean Christopherson @ 2019-10-28 17:32 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Mike Rapoport, linux-kernel, Alexey Dobriyan, Andrew Morton,
	Andy Lutomirski, Arnd Bergmann, Borislav Petkov, Dave Hansen,
	James Bottomley, Peter Zijlstra, Steven Rostedt, Thomas Gleixner,
	Ingo Molnar, H. Peter Anvin, linux-api, linux-mm, x86,
	Mike Rapoport

On Mon, Oct 28, 2019 at 10:12:44AM -0700, Dave Hansen wrote:
> On 10/27/19 3:17 AM, Mike Rapoport wrote:
> > The pages in these mappings are removed from the kernel direct map and
> > marked with PG_user_exclusive flag. When the exclusive area is unmapped,
> > the pages are mapped back into the direct map.
> 
> This looks fun.  It's certainly simple.
> 
> But, the description is not really calling out the pros and cons very
> well.  I'm also not sure that folks will use an interface like this that
> requires up-front, special code to do an allocation instead of something
> like madvise().  That's why protection keys ended up the way it did: if
> you do this as a mmap() replacement, you need to modify all *allocators*
> to be enabled for this.  If you do it with mprotect()-style, you can
> apply it to existing allocations.
> 
> Some other random thoughts:
> 
>  * The page flag is probably not a good idea.  It would be probably
>    better to set _PAGE_SPECIAL on the PTE and force get_user_pages()
>    into the slow path.
>  * This really stops being "normal" memory.  You can't do futexes on it,
>    cant splice it.  Probably need a more fleshed-out list of
>    incompatible features.
>  * As Kirill noted, each 4k page ends up with a potential 1GB "blast
>    radius" of demoted pages in the direct map.  Not cool.  This is
>    probably a non-starter as it stands.
>  * The global TLB flushes are going to eat you alive.  They probably
>    border on a DoS on larger systems.
>  * Do we really want this user interface to dictate the kernel
>    implementation?  In other words, do we really want MAP_EXCLUSIVE,
>    or do we want MAP_SECRET?  One tells the kernel what do *do*, the
>    other tells the kernel what the memory *IS*.

If we go that route, maybe MAP_USER_SECRET so that there's wiggle room in
the event that there are different secret keepers that require different
implementations in the kernel?   E.g. MAP_GUEST_SECRET for a KVM guest to
take the userspace VMM (Qemu) out of the TCB, i.e. the mapping would be
accessible by the kernel (or just KVM?) and the KVM guest, but not
userspace.

>  * There's a lot of other stuff going on in this area: XPFO, SEV, MKTME,
>    Persistent Memory, where the kernel direct map is a liability in some
>    way.  We probably need some kind of overall, architected solution
>    rather than five or ten things all poking at the direct map.
> 

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH RFC] mm: add MAP_EXCLUSIVE to create exclusive user mappings
  2019-10-27 10:17 ` Mike Rapoport
                     ` (2 preceding siblings ...)
  2019-10-28 17:12   ` Dave Hansen
@ 2019-10-28 18:02   ` Andy Lutomirski
  2019-10-29 11:02   ` David Hildenbrand
  4 siblings, 0 replies; 60+ messages in thread
From: Andy Lutomirski @ 2019-10-28 18:02 UTC (permalink / raw)
  To: Mike Rapoport
  Cc: LKML, Alexey Dobriyan, Andrew Morton, Andy Lutomirski,
	Arnd Bergmann, Borislav Petkov, Dave Hansen, James Bottomley,
	Peter Zijlstra, Steven Rostedt, Thomas Gleixner, Ingo Molnar,
	H. Peter Anvin, Linux API, Linux-MM, X86 ML, Mike Rapoport

On Sun, Oct 27, 2019 at 3:17 AM Mike Rapoport <rppt@kernel.org> wrote:
>
> From: Mike Rapoport <rppt@linux.ibm.com>
>
> The mappings created with MAP_EXCLUSIVE are visible only in the context of
> the owning process and can be used by applications to store secret
> information that will not be visible not only to other processes but to the
> kernel as well.
>
> The pages in these mappings are removed from the kernel direct map and
> marked with PG_user_exclusive flag. When the exclusive area is unmapped,
> the pages are mapped back into the direct map.
>
> The MAP_EXCLUSIVE flag implies MAP_POPULATE and MAP_LOCKED.
>
> Signed-off-by: Mike Rapoport <rppt@linux.ibm.com>
> ---
>  arch/x86/mm/fault.c                    | 14 ++++++++++
>  fs/proc/task_mmu.c                     |  1 +
>  include/linux/mm.h                     |  9 +++++++
>  include/linux/page-flags.h             |  7 +++++
>  include/linux/page_excl.h              | 49 ++++++++++++++++++++++++++++++++++
>  include/trace/events/mmflags.h         |  9 ++++++-
>  include/uapi/asm-generic/mman-common.h |  1 +
>  kernel/fork.c                          |  3 ++-
>  mm/Kconfig                             |  3 +++
>  mm/gup.c                               |  8 ++++++
>  mm/memory.c                            |  3 +++
>  mm/mmap.c                              | 16 +++++++++++
>  mm/page_alloc.c                        |  5 ++++
>  13 files changed, 126 insertions(+), 2 deletions(-)
>  create mode 100644 include/linux/page_excl.h
>
> diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c
> index 9ceacd1..8f73a75 100644
> --- a/arch/x86/mm/fault.c
> +++ b/arch/x86/mm/fault.c
> @@ -17,6 +17,7 @@
>  #include <linux/context_tracking.h>    /* exception_enter(), ...       */
>  #include <linux/uaccess.h>             /* faulthandler_disabled()      */
>  #include <linux/efi.h>                 /* efi_recover_from_page_fault()*/
> +#include <linux/page_excl.h>           /* page_is_user_exclusive()     */
>  #include <linux/mm_types.h>
>
>  #include <asm/cpufeature.h>            /* boot_cpu_has, ...            */
> @@ -1218,6 +1219,13 @@ static int fault_in_kernel_space(unsigned long address)
>         return address >= TASK_SIZE_MAX;
>  }
>
> +static bool fault_in_user_exclusive_page(unsigned long address)
> +{
> +       struct page *page = virt_to_page(address);
> +
> +       return page_is_user_exclusive(page);
> +}
> +
>  /*
>   * Called for all faults where 'address' is part of the kernel address
>   * space.  Might get called for faults that originate from *code* that
> @@ -1261,6 +1269,12 @@ do_kern_addr_fault(struct pt_regs *regs, unsigned long hw_error_code,
>         if (spurious_kernel_fault(hw_error_code, address))
>                 return;
>
> +       /* FIXME: warn and handle gracefully */
> +       if (unlikely(fault_in_user_exclusive_page(address))) {
> +               pr_err("page fault in user exclusive page at %lx", address);
> +               force_sig_fault(SIGSEGV, SEGV_MAPERR, (void __user *)address);
> +       }

Sending a signal here is not a reasonable thing to do in response to
an unexpected kernel fault.  You need to OOPS.  Printing a nice
message would be nice.

--Andy

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH RFC] mm: add MAP_EXCLUSIVE to create exclusive user mappings
  2019-10-28 17:12   ` Dave Hansen
  2019-10-28 17:32     ` Sean Christopherson
@ 2019-10-28 18:08     ` Matthew Wilcox
  2019-10-29  9:28       ` Mike Rapoport
  2019-10-29  9:19     ` Mike Rapoport
  2 siblings, 1 reply; 60+ messages in thread
From: Matthew Wilcox @ 2019-10-28 18:08 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Mike Rapoport, linux-kernel, Alexey Dobriyan, Andrew Morton,
	Andy Lutomirski, Arnd Bergmann, Borislav Petkov, Dave Hansen,
	James Bottomley, Peter Zijlstra, Steven Rostedt, Thomas Gleixner,
	Ingo Molnar, H. Peter Anvin, linux-api, linux-mm, x86,
	Mike Rapoport

On Mon, Oct 28, 2019 at 10:12:44AM -0700, Dave Hansen wrote:
> Some other random thoughts:
> 
>  * The page flag is probably not a good idea.  It would be probably
>    better to set _PAGE_SPECIAL on the PTE and force get_user_pages()
>    into the slow path.
>  * This really stops being "normal" memory.  You can't do futexes on it,
>    cant splice it.  Probably need a more fleshed-out list of
>    incompatible features.
>  * As Kirill noted, each 4k page ends up with a potential 1GB "blast
>    radius" of demoted pages in the direct map.  Not cool.  This is
>    probably a non-starter as it stands.
>  * The global TLB flushes are going to eat you alive.  They probably
>    border on a DoS on larger systems.
>  * Do we really want this user interface to dictate the kernel
>    implementation?  In other words, do we really want MAP_EXCLUSIVE,
>    or do we want MAP_SECRET?  One tells the kernel what do *do*, the
>    other tells the kernel what the memory *IS*.
>  * There's a lot of other stuff going on in this area: XPFO, SEV, MKTME,
>    Persistent Memory, where the kernel direct map is a liability in some
>    way.  We probably need some kind of overall, architected solution
>    rather than five or ten things all poking at the direct map.

Another random set of thoughts:

 - Should devices be permitted to DMA to/from MAP_SECRET pages?
 - How about GUP?  Can I ptrace my way into another process's secret pages?
 - What if I splice() the page into a pipe?

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH RFC] mm: add MAP_EXCLUSIVE to create exclusive user mappings
  2019-10-28 13:55         ` Peter Zijlstra
@ 2019-10-28 19:59           ` Edgecombe, Rick P
  2019-10-28 21:00             ` Peter Zijlstra
  0 siblings, 1 reply; 60+ messages in thread
From: Edgecombe, Rick P @ 2019-10-28 19:59 UTC (permalink / raw)
  To: kirill, peterz
  Cc: adobriyan, linux-kernel, rppt, rostedt, jejb, tglx, linux-mm,
	dave.hansen, linux-api, x86, akpm, hpa, mingo, luto, rppt, bp,
	arnd

On Mon, 2019-10-28 at 14:55 +0100, Peter Zijlstra wrote:
> On Mon, Oct 28, 2019 at 04:16:23PM +0300, Kirill A. Shutemov wrote:
> 
> > I think active use of this feature will lead to performance degradation of
> > the system with time.
> > 
> > Setting a single 4k page non-present in the direct mapping will require
> > splitting 2M or 1G page we usually map direct mapping with. And it's one
> > way road. We don't have any mechanism to map the memory with huge page
> > again after the application has freed the page.
> 
> Right, we recently had a 'bug' where ftrace triggered something like
> this and facebook ran into it as a performance regression. So yes, this
> is a real concern.

Don't e/cBPF filters also break the direct map down to 4k pages when calling
set_memory_ro() on the filter for 64 bit x86 and arm?

I've been wondering if the page allocator should make some effort to find a
broken down page for anything that can be known will have direct map permissions
changed (or if it already groups them somehow). But also, why any potential
slowdown of 4k pages on the direct map hasn't been noticed for apps that do a
lot of insertions and removals of BPF filters, if this is indeed the case.



^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH RFC] mm: add MAP_EXCLUSIVE to create exclusive user mappings
  2019-10-27 11:00   ` Mike Rapoport
@ 2019-10-28 20:23     ` Florian Weimer
  2019-10-29  9:01       ` Mike Rapoport
  0 siblings, 1 reply; 60+ messages in thread
From: Florian Weimer @ 2019-10-28 20:23 UTC (permalink / raw)
  To: Mike Rapoport
  Cc: linux-kernel, Alexey Dobriyan, Andrew Morton, Andy Lutomirski,
	Arnd Bergmann, Borislav Petkov, Dave Hansen, James Bottomley,
	Peter Zijlstra, Steven Rostedt, Thomas Gleixner, Ingo Molnar,
	H. Peter Anvin, linux-api, linux-mm, x86, Mike Rapoport

* Mike Rapoport:

> On October 27, 2019 12:30:21 PM GMT+02:00, Florian Weimer
> <fw@deneb.enyo.de> wrote:
>>* Mike Rapoport:
>>
>>> The patch below aims to allow applications to create mappins that
>>have
>>> pages visible only to the owning process. Such mappings could be used
>>to
>>> store secrets so that these secrets are not visible neither to other
>>> processes nor to the kernel.
>>
>>How is this expected to interact with CRIU?
>
> CRIU dumps the memory contents using a parasite code from inside the
> dumpee address space, so it would work the same way as for the other
> mappings. Of course, at the restore time the exclusive mapping should
> be recreated with the appropriate flags.

Hmm, so it would use a bounce buffer to perform the extraction?

>>> I've only tested the basic functionality, the changes should be
>>verified
>>> against THP/migration/compaction. Yet, I'd appreciate early feedback.
>>
>>What are the expected semantics for VM migration?  Should it fail?
>
> I don't quite follow. If qemu would use such mappings it would be able
> to transfer them during live migration.

I was wondering if the special state is supposed to bubble up to the
host eventually.

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH RFC] mm: add MAP_EXCLUSIVE to create exclusive user mappings
  2019-10-27 10:17 [PATCH RFC] mm: add MAP_EXCLUSIVE to create exclusive user mappings Mike Rapoport
  2019-10-27 10:17 ` Mike Rapoport
  2019-10-27 10:30 ` Florian Weimer
@ 2019-10-28 20:44 ` Andy Lutomirski
  2019-10-29  9:32   ` Mike Rapoport
  2019-10-29 11:25 ` [PATCH RFC] mm: add MAP_EXCLUSIVE to create exclusive user mappings Reshetova, Elena
  3 siblings, 1 reply; 60+ messages in thread
From: Andy Lutomirski @ 2019-10-28 20:44 UTC (permalink / raw)
  To: Mike Rapoport
  Cc: linux-kernel, Alexey Dobriyan, Andrew Morton, Andy Lutomirski,
	Arnd Bergmann, Borislav Petkov, Dave Hansen, James Bottomley,
	Peter Zijlstra, Steven Rostedt, Thomas Gleixner, Ingo Molnar,
	H. Peter Anvin, linux-api, linux-mm, x86, Mike Rapoport


> On Oct 27, 2019, at 4:17 AM, Mike Rapoport <rppt@kernel.org> wrote:
> 
> From: Mike Rapoport <rppt@linux.ibm.com>
> 
> Hi,
> 
> The patch below aims to allow applications to create mappins that have
> pages visible only to the owning process. Such mappings could be used to
> store secrets so that these secrets are not visible neither to other
> processes nor to the kernel.
> 
> I've only tested the basic functionality, the changes should be verified
> against THP/migration/compaction. Yet, I'd appreciate early feedback.

I’ve contemplated the concept a fair amount, and I think you should consider a change to the API. In particular, rather than having it be a MAP_ flag, make it a chardev.  You can, at least at first, allow only MAP_SHARED, and admins can decide who gets to use it.  It might also play better with the VM overall, and you won’t need a VM_ flag for it — you can just wire up .fault to do the right thing.

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH RFC] mm: add MAP_EXCLUSIVE to create exclusive user mappings
  2019-10-28 19:59           ` Edgecombe, Rick P
@ 2019-10-28 21:00             ` Peter Zijlstra
  2019-10-29 17:27               ` Edgecombe, Rick P
  0 siblings, 1 reply; 60+ messages in thread
From: Peter Zijlstra @ 2019-10-28 21:00 UTC (permalink / raw)
  To: Edgecombe, Rick P
  Cc: kirill, adobriyan, linux-kernel, rppt, rostedt, jejb, tglx,
	linux-mm, dave.hansen, linux-api, x86, akpm, hpa, mingo, luto,
	rppt, bp, arnd

On Mon, Oct 28, 2019 at 07:59:25PM +0000, Edgecombe, Rick P wrote:
> On Mon, 2019-10-28 at 14:55 +0100, Peter Zijlstra wrote:
> > On Mon, Oct 28, 2019 at 04:16:23PM +0300, Kirill A. Shutemov wrote:
> > 
> > > I think active use of this feature will lead to performance degradation of
> > > the system with time.
> > > 
> > > Setting a single 4k page non-present in the direct mapping will require
> > > splitting 2M or 1G page we usually map direct mapping with. And it's one
> > > way road. We don't have any mechanism to map the memory with huge page
> > > again after the application has freed the page.
> > 
> > Right, we recently had a 'bug' where ftrace triggered something like
> > this and facebook ran into it as a performance regression. So yes, this
> > is a real concern.
> 
> Don't e/cBPF filters also break the direct map down to 4k pages when calling
> set_memory_ro() on the filter for 64 bit x86 and arm?
> 
> I've been wondering if the page allocator should make some effort to find a
> broken down page for anything that can be known will have direct map permissions
> changed (or if it already groups them somehow). But also, why any potential
> slowdown of 4k pages on the direct map hasn't been noticed for apps that do a
> lot of insertions and removals of BPF filters, if this is indeed the case.

That should be limited to the module range. Random data maps could
shatter the world.

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH RFC] mm: add MAP_EXCLUSIVE to create exclusive user mappings
  2019-10-28 13:16       ` Kirill A. Shutemov
  2019-10-28 13:55         ` Peter Zijlstra
@ 2019-10-29  5:43         ` Dan Williams
  2019-10-29  6:43           ` Kirill A. Shutemov
  2019-10-29  7:08         ` Christopher Lameter
  2 siblings, 1 reply; 60+ messages in thread
From: Dan Williams @ 2019-10-29  5:43 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Mike Rapoport, Linux Kernel Mailing List, Alexey Dobriyan,
	Andrew Morton, Andy Lutomirski, Arnd Bergmann, Borislav Petkov,
	Dave Hansen, James Bottomley, Peter Zijlstra, Steven Rostedt,
	Thomas Gleixner, Ingo Molnar, H. Peter Anvin, Linux API,
	linux-mm, the arch/x86 maintainers, Mike Rapoport

On Mon, Oct 28, 2019 at 6:16 AM Kirill A. Shutemov <kirill@shutemov.name> wrote:
>
> On Mon, Oct 28, 2019 at 02:00:19PM +0100, Mike Rapoport wrote:
> > On Mon, Oct 28, 2019 at 03:31:24PM +0300, Kirill A. Shutemov wrote:
> > > On Sun, Oct 27, 2019 at 12:17:32PM +0200, Mike Rapoport wrote:
> > > > From: Mike Rapoport <rppt@linux.ibm.com>
> > > >
> > > > The mappings created with MAP_EXCLUSIVE are visible only in the context of
> > > > the owning process and can be used by applications to store secret
> > > > information that will not be visible not only to other processes but to the
> > > > kernel as well.
> > > >
> > > > The pages in these mappings are removed from the kernel direct map and
> > > > marked with PG_user_exclusive flag. When the exclusive area is unmapped,
> > > > the pages are mapped back into the direct map.
> > >
> > > I probably blind, but I don't see where you manipulate direct map...
> >
> > __get_user_pages() calls __set_page_user_exclusive() which in turn calls
> > set_direct_map_invalid_noflush() that makes the page not present.
>
> Ah. okay.
>
> I think active use of this feature will lead to performance degradation of
> the system with time.
>
> Setting a single 4k page non-present in the direct mapping will require
> splitting 2M or 1G page we usually map direct mapping with. And it's one
> way road. We don't have any mechanism to map the memory with huge page
> again after the application has freed the page.
>
> It might be okay if all these pages cluster together, but I don't think we
> have a way to achieve it easily.

Still, it would be worth exploring what that would look like if not
for MAP_EXCLUSIVE then set_mce_nospec() that wants to punch out poison
pages from the direct map. In the case of pmem, where those pages are
able to be repaired, it would be nice to also repair the mapping
granularity of the direct map.

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH RFC] mm: add MAP_EXCLUSIVE to create exclusive user mappings
  2019-10-29  5:43         ` Dan Williams
@ 2019-10-29  6:43           ` Kirill A. Shutemov
  2019-10-29  8:56             ` Peter Zijlstra
  2019-10-29 19:43             ` [PATCH RFC] mm: add MAP_EXCLUSIVE to create exclusive user mappings Dan Williams
  0 siblings, 2 replies; 60+ messages in thread
From: Kirill A. Shutemov @ 2019-10-29  6:43 UTC (permalink / raw)
  To: Dan Williams
  Cc: Mike Rapoport, Linux Kernel Mailing List, Alexey Dobriyan,
	Andrew Morton, Andy Lutomirski, Arnd Bergmann, Borislav Petkov,
	Dave Hansen, James Bottomley, Peter Zijlstra, Steven Rostedt,
	Thomas Gleixner, Ingo Molnar, H. Peter Anvin, Linux API,
	linux-mm, the arch/x86 maintainers, Mike Rapoport

On Mon, Oct 28, 2019 at 10:43:51PM -0700, Dan Williams wrote:
> On Mon, Oct 28, 2019 at 6:16 AM Kirill A. Shutemov <kirill@shutemov.name> wrote:
> >
> > On Mon, Oct 28, 2019 at 02:00:19PM +0100, Mike Rapoport wrote:
> > > On Mon, Oct 28, 2019 at 03:31:24PM +0300, Kirill A. Shutemov wrote:
> > > > On Sun, Oct 27, 2019 at 12:17:32PM +0200, Mike Rapoport wrote:
> > > > > From: Mike Rapoport <rppt@linux.ibm.com>
> > > > >
> > > > > The mappings created with MAP_EXCLUSIVE are visible only in the context of
> > > > > the owning process and can be used by applications to store secret
> > > > > information that will not be visible not only to other processes but to the
> > > > > kernel as well.
> > > > >
> > > > > The pages in these mappings are removed from the kernel direct map and
> > > > > marked with PG_user_exclusive flag. When the exclusive area is unmapped,
> > > > > the pages are mapped back into the direct map.
> > > >
> > > > I probably blind, but I don't see where you manipulate direct map...
> > >
> > > __get_user_pages() calls __set_page_user_exclusive() which in turn calls
> > > set_direct_map_invalid_noflush() that makes the page not present.
> >
> > Ah. okay.
> >
> > I think active use of this feature will lead to performance degradation of
> > the system with time.
> >
> > Setting a single 4k page non-present in the direct mapping will require
> > splitting 2M or 1G page we usually map direct mapping with. And it's one
> > way road. We don't have any mechanism to map the memory with huge page
> > again after the application has freed the page.
> >
> > It might be okay if all these pages cluster together, but I don't think we
> > have a way to achieve it easily.
> 
> Still, it would be worth exploring what that would look like if not
> for MAP_EXCLUSIVE then set_mce_nospec() that wants to punch out poison
> pages from the direct map. In the case of pmem, where those pages are
> able to be repaired, it would be nice to also repair the mapping
> granularity of the direct map.

The solution has to consist of two parts: finding a range to collapse and
actually collapsing the range into a huge page.

Finding the collapsible range will likely require background scanning of
the direct mapping as we do for THP with khugepaged. It should not too
hard, but likely require long and tedious tuning to be effective, but not
too disturbing for the system.

Alternatively, after any changes to the direct mapping, we can initiate
checking if the range is collapsible. Up to 1G around the changed 4k.
It might be more taxing than scanning if direct mapping changes often.

Collapsing itself appears to be simple: re-check if the range is
collapsible under the lock, replace the page table with the huge page and
flush the TLB.

But some CPUs don't like to have two TLB entries for the same memory with
different sizes at the same time. See for instance AMD erratum 383.

Getting it right would require making the range not present, flush TLB and
only then install huge page. That's what we do for userspace.

It will not fly for the direct mapping. There is no reasonable way to
exclude other CPU from accessing the range while it's not present (call
stop_machine()? :P). Moreover, the range may contain the code that doing
the collapse or data required for it...

BTW, looks like current __split_large_page() in pageattr.c is susceptible
to the errata. Maybe we can get away with the easy way...

-- 
 Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH RFC] mm: add MAP_EXCLUSIVE to create exclusive user mappings
  2019-10-28 13:16       ` Kirill A. Shutemov
  2019-10-28 13:55         ` Peter Zijlstra
  2019-10-29  5:43         ` Dan Williams
@ 2019-10-29  7:08         ` Christopher Lameter
  2019-10-29  8:55           ` Mike Rapoport
  2 siblings, 1 reply; 60+ messages in thread
From: Christopher Lameter @ 2019-10-29  7:08 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Mike Rapoport, linux-kernel, Alexey Dobriyan, Andrew Morton,
	Andy Lutomirski, Arnd Bergmann, Borislav Petkov, Dave Hansen,
	James Bottomley, Peter Zijlstra, Steven Rostedt, Thomas Gleixner,
	Ingo Molnar, H. Peter Anvin, linux-api, linux-mm, x86,
	Mike Rapoport

On Mon, 28 Oct 2019, Kirill A. Shutemov wrote:

> Setting a single 4k page non-present in the direct mapping will require
> splitting 2M or 1G page we usually map direct mapping with. And it's one
> way road. We don't have any mechanism to map the memory with huge page
> again after the application has freed the page.
>
> It might be okay if all these pages cluster together, but I don't think we
> have a way to achieve it easily.

Set aside a special physical memory range for this and migrate the
page to that physical memory range when MAP_EXCLUSIVE is specified?

Maybe some processors also have hardware ranges that offer additional
protection for stuff like that?

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH RFC] mm: add MAP_EXCLUSIVE to create exclusive user mappings
  2019-10-29  7:08         ` Christopher Lameter
@ 2019-10-29  8:55           ` Mike Rapoport
  2019-10-29 10:12             ` Christopher Lameter
  0 siblings, 1 reply; 60+ messages in thread
From: Mike Rapoport @ 2019-10-29  8:55 UTC (permalink / raw)
  To: Christopher Lameter
  Cc: Kirill A. Shutemov, linux-kernel, Alexey Dobriyan, Andrew Morton,
	Andy Lutomirski, Arnd Bergmann, Borislav Petkov, Dave Hansen,
	James Bottomley, Peter Zijlstra, Steven Rostedt, Thomas Gleixner,
	Ingo Molnar, H. Peter Anvin, linux-api, linux-mm, x86,
	Mike Rapoport

On Tue, Oct 29, 2019 at 07:08:42AM +0000, Christopher Lameter wrote:
> On Mon, 28 Oct 2019, Kirill A. Shutemov wrote:
> 
> > Setting a single 4k page non-present in the direct mapping will require
> > splitting 2M or 1G page we usually map direct mapping with. And it's one
> > way road. We don't have any mechanism to map the memory with huge page
> > again after the application has freed the page.
> >
> > It might be okay if all these pages cluster together, but I don't think we
> > have a way to achieve it easily.
> 
> Set aside a special physical memory range for this and migrate the
> page to that physical memory range when MAP_EXCLUSIVE is specified?

I've talked with Thomas yesterday and he suggested something similar:

When the MAP_EXCLUSIVE request comes for the first time, we allocate a huge
page for it and then use this page as a pool of 4K pages for subsequent
requests. Once this huge page is full we allocate a new one and append it
to the pool. When all the 4K pages that comprise the huge page are freed
the huge page is collapsed.

And then on top of this we can look into compaction of the direct map.

Of course, this would work if the easy way of collapsing direct map pages
Kirill mentioned on other mail will work.
 
> Maybe some processors also have hardware ranges that offer additional
> protection for stuff like that?
> 

-- 
Sincerely yours,
Mike.

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH RFC] mm: add MAP_EXCLUSIVE to create exclusive user mappings
  2019-10-29  6:43           ` Kirill A. Shutemov
@ 2019-10-29  8:56             ` Peter Zijlstra
  2019-10-29 11:00               ` Kirill A. Shutemov
  2019-10-29 19:43             ` [PATCH RFC] mm: add MAP_EXCLUSIVE to create exclusive user mappings Dan Williams
  1 sibling, 1 reply; 60+ messages in thread
From: Peter Zijlstra @ 2019-10-29  8:56 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Dan Williams, Mike Rapoport, Linux Kernel Mailing List,
	Alexey Dobriyan, Andrew Morton, Andy Lutomirski, Arnd Bergmann,
	Borislav Petkov, Dave Hansen, James Bottomley, Steven Rostedt,
	Thomas Gleixner, Ingo Molnar, H. Peter Anvin, Linux API,
	linux-mm, the arch/x86 maintainers, Mike Rapoport

On Tue, Oct 29, 2019 at 09:43:18AM +0300, Kirill A. Shutemov wrote:
> But some CPUs don't like to have two TLB entries for the same memory with
> different sizes at the same time. See for instance AMD erratum 383.
> 
> Getting it right would require making the range not present, flush TLB and
> only then install huge page. That's what we do for userspace.
> 
> It will not fly for the direct mapping. There is no reasonable way to
> exclude other CPU from accessing the range while it's not present (call
> stop_machine()? :P). Moreover, the range may contain the code that doing
> the collapse or data required for it...
> 
> BTW, looks like current __split_large_page() in pageattr.c is susceptible
> to the errata. Maybe we can get away with the easy way...

As you write above, there is just no way we can have a (temporary) hole
in the direct map.

We are careful about that other errata, and make sure both translations
are identical wrt everything else.

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH RFC] mm: add MAP_EXCLUSIVE to create exclusive user mappings
  2019-10-28 20:23     ` Florian Weimer
@ 2019-10-29  9:01       ` Mike Rapoport
  0 siblings, 0 replies; 60+ messages in thread
From: Mike Rapoport @ 2019-10-29  9:01 UTC (permalink / raw)
  To: Florian Weimer
  Cc: linux-kernel, Alexey Dobriyan, Andrew Morton, Andy Lutomirski,
	Arnd Bergmann, Borislav Petkov, Dave Hansen, James Bottomley,
	Peter Zijlstra, Steven Rostedt, Thomas Gleixner, Ingo Molnar,
	H. Peter Anvin, linux-api, linux-mm, x86, Mike Rapoport

On Mon, Oct 28, 2019 at 09:23:17PM +0100, Florian Weimer wrote:
> * Mike Rapoport:
> 
> > On October 27, 2019 12:30:21 PM GMT+02:00, Florian Weimer
> > <fw@deneb.enyo.de> wrote:
> >>* Mike Rapoport:
> >>
> >>> The patch below aims to allow applications to create mappins that
> >>have
> >>> pages visible only to the owning process. Such mappings could be used
> >>to
> >>> store secrets so that these secrets are not visible neither to other
> >>> processes nor to the kernel.
> >>
> >>How is this expected to interact with CRIU?
> >
> > CRIU dumps the memory contents using a parasite code from inside the
> > dumpee address space, so it would work the same way as for the other
> > mappings. Of course, at the restore time the exclusive mapping should
> > be recreated with the appropriate flags.
> 
> Hmm, so it would use a bounce buffer to perform the extraction?

At first I thought that CRIU would extract the memory contents from these
mappings just as it does now using vmsplice(). But it seems that such
mappings won't play well with pipes, so CRIU will need a bounce buffer
indeed.
 
> >>> I've only tested the basic functionality, the changes should be
> >>verified
> >>> against THP/migration/compaction. Yet, I'd appreciate early feedback.
> >>
> >>What are the expected semantics for VM migration?  Should it fail?
> >
> > I don't quite follow. If qemu would use such mappings it would be able
> > to transfer them during live migration.
> 
> I was wondering if the special state is supposed to bubble up to the
> host eventually.

Well, that was not intended.

-- 
Sincerely yours,
Mike.

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH RFC] mm: add MAP_EXCLUSIVE to create exclusive user mappings
  2019-10-28 17:12   ` Dave Hansen
  2019-10-28 17:32     ` Sean Christopherson
  2019-10-28 18:08     ` Matthew Wilcox
@ 2019-10-29  9:19     ` Mike Rapoport
  2 siblings, 0 replies; 60+ messages in thread
From: Mike Rapoport @ 2019-10-29  9:19 UTC (permalink / raw)
  To: Dave Hansen
  Cc: linux-kernel, Alexey Dobriyan, Andrew Morton, Andy Lutomirski,
	Arnd Bergmann, Borislav Petkov, Dave Hansen, James Bottomley,
	Peter Zijlstra, Steven Rostedt, Thomas Gleixner, Ingo Molnar,
	H. Peter Anvin, linux-api, linux-mm, x86, Mike Rapoport

On Mon, Oct 28, 2019 at 10:12:44AM -0700, Dave Hansen wrote:
> On 10/27/19 3:17 AM, Mike Rapoport wrote:
> > The pages in these mappings are removed from the kernel direct map and
> > marked with PG_user_exclusive flag. When the exclusive area is unmapped,
> > the pages are mapped back into the direct map.
> 
> This looks fun.  It's certainly simple.
> 
> But, the description is not really calling out the pros and cons very
> well.  I'm also not sure that folks will use an interface like this that
> requires up-front, special code to do an allocation instead of something
> like madvise().  That's why protection keys ended up the way it did: if
> you do this as a mmap() replacement, you need to modify all *allocators*
> to be enabled for this.  If you do it with mprotect()-style, you can
> apply it to existing allocations.

Actually, I've started with mprotect() and then realized that mmap() would
be simpler, so I switched over to mmap().
 
> Some other random thoughts:
> 
>  * The page flag is probably not a good idea.  It would be probably
>    better to set _PAGE_SPECIAL on the PTE and force get_user_pages()
>    into the slow path.

The page flag won't work on 32-bit, indeed. But do we really need such
functionality on 32-bit?

>  * This really stops being "normal" memory.  You can't do futexes on it,
>    cant splice it.  Probably need a more fleshed-out list of
>    incompatible features.

True, my bad. I should have mentioned more than THP/compaction/migration.

>  * As Kirill noted, each 4k page ends up with a potential 1GB "blast
>    radius" of demoted pages in the direct map.  Not cool.  This is
>    probably a non-starter as it stands.
>  * The global TLB flushes are going to eat you alive.  They probably
>    border on a DoS on larger systems.

As I wrote in another email, we could use some kind of pooling to reduce
the "blast radius" and that will reduce that amount of TLB flushes as well.
The size of the MAP_EXCLUSIVE obeys the RLIMIT_MEMLOCK and we can add a
system-wide limit for size of such allocations.

>  * Do we really want this user interface to dictate the kernel
>    implementation?  In other words, do we really want MAP_EXCLUSIVE,
>    or do we want MAP_SECRET?  One tells the kernel what do *do*, the
>    other tells the kernel what the memory *IS*.

I hesitated quite some time between EXCLUSIVE and SECRET. I've settled down
on EXCLUSIVE because in my view that better describes the fact that the
region is only mapped in its owner address space. And as such it can be
used to store secrets, but it can be used for other purposes as well.

>  * There's a lot of other stuff going on in this area: XPFO, SEV, MKTME,
>    Persistent Memory, where the kernel direct map is a liability in some
>    way.  We probably need some kind of overall, architected solution
>    rather than five or ten things all poking at the direct map.

Agree.
 

-- 
Sincerely yours,
Mike.

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH RFC] mm: add MAP_EXCLUSIVE to create exclusive user mappings
  2019-10-28 18:08     ` Matthew Wilcox
@ 2019-10-29  9:28       ` Mike Rapoport
  0 siblings, 0 replies; 60+ messages in thread
From: Mike Rapoport @ 2019-10-29  9:28 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Dave Hansen, linux-kernel, Alexey Dobriyan, Andrew Morton,
	Andy Lutomirski, Arnd Bergmann, Borislav Petkov, Dave Hansen,
	James Bottomley, Peter Zijlstra, Steven Rostedt, Thomas Gleixner,
	Ingo Molnar, H. Peter Anvin, linux-api, linux-mm, x86,
	Mike Rapoport

On Mon, Oct 28, 2019 at 11:08:08AM -0700, Matthew Wilcox wrote:
> On Mon, Oct 28, 2019 at 10:12:44AM -0700, Dave Hansen wrote:
> > Some other random thoughts:
> > 
> >  * The page flag is probably not a good idea.  It would be probably
> >    better to set _PAGE_SPECIAL on the PTE and force get_user_pages()
> >    into the slow path.
> >  * This really stops being "normal" memory.  You can't do futexes on it,
> >    cant splice it.  Probably need a more fleshed-out list of
> >    incompatible features.
> >  * As Kirill noted, each 4k page ends up with a potential 1GB "blast
> >    radius" of demoted pages in the direct map.  Not cool.  This is
> >    probably a non-starter as it stands.
> >  * The global TLB flushes are going to eat you alive.  They probably
> >    border on a DoS on larger systems.
> >  * Do we really want this user interface to dictate the kernel
> >    implementation?  In other words, do we really want MAP_EXCLUSIVE,
> >    or do we want MAP_SECRET?  One tells the kernel what do *do*, the
> >    other tells the kernel what the memory *IS*.
> >  * There's a lot of other stuff going on in this area: XPFO, SEV, MKTME,
> >    Persistent Memory, where the kernel direct map is a liability in some
> >    way.  We probably need some kind of overall, architected solution
> >    rather than five or ten things all poking at the direct map.
> 
> Another random set of thoughts:
> 
>  - Should devices be permitted to DMA to/from MAP_SECRET pages?

I can't say I have a clear cut yes or no here. One possible use case for
such pages is to read a secrets from storage directly into them. On the
other side, DMA to/from a device can be used to exploit those secrets...

>  - How about GUP?

Do you mean GUP for "remote" memory? I'd say no.

>  - Can I ptrace my way into another process's secret pages?

No.

>  - What if I splice() the page into a pipe?

I think it should fail.

-- 
Sincerely yours,
Mike.

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH RFC] mm: add MAP_EXCLUSIVE to create exclusive user mappings
  2019-10-28 20:44 ` Andy Lutomirski
@ 2019-10-29  9:32   ` Mike Rapoport
  2019-10-29 17:00     ` Andy Lutomirski
  0 siblings, 1 reply; 60+ messages in thread
From: Mike Rapoport @ 2019-10-29  9:32 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: linux-kernel, Alexey Dobriyan, Andrew Morton, Andy Lutomirski,
	Arnd Bergmann, Borislav Petkov, Dave Hansen, James Bottomley,
	Peter Zijlstra, Steven Rostedt, Thomas Gleixner, Ingo Molnar,
	H. Peter Anvin, linux-api, linux-mm, x86, Mike Rapoport

On Mon, Oct 28, 2019 at 02:44:23PM -0600, Andy Lutomirski wrote:
> 
> > On Oct 27, 2019, at 4:17 AM, Mike Rapoport <rppt@kernel.org> wrote:
> > 
> > From: Mike Rapoport <rppt@linux.ibm.com>
> > 
> > Hi,
> > 
> > The patch below aims to allow applications to create mappins that have
> > pages visible only to the owning process. Such mappings could be used to
> > store secrets so that these secrets are not visible neither to other
> > processes nor to the kernel.
> > 
> > I've only tested the basic functionality, the changes should be verified
> > against THP/migration/compaction. Yet, I'd appreciate early feedback.
> 
> I’ve contemplated the concept a fair amount, and I think you should
> consider a change to the API. In particular, rather than having it be a
> MAP_ flag, make it a chardev.  You can, at least at first, allow only
> MAP_SHARED, and admins can decide who gets to use it.  It might also play
> better with the VM overall, and you won’t need a VM_ flag for it — you
> can just wire up .fault to do the right thing.

I think mmap()/mprotect()/madvise() are the natural APIs for such
interface. Switching to a chardev doesn't solve the major problem of direct
map fragmentation and defeats the ability to use exclusive memory mappings
with the existing allocators, while mprotect() and madvise() do not.

-- 
Sincerely yours,
Mike.

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH RFC] mm: add MAP_EXCLUSIVE to create exclusive user mappings
  2019-10-29  8:55           ` Mike Rapoport
@ 2019-10-29 10:12             ` Christopher Lameter
  2019-10-30  7:11               ` Mike Rapoport
  0 siblings, 1 reply; 60+ messages in thread
From: Christopher Lameter @ 2019-10-29 10:12 UTC (permalink / raw)
  To: Mike Rapoport
  Cc: Kirill A. Shutemov, linux-kernel, Alexey Dobriyan, Andrew Morton,
	Andy Lutomirski, Arnd Bergmann, Borislav Petkov, Dave Hansen,
	James Bottomley, Peter Zijlstra, Steven Rostedt, Thomas Gleixner,
	Ingo Molnar, H. Peter Anvin, linux-api, linux-mm, x86,
	Mike Rapoport



On Tue, 29 Oct 2019, Mike Rapoport wrote:

> I've talked with Thomas yesterday and he suggested something similar:
>
> When the MAP_EXCLUSIVE request comes for the first time, we allocate a huge
> page for it and then use this page as a pool of 4K pages for subsequent
> requests. Once this huge page is full we allocate a new one and append it
> to the pool. When all the 4K pages that comprise the huge page are freed
> the huge page is collapsed.

Or write a device driver that allows you to mmap a secure area and avoid
all core kernel modifications?

/dev/securemem or so?

It may exist already.


^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH RFC] mm: add MAP_EXCLUSIVE to create exclusive user mappings
  2019-10-29  8:56             ` Peter Zijlstra
@ 2019-10-29 11:00               ` Kirill A. Shutemov
  2019-10-29 12:39                 ` AMD TLB errata, (Was: [PATCH RFC] mm: add MAP_EXCLUSIVE to create exclusive user mappings) Peter Zijlstra
  0 siblings, 1 reply; 60+ messages in thread
From: Kirill A. Shutemov @ 2019-10-29 11:00 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Dan Williams, Mike Rapoport, Linux Kernel Mailing List,
	Alexey Dobriyan, Andrew Morton, Andy Lutomirski, Arnd Bergmann,
	Borislav Petkov, Dave Hansen, James Bottomley, Steven Rostedt,
	Thomas Gleixner, Ingo Molnar, H. Peter Anvin, Linux API,
	linux-mm, the arch/x86 maintainers, Mike Rapoport

On Tue, Oct 29, 2019 at 09:56:02AM +0100, Peter Zijlstra wrote:
> On Tue, Oct 29, 2019 at 09:43:18AM +0300, Kirill A. Shutemov wrote:
> > But some CPUs don't like to have two TLB entries for the same memory with
> > different sizes at the same time. See for instance AMD erratum 383.
> > 
> > Getting it right would require making the range not present, flush TLB and
> > only then install huge page. That's what we do for userspace.
> > 
> > It will not fly for the direct mapping. There is no reasonable way to
> > exclude other CPU from accessing the range while it's not present (call
> > stop_machine()? :P). Moreover, the range may contain the code that doing
> > the collapse or data required for it...
> > 
> > BTW, looks like current __split_large_page() in pageattr.c is susceptible
> > to the errata. Maybe we can get away with the easy way...
> 
> As you write above, there is just no way we can have a (temporary) hole
> in the direct map.
> 
> We are careful about that other errata, and make sure both translations
> are identical wrt everything else.

It's not clear if it is enough to avoid the issue. "under a highly specific
and detailed set of conditions" is not very specific set of conditions :P

-- 
 Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH RFC] mm: add MAP_EXCLUSIVE to create exclusive user mappings
  2019-10-27 10:17 ` Mike Rapoport
                     ` (3 preceding siblings ...)
  2019-10-28 18:02   ` Andy Lutomirski
@ 2019-10-29 11:02   ` David Hildenbrand
  2019-10-30  8:15     ` Mike Rapoport
  4 siblings, 1 reply; 60+ messages in thread
From: David Hildenbrand @ 2019-10-29 11:02 UTC (permalink / raw)
  To: Mike Rapoport, linux-kernel
  Cc: Alexey Dobriyan, Andrew Morton, Andy Lutomirski, Arnd Bergmann,
	Borislav Petkov, Dave Hansen, James Bottomley, Peter Zijlstra,
	Steven Rostedt, Thomas Gleixner, Ingo Molnar, H. Peter Anvin,
	linux-api, linux-mm, x86, Mike Rapoport

On 27.10.19 11:17, Mike Rapoport wrote:
> From: Mike Rapoport <rppt@linux.ibm.com>
> 
> The mappings created with MAP_EXCLUSIVE are visible only in the context of
> the owning process and can be used by applications to store secret
> information that will not be visible not only to other processes but to the
> kernel as well.
> 
> The pages in these mappings are removed from the kernel direct map and
> marked with PG_user_exclusive flag. When the exclusive area is unmapped,
> the pages are mapped back into the direct map.
> 

Just a thought, the kernel is still able to indirectly read the contents 
of these pages by doing a kdump from kexec environment, right?. Also, I 
wonder what would happen if you map such pages via /dev/mem into another 
user space application and e.g., use them along with kvm [1].

[1] https://lwn.net/Articles/778240/

-- 

Thanks,

David / dhildenb


^ permalink raw reply	[flat|nested] 60+ messages in thread

* RE: [PATCH RFC] mm: add MAP_EXCLUSIVE to create exclusive user mappings
  2019-10-27 10:17 [PATCH RFC] mm: add MAP_EXCLUSIVE to create exclusive user mappings Mike Rapoport
                   ` (2 preceding siblings ...)
  2019-10-28 20:44 ` Andy Lutomirski
@ 2019-10-29 11:25 ` Reshetova, Elena
  2019-10-29 15:13   ` Tycho Andersen
  2019-10-29 17:03   ` Andy Lutomirski
  3 siblings, 2 replies; 60+ messages in thread
From: Reshetova, Elena @ 2019-10-29 11:25 UTC (permalink / raw)
  To: Mike Rapoport, linux-kernel
  Cc: Alexey Dobriyan, Andrew Morton, Andy Lutomirski, Arnd Bergmann,
	Borislav Petkov, Dave Hansen, James Bottomley, Peter Zijlstra,
	Steven Rostedt, Thomas Gleixner, Ingo Molnar, H. Peter Anvin,
	linux-api, linux-mm, x86, Mike Rapoport, Tycho Andersen,
	Alan Cox

> The patch below aims to allow applications to create mappins that have
> pages visible only to the owning process. Such mappings could be used to
> store secrets so that these secrets are not visible neither to other
> processes nor to the kernel.

Hi Mike, 

I have actually been looking into the closely related problem for the past 
couple of weeks (on and off). What is common here is the need for userspace
to indicate to kernel that some pages contain secrets. And then there are
actually a number of things that kernel can do to try to protect these secrets
better. Unmap from direct map is one of them. Another thing is to map such
pages as non-cached, which can help us to prevent or considerably restrict
speculation on such pages. The initial proof of concept for marking pages as
"UNCACHED" that I got from Dave Hansen was actually based on mlock2() 
and a new flag for it for this purpose. Since then I have been thinking on what
interface suits the use case better and actually selected going with new madvise() 
flag instead because of all possible implications for fragmentation and performance. 
My logic was that we better allocate the secret data explicitly (using mmap()) 
to make sure that no other process data accidentally gets to suffer.
Imagine I would allocate a buffer to hold a secret key, signal with mlock
 to protect it and suddenly my other high throughput non-secret buffer 
(which happened to live on the same page by chance) became very slow
 and I don't even have an easy way (apart from mmap()ing it!) to guarantee
 that it won't be affected.

So, I ended up towards smth like:

  secret_buffer =  mmap(NULL, PAGE_SIZE, ...)
   madvise(secret_buffer, size, MADV_SECRET)

I have work in progress code here:
 https://github.com/ereshetova/linux/commits/madvise

I haven't sent it for review, because it is not ready yet and I am now working
on trying to add the page wiping functionality. Otherwise it would be useless
to protect the page during the time it is used in userspace, but then allow it
to get reused by a different process later after it has been released back and
userspace was stupid enough not to wipe the contents (or was crashed on 
purpose before it was able to wipe anything out). 

We have also had some discussions with Tycho that XPFO can be also
applied selectively for such "SECRET" marked pages and I know that he has also
did some initial prototyping on this, so I think it would be great to decide
on userspace interface first and then see how we can assemble together all
these features. 

The *very* far fetching goal for all of this would be something that Alan Cox
suggested when I started looking into this - actually have a new libc function to 
allocate memory in a secure way, which can hide all the dancing with mmap()/madvise()
(or/and potentially interaction with a chardev that Andy was suggesting also) and
implement an efficient allocator for such secret pages. Openssl has its
own version of  "secure heap", which is essentially mmap area with additional 
MLOCK_ONFAULT and MADV_DONTDUMP flags for protection. Some other 
apps or libs must use smth similar if they want additional protection, which
makes them to reimplement the same concept again and again. Sadly or surprisingly 
other major libs like boringssl, mbedTLS or client like openssh do not user any mlock()/
madvise() flags for any additional protection of secrets that they hold in memory. 
Maybe if all of it would be behind a single secure API situation would start to 
change in userspace towards better. 

Best Regards,
Elena.
 
. 

^ permalink raw reply	[flat|nested] 60+ messages in thread

* AMD TLB errata, (Was: [PATCH RFC] mm: add MAP_EXCLUSIVE to create exclusive user mappings)
  2019-10-29 11:00               ` Kirill A. Shutemov
@ 2019-10-29 12:39                 ` Peter Zijlstra
  2019-11-15 14:12                   ` Tom Lendacky
  0 siblings, 1 reply; 60+ messages in thread
From: Peter Zijlstra @ 2019-10-29 12:39 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Dan Williams, Mike Rapoport, Linux Kernel Mailing List,
	Alexey Dobriyan, Andrew Morton, Andy Lutomirski, Arnd Bergmann,
	Borislav Petkov, Dave Hansen, James Bottomley, Steven Rostedt,
	Thomas Gleixner, Ingo Molnar, H. Peter Anvin, Linux API,
	linux-mm, the arch/x86 maintainers, Mike Rapoport,
	thomas.lendacky

On Tue, Oct 29, 2019 at 02:00:24PM +0300, Kirill A. Shutemov wrote:
> On Tue, Oct 29, 2019 at 09:56:02AM +0100, Peter Zijlstra wrote:
> > On Tue, Oct 29, 2019 at 09:43:18AM +0300, Kirill A. Shutemov wrote:
> > > But some CPUs don't like to have two TLB entries for the same memory with
> > > different sizes at the same time. See for instance AMD erratum 383.
> > > 
> > > Getting it right would require making the range not present, flush TLB and
> > > only then install huge page. That's what we do for userspace.
> > > 
> > > It will not fly for the direct mapping. There is no reasonable way to
> > > exclude other CPU from accessing the range while it's not present (call
> > > stop_machine()? :P). Moreover, the range may contain the code that doing
> > > the collapse or data required for it...
> > > 
> > > BTW, looks like current __split_large_page() in pageattr.c is susceptible
> > > to the errata. Maybe we can get away with the easy way...
> > 
> > As you write above, there is just no way we can have a (temporary) hole
> > in the direct map.
> > 
> > We are careful about that other errata, and make sure both translations
> > are identical wrt everything else.
> 
> It's not clear if it is enough to avoid the issue. "under a highly specific
> and detailed set of conditions" is not very specific set of conditions :P

Yeah, I know ... :/ Tom is there any chance you could shed a little more
light on that errata?


^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH RFC] mm: add MAP_EXCLUSIVE to create exclusive user mappings
  2019-10-29 11:25 ` [PATCH RFC] mm: add MAP_EXCLUSIVE to create exclusive user mappings Reshetova, Elena
@ 2019-10-29 15:13   ` Tycho Andersen
  2019-10-29 17:03   ` Andy Lutomirski
  1 sibling, 0 replies; 60+ messages in thread
From: Tycho Andersen @ 2019-10-29 15:13 UTC (permalink / raw)
  To: Reshetova, Elena
  Cc: Mike Rapoport, linux-kernel, Alexey Dobriyan, Andrew Morton,
	Andy Lutomirski, Arnd Bergmann, Borislav Petkov, Dave Hansen,
	James Bottomley, Peter Zijlstra, Steven Rostedt, Thomas Gleixner,
	Ingo Molnar, H. Peter Anvin, linux-api, linux-mm, x86,
	Mike Rapoport, Alan Cox

Hi Elena, Mike,

On Tue, Oct 29, 2019 at 11:25:12AM +0000, Reshetova, Elena wrote:
> > The patch below aims to allow applications to create mappins that have
> > pages visible only to the owning process. Such mappings could be used to
> > store secrets so that these secrets are not visible neither to other
> > processes nor to the kernel.
> 
> Hi Mike, 
> 
> I have actually been looking into the closely related problem for the past 
> couple of weeks (on and off). What is common here is the need for userspace
> to indicate to kernel that some pages contain secrets. And then there are
> actually a number of things that kernel can do to try to protect these secrets
> better. Unmap from direct map is one of them. Another thing is to map such
> pages as non-cached, which can help us to prevent or considerably restrict
> speculation on such pages. The initial proof of concept for marking pages as
> "UNCACHED" that I got from Dave Hansen was actually based on mlock2() 
> and a new flag for it for this purpose. Since then I have been thinking on what
> interface suits the use case better and actually selected going with new madvise() 
> flag instead because of all possible implications for fragmentation and performance. 
> My logic was that we better allocate the secret data explicitly (using mmap()) 
> to make sure that no other process data accidentally gets to suffer.
> Imagine I would allocate a buffer to hold a secret key, signal with mlock
>  to protect it and suddenly my other high throughput non-secret buffer 
> (which happened to live on the same page by chance) became very slow
>  and I don't even have an easy way (apart from mmap()ing it!) to guarantee
>  that it won't be affected.
> 
> So, I ended up towards smth like:
> 
>   secret_buffer =  mmap(NULL, PAGE_SIZE, ...)
>    madvise(secret_buffer, size, MADV_SECRET)
> 
> I have work in progress code here:
>  https://github.com/ereshetova/linux/commits/madvise
> 
> I haven't sent it for review, because it is not ready yet and I am now working
> on trying to add the page wiping functionality. Otherwise it would be useless
> to protect the page during the time it is used in userspace, but then allow it
> to get reused by a different process later after it has been released back and
> userspace was stupid enough not to wipe the contents (or was crashed on 
> purpose before it was able to wipe anything out). 

I was looking at this and thinking that wiping during do_exit() might
be a nice place, but I haven't tried anything yet.

> We have also had some discussions with Tycho that XPFO can be also
> applied selectively for such "SECRET" marked pages and I know that he has also
> did some initial prototyping on this, so I think it would be great to decide
> on userspace interface first and then see how we can assemble together all
> these features. 

Yep! Here's my tree with the direct un-mapping bits ported from XPFO:
https://github.com/tych0/linux/commits/madvise

As noted in one of the commit messages I think the bit math for page
prot flags needs a bit of work, but the test passes, so :)

In any case, I'll try to look at Mike's patches later today.

Cheers,

Tycho

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH RFC] mm: add MAP_EXCLUSIVE to create exclusive user mappings
  2019-10-29  9:32   ` Mike Rapoport
@ 2019-10-29 17:00     ` Andy Lutomirski
  2019-10-30  8:40       ` Mike Rapoport
  0 siblings, 1 reply; 60+ messages in thread
From: Andy Lutomirski @ 2019-10-29 17:00 UTC (permalink / raw)
  To: Mike Rapoport
  Cc: LKML, Alexey Dobriyan, Andrew Morton, Andy Lutomirski,
	Arnd Bergmann, Borislav Petkov, Dave Hansen, James Bottomley,
	Peter Zijlstra, Steven Rostedt, Thomas Gleixner, Ingo Molnar,
	H. Peter Anvin, Linux API, Linux-MM, X86 ML, Mike Rapoport

On Tue, Oct 29, 2019 at 2:33 AM Mike Rapoport <rppt@kernel.org> wrote:
>
> On Mon, Oct 28, 2019 at 02:44:23PM -0600, Andy Lutomirski wrote:
> >
> > > On Oct 27, 2019, at 4:17 AM, Mike Rapoport <rppt@kernel.org> wrote:
> > >
> > > From: Mike Rapoport <rppt@linux.ibm.com>
> > >
> > > Hi,
> > >
> > > The patch below aims to allow applications to create mappins that have
> > > pages visible only to the owning process. Such mappings could be used to
> > > store secrets so that these secrets are not visible neither to other
> > > processes nor to the kernel.
> > >
> > > I've only tested the basic functionality, the changes should be verified
> > > against THP/migration/compaction. Yet, I'd appreciate early feedback.
> >
> > I’ve contemplated the concept a fair amount, and I think you should
> > consider a change to the API. In particular, rather than having it be a
> > MAP_ flag, make it a chardev.  You can, at least at first, allow only
> > MAP_SHARED, and admins can decide who gets to use it.  It might also play
> > better with the VM overall, and you won’t need a VM_ flag for it — you
> > can just wire up .fault to do the right thing.
>
> I think mmap()/mprotect()/madvise() are the natural APIs for such
> interface.

Then you have a whole bunch of questions to answer.  For example:

What happens if you mprotect() or similar when the mapping is already
in use in a way that's incompatible with MAP_EXCLUSIVE?

Is it actually reasonable to malloc() some memory and then make it exclusive?

Are you permitted to map a file MAP_EXCLUSIVE?  What does it mean?

What does MAP_PRIVATE | MAP_EXCLUSIVE do?

How does one pass exclusive memory via SCM_RIGHTS?  (If it's a
memfd-like or chardev interface, it's trivial.  mmap(), not so much.)

And finally, there's my personal giant pet peeve: a major use of this
will be for virtualization.  I suspect that a lot of people would like
the majority of KVM guest memory to be unmapped from the host
pagetables.  But people might also like for guest memory to be
unmapped in *QEMU's* pagetables, and mmap() is a basically worthless
interface for this.  Getting fd-backed memory into a guest will take
some possibly major work in the kernel, but getting vma-backed memory
into a guest without mapping it in the host user address space seems
much, much worse.

> Switching to a chardev doesn't solve the major problem of direct
> map fragmentation and defeats the ability to use exclusive memory mappings
> with the existing allocators, while mprotect() and madvise() do not.
>

Will people really want to do malloc() and then remap it exclusive?
This sounds dubiously useful at best.

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH RFC] mm: add MAP_EXCLUSIVE to create exclusive user mappings
  2019-10-29 11:25 ` [PATCH RFC] mm: add MAP_EXCLUSIVE to create exclusive user mappings Reshetova, Elena
  2019-10-29 15:13   ` Tycho Andersen
@ 2019-10-29 17:03   ` Andy Lutomirski
  2019-10-29 17:37     ` Alan Cox
  2019-10-29 17:43     ` James Bottomley
  1 sibling, 2 replies; 60+ messages in thread
From: Andy Lutomirski @ 2019-10-29 17:03 UTC (permalink / raw)
  To: Reshetova, Elena
  Cc: Mike Rapoport, linux-kernel, Alexey Dobriyan, Andrew Morton,
	Andy Lutomirski, Arnd Bergmann, Borislav Petkov, Dave Hansen,
	James Bottomley, Peter Zijlstra, Steven Rostedt, Thomas Gleixner,
	Ingo Molnar, H. Peter Anvin, linux-api, linux-mm, x86,
	Mike Rapoport, Tycho Andersen, Alan Cox

On Tue, Oct 29, 2019 at 4:25 AM Reshetova, Elena
<elena.reshetova@intel.com> wrote:
>
> > The patch below aims to allow applications to create mappins that have
> > pages visible only to the owning process. Such mappings could be used to
> > store secrets so that these secrets are not visible neither to other
> > processes nor to the kernel.
>
> Hi Mike,
>
> I have actually been looking into the closely related problem for the past
> couple of weeks (on and off). What is common here is the need for userspace
> to indicate to kernel that some pages contain secrets. And then there are
> actually a number of things that kernel can do to try to protect these secrets
> better. Unmap from direct map is one of them. Another thing is to map such
> pages as non-cached, which can help us to prevent or considerably restrict
> speculation on such pages. The initial proof of concept for marking pages as
> "UNCACHED" that I got from Dave Hansen was actually based on mlock2()
> and a new flag for it for this purpose. Since then I have been thinking on what
> interface suits the use case better and actually selected going with new madvise()
> flag instead because of all possible implications for fragmentation and performance.

Doing all of this with MAP_SECRET seems bad to me.  If user code wants
UC memory, it should ask for UC memory -- having the kernel involved
in the decision to use UC memory is a bad idea, because the
performance impact of using UC memory where user code wasn't expecting
it wil be so bad that the system might as well not work at all.  (For
kicks, I once added a sysctl to turn off caching in CR0.  I enabled it
in gnome-shell.  The system slowed down to such an extent that I was
unable to enter the three or so keystrokes to turn it back off.)

EXCLUSIVE makes sense.  Saying "don't ptrace this" makes sense.  UC
makes sense.  But having one flag to rule them all does not make sense
to me.

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH RFC] mm: add MAP_EXCLUSIVE to create exclusive user mappings
  2019-10-28 21:00             ` Peter Zijlstra
@ 2019-10-29 17:27               ` Edgecombe, Rick P
  2019-10-30 10:04                 ` Peter Zijlstra
  0 siblings, 1 reply; 60+ messages in thread
From: Edgecombe, Rick P @ 2019-10-29 17:27 UTC (permalink / raw)
  To: peterz
  Cc: adobriyan, linux-kernel, rppt, rostedt, jejb, tglx, linux-mm,
	dave.hansen, linux-api, x86, akpm, hpa, mingo, luto, kirill, bp,
	rppt, arnd

On Mon, 2019-10-28 at 22:00 +0100, Peter Zijlstra wrote:
> On Mon, Oct 28, 2019 at 07:59:25PM +0000, Edgecombe, Rick P wrote:
> > On Mon, 2019-10-28 at 14:55 +0100, Peter Zijlstra wrote:
> > > On Mon, Oct 28, 2019 at 04:16:23PM +0300, Kirill A. Shutemov wrote:
> > > 
> > > > I think active use of this feature will lead to performance degradation
> > > > of
> > > > the system with time.
> > > > 
> > > > Setting a single 4k page non-present in the direct mapping will require
> > > > splitting 2M or 1G page we usually map direct mapping with. And it's one
> > > > way road. We don't have any mechanism to map the memory with huge page
> > > > again after the application has freed the page.
> > > 
> > > Right, we recently had a 'bug' where ftrace triggered something like
> > > this and facebook ran into it as a performance regression. So yes, this
> > > is a real concern.
> > 
> > Don't e/cBPF filters also break the direct map down to 4k pages when calling
> > set_memory_ro() on the filter for 64 bit x86 and arm?
> > 
> > I've been wondering if the page allocator should make some effort to find a
> > broken down page for anything that can be known will have direct map
> > permissions
> > changed (or if it already groups them somehow). But also, why any potential
> > slowdown of 4k pages on the direct map hasn't been noticed for apps that do
> > a
> > lot of insertions and removals of BPF filters, if this is indeed the case.
> 
> That should be limited to the module range. Random data maps could
> shatter the world.

BPF has one vmalloc space allocation for the byte code and one for the module
space allocation for the JIT. Both get RO also set on the direct map alias of
the pages, and reset RW when freed.

You mean shatter performance?

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH RFC] mm: add MAP_EXCLUSIVE to create exclusive user mappings
  2019-10-29 17:03   ` Andy Lutomirski
@ 2019-10-29 17:37     ` Alan Cox
  2019-10-29 17:43     ` James Bottomley
  1 sibling, 0 replies; 60+ messages in thread
From: Alan Cox @ 2019-10-29 17:37 UTC (permalink / raw)
  To: Andy Lutomirski, Reshetova, Elena
  Cc: Mike Rapoport, linux-kernel, Alexey Dobriyan, Andrew Morton,
	Arnd Bergmann, Borislav Petkov, Dave Hansen, James Bottomley,
	Peter Zijlstra, Steven Rostedt, Thomas Gleixner, Ingo Molnar,
	H. Peter Anvin, linux-api, linux-mm, x86, Mike Rapoport,
	Tycho Andersen

> Doing all of this with MAP_SECRET seems bad to me.  If user code
> wants
> UC memory, it should ask for UC memory -- having the kernel involved
> in the decision to use UC memory is a bad idea, because the
> performance impact of using UC memory where user code wasn't
> expecting

The user has no idea that they want UC memory. It varies by platform
what this means. There are some systems (eg in order uclinux devices,
M68K, old atoms) for which it probably means 'no-op', there are those
where UC helps, those it hinders, there are those where WC is probably
sufficient. There are platforms where 'secret' memory might best be
implemented by using on die memory pools or cache locking. It might
even mean 'put me in a non HT cgroup'.

Secret might also mean 'not accessible by thunderbolt', or 'do not swap
unless swap is encrypted' and other things.

IMHO the question is what is the actual semantic here. What are you
asking for ? Does it mean "at any cost", what does it guarantee (100%
or statistically), what level of guarantee is acceptable, what level is
-EOPNOTSUPP or similar ?

I'm also wary of the focus always being on keys. If you decrypt a file
I'm probably just as interested in the contents so can I mmap a file
this way and if so what happens on the unmap. Yes key theft lets me do
all sorts of theoretical long term bad stuff, but frequently data theft
is sufficient to do lots of practical short term bad stuff. Also as an
attacker I'm probably a script, and I don't want to be exposing my
master long term because they want the footprints gone.

> in gnome-shell.  The system slowed down to such an extent that I was
> unable to enter the three or so keystrokes to turn it back off.)

Yes - and any uncached pages also need to be kept away from anything
that the kernel touches under locks, or use in atomic user operations
stuff. Copy on write of an uncached page for example is suddenly really
slow and there are so many other cases we'd have to find and deal with.

> EXCLUSIVE makes sense.  Saying "don't ptrace this" makes sense.  UC
> makes sense.  But having one flag to rule them all does not make
> sense
> to me.

We already support not ptracing, and if I can ptrace any of the code I
can access all of its code/data so that one isn't hard and the LSM
interfaces can do it. That one is easy - minus the fact that malware
writers are big fans of anything that stops tracing...

Alan



^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH RFC] mm: add MAP_EXCLUSIVE to create exclusive user mappings
  2019-10-29 17:03   ` Andy Lutomirski
  2019-10-29 17:37     ` Alan Cox
@ 2019-10-29 17:43     ` James Bottomley
  2019-10-29 18:10       ` Andy Lutomirski
  1 sibling, 1 reply; 60+ messages in thread
From: James Bottomley @ 2019-10-29 17:43 UTC (permalink / raw)
  To: Andy Lutomirski, Reshetova, Elena
  Cc: Mike Rapoport, linux-kernel, Alexey Dobriyan, Andrew Morton,
	Arnd Bergmann, Borislav Petkov, Dave Hansen, Peter Zijlstra,
	Steven Rostedt, Thomas Gleixner, Ingo Molnar, H. Peter Anvin,
	linux-api, linux-mm, x86, Mike Rapoport, Tycho Andersen,
	Alan Cox

On Tue, 2019-10-29 at 10:03 -0700, Andy Lutomirski wrote:
> On Tue, Oct 29, 2019 at 4:25 AM Reshetova, Elena
> <elena.reshetova@intel.com> wrote:
> > 
> > > The patch below aims to allow applications to create mappins that
> > > have
> > > pages visible only to the owning process. Such mappings could be
> > > used to
> > > store secrets so that these secrets are not visible neither to
> > > other
> > > processes nor to the kernel.
> > 
> > Hi Mike,
> > 
> > I have actually been looking into the closely related problem for
> > the past
> > couple of weeks (on and off). What is common here is the need for
> > userspace
> > to indicate to kernel that some pages contain secrets. And then
> > there are
> > actually a number of things that kernel can do to try to protect
> > these secrets
> > better. Unmap from direct map is one of them. Another thing is to
> > map such
> > pages as non-cached, which can help us to prevent or considerably
> > restrict
> > speculation on such pages. The initial proof of concept for marking
> > pages as
> > "UNCACHED" that I got from Dave Hansen was actually based on
> > mlock2()
> > and a new flag for it for this purpose. Since then I have been
> > thinking on what
> > interface suits the use case better and actually selected going
> > with new madvise()
> > flag instead because of all possible implications for fragmentation
> > and performance.
> 
> Doing all of this with MAP_SECRET seems bad to me.  If user code
> wants UC memory, it should ask for UC memory -- having the kernel
> involved in the decision to use UC memory is a bad idea, because the
> performance impact of using UC memory where user code wasn't
> expecting it wil be so bad that the system might as well not work at
> all.  (For kicks, I once added a sysctl to turn off caching in
> CR0.  I enabled it in gnome-shell.  The system slowed down to such an
> extent that I was unable to enter the three or so keystrokes to turn
> it back off.)
> 
> EXCLUSIVE makes sense.  Saying "don't ptrace this" makes sense.  UC
> makes sense.  But having one flag to rule them all does not make
> sense to me.

So this is a usability problem.  We have a memory flag that can be used
for "secrecy" for some userspace value of the word and we have a load
of internal properties depending on how the hardware works, including
potentially some hardware additions like SEV or TME, that can be used
to implement the property.  If we expose our hardware vagaries, the
user is really not going to know what to do ... and we have a limited
number of flags to express this, so it stands to reason that we need to
define "secrecy" for the user and then implement it using whatever
flags we have.  So I think no ptrace and no direct map make sense for
pretty much any value of "secrecy".  The UC bit seems to be an attempt
to prevent exfiltration via L1TF or other cache side channels, so it
looks like it should only be applied if the side channel mitigations
aren't active ... which would tend to indicate it's a kernel decision
as well.

In the use case in my head, I'd like MAP_EXCLUSIVE to mean the data in
the user memory is difficult to exfiltrate for another tenant in a
virtual system, even if they break containment, so effectively I want
it protected against kernel exploitation and root in the host ... and I
suppose I need to acknowledge that "protected" means best effort
available on the platform, not no attacker can ever extract this.

James




^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH RFC] mm: add MAP_EXCLUSIVE to create exclusive user mappings
  2019-10-29 17:43     ` James Bottomley
@ 2019-10-29 18:10       ` Andy Lutomirski
  0 siblings, 0 replies; 60+ messages in thread
From: Andy Lutomirski @ 2019-10-29 18:10 UTC (permalink / raw)
  To: James Bottomley
  Cc: Andy Lutomirski, Reshetova, Elena, Mike Rapoport, linux-kernel,
	Alexey Dobriyan, Andrew Morton, Arnd Bergmann, Borislav Petkov,
	Dave Hansen, Peter Zijlstra, Steven Rostedt, Thomas Gleixner,
	Ingo Molnar, H. Peter Anvin, linux-api, linux-mm, x86,
	Mike Rapoport, Tycho Andersen, Alan Cox

On Tue, Oct 29, 2019 at 10:44 AM James Bottomley <jejb@linux.ibm.com> wrote:
>
> On Tue, 2019-10-29 at 10:03 -0700, Andy Lutomirski wrote:
> > On Tue, Oct 29, 2019 at 4:25 AM Reshetova, Elena
> > <elena.reshetova@intel.com> wrote:
> > >
> > > > The patch below aims to allow applications to create mappins that
> > > > have
> > > > pages visible only to the owning process. Such mappings could be
> > > > used to
> > > > store secrets so that these secrets are not visible neither to
> > > > other
> > > > processes nor to the kernel.
> > >
> > > Hi Mike,
> > >
> > > I have actually been looking into the closely related problem for
> > > the past
> > > couple of weeks (on and off). What is common here is the need for
> > > userspace
> > > to indicate to kernel that some pages contain secrets. And then
> > > there are
> > > actually a number of things that kernel can do to try to protect
> > > these secrets
> > > better. Unmap from direct map is one of them. Another thing is to
> > > map such
> > > pages as non-cached, which can help us to prevent or considerably
> > > restrict
> > > speculation on such pages. The initial proof of concept for marking
> > > pages as
> > > "UNCACHED" that I got from Dave Hansen was actually based on
> > > mlock2()
> > > and a new flag for it for this purpose. Since then I have been
> > > thinking on what
> > > interface suits the use case better and actually selected going
> > > with new madvise()
> > > flag instead because of all possible implications for fragmentation
> > > and performance.
> >
> > Doing all of this with MAP_SECRET seems bad to me.  If user code
> > wants UC memory, it should ask for UC memory -- having the kernel
> > involved in the decision to use UC memory is a bad idea, because the
> > performance impact of using UC memory where user code wasn't
> > expecting it wil be so bad that the system might as well not work at
> > all.  (For kicks, I once added a sysctl to turn off caching in
> > CR0.  I enabled it in gnome-shell.  The system slowed down to such an
> > extent that I was unable to enter the three or so keystrokes to turn
> > it back off.)
> >
> > EXCLUSIVE makes sense.  Saying "don't ptrace this" makes sense.  UC
> > makes sense.  But having one flag to rule them all does not make
> > sense to me.
>
> So this is a usability problem.  We have a memory flag that can be used
> for "secrecy" for some userspace value of the word and we have a load
> of internal properties depending on how the hardware works, including
> potentially some hardware additions like SEV or TME, that can be used
> to implement the property.  If we expose our hardware vagaries, the
> user is really not going to know what to do ... and we have a limited
> number of flags to express this, so it stands to reason that we need to
> define "secrecy" for the user and then implement it using whatever
> flags we have.  So I think no ptrace and no direct map make sense for
> pretty much any value of "secrecy".  The UC bit seems to be an attempt
> to prevent exfiltration via L1TF or other cache side channels, so it
> looks like it should only be applied if the side channel mitigations
> aren't active ... which would tend to indicate it's a kernel decision
> as well.

I just don't think this will work in practice.  Someone will say "hey,
let's keep this giant buffer we do crypto from, or maybe even the
entire data area of some critical service, secret".  It will work
*fine* at first.  But then some kernel config changes and we can't do
DMA, and now it breaks on some configs.  Someone else will say "hey, I
don't have L1TF or whatever mitigation, let's turn on UC", and
everything goes to hell.

IMO the kernel should attempt to keep *all memory* secret.  Specific
applications that want greater levels of secrecy should opt in to more
expensive things.  Here's what's already on the table:

Exclusive / XPFO / XPO: allocation might be extremely expensive.
Overuse might hurt performance due to huge page fragmentation  DMA may
not work.  Otherwise it's peachy.

SEV: Works only in some contexts.  The current kernel implementation
is, IMO, unacceptable to the extent that I wish I could go back in
time and NAK it.

TME: it's on or it's off.  There's no room for a MAP_ flag here.

MKTME: of highly dubious value here.  The only useful thing here I can
thing it would be a MAP_NOTSECRET to opt *out* of encryption for a
specific range.  Other than that, it has all the same performance
implications that EXCLUSIVE has.

UC: Performance hit is extreme.  *Also* has the perf implications of
exclusive.  I can't imagine this making any sense except were the user
application is written in the expectation that UC might be used so
that the access patterns would be reasonable.

WC: Same issues as UC plus memory ordering issues such that
unsuspecting applications will corrupt data.

Trying to bundle these together with kernel- or admin-only config
seems like a lost cause.

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH RFC] mm: add MAP_EXCLUSIVE to create exclusive user mappings
  2019-10-29  6:43           ` Kirill A. Shutemov
  2019-10-29  8:56             ` Peter Zijlstra
@ 2019-10-29 19:43             ` Dan Williams
  2019-10-29 20:07               ` Dave Hansen
  1 sibling, 1 reply; 60+ messages in thread
From: Dan Williams @ 2019-10-29 19:43 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Mike Rapoport, Linux Kernel Mailing List, Alexey Dobriyan,
	Andrew Morton, Andy Lutomirski, Arnd Bergmann, Borislav Petkov,
	Dave Hansen, James Bottomley, Peter Zijlstra, Steven Rostedt,
	Thomas Gleixner, Ingo Molnar, H. Peter Anvin, Linux API,
	linux-mm, the arch/x86 maintainers, Mike Rapoport

On Mon, Oct 28, 2019 at 11:43 PM Kirill A. Shutemov
<kirill@shutemov.name> wrote:
>
> On Mon, Oct 28, 2019 at 10:43:51PM -0700, Dan Williams wrote:
> > On Mon, Oct 28, 2019 at 6:16 AM Kirill A. Shutemov <kirill@shutemov.name> wrote:
> > >
> > > On Mon, Oct 28, 2019 at 02:00:19PM +0100, Mike Rapoport wrote:
> > > > On Mon, Oct 28, 2019 at 03:31:24PM +0300, Kirill A. Shutemov wrote:
> > > > > On Sun, Oct 27, 2019 at 12:17:32PM +0200, Mike Rapoport wrote:
> > > > > > From: Mike Rapoport <rppt@linux.ibm.com>
> > > > > >
> > > > > > The mappings created with MAP_EXCLUSIVE are visible only in the context of
> > > > > > the owning process and can be used by applications to store secret
> > > > > > information that will not be visible not only to other processes but to the
> > > > > > kernel as well.
> > > > > >
> > > > > > The pages in these mappings are removed from the kernel direct map and
> > > > > > marked with PG_user_exclusive flag. When the exclusive area is unmapped,
> > > > > > the pages are mapped back into the direct map.
> > > > >
> > > > > I probably blind, but I don't see where you manipulate direct map...
> > > >
> > > > __get_user_pages() calls __set_page_user_exclusive() which in turn calls
> > > > set_direct_map_invalid_noflush() that makes the page not present.
> > >
> > > Ah. okay.
> > >
> > > I think active use of this feature will lead to performance degradation of
> > > the system with time.
> > >
> > > Setting a single 4k page non-present in the direct mapping will require
> > > splitting 2M or 1G page we usually map direct mapping with. And it's one
> > > way road. We don't have any mechanism to map the memory with huge page
> > > again after the application has freed the page.
> > >
> > > It might be okay if all these pages cluster together, but I don't think we
> > > have a way to achieve it easily.
> >
> > Still, it would be worth exploring what that would look like if not
> > for MAP_EXCLUSIVE then set_mce_nospec() that wants to punch out poison
> > pages from the direct map. In the case of pmem, where those pages are
> > able to be repaired, it would be nice to also repair the mapping
> > granularity of the direct map.
>
> The solution has to consist of two parts: finding a range to collapse and
> actually collapsing the range into a huge page.
>
> Finding the collapsible range will likely require background scanning of
> the direct mapping as we do for THP with khugepaged. It should not too
> hard, but likely require long and tedious tuning to be effective, but not
> too disturbing for the system.
>
> Alternatively, after any changes to the direct mapping, we can initiate
> checking if the range is collapsible. Up to 1G around the changed 4k.
> It might be more taxing than scanning if direct mapping changes often.
>
> Collapsing itself appears to be simple: re-check if the range is
> collapsible under the lock, replace the page table with the huge page and
> flush the TLB.
>
> But some CPUs don't like to have two TLB entries for the same memory with
> different sizes at the same time. See for instance AMD erratum 383.

That basic description would seem to defeat most (all?) interesting
huge page use cases. For example dax makes no attempt to make sure
aliased mappings of pmem are the same size between the direct map that
the driver uses, and userspace dax mappings. So I assume there are
more details than "all aliased mappings must be the same size".

> Getting it right would require making the range not present, flush TLB and
> only then install huge page. That's what we do for userspace.
>
> It will not fly for the direct mapping. There is no reasonable way to
> exclude other CPU from accessing the range while it's not present (call
> stop_machine()? :P). Moreover, the range may contain the code that doing
> the collapse or data required for it...

At least for pmem all the access points can be controlled. pmem is
never used for kernel text at least in the dax mode where it is
accessed via file-backed shared mappings, or the pmem driver. So when
I say "direct-map repair" I mean the incidental direct-map that pmem
uses since it maps pmem with arch_add_memory(), not the typical DRAM
direct-map that may house kernel text. Poison consumed from the kernel
DRAM direct-map is fatal, poison consumed from dax mappings and the
pmem driver path is recoverable and repairable.

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH RFC] mm: add MAP_EXCLUSIVE to create exclusive user mappings
  2019-10-29 19:43             ` [PATCH RFC] mm: add MAP_EXCLUSIVE to create exclusive user mappings Dan Williams
@ 2019-10-29 20:07               ` Dave Hansen
  0 siblings, 0 replies; 60+ messages in thread
From: Dave Hansen @ 2019-10-29 20:07 UTC (permalink / raw)
  To: Dan Williams, Kirill A. Shutemov
  Cc: Mike Rapoport, Linux Kernel Mailing List, Alexey Dobriyan,
	Andrew Morton, Andy Lutomirski, Arnd Bergmann, Borislav Petkov,
	Dave Hansen, James Bottomley, Peter Zijlstra, Steven Rostedt,
	Thomas Gleixner, Ingo Molnar, H. Peter Anvin, Linux API,
	linux-mm, the arch/x86 maintainers, Mike Rapoport

On 10/29/19 12:43 PM, Dan Williams wrote:
>> But some CPUs don't like to have two TLB entries for the same memory with
>> different sizes at the same time. See for instance AMD erratum 383.
> That basic description would seem to defeat most (all?) interesting
> huge page use cases. For example dax makes no attempt to make sure
> aliased mappings of pmem are the same size between the direct map that
> the driver uses, and userspace dax mappings. So I assume there are
> more details than "all aliased mappings must be the same size".

These are about when large and small TLB entries could be held in the
TLB at the same time for the same virtual address in the same process.
It doesn't matter that two *different* mappings are using different page
size.

Imagine you were *just* changing the page size.  Without these errata,
you could just skip flushing the TLB.  You might use the old hardware
page size for a while, but it will be functionally OK.  With these
errata, we need to ensure in software that the old TLB entries for the
old page size are flushed before the new page size is established.

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH RFC] mm: add MAP_EXCLUSIVE to create exclusive user mappings
  2019-10-29 10:12             ` Christopher Lameter
@ 2019-10-30  7:11               ` Mike Rapoport
  2019-10-30 12:09                 ` Christopher Lameter
  0 siblings, 1 reply; 60+ messages in thread
From: Mike Rapoport @ 2019-10-30  7:11 UTC (permalink / raw)
  To: Christopher Lameter
  Cc: Kirill A. Shutemov, linux-kernel, Alexey Dobriyan, Andrew Morton,
	Andy Lutomirski, Arnd Bergmann, Borislav Petkov, Dave Hansen,
	James Bottomley, Peter Zijlstra, Steven Rostedt, Thomas Gleixner,
	Ingo Molnar, H. Peter Anvin, linux-api, linux-mm, x86,
	Mike Rapoport

On Tue, Oct 29, 2019 at 10:12:04AM +0000, Christopher Lameter wrote:
> 
> 
> On Tue, 29 Oct 2019, Mike Rapoport wrote:
> 
> > I've talked with Thomas yesterday and he suggested something similar:
> >
> > When the MAP_EXCLUSIVE request comes for the first time, we allocate a huge
> > page for it and then use this page as a pool of 4K pages for subsequent
> > requests. Once this huge page is full we allocate a new one and append it
> > to the pool. When all the 4K pages that comprise the huge page are freed
> > the huge page is collapsed.
> 
> Or write a device driver that allows you to mmap a secure area and avoid
> all core kernel modifications?
> 
> /dev/securemem or so?

A device driver will need to remove the secure area from the direct map and
then we back to square one.
 
> It may exist already.
> 

-- 
Sincerely yours,
Mike.

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH RFC] mm: add MAP_EXCLUSIVE to create exclusive user mappings
  2019-10-29 11:02   ` David Hildenbrand
@ 2019-10-30  8:15     ` Mike Rapoport
  2019-10-30  8:19       ` David Hildenbrand
  0 siblings, 1 reply; 60+ messages in thread
From: Mike Rapoport @ 2019-10-30  8:15 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: linux-kernel, Alexey Dobriyan, Andrew Morton, Andy Lutomirski,
	Arnd Bergmann, Borislav Petkov, Dave Hansen, James Bottomley,
	Peter Zijlstra, Steven Rostedt, Thomas Gleixner, Ingo Molnar,
	H. Peter Anvin, linux-api, linux-mm, x86, Mike Rapoport

On Tue, Oct 29, 2019 at 12:02:34PM +0100, David Hildenbrand wrote:
> On 27.10.19 11:17, Mike Rapoport wrote:
> >From: Mike Rapoport <rppt@linux.ibm.com>
> >
> >The mappings created with MAP_EXCLUSIVE are visible only in the context of
> >the owning process and can be used by applications to store secret
> >information that will not be visible not only to other processes but to the
> >kernel as well.
> >
> >The pages in these mappings are removed from the kernel direct map and
> >marked with PG_user_exclusive flag. When the exclusive area is unmapped,
> >the pages are mapped back into the direct map.
> >
> 
> Just a thought, the kernel is still able to indirectly read the contents of
> these pages by doing a kdump from kexec environment, right?

Right.

> Also, I wonder
> what would happen if you map such pages via /dev/mem into another user space
> application and e.g., use them along with kvm [1].

Do you mean that one application creates MAP_EXCLUSIVE and another
applications accesses the same physical pages via /dev/mem? 

With /dev/mem all physical memory is visible...
 
> [1] https://lwn.net/Articles/778240/
> 
> -- 
> 
> Thanks,
> 
> David / dhildenb
> 

-- 
Sincerely yours,
Mike.

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH RFC] mm: add MAP_EXCLUSIVE to create exclusive user mappings
  2019-10-30  8:15     ` Mike Rapoport
@ 2019-10-30  8:19       ` David Hildenbrand
  2019-10-31 19:16         ` Mike Rapoport
  0 siblings, 1 reply; 60+ messages in thread
From: David Hildenbrand @ 2019-10-30  8:19 UTC (permalink / raw)
  To: Mike Rapoport
  Cc: linux-kernel, Alexey Dobriyan, Andrew Morton, Andy Lutomirski,
	Arnd Bergmann, Borislav Petkov, Dave Hansen, James Bottomley,
	Peter Zijlstra, Steven Rostedt, Thomas Gleixner, Ingo Molnar,
	H. Peter Anvin, linux-api, linux-mm, x86, Mike Rapoport

On 30.10.19 09:15, Mike Rapoport wrote:
> On Tue, Oct 29, 2019 at 12:02:34PM +0100, David Hildenbrand wrote:
>> On 27.10.19 11:17, Mike Rapoport wrote:
>>> From: Mike Rapoport <rppt@linux.ibm.com>
>>>
>>> The mappings created with MAP_EXCLUSIVE are visible only in the context of
>>> the owning process and can be used by applications to store secret
>>> information that will not be visible not only to other processes but to the
>>> kernel as well.
>>>
>>> The pages in these mappings are removed from the kernel direct map and
>>> marked with PG_user_exclusive flag. When the exclusive area is unmapped,
>>> the pages are mapped back into the direct map.
>>>
>>
>> Just a thought, the kernel is still able to indirectly read the contents of
>> these pages by doing a kdump from kexec environment, right?
> 
> Right.
> 
>> Also, I wonder
>> what would happen if you map such pages via /dev/mem into another user space
>> application and e.g., use them along with kvm [1].
> 
> Do you mean that one application creates MAP_EXCLUSIVE and another
> applications accesses the same physical pages via /dev/mem?

Exactly.

> 
> With /dev/mem all physical memory is visible...

Okay, so the statement "information that will not be visible not only to 
other processes but to the kernel as well" is not correct. There are 
easy ways to access that information if you really want to (might 
require root permissions, though).

-- 

Thanks,

David / dhildenb


^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH RFC] mm: add MAP_EXCLUSIVE to create exclusive user mappings
  2019-10-29 17:00     ` Andy Lutomirski
@ 2019-10-30  8:40       ` Mike Rapoport
  2019-10-30 21:28         ` Andy Lutomirski
  0 siblings, 1 reply; 60+ messages in thread
From: Mike Rapoport @ 2019-10-30  8:40 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: LKML, Alexey Dobriyan, Andrew Morton, Arnd Bergmann,
	Borislav Petkov, Dave Hansen, James Bottomley, Peter Zijlstra,
	Steven Rostedt, Thomas Gleixner, Ingo Molnar, H. Peter Anvin,
	Linux API, Linux-MM, X86 ML, Mike Rapoport

On Tue, Oct 29, 2019 at 10:00:55AM -0700, Andy Lutomirski wrote:
> On Tue, Oct 29, 2019 at 2:33 AM Mike Rapoport <rppt@kernel.org> wrote:
> >
> > On Mon, Oct 28, 2019 at 02:44:23PM -0600, Andy Lutomirski wrote:
> > >
> > > > On Oct 27, 2019, at 4:17 AM, Mike Rapoport <rppt@kernel.org> wrote:
> > > >
> > > > From: Mike Rapoport <rppt@linux.ibm.com>
> > > >
> > > > Hi,
> > > >
> > > > The patch below aims to allow applications to create mappins that have
> > > > pages visible only to the owning process. Such mappings could be used to
> > > > store secrets so that these secrets are not visible neither to other
> > > > processes nor to the kernel.
> > > >
> > > > I've only tested the basic functionality, the changes should be verified
> > > > against THP/migration/compaction. Yet, I'd appreciate early feedback.
> > >
> > > I’ve contemplated the concept a fair amount, and I think you should
> > > consider a change to the API. In particular, rather than having it be a
> > > MAP_ flag, make it a chardev.  You can, at least at first, allow only
> > > MAP_SHARED, and admins can decide who gets to use it.  It might also play
> > > better with the VM overall, and you won’t need a VM_ flag for it — you
> > > can just wire up .fault to do the right thing.
> >
> > I think mmap()/mprotect()/madvise() are the natural APIs for such
> > interface.
> 
> Then you have a whole bunch of questions to answer.  For example:
> 
> What happens if you mprotect() or similar when the mapping is already
> in use in a way that's incompatible with MAP_EXCLUSIVE?

Then we refuse to mprotect()? Like in any other case when vm_flags are not
compatible with required madvise()/mprotect() operation.

> Is it actually reasonable to malloc() some memory and then make it exclusive?
> 
> Are you permitted to map a file MAP_EXCLUSIVE?  What does it mean?

I'd limit MAP_EXCLUSIVE only to anonymous memory.

> What does MAP_PRIVATE | MAP_EXCLUSIVE do?

My preference is to have only mmap() and then the semantics is more clear:

MAP_PRIVATE | MAP_EXCLUSIVE creates a pre-populated region, marks it locked
and drops the pages in this region from the direct map.
The pages are returned back on munmap(). 
Then there is no way to change an existing area to be exclusive or vice
versa.

> How does one pass exclusive memory via SCM_RIGHTS?  (If it's a
> memfd-like or chardev interface, it's trivial.  mmap(), not so much.)

Why passing such memory via SCM_RIGHTS would be useful?
 
> And finally, there's my personal giant pet peeve: a major use of this
> will be for virtualization.  I suspect that a lot of people would like
> the majority of KVM guest memory to be unmapped from the host
> pagetables.  But people might also like for guest memory to be
> unmapped in *QEMU's* pagetables, and mmap() is a basically worthless
> interface for this.  Getting fd-backed memory into a guest will take
> some possibly major work in the kernel, but getting vma-backed memory
> into a guest without mapping it in the host user address space seems
> much, much worse.

Well, in my view, the MAP_EXCLUSIVE is intended to keep small secrets
rather than use it for the entire guest memory. I even considered adding a
limit for the mapping size, but then I decided that since RLIMIT_MEMLOCK is
anyway enforced there is no need for a new one.

I agree that getting fd-backed memory into a guest would be less pain that
VMA, but KVM can already use memory outside the control of the kernel via
/dev/map [1].

So unless I'm missing something here, there is no need to use MAP_EXCLUSIVE
for the guest memory.

[1] https://lwn.net/Articles/778240/

> > Switching to a chardev doesn't solve the major problem of direct
> > map fragmentation and defeats the ability to use exclusive memory mappings
> > with the existing allocators, while mprotect() and madvise() do not.
> >
> 
> Will people really want to do malloc() and then remap it exclusive?
> This sounds dubiously useful at best.

Again, my preference is to have mmap() only, but I see a value in this use
case as well. Application developers allocate memory and then sometimes
change its properties rather than go mmap() something. For such usage
mprotect() may be usefull.


-- 
Sincerely yours,
Mike.

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH RFC] mm: add MAP_EXCLUSIVE to create exclusive user mappings
  2019-10-29 17:27               ` Edgecombe, Rick P
@ 2019-10-30 10:04                 ` Peter Zijlstra
  2019-10-30 15:35                   ` Alexei Starovoitov
  2019-10-30 17:48                   ` Edgecombe, Rick P
  0 siblings, 2 replies; 60+ messages in thread
From: Peter Zijlstra @ 2019-10-30 10:04 UTC (permalink / raw)
  To: Edgecombe, Rick P
  Cc: adobriyan, linux-kernel, rppt, rostedt, jejb, tglx, linux-mm,
	dave.hansen, linux-api, x86, akpm, hpa, mingo, luto, kirill, bp,
	rppt, arnd

On Tue, Oct 29, 2019 at 05:27:43PM +0000, Edgecombe, Rick P wrote:
> On Mon, 2019-10-28 at 22:00 +0100, Peter Zijlstra wrote:

> > That should be limited to the module range. Random data maps could
> > shatter the world.
> 
> BPF has one vmalloc space allocation for the byte code and one for the module
> space allocation for the JIT. Both get RO also set on the direct map alias of
> the pages, and reset RW when freed.

Argh, I didn't know they mapped the bytecode RO; why does it do that? It
can throw out the bytecode once it's JIT'ed.

> You mean shatter performance?

Shatter (all) large pages.

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH RFC] mm: add MAP_EXCLUSIVE to create exclusive user mappings
  2019-10-30  7:11               ` Mike Rapoport
@ 2019-10-30 12:09                 ` Christopher Lameter
  0 siblings, 0 replies; 60+ messages in thread
From: Christopher Lameter @ 2019-10-30 12:09 UTC (permalink / raw)
  To: Mike Rapoport
  Cc: Kirill A. Shutemov, linux-kernel, Alexey Dobriyan, Andrew Morton,
	Andy Lutomirski, Arnd Bergmann, Borislav Petkov, Dave Hansen,
	James Bottomley, Peter Zijlstra, Steven Rostedt, Thomas Gleixner,
	Ingo Molnar, H. Peter Anvin, linux-api, linux-mm, x86,
	Mike Rapoport


On Wed, 30 Oct 2019, Mike Rapoport wrote:

> > /dev/securemem or so?
>
> A device driver will need to remove the secure area from the direct map and
> then we back to square one.

We have avoided the need for modifications to kernel core code. And its a
natural thing to treat this like special memory provided by a device
driver.



^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH RFC] mm: add MAP_EXCLUSIVE to create exclusive user mappings
  2019-10-30 10:04                 ` Peter Zijlstra
@ 2019-10-30 15:35                   ` Alexei Starovoitov
  2019-10-30 18:39                     ` Peter Zijlstra
  2019-10-30 17:48                   ` Edgecombe, Rick P
  1 sibling, 1 reply; 60+ messages in thread
From: Alexei Starovoitov @ 2019-10-30 15:35 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Edgecombe, Rick P, adobriyan, linux-kernel, rppt, rostedt, jejb,
	tglx, linux-mm, dave.hansen, linux-api, x86, akpm, hpa, mingo,
	luto, kirill, bp, rppt, arnd, Daniel Borkmann, bpf

On Wed, Oct 30, 2019 at 3:06 AM Peter Zijlstra <peterz@infradead.org> wrote:
>
> On Tue, Oct 29, 2019 at 05:27:43PM +0000, Edgecombe, Rick P wrote:
> > On Mon, 2019-10-28 at 22:00 +0100, Peter Zijlstra wrote:
>
> > > That should be limited to the module range. Random data maps could
> > > shatter the world.
> >
> > BPF has one vmalloc space allocation for the byte code and one for the module
> > space allocation for the JIT. Both get RO also set on the direct map alias of
> > the pages, and reset RW when freed.
>
> Argh, I didn't know they mapped the bytecode RO; why does it do that? It
> can throw out the bytecode once it's JIT'ed.

because of endless security "concerns" that some folks had.
Like what if something can exploit another bug in the kernel
and modify bytecode that was already verified
then interpreter will execute that modified bytecode.
Sort of similar reasoning why .text is read-only.
I think it's not a realistic attack, but I didn't bother to argue back then.
The mere presence of interpreter itself is a real security concern.
People that care about speculation attacks should
have CONFIG_BPF_JIT_ALWAYS_ON=y,
so modifying bytecode via another exploit will be pointless.
Getting rid of RO for bytecode will save a ton of memory too,
since we won't need to allocate full page for each small programs.

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH RFC] mm: add MAP_EXCLUSIVE to create exclusive user mappings
  2019-10-30 10:04                 ` Peter Zijlstra
  2019-10-30 15:35                   ` Alexei Starovoitov
@ 2019-10-30 17:48                   ` Edgecombe, Rick P
  2019-10-30 17:58                     ` Dave Hansen
  1 sibling, 1 reply; 60+ messages in thread
From: Edgecombe, Rick P @ 2019-10-30 17:48 UTC (permalink / raw)
  To: peterz
  Cc: adobriyan, linux-kernel, rppt, rostedt, jejb, tglx, linux-mm,
	dave.hansen, linux-api, x86, akpm, hpa, mingo, luto, kirill, bp,
	rppt, arnd

On Wed, 2019-10-30 at 11:04 +0100, Peter Zijlstra wrote:
> > You mean shatter performance?
> 
> Shatter (all) large pages.

So it looks like this is already happening then to some degree. It's not just
BPF either, any module_alloc() user is going to do something similar with the
direct map alias of the page they got for the text.

So there must be at least some usages where breaking the direct map down, for
like a page to store a key or something, isn't totally horrible.


^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH RFC] mm: add MAP_EXCLUSIVE to create exclusive user mappings
  2019-10-30 17:48                   ` Edgecombe, Rick P
@ 2019-10-30 17:58                     ` Dave Hansen
  2019-10-30 18:01                       ` Dave Hansen
  0 siblings, 1 reply; 60+ messages in thread
From: Dave Hansen @ 2019-10-30 17:58 UTC (permalink / raw)
  To: Edgecombe, Rick P, peterz
  Cc: adobriyan, linux-kernel, rppt, rostedt, jejb, tglx, linux-mm,
	dave.hansen, linux-api, x86, akpm, hpa, mingo, luto, kirill, bp,
	rppt, arnd

On 10/30/19 10:48 AM, Edgecombe, Rick P wrote:
> On Wed, 2019-10-30 at 11:04 +0100, Peter Zijlstra wrote:
>>> You mean shatter performance?
>>
>> Shatter (all) large pages.
> 
> So it looks like this is already happening then to some degree. It's not just
> BPF either, any module_alloc() user is going to do something similar with the
> direct map alias of the page they got for the text.
> 
> So there must be at least some usages where breaking the direct map down, for
> like a page to store a key or something, isn't totally horrible.

The systems that really need large pages are the large ones.  They have
the same TLBs and data structures as really little systems, but orders
of magnitude more address space.  Modules and BPF are a (hopefully) drop
in the bucket on small systems and they're really inconsequential on
really big systems.

Modules also require privilege.

Allowing random user apps to fracture the direct map for every page of
their memory or *lots* of pages of their memory is an entirely different
kind of problem from modules.  It takes a "drop in the bucket"
fracturing and turns it into the common case.

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH RFC] mm: add MAP_EXCLUSIVE to create exclusive user mappings
  2019-10-30 17:58                     ` Dave Hansen
@ 2019-10-30 18:01                       ` Dave Hansen
  0 siblings, 0 replies; 60+ messages in thread
From: Dave Hansen @ 2019-10-30 18:01 UTC (permalink / raw)
  To: Edgecombe, Rick P, peterz
  Cc: adobriyan, linux-kernel, rppt, rostedt, jejb, tglx, linux-mm,
	dave.hansen, linux-api, x86, akpm, hpa, mingo, luto, kirill, bp,
	rppt, arnd

On 10/30/19 10:58 AM, Dave Hansen wrote:
> Modules also require privilege.

IMNHO, if BPF is fracturing large swaths the direct map with no
privilege, it's only a matter of time until it starts to cause problems.
 The fact that we do it today is only evidence that we have a ticking
time bomb, not that it's OK.

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH RFC] mm: add MAP_EXCLUSIVE to create exclusive user mappings
  2019-10-30 15:35                   ` Alexei Starovoitov
@ 2019-10-30 18:39                     ` Peter Zijlstra
  2019-10-30 18:52                       ` Alexei Starovoitov
  0 siblings, 1 reply; 60+ messages in thread
From: Peter Zijlstra @ 2019-10-30 18:39 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Edgecombe, Rick P, adobriyan, linux-kernel, rppt, rostedt, jejb,
	tglx, linux-mm, dave.hansen, linux-api, x86, akpm, hpa, mingo,
	luto, kirill, bp, rppt, arnd, Daniel Borkmann, bpf

On Wed, Oct 30, 2019 at 08:35:09AM -0700, Alexei Starovoitov wrote:
> On Wed, Oct 30, 2019 at 3:06 AM Peter Zijlstra <peterz@infradead.org> wrote:
> >
> > On Tue, Oct 29, 2019 at 05:27:43PM +0000, Edgecombe, Rick P wrote:
> > > On Mon, 2019-10-28 at 22:00 +0100, Peter Zijlstra wrote:
> >
> > > > That should be limited to the module range. Random data maps could
> > > > shatter the world.
> > >
> > > BPF has one vmalloc space allocation for the byte code and one for the module
> > > space allocation for the JIT. Both get RO also set on the direct map alias of
> > > the pages, and reset RW when freed.
> >
> > Argh, I didn't know they mapped the bytecode RO; why does it do that? It
> > can throw out the bytecode once it's JIT'ed.
> 
> because of endless security "concerns" that some folks had.
> Like what if something can exploit another bug in the kernel
> and modify bytecode that was already verified
> then interpreter will execute that modified bytecode.

But when it's JIT'ed the bytecode is no longer of relevance, right? So
any scenario with a JIT on can then toss the bytecode and certainly
doesn't need to map it RO.

> Sort of similar reasoning why .text is read-only.
> I think it's not a realistic attack, but I didn't bother to argue back then.
> The mere presence of interpreter itself is a real security concern.
> People that care about speculation attacks should
> have CONFIG_BPF_JIT_ALWAYS_ON=y,

This isn't about speculation attacks, it is about breaking buffer limits
and being able to write to memory. And in that respect being able to
change the current task state (write it's effective PID to 0) is much
simpler than writing to text or bytecode, but if you cannot reach/find
the task struct but can reach/find text..

> so modifying bytecode via another exploit will be pointless.
> Getting rid of RO for bytecode will save a ton of memory too,
> since we won't need to allocate full page for each small programs.

So I'm thinking we can get rid of that for any scenario that has the JIT
enabled -- not only JIT_ALWAYS_ON.

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH RFC] mm: add MAP_EXCLUSIVE to create exclusive user mappings
  2019-10-30 18:39                     ` Peter Zijlstra
@ 2019-10-30 18:52                       ` Alexei Starovoitov
  0 siblings, 0 replies; 60+ messages in thread
From: Alexei Starovoitov @ 2019-10-30 18:52 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Edgecombe, Rick P, adobriyan, linux-kernel, rppt, rostedt, jejb,
	tglx, linux-mm, dave.hansen, linux-api, x86, akpm, hpa, mingo,
	luto, kirill, bp, rppt, arnd, Daniel Borkmann, bpf

On Wed, Oct 30, 2019 at 11:39 AM Peter Zijlstra <peterz@infradead.org> wrote:
>
> On Wed, Oct 30, 2019 at 08:35:09AM -0700, Alexei Starovoitov wrote:
> > On Wed, Oct 30, 2019 at 3:06 AM Peter Zijlstra <peterz@infradead.org> wrote:
> > >
> > > On Tue, Oct 29, 2019 at 05:27:43PM +0000, Edgecombe, Rick P wrote:
> > > > On Mon, 2019-10-28 at 22:00 +0100, Peter Zijlstra wrote:
> > >
> > > > > That should be limited to the module range. Random data maps could
> > > > > shatter the world.
> > > >
> > > > BPF has one vmalloc space allocation for the byte code and one for the module
> > > > space allocation for the JIT. Both get RO also set on the direct map alias of
> > > > the pages, and reset RW when freed.
> > >
> > > Argh, I didn't know they mapped the bytecode RO; why does it do that? It
> > > can throw out the bytecode once it's JIT'ed.
> >
> > because of endless security "concerns" that some folks had.
> > Like what if something can exploit another bug in the kernel
> > and modify bytecode that was already verified
> > then interpreter will execute that modified bytecode.
>
> But when it's JIT'ed the bytecode is no longer of relevance, right? So
> any scenario with a JIT on can then toss the bytecode and certainly
> doesn't need to map it RO.

We keep so called "xlated" bytecode around for debugging.
It's the one that is actually running. It was modified through
several stages of the verifier before being runnable by interpreter.
When folks debug stuff in production they want to see
the whole thing. Both x86 asm and xlated bytecode.
xlated bytecode also sanitized before it's returned
back to user space.

> > Sort of similar reasoning why .text is read-only.
> > I think it's not a realistic attack, but I didn't bother to argue back then.
> > The mere presence of interpreter itself is a real security concern.
> > People that care about speculation attacks should
> > have CONFIG_BPF_JIT_ALWAYS_ON=y,
>
> This isn't about speculation attacks, it is about breaking buffer limits
> and being able to write to memory. And in that respect being able to
> change the current task state (write it's effective PID to 0) is much
> simpler than writing to text or bytecode, but if you cannot reach/find
> the task struct but can reach/find text..

exactly. that's why RO bytecode was dubious to me from the beginning.
For an attacker to write meaningful bytecode they need to know
quite a few other kernel internal pointers.
If an exploit can write into memory there are plenty of easier targets.

> > so modifying bytecode via another exploit will be pointless.
> > Getting rid of RO for bytecode will save a ton of memory too,
> > since we won't need to allocate full page for each small programs.
>
> So I'm thinking we can get rid of that for any scenario that has the JIT
> enabled -- not only JIT_ALWAYS_ON.

Sounds good to me. Happy to do that. Will add it to our todo list.

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH RFC] mm: add MAP_EXCLUSIVE to create exclusive user mappings
  2019-10-30  8:40       ` Mike Rapoport
@ 2019-10-30 21:28         ` Andy Lutomirski
  2019-10-31  7:21           ` Mike Rapoport
  2019-12-05 15:34           ` Mike Rapoport
  0 siblings, 2 replies; 60+ messages in thread
From: Andy Lutomirski @ 2019-10-30 21:28 UTC (permalink / raw)
  To: Mike Rapoport
  Cc: Andy Lutomirski, LKML, Alexey Dobriyan, Andrew Morton,
	Arnd Bergmann, Borislav Petkov, Dave Hansen, James Bottomley,
	Peter Zijlstra, Steven Rostedt, Thomas Gleixner, Ingo Molnar,
	H. Peter Anvin, Linux API, Linux-MM, X86 ML, Mike Rapoport

On Wed, Oct 30, 2019 at 1:40 AM Mike Rapoport <rppt@kernel.org> wrote:
>
> On Tue, Oct 29, 2019 at 10:00:55AM -0700, Andy Lutomirski wrote:
> > On Tue, Oct 29, 2019 at 2:33 AM Mike Rapoport <rppt@kernel.org> wrote:
> > >
> > > On Mon, Oct 28, 2019 at 02:44:23PM -0600, Andy Lutomirski wrote:
> > > >
> > > > > On Oct 27, 2019, at 4:17 AM, Mike Rapoport <rppt@kernel.org> wrote:
> > > > >
> > > > > From: Mike Rapoport <rppt@linux.ibm.com>
> > > > >
> > > > > Hi,
> > > > >
> > > > > The patch below aims to allow applications to create mappins that have
> > > > > pages visible only to the owning process. Such mappings could be used to
> > > > > store secrets so that these secrets are not visible neither to other
> > > > > processes nor to the kernel.
> > > > >
> > > > > I've only tested the basic functionality, the changes should be verified
> > > > > against THP/migration/compaction. Yet, I'd appreciate early feedback.
> > > >
> > > > I’ve contemplated the concept a fair amount, and I think you should
> > > > consider a change to the API. In particular, rather than having it be a
> > > > MAP_ flag, make it a chardev.  You can, at least at first, allow only
> > > > MAP_SHARED, and admins can decide who gets to use it.  It might also play
> > > > better with the VM overall, and you won’t need a VM_ flag for it — you
> > > > can just wire up .fault to do the right thing.
> > >
> > > I think mmap()/mprotect()/madvise() are the natural APIs for such
> > > interface.
> >
> > Then you have a whole bunch of questions to answer.  For example:
> >
> > What happens if you mprotect() or similar when the mapping is already
> > in use in a way that's incompatible with MAP_EXCLUSIVE?
>
> Then we refuse to mprotect()? Like in any other case when vm_flags are not
> compatible with required madvise()/mprotect() operation.
>

I'm not talking about flags.  I'm talking about the case where one
thread (or RDMA or whatever) has get_user_pages()'d a mapping and
another thread mprotect()s it MAP_EXCLUSIVE.

> > Is it actually reasonable to malloc() some memory and then make it exclusive?
> >
> > Are you permitted to map a file MAP_EXCLUSIVE?  What does it mean?
>
> I'd limit MAP_EXCLUSIVE only to anonymous memory.
>
> > What does MAP_PRIVATE | MAP_EXCLUSIVE do?
>
> My preference is to have only mmap() and then the semantics is more clear:
>
> MAP_PRIVATE | MAP_EXCLUSIVE creates a pre-populated region, marks it locked
> and drops the pages in this region from the direct map.
> The pages are returned back on munmap().
> Then there is no way to change an existing area to be exclusive or vice
> versa.

And what happens if you fork()?  Limiting it to MAP_SHARED |
MAP_EXCLUSIVE would about this particular nasty question.

>
> > How does one pass exclusive memory via SCM_RIGHTS?  (If it's a
> > memfd-like or chardev interface, it's trivial.  mmap(), not so much.)
>
> Why passing such memory via SCM_RIGHTS would be useful?

Suppose I want to put a secret into exclusive memory and then send
that secret to some other process.  The obvious approach would be to
SCM_RIGHTS an fd over, but you can't do that with MAP_EXCLUSIVE as
you've defined it.  In general, there are lots of use cases for memfd
and other fd-backed memory.

>
> > And finally, there's my personal giant pet peeve: a major use of this
> > will be for virtualization.  I suspect that a lot of people would like
> > the majority of KVM guest memory to be unmapped from the host
> > pagetables.  But people might also like for guest memory to be
> > unmapped in *QEMU's* pagetables, and mmap() is a basically worthless
> > interface for this.  Getting fd-backed memory into a guest will take
> > some possibly major work in the kernel, but getting vma-backed memory
> > into a guest without mapping it in the host user address space seems
> > much, much worse.
>
> Well, in my view, the MAP_EXCLUSIVE is intended to keep small secrets
> rather than use it for the entire guest memory. I even considered adding a
> limit for the mapping size, but then I decided that since RLIMIT_MEMLOCK is
> anyway enforced there is no need for a new one.
>
> I agree that getting fd-backed memory into a guest would be less pain that
> VMA, but KVM can already use memory outside the control of the kernel via
> /dev/map [1].

That series doesn't address the problem I'm talking about at all.  I'm
saying that there is a legitimate use case where QEMU should *not*
have a mapping of the memory.  So QEMU would create some exclusive
memory using /dev/exclusive_memory and would tell KVM to map it into
the guest without mapping it into QEMU's address space at all.

(In fact, the way that SEV currently works is *functionally* like
this, except that there's a bogus incoherent mapping in the QEMU
process that is a giant can of worms.


IMO a major benefit of a chardev approach is that you don't need a new
VM_ flag and you don't need to worry about wiring it up everywhere in
the core mm code.

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH RFC] mm: add MAP_EXCLUSIVE to create exclusive user mappings
  2019-10-30 21:28         ` Andy Lutomirski
@ 2019-10-31  7:21           ` Mike Rapoport
  2019-12-05 15:34           ` Mike Rapoport
  1 sibling, 0 replies; 60+ messages in thread
From: Mike Rapoport @ 2019-10-31  7:21 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: LKML, Alexey Dobriyan, Andrew Morton, Arnd Bergmann,
	Borislav Petkov, Dave Hansen, James Bottomley, Peter Zijlstra,
	Steven Rostedt, Thomas Gleixner, Ingo Molnar, H. Peter Anvin,
	Linux API, Linux-MM, X86 ML, Mike Rapoport

On Wed, Oct 30, 2019 at 02:28:21PM -0700, Andy Lutomirski wrote:
> On Wed, Oct 30, 2019 at 1:40 AM Mike Rapoport <rppt@kernel.org> wrote:
> >
> > On Tue, Oct 29, 2019 at 10:00:55AM -0700, Andy Lutomirski wrote:
> > > On Tue, Oct 29, 2019 at 2:33 AM Mike Rapoport <rppt@kernel.org> wrote:
> > > >
> > > > On Mon, Oct 28, 2019 at 02:44:23PM -0600, Andy Lutomirski wrote:
> > > > >
> > > > > > On Oct 27, 2019, at 4:17 AM, Mike Rapoport <rppt@kernel.org> wrote:
> > > > > >
> > > > > > From: Mike Rapoport <rppt@linux.ibm.com>
> > > > > >
> > > > > > Hi,
> > > > > >
> > > > > > The patch below aims to allow applications to create mappins that have
> > > > > > pages visible only to the owning process. Such mappings could be used to
> > > > > > store secrets so that these secrets are not visible neither to other
> > > > > > processes nor to the kernel.
> > > > > >
> > > > > > I've only tested the basic functionality, the changes should be verified
> > > > > > against THP/migration/compaction. Yet, I'd appreciate early feedback.
> > > > >
> > > > > I’ve contemplated the concept a fair amount, and I think you should
> > > > > consider a change to the API. In particular, rather than having it be a
> > > > > MAP_ flag, make it a chardev.  You can, at least at first, allow only
> > > > > MAP_SHARED, and admins can decide who gets to use it.  It might also play
> > > > > better with the VM overall, and you won’t need a VM_ flag for it — you
> > > > > can just wire up .fault to do the right thing.
> > > >
> > > > I think mmap()/mprotect()/madvise() are the natural APIs for such
> > > > interface.
> > >
> > > Then you have a whole bunch of questions to answer.  For example:
> > >
> > > What happens if you mprotect() or similar when the mapping is already
> > > in use in a way that's incompatible with MAP_EXCLUSIVE?
> >
> > Then we refuse to mprotect()? Like in any other case when vm_flags are not
> > compatible with required madvise()/mprotect() operation.
> >
> 
> I'm not talking about flags.  I'm talking about the case where one
> thread (or RDMA or whatever) has get_user_pages()'d a mapping and
> another thread mprotect()s it MAP_EXCLUSIVE.
> 
> > > Is it actually reasonable to malloc() some memory and then make it exclusive?
> > >
> > > Are you permitted to map a file MAP_EXCLUSIVE?  What does it mean?
> >
> > I'd limit MAP_EXCLUSIVE only to anonymous memory.
> >
> > > What does MAP_PRIVATE | MAP_EXCLUSIVE do?
> >
> > My preference is to have only mmap() and then the semantics is more clear:
> >
> > MAP_PRIVATE | MAP_EXCLUSIVE creates a pre-populated region, marks it locked
> > and drops the pages in this region from the direct map.
> > The pages are returned back on munmap().
> > Then there is no way to change an existing area to be exclusive or vice
> > versa.
> 
> And what happens if you fork()?  Limiting it to MAP_SHARED |
> MAP_EXCLUSIVE would about this particular nasty question.
> 
> >
> > > How does one pass exclusive memory via SCM_RIGHTS?  (If it's a
> > > memfd-like or chardev interface, it's trivial.  mmap(), not so much.)
> >
> > Why passing such memory via SCM_RIGHTS would be useful?
> 
> Suppose I want to put a secret into exclusive memory and then send
> that secret to some other process.  The obvious approach would be to
> SCM_RIGHTS an fd over, but you can't do that with MAP_EXCLUSIVE as
> you've defined it.  In general, there are lots of use cases for memfd
> and other fd-backed memory.
> 
> >
> > > And finally, there's my personal giant pet peeve: a major use of this
> > > will be for virtualization.  I suspect that a lot of people would like
> > > the majority of KVM guest memory to be unmapped from the host
> > > pagetables.  But people might also like for guest memory to be
> > > unmapped in *QEMU's* pagetables, and mmap() is a basically worthless
> > > interface for this.  Getting fd-backed memory into a guest will take
> > > some possibly major work in the kernel, but getting vma-backed memory
> > > into a guest without mapping it in the host user address space seems
> > > much, much worse.
> >
> > Well, in my view, the MAP_EXCLUSIVE is intended to keep small secrets
> > rather than use it for the entire guest memory. I even considered adding a
> > limit for the mapping size, but then I decided that since RLIMIT_MEMLOCK is
> > anyway enforced there is no need for a new one.
> >
> > I agree that getting fd-backed memory into a guest would be less pain that
> > VMA, but KVM can already use memory outside the control of the kernel via
> > /dev/map [1].
> 
> That series doesn't address the problem I'm talking about at all.  I'm
> saying that there is a legitimate use case where QEMU should *not*
> have a mapping of the memory.  So QEMU would create some exclusive
> memory using /dev/exclusive_memory and would tell KVM to map it into
> the guest without mapping it into QEMU's address space at all.
> 
> (In fact, the way that SEV currently works is *functionally* like
> this, except that there's a bogus incoherent mapping in the QEMU
> process that is a giant can of worms.
> 
> 
> IMO a major benefit of a chardev approach is that you don't need a new
> VM_ flag and you don't need to worry about wiring it up everywhere in
> the core mm code.

Ok, at last I'm starting to see your and Christoph's point.

Just to reiterate, we can use fd-backed memory using /dev/exclusive_memory
chardev (or some other name we'll pick after long bikeshedding) and then
the .mmap method of this character device can do interesting things with
the backing physical memory. Since the memory is not VMA-mapped, we do not
have to find all the places in the core that might require a check of a VM_
flag to ensure there is no clashes with the exclusive memory.

Still, whatever we do with the mapping  properties of this memory, we need
a solution to the splitting of huge pages that map the direct map, but this
is an orthogonal problem in a way.

-- 
Sincerely yours,
Mike.

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH RFC] mm: add MAP_EXCLUSIVE to create exclusive user mappings
  2019-10-30  8:19       ` David Hildenbrand
@ 2019-10-31 19:16         ` Mike Rapoport
  2019-10-31 21:52           ` Dan Williams
  0 siblings, 1 reply; 60+ messages in thread
From: Mike Rapoport @ 2019-10-31 19:16 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: linux-kernel, Alexey Dobriyan, Andrew Morton, Andy Lutomirski,
	Arnd Bergmann, Borislav Petkov, Dave Hansen, James Bottomley,
	Peter Zijlstra, Steven Rostedt, Thomas Gleixner, Ingo Molnar,
	H. Peter Anvin, linux-api, linux-mm, x86, Mike Rapoport

On Wed, Oct 30, 2019 at 09:19:33AM +0100, David Hildenbrand wrote:
> On 30.10.19 09:15, Mike Rapoport wrote:
> >On Tue, Oct 29, 2019 at 12:02:34PM +0100, David Hildenbrand wrote:
> >>On 27.10.19 11:17, Mike Rapoport wrote:
> >>>From: Mike Rapoport <rppt@linux.ibm.com>
> >>>
> >>>The mappings created with MAP_EXCLUSIVE are visible only in the context of
> >>>the owning process and can be used by applications to store secret
> >>>information that will not be visible not only to other processes but to the
> >>>kernel as well.
> >>>
> >>>The pages in these mappings are removed from the kernel direct map and
> >>>marked with PG_user_exclusive flag. When the exclusive area is unmapped,
> >>>the pages are mapped back into the direct map.
> >>>
> >>
> >>Just a thought, the kernel is still able to indirectly read the contents of
> >>these pages by doing a kdump from kexec environment, right?
> >
> >Right.
> >
> >>Also, I wonder
> >>what would happen if you map such pages via /dev/mem into another user space
> >>application and e.g., use them along with kvm [1].
> >
> >Do you mean that one application creates MAP_EXCLUSIVE and another
> >applications accesses the same physical pages via /dev/mem?
> 
> Exactly.
> 
> >
> >With /dev/mem all physical memory is visible...
> 
> Okay, so the statement "information that will not be visible not only to
> other processes but to the kernel as well" is not correct. There are easy
> ways to access that information if you really want to (might require root
> permissions, though).

Right, but /dev/mem is an easy way to extract any information in any
environment if one has root permissions...
 
> -- 
> 
> Thanks,
> 
> David / dhildenb
> 

-- 
Sincerely yours,
Mike.

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH RFC] mm: add MAP_EXCLUSIVE to create exclusive user mappings
  2019-10-31 19:16         ` Mike Rapoport
@ 2019-10-31 21:52           ` Dan Williams
  0 siblings, 0 replies; 60+ messages in thread
From: Dan Williams @ 2019-10-31 21:52 UTC (permalink / raw)
  To: Mike Rapoport
  Cc: David Hildenbrand, Linux Kernel Mailing List, Alexey Dobriyan,
	Andrew Morton, Andy Lutomirski, Arnd Bergmann, Borislav Petkov,
	Dave Hansen, James Bottomley, Peter Zijlstra, Steven Rostedt,
	Thomas Gleixner, Ingo Molnar, H. Peter Anvin, Linux API,
	linux-mm, the arch/x86 maintainers, Mike Rapoport

On Thu, Oct 31, 2019 at 12:17 PM Mike Rapoport <rppt@kernel.org> wrote:
>
> On Wed, Oct 30, 2019 at 09:19:33AM +0100, David Hildenbrand wrote:
> > On 30.10.19 09:15, Mike Rapoport wrote:
> > >On Tue, Oct 29, 2019 at 12:02:34PM +0100, David Hildenbrand wrote:
> > >>On 27.10.19 11:17, Mike Rapoport wrote:
> > >>>From: Mike Rapoport <rppt@linux.ibm.com>
> > >>>
> > >>>The mappings created with MAP_EXCLUSIVE are visible only in the context of
> > >>>the owning process and can be used by applications to store secret
> > >>>information that will not be visible not only to other processes but to the
> > >>>kernel as well.
> > >>>
> > >>>The pages in these mappings are removed from the kernel direct map and
> > >>>marked with PG_user_exclusive flag. When the exclusive area is unmapped,
> > >>>the pages are mapped back into the direct map.
> > >>>
> > >>
> > >>Just a thought, the kernel is still able to indirectly read the contents of
> > >>these pages by doing a kdump from kexec environment, right?
> > >
> > >Right.
> > >
> > >>Also, I wonder
> > >>what would happen if you map such pages via /dev/mem into another user space
> > >>application and e.g., use them along with kvm [1].
> > >
> > >Do you mean that one application creates MAP_EXCLUSIVE and another
> > >applications accesses the same physical pages via /dev/mem?
> >
> > Exactly.
> >
> > >
> > >With /dev/mem all physical memory is visible...
> >
> > Okay, so the statement "information that will not be visible not only to
> > other processes but to the kernel as well" is not correct. There are easy
> > ways to access that information if you really want to (might require root
> > permissions, though).
>
> Right, but /dev/mem is an easy way to extract any information in any
> environment if one has root permissions...
>

I don't understand this concern with /dev/mem. Just add these pages to
the growing list of the things /dev/mem is not allowed to touch.

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: AMD TLB errata, (Was: [PATCH RFC] mm: add MAP_EXCLUSIVE to create exclusive user mappings)
  2019-10-29 12:39                 ` AMD TLB errata, (Was: [PATCH RFC] mm: add MAP_EXCLUSIVE to create exclusive user mappings) Peter Zijlstra
@ 2019-11-15 14:12                   ` Tom Lendacky
  2019-11-15 14:31                     ` Peter Zijlstra
  0 siblings, 1 reply; 60+ messages in thread
From: Tom Lendacky @ 2019-11-15 14:12 UTC (permalink / raw)
  To: Peter Zijlstra, Kirill A. Shutemov
  Cc: Dan Williams, Mike Rapoport, Linux Kernel Mailing List,
	Alexey Dobriyan, Andrew Morton, Andy Lutomirski, Arnd Bergmann,
	Borislav Petkov, Dave Hansen, James Bottomley, Steven Rostedt,
	Thomas Gleixner, Ingo Molnar, H. Peter Anvin, Linux API,
	linux-mm, the arch/x86 maintainers, Mike Rapoport

On 10/29/19 7:39 AM, Peter Zijlstra wrote:
> On Tue, Oct 29, 2019 at 02:00:24PM +0300, Kirill A. Shutemov wrote:
>> On Tue, Oct 29, 2019 at 09:56:02AM +0100, Peter Zijlstra wrote:
>>> On Tue, Oct 29, 2019 at 09:43:18AM +0300, Kirill A. Shutemov wrote:
>>>> But some CPUs don't like to have two TLB entries for the same memory with
>>>> different sizes at the same time. See for instance AMD erratum 383.
>>>>
>>>> Getting it right would require making the range not present, flush TLB and
>>>> only then install huge page. That's what we do for userspace.
>>>>
>>>> It will not fly for the direct mapping. There is no reasonable way to
>>>> exclude other CPU from accessing the range while it's not present (call
>>>> stop_machine()? :P). Moreover, the range may contain the code that doing
>>>> the collapse or data required for it...
>>>>
>>>> BTW, looks like current __split_large_page() in pageattr.c is susceptible
>>>> to the errata. Maybe we can get away with the easy way...
>>>
>>> As you write above, there is just no way we can have a (temporary) hole
>>> in the direct map.
>>>
>>> We are careful about that other errata, and make sure both translations
>>> are identical wrt everything else.
>>
>> It's not clear if it is enough to avoid the issue. "under a highly specific
>> and detailed set of conditions" is not very specific set of conditions :P
> 
> Yeah, I know ... :/ Tom is there any chance you could shed a little more
> light on that errata?

I talked with some of the hardware folks and if you maintain the same bits
in the large and small pages (aside from the large page bit) until the
flush, then the errata should not occur.

The errata really applies to mappings that end up with different attribute
bits being set. Even then, it doesn't fail every time. There are other
conditions required to make it fail.

Thanks,
Tom

> 

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: AMD TLB errata, (Was: [PATCH RFC] mm: add MAP_EXCLUSIVE to create exclusive user mappings)
  2019-11-15 14:12                   ` Tom Lendacky
@ 2019-11-15 14:31                     ` Peter Zijlstra
  0 siblings, 0 replies; 60+ messages in thread
From: Peter Zijlstra @ 2019-11-15 14:31 UTC (permalink / raw)
  To: Tom Lendacky
  Cc: Kirill A. Shutemov, Dan Williams, Mike Rapoport,
	Linux Kernel Mailing List, Alexey Dobriyan, Andrew Morton,
	Andy Lutomirski, Arnd Bergmann, Borislav Petkov, Dave Hansen,
	James Bottomley, Steven Rostedt, Thomas Gleixner, Ingo Molnar,
	H. Peter Anvin, Linux API, linux-mm, the arch/x86 maintainers,
	Mike Rapoport

On Fri, Nov 15, 2019 at 08:12:52AM -0600, Tom Lendacky wrote:
> I talked with some of the hardware folks and if you maintain the same bits
> in the large and small pages (aside from the large page bit) until the
> flush, then the errata should not occur.

Excellent!

Thanks for digging that out Tom.

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH RFC] mm: add MAP_EXCLUSIVE to create exclusive user mappings
  2019-10-30 21:28         ` Andy Lutomirski
  2019-10-31  7:21           ` Mike Rapoport
@ 2019-12-05 15:34           ` Mike Rapoport
  2019-12-08 14:10             ` [PATCH] mm: extend memfd with ability to create secret memory kbuild test robot
  1 sibling, 1 reply; 60+ messages in thread
From: Mike Rapoport @ 2019-12-05 15:34 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: LKML, Alexey Dobriyan, Andrew Morton, Arnd Bergmann,
	Borislav Petkov, Dave Hansen, James Bottomley, Peter Zijlstra,
	Steven Rostedt, Thomas Gleixner, Ingo Molnar, H. Peter Anvin,
	Linux API, Linux-MM, X86 ML, Mike Rapoport, Alan Cox, Reshetova,
	Elena, Tycho Andersen, Christopher Lameter

On Wed, Oct 30, 2019 at 02:28:21PM -0700, Andy Lutomirski wrote:
> 
> IMO a major benefit of a chardev approach is that you don't need a new
> VM_ flag and you don't need to worry about wiring it up everywhere in
> the core mm code.

I've done a couple of experiments with secret/exclusive/whatever
memory backed by a file-descriptor using a chardev and memfd_create
syscall. There is indeed no need for VM_ flag, but there are still places
that would require special care, e.g vm_normal_page(), madvise(DO_FORK), so
it won't be completely free of core mm modifications.

Below is a POC that implements extension to memfd_create() that allows
mapping of a "secret" memory. The "secrecy" mode should be explicitly set
using ioctl(), for now I've implemented exclusive and uncached mappings.

The POC primarily indented to illustrate a possible userspace API for
fd-based secret memory. The idea is that user will create a file
descriptor using a system call. The user than has to use ioctl() to define
the desired mode of operation and only when the mode is set it is possible
to mmap() the memory. I.e something like

	fd = memfd_create("secret", MFD_SECRET);
	ioctl(fd, MFD_SECRET_UNCACHED);
	ptr = mmap(NULL, MAP_SIZE, PROT_READ | PROT_WRITE, MAP_SHARED,
		   fd, 0);


The ioctl() allows a lot of flexibility in how the secrecy should be
defined. It could be either a request for a particular protection (e.g.
exclusive, uncached) or something like "secrecy level" from "a bit more
secret than normally" to "do your best even at the expense of performance".
The POC implements the first option and the modes are mutually exclusive
for now, but there is no fundamental reason they cannot be mixed.

I've chosen memfd over a chardev as it seem to play more neatly with
anon_inodes and would allow simple (ab)use of the page cache for tracking
pages allocated for the "secret" mappings as well as using
address_space_operations for e.g. page migration callbacks.

The POC implementation uses set_memory/pageattr APIs to manipulate the
direct map and does not address the direct map fragmentation issue.

Of course this is something that must be addresses, as well as
modifications to core mm to required keep the secret memory secret, but I'd
really like to focus on the userspace ABI first.

From 0e1bd1b63f38685ffc5aab8c89b31086bde3da7b Mon Sep 17 00:00:00 2001
From: Mike Rapoport <rppt@linux.ibm.com>
Date: Mon, 18 Nov 2019 09:32:22 +0200
Subject: [PATCH] mm: extend memfd with ability to create secret memory

Signed-off-by: Mike Rapoport <rppt@linux.ibm.com>
---
 include/linux/memfd.h      |   9 ++
 include/uapi/linux/magic.h |   1 +
 include/uapi/linux/memfd.h |   6 +
 mm/Kconfig                 |   3 +
 mm/Makefile                |   1 +
 mm/memfd.c                 |  10 +-
 mm/secretmem.c             | 233 +++++++++++++++++++++++++++++++++++++
 7 files changed, 261 insertions(+), 2 deletions(-)
 create mode 100644 mm/secretmem.c

diff --git a/include/linux/memfd.h b/include/linux/memfd.h
index 4f1600413f91..d3ca7285f51a 100644
--- a/include/linux/memfd.h
+++ b/include/linux/memfd.h
@@ -13,4 +13,13 @@ static inline long memfd_fcntl(struct file *f, unsigned int c, unsigned long a)
 }
 #endif
 
+#ifdef CONFIG_MEMFD_SECRETMEM
+extern struct file *secretmem_file_create(const char *name, unsigned int flags);
+#else
+static inline struct file *secretmem_file_create(const char *name, unsigned int flags)
+{
+	return ERR_PTR(-EINVAL);
+}
+#endif
+
 #endif /* __LINUX_MEMFD_H */
diff --git a/include/uapi/linux/magic.h b/include/uapi/linux/magic.h
index 903cc2d2750b..3dad6208c8de 100644
--- a/include/uapi/linux/magic.h
+++ b/include/uapi/linux/magic.h
@@ -94,5 +94,6 @@
 #define ZSMALLOC_MAGIC		0x58295829
 #define DMA_BUF_MAGIC		0x444d4142	/* "DMAB" */
 #define Z3FOLD_MAGIC		0x33
+#define SECRETMEM_MAGIC		0x5345434d	/* "SECM" */
 
 #endif /* __LINUX_MAGIC_H__ */
diff --git a/include/uapi/linux/memfd.h b/include/uapi/linux/memfd.h
index 7a8a26751c23..3320a79b638d 100644
--- a/include/uapi/linux/memfd.h
+++ b/include/uapi/linux/memfd.h
@@ -8,6 +8,12 @@
 #define MFD_CLOEXEC		0x0001U
 #define MFD_ALLOW_SEALING	0x0002U
 #define MFD_HUGETLB		0x0004U
+#define MFD_SECRET		0x0008U
+
+/* ioctls for secret memory */
+#define MFD_SECRET_IOCTL '-'
+#define MFD_SECRET_EXCLUSIVE	_IOW(MFD_SECRET_IOCTL, 0x13, unsigned long)
+#define MFD_SECRET_UNCACHED	_IOW(MFD_SECRET_IOCTL, 0x14, unsigned long)
 
 /*
  * Huge page size encoding when MFD_HUGETLB is specified, and a huge page
diff --git a/mm/Kconfig b/mm/Kconfig
index a5dae9a7eb51..aa828f240287 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -736,4 +736,7 @@ config ARCH_HAS_PTE_SPECIAL
 config ARCH_HAS_HUGEPD
 	bool
 
+config MEMFD_SECRETMEM
+        def_bool MEMFD_CREATE && ARCH_HAS_SET_DIRECT_MAP
+
 endmenu
diff --git a/mm/Makefile b/mm/Makefile
index d996846697ef..54cb8a60d698 100644
--- a/mm/Makefile
+++ b/mm/Makefile
@@ -107,3 +107,4 @@ obj-$(CONFIG_PERCPU_STATS) += percpu-stats.o
 obj-$(CONFIG_ZONE_DEVICE) += memremap.o
 obj-$(CONFIG_HMM_MIRROR) += hmm.o
 obj-$(CONFIG_MEMFD_CREATE) += memfd.o
+obj-$(CONFIG_MEMFD_SECRETMEM) += secretmem.o
diff --git a/mm/memfd.c b/mm/memfd.c
index 2647c898990c..3e1cc37e0389 100644
--- a/mm/memfd.c
+++ b/mm/memfd.c
@@ -245,7 +245,8 @@ long memfd_fcntl(struct file *file, unsigned int cmd, unsigned long arg)
 #define MFD_NAME_PREFIX_LEN (sizeof(MFD_NAME_PREFIX) - 1)
 #define MFD_NAME_MAX_LEN (NAME_MAX - MFD_NAME_PREFIX_LEN)
 
-#define MFD_ALL_FLAGS (MFD_CLOEXEC | MFD_ALLOW_SEALING | MFD_HUGETLB)
+#define MFD_SECRET_MASK (MFD_CLOEXEC | MFD_SECRET)
+#define MFD_ALL_FLAGS (MFD_CLOEXEC | MFD_ALLOW_SEALING | MFD_HUGETLB | MFD_SECRET)
 
 SYSCALL_DEFINE2(memfd_create,
 		const char __user *, uname,
@@ -257,6 +258,9 @@ SYSCALL_DEFINE2(memfd_create,
 	char *name;
 	long len;
 
+	if (flags & ~(unsigned int)MFD_SECRET_MASK)
+		return -EINVAL;
+
 	if (!(flags & MFD_HUGETLB)) {
 		if (flags & ~(unsigned int)MFD_ALL_FLAGS)
 			return -EINVAL;
@@ -296,7 +300,9 @@ SYSCALL_DEFINE2(memfd_create,
 		goto err_name;
 	}
 
-	if (flags & MFD_HUGETLB) {
+	if (flags & MFD_SECRET) {
+		file = secretmem_file_create(name, flags);
+	} else if (flags & MFD_HUGETLB) {
 		struct user_struct *user = NULL;
 
 		file = hugetlb_file_setup(name, 0, VM_NORESERVE, &user,
diff --git a/mm/secretmem.c b/mm/secretmem.c
new file mode 100644
index 000000000000..e787b8dc925b
--- /dev/null
+++ b/mm/secretmem.c
@@ -0,0 +1,233 @@
+// SPDX-License-Identifier: GPL-2.0
+#include <linux/mm.h>
+#include <linux/fs.h>
+#include <linux/printk.h>
+#include <linux/fs.h>
+#include <linux/mount.h>
+#include <linux/memfd.h>
+#include <linux/pseudo_fs.h>
+#include <linux/set_memory.h>
+#include <uapi/linux/memfd.h>
+#include <uapi/linux/magic.h>
+
+#include <asm/tlb.h>
+
+#define SECRETMEM_EXCLUSIVE	0x1
+#define SECRETMEM_UNCACHED	0x2
+
+struct secretmem_state {
+	unsigned int mode;
+};
+
+static vm_fault_t secretmem_fault(struct vm_fault *vmf)
+{
+	struct secretmem_state *state = vmf->vma->vm_file->private_data;
+	struct address_space *mapping = vmf->vma->vm_file->f_mapping;
+	pgoff_t offset = vmf->pgoff;
+	unsigned long addr;
+	struct page *page;
+	int err;
+
+	page = find_get_page(mapping, offset);
+	if (!page) {
+		page = pagecache_get_page(mapping, offset,
+					  FGP_CREAT|FGP_FOR_MMAP,
+					  vmf->gfp_mask);
+		if (!page)
+			return vmf_error(-ENOMEM);
+
+		__SetPageUptodate(page);
+	}
+
+	if (state->mode == SECRETMEM_EXCLUSIVE)
+		err = set_direct_map_invalid_noflush(page);
+	else if (state->mode == SECRETMEM_UNCACHED)
+		err = set_pages_array_uc(&page, 1);
+	else
+		BUG();
+
+	if (err) {
+		delete_from_page_cache(page);
+		return vmf_error(err);
+	}
+
+	addr = (unsigned long)page_address(page);
+	flush_tlb_kernel_range(addr, addr + PAGE_SIZE);
+
+	vmf->page = page;
+	return  0;
+}
+
+static void secretmem_close(struct vm_area_struct *vma)
+{
+	struct secretmem_state *state = vma->vm_file->private_data;
+	struct address_space *mapping = vma->vm_file->f_mapping;
+	struct page *page;
+	pgoff_t index;
+
+	xa_for_each(&mapping->i_pages, index, page) {
+		get_page(page);
+		lock_page(page);
+
+		if (state->mode == SECRETMEM_EXCLUSIVE)
+			set_direct_map_default_noflush(page);
+		else if (state->mode == SECRETMEM_UNCACHED)
+			set_pages_array_wb(&page, 1);
+		else
+			BUG();
+
+		__ClearPageDirty(page);
+		delete_from_page_cache(page);
+
+		unlock_page(page);
+		put_page(page);
+	}
+}
+
+static const struct vm_operations_struct secretmem_vm_ops = {
+	.fault = secretmem_fault,
+	.close = secretmem_close,
+};
+
+static int secretmem_mmap(struct file *file, struct vm_area_struct *vma)
+{
+	struct secretmem_state *state = file->private_data;
+	unsigned long mode = state->mode;
+
+	if (!mode)
+		return -EINVAL;
+
+	switch (mode) {
+	case SECRETMEM_UNCACHED:
+		vma->vm_page_prot = pgprot_noncached(vma->vm_page_prot);
+		/* fallthrough */
+	case SECRETMEM_EXCLUSIVE:
+		vma->vm_ops = &secretmem_vm_ops;
+		break;
+	default:
+		return -EINVAL;
+	}
+
+	return 0;
+}
+
+static long secretmem_ioctl(struct file *file, unsigned cmd, unsigned long arg)
+{
+	struct secretmem_state *state = file->private_data;
+	unsigned long mode = state->mode;
+
+	if (mode)
+		return -EINVAL;
+
+	switch (cmd) {
+	case MFD_SECRET_EXCLUSIVE:
+		mode = SECRETMEM_EXCLUSIVE;
+		break;
+	case MFD_SECRET_UNCACHED:
+		mode = SECRETMEM_UNCACHED;
+		break;
+	default:
+		return -EINVAL;
+	}
+
+	state->mode = mode;
+
+	return 0;
+}
+
+static int secretmem_release(struct inode *inode, struct file *file)
+{
+	struct secretmem_state *state = file->private_data;
+
+	kfree(state);
+
+	return 0;
+}
+
+const struct file_operations secretmem_fops = {
+	.release	= secretmem_release,
+	.mmap		= secretmem_mmap,
+	.unlocked_ioctl = secretmem_ioctl,
+	.compat_ioctl	= secretmem_ioctl,
+};
+
+static bool secretmem_isolate_page(struct page *page, isolate_mode_t mode)
+{
+	return false;
+}
+
+static int secretmem_migratepage(struct address_space *mapping,
+				 struct page *newpage, struct page *page,
+				 enum migrate_mode mode)
+{
+	return -EBUSY;
+}
+
+static void secretmem_putback_page(struct page *page)
+{
+}
+
+static const struct address_space_operations secretmem_aops = {
+	.migratepage	= secretmem_migratepage,
+	.isolate_page	= secretmem_isolate_page,
+	.putback_page	= secretmem_putback_page,
+};
+
+static struct vfsmount *secretmem_mnt;
+
+struct file *secretmem_file_create(const char *name, unsigned int flags)
+{
+	struct inode *inode = alloc_anon_inode(secretmem_mnt->mnt_sb);
+	struct file *file = ERR_PTR(-ENOMEM);
+	struct secretmem_state *state;
+
+	if (IS_ERR(inode))
+		return ERR_CAST(inode);
+
+	state = kzalloc(sizeof(*state), GFP_KERNEL);
+	if (!state)
+		goto err_free_inode;
+
+	file = alloc_file_pseudo(inode, secretmem_mnt, "secretmem",
+				 O_RDWR, &secretmem_fops);
+	if (IS_ERR(file))
+		goto err_free_state;
+
+	mapping_set_unevictable(inode->i_mapping);
+
+	inode->i_mapping->private_data = state;
+	inode->i_mapping->a_ops = &secretmem_aops;
+
+	file->private_data = state;
+
+	return file;
+
+err_free_state:
+	kfree(state);
+err_free_inode:
+	iput(inode);
+	return file;
+}
+
+static int secretmem_init_fs_context(struct fs_context *fc)
+{
+	return init_pseudo(fc, SECRETMEM_MAGIC) ? 0 : -ENOMEM;
+}
+
+static struct file_system_type secretmem_fs = {
+	.name		= "secretmem",
+	.init_fs_context = secretmem_init_fs_context,
+	.kill_sb	= kill_anon_super,
+};
+
+static int secretmem_init(void)
+{
+	int ret = 0;
+
+	secretmem_mnt = kern_mount(&secretmem_fs);
+	if (IS_ERR(secretmem_mnt))
+		ret = PTR_ERR(secretmem_mnt);
+
+	return ret;
+}
+fs_initcall(secretmem_init);
-- 
2.24.0

-- 
Sincerely yours,
Mike.

^ permalink raw reply related	[flat|nested] 60+ messages in thread

* Re: [PATCH] mm: extend memfd with ability to create secret memory
  2019-12-05 15:34           ` Mike Rapoport
@ 2019-12-08 14:10             ` kbuild test robot
  0 siblings, 0 replies; 60+ messages in thread
From: kbuild test robot @ 2019-12-08 14:10 UTC (permalink / raw)
  To: Mike Rapoport
  Cc: kbuild-all, Andy Lutomirski, LKML, Alexey Dobriyan,
	Andrew Morton, Arnd Bergmann, Borislav Petkov, Dave Hansen,
	James Bottomley, Peter Zijlstra, Steven Rostedt, Thomas Gleixner,
	Ingo Molnar, H. Peter Anvin, Linux API, Linux-MM, X86 ML,
	Mike Rapoport, Alan Cox, Reshetova, Elena, Tycho Andersen,
	Christopher Lameter

[-- Attachment #1: Type: text/plain, Size: 3891 bytes --]

Hi Mike,

I love your patch! Yet something to improve:

[auto build test ERROR on mmotm/master]
[also build test ERROR on linux/master v5.4]
[cannot apply to arm-soc/for-next linus/master next-20191208]
[if your patch is applied to the wrong git tree, please drop us a note to help
improve the system. BTW, we also suggest to use '--base' option to specify the
base tree in git format-patch, please see https://stackoverflow.com/a/37406982]

url:    https://github.com/0day-ci/linux/commits/Mike-Rapoport/mm-extend-memfd-with-ability-to-create-secret-memory/20191207-130906
base:   git://git.cmpxchg.org/linux-mmotm.git master
config: arm64-randconfig-a001-20191208 (attached as .config)
compiler: aarch64-linux-gcc (GCC) 7.5.0
reproduce:
        wget https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross
        chmod +x ~/bin/make.cross
        # save the attached .config to linux build tree
        GCC_VERSION=7.5.0 make.cross ARCH=arm64 

If you fix the issue, kindly add following tag
Reported-by: kbuild test robot <lkp@intel.com>

All errors (new ones prefixed by >>):

   mm/secretmem.c: In function 'secretmem_fault':
>> mm/secretmem.c:45:9: error: implicit declaration of function 'set_pages_array_uc'; did you mean 'set_page_dirty_lock'? [-Werror=implicit-function-declaration]
      err = set_pages_array_uc(&page, 1);
            ^~~~~~~~~~~~~~~~~~
            set_page_dirty_lock
   mm/secretmem.c: In function 'secretmem_close':
>> mm/secretmem.c:75:4: error: implicit declaration of function 'set_pages_array_wb'; did you mean 'set_page_dirty'? [-Werror=implicit-function-declaration]
       set_pages_array_wb(&page, 1);
       ^~~~~~~~~~~~~~~~~~
       set_page_dirty
   cc1: some warnings being treated as errors

vim +45 mm/secretmem.c

    21	
    22	static vm_fault_t secretmem_fault(struct vm_fault *vmf)
    23	{
    24		struct secretmem_state *state = vmf->vma->vm_file->private_data;
    25		struct address_space *mapping = vmf->vma->vm_file->f_mapping;
    26		pgoff_t offset = vmf->pgoff;
    27		unsigned long addr;
    28		struct page *page;
    29		int err;
    30	
    31		page = find_get_page(mapping, offset);
    32		if (!page) {
    33			page = pagecache_get_page(mapping, offset,
    34						  FGP_CREAT|FGP_FOR_MMAP,
    35						  vmf->gfp_mask);
    36			if (!page)
    37				return vmf_error(-ENOMEM);
    38	
    39			__SetPageUptodate(page);
    40		}
    41	
    42		if (state->mode == SECRETMEM_EXCLUSIVE)
    43			err = set_direct_map_invalid_noflush(page);
    44		else if (state->mode == SECRETMEM_UNCACHED)
  > 45			err = set_pages_array_uc(&page, 1);
    46		else
    47			BUG();
    48	
    49		if (err) {
    50			delete_from_page_cache(page);
    51			return vmf_error(err);
    52		}
    53	
    54		addr = (unsigned long)page_address(page);
    55		flush_tlb_kernel_range(addr, addr + PAGE_SIZE);
    56	
    57		vmf->page = page;
    58		return  0;
    59	}
    60	
    61	static void secretmem_close(struct vm_area_struct *vma)
    62	{
    63		struct secretmem_state *state = vma->vm_file->private_data;
    64		struct address_space *mapping = vma->vm_file->f_mapping;
    65		struct page *page;
    66		pgoff_t index;
    67	
    68		xa_for_each(&mapping->i_pages, index, page) {
    69			get_page(page);
    70			lock_page(page);
    71	
    72			if (state->mode == SECRETMEM_EXCLUSIVE)
    73				set_direct_map_default_noflush(page);
    74			else if (state->mode == SECRETMEM_UNCACHED)
  > 75				set_pages_array_wb(&page, 1);
    76			else
    77				BUG();
    78	
    79			__ClearPageDirty(page);
    80			delete_from_page_cache(page);
    81	
    82			unlock_page(page);
    83			put_page(page);
    84		}
    85	}
    86	

---
0-DAY kernel test infrastructure                 Open Source Technology Center
https://lists.01.org/hyperkitty/list/kbuild-all@lists.01.org Intel Corporation

[-- Attachment #2: .config.gz --]
[-- Type: application/gzip, Size: 37879 bytes --]

^ permalink raw reply	[flat|nested] 60+ messages in thread

end of thread, other threads:[~2019-12-08 14:11 UTC | newest]

Thread overview: 60+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-10-27 10:17 [PATCH RFC] mm: add MAP_EXCLUSIVE to create exclusive user mappings Mike Rapoport
2019-10-27 10:17 ` Mike Rapoport
2019-10-28 12:31   ` Kirill A. Shutemov
2019-10-28 13:00     ` Mike Rapoport
2019-10-28 13:16       ` Kirill A. Shutemov
2019-10-28 13:55         ` Peter Zijlstra
2019-10-28 19:59           ` Edgecombe, Rick P
2019-10-28 21:00             ` Peter Zijlstra
2019-10-29 17:27               ` Edgecombe, Rick P
2019-10-30 10:04                 ` Peter Zijlstra
2019-10-30 15:35                   ` Alexei Starovoitov
2019-10-30 18:39                     ` Peter Zijlstra
2019-10-30 18:52                       ` Alexei Starovoitov
2019-10-30 17:48                   ` Edgecombe, Rick P
2019-10-30 17:58                     ` Dave Hansen
2019-10-30 18:01                       ` Dave Hansen
2019-10-29  5:43         ` Dan Williams
2019-10-29  6:43           ` Kirill A. Shutemov
2019-10-29  8:56             ` Peter Zijlstra
2019-10-29 11:00               ` Kirill A. Shutemov
2019-10-29 12:39                 ` AMD TLB errata, (Was: [PATCH RFC] mm: add MAP_EXCLUSIVE to create exclusive user mappings) Peter Zijlstra
2019-11-15 14:12                   ` Tom Lendacky
2019-11-15 14:31                     ` Peter Zijlstra
2019-10-29 19:43             ` [PATCH RFC] mm: add MAP_EXCLUSIVE to create exclusive user mappings Dan Williams
2019-10-29 20:07               ` Dave Hansen
2019-10-29  7:08         ` Christopher Lameter
2019-10-29  8:55           ` Mike Rapoport
2019-10-29 10:12             ` Christopher Lameter
2019-10-30  7:11               ` Mike Rapoport
2019-10-30 12:09                 ` Christopher Lameter
2019-10-28 14:55   ` David Hildenbrand
2019-10-28 17:12   ` Dave Hansen
2019-10-28 17:32     ` Sean Christopherson
2019-10-28 18:08     ` Matthew Wilcox
2019-10-29  9:28       ` Mike Rapoport
2019-10-29  9:19     ` Mike Rapoport
2019-10-28 18:02   ` Andy Lutomirski
2019-10-29 11:02   ` David Hildenbrand
2019-10-30  8:15     ` Mike Rapoport
2019-10-30  8:19       ` David Hildenbrand
2019-10-31 19:16         ` Mike Rapoport
2019-10-31 21:52           ` Dan Williams
2019-10-27 10:30 ` Florian Weimer
2019-10-27 11:00   ` Mike Rapoport
2019-10-28 20:23     ` Florian Weimer
2019-10-29  9:01       ` Mike Rapoport
2019-10-28 20:44 ` Andy Lutomirski
2019-10-29  9:32   ` Mike Rapoport
2019-10-29 17:00     ` Andy Lutomirski
2019-10-30  8:40       ` Mike Rapoport
2019-10-30 21:28         ` Andy Lutomirski
2019-10-31  7:21           ` Mike Rapoport
2019-12-05 15:34           ` Mike Rapoport
2019-12-08 14:10             ` [PATCH] mm: extend memfd with ability to create secret memory kbuild test robot
2019-10-29 11:25 ` [PATCH RFC] mm: add MAP_EXCLUSIVE to create exclusive user mappings Reshetova, Elena
2019-10-29 15:13   ` Tycho Andersen
2019-10-29 17:03   ` Andy Lutomirski
2019-10-29 17:37     ` Alan Cox
2019-10-29 17:43     ` James Bottomley
2019-10-29 18:10       ` Andy Lutomirski

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).