linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v2 00/17] x86/ldt: Use a VMA based read only mapping
@ 2017-12-14 11:27 Peter Zijlstra
  2017-12-14 11:27 ` [PATCH v2 01/17] mm/gup: Fixup p*_access_permitted() Peter Zijlstra
                   ` (17 more replies)
  0 siblings, 18 replies; 76+ messages in thread
From: Peter Zijlstra @ 2017-12-14 11:27 UTC (permalink / raw)
  To: linux-kernel, tglx
  Cc: x86, Linus Torvalds, Andy Lutomirsky, Peter Zijlstra,
	Dave Hansen, Borislav Petkov, Greg KH, keescook, hughd,
	Brian Gerst, Josh Poimboeuf, Denys Vlasenko, Boris Ostrovsky,
	Juergen Gross, David Laight, Eduardo Valentin, aliguori,
	Will Deacon, linux-mm, kirill.shutemov, dan.j.williams

So here's a second posting of the VMA based LDT implementation; now without
most of the crazy.

I took out the write fault handler and the magic LAR touching code.

Additionally there are a bunch of patches that address generic vm issue.

 - gup() access control; In specific I looked at accessing !_PAGE_USER pages
   because these patches rely on not being able to do that.

 - special mappings; A whole bunch of mmap ops don't make sense on special
   mappings so disallow them.

Both things make sense independent of the rest of the series. Similarly, the
patches that kill that rediculous LDT inherit on exec() are also unquestionably
good.

So I think at least the first 6 patches are good, irrespective of the
VMA approach.

On the whole VMA approach, Andy I know you hate it with a passion, but I really
rather like how it ties the LDT to the process that it belongs to and it
reduces the amount of 'special' pages in the whole PTI mapping.

I'm not the one going to make the decision on this; but I figured I at least
post a version without the obvious crap parts of the last one.

Note: if we were to also disallow munmap() for special mappings (which I
suppose makes perfect sense) then we could further reduce the actual LDT
code (we'd no longer need the sm::close callback and related things).

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 76+ messages in thread

* [PATCH v2 01/17] mm/gup: Fixup p*_access_permitted()
  2017-12-14 11:27 [PATCH v2 00/17] x86/ldt: Use a VMA based read only mapping Peter Zijlstra
@ 2017-12-14 11:27 ` Peter Zijlstra
  2017-12-14 12:41   ` Peter Zijlstra
  2017-12-14 11:27 ` [PATCH v2 02/17] mm: Exempt special mappings from mlock(), mprotect() and madvise() Peter Zijlstra
                   ` (16 subsequent siblings)
  17 siblings, 1 reply; 76+ messages in thread
From: Peter Zijlstra @ 2017-12-14 11:27 UTC (permalink / raw)
  To: linux-kernel, tglx
  Cc: x86, Linus Torvalds, Andy Lutomirsky, Peter Zijlstra,
	Dave Hansen, Borislav Petkov, Greg KH, keescook, hughd,
	Brian Gerst, Josh Poimboeuf, Denys Vlasenko, Boris Ostrovsky,
	Juergen Gross, David Laight, Eduardo Valentin, aliguori,
	Will Deacon, linux-mm, kirill.shutemov, dan.j.williams

[-- Attachment #1: peterz-vm-fix-gup.patch --]
[-- Type: text/plain, Size: 2637 bytes --]

The gup_*_range() functions which implement __get_user_pages_fast() do
a p*_access_permitted() test to see if the memory is at all accessible
(tests both _PAGE_USER|_PAGE_RW as well as architectural things like
pkeys).

But the follow_*() functions which implement __get_user_pages() do not
have this test. Recently, commit:

  5c9d2d5c269c ("mm: replace pte_write with pte_access_permitted in fault + gup paths")

added it to a few specific write paths, but it failed to consistently
apply it (I've not audited anything outside of gup).

Revert the change from that patch and insert the tests in the right
locations such that they cover all READ / WRITE accesses for all
pte/pmd/pud levels.

In particular I care about the _PAGE_USER test, we should not ever,
allow access to pages not marked with it, but it also makes the pkey
accesses more consistent.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 mm/gup.c |   25 ++++++++++++++++++++++++-
 1 file changed, 24 insertions(+), 1 deletion(-)

--- a/mm/gup.c
+++ b/mm/gup.c
@@ -66,7 +66,7 @@ static int follow_pfn_pte(struct vm_area
  */
 static inline bool can_follow_write_pte(pte_t pte, unsigned int flags)
 {
-	return pte_access_permitted(pte, WRITE) ||
+	return pte_write(pte) ||
 		((flags & FOLL_FORCE) && (flags & FOLL_COW) && pte_dirty(pte));
 }
 
@@ -153,6 +153,11 @@ static struct page *follow_page_pte(stru
 	}
 
 	if (flags & FOLL_GET) {
+		if (!pte_access_permitted(pte, !!(flags & FOLL_WRITE))) {
+			page = ERR_PTR(-EFAULT);
+			goto out;
+		}
+
 		get_page(page);
 
 		/* drop the pgmap reference now that we hold the page */
@@ -244,6 +249,15 @@ static struct page *follow_pmd_mask(stru
 			pmd_migration_entry_wait(mm, pmd);
 		goto retry;
 	}
+
+	if (flags & FOLL_GET) {
+		if (!pmd_access_permitted(*pmd, !!(flags & FOLL_WRITE))) {
+			page = ERR_PTR(-EFAULT);
+			spin_unlock(ptr);
+			return page;
+		}
+	}
+
 	if (pmd_devmap(*pmd)) {
 		ptl = pmd_lock(mm, pmd);
 		page = follow_devmap_pmd(vma, address, pmd, flags);
@@ -326,6 +340,15 @@ static struct page *follow_pud_mask(stru
 			return page;
 		return no_page_table(vma, flags);
 	}
+
+	if (flags & FOLL_GET) {
+		if (!pud_access_permitted(*pud, !!(flags & FOLL_WRITE))) {
+			page = ERR_PTR(-EFAULT);
+			spin_unlock(ptr);
+			return page;
+		}
+	}
+
 	if (pud_devmap(*pud)) {
 		ptl = pud_lock(mm, pud);
 		page = follow_devmap_pud(vma, address, pud, flags);


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 76+ messages in thread

* [PATCH v2 02/17] mm: Exempt special mappings from mlock(), mprotect() and madvise()
  2017-12-14 11:27 [PATCH v2 00/17] x86/ldt: Use a VMA based read only mapping Peter Zijlstra
  2017-12-14 11:27 ` [PATCH v2 01/17] mm/gup: Fixup p*_access_permitted() Peter Zijlstra
@ 2017-12-14 11:27 ` Peter Zijlstra
  2017-12-14 16:19   ` Andy Lutomirski
  2017-12-14 11:27 ` [PATCH v2 03/17] arch: Allow arch_dup_mmap() to fail Peter Zijlstra
                   ` (15 subsequent siblings)
  17 siblings, 1 reply; 76+ messages in thread
From: Peter Zijlstra @ 2017-12-14 11:27 UTC (permalink / raw)
  To: linux-kernel, tglx
  Cc: x86, Linus Torvalds, Andy Lutomirsky, Peter Zijlstra,
	Dave Hansen, Borislav Petkov, Greg KH, keescook, hughd,
	Brian Gerst, Josh Poimboeuf, Denys Vlasenko, Boris Ostrovsky,
	Juergen Gross, David Laight, Eduardo Valentin, aliguori,
	Will Deacon, linux-mm, kirill.shutemov, dan.j.williams

[-- Attachment #1: peterz-vm-no-special-mapping.patch --]
[-- Type: text/plain, Size: 1650 bytes --]

It makes no sense to ever prod at special mappings with any of these
syscalls.

XXX should we include munmap() ?

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 mm/madvise.c  |    3 +++
 mm/mlock.c    |    3 ++-
 mm/mprotect.c |    3 +++
 3 files changed, 8 insertions(+), 1 deletion(-)

--- a/mm/madvise.c
+++ b/mm/madvise.c
@@ -678,6 +678,9 @@ static long
 madvise_vma(struct vm_area_struct *vma, struct vm_area_struct **prev,
 		unsigned long start, unsigned long end, int behavior)
 {
+	if (vma_is_special_mapping(vma))
+		return -EINVAL;
+
 	switch (behavior) {
 	case MADV_REMOVE:
 		return madvise_remove(vma, prev, start, end);
--- a/mm/mlock.c
+++ b/mm/mlock.c
@@ -521,7 +521,8 @@ static int mlock_fixup(struct vm_area_st
 	vm_flags_t old_flags = vma->vm_flags;
 
 	if (newflags == vma->vm_flags || (vma->vm_flags & VM_SPECIAL) ||
-	    is_vm_hugetlb_page(vma) || vma == get_gate_vma(current->mm))
+	    is_vm_hugetlb_page(vma) || vma == get_gate_vma(current->mm) ||
+	    vma_is_special_mapping(vma))
 		/* don't set VM_LOCKED or VM_LOCKONFAULT and don't count */
 		goto out;
 
--- a/mm/mprotect.c
+++ b/mm/mprotect.c
@@ -307,6 +307,9 @@ mprotect_fixup(struct vm_area_struct *vm
 		return 0;
 	}
 
+	if (vma_is_special_mapping(vma))
+		return -ENOMEM;
+
 	/*
 	 * If we make a private mapping writable we increase our commit;
 	 * but (without finer accounting) cannot reduce our commit if we


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 76+ messages in thread

* [PATCH v2 03/17] arch: Allow arch_dup_mmap() to fail
  2017-12-14 11:27 [PATCH v2 00/17] x86/ldt: Use a VMA based read only mapping Peter Zijlstra
  2017-12-14 11:27 ` [PATCH v2 01/17] mm/gup: Fixup p*_access_permitted() Peter Zijlstra
  2017-12-14 11:27 ` [PATCH v2 02/17] mm: Exempt special mappings from mlock(), mprotect() and madvise() Peter Zijlstra
@ 2017-12-14 11:27 ` Peter Zijlstra
  2017-12-14 16:22   ` Andy Lutomirski
  2017-12-14 11:27 ` [PATCH v2 04/17] x86/ldt: Rework locking Peter Zijlstra
                   ` (14 subsequent siblings)
  17 siblings, 1 reply; 76+ messages in thread
From: Peter Zijlstra @ 2017-12-14 11:27 UTC (permalink / raw)
  To: linux-kernel, tglx
  Cc: x86, Linus Torvalds, Andy Lutomirsky, Peter Zijlstra,
	Dave Hansen, Borislav Petkov, Greg KH, keescook, hughd,
	Brian Gerst, Josh Poimboeuf, Denys Vlasenko, Boris Ostrovsky,
	Juergen Gross, David Laight, Eduardo Valentin, aliguori,
	Will Deacon, linux-mm, kirill.shutemov, dan.j.williams

[-- Attachment #1: arch--Allow-arch_dup_mmap---to-fail.patch --]
[-- Type: text/plain, Size: 3289 bytes --]

From: Thomas Gleixner <tglx@linutronix.de>

In order to sanitize the LDT initialization on x86 arch_dup_mmap() must be
allowed to fail. Fix up all instances.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 arch/powerpc/include/asm/mmu_context.h   |    5 +++--
 arch/um/include/asm/mmu_context.h        |    3 ++-
 arch/unicore32/include/asm/mmu_context.h |    5 +++--
 arch/x86/include/asm/mmu_context.h       |    4 ++--
 include/asm-generic/mm_hooks.h           |    5 +++--
 kernel/fork.c                            |    3 +--
 6 files changed, 14 insertions(+), 11 deletions(-)

--- a/arch/powerpc/include/asm/mmu_context.h
+++ b/arch/powerpc/include/asm/mmu_context.h
@@ -160,9 +160,10 @@ static inline void enter_lazy_tlb(struct
 #endif
 }
 
-static inline void arch_dup_mmap(struct mm_struct *oldmm,
-				 struct mm_struct *mm)
+static inline int arch_dup_mmap(struct mm_struct *oldmm,
+				struct mm_struct *mm)
 {
+	return 0;
 }
 
 #ifndef CONFIG_PPC_BOOK3S_64
--- a/arch/um/include/asm/mmu_context.h
+++ b/arch/um/include/asm/mmu_context.h
@@ -15,9 +15,10 @@ extern void uml_setup_stubs(struct mm_st
 /*
  * Needed since we do not use the asm-generic/mm_hooks.h:
  */
-static inline void arch_dup_mmap(struct mm_struct *oldmm, struct mm_struct *mm)
+static inline int arch_dup_mmap(struct mm_struct *oldmm, struct mm_struct *mm)
 {
 	uml_setup_stubs(mm);
+	return 0;
 }
 extern void arch_exit_mmap(struct mm_struct *mm);
 static inline void arch_unmap(struct mm_struct *mm,
--- a/arch/unicore32/include/asm/mmu_context.h
+++ b/arch/unicore32/include/asm/mmu_context.h
@@ -81,9 +81,10 @@ do { \
 	} \
 } while (0)
 
-static inline void arch_dup_mmap(struct mm_struct *oldmm,
-				 struct mm_struct *mm)
+static inline int arch_dup_mmap(struct mm_struct *oldmm,
+				struct mm_struct *mm)
 {
+	return 0;
 }
 
 static inline void arch_unmap(struct mm_struct *mm,
--- a/arch/x86/include/asm/mmu_context.h
+++ b/arch/x86/include/asm/mmu_context.h
@@ -176,10 +176,10 @@ do {						\
 } while (0)
 #endif
 
-static inline void arch_dup_mmap(struct mm_struct *oldmm,
-				 struct mm_struct *mm)
+static inline int arch_dup_mmap(struct mm_struct *oldmm, struct mm_struct *mm)
 {
 	paravirt_arch_dup_mmap(oldmm, mm);
+	return 0;
 }
 
 static inline void arch_exit_mmap(struct mm_struct *mm)
--- a/include/asm-generic/mm_hooks.h
+++ b/include/asm-generic/mm_hooks.h
@@ -7,9 +7,10 @@
 #ifndef _ASM_GENERIC_MM_HOOKS_H
 #define _ASM_GENERIC_MM_HOOKS_H
 
-static inline void arch_dup_mmap(struct mm_struct *oldmm,
-				 struct mm_struct *mm)
+static inline int arch_dup_mmap(struct mm_struct *oldmm,
+				struct mm_struct *mm)
 {
+	return 0;
 }
 
 static inline void arch_exit_mmap(struct mm_struct *mm)
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -721,8 +721,7 @@ static __latent_entropy int dup_mmap(str
 			goto out;
 	}
 	/* a new mm has just been created */
-	arch_dup_mmap(oldmm, mm);
-	retval = 0;
+	retval = arch_dup_mmap(oldmm, mm);
 out:
 	up_write(&mm->mmap_sem);
 	flush_tlb_mm(oldmm);


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 76+ messages in thread

* [PATCH v2 04/17] x86/ldt: Rework locking
  2017-12-14 11:27 [PATCH v2 00/17] x86/ldt: Use a VMA based read only mapping Peter Zijlstra
                   ` (2 preceding siblings ...)
  2017-12-14 11:27 ` [PATCH v2 03/17] arch: Allow arch_dup_mmap() to fail Peter Zijlstra
@ 2017-12-14 11:27 ` Peter Zijlstra
  2017-12-14 11:27 ` [PATCH v2 05/17] x86/ldt: Prevent ldt inheritance on exec Peter Zijlstra
                   ` (13 subsequent siblings)
  17 siblings, 0 replies; 76+ messages in thread
From: Peter Zijlstra @ 2017-12-14 11:27 UTC (permalink / raw)
  To: linux-kernel, tglx
  Cc: x86, Linus Torvalds, Andy Lutomirsky, Peter Zijlstra,
	Dave Hansen, Borislav Petkov, Greg KH, keescook, hughd,
	Brian Gerst, Josh Poimboeuf, Denys Vlasenko, Boris Ostrovsky,
	Juergen Gross, David Laight, Eduardo Valentin, aliguori,
	Will Deacon, linux-mm, kirill.shutemov, dan.j.williams

[-- Attachment #1: x86-ldt--Rework-locking.patch --]
[-- Type: text/plain, Size: 4715 bytes --]

From: Thomas Gleixner <tglx@linutronix.de>

The LDT is duplicated on fork() and on exec(), which is wrong as exec()
should start from a clean state, i.e. without LDT. To fix this the LDT
duplication code will be moved into arch_dup_mmap() which is only called
for fork().

This introduces a locking problem. arch_dup_mmap() holds mmap_sem of the
parent process, but the LDT duplication code needs to acquire
mm->context.lock to access the LDT data safely, which is the reverse lock
order of write_ldt() where mmap_sem nests into context.lock.

Solve this by introducing a new rw semaphore which serializes the
read/write_ldt() syscall operations and use context.lock to protect the
actual installment of the LDT descriptor.

So context.lock stabilizes mm->context.ldt and can nest inside of the new
semaphore or mmap_sem.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
---
 arch/x86/include/asm/mmu.h         |    4 +++-
 arch/x86/include/asm/mmu_context.h |    2 ++
 arch/x86/kernel/ldt.c              |   33 +++++++++++++++++++++------------
 3 files changed, 26 insertions(+), 13 deletions(-)

--- a/arch/x86/include/asm/mmu.h
+++ b/arch/x86/include/asm/mmu.h
@@ -3,6 +3,7 @@
 #define _ASM_X86_MMU_H
 
 #include <linux/spinlock.h>
+#include <linux/rwsem.h>
 #include <linux/mutex.h>
 #include <linux/atomic.h>
 
@@ -27,7 +28,8 @@ typedef struct {
 	atomic64_t tlb_gen;
 
 #ifdef CONFIG_MODIFY_LDT_SYSCALL
-	struct ldt_struct *ldt;
+	struct rw_semaphore	ldt_usr_sem;
+	struct ldt_struct	*ldt;
 #endif
 
 #ifdef CONFIG_X86_64
--- a/arch/x86/include/asm/mmu_context.h
+++ b/arch/x86/include/asm/mmu_context.h
@@ -132,6 +132,8 @@ void enter_lazy_tlb(struct mm_struct *mm
 static inline int init_new_context(struct task_struct *tsk,
 				   struct mm_struct *mm)
 {
+	mutex_init(&mm->context.lock);
+
 	mm->context.ctx_id = atomic64_inc_return(&last_mm_ctx_id);
 	atomic64_set(&mm->context.tlb_gen, 0);
 
--- a/arch/x86/kernel/ldt.c
+++ b/arch/x86/kernel/ldt.c
@@ -5,6 +5,11 @@
  * Copyright (C) 2002 Andi Kleen
  *
  * This handles calls from both 32bit and 64bit mode.
+ *
+ * Lock order:
+ *	contex.ldt_usr_sem
+ *	  mmap_sem
+ *	    context.lock
  */
 
 #include <linux/errno.h>
@@ -42,7 +47,7 @@ static void refresh_ldt_segments(void)
 #endif
 }
 
-/* context.lock is held for us, so we don't need any locking. */
+/* context.lock is held by the task which issued the smp function call */
 static void flush_ldt(void *__mm)
 {
 	struct mm_struct *mm = __mm;
@@ -99,15 +104,17 @@ static void finalize_ldt_struct(struct l
 	paravirt_alloc_ldt(ldt->entries, ldt->nr_entries);
 }
 
-/* context.lock is held */
-static void install_ldt(struct mm_struct *current_mm,
-			struct ldt_struct *ldt)
+static void install_ldt(struct mm_struct *mm, struct ldt_struct *ldt)
 {
+	mutex_lock(&mm->context.lock);
+
 	/* Synchronizes with READ_ONCE in load_mm_ldt. */
-	smp_store_release(&current_mm->context.ldt, ldt);
+	smp_store_release(&mm->context.ldt, ldt);
 
-	/* Activate the LDT for all CPUs using current_mm. */
-	on_each_cpu_mask(mm_cpumask(current_mm), flush_ldt, current_mm, true);
+	/* Activate the LDT for all CPUs using currents mm. */
+	on_each_cpu_mask(mm_cpumask(mm), flush_ldt, mm, true);
+
+	mutex_unlock(&mm->context.lock);
 }
 
 static void free_ldt_struct(struct ldt_struct *ldt)
@@ -133,7 +140,8 @@ int init_new_context_ldt(struct task_str
 	struct mm_struct *old_mm;
 	int retval = 0;
 
-	mutex_init(&mm->context.lock);
+	init_rwsem(&mm->context.ldt_usr_sem);
+
 	old_mm = current->mm;
 	if (!old_mm) {
 		mm->context.ldt = NULL;
@@ -180,7 +188,7 @@ static int read_ldt(void __user *ptr, un
 	unsigned long entries_size;
 	int retval;
 
-	mutex_lock(&mm->context.lock);
+	down_read(&mm->context.ldt_usr_sem);
 
 	if (!mm->context.ldt) {
 		retval = 0;
@@ -209,7 +217,7 @@ static int read_ldt(void __user *ptr, un
 	retval = bytecount;
 
 out_unlock:
-	mutex_unlock(&mm->context.lock);
+	up_read(&mm->context.ldt_usr_sem);
 	return retval;
 }
 
@@ -269,7 +277,8 @@ static int write_ldt(void __user *ptr, u
 			ldt.avl = 0;
 	}
 
-	mutex_lock(&mm->context.lock);
+	if (down_write_killable(&mm->context.ldt_usr_sem))
+		return -EINTR;
 
 	old_ldt       = mm->context.ldt;
 	old_nr_entries = old_ldt ? old_ldt->nr_entries : 0;
@@ -291,7 +300,7 @@ static int write_ldt(void __user *ptr, u
 	error = 0;
 
 out_unlock:
-	mutex_unlock(&mm->context.lock);
+	up_write(&mm->context.ldt_usr_sem);
 out:
 	return error;
 }


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 76+ messages in thread

* [PATCH v2 05/17] x86/ldt: Prevent ldt inheritance on exec
  2017-12-14 11:27 [PATCH v2 00/17] x86/ldt: Use a VMA based read only mapping Peter Zijlstra
                   ` (3 preceding siblings ...)
  2017-12-14 11:27 ` [PATCH v2 04/17] x86/ldt: Rework locking Peter Zijlstra
@ 2017-12-14 11:27 ` Peter Zijlstra
  2017-12-14 16:32   ` Andy Lutomirski
  2017-12-14 11:27 ` [PATCH v2 06/17] x86/ldt: Do not install LDT for kernel threads Peter Zijlstra
                   ` (12 subsequent siblings)
  17 siblings, 1 reply; 76+ messages in thread
From: Peter Zijlstra @ 2017-12-14 11:27 UTC (permalink / raw)
  To: linux-kernel, tglx
  Cc: x86, Linus Torvalds, Andy Lutomirsky, Peter Zijlstra,
	Dave Hansen, Borislav Petkov, Greg KH, keescook, hughd,
	Brian Gerst, Josh Poimboeuf, Denys Vlasenko, Boris Ostrovsky,
	Juergen Gross, David Laight, Eduardo Valentin, aliguori,
	Will Deacon, linux-mm, kirill.shutemov, dan.j.williams

[-- Attachment #1: x86-ldt--Prevent-ldt-inheritance-on-exec.patch --]
[-- Type: text/plain, Size: 4540 bytes --]

From: Thomas Gleixner <tglx@linutronix.de>

The LDT is inheritet independent of fork or exec, but that makes no sense
at all because exec is supposed to start the process clean.

The reason why this happens is that init_new_context_ldt() is called from
init_new_context() which obviously needs to be called for both fork() and
exec().

It would be surprising if anything relies on that behaviour, so it seems to
be safe to remove that misfeature.

Split the context initialization into two parts. Clear the ldt pointer and
initialize the mutex from the general context init and move the LDT
duplication to arch_dup_mmap() which is only called on fork().

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
---
 arch/x86/include/asm/mmu_context.h    |   21 ++++++++++++++-------
 arch/x86/kernel/ldt.c                 |   18 +++++-------------
 tools/testing/selftests/x86/ldt_gdt.c |    9 +++------
 3 files changed, 22 insertions(+), 26 deletions(-)

--- a/arch/x86/include/asm/mmu_context.h
+++ b/arch/x86/include/asm/mmu_context.h
@@ -57,11 +57,17 @@ struct ldt_struct {
 /*
  * Used for LDT copy/destruction.
  */
-int init_new_context_ldt(struct task_struct *tsk, struct mm_struct *mm);
+static inline void init_new_context_ldt(struct mm_struct *mm)
+{
+	mm->context.ldt = NULL;
+	init_rwsem(&mm->context.ldt_usr_sem);
+}
+int ldt_dup_context(struct mm_struct *oldmm, struct mm_struct *mm);
 void destroy_context_ldt(struct mm_struct *mm);
 #else	/* CONFIG_MODIFY_LDT_SYSCALL */
-static inline int init_new_context_ldt(struct task_struct *tsk,
-				       struct mm_struct *mm)
+static inline void init_new_context_ldt(struct mm_struct *mm) { }
+static inline int ldt_dup_context(struct mm_struct *oldmm,
+				  struct mm_struct *mm)
 {
 	return 0;
 }
@@ -137,15 +143,16 @@ static inline int init_new_context(struc
 	mm->context.ctx_id = atomic64_inc_return(&last_mm_ctx_id);
 	atomic64_set(&mm->context.tlb_gen, 0);
 
-	#ifdef CONFIG_X86_INTEL_MEMORY_PROTECTION_KEYS
+#ifdef CONFIG_X86_INTEL_MEMORY_PROTECTION_KEYS
 	if (cpu_feature_enabled(X86_FEATURE_OSPKE)) {
 		/* pkey 0 is the default and always allocated */
 		mm->context.pkey_allocation_map = 0x1;
 		/* -1 means unallocated or invalid */
 		mm->context.execute_only_pkey = -1;
 	}
-	#endif
-	return init_new_context_ldt(tsk, mm);
+#endif
+	init_new_context_ldt(mm);
+	return 0;
 }
 static inline void destroy_context(struct mm_struct *mm)
 {
@@ -181,7 +188,7 @@ do {						\
 static inline int arch_dup_mmap(struct mm_struct *oldmm, struct mm_struct *mm)
 {
 	paravirt_arch_dup_mmap(oldmm, mm);
-	return 0;
+	return ldt_dup_context(oldmm, mm);
 }
 
 static inline void arch_exit_mmap(struct mm_struct *mm)
--- a/arch/x86/kernel/ldt.c
+++ b/arch/x86/kernel/ldt.c
@@ -131,28 +131,20 @@ static void free_ldt_struct(struct ldt_s
 }
 
 /*
- * we do not have to muck with descriptors here, that is
- * done in switch_mm() as needed.
+ * Called on fork from arch_dup_mmap(). Just copy the current LDT state,
+ * the new task is not running, so nothing can be installed.
  */
-int init_new_context_ldt(struct task_struct *tsk, struct mm_struct *mm)
+int ldt_dup_context(struct mm_struct *old_mm, struct mm_struct *mm)
 {
 	struct ldt_struct *new_ldt;
-	struct mm_struct *old_mm;
 	int retval = 0;
 
-	init_rwsem(&mm->context.ldt_usr_sem);
-
-	old_mm = current->mm;
-	if (!old_mm) {
-		mm->context.ldt = NULL;
+	if (!old_mm)
 		return 0;
-	}
 
 	mutex_lock(&old_mm->context.lock);
-	if (!old_mm->context.ldt) {
-		mm->context.ldt = NULL;
+	if (!old_mm->context.ldt)
 		goto out_unlock;
-	}
 
 	new_ldt = alloc_ldt_struct(old_mm->context.ldt->nr_entries);
 	if (!new_ldt) {
--- a/tools/testing/selftests/x86/ldt_gdt.c
+++ b/tools/testing/selftests/x86/ldt_gdt.c
@@ -627,13 +627,10 @@ static void do_multicpu_tests(void)
 static int finish_exec_test(void)
 {
 	/*
-	 * In a sensible world, this would be check_invalid_segment(0, 1);
-	 * For better or for worse, though, the LDT is inherited across exec.
-	 * We can probably change this safely, but for now we test it.
+	 * Older kernel versions did inherit the LDT on exec() which is
+	 * wrong because exec() starts from a clean state.
 	 */
-	check_valid_segment(0, 1,
-			    AR_DPL3 | AR_TYPE_XRCODE | AR_S | AR_P | AR_DB,
-			    42, true);
+	check_invalid_segment(0, 1);
 
 	return nerrs ? 1 : 0;
 }


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 76+ messages in thread

* [PATCH v2 06/17] x86/ldt: Do not install LDT for kernel threads
  2017-12-14 11:27 [PATCH v2 00/17] x86/ldt: Use a VMA based read only mapping Peter Zijlstra
                   ` (4 preceding siblings ...)
  2017-12-14 11:27 ` [PATCH v2 05/17] x86/ldt: Prevent ldt inheritance on exec Peter Zijlstra
@ 2017-12-14 11:27 ` Peter Zijlstra
  2017-12-14 19:43   ` Peter Zijlstra
  2017-12-14 11:27 ` [PATCH v2 07/17] mm/softdirty: Move VM_SOFTDIRTY into high bits Peter Zijlstra
                   ` (11 subsequent siblings)
  17 siblings, 1 reply; 76+ messages in thread
From: Peter Zijlstra @ 2017-12-14 11:27 UTC (permalink / raw)
  To: linux-kernel, tglx
  Cc: x86, Linus Torvalds, Andy Lutomirsky, Peter Zijlstra,
	Dave Hansen, Borislav Petkov, Greg KH, keescook, hughd,
	Brian Gerst, Josh Poimboeuf, Denys Vlasenko, Boris Ostrovsky,
	Juergen Gross, David Laight, Eduardo Valentin, aliguori,
	Will Deacon, linux-mm, kirill.shutemov, dan.j.williams

[-- Attachment #1: x86-ldt--Do-not-install-LDT-for-kernel-threads.patch --]
[-- Type: text/plain, Size: 975 bytes --]

From: Thomas Gleixner <tglx@linutronix.de>

Kernel threads can use the mm of a user process temporarily via use_mm(),
but there is no point in installing the LDT which is associated to that mm
for the kernel thread.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
---
 arch/x86/include/asm/mmu_context.h |    3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

--- a/arch/x86/include/asm/mmu_context.h
+++ b/arch/x86/include/asm/mmu_context.h
@@ -95,8 +95,7 @@ static inline void load_mm_ldt(struct mm
 	 * the local LDT after an IPI loaded a newer value than the one
 	 * that we can see.
 	 */
-
-	if (unlikely(ldt))
+	if (unlikely(ldt && !(current->flags & PF_KTHREAD)))
 		set_ldt(ldt->entries, ldt->nr_entries);
 	else
 		clear_LDT();


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 76+ messages in thread

* [PATCH v2 07/17] mm/softdirty: Move VM_SOFTDIRTY into high bits
  2017-12-14 11:27 [PATCH v2 00/17] x86/ldt: Use a VMA based read only mapping Peter Zijlstra
                   ` (5 preceding siblings ...)
  2017-12-14 11:27 ` [PATCH v2 06/17] x86/ldt: Do not install LDT for kernel threads Peter Zijlstra
@ 2017-12-14 11:27 ` Peter Zijlstra
  2017-12-14 11:27 ` [PATCH v2 08/17] mm/x86: Allow special mappings with user access cleared Peter Zijlstra
                   ` (10 subsequent siblings)
  17 siblings, 0 replies; 76+ messages in thread
From: Peter Zijlstra @ 2017-12-14 11:27 UTC (permalink / raw)
  To: linux-kernel, tglx
  Cc: x86, Linus Torvalds, Andy Lutomirsky, Peter Zijlstra,
	Dave Hansen, Borislav Petkov, Greg KH, keescook, hughd,
	Brian Gerst, Josh Poimboeuf, Denys Vlasenko, Boris Ostrovsky,
	Juergen Gross, David Laight, Eduardo Valentin, aliguori,
	Will Deacon, linux-mm, kirill.shutemov, dan.j.williams

[-- Attachment #1: mm-softdirty--Move-VM_SOFTDIRTY-into-high-bits.patch --]
[-- Type: text/plain, Size: 2791 bytes --]

From: Peter Zijlstra <peterz@infradead.org>

Only 64bit architectures (x86_64, s390, PPC_BOOK3S_64) have support for
HAVE_ARCH_SOFT_DIRTY, so ensure they all select ARCH_USES_HIGH_VMA_FLAGS
and move the VM_SOFTDIRTY flag into the high flags.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 arch/powerpc/platforms/Kconfig.cputype |    1 +
 arch/s390/Kconfig                      |    1 +
 include/linux/mm.h                     |   17 +++++++++++------
 3 files changed, 13 insertions(+), 6 deletions(-)

--- a/arch/powerpc/platforms/Kconfig.cputype
+++ b/arch/powerpc/platforms/Kconfig.cputype
@@ -76,6 +76,7 @@ config PPC_BOOK3S_64
 	select ARCH_SUPPORTS_NUMA_BALANCING
 	select IRQ_WORK
 	select HAVE_KERNEL_XZ
+	select ARCH_USES_HIGH_VMA_FLAGS
 
 config PPC_BOOK3E_64
 	bool "Embedded processors"
--- a/arch/s390/Kconfig
+++ b/arch/s390/Kconfig
@@ -131,6 +131,7 @@ config S390
 	select CPU_NO_EFFICIENT_FFS if !HAVE_MARCH_Z9_109_FEATURES
 	select HAVE_ARCH_SECCOMP_FILTER
 	select HAVE_ARCH_SOFT_DIRTY
+	select ARCH_USES_HIGH_VMA_FLAGS
 	select HAVE_ARCH_TRACEHOOK
 	select HAVE_ARCH_TRANSPARENT_HUGEPAGE
 	select HAVE_EBPF_JIT if PACK_STACK && HAVE_MARCH_Z196_FEATURES
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -194,12 +194,6 @@ extern unsigned int kobjsize(const void
 #define VM_WIPEONFORK	0x02000000	/* Wipe VMA contents in child. */
 #define VM_DONTDUMP	0x04000000	/* Do not include in the core dump */
 
-#ifdef CONFIG_MEM_SOFT_DIRTY
-# define VM_SOFTDIRTY	0x08000000	/* Not soft dirty clean area */
-#else
-# define VM_SOFTDIRTY	0
-#endif
-
 #define VM_MIXEDMAP	0x10000000	/* Can contain "struct page" and pure PFN pages */
 #define VM_HUGEPAGE	0x20000000	/* MADV_HUGEPAGE marked this vma */
 #define VM_NOHUGEPAGE	0x40000000	/* MADV_NOHUGEPAGE marked this vma */
@@ -216,8 +210,19 @@ extern unsigned int kobjsize(const void
 #define VM_HIGH_ARCH_2	BIT(VM_HIGH_ARCH_BIT_2)
 #define VM_HIGH_ARCH_3	BIT(VM_HIGH_ARCH_BIT_3)
 #define VM_HIGH_ARCH_4	BIT(VM_HIGH_ARCH_BIT_4)
+
+#define VM_HIGH_SOFTDIRTY_BIT	37	/* bit only usable on 64-bit architectures */
 #endif /* CONFIG_ARCH_USES_HIGH_VMA_FLAGS */
 
+#ifdef CONFIG_MEM_SOFT_DIRTY
+# ifndef CONFIG_ARCH_USES_HIGH_VMA_FLAGS
+#  error MEM_SOFT_DIRTY depends on ARCH_USES_HIGH_VMA_FLAGS
+# endif
+# define VM_SOFTDIRTY		BIT(VM_HIGH_SOFTDIRTY_BIT) /* Not soft dirty clean area */
+#else
+# define VM_SOFTDIRTY		VM_NONE
+#endif
+
 #if defined(CONFIG_X86)
 # define VM_PAT		VM_ARCH_1	/* PAT reserves whole VMA at once (x86) */
 #if defined (CONFIG_X86_INTEL_MEMORY_PROTECTION_KEYS)


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 76+ messages in thread

* [PATCH v2 08/17] mm/x86: Allow special mappings with user access cleared
  2017-12-14 11:27 [PATCH v2 00/17] x86/ldt: Use a VMA based read only mapping Peter Zijlstra
                   ` (6 preceding siblings ...)
  2017-12-14 11:27 ` [PATCH v2 07/17] mm/softdirty: Move VM_SOFTDIRTY into high bits Peter Zijlstra
@ 2017-12-14 11:27 ` Peter Zijlstra
  2017-12-14 11:27 ` [PATCH v2 09/17] mm: Provide vm_special_mapping::close Peter Zijlstra
                   ` (9 subsequent siblings)
  17 siblings, 0 replies; 76+ messages in thread
From: Peter Zijlstra @ 2017-12-14 11:27 UTC (permalink / raw)
  To: linux-kernel, tglx
  Cc: x86, Linus Torvalds, Andy Lutomirsky, Peter Zijlstra,
	Dave Hansen, Borislav Petkov, Greg KH, keescook, hughd,
	Brian Gerst, Josh Poimboeuf, Denys Vlasenko, Boris Ostrovsky,
	Juergen Gross, David Laight, Eduardo Valentin, aliguori,
	Will Deacon, linux-mm, kirill.shutemov, dan.j.williams

[-- Attachment #1: mm--Allow-special-mappings-with-user-access-cleared.patch --]
[-- Type: text/plain, Size: 3277 bytes --]

From: Peter Zijstra <peterz@infradead.org>

In order to create VMAs that are not accessible to userspace create a new
VM_NOUSER flag. This can be used in conjunction with
install_special_mapping() to inject 'kernel' data into the userspace map.

Similar to how arch_vm_get_page_prot() allows adding _PAGE_flags to
pgprot_t, introduce arch_vm_get_page_prot_excl() which masks
_PAGE_flags from pgprot_t and use this to implement VM_NOUSER for x86.

get_user_page() will allow things like FOLL_POPULATE but will fail
FOLL_GET (with or without FOLL_WRITE).

Signed-off-by: Peter Zijstra <peterz@infradead.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
---
 arch/x86/include/uapi/asm/mman.h |    4 ++++
 include/linux/mm.h               |    2 ++
 include/linux/mman.h             |    4 ++++
 mm/mmap.c                        |   12 ++++++++++--
 4 files changed, 20 insertions(+), 2 deletions(-)

--- a/arch/x86/include/uapi/asm/mman.h
+++ b/arch/x86/include/uapi/asm/mman.h
@@ -26,6 +26,10 @@
 		((key) & 0x8 ? VM_PKEY_BIT3 : 0))
 #endif
 
+#define arch_vm_get_page_prot_excl(vm_flags) __pgprot(		\
+		((vm_flags) & VM_NOUSER ? _PAGE_USER : 0)	\
+		)
+
 #include <asm-generic/mman.h>
 
 #endif /* _ASM_X86_MMAN_H */
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -193,6 +193,7 @@ extern unsigned int kobjsize(const void
 #define VM_ARCH_1	0x01000000	/* Architecture-specific flag */
 #define VM_WIPEONFORK	0x02000000	/* Wipe VMA contents in child. */
 #define VM_DONTDUMP	0x04000000	/* Do not include in the core dump */
+#define VM_ARCH_0	0x08000000	/* Architecture-specific flag */
 
 #define VM_MIXEDMAP	0x10000000	/* Can contain "struct page" and pure PFN pages */
 #define VM_HUGEPAGE	0x20000000	/* MADV_HUGEPAGE marked this vma */
@@ -224,6 +225,7 @@ extern unsigned int kobjsize(const void
 #endif
 
 #if defined(CONFIG_X86)
+# define VM_NOUSER	VM_ARCH_0	/* Not accessible by userspace */
 # define VM_PAT		VM_ARCH_1	/* PAT reserves whole VMA at once (x86) */
 #if defined (CONFIG_X86_INTEL_MEMORY_PROTECTION_KEYS)
 # define VM_PKEY_SHIFT	VM_HIGH_ARCH_BIT_0
--- a/include/linux/mman.h
+++ b/include/linux/mman.h
@@ -43,6 +43,10 @@ static inline void vm_unacct_memory(long
 #define arch_vm_get_page_prot(vm_flags) __pgprot(0)
 #endif
 
+#ifndef arch_vm_get_page_prot_excl
+#define arch_vm_get_page_prot_excl(vm_flags) __pgprot(0)
+#endif
+
 #ifndef arch_validate_prot
 /*
  * This is called from mprotect().  PROT_GROWSDOWN and PROT_GROWSUP have
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -102,9 +102,17 @@ pgprot_t protection_map[16] __ro_after_i
 
 pgprot_t vm_get_page_prot(unsigned long vm_flags)
 {
-	return __pgprot(pgprot_val(protection_map[vm_flags &
-				(VM_READ|VM_WRITE|VM_EXEC|VM_SHARED)]) |
+	pgprot_t prot;
+
+	prot = protection_map[vm_flags & (VM_READ|VM_WRITE|VM_EXEC|VM_SHARED)];
+
+	prot = __pgprot(pgprot_val(prot) |
 			pgprot_val(arch_vm_get_page_prot(vm_flags)));
+
+	prot = __pgprot(pgprot_val(prot) &
+			~pgprot_val(arch_vm_get_page_prot_excl(vm_flags)));
+
+	return prot;
 }
 EXPORT_SYMBOL(vm_get_page_prot);
 


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 76+ messages in thread

* [PATCH v2 09/17] mm: Provide vm_special_mapping::close
  2017-12-14 11:27 [PATCH v2 00/17] x86/ldt: Use a VMA based read only mapping Peter Zijlstra
                   ` (7 preceding siblings ...)
  2017-12-14 11:27 ` [PATCH v2 08/17] mm/x86: Allow special mappings with user access cleared Peter Zijlstra
@ 2017-12-14 11:27 ` Peter Zijlstra
  2017-12-14 11:27 ` [PATCH v2 10/17] selftest/x86: Implement additional LDT selftests Peter Zijlstra
                   ` (8 subsequent siblings)
  17 siblings, 0 replies; 76+ messages in thread
From: Peter Zijlstra @ 2017-12-14 11:27 UTC (permalink / raw)
  To: linux-kernel, tglx
  Cc: x86, Linus Torvalds, Andy Lutomirsky, Peter Zijlstra,
	Dave Hansen, Borislav Petkov, Greg KH, keescook, hughd,
	Brian Gerst, Josh Poimboeuf, Denys Vlasenko, Boris Ostrovsky,
	Juergen Gross, David Laight, Eduardo Valentin, aliguori,
	Will Deacon, linux-mm, kirill.shutemov, dan.j.williams

[-- Attachment #1: mm--Provide-vm_special_mapping--close.patch --]
[-- Type: text/plain, Size: 1377 bytes --]

From: Peter Zijlstra  <peterz@infradead.org>

Userspace can (malisiously) munmap() the VMAs injected into its memory
map through install_special_mapping(). In order to ensure there are no
hardware resources tied to the mapping, we need a close callback.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
---
 include/linux/mm_types.h |    3 +++
 mm/mmap.c                |    4 ++++
 2 files changed, 7 insertions(+)

--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -644,6 +644,9 @@ struct vm_special_mapping {
 
 	int (*mremap)(const struct vm_special_mapping *sm,
 		     struct vm_area_struct *new_vma);
+
+	void (*close)(const struct vm_special_mapping *sm,
+		      struct vm_area_struct *vma);
 };
 
 enum tlb_flush_reason {
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -3206,6 +3206,10 @@ static int special_mapping_fault(struct
  */
 static void special_mapping_close(struct vm_area_struct *vma)
 {
+	struct vm_special_mapping *sm = vma->vm_private_data;
+
+	if (sm->close)
+		sm->close(sm, vma);
 }
 
 static const char *special_mapping_name(struct vm_area_struct *vma)


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 76+ messages in thread

* [PATCH v2 10/17] selftest/x86: Implement additional LDT selftests
  2017-12-14 11:27 [PATCH v2 00/17] x86/ldt: Use a VMA based read only mapping Peter Zijlstra
                   ` (8 preceding siblings ...)
  2017-12-14 11:27 ` [PATCH v2 09/17] mm: Provide vm_special_mapping::close Peter Zijlstra
@ 2017-12-14 11:27 ` Peter Zijlstra
  2017-12-14 11:27 ` [PATCH v2 11/17] selftests/x86/ldt_gdt: Prepare for access bit forced Peter Zijlstra
                   ` (7 subsequent siblings)
  17 siblings, 0 replies; 76+ messages in thread
From: Peter Zijlstra @ 2017-12-14 11:27 UTC (permalink / raw)
  To: linux-kernel, tglx
  Cc: x86, Linus Torvalds, Andy Lutomirsky, Peter Zijlstra,
	Dave Hansen, Borislav Petkov, Greg KH, keescook, hughd,
	Brian Gerst, Josh Poimboeuf, Denys Vlasenko, Boris Ostrovsky,
	Juergen Gross, David Laight, Eduardo Valentin, aliguori,
	Will Deacon, linux-mm, kirill.shutemov, dan.j.williams

[-- Attachment #1: selftest-x86-Implement-additional-LDT-selftests.patch --]
[-- Type: text/plain, Size: 3164 bytes --]

From: Peter Zijlstra <peterz@infradead.org>

do_ldt_ss_test() - tests modifying the SS segment while in use; this
tends to come apart with RO LDT maps

do_ldt_unmap_test() - tests the mechanics of unmapping the (future)
LDT VMA. Additional tests would make sense; like unmapping it while in
use (TODO).

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
---
 tools/testing/selftests/x86/ldt_gdt.c |   71 +++++++++++++++++++++++++++++++++-
 1 file changed, 70 insertions(+), 1 deletion(-)

--- a/tools/testing/selftests/x86/ldt_gdt.c
+++ b/tools/testing/selftests/x86/ldt_gdt.c
@@ -242,6 +242,72 @@ static void fail_install(struct user_des
 	}
 }
 
+static void do_ldt_ss_test(void)
+{
+	unsigned short prev_sel, sel = (2 << 3) | (1 << 2) | 3;
+	struct user_desc *ldt_desc = low_user_desc + 2;
+	int ret;
+
+	ldt_desc->entry_number	= 2;
+	ldt_desc->base_addr	= (unsigned long)&counter_page[1];
+	ldt_desc->limit		= 0xfffff;
+	ldt_desc->seg_32bit	= 1;
+	ldt_desc->contents		= 0; /* Data, grow-up*/
+	ldt_desc->read_exec_only	= 0;
+	ldt_desc->limit_in_pages	= 1;
+	ldt_desc->seg_not_present	= 0;
+	ldt_desc->useable		= 0;
+
+	ret = safe_modify_ldt(1, ldt_desc, sizeof(*ldt_desc));
+	if (ret)
+		perror("ponies");
+
+	/*
+	 * syscall (eax) 123 - modify_ldt / return value
+	 *         (ebx)     - func
+	 *         (ecx)     - ptr
+	 *         (edx)     - bytecount
+	 */
+
+	int eax = 123;
+	int ebx = 1;
+	int ecx = (unsigned int)(unsigned long)ldt_desc;
+	int edx = sizeof(struct user_desc);
+
+	asm volatile ("movw %%ss, %[prev_sel]\n\t"
+		      "movw %[sel], %%ss\n\t"
+		      "int $0x80\n\t"
+		      "movw %[prev_sel], %%ss"
+		      : [prev_sel] "=&R" (prev_sel), "+a" (eax)
+		      : [sel] "R" (sel), "b" (ebx), "c" (ecx), "d" (edx)
+		      : INT80_CLOBBERS);
+
+	printf("[OK]\tSS modify_ldt()\n");
+}
+
+static void do_ldt_unmap_test(void)
+{
+	FILE *file = fopen("/proc/self/maps", "r");
+	char *line = NULL;
+	size_t len = 0;
+	ssize_t nread;
+	unsigned long start, end;
+
+	while ((nread = getline(&line, &len, file)) != -1) {
+		if (strstr(line, "[ldt]")) {
+			if (sscanf(line, "%lx-%lx", &start, &end) == 2) {
+				munmap((void *)start, end-start);
+				printf("[OK]\tmunmap LDT\n");
+				break;
+			}
+		}
+	}
+
+	free(line);
+	fclose(file);
+
+}
+
 static void do_simple_tests(void)
 {
 	struct user_desc desc = {
@@ -696,7 +762,7 @@ static int invoke_set_thread_area(void)
 
 static void setup_low_user_desc(void)
 {
-	low_user_desc = mmap(NULL, 2 * sizeof(struct user_desc),
+	low_user_desc = mmap(NULL, 3 * sizeof(struct user_desc),
 			     PROT_READ | PROT_WRITE,
 			     MAP_ANONYMOUS | MAP_PRIVATE | MAP_32BIT, -1, 0);
 	if (low_user_desc == MAP_FAILED)
@@ -916,6 +982,9 @@ int main(int argc, char **argv)
 	setup_counter_page();
 	setup_low_user_desc();
 
+	do_ldt_ss_test();
+	do_ldt_unmap_test();
+
 	do_simple_tests();
 
 	do_multicpu_tests();


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 76+ messages in thread

* [PATCH v2 11/17] selftests/x86/ldt_gdt: Prepare for access bit forced
  2017-12-14 11:27 [PATCH v2 00/17] x86/ldt: Use a VMA based read only mapping Peter Zijlstra
                   ` (9 preceding siblings ...)
  2017-12-14 11:27 ` [PATCH v2 10/17] selftest/x86: Implement additional LDT selftests Peter Zijlstra
@ 2017-12-14 11:27 ` Peter Zijlstra
  2017-12-14 16:20   ` Andy Lutomirski
  2017-12-14 11:27 ` [PATCH v2 12/17] mm: Make populate_vma_page_range() available Peter Zijlstra
                   ` (6 subsequent siblings)
  17 siblings, 1 reply; 76+ messages in thread
From: Peter Zijlstra @ 2017-12-14 11:27 UTC (permalink / raw)
  To: linux-kernel, tglx
  Cc: x86, Linus Torvalds, Andy Lutomirsky, Peter Zijlstra,
	Dave Hansen, Borislav Petkov, Greg KH, keescook, hughd,
	Brian Gerst, Josh Poimboeuf, Denys Vlasenko, Boris Ostrovsky,
	Juergen Gross, David Laight, Eduardo Valentin, aliguori,
	Will Deacon, linux-mm, kirill.shutemov, dan.j.williams

[-- Attachment #1: selftests-x86-ldt_gdt--Prepare-for-access-bit-forced.patch --]
[-- Type: text/plain, Size: 1105 bytes --]

From: Thomas Gleixner <tglx@linutronix.de>

In order to make the LDT mapping RO the access bit needs to be forced by
the kernel. Adjust the test case so it handles that gracefully.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
---
 tools/testing/selftests/x86/ldt_gdt.c |    3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

--- a/tools/testing/selftests/x86/ldt_gdt.c
+++ b/tools/testing/selftests/x86/ldt_gdt.c
@@ -122,8 +122,7 @@ static void check_valid_segment(uint16_t
 	 * NB: Different Linux versions do different things with the
 	 * accessed bit in set_thread_area().
 	 */
-	if (ar != expected_ar &&
-	    (ldt || ar != (expected_ar | AR_ACCESSED))) {
+	if (ar != expected_ar && ar != (expected_ar | AR_ACCESSED)) {
 		printf("[FAIL]\t%s entry %hu has AR 0x%08X but expected 0x%08X\n",
 		       (ldt ? "LDT" : "GDT"), index, ar, expected_ar);
 		nerrs++;


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 76+ messages in thread

* [PATCH v2 12/17] mm: Make populate_vma_page_range() available
  2017-12-14 11:27 [PATCH v2 00/17] x86/ldt: Use a VMA based read only mapping Peter Zijlstra
                   ` (10 preceding siblings ...)
  2017-12-14 11:27 ` [PATCH v2 11/17] selftests/x86/ldt_gdt: Prepare for access bit forced Peter Zijlstra
@ 2017-12-14 11:27 ` Peter Zijlstra
  2017-12-14 11:27 ` [PATCH v2 13/17] x86/mm: Force LDT desc accessed bit Peter Zijlstra
                   ` (5 subsequent siblings)
  17 siblings, 0 replies; 76+ messages in thread
From: Peter Zijlstra @ 2017-12-14 11:27 UTC (permalink / raw)
  To: linux-kernel, tglx
  Cc: x86, Linus Torvalds, Andy Lutomirsky, Peter Zijlstra,
	Dave Hansen, Borislav Petkov, Greg KH, keescook, hughd,
	Brian Gerst, Josh Poimboeuf, Denys Vlasenko, Boris Ostrovsky,
	Juergen Gross, David Laight, Eduardo Valentin, aliguori,
	Will Deacon, linux-mm, kirill.shutemov, dan.j.williams

[-- Attachment #1: mm--Make-populate_vma_page_range---available.patch --]
[-- Type: text/plain, Size: 1510 bytes --]

From: Peter Zijlstra <peterz@infradead.org>

Make populate_vma_page_range() outside mm, so special mappings can be
populated in dup_mmap().

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
---
 include/linux/mm.h |    2 ++
 mm/internal.h      |    2 --
 2 files changed, 2 insertions(+), 2 deletions(-)

--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -2159,6 +2159,8 @@ do_mmap_pgoff(struct file *file, unsigne
 }
 
 #ifdef CONFIG_MMU
+extern long populate_vma_page_range(struct vm_area_struct *vma,
+		unsigned long start, unsigned long end, int *nonblocking);
 extern int __mm_populate(unsigned long addr, unsigned long len,
 			 int ignore_errors);
 static inline void mm_populate(unsigned long addr, unsigned long len)
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -284,8 +284,6 @@ void __vma_link_list(struct mm_struct *m
 		struct vm_area_struct *prev, struct rb_node *rb_parent);
 
 #ifdef CONFIG_MMU
-extern long populate_vma_page_range(struct vm_area_struct *vma,
-		unsigned long start, unsigned long end, int *nonblocking);
 extern void munlock_vma_pages_range(struct vm_area_struct *vma,
 			unsigned long start, unsigned long end);
 static inline void munlock_vma_pages_all(struct vm_area_struct *vma)


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 76+ messages in thread

* [PATCH v2 13/17] x86/mm: Force LDT desc accessed bit
  2017-12-14 11:27 [PATCH v2 00/17] x86/ldt: Use a VMA based read only mapping Peter Zijlstra
                   ` (11 preceding siblings ...)
  2017-12-14 11:27 ` [PATCH v2 12/17] mm: Make populate_vma_page_range() available Peter Zijlstra
@ 2017-12-14 11:27 ` Peter Zijlstra
  2017-12-14 16:21   ` Andy Lutomirski
  2017-12-14 11:27 ` [PATCH v2 14/17] x86/ldt: Reshuffle code Peter Zijlstra
                   ` (4 subsequent siblings)
  17 siblings, 1 reply; 76+ messages in thread
From: Peter Zijlstra @ 2017-12-14 11:27 UTC (permalink / raw)
  To: linux-kernel, tglx
  Cc: x86, Linus Torvalds, Andy Lutomirsky, Peter Zijlstra,
	Dave Hansen, Borislav Petkov, Greg KH, keescook, hughd,
	Brian Gerst, Josh Poimboeuf, Denys Vlasenko, Boris Ostrovsky,
	Juergen Gross, David Laight, Eduardo Valentin, aliguori,
	Will Deacon, linux-mm, kirill.shutemov, dan.j.williams

[-- Attachment #1: peterz-ldt-force-accessed.patch --]
[-- Type: text/plain, Size: 1415 bytes --]

In preparation to mapping the LDT RO, unconditionally set the accessed
bit.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 arch/x86/include/asm/desc.h |    5 +++++
 arch/x86/kernel/tls.c       |   11 ++---------
 2 files changed, 7 insertions(+), 9 deletions(-)

--- a/arch/x86/include/asm/desc.h
+++ b/arch/x86/include/asm/desc.h
@@ -20,6 +20,11 @@ static inline void fill_ldt(struct desc_
 
 	desc->type		= (info->read_exec_only ^ 1) << 1;
 	desc->type	       |= info->contents << 2;
+	/*
+	 * Always set the accessed bit so that the CPU
+	 * doesn't try to write to the (read-only) GDT/LDT.
+	 */
+	desc->type             |= 1;
 
 	desc->s			= 1;
 	desc->dpl		= 0x3;
--- a/arch/x86/kernel/tls.c
+++ b/arch/x86/kernel/tls.c
@@ -93,17 +93,10 @@ static void set_tls_desc(struct task_str
 	cpu = get_cpu();
 
 	while (n-- > 0) {
-		if (LDT_empty(info) || LDT_zero(info)) {
+		if (LDT_empty(info) || LDT_zero(info))
 			memset(desc, 0, sizeof(*desc));
-		} else {
+		else
 			fill_ldt(desc, info);
-
-			/*
-			 * Always set the accessed bit so that the CPU
-			 * doesn't try to write to the (read-only) GDT.
-			 */
-			desc->type |= 1;
-		}
 		++info;
 		++desc;
 	}


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 76+ messages in thread

* [PATCH v2 14/17] x86/ldt: Reshuffle code
  2017-12-14 11:27 [PATCH v2 00/17] x86/ldt: Use a VMA based read only mapping Peter Zijlstra
                   ` (12 preceding siblings ...)
  2017-12-14 11:27 ` [PATCH v2 13/17] x86/mm: Force LDT desc accessed bit Peter Zijlstra
@ 2017-12-14 11:27 ` Peter Zijlstra
  2017-12-14 16:23   ` Andy Lutomirski
  2017-12-14 11:27 ` [PATCH v2 15/17] x86/ldt: Prepare for VMA mapping Peter Zijlstra
                   ` (3 subsequent siblings)
  17 siblings, 1 reply; 76+ messages in thread
From: Peter Zijlstra @ 2017-12-14 11:27 UTC (permalink / raw)
  To: linux-kernel, tglx
  Cc: x86, Linus Torvalds, Andy Lutomirsky, Peter Zijlstra,
	Dave Hansen, Borislav Petkov, Greg KH, keescook, hughd,
	Brian Gerst, Josh Poimboeuf, Denys Vlasenko, Boris Ostrovsky,
	Juergen Gross, David Laight, Eduardo Valentin, aliguori,
	Will Deacon, linux-mm, kirill.shutemov, dan.j.williams

[-- Attachment #1: x86-ldt--Reshuffle-code.patch --]
[-- Type: text/plain, Size: 5857 bytes --]

From: Thomas Gleixner <tglx@linutronix.de>

Restructure the code, so the following VMA changes do not create an
unreadable mess. No functional change.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
---
 arch/x86/include/asm/mmu_context.h |    4 +
 arch/x86/kernel/ldt.c              |  118 +++++++++++++++++--------------------
 2 files changed, 59 insertions(+), 63 deletions(-)

--- a/arch/x86/include/asm/mmu_context.h
+++ b/arch/x86/include/asm/mmu_context.h
@@ -39,6 +39,10 @@ static inline void load_mm_cr4(struct mm
 #endif
 
 #ifdef CONFIG_MODIFY_LDT_SYSCALL
+#include <asm/ldt.h>
+
+#define LDT_ENTRIES_MAP_SIZE	(LDT_ENTRIES * LDT_ENTRY_SIZE)
+
 /*
  * ldt_structs can be allocated, used, and freed, but they are never
  * modified while live.
--- a/arch/x86/kernel/ldt.c
+++ b/arch/x86/kernel/ldt.c
@@ -28,6 +28,12 @@
 #include <asm/mmu_context.h>
 #include <asm/syscalls.h>
 
+/* After calling this, the LDT is immutable. */
+static void finalize_ldt_struct(struct ldt_struct *ldt)
+{
+	paravirt_alloc_ldt(ldt->entries, ldt->nr_entries);
+}
+
 static void refresh_ldt_segments(void)
 {
 #ifdef CONFIG_X86_64
@@ -48,18 +54,31 @@ static void refresh_ldt_segments(void)
 }
 
 /* context.lock is held by the task which issued the smp function call */
-static void flush_ldt(void *__mm)
+static void __ldt_install(void *__mm)
 {
 	struct mm_struct *mm = __mm;
-	mm_context_t *pc;
+	struct ldt_struct *ldt = mm->context.ldt;
 
-	if (this_cpu_read(cpu_tlbstate.loaded_mm) != mm)
-		return;
+	if (this_cpu_read(cpu_tlbstate.loaded_mm) == mm &&
+	    !(current->flags & PF_KTHREAD)) {
+		unsigned int nentries = ldt ? ldt->nr_entries : 0;
+
+		set_ldt(ldt->entries, nentries);
+		refresh_ldt_segments();
+	}
+}
 
-	pc = &mm->context;
-	set_ldt(pc->ldt->entries, pc->ldt->nr_entries);
+static void ldt_install_mm(struct mm_struct *mm, struct ldt_struct *ldt)
+{
+	mutex_lock(&mm->context.lock);
 
-	refresh_ldt_segments();
+	/* Synchronizes with READ_ONCE in load_mm_ldt. */
+	smp_store_release(&mm->context.ldt, ldt);
+
+	/* Activate the LDT for all CPUs using currents mm. */
+	on_each_cpu_mask(mm_cpumask(mm), __ldt_install, mm, true);
+
+	mutex_unlock(&mm->context.lock);
 }
 
 /* The caller must call finalize_ldt_struct on the result. LDT starts zeroed. */
@@ -98,25 +118,6 @@ static struct ldt_struct *alloc_ldt_stru
 	return new_ldt;
 }
 
-/* After calling this, the LDT is immutable. */
-static void finalize_ldt_struct(struct ldt_struct *ldt)
-{
-	paravirt_alloc_ldt(ldt->entries, ldt->nr_entries);
-}
-
-static void install_ldt(struct mm_struct *mm, struct ldt_struct *ldt)
-{
-	mutex_lock(&mm->context.lock);
-
-	/* Synchronizes with READ_ONCE in load_mm_ldt. */
-	smp_store_release(&mm->context.ldt, ldt);
-
-	/* Activate the LDT for all CPUs using currents mm. */
-	on_each_cpu_mask(mm_cpumask(mm), flush_ldt, mm, true);
-
-	mutex_unlock(&mm->context.lock);
-}
-
 static void free_ldt_struct(struct ldt_struct *ldt)
 {
 	if (likely(!ldt))
@@ -131,6 +132,18 @@ static void free_ldt_struct(struct ldt_s
 }
 
 /*
+ * This can run unlocked because the mm is no longer in use. No need to
+ * clear LDT on the CPU either because that's called from __mm_drop() and
+ * the task which owned the mm is already dead. The context switch code has
+ * either cleared LDT or installed a new one.
+ */
+void destroy_context_ldt(struct mm_struct *mm)
+{
+	free_ldt_struct(mm->context.ldt);
+	mm->context.ldt = NULL;
+}
+
+/*
  * Called on fork from arch_dup_mmap(). Just copy the current LDT state,
  * the new task is not running, so nothing can be installed.
  */
@@ -163,54 +176,33 @@ int ldt_dup_context(struct mm_struct *ol
 	return retval;
 }
 
-/*
- * No need to lock the MM as we are the last user
- *
- * 64bit: Don't touch the LDT register - we're already in the next thread.
- */
-void destroy_context_ldt(struct mm_struct *mm)
-{
-	free_ldt_struct(mm->context.ldt);
-	mm->context.ldt = NULL;
-}
-
-static int read_ldt(void __user *ptr, unsigned long bytecount)
+static int read_ldt(void __user *ptr, unsigned long nbytes)
 {
 	struct mm_struct *mm = current->mm;
-	unsigned long entries_size;
-	int retval;
+	struct ldt_struct *ldt;
+	unsigned long tocopy;
+	int ret = 0;
 
 	down_read(&mm->context.ldt_usr_sem);
 
-	if (!mm->context.ldt) {
-		retval = 0;
+	ldt = mm->context.ldt;
+	if (!ldt)
 		goto out_unlock;
-	}
 
-	if (bytecount > LDT_ENTRY_SIZE * LDT_ENTRIES)
-		bytecount = LDT_ENTRY_SIZE * LDT_ENTRIES;
+	if (nbytes > LDT_ENTRIES_MAP_SIZE)
+		nbytes = LDT_ENTRIES_MAP_SIZE;
 
-	entries_size = mm->context.ldt->nr_entries * LDT_ENTRY_SIZE;
-	if (entries_size > bytecount)
-		entries_size = bytecount;
-
-	if (copy_to_user(ptr, mm->context.ldt->entries, entries_size)) {
-		retval = -EFAULT;
+	ret = -EFAULT;
+	tocopy = min((unsigned long)ldt->nr_entries * LDT_ENTRY_SIZE, nbytes);
+	if (tocopy < nbytes && clear_user(ptr + tocopy, nbytes - tocopy))
 		goto out_unlock;
-	}
-
-	if (entries_size != bytecount) {
-		/* Zero-fill the rest and pretend we read bytecount bytes. */
-		if (clear_user(ptr + entries_size, bytecount - entries_size)) {
-			retval = -EFAULT;
-			goto out_unlock;
-		}
-	}
-	retval = bytecount;
 
+	if (copy_to_user(ptr, ldt->entries, tocopy))
+		goto out_unlock;
+	ret = nbytes;
 out_unlock:
 	up_read(&mm->context.ldt_usr_sem);
-	return retval;
+	return ret;
 }
 
 static int read_default_ldt(void __user *ptr, unsigned long bytecount)
@@ -287,7 +279,7 @@ static int write_ldt(void __user *ptr, u
 	new_ldt->entries[ldt_info.entry_number] = ldt;
 	finalize_ldt_struct(new_ldt);
 
-	install_ldt(mm, new_ldt);
+	ldt_install_mm(mm, new_ldt);
 	free_ldt_struct(old_ldt);
 	error = 0;
 


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 76+ messages in thread

* [PATCH v2 15/17] x86/ldt: Prepare for VMA mapping
  2017-12-14 11:27 [PATCH v2 00/17] x86/ldt: Use a VMA based read only mapping Peter Zijlstra
                   ` (13 preceding siblings ...)
  2017-12-14 11:27 ` [PATCH v2 14/17] x86/ldt: Reshuffle code Peter Zijlstra
@ 2017-12-14 11:27 ` Peter Zijlstra
  2017-12-14 11:27 ` [PATCH v2 16/17] x86/ldt: Add VMA management code Peter Zijlstra
                   ` (2 subsequent siblings)
  17 siblings, 0 replies; 76+ messages in thread
From: Peter Zijlstra @ 2017-12-14 11:27 UTC (permalink / raw)
  To: linux-kernel, tglx
  Cc: x86, Linus Torvalds, Andy Lutomirsky, Peter Zijlstra,
	Dave Hansen, Borislav Petkov, Greg KH, keescook, hughd,
	Brian Gerst, Josh Poimboeuf, Denys Vlasenko, Boris Ostrovsky,
	Juergen Gross, David Laight, Eduardo Valentin, aliguori,
	Will Deacon, linux-mm, kirill.shutemov, dan.j.williams

[-- Attachment #1: x86-ldt--Prepare-for-VMA-mapping.patch --]
[-- Type: text/plain, Size: 5230 bytes --]

From: Thomas Gleixner <tglx@linutronix.de>

Implement that infrastructure to manage LDT information with backing
pages. Preparatory patch for VMA based LDT mapping. Split out for ease of
review.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
---
 arch/x86/include/asm/mmu.h         |    3 +
 arch/x86/include/asm/mmu_context.h |    9 ++-
 arch/x86/kernel/ldt.c              |  107 ++++++++++++++++++++++++++++++++++++-
 3 files changed, 116 insertions(+), 3 deletions(-)

--- a/arch/x86/include/asm/mmu.h
+++ b/arch/x86/include/asm/mmu.h
@@ -7,6 +7,8 @@
 #include <linux/mutex.h>
 #include <linux/atomic.h>
 
+struct ldt_mapping;
+
 /*
  * x86 has arch-specific MMU state beyond what lives in mm_struct.
  */
@@ -29,6 +31,7 @@ typedef struct {
 
 #ifdef CONFIG_MODIFY_LDT_SYSCALL
 	struct rw_semaphore	ldt_usr_sem;
+	struct ldt_mapping	*ldt_mapping;
 	struct ldt_struct	*ldt;
 #endif
 
--- a/arch/x86/include/asm/mmu_context.h
+++ b/arch/x86/include/asm/mmu_context.h
@@ -42,6 +42,8 @@ static inline void load_mm_cr4(struct mm
 #include <asm/ldt.h>
 
 #define LDT_ENTRIES_MAP_SIZE	(LDT_ENTRIES * LDT_ENTRY_SIZE)
+#define LDT_ENTRIES_PAGES	(LDT_ENTRIES_MAP_SIZE / PAGE_SIZE)
+#define LDT_ENTRIES_PER_PAGE	(PAGE_SIZE / LDT_ENTRY_SIZE)
 
 /*
  * ldt_structs can be allocated, used, and freed, but they are never
@@ -54,8 +56,10 @@ struct ldt_struct {
 	 * call gates.  On native, we could merge the ldt_struct and LDT
 	 * allocations, but it's not worth trying to optimize.
 	 */
-	struct desc_struct *entries;
-	unsigned int nr_entries;
+	struct desc_struct	*entries;
+	struct page		*pages[LDT_ENTRIES_PAGES];
+	unsigned int		nr_entries;
+	unsigned int		pages_allocated;
 };
 
 /*
@@ -64,6 +68,7 @@ struct ldt_struct {
 static inline void init_new_context_ldt(struct mm_struct *mm)
 {
 	mm->context.ldt = NULL;
+	mm->context.ldt_mapping = NULL;
 	init_rwsem(&mm->context.ldt_usr_sem);
 }
 int ldt_dup_context(struct mm_struct *oldmm, struct mm_struct *mm);
--- a/arch/x86/kernel/ldt.c
+++ b/arch/x86/kernel/ldt.c
@@ -28,6 +28,11 @@
 #include <asm/mmu_context.h>
 #include <asm/syscalls.h>
 
+struct ldt_mapping {
+	struct ldt_struct		ldts[2];
+	unsigned int			ldt_index;
+};
+
 /* After calling this, the LDT is immutable. */
 static void finalize_ldt_struct(struct ldt_struct *ldt)
 {
@@ -82,6 +87,97 @@ static void ldt_install_mm(struct mm_str
 	mutex_unlock(&mm->context.lock);
 }
 
+static void ldt_free_pages(struct ldt_struct *ldt)
+{
+	int i;
+
+	for (i = 0; i < ldt->pages_allocated; i++)
+		__free_page(ldt->pages[i]);
+}
+
+static void ldt_free_lmap(struct ldt_mapping *lmap)
+{
+	if (!lmap)
+		return;
+	ldt_free_pages(&lmap->ldts[0]);
+	ldt_free_pages(&lmap->ldts[1]);
+	kfree(lmap);
+}
+
+static int ldt_alloc_pages(struct ldt_struct *ldt, unsigned int nentries)
+{
+	unsigned int npages, idx;
+
+	npages = DIV_ROUND_UP(nentries * LDT_ENTRY_SIZE, PAGE_SIZE);
+
+	for (idx = ldt->pages_allocated; idx < npages; idx++) {
+		if (WARN_ON_ONCE(ldt->pages[idx]))
+			continue;
+
+		ldt->pages[idx] = alloc_page(GFP_KERNEL | __GFP_ZERO);
+		if (!ldt->pages[idx])
+			return -ENOMEM;
+
+		ldt->pages_allocated++;
+	}
+	return 0;
+}
+
+static struct ldt_mapping *ldt_alloc_lmap(struct mm_struct *mm,
+					  unsigned int nentries)
+{
+	struct ldt_mapping *lmap = kzalloc(sizeof(*lmap), GFP_KERNEL);
+
+	if (!lmap)
+		return ERR_PTR(-ENOMEM);
+
+	if (ldt_alloc_pages(&lmap->ldts[0], nentries)) {
+		ldt_free_lmap(lmap);
+		return ERR_PTR(-ENOMEM);
+	}
+	return lmap;
+}
+
+static void ldt_set_entry(struct ldt_struct *ldt, struct desc_struct *ldtdesc,
+			  unsigned int offs)
+{
+	unsigned int dstidx;
+
+	offs *= LDT_ENTRY_SIZE;
+	dstidx = offs / PAGE_SIZE;
+	offs %= PAGE_SIZE;
+	memcpy(page_address(ldt->pages[dstidx]) + offs, ldtdesc,
+	       sizeof(*ldtdesc));
+}
+
+static void ldt_clone_entries(struct ldt_struct *dst, struct ldt_struct *src,
+			      unsigned int nent)
+{
+	unsigned long tocopy;
+	unsigned int i;
+
+	for (i = 0, tocopy = nent * LDT_ENTRY_SIZE; tocopy; i++) {
+		unsigned long n = min(PAGE_SIZE, tocopy);
+
+		memcpy(page_address(dst->pages[i]),
+		       page_address(src->pages[i]), n);
+		tocopy -= n;
+	}
+}
+
+static void cleanup_ldt_struct(struct ldt_struct *ldt)
+{
+	static struct desc_struct zero_desc;
+	unsigned int i;
+
+	if (!ldt)
+		return;
+	paravirt_free_ldt(ldt->entries, ldt->nr_entries);
+	for (i = 0; i < ldt->nr_entries; i++)
+		ldt_set_entry(ldt, &zero_desc, i);
+	ldt->nr_entries = 0;
+}
+
 /* The caller must call finalize_ldt_struct on the result. LDT starts zeroed. */
 static struct ldt_struct *alloc_ldt_struct(unsigned int num_entries)
 {
@@ -139,8 +235,17 @@ static void free_ldt_struct(struct ldt_s
  */
 void destroy_context_ldt(struct mm_struct *mm)
 {
-	free_ldt_struct(mm->context.ldt);
+	struct ldt_mapping *lmap = mm->context.ldt_mapping;
+	struct ldt_struct *ldt = mm->context.ldt;
+
+	free_ldt_struct(ldt);
 	mm->context.ldt = NULL;
+
+	if (!lmap)
+		return;
+
+	mm->context.ldt_mapping = NULL;
+	ldt_free_lmap(lmap);
 }
 
 /*


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 76+ messages in thread

* [PATCH v2 16/17] x86/ldt: Add VMA management code
  2017-12-14 11:27 [PATCH v2 00/17] x86/ldt: Use a VMA based read only mapping Peter Zijlstra
                   ` (14 preceding siblings ...)
  2017-12-14 11:27 ` [PATCH v2 15/17] x86/ldt: Prepare for VMA mapping Peter Zijlstra
@ 2017-12-14 11:27 ` Peter Zijlstra
  2017-12-14 11:27 ` [PATCH v2 17/17] x86/ldt: Make it read only VMA mapped Peter Zijlstra
  2017-12-14 12:03 ` [PATCH v2 00/17] x86/ldt: Use a VMA based read only mapping Thomas Gleixner
  17 siblings, 0 replies; 76+ messages in thread
From: Peter Zijlstra @ 2017-12-14 11:27 UTC (permalink / raw)
  To: linux-kernel, tglx
  Cc: x86, Linus Torvalds, Andy Lutomirsky, Peter Zijlstra,
	Dave Hansen, Borislav Petkov, Greg KH, keescook, hughd,
	Brian Gerst, Josh Poimboeuf, Denys Vlasenko, Boris Ostrovsky,
	Juergen Gross, David Laight, Eduardo Valentin, aliguori,
	Will Deacon, linux-mm, kirill.shutemov, dan.j.williams

[-- Attachment #1: x86-ldt--Add-VMA-management-code.patch --]
[-- Type: text/plain, Size: 3885 bytes --]

Add the VMA management code to LDT which allows to install the LDT as a
special mapping, like VDSO and uprobes. The mapping is in the user address
space, but without the usr bit set and read only. Split out for ease of
review.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 arch/x86/kernel/ldt.c |  103 +++++++++++++++++++++++++++++++++++++++++++++++++-
 1 file changed, 102 insertions(+), 1 deletion(-)

--- a/arch/x86/kernel/ldt.c
+++ b/arch/x86/kernel/ldt.c
@@ -31,6 +31,7 @@
 struct ldt_mapping {
 	struct ldt_struct		ldts[2];
 	unsigned int			ldt_index;
+	unsigned int			ldt_mapped;
 };
 
 /* After calling this, the LDT is immutable. */
@@ -177,6 +178,105 @@ static void cleanup_ldt_struct(struct ld
 	ldt->nr_entries = 0;
 }
 
+static int ldt_fault(const struct vm_special_mapping *sm,
+		     struct vm_area_struct *vma, struct vm_fault *vmf)
+{
+	struct ldt_mapping *lmap = vma->vm_mm->context.ldt_mapping;
+	struct ldt_struct *ldt = lmap->ldts;
+	pgoff_t pgo = vmf->pgoff;
+	struct page *page;
+
+	if (pgo >= LDT_ENTRIES_PAGES) {
+		pgo -= LDT_ENTRIES_PAGES;
+		ldt++;
+	}
+	if (pgo >= LDT_ENTRIES_PAGES)
+		return VM_FAULT_SIGBUS;
+
+	page = ldt->pages[pgo];
+	if (!page)
+		return VM_FAULT_SIGBUS;
+	get_page(page);
+	vmf->page = page;
+	return 0;
+}
+
+static int ldt_mremap(const struct vm_special_mapping *sm,
+		      struct vm_area_struct *new_vma)
+{
+	return -EINVAL;
+}
+
+static void ldt_close(const struct vm_special_mapping *sm,
+		      struct vm_area_struct *vma)
+{
+	struct mm_struct *mm = vma->vm_mm;
+	struct ldt_struct *ldt;
+
+	/*
+	 * Orders against ldt_install().
+	 */
+	mutex_lock(&mm->context.lock);
+	ldt = mm->context.ldt;
+	ldt_install_mm(mm, NULL);
+	cleanup_ldt_struct(ldt);
+	mm->context.ldt_mapping->ldt_mapped = 0;
+	mutex_unlock(&mm->context.lock);
+}
+
+static const struct vm_special_mapping ldt_special_mapping = {
+	.name	= "[ldt]",
+	.fault	= ldt_fault,
+	.mremap	= ldt_mremap,
+	.close	= ldt_close,
+};
+
+static struct vm_area_struct *ldt_alloc_vma(struct mm_struct *mm,
+					    struct ldt_mapping *lmap)
+{
+	unsigned long vm_flags, size;
+	struct vm_area_struct *vma;
+	unsigned long addr;
+
+	size = 2 * LDT_ENTRIES_MAP_SIZE;
+	addr = get_unmapped_area(NULL, TASK_SIZE - PAGE_SIZE, size, 0, 0);
+	if (IS_ERR_VALUE(addr))
+		return ERR_PTR(addr);
+
+	vm_flags = VM_READ | VM_WIPEONFORK | VM_NOUSER | VM_SHARED;
+	vma = _install_special_mapping(mm, addr, size, vm_flags,
+				       &ldt_special_mapping);
+	if (IS_ERR(vma))
+		return vma;
+
+	lmap->ldts[0].entries = (struct desc_struct *) addr;
+	addr += LDT_ENTRIES_MAP_SIZE;
+	lmap->ldts[1].entries = (struct desc_struct *) addr;
+	return vma;
+}
+
+static int ldt_mmap(struct mm_struct *mm, struct ldt_mapping *lmap)
+{
+	struct vm_area_struct *vma;
+	int ret = 0;
+
+	if (down_write_killable(&mm->mmap_sem))
+		return -EINTR;
+	vma = ldt_alloc_vma(mm, lmap);
+	if (IS_ERR(vma)) {
+		ret = PTR_ERR(vma);
+	} else {
+		/*
+		 * The moment mmap_sem() is released munmap() can observe
+		 * the mapping and make it go away through ldt_close(). But
+		 * for now there is mapping.
+		 */
+		lmap->ldt_mapped = 1;
+	}
+	up_write(&mm->mmap_sem);
+	return ret;
+}
+
 /* The caller must call finalize_ldt_struct on the result. LDT starts zeroed. */
 static struct ldt_struct *alloc_ldt_struct(unsigned int num_entries)
 {
@@ -289,7 +389,8 @@ static int read_ldt(void __user *ptr, un
 
 	down_read(&mm->context.ldt_usr_sem);
 
-	ldt = mm->context.ldt;
+	/* Might race against vm_unmap, which installs a NULL LDT */
+	ldt = READ_ONCE(mm->context.ldt);
 	if (!ldt)
 		goto out_unlock;
 


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 76+ messages in thread

* [PATCH v2 17/17] x86/ldt: Make it read only VMA mapped
  2017-12-14 11:27 [PATCH v2 00/17] x86/ldt: Use a VMA based read only mapping Peter Zijlstra
                   ` (15 preceding siblings ...)
  2017-12-14 11:27 ` [PATCH v2 16/17] x86/ldt: Add VMA management code Peter Zijlstra
@ 2017-12-14 11:27 ` Peter Zijlstra
  2017-12-14 12:03 ` [PATCH v2 00/17] x86/ldt: Use a VMA based read only mapping Thomas Gleixner
  17 siblings, 0 replies; 76+ messages in thread
From: Peter Zijlstra @ 2017-12-14 11:27 UTC (permalink / raw)
  To: linux-kernel, tglx
  Cc: x86, Linus Torvalds, Andy Lutomirsky, Peter Zijlstra,
	Dave Hansen, Borislav Petkov, Greg KH, keescook, hughd,
	Brian Gerst, Josh Poimboeuf, Denys Vlasenko, Boris Ostrovsky,
	Juergen Gross, David Laight, Eduardo Valentin, aliguori,
	Will Deacon, linux-mm, kirill.shutemov, dan.j.williams

[-- Attachment #1: x86-ldt--Make-it-VMA-mapped.patch --]
[-- Type: text/plain, Size: 10399 bytes --]

From: Thomas Gleixner <tglx@linutronix.de>

Replace the existing LDT allocation and installation code by the new VMA
based mapping code. The mapping is exposed read only to user space so it is
accessible when the CPU executes in ring 3. The access to the backing pages
is not a linear VA space to avoid an extra alias mapping or the allocation
of higher order pages.

The special write fault handler and the touch mechanism on exit to user
space makes sure that the expectations of the CPU are met.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
---
--- a/arch/x86/kernel/ldt.c
+++ b/arch/x86/kernel/ldt.c
@@ -67,24 +67,48 @@ static void __ldt_install(void *__mm)
 
 	if (this_cpu_read(cpu_tlbstate.loaded_mm) == mm &&
 	    !(current->flags & PF_KTHREAD)) {
-		unsigned int nentries = ldt ? ldt->nr_entries : 0;
-
-		set_ldt(ldt->entries, nentries);
-		refresh_ldt_segments();
+		if (ldt) {
+			set_ldt(ldt->entries, ldt->nr_entries);
+			refresh_ldt_segments();
+		} else {
+			set_ldt(NULL, 0);
+		}
 	}
 }
 
 static void ldt_install_mm(struct mm_struct *mm, struct ldt_struct *ldt)
 {
-	mutex_lock(&mm->context.lock);
+	lockdep_assert_held(&mm->context.lock);
 
 	/* Synchronizes with READ_ONCE in load_mm_ldt. */
 	smp_store_release(&mm->context.ldt, ldt);
 
 	/* Activate the LDT for all CPUs using currents mm. */
 	on_each_cpu_mask(mm_cpumask(mm), __ldt_install, mm, true);
+}
 
-	mutex_unlock(&mm->context.lock);
+static int ldt_populate(struct ldt_struct *ldt)
+{
+	unsigned long len, start = (unsigned long)ldt->entries;
+
+	len = round_up(ldt->nr_entries * LDT_ENTRY_SIZE, PAGE_SIZE);
+	return __mm_populate(start, len, 0);
+}
+
+/* Install the new LDT after populating the user space mapping. */
+static int ldt_install(struct mm_struct *mm, struct ldt_struct *ldt)
+{
+	int ret = ldt ? ldt_populate(ldt) : 0;
+
+	if (!ret) {
+		mutex_lock(&mm->context.lock);
+		if (mm->context.ldt_mapping->ldt_mapped)
+			ldt_install_mm(mm, ldt);
+		else
+			ret = -EINVAL;
+		mutex_unlock(&mm->context.lock);
+	}
+	return ret;
 }
 
 static void ldt_free_pages(struct ldt_struct *ldt)
@@ -277,53 +301,68 @@ static int ldt_mmap(struct mm_struct *mm
 	return ret;
 }
 
-/* The caller must call finalize_ldt_struct on the result. LDT starts zeroed. */
-static struct ldt_struct *alloc_ldt_struct(unsigned int num_entries)
+/*
+ * Called on fork from arch_dup_mmap(). Just copy the current LDT state,
+ * the new task is not running, so nothing can be installed.
+ */
+int ldt_dup_context(struct mm_struct *old_mm, struct mm_struct *mm)
 {
-	struct ldt_struct *new_ldt;
-	unsigned int alloc_size;
+	struct ldt_mapping *old_lmap, *lmap;
+	struct vm_area_struct *vma;
+	struct ldt_struct *old_ldt;
+	unsigned long addr, len;
+	int nentries, ret = 0;
 
-	if (num_entries > LDT_ENTRIES)
-		return NULL;
+	if (!old_mm)
+		return 0;
 
-	new_ldt = kmalloc(sizeof(struct ldt_struct), GFP_KERNEL);
-	if (!new_ldt)
-		return NULL;
+	mutex_lock(&old_mm->context.lock);
+	old_lmap = old_mm->context.ldt_mapping;
+	if (!old_lmap || !old_mm->context.ldt)
+		goto out_unlock;
+
+	old_ldt = old_mm->context.ldt;
+	nentries = old_ldt->nr_entries;
+	if (!nentries)
+		goto out_unlock;
+	lmap = ldt_alloc_lmap(mm, nentries);
+	if (IS_ERR(lmap)) {
+		ret = PTR_ERR(lmap);
+		goto out_unlock;
+	}
 
-	BUILD_BUG_ON(LDT_ENTRY_SIZE != sizeof(struct desc_struct));
-	alloc_size = num_entries * LDT_ENTRY_SIZE;
+	addr = (unsigned long)old_mm->context.ldt_mapping->ldts[0].entries;
+	vma = find_vma(mm, addr);
+	if (!vma)
+		goto out_lmap;
 
+	mm->context.ldt_mapping = lmap;
 	/*
-	 * Xen is very picky: it requires a page-aligned LDT that has no
-	 * trailing nonzero bytes in any page that contains LDT descriptors.
-	 * Keep it simple: zero the whole allocation and never allocate less
-	 * than PAGE_SIZE.
+	 * Copy the current settings over. Save the number of entries and
+	 * the data.
 	 */
-	if (alloc_size > PAGE_SIZE)
-		new_ldt->entries = vzalloc(alloc_size);
-	else
-		new_ldt->entries = (void *)get_zeroed_page(GFP_KERNEL);
+	lmap->ldts[0].entries = (struct desc_struct *)addr;
+	lmap->ldts[1].entries = (struct desc_struct *)(addr + LDT_ENTRIES_MAP_SIZE);
 
-	if (!new_ldt->entries) {
-		kfree(new_ldt);
-		return NULL;
-	}
+	lmap->ldts[0].nr_entries = nentries;
+	ldt_clone_entries(&lmap->ldts[0], old_ldt, nentries);
 
-	new_ldt->nr_entries = num_entries;
-	return new_ldt;
-}
+	len = ALIGN(nentries * LDT_ENTRY_SIZE, PAGE_SIZE);
+	ret = populate_vma_page_range(vma, addr, addr + len, NULL);
+	if (ret != len / PAGE_SIZE)
+		goto out_lmap;
+	finalize_ldt_struct(&lmap->ldts[0]);
+	mm->context.ldt = &lmap->ldts[0];
+	ret = 0;
 
-static void free_ldt_struct(struct ldt_struct *ldt)
-{
-	if (likely(!ldt))
-		return;
-
-	paravirt_free_ldt(ldt->entries, ldt->nr_entries);
-	if (ldt->nr_entries * LDT_ENTRY_SIZE > PAGE_SIZE)
-		vfree_atomic(ldt->entries);
-	else
-		free_page((unsigned long)ldt->entries);
-	kfree(ldt);
+out_unlock:
+	mutex_unlock(&old_mm->context.lock);
+	return ret;
+out_lmap:
+	mm->context.ldt_mapping = NULL;
+	mutex_unlock(&old_mm->context.lock);
+	ldt_free_lmap(lmap);
+	return -ENOMEM;
 }
 
 /*
@@ -337,55 +376,21 @@ void destroy_context_ldt(struct mm_struc
 	struct ldt_mapping *lmap = mm->context.ldt_mapping;
 	struct ldt_struct *ldt = mm->context.ldt;
 
-	free_ldt_struct(ldt);
-	mm->context.ldt = NULL;
-
 	if (!lmap)
 		return;
-
+	if (ldt)
+		paravirt_free_ldt(ldt->entries, ldt->nr_entries);
+	mm->context.ldt = NULL;
 	mm->context.ldt_mapping = NULL;
 	ldt_free_lmap(lmap);
 }
 
-/*
- * Called on fork from arch_dup_mmap(). Just copy the current LDT state,
- * the new task is not running, so nothing can be installed.
- */
-int ldt_dup_context(struct mm_struct *old_mm, struct mm_struct *mm)
-{
-	struct ldt_struct *new_ldt;
-	int retval = 0;
-
-	if (!old_mm)
-		return 0;
-
-	mutex_lock(&old_mm->context.lock);
-	if (!old_mm->context.ldt)
-		goto out_unlock;
-
-	new_ldt = alloc_ldt_struct(old_mm->context.ldt->nr_entries);
-	if (!new_ldt) {
-		retval = -ENOMEM;
-		goto out_unlock;
-	}
-
-	memcpy(new_ldt->entries, old_mm->context.ldt->entries,
-	       new_ldt->nr_entries * LDT_ENTRY_SIZE);
-	finalize_ldt_struct(new_ldt);
-
-	mm->context.ldt = new_ldt;
-
-out_unlock:
-	mutex_unlock(&old_mm->context.lock);
-	return retval;
-}
-
 static int read_ldt(void __user *ptr, unsigned long nbytes)
 {
 	struct mm_struct *mm = current->mm;
 	struct ldt_struct *ldt;
 	unsigned long tocopy;
-	int ret = 0;
+	int i, ret = 0;
 
 	down_read(&mm->context.ldt_usr_sem);
 
@@ -402,8 +407,14 @@ static int read_ldt(void __user *ptr, un
 	if (tocopy < nbytes && clear_user(ptr + tocopy, nbytes - tocopy))
 		goto out_unlock;
 
-	if (copy_to_user(ptr, ldt->entries, tocopy))
-		goto out_unlock;
+	for (i = 0; tocopy; i++) {
+		unsigned long n = min(PAGE_SIZE, tocopy);
+
+		if (copy_to_user(ptr, page_address(ldt->pages[i]), n))
+			goto out_unlock;
+		tocopy -= n;
+		ptr += n;
+	}
 	ret = nbytes;
 out_unlock:
 	up_read(&mm->context.ldt_usr_sem);
@@ -427,12 +438,13 @@ static int read_default_ldt(void __user
 
 static int write_ldt(void __user *ptr, unsigned long bytecount, int oldmode)
 {
-	struct mm_struct *mm = current->mm;
 	struct ldt_struct *new_ldt, *old_ldt;
-	unsigned int old_nr_entries, new_nr_entries;
+	unsigned int nold, nentries, ldtidx;
+	struct mm_struct *mm = current->mm;
 	struct user_desc ldt_info;
-	struct desc_struct ldt;
-	int error;
+	struct ldt_mapping *lmap;
+	struct desc_struct entry;
+	int error, mapped;
 
 	error = -EINVAL;
 	if (bytecount != sizeof(ldt_info))
@@ -454,39 +466,68 @@ static int write_ldt(void __user *ptr, u
 	if ((oldmode && !ldt_info.base_addr && !ldt_info.limit) ||
 	    LDT_empty(&ldt_info)) {
 		/* The user wants to clear the entry. */
-		memset(&ldt, 0, sizeof(ldt));
+		memset(&entry, 0, sizeof(entry));
 	} else {
-		if (!IS_ENABLED(CONFIG_X86_16BIT) && !ldt_info.seg_32bit) {
-			error = -EINVAL;
+		if (!IS_ENABLED(CONFIG_X86_16BIT) && !ldt_info.seg_32bit)
 			goto out;
-		}
-
-		fill_ldt(&ldt, &ldt_info);
+		fill_ldt(&entry, &ldt_info);
 		if (oldmode)
-			ldt.avl = 0;
+			entry.avl = 0;
 	}
 
 	if (down_write_killable(&mm->context.ldt_usr_sem))
 		return -EINTR;
 
-	old_ldt       = mm->context.ldt;
-	old_nr_entries = old_ldt ? old_ldt->nr_entries : 0;
-	new_nr_entries = max(ldt_info.entry_number + 1, old_nr_entries);
-
-	error = -ENOMEM;
-	new_ldt = alloc_ldt_struct(new_nr_entries);
-	if (!new_ldt)
+	lmap = mm->context.ldt_mapping;
+	old_ldt = mm->context.ldt;
+	ldtidx = lmap ? lmap->ldt_index ^ 1 : 0;
+
+	if (!lmap) {
+		/* First invocation, install it. */
+		nentries = ldt_info.entry_number + 1;
+		lmap = ldt_alloc_lmap(mm, nentries);
+		if (IS_ERR(lmap)) {
+			error = PTR_ERR(lmap);
+			goto out_unlock;
+		}
+		mm->context.ldt_mapping = lmap;
+	}
+
+	/*
+	 * ldt_close() can clear lmap->ldt_mapped under context.lock, so
+	 * lmap->ldt_mapped needs to be read under that lock as well.
+	 *
+	 * If !mapped, try and establish the mapping; this code is fully
+	 * serialized under ldt_usr_sem. If the VMA vanishes after dropping
+	 * the lock, then ldt_install() will fail later on.
+	 */
+	mutex_lock(&mm->context.lock);
+	mapped = lmap->ldt_mapped;
+	mutex_unlock(&mm->context.lock);
+	if (!mapped) {
+		error = ldt_mmap(mm, lmap);
+		if (error)
+			goto out_unlock;
+	}
+
+	nold = old_ldt ? old_ldt->nr_entries : 0;
+	nentries = max(ldt_info.entry_number + 1, nold);
+	/* Select the new ldt and allocate pages if necessary */
+	new_ldt = &lmap->ldts[ldtidx];
+	error = ldt_alloc_pages(new_ldt, nentries);
+	if (error)
 		goto out_unlock;
 
-	if (old_ldt)
-		memcpy(new_ldt->entries, old_ldt->entries, old_nr_entries * LDT_ENTRY_SIZE);
+	if (nold)
+		ldt_clone_entries(new_ldt, old_ldt, nold);
 
-	new_ldt->entries[ldt_info.entry_number] = ldt;
+	ldt_set_entry(new_ldt, &entry, ldt_info.entry_number);
+	new_ldt->nr_entries = nentries;
+	lmap->ldt_index = ldtidx;
 	finalize_ldt_struct(new_ldt);
-
-	ldt_install_mm(mm, new_ldt);
-	free_ldt_struct(old_ldt);
-	error = 0;
+	/* Install the new LDT. Might fail due to vm_unmap() or ENOMEM */
+	error = ldt_install(mm, new_ldt);
+	cleanup_ldt_struct(error ? new_ldt : old_ldt);
 
 out_unlock:
 	up_write(&mm->context.ldt_usr_sem);


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH v2 00/17] x86/ldt: Use a VMA based read only mapping
  2017-12-14 11:27 [PATCH v2 00/17] x86/ldt: Use a VMA based read only mapping Peter Zijlstra
                   ` (16 preceding siblings ...)
  2017-12-14 11:27 ` [PATCH v2 17/17] x86/ldt: Make it read only VMA mapped Peter Zijlstra
@ 2017-12-14 12:03 ` Thomas Gleixner
  2017-12-14 12:08   ` Peter Zijlstra
  17 siblings, 1 reply; 76+ messages in thread
From: Thomas Gleixner @ 2017-12-14 12:03 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-kernel, x86, Linus Torvalds, Andy Lutomirsky, Dave Hansen,
	Borislav Petkov, Greg KH, keescook, hughd, Brian Gerst,
	Josh Poimboeuf, Denys Vlasenko, Boris Ostrovsky, Juergen Gross,
	David Laight, Eduardo Valentin, aliguori, Will Deacon, linux-mm,
	kirill.shutemov, dan.j.williams

On Thu, 14 Dec 2017, Peter Zijlstra wrote:
> So here's a second posting of the VMA based LDT implementation; now without
> most of the crazy.
> 
> I took out the write fault handler and the magic LAR touching code.
> 
> Additionally there are a bunch of patches that address generic vm issue.
> 
>  - gup() access control; In specific I looked at accessing !_PAGE_USER pages
>    because these patches rely on not being able to do that.
> 
>  - special mappings; A whole bunch of mmap ops don't make sense on special
>    mappings so disallow them.
> 
> Both things make sense independent of the rest of the series. Similarly, the
> patches that kill that rediculous LDT inherit on exec() are also unquestionably
> good.
> 
> So I think at least the first 6 patches are good, irrespective of the
> VMA approach.
> 
> On the whole VMA approach, Andy I know you hate it with a passion, but I really
> rather like how it ties the LDT to the process that it belongs to and it
> reduces the amount of 'special' pages in the whole PTI mapping.
> 
> I'm not the one going to make the decision on this; but I figured I at least
> post a version without the obvious crap parts of the last one.
> 
> Note: if we were to also disallow munmap() for special mappings (which I
> suppose makes perfect sense) then we could further reduce the actual LDT
> code (we'd no longer need the sm::close callback and related things).

That makes a lot of sense for the other special mapping users like VDSO and
kprobes.

Thanks,

	tglx

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH v2 00/17] x86/ldt: Use a VMA based read only mapping
  2017-12-14 12:03 ` [PATCH v2 00/17] x86/ldt: Use a VMA based read only mapping Thomas Gleixner
@ 2017-12-14 12:08   ` Peter Zijlstra
  2017-12-14 16:35     ` Andy Lutomirski
  0 siblings, 1 reply; 76+ messages in thread
From: Peter Zijlstra @ 2017-12-14 12:08 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: linux-kernel, x86, Linus Torvalds, Andy Lutomirsky, Dave Hansen,
	Borislav Petkov, Greg KH, keescook, hughd, Brian Gerst,
	Josh Poimboeuf, Denys Vlasenko, Boris Ostrovsky, Juergen Gross,
	David Laight, Eduardo Valentin, aliguori, Will Deacon, linux-mm,
	kirill.shutemov, dan.j.williams

On Thu, Dec 14, 2017 at 01:03:37PM +0100, Thomas Gleixner wrote:
> On Thu, 14 Dec 2017, Peter Zijlstra wrote:
> > So here's a second posting of the VMA based LDT implementation; now without
> > most of the crazy.
> > 
> > I took out the write fault handler and the magic LAR touching code.
> > 
> > Additionally there are a bunch of patches that address generic vm issue.
> > 
> >  - gup() access control; In specific I looked at accessing !_PAGE_USER pages
> >    because these patches rely on not being able to do that.
> > 
> >  - special mappings; A whole bunch of mmap ops don't make sense on special
> >    mappings so disallow them.
> > 
> > Both things make sense independent of the rest of the series. Similarly, the
> > patches that kill that rediculous LDT inherit on exec() are also unquestionably
> > good.
> > 
> > So I think at least the first 6 patches are good, irrespective of the
> > VMA approach.
> > 
> > On the whole VMA approach, Andy I know you hate it with a passion, but I really
> > rather like how it ties the LDT to the process that it belongs to and it
> > reduces the amount of 'special' pages in the whole PTI mapping.
> > 
> > I'm not the one going to make the decision on this; but I figured I at least
> > post a version without the obvious crap parts of the last one.
> > 
> > Note: if we were to also disallow munmap() for special mappings (which I
> > suppose makes perfect sense) then we could further reduce the actual LDT
> > code (we'd no longer need the sm::close callback and related things).
> 
> That makes a lot of sense for the other special mapping users like VDSO and
> kprobes.

Right, and while looking at that I also figured it might make sense to
unconditionally disallow splitting special mappings.


--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -2698,6 +2698,9 @@ int do_munmap(struct mm_struct *mm, unsi
 	}
 	vma = prev ? prev->vm_next : mm->mmap;
 
+	if (vma_is_special_mapping(vma))
+		return -EINVAL;
+
 	if (unlikely(uf)) {
 		/*
 		 * If userfaultfd_unmap_prep returns an error the vmas
@@ -3223,10 +3226,11 @@ static int special_mapping_fault(struct
  */
 static void special_mapping_close(struct vm_area_struct *vma)
 {
-	struct vm_special_mapping *sm = vma->vm_private_data;
+}
 
-	if (sm->close)
-		sm->close(sm, vma);
+static int special_mapping_split(struct vm_area_struct *vma, unsigned long addr)
+{
+	return -EINVAL;
 }
 
 static const char *special_mapping_name(struct vm_area_struct *vma)
@@ -3252,6 +3256,7 @@ static const struct vm_operations_struct
 	.fault = special_mapping_fault,
 	.mremap = special_mapping_mremap,
 	.name = special_mapping_name,
+	.split = special_mapping_split,
 };
 
 static const struct vm_operations_struct legacy_special_mapping_vmops = {

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH v2 01/17] mm/gup: Fixup p*_access_permitted()
  2017-12-14 11:27 ` [PATCH v2 01/17] mm/gup: Fixup p*_access_permitted() Peter Zijlstra
@ 2017-12-14 12:41   ` Peter Zijlstra
  2017-12-14 14:37     ` Peter Zijlstra
  0 siblings, 1 reply; 76+ messages in thread
From: Peter Zijlstra @ 2017-12-14 12:41 UTC (permalink / raw)
  To: linux-kernel, tglx
  Cc: x86, Linus Torvalds, Andy Lutomirsky, Dave Hansen,
	Borislav Petkov, Greg KH, keescook, hughd, Brian Gerst,
	Josh Poimboeuf, Denys Vlasenko, Boris Ostrovsky, Juergen Gross,
	David Laight, Eduardo Valentin, aliguori, Will Deacon, linux-mm,
	kirill.shutemov, dan.j.williams

On Thu, Dec 14, 2017 at 12:27:27PM +0100, Peter Zijlstra wrote:
> The gup_*_range() functions which implement __get_user_pages_fast() do
> a p*_access_permitted() test to see if the memory is at all accessible
> (tests both _PAGE_USER|_PAGE_RW as well as architectural things like
> pkeys).
> 
> But the follow_*() functions which implement __get_user_pages() do not
> have this test. Recently, commit:
> 
>   5c9d2d5c269c ("mm: replace pte_write with pte_access_permitted in fault + gup paths")
> 
> added it to a few specific write paths, but it failed to consistently
> apply it (I've not audited anything outside of gup).
> 
> Revert the change from that patch and insert the tests in the right
> locations such that they cover all READ / WRITE accesses for all
> pte/pmd/pud levels.
> 
> In particular I care about the _PAGE_USER test, we should not ever,
> allow access to pages not marked with it, but it also makes the pkey
> accesses more consistent.

This should probably go on top. These are now all superfluous and
slightly wrong.

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 2f2f5e774902..1797368cc83a 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -870,9 +870,6 @@ struct page *follow_devmap_pmd(struct vm_area_struct *vma, unsigned long addr,
 	 */
 	WARN_ONCE(flags & FOLL_COW, "mm: In follow_devmap_pmd with FOLL_COW set");
 
-	if (!pmd_access_permitted(*pmd, flags & FOLL_WRITE))
-		return NULL;
-
 	if (pmd_present(*pmd) && pmd_devmap(*pmd))
 		/* pass */;
 	else
@@ -1012,9 +1009,6 @@ struct page *follow_devmap_pud(struct vm_area_struct *vma, unsigned long addr,
 
 	assert_spin_locked(pud_lockptr(mm, pud));
 
-	if (!pud_access_permitted(*pud, flags & FOLL_WRITE))
-		return NULL;
-
 	if (pud_present(*pud) && pud_devmap(*pud))
 		/* pass */;
 	else
@@ -1386,7 +1380,7 @@ int do_huge_pmd_wp_page(struct vm_fault *vmf, pmd_t orig_pmd)
  */
 static inline bool can_follow_write_pmd(pmd_t pmd, unsigned int flags)
 {
-	return pmd_access_permitted(pmd, WRITE) ||
+	return pmd_write(pmd) ||
 	       ((flags & FOLL_FORCE) && (flags & FOLL_COW) && pmd_dirty(pmd));
 }
 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 76+ messages in thread

* Re: [PATCH v2 01/17] mm/gup: Fixup p*_access_permitted()
  2017-12-14 12:41   ` Peter Zijlstra
@ 2017-12-14 14:37     ` Peter Zijlstra
  2017-12-14 20:44       ` Dave Hansen
  2017-12-15 14:04       ` Peter Zijlstra
  0 siblings, 2 replies; 76+ messages in thread
From: Peter Zijlstra @ 2017-12-14 14:37 UTC (permalink / raw)
  To: linux-kernel, tglx
  Cc: x86, Linus Torvalds, Andy Lutomirsky, Dave Hansen,
	Borislav Petkov, Greg KH, keescook, hughd, Brian Gerst,
	Josh Poimboeuf, Denys Vlasenko, Boris Ostrovsky, Juergen Gross,
	David Laight, Eduardo Valentin, aliguori, Will Deacon, linux-mm,
	kirill.shutemov, dan.j.williams

On Thu, Dec 14, 2017 at 01:41:17PM +0100, Peter Zijlstra wrote:
> On Thu, Dec 14, 2017 at 12:27:27PM +0100, Peter Zijlstra wrote:
> > The gup_*_range() functions which implement __get_user_pages_fast() do
> > a p*_access_permitted() test to see if the memory is at all accessible
> > (tests both _PAGE_USER|_PAGE_RW as well as architectural things like
> > pkeys).
> > 
> > But the follow_*() functions which implement __get_user_pages() do not
> > have this test. Recently, commit:
> > 
> >   5c9d2d5c269c ("mm: replace pte_write with pte_access_permitted in fault + gup paths")
> > 
> > added it to a few specific write paths, but it failed to consistently
> > apply it (I've not audited anything outside of gup).
> > 
> > Revert the change from that patch and insert the tests in the right
> > locations such that they cover all READ / WRITE accesses for all
> > pte/pmd/pud levels.
> > 
> > In particular I care about the _PAGE_USER test, we should not ever,
> > allow access to pages not marked with it, but it also makes the pkey
> > accesses more consistent.
> 
> This should probably go on top. These are now all superfluous and
> slightly wrong.

I also cannot explain dax_mapping_entry_mkclean(), why would we not make
clean those pages that are not pkey writable (but clearly are writable
and dirty)? That doesn't make any sense at all.

Kirill did point out that my patch(es) break FOLL_DUMP in that it would
now exclude pkey protected pages from core-dumps.

My counter argument is that it will now properly exclude !_PAGE_USER
pages.

If we change p??_access_permitted() to pass the full follow flags
instead of just the write part we could fix that.

I'm also looking at pte_access_permitted() in handle_pte_fault(); that
looks very dodgy to me. How does that not result in endlessly CoW'ing
the same page over and over when we have a PKEY disallowing write access
on that page?

Bah... /me grumpy

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH v2 02/17] mm: Exempt special mappings from mlock(), mprotect() and madvise()
  2017-12-14 11:27 ` [PATCH v2 02/17] mm: Exempt special mappings from mlock(), mprotect() and madvise() Peter Zijlstra
@ 2017-12-14 16:19   ` Andy Lutomirski
  2017-12-14 17:36     ` Peter Zijlstra
  0 siblings, 1 reply; 76+ messages in thread
From: Andy Lutomirski @ 2017-12-14 16:19 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-kernel, Thomas Gleixner, X86 ML, Linus Torvalds,
	Andy Lutomirsky, Dave Hansen, Borislav Petkov, Greg KH,
	Kees Cook, Hugh Dickins, Brian Gerst, Josh Poimboeuf,
	Denys Vlasenko, Boris Ostrovsky, Juergen Gross, David Laight,
	Eduardo Valentin, Liguori, Anthony, Will Deacon, linux-mm,
	Kirill A. Shutemov, Dan Williams

On Thu, Dec 14, 2017 at 3:27 AM, Peter Zijlstra <peterz@infradead.org> wrote:
> It makes no sense to ever prod at special mappings with any of these
> syscalls.
>
> XXX should we include munmap() ?

This is an ABI break for the vdso.  Maybe that's okay, but mremap() on
the vdso is certainly used, and I can imagine debuggers using
mprotect().

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH v2 11/17] selftests/x86/ldt_gdt: Prepare for access bit forced
  2017-12-14 11:27 ` [PATCH v2 11/17] selftests/x86/ldt_gdt: Prepare for access bit forced Peter Zijlstra
@ 2017-12-14 16:20   ` Andy Lutomirski
  2017-12-14 19:43     ` Linus Torvalds
  0 siblings, 1 reply; 76+ messages in thread
From: Andy Lutomirski @ 2017-12-14 16:20 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-kernel, Thomas Gleixner, X86 ML, Linus Torvalds,
	Andy Lutomirsky, Dave Hansen, Borislav Petkov, Greg KH,
	Kees Cook, Hugh Dickins, Brian Gerst, Josh Poimboeuf,
	Denys Vlasenko, Boris Ostrovsky, Juergen Gross, David Laight,
	Eduardo Valentin, Liguori, Anthony, Will Deacon, linux-mm,
	Kirill A. Shutemov, Dan Williams

On Thu, Dec 14, 2017 at 3:27 AM, Peter Zijlstra <peterz@infradead.org> wrote:
> From: Thomas Gleixner <tglx@linutronix.de>
>
> In order to make the LDT mapping RO the access bit needs to be forced by
> the kernel. Adjust the test case so it handles that gracefully.

If this turns out to need reverting because it breaks Wine or
something, we're really going to regret it.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH v2 13/17] x86/mm: Force LDT desc accessed bit
  2017-12-14 11:27 ` [PATCH v2 13/17] x86/mm: Force LDT desc accessed bit Peter Zijlstra
@ 2017-12-14 16:21   ` Andy Lutomirski
  0 siblings, 0 replies; 76+ messages in thread
From: Andy Lutomirski @ 2017-12-14 16:21 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-kernel, Thomas Gleixner, X86 ML, Linus Torvalds,
	Andy Lutomirsky, Dave Hansen, Borislav Petkov, Greg KH,
	Kees Cook, Hugh Dickins, Brian Gerst, Josh Poimboeuf,
	Denys Vlasenko, Boris Ostrovsky, Juergen Gross, David Laight,
	Eduardo Valentin, Liguori, Anthony, Will Deacon, linux-mm,
	Kirill A. Shutemov, Dan Williams

On Thu, Dec 14, 2017 at 3:27 AM, Peter Zijlstra <peterz@infradead.org> wrote:
> In preparation to mapping the LDT RO, unconditionally set the accessed
> bit.

I like this patch, but only as part of a standalone series to harden
the kernel by making the LDT be RO so we can revert the series if
needed.  I don't like it as a prerequisite to having a working kernel.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH v2 03/17] arch: Allow arch_dup_mmap() to fail
  2017-12-14 11:27 ` [PATCH v2 03/17] arch: Allow arch_dup_mmap() to fail Peter Zijlstra
@ 2017-12-14 16:22   ` Andy Lutomirski
  0 siblings, 0 replies; 76+ messages in thread
From: Andy Lutomirski @ 2017-12-14 16:22 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-kernel, Thomas Gleixner, X86 ML, Linus Torvalds,
	Andy Lutomirsky, Dave Hansen, Borislav Petkov, Greg KH,
	Kees Cook, Hugh Dickins, Brian Gerst, Josh Poimboeuf,
	Denys Vlasenko, Boris Ostrovsky, Juergen Gross, David Laight,
	Eduardo Valentin, Liguori, Anthony, Will Deacon, linux-mm,
	Kirill A. Shutemov, Dan Williams

On Thu, Dec 14, 2017 at 3:27 AM, Peter Zijlstra <peterz@infradead.org> wrote:
> From: Thomas Gleixner <tglx@linutronix.de>
>
> In order to sanitize the LDT initialization on x86 arch_dup_mmap() must be
> allowed to fail. Fix up all instances.

Reviewed-by: Andy Lutomirski <luto@kernel.org>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH v2 14/17] x86/ldt: Reshuffle code
  2017-12-14 11:27 ` [PATCH v2 14/17] x86/ldt: Reshuffle code Peter Zijlstra
@ 2017-12-14 16:23   ` Andy Lutomirski
  2017-12-14 16:31     ` Thomas Gleixner
  0 siblings, 1 reply; 76+ messages in thread
From: Andy Lutomirski @ 2017-12-14 16:23 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-kernel, Thomas Gleixner, X86 ML, Linus Torvalds,
	Andy Lutomirsky, Dave Hansen, Borislav Petkov, Greg KH,
	Kees Cook, Hugh Dickins, Brian Gerst, Josh Poimboeuf,
	Denys Vlasenko, Boris Ostrovsky, Juergen Gross, David Laight,
	Eduardo Valentin, Liguori, Anthony, Will Deacon, linux-mm,
	Kirill A. Shutemov, Dan Williams

On Thu, Dec 14, 2017 at 3:27 AM, Peter Zijlstra <peterz@infradead.org> wrote:
> From: Thomas Gleixner <tglx@linutronix.de>
>
> Restructure the code, so the following VMA changes do not create an
> unreadable mess. No functional change.

Can the PF_KTHREAD thing be its own patch so it can be reviewed on its own?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH v2 14/17] x86/ldt: Reshuffle code
  2017-12-14 16:23   ` Andy Lutomirski
@ 2017-12-14 16:31     ` Thomas Gleixner
  2017-12-14 16:32       ` Thomas Gleixner
  0 siblings, 1 reply; 76+ messages in thread
From: Thomas Gleixner @ 2017-12-14 16:31 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Peter Zijlstra, linux-kernel, X86 ML, Linus Torvalds,
	Dave Hansen, Borislav Petkov, Greg KH, Kees Cook, Hugh Dickins,
	Brian Gerst, Josh Poimboeuf, Denys Vlasenko, Boris Ostrovsky,
	Juergen Gross, David Laight, Eduardo Valentin, Liguori, Anthony,
	Will Deacon, linux-mm, Kirill A. Shutemov, Dan Williams


On Thu, 14 Dec 2017, Andy Lutomirski wrote:

> On Thu, Dec 14, 2017 at 3:27 AM, Peter Zijlstra <peterz@infradead.org> wrote:
> > From: Thomas Gleixner <tglx@linutronix.de>
> >
> > Restructure the code, so the following VMA changes do not create an
> > unreadable mess. No functional change.
> 
> Can the PF_KTHREAD thing be its own patch so it can be reviewed on its own?

I had that as a separate patch at some point.

Thanks,

	tglx

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH v2 14/17] x86/ldt: Reshuffle code
  2017-12-14 16:31     ` Thomas Gleixner
@ 2017-12-14 16:32       ` Thomas Gleixner
  2017-12-14 16:34         ` Andy Lutomirski
  0 siblings, 1 reply; 76+ messages in thread
From: Thomas Gleixner @ 2017-12-14 16:32 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Peter Zijlstra, linux-kernel, X86 ML, Linus Torvalds,
	Dave Hansen, Borislav Petkov, Greg KH, Kees Cook, Hugh Dickins,
	Brian Gerst, Josh Poimboeuf, Denys Vlasenko, Boris Ostrovsky,
	Juergen Gross, David Laight, Eduardo Valentin, Liguori, Anthony,
	Will Deacon, linux-mm, Kirill A. Shutemov, Dan Williams

On Thu, 14 Dec 2017, Thomas Gleixner wrote:

> 
> On Thu, 14 Dec 2017, Andy Lutomirski wrote:
> 
> > On Thu, Dec 14, 2017 at 3:27 AM, Peter Zijlstra <peterz@infradead.org> wrote:
> > > From: Thomas Gleixner <tglx@linutronix.de>
> > >
> > > Restructure the code, so the following VMA changes do not create an
> > > unreadable mess. No functional change.
> > 
> > Can the PF_KTHREAD thing be its own patch so it can be reviewed on its own?
> 
> I had that as a separate patch at some point.

See 5/N

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH v2 05/17] x86/ldt: Prevent ldt inheritance on exec
  2017-12-14 11:27 ` [PATCH v2 05/17] x86/ldt: Prevent ldt inheritance on exec Peter Zijlstra
@ 2017-12-14 16:32   ` Andy Lutomirski
  0 siblings, 0 replies; 76+ messages in thread
From: Andy Lutomirski @ 2017-12-14 16:32 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-kernel, Thomas Gleixner, X86 ML, Linus Torvalds,
	Andy Lutomirsky, Dave Hansen, Borislav Petkov, Greg KH,
	Kees Cook, Hugh Dickins, Brian Gerst, Josh Poimboeuf,
	Denys Vlasenko, Boris Ostrovsky, Juergen Gross, David Laight,
	Eduardo Valentin, Liguori, Anthony, Will Deacon, linux-mm,
	Kirill A. Shutemov, Dan Williams

On Thu, Dec 14, 2017 at 3:27 AM, Peter Zijlstra <peterz@infradead.org> wrote:
> From: Thomas Gleixner <tglx@linutronix.de>
>
> The LDT is inheritet independent of fork or exec, but that makes no sense
> at all because exec is supposed to start the process clean.
>
> The reason why this happens is that init_new_context_ldt() is called from
> init_new_context() which obviously needs to be called for both fork() and
> exec().
>
> It would be surprising if anything relies on that behaviour, so it seems to
> be safe to remove that misfeature.
>
> Split the context initialization into two parts. Clear the ldt pointer and
> initialize the mutex from the general context init and move the LDT
> duplication to arch_dup_mmap() which is only called on fork().

I like this one.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH v2 14/17] x86/ldt: Reshuffle code
  2017-12-14 16:32       ` Thomas Gleixner
@ 2017-12-14 16:34         ` Andy Lutomirski
  2017-12-14 17:47           ` Peter Zijlstra
  0 siblings, 1 reply; 76+ messages in thread
From: Andy Lutomirski @ 2017-12-14 16:34 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Andy Lutomirski, Peter Zijlstra, linux-kernel, X86 ML,
	Linus Torvalds, Dave Hansen, Borislav Petkov, Greg KH, Kees Cook,
	Hugh Dickins, Brian Gerst, Josh Poimboeuf, Denys Vlasenko,
	Boris Ostrovsky, Juergen Gross, David Laight, Eduardo Valentin,
	Liguori, Anthony, Will Deacon, linux-mm, Kirill A. Shutemov,
	Dan Williams

On Thu, Dec 14, 2017 at 8:32 AM, Thomas Gleixner <tglx@linutronix.de> wrote:
> On Thu, 14 Dec 2017, Thomas Gleixner wrote:
>
>>
>> On Thu, 14 Dec 2017, Andy Lutomirski wrote:
>>
>> > On Thu, Dec 14, 2017 at 3:27 AM, Peter Zijlstra <peterz@infradead.org> wrote:
>> > > From: Thomas Gleixner <tglx@linutronix.de>
>> > >
>> > > Restructure the code, so the following VMA changes do not create an
>> > > unreadable mess. No functional change.
>> >
>> > Can the PF_KTHREAD thing be its own patch so it can be reviewed on its own?
>>
>> I had that as a separate patch at some point.
>
> See 5/N

It looks like a little bit of it got mixed in to this patch, though.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH v2 00/17] x86/ldt: Use a VMA based read only mapping
  2017-12-14 12:08   ` Peter Zijlstra
@ 2017-12-14 16:35     ` Andy Lutomirski
  2017-12-14 17:50       ` Peter Zijlstra
  0 siblings, 1 reply; 76+ messages in thread
From: Andy Lutomirski @ 2017-12-14 16:35 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Thomas Gleixner, linux-kernel, X86 ML, Linus Torvalds,
	Andy Lutomirsky, Dave Hansen, Borislav Petkov, Greg KH,
	Kees Cook, Hugh Dickins, Brian Gerst, Josh Poimboeuf,
	Denys Vlasenko, Boris Ostrovsky, Juergen Gross, David Laight,
	Eduardo Valentin, Liguori, Anthony, Will Deacon, linux-mm,
	Kirill A. Shutemov, Dan Williams

On Thu, Dec 14, 2017 at 4:08 AM, Peter Zijlstra <peterz@infradead.org> wrote:
> On Thu, Dec 14, 2017 at 01:03:37PM +0100, Thomas Gleixner wrote:
>> On Thu, 14 Dec 2017, Peter Zijlstra wrote:
>> > So here's a second posting of the VMA based LDT implementation; now without
>> > most of the crazy.
>> >
>> > I took out the write fault handler and the magic LAR touching code.
>> >
>> > Additionally there are a bunch of patches that address generic vm issue.
>> >
>> >  - gup() access control; In specific I looked at accessing !_PAGE_USER pages
>> >    because these patches rely on not being able to do that.
>> >
>> >  - special mappings; A whole bunch of mmap ops don't make sense on special
>> >    mappings so disallow them.
>> >
>> > Both things make sense independent of the rest of the series. Similarly, the
>> > patches that kill that rediculous LDT inherit on exec() are also unquestionably
>> > good.
>> >
>> > So I think at least the first 6 patches are good, irrespective of the
>> > VMA approach.
>> >
>> > On the whole VMA approach, Andy I know you hate it with a passion, but I really
>> > rather like how it ties the LDT to the process that it belongs to and it
>> > reduces the amount of 'special' pages in the whole PTI mapping.
>> >
>> > I'm not the one going to make the decision on this; but I figured I at least
>> > post a version without the obvious crap parts of the last one.
>> >
>> > Note: if we were to also disallow munmap() for special mappings (which I
>> > suppose makes perfect sense) then we could further reduce the actual LDT
>> > code (we'd no longer need the sm::close callback and related things).
>>
>> That makes a lot of sense for the other special mapping users like VDSO and
>> kprobes.
>
> Right, and while looking at that I also figured it might make sense to
> unconditionally disallow splitting special mappings.
>
>
> --- a/mm/mmap.c
> +++ b/mm/mmap.c
> @@ -2698,6 +2698,9 @@ int do_munmap(struct mm_struct *mm, unsi
>         }
>         vma = prev ? prev->vm_next : mm->mmap;
>
> +       if (vma_is_special_mapping(vma))
> +               return -EINVAL;
> +
>         if (unlikely(uf)) {
>                 /*
>                  * If userfaultfd_unmap_prep returns an error the vmas
> @@ -3223,10 +3226,11 @@ static int special_mapping_fault(struct
>   */
>  static void special_mapping_close(struct vm_area_struct *vma)
>  {
> -       struct vm_special_mapping *sm = vma->vm_private_data;
> +}
>
> -       if (sm->close)
> -               sm->close(sm, vma);
> +static int special_mapping_split(struct vm_area_struct *vma, unsigned long addr)
> +{
> +       return -EINVAL;
>  }
>
>  static const char *special_mapping_name(struct vm_area_struct *vma)
> @@ -3252,6 +3256,7 @@ static const struct vm_operations_struct
>         .fault = special_mapping_fault,
>         .mremap = special_mapping_mremap,
>         .name = special_mapping_name,
> +       .split = special_mapping_split,
>  };
>
>  static const struct vm_operations_struct legacy_special_mapping_vmops = {

Disallowing splitting seems fine.  Disallowing munmap might not be.
Certainly CRIU relies on being able to mremap() the VDSO.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH v2 02/17] mm: Exempt special mappings from mlock(), mprotect() and madvise()
  2017-12-14 16:19   ` Andy Lutomirski
@ 2017-12-14 17:36     ` Peter Zijlstra
  2018-01-02 16:44       ` Dmitry Safonov
  0 siblings, 1 reply; 76+ messages in thread
From: Peter Zijlstra @ 2017-12-14 17:36 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: linux-kernel, Thomas Gleixner, X86 ML, Linus Torvalds,
	Dave Hansen, Borislav Petkov, Greg KH, Kees Cook, Hugh Dickins,
	Brian Gerst, Josh Poimboeuf, Denys Vlasenko, Boris Ostrovsky,
	Juergen Gross, David Laight, Eduardo Valentin, Liguori, Anthony,
	Will Deacon, linux-mm, Kirill A. Shutemov, Dan Williams

On Thu, Dec 14, 2017 at 08:19:36AM -0800, Andy Lutomirski wrote:
> On Thu, Dec 14, 2017 at 3:27 AM, Peter Zijlstra <peterz@infradead.org> wrote:
> > It makes no sense to ever prod at special mappings with any of these
> > syscalls.
> >
> > XXX should we include munmap() ?
> 
> This is an ABI break for the vdso.  Maybe that's okay, but mremap() on
> the vdso is certainly used, and I can imagine debuggers using
> mprotect().

*groan*, ok so mremap() will actually still work after this, but yes,
mprotect() will not. I hadn't figured people would muck with the VDSO
like that.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH v2 14/17] x86/ldt: Reshuffle code
  2017-12-14 16:34         ` Andy Lutomirski
@ 2017-12-14 17:47           ` Peter Zijlstra
  0 siblings, 0 replies; 76+ messages in thread
From: Peter Zijlstra @ 2017-12-14 17:47 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Thomas Gleixner, linux-kernel, X86 ML, Linus Torvalds,
	Dave Hansen, Borislav Petkov, Greg KH, Kees Cook, Hugh Dickins,
	Brian Gerst, Josh Poimboeuf, Denys Vlasenko, Boris Ostrovsky,
	Juergen Gross, David Laight, Eduardo Valentin, Liguori, Anthony,
	Will Deacon, linux-mm, Kirill A. Shutemov, Dan Williams

On Thu, Dec 14, 2017 at 08:34:00AM -0800, Andy Lutomirski wrote:
> On Thu, Dec 14, 2017 at 8:32 AM, Thomas Gleixner <tglx@linutronix.de> wrote:

> >> > Can the PF_KTHREAD thing be its own patch so it can be reviewed on its own?
> >>
> >> I had that as a separate patch at some point.
> >
> > See 5/N
> 
> It looks like a little bit of it got mixed in to this patch, though.

I can fix that..

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH v2 00/17] x86/ldt: Use a VMA based read only mapping
  2017-12-14 16:35     ` Andy Lutomirski
@ 2017-12-14 17:50       ` Peter Zijlstra
  0 siblings, 0 replies; 76+ messages in thread
From: Peter Zijlstra @ 2017-12-14 17:50 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Thomas Gleixner, linux-kernel, X86 ML, Linus Torvalds,
	Dave Hansen, Borislav Petkov, Greg KH, Kees Cook, Hugh Dickins,
	Brian Gerst, Josh Poimboeuf, Denys Vlasenko, Boris Ostrovsky,
	Juergen Gross, David Laight, Eduardo Valentin, Liguori, Anthony,
	Will Deacon, linux-mm, Kirill A. Shutemov, Dan Williams

On Thu, Dec 14, 2017 at 08:35:50AM -0800, Andy Lutomirski wrote:

> > @@ -3252,6 +3256,7 @@ static const struct vm_operations_struct
> >         .fault = special_mapping_fault,
> >         .mremap = special_mapping_mremap,
> >         .name = special_mapping_name,
> > +       .split = special_mapping_split,
> >  };
> >
> >  static const struct vm_operations_struct legacy_special_mapping_vmops = {
> 
> Disallowing splitting seems fine.  Disallowing munmap might not be.
> Certainly CRIU relies on being able to mremap() the VDSO.

And for mremap() we need do_munmap() to work, argh.


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH v2 11/17] selftests/x86/ldt_gdt: Prepare for access bit forced
  2017-12-14 16:20   ` Andy Lutomirski
@ 2017-12-14 19:43     ` Linus Torvalds
  2017-12-14 21:22       ` Andy Lutomirski
  0 siblings, 1 reply; 76+ messages in thread
From: Linus Torvalds @ 2017-12-14 19:43 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Peter Zijlstra, linux-kernel, Thomas Gleixner, X86 ML,
	Dave Hansen, Borislav Petkov, Greg KH, Kees Cook, Hugh Dickins,
	Brian Gerst, Josh Poimboeuf, Denys Vlasenko, Boris Ostrovsky,
	Juergen Gross, David Laight, Eduardo Valentin, Liguori, Anthony,
	Will Deacon, linux-mm, Kirill A. Shutemov, Dan Williams

On Thu, Dec 14, 2017 at 8:20 AM, Andy Lutomirski <luto@kernel.org> wrote:
>
> If this turns out to need reverting because it breaks Wine or
> something, we're really going to regret it.

I really don't see that as very likely. We already play other (much
more fundamental) games with segments.

But I do agree that it would be good to consider this "turn LDT
read-only" a separate series just in case.

                   Linus

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH v2 06/17] x86/ldt: Do not install LDT for kernel threads
  2017-12-14 11:27 ` [PATCH v2 06/17] x86/ldt: Do not install LDT for kernel threads Peter Zijlstra
@ 2017-12-14 19:43   ` Peter Zijlstra
  2017-12-14 21:27     ` Andy Lutomirski
  0 siblings, 1 reply; 76+ messages in thread
From: Peter Zijlstra @ 2017-12-14 19:43 UTC (permalink / raw)
  To: linux-kernel, tglx
  Cc: x86, Linus Torvalds, Andy Lutomirsky, Dave Hansen,
	Borislav Petkov, Greg KH, keescook, hughd, Brian Gerst,
	Josh Poimboeuf, Denys Vlasenko, Boris Ostrovsky, Juergen Gross,
	David Laight, Eduardo Valentin, aliguori, Will Deacon, linux-mm,
	kirill.shutemov, dan.j.williams

On Thu, Dec 14, 2017 at 12:27:32PM +0100, Peter Zijlstra wrote:
> From: Thomas Gleixner <tglx@linutronix.de>
> 
> Kernel threads can use the mm of a user process temporarily via use_mm(),
> but there is no point in installing the LDT which is associated to that mm
> for the kernel thread.

So thinking about this a bit more; I fear its not correct.

Suppose a kthread does use_mm() and we then schedule to a task of that
process, we'll not pass through switch_mm() and we'll not install the
LDT and bad things happen.

Or am I missing something?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH v2 01/17] mm/gup: Fixup p*_access_permitted()
  2017-12-14 14:37     ` Peter Zijlstra
@ 2017-12-14 20:44       ` Dave Hansen
  2017-12-14 20:54         ` Peter Zijlstra
  2017-12-15 14:04       ` Peter Zijlstra
  1 sibling, 1 reply; 76+ messages in thread
From: Dave Hansen @ 2017-12-14 20:44 UTC (permalink / raw)
  To: Peter Zijlstra, linux-kernel, tglx
  Cc: x86, Linus Torvalds, Andy Lutomirsky, Borislav Petkov, Greg KH,
	keescook, hughd, Brian Gerst, Josh Poimboeuf, Denys Vlasenko,
	Boris Ostrovsky, Juergen Gross, David Laight, Eduardo Valentin,
	aliguori, Will Deacon, linux-mm, kirill.shutemov, dan.j.williams

On 12/14/2017 06:37 AM, Peter Zijlstra wrote:
> I'm also looking at pte_access_permitted() in handle_pte_fault(); that
> looks very dodgy to me. How does that not result in endlessly CoW'ing
> the same page over and over when we have a PKEY disallowing write access
> on that page?

I'm not seeing the pte_access_permitted() in handle_pte_fault().  I
assume that's something you added in this series.

But, one of the ways that we keep pkeys from causing these kinds of
repeating loops when interacting with other things is this hunk in the
page fault code:

> static inline int
> access_error(unsigned long error_code, struct vm_area_struct *vma)
> {
...
>         /*
>          * Read or write was blocked by protection keys.  This is
>          * always an unconditional error and can never result in
>          * a follow-up action to resolve the fault, like a COW.
>          */
>         if (error_code & PF_PK)
>                 return 1;

That short-circuits the page fault pretty quickly.  So, basically, the
rule is: if the hardware says you tripped over pkey permissions, you
die.  We don't try to do anything to the underlying page *before* saying
that you die.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH v2 01/17] mm/gup: Fixup p*_access_permitted()
  2017-12-14 20:44       ` Dave Hansen
@ 2017-12-14 20:54         ` Peter Zijlstra
  2017-12-14 21:18           ` Peter Zijlstra
  2017-12-15  5:04           ` Dave Hansen
  0 siblings, 2 replies; 76+ messages in thread
From: Peter Zijlstra @ 2017-12-14 20:54 UTC (permalink / raw)
  To: Dave Hansen
  Cc: linux-kernel, tglx, x86, Linus Torvalds, Andy Lutomirsky,
	Borislav Petkov, Greg KH, keescook, hughd, Brian Gerst,
	Josh Poimboeuf, Denys Vlasenko, Boris Ostrovsky, Juergen Gross,
	David Laight, Eduardo Valentin, aliguori, Will Deacon, linux-mm,
	kirill.shutemov, dan.j.williams

On Thu, Dec 14, 2017 at 12:44:58PM -0800, Dave Hansen wrote:
> On 12/14/2017 06:37 AM, Peter Zijlstra wrote:
> > I'm also looking at pte_access_permitted() in handle_pte_fault(); that
> > looks very dodgy to me. How does that not result in endlessly CoW'ing
> > the same page over and over when we have a PKEY disallowing write access
> > on that page?
> 
> I'm not seeing the pte_access_permitted() in handle_pte_fault().  I
> assume that's something you added in this series.

No, Dan did in 5c9d2d5c269c4.

> But, one of the ways that we keep pkeys from causing these kinds of
> repeating loops when interacting with other things is this hunk in the
> page fault code:
> 
> > static inline int
> > access_error(unsigned long error_code, struct vm_area_struct *vma)
> > {
> ...
> >         /*
> >          * Read or write was blocked by protection keys.  This is
> >          * always an unconditional error and can never result in
> >          * a follow-up action to resolve the fault, like a COW.
> >          */
> >         if (error_code & PF_PK)
> >                 return 1;
> 
> That short-circuits the page fault pretty quickly.  So, basically, the
> rule is: if the hardware says you tripped over pkey permissions, you
> die.  We don't try to do anything to the underlying page *before* saying
> that you die.

That only works when you trip the fault from hardware. Not if you do a
software fault using gup().

AFAIK __get_user_pages(FOLL_FORCE|FOLL_WRITE|FOLL_GET) will loop
indefinitely on the case I described.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH v2 01/17] mm/gup: Fixup p*_access_permitted()
  2017-12-14 20:54         ` Peter Zijlstra
@ 2017-12-14 21:18           ` Peter Zijlstra
  2017-12-15  5:04           ` Dave Hansen
  1 sibling, 0 replies; 76+ messages in thread
From: Peter Zijlstra @ 2017-12-14 21:18 UTC (permalink / raw)
  To: Dave Hansen
  Cc: linux-kernel, tglx, x86, Linus Torvalds, Andy Lutomirsky,
	Borislav Petkov, Greg KH, keescook, hughd, Brian Gerst,
	Josh Poimboeuf, Denys Vlasenko, Boris Ostrovsky, Juergen Gross,
	David Laight, Eduardo Valentin, aliguori, Will Deacon, linux-mm,
	kirill.shutemov, dan.j.williams

On Thu, Dec 14, 2017 at 09:54:50PM +0100, Peter Zijlstra wrote:
> On Thu, Dec 14, 2017 at 12:44:58PM -0800, Dave Hansen wrote:
> > On 12/14/2017 06:37 AM, Peter Zijlstra wrote:
> > > I'm also looking at pte_access_permitted() in handle_pte_fault(); that
> > > looks very dodgy to me. How does that not result in endlessly CoW'ing
> > > the same page over and over when we have a PKEY disallowing write access
> > > on that page?
> > 
> > I'm not seeing the pte_access_permitted() in handle_pte_fault().  I
> > assume that's something you added in this series.
> 
> No, Dan did in 5c9d2d5c269c4.
> 
> > But, one of the ways that we keep pkeys from causing these kinds of
> > repeating loops when interacting with other things is this hunk in the
> > page fault code:
> > 
> > > static inline int
> > > access_error(unsigned long error_code, struct vm_area_struct *vma)
> > > {
> > ...
> > >         /*
> > >          * Read or write was blocked by protection keys.  This is
> > >          * always an unconditional error and can never result in
> > >          * a follow-up action to resolve the fault, like a COW.
> > >          */
> > >         if (error_code & PF_PK)
> > >                 return 1;
> > 
> > That short-circuits the page fault pretty quickly.  So, basically, the
> > rule is: if the hardware says you tripped over pkey permissions, you
> > die.  We don't try to do anything to the underlying page *before* saying
> > that you die.
> 
> That only works when you trip the fault from hardware. Not if you do a
> software fault using gup().
> 
> AFAIK __get_user_pages(FOLL_FORCE|FOLL_WRITE|FOLL_GET) will loop
> indefinitely on the case I described.

Note that my patch actually fixes this by making can_follow_write_pte()
not return NULL (we'll take the CoW fault irrespective of PKEYs) and
then on the second go-around, we'll find a writable PTE but return
-EFAULT from follow_page_mask() because of PKEY and terminate.

But as is, follow_page_mask() will return NULL because either !write or
PKEY, faultin_page()->handle_mm_fault() will see !write because of PKEY
go into the CoW path, we rety follow_page_mask() it will _still_ return
NULL because PKEY, again to the fault, again retry, again ....

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH v2 11/17] selftests/x86/ldt_gdt: Prepare for access bit forced
  2017-12-14 19:43     ` Linus Torvalds
@ 2017-12-14 21:22       ` Andy Lutomirski
  2017-12-14 21:44         ` Linus Torvalds
  0 siblings, 1 reply; 76+ messages in thread
From: Andy Lutomirski @ 2017-12-14 21:22 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Andy Lutomirski, Peter Zijlstra, linux-kernel, Thomas Gleixner,
	X86 ML, Dave Hansen, Borislav Petkov, Greg KH, Kees Cook,
	Hugh Dickins, Brian Gerst, Josh Poimboeuf, Denys Vlasenko,
	Boris Ostrovsky, Juergen Gross, David Laight, Eduardo Valentin,
	Liguori, Anthony, Will Deacon, linux-mm, Kirill A. Shutemov,
	Dan Williams

On Thu, Dec 14, 2017 at 11:43 AM, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
> On Thu, Dec 14, 2017 at 8:20 AM, Andy Lutomirski <luto@kernel.org> wrote:
>>
>> If this turns out to need reverting because it breaks Wine or
>> something, we're really going to regret it.
>
> I really don't see that as very likely. We already play other (much
> more fundamental) games with segments.
>

I dunno.  Maybe Wine or DOSEMU apps expect to be able to create a
non-accessed segment and then read out the accessed bit using LAR or
modify_ldt() later.

> But I do agree that it would be good to consider this "turn LDT
> read-only" a separate series just in case.

Which kind of kills the whole thing.  There's no way the idea of
putting the LDT in a VMA is okay if it's RW.  You just get the kernel
to put_user() a call gate into it and it's game over.

I have a competing patch that just aliases the LDT high up in kernel
land and shares it in the user tables.  I like a lot of the cleanups
in this series, but I don't like the actual LDT-in-a-VMA part.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH v2 06/17] x86/ldt: Do not install LDT for kernel threads
  2017-12-14 19:43   ` Peter Zijlstra
@ 2017-12-14 21:27     ` Andy Lutomirski
  0 siblings, 0 replies; 76+ messages in thread
From: Andy Lutomirski @ 2017-12-14 21:27 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-kernel, Thomas Gleixner, X86 ML, Linus Torvalds,
	Andy Lutomirsky, Dave Hansen, Borislav Petkov, Greg KH,
	Kees Cook, Hugh Dickins, Brian Gerst, Josh Poimboeuf,
	Denys Vlasenko, Boris Ostrovsky, Juergen Gross, David Laight,
	Eduardo Valentin, Liguori, Anthony, Will Deacon, linux-mm,
	Kirill A. Shutemov, Dan Williams

On Thu, Dec 14, 2017 at 11:43 AM, Peter Zijlstra <peterz@infradead.org> wrote:
> On Thu, Dec 14, 2017 at 12:27:32PM +0100, Peter Zijlstra wrote:
>> From: Thomas Gleixner <tglx@linutronix.de>
>>
>> Kernel threads can use the mm of a user process temporarily via use_mm(),
>> but there is no point in installing the LDT which is associated to that mm
>> for the kernel thread.
>
> So thinking about this a bit more; I fear its not correct.
>
> Suppose a kthread does use_mm() and we then schedule to a task of that
> process, we'll not pass through switch_mm() and we'll not install the
> LDT and bad things happen.
>
> Or am I missing something?
>

Nah, you're probably right.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH v2 11/17] selftests/x86/ldt_gdt: Prepare for access bit forced
  2017-12-14 21:22       ` Andy Lutomirski
@ 2017-12-14 21:44         ` Linus Torvalds
  2017-12-14 21:48           ` Linus Torvalds
  2017-12-14 22:23           ` Thomas Gleixner
  0 siblings, 2 replies; 76+ messages in thread
From: Linus Torvalds @ 2017-12-14 21:44 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Peter Zijlstra, linux-kernel, Thomas Gleixner, X86 ML,
	Dave Hansen, Borislav Petkov, Greg KH, Kees Cook, Hugh Dickins,
	Brian Gerst, Josh Poimboeuf, Denys Vlasenko, Boris Ostrovsky,
	Juergen Gross, David Laight, Eduardo Valentin, Liguori, Anthony,
	Will Deacon, linux-mm, Kirill A. Shutemov, Dan Williams

On Thu, Dec 14, 2017 at 1:22 PM, Andy Lutomirski <luto@kernel.org> wrote:
>
> Which kind of kills the whole thing.  There's no way the idea of
> putting the LDT in a VMA is okay if it's RW.

Sure there is.

I really don't understand why you guys think it has to be RO.

All it has to be is not _user_ accessible. And that's a requirement
regardless, because no way in hell should users be able to read the
damn thing.

So it clearly needs to have the PAGE_USER bit clear (to avoid users
accessing it directly), and it needs to be marked somehow for
get_user_pages() to refuse it too, and access_ok() needs to fail it so
that we can't do get_user/put_user on it.

But the whole RO vs RW is not fundamentally critical.

Now, I do agree that RO is much much better in general, and it avoids
the requirement to play games with "access_ok()" and friends (assuming
we're just ok with users reading it), but I disagree with the whole
"this is fundamental".

                 Linus

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH v2 11/17] selftests/x86/ldt_gdt: Prepare for access bit forced
  2017-12-14 21:44         ` Linus Torvalds
@ 2017-12-14 21:48           ` Linus Torvalds
  2017-12-14 22:02             ` Peter Zijlstra
  2017-12-14 22:11             ` Andy Lutomirski
  2017-12-14 22:23           ` Thomas Gleixner
  1 sibling, 2 replies; 76+ messages in thread
From: Linus Torvalds @ 2017-12-14 21:48 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Peter Zijlstra, linux-kernel, Thomas Gleixner, X86 ML,
	Dave Hansen, Borislav Petkov, Greg KH, Kees Cook, Hugh Dickins,
	Brian Gerst, Josh Poimboeuf, Denys Vlasenko, Boris Ostrovsky,
	Juergen Gross, David Laight, Eduardo Valentin, Liguori, Anthony,
	Will Deacon, linux-mm, Kirill A. Shutemov, Dan Williams

On Thu, Dec 14, 2017 at 1:44 PM, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> So it clearly needs to have the PAGE_USER bit clear (to avoid users
> accessing it directly), and it needs to be marked somehow for
> get_user_pages() to refuse it too, and access_ok() needs to fail it so
> that we can't do get_user/put_user on it.

Actually, just clearing PAGE_USER should make gup avoid it automatically.

So really the only other thing it needs is to have access_ok() avoid
it so that the kernel can't be fooled into accessing it for the user.

That does probably mean having to put it at the top of the user
address space and playing games with user_addr_max(). Which is not
wonderful, but certainly not rocket surgery either.

              Linus

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH v2 11/17] selftests/x86/ldt_gdt: Prepare for access bit forced
  2017-12-14 21:48           ` Linus Torvalds
@ 2017-12-14 22:02             ` Peter Zijlstra
  2017-12-14 22:14               ` Linus Torvalds
  2017-12-14 22:11             ` Andy Lutomirski
  1 sibling, 1 reply; 76+ messages in thread
From: Peter Zijlstra @ 2017-12-14 22:02 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Andy Lutomirski, linux-kernel, Thomas Gleixner, X86 ML,
	Dave Hansen, Borislav Petkov, Greg KH, Kees Cook, Hugh Dickins,
	Brian Gerst, Josh Poimboeuf, Denys Vlasenko, Boris Ostrovsky,
	Juergen Gross, David Laight, Eduardo Valentin, Liguori, Anthony,
	Will Deacon, linux-mm, Kirill A. Shutemov, Dan Williams

On Thu, Dec 14, 2017 at 01:48:50PM -0800, Linus Torvalds wrote:
> Actually, just clearing PAGE_USER should make gup avoid it automatically.

_Should_ being the operative word, because I cannot currently see it
DTRT. But maybe I'm missing the obvious -- I tend to do that at times.

We don't appear to have pte_bad() and pte_access_permitted() is
currently all sorts of wrong.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH v2 11/17] selftests/x86/ldt_gdt: Prepare for access bit forced
  2017-12-14 21:48           ` Linus Torvalds
  2017-12-14 22:02             ` Peter Zijlstra
@ 2017-12-14 22:11             ` Andy Lutomirski
  2017-12-14 22:15               ` Linus Torvalds
  1 sibling, 1 reply; 76+ messages in thread
From: Andy Lutomirski @ 2017-12-14 22:11 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Andy Lutomirski, Peter Zijlstra, linux-kernel, Thomas Gleixner,
	X86 ML, Dave Hansen, Borislav Petkov, Greg KH, Kees Cook,
	Hugh Dickins, Brian Gerst, Josh Poimboeuf, Denys Vlasenko,
	Boris Ostrovsky, Juergen Gross, David Laight, Eduardo Valentin,
	Liguori, Anthony, Will Deacon, linux-mm, Kirill A. Shutemov,
	Dan Williams


> On Dec 14, 2017, at 1:48 PM, Linus Torvalds <torvalds@linux-foundation.org> wrote:
> 
> On Thu, Dec 14, 2017 at 1:44 PM, Linus Torvalds
> <torvalds@linux-foundation.org> wrote:
>> 
>> So it clearly needs to have the PAGE_USER bit clear (to avoid users
>> accessing it directly), and it needs to be marked somehow for
>> get_user_pages() to refuse it too, and access_ok() needs to fail it so
>> that we can't do get_user/put_user on it.
> 
> Actually, just clearing PAGE_USER should make gup avoid it automatically.
> 
> So really the only other thing it needs is to have access_ok() avoid
> it so that the kernel can't be fooled into accessing it for the user.
> 
> That does probably mean having to put it at the top of the user
> address space and playing games with user_addr_max(). Which is not
> wonderful, but certainly not rocket surgery either.

That seems to rather defeat the point of using a VMA, though.  And it means we still have to do a full cmp instead of just checking a sign bit in access_ok if we ever manage to kill set_fs().

Again, I have an apparently fully functional patch to alias the LDT at a high (kernel) address where we can cleanly map it in the user pagetables without any of this VMA stuff.  It's much less code.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH v2 11/17] selftests/x86/ldt_gdt: Prepare for access bit forced
  2017-12-14 22:02             ` Peter Zijlstra
@ 2017-12-14 22:14               ` Linus Torvalds
  2017-12-14 22:24                 ` Peter Zijlstra
  0 siblings, 1 reply; 76+ messages in thread
From: Linus Torvalds @ 2017-12-14 22:14 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Andy Lutomirski, linux-kernel, Thomas Gleixner, X86 ML,
	Dave Hansen, Borislav Petkov, Greg KH, Kees Cook, Hugh Dickins,
	Brian Gerst, Josh Poimboeuf, Denys Vlasenko, Boris Ostrovsky,
	Juergen Gross, David Laight, Eduardo Valentin, Liguori, Anthony,
	Will Deacon, linux-mm, Kirill A. Shutemov, Dan Williams

On Thu, Dec 14, 2017 at 2:02 PM, Peter Zijlstra <peterz@infradead.org> wrote:
>
> _Should_ being the operative word, because I cannot currently see it
> DTRT. But maybe I'm missing the obvious -- I tend to do that at times.

At least the old get_user_pages_fast() code used to check the USER bit:

        unsigned long need_pte_bits = _PAGE_PRESENT|_PAGE_USER;

        if (write)
                need_pte_bits |= _PAGE_RW;

but that may have been lost when we converted over to the generic code.

It shouldn't actually _matter_, since we'd need to change access_ok()
anyway (and gup had better check that!)

             Linus

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH v2 11/17] selftests/x86/ldt_gdt: Prepare for access bit forced
  2017-12-14 22:11             ` Andy Lutomirski
@ 2017-12-14 22:15               ` Linus Torvalds
  2017-12-14 22:30                 ` Andy Lutomirski
  0 siblings, 1 reply; 76+ messages in thread
From: Linus Torvalds @ 2017-12-14 22:15 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Andy Lutomirski, Peter Zijlstra, linux-kernel, Thomas Gleixner,
	X86 ML, Dave Hansen, Borislav Petkov, Greg KH, Kees Cook,
	Hugh Dickins, Brian Gerst, Josh Poimboeuf, Denys Vlasenko,
	Boris Ostrovsky, Juergen Gross, David Laight, Eduardo Valentin,
	Liguori, Anthony, Will Deacon, linux-mm, Kirill A. Shutemov,
	Dan Williams

On Thu, Dec 14, 2017 at 2:11 PM, Andy Lutomirski <luto@amacapital.net> wrote:
>
> That seems to rather defeat the point of using a VMA, though.

There never was any point in using a VMA per se.

The point was always to just map the damn thing in the user page
tables, wasn't it?

The vma bit was just an implementation detail.

           Linus

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH v2 11/17] selftests/x86/ldt_gdt: Prepare for access bit forced
  2017-12-14 21:44         ` Linus Torvalds
  2017-12-14 21:48           ` Linus Torvalds
@ 2017-12-14 22:23           ` Thomas Gleixner
  2017-12-14 22:50             ` Linus Torvalds
  1 sibling, 1 reply; 76+ messages in thread
From: Thomas Gleixner @ 2017-12-14 22:23 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Andy Lutomirski, Peter Zijlstra, linux-kernel, X86 ML,
	Dave Hansen, Borislav Petkov, Greg KH, Kees Cook, Hugh Dickins,
	Brian Gerst, Josh Poimboeuf, Denys Vlasenko, Boris Ostrovsky,
	Juergen Gross, David Laight, Eduardo Valentin, Liguori, Anthony,
	Will Deacon, linux-mm, Kirill A. Shutemov, Dan Williams

On Thu, 14 Dec 2017, Linus Torvalds wrote:

> On Thu, Dec 14, 2017 at 1:22 PM, Andy Lutomirski <luto@kernel.org> wrote:
> >
> > Which kind of kills the whole thing.  There's no way the idea of
> > putting the LDT in a VMA is okay if it's RW.
> 
> Sure there is.
> 
> I really don't understand why you guys think it has to be RO.
> 
> All it has to be is not _user_ accessible. And that's a requirement
> regardless, because no way in hell should users be able to read the
> damn thing.

The user knows the LDT contents because he put it there and it can be read
via modify_ldt(0, ) anyway. Or am I misunderstanding what you are trying to
say?

Thanks,

	tglx


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH v2 11/17] selftests/x86/ldt_gdt: Prepare for access bit forced
  2017-12-14 22:14               ` Linus Torvalds
@ 2017-12-14 22:24                 ` Peter Zijlstra
  2017-12-14 22:52                   ` Linus Torvalds
  0 siblings, 1 reply; 76+ messages in thread
From: Peter Zijlstra @ 2017-12-14 22:24 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Andy Lutomirski, linux-kernel, Thomas Gleixner, X86 ML,
	Dave Hansen, Borislav Petkov, Greg KH, Kees Cook, Hugh Dickins,
	Brian Gerst, Josh Poimboeuf, Denys Vlasenko, Boris Ostrovsky,
	Juergen Gross, David Laight, Eduardo Valentin, Liguori, Anthony,
	Will Deacon, linux-mm, Kirill A. Shutemov, Dan Williams

On Thu, Dec 14, 2017 at 02:14:00PM -0800, Linus Torvalds wrote:
> On Thu, Dec 14, 2017 at 2:02 PM, Peter Zijlstra <peterz@infradead.org> wrote:
> >
> > _Should_ being the operative word, because I cannot currently see it
> > DTRT. But maybe I'm missing the obvious -- I tend to do that at times.
> 
> At least the old get_user_pages_fast() code used to check the USER bit:
> 
>         unsigned long need_pte_bits = _PAGE_PRESENT|_PAGE_USER;
> 
>         if (write)
>                 need_pte_bits |= _PAGE_RW;
> 
> but that may have been lost when we converted over to the generic code.

The generic gup_pte_range() has pte_access_permitted() (which has the
above test) in the right place.

> It shouldn't actually _matter_, since we'd need to change access_ok()
> anyway (and gup had better check that!)

get_user_pages_fast() (both of them) do indeed test access_ok(), but the
regular get_user_pages() does not, I suspect because it can operate on a
foreign mm.

And its the regular old get_user_pages() that's all sorts of broken wrt
!PAGE_USER too.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH v2 11/17] selftests/x86/ldt_gdt: Prepare for access bit forced
  2017-12-14 22:15               ` Linus Torvalds
@ 2017-12-14 22:30                 ` Andy Lutomirski
  0 siblings, 0 replies; 76+ messages in thread
From: Andy Lutomirski @ 2017-12-14 22:30 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Andy Lutomirski, Peter Zijlstra, linux-kernel, Thomas Gleixner,
	X86 ML, Dave Hansen, Borislav Petkov, Greg KH, Kees Cook,
	Hugh Dickins, Brian Gerst, Josh Poimboeuf, Denys Vlasenko,
	Boris Ostrovsky, Juergen Gross, David Laight, Eduardo Valentin,
	Liguori, Anthony, Will Deacon, linux-mm, Kirill A. Shutemov,
	Dan Williams



> On Dec 14, 2017, at 2:15 PM, Linus Torvalds <torvalds@linux-foundation.org> wrote:
> 
>> On Thu, Dec 14, 2017 at 2:11 PM, Andy Lutomirski <luto@amacapital.net> wrote:
>> 
>> That seems to rather defeat the point of using a VMA, though.
> 
> There never was any point in using a VMA per se.
> 
> The point was always to just map the damn thing in the user page
> tables, wasn't it?
> 
> The vma bit was just an implementation detail.

And all this is why I dislike using a VMA.  My patch puts it at a negative address. We could just as easily put it just above TASK_SIZE_MAX, but I'm a bit nervous about bugs that overrun an access_ok check by a small amount.  IIRC I found one of those in the net code once, and I didn't look very hard.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH v2 11/17] selftests/x86/ldt_gdt: Prepare for access bit forced
  2017-12-14 22:23           ` Thomas Gleixner
@ 2017-12-14 22:50             ` Linus Torvalds
  0 siblings, 0 replies; 76+ messages in thread
From: Linus Torvalds @ 2017-12-14 22:50 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Andy Lutomirski, Peter Zijlstra, linux-kernel, X86 ML,
	Dave Hansen, Borislav Petkov, Greg KH, Kees Cook, Hugh Dickins,
	Brian Gerst, Josh Poimboeuf, Denys Vlasenko, Boris Ostrovsky,
	Juergen Gross, David Laight, Eduardo Valentin, Liguori, Anthony,
	Will Deacon, linux-mm, Kirill A. Shutemov, Dan Williams

On Thu, Dec 14, 2017 at 2:23 PM, Thomas Gleixner <tglx@linutronix.de> wrote:
>
> The user knows the LDT contents because he put it there and it can be read
> via modify_ldt(0, ) anyway. Or am I misunderstanding what you are trying to
> say?

I don't think they are secret, it's more of a "if they can read it,
they can write it" kind of thing.

The whole "it should be RO" makes no sense. The first choice should be
"it should be inaccessible".

And that actually seems the _simpler_ choice.

             Linus

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH v2 11/17] selftests/x86/ldt_gdt: Prepare for access bit forced
  2017-12-14 22:24                 ` Peter Zijlstra
@ 2017-12-14 22:52                   ` Linus Torvalds
  0 siblings, 0 replies; 76+ messages in thread
From: Linus Torvalds @ 2017-12-14 22:52 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Andy Lutomirski, linux-kernel, Thomas Gleixner, X86 ML,
	Dave Hansen, Borislav Petkov, Greg KH, Kees Cook, Hugh Dickins,
	Brian Gerst, Josh Poimboeuf, Denys Vlasenko, Boris Ostrovsky,
	Juergen Gross, David Laight, Eduardo Valentin, Liguori, Anthony,
	Will Deacon, linux-mm, Kirill A. Shutemov, Dan Williams

On Thu, Dec 14, 2017 at 2:24 PM, Peter Zijlstra <peterz@infradead.org> wrote:
>
> get_user_pages_fast() (both of them) do indeed test access_ok(), but the
> regular get_user_pages() does not, I suspect because it can operate on a
> foreign mm.

That sounds wrong.

We actually had some very serious reasons why get_user_pages_fast()
needed to check access_ok().

I happen to forget what those reasons were, though.

My mind may be going.

But I think it was something like "you could walk off the page tables
because the undefined address range generates nonsensical values for
the pgd_offset() functions" etc.

But maybe the regular get_user_pages() has some other way to protect
against that.

          Linus

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH v2 01/17] mm/gup: Fixup p*_access_permitted()
  2017-12-14 20:54         ` Peter Zijlstra
  2017-12-14 21:18           ` Peter Zijlstra
@ 2017-12-15  5:04           ` Dave Hansen
  2017-12-15  6:09             ` Linus Torvalds
  2017-12-15  8:00             ` Peter Zijlstra
  1 sibling, 2 replies; 76+ messages in thread
From: Dave Hansen @ 2017-12-15  5:04 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-kernel, tglx, x86, Linus Torvalds, Andy Lutomirsky,
	Borislav Petkov, Greg KH, keescook, hughd, Brian Gerst,
	Josh Poimboeuf, Denys Vlasenko, Boris Ostrovsky, Juergen Gross,
	David Laight, Eduardo Valentin, aliguori, Will Deacon, linux-mm,
	kirill.shutemov, dan.j.williams

On 12/14/2017 12:54 PM, Peter Zijlstra wrote:
>> That short-circuits the page fault pretty quickly.  So, basically, the
>> rule is: if the hardware says you tripped over pkey permissions, you
>> die.  We don't try to do anything to the underlying page *before* saying
>> that you die.
> That only works when you trip the fault from hardware. Not if you do a
> software fault using gup().
> 
> AFAIK __get_user_pages(FOLL_FORCE|FOLL_WRITE|FOLL_GET) will loop
> indefinitely on the case I described.

So, the underlying bug here is that we now a get_user_pages_remote() and
then go ahead and do the p*_access_permitted() checks against the
current PKRU.  This was introduced recently with the addition of the new
p??_access_permitted() calls.

We have checks in the VMA path for the "remote" gups and we avoid
consulting PKRU for them.  This got missed in the pkeys selftests
because I did a ptrace read, but not a *write*.  I also didn't
explicitly test it against something where a COW needed to be done.

I've got some additions to the selftests and a fix where we pass FOLL_*
flags around a bit more instead of just 'write'.  I'll get those out as
soon as I do a bit more testing.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH v2 01/17] mm/gup: Fixup p*_access_permitted()
  2017-12-15  5:04           ` Dave Hansen
@ 2017-12-15  6:09             ` Linus Torvalds
  2017-12-15  7:51               ` Peter Zijlstra
  2017-12-15  8:00             ` Peter Zijlstra
  1 sibling, 1 reply; 76+ messages in thread
From: Linus Torvalds @ 2017-12-15  6:09 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Peter Zijlstra, Linux Kernel Mailing List, tglx, x86,
	Andy Lutomirsky, Borislav Petkov, Greg KH, keescook, hughd,
	Brian Gerst, Josh Poimboeuf, Denys Vlasenko, Boris Ostrovsky,
	Juergen Gross, David Laight, Eduardo Valentin, aliguori,
	Will Deacon, linux-mm, kirill.shutemov, dan.j.williams

[-- Attachment #1: Type: text/plain, Size: 1038 bytes --]

On Dec 14, 2017 21:04, "Dave Hansen" <dave.hansen@intel.com> wrote:

On 12/14/2017 12:54 PM, Peter Zijlstra wrote:
>> That short-circuits the page fault pretty quickly.  So, basically, the
>> rule is: if the hardware says you tripped over pkey permissions, you
>> die.  We don't try to do anything to the underlying page *before* saying
>> that you die.
> That only works when you trip the fault from hardware. Not if you do a
> software fault using gup().
>
> AFAIK __get_user_pages(FOLL_FORCE|FOLL_WRITE|FOLL_GET) will loop
> indefinitely on the case I described.

So, the underlying bug here is that we now a get_user_pages_remote() and
then go ahead and do the p*_access_permitted() checks against the
current PKRU.  This was introduced recently with the addition of the new
p??_access_permitted() calls.


Can we please just undo that broken crap instead of trying to "fix" it?

It was wrong. We absolutely do not want to complicate the gup path.

Let's fet rid of those broken p??_access_permited() things.

Please.

         Linus

[-- Attachment #2: Type: text/html, Size: 1952 bytes --]

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH v2 01/17] mm/gup: Fixup p*_access_permitted()
  2017-12-15  6:09             ` Linus Torvalds
@ 2017-12-15  7:51               ` Peter Zijlstra
  2017-12-16  0:20                 ` Linus Torvalds
  0 siblings, 1 reply; 76+ messages in thread
From: Peter Zijlstra @ 2017-12-15  7:51 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Dave Hansen, Linux Kernel Mailing List, tglx, x86,
	Andy Lutomirsky, Borislav Petkov, Greg KH, keescook, hughd,
	Brian Gerst, Josh Poimboeuf, Denys Vlasenko, Boris Ostrovsky,
	Juergen Gross, David Laight, Eduardo Valentin, aliguori,
	Will Deacon, linux-mm, kirill.shutemov, dan.j.williams

On Thu, Dec 14, 2017 at 10:09:24PM -0800, Linus Torvalds wrote:
> On Dec 14, 2017 21:04, "Dave Hansen" <dave.hansen@intel.com> wrote:
> Can we please just undo that broken crap instead of trying to "fix" it?
> 
> It was wrong. We absolutely do not want to complicate the gup path.
> 
> Let's fet rid of those broken p??_access_permited() things.

So we actually need the pte_access_permitted() stuff if we want to
ensure we're not stepping on !PAGE_USER things.


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH v2 01/17] mm/gup: Fixup p*_access_permitted()
  2017-12-15  5:04           ` Dave Hansen
  2017-12-15  6:09             ` Linus Torvalds
@ 2017-12-15  8:00             ` Peter Zijlstra
  2017-12-15 10:25               ` Peter Zijlstra
  1 sibling, 1 reply; 76+ messages in thread
From: Peter Zijlstra @ 2017-12-15  8:00 UTC (permalink / raw)
  To: Dave Hansen
  Cc: linux-kernel, tglx, x86, Linus Torvalds, Andy Lutomirsky,
	Borislav Petkov, Greg KH, keescook, hughd, Brian Gerst,
	Josh Poimboeuf, Denys Vlasenko, Boris Ostrovsky, Juergen Gross,
	David Laight, Eduardo Valentin, aliguori, Will Deacon, linux-mm,
	kirill.shutemov, dan.j.williams

On Thu, Dec 14, 2017 at 09:04:56PM -0800, Dave Hansen wrote:
> 
> I've got some additions to the selftests and a fix where we pass FOLL_*
> flags around a bit more instead of just 'write'.  I'll get those out as
> soon as I do a bit more testing.

Try the below; I have more in the works, but this already fixes a whole
bunch of obvious fail and should fix the case I described.

The thing is, you should _never_ return NULL for an access error, that's
complete crap.

You should also not blindly change every pte_write() test to
pte_access_permitted(), that's also wrong, because then you're missing
the read-access tests.

Basically you need to very carefully audit each and every
p??_access_permitted() call; they're currently mostly wrong.

--- a/mm/gup.c
+++ b/mm/gup.c
@@ -66,7 +66,7 @@ static int follow_pfn_pte(struct vm_area
  */
 static inline bool can_follow_write_pte(pte_t pte, unsigned int flags)
 {
-	return pte_access_permitted(pte, WRITE) ||
+	return pte_write(pte) ||
 		((flags & FOLL_FORCE) && (flags & FOLL_COW) && pte_dirty(pte));
 }
 
@@ -153,6 +153,11 @@ static struct page *follow_page_pte(stru
 	}
 
 	if (flags & FOLL_GET) {
+		if (!pte_access_permitted(pte, !!(flags & FOLL_WRITE))) {
+			page = ERR_PTR(-EFAULT);
+			goto out;
+		}
+
 		get_page(page);
 
 		/* drop the pgmap reference now that we hold the page */
@@ -244,6 +249,15 @@ static struct page *follow_pmd_mask(stru
 			pmd_migration_entry_wait(mm, pmd);
 		goto retry;
 	}
+
+	if (flags & FOLL_GET) {
+		if (!pmd_access_permitted(*pmd, !!(flags & FOLL_WRITE))) {
+			page = ERR_PTR(-EFAULT);
+			spin_unlock(ptr);
+			return page;
+		}
+	}
+
 	if (pmd_devmap(*pmd)) {
 		ptl = pmd_lock(mm, pmd);
 		page = follow_devmap_pmd(vma, address, pmd, flags);
@@ -326,6 +340,15 @@ static struct page *follow_pud_mask(stru
 			return page;
 		return no_page_table(vma, flags);
 	}
+
+	if (flags & FOLL_GET) {
+		if (!pud_access_permitted(*pud, !!(flags & FOLL_WRITE))) {
+			page = ERR_PTR(-EFAULT);
+			spin_unlock(ptr);
+			return page;
+		}
+	}
+
 	if (pud_devmap(*pud)) {
 		ptl = pud_lock(mm, pud);
 		page = follow_devmap_pud(vma, address, pud, flags);
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -870,9 +870,6 @@ struct page *follow_devmap_pmd(struct vm
 	 */
 	WARN_ONCE(flags & FOLL_COW, "mm: In follow_devmap_pmd with FOLL_COW set");
 
-	if (!pmd_access_permitted(*pmd, flags & FOLL_WRITE))
-		return NULL;
-
 	if (pmd_present(*pmd) && pmd_devmap(*pmd))
 		/* pass */;
 	else
@@ -1012,9 +1009,6 @@ struct page *follow_devmap_pud(struct vm
 
 	assert_spin_locked(pud_lockptr(mm, pud));
 
-	if (!pud_access_permitted(*pud, flags & FOLL_WRITE))
-		return NULL;
-
 	if (pud_present(*pud) && pud_devmap(*pud))
 		/* pass */;
 	else
@@ -1386,7 +1380,7 @@ int do_huge_pmd_wp_page(struct vm_fault
  */
 static inline bool can_follow_write_pmd(pmd_t pmd, unsigned int flags)
 {
-	return pmd_access_permitted(pmd, WRITE) ||
+	return pmd_write(pmd) ||
 	       ((flags & FOLL_FORCE) && (flags & FOLL_COW) && pmd_dirty(pmd));
 }
 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH v2 01/17] mm/gup: Fixup p*_access_permitted()
  2017-12-15  8:00             ` Peter Zijlstra
@ 2017-12-15 10:25               ` Peter Zijlstra
  2017-12-15 11:38                 ` Peter Zijlstra
  0 siblings, 1 reply; 76+ messages in thread
From: Peter Zijlstra @ 2017-12-15 10:25 UTC (permalink / raw)
  To: Dave Hansen
  Cc: linux-kernel, tglx, x86, Linus Torvalds, Andy Lutomirsky,
	Borislav Petkov, Greg KH, keescook, hughd, Brian Gerst,
	Josh Poimboeuf, Denys Vlasenko, Boris Ostrovsky, Juergen Gross,
	David Laight, Eduardo Valentin, aliguori, Will Deacon, linux-mm,
	kirill.shutemov, dan.j.williams

On Fri, Dec 15, 2017 at 09:00:41AM +0100, Peter Zijlstra wrote:
> On Thu, Dec 14, 2017 at 09:04:56PM -0800, Dave Hansen wrote:
> > 
> > I've got some additions to the selftests and a fix where we pass FOLL_*
> > flags around a bit more instead of just 'write'.  I'll get those out as
> > soon as I do a bit more testing.
> 
> Try the below; I have more in the works, but this already fixes a whole
> bunch of obvious fail and should fix the case I described.
> 
> The thing is, you should _never_ return NULL for an access error, that's
> complete crap.
> 
> You should also not blindly change every pte_write() test to
> pte_access_permitted(), that's also wrong, because then you're missing
> the read-access tests.
> 
> Basically you need to very carefully audit each and every
> p??_access_permitted() call; they're currently mostly wrong.

I think we also want this:

diff --git a/fs/dax.c b/fs/dax.c
index 78b72c48374e..95981591977a 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -627,8 +627,7 @@ static void dax_mapping_entry_mkclean(struct address_space *mapping,
 
 			if (pfn != pmd_pfn(*pmdp))
 				goto unlock_pmd;
-			if (!pmd_dirty(*pmdp)
-					&& !pmd_access_permitted(*pmdp, WRITE))
+			if (!pmd_dirty(*pmdp) && !pmd_write(*pmdp))
 				goto unlock_pmd;
 
 			flush_cache_page(vma, address, pfn);
diff --git a/mm/memory.c b/mm/memory.c
index 5eb3d2524bdc..6ce3ba12e07d 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -3948,7 +3948,7 @@ static int handle_pte_fault(struct vm_fault *vmf)
 	if (unlikely(!pte_same(*vmf->pte, entry)))
 		goto unlock;
 	if (vmf->flags & FAULT_FLAG_WRITE) {
-		if (!pte_access_permitted(entry, WRITE))
+		if (!pte_write(entry))
 			return do_wp_page(vmf);
 		entry = pte_mkdirty(entry);
 	}


the DAX one is both inconsistent (only the PMD case, not also the PTE
case) and just wrong; you don't want PKEYs to avoid cleaning pages,
that's crazy.

The memory one is also clearly wrong, not having access does not a write
fault make. If we have pte_write() set we should not do_wp_page() just
because we don't have access. This falls under the "doing anything other
than hard failure for !access is crazy" header.


Still looking at __handle_mm_fault(), they smell bad, but I can't get my
brain started today :/

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 76+ messages in thread

* Re: [PATCH v2 01/17] mm/gup: Fixup p*_access_permitted()
  2017-12-15 10:25               ` Peter Zijlstra
@ 2017-12-15 11:38                 ` Peter Zijlstra
  2017-12-15 16:38                   ` Dan Williams
  0 siblings, 1 reply; 76+ messages in thread
From: Peter Zijlstra @ 2017-12-15 11:38 UTC (permalink / raw)
  To: Dave Hansen
  Cc: linux-kernel, tglx, x86, Linus Torvalds, Andy Lutomirsky,
	Borislav Petkov, Greg KH, keescook, hughd, Brian Gerst,
	Josh Poimboeuf, Denys Vlasenko, Boris Ostrovsky, Juergen Gross,
	David Laight, Eduardo Valentin, aliguori, Will Deacon, linux-mm,
	kirill.shutemov, dan.j.williams

On Fri, Dec 15, 2017 at 11:25:29AM +0100, Peter Zijlstra wrote:
> The memory one is also clearly wrong, not having access does not a write
> fault make. If we have pte_write() set we should not do_wp_page() just
> because we don't have access. This falls under the "doing anything other
> than hard failure for !access is crazy" header.

So per the very same reasoning I think the below is warranted too; also
rename that @dirty variable, because its also wrong.

diff --git a/mm/memory.c b/mm/memory.c
index 5eb3d2524bdc..0d43b347eb0a 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -3987,7 +3987,7 @@ static int __handle_mm_fault(struct vm_area_struct *vma, unsigned long address,
 		.pgoff = linear_page_index(vma, address),
 		.gfp_mask = __get_fault_gfp_mask(vma),
 	};
-	unsigned int dirty = flags & FAULT_FLAG_WRITE;
+	unsigned int write = flags & FAULT_FLAG_WRITE;
 	struct mm_struct *mm = vma->vm_mm;
 	pgd_t *pgd;
 	p4d_t *p4d;
@@ -4013,7 +4013,7 @@ static int __handle_mm_fault(struct vm_area_struct *vma, unsigned long address,
 
 			/* NUMA case for anonymous PUDs would go here */
 
-			if (dirty && !pud_access_permitted(orig_pud, WRITE)) {
+			if (write && !pud_write(orig_pud)) {
 				ret = wp_huge_pud(&vmf, orig_pud);
 				if (!(ret & VM_FAULT_FALLBACK))
 					return ret;
@@ -4046,7 +4046,7 @@ static int __handle_mm_fault(struct vm_area_struct *vma, unsigned long address,
 			if (pmd_protnone(orig_pmd) && vma_is_accessible(vma))
 				return do_huge_pmd_numa_page(&vmf, orig_pmd);
 
-			if (dirty && !pmd_access_permitted(orig_pmd, WRITE)) {
+			if (write && !pmd_write(orig_pmd)) {
 				ret = wp_huge_pmd(&vmf, orig_pmd);
 				if (!(ret & VM_FAULT_FALLBACK))
 					return ret;


I still cannot make sense of what the intention behind these changes
were, the Changelog that went with them is utter crap, it doesn't
explain anything.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 76+ messages in thread

* Re: [PATCH v2 01/17] mm/gup: Fixup p*_access_permitted()
  2017-12-14 14:37     ` Peter Zijlstra
  2017-12-14 20:44       ` Dave Hansen
@ 2017-12-15 14:04       ` Peter Zijlstra
  1 sibling, 0 replies; 76+ messages in thread
From: Peter Zijlstra @ 2017-12-15 14:04 UTC (permalink / raw)
  To: linux-kernel, tglx
  Cc: x86, Linus Torvalds, Andy Lutomirsky, Dave Hansen,
	Borislav Petkov, Greg KH, keescook, hughd, Brian Gerst,
	Josh Poimboeuf, Denys Vlasenko, Boris Ostrovsky, Juergen Gross,
	David Laight, Eduardo Valentin, aliguori, Will Deacon, linux-mm,
	kirill.shutemov, dan.j.williams

On Thu, Dec 14, 2017 at 03:37:30PM +0100, Peter Zijlstra wrote:

> Kirill did point out that my patch(es) break FOLL_DUMP in that it would
> now exclude pkey protected pages from core-dumps.
> 
> My counter argument is that it will now properly exclude !_PAGE_USER
> pages.
> 
> If we change p??_access_permitted() to pass the full follow flags
> instead of just the write part we could fix that.

Something like the completely untested below would do that.

Then we'd need this on top:

--- a/arch/x86/include/asm/pgtable.h
+++ b/arch/x86/include/asm/pgtable.h
@@ -1232,7 +1232,10 @@ __pte_access_permitted(unsigned long pte
 		need_pte_bits |= _PAGE_RW;
 
 	if ((pteval & need_pte_bits) != need_pte_bits)
-		return 0;
+		return false;
+
+	if (foll_flags & FOLL_DUMP)
+		return true;
 
 	return __pkru_allows_pkey(pte_flags_pkey(pteval), write);
 }

But it is rather ugly... :/


---

--- a/arch/arm/include/asm/pgtable.h
+++ b/arch/arm/include/asm/pgtable.h
@@ -11,6 +11,7 @@
 #define _ASMARM_PGTABLE_H
 
 #include <linux/const.h>
+#include <linux/foll_flags.h>
 #include <asm/proc-fns.h>
 
 #ifndef CONFIG_MMU
@@ -232,12 +233,12 @@ static inline pte_t *pmd_page_vaddr(pmd_
 #define pte_valid_user(pte)	\
 	(pte_valid(pte) && pte_isset((pte), L_PTE_USER) && pte_young(pte))
 
-static inline bool pte_access_permitted(pte_t pte, bool write)
+static inline bool pte_access_permitted(pte_t pte, unsigned int foll_flags)
 {
 	pteval_t mask = L_PTE_PRESENT | L_PTE_USER;
 	pteval_t needed = mask;
 
-	if (write)
+	if (foll_flags & FOLL_WRITE)
 		mask |= L_PTE_RDONLY;
 
 	return (pte_val(pte) & mask) == needed;
--- a/arch/arm64/include/asm/pgtable.h
+++ b/arch/arm64/include/asm/pgtable.h
@@ -16,6 +16,8 @@
 #ifndef __ASM_PGTABLE_H
 #define __ASM_PGTABLE_H
 
+#include <linux/foll_flags.h>
+
 #include <asm/bug.h>
 #include <asm/proc-fns.h>
 
@@ -114,12 +116,23 @@ extern unsigned long empty_zero_page[PAG
  * write permission check) other than user execute-only which do not have the
  * PTE_USER bit set. PROT_NONE mappings do not have the PTE_VALID bit set.
  */
-#define pte_access_permitted(pte, write) \
-	(pte_valid_user(pte) && (!(write) || pte_write(pte)))
-#define pmd_access_permitted(pmd, write) \
-	(pte_access_permitted(pmd_pte(pmd), (write)))
-#define pud_access_permitted(pud, write) \
-	(pte_access_permitted(pud_pte(pud), (write)))
+static inline bool __pte_access_permitted(pte_t pte, unsigned int foll_flags)
+{
+	if (!pte_valid_user(pte))
+		return false;
+
+	if (foll_flags & FOLL_WRITE)
+		return !!pte_write(pte);
+
+	return true;
+}
+
+#define pte_access_permitted(pte, foll_flags) \
+	(__pte_access_permitted((pte), (foll_flags))
+#define pmd_access_permitted(pmd, foll_flags) \
+	(pte_access_permitted(pmd_pte(pmd), (foll_flags)))
+#define pud_access_permitted(pud, foll_flags) \
+	(pte_access_permitted(pud_pte(pud), (foll_flags)))
 
 static inline pte_t clear_pte_bit(pte_t pte, pgprot_t prot)
 {
--- a/arch/sparc/mm/gup.c
+++ b/arch/sparc/mm/gup.c
@@ -75,7 +75,7 @@ static int gup_huge_pmd(pmd_t *pmdp, pmd
 	if (!(pmd_val(pmd) & _PAGE_VALID))
 		return 0;
 
-	if (!pmd_access_permitted(pmd, write))
+	if (!pmd_access_permitted(pmd, !!write * FOLL_WRITE))
 		return 0;
 
 	refs = 0;
@@ -114,7 +114,7 @@ static int gup_huge_pud(pud_t *pudp, pud
 	if (!(pud_val(pud) & _PAGE_VALID))
 		return 0;
 
-	if (!pud_access_permitted(pud, write))
+	if (!pud_access_permitted(pud, !!write * FOLL_WRITE))
 		return 0;
 
 	refs = 0;
--- a/arch/x86/include/asm/pgtable.h
+++ b/arch/x86/include/asm/pgtable.h
@@ -2,6 +2,7 @@
 #ifndef _ASM_X86_PGTABLE_H
 #define _ASM_X86_PGTABLE_H
 
+#include <linux/foll_flags.h>
 #include <linux/mem_encrypt.h>
 #include <asm/page.h>
 #include <asm/pgtable_types.h>
@@ -1221,9 +1222,11 @@ static inline bool __pkru_allows_pkey(u1
  * _PAGE_PRESENT, _PAGE_USER, and _PAGE_RW in here which are the
  * same value on all 3 types.
  */
-static inline bool __pte_access_permitted(unsigned long pteval, bool write)
+static inline bool
+__pte_access_permitted(unsigned long pteval, unsigned int foll_flags)
 {
 	unsigned long need_pte_bits = _PAGE_PRESENT|_PAGE_USER;
+	bool write = foll_flags & FOLL_WRITE;
 
 	if (write)
 		need_pte_bits |= _PAGE_RW;
@@ -1235,21 +1238,21 @@ static inline bool __pte_access_permitte
 }
 
 #define pte_access_permitted pte_access_permitted
-static inline bool pte_access_permitted(pte_t pte, bool write)
+static inline bool pte_access_permitted(pte_t pte, bool foll_flags)
 {
-	return __pte_access_permitted(pte_val(pte), write);
+	return __pte_access_permitted(pte_val(pte), foll_flags);
 }
 
 #define pmd_access_permitted pmd_access_permitted
-static inline bool pmd_access_permitted(pmd_t pmd, bool write)
+static inline bool pmd_access_permitted(pmd_t pmd, bool foll_flags)
 {
-	return __pte_access_permitted(pmd_val(pmd), write);
+	return __pte_access_permitted(pmd_val(pmd), foll_flags);
 }
 
 #define pud_access_permitted pud_access_permitted
-static inline bool pud_access_permitted(pud_t pud, bool write)
+static inline bool pud_access_permitted(pud_t pud, bool foll_flags)
 {
-	return __pte_access_permitted(pud_val(pud), write);
+	return __pte_access_permitted(pud_val(pud), foll_flags);
 }
 
 #include <asm-generic/pgtable.h>
--- a/include/asm-generic/pgtable.h
+++ b/include/asm-generic/pgtable.h
@@ -2,6 +2,7 @@
 #ifndef _ASM_GENERIC_PGTABLE_H
 #define _ASM_GENERIC_PGTABLE_H
 
+#include <linux/foll_flags.h>
 #include <linux/pfn.h>
 
 #ifndef __ASSEMBLY__
@@ -343,28 +344,28 @@ static inline int pte_unused(pte_t pte)
 #endif
 
 #ifndef pte_access_permitted
-#define pte_access_permitted(pte, write) \
-	(pte_present(pte) && (!(write) || pte_write(pte)))
+#define pte_access_permitted(pte, foll_flags) \
+	(pte_present(pte) && (!(foll_flags & FOLL_WRITE) || pte_foll_flags(pte)))
 #endif
 
 #ifndef pmd_access_permitted
-#define pmd_access_permitted(pmd, write) \
-	(pmd_present(pmd) && (!(write) || pmd_write(pmd)))
+#define pmd_access_permitted(pmd, foll_flags) \
+	(pmd_present(pmd) && (!(foll_flags & FOLL_WRITE) || pmd_foll_flags(pmd)))
 #endif
 
 #ifndef pud_access_permitted
-#define pud_access_permitted(pud, write) \
-	(pud_present(pud) && (!(write) || pud_write(pud)))
+#define pud_access_permitted(pud, foll_flags) \
+	(pud_present(pud) && (!(foll_flags & FOLL_WRITE) || pud_foll_flags(pud)))
 #endif
 
 #ifndef p4d_access_permitted
-#define p4d_access_permitted(p4d, write) \
-	(p4d_present(p4d) && (!(write) || p4d_write(p4d)))
+#define p4d_access_permitted(p4d, foll_flags) \
+	(p4d_present(p4d) && (!(foll_flags & FOLL_WRITE) || p4d_foll_flags(p4d)))
 #endif
 
 #ifndef pgd_access_permitted
-#define pgd_access_permitted(pgd, write) \
-	(pgd_present(pgd) && (!(write) || pgd_write(pgd)))
+#define pgd_access_permitted(pgd, foll_flags) \
+	(pgd_present(pgd) && (!(foll_flags & FOLL_WRITE) || pgd_foll_flags(pgd)))
 #endif
 
 #ifndef __HAVE_ARCH_PMD_SAME
--- /dev/null
+++ b/include/linux/foll_flags.h
@@ -0,0 +1,22 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef _LINUX_FOLL_FLAGS_H
+#define _LINUX_FOLL_FLAGS_H
+
+#define FOLL_WRITE	0x01	/* check pte is writable */
+#define FOLL_TOUCH	0x02	/* mark page accessed */
+#define FOLL_GET	0x04	/* do get_page on page */
+#define FOLL_DUMP	0x08	/* give error on hole if it would be zero */
+#define FOLL_FORCE	0x10	/* get_user_pages read/write w/o permission */
+#define FOLL_NOWAIT	0x20	/* if a disk transfer is needed, start the IO
+				 * and return without waiting upon it */
+#define FOLL_POPULATE	0x40	/* fault in page */
+#define FOLL_SPLIT	0x80	/* don't return transhuge pages, split them */
+#define FOLL_HWPOISON	0x100	/* check page is hwpoisoned */
+#define FOLL_NUMA	0x200	/* force NUMA hinting page fault */
+#define FOLL_MIGRATION	0x400	/* wait for page to replace migration entry */
+#define FOLL_TRIED	0x800	/* a retry, previous pass started an IO */
+#define FOLL_MLOCK	0x1000	/* lock present pages */
+#define FOLL_REMOTE	0x2000	/* we are working on non-current tsk/mm */
+#define FOLL_COW	0x4000	/* internal GUP flag */
+
+#endif /* _LINUX_FOLL_FLAGS_H */
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -6,6 +6,7 @@
 
 #ifdef __KERNEL__
 
+#include <linux/foll_flags.h>
 #include <linux/mmdebug.h>
 #include <linux/gfp.h>
 #include <linux/bug.h>
@@ -2419,23 +2420,6 @@ static inline struct page *follow_page(s
 	return follow_page_mask(vma, address, foll_flags, &unused_page_mask);
 }
 
-#define FOLL_WRITE	0x01	/* check pte is writable */
-#define FOLL_TOUCH	0x02	/* mark page accessed */
-#define FOLL_GET	0x04	/* do get_page on page */
-#define FOLL_DUMP	0x08	/* give error on hole if it would be zero */
-#define FOLL_FORCE	0x10	/* get_user_pages read/write w/o permission */
-#define FOLL_NOWAIT	0x20	/* if a disk transfer is needed, start the IO
-				 * and return without waiting upon it */
-#define FOLL_POPULATE	0x40	/* fault in page */
-#define FOLL_SPLIT	0x80	/* don't return transhuge pages, split them */
-#define FOLL_HWPOISON	0x100	/* check page is hwpoisoned */
-#define FOLL_NUMA	0x200	/* force NUMA hinting page fault */
-#define FOLL_MIGRATION	0x400	/* wait for page to replace migration entry */
-#define FOLL_TRIED	0x800	/* a retry, previous pass started an IO */
-#define FOLL_MLOCK	0x1000	/* lock present pages */
-#define FOLL_REMOTE	0x2000	/* we are working on non-current tsk/mm */
-#define FOLL_COW	0x4000	/* internal GUP flag */
-
 static inline int vm_fault_to_errno(int vm_fault, int foll_flags)
 {
 	if (vm_fault & VM_FAULT_OOM)
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -153,7 +153,7 @@ static struct page *follow_page_pte(stru
 	}
 
 	if (flags & FOLL_GET) {
-		if (!pte_access_permitted(pte, !!(flags & FOLL_WRITE))) {
+		if (!pte_access_permitted(pte, flags)) {
 			page = ERR_PTR(-EFAULT);
 			goto out;
 		}
@@ -251,7 +251,7 @@ static struct page *follow_pmd_mask(stru
 	}
 
 	if (flags & FOLL_GET) {
-		if (!pmd_access_permitted(*pmd, !!(flags & FOLL_WRITE))) {
+		if (!pmd_access_permitted(*pmd, flags)) {
 			page = ERR_PTR(-EFAULT);
 			spin_unlock(ptr);
 			return page;
@@ -342,7 +342,7 @@ static struct page *follow_pud_mask(stru
 	}
 
 	if (flags & FOLL_GET) {
-		if (!pud_access_permitted(*pud, !!(flags & FOLL_WRITE))) {
+		if (!pud_access_permitted(*pud, flags)) {
 			page = ERR_PTR(-EFAULT);
 			spin_unlock(ptr);
 			return page;
@@ -1407,7 +1407,7 @@ static int gup_pte_range(pmd_t pmd, unsi
 		if (pte_protnone(pte))
 			goto pte_unmap;
 
-		if (!pte_access_permitted(pte, write))
+		if (!pte_access_permitted(pte, !!write * FOLL_WRITE))
 			goto pte_unmap;
 
 		if (pte_devmap(pte)) {
@@ -1528,7 +1528,7 @@ static int gup_huge_pmd(pmd_t orig, pmd_
 	struct page *head, *page;
 	int refs;
 
-	if (!pmd_access_permitted(orig, write))
+	if (!pmd_access_permitted(orig, !!write * FOLL_WRITE))
 		return 0;
 
 	if (pmd_devmap(orig))
@@ -1566,7 +1566,7 @@ static int gup_huge_pud(pud_t orig, pud_
 	struct page *head, *page;
 	int refs;
 
-	if (!pud_access_permitted(orig, write))
+	if (!pud_access_permitted(orig, !!write * FOLL_WRITE))
 		return 0;
 
 	if (pud_devmap(orig))
@@ -1605,7 +1605,7 @@ static int gup_huge_pgd(pgd_t orig, pgd_
 	int refs;
 	struct page *head, *page;
 
-	if (!pgd_access_permitted(orig, write))
+	if (!pgd_access_permitted(orig, !!write * FOLL_WRITE))
 		return 0;
 
 	BUILD_BUG_ON(pgd_devmap(orig));
--- a/mm/hmm.c
+++ b/mm/hmm.c
@@ -391,11 +391,11 @@ static int hmm_vma_walk_pmd(pmd_t *pmdp,
 		if (pmd_protnone(pmd))
 			return hmm_vma_walk_clear(start, end, walk);
 
-		if (!pmd_access_permitted(pmd, write_fault))
+		if (!pmd_access_permitted(pmd, !!write_fault * FOLL_WRITE))
 			return hmm_vma_walk_clear(start, end, walk);
 
 		pfn = pmd_pfn(pmd) + pte_index(addr);
-		flag |= pmd_access_permitted(pmd, WRITE) ? HMM_PFN_WRITE : 0;
+		flag |= pmd_access_permitted(pmd, FOLL_WRITE) ? HMM_PFN_WRITE : 0;
 		for (; addr < end; addr += PAGE_SIZE, i++, pfn++)
 			pfns[i] = hmm_pfn_t_from_pfn(pfn) | flag;
 		return 0;
@@ -456,11 +456,11 @@ static int hmm_vma_walk_pmd(pmd_t *pmdp,
 			continue;
 		}
 
-		if (!pte_access_permitted(pte, write_fault))
+		if (!pte_access_permitted(pte, !!write_fault * FOLL_WRITE))
 			goto fault;
 
 		pfns[i] = hmm_pfn_t_from_pfn(pte_pfn(pte)) | flag;
-		pfns[i] |= pte_access_permitted(pte, WRITE) ? HMM_PFN_WRITE : 0;
+		pfns[i] |= pte_access_permitted(pte, FOLL_WRITE) ? HMM_PFN_WRITE : 0;
 		continue;
 
 fault:
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -4336,7 +4336,7 @@ int follow_phys(struct vm_area_struct *v
 		goto out;
 	pte = *ptep;
 
-	if (!pte_access_permitted(pte, flags & FOLL_WRITE))
+	if (!pte_access_permitted(pte, flags))
 		goto unlock;
 
 	*prot = pgprot_val(pte_pgprot(pte));

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH v2 01/17] mm/gup: Fixup p*_access_permitted()
  2017-12-15 11:38                 ` Peter Zijlstra
@ 2017-12-15 16:38                   ` Dan Williams
  2017-12-18 11:54                     ` Peter Zijlstra
  0 siblings, 1 reply; 76+ messages in thread
From: Dan Williams @ 2017-12-15 16:38 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Dave Hansen, linux-kernel, Thomas Gleixner, X86 ML,
	Linus Torvalds, Andy Lutomirsky, Borislav Petkov, Greg KH,
	keescook, Hugh Dickins, Brian Gerst, Josh Poimboeuf,
	Denys Vlasenko, Boris Ostrovsky, Juergen Gross, David Laight,
	Eduardo Valentin, Liguori, Anthony, Will Deacon, Linux MM,
	Kirill A. Shutemov

On Fri, Dec 15, 2017 at 3:38 AM, Peter Zijlstra <peterz@infradead.org> wrote:
> On Fri, Dec 15, 2017 at 11:25:29AM +0100, Peter Zijlstra wrote:
>> The memory one is also clearly wrong, not having access does not a write
>> fault make. If we have pte_write() set we should not do_wp_page() just
>> because we don't have access. This falls under the "doing anything other
>> than hard failure for !access is crazy" header.
>
> So per the very same reasoning I think the below is warranted too; also
> rename that @dirty variable, because its also wrong.
>
> diff --git a/mm/memory.c b/mm/memory.c
> index 5eb3d2524bdc..0d43b347eb0a 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -3987,7 +3987,7 @@ static int __handle_mm_fault(struct vm_area_struct *vma, unsigned long address,
>                 .pgoff = linear_page_index(vma, address),
>                 .gfp_mask = __get_fault_gfp_mask(vma),
>         };
> -       unsigned int dirty = flags & FAULT_FLAG_WRITE;
> +       unsigned int write = flags & FAULT_FLAG_WRITE;
>         struct mm_struct *mm = vma->vm_mm;
>         pgd_t *pgd;
>         p4d_t *p4d;
> @@ -4013,7 +4013,7 @@ static int __handle_mm_fault(struct vm_area_struct *vma, unsigned long address,
>
>                         /* NUMA case for anonymous PUDs would go here */
>
> -                       if (dirty && !pud_access_permitted(orig_pud, WRITE)) {
> +                       if (write && !pud_write(orig_pud)) {
>                                 ret = wp_huge_pud(&vmf, orig_pud);
>                                 if (!(ret & VM_FAULT_FALLBACK))
>                                         return ret;
> @@ -4046,7 +4046,7 @@ static int __handle_mm_fault(struct vm_area_struct *vma, unsigned long address,
>                         if (pmd_protnone(orig_pmd) && vma_is_accessible(vma))
>                                 return do_huge_pmd_numa_page(&vmf, orig_pmd);
>
> -                       if (dirty && !pmd_access_permitted(orig_pmd, WRITE)) {
> +                       if (write && !pmd_write(orig_pmd)) {
>                                 ret = wp_huge_pmd(&vmf, orig_pmd);
>                                 if (!(ret & VM_FAULT_FALLBACK))
>                                         return ret;
>
>
> I still cannot make sense of what the intention behind these changes
> were, the Changelog that went with them is utter crap, it doesn't
> explain anything.

The motivation was that I noticed that get_user_pages_fast() was doing
a full pud_access_permitted() check, but the get_user_pages() slow
path was only doing a pud_write() check. That was inconsistent so I
went to go resolve that across all the pte types and ended up making a
mess of things, I'm fine if the answer is that we should have went the
other way to only do write checks. However, when I was investigating
which way to go the aspect that persuaded me to start sprinkling
p??_access_permitted checks around was that the application behavior
changed between mmap access and direct-i/o access to the same buffer.
I assumed that different access behavior between those would be an
inconsistent surprise to userspace. Although, infinitely looping in
handle_mm_fault is an even worse surprise, apologies for that.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH v2 01/17] mm/gup: Fixup p*_access_permitted()
  2017-12-15  7:51               ` Peter Zijlstra
@ 2017-12-16  0:20                 ` Linus Torvalds
  2017-12-16  0:29                   ` Dan Williams
  2017-12-16  0:31                   ` Al Viro
  0 siblings, 2 replies; 76+ messages in thread
From: Linus Torvalds @ 2017-12-16  0:20 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Dave Hansen, Linux Kernel Mailing List, Thomas Gleixner,
	the arch/x86 maintainers, Andy Lutomirsky, Borislav Petkov,
	Greg KH, Kees Cook, Hugh Dickins, Brian Gerst, Josh Poimboeuf,
	Denys Vlasenko, Boris Ostrovsky, Juergen Gross, David Laight,
	Eduardo Valentin, Liguori, Anthony, Will Deacon, linux-mm,
	Kirill A. Shutemov, Dan Williams

On Thu, Dec 14, 2017 at 11:51 PM, Peter Zijlstra <peterz@infradead.org> wrote:
>
> So we actually need the pte_access_permitted() stuff if we want to
> ensure we're not stepping on !PAGE_USER things.

We really don't. Not in that complex and broken format, and not for every level.

Also, while I think we *should* check the PAGE_USER bit when walking
the page tables, like we used to, we should

 (a) do it much more simply, not with that broken interface that takes
insane and pointless flags

 (b) not tie it together with this issue at all, since the PAGE_USER
thing really is largely immaterial.

The fact is, if we have non-user mappings in the user part of the
address space, we _need_ to teach access_ok() about them, because
fundamentally any "get_user()/put_user()" will happily ignore the lack
of PAGE_USER (since those happen from kernel space).

So I'd like to check PAGE_USER in GUP simply because it's a simple
sanity check, not because it is important.

And that whole "p??_access_permitted() checks against the current
PKRU" is just incredible shit. It's currently broken, exactly because
"current PKRU" isn't even well-defined when you do it across different
threads, much less different address spaces.

This is why I'm 100% convinced that the current
"p??_access_permitted()" is just pure and utter garbage. And it's
garbage at a _fundamental_ level, not because of some small
implementation detail.

            Linus

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH v2 01/17] mm/gup: Fixup p*_access_permitted()
  2017-12-16  0:20                 ` Linus Torvalds
@ 2017-12-16  0:29                   ` Dan Williams
  2017-12-16  1:10                     ` Linus Torvalds
  2017-12-16  0:31                   ` Al Viro
  1 sibling, 1 reply; 76+ messages in thread
From: Dan Williams @ 2017-12-16  0:29 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Peter Zijlstra, Dave Hansen, Linux Kernel Mailing List,
	Thomas Gleixner, the arch/x86 maintainers, Andy Lutomirsky,
	Borislav Petkov, Greg KH, Kees Cook, Hugh Dickins, Brian Gerst,
	Josh Poimboeuf, Denys Vlasenko, Boris Ostrovsky, Juergen Gross,
	David Laight, Eduardo Valentin, Liguori, Anthony, Will Deacon,
	linux-mm, Kirill A. Shutemov

On Fri, Dec 15, 2017 at 4:20 PM, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> On Thu, Dec 14, 2017 at 11:51 PM, Peter Zijlstra <peterz@infradead.org> wrote:
> >
> > So we actually need the pte_access_permitted() stuff if we want to
> > ensure we're not stepping on !PAGE_USER things.
>
> We really don't. Not in that complex and broken format, and not for every level.
>
> Also, while I think we *should* check the PAGE_USER bit when walking
> the page tables, like we used to, we should
>
>  (a) do it much more simply, not with that broken interface that takes
> insane and pointless flags
>
>  (b) not tie it together with this issue at all, since the PAGE_USER
> thing really is largely immaterial.
>
> The fact is, if we have non-user mappings in the user part of the
> address space, we _need_ to teach access_ok() about them, because
> fundamentally any "get_user()/put_user()" will happily ignore the lack
> of PAGE_USER (since those happen from kernel space).
>
> So I'd like to check PAGE_USER in GUP simply because it's a simple
> sanity check, not because it is important.
>
> And that whole "p??_access_permitted() checks against the current
> PKRU" is just incredible shit. It's currently broken, exactly because
> "current PKRU" isn't even well-defined when you do it across different
> threads, much less different address spaces.
>
> This is why I'm 100% convinced that the current
> "p??_access_permitted()" is just pure and utter garbage. And it's
> garbage at a _fundamental_ level, not because of some small
> implementation detail.

So do you want to do a straight revert of these that went in for 4.15:

5c9d2d5c269c mm: replace pte_write with pte_access_permitted in fault
+ gup paths
c7da82b894e9 mm: replace pmd_write with pmd_access_permitted in fault
+ gup paths
e7fe7b5cae90 mm: replace pud_write with pud_access_permitted in fault
+ gup paths

...or take Peter's patches that are trying to fix up the damage?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH v2 01/17] mm/gup: Fixup p*_access_permitted()
  2017-12-16  0:20                 ` Linus Torvalds
  2017-12-16  0:29                   ` Dan Williams
@ 2017-12-16  0:31                   ` Al Viro
  2017-12-16  1:05                     ` Linus Torvalds
  1 sibling, 1 reply; 76+ messages in thread
From: Al Viro @ 2017-12-16  0:31 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Peter Zijlstra, Dave Hansen, Linux Kernel Mailing List,
	Thomas Gleixner, the arch/x86 maintainers, Andy Lutomirsky,
	Borislav Petkov, Greg KH, Kees Cook, Hugh Dickins, Brian Gerst,
	Josh Poimboeuf, Denys Vlasenko, Boris Ostrovsky, Juergen Gross,
	David Laight, Eduardo Valentin, Liguori, Anthony, Will Deacon,
	linux-mm, Kirill A. Shutemov, Dan Williams

On Fri, Dec 15, 2017 at 04:20:31PM -0800, Linus Torvalds wrote:
> On Thu, Dec 14, 2017 at 11:51 PM, Peter Zijlstra <peterz@infradead.org> wrote:
> >
> > So we actually need the pte_access_permitted() stuff if we want to
> > ensure we're not stepping on !PAGE_USER things.
> 
> We really don't. Not in that complex and broken format, and not for every level.
> 
> Also, while I think we *should* check the PAGE_USER bit when walking
> the page tables, like we used to, we should
> 
>  (a) do it much more simply, not with that broken interface that takes
> insane and pointless flags
> 
>  (b) not tie it together with this issue at all, since the PAGE_USER
> thing really is largely immaterial.
> 
> The fact is, if we have non-user mappings in the user part of the
> address space, we _need_ to teach access_ok() about them, because
> fundamentally any "get_user()/put_user()" will happily ignore the lack
> of PAGE_USER (since those happen from kernel space).

Details, please - how *can* access_ok() be taught of that?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH v2 01/17] mm/gup: Fixup p*_access_permitted()
  2017-12-16  0:31                   ` Al Viro
@ 2017-12-16  1:05                     ` Linus Torvalds
  0 siblings, 0 replies; 76+ messages in thread
From: Linus Torvalds @ 2017-12-16  1:05 UTC (permalink / raw)
  To: Al Viro
  Cc: Peter Zijlstra, Dave Hansen, Linux Kernel Mailing List,
	Thomas Gleixner, the arch/x86 maintainers, Andy Lutomirsky,
	Borislav Petkov, Greg KH, Kees Cook, Hugh Dickins, Brian Gerst,
	Josh Poimboeuf, Denys Vlasenko, Boris Ostrovsky, Juergen Gross,
	David Laight, Eduardo Valentin, Liguori, Anthony, Will Deacon,
	linux-mm, Kirill A. Shutemov, Dan Williams

On Fri, Dec 15, 2017 at 4:31 PM, Al Viro <viro@zeniv.linux.org.uk> wrote:
>>
>> The fact is, if we have non-user mappings in the user part of the
>> address space, we _need_ to teach access_ok() about them, because
>> fundamentally any "get_user()/put_user()" will happily ignore the lack
>> of PAGE_USER (since those happen from kernel space).
>
> Details, please - how *can* access_ok() be taught of that?

We'd have to do something like put the !PAGE_USER mapping at the top
of the user address space, and then simply make user_addr_max()
smaller than the actual user page table size.

Or some other silly hack.

I do not believe there is any sane way to have !PAGE_USER in
_general_, if you actually want to limit access to it.

(We _could_ use !PAGE_USER for things that aren't really strictly
about security - ie  we could have used it for the NUMA balancing
instead of using the P bit, and just let put_user/get_user blow
through them).

             Linus

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH v2 01/17] mm/gup: Fixup p*_access_permitted()
  2017-12-16  0:29                   ` Dan Williams
@ 2017-12-16  1:10                     ` Linus Torvalds
  2017-12-16  1:25                       ` Dave Hansen
  2017-12-16  1:29                       ` Dan Williams
  0 siblings, 2 replies; 76+ messages in thread
From: Linus Torvalds @ 2017-12-16  1:10 UTC (permalink / raw)
  To: Dan Williams
  Cc: Peter Zijlstra, Dave Hansen, Linux Kernel Mailing List,
	Thomas Gleixner, the arch/x86 maintainers, Andy Lutomirsky,
	Borislav Petkov, Greg KH, Kees Cook, Hugh Dickins, Brian Gerst,
	Josh Poimboeuf, Denys Vlasenko, Boris Ostrovsky, Juergen Gross,
	David Laight, Eduardo Valentin, Liguori, Anthony, Will Deacon,
	linux-mm, Kirill A. Shutemov

On Fri, Dec 15, 2017 at 4:29 PM, Dan Williams <dan.j.williams@intel.com> wrote:
> So do you want to do a straight revert of these that went in for 4.15:

I think that's the right thing to do, but would want to verify that
there are no *other* issues than just the attempt at PKRU.

The commit message does talk about PAGE_USER, and as mentioned I do
think that's a good thing to check, I just don't think it should be
done this way,

Was there something else going behind these commits? Because if not,
let's revert and then perhaps later introduce a more targeted thing?

Also, aren't the protection keys encoded in the vma?

Because *if* we want to check protection keys, I think we should do
that at the vma layer, partly exactly because the exact implementation
of protection keys is so architecture-specific, and partly because I
don't think it makes sense to check them for every page anyway.

               Linus

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH v2 01/17] mm/gup: Fixup p*_access_permitted()
  2017-12-16  1:10                     ` Linus Torvalds
@ 2017-12-16  1:25                       ` Dave Hansen
  2017-12-16  2:28                         ` Linus Torvalds
  2017-12-16  1:29                       ` Dan Williams
  1 sibling, 1 reply; 76+ messages in thread
From: Dave Hansen @ 2017-12-16  1:25 UTC (permalink / raw)
  To: Linus Torvalds, Dan Williams
  Cc: Peter Zijlstra, Linux Kernel Mailing List, Thomas Gleixner,
	the arch/x86 maintainers, Andy Lutomirsky, Borislav Petkov,
	Greg KH, Kees Cook, Hugh Dickins, Brian Gerst, Josh Poimboeuf,
	Denys Vlasenko, Boris Ostrovsky, Juergen Gross, David Laight,
	Eduardo Valentin, Liguori, Anthony, Will Deacon, linux-mm,
	Kirill A. Shutemov

On 12/15/2017 05:10 PM, Linus Torvalds wrote:
> Because *if* we want to check protection keys, I think we should do
> that at the vma layer, partly exactly because the exact implementation
> of protection keys is so architecture-specific, and partly because I
> don't think it makes sense to check them for every page anyway.

So, there are VMA checks against protection keys.  The problem _here_ is
that we are checking against the VMA (and correctly skipping the PKRU
checks) and then _mistakenly_ applying the PTE checks against PKRU.

I think the reason we needed VMA and PTE checks was the
get_user_pages_fast() path not having a VMA.

I need to go re-read the commits, though.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH v2 01/17] mm/gup: Fixup p*_access_permitted()
  2017-12-16  1:10                     ` Linus Torvalds
  2017-12-16  1:25                       ` Dave Hansen
@ 2017-12-16  1:29                       ` Dan Williams
  1 sibling, 0 replies; 76+ messages in thread
From: Dan Williams @ 2017-12-16  1:29 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Peter Zijlstra, Dave Hansen, Linux Kernel Mailing List,
	Thomas Gleixner, the arch/x86 maintainers, Andy Lutomirsky,
	Borislav Petkov, Greg KH, Kees Cook, Hugh Dickins, Brian Gerst,
	Josh Poimboeuf, Denys Vlasenko, Boris Ostrovsky, Juergen Gross,
	David Laight, Eduardo Valentin, Liguori, Anthony, Will Deacon,
	linux-mm, Kirill A. Shutemov

On Fri, Dec 15, 2017 at 5:10 PM, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
> On Fri, Dec 15, 2017 at 4:29 PM, Dan Williams <dan.j.williams@intel.com> wrote:
>> So do you want to do a straight revert of these that went in for 4.15:
>
> I think that's the right thing to do, but would want to verify that
> there are no *other* issues than just the attempt at PKRU.
>
> The commit message does talk about PAGE_USER, and as mentioned I do
> think that's a good thing to check, I just don't think it should be
> done this way,
>
> Was there something else going behind these commits? Because if not,
> let's revert and then perhaps later introduce a more targeted thing?

Yes, these three can be safely reverted.

    5c9d2d5c269c mm: replace pte_write with pte_access_permitted...
    c7da82b894e9 mm: replace pmd_write with pmd_access_permitted...
    e7fe7b5cae90 mm: replace pud_write with pud_access_permitted...

They were part of a 4 patch series where this lead one below is the
one fix we actually need.

    1501899a898d mm: fix device-dax pud write-faults triggered by...

---

Now, the original access permitted was born out of a cleanup to
introduce pte_allows_gup(), this is where the PAGE_USER check came
from:

    1874f6895c92 x86/mm/gup: Simplify get_user_pages() PTE bit handling

...and that helper later grew pkey check support here:

   33a709b25a76 mm/gup, x86/mm/pkeys: Check VMAs and PTEs for protection keys

...sometime later it was all renamed and made kernel-global here when
the x86 gup implementation was converted to use the common
implementation:

    e7884f8ead4a mm/gup: Move permission checks into helpers

All this to say that these are not revert candidates and need
incremental patches if we want to back out the pkey checking for the
gup-fast path and re-work the PAGE_USER checking.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH v2 01/17] mm/gup: Fixup p*_access_permitted()
  2017-12-16  1:25                       ` Dave Hansen
@ 2017-12-16  2:28                         ` Linus Torvalds
  2017-12-16  2:48                           ` Al Viro
  0 siblings, 1 reply; 76+ messages in thread
From: Linus Torvalds @ 2017-12-16  2:28 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Dan Williams, Peter Zijlstra, Linux Kernel Mailing List,
	Thomas Gleixner, the arch/x86 maintainers, Andy Lutomirsky,
	Borislav Petkov, Greg KH, Kees Cook, Hugh Dickins, Brian Gerst,
	Josh Poimboeuf, Denys Vlasenko, Boris Ostrovsky, Juergen Gross,
	David Laight, Eduardo Valentin, Liguori, Anthony, Will Deacon,
	linux-mm, Kirill A. Shutemov

On Fri, Dec 15, 2017 at 5:25 PM, Dave Hansen <dave.hansen@intel.com> wrote:
>
> I think the reason we needed VMA and PTE checks was the
> get_user_pages_fast() path not having a VMA.

That is indeed the point of get_user_pages_fast(): no vma lookup, no
locking, just "do the default case as streamlined as possible".

But part of it is also that we should fall back to the slow case if
the fast case doesn't work (eg because the page isn't there or
whatever).

So what we could do - perhaps - is to just make get_user_pages_fast()
check whether any of the protection key bits are set, and fail for
that case.

If we care, that is.

               Linus

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH v2 01/17] mm/gup: Fixup p*_access_permitted()
  2017-12-16  2:28                         ` Linus Torvalds
@ 2017-12-16  2:48                           ` Al Viro
  2017-12-16  2:52                             ` Linus Torvalds
  0 siblings, 1 reply; 76+ messages in thread
From: Al Viro @ 2017-12-16  2:48 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Dave Hansen, Dan Williams, Peter Zijlstra,
	Linux Kernel Mailing List, Thomas Gleixner,
	the arch/x86 maintainers, Andy Lutomirsky, Borislav Petkov,
	Greg KH, Kees Cook, Hugh Dickins, Brian Gerst, Josh Poimboeuf,
	Denys Vlasenko, Boris Ostrovsky, Juergen Gross, David Laight,
	Eduardo Valentin, Liguori, Anthony, Will Deacon, linux-mm,
	Kirill A. Shutemov

On Fri, Dec 15, 2017 at 06:28:36PM -0800, Linus Torvalds wrote:
> On Fri, Dec 15, 2017 at 5:25 PM, Dave Hansen <dave.hansen@intel.com> wrote:
> >
> > I think the reason we needed VMA and PTE checks was the
> > get_user_pages_fast() path not having a VMA.
> 
> That is indeed the point of get_user_pages_fast(): no vma lookup, no
> locking, just "do the default case as streamlined as possible".
> 
> But part of it is also that we should fall back to the slow case if
> the fast case doesn't work (eg because the page isn't there or
> whatever).
> 
> So what we could do - perhaps - is to just make get_user_pages_fast()
> check whether any of the protection key bits are set, and fail for
> that case.

FWIW, a good description of fast path in get_user_pages_fast() is
"simulate a TLB miss", the slow path being "... and go for simulated
page fault if TLB miss would have escalated to #PF".

Treating protection key bits as "escalate to page fault and let that
deal with the checks" should be fine - page fault handler must
cope with the page actually being present in page tables anyway, for
obvious reasons...

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH v2 01/17] mm/gup: Fixup p*_access_permitted()
  2017-12-16  2:48                           ` Al Viro
@ 2017-12-16  2:52                             ` Linus Torvalds
  2017-12-16  3:00                               ` Linus Torvalds
  2017-12-16  3:21                               ` Dave Hansen
  0 siblings, 2 replies; 76+ messages in thread
From: Linus Torvalds @ 2017-12-16  2:52 UTC (permalink / raw)
  To: Al Viro
  Cc: Dave Hansen, Dan Williams, Peter Zijlstra,
	Linux Kernel Mailing List, Thomas Gleixner,
	the arch/x86 maintainers, Andy Lutomirsky, Borislav Petkov,
	Greg KH, Kees Cook, Hugh Dickins, Brian Gerst, Josh Poimboeuf,
	Denys Vlasenko, Boris Ostrovsky, Juergen Gross, David Laight,
	Eduardo Valentin, Liguori, Anthony, Will Deacon, linux-mm,
	Kirill A. Shutemov

On Fri, Dec 15, 2017 at 6:48 PM, Al Viro <viro@zeniv.linux.org.uk> wrote:
>
> Treating protection key bits as "escalate to page fault and let that
> deal with the checks" should be fine

Well, it's *semantically* fine and I think it's the right model from
that standpoint.

However, since the main use case of protection keys is probably
databases (Dave?) and since those also might be performance-sensitive
about direct-IO doing page table lookups, it might not be great in
practice.

             Linus

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH v2 01/17] mm/gup: Fixup p*_access_permitted()
  2017-12-16  2:52                             ` Linus Torvalds
@ 2017-12-16  3:00                               ` Linus Torvalds
  2017-12-16  3:21                               ` Dave Hansen
  1 sibling, 0 replies; 76+ messages in thread
From: Linus Torvalds @ 2017-12-16  3:00 UTC (permalink / raw)
  To: Al Viro
  Cc: Dave Hansen, Dan Williams, Peter Zijlstra,
	Linux Kernel Mailing List, Thomas Gleixner,
	the arch/x86 maintainers, Andy Lutomirsky, Borislav Petkov,
	Greg KH, Kees Cook, Hugh Dickins, Brian Gerst, Josh Poimboeuf,
	Denys Vlasenko, Boris Ostrovsky, Juergen Gross, David Laight,
	Eduardo Valentin, Liguori, Anthony, Will Deacon, linux-mm,
	Kirill A. Shutemov

On Fri, Dec 15, 2017 at 6:52 PM, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> However, since the main use case of protection keys is probably
> databases (Dave?) and since those also might be performance-sensitive
> about direct-IO doing page table lookups, it might not be great in
> practice.

Anyway, I reverted the three commits Dan pointed to, and we can always
revisit the exact path forward later

Because regardless of what the eventual future checks will be, for
4.15 I think we'll just go back to doing what we used to do.

If somebody sees problems, holler.

                  Linus

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH v2 01/17] mm/gup: Fixup p*_access_permitted()
  2017-12-16  2:52                             ` Linus Torvalds
  2017-12-16  3:00                               ` Linus Torvalds
@ 2017-12-16  3:21                               ` Dave Hansen
  1 sibling, 0 replies; 76+ messages in thread
From: Dave Hansen @ 2017-12-16  3:21 UTC (permalink / raw)
  To: Linus Torvalds, Al Viro
  Cc: Dan Williams, Peter Zijlstra, Linux Kernel Mailing List,
	Thomas Gleixner, the arch/x86 maintainers, Andy Lutomirsky,
	Borislav Petkov, Greg KH, Kees Cook, Hugh Dickins, Brian Gerst,
	Josh Poimboeuf, Denys Vlasenko, Boris Ostrovsky, Juergen Gross,
	David Laight, Eduardo Valentin, Liguori, Anthony, Will Deacon,
	linux-mm, Kirill A. Shutemov

On 12/15/2017 06:52 PM, Linus Torvalds wrote:
> On Fri, Dec 15, 2017 at 6:48 PM, Al Viro <viro@zeniv.linux.org.uk> wrote:
>> Treating protection key bits as "escalate to page fault and let that
>> deal with the checks" should be fine
> 
> Well, it's *semantically* fine and I think it's the right model from
> that standpoint.

It's _close_ to fine.  :)

Practically, we're going to have two classes of things in the world:
1. Things that are protected with protection keys and have non-zero bits
   in the pkey PTE bits.
2. Things that are _not_ protected will have zeros in there.

But, in the hardware, *everything* has a pkey.  0 is the default,
obviously, but the hardware treats it the same as all the other values.
So, if we go checking for the "pkey bits being set", and have behavior
diverge when they are set, we end up with pkey=0 being even more special
compared to the rest.

This might be OK, but it's going to be interesting to document and write
tests for it.  I'm already dreading the manpage updates.

> However, since the main use case of protection keys is probably
> databases (Dave?) and since those also might be performance-sensitive
> about direct-IO doing page table lookups, it might not be great in
> practice.

Yeah, databases are definitely the heavy-hitters that care about it.

But, these PKRU checks are cheap.  I forget the actual cycle counts, but
I remember thinking that it's pretty darn cheap to read PKRU.  In the
grand scheme of doing a page table walk and incrementing an atomic, it's
surely in the noise for direct I/O to large pages, which is basically
guaranteed for the database guys.

I did some get_user_pages() torture tests (on small pages IIRC) before I
put the code in and could not detect a delta from the code being there
or not.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH v2 01/17] mm/gup: Fixup p*_access_permitted()
  2017-12-15 16:38                   ` Dan Williams
@ 2017-12-18 11:54                     ` Peter Zijlstra
  2017-12-18 18:42                       ` Dan Williams
  0 siblings, 1 reply; 76+ messages in thread
From: Peter Zijlstra @ 2017-12-18 11:54 UTC (permalink / raw)
  To: Dan Williams
  Cc: Dave Hansen, linux-kernel, Thomas Gleixner, X86 ML,
	Linus Torvalds, Andy Lutomirsky, Borislav Petkov, Greg KH,
	keescook, Hugh Dickins, Brian Gerst, Josh Poimboeuf,
	Denys Vlasenko, Boris Ostrovsky, Juergen Gross, David Laight,
	Eduardo Valentin, Liguori, Anthony, Will Deacon, Linux MM,
	Kirill A. Shutemov

On Fri, Dec 15, 2017 at 08:38:02AM -0800, Dan Williams wrote:

> The motivation was that I noticed that get_user_pages_fast() was doing
> a full pud_access_permitted() check, but the get_user_pages() slow
> path was only doing a pud_write() check. That was inconsistent so I
> went to go resolve that across all the pte types and ended up making a
> mess of things,

> I'm fine if the answer is that we should have went the
> other way to only do write checks. However, when I was investigating
> which way to go the aspect that persuaded me to start sprinkling
> p??_access_permitted checks around was that the application behavior
> changed between mmap access and direct-i/o access to the same buffer.

> I assumed that different access behavior between those would be an
> inconsistent surprise to userspace. Although, infinitely looping in
> handle_mm_fault is an even worse surprise, apologies for that.

Well, we all make a mess of things at time. I'm certainly guilty of
that, so no worries there. But it really helps if your Changelogs at
least describe what you're trying to do and why.

So I think I covered what you set out to do. In any case, Linus took the
whole lot back out, so we can look at this afresh.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH v2 01/17] mm/gup: Fixup p*_access_permitted()
  2017-12-18 11:54                     ` Peter Zijlstra
@ 2017-12-18 18:42                       ` Dan Williams
  0 siblings, 0 replies; 76+ messages in thread
From: Dan Williams @ 2017-12-18 18:42 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Dave Hansen, linux-kernel, Thomas Gleixner, X86 ML,
	Linus Torvalds, Andy Lutomirsky, Borislav Petkov, Greg KH,
	Kees Cook, Hugh Dickins, Brian Gerst, Josh Poimboeuf,
	Denys Vlasenko, Boris Ostrovsky, Juergen Gross, David Laight,
	Eduardo Valentin, Liguori, Anthony, Will Deacon, Linux MM,
	Kirill A. Shutemov

On Mon, Dec 18, 2017 at 3:54 AM, Peter Zijlstra <peterz@infradead.org> wrote:
> On Fri, Dec 15, 2017 at 08:38:02AM -0800, Dan Williams wrote:
>
>> The motivation was that I noticed that get_user_pages_fast() was doing
>> a full pud_access_permitted() check, but the get_user_pages() slow
>> path was only doing a pud_write() check. That was inconsistent so I
>> went to go resolve that across all the pte types and ended up making a
>> mess of things,
>
>> I'm fine if the answer is that we should have went the
>> other way to only do write checks. However, when I was investigating
>> which way to go the aspect that persuaded me to start sprinkling
>> p??_access_permitted checks around was that the application behavior
>> changed between mmap access and direct-i/o access to the same buffer.
>
>> I assumed that different access behavior between those would be an
>> inconsistent surprise to userspace. Although, infinitely looping in
>> handle_mm_fault is an even worse surprise, apologies for that.
>
> Well, we all make a mess of things at time. I'm certainly guilty of
> that, so no worries there. But it really helps if your Changelogs at
> least describe what you're trying to do and why.

Yes, agreed. Unfortunately in this case all those details were
included in the lead in patch, and should have been duplicated to the
follow on cleanups. See:

1501899a898d mm: fix device-dax pud write-faults triggered by get_user_pages()

    "For now this just implements a simple check for the _PAGE_RW bit similar
    to pmd_write.  However, this implies that the gup-slow-path check is
    missing the extra checks that the gup-fast-path performs with
    pud_access_permitted.  Later patches will align all checks to use the
    'access_permitted' helper if the architecture provides it."

...and that paragraph would have saved you some wondering.

> So I think I covered what you set out to do. In any case, Linus took the
> whole lot back out, so we can look at this afresh.

Thanks Peter, appreciate it.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH v2 02/17] mm: Exempt special mappings from mlock(), mprotect() and madvise()
  2017-12-14 17:36     ` Peter Zijlstra
@ 2018-01-02 16:44       ` Dmitry Safonov
  0 siblings, 0 replies; 76+ messages in thread
From: Dmitry Safonov @ 2018-01-02 16:44 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Andy Lutomirski, linux-kernel, Thomas Gleixner, X86 ML,
	Linus Torvalds, Dave Hansen, Borislav Petkov, Greg KH, Kees Cook,
	Hugh Dickins, Brian Gerst, Josh Poimboeuf, Denys Vlasenko,
	Boris Ostrovsky, Juergen Gross, David Laight, Eduardo Valentin,
	Liguori, Anthony, Will Deacon, linux-mm, Kirill A. Shutemov,
	Dan Williams, crml

Hi, sorry for the late reply,

2017-12-14 17:36 GMT+00:00 Peter Zijlstra <peterz@infradead.org>:
> On Thu, Dec 14, 2017 at 08:19:36AM -0800, Andy Lutomirski wrote:
>> On Thu, Dec 14, 2017 at 3:27 AM, Peter Zijlstra <peterz@infradead.org> wrote:
>> > It makes no sense to ever prod at special mappings with any of these
>> > syscalls.
>> >
>> > XXX should we include munmap() ?
>>
>> This is an ABI break for the vdso.  Maybe that's okay, but mremap() on
>> the vdso is certainly used, and I can imagine debuggers using
>> mprotect().
>
> *groan*, ok so mremap() will actually still work after this, but yes,
> mprotect() will not. I hadn't figured people would muck with the VDSO
> like that.

mremap() is needed for CRIU, at least.

Please, don't restrict munmap(), as ARCH_MAP_VDSO_* allows to map vdso
iff it's not already mapped.
We don't need +w vdso mapping, but I guess that may break gdb breakpoints
on vdso.

Also, AFAICS, vma_is_special_mapping() has two parameters in linux-next,
and your patches set doesn't change that.

Thanks,
             Dmitry

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 76+ messages in thread

end of thread, other threads:[~2018-01-02 16:44 UTC | newest]

Thread overview: 76+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-12-14 11:27 [PATCH v2 00/17] x86/ldt: Use a VMA based read only mapping Peter Zijlstra
2017-12-14 11:27 ` [PATCH v2 01/17] mm/gup: Fixup p*_access_permitted() Peter Zijlstra
2017-12-14 12:41   ` Peter Zijlstra
2017-12-14 14:37     ` Peter Zijlstra
2017-12-14 20:44       ` Dave Hansen
2017-12-14 20:54         ` Peter Zijlstra
2017-12-14 21:18           ` Peter Zijlstra
2017-12-15  5:04           ` Dave Hansen
2017-12-15  6:09             ` Linus Torvalds
2017-12-15  7:51               ` Peter Zijlstra
2017-12-16  0:20                 ` Linus Torvalds
2017-12-16  0:29                   ` Dan Williams
2017-12-16  1:10                     ` Linus Torvalds
2017-12-16  1:25                       ` Dave Hansen
2017-12-16  2:28                         ` Linus Torvalds
2017-12-16  2:48                           ` Al Viro
2017-12-16  2:52                             ` Linus Torvalds
2017-12-16  3:00                               ` Linus Torvalds
2017-12-16  3:21                               ` Dave Hansen
2017-12-16  1:29                       ` Dan Williams
2017-12-16  0:31                   ` Al Viro
2017-12-16  1:05                     ` Linus Torvalds
2017-12-15  8:00             ` Peter Zijlstra
2017-12-15 10:25               ` Peter Zijlstra
2017-12-15 11:38                 ` Peter Zijlstra
2017-12-15 16:38                   ` Dan Williams
2017-12-18 11:54                     ` Peter Zijlstra
2017-12-18 18:42                       ` Dan Williams
2017-12-15 14:04       ` Peter Zijlstra
2017-12-14 11:27 ` [PATCH v2 02/17] mm: Exempt special mappings from mlock(), mprotect() and madvise() Peter Zijlstra
2017-12-14 16:19   ` Andy Lutomirski
2017-12-14 17:36     ` Peter Zijlstra
2018-01-02 16:44       ` Dmitry Safonov
2017-12-14 11:27 ` [PATCH v2 03/17] arch: Allow arch_dup_mmap() to fail Peter Zijlstra
2017-12-14 16:22   ` Andy Lutomirski
2017-12-14 11:27 ` [PATCH v2 04/17] x86/ldt: Rework locking Peter Zijlstra
2017-12-14 11:27 ` [PATCH v2 05/17] x86/ldt: Prevent ldt inheritance on exec Peter Zijlstra
2017-12-14 16:32   ` Andy Lutomirski
2017-12-14 11:27 ` [PATCH v2 06/17] x86/ldt: Do not install LDT for kernel threads Peter Zijlstra
2017-12-14 19:43   ` Peter Zijlstra
2017-12-14 21:27     ` Andy Lutomirski
2017-12-14 11:27 ` [PATCH v2 07/17] mm/softdirty: Move VM_SOFTDIRTY into high bits Peter Zijlstra
2017-12-14 11:27 ` [PATCH v2 08/17] mm/x86: Allow special mappings with user access cleared Peter Zijlstra
2017-12-14 11:27 ` [PATCH v2 09/17] mm: Provide vm_special_mapping::close Peter Zijlstra
2017-12-14 11:27 ` [PATCH v2 10/17] selftest/x86: Implement additional LDT selftests Peter Zijlstra
2017-12-14 11:27 ` [PATCH v2 11/17] selftests/x86/ldt_gdt: Prepare for access bit forced Peter Zijlstra
2017-12-14 16:20   ` Andy Lutomirski
2017-12-14 19:43     ` Linus Torvalds
2017-12-14 21:22       ` Andy Lutomirski
2017-12-14 21:44         ` Linus Torvalds
2017-12-14 21:48           ` Linus Torvalds
2017-12-14 22:02             ` Peter Zijlstra
2017-12-14 22:14               ` Linus Torvalds
2017-12-14 22:24                 ` Peter Zijlstra
2017-12-14 22:52                   ` Linus Torvalds
2017-12-14 22:11             ` Andy Lutomirski
2017-12-14 22:15               ` Linus Torvalds
2017-12-14 22:30                 ` Andy Lutomirski
2017-12-14 22:23           ` Thomas Gleixner
2017-12-14 22:50             ` Linus Torvalds
2017-12-14 11:27 ` [PATCH v2 12/17] mm: Make populate_vma_page_range() available Peter Zijlstra
2017-12-14 11:27 ` [PATCH v2 13/17] x86/mm: Force LDT desc accessed bit Peter Zijlstra
2017-12-14 16:21   ` Andy Lutomirski
2017-12-14 11:27 ` [PATCH v2 14/17] x86/ldt: Reshuffle code Peter Zijlstra
2017-12-14 16:23   ` Andy Lutomirski
2017-12-14 16:31     ` Thomas Gleixner
2017-12-14 16:32       ` Thomas Gleixner
2017-12-14 16:34         ` Andy Lutomirski
2017-12-14 17:47           ` Peter Zijlstra
2017-12-14 11:27 ` [PATCH v2 15/17] x86/ldt: Prepare for VMA mapping Peter Zijlstra
2017-12-14 11:27 ` [PATCH v2 16/17] x86/ldt: Add VMA management code Peter Zijlstra
2017-12-14 11:27 ` [PATCH v2 17/17] x86/ldt: Make it read only VMA mapped Peter Zijlstra
2017-12-14 12:03 ` [PATCH v2 00/17] x86/ldt: Use a VMA based read only mapping Thomas Gleixner
2017-12-14 12:08   ` Peter Zijlstra
2017-12-14 16:35     ` Andy Lutomirski
2017-12-14 17:50       ` Peter Zijlstra

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).