All of lore.kernel.org
 help / color / mirror / Atom feed
* [patch 00/16] x86/ldt: Use a VMA based read only mapping
@ 2017-12-12 17:32 ` Thomas Gleixner
  0 siblings, 0 replies; 134+ messages in thread
From: Thomas Gleixner @ 2017-12-12 17:32 UTC (permalink / raw)
  To: LKML
  Cc: x86, Linus Torvalds, Andy Lutomirsky, Peter Zijlstra,
	Dave Hansen, Borislav Petkov, Greg KH, keescook, hughd,
	Brian Gerst, Josh Poimboeuf, Denys Vlasenko, Boris Ostrovsky,
	Juergen Gross, David Laight, Eduardo Valentin, aliguori,
	Will Deacon, linux-mm

Peter and myself spent quite some time to figure out how to make CPUs cope
with a RO mapped LDT.

While the initial trick of writing the ACCESS bit in a special fault
handler covers most cases, the tricky problem of CS/SS in return to user
space (IRET ...) was giving us quite some headache.

Peter finally found a way to do so. Touching the CS/SS selectors with LAR
on the way out to user space makes it work w/o trouble.

Contrary to the approach Andy was taking with storing the LDT in a special
map area, the following series uses a special mapping which is mapped
without the user bit and read only. This just ties the LDT to the process
which is the most natural way to do it, removes the requirement for special
pagetable code and works independent of pagetable isolation.

This was tested on quite a range of Intel and AMD machines, but the test
coverage on 32bit is quite meager. I'll resurrect a few dust bricks
tomorrow.

The patch series also includes an updated version of the: do not inherit
LDT on exec changes.

There are some extensions to the VMA code, which need scrunity of the mm
folks.

Thanks,

	tglx
---
 arch/powerpc/include/asm/mmu_context.h     |    5 
 arch/powerpc/platforms/Kconfig.cputype     |    1 
 arch/s390/Kconfig                          |    1 
 arch/x86/entry/common.c                    |    8 
 arch/x86/include/asm/desc.h                |    2 
 arch/x86/include/asm/mmu.h                 |    7 
 arch/x86/include/asm/thread_info.h         |    4 
 arch/x86/include/uapi/asm/mman.h           |    4 
 arch/x86/kernel/cpu/common.c               |    4 
 arch/x86/kernel/ldt.c                      |  573 ++++++++++++++++++++++-------
 arch/x86/mm/fault.c                        |   19 
 arch/x86/mm/tlb.c                          |    2 
 arch/x86/power/cpu.c                       |    2 
 b/arch/um/include/asm/mmu_context.h        |    3 
 b/arch/unicore32/include/asm/mmu_context.h |    5 
 b/arch/x86/include/asm/mmu_context.h       |   93 +++-
 b/include/linux/mman.h                     |    4 
 include/asm-generic/mm_hooks.h             |    5 
 include/linux/mm.h                         |   21 -
 include/linux/mm_types.h                   |    3 
 kernel/fork.c                              |    3 
 mm/internal.h                              |    2 
 mm/mmap.c                                  |   16 
 tools/testing/selftests/x86/ldt_gdt.c      |   83 +++-
 24 files changed, 673 insertions(+), 197 deletions(-)

^ permalink raw reply	[flat|nested] 134+ messages in thread

* [patch 00/16] x86/ldt: Use a VMA based read only mapping
@ 2017-12-12 17:32 ` Thomas Gleixner
  0 siblings, 0 replies; 134+ messages in thread
From: Thomas Gleixner @ 2017-12-12 17:32 UTC (permalink / raw)
  To: LKML
  Cc: x86, Linus Torvalds, Andy Lutomirsky, Peter Zijlstra,
	Dave Hansen, Borislav Petkov, Greg KH, keescook, hughd,
	Brian Gerst, Josh Poimboeuf, Denys Vlasenko, Boris Ostrovsky,
	Juergen Gross, David Laight, Eduardo Valentin, aliguori,
	Will Deacon, linux-mm

Peter and myself spent quite some time to figure out how to make CPUs cope
with a RO mapped LDT.

While the initial trick of writing the ACCESS bit in a special fault
handler covers most cases, the tricky problem of CS/SS in return to user
space (IRET ...) was giving us quite some headache.

Peter finally found a way to do so. Touching the CS/SS selectors with LAR
on the way out to user space makes it work w/o trouble.

Contrary to the approach Andy was taking with storing the LDT in a special
map area, the following series uses a special mapping which is mapped
without the user bit and read only. This just ties the LDT to the process
which is the most natural way to do it, removes the requirement for special
pagetable code and works independent of pagetable isolation.

This was tested on quite a range of Intel and AMD machines, but the test
coverage on 32bit is quite meager. I'll resurrect a few dust bricks
tomorrow.

The patch series also includes an updated version of the: do not inherit
LDT on exec changes.

There are some extensions to the VMA code, which need scrunity of the mm
folks.

Thanks,

	tglx
---
 arch/powerpc/include/asm/mmu_context.h     |    5 
 arch/powerpc/platforms/Kconfig.cputype     |    1 
 arch/s390/Kconfig                          |    1 
 arch/x86/entry/common.c                    |    8 
 arch/x86/include/asm/desc.h                |    2 
 arch/x86/include/asm/mmu.h                 |    7 
 arch/x86/include/asm/thread_info.h         |    4 
 arch/x86/include/uapi/asm/mman.h           |    4 
 arch/x86/kernel/cpu/common.c               |    4 
 arch/x86/kernel/ldt.c                      |  573 ++++++++++++++++++++++-------
 arch/x86/mm/fault.c                        |   19 
 arch/x86/mm/tlb.c                          |    2 
 arch/x86/power/cpu.c                       |    2 
 b/arch/um/include/asm/mmu_context.h        |    3 
 b/arch/unicore32/include/asm/mmu_context.h |    5 
 b/arch/x86/include/asm/mmu_context.h       |   93 +++-
 b/include/linux/mman.h                     |    4 
 include/asm-generic/mm_hooks.h             |    5 
 include/linux/mm.h                         |   21 -
 include/linux/mm_types.h                   |    3 
 kernel/fork.c                              |    3 
 mm/internal.h                              |    2 
 mm/mmap.c                                  |   16 
 tools/testing/selftests/x86/ldt_gdt.c      |   83 +++-
 24 files changed, 673 insertions(+), 197 deletions(-)



--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 134+ messages in thread

* [patch 01/16] arch: Allow arch_dup_mmap() to fail
  2017-12-12 17:32 ` Thomas Gleixner
@ 2017-12-12 17:32   ` Thomas Gleixner
  -1 siblings, 0 replies; 134+ messages in thread
From: Thomas Gleixner @ 2017-12-12 17:32 UTC (permalink / raw)
  To: LKML
  Cc: x86, Linus Torvalds, Andy Lutomirsky, Peter Zijlstra,
	Dave Hansen, Borislav Petkov, Greg KH, keescook, hughd,
	Brian Gerst, Josh Poimboeuf, Denys Vlasenko, Boris Ostrovsky,
	Juergen Gross, David Laight, Eduardo Valentin, aliguori,
	Will Deacon, linux-mm

[-- Attachment #1: arch--Allow-arch_dup_mmap---to-fail.patch --]
[-- Type: text/plain, Size: 3028 bytes --]

From: Thomas Gleixner <tglx@linutronix.de>

In order to sanitize the LDT initialization on x86 arch_dup_mmap() must be
allowed to fail. Fix up all instances.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
---
 arch/powerpc/include/asm/mmu_context.h   |    5 +++--
 arch/um/include/asm/mmu_context.h        |    3 ++-
 arch/unicore32/include/asm/mmu_context.h |    5 +++--
 arch/x86/include/asm/mmu_context.h       |    4 ++--
 include/asm-generic/mm_hooks.h           |    5 +++--
 kernel/fork.c                            |    3 +--
 6 files changed, 14 insertions(+), 11 deletions(-)

--- a/arch/powerpc/include/asm/mmu_context.h
+++ b/arch/powerpc/include/asm/mmu_context.h
@@ -114,9 +114,10 @@ static inline void enter_lazy_tlb(struct
 #endif
 }
 
-static inline void arch_dup_mmap(struct mm_struct *oldmm,
-				 struct mm_struct *mm)
+static inline int arch_dup_mmap(struct mm_struct *oldmm,
+				struct mm_struct *mm)
 {
+	return 0;
 }
 
 static inline void arch_exit_mmap(struct mm_struct *mm)
--- a/arch/um/include/asm/mmu_context.h
+++ b/arch/um/include/asm/mmu_context.h
@@ -15,9 +15,10 @@ extern void uml_setup_stubs(struct mm_st
 /*
  * Needed since we do not use the asm-generic/mm_hooks.h:
  */
-static inline void arch_dup_mmap(struct mm_struct *oldmm, struct mm_struct *mm)
+static inline int arch_dup_mmap(struct mm_struct *oldmm, struct mm_struct *mm)
 {
 	uml_setup_stubs(mm);
+	return 0;
 }
 extern void arch_exit_mmap(struct mm_struct *mm);
 static inline void arch_unmap(struct mm_struct *mm,
--- a/arch/unicore32/include/asm/mmu_context.h
+++ b/arch/unicore32/include/asm/mmu_context.h
@@ -81,9 +81,10 @@ do { \
 	} \
 } while (0)
 
-static inline void arch_dup_mmap(struct mm_struct *oldmm,
-				 struct mm_struct *mm)
+static inline int arch_dup_mmap(struct mm_struct *oldmm,
+				struct mm_struct *mm)
 {
+	return 0;
 }
 
 static inline void arch_unmap(struct mm_struct *mm,
--- a/arch/x86/include/asm/mmu_context.h
+++ b/arch/x86/include/asm/mmu_context.h
@@ -176,10 +176,10 @@ do {						\
 } while (0)
 #endif
 
-static inline void arch_dup_mmap(struct mm_struct *oldmm,
-				 struct mm_struct *mm)
+static inline int arch_dup_mmap(struct mm_struct *oldmm, struct mm_struct *mm)
 {
 	paravirt_arch_dup_mmap(oldmm, mm);
+	return 0;
 }
 
 static inline void arch_exit_mmap(struct mm_struct *mm)
--- a/include/asm-generic/mm_hooks.h
+++ b/include/asm-generic/mm_hooks.h
@@ -7,9 +7,10 @@
 #ifndef _ASM_GENERIC_MM_HOOKS_H
 #define _ASM_GENERIC_MM_HOOKS_H
 
-static inline void arch_dup_mmap(struct mm_struct *oldmm,
-				 struct mm_struct *mm)
+static inline int arch_dup_mmap(struct mm_struct *oldmm,
+				struct mm_struct *mm)
 {
+	return 0;
 }
 
 static inline void arch_exit_mmap(struct mm_struct *mm)
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -721,8 +721,7 @@ static __latent_entropy int dup_mmap(str
 			goto out;
 	}
 	/* a new mm has just been created */
-	arch_dup_mmap(oldmm, mm);
-	retval = 0;
+	retval = arch_dup_mmap(oldmm, mm);
 out:
 	up_write(&mm->mmap_sem);
 	flush_tlb_mm(oldmm);

^ permalink raw reply	[flat|nested] 134+ messages in thread

* [patch 01/16] arch: Allow arch_dup_mmap() to fail
@ 2017-12-12 17:32   ` Thomas Gleixner
  0 siblings, 0 replies; 134+ messages in thread
From: Thomas Gleixner @ 2017-12-12 17:32 UTC (permalink / raw)
  To: LKML
  Cc: x86, Linus Torvalds, Andy Lutomirsky, Peter Zijlstra,
	Dave Hansen, Borislav Petkov, Greg KH, keescook, hughd,
	Brian Gerst, Josh Poimboeuf, Denys Vlasenko, Boris Ostrovsky,
	Juergen Gross, David Laight, Eduardo Valentin, aliguori,
	Will Deacon, linux-mm

[-- Attachment #1: arch--Allow-arch_dup_mmap---to-fail.patch --]
[-- Type: text/plain, Size: 3255 bytes --]

From: Thomas Gleixner <tglx@linutronix.de>

In order to sanitize the LDT initialization on x86 arch_dup_mmap() must be
allowed to fail. Fix up all instances.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
---
 arch/powerpc/include/asm/mmu_context.h   |    5 +++--
 arch/um/include/asm/mmu_context.h        |    3 ++-
 arch/unicore32/include/asm/mmu_context.h |    5 +++--
 arch/x86/include/asm/mmu_context.h       |    4 ++--
 include/asm-generic/mm_hooks.h           |    5 +++--
 kernel/fork.c                            |    3 +--
 6 files changed, 14 insertions(+), 11 deletions(-)

--- a/arch/powerpc/include/asm/mmu_context.h
+++ b/arch/powerpc/include/asm/mmu_context.h
@@ -114,9 +114,10 @@ static inline void enter_lazy_tlb(struct
 #endif
 }
 
-static inline void arch_dup_mmap(struct mm_struct *oldmm,
-				 struct mm_struct *mm)
+static inline int arch_dup_mmap(struct mm_struct *oldmm,
+				struct mm_struct *mm)
 {
+	return 0;
 }
 
 static inline void arch_exit_mmap(struct mm_struct *mm)
--- a/arch/um/include/asm/mmu_context.h
+++ b/arch/um/include/asm/mmu_context.h
@@ -15,9 +15,10 @@ extern void uml_setup_stubs(struct mm_st
 /*
  * Needed since we do not use the asm-generic/mm_hooks.h:
  */
-static inline void arch_dup_mmap(struct mm_struct *oldmm, struct mm_struct *mm)
+static inline int arch_dup_mmap(struct mm_struct *oldmm, struct mm_struct *mm)
 {
 	uml_setup_stubs(mm);
+	return 0;
 }
 extern void arch_exit_mmap(struct mm_struct *mm);
 static inline void arch_unmap(struct mm_struct *mm,
--- a/arch/unicore32/include/asm/mmu_context.h
+++ b/arch/unicore32/include/asm/mmu_context.h
@@ -81,9 +81,10 @@ do { \
 	} \
 } while (0)
 
-static inline void arch_dup_mmap(struct mm_struct *oldmm,
-				 struct mm_struct *mm)
+static inline int arch_dup_mmap(struct mm_struct *oldmm,
+				struct mm_struct *mm)
 {
+	return 0;
 }
 
 static inline void arch_unmap(struct mm_struct *mm,
--- a/arch/x86/include/asm/mmu_context.h
+++ b/arch/x86/include/asm/mmu_context.h
@@ -176,10 +176,10 @@ do {						\
 } while (0)
 #endif
 
-static inline void arch_dup_mmap(struct mm_struct *oldmm,
-				 struct mm_struct *mm)
+static inline int arch_dup_mmap(struct mm_struct *oldmm, struct mm_struct *mm)
 {
 	paravirt_arch_dup_mmap(oldmm, mm);
+	return 0;
 }
 
 static inline void arch_exit_mmap(struct mm_struct *mm)
--- a/include/asm-generic/mm_hooks.h
+++ b/include/asm-generic/mm_hooks.h
@@ -7,9 +7,10 @@
 #ifndef _ASM_GENERIC_MM_HOOKS_H
 #define _ASM_GENERIC_MM_HOOKS_H
 
-static inline void arch_dup_mmap(struct mm_struct *oldmm,
-				 struct mm_struct *mm)
+static inline int arch_dup_mmap(struct mm_struct *oldmm,
+				struct mm_struct *mm)
 {
+	return 0;
 }
 
 static inline void arch_exit_mmap(struct mm_struct *mm)
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -721,8 +721,7 @@ static __latent_entropy int dup_mmap(str
 			goto out;
 	}
 	/* a new mm has just been created */
-	arch_dup_mmap(oldmm, mm);
-	retval = 0;
+	retval = arch_dup_mmap(oldmm, mm);
 out:
 	up_write(&mm->mmap_sem);
 	flush_tlb_mm(oldmm);


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 134+ messages in thread

* [patch 02/16] x86/ldt: Rework locking
  2017-12-12 17:32 ` Thomas Gleixner
@ 2017-12-12 17:32   ` Thomas Gleixner
  -1 siblings, 0 replies; 134+ messages in thread
From: Thomas Gleixner @ 2017-12-12 17:32 UTC (permalink / raw)
  To: LKML
  Cc: x86, Linus Torvalds, Andy Lutomirsky, Peter Zijlstra,
	Dave Hansen, Borislav Petkov, Greg KH, keescook, hughd,
	Brian Gerst, Josh Poimboeuf, Denys Vlasenko, Boris Ostrovsky,
	Juergen Gross, David Laight, Eduardo Valentin, aliguori,
	Will Deacon, linux-mm

[-- Attachment #1: x86-ldt--Rework-locking.patch --]
[-- Type: text/plain, Size: 4488 bytes --]

From: Thomas Gleixner <tglx@linutronix.de>

The LDT is duplicated on fork() and on exec(), which is wrong as exec()
should start from a clean state, i.e. without LDT. To fix this the LDT
duplication code will be moved into arch_dup_mmap() which is only called
for fork().

This introduces a locking problem. arch_dup_mmap() holds mmap_sem of the
parent process, but the LDT duplication code needs to acquire
mm->context.lock to access the LDT data safely, which is the reverse lock
order of write_ldt() where mmap_sem nests into context.lock.

Solve this by introducing a new rw semaphore which serializes the
read/write_ldt() syscall operations and use context.lock to protect the
actual installment of the LDT descriptor.

So context.lock stabilizes mm->context.ldt and can nest inside of the new
semaphore or mmap_sem.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
---
 arch/x86/include/asm/mmu.h         |    4 +++-
 arch/x86/include/asm/mmu_context.h |    2 ++
 arch/x86/kernel/ldt.c              |   33 +++++++++++++++++++++------------
 3 files changed, 26 insertions(+), 13 deletions(-)

--- a/arch/x86/include/asm/mmu.h
+++ b/arch/x86/include/asm/mmu.h
@@ -3,6 +3,7 @@
 #define _ASM_X86_MMU_H
 
 #include <linux/spinlock.h>
+#include <linux/rwsem.h>
 #include <linux/mutex.h>
 #include <linux/atomic.h>
 
@@ -27,7 +28,8 @@ typedef struct {
 	atomic64_t tlb_gen;
 
 #ifdef CONFIG_MODIFY_LDT_SYSCALL
-	struct ldt_struct *ldt;
+	struct rw_semaphore	ldt_usr_sem;
+	struct ldt_struct	*ldt;
 #endif
 
 #ifdef CONFIG_X86_64
--- a/arch/x86/include/asm/mmu_context.h
+++ b/arch/x86/include/asm/mmu_context.h
@@ -132,6 +132,8 @@ void enter_lazy_tlb(struct mm_struct *mm
 static inline int init_new_context(struct task_struct *tsk,
 				   struct mm_struct *mm)
 {
+	mutex_init(&mm->context.lock);
+
 	mm->context.ctx_id = atomic64_inc_return(&last_mm_ctx_id);
 	atomic64_set(&mm->context.tlb_gen, 0);
 
--- a/arch/x86/kernel/ldt.c
+++ b/arch/x86/kernel/ldt.c
@@ -5,6 +5,11 @@
  * Copyright (C) 2002 Andi Kleen
  *
  * This handles calls from both 32bit and 64bit mode.
+ *
+ * Lock order:
+ *	contex.ldt_usr_sem
+ *	  mmap_sem
+ *	    context.lock
  */
 
 #include <linux/errno.h>
@@ -42,7 +47,7 @@ static void refresh_ldt_segments(void)
 #endif
 }
 
-/* context.lock is held for us, so we don't need any locking. */
+/* context.lock is held by the task which issued the smp function call */
 static void flush_ldt(void *__mm)
 {
 	struct mm_struct *mm = __mm;
@@ -99,15 +104,17 @@ static void finalize_ldt_struct(struct l
 	paravirt_alloc_ldt(ldt->entries, ldt->nr_entries);
 }
 
-/* context.lock is held */
-static void install_ldt(struct mm_struct *current_mm,
-			struct ldt_struct *ldt)
+static void install_ldt(struct mm_struct *mm, struct ldt_struct *ldt)
 {
+	mutex_lock(&mm->context.lock);
+
 	/* Synchronizes with READ_ONCE in load_mm_ldt. */
-	smp_store_release(&current_mm->context.ldt, ldt);
+	smp_store_release(&mm->context.ldt, ldt);
 
-	/* Activate the LDT for all CPUs using current_mm. */
-	on_each_cpu_mask(mm_cpumask(current_mm), flush_ldt, current_mm, true);
+	/* Activate the LDT for all CPUs using currents mm. */
+	on_each_cpu_mask(mm_cpumask(mm), flush_ldt, mm, true);
+
+	mutex_unlock(&mm->context.lock);
 }
 
 static void free_ldt_struct(struct ldt_struct *ldt)
@@ -133,7 +140,8 @@ int init_new_context_ldt(struct task_str
 	struct mm_struct *old_mm;
 	int retval = 0;
 
-	mutex_init(&mm->context.lock);
+	init_rwsem(&mm->context.ldt_usr_sem);
+
 	old_mm = current->mm;
 	if (!old_mm) {
 		mm->context.ldt = NULL;
@@ -180,7 +188,7 @@ static int read_ldt(void __user *ptr, un
 	unsigned long entries_size;
 	int retval;
 
-	mutex_lock(&mm->context.lock);
+	down_read(&mm->context.ldt_usr_sem);
 
 	if (!mm->context.ldt) {
 		retval = 0;
@@ -209,7 +217,7 @@ static int read_ldt(void __user *ptr, un
 	retval = bytecount;
 
 out_unlock:
-	mutex_unlock(&mm->context.lock);
+	up_read(&mm->context.ldt_usr_sem);
 	return retval;
 }
 
@@ -269,7 +277,8 @@ static int write_ldt(void __user *ptr, u
 			ldt.avl = 0;
 	}
 
-	mutex_lock(&mm->context.lock);
+	if (down_write_killable(&mm->context.ldt_usr_sem))
+		return -EINTR;
 
 	old_ldt       = mm->context.ldt;
 	old_nr_entries = old_ldt ? old_ldt->nr_entries : 0;
@@ -291,7 +300,7 @@ static int write_ldt(void __user *ptr, u
 	error = 0;
 
 out_unlock:
-	mutex_unlock(&mm->context.lock);
+	up_write(&mm->context.ldt_usr_sem);
 out:
 	return error;
 }

^ permalink raw reply	[flat|nested] 134+ messages in thread

* [patch 02/16] x86/ldt: Rework locking
@ 2017-12-12 17:32   ` Thomas Gleixner
  0 siblings, 0 replies; 134+ messages in thread
From: Thomas Gleixner @ 2017-12-12 17:32 UTC (permalink / raw)
  To: LKML
  Cc: x86, Linus Torvalds, Andy Lutomirsky, Peter Zijlstra,
	Dave Hansen, Borislav Petkov, Greg KH, keescook, hughd,
	Brian Gerst, Josh Poimboeuf, Denys Vlasenko, Boris Ostrovsky,
	Juergen Gross, David Laight, Eduardo Valentin, aliguori,
	Will Deacon, linux-mm

[-- Attachment #1: x86-ldt--Rework-locking.patch --]
[-- Type: text/plain, Size: 4715 bytes --]

From: Thomas Gleixner <tglx@linutronix.de>

The LDT is duplicated on fork() and on exec(), which is wrong as exec()
should start from a clean state, i.e. without LDT. To fix this the LDT
duplication code will be moved into arch_dup_mmap() which is only called
for fork().

This introduces a locking problem. arch_dup_mmap() holds mmap_sem of the
parent process, but the LDT duplication code needs to acquire
mm->context.lock to access the LDT data safely, which is the reverse lock
order of write_ldt() where mmap_sem nests into context.lock.

Solve this by introducing a new rw semaphore which serializes the
read/write_ldt() syscall operations and use context.lock to protect the
actual installment of the LDT descriptor.

So context.lock stabilizes mm->context.ldt and can nest inside of the new
semaphore or mmap_sem.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
---
 arch/x86/include/asm/mmu.h         |    4 +++-
 arch/x86/include/asm/mmu_context.h |    2 ++
 arch/x86/kernel/ldt.c              |   33 +++++++++++++++++++++------------
 3 files changed, 26 insertions(+), 13 deletions(-)

--- a/arch/x86/include/asm/mmu.h
+++ b/arch/x86/include/asm/mmu.h
@@ -3,6 +3,7 @@
 #define _ASM_X86_MMU_H
 
 #include <linux/spinlock.h>
+#include <linux/rwsem.h>
 #include <linux/mutex.h>
 #include <linux/atomic.h>
 
@@ -27,7 +28,8 @@ typedef struct {
 	atomic64_t tlb_gen;
 
 #ifdef CONFIG_MODIFY_LDT_SYSCALL
-	struct ldt_struct *ldt;
+	struct rw_semaphore	ldt_usr_sem;
+	struct ldt_struct	*ldt;
 #endif
 
 #ifdef CONFIG_X86_64
--- a/arch/x86/include/asm/mmu_context.h
+++ b/arch/x86/include/asm/mmu_context.h
@@ -132,6 +132,8 @@ void enter_lazy_tlb(struct mm_struct *mm
 static inline int init_new_context(struct task_struct *tsk,
 				   struct mm_struct *mm)
 {
+	mutex_init(&mm->context.lock);
+
 	mm->context.ctx_id = atomic64_inc_return(&last_mm_ctx_id);
 	atomic64_set(&mm->context.tlb_gen, 0);
 
--- a/arch/x86/kernel/ldt.c
+++ b/arch/x86/kernel/ldt.c
@@ -5,6 +5,11 @@
  * Copyright (C) 2002 Andi Kleen
  *
  * This handles calls from both 32bit and 64bit mode.
+ *
+ * Lock order:
+ *	contex.ldt_usr_sem
+ *	  mmap_sem
+ *	    context.lock
  */
 
 #include <linux/errno.h>
@@ -42,7 +47,7 @@ static void refresh_ldt_segments(void)
 #endif
 }
 
-/* context.lock is held for us, so we don't need any locking. */
+/* context.lock is held by the task which issued the smp function call */
 static void flush_ldt(void *__mm)
 {
 	struct mm_struct *mm = __mm;
@@ -99,15 +104,17 @@ static void finalize_ldt_struct(struct l
 	paravirt_alloc_ldt(ldt->entries, ldt->nr_entries);
 }
 
-/* context.lock is held */
-static void install_ldt(struct mm_struct *current_mm,
-			struct ldt_struct *ldt)
+static void install_ldt(struct mm_struct *mm, struct ldt_struct *ldt)
 {
+	mutex_lock(&mm->context.lock);
+
 	/* Synchronizes with READ_ONCE in load_mm_ldt. */
-	smp_store_release(&current_mm->context.ldt, ldt);
+	smp_store_release(&mm->context.ldt, ldt);
 
-	/* Activate the LDT for all CPUs using current_mm. */
-	on_each_cpu_mask(mm_cpumask(current_mm), flush_ldt, current_mm, true);
+	/* Activate the LDT for all CPUs using currents mm. */
+	on_each_cpu_mask(mm_cpumask(mm), flush_ldt, mm, true);
+
+	mutex_unlock(&mm->context.lock);
 }
 
 static void free_ldt_struct(struct ldt_struct *ldt)
@@ -133,7 +140,8 @@ int init_new_context_ldt(struct task_str
 	struct mm_struct *old_mm;
 	int retval = 0;
 
-	mutex_init(&mm->context.lock);
+	init_rwsem(&mm->context.ldt_usr_sem);
+
 	old_mm = current->mm;
 	if (!old_mm) {
 		mm->context.ldt = NULL;
@@ -180,7 +188,7 @@ static int read_ldt(void __user *ptr, un
 	unsigned long entries_size;
 	int retval;
 
-	mutex_lock(&mm->context.lock);
+	down_read(&mm->context.ldt_usr_sem);
 
 	if (!mm->context.ldt) {
 		retval = 0;
@@ -209,7 +217,7 @@ static int read_ldt(void __user *ptr, un
 	retval = bytecount;
 
 out_unlock:
-	mutex_unlock(&mm->context.lock);
+	up_read(&mm->context.ldt_usr_sem);
 	return retval;
 }
 
@@ -269,7 +277,8 @@ static int write_ldt(void __user *ptr, u
 			ldt.avl = 0;
 	}
 
-	mutex_lock(&mm->context.lock);
+	if (down_write_killable(&mm->context.ldt_usr_sem))
+		return -EINTR;
 
 	old_ldt       = mm->context.ldt;
 	old_nr_entries = old_ldt ? old_ldt->nr_entries : 0;
@@ -291,7 +300,7 @@ static int write_ldt(void __user *ptr, u
 	error = 0;
 
 out_unlock:
-	mutex_unlock(&mm->context.lock);
+	up_write(&mm->context.ldt_usr_sem);
 out:
 	return error;
 }


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 134+ messages in thread

* [patch 03/16] x86/ldt: Prevent ldt inheritance on exec
  2017-12-12 17:32 ` Thomas Gleixner
@ 2017-12-12 17:32   ` Thomas Gleixner
  -1 siblings, 0 replies; 134+ messages in thread
From: Thomas Gleixner @ 2017-12-12 17:32 UTC (permalink / raw)
  To: LKML
  Cc: x86, Linus Torvalds, Andy Lutomirsky, Peter Zijlstra,
	Dave Hansen, Borislav Petkov, Greg KH, keescook, hughd,
	Brian Gerst, Josh Poimboeuf, Denys Vlasenko, Boris Ostrovsky,
	Juergen Gross, David Laight, Eduardo Valentin, aliguori,
	Will Deacon, linux-mm

[-- Attachment #1: x86-ldt--Prevent-ldt-inheritance-on-exec.patch --]
[-- Type: text/plain, Size: 4313 bytes --]

From: Thomas Gleixner <tglx@linutronix.de>

The LDT is inheritet independent of fork or exec, but that makes no sense
at all because exec is supposed to start the process clean.

The reason why this happens is that init_new_context_ldt() is called from
init_new_context() which obviously needs to be called for both fork() and
exec().

It would be surprising if anything relies on that behaviour, so it seems to
be safe to remove that misfeature.

Split the context initialization into two parts. Clear the ldt pointer and
initialize the mutex from the general context init and move the LDT
duplication to arch_dup_mmap() which is only called on fork().

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
---
 arch/x86/include/asm/mmu_context.h    |   21 ++++++++++++++-------
 arch/x86/kernel/ldt.c                 |   18 +++++-------------
 tools/testing/selftests/x86/ldt_gdt.c |    9 +++------
 3 files changed, 22 insertions(+), 26 deletions(-)

--- a/arch/x86/include/asm/mmu_context.h
+++ b/arch/x86/include/asm/mmu_context.h
@@ -57,11 +57,17 @@ struct ldt_struct {
 /*
  * Used for LDT copy/destruction.
  */
-int init_new_context_ldt(struct task_struct *tsk, struct mm_struct *mm);
+static inline void init_new_context_ldt(struct mm_struct *mm)
+{
+	mm->context.ldt = NULL;
+	init_rwsem(&mm->context.ldt_usr_sem);
+}
+int ldt_dup_context(struct mm_struct *oldmm, struct mm_struct *mm);
 void destroy_context_ldt(struct mm_struct *mm);
 #else	/* CONFIG_MODIFY_LDT_SYSCALL */
-static inline int init_new_context_ldt(struct task_struct *tsk,
-				       struct mm_struct *mm)
+static inline void init_new_context_ldt(struct mm_struct *mm) { }
+static inline int ldt_dup_context(struct mm_struct *oldmm,
+				  struct mm_struct *mm)
 {
 	return 0;
 }
@@ -137,15 +143,16 @@ static inline int init_new_context(struc
 	mm->context.ctx_id = atomic64_inc_return(&last_mm_ctx_id);
 	atomic64_set(&mm->context.tlb_gen, 0);
 
-	#ifdef CONFIG_X86_INTEL_MEMORY_PROTECTION_KEYS
+#ifdef CONFIG_X86_INTEL_MEMORY_PROTECTION_KEYS
 	if (cpu_feature_enabled(X86_FEATURE_OSPKE)) {
 		/* pkey 0 is the default and always allocated */
 		mm->context.pkey_allocation_map = 0x1;
 		/* -1 means unallocated or invalid */
 		mm->context.execute_only_pkey = -1;
 	}
-	#endif
-	return init_new_context_ldt(tsk, mm);
+#endif
+	init_new_context_ldt(mm);
+	return 0;
 }
 static inline void destroy_context(struct mm_struct *mm)
 {
@@ -181,7 +188,7 @@ do {						\
 static inline int arch_dup_mmap(struct mm_struct *oldmm, struct mm_struct *mm)
 {
 	paravirt_arch_dup_mmap(oldmm, mm);
-	return 0;
+	return ldt_dup_context(oldmm, mm);
 }
 
 static inline void arch_exit_mmap(struct mm_struct *mm)
--- a/arch/x86/kernel/ldt.c
+++ b/arch/x86/kernel/ldt.c
@@ -131,28 +131,20 @@ static void free_ldt_struct(struct ldt_s
 }
 
 /*
- * we do not have to muck with descriptors here, that is
- * done in switch_mm() as needed.
+ * Called on fork from arch_dup_mmap(). Just copy the current LDT state,
+ * the new task is not running, so nothing can be installed.
  */
-int init_new_context_ldt(struct task_struct *tsk, struct mm_struct *mm)
+int ldt_dup_context(struct mm_struct *old_mm, struct mm_struct *mm)
 {
 	struct ldt_struct *new_ldt;
-	struct mm_struct *old_mm;
 	int retval = 0;
 
-	init_rwsem(&mm->context.ldt_usr_sem);
-
-	old_mm = current->mm;
-	if (!old_mm) {
-		mm->context.ldt = NULL;
+	if (!old_mm)
 		return 0;
-	}
 
 	mutex_lock(&old_mm->context.lock);
-	if (!old_mm->context.ldt) {
-		mm->context.ldt = NULL;
+	if (!old_mm->context.ldt)
 		goto out_unlock;
-	}
 
 	new_ldt = alloc_ldt_struct(old_mm->context.ldt->nr_entries);
 	if (!new_ldt) {
--- a/tools/testing/selftests/x86/ldt_gdt.c
+++ b/tools/testing/selftests/x86/ldt_gdt.c
@@ -627,13 +627,10 @@ static void do_multicpu_tests(void)
 static int finish_exec_test(void)
 {
 	/*
-	 * In a sensible world, this would be check_invalid_segment(0, 1);
-	 * For better or for worse, though, the LDT is inherited across exec.
-	 * We can probably change this safely, but for now we test it.
+	 * Older kernel versions did inherit the LDT on exec() which is
+	 * wrong because exec() starts from a clean state.
 	 */
-	check_valid_segment(0, 1,
-			    AR_DPL3 | AR_TYPE_XRCODE | AR_S | AR_P | AR_DB,
-			    42, true);
+	check_invalid_segment(0, 1);
 
 	return nerrs ? 1 : 0;
 }

^ permalink raw reply	[flat|nested] 134+ messages in thread

* [patch 03/16] x86/ldt: Prevent ldt inheritance on exec
@ 2017-12-12 17:32   ` Thomas Gleixner
  0 siblings, 0 replies; 134+ messages in thread
From: Thomas Gleixner @ 2017-12-12 17:32 UTC (permalink / raw)
  To: LKML
  Cc: x86, Linus Torvalds, Andy Lutomirsky, Peter Zijlstra,
	Dave Hansen, Borislav Petkov, Greg KH, keescook, hughd,
	Brian Gerst, Josh Poimboeuf, Denys Vlasenko, Boris Ostrovsky,
	Juergen Gross, David Laight, Eduardo Valentin, aliguori,
	Will Deacon, linux-mm

[-- Attachment #1: x86-ldt--Prevent-ldt-inheritance-on-exec.patch --]
[-- Type: text/plain, Size: 4540 bytes --]

From: Thomas Gleixner <tglx@linutronix.de>

The LDT is inheritet independent of fork or exec, but that makes no sense
at all because exec is supposed to start the process clean.

The reason why this happens is that init_new_context_ldt() is called from
init_new_context() which obviously needs to be called for both fork() and
exec().

It would be surprising if anything relies on that behaviour, so it seems to
be safe to remove that misfeature.

Split the context initialization into two parts. Clear the ldt pointer and
initialize the mutex from the general context init and move the LDT
duplication to arch_dup_mmap() which is only called on fork().

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
---
 arch/x86/include/asm/mmu_context.h    |   21 ++++++++++++++-------
 arch/x86/kernel/ldt.c                 |   18 +++++-------------
 tools/testing/selftests/x86/ldt_gdt.c |    9 +++------
 3 files changed, 22 insertions(+), 26 deletions(-)

--- a/arch/x86/include/asm/mmu_context.h
+++ b/arch/x86/include/asm/mmu_context.h
@@ -57,11 +57,17 @@ struct ldt_struct {
 /*
  * Used for LDT copy/destruction.
  */
-int init_new_context_ldt(struct task_struct *tsk, struct mm_struct *mm);
+static inline void init_new_context_ldt(struct mm_struct *mm)
+{
+	mm->context.ldt = NULL;
+	init_rwsem(&mm->context.ldt_usr_sem);
+}
+int ldt_dup_context(struct mm_struct *oldmm, struct mm_struct *mm);
 void destroy_context_ldt(struct mm_struct *mm);
 #else	/* CONFIG_MODIFY_LDT_SYSCALL */
-static inline int init_new_context_ldt(struct task_struct *tsk,
-				       struct mm_struct *mm)
+static inline void init_new_context_ldt(struct mm_struct *mm) { }
+static inline int ldt_dup_context(struct mm_struct *oldmm,
+				  struct mm_struct *mm)
 {
 	return 0;
 }
@@ -137,15 +143,16 @@ static inline int init_new_context(struc
 	mm->context.ctx_id = atomic64_inc_return(&last_mm_ctx_id);
 	atomic64_set(&mm->context.tlb_gen, 0);
 
-	#ifdef CONFIG_X86_INTEL_MEMORY_PROTECTION_KEYS
+#ifdef CONFIG_X86_INTEL_MEMORY_PROTECTION_KEYS
 	if (cpu_feature_enabled(X86_FEATURE_OSPKE)) {
 		/* pkey 0 is the default and always allocated */
 		mm->context.pkey_allocation_map = 0x1;
 		/* -1 means unallocated or invalid */
 		mm->context.execute_only_pkey = -1;
 	}
-	#endif
-	return init_new_context_ldt(tsk, mm);
+#endif
+	init_new_context_ldt(mm);
+	return 0;
 }
 static inline void destroy_context(struct mm_struct *mm)
 {
@@ -181,7 +188,7 @@ do {						\
 static inline int arch_dup_mmap(struct mm_struct *oldmm, struct mm_struct *mm)
 {
 	paravirt_arch_dup_mmap(oldmm, mm);
-	return 0;
+	return ldt_dup_context(oldmm, mm);
 }
 
 static inline void arch_exit_mmap(struct mm_struct *mm)
--- a/arch/x86/kernel/ldt.c
+++ b/arch/x86/kernel/ldt.c
@@ -131,28 +131,20 @@ static void free_ldt_struct(struct ldt_s
 }
 
 /*
- * we do not have to muck with descriptors here, that is
- * done in switch_mm() as needed.
+ * Called on fork from arch_dup_mmap(). Just copy the current LDT state,
+ * the new task is not running, so nothing can be installed.
  */
-int init_new_context_ldt(struct task_struct *tsk, struct mm_struct *mm)
+int ldt_dup_context(struct mm_struct *old_mm, struct mm_struct *mm)
 {
 	struct ldt_struct *new_ldt;
-	struct mm_struct *old_mm;
 	int retval = 0;
 
-	init_rwsem(&mm->context.ldt_usr_sem);
-
-	old_mm = current->mm;
-	if (!old_mm) {
-		mm->context.ldt = NULL;
+	if (!old_mm)
 		return 0;
-	}
 
 	mutex_lock(&old_mm->context.lock);
-	if (!old_mm->context.ldt) {
-		mm->context.ldt = NULL;
+	if (!old_mm->context.ldt)
 		goto out_unlock;
-	}
 
 	new_ldt = alloc_ldt_struct(old_mm->context.ldt->nr_entries);
 	if (!new_ldt) {
--- a/tools/testing/selftests/x86/ldt_gdt.c
+++ b/tools/testing/selftests/x86/ldt_gdt.c
@@ -627,13 +627,10 @@ static void do_multicpu_tests(void)
 static int finish_exec_test(void)
 {
 	/*
-	 * In a sensible world, this would be check_invalid_segment(0, 1);
-	 * For better or for worse, though, the LDT is inherited across exec.
-	 * We can probably change this safely, but for now we test it.
+	 * Older kernel versions did inherit the LDT on exec() which is
+	 * wrong because exec() starts from a clean state.
 	 */
-	check_valid_segment(0, 1,
-			    AR_DPL3 | AR_TYPE_XRCODE | AR_S | AR_P | AR_DB,
-			    42, true);
+	check_invalid_segment(0, 1);
 
 	return nerrs ? 1 : 0;
 }


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 134+ messages in thread

* [patch 04/16] mm/softdirty: Move VM_SOFTDIRTY into high bits
  2017-12-12 17:32 ` Thomas Gleixner
@ 2017-12-12 17:32   ` Thomas Gleixner
  -1 siblings, 0 replies; 134+ messages in thread
From: Thomas Gleixner @ 2017-12-12 17:32 UTC (permalink / raw)
  To: LKML
  Cc: x86, Linus Torvalds, Andy Lutomirsky, Peter Zijlstra,
	Dave Hansen, Borislav Petkov, Greg KH, keescook, hughd,
	Brian Gerst, Josh Poimboeuf, Denys Vlasenko, Boris Ostrovsky,
	Juergen Gross, David Laight, Eduardo Valentin, aliguori,
	Will Deacon, linux-mm

[-- Attachment #1: mm-softdirty--Move-VM_SOFTDIRTY-into-high-bits.patch --]
[-- Type: text/plain, Size: 2564 bytes --]

From: Peter Zijlstra <peterz@infradead.org>

Only 64bit architectures (x86_64, s390, PPC_BOOK3S_64) have support for
HAVE_ARCH_SOFT_DIRTY, so ensure they all select ARCH_USES_HIGH_VMA_FLAGS
and move the VM_SOFTDIRTY flag into the high flags.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 arch/powerpc/platforms/Kconfig.cputype |    1 +
 arch/s390/Kconfig                      |    1 +
 include/linux/mm.h                     |   17 +++++++++++------
 3 files changed, 13 insertions(+), 6 deletions(-)

--- a/arch/powerpc/platforms/Kconfig.cputype
+++ b/arch/powerpc/platforms/Kconfig.cputype
@@ -76,6 +76,7 @@ config PPC_BOOK3S_64
 	select ARCH_SUPPORTS_NUMA_BALANCING
 	select IRQ_WORK
 	select HAVE_KERNEL_XZ
+	select ARCH_USES_HIGH_VMA_FLAGS
 
 config PPC_BOOK3E_64
 	bool "Embedded processors"
--- a/arch/s390/Kconfig
+++ b/arch/s390/Kconfig
@@ -131,6 +131,7 @@ config S390
 	select CPU_NO_EFFICIENT_FFS if !HAVE_MARCH_Z9_109_FEATURES
 	select HAVE_ARCH_SECCOMP_FILTER
 	select HAVE_ARCH_SOFT_DIRTY
+	select ARCH_USES_HIGH_VMA_FLAGS
 	select HAVE_ARCH_TRACEHOOK
 	select HAVE_ARCH_TRANSPARENT_HUGEPAGE
 	select HAVE_EBPF_JIT if PACK_STACK && HAVE_MARCH_Z196_FEATURES
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -194,12 +194,6 @@ extern unsigned int kobjsize(const void
 #define VM_WIPEONFORK	0x02000000	/* Wipe VMA contents in child. */
 #define VM_DONTDUMP	0x04000000	/* Do not include in the core dump */
 
-#ifdef CONFIG_MEM_SOFT_DIRTY
-# define VM_SOFTDIRTY	0x08000000	/* Not soft dirty clean area */
-#else
-# define VM_SOFTDIRTY	0
-#endif
-
 #define VM_MIXEDMAP	0x10000000	/* Can contain "struct page" and pure PFN pages */
 #define VM_HUGEPAGE	0x20000000	/* MADV_HUGEPAGE marked this vma */
 #define VM_NOHUGEPAGE	0x40000000	/* MADV_NOHUGEPAGE marked this vma */
@@ -216,8 +210,19 @@ extern unsigned int kobjsize(const void
 #define VM_HIGH_ARCH_2	BIT(VM_HIGH_ARCH_BIT_2)
 #define VM_HIGH_ARCH_3	BIT(VM_HIGH_ARCH_BIT_3)
 #define VM_HIGH_ARCH_4	BIT(VM_HIGH_ARCH_BIT_4)
+
+#define VM_HIGH_SOFTDIRTY_BIT	37	/* bit only usable on 64-bit architectures */
 #endif /* CONFIG_ARCH_USES_HIGH_VMA_FLAGS */
 
+#ifdef CONFIG_MEM_SOFT_DIRTY
+# ifndef CONFIG_ARCH_USES_HIGH_VMA_FLAGS
+#  error MEM_SOFT_DIRTY depends on ARCH_USES_HIGH_VMA_FLAGS
+# endif
+# define VM_SOFTDIRTY		BIT(VM_HIGH_SOFTDIRTY_BIT) /* Not soft dirty clean area */
+#else
+# define VM_SOFTDIRTY		VM_NONE
+#endif
+
 #if defined(CONFIG_X86)
 # define VM_PAT		VM_ARCH_1	/* PAT reserves whole VMA at once (x86) */
 #if defined (CONFIG_X86_INTEL_MEMORY_PROTECTION_KEYS)

^ permalink raw reply	[flat|nested] 134+ messages in thread

* [patch 04/16] mm/softdirty: Move VM_SOFTDIRTY into high bits
@ 2017-12-12 17:32   ` Thomas Gleixner
  0 siblings, 0 replies; 134+ messages in thread
From: Thomas Gleixner @ 2017-12-12 17:32 UTC (permalink / raw)
  To: LKML
  Cc: x86, Linus Torvalds, Andy Lutomirsky, Peter Zijlstra,
	Dave Hansen, Borislav Petkov, Greg KH, keescook, hughd,
	Brian Gerst, Josh Poimboeuf, Denys Vlasenko, Boris Ostrovsky,
	Juergen Gross, David Laight, Eduardo Valentin, aliguori,
	Will Deacon, linux-mm

[-- Attachment #1: mm-softdirty--Move-VM_SOFTDIRTY-into-high-bits.patch --]
[-- Type: text/plain, Size: 2791 bytes --]

From: Peter Zijlstra <peterz@infradead.org>

Only 64bit architectures (x86_64, s390, PPC_BOOK3S_64) have support for
HAVE_ARCH_SOFT_DIRTY, so ensure they all select ARCH_USES_HIGH_VMA_FLAGS
and move the VM_SOFTDIRTY flag into the high flags.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 arch/powerpc/platforms/Kconfig.cputype |    1 +
 arch/s390/Kconfig                      |    1 +
 include/linux/mm.h                     |   17 +++++++++++------
 3 files changed, 13 insertions(+), 6 deletions(-)

--- a/arch/powerpc/platforms/Kconfig.cputype
+++ b/arch/powerpc/platforms/Kconfig.cputype
@@ -76,6 +76,7 @@ config PPC_BOOK3S_64
 	select ARCH_SUPPORTS_NUMA_BALANCING
 	select IRQ_WORK
 	select HAVE_KERNEL_XZ
+	select ARCH_USES_HIGH_VMA_FLAGS
 
 config PPC_BOOK3E_64
 	bool "Embedded processors"
--- a/arch/s390/Kconfig
+++ b/arch/s390/Kconfig
@@ -131,6 +131,7 @@ config S390
 	select CPU_NO_EFFICIENT_FFS if !HAVE_MARCH_Z9_109_FEATURES
 	select HAVE_ARCH_SECCOMP_FILTER
 	select HAVE_ARCH_SOFT_DIRTY
+	select ARCH_USES_HIGH_VMA_FLAGS
 	select HAVE_ARCH_TRACEHOOK
 	select HAVE_ARCH_TRANSPARENT_HUGEPAGE
 	select HAVE_EBPF_JIT if PACK_STACK && HAVE_MARCH_Z196_FEATURES
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -194,12 +194,6 @@ extern unsigned int kobjsize(const void
 #define VM_WIPEONFORK	0x02000000	/* Wipe VMA contents in child. */
 #define VM_DONTDUMP	0x04000000	/* Do not include in the core dump */
 
-#ifdef CONFIG_MEM_SOFT_DIRTY
-# define VM_SOFTDIRTY	0x08000000	/* Not soft dirty clean area */
-#else
-# define VM_SOFTDIRTY	0
-#endif
-
 #define VM_MIXEDMAP	0x10000000	/* Can contain "struct page" and pure PFN pages */
 #define VM_HUGEPAGE	0x20000000	/* MADV_HUGEPAGE marked this vma */
 #define VM_NOHUGEPAGE	0x40000000	/* MADV_NOHUGEPAGE marked this vma */
@@ -216,8 +210,19 @@ extern unsigned int kobjsize(const void
 #define VM_HIGH_ARCH_2	BIT(VM_HIGH_ARCH_BIT_2)
 #define VM_HIGH_ARCH_3	BIT(VM_HIGH_ARCH_BIT_3)
 #define VM_HIGH_ARCH_4	BIT(VM_HIGH_ARCH_BIT_4)
+
+#define VM_HIGH_SOFTDIRTY_BIT	37	/* bit only usable on 64-bit architectures */
 #endif /* CONFIG_ARCH_USES_HIGH_VMA_FLAGS */
 
+#ifdef CONFIG_MEM_SOFT_DIRTY
+# ifndef CONFIG_ARCH_USES_HIGH_VMA_FLAGS
+#  error MEM_SOFT_DIRTY depends on ARCH_USES_HIGH_VMA_FLAGS
+# endif
+# define VM_SOFTDIRTY		BIT(VM_HIGH_SOFTDIRTY_BIT) /* Not soft dirty clean area */
+#else
+# define VM_SOFTDIRTY		VM_NONE
+#endif
+
 #if defined(CONFIG_X86)
 # define VM_PAT		VM_ARCH_1	/* PAT reserves whole VMA at once (x86) */
 #if defined (CONFIG_X86_INTEL_MEMORY_PROTECTION_KEYS)


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 134+ messages in thread

* [patch 05/16] mm: Allow special mappings with user access cleared
  2017-12-12 17:32 ` Thomas Gleixner
@ 2017-12-12 17:32   ` Thomas Gleixner
  -1 siblings, 0 replies; 134+ messages in thread
From: Thomas Gleixner @ 2017-12-12 17:32 UTC (permalink / raw)
  To: LKML
  Cc: x86, Linus Torvalds, Andy Lutomirsky, Peter Zijlstra,
	Dave Hansen, Borislav Petkov, Greg KH, keescook, hughd,
	Brian Gerst, Josh Poimboeuf, Denys Vlasenko, Boris Ostrovsky,
	Juergen Gross, David Laight, Eduardo Valentin, aliguori,
	Will Deacon, linux-mm

[-- Attachment #1: mm--Allow-special-mappings-with-user-access-cleared.patch --]
[-- Type: text/plain, Size: 2943 bytes --]

From: Peter Zijstra <peterz@infradead.org>

In order to create VMAs that are not accessible to userspace create a new
VM_NOUSER flag. This can be used in conjunction with
install_special_mapping() to inject 'kernel' data into the userspace map.

Similar to how arch_vm_get_page_prot() allows adding _PAGE_flags to
pgprot_t, introduce arch_vm_get_page_prot_excl() which masks
_PAGE_flags from pgprot_t and use this to implement VM_NOUSER for x86.

Signed-off-by: Peter Zijstra <peterz@infradead.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
---
 arch/x86/include/uapi/asm/mman.h |    4 ++++
 include/linux/mm.h               |    2 ++
 include/linux/mman.h             |    4 ++++
 mm/mmap.c                        |   12 ++++++++++--
 4 files changed, 20 insertions(+), 2 deletions(-)

--- a/arch/x86/include/uapi/asm/mman.h
+++ b/arch/x86/include/uapi/asm/mman.h
@@ -26,6 +26,10 @@
 		((key) & 0x8 ? VM_PKEY_BIT3 : 0))
 #endif
 
+#define arch_vm_get_page_prot_excl(vm_flags) __pgprot(		\
+		((vm_flags) & VM_NOUSER ? _PAGE_USER : 0)	\
+		)
+
 #include <asm-generic/mman.h>
 
 #endif /* _ASM_X86_MMAN_H */
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -193,6 +193,7 @@ extern unsigned int kobjsize(const void
 #define VM_ARCH_1	0x01000000	/* Architecture-specific flag */
 #define VM_WIPEONFORK	0x02000000	/* Wipe VMA contents in child. */
 #define VM_DONTDUMP	0x04000000	/* Do not include in the core dump */
+#define VM_ARCH_0	0x08000000	/* Architecture-specific flag */
 
 #define VM_MIXEDMAP	0x10000000	/* Can contain "struct page" and pure PFN pages */
 #define VM_HUGEPAGE	0x20000000	/* MADV_HUGEPAGE marked this vma */
@@ -224,6 +225,7 @@ extern unsigned int kobjsize(const void
 #endif
 
 #if defined(CONFIG_X86)
+# define VM_NOUSER	VM_ARCH_0	/* Not accessible by userspace */
 # define VM_PAT		VM_ARCH_1	/* PAT reserves whole VMA at once (x86) */
 #if defined (CONFIG_X86_INTEL_MEMORY_PROTECTION_KEYS)
 # define VM_PKEY_SHIFT	VM_HIGH_ARCH_BIT_0
--- a/include/linux/mman.h
+++ b/include/linux/mman.h
@@ -43,6 +43,10 @@ static inline void vm_unacct_memory(long
 #define arch_vm_get_page_prot(vm_flags) __pgprot(0)
 #endif
 
+#ifndef arch_vm_get_page_prot_excl
+#define arch_vm_get_page_prot_excl(vm_flags) __pgprot(0)
+#endif
+
 #ifndef arch_validate_prot
 /*
  * This is called from mprotect().  PROT_GROWSDOWN and PROT_GROWSUP have
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -102,9 +102,17 @@ pgprot_t protection_map[16] __ro_after_i
 
 pgprot_t vm_get_page_prot(unsigned long vm_flags)
 {
-	return __pgprot(pgprot_val(protection_map[vm_flags &
-				(VM_READ|VM_WRITE|VM_EXEC|VM_SHARED)]) |
+	pgprot_t prot;
+
+	prot = protection_map[vm_flags & (VM_READ|VM_WRITE|VM_EXEC|VM_SHARED)];
+
+	prot = __pgprot(pgprot_val(prot) |
 			pgprot_val(arch_vm_get_page_prot(vm_flags)));
+
+	prot = __pgprot(pgprot_val(prot) &
+			~pgprot_val(arch_vm_get_page_prot_excl(vm_flags)));
+
+	return prot;
 }
 EXPORT_SYMBOL(vm_get_page_prot);
 

^ permalink raw reply	[flat|nested] 134+ messages in thread

* [patch 05/16] mm: Allow special mappings with user access cleared
@ 2017-12-12 17:32   ` Thomas Gleixner
  0 siblings, 0 replies; 134+ messages in thread
From: Thomas Gleixner @ 2017-12-12 17:32 UTC (permalink / raw)
  To: LKML
  Cc: x86, Linus Torvalds, Andy Lutomirsky, Peter Zijlstra,
	Dave Hansen, Borislav Petkov, Greg KH, keescook, hughd,
	Brian Gerst, Josh Poimboeuf, Denys Vlasenko, Boris Ostrovsky,
	Juergen Gross, David Laight, Eduardo Valentin, aliguori,
	Will Deacon, linux-mm

[-- Attachment #1: mm--Allow-special-mappings-with-user-access-cleared.patch --]
[-- Type: text/plain, Size: 3170 bytes --]

From: Peter Zijstra <peterz@infradead.org>

In order to create VMAs that are not accessible to userspace create a new
VM_NOUSER flag. This can be used in conjunction with
install_special_mapping() to inject 'kernel' data into the userspace map.

Similar to how arch_vm_get_page_prot() allows adding _PAGE_flags to
pgprot_t, introduce arch_vm_get_page_prot_excl() which masks
_PAGE_flags from pgprot_t and use this to implement VM_NOUSER for x86.

Signed-off-by: Peter Zijstra <peterz@infradead.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
---
 arch/x86/include/uapi/asm/mman.h |    4 ++++
 include/linux/mm.h               |    2 ++
 include/linux/mman.h             |    4 ++++
 mm/mmap.c                        |   12 ++++++++++--
 4 files changed, 20 insertions(+), 2 deletions(-)

--- a/arch/x86/include/uapi/asm/mman.h
+++ b/arch/x86/include/uapi/asm/mman.h
@@ -26,6 +26,10 @@
 		((key) & 0x8 ? VM_PKEY_BIT3 : 0))
 #endif
 
+#define arch_vm_get_page_prot_excl(vm_flags) __pgprot(		\
+		((vm_flags) & VM_NOUSER ? _PAGE_USER : 0)	\
+		)
+
 #include <asm-generic/mman.h>
 
 #endif /* _ASM_X86_MMAN_H */
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -193,6 +193,7 @@ extern unsigned int kobjsize(const void
 #define VM_ARCH_1	0x01000000	/* Architecture-specific flag */
 #define VM_WIPEONFORK	0x02000000	/* Wipe VMA contents in child. */
 #define VM_DONTDUMP	0x04000000	/* Do not include in the core dump */
+#define VM_ARCH_0	0x08000000	/* Architecture-specific flag */
 
 #define VM_MIXEDMAP	0x10000000	/* Can contain "struct page" and pure PFN pages */
 #define VM_HUGEPAGE	0x20000000	/* MADV_HUGEPAGE marked this vma */
@@ -224,6 +225,7 @@ extern unsigned int kobjsize(const void
 #endif
 
 #if defined(CONFIG_X86)
+# define VM_NOUSER	VM_ARCH_0	/* Not accessible by userspace */
 # define VM_PAT		VM_ARCH_1	/* PAT reserves whole VMA at once (x86) */
 #if defined (CONFIG_X86_INTEL_MEMORY_PROTECTION_KEYS)
 # define VM_PKEY_SHIFT	VM_HIGH_ARCH_BIT_0
--- a/include/linux/mman.h
+++ b/include/linux/mman.h
@@ -43,6 +43,10 @@ static inline void vm_unacct_memory(long
 #define arch_vm_get_page_prot(vm_flags) __pgprot(0)
 #endif
 
+#ifndef arch_vm_get_page_prot_excl
+#define arch_vm_get_page_prot_excl(vm_flags) __pgprot(0)
+#endif
+
 #ifndef arch_validate_prot
 /*
  * This is called from mprotect().  PROT_GROWSDOWN and PROT_GROWSUP have
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -102,9 +102,17 @@ pgprot_t protection_map[16] __ro_after_i
 
 pgprot_t vm_get_page_prot(unsigned long vm_flags)
 {
-	return __pgprot(pgprot_val(protection_map[vm_flags &
-				(VM_READ|VM_WRITE|VM_EXEC|VM_SHARED)]) |
+	pgprot_t prot;
+
+	prot = protection_map[vm_flags & (VM_READ|VM_WRITE|VM_EXEC|VM_SHARED)];
+
+	prot = __pgprot(pgprot_val(prot) |
 			pgprot_val(arch_vm_get_page_prot(vm_flags)));
+
+	prot = __pgprot(pgprot_val(prot) &
+			~pgprot_val(arch_vm_get_page_prot_excl(vm_flags)));
+
+	return prot;
 }
 EXPORT_SYMBOL(vm_get_page_prot);
 


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 134+ messages in thread

* [patch 06/16] mm: Provide vm_special_mapping::close
  2017-12-12 17:32 ` Thomas Gleixner
@ 2017-12-12 17:32   ` Thomas Gleixner
  -1 siblings, 0 replies; 134+ messages in thread
From: Thomas Gleixner @ 2017-12-12 17:32 UTC (permalink / raw)
  To: LKML
  Cc: x86, Linus Torvalds, Andy Lutomirsky, Peter Zijlstra,
	Dave Hansen, Borislav Petkov, Greg KH, keescook, hughd,
	Brian Gerst, Josh Poimboeuf, Denys Vlasenko, Boris Ostrovsky,
	Juergen Gross, David Laight, Eduardo Valentin, aliguori,
	Will Deacon, linux-mm

[-- Attachment #1: mm--Provide-vm_special_mapping--close.patch --]
[-- Type: text/plain, Size: 1154 bytes --]

From: Peter Zijlstra  <peterz@infradead.org>

Userspace can (malisiously) munmap() the VMAs injected into its memory
map through install_special_mapping(). In order to ensure there are no
hardware resources tied to the mapping, we need a close callback.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
---
---
 include/linux/mm_types.h |    3 +++
 mm/mmap.c                |    4 ++++
 2 files changed, 7 insertions(+)

--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -644,6 +644,9 @@ struct vm_special_mapping {
 
 	int (*mremap)(const struct vm_special_mapping *sm,
 		     struct vm_area_struct *new_vma);
+
+	void (*close)(const struct vm_special_mapping *sm,
+		      struct vm_area_struct *vma);
 };
 
 enum tlb_flush_reason {
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -3206,6 +3206,10 @@ static int special_mapping_fault(struct
  */
 static void special_mapping_close(struct vm_area_struct *vma)
 {
+	struct vm_special_mapping *sm = vma->vm_private_data;
+
+	if (sm->close)
+		sm->close(sm, vma);
 }
 
 static const char *special_mapping_name(struct vm_area_struct *vma)

^ permalink raw reply	[flat|nested] 134+ messages in thread

* [patch 06/16] mm: Provide vm_special_mapping::close
@ 2017-12-12 17:32   ` Thomas Gleixner
  0 siblings, 0 replies; 134+ messages in thread
From: Thomas Gleixner @ 2017-12-12 17:32 UTC (permalink / raw)
  To: LKML
  Cc: x86, Linus Torvalds, Andy Lutomirsky, Peter Zijlstra,
	Dave Hansen, Borislav Petkov, Greg KH, keescook, hughd,
	Brian Gerst, Josh Poimboeuf, Denys Vlasenko, Boris Ostrovsky,
	Juergen Gross, David Laight, Eduardo Valentin, aliguori,
	Will Deacon, linux-mm

[-- Attachment #1: mm--Provide-vm_special_mapping--close.patch --]
[-- Type: text/plain, Size: 1381 bytes --]

From: Peter Zijlstra  <peterz@infradead.org>

Userspace can (malisiously) munmap() the VMAs injected into its memory
map through install_special_mapping(). In order to ensure there are no
hardware resources tied to the mapping, we need a close callback.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
---
---
 include/linux/mm_types.h |    3 +++
 mm/mmap.c                |    4 ++++
 2 files changed, 7 insertions(+)

--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -644,6 +644,9 @@ struct vm_special_mapping {
 
 	int (*mremap)(const struct vm_special_mapping *sm,
 		     struct vm_area_struct *new_vma);
+
+	void (*close)(const struct vm_special_mapping *sm,
+		      struct vm_area_struct *vma);
 };
 
 enum tlb_flush_reason {
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -3206,6 +3206,10 @@ static int special_mapping_fault(struct
  */
 static void special_mapping_close(struct vm_area_struct *vma)
 {
+	struct vm_special_mapping *sm = vma->vm_private_data;
+
+	if (sm->close)
+		sm->close(sm, vma);
 }
 
 static const char *special_mapping_name(struct vm_area_struct *vma)


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 134+ messages in thread

* [patch 07/16] selftest/x86: Implement additional LDT selftests
  2017-12-12 17:32 ` Thomas Gleixner
@ 2017-12-12 17:32   ` Thomas Gleixner
  -1 siblings, 0 replies; 134+ messages in thread
From: Thomas Gleixner @ 2017-12-12 17:32 UTC (permalink / raw)
  To: LKML
  Cc: x86, Linus Torvalds, Andy Lutomirsky, Peter Zijlstra,
	Dave Hansen, Borislav Petkov, Greg KH, keescook, hughd,
	Brian Gerst, Josh Poimboeuf, Denys Vlasenko, Boris Ostrovsky,
	Juergen Gross, David Laight, Eduardo Valentin, aliguori,
	Will Deacon, linux-mm

[-- Attachment #1: selftest-x86-Implement-additional-LDT-selftests.patch --]
[-- Type: text/plain, Size: 2937 bytes --]

From: Peter Zijlstra <peterz@infradead.org>

do_ldt_ss_test() - tests modifying the SS segment while in use; this
tends to come apart with RO LDT maps

do_ldt_unmap_test() - tests the mechanics of unmapping the (future)
LDT VMA. Additional tests would make sense; like unmapping it while in
use (TODO).

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
---
 tools/testing/selftests/x86/ldt_gdt.c |   71 +++++++++++++++++++++++++++++++++-
 1 file changed, 70 insertions(+), 1 deletion(-)

--- a/tools/testing/selftests/x86/ldt_gdt.c
+++ b/tools/testing/selftests/x86/ldt_gdt.c
@@ -242,6 +242,72 @@ static void fail_install(struct user_des
 	}
 }
 
+static void do_ldt_ss_test(void)
+{
+	unsigned short prev_sel, sel = (2 << 3) | (1 << 2) | 3;
+	struct user_desc *ldt_desc = low_user_desc + 2;
+	int ret;
+
+	ldt_desc->entry_number	= 2;
+	ldt_desc->base_addr	= (unsigned long)&counter_page[1];
+	ldt_desc->limit		= 0xfffff;
+	ldt_desc->seg_32bit	= 1;
+	ldt_desc->contents		= 0; /* Data, grow-up*/
+	ldt_desc->read_exec_only	= 0;
+	ldt_desc->limit_in_pages	= 1;
+	ldt_desc->seg_not_present	= 0;
+	ldt_desc->useable		= 0;
+
+	ret = safe_modify_ldt(1, ldt_desc, sizeof(*ldt_desc));
+	if (ret)
+		perror("ponies");
+
+	/*
+	 * syscall (eax) 123 - modify_ldt / return value
+	 *         (ebx)     - func
+	 *         (ecx)     - ptr
+	 *         (edx)     - bytecount
+	 */
+
+	int eax = 123;
+	int ebx = 1;
+	int ecx = (unsigned int)(unsigned long)ldt_desc;
+	int edx = sizeof(struct user_desc);
+
+	asm volatile ("movw %%ss, %[prev_sel]\n\t"
+		      "movw %[sel], %%ss\n\t"
+		      "int $0x80\n\t"
+		      "movw %[prev_sel], %%ss"
+		      : [prev_sel] "=&R" (prev_sel), "+a" (eax)
+		      : [sel] "R" (sel), "b" (ebx), "c" (ecx), "d" (edx)
+		      : INT80_CLOBBERS);
+
+	printf("[OK]\tSS modify_ldt()\n");
+}
+
+static void do_ldt_unmap_test(void)
+{
+	FILE *file = fopen("/proc/self/maps", "r");
+	char *line = NULL;
+	size_t len = 0;
+	ssize_t nread;
+	unsigned long start, end;
+
+	while ((nread = getline(&line, &len, file)) != -1) {
+		if (strstr(line, "[ldt]")) {
+			if (sscanf(line, "%lx-%lx", &start, &end) == 2) {
+				munmap((void *)start, end-start);
+				printf("[OK]\tmunmap LDT\n");
+				break;
+			}
+		}
+	}
+
+	free(line);
+	fclose(file);
+
+}
+
 static void do_simple_tests(void)
 {
 	struct user_desc desc = {
@@ -696,7 +762,7 @@ static int invoke_set_thread_area(void)
 
 static void setup_low_user_desc(void)
 {
-	low_user_desc = mmap(NULL, 2 * sizeof(struct user_desc),
+	low_user_desc = mmap(NULL, 3 * sizeof(struct user_desc),
 			     PROT_READ | PROT_WRITE,
 			     MAP_ANONYMOUS | MAP_PRIVATE | MAP_32BIT, -1, 0);
 	if (low_user_desc == MAP_FAILED)
@@ -916,6 +982,9 @@ int main(int argc, char **argv)
 	setup_counter_page();
 	setup_low_user_desc();
 
+	do_ldt_ss_test();
+	do_ldt_unmap_test();
+
 	do_simple_tests();
 
 	do_multicpu_tests();

^ permalink raw reply	[flat|nested] 134+ messages in thread

* [patch 07/16] selftest/x86: Implement additional LDT selftests
@ 2017-12-12 17:32   ` Thomas Gleixner
  0 siblings, 0 replies; 134+ messages in thread
From: Thomas Gleixner @ 2017-12-12 17:32 UTC (permalink / raw)
  To: LKML
  Cc: x86, Linus Torvalds, Andy Lutomirsky, Peter Zijlstra,
	Dave Hansen, Borislav Petkov, Greg KH, keescook, hughd,
	Brian Gerst, Josh Poimboeuf, Denys Vlasenko, Boris Ostrovsky,
	Juergen Gross, David Laight, Eduardo Valentin, aliguori,
	Will Deacon, linux-mm

[-- Attachment #1: selftest-x86-Implement-additional-LDT-selftests.patch --]
[-- Type: text/plain, Size: 3164 bytes --]

From: Peter Zijlstra <peterz@infradead.org>

do_ldt_ss_test() - tests modifying the SS segment while in use; this
tends to come apart with RO LDT maps

do_ldt_unmap_test() - tests the mechanics of unmapping the (future)
LDT VMA. Additional tests would make sense; like unmapping it while in
use (TODO).

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
---
 tools/testing/selftests/x86/ldt_gdt.c |   71 +++++++++++++++++++++++++++++++++-
 1 file changed, 70 insertions(+), 1 deletion(-)

--- a/tools/testing/selftests/x86/ldt_gdt.c
+++ b/tools/testing/selftests/x86/ldt_gdt.c
@@ -242,6 +242,72 @@ static void fail_install(struct user_des
 	}
 }
 
+static void do_ldt_ss_test(void)
+{
+	unsigned short prev_sel, sel = (2 << 3) | (1 << 2) | 3;
+	struct user_desc *ldt_desc = low_user_desc + 2;
+	int ret;
+
+	ldt_desc->entry_number	= 2;
+	ldt_desc->base_addr	= (unsigned long)&counter_page[1];
+	ldt_desc->limit		= 0xfffff;
+	ldt_desc->seg_32bit	= 1;
+	ldt_desc->contents		= 0; /* Data, grow-up*/
+	ldt_desc->read_exec_only	= 0;
+	ldt_desc->limit_in_pages	= 1;
+	ldt_desc->seg_not_present	= 0;
+	ldt_desc->useable		= 0;
+
+	ret = safe_modify_ldt(1, ldt_desc, sizeof(*ldt_desc));
+	if (ret)
+		perror("ponies");
+
+	/*
+	 * syscall (eax) 123 - modify_ldt / return value
+	 *         (ebx)     - func
+	 *         (ecx)     - ptr
+	 *         (edx)     - bytecount
+	 */
+
+	int eax = 123;
+	int ebx = 1;
+	int ecx = (unsigned int)(unsigned long)ldt_desc;
+	int edx = sizeof(struct user_desc);
+
+	asm volatile ("movw %%ss, %[prev_sel]\n\t"
+		      "movw %[sel], %%ss\n\t"
+		      "int $0x80\n\t"
+		      "movw %[prev_sel], %%ss"
+		      : [prev_sel] "=&R" (prev_sel), "+a" (eax)
+		      : [sel] "R" (sel), "b" (ebx), "c" (ecx), "d" (edx)
+		      : INT80_CLOBBERS);
+
+	printf("[OK]\tSS modify_ldt()\n");
+}
+
+static void do_ldt_unmap_test(void)
+{
+	FILE *file = fopen("/proc/self/maps", "r");
+	char *line = NULL;
+	size_t len = 0;
+	ssize_t nread;
+	unsigned long start, end;
+
+	while ((nread = getline(&line, &len, file)) != -1) {
+		if (strstr(line, "[ldt]")) {
+			if (sscanf(line, "%lx-%lx", &start, &end) == 2) {
+				munmap((void *)start, end-start);
+				printf("[OK]\tmunmap LDT\n");
+				break;
+			}
+		}
+	}
+
+	free(line);
+	fclose(file);
+
+}
+
 static void do_simple_tests(void)
 {
 	struct user_desc desc = {
@@ -696,7 +762,7 @@ static int invoke_set_thread_area(void)
 
 static void setup_low_user_desc(void)
 {
-	low_user_desc = mmap(NULL, 2 * sizeof(struct user_desc),
+	low_user_desc = mmap(NULL, 3 * sizeof(struct user_desc),
 			     PROT_READ | PROT_WRITE,
 			     MAP_ANONYMOUS | MAP_PRIVATE | MAP_32BIT, -1, 0);
 	if (low_user_desc == MAP_FAILED)
@@ -916,6 +982,9 @@ int main(int argc, char **argv)
 	setup_counter_page();
 	setup_low_user_desc();
 
+	do_ldt_ss_test();
+	do_ldt_unmap_test();
+
 	do_simple_tests();
 
 	do_multicpu_tests();


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 134+ messages in thread

* [patch 08/16] selftests/x86/ldt_gdt: Prepare for access bit forced
  2017-12-12 17:32 ` Thomas Gleixner
@ 2017-12-12 17:32   ` Thomas Gleixner
  -1 siblings, 0 replies; 134+ messages in thread
From: Thomas Gleixner @ 2017-12-12 17:32 UTC (permalink / raw)
  To: LKML
  Cc: x86, Linus Torvalds, Andy Lutomirsky, Peter Zijlstra,
	Dave Hansen, Borislav Petkov, Greg KH, keescook, hughd,
	Brian Gerst, Josh Poimboeuf, Denys Vlasenko, Boris Ostrovsky,
	Juergen Gross, David Laight, Eduardo Valentin, aliguori,
	Will Deacon, linux-mm

[-- Attachment #1: selftests-x86-ldt_gdt--Prepare-for-access-bit-forced.patch --]
[-- Type: text/plain, Size: 878 bytes --]

From: Thomas Gleixner <tglx@linutronix.de>

In order to make the LDT mapping RO the access bit needs to be forced by
the kernel. Adjust the test case so it handles that gracefully.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
---
 tools/testing/selftests/x86/ldt_gdt.c |    3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

--- a/tools/testing/selftests/x86/ldt_gdt.c
+++ b/tools/testing/selftests/x86/ldt_gdt.c
@@ -122,8 +122,7 @@ static void check_valid_segment(uint16_t
 	 * NB: Different Linux versions do different things with the
 	 * accessed bit in set_thread_area().
 	 */
-	if (ar != expected_ar &&
-	    (ldt || ar != (expected_ar | AR_ACCESSED))) {
+	if (ar != expected_ar && ar != (expected_ar | AR_ACCESSED)) {
 		printf("[FAIL]\t%s entry %hu has AR 0x%08X but expected 0x%08X\n",
 		       (ldt ? "LDT" : "GDT"), index, ar, expected_ar);
 		nerrs++;

^ permalink raw reply	[flat|nested] 134+ messages in thread

* [patch 08/16] selftests/x86/ldt_gdt: Prepare for access bit forced
@ 2017-12-12 17:32   ` Thomas Gleixner
  0 siblings, 0 replies; 134+ messages in thread
From: Thomas Gleixner @ 2017-12-12 17:32 UTC (permalink / raw)
  To: LKML
  Cc: x86, Linus Torvalds, Andy Lutomirsky, Peter Zijlstra,
	Dave Hansen, Borislav Petkov, Greg KH, keescook, hughd,
	Brian Gerst, Josh Poimboeuf, Denys Vlasenko, Boris Ostrovsky,
	Juergen Gross, David Laight, Eduardo Valentin, aliguori,
	Will Deacon, linux-mm

[-- Attachment #1: selftests-x86-ldt_gdt--Prepare-for-access-bit-forced.patch --]
[-- Type: text/plain, Size: 1105 bytes --]

From: Thomas Gleixner <tglx@linutronix.de>

In order to make the LDT mapping RO the access bit needs to be forced by
the kernel. Adjust the test case so it handles that gracefully.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
---
 tools/testing/selftests/x86/ldt_gdt.c |    3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

--- a/tools/testing/selftests/x86/ldt_gdt.c
+++ b/tools/testing/selftests/x86/ldt_gdt.c
@@ -122,8 +122,7 @@ static void check_valid_segment(uint16_t
 	 * NB: Different Linux versions do different things with the
 	 * accessed bit in set_thread_area().
 	 */
-	if (ar != expected_ar &&
-	    (ldt || ar != (expected_ar | AR_ACCESSED))) {
+	if (ar != expected_ar && ar != (expected_ar | AR_ACCESSED)) {
 		printf("[FAIL]\t%s entry %hu has AR 0x%08X but expected 0x%08X\n",
 		       (ldt ? "LDT" : "GDT"), index, ar, expected_ar);
 		nerrs++;


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 134+ messages in thread

* [patch 09/16] mm: Make populate_vma_page_range() available
  2017-12-12 17:32 ` Thomas Gleixner
@ 2017-12-12 17:32   ` Thomas Gleixner
  -1 siblings, 0 replies; 134+ messages in thread
From: Thomas Gleixner @ 2017-12-12 17:32 UTC (permalink / raw)
  To: LKML
  Cc: x86, Linus Torvalds, Andy Lutomirsky, Peter Zijlstra,
	Dave Hansen, Borislav Petkov, Greg KH, keescook, hughd,
	Brian Gerst, Josh Poimboeuf, Denys Vlasenko, Boris Ostrovsky,
	Juergen Gross, David Laight, Eduardo Valentin, aliguori,
	Will Deacon, linux-mm

[-- Attachment #1: mm--Make-populate_vma_page_range---available.patch --]
[-- Type: text/plain, Size: 1283 bytes --]

From: Peter Zijlstra <peterz@infradead.org>

Make populate_vma_page_range() outside mm, so special mappings can be
populated in dup_mmap().

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
---
 include/linux/mm.h |    2 ++
 mm/internal.h      |    2 --
 2 files changed, 2 insertions(+), 2 deletions(-)

--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -2159,6 +2159,8 @@ do_mmap_pgoff(struct file *file, unsigne
 }
 
 #ifdef CONFIG_MMU
+extern long populate_vma_page_range(struct vm_area_struct *vma,
+		unsigned long start, unsigned long end, int *nonblocking);
 extern int __mm_populate(unsigned long addr, unsigned long len,
 			 int ignore_errors);
 static inline void mm_populate(unsigned long addr, unsigned long len)
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -284,8 +284,6 @@ void __vma_link_list(struct mm_struct *m
 		struct vm_area_struct *prev, struct rb_node *rb_parent);
 
 #ifdef CONFIG_MMU
-extern long populate_vma_page_range(struct vm_area_struct *vma,
-		unsigned long start, unsigned long end, int *nonblocking);
 extern void munlock_vma_pages_range(struct vm_area_struct *vma,
 			unsigned long start, unsigned long end);
 static inline void munlock_vma_pages_all(struct vm_area_struct *vma)

^ permalink raw reply	[flat|nested] 134+ messages in thread

* [patch 09/16] mm: Make populate_vma_page_range() available
@ 2017-12-12 17:32   ` Thomas Gleixner
  0 siblings, 0 replies; 134+ messages in thread
From: Thomas Gleixner @ 2017-12-12 17:32 UTC (permalink / raw)
  To: LKML
  Cc: x86, Linus Torvalds, Andy Lutomirsky, Peter Zijlstra,
	Dave Hansen, Borislav Petkov, Greg KH, keescook, hughd,
	Brian Gerst, Josh Poimboeuf, Denys Vlasenko, Boris Ostrovsky,
	Juergen Gross, David Laight, Eduardo Valentin, aliguori,
	Will Deacon, linux-mm

[-- Attachment #1: mm--Make-populate_vma_page_range---available.patch --]
[-- Type: text/plain, Size: 1510 bytes --]

From: Peter Zijlstra <peterz@infradead.org>

Make populate_vma_page_range() outside mm, so special mappings can be
populated in dup_mmap().

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
---
 include/linux/mm.h |    2 ++
 mm/internal.h      |    2 --
 2 files changed, 2 insertions(+), 2 deletions(-)

--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -2159,6 +2159,8 @@ do_mmap_pgoff(struct file *file, unsigne
 }
 
 #ifdef CONFIG_MMU
+extern long populate_vma_page_range(struct vm_area_struct *vma,
+		unsigned long start, unsigned long end, int *nonblocking);
 extern int __mm_populate(unsigned long addr, unsigned long len,
 			 int ignore_errors);
 static inline void mm_populate(unsigned long addr, unsigned long len)
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -284,8 +284,6 @@ void __vma_link_list(struct mm_struct *m
 		struct vm_area_struct *prev, struct rb_node *rb_parent);
 
 #ifdef CONFIG_MMU
-extern long populate_vma_page_range(struct vm_area_struct *vma,
-		unsigned long start, unsigned long end, int *nonblocking);
 extern void munlock_vma_pages_range(struct vm_area_struct *vma,
 			unsigned long start, unsigned long end);
 static inline void munlock_vma_pages_all(struct vm_area_struct *vma)


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 134+ messages in thread

* [patch 10/16] x86/ldt: Do not install LDT for kernel threads
  2017-12-12 17:32 ` Thomas Gleixner
@ 2017-12-12 17:32   ` Thomas Gleixner
  -1 siblings, 0 replies; 134+ messages in thread
From: Thomas Gleixner @ 2017-12-12 17:32 UTC (permalink / raw)
  To: LKML
  Cc: x86, Linus Torvalds, Andy Lutomirsky, Peter Zijlstra,
	Dave Hansen, Borislav Petkov, Greg KH, keescook, hughd,
	Brian Gerst, Josh Poimboeuf, Denys Vlasenko, Boris Ostrovsky,
	Juergen Gross, David Laight, Eduardo Valentin, aliguori,
	Will Deacon, linux-mm

[-- Attachment #1: x86-ldt--Do-not-install-LDT-for-kernel-threads.patch --]
[-- Type: text/plain, Size: 747 bytes --]

From: Thomas Gleixner <tglx@linutronix.de>

Kernel threads can use the mm of a user process temporarily via use_mm(),
but there is no point in installing the LDT which is associated to that mm
for the kernel thread.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
---
 arch/x86/include/asm/mmu_context.h |    3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

--- a/arch/x86/include/asm/mmu_context.h
+++ b/arch/x86/include/asm/mmu_context.h
@@ -95,8 +95,7 @@ static inline void load_mm_ldt(struct mm
 	 * the local LDT after an IPI loaded a newer value than the one
 	 * that we can see.
 	 */
-
-	if (unlikely(ldt))
+	if (unlikely(ldt && !(current->flags & PF_KTHREAD))
 		set_ldt(ldt->entries, ldt->nr_entries);
 	else
 		clear_LDT();

^ permalink raw reply	[flat|nested] 134+ messages in thread

* [patch 10/16] x86/ldt: Do not install LDT for kernel threads
@ 2017-12-12 17:32   ` Thomas Gleixner
  0 siblings, 0 replies; 134+ messages in thread
From: Thomas Gleixner @ 2017-12-12 17:32 UTC (permalink / raw)
  To: LKML
  Cc: x86, Linus Torvalds, Andy Lutomirsky, Peter Zijlstra,
	Dave Hansen, Borislav Petkov, Greg KH, keescook, hughd,
	Brian Gerst, Josh Poimboeuf, Denys Vlasenko, Boris Ostrovsky,
	Juergen Gross, David Laight, Eduardo Valentin, aliguori,
	Will Deacon, linux-mm

[-- Attachment #1: x86-ldt--Do-not-install-LDT-for-kernel-threads.patch --]
[-- Type: text/plain, Size: 974 bytes --]

From: Thomas Gleixner <tglx@linutronix.de>

Kernel threads can use the mm of a user process temporarily via use_mm(),
but there is no point in installing the LDT which is associated to that mm
for the kernel thread.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
---
 arch/x86/include/asm/mmu_context.h |    3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

--- a/arch/x86/include/asm/mmu_context.h
+++ b/arch/x86/include/asm/mmu_context.h
@@ -95,8 +95,7 @@ static inline void load_mm_ldt(struct mm
 	 * the local LDT after an IPI loaded a newer value than the one
 	 * that we can see.
 	 */
-
-	if (unlikely(ldt))
+	if (unlikely(ldt && !(current->flags & PF_KTHREAD))
 		set_ldt(ldt->entries, ldt->nr_entries);
 	else
 		clear_LDT();


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 134+ messages in thread

* [patch 11/16] x86/ldt: Force access bit for CS/SS
  2017-12-12 17:32 ` Thomas Gleixner
@ 2017-12-12 17:32   ` Thomas Gleixner
  -1 siblings, 0 replies; 134+ messages in thread
From: Thomas Gleixner @ 2017-12-12 17:32 UTC (permalink / raw)
  To: LKML
  Cc: x86, Linus Torvalds, Andy Lutomirsky, Peter Zijlstra,
	Dave Hansen, Borislav Petkov, Greg KH, keescook, hughd,
	Brian Gerst, Josh Poimboeuf, Denys Vlasenko, Boris Ostrovsky,
	Juergen Gross, David Laight, Eduardo Valentin, aliguori,
	Will Deacon, linux-mm

[-- Attachment #1: x86-ldt--Force-access-bit-for-CS-SS.patch --]
[-- Type: text/plain, Size: 9581 bytes --]

From: Peter Zijlstra <peterz@infradead.org>

When mapping the LDT RO the hardware will typically generate write faults
on first use. These faults can be trapped and the backing pages can be
modified by the kernel.

There is one exception; IRET will immediately load CS/SS and unrecoverably
#GP. To avoid this issue access the LDT descriptors used by CS/SS before
the IRET to userspace.

For this use LAR, which is a safe operation in that it will happily consume
an invalid LDT descriptor without traps. It gets the CPU to load the
descriptor and observes the (preset) ACCESS bit.

So far none of the obvious candidates like dosemu/wine/etc. do care about
the ACCESS bit at all, so it should be rather safe to enforce it.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
---
 arch/x86/entry/common.c            |    8 ++++-
 arch/x86/include/asm/desc.h        |    2 +
 arch/x86/include/asm/mmu_context.h |   53 +++++++++++++++++++++++--------------
 arch/x86/include/asm/thread_info.h |    4 ++
 arch/x86/kernel/cpu/common.c       |    4 +-
 arch/x86/kernel/ldt.c              |   30 ++++++++++++++++++++
 arch/x86/mm/tlb.c                  |    2 -
 arch/x86/power/cpu.c               |    2 -
 8 files changed, 78 insertions(+), 27 deletions(-)

--- a/arch/x86/entry/common.c
+++ b/arch/x86/entry/common.c
@@ -30,6 +30,7 @@
 #include <asm/vdso.h>
 #include <linux/uaccess.h>
 #include <asm/cpufeature.h>
+#include <asm/mmu_context.h>
 
 #define CREATE_TRACE_POINTS
 #include <trace/events/syscalls.h>
@@ -130,8 +131,8 @@ static long syscall_trace_enter(struct p
 	return ret ?: regs->orig_ax;
 }
 
-#define EXIT_TO_USERMODE_LOOP_FLAGS				\
-	(_TIF_SIGPENDING | _TIF_NOTIFY_RESUME | _TIF_UPROBE |	\
+#define EXIT_TO_USERMODE_LOOP_FLAGS					\
+	(_TIF_SIGPENDING | _TIF_NOTIFY_RESUME | _TIF_UPROBE | _TIF_LDT |\
 	 _TIF_NEED_RESCHED | _TIF_USER_RETURN_NOTIFY | _TIF_PATCH_PENDING)
 
 static void exit_to_usermode_loop(struct pt_regs *regs, u32 cached_flags)
@@ -171,6 +172,9 @@ static void exit_to_usermode_loop(struct
 		/* Disable IRQs and retry */
 		local_irq_disable();
 
+		if (cached_flags & _TIF_LDT)
+			ldt_exit_user(regs);
+
 		cached_flags = READ_ONCE(current_thread_info()->flags);
 
 		if (!(cached_flags & EXIT_TO_USERMODE_LOOP_FLAGS))
--- a/arch/x86/include/asm/desc.h
+++ b/arch/x86/include/asm/desc.h
@@ -20,6 +20,8 @@ static inline void fill_ldt(struct desc_
 
 	desc->type		= (info->read_exec_only ^ 1) << 1;
 	desc->type	       |= info->contents << 2;
+	/* Set ACCESS bit */
+	desc->type	       |= 1;
 
 	desc->s			= 1;
 	desc->dpl		= 0x3;
--- a/arch/x86/include/asm/mmu_context.h
+++ b/arch/x86/include/asm/mmu_context.h
@@ -57,24 +57,34 @@ struct ldt_struct {
 /*
  * Used for LDT copy/destruction.
  */
-static inline void init_new_context_ldt(struct mm_struct *mm)
+static inline void init_new_context_ldt(struct task_struct *task,
+					struct mm_struct *mm)
 {
 	mm->context.ldt = NULL;
 	init_rwsem(&mm->context.ldt_usr_sem);
+	/*
+	 * Set the TIF flag unconditonally as in ldt_dup_context() the new
+	 * task pointer is not available. In case there is no LDT this is a
+	 * nop on the first exit to user space.
+	 */
+	set_tsk_thread_flag(task, TIF_LDT);
 }
 int ldt_dup_context(struct mm_struct *oldmm, struct mm_struct *mm);
+void ldt_exit_user(struct pt_regs *regs);
 void destroy_context_ldt(struct mm_struct *mm);
 #else	/* CONFIG_MODIFY_LDT_SYSCALL */
-static inline void init_new_context_ldt(struct mm_struct *mm) { }
+static inline void init_new_context_ldt(struct task_struct *task,
+					struct mm_struct *mm) { }
 static inline int ldt_dup_context(struct mm_struct *oldmm,
 				  struct mm_struct *mm)
 {
 	return 0;
 }
-static inline void destroy_context_ldt(struct mm_struct *mm) {}
+static inline void ldt_exit_user(struct pt_regs *regs) { }
+static inline void destroy_context_ldt(struct mm_struct *mm) { }
 #endif
 
-static inline void load_mm_ldt(struct mm_struct *mm)
+static inline void load_mm_ldt(struct mm_struct *mm, struct task_struct *tsk)
 {
 #ifdef CONFIG_MODIFY_LDT_SYSCALL
 	struct ldt_struct *ldt;
@@ -83,28 +93,31 @@ static inline void load_mm_ldt(struct mm
 	ldt = READ_ONCE(mm->context.ldt);
 
 	/*
-	 * Any change to mm->context.ldt is followed by an IPI to all
-	 * CPUs with the mm active.  The LDT will not be freed until
-	 * after the IPI is handled by all such CPUs.  This means that,
-	 * if the ldt_struct changes before we return, the values we see
-	 * will be safe, and the new values will be loaded before we run
-	 * any user code.
+	 * Clear LDT if the mm does not have it set or if this is a kernel
+	 * thread which might temporarily use the mm of a user process via
+	 * use_mm(). If the next task uses LDT then set it up and set
+	 * TIF_LDT so it will touch the new LDT on exit to user space.
 	 *
-	 * NB: don't try to convert this to use RCU without extreme care.
-	 * We would still need IRQs off, because we don't want to change
-	 * the local LDT after an IPI loaded a newer value than the one
-	 * that we can see.
+	 * This code is run with interrupts disabled so it is serialized
+	 * against the IPI from ldt_install_mm().
 	 */
-	if (unlikely(ldt && !(current->flags & PF_KTHREAD))
-		set_ldt(ldt->entries, ldt->nr_entries);
-	else
+	if (likely(!ldt || (tsk->flags & PF_KTHREAD))) {
 		clear_LDT();
+	} else {
+		set_ldt(ldt->entries, ldt->nr_entries);
+		set_tsk_thread_flag(tsk, TIF_LDT);
+	}
 #else
+	/*
+	 * FIXME: This wants a comment why this actually does anything at
+	 * all when the syscall is disabled.
+	 */
 	clear_LDT();
 #endif
 }
 
-static inline void switch_ldt(struct mm_struct *prev, struct mm_struct *next)
+static inline void switch_ldt(struct mm_struct *prev, struct mm_struct *next,
+			      struct task_struct *tsk)
 {
 #ifdef CONFIG_MODIFY_LDT_SYSCALL
 	/*
@@ -126,7 +139,7 @@ static inline void switch_ldt(struct mm_
 	 */
 	if (unlikely((unsigned long)prev->context.ldt |
 		     (unsigned long)next->context.ldt))
-		load_mm_ldt(next);
+		load_mm_ldt(next, tsk);
 #endif
 
 	DEBUG_LOCKS_WARN_ON(preemptible());
@@ -150,7 +163,7 @@ static inline int init_new_context(struc
 		mm->context.execute_only_pkey = -1;
 	}
 #endif
-	init_new_context_ldt(mm);
+	init_new_context_ldt(tsk, mm);
 	return 0;
 }
 static inline void destroy_context(struct mm_struct *mm)
--- a/arch/x86/include/asm/thread_info.h
+++ b/arch/x86/include/asm/thread_info.h
@@ -83,6 +83,7 @@ struct thread_info {
 #define TIF_SYSCALL_EMU		6	/* syscall emulation active */
 #define TIF_SYSCALL_AUDIT	7	/* syscall auditing active */
 #define TIF_SECCOMP		8	/* secure computing */
+#define TIF_LDT			9	/* Populate LDT after fork */
 #define TIF_USER_RETURN_NOTIFY	11	/* notify kernel of userspace return */
 #define TIF_UPROBE		12	/* breakpointed or singlestepping */
 #define TIF_PATCH_PENDING	13	/* pending live patching update */
@@ -109,6 +110,7 @@ struct thread_info {
 #define _TIF_SYSCALL_EMU	(1 << TIF_SYSCALL_EMU)
 #define _TIF_SYSCALL_AUDIT	(1 << TIF_SYSCALL_AUDIT)
 #define _TIF_SECCOMP		(1 << TIF_SECCOMP)
+#define _TIF_LDT		(1 << TIF_LDT)
 #define _TIF_USER_RETURN_NOTIFY	(1 << TIF_USER_RETURN_NOTIFY)
 #define _TIF_UPROBE		(1 << TIF_UPROBE)
 #define _TIF_PATCH_PENDING	(1 << TIF_PATCH_PENDING)
@@ -141,7 +143,7 @@ struct thread_info {
 	 _TIF_NEED_RESCHED | _TIF_SINGLESTEP | _TIF_SYSCALL_EMU |	\
 	 _TIF_SYSCALL_AUDIT | _TIF_USER_RETURN_NOTIFY | _TIF_UPROBE |	\
 	 _TIF_PATCH_PENDING | _TIF_NOHZ | _TIF_SYSCALL_TRACEPOINT |	\
-	 _TIF_FSCHECK)
+	 _TIF_FSCHECK | _TIF_LDT)
 
 /* flags to check in __switch_to() */
 #define _TIF_WORK_CTXSW							\
--- a/arch/x86/kernel/cpu/common.c
+++ b/arch/x86/kernel/cpu/common.c
@@ -1602,7 +1602,7 @@ void cpu_init(void)
 	set_tss_desc(cpu, t);
 	load_TR_desc();
 
-	load_mm_ldt(&init_mm);
+	load_mm_ldt(&init_mm, current);
 
 	clear_all_debug_regs();
 	dbg_restore_debug_regs();
@@ -1660,7 +1660,7 @@ void cpu_init(void)
 	set_tss_desc(cpu, t);
 	load_TR_desc();
 
-	load_mm_ldt(&init_mm);
+	load_mm_ldt(&init_mm, current);
 
 	t->x86_tss.io_bitmap_base = offsetof(struct tss_struct, io_bitmap);
 
--- a/arch/x86/kernel/ldt.c
+++ b/arch/x86/kernel/ldt.c
@@ -164,6 +164,36 @@ int ldt_dup_context(struct mm_struct *ol
 }
 
 /*
+ * Touching the LDT entries with LAR makes sure that the CPU "caches" the
+ * ACCESSED bit in the LDT entry which is already set when the entry is
+ * stored.
+ */
+static inline void ldt_touch_seg(unsigned long seg)
+{
+	u16 ar, sel = (u16)seg & ~SEGMENT_RPL_MASK;
+
+	if (!(seg & SEGMENT_LDT))
+		return;
+
+	asm volatile ("lar %[sel], %[ar]"
+			: [ar] "=R" (ar)
+			: [sel] "R" (sel));
+}
+
+void ldt_exit_user(struct pt_regs *regs)
+{
+	struct mm_struct *mm = current->mm;
+
+	clear_tsk_thread_flag(current, TIF_LDT);
+
+	if (!mm || !mm->context.ldt)
+		return;
+
+	ldt_touch_seg(regs->cs);
+	ldt_touch_seg(regs->ss);
+}
+
+/*
  * No need to lock the MM as we are the last user
  *
  * 64bit: Don't touch the LDT register - we're already in the next thread.
--- a/arch/x86/mm/tlb.c
+++ b/arch/x86/mm/tlb.c
@@ -219,7 +219,7 @@ void switch_mm_irqs_off(struct mm_struct
 	}
 
 	load_mm_cr4(next);
-	switch_ldt(real_prev, next);
+	switch_ldt(real_prev, next, tsk);
 }
 
 /*
--- a/arch/x86/power/cpu.c
+++ b/arch/x86/power/cpu.c
@@ -180,7 +180,7 @@ static void fix_processor_context(void)
 	syscall_init();				/* This sets MSR_*STAR and related */
 #endif
 	load_TR_desc();				/* This does ltr */
-	load_mm_ldt(current->active_mm);	/* This does lldt */
+	load_mm_ldt(current->active_mm, current);
 	initialize_tlbstate_and_flush();
 
 	fpu__resume_cpu();

^ permalink raw reply	[flat|nested] 134+ messages in thread

* [patch 11/16] x86/ldt: Force access bit for CS/SS
@ 2017-12-12 17:32   ` Thomas Gleixner
  0 siblings, 0 replies; 134+ messages in thread
From: Thomas Gleixner @ 2017-12-12 17:32 UTC (permalink / raw)
  To: LKML
  Cc: x86, Linus Torvalds, Andy Lutomirsky, Peter Zijlstra,
	Dave Hansen, Borislav Petkov, Greg KH, keescook, hughd,
	Brian Gerst, Josh Poimboeuf, Denys Vlasenko, Boris Ostrovsky,
	Juergen Gross, David Laight, Eduardo Valentin, aliguori,
	Will Deacon, linux-mm

[-- Attachment #1: x86-ldt--Force-access-bit-for-CS-SS.patch --]
[-- Type: text/plain, Size: 9808 bytes --]

From: Peter Zijlstra <peterz@infradead.org>

When mapping the LDT RO the hardware will typically generate write faults
on first use. These faults can be trapped and the backing pages can be
modified by the kernel.

There is one exception; IRET will immediately load CS/SS and unrecoverably
#GP. To avoid this issue access the LDT descriptors used by CS/SS before
the IRET to userspace.

For this use LAR, which is a safe operation in that it will happily consume
an invalid LDT descriptor without traps. It gets the CPU to load the
descriptor and observes the (preset) ACCESS bit.

So far none of the obvious candidates like dosemu/wine/etc. do care about
the ACCESS bit at all, so it should be rather safe to enforce it.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
---
 arch/x86/entry/common.c            |    8 ++++-
 arch/x86/include/asm/desc.h        |    2 +
 arch/x86/include/asm/mmu_context.h |   53 +++++++++++++++++++++++--------------
 arch/x86/include/asm/thread_info.h |    4 ++
 arch/x86/kernel/cpu/common.c       |    4 +-
 arch/x86/kernel/ldt.c              |   30 ++++++++++++++++++++
 arch/x86/mm/tlb.c                  |    2 -
 arch/x86/power/cpu.c               |    2 -
 8 files changed, 78 insertions(+), 27 deletions(-)

--- a/arch/x86/entry/common.c
+++ b/arch/x86/entry/common.c
@@ -30,6 +30,7 @@
 #include <asm/vdso.h>
 #include <linux/uaccess.h>
 #include <asm/cpufeature.h>
+#include <asm/mmu_context.h>
 
 #define CREATE_TRACE_POINTS
 #include <trace/events/syscalls.h>
@@ -130,8 +131,8 @@ static long syscall_trace_enter(struct p
 	return ret ?: regs->orig_ax;
 }
 
-#define EXIT_TO_USERMODE_LOOP_FLAGS				\
-	(_TIF_SIGPENDING | _TIF_NOTIFY_RESUME | _TIF_UPROBE |	\
+#define EXIT_TO_USERMODE_LOOP_FLAGS					\
+	(_TIF_SIGPENDING | _TIF_NOTIFY_RESUME | _TIF_UPROBE | _TIF_LDT |\
 	 _TIF_NEED_RESCHED | _TIF_USER_RETURN_NOTIFY | _TIF_PATCH_PENDING)
 
 static void exit_to_usermode_loop(struct pt_regs *regs, u32 cached_flags)
@@ -171,6 +172,9 @@ static void exit_to_usermode_loop(struct
 		/* Disable IRQs and retry */
 		local_irq_disable();
 
+		if (cached_flags & _TIF_LDT)
+			ldt_exit_user(regs);
+
 		cached_flags = READ_ONCE(current_thread_info()->flags);
 
 		if (!(cached_flags & EXIT_TO_USERMODE_LOOP_FLAGS))
--- a/arch/x86/include/asm/desc.h
+++ b/arch/x86/include/asm/desc.h
@@ -20,6 +20,8 @@ static inline void fill_ldt(struct desc_
 
 	desc->type		= (info->read_exec_only ^ 1) << 1;
 	desc->type	       |= info->contents << 2;
+	/* Set ACCESS bit */
+	desc->type	       |= 1;
 
 	desc->s			= 1;
 	desc->dpl		= 0x3;
--- a/arch/x86/include/asm/mmu_context.h
+++ b/arch/x86/include/asm/mmu_context.h
@@ -57,24 +57,34 @@ struct ldt_struct {
 /*
  * Used for LDT copy/destruction.
  */
-static inline void init_new_context_ldt(struct mm_struct *mm)
+static inline void init_new_context_ldt(struct task_struct *task,
+					struct mm_struct *mm)
 {
 	mm->context.ldt = NULL;
 	init_rwsem(&mm->context.ldt_usr_sem);
+	/*
+	 * Set the TIF flag unconditonally as in ldt_dup_context() the new
+	 * task pointer is not available. In case there is no LDT this is a
+	 * nop on the first exit to user space.
+	 */
+	set_tsk_thread_flag(task, TIF_LDT);
 }
 int ldt_dup_context(struct mm_struct *oldmm, struct mm_struct *mm);
+void ldt_exit_user(struct pt_regs *regs);
 void destroy_context_ldt(struct mm_struct *mm);
 #else	/* CONFIG_MODIFY_LDT_SYSCALL */
-static inline void init_new_context_ldt(struct mm_struct *mm) { }
+static inline void init_new_context_ldt(struct task_struct *task,
+					struct mm_struct *mm) { }
 static inline int ldt_dup_context(struct mm_struct *oldmm,
 				  struct mm_struct *mm)
 {
 	return 0;
 }
-static inline void destroy_context_ldt(struct mm_struct *mm) {}
+static inline void ldt_exit_user(struct pt_regs *regs) { }
+static inline void destroy_context_ldt(struct mm_struct *mm) { }
 #endif
 
-static inline void load_mm_ldt(struct mm_struct *mm)
+static inline void load_mm_ldt(struct mm_struct *mm, struct task_struct *tsk)
 {
 #ifdef CONFIG_MODIFY_LDT_SYSCALL
 	struct ldt_struct *ldt;
@@ -83,28 +93,31 @@ static inline void load_mm_ldt(struct mm
 	ldt = READ_ONCE(mm->context.ldt);
 
 	/*
-	 * Any change to mm->context.ldt is followed by an IPI to all
-	 * CPUs with the mm active.  The LDT will not be freed until
-	 * after the IPI is handled by all such CPUs.  This means that,
-	 * if the ldt_struct changes before we return, the values we see
-	 * will be safe, and the new values will be loaded before we run
-	 * any user code.
+	 * Clear LDT if the mm does not have it set or if this is a kernel
+	 * thread which might temporarily use the mm of a user process via
+	 * use_mm(). If the next task uses LDT then set it up and set
+	 * TIF_LDT so it will touch the new LDT on exit to user space.
 	 *
-	 * NB: don't try to convert this to use RCU without extreme care.
-	 * We would still need IRQs off, because we don't want to change
-	 * the local LDT after an IPI loaded a newer value than the one
-	 * that we can see.
+	 * This code is run with interrupts disabled so it is serialized
+	 * against the IPI from ldt_install_mm().
 	 */
-	if (unlikely(ldt && !(current->flags & PF_KTHREAD))
-		set_ldt(ldt->entries, ldt->nr_entries);
-	else
+	if (likely(!ldt || (tsk->flags & PF_KTHREAD))) {
 		clear_LDT();
+	} else {
+		set_ldt(ldt->entries, ldt->nr_entries);
+		set_tsk_thread_flag(tsk, TIF_LDT);
+	}
 #else
+	/*
+	 * FIXME: This wants a comment why this actually does anything at
+	 * all when the syscall is disabled.
+	 */
 	clear_LDT();
 #endif
 }
 
-static inline void switch_ldt(struct mm_struct *prev, struct mm_struct *next)
+static inline void switch_ldt(struct mm_struct *prev, struct mm_struct *next,
+			      struct task_struct *tsk)
 {
 #ifdef CONFIG_MODIFY_LDT_SYSCALL
 	/*
@@ -126,7 +139,7 @@ static inline void switch_ldt(struct mm_
 	 */
 	if (unlikely((unsigned long)prev->context.ldt |
 		     (unsigned long)next->context.ldt))
-		load_mm_ldt(next);
+		load_mm_ldt(next, tsk);
 #endif
 
 	DEBUG_LOCKS_WARN_ON(preemptible());
@@ -150,7 +163,7 @@ static inline int init_new_context(struc
 		mm->context.execute_only_pkey = -1;
 	}
 #endif
-	init_new_context_ldt(mm);
+	init_new_context_ldt(tsk, mm);
 	return 0;
 }
 static inline void destroy_context(struct mm_struct *mm)
--- a/arch/x86/include/asm/thread_info.h
+++ b/arch/x86/include/asm/thread_info.h
@@ -83,6 +83,7 @@ struct thread_info {
 #define TIF_SYSCALL_EMU		6	/* syscall emulation active */
 #define TIF_SYSCALL_AUDIT	7	/* syscall auditing active */
 #define TIF_SECCOMP		8	/* secure computing */
+#define TIF_LDT			9	/* Populate LDT after fork */
 #define TIF_USER_RETURN_NOTIFY	11	/* notify kernel of userspace return */
 #define TIF_UPROBE		12	/* breakpointed or singlestepping */
 #define TIF_PATCH_PENDING	13	/* pending live patching update */
@@ -109,6 +110,7 @@ struct thread_info {
 #define _TIF_SYSCALL_EMU	(1 << TIF_SYSCALL_EMU)
 #define _TIF_SYSCALL_AUDIT	(1 << TIF_SYSCALL_AUDIT)
 #define _TIF_SECCOMP		(1 << TIF_SECCOMP)
+#define _TIF_LDT		(1 << TIF_LDT)
 #define _TIF_USER_RETURN_NOTIFY	(1 << TIF_USER_RETURN_NOTIFY)
 #define _TIF_UPROBE		(1 << TIF_UPROBE)
 #define _TIF_PATCH_PENDING	(1 << TIF_PATCH_PENDING)
@@ -141,7 +143,7 @@ struct thread_info {
 	 _TIF_NEED_RESCHED | _TIF_SINGLESTEP | _TIF_SYSCALL_EMU |	\
 	 _TIF_SYSCALL_AUDIT | _TIF_USER_RETURN_NOTIFY | _TIF_UPROBE |	\
 	 _TIF_PATCH_PENDING | _TIF_NOHZ | _TIF_SYSCALL_TRACEPOINT |	\
-	 _TIF_FSCHECK)
+	 _TIF_FSCHECK | _TIF_LDT)
 
 /* flags to check in __switch_to() */
 #define _TIF_WORK_CTXSW							\
--- a/arch/x86/kernel/cpu/common.c
+++ b/arch/x86/kernel/cpu/common.c
@@ -1602,7 +1602,7 @@ void cpu_init(void)
 	set_tss_desc(cpu, t);
 	load_TR_desc();
 
-	load_mm_ldt(&init_mm);
+	load_mm_ldt(&init_mm, current);
 
 	clear_all_debug_regs();
 	dbg_restore_debug_regs();
@@ -1660,7 +1660,7 @@ void cpu_init(void)
 	set_tss_desc(cpu, t);
 	load_TR_desc();
 
-	load_mm_ldt(&init_mm);
+	load_mm_ldt(&init_mm, current);
 
 	t->x86_tss.io_bitmap_base = offsetof(struct tss_struct, io_bitmap);
 
--- a/arch/x86/kernel/ldt.c
+++ b/arch/x86/kernel/ldt.c
@@ -164,6 +164,36 @@ int ldt_dup_context(struct mm_struct *ol
 }
 
 /*
+ * Touching the LDT entries with LAR makes sure that the CPU "caches" the
+ * ACCESSED bit in the LDT entry which is already set when the entry is
+ * stored.
+ */
+static inline void ldt_touch_seg(unsigned long seg)
+{
+	u16 ar, sel = (u16)seg & ~SEGMENT_RPL_MASK;
+
+	if (!(seg & SEGMENT_LDT))
+		return;
+
+	asm volatile ("lar %[sel], %[ar]"
+			: [ar] "=R" (ar)
+			: [sel] "R" (sel));
+}
+
+void ldt_exit_user(struct pt_regs *regs)
+{
+	struct mm_struct *mm = current->mm;
+
+	clear_tsk_thread_flag(current, TIF_LDT);
+
+	if (!mm || !mm->context.ldt)
+		return;
+
+	ldt_touch_seg(regs->cs);
+	ldt_touch_seg(regs->ss);
+}
+
+/*
  * No need to lock the MM as we are the last user
  *
  * 64bit: Don't touch the LDT register - we're already in the next thread.
--- a/arch/x86/mm/tlb.c
+++ b/arch/x86/mm/tlb.c
@@ -219,7 +219,7 @@ void switch_mm_irqs_off(struct mm_struct
 	}
 
 	load_mm_cr4(next);
-	switch_ldt(real_prev, next);
+	switch_ldt(real_prev, next, tsk);
 }
 
 /*
--- a/arch/x86/power/cpu.c
+++ b/arch/x86/power/cpu.c
@@ -180,7 +180,7 @@ static void fix_processor_context(void)
 	syscall_init();				/* This sets MSR_*STAR and related */
 #endif
 	load_TR_desc();				/* This does ltr */
-	load_mm_ldt(current->active_mm);	/* This does lldt */
+	load_mm_ldt(current->active_mm, current);
 	initialize_tlbstate_and_flush();
 
 	fpu__resume_cpu();


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 134+ messages in thread

* [patch 12/16] x86/ldt: Reshuffle code
  2017-12-12 17:32 ` Thomas Gleixner
@ 2017-12-12 17:32   ` Thomas Gleixner
  -1 siblings, 0 replies; 134+ messages in thread
From: Thomas Gleixner @ 2017-12-12 17:32 UTC (permalink / raw)
  To: LKML
  Cc: x86, Linus Torvalds, Andy Lutomirsky, Peter Zijlstra,
	Dave Hansen, Borislav Petkov, Greg KH, keescook, hughd,
	Brian Gerst, Josh Poimboeuf, Denys Vlasenko, Boris Ostrovsky,
	Juergen Gross, David Laight, Eduardo Valentin, aliguori,
	Will Deacon, linux-mm

[-- Attachment #1: x86-ldt--Reshuffle-code.patch --]
[-- Type: text/plain, Size: 5701 bytes --]

From: Thomas Gleixner <tglx@linutronix.de>

Restructure the code, so the following VMA changes do not create an
unreadable mess. No functional change.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
---
 arch/x86/include/asm/mmu_context.h |    4 +
 arch/x86/kernel/ldt.c              |  118 +++++++++++++++++--------------------
 2 files changed, 59 insertions(+), 63 deletions(-)

--- a/arch/x86/include/asm/mmu_context.h
+++ b/arch/x86/include/asm/mmu_context.h
@@ -39,6 +39,10 @@ static inline void load_mm_cr4(struct mm
 #endif
 
 #ifdef CONFIG_MODIFY_LDT_SYSCALL
+#include <asm/ldt.h>
+
+#define LDT_ENTRIES_MAP_SIZE	(LDT_ENTRIES * LDT_ENTRY_SIZE)
+
 /*
  * ldt_structs can be allocated, used, and freed, but they are never
  * modified while live.
--- a/arch/x86/kernel/ldt.c
+++ b/arch/x86/kernel/ldt.c
@@ -28,6 +28,12 @@
 #include <asm/mmu_context.h>
 #include <asm/syscalls.h>
 
+/* After calling this, the LDT is immutable. */
+static void finalize_ldt_struct(struct ldt_struct *ldt)
+{
+	paravirt_alloc_ldt(ldt->entries, ldt->nr_entries);
+}
+
 static void refresh_ldt_segments(void)
 {
 #ifdef CONFIG_X86_64
@@ -48,18 +54,32 @@ static void refresh_ldt_segments(void)
 }
 
 /* context.lock is held by the task which issued the smp function call */
-static void flush_ldt(void *__mm)
+static void __ldt_install(void *__mm)
 {
 	struct mm_struct *mm = __mm;
-	mm_context_t *pc;
+	struct ldt_struct *ldt = mm->context.ldt;
 
-	if (this_cpu_read(cpu_tlbstate.loaded_mm) != mm)
-		return;
+	if (this_cpu_read(cpu_tlbstate.loaded_mm) == mm &&
+	    !(current->flags & PF_KTHREAD)) {
+		unsigned int nentries = ldt ? ldt->nr_entries : 0;
+
+		set_ldt(ldt->entries, nentries);
+		refresh_ldt_segments();
+		set_tsk_thread_flag(current, TIF_LDT);
+	}
+}
 
-	pc = &mm->context;
-	set_ldt(pc->ldt->entries, pc->ldt->nr_entries);
+static void ldt_install_mm(struct mm_struct *mm, struct ldt_struct *ldt)
+{
+	mutex_lock(&mm->context.lock);
 
-	refresh_ldt_segments();
+	/* Synchronizes with READ_ONCE in load_mm_ldt. */
+	smp_store_release(&mm->context.ldt, ldt);
+
+	/* Activate the LDT for all CPUs using currents mm. */
+	on_each_cpu_mask(mm_cpumask(mm), __ldt_install, mm, true);
+
+	mutex_unlock(&mm->context.lock);
 }
 
 /* The caller must call finalize_ldt_struct on the result. LDT starts zeroed. */
@@ -98,25 +118,6 @@ static struct ldt_struct *alloc_ldt_stru
 	return new_ldt;
 }
 
-/* After calling this, the LDT is immutable. */
-static void finalize_ldt_struct(struct ldt_struct *ldt)
-{
-	paravirt_alloc_ldt(ldt->entries, ldt->nr_entries);
-}
-
-static void install_ldt(struct mm_struct *mm, struct ldt_struct *ldt)
-{
-	mutex_lock(&mm->context.lock);
-
-	/* Synchronizes with READ_ONCE in load_mm_ldt. */
-	smp_store_release(&mm->context.ldt, ldt);
-
-	/* Activate the LDT for all CPUs using currents mm. */
-	on_each_cpu_mask(mm_cpumask(mm), flush_ldt, mm, true);
-
-	mutex_unlock(&mm->context.lock);
-}
-
 static void free_ldt_struct(struct ldt_struct *ldt)
 {
 	if (likely(!ldt))
@@ -164,6 +165,18 @@ int ldt_dup_context(struct mm_struct *ol
 }
 
 /*
+ * This can run unlocked because the mm is no longer in use. No need to
+ * clear LDT on the CPU either because that's called from __mm_drop() and
+ * the task which owned the mm is already dead. The context switch code has
+ * either cleared LDT or installed a new one.
+ */
+void destroy_context_ldt(struct mm_struct *mm)
+{
+	free_ldt_struct(mm->context.ldt);
+	mm->context.ldt = NULL;
+}
+
+/*
  * Touching the LDT entries with LAR makes sure that the CPU "caches" the
  * ACCESSED bit in the LDT entry which is already set when the entry is
  * stored.
@@ -193,54 +206,33 @@ void ldt_exit_user(struct pt_regs *regs)
 	ldt_touch_seg(regs->ss);
 }
 
-/*
- * No need to lock the MM as we are the last user
- *
- * 64bit: Don't touch the LDT register - we're already in the next thread.
- */
-void destroy_context_ldt(struct mm_struct *mm)
-{
-	free_ldt_struct(mm->context.ldt);
-	mm->context.ldt = NULL;
-}
-
-static int read_ldt(void __user *ptr, unsigned long bytecount)
+static int read_ldt(void __user *ptr, unsigned long nbytes)
 {
 	struct mm_struct *mm = current->mm;
-	unsigned long entries_size;
-	int retval;
+	struct ldt_struct *ldt;
+	unsigned long tocopy;
+	int ret = 0;
 
 	down_read(&mm->context.ldt_usr_sem);
 
-	if (!mm->context.ldt) {
-		retval = 0;
+	ldt = mm->context.ldt;
+	if (!ldt)
 		goto out_unlock;
-	}
 
-	if (bytecount > LDT_ENTRY_SIZE * LDT_ENTRIES)
-		bytecount = LDT_ENTRY_SIZE * LDT_ENTRIES;
+	if (nbytes > LDT_ENTRIES_MAP_SIZE)
+		nbytes = LDT_ENTRIES_MAP_SIZE;
 
-	entries_size = mm->context.ldt->nr_entries * LDT_ENTRY_SIZE;
-	if (entries_size > bytecount)
-		entries_size = bytecount;
-
-	if (copy_to_user(ptr, mm->context.ldt->entries, entries_size)) {
-		retval = -EFAULT;
+	ret = -EFAULT;
+	tocopy = min((unsigned long)ldt->nr_entries * LDT_ENTRY_SIZE, nbytes);
+	if (tocopy < nbytes && clear_user(ptr + tocopy, nbytes - tocopy))
 		goto out_unlock;
-	}
-
-	if (entries_size != bytecount) {
-		/* Zero-fill the rest and pretend we read bytecount bytes. */
-		if (clear_user(ptr + entries_size, bytecount - entries_size)) {
-			retval = -EFAULT;
-			goto out_unlock;
-		}
-	}
-	retval = bytecount;
 
+	if (copy_to_user(ptr, ldt->entries, tocopy))
+		goto out_unlock;
+	ret = nbytes;
 out_unlock:
 	up_read(&mm->context.ldt_usr_sem);
-	return retval;
+	return ret;
 }
 
 static int read_default_ldt(void __user *ptr, unsigned long bytecount)
@@ -317,7 +309,7 @@ static int write_ldt(void __user *ptr, u
 	new_ldt->entries[ldt_info.entry_number] = ldt;
 	finalize_ldt_struct(new_ldt);
 
-	install_ldt(mm, new_ldt);
+	ldt_install_mm(mm, new_ldt);
 	free_ldt_struct(old_ldt);
 	error = 0;
 

^ permalink raw reply	[flat|nested] 134+ messages in thread

* [patch 12/16] x86/ldt: Reshuffle code
@ 2017-12-12 17:32   ` Thomas Gleixner
  0 siblings, 0 replies; 134+ messages in thread
From: Thomas Gleixner @ 2017-12-12 17:32 UTC (permalink / raw)
  To: LKML
  Cc: x86, Linus Torvalds, Andy Lutomirsky, Peter Zijlstra,
	Dave Hansen, Borislav Petkov, Greg KH, keescook, hughd,
	Brian Gerst, Josh Poimboeuf, Denys Vlasenko, Boris Ostrovsky,
	Juergen Gross, David Laight, Eduardo Valentin, aliguori,
	Will Deacon, linux-mm

[-- Attachment #1: x86-ldt--Reshuffle-code.patch --]
[-- Type: text/plain, Size: 5928 bytes --]

From: Thomas Gleixner <tglx@linutronix.de>

Restructure the code, so the following VMA changes do not create an
unreadable mess. No functional change.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
---
 arch/x86/include/asm/mmu_context.h |    4 +
 arch/x86/kernel/ldt.c              |  118 +++++++++++++++++--------------------
 2 files changed, 59 insertions(+), 63 deletions(-)

--- a/arch/x86/include/asm/mmu_context.h
+++ b/arch/x86/include/asm/mmu_context.h
@@ -39,6 +39,10 @@ static inline void load_mm_cr4(struct mm
 #endif
 
 #ifdef CONFIG_MODIFY_LDT_SYSCALL
+#include <asm/ldt.h>
+
+#define LDT_ENTRIES_MAP_SIZE	(LDT_ENTRIES * LDT_ENTRY_SIZE)
+
 /*
  * ldt_structs can be allocated, used, and freed, but they are never
  * modified while live.
--- a/arch/x86/kernel/ldt.c
+++ b/arch/x86/kernel/ldt.c
@@ -28,6 +28,12 @@
 #include <asm/mmu_context.h>
 #include <asm/syscalls.h>
 
+/* After calling this, the LDT is immutable. */
+static void finalize_ldt_struct(struct ldt_struct *ldt)
+{
+	paravirt_alloc_ldt(ldt->entries, ldt->nr_entries);
+}
+
 static void refresh_ldt_segments(void)
 {
 #ifdef CONFIG_X86_64
@@ -48,18 +54,32 @@ static void refresh_ldt_segments(void)
 }
 
 /* context.lock is held by the task which issued the smp function call */
-static void flush_ldt(void *__mm)
+static void __ldt_install(void *__mm)
 {
 	struct mm_struct *mm = __mm;
-	mm_context_t *pc;
+	struct ldt_struct *ldt = mm->context.ldt;
 
-	if (this_cpu_read(cpu_tlbstate.loaded_mm) != mm)
-		return;
+	if (this_cpu_read(cpu_tlbstate.loaded_mm) == mm &&
+	    !(current->flags & PF_KTHREAD)) {
+		unsigned int nentries = ldt ? ldt->nr_entries : 0;
+
+		set_ldt(ldt->entries, nentries);
+		refresh_ldt_segments();
+		set_tsk_thread_flag(current, TIF_LDT);
+	}
+}
 
-	pc = &mm->context;
-	set_ldt(pc->ldt->entries, pc->ldt->nr_entries);
+static void ldt_install_mm(struct mm_struct *mm, struct ldt_struct *ldt)
+{
+	mutex_lock(&mm->context.lock);
 
-	refresh_ldt_segments();
+	/* Synchronizes with READ_ONCE in load_mm_ldt. */
+	smp_store_release(&mm->context.ldt, ldt);
+
+	/* Activate the LDT for all CPUs using currents mm. */
+	on_each_cpu_mask(mm_cpumask(mm), __ldt_install, mm, true);
+
+	mutex_unlock(&mm->context.lock);
 }
 
 /* The caller must call finalize_ldt_struct on the result. LDT starts zeroed. */
@@ -98,25 +118,6 @@ static struct ldt_struct *alloc_ldt_stru
 	return new_ldt;
 }
 
-/* After calling this, the LDT is immutable. */
-static void finalize_ldt_struct(struct ldt_struct *ldt)
-{
-	paravirt_alloc_ldt(ldt->entries, ldt->nr_entries);
-}
-
-static void install_ldt(struct mm_struct *mm, struct ldt_struct *ldt)
-{
-	mutex_lock(&mm->context.lock);
-
-	/* Synchronizes with READ_ONCE in load_mm_ldt. */
-	smp_store_release(&mm->context.ldt, ldt);
-
-	/* Activate the LDT for all CPUs using currents mm. */
-	on_each_cpu_mask(mm_cpumask(mm), flush_ldt, mm, true);
-
-	mutex_unlock(&mm->context.lock);
-}
-
 static void free_ldt_struct(struct ldt_struct *ldt)
 {
 	if (likely(!ldt))
@@ -164,6 +165,18 @@ int ldt_dup_context(struct mm_struct *ol
 }
 
 /*
+ * This can run unlocked because the mm is no longer in use. No need to
+ * clear LDT on the CPU either because that's called from __mm_drop() and
+ * the task which owned the mm is already dead. The context switch code has
+ * either cleared LDT or installed a new one.
+ */
+void destroy_context_ldt(struct mm_struct *mm)
+{
+	free_ldt_struct(mm->context.ldt);
+	mm->context.ldt = NULL;
+}
+
+/*
  * Touching the LDT entries with LAR makes sure that the CPU "caches" the
  * ACCESSED bit in the LDT entry which is already set when the entry is
  * stored.
@@ -193,54 +206,33 @@ void ldt_exit_user(struct pt_regs *regs)
 	ldt_touch_seg(regs->ss);
 }
 
-/*
- * No need to lock the MM as we are the last user
- *
- * 64bit: Don't touch the LDT register - we're already in the next thread.
- */
-void destroy_context_ldt(struct mm_struct *mm)
-{
-	free_ldt_struct(mm->context.ldt);
-	mm->context.ldt = NULL;
-}
-
-static int read_ldt(void __user *ptr, unsigned long bytecount)
+static int read_ldt(void __user *ptr, unsigned long nbytes)
 {
 	struct mm_struct *mm = current->mm;
-	unsigned long entries_size;
-	int retval;
+	struct ldt_struct *ldt;
+	unsigned long tocopy;
+	int ret = 0;
 
 	down_read(&mm->context.ldt_usr_sem);
 
-	if (!mm->context.ldt) {
-		retval = 0;
+	ldt = mm->context.ldt;
+	if (!ldt)
 		goto out_unlock;
-	}
 
-	if (bytecount > LDT_ENTRY_SIZE * LDT_ENTRIES)
-		bytecount = LDT_ENTRY_SIZE * LDT_ENTRIES;
+	if (nbytes > LDT_ENTRIES_MAP_SIZE)
+		nbytes = LDT_ENTRIES_MAP_SIZE;
 
-	entries_size = mm->context.ldt->nr_entries * LDT_ENTRY_SIZE;
-	if (entries_size > bytecount)
-		entries_size = bytecount;
-
-	if (copy_to_user(ptr, mm->context.ldt->entries, entries_size)) {
-		retval = -EFAULT;
+	ret = -EFAULT;
+	tocopy = min((unsigned long)ldt->nr_entries * LDT_ENTRY_SIZE, nbytes);
+	if (tocopy < nbytes && clear_user(ptr + tocopy, nbytes - tocopy))
 		goto out_unlock;
-	}
-
-	if (entries_size != bytecount) {
-		/* Zero-fill the rest and pretend we read bytecount bytes. */
-		if (clear_user(ptr + entries_size, bytecount - entries_size)) {
-			retval = -EFAULT;
-			goto out_unlock;
-		}
-	}
-	retval = bytecount;
 
+	if (copy_to_user(ptr, ldt->entries, tocopy))
+		goto out_unlock;
+	ret = nbytes;
 out_unlock:
 	up_read(&mm->context.ldt_usr_sem);
-	return retval;
+	return ret;
 }
 
 static int read_default_ldt(void __user *ptr, unsigned long bytecount)
@@ -317,7 +309,7 @@ static int write_ldt(void __user *ptr, u
 	new_ldt->entries[ldt_info.entry_number] = ldt;
 	finalize_ldt_struct(new_ldt);
 
-	install_ldt(mm, new_ldt);
+	ldt_install_mm(mm, new_ldt);
 	free_ldt_struct(old_ldt);
 	error = 0;
 


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 134+ messages in thread

* [patch 13/16] x86/ldt: Introduce LDT write fault handler
  2017-12-12 17:32 ` Thomas Gleixner
@ 2017-12-12 17:32   ` Thomas Gleixner
  -1 siblings, 0 replies; 134+ messages in thread
From: Thomas Gleixner @ 2017-12-12 17:32 UTC (permalink / raw)
  To: LKML
  Cc: x86, Linus Torvalds, Andy Lutomirsky, Peter Zijlstra,
	Dave Hansen, Borislav Petkov, Greg KH, keescook, hughd,
	Brian Gerst, Josh Poimboeuf, Denys Vlasenko, Boris Ostrovsky,
	Juergen Gross, David Laight, Eduardo Valentin, aliguori,
	Will Deacon, linux-mm

[-- Attachment #1: x86-ldt--Introduce-LDT-fault-handler.patch --]
[-- Type: text/plain, Size: 3970 bytes --]

From: Thomas Gleixner <tglx@linutronix.de>

When the LDT is mapped RO, the CPU will write fault the first time it uses
a segment descriptor in order to set the ACCESS bit (for some reason it
doesn't always observe that it already preset). Catch the fault and set the
ACCESS bit in the handler.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
---
 arch/x86/include/asm/mmu_context.h |    7 +++++++
 arch/x86/kernel/ldt.c              |   30 ++++++++++++++++++++++++++++++
 arch/x86/mm/fault.c                |   19 +++++++++++++++++++
 3 files changed, 56 insertions(+)

--- a/arch/x86/include/asm/mmu_context.h
+++ b/arch/x86/include/asm/mmu_context.h
@@ -76,6 +76,11 @@ static inline void init_new_context_ldt(
 int ldt_dup_context(struct mm_struct *oldmm, struct mm_struct *mm);
 void ldt_exit_user(struct pt_regs *regs);
 void destroy_context_ldt(struct mm_struct *mm);
+bool __ldt_write_fault(unsigned long address);
+static inline bool ldt_is_active(struct mm_struct *mm)
+{
+	return mm && mm->context.ldt != NULL;
+}
 #else	/* CONFIG_MODIFY_LDT_SYSCALL */
 static inline void init_new_context_ldt(struct task_struct *task,
 					struct mm_struct *mm) { }
@@ -86,6 +91,8 @@ static inline int ldt_dup_context(struct
 }
 static inline void ldt_exit_user(struct pt_regs *regs) { }
 static inline void destroy_context_ldt(struct mm_struct *mm) { }
+static inline bool __ldt_write_fault(unsigned long address) { return false; }
+static inline bool ldt_is_active(struct mm_struct *mm)  { return false; }
 #endif
 
 static inline void load_mm_ldt(struct mm_struct *mm, struct task_struct *tsk)
--- a/arch/x86/kernel/ldt.c
+++ b/arch/x86/kernel/ldt.c
@@ -82,6 +82,36 @@ static void ldt_install_mm(struct mm_str
 	mutex_unlock(&mm->context.lock);
 }
 
+/*
+ * ldt_write_fault() already checked whether there is an ldt installed in
+ * __do_page_fault(), so it's safe to access it here because interrupts are
+ * disabled and any ipi which would change it is blocked until this
+ * returns.  The underlying page mapping cannot change as long as the ldt
+ * is the active one in the context.
+ *
+ * The fault error code is X86_PF_WRITE | X86_PF_PROT and checked in
+ * __do_page_fault() already. This happens when a segment is selected and
+ * the CPU tries to set the accessed bit in desc_struct.type because the
+ * LDT entries are mapped RO. Set it manually.
+ */
+bool __ldt_write_fault(unsigned long address)
+{
+	struct ldt_struct *ldt = current->mm->context.ldt;
+	unsigned long start, end, entry;
+	struct desc_struct *desc;
+
+	start = (unsigned long) ldt->entries;
+	end = start + ldt->nr_entries * LDT_ENTRY_SIZE;
+
+	if (address < start || address >= end)
+		return false;
+
+	desc = (struct desc_struct *) ldt->entries;
+	entry = (address - start) / LDT_ENTRY_SIZE;
+	desc[entry].type |= 0x01;
+	return true;
+}
+
 /* The caller must call finalize_ldt_struct on the result. LDT starts zeroed. */
 static struct ldt_struct *alloc_ldt_struct(unsigned int num_entries)
 {
--- a/arch/x86/mm/fault.c
+++ b/arch/x86/mm/fault.c
@@ -1234,6 +1234,22 @@ static inline bool smap_violation(int er
 }
 
 /*
+ * Handles the case where the CPU fails to set the accessed bit in a LDT
+ * entry because the entries are mapped RO.
+ */
+static inline bool ldt_write_fault(unsigned long ecode, unsigned long address,
+				   struct pt_regs *regs)
+{
+	if (!IS_ENABLED(CONFIG_MODIFY_LDT_SYSCALL))
+		return false;
+	if (!ldt_is_active(current->mm))
+		return false;
+	if (ecode != (X86_PF_WRITE | X86_PF_PROT))
+		return false;
+	return __ldt_write_fault(address);
+}
+
+/*
  * This routine handles page faults.  It determines the address,
  * and the problem, and then passes it off to one of the appropriate
  * routines.
@@ -1305,6 +1321,9 @@ static noinline void
 	if (unlikely(kprobes_fault(regs)))
 		return;
 
+	if (unlikely(ldt_write_fault(error_code, address, regs)))
+		return;
+
 	if (unlikely(error_code & X86_PF_RSVD))
 		pgtable_bad(regs, error_code, address);
 

^ permalink raw reply	[flat|nested] 134+ messages in thread

* [patch 13/16] x86/ldt: Introduce LDT write fault handler
@ 2017-12-12 17:32   ` Thomas Gleixner
  0 siblings, 0 replies; 134+ messages in thread
From: Thomas Gleixner @ 2017-12-12 17:32 UTC (permalink / raw)
  To: LKML
  Cc: x86, Linus Torvalds, Andy Lutomirsky, Peter Zijlstra,
	Dave Hansen, Borislav Petkov, Greg KH, keescook, hughd,
	Brian Gerst, Josh Poimboeuf, Denys Vlasenko, Boris Ostrovsky,
	Juergen Gross, David Laight, Eduardo Valentin, aliguori,
	Will Deacon, linux-mm

[-- Attachment #1: x86-ldt--Introduce-LDT-fault-handler.patch --]
[-- Type: text/plain, Size: 4197 bytes --]

From: Thomas Gleixner <tglx@linutronix.de>

When the LDT is mapped RO, the CPU will write fault the first time it uses
a segment descriptor in order to set the ACCESS bit (for some reason it
doesn't always observe that it already preset). Catch the fault and set the
ACCESS bit in the handler.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
---
 arch/x86/include/asm/mmu_context.h |    7 +++++++
 arch/x86/kernel/ldt.c              |   30 ++++++++++++++++++++++++++++++
 arch/x86/mm/fault.c                |   19 +++++++++++++++++++
 3 files changed, 56 insertions(+)

--- a/arch/x86/include/asm/mmu_context.h
+++ b/arch/x86/include/asm/mmu_context.h
@@ -76,6 +76,11 @@ static inline void init_new_context_ldt(
 int ldt_dup_context(struct mm_struct *oldmm, struct mm_struct *mm);
 void ldt_exit_user(struct pt_regs *regs);
 void destroy_context_ldt(struct mm_struct *mm);
+bool __ldt_write_fault(unsigned long address);
+static inline bool ldt_is_active(struct mm_struct *mm)
+{
+	return mm && mm->context.ldt != NULL;
+}
 #else	/* CONFIG_MODIFY_LDT_SYSCALL */
 static inline void init_new_context_ldt(struct task_struct *task,
 					struct mm_struct *mm) { }
@@ -86,6 +91,8 @@ static inline int ldt_dup_context(struct
 }
 static inline void ldt_exit_user(struct pt_regs *regs) { }
 static inline void destroy_context_ldt(struct mm_struct *mm) { }
+static inline bool __ldt_write_fault(unsigned long address) { return false; }
+static inline bool ldt_is_active(struct mm_struct *mm)  { return false; }
 #endif
 
 static inline void load_mm_ldt(struct mm_struct *mm, struct task_struct *tsk)
--- a/arch/x86/kernel/ldt.c
+++ b/arch/x86/kernel/ldt.c
@@ -82,6 +82,36 @@ static void ldt_install_mm(struct mm_str
 	mutex_unlock(&mm->context.lock);
 }
 
+/*
+ * ldt_write_fault() already checked whether there is an ldt installed in
+ * __do_page_fault(), so it's safe to access it here because interrupts are
+ * disabled and any ipi which would change it is blocked until this
+ * returns.  The underlying page mapping cannot change as long as the ldt
+ * is the active one in the context.
+ *
+ * The fault error code is X86_PF_WRITE | X86_PF_PROT and checked in
+ * __do_page_fault() already. This happens when a segment is selected and
+ * the CPU tries to set the accessed bit in desc_struct.type because the
+ * LDT entries are mapped RO. Set it manually.
+ */
+bool __ldt_write_fault(unsigned long address)
+{
+	struct ldt_struct *ldt = current->mm->context.ldt;
+	unsigned long start, end, entry;
+	struct desc_struct *desc;
+
+	start = (unsigned long) ldt->entries;
+	end = start + ldt->nr_entries * LDT_ENTRY_SIZE;
+
+	if (address < start || address >= end)
+		return false;
+
+	desc = (struct desc_struct *) ldt->entries;
+	entry = (address - start) / LDT_ENTRY_SIZE;
+	desc[entry].type |= 0x01;
+	return true;
+}
+
 /* The caller must call finalize_ldt_struct on the result. LDT starts zeroed. */
 static struct ldt_struct *alloc_ldt_struct(unsigned int num_entries)
 {
--- a/arch/x86/mm/fault.c
+++ b/arch/x86/mm/fault.c
@@ -1234,6 +1234,22 @@ static inline bool smap_violation(int er
 }
 
 /*
+ * Handles the case where the CPU fails to set the accessed bit in a LDT
+ * entry because the entries are mapped RO.
+ */
+static inline bool ldt_write_fault(unsigned long ecode, unsigned long address,
+				   struct pt_regs *regs)
+{
+	if (!IS_ENABLED(CONFIG_MODIFY_LDT_SYSCALL))
+		return false;
+	if (!ldt_is_active(current->mm))
+		return false;
+	if (ecode != (X86_PF_WRITE | X86_PF_PROT))
+		return false;
+	return __ldt_write_fault(address);
+}
+
+/*
  * This routine handles page faults.  It determines the address,
  * and the problem, and then passes it off to one of the appropriate
  * routines.
@@ -1305,6 +1321,9 @@ static noinline void
 	if (unlikely(kprobes_fault(regs)))
 		return;
 
+	if (unlikely(ldt_write_fault(error_code, address, regs)))
+		return;
+
 	if (unlikely(error_code & X86_PF_RSVD))
 		pgtable_bad(regs, error_code, address);
 


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 134+ messages in thread

* [patch 14/16] x86/ldt: Prepare for VMA mapping
  2017-12-12 17:32 ` Thomas Gleixner
@ 2017-12-12 17:32   ` Thomas Gleixner
  -1 siblings, 0 replies; 134+ messages in thread
From: Thomas Gleixner @ 2017-12-12 17:32 UTC (permalink / raw)
  To: LKML
  Cc: x86, Linus Torvalds, Andy Lutomirsky, Peter Zijlstra,
	Dave Hansen, Borislav Petkov, Greg KH, keescook, hughd,
	Brian Gerst, Josh Poimboeuf, Denys Vlasenko, Boris Ostrovsky,
	Juergen Gross, David Laight, Eduardo Valentin, aliguori,
	Will Deacon, linux-mm

[-- Attachment #1: x86-ldt--Prepare-for-VMA-mapping.patch --]
[-- Type: text/plain, Size: 4992 bytes --]

From: Thomas Gleixner <tglx@linutronix.de>

Implement that infrastructure to manage LDT information with backing
pages. Preparatory patch for VMA based LDT mapping. Split out for ease of
review.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
---
 arch/x86/include/asm/mmu.h         |    3 +
 arch/x86/include/asm/mmu_context.h |    9 ++-
 arch/x86/kernel/ldt.c              |  107 ++++++++++++++++++++++++++++++++++++-
 3 files changed, 116 insertions(+), 3 deletions(-)

--- a/arch/x86/include/asm/mmu.h
+++ b/arch/x86/include/asm/mmu.h
@@ -7,6 +7,8 @@
 #include <linux/mutex.h>
 #include <linux/atomic.h>
 
+struct ldt_mapping;
+
 /*
  * x86 has arch-specific MMU state beyond what lives in mm_struct.
  */
@@ -29,6 +31,7 @@ typedef struct {
 
 #ifdef CONFIG_MODIFY_LDT_SYSCALL
 	struct rw_semaphore	ldt_usr_sem;
+	struct ldt_mapping	*ldt_mapping;
 	struct ldt_struct	*ldt;
 #endif
 
--- a/arch/x86/include/asm/mmu_context.h
+++ b/arch/x86/include/asm/mmu_context.h
@@ -42,6 +42,8 @@ static inline void load_mm_cr4(struct mm
 #include <asm/ldt.h>
 
 #define LDT_ENTRIES_MAP_SIZE	(LDT_ENTRIES * LDT_ENTRY_SIZE)
+#define LDT_ENTRIES_PAGES	(LDT_ENTRIES_MAP_SIZE / PAGE_SIZE)
+#define LDT_ENTRIES_PER_PAGE	(PAGE_SIZE / LDT_ENTRY_SIZE)
 
 /*
  * ldt_structs can be allocated, used, and freed, but they are never
@@ -54,8 +56,10 @@ struct ldt_struct {
 	 * call gates.  On native, we could merge the ldt_struct and LDT
 	 * allocations, but it's not worth trying to optimize.
 	 */
-	struct desc_struct *entries;
-	unsigned int nr_entries;
+	struct desc_struct	*entries;
+	struct page		*pages[LDT_ENTRIES_PAGES];
+	unsigned int		nr_entries;
+	unsigned int		pages_allocated;
 };
 
 /*
@@ -65,6 +69,7 @@ static inline void init_new_context_ldt(
 					struct mm_struct *mm)
 {
 	mm->context.ldt = NULL;
+	mm->context.ldt_mapping = NULL;
 	init_rwsem(&mm->context.ldt_usr_sem);
 	/*
 	 * Set the TIF flag unconditonally as in ldt_dup_context() the new
--- a/arch/x86/kernel/ldt.c
+++ b/arch/x86/kernel/ldt.c
@@ -28,6 +28,11 @@
 #include <asm/mmu_context.h>
 #include <asm/syscalls.h>
 
+struct ldt_mapping {
+	struct ldt_struct		ldts[2];
+	unsigned int			ldt_index;
+};
+
 /* After calling this, the LDT is immutable. */
 static void finalize_ldt_struct(struct ldt_struct *ldt)
 {
@@ -82,6 +87,97 @@ static void install_ldt(struct mm_struct
 	mutex_unlock(&mm->context.lock);
 }
 
+static void ldt_free_pages(struct ldt_struct *ldt)
+{
+	int i;
+
+	for (i = 0; i < ldt->pages_allocated; i++)
+		__free_page(ldt->pages[i]);
+}
+
+static void ldt_free_lmap(struct ldt_mapping *lmap)
+{
+	if (!lmap)
+		return;
+	ldt_free_pages(&lmap->ldts[0]);
+	ldt_free_pages(&lmap->ldts[1]);
+	kfree(lmap);
+}
+
+static int ldt_alloc_pages(struct ldt_struct *ldt, unsigned int nentries)
+{
+	unsigned int npages, idx;
+
+	npages = DIV_ROUND_UP(nentries * LDT_ENTRY_SIZE, PAGE_SIZE);
+
+	for (idx = ldt->pages_allocated; idx < npages; idx++) {
+		if (WARN_ON_ONCE(ldt->pages[idx]))
+			continue;
+
+		ldt->pages[idx] = alloc_page(GFP_KERNEL | __GFP_ZERO);
+		if (!ldt->pages[idx])
+			return -ENOMEM;
+
+		ldt->pages_allocated++;
+	}
+	return 0;
+}
+
+static struct ldt_mapping *ldt_alloc_lmap(struct mm_struct *mm,
+					  unsigned int nentries)
+{
+	struct ldt_mapping *lmap = kzalloc(sizeof(*lmap), GFP_KERNEL);
+
+	if (!lmap)
+		return ERR_PTR(-ENOMEM);
+
+	if (ldt_alloc_pages(&lmap->ldts[0], nentries)) {
+		ldt_free_lmap(lmap);
+		return ERR_PTR(-ENOMEM);
+	}
+	return lmap;
+}
+
+static void ldt_set_entry(struct ldt_struct *ldt, struct desc_struct *ldtdesc,
+			  unsigned int offs)
+{
+	unsigned int dstidx;
+
+	offs *= LDT_ENTRY_SIZE;
+	dstidx = offs / PAGE_SIZE;
+	offs %= PAGE_SIZE;
+	memcpy(page_address(ldt->pages[dstidx]) + offs, ldtdesc,
+	       sizeof(*ldtdesc));
+}
+
+static void ldt_clone_entries(struct ldt_struct *dst, struct ldt_struct *src,
+			      unsigned int nent)
+{
+	unsigned long tocopy;
+	unsigned int i;
+
+	for (i = 0, tocopy = nent * LDT_ENTRY_SIZE; tocopy; i++) {
+		unsigned long n = min(PAGE_SIZE, tocopy);
+
+		memcpy(page_address(dst->pages[i]),
+		       page_address(src->pages[i]), n);
+		tocopy -= n;
+	}
+}
+
+static void cleanup_ldt_struct(struct ldt_struct *ldt)
+{
+	static struct desc_struct zero_desc;
+	unsigned int i;
+
+	if (!ldt)
+		return;
+	paravirt_free_ldt(ldt->entries, ldt->nr_entries);
+	for (i = 0; i < ldt->nr_entries; i++)
+		ldt_set_entry(ldt, &zero_desc, i);
+	ldt->nr_entries = 0;
+}
+
 /*
  * ldt_write_fault() already checked whether there is an ldt installed in
  * __do_page_fault(), so it's safe to access it here because interrupts are
@@ -202,8 +298,17 @@ int ldt_dup_context(struct mm_struct *ol
  */
 void destroy_context_ldt(struct mm_struct *mm)
 {
-	free_ldt_struct(mm->context.ldt);
+	struct ldt_mapping *lmap = mm->context.ldt_mapping;
+	struct ldt_struct *ldt = mm->context.ldt;
+
+	free_ldt_struct(ldt);
 	mm->context.ldt = NULL;
+
+	if (!lmap)
+		return;
+
+	mm->context.ldt_mapping = NULL;
+	ldt_free_lmap(lmap);
 }
 
 /*

^ permalink raw reply	[flat|nested] 134+ messages in thread

* [patch 14/16] x86/ldt: Prepare for VMA mapping
@ 2017-12-12 17:32   ` Thomas Gleixner
  0 siblings, 0 replies; 134+ messages in thread
From: Thomas Gleixner @ 2017-12-12 17:32 UTC (permalink / raw)
  To: LKML
  Cc: x86, Linus Torvalds, Andy Lutomirsky, Peter Zijlstra,
	Dave Hansen, Borislav Petkov, Greg KH, keescook, hughd,
	Brian Gerst, Josh Poimboeuf, Denys Vlasenko, Boris Ostrovsky,
	Juergen Gross, David Laight, Eduardo Valentin, aliguori,
	Will Deacon, linux-mm

[-- Attachment #1: x86-ldt--Prepare-for-VMA-mapping.patch --]
[-- Type: text/plain, Size: 5219 bytes --]

From: Thomas Gleixner <tglx@linutronix.de>

Implement that infrastructure to manage LDT information with backing
pages. Preparatory patch for VMA based LDT mapping. Split out for ease of
review.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
---
 arch/x86/include/asm/mmu.h         |    3 +
 arch/x86/include/asm/mmu_context.h |    9 ++-
 arch/x86/kernel/ldt.c              |  107 ++++++++++++++++++++++++++++++++++++-
 3 files changed, 116 insertions(+), 3 deletions(-)

--- a/arch/x86/include/asm/mmu.h
+++ b/arch/x86/include/asm/mmu.h
@@ -7,6 +7,8 @@
 #include <linux/mutex.h>
 #include <linux/atomic.h>
 
+struct ldt_mapping;
+
 /*
  * x86 has arch-specific MMU state beyond what lives in mm_struct.
  */
@@ -29,6 +31,7 @@ typedef struct {
 
 #ifdef CONFIG_MODIFY_LDT_SYSCALL
 	struct rw_semaphore	ldt_usr_sem;
+	struct ldt_mapping	*ldt_mapping;
 	struct ldt_struct	*ldt;
 #endif
 
--- a/arch/x86/include/asm/mmu_context.h
+++ b/arch/x86/include/asm/mmu_context.h
@@ -42,6 +42,8 @@ static inline void load_mm_cr4(struct mm
 #include <asm/ldt.h>
 
 #define LDT_ENTRIES_MAP_SIZE	(LDT_ENTRIES * LDT_ENTRY_SIZE)
+#define LDT_ENTRIES_PAGES	(LDT_ENTRIES_MAP_SIZE / PAGE_SIZE)
+#define LDT_ENTRIES_PER_PAGE	(PAGE_SIZE / LDT_ENTRY_SIZE)
 
 /*
  * ldt_structs can be allocated, used, and freed, but they are never
@@ -54,8 +56,10 @@ struct ldt_struct {
 	 * call gates.  On native, we could merge the ldt_struct and LDT
 	 * allocations, but it's not worth trying to optimize.
 	 */
-	struct desc_struct *entries;
-	unsigned int nr_entries;
+	struct desc_struct	*entries;
+	struct page		*pages[LDT_ENTRIES_PAGES];
+	unsigned int		nr_entries;
+	unsigned int		pages_allocated;
 };
 
 /*
@@ -65,6 +69,7 @@ static inline void init_new_context_ldt(
 					struct mm_struct *mm)
 {
 	mm->context.ldt = NULL;
+	mm->context.ldt_mapping = NULL;
 	init_rwsem(&mm->context.ldt_usr_sem);
 	/*
 	 * Set the TIF flag unconditonally as in ldt_dup_context() the new
--- a/arch/x86/kernel/ldt.c
+++ b/arch/x86/kernel/ldt.c
@@ -28,6 +28,11 @@
 #include <asm/mmu_context.h>
 #include <asm/syscalls.h>
 
+struct ldt_mapping {
+	struct ldt_struct		ldts[2];
+	unsigned int			ldt_index;
+};
+
 /* After calling this, the LDT is immutable. */
 static void finalize_ldt_struct(struct ldt_struct *ldt)
 {
@@ -82,6 +87,97 @@ static void install_ldt(struct mm_struct
 	mutex_unlock(&mm->context.lock);
 }
 
+static void ldt_free_pages(struct ldt_struct *ldt)
+{
+	int i;
+
+	for (i = 0; i < ldt->pages_allocated; i++)
+		__free_page(ldt->pages[i]);
+}
+
+static void ldt_free_lmap(struct ldt_mapping *lmap)
+{
+	if (!lmap)
+		return;
+	ldt_free_pages(&lmap->ldts[0]);
+	ldt_free_pages(&lmap->ldts[1]);
+	kfree(lmap);
+}
+
+static int ldt_alloc_pages(struct ldt_struct *ldt, unsigned int nentries)
+{
+	unsigned int npages, idx;
+
+	npages = DIV_ROUND_UP(nentries * LDT_ENTRY_SIZE, PAGE_SIZE);
+
+	for (idx = ldt->pages_allocated; idx < npages; idx++) {
+		if (WARN_ON_ONCE(ldt->pages[idx]))
+			continue;
+
+		ldt->pages[idx] = alloc_page(GFP_KERNEL | __GFP_ZERO);
+		if (!ldt->pages[idx])
+			return -ENOMEM;
+
+		ldt->pages_allocated++;
+	}
+	return 0;
+}
+
+static struct ldt_mapping *ldt_alloc_lmap(struct mm_struct *mm,
+					  unsigned int nentries)
+{
+	struct ldt_mapping *lmap = kzalloc(sizeof(*lmap), GFP_KERNEL);
+
+	if (!lmap)
+		return ERR_PTR(-ENOMEM);
+
+	if (ldt_alloc_pages(&lmap->ldts[0], nentries)) {
+		ldt_free_lmap(lmap);
+		return ERR_PTR(-ENOMEM);
+	}
+	return lmap;
+}
+
+static void ldt_set_entry(struct ldt_struct *ldt, struct desc_struct *ldtdesc,
+			  unsigned int offs)
+{
+	unsigned int dstidx;
+
+	offs *= LDT_ENTRY_SIZE;
+	dstidx = offs / PAGE_SIZE;
+	offs %= PAGE_SIZE;
+	memcpy(page_address(ldt->pages[dstidx]) + offs, ldtdesc,
+	       sizeof(*ldtdesc));
+}
+
+static void ldt_clone_entries(struct ldt_struct *dst, struct ldt_struct *src,
+			      unsigned int nent)
+{
+	unsigned long tocopy;
+	unsigned int i;
+
+	for (i = 0, tocopy = nent * LDT_ENTRY_SIZE; tocopy; i++) {
+		unsigned long n = min(PAGE_SIZE, tocopy);
+
+		memcpy(page_address(dst->pages[i]),
+		       page_address(src->pages[i]), n);
+		tocopy -= n;
+	}
+}
+
+static void cleanup_ldt_struct(struct ldt_struct *ldt)
+{
+	static struct desc_struct zero_desc;
+	unsigned int i;
+
+	if (!ldt)
+		return;
+	paravirt_free_ldt(ldt->entries, ldt->nr_entries);
+	for (i = 0; i < ldt->nr_entries; i++)
+		ldt_set_entry(ldt, &zero_desc, i);
+	ldt->nr_entries = 0;
+}
+
 /*
  * ldt_write_fault() already checked whether there is an ldt installed in
  * __do_page_fault(), so it's safe to access it here because interrupts are
@@ -202,8 +298,17 @@ int ldt_dup_context(struct mm_struct *ol
  */
 void destroy_context_ldt(struct mm_struct *mm)
 {
-	free_ldt_struct(mm->context.ldt);
+	struct ldt_mapping *lmap = mm->context.ldt_mapping;
+	struct ldt_struct *ldt = mm->context.ldt;
+
+	free_ldt_struct(ldt);
 	mm->context.ldt = NULL;
+
+	if (!lmap)
+		return;
+
+	mm->context.ldt_mapping = NULL;
+	ldt_free_lmap(lmap);
 }
 
 /*


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 134+ messages in thread

* [patch 15/16] x86/ldt: Add VMA management code
  2017-12-12 17:32 ` Thomas Gleixner
@ 2017-12-12 17:32   ` Thomas Gleixner
  -1 siblings, 0 replies; 134+ messages in thread
From: Thomas Gleixner @ 2017-12-12 17:32 UTC (permalink / raw)
  To: LKML
  Cc: x86, Linus Torvalds, Andy Lutomirsky, Peter Zijlstra,
	Dave Hansen, Borislav Petkov, Greg KH, keescook, hughd,
	Brian Gerst, Josh Poimboeuf, Denys Vlasenko, Boris Ostrovsky,
	Juergen Gross, David Laight, Eduardo Valentin, aliguori,
	Will Deacon, linux-mm

[-- Attachment #1: x86-ldt--Add-VMA-management-code.patch --]
[-- Type: text/plain, Size: 3645 bytes --]

From: Thomas Gleixner <tglx@linutronix.de>

Add the VMA management code to LDT which allows to install the LDT as a
special mapping, like VDSO and uprobes. The mapping is in the user address
space, but without the usr bit set and read only. Split out for ease of
review.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
---
 arch/x86/kernel/ldt.c |  103 +++++++++++++++++++++++++++++++++++++++++++++++++-
 1 file changed, 102 insertions(+), 1 deletion(-)

--- a/arch/x86/kernel/ldt.c
+++ b/arch/x86/kernel/ldt.c
@@ -31,6 +31,7 @@
 struct ldt_mapping {
 	struct ldt_struct		ldts[2];
 	unsigned int			ldt_index;
+	unsigned int			ldt_mapped;
 };
 
 /* After calling this, the LDT is immutable. */
@@ -208,6 +209,105 @@ bool __ldt_write_fault(unsigned long add
 	return true;
 }
 
+static int ldt_fault(const struct vm_special_mapping *sm,
+		     struct vm_area_struct *vma, struct vm_fault *vmf)
+{
+	struct ldt_mapping *lmap = vma->vm_mm->context.ldt_mapping;
+	struct ldt_struct *ldt = lmap->ldts;
+	pgoff_t pgo = vmf->pgoff;
+	struct page *page;
+
+	if (pgo >= LDT_ENTRIES_PAGES) {
+		pgo -= LDT_ENTRIES_PAGES;
+		ldt++;
+	}
+	if (pgo >= LDT_ENTRIES_PAGES)
+		return VM_FAULT_SIGBUS;
+
+	page = ldt->pages[pgo];
+	if (!page)
+		return VM_FAULT_SIGBUS;
+	get_page(page);
+	vmf->page = page;
+	return 0;
+}
+
+static int ldt_mremap(const struct vm_special_mapping *sm,
+		      struct vm_area_struct *new_vma)
+{
+	return -EINVAL;
+}
+
+static void ldt_close(const struct vm_special_mapping *sm,
+		      struct vm_area_struct *vma)
+{
+	struct mm_struct *mm = vma->vm_mm;
+	struct ldt_struct *ldt;
+
+	/*
+	 * Orders against ldt_install().
+	 */
+	mutex_lock(&mm->context.lock);
+	ldt = mm->context.ldt;
+	ldt_install_mm(mm, NULL);
+	cleanup_ldt_struct(ldt);
+	mm->context.ldt_mapping->ldt_mapped = 0;
+	mutex_unlock(&mm->context.lock);
+}
+
+static const struct vm_special_mapping ldt_special_mapping = {
+	.name	= "[ldt]",
+	.fault	= ldt_fault,
+	.mremap	= ldt_mremap,
+	.close	= ldt_close,
+};
+
+static struct vm_area_struct *ldt_alloc_vma(struct mm_struct *mm,
+					    struct ldt_mapping *lmap)
+{
+	unsigned long vm_flags, size;
+	struct vm_area_struct *vma;
+	unsigned long addr;
+
+	size = 2 * LDT_ENTRIES_MAP_SIZE;
+	addr = get_unmapped_area(NULL, TASK_SIZE - PAGE_SIZE, size, 0, 0);
+	if (IS_ERR_VALUE(addr))
+		return ERR_PTR(addr);
+
+	vm_flags = VM_READ | VM_LOCKED | VM_WIPEONFORK | VM_NOUSER | VM_SHARED;
+	vma = _install_special_mapping(mm, addr, size, vm_flags,
+				       &ldt_special_mapping);
+	if (IS_ERR(vma))
+		return vma;
+
+	lmap->ldts[0].entries = (struct desc_struct *) addr;
+	addr += LDT_ENTRIES_MAP_SIZE;
+	lmap->ldts[1].entries = (struct desc_struct *) addr;
+	return vma;
+}
+
+static int ldt_mmap(struct mm_struct *mm, struct ldt_mapping *lmap)
+{
+	struct vm_area_struct *vma;
+	int ret = 0;
+
+	if (down_write_killable(&mm->mmap_sem))
+		return -EINTR;
+	vma = ldt_alloc_vma(mm, lmap);
+	if (IS_ERR(vma)) {
+		ret = PTR_ERR(vma);
+	} else {
+		/*
+		 * The moment mmap_sem() is released munmap() can observe
+		 * the mapping and make it go away through ldt_close(). But
+		 * for now there is mapping.
+		 */
+		lmap->ldt_mapped = 1;
+	}
+	up_write(&mm->mmap_sem);
+	return ret;
+}
+
 /* The caller must call finalize_ldt_struct on the result. LDT starts zeroed. */
 static struct ldt_struct *alloc_ldt_struct(unsigned int num_entries)
 {
@@ -350,7 +450,8 @@ static int read_ldt(void __user *ptr, un
 
 	down_read(&mm->context.ldt_usr_sem);
 
-	ldt = mm->context.ldt;
+	/* Might race against vm_unmap, which installs a NULL LDT */
+	ldt = READ_ONCE(mm->context.ldt);
 	if (!ldt)
 		goto out_unlock;
 

^ permalink raw reply	[flat|nested] 134+ messages in thread

* [patch 15/16] x86/ldt: Add VMA management code
@ 2017-12-12 17:32   ` Thomas Gleixner
  0 siblings, 0 replies; 134+ messages in thread
From: Thomas Gleixner @ 2017-12-12 17:32 UTC (permalink / raw)
  To: LKML
  Cc: x86, Linus Torvalds, Andy Lutomirsky, Peter Zijlstra,
	Dave Hansen, Borislav Petkov, Greg KH, keescook, hughd,
	Brian Gerst, Josh Poimboeuf, Denys Vlasenko, Boris Ostrovsky,
	Juergen Gross, David Laight, Eduardo Valentin, aliguori,
	Will Deacon, linux-mm

[-- Attachment #1: x86-ldt--Add-VMA-management-code.patch --]
[-- Type: text/plain, Size: 3872 bytes --]

From: Thomas Gleixner <tglx@linutronix.de>

Add the VMA management code to LDT which allows to install the LDT as a
special mapping, like VDSO and uprobes. The mapping is in the user address
space, but without the usr bit set and read only. Split out for ease of
review.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
---
 arch/x86/kernel/ldt.c |  103 +++++++++++++++++++++++++++++++++++++++++++++++++-
 1 file changed, 102 insertions(+), 1 deletion(-)

--- a/arch/x86/kernel/ldt.c
+++ b/arch/x86/kernel/ldt.c
@@ -31,6 +31,7 @@
 struct ldt_mapping {
 	struct ldt_struct		ldts[2];
 	unsigned int			ldt_index;
+	unsigned int			ldt_mapped;
 };
 
 /* After calling this, the LDT is immutable. */
@@ -208,6 +209,105 @@ bool __ldt_write_fault(unsigned long add
 	return true;
 }
 
+static int ldt_fault(const struct vm_special_mapping *sm,
+		     struct vm_area_struct *vma, struct vm_fault *vmf)
+{
+	struct ldt_mapping *lmap = vma->vm_mm->context.ldt_mapping;
+	struct ldt_struct *ldt = lmap->ldts;
+	pgoff_t pgo = vmf->pgoff;
+	struct page *page;
+
+	if (pgo >= LDT_ENTRIES_PAGES) {
+		pgo -= LDT_ENTRIES_PAGES;
+		ldt++;
+	}
+	if (pgo >= LDT_ENTRIES_PAGES)
+		return VM_FAULT_SIGBUS;
+
+	page = ldt->pages[pgo];
+	if (!page)
+		return VM_FAULT_SIGBUS;
+	get_page(page);
+	vmf->page = page;
+	return 0;
+}
+
+static int ldt_mremap(const struct vm_special_mapping *sm,
+		      struct vm_area_struct *new_vma)
+{
+	return -EINVAL;
+}
+
+static void ldt_close(const struct vm_special_mapping *sm,
+		      struct vm_area_struct *vma)
+{
+	struct mm_struct *mm = vma->vm_mm;
+	struct ldt_struct *ldt;
+
+	/*
+	 * Orders against ldt_install().
+	 */
+	mutex_lock(&mm->context.lock);
+	ldt = mm->context.ldt;
+	ldt_install_mm(mm, NULL);
+	cleanup_ldt_struct(ldt);
+	mm->context.ldt_mapping->ldt_mapped = 0;
+	mutex_unlock(&mm->context.lock);
+}
+
+static const struct vm_special_mapping ldt_special_mapping = {
+	.name	= "[ldt]",
+	.fault	= ldt_fault,
+	.mremap	= ldt_mremap,
+	.close	= ldt_close,
+};
+
+static struct vm_area_struct *ldt_alloc_vma(struct mm_struct *mm,
+					    struct ldt_mapping *lmap)
+{
+	unsigned long vm_flags, size;
+	struct vm_area_struct *vma;
+	unsigned long addr;
+
+	size = 2 * LDT_ENTRIES_MAP_SIZE;
+	addr = get_unmapped_area(NULL, TASK_SIZE - PAGE_SIZE, size, 0, 0);
+	if (IS_ERR_VALUE(addr))
+		return ERR_PTR(addr);
+
+	vm_flags = VM_READ | VM_LOCKED | VM_WIPEONFORK | VM_NOUSER | VM_SHARED;
+	vma = _install_special_mapping(mm, addr, size, vm_flags,
+				       &ldt_special_mapping);
+	if (IS_ERR(vma))
+		return vma;
+
+	lmap->ldts[0].entries = (struct desc_struct *) addr;
+	addr += LDT_ENTRIES_MAP_SIZE;
+	lmap->ldts[1].entries = (struct desc_struct *) addr;
+	return vma;
+}
+
+static int ldt_mmap(struct mm_struct *mm, struct ldt_mapping *lmap)
+{
+	struct vm_area_struct *vma;
+	int ret = 0;
+
+	if (down_write_killable(&mm->mmap_sem))
+		return -EINTR;
+	vma = ldt_alloc_vma(mm, lmap);
+	if (IS_ERR(vma)) {
+		ret = PTR_ERR(vma);
+	} else {
+		/*
+		 * The moment mmap_sem() is released munmap() can observe
+		 * the mapping and make it go away through ldt_close(). But
+		 * for now there is mapping.
+		 */
+		lmap->ldt_mapped = 1;
+	}
+	up_write(&mm->mmap_sem);
+	return ret;
+}
+
 /* The caller must call finalize_ldt_struct on the result. LDT starts zeroed. */
 static struct ldt_struct *alloc_ldt_struct(unsigned int num_entries)
 {
@@ -350,7 +450,8 @@ static int read_ldt(void __user *ptr, un
 
 	down_read(&mm->context.ldt_usr_sem);
 
-	ldt = mm->context.ldt;
+	/* Might race against vm_unmap, which installs a NULL LDT */
+	ldt = READ_ONCE(mm->context.ldt);
 	if (!ldt)
 		goto out_unlock;
 


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 134+ messages in thread

* [patch 16/16] x86/ldt: Make it read only VMA mapped
  2017-12-12 17:32 ` Thomas Gleixner
@ 2017-12-12 17:32   ` Thomas Gleixner
  -1 siblings, 0 replies; 134+ messages in thread
From: Thomas Gleixner @ 2017-12-12 17:32 UTC (permalink / raw)
  To: LKML
  Cc: x86, Linus Torvalds, Andy Lutomirsky, Peter Zijlstra,
	Dave Hansen, Borislav Petkov, Greg KH, keescook, hughd,
	Brian Gerst, Josh Poimboeuf, Denys Vlasenko, Boris Ostrovsky,
	Juergen Gross, David Laight, Eduardo Valentin, aliguori,
	Will Deacon, linux-mm

[-- Attachment #1: x86-ldt--Make-it-VMA-mapped.patch --]
[-- Type: text/plain, Size: 11789 bytes --]

From: Thomas Gleixner <tglx@linutronix.de>

Replace the existing LDT allocation and installation code by the new VMA
based mapping code. The mapping is exposed read only to user space so it is
accessible when the CPU executes in ring 3. The access to the backing pages
is not a linear VA space to avoid an extra alias mapping or the allocation
of higher order pages.

The special write fault handler and the touch mechanism on exit to user
space makes sure that the expectations of the CPU are met.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
---
 arch/x86/kernel/ldt.c |  282 +++++++++++++++++++++++++++++---------------------
 1 file changed, 165 insertions(+), 117 deletions(-)

--- a/arch/x86/kernel/ldt.c
+++ b/arch/x86/kernel/ldt.c
@@ -67,25 +67,49 @@ static void __ldt_install(void *__mm)
 
 	if (this_cpu_read(cpu_tlbstate.loaded_mm) == mm &&
 	    !(current->flags & PF_KTHREAD)) {
-		unsigned int nentries = ldt ? ldt->nr_entries : 0;
-
-		set_ldt(ldt->entries, nentries);
-		refresh_ldt_segments();
-		set_tsk_thread_flag(current, TIF_LDT);
+		if (ldt) {
+			set_ldt(ldt->entries, ldt->nr_entries);
+			refresh_ldt_segments();
+			set_tsk_thread_flag(current, TIF_LDT);
+		} else {
+			set_ldt(NULL, 0);
+		}
 	}
 }
 
 static void ldt_install_mm(struct mm_struct *mm, struct ldt_struct *ldt)
 {
-	mutex_lock(&mm->context.lock);
+	lockdep_assert_held(&mm->context.lock);
 
 	/* Synchronizes with READ_ONCE in load_mm_ldt. */
 	smp_store_release(&mm->context.ldt, ldt);
 
 	/* Activate the LDT for all CPUs using currents mm. */
 	on_each_cpu_mask(mm_cpumask(mm), __ldt_install, mm, true);
+}
 
-	mutex_unlock(&mm->context.lock);
+static int ldt_populate(struct ldt_struct *ldt)
+{
+	unsigned long len, start = (unsigned long)ldt->entries;
+
+	len = round_up(ldt->nr_entries * LDT_ENTRY_SIZE, PAGE_SIZE);
+	return __mm_populate(start, len, 0);
+}
+
+/* Install the new LDT after populating the user space mapping. */
+static int ldt_install(struct mm_struct *mm, struct ldt_struct *ldt)
+{
+	int ret = ldt ? ldt_populate(ldt) : 0;
+
+	if (!ret) {
+		mutex_lock(&mm->context.lock);
+		if (mm->context.ldt_mapping->ldt_mapped)
+			ldt_install_mm(mm, ldt);
+		else
+			ret = -EINVAL;
+		mutex_unlock(&mm->context.lock);
+	}
+	return ret;
 }
 
 static void ldt_free_pages(struct ldt_struct *ldt)
@@ -193,9 +217,11 @@ static void cleanup_ldt_struct(struct ld
  */
 bool __ldt_write_fault(unsigned long address)
 {
-	struct ldt_struct *ldt = current->mm->context.ldt;
+	struct ldt_mapping *lmap = current->mm->context.ldt_mapping;
+	struct ldt_struct *ldt = lmap->ldts;
 	unsigned long start, end, entry;
 	struct desc_struct *desc;
+	struct page *page;
 
 	start = (unsigned long) ldt->entries;
 	end = start + ldt->nr_entries * LDT_ENTRY_SIZE;
@@ -203,8 +229,12 @@ bool __ldt_write_fault(unsigned long add
 	if (address < start || address >= end)
 		return false;
 
-	desc = (struct desc_struct *) ldt->entries;
-	entry = (address - start) / LDT_ENTRY_SIZE;
+	page = ldt->pages[(address - start) / PAGE_SIZE];
+	if (!page)
+		return false;
+
+	desc = page_address(page);
+	entry = ((address - start) % PAGE_SIZE) / LDT_ENTRY_SIZE;
 	desc[entry].type |= 0x01;
 	return true;
 }
@@ -308,107 +338,69 @@ static int ldt_mmap(struct mm_struct *mm
 	return ret;
 }
 
-/* The caller must call finalize_ldt_struct on the result. LDT starts zeroed. */
-static struct ldt_struct *alloc_ldt_struct(unsigned int num_entries)
-{
-	struct ldt_struct *new_ldt;
-	unsigned int alloc_size;
-
-	if (num_entries > LDT_ENTRIES)
-		return NULL;
-
-	new_ldt = kmalloc(sizeof(struct ldt_struct), GFP_KERNEL);
-	if (!new_ldt)
-		return NULL;
-
-	BUILD_BUG_ON(LDT_ENTRY_SIZE != sizeof(struct desc_struct));
-	alloc_size = num_entries * LDT_ENTRY_SIZE;
-
-	/*
-	 * Xen is very picky: it requires a page-aligned LDT that has no
-	 * trailing nonzero bytes in any page that contains LDT descriptors.
-	 * Keep it simple: zero the whole allocation and never allocate less
-	 * than PAGE_SIZE.
-	 */
-	if (alloc_size > PAGE_SIZE)
-		new_ldt->entries = vzalloc(alloc_size);
-	else
-		new_ldt->entries = (void *)get_zeroed_page(GFP_KERNEL);
-
-	if (!new_ldt->entries) {
-		kfree(new_ldt);
-		return NULL;
-	}
-
-	new_ldt->nr_entries = num_entries;
-	return new_ldt;
-}
-
-static void free_ldt_struct(struct ldt_struct *ldt)
-{
-	if (likely(!ldt))
-		return;
-
-	paravirt_free_ldt(ldt->entries, ldt->nr_entries);
-	if (ldt->nr_entries * LDT_ENTRY_SIZE > PAGE_SIZE)
-		vfree_atomic(ldt->entries);
-	else
-		free_page((unsigned long)ldt->entries);
-	kfree(ldt);
-}
-
 /*
  * Called on fork from arch_dup_mmap(). Just copy the current LDT state,
  * the new task is not running, so nothing can be installed.
  */
 int ldt_dup_context(struct mm_struct *old_mm, struct mm_struct *mm)
 {
-	struct ldt_struct *new_ldt;
-	int retval = 0;
+	struct ldt_mapping *old_lmap, *lmap;
+	struct vm_area_struct *vma;
+	struct ldt_struct *old_ldt;
+	unsigned long addr, len;
+	int nentries, ret = 0;
 
 	if (!old_mm)
 		return 0;
 
 	mutex_lock(&old_mm->context.lock);
-	if (!old_mm->context.ldt)
+	old_lmap = old_mm->context.ldt_mapping;
+	if (!old_lmap || !old_mm->context.ldt)
 		goto out_unlock;
 
-	new_ldt = alloc_ldt_struct(old_mm->context.ldt->nr_entries);
-	if (!new_ldt) {
-		retval = -ENOMEM;
+	old_ldt = old_mm->context.ldt;
+	nentries = old_ldt->nr_entries;
+	if (!nentries)
 		goto out_unlock;
-	}
 
-	memcpy(new_ldt->entries, old_mm->context.ldt->entries,
-	       new_ldt->nr_entries * LDT_ENTRY_SIZE);
-	finalize_ldt_struct(new_ldt);
-
-	mm->context.ldt = new_ldt;
+	lmap = ldt_alloc_lmap(mm, nentries);
+	if (IS_ERR(lmap)) {
+		ret = PTR_ERR(lmap);
+		goto out_unlock;
+	}
 
-out_unlock:
-	mutex_unlock(&old_mm->context.lock);
-	return retval;
-}
+	addr = (unsigned long)old_mm->context.ldt_mapping->ldts[0].entries;
+	vma = find_vma(mm, addr);
+	if (!vma)
+		goto out_lmap;
 
-/*
- * This can run unlocked because the mm is no longer in use. No need to
- * clear LDT on the CPU either because that's called from __mm_drop() and
- * the task which owned the mm is already dead. The context switch code has
- * either cleared LDT or installed a new one.
- */
-void destroy_context_ldt(struct mm_struct *mm)
-{
-	struct ldt_mapping *lmap = mm->context.ldt_mapping;
-	struct ldt_struct *ldt = mm->context.ldt;
+	mm->context.ldt_mapping = lmap;
+	/*
+	 * Copy the current settings over. Save the number of entries and
+	 * the data.
+	 */
+	lmap->ldts[0].entries = (struct desc_struct *)addr;
+	lmap->ldts[1].entries = (struct desc_struct *)(addr + LDT_ENTRIES_MAP_SIZE);
 
-	free_ldt_struct(ldt);
-	mm->context.ldt = NULL;
+	lmap->ldts[0].nr_entries = nentries;
+	ldt_clone_entries(&lmap->ldts[0], old_ldt, nentries);
 
-	if (!lmap)
-		return;
+	len = ALIGN(nentries * LDT_ENTRY_SIZE, PAGE_SIZE);
+	ret = populate_vma_page_range(vma, addr, addr + len, NULL);
+	if (ret != len / PAGE_SIZE)
+		goto out_lmap;
+	finalize_ldt_struct(&lmap->ldts[0]);
+	mm->context.ldt = &lmap->ldts[0];
+	ret = 0;
 
+out_unlock:
+	mutex_unlock(&old_mm->context.lock);
+	return ret;
+out_lmap:
 	mm->context.ldt_mapping = NULL;
+	mutex_unlock(&old_mm->context.lock);
 	ldt_free_lmap(lmap);
+	return -ENOMEM;
 }
 
 /*
@@ -441,12 +433,32 @@ void ldt_exit_user(struct pt_regs *regs)
 	ldt_touch_seg(regs->ss);
 }
 
+/*
+ * This can run unlocked because the mm is no longer in use. No need to
+ * clear LDT on the CPU either because that's called from __mm_drop() and
+ * the task which owned the mm is already dead. The context switch code has
+ * either cleared LDT or installed a new one.
+ */
+void destroy_context_ldt(struct mm_struct *mm)
+{
+	struct ldt_mapping *lmap = mm->context.ldt_mapping;
+	struct ldt_struct *ldt = mm->context.ldt;
+
+	if (!lmap)
+		return;
+	if (ldt)
+		paravirt_free_ldt(ldt->entries, ldt->nr_entries);
+	mm->context.ldt = NULL;
+	mm->context.ldt_mapping = NULL;
+	ldt_free_lmap(lmap);
+}
+
 static int read_ldt(void __user *ptr, unsigned long nbytes)
 {
 	struct mm_struct *mm = current->mm;
 	struct ldt_struct *ldt;
 	unsigned long tocopy;
-	int ret = 0;
+	int i, ret = 0;
 
 	down_read(&mm->context.ldt_usr_sem);
 
@@ -463,8 +475,14 @@ static int read_ldt(void __user *ptr, un
 	if (tocopy < nbytes && clear_user(ptr + tocopy, nbytes - tocopy))
 		goto out_unlock;
 
-	if (copy_to_user(ptr, ldt->entries, tocopy))
-		goto out_unlock;
+	for (i = 0; tocopy; i++) {
+		unsigned long n = min(PAGE_SIZE, tocopy);
+
+		if (copy_to_user(ptr, page_address(ldt->pages[i]), n))
+			goto out_unlock;
+		tocopy -= n;
+		ptr += n;
+	}
 	ret = nbytes;
 out_unlock:
 	up_read(&mm->context.ldt_usr_sem);
@@ -488,12 +506,13 @@ static int read_default_ldt(void __user
 
 static int write_ldt(void __user *ptr, unsigned long bytecount, int oldmode)
 {
-	struct mm_struct *mm = current->mm;
 	struct ldt_struct *new_ldt, *old_ldt;
-	unsigned int old_nr_entries, new_nr_entries;
+	unsigned int nold, nentries, ldtidx;
+	struct mm_struct *mm = current->mm;
 	struct user_desc ldt_info;
-	struct desc_struct ldt;
-	int error;
+	struct ldt_mapping *lmap;
+	struct desc_struct entry;
+	int error, mapped;
 
 	error = -EINVAL;
 	if (bytecount != sizeof(ldt_info))
@@ -515,39 +534,68 @@ static int write_ldt(void __user *ptr, u
 	if ((oldmode && !ldt_info.base_addr && !ldt_info.limit) ||
 	    LDT_empty(&ldt_info)) {
 		/* The user wants to clear the entry. */
-		memset(&ldt, 0, sizeof(ldt));
+		memset(&entry, 0, sizeof(entry));
 	} else {
-		if (!IS_ENABLED(CONFIG_X86_16BIT) && !ldt_info.seg_32bit) {
-			error = -EINVAL;
+		if (!IS_ENABLED(CONFIG_X86_16BIT) && !ldt_info.seg_32bit)
 			goto out;
-		}
-
-		fill_ldt(&ldt, &ldt_info);
+		fill_ldt(&entry, &ldt_info);
 		if (oldmode)
-			ldt.avl = 0;
+			entry.avl = 0;
 	}
 
 	if (down_write_killable(&mm->context.ldt_usr_sem))
 		return -EINTR;
 
-	old_ldt       = mm->context.ldt;
-	old_nr_entries = old_ldt ? old_ldt->nr_entries : 0;
-	new_nr_entries = max(ldt_info.entry_number + 1, old_nr_entries);
-
-	error = -ENOMEM;
-	new_ldt = alloc_ldt_struct(new_nr_entries);
-	if (!new_ldt)
+	lmap = mm->context.ldt_mapping;
+	old_ldt = mm->context.ldt;
+	ldtidx = lmap ? lmap->ldt_index ^ 1 : 0;
+
+	if (!lmap) {
+		/* First invocation, install it. */
+		nentries = ldt_info.entry_number + 1;
+		lmap = ldt_alloc_lmap(mm, nentries);
+		if (IS_ERR(lmap)) {
+			error = PTR_ERR(lmap);
+			goto out_unlock;
+		}
+		mm->context.ldt_mapping = lmap;
+	}
+
+	/*
+	 * ldt_close() can clear lmap->ldt_mapped under context.lock, so
+	 * lmap->ldt_mapped needs to be read under that lock as well.
+	 *
+	 * If !mapped, try and establish the mapping; this code is fully
+	 * serialized under ldt_usr_sem. If the VMA vanishes after dropping
+	 * the lock, then ldt_install() will fail later on.
+	 */
+	mutex_lock(&mm->context.lock);
+	mapped = lmap->ldt_mapped;
+	mutex_unlock(&mm->context.lock);
+	if (!mapped) {
+		error = ldt_mmap(mm, lmap);
+		if (error)
+			goto out_unlock;
+	}
+
+	nold = old_ldt ? old_ldt->nr_entries : 0;
+	nentries = max(ldt_info.entry_number + 1, nold);
+	/* Select the new ldt and allocate pages if necessary */
+	new_ldt = &lmap->ldts[ldtidx];
+	error = ldt_alloc_pages(new_ldt, nentries);
+	if (error)
 		goto out_unlock;
 
-	if (old_ldt)
-		memcpy(new_ldt->entries, old_ldt->entries, old_nr_entries * LDT_ENTRY_SIZE);
+	if (nold)
+		ldt_clone_entries(new_ldt, old_ldt, nold);
 
-	new_ldt->entries[ldt_info.entry_number] = ldt;
+	ldt_set_entry(new_ldt, &entry, ldt_info.entry_number);
+	new_ldt->nr_entries = nentries;
+	lmap->ldt_index = ldtidx;
 	finalize_ldt_struct(new_ldt);
-
-	ldt_install_mm(mm, new_ldt);
-	free_ldt_struct(old_ldt);
-	error = 0;
+	/* Install the new LDT. Might fail due to vm_unmap() or ENOMEM */
+	error = ldt_install(mm, new_ldt);
+	cleanup_ldt_struct(error ? new_ldt : old_ldt);
 
 out_unlock:
 	up_write(&mm->context.ldt_usr_sem);

^ permalink raw reply	[flat|nested] 134+ messages in thread

* [patch 16/16] x86/ldt: Make it read only VMA mapped
@ 2017-12-12 17:32   ` Thomas Gleixner
  0 siblings, 0 replies; 134+ messages in thread
From: Thomas Gleixner @ 2017-12-12 17:32 UTC (permalink / raw)
  To: LKML
  Cc: x86, Linus Torvalds, Andy Lutomirsky, Peter Zijlstra,
	Dave Hansen, Borislav Petkov, Greg KH, keescook, hughd,
	Brian Gerst, Josh Poimboeuf, Denys Vlasenko, Boris Ostrovsky,
	Juergen Gross, David Laight, Eduardo Valentin, aliguori,
	Will Deacon, linux-mm

[-- Attachment #1: x86-ldt--Make-it-VMA-mapped.patch --]
[-- Type: text/plain, Size: 12016 bytes --]

From: Thomas Gleixner <tglx@linutronix.de>

Replace the existing LDT allocation and installation code by the new VMA
based mapping code. The mapping is exposed read only to user space so it is
accessible when the CPU executes in ring 3. The access to the backing pages
is not a linear VA space to avoid an extra alias mapping or the allocation
of higher order pages.

The special write fault handler and the touch mechanism on exit to user
space makes sure that the expectations of the CPU are met.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
---
 arch/x86/kernel/ldt.c |  282 +++++++++++++++++++++++++++++---------------------
 1 file changed, 165 insertions(+), 117 deletions(-)

--- a/arch/x86/kernel/ldt.c
+++ b/arch/x86/kernel/ldt.c
@@ -67,25 +67,49 @@ static void __ldt_install(void *__mm)
 
 	if (this_cpu_read(cpu_tlbstate.loaded_mm) == mm &&
 	    !(current->flags & PF_KTHREAD)) {
-		unsigned int nentries = ldt ? ldt->nr_entries : 0;
-
-		set_ldt(ldt->entries, nentries);
-		refresh_ldt_segments();
-		set_tsk_thread_flag(current, TIF_LDT);
+		if (ldt) {
+			set_ldt(ldt->entries, ldt->nr_entries);
+			refresh_ldt_segments();
+			set_tsk_thread_flag(current, TIF_LDT);
+		} else {
+			set_ldt(NULL, 0);
+		}
 	}
 }
 
 static void ldt_install_mm(struct mm_struct *mm, struct ldt_struct *ldt)
 {
-	mutex_lock(&mm->context.lock);
+	lockdep_assert_held(&mm->context.lock);
 
 	/* Synchronizes with READ_ONCE in load_mm_ldt. */
 	smp_store_release(&mm->context.ldt, ldt);
 
 	/* Activate the LDT for all CPUs using currents mm. */
 	on_each_cpu_mask(mm_cpumask(mm), __ldt_install, mm, true);
+}
 
-	mutex_unlock(&mm->context.lock);
+static int ldt_populate(struct ldt_struct *ldt)
+{
+	unsigned long len, start = (unsigned long)ldt->entries;
+
+	len = round_up(ldt->nr_entries * LDT_ENTRY_SIZE, PAGE_SIZE);
+	return __mm_populate(start, len, 0);
+}
+
+/* Install the new LDT after populating the user space mapping. */
+static int ldt_install(struct mm_struct *mm, struct ldt_struct *ldt)
+{
+	int ret = ldt ? ldt_populate(ldt) : 0;
+
+	if (!ret) {
+		mutex_lock(&mm->context.lock);
+		if (mm->context.ldt_mapping->ldt_mapped)
+			ldt_install_mm(mm, ldt);
+		else
+			ret = -EINVAL;
+		mutex_unlock(&mm->context.lock);
+	}
+	return ret;
 }
 
 static void ldt_free_pages(struct ldt_struct *ldt)
@@ -193,9 +217,11 @@ static void cleanup_ldt_struct(struct ld
  */
 bool __ldt_write_fault(unsigned long address)
 {
-	struct ldt_struct *ldt = current->mm->context.ldt;
+	struct ldt_mapping *lmap = current->mm->context.ldt_mapping;
+	struct ldt_struct *ldt = lmap->ldts;
 	unsigned long start, end, entry;
 	struct desc_struct *desc;
+	struct page *page;
 
 	start = (unsigned long) ldt->entries;
 	end = start + ldt->nr_entries * LDT_ENTRY_SIZE;
@@ -203,8 +229,12 @@ bool __ldt_write_fault(unsigned long add
 	if (address < start || address >= end)
 		return false;
 
-	desc = (struct desc_struct *) ldt->entries;
-	entry = (address - start) / LDT_ENTRY_SIZE;
+	page = ldt->pages[(address - start) / PAGE_SIZE];
+	if (!page)
+		return false;
+
+	desc = page_address(page);
+	entry = ((address - start) % PAGE_SIZE) / LDT_ENTRY_SIZE;
 	desc[entry].type |= 0x01;
 	return true;
 }
@@ -308,107 +338,69 @@ static int ldt_mmap(struct mm_struct *mm
 	return ret;
 }
 
-/* The caller must call finalize_ldt_struct on the result. LDT starts zeroed. */
-static struct ldt_struct *alloc_ldt_struct(unsigned int num_entries)
-{
-	struct ldt_struct *new_ldt;
-	unsigned int alloc_size;
-
-	if (num_entries > LDT_ENTRIES)
-		return NULL;
-
-	new_ldt = kmalloc(sizeof(struct ldt_struct), GFP_KERNEL);
-	if (!new_ldt)
-		return NULL;
-
-	BUILD_BUG_ON(LDT_ENTRY_SIZE != sizeof(struct desc_struct));
-	alloc_size = num_entries * LDT_ENTRY_SIZE;
-
-	/*
-	 * Xen is very picky: it requires a page-aligned LDT that has no
-	 * trailing nonzero bytes in any page that contains LDT descriptors.
-	 * Keep it simple: zero the whole allocation and never allocate less
-	 * than PAGE_SIZE.
-	 */
-	if (alloc_size > PAGE_SIZE)
-		new_ldt->entries = vzalloc(alloc_size);
-	else
-		new_ldt->entries = (void *)get_zeroed_page(GFP_KERNEL);
-
-	if (!new_ldt->entries) {
-		kfree(new_ldt);
-		return NULL;
-	}
-
-	new_ldt->nr_entries = num_entries;
-	return new_ldt;
-}
-
-static void free_ldt_struct(struct ldt_struct *ldt)
-{
-	if (likely(!ldt))
-		return;
-
-	paravirt_free_ldt(ldt->entries, ldt->nr_entries);
-	if (ldt->nr_entries * LDT_ENTRY_SIZE > PAGE_SIZE)
-		vfree_atomic(ldt->entries);
-	else
-		free_page((unsigned long)ldt->entries);
-	kfree(ldt);
-}
-
 /*
  * Called on fork from arch_dup_mmap(). Just copy the current LDT state,
  * the new task is not running, so nothing can be installed.
  */
 int ldt_dup_context(struct mm_struct *old_mm, struct mm_struct *mm)
 {
-	struct ldt_struct *new_ldt;
-	int retval = 0;
+	struct ldt_mapping *old_lmap, *lmap;
+	struct vm_area_struct *vma;
+	struct ldt_struct *old_ldt;
+	unsigned long addr, len;
+	int nentries, ret = 0;
 
 	if (!old_mm)
 		return 0;
 
 	mutex_lock(&old_mm->context.lock);
-	if (!old_mm->context.ldt)
+	old_lmap = old_mm->context.ldt_mapping;
+	if (!old_lmap || !old_mm->context.ldt)
 		goto out_unlock;
 
-	new_ldt = alloc_ldt_struct(old_mm->context.ldt->nr_entries);
-	if (!new_ldt) {
-		retval = -ENOMEM;
+	old_ldt = old_mm->context.ldt;
+	nentries = old_ldt->nr_entries;
+	if (!nentries)
 		goto out_unlock;
-	}
 
-	memcpy(new_ldt->entries, old_mm->context.ldt->entries,
-	       new_ldt->nr_entries * LDT_ENTRY_SIZE);
-	finalize_ldt_struct(new_ldt);
-
-	mm->context.ldt = new_ldt;
+	lmap = ldt_alloc_lmap(mm, nentries);
+	if (IS_ERR(lmap)) {
+		ret = PTR_ERR(lmap);
+		goto out_unlock;
+	}
 
-out_unlock:
-	mutex_unlock(&old_mm->context.lock);
-	return retval;
-}
+	addr = (unsigned long)old_mm->context.ldt_mapping->ldts[0].entries;
+	vma = find_vma(mm, addr);
+	if (!vma)
+		goto out_lmap;
 
-/*
- * This can run unlocked because the mm is no longer in use. No need to
- * clear LDT on the CPU either because that's called from __mm_drop() and
- * the task which owned the mm is already dead. The context switch code has
- * either cleared LDT or installed a new one.
- */
-void destroy_context_ldt(struct mm_struct *mm)
-{
-	struct ldt_mapping *lmap = mm->context.ldt_mapping;
-	struct ldt_struct *ldt = mm->context.ldt;
+	mm->context.ldt_mapping = lmap;
+	/*
+	 * Copy the current settings over. Save the number of entries and
+	 * the data.
+	 */
+	lmap->ldts[0].entries = (struct desc_struct *)addr;
+	lmap->ldts[1].entries = (struct desc_struct *)(addr + LDT_ENTRIES_MAP_SIZE);
 
-	free_ldt_struct(ldt);
-	mm->context.ldt = NULL;
+	lmap->ldts[0].nr_entries = nentries;
+	ldt_clone_entries(&lmap->ldts[0], old_ldt, nentries);
 
-	if (!lmap)
-		return;
+	len = ALIGN(nentries * LDT_ENTRY_SIZE, PAGE_SIZE);
+	ret = populate_vma_page_range(vma, addr, addr + len, NULL);
+	if (ret != len / PAGE_SIZE)
+		goto out_lmap;
+	finalize_ldt_struct(&lmap->ldts[0]);
+	mm->context.ldt = &lmap->ldts[0];
+	ret = 0;
 
+out_unlock:
+	mutex_unlock(&old_mm->context.lock);
+	return ret;
+out_lmap:
 	mm->context.ldt_mapping = NULL;
+	mutex_unlock(&old_mm->context.lock);
 	ldt_free_lmap(lmap);
+	return -ENOMEM;
 }
 
 /*
@@ -441,12 +433,32 @@ void ldt_exit_user(struct pt_regs *regs)
 	ldt_touch_seg(regs->ss);
 }
 
+/*
+ * This can run unlocked because the mm is no longer in use. No need to
+ * clear LDT on the CPU either because that's called from __mm_drop() and
+ * the task which owned the mm is already dead. The context switch code has
+ * either cleared LDT or installed a new one.
+ */
+void destroy_context_ldt(struct mm_struct *mm)
+{
+	struct ldt_mapping *lmap = mm->context.ldt_mapping;
+	struct ldt_struct *ldt = mm->context.ldt;
+
+	if (!lmap)
+		return;
+	if (ldt)
+		paravirt_free_ldt(ldt->entries, ldt->nr_entries);
+	mm->context.ldt = NULL;
+	mm->context.ldt_mapping = NULL;
+	ldt_free_lmap(lmap);
+}
+
 static int read_ldt(void __user *ptr, unsigned long nbytes)
 {
 	struct mm_struct *mm = current->mm;
 	struct ldt_struct *ldt;
 	unsigned long tocopy;
-	int ret = 0;
+	int i, ret = 0;
 
 	down_read(&mm->context.ldt_usr_sem);
 
@@ -463,8 +475,14 @@ static int read_ldt(void __user *ptr, un
 	if (tocopy < nbytes && clear_user(ptr + tocopy, nbytes - tocopy))
 		goto out_unlock;
 
-	if (copy_to_user(ptr, ldt->entries, tocopy))
-		goto out_unlock;
+	for (i = 0; tocopy; i++) {
+		unsigned long n = min(PAGE_SIZE, tocopy);
+
+		if (copy_to_user(ptr, page_address(ldt->pages[i]), n))
+			goto out_unlock;
+		tocopy -= n;
+		ptr += n;
+	}
 	ret = nbytes;
 out_unlock:
 	up_read(&mm->context.ldt_usr_sem);
@@ -488,12 +506,13 @@ static int read_default_ldt(void __user
 
 static int write_ldt(void __user *ptr, unsigned long bytecount, int oldmode)
 {
-	struct mm_struct *mm = current->mm;
 	struct ldt_struct *new_ldt, *old_ldt;
-	unsigned int old_nr_entries, new_nr_entries;
+	unsigned int nold, nentries, ldtidx;
+	struct mm_struct *mm = current->mm;
 	struct user_desc ldt_info;
-	struct desc_struct ldt;
-	int error;
+	struct ldt_mapping *lmap;
+	struct desc_struct entry;
+	int error, mapped;
 
 	error = -EINVAL;
 	if (bytecount != sizeof(ldt_info))
@@ -515,39 +534,68 @@ static int write_ldt(void __user *ptr, u
 	if ((oldmode && !ldt_info.base_addr && !ldt_info.limit) ||
 	    LDT_empty(&ldt_info)) {
 		/* The user wants to clear the entry. */
-		memset(&ldt, 0, sizeof(ldt));
+		memset(&entry, 0, sizeof(entry));
 	} else {
-		if (!IS_ENABLED(CONFIG_X86_16BIT) && !ldt_info.seg_32bit) {
-			error = -EINVAL;
+		if (!IS_ENABLED(CONFIG_X86_16BIT) && !ldt_info.seg_32bit)
 			goto out;
-		}
-
-		fill_ldt(&ldt, &ldt_info);
+		fill_ldt(&entry, &ldt_info);
 		if (oldmode)
-			ldt.avl = 0;
+			entry.avl = 0;
 	}
 
 	if (down_write_killable(&mm->context.ldt_usr_sem))
 		return -EINTR;
 
-	old_ldt       = mm->context.ldt;
-	old_nr_entries = old_ldt ? old_ldt->nr_entries : 0;
-	new_nr_entries = max(ldt_info.entry_number + 1, old_nr_entries);
-
-	error = -ENOMEM;
-	new_ldt = alloc_ldt_struct(new_nr_entries);
-	if (!new_ldt)
+	lmap = mm->context.ldt_mapping;
+	old_ldt = mm->context.ldt;
+	ldtidx = lmap ? lmap->ldt_index ^ 1 : 0;
+
+	if (!lmap) {
+		/* First invocation, install it. */
+		nentries = ldt_info.entry_number + 1;
+		lmap = ldt_alloc_lmap(mm, nentries);
+		if (IS_ERR(lmap)) {
+			error = PTR_ERR(lmap);
+			goto out_unlock;
+		}
+		mm->context.ldt_mapping = lmap;
+	}
+
+	/*
+	 * ldt_close() can clear lmap->ldt_mapped under context.lock, so
+	 * lmap->ldt_mapped needs to be read under that lock as well.
+	 *
+	 * If !mapped, try and establish the mapping; this code is fully
+	 * serialized under ldt_usr_sem. If the VMA vanishes after dropping
+	 * the lock, then ldt_install() will fail later on.
+	 */
+	mutex_lock(&mm->context.lock);
+	mapped = lmap->ldt_mapped;
+	mutex_unlock(&mm->context.lock);
+	if (!mapped) {
+		error = ldt_mmap(mm, lmap);
+		if (error)
+			goto out_unlock;
+	}
+
+	nold = old_ldt ? old_ldt->nr_entries : 0;
+	nentries = max(ldt_info.entry_number + 1, nold);
+	/* Select the new ldt and allocate pages if necessary */
+	new_ldt = &lmap->ldts[ldtidx];
+	error = ldt_alloc_pages(new_ldt, nentries);
+	if (error)
 		goto out_unlock;
 
-	if (old_ldt)
-		memcpy(new_ldt->entries, old_ldt->entries, old_nr_entries * LDT_ENTRY_SIZE);
+	if (nold)
+		ldt_clone_entries(new_ldt, old_ldt, nold);
 
-	new_ldt->entries[ldt_info.entry_number] = ldt;
+	ldt_set_entry(new_ldt, &entry, ldt_info.entry_number);
+	new_ldt->nr_entries = nentries;
+	lmap->ldt_index = ldtidx;
 	finalize_ldt_struct(new_ldt);
-
-	ldt_install_mm(mm, new_ldt);
-	free_ldt_struct(old_ldt);
-	error = 0;
+	/* Install the new LDT. Might fail due to vm_unmap() or ENOMEM */
+	error = ldt_install(mm, new_ldt);
+	cleanup_ldt_struct(error ? new_ldt : old_ldt);
 
 out_unlock:
 	up_write(&mm->context.ldt_usr_sem);


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 134+ messages in thread

* Re: [patch 10/16] x86/ldt: Do not install LDT for kernel threads
  2017-12-12 17:32   ` Thomas Gleixner
@ 2017-12-12 17:57     ` Andy Lutomirski
  -1 siblings, 0 replies; 134+ messages in thread
From: Andy Lutomirski @ 2017-12-12 17:57 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: LKML, X86 ML, Linus Torvalds, Andy Lutomirsky, Peter Zijlstra,
	Dave Hansen, Borislav Petkov, Greg KH, Kees Cook, Hugh Dickins,
	Brian Gerst, Josh Poimboeuf, Denys Vlasenko, Boris Ostrovsky,
	Juergen Gross, David Laight, Eduardo Valentin, aliguori,
	Will Deacon, linux-mm

On Tue, Dec 12, 2017 at 9:32 AM, Thomas Gleixner <tglx@linutronix.de> wrote:
> From: Thomas Gleixner <tglx@linutronix.de>
>
> Kernel threads can use the mm of a user process temporarily via use_mm(),
> but there is no point in installing the LDT which is associated to that mm
> for the kernel thread.
>

I like this one.

^ permalink raw reply	[flat|nested] 134+ messages in thread

* Re: [patch 10/16] x86/ldt: Do not install LDT for kernel threads
@ 2017-12-12 17:57     ` Andy Lutomirski
  0 siblings, 0 replies; 134+ messages in thread
From: Andy Lutomirski @ 2017-12-12 17:57 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: LKML, X86 ML, Linus Torvalds, Andy Lutomirsky, Peter Zijlstra,
	Dave Hansen, Borislav Petkov, Greg KH, Kees Cook, Hugh Dickins,
	Brian Gerst, Josh Poimboeuf, Denys Vlasenko, Boris Ostrovsky,
	Juergen Gross, David Laight, Eduardo Valentin, aliguori,
	Will Deacon, linux-mm

On Tue, Dec 12, 2017 at 9:32 AM, Thomas Gleixner <tglx@linutronix.de> wrote:
> From: Thomas Gleixner <tglx@linutronix.de>
>
> Kernel threads can use the mm of a user process temporarily via use_mm(),
> but there is no point in installing the LDT which is associated to that mm
> for the kernel thread.
>

I like this one.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 134+ messages in thread

* Re: [patch 13/16] x86/ldt: Introduce LDT write fault handler
  2017-12-12 17:32   ` Thomas Gleixner
@ 2017-12-12 17:58     ` Andy Lutomirski
  -1 siblings, 0 replies; 134+ messages in thread
From: Andy Lutomirski @ 2017-12-12 17:58 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: LKML, X86 ML, Linus Torvalds, Andy Lutomirsky, Peter Zijlstra,
	Dave Hansen, Borislav Petkov, Greg KH, Kees Cook, Hugh Dickins,
	Brian Gerst, Josh Poimboeuf, Denys Vlasenko, Boris Ostrovsky,
	Juergen Gross, David Laight, Eduardo Valentin, aliguori,
	Will Deacon, linux-mm

On Tue, Dec 12, 2017 at 9:32 AM, Thomas Gleixner <tglx@linutronix.de> wrote:
> From: Thomas Gleixner <tglx@linutronix.de>
>
> When the LDT is mapped RO, the CPU will write fault the first time it uses
> a segment descriptor in order to set the ACCESS bit (for some reason it
> doesn't always observe that it already preset). Catch the fault and set the
> ACCESS bit in the handler.
>
> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
> ---
>  arch/x86/include/asm/mmu_context.h |    7 +++++++
>  arch/x86/kernel/ldt.c              |   30 ++++++++++++++++++++++++++++++
>  arch/x86/mm/fault.c                |   19 +++++++++++++++++++
>  3 files changed, 56 insertions(+)
>
> --- a/arch/x86/include/asm/mmu_context.h
> +++ b/arch/x86/include/asm/mmu_context.h
> @@ -76,6 +76,11 @@ static inline void init_new_context_ldt(
>  int ldt_dup_context(struct mm_struct *oldmm, struct mm_struct *mm);
>  void ldt_exit_user(struct pt_regs *regs);
>  void destroy_context_ldt(struct mm_struct *mm);
> +bool __ldt_write_fault(unsigned long address);
> +static inline bool ldt_is_active(struct mm_struct *mm)
> +{
> +       return mm && mm->context.ldt != NULL;
> +}
>  #else  /* CONFIG_MODIFY_LDT_SYSCALL */
>  static inline void init_new_context_ldt(struct task_struct *task,
>                                         struct mm_struct *mm) { }
> @@ -86,6 +91,8 @@ static inline int ldt_dup_context(struct
>  }
>  static inline void ldt_exit_user(struct pt_regs *regs) { }
>  static inline void destroy_context_ldt(struct mm_struct *mm) { }
> +static inline bool __ldt_write_fault(unsigned long address) { return false; }
> +static inline bool ldt_is_active(struct mm_struct *mm)  { return false; }
>  #endif
>
>  static inline void load_mm_ldt(struct mm_struct *mm, struct task_struct *tsk)
> --- a/arch/x86/kernel/ldt.c
> +++ b/arch/x86/kernel/ldt.c
> @@ -82,6 +82,36 @@ static void ldt_install_mm(struct mm_str
>         mutex_unlock(&mm->context.lock);
>  }
>
> +/*
> + * ldt_write_fault() already checked whether there is an ldt installed in
> + * __do_page_fault(), so it's safe to access it here because interrupts are
> + * disabled and any ipi which would change it is blocked until this
> + * returns.  The underlying page mapping cannot change as long as the ldt
> + * is the active one in the context.
> + *
> + * The fault error code is X86_PF_WRITE | X86_PF_PROT and checked in
> + * __do_page_fault() already. This happens when a segment is selected and
> + * the CPU tries to set the accessed bit in desc_struct.type because the
> + * LDT entries are mapped RO. Set it manually.
> + */
> +bool __ldt_write_fault(unsigned long address)
> +{
> +       struct ldt_struct *ldt = current->mm->context.ldt;
> +       unsigned long start, end, entry;
> +       struct desc_struct *desc;
> +
> +       start = (unsigned long) ldt->entries;
> +       end = start + ldt->nr_entries * LDT_ENTRY_SIZE;
> +
> +       if (address < start || address >= end)
> +               return false;
> +
> +       desc = (struct desc_struct *) ldt->entries;
> +       entry = (address - start) / LDT_ENTRY_SIZE;
> +       desc[entry].type |= 0x01;

You have another patch that unconditionally sets the accessed bit on
installation.  What gives?

Also, this patch is going to die a horrible death if IRET ever hits
this condition.  Or load gs.

^ permalink raw reply	[flat|nested] 134+ messages in thread

* Re: [patch 13/16] x86/ldt: Introduce LDT write fault handler
@ 2017-12-12 17:58     ` Andy Lutomirski
  0 siblings, 0 replies; 134+ messages in thread
From: Andy Lutomirski @ 2017-12-12 17:58 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: LKML, X86 ML, Linus Torvalds, Andy Lutomirsky, Peter Zijlstra,
	Dave Hansen, Borislav Petkov, Greg KH, Kees Cook, Hugh Dickins,
	Brian Gerst, Josh Poimboeuf, Denys Vlasenko, Boris Ostrovsky,
	Juergen Gross, David Laight, Eduardo Valentin, aliguori,
	Will Deacon, linux-mm

On Tue, Dec 12, 2017 at 9:32 AM, Thomas Gleixner <tglx@linutronix.de> wrote:
> From: Thomas Gleixner <tglx@linutronix.de>
>
> When the LDT is mapped RO, the CPU will write fault the first time it uses
> a segment descriptor in order to set the ACCESS bit (for some reason it
> doesn't always observe that it already preset). Catch the fault and set the
> ACCESS bit in the handler.
>
> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
> ---
>  arch/x86/include/asm/mmu_context.h |    7 +++++++
>  arch/x86/kernel/ldt.c              |   30 ++++++++++++++++++++++++++++++
>  arch/x86/mm/fault.c                |   19 +++++++++++++++++++
>  3 files changed, 56 insertions(+)
>
> --- a/arch/x86/include/asm/mmu_context.h
> +++ b/arch/x86/include/asm/mmu_context.h
> @@ -76,6 +76,11 @@ static inline void init_new_context_ldt(
>  int ldt_dup_context(struct mm_struct *oldmm, struct mm_struct *mm);
>  void ldt_exit_user(struct pt_regs *regs);
>  void destroy_context_ldt(struct mm_struct *mm);
> +bool __ldt_write_fault(unsigned long address);
> +static inline bool ldt_is_active(struct mm_struct *mm)
> +{
> +       return mm && mm->context.ldt != NULL;
> +}
>  #else  /* CONFIG_MODIFY_LDT_SYSCALL */
>  static inline void init_new_context_ldt(struct task_struct *task,
>                                         struct mm_struct *mm) { }
> @@ -86,6 +91,8 @@ static inline int ldt_dup_context(struct
>  }
>  static inline void ldt_exit_user(struct pt_regs *regs) { }
>  static inline void destroy_context_ldt(struct mm_struct *mm) { }
> +static inline bool __ldt_write_fault(unsigned long address) { return false; }
> +static inline bool ldt_is_active(struct mm_struct *mm)  { return false; }
>  #endif
>
>  static inline void load_mm_ldt(struct mm_struct *mm, struct task_struct *tsk)
> --- a/arch/x86/kernel/ldt.c
> +++ b/arch/x86/kernel/ldt.c
> @@ -82,6 +82,36 @@ static void ldt_install_mm(struct mm_str
>         mutex_unlock(&mm->context.lock);
>  }
>
> +/*
> + * ldt_write_fault() already checked whether there is an ldt installed in
> + * __do_page_fault(), so it's safe to access it here because interrupts are
> + * disabled and any ipi which would change it is blocked until this
> + * returns.  The underlying page mapping cannot change as long as the ldt
> + * is the active one in the context.
> + *
> + * The fault error code is X86_PF_WRITE | X86_PF_PROT and checked in
> + * __do_page_fault() already. This happens when a segment is selected and
> + * the CPU tries to set the accessed bit in desc_struct.type because the
> + * LDT entries are mapped RO. Set it manually.
> + */
> +bool __ldt_write_fault(unsigned long address)
> +{
> +       struct ldt_struct *ldt = current->mm->context.ldt;
> +       unsigned long start, end, entry;
> +       struct desc_struct *desc;
> +
> +       start = (unsigned long) ldt->entries;
> +       end = start + ldt->nr_entries * LDT_ENTRY_SIZE;
> +
> +       if (address < start || address >= end)
> +               return false;
> +
> +       desc = (struct desc_struct *) ldt->entries;
> +       entry = (address - start) / LDT_ENTRY_SIZE;
> +       desc[entry].type |= 0x01;

You have another patch that unconditionally sets the accessed bit on
installation.  What gives?

Also, this patch is going to die a horrible death if IRET ever hits
this condition.  Or load gs.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 134+ messages in thread

* Re: [patch 05/16] mm: Allow special mappings with user access cleared
  2017-12-12 17:32   ` Thomas Gleixner
@ 2017-12-12 18:00     ` Andy Lutomirski
  -1 siblings, 0 replies; 134+ messages in thread
From: Andy Lutomirski @ 2017-12-12 18:00 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: LKML, X86 ML, Linus Torvalds, Andy Lutomirsky, Peter Zijlstra,
	Dave Hansen, Borislav Petkov, Greg KH, Kees Cook, Hugh Dickins,
	Brian Gerst, Josh Poimboeuf, Denys Vlasenko, Boris Ostrovsky,
	Juergen Gross, David Laight, Eduardo Valentin, aliguori,
	Will Deacon, linux-mm

On Tue, Dec 12, 2017 at 9:32 AM, Thomas Gleixner <tglx@linutronix.de> wrote:
> From: Peter Zijstra <peterz@infradead.org>
>
> In order to create VMAs that are not accessible to userspace create a new
> VM_NOUSER flag. This can be used in conjunction with
> install_special_mapping() to inject 'kernel' data into the userspace map.
>
> Similar to how arch_vm_get_page_prot() allows adding _PAGE_flags to
> pgprot_t, introduce arch_vm_get_page_prot_excl() which masks
> _PAGE_flags from pgprot_t and use this to implement VM_NOUSER for x86.

How does this interact with get_user_pages(), etc?

^ permalink raw reply	[flat|nested] 134+ messages in thread

* Re: [patch 05/16] mm: Allow special mappings with user access cleared
@ 2017-12-12 18:00     ` Andy Lutomirski
  0 siblings, 0 replies; 134+ messages in thread
From: Andy Lutomirski @ 2017-12-12 18:00 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: LKML, X86 ML, Linus Torvalds, Andy Lutomirsky, Peter Zijlstra,
	Dave Hansen, Borislav Petkov, Greg KH, Kees Cook, Hugh Dickins,
	Brian Gerst, Josh Poimboeuf, Denys Vlasenko, Boris Ostrovsky,
	Juergen Gross, David Laight, Eduardo Valentin, aliguori,
	Will Deacon, linux-mm

On Tue, Dec 12, 2017 at 9:32 AM, Thomas Gleixner <tglx@linutronix.de> wrote:
> From: Peter Zijstra <peterz@infradead.org>
>
> In order to create VMAs that are not accessible to userspace create a new
> VM_NOUSER flag. This can be used in conjunction with
> install_special_mapping() to inject 'kernel' data into the userspace map.
>
> Similar to how arch_vm_get_page_prot() allows adding _PAGE_flags to
> pgprot_t, introduce arch_vm_get_page_prot_excl() which masks
> _PAGE_flags from pgprot_t and use this to implement VM_NOUSER for x86.

How does this interact with get_user_pages(), etc?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 134+ messages in thread

* Re: [patch 11/16] x86/ldt: Force access bit for CS/SS
  2017-12-12 17:32   ` Thomas Gleixner
@ 2017-12-12 18:03     ` Andy Lutomirski
  -1 siblings, 0 replies; 134+ messages in thread
From: Andy Lutomirski @ 2017-12-12 18:03 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: LKML, X86 ML, Linus Torvalds, Andy Lutomirsky, Peter Zijlstra,
	Dave Hansen, Borislav Petkov, Greg KH, Kees Cook, Hugh Dickins,
	Brian Gerst, Josh Poimboeuf, Denys Vlasenko, Boris Ostrovsky,
	Juergen Gross, David Laight, Eduardo Valentin, aliguori,
	Will Deacon, linux-mm

On Tue, Dec 12, 2017 at 9:32 AM, Thomas Gleixner <tglx@linutronix.de> wrote:
> From: Peter Zijlstra <peterz@infradead.org>
>
> When mapping the LDT RO the hardware will typically generate write faults
> on first use. These faults can be trapped and the backing pages can be
> modified by the kernel.
>
> There is one exception; IRET will immediately load CS/SS and unrecoverably
> #GP. To avoid this issue access the LDT descriptors used by CS/SS before
> the IRET to userspace.
>
> For this use LAR, which is a safe operation in that it will happily consume
> an invalid LDT descriptor without traps. It gets the CPU to load the
> descriptor and observes the (preset) ACCESS bit.
>
> So far none of the obvious candidates like dosemu/wine/etc. do care about
> the ACCESS bit at all, so it should be rather safe to enforce it.
>
> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
> ---
>  arch/x86/entry/common.c            |    8 ++++-
>  arch/x86/include/asm/desc.h        |    2 +
>  arch/x86/include/asm/mmu_context.h |   53 +++++++++++++++++++++++--------------
>  arch/x86/include/asm/thread_info.h |    4 ++
>  arch/x86/kernel/cpu/common.c       |    4 +-
>  arch/x86/kernel/ldt.c              |   30 ++++++++++++++++++++
>  arch/x86/mm/tlb.c                  |    2 -
>  arch/x86/power/cpu.c               |    2 -
>  8 files changed, 78 insertions(+), 27 deletions(-)
>
> --- a/arch/x86/entry/common.c
> +++ b/arch/x86/entry/common.c
> @@ -30,6 +30,7 @@
>  #include <asm/vdso.h>
>  #include <linux/uaccess.h>
>  #include <asm/cpufeature.h>
> +#include <asm/mmu_context.h>
>
>  #define CREATE_TRACE_POINTS
>  #include <trace/events/syscalls.h>
> @@ -130,8 +131,8 @@ static long syscall_trace_enter(struct p
>         return ret ?: regs->orig_ax;
>  }
>
> -#define EXIT_TO_USERMODE_LOOP_FLAGS                            \
> -       (_TIF_SIGPENDING | _TIF_NOTIFY_RESUME | _TIF_UPROBE |   \
> +#define EXIT_TO_USERMODE_LOOP_FLAGS                                    \
> +       (_TIF_SIGPENDING | _TIF_NOTIFY_RESUME | _TIF_UPROBE | _TIF_LDT |\
>          _TIF_NEED_RESCHED | _TIF_USER_RETURN_NOTIFY | _TIF_PATCH_PENDING)
>
>  static void exit_to_usermode_loop(struct pt_regs *regs, u32 cached_flags)
> @@ -171,6 +172,9 @@ static void exit_to_usermode_loop(struct
>                 /* Disable IRQs and retry */
>                 local_irq_disable();
>
> +               if (cached_flags & _TIF_LDT)
> +                       ldt_exit_user(regs);

Nope.  To the extent that this code actually does anything (which it
shouldn't since you already forced the access bit), it's racy against
flush_ldt() from another thread, and that race will be exploitable for
privilege escalation.  It needs to be outside the loopy part.

--Andy

^ permalink raw reply	[flat|nested] 134+ messages in thread

* Re: [patch 11/16] x86/ldt: Force access bit for CS/SS
@ 2017-12-12 18:03     ` Andy Lutomirski
  0 siblings, 0 replies; 134+ messages in thread
From: Andy Lutomirski @ 2017-12-12 18:03 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: LKML, X86 ML, Linus Torvalds, Andy Lutomirsky, Peter Zijlstra,
	Dave Hansen, Borislav Petkov, Greg KH, Kees Cook, Hugh Dickins,
	Brian Gerst, Josh Poimboeuf, Denys Vlasenko, Boris Ostrovsky,
	Juergen Gross, David Laight, Eduardo Valentin, aliguori,
	Will Deacon, linux-mm

On Tue, Dec 12, 2017 at 9:32 AM, Thomas Gleixner <tglx@linutronix.de> wrote:
> From: Peter Zijlstra <peterz@infradead.org>
>
> When mapping the LDT RO the hardware will typically generate write faults
> on first use. These faults can be trapped and the backing pages can be
> modified by the kernel.
>
> There is one exception; IRET will immediately load CS/SS and unrecoverably
> #GP. To avoid this issue access the LDT descriptors used by CS/SS before
> the IRET to userspace.
>
> For this use LAR, which is a safe operation in that it will happily consume
> an invalid LDT descriptor without traps. It gets the CPU to load the
> descriptor and observes the (preset) ACCESS bit.
>
> So far none of the obvious candidates like dosemu/wine/etc. do care about
> the ACCESS bit at all, so it should be rather safe to enforce it.
>
> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
> ---
>  arch/x86/entry/common.c            |    8 ++++-
>  arch/x86/include/asm/desc.h        |    2 +
>  arch/x86/include/asm/mmu_context.h |   53 +++++++++++++++++++++++--------------
>  arch/x86/include/asm/thread_info.h |    4 ++
>  arch/x86/kernel/cpu/common.c       |    4 +-
>  arch/x86/kernel/ldt.c              |   30 ++++++++++++++++++++
>  arch/x86/mm/tlb.c                  |    2 -
>  arch/x86/power/cpu.c               |    2 -
>  8 files changed, 78 insertions(+), 27 deletions(-)
>
> --- a/arch/x86/entry/common.c
> +++ b/arch/x86/entry/common.c
> @@ -30,6 +30,7 @@
>  #include <asm/vdso.h>
>  #include <linux/uaccess.h>
>  #include <asm/cpufeature.h>
> +#include <asm/mmu_context.h>
>
>  #define CREATE_TRACE_POINTS
>  #include <trace/events/syscalls.h>
> @@ -130,8 +131,8 @@ static long syscall_trace_enter(struct p
>         return ret ?: regs->orig_ax;
>  }
>
> -#define EXIT_TO_USERMODE_LOOP_FLAGS                            \
> -       (_TIF_SIGPENDING | _TIF_NOTIFY_RESUME | _TIF_UPROBE |   \
> +#define EXIT_TO_USERMODE_LOOP_FLAGS                                    \
> +       (_TIF_SIGPENDING | _TIF_NOTIFY_RESUME | _TIF_UPROBE | _TIF_LDT |\
>          _TIF_NEED_RESCHED | _TIF_USER_RETURN_NOTIFY | _TIF_PATCH_PENDING)
>
>  static void exit_to_usermode_loop(struct pt_regs *regs, u32 cached_flags)
> @@ -171,6 +172,9 @@ static void exit_to_usermode_loop(struct
>                 /* Disable IRQs and retry */
>                 local_irq_disable();
>
> +               if (cached_flags & _TIF_LDT)
> +                       ldt_exit_user(regs);

Nope.  To the extent that this code actually does anything (which it
shouldn't since you already forced the access bit), it's racy against
flush_ldt() from another thread, and that race will be exploitable for
privilege escalation.  It needs to be outside the loopy part.

--Andy

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 134+ messages in thread

* Re: [patch 00/16] x86/ldt: Use a VMA based read only mapping
  2017-12-12 17:32 ` Thomas Gleixner
@ 2017-12-12 18:03   ` Andy Lutomirski
  -1 siblings, 0 replies; 134+ messages in thread
From: Andy Lutomirski @ 2017-12-12 18:03 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: LKML, X86 ML, Linus Torvalds, Andy Lutomirsky, Peter Zijlstra,
	Dave Hansen, Borislav Petkov, Greg KH, Kees Cook, Hugh Dickins,
	Brian Gerst, Josh Poimboeuf, Denys Vlasenko, Boris Ostrovsky,
	Juergen Gross, David Laight, Eduardo Valentin, aliguori,
	Will Deacon, linux-mm

On Tue, Dec 12, 2017 at 9:32 AM, Thomas Gleixner <tglx@linutronix.de> wrote:
> Peter and myself spent quite some time to figure out how to make CPUs cope
> with a RO mapped LDT.
>
> While the initial trick of writing the ACCESS bit in a special fault
> handler covers most cases, the tricky problem of CS/SS in return to user
> space (IRET ...) was giving us quite some headache.
>
> Peter finally found a way to do so. Touching the CS/SS selectors with LAR
> on the way out to user space makes it work w/o trouble.
>
> Contrary to the approach Andy was taking with storing the LDT in a special
> map area, the following series uses a special mapping which is mapped
> without the user bit and read only. This just ties the LDT to the process
> which is the most natural way to do it, removes the requirement for special
> pagetable code and works independent of pagetable isolation.
>
> This was tested on quite a range of Intel and AMD machines, but the test
> coverage on 32bit is quite meager. I'll resurrect a few dust bricks
> tomorrow.

I think it's neat that you got this working.  But it's like three
times the size of my patch, is *way* more intrusive, and isn't
obviously correct WRT IRET and load_gs_index().  So... how is it
better than my patch?

^ permalink raw reply	[flat|nested] 134+ messages in thread

* Re: [patch 00/16] x86/ldt: Use a VMA based read only mapping
@ 2017-12-12 18:03   ` Andy Lutomirski
  0 siblings, 0 replies; 134+ messages in thread
From: Andy Lutomirski @ 2017-12-12 18:03 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: LKML, X86 ML, Linus Torvalds, Andy Lutomirsky, Peter Zijlstra,
	Dave Hansen, Borislav Petkov, Greg KH, Kees Cook, Hugh Dickins,
	Brian Gerst, Josh Poimboeuf, Denys Vlasenko, Boris Ostrovsky,
	Juergen Gross, David Laight, Eduardo Valentin, aliguori,
	Will Deacon, linux-mm

On Tue, Dec 12, 2017 at 9:32 AM, Thomas Gleixner <tglx@linutronix.de> wrote:
> Peter and myself spent quite some time to figure out how to make CPUs cope
> with a RO mapped LDT.
>
> While the initial trick of writing the ACCESS bit in a special fault
> handler covers most cases, the tricky problem of CS/SS in return to user
> space (IRET ...) was giving us quite some headache.
>
> Peter finally found a way to do so. Touching the CS/SS selectors with LAR
> on the way out to user space makes it work w/o trouble.
>
> Contrary to the approach Andy was taking with storing the LDT in a special
> map area, the following series uses a special mapping which is mapped
> without the user bit and read only. This just ties the LDT to the process
> which is the most natural way to do it, removes the requirement for special
> pagetable code and works independent of pagetable isolation.
>
> This was tested on quite a range of Intel and AMD machines, but the test
> coverage on 32bit is quite meager. I'll resurrect a few dust bricks
> tomorrow.

I think it's neat that you got this working.  But it's like three
times the size of my patch, is *way* more intrusive, and isn't
obviously correct WRT IRET and load_gs_index().  So... how is it
better than my patch?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 134+ messages in thread

* Re: [patch 05/16] mm: Allow special mappings with user access cleared
  2017-12-12 18:00     ` Andy Lutomirski
@ 2017-12-12 18:05       ` Peter Zijlstra
  -1 siblings, 0 replies; 134+ messages in thread
From: Peter Zijlstra @ 2017-12-12 18:05 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Thomas Gleixner, LKML, X86 ML, Linus Torvalds, Dave Hansen,
	Borislav Petkov, Greg KH, Kees Cook, Hugh Dickins, Brian Gerst,
	Josh Poimboeuf, Denys Vlasenko, Boris Ostrovsky, Juergen Gross,
	David Laight, Eduardo Valentin, aliguori, Will Deacon, linux-mm

On Tue, Dec 12, 2017 at 10:00:08AM -0800, Andy Lutomirski wrote:
> On Tue, Dec 12, 2017 at 9:32 AM, Thomas Gleixner <tglx@linutronix.de> wrote:
> > From: Peter Zijstra <peterz@infradead.org>
> >
> > In order to create VMAs that are not accessible to userspace create a new
> > VM_NOUSER flag. This can be used in conjunction with
> > install_special_mapping() to inject 'kernel' data into the userspace map.
> >
> > Similar to how arch_vm_get_page_prot() allows adding _PAGE_flags to
> > pgprot_t, introduce arch_vm_get_page_prot_excl() which masks
> > _PAGE_flags from pgprot_t and use this to implement VM_NOUSER for x86.
> 
> How does this interact with get_user_pages(), etc?

gup would find the page. These patches do in fact rely on that through
the populate things.

^ permalink raw reply	[flat|nested] 134+ messages in thread

* Re: [patch 05/16] mm: Allow special mappings with user access cleared
@ 2017-12-12 18:05       ` Peter Zijlstra
  0 siblings, 0 replies; 134+ messages in thread
From: Peter Zijlstra @ 2017-12-12 18:05 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Thomas Gleixner, LKML, X86 ML, Linus Torvalds, Dave Hansen,
	Borislav Petkov, Greg KH, Kees Cook, Hugh Dickins, Brian Gerst,
	Josh Poimboeuf, Denys Vlasenko, Boris Ostrovsky, Juergen Gross,
	David Laight, Eduardo Valentin, aliguori, Will Deacon, linux-mm

On Tue, Dec 12, 2017 at 10:00:08AM -0800, Andy Lutomirski wrote:
> On Tue, Dec 12, 2017 at 9:32 AM, Thomas Gleixner <tglx@linutronix.de> wrote:
> > From: Peter Zijstra <peterz@infradead.org>
> >
> > In order to create VMAs that are not accessible to userspace create a new
> > VM_NOUSER flag. This can be used in conjunction with
> > install_special_mapping() to inject 'kernel' data into the userspace map.
> >
> > Similar to how arch_vm_get_page_prot() allows adding _PAGE_flags to
> > pgprot_t, introduce arch_vm_get_page_prot_excl() which masks
> > _PAGE_flags from pgprot_t and use this to implement VM_NOUSER for x86.
> 
> How does this interact with get_user_pages(), etc?

gup would find the page. These patches do in fact rely on that through
the populate things.


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 134+ messages in thread

* Re: [patch 05/16] mm: Allow special mappings with user access cleared
  2017-12-12 18:05       ` Peter Zijlstra
@ 2017-12-12 18:06         ` Andy Lutomirski
  -1 siblings, 0 replies; 134+ messages in thread
From: Andy Lutomirski @ 2017-12-12 18:06 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Andy Lutomirski, Thomas Gleixner, LKML, X86 ML, Linus Torvalds,
	Dave Hansen, Borislav Petkov, Greg KH, Kees Cook, Hugh Dickins,
	Brian Gerst, Josh Poimboeuf, Denys Vlasenko, Boris Ostrovsky,
	Juergen Gross, David Laight, Eduardo Valentin, aliguori,
	Will Deacon, linux-mm

On Tue, Dec 12, 2017 at 10:05 AM, Peter Zijlstra <peterz@infradead.org> wrote:
> On Tue, Dec 12, 2017 at 10:00:08AM -0800, Andy Lutomirski wrote:
>> On Tue, Dec 12, 2017 at 9:32 AM, Thomas Gleixner <tglx@linutronix.de> wrote:
>> > From: Peter Zijstra <peterz@infradead.org>
>> >
>> > In order to create VMAs that are not accessible to userspace create a new
>> > VM_NOUSER flag. This can be used in conjunction with
>> > install_special_mapping() to inject 'kernel' data into the userspace map.
>> >
>> > Similar to how arch_vm_get_page_prot() allows adding _PAGE_flags to
>> > pgprot_t, introduce arch_vm_get_page_prot_excl() which masks
>> > _PAGE_flags from pgprot_t and use this to implement VM_NOUSER for x86.
>>
>> How does this interact with get_user_pages(), etc?
>
> gup would find the page. These patches do in fact rely on that through
> the populate things.
>

Blech.  So you can write(2) from the LDT to a file and you can even
sendfile it, perhaps.  What happens if it's get_user_page()'d when
modify_ldt() wants to free it?

This patch series scares the crap out of me.

^ permalink raw reply	[flat|nested] 134+ messages in thread

* Re: [patch 05/16] mm: Allow special mappings with user access cleared
@ 2017-12-12 18:06         ` Andy Lutomirski
  0 siblings, 0 replies; 134+ messages in thread
From: Andy Lutomirski @ 2017-12-12 18:06 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Andy Lutomirski, Thomas Gleixner, LKML, X86 ML, Linus Torvalds,
	Dave Hansen, Borislav Petkov, Greg KH, Kees Cook, Hugh Dickins,
	Brian Gerst, Josh Poimboeuf, Denys Vlasenko, Boris Ostrovsky,
	Juergen Gross, David Laight, Eduardo Valentin, aliguori,
	Will Deacon, linux-mm

On Tue, Dec 12, 2017 at 10:05 AM, Peter Zijlstra <peterz@infradead.org> wrote:
> On Tue, Dec 12, 2017 at 10:00:08AM -0800, Andy Lutomirski wrote:
>> On Tue, Dec 12, 2017 at 9:32 AM, Thomas Gleixner <tglx@linutronix.de> wrote:
>> > From: Peter Zijstra <peterz@infradead.org>
>> >
>> > In order to create VMAs that are not accessible to userspace create a new
>> > VM_NOUSER flag. This can be used in conjunction with
>> > install_special_mapping() to inject 'kernel' data into the userspace map.
>> >
>> > Similar to how arch_vm_get_page_prot() allows adding _PAGE_flags to
>> > pgprot_t, introduce arch_vm_get_page_prot_excl() which masks
>> > _PAGE_flags from pgprot_t and use this to implement VM_NOUSER for x86.
>>
>> How does this interact with get_user_pages(), etc?
>
> gup would find the page. These patches do in fact rely on that through
> the populate things.
>

Blech.  So you can write(2) from the LDT to a file and you can even
sendfile it, perhaps.  What happens if it's get_user_page()'d when
modify_ldt() wants to free it?

This patch series scares the crap out of me.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 134+ messages in thread

* Re: [patch 11/16] x86/ldt: Force access bit for CS/SS
  2017-12-12 18:03     ` Andy Lutomirski
@ 2017-12-12 18:09       ` Peter Zijlstra
  -1 siblings, 0 replies; 134+ messages in thread
From: Peter Zijlstra @ 2017-12-12 18:09 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Thomas Gleixner, LKML, X86 ML, Linus Torvalds, Dave Hansen,
	Borislav Petkov, Greg KH, Kees Cook, Hugh Dickins, Brian Gerst,
	Josh Poimboeuf, Denys Vlasenko, Boris Ostrovsky, Juergen Gross,
	David Laight, Eduardo Valentin, aliguori, Will Deacon, linux-mm

On Tue, Dec 12, 2017 at 10:03:02AM -0800, Andy Lutomirski wrote:
> On Tue, Dec 12, 2017 at 9:32 AM, Thomas Gleixner <tglx@linutronix.de> wrote:

> > @@ -171,6 +172,9 @@ static void exit_to_usermode_loop(struct
> >                 /* Disable IRQs and retry */
> >                 local_irq_disable();
> >
> > +               if (cached_flags & _TIF_LDT)
> > +                       ldt_exit_user(regs);
> 
> Nope.  To the extent that this code actually does anything (which it
> shouldn't since you already forced the access bit),

Without this; even with the access bit set; IRET will go wobbly and
we'll #GP on the user-space side. Try it ;-)

> it's racy against
> flush_ldt() from another thread, and that race will be exploitable for
> privilege escalation.  It needs to be outside the loopy part.

The flush_ldt (__ldt_install after these patches) would re-set the TIF
flag. But sure, we can move this outside the loop I suppose.

^ permalink raw reply	[flat|nested] 134+ messages in thread

* Re: [patch 11/16] x86/ldt: Force access bit for CS/SS
@ 2017-12-12 18:09       ` Peter Zijlstra
  0 siblings, 0 replies; 134+ messages in thread
From: Peter Zijlstra @ 2017-12-12 18:09 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Thomas Gleixner, LKML, X86 ML, Linus Torvalds, Dave Hansen,
	Borislav Petkov, Greg KH, Kees Cook, Hugh Dickins, Brian Gerst,
	Josh Poimboeuf, Denys Vlasenko, Boris Ostrovsky, Juergen Gross,
	David Laight, Eduardo Valentin, aliguori, Will Deacon, linux-mm

On Tue, Dec 12, 2017 at 10:03:02AM -0800, Andy Lutomirski wrote:
> On Tue, Dec 12, 2017 at 9:32 AM, Thomas Gleixner <tglx@linutronix.de> wrote:

> > @@ -171,6 +172,9 @@ static void exit_to_usermode_loop(struct
> >                 /* Disable IRQs and retry */
> >                 local_irq_disable();
> >
> > +               if (cached_flags & _TIF_LDT)
> > +                       ldt_exit_user(regs);
> 
> Nope.  To the extent that this code actually does anything (which it
> shouldn't since you already forced the access bit),

Without this; even with the access bit set; IRET will go wobbly and
we'll #GP on the user-space side. Try it ;-)

> it's racy against
> flush_ldt() from another thread, and that race will be exploitable for
> privilege escalation.  It needs to be outside the loopy part.

The flush_ldt (__ldt_install after these patches) would re-set the TIF
flag. But sure, we can move this outside the loop I suppose.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 134+ messages in thread

* Re: [patch 11/16] x86/ldt: Force access bit for CS/SS
  2017-12-12 18:09       ` Peter Zijlstra
@ 2017-12-12 18:10         ` Andy Lutomirski
  -1 siblings, 0 replies; 134+ messages in thread
From: Andy Lutomirski @ 2017-12-12 18:10 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Andy Lutomirski, Thomas Gleixner, LKML, X86 ML, Linus Torvalds,
	Dave Hansen, Borislav Petkov, Greg KH, Kees Cook, Hugh Dickins,
	Brian Gerst, Josh Poimboeuf, Denys Vlasenko, Boris Ostrovsky,
	Juergen Gross, David Laight, Eduardo Valentin, aliguori,
	Will Deacon, linux-mm

On Tue, Dec 12, 2017 at 10:09 AM, Peter Zijlstra <peterz@infradead.org> wrote:
> On Tue, Dec 12, 2017 at 10:03:02AM -0800, Andy Lutomirski wrote:
>> On Tue, Dec 12, 2017 at 9:32 AM, Thomas Gleixner <tglx@linutronix.de> wrote:
>
>> > @@ -171,6 +172,9 @@ static void exit_to_usermode_loop(struct
>> >                 /* Disable IRQs and retry */
>> >                 local_irq_disable();
>> >
>> > +               if (cached_flags & _TIF_LDT)
>> > +                       ldt_exit_user(regs);
>>
>> Nope.  To the extent that this code actually does anything (which it
>> shouldn't since you already forced the access bit),
>
> Without this; even with the access bit set; IRET will go wobbly and
> we'll #GP on the user-space side. Try it ;-)

Maybe later.

But that means that we need Intel and AMD to confirm WTF is going on
before this blows up even with LAR on some other CPU.

>
>> it's racy against
>> flush_ldt() from another thread, and that race will be exploitable for
>> privilege escalation.  It needs to be outside the loopy part.
>
> The flush_ldt (__ldt_install after these patches) would re-set the TIF
> flag. But sure, we can move this outside the loop I suppose.

^ permalink raw reply	[flat|nested] 134+ messages in thread

* Re: [patch 11/16] x86/ldt: Force access bit for CS/SS
@ 2017-12-12 18:10         ` Andy Lutomirski
  0 siblings, 0 replies; 134+ messages in thread
From: Andy Lutomirski @ 2017-12-12 18:10 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Andy Lutomirski, Thomas Gleixner, LKML, X86 ML, Linus Torvalds,
	Dave Hansen, Borislav Petkov, Greg KH, Kees Cook, Hugh Dickins,
	Brian Gerst, Josh Poimboeuf, Denys Vlasenko, Boris Ostrovsky,
	Juergen Gross, David Laight, Eduardo Valentin, aliguori,
	Will Deacon, linux-mm

On Tue, Dec 12, 2017 at 10:09 AM, Peter Zijlstra <peterz@infradead.org> wrote:
> On Tue, Dec 12, 2017 at 10:03:02AM -0800, Andy Lutomirski wrote:
>> On Tue, Dec 12, 2017 at 9:32 AM, Thomas Gleixner <tglx@linutronix.de> wrote:
>
>> > @@ -171,6 +172,9 @@ static void exit_to_usermode_loop(struct
>> >                 /* Disable IRQs and retry */
>> >                 local_irq_disable();
>> >
>> > +               if (cached_flags & _TIF_LDT)
>> > +                       ldt_exit_user(regs);
>>
>> Nope.  To the extent that this code actually does anything (which it
>> shouldn't since you already forced the access bit),
>
> Without this; even with the access bit set; IRET will go wobbly and
> we'll #GP on the user-space side. Try it ;-)

Maybe later.

But that means that we need Intel and AMD to confirm WTF is going on
before this blows up even with LAR on some other CPU.

>
>> it's racy against
>> flush_ldt() from another thread, and that race will be exploitable for
>> privilege escalation.  It needs to be outside the loopy part.
>
> The flush_ldt (__ldt_install after these patches) would re-set the TIF
> flag. But sure, we can move this outside the loop I suppose.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 134+ messages in thread

* Re: [patch 13/16] x86/ldt: Introduce LDT write fault handler
  2017-12-12 17:58     ` Andy Lutomirski
@ 2017-12-12 18:19       ` Peter Zijlstra
  -1 siblings, 0 replies; 134+ messages in thread
From: Peter Zijlstra @ 2017-12-12 18:19 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Thomas Gleixner, LKML, X86 ML, Linus Torvalds, Dave Hansen,
	Borislav Petkov, Greg KH, Kees Cook, Hugh Dickins, Brian Gerst,
	Josh Poimboeuf, Denys Vlasenko, Boris Ostrovsky, Juergen Gross,
	David Laight, Eduardo Valentin, aliguori, Will Deacon, linux-mm

On Tue, Dec 12, 2017 at 09:58:58AM -0800, Andy Lutomirski wrote:
> On Tue, Dec 12, 2017 at 9:32 AM, Thomas Gleixner <tglx@linutronix.de> wrote:

> > +bool __ldt_write_fault(unsigned long address)
> > +{
> > +       struct ldt_struct *ldt = current->mm->context.ldt;
> > +       unsigned long start, end, entry;
> > +       struct desc_struct *desc;
> > +
> > +       start = (unsigned long) ldt->entries;
> > +       end = start + ldt->nr_entries * LDT_ENTRY_SIZE;
> > +
> > +       if (address < start || address >= end)
> > +               return false;
> > +
> > +       desc = (struct desc_struct *) ldt->entries;
> > +       entry = (address - start) / LDT_ENTRY_SIZE;
> > +       desc[entry].type |= 0x01;
> 
> You have another patch that unconditionally sets the accessed bit on
> installation.  What gives?

Right, initially we didn't set that unconditionally. But even when we
did do that, we've observed the CPU generating these write faults.

> Also, this patch is going to die a horrible death if IRET ever hits
> this condition.  Or load gs.

Us touching the CS/SS descriptors with LAR should avoid IRET going off
the rails, I'm not familiar with the whole gs thing, but we could very
easily augment refresh_ldt_segments() I suppose.

Would you care to be a little more specific and or propose a testcase
for this situation?

^ permalink raw reply	[flat|nested] 134+ messages in thread

* Re: [patch 13/16] x86/ldt: Introduce LDT write fault handler
@ 2017-12-12 18:19       ` Peter Zijlstra
  0 siblings, 0 replies; 134+ messages in thread
From: Peter Zijlstra @ 2017-12-12 18:19 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Thomas Gleixner, LKML, X86 ML, Linus Torvalds, Dave Hansen,
	Borislav Petkov, Greg KH, Kees Cook, Hugh Dickins, Brian Gerst,
	Josh Poimboeuf, Denys Vlasenko, Boris Ostrovsky, Juergen Gross,
	David Laight, Eduardo Valentin, aliguori, Will Deacon, linux-mm

On Tue, Dec 12, 2017 at 09:58:58AM -0800, Andy Lutomirski wrote:
> On Tue, Dec 12, 2017 at 9:32 AM, Thomas Gleixner <tglx@linutronix.de> wrote:

> > +bool __ldt_write_fault(unsigned long address)
> > +{
> > +       struct ldt_struct *ldt = current->mm->context.ldt;
> > +       unsigned long start, end, entry;
> > +       struct desc_struct *desc;
> > +
> > +       start = (unsigned long) ldt->entries;
> > +       end = start + ldt->nr_entries * LDT_ENTRY_SIZE;
> > +
> > +       if (address < start || address >= end)
> > +               return false;
> > +
> > +       desc = (struct desc_struct *) ldt->entries;
> > +       entry = (address - start) / LDT_ENTRY_SIZE;
> > +       desc[entry].type |= 0x01;
> 
> You have another patch that unconditionally sets the accessed bit on
> installation.  What gives?

Right, initially we didn't set that unconditionally. But even when we
did do that, we've observed the CPU generating these write faults.

> Also, this patch is going to die a horrible death if IRET ever hits
> this condition.  Or load gs.

Us touching the CS/SS descriptors with LAR should avoid IRET going off
the rails, I'm not familiar with the whole gs thing, but we could very
easily augment refresh_ldt_segments() I suppose.

Would you care to be a little more specific and or propose a testcase
for this situation?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 134+ messages in thread

* Re: [patch 11/16] x86/ldt: Force access bit for CS/SS
  2017-12-12 18:10         ` Andy Lutomirski
@ 2017-12-12 18:22           ` Andy Lutomirski
  -1 siblings, 0 replies; 134+ messages in thread
From: Andy Lutomirski @ 2017-12-12 18:22 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Peter Zijlstra, Thomas Gleixner, LKML, X86 ML, Linus Torvalds,
	Dave Hansen, Borislav Petkov, Greg KH, Kees Cook, Hugh Dickins,
	Brian Gerst, Josh Poimboeuf, Denys Vlasenko, Boris Ostrovsky,
	Juergen Gross, David Laight, Eduardo Valentin, aliguori,
	Will Deacon, linux-mm



> On Dec 12, 2017, at 10:10 AM, Andy Lutomirski <luto@kernel.org> wrote:
> 
>> On Tue, Dec 12, 2017 at 10:09 AM, Peter Zijlstra <peterz@infradead.org> wrote:
>>> On Tue, Dec 12, 2017 at 10:03:02AM -0800, Andy Lutomirski wrote:
>>> On Tue, Dec 12, 2017 at 9:32 AM, Thomas Gleixner <tglx@linutronix.de> wrote:
>> 
>>>> @@ -171,6 +172,9 @@ static void exit_to_usermode_loop(struct
>>>>                /* Disable IRQs and retry */
>>>>                local_irq_disable();
>>>> 
>>>> +               if (cached_flags & _TIF_LDT)
>>>> +                       ldt_exit_user(regs);
>>> 
>>> Nope.  To the extent that this code actually does anything (which it
>>> shouldn't since you already forced the access bit),
>> 
>> Without this; even with the access bit set; IRET will go wobbly and
>> we'll #GP on the user-space side. Try it ;-)
> 
> Maybe later.
> 
> But that means that we need Intel and AMD to confirm WTF is going on
> before this blows up even with LAR on some other CPU.
> 
>> 
>>> it's racy against
>>> flush_ldt() from another thread, and that race will be exploitable for
>>> privilege escalation.  It needs to be outside the loopy part.
>> 
>> The flush_ldt (__ldt_install after these patches) would re-set the TIF
>> flag. But sure, we can move this outside the loop I suppose.

Also, why is LAR deferred to user exit?  And I thought that LAR didn't set the accessed bit.

If I had to guess, I'd guess that LAR is actually generating a read fault and forcing the pagetables to get populated.  If so, then it means the VMA code isn't quite right, or you're susceptible to failures under memory pressure.

Now maybe LAR will repopulate the PTE every time if you were to never clear it, but ick.

^ permalink raw reply	[flat|nested] 134+ messages in thread

* Re: [patch 11/16] x86/ldt: Force access bit for CS/SS
@ 2017-12-12 18:22           ` Andy Lutomirski
  0 siblings, 0 replies; 134+ messages in thread
From: Andy Lutomirski @ 2017-12-12 18:22 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Peter Zijlstra, Thomas Gleixner, LKML, X86 ML, Linus Torvalds,
	Dave Hansen, Borislav Petkov, Greg KH, Kees Cook, Hugh Dickins,
	Brian Gerst, Josh Poimboeuf, Denys Vlasenko, Boris Ostrovsky,
	Juergen Gross, David Laight, Eduardo Valentin, aliguori,
	Will Deacon, linux-mm



> On Dec 12, 2017, at 10:10 AM, Andy Lutomirski <luto@kernel.org> wrote:
> 
>> On Tue, Dec 12, 2017 at 10:09 AM, Peter Zijlstra <peterz@infradead.org> wrote:
>>> On Tue, Dec 12, 2017 at 10:03:02AM -0800, Andy Lutomirski wrote:
>>> On Tue, Dec 12, 2017 at 9:32 AM, Thomas Gleixner <tglx@linutronix.de> wrote:
>> 
>>>> @@ -171,6 +172,9 @@ static void exit_to_usermode_loop(struct
>>>>                /* Disable IRQs and retry */
>>>>                local_irq_disable();
>>>> 
>>>> +               if (cached_flags & _TIF_LDT)
>>>> +                       ldt_exit_user(regs);
>>> 
>>> Nope.  To the extent that this code actually does anything (which it
>>> shouldn't since you already forced the access bit),
>> 
>> Without this; even with the access bit set; IRET will go wobbly and
>> we'll #GP on the user-space side. Try it ;-)
> 
> Maybe later.
> 
> But that means that we need Intel and AMD to confirm WTF is going on
> before this blows up even with LAR on some other CPU.
> 
>> 
>>> it's racy against
>>> flush_ldt() from another thread, and that race will be exploitable for
>>> privilege escalation.  It needs to be outside the loopy part.
>> 
>> The flush_ldt (__ldt_install after these patches) would re-set the TIF
>> flag. But sure, we can move this outside the loop I suppose.

Also, why is LAR deferred to user exit?  And I thought that LAR didn't set the accessed bit.

If I had to guess, I'd guess that LAR is actually generating a read fault and forcing the pagetables to get populated.  If so, then it means the VMA code isn't quite right, or you're susceptible to failures under memory pressure.

Now maybe LAR will repopulate the PTE every time if you were to never clear it, but ick.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 134+ messages in thread

* Re: [patch 05/16] mm: Allow special mappings with user access cleared
  2017-12-12 18:06         ` Andy Lutomirski
@ 2017-12-12 18:25           ` Peter Zijlstra
  -1 siblings, 0 replies; 134+ messages in thread
From: Peter Zijlstra @ 2017-12-12 18:25 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Thomas Gleixner, LKML, X86 ML, Linus Torvalds, Dave Hansen,
	Borislav Petkov, Greg KH, Kees Cook, Hugh Dickins, Brian Gerst,
	Josh Poimboeuf, Denys Vlasenko, Boris Ostrovsky, Juergen Gross,
	David Laight, Eduardo Valentin, aliguori, Will Deacon, linux-mm

On Tue, Dec 12, 2017 at 10:06:51AM -0800, Andy Lutomirski wrote:
> On Tue, Dec 12, 2017 at 10:05 AM, Peter Zijlstra <peterz@infradead.org> wrote:

> > gup would find the page. These patches do in fact rely on that through
> > the populate things.
> >
> 
> Blech.  So you can write(2) from the LDT to a file and you can even
> sendfile it, perhaps. 

Hmm, indeed.. I suppose I could go fix that. But how bad is it to leak
the LDT contents?

What would be far worse of course is if we could read(2) data into the
ldt, I'll look into that.

> What happens if it's get_user_page()'d when
> modify_ldt() wants to free it?

modify_ldt should never free pages, we only ever free pages when we
destroy the mm.

^ permalink raw reply	[flat|nested] 134+ messages in thread

* Re: [patch 05/16] mm: Allow special mappings with user access cleared
@ 2017-12-12 18:25           ` Peter Zijlstra
  0 siblings, 0 replies; 134+ messages in thread
From: Peter Zijlstra @ 2017-12-12 18:25 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Thomas Gleixner, LKML, X86 ML, Linus Torvalds, Dave Hansen,
	Borislav Petkov, Greg KH, Kees Cook, Hugh Dickins, Brian Gerst,
	Josh Poimboeuf, Denys Vlasenko, Boris Ostrovsky, Juergen Gross,
	David Laight, Eduardo Valentin, aliguori, Will Deacon, linux-mm

On Tue, Dec 12, 2017 at 10:06:51AM -0800, Andy Lutomirski wrote:
> On Tue, Dec 12, 2017 at 10:05 AM, Peter Zijlstra <peterz@infradead.org> wrote:

> > gup would find the page. These patches do in fact rely on that through
> > the populate things.
> >
> 
> Blech.  So you can write(2) from the LDT to a file and you can even
> sendfile it, perhaps. 

Hmm, indeed.. I suppose I could go fix that. But how bad is it to leak
the LDT contents?

What would be far worse of course is if we could read(2) data into the
ldt, I'll look into that.

> What happens if it's get_user_page()'d when
> modify_ldt() wants to free it?

modify_ldt should never free pages, we only ever free pages when we
destroy the mm.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 134+ messages in thread

* Re: [patch 11/16] x86/ldt: Force access bit for CS/SS
  2017-12-12 18:22           ` Andy Lutomirski
@ 2017-12-12 18:29             ` Peter Zijlstra
  -1 siblings, 0 replies; 134+ messages in thread
From: Peter Zijlstra @ 2017-12-12 18:29 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Andy Lutomirski, Thomas Gleixner, LKML, X86 ML, Linus Torvalds,
	Dave Hansen, Borislav Petkov, Greg KH, Kees Cook, Hugh Dickins,
	Brian Gerst, Josh Poimboeuf, Denys Vlasenko, Boris Ostrovsky,
	Juergen Gross, David Laight, Eduardo Valentin, aliguori,
	Will Deacon, linux-mm

On Tue, Dec 12, 2017 at 10:22:48AM -0800, Andy Lutomirski wrote:
> 
> Also, why is LAR deferred to user exit?  And I thought that LAR didn't
> set the accessed bit.

LAR does not set the ACCESSED bit indeed, we need to explicitly set that
when creating the descriptor.

It also works if you do the LAR right after LLDT (which is what I
originally had). The reason its a TIF flag is that I originally LAR'ed
every entry in the table.

It got reduced to CS/SS, but the TIF thing stayed.

> If I had to guess, I'd guess that LAR is actually generating a read
> fault and forcing the pagetables to get populated.  If so, then it
> means the VMA code isn't quite right, or you're susceptible to
> failures under memory pressure.
> 
> Now maybe LAR will repopulate the PTE every time if you were to never
> clear it, but ick.

I did not observe #PFs from LAR, we had a giant pile of trace_printk()
in there.

^ permalink raw reply	[flat|nested] 134+ messages in thread

* Re: [patch 11/16] x86/ldt: Force access bit for CS/SS
@ 2017-12-12 18:29             ` Peter Zijlstra
  0 siblings, 0 replies; 134+ messages in thread
From: Peter Zijlstra @ 2017-12-12 18:29 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Andy Lutomirski, Thomas Gleixner, LKML, X86 ML, Linus Torvalds,
	Dave Hansen, Borislav Petkov, Greg KH, Kees Cook, Hugh Dickins,
	Brian Gerst, Josh Poimboeuf, Denys Vlasenko, Boris Ostrovsky,
	Juergen Gross, David Laight, Eduardo Valentin, aliguori,
	Will Deacon, linux-mm

On Tue, Dec 12, 2017 at 10:22:48AM -0800, Andy Lutomirski wrote:
> 
> Also, why is LAR deferred to user exit?  And I thought that LAR didn't
> set the accessed bit.

LAR does not set the ACCESSED bit indeed, we need to explicitly set that
when creating the descriptor.

It also works if you do the LAR right after LLDT (which is what I
originally had). The reason its a TIF flag is that I originally LAR'ed
every entry in the table.

It got reduced to CS/SS, but the TIF thing stayed.

> If I had to guess, I'd guess that LAR is actually generating a read
> fault and forcing the pagetables to get populated.  If so, then it
> means the VMA code isn't quite right, or you're susceptible to
> failures under memory pressure.
> 
> Now maybe LAR will repopulate the PTE every time if you were to never
> clear it, but ick.

I did not observe #PFs from LAR, we had a giant pile of trace_printk()
in there.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 134+ messages in thread

* Re: [patch 11/16] x86/ldt: Force access bit for CS/SS
  2017-12-12 18:29             ` Peter Zijlstra
@ 2017-12-12 18:41               ` Thomas Gleixner
  -1 siblings, 0 replies; 134+ messages in thread
From: Thomas Gleixner @ 2017-12-12 18:41 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Andy Lutomirski, Andy Lutomirski, LKML, X86 ML, Linus Torvalds,
	Dave Hansen, Borislav Petkov, Greg KH, Kees Cook, Hugh Dickins,
	Brian Gerst, Josh Poimboeuf, Denys Vlasenko, Boris Ostrovsky,
	Juergen Gross, David Laight, Eduardo Valentin, aliguori,
	Will Deacon, linux-mm

On Tue, 12 Dec 2017, Peter Zijlstra wrote:

> On Tue, Dec 12, 2017 at 10:22:48AM -0800, Andy Lutomirski wrote:
> > 
> > Also, why is LAR deferred to user exit?  And I thought that LAR didn't
> > set the accessed bit.
> 
> LAR does not set the ACCESSED bit indeed, we need to explicitly set that
> when creating the descriptor.
> 
> It also works if you do the LAR right after LLDT (which is what I
> originally had). The reason its a TIF flag is that I originally LAR'ed
> every entry in the table.
> 
> It got reduced to CS/SS, but the TIF thing stayed.
> 
> > If I had to guess, I'd guess that LAR is actually generating a read
> > fault and forcing the pagetables to get populated.  If so, then it
> > means the VMA code isn't quite right, or you're susceptible to
> > failures under memory pressure.
> > 
> > Now maybe LAR will repopulate the PTE every time if you were to never
> > clear it, but ick.
> 
> I did not observe #PFs from LAR, we had a giant pile of trace_printk()
> in there.

The pages are populated _before_ the new ldt is installed. So no memory
pressure issue, nothing. If the populate fails, then modify_ldt() returns
with an error.

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 134+ messages in thread

* Re: [patch 11/16] x86/ldt: Force access bit for CS/SS
@ 2017-12-12 18:41               ` Thomas Gleixner
  0 siblings, 0 replies; 134+ messages in thread
From: Thomas Gleixner @ 2017-12-12 18:41 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Andy Lutomirski, Andy Lutomirski, LKML, X86 ML, Linus Torvalds,
	Dave Hansen, Borislav Petkov, Greg KH, Kees Cook, Hugh Dickins,
	Brian Gerst, Josh Poimboeuf, Denys Vlasenko, Boris Ostrovsky,
	Juergen Gross, David Laight, Eduardo Valentin, aliguori,
	Will Deacon, linux-mm

On Tue, 12 Dec 2017, Peter Zijlstra wrote:

> On Tue, Dec 12, 2017 at 10:22:48AM -0800, Andy Lutomirski wrote:
> > 
> > Also, why is LAR deferred to user exit?  And I thought that LAR didn't
> > set the accessed bit.
> 
> LAR does not set the ACCESSED bit indeed, we need to explicitly set that
> when creating the descriptor.
> 
> It also works if you do the LAR right after LLDT (which is what I
> originally had). The reason its a TIF flag is that I originally LAR'ed
> every entry in the table.
> 
> It got reduced to CS/SS, but the TIF thing stayed.
> 
> > If I had to guess, I'd guess that LAR is actually generating a read
> > fault and forcing the pagetables to get populated.  If so, then it
> > means the VMA code isn't quite right, or you're susceptible to
> > failures under memory pressure.
> > 
> > Now maybe LAR will repopulate the PTE every time if you were to never
> > clear it, but ick.
> 
> I did not observe #PFs from LAR, we had a giant pile of trace_printk()
> in there.

The pages are populated _before_ the new ldt is installed. So no memory
pressure issue, nothing. If the populate fails, then modify_ldt() returns
with an error.

Thanks,

	tglx

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 134+ messages in thread

* Re: [patch 13/16] x86/ldt: Introduce LDT write fault handler
  2017-12-12 18:19       ` Peter Zijlstra
@ 2017-12-12 18:43         ` Thomas Gleixner
  -1 siblings, 0 replies; 134+ messages in thread
From: Thomas Gleixner @ 2017-12-12 18:43 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Andy Lutomirski, LKML, X86 ML, Linus Torvalds, Dave Hansen,
	Borislav Petkov, Greg KH, Kees Cook, Hugh Dickins, Brian Gerst,
	Josh Poimboeuf, Denys Vlasenko, Boris Ostrovsky, Juergen Gross,
	David Laight, Eduardo Valentin, aliguori, Will Deacon, linux-mm

On Tue, 12 Dec 2017, Peter Zijlstra wrote:
> On Tue, Dec 12, 2017 at 09:58:58AM -0800, Andy Lutomirski wrote:
> > On Tue, Dec 12, 2017 at 9:32 AM, Thomas Gleixner <tglx@linutronix.de> wrote:
> 
> > > +bool __ldt_write_fault(unsigned long address)
> > > +{
> > > +       struct ldt_struct *ldt = current->mm->context.ldt;
> > > +       unsigned long start, end, entry;
> > > +       struct desc_struct *desc;
> > > +
> > > +       start = (unsigned long) ldt->entries;
> > > +       end = start + ldt->nr_entries * LDT_ENTRY_SIZE;
> > > +
> > > +       if (address < start || address >= end)
> > > +               return false;
> > > +
> > > +       desc = (struct desc_struct *) ldt->entries;
> > > +       entry = (address - start) / LDT_ENTRY_SIZE;
> > > +       desc[entry].type |= 0x01;
> > 
> > You have another patch that unconditionally sets the accessed bit on
> > installation.  What gives?
> 
> Right, initially we didn't set that unconditionally. But even when we
> did do that, we've observed the CPU generating these write faults.
> 
> > Also, this patch is going to die a horrible death if IRET ever hits
> > this condition.  Or load gs.
> 
> Us touching the CS/SS descriptors with LAR should avoid IRET going off
> the rails, I'm not familiar with the whole gs thing, but we could very
> easily augment refresh_ldt_segments() I suppose.
> 
> Would you care to be a little more specific and or propose a testcase
> for this situation?

Again. load gs does not cause a fault at all like any other segment
load. The fault comes when the segment is accessed the first time or via
LAR. 

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 134+ messages in thread

* Re: [patch 13/16] x86/ldt: Introduce LDT write fault handler
@ 2017-12-12 18:43         ` Thomas Gleixner
  0 siblings, 0 replies; 134+ messages in thread
From: Thomas Gleixner @ 2017-12-12 18:43 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Andy Lutomirski, LKML, X86 ML, Linus Torvalds, Dave Hansen,
	Borislav Petkov, Greg KH, Kees Cook, Hugh Dickins, Brian Gerst,
	Josh Poimboeuf, Denys Vlasenko, Boris Ostrovsky, Juergen Gross,
	David Laight, Eduardo Valentin, aliguori, Will Deacon, linux-mm

On Tue, 12 Dec 2017, Peter Zijlstra wrote:
> On Tue, Dec 12, 2017 at 09:58:58AM -0800, Andy Lutomirski wrote:
> > On Tue, Dec 12, 2017 at 9:32 AM, Thomas Gleixner <tglx@linutronix.de> wrote:
> 
> > > +bool __ldt_write_fault(unsigned long address)
> > > +{
> > > +       struct ldt_struct *ldt = current->mm->context.ldt;
> > > +       unsigned long start, end, entry;
> > > +       struct desc_struct *desc;
> > > +
> > > +       start = (unsigned long) ldt->entries;
> > > +       end = start + ldt->nr_entries * LDT_ENTRY_SIZE;
> > > +
> > > +       if (address < start || address >= end)
> > > +               return false;
> > > +
> > > +       desc = (struct desc_struct *) ldt->entries;
> > > +       entry = (address - start) / LDT_ENTRY_SIZE;
> > > +       desc[entry].type |= 0x01;
> > 
> > You have another patch that unconditionally sets the accessed bit on
> > installation.  What gives?
> 
> Right, initially we didn't set that unconditionally. But even when we
> did do that, we've observed the CPU generating these write faults.
> 
> > Also, this patch is going to die a horrible death if IRET ever hits
> > this condition.  Or load gs.
> 
> Us touching the CS/SS descriptors with LAR should avoid IRET going off
> the rails, I'm not familiar with the whole gs thing, but we could very
> easily augment refresh_ldt_segments() I suppose.
> 
> Would you care to be a little more specific and or propose a testcase
> for this situation?

Again. load gs does not cause a fault at all like any other segment
load. The fault comes when the segment is accessed the first time or via
LAR. 

Thanks,

	tglx


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 134+ messages in thread

* Re: [patch 13/16] x86/ldt: Introduce LDT write fault handler
  2017-12-12 17:32   ` Thomas Gleixner
@ 2017-12-12 19:01     ` Linus Torvalds
  -1 siblings, 0 replies; 134+ messages in thread
From: Linus Torvalds @ 2017-12-12 19:01 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: LKML, the arch/x86 maintainers, Andy Lutomirsky, Peter Zijlstra,
	Dave Hansen, Borislav Petkov, Greg KH, Kees Cook, Hugh Dickins,
	Brian Gerst, Josh Poimboeuf, Denys Vlasenko, Boris Ostrovsky,
	Juergen Gross, David Laight, Eduardo Valentin, Liguori, Anthony,
	Will Deacon, linux-mm

On Tue, Dec 12, 2017 at 9:32 AM, Thomas Gleixner <tglx@linutronix.de> wrote:
> From: Thomas Gleixner <tglx@linutronix.de>
>
> When the LDT is mapped RO, the CPU will write fault the first time it uses
> a segment descriptor in order to set the ACCESS bit (for some reason it
> doesn't always observe that it already preset). Catch the fault and set the
> ACCESS bit in the handler.

This really scares me.

We use segments in some critical code in the kernel, like the whole
percpu data etc. Stuff that definitely shouldn't fault.

Yes, those segments should damn well be already marked accessed when
the segment is loaded, but apparently that isn't reliable.

So it potentially takes faults in random and very critical places.
It's probably dependent on microarchitecture on exactly when the
cached segment copy has the accessed bit set or not.

Also, I worry about crazy errata with TSS etc - this whole RO LDT
thing also introduces lots of possible new fault points in microcode
that nobody sane has ever done before, no?

> +       desc = (struct desc_struct *) ldt->entries;
> +       entry = (address - start) / LDT_ENTRY_SIZE;
> +       desc[entry].type |= 0x01;

This is also pretty disgusting.

Why isn't it just something like

      desc = (void *)(address & ~(LDT_ENTRY_SIZE-1));
      desc->type != 0x01;

since the ldt should all be aligned anyway.

                Linus

^ permalink raw reply	[flat|nested] 134+ messages in thread

* Re: [patch 13/16] x86/ldt: Introduce LDT write fault handler
@ 2017-12-12 19:01     ` Linus Torvalds
  0 siblings, 0 replies; 134+ messages in thread
From: Linus Torvalds @ 2017-12-12 19:01 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: LKML, the arch/x86 maintainers, Andy Lutomirsky, Peter Zijlstra,
	Dave Hansen, Borislav Petkov, Greg KH, Kees Cook, Hugh Dickins,
	Brian Gerst, Josh Poimboeuf, Denys Vlasenko, Boris Ostrovsky,
	Juergen Gross, David Laight, Eduardo Valentin, Liguori, Anthony,
	Will Deacon, linux-mm

On Tue, Dec 12, 2017 at 9:32 AM, Thomas Gleixner <tglx@linutronix.de> wrote:
> From: Thomas Gleixner <tglx@linutronix.de>
>
> When the LDT is mapped RO, the CPU will write fault the first time it uses
> a segment descriptor in order to set the ACCESS bit (for some reason it
> doesn't always observe that it already preset). Catch the fault and set the
> ACCESS bit in the handler.

This really scares me.

We use segments in some critical code in the kernel, like the whole
percpu data etc. Stuff that definitely shouldn't fault.

Yes, those segments should damn well be already marked accessed when
the segment is loaded, but apparently that isn't reliable.

So it potentially takes faults in random and very critical places.
It's probably dependent on microarchitecture on exactly when the
cached segment copy has the accessed bit set or not.

Also, I worry about crazy errata with TSS etc - this whole RO LDT
thing also introduces lots of possible new fault points in microcode
that nobody sane has ever done before, no?

> +       desc = (struct desc_struct *) ldt->entries;
> +       entry = (address - start) / LDT_ENTRY_SIZE;
> +       desc[entry].type |= 0x01;

This is also pretty disgusting.

Why isn't it just something like

      desc = (void *)(address & ~(LDT_ENTRY_SIZE-1));
      desc->type != 0x01;

since the ldt should all be aligned anyway.

                Linus

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 134+ messages in thread

* Re: [patch 11/16] x86/ldt: Force access bit for CS/SS
  2017-12-12 18:41               ` Thomas Gleixner
@ 2017-12-12 19:04                 ` Peter Zijlstra
  -1 siblings, 0 replies; 134+ messages in thread
From: Peter Zijlstra @ 2017-12-12 19:04 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Andy Lutomirski, Andy Lutomirski, LKML, X86 ML, Linus Torvalds,
	Dave Hansen, Borislav Petkov, Greg KH, Kees Cook, Hugh Dickins,
	Brian Gerst, Josh Poimboeuf, Denys Vlasenko, Boris Ostrovsky,
	Juergen Gross, David Laight, Eduardo Valentin, aliguori,
	Will Deacon, linux-mm

On Tue, Dec 12, 2017 at 07:41:39PM +0100, Thomas Gleixner wrote:
> The pages are populated _before_ the new ldt is installed. So no memory
> pressure issue, nothing. If the populate fails, then modify_ldt() returns
> with an error.

Also, these pages are not visible to reclaim. They're not pagecache and
we didn't install a shrinker.

^ permalink raw reply	[flat|nested] 134+ messages in thread

* Re: [patch 11/16] x86/ldt: Force access bit for CS/SS
@ 2017-12-12 19:04                 ` Peter Zijlstra
  0 siblings, 0 replies; 134+ messages in thread
From: Peter Zijlstra @ 2017-12-12 19:04 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Andy Lutomirski, Andy Lutomirski, LKML, X86 ML, Linus Torvalds,
	Dave Hansen, Borislav Petkov, Greg KH, Kees Cook, Hugh Dickins,
	Brian Gerst, Josh Poimboeuf, Denys Vlasenko, Boris Ostrovsky,
	Juergen Gross, David Laight, Eduardo Valentin, aliguori,
	Will Deacon, linux-mm

On Tue, Dec 12, 2017 at 07:41:39PM +0100, Thomas Gleixner wrote:
> The pages are populated _before_ the new ldt is installed. So no memory
> pressure issue, nothing. If the populate fails, then modify_ldt() returns
> with an error.

Also, these pages are not visible to reclaim. They're not pagecache and
we didn't install a shrinker.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 134+ messages in thread

* Re: [patch 11/16] x86/ldt: Force access bit for CS/SS
  2017-12-12 17:32   ` Thomas Gleixner
@ 2017-12-12 19:05     ` Linus Torvalds
  -1 siblings, 0 replies; 134+ messages in thread
From: Linus Torvalds @ 2017-12-12 19:05 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: LKML, the arch/x86 maintainers, Andy Lutomirsky, Peter Zijlstra,
	Dave Hansen, Borislav Petkov, Greg KH, Kees Cook, Hugh Dickins,
	Brian Gerst, Josh Poimboeuf, Denys Vlasenko, Boris Ostrovsky,
	Juergen Gross, David Laight, Eduardo Valentin, Liguori, Anthony,
	Will Deacon, linux-mm

On Tue, Dec 12, 2017 at 9:32 AM, Thomas Gleixner <tglx@linutronix.de> wrote:
>
> There is one exception; IRET will immediately load CS/SS and unrecoverably
> #GP. To avoid this issue access the LDT descriptors used by CS/SS before
> the IRET to userspace.

Ok, so the other patch made me nervous, this just makes me go "Hell no!".

This is exactly the kind of "now we get traps in random microcode
places that have never been tested" kind of thing that I was talking
about.

Why is the iret exception unrecoverable anyway? Does anybody even know?

                    Linus

^ permalink raw reply	[flat|nested] 134+ messages in thread

* Re: [patch 11/16] x86/ldt: Force access bit for CS/SS
@ 2017-12-12 19:05     ` Linus Torvalds
  0 siblings, 0 replies; 134+ messages in thread
From: Linus Torvalds @ 2017-12-12 19:05 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: LKML, the arch/x86 maintainers, Andy Lutomirsky, Peter Zijlstra,
	Dave Hansen, Borislav Petkov, Greg KH, Kees Cook, Hugh Dickins,
	Brian Gerst, Josh Poimboeuf, Denys Vlasenko, Boris Ostrovsky,
	Juergen Gross, David Laight, Eduardo Valentin, Liguori, Anthony,
	Will Deacon, linux-mm

On Tue, Dec 12, 2017 at 9:32 AM, Thomas Gleixner <tglx@linutronix.de> wrote:
>
> There is one exception; IRET will immediately load CS/SS and unrecoverably
> #GP. To avoid this issue access the LDT descriptors used by CS/SS before
> the IRET to userspace.

Ok, so the other patch made me nervous, this just makes me go "Hell no!".

This is exactly the kind of "now we get traps in random microcode
places that have never been tested" kind of thing that I was talking
about.

Why is the iret exception unrecoverable anyway? Does anybody even know?

                    Linus

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 134+ messages in thread

* Re: [patch 13/16] x86/ldt: Introduce LDT write fault handler
  2017-12-12 19:01     ` Linus Torvalds
@ 2017-12-12 19:21       ` Thomas Gleixner
  -1 siblings, 0 replies; 134+ messages in thread
From: Thomas Gleixner @ 2017-12-12 19:21 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: LKML, the arch/x86 maintainers, Andy Lutomirsky, Peter Zijlstra,
	Dave Hansen, Borislav Petkov, Greg KH, Kees Cook, Hugh Dickins,
	Brian Gerst, Josh Poimboeuf, Denys Vlasenko, Boris Ostrovsky,
	Juergen Gross, David Laight, Eduardo Valentin, Liguori, Anthony,
	Will Deacon, linux-mm

On Tue, 12 Dec 2017, Linus Torvalds wrote:

> On Tue, Dec 12, 2017 at 9:32 AM, Thomas Gleixner <tglx@linutronix.de> wrote:
> > From: Thomas Gleixner <tglx@linutronix.de>
> >
> > When the LDT is mapped RO, the CPU will write fault the first time it uses
> > a segment descriptor in order to set the ACCESS bit (for some reason it
> > doesn't always observe that it already preset). Catch the fault and set the
> > ACCESS bit in the handler.
> 
> This really scares me.
> 
> We use segments in some critical code in the kernel, like the whole
> percpu data etc. Stuff that definitely shouldn't fault.
> 
> Yes, those segments should damn well be already marked accessed when
> the segment is loaded, but apparently that isn't reliable.

That has nothing to do with the user installed LDT. The kernel does not use
and rely on LDT at all.

The only critical interaction is the return to user path (user CS/SS) and
we made sure with the LAR touching that these are precached in the CPU
before we go into fragile exit code. Luto has some concerns
vs. load_gs[_index] and we'll certainly look into that some more.

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 134+ messages in thread

* Re: [patch 13/16] x86/ldt: Introduce LDT write fault handler
@ 2017-12-12 19:21       ` Thomas Gleixner
  0 siblings, 0 replies; 134+ messages in thread
From: Thomas Gleixner @ 2017-12-12 19:21 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: LKML, the arch/x86 maintainers, Andy Lutomirsky, Peter Zijlstra,
	Dave Hansen, Borislav Petkov, Greg KH, Kees Cook, Hugh Dickins,
	Brian Gerst, Josh Poimboeuf, Denys Vlasenko, Boris Ostrovsky,
	Juergen Gross, David Laight, Eduardo Valentin, Liguori, Anthony,
	Will Deacon, linux-mm

On Tue, 12 Dec 2017, Linus Torvalds wrote:

> On Tue, Dec 12, 2017 at 9:32 AM, Thomas Gleixner <tglx@linutronix.de> wrote:
> > From: Thomas Gleixner <tglx@linutronix.de>
> >
> > When the LDT is mapped RO, the CPU will write fault the first time it uses
> > a segment descriptor in order to set the ACCESS bit (for some reason it
> > doesn't always observe that it already preset). Catch the fault and set the
> > ACCESS bit in the handler.
> 
> This really scares me.
> 
> We use segments in some critical code in the kernel, like the whole
> percpu data etc. Stuff that definitely shouldn't fault.
> 
> Yes, those segments should damn well be already marked accessed when
> the segment is loaded, but apparently that isn't reliable.

That has nothing to do with the user installed LDT. The kernel does not use
and rely on LDT at all.

The only critical interaction is the return to user path (user CS/SS) and
we made sure with the LAR touching that these are precached in the CPU
before we go into fragile exit code. Luto has some concerns
vs. load_gs[_index] and we'll certainly look into that some more.

Thanks,

	tglx

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 134+ messages in thread

* Re: [patch 11/16] x86/ldt: Force access bit for CS/SS
  2017-12-12 19:05     ` Linus Torvalds
@ 2017-12-12 19:26       ` Andy Lutomirski
  -1 siblings, 0 replies; 134+ messages in thread
From: Andy Lutomirski @ 2017-12-12 19:26 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Thomas Gleixner, LKML, the arch/x86 maintainers, Andy Lutomirsky,
	Peter Zijlstra, Dave Hansen, Borislav Petkov, Greg KH, Kees Cook,
	Hugh Dickins, Brian Gerst, Josh Poimboeuf, Denys Vlasenko,
	Boris Ostrovsky, Juergen Gross, David Laight, Eduardo Valentin,
	Liguori, Anthony, Will Deacon, linux-mm



> On Dec 12, 2017, at 11:05 AM, Linus Torvalds <torvalds@linux-foundation.org> wrote:
> 
>> On Tue, Dec 12, 2017 at 9:32 AM, Thomas Gleixner <tglx@linutronix.de> wrote:
>> 
>> There is one exception; IRET will immediately load CS/SS and unrecoverably
>> #GP. To avoid this issue access the LDT descriptors used by CS/SS before
>> the IRET to userspace.
> 
> Ok, so the other patch made me nervous, this just makes me go "Hell no!".
> 
> This is exactly the kind of "now we get traps in random microcode
> places that have never been tested" kind of thing that I was talking
> about.
> 
> Why is the iret exception unrecoverable anyway? Does anybody even know?
> 

Weird microcode shit aside, a fault on IRET will return to kernel code with kernel GS, and then the next time we enter the kernel we're backwards.  We could fix idtentry to get this right, but the code is already tangled enough.

This series is full of landmines, I think.  My latest patch set has a fully functional LDT with PTI on, and the only thing particularly scary about it is that it fiddles with page tables.  Other than that, there's no VMA magic, no RO magic, and no microcode magic.  And the LDT is still normal kernel memory, so we can ignore a whole pile of potential attacks. 

Also, how does it make any sense to have a cached descriptor that's not accessed?  Xen PV does weird LDT page fault shit, and is works, so I suspect we're just misunderstanding something.  The VMX spec kind of documents this...

>                    Linus

^ permalink raw reply	[flat|nested] 134+ messages in thread

* Re: [patch 11/16] x86/ldt: Force access bit for CS/SS
@ 2017-12-12 19:26       ` Andy Lutomirski
  0 siblings, 0 replies; 134+ messages in thread
From: Andy Lutomirski @ 2017-12-12 19:26 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Thomas Gleixner, LKML, the arch/x86 maintainers, Andy Lutomirsky,
	Peter Zijlstra, Dave Hansen, Borislav Petkov, Greg KH, Kees Cook,
	Hugh Dickins, Brian Gerst, Josh Poimboeuf, Denys Vlasenko,
	Boris Ostrovsky, Juergen Gross, David Laight, Eduardo Valentin,
	Liguori, Anthony, Will Deacon, linux-mm



> On Dec 12, 2017, at 11:05 AM, Linus Torvalds <torvalds@linux-foundation.org> wrote:
> 
>> On Tue, Dec 12, 2017 at 9:32 AM, Thomas Gleixner <tglx@linutronix.de> wrote:
>> 
>> There is one exception; IRET will immediately load CS/SS and unrecoverably
>> #GP. To avoid this issue access the LDT descriptors used by CS/SS before
>> the IRET to userspace.
> 
> Ok, so the other patch made me nervous, this just makes me go "Hell no!".
> 
> This is exactly the kind of "now we get traps in random microcode
> places that have never been tested" kind of thing that I was talking
> about.
> 
> Why is the iret exception unrecoverable anyway? Does anybody even know?
> 

Weird microcode shit aside, a fault on IRET will return to kernel code with kernel GS, and then the next time we enter the kernel we're backwards.  We could fix idtentry to get this right, but the code is already tangled enough.

This series is full of landmines, I think.  My latest patch set has a fully functional LDT with PTI on, and the only thing particularly scary about it is that it fiddles with page tables.  Other than that, there's no VMA magic, no RO magic, and no microcode magic.  And the LDT is still normal kernel memory, so we can ignore a whole pile of potential attacks. 

Also, how does it make any sense to have a cached descriptor that's not accessed?  Xen PV does weird LDT page fault shit, and is works, so I suspect we're just misunderstanding something.  The VMX spec kind of documents this...

>                    Linus

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 134+ messages in thread

* Re: [patch 13/16] x86/ldt: Introduce LDT write fault handler
  2017-12-12 19:21       ` Thomas Gleixner
@ 2017-12-12 19:51         ` Linus Torvalds
  -1 siblings, 0 replies; 134+ messages in thread
From: Linus Torvalds @ 2017-12-12 19:51 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: LKML, the arch/x86 maintainers, Andy Lutomirsky, Peter Zijlstra,
	Dave Hansen, Borislav Petkov, Greg KH, Kees Cook, Hugh Dickins,
	Brian Gerst, Josh Poimboeuf, Denys Vlasenko, Boris Ostrovsky,
	Juergen Gross, David Laight, Eduardo Valentin, Liguori, Anthony,
	Will Deacon, linux-mm

On Tue, Dec 12, 2017 at 11:21 AM, Thomas Gleixner <tglx@linutronix.de> wrote:
>
> That has nothing to do with the user installed LDT. The kernel does not use
> and rely on LDT at all.

Sure it does. We end up loading the selector for %gs and %fs, and
those selectors end up being connected with whatever user-mode has set
up for them.

We then set the FS/GS base pointer to a kernel-specific value, but
that is _separately_ from the actual accessed bit that is in the
selector.

So the kernel doesn't care, but the kernel definitely uses them.

                 Linus

^ permalink raw reply	[flat|nested] 134+ messages in thread

* Re: [patch 13/16] x86/ldt: Introduce LDT write fault handler
@ 2017-12-12 19:51         ` Linus Torvalds
  0 siblings, 0 replies; 134+ messages in thread
From: Linus Torvalds @ 2017-12-12 19:51 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: LKML, the arch/x86 maintainers, Andy Lutomirsky, Peter Zijlstra,
	Dave Hansen, Borislav Petkov, Greg KH, Kees Cook, Hugh Dickins,
	Brian Gerst, Josh Poimboeuf, Denys Vlasenko, Boris Ostrovsky,
	Juergen Gross, David Laight, Eduardo Valentin, Liguori, Anthony,
	Will Deacon, linux-mm

On Tue, Dec 12, 2017 at 11:21 AM, Thomas Gleixner <tglx@linutronix.de> wrote:
>
> That has nothing to do with the user installed LDT. The kernel does not use
> and rely on LDT at all.

Sure it does. We end up loading the selector for %gs and %fs, and
those selectors end up being connected with whatever user-mode has set
up for them.

We then set the FS/GS base pointer to a kernel-specific value, but
that is _separately_ from the actual accessed bit that is in the
selector.

So the kernel doesn't care, but the kernel definitely uses them.

                 Linus

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 134+ messages in thread

* Re: [patch 13/16] x86/ldt: Introduce LDT write fault handler
  2017-12-12 19:21       ` Thomas Gleixner
@ 2017-12-12 20:21         ` Dave Hansen
  -1 siblings, 0 replies; 134+ messages in thread
From: Dave Hansen @ 2017-12-12 20:21 UTC (permalink / raw)
  To: Thomas Gleixner, Linus Torvalds
  Cc: LKML, the arch/x86 maintainers, Andy Lutomirsky, Peter Zijlstra,
	Borislav Petkov, Greg KH, Kees Cook, Hugh Dickins, Brian Gerst,
	Josh Poimboeuf, Denys Vlasenko, Boris Ostrovsky, Juergen Gross,
	David Laight, Eduardo Valentin, Liguori, Anthony, Will Deacon,
	linux-mm

On 12/12/2017 11:21 AM, Thomas Gleixner wrote:
> The only critical interaction is the return to user path (user CS/SS) and
> we made sure with the LAR touching that these are precached in the CPU
> before we go into fragile exit code.

How do we make sure that it _stays_ cached?

Surely there is weird stuff like WBINVD or SMI's that can come at very
inconvenient times and wipe it out of the cache.

^ permalink raw reply	[flat|nested] 134+ messages in thread

* Re: [patch 13/16] x86/ldt: Introduce LDT write fault handler
@ 2017-12-12 20:21         ` Dave Hansen
  0 siblings, 0 replies; 134+ messages in thread
From: Dave Hansen @ 2017-12-12 20:21 UTC (permalink / raw)
  To: Thomas Gleixner, Linus Torvalds
  Cc: LKML, the arch/x86 maintainers, Andy Lutomirsky, Peter Zijlstra,
	Borislav Petkov, Greg KH, Kees Cook, Hugh Dickins, Brian Gerst,
	Josh Poimboeuf, Denys Vlasenko, Boris Ostrovsky, Juergen Gross,
	David Laight, Eduardo Valentin, Liguori, Anthony, Will Deacon,
	linux-mm

On 12/12/2017 11:21 AM, Thomas Gleixner wrote:
> The only critical interaction is the return to user path (user CS/SS) and
> we made sure with the LAR touching that these are precached in the CPU
> before we go into fragile exit code.

How do we make sure that it _stays_ cached?

Surely there is weird stuff like WBINVD or SMI's that can come at very
inconvenient times and wipe it out of the cache.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 134+ messages in thread

* Re: [patch 13/16] x86/ldt: Introduce LDT write fault handler
  2017-12-12 20:21         ` Dave Hansen
@ 2017-12-12 20:37           ` Thomas Gleixner
  -1 siblings, 0 replies; 134+ messages in thread
From: Thomas Gleixner @ 2017-12-12 20:37 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Linus Torvalds, LKML, the arch/x86 maintainers, Andy Lutomirsky,
	Peter Zijlstra, Borislav Petkov, Greg KH, Kees Cook,
	Hugh Dickins, Brian Gerst, Josh Poimboeuf, Denys Vlasenko,
	Boris Ostrovsky, Juergen Gross, David Laight, Eduardo Valentin,
	Liguori, Anthony, Will Deacon, linux-mm

On Tue, 12 Dec 2017, Dave Hansen wrote:

> On 12/12/2017 11:21 AM, Thomas Gleixner wrote:
> > The only critical interaction is the return to user path (user CS/SS) and
> > we made sure with the LAR touching that these are precached in the CPU
> > before we go into fragile exit code.
> 
> How do we make sure that it _stays_ cached?
> 
> Surely there is weird stuff like WBINVD or SMI's that can come at very
> inconvenient times and wipe it out of the cache.

This does not look like cache in the sense of memory cache. It seems to be
CPU internal state and I just stuffed WBINVD and alternatively CLFLUSH'ed
the entries after the 'touch' via LAR. Still works.

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 134+ messages in thread

* Re: [patch 13/16] x86/ldt: Introduce LDT write fault handler
@ 2017-12-12 20:37           ` Thomas Gleixner
  0 siblings, 0 replies; 134+ messages in thread
From: Thomas Gleixner @ 2017-12-12 20:37 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Linus Torvalds, LKML, the arch/x86 maintainers, Andy Lutomirsky,
	Peter Zijlstra, Borislav Petkov, Greg KH, Kees Cook,
	Hugh Dickins, Brian Gerst, Josh Poimboeuf, Denys Vlasenko,
	Boris Ostrovsky, Juergen Gross, David Laight, Eduardo Valentin,
	Liguori, Anthony, Will Deacon, linux-mm

On Tue, 12 Dec 2017, Dave Hansen wrote:

> On 12/12/2017 11:21 AM, Thomas Gleixner wrote:
> > The only critical interaction is the return to user path (user CS/SS) and
> > we made sure with the LAR touching that these are precached in the CPU
> > before we go into fragile exit code.
> 
> How do we make sure that it _stays_ cached?
> 
> Surely there is weird stuff like WBINVD or SMI's that can come at very
> inconvenient times and wipe it out of the cache.

This does not look like cache in the sense of memory cache. It seems to be
CPU internal state and I just stuffed WBINVD and alternatively CLFLUSH'ed
the entries after the 'touch' via LAR. Still works.

Thanks,

	tglx



--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 134+ messages in thread

* Re: [patch 13/16] x86/ldt: Introduce LDT write fault handler
  2017-12-12 20:37           ` Thomas Gleixner
@ 2017-12-12 21:35             ` Andy Lutomirski
  -1 siblings, 0 replies; 134+ messages in thread
From: Andy Lutomirski @ 2017-12-12 21:35 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Dave Hansen, Linus Torvalds, LKML, the arch/x86 maintainers,
	Andy Lutomirsky, Peter Zijlstra, Borislav Petkov, Greg KH,
	Kees Cook, Hugh Dickins, Brian Gerst, Josh Poimboeuf,
	Denys Vlasenko, Boris Ostrovsky, Juergen Gross, David Laight,
	Eduardo Valentin, Liguori, Anthony, Will Deacon, linux-mm

On Tue, Dec 12, 2017 at 12:37 PM, Thomas Gleixner <tglx@linutronix.de> wrote:
> On Tue, 12 Dec 2017, Dave Hansen wrote:
>
>> On 12/12/2017 11:21 AM, Thomas Gleixner wrote:
>> > The only critical interaction is the return to user path (user CS/SS) and
>> > we made sure with the LAR touching that these are precached in the CPU
>> > before we go into fragile exit code.
>>
>> How do we make sure that it _stays_ cached?
>>
>> Surely there is weird stuff like WBINVD or SMI's that can come at very
>> inconvenient times and wipe it out of the cache.
>
> This does not look like cache in the sense of memory cache. It seems to be
> CPU internal state and I just stuffed WBINVD and alternatively CLFLUSH'ed
> the entries after the 'touch' via LAR. Still works.
>

There *must* be some weird bug in this series.  I find it very hard to
believe that x86 CPUs have a magic cache that caches any part of a
not-actually-in-a-segment-register descriptor entry.

^ permalink raw reply	[flat|nested] 134+ messages in thread

* Re: [patch 13/16] x86/ldt: Introduce LDT write fault handler
@ 2017-12-12 21:35             ` Andy Lutomirski
  0 siblings, 0 replies; 134+ messages in thread
From: Andy Lutomirski @ 2017-12-12 21:35 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Dave Hansen, Linus Torvalds, LKML, the arch/x86 maintainers,
	Andy Lutomirsky, Peter Zijlstra, Borislav Petkov, Greg KH,
	Kees Cook, Hugh Dickins, Brian Gerst, Josh Poimboeuf,
	Denys Vlasenko, Boris Ostrovsky, Juergen Gross, David Laight,
	Eduardo Valentin, Liguori, Anthony, Will Deacon, linux-mm

On Tue, Dec 12, 2017 at 12:37 PM, Thomas Gleixner <tglx@linutronix.de> wrote:
> On Tue, 12 Dec 2017, Dave Hansen wrote:
>
>> On 12/12/2017 11:21 AM, Thomas Gleixner wrote:
>> > The only critical interaction is the return to user path (user CS/SS) and
>> > we made sure with the LAR touching that these are precached in the CPU
>> > before we go into fragile exit code.
>>
>> How do we make sure that it _stays_ cached?
>>
>> Surely there is weird stuff like WBINVD or SMI's that can come at very
>> inconvenient times and wipe it out of the cache.
>
> This does not look like cache in the sense of memory cache. It seems to be
> CPU internal state and I just stuffed WBINVD and alternatively CLFLUSH'ed
> the entries after the 'touch' via LAR. Still works.
>

There *must* be some weird bug in this series.  I find it very hard to
believe that x86 CPUs have a magic cache that caches any part of a
not-actually-in-a-segment-register descriptor entry.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 134+ messages in thread

* Re: [patch 13/16] x86/ldt: Introduce LDT write fault handler
  2017-12-12 20:37           ` Thomas Gleixner
@ 2017-12-12 21:41             ` Thomas Gleixner
  -1 siblings, 0 replies; 134+ messages in thread
From: Thomas Gleixner @ 2017-12-12 21:41 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Linus Torvalds, LKML, the arch/x86 maintainers, Andy Lutomirsky,
	Peter Zijlstra, Borislav Petkov, Greg KH, Kees Cook,
	Hugh Dickins, Brian Gerst, Josh Poimboeuf, Denys Vlasenko,
	Boris Ostrovsky, Juergen Gross, David Laight, Eduardo Valentin,
	Liguori, Anthony, Will Deacon, linux-mm

On Tue, 12 Dec 2017, Thomas Gleixner wrote:
> On Tue, 12 Dec 2017, Dave Hansen wrote:
> 
> > On 12/12/2017 11:21 AM, Thomas Gleixner wrote:
> > > The only critical interaction is the return to user path (user CS/SS) and
> > > we made sure with the LAR touching that these are precached in the CPU
> > > before we go into fragile exit code.
> > 
> > How do we make sure that it _stays_ cached?
> > 
> > Surely there is weird stuff like WBINVD or SMI's that can come at very
> > inconvenient times and wipe it out of the cache.
> 
> This does not look like cache in the sense of memory cache. It seems to be
> CPU internal state and I just stuffed WBINVD and alternatively CLFLUSH'ed
> the entries after the 'touch' via LAR. Still works.

Dave pointed me once more to the following paragraph in the SDM, which
Peter and I looked at before and we tried that w/o success:

    If the segment descriptors in the GDT or an LDT are placed in ROM, the
    processor can enter an indefinite loop if software or the processor
    attempts to update (write to) the ROM-based segment descriptors. To
    prevent this problem, set the accessed bits for all segment descriptors
    placed in a ROM. Also, remove operating-system or executive code that
    attempts to modify segment descriptors located in ROM.

Now that made me go back to the state of the patch series which made us
make that magic 'touch' and write fault handler. The difference to the code
today is that it did not prepopulate the user visible mapping.

We added that later because we were worried about not being able to
populate it in the #PF due to memory pressure without ripping out the magic
cure again.

But I did now and actually removing both the user exit magic 'touch' code
and the write fault handler keeps it working.

Removing the prepopulate code makes it break again with a #GP in
IRET/SYSRET.

What happens there is that the IRET pops SS (with a minimal testcase) which
causes the #PF. That populates the PTE and returns happily. Right after
that the #GP comes in with IP pointing to the user space instruction right
after the syscall.

That simplifies and descaryfies that code massively.

Darn, I should have gone back and check every part again as I usually do,
but my fried brain failed.

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 134+ messages in thread

* Re: [patch 13/16] x86/ldt: Introduce LDT write fault handler
@ 2017-12-12 21:41             ` Thomas Gleixner
  0 siblings, 0 replies; 134+ messages in thread
From: Thomas Gleixner @ 2017-12-12 21:41 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Linus Torvalds, LKML, the arch/x86 maintainers, Andy Lutomirsky,
	Peter Zijlstra, Borislav Petkov, Greg KH, Kees Cook,
	Hugh Dickins, Brian Gerst, Josh Poimboeuf, Denys Vlasenko,
	Boris Ostrovsky, Juergen Gross, David Laight, Eduardo Valentin,
	Liguori, Anthony, Will Deacon, linux-mm

On Tue, 12 Dec 2017, Thomas Gleixner wrote:
> On Tue, 12 Dec 2017, Dave Hansen wrote:
> 
> > On 12/12/2017 11:21 AM, Thomas Gleixner wrote:
> > > The only critical interaction is the return to user path (user CS/SS) and
> > > we made sure with the LAR touching that these are precached in the CPU
> > > before we go into fragile exit code.
> > 
> > How do we make sure that it _stays_ cached?
> > 
> > Surely there is weird stuff like WBINVD or SMI's that can come at very
> > inconvenient times and wipe it out of the cache.
> 
> This does not look like cache in the sense of memory cache. It seems to be
> CPU internal state and I just stuffed WBINVD and alternatively CLFLUSH'ed
> the entries after the 'touch' via LAR. Still works.

Dave pointed me once more to the following paragraph in the SDM, which
Peter and I looked at before and we tried that w/o success:

    If the segment descriptors in the GDT or an LDT are placed in ROM, the
    processor can enter an indefinite loop if software or the processor
    attempts to update (write to) the ROM-based segment descriptors. To
    prevent this problem, set the accessed bits for all segment descriptors
    placed in a ROM. Also, remove operating-system or executive code that
    attempts to modify segment descriptors located in ROM.

Now that made me go back to the state of the patch series which made us
make that magic 'touch' and write fault handler. The difference to the code
today is that it did not prepopulate the user visible mapping.

We added that later because we were worried about not being able to
populate it in the #PF due to memory pressure without ripping out the magic
cure again.

But I did now and actually removing both the user exit magic 'touch' code
and the write fault handler keeps it working.

Removing the prepopulate code makes it break again with a #GP in
IRET/SYSRET.

What happens there is that the IRET pops SS (with a minimal testcase) which
causes the #PF. That populates the PTE and returns happily. Right after
that the #GP comes in with IP pointing to the user space instruction right
after the syscall.

That simplifies and descaryfies that code massively.

Darn, I should have gone back and check every part again as I usually do,
but my fried brain failed.

Thanks,

	tglx







--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 134+ messages in thread

* Re: [patch 13/16] x86/ldt: Introduce LDT write fault handler
  2017-12-12 21:35             ` Andy Lutomirski
@ 2017-12-12 21:42               ` Thomas Gleixner
  -1 siblings, 0 replies; 134+ messages in thread
From: Thomas Gleixner @ 2017-12-12 21:42 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Dave Hansen, Linus Torvalds, LKML, the arch/x86 maintainers,
	Peter Zijlstra, Borislav Petkov, Greg KH, Kees Cook,
	Hugh Dickins, Brian Gerst, Josh Poimboeuf, Denys Vlasenko,
	Boris Ostrovsky, Juergen Gross, David Laight, Eduardo Valentin,
	Liguori, Anthony, Will Deacon, linux-mm

On Tue, 12 Dec 2017, Andy Lutomirski wrote:

> On Tue, Dec 12, 2017 at 12:37 PM, Thomas Gleixner <tglx@linutronix.de> wrote:
> > On Tue, 12 Dec 2017, Dave Hansen wrote:
> >
> >> On 12/12/2017 11:21 AM, Thomas Gleixner wrote:
> >> > The only critical interaction is the return to user path (user CS/SS) and
> >> > we made sure with the LAR touching that these are precached in the CPU
> >> > before we go into fragile exit code.
> >>
> >> How do we make sure that it _stays_ cached?
> >>
> >> Surely there is weird stuff like WBINVD or SMI's that can come at very
> >> inconvenient times and wipe it out of the cache.
> >
> > This does not look like cache in the sense of memory cache. It seems to be
> > CPU internal state and I just stuffed WBINVD and alternatively CLFLUSH'ed
> > the entries after the 'touch' via LAR. Still works.
> >
> 
> There *must* be some weird bug in this series.  I find it very hard to
> believe that x86 CPUs have a magic cache that caches any part of a
> not-actually-in-a-segment-register descriptor entry.

There is no bug in the code. There was just a bug in my brain which made me
fail to see the obvious. See the other mail.

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 134+ messages in thread

* Re: [patch 13/16] x86/ldt: Introduce LDT write fault handler
@ 2017-12-12 21:42               ` Thomas Gleixner
  0 siblings, 0 replies; 134+ messages in thread
From: Thomas Gleixner @ 2017-12-12 21:42 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Dave Hansen, Linus Torvalds, LKML, the arch/x86 maintainers,
	Peter Zijlstra, Borislav Petkov, Greg KH, Kees Cook,
	Hugh Dickins, Brian Gerst, Josh Poimboeuf, Denys Vlasenko,
	Boris Ostrovsky, Juergen Gross, David Laight, Eduardo Valentin,
	Liguori, Anthony, Will Deacon, linux-mm

On Tue, 12 Dec 2017, Andy Lutomirski wrote:

> On Tue, Dec 12, 2017 at 12:37 PM, Thomas Gleixner <tglx@linutronix.de> wrote:
> > On Tue, 12 Dec 2017, Dave Hansen wrote:
> >
> >> On 12/12/2017 11:21 AM, Thomas Gleixner wrote:
> >> > The only critical interaction is the return to user path (user CS/SS) and
> >> > we made sure with the LAR touching that these are precached in the CPU
> >> > before we go into fragile exit code.
> >>
> >> How do we make sure that it _stays_ cached?
> >>
> >> Surely there is weird stuff like WBINVD or SMI's that can come at very
> >> inconvenient times and wipe it out of the cache.
> >
> > This does not look like cache in the sense of memory cache. It seems to be
> > CPU internal state and I just stuffed WBINVD and alternatively CLFLUSH'ed
> > the entries after the 'touch' via LAR. Still works.
> >
> 
> There *must* be some weird bug in this series.  I find it very hard to
> believe that x86 CPUs have a magic cache that caches any part of a
> not-actually-in-a-segment-register descriptor entry.

There is no bug in the code. There was just a bug in my brain which made me
fail to see the obvious. See the other mail.

Thanks,

	tglx

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 134+ messages in thread

* Re: [patch 13/16] x86/ldt: Introduce LDT write fault handler
  2017-12-12 21:41             ` Thomas Gleixner
@ 2017-12-12 21:46               ` Thomas Gleixner
  -1 siblings, 0 replies; 134+ messages in thread
From: Thomas Gleixner @ 2017-12-12 21:46 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Linus Torvalds, LKML, the arch/x86 maintainers, Andy Lutomirsky,
	Peter Zijlstra, Borislav Petkov, Greg KH, Kees Cook,
	Hugh Dickins, Brian Gerst, Josh Poimboeuf, Denys Vlasenko,
	Boris Ostrovsky, Juergen Gross, David Laight, Eduardo Valentin,
	Liguori, Anthony, Will Deacon, linux-mm

On Tue, 12 Dec 2017, Thomas Gleixner wrote:

> On Tue, 12 Dec 2017, Thomas Gleixner wrote:
> > On Tue, 12 Dec 2017, Dave Hansen wrote:
> > 
> > > On 12/12/2017 11:21 AM, Thomas Gleixner wrote:
> > > > The only critical interaction is the return to user path (user CS/SS) and
> > > > we made sure with the LAR touching that these are precached in the CPU
> > > > before we go into fragile exit code.
> > > 
> > > How do we make sure that it _stays_ cached?
> > > 
> > > Surely there is weird stuff like WBINVD or SMI's that can come at very
> > > inconvenient times and wipe it out of the cache.
> > 
> > This does not look like cache in the sense of memory cache. It seems to be
> > CPU internal state and I just stuffed WBINVD and alternatively CLFLUSH'ed
> > the entries after the 'touch' via LAR. Still works.
> 
> Dave pointed me once more to the following paragraph in the SDM, which
> Peter and I looked at before and we tried that w/o success:
> 
>     If the segment descriptors in the GDT or an LDT are placed in ROM, the
>     processor can enter an indefinite loop if software or the processor
>     attempts to update (write to) the ROM-based segment descriptors. To
>     prevent this problem, set the accessed bits for all segment descriptors
>     placed in a ROM. Also, remove operating-system or executive code that
>     attempts to modify segment descriptors located in ROM.
> 
> Now that made me go back to the state of the patch series which made us
> make that magic 'touch' and write fault handler. The difference to the code
> today is that it did not prepopulate the user visible mapping.
> 
> We added that later because we were worried about not being able to
> populate it in the #PF due to memory pressure without ripping out the magic
> cure again.
> 
> But I did now and actually removing both the user exit magic 'touch' code
> and the write fault handler keeps it working.
> 
> Removing the prepopulate code makes it break again with a #GP in
> IRET/SYSRET.
> 
> What happens there is that the IRET pops SS (with a minimal testcase) which
> causes the #PF. That populates the PTE and returns happily. Right after
> that the #GP comes in with IP pointing to the user space instruction right
> after the syscall.
> 
> That simplifies and descaryfies that code massively.
> 
> Darn, I should have gone back and check every part again as I usually do,
> but my fried brain failed.

The magic write ACCESS bit handler is a left over from the early attempts
not to force ACCESS=1 when setting up the descriptor entry.

Bah. My patch stack history proves where the 3 cross roads are where I took
the wrong turn.

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 134+ messages in thread

* Re: [patch 13/16] x86/ldt: Introduce LDT write fault handler
@ 2017-12-12 21:46               ` Thomas Gleixner
  0 siblings, 0 replies; 134+ messages in thread
From: Thomas Gleixner @ 2017-12-12 21:46 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Linus Torvalds, LKML, the arch/x86 maintainers, Andy Lutomirsky,
	Peter Zijlstra, Borislav Petkov, Greg KH, Kees Cook,
	Hugh Dickins, Brian Gerst, Josh Poimboeuf, Denys Vlasenko,
	Boris Ostrovsky, Juergen Gross, David Laight, Eduardo Valentin,
	Liguori, Anthony, Will Deacon, linux-mm

On Tue, 12 Dec 2017, Thomas Gleixner wrote:

> On Tue, 12 Dec 2017, Thomas Gleixner wrote:
> > On Tue, 12 Dec 2017, Dave Hansen wrote:
> > 
> > > On 12/12/2017 11:21 AM, Thomas Gleixner wrote:
> > > > The only critical interaction is the return to user path (user CS/SS) and
> > > > we made sure with the LAR touching that these are precached in the CPU
> > > > before we go into fragile exit code.
> > > 
> > > How do we make sure that it _stays_ cached?
> > > 
> > > Surely there is weird stuff like WBINVD or SMI's that can come at very
> > > inconvenient times and wipe it out of the cache.
> > 
> > This does not look like cache in the sense of memory cache. It seems to be
> > CPU internal state and I just stuffed WBINVD and alternatively CLFLUSH'ed
> > the entries after the 'touch' via LAR. Still works.
> 
> Dave pointed me once more to the following paragraph in the SDM, which
> Peter and I looked at before and we tried that w/o success:
> 
>     If the segment descriptors in the GDT or an LDT are placed in ROM, the
>     processor can enter an indefinite loop if software or the processor
>     attempts to update (write to) the ROM-based segment descriptors. To
>     prevent this problem, set the accessed bits for all segment descriptors
>     placed in a ROM. Also, remove operating-system or executive code that
>     attempts to modify segment descriptors located in ROM.
> 
> Now that made me go back to the state of the patch series which made us
> make that magic 'touch' and write fault handler. The difference to the code
> today is that it did not prepopulate the user visible mapping.
> 
> We added that later because we were worried about not being able to
> populate it in the #PF due to memory pressure without ripping out the magic
> cure again.
> 
> But I did now and actually removing both the user exit magic 'touch' code
> and the write fault handler keeps it working.
> 
> Removing the prepopulate code makes it break again with a #GP in
> IRET/SYSRET.
> 
> What happens there is that the IRET pops SS (with a minimal testcase) which
> causes the #PF. That populates the PTE and returns happily. Right after
> that the #GP comes in with IP pointing to the user space instruction right
> after the syscall.
> 
> That simplifies and descaryfies that code massively.
> 
> Darn, I should have gone back and check every part again as I usually do,
> but my fried brain failed.

The magic write ACCESS bit handler is a left over from the early attempts
not to force ACCESS=1 when setting up the descriptor entry.

Bah. My patch stack history proves where the 3 cross roads are where I took
the wrong turn.

Thanks,

	tglx

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 134+ messages in thread

* Re: [patch 13/16] x86/ldt: Introduce LDT write fault handler
  2017-12-12 21:41             ` Thomas Gleixner
@ 2017-12-12 22:25               ` Peter Zijlstra
  -1 siblings, 0 replies; 134+ messages in thread
From: Peter Zijlstra @ 2017-12-12 22:25 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Dave Hansen, Linus Torvalds, LKML, the arch/x86 maintainers,
	Andy Lutomirsky, Borislav Petkov, Greg KH, Kees Cook,
	Hugh Dickins, Brian Gerst, Josh Poimboeuf, Denys Vlasenko,
	Boris Ostrovsky, Juergen Gross, David Laight, Eduardo Valentin,
	Liguori, Anthony, Will Deacon, linux-mm

On Tue, Dec 12, 2017 at 10:41:03PM +0100, Thomas Gleixner wrote:
> Now that made me go back to the state of the patch series which made us
> make that magic 'touch' and write fault handler. The difference to the code
> today is that it did not prepopulate the user visible mapping.
> 
> We added that later because we were worried about not being able to
> populate it in the #PF due to memory pressure without ripping out the magic
> cure again.
> 
> But I did now and actually removing both the user exit magic 'touch' code
> and the write fault handler keeps it working.

Argh, had we really not tried that!? Bah.

^ permalink raw reply	[flat|nested] 134+ messages in thread

* Re: [patch 13/16] x86/ldt: Introduce LDT write fault handler
@ 2017-12-12 22:25               ` Peter Zijlstra
  0 siblings, 0 replies; 134+ messages in thread
From: Peter Zijlstra @ 2017-12-12 22:25 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Dave Hansen, Linus Torvalds, LKML, the arch/x86 maintainers,
	Andy Lutomirsky, Borislav Petkov, Greg KH, Kees Cook,
	Hugh Dickins, Brian Gerst, Josh Poimboeuf, Denys Vlasenko,
	Boris Ostrovsky, Juergen Gross, David Laight, Eduardo Valentin,
	Liguori, Anthony, Will Deacon, linux-mm

On Tue, Dec 12, 2017 at 10:41:03PM +0100, Thomas Gleixner wrote:
> Now that made me go back to the state of the patch series which made us
> make that magic 'touch' and write fault handler. The difference to the code
> today is that it did not prepopulate the user visible mapping.
> 
> We added that later because we were worried about not being able to
> populate it in the #PF due to memory pressure without ripping out the magic
> cure again.
> 
> But I did now and actually removing both the user exit magic 'touch' code
> and the write fault handler keeps it working.

Argh, had we really not tried that!? Bah.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 134+ messages in thread

* Re: [patch 05/16] mm: Allow special mappings with user access cleared
  2017-12-12 18:00     ` Andy Lutomirski
@ 2017-12-13 12:22       ` Peter Zijlstra
  -1 siblings, 0 replies; 134+ messages in thread
From: Peter Zijlstra @ 2017-12-13 12:22 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Thomas Gleixner, LKML, X86 ML, Linus Torvalds, Dave Hansen,
	Borislav Petkov, Greg KH, Kees Cook, Hugh Dickins, Brian Gerst,
	Josh Poimboeuf, Denys Vlasenko, Boris Ostrovsky, Juergen Gross,
	David Laight, Eduardo Valentin, aliguori, Will Deacon, linux-mm,
	kirill.shutemov, aneesh.kumar

On Tue, Dec 12, 2017 at 10:00:08AM -0800, Andy Lutomirski wrote:
> On Tue, Dec 12, 2017 at 9:32 AM, Thomas Gleixner <tglx@linutronix.de> wrote:
> > From: Peter Zijstra <peterz@infradead.org>
> >
> > In order to create VMAs that are not accessible to userspace create a new
> > VM_NOUSER flag. This can be used in conjunction with
> > install_special_mapping() to inject 'kernel' data into the userspace map.
> >
> > Similar to how arch_vm_get_page_prot() allows adding _PAGE_flags to
> > pgprot_t, introduce arch_vm_get_page_prot_excl() which masks
> > _PAGE_flags from pgprot_t and use this to implement VM_NOUSER for x86.
> 
> How does this interact with get_user_pages(), etc?

So I went through that code and I think I found a bug related to this.

get_user_pages_fast() will ultimately end up doing
pte_access_permitted() before getting the page, follow_page OTOH does
not do this, which makes for a curious difference between the two.

So I'm thinking we want the below irrespective of the VM_NOUSER patch,
but with VM_NOUSER it would mean write(2) will no longer be able to
access the page.

diff --git a/mm/gup.c b/mm/gup.c
index dfcde13f289a..b852f37a2b0c 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -153,6 +153,11 @@ static struct page *follow_page_pte(struct vm_area_struct *vma,
 	}
 
 	if (flags & FOLL_GET) {
+		if (!pte_access_permitted(pte, !!(flags & FOLL_WRITE))) {
+			page = ERR_PTR(-EFAULT);
+			goto out;
+		}
+
 		get_page(page);
 
 		/* drop the pgmap reference now that we hold the page */

^ permalink raw reply related	[flat|nested] 134+ messages in thread

* Re: [patch 05/16] mm: Allow special mappings with user access cleared
@ 2017-12-13 12:22       ` Peter Zijlstra
  0 siblings, 0 replies; 134+ messages in thread
From: Peter Zijlstra @ 2017-12-13 12:22 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Thomas Gleixner, LKML, X86 ML, Linus Torvalds, Dave Hansen,
	Borislav Petkov, Greg KH, Kees Cook, Hugh Dickins, Brian Gerst,
	Josh Poimboeuf, Denys Vlasenko, Boris Ostrovsky, Juergen Gross,
	David Laight, Eduardo Valentin, aliguori, Will Deacon, linux-mm,
	kirill.shutemov, aneesh.kumar

On Tue, Dec 12, 2017 at 10:00:08AM -0800, Andy Lutomirski wrote:
> On Tue, Dec 12, 2017 at 9:32 AM, Thomas Gleixner <tglx@linutronix.de> wrote:
> > From: Peter Zijstra <peterz@infradead.org>
> >
> > In order to create VMAs that are not accessible to userspace create a new
> > VM_NOUSER flag. This can be used in conjunction with
> > install_special_mapping() to inject 'kernel' data into the userspace map.
> >
> > Similar to how arch_vm_get_page_prot() allows adding _PAGE_flags to
> > pgprot_t, introduce arch_vm_get_page_prot_excl() which masks
> > _PAGE_flags from pgprot_t and use this to implement VM_NOUSER for x86.
> 
> How does this interact with get_user_pages(), etc?

So I went through that code and I think I found a bug related to this.

get_user_pages_fast() will ultimately end up doing
pte_access_permitted() before getting the page, follow_page OTOH does
not do this, which makes for a curious difference between the two.

So I'm thinking we want the below irrespective of the VM_NOUSER patch,
but with VM_NOUSER it would mean write(2) will no longer be able to
access the page.

diff --git a/mm/gup.c b/mm/gup.c
index dfcde13f289a..b852f37a2b0c 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -153,6 +153,11 @@ static struct page *follow_page_pte(struct vm_area_struct *vma,
 	}
 
 	if (flags & FOLL_GET) {
+		if (!pte_access_permitted(pte, !!(flags & FOLL_WRITE))) {
+			page = ERR_PTR(-EFAULT);
+			goto out;
+		}
+
 		get_page(page);
 
 		/* drop the pgmap reference now that we hold the page */



--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 134+ messages in thread

* Re: [patch 05/16] mm: Allow special mappings with user access cleared
  2017-12-13 12:22       ` Peter Zijlstra
@ 2017-12-13 12:57         ` Kirill A. Shutemov
  -1 siblings, 0 replies; 134+ messages in thread
From: Kirill A. Shutemov @ 2017-12-13 12:57 UTC (permalink / raw)
  To: Peter Zijlstra, Dave Hansen
  Cc: Andy Lutomirski, Thomas Gleixner, LKML, X86 ML, Linus Torvalds,
	Borislav Petkov, Greg KH, Kees Cook, Hugh Dickins, Brian Gerst,
	Josh Poimboeuf, Denys Vlasenko, Boris Ostrovsky, Juergen Gross,
	David Laight, Eduardo Valentin, aliguori, Will Deacon, linux-mm,
	kirill.shutemov, aneesh.kumar

On Wed, Dec 13, 2017 at 01:22:11PM +0100, Peter Zijlstra wrote:
> On Tue, Dec 12, 2017 at 10:00:08AM -0800, Andy Lutomirski wrote:
> > On Tue, Dec 12, 2017 at 9:32 AM, Thomas Gleixner <tglx@linutronix.de> wrote:
> > > From: Peter Zijstra <peterz@infradead.org>
> > >
> > > In order to create VMAs that are not accessible to userspace create a new
> > > VM_NOUSER flag. This can be used in conjunction with
> > > install_special_mapping() to inject 'kernel' data into the userspace map.
> > >
> > > Similar to how arch_vm_get_page_prot() allows adding _PAGE_flags to
> > > pgprot_t, introduce arch_vm_get_page_prot_excl() which masks
> > > _PAGE_flags from pgprot_t and use this to implement VM_NOUSER for x86.
> > 
> > How does this interact with get_user_pages(), etc?
> 
> So I went through that code and I think I found a bug related to this.
> 
> get_user_pages_fast() will ultimately end up doing
> pte_access_permitted() before getting the page, follow_page OTOH does
> not do this, which makes for a curious difference between the two.
> 
> So I'm thinking we want the below irrespective of the VM_NOUSER patch,
> but with VM_NOUSER it would mean write(2) will no longer be able to
> access the page.

Oh..

We do call pte_access_permitted(), but only for write access.
See can_follow_write_pte().

The issue seems bigger: we also need such calls for other page table levels :-/

Dave, what is effect of this on protection keys?

> 
> diff --git a/mm/gup.c b/mm/gup.c
> index dfcde13f289a..b852f37a2b0c 100644
> --- a/mm/gup.c
> +++ b/mm/gup.c
> @@ -153,6 +153,11 @@ static struct page *follow_page_pte(struct vm_area_struct *vma,
>  	}
>  
>  	if (flags & FOLL_GET) {
> +		if (!pte_access_permitted(pte, !!(flags & FOLL_WRITE))) {
> +			page = ERR_PTR(-EFAULT);
> +			goto out;
> +		}
> +
>  		get_page(page);
>  
>  		/* drop the pgmap reference now that we hold the page */
> 
> 
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

-- 
 Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 134+ messages in thread

* Re: [patch 05/16] mm: Allow special mappings with user access cleared
@ 2017-12-13 12:57         ` Kirill A. Shutemov
  0 siblings, 0 replies; 134+ messages in thread
From: Kirill A. Shutemov @ 2017-12-13 12:57 UTC (permalink / raw)
  To: Peter Zijlstra, Dave Hansen
  Cc: Andy Lutomirski, Thomas Gleixner, LKML, X86 ML, Linus Torvalds,
	Borislav Petkov, Greg KH, Kees Cook, Hugh Dickins, Brian Gerst,
	Josh Poimboeuf, Denys Vlasenko, Boris Ostrovsky, Juergen Gross,
	David Laight, Eduardo Valentin, aliguori, Will Deacon, linux-mm,
	kirill.shutemov, aneesh.kumar

On Wed, Dec 13, 2017 at 01:22:11PM +0100, Peter Zijlstra wrote:
> On Tue, Dec 12, 2017 at 10:00:08AM -0800, Andy Lutomirski wrote:
> > On Tue, Dec 12, 2017 at 9:32 AM, Thomas Gleixner <tglx@linutronix.de> wrote:
> > > From: Peter Zijstra <peterz@infradead.org>
> > >
> > > In order to create VMAs that are not accessible to userspace create a new
> > > VM_NOUSER flag. This can be used in conjunction with
> > > install_special_mapping() to inject 'kernel' data into the userspace map.
> > >
> > > Similar to how arch_vm_get_page_prot() allows adding _PAGE_flags to
> > > pgprot_t, introduce arch_vm_get_page_prot_excl() which masks
> > > _PAGE_flags from pgprot_t and use this to implement VM_NOUSER for x86.
> > 
> > How does this interact with get_user_pages(), etc?
> 
> So I went through that code and I think I found a bug related to this.
> 
> get_user_pages_fast() will ultimately end up doing
> pte_access_permitted() before getting the page, follow_page OTOH does
> not do this, which makes for a curious difference between the two.
> 
> So I'm thinking we want the below irrespective of the VM_NOUSER patch,
> but with VM_NOUSER it would mean write(2) will no longer be able to
> access the page.

Oh..

We do call pte_access_permitted(), but only for write access.
See can_follow_write_pte().

The issue seems bigger: we also need such calls for other page table levels :-/

Dave, what is effect of this on protection keys?

> 
> diff --git a/mm/gup.c b/mm/gup.c
> index dfcde13f289a..b852f37a2b0c 100644
> --- a/mm/gup.c
> +++ b/mm/gup.c
> @@ -153,6 +153,11 @@ static struct page *follow_page_pte(struct vm_area_struct *vma,
>  	}
>  
>  	if (flags & FOLL_GET) {
> +		if (!pte_access_permitted(pte, !!(flags & FOLL_WRITE))) {
> +			page = ERR_PTR(-EFAULT);
> +			goto out;
> +		}
> +
>  		get_page(page);
>  
>  		/* drop the pgmap reference now that we hold the page */
> 
> 
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

-- 
 Kirill A. Shutemov

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 134+ messages in thread

* Re: [patch 05/16] mm: Allow special mappings with user access cleared
  2017-12-13 12:57         ` Kirill A. Shutemov
@ 2017-12-13 14:34           ` Peter Zijlstra
  -1 siblings, 0 replies; 134+ messages in thread
From: Peter Zijlstra @ 2017-12-13 14:34 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Dave Hansen, Andy Lutomirski, Thomas Gleixner, LKML, X86 ML,
	Linus Torvalds, Borislav Petkov, Greg KH, Kees Cook,
	Hugh Dickins, Brian Gerst, Josh Poimboeuf, Denys Vlasenko,
	Boris Ostrovsky, Juergen Gross, David Laight, Eduardo Valentin,
	aliguori, Will Deacon, linux-mm, kirill.shutemov, aneesh.kumar

On Wed, Dec 13, 2017 at 03:57:40PM +0300, Kirill A. Shutemov wrote:
> On Wed, Dec 13, 2017 at 01:22:11PM +0100, Peter Zijlstra wrote:

> > get_user_pages_fast() will ultimately end up doing
> > pte_access_permitted() before getting the page, follow_page OTOH does
> > not do this, which makes for a curious difference between the two.
> > 
> > So I'm thinking we want the below irrespective of the VM_NOUSER patch,
> > but with VM_NOUSER it would mean write(2) will no longer be able to
> > access the page.
> 
> Oh..
> 
> We do call pte_access_permitted(), but only for write access.
> See can_follow_write_pte().

My can_follow_write_pte() looks like:

static inline bool can_follow_write_pte(pte_t pte, unsigned int flags)
{
	return pte_write(pte) ||
		((flags & FOLL_FORCE) && (flags & FOLL_COW) && pte_dirty(pte));
}

am I perchance looking at the wrong tee?

> The issue seems bigger: we also need such calls for other page table levels :-/

Sure.

^ permalink raw reply	[flat|nested] 134+ messages in thread

* Re: [patch 05/16] mm: Allow special mappings with user access cleared
@ 2017-12-13 14:34           ` Peter Zijlstra
  0 siblings, 0 replies; 134+ messages in thread
From: Peter Zijlstra @ 2017-12-13 14:34 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Dave Hansen, Andy Lutomirski, Thomas Gleixner, LKML, X86 ML,
	Linus Torvalds, Borislav Petkov, Greg KH, Kees Cook,
	Hugh Dickins, Brian Gerst, Josh Poimboeuf, Denys Vlasenko,
	Boris Ostrovsky, Juergen Gross, David Laight, Eduardo Valentin,
	aliguori, Will Deacon, linux-mm, kirill.shutemov, aneesh.kumar

On Wed, Dec 13, 2017 at 03:57:40PM +0300, Kirill A. Shutemov wrote:
> On Wed, Dec 13, 2017 at 01:22:11PM +0100, Peter Zijlstra wrote:

> > get_user_pages_fast() will ultimately end up doing
> > pte_access_permitted() before getting the page, follow_page OTOH does
> > not do this, which makes for a curious difference between the two.
> > 
> > So I'm thinking we want the below irrespective of the VM_NOUSER patch,
> > but with VM_NOUSER it would mean write(2) will no longer be able to
> > access the page.
> 
> Oh..
> 
> We do call pte_access_permitted(), but only for write access.
> See can_follow_write_pte().

My can_follow_write_pte() looks like:

static inline bool can_follow_write_pte(pte_t pte, unsigned int flags)
{
	return pte_write(pte) ||
		((flags & FOLL_FORCE) && (flags & FOLL_COW) && pte_dirty(pte));
}

am I perchance looking at the wrong tee?

> The issue seems bigger: we also need such calls for other page table levels :-/

Sure.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 134+ messages in thread

* Re: [patch 05/16] mm: Allow special mappings with user access cleared
  2017-12-13 14:34           ` Peter Zijlstra
@ 2017-12-13 14:43             ` Kirill A. Shutemov
  -1 siblings, 0 replies; 134+ messages in thread
From: Kirill A. Shutemov @ 2017-12-13 14:43 UTC (permalink / raw)
  To: Peter Zijlstra, Dan Williams
  Cc: Dave Hansen, Andy Lutomirski, Thomas Gleixner, LKML, X86 ML,
	Linus Torvalds, Borislav Petkov, Greg KH, Kees Cook,
	Hugh Dickins, Brian Gerst, Josh Poimboeuf, Denys Vlasenko,
	Boris Ostrovsky, Juergen Gross, David Laight, Eduardo Valentin,
	aliguori, Will Deacon, linux-mm, kirill.shutemov, aneesh.kumar

On Wed, Dec 13, 2017 at 03:34:55PM +0100, Peter Zijlstra wrote:
> On Wed, Dec 13, 2017 at 03:57:40PM +0300, Kirill A. Shutemov wrote:
> > On Wed, Dec 13, 2017 at 01:22:11PM +0100, Peter Zijlstra wrote:
> 
> > > get_user_pages_fast() will ultimately end up doing
> > > pte_access_permitted() before getting the page, follow_page OTOH does
> > > not do this, which makes for a curious difference between the two.
> > > 
> > > So I'm thinking we want the below irrespective of the VM_NOUSER patch,
> > > but with VM_NOUSER it would mean write(2) will no longer be able to
> > > access the page.
> > 
> > Oh..
> > 
> > We do call pte_access_permitted(), but only for write access.
> > See can_follow_write_pte().
> 
> My can_follow_write_pte() looks like:
> 
> static inline bool can_follow_write_pte(pte_t pte, unsigned int flags)
> {
> 	return pte_write(pte) ||
> 		((flags & FOLL_FORCE) && (flags & FOLL_COW) && pte_dirty(pte));
> }
> 
> am I perchance looking at the wrong tee?

I'm looking at Linus' tree.

It was changed recently:
	5c9d2d5c269c ("mm: replace pte_write with pte_access_permitted in fault + gup paths")

+Dan.

-- 
 Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 134+ messages in thread

* Re: [patch 05/16] mm: Allow special mappings with user access cleared
@ 2017-12-13 14:43             ` Kirill A. Shutemov
  0 siblings, 0 replies; 134+ messages in thread
From: Kirill A. Shutemov @ 2017-12-13 14:43 UTC (permalink / raw)
  To: Peter Zijlstra, Dan Williams
  Cc: Dave Hansen, Andy Lutomirski, Thomas Gleixner, LKML, X86 ML,
	Linus Torvalds, Borislav Petkov, Greg KH, Kees Cook,
	Hugh Dickins, Brian Gerst, Josh Poimboeuf, Denys Vlasenko,
	Boris Ostrovsky, Juergen Gross, David Laight, Eduardo Valentin,
	aliguori, Will Deacon, linux-mm, kirill.shutemov, aneesh.kumar

On Wed, Dec 13, 2017 at 03:34:55PM +0100, Peter Zijlstra wrote:
> On Wed, Dec 13, 2017 at 03:57:40PM +0300, Kirill A. Shutemov wrote:
> > On Wed, Dec 13, 2017 at 01:22:11PM +0100, Peter Zijlstra wrote:
> 
> > > get_user_pages_fast() will ultimately end up doing
> > > pte_access_permitted() before getting the page, follow_page OTOH does
> > > not do this, which makes for a curious difference between the two.
> > > 
> > > So I'm thinking we want the below irrespective of the VM_NOUSER patch,
> > > but with VM_NOUSER it would mean write(2) will no longer be able to
> > > access the page.
> > 
> > Oh..
> > 
> > We do call pte_access_permitted(), but only for write access.
> > See can_follow_write_pte().
> 
> My can_follow_write_pte() looks like:
> 
> static inline bool can_follow_write_pte(pte_t pte, unsigned int flags)
> {
> 	return pte_write(pte) ||
> 		((flags & FOLL_FORCE) && (flags & FOLL_COW) && pte_dirty(pte));
> }
> 
> am I perchance looking at the wrong tee?

I'm looking at Linus' tree.

It was changed recently:
	5c9d2d5c269c ("mm: replace pte_write with pte_access_permitted in fault + gup paths")

+Dan.

-- 
 Kirill A. Shutemov

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 134+ messages in thread

* Re: [patch 05/16] mm: Allow special mappings with user access cleared
  2017-12-13 14:43             ` Kirill A. Shutemov
@ 2017-12-13 15:00               ` Peter Zijlstra
  -1 siblings, 0 replies; 134+ messages in thread
From: Peter Zijlstra @ 2017-12-13 15:00 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Dan Williams, Dave Hansen, Andy Lutomirski, Thomas Gleixner,
	LKML, X86 ML, Linus Torvalds, Borislav Petkov, Greg KH,
	Kees Cook, Hugh Dickins, Brian Gerst, Josh Poimboeuf,
	Denys Vlasenko, Boris Ostrovsky, Juergen Gross, David Laight,
	Eduardo Valentin, aliguori, Will Deacon, linux-mm,
	kirill.shutemov, aneesh.kumar

On Wed, Dec 13, 2017 at 05:43:39PM +0300, Kirill A. Shutemov wrote:
> > am I perchance looking at the wrong tee?
> 
> I'm looking at Linus' tree.

Clearly I'm not synced up enough... :/

> It was changed recently:
> 	5c9d2d5c269c ("mm: replace pte_write with pte_access_permitted in fault + gup paths")
> 

Indeed. So FOLL_GET should also get these tests and, as you said, the
other levels too.

I would like FOLL_POPULATE (doesn't have FOLL_GET) to be allowed
'access'.

^ permalink raw reply	[flat|nested] 134+ messages in thread

* Re: [patch 05/16] mm: Allow special mappings with user access cleared
@ 2017-12-13 15:00               ` Peter Zijlstra
  0 siblings, 0 replies; 134+ messages in thread
From: Peter Zijlstra @ 2017-12-13 15:00 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Dan Williams, Dave Hansen, Andy Lutomirski, Thomas Gleixner,
	LKML, X86 ML, Linus Torvalds, Borislav Petkov, Greg KH,
	Kees Cook, Hugh Dickins, Brian Gerst, Josh Poimboeuf,
	Denys Vlasenko, Boris Ostrovsky, Juergen Gross, David Laight,
	Eduardo Valentin, aliguori, Will Deacon, linux-mm,
	kirill.shutemov, aneesh.kumar

On Wed, Dec 13, 2017 at 05:43:39PM +0300, Kirill A. Shutemov wrote:
> > am I perchance looking at the wrong tee?
> 
> I'm looking at Linus' tree.

Clearly I'm not synced up enough... :/

> It was changed recently:
> 	5c9d2d5c269c ("mm: replace pte_write with pte_access_permitted in fault + gup paths")
> 

Indeed. So FOLL_GET should also get these tests and, as you said, the
other levels too.

I would like FOLL_POPULATE (doesn't have FOLL_GET) to be allowed
'access'.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 134+ messages in thread

* Re: [patch 05/16] mm: Allow special mappings with user access cleared
  2017-12-13 15:00               ` Peter Zijlstra
@ 2017-12-13 15:04                 ` Peter Zijlstra
  -1 siblings, 0 replies; 134+ messages in thread
From: Peter Zijlstra @ 2017-12-13 15:04 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Dan Williams, Dave Hansen, Andy Lutomirski, Thomas Gleixner,
	LKML, X86 ML, Linus Torvalds, Borislav Petkov, Greg KH,
	Kees Cook, Hugh Dickins, Brian Gerst, Josh Poimboeuf,
	Denys Vlasenko, Boris Ostrovsky, Juergen Gross, David Laight,
	Eduardo Valentin, aliguori, Will Deacon, linux-mm,
	kirill.shutemov, aneesh.kumar

On Wed, Dec 13, 2017 at 04:00:07PM +0100, Peter Zijlstra wrote:
> On Wed, Dec 13, 2017 at 05:43:39PM +0300, Kirill A. Shutemov wrote:
> > > am I perchance looking at the wrong tee?
> > 
> > I'm looking at Linus' tree.
> 
> Clearly I'm not synced up enough... :/
> 
> > It was changed recently:
> > 	5c9d2d5c269c ("mm: replace pte_write with pte_access_permitted in fault + gup paths")
> > 
> 
> Indeed. So FOLL_GET should also get these tests and, as you said, the
> other levels too.
> 
> I would like FOLL_POPULATE (doesn't have FOLL_GET) to be allowed
> 'access'.

Similarly, should we avoid arch_vma_access_permitted() if !FOLL_GET ?

^ permalink raw reply	[flat|nested] 134+ messages in thread

* Re: [patch 05/16] mm: Allow special mappings with user access cleared
@ 2017-12-13 15:04                 ` Peter Zijlstra
  0 siblings, 0 replies; 134+ messages in thread
From: Peter Zijlstra @ 2017-12-13 15:04 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Dan Williams, Dave Hansen, Andy Lutomirski, Thomas Gleixner,
	LKML, X86 ML, Linus Torvalds, Borislav Petkov, Greg KH,
	Kees Cook, Hugh Dickins, Brian Gerst, Josh Poimboeuf,
	Denys Vlasenko, Boris Ostrovsky, Juergen Gross, David Laight,
	Eduardo Valentin, aliguori, Will Deacon, linux-mm,
	kirill.shutemov, aneesh.kumar

On Wed, Dec 13, 2017 at 04:00:07PM +0100, Peter Zijlstra wrote:
> On Wed, Dec 13, 2017 at 05:43:39PM +0300, Kirill A. Shutemov wrote:
> > > am I perchance looking at the wrong tee?
> > 
> > I'm looking at Linus' tree.
> 
> Clearly I'm not synced up enough... :/
> 
> > It was changed recently:
> > 	5c9d2d5c269c ("mm: replace pte_write with pte_access_permitted in fault + gup paths")
> > 
> 
> Indeed. So FOLL_GET should also get these tests and, as you said, the
> other levels too.
> 
> I would like FOLL_POPULATE (doesn't have FOLL_GET) to be allowed
> 'access'.

Similarly, should we avoid arch_vma_access_permitted() if !FOLL_GET ?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 134+ messages in thread

* Re: [patch 05/16] mm: Allow special mappings with user access cleared
  2017-12-13 12:57         ` Kirill A. Shutemov
@ 2017-12-13 15:14           ` Dave Hansen
  -1 siblings, 0 replies; 134+ messages in thread
From: Dave Hansen @ 2017-12-13 15:14 UTC (permalink / raw)
  To: Kirill A. Shutemov, Peter Zijlstra
  Cc: Andy Lutomirski, Thomas Gleixner, LKML, X86 ML, Linus Torvalds,
	Borislav Petkov, Greg KH, Kees Cook, Hugh Dickins, Brian Gerst,
	Josh Poimboeuf, Denys Vlasenko, Boris Ostrovsky, Juergen Gross,
	David Laight, Eduardo Valentin, aliguori, Will Deacon, linux-mm,
	kirill.shutemov, aneesh.kumar

On 12/13/2017 04:57 AM, Kirill A. Shutemov wrote:
> Dave, what is effect of this on protection keys?

The goal was to make pkeys-protected userspace memory access
_consistent_ with normal access.  Specifically, we want a kernel to
disallow access (or writes) to memory where userspace mapping has a pkey
whose permissions are in conflict with the access.

For instance:

This will fault writing a byte to 'addr':

	char *addr = malloc(PAGE_SIZE);
	pkey_mprotect(addr, PAGE_SIZE, 13);
	pkey_deny_access(13);
	*addr[0] = 'f';

But this will write one byte to addr successfully (if it uses the kernel
mapping of the physical page backing 'addr'):

	char *addr = malloc(PAGE_SIZE);
	pkey_mprotect(addr, PAGE_SIZE, 13);
	pkey_deny_access(13);
	read(fd, addr, 1);

^ permalink raw reply	[flat|nested] 134+ messages in thread

* Re: [patch 05/16] mm: Allow special mappings with user access cleared
@ 2017-12-13 15:14           ` Dave Hansen
  0 siblings, 0 replies; 134+ messages in thread
From: Dave Hansen @ 2017-12-13 15:14 UTC (permalink / raw)
  To: Kirill A. Shutemov, Peter Zijlstra
  Cc: Andy Lutomirski, Thomas Gleixner, LKML, X86 ML, Linus Torvalds,
	Borislav Petkov, Greg KH, Kees Cook, Hugh Dickins, Brian Gerst,
	Josh Poimboeuf, Denys Vlasenko, Boris Ostrovsky, Juergen Gross,
	David Laight, Eduardo Valentin, aliguori, Will Deacon, linux-mm,
	kirill.shutemov, aneesh.kumar

On 12/13/2017 04:57 AM, Kirill A. Shutemov wrote:
> Dave, what is effect of this on protection keys?

The goal was to make pkeys-protected userspace memory access
_consistent_ with normal access.  Specifically, we want a kernel to
disallow access (or writes) to memory where userspace mapping has a pkey
whose permissions are in conflict with the access.

For instance:

This will fault writing a byte to 'addr':

	char *addr = malloc(PAGE_SIZE);
	pkey_mprotect(addr, PAGE_SIZE, 13);
	pkey_deny_access(13);
	*addr[0] = 'f';

But this will write one byte to addr successfully (if it uses the kernel
mapping of the physical page backing 'addr'):

	char *addr = malloc(PAGE_SIZE);
	pkey_mprotect(addr, PAGE_SIZE, 13);
	pkey_deny_access(13);
	read(fd, addr, 1);

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 134+ messages in thread

* Re: [patch 05/16] mm: Allow special mappings with user access cleared
  2017-12-13 15:14           ` Dave Hansen
@ 2017-12-13 15:32             ` Peter Zijlstra
  -1 siblings, 0 replies; 134+ messages in thread
From: Peter Zijlstra @ 2017-12-13 15:32 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Kirill A. Shutemov, Andy Lutomirski, Thomas Gleixner, LKML,
	X86 ML, Linus Torvalds, Borislav Petkov, Greg KH, Kees Cook,
	Hugh Dickins, Brian Gerst, Josh Poimboeuf, Denys Vlasenko,
	Boris Ostrovsky, Juergen Gross, David Laight, Eduardo Valentin,
	aliguori, Will Deacon, linux-mm, kirill.shutemov, aneesh.kumar

On Wed, Dec 13, 2017 at 07:14:41AM -0800, Dave Hansen wrote:
> On 12/13/2017 04:57 AM, Kirill A. Shutemov wrote:
> > Dave, what is effect of this on protection keys?
> 
> The goal was to make pkeys-protected userspace memory access
> _consistent_ with normal access.  Specifically, we want a kernel to
> disallow access (or writes) to memory where userspace mapping has a pkey
> whose permissions are in conflict with the access.
> 
> For instance:
> 
> This will fault writing a byte to 'addr':
> 
> 	char *addr = malloc(PAGE_SIZE);
> 	pkey_mprotect(addr, PAGE_SIZE, 13);
> 	pkey_deny_access(13);
> 	*addr[0] = 'f';
> 
> But this will write one byte to addr successfully (if it uses the kernel
> mapping of the physical page backing 'addr'):
> 
> 	char *addr = malloc(PAGE_SIZE);
> 	pkey_mprotect(addr, PAGE_SIZE, 13);
> 	pkey_deny_access(13);
> 	read(fd, addr, 1);
> 

This seems confused to me; why are these two cases different?

^ permalink raw reply	[flat|nested] 134+ messages in thread

* Re: [patch 05/16] mm: Allow special mappings with user access cleared
@ 2017-12-13 15:32             ` Peter Zijlstra
  0 siblings, 0 replies; 134+ messages in thread
From: Peter Zijlstra @ 2017-12-13 15:32 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Kirill A. Shutemov, Andy Lutomirski, Thomas Gleixner, LKML,
	X86 ML, Linus Torvalds, Borislav Petkov, Greg KH, Kees Cook,
	Hugh Dickins, Brian Gerst, Josh Poimboeuf, Denys Vlasenko,
	Boris Ostrovsky, Juergen Gross, David Laight, Eduardo Valentin,
	aliguori, Will Deacon, linux-mm, kirill.shutemov, aneesh.kumar

On Wed, Dec 13, 2017 at 07:14:41AM -0800, Dave Hansen wrote:
> On 12/13/2017 04:57 AM, Kirill A. Shutemov wrote:
> > Dave, what is effect of this on protection keys?
> 
> The goal was to make pkeys-protected userspace memory access
> _consistent_ with normal access.  Specifically, we want a kernel to
> disallow access (or writes) to memory where userspace mapping has a pkey
> whose permissions are in conflict with the access.
> 
> For instance:
> 
> This will fault writing a byte to 'addr':
> 
> 	char *addr = malloc(PAGE_SIZE);
> 	pkey_mprotect(addr, PAGE_SIZE, 13);
> 	pkey_deny_access(13);
> 	*addr[0] = 'f';
> 
> But this will write one byte to addr successfully (if it uses the kernel
> mapping of the physical page backing 'addr'):
> 
> 	char *addr = malloc(PAGE_SIZE);
> 	pkey_mprotect(addr, PAGE_SIZE, 13);
> 	pkey_deny_access(13);
> 	read(fd, addr, 1);
> 

This seems confused to me; why are these two cases different?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 134+ messages in thread

* Re: [patch 05/16] mm: Allow special mappings with user access cleared
  2017-12-13 15:32             ` Peter Zijlstra
@ 2017-12-13 15:47               ` Dave Hansen
  -1 siblings, 0 replies; 134+ messages in thread
From: Dave Hansen @ 2017-12-13 15:47 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Kirill A. Shutemov, Andy Lutomirski, Thomas Gleixner, LKML,
	X86 ML, Linus Torvalds, Borislav Petkov, Greg KH, Kees Cook,
	Hugh Dickins, Brian Gerst, Josh Poimboeuf, Denys Vlasenko,
	Boris Ostrovsky, Juergen Gross, David Laight, Eduardo Valentin,
	aliguori, Will Deacon, linux-mm, kirill.shutemov, aneesh.kumar

On 12/13/2017 07:32 AM, Peter Zijlstra wrote:
>> This will fault writing a byte to 'addr':
>>
>> 	char *addr = malloc(PAGE_SIZE);
>> 	pkey_mprotect(addr, PAGE_SIZE, 13);
>> 	pkey_deny_access(13);
>> 	*addr[0] = 'f';
>>
>> But this will write one byte to addr successfully (if it uses the kernel
>> mapping of the physical page backing 'addr'):
>>
>> 	char *addr = malloc(PAGE_SIZE);
>> 	pkey_mprotect(addr, PAGE_SIZE, 13);
>> 	pkey_deny_access(13);
>> 	read(fd, addr, 1);
>>
> This seems confused to me; why are these two cases different?

Protection keys doesn't work in the kernel direct map, so if the read()
was implemented by writing to the direct map alias of 'addr' then this
would bypass protection keys.

^ permalink raw reply	[flat|nested] 134+ messages in thread

* Re: [patch 05/16] mm: Allow special mappings with user access cleared
@ 2017-12-13 15:47               ` Dave Hansen
  0 siblings, 0 replies; 134+ messages in thread
From: Dave Hansen @ 2017-12-13 15:47 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Kirill A. Shutemov, Andy Lutomirski, Thomas Gleixner, LKML,
	X86 ML, Linus Torvalds, Borislav Petkov, Greg KH, Kees Cook,
	Hugh Dickins, Brian Gerst, Josh Poimboeuf, Denys Vlasenko,
	Boris Ostrovsky, Juergen Gross, David Laight, Eduardo Valentin,
	aliguori, Will Deacon, linux-mm, kirill.shutemov, aneesh.kumar

On 12/13/2017 07:32 AM, Peter Zijlstra wrote:
>> This will fault writing a byte to 'addr':
>>
>> 	char *addr = malloc(PAGE_SIZE);
>> 	pkey_mprotect(addr, PAGE_SIZE, 13);
>> 	pkey_deny_access(13);
>> 	*addr[0] = 'f';
>>
>> But this will write one byte to addr successfully (if it uses the kernel
>> mapping of the physical page backing 'addr'):
>>
>> 	char *addr = malloc(PAGE_SIZE);
>> 	pkey_mprotect(addr, PAGE_SIZE, 13);
>> 	pkey_deny_access(13);
>> 	read(fd, addr, 1);
>>
> This seems confused to me; why are these two cases different?

Protection keys doesn't work in the kernel direct map, so if the read()
was implemented by writing to the direct map alias of 'addr' then this
would bypass protection keys.


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 134+ messages in thread

* Re: [patch 05/16] mm: Allow special mappings with user access cleared
  2017-12-13 15:47               ` Dave Hansen
@ 2017-12-13 15:54                 ` Peter Zijlstra
  -1 siblings, 0 replies; 134+ messages in thread
From: Peter Zijlstra @ 2017-12-13 15:54 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Kirill A. Shutemov, Andy Lutomirski, Thomas Gleixner, LKML,
	X86 ML, Linus Torvalds, Borislav Petkov, Greg KH, Kees Cook,
	Hugh Dickins, Brian Gerst, Josh Poimboeuf, Denys Vlasenko,
	Boris Ostrovsky, Juergen Gross, David Laight, Eduardo Valentin,
	aliguori, Will Deacon, linux-mm, kirill.shutemov, aneesh.kumar

On Wed, Dec 13, 2017 at 07:47:46AM -0800, Dave Hansen wrote:
> On 12/13/2017 07:32 AM, Peter Zijlstra wrote:
> >> This will fault writing a byte to 'addr':
> >>
> >> 	char *addr = malloc(PAGE_SIZE);
> >> 	pkey_mprotect(addr, PAGE_SIZE, 13);
> >> 	pkey_deny_access(13);
> >> 	*addr[0] = 'f';
> >>
> >> But this will write one byte to addr successfully (if it uses the kernel
> >> mapping of the physical page backing 'addr'):
> >>
> >> 	char *addr = malloc(PAGE_SIZE);
> >> 	pkey_mprotect(addr, PAGE_SIZE, 13);
> >> 	pkey_deny_access(13);
> >> 	read(fd, addr, 1);
> >>
> > This seems confused to me; why are these two cases different?
> 
> Protection keys doesn't work in the kernel direct map, so if the read()
> was implemented by writing to the direct map alias of 'addr' then this
> would bypass protection keys.

Which is why get_user_pages() _should_ enforce this.

What use are protection keys if you can trivially circumvent them?

^ permalink raw reply	[flat|nested] 134+ messages in thread

* Re: [patch 05/16] mm: Allow special mappings with user access cleared
@ 2017-12-13 15:54                 ` Peter Zijlstra
  0 siblings, 0 replies; 134+ messages in thread
From: Peter Zijlstra @ 2017-12-13 15:54 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Kirill A. Shutemov, Andy Lutomirski, Thomas Gleixner, LKML,
	X86 ML, Linus Torvalds, Borislav Petkov, Greg KH, Kees Cook,
	Hugh Dickins, Brian Gerst, Josh Poimboeuf, Denys Vlasenko,
	Boris Ostrovsky, Juergen Gross, David Laight, Eduardo Valentin,
	aliguori, Will Deacon, linux-mm, kirill.shutemov, aneesh.kumar

On Wed, Dec 13, 2017 at 07:47:46AM -0800, Dave Hansen wrote:
> On 12/13/2017 07:32 AM, Peter Zijlstra wrote:
> >> This will fault writing a byte to 'addr':
> >>
> >> 	char *addr = malloc(PAGE_SIZE);
> >> 	pkey_mprotect(addr, PAGE_SIZE, 13);
> >> 	pkey_deny_access(13);
> >> 	*addr[0] = 'f';
> >>
> >> But this will write one byte to addr successfully (if it uses the kernel
> >> mapping of the physical page backing 'addr'):
> >>
> >> 	char *addr = malloc(PAGE_SIZE);
> >> 	pkey_mprotect(addr, PAGE_SIZE, 13);
> >> 	pkey_deny_access(13);
> >> 	read(fd, addr, 1);
> >>
> > This seems confused to me; why are these two cases different?
> 
> Protection keys doesn't work in the kernel direct map, so if the read()
> was implemented by writing to the direct map alias of 'addr' then this
> would bypass protection keys.

Which is why get_user_pages() _should_ enforce this.

What use are protection keys if you can trivially circumvent them?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 134+ messages in thread

* Re: [patch 05/16] mm: Allow special mappings with user access cleared
  2017-12-13 15:54                 ` Peter Zijlstra
@ 2017-12-13 18:08                   ` Linus Torvalds
  -1 siblings, 0 replies; 134+ messages in thread
From: Linus Torvalds @ 2017-12-13 18:08 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Dave Hansen, Kirill A. Shutemov, Andy Lutomirski,
	Thomas Gleixner, LKML, X86 ML, Borislav Petkov, Greg KH,
	Kees Cook, Hugh Dickins, Brian Gerst, Josh Poimboeuf,
	Denys Vlasenko, Boris Ostrovsky, Juergen Gross, David Laight,
	Eduardo Valentin, Liguori, Anthony, Will Deacon, linux-mm,
	Kirill A. Shutemov, Aneesh Kumar K. V

On Wed, Dec 13, 2017 at 7:54 AM, Peter Zijlstra <peterz@infradead.org> wrote:
>
> Which is why get_user_pages() _should_ enforce this.
>
> What use are protection keys if you can trivially circumvent them?

No, we will *not* worry about protection keys in get_user_pages().

They are not "security". They are a debug aid and safety against random mis-use.

In particular, they are very much *NOT* about "trivially circumvent
them". The user could just change their mapping thing, for chrissake!

We already allow access to PROT_NONE for gdb and friends, very much on purpose.

We're not going to make the VM more complex for something that
absolutely nobody cares about, and has zero security issues.

                        Linus

^ permalink raw reply	[flat|nested] 134+ messages in thread

* Re: [patch 05/16] mm: Allow special mappings with user access cleared
@ 2017-12-13 18:08                   ` Linus Torvalds
  0 siblings, 0 replies; 134+ messages in thread
From: Linus Torvalds @ 2017-12-13 18:08 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Dave Hansen, Kirill A. Shutemov, Andy Lutomirski,
	Thomas Gleixner, LKML, X86 ML, Borislav Petkov, Greg KH,
	Kees Cook, Hugh Dickins, Brian Gerst, Josh Poimboeuf,
	Denys Vlasenko, Boris Ostrovsky, Juergen Gross, David Laight,
	Eduardo Valentin, Liguori, Anthony, Will Deacon, linux-mm,
	Kirill A. Shutemov, Aneesh Kumar K. V

On Wed, Dec 13, 2017 at 7:54 AM, Peter Zijlstra <peterz@infradead.org> wrote:
>
> Which is why get_user_pages() _should_ enforce this.
>
> What use are protection keys if you can trivially circumvent them?

No, we will *not* worry about protection keys in get_user_pages().

They are not "security". They are a debug aid and safety against random mis-use.

In particular, they are very much *NOT* about "trivially circumvent
them". The user could just change their mapping thing, for chrissake!

We already allow access to PROT_NONE for gdb and friends, very much on purpose.

We're not going to make the VM more complex for something that
absolutely nobody cares about, and has zero security issues.

                        Linus

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 134+ messages in thread

* Re: [patch 05/16] mm: Allow special mappings with user access cleared
  2017-12-13 18:08                   ` Linus Torvalds
@ 2017-12-13 18:21                     ` Dave Hansen
  -1 siblings, 0 replies; 134+ messages in thread
From: Dave Hansen @ 2017-12-13 18:21 UTC (permalink / raw)
  To: Linus Torvalds, Peter Zijlstra
  Cc: Kirill A. Shutemov, Andy Lutomirski, Thomas Gleixner, LKML,
	X86 ML, Borislav Petkov, Greg KH, Kees Cook, Hugh Dickins,
	Brian Gerst, Josh Poimboeuf, Denys Vlasenko, Boris Ostrovsky,
	Juergen Gross, David Laight, Eduardo Valentin, Liguori, Anthony,
	Will Deacon, linux-mm, Kirill A. Shutemov, Aneesh Kumar K. V

On 12/13/2017 10:08 AM, Linus Torvalds wrote:
> On Wed, Dec 13, 2017 at 7:54 AM, Peter Zijlstr <peterz@infradead.org> wrote:
>> Which is why get_user_pages() _should_ enforce this.
>> 
>> What use are protection keys if you can trivially circumvent them?
> No, we will *not* worry about protection keys in get_user_pages().

We did introduce some support for it here:

> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=33a709b25a760b91184bb335cf7d7c32b8123013

> They are not "security". They are a debug aid and safety against
> random mis-use.

Totally agree.  It's not about security.  As I mentioned in the commit,
the goal here was to try to make pkey-protected access behavior
consistent with mprotect().

I still think this was nice to do and probably surprises users less than
if we didn't have it.

> We already allow access to PROT_NONE for gdb and friends, very much on purpose.

Yup, exactly, and that's one of the reasons that I tried to call those
out as "remote" access that are specicifially no subject to protection keys.

^ permalink raw reply	[flat|nested] 134+ messages in thread

* Re: [patch 05/16] mm: Allow special mappings with user access cleared
@ 2017-12-13 18:21                     ` Dave Hansen
  0 siblings, 0 replies; 134+ messages in thread
From: Dave Hansen @ 2017-12-13 18:21 UTC (permalink / raw)
  To: Linus Torvalds, Peter Zijlstra
  Cc: Kirill A. Shutemov, Andy Lutomirski, Thomas Gleixner, LKML,
	X86 ML, Borislav Petkov, Greg KH, Kees Cook, Hugh Dickins,
	Brian Gerst, Josh Poimboeuf, Denys Vlasenko, Boris Ostrovsky,
	Juergen Gross, David Laight, Eduardo Valentin, Liguori, Anthony,
	Will Deacon, linux-mm, Kirill A. Shutemov, Aneesh Kumar K. V

On 12/13/2017 10:08 AM, Linus Torvalds wrote:
> On Wed, Dec 13, 2017 at 7:54 AM, Peter Zijlstr <peterz@infradead.org> wrote:
>> Which is why get_user_pages() _should_ enforce this.
>> 
>> What use are protection keys if you can trivially circumvent them?
> No, we will *not* worry about protection keys in get_user_pages().

We did introduce some support for it here:

> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=33a709b25a760b91184bb335cf7d7c32b8123013

> They are not "security". They are a debug aid and safety against
> random mis-use.

Totally agree.  It's not about security.  As I mentioned in the commit,
the goal here was to try to make pkey-protected access behavior
consistent with mprotect().

I still think this was nice to do and probably surprises users less than
if we didn't have it.

> We already allow access to PROT_NONE for gdb and friends, very much on purpose.

Yup, exactly, and that's one of the reasons that I tried to call those
out as "remote" access that are specicifially no subject to protection keys.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 134+ messages in thread

* Re: [patch 05/16] mm: Allow special mappings with user access cleared
  2017-12-13 18:21                     ` Dave Hansen
@ 2017-12-13 18:23                       ` Linus Torvalds
  -1 siblings, 0 replies; 134+ messages in thread
From: Linus Torvalds @ 2017-12-13 18:23 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Peter Zijlstra, Kirill A. Shutemov, Andy Lutomirski,
	Thomas Gleixner, LKML, X86 ML, Borislav Petkov, Greg KH,
	Kees Cook, Hugh Dickins, Brian Gerst, Josh Poimboeuf,
	Denys Vlasenko, Boris Ostrovsky, Juergen Gross, David Laight,
	Eduardo Valentin, Liguori, Anthony, Will Deacon, linux-mm,
	Kirill A. Shutemov, Aneesh Kumar K. V

On Wed, Dec 13, 2017 at 10:21 AM, Dave Hansen <dave.hansen@intel.com> wrote:
> On 12/13/2017 10:08 AM, Linus Torvalds wrote:
>> On Wed, Dec 13, 2017 at 7:54 AM, Peter Zijlstr <peterz@infradead.org> wrote:
>>> Which is why get_user_pages() _should_ enforce this.
>>>
>>> What use are protection keys if you can trivially circumvent them?
>> No, we will *not* worry about protection keys in get_user_pages().
>
> We did introduce some support for it here:
>
>> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=33a709b25a760b91184bb335cf7d7c32b8123013

Ugh. I never realized.

We should revert that, I feel. It's literally extra complexity for no
actual real gain, and there is a real downside: the extra complexity
that will cause people to get things wrong.

This thread about us getting it wrong is just the proof. I vote for
not trying to "fix" this case, let's just remove it.

                  Linus

^ permalink raw reply	[flat|nested] 134+ messages in thread

* Re: [patch 05/16] mm: Allow special mappings with user access cleared
@ 2017-12-13 18:23                       ` Linus Torvalds
  0 siblings, 0 replies; 134+ messages in thread
From: Linus Torvalds @ 2017-12-13 18:23 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Peter Zijlstra, Kirill A. Shutemov, Andy Lutomirski,
	Thomas Gleixner, LKML, X86 ML, Borislav Petkov, Greg KH,
	Kees Cook, Hugh Dickins, Brian Gerst, Josh Poimboeuf,
	Denys Vlasenko, Boris Ostrovsky, Juergen Gross, David Laight,
	Eduardo Valentin, Liguori, Anthony, Will Deacon, linux-mm,
	Kirill A. Shutemov, Aneesh Kumar K. V

On Wed, Dec 13, 2017 at 10:21 AM, Dave Hansen <dave.hansen@intel.com> wrote:
> On 12/13/2017 10:08 AM, Linus Torvalds wrote:
>> On Wed, Dec 13, 2017 at 7:54 AM, Peter Zijlstr <peterz@infradead.org> wrote:
>>> Which is why get_user_pages() _should_ enforce this.
>>>
>>> What use are protection keys if you can trivially circumvent them?
>> No, we will *not* worry about protection keys in get_user_pages().
>
> We did introduce some support for it here:
>
>> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=33a709b25a760b91184bb335cf7d7c32b8123013

Ugh. I never realized.

We should revert that, I feel. It's literally extra complexity for no
actual real gain, and there is a real downside: the extra complexity
that will cause people to get things wrong.

This thread about us getting it wrong is just the proof. I vote for
not trying to "fix" this case, let's just remove it.

                  Linus

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 134+ messages in thread

* Re: [patch 05/16] mm: Allow special mappings with user access cleared
  2017-12-13 18:08                   ` Linus Torvalds
@ 2017-12-13 18:31                     ` Andy Lutomirski
  -1 siblings, 0 replies; 134+ messages in thread
From: Andy Lutomirski @ 2017-12-13 18:31 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Peter Zijlstra, Dave Hansen, Kirill A. Shutemov, Andy Lutomirski,
	Thomas Gleixner, LKML, X86 ML, Borislav Petkov, Greg KH,
	Kees Cook, Hugh Dickins, Brian Gerst, Josh Poimboeuf,
	Denys Vlasenko, Boris Ostrovsky, Juergen Gross, David Laight,
	Eduardo Valentin, Liguori, Anthony, Will Deacon, linux-mm,
	Kirill A. Shutemov, Aneesh Kumar K. V

On Wed, Dec 13, 2017 at 10:08 AM, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
> On Wed, Dec 13, 2017 at 7:54 AM, Peter Zijlstra <peterz@infradead.org> wrote:
>>
>> Which is why get_user_pages() _should_ enforce this.
>>
>> What use are protection keys if you can trivially circumvent them?
>
> No, we will *not* worry about protection keys in get_user_pages().
>

Hmm.  If I goof some pointer and pass that bogus pointer to read(2),
and I'm using pkey to protect my mmapped database, I think i'd rather
that read(2) fail.  Sure, pkey is trivially circumventable using
wrpkru or mprotect, but those are obvious dangerous functions.

^ permalink raw reply	[flat|nested] 134+ messages in thread

* Re: [patch 05/16] mm: Allow special mappings with user access cleared
@ 2017-12-13 18:31                     ` Andy Lutomirski
  0 siblings, 0 replies; 134+ messages in thread
From: Andy Lutomirski @ 2017-12-13 18:31 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Peter Zijlstra, Dave Hansen, Kirill A. Shutemov, Andy Lutomirski,
	Thomas Gleixner, LKML, X86 ML, Borislav Petkov, Greg KH,
	Kees Cook, Hugh Dickins, Brian Gerst, Josh Poimboeuf,
	Denys Vlasenko, Boris Ostrovsky, Juergen Gross, David Laight,
	Eduardo Valentin, Liguori, Anthony, Will Deacon, linux-mm,
	Kirill A. Shutemov, Aneesh Kumar K. V

On Wed, Dec 13, 2017 at 10:08 AM, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
> On Wed, Dec 13, 2017 at 7:54 AM, Peter Zijlstra <peterz@infradead.org> wrote:
>>
>> Which is why get_user_pages() _should_ enforce this.
>>
>> What use are protection keys if you can trivially circumvent them?
>
> No, we will *not* worry about protection keys in get_user_pages().
>

Hmm.  If I goof some pointer and pass that bogus pointer to read(2),
and I'm using pkey to protect my mmapped database, I think i'd rather
that read(2) fail.  Sure, pkey is trivially circumventable using
wrpkru or mprotect, but those are obvious dangerous functions.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 134+ messages in thread

* Re: [patch 05/16] mm: Allow special mappings with user access cleared
  2017-12-13 18:08                   ` Linus Torvalds
@ 2017-12-13 18:32                     ` Peter Zijlstra
  -1 siblings, 0 replies; 134+ messages in thread
From: Peter Zijlstra @ 2017-12-13 18:32 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Dave Hansen, Kirill A. Shutemov, Andy Lutomirski,
	Thomas Gleixner, LKML, X86 ML, Borislav Petkov, Greg KH,
	Kees Cook, Hugh Dickins, Brian Gerst, Josh Poimboeuf,
	Denys Vlasenko, Boris Ostrovsky, Juergen Gross, David Laight,
	Eduardo Valentin, Liguori, Anthony, Will Deacon, linux-mm,
	Kirill A. Shutemov, Aneesh Kumar K. V

On Wed, Dec 13, 2017 at 10:08:30AM -0800, Linus Torvalds wrote:
> On Wed, Dec 13, 2017 at 7:54 AM, Peter Zijlstra <peterz@infradead.org> wrote:
> >
> > Which is why get_user_pages() _should_ enforce this.
> >
> > What use are protection keys if you can trivially circumvent them?
> 
> No, we will *not* worry about protection keys in get_user_pages().
> 
> They are not "security". They are a debug aid and safety against random mis-use.
> 
> In particular, they are very much *NOT* about "trivially circumvent
> them". The user could just change their mapping thing, for chrissake!
> 
> We already allow access to PROT_NONE for gdb and friends, very much on purpose.
> 
> We're not going to make the VM more complex for something that
> absolutely nobody cares about, and has zero security issues.

OK, that might have been my phrasing that was off -- mostly because I
was looking at it from the VM_NOUSER angle, but currently:

  - gup_pte_range() has pte_access_permitted()

  - follow_page_pte() has pte_access_permitted() for FOLL_WRITE only

All I'm saying is that that is inconsistent and we should change
follow_page_pte() to use pte_access_permitted() for FOLL_GET, such that
__get_user_pages_fast() and __get_user_pages() have matching semantics.

Now, if VM_NOUSER were to live, the above change would ensure write(2)
cannot read from such VMAs, where the existing test for FOLL_WRITE
already disallows read(2) from writing to them.

But even without VM_NOUSER it makes the VM more consistent than it is
today.

^ permalink raw reply	[flat|nested] 134+ messages in thread

* Re: [patch 05/16] mm: Allow special mappings with user access cleared
@ 2017-12-13 18:32                     ` Peter Zijlstra
  0 siblings, 0 replies; 134+ messages in thread
From: Peter Zijlstra @ 2017-12-13 18:32 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Dave Hansen, Kirill A. Shutemov, Andy Lutomirski,
	Thomas Gleixner, LKML, X86 ML, Borislav Petkov, Greg KH,
	Kees Cook, Hugh Dickins, Brian Gerst, Josh Poimboeuf,
	Denys Vlasenko, Boris Ostrovsky, Juergen Gross, David Laight,
	Eduardo Valentin, Liguori, Anthony, Will Deacon, linux-mm,
	Kirill A. Shutemov, Aneesh Kumar K. V

On Wed, Dec 13, 2017 at 10:08:30AM -0800, Linus Torvalds wrote:
> On Wed, Dec 13, 2017 at 7:54 AM, Peter Zijlstra <peterz@infradead.org> wrote:
> >
> > Which is why get_user_pages() _should_ enforce this.
> >
> > What use are protection keys if you can trivially circumvent them?
> 
> No, we will *not* worry about protection keys in get_user_pages().
> 
> They are not "security". They are a debug aid and safety against random mis-use.
> 
> In particular, they are very much *NOT* about "trivially circumvent
> them". The user could just change their mapping thing, for chrissake!
> 
> We already allow access to PROT_NONE for gdb and friends, very much on purpose.
> 
> We're not going to make the VM more complex for something that
> absolutely nobody cares about, and has zero security issues.

OK, that might have been my phrasing that was off -- mostly because I
was looking at it from the VM_NOUSER angle, but currently:

  - gup_pte_range() has pte_access_permitted()

  - follow_page_pte() has pte_access_permitted() for FOLL_WRITE only

All I'm saying is that that is inconsistent and we should change
follow_page_pte() to use pte_access_permitted() for FOLL_GET, such that
__get_user_pages_fast() and __get_user_pages() have matching semantics.

Now, if VM_NOUSER were to live, the above change would ensure write(2)
cannot read from such VMAs, where the existing test for FOLL_WRITE
already disallows read(2) from writing to them.

But even without VM_NOUSER it makes the VM more consistent than it is
today.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 134+ messages in thread

* Re: [patch 05/16] mm: Allow special mappings with user access cleared
  2017-12-13 18:32                     ` Peter Zijlstra
@ 2017-12-13 18:35                       ` Linus Torvalds
  -1 siblings, 0 replies; 134+ messages in thread
From: Linus Torvalds @ 2017-12-13 18:35 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Dave Hansen, Kirill A. Shutemov, Andy Lutomirski,
	Thomas Gleixner, LKML, X86 ML, Borislav Petkov, Greg KH,
	Kees Cook, Hugh Dickins, Brian Gerst, Josh Poimboeuf,
	Denys Vlasenko, Boris Ostrovsky, Juergen Gross, David Laight,
	Eduardo Valentin, Liguori, Anthony, Will Deacon, linux-mm,
	Kirill A. Shutemov, Aneesh Kumar K. V

On Wed, Dec 13, 2017 at 10:32 AM, Peter Zijlstra <peterz@infradead.org> wrote:
>
> Now, if VM_NOUSER were to live, the above change would ensure write(2)
> cannot read from such VMAs, where the existing test for FOLL_WRITE
> already disallows read(2) from writing to them.

So I don't mind at all the notion of disallowing access to some
special mappings at the vma level. So a VM_NOUSER flag that just
disallows get_user_pages entirely I'm ok with.

It's the protection keys in particular that I don't like having to
worry about. They are subtle and have odd architecture-specific
meaning, and needs to be checked at all levels in the page table tree.

               Linus

^ permalink raw reply	[flat|nested] 134+ messages in thread

* Re: [patch 05/16] mm: Allow special mappings with user access cleared
@ 2017-12-13 18:35                       ` Linus Torvalds
  0 siblings, 0 replies; 134+ messages in thread
From: Linus Torvalds @ 2017-12-13 18:35 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Dave Hansen, Kirill A. Shutemov, Andy Lutomirski,
	Thomas Gleixner, LKML, X86 ML, Borislav Petkov, Greg KH,
	Kees Cook, Hugh Dickins, Brian Gerst, Josh Poimboeuf,
	Denys Vlasenko, Boris Ostrovsky, Juergen Gross, David Laight,
	Eduardo Valentin, Liguori, Anthony, Will Deacon, linux-mm,
	Kirill A. Shutemov, Aneesh Kumar K. V

On Wed, Dec 13, 2017 at 10:32 AM, Peter Zijlstra <peterz@infradead.org> wrote:
>
> Now, if VM_NOUSER were to live, the above change would ensure write(2)
> cannot read from such VMAs, where the existing test for FOLL_WRITE
> already disallows read(2) from writing to them.

So I don't mind at all the notion of disallowing access to some
special mappings at the vma level. So a VM_NOUSER flag that just
disallows get_user_pages entirely I'm ok with.

It's the protection keys in particular that I don't like having to
worry about. They are subtle and have odd architecture-specific
meaning, and needs to be checked at all levels in the page table tree.

               Linus

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 134+ messages in thread

* Re: [patch 05/16] mm: Allow special mappings with user access cleared
  2017-12-12 17:32   ` Thomas Gleixner
@ 2017-12-13 21:50     ` Matthew Wilcox
  -1 siblings, 0 replies; 134+ messages in thread
From: Matthew Wilcox @ 2017-12-13 21:50 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: LKML, x86, Linus Torvalds, Andy Lutomirsky, Peter Zijlstra,
	Dave Hansen, Borislav Petkov, Greg KH, keescook, hughd,
	Brian Gerst, Josh Poimboeuf, Denys Vlasenko, Boris Ostrovsky,
	Juergen Gross, David Laight, Eduardo Valentin, aliguori,
	Will Deacon, linux-mm

On Tue, Dec 12, 2017 at 06:32:26PM +0100, Thomas Gleixner wrote:
> From: Peter Zijstra <peterz@infradead.org>
> 
> In order to create VMAs that are not accessible to userspace create a new
> VM_NOUSER flag. This can be used in conjunction with
> install_special_mapping() to inject 'kernel' data into the userspace map.

Maybe I misunderstand the intent behind this, but I was recently looking
at something kind of similar.  I was calling it VM_NOTLB and it wouldn't
put TLB entries into the userspace map at all.  The idea was to be able
to use the user address purely as a handle for specific kernel pages,
which were guaranteed to never be mapped into userspace, so we didn't
need to send TLB invalidations when we took those pages away from the user
process again.  But we'd be able to pass the address to read() or write().

So I was going to check the VMA flags in no_page_table() and return the
struct page that was notmapped there.  I didn't get as far as constructing
a prototype yet, and I'm not entirely sure I understand the purpose of
this patch, so perhaps there's no synergy here at all (and perhaps my
idea wouldn't have worked anyway).

^ permalink raw reply	[flat|nested] 134+ messages in thread

* Re: [patch 05/16] mm: Allow special mappings with user access cleared
@ 2017-12-13 21:50     ` Matthew Wilcox
  0 siblings, 0 replies; 134+ messages in thread
From: Matthew Wilcox @ 2017-12-13 21:50 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: LKML, x86, Linus Torvalds, Andy Lutomirsky, Peter Zijlstra,
	Dave Hansen, Borislav Petkov, Greg KH, keescook, hughd,
	Brian Gerst, Josh Poimboeuf, Denys Vlasenko, Boris Ostrovsky,
	Juergen Gross, David Laight, Eduardo Valentin, aliguori,
	Will Deacon, linux-mm

On Tue, Dec 12, 2017 at 06:32:26PM +0100, Thomas Gleixner wrote:
> From: Peter Zijstra <peterz@infradead.org>
> 
> In order to create VMAs that are not accessible to userspace create a new
> VM_NOUSER flag. This can be used in conjunction with
> install_special_mapping() to inject 'kernel' data into the userspace map.

Maybe I misunderstand the intent behind this, but I was recently looking
at something kind of similar.  I was calling it VM_NOTLB and it wouldn't
put TLB entries into the userspace map at all.  The idea was to be able
to use the user address purely as a handle for specific kernel pages,
which were guaranteed to never be mapped into userspace, so we didn't
need to send TLB invalidations when we took those pages away from the user
process again.  But we'd be able to pass the address to read() or write().

So I was going to check the VMA flags in no_page_table() and return the
struct page that was notmapped there.  I didn't get as far as constructing
a prototype yet, and I'm not entirely sure I understand the purpose of
this patch, so perhaps there's no synergy here at all (and perhaps my
idea wouldn't have worked anyway).

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 134+ messages in thread

* Re: [patch 05/16] mm: Allow special mappings with user access cleared
  2017-12-13 21:50     ` Matthew Wilcox
@ 2017-12-13 22:12       ` Peter Zijlstra
  -1 siblings, 0 replies; 134+ messages in thread
From: Peter Zijlstra @ 2017-12-13 22:12 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Thomas Gleixner, LKML, x86, Linus Torvalds, Andy Lutomirsky,
	Dave Hansen, Borislav Petkov, Greg KH, keescook, hughd,
	Brian Gerst, Josh Poimboeuf, Denys Vlasenko, Boris Ostrovsky,
	Juergen Gross, David Laight, Eduardo Valentin, aliguori,
	Will Deacon, linux-mm

On Wed, Dec 13, 2017 at 01:50:22PM -0800, Matthew Wilcox wrote:
> On Tue, Dec 12, 2017 at 06:32:26PM +0100, Thomas Gleixner wrote:
> > From: Peter Zijstra <peterz@infradead.org>
> > 
> > In order to create VMAs that are not accessible to userspace create a new
> > VM_NOUSER flag. This can be used in conjunction with
> > install_special_mapping() to inject 'kernel' data into the userspace map.
> 
> Maybe I misunderstand the intent behind this, but I was recently looking
> at something kind of similar.  I was calling it VM_NOTLB and it wouldn't
> put TLB entries into the userspace map at all.  The idea was to be able
> to use the user address purely as a handle for specific kernel pages,
> which were guaranteed to never be mapped into userspace, so we didn't
> need to send TLB invalidations when we took those pages away from the user
> process again.  But we'd be able to pass the address to read() or write().
> 
> So I was going to check the VMA flags in no_page_table() and return the
> struct page that was notmapped there.  I didn't get as far as constructing
> a prototype yet, and I'm not entirely sure I understand the purpose of
> this patch, so perhaps there's no synergy here at all (and perhaps my
> idea wouldn't have worked anyway).

Yeah, completely different. This here actually needs the page table
entries. Currently we keep the LDT in kernel memory, but with PTI we
loose the entire kernel map.

Since the LDT is strictly per process, the idea was to actually inject
it into the userspace map. Except of course, userspace must not actually
be able to access it. So by mapping it !_PAGE_USER its 'invisible'.

But the CPU very much needs the mapping, it will load the LDT entries
through them.

^ permalink raw reply	[flat|nested] 134+ messages in thread

* Re: [patch 05/16] mm: Allow special mappings with user access cleared
@ 2017-12-13 22:12       ` Peter Zijlstra
  0 siblings, 0 replies; 134+ messages in thread
From: Peter Zijlstra @ 2017-12-13 22:12 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Thomas Gleixner, LKML, x86, Linus Torvalds, Andy Lutomirsky,
	Dave Hansen, Borislav Petkov, Greg KH, keescook, hughd,
	Brian Gerst, Josh Poimboeuf, Denys Vlasenko, Boris Ostrovsky,
	Juergen Gross, David Laight, Eduardo Valentin, aliguori,
	Will Deacon, linux-mm

On Wed, Dec 13, 2017 at 01:50:22PM -0800, Matthew Wilcox wrote:
> On Tue, Dec 12, 2017 at 06:32:26PM +0100, Thomas Gleixner wrote:
> > From: Peter Zijstra <peterz@infradead.org>
> > 
> > In order to create VMAs that are not accessible to userspace create a new
> > VM_NOUSER flag. This can be used in conjunction with
> > install_special_mapping() to inject 'kernel' data into the userspace map.
> 
> Maybe I misunderstand the intent behind this, but I was recently looking
> at something kind of similar.  I was calling it VM_NOTLB and it wouldn't
> put TLB entries into the userspace map at all.  The idea was to be able
> to use the user address purely as a handle for specific kernel pages,
> which were guaranteed to never be mapped into userspace, so we didn't
> need to send TLB invalidations when we took those pages away from the user
> process again.  But we'd be able to pass the address to read() or write().
> 
> So I was going to check the VMA flags in no_page_table() and return the
> struct page that was notmapped there.  I didn't get as far as constructing
> a prototype yet, and I'm not entirely sure I understand the purpose of
> this patch, so perhaps there's no synergy here at all (and perhaps my
> idea wouldn't have worked anyway).

Yeah, completely different. This here actually needs the page table
entries. Currently we keep the LDT in kernel memory, but with PTI we
loose the entire kernel map.

Since the LDT is strictly per process, the idea was to actually inject
it into the userspace map. Except of course, userspace must not actually
be able to access it. So by mapping it !_PAGE_USER its 'invisible'.

But the CPU very much needs the mapping, it will load the LDT entries
through them.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 134+ messages in thread

* Re: [patch 05/16] mm: Allow special mappings with user access cleared
  2017-12-13 22:12       ` Peter Zijlstra
@ 2017-12-14  0:10         ` Matthew Wilcox
  -1 siblings, 0 replies; 134+ messages in thread
From: Matthew Wilcox @ 2017-12-14  0:10 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Thomas Gleixner, LKML, x86, Linus Torvalds, Andy Lutomirsky,
	Dave Hansen, Borislav Petkov, Greg KH, keescook, hughd,
	Brian Gerst, Josh Poimboeuf, Denys Vlasenko, Boris Ostrovsky,
	Juergen Gross, David Laight, Eduardo Valentin, aliguori,
	Will Deacon, linux-mm

On Wed, Dec 13, 2017 at 11:12:33PM +0100, Peter Zijlstra wrote:
> On Wed, Dec 13, 2017 at 01:50:22PM -0800, Matthew Wilcox wrote:
> > On Tue, Dec 12, 2017 at 06:32:26PM +0100, Thomas Gleixner wrote:
> > > From: Peter Zijstra <peterz@infradead.org>
> > > In order to create VMAs that are not accessible to userspace create a new
> > > VM_NOUSER flag. This can be used in conjunction with
> > > install_special_mapping() to inject 'kernel' data into the userspace map.
> > 
> > Maybe I misunderstand the intent behind this, but I was recently looking
> > at something kind of similar.  I was calling it VM_NOTLB and it wouldn't
> > put TLB entries into the userspace map at all.  The idea was to be able
> > to use the user address purely as a handle for specific kernel pages,
> > which were guaranteed to never be mapped into userspace, so we didn't
> > need to send TLB invalidations when we took those pages away from the user
> > process again.  But we'd be able to pass the address to read() or write().
> 
> Since the LDT is strictly per process, the idea was to actually inject
> it into the userspace map. Except of course, userspace must not actually
> be able to access it. So by mapping it !_PAGE_USER its 'invisible'.
> 
> But the CPU very much needs the mapping, it will load the LDT entries
> through them.

So can I use your VM_NOUSER VMAs for my purpose?  That is, can I change
the page table without flushing the TLB?  The only access to these PTEs
will be through the kernel mapping, so I don't see why I'd need to.

^ permalink raw reply	[flat|nested] 134+ messages in thread

* Re: [patch 05/16] mm: Allow special mappings with user access cleared
@ 2017-12-14  0:10         ` Matthew Wilcox
  0 siblings, 0 replies; 134+ messages in thread
From: Matthew Wilcox @ 2017-12-14  0:10 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Thomas Gleixner, LKML, x86, Linus Torvalds, Andy Lutomirsky,
	Dave Hansen, Borislav Petkov, Greg KH, keescook, hughd,
	Brian Gerst, Josh Poimboeuf, Denys Vlasenko, Boris Ostrovsky,
	Juergen Gross, David Laight, Eduardo Valentin, aliguori,
	Will Deacon, linux-mm

On Wed, Dec 13, 2017 at 11:12:33PM +0100, Peter Zijlstra wrote:
> On Wed, Dec 13, 2017 at 01:50:22PM -0800, Matthew Wilcox wrote:
> > On Tue, Dec 12, 2017 at 06:32:26PM +0100, Thomas Gleixner wrote:
> > > From: Peter Zijstra <peterz@infradead.org>
> > > In order to create VMAs that are not accessible to userspace create a new
> > > VM_NOUSER flag. This can be used in conjunction with
> > > install_special_mapping() to inject 'kernel' data into the userspace map.
> > 
> > Maybe I misunderstand the intent behind this, but I was recently looking
> > at something kind of similar.  I was calling it VM_NOTLB and it wouldn't
> > put TLB entries into the userspace map at all.  The idea was to be able
> > to use the user address purely as a handle for specific kernel pages,
> > which were guaranteed to never be mapped into userspace, so we didn't
> > need to send TLB invalidations when we took those pages away from the user
> > process again.  But we'd be able to pass the address to read() or write().
> 
> Since the LDT is strictly per process, the idea was to actually inject
> it into the userspace map. Except of course, userspace must not actually
> be able to access it. So by mapping it !_PAGE_USER its 'invisible'.
> 
> But the CPU very much needs the mapping, it will load the LDT entries
> through them.

So can I use your VM_NOUSER VMAs for my purpose?  That is, can I change
the page table without flushing the TLB?  The only access to these PTEs
will be through the kernel mapping, so I don't see why I'd need to.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 134+ messages in thread

* Re: [patch 05/16] mm: Allow special mappings with user access cleared
  2017-12-14  0:10         ` Matthew Wilcox
@ 2017-12-14  0:16           ` Andy Lutomirski
  -1 siblings, 0 replies; 134+ messages in thread
From: Andy Lutomirski @ 2017-12-14  0:16 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Peter Zijlstra, Thomas Gleixner, LKML, X86 ML, Linus Torvalds,
	Andy Lutomirsky, Dave Hansen, Borislav Petkov, Greg KH,
	Kees Cook, Hugh Dickins, Brian Gerst, Josh Poimboeuf,
	Denys Vlasenko, Boris Ostrovsky, Juergen Gross, David Laight,
	Eduardo Valentin, Liguori, Anthony, Will Deacon, linux-mm

On Wed, Dec 13, 2017 at 4:10 PM, Matthew Wilcox <willy@infradead.org> wrote:
> On Wed, Dec 13, 2017 at 11:12:33PM +0100, Peter Zijlstra wrote:
>> On Wed, Dec 13, 2017 at 01:50:22PM -0800, Matthew Wilcox wrote:
>> > On Tue, Dec 12, 2017 at 06:32:26PM +0100, Thomas Gleixner wrote:
>> > > From: Peter Zijstra <peterz@infradead.org>
>> > > In order to create VMAs that are not accessible to userspace create a new
>> > > VM_NOUSER flag. This can be used in conjunction with
>> > > install_special_mapping() to inject 'kernel' data into the userspace map.
>> >
>> > Maybe I misunderstand the intent behind this, but I was recently looking
>> > at something kind of similar.  I was calling it VM_NOTLB and it wouldn't
>> > put TLB entries into the userspace map at all.  The idea was to be able
>> > to use the user address purely as a handle for specific kernel pages,
>> > which were guaranteed to never be mapped into userspace, so we didn't
>> > need to send TLB invalidations when we took those pages away from the user
>> > process again.  But we'd be able to pass the address to read() or write().
>>
>> Since the LDT is strictly per process, the idea was to actually inject
>> it into the userspace map. Except of course, userspace must not actually
>> be able to access it. So by mapping it !_PAGE_USER its 'invisible'.
>>
>> But the CPU very much needs the mapping, it will load the LDT entries
>> through them.
>
> So can I use your VM_NOUSER VMAs for my purpose?  That is, can I change
> the page table without flushing the TLB?  The only access to these PTEs
> will be through the kernel mapping, so I don't see why I'd need to.

I doubt it, since if it's in the kernel pagetables at all, then the
mapping can be cached for kernel purposes.

But I still think this discussion is off in the weeds.  x86 does not
actually need any of this stuff.

^ permalink raw reply	[flat|nested] 134+ messages in thread

* Re: [patch 05/16] mm: Allow special mappings with user access cleared
@ 2017-12-14  0:16           ` Andy Lutomirski
  0 siblings, 0 replies; 134+ messages in thread
From: Andy Lutomirski @ 2017-12-14  0:16 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Peter Zijlstra, Thomas Gleixner, LKML, X86 ML, Linus Torvalds,
	Andy Lutomirsky, Dave Hansen, Borislav Petkov, Greg KH,
	Kees Cook, Hugh Dickins, Brian Gerst, Josh Poimboeuf,
	Denys Vlasenko, Boris Ostrovsky, Juergen Gross, David Laight,
	Eduardo Valentin, Liguori, Anthony, Will Deacon, linux-mm

On Wed, Dec 13, 2017 at 4:10 PM, Matthew Wilcox <willy@infradead.org> wrote:
> On Wed, Dec 13, 2017 at 11:12:33PM +0100, Peter Zijlstra wrote:
>> On Wed, Dec 13, 2017 at 01:50:22PM -0800, Matthew Wilcox wrote:
>> > On Tue, Dec 12, 2017 at 06:32:26PM +0100, Thomas Gleixner wrote:
>> > > From: Peter Zijstra <peterz@infradead.org>
>> > > In order to create VMAs that are not accessible to userspace create a new
>> > > VM_NOUSER flag. This can be used in conjunction with
>> > > install_special_mapping() to inject 'kernel' data into the userspace map.
>> >
>> > Maybe I misunderstand the intent behind this, but I was recently looking
>> > at something kind of similar.  I was calling it VM_NOTLB and it wouldn't
>> > put TLB entries into the userspace map at all.  The idea was to be able
>> > to use the user address purely as a handle for specific kernel pages,
>> > which were guaranteed to never be mapped into userspace, so we didn't
>> > need to send TLB invalidations when we took those pages away from the user
>> > process again.  But we'd be able to pass the address to read() or write().
>>
>> Since the LDT is strictly per process, the idea was to actually inject
>> it into the userspace map. Except of course, userspace must not actually
>> be able to access it. So by mapping it !_PAGE_USER its 'invisible'.
>>
>> But the CPU very much needs the mapping, it will load the LDT entries
>> through them.
>
> So can I use your VM_NOUSER VMAs for my purpose?  That is, can I change
> the page table without flushing the TLB?  The only access to these PTEs
> will be through the kernel mapping, so I don't see why I'd need to.

I doubt it, since if it's in the kernel pagetables at all, then the
mapping can be cached for kernel purposes.

But I still think this discussion is off in the weeds.  x86 does not
actually need any of this stuff.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 134+ messages in thread

* Re: [patch 05/16] mm: Allow special mappings with user access cleared
  2017-12-13 18:08                   ` Linus Torvalds
@ 2017-12-14  4:53                     ` Aneesh Kumar K.V
  -1 siblings, 0 replies; 134+ messages in thread
From: Aneesh Kumar K.V @ 2017-12-14  4:53 UTC (permalink / raw)
  To: Linus Torvalds, Peter Zijlstra
  Cc: Dave Hansen, Kirill A. Shutemov, Andy Lutomirski,
	Thomas Gleixner, LKML, X86 ML, Borislav Petkov, Greg KH,
	Kees Cook, Hugh Dickins, Brian Gerst, Josh Poimboeuf,
	Denys Vlasenko, Boris Ostrovsky, Juergen Gross, David Laight,
	Eduardo Valentin, Liguori, Anthony, Will Deacon, linux-mm,
	Kirill A. Shutemov

Linus Torvalds <torvalds@linux-foundation.org> writes:

> On Wed, Dec 13, 2017 at 7:54 AM, Peter Zijlstra <peterz@infradead.org> wrote:
>>
>> Which is why get_user_pages() _should_ enforce this.
>>
>> What use are protection keys if you can trivially circumvent them?
>
> No, we will *not* worry about protection keys in get_user_pages().
>
> They are not "security". They are a debug aid and safety against random mis-use.
>
> In particular, they are very much *NOT* about "trivially circumvent
> them". The user could just change their mapping thing, for chrissake!
>
> We already allow access to PROT_NONE for gdb and friends, very much on purpose.
>

Can you clarify this? We recently did fix read access on PROT_NONE via
gup here for ppc64 https://lkml.kernel.org/r/20171204021912.25974-2-aneesh.kumar@linux.vnet.ibm.com

What is the expected behaviour against gup and get_user_pages for
PROT_NONE. 

Another issue is we end up behaving differently with PROT_NONE mapping
based on whether autonuma is enabled or not. For a PROT_NONE mapping we
return true with pte_protnone().

-aneesh

^ permalink raw reply	[flat|nested] 134+ messages in thread

* Re: [patch 05/16] mm: Allow special mappings with user access cleared
@ 2017-12-14  4:53                     ` Aneesh Kumar K.V
  0 siblings, 0 replies; 134+ messages in thread
From: Aneesh Kumar K.V @ 2017-12-14  4:53 UTC (permalink / raw)
  To: Linus Torvalds, Peter Zijlstra
  Cc: Dave Hansen, Kirill A. Shutemov, Andy Lutomirski,
	Thomas Gleixner, LKML, X86 ML, Borislav Petkov, Greg KH,
	Kees Cook, Hugh Dickins, Brian Gerst, Josh Poimboeuf,
	Denys Vlasenko, Boris Ostrovsky, Juergen Gross, David Laight,
	Eduardo Valentin, Liguori, Anthony, Will Deacon, linux-mm,
	Kirill A. Shutemov

Linus Torvalds <torvalds@linux-foundation.org> writes:

> On Wed, Dec 13, 2017 at 7:54 AM, Peter Zijlstra <peterz@infradead.org> wrote:
>>
>> Which is why get_user_pages() _should_ enforce this.
>>
>> What use are protection keys if you can trivially circumvent them?
>
> No, we will *not* worry about protection keys in get_user_pages().
>
> They are not "security". They are a debug aid and safety against random mis-use.
>
> In particular, they are very much *NOT* about "trivially circumvent
> them". The user could just change their mapping thing, for chrissake!
>
> We already allow access to PROT_NONE for gdb and friends, very much on purpose.
>

Can you clarify this? We recently did fix read access on PROT_NONE via
gup here for ppc64 https://lkml.kernel.org/r/20171204021912.25974-2-aneesh.kumar@linux.vnet.ibm.com

What is the expected behaviour against gup and get_user_pages for
PROT_NONE. 

Another issue is we end up behaving differently with PROT_NONE mapping
based on whether autonuma is enabled or not. For a PROT_NONE mapping we
return true with pte_protnone().

-aneesh

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 134+ messages in thread

* RE: [patch 11/16] x86/ldt: Force access bit for CS/SS
  2017-12-12 19:26       ` Andy Lutomirski
@ 2017-12-19 12:10         ` David Laight
  -1 siblings, 0 replies; 134+ messages in thread
From: David Laight @ 2017-12-19 12:10 UTC (permalink / raw)
  To: 'Andy Lutomirski', Linus Torvalds
  Cc: Thomas Gleixner, LKML, the arch/x86 maintainers, Andy Lutomirsky,
	Peter Zijlstra, Dave Hansen, Borislav Petkov, Greg KH, Kees Cook,
	Hugh Dickins, Brian Gerst, Josh Poimboeuf, Denys Vlasenko,
	Boris Ostrovsky, Juergen Gross, Eduardo Valentin, Liguori,
	Anthony, Will Deacon, linux-mm

From: Andy Lutomirski
> Sent: 12 December 2017 19:27
...
> > Why is the iret exception unrecoverable anyway? Does anybody even know?
> >
> 
> Weird microcode shit aside, a fault on IRET will return to kernel code with kernel GS, and then the
> next time we enter the kernel we're backwards.  We could fix idtentry to get this right, but the code
> is already tangled enough.
...

Notwithstanding a readonly LDT, the iret (and pop %ds, pop %es that probably
precede it) are all likely to fault in kernel if the segment registers are invalid.
(Setting %fs and %gs for 32 bit processes is left to the reader.)

Unlike every other fault in the kernel code segment, gsbase will contain
the user value, not the kernel one.

The kernel code must detect this somehow and correct everything before (probably)
generating a SIGSEGV and returning to the user's signal handler with the
invalid segment registers in the signal context.

Assuming this won't happen (because the segment registers are always valid)
is likely to be a recipe for disaster (or an escalation).

I guess the problem with a readonly LDT is that you don't want to fault
setting the 'accesses' bit.

	David

^ permalink raw reply	[flat|nested] 134+ messages in thread

* RE: [patch 11/16] x86/ldt: Force access bit for CS/SS
@ 2017-12-19 12:10         ` David Laight
  0 siblings, 0 replies; 134+ messages in thread
From: David Laight @ 2017-12-19 12:10 UTC (permalink / raw)
  To: 'Andy Lutomirski', Linus Torvalds
  Cc: Thomas Gleixner, LKML, the arch/x86 maintainers, Andy Lutomirsky,
	Peter Zijlstra, Dave Hansen, Borislav Petkov, Greg KH, Kees Cook,
	Hugh Dickins, Brian Gerst, Josh Poimboeuf, Denys Vlasenko,
	Boris Ostrovsky, Juergen Gross, Eduardo Valentin, Liguori,
	Anthony, Will Deacon, linux-mm

From: Andy Lutomirski
> Sent: 12 December 2017 19:27
...
> > Why is the iret exception unrecoverable anyway? Does anybody even know?
> >
> 
> Weird microcode shit aside, a fault on IRET will return to kernel code with kernel GS, and then the
> next time we enter the kernel we're backwards.  We could fix idtentry to get this right, but the code
> is already tangled enough.
...

Notwithstanding a readonly LDT, the iret (and pop %ds, pop %es that probably
precede it) are all likely to fault in kernel if the segment registers are invalid.
(Setting %fs and %gs for 32 bit processes is left to the reader.)

Unlike every other fault in the kernel code segment, gsbase will contain
the user value, not the kernel one.

The kernel code must detect this somehow and correct everything before (probably)
generating a SIGSEGV and returning to the user's signal handler with the
invalid segment registers in the signal context.

Assuming this won't happen (because the segment registers are always valid)
is likely to be a recipe for disaster (or an escalation).

I guess the problem with a readonly LDT is that you don't want to fault
setting the 'accesses' bit.

	David

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 134+ messages in thread

end of thread, other threads:[~2017-12-19 12:10 UTC | newest]

Thread overview: 134+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-12-12 17:32 [patch 00/16] x86/ldt: Use a VMA based read only mapping Thomas Gleixner
2017-12-12 17:32 ` Thomas Gleixner
2017-12-12 17:32 ` [patch 01/16] arch: Allow arch_dup_mmap() to fail Thomas Gleixner
2017-12-12 17:32   ` Thomas Gleixner
2017-12-12 17:32 ` [patch 02/16] x86/ldt: Rework locking Thomas Gleixner
2017-12-12 17:32   ` Thomas Gleixner
2017-12-12 17:32 ` [patch 03/16] x86/ldt: Prevent ldt inheritance on exec Thomas Gleixner
2017-12-12 17:32   ` Thomas Gleixner
2017-12-12 17:32 ` [patch 04/16] mm/softdirty: Move VM_SOFTDIRTY into high bits Thomas Gleixner
2017-12-12 17:32   ` Thomas Gleixner
2017-12-12 17:32 ` [patch 05/16] mm: Allow special mappings with user access cleared Thomas Gleixner
2017-12-12 17:32   ` Thomas Gleixner
2017-12-12 18:00   ` Andy Lutomirski
2017-12-12 18:00     ` Andy Lutomirski
2017-12-12 18:05     ` Peter Zijlstra
2017-12-12 18:05       ` Peter Zijlstra
2017-12-12 18:06       ` Andy Lutomirski
2017-12-12 18:06         ` Andy Lutomirski
2017-12-12 18:25         ` Peter Zijlstra
2017-12-12 18:25           ` Peter Zijlstra
2017-12-13 12:22     ` Peter Zijlstra
2017-12-13 12:22       ` Peter Zijlstra
2017-12-13 12:57       ` Kirill A. Shutemov
2017-12-13 12:57         ` Kirill A. Shutemov
2017-12-13 14:34         ` Peter Zijlstra
2017-12-13 14:34           ` Peter Zijlstra
2017-12-13 14:43           ` Kirill A. Shutemov
2017-12-13 14:43             ` Kirill A. Shutemov
2017-12-13 15:00             ` Peter Zijlstra
2017-12-13 15:00               ` Peter Zijlstra
2017-12-13 15:04               ` Peter Zijlstra
2017-12-13 15:04                 ` Peter Zijlstra
2017-12-13 15:14         ` Dave Hansen
2017-12-13 15:14           ` Dave Hansen
2017-12-13 15:32           ` Peter Zijlstra
2017-12-13 15:32             ` Peter Zijlstra
2017-12-13 15:47             ` Dave Hansen
2017-12-13 15:47               ` Dave Hansen
2017-12-13 15:54               ` Peter Zijlstra
2017-12-13 15:54                 ` Peter Zijlstra
2017-12-13 18:08                 ` Linus Torvalds
2017-12-13 18:08                   ` Linus Torvalds
2017-12-13 18:21                   ` Dave Hansen
2017-12-13 18:21                     ` Dave Hansen
2017-12-13 18:23                     ` Linus Torvalds
2017-12-13 18:23                       ` Linus Torvalds
2017-12-13 18:31                   ` Andy Lutomirski
2017-12-13 18:31                     ` Andy Lutomirski
2017-12-13 18:32                   ` Peter Zijlstra
2017-12-13 18:32                     ` Peter Zijlstra
2017-12-13 18:35                     ` Linus Torvalds
2017-12-13 18:35                       ` Linus Torvalds
2017-12-14  4:53                   ` Aneesh Kumar K.V
2017-12-14  4:53                     ` Aneesh Kumar K.V
2017-12-13 21:50   ` Matthew Wilcox
2017-12-13 21:50     ` Matthew Wilcox
2017-12-13 22:12     ` Peter Zijlstra
2017-12-13 22:12       ` Peter Zijlstra
2017-12-14  0:10       ` Matthew Wilcox
2017-12-14  0:10         ` Matthew Wilcox
2017-12-14  0:16         ` Andy Lutomirski
2017-12-14  0:16           ` Andy Lutomirski
2017-12-12 17:32 ` [patch 06/16] mm: Provide vm_special_mapping::close Thomas Gleixner
2017-12-12 17:32   ` Thomas Gleixner
2017-12-12 17:32 ` [patch 07/16] selftest/x86: Implement additional LDT selftests Thomas Gleixner
2017-12-12 17:32   ` Thomas Gleixner
2017-12-12 17:32 ` [patch 08/16] selftests/x86/ldt_gdt: Prepare for access bit forced Thomas Gleixner
2017-12-12 17:32   ` Thomas Gleixner
2017-12-12 17:32 ` [patch 09/16] mm: Make populate_vma_page_range() available Thomas Gleixner
2017-12-12 17:32   ` Thomas Gleixner
2017-12-12 17:32 ` [patch 10/16] x86/ldt: Do not install LDT for kernel threads Thomas Gleixner
2017-12-12 17:32   ` Thomas Gleixner
2017-12-12 17:57   ` Andy Lutomirski
2017-12-12 17:57     ` Andy Lutomirski
2017-12-12 17:32 ` [patch 11/16] x86/ldt: Force access bit for CS/SS Thomas Gleixner
2017-12-12 17:32   ` Thomas Gleixner
2017-12-12 18:03   ` Andy Lutomirski
2017-12-12 18:03     ` Andy Lutomirski
2017-12-12 18:09     ` Peter Zijlstra
2017-12-12 18:09       ` Peter Zijlstra
2017-12-12 18:10       ` Andy Lutomirski
2017-12-12 18:10         ` Andy Lutomirski
2017-12-12 18:22         ` Andy Lutomirski
2017-12-12 18:22           ` Andy Lutomirski
2017-12-12 18:29           ` Peter Zijlstra
2017-12-12 18:29             ` Peter Zijlstra
2017-12-12 18:41             ` Thomas Gleixner
2017-12-12 18:41               ` Thomas Gleixner
2017-12-12 19:04               ` Peter Zijlstra
2017-12-12 19:04                 ` Peter Zijlstra
2017-12-12 19:05   ` Linus Torvalds
2017-12-12 19:05     ` Linus Torvalds
2017-12-12 19:26     ` Andy Lutomirski
2017-12-12 19:26       ` Andy Lutomirski
2017-12-19 12:10       ` David Laight
2017-12-19 12:10         ` David Laight
2017-12-12 17:32 ` [patch 12/16] x86/ldt: Reshuffle code Thomas Gleixner
2017-12-12 17:32   ` Thomas Gleixner
2017-12-12 17:32 ` [patch 13/16] x86/ldt: Introduce LDT write fault handler Thomas Gleixner
2017-12-12 17:32   ` Thomas Gleixner
2017-12-12 17:58   ` Andy Lutomirski
2017-12-12 17:58     ` Andy Lutomirski
2017-12-12 18:19     ` Peter Zijlstra
2017-12-12 18:19       ` Peter Zijlstra
2017-12-12 18:43       ` Thomas Gleixner
2017-12-12 18:43         ` Thomas Gleixner
2017-12-12 19:01   ` Linus Torvalds
2017-12-12 19:01     ` Linus Torvalds
2017-12-12 19:21     ` Thomas Gleixner
2017-12-12 19:21       ` Thomas Gleixner
2017-12-12 19:51       ` Linus Torvalds
2017-12-12 19:51         ` Linus Torvalds
2017-12-12 20:21       ` Dave Hansen
2017-12-12 20:21         ` Dave Hansen
2017-12-12 20:37         ` Thomas Gleixner
2017-12-12 20:37           ` Thomas Gleixner
2017-12-12 21:35           ` Andy Lutomirski
2017-12-12 21:35             ` Andy Lutomirski
2017-12-12 21:42             ` Thomas Gleixner
2017-12-12 21:42               ` Thomas Gleixner
2017-12-12 21:41           ` Thomas Gleixner
2017-12-12 21:41             ` Thomas Gleixner
2017-12-12 21:46             ` Thomas Gleixner
2017-12-12 21:46               ` Thomas Gleixner
2017-12-12 22:25             ` Peter Zijlstra
2017-12-12 22:25               ` Peter Zijlstra
2017-12-12 17:32 ` [patch 14/16] x86/ldt: Prepare for VMA mapping Thomas Gleixner
2017-12-12 17:32   ` Thomas Gleixner
2017-12-12 17:32 ` [patch 15/16] x86/ldt: Add VMA management code Thomas Gleixner
2017-12-12 17:32   ` Thomas Gleixner
2017-12-12 17:32 ` [patch 16/16] x86/ldt: Make it read only VMA mapped Thomas Gleixner
2017-12-12 17:32   ` Thomas Gleixner
2017-12-12 18:03 ` [patch 00/16] x86/ldt: Use a VMA based read only mapping Andy Lutomirski
2017-12-12 18:03   ` Andy Lutomirski

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.