linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [RFC PATCH 0/6] x86: text_poke() fixes
@ 2018-08-29  8:11 Nadav Amit
  2018-08-29  8:11 ` [RFC PATCH 1/6] x86/alternative: assert text_mutex is taken Nadav Amit
                   ` (5 more replies)
  0 siblings, 6 replies; 34+ messages in thread
From: Nadav Amit @ 2018-08-29  8:11 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: linux-kernel, Ingo Molnar, x86, Arnd Bergmann, linux-arch,
	Nadav Amit, Andy Lutomirski, Masami Hiramatsu, Kees Cook,
	Peter Zijlstra

This patch-set addresses some issues that were raised in the recent
correspondence and might affect the security and the correctness of code
patching. (Note that patching performance is not addressed by this
patch-set).

The main issue that the patches deal with is the fact that the fixmap
PTEs that are used for patching are available for access from other
cores and might be exploited. They are not even flushed from the TLB in
remote cores, so the risk is even higher. Address this issue by
introducing a temporary mm that is only used during patching.
Unfortunately, due to init ordering, fixmap is still used during
boot-time patching. Future patches can eliminate the need for it.

The second issue is the missing lockdep assertion to ensure text_mutex
is taken. It is actually not always taken, so fix the instances that
were found not to take the lock (although they should be safe even
without taking the lock).

Finally, try to be more conservative and to map a single page, instead
of two, when possible. This helps both security and performance.

In addition, there is some cleanup of the patching code to make it more
readable.

[ Andy: please provide your SOB for your patch ]

Cc: Andy Lutomirski <luto@kernel.org>
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Kees Cook <keescook@chromium.org>
Cc: Peter Zijlstra <peterz@infradead.org>

Andy Lutomirski (1):
  x86/mm: temporary mm struct

Nadav Amit (5):
  x86/alternative: assert text_mutex is taken
  fork: provide a function for copying init_mm
  x86/alternatives: initializing temporary mm for patching
  x86/alternatives: use temporary mm for text poking
  x86/alternatives: remove text_poke() return value

 arch/x86/include/asm/mmu_context.h   |  20 ++++
 arch/x86/include/asm/pgtable.h       |   4 +
 arch/x86/include/asm/text-patching.h |   4 +-
 arch/x86/kernel/alternative.c        | 157 +++++++++++++++++++++++----
 arch/x86/kernel/kgdb.c               |   9 ++
 arch/x86/mm/init_64.c                |  35 ++++++
 include/asm-generic/pgtable.h        |   4 +
 include/linux/sched/task.h           |   1 +
 init/main.c                          |   1 +
 kernel/fork.c                        |  24 +++-
 10 files changed, 230 insertions(+), 29 deletions(-)

-- 
2.17.1


^ permalink raw reply	[flat|nested] 34+ messages in thread

* [RFC PATCH 1/6] x86/alternative: assert text_mutex is taken
  2018-08-29  8:11 [RFC PATCH 0/6] x86: text_poke() fixes Nadav Amit
@ 2018-08-29  8:11 ` Nadav Amit
  2018-08-29  8:59   ` Masami Hiramatsu
  2018-08-29  8:11 ` [RFC PATCH 2/6] x86/mm: temporary mm struct Nadav Amit
                   ` (4 subsequent siblings)
  5 siblings, 1 reply; 34+ messages in thread
From: Nadav Amit @ 2018-08-29  8:11 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: linux-kernel, Ingo Molnar, x86, Arnd Bergmann, linux-arch,
	Nadav Amit, Andy Lutomirski, Masami Hiramatsu, Kees Cook

Use lockdep to ensure that text_mutex is taken when text_poke() is
called.

Actually it is not always taken, specifically when it is called by kgdb,
so take the lock in these cases.

Cc: Andy Lutomirski <luto@kernel.org>
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Kees Cook <keescook@chromium.org>
Suggested-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Nadav Amit <namit@vmware.com>
---
 arch/x86/kernel/alternative.c | 1 +
 arch/x86/kernel/kgdb.c        | 9 +++++++++
 2 files changed, 10 insertions(+)

diff --git a/arch/x86/kernel/alternative.c b/arch/x86/kernel/alternative.c
index 014f214da581..916c11b410c4 100644
--- a/arch/x86/kernel/alternative.c
+++ b/arch/x86/kernel/alternative.c
@@ -699,6 +699,7 @@ void *text_poke(void *addr, const void *opcode, size_t len)
 	 * pages as they are not yet initialized.
 	 */
 	BUG_ON(!after_bootmem);
+	lockdep_assert_held(&text_mutex);
 
 	if (!core_kernel_text((unsigned long)addr)) {
 		pages[0] = vmalloc_to_page(addr);
diff --git a/arch/x86/kernel/kgdb.c b/arch/x86/kernel/kgdb.c
index 8e36f249646e..60b99c76086c 100644
--- a/arch/x86/kernel/kgdb.c
+++ b/arch/x86/kernel/kgdb.c
@@ -768,8 +768,12 @@ int kgdb_arch_set_breakpoint(struct kgdb_bkpt *bpt)
 	 */
 	if (mutex_is_locked(&text_mutex))
 		return -EBUSY;
+
+	/* Take the mutex to avoid lockdep assertion failures. */
+	mutex_lock(&text_mutex);
 	text_poke((void *)bpt->bpt_addr, arch_kgdb_ops.gdb_bpt_instr,
 		  BREAK_INSTR_SIZE);
+	mutex_unlock(&text_mutex);
 	err = probe_kernel_read(opc, (char *)bpt->bpt_addr, BREAK_INSTR_SIZE);
 	if (err)
 		return err;
@@ -793,7 +797,12 @@ int kgdb_arch_remove_breakpoint(struct kgdb_bkpt *bpt)
 	 */
 	if (mutex_is_locked(&text_mutex))
 		goto knl_write;
+
+	/* Take the mutex to avoid lockdep assertion failures. */
+	mutex_lock(&text_mutex);
 	text_poke((void *)bpt->bpt_addr, bpt->saved_instr, BREAK_INSTR_SIZE);
+	mutex_unlock(&text_mutex);
+
 	err = probe_kernel_read(opc, (char *)bpt->bpt_addr, BREAK_INSTR_SIZE);
 	if (err || memcmp(opc, bpt->saved_instr, BREAK_INSTR_SIZE))
 		goto knl_write;
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [RFC PATCH 2/6] x86/mm: temporary mm struct
  2018-08-29  8:11 [RFC PATCH 0/6] x86: text_poke() fixes Nadav Amit
  2018-08-29  8:11 ` [RFC PATCH 1/6] x86/alternative: assert text_mutex is taken Nadav Amit
@ 2018-08-29  8:11 ` Nadav Amit
  2018-08-29  9:49   ` Masami Hiramatsu
  2018-08-29 15:46   ` Andy Lutomirski
  2018-08-29  8:11 ` [RFC PATCH 3/6] fork: provide a function for copying init_mm Nadav Amit
                   ` (3 subsequent siblings)
  5 siblings, 2 replies; 34+ messages in thread
From: Nadav Amit @ 2018-08-29  8:11 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: linux-kernel, Ingo Molnar, x86, Arnd Bergmann, linux-arch,
	Andy Lutomirski, Masami Hiramatsu, Kees Cook, Peter Zijlstra,
	Nadav Amit

From: Andy Lutomirski <luto@kernel.org>

Sometimes we want to set a temporary page-table entries (PTEs) in one of
the cores, without allowing other cores to use - even speculatively -
these mappings. There are two benefits for doing so:

(1) Security: if sensitive PTEs are set, temporary mm prevents their use
in other cores. This hardens the security as it prevents exploding a
dangling pointer to overwrite sensitive data using the sensitive PTE.

(2) Avoiding TLB shootdowns: the PTEs do not need to be flushed in
remote page-tables.

To do so a temporary mm_struct can be used. Mappings which are private
for this mm can be set in the userspace part of the address-space.
During the whole time in which the temporary mm is loaded, interrupts
must be disabled.

The first use-case for temporary PTEs, which will follow, is for poking
the kernel text.

[ Commit message was written by Nadav ]

Cc: Andy Lutomirski <luto@kernel.org>
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Kees Cook <keescook@chromium.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Nadav Amit <namit@vmware.com>
---
 arch/x86/include/asm/mmu_context.h | 20 ++++++++++++++++++++
 1 file changed, 20 insertions(+)

diff --git a/arch/x86/include/asm/mmu_context.h b/arch/x86/include/asm/mmu_context.h
index eeeb9289c764..96afc8c0cf15 100644
--- a/arch/x86/include/asm/mmu_context.h
+++ b/arch/x86/include/asm/mmu_context.h
@@ -338,4 +338,24 @@ static inline unsigned long __get_current_cr3_fast(void)
 	return cr3;
 }
 
+typedef struct {
+	struct mm_struct *prev;
+} temporary_mm_state_t;
+
+static inline temporary_mm_state_t use_temporary_mm(struct mm_struct *mm)
+{
+	temporary_mm_state_t state;
+
+	lockdep_assert_irqs_disabled();
+	state.prev = this_cpu_read(cpu_tlbstate.loaded_mm);
+	switch_mm_irqs_off(NULL, mm, current);
+	return state;
+}
+
+static inline void unuse_temporary_mm(temporary_mm_state_t prev)
+{
+	lockdep_assert_irqs_disabled();
+	switch_mm_irqs_off(NULL, prev.prev, current);
+}
+
 #endif /* _ASM_X86_MMU_CONTEXT_H */
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [RFC PATCH 3/6] fork: provide a function for copying init_mm
  2018-08-29  8:11 [RFC PATCH 0/6] x86: text_poke() fixes Nadav Amit
  2018-08-29  8:11 ` [RFC PATCH 1/6] x86/alternative: assert text_mutex is taken Nadav Amit
  2018-08-29  8:11 ` [RFC PATCH 2/6] x86/mm: temporary mm struct Nadav Amit
@ 2018-08-29  8:11 ` Nadav Amit
  2018-08-29  9:54   ` Masami Hiramatsu
  2018-08-29  8:11 ` [RFC PATCH 4/6] x86/alternatives: initializing temporary mm for patching Nadav Amit
                   ` (2 subsequent siblings)
  5 siblings, 1 reply; 34+ messages in thread
From: Nadav Amit @ 2018-08-29  8:11 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: linux-kernel, Ingo Molnar, x86, Arnd Bergmann, linux-arch,
	Nadav Amit, Andy Lutomirski, Masami Hiramatsu, Kees Cook,
	Peter Zijlstra

Provide a function for copying init_mm. This function will be later used
for setting a temporary mm.

Cc: Andy Lutomirski <luto@kernel.org>
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Kees Cook <keescook@chromium.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Nadav Amit <namit@vmware.com>
---
 include/linux/sched/task.h |  1 +
 kernel/fork.c              | 24 ++++++++++++++++++------
 2 files changed, 19 insertions(+), 6 deletions(-)

diff --git a/include/linux/sched/task.h b/include/linux/sched/task.h
index 108ede99e533..ac0a675678f5 100644
--- a/include/linux/sched/task.h
+++ b/include/linux/sched/task.h
@@ -74,6 +74,7 @@ extern void exit_itimers(struct signal_struct *);
 extern long _do_fork(unsigned long, unsigned long, unsigned long, int __user *, int __user *, unsigned long);
 extern long do_fork(unsigned long, unsigned long, unsigned long, int __user *, int __user *);
 struct task_struct *fork_idle(int);
+struct mm_struct *copy_init_mm(void);
 extern pid_t kernel_thread(int (*fn)(void *), void *arg, unsigned long flags);
 extern long kernel_wait4(pid_t, int __user *, int, struct rusage *);
 
diff --git a/kernel/fork.c b/kernel/fork.c
index d896e9ca38b0..a1c637b903c1 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -1254,13 +1254,20 @@ void mm_release(struct task_struct *tsk, struct mm_struct *mm)
 		complete_vfork_done(tsk);
 }
 
-/*
- * Allocate a new mm structure and copy contents from the
- * mm structure of the passed in task structure.
+/**
+ * dup_mm() - duplicates an existing mm structure
+ * @tsk: the task_struct with which the new mm will be associated.
+ * @oldmm: the mm to duplicate.
+ *
+ * Allocates a new mm structure and copy contents from the provided
+ * @oldmm structure.
+ *
+ * Return: the duplicated mm or NULL on failure.
  */
-static struct mm_struct *dup_mm(struct task_struct *tsk)
+static struct mm_struct *dup_mm(struct task_struct *tsk,
+				struct mm_struct *oldmm)
 {
-	struct mm_struct *mm, *oldmm = current->mm;
+	struct mm_struct *mm;
 	int err;
 
 	mm = allocate_mm();
@@ -1327,7 +1334,7 @@ static int copy_mm(unsigned long clone_flags, struct task_struct *tsk)
 	}
 
 	retval = -ENOMEM;
-	mm = dup_mm(tsk);
+	mm = dup_mm(tsk, current->mm);
 	if (!mm)
 		goto fail_nomem;
 
@@ -2127,6 +2134,11 @@ struct task_struct *fork_idle(int cpu)
 	return task;
 }
 
+struct mm_struct *copy_init_mm(void)
+{
+	return dup_mm(NULL, &init_mm);
+}
+
 /*
  *  Ok, this is the main fork-routine.
  *
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [RFC PATCH 4/6] x86/alternatives: initializing temporary mm for patching
  2018-08-29  8:11 [RFC PATCH 0/6] x86: text_poke() fixes Nadav Amit
                   ` (2 preceding siblings ...)
  2018-08-29  8:11 ` [RFC PATCH 3/6] fork: provide a function for copying init_mm Nadav Amit
@ 2018-08-29  8:11 ` Nadav Amit
  2018-08-29 13:21   ` Masami Hiramatsu
  2018-08-29  8:11 ` [RFC PATCH 5/6] x86/alternatives: use temporary mm for text poking Nadav Amit
  2018-08-29  8:11 ` [RFC PATCH 6/6] x86/alternatives: remove text_poke() return value Nadav Amit
  5 siblings, 1 reply; 34+ messages in thread
From: Nadav Amit @ 2018-08-29  8:11 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: linux-kernel, Ingo Molnar, x86, Arnd Bergmann, linux-arch,
	Nadav Amit, Masami Hiramatsu, Kees Cook, Peter Zijlstra

To prevent improper use of the PTEs that are used for text patching, we
want to use a temporary mm struct. We initailize it by copying the init
mm.

The address that will be used for patching is taken from the lower area
that is usually used for the task memory. Doing so prevents the need to
frequently synchronize the temporary-mm (e.g., when BPF programs are
installed), since different PGDs are used for the task memory.

Finally, we randomize the address of the PTEs to harden against exploits
that use these PTEs.

Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Kees Cook <keescook@chromium.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Suggested-by: Andy Lutomirski <luto@kernel.org>
Signed-off-by: Nadav Amit <namit@vmware.com>
---
 arch/x86/include/asm/pgtable.h       |  4 ++++
 arch/x86/include/asm/text-patching.h |  2 ++
 arch/x86/mm/init_64.c                | 35 ++++++++++++++++++++++++++++
 include/asm-generic/pgtable.h        |  4 ++++
 init/main.c                          |  1 +
 5 files changed, 46 insertions(+)

diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h
index e4ffa565a69f..c65d2b146ff6 100644
--- a/arch/x86/include/asm/pgtable.h
+++ b/arch/x86/include/asm/pgtable.h
@@ -1022,6 +1022,10 @@ static inline void __meminit init_trampoline_default(void)
 	/* Default trampoline pgd value */
 	trampoline_pgd_entry = init_top_pgt[pgd_index(__PAGE_OFFSET)];
 }
+
+void __init poking_init(void);
+#define poking_init poking_init
+
 # ifdef CONFIG_RANDOMIZE_MEMORY
 void __meminit init_trampoline(void);
 # else
diff --git a/arch/x86/include/asm/text-patching.h b/arch/x86/include/asm/text-patching.h
index e85ff65c43c3..ffe7902cc326 100644
--- a/arch/x86/include/asm/text-patching.h
+++ b/arch/x86/include/asm/text-patching.h
@@ -38,5 +38,7 @@ extern void *text_poke(void *addr, const void *opcode, size_t len);
 extern int poke_int3_handler(struct pt_regs *regs);
 extern void *text_poke_bp(void *addr, const void *opcode, size_t len, void *handler);
 extern int after_bootmem;
+extern __ro_after_init struct mm_struct *poking_mm;
+extern __ro_after_init unsigned long poking_addr;
 
 #endif /* _ASM_X86_TEXT_PATCHING_H */
diff --git a/arch/x86/mm/init_64.c b/arch/x86/mm/init_64.c
index dd519f372169..ed4a46a89946 100644
--- a/arch/x86/mm/init_64.c
+++ b/arch/x86/mm/init_64.c
@@ -33,6 +33,7 @@
 #include <linux/nmi.h>
 #include <linux/gfp.h>
 #include <linux/kcore.h>
+#include <linux/sched/mm.h>
 
 #include <asm/processor.h>
 #include <asm/bios_ebda.h>
@@ -54,6 +55,7 @@
 #include <asm/init.h>
 #include <asm/uv/uv.h>
 #include <asm/setup.h>
+#include <asm/text-patching.h>
 
 #include "mm_internal.h"
 
@@ -1389,6 +1391,39 @@ unsigned long memory_block_size_bytes(void)
 	return memory_block_size_probed;
 }
 
+/*
+ * Initialize an mm_struct to be used during poking and a pointer to be used
+ * during patching. If anything fails during initialization, poking will be done
+ * using the fixmap, which is unsafe, so warn the user about it.
+ */
+void __init poking_init(void)
+{
+	unsigned long poking_addr;
+
+	poking_mm = copy_init_mm();
+	if (!poking_mm)
+		goto error;
+
+	/*
+	 * Randomize the poking address, but make sure that the following page
+	 * will be mapped at the same PMD. We need 2 pages, so find space for 3,
+	 * and adjust the address if the PMD ends after the first one.
+	 */
+	poking_addr = TASK_UNMAPPED_BASE +
+		(kaslr_get_random_long("Poking") & PAGE_MASK) %
+		(TASK_SIZE - TASK_UNMAPPED_BASE - 3 * PAGE_SIZE);
+
+	if (((poking_addr + PAGE_SIZE) & ~PMD_MASK) == 0)
+		poking_addr += PAGE_SIZE;
+
+	return;
+error:
+	if (poking_mm)
+		mmput(poking_mm);
+	poking_mm = NULL;
+	pr_err("x86/mm: error setting a separate poking address space\n");
+}
+
 #ifdef CONFIG_SPARSEMEM_VMEMMAP
 /*
  * Initialise the sparsemem vmemmap using huge-pages at the PMD level.
diff --git a/include/asm-generic/pgtable.h b/include/asm-generic/pgtable.h
index 88ebc6102c7c..c66579d0ee67 100644
--- a/include/asm-generic/pgtable.h
+++ b/include/asm-generic/pgtable.h
@@ -1111,6 +1111,10 @@ static inline bool arch_has_pfn_modify_check(void)
 
 #ifndef PAGE_KERNEL_EXEC
 # define PAGE_KERNEL_EXEC PAGE_KERNEL
+
+#ifndef poking_init
+static inline void poking_init(void) { }
+#endif
 #endif
 
 #endif /* !__ASSEMBLY__ */
diff --git a/init/main.c b/init/main.c
index 18f8f0140fa0..6754ff2687c8 100644
--- a/init/main.c
+++ b/init/main.c
@@ -725,6 +725,7 @@ asmlinkage __visible void __init start_kernel(void)
 	taskstats_init_early();
 	delayacct_init();
 
+	poking_init();
 	check_bugs();
 
 	acpi_subsystem_init();
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [RFC PATCH 5/6] x86/alternatives: use temporary mm for text poking
  2018-08-29  8:11 [RFC PATCH 0/6] x86: text_poke() fixes Nadav Amit
                   ` (3 preceding siblings ...)
  2018-08-29  8:11 ` [RFC PATCH 4/6] x86/alternatives: initializing temporary mm for patching Nadav Amit
@ 2018-08-29  8:11 ` Nadav Amit
  2018-08-29  9:28   ` Peter Zijlstra
  2018-08-29  8:11 ` [RFC PATCH 6/6] x86/alternatives: remove text_poke() return value Nadav Amit
  5 siblings, 1 reply; 34+ messages in thread
From: Nadav Amit @ 2018-08-29  8:11 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: linux-kernel, Ingo Molnar, x86, Arnd Bergmann, linux-arch,
	Nadav Amit, Andy Lutomirski, Masami Hiramatsu, Kees Cook,
	Peter Zijlstra

text_poke() can potentially compromise the security as it sets temporary
PTEs in the fixmap. These PTEs might be used to rewrite the kernel code
from other cores accidentally or maliciously, if an attacker gains the
ability to write onto kernel memory.

Moreover, since remote TLBs are not flushed after the temporary PTEs are
removed, the time-window in which the code is writable is not limited if
the fixmap PTEs - maliciously or accidentally - are cached in the TLB.

To address these potential security hazards, we use a temporary mm for
patching the code. Unfortunately, the temporary-mm cannot be initialized
early enough during the init, and as a result x86_late_time_init() needs
to use text_poke() before it can be initialized. text_poke() therefore
keeps the two poking versions - using fixmap and using temporary mm -
and uses them accordingly.

More adventurous developers can try to reorder the init sequence or use
text_poke_early() instead of text_poke() to remove the use of fixmap for
patching completely.

Finally, text_poke() is also not conservative enough when mapping pages,
as it always tries to map 2 pages, even when a single one is sufficient.
So try to be more conservative, and do not map more than needed.

Cc: Andy Lutomirski <luto@kernel.org>
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Kees Cook <keescook@chromium.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Nadav Amit <namit@vmware.com>
---
 arch/x86/kernel/alternative.c | 154 +++++++++++++++++++++++++++++-----
 1 file changed, 133 insertions(+), 21 deletions(-)

diff --git a/arch/x86/kernel/alternative.c b/arch/x86/kernel/alternative.c
index 916c11b410c4..0feac3dfabe9 100644
--- a/arch/x86/kernel/alternative.c
+++ b/arch/x86/kernel/alternative.c
@@ -11,6 +11,7 @@
 #include <linux/stop_machine.h>
 #include <linux/slab.h>
 #include <linux/kdebug.h>
+#include <linux/mmu_context.h>
 #include <asm/text-patching.h>
 #include <asm/alternative.h>
 #include <asm/sections.h>
@@ -674,6 +675,113 @@ void *__init_or_module text_poke_early(void *addr, const void *opcode,
 	return addr;
 }
 
+/**
+ * text_poke_fixmap - poke using the fixmap.
+ *
+ * Fallback function for poking the text using the fixmap. It is used during
+ * early boot and in the rare case in which initialization of safe poking fails.
+ *
+ * Poking in this manner should be avoided, since it allows other cores to use
+ * the fixmap entries, and can be exploited by an attacker to overwrite the code
+ * (assuming he gained the write access through another bug).
+ */
+static void text_poke_fixmap(void *addr, const void *opcode, size_t len,
+			     struct page *pages[2])
+{
+	u8 *vaddr;
+
+	set_fixmap(FIX_TEXT_POKE0, page_to_phys(pages[0]));
+	if (pages[1])
+		set_fixmap(FIX_TEXT_POKE1, page_to_phys(pages[1]));
+	vaddr = (u8 *)fix_to_virt(FIX_TEXT_POKE0);
+	memcpy(vaddr + offset_in_page(addr), opcode, len);
+
+	/*
+	 * clear_fixmap() performs a TLB flush, so no additional TLB
+	 * flush is needed.
+	 */
+	clear_fixmap(FIX_TEXT_POKE0);
+	if (pages[1])
+		clear_fixmap(FIX_TEXT_POKE1);
+	sync_core();
+	/* Could also do a CLFLUSH here to speed up CPU recovery; but
+	   that causes hangs on some VIA CPUs. */
+}
+
+__ro_after_init struct mm_struct *poking_mm;
+__ro_after_init unsigned long poking_addr;
+
+/**
+ * text_poke_safe() - Pokes the text using a separate address space.
+ *
+ * This is the preferable way for patching the kernel after boot, as it does not
+ * allow other cores to accidentally or maliciously modify the code using the
+ * temporary PTEs.
+ */
+static void text_poke_safe(void *addr, const void *opcode, size_t len,
+			   struct page *pages[2])
+{
+	temporary_mm_state_t prev;
+	pte_t pte, *ptep;
+	spinlock_t *ptl;
+
+	/*
+	 * The lock is not really needed, but this allows to avoid open-coding.
+	 */
+	ptep = get_locked_pte(poking_mm, poking_addr, &ptl);
+
+	pte = mk_pte(pages[0], PAGE_KERNEL);
+	set_pte_at(poking_mm, poking_addr, ptep, pte);
+
+	if (pages[1]) {
+		pte = mk_pte(pages[1], PAGE_KERNEL);
+		set_pte_at(poking_mm, poking_addr + PAGE_SIZE, ptep + 1, pte);
+	}
+
+	/*
+	 * Loading the temporary mm behaves as a compiler barrier, which
+	 * guarantees that the PTE will be set at the time memcpy() is done.
+	 */
+	prev = use_temporary_mm(poking_mm);
+
+	memcpy((u8 *)poking_addr + offset_in_page(addr), opcode, len);
+
+	/*
+	 * Ensure that the PTE is only cleared after copying is done by using a
+	 * compiler barrier.
+	 */
+	barrier();
+
+	pte_clear(poking_mm, poking_addr, ptep);
+
+	/*
+	 * __flush_tlb_one_user() performs a redundant TLB flush when PTI is on,
+	 * as it also flushes the corresponding "user" address spaces, which
+	 * does not exist.
+	 *
+	 * Poking, however, is already very inefficient since it does not try to
+	 * batch updates, so we ignore this problem for the time being.
+	 *
+	 * Since the PTEs do not exist in other kernel address-spaces, we do
+	 * not use __flush_tlb_one_kernel(), which when PTI is on would cause
+	 * more unwarranted TLB flushes.
+	 */
+	__flush_tlb_one_user(poking_addr);
+	if (pages[1]) {
+		pte_clear(poking_mm, poking_addr + PAGE_SIZE, ptep + 1);
+		__flush_tlb_one_user(poking_addr + PAGE_SIZE);
+	}
+
+	/*
+	 * Loading the previous page-table hierarchy requires a serializing
+	 * instruction that already allows the core to see the updated version.
+	 * Xen-PV is assumed to serialize execution in a similar manner.
+	 */
+	unuse_temporary_mm(prev);
+
+	pte_unmap_unlock(ptep, ptl);
+}
+
 /**
  * text_poke - Update instructions on a live kernel
  * @addr: address to modify
@@ -689,42 +797,46 @@ void *__init_or_module text_poke_early(void *addr, const void *opcode,
  */
 void *text_poke(void *addr, const void *opcode, size_t len)
 {
+	bool cross_page_boundary = offset_in_page(addr) + len > PAGE_SIZE;
+	struct page *pages[2] = {0};
 	unsigned long flags;
-	char *vaddr;
-	struct page *pages[2];
-	int i;
 
 	/*
-	 * While boot memory allocator is runnig we cannot use struct
-	 * pages as they are not yet initialized.
+	 * While boot memory allocator is running we cannot use struct pages as
+	 * they are not yet initialized.
 	 */
 	BUG_ON(!after_bootmem);
 	lockdep_assert_held(&text_mutex);
 
 	if (!core_kernel_text((unsigned long)addr)) {
 		pages[0] = vmalloc_to_page(addr);
-		pages[1] = vmalloc_to_page(addr + PAGE_SIZE);
+		if (cross_page_boundary)
+			pages[1] = vmalloc_to_page(addr + PAGE_SIZE);
 	} else {
 		pages[0] = virt_to_page(addr);
 		WARN_ON(!PageReserved(pages[0]));
-		pages[1] = virt_to_page(addr + PAGE_SIZE);
+		if (cross_page_boundary)
+			pages[1] = virt_to_page(addr + PAGE_SIZE);
 	}
 	BUG_ON(!pages[0]);
 	local_irq_save(flags);
-	set_fixmap(FIX_TEXT_POKE0, page_to_phys(pages[0]));
-	if (pages[1])
-		set_fixmap(FIX_TEXT_POKE1, page_to_phys(pages[1]));
-	vaddr = (char *)fix_to_virt(FIX_TEXT_POKE0);
-	memcpy(&vaddr[(unsigned long)addr & ~PAGE_MASK], opcode, len);
-	clear_fixmap(FIX_TEXT_POKE0);
-	if (pages[1])
-		clear_fixmap(FIX_TEXT_POKE1);
-	local_flush_tlb();
-	sync_core();
-	/* Could also do a CLFLUSH here to speed up CPU recovery; but
-	   that causes hangs on some VIA CPUs. */
-	for (i = 0; i < len; i++)
-		BUG_ON(((char *)addr)[i] != ((char *)opcode)[i]);
+
+	/*
+	 * During initial boot, it is hard to initialize poking_mm due to
+	 * dependencies in boot order.
+	 */
+	if (poking_mm)
+		text_poke_safe(addr, opcode, len, pages);
+	else
+		text_poke_fixmap(addr, opcode, len, pages);
+
+	/*
+	 * To be on the safe side, do the comparison before enabling IRQs, as it
+	 * was done before. However, it makes more sense to allow the callers to
+	 * deal with potential failures and not to panic so easily.
+	 */
+	BUG_ON(memcmp(addr, opcode, len));
+
 	local_irq_restore(flags);
 	return addr;
 }
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [RFC PATCH 6/6] x86/alternatives: remove text_poke() return value
  2018-08-29  8:11 [RFC PATCH 0/6] x86: text_poke() fixes Nadav Amit
                   ` (4 preceding siblings ...)
  2018-08-29  8:11 ` [RFC PATCH 5/6] x86/alternatives: use temporary mm for text poking Nadav Amit
@ 2018-08-29  8:11 ` Nadav Amit
  2018-08-29  9:52   ` Masami Hiramatsu
  5 siblings, 1 reply; 34+ messages in thread
From: Nadav Amit @ 2018-08-29  8:11 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: linux-kernel, Ingo Molnar, x86, Arnd Bergmann, linux-arch,
	Nadav Amit, Andy Lutomirski, Masami Hiramatsu, Kees Cook,
	Peter Zijlstra

The return value of text_poke() is meaningless - it is one of the
function inputs. One day someone may allow the callers to deal with
text_poke() failures, if those actually happen.

In the meanwhile, remove the return value.

Cc: Andy Lutomirski <luto@kernel.org>
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Kees Cook <keescook@chromium.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Nadav Amit <namit@vmware.com>
---
 arch/x86/include/asm/text-patching.h | 2 +-
 arch/x86/kernel/alternative.c        | 2 +-
 2 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/arch/x86/include/asm/text-patching.h b/arch/x86/include/asm/text-patching.h
index ffe7902cc326..1f73f71b4de2 100644
--- a/arch/x86/include/asm/text-patching.h
+++ b/arch/x86/include/asm/text-patching.h
@@ -34,7 +34,7 @@ extern void *text_poke_early(void *addr, const void *opcode, size_t len);
  * On the local CPU you need to be protected again NMI or MCE handlers seeing an
  * inconsistent instruction while you patch.
  */
-extern void *text_poke(void *addr, const void *opcode, size_t len);
+extern void text_poke(void *addr, const void *opcode, size_t len);
 extern int poke_int3_handler(struct pt_regs *regs);
 extern void *text_poke_bp(void *addr, const void *opcode, size_t len, void *handler);
 extern int after_bootmem;
diff --git a/arch/x86/kernel/alternative.c b/arch/x86/kernel/alternative.c
index 0feac3dfabe9..45b7fdeaed90 100644
--- a/arch/x86/kernel/alternative.c
+++ b/arch/x86/kernel/alternative.c
@@ -795,7 +795,7 @@ static void text_poke_safe(void *addr, const void *opcode, size_t len,
  *
  * Note: Must be called under text_mutex.
  */
-void *text_poke(void *addr, const void *opcode, size_t len)
+void text_poke(void *addr, const void *opcode, size_t len)
 {
 	bool cross_page_boundary = offset_in_page(addr) + len > PAGE_SIZE;
 	struct page *pages[2] = {0};
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 34+ messages in thread

* Re: [RFC PATCH 1/6] x86/alternative: assert text_mutex is taken
  2018-08-29  8:11 ` [RFC PATCH 1/6] x86/alternative: assert text_mutex is taken Nadav Amit
@ 2018-08-29  8:59   ` Masami Hiramatsu
  2018-08-29 17:11     ` Nadav Amit
  0 siblings, 1 reply; 34+ messages in thread
From: Masami Hiramatsu @ 2018-08-29  8:59 UTC (permalink / raw)
  To: Nadav Amit
  Cc: Thomas Gleixner, linux-kernel, Ingo Molnar, x86, Arnd Bergmann,
	linux-arch, Andy Lutomirski, Masami Hiramatsu, Kees Cook

On Wed, 29 Aug 2018 01:11:42 -0700
Nadav Amit <namit@vmware.com> wrote:

> Use lockdep to ensure that text_mutex is taken when text_poke() is
> called.
> 
> Actually it is not always taken, specifically when it is called by kgdb,
> so take the lock in these cases.

Can we really take a mutex in kgdb context?

kgdb_arch_remove_breakpoint
  <- dbg_deactivate_sw_breakpoints
    <- kgdb_reenter_check
       <- kgdb_handle_exception
          <- __kgdb_notify
            <- kgdb_ll_trap
              <- do_int3
            <- kgdb_notify
              <- die notifier

kgdb_arch_set_breakpoint
  <- dbg_activate_sw_breakpoints
    <- kgdb_reenter_check
       <- kgdb_handle_exception
           ...

Both seems called in exception context, so we can not take a mutex lock.
I think kgdb needs a special path.

Thanks,

> 
> Cc: Andy Lutomirski <luto@kernel.org>
> Cc: Masami Hiramatsu <mhiramat@kernel.org>
> Cc: Kees Cook <keescook@chromium.org>
> Suggested-by: Peter Zijlstra <peterz@infradead.org>
> Signed-off-by: Nadav Amit <namit@vmware.com>
> ---
>  arch/x86/kernel/alternative.c | 1 +
>  arch/x86/kernel/kgdb.c        | 9 +++++++++
>  2 files changed, 10 insertions(+)
> 
> diff --git a/arch/x86/kernel/alternative.c b/arch/x86/kernel/alternative.c
> index 014f214da581..916c11b410c4 100644
> --- a/arch/x86/kernel/alternative.c
> +++ b/arch/x86/kernel/alternative.c
> @@ -699,6 +699,7 @@ void *text_poke(void *addr, const void *opcode, size_t len)
>  	 * pages as they are not yet initialized.
>  	 */
>  	BUG_ON(!after_bootmem);
> +	lockdep_assert_held(&text_mutex);
>  
>  	if (!core_kernel_text((unsigned long)addr)) {
>  		pages[0] = vmalloc_to_page(addr);
> diff --git a/arch/x86/kernel/kgdb.c b/arch/x86/kernel/kgdb.c
> index 8e36f249646e..60b99c76086c 100644
> --- a/arch/x86/kernel/kgdb.c
> +++ b/arch/x86/kernel/kgdb.c
> @@ -768,8 +768,12 @@ int kgdb_arch_set_breakpoint(struct kgdb_bkpt *bpt)
>  	 */
>  	if (mutex_is_locked(&text_mutex))
>  		return -EBUSY;
> +
> +	/* Take the mutex to avoid lockdep assertion failures. */
> +	mutex_lock(&text_mutex);
>  	text_poke((void *)bpt->bpt_addr, arch_kgdb_ops.gdb_bpt_instr,
>  		  BREAK_INSTR_SIZE);
> +	mutex_unlock(&text_mutex);
>  	err = probe_kernel_read(opc, (char *)bpt->bpt_addr, BREAK_INSTR_SIZE);
>  	if (err)
>  		return err;
> @@ -793,7 +797,12 @@ int kgdb_arch_remove_breakpoint(struct kgdb_bkpt *bpt)
>  	 */
>  	if (mutex_is_locked(&text_mutex))
>  		goto knl_write;
> +
> +	/* Take the mutex to avoid lockdep assertion failures. */
> +	mutex_lock(&text_mutex);
>  	text_poke((void *)bpt->bpt_addr, bpt->saved_instr, BREAK_INSTR_SIZE);
> +	mutex_unlock(&text_mutex);
> +
>  	err = probe_kernel_read(opc, (char *)bpt->bpt_addr, BREAK_INSTR_SIZE);
>  	if (err || memcmp(opc, bpt->saved_instr, BREAK_INSTR_SIZE))
>  		goto knl_write;
> -- 
> 2.17.1
> 


-- 
Masami Hiramatsu <mhiramat@kernel.org>

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [RFC PATCH 5/6] x86/alternatives: use temporary mm for text poking
  2018-08-29  8:11 ` [RFC PATCH 5/6] x86/alternatives: use temporary mm for text poking Nadav Amit
@ 2018-08-29  9:28   ` Peter Zijlstra
  2018-08-29 15:46     ` Andy Lutomirski
  0 siblings, 1 reply; 34+ messages in thread
From: Peter Zijlstra @ 2018-08-29  9:28 UTC (permalink / raw)
  To: Nadav Amit
  Cc: Thomas Gleixner, linux-kernel, Ingo Molnar, x86, Arnd Bergmann,
	linux-arch, Andy Lutomirski, Masami Hiramatsu, Kees Cook

On Wed, Aug 29, 2018 at 01:11:46AM -0700, Nadav Amit wrote:
> +static void text_poke_fixmap(void *addr, const void *opcode, size_t len,
> +			     struct page *pages[2])
> +{
> +	u8 *vaddr;
> +
> +	set_fixmap(FIX_TEXT_POKE0, page_to_phys(pages[0]));
> +	if (pages[1])
> +		set_fixmap(FIX_TEXT_POKE1, page_to_phys(pages[1]));
> +	vaddr = (u8 *)fix_to_virt(FIX_TEXT_POKE0);
> +	memcpy(vaddr + offset_in_page(addr), opcode, len);
> +
> +	/*
> +	 * clear_fixmap() performs a TLB flush, so no additional TLB
> +	 * flush is needed.
> +	 */
> +	clear_fixmap(FIX_TEXT_POKE0);
> +	if (pages[1])
> +		clear_fixmap(FIX_TEXT_POKE1);
> +	sync_core();
> +	/* Could also do a CLFLUSH here to speed up CPU recovery; but
> +	   that causes hangs on some VIA CPUs. */

Please take this opportunity to fix that comment style.

> +}
> +
> +__ro_after_init struct mm_struct *poking_mm;
> +__ro_after_init unsigned long poking_addr;
> +
> +/**
> + * text_poke_safe() - Pokes the text using a separate address space.
> + *
> + * This is the preferable way for patching the kernel after boot, as it does not
> + * allow other cores to accidentally or maliciously modify the code using the
> + * temporary PTEs.
> + */
> +static void text_poke_safe(void *addr, const void *opcode, size_t len,
> +			   struct page *pages[2])
> +{
> +	temporary_mm_state_t prev;
> +	pte_t pte, *ptep;
> +	spinlock_t *ptl;
> +
> +	/*
> +	 * The lock is not really needed, but this allows to avoid open-coding.
> +	 */
> +	ptep = get_locked_pte(poking_mm, poking_addr, &ptl);
> +
> +	pte = mk_pte(pages[0], PAGE_KERNEL);
> +	set_pte_at(poking_mm, poking_addr, ptep, pte);
> +
> +	if (pages[1]) {
> +		pte = mk_pte(pages[1], PAGE_KERNEL);
> +		set_pte_at(poking_mm, poking_addr + PAGE_SIZE, ptep + 1, pte);
> +	}
> +
> +	/*
> +	 * Loading the temporary mm behaves as a compiler barrier, which
> +	 * guarantees that the PTE will be set at the time memcpy() is done.
> +	 */
> +	prev = use_temporary_mm(poking_mm);
> +
> +	memcpy((u8 *)poking_addr + offset_in_page(addr), opcode, len);
> +
> +	/*
> +	 * Ensure that the PTE is only cleared after copying is done by using a
> +	 * compiler barrier.
> +	 */
> +	barrier();

I tripped over the use of 'done', because even with TSO the store isn't
done once the instruction retires.

All we want to ensure is that the pte_clear() store is issued after the
copy, and that is indeed guaranteed by this.

> +	pte_clear(poking_mm, poking_addr, ptep);
> +
> +	/*
> +	 * __flush_tlb_one_user() performs a redundant TLB flush when PTI is on,
> +	 * as it also flushes the corresponding "user" address spaces, which
> +	 * does not exist.
> +	 *
> +	 * Poking, however, is already very inefficient since it does not try to
> +	 * batch updates, so we ignore this problem for the time being.
> +	 *
> +	 * Since the PTEs do not exist in other kernel address-spaces, we do
> +	 * not use __flush_tlb_one_kernel(), which when PTI is on would cause
> +	 * more unwarranted TLB flushes.
> +	 */

yuck :-), but yeah.

> +	__flush_tlb_one_user(poking_addr);
> +	if (pages[1]) {
> +		pte_clear(poking_mm, poking_addr + PAGE_SIZE, ptep + 1);
> +		__flush_tlb_one_user(poking_addr + PAGE_SIZE);
> +	}
> +	/*
> +	 * Loading the previous page-table hierarchy requires a serializing
> +	 * instruction that already allows the core to see the updated version.
> +	 * Xen-PV is assumed to serialize execution in a similar manner.
> +	 */
> +	unuse_temporary_mm(prev);
> +
> +	pte_unmap_unlock(ptep, ptl);
> +}

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [RFC PATCH 2/6] x86/mm: temporary mm struct
  2018-08-29  8:11 ` [RFC PATCH 2/6] x86/mm: temporary mm struct Nadav Amit
@ 2018-08-29  9:49   ` Masami Hiramatsu
  2018-08-29 15:41     ` Andy Lutomirski
  2018-08-29 15:46   ` Andy Lutomirski
  1 sibling, 1 reply; 34+ messages in thread
From: Masami Hiramatsu @ 2018-08-29  9:49 UTC (permalink / raw)
  To: Nadav Amit
  Cc: Thomas Gleixner, linux-kernel, Ingo Molnar, x86, Arnd Bergmann,
	linux-arch, Andy Lutomirski, Masami Hiramatsu, Kees Cook,
	Peter Zijlstra

On Wed, 29 Aug 2018 01:11:43 -0700
Nadav Amit <namit@vmware.com> wrote:

> From: Andy Lutomirski <luto@kernel.org>
> 
> Sometimes we want to set a temporary page-table entries (PTEs) in one of
> the cores, without allowing other cores to use - even speculatively -
> these mappings. There are two benefits for doing so:
> 
> (1) Security: if sensitive PTEs are set, temporary mm prevents their use
> in other cores. This hardens the security as it prevents exploding a
> dangling pointer to overwrite sensitive data using the sensitive PTE.
> 
> (2) Avoiding TLB shootdowns: the PTEs do not need to be flushed in
> remote page-tables.
> 
> To do so a temporary mm_struct can be used. Mappings which are private
> for this mm can be set in the userspace part of the address-space.
> During the whole time in which the temporary mm is loaded, interrupts
> must be disabled.
> 
> The first use-case for temporary PTEs, which will follow, is for poking
> the kernel text.
> 
> [ Commit message was written by Nadav ]
> 
> Cc: Andy Lutomirski <luto@kernel.org>
> Cc: Masami Hiramatsu <mhiramat@kernel.org>
> Cc: Kees Cook <keescook@chromium.org>
> Cc: Peter Zijlstra <peterz@infradead.org>
> Signed-off-by: Nadav Amit <namit@vmware.com>
> ---
>  arch/x86/include/asm/mmu_context.h | 20 ++++++++++++++++++++
>  1 file changed, 20 insertions(+)
> 
> diff --git a/arch/x86/include/asm/mmu_context.h b/arch/x86/include/asm/mmu_context.h
> index eeeb9289c764..96afc8c0cf15 100644
> --- a/arch/x86/include/asm/mmu_context.h
> +++ b/arch/x86/include/asm/mmu_context.h
> @@ -338,4 +338,24 @@ static inline unsigned long __get_current_cr3_fast(void)
>  	return cr3;
>  }
>  
> +typedef struct {
> +	struct mm_struct *prev;
> +} temporary_mm_state_t;
> +
> +static inline temporary_mm_state_t use_temporary_mm(struct mm_struct *mm)
> +{
> +	temporary_mm_state_t state;
> +
> +	lockdep_assert_irqs_disabled();
> +	state.prev = this_cpu_read(cpu_tlbstate.loaded_mm);
> +	switch_mm_irqs_off(NULL, mm, current);
> +	return state;
> +}

Hmm, why don't we return mm_struct *prev directly?

Thank you,

> +
> +static inline void unuse_temporary_mm(temporary_mm_state_t prev)
> +{
> +	lockdep_assert_irqs_disabled();
> +	switch_mm_irqs_off(NULL, prev.prev, current);
> +}
> +
>  #endif /* _ASM_X86_MMU_CONTEXT_H */
> -- 
> 2.17.1
> 


-- 
Masami Hiramatsu <mhiramat@kernel.org>

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [RFC PATCH 6/6] x86/alternatives: remove text_poke() return value
  2018-08-29  8:11 ` [RFC PATCH 6/6] x86/alternatives: remove text_poke() return value Nadav Amit
@ 2018-08-29  9:52   ` Masami Hiramatsu
  2018-08-29 17:15     ` Nadav Amit
  0 siblings, 1 reply; 34+ messages in thread
From: Masami Hiramatsu @ 2018-08-29  9:52 UTC (permalink / raw)
  To: Nadav Amit
  Cc: Thomas Gleixner, linux-kernel, Ingo Molnar, x86, Arnd Bergmann,
	linux-arch, Andy Lutomirski, Masami Hiramatsu, Kees Cook,
	Peter Zijlstra

On Wed, 29 Aug 2018 01:11:47 -0700
Nadav Amit <namit@vmware.com> wrote:

> The return value of text_poke() is meaningless - it is one of the
> function inputs. One day someone may allow the callers to deal with
> text_poke() failures, if those actually happen.
> 
> In the meanwhile, remove the return value.
> 
> Cc: Andy Lutomirski <luto@kernel.org>
> Cc: Masami Hiramatsu <mhiramat@kernel.org>
> Cc: Kees Cook <keescook@chromium.org>
> Cc: Peter Zijlstra <peterz@infradead.org>
> Signed-off-by: Nadav Amit <namit@vmware.com>
> ---
>  arch/x86/include/asm/text-patching.h | 2 +-
>  arch/x86/kernel/alternative.c        | 2 +-
>  2 files changed, 2 insertions(+), 2 deletions(-)
> 
> diff --git a/arch/x86/include/asm/text-patching.h b/arch/x86/include/asm/text-patching.h
> index ffe7902cc326..1f73f71b4de2 100644
> --- a/arch/x86/include/asm/text-patching.h
> +++ b/arch/x86/include/asm/text-patching.h
> @@ -34,7 +34,7 @@ extern void *text_poke_early(void *addr, const void *opcode, size_t len);
>   * On the local CPU you need to be protected again NMI or MCE handlers seeing an
>   * inconsistent instruction while you patch.
>   */
> -extern void *text_poke(void *addr, const void *opcode, size_t len);
> +extern void text_poke(void *addr, const void *opcode, size_t len);
>  extern int poke_int3_handler(struct pt_regs *regs);
>  extern void *text_poke_bp(void *addr, const void *opcode, size_t len, void *handler);
>  extern int after_bootmem;
> diff --git a/arch/x86/kernel/alternative.c b/arch/x86/kernel/alternative.c
> index 0feac3dfabe9..45b7fdeaed90 100644
> --- a/arch/x86/kernel/alternative.c
> +++ b/arch/x86/kernel/alternative.c
> @@ -795,7 +795,7 @@ static void text_poke_safe(void *addr, const void *opcode, size_t len,
>   *
>   * Note: Must be called under text_mutex.
>   */
> -void *text_poke(void *addr, const void *opcode, size_t len)
> +void text_poke(void *addr, const void *opcode, size_t len)
>  {
>  	bool cross_page_boundary = offset_in_page(addr) + len > PAGE_SIZE;
>  	struct page *pages[2] = {0};

Could you also remove "return addr;" in this patch ?

Thank you,

> -- 
> 2.17.1
> 


-- 
Masami Hiramatsu <mhiramat@kernel.org>

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [RFC PATCH 3/6] fork: provide a function for copying init_mm
  2018-08-29  8:11 ` [RFC PATCH 3/6] fork: provide a function for copying init_mm Nadav Amit
@ 2018-08-29  9:54   ` Masami Hiramatsu
  0 siblings, 0 replies; 34+ messages in thread
From: Masami Hiramatsu @ 2018-08-29  9:54 UTC (permalink / raw)
  To: Nadav Amit
  Cc: Thomas Gleixner, linux-kernel, Ingo Molnar, x86, Arnd Bergmann,
	linux-arch, Andy Lutomirski, Masami Hiramatsu, Kees Cook,
	Peter Zijlstra

On Wed, 29 Aug 2018 01:11:44 -0700
Nadav Amit <namit@vmware.com> wrote:

> Provide a function for copying init_mm. This function will be later used
> for setting a temporary mm.

This looks good to me :)

Reviewed-by: Masami Hiramatsu <mhiramat@kernel.org>

Thanks!

> 
> Cc: Andy Lutomirski <luto@kernel.org>
> Cc: Masami Hiramatsu <mhiramat@kernel.org>
> Cc: Kees Cook <keescook@chromium.org>
> Cc: Peter Zijlstra <peterz@infradead.org>
> Signed-off-by: Nadav Amit <namit@vmware.com>
> ---
>  include/linux/sched/task.h |  1 +
>  kernel/fork.c              | 24 ++++++++++++++++++------
>  2 files changed, 19 insertions(+), 6 deletions(-)
> 
> diff --git a/include/linux/sched/task.h b/include/linux/sched/task.h
> index 108ede99e533..ac0a675678f5 100644
> --- a/include/linux/sched/task.h
> +++ b/include/linux/sched/task.h
> @@ -74,6 +74,7 @@ extern void exit_itimers(struct signal_struct *);
>  extern long _do_fork(unsigned long, unsigned long, unsigned long, int __user *, int __user *, unsigned long);
>  extern long do_fork(unsigned long, unsigned long, unsigned long, int __user *, int __user *);
>  struct task_struct *fork_idle(int);
> +struct mm_struct *copy_init_mm(void);
>  extern pid_t kernel_thread(int (*fn)(void *), void *arg, unsigned long flags);
>  extern long kernel_wait4(pid_t, int __user *, int, struct rusage *);
>  
> diff --git a/kernel/fork.c b/kernel/fork.c
> index d896e9ca38b0..a1c637b903c1 100644
> --- a/kernel/fork.c
> +++ b/kernel/fork.c
> @@ -1254,13 +1254,20 @@ void mm_release(struct task_struct *tsk, struct mm_struct *mm)
>  		complete_vfork_done(tsk);
>  }
>  
> -/*
> - * Allocate a new mm structure and copy contents from the
> - * mm structure of the passed in task structure.
> +/**
> + * dup_mm() - duplicates an existing mm structure
> + * @tsk: the task_struct with which the new mm will be associated.
> + * @oldmm: the mm to duplicate.
> + *
> + * Allocates a new mm structure and copy contents from the provided
> + * @oldmm structure.
> + *
> + * Return: the duplicated mm or NULL on failure.
>   */
> -static struct mm_struct *dup_mm(struct task_struct *tsk)
> +static struct mm_struct *dup_mm(struct task_struct *tsk,
> +				struct mm_struct *oldmm)
>  {
> -	struct mm_struct *mm, *oldmm = current->mm;
> +	struct mm_struct *mm;
>  	int err;
>  
>  	mm = allocate_mm();
> @@ -1327,7 +1334,7 @@ static int copy_mm(unsigned long clone_flags, struct task_struct *tsk)
>  	}
>  
>  	retval = -ENOMEM;
> -	mm = dup_mm(tsk);
> +	mm = dup_mm(tsk, current->mm);
>  	if (!mm)
>  		goto fail_nomem;
>  
> @@ -2127,6 +2134,11 @@ struct task_struct *fork_idle(int cpu)
>  	return task;
>  }
>  
> +struct mm_struct *copy_init_mm(void)
> +{
> +	return dup_mm(NULL, &init_mm);
> +}
> +
>  /*
>   *  Ok, this is the main fork-routine.
>   *
> -- 
> 2.17.1
> 


-- 
Masami Hiramatsu <mhiramat@kernel.org>

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [RFC PATCH 4/6] x86/alternatives: initializing temporary mm for patching
  2018-08-29  8:11 ` [RFC PATCH 4/6] x86/alternatives: initializing temporary mm for patching Nadav Amit
@ 2018-08-29 13:21   ` Masami Hiramatsu
  2018-08-29 17:45     ` Nadav Amit
  0 siblings, 1 reply; 34+ messages in thread
From: Masami Hiramatsu @ 2018-08-29 13:21 UTC (permalink / raw)
  To: Nadav Amit
  Cc: Thomas Gleixner, linux-kernel, Ingo Molnar, x86, Arnd Bergmann,
	linux-arch, Masami Hiramatsu, Kees Cook, Peter Zijlstra

On Wed, 29 Aug 2018 01:11:45 -0700
Nadav Amit <namit@vmware.com> wrote:

> To prevent improper use of the PTEs that are used for text patching, we
> want to use a temporary mm struct. We initailize it by copying the init
> mm.
> 
> The address that will be used for patching is taken from the lower area
> that is usually used for the task memory. Doing so prevents the need to
> frequently synchronize the temporary-mm (e.g., when BPF programs are
> installed), since different PGDs are used for the task memory.
> 
> Finally, we randomize the address of the PTEs to harden against exploits
> that use these PTEs.
> 
> Cc: Masami Hiramatsu <mhiramat@kernel.org>
> Cc: Kees Cook <keescook@chromium.org>
> Cc: Peter Zijlstra <peterz@infradead.org>
> Suggested-by: Andy Lutomirski <luto@kernel.org>
> Signed-off-by: Nadav Amit <namit@vmware.com>
> ---
>  arch/x86/include/asm/pgtable.h       |  4 ++++
>  arch/x86/include/asm/text-patching.h |  2 ++
>  arch/x86/mm/init_64.c                | 35 ++++++++++++++++++++++++++++
>  include/asm-generic/pgtable.h        |  4 ++++
>  init/main.c                          |  1 +
>  5 files changed, 46 insertions(+)
> 
> diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h
> index e4ffa565a69f..c65d2b146ff6 100644
> --- a/arch/x86/include/asm/pgtable.h
> +++ b/arch/x86/include/asm/pgtable.h
> @@ -1022,6 +1022,10 @@ static inline void __meminit init_trampoline_default(void)
>  	/* Default trampoline pgd value */
>  	trampoline_pgd_entry = init_top_pgt[pgd_index(__PAGE_OFFSET)];
>  }
> +
> +void __init poking_init(void);
> +#define poking_init poking_init

Would we need this macro?

> +
>  # ifdef CONFIG_RANDOMIZE_MEMORY
>  void __meminit init_trampoline(void);
>  # else
> diff --git a/arch/x86/include/asm/text-patching.h b/arch/x86/include/asm/text-patching.h
> index e85ff65c43c3..ffe7902cc326 100644
> --- a/arch/x86/include/asm/text-patching.h
> +++ b/arch/x86/include/asm/text-patching.h
> @@ -38,5 +38,7 @@ extern void *text_poke(void *addr, const void *opcode, size_t len);
>  extern int poke_int3_handler(struct pt_regs *regs);
>  extern void *text_poke_bp(void *addr, const void *opcode, size_t len, void *handler);
>  extern int after_bootmem;
> +extern __ro_after_init struct mm_struct *poking_mm;
> +extern __ro_after_init unsigned long poking_addr;
>  
>  #endif /* _ASM_X86_TEXT_PATCHING_H */
> diff --git a/arch/x86/mm/init_64.c b/arch/x86/mm/init_64.c
> index dd519f372169..ed4a46a89946 100644
> --- a/arch/x86/mm/init_64.c
> +++ b/arch/x86/mm/init_64.c
> @@ -33,6 +33,7 @@
>  #include <linux/nmi.h>
>  #include <linux/gfp.h>
>  #include <linux/kcore.h>
> +#include <linux/sched/mm.h>
>  
>  #include <asm/processor.h>
>  #include <asm/bios_ebda.h>
> @@ -54,6 +55,7 @@
>  #include <asm/init.h>
>  #include <asm/uv/uv.h>
>  #include <asm/setup.h>
> +#include <asm/text-patching.h>
>  
>  #include "mm_internal.h"
>  
> @@ -1389,6 +1391,39 @@ unsigned long memory_block_size_bytes(void)
>  	return memory_block_size_probed;
>  }
>  
> +/*
> + * Initialize an mm_struct to be used during poking and a pointer to be used
> + * during patching. If anything fails during initialization, poking will be done
> + * using the fixmap, which is unsafe, so warn the user about it.
> + */
> +void __init poking_init(void)
> +{
> +	unsigned long poking_addr;
> +
> +	poking_mm = copy_init_mm();
> +	if (!poking_mm)
> +		goto error;
> +
> +	/*
> +	 * Randomize the poking address, but make sure that the following page
> +	 * will be mapped at the same PMD. We need 2 pages, so find space for 3,
> +	 * and adjust the address if the PMD ends after the first one.
> +	 */
> +	poking_addr = TASK_UNMAPPED_BASE +
> +		(kaslr_get_random_long("Poking") & PAGE_MASK) %
> +		(TASK_SIZE - TASK_UNMAPPED_BASE - 3 * PAGE_SIZE);
> +
> +	if (((poking_addr + PAGE_SIZE) & ~PMD_MASK) == 0)
> +		poking_addr += PAGE_SIZE;
> +
> +	return;
> +error:
> +	if (poking_mm)
> +		mmput(poking_mm);
> +	poking_mm = NULL;

At this point, only poking_mm == NULL case jumps into error. So we don't
need above 3 lines.

> +	pr_err("x86/mm: error setting a separate poking address space\n");
> +}
> +
>  #ifdef CONFIG_SPARSEMEM_VMEMMAP
>  /*
>   * Initialise the sparsemem vmemmap using huge-pages at the PMD level.
> diff --git a/include/asm-generic/pgtable.h b/include/asm-generic/pgtable.h
> index 88ebc6102c7c..c66579d0ee67 100644
> --- a/include/asm-generic/pgtable.h
> +++ b/include/asm-generic/pgtable.h
> @@ -1111,6 +1111,10 @@ static inline bool arch_has_pfn_modify_check(void)
>  
>  #ifndef PAGE_KERNEL_EXEC
>  # define PAGE_KERNEL_EXEC PAGE_KERNEL
> +
> +#ifndef poking_init
> +static inline void poking_init(void) { }
> +#endif

Hmm, this seems a bit tricky. Maybe we can make an __weak function
in init/main.c.

Thank you,

>  #endif
>  
>  #endif /* !__ASSEMBLY__ */
> diff --git a/init/main.c b/init/main.c
> index 18f8f0140fa0..6754ff2687c8 100644
> --- a/init/main.c
> +++ b/init/main.c
> @@ -725,6 +725,7 @@ asmlinkage __visible void __init start_kernel(void)
>  	taskstats_init_early();
>  	delayacct_init();
>  
> +	poking_init();
>  	check_bugs();
>  
>  	acpi_subsystem_init();
> -- 
> 2.17.1
> 


-- 
Masami Hiramatsu <mhiramat@kernel.org>

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [RFC PATCH 2/6] x86/mm: temporary mm struct
  2018-08-29  9:49   ` Masami Hiramatsu
@ 2018-08-29 15:41     ` Andy Lutomirski
  2018-08-29 16:54       ` Nadav Amit
  2018-08-30  1:38       ` Masami Hiramatsu
  0 siblings, 2 replies; 34+ messages in thread
From: Andy Lutomirski @ 2018-08-29 15:41 UTC (permalink / raw)
  To: Masami Hiramatsu
  Cc: Nadav Amit, Thomas Gleixner, LKML, Ingo Molnar, X86 ML,
	Arnd Bergmann, linux-arch, Andy Lutomirski, Kees Cook,
	Peter Zijlstra

On Wed, Aug 29, 2018 at 2:49 AM, Masami Hiramatsu <mhiramat@kernel.org> wrote:
> On Wed, 29 Aug 2018 01:11:43 -0700
> Nadav Amit <namit@vmware.com> wrote:
>
>> From: Andy Lutomirski <luto@kernel.org>
>>
>> Sometimes we want to set a temporary page-table entries (PTEs) in one of
>> the cores, without allowing other cores to use - even speculatively -
>> these mappings. There are two benefits for doing so:
>>
>> (1) Security: if sensitive PTEs are set, temporary mm prevents their use
>> in other cores. This hardens the security as it prevents exploding a
>> dangling pointer to overwrite sensitive data using the sensitive PTE.
>>
>> (2) Avoiding TLB shootdowns: the PTEs do not need to be flushed in
>> remote page-tables.
>>
>> To do so a temporary mm_struct can be used. Mappings which are private
>> for this mm can be set in the userspace part of the address-space.
>> During the whole time in which the temporary mm is loaded, interrupts
>> must be disabled.
>>
>> The first use-case for temporary PTEs, which will follow, is for poking
>> the kernel text.
>>
>> [ Commit message was written by Nadav ]
>>
>> Cc: Andy Lutomirski <luto@kernel.org>
>> Cc: Masami Hiramatsu <mhiramat@kernel.org>
>> Cc: Kees Cook <keescook@chromium.org>
>> Cc: Peter Zijlstra <peterz@infradead.org>
>> Signed-off-by: Nadav Amit <namit@vmware.com>
>> ---
>>  arch/x86/include/asm/mmu_context.h | 20 ++++++++++++++++++++
>>  1 file changed, 20 insertions(+)
>>
>> diff --git a/arch/x86/include/asm/mmu_context.h b/arch/x86/include/asm/mmu_context.h
>> index eeeb9289c764..96afc8c0cf15 100644
>> --- a/arch/x86/include/asm/mmu_context.h
>> +++ b/arch/x86/include/asm/mmu_context.h
>> @@ -338,4 +338,24 @@ static inline unsigned long __get_current_cr3_fast(void)
>>       return cr3;
>>  }
>>
>> +typedef struct {
>> +     struct mm_struct *prev;
>> +} temporary_mm_state_t;
>> +
>> +static inline temporary_mm_state_t use_temporary_mm(struct mm_struct *mm)
>> +{
>> +     temporary_mm_state_t state;
>> +
>> +     lockdep_assert_irqs_disabled();
>> +     state.prev = this_cpu_read(cpu_tlbstate.loaded_mm);
>> +     switch_mm_irqs_off(NULL, mm, current);
>> +     return state;
>> +}
>
> Hmm, why don't we return mm_struct *prev directly?

I did it this way to make it easier to add future debugging stuff
later.  Also, when I first wrote this, I stashed the old CR3 instead
of the old mm_struct, and it seemed like callers should be insulated
from details like this.

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [RFC PATCH 5/6] x86/alternatives: use temporary mm for text poking
  2018-08-29  9:28   ` Peter Zijlstra
@ 2018-08-29 15:46     ` Andy Lutomirski
  2018-08-29 16:14       ` Peter Zijlstra
  0 siblings, 1 reply; 34+ messages in thread
From: Andy Lutomirski @ 2018-08-29 15:46 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Nadav Amit, Thomas Gleixner, LKML, Ingo Molnar, X86 ML,
	Arnd Bergmann, linux-arch, Andy Lutomirski, Masami Hiramatsu,
	Kees Cook

On Wed, Aug 29, 2018 at 2:28 AM, Peter Zijlstra <peterz@infradead.org> wrote:
> On Wed, Aug 29, 2018 at 01:11:46AM -0700, Nadav Amit wrote:

>> +     pte_clear(poking_mm, poking_addr, ptep);
>> +
>> +     /*
>> +      * __flush_tlb_one_user() performs a redundant TLB flush when PTI is on,
>> +      * as it also flushes the corresponding "user" address spaces, which
>> +      * does not exist.
>> +      *
>> +      * Poking, however, is already very inefficient since it does not try to
>> +      * batch updates, so we ignore this problem for the time being.
>> +      *
>> +      * Since the PTEs do not exist in other kernel address-spaces, we do
>> +      * not use __flush_tlb_one_kernel(), which when PTI is on would cause
>> +      * more unwarranted TLB flushes.
>> +      */
>
> yuck :-), but yeah.

I'm sure we covered this ad nauseum when PTI was being developed, but
we were kind of in a rush, so:

Why do we do INVPCID at all?  The fallback path for non-INVPCID
systems uses invalidate_user_asid(), which should be faster than the
invpcid path.  And doesn't do a redundant flush in this case.

Can we just drop the INVPCID?  While we're at it, we could drop
X86_FEATURE_INVPCID_SINGLE entirely, since that's the only user.

--Andy

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [RFC PATCH 2/6] x86/mm: temporary mm struct
  2018-08-29  8:11 ` [RFC PATCH 2/6] x86/mm: temporary mm struct Nadav Amit
  2018-08-29  9:49   ` Masami Hiramatsu
@ 2018-08-29 15:46   ` Andy Lutomirski
  1 sibling, 0 replies; 34+ messages in thread
From: Andy Lutomirski @ 2018-08-29 15:46 UTC (permalink / raw)
  To: Nadav Amit, Rik van Riel
  Cc: Thomas Gleixner, LKML, Ingo Molnar, X86 ML, Arnd Bergmann,
	linux-arch, Andy Lutomirski, Masami Hiramatsu, Kees Cook,
	Peter Zijlstra

Rik, this is the patch I was referring to.

On Wed, Aug 29, 2018 at 1:11 AM, Nadav Amit <namit@vmware.com> wrote:
> From: Andy Lutomirski <luto@kernel.org>
>
> Sometimes we want to set a temporary page-table entries (PTEs) in one of
> the cores, without allowing other cores to use - even speculatively -
> these mappings. There are two benefits for doing so:
>
> (1) Security: if sensitive PTEs are set, temporary mm prevents their use
> in other cores. This hardens the security as it prevents exploding a
> dangling pointer to overwrite sensitive data using the sensitive PTE.
>
> (2) Avoiding TLB shootdowns: the PTEs do not need to be flushed in
> remote page-tables.
>
> To do so a temporary mm_struct can be used. Mappings which are private
> for this mm can be set in the userspace part of the address-space.
> During the whole time in which the temporary mm is loaded, interrupts
> must be disabled.
>
> The first use-case for temporary PTEs, which will follow, is for poking
> the kernel text.
>
> [ Commit message was written by Nadav ]
>
> Cc: Andy Lutomirski <luto@kernel.org>
> Cc: Masami Hiramatsu <mhiramat@kernel.org>
> Cc: Kees Cook <keescook@chromium.org>
> Cc: Peter Zijlstra <peterz@infradead.org>
> Signed-off-by: Nadav Amit <namit@vmware.com>
> ---
>  arch/x86/include/asm/mmu_context.h | 20 ++++++++++++++++++++
>  1 file changed, 20 insertions(+)
>
> diff --git a/arch/x86/include/asm/mmu_context.h b/arch/x86/include/asm/mmu_context.h
> index eeeb9289c764..96afc8c0cf15 100644
> --- a/arch/x86/include/asm/mmu_context.h
> +++ b/arch/x86/include/asm/mmu_context.h
> @@ -338,4 +338,24 @@ static inline unsigned long __get_current_cr3_fast(void)
>         return cr3;
>  }
>
> +typedef struct {
> +       struct mm_struct *prev;
> +} temporary_mm_state_t;
> +
> +static inline temporary_mm_state_t use_temporary_mm(struct mm_struct *mm)
> +{
> +       temporary_mm_state_t state;
> +
> +       lockdep_assert_irqs_disabled();
> +       state.prev = this_cpu_read(cpu_tlbstate.loaded_mm);
> +       switch_mm_irqs_off(NULL, mm, current);
> +       return state;
> +}
> +
> +static inline void unuse_temporary_mm(temporary_mm_state_t prev)
> +{
> +       lockdep_assert_irqs_disabled();
> +       switch_mm_irqs_off(NULL, prev.prev, current);
> +}
> +
>  #endif /* _ASM_X86_MMU_CONTEXT_H */
> --
> 2.17.1
>

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [RFC PATCH 5/6] x86/alternatives: use temporary mm for text poking
  2018-08-29 15:46     ` Andy Lutomirski
@ 2018-08-29 16:14       ` Peter Zijlstra
  2018-08-29 16:32         ` Andy Lutomirski
  0 siblings, 1 reply; 34+ messages in thread
From: Peter Zijlstra @ 2018-08-29 16:14 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Nadav Amit, Thomas Gleixner, LKML, Ingo Molnar, X86 ML,
	Arnd Bergmann, linux-arch, Masami Hiramatsu, Kees Cook,
	Dave Hansen

On Wed, Aug 29, 2018 at 08:46:04AM -0700, Andy Lutomirski wrote:
> On Wed, Aug 29, 2018 at 2:28 AM, Peter Zijlstra <peterz@infradead.org> wrote:
> > On Wed, Aug 29, 2018 at 01:11:46AM -0700, Nadav Amit wrote:
> 
> >> +     pte_clear(poking_mm, poking_addr, ptep);
> >> +
> >> +     /*
> >> +      * __flush_tlb_one_user() performs a redundant TLB flush when PTI is on,
> >> +      * as it also flushes the corresponding "user" address spaces, which
> >> +      * does not exist.
> >> +      *
> >> +      * Poking, however, is already very inefficient since it does not try to
> >> +      * batch updates, so we ignore this problem for the time being.
> >> +      *
> >> +      * Since the PTEs do not exist in other kernel address-spaces, we do
> >> +      * not use __flush_tlb_one_kernel(), which when PTI is on would cause
> >> +      * more unwarranted TLB flushes.
> >> +      */
> >
> > yuck :-), but yeah.
> 
> I'm sure we covered this ad nauseum when PTI was being developed, but
> we were kind of in a rush, so:
> 
> Why do we do INVPCID at all?  The fallback path for non-INVPCID
> systems uses invalidate_user_asid(), which should be faster than the
> invpcid path.  And doesn't do a redundant flush in this case.

I don't remember; and you forgot to (re)add dhansen.

Logically INVPCID_SINGLE should be faster since it pokes out a single
translation in another PCID instead of killing all user translations.

Is it just a matter of (current) chips implementing INVLPCID_SINGLE
inefficient, or something else?

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [RFC PATCH 5/6] x86/alternatives: use temporary mm for text poking
  2018-08-29 16:14       ` Peter Zijlstra
@ 2018-08-29 16:32         ` Andy Lutomirski
  2018-08-29 16:37           ` Dave Hansen
  0 siblings, 1 reply; 34+ messages in thread
From: Andy Lutomirski @ 2018-08-29 16:32 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Andy Lutomirski, Nadav Amit, Thomas Gleixner, LKML, Ingo Molnar,
	X86 ML, Arnd Bergmann, linux-arch, Masami Hiramatsu, Kees Cook,
	Dave Hansen

On Wed, Aug 29, 2018 at 9:14 AM, Peter Zijlstra <peterz@infradead.org> wrote:
> On Wed, Aug 29, 2018 at 08:46:04AM -0700, Andy Lutomirski wrote:
>> On Wed, Aug 29, 2018 at 2:28 AM, Peter Zijlstra <peterz@infradead.org> wrote:
>> > On Wed, Aug 29, 2018 at 01:11:46AM -0700, Nadav Amit wrote:
>>
>> >> +     pte_clear(poking_mm, poking_addr, ptep);
>> >> +
>> >> +     /*
>> >> +      * __flush_tlb_one_user() performs a redundant TLB flush when PTI is on,
>> >> +      * as it also flushes the corresponding "user" address spaces, which
>> >> +      * does not exist.
>> >> +      *
>> >> +      * Poking, however, is already very inefficient since it does not try to
>> >> +      * batch updates, so we ignore this problem for the time being.
>> >> +      *
>> >> +      * Since the PTEs do not exist in other kernel address-spaces, we do
>> >> +      * not use __flush_tlb_one_kernel(), which when PTI is on would cause
>> >> +      * more unwarranted TLB flushes.
>> >> +      */
>> >
>> > yuck :-), but yeah.
>>
>> I'm sure we covered this ad nauseum when PTI was being developed, but
>> we were kind of in a rush, so:
>>
>> Why do we do INVPCID at all?  The fallback path for non-INVPCID
>> systems uses invalidate_user_asid(), which should be faster than the
>> invpcid path.  And doesn't do a redundant flush in this case.
>
> I don't remember; and you forgot to (re)add dhansen.
>
> Logically INVPCID_SINGLE should be faster since it pokes out a single
> translation in another PCID instead of killing all user translations.
>
> Is it just a matter of (current) chips implementing INVLPCID_SINGLE
> inefficient, or something else?

It's two things.  Current chips (or at least Skylake, but I'm pretty
sure that older chips are the same) have INVPCID being slower than
writing CR3.  (Yes, that's right, it is considerably faster to flush
the a whole PCID by writing to CR3 than it is to ask INVPCID to do
anything at all.)  But INVPCID is also serializing, whereas just
marking an ASID for future flushing is essentially free.

It's plausible that there are workloads where the current code is
faster, such as where we're munmapping a single page via syscall and
we'd prefer to only flush that one TLB entry even if the flush
operation is slower as a result.

--Andy

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [RFC PATCH 5/6] x86/alternatives: use temporary mm for text poking
  2018-08-29 16:32         ` Andy Lutomirski
@ 2018-08-29 16:37           ` Dave Hansen
  0 siblings, 0 replies; 34+ messages in thread
From: Dave Hansen @ 2018-08-29 16:37 UTC (permalink / raw)
  To: Andy Lutomirski, Peter Zijlstra
  Cc: Nadav Amit, Thomas Gleixner, LKML, Ingo Molnar, X86 ML,
	Arnd Bergmann, linux-arch, Masami Hiramatsu, Kees Cook

On 08/29/2018 09:32 AM, Andy Lutomirski wrote:
> It's plausible that there are workloads where the current code is
> faster, such as where we're munmapping a single page via syscall and
> we'd prefer to only flush that one TLB entry even if the flush
> operation is slower as a result.

Yeah, I don't specifically remember testing it.  But, I know I wanted to
avoid throwing away thousands of TLB entries when we only want to rid
ourselves of one.

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [RFC PATCH 2/6] x86/mm: temporary mm struct
  2018-08-29 15:41     ` Andy Lutomirski
@ 2018-08-29 16:54       ` Nadav Amit
  2018-08-29 21:38         ` Andy Lutomirski
  2018-08-30  1:38       ` Masami Hiramatsu
  1 sibling, 1 reply; 34+ messages in thread
From: Nadav Amit @ 2018-08-29 16:54 UTC (permalink / raw)
  To: Andy Lutomirski, Masami Hiramatsu
  Cc: Thomas Gleixner, LKML, Ingo Molnar, X86 ML, Arnd Bergmann,
	linux-arch, Kees Cook, Peter Zijlstra

at 8:41 AM, Andy Lutomirski <luto@kernel.org> wrote:

> On Wed, Aug 29, 2018 at 2:49 AM, Masami Hiramatsu <mhiramat@kernel.org> wrote:
>> On Wed, 29 Aug 2018 01:11:43 -0700
>> Nadav Amit <namit@vmware.com> wrote:
>> 
>>> From: Andy Lutomirski <luto@kernel.org>
>>> 
>>> Sometimes we want to set a temporary page-table entries (PTEs) in one of
>>> the cores, without allowing other cores to use - even speculatively -
>>> these mappings. There are two benefits for doing so:
>>> 
>>> (1) Security: if sensitive PTEs are set, temporary mm prevents their use
>>> in other cores. This hardens the security as it prevents exploding a
>>> dangling pointer to overwrite sensitive data using the sensitive PTE.
>>> 
>>> (2) Avoiding TLB shootdowns: the PTEs do not need to be flushed in
>>> remote page-tables.
>>> 
>>> To do so a temporary mm_struct can be used. Mappings which are private
>>> for this mm can be set in the userspace part of the address-space.
>>> During the whole time in which the temporary mm is loaded, interrupts
>>> must be disabled.
>>> 
>>> The first use-case for temporary PTEs, which will follow, is for poking
>>> the kernel text.
>>> 
>>> [ Commit message was written by Nadav ]
>>> 
>>> Cc: Andy Lutomirski <luto@kernel.org>
>>> Cc: Masami Hiramatsu <mhiramat@kernel.org>
>>> Cc: Kees Cook <keescook@chromium.org>
>>> Cc: Peter Zijlstra <peterz@infradead.org>
>>> Signed-off-by: Nadav Amit <namit@vmware.com>
>>> ---
>>> arch/x86/include/asm/mmu_context.h | 20 ++++++++++++++++++++
>>> 1 file changed, 20 insertions(+)
>>> 
>>> diff --git a/arch/x86/include/asm/mmu_context.h b/arch/x86/include/asm/mmu_context.h
>>> index eeeb9289c764..96afc8c0cf15 100644
>>> --- a/arch/x86/include/asm/mmu_context.h
>>> +++ b/arch/x86/include/asm/mmu_context.h
>>> @@ -338,4 +338,24 @@ static inline unsigned long __get_current_cr3_fast(void)
>>>      return cr3;
>>> }
>>> 
>>> +typedef struct {
>>> +     struct mm_struct *prev;
>>> +} temporary_mm_state_t;
>>> +
>>> +static inline temporary_mm_state_t use_temporary_mm(struct mm_struct *mm)
>>> +{
>>> +     temporary_mm_state_t state;
>>> +
>>> +     lockdep_assert_irqs_disabled();
>>> +     state.prev = this_cpu_read(cpu_tlbstate.loaded_mm);
>>> +     switch_mm_irqs_off(NULL, mm, current);
>>> +     return state;
>>> +}
>> 
>> Hmm, why don't we return mm_struct *prev directly?
> 
> I did it this way to make it easier to add future debugging stuff
> later.  Also, when I first wrote this, I stashed the old CR3 instead
> of the old mm_struct, and it seemed like callers should be insulated
> from details like this.

Andy, please let me know if you want me to change it somehow, and please
provide your signed-off-by.


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [RFC PATCH 1/6] x86/alternative: assert text_mutex is taken
  2018-08-29  8:59   ` Masami Hiramatsu
@ 2018-08-29 17:11     ` Nadav Amit
  2018-08-29 19:36       ` Nadav Amit
  0 siblings, 1 reply; 34+ messages in thread
From: Nadav Amit @ 2018-08-29 17:11 UTC (permalink / raw)
  To: Masami Hiramatsu
  Cc: Thomas Gleixner, LKML, Ingo Molnar, X86 ML, Arnd Bergmann,
	linux-arch, Andy Lutomirski, Kees Cook

at 1:59 AM, Masami Hiramatsu <mhiramat@kernel.org> wrote:

> On Wed, 29 Aug 2018 01:11:42 -0700
> Nadav Amit <namit@vmware.com> wrote:
> 
>> Use lockdep to ensure that text_mutex is taken when text_poke() is
>> called.
>> 
>> Actually it is not always taken, specifically when it is called by kgdb,
>> so take the lock in these cases.
> 
> Can we really take a mutex in kgdb context?
> 
> kgdb_arch_remove_breakpoint
>  <- dbg_deactivate_sw_breakpoints
>    <- kgdb_reenter_check
>       <- kgdb_handle_exception
>          <- __kgdb_notify
>            <- kgdb_ll_trap
>              <- do_int3
>            <- kgdb_notify
>              <- die notifier
> 
> kgdb_arch_set_breakpoint
>  <- dbg_activate_sw_breakpoints
>    <- kgdb_reenter_check
>       <- kgdb_handle_exception
>           ...
> 
> Both seems called in exception context, so we can not take a mutex lock.
> I think kgdb needs a special path.

You are correct, but I don’t want a special path. Presumably text_mutex is
guaranteed not to be taken according to the code.

So I guess the only concern is lockdep. Do you see any problem if I change
mutex_lock() into mutex_trylock()? It should always succeed, and I can add a
warning and a failure path if it fails for some reason.


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [RFC PATCH 6/6] x86/alternatives: remove text_poke() return value
  2018-08-29  9:52   ` Masami Hiramatsu
@ 2018-08-29 17:15     ` Nadav Amit
  0 siblings, 0 replies; 34+ messages in thread
From: Nadav Amit @ 2018-08-29 17:15 UTC (permalink / raw)
  To: Masami Hiramatsu
  Cc: Thomas Gleixner, LKML, Ingo Molnar, X86 ML, Arnd Bergmann,
	linux-arch, Andy Lutomirski, Kees Cook, Peter Zijlstra

at 2:52 AM, Masami Hiramatsu <mhiramat@kernel.org> wrote:

>> --- a/arch/x86/kernel/alternative.c
>> +++ b/arch/x86/kernel/alternative.c
>> @@ -795,7 +795,7 @@ static void text_poke_safe(void *addr, const void *opcode, size_t len,
>>  *
>>  * Note: Must be called under text_mutex.
>>  */
>> -void *text_poke(void *addr, const void *opcode, size_t len)
>> +void text_poke(void *addr, const void *opcode, size_t len)
>> {
>> 	bool cross_page_boundary = offset_in_page(addr) + len > PAGE_SIZE;
>> 	struct page *pages[2] = {0};
> 
> Could you also remove "return addr;" in this patch ?

Oops. Thanks!


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [RFC PATCH 4/6] x86/alternatives: initializing temporary mm for patching
  2018-08-29 13:21   ` Masami Hiramatsu
@ 2018-08-29 17:45     ` Nadav Amit
  0 siblings, 0 replies; 34+ messages in thread
From: Nadav Amit @ 2018-08-29 17:45 UTC (permalink / raw)
  To: Masami Hiramatsu
  Cc: Thomas Gleixner, LKML, Ingo Molnar, X86 ML, Arnd Bergmann,
	linux-arch, Kees Cook, Peter Zijlstra

at 6:21 AM, Masami Hiramatsu <mhiramat@kernel.org> wrote:

> On Wed, 29 Aug 2018 01:11:45 -0700
> Nadav Amit <namit@vmware.com> wrote:
> 
>> To prevent improper use of the PTEs that are used for text patching, we
>> want to use a temporary mm struct. We initailize it by copying the init
>> mm.
>> 
>> The address that will be used for patching is taken from the lower area
>> that is usually used for the task memory. Doing so prevents the need to
>> frequently synchronize the temporary-mm (e.g., when BPF programs are
>> installed), since different PGDs are used for the task memory.
>> 
>> Finally, we randomize the address of the PTEs to harden against exploits
>> that use these PTEs.
>> 
>> Cc: Masami Hiramatsu <mhiramat@kernel.org>
>> Cc: Kees Cook <keescook@chromium.org>
>> Cc: Peter Zijlstra <peterz@infradead.org>
>> Suggested-by: Andy Lutomirski <luto@kernel.org>
>> Signed-off-by: Nadav Amit <namit@vmware.com>
>> ---
>> arch/x86/include/asm/pgtable.h       |  4 ++++
>> arch/x86/include/asm/text-patching.h |  2 ++
>> arch/x86/mm/init_64.c                | 35 ++++++++++++++++++++++++++++
>> include/asm-generic/pgtable.h        |  4 ++++
>> init/main.c                          |  1 +
>> 5 files changed, 46 insertions(+)
>> 
>> diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h
>> index e4ffa565a69f..c65d2b146ff6 100644
>> --- a/arch/x86/include/asm/pgtable.h
>> +++ b/arch/x86/include/asm/pgtable.h
>> @@ -1022,6 +1022,10 @@ static inline void __meminit init_trampoline_default(void)
>> 	/* Default trampoline pgd value */
>> 	trampoline_pgd_entry = init_top_pgt[pgd_index(__PAGE_OFFSET)];
>> }
>> +
>> +void __init poking_init(void);
>> +#define poking_init poking_init
> 
> Would we need this macro?
> 
>> +
>> # ifdef CONFIG_RANDOMIZE_MEMORY
>> void __meminit init_trampoline(void);
>> # else
>> diff --git a/arch/x86/include/asm/text-patching.h b/arch/x86/include/asm/text-patching.h
>> index e85ff65c43c3..ffe7902cc326 100644
>> --- a/arch/x86/include/asm/text-patching.h
>> +++ b/arch/x86/include/asm/text-patching.h
>> @@ -38,5 +38,7 @@ extern void *text_poke(void *addr, const void *opcode, size_t len);
>> extern int poke_int3_handler(struct pt_regs *regs);
>> extern void *text_poke_bp(void *addr, const void *opcode, size_t len, void *handler);
>> extern int after_bootmem;
>> +extern __ro_after_init struct mm_struct *poking_mm;
>> +extern __ro_after_init unsigned long poking_addr;
>> 
>> #endif /* _ASM_X86_TEXT_PATCHING_H */
>> diff --git a/arch/x86/mm/init_64.c b/arch/x86/mm/init_64.c
>> index dd519f372169..ed4a46a89946 100644
>> --- a/arch/x86/mm/init_64.c
>> +++ b/arch/x86/mm/init_64.c
>> @@ -33,6 +33,7 @@
>> #include <linux/nmi.h>
>> #include <linux/gfp.h>
>> #include <linux/kcore.h>
>> +#include <linux/sched/mm.h>
>> 
>> #include <asm/processor.h>
>> #include <asm/bios_ebda.h>
>> @@ -54,6 +55,7 @@
>> #include <asm/init.h>
>> #include <asm/uv/uv.h>
>> #include <asm/setup.h>
>> +#include <asm/text-patching.h>
>> 
>> #include "mm_internal.h"
>> 
>> @@ -1389,6 +1391,39 @@ unsigned long memory_block_size_bytes(void)
>> 	return memory_block_size_probed;
>> }
>> 
>> +/*
>> + * Initialize an mm_struct to be used during poking and a pointer to be used
>> + * during patching. If anything fails during initialization, poking will be done
>> + * using the fixmap, which is unsafe, so warn the user about it.
>> + */
>> +void __init poking_init(void)
>> +{
>> +	unsigned long poking_addr;
>> +
>> +	poking_mm = copy_init_mm();
>> +	if (!poking_mm)
>> +		goto error;
>> +
>> +	/*
>> +	 * Randomize the poking address, but make sure that the following page
>> +	 * will be mapped at the same PMD. We need 2 pages, so find space for 3,
>> +	 * and adjust the address if the PMD ends after the first one.
>> +	 */
>> +	poking_addr = TASK_UNMAPPED_BASE +
>> +		(kaslr_get_random_long("Poking") & PAGE_MASK) %
>> +		(TASK_SIZE - TASK_UNMAPPED_BASE - 3 * PAGE_SIZE);
>> +
>> +	if (((poking_addr + PAGE_SIZE) & ~PMD_MASK) == 0)
>> +		poking_addr += PAGE_SIZE;
>> +
>> +	return;
>> +error:
>> +	if (poking_mm)
>> +		mmput(poking_mm);
>> +	poking_mm = NULL;
> 
> At this point, only poking_mm == NULL case jumps into error. So we don't
> need above 3 lines.

Right. Will be fixed in the next version. 

> 
>> +	pr_err("x86/mm: error setting a separate poking address space\n");
>> +}
>> +
>> #ifdef CONFIG_SPARSEMEM_VMEMMAP
>> /*
>>  * Initialise the sparsemem vmemmap using huge-pages at the PMD level.
>> diff --git a/include/asm-generic/pgtable.h b/include/asm-generic/pgtable.h
>> index 88ebc6102c7c..c66579d0ee67 100644
>> --- a/include/asm-generic/pgtable.h
>> +++ b/include/asm-generic/pgtable.h
>> @@ -1111,6 +1111,10 @@ static inline bool arch_has_pfn_modify_check(void)
>> 
>> #ifndef PAGE_KERNEL_EXEC
>> # define PAGE_KERNEL_EXEC PAGE_KERNEL
>> +
>> +#ifndef poking_init
>> +static inline void poking_init(void) { }
>> +#endif
> 
> Hmm, this seems a bit tricky. Maybe we can make an __weak function
> in init/main.c.

Of course - __weak is much better. Thanks!





^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [RFC PATCH 1/6] x86/alternative: assert text_mutex is taken
  2018-08-29 17:11     ` Nadav Amit
@ 2018-08-29 19:36       ` Nadav Amit
  2018-08-29 20:13         ` Sean Christopherson
  0 siblings, 1 reply; 34+ messages in thread
From: Nadav Amit @ 2018-08-29 19:36 UTC (permalink / raw)
  To: Masami Hiramatsu
  Cc: Thomas Gleixner, LKML, Ingo Molnar, X86 ML, Arnd Bergmann,
	linux-arch, Andy Lutomirski, Kees Cook

at 10:11 AM, Nadav Amit <namit@vmware.com> wrote:

> at 1:59 AM, Masami Hiramatsu <mhiramat@kernel.org> wrote:
> 
>> On Wed, 29 Aug 2018 01:11:42 -0700
>> Nadav Amit <namit@vmware.com> wrote:
>> 
>>> Use lockdep to ensure that text_mutex is taken when text_poke() is
>>> called.
>>> 
>>> Actually it is not always taken, specifically when it is called by kgdb,
>>> so take the lock in these cases.
>> 
>> Can we really take a mutex in kgdb context?
>> 
>> kgdb_arch_remove_breakpoint
>> <- dbg_deactivate_sw_breakpoints
>>   <- kgdb_reenter_check
>>      <- kgdb_handle_exception
>>         <- __kgdb_notify
>>           <- kgdb_ll_trap
>>             <- do_int3
>>           <- kgdb_notify
>>             <- die notifier
>> 
>> kgdb_arch_set_breakpoint
>> <- dbg_activate_sw_breakpoints
>>   <- kgdb_reenter_check
>>      <- kgdb_handle_exception
>>          ...
>> 
>> Both seems called in exception context, so we can not take a mutex lock.
>> I think kgdb needs a special path.
> 
> You are correct, but I don’t want a special path. Presumably text_mutex is
> guaranteed not to be taken according to the code.
> 
> So I guess the only concern is lockdep. Do you see any problem if I change
> mutex_lock() into mutex_trylock()? It should always succeed, and I can add a
> warning and a failure path if it fails for some reason.

Err.. This will not work. I think I will drop this patch, since I cannot
find a proper yet simple assertion. Creating special path just for the
assertion seems wrong.

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [RFC PATCH 1/6] x86/alternative: assert text_mutex is taken
  2018-08-29 19:36       ` Nadav Amit
@ 2018-08-29 20:13         ` Sean Christopherson
  2018-08-29 20:44           ` Nadav Amit
  0 siblings, 1 reply; 34+ messages in thread
From: Sean Christopherson @ 2018-08-29 20:13 UTC (permalink / raw)
  To: Nadav Amit
  Cc: Masami Hiramatsu, Thomas Gleixner, LKML, Ingo Molnar, X86 ML,
	Arnd Bergmann, linux-arch, Andy Lutomirski, Kees Cook

On Wed, Aug 29, 2018 at 07:36:22PM +0000, Nadav Amit wrote:
> at 10:11 AM, Nadav Amit <namit@vmware.com> wrote:
> 
> > at 1:59 AM, Masami Hiramatsu <mhiramat@kernel.org> wrote:
> > 
> >> On Wed, 29 Aug 2018 01:11:42 -0700
> >> Nadav Amit <namit@vmware.com> wrote:
> >> 
> >>> Use lockdep to ensure that text_mutex is taken when text_poke() is
> >>> called.
> >>> 
> >>> Actually it is not always taken, specifically when it is called by kgdb,
> >>> so take the lock in these cases.
> >> 
> >> Can we really take a mutex in kgdb context?
> >> 
> >> kgdb_arch_remove_breakpoint
> >> <- dbg_deactivate_sw_breakpoints
> >>   <- kgdb_reenter_check
> >>      <- kgdb_handle_exception
> >>         <- __kgdb_notify
> >>           <- kgdb_ll_trap
> >>             <- do_int3
> >>           <- kgdb_notify
> >>             <- die notifier
> >> 
> >> kgdb_arch_set_breakpoint
> >> <- dbg_activate_sw_breakpoints
> >>   <- kgdb_reenter_check
> >>      <- kgdb_handle_exception
> >>          ...
> >> 
> >> Both seems called in exception context, so we can not take a mutex lock.
> >> I think kgdb needs a special path.
> > 
> > You are correct, but I don’t want a special path. Presumably text_mutex is
> > guaranteed not to be taken according to the code.
> > 
> > So I guess the only concern is lockdep. Do you see any problem if I change
> > mutex_lock() into mutex_trylock()? It should always succeed, and I can add a
> > warning and a failure path if it fails for some reason.
> 
> Err.. This will not work. I think I will drop this patch, since I cannot
> find a proper yet simple assertion. Creating special path just for the
> assertion seems wrong.

It's probably worth expanding the comment for text_poke() to call out
the kgdb case and reference kgdb_arch_{set,remove}_breakpoint(), whose
code and comments make it explicitly clear why its safe for them to
call text_poke() without acquiring the lock.  Might prevent someone
from going down this path again in the future.

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [RFC PATCH 1/6] x86/alternative: assert text_mutex is taken
  2018-08-29 20:13         ` Sean Christopherson
@ 2018-08-29 20:44           ` Nadav Amit
  2018-08-29 21:00             ` Sean Christopherson
  0 siblings, 1 reply; 34+ messages in thread
From: Nadav Amit @ 2018-08-29 20:44 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Masami Hiramatsu, Thomas Gleixner, LKML, Ingo Molnar, X86 ML,
	Arnd Bergmann, linux-arch, Andy Lutomirski, Kees Cook

at 1:13 PM, Sean Christopherson <sean.j.christopherson@intel.com> wrote:

> On Wed, Aug 29, 2018 at 07:36:22PM +0000, Nadav Amit wrote:
>> at 10:11 AM, Nadav Amit <namit@vmware.com> wrote:
>> 
>>> at 1:59 AM, Masami Hiramatsu <mhiramat@kernel.org> wrote:
>>> 
>>>> On Wed, 29 Aug 2018 01:11:42 -0700
>>>> Nadav Amit <namit@vmware.com> wrote:
>>>> 
>>>>> Use lockdep to ensure that text_mutex is taken when text_poke() is
>>>>> called.
>>>>> 
>>>>> Actually it is not always taken, specifically when it is called by kgdb,
>>>>> so take the lock in these cases.
>>>> 
>>>> Can we really take a mutex in kgdb context?
>>>> 
>>>> kgdb_arch_remove_breakpoint
>>>> <- dbg_deactivate_sw_breakpoints
>>>>  <- kgdb_reenter_check
>>>>     <- kgdb_handle_exception
>>>>        <- __kgdb_notify
>>>>          <- kgdb_ll_trap
>>>>            <- do_int3
>>>>          <- kgdb_notify
>>>>            <- die notifier
>>>> 
>>>> kgdb_arch_set_breakpoint
>>>> <- dbg_activate_sw_breakpoints
>>>>  <- kgdb_reenter_check
>>>>     <- kgdb_handle_exception
>>>>         ...
>>>> 
>>>> Both seems called in exception context, so we can not take a mutex lock.
>>>> I think kgdb needs a special path.
>>> 
>>> You are correct, but I don’t want a special path. Presumably text_mutex is
>>> guaranteed not to be taken according to the code.
>>> 
>>> So I guess the only concern is lockdep. Do you see any problem if I change
>>> mutex_lock() into mutex_trylock()? It should always succeed, and I can add a
>>> warning and a failure path if it fails for some reason.
>> 
>> Err.. This will not work. I think I will drop this patch, since I cannot
>> find a proper yet simple assertion. Creating special path just for the
>> assertion seems wrong.
> 
> It's probably worth expanding the comment for text_poke() to call out
> the kgdb case and reference kgdb_arch_{set,remove}_breakpoint(), whose
> code and comments make it explicitly clear why its safe for them to
> call text_poke() without acquiring the lock.  Might prevent someone
> from going down this path again in the future.

I thought that the whole point of the patch was to avoid comments, and
instead enforce the right behavior. I don’t understand well enough kgdb
code, so I cannot attest it does the right thing. What happens if
kgdb_do_roundup==0?



^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [RFC PATCH 1/6] x86/alternative: assert text_mutex is taken
  2018-08-29 20:44           ` Nadav Amit
@ 2018-08-29 21:00             ` Sean Christopherson
  2018-08-29 22:56               ` Nadav Amit
  2018-08-30  2:26               ` Masami Hiramatsu
  0 siblings, 2 replies; 34+ messages in thread
From: Sean Christopherson @ 2018-08-29 21:00 UTC (permalink / raw)
  To: Nadav Amit
  Cc: Masami Hiramatsu, Thomas Gleixner, LKML, Ingo Molnar, X86 ML,
	Arnd Bergmann, linux-arch, Andy Lutomirski, Kees Cook

On Wed, Aug 29, 2018 at 08:44:47PM +0000, Nadav Amit wrote:
> at 1:13 PM, Sean Christopherson <sean.j.christopherson@intel.com> wrote:
> 
> > On Wed, Aug 29, 2018 at 07:36:22PM +0000, Nadav Amit wrote:
> >> at 10:11 AM, Nadav Amit <namit@vmware.com> wrote:
> >> 
> >>> at 1:59 AM, Masami Hiramatsu <mhiramat@kernel.org> wrote:
> >>> 
> >>>> On Wed, 29 Aug 2018 01:11:42 -0700
> >>>> Nadav Amit <namit@vmware.com> wrote:
> >>>> 
> >>>>> Use lockdep to ensure that text_mutex is taken when text_poke() is
> >>>>> called.
> >>>>> 
> >>>>> Actually it is not always taken, specifically when it is called by kgdb,
> >>>>> so take the lock in these cases.
> >>>> 
> >>>> Can we really take a mutex in kgdb context?
> >>>> 
> >>>> kgdb_arch_remove_breakpoint
> >>>> <- dbg_deactivate_sw_breakpoints
> >>>>  <- kgdb_reenter_check
> >>>>     <- kgdb_handle_exception
> >>>>        <- __kgdb_notify
> >>>>          <- kgdb_ll_trap
> >>>>            <- do_int3
> >>>>          <- kgdb_notify
> >>>>            <- die notifier
> >>>> 
> >>>> kgdb_arch_set_breakpoint
> >>>> <- dbg_activate_sw_breakpoints
> >>>>  <- kgdb_reenter_check
> >>>>     <- kgdb_handle_exception
> >>>>         ...
> >>>> 
> >>>> Both seems called in exception context, so we can not take a mutex lock.
> >>>> I think kgdb needs a special path.
> >>> 
> >>> You are correct, but I don’t want a special path. Presumably text_mutex is
> >>> guaranteed not to be taken according to the code.
> >>> 
> >>> So I guess the only concern is lockdep. Do you see any problem if I change
> >>> mutex_lock() into mutex_trylock()? It should always succeed, and I can add a
> >>> warning and a failure path if it fails for some reason.
> >> 
> >> Err.. This will not work. I think I will drop this patch, since I cannot
> >> find a proper yet simple assertion. Creating special path just for the
> >> assertion seems wrong.
> > 
> > It's probably worth expanding the comment for text_poke() to call out
> > the kgdb case and reference kgdb_arch_{set,remove}_breakpoint(), whose
> > code and comments make it explicitly clear why its safe for them to
> > call text_poke() without acquiring the lock.  Might prevent someone
> > from going down this path again in the future.
> 
> I thought that the whole point of the patch was to avoid comments, and
> instead enforce the right behavior. I don’t understand well enough kgdb
> code, so I cannot attest it does the right thing. What happens if
> kgdb_do_roundup==0?

As is, the comment is wrong because there are obviously cases where
text_poke() is called without text_mutex being held.  I can't attest
to the kgdb code either.  My thought was to document the exception so
that if someone does want to try and enforce the right behavior they
can dive right into the problem instead of having to learn of the kgdb
gotcha the hard way.  Maybe a FIXME is the right approach?

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [RFC PATCH 2/6] x86/mm: temporary mm struct
  2018-08-29 16:54       ` Nadav Amit
@ 2018-08-29 21:38         ` Andy Lutomirski
  0 siblings, 0 replies; 34+ messages in thread
From: Andy Lutomirski @ 2018-08-29 21:38 UTC (permalink / raw)
  To: Nadav Amit
  Cc: Andy Lutomirski, Masami Hiramatsu, Thomas Gleixner, LKML,
	Ingo Molnar, X86 ML, Arnd Bergmann, linux-arch, Kees Cook,
	Peter Zijlstra

On Wed, Aug 29, 2018 at 9:54 AM, Nadav Amit <namit@vmware.com> wrote:
> at 8:41 AM, Andy Lutomirski <luto@kernel.org> wrote:
>
>> On Wed, Aug 29, 2018 at 2:49 AM, Masami Hiramatsu <mhiramat@kernel.org> wrote:
>>> On Wed, 29 Aug 2018 01:11:43 -0700
>>> Nadav Amit <namit@vmware.com> wrote:
>>>
>>>> From: Andy Lutomirski <luto@kernel.org>
>>>>
>>>> Sometimes we want to set a temporary page-table entries (PTEs) in one of
>>>> the cores, without allowing other cores to use - even speculatively -
>>>> these mappings. There are two benefits for doing so:
>>>>
>>>> (1) Security: if sensitive PTEs are set, temporary mm prevents their use
>>>> in other cores. This hardens the security as it prevents exploding a
>>>> dangling pointer to overwrite sensitive data using the sensitive PTE.
>>>>
>>>> (2) Avoiding TLB shootdowns: the PTEs do not need to be flushed in
>>>> remote page-tables.
>>>>
>>>> To do so a temporary mm_struct can be used. Mappings which are private
>>>> for this mm can be set in the userspace part of the address-space.
>>>> During the whole time in which the temporary mm is loaded, interrupts
>>>> must be disabled.
>>>>
>>>> The first use-case for temporary PTEs, which will follow, is for poking
>>>> the kernel text.
>>>>
>>>> [ Commit message was written by Nadav ]
>>>>
>>>> Cc: Andy Lutomirski <luto@kernel.org>
>>>> Cc: Masami Hiramatsu <mhiramat@kernel.org>
>>>> Cc: Kees Cook <keescook@chromium.org>
>>>> Cc: Peter Zijlstra <peterz@infradead.org>
>>>> Signed-off-by: Nadav Amit <namit@vmware.com>
>>>> ---
>>>> arch/x86/include/asm/mmu_context.h | 20 ++++++++++++++++++++
>>>> 1 file changed, 20 insertions(+)
>>>>
>>>> diff --git a/arch/x86/include/asm/mmu_context.h b/arch/x86/include/asm/mmu_context.h
>>>> index eeeb9289c764..96afc8c0cf15 100644
>>>> --- a/arch/x86/include/asm/mmu_context.h
>>>> +++ b/arch/x86/include/asm/mmu_context.h
>>>> @@ -338,4 +338,24 @@ static inline unsigned long __get_current_cr3_fast(void)
>>>>      return cr3;
>>>> }
>>>>
>>>> +typedef struct {
>>>> +     struct mm_struct *prev;
>>>> +} temporary_mm_state_t;
>>>> +
>>>> +static inline temporary_mm_state_t use_temporary_mm(struct mm_struct *mm)
>>>> +{
>>>> +     temporary_mm_state_t state;
>>>> +
>>>> +     lockdep_assert_irqs_disabled();
>>>> +     state.prev = this_cpu_read(cpu_tlbstate.loaded_mm);
>>>> +     switch_mm_irqs_off(NULL, mm, current);
>>>> +     return state;
>>>> +}
>>>
>>> Hmm, why don't we return mm_struct *prev directly?
>>
>> I did it this way to make it easier to add future debugging stuff
>> later.  Also, when I first wrote this, I stashed the old CR3 instead
>> of the old mm_struct, and it seemed like callers should be insulated
>> from details like this.
>
> Andy, please let me know if you want me to change it somehow, and please
> provide your signed-off-by.
>

I'm happy with it.

Signed-off-by: Andy Lutomirski <luto@kernel.org>

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [RFC PATCH 1/6] x86/alternative: assert text_mutex is taken
  2018-08-29 21:00             ` Sean Christopherson
@ 2018-08-29 22:56               ` Nadav Amit
  2018-08-30  2:26               ` Masami Hiramatsu
  1 sibling, 0 replies; 34+ messages in thread
From: Nadav Amit @ 2018-08-29 22:56 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Masami Hiramatsu, Thomas Gleixner, LKML, Ingo Molnar, X86 ML,
	Arnd Bergmann, linux-arch, Andy Lutomirski, Kees Cook

at 2:00 PM, Sean Christopherson <sean.j.christopherson@intel.com> wrote:

> On Wed, Aug 29, 2018 at 08:44:47PM +0000, Nadav Amit wrote:
>> at 1:13 PM, Sean Christopherson <sean.j.christopherson@intel.com> wrote:
>> 
>>> On Wed, Aug 29, 2018 at 07:36:22PM +0000, Nadav Amit wrote:
>>>> at 10:11 AM, Nadav Amit <namit@vmware.com> wrote:
>>>> 
>>>>> at 1:59 AM, Masami Hiramatsu <mhiramat@kernel.org> wrote:
>>>>> 
>>>>>> On Wed, 29 Aug 2018 01:11:42 -0700
>>>>>> Nadav Amit <namit@vmware.com> wrote:
>>>>>> 
>>>>>>> Use lockdep to ensure that text_mutex is taken when text_poke() is
>>>>>>> called.
>>>>>>> 
>>>>>>> Actually it is not always taken, specifically when it is called by kgdb,
>>>>>>> so take the lock in these cases.
>>>>>> 
>>>>>> Can we really take a mutex in kgdb context?
>>>>>> 
>>>>>> kgdb_arch_remove_breakpoint
>>>>>> <- dbg_deactivate_sw_breakpoints
>>>>>> <- kgdb_reenter_check
>>>>>>    <- kgdb_handle_exception
>>>>>>       <- __kgdb_notify
>>>>>>         <- kgdb_ll_trap
>>>>>>           <- do_int3
>>>>>>         <- kgdb_notify
>>>>>>           <- die notifier
>>>>>> 
>>>>>> kgdb_arch_set_breakpoint
>>>>>> <- dbg_activate_sw_breakpoints
>>>>>> <- kgdb_reenter_check
>>>>>>    <- kgdb_handle_exception
>>>>>>        ...
>>>>>> 
>>>>>> Both seems called in exception context, so we can not take a mutex lock.
>>>>>> I think kgdb needs a special path.
>>>>> 
>>>>> You are correct, but I don’t want a special path. Presumably text_mutex is
>>>>> guaranteed not to be taken according to the code.
>>>>> 
>>>>> So I guess the only concern is lockdep. Do you see any problem if I change
>>>>> mutex_lock() into mutex_trylock()? It should always succeed, and I can add a
>>>>> warning and a failure path if it fails for some reason.
>>>> 
>>>> Err.. This will not work. I think I will drop this patch, since I cannot
>>>> find a proper yet simple assertion. Creating special path just for the
>>>> assertion seems wrong.
>>> 
>>> It's probably worth expanding the comment for text_poke() to call out
>>> the kgdb case and reference kgdb_arch_{set,remove}_breakpoint(), whose
>>> code and comments make it explicitly clear why its safe for them to
>>> call text_poke() without acquiring the lock.  Might prevent someone
>>> from going down this path again in the future.
>> 
>> I thought that the whole point of the patch was to avoid comments, and
>> instead enforce the right behavior. I don’t understand well enough kgdb
>> code, so I cannot attest it does the right thing. What happens if
>> kgdb_do_roundup==0?
> 
> As is, the comment is wrong because there are obviously cases where
> text_poke() is called without text_mutex being held.  I can't attest
> to the kgdb code either.  My thought was to document the exception so
> that if someone does want to try and enforce the right behavior they
> can dive right into the problem instead of having to learn of the kgdb
> gotcha the hard way.  Maybe a FIXME is the right approach?

Ok. I’ll add a FIXME comment as you propose, but this does not deserve a
separate patch. I’ll squash it into patch 5.


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [RFC PATCH 2/6] x86/mm: temporary mm struct
  2018-08-29 15:41     ` Andy Lutomirski
  2018-08-29 16:54       ` Nadav Amit
@ 2018-08-30  1:38       ` Masami Hiramatsu
  2018-08-30  1:59         ` Andy Lutomirski
  1 sibling, 1 reply; 34+ messages in thread
From: Masami Hiramatsu @ 2018-08-30  1:38 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Nadav Amit, Thomas Gleixner, LKML, Ingo Molnar, X86 ML,
	Arnd Bergmann, linux-arch, Kees Cook, Peter Zijlstra

On Wed, 29 Aug 2018 08:41:00 -0700
Andy Lutomirski <luto@kernel.org> wrote:

> On Wed, Aug 29, 2018 at 2:49 AM, Masami Hiramatsu <mhiramat@kernel.org> wrote:
> > On Wed, 29 Aug 2018 01:11:43 -0700
> > Nadav Amit <namit@vmware.com> wrote:
> >
> >> From: Andy Lutomirski <luto@kernel.org>
> >>
> >> Sometimes we want to set a temporary page-table entries (PTEs) in one of
> >> the cores, without allowing other cores to use - even speculatively -
> >> these mappings. There are two benefits for doing so:
> >>
> >> (1) Security: if sensitive PTEs are set, temporary mm prevents their use
> >> in other cores. This hardens the security as it prevents exploding a
> >> dangling pointer to overwrite sensitive data using the sensitive PTE.
> >>
> >> (2) Avoiding TLB shootdowns: the PTEs do not need to be flushed in
> >> remote page-tables.
> >>
> >> To do so a temporary mm_struct can be used. Mappings which are private
> >> for this mm can be set in the userspace part of the address-space.
> >> During the whole time in which the temporary mm is loaded, interrupts
> >> must be disabled.
> >>
> >> The first use-case for temporary PTEs, which will follow, is for poking
> >> the kernel text.
> >>
> >> [ Commit message was written by Nadav ]
> >>
> >> Cc: Andy Lutomirski <luto@kernel.org>
> >> Cc: Masami Hiramatsu <mhiramat@kernel.org>
> >> Cc: Kees Cook <keescook@chromium.org>
> >> Cc: Peter Zijlstra <peterz@infradead.org>
> >> Signed-off-by: Nadav Amit <namit@vmware.com>
> >> ---
> >>  arch/x86/include/asm/mmu_context.h | 20 ++++++++++++++++++++
> >>  1 file changed, 20 insertions(+)
> >>
> >> diff --git a/arch/x86/include/asm/mmu_context.h b/arch/x86/include/asm/mmu_context.h
> >> index eeeb9289c764..96afc8c0cf15 100644
> >> --- a/arch/x86/include/asm/mmu_context.h
> >> +++ b/arch/x86/include/asm/mmu_context.h
> >> @@ -338,4 +338,24 @@ static inline unsigned long __get_current_cr3_fast(void)
> >>       return cr3;
> >>  }
> >>
> >> +typedef struct {
> >> +     struct mm_struct *prev;
> >> +} temporary_mm_state_t;
> >> +
> >> +static inline temporary_mm_state_t use_temporary_mm(struct mm_struct *mm)
> >> +{
> >> +     temporary_mm_state_t state;
> >> +
> >> +     lockdep_assert_irqs_disabled();
> >> +     state.prev = this_cpu_read(cpu_tlbstate.loaded_mm);
> >> +     switch_mm_irqs_off(NULL, mm, current);
> >> +     return state;
> >> +}
> >
> > Hmm, why don't we return mm_struct *prev directly?
> 
> I did it this way to make it easier to add future debugging stuff
> later. Also, when I first wrote this, I stashed the old CR3 instead
> of the old mm_struct, and it seemed like callers should be insulated
> from details like this.

Hmm, I see. But in that case, we should call it "struct temporary_mm"
and explicitly allocate (and pass) it, since we can not return the
data structure from stack. If we can combine it with new mm, it will
be more encapsulated e.g.

struct temporary_mm {
	struct mm_struct *mm;
	struct mm_struct *prev;
};

static struct temporary_mm poking_tmp_mm;

poking_init()
{
	if (init_temporary_mm(&tmp_mm, &init_mm))
		goto error;
	...
}

text_poke_safe()
{
	...
	use_temporary_mm(&tmp_mm);
	...
	unuse_temporary_mm(&tmp_mm);
}

Any thought?

Thanks,

-- 
Masami Hiramatsu <mhiramat@kernel.org>

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [RFC PATCH 2/6] x86/mm: temporary mm struct
  2018-08-30  1:38       ` Masami Hiramatsu
@ 2018-08-30  1:59         ` Andy Lutomirski
  2018-08-31  4:42           ` Masami Hiramatsu
  0 siblings, 1 reply; 34+ messages in thread
From: Andy Lutomirski @ 2018-08-30  1:59 UTC (permalink / raw)
  To: Masami Hiramatsu
  Cc: Andy Lutomirski, Nadav Amit, Thomas Gleixner, LKML, Ingo Molnar,
	X86 ML, Arnd Bergmann, linux-arch, Kees Cook, Peter Zijlstra



> On Aug 29, 2018, at 6:38 PM, Masami Hiramatsu <mhiramat@kernel.org> wrote:
> 
> On Wed, 29 Aug 2018 08:41:00 -0700
> Andy Lutomirski <luto@kernel.org> wrote:
> 
>>> On Wed, Aug 29, 2018 at 2:49 AM, Masami Hiramatsu <mhiramat@kernel.org> wrote:
>>> On Wed, 29 Aug 2018 01:11:43 -0700
>>> Nadav Amit <namit@vmware.com> wrote:
>>> 
>>>> From: Andy Lutomirski <luto@kernel.org>
>>>> 
>>>> Sometimes we want to set a temporary page-table entries (PTEs) in one of
>>>> the cores, without allowing other cores to use - even speculatively -
>>>> these mappings. There are two benefits for doing so:
>>>> 
>>>> (1) Security: if sensitive PTEs are set, temporary mm prevents their use
>>>> in other cores. This hardens the security as it prevents exploding a
>>>> dangling pointer to overwrite sensitive data using the sensitive PTE.
>>>> 
>>>> (2) Avoiding TLB shootdowns: the PTEs do not need to be flushed in
>>>> remote page-tables.
>>>> 
>>>> To do so a temporary mm_struct can be used. Mappings which are private
>>>> for this mm can be set in the userspace part of the address-space.
>>>> During the whole time in which the temporary mm is loaded, interrupts
>>>> must be disabled.
>>>> 
>>>> The first use-case for temporary PTEs, which will follow, is for poking
>>>> the kernel text.
>>>> 
>>>> [ Commit message was written by Nadav ]
>>>> 
>>>> Cc: Andy Lutomirski <luto@kernel.org>
>>>> Cc: Masami Hiramatsu <mhiramat@kernel.org>
>>>> Cc: Kees Cook <keescook@chromium.org>
>>>> Cc: Peter Zijlstra <peterz@infradead.org>
>>>> Signed-off-by: Nadav Amit <namit@vmware.com>
>>>> ---
>>>> arch/x86/include/asm/mmu_context.h | 20 ++++++++++++++++++++
>>>> 1 file changed, 20 insertions(+)
>>>> 
>>>> diff --git a/arch/x86/include/asm/mmu_context.h b/arch/x86/include/asm/mmu_context.h
>>>> index eeeb9289c764..96afc8c0cf15 100644
>>>> --- a/arch/x86/include/asm/mmu_context.h
>>>> +++ b/arch/x86/include/asm/mmu_context.h
>>>> @@ -338,4 +338,24 @@ static inline unsigned long __get_current_cr3_fast(void)
>>>>      return cr3;
>>>> }
>>>> 
>>>> +typedef struct {
>>>> +     struct mm_struct *prev;
>>>> +} temporary_mm_state_t;
>>>> +
>>>> +static inline temporary_mm_state_t use_temporary_mm(struct mm_struct *mm)
>>>> +{
>>>> +     temporary_mm_state_t state;
>>>> +
>>>> +     lockdep_assert_irqs_disabled();
>>>> +     state.prev = this_cpu_read(cpu_tlbstate.loaded_mm);
>>>> +     switch_mm_irqs_off(NULL, mm, current);
>>>> +     return state;
>>>> +}
>>> 
>>> Hmm, why don't we return mm_struct *prev directly?
>> 
>> I did it this way to make it easier to add future debugging stuff
>> later. Also, when I first wrote this, I stashed the old CR3 instead
>> of the old mm_struct, and it seemed like callers should be insulated
>> from details like this.
> 
> Hmm, I see. But in that case, we should call it "struct temporary_mm"
> and explicitly allocate (and pass) it, since we can not return the
> data structure from stack.

Why not?

> If we can combine it with new mm, it will
> be more encapsulated e.g.
> 
> struct temporary_mm {
>    struct mm_struct *mm;
>    struct mm_struct *prev;
> };
> 
> static struct temporary_mm poking_tmp_mm;
> 
> poking_init()
> {
>    if (init_temporary_mm(&tmp_mm, &init_mm))
>        goto error;
>    ...
> }
> 
> text_poke_safe()
> {
>    ...
>    use_temporary_mm(&tmp_mm);
>    ...
>    unuse_temporary_mm(&tmp_mm);
> }
> 
> Any thought?

That seems more complicated for not very much gain.


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [RFC PATCH 1/6] x86/alternative: assert text_mutex is taken
  2018-08-29 21:00             ` Sean Christopherson
  2018-08-29 22:56               ` Nadav Amit
@ 2018-08-30  2:26               ` Masami Hiramatsu
  2018-08-30  5:23                 ` Nadav Amit
  1 sibling, 1 reply; 34+ messages in thread
From: Masami Hiramatsu @ 2018-08-30  2:26 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Nadav Amit, Masami Hiramatsu, Thomas Gleixner, LKML, Ingo Molnar,
	X86 ML, Arnd Bergmann, linux-arch, Andy Lutomirski, Kees Cook,
	Jason Wessel

On Wed, 29 Aug 2018 14:00:06 -0700
Sean Christopherson <sean.j.christopherson@intel.com> wrote:

> On Wed, Aug 29, 2018 at 08:44:47PM +0000, Nadav Amit wrote:
> > at 1:13 PM, Sean Christopherson <sean.j.christopherson@intel.com> wrote:
> > 
> > > On Wed, Aug 29, 2018 at 07:36:22PM +0000, Nadav Amit wrote:
> > >> at 10:11 AM, Nadav Amit <namit@vmware.com> wrote:
> > >> 
> > >>> at 1:59 AM, Masami Hiramatsu <mhiramat@kernel.org> wrote:
> > >>> 
> > >>>> On Wed, 29 Aug 2018 01:11:42 -0700
> > >>>> Nadav Amit <namit@vmware.com> wrote:
> > >>>> 
> > >>>>> Use lockdep to ensure that text_mutex is taken when text_poke() is
> > >>>>> called.
> > >>>>> 
> > >>>>> Actually it is not always taken, specifically when it is called by kgdb,
> > >>>>> so take the lock in these cases.
> > >>>> 
> > >>>> Can we really take a mutex in kgdb context?
> > >>>> 
> > >>>> kgdb_arch_remove_breakpoint
> > >>>> <- dbg_deactivate_sw_breakpoints
> > >>>>  <- kgdb_reenter_check
> > >>>>     <- kgdb_handle_exception
> > >>>>        <- __kgdb_notify
> > >>>>          <- kgdb_ll_trap
> > >>>>            <- do_int3
> > >>>>          <- kgdb_notify
> > >>>>            <- die notifier
> > >>>> 
> > >>>> kgdb_arch_set_breakpoint
> > >>>> <- dbg_activate_sw_breakpoints
> > >>>>  <- kgdb_reenter_check
> > >>>>     <- kgdb_handle_exception
> > >>>>         ...
> > >>>> 
> > >>>> Both seems called in exception context, so we can not take a mutex lock.
> > >>>> I think kgdb needs a special path.
> > >>> 
> > >>> You are correct, but I don’t want a special path. Presumably text_mutex is
> > >>> guaranteed not to be taken according to the code.
> > >>> 
> > >>> So I guess the only concern is lockdep. Do you see any problem if I change
> > >>> mutex_lock() into mutex_trylock()? It should always succeed, and I can add a
> > >>> warning and a failure path if it fails for some reason.
> > >> 
> > >> Err.. This will not work. I think I will drop this patch, since I cannot
> > >> find a proper yet simple assertion. Creating special path just for the
> > >> assertion seems wrong.
> > > 
> > > It's probably worth expanding the comment for text_poke() to call out
> > > the kgdb case and reference kgdb_arch_{set,remove}_breakpoint(), whose
> > > code and comments make it explicitly clear why its safe for them to
> > > call text_poke() without acquiring the lock.  Might prevent someone
> > > from going down this path again in the future.
> > 
> > I thought that the whole point of the patch was to avoid comments, and
> > instead enforce the right behavior. I don’t understand well enough kgdb
> > code, so I cannot attest it does the right thing. What happens if
> > kgdb_do_roundup==0?
> 
> As is, the comment is wrong because there are obviously cases where
> text_poke() is called without text_mutex being held.  I can't attest
> to the kgdb code either.  My thought was to document the exception so
> that if someone does want to try and enforce the right behavior they
> can dive right into the problem instead of having to learn of the kgdb
> gotcha the hard way.  Maybe a FIXME is the right approach?

No, kgdb ensures that the text_mutex has not been held right before
calling text_poke. So they also take care the text_mutex. I guess
kgdb_arch_{set,remove}_breakpoint() is supposed to be run under
a special circumstance, like stopping all other threads/cores.
In that case, we can just check the text_mutex is not locked.
 
Anyway, kgdb is a very rare courner case. I think if CONFIG_KGDB is
enabled, lockdep and any assertion should be disabled, since kgdb
can tweak anything in the kernel with unexpected ways...

Thank you,

-- 
Masami Hiramatsu <mhiramat@kernel.org>

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [RFC PATCH 1/6] x86/alternative: assert text_mutex is taken
  2018-08-30  2:26               ` Masami Hiramatsu
@ 2018-08-30  5:23                 ` Nadav Amit
  0 siblings, 0 replies; 34+ messages in thread
From: Nadav Amit @ 2018-08-30  5:23 UTC (permalink / raw)
  To: Masami Hiramatsu, Sean Christopherson
  Cc: Thomas Gleixner, LKML, Ingo Molnar, X86 ML, Arnd Bergmann,
	linux-arch, Andy Lutomirski, Kees Cook, Jason Wessel

at 7:26 PM, Masami Hiramatsu <mhiramat@kernel.org> wrote:

> On Wed, 29 Aug 2018 14:00:06 -0700
> Sean Christopherson <sean.j.christopherson@intel.com> wrote:
> 
>> On Wed, Aug 29, 2018 at 08:44:47PM +0000, Nadav Amit wrote:
>>> at 1:13 PM, Sean Christopherson <sean.j.christopherson@intel.com> wrote:
>>> 
>>>> On Wed, Aug 29, 2018 at 07:36:22PM +0000, Nadav Amit wrote:
>>>>> at 10:11 AM, Nadav Amit <namit@vmware.com> wrote:
>>>>> 
>>>>>> at 1:59 AM, Masami Hiramatsu <mhiramat@kernel.org> wrote:
>>>>>> 
>>>>>>> On Wed, 29 Aug 2018 01:11:42 -0700
>>>>>>> Nadav Amit <namit@vmware.com> wrote:
>>>>>>> 
>>>>>>>> Use lockdep to ensure that text_mutex is taken when text_poke() is
>>>>>>>> called.
>>>>>>>> 
>>>>>>>> Actually it is not always taken, specifically when it is called by kgdb,
>>>>>>>> so take the lock in these cases.
>>>>>>> 
>>>>>>> Can we really take a mutex in kgdb context?
>>>>>>> 
>>>>>>> kgdb_arch_remove_breakpoint
>>>>>>> <- dbg_deactivate_sw_breakpoints
>>>>>>> <- kgdb_reenter_check
>>>>>>>    <- kgdb_handle_exception
>>>>>>>       <- __kgdb_notify
>>>>>>>         <- kgdb_ll_trap
>>>>>>>           <- do_int3
>>>>>>>         <- kgdb_notify
>>>>>>>           <- die notifier
>>>>>>> 
>>>>>>> kgdb_arch_set_breakpoint
>>>>>>> <- dbg_activate_sw_breakpoints
>>>>>>> <- kgdb_reenter_check
>>>>>>>    <- kgdb_handle_exception
>>>>>>>        ...
>>>>>>> 
>>>>>>> Both seems called in exception context, so we can not take a mutex lock.
>>>>>>> I think kgdb needs a special path.
>>>>>> 
>>>>>> You are correct, but I don’t want a special path. Presumably text_mutex is
>>>>>> guaranteed not to be taken according to the code.
>>>>>> 
>>>>>> So I guess the only concern is lockdep. Do you see any problem if I change
>>>>>> mutex_lock() into mutex_trylock()? It should always succeed, and I can add a
>>>>>> warning and a failure path if it fails for some reason.
>>>>> 
>>>>> Err.. This will not work. I think I will drop this patch, since I cannot
>>>>> find a proper yet simple assertion. Creating special path just for the
>>>>> assertion seems wrong.
>>>> 
>>>> It's probably worth expanding the comment for text_poke() to call out
>>>> the kgdb case and reference kgdb_arch_{set,remove}_breakpoint(), whose
>>>> code and comments make it explicitly clear why its safe for them to
>>>> call text_poke() without acquiring the lock.  Might prevent someone
>>>> from going down this path again in the future.
>>> 
>>> I thought that the whole point of the patch was to avoid comments, and
>>> instead enforce the right behavior. I don’t understand well enough kgdb
>>> code, so I cannot attest it does the right thing. What happens if
>>> kgdb_do_roundup==0?
>> 
>> As is, the comment is wrong because there are obviously cases where
>> text_poke() is called without text_mutex being held.  I can't attest
>> to the kgdb code either.  My thought was to document the exception so
>> that if someone does want to try and enforce the right behavior they
>> can dive right into the problem instead of having to learn of the kgdb
>> gotcha the hard way.  Maybe a FIXME is the right approach?
> 
> No, kgdb ensures that the text_mutex has not been held right before
> calling text_poke. So they also take care the text_mutex. I guess
> kgdb_arch_{set,remove}_breakpoint() is supposed to be run under
> a special circumstance, like stopping all other threads/cores.
> In that case, we can just check the text_mutex is not locked.

I assumed so too, but after looking at the code, I am not sure that this is
the case when gdb_do_roundup==0.

> Anyway, kgdb is a very rare courner case. I think if CONFIG_KGDB is
> enabled, lockdep and any assertion should be disabled, since kgdb
> can tweak anything in the kernel with unexpected ways...

Call me lazy, but I really do not want to debug syzkaller failures due to
this issue (now or in the future). If the assertion is known to be
incorrect, even in a corner case, I see no reason to have it and I certainly
do not want to be the one that added it…


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [RFC PATCH 2/6] x86/mm: temporary mm struct
  2018-08-30  1:59         ` Andy Lutomirski
@ 2018-08-31  4:42           ` Masami Hiramatsu
  0 siblings, 0 replies; 34+ messages in thread
From: Masami Hiramatsu @ 2018-08-31  4:42 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Andy Lutomirski, Nadav Amit, Thomas Gleixner, LKML, Ingo Molnar,
	X86 ML, Arnd Bergmann, linux-arch, Kees Cook, Peter Zijlstra

On Wed, 29 Aug 2018 18:59:52 -0700
Andy Lutomirski <luto@amacapital.net> wrote:

> 
> 
> > On Aug 29, 2018, at 6:38 PM, Masami Hiramatsu <mhiramat@kernel.org> wrote:
> > 
> > On Wed, 29 Aug 2018 08:41:00 -0700
> > Andy Lutomirski <luto@kernel.org> wrote:
> > 
> >>> On Wed, Aug 29, 2018 at 2:49 AM, Masami Hiramatsu <mhiramat@kernel.org> wrote:
> >>> On Wed, 29 Aug 2018 01:11:43 -0700
> >>> Nadav Amit <namit@vmware.com> wrote:
> >>> 
> >>>> From: Andy Lutomirski <luto@kernel.org>
> >>>> 
> >>>> Sometimes we want to set a temporary page-table entries (PTEs) in one of
> >>>> the cores, without allowing other cores to use - even speculatively -
> >>>> these mappings. There are two benefits for doing so:
> >>>> 
> >>>> (1) Security: if sensitive PTEs are set, temporary mm prevents their use
> >>>> in other cores. This hardens the security as it prevents exploding a
> >>>> dangling pointer to overwrite sensitive data using the sensitive PTE.
> >>>> 
> >>>> (2) Avoiding TLB shootdowns: the PTEs do not need to be flushed in
> >>>> remote page-tables.
> >>>> 
> >>>> To do so a temporary mm_struct can be used. Mappings which are private
> >>>> for this mm can be set in the userspace part of the address-space.
> >>>> During the whole time in which the temporary mm is loaded, interrupts
> >>>> must be disabled.
> >>>> 
> >>>> The first use-case for temporary PTEs, which will follow, is for poking
> >>>> the kernel text.
> >>>> 
> >>>> [ Commit message was written by Nadav ]
> >>>> 
> >>>> Cc: Andy Lutomirski <luto@kernel.org>
> >>>> Cc: Masami Hiramatsu <mhiramat@kernel.org>
> >>>> Cc: Kees Cook <keescook@chromium.org>
> >>>> Cc: Peter Zijlstra <peterz@infradead.org>
> >>>> Signed-off-by: Nadav Amit <namit@vmware.com>
> >>>> ---
> >>>> arch/x86/include/asm/mmu_context.h | 20 ++++++++++++++++++++
> >>>> 1 file changed, 20 insertions(+)
> >>>> 
> >>>> diff --git a/arch/x86/include/asm/mmu_context.h b/arch/x86/include/asm/mmu_context.h
> >>>> index eeeb9289c764..96afc8c0cf15 100644
> >>>> --- a/arch/x86/include/asm/mmu_context.h
> >>>> +++ b/arch/x86/include/asm/mmu_context.h
> >>>> @@ -338,4 +338,24 @@ static inline unsigned long __get_current_cr3_fast(void)
> >>>>      return cr3;
> >>>> }
> >>>> 
> >>>> +typedef struct {
> >>>> +     struct mm_struct *prev;
> >>>> +} temporary_mm_state_t;
> >>>> +
> >>>> +static inline temporary_mm_state_t use_temporary_mm(struct mm_struct *mm)
> >>>> +{
> >>>> +     temporary_mm_state_t state;
> >>>> +
> >>>> +     lockdep_assert_irqs_disabled();
> >>>> +     state.prev = this_cpu_read(cpu_tlbstate.loaded_mm);
> >>>> +     switch_mm_irqs_off(NULL, mm, current);
> >>>> +     return state;
> >>>> +}
> >>> 
> >>> Hmm, why don't we return mm_struct *prev directly?
> >> 
> >> I did it this way to make it easier to add future debugging stuff
> >> later. Also, when I first wrote this, I stashed the old CR3 instead
> >> of the old mm_struct, and it seemed like callers should be insulated
> >> from details like this.
> > 
> > Hmm, I see. But in that case, we should call it "struct temporary_mm"
> > and explicitly allocate (and pass) it, since we can not return the
> > data structure from stack.
> 
> Why not?

Ah, ok as far as it returns a data structure as immediate value.
(I don't recommend it because it hides a copy..)

> 
> > If we can combine it with new mm, it will
> > be more encapsulated e.g.
> > 
> > struct temporary_mm {
> >    struct mm_struct *mm;
> >    struct mm_struct *prev;
> > };
> > 
> > static struct temporary_mm poking_tmp_mm;
> > 
> > poking_init()
> > {
> >    if (init_temporary_mm(&tmp_mm, &init_mm))
> >        goto error;
> >    ...
> > }
> > 
> > text_poke_safe()
> > {
> >    ...
> >    use_temporary_mm(&tmp_mm);
> >    ...
> >    unuse_temporary_mm(&tmp_mm);
> > }
> > 
> > Any thought?
> 
> That seems more complicated for not very much gain.

Hmm, OK. anyway that is just a style note. The code itself looks good for me.

Thank you,

> 


-- 
Masami Hiramatsu <mhiramat@kernel.org>

^ permalink raw reply	[flat|nested] 34+ messages in thread

end of thread, other threads:[~2018-08-31  4:42 UTC | newest]

Thread overview: 34+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-08-29  8:11 [RFC PATCH 0/6] x86: text_poke() fixes Nadav Amit
2018-08-29  8:11 ` [RFC PATCH 1/6] x86/alternative: assert text_mutex is taken Nadav Amit
2018-08-29  8:59   ` Masami Hiramatsu
2018-08-29 17:11     ` Nadav Amit
2018-08-29 19:36       ` Nadav Amit
2018-08-29 20:13         ` Sean Christopherson
2018-08-29 20:44           ` Nadav Amit
2018-08-29 21:00             ` Sean Christopherson
2018-08-29 22:56               ` Nadav Amit
2018-08-30  2:26               ` Masami Hiramatsu
2018-08-30  5:23                 ` Nadav Amit
2018-08-29  8:11 ` [RFC PATCH 2/6] x86/mm: temporary mm struct Nadav Amit
2018-08-29  9:49   ` Masami Hiramatsu
2018-08-29 15:41     ` Andy Lutomirski
2018-08-29 16:54       ` Nadav Amit
2018-08-29 21:38         ` Andy Lutomirski
2018-08-30  1:38       ` Masami Hiramatsu
2018-08-30  1:59         ` Andy Lutomirski
2018-08-31  4:42           ` Masami Hiramatsu
2018-08-29 15:46   ` Andy Lutomirski
2018-08-29  8:11 ` [RFC PATCH 3/6] fork: provide a function for copying init_mm Nadav Amit
2018-08-29  9:54   ` Masami Hiramatsu
2018-08-29  8:11 ` [RFC PATCH 4/6] x86/alternatives: initializing temporary mm for patching Nadav Amit
2018-08-29 13:21   ` Masami Hiramatsu
2018-08-29 17:45     ` Nadav Amit
2018-08-29  8:11 ` [RFC PATCH 5/6] x86/alternatives: use temporary mm for text poking Nadav Amit
2018-08-29  9:28   ` Peter Zijlstra
2018-08-29 15:46     ` Andy Lutomirski
2018-08-29 16:14       ` Peter Zijlstra
2018-08-29 16:32         ` Andy Lutomirski
2018-08-29 16:37           ` Dave Hansen
2018-08-29  8:11 ` [RFC PATCH 6/6] x86/alternatives: remove text_poke() return value Nadav Amit
2018-08-29  9:52   ` Masami Hiramatsu
2018-08-29 17:15     ` Nadav Amit

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).