linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v2 0/6] x86/alternatives: text_poke() fixes
@ 2018-09-02 17:32 Nadav Amit
  2018-09-02 17:32 ` [PATCH v2 1/6] Fix "x86/alternatives: Lockdep-enforce text_mutex in text_poke*()" Nadav Amit
                   ` (6 more replies)
  0 siblings, 7 replies; 31+ messages in thread
From: Nadav Amit @ 2018-09-02 17:32 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: linux-kernel, Ingo Molnar, x86, Arnd Bergmann, linux-arch,
	Dave Hansen, Nadav Amit, Nadav Amit, Jiri Kosina,
	Andy Lutomirski, Masami Hiramatsu, Kees Cook, Peter Zijlstra

This patch-set addresses some issues that were raised in a recent
correspondence and might affect the security and the correctness of code
patching. (Note that patching performance is not addressed by this
patch-set).

The main issue that the patches deal with is the fact that the fixmap
PTEs that are used for patching are available for access from other
cores and might be exploited. They are not even flushed from the TLB in
remote cores, so the risk is even higher. Address this issue by
introducing a temporary mm that is only used during patching.
Unfortunately, due to init ordering, fixmap is still used during
boot-time patching. Future patches can eliminate the need for it.

The second issue is the missing lockdep assertion to ensure text_mutex
is taken. It is actually not always taken, so fix the instances that
were found not to take the lock (although they should be safe even
without taking the lock).

Finally, try to be more conservative and to map a single page, instead
of two, when possible. This helps both security and performance.

In addition, there is some cleanup of the patching code to make it more
readable.

v1->v2:
- Partial revert of 9222f606506c added to 1/6 [masami]
- Added Masami's reviewed-by tag

RFC->v1:
- Added handling of error in get_locked_pte()
- Remove lockdep assertion, clarify text_mutex use instead [masami]
- Comment fix [peterz]
- Removed remainders of text_poke return value [masami]
- Use __weak for poking_init instead of macros [masami]
- Simplify error handling in poking_init [masami]  

Cc: Jiri Kosina <jkosina@suse.cz>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Kees Cook <keescook@chromium.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Link: https://lkml.org/lkml/2018/8/24/586

Andy Lutomirski (1):
  x86/mm: temporary mm struct

Nadav Amit (5):
  Fix "x86/alternatives: Lockdep-enforce text_mutex in text_poke*()"
  fork: provide a function for copying init_mm
  x86/alternatives: initializing temporary mm for patching
  x86/alternatives: use temporary mm for text poking
  x86/alternatives: remove text_poke() return value

 arch/x86/include/asm/mmu_context.h   |  20 +++
 arch/x86/include/asm/pgtable.h       |   3 +
 arch/x86/include/asm/text-patching.h |   4 +-
 arch/x86/kernel/alternative.c        | 175 +++++++++++++++++++++++----
 arch/x86/mm/init_64.c                |  29 +++++
 include/linux/sched/task.h           |   1 +
 init/main.c                          |   3 +
 kernel/fork.c                        |  24 +++-
 8 files changed, 227 insertions(+), 32 deletions(-)

-- 
2.17.1


^ permalink raw reply	[flat|nested] 31+ messages in thread

* [PATCH v2 1/6] Fix "x86/alternatives: Lockdep-enforce text_mutex in text_poke*()"
  2018-09-02 17:32 [PATCH v2 0/6] x86/alternatives: text_poke() fixes Nadav Amit
@ 2018-09-02 17:32 ` Nadav Amit
  2018-09-06 19:40   ` Peter Zijlstra
  2018-09-02 17:32 ` [PATCH v2 2/6] x86/mm: temporary mm struct Nadav Amit
                   ` (5 subsequent siblings)
  6 siblings, 1 reply; 31+ messages in thread
From: Nadav Amit @ 2018-09-02 17:32 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: linux-kernel, Ingo Molnar, x86, Arnd Bergmann, linux-arch,
	Dave Hansen, Nadav Amit, Nadav Amit, Jiri Kosina,
	Andy Lutomirski, Kees Cook, Dave Hansen

text_mutex is expected to be held before text_poke() is called, but we
cannot add a lockdep assertion since kgdb does not take it, and instead
*supposedly* ensures the lock is not taken and will not be acquired by
any other core while text_poke() is running.

The reason for the "supposedly" comment is that it is not entirely clear
that this would be the case if gdb_do_roundup is zero.

Add a comment to clarify this behavior, and restore the assertions as
they were before the recent commit.

This partially reverts commit 9222f606506c ("x86/alternatives:
Lockdep-enforce text_mutex in text_poke*()")

Cc: Jiri Kosina <jkosina@suse.cz>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Kees Cook <keescook@chromium.org>
Cc: Dave Hansen <dave.hansen@intel.com>
Fixes: 9222f606506c ("x86/alternatives: Lockdep-enforce text_mutex in text_poke*()")
Reviewed-by: Masami Hiramatsu <mhiramat@kernel.org>
Tested-by: Masami Hiramatsu <mhiramat@kernel.org>
Suggested-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Nadav Amit <namit@vmware.com>
---
 arch/x86/kernel/alternative.c | 7 +++++--
 1 file changed, 5 insertions(+), 2 deletions(-)

diff --git a/arch/x86/kernel/alternative.c b/arch/x86/kernel/alternative.c
index b9d5e7c9ef43..e9be18245698 100644
--- a/arch/x86/kernel/alternative.c
+++ b/arch/x86/kernel/alternative.c
@@ -684,6 +684,11 @@ void *__init_or_module text_poke_early(void *addr, const void *opcode,
  * It means the size must be writable atomically and the address must be aligned
  * in a way that permits an atomic write. It also makes sure we fit on a single
  * page.
+ *
+ * Context: Must be called under text_mutex. kgdb is an exception: it does not
+ *	    hold the mutex, as it *supposedly* ensures that no other core is
+ *	    holding the mutex and ensures that none of them will acquire the
+ *	    mutex while the code runs.
  */
 void *text_poke(void *addr, const void *opcode, size_t len)
 {
@@ -698,8 +703,6 @@ void *text_poke(void *addr, const void *opcode, size_t len)
 	 */
 	BUG_ON(!after_bootmem);
 
-	lockdep_assert_held(&text_mutex);
-
 	if (!core_kernel_text((unsigned long)addr)) {
 		pages[0] = vmalloc_to_page(addr);
 		pages[1] = vmalloc_to_page(addr + PAGE_SIZE);
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [PATCH v2 2/6] x86/mm: temporary mm struct
  2018-09-02 17:32 [PATCH v2 0/6] x86/alternatives: text_poke() fixes Nadav Amit
  2018-09-02 17:32 ` [PATCH v2 1/6] Fix "x86/alternatives: Lockdep-enforce text_mutex in text_poke*()" Nadav Amit
@ 2018-09-02 17:32 ` Nadav Amit
  2018-09-02 17:32 ` [PATCH v2 3/6] fork: provide a function for copying init_mm Nadav Amit
                   ` (4 subsequent siblings)
  6 siblings, 0 replies; 31+ messages in thread
From: Nadav Amit @ 2018-09-02 17:32 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: linux-kernel, Ingo Molnar, x86, Arnd Bergmann, linux-arch,
	Dave Hansen, Nadav Amit, Andy Lutomirski, Kees Cook,
	Peter Zijlstra, Dave Hansen, Nadav Amit

From: Andy Lutomirski <luto@kernel.org>

Sometimes we want to set a temporary page-table entries (PTEs) in one of
the cores, without allowing other cores to use - even speculatively -
these mappings. There are two benefits for doing so:

(1) Security: if sensitive PTEs are set, temporary mm prevents their use
in other cores. This hardens the security as it prevents exploding a
dangling pointer to overwrite sensitive data using the sensitive PTE.

(2) Avoiding TLB shootdowns: the PTEs do not need to be flushed in
remote page-tables.

To do so a temporary mm_struct can be used. Mappings which are private
for this mm can be set in the userspace part of the address-space.
During the whole time in which the temporary mm is loaded, interrupts
must be disabled.

The first use-case for temporary PTEs, which will follow, is for poking
the kernel text.

[ Commit message was written by Nadav ]

Cc: Kees Cook <keescook@chromium.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Dave Hansen <dave.hansen@intel.com>
Reviewed-by: Masami Hiramatsu <mhiramat@kernel.org>
Tested-by: Masami Hiramatsu <mhiramat@kernel.org>
Signed-off-by: Andy Lutomirski <luto@kernel.org>
Signed-off-by: Nadav Amit <namit@vmware.com>
---
 arch/x86/include/asm/mmu_context.h | 20 ++++++++++++++++++++
 1 file changed, 20 insertions(+)

diff --git a/arch/x86/include/asm/mmu_context.h b/arch/x86/include/asm/mmu_context.h
index eeeb9289c764..96afc8c0cf15 100644
--- a/arch/x86/include/asm/mmu_context.h
+++ b/arch/x86/include/asm/mmu_context.h
@@ -338,4 +338,24 @@ static inline unsigned long __get_current_cr3_fast(void)
 	return cr3;
 }
 
+typedef struct {
+	struct mm_struct *prev;
+} temporary_mm_state_t;
+
+static inline temporary_mm_state_t use_temporary_mm(struct mm_struct *mm)
+{
+	temporary_mm_state_t state;
+
+	lockdep_assert_irqs_disabled();
+	state.prev = this_cpu_read(cpu_tlbstate.loaded_mm);
+	switch_mm_irqs_off(NULL, mm, current);
+	return state;
+}
+
+static inline void unuse_temporary_mm(temporary_mm_state_t prev)
+{
+	lockdep_assert_irqs_disabled();
+	switch_mm_irqs_off(NULL, prev.prev, current);
+}
+
 #endif /* _ASM_X86_MMU_CONTEXT_H */
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [PATCH v2 3/6] fork: provide a function for copying init_mm
  2018-09-02 17:32 [PATCH v2 0/6] x86/alternatives: text_poke() fixes Nadav Amit
  2018-09-02 17:32 ` [PATCH v2 1/6] Fix "x86/alternatives: Lockdep-enforce text_mutex in text_poke*()" Nadav Amit
  2018-09-02 17:32 ` [PATCH v2 2/6] x86/mm: temporary mm struct Nadav Amit
@ 2018-09-02 17:32 ` Nadav Amit
  2018-09-02 17:32 ` [PATCH v2 4/6] x86/alternatives: initializing temporary mm for patching Nadav Amit
                   ` (3 subsequent siblings)
  6 siblings, 0 replies; 31+ messages in thread
From: Nadav Amit @ 2018-09-02 17:32 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: linux-kernel, Ingo Molnar, x86, Arnd Bergmann, linux-arch,
	Dave Hansen, Nadav Amit, Nadav Amit, Andy Lutomirski, Kees Cook,
	Peter Zijlstra, Dave Hansen

Provide a function for copying init_mm. This function will be later used
for setting a temporary mm.

Cc: Andy Lutomirski <luto@kernel.org>
Cc: Kees Cook <keescook@chromium.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Dave Hansen <dave.hansen@intel.com>
Reviewed-by: Masami Hiramatsu <mhiramat@kernel.org>
Tested-by: Masami Hiramatsu <mhiramat@kernel.org>
Signed-off-by: Nadav Amit <namit@vmware.com>
---
 include/linux/sched/task.h |  1 +
 kernel/fork.c              | 24 ++++++++++++++++++------
 2 files changed, 19 insertions(+), 6 deletions(-)

diff --git a/include/linux/sched/task.h b/include/linux/sched/task.h
index 108ede99e533..ac0a675678f5 100644
--- a/include/linux/sched/task.h
+++ b/include/linux/sched/task.h
@@ -74,6 +74,7 @@ extern void exit_itimers(struct signal_struct *);
 extern long _do_fork(unsigned long, unsigned long, unsigned long, int __user *, int __user *, unsigned long);
 extern long do_fork(unsigned long, unsigned long, unsigned long, int __user *, int __user *);
 struct task_struct *fork_idle(int);
+struct mm_struct *copy_init_mm(void);
 extern pid_t kernel_thread(int (*fn)(void *), void *arg, unsigned long flags);
 extern long kernel_wait4(pid_t, int __user *, int, struct rusage *);
 
diff --git a/kernel/fork.c b/kernel/fork.c
index d896e9ca38b0..a1c637b903c1 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -1254,13 +1254,20 @@ void mm_release(struct task_struct *tsk, struct mm_struct *mm)
 		complete_vfork_done(tsk);
 }
 
-/*
- * Allocate a new mm structure and copy contents from the
- * mm structure of the passed in task structure.
+/**
+ * dup_mm() - duplicates an existing mm structure
+ * @tsk: the task_struct with which the new mm will be associated.
+ * @oldmm: the mm to duplicate.
+ *
+ * Allocates a new mm structure and copy contents from the provided
+ * @oldmm structure.
+ *
+ * Return: the duplicated mm or NULL on failure.
  */
-static struct mm_struct *dup_mm(struct task_struct *tsk)
+static struct mm_struct *dup_mm(struct task_struct *tsk,
+				struct mm_struct *oldmm)
 {
-	struct mm_struct *mm, *oldmm = current->mm;
+	struct mm_struct *mm;
 	int err;
 
 	mm = allocate_mm();
@@ -1327,7 +1334,7 @@ static int copy_mm(unsigned long clone_flags, struct task_struct *tsk)
 	}
 
 	retval = -ENOMEM;
-	mm = dup_mm(tsk);
+	mm = dup_mm(tsk, current->mm);
 	if (!mm)
 		goto fail_nomem;
 
@@ -2127,6 +2134,11 @@ struct task_struct *fork_idle(int cpu)
 	return task;
 }
 
+struct mm_struct *copy_init_mm(void)
+{
+	return dup_mm(NULL, &init_mm);
+}
+
 /*
  *  Ok, this is the main fork-routine.
  *
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [PATCH v2 4/6] x86/alternatives: initializing temporary mm for patching
  2018-09-02 17:32 [PATCH v2 0/6] x86/alternatives: text_poke() fixes Nadav Amit
                   ` (2 preceding siblings ...)
  2018-09-02 17:32 ` [PATCH v2 3/6] fork: provide a function for copying init_mm Nadav Amit
@ 2018-09-02 17:32 ` Nadav Amit
  2018-09-06  9:01   ` Peter Zijlstra
  2018-09-02 17:32 ` [PATCH v2 5/6] x86/alternatives: use temporary mm for text poking Nadav Amit
                   ` (2 subsequent siblings)
  6 siblings, 1 reply; 31+ messages in thread
From: Nadav Amit @ 2018-09-02 17:32 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: linux-kernel, Ingo Molnar, x86, Arnd Bergmann, linux-arch,
	Dave Hansen, Nadav Amit, Nadav Amit, Kees Cook, Peter Zijlstra,
	Dave Hansen

To prevent improper use of the PTEs that are used for text patching, we
want to use a temporary mm struct. We initailize it by copying the init
mm.

The address that will be used for patching is taken from the lower area
that is usually used for the task memory. Doing so prevents the need to
frequently synchronize the temporary-mm (e.g., when BPF programs are
installed), since different PGDs are used for the task memory.

Finally, we randomize the address of the PTEs to harden against exploits
that use these PTEs.

Cc: Kees Cook <keescook@chromium.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Dave Hansen <dave.hansen@intel.com>
Reviewed-by: Masami Hiramatsu <mhiramat@kernel.org>
Tested-by: Masami Hiramatsu <mhiramat@kernel.org>
Suggested-by: Andy Lutomirski <luto@kernel.org>
Signed-off-by: Nadav Amit <namit@vmware.com>
---
 arch/x86/include/asm/pgtable.h       |  3 +++
 arch/x86/include/asm/text-patching.h |  2 ++
 arch/x86/mm/init_64.c                | 29 ++++++++++++++++++++++++++++
 init/main.c                          |  3 +++
 4 files changed, 37 insertions(+)

diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h
index e4ffa565a69f..3de9a1fb7a9a 100644
--- a/arch/x86/include/asm/pgtable.h
+++ b/arch/x86/include/asm/pgtable.h
@@ -1022,6 +1022,9 @@ static inline void __meminit init_trampoline_default(void)
 	/* Default trampoline pgd value */
 	trampoline_pgd_entry = init_top_pgt[pgd_index(__PAGE_OFFSET)];
 }
+
+void __init poking_init(void);
+
 # ifdef CONFIG_RANDOMIZE_MEMORY
 void __meminit init_trampoline(void);
 # else
diff --git a/arch/x86/include/asm/text-patching.h b/arch/x86/include/asm/text-patching.h
index e85ff65c43c3..ffe7902cc326 100644
--- a/arch/x86/include/asm/text-patching.h
+++ b/arch/x86/include/asm/text-patching.h
@@ -38,5 +38,7 @@ extern void *text_poke(void *addr, const void *opcode, size_t len);
 extern int poke_int3_handler(struct pt_regs *regs);
 extern void *text_poke_bp(void *addr, const void *opcode, size_t len, void *handler);
 extern int after_bootmem;
+extern __ro_after_init struct mm_struct *poking_mm;
+extern __ro_after_init unsigned long poking_addr;
 
 #endif /* _ASM_X86_TEXT_PATCHING_H */
diff --git a/arch/x86/mm/init_64.c b/arch/x86/mm/init_64.c
index dd519f372169..db33a724a054 100644
--- a/arch/x86/mm/init_64.c
+++ b/arch/x86/mm/init_64.c
@@ -54,6 +54,7 @@
 #include <asm/init.h>
 #include <asm/uv/uv.h>
 #include <asm/setup.h>
+#include <asm/text-patching.h>
 
 #include "mm_internal.h"
 
@@ -1389,6 +1390,34 @@ unsigned long memory_block_size_bytes(void)
 	return memory_block_size_probed;
 }
 
+/*
+ * Initialize an mm_struct to be used during poking and a pointer to be used
+ * during patching. If anything fails during initialization, poking will be done
+ * using the fixmap, which is unsafe, so warn the user about it.
+ */
+void __init poking_init(void)
+{
+	unsigned long poking_addr;
+
+	poking_mm = copy_init_mm();
+	if (!poking_mm) {
+		pr_err("x86/mm: error setting a separate poking address space");
+		return;
+	}
+
+	/*
+	 * Randomize the poking address, but make sure that the following page
+	 * will be mapped at the same PMD. We need 2 pages, so find space for 3,
+	 * and adjust the address if the PMD ends after the first one.
+	 */
+	poking_addr = TASK_UNMAPPED_BASE +
+		(kaslr_get_random_long("Poking") & PAGE_MASK) %
+		(TASK_SIZE - TASK_UNMAPPED_BASE - 3 * PAGE_SIZE);
+
+	if (((poking_addr + PAGE_SIZE) & ~PMD_MASK) == 0)
+		poking_addr += PAGE_SIZE;
+}
+
 #ifdef CONFIG_SPARSEMEM_VMEMMAP
 /*
  * Initialise the sparsemem vmemmap using huge-pages at the PMD level.
diff --git a/init/main.c b/init/main.c
index 18f8f0140fa0..8c6dd8d88fca 100644
--- a/init/main.c
+++ b/init/main.c
@@ -496,6 +496,8 @@ void __init __weak thread_stack_cache_init(void)
 
 void __init __weak mem_encrypt_init(void) { }
 
+void __init __weak poking_init(void) { }
+
 bool initcall_debug;
 core_param(initcall_debug, initcall_debug, bool, 0644);
 
@@ -725,6 +727,7 @@ asmlinkage __visible void __init start_kernel(void)
 	taskstats_init_early();
 	delayacct_init();
 
+	poking_init();
 	check_bugs();
 
 	acpi_subsystem_init();
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [PATCH v2 5/6] x86/alternatives: use temporary mm for text poking
  2018-09-02 17:32 [PATCH v2 0/6] x86/alternatives: text_poke() fixes Nadav Amit
                   ` (3 preceding siblings ...)
  2018-09-02 17:32 ` [PATCH v2 4/6] x86/alternatives: initializing temporary mm for patching Nadav Amit
@ 2018-09-02 17:32 ` Nadav Amit
  2018-09-02 17:32 ` [PATCH v2 6/6] x86/alternatives: remove text_poke() return value Nadav Amit
  2018-09-05 18:56 ` [PATCH v2 0/6] x86/alternatives: text_poke() fixes Peter Zijlstra
  6 siblings, 0 replies; 31+ messages in thread
From: Nadav Amit @ 2018-09-02 17:32 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: linux-kernel, Ingo Molnar, x86, Arnd Bergmann, linux-arch,
	Dave Hansen, Nadav Amit, Nadav Amit, Andy Lutomirski, Kees Cook,
	Peter Zijlstra, Dave Hansen

text_poke() can potentially compromise the security as it sets temporary
PTEs in the fixmap. These PTEs might be used to rewrite the kernel code
from other cores accidentally or maliciously, if an attacker gains the
ability to write onto kernel memory.

Moreover, since remote TLBs are not flushed after the temporary PTEs are
removed, the time-window in which the code is writable is not limited if
the fixmap PTEs - maliciously or accidentally - are cached in the TLB.

To address these potential security hazards, we use a temporary mm for
patching the code. Unfortunately, the temporary-mm cannot be initialized
early enough during the init, and as a result x86_late_time_init() needs
to use text_poke() before it can be initialized. text_poke() therefore
keeps the two poking versions - using fixmap and using temporary mm -
and uses them accordingly.

More adventurous developers can try to reorder the init sequence or use
text_poke_early() instead of text_poke() to remove the use of fixmap for
patching completely.

Finally, text_poke() is also not conservative enough when mapping pages,
as it always tries to map 2 pages, even when a single one is sufficient.
So try to be more conservative, and do not map more than needed.

Cc: Andy Lutomirski <luto@kernel.org>
Cc: Kees Cook <keescook@chromium.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Dave Hansen <dave.hansen@intel.com>
Reviewed-by: Masami Hiramatsu <mhiramat@kernel.org>
Tested-by: Masami Hiramatsu <mhiramat@kernel.org>
Signed-off-by: Nadav Amit <namit@vmware.com>
---
 arch/x86/kernel/alternative.c | 165 +++++++++++++++++++++++++++++-----
 1 file changed, 144 insertions(+), 21 deletions(-)

diff --git a/arch/x86/kernel/alternative.c b/arch/x86/kernel/alternative.c
index e9be18245698..edca599c4479 100644
--- a/arch/x86/kernel/alternative.c
+++ b/arch/x86/kernel/alternative.c
@@ -11,6 +11,7 @@
 #include <linux/stop_machine.h>
 #include <linux/slab.h>
 #include <linux/kdebug.h>
+#include <linux/mmu_context.h>
 #include <asm/text-patching.h>
 #include <asm/alternative.h>
 #include <asm/sections.h>
@@ -674,6 +675,124 @@ void *__init_or_module text_poke_early(void *addr, const void *opcode,
 	return addr;
 }
 
+/**
+ * text_poke_fixmap - poke using the fixmap.
+ *
+ * Fallback function for poking the text using the fixmap. It is used during
+ * early boot and in the rare case in which initialization of safe poking fails.
+ *
+ * Poking in this manner should be avoided, since it allows other cores to use
+ * the fixmap entries, and can be exploited by an attacker to overwrite the code
+ * (assuming he gained the write access through another bug).
+ */
+static void text_poke_fixmap(void *addr, const void *opcode, size_t len,
+			     struct page *pages[2])
+{
+	u8 *vaddr;
+
+	set_fixmap(FIX_TEXT_POKE0, page_to_phys(pages[0]));
+	if (pages[1])
+		set_fixmap(FIX_TEXT_POKE1, page_to_phys(pages[1]));
+	vaddr = (u8 *)fix_to_virt(FIX_TEXT_POKE0);
+	memcpy(vaddr + offset_in_page(addr), opcode, len);
+
+	/*
+	 * clear_fixmap() performs a TLB flush, so no additional TLB
+	 * flush is needed.
+	 */
+	clear_fixmap(FIX_TEXT_POKE0);
+	if (pages[1])
+		clear_fixmap(FIX_TEXT_POKE1);
+	sync_core();
+
+	/*
+	 * Could also do a CLFLUSH here to speed up CPU recovery; but
+	 * that causes hangs on some VIA CPUs.
+	 */
+}
+
+__ro_after_init struct mm_struct *poking_mm;
+__ro_after_init unsigned long poking_addr;
+
+/**
+ * text_poke_safe() - Pokes the text using a separate address space.
+ *
+ * This is the preferable way for patching the kernel after boot, as it does not
+ * allow other cores to accidentally or maliciously modify the code using the
+ * temporary PTEs.
+ */
+static void text_poke_safe(void *addr, const void *opcode, size_t len,
+			   struct page *pages[2])
+{
+	temporary_mm_state_t prev;
+	pte_t pte, *ptep;
+	spinlock_t *ptl;
+
+	/*
+	 * The lock is not really needed, but this allows to avoid open-coding.
+	 */
+	ptep = get_locked_pte(poking_mm, poking_addr, &ptl);
+
+	/*
+	 * If we failed to allocate a PTE, fail silently. The caller (text_poke)
+	 * will detect that the write failed when it compares the memory with
+	 * the new opcode.
+	 */
+	if (unlikely(!ptep))
+		return;
+
+	pte = mk_pte(pages[0], PAGE_KERNEL);
+	set_pte_at(poking_mm, poking_addr, ptep, pte);
+
+	if (pages[1]) {
+		pte = mk_pte(pages[1], PAGE_KERNEL);
+		set_pte_at(poking_mm, poking_addr + PAGE_SIZE, ptep + 1, pte);
+	}
+
+	/*
+	 * Loading the temporary mm behaves as a compiler barrier, which
+	 * guarantees that the PTE will be set at the time memcpy() is done.
+	 */
+	prev = use_temporary_mm(poking_mm);
+
+	memcpy((u8 *)poking_addr + offset_in_page(addr), opcode, len);
+
+	/*
+	 * Ensure that the PTE is only cleared after the instructions of memcpy
+	 * were issued by using a compiler barrier.
+	 */
+	barrier();
+
+	pte_clear(poking_mm, poking_addr, ptep);
+
+	/*
+	 * __flush_tlb_one_user() performs a redundant TLB flush when PTI is on,
+	 * as it also flushes the corresponding "user" address spaces, which
+	 * does not exist.
+	 *
+	 * Poking, however, is already very inefficient since it does not try to
+	 * batch updates, so we ignore this problem for the time being.
+	 *
+	 * Since the PTEs do not exist in other kernel address-spaces, we do
+	 * not use __flush_tlb_one_kernel(), which when PTI is on would cause
+	 * more unwarranted TLB flushes.
+	 */
+	__flush_tlb_one_user(poking_addr);
+	if (pages[1]) {
+		pte_clear(poking_mm, poking_addr + PAGE_SIZE, ptep + 1);
+		__flush_tlb_one_user(poking_addr + PAGE_SIZE);
+	}
+
+	/*
+	 * Loading the previous page-table hierarchy requires a serializing
+	 * instruction that already allows the core to see the updated version.
+	 * Xen-PV is assumed to serialize execution in a similar manner.
+	 */
+	unuse_temporary_mm(prev);
+
+	pte_unmap_unlock(ptep, ptl);
+}
+
 /**
  * text_poke - Update instructions on a live kernel
  * @addr: address to modify
@@ -692,41 +811,45 @@ void *__init_or_module text_poke_early(void *addr, const void *opcode,
  */
 void *text_poke(void *addr, const void *opcode, size_t len)
 {
+	bool cross_page_boundary = offset_in_page(addr) + len > PAGE_SIZE;
+	struct page *pages[2] = {0};
 	unsigned long flags;
-	char *vaddr;
-	struct page *pages[2];
-	int i;
 
 	/*
-	 * While boot memory allocator is runnig we cannot use struct
-	 * pages as they are not yet initialized.
+	 * While boot memory allocator is running we cannot use struct pages as
+	 * they are not yet initialized.
 	 */
 	BUG_ON(!after_bootmem);
 
 	if (!core_kernel_text((unsigned long)addr)) {
 		pages[0] = vmalloc_to_page(addr);
-		pages[1] = vmalloc_to_page(addr + PAGE_SIZE);
+		if (cross_page_boundary)
+			pages[1] = vmalloc_to_page(addr + PAGE_SIZE);
 	} else {
 		pages[0] = virt_to_page(addr);
 		WARN_ON(!PageReserved(pages[0]));
-		pages[1] = virt_to_page(addr + PAGE_SIZE);
+		if (cross_page_boundary)
+			pages[1] = virt_to_page(addr + PAGE_SIZE);
 	}
 	BUG_ON(!pages[0]);
 	local_irq_save(flags);
-	set_fixmap(FIX_TEXT_POKE0, page_to_phys(pages[0]));
-	if (pages[1])
-		set_fixmap(FIX_TEXT_POKE1, page_to_phys(pages[1]));
-	vaddr = (char *)fix_to_virt(FIX_TEXT_POKE0);
-	memcpy(&vaddr[(unsigned long)addr & ~PAGE_MASK], opcode, len);
-	clear_fixmap(FIX_TEXT_POKE0);
-	if (pages[1])
-		clear_fixmap(FIX_TEXT_POKE1);
-	local_flush_tlb();
-	sync_core();
-	/* Could also do a CLFLUSH here to speed up CPU recovery; but
-	   that causes hangs on some VIA CPUs. */
-	for (i = 0; i < len; i++)
-		BUG_ON(((char *)addr)[i] != ((char *)opcode)[i]);
+
+	/*
+	 * During initial boot, it is hard to initialize poking_mm due to
+	 * dependencies in boot order.
+	 */
+	if (poking_mm)
+		text_poke_safe(addr, opcode, len, pages);
+	else
+		text_poke_fixmap(addr, opcode, len, pages);
+
+	/*
+	 * To be on the safe side, do the comparison before enabling IRQs, as it
+	 * was done before. However, it makes more sense to allow the callers to
+	 * deal with potential failures and not to panic so easily.
+	 */
+	BUG_ON(memcmp(addr, opcode, len));
+
 	local_irq_restore(flags);
 	return addr;
 }
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [PATCH v2 6/6] x86/alternatives: remove text_poke() return value
  2018-09-02 17:32 [PATCH v2 0/6] x86/alternatives: text_poke() fixes Nadav Amit
                   ` (4 preceding siblings ...)
  2018-09-02 17:32 ` [PATCH v2 5/6] x86/alternatives: use temporary mm for text poking Nadav Amit
@ 2018-09-02 17:32 ` Nadav Amit
  2018-09-05 18:56 ` [PATCH v2 0/6] x86/alternatives: text_poke() fixes Peter Zijlstra
  6 siblings, 0 replies; 31+ messages in thread
From: Nadav Amit @ 2018-09-02 17:32 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: linux-kernel, Ingo Molnar, x86, Arnd Bergmann, linux-arch,
	Dave Hansen, Nadav Amit, Nadav Amit, Andy Lutomirski, Kees Cook,
	Peter Zijlstra

The return value of text_poke() is meaningless - it is one of the
function inputs. One day someone may allow the callers to deal with
text_poke() failures, if those actually happen.

In the meanwhile, remove the return value.

Cc: Andy Lutomirski <luto@kernel.org>
Cc: Kees Cook <keescook@chromium.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Reviewed-by: Masami Hiramatsu <mhiramat@kernel.org>
Tested-by: Masami Hiramatsu <mhiramat@kernel.org>
Signed-off-by: Nadav Amit <namit@vmware.com>
---
 arch/x86/include/asm/text-patching.h | 2 +-
 arch/x86/kernel/alternative.c        | 3 +--
 2 files changed, 2 insertions(+), 3 deletions(-)

diff --git a/arch/x86/include/asm/text-patching.h b/arch/x86/include/asm/text-patching.h
index ffe7902cc326..1f73f71b4de2 100644
--- a/arch/x86/include/asm/text-patching.h
+++ b/arch/x86/include/asm/text-patching.h
@@ -34,7 +34,7 @@ extern void *text_poke_early(void *addr, const void *opcode, size_t len);
  * On the local CPU you need to be protected again NMI or MCE handlers seeing an
  * inconsistent instruction while you patch.
  */
-extern void *text_poke(void *addr, const void *opcode, size_t len);
+extern void text_poke(void *addr, const void *opcode, size_t len);
 extern int poke_int3_handler(struct pt_regs *regs);
 extern void *text_poke_bp(void *addr, const void *opcode, size_t len, void *handler);
 extern int after_bootmem;
diff --git a/arch/x86/kernel/alternative.c b/arch/x86/kernel/alternative.c
index edca599c4479..7a30c5a3ca37 100644
--- a/arch/x86/kernel/alternative.c
+++ b/arch/x86/kernel/alternative.c
@@ -809,7 +809,7 @@ static void text_poke_safe(void *addr, const void *opcode, size_t len,
  *	    holding the mutex and ensures that none of them will acquire the
  *	    mutex while the code runs.
  */
-void *text_poke(void *addr, const void *opcode, size_t len)
+void text_poke(void *addr, const void *opcode, size_t len)
 {
 	bool cross_page_boundary = offset_in_page(addr) + len > PAGE_SIZE;
 	struct page *pages[2] = {0};
@@ -851,7 +851,6 @@ void *text_poke(void *addr, const void *opcode, size_t len)
 	BUG_ON(memcmp(addr, opcode, len));
 
 	local_irq_restore(flags);
-	return addr;
 }
 
 static void do_sync_core(void *info)
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 31+ messages in thread

* Re: [PATCH v2 0/6] x86/alternatives: text_poke() fixes
  2018-09-02 17:32 [PATCH v2 0/6] x86/alternatives: text_poke() fixes Nadav Amit
                   ` (5 preceding siblings ...)
  2018-09-02 17:32 ` [PATCH v2 6/6] x86/alternatives: remove text_poke() return value Nadav Amit
@ 2018-09-05 18:56 ` Peter Zijlstra
  2018-09-05 19:02   ` Nadav Amit
  6 siblings, 1 reply; 31+ messages in thread
From: Peter Zijlstra @ 2018-09-05 18:56 UTC (permalink / raw)
  To: Nadav Amit
  Cc: Thomas Gleixner, linux-kernel, Ingo Molnar, x86, Arnd Bergmann,
	linux-arch, Dave Hansen, Nadav Amit, Jiri Kosina,
	Andy Lutomirski, Masami Hiramatsu, Kees Cook

On Sun, Sep 02, 2018 at 10:32:18AM -0700, Nadav Amit wrote:
> This patch-set addresses some issues that were raised in a recent
> correspondence and might affect the security and the correctness of code
> patching. (Note that patching performance is not addressed by this
> patch-set).
> 
> The main issue that the patches deal with is the fact that the fixmap
> PTEs that are used for patching are available for access from other
> cores and might be exploited. They are not even flushed from the TLB in
> remote cores, so the risk is even higher. Address this issue by
> introducing a temporary mm that is only used during patching.
> Unfortunately, due to init ordering, fixmap is still used during
> boot-time patching. Future patches can eliminate the need for it.
> 

Remind me; why are we doing it like this instead of fixing fixmap?
Because while this fixes the text_poke crud, it does leave fixmap
broken.

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v2 0/6] x86/alternatives: text_poke() fixes
  2018-09-05 18:56 ` [PATCH v2 0/6] x86/alternatives: text_poke() fixes Peter Zijlstra
@ 2018-09-05 19:02   ` Nadav Amit
  2018-09-05 19:10     ` Nadav Amit
  0 siblings, 1 reply; 31+ messages in thread
From: Nadav Amit @ 2018-09-05 19:02 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Thomas Gleixner, LKML, Ingo Molnar, X86 ML, Arnd Bergmann,
	linux-arch, Dave Hansen, Jiri Kosina, Andy Lutomirski,
	Masami Hiramatsu, Kees Cook

at 11:56 AM, Peter Zijlstra <peterz@infradead.org> wrote:

> On Sun, Sep 02, 2018 at 10:32:18AM -0700, Nadav Amit wrote:
>> This patch-set addresses some issues that were raised in a recent
>> correspondence and might affect the security and the correctness of code
>> patching. (Note that patching performance is not addressed by this
>> patch-set).
>> 
>> The main issue that the patches deal with is the fact that the fixmap
>> PTEs that are used for patching are available for access from other
>> cores and might be exploited. They are not even flushed from the TLB in
>> remote cores, so the risk is even higher. Address this issue by
>> introducing a temporary mm that is only used during patching.
>> Unfortunately, due to init ordering, fixmap is still used during
>> boot-time patching. Future patches can eliminate the need for it.
> 
> Remind me; why are we doing it like this instead of fixing fixmap?
> Because while this fixes the text_poke crud, it does leave fixmap
> broken.

Do you have other fixmap mappings in mind that are modified after boot?


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v2 0/6] x86/alternatives: text_poke() fixes
  2018-09-05 19:02   ` Nadav Amit
@ 2018-09-05 19:10     ` Nadav Amit
  2018-09-06  8:13       ` Peter Zijlstra
  0 siblings, 1 reply; 31+ messages in thread
From: Nadav Amit @ 2018-09-05 19:10 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Thomas Gleixner, LKML, Ingo Molnar, X86 ML, Arnd Bergmann,
	linux-arch, Dave Hansen, Jiri Kosina, Andy Lutomirski,
	Masami Hiramatsu, Kees Cook

at 12:02 PM, Nadav Amit <namit@vmware.com> wrote:

> at 11:56 AM, Peter Zijlstra <peterz@infradead.org> wrote:
> 
>> On Sun, Sep 02, 2018 at 10:32:18AM -0700, Nadav Amit wrote:
>>> This patch-set addresses some issues that were raised in a recent
>>> correspondence and might affect the security and the correctness of code
>>> patching. (Note that patching performance is not addressed by this
>>> patch-set).
>>> 
>>> The main issue that the patches deal with is the fact that the fixmap
>>> PTEs that are used for patching are available for access from other
>>> cores and might be exploited. They are not even flushed from the TLB in
>>> remote cores, so the risk is even higher. Address this issue by
>>> introducing a temporary mm that is only used during patching.
>>> Unfortunately, due to init ordering, fixmap is still used during
>>> boot-time patching. Future patches can eliminate the need for it.
>> 
>> Remind me; why are we doing it like this instead of fixing fixmap?
>> Because while this fixes the text_poke crud, it does leave fixmap
>> broken.
> 
> Do you have other fixmap mappings in mind that are modified after boot?

Oh.. I misunderstood you. You mean: why not to make the fixmap mappings that
are used for text_poke() as private ones.

Well, the main reason is that it can require synchronizations of the
different page-tables whenever a module is loaded/unloaded. The fixmap
region shares a PGD and PUD with the modules area in x86-64.

In contrast, the proposed solution uses a different PGD, so no
synchronization between page-tables is needed when modules are loaded.
Remember that module memory is allocated even when BPF programs are
installed, which can be rather common scenario.


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v2 0/6] x86/alternatives: text_poke() fixes
  2018-09-05 19:10     ` Nadav Amit
@ 2018-09-06  8:13       ` Peter Zijlstra
  2018-09-06  8:42         ` Peter Zijlstra
                           ` (2 more replies)
  0 siblings, 3 replies; 31+ messages in thread
From: Peter Zijlstra @ 2018-09-06  8:13 UTC (permalink / raw)
  To: Nadav Amit
  Cc: Thomas Gleixner, LKML, Ingo Molnar, X86 ML, Arnd Bergmann,
	linux-arch, Dave Hansen, Jiri Kosina, Andy Lutomirski,
	Masami Hiramatsu, Kees Cook

On Wed, Sep 05, 2018 at 07:10:46PM +0000, Nadav Amit wrote:
> at 12:02 PM, Nadav Amit <namit@vmware.com> wrote:
> 
> > at 11:56 AM, Peter Zijlstra <peterz@infradead.org> wrote:
> > 
> >> On Sun, Sep 02, 2018 at 10:32:18AM -0700, Nadav Amit wrote:
> >>> This patch-set addresses some issues that were raised in a recent
> >>> correspondence and might affect the security and the correctness of code
> >>> patching. (Note that patching performance is not addressed by this
> >>> patch-set).
> >>> 
> >>> The main issue that the patches deal with is the fact that the fixmap
> >>> PTEs that are used for patching are available for access from other
> >>> cores and might be exploited. They are not even flushed from the TLB in
> >>> remote cores, so the risk is even higher. Address this issue by
> >>> introducing a temporary mm that is only used during patching.
> >>> Unfortunately, due to init ordering, fixmap is still used during
> >>> boot-time patching. Future patches can eliminate the need for it.
> >> 
> >> Remind me; why are we doing it like this instead of fixing fixmap?
> >> Because while this fixes the text_poke crud, it does leave fixmap
> >> broken.
> > 
> > Do you have other fixmap mappings in mind that are modified after boot?
> 
> Oh.. I misunderstood you. You mean: why not to make the fixmap mappings that
> are used for text_poke() as private ones.

No, you got it the first time. There are in fact more fixmap abusers;
see drivers/acpi/apei/ghes.c.  Also, as long as set_fixmap() allows
overwriting a _PAGE_PRESENT pte and has that dodgy
__flush_tlb_one_kernel() in it, the broken remains (and can return).

So we need to fix fixmap, to either disallow overwriting a _PAGE_PRESENT
pte, or to issue a full TLB invalidate.

Either fix will terminally break GHES, but that's OK, they've known
about this issue since 2015 and haven't cared, so I can't be bothered
about them.


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v2 0/6] x86/alternatives: text_poke() fixes
  2018-09-06  8:13       ` Peter Zijlstra
@ 2018-09-06  8:42         ` Peter Zijlstra
  2018-09-06  9:18         ` Peter Zijlstra
  2018-09-06 10:16         ` Peter Zijlstra
  2 siblings, 0 replies; 31+ messages in thread
From: Peter Zijlstra @ 2018-09-06  8:42 UTC (permalink / raw)
  To: Nadav Amit
  Cc: Thomas Gleixner, LKML, Ingo Molnar, X86 ML, Arnd Bergmann,
	linux-arch, Dave Hansen, Jiri Kosina, Andy Lutomirski,
	Masami Hiramatsu, Kees Cook

On Thu, Sep 06, 2018 at 10:13:00AM +0200, Peter Zijlstra wrote:

> No, you got it the first time. There are in fact more fixmap abusers;
> see drivers/acpi/apei/ghes.c.  Also, as long as set_fixmap() allows
> overwriting a _PAGE_PRESENT pte and has that dodgy
> __flush_tlb_one_kernel() in it, the broken remains (and can return).
> 
> So we need to fix fixmap, to either disallow overwriting a _PAGE_PRESENT
> pte, or to issue a full TLB invalidate.
> 
> Either fix will terminally break GHES, but that's OK, they've known
> about this issue since 2015 and haven't cared, so I can't be bothered
> about them.

Worse, when I boot with the below patch in, I get hits on text_poke (we
knew about this), but I also get hits from:

  release_ds_buffer()

Because PEBS/BTS hardware write through the page-tables (I know right),
their buffers need to be in the PTI user map, and we do that by mapping
the pages in the CAE, and that uses fixmap.

---
diff --git a/arch/x86/mm/init_64.c b/arch/x86/mm/init_64.c
index dd519f372169..20dcd6fa690e 100644
--- a/arch/x86/mm/init_64.c
+++ b/arch/x86/mm/init_64.c
@@ -259,9 +259,16 @@ static pte_t *fill_pte(pmd_t *pmd, unsigned long vaddr)
 static void __set_pte_vaddr(pud_t *pud, unsigned long vaddr, pte_t new_pte)
 {
 	pmd_t *pmd = fill_pmd(pud, vaddr);
-	pte_t *pte = fill_pte(pmd, vaddr);
+	pte_t *ptep = fill_pte(pmd, vaddr);
 
-	set_pte(pte, new_pte);
+	pte_t pte = *ptep;
+
+	set_pte(ptep, new_pte);
+
+	WARN((pte_val(pte) & _PAGE_PRESENT) && (pte_val(pte) != pte_val(new_pte)),
+		"set_fixmap(%d, %lx), was: %lx",
+		(int)__virt_to_fix(vaddr),
+		pte_val(new_pte), pte_val(pte));
 
 	/*
 	 * It's enough to flush this one mapping.

^ permalink raw reply related	[flat|nested] 31+ messages in thread

* Re: [PATCH v2 4/6] x86/alternatives: initializing temporary mm for patching
  2018-09-02 17:32 ` [PATCH v2 4/6] x86/alternatives: initializing temporary mm for patching Nadav Amit
@ 2018-09-06  9:01   ` Peter Zijlstra
  2018-09-07 20:52     ` Nadav Amit
  0 siblings, 1 reply; 31+ messages in thread
From: Peter Zijlstra @ 2018-09-06  9:01 UTC (permalink / raw)
  To: Nadav Amit
  Cc: Thomas Gleixner, linux-kernel, Ingo Molnar, x86, Arnd Bergmann,
	linux-arch, Dave Hansen, Nadav Amit, Kees Cook, Dave Hansen

On Sun, Sep 02, 2018 at 10:32:22AM -0700, Nadav Amit wrote:
> +void __init poking_init(void)
> +{
> +	unsigned long poking_addr;
> +
> +	poking_mm = copy_init_mm();
> +	if (!poking_mm) {
> +		pr_err("x86/mm: error setting a separate poking address space");
> +		return;
> +	}
> +
> +	/*
> +	 * Randomize the poking address, but make sure that the following page
> +	 * will be mapped at the same PMD. We need 2 pages, so find space for 3,
> +	 * and adjust the address if the PMD ends after the first one.
> +	 */
> +	poking_addr = TASK_UNMAPPED_BASE +
> +		(kaslr_get_random_long("Poking") & PAGE_MASK) %
> +		(TASK_SIZE - TASK_UNMAPPED_BASE - 3 * PAGE_SIZE);
> +
> +	if (((poking_addr + PAGE_SIZE) & ~PMD_MASK) == 0)
> +		poking_addr += PAGE_SIZE;
> +}

This fails to compile for me.. I don't have kaslr_get_random_long().

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v2 0/6] x86/alternatives: text_poke() fixes
  2018-09-06  8:13       ` Peter Zijlstra
  2018-09-06  8:42         ` Peter Zijlstra
@ 2018-09-06  9:18         ` Peter Zijlstra
  2018-09-06 10:16         ` Peter Zijlstra
  2 siblings, 0 replies; 31+ messages in thread
From: Peter Zijlstra @ 2018-09-06  9:18 UTC (permalink / raw)
  To: Nadav Amit
  Cc: Thomas Gleixner, LKML, Ingo Molnar, X86 ML, Arnd Bergmann,
	linux-arch, Dave Hansen, Jiri Kosina, Andy Lutomirski,
	Masami Hiramatsu, Kees Cook

On Thu, Sep 06, 2018 at 10:13:00AM +0200, Peter Zijlstra wrote:
> No, you got it the first time. There are in fact more fixmap abusers;
> see drivers/acpi/apei/ghes.c.  Also, as long as set_fixmap() allows
> overwriting a _PAGE_PRESENT pte and has that dodgy
> __flush_tlb_one_kernel() in it, the broken remains (and can return).
> 
> So we need to fix fixmap, to either disallow overwriting a _PAGE_PRESENT
> pte, or to issue a full TLB invalidate.
> 
> Either fix will terminally break GHES, but that's OK, they've known
> about this issue since 2015 and haven't cared, so I can't be bothered
> about them.

The below, combined with your text_poke() patches (text_poke disabled
IRQs, so the below still complains, and I was lazy) makes my machine
(mostly) happy.

There's still some complaining before the poking_mm exists. Let me see
if I can't cure that.

---

diff --git a/arch/x86/mm/init_64.c b/arch/x86/mm/init_64.c
index dd519f372169..8df85fe66332 100644
--- a/arch/x86/mm/init_64.c
+++ b/arch/x86/mm/init_64.c
@@ -259,15 +260,27 @@ static pte_t *fill_pte(pmd_t *pmd, unsigned long vaddr)
 static void __set_pte_vaddr(pud_t *pud, unsigned long vaddr, pte_t new_pte)
 {
 	pmd_t *pmd = fill_pmd(pud, vaddr);
-	pte_t *pte = fill_pte(pmd, vaddr);
+	pte_t *ptep = fill_pte(pmd, vaddr);
 
-	set_pte(pte, new_pte);
+	pte_t pte = *ptep;
 
-	/*
-	 * It's enough to flush this one mapping.
-	 * (PGE mappings get flushed as well)
-	 */
-	__flush_tlb_one_kernel(vaddr);
+	set_pte(ptep, new_pte);
+
+	if (irqs_disabled()) {
+		WARN((pte_val(pte) & _PAGE_PRESENT) && (pte_val(pte) != pte_val(new_pte)),
+				"set_fixmap(%d, %lx), was: %lx",
+				(int)__virt_to_fix(vaddr),
+				pte_val(new_pte), pte_val(pte));
+
+		/*
+		 * It is _NOT_ enough to flush just the local mapping;
+		 * it might mostly work, but there is no guarantee a remote
+		 * CPU didn't load this entry into its TLB.
+		 */
+		__flush_tlb_one_kernel(vaddr);
+	} else {
+		flush_tlb_kernel_range(vaddr, vaddr + PAGE_SIZE);
+	}
 }
 
 void set_pte_vaddr_p4d(p4d_t *p4d_page, unsigned long vaddr, pte_t new_pte)
diff --git a/drivers/acpi/apei/Kconfig b/drivers/acpi/apei/Kconfig
index 52ae5438edeb..e3c415bdbfbf 100644
--- a/drivers/acpi/apei/Kconfig
+++ b/drivers/acpi/apei/Kconfig
@@ -19,6 +19,7 @@ config ACPI_APEI
 
 config ACPI_APEI_GHES
 	bool "APEI Generic Hardware Error Source"
+	depends on BROKEN
 	depends on ACPI_APEI
 	select ACPI_HED
 	select IRQ_WORK

^ permalink raw reply related	[flat|nested] 31+ messages in thread

* Re: [PATCH v2 0/6] x86/alternatives: text_poke() fixes
  2018-09-06  8:13       ` Peter Zijlstra
  2018-09-06  8:42         ` Peter Zijlstra
  2018-09-06  9:18         ` Peter Zijlstra
@ 2018-09-06 10:16         ` Peter Zijlstra
  2018-09-06 17:01           ` Nadav Amit
  2 siblings, 1 reply; 31+ messages in thread
From: Peter Zijlstra @ 2018-09-06 10:16 UTC (permalink / raw)
  To: Nadav Amit
  Cc: Thomas Gleixner, LKML, Ingo Molnar, X86 ML, Arnd Bergmann,
	linux-arch, Dave Hansen, Jiri Kosina, Andy Lutomirski,
	Masami Hiramatsu, Kees Cook

On Thu, Sep 06, 2018 at 10:13:00AM +0200, Peter Zijlstra wrote:
> No, you got it the first time. There are in fact more fixmap abusers;
> see drivers/acpi/apei/ghes.c.  Also, as long as set_fixmap() allows
> overwriting a _PAGE_PRESENT pte and has that dodgy
> __flush_tlb_one_kernel() in it, the broken remains (and can return).
> 
> So we need to fix fixmap, to either disallow overwriting a _PAGE_PRESENT
> pte, or to issue a full TLB invalidate.
> 
> Either fix will terminally break GHES, but that's OK, they've known
> about this issue since 2015 and haven't cared, so I can't be bothered
> about them.

In fact, the below seems to cure all woes..

diff --git a/arch/x86/kernel/alternative.c b/arch/x86/kernel/alternative.c
index b9d5e7c9ef43..00f1c9e4f0a3 100644
--- a/arch/x86/kernel/alternative.c
+++ b/arch/x86/kernel/alternative.c
@@ -709,22 +709,25 @@ void *text_poke(void *addr, const void *opcode, size_t len)
 		pages[1] = virt_to_page(addr + PAGE_SIZE);
 	}
 	BUG_ON(!pages[0]);
-	local_irq_save(flags);
+
 	set_fixmap(FIX_TEXT_POKE0, page_to_phys(pages[0]));
 	if (pages[1])
 		set_fixmap(FIX_TEXT_POKE1, page_to_phys(pages[1]));
+
+	local_irq_save(flags);
 	vaddr = (char *)fix_to_virt(FIX_TEXT_POKE0);
 	memcpy(&vaddr[(unsigned long)addr & ~PAGE_MASK], opcode, len);
-	clear_fixmap(FIX_TEXT_POKE0);
-	if (pages[1])
-		clear_fixmap(FIX_TEXT_POKE1);
-	local_flush_tlb();
 	sync_core();
 	/* Could also do a CLFLUSH here to speed up CPU recovery; but
 	   that causes hangs on some VIA CPUs. */
 	for (i = 0; i < len; i++)
 		BUG_ON(((char *)addr)[i] != ((char *)opcode)[i]);
 	local_irq_restore(flags);
+
+	clear_fixmap(FIX_TEXT_POKE0);
+	if (pages[1])
+		clear_fixmap(FIX_TEXT_POKE1);
+
 	return addr;
 }
 
diff --git a/arch/x86/mm/init_64.c b/arch/x86/mm/init_64.c
index dd519f372169..fd6af66bc400 100644
--- a/arch/x86/mm/init_64.c
+++ b/arch/x86/mm/init_64.c
@@ -260,14 +260,28 @@ static void __set_pte_vaddr(pud_t *pud, unsigned long vaddr, pte_t new_pte)
 {
 	pmd_t *pmd = fill_pmd(pud, vaddr);
 	pte_t *pte = fill_pte(pmd, vaddr);
+	pte_t old_pte = *pte;
 
 	set_pte(pte, new_pte);
 
 	/*
-	 * It's enough to flush this one mapping.
-	 * (PGE mappings get flushed as well)
+	 * If it was present and changes, we need to invalidate TLBs.
 	 */
-	__flush_tlb_one_kernel(vaddr);
+	if (!(pte_present(old_pte) && !pte_same(old_pte, new_pte)))
+		return;
+
+	if (WARN(irqs_disabled(), "broken set_fixmap(%d, %lx), was: %lx",
+				(int)__virt_to_fix(vaddr),
+				pte_val(new_pte), pte_val(old_pte))) {
+		/*
+		 * It is _NOT_ enough to flush just the local mapping;
+		 * it might mostly work, but there is no guarantee a remote
+		 * CPU didn't load this entry into its TLB.
+		 */
+		__flush_tlb_one_kernel(vaddr);
+	} else {
+		flush_tlb_kernel_range(vaddr, vaddr + PAGE_SIZE);
+	}
 }
 
 void set_pte_vaddr_p4d(p4d_t *p4d_page, unsigned long vaddr, pte_t new_pte)
diff --git a/drivers/acpi/apei/Kconfig b/drivers/acpi/apei/Kconfig
index 52ae5438edeb..e3c415bdbfbf 100644
--- a/drivers/acpi/apei/Kconfig
+++ b/drivers/acpi/apei/Kconfig
@@ -19,6 +19,7 @@ config ACPI_APEI
 
 config ACPI_APEI_GHES
 	bool "APEI Generic Hardware Error Source"
+	depends on BROKEN if X86
 	depends on ACPI_APEI
 	select ACPI_HED
 	select IRQ_WORK

^ permalink raw reply related	[flat|nested] 31+ messages in thread

* Re: [PATCH v2 0/6] x86/alternatives: text_poke() fixes
  2018-09-06 10:16         ` Peter Zijlstra
@ 2018-09-06 17:01           ` Nadav Amit
  2018-09-06 17:17             ` Peter Zijlstra
  2018-09-06 17:23             ` Peter Zijlstra
  0 siblings, 2 replies; 31+ messages in thread
From: Nadav Amit @ 2018-09-06 17:01 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Thomas Gleixner, LKML, Ingo Molnar, X86 ML, Arnd Bergmann,
	linux-arch, Dave Hansen, Jiri Kosina, Andy Lutomirski,
	Masami Hiramatsu, Kees Cook

at 3:16 AM, Peter Zijlstra <peterz@infradead.org> wrote:

> On Thu, Sep 06, 2018 at 10:13:00AM +0200, Peter Zijlstra wrote:
>> No, you got it the first time. There are in fact more fixmap abusers;
>> see drivers/acpi/apei/ghes.c.  Also, as long as set_fixmap() allows
>> overwriting a _PAGE_PRESENT pte and has that dodgy
>> __flush_tlb_one_kernel() in it, the broken remains (and can return).
>> 
>> So we need to fix fixmap, to either disallow overwriting a _PAGE_PRESENT
>> pte, or to issue a full TLB invalidate.
>> 
>> Either fix will terminally break GHES, but that's OK, they've known
>> about this issue since 2015 and haven't cared, so I can't be bothered
>> about them.
> 
> In fact, the below seems to cure all woes..
> 
> diff --git a/arch/x86/kernel/alternative.c b/arch/x86/kernel/alternative.c
> index b9d5e7c9ef43..00f1c9e4f0a3 100644
> --- a/arch/x86/kernel/alternative.c
> +++ b/arch/x86/kernel/alternative.c
> @@ -709,22 +709,25 @@ void *text_poke(void *addr, const void *opcode, size_t len)
> 		pages[1] = virt_to_page(addr + PAGE_SIZE);
> 	}
> 	BUG_ON(!pages[0]);
> -	local_irq_save(flags);
> +
> 	set_fixmap(FIX_TEXT_POKE0, page_to_phys(pages[0]));
> 	if (pages[1])
> 		set_fixmap(FIX_TEXT_POKE1, page_to_phys(pages[1]));
> +
> +	local_irq_save(flags);
> 	vaddr = (char *)fix_to_virt(FIX_TEXT_POKE0);
> 	memcpy(&vaddr[(unsigned long)addr & ~PAGE_MASK], opcode, len);
> -	clear_fixmap(FIX_TEXT_POKE0);
> -	if (pages[1])
> -		clear_fixmap(FIX_TEXT_POKE1);
> -	local_flush_tlb();
> 	sync_core();
> 	/* Could also do a CLFLUSH here to speed up CPU recovery; but
> 	   that causes hangs on some VIA CPUs. */
> 	for (i = 0; i < len; i++)
> 		BUG_ON(((char *)addr)[i] != ((char *)opcode)[i]);
> 	local_irq_restore(flags);
> +
> +	clear_fixmap(FIX_TEXT_POKE0);
> +	if (pages[1])
> +		clear_fixmap(FIX_TEXT_POKE1);
> +
> 	return addr;
> }
> 
> diff --git a/arch/x86/mm/init_64.c b/arch/x86/mm/init_64.c
> index dd519f372169..fd6af66bc400 100644
> --- a/arch/x86/mm/init_64.c
> +++ b/arch/x86/mm/init_64.c
> @@ -260,14 +260,28 @@ static void __set_pte_vaddr(pud_t *pud, unsigned long vaddr, pte_t new_pte)
> {
> 	pmd_t *pmd = fill_pmd(pud, vaddr);
> 	pte_t *pte = fill_pte(pmd, vaddr);
> +	pte_t old_pte = *pte;
> 
> 	set_pte(pte, new_pte);
> 
> 	/*
> -	 * It's enough to flush this one mapping.
> -	 * (PGE mappings get flushed as well)
> +	 * If it was present and changes, we need to invalidate TLBs.
> 	 */
> -	__flush_tlb_one_kernel(vaddr);
> +	if (!(pte_present(old_pte) && !pte_same(old_pte, new_pte)))
> +		return;
> +
> +	if (WARN(irqs_disabled(), "broken set_fixmap(%d, %lx), was: %lx",
> +				(int)__virt_to_fix(vaddr),
> +				pte_val(new_pte), pte_val(old_pte))) {
> +		/*
> +		 * It is _NOT_ enough to flush just the local mapping;
> +		 * it might mostly work, but there is no guarantee a remote
> +		 * CPU didn't load this entry into its TLB.
> +		 */
> +		__flush_tlb_one_kernel(vaddr);
> +	} else {
> +		flush_tlb_kernel_range(vaddr, vaddr + PAGE_SIZE);
> +	}
> }
> 
> void set_pte_vaddr_p4d(p4d_t *p4d_page, unsigned long vaddr, pte_t new_pte)
> diff --git a/drivers/acpi/apei/Kconfig b/drivers/acpi/apei/Kconfig
> index 52ae5438edeb..e3c415bdbfbf 100644
> --- a/drivers/acpi/apei/Kconfig
> +++ b/drivers/acpi/apei/Kconfig
> @@ -19,6 +19,7 @@ config ACPI_APEI
> 
> config ACPI_APEI_GHES
> 	bool "APEI Generic Hardware Error Source"
> +	depends on BROKEN if X86
> 	depends on ACPI_APEI
> 	select ACPI_HED
> 	select IRQ_WORK

Indeed, I missed these other instances that use the fixmap.

I’ll give your patch a try once my server goes back online. I was (and still
am) worried that interrupts would be disabled when __set_pte_vaddr() is
called, which would make the fix more complicated.

In addition, there might be a couple of issues with your fix:

1. __set_pte_vaddr() is not used exclusive by set_fixmap(). This means
the warning might be wrong, but also means that these code patches (Xen’s
set_pte_mfn(), CPU-entry-area setup) needs to be checked. And as you said
before, someone might use this function for other purposes as well.

2. Printing the virtual address can break KASLR.

3. The WARN() can introduce some overhead since checking if IRQs are
disabled takes considerably long time. Perhaps VM_WARN() or something is
better. I realize this code-path is not on the hot-path though...

4. I guess flush_tlb_kernel_range() should also have something like
VM_WARN_ON(irqs_disabled()), just as an additional general sanity check.

Let me know if you want me to make these changes and include your patch in
the set.

Regards,
Nadav


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v2 0/6] x86/alternatives: text_poke() fixes
  2018-09-06 17:01           ` Nadav Amit
@ 2018-09-06 17:17             ` Peter Zijlstra
  2018-09-06 17:58               ` Nadav Amit
  2018-09-06 17:23             ` Peter Zijlstra
  1 sibling, 1 reply; 31+ messages in thread
From: Peter Zijlstra @ 2018-09-06 17:17 UTC (permalink / raw)
  To: Nadav Amit
  Cc: Thomas Gleixner, LKML, Ingo Molnar, X86 ML, Arnd Bergmann,
	linux-arch, Dave Hansen, Jiri Kosina, Andy Lutomirski,
	Masami Hiramatsu, Kees Cook

On Thu, Sep 06, 2018 at 05:01:25PM +0000, Nadav Amit wrote:
> In addition, there might be a couple of issues with your fix:

It boots on my box ;-)

> 1. __set_pte_vaddr() is not used exclusive by set_fixmap(). This means
> the warning might be wrong, but also means that these code patches (Xen’s
> set_pte_mfn(), CPU-entry-area setup) needs to be checked. And as you said
> before, someone might use this function for other purposes as well.

CEA is fine, that actually needs it too.

The one thing I missed out on earlier is the below chunk, that is no
longer needed now that cea_set_pte() actually does the right thing.

diff --git a/arch/x86/events/intel/ds.c b/arch/x86/events/intel/ds.c
index b7b01d762d32..14ad97fa0749 100644
--- a/arch/x86/events/intel/ds.c
+++ b/arch/x86/events/intel/ds.c
@@ -293,12 +293,6 @@ static void ds_update_cea(void *cea, void *addr, size_t size, pgprot_t prot)
 	preempt_disable();
 	for (; msz < size; msz += PAGE_SIZE, pa += PAGE_SIZE, cea += PAGE_SIZE)
 		cea_set_pte(cea, pa, prot);
-
-	/*
-	 * This is a cross-CPU update of the cpu_entry_area, we must shoot down
-	 * all TLB entries for it.
-	 */
-	flush_tlb_kernel_range(start, start + size);
 	preempt_enable();
 }
 
@@ -310,8 +304,6 @@ static void ds_clear_cea(void *cea, size_t size)
 	preempt_disable();
 	for (; msz < size; msz += PAGE_SIZE, cea += PAGE_SIZE)
 		cea_set_pte(cea, 0, PAGE_NONE);
-
-	flush_tlb_kernel_range(start, start + size);
 	preempt_enable();
 }
 

> 2. Printing the virtual address can break KASLR.

Local KASLR is a myth.. but sure, we can fix the print.

> 3. The WARN() can introduce some overhead since checking if IRQs are
> disabled takes considerably long time. Perhaps VM_WARN() or something is
> better. I realize this code-path is not on the hot-path though...

Yeah, if it triggers you have bigger problems. We can make it
WARN_ONCE() I suppose.

> 4. I guess flush_tlb_kernel_range() should also have something like
> VM_WARN_ON(irqs_disabled()), just as an additional general sanity check.

It has, it's hidden in kernel/smp.c:smp_call_function_many().

> Let me know if you want me to make these changes and include your patch in
> the set.

The set is no longer needed. text_poke() is fine and correct with this
one patch.

^ permalink raw reply related	[flat|nested] 31+ messages in thread

* Re: [PATCH v2 0/6] x86/alternatives: text_poke() fixes
  2018-09-06 17:01           ` Nadav Amit
  2018-09-06 17:17             ` Peter Zijlstra
@ 2018-09-06 17:23             ` Peter Zijlstra
  1 sibling, 0 replies; 31+ messages in thread
From: Peter Zijlstra @ 2018-09-06 17:23 UTC (permalink / raw)
  To: Nadav Amit
  Cc: Thomas Gleixner, LKML, Ingo Molnar, X86 ML, Arnd Bergmann,
	linux-arch, Dave Hansen, Jiri Kosina, Andy Lutomirski,
	Masami Hiramatsu, Kees Cook

On Thu, Sep 06, 2018 at 05:01:25PM +0000, Nadav Amit wrote:
> I’ll give your patch a try once my server goes back online. I was (and still
> am) worried that interrupts would be disabled when __set_pte_vaddr() is
> called, which would make the fix more complicated.

Thing is, we only need the TLB invalidate if the previous PTE was
present and the new PTE is different. If we write the 'first' PTE, all
is fine.

The code as presented WARNs if we do __set_pte_vaddr() that needs a TLB
invalidate and we have IRQs disabled. And aside from the GHES
trainwreck, the patch as given boots and runs fine on my machine.

And no, if there is a caller that has interrupts disabled and needs TLB
invalidate, the patch still is right. Just means that caller is
terminally broken and needs fixing (like GHES).

There is no way x86 can do what needs done with IRQs disabled.

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v2 0/6] x86/alternatives: text_poke() fixes
  2018-09-06 17:17             ` Peter Zijlstra
@ 2018-09-06 17:58               ` Nadav Amit
  2018-09-06 18:09                 ` Andy Lutomirski
  0 siblings, 1 reply; 31+ messages in thread
From: Nadav Amit @ 2018-09-06 17:58 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Thomas Gleixner, LKML, Ingo Molnar, X86 ML, Arnd Bergmann,
	linux-arch, Dave Hansen, Jiri Kosina, Andy Lutomirski,
	Masami Hiramatsu, Kees Cook

at 10:17 AM, Peter Zijlstra <peterz@infradead.org> wrote:

> On Thu, Sep 06, 2018 at 05:01:25PM +0000, Nadav Amit wrote:
>> In addition, there might be a couple of issues with your fix:
> 
> It boots on my box ;-)
> 
>> 1. __set_pte_vaddr() is not used exclusive by set_fixmap(). This means
>> the warning might be wrong, but also means that these code patches (Xen’s
>> set_pte_mfn(), CPU-entry-area setup) needs to be checked. And as you said
>> before, someone might use this function for other purposes as well.
> 
> CEA is fine, that actually needs it too.
> 
> The one thing I missed out on earlier is the below chunk, that is no
> longer needed now that cea_set_pte() actually does the right thing.
> 
> diff --git a/arch/x86/events/intel/ds.c b/arch/x86/events/intel/ds.c
> index b7b01d762d32..14ad97fa0749 100644
> --- a/arch/x86/events/intel/ds.c
> +++ b/arch/x86/events/intel/ds.c
> @@ -293,12 +293,6 @@ static void ds_update_cea(void *cea, void *addr, size_t size, pgprot_t prot)
> preempt_disable();
> for (; msz < size; msz += PAGE_SIZE, pa += PAGE_SIZE, cea += PAGE_SIZE)
> cea_set_pte(cea, pa, prot);
> -
> - /*
> -  * This is a cross-CPU update of the cpu_entry_area, we must shoot down
> -  * all TLB entries for it.
> -  */
> - flush_tlb_kernel_range(start, start + size);
> preempt_enable();
> }
> 
> @@ -310,8 +304,6 @@ static void ds_clear_cea(void *cea, size_t size)
> preempt_disable();
> for (; msz < size; msz += PAGE_SIZE, cea += PAGE_SIZE)
> cea_set_pte(cea, 0, PAGE_NONE);
> -
> - flush_tlb_kernel_range(start, start + size);
> preempt_enable();
> }
> 
> 
>> 2. Printing the virtual address can break KASLR.
> 
> Local KASLR is a myth.. but sure, we can fix the print.
> 
>> 3. The WARN() can introduce some overhead since checking if IRQs are
>> disabled takes considerably long time. Perhaps VM_WARN() or something is
>> better. I realize this code-path is not on the hot-path though...
> 
> Yeah, if it triggers you have bigger problems. We can make it
> WARN_ONCE() I suppose.
> 
>> 4. I guess flush_tlb_kernel_range() should also have something like
>> VM_WARN_ON(irqs_disabled()), just as an additional general sanity check.
> 
> It has, it's hidden in kernel/smp.c:smp_call_function_many().

Right. Thanks.

> 
>> Let me know if you want me to make these changes and include your patch in
>> the set.
> 
> The set is no longer needed. text_poke() is fine and correct with this
> one patch.

It depends what security you want. Some may consider even the short
time-window in which the kernel code is writable from other cores as
insufficient for security.

In addition, the set removes the need for remote TLB shootdowns that
text_poke() - with this fix - requires.


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v2 0/6] x86/alternatives: text_poke() fixes
  2018-09-06 17:58               ` Nadav Amit
@ 2018-09-06 18:09                 ` Andy Lutomirski
  2018-09-06 18:31                   ` Peter Zijlstra
  0 siblings, 1 reply; 31+ messages in thread
From: Andy Lutomirski @ 2018-09-06 18:09 UTC (permalink / raw)
  To: Nadav Amit
  Cc: Peter Zijlstra, Thomas Gleixner, LKML, Ingo Molnar, X86 ML,
	Arnd Bergmann, linux-arch, Dave Hansen, Jiri Kosina,
	Andy Lutomirski, Masami Hiramatsu, Kees Cook



> On Sep 6, 2018, at 10:58 AM, Nadav Amit <namit@vmware.com> wrote:
> 
> at 10:17 AM, Peter Zijlstra <peterz@infradead.org> wrote:
> 
>>> On Thu, Sep 06, 2018 at 05:01:25PM +0000, Nadav Amit wrote:
>>> In addition, there might be a couple of issues with your fix:
>> 
>> It boots on my box ;-)
>> 
>>> 1. __set_pte_vaddr() is not used exclusive by set_fixmap(). This means
>>> the warning might be wrong, but also means that these code patches (Xen’s
>>> set_pte_mfn(), CPU-entry-area setup) needs to be checked. And as you said
>>> before, someone might use this function for other purposes as well.
>> 
>> CEA is fine, that actually needs it too.
>> 
>> The one thing I missed out on earlier is the below chunk, that is no
>> longer needed now that cea_set_pte() actually does the right thing.
>> 
>> diff --git a/arch/x86/events/intel/ds.c b/arch/x86/events/intel/ds.c
>> index b7b01d762d32..14ad97fa0749 100644
>> --- a/arch/x86/events/intel/ds.c
>> +++ b/arch/x86/events/intel/ds.c
>> @@ -293,12 +293,6 @@ static void ds_update_cea(void *cea, void *addr, size_t size, pgprot_t prot)
>> preempt_disable();
>> for (; msz < size; msz += PAGE_SIZE, pa += PAGE_SIZE, cea += PAGE_SIZE)
>> cea_set_pte(cea, pa, prot);
>> -
>> - /*
>> -  * This is a cross-CPU update of the cpu_entry_area, we must shoot down
>> -  * all TLB entries for it.
>> -  */
>> - flush_tlb_kernel_range(start, start + size);
>> preempt_enable();
>> }
>> 
>> @@ -310,8 +304,6 @@ static void ds_clear_cea(void *cea, size_t size)
>> preempt_disable();
>> for (; msz < size; msz += PAGE_SIZE, cea += PAGE_SIZE)
>> cea_set_pte(cea, 0, PAGE_NONE);
>> -
>> - flush_tlb_kernel_range(start, start + size);
>> preempt_enable();
>> }
>> 
>> 
>>> 2. Printing the virtual address can break KASLR.
>> 
>> Local KASLR is a myth.. but sure, we can fix the print.
>> 
>>> 3. The WARN() can introduce some overhead since checking if IRQs are
>>> disabled takes considerably long time. Perhaps VM_WARN() or something is
>>> better. I realize this code-path is not on the hot-path though...
>> 
>> Yeah, if it triggers you have bigger problems. We can make it
>> WARN_ONCE() I suppose.
>> 
>>> 4. I guess flush_tlb_kernel_range() should also have something like
>>> VM_WARN_ON(irqs_disabled()), just as an additional general sanity check.
>> 
>> It has, it's hidden in kernel/smp.c:smp_call_function_many().
> 
> Right. Thanks.
> 
>> 
>>> Let me know if you want me to make these changes and include your patch in
>>> the set.
>> 
>> The set is no longer needed. text_poke() is fine and correct with this
>> one patch.
> 
> It depends what security you want. Some may consider even the short
> time-window in which the kernel code is writable from other cores as
> insufficient for security.
> 
> In addition, the set removes the need for remote TLB shootdowns that
> text_poke() - with this fix - requires.
> 

I’m personally in favor of not needing a global broadcast flush to install kprobes.

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v2 0/6] x86/alternatives: text_poke() fixes
  2018-09-06 18:09                 ` Andy Lutomirski
@ 2018-09-06 18:31                   ` Peter Zijlstra
  2018-09-06 18:38                     ` Nadav Amit
  0 siblings, 1 reply; 31+ messages in thread
From: Peter Zijlstra @ 2018-09-06 18:31 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Nadav Amit, Thomas Gleixner, LKML, Ingo Molnar, X86 ML,
	Arnd Bergmann, linux-arch, Dave Hansen, Jiri Kosina,
	Andy Lutomirski, Masami Hiramatsu, Kees Cook

On Thu, Sep 06, 2018 at 11:09:23AM -0700, Andy Lutomirski wrote:
> > On Sep 6, 2018, at 10:58 AM, Nadav Amit <namit@vmware.com> wrote:
> > It depends what security you want. Some may consider even the short
> > time-window in which the kernel code is writable from other cores as
> > insufficient for security.
> > 
> > In addition, the set removes the need for remote TLB shootdowns that
> > text_poke() - with this fix - requires.
> > 
> 
> I’m personally in favor of not needing a global broadcast flush to install kprobes.

That's fine. But at that point its an optimization, not a correctness
issue.

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v2 0/6] x86/alternatives: text_poke() fixes
  2018-09-06 18:31                   ` Peter Zijlstra
@ 2018-09-06 18:38                     ` Nadav Amit
  2018-09-06 19:19                       ` Peter Zijlstra
  0 siblings, 1 reply; 31+ messages in thread
From: Nadav Amit @ 2018-09-06 18:38 UTC (permalink / raw)
  To: Peter Zijlstra, Andy Lutomirski
  Cc: Thomas Gleixner, LKML, Ingo Molnar, X86 ML, Arnd Bergmann,
	linux-arch, Dave Hansen, Jiri Kosina, Andy Lutomirski,
	Masami Hiramatsu, Kees Cook

at 11:31 AM, Peter Zijlstra <peterz@infradead.org> wrote:

> On Thu, Sep 06, 2018 at 11:09:23AM -0700, Andy Lutomirski wrote:
>>> On Sep 6, 2018, at 10:58 AM, Nadav Amit <namit@vmware.com> wrote:
>>> It depends what security you want. Some may consider even the short
>>> time-window in which the kernel code is writable from other cores as
>>> insufficient for security.
>>> 
>>> In addition, the set removes the need for remote TLB shootdowns that
>>> text_poke() - with this fix - requires.
>> 
>> I’m personally in favor of not needing a global broadcast flush to install kprobes.
> 
> That's fine. But at that point its an optimization, not a correctness
> issue.

Note that patch 1/6 is still needed to fix false lockdep shoutouts due to a
recent patch.


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v2 0/6] x86/alternatives: text_poke() fixes
  2018-09-06 18:38                     ` Nadav Amit
@ 2018-09-06 19:19                       ` Peter Zijlstra
  0 siblings, 0 replies; 31+ messages in thread
From: Peter Zijlstra @ 2018-09-06 19:19 UTC (permalink / raw)
  To: Nadav Amit
  Cc: Andy Lutomirski, Thomas Gleixner, LKML, Ingo Molnar, X86 ML,
	Arnd Bergmann, linux-arch, Dave Hansen, Jiri Kosina,
	Andy Lutomirski, Masami Hiramatsu, Kees Cook

On Thu, Sep 06, 2018 at 06:38:30PM +0000, Nadav Amit wrote:
> Note that patch 1/6 is still needed to fix false lockdep shoutouts due to a
> recent patch.

For some reason I do not appear to have 1/6 in my inbox. Let me dig
through lkml.

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v2 1/6] Fix "x86/alternatives: Lockdep-enforce text_mutex in text_poke*()"
  2018-09-02 17:32 ` [PATCH v2 1/6] Fix "x86/alternatives: Lockdep-enforce text_mutex in text_poke*()" Nadav Amit
@ 2018-09-06 19:40   ` Peter Zijlstra
  2018-09-06 19:42     ` Nadav Amit
  0 siblings, 1 reply; 31+ messages in thread
From: Peter Zijlstra @ 2018-09-06 19:40 UTC (permalink / raw)
  To: Nadav Amit
  Cc: Thomas Gleixner, linux-kernel, Ingo Molnar, x86, Arnd Bergmann,
	linux-arch, Dave Hansen, Nadav Amit, Jiri Kosina,
	Andy Lutomirski, Kees Cook, Dave Hansen

On Sun, Sep 02, 2018 at 10:32:19AM -0700, Nadav Amit wrote:
> text_mutex is expected to be held before text_poke() is called, but we
> cannot add a lockdep assertion since kgdb does not take it, and instead
> *supposedly* ensures the lock is not taken and will not be acquired by
> any other core while text_poke() is running.
> 
> The reason for the "supposedly" comment is that it is not entirely clear
> that this would be the case if gdb_do_roundup is zero.

Argh, that's pretty shit code...

Not only is that text_mutex abuse ugly, so too is the fixmap usage from
IRQ context. I suppose this really does require your alternative mm
patches for text_poke().

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v2 1/6] Fix "x86/alternatives: Lockdep-enforce text_mutex in text_poke*()"
  2018-09-06 19:40   ` Peter Zijlstra
@ 2018-09-06 19:42     ` Nadav Amit
  2018-09-06 19:53       ` Peter Zijlstra
  0 siblings, 1 reply; 31+ messages in thread
From: Nadav Amit @ 2018-09-06 19:42 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Thomas Gleixner, LKML, Ingo Molnar, X86 ML, Arnd Bergmann,
	linux-arch, Dave Hansen, Jiri Kosina, Andy Lutomirski, Kees Cook,
	Dave Hansen

at 12:40 PM, Peter Zijlstra <peterz@infradead.org> wrote:

> On Sun, Sep 02, 2018 at 10:32:19AM -0700, Nadav Amit wrote:
>> text_mutex is expected to be held before text_poke() is called, but we
>> cannot add a lockdep assertion since kgdb does not take it, and instead
>> *supposedly* ensures the lock is not taken and will not be acquired by
>> any other core while text_poke() is running.
>> 
>> The reason for the "supposedly" comment is that it is not entirely clear
>> that this would be the case if gdb_do_roundup is zero.
> 
> Argh, that's pretty shit code...
> 
> Not only is that text_mutex abuse ugly, so too is the fixmap usage from
> IRQ context. I suppose this really does require your alternative mm
> patches for text_poke().

Right, I forgot about that…


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v2 1/6] Fix "x86/alternatives: Lockdep-enforce text_mutex in text_poke*()"
  2018-09-06 19:42     ` Nadav Amit
@ 2018-09-06 19:53       ` Peter Zijlstra
  2018-09-06 19:58         ` Nadav Amit
  0 siblings, 1 reply; 31+ messages in thread
From: Peter Zijlstra @ 2018-09-06 19:53 UTC (permalink / raw)
  To: Nadav Amit
  Cc: Thomas Gleixner, LKML, Ingo Molnar, X86 ML, Arnd Bergmann,
	linux-arch, Dave Hansen, Jiri Kosina, Andy Lutomirski, Kees Cook,
	Dave Hansen

On Thu, Sep 06, 2018 at 07:42:14PM +0000, Nadav Amit wrote:
> at 12:40 PM, Peter Zijlstra <peterz@infradead.org> wrote:
> 
> > On Sun, Sep 02, 2018 at 10:32:19AM -0700, Nadav Amit wrote:
> >> text_mutex is expected to be held before text_poke() is called, but we
> >> cannot add a lockdep assertion since kgdb does not take it, and instead
> >> *supposedly* ensures the lock is not taken and will not be acquired by
> >> any other core while text_poke() is running.
> >> 
> >> The reason for the "supposedly" comment is that it is not entirely clear
> >> that this would be the case if gdb_do_roundup is zero.
> > 
> > Argh, that's pretty shit code...
> > 
> > Not only is that text_mutex abuse ugly, so too is the fixmap usage from
> > IRQ context. I suppose this really does require your alternative mm
> > patches for text_poke().
> 
> Right, I forgot about that…

With that CR3 trickery, we can rid ourselves of the text_mutex
requirement, since concurrent text_poke is 'safe'. That would clean up
the kgdb code quite a bit.

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v2 1/6] Fix "x86/alternatives: Lockdep-enforce text_mutex in text_poke*()"
  2018-09-06 19:53       ` Peter Zijlstra
@ 2018-09-06 19:58         ` Nadav Amit
  2018-09-06 20:25           ` Peter Zijlstra
  0 siblings, 1 reply; 31+ messages in thread
From: Nadav Amit @ 2018-09-06 19:58 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Thomas Gleixner, LKML, Ingo Molnar, X86 ML, Arnd Bergmann,
	linux-arch, Dave Hansen, Jiri Kosina, Andy Lutomirski, Kees Cook,
	Dave Hansen

at 12:53 PM, Peter Zijlstra <peterz@infradead.org> wrote:

> On Thu, Sep 06, 2018 at 07:42:14PM +0000, Nadav Amit wrote:
>> at 12:40 PM, Peter Zijlstra <peterz@infradead.org> wrote:
>> 
>>> On Sun, Sep 02, 2018 at 10:32:19AM -0700, Nadav Amit wrote:
>>>> text_mutex is expected to be held before text_poke() is called, but we
>>>> cannot add a lockdep assertion since kgdb does not take it, and instead
>>>> *supposedly* ensures the lock is not taken and will not be acquired by
>>>> any other core while text_poke() is running.
>>>> 
>>>> The reason for the "supposedly" comment is that it is not entirely clear
>>>> that this would be the case if gdb_do_roundup is zero.
>>> 
>>> Argh, that's pretty shit code...
>>> 
>>> Not only is that text_mutex abuse ugly, so too is the fixmap usage from
>>> IRQ context. I suppose this really does require your alternative mm
>>> patches for text_poke().
>> 
>> Right, I forgot about that…
> 
> With that CR3 trickery, we can rid ourselves of the text_mutex
> requirement, since concurrent text_poke is 'safe'. That would clean up
> the kgdb code quite a bit.

I don’t know. I’m somewhat worried with multiple mechanisms potentially
changing the same code at the same time - and maybe ending up with some
mess.


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v2 1/6] Fix "x86/alternatives: Lockdep-enforce text_mutex in text_poke*()"
  2018-09-06 19:58         ` Nadav Amit
@ 2018-09-06 20:25           ` Peter Zijlstra
  2018-09-06 20:57             ` Nadav Amit
  0 siblings, 1 reply; 31+ messages in thread
From: Peter Zijlstra @ 2018-09-06 20:25 UTC (permalink / raw)
  To: Nadav Amit
  Cc: Thomas Gleixner, LKML, Ingo Molnar, X86 ML, Arnd Bergmann,
	linux-arch, Dave Hansen, Jiri Kosina, Andy Lutomirski, Kees Cook,
	Dave Hansen

On Thu, Sep 06, 2018 at 07:58:40PM +0000, Nadav Amit wrote:
> > With that CR3 trickery, we can rid ourselves of the text_mutex
> > requirement, since concurrent text_poke is 'safe'. That would clean up
> > the kgdb code quite a bit.
> 
> I don’t know. I’m somewhat worried with multiple mechanisms potentially
> changing the same code at the same time - and maybe ending up with some
> mess.

kgdb only pokes INT3, that should be pretty safe.

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v2 1/6] Fix "x86/alternatives: Lockdep-enforce text_mutex in text_poke*()"
  2018-09-06 20:25           ` Peter Zijlstra
@ 2018-09-06 20:57             ` Nadav Amit
  2018-09-06 21:41               ` Peter Zijlstra
  0 siblings, 1 reply; 31+ messages in thread
From: Nadav Amit @ 2018-09-06 20:57 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Thomas Gleixner, LKML, Ingo Molnar, X86 ML, Arnd Bergmann,
	linux-arch, Dave Hansen, Jiri Kosina, Andy Lutomirski, Kees Cook,
	Dave Hansen

at 1:25 PM, Peter Zijlstra <peterz@infradead.org> wrote:

> On Thu, Sep 06, 2018 at 07:58:40PM +0000, Nadav Amit wrote:
>>> With that CR3 trickery, we can rid ourselves of the text_mutex
>>> requirement, since concurrent text_poke is 'safe'. That would clean up
>>> the kgdb code quite a bit.
>> 
>> I don’t know. I’m somewhat worried with multiple mechanisms potentially
>> changing the same code at the same time - and maybe ending up with some
>> mess.
> 
> kgdb only pokes INT3, that should be pretty safe.

Maybe I misunderstand your point. If you want me to get rid of text_mutex
completely, I am afraid it will be able to cause mess by changing the same
piece of code through kprobes and the static-keys mechanism.

I doubt it would work today without failing, but getting rid of text_mutex
is likely to make it even worse.


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v2 1/6] Fix "x86/alternatives: Lockdep-enforce text_mutex in text_poke*()"
  2018-09-06 20:57             ` Nadav Amit
@ 2018-09-06 21:41               ` Peter Zijlstra
  0 siblings, 0 replies; 31+ messages in thread
From: Peter Zijlstra @ 2018-09-06 21:41 UTC (permalink / raw)
  To: Nadav Amit
  Cc: Thomas Gleixner, LKML, Ingo Molnar, X86 ML, Arnd Bergmann,
	linux-arch, Dave Hansen, Jiri Kosina, Andy Lutomirski, Kees Cook,
	Dave Hansen

On Thu, Sep 06, 2018 at 08:57:38PM +0000, Nadav Amit wrote:
> at 1:25 PM, Peter Zijlstra <peterz@infradead.org> wrote:
> 
> > On Thu, Sep 06, 2018 at 07:58:40PM +0000, Nadav Amit wrote:
> >>> With that CR3 trickery, we can rid ourselves of the text_mutex
> >>> requirement, since concurrent text_poke is 'safe'. That would clean up
> >>> the kgdb code quite a bit.
> >> 
> >> I don’t know. I’m somewhat worried with multiple mechanisms potentially
> >> changing the same code at the same time - and maybe ending up with some
> >> mess.
> > 
> > kgdb only pokes INT3, that should be pretty safe.
> 
> Maybe I misunderstand your point. If you want me to get rid of text_mutex
> completely,

No, just the ugly things kgdb does with it.

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v2 4/6] x86/alternatives: initializing temporary mm for patching
  2018-09-06  9:01   ` Peter Zijlstra
@ 2018-09-07 20:52     ` Nadav Amit
  0 siblings, 0 replies; 31+ messages in thread
From: Nadav Amit @ 2018-09-07 20:52 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Thomas Gleixner, LKML, Ingo Molnar, X86 ML, Arnd Bergmann,
	linux-arch, Dave Hansen, Kees Cook, Dave Hansen

at 2:01 AM, Peter Zijlstra <peterz@infradead.org> wrote:

> On Sun, Sep 02, 2018 at 10:32:22AM -0700, Nadav Amit wrote:
>> +void __init poking_init(void)
>> +{
>> +	unsigned long poking_addr;
>> +
>> +	poking_mm = copy_init_mm();
>> +	if (!poking_mm) {
>> +		pr_err("x86/mm: error setting a separate poking address space");
>> +		return;
>> +	}
>> +
>> +	/*
>> +	 * Randomize the poking address, but make sure that the following page
>> +	 * will be mapped at the same PMD. We need 2 pages, so find space for 3,
>> +	 * and adjust the address if the PMD ends after the first one.
>> +	 */
>> +	poking_addr = TASK_UNMAPPED_BASE +
>> +		(kaslr_get_random_long("Poking") & PAGE_MASK) %
>> +		(TASK_SIZE - TASK_UNMAPPED_BASE - 3 * PAGE_SIZE);
>> +
>> +	if (((poking_addr + PAGE_SIZE) & ~PMD_MASK) == 0)
>> +		poking_addr += PAGE_SIZE;
>> +}
> 
> This fails to compile for me.. I don't have kaslr_get_random_long().

Will be fixed in v3. Thanks.

^ permalink raw reply	[flat|nested] 31+ messages in thread

end of thread, other threads:[~2018-09-07 20:52 UTC | newest]

Thread overview: 31+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-09-02 17:32 [PATCH v2 0/6] x86/alternatives: text_poke() fixes Nadav Amit
2018-09-02 17:32 ` [PATCH v2 1/6] Fix "x86/alternatives: Lockdep-enforce text_mutex in text_poke*()" Nadav Amit
2018-09-06 19:40   ` Peter Zijlstra
2018-09-06 19:42     ` Nadav Amit
2018-09-06 19:53       ` Peter Zijlstra
2018-09-06 19:58         ` Nadav Amit
2018-09-06 20:25           ` Peter Zijlstra
2018-09-06 20:57             ` Nadav Amit
2018-09-06 21:41               ` Peter Zijlstra
2018-09-02 17:32 ` [PATCH v2 2/6] x86/mm: temporary mm struct Nadav Amit
2018-09-02 17:32 ` [PATCH v2 3/6] fork: provide a function for copying init_mm Nadav Amit
2018-09-02 17:32 ` [PATCH v2 4/6] x86/alternatives: initializing temporary mm for patching Nadav Amit
2018-09-06  9:01   ` Peter Zijlstra
2018-09-07 20:52     ` Nadav Amit
2018-09-02 17:32 ` [PATCH v2 5/6] x86/alternatives: use temporary mm for text poking Nadav Amit
2018-09-02 17:32 ` [PATCH v2 6/6] x86/alternatives: remove text_poke() return value Nadav Amit
2018-09-05 18:56 ` [PATCH v2 0/6] x86/alternatives: text_poke() fixes Peter Zijlstra
2018-09-05 19:02   ` Nadav Amit
2018-09-05 19:10     ` Nadav Amit
2018-09-06  8:13       ` Peter Zijlstra
2018-09-06  8:42         ` Peter Zijlstra
2018-09-06  9:18         ` Peter Zijlstra
2018-09-06 10:16         ` Peter Zijlstra
2018-09-06 17:01           ` Nadav Amit
2018-09-06 17:17             ` Peter Zijlstra
2018-09-06 17:58               ` Nadav Amit
2018-09-06 18:09                 ` Andy Lutomirski
2018-09-06 18:31                   ` Peter Zijlstra
2018-09-06 18:38                     ` Nadav Amit
2018-09-06 19:19                       ` Peter Zijlstra
2018-09-06 17:23             ` Peter Zijlstra

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).