kernel-hardening.lists.openwall.com archive mirror
 help / color / mirror / Atom feed
* [PATCH v2 00/20] Merge text_poke fixes and executable lockdowns
@ 2019-01-29  0:34 Rick Edgecombe
  2019-01-29  0:34 ` [PATCH v2 01/20] Fix "x86/alternatives: Lockdep-enforce text_mutex in text_poke*()" Rick Edgecombe
                   ` (19 more replies)
  0 siblings, 20 replies; 71+ messages in thread
From: Rick Edgecombe @ 2019-01-29  0:34 UTC (permalink / raw)
  To: Andy Lutomirski, Ingo Molnar
  Cc: linux-kernel, x86, hpa, Thomas Gleixner, Borislav Petkov,
	Nadav Amit, Dave Hansen, Peter Zijlstra, linux_dti,
	linux-integrity, linux-security-module, akpm, kernel-hardening,
	linux-mm, will.deacon, ard.biesheuvel, kristen, deneen.t.dock,
	Rick Edgecombe

This patchset improves several overlapping issues around stale TLB
entries and W^X violations. It is combined from a slightly tweaked
"x86/alternative: text_poke() enhancements v7" [1] and a next version of
the "Don’t leave executable TLB entries to freed pages v2" [2]
patchsets that were conflicting.

The related issues that this fixes:
1. Fixmap PTEs that are used for patching are available for access from
   other cores and might be exploited. They are not even flushed from
   the TLB in remote cores, so the risk is even higher. Address this
   issue by introducing a temporary mm that is only used during
   patching. Unfortunately, due to init ordering, fixmap is still used
   during boot-time patching. Future patches can eliminate the need for
   it.
2. Missing lockdep assertion to ensure text_mutex is taken. It is
   actually not always taken, so fix the instances that were found not
   to take the lock (although they should be safe even without taking
   the lock).
3. Module_alloc returning memory that is RWX until a module is finished
   loading.
4. Sometimes when memory is freed via the module subsystem, an
   executable permissioned TLB entry can remain to a freed page. If the
   page is re-used to back an address that will receive data from
   userspace, it can result in user data being mapped as executable in
   the kernel. The root of this behavior is vfree lazily flushing the
   TLB, but not lazily freeing the underlying pages.

Changes for v2:
 - Adding “Reviewed-by tag” [Masami]
 - Comment instead of code to warn against module removal while patching [Masami]
 - Avoiding open-coded TLB flush [Andy]
 - Remove "This patch" [Borislav Petkov]
 - Not set global bit during text poking [Andy, hpa]
 - Add Ack from [Pavel Machek]
 - Split patch 16 "Plug in new special vfree flag" into 4 patches (16-19)
   to make it easier to review. There were no code changes.

The changes from "Don’t leave executable TLB entries to freed pages
v2" to v1:
 - Add support for case of hibernate trying to save an unmapped page
   on the directmap. (Ard Biesheuvel)
 - No week arch breakout for vfree-ing special memory (Andy Lutomirski)
 - Avoid changing deferred free code by moving modules init free to work
   queue (Andy Lutomirski)
 - Plug in new flag for kprobes and ftrace
 - More arch generic names for set_pages functions (Ard Biesheuvel)
 - Fix for TLB not always flushing the directmap (Nadav Amit)
 
Changes from "x86/alternative: text_poke() enhancements v7" to v1
 - Fix build failure on CONFIG_RANDOMIZE_BASE=n (Rick)
 - Remove text_poke usage from ftrace (Nadav)
 
[1] https://lkml.org/lkml/2018/12/5/200
[2] https://lkml.org/lkml/2018/12/11/1571

Andy Lutomirski (1):
  x86/mm: temporary mm struct

Nadav Amit (12):
  Fix "x86/alternatives: Lockdep-enforce text_mutex in text_poke*()"
  x86/jump_label: Use text_poke_early() during early init
  fork: provide a function for copying init_mm
  x86/alternative: initializing temporary mm for patching
  x86/alternative: use temporary mm for text poking
  x86/kgdb: avoid redundant comparison of patched code
  x86/ftrace: set trampoline pages as executable
  x86/kprobes: instruction pages initialization enhancements
  x86: avoid W^X being broken during modules loading
  x86/jump-label: remove support for custom poker
  x86/alternative: Remove the return value of text_poke_*()
  x86/alternative: comment about module removal races

Rick Edgecombe (7):
  Add set_alias_ function and x86 implementation
  mm: Make hibernate handle unmapped pages
  vmalloc: New flags for safe vfree on special perms
  modules: Use vmalloc special flag
  bpf: Use vmalloc special flag
  x86/ftrace: Use vmalloc special flag
  x86/kprobes: Use vmalloc special flag

 arch/Kconfig                         |   4 +
 arch/x86/Kconfig                     |   1 +
 arch/x86/include/asm/fixmap.h        |   2 -
 arch/x86/include/asm/mmu_context.h   |  32 +++++
 arch/x86/include/asm/pgtable.h       |   3 +
 arch/x86/include/asm/set_memory.h    |   3 +
 arch/x86/include/asm/text-patching.h |   7 +-
 arch/x86/kernel/alternative.c        | 199 ++++++++++++++++++++-------
 arch/x86/kernel/ftrace.c             |  14 +-
 arch/x86/kernel/jump_label.c         |  19 ++-
 arch/x86/kernel/kgdb.c               |  25 +---
 arch/x86/kernel/kprobes/core.c       |  19 ++-
 arch/x86/kernel/module.c             |   2 +-
 arch/x86/mm/init_64.c                |  36 +++++
 arch/x86/mm/pageattr.c               |  16 ++-
 arch/x86/xen/mmu_pv.c                |   2 -
 include/linux/filter.h               |  18 +--
 include/linux/mm.h                   |  18 +--
 include/linux/sched/task.h           |   1 +
 include/linux/set_memory.h           |  10 ++
 include/linux/vmalloc.h              |  13 ++
 init/main.c                          |   3 +
 kernel/bpf/core.c                    |   1 -
 kernel/fork.c                        |  24 +++-
 kernel/module.c                      |  82 ++++++-----
 mm/page_alloc.c                      |   7 +-
 mm/vmalloc.c                         | 122 +++++++++++++---
 27 files changed, 494 insertions(+), 189 deletions(-)

-- 
2.17.1

^ permalink raw reply	[flat|nested] 71+ messages in thread

* [PATCH v2 01/20] Fix "x86/alternatives: Lockdep-enforce text_mutex in text_poke*()"
  2019-01-29  0:34 [PATCH v2 00/20] Merge text_poke fixes and executable lockdowns Rick Edgecombe
@ 2019-01-29  0:34 ` Rick Edgecombe
  2019-01-29  0:34 ` [PATCH v2 02/20] x86/jump_label: Use text_poke_early() during early init Rick Edgecombe
                   ` (18 subsequent siblings)
  19 siblings, 0 replies; 71+ messages in thread
From: Rick Edgecombe @ 2019-01-29  0:34 UTC (permalink / raw)
  To: Andy Lutomirski, Ingo Molnar
  Cc: linux-kernel, x86, hpa, Thomas Gleixner, Borislav Petkov,
	Nadav Amit, Dave Hansen, Peter Zijlstra, linux_dti,
	linux-integrity, linux-security-module, akpm, kernel-hardening,
	linux-mm, will.deacon, ard.biesheuvel, kristen, deneen.t.dock,
	Nadav Amit, Kees Cook, Dave Hansen, Masami Hiramatsu,
	Rick Edgecombe

From: Nadav Amit <namit@vmware.com>

text_mutex is currently expected to be held before text_poke() is
called, but we kgdb does not take the mutex, and instead *supposedly*
ensures the lock is not taken and will not be acquired by any other core
while text_poke() is running.

The reason for the "supposedly" comment is that it is not entirely clear
that this would be the case if kgdb_do_roundup is zero.

Create two wrapper functions, text_poke() and text_poke_kgdb() which do
or do not run the lockdep assertion respectively.

While we are at it, change the return code of text_poke() to something
meaningful. One day, callers might actually respect it and the existing
BUG_ON() when patching fails could be removed. For kgdb, the return
value can actually be used.

Cc: Andy Lutomirski <luto@kernel.org>
Cc: Kees Cook <keescook@chromium.org>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Fixes: 9222f606506c ("x86/alternatives: Lockdep-enforce text_mutex in text_poke*()")
Suggested-by: Peter Zijlstra <peterz@infradead.org>
Acked-by: Jiri Kosina <jkosina@suse.cz>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Masami Hiramatsu <mhiramat@kernel.org>
Signed-off-by: Nadav Amit <namit@vmware.com>
Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
---
 arch/x86/include/asm/text-patching.h |  1 +
 arch/x86/kernel/alternative.c        | 52 ++++++++++++++++++++--------
 arch/x86/kernel/kgdb.c               | 11 +++---
 3 files changed, 45 insertions(+), 19 deletions(-)

diff --git a/arch/x86/include/asm/text-patching.h b/arch/x86/include/asm/text-patching.h
index e85ff65c43c3..f8fc8e86cf01 100644
--- a/arch/x86/include/asm/text-patching.h
+++ b/arch/x86/include/asm/text-patching.h
@@ -35,6 +35,7 @@ extern void *text_poke_early(void *addr, const void *opcode, size_t len);
  * inconsistent instruction while you patch.
  */
 extern void *text_poke(void *addr, const void *opcode, size_t len);
+extern void *text_poke_kgdb(void *addr, const void *opcode, size_t len);
 extern int poke_int3_handler(struct pt_regs *regs);
 extern void *text_poke_bp(void *addr, const void *opcode, size_t len, void *handler);
 extern int after_bootmem;
diff --git a/arch/x86/kernel/alternative.c b/arch/x86/kernel/alternative.c
index d458c7973c56..12fddbc8c55b 100644
--- a/arch/x86/kernel/alternative.c
+++ b/arch/x86/kernel/alternative.c
@@ -678,18 +678,7 @@ void *__init_or_module text_poke_early(void *addr, const void *opcode,
 	return addr;
 }
 
-/**
- * text_poke - Update instructions on a live kernel
- * @addr: address to modify
- * @opcode: source of the copy
- * @len: length to copy
- *
- * Only atomic text poke/set should be allowed when not doing early patching.
- * It means the size must be writable atomically and the address must be aligned
- * in a way that permits an atomic write. It also makes sure we fit on a single
- * page.
- */
-void *text_poke(void *addr, const void *opcode, size_t len)
+static void *__text_poke(void *addr, const void *opcode, size_t len)
 {
 	unsigned long flags;
 	char *vaddr;
@@ -702,8 +691,6 @@ void *text_poke(void *addr, const void *opcode, size_t len)
 	 */
 	BUG_ON(!after_bootmem);
 
-	lockdep_assert_held(&text_mutex);
-
 	if (!core_kernel_text((unsigned long)addr)) {
 		pages[0] = vmalloc_to_page(addr);
 		pages[1] = vmalloc_to_page(addr + PAGE_SIZE);
@@ -732,6 +719,43 @@ void *text_poke(void *addr, const void *opcode, size_t len)
 	return addr;
 }
 
+/**
+ * text_poke - Update instructions on a live kernel
+ * @addr: address to modify
+ * @opcode: source of the copy
+ * @len: length to copy
+ *
+ * Only atomic text poke/set should be allowed when not doing early patching.
+ * It means the size must be writable atomically and the address must be aligned
+ * in a way that permits an atomic write. It also makes sure we fit on a single
+ * page.
+ */
+void *text_poke(void *addr, const void *opcode, size_t len)
+{
+	lockdep_assert_held(&text_mutex);
+
+	return __text_poke(addr, opcode, len);
+}
+
+/**
+ * text_poke_kgdb - Update instructions on a live kernel by kgdb
+ * @addr: address to modify
+ * @opcode: source of the copy
+ * @len: length to copy
+ *
+ * Only atomic text poke/set should be allowed when not doing early patching.
+ * It means the size must be writable atomically and the address must be aligned
+ * in a way that permits an atomic write. It also makes sure we fit on a single
+ * page.
+ *
+ * Context: should only be used by kgdb, which ensures no other core is running,
+ *	    despite the fact it does not hold the text_mutex.
+ */
+void *text_poke_kgdb(void *addr, const void *opcode, size_t len)
+{
+	return __text_poke(addr, opcode, len);
+}
+
 static void do_sync_core(void *info)
 {
 	sync_core();
diff --git a/arch/x86/kernel/kgdb.c b/arch/x86/kernel/kgdb.c
index 5db08425063e..1461544cba8b 100644
--- a/arch/x86/kernel/kgdb.c
+++ b/arch/x86/kernel/kgdb.c
@@ -758,13 +758,13 @@ int kgdb_arch_set_breakpoint(struct kgdb_bkpt *bpt)
 	if (!err)
 		return err;
 	/*
-	 * It is safe to call text_poke() because normal kernel execution
+	 * It is safe to call text_poke_kgdb() because normal kernel execution
 	 * is stopped on all cores, so long as the text_mutex is not locked.
 	 */
 	if (mutex_is_locked(&text_mutex))
 		return -EBUSY;
-	text_poke((void *)bpt->bpt_addr, arch_kgdb_ops.gdb_bpt_instr,
-		  BREAK_INSTR_SIZE);
+	text_poke_kgdb((void *)bpt->bpt_addr, arch_kgdb_ops.gdb_bpt_instr,
+		       BREAK_INSTR_SIZE);
 	err = probe_kernel_read(opc, (char *)bpt->bpt_addr, BREAK_INSTR_SIZE);
 	if (err)
 		return err;
@@ -783,12 +783,13 @@ int kgdb_arch_remove_breakpoint(struct kgdb_bkpt *bpt)
 	if (bpt->type != BP_POKE_BREAKPOINT)
 		goto knl_write;
 	/*
-	 * It is safe to call text_poke() because normal kernel execution
+	 * It is safe to call text_poke_kgdb() because normal kernel execution
 	 * is stopped on all cores, so long as the text_mutex is not locked.
 	 */
 	if (mutex_is_locked(&text_mutex))
 		goto knl_write;
-	text_poke((void *)bpt->bpt_addr, bpt->saved_instr, BREAK_INSTR_SIZE);
+	text_poke_kgdb((void *)bpt->bpt_addr, bpt->saved_instr,
+		       BREAK_INSTR_SIZE);
 	err = probe_kernel_read(opc, (char *)bpt->bpt_addr, BREAK_INSTR_SIZE);
 	if (err || memcmp(opc, bpt->saved_instr, BREAK_INSTR_SIZE))
 		goto knl_write;
-- 
2.17.1

^ permalink raw reply related	[flat|nested] 71+ messages in thread

* [PATCH v2 02/20] x86/jump_label: Use text_poke_early() during early init
  2019-01-29  0:34 [PATCH v2 00/20] Merge text_poke fixes and executable lockdowns Rick Edgecombe
  2019-01-29  0:34 ` [PATCH v2 01/20] Fix "x86/alternatives: Lockdep-enforce text_mutex in text_poke*()" Rick Edgecombe
@ 2019-01-29  0:34 ` Rick Edgecombe
  2019-01-29  0:34 ` [PATCH v2 03/20] x86/mm: temporary mm struct Rick Edgecombe
                   ` (17 subsequent siblings)
  19 siblings, 0 replies; 71+ messages in thread
From: Rick Edgecombe @ 2019-01-29  0:34 UTC (permalink / raw)
  To: Andy Lutomirski, Ingo Molnar
  Cc: linux-kernel, x86, hpa, Thomas Gleixner, Borislav Petkov,
	Nadav Amit, Dave Hansen, Peter Zijlstra, linux_dti,
	linux-integrity, linux-security-module, akpm, kernel-hardening,
	linux-mm, will.deacon, ard.biesheuvel, kristen, deneen.t.dock,
	Nadav Amit, Kees Cook, Dave Hansen, Masami Hiramatsu,
	Rick Edgecombe

From: Nadav Amit <namit@vmware.com>

There is no apparent reason not to use text_poke_early() while we are
during early-init and we do not patch code that might be on the stack
(i.e., we'll return to the middle of the patched code). This appears to
be the case of jump-labels, so do so.

This is required for the next patches that would set a temporary mm for
patching, which is initialized after some static-keys are
enabled/disabled.

Cc: Andy Lutomirski <luto@kernel.org>
Cc: Kees Cook <keescook@chromium.org>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Nadav Amit <namit@vmware.com>
Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
---
 arch/x86/kernel/jump_label.c | 7 ++++++-
 1 file changed, 6 insertions(+), 1 deletion(-)

diff --git a/arch/x86/kernel/jump_label.c b/arch/x86/kernel/jump_label.c
index f99bd26bd3f1..e36cfec0f35e 100644
--- a/arch/x86/kernel/jump_label.c
+++ b/arch/x86/kernel/jump_label.c
@@ -50,7 +50,12 @@ static void __ref __jump_label_transform(struct jump_entry *entry,
 	jmp.offset = jump_entry_target(entry) -
 		     (jump_entry_code(entry) + JUMP_LABEL_NOP_SIZE);
 
-	if (early_boot_irqs_disabled)
+	/*
+	 * As long as we're UP and not yet marked RO, we can use
+	 * text_poke_early; SYSTEM_BOOTING guarantees both, as we switch to
+	 * SYSTEM_SCHEDULING before going either.
+	 */
+	if (system_state == SYSTEM_BOOTING)
 		poker = text_poke_early;
 
 	if (type == JUMP_LABEL_JMP) {
-- 
2.17.1

^ permalink raw reply related	[flat|nested] 71+ messages in thread

* [PATCH v2 03/20] x86/mm: temporary mm struct
  2019-01-29  0:34 [PATCH v2 00/20] Merge text_poke fixes and executable lockdowns Rick Edgecombe
  2019-01-29  0:34 ` [PATCH v2 01/20] Fix "x86/alternatives: Lockdep-enforce text_mutex in text_poke*()" Rick Edgecombe
  2019-01-29  0:34 ` [PATCH v2 02/20] x86/jump_label: Use text_poke_early() during early init Rick Edgecombe
@ 2019-01-29  0:34 ` Rick Edgecombe
  2019-01-31 11:29   ` Borislav Petkov
  2019-01-29  0:34 ` [PATCH v2 04/20] fork: provide a function for copying init_mm Rick Edgecombe
                   ` (16 subsequent siblings)
  19 siblings, 1 reply; 71+ messages in thread
From: Rick Edgecombe @ 2019-01-29  0:34 UTC (permalink / raw)
  To: Andy Lutomirski, Ingo Molnar
  Cc: linux-kernel, x86, hpa, Thomas Gleixner, Borislav Petkov,
	Nadav Amit, Dave Hansen, Peter Zijlstra, linux_dti,
	linux-integrity, linux-security-module, akpm, kernel-hardening,
	linux-mm, will.deacon, ard.biesheuvel, kristen, deneen.t.dock,
	Kees Cook, Dave Hansen, Nadav Amit, Rick Edgecombe

From: Andy Lutomirski <luto@kernel.org>

Sometimes we want to set a temporary page-table entries (PTEs) in one of
the cores, without allowing other cores to use - even speculatively -
these mappings. There are two benefits for doing so:

(1) Security: if sensitive PTEs are set, temporary mm prevents their use
in other cores. This hardens the security as it prevents exploding a
dangling pointer to overwrite sensitive data using the sensitive PTE.

(2) Avoiding TLB shootdowns: the PTEs do not need to be flushed in
remote page-tables.

To do so a temporary mm_struct can be used. Mappings which are private
for this mm can be set in the userspace part of the address-space.
During the whole time in which the temporary mm is loaded, interrupts
must be disabled.

The first use-case for temporary PTEs, which will follow, is for poking
the kernel text.

[ Commit message was written by Nadav ]

Cc: Kees Cook <keescook@chromium.org>
Cc: Dave Hansen <dave.hansen@intel.com>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Masami Hiramatsu <mhiramat@kernel.org>
Tested-by: Masami Hiramatsu <mhiramat@kernel.org>
Signed-off-by: Andy Lutomirski <luto@kernel.org>
Signed-off-by: Nadav Amit <namit@vmware.com>
Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
---
 arch/x86/include/asm/mmu_context.h | 32 ++++++++++++++++++++++++++++++
 1 file changed, 32 insertions(+)

diff --git a/arch/x86/include/asm/mmu_context.h b/arch/x86/include/asm/mmu_context.h
index 19d18fae6ec6..cd0c29e494a6 100644
--- a/arch/x86/include/asm/mmu_context.h
+++ b/arch/x86/include/asm/mmu_context.h
@@ -356,4 +356,36 @@ static inline unsigned long __get_current_cr3_fast(void)
 	return cr3;
 }
 
+typedef struct {
+	struct mm_struct *prev;
+} temporary_mm_state_t;
+
+/*
+ * Using a temporary mm allows to set temporary mappings that are not accessible
+ * by other cores. Such mappings are needed to perform sensitive memory writes
+ * that override the kernel memory protections (e.g., W^X), without exposing the
+ * temporary page-table mappings that are required for these write operations to
+ * other cores.
+ *
+ * Context: The temporary mm needs to be used exclusively by a single core. To
+ *          harden security IRQs must be disabled while the temporary mm is
+ *          loaded, thereby preventing interrupt handler bugs from override the
+ *          kernel memory protection.
+ */
+static inline temporary_mm_state_t use_temporary_mm(struct mm_struct *mm)
+{
+	temporary_mm_state_t state;
+
+	lockdep_assert_irqs_disabled();
+	state.prev = this_cpu_read(cpu_tlbstate.loaded_mm);
+	switch_mm_irqs_off(NULL, mm, current);
+	return state;
+}
+
+static inline void unuse_temporary_mm(temporary_mm_state_t prev)
+{
+	lockdep_assert_irqs_disabled();
+	switch_mm_irqs_off(NULL, prev.prev, current);
+}
+
 #endif /* _ASM_X86_MMU_CONTEXT_H */
-- 
2.17.1

^ permalink raw reply related	[flat|nested] 71+ messages in thread

* [PATCH v2 04/20] fork: provide a function for copying init_mm
  2019-01-29  0:34 [PATCH v2 00/20] Merge text_poke fixes and executable lockdowns Rick Edgecombe
                   ` (2 preceding siblings ...)
  2019-01-29  0:34 ` [PATCH v2 03/20] x86/mm: temporary mm struct Rick Edgecombe
@ 2019-01-29  0:34 ` Rick Edgecombe
  2019-02-05  8:53   ` Borislav Petkov
  2019-01-29  0:34 ` [PATCH v2 05/20] x86/alternative: initializing temporary mm for patching Rick Edgecombe
                   ` (15 subsequent siblings)
  19 siblings, 1 reply; 71+ messages in thread
From: Rick Edgecombe @ 2019-01-29  0:34 UTC (permalink / raw)
  To: Andy Lutomirski, Ingo Molnar
  Cc: linux-kernel, x86, hpa, Thomas Gleixner, Borislav Petkov,
	Nadav Amit, Dave Hansen, Peter Zijlstra, linux_dti,
	linux-integrity, linux-security-module, akpm, kernel-hardening,
	linux-mm, will.deacon, ard.biesheuvel, kristen, deneen.t.dock,
	Nadav Amit, Kees Cook, Dave Hansen, Rick Edgecombe

From: Nadav Amit <namit@vmware.com>

Provide a function for copying init_mm. This function will be later used
for setting a temporary mm.

Cc: Andy Lutomirski <luto@kernel.org>
Cc: Kees Cook <keescook@chromium.org>
Cc: Dave Hansen <dave.hansen@intel.com>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Masami Hiramatsu <mhiramat@kernel.org>
Tested-by: Masami Hiramatsu <mhiramat@kernel.org>
Signed-off-by: Nadav Amit <namit@vmware.com>
Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
---
 include/linux/sched/task.h |  1 +
 kernel/fork.c              | 24 ++++++++++++++++++------
 2 files changed, 19 insertions(+), 6 deletions(-)

diff --git a/include/linux/sched/task.h b/include/linux/sched/task.h
index 44c6f15800ff..c5a00a7b3beb 100644
--- a/include/linux/sched/task.h
+++ b/include/linux/sched/task.h
@@ -76,6 +76,7 @@ extern void exit_itimers(struct signal_struct *);
 extern long _do_fork(unsigned long, unsigned long, unsigned long, int __user *, int __user *, unsigned long);
 extern long do_fork(unsigned long, unsigned long, unsigned long, int __user *, int __user *);
 struct task_struct *fork_idle(int);
+struct mm_struct *copy_init_mm(void);
 extern pid_t kernel_thread(int (*fn)(void *), void *arg, unsigned long flags);
 extern long kernel_wait4(pid_t, int __user *, int, struct rusage *);
 
diff --git a/kernel/fork.c b/kernel/fork.c
index b69248e6f0e0..d7b156c49f29 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -1299,13 +1299,20 @@ void mm_release(struct task_struct *tsk, struct mm_struct *mm)
 		complete_vfork_done(tsk);
 }
 
-/*
- * Allocate a new mm structure and copy contents from the
- * mm structure of the passed in task structure.
+/**
+ * dup_mm() - duplicates an existing mm structure
+ * @tsk: the task_struct with which the new mm will be associated.
+ * @oldmm: the mm to duplicate.
+ *
+ * Allocates a new mm structure and copy contents from the provided
+ * @oldmm structure.
+ *
+ * Return: the duplicated mm or NULL on failure.
  */
-static struct mm_struct *dup_mm(struct task_struct *tsk)
+static struct mm_struct *dup_mm(struct task_struct *tsk,
+				struct mm_struct *oldmm)
 {
-	struct mm_struct *mm, *oldmm = current->mm;
+	struct mm_struct *mm;
 	int err;
 
 	mm = allocate_mm();
@@ -1372,7 +1379,7 @@ static int copy_mm(unsigned long clone_flags, struct task_struct *tsk)
 	}
 
 	retval = -ENOMEM;
-	mm = dup_mm(tsk);
+	mm = dup_mm(tsk, current->mm);
 	if (!mm)
 		goto fail_nomem;
 
@@ -2187,6 +2194,11 @@ struct task_struct *fork_idle(int cpu)
 	return task;
 }
 
+struct mm_struct *copy_init_mm(void)
+{
+	return dup_mm(NULL, &init_mm);
+}
+
 /*
  *  Ok, this is the main fork-routine.
  *
-- 
2.17.1

^ permalink raw reply related	[flat|nested] 71+ messages in thread

* [PATCH v2 05/20] x86/alternative: initializing temporary mm for patching
  2019-01-29  0:34 [PATCH v2 00/20] Merge text_poke fixes and executable lockdowns Rick Edgecombe
                   ` (3 preceding siblings ...)
  2019-01-29  0:34 ` [PATCH v2 04/20] fork: provide a function for copying init_mm Rick Edgecombe
@ 2019-01-29  0:34 ` Rick Edgecombe
  2019-02-05  9:18   ` Borislav Petkov
  2019-02-11  0:39   ` Nadav Amit
  2019-01-29  0:34 ` [PATCH v2 06/20] x86/alternative: use temporary mm for text poking Rick Edgecombe
                   ` (14 subsequent siblings)
  19 siblings, 2 replies; 71+ messages in thread
From: Rick Edgecombe @ 2019-01-29  0:34 UTC (permalink / raw)
  To: Andy Lutomirski, Ingo Molnar
  Cc: linux-kernel, x86, hpa, Thomas Gleixner, Borislav Petkov,
	Nadav Amit, Dave Hansen, Peter Zijlstra, linux_dti,
	linux-integrity, linux-security-module, akpm, kernel-hardening,
	linux-mm, will.deacon, ard.biesheuvel, kristen, deneen.t.dock,
	Nadav Amit, Kees Cook, Dave Hansen, Rick Edgecombe

From: Nadav Amit <namit@vmware.com>

To prevent improper use of the PTEs that are used for text patching, we
want to use a temporary mm struct. We initailize it by copying the init
mm.

The address that will be used for patching is taken from the lower area
that is usually used for the task memory. Doing so prevents the need to
frequently synchronize the temporary-mm (e.g., when BPF programs are
installed), since different PGDs are used for the task memory.

Finally, we randomize the address of the PTEs to harden against exploits
that use these PTEs.

Cc: Kees Cook <keescook@chromium.org>
Cc: Dave Hansen <dave.hansen@intel.com>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Masami Hiramatsu <mhiramat@kernel.org>
Tested-by: Masami Hiramatsu <mhiramat@kernel.org>
Suggested-by: Andy Lutomirski <luto@kernel.org>
Signed-off-by: Nadav Amit <namit@vmware.com>
Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
---
 arch/x86/include/asm/pgtable.h       |  3 +++
 arch/x86/include/asm/text-patching.h |  2 ++
 arch/x86/kernel/alternative.c        |  3 +++
 arch/x86/mm/init_64.c                | 36 ++++++++++++++++++++++++++++
 init/main.c                          |  3 +++
 5 files changed, 47 insertions(+)

diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h
index 40616e805292..e8f630d9a2ed 100644
--- a/arch/x86/include/asm/pgtable.h
+++ b/arch/x86/include/asm/pgtable.h
@@ -1021,6 +1021,9 @@ static inline void __meminit init_trampoline_default(void)
 	/* Default trampoline pgd value */
 	trampoline_pgd_entry = init_top_pgt[pgd_index(__PAGE_OFFSET)];
 }
+
+void __init poking_init(void);
+
 # ifdef CONFIG_RANDOMIZE_MEMORY
 void __meminit init_trampoline(void);
 # else
diff --git a/arch/x86/include/asm/text-patching.h b/arch/x86/include/asm/text-patching.h
index f8fc8e86cf01..a75eed841eed 100644
--- a/arch/x86/include/asm/text-patching.h
+++ b/arch/x86/include/asm/text-patching.h
@@ -39,5 +39,7 @@ extern void *text_poke_kgdb(void *addr, const void *opcode, size_t len);
 extern int poke_int3_handler(struct pt_regs *regs);
 extern void *text_poke_bp(void *addr, const void *opcode, size_t len, void *handler);
 extern int after_bootmem;
+extern __ro_after_init struct mm_struct *poking_mm;
+extern __ro_after_init unsigned long poking_addr;
 
 #endif /* _ASM_X86_TEXT_PATCHING_H */
diff --git a/arch/x86/kernel/alternative.c b/arch/x86/kernel/alternative.c
index 12fddbc8c55b..ae05fbb50171 100644
--- a/arch/x86/kernel/alternative.c
+++ b/arch/x86/kernel/alternative.c
@@ -678,6 +678,9 @@ void *__init_or_module text_poke_early(void *addr, const void *opcode,
 	return addr;
 }
 
+__ro_after_init struct mm_struct *poking_mm;
+__ro_after_init unsigned long poking_addr;
+
 static void *__text_poke(void *addr, const void *opcode, size_t len)
 {
 	unsigned long flags;
diff --git a/arch/x86/mm/init_64.c b/arch/x86/mm/init_64.c
index bccff68e3267..125c8c48aa24 100644
--- a/arch/x86/mm/init_64.c
+++ b/arch/x86/mm/init_64.c
@@ -53,6 +53,7 @@
 #include <asm/init.h>
 #include <asm/uv/uv.h>
 #include <asm/setup.h>
+#include <asm/text-patching.h>
 
 #include "mm_internal.h"
 
@@ -1383,6 +1384,41 @@ unsigned long memory_block_size_bytes(void)
 	return memory_block_size_probed;
 }
 
+/*
+ * Initialize an mm_struct to be used during poking and a pointer to be used
+ * during patching.
+ */
+void __init poking_init(void)
+{
+	spinlock_t *ptl;
+	pte_t *ptep;
+
+	poking_mm = copy_init_mm();
+	BUG_ON(!poking_mm);
+
+	/*
+	 * Randomize the poking address, but make sure that the following page
+	 * will be mapped at the same PMD. We need 2 pages, so find space for 3,
+	 * and adjust the address if the PMD ends after the first one.
+	 */
+	poking_addr = TASK_UNMAPPED_BASE;
+	if (IS_ENABLED(CONFIG_RANDOMIZE_BASE))
+		poking_addr += (kaslr_get_random_long("Poking") & PAGE_MASK) %
+			(TASK_SIZE - TASK_UNMAPPED_BASE - 3 * PAGE_SIZE);
+
+	if (((poking_addr + PAGE_SIZE) & ~PMD_MASK) == 0)
+		poking_addr += PAGE_SIZE;
+
+	/*
+	 * We need to trigger the allocation of the page-tables that will be
+	 * needed for poking now. Later, poking may be performed in an atomic
+	 * section, which might cause allocation to fail.
+	 */
+	ptep = get_locked_pte(poking_mm, poking_addr, &ptl);
+	BUG_ON(!ptep);
+	pte_unmap_unlock(ptep, ptl);
+}
+
 #ifdef CONFIG_SPARSEMEM_VMEMMAP
 /*
  * Initialise the sparsemem vmemmap using huge-pages at the PMD level.
diff --git a/init/main.c b/init/main.c
index e2e80ca3165a..f5947ba53bb4 100644
--- a/init/main.c
+++ b/init/main.c
@@ -496,6 +496,8 @@ void __init __weak thread_stack_cache_init(void)
 
 void __init __weak mem_encrypt_init(void) { }
 
+void __init __weak poking_init(void) { }
+
 bool initcall_debug;
 core_param(initcall_debug, initcall_debug, bool, 0644);
 
@@ -730,6 +732,7 @@ asmlinkage __visible void __init start_kernel(void)
 	taskstats_init_early();
 	delayacct_init();
 
+	poking_init();
 	check_bugs();
 
 	acpi_subsystem_init();
-- 
2.17.1

^ permalink raw reply related	[flat|nested] 71+ messages in thread

* [PATCH v2 06/20] x86/alternative: use temporary mm for text poking
  2019-01-29  0:34 [PATCH v2 00/20] Merge text_poke fixes and executable lockdowns Rick Edgecombe
                   ` (4 preceding siblings ...)
  2019-01-29  0:34 ` [PATCH v2 05/20] x86/alternative: initializing temporary mm for patching Rick Edgecombe
@ 2019-01-29  0:34 ` Rick Edgecombe
  2019-02-05  9:58   ` Borislav Petkov
  2019-01-29  0:34 ` [PATCH v2 07/20] x86/kgdb: avoid redundant comparison of patched code Rick Edgecombe
                   ` (13 subsequent siblings)
  19 siblings, 1 reply; 71+ messages in thread
From: Rick Edgecombe @ 2019-01-29  0:34 UTC (permalink / raw)
  To: Andy Lutomirski, Ingo Molnar
  Cc: linux-kernel, x86, hpa, Thomas Gleixner, Borislav Petkov,
	Nadav Amit, Dave Hansen, Peter Zijlstra, linux_dti,
	linux-integrity, linux-security-module, akpm, kernel-hardening,
	linux-mm, will.deacon, ard.biesheuvel, kristen, deneen.t.dock,
	Nadav Amit, Kees Cook, Dave Hansen, Masami Hiramatsu,
	Rick Edgecombe

From: Nadav Amit <namit@vmware.com>

text_poke() can potentially compromise the security as it sets temporary
PTEs in the fixmap. These PTEs might be used to rewrite the kernel code
from other cores accidentally or maliciously, if an attacker gains the
ability to write onto kernel memory.

Moreover, since remote TLBs are not flushed after the temporary PTEs are
removed, the time-window in which the code is writable is not limited if
the fixmap PTEs - maliciously or accidentally - are cached in the TLB.
To address these potential security hazards, we use a temporary mm for
patching the code.

Finally, text_poke() is also not conservative enough when mapping pages,
as it always tries to map 2 pages, even when a single one is sufficient.
So try to be more conservative, and do not map more than needed.

Cc: Andy Lutomirski <luto@kernel.org>
Cc: Kees Cook <keescook@chromium.org>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Nadav Amit <namit@vmware.com>
Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
---
 arch/x86/include/asm/fixmap.h |   2 -
 arch/x86/kernel/alternative.c | 106 +++++++++++++++++++++++++++-------
 arch/x86/xen/mmu_pv.c         |   2 -
 3 files changed, 84 insertions(+), 26 deletions(-)

diff --git a/arch/x86/include/asm/fixmap.h b/arch/x86/include/asm/fixmap.h
index 50ba74a34a37..9da8cccdf3fb 100644
--- a/arch/x86/include/asm/fixmap.h
+++ b/arch/x86/include/asm/fixmap.h
@@ -103,8 +103,6 @@ enum fixed_addresses {
 #ifdef CONFIG_PARAVIRT
 	FIX_PARAVIRT_BOOTMAP,
 #endif
-	FIX_TEXT_POKE1,	/* reserve 2 pages for text_poke() */
-	FIX_TEXT_POKE0, /* first page is last, because allocation is backward */
 #ifdef	CONFIG_X86_INTEL_MID
 	FIX_LNW_VRTC,
 #endif
diff --git a/arch/x86/kernel/alternative.c b/arch/x86/kernel/alternative.c
index ae05fbb50171..76d482a2b716 100644
--- a/arch/x86/kernel/alternative.c
+++ b/arch/x86/kernel/alternative.c
@@ -11,6 +11,7 @@
 #include <linux/stop_machine.h>
 #include <linux/slab.h>
 #include <linux/kdebug.h>
+#include <linux/mmu_context.h>
 #include <asm/text-patching.h>
 #include <asm/alternative.h>
 #include <asm/sections.h>
@@ -683,41 +684,102 @@ __ro_after_init unsigned long poking_addr;
 
 static void *__text_poke(void *addr, const void *opcode, size_t len)
 {
+	bool cross_page_boundary = offset_in_page(addr) + len > PAGE_SIZE;
+	temporary_mm_state_t prev;
+	struct page *pages[2] = {NULL};
 	unsigned long flags;
-	char *vaddr;
-	struct page *pages[2];
-	int i;
+	pte_t pte, *ptep;
+	spinlock_t *ptl;
+	pgprot_t prot;
 
 	/*
-	 * While boot memory allocator is runnig we cannot use struct
-	 * pages as they are not yet initialized.
+	 * While boot memory allocator is running we cannot use struct pages as
+	 * they are not yet initialized.
 	 */
 	BUG_ON(!after_bootmem);
 
 	if (!core_kernel_text((unsigned long)addr)) {
 		pages[0] = vmalloc_to_page(addr);
-		pages[1] = vmalloc_to_page(addr + PAGE_SIZE);
+		if (cross_page_boundary)
+			pages[1] = vmalloc_to_page(addr + PAGE_SIZE);
 	} else {
 		pages[0] = virt_to_page(addr);
 		WARN_ON(!PageReserved(pages[0]));
-		pages[1] = virt_to_page(addr + PAGE_SIZE);
+		if (cross_page_boundary)
+			pages[1] = virt_to_page(addr + PAGE_SIZE);
 	}
-	BUG_ON(!pages[0]);
+	BUG_ON(!pages[0] || (cross_page_boundary && !pages[1]));
+
 	local_irq_save(flags);
-	set_fixmap(FIX_TEXT_POKE0, page_to_phys(pages[0]));
-	if (pages[1])
-		set_fixmap(FIX_TEXT_POKE1, page_to_phys(pages[1]));
-	vaddr = (char *)fix_to_virt(FIX_TEXT_POKE0);
-	memcpy(&vaddr[(unsigned long)addr & ~PAGE_MASK], opcode, len);
-	clear_fixmap(FIX_TEXT_POKE0);
-	if (pages[1])
-		clear_fixmap(FIX_TEXT_POKE1);
-	local_flush_tlb();
-	sync_core();
-	/* Could also do a CLFLUSH here to speed up CPU recovery; but
-	   that causes hangs on some VIA CPUs. */
-	for (i = 0; i < len; i++)
-		BUG_ON(((char *)addr)[i] != ((char *)opcode)[i]);
+
+	/*
+	 * The lock is not really needed, but this allows to avoid open-coding.
+	 */
+	ptep = get_locked_pte(poking_mm, poking_addr, &ptl);
+
+	/*
+	 * This must not fail; preallocated in poking_init().
+	 */
+	VM_BUG_ON(!ptep);
+
+	/*
+	 * flush_tlb_mm_range() would be called when the poking_mm is not
+	 * loaded. When PCID is in use, the flush would be deferred to the time
+	 * the poking_mm is loaded again. Set the PTE as non-global to prevent
+	 * it from being used when we are done.
+	 */
+	prot = __pgprot(pgprot_val(PAGE_KERNEL) & ~_PAGE_GLOBAL);
+
+	pte = mk_pte(pages[0], prot);
+	set_pte_at(poking_mm, poking_addr, ptep, pte);
+
+	if (cross_page_boundary) {
+		pte = mk_pte(pages[1], prot);
+		set_pte_at(poking_mm, poking_addr + PAGE_SIZE, ptep + 1, pte);
+	}
+
+	/*
+	 * Loading the temporary mm behaves as a compiler barrier, which
+	 * guarantees that the PTE will be set at the time memcpy() is done.
+	 */
+	prev = use_temporary_mm(poking_mm);
+
+	kasan_disable_current();
+	memcpy((u8 *)poking_addr + offset_in_page(addr), opcode, len);
+	kasan_enable_current();
+
+	/*
+	 * Ensure that the PTE is only cleared after the instructions of memcpy
+	 * were issued by using a compiler barrier.
+	 */
+	barrier();
+
+	pte_clear(poking_mm, poking_addr, ptep);
+	if (cross_page_boundary)
+		pte_clear(poking_mm, poking_addr + PAGE_SIZE, ptep + 1);
+
+	/*
+	 * Loading the previous page-table hierarchy requires a serializing
+	 * instruction that already allows the core to see the updated version.
+	 * Xen-PV is assumed to serialize execution in a similar manner.
+	 */
+	unuse_temporary_mm(prev);
+
+	/*
+	 * Flushing the TLB might involve IPIs, which would require enabled
+	 * IRQs, but not if the mm is not used, as it is in this point.
+	 */
+	flush_tlb_mm_range(poking_mm, poking_addr, poking_addr +
+			   (cross_page_boundary ? 2 : 1) * PAGE_SIZE,
+			   PAGE_SHIFT, false);
+
+	pte_unmap_unlock(ptep, ptl);
+	/*
+	 * If the text doesn't match what we just wrote; something is
+	 * fundamentally screwy, there's nothing we can really do about that.
+	 */
+	BUG_ON(memcmp(addr, opcode, len));
+
 	local_irq_restore(flags);
 	return addr;
 }
diff --git a/arch/x86/xen/mmu_pv.c b/arch/x86/xen/mmu_pv.c
index 0f4fe206dcc2..82b181fcefe5 100644
--- a/arch/x86/xen/mmu_pv.c
+++ b/arch/x86/xen/mmu_pv.c
@@ -2319,8 +2319,6 @@ static void xen_set_fixmap(unsigned idx, phys_addr_t phys, pgprot_t prot)
 #elif defined(CONFIG_X86_VSYSCALL_EMULATION)
 	case VSYSCALL_PAGE:
 #endif
-	case FIX_TEXT_POKE0:
-	case FIX_TEXT_POKE1:
 		/* All local page mappings */
 		pte = pfn_pte(phys, prot);
 		break;
-- 
2.17.1

^ permalink raw reply related	[flat|nested] 71+ messages in thread

* [PATCH v2 07/20] x86/kgdb: avoid redundant comparison of patched code
  2019-01-29  0:34 [PATCH v2 00/20] Merge text_poke fixes and executable lockdowns Rick Edgecombe
                   ` (5 preceding siblings ...)
  2019-01-29  0:34 ` [PATCH v2 06/20] x86/alternative: use temporary mm for text poking Rick Edgecombe
@ 2019-01-29  0:34 ` Rick Edgecombe
  2019-01-29  0:34 ` [PATCH v2 08/20] x86/ftrace: set trampoline pages as executable Rick Edgecombe
                   ` (12 subsequent siblings)
  19 siblings, 0 replies; 71+ messages in thread
From: Rick Edgecombe @ 2019-01-29  0:34 UTC (permalink / raw)
  To: Andy Lutomirski, Ingo Molnar
  Cc: linux-kernel, x86, hpa, Thomas Gleixner, Borislav Petkov,
	Nadav Amit, Dave Hansen, Peter Zijlstra, linux_dti,
	linux-integrity, linux-security-module, akpm, kernel-hardening,
	linux-mm, will.deacon, ard.biesheuvel, kristen, deneen.t.dock,
	Nadav Amit, Rick Edgecombe

From: Nadav Amit <namit@vmware.com>

text_poke() already ensures that the written value is the correct one
and fails if that is not the case. There is no need for an additional
comparison. Remove it.

Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Nadav Amit <namit@vmware.com>
Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
---
 arch/x86/kernel/kgdb.c | 14 +-------------
 1 file changed, 1 insertion(+), 13 deletions(-)

diff --git a/arch/x86/kernel/kgdb.c b/arch/x86/kernel/kgdb.c
index 1461544cba8b..057af9187a04 100644
--- a/arch/x86/kernel/kgdb.c
+++ b/arch/x86/kernel/kgdb.c
@@ -746,7 +746,6 @@ void kgdb_arch_set_pc(struct pt_regs *regs, unsigned long ip)
 int kgdb_arch_set_breakpoint(struct kgdb_bkpt *bpt)
 {
 	int err;
-	char opc[BREAK_INSTR_SIZE];
 
 	bpt->type = BP_BREAKPOINT;
 	err = probe_kernel_read(bpt->saved_instr, (char *)bpt->bpt_addr,
@@ -765,11 +764,6 @@ int kgdb_arch_set_breakpoint(struct kgdb_bkpt *bpt)
 		return -EBUSY;
 	text_poke_kgdb((void *)bpt->bpt_addr, arch_kgdb_ops.gdb_bpt_instr,
 		       BREAK_INSTR_SIZE);
-	err = probe_kernel_read(opc, (char *)bpt->bpt_addr, BREAK_INSTR_SIZE);
-	if (err)
-		return err;
-	if (memcmp(opc, arch_kgdb_ops.gdb_bpt_instr, BREAK_INSTR_SIZE))
-		return -EINVAL;
 	bpt->type = BP_POKE_BREAKPOINT;
 
 	return err;
@@ -777,9 +771,6 @@ int kgdb_arch_set_breakpoint(struct kgdb_bkpt *bpt)
 
 int kgdb_arch_remove_breakpoint(struct kgdb_bkpt *bpt)
 {
-	int err;
-	char opc[BREAK_INSTR_SIZE];
-
 	if (bpt->type != BP_POKE_BREAKPOINT)
 		goto knl_write;
 	/*
@@ -790,10 +781,7 @@ int kgdb_arch_remove_breakpoint(struct kgdb_bkpt *bpt)
 		goto knl_write;
 	text_poke_kgdb((void *)bpt->bpt_addr, bpt->saved_instr,
 		       BREAK_INSTR_SIZE);
-	err = probe_kernel_read(opc, (char *)bpt->bpt_addr, BREAK_INSTR_SIZE);
-	if (err || memcmp(opc, bpt->saved_instr, BREAK_INSTR_SIZE))
-		goto knl_write;
-	return err;
+	return 0;
 
 knl_write:
 	return probe_kernel_write((char *)bpt->bpt_addr,
-- 
2.17.1

^ permalink raw reply related	[flat|nested] 71+ messages in thread

* [PATCH v2 08/20] x86/ftrace: set trampoline pages as executable
  2019-01-29  0:34 [PATCH v2 00/20] Merge text_poke fixes and executable lockdowns Rick Edgecombe
                   ` (6 preceding siblings ...)
  2019-01-29  0:34 ` [PATCH v2 07/20] x86/kgdb: avoid redundant comparison of patched code Rick Edgecombe
@ 2019-01-29  0:34 ` Rick Edgecombe
  2019-01-29  0:34 ` [PATCH v2 09/20] x86/kprobes: instruction pages initialization enhancements Rick Edgecombe
                   ` (11 subsequent siblings)
  19 siblings, 0 replies; 71+ messages in thread
From: Rick Edgecombe @ 2019-01-29  0:34 UTC (permalink / raw)
  To: Andy Lutomirski, Ingo Molnar
  Cc: linux-kernel, x86, hpa, Thomas Gleixner, Borislav Petkov,
	Nadav Amit, Dave Hansen, Peter Zijlstra, linux_dti,
	linux-integrity, linux-security-module, akpm, kernel-hardening,
	linux-mm, will.deacon, ard.biesheuvel, kristen, deneen.t.dock,
	Nadav Amit, Steven Rostedt, Rick Edgecombe

From: Nadav Amit <namit@vmware.com>

Since alloc_module() will not set the pages as executable soon, we need
to do so for ftrace trampoline pages after they are allocated.

For the time being, we do not change ftrace to use the text_poke()
interface. As a result, ftrace breaks still breaks W^X.

Cc: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Nadav Amit <namit@vmware.com>
Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
---
 arch/x86/kernel/ftrace.c | 8 ++++++++
 1 file changed, 8 insertions(+)

diff --git a/arch/x86/kernel/ftrace.c b/arch/x86/kernel/ftrace.c
index 8257a59704ae..13c8249b197f 100644
--- a/arch/x86/kernel/ftrace.c
+++ b/arch/x86/kernel/ftrace.c
@@ -742,6 +742,7 @@ create_trampoline(struct ftrace_ops *ops, unsigned int *tramp_size)
 	unsigned long end_offset;
 	unsigned long op_offset;
 	unsigned long offset;
+	unsigned long npages;
 	unsigned long size;
 	unsigned long retq;
 	unsigned long *ptr;
@@ -774,6 +775,7 @@ create_trampoline(struct ftrace_ops *ops, unsigned int *tramp_size)
 		return 0;
 
 	*tramp_size = size + RET_SIZE + sizeof(void *);
+	npages = DIV_ROUND_UP(*tramp_size, PAGE_SIZE);
 
 	/* Copy ftrace_caller onto the trampoline memory */
 	ret = probe_kernel_read(trampoline, (void *)start_offset, size);
@@ -818,6 +820,12 @@ create_trampoline(struct ftrace_ops *ops, unsigned int *tramp_size)
 	/* ALLOC_TRAMP flags lets us know we created it */
 	ops->flags |= FTRACE_OPS_FL_ALLOC_TRAMP;
 
+	/*
+	 * Module allocation needs to be completed by making the page
+	 * executable. The page is still writable, which is a security hazard,
+	 * but anyhow ftrace breaks W^X completely.
+	 */
+	set_memory_x((unsigned long)trampoline, npages);
 	return (unsigned long)trampoline;
 fail:
 	tramp_free(trampoline, *tramp_size);
-- 
2.17.1

^ permalink raw reply related	[flat|nested] 71+ messages in thread

* [PATCH v2 09/20] x86/kprobes: instruction pages initialization enhancements
  2019-01-29  0:34 [PATCH v2 00/20] Merge text_poke fixes and executable lockdowns Rick Edgecombe
                   ` (7 preceding siblings ...)
  2019-01-29  0:34 ` [PATCH v2 08/20] x86/ftrace: set trampoline pages as executable Rick Edgecombe
@ 2019-01-29  0:34 ` Rick Edgecombe
  2019-02-11 18:22   ` Borislav Petkov
  2019-01-29  0:34 ` [PATCH v2 10/20] x86: avoid W^X being broken during modules loading Rick Edgecombe
                   ` (10 subsequent siblings)
  19 siblings, 1 reply; 71+ messages in thread
From: Rick Edgecombe @ 2019-01-29  0:34 UTC (permalink / raw)
  To: Andy Lutomirski, Ingo Molnar
  Cc: linux-kernel, x86, hpa, Thomas Gleixner, Borislav Petkov,
	Nadav Amit, Dave Hansen, Peter Zijlstra, linux_dti,
	linux-integrity, linux-security-module, akpm, kernel-hardening,
	linux-mm, will.deacon, ard.biesheuvel, kristen, deneen.t.dock,
	Nadav Amit, Rick Edgecombe

From: Nadav Amit <namit@vmware.com>

Make kprobes instruction pages read-only (and executable) after they are
set to prevent them from mistaken or malicious modifications.

This is a preparatory patch for a following patch that makes module
allocated pages non-executable and sets the page as executable after
allocation.

While at it, do some small cleanup of what appears to be unnecessary
masking.

Acked-by: Masami Hiramatsu <mhiramat@kernel.org>
Signed-off-by: Nadav Amit <namit@vmware.com>
Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
---
 arch/x86/kernel/kprobes/core.c | 24 ++++++++++++++++++++----
 1 file changed, 20 insertions(+), 4 deletions(-)

diff --git a/arch/x86/kernel/kprobes/core.c b/arch/x86/kernel/kprobes/core.c
index 4ba75afba527..fac692e36833 100644
--- a/arch/x86/kernel/kprobes/core.c
+++ b/arch/x86/kernel/kprobes/core.c
@@ -431,8 +431,20 @@ void *alloc_insn_page(void)
 	void *page;
 
 	page = module_alloc(PAGE_SIZE);
-	if (page)
-		set_memory_ro((unsigned long)page & PAGE_MASK, 1);
+	if (page == NULL)
+		return NULL;
+
+	/*
+	 * First make the page read-only, and then only then make it executable
+	 * to prevent it from being W+X in between.
+	 */
+	set_memory_ro((unsigned long)page, 1);
+
+	/*
+	 * TODO: Once additional kernel code protection mechanisms are set, ensure
+	 * that the page was not maliciously altered and it is still zeroed.
+	 */
+	set_memory_x((unsigned long)page, 1);
 
 	return page;
 }
@@ -440,8 +452,12 @@ void *alloc_insn_page(void)
 /* Recover page to RW mode before releasing it */
 void free_insn_page(void *page)
 {
-	set_memory_nx((unsigned long)page & PAGE_MASK, 1);
-	set_memory_rw((unsigned long)page & PAGE_MASK, 1);
+	/*
+	 * First make the page non-executable, and then only then make it
+	 * writable to prevent it from being W+X in between.
+	 */
+	set_memory_nx((unsigned long)page, 1);
+	set_memory_rw((unsigned long)page, 1);
 	module_memfree(page);
 }
 
-- 
2.17.1

^ permalink raw reply related	[flat|nested] 71+ messages in thread

* [PATCH v2 10/20] x86: avoid W^X being broken during modules loading
  2019-01-29  0:34 [PATCH v2 00/20] Merge text_poke fixes and executable lockdowns Rick Edgecombe
                   ` (8 preceding siblings ...)
  2019-01-29  0:34 ` [PATCH v2 09/20] x86/kprobes: instruction pages initialization enhancements Rick Edgecombe
@ 2019-01-29  0:34 ` Rick Edgecombe
  2019-02-11 18:29   ` Borislav Petkov
  2019-01-29  0:34 ` [PATCH v2 11/20] x86/jump-label: remove support for custom poker Rick Edgecombe
                   ` (9 subsequent siblings)
  19 siblings, 1 reply; 71+ messages in thread
From: Rick Edgecombe @ 2019-01-29  0:34 UTC (permalink / raw)
  To: Andy Lutomirski, Ingo Molnar
  Cc: linux-kernel, x86, hpa, Thomas Gleixner, Borislav Petkov,
	Nadav Amit, Dave Hansen, Peter Zijlstra, linux_dti,
	linux-integrity, linux-security-module, akpm, kernel-hardening,
	linux-mm, will.deacon, ard.biesheuvel, kristen, deneen.t.dock,
	Nadav Amit, Kees Cook, Dave Hansen, Masami Hiramatsu,
	Rick Edgecombe

From: Nadav Amit <namit@vmware.com>

When modules and BPF filters are loaded, there is a time window in
which some memory is both writable and executable. An attacker that has
already found another vulnerability (e.g., a dangling pointer) might be
able to exploit this behavior to overwrite kernel code.

Prevent having writable executable PTEs in this stage. In addition,
avoiding having W+X mappings can also slightly simplify the patching of
modules code on initialization (e.g., by alternatives and static-key),
as would be done in the next patch.

To avoid having W+X mappings, set them initially as RW (NX) and after
they are set as RO set them as X as well. Setting them as executable is
done as a separate step to avoid one core in which the old PTE is cached
(hence writable), and another which sees the updated PTE (executable),
which would break the W^X protection.

Cc: Kees Cook <keescook@chromium.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Suggested-by: Thomas Gleixner <tglx@linutronix.de>
Suggested-by: Andy Lutomirski <luto@amacapital.net>
Signed-off-by: Nadav Amit <namit@vmware.com>
Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
---
 arch/x86/kernel/alternative.c | 28 +++++++++++++++++++++-------
 arch/x86/kernel/module.c      |  2 +-
 include/linux/filter.h        |  2 +-
 kernel/module.c               |  5 +++++
 4 files changed, 28 insertions(+), 9 deletions(-)

diff --git a/arch/x86/kernel/alternative.c b/arch/x86/kernel/alternative.c
index 76d482a2b716..69f3e650ada8 100644
--- a/arch/x86/kernel/alternative.c
+++ b/arch/x86/kernel/alternative.c
@@ -667,15 +667,29 @@ void __init alternative_instructions(void)
  * handlers seeing an inconsistent instruction while you patch.
  */
 void *__init_or_module text_poke_early(void *addr, const void *opcode,
-					      size_t len)
+				       size_t len)
 {
 	unsigned long flags;
-	local_irq_save(flags);
-	memcpy(addr, opcode, len);
-	local_irq_restore(flags);
-	sync_core();
-	/* Could also do a CLFLUSH here to speed up CPU recovery; but
-	   that causes hangs on some VIA CPUs. */
+
+	if (static_cpu_has(X86_FEATURE_NX) &&
+	    is_module_text_address((unsigned long)addr)) {
+		/*
+		 * Modules text is marked initially as non-executable, so the
+		 * code cannot be running and speculative code-fetches are
+		 * prevented. We can just change the code.
+		 */
+		memcpy(addr, opcode, len);
+	} else {
+		local_irq_save(flags);
+		memcpy(addr, opcode, len);
+		local_irq_restore(flags);
+		sync_core();
+
+		/*
+		 * Could also do a CLFLUSH here to speed up CPU recovery; but
+		 * that causes hangs on some VIA CPUs.
+		 */
+	}
 	return addr;
 }
 
diff --git a/arch/x86/kernel/module.c b/arch/x86/kernel/module.c
index b052e883dd8c..cfa3106faee4 100644
--- a/arch/x86/kernel/module.c
+++ b/arch/x86/kernel/module.c
@@ -87,7 +87,7 @@ void *module_alloc(unsigned long size)
 	p = __vmalloc_node_range(size, MODULE_ALIGN,
 				    MODULES_VADDR + get_module_load_offset(),
 				    MODULES_END, GFP_KERNEL,
-				    PAGE_KERNEL_EXEC, 0, NUMA_NO_NODE,
+				    PAGE_KERNEL, 0, NUMA_NO_NODE,
 				    __builtin_return_address(0));
 	if (p && (kasan_module_alloc(p, size) < 0)) {
 		vfree(p);
diff --git a/include/linux/filter.h b/include/linux/filter.h
index d531d4250bff..9cdfab7f383c 100644
--- a/include/linux/filter.h
+++ b/include/linux/filter.h
@@ -681,7 +681,6 @@ bpf_ctx_narrow_access_ok(u32 off, u32 size, u32 size_default)
 
 static inline void bpf_prog_lock_ro(struct bpf_prog *fp)
 {
-	fp->undo_set_mem = 1;
 	set_memory_ro((unsigned long)fp, fp->pages);
 }
 
@@ -694,6 +693,7 @@ static inline void bpf_prog_unlock_ro(struct bpf_prog *fp)
 static inline void bpf_jit_binary_lock_ro(struct bpf_binary_header *hdr)
 {
 	set_memory_ro((unsigned long)hdr, hdr->pages);
+	set_memory_x((unsigned long)hdr, hdr->pages);
 }
 
 static inline void bpf_jit_binary_unlock_ro(struct bpf_binary_header *hdr)
diff --git a/kernel/module.c b/kernel/module.c
index 2ad1b5239910..ae1b77da6a20 100644
--- a/kernel/module.c
+++ b/kernel/module.c
@@ -1950,8 +1950,13 @@ void module_enable_ro(const struct module *mod, bool after_init)
 		return;
 
 	frob_text(&mod->core_layout, set_memory_ro);
+	frob_text(&mod->core_layout, set_memory_x);
+
 	frob_rodata(&mod->core_layout, set_memory_ro);
+
 	frob_text(&mod->init_layout, set_memory_ro);
+	frob_text(&mod->init_layout, set_memory_x);
+
 	frob_rodata(&mod->init_layout, set_memory_ro);
 
 	if (after_init)
-- 
2.17.1

^ permalink raw reply related	[flat|nested] 71+ messages in thread

* [PATCH v2 11/20] x86/jump-label: remove support for custom poker
  2019-01-29  0:34 [PATCH v2 00/20] Merge text_poke fixes and executable lockdowns Rick Edgecombe
                   ` (9 preceding siblings ...)
  2019-01-29  0:34 ` [PATCH v2 10/20] x86: avoid W^X being broken during modules loading Rick Edgecombe
@ 2019-01-29  0:34 ` Rick Edgecombe
  2019-02-11 18:37   ` Borislav Petkov
  2019-01-29  0:34 ` [PATCH v2 12/20] x86/alternative: Remove the return value of text_poke_*() Rick Edgecombe
                   ` (8 subsequent siblings)
  19 siblings, 1 reply; 71+ messages in thread
From: Rick Edgecombe @ 2019-01-29  0:34 UTC (permalink / raw)
  To: Andy Lutomirski, Ingo Molnar
  Cc: linux-kernel, x86, hpa, Thomas Gleixner, Borislav Petkov,
	Nadav Amit, Dave Hansen, Peter Zijlstra, linux_dti,
	linux-integrity, linux-security-module, akpm, kernel-hardening,
	linux-mm, will.deacon, ard.biesheuvel, kristen, deneen.t.dock,
	Nadav Amit, Kees Cook, Dave Hansen, Masami Hiramatsu,
	Rick Edgecombe

From: Nadav Amit <namit@vmware.com>

There are only two types of poking: early and breakpoint based. The use
of a function pointer to perform poking complicates the code and is
probably inefficient due to the use of indirect branches.

Cc: Andy Lutomirski <luto@kernel.org>
Cc: Kees Cook <keescook@chromium.org>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Nadav Amit <namit@vmware.com>
Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
---
 arch/x86/kernel/jump_label.c | 24 ++++++++----------------
 1 file changed, 8 insertions(+), 16 deletions(-)

diff --git a/arch/x86/kernel/jump_label.c b/arch/x86/kernel/jump_label.c
index e36cfec0f35e..427facef8aff 100644
--- a/arch/x86/kernel/jump_label.c
+++ b/arch/x86/kernel/jump_label.c
@@ -37,7 +37,6 @@ static void bug_at(unsigned char *ip, int line)
 
 static void __ref __jump_label_transform(struct jump_entry *entry,
 					 enum jump_label_type type,
-					 void *(*poker)(void *, const void *, size_t),
 					 int init)
 {
 	union jump_code_union jmp;
@@ -50,14 +49,6 @@ static void __ref __jump_label_transform(struct jump_entry *entry,
 	jmp.offset = jump_entry_target(entry) -
 		     (jump_entry_code(entry) + JUMP_LABEL_NOP_SIZE);
 
-	/*
-	 * As long as we're UP and not yet marked RO, we can use
-	 * text_poke_early; SYSTEM_BOOTING guarantees both, as we switch to
-	 * SYSTEM_SCHEDULING before going either.
-	 */
-	if (system_state == SYSTEM_BOOTING)
-		poker = text_poke_early;
-
 	if (type == JUMP_LABEL_JMP) {
 		if (init) {
 			expect = default_nop; line = __LINE__;
@@ -80,16 +71,17 @@ static void __ref __jump_label_transform(struct jump_entry *entry,
 		bug_at((void *)jump_entry_code(entry), line);
 
 	/*
-	 * Make text_poke_bp() a default fallback poker.
+	 * As long as we're UP and not yet marked RO, we can use
+	 * text_poke_early; SYSTEM_BOOTING guarantees both, as we switch to
+	 * SYSTEM_SCHEDULING before going either.
 	 *
 	 * At the time the change is being done, just ignore whether we
 	 * are doing nop -> jump or jump -> nop transition, and assume
 	 * always nop being the 'currently valid' instruction
-	 *
 	 */
-	if (poker) {
-		(*poker)((void *)jump_entry_code(entry), code,
-			 JUMP_LABEL_NOP_SIZE);
+	if (init || system_state == SYSTEM_BOOTING) {
+		text_poke_early((void *)jump_entry_code(entry), code,
+				JUMP_LABEL_NOP_SIZE);
 		return;
 	}
 
@@ -101,7 +93,7 @@ void arch_jump_label_transform(struct jump_entry *entry,
 			       enum jump_label_type type)
 {
 	mutex_lock(&text_mutex);
-	__jump_label_transform(entry, type, NULL, 0);
+	__jump_label_transform(entry, type, 0);
 	mutex_unlock(&text_mutex);
 }
 
@@ -131,5 +123,5 @@ __init_or_module void arch_jump_label_transform_static(struct jump_entry *entry,
 			jlstate = JL_STATE_NO_UPDATE;
 	}
 	if (jlstate == JL_STATE_UPDATE)
-		__jump_label_transform(entry, type, text_poke_early, 1);
+		__jump_label_transform(entry, type, 1);
 }
-- 
2.17.1

^ permalink raw reply related	[flat|nested] 71+ messages in thread

* [PATCH v2 12/20] x86/alternative: Remove the return value of text_poke_*()
  2019-01-29  0:34 [PATCH v2 00/20] Merge text_poke fixes and executable lockdowns Rick Edgecombe
                   ` (10 preceding siblings ...)
  2019-01-29  0:34 ` [PATCH v2 11/20] x86/jump-label: remove support for custom poker Rick Edgecombe
@ 2019-01-29  0:34 ` Rick Edgecombe
  2019-01-29  0:34 ` [PATCH v2 13/20] Add set_alias_ function and x86 implementation Rick Edgecombe
                   ` (7 subsequent siblings)
  19 siblings, 0 replies; 71+ messages in thread
From: Rick Edgecombe @ 2019-01-29  0:34 UTC (permalink / raw)
  To: Andy Lutomirski, Ingo Molnar
  Cc: linux-kernel, x86, hpa, Thomas Gleixner, Borislav Petkov,
	Nadav Amit, Dave Hansen, Peter Zijlstra, linux_dti,
	linux-integrity, linux-security-module, akpm, kernel-hardening,
	linux-mm, will.deacon, ard.biesheuvel, kristen, deneen.t.dock,
	Nadav Amit, Kees Cook, Dave Hansen, Masami Hiramatsu,
	Rick Edgecombe

From: Nadav Amit <namit@vmware.com>

The return value of text_poke_early() and text_poke_bp() is useless.
Remove it.

Cc: Andy Lutomirski <luto@kernel.org>
Cc: Kees Cook <keescook@chromium.org>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Nadav Amit <namit@vmware.com>
Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
---
 arch/x86/include/asm/text-patching.h |  4 ++--
 arch/x86/kernel/alternative.c        | 11 ++++-------
 2 files changed, 6 insertions(+), 9 deletions(-)

diff --git a/arch/x86/include/asm/text-patching.h b/arch/x86/include/asm/text-patching.h
index a75eed841eed..c90678fd391a 100644
--- a/arch/x86/include/asm/text-patching.h
+++ b/arch/x86/include/asm/text-patching.h
@@ -18,7 +18,7 @@ static inline void apply_paravirt(struct paravirt_patch_site *start,
 #define __parainstructions_end	NULL
 #endif
 
-extern void *text_poke_early(void *addr, const void *opcode, size_t len);
+extern void text_poke_early(void *addr, const void *opcode, size_t len);
 
 /*
  * Clear and restore the kernel write-protection flag on the local CPU.
@@ -37,7 +37,7 @@ extern void *text_poke_early(void *addr, const void *opcode, size_t len);
 extern void *text_poke(void *addr, const void *opcode, size_t len);
 extern void *text_poke_kgdb(void *addr, const void *opcode, size_t len);
 extern int poke_int3_handler(struct pt_regs *regs);
-extern void *text_poke_bp(void *addr, const void *opcode, size_t len, void *handler);
+extern void text_poke_bp(void *addr, const void *opcode, size_t len, void *handler);
 extern int after_bootmem;
 extern __ro_after_init struct mm_struct *poking_mm;
 extern __ro_after_init unsigned long poking_addr;
diff --git a/arch/x86/kernel/alternative.c b/arch/x86/kernel/alternative.c
index 69f3e650ada8..81876e3ef3fd 100644
--- a/arch/x86/kernel/alternative.c
+++ b/arch/x86/kernel/alternative.c
@@ -264,7 +264,7 @@ static void __init_or_module add_nops(void *insns, unsigned int len)
 
 extern struct alt_instr __alt_instructions[], __alt_instructions_end[];
 extern s32 __smp_locks[], __smp_locks_end[];
-void *text_poke_early(void *addr, const void *opcode, size_t len);
+void text_poke_early(void *addr, const void *opcode, size_t len);
 
 /*
  * Are we looking at a near JMP with a 1 or 4-byte displacement.
@@ -666,8 +666,8 @@ void __init alternative_instructions(void)
  * instructions. And on the local CPU you need to be protected again NMI or MCE
  * handlers seeing an inconsistent instruction while you patch.
  */
-void *__init_or_module text_poke_early(void *addr, const void *opcode,
-				       size_t len)
+void __init_or_module text_poke_early(void *addr, const void *opcode,
+				      size_t len)
 {
 	unsigned long flags;
 
@@ -690,7 +690,6 @@ void *__init_or_module text_poke_early(void *addr, const void *opcode,
 		 * that causes hangs on some VIA CPUs.
 		 */
 	}
-	return addr;
 }
 
 __ro_after_init struct mm_struct *poking_mm;
@@ -890,7 +889,7 @@ int poke_int3_handler(struct pt_regs *regs)
  *	  replacing opcode
  *	- sync cores
  */
-void *text_poke_bp(void *addr, const void *opcode, size_t len, void *handler)
+void text_poke_bp(void *addr, const void *opcode, size_t len, void *handler)
 {
 	unsigned char int3 = 0xcc;
 
@@ -932,7 +931,5 @@ void *text_poke_bp(void *addr, const void *opcode, size_t len, void *handler)
 	 * the writing of the new instruction.
 	 */
 	bp_patching_in_progress = false;
-
-	return addr;
 }
 
-- 
2.17.1

^ permalink raw reply related	[flat|nested] 71+ messages in thread

* [PATCH v2 13/20] Add set_alias_ function and x86 implementation
  2019-01-29  0:34 [PATCH v2 00/20] Merge text_poke fixes and executable lockdowns Rick Edgecombe
                   ` (11 preceding siblings ...)
  2019-01-29  0:34 ` [PATCH v2 12/20] x86/alternative: Remove the return value of text_poke_*() Rick Edgecombe
@ 2019-01-29  0:34 ` Rick Edgecombe
  2019-02-11 19:09   ` Borislav Petkov
  2019-01-29  0:34 ` [PATCH v2 14/20] mm: Make hibernate handle unmapped pages Rick Edgecombe
                   ` (6 subsequent siblings)
  19 siblings, 1 reply; 71+ messages in thread
From: Rick Edgecombe @ 2019-01-29  0:34 UTC (permalink / raw)
  To: Andy Lutomirski, Ingo Molnar
  Cc: linux-kernel, x86, hpa, Thomas Gleixner, Borislav Petkov,
	Nadav Amit, Dave Hansen, Peter Zijlstra, linux_dti,
	linux-integrity, linux-security-module, akpm, kernel-hardening,
	linux-mm, will.deacon, ard.biesheuvel, kristen, deneen.t.dock,
	Rick Edgecombe

This adds two new functions set_alias_default_noflush and
set_alias_nv_noflush for setting the alias mapping for the page to its
default valid permissions and to an invalid state that cannot be cached in
a TLB, respectively. These functions to not flush the TLB.

Note, __kernel_map_pages does something similar but flushes the TLB and
doesn't reset the permission bits to default on all architectures.

There is also an ARCH config ARCH_HAS_SET_ALIAS for specifying whether
these have an actual implementation or a default empty one.

Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
---
 arch/Kconfig                      |  4 ++++
 arch/x86/Kconfig                  |  1 +
 arch/x86/include/asm/set_memory.h |  3 +++
 arch/x86/mm/pageattr.c            | 14 +++++++++++---
 include/linux/set_memory.h        | 10 ++++++++++
 5 files changed, 29 insertions(+), 3 deletions(-)

diff --git a/arch/Kconfig b/arch/Kconfig
index 4cfb6de48f79..4ef9db190f2d 100644
--- a/arch/Kconfig
+++ b/arch/Kconfig
@@ -249,6 +249,10 @@ config ARCH_HAS_FORTIFY_SOURCE
 config ARCH_HAS_SET_MEMORY
 	bool
 
+# Select if arch has all set_alias_nv/default() functions
+config ARCH_HAS_SET_ALIAS
+	bool
+
 # Select if arch init_task must go in the __init_task_data section
 config ARCH_TASK_STRUCT_ON_STACK
        bool
diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 26387c7bf305..42bb1df4ea94 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -66,6 +66,7 @@ config X86
 	select ARCH_HAS_UACCESS_FLUSHCACHE	if X86_64
 	select ARCH_HAS_UACCESS_MCSAFE		if X86_64 && X86_MCE
 	select ARCH_HAS_SET_MEMORY
+	select ARCH_HAS_SET_ALIAS
 	select ARCH_HAS_STRICT_KERNEL_RWX
 	select ARCH_HAS_STRICT_MODULE_RWX
 	select ARCH_HAS_SYNC_CORE_BEFORE_USERMODE
diff --git a/arch/x86/include/asm/set_memory.h b/arch/x86/include/asm/set_memory.h
index 07a25753e85c..2ef4e4222df1 100644
--- a/arch/x86/include/asm/set_memory.h
+++ b/arch/x86/include/asm/set_memory.h
@@ -85,6 +85,9 @@ int set_pages_nx(struct page *page, int numpages);
 int set_pages_ro(struct page *page, int numpages);
 int set_pages_rw(struct page *page, int numpages);
 
+int set_alias_nv_noflush(struct page *page);
+int set_alias_default_noflush(struct page *page);
+
 extern int kernel_set_to_readonly;
 void set_kernel_text_rw(void);
 void set_kernel_text_ro(void);
diff --git a/arch/x86/mm/pageattr.c b/arch/x86/mm/pageattr.c
index 4f8972311a77..3a51915a1410 100644
--- a/arch/x86/mm/pageattr.c
+++ b/arch/x86/mm/pageattr.c
@@ -2209,8 +2209,6 @@ int set_pages_rw(struct page *page, int numpages)
 	return set_memory_rw(addr, numpages);
 }
 
-#ifdef CONFIG_DEBUG_PAGEALLOC
-
 static int __set_pages_p(struct page *page, int numpages)
 {
 	unsigned long tempaddr = (unsigned long) page_address(page);
@@ -2249,6 +2247,17 @@ static int __set_pages_np(struct page *page, int numpages)
 	return __change_page_attr_set_clr(&cpa, 0);
 }
 
+int set_alias_nv_noflush(struct page *page)
+{
+	return __set_pages_np(page, 1);
+}
+
+int set_alias_default_noflush(struct page *page)
+{
+	return __set_pages_p(page, 1);
+}
+
+#ifdef CONFIG_DEBUG_PAGEALLOC
 void __kernel_map_pages(struct page *page, int numpages, int enable)
 {
 	if (PageHighMem(page))
@@ -2282,7 +2291,6 @@ void __kernel_map_pages(struct page *page, int numpages, int enable)
 }
 
 #ifdef CONFIG_HIBERNATION
-
 bool kernel_page_present(struct page *page)
 {
 	unsigned int level;
diff --git a/include/linux/set_memory.h b/include/linux/set_memory.h
index 2a986d282a97..d19481ac6a8f 100644
--- a/include/linux/set_memory.h
+++ b/include/linux/set_memory.h
@@ -10,6 +10,16 @@
 
 #ifdef CONFIG_ARCH_HAS_SET_MEMORY
 #include <asm/set_memory.h>
+#ifndef CONFIG_ARCH_HAS_SET_ALIAS
+static inline int set_alias_nv_noflush(struct page *page)
+{
+	return 0;
+}
+static inline int set_alias_default_noflush(struct page *page)
+{
+	return 0;
+}
+#endif
 #else
 static inline int set_memory_ro(unsigned long addr, int numpages) { return 0; }
 static inline int set_memory_rw(unsigned long addr, int numpages) { return 0; }
-- 
2.17.1

^ permalink raw reply related	[flat|nested] 71+ messages in thread

* [PATCH v2 14/20] mm: Make hibernate handle unmapped pages
  2019-01-29  0:34 [PATCH v2 00/20] Merge text_poke fixes and executable lockdowns Rick Edgecombe
                   ` (12 preceding siblings ...)
  2019-01-29  0:34 ` [PATCH v2 13/20] Add set_alias_ function and x86 implementation Rick Edgecombe
@ 2019-01-29  0:34 ` Rick Edgecombe
  2019-02-19 11:04   ` Borislav Petkov
  2019-01-29  0:34 ` [PATCH v2 15/20] vmalloc: New flags for safe vfree on special perms Rick Edgecombe
                   ` (5 subsequent siblings)
  19 siblings, 1 reply; 71+ messages in thread
From: Rick Edgecombe @ 2019-01-29  0:34 UTC (permalink / raw)
  To: Andy Lutomirski, Ingo Molnar
  Cc: linux-kernel, x86, hpa, Thomas Gleixner, Borislav Petkov,
	Nadav Amit, Dave Hansen, Peter Zijlstra, linux_dti,
	linux-integrity, linux-security-module, akpm, kernel-hardening,
	linux-mm, will.deacon, ard.biesheuvel, kristen, deneen.t.dock,
	Rick Edgecombe, Rafael J. Wysocki, Pavel Machek

For architectures with CONFIG_ARCH_HAS_SET_ALIAS, pages can be unmapped
briefly on the directmap, even when CONFIG_DEBUG_PAGEALLOC is not
configured. So this changes kernel_map_pages and kernel_page_present to be
defined when CONFIG_ARCH_HAS_SET_ALIAS is defined as well. It also changes
places (page_alloc.c) where those functions are assumed to only be
implemented when CONFIG_DEBUG_PAGEALLOC is defined.

So now when CONFIG_ARCH_HAS_SET_ALIAS=y, hibernate will handle not present
page when saving. Previously this was already done when
CONFIG_DEBUG_PAGEALLOC was configured. It does not appear to have a big
hibernating performance impact.

Before:
[    4.670938] PM: Wrote 171996 kbytes in 0.21 seconds (819.02 MB/s)

After:
[    4.504714] PM: Wrote 178932 kbytes in 0.22 seconds (813.32 MB/s)

Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net>
Cc: Pavel Machek <pavel@ucw.cz>
Acked-by: Pavel Machek <pavel@ucw.cz>
Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
---
 arch/x86/mm/pageattr.c |  4 ----
 include/linux/mm.h     | 18 ++++++------------
 mm/page_alloc.c        |  7 +++++--
 3 files changed, 11 insertions(+), 18 deletions(-)

diff --git a/arch/x86/mm/pageattr.c b/arch/x86/mm/pageattr.c
index 3a51915a1410..717bdc188aab 100644
--- a/arch/x86/mm/pageattr.c
+++ b/arch/x86/mm/pageattr.c
@@ -2257,7 +2257,6 @@ int set_alias_default_noflush(struct page *page)
 	return __set_pages_p(page, 1);
 }
 
-#ifdef CONFIG_DEBUG_PAGEALLOC
 void __kernel_map_pages(struct page *page, int numpages, int enable)
 {
 	if (PageHighMem(page))
@@ -2302,11 +2301,8 @@ bool kernel_page_present(struct page *page)
 	pte = lookup_address((unsigned long)page_address(page), &level);
 	return (pte_val(*pte) & _PAGE_PRESENT);
 }
-
 #endif /* CONFIG_HIBERNATION */
 
-#endif /* CONFIG_DEBUG_PAGEALLOC */
-
 int __init kernel_map_pages_in_pgd(pgd_t *pgd, u64 pfn, unsigned long address,
 				   unsigned numpages, unsigned long page_flags)
 {
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 80bb6408fe73..b362a280a919 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -2642,37 +2642,31 @@ static inline void kernel_poison_pages(struct page *page, int numpages,
 					int enable) { }
 #endif
 
-#ifdef CONFIG_DEBUG_PAGEALLOC
 extern bool _debug_pagealloc_enabled;
-extern void __kernel_map_pages(struct page *page, int numpages, int enable);
 
 static inline bool debug_pagealloc_enabled(void)
 {
-	return _debug_pagealloc_enabled;
+	return IS_ENABLED(CONFIG_DEBUG_PAGEALLOC) && _debug_pagealloc_enabled;
 }
 
+#if defined(CONFIG_DEBUG_PAGEALLOC) || defined(CONFIG_ARCH_HAS_SET_ALIAS)
+extern void __kernel_map_pages(struct page *page, int numpages, int enable);
+
 static inline void
 kernel_map_pages(struct page *page, int numpages, int enable)
 {
-	if (!debug_pagealloc_enabled())
-		return;
-
 	__kernel_map_pages(page, numpages, enable);
 }
 #ifdef CONFIG_HIBERNATION
 extern bool kernel_page_present(struct page *page);
 #endif	/* CONFIG_HIBERNATION */
-#else	/* CONFIG_DEBUG_PAGEALLOC */
+#else	/* CONFIG_DEBUG_PAGEALLOC || CONFIG_ARCH_HAS_SET_ALIAS */
 static inline void
 kernel_map_pages(struct page *page, int numpages, int enable) {}
 #ifdef CONFIG_HIBERNATION
 static inline bool kernel_page_present(struct page *page) { return true; }
 #endif	/* CONFIG_HIBERNATION */
-static inline bool debug_pagealloc_enabled(void)
-{
-	return false;
-}
-#endif	/* CONFIG_DEBUG_PAGEALLOC */
+#endif	/* CONFIG_DEBUG_PAGEALLOC || CONFIG_ARCH_HAS_SET_ALIAS */
 
 #ifdef __HAVE_ARCH_GATE_AREA
 extern struct vm_area_struct *get_gate_vma(struct mm_struct *mm);
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index d295c9bc01a8..92d0a0934274 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1074,7 +1074,9 @@ static __always_inline bool free_pages_prepare(struct page *page,
 	}
 	arch_free_page(page, order);
 	kernel_poison_pages(page, 1 << order, 0);
-	kernel_map_pages(page, 1 << order, 0);
+	if (debug_pagealloc_enabled())
+		kernel_map_pages(page, 1 << order, 0);
+
 	kasan_free_nondeferred_pages(page, order);
 
 	return true;
@@ -1944,7 +1946,8 @@ inline void post_alloc_hook(struct page *page, unsigned int order,
 	set_page_refcounted(page);
 
 	arch_alloc_page(page, order);
-	kernel_map_pages(page, 1 << order, 1);
+	if (debug_pagealloc_enabled())
+		kernel_map_pages(page, 1 << order, 1);
 	kernel_poison_pages(page, 1 << order, 1);
 	kasan_alloc_pages(page, order);
 	set_page_owner(page, order, gfp_flags);
-- 
2.17.1

^ permalink raw reply related	[flat|nested] 71+ messages in thread

* [PATCH v2 15/20] vmalloc: New flags for safe vfree on special perms
  2019-01-29  0:34 [PATCH v2 00/20] Merge text_poke fixes and executable lockdowns Rick Edgecombe
                   ` (13 preceding siblings ...)
  2019-01-29  0:34 ` [PATCH v2 14/20] mm: Make hibernate handle unmapped pages Rick Edgecombe
@ 2019-01-29  0:34 ` Rick Edgecombe
  2019-02-19 12:48   ` Borislav Petkov
  2019-01-29  0:34 ` [PATCH v2 16/20] modules: Use vmalloc special flag Rick Edgecombe
                   ` (4 subsequent siblings)
  19 siblings, 1 reply; 71+ messages in thread
From: Rick Edgecombe @ 2019-01-29  0:34 UTC (permalink / raw)
  To: Andy Lutomirski, Ingo Molnar
  Cc: linux-kernel, x86, hpa, Thomas Gleixner, Borislav Petkov,
	Nadav Amit, Dave Hansen, Peter Zijlstra, linux_dti,
	linux-integrity, linux-security-module, akpm, kernel-hardening,
	linux-mm, will.deacon, ard.biesheuvel, kristen, deneen.t.dock,
	Rick Edgecombe

This adds a new flags VM_HAS_SPECIAL_PERMS, for enabling vfree operations
to immediately clear executable TLB entries to freed pages, and handle
freeing memory with special permissions. It also takes care of resetting
the direct map permissions for the pages being unmapped. So this flag is
useful for any kind of memory with elevated permissions, or where there can
be related permissions changes on the directmap. Today this is RO+X and RO
memory.

Although this enables directly vfreeing RO memory now, RO memory cannot be
freed in an interrupt because the allocation itself is used as a node on
deferred free list. So when RO memory needs to be freed in an interrupt
the code doing the vfree needs to have its own work queue, as was the case
before the deferred vfree list handling was added. Today there is only one
case where this happens.

For architectures with set_alias_ implementations this whole operation
can be done with one TLB flush when centralized like this. For others with
directmap permissions, currently only arm64, a backup method using
set_memory functions is used to reset the directmap. When arm64 adds
set_alias_ functions, this backup can be removed.

When the TLB is flushed to both remove TLB entries for the vmalloc range
mapping and the direct map permissions, the lazy purge operation could be
done to try to save a TLB flush later. However today vm_unmap_aliases
could flush a TLB range that does not include the directmap. So a helper
is added with extra parameters that can allow both the vmalloc address and
the direct mapping to be flushed during this operation. The behavior of the
normal vm_unmap_aliases function is unchanged.

Suggested-by: Dave Hansen <dave.hansen@intel.com>
Suggested-by: Andy Lutomirski <luto@kernel.org>
Suggested-by: Will Deacon <will.deacon@arm.com>
Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
---
 include/linux/vmalloc.h |  13 +++++
 mm/vmalloc.c            | 122 +++++++++++++++++++++++++++++++++-------
 2 files changed, 116 insertions(+), 19 deletions(-)

diff --git a/include/linux/vmalloc.h b/include/linux/vmalloc.h
index 398e9c95cd61..9f643f917360 100644
--- a/include/linux/vmalloc.h
+++ b/include/linux/vmalloc.h
@@ -21,6 +21,11 @@ struct notifier_block;		/* in notifier.h */
 #define VM_UNINITIALIZED	0x00000020	/* vm_struct is not fully initialized */
 #define VM_NO_GUARD		0x00000040      /* don't add guard page */
 #define VM_KASAN		0x00000080      /* has allocated kasan shadow memory */
+/*
+ * Memory with VM_HAS_SPECIAL_PERMS cannot be freed in an interrupt or with
+ * vfree_atomic.
+ */
+#define VM_HAS_SPECIAL_PERMS	0x00000200      /* Reset directmap and flush TLB on unmap */
 /* bits [20..32] reserved for arch specific ioremap internals */
 
 /*
@@ -135,6 +140,14 @@ extern struct vm_struct *__get_vm_area_caller(unsigned long size,
 extern struct vm_struct *remove_vm_area(const void *addr);
 extern struct vm_struct *find_vm_area(const void *addr);
 
+static inline void set_vm_special(void *addr)
+{
+	struct vm_struct *vm = find_vm_area(addr);
+
+	if (vm)
+		vm->flags |= VM_HAS_SPECIAL_PERMS;
+}
+
 extern int map_vm_area(struct vm_struct *area, pgprot_t prot,
 			struct page **pages);
 #ifdef CONFIG_MMU
diff --git a/mm/vmalloc.c b/mm/vmalloc.c
index 871e41c55e23..d459b5b9649b 100644
--- a/mm/vmalloc.c
+++ b/mm/vmalloc.c
@@ -18,6 +18,7 @@
 #include <linux/interrupt.h>
 #include <linux/proc_fs.h>
 #include <linux/seq_file.h>
+#include <linux/set_memory.h>
 #include <linux/debugobjects.h>
 #include <linux/kallsyms.h>
 #include <linux/list.h>
@@ -1055,24 +1056,11 @@ static void vb_free(const void *addr, unsigned long size)
 		spin_unlock(&vb->lock);
 }
 
-/**
- * vm_unmap_aliases - unmap outstanding lazy aliases in the vmap layer
- *
- * The vmap/vmalloc layer lazily flushes kernel virtual mappings primarily
- * to amortize TLB flushing overheads. What this means is that any page you
- * have now, may, in a former life, have been mapped into kernel virtual
- * address by the vmap layer and so there might be some CPUs with TLB entries
- * still referencing that page (additional to the regular 1:1 kernel mapping).
- *
- * vm_unmap_aliases flushes all such lazy mappings. After it returns, we can
- * be sure that none of the pages we have control over will have any aliases
- * from the vmap layer.
- */
-void vm_unmap_aliases(void)
+static void _vm_unmap_aliases(unsigned long start, unsigned long end,
+				int must_flush)
 {
-	unsigned long start = ULONG_MAX, end = 0;
 	int cpu;
-	int flush = 0;
+	int flush = must_flush;
 
 	if (unlikely(!vmap_initialized))
 		return;
@@ -1109,6 +1097,27 @@ void vm_unmap_aliases(void)
 		flush_tlb_kernel_range(start, end);
 	mutex_unlock(&vmap_purge_lock);
 }
+
+/**
+ * vm_unmap_aliases - unmap outstanding lazy aliases in the vmap layer
+ *
+ * The vmap/vmalloc layer lazily flushes kernel virtual mappings primarily
+ * to amortize TLB flushing overheads. What this means is that any page you
+ * have now, may, in a former life, have been mapped into kernel virtual
+ * address by the vmap layer and so there might be some CPUs with TLB entries
+ * still referencing that page (additional to the regular 1:1 kernel mapping).
+ *
+ * vm_unmap_aliases flushes all such lazy mappings. After it returns, we can
+ * be sure that none of the pages we have control over will have any aliases
+ * from the vmap layer.
+ */
+void vm_unmap_aliases(void)
+{
+	unsigned long start = ULONG_MAX, end = 0;
+	int must_flush = 0;
+
+	_vm_unmap_aliases(start, end, must_flush);
+}
 EXPORT_SYMBOL_GPL(vm_unmap_aliases);
 
 /**
@@ -1494,6 +1503,79 @@ struct vm_struct *remove_vm_area(const void *addr)
 	return NULL;
 }
 
+static inline void set_area_alias(const struct vm_struct *area,
+			int (*set_alias)(struct page *page))
+{
+	int i;
+
+	for (i = 0; i < area->nr_pages; i++) {
+		unsigned long addr =
+			(unsigned long)page_address(area->pages[i]);
+
+		if (addr)
+			set_alias(area->pages[i]);
+	}
+}
+
+/* This handles removing and resetting vm mappings related to the vm_struct. */
+static void vm_remove_mappings(struct vm_struct *area, int deallocate_pages)
+{
+	unsigned long addr = (unsigned long)area->addr;
+	unsigned long start = ULONG_MAX, end = 0;
+	int special = area->flags & VM_HAS_SPECIAL_PERMS;
+	int i;
+
+	/*
+	 * The below block can be removed when all architectures that have
+	 * direct map permissions also have set_alias_ implementations. This is
+	 * to do resetting on the directmap for any special permissions (today
+	 * only X), without leaving a RW+X window.
+	 */
+	if (special && !IS_ENABLED(CONFIG_ARCH_HAS_SET_ALIAS)) {
+		set_memory_nx(addr, area->nr_pages);
+		set_memory_rw(addr, area->nr_pages);
+	}
+
+	remove_vm_area(area->addr);
+
+	/* If this is not special memory, we can skip the below. */
+	if (!special)
+		return;
+
+	/*
+	 * If we are not deallocating pages, we can just do the flush of the VM
+	 * area and return.
+	 */
+	if (!deallocate_pages) {
+		vm_unmap_aliases();
+		return;
+	}
+
+	/*
+	 * If we are here, we need to flush the vm mapping and reset the direct
+	 * map.
+	 * First find the start and end range of the direct mappings to make
+	 * sure the vm_unmap_aliases flush includes the direct map.
+	 */
+	for (i = 0; i < area->nr_pages; i++) {
+		unsigned long addr =
+			(unsigned long)page_address(area->pages[i]);
+		if (addr) {
+			start = min(addr, start);
+			end = max(addr, end);
+		}
+	}
+
+	/*
+	 * First we set direct map to something not valid so that it won't be
+	 * cached if there are any accesses after the TLB flush, then we flush
+	 * the TLB, and reset the directmap permissions to the default.
+	 */
+	set_area_alias(area, set_alias_nv_noflush);
+	_vm_unmap_aliases(start, end, 1);
+	set_area_alias(area, set_alias_default_noflush);
+}
+
 static void __vunmap(const void *addr, int deallocate_pages)
 {
 	struct vm_struct *area;
@@ -1515,7 +1597,8 @@ static void __vunmap(const void *addr, int deallocate_pages)
 	debug_check_no_locks_freed(area->addr, get_vm_area_size(area));
 	debug_check_no_obj_freed(area->addr, get_vm_area_size(area));
 
-	remove_vm_area(addr);
+	vm_remove_mappings(area, deallocate_pages);
+
 	if (deallocate_pages) {
 		int i;
 
@@ -1925,8 +2008,9 @@ EXPORT_SYMBOL(vzalloc_node);
 
 void *vmalloc_exec(unsigned long size)
 {
-	return __vmalloc_node(size, 1, GFP_KERNEL, PAGE_KERNEL_EXEC,
-			      NUMA_NO_NODE, __builtin_return_address(0));
+	return __vmalloc_node_range(size, 1, VMALLOC_START, VMALLOC_END,
+			GFP_KERNEL, PAGE_KERNEL_EXEC, VM_HAS_SPECIAL_PERMS,
+			NUMA_NO_NODE, __builtin_return_address(0));
 }
 
 #if defined(CONFIG_64BIT) && defined(CONFIG_ZONE_DMA32)
-- 
2.17.1

^ permalink raw reply related	[flat|nested] 71+ messages in thread

* [PATCH v2 16/20] modules: Use vmalloc special flag
  2019-01-29  0:34 [PATCH v2 00/20] Merge text_poke fixes and executable lockdowns Rick Edgecombe
                   ` (14 preceding siblings ...)
  2019-01-29  0:34 ` [PATCH v2 15/20] vmalloc: New flags for safe vfree on special perms Rick Edgecombe
@ 2019-01-29  0:34 ` Rick Edgecombe
  2019-01-29  0:34 ` [PATCH v2 17/20] bpf: " Rick Edgecombe
                   ` (3 subsequent siblings)
  19 siblings, 0 replies; 71+ messages in thread
From: Rick Edgecombe @ 2019-01-29  0:34 UTC (permalink / raw)
  To: Andy Lutomirski, Ingo Molnar
  Cc: linux-kernel, x86, hpa, Thomas Gleixner, Borislav Petkov,
	Nadav Amit, Dave Hansen, Peter Zijlstra, linux_dti,
	linux-integrity, linux-security-module, akpm, kernel-hardening,
	linux-mm, will.deacon, ard.biesheuvel, kristen, deneen.t.dock,
	Rick Edgecombe, Jessica Yu, Steven Rostedt

Use new flag for handling freeing of special permissioned memory in
vmalloc and remove places where memory was set RW before freeing which is
no longer needed.

Since vfreeing of VM_HAS_SPECIAL_PERMS memory is not supported in an
interrupt by vmalloc, the freeing of init sections is moved to a work
queue. Instead of call_rcu it now uses synchronize_rcu() in the work
queue.

Lastly, there is now a WARN_ON in module_memfree since it should not be
called in an interrupt with special memory as is required for
VM_HAS_SPECIAL_PERMS.

Cc: Jessica Yu <jeyu@kernel.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
---
 kernel/module.c | 77 +++++++++++++++++++++++++------------------------
 1 file changed, 39 insertions(+), 38 deletions(-)

diff --git a/kernel/module.c b/kernel/module.c
index ae1b77da6a20..1af5c8e19086 100644
--- a/kernel/module.c
+++ b/kernel/module.c
@@ -98,6 +98,10 @@ DEFINE_MUTEX(module_mutex);
 EXPORT_SYMBOL_GPL(module_mutex);
 static LIST_HEAD(modules);
 
+/* Work queue for freeing init sections in success case */
+static struct work_struct init_free_wq;
+static struct llist_head init_free_list;
+
 #ifdef CONFIG_MODULES_TREE_LOOKUP
 
 /*
@@ -1949,6 +1953,8 @@ void module_enable_ro(const struct module *mod, bool after_init)
 	if (!rodata_enabled)
 		return;
 
+	set_vm_special(mod->core_layout.base);
+	set_vm_special(mod->init_layout.base);
 	frob_text(&mod->core_layout, set_memory_ro);
 	frob_text(&mod->core_layout, set_memory_x);
 
@@ -1972,15 +1978,6 @@ static void module_enable_nx(const struct module *mod)
 	frob_writable_data(&mod->init_layout, set_memory_nx);
 }
 
-static void module_disable_nx(const struct module *mod)
-{
-	frob_rodata(&mod->core_layout, set_memory_x);
-	frob_ro_after_init(&mod->core_layout, set_memory_x);
-	frob_writable_data(&mod->core_layout, set_memory_x);
-	frob_rodata(&mod->init_layout, set_memory_x);
-	frob_writable_data(&mod->init_layout, set_memory_x);
-}
-
 /* Iterate through all modules and set each module's text as RW */
 void set_all_modules_text_rw(void)
 {
@@ -2024,23 +2021,8 @@ void set_all_modules_text_ro(void)
 	}
 	mutex_unlock(&module_mutex);
 }
-
-static void disable_ro_nx(const struct module_layout *layout)
-{
-	if (rodata_enabled) {
-		frob_text(layout, set_memory_rw);
-		frob_rodata(layout, set_memory_rw);
-		frob_ro_after_init(layout, set_memory_rw);
-	}
-	frob_rodata(layout, set_memory_x);
-	frob_ro_after_init(layout, set_memory_x);
-	frob_writable_data(layout, set_memory_x);
-}
-
 #else
-static void disable_ro_nx(const struct module_layout *layout) { }
 static void module_enable_nx(const struct module *mod) { }
-static void module_disable_nx(const struct module *mod) { }
 #endif
 
 #ifdef CONFIG_LIVEPATCH
@@ -2120,6 +2102,11 @@ static void free_module_elf(struct module *mod)
 
 void __weak module_memfree(void *module_region)
 {
+	/*
+	 * This memory may be RO, and freeing RO memory in an interrupt is not
+	 * supported by vmalloc.
+	 */
+	WARN_ON(in_interrupt());
 	vfree(module_region);
 }
 
@@ -2171,7 +2158,6 @@ static void free_module(struct module *mod)
 	mutex_unlock(&module_mutex);
 
 	/* This may be empty, but that's OK */
-	disable_ro_nx(&mod->init_layout);
 	module_arch_freeing_init(mod);
 	module_memfree(mod->init_layout.base);
 	kfree(mod->args);
@@ -2181,7 +2167,6 @@ static void free_module(struct module *mod)
 	lockdep_free_key_range(mod->core_layout.base, mod->core_layout.size);
 
 	/* Finally, free the core (containing the module structure) */
-	disable_ro_nx(&mod->core_layout);
 	module_memfree(mod->core_layout.base);
 }
 
@@ -3424,17 +3409,34 @@ static void do_mod_ctors(struct module *mod)
 
 /* For freeing module_init on success, in case kallsyms traversing */
 struct mod_initfree {
-	struct rcu_head rcu;
+	struct llist_node node;
 	void *module_init;
 };
 
-static void do_free_init(struct rcu_head *head)
+static void do_free_init(struct work_struct *w)
 {
-	struct mod_initfree *m = container_of(head, struct mod_initfree, rcu);
-	module_memfree(m->module_init);
-	kfree(m);
+	struct llist_node *pos, *n, *list;
+	struct mod_initfree *initfree;
+
+	list = llist_del_all(&init_free_list);
+
+	synchronize_rcu();
+
+	llist_for_each_safe(pos, n, list) {
+		initfree = container_of(pos, struct mod_initfree, node);
+		module_memfree(initfree->module_init);
+		kfree(initfree);
+	}
 }
 
+static int __init modules_wq_init(void)
+{
+	INIT_WORK(&init_free_wq, do_free_init);
+	init_llist_head(&init_free_list);
+	return 0;
+}
+module_init(modules_wq_init);
+
 /*
  * This is where the real work happens.
  *
@@ -3511,7 +3513,6 @@ static noinline int do_init_module(struct module *mod)
 #endif
 	module_enable_ro(mod, true);
 	mod_tree_remove_init(mod);
-	disable_ro_nx(&mod->init_layout);
 	module_arch_freeing_init(mod);
 	mod->init_layout.base = NULL;
 	mod->init_layout.size = 0;
@@ -3522,14 +3523,18 @@ static noinline int do_init_module(struct module *mod)
 	 * We want to free module_init, but be aware that kallsyms may be
 	 * walking this with preempt disabled.  In all the failure paths, we
 	 * call synchronize_rcu(), but we don't want to slow down the success
-	 * path, so use actual RCU here.
+	 * path. We can't do module_memfree in an interrupt, so we do the work
+	 * and call synchronize_rcu() in a work queue.
+	 *
 	 * Note that module_alloc() on most architectures creates W+X page
 	 * mappings which won't be cleaned up until do_free_init() runs.  Any
 	 * code such as mark_rodata_ro() which depends on those mappings to
 	 * be cleaned up needs to sync with the queued work - ie
 	 * rcu_barrier()
 	 */
-	call_rcu(&freeinit->rcu, do_free_init);
+	if (llist_add(&freeinit->node, &init_free_list))
+		schedule_work(&init_free_wq);
+
 	mutex_unlock(&module_mutex);
 	wake_up_all(&module_wq);
 
@@ -3826,10 +3831,6 @@ static int load_module(struct load_info *info, const char __user *uargs,
 	module_bug_cleanup(mod);
 	mutex_unlock(&module_mutex);
 
-	/* we can't deallocate the module until we clear memory protection */
-	module_disable_ro(mod);
-	module_disable_nx(mod);
-
  ddebug_cleanup:
 	ftrace_release_mod(mod);
 	dynamic_debug_remove(mod, info->debug);
-- 
2.17.1

^ permalink raw reply related	[flat|nested] 71+ messages in thread

* [PATCH v2 17/20] bpf: Use vmalloc special flag
  2019-01-29  0:34 [PATCH v2 00/20] Merge text_poke fixes and executable lockdowns Rick Edgecombe
                   ` (15 preceding siblings ...)
  2019-01-29  0:34 ` [PATCH v2 16/20] modules: Use vmalloc special flag Rick Edgecombe
@ 2019-01-29  0:34 ` Rick Edgecombe
  2019-01-29  0:34 ` [PATCH v2 18/20] x86/ftrace: " Rick Edgecombe
                   ` (2 subsequent siblings)
  19 siblings, 0 replies; 71+ messages in thread
From: Rick Edgecombe @ 2019-01-29  0:34 UTC (permalink / raw)
  To: Andy Lutomirski, Ingo Molnar
  Cc: linux-kernel, x86, hpa, Thomas Gleixner, Borislav Petkov,
	Nadav Amit, Dave Hansen, Peter Zijlstra, linux_dti,
	linux-integrity, linux-security-module, akpm, kernel-hardening,
	linux-mm, will.deacon, ard.biesheuvel, kristen, deneen.t.dock,
	Rick Edgecombe, Daniel Borkmann, Alexei Starovoitov

Use new flag VM_HAS_SPECIAL_PERMS for handling freeing of special
permissioned memory in vmalloc and remove places where memory was set RW
before freeing which is no longer needed. Also we no longer need a bit to
track if the memory is RO because it is tracked in vmalloc.

Cc: Daniel Borkmann <daniel@iogearbox.net>
Cc: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
---
 include/linux/filter.h | 16 +++-------------
 kernel/bpf/core.c      |  1 -
 2 files changed, 3 insertions(+), 14 deletions(-)

diff --git a/include/linux/filter.h b/include/linux/filter.h
index 9cdfab7f383c..cc9581dd9c58 100644
--- a/include/linux/filter.h
+++ b/include/linux/filter.h
@@ -20,6 +20,7 @@
 #include <linux/set_memory.h>
 #include <linux/kallsyms.h>
 #include <linux/if_vlan.h>
+#include <linux/vmalloc.h>
 
 #include <net/sch_generic.h>
 
@@ -483,7 +484,6 @@ struct bpf_prog {
 	u16			pages;		/* Number of allocated pages */
 	u16			jited:1,	/* Is our filter JIT'ed? */
 				jit_requested:1,/* archs need to JIT the prog */
-				undo_set_mem:1,	/* Passed set_memory_ro() checkpoint */
 				gpl_compatible:1, /* Is filter GPL compatible? */
 				cb_access:1,	/* Is control block accessed? */
 				dst_needed:1,	/* Do we need dst entry? */
@@ -681,26 +681,17 @@ bpf_ctx_narrow_access_ok(u32 off, u32 size, u32 size_default)
 
 static inline void bpf_prog_lock_ro(struct bpf_prog *fp)
 {
+	set_vm_special(fp);
 	set_memory_ro((unsigned long)fp, fp->pages);
 }
 
-static inline void bpf_prog_unlock_ro(struct bpf_prog *fp)
-{
-	if (fp->undo_set_mem)
-		set_memory_rw((unsigned long)fp, fp->pages);
-}
-
 static inline void bpf_jit_binary_lock_ro(struct bpf_binary_header *hdr)
 {
+	set_vm_special(hdr);
 	set_memory_ro((unsigned long)hdr, hdr->pages);
 	set_memory_x((unsigned long)hdr, hdr->pages);
 }
 
-static inline void bpf_jit_binary_unlock_ro(struct bpf_binary_header *hdr)
-{
-	set_memory_rw((unsigned long)hdr, hdr->pages);
-}
-
 static inline struct bpf_binary_header *
 bpf_jit_binary_hdr(const struct bpf_prog *fp)
 {
@@ -735,7 +726,6 @@ void __bpf_prog_free(struct bpf_prog *fp);
 
 static inline void bpf_prog_unlock_free(struct bpf_prog *fp)
 {
-	bpf_prog_unlock_ro(fp);
 	__bpf_prog_free(fp);
 }
 
diff --git a/kernel/bpf/core.c b/kernel/bpf/core.c
index 19c49313c709..465c1c3623e8 100644
--- a/kernel/bpf/core.c
+++ b/kernel/bpf/core.c
@@ -804,7 +804,6 @@ void __weak bpf_jit_free(struct bpf_prog *fp)
 	if (fp->jited) {
 		struct bpf_binary_header *hdr = bpf_jit_binary_hdr(fp);
 
-		bpf_jit_binary_unlock_ro(hdr);
 		bpf_jit_binary_free(hdr);
 
 		WARN_ON_ONCE(!bpf_prog_kallsyms_verify_off(fp));
-- 
2.17.1

^ permalink raw reply related	[flat|nested] 71+ messages in thread

* [PATCH v2 18/20] x86/ftrace: Use vmalloc special flag
  2019-01-29  0:34 [PATCH v2 00/20] Merge text_poke fixes and executable lockdowns Rick Edgecombe
                   ` (16 preceding siblings ...)
  2019-01-29  0:34 ` [PATCH v2 17/20] bpf: " Rick Edgecombe
@ 2019-01-29  0:34 ` Rick Edgecombe
  2019-01-29  0:34 ` [PATCH v2 19/20] x86/kprobes: " Rick Edgecombe
  2019-01-29  0:34 ` [PATCH v2 20/20] x86/alternative: comment about module removal races Rick Edgecombe
  19 siblings, 0 replies; 71+ messages in thread
From: Rick Edgecombe @ 2019-01-29  0:34 UTC (permalink / raw)
  To: Andy Lutomirski, Ingo Molnar
  Cc: linux-kernel, x86, hpa, Thomas Gleixner, Borislav Petkov,
	Nadav Amit, Dave Hansen, Peter Zijlstra, linux_dti,
	linux-integrity, linux-security-module, akpm, kernel-hardening,
	linux-mm, will.deacon, ard.biesheuvel, kristen, deneen.t.dock,
	Rick Edgecombe, Steven Rostedt

Use new flag VM_HAS_SPECIAL_PERMS for handling freeing of special
permissioned memory in vmalloc and remove places where memory was set NX
and RW before freeing which is no longer needed.

Cc: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
---
 arch/x86/kernel/ftrace.c | 6 ++----
 1 file changed, 2 insertions(+), 4 deletions(-)

diff --git a/arch/x86/kernel/ftrace.c b/arch/x86/kernel/ftrace.c
index 13c8249b197f..cf30594a2032 100644
--- a/arch/x86/kernel/ftrace.c
+++ b/arch/x86/kernel/ftrace.c
@@ -692,10 +692,6 @@ static inline void *alloc_tramp(unsigned long size)
 }
 static inline void tramp_free(void *tramp, int size)
 {
-	int npages = PAGE_ALIGN(size) >> PAGE_SHIFT;
-
-	set_memory_nx((unsigned long)tramp, npages);
-	set_memory_rw((unsigned long)tramp, npages);
 	module_memfree(tramp);
 }
 #else
@@ -820,6 +816,8 @@ create_trampoline(struct ftrace_ops *ops, unsigned int *tramp_size)
 	/* ALLOC_TRAMP flags lets us know we created it */
 	ops->flags |= FTRACE_OPS_FL_ALLOC_TRAMP;
 
+	set_vm_special(trampoline);
+
 	/*
 	 * Module allocation needs to be completed by making the page
 	 * executable. The page is still writable, which is a security hazard,
-- 
2.17.1

^ permalink raw reply related	[flat|nested] 71+ messages in thread

* [PATCH v2 19/20] x86/kprobes: Use vmalloc special flag
  2019-01-29  0:34 [PATCH v2 00/20] Merge text_poke fixes and executable lockdowns Rick Edgecombe
                   ` (17 preceding siblings ...)
  2019-01-29  0:34 ` [PATCH v2 18/20] x86/ftrace: " Rick Edgecombe
@ 2019-01-29  0:34 ` Rick Edgecombe
  2019-01-29  0:34 ` [PATCH v2 20/20] x86/alternative: comment about module removal races Rick Edgecombe
  19 siblings, 0 replies; 71+ messages in thread
From: Rick Edgecombe @ 2019-01-29  0:34 UTC (permalink / raw)
  To: Andy Lutomirski, Ingo Molnar
  Cc: linux-kernel, x86, hpa, Thomas Gleixner, Borislav Petkov,
	Nadav Amit, Dave Hansen, Peter Zijlstra, linux_dti,
	linux-integrity, linux-security-module, akpm, kernel-hardening,
	linux-mm, will.deacon, ard.biesheuvel, kristen, deneen.t.dock,
	Rick Edgecombe, Masami Hiramatsu

Use new flag VM_HAS_SPECIAL_PERMS for handling freeing of special
permissioned memory in vmalloc and remove places where memory was set NX
and RW before freeing which is no longer needed.

Cc: Masami Hiramatsu <mhiramat@kernel.org>
Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
---
 arch/x86/kernel/kprobes/core.c | 7 +------
 1 file changed, 1 insertion(+), 6 deletions(-)

diff --git a/arch/x86/kernel/kprobes/core.c b/arch/x86/kernel/kprobes/core.c
index fac692e36833..f2fab35bcb82 100644
--- a/arch/x86/kernel/kprobes/core.c
+++ b/arch/x86/kernel/kprobes/core.c
@@ -434,6 +434,7 @@ void *alloc_insn_page(void)
 	if (page == NULL)
 		return NULL;
 
+	set_vm_special(page);
 	/*
 	 * First make the page read-only, and then only then make it executable
 	 * to prevent it from being W+X in between.
@@ -452,12 +453,6 @@ void *alloc_insn_page(void)
 /* Recover page to RW mode before releasing it */
 void free_insn_page(void *page)
 {
-	/*
-	 * First make the page non-executable, and then only then make it
-	 * writable to prevent it from being W+X in between.
-	 */
-	set_memory_nx((unsigned long)page, 1);
-	set_memory_rw((unsigned long)page, 1);
 	module_memfree(page);
 }
 
-- 
2.17.1

^ permalink raw reply related	[flat|nested] 71+ messages in thread

* [PATCH v2 20/20] x86/alternative: comment about module removal races
  2019-01-29  0:34 [PATCH v2 00/20] Merge text_poke fixes and executable lockdowns Rick Edgecombe
                   ` (18 preceding siblings ...)
  2019-01-29  0:34 ` [PATCH v2 19/20] x86/kprobes: " Rick Edgecombe
@ 2019-01-29  0:34 ` Rick Edgecombe
  19 siblings, 0 replies; 71+ messages in thread
From: Rick Edgecombe @ 2019-01-29  0:34 UTC (permalink / raw)
  To: Andy Lutomirski, Ingo Molnar
  Cc: linux-kernel, x86, hpa, Thomas Gleixner, Borislav Petkov,
	Nadav Amit, Dave Hansen, Peter Zijlstra, linux_dti,
	linux-integrity, linux-security-module, akpm, kernel-hardening,
	linux-mm, will.deacon, ard.biesheuvel, kristen, deneen.t.dock,
	Nadav Amit, Masami Hiramatsu, Rick Edgecombe

From: Nadav Amit <namit@vmware.com>

Add a comment to clarify that users of text_poke() must ensure that
no races with module removal take place.

Cc: Masami Hiramatsu <mhiramat@kernel.org>
Signed-off-by: Nadav Amit <namit@vmware.com>
Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
---
 arch/x86/kernel/alternative.c | 5 +++++
 1 file changed, 5 insertions(+)

diff --git a/arch/x86/kernel/alternative.c b/arch/x86/kernel/alternative.c
index 81876e3ef3fd..cc3b6222857a 100644
--- a/arch/x86/kernel/alternative.c
+++ b/arch/x86/kernel/alternative.c
@@ -807,6 +807,11 @@ static void *__text_poke(void *addr, const void *opcode, size_t len)
  * It means the size must be writable atomically and the address must be aligned
  * in a way that permits an atomic write. It also makes sure we fit on a single
  * page.
+ *
+ * Note that the caller must ensure that if the modified code is part of a
+ * module, the module would not be removed during poking. This can be achieved
+ * by registering a module notifier, and ordering module removal and patching
+ * trough a mutex.
  */
 void *text_poke(void *addr, const void *opcode, size_t len)
 {
-- 
2.17.1

^ permalink raw reply related	[flat|nested] 71+ messages in thread

* Re: [PATCH v2 03/20] x86/mm: temporary mm struct
  2019-01-29  0:34 ` [PATCH v2 03/20] x86/mm: temporary mm struct Rick Edgecombe
@ 2019-01-31 11:29   ` Borislav Petkov
  2019-01-31 22:19     ` Nadav Amit
  0 siblings, 1 reply; 71+ messages in thread
From: Borislav Petkov @ 2019-01-31 11:29 UTC (permalink / raw)
  To: Rick Edgecombe
  Cc: Andy Lutomirski, Ingo Molnar, linux-kernel, x86, hpa,
	Thomas Gleixner, Nadav Amit, Dave Hansen, Peter Zijlstra,
	linux_dti, linux-integrity, linux-security-module, akpm,
	kernel-hardening, linux-mm, will.deacon, ard.biesheuvel, kristen,
	deneen.t.dock, Kees Cook, Dave Hansen, Nadav Amit

> Subject: Re: [PATCH v2 03/20] x86/mm: temporary mm struct

Subject needs a verb: "Add a temporary... "

On Mon, Jan 28, 2019 at 04:34:05PM -0800, Rick Edgecombe wrote:
> From: Andy Lutomirski <luto@kernel.org>
> 
> Sometimes we want to set a temporary page-table entries (PTEs) in one of

s/a //

Also, drop the "we" and make it impartial and passive:

 "Describe your changes in imperative mood, e.g. "make xyzzy do frotz"
  instead of "[This patch] makes xyzzy do frotz" or "[I] changed xyzzy
  to do frotz", as if you are giving orders to the codebase to change
  its behaviour."

> the cores, without allowing other cores to use - even speculatively -
> these mappings. There are two benefits for doing so:
> 
> (1) Security: if sensitive PTEs are set, temporary mm prevents their use
> in other cores. This hardens the security as it prevents exploding a

exploding or exploiting? Or exposing? :)

> dangling pointer to overwrite sensitive data using the sensitive PTE.
> 
> (2) Avoiding TLB shootdowns: the PTEs do not need to be flushed in
> remote page-tables.

Those belong in the code comments below, explaining what it is going to
be used for.

> To do so a temporary mm_struct can be used. Mappings which are private
> for this mm can be set in the userspace part of the address-space.
> During the whole time in which the temporary mm is loaded, interrupts
> must be disabled.
> 
> The first use-case for temporary PTEs, which will follow, is for poking
> the kernel text.
> 
> [ Commit message was written by Nadav ]
> 
> Cc: Kees Cook <keescook@chromium.org>
> Cc: Dave Hansen <dave.hansen@intel.com>
> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
> Reviewed-by: Masami Hiramatsu <mhiramat@kernel.org>
> Tested-by: Masami Hiramatsu <mhiramat@kernel.org>
> Signed-off-by: Andy Lutomirski <luto@kernel.org>
> Signed-off-by: Nadav Amit <namit@vmware.com>
> Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
> ---
>  arch/x86/include/asm/mmu_context.h | 32 ++++++++++++++++++++++++++++++
>  1 file changed, 32 insertions(+)
> 
> diff --git a/arch/x86/include/asm/mmu_context.h b/arch/x86/include/asm/mmu_context.h
> index 19d18fae6ec6..cd0c29e494a6 100644
> --- a/arch/x86/include/asm/mmu_context.h
> +++ b/arch/x86/include/asm/mmu_context.h
> @@ -356,4 +356,36 @@ static inline unsigned long __get_current_cr3_fast(void)
>  	return cr3;
>  }
>  
> +typedef struct {

Why does it have to be a typedef?

That prev.prev below looks unnecessary, instead of just using prev.

> +	struct mm_struct *prev;

Why "prev"?

> +} temporary_mm_state_t;

That's kinda long - it is longer than the function name below.
temp_mm_state_t not enough?

> +
> +/*
> + * Using a temporary mm allows to set temporary mappings that are not accessible
> + * by other cores. Such mappings are needed to perform sensitive memory writes
> + * that override the kernel memory protections (e.g., W^X), without exposing the
> + * temporary page-table mappings that are required for these write operations to
> + * other cores.
> + *
> + * Context: The temporary mm needs to be used exclusively by a single core. To
> + *          harden security IRQs must be disabled while the temporary mm is
			      ^
			      ,

> + *          loaded, thereby preventing interrupt handler bugs from override the

s/override/overriding/

> + *          kernel memory protection.
> + */
> +static inline temporary_mm_state_t use_temporary_mm(struct mm_struct *mm)
> +{
> +	temporary_mm_state_t state;
> +
> +	lockdep_assert_irqs_disabled();
> +	state.prev = this_cpu_read(cpu_tlbstate.loaded_mm);
> +	switch_mm_irqs_off(NULL, mm, current);
> +	return state;
> +}
> +
> +static inline void unuse_temporary_mm(temporary_mm_state_t prev)
> +{
> +	lockdep_assert_irqs_disabled();
> +	switch_mm_irqs_off(NULL, prev.prev, current);
> +}
> +
>  #endif /* _ASM_X86_MMU_CONTEXT_H */
> -- 
> 2.17.1
> 

-- 
Regards/Gruss,
    Boris.

Good mailing practices for 400: avoid top-posting and trim the reply.

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [PATCH v2 03/20] x86/mm: temporary mm struct
  2019-01-31 11:29   ` Borislav Petkov
@ 2019-01-31 22:19     ` Nadav Amit
  2019-02-01  0:08       ` Borislav Petkov
  2019-02-04 14:28       ` Borislav Petkov
  0 siblings, 2 replies; 71+ messages in thread
From: Nadav Amit @ 2019-01-31 22:19 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: Rick Edgecombe, Andy Lutomirski, Ingo Molnar, LKML, X86 ML,
	H. Peter Anvin, Thomas Gleixner, Dave Hansen, Peter Zijlstra,
	Damian Tometzki, linux-integrity, LSM List, Andrew Morton,
	Kernel Hardening, Linux-MM, Will Deacon, Ard Biesheuvel,
	Kristen Carlson Accardi, Dock, Deneen T, Kees Cook, Dave Hansen

> On Jan 31, 2019, at 3:29 AM, Borislav Petkov <bp@alien8.de> wrote:
> 
>> Subject: Re: [PATCH v2 03/20] x86/mm: temporary mm struct
> 
> Subject needs a verb: "Add a temporary... "
> 
> On Mon, Jan 28, 2019 at 04:34:05PM -0800, Rick Edgecombe wrote:
>> From: Andy Lutomirski <luto@kernel.org>
>> 
>> Sometimes we want to set a temporary page-table entries (PTEs) in one of
> 
> s/a //
> 
> Also, drop the "we" and make it impartial and passive:
> 
> "Describe your changes in imperative mood, e.g. "make xyzzy do frotz"
>  instead of "[This patch] makes xyzzy do frotz" or "[I] changed xyzzy
>  to do frotz", as if you are giving orders to the codebase to change
>  its behaviour."
> 
>> the cores, without allowing other cores to use - even speculatively -
>> these mappings. There are two benefits for doing so:
>> 
>> (1) Security: if sensitive PTEs are set, temporary mm prevents their use
>> in other cores. This hardens the security as it prevents exploding a
> 
> exploding or exploiting? Or exposing? :)
> 
>> dangling pointer to overwrite sensitive data using the sensitive PTE.
>> 
>> (2) Avoiding TLB shootdowns: the PTEs do not need to be flushed in
>> remote page-tables.
> 
> Those belong in the code comments below, explaining what it is going to
> be used for.

I will add it to the code as well.

> 
>> To do so a temporary mm_struct can be used. Mappings which are private
>> for this mm can be set in the userspace part of the address-space.
>> During the whole time in which the temporary mm is loaded, interrupts
>> must be disabled.
>> 
>> The first use-case for temporary PTEs, which will follow, is for poking
>> the kernel text.
>> 
>> [ Commit message was written by Nadav ]
>> 
>> Cc: Kees Cook <keescook@chromium.org>
>> Cc: Dave Hansen <dave.hansen@intel.com>
>> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
>> Reviewed-by: Masami Hiramatsu <mhiramat@kernel.org>
>> Tested-by: Masami Hiramatsu <mhiramat@kernel.org>
>> Signed-off-by: Andy Lutomirski <luto@kernel.org>
>> Signed-off-by: Nadav Amit <namit@vmware.com>
>> Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
>> ---
>> arch/x86/include/asm/mmu_context.h | 32 ++++++++++++++++++++++++++++++
>> 1 file changed, 32 insertions(+)
>> 
>> diff --git a/arch/x86/include/asm/mmu_context.h b/arch/x86/include/asm/mmu_context.h
>> index 19d18fae6ec6..cd0c29e494a6 100644
>> --- a/arch/x86/include/asm/mmu_context.h
>> +++ b/arch/x86/include/asm/mmu_context.h
>> @@ -356,4 +356,36 @@ static inline unsigned long __get_current_cr3_fast(void)
>> 	return cr3;
>> }
>> 
>> +typedef struct {
> 
> Why does it have to be a typedef?

Having a different struct can prevent the misuse of using mm_structs in
unuse_temporary_mm() that were not “used” using use_temporary_mm. The
typedef, I presume, can deter users from starting to play with the internal
“private” fields.

> That prev.prev below looks unnecessary, instead of just using prev.
> 
>> +	struct mm_struct *prev;
> 
> Why "prev”?

This is obviously the previous active mm. Feel free to suggest an
alternative name.

>> +} temporary_mm_state_t;
> 
> That's kinda long - it is longer than the function name below.
> temp_mm_state_t not enough?

I will change it.

> 
>> +
>> +/*
>> + * Using a temporary mm allows to set temporary mappings that are not accessible
>> + * by other cores. Such mappings are needed to perform sensitive memory writes
>> + * that override the kernel memory protections (e.g., W^X), without exposing the
>> + * temporary page-table mappings that are required for these write operations to
>> + * other cores.
>> + *
>> + * Context: The temporary mm needs to be used exclusively by a single core. To
>> + *          harden security IRQs must be disabled while the temporary mm is
> 			      ^
> 			      ,
> 
>> + *          loaded, thereby preventing interrupt handler bugs from override the
> 
> s/override/overriding/

I will fix all of these typos, comment. Thank you.

Meta-question: could you please review the entire patch-set? This is
actually v9 of this particular patch - it was part of a separate patch-set
before. I don’t think that the patch has changed since (the real) v1.

These sporadic comments after each version really makes it hard to get this
work completed.


^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [PATCH v2 03/20] x86/mm: temporary mm struct
  2019-01-31 22:19     ` Nadav Amit
@ 2019-02-01  0:08       ` Borislav Petkov
  2019-02-01  0:25         ` Nadav Amit
  2019-02-04 14:28       ` Borislav Petkov
  1 sibling, 1 reply; 71+ messages in thread
From: Borislav Petkov @ 2019-02-01  0:08 UTC (permalink / raw)
  To: Nadav Amit
  Cc: Rick Edgecombe, Andy Lutomirski, Ingo Molnar, LKML, X86 ML,
	H. Peter Anvin, Thomas Gleixner, Dave Hansen, Peter Zijlstra,
	Damian Tometzki, linux-integrity, LSM List, Andrew Morton,
	Kernel Hardening, Linux-MM, Will Deacon, Ard Biesheuvel,
	Kristen Carlson Accardi, Dock, Deneen T, Kees Cook, Dave Hansen

On Thu, Jan 31, 2019 at 10:19:54PM +0000, Nadav Amit wrote:
> Meta-question: could you please review the entire patch-set? This is
> actually v9 of this particular patch - it was part of a separate patch-set
> before. I don’t think that the patch has changed since (the real) v1.
> 
> These sporadic comments after each version really makes it hard to get this
> work completed.

Sorry but where I am the day has only 24 hours and this patchset is not
the only one in my overflowing mbox. If my sporadic comments are making
it hard to finish your work, I better not interfere then.

-- 
Regards/Gruss,
    Boris.

Good mailing practices for 400: avoid top-posting and trim the reply.

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [PATCH v2 03/20] x86/mm: temporary mm struct
  2019-02-01  0:08       ` Borislav Petkov
@ 2019-02-01  0:25         ` Nadav Amit
  0 siblings, 0 replies; 71+ messages in thread
From: Nadav Amit @ 2019-02-01  0:25 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: Rick Edgecombe, Andy Lutomirski, Ingo Molnar, LKML, X86 ML,
	H. Peter Anvin, Thomas Gleixner, Dave Hansen, Peter Zijlstra,
	Damian Tometzki, linux-integrity, LSM List, Andrew Morton,
	Kernel Hardening, Linux-MM, Will Deacon, Ard Biesheuvel,
	Kristen Carlson Accardi, Dock, Deneen T, Kees Cook, Dave Hansen

> On Jan 31, 2019, at 4:08 PM, Borislav Petkov <bp@alien8.de> wrote:
> 
> On Thu, Jan 31, 2019 at 10:19:54PM +0000, Nadav Amit wrote:
>> Meta-question: could you please review the entire patch-set? This is
>> actually v9 of this particular patch - it was part of a separate patch-set
>> before. I don’t think that the patch has changed since (the real) v1.
>> 
>> These sporadic comments after each version really makes it hard to get this
>> work completed.
> 
> Sorry but where I am the day has only 24 hours and this patchset is not
> the only one in my overflowing mbox. If my sporadic comments are making
> it hard to finish your work, I better not interfere then.

I certainly did not intend for it to sound this way, and your feedback is
obviously valuable.

Just let me know when you are done reviewing the patch-set, so I will not
overflow your mailbox with even unnecessary versions of these patches. :)


^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [PATCH v2 03/20] x86/mm: temporary mm struct
  2019-01-31 22:19     ` Nadav Amit
  2019-02-01  0:08       ` Borislav Petkov
@ 2019-02-04 14:28       ` Borislav Petkov
  1 sibling, 0 replies; 71+ messages in thread
From: Borislav Petkov @ 2019-02-04 14:28 UTC (permalink / raw)
  To: Nadav Amit
  Cc: Rick Edgecombe, Andy Lutomirski, Ingo Molnar, LKML, X86 ML,
	H. Peter Anvin, Thomas Gleixner, Dave Hansen, Peter Zijlstra,
	Damian Tometzki, linux-integrity, LSM List, Andrew Morton,
	Kernel Hardening, Linux-MM, Will Deacon, Ard Biesheuvel,
	Kristen Carlson Accardi, Dock, Deneen T, Kees Cook, Dave Hansen

On Thu, Jan 31, 2019 at 10:19:54PM +0000, Nadav Amit wrote:
> Having a different struct can prevent the misuse of using mm_structs in
> unuse_temporary_mm() that were not “used” using use_temporary_mm. The
> typedef, I presume, can deter users from starting to play with the internal
> “private” fields.

Ok, makes sense.

> > That prev.prev below looks unnecessary, instead of just using prev.
> > 
> >> +	struct mm_struct *prev;
> > 
> > Why "prev”?
> 
> This is obviously the previous active mm. Feel free to suggest an
> alternative name.

Well, when I look at the typedef I'm wondering why is it called "prev"
but I guess this is to mean that it will be saving the previously used
mm, so ack.

Thx.

-- 
Regards/Gruss,
    Boris.

Good mailing practices for 400: avoid top-posting and trim the reply.

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [PATCH v2 04/20] fork: provide a function for copying init_mm
  2019-01-29  0:34 ` [PATCH v2 04/20] fork: provide a function for copying init_mm Rick Edgecombe
@ 2019-02-05  8:53   ` Borislav Petkov
  2019-02-05  9:03     ` Nadav Amit
  0 siblings, 1 reply; 71+ messages in thread
From: Borislav Petkov @ 2019-02-05  8:53 UTC (permalink / raw)
  To: Rick Edgecombe
  Cc: Andy Lutomirski, Ingo Molnar, linux-kernel, x86, hpa,
	Thomas Gleixner, Nadav Amit, Dave Hansen, Peter Zijlstra,
	linux_dti, linux-integrity, linux-security-module, akpm,
	kernel-hardening, linux-mm, will.deacon, ard.biesheuvel, kristen,
	deneen.t.dock, Nadav Amit, Kees Cook, Dave Hansen

On Mon, Jan 28, 2019 at 04:34:06PM -0800, Rick Edgecombe wrote:
> From: Nadav Amit <namit@vmware.com>
> 
> Provide a function for copying init_mm. This function will be later used
> for setting a temporary mm.
> 
> Cc: Andy Lutomirski <luto@kernel.org>
> Cc: Kees Cook <keescook@chromium.org>
> Cc: Dave Hansen <dave.hansen@intel.com>
> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
> Reviewed-by: Masami Hiramatsu <mhiramat@kernel.org>
> Tested-by: Masami Hiramatsu <mhiramat@kernel.org>
> Signed-off-by: Nadav Amit <namit@vmware.com>
> Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
> ---
>  include/linux/sched/task.h |  1 +
>  kernel/fork.c              | 24 ++++++++++++++++++------
>  2 files changed, 19 insertions(+), 6 deletions(-)
> 
> diff --git a/include/linux/sched/task.h b/include/linux/sched/task.h
> index 44c6f15800ff..c5a00a7b3beb 100644
> --- a/include/linux/sched/task.h
> +++ b/include/linux/sched/task.h
> @@ -76,6 +76,7 @@ extern void exit_itimers(struct signal_struct *);
>  extern long _do_fork(unsigned long, unsigned long, unsigned long, int __user *, int __user *, unsigned long);
>  extern long do_fork(unsigned long, unsigned long, unsigned long, int __user *, int __user *);
>  struct task_struct *fork_idle(int);
> +struct mm_struct *copy_init_mm(void);
>  extern pid_t kernel_thread(int (*fn)(void *), void *arg, unsigned long flags);
>  extern long kernel_wait4(pid_t, int __user *, int, struct rusage *);
>  
> diff --git a/kernel/fork.c b/kernel/fork.c
> index b69248e6f0e0..d7b156c49f29 100644
> --- a/kernel/fork.c
> +++ b/kernel/fork.c
> @@ -1299,13 +1299,20 @@ void mm_release(struct task_struct *tsk, struct mm_struct *mm)
>  		complete_vfork_done(tsk);
>  }
>  
> -/*
> - * Allocate a new mm structure and copy contents from the
> - * mm structure of the passed in task structure.
> +/**
> + * dup_mm() - duplicates an existing mm structure
> + * @tsk: the task_struct with which the new mm will be associated.
> + * @oldmm: the mm to duplicate.
> + *
> + * Allocates a new mm structure and copy contents from the provided

s/copy/copies/

-- 
Regards/Gruss,
    Boris.

Good mailing practices for 400: avoid top-posting and trim the reply.

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [PATCH v2 04/20] fork: provide a function for copying init_mm
  2019-02-05  8:53   ` Borislav Petkov
@ 2019-02-05  9:03     ` Nadav Amit
  0 siblings, 0 replies; 71+ messages in thread
From: Nadav Amit @ 2019-02-05  9:03 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: Rick Edgecombe, Andy Lutomirski, Ingo Molnar, LKML, X86 ML,
	H. Peter Anvin, Thomas Gleixner, Dave Hansen, Peter Zijlstra,
	Damian Tometzki, linux-integrity, LSM List, Andrew Morton,
	Kernel Hardening, Linux-MM, Will Deacon, Ard Biesheuvel,
	Kristen Carlson Accardi, Dock, Deneen T, Kees Cook, Dave Hansen

> On Feb 5, 2019, at 12:53 AM, Borislav Petkov <bp@alien8.de> wrote:
> 
> On Mon, Jan 28, 2019 at 04:34:06PM -0800, Rick Edgecombe wrote:
>> From: Nadav Amit <namit@vmware.com>
>> 
>> - * Allocate a new mm structure and copy contents from the
>> - * mm structure of the passed in task structure.
>> +/**
>> + * dup_mm() - duplicates an existing mm structure
>> + * @tsk: the task_struct with which the new mm will be associated.
>> + * @oldmm: the mm to duplicate.
>> + *
>> + * Allocates a new mm structure and copy contents from the provided
> 
> s/copy/copies/

Thanks, applied (I revised this sentence a bit).

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [PATCH v2 05/20] x86/alternative: initializing temporary mm for patching
  2019-01-29  0:34 ` [PATCH v2 05/20] x86/alternative: initializing temporary mm for patching Rick Edgecombe
@ 2019-02-05  9:18   ` Borislav Petkov
  2019-02-11  0:39   ` Nadav Amit
  1 sibling, 0 replies; 71+ messages in thread
From: Borislav Petkov @ 2019-02-05  9:18 UTC (permalink / raw)
  To: Rick Edgecombe
  Cc: Andy Lutomirski, Ingo Molnar, linux-kernel, x86, hpa,
	Thomas Gleixner, Nadav Amit, Dave Hansen, Peter Zijlstra,
	linux_dti, linux-integrity, linux-security-module, akpm,
	kernel-hardening, linux-mm, will.deacon, ard.biesheuvel, kristen,
	deneen.t.dock, Nadav Amit, Kees Cook, Dave Hansen

Just nitpicks:

Subject: [PATCH v2 05/20] x86/alternative: initializing temporary mm for patching

s/initailizing/Initialize/

On Mon, Jan 28, 2019 at 04:34:07PM -0800, Rick Edgecombe wrote:
> From: Nadav Amit <namit@vmware.com>
> 
> To prevent improper use of the PTEs that are used for text patching, we
> want to use a temporary mm struct. We initailize it by copying the init

Please remove the "we" from commit messages and use impartial, passive
formulations.

Thx.

-- 
Regards/Gruss,
    Boris.

Good mailing practices for 400: avoid top-posting and trim the reply.

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [PATCH v2 06/20] x86/alternative: use temporary mm for text poking
  2019-01-29  0:34 ` [PATCH v2 06/20] x86/alternative: use temporary mm for text poking Rick Edgecombe
@ 2019-02-05  9:58   ` Borislav Petkov
  2019-02-05 11:31     ` Peter Zijlstra
  0 siblings, 1 reply; 71+ messages in thread
From: Borislav Petkov @ 2019-02-05  9:58 UTC (permalink / raw)
  To: Rick Edgecombe
  Cc: Andy Lutomirski, Ingo Molnar, linux-kernel, x86, hpa,
	Thomas Gleixner, Nadav Amit, Dave Hansen, Peter Zijlstra,
	linux_dti, linux-integrity, linux-security-module, akpm,
	kernel-hardening, linux-mm, will.deacon, ard.biesheuvel, kristen,
	deneen.t.dock, Nadav Amit, Kees Cook, Dave Hansen,
	Masami Hiramatsu

On Mon, Jan 28, 2019 at 04:34:08PM -0800, Rick Edgecombe wrote:
> From: Nadav Amit <namit@vmware.com>
> 
> text_poke() can potentially compromise the security as it sets temporary

s/the //

> PTEs in the fixmap. These PTEs might be used to rewrite the kernel code
> from other cores accidentally or maliciously, if an attacker gains the
> ability to write onto kernel memory.

Eww, sneaky. That would be a really nasty attack.

> Moreover, since remote TLBs are not flushed after the temporary PTEs are
> removed, the time-window in which the code is writable is not limited if
> the fixmap PTEs - maliciously or accidentally - are cached in the TLB.
> To address these potential security hazards, we use a temporary mm for
> patching the code.
> 
> Finally, text_poke() is also not conservative enough when mapping pages,
> as it always tries to map 2 pages, even when a single one is sufficient.
> So try to be more conservative, and do not map more than needed.
> 
> Cc: Andy Lutomirski <luto@kernel.org>
> Cc: Kees Cook <keescook@chromium.org>
> Cc: Dave Hansen <dave.hansen@intel.com>
> Cc: Masami Hiramatsu <mhiramat@kernel.org>
> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
> Signed-off-by: Nadav Amit <namit@vmware.com>
> Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
> ---
>  arch/x86/include/asm/fixmap.h |   2 -
>  arch/x86/kernel/alternative.c | 106 +++++++++++++++++++++++++++-------
>  arch/x86/xen/mmu_pv.c         |   2 -
>  3 files changed, 84 insertions(+), 26 deletions(-)
> 
> diff --git a/arch/x86/include/asm/fixmap.h b/arch/x86/include/asm/fixmap.h
> index 50ba74a34a37..9da8cccdf3fb 100644
> --- a/arch/x86/include/asm/fixmap.h
> +++ b/arch/x86/include/asm/fixmap.h
> @@ -103,8 +103,6 @@ enum fixed_addresses {
>  #ifdef CONFIG_PARAVIRT
>  	FIX_PARAVIRT_BOOTMAP,
>  #endif
> -	FIX_TEXT_POKE1,	/* reserve 2 pages for text_poke() */
> -	FIX_TEXT_POKE0, /* first page is last, because allocation is backward */

Two fixmap slots less - good riddance. :)

>  #ifdef	CONFIG_X86_INTEL_MID
>  	FIX_LNW_VRTC,
>  #endif
> diff --git a/arch/x86/kernel/alternative.c b/arch/x86/kernel/alternative.c
> index ae05fbb50171..76d482a2b716 100644
> --- a/arch/x86/kernel/alternative.c
> +++ b/arch/x86/kernel/alternative.c
> @@ -11,6 +11,7 @@
>  #include <linux/stop_machine.h>
>  #include <linux/slab.h>
>  #include <linux/kdebug.h>
> +#include <linux/mmu_context.h>
>  #include <asm/text-patching.h>
>  #include <asm/alternative.h>
>  #include <asm/sections.h>
> @@ -683,41 +684,102 @@ __ro_after_init unsigned long poking_addr;
>  
>  static void *__text_poke(void *addr, const void *opcode, size_t len)
>  {
> +	bool cross_page_boundary = offset_in_page(addr) + len > PAGE_SIZE;
> +	temporary_mm_state_t prev;
> +	struct page *pages[2] = {NULL};
>  	unsigned long flags;
> -	char *vaddr;
> -	struct page *pages[2];
> -	int i;
> +	pte_t pte, *ptep;
> +	spinlock_t *ptl;
> +	pgprot_t prot;
>  
>  	/*
> -	 * While boot memory allocator is runnig we cannot use struct
> -	 * pages as they are not yet initialized.
> +	 * While boot memory allocator is running we cannot use struct pages as
> +	 * they are not yet initialized.
>  	 */
>  	BUG_ON(!after_bootmem);
>  
>  	if (!core_kernel_text((unsigned long)addr)) {
>  		pages[0] = vmalloc_to_page(addr);
> -		pages[1] = vmalloc_to_page(addr + PAGE_SIZE);
> +		if (cross_page_boundary)
> +			pages[1] = vmalloc_to_page(addr + PAGE_SIZE);
>  	} else {
>  		pages[0] = virt_to_page(addr);
>  		WARN_ON(!PageReserved(pages[0]));
> -		pages[1] = virt_to_page(addr + PAGE_SIZE);
> +		if (cross_page_boundary)
> +			pages[1] = virt_to_page(addr + PAGE_SIZE);
>  	}
> -	BUG_ON(!pages[0]);
> +	BUG_ON(!pages[0] || (cross_page_boundary && !pages[1]));

checkpatch fires a lot for this patchset and I think we should tone down
the BUG_ON() use.

WARNING: Avoid crashing the kernel - try using WARN_ON & recovery code rather than BUG() or BUG_ON()
#116: FILE: arch/x86/kernel/alternative.c:711:
+       BUG_ON(!pages[0] || (cross_page_boundary && !pages[1]));

While the below BUG_ON makes sense, this here could be a WARN_ON or so.

Which begs the next question: AFAICT, nothing looks at text_poke*()'s
retval. So why are we even bothering returning something?

> +
>  	local_irq_save(flags);
> -	set_fixmap(FIX_TEXT_POKE0, page_to_phys(pages[0]));
> -	if (pages[1])
> -		set_fixmap(FIX_TEXT_POKE1, page_to_phys(pages[1]));
> -	vaddr = (char *)fix_to_virt(FIX_TEXT_POKE0);
> -	memcpy(&vaddr[(unsigned long)addr & ~PAGE_MASK], opcode, len);
> -	clear_fixmap(FIX_TEXT_POKE0);
> -	if (pages[1])
> -		clear_fixmap(FIX_TEXT_POKE1);
> -	local_flush_tlb();
> -	sync_core();
> -	/* Could also do a CLFLUSH here to speed up CPU recovery; but
> -	   that causes hangs on some VIA CPUs. */
> -	for (i = 0; i < len; i++)
> -		BUG_ON(((char *)addr)[i] != ((char *)opcode)[i]);
> +
> +	/*
> +	 * The lock is not really needed, but this allows to avoid open-coding.
> +	 */
> +	ptep = get_locked_pte(poking_mm, poking_addr, &ptl);
> +
> +	/*
> +	 * This must not fail; preallocated in poking_init().
> +	 */
> +	VM_BUG_ON(!ptep);
> +
> +	/*
> +	 * flush_tlb_mm_range() would be called when the poking_mm is not
> +	 * loaded. When PCID is in use, the flush would be deferred to the time
> +	 * the poking_mm is loaded again. Set the PTE as non-global to prevent
> +	 * it from being used when we are done.
> +	 */
> +	prot = __pgprot(pgprot_val(PAGE_KERNEL) & ~_PAGE_GLOBAL);

So

				_KERNPG_TABLE | _PAGE_NX

as this is pagetable page, AFAICT.

-- 
Regards/Gruss,
    Boris.

Good mailing practices for 400: avoid top-posting and trim the reply.

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [PATCH v2 06/20] x86/alternative: use temporary mm for text poking
  2019-02-05  9:58   ` Borislav Petkov
@ 2019-02-05 11:31     ` Peter Zijlstra
  2019-02-05 12:35       ` Borislav Petkov
  2019-02-05 13:29       ` Peter Zijlstra
  0 siblings, 2 replies; 71+ messages in thread
From: Peter Zijlstra @ 2019-02-05 11:31 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: Rick Edgecombe, Andy Lutomirski, Ingo Molnar, linux-kernel, x86,
	hpa, Thomas Gleixner, Nadav Amit, Dave Hansen, linux_dti,
	linux-integrity, linux-security-module, akpm, kernel-hardening,
	linux-mm, will.deacon, ard.biesheuvel, kristen, deneen.t.dock,
	Nadav Amit, Kees Cook, Dave Hansen, Masami Hiramatsu

On Tue, Feb 05, 2019 at 10:58:53AM +0100, Borislav Petkov wrote:
> > @@ -683,41 +684,102 @@ __ro_after_init unsigned long poking_addr;
> >  
> >  static void *__text_poke(void *addr, const void *opcode, size_t len)
> >  {
> > +	bool cross_page_boundary = offset_in_page(addr) + len > PAGE_SIZE;
> > +	temporary_mm_state_t prev;
> > +	struct page *pages[2] = {NULL};
> >  	unsigned long flags;
> > -	char *vaddr;
> > -	struct page *pages[2];
> > -	int i;
> > +	pte_t pte, *ptep;
> > +	spinlock_t *ptl;
> > +	pgprot_t prot;
> >  
> >  	/*
> > -	 * While boot memory allocator is runnig we cannot use struct
> > -	 * pages as they are not yet initialized.
> > +	 * While boot memory allocator is running we cannot use struct pages as
> > +	 * they are not yet initialized.
> >  	 */
> >  	BUG_ON(!after_bootmem);
> >  
> >  	if (!core_kernel_text((unsigned long)addr)) {
> >  		pages[0] = vmalloc_to_page(addr);
> > -		pages[1] = vmalloc_to_page(addr + PAGE_SIZE);
> > +		if (cross_page_boundary)
> > +			pages[1] = vmalloc_to_page(addr + PAGE_SIZE);
> >  	} else {
> >  		pages[0] = virt_to_page(addr);
> >  		WARN_ON(!PageReserved(pages[0]));
> > -		pages[1] = virt_to_page(addr + PAGE_SIZE);
> > +		if (cross_page_boundary)
> > +			pages[1] = virt_to_page(addr + PAGE_SIZE);
> >  	}
> > -	BUG_ON(!pages[0]);
> > +	BUG_ON(!pages[0] || (cross_page_boundary && !pages[1]));
> 
> checkpatch fires a lot for this patchset and I think we should tone down
> the BUG_ON() use.

I've been pushing for BUG_ON() in this patch set; sod checkpatch.

Maybe not this BUG_ON in particular, but a number of them introduced
here are really situations where we can't do anything sane.

This BUG_ON() in particular is the choice between corrupted text or an
instantly dead machine; what would you do?

In general, text_poke() cannot fail:

 - suppose changing a single jump label requires poking multiple sites
   (not uncommon), we fail halfway through and then have to undo the
   first pokes, but those pokes fail again.

 - this then leaves us no way forward and no way back, we've got
   inconsistent text state -> FAIL.

So even an 'early' fail (like here) doesn't work in the rollback
scenario if you combine them.

So while in general I agree with BUG_ON() being undesirable, I think
liberal sprinking in text_poke() is fine; you really _REALLY_ want this
to work or fail loudly. Text corruption is just painful.

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [PATCH v2 06/20] x86/alternative: use temporary mm for text poking
  2019-02-05 11:31     ` Peter Zijlstra
@ 2019-02-05 12:35       ` Borislav Petkov
  2019-02-05 13:25         ` Peter Zijlstra
  2019-02-05 17:54         ` Nadav Amit
  2019-02-05 13:29       ` Peter Zijlstra
  1 sibling, 2 replies; 71+ messages in thread
From: Borislav Petkov @ 2019-02-05 12:35 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Rick Edgecombe, Andy Lutomirski, Ingo Molnar, linux-kernel, x86,
	hpa, Thomas Gleixner, Nadav Amit, Dave Hansen, linux_dti,
	linux-integrity, linux-security-module, akpm, kernel-hardening,
	linux-mm, will.deacon, ard.biesheuvel, kristen, deneen.t.dock,
	Nadav Amit, Kees Cook, Dave Hansen, Masami Hiramatsu

On Tue, Feb 05, 2019 at 12:31:46PM +0100, Peter Zijlstra wrote:
> ...
>
> So while in general I agree with BUG_ON() being undesirable, I think
> liberal sprinking in text_poke() is fine; you really _REALLY_ want this
> to work or fail loudly. Text corruption is just painful.

Ok. It would be good to have the gist of this sentiment in a comment
above it so that it is absolutely clear why we're doing it.

And since text_poke() can't fail, then it doesn't need a retval too.
AFAICT, nothing is actually using it.

-- 
Regards/Gruss,
    Boris.

Good mailing practices for 400: avoid top-posting and trim the reply.

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [PATCH v2 06/20] x86/alternative: use temporary mm for text poking
  2019-02-05 12:35       ` Borislav Petkov
@ 2019-02-05 13:25         ` Peter Zijlstra
  2019-02-05 17:54         ` Nadav Amit
  1 sibling, 0 replies; 71+ messages in thread
From: Peter Zijlstra @ 2019-02-05 13:25 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: Rick Edgecombe, Andy Lutomirski, Ingo Molnar, linux-kernel, x86,
	hpa, Thomas Gleixner, Nadav Amit, Dave Hansen, linux_dti,
	linux-integrity, linux-security-module, akpm, kernel-hardening,
	linux-mm, will.deacon, ard.biesheuvel, kristen, deneen.t.dock,
	Nadav Amit, Kees Cook, Dave Hansen, Masami Hiramatsu

On Tue, Feb 05, 2019 at 01:35:33PM +0100, Borislav Petkov wrote:
> On Tue, Feb 05, 2019 at 12:31:46PM +0100, Peter Zijlstra wrote:
> > ...
> >
> > So while in general I agree with BUG_ON() being undesirable, I think
> > liberal sprinking in text_poke() is fine; you really _REALLY_ want this
> > to work or fail loudly. Text corruption is just painful.
> 
> Ok. It would be good to have the gist of this sentiment in a comment
> above it so that it is absolutely clear why we're doing it.
> 
> And since text_poke() can't fail, then it doesn't need a retval too.
> AFAICT, nothing is actually using it.

See patch 12, that removes the return value (after fixing the few users
that currently 'rely' on it).

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [PATCH v2 06/20] x86/alternative: use temporary mm for text poking
  2019-02-05 11:31     ` Peter Zijlstra
  2019-02-05 12:35       ` Borislav Petkov
@ 2019-02-05 13:29       ` Peter Zijlstra
  1 sibling, 0 replies; 71+ messages in thread
From: Peter Zijlstra @ 2019-02-05 13:29 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: Rick Edgecombe, Andy Lutomirski, Ingo Molnar, linux-kernel, x86,
	hpa, Thomas Gleixner, Nadav Amit, Dave Hansen, linux_dti,
	linux-integrity, linux-security-module, akpm, kernel-hardening,
	linux-mm, will.deacon, ard.biesheuvel, kristen, deneen.t.dock,
	Nadav Amit, Kees Cook, Dave Hansen, Masami Hiramatsu

On Tue, Feb 05, 2019 at 12:31:46PM +0100, Peter Zijlstra wrote:
> In general, text_poke() cannot fail:
> 
>  - suppose changing a single jump label requires poking multiple sites
>    (not uncommon), we fail halfway through and then have to undo the
>    first pokes, but those pokes fail again.
> 
>  - this then leaves us no way forward and no way back, we've got
>    inconsistent text state -> FAIL.

Note that this exact fail scenario still exists in the CPU hotplug code.
See kernel/cpu.c:cpuhp_thread_fun():

		/*
		 * If we fail on a rollback, we're up a creek without no
		 * paddle, no way forward, no way back. We loose, thanks for
		 * playing.
		 */
		WARN_ON_ONCE(st->rollback);

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [PATCH v2 06/20] x86/alternative: use temporary mm for text poking
  2019-02-05 12:35       ` Borislav Petkov
  2019-02-05 13:25         ` Peter Zijlstra
@ 2019-02-05 17:54         ` Nadav Amit
  1 sibling, 0 replies; 71+ messages in thread
From: Nadav Amit @ 2019-02-05 17:54 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: Peter Zijlstra, Rick Edgecombe, Andy Lutomirski, Ingo Molnar,
	LKML, X86 ML, H. Peter Anvin, Thomas Gleixner, Dave Hansen,
	Damian Tometzki, linux-integrity, LSM List, Andrew Morton,
	Kernel Hardening, Linux-MM, Will Deacon, Ard Biesheuvel,
	Kristen Carlson Accardi, Dock, Deneen T, Kees Cook, Dave Hansen,
	Masami Hiramatsu

> On Feb 5, 2019, at 4:35 AM, Borislav Petkov <bp@alien8.de> wrote:
> 
> On Tue, Feb 05, 2019 at 12:31:46PM +0100, Peter Zijlstra wrote:
>> ...
>> 
>> So while in general I agree with BUG_ON() being undesirable, I think
>> liberal sprinking in text_poke() is fine; you really _REALLY_ want this
>> to work or fail loudly. Text corruption is just painful.
> 
> Ok. It would be good to have the gist of this sentiment in a comment
> above it so that it is absolutely clear why we're doing it.

I added a short comment for v3 above each BUG_ON().

> And since text_poke() can't fail, then it doesn't need a retval too.
> AFAICT, nothing is actually using it.

As Peter said, this is addressed in a separate patch (one patch per logical
change).

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [PATCH v2 05/20] x86/alternative: initializing temporary mm for patching
  2019-01-29  0:34 ` [PATCH v2 05/20] x86/alternative: initializing temporary mm for patching Rick Edgecombe
  2019-02-05  9:18   ` Borislav Petkov
@ 2019-02-11  0:39   ` Nadav Amit
  2019-02-11  5:18     ` Andy Lutomirski
  1 sibling, 1 reply; 71+ messages in thread
From: Nadav Amit @ 2019-02-11  0:39 UTC (permalink / raw)
  To: Rick Edgecombe, Andy Lutomirski
  Cc: Ingo Molnar, LKML, X86 ML, H. Peter Anvin, Thomas Gleixner,
	Borislav Petkov, Dave Hansen, Peter Zijlstra, Damian Tometzki,
	linux-integrity, LSM List, Andrew Morton, Kernel Hardening,
	Linux-MM, Will Deacon, Ard Biesheuvel, Kristen Carlson Accardi,
	Dock, Deneen T, Kees Cook, Dave Hansen

> On Jan 28, 2019, at 4:34 PM, Rick Edgecombe <rick.p.edgecombe@intel.com> wrote:
> 
> From: Nadav Amit <namit@vmware.com>
> 
> To prevent improper use of the PTEs that are used for text patching, we
> want to use a temporary mm struct. We initailize it by copying the init
> mm.
> 
> The address that will be used for patching is taken from the lower area
> that is usually used for the task memory. Doing so prevents the need to
> frequently synchronize the temporary-mm (e.g., when BPF programs are
> installed), since different PGDs are used for the task memory.
> 
> Finally, we randomize the address of the PTEs to harden against exploits
> that use these PTEs.
> 
> Cc: Kees Cook <keescook@chromium.org>
> Cc: Dave Hansen <dave.hansen@intel.com>
> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
> Reviewed-by: Masami Hiramatsu <mhiramat@kernel.org>
> Tested-by: Masami Hiramatsu <mhiramat@kernel.org>
> Suggested-by: Andy Lutomirski <luto@kernel.org>
> Signed-off-by: Nadav Amit <namit@vmware.com>
> Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
> ---
> arch/x86/include/asm/pgtable.h       |  3 +++
> arch/x86/include/asm/text-patching.h |  2 ++
> arch/x86/kernel/alternative.c        |  3 +++
> arch/x86/mm/init_64.c                | 36 ++++++++++++++++++++++++++++
> init/main.c                          |  3 +++
> 5 files changed, 47 insertions(+)
> 
> diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h
> index 40616e805292..e8f630d9a2ed 100644
> --- a/arch/x86/include/asm/pgtable.h
> +++ b/arch/x86/include/asm/pgtable.h
> @@ -1021,6 +1021,9 @@ static inline void __meminit init_trampoline_default(void)
> 	/* Default trampoline pgd value */
> 	trampoline_pgd_entry = init_top_pgt[pgd_index(__PAGE_OFFSET)];
> }
> +
> +void __init poking_init(void);
> +
> # ifdef CONFIG_RANDOMIZE_MEMORY
> void __meminit init_trampoline(void);
> # else
> diff --git a/arch/x86/include/asm/text-patching.h b/arch/x86/include/asm/text-patching.h
> index f8fc8e86cf01..a75eed841eed 100644
> --- a/arch/x86/include/asm/text-patching.h
> +++ b/arch/x86/include/asm/text-patching.h
> @@ -39,5 +39,7 @@ extern void *text_poke_kgdb(void *addr, const void *opcode, size_t len);
> extern int poke_int3_handler(struct pt_regs *regs);
> extern void *text_poke_bp(void *addr, const void *opcode, size_t len, void *handler);
> extern int after_bootmem;
> +extern __ro_after_init struct mm_struct *poking_mm;
> +extern __ro_after_init unsigned long poking_addr;
> 
> #endif /* _ASM_X86_TEXT_PATCHING_H */
> diff --git a/arch/x86/kernel/alternative.c b/arch/x86/kernel/alternative.c
> index 12fddbc8c55b..ae05fbb50171 100644
> --- a/arch/x86/kernel/alternative.c
> +++ b/arch/x86/kernel/alternative.c
> @@ -678,6 +678,9 @@ void *__init_or_module text_poke_early(void *addr, const void *opcode,
> 	return addr;
> }
> 
> +__ro_after_init struct mm_struct *poking_mm;
> +__ro_after_init unsigned long poking_addr;
> +
> static void *__text_poke(void *addr, const void *opcode, size_t len)
> {
> 	unsigned long flags;
> diff --git a/arch/x86/mm/init_64.c b/arch/x86/mm/init_64.c
> index bccff68e3267..125c8c48aa24 100644
> --- a/arch/x86/mm/init_64.c
> +++ b/arch/x86/mm/init_64.c
> @@ -53,6 +53,7 @@
> #include <asm/init.h>
> #include <asm/uv/uv.h>
> #include <asm/setup.h>
> +#include <asm/text-patching.h>
> 
> #include "mm_internal.h"
> 
> @@ -1383,6 +1384,41 @@ unsigned long memory_block_size_bytes(void)
> 	return memory_block_size_probed;
> }
> 
> +/*
> + * Initialize an mm_struct to be used during poking and a pointer to be used
> + * during patching.
> + */
> +void __init poking_init(void)
> +{
> +	spinlock_t *ptl;
> +	pte_t *ptep;
> +
> +	poking_mm = copy_init_mm();
> +	BUG_ON(!poking_mm);
> +
> +	/*
> +	 * Randomize the poking address, but make sure that the following page
> +	 * will be mapped at the same PMD. We need 2 pages, so find space for 3,
> +	 * and adjust the address if the PMD ends after the first one.
> +	 */
> +	poking_addr = TASK_UNMAPPED_BASE;
> +	if (IS_ENABLED(CONFIG_RANDOMIZE_BASE))
> +		poking_addr += (kaslr_get_random_long("Poking") & PAGE_MASK) %
> +			(TASK_SIZE - TASK_UNMAPPED_BASE - 3 * PAGE_SIZE);
> +
> +	if (((poking_addr + PAGE_SIZE) & ~PMD_MASK) == 0)
> +		poking_addr += PAGE_SIZE;

Further thinking about it, I think that allocating the virtual address for
poking from user address-range is problematic. The user can set watchpoints
on different addresses, cause some static-keys to be enabled/disabled, and
monitor the signals to derandomize the poking address.

Andy, I think you were pushing this change. Can I go back to use a vmalloc’d
address instead, or do you have a better solution? I prefer not to
save/restore DR7, of course.

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [PATCH v2 05/20] x86/alternative: initializing temporary mm for patching
  2019-02-11  0:39   ` Nadav Amit
@ 2019-02-11  5:18     ` Andy Lutomirski
  2019-02-11 18:04       ` Nadav Amit
  0 siblings, 1 reply; 71+ messages in thread
From: Andy Lutomirski @ 2019-02-11  5:18 UTC (permalink / raw)
  To: Nadav Amit
  Cc: Rick Edgecombe, Andy Lutomirski, Ingo Molnar, LKML, X86 ML,
	H. Peter Anvin, Thomas Gleixner, Borislav Petkov, Dave Hansen,
	Peter Zijlstra, Damian Tometzki, linux-integrity, LSM List,
	Andrew Morton, Kernel Hardening, Linux-MM, Will Deacon,
	Ard Biesheuvel, Kristen Carlson Accardi, Dock, Deneen T,
	Kees Cook, Dave Hansen



On Feb 10, 2019, at 4:39 PM, Nadav Amit <nadav.amit@gmail.com> wrote:

>> On Jan 28, 2019, at 4:34 PM, Rick Edgecombe <rick.p.edgecombe@intel.com> wrote:
>> 
>> From: Nadav Amit <namit@vmware.com>
>> 
>> To prevent improper use of the PTEs that are used for text patching, we
>> want to use a temporary mm struct. We initailize it by copying the init
>> mm.
>> 
>> The address that will be used for patching is taken from the lower area
>> that is usually used for the task memory. Doing so prevents the need to
>> frequently synchronize the temporary-mm (e.g., when BPF programs are
>> installed), since different PGDs are used for the task memory.
>> 
>> Finally, we randomize the address of the PTEs to harden against exploits
>> that use these PTEs.
>> 
>> Cc: Kees Cook <keescook@chromium.org>
>> Cc: Dave Hansen <dave.hansen@intel.com>
>> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
>> Reviewed-by: Masami Hiramatsu <mhiramat@kernel.org>
>> Tested-by: Masami Hiramatsu <mhiramat@kernel.org>
>> Suggested-by: Andy Lutomirski <luto@kernel.org>
>> Signed-off-by: Nadav Amit <namit@vmware.com>
>> Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
>> ---
>> arch/x86/include/asm/pgtable.h       |  3 +++
>> arch/x86/include/asm/text-patching.h |  2 ++
>> arch/x86/kernel/alternative.c        |  3 +++
>> arch/x86/mm/init_64.c                | 36 ++++++++++++++++++++++++++++
>> init/main.c                          |  3 +++
>> 5 files changed, 47 insertions(+)
>> 
>> diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h
>> index 40616e805292..e8f630d9a2ed 100644
>> --- a/arch/x86/include/asm/pgtable.h
>> +++ b/arch/x86/include/asm/pgtable.h
>> @@ -1021,6 +1021,9 @@ static inline void __meminit init_trampoline_default(void)
>>    /* Default trampoline pgd value */
>>    trampoline_pgd_entry = init_top_pgt[pgd_index(__PAGE_OFFSET)];
>> }
>> +
>> +void __init poking_init(void);
>> +
>> # ifdef CONFIG_RANDOMIZE_MEMORY
>> void __meminit init_trampoline(void);
>> # else
>> diff --git a/arch/x86/include/asm/text-patching.h b/arch/x86/include/asm/text-patching.h
>> index f8fc8e86cf01..a75eed841eed 100644
>> --- a/arch/x86/include/asm/text-patching.h
>> +++ b/arch/x86/include/asm/text-patching.h
>> @@ -39,5 +39,7 @@ extern void *text_poke_kgdb(void *addr, const void *opcode, size_t len);
>> extern int poke_int3_handler(struct pt_regs *regs);
>> extern void *text_poke_bp(void *addr, const void *opcode, size_t len, void *handler);
>> extern int after_bootmem;
>> +extern __ro_after_init struct mm_struct *poking_mm;
>> +extern __ro_after_init unsigned long poking_addr;
>> 
>> #endif /* _ASM_X86_TEXT_PATCHING_H */
>> diff --git a/arch/x86/kernel/alternative.c b/arch/x86/kernel/alternative.c
>> index 12fddbc8c55b..ae05fbb50171 100644
>> --- a/arch/x86/kernel/alternative.c
>> +++ b/arch/x86/kernel/alternative.c
>> @@ -678,6 +678,9 @@ void *__init_or_module text_poke_early(void *addr, const void *opcode,
>>    return addr;
>> }
>> 
>> +__ro_after_init struct mm_struct *poking_mm;
>> +__ro_after_init unsigned long poking_addr;
>> +
>> static void *__text_poke(void *addr, const void *opcode, size_t len)
>> {
>>    unsigned long flags;
>> diff --git a/arch/x86/mm/init_64.c b/arch/x86/mm/init_64.c
>> index bccff68e3267..125c8c48aa24 100644
>> --- a/arch/x86/mm/init_64.c
>> +++ b/arch/x86/mm/init_64.c
>> @@ -53,6 +53,7 @@
>> #include <asm/init.h>
>> #include <asm/uv/uv.h>
>> #include <asm/setup.h>
>> +#include <asm/text-patching.h>
>> 
>> #include "mm_internal.h"
>> 
>> @@ -1383,6 +1384,41 @@ unsigned long memory_block_size_bytes(void)
>>    return memory_block_size_probed;
>> }
>> 
>> +/*
>> + * Initialize an mm_struct to be used during poking and a pointer to be used
>> + * during patching.
>> + */
>> +void __init poking_init(void)
>> +{
>> +    spinlock_t *ptl;
>> +    pte_t *ptep;
>> +
>> +    poking_mm = copy_init_mm();
>> +    BUG_ON(!poking_mm);
>> +
>> +    /*
>> +     * Randomize the poking address, but make sure that the following page
>> +     * will be mapped at the same PMD. We need 2 pages, so find space for 3,
>> +     * and adjust the address if the PMD ends after the first one.
>> +     */
>> +    poking_addr = TASK_UNMAPPED_BASE;
>> +    if (IS_ENABLED(CONFIG_RANDOMIZE_BASE))
>> +        poking_addr += (kaslr_get_random_long("Poking") & PAGE_MASK) %
>> +            (TASK_SIZE - TASK_UNMAPPED_BASE - 3 * PAGE_SIZE);
>> +
>> +    if (((poking_addr + PAGE_SIZE) & ~PMD_MASK) == 0)
>> +        poking_addr += PAGE_SIZE;
> 
> Further thinking about it, I think that allocating the virtual address for
> poking from user address-range is problematic. The user can set watchpoints
> on different addresses, cause some static-keys to be enabled/disabled, and
> monitor the signals to derandomize the poking address.
> 

Hmm, I hadn’t thought about watchpoints. I’m not sure how much we care about possible derandomization like this, but we certainly don’t want to send signals or otherwise malfunction.

> Andy, I think you were pushing this change. Can I go back to use a vmalloc’d
> address instead, or do you have a better solution?

Hmm. If we use a vmalloc address, we have to make sure it’s not actually allocated. I suppose we could allocate one once at boot and use that.  We also have the problem that the usual APIs for handling “user” addresses might assume they’re actually in the user range, although this seems unlikely to be a problem in practice.  More seriously, though, the code that manipulates per-mm paging structures assumes that *all* of the structures up to the top level are per-mm, and, if we use anything less than a private pgd, this isn’t the case.

> I prefer not to
> save/restore DR7, of course.
> 

I suspect we may want to use the temporary mm concept for EFI, too, so we may want to just suck it up and save/restore DR7.  But only if a watchpoint is in use, of course. I have an old patch I could dust off that tracks DR7 to make things like this efficient.

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [PATCH v2 05/20] x86/alternative: initializing temporary mm for patching
  2019-02-11  5:18     ` Andy Lutomirski
@ 2019-02-11 18:04       ` Nadav Amit
  2019-02-11 19:07         ` Andy Lutomirski
  0 siblings, 1 reply; 71+ messages in thread
From: Nadav Amit @ 2019-02-11 18:04 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Rick Edgecombe, Andy Lutomirski, Ingo Molnar, LKML, X86 ML,
	H. Peter Anvin, Thomas Gleixner, Borislav Petkov, Dave Hansen,
	Peter Zijlstra, Damian Tometzki, linux-integrity, LSM List,
	Andrew Morton, Kernel Hardening, Linux-MM, Will Deacon,
	Ard Biesheuvel, Kristen Carlson Accardi, Dock, Deneen T,
	Kees Cook, Dave Hansen

> On Feb 10, 2019, at 9:18 PM, Andy Lutomirski <luto@amacapital.net> wrote:
> 
> 
> 
> On Feb 10, 2019, at 4:39 PM, Nadav Amit <nadav.amit@gmail.com> wrote:
> 
>>> On Jan 28, 2019, at 4:34 PM, Rick Edgecombe <rick.p.edgecombe@intel.com> wrote:
>>> 
>>> From: Nadav Amit <namit@vmware.com>
>>> 
>>> To prevent improper use of the PTEs that are used for text patching, we
>>> want to use a temporary mm struct. We initailize it by copying the init
>>> mm.
>>> 
>>> The address that will be used for patching is taken from the lower area
>>> that is usually used for the task memory. Doing so prevents the need to
>>> frequently synchronize the temporary-mm (e.g., when BPF programs are
>>> installed), since different PGDs are used for the task memory.
>>> 
>>> Finally, we randomize the address of the PTEs to harden against exploits
>>> that use these PTEs.
>>> 
>>> Cc: Kees Cook <keescook@chromium.org>
>>> Cc: Dave Hansen <dave.hansen@intel.com>
>>> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
>>> Reviewed-by: Masami Hiramatsu <mhiramat@kernel.org>
>>> Tested-by: Masami Hiramatsu <mhiramat@kernel.org>
>>> Suggested-by: Andy Lutomirski <luto@kernel.org>
>>> Signed-off-by: Nadav Amit <namit@vmware.com>
>>> Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
>>> ---
>>> arch/x86/include/asm/pgtable.h       |  3 +++
>>> arch/x86/include/asm/text-patching.h |  2 ++
>>> arch/x86/kernel/alternative.c        |  3 +++
>>> arch/x86/mm/init_64.c                | 36 ++++++++++++++++++++++++++++
>>> init/main.c                          |  3 +++
>>> 5 files changed, 47 insertions(+)
>>> 
>>> diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h
>>> index 40616e805292..e8f630d9a2ed 100644
>>> --- a/arch/x86/include/asm/pgtable.h
>>> +++ b/arch/x86/include/asm/pgtable.h
>>> @@ -1021,6 +1021,9 @@ static inline void __meminit init_trampoline_default(void)
>>>   /* Default trampoline pgd value */
>>>   trampoline_pgd_entry = init_top_pgt[pgd_index(__PAGE_OFFSET)];
>>> }
>>> +
>>> +void __init poking_init(void);
>>> +
>>> # ifdef CONFIG_RANDOMIZE_MEMORY
>>> void __meminit init_trampoline(void);
>>> # else
>>> diff --git a/arch/x86/include/asm/text-patching.h b/arch/x86/include/asm/text-patching.h
>>> index f8fc8e86cf01..a75eed841eed 100644
>>> --- a/arch/x86/include/asm/text-patching.h
>>> +++ b/arch/x86/include/asm/text-patching.h
>>> @@ -39,5 +39,7 @@ extern void *text_poke_kgdb(void *addr, const void *opcode, size_t len);
>>> extern int poke_int3_handler(struct pt_regs *regs);
>>> extern void *text_poke_bp(void *addr, const void *opcode, size_t len, void *handler);
>>> extern int after_bootmem;
>>> +extern __ro_after_init struct mm_struct *poking_mm;
>>> +extern __ro_after_init unsigned long poking_addr;
>>> 
>>> #endif /* _ASM_X86_TEXT_PATCHING_H */
>>> diff --git a/arch/x86/kernel/alternative.c b/arch/x86/kernel/alternative.c
>>> index 12fddbc8c55b..ae05fbb50171 100644
>>> --- a/arch/x86/kernel/alternative.c
>>> +++ b/arch/x86/kernel/alternative.c
>>> @@ -678,6 +678,9 @@ void *__init_or_module text_poke_early(void *addr, const void *opcode,
>>>   return addr;
>>> }
>>> 
>>> +__ro_after_init struct mm_struct *poking_mm;
>>> +__ro_after_init unsigned long poking_addr;
>>> +
>>> static void *__text_poke(void *addr, const void *opcode, size_t len)
>>> {
>>>   unsigned long flags;
>>> diff --git a/arch/x86/mm/init_64.c b/arch/x86/mm/init_64.c
>>> index bccff68e3267..125c8c48aa24 100644
>>> --- a/arch/x86/mm/init_64.c
>>> +++ b/arch/x86/mm/init_64.c
>>> @@ -53,6 +53,7 @@
>>> #include <asm/init.h>
>>> #include <asm/uv/uv.h>
>>> #include <asm/setup.h>
>>> +#include <asm/text-patching.h>
>>> 
>>> #include "mm_internal.h"
>>> 
>>> @@ -1383,6 +1384,41 @@ unsigned long memory_block_size_bytes(void)
>>>   return memory_block_size_probed;
>>> }
>>> 
>>> +/*
>>> + * Initialize an mm_struct to be used during poking and a pointer to be used
>>> + * during patching.
>>> + */
>>> +void __init poking_init(void)
>>> +{
>>> +    spinlock_t *ptl;
>>> +    pte_t *ptep;
>>> +
>>> +    poking_mm = copy_init_mm();
>>> +    BUG_ON(!poking_mm);
>>> +
>>> +    /*
>>> +     * Randomize the poking address, but make sure that the following page
>>> +     * will be mapped at the same PMD. We need 2 pages, so find space for 3,
>>> +     * and adjust the address if the PMD ends after the first one.
>>> +     */
>>> +    poking_addr = TASK_UNMAPPED_BASE;
>>> +    if (IS_ENABLED(CONFIG_RANDOMIZE_BASE))
>>> +        poking_addr += (kaslr_get_random_long("Poking") & PAGE_MASK) %
>>> +            (TASK_SIZE - TASK_UNMAPPED_BASE - 3 * PAGE_SIZE);
>>> +
>>> +    if (((poking_addr + PAGE_SIZE) & ~PMD_MASK) == 0)
>>> +        poking_addr += PAGE_SIZE;
>> 
>> Further thinking about it, I think that allocating the virtual address for
>> poking from user address-range is problematic. The user can set watchpoints
>> on different addresses, cause some static-keys to be enabled/disabled, and
>> monitor the signals to derandomize the poking address.
> 
> Hmm, I hadn’t thought about watchpoints. I’m not sure how much we care
> about possible derandomization like this, but we certainly don’t want to
> send signals or otherwise malfunction.
> 
>> Andy, I think you were pushing this change. Can I go back to use a vmalloc’d
>> address instead, or do you have a better solution?
> 
> Hmm. If we use a vmalloc address, we have to make sure it’s not actually
> allocated. I suppose we could allocate one once at boot and use that. We
> also have the problem that the usual APIs for handling “user” addresses
> might assume they’re actually in the user range, although this seems
> unlikely to be a problem in practice. More seriously, though, the code
> that manipulates per-mm paging structures assumes that *all* of the
> structures up to the top level are per-mm, and, if we use anything less
> than a private pgd, this isn’t the case.

I forgot that I only had this conversation in my mind ;-)

Well, I did write some code that kept some vmalloc’d area private, and it
did require more synchronization between the pgd’s. It is still possible
to use another top-level PGD, but … (continued below)

> 
>> I prefer not to
>> save/restore DR7, of course.
> 
> I suspect we may want to use the temporary mm concept for EFI, too, so we
> may want to just suck it up and save/restore DR7. But only if a watchpoint
> is in use, of course. I have an old patch I could dust off that tracks DR7
> to make things like this efficient.

… but, if this is the case, then I will just make (un)use_temporary_mm() to
save/restore DR7. I guess you are ok with such a solution. I will
incorporate it into Rick’s v3.

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [PATCH v2 09/20] x86/kprobes: instruction pages initialization enhancements
  2019-01-29  0:34 ` [PATCH v2 09/20] x86/kprobes: instruction pages initialization enhancements Rick Edgecombe
@ 2019-02-11 18:22   ` Borislav Petkov
  2019-02-11 19:36     ` Nadav Amit
  0 siblings, 1 reply; 71+ messages in thread
From: Borislav Petkov @ 2019-02-11 18:22 UTC (permalink / raw)
  To: Rick Edgecombe
  Cc: Andy Lutomirski, Ingo Molnar, linux-kernel, x86, hpa,
	Thomas Gleixner, Nadav Amit, Dave Hansen, Peter Zijlstra,
	linux_dti, linux-integrity, linux-security-module, akpm,
	kernel-hardening, linux-mm, will.deacon, ard.biesheuvel, kristen,
	deneen.t.dock, Nadav Amit

Only nitpicks:

> Subject: Re: [PATCH v2 09/20] x86/kprobes: instruction pages initialization enhancements

Subject needs a verb.

On Mon, Jan 28, 2019 at 04:34:11PM -0800, Rick Edgecombe wrote:
> From: Nadav Amit <namit@vmware.com>
> 
> Make kprobes instruction pages read-only (and executable) after they are
> set to prevent them from mistaken or malicious modifications.
> 
> This is a preparatory patch for a following patch that makes module
> allocated pages non-executable and sets the page as executable after
> allocation.
> 
> While at it, do some small cleanup of what appears to be unnecessary
> masking.
> 
> Acked-by: Masami Hiramatsu <mhiramat@kernel.org>
> Signed-off-by: Nadav Amit <namit@vmware.com>
> Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
> ---
>  arch/x86/kernel/kprobes/core.c | 24 ++++++++++++++++++++----
>  1 file changed, 20 insertions(+), 4 deletions(-)
> 
> diff --git a/arch/x86/kernel/kprobes/core.c b/arch/x86/kernel/kprobes/core.c
> index 4ba75afba527..fac692e36833 100644
> --- a/arch/x86/kernel/kprobes/core.c
> +++ b/arch/x86/kernel/kprobes/core.c
> @@ -431,8 +431,20 @@ void *alloc_insn_page(void)
>  	void *page;
>  
>  	page = module_alloc(PAGE_SIZE);
> -	if (page)
> -		set_memory_ro((unsigned long)page & PAGE_MASK, 1);
> +	if (page == NULL)
> +		return NULL;

Null tests we generally do like this:

	if (! ...


like in the rest of this file.

> +
> +	/*
> +	 * First make the page read-only, and then only then make it executable

 s/then only then/only then/

ditto below.

> +	 * to prevent it from being W+X in between.
> +	 */
> +	set_memory_ro((unsigned long)page, 1);
> +
> +	/*
> +	 * TODO: Once additional kernel code protection mechanisms are set, ensure
> +	 * that the page was not maliciously altered and it is still zeroed.
> +	 */
> +	set_memory_x((unsigned long)page, 1);
>  
>  	return page;
>  }
> @@ -440,8 +452,12 @@ void *alloc_insn_page(void)
>  /* Recover page to RW mode before releasing it */
>  void free_insn_page(void *page)
>  {
> -	set_memory_nx((unsigned long)page & PAGE_MASK, 1);
> -	set_memory_rw((unsigned long)page & PAGE_MASK, 1);
> +	/*
> +	 * First make the page non-executable, and then only then make it
> +	 * writable to prevent it from being W+X in between.
> +	 */
> +	set_memory_nx((unsigned long)page, 1);
> +	set_memory_rw((unsigned long)page, 1);
>  	module_memfree(page);
>  }
>  
> -- 

-- 
Regards/Gruss,
    Boris.

Good mailing practices for 400: avoid top-posting and trim the reply.

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [PATCH v2 10/20] x86: avoid W^X being broken during modules loading
  2019-01-29  0:34 ` [PATCH v2 10/20] x86: avoid W^X being broken during modules loading Rick Edgecombe
@ 2019-02-11 18:29   ` Borislav Petkov
  2019-02-11 18:45     ` Nadav Amit
  0 siblings, 1 reply; 71+ messages in thread
From: Borislav Petkov @ 2019-02-11 18:29 UTC (permalink / raw)
  To: Rick Edgecombe
  Cc: Andy Lutomirski, Ingo Molnar, linux-kernel, x86, hpa,
	Thomas Gleixner, Nadav Amit, Dave Hansen, Peter Zijlstra,
	linux_dti, linux-integrity, linux-security-module, akpm,
	kernel-hardening, linux-mm, will.deacon, ard.biesheuvel, kristen,
	deneen.t.dock, Nadav Amit, Kees Cook, Dave Hansen,
	Masami Hiramatsu

> Subject: Re: [PATCH v2 10/20] x86: avoid W^X being broken during modules loading

For your next submission, please fix all your subjects:

The tip tree preferred format for patch subject prefixes is
'subsys/component:', e.g. 'x86/apic:', 'x86/mm/fault:', 'sched/fair:',
'genirq/core:'. Please do not use file names or complete file paths as
prefix. 'git log path/to/file' should give you a reasonable hint in most
cases.

The condensed patch description in the subject line should start with a
uppercase letter and should be written in imperative tone.


On Mon, Jan 28, 2019 at 04:34:12PM -0800, Rick Edgecombe wrote:
> From: Nadav Amit <namit@vmware.com>
> 
> When modules and BPF filters are loaded, there is a time window in
> which some memory is both writable and executable. An attacker that has
> already found another vulnerability (e.g., a dangling pointer) might be
> able to exploit this behavior to overwrite kernel code.
> 
> Prevent having writable executable PTEs in this stage. In addition,
> avoiding having W+X mappings can also slightly simplify the patching of
> modules code on initialization (e.g., by alternatives and static-key),
> as would be done in the next patch.
> 
> To avoid having W+X mappings, set them initially as RW (NX) and after
> they are set as RO set them as X as well. Setting them as executable is
> done as a separate step to avoid one core in which the old PTE is cached
> (hence writable), and another which sees the updated PTE (executable),
> which would break the W^X protection.
> 
> Cc: Kees Cook <keescook@chromium.org>
> Cc: Peter Zijlstra <peterz@infradead.org>
> Cc: Dave Hansen <dave.hansen@intel.com>
> Cc: Masami Hiramatsu <mhiramat@kernel.org>
> Suggested-by: Thomas Gleixner <tglx@linutronix.de>
> Suggested-by: Andy Lutomirski <luto@amacapital.net>
> Signed-off-by: Nadav Amit <namit@vmware.com>
> Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
> ---
>  arch/x86/kernel/alternative.c | 28 +++++++++++++++++++++-------
>  arch/x86/kernel/module.c      |  2 +-
>  include/linux/filter.h        |  2 +-
>  kernel/module.c               |  5 +++++
>  4 files changed, 28 insertions(+), 9 deletions(-)
> 
> diff --git a/arch/x86/kernel/alternative.c b/arch/x86/kernel/alternative.c
> index 76d482a2b716..69f3e650ada8 100644
> --- a/arch/x86/kernel/alternative.c
> +++ b/arch/x86/kernel/alternative.c
> @@ -667,15 +667,29 @@ void __init alternative_instructions(void)
>   * handlers seeing an inconsistent instruction while you patch.
>   */
>  void *__init_or_module text_poke_early(void *addr, const void *opcode,
> -					      size_t len)
> +				       size_t len)
>  {
>  	unsigned long flags;
> -	local_irq_save(flags);
> -	memcpy(addr, opcode, len);
> -	local_irq_restore(flags);
> -	sync_core();
> -	/* Could also do a CLFLUSH here to speed up CPU recovery; but
> -	   that causes hangs on some VIA CPUs. */
> +
> +	if (static_cpu_has(X86_FEATURE_NX) &&

Not a fast path - boot_cpu_has() is fine here.

-- 
Regards/Gruss,
    Boris.

Good mailing practices for 400: avoid top-posting and trim the reply.

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [PATCH v2 11/20] x86/jump-label: remove support for custom poker
  2019-01-29  0:34 ` [PATCH v2 11/20] x86/jump-label: remove support for custom poker Rick Edgecombe
@ 2019-02-11 18:37   ` Borislav Petkov
  0 siblings, 0 replies; 71+ messages in thread
From: Borislav Petkov @ 2019-02-11 18:37 UTC (permalink / raw)
  To: Rick Edgecombe
  Cc: Andy Lutomirski, Ingo Molnar, linux-kernel, x86, hpa,
	Thomas Gleixner, Nadav Amit, Dave Hansen, Peter Zijlstra,
	linux_dti, linux-integrity, linux-security-module, akpm,
	kernel-hardening, linux-mm, will.deacon, ard.biesheuvel, kristen,
	deneen.t.dock, Nadav Amit, Kees Cook, Dave Hansen,
	Masami Hiramatsu

On Mon, Jan 28, 2019 at 04:34:13PM -0800, Rick Edgecombe wrote:
> From: Nadav Amit <namit@vmware.com>
> 
> There are only two types of poking: early and breakpoint based. The use
> of a function pointer to perform poking complicates the code and is
> probably inefficient due to the use of indirect branches.
> 
> Cc: Andy Lutomirski <luto@kernel.org>
> Cc: Kees Cook <keescook@chromium.org>
> Cc: Dave Hansen <dave.hansen@intel.com>
> Cc: Masami Hiramatsu <mhiramat@kernel.org>
> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
> Signed-off-by: Nadav Amit <namit@vmware.com>
> Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
> ---
>  arch/x86/kernel/jump_label.c | 24 ++++++++----------------
>  1 file changed, 8 insertions(+), 16 deletions(-)

...

> @@ -80,16 +71,17 @@ static void __ref __jump_label_transform(struct jump_entry *entry,
>  		bug_at((void *)jump_entry_code(entry), line);
>  
>  	/*
> -	 * Make text_poke_bp() a default fallback poker.
> +	 * As long as we're UP and not yet marked RO, we can use
> +	 * text_poke_early; SYSTEM_BOOTING guarantees both, as we switch to
> +	 * SYSTEM_SCHEDULING before going either.

s/going/doing/ ?

-- 
Regards/Gruss,
    Boris.

Good mailing practices for 400: avoid top-posting and trim the reply.

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [PATCH v2 10/20] x86: avoid W^X being broken during modules loading
  2019-02-11 18:29   ` Borislav Petkov
@ 2019-02-11 18:45     ` Nadav Amit
  2019-02-11 19:01       ` Borislav Petkov
  0 siblings, 1 reply; 71+ messages in thread
From: Nadav Amit @ 2019-02-11 18:45 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: Rick Edgecombe, Andy Lutomirski, Ingo Molnar, LKML, X86 ML,
	H. Peter Anvin, Thomas Gleixner, Dave Hansen, Peter Zijlstra,
	Damian Tometzki, linux-integrity, LSM List, Andrew Morton,
	Kernel Hardening, Linux-MM, Will Deacon, Ard Biesheuvel,
	Kristen Carlson Accardi, Dock, Deneen T, Kees Cook, Dave Hansen,
	Masami Hiramatsu

> On Feb 11, 2019, at 10:29 AM, Borislav Petkov <bp@alien8.de> wrote:
> 
>> diff --git a/arch/x86/kernel/alternative.c b/arch/x86/kernel/alternative.c
>> index 76d482a2b716..69f3e650ada8 100644
>> --- a/arch/x86/kernel/alternative.c
>> +++ b/arch/x86/kernel/alternative.c
>> @@ -667,15 +667,29 @@ void __init alternative_instructions(void)
>>  * handlers seeing an inconsistent instruction while you patch.
>>  */
>> void *__init_or_module text_poke_early(void *addr, const void *opcode,
>> -					      size_t len)
>> +				       size_t len)
>> {
>> 	unsigned long flags;
>> -	local_irq_save(flags);
>> -	memcpy(addr, opcode, len);
>> -	local_irq_restore(flags);
>> -	sync_core();
>> -	/* Could also do a CLFLUSH here to speed up CPU recovery; but
>> -	   that causes hangs on some VIA CPUs. */
>> +
>> +	if (static_cpu_has(X86_FEATURE_NX) &&
> 
> Not a fast path - boot_cpu_has() is fine here.

Are you sure about that? This path is still used when modules are loaded.

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [PATCH v2 10/20] x86: avoid W^X being broken during modules loading
  2019-02-11 18:45     ` Nadav Amit
@ 2019-02-11 19:01       ` Borislav Petkov
  2019-02-11 19:09         ` Nadav Amit
  0 siblings, 1 reply; 71+ messages in thread
From: Borislav Petkov @ 2019-02-11 19:01 UTC (permalink / raw)
  To: Nadav Amit
  Cc: Rick Edgecombe, Andy Lutomirski, Ingo Molnar, LKML, X86 ML,
	H. Peter Anvin, Thomas Gleixner, Dave Hansen, Peter Zijlstra,
	Damian Tometzki, linux-integrity, LSM List, Andrew Morton,
	Kernel Hardening, Linux-MM, Will Deacon, Ard Biesheuvel,
	Kristen Carlson Accardi, Dock, Deneen T, Kees Cook, Dave Hansen,
	Masami Hiramatsu

On Mon, Feb 11, 2019 at 10:45:26AM -0800, Nadav Amit wrote:
> Are you sure about that? This path is still used when modules are loaded.

Yes, I'm sure. Loading a module does a gazillion things so saving a
couple of insns - yes, boot_cpu_has() is usually a RIP-relative MOV and a
TEST - doesn't show even as a blip on any radar.

-- 
Regards/Gruss,
    Boris.

Good mailing practices for 400: avoid top-posting and trim the reply.

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [PATCH v2 05/20] x86/alternative: initializing temporary mm for patching
  2019-02-11 18:04       ` Nadav Amit
@ 2019-02-11 19:07         ` Andy Lutomirski
  2019-02-11 19:18           ` Nadav Amit
  0 siblings, 1 reply; 71+ messages in thread
From: Andy Lutomirski @ 2019-02-11 19:07 UTC (permalink / raw)
  To: Nadav Amit
  Cc: Rick Edgecombe, Andy Lutomirski, Ingo Molnar, LKML, X86 ML,
	H. Peter Anvin, Thomas Gleixner, Borislav Petkov, Dave Hansen,
	Peter Zijlstra, Damian Tometzki, linux-integrity, LSM List,
	Andrew Morton, Kernel Hardening, Linux-MM, Will Deacon,
	Ard Biesheuvel, Kristen Carlson Accardi, Dock, Deneen T,
	Kees Cook, Dave Hansen

On Mon, Feb 11, 2019 at 10:05 AM Nadav Amit <nadav.amit@gmail.com> wrote:
>
> > On Feb 10, 2019, at 9:18 PM, Andy Lutomirski <luto@amacapital.net> wrote:
> >
> >
> >
> > On Feb 10, 2019, at 4:39 PM, Nadav Amit <nadav.amit@gmail.com> wrote:
> >
> >>> On Jan 28, 2019, at 4:34 PM, Rick Edgecombe <rick.p.edgecombe@intel.com> wrote:
> >>>
> >>> From: Nadav Amit <namit@vmware.com>
> >>>
> >>> To prevent improper use of the PTEs that are used for text patching, we
> >>> want to use a temporary mm struct. We initailize it by copying the init
> >>> mm.
> >>>
> >>> The address that will be used for patching is taken from the lower area
> >>> that is usually used for the task memory. Doing so prevents the need to
> >>> frequently synchronize the temporary-mm (e.g., when BPF programs are
> >>> installed), since different PGDs are used for the task memory.
> >>>
> >>> Finally, we randomize the address of the PTEs to harden against exploits
> >>> that use these PTEs.
> >>>
> >>> Cc: Kees Cook <keescook@chromium.org>
> >>> Cc: Dave Hansen <dave.hansen@intel.com>
> >>> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
> >>> Reviewed-by: Masami Hiramatsu <mhiramat@kernel.org>
> >>> Tested-by: Masami Hiramatsu <mhiramat@kernel.org>
> >>> Suggested-by: Andy Lutomirski <luto@kernel.org>
> >>> Signed-off-by: Nadav Amit <namit@vmware.com>
> >>> Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
> >>> ---
> >>> arch/x86/include/asm/pgtable.h       |  3 +++
> >>> arch/x86/include/asm/text-patching.h |  2 ++
> >>> arch/x86/kernel/alternative.c        |  3 +++
> >>> arch/x86/mm/init_64.c                | 36 ++++++++++++++++++++++++++++
> >>> init/main.c                          |  3 +++
> >>> 5 files changed, 47 insertions(+)
> >>>
> >>> diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h
> >>> index 40616e805292..e8f630d9a2ed 100644
> >>> --- a/arch/x86/include/asm/pgtable.h
> >>> +++ b/arch/x86/include/asm/pgtable.h
> >>> @@ -1021,6 +1021,9 @@ static inline void __meminit init_trampoline_default(void)
> >>>   /* Default trampoline pgd value */
> >>>   trampoline_pgd_entry = init_top_pgt[pgd_index(__PAGE_OFFSET)];
> >>> }
> >>> +
> >>> +void __init poking_init(void);
> >>> +
> >>> # ifdef CONFIG_RANDOMIZE_MEMORY
> >>> void __meminit init_trampoline(void);
> >>> # else
> >>> diff --git a/arch/x86/include/asm/text-patching.h b/arch/x86/include/asm/text-patching.h
> >>> index f8fc8e86cf01..a75eed841eed 100644
> >>> --- a/arch/x86/include/asm/text-patching.h
> >>> +++ b/arch/x86/include/asm/text-patching.h
> >>> @@ -39,5 +39,7 @@ extern void *text_poke_kgdb(void *addr, const void *opcode, size_t len);
> >>> extern int poke_int3_handler(struct pt_regs *regs);
> >>> extern void *text_poke_bp(void *addr, const void *opcode, size_t len, void *handler);
> >>> extern int after_bootmem;
> >>> +extern __ro_after_init struct mm_struct *poking_mm;
> >>> +extern __ro_after_init unsigned long poking_addr;
> >>>
> >>> #endif /* _ASM_X86_TEXT_PATCHING_H */
> >>> diff --git a/arch/x86/kernel/alternative.c b/arch/x86/kernel/alternative.c
> >>> index 12fddbc8c55b..ae05fbb50171 100644
> >>> --- a/arch/x86/kernel/alternative.c
> >>> +++ b/arch/x86/kernel/alternative.c
> >>> @@ -678,6 +678,9 @@ void *__init_or_module text_poke_early(void *addr, const void *opcode,
> >>>   return addr;
> >>> }
> >>>
> >>> +__ro_after_init struct mm_struct *poking_mm;
> >>> +__ro_after_init unsigned long poking_addr;
> >>> +
> >>> static void *__text_poke(void *addr, const void *opcode, size_t len)
> >>> {
> >>>   unsigned long flags;
> >>> diff --git a/arch/x86/mm/init_64.c b/arch/x86/mm/init_64.c
> >>> index bccff68e3267..125c8c48aa24 100644
> >>> --- a/arch/x86/mm/init_64.c
> >>> +++ b/arch/x86/mm/init_64.c
> >>> @@ -53,6 +53,7 @@
> >>> #include <asm/init.h>
> >>> #include <asm/uv/uv.h>
> >>> #include <asm/setup.h>
> >>> +#include <asm/text-patching.h>
> >>>
> >>> #include "mm_internal.h"
> >>>
> >>> @@ -1383,6 +1384,41 @@ unsigned long memory_block_size_bytes(void)
> >>>   return memory_block_size_probed;
> >>> }
> >>>
> >>> +/*
> >>> + * Initialize an mm_struct to be used during poking and a pointer to be used
> >>> + * during patching.
> >>> + */
> >>> +void __init poking_init(void)
> >>> +{
> >>> +    spinlock_t *ptl;
> >>> +    pte_t *ptep;
> >>> +
> >>> +    poking_mm = copy_init_mm();
> >>> +    BUG_ON(!poking_mm);
> >>> +
> >>> +    /*
> >>> +     * Randomize the poking address, but make sure that the following page
> >>> +     * will be mapped at the same PMD. We need 2 pages, so find space for 3,
> >>> +     * and adjust the address if the PMD ends after the first one.
> >>> +     */
> >>> +    poking_addr = TASK_UNMAPPED_BASE;
> >>> +    if (IS_ENABLED(CONFIG_RANDOMIZE_BASE))
> >>> +        poking_addr += (kaslr_get_random_long("Poking") & PAGE_MASK) %
> >>> +            (TASK_SIZE - TASK_UNMAPPED_BASE - 3 * PAGE_SIZE);
> >>> +
> >>> +    if (((poking_addr + PAGE_SIZE) & ~PMD_MASK) == 0)
> >>> +        poking_addr += PAGE_SIZE;
> >>
> >> Further thinking about it, I think that allocating the virtual address for
> >> poking from user address-range is problematic. The user can set watchpoints
> >> on different addresses, cause some static-keys to be enabled/disabled, and
> >> monitor the signals to derandomize the poking address.
> >
> > Hmm, I hadn’t thought about watchpoints. I’m not sure how much we care
> > about possible derandomization like this, but we certainly don’t want to
> > send signals or otherwise malfunction.
> >
> >> Andy, I think you were pushing this change. Can I go back to use a vmalloc’d
> >> address instead, or do you have a better solution?
> >
> > Hmm. If we use a vmalloc address, we have to make sure it’s not actually
> > allocated. I suppose we could allocate one once at boot and use that. We
> > also have the problem that the usual APIs for handling “user” addresses
> > might assume they’re actually in the user range, although this seems
> > unlikely to be a problem in practice. More seriously, though, the code
> > that manipulates per-mm paging structures assumes that *all* of the
> > structures up to the top level are per-mm, and, if we use anything less
> > than a private pgd, this isn’t the case.
>
> I forgot that I only had this conversation in my mind ;-)
>
> Well, I did write some code that kept some vmalloc’d area private, and it
> did require more synchronization between the pgd’s. It is still possible
> to use another top-level PGD, but … (continued below)
>
> >
> >> I prefer not to
> >> save/restore DR7, of course.
> >
> > I suspect we may want to use the temporary mm concept for EFI, too, so we
> > may want to just suck it up and save/restore DR7. But only if a watchpoint
> > is in use, of course. I have an old patch I could dust off that tracks DR7
> > to make things like this efficient.
>
> … but, if this is the case, then I will just make (un)use_temporary_mm() to
> save/restore DR7. I guess you are ok with such a solution. I will
> incorporate it into Rick’s v3.
>

I'm certainly amenable to other solutions, but this one does seem the
least messy.  I looked at my old patch, and it doesn't do what you
want.  I'd suggest you just add a percpu variable like cpu_dr7 and rig
up some accessors so that it stays up to date.  Then you can skip the
dr7 writes if there are no watchpoints set.

Also, EFI is probably a less interesting example than rare_write.
With rare_write, especially the dynamically allocated variants that
people keep coming up with, we'll need a swath of address space fully
as large as the vmalloc area. and getting *that* right while still
using the kernel address range might be more of a mess than we really
want to deal with.

--Andy

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [PATCH v2 13/20] Add set_alias_ function and x86 implementation
  2019-01-29  0:34 ` [PATCH v2 13/20] Add set_alias_ function and x86 implementation Rick Edgecombe
@ 2019-02-11 19:09   ` Borislav Petkov
  2019-02-11 19:27     ` Edgecombe, Rick P
  2019-02-11 22:59     ` Andy Lutomirski
  0 siblings, 2 replies; 71+ messages in thread
From: Borislav Petkov @ 2019-02-11 19:09 UTC (permalink / raw)
  To: Rick Edgecombe
  Cc: Andy Lutomirski, Ingo Molnar, linux-kernel, x86, hpa,
	Thomas Gleixner, Nadav Amit, Dave Hansen, Peter Zijlstra,
	linux_dti, linux-integrity, linux-security-module, akpm,
	kernel-hardening, linux-mm, will.deacon, ard.biesheuvel, kristen,
	deneen.t.dock

On Mon, Jan 28, 2019 at 04:34:15PM -0800, Rick Edgecombe wrote:
> This adds two new functions set_alias_default_noflush and

s/This adds/Add/

> set_alias_nv_noflush for setting the alias mapping for the page to its

Please end function names with parentheses, below too.

> default valid permissions and to an invalid state that cannot be cached in
> a TLB, respectively. These functions to not flush the TLB.

s/to/do/

Also, pls put that description as comments over the functions in the
code. Otherwise that "nv" as part of the name doesn't really explain
what it does.

Actually, you could just as well call the function

set_alias_invalid_noflush()

All the other words are written in full, no need to have "nv" there.

Thx.

-- 
Regards/Gruss,
    Boris.

Good mailing practices for 400: avoid top-posting and trim the reply.

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [PATCH v2 10/20] x86: avoid W^X being broken during modules loading
  2019-02-11 19:01       ` Borislav Petkov
@ 2019-02-11 19:09         ` Nadav Amit
  2019-02-11 19:10           ` Borislav Petkov
  0 siblings, 1 reply; 71+ messages in thread
From: Nadav Amit @ 2019-02-11 19:09 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: Rick Edgecombe, Andy Lutomirski, Ingo Molnar, LKML, X86 ML,
	H. Peter Anvin, Thomas Gleixner, Dave Hansen, Peter Zijlstra,
	Damian Tometzki, linux-integrity, LSM List, Andrew Morton,
	Kernel Hardening, Linux-MM, Will Deacon, Ard Biesheuvel,
	Kristen Carlson Accardi, Dock, Deneen T, Kees Cook, Dave Hansen,
	Masami Hiramatsu

> On Feb 11, 2019, at 11:01 AM, Borislav Petkov <bp@alien8.de> wrote:
> 
> On Mon, Feb 11, 2019 at 10:45:26AM -0800, Nadav Amit wrote:
>> Are you sure about that? This path is still used when modules are loaded.
> 
> Yes, I'm sure. Loading a module does a gazillion things so saving a
> couple of insns - yes, boot_cpu_has() is usually a RIP-relative MOV and a
> TEST - doesn't show even as a blip on any radar.

I fully agree, if that is the standard.

It is just that I find the use of static_cpu_has()/boot_cpu_has() to be very
inconsistent. I doubt that show_cpuinfo_misc(), copy_fpstate_to_sigframe(),
or i915_memcpy_init_early() that use static_cpu_has() are any hotter than
text_poke_early().

Anyhow, I’ll use boot_cpu_has() as you said.

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [PATCH v2 10/20] x86: avoid W^X being broken during modules loading
  2019-02-11 19:09         ` Nadav Amit
@ 2019-02-11 19:10           ` Borislav Petkov
  2019-02-11 19:27             ` Nadav Amit
  0 siblings, 1 reply; 71+ messages in thread
From: Borislav Petkov @ 2019-02-11 19:10 UTC (permalink / raw)
  To: Nadav Amit
  Cc: Rick Edgecombe, Andy Lutomirski, Ingo Molnar, LKML, X86 ML,
	H. Peter Anvin, Thomas Gleixner, Dave Hansen, Peter Zijlstra,
	Damian Tometzki, linux-integrity, LSM List, Andrew Morton,
	Kernel Hardening, Linux-MM, Will Deacon, Ard Biesheuvel,
	Kristen Carlson Accardi, Dock, Deneen T, Kees Cook, Dave Hansen,
	Masami Hiramatsu

On Mon, Feb 11, 2019 at 11:09:25AM -0800, Nadav Amit wrote:
> It is just that I find the use of static_cpu_has()/boot_cpu_has() to be very
> inconsistent. I doubt that show_cpuinfo_misc(), copy_fpstate_to_sigframe(),
> or i915_memcpy_init_early() that use static_cpu_has() are any hotter than
> text_poke_early().

Would some beefing of the comment over it help?

-- 
Regards/Gruss,
    Boris.

Good mailing practices for 400: avoid top-posting and trim the reply.

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [PATCH v2 05/20] x86/alternative: initializing temporary mm for patching
  2019-02-11 19:07         ` Andy Lutomirski
@ 2019-02-11 19:18           ` Nadav Amit
  2019-02-11 22:47             ` Andy Lutomirski
  0 siblings, 1 reply; 71+ messages in thread
From: Nadav Amit @ 2019-02-11 19:18 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Rick Edgecombe, Ingo Molnar, LKML, X86 ML, H. Peter Anvin,
	Thomas Gleixner, Borislav Petkov, Dave Hansen, Peter Zijlstra,
	Damian Tometzki, linux-integrity, LSM List, Andrew Morton,
	Kernel Hardening, Linux-MM, Will Deacon, Ard Biesheuvel,
	Kristen Carlson Accardi, Dock, Deneen T, Kees Cook, Dave Hansen

> On Feb 11, 2019, at 11:07 AM, Andy Lutomirski <luto@kernel.org> wrote:
> 
> I'm certainly amenable to other solutions, but this one does seem the
> least messy.  I looked at my old patch, and it doesn't do what you
> want.  I'd suggest you just add a percpu variable like cpu_dr7 and rig
> up some accessors so that it stays up to date.  Then you can skip the
> dr7 writes if there are no watchpoints set.
> 
> Also, EFI is probably a less interesting example than rare_write.
> With rare_write, especially the dynamically allocated variants that
> people keep coming up with, we'll need a swath of address space fully
> as large as the vmalloc area. and getting *that* right while still
> using the kernel address range might be more of a mess than we really
> want to deal with.

As long as you feel comfortable with this solution, I’m fine with it.

Here is what I have (untested). I prefer to save/restore all the DRs,
because IIRC DR6 indications are updated even if breakpoints are disabled
(in DR7). And anyhow, that is the standard interface.


-- >8 --

From: Nadav Amit <namit@vmware.com>
Date: Mon, 11 Feb 2019 03:07:08 -0800
Subject: [PATCH] mm: save DRs when loading temporary mm

Signed-off-by: Nadav Amit <namit@vmware.com>
---
 arch/x86/include/asm/mmu_context.h | 18 ++++++++++++++++++
 1 file changed, 18 insertions(+)

diff --git a/arch/x86/include/asm/mmu_context.h b/arch/x86/include/asm/mmu_context.h
index d684b954f3c0..4f92ec3df149 100644
--- a/arch/x86/include/asm/mmu_context.h
+++ b/arch/x86/include/asm/mmu_context.h
@@ -13,6 +13,7 @@
 #include <asm/tlbflush.h>
 #include <asm/paravirt.h>
 #include <asm/mpx.h>
+#include <asm/debugreg.h>
 
 extern atomic64_t last_mm_ctx_id;
 
@@ -358,6 +359,7 @@ static inline unsigned long __get_current_cr3_fast(void)
 
 typedef struct {
 	struct mm_struct *prev;
+	unsigned short bp_enabled : 1;
 } temp_mm_state_t;
 
 /*
@@ -380,6 +382,15 @@ static inline temp_mm_state_t use_temporary_mm(struct mm_struct *mm)
 	lockdep_assert_irqs_disabled();
 	state.prev = this_cpu_read(cpu_tlbstate.loaded_mm);
 	switch_mm_irqs_off(NULL, mm, current);
+
+	/*
+	 * If breakpoints are enabled, disable them while the temporary mm is
+	 * used - they do not belong and might cause wrong signals or crashes.
+	 */
+	state.bp_enabled = hw_breakpoint_active();
+	if (state.bp_enabled)
+		hw_breakpoint_disable();
+
 	return state;
 }
 
@@ -387,6 +398,13 @@ static inline void unuse_temporary_mm(temp_mm_state_t prev)
 {
 	lockdep_assert_irqs_disabled();
 	switch_mm_irqs_off(NULL, prev.prev, current);
+
+	/*
+	 * Restore the breakpoints if they were disabled before the temporary mm
+	 * was loaded.
+	 */
+	if (prev.bp_enabled)
+		hw_breakpoint_restore();
 }
 
 #endif /* _ASM_X86_MMU_CONTEXT_H */
-- 
2.17.1

^ permalink raw reply related	[flat|nested] 71+ messages in thread

* Re: [PATCH v2 13/20] Add set_alias_ function and x86 implementation
  2019-02-11 19:09   ` Borislav Petkov
@ 2019-02-11 19:27     ` Edgecombe, Rick P
  2019-02-11 22:59     ` Andy Lutomirski
  1 sibling, 0 replies; 71+ messages in thread
From: Edgecombe, Rick P @ 2019-02-11 19:27 UTC (permalink / raw)
  To: bp
  Cc: linux-kernel, peterz, linux-integrity, ard.biesheuvel, tglx,
	linux-mm, dave.hansen, nadav.amit, Dock, Deneen T,
	linux-security-module, x86, akpm, hpa, kristen, mingo, linux_dti,
	luto, will.deacon, kernel-hardening

On Mon, 2019-02-11 at 20:09 +0100, Borislav Petkov wrote:
> On Mon, Jan 28, 2019 at 04:34:15PM -0800, Rick Edgecombe wrote:
> > This adds two new functions set_alias_default_noflush and
> 
> s/This adds/Add/
> 
> > set_alias_nv_noflush for setting the alias mapping for the page to its
> 
> Please end function names with parentheses, below too.
Ok.
> > default valid permissions and to an invalid state that cannot be cached in
> > a TLB, respectively. These functions to not flush the TLB.
> 
> s/to/do/
> 
Argh, thanks.
> Also, pls put that description as comments over the functions in the
> code. Otherwise that "nv" as part of the name doesn't really explain
> what it does.
> 
> Actually, you could just as well call the function
> 
> set_alias_invalid_noflush()
> 
> All the other words are written in full, no need to have "nv" there.
> 
> Thx.
Yes, that seems better.

Thanks,

Rick

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [PATCH v2 10/20] x86: avoid W^X being broken during modules loading
  2019-02-11 19:10           ` Borislav Petkov
@ 2019-02-11 19:27             ` Nadav Amit
  2019-02-11 19:42               ` Borislav Petkov
  0 siblings, 1 reply; 71+ messages in thread
From: Nadav Amit @ 2019-02-11 19:27 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: Rick Edgecombe, Andy Lutomirski, Ingo Molnar, LKML, X86 ML,
	H. Peter Anvin, Thomas Gleixner, Dave Hansen, Peter Zijlstra,
	Damian Tometzki, linux-integrity, LSM List, Andrew Morton,
	Kernel Hardening, Linux-MM, Will Deacon, Ard Biesheuvel,
	Kristen Carlson Accardi, Dock, Deneen T, Kees Cook, Dave Hansen,
	Masami Hiramatsu

> On Feb 11, 2019, at 11:10 AM, Borislav Petkov <bp@alien8.de> wrote:
> 
> On Mon, Feb 11, 2019 at 11:09:25AM -0800, Nadav Amit wrote:
>> It is just that I find the use of static_cpu_has()/boot_cpu_has() to be very
>> inconsistent. I doubt that show_cpuinfo_misc(), copy_fpstate_to_sigframe(),
>> or i915_memcpy_init_early() that use static_cpu_has() are any hotter than
>> text_poke_early().
> 
> Would some beefing of the comment over it help?

Is there any comment over static_cpu_has()? ;-)

Anyhow, obviously a comment would be useful.

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [PATCH v2 09/20] x86/kprobes: instruction pages initialization enhancements
  2019-02-11 18:22   ` Borislav Petkov
@ 2019-02-11 19:36     ` Nadav Amit
  0 siblings, 0 replies; 71+ messages in thread
From: Nadav Amit @ 2019-02-11 19:36 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: Rick Edgecombe, Andy Lutomirski, Ingo Molnar, LKML, X86 ML,
	H. Peter Anvin, Thomas Gleixner, Dave Hansen, Peter Zijlstra,
	Damian Tometzki, linux-integrity, LSM List, Andrew Morton,
	Kernel Hardening, Linux-MM, Will Deacon, Ard Biesheuvel,
	Kristen Carlson Accardi, deneen.t.dock

> On Feb 11, 2019, at 10:22 AM, Borislav Petkov <bp@alien8.de> wrote:
> 
> Only nitpicks:

Thanks for the feedback. Applied.

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [PATCH v2 10/20] x86: avoid W^X being broken during modules loading
  2019-02-11 19:27             ` Nadav Amit
@ 2019-02-11 19:42               ` Borislav Petkov
  2019-02-11 20:32                 ` Nadav Amit
  2019-03-07  7:29                 ` [PATCH v2 10/20] x86: avoid W^X being broken during modules loading Borislav Petkov
  0 siblings, 2 replies; 71+ messages in thread
From: Borislav Petkov @ 2019-02-11 19:42 UTC (permalink / raw)
  To: Nadav Amit
  Cc: Rick Edgecombe, Andy Lutomirski, Ingo Molnar, LKML, X86 ML,
	H. Peter Anvin, Thomas Gleixner, Dave Hansen, Peter Zijlstra,
	Damian Tometzki, linux-integrity, LSM List, Andrew Morton,
	Kernel Hardening, Linux-MM, Will Deacon, Ard Biesheuvel,
	Kristen Carlson Accardi, Dock, Deneen T, Kees Cook, Dave Hansen,
	Masami Hiramatsu

On Mon, Feb 11, 2019 at 11:27:03AM -0800, Nadav Amit wrote:
> Is there any comment over static_cpu_has()? ;-)

Almost:

/*
 * Static testing of CPU features.  Used the same as boot_cpu_has().
 * These will statically patch the target code for additional
 * performance.
 */
static __always_inline __pure bool _static_cpu_has(u16 bit)

-- 
Regards/Gruss,
    Boris.

Good mailing practices for 400: avoid top-posting and trim the reply.

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [PATCH v2 10/20] x86: avoid W^X being broken during modules loading
  2019-02-11 19:42               ` Borislav Petkov
@ 2019-02-11 20:32                 ` Nadav Amit
  2019-03-07 15:10                   ` [PATCH] x86/cpufeature: Remove __pure attribute to _static_cpu_has() Borislav Petkov
  2019-03-07  7:29                 ` [PATCH v2 10/20] x86: avoid W^X being broken during modules loading Borislav Petkov
  1 sibling, 1 reply; 71+ messages in thread
From: Nadav Amit @ 2019-02-11 20:32 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: Rick Edgecombe, Andy Lutomirski, Ingo Molnar, LKML, X86 ML,
	H. Peter Anvin, Thomas Gleixner, Dave Hansen, Peter Zijlstra,
	Damian Tometzki, linux-integrity, LSM List, Andrew Morton,
	Kernel Hardening, Linux-MM, Will Deacon, Ard Biesheuvel,
	Kristen Carlson Accardi, Dock, Deneen T, Kees Cook, Dave Hansen,
	Masami Hiramatsu

> On Feb 11, 2019, at 11:42 AM, Borislav Petkov <bp@alien8.de> wrote:
> 
> On Mon, Feb 11, 2019 at 11:27:03AM -0800, Nadav Amit wrote:
>> Is there any comment over static_cpu_has()? ;-)
> 
> Almost:
> 
> /*
> * Static testing of CPU features.  Used the same as boot_cpu_has().
> * These will statically patch the target code for additional
> * performance.
> */
> static __always_inline __pure bool _static_cpu_has(u16 bit)

Oh, I missed this comment.

BTW: the “__pure” attribute is useless when “__always_inline” is used.
Unless it is intended to be some sort of comment, of course.

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [PATCH v2 05/20] x86/alternative: initializing temporary mm for patching
  2019-02-11 19:18           ` Nadav Amit
@ 2019-02-11 22:47             ` Andy Lutomirski
  2019-02-12 18:23               ` Nadav Amit
  0 siblings, 1 reply; 71+ messages in thread
From: Andy Lutomirski @ 2019-02-11 22:47 UTC (permalink / raw)
  To: Nadav Amit
  Cc: Andy Lutomirski, Rick Edgecombe, Ingo Molnar, LKML, X86 ML,
	H. Peter Anvin, Thomas Gleixner, Borislav Petkov, Dave Hansen,
	Peter Zijlstra, Damian Tometzki, linux-integrity, LSM List,
	Andrew Morton, Kernel Hardening, Linux-MM, Will Deacon,
	Ard Biesheuvel, Kristen Carlson Accardi, Dock, Deneen T,
	Kees Cook, Dave Hansen

On Mon, Feb 11, 2019 at 11:18 AM Nadav Amit <nadav.amit@gmail.com> wrote:
>
> > On Feb 11, 2019, at 11:07 AM, Andy Lutomirski <luto@kernel.org> wrote:
> >
> > I'm certainly amenable to other solutions, but this one does seem the
> > least messy.  I looked at my old patch, and it doesn't do what you
> > want.  I'd suggest you just add a percpu variable like cpu_dr7 and rig
> > up some accessors so that it stays up to date.  Then you can skip the
> > dr7 writes if there are no watchpoints set.
> >
> > Also, EFI is probably a less interesting example than rare_write.
> > With rare_write, especially the dynamically allocated variants that
> > people keep coming up with, we'll need a swath of address space fully
> > as large as the vmalloc area. and getting *that* right while still
> > using the kernel address range might be more of a mess than we really
> > want to deal with.
>
> As long as you feel comfortable with this solution, I’m fine with it.
>
> Here is what I have (untested). I prefer to save/restore all the DRs,
> because IIRC DR6 indications are updated even if breakpoints are disabled
> (in DR7). And anyhow, that is the standard interface.

Seems reasonable, but:

>
>
> -- >8 --
>
> From: Nadav Amit <namit@vmware.com>
> Date: Mon, 11 Feb 2019 03:07:08 -0800
> Subject: [PATCH] mm: save DRs when loading temporary mm
>
> Signed-off-by: Nadav Amit <namit@vmware.com>
> ---
>  arch/x86/include/asm/mmu_context.h | 18 ++++++++++++++++++
>  1 file changed, 18 insertions(+)
>
> diff --git a/arch/x86/include/asm/mmu_context.h b/arch/x86/include/asm/mmu_context.h
> index d684b954f3c0..4f92ec3df149 100644
> --- a/arch/x86/include/asm/mmu_context.h
> +++ b/arch/x86/include/asm/mmu_context.h
> @@ -13,6 +13,7 @@
>  #include <asm/tlbflush.h>
>  #include <asm/paravirt.h>
>  #include <asm/mpx.h>
> +#include <asm/debugreg.h>
>
>  extern atomic64_t last_mm_ctx_id;
>
> @@ -358,6 +359,7 @@ static inline unsigned long __get_current_cr3_fast(void)
>
>  typedef struct {
>         struct mm_struct *prev;
> +       unsigned short bp_enabled : 1;
>  } temp_mm_state_t;
>
>  /*
> @@ -380,6 +382,15 @@ static inline temp_mm_state_t use_temporary_mm(struct mm_struct *mm)
>         lockdep_assert_irqs_disabled();
>         state.prev = this_cpu_read(cpu_tlbstate.loaded_mm);
>         switch_mm_irqs_off(NULL, mm, current);
> +
> +       /*
> +        * If breakpoints are enabled, disable them while the temporary mm is
> +        * used - they do not belong and might cause wrong signals or crashes.
> +        */

Maybe clarify this?  Add some mention that the specific problem is
that user code could set a watchpoint on an address that is also used
in the temporary mm.

Arguably we should not disable *kernel* breakpoints a la perf, but
that seems like quite a minor issue, at least as long as
use_temporary_mm() doesn't get wider use.  But a comment that this
also disables perf breakpoints and that this could be undesirable
might be in order as well.

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [PATCH v2 13/20] Add set_alias_ function and x86 implementation
  2019-02-11 19:09   ` Borislav Petkov
  2019-02-11 19:27     ` Edgecombe, Rick P
@ 2019-02-11 22:59     ` Andy Lutomirski
  2019-02-12  0:01       ` Edgecombe, Rick P
  1 sibling, 1 reply; 71+ messages in thread
From: Andy Lutomirski @ 2019-02-11 22:59 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: Rick Edgecombe, Andy Lutomirski, Ingo Molnar, LKML, X86 ML,
	H. Peter Anvin, Thomas Gleixner, Nadav Amit, Dave Hansen,
	Peter Zijlstra, linux_dti, linux-integrity, LSM List,
	Andrew Morton, Kernel Hardening, Linux-MM, Will Deacon,
	Ard Biesheuvel, Kristen Carlson Accardi, Dock, Deneen T

On Mon, Feb 11, 2019 at 11:09 AM Borislav Petkov <bp@alien8.de> wrote:
>
> On Mon, Jan 28, 2019 at 04:34:15PM -0800, Rick Edgecombe wrote:
> > This adds two new functions set_alias_default_noflush and
>
> s/This adds/Add/
>
> > set_alias_nv_noflush for setting the alias mapping for the page to its
>
> Please end function names with parentheses, below too.
>
> > default valid permissions and to an invalid state that cannot be cached in
> > a TLB, respectively. These functions to not flush the TLB.
>
> s/to/do/
>
> Also, pls put that description as comments over the functions in the
> code. Otherwise that "nv" as part of the name doesn't really explain
> what it does.
>
> Actually, you could just as well call the function
>
> set_alias_invalid_noflush()
>
> All the other words are written in full, no need to have "nv" there.

Why are you calling this an "alias"?  You're modifying the direct map.
Your patches are thinking of the direct map as an alias of the vmap
mapping, but that does seem a bit backwards.  How about
set_direct_map_invalid_noflush(), etc?

--Andy

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [PATCH v2 13/20] Add set_alias_ function and x86 implementation
  2019-02-11 22:59     ` Andy Lutomirski
@ 2019-02-12  0:01       ` Edgecombe, Rick P
  0 siblings, 0 replies; 71+ messages in thread
From: Edgecombe, Rick P @ 2019-02-12  0:01 UTC (permalink / raw)
  To: luto, bp
  Cc: linux-kernel, peterz, linux-integrity, ard.biesheuvel, tglx,
	linux-mm, nadav.amit, dave.hansen, Dock, Deneen T,
	linux-security-module, x86, akpm, hpa, kristen, mingo, linux_dti,
	will.deacon, kernel-hardening

On Mon, 2019-02-11 at 14:59 -0800, Andy Lutomirski wrote:
> On Mon, Feb 11, 2019 at 11:09 AM Borislav Petkov <bp@alien8.de> wrote:
> > 
> > On Mon, Jan 28, 2019 at 04:34:15PM -0800, Rick Edgecombe wrote:
> > > This adds two new functions set_alias_default_noflush and
> > 
> > s/This adds/Add/
> > 
> > > set_alias_nv_noflush for setting the alias mapping for the page to its
> > 
> > Please end function names with parentheses, below too.
> > 
> > > default valid permissions and to an invalid state that cannot be cached in
> > > a TLB, respectively. These functions to not flush the TLB.
> > 
> > s/to/do/
> > 
> > Also, pls put that description as comments over the functions in the
> > code. Otherwise that "nv" as part of the name doesn't really explain
> > what it does.
> > 
> > Actually, you could just as well call the function
> > 
> > set_alias_invalid_noflush()
> > 
> > All the other words are written in full, no need to have "nv" there.
> 
> Why are you calling this an "alias"?  You're modifying the direct map.
> Your patches are thinking of the direct map as an alias of the vmap
> mapping, but that does seem a bit backwards.  How about
> set_direct_map_invalid_noflush(), etc?
> 
I picked it up from some of the names in arch/x86/mm/pageattr.c:
CPA_NO_CHECK_ALIAS, set_memory_np_noalias(), etc. In that file the directmap
address seems to be the "alias". For 32 bit with highmem though, this would also
set permissions for a kmap mapping as well (if one existed), since that address
will be returned from page_address().

Yea, in vmalloc, vm_unmap_aliases talks about the vmap address "alias". So I
guess calling it "alias" is ambiguous. But does set_direct_map_invalid_noflush
make sense in the highmem case?

I couldn't think of any names that I loved, which is why I ran the
set_alias_*_noflush names by people in an earlier version, although looking back
only Ard chimed in on that. "set_direct_map_invalid_noflush" is fine with me if
nobody objects.

Thanks,

Rick

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [PATCH v2 05/20] x86/alternative: initializing temporary mm for patching
  2019-02-11 22:47             ` Andy Lutomirski
@ 2019-02-12 18:23               ` Nadav Amit
  0 siblings, 0 replies; 71+ messages in thread
From: Nadav Amit @ 2019-02-12 18:23 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Rick Edgecombe, Ingo Molnar, LKML, X86 ML, H. Peter Anvin,
	Thomas Gleixner, Borislav Petkov, Dave Hansen, Peter Zijlstra,
	Damian Tometzki, linux-integrity, LSM List, Andrew Morton,
	Kernel Hardening, Linux-MM, Will Deacon, Ard Biesheuvel,
	Kristen Carlson Accardi, Dock, Deneen T, Kees Cook, Dave Hansen

> On Feb 11, 2019, at 2:47 PM, Andy Lutomirski <luto@kernel.org> wrote:
> 
> On Mon, Feb 11, 2019 at 11:18 AM Nadav Amit <nadav.amit@gmail.com> wrote:
>> 
>> +
>> +       /*
>> +        * If breakpoints are enabled, disable them while the temporary mm is
>> +        * used - they do not belong and might cause wrong signals or crashes.
>> +        */
> 
> Maybe clarify this?  Add some mention that the specific problem is
> that user code could set a watchpoint on an address that is also used
> in the temporary mm.
> 
> Arguably we should not disable *kernel* breakpoints a la perf, but
> that seems like quite a minor issue, at least as long as
> use_temporary_mm() doesn't get wider use.  But a comment that this
> also disables perf breakpoints and that this could be undesirable
> might be in order as well.

I think that in the future there may also be security benefits for disabling
breakpoints when you are in a sensitive code-block, for instance when you
poke text, to prevent the control flow from being hijacked (by exploiting a
bug in the debug exception handler). Some additional steps need to be taken
for that to be beneficial so I leave it out of the comment for now.

Anyhow, how about this:

-- >8 --

From: Nadav Amit <namit@vmware.com>
Date: Mon, 11 Feb 2019 03:07:08 -0800
Subject: [PATCH] x86/mm: Save DRs when loading a temporary mm

Prevent user watchpoints from mistakenly firing while the temporary mm
is being used. As the addresses that of the temporary mm might overlap
those of the user-process, this is necessary to prevent wrong signals
or worse things from happening.

Cc: Andy Lutomirski <luto@kernel.org>
Signed-off-by: Nadav Amit <namit@vmware.com>
---
 arch/x86/include/asm/mmu_context.h | 25 +++++++++++++++++++++++++
 1 file changed, 25 insertions(+)

diff --git a/arch/x86/include/asm/mmu_context.h b/arch/x86/include/asm/mmu_context.h
index d684b954f3c0..0d6c72ece750 100644
--- a/arch/x86/include/asm/mmu_context.h
+++ b/arch/x86/include/asm/mmu_context.h
@@ -13,6 +13,7 @@
 #include <asm/tlbflush.h>
 #include <asm/paravirt.h>
 #include <asm/mpx.h>
+#include <asm/debugreg.h>
 
 extern atomic64_t last_mm_ctx_id;
 
@@ -358,6 +359,7 @@ static inline unsigned long __get_current_cr3_fast(void)
 
 typedef struct {
 	struct mm_struct *prev;
+	unsigned short bp_enabled : 1;
 } temp_mm_state_t;
 
 /*
@@ -380,6 +382,22 @@ static inline temp_mm_state_t use_temporary_mm(struct mm_struct *mm)
 	lockdep_assert_irqs_disabled();
 	state.prev = this_cpu_read(cpu_tlbstate.loaded_mm);
 	switch_mm_irqs_off(NULL, mm, current);
+
+	/*
+	 * If breakpoints are enabled, disable them while the temporary mm is
+	 * used. Userspace might set up watchpoints on addresses that are used
+	 * in the temporary mm, which would lead to wrong signals being sent or
+	 * crashes.
+	 *
+	 * Note that breakpoints are not disabled selectively, which also causes
+	 * kernel breakpoints (e.g., perf's) to be disabled. This might be
+	 * undesirable, but still seems reasonable as the code that runs in the
+	 * temporary mm should be short.
+	 */
+	state.bp_enabled = hw_breakpoint_active();
+	if (state.bp_enabled)
+		hw_breakpoint_disable();
+
 	return state;
 }
 
@@ -387,6 +405,13 @@ static inline void unuse_temporary_mm(temp_mm_state_t prev)
 {
 	lockdep_assert_irqs_disabled();
 	switch_mm_irqs_off(NULL, prev.prev, current);
+
+	/*
+	 * Restore the breakpoints if they were disabled before the temporary mm
+	 * was loaded.
+	 */
+	if (prev.bp_enabled)
+		hw_breakpoint_restore();
 }
 
 #endif /* _ASM_X86_MMU_CONTEXT_H */
-- 
2.17.1

^ permalink raw reply related	[flat|nested] 71+ messages in thread

* Re: [PATCH v2 14/20] mm: Make hibernate handle unmapped pages
  2019-01-29  0:34 ` [PATCH v2 14/20] mm: Make hibernate handle unmapped pages Rick Edgecombe
@ 2019-02-19 11:04   ` Borislav Petkov
  2019-02-19 21:28     ` Edgecombe, Rick P
  0 siblings, 1 reply; 71+ messages in thread
From: Borislav Petkov @ 2019-02-19 11:04 UTC (permalink / raw)
  To: Rick Edgecombe
  Cc: Andy Lutomirski, Ingo Molnar, linux-kernel, x86, hpa,
	Thomas Gleixner, Nadav Amit, Dave Hansen, Peter Zijlstra,
	linux_dti, linux-integrity, linux-security-module, akpm,
	kernel-hardening, linux-mm, will.deacon, ard.biesheuvel, kristen,
	deneen.t.dock, Rafael J. Wysocki, Pavel Machek

On Mon, Jan 28, 2019 at 04:34:16PM -0800, Rick Edgecombe wrote:
> For architectures with CONFIG_ARCH_HAS_SET_ALIAS, pages can be unmapped
> briefly on the directmap, even when CONFIG_DEBUG_PAGEALLOC is not
> configured. So this changes kernel_map_pages and kernel_page_present to be

s/this changes/change/

>From Documentation/process/submitting-patches.rst:

 "Describe your changes in imperative mood, e.g. "make xyzzy do frotz"
  instead of "[This patch] makes xyzzy do frotz" or "[I] changed xyzzy
  to do frotz", as if you are giving orders to the codebase to change
  its behaviour."

Also, please end function names with parentheses.

> defined when CONFIG_ARCH_HAS_SET_ALIAS is defined as well. It also changes
> places (page_alloc.c) where those functions are assumed to only be
> implemented when CONFIG_DEBUG_PAGEALLOC is defined.

The commit message doesn't need to say "what" you're doing - that should
be obvious from the diff below. It should rather say "why" you're doing
it.

> So now when CONFIG_ARCH_HAS_SET_ALIAS=y, hibernate will handle not present
> page when saving. Previously this was already done when

pages

> CONFIG_DEBUG_PAGEALLOC was configured. It does not appear to have a big
> hibernating performance impact.

Comment over safe_copy_page() needs updating I guess.

> Before:
> [    4.670938] PM: Wrote 171996 kbytes in 0.21 seconds (819.02 MB/s)
> 
> After:
> [    4.504714] PM: Wrote 178932 kbytes in 0.22 seconds (813.32 MB/s)

IINM, that's like 1734 pages more. How am I to understand this number?

Code has called set_alias_nv_noflush() on them and safe_copy_page() now
maps them one by one to copy them to the hibernation image?

Thx.

-- 
Regards/Gruss,
    Boris.

Good mailing practices for 400: avoid top-posting and trim the reply.

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [PATCH v2 15/20] vmalloc: New flags for safe vfree on special perms
  2019-01-29  0:34 ` [PATCH v2 15/20] vmalloc: New flags for safe vfree on special perms Rick Edgecombe
@ 2019-02-19 12:48   ` Borislav Petkov
  2019-02-19 19:42     ` Edgecombe, Rick P
  0 siblings, 1 reply; 71+ messages in thread
From: Borislav Petkov @ 2019-02-19 12:48 UTC (permalink / raw)
  To: Rick Edgecombe
  Cc: Andy Lutomirski, Ingo Molnar, linux-kernel, x86, hpa,
	Thomas Gleixner, Nadav Amit, Dave Hansen, Peter Zijlstra,
	linux_dti, linux-integrity, linux-security-module, akpm,
	kernel-hardening, linux-mm, will.deacon, ard.biesheuvel, kristen,
	deneen.t.dock

On Mon, Jan 28, 2019 at 04:34:17PM -0800, Rick Edgecombe wrote:
> This adds a new flags VM_HAS_SPECIAL_PERMS, for enabling vfree operations

s/This adds/add/ - you get the idea. :)

s/flags/flag/

> to immediately clear executable TLB entries to freed pages, and handle
> freeing memory with special permissions. It also takes care of resetting
> the direct map permissions for the pages being unmapped. So this flag is
> useful for any kind of memory with elevated permissions, or where there can
> be related permissions changes on the directmap. Today this is RO+X and RO
> memory.
> 
> Although this enables directly vfreeing RO memory now, RO memory cannot be
> freed in an interrupt because the allocation itself is used as a node on
> deferred free list. So when RO memory needs to be freed in an interrupt
> the code doing the vfree needs to have its own work queue, as was the case
> before the deferred vfree list handling was added. Today there is only one
> case where this happens.
> 
> For architectures with set_alias_ implementations this whole operation
> can be done with one TLB flush when centralized like this. For others with
> directmap permissions, currently only arm64, a backup method using
> set_memory functions is used to reset the directmap. When arm64 adds
> set_alias_ functions, this backup can be removed.
> 
> When the TLB is flushed to both remove TLB entries for the vmalloc range
> mapping and the direct map permissions, the lazy purge operation could be
> done to try to save a TLB flush later. However today vm_unmap_aliases
> could flush a TLB range that does not include the directmap. So a helper
> is added with extra parameters that can allow both the vmalloc address and
> the direct mapping to be flushed during this operation. The behavior of the
> normal vm_unmap_aliases function is unchanged.
> 
> Suggested-by: Dave Hansen <dave.hansen@intel.com>
> Suggested-by: Andy Lutomirski <luto@kernel.org>
> Suggested-by: Will Deacon <will.deacon@arm.com>
> Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
> ---
>  include/linux/vmalloc.h |  13 +++++
>  mm/vmalloc.c            | 122 +++++++++++++++++++++++++++++++++-------
>  2 files changed, 116 insertions(+), 19 deletions(-)
> 
> diff --git a/include/linux/vmalloc.h b/include/linux/vmalloc.h
> index 398e9c95cd61..9f643f917360 100644
> --- a/include/linux/vmalloc.h
> +++ b/include/linux/vmalloc.h
> @@ -21,6 +21,11 @@ struct notifier_block;		/* in notifier.h */
>  #define VM_UNINITIALIZED	0x00000020	/* vm_struct is not fully initialized */
>  #define VM_NO_GUARD		0x00000040      /* don't add guard page */
>  #define VM_KASAN		0x00000080      /* has allocated kasan shadow memory */
> +/*
> + * Memory with VM_HAS_SPECIAL_PERMS cannot be freed in an interrupt or with
> + * vfree_atomic.

Please end function names with parentheses. You should go over the whole
patchset - there are a bunch of places.

> + */
> +#define VM_HAS_SPECIAL_PERMS	0x00000200      /* Reset directmap and flush TLB on unmap */

After 0x00000080 comes 0x00000100. 0x00000010 is free too. What's up?

>  /* bits [20..32] reserved for arch specific ioremap internals */
>  
>  /*
> @@ -135,6 +140,14 @@ extern struct vm_struct *__get_vm_area_caller(unsigned long size,
>  extern struct vm_struct *remove_vm_area(const void *addr);
>  extern struct vm_struct *find_vm_area(const void *addr);
>  
> +static inline void set_vm_special(void *addr)

You need a different name than "special" for a vm which needs to flush
and clear mapping perms on removal. VM_RESET_PERMS or whatever is more
to the point than "special", for example, which could mean a lot of
things.

> +{
> +	struct vm_struct *vm = find_vm_area(addr);
> +
> +	if (vm)
> +		vm->flags |= VM_HAS_SPECIAL_PERMS;
> +}
> +
>  extern int map_vm_area(struct vm_struct *area, pgprot_t prot,
>  			struct page **pages);
>  #ifdef CONFIG_MMU
> diff --git a/mm/vmalloc.c b/mm/vmalloc.c
> index 871e41c55e23..d459b5b9649b 100644
> --- a/mm/vmalloc.c
> +++ b/mm/vmalloc.c
> @@ -18,6 +18,7 @@
>  #include <linux/interrupt.h>
>  #include <linux/proc_fs.h>
>  #include <linux/seq_file.h>
> +#include <linux/set_memory.h>
>  #include <linux/debugobjects.h>
>  #include <linux/kallsyms.h>
>  #include <linux/list.h>
> @@ -1055,24 +1056,11 @@ static void vb_free(const void *addr, unsigned long size)
>  		spin_unlock(&vb->lock);
>  }
>  
> -/**
> - * vm_unmap_aliases - unmap outstanding lazy aliases in the vmap layer
> - *
> - * The vmap/vmalloc layer lazily flushes kernel virtual mappings primarily
> - * to amortize TLB flushing overheads. What this means is that any page you
> - * have now, may, in a former life, have been mapped into kernel virtual
> - * address by the vmap layer and so there might be some CPUs with TLB entries
> - * still referencing that page (additional to the regular 1:1 kernel mapping).
> - *
> - * vm_unmap_aliases flushes all such lazy mappings. After it returns, we can
> - * be sure that none of the pages we have control over will have any aliases
> - * from the vmap layer.
> - */
> -void vm_unmap_aliases(void)
> +static void _vm_unmap_aliases(unsigned long start, unsigned long end,
> +				int must_flush)

Align arguments on the opening brace. There's more places below, pls fix
them all.

>  {
> -	unsigned long start = ULONG_MAX, end = 0;
>  	int cpu;
> -	int flush = 0;
> +	int flush = must_flush;

You can't use must_flush directly because...?

gcc produces the same asm here, with or without the local "flush" var.

>  
>  	if (unlikely(!vmap_initialized))
>  		return;
> @@ -1109,6 +1097,27 @@ void vm_unmap_aliases(void)
>  		flush_tlb_kernel_range(start, end);
>  	mutex_unlock(&vmap_purge_lock);
>  }
> +
> +/**
> + * vm_unmap_aliases - unmap outstanding lazy aliases in the vmap layer
> + *
> + * The vmap/vmalloc layer lazily flushes kernel virtual mappings primarily
> + * to amortize TLB flushing overheads. What this means is that any page you
> + * have now, may, in a former life, have been mapped into kernel virtual
> + * address by the vmap layer and so there might be some CPUs with TLB entries
> + * still referencing that page (additional to the regular 1:1 kernel mapping).
> + *
> + * vm_unmap_aliases flushes all such lazy mappings. After it returns, we can
> + * be sure that none of the pages we have control over will have any aliases
> + * from the vmap layer.
> + */
> +void vm_unmap_aliases(void)
> +{
> +	unsigned long start = ULONG_MAX, end = 0;
> +	int must_flush = 0;
> +
> +	_vm_unmap_aliases(start, end, must_flush);
> +}
>  EXPORT_SYMBOL_GPL(vm_unmap_aliases);
>  
>  /**
> @@ -1494,6 +1503,79 @@ struct vm_struct *remove_vm_area(const void *addr)
>  	return NULL;
>  }
>  
> +static inline void set_area_alias(const struct vm_struct *area,
> +			int (*set_alias)(struct page *page))
> +{
> +	int i;
> +
> +	for (i = 0; i < area->nr_pages; i++) {
> +		unsigned long addr =
> +			(unsigned long)page_address(area->pages[i]);
> +
> +		if (addr)
> +			set_alias(area->pages[i]);

What's wrong with simply:

        for (i = 0; i < area->nr_pages; i++) {
                if (page_address(area->pages[i]))
                        set_alias(area->pages[i]);
        }

?

> +	}
> +}
> +
> +/* This handles removing and resetting vm mappings related to the vm_struct. */

s/This handles/Handle/

> +static void vm_remove_mappings(struct vm_struct *area, int deallocate_pages)
> +{
> +	unsigned long addr = (unsigned long)area->addr;
> +	unsigned long start = ULONG_MAX, end = 0;
> +	int special = area->flags & VM_HAS_SPECIAL_PERMS;
> +	int i;
> +
> +	/*
> +	 * The below block can be removed when all architectures that have
> +	 * direct map permissions also have set_alias_ implementations. This is
> +	 * to do resetting on the directmap for any special permissions (today
> +	 * only X), without leaving a RW+X window.
> +	 */
> +	if (special && !IS_ENABLED(CONFIG_ARCH_HAS_SET_ALIAS)) {
> +		set_memory_nx(addr, area->nr_pages);
> +		set_memory_rw(addr, area->nr_pages);

That's two not very cheap calls to the underlying worker function, for
example change_memory_common() on ARM64, instead of calling it once with
the respective flags. You allude to that in the commit message but you
might wanna run it by ARM folks first.

> +	}
> +
> +	remove_vm_area(area->addr);
> +
> +	/* If this is not special memory, we can skip the below. */
> +	if (!special)
> +		return;
> +
> +	/*
> +	 * If we are not deallocating pages, we can just do the flush of the VM
> +	 * area and return.
> +	 */
> +	if (!deallocate_pages) {
> +		vm_unmap_aliases();
> +		return;
> +	}
> +
> +	/*
> +	 * If we are here, we need to flush the vm mapping and reset the direct
> +	 * map.
> +	 * First find the start and end range of the direct mappings to make
> +	 * sure the vm_unmap_aliases flush includes the direct map.
> +	 */
> +	for (i = 0; i < area->nr_pages; i++) {
> +		unsigned long addr =
> +			(unsigned long)page_address(area->pages[i]);
> +		if (addr) {

		if (page_address(area->pages[i]))

as above.

> +			start = min(addr, start);
> +			end = max(addr, end);
> +		}
> +	}
> +
> +	/*
> +	 * First we set direct map to something not valid so that it won't be

Above comment says "First" too. In general, all those "we" formulations
do not make the comments as easy to read as when you make them
impersonal and imperative:

	/*
	 * Set the direct map to something invalid...

Just like Documentation/process/submitting-patches.rst says:

 "Describe your changes in imperative mood, e.g. "make xyzzy do frotz"
  instead of "[This patch] makes xyzzy do frotz" or "[I] changed xyzzy
  to do frotz", as if you are giving orders to the codebase to change
  its behaviour."

you simply order your code to do stuff. :-)

> +	 * cached if there are any accesses after the TLB flush, then we flush
> +	 * the TLB, and reset the directmap permissions to the default.
> +	 */
> +	set_area_alias(area, set_alias_nv_noflush);
> +	_vm_unmap_aliases(start, end, 1);
> +	set_area_alias(area, set_alias_default_noflush);
> +}
> +
>  static void __vunmap(const void *addr, int deallocate_pages)
>  {
>  	struct vm_struct *area;

-- 
Regards/Gruss,
    Boris.

Good mailing practices for 400: avoid top-posting and trim the reply.

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [PATCH v2 15/20] vmalloc: New flags for safe vfree on special perms
  2019-02-19 12:48   ` Borislav Petkov
@ 2019-02-19 19:42     ` Edgecombe, Rick P
  2019-02-20 16:14       ` Borislav Petkov
  0 siblings, 1 reply; 71+ messages in thread
From: Edgecombe, Rick P @ 2019-02-19 19:42 UTC (permalink / raw)
  To: ard.biesheuvel, bp, will.deacon
  Cc: linux-kernel, peterz, linux-integrity, Dock, Deneen T, tglx,
	linux-mm, dave.hansen, nadav.amit, linux-security-module, x86,
	akpm, hpa, kristen, mingo, linux_dti, luto, kernel-hardening

Thanks Boris.

Ard, Will: An arm question came up below. Any thoughts?

On Tue, 2019-02-19 at 13:48 +0100, Borislav Petkov wrote:
> On Mon, Jan 28, 2019 at 04:34:17PM -0800, Rick Edgecombe wrote:
> > This adds a new flags VM_HAS_SPECIAL_PERMS, for enabling vfree operations
> 
> s/This adds/add/ - you get the idea. :)
Yes, thanks. Fixed after your comments on other patches.

> s/flags/flag/
> 
> > to immediately clear executable TLB entries to freed pages, and handle
> > freeing memory with special permissions. It also takes care of resetting
> > the direct map permissions for the pages being unmapped. So this flag is
> > useful for any kind of memory with elevated permissions, or where there can
> > be related permissions changes on the directmap. Today this is RO+X and RO
> > memory.
> > 
> > Although this enables directly vfreeing RO memory now, RO memory cannot be
> > freed in an interrupt because the allocation itself is used as a node on
> > deferred free list. So when RO memory needs to be freed in an interrupt
> > the code doing the vfree needs to have its own work queue, as was the case
> > before the deferred vfree list handling was added. Today there is only one
> > case where this happens.
> > 
> > For architectures with set_alias_ implementations this whole operation
> > can be done with one TLB flush when centralized like this. For others with
> > directmap permissions, currently only arm64, a backup method using
> > set_memory functions is used to reset the directmap. When arm64 adds
> > set_alias_ functions, this backup can be removed.
> > 
> > When the TLB is flushed to both remove TLB entries for the vmalloc range
> > mapping and the direct map permissions, the lazy purge operation could be
> > done to try to save a TLB flush later. However today vm_unmap_aliases
> > could flush a TLB range that does not include the directmap. So a helper
> > is added with extra parameters that can allow both the vmalloc address and
> > the direct mapping to be flushed during this operation. The behavior of the
> > normal vm_unmap_aliases function is unchanged.
> > 
> > Suggested-by: Dave Hansen <dave.hansen@intel.com>
> > Suggested-by: Andy Lutomirski <luto@kernel.org>
> > Suggested-by: Will Deacon <will.deacon@arm.com>
> > Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
> > ---
> >  include/linux/vmalloc.h |  13 +++++
> >  mm/vmalloc.c            | 122 +++++++++++++++++++++++++++++++++-------
> >  2 files changed, 116 insertions(+), 19 deletions(-)
> > 
> > diff --git a/include/linux/vmalloc.h b/include/linux/vmalloc.h
> > index 398e9c95cd61..9f643f917360 100644
> > --- a/include/linux/vmalloc.h
> > +++ b/include/linux/vmalloc.h
> > @@ -21,6 +21,11 @@ struct notifier_block;		/* in notifier.h */
> >  #define VM_UNINITIALIZED	0x00000020	/* vm_struct is not fully
> > initialized */
> >  #define VM_NO_GUARD		0x00000040      /* don't add guard page
> > */
> >  #define VM_KASAN		0x00000080      /* has allocated kasan shadow
> > memory */
> > +/*
> > + * Memory with VM_HAS_SPECIAL_PERMS cannot be freed in an interrupt or with
> > + * vfree_atomic.
> 
> Please end function names with parentheses. You should go over the whole
> patchset - there are a bunch of places.
Ok.

> > + */
> > +#define VM_HAS_SPECIAL_PERMS	0x00000200      /* Reset directmap and
> > flush TLB on unmap */
> 
> After 0x00000080 comes 0x00000100. 0x00000010 is free too. What's up?
I was just trying to follow the pattern noticed from the gap at 0x00000010. I'll
add it at 0x00000100 in case there is some reason for the other gap.

> >  /* bits [20..32] reserved for arch specific ioremap internals */
> >  
> >  /*
> > @@ -135,6 +140,14 @@ extern struct vm_struct *__get_vm_area_caller(unsigned
> > long size,
> >  extern struct vm_struct *remove_vm_area(const void *addr);
> >  extern struct vm_struct *find_vm_area(const void *addr);
> >  
> > +static inline void set_vm_special(void *addr)
> 
> You need a different name than "special" for a vm which needs to flush
> and clear mapping perms on removal. VM_RESET_PERMS or whatever is more
> to the point than "special", for example, which could mean a lot of
> things.
I don't have a strong opinion about the name, but it has gone through some
revisions, so I'll summarize the history.

There are two intentions - not leaving a stale TLB entry to a freed page, and
then resetting the direct map for architectures that need it. Andy had pointed
out that you can do this with only one TLB flush if all the cleanup is
centralized here. The original point of this flag was the TLB flush of the
vmalloc alias, and it's something that makes sense for all architectures that
have something like NX, regardless of what they are doing with their direct map.

At one point there were two flags, one for "immediate flush" and one for "reset
direct map", but Andy had also pointed out that you would mostly use them
together, so why complicate things.

So to capture both of those intentions, maybe I'll slightly tweak your
suggestion to VM_FLUSH_RESET_PERMS?

> > +{
> > +	struct vm_struct *vm = find_vm_area(addr);
> > +
> > +	if (vm)
> > +		vm->flags |= VM_HAS_SPECIAL_PERMS;
> > +}
> > +
> >  extern int map_vm_area(struct vm_struct *area, pgprot_t prot,
> >  			struct page **pages);
> >  #ifdef CONFIG_MMU
> > diff --git a/mm/vmalloc.c b/mm/vmalloc.c
> > index 871e41c55e23..d459b5b9649b 100644
> > --- a/mm/vmalloc.c
> > +++ b/mm/vmalloc.c
> > @@ -18,6 +18,7 @@
> >  #include <linux/interrupt.h>
> >  #include <linux/proc_fs.h>
> >  #include <linux/seq_file.h>
> > +#include <linux/set_memory.h>
> >  #include <linux/debugobjects.h>
> >  #include <linux/kallsyms.h>
> >  #include <linux/list.h>
> > @@ -1055,24 +1056,11 @@ static void vb_free(const void *addr, unsigned long
> > size)
> >  		spin_unlock(&vb->lock);
> >  }
> >  
> > -/**
> > - * vm_unmap_aliases - unmap outstanding lazy aliases in the vmap layer
> > - *
> > - * The vmap/vmalloc layer lazily flushes kernel virtual mappings primarily
> > - * to amortize TLB flushing overheads. What this means is that any page you
> > - * have now, may, in a former life, have been mapped into kernel virtual
> > - * address by the vmap layer and so there might be some CPUs with TLB
> > entries
> > - * still referencing that page (additional to the regular 1:1 kernel
> > mapping).
> > - *
> > - * vm_unmap_aliases flushes all such lazy mappings. After it returns, we
> > can
> > - * be sure that none of the pages we have control over will have any
> > aliases
> > - * from the vmap layer.
> > - */
> > -void vm_unmap_aliases(void)
> > +static void _vm_unmap_aliases(unsigned long start, unsigned long end,
> > +				int must_flush)
> 
> Align arguments on the opening brace. There's more places below, pls fix
> them all.
Ok.

> >  {
> > -	unsigned long start = ULONG_MAX, end = 0;
> >  	int cpu;
> > -	int flush = 0;
> > +	int flush = must_flush;
> 
> You can't use must_flush directly because...?
> 
> gcc produces the same asm here, with or without the local "flush" var.
I had thought it was easier to read. If its not the case, I'll change it as you
suggest.

> >  
> >  	if (unlikely(!vmap_initialized))
> >  		return;
> > @@ -1109,6 +1097,27 @@ void vm_unmap_aliases(void)
> >  		flush_tlb_kernel_range(start, end);
> >  	mutex_unlock(&vmap_purge_lock);
> >  }
> > +
> > +/**
> > + * vm_unmap_aliases - unmap outstanding lazy aliases in the vmap layer
> > + *
> > + * The vmap/vmalloc layer lazily flushes kernel virtual mappings primarily
> > + * to amortize TLB flushing overheads. What this means is that any page you
> > + * have now, may, in a former life, have been mapped into kernel virtual
> > + * address by the vmap layer and so there might be some CPUs with TLB
> > entries
> > + * still referencing that page (additional to the regular 1:1 kernel
> > mapping).
> > + *
> > + * vm_unmap_aliases flushes all such lazy mappings. After it returns, we
> > can
> > + * be sure that none of the pages we have control over will have any
> > aliases
> > + * from the vmap layer.
> > + */
> > +void vm_unmap_aliases(void)
> > +{
> > +	unsigned long start = ULONG_MAX, end = 0;
> > +	int must_flush = 0;
> > +
> > +	_vm_unmap_aliases(start, end, must_flush);
> > +}
> >  EXPORT_SYMBOL_GPL(vm_unmap_aliases);
> >  
> >  /**
> > @@ -1494,6 +1503,79 @@ struct vm_struct *remove_vm_area(const void *addr)
> >  	return NULL;
> >  }
> >  
> > +static inline void set_area_alias(const struct vm_struct *area,
> > +			int (*set_alias)(struct page *page))
> > +{
> > +	int i;
> > +
> > +	for (i = 0; i < area->nr_pages; i++) {
> > +		unsigned long addr =
> > +			(unsigned long)page_address(area->pages[i]);
> > +
> > +		if (addr)
> > +			set_alias(area->pages[i]);
> 
> What's wrong with simply:
> 
>         for (i = 0; i < area->nr_pages; i++) {
>                 if (page_address(area->pages[i]))
>                         set_alias(area->pages[i]);
>         }
> 
> ?
Yes, I see that's better.

I think at one point I had played with having the set_alias()'s take an unsigned
long address, and so the address was used twice and was worth assigning to
something, so this should be fixed. Thanks.

> > +	}
> > +}
> > +
> > +/* This handles removing and resetting vm mappings related to the
> > vm_struct. */
> 
> s/This handles/Handle/
> 
> > +static void vm_remove_mappings(struct vm_struct *area, int
> > deallocate_pages)
> > +{
> > +	unsigned long addr = (unsigned long)area->addr;
> > +	unsigned long start = ULONG_MAX, end = 0;
> > +	int special = area->flags & VM_HAS_SPECIAL_PERMS;
> > +	int i;
> > +
> > +	/*
> > +	 * The below block can be removed when all architectures that have
> > +	 * direct map permissions also have set_alias_ implementations. This is
> > +	 * to do resetting on the directmap for any special permissions (today
> > +	 * only X), without leaving a RW+X window.
> > +	 */
> > +	if (special && !IS_ENABLED(CONFIG_ARCH_HAS_SET_ALIAS)) {
> > +		set_memory_nx(addr, area->nr_pages);
> > +		set_memory_rw(addr, area->nr_pages);
> 
> That's two not very cheap calls to the underlying worker function, for
> example change_memory_common() on ARM64, instead of calling it once with
> the respective flags. You allude to that in the commit message but you
> might wanna run it by ARM folks first.
These calls are basically getting moved to be centralized here from other
places. In the later patches they get removed from where they were, so its a net
zero (except BPF on ARM I think).

Ard had expressed interest in having the set_alias_() functions for Arm, and the
names were chosen to be arch agnostic. He didn't explicitly commit but I was
under the impression he might create an implementation for ARM and we could
remove this block.

Ard, did I misinterpret that?

> > +	}
> > +
> > +	remove_vm_area(area->addr);
> > +
> > +	/* If this is not special memory, we can skip the below. */
> > +	if (!special)
> > +		return;
> > +
> > +	/*
> > +	 * If we are not deallocating pages, we can just do the flush of the VM
> > +	 * area and return.
> > +	 */
> > +	if (!deallocate_pages) {
> > +		vm_unmap_aliases();
> > +		return;
> > +	}
> > +
> > +	/*
> > +	 * If we are here, we need to flush the vm mapping and reset the direct
> > +	 * map.
> > +	 * First find the start and end range of the direct mappings to make
> > +	 * sure the vm_unmap_aliases flush includes the direct map.
> > +	 */
> > +	for (i = 0; i < area->nr_pages; i++) {
> > +		unsigned long addr =
> > +			(unsigned long)page_address(area->pages[i]);
> > +		if (addr) {
> 
> 		if (page_address(area->pages[i]))
> 
> as above.
Right. Thanks.

> > +			start = min(addr, start);
> > +			end = max(addr, end);
> > +		}
> > +	}
> > +
> > +	/*
> > +	 * First we set direct map to something not valid so that it won't be
> 
> Above comment says "First" too. In general, all those "we" formulations
> do not make the comments as easy to read as when you make them
> impersonal and imperative:
> 
> 	/*
> 	 * Set the direct map to something invalid...
> 
> Just like Documentation/process/submitting-patches.rst says:
> 
>  "Describe your changes in imperative mood, e.g. "make xyzzy do frotz"
>   instead of "[This patch] makes xyzzy do frotz" or "[I] changed xyzzy
>   to do frotz", as if you are giving orders to the codebase to change
>   its behaviour."
> 
> you simply order your code to do stuff. :-)
I see, I will make sure to fully apply this grammar rule to commit messages and
comments for the next version.

> > +	 * cached if there are any accesses after the TLB flush, then we flush
> > +	 * the TLB, and reset the directmap permissions to the default.
> > +	 */
> > +	set_area_alias(area, set_alias_nv_noflush);
> > +	_vm_unmap_aliases(start, end, 1);
> > +	set_area_alias(area, set_alias_default_noflush);
> > +}
> > +
> >  static void __vunmap(const void *addr, int deallocate_pages)
> >  {
> >  	struct vm_struct *area;
> 
> 

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [PATCH v2 14/20] mm: Make hibernate handle unmapped pages
  2019-02-19 11:04   ` Borislav Petkov
@ 2019-02-19 21:28     ` Edgecombe, Rick P
  2019-02-20 16:07       ` Borislav Petkov
  0 siblings, 1 reply; 71+ messages in thread
From: Edgecombe, Rick P @ 2019-02-19 21:28 UTC (permalink / raw)
  To: bp
  Cc: linux-kernel, peterz, linux-integrity, ard.biesheuvel, tglx,
	linux-mm, dave.hansen, nadav.amit, Dock, Deneen T, pavel,
	linux-security-module, x86, akpm, hpa, kristen, mingo, linux_dti,
	luto, will.deacon, kernel-hardening, rjw

On Tue, 2019-02-19 at 12:04 +0100, Borislav Petkov wrote:
> On Mon, Jan 28, 2019 at 04:34:16PM -0800, Rick Edgecombe wrote:
> > For architectures with CONFIG_ARCH_HAS_SET_ALIAS, pages can be unmapped
> > briefly on the directmap, even when CONFIG_DEBUG_PAGEALLOC is not
> > configured. So this changes kernel_map_pages and kernel_page_present to be
> 
> s/this changes/change/
> 
> From Documentation/process/submitting-patches.rst:
> 
>  "Describe your changes in imperative mood, e.g. "make xyzzy do frotz"
>   instead of "[This patch] makes xyzzy do frotz" or "[I] changed xyzzy
>   to do frotz", as if you are giving orders to the codebase to change
>   its behaviour."
> 
> Also, please end function names with parentheses.
Yes, gotcha.

> > defined when CONFIG_ARCH_HAS_SET_ALIAS is defined as well. It also changes
> > places (page_alloc.c) where those functions are assumed to only be
> > implemented when CONFIG_DEBUG_PAGEALLOC is defined.
> 
> The commit message doesn't need to say "what" you're doing - that should
> be obvious from the diff below. It should rather say "why" you're doing
> it.
Ok, sorry. I'll change this to be more concise.

> > So now when CONFIG_ARCH_HAS_SET_ALIAS=y, hibernate will handle not present
> > page when saving. Previously this was already done when
> 
> pages
> 
> > CONFIG_DEBUG_PAGEALLOC was configured. It does not appear to have a big
> > hibernating performance impact.
> 
> Comment over safe_copy_page
Oh, yes you are right.

> > Before:
> > [    4.670938] PM: Wrote 171996 kbytes in 0.21 seconds (819.02 MB/s)
> > 
> > After:
> > [    4.504714] PM: Wrote 178932 kbytes in 0.22 seconds (813.32 MB/s)
> 
> IINM, that's like 1734 pages more. How am I to understand this number?
> 
> Code has called set_alias_nv_noflush() on them and safe_copy_page() now
> maps them one by one to copy them to the hibernation image?
> 
> Thx.
> 
These are from logs hibernate generates. The concern was that hibernate could be
slightly slower because of the checking of whether the pages are mapped. The
bandwidth number can be used to compare, 819.02->813.32 MB/s. Some randomness
must have resulted in different amounts of memory used between tests. I can just
remove the log lines and include the bandwidth numbers.

Thanks,

Rick

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [PATCH v2 14/20] mm: Make hibernate handle unmapped pages
  2019-02-19 21:28     ` Edgecombe, Rick P
@ 2019-02-20 16:07       ` Borislav Petkov
  0 siblings, 0 replies; 71+ messages in thread
From: Borislav Petkov @ 2019-02-20 16:07 UTC (permalink / raw)
  To: Edgecombe, Rick P
  Cc: linux-kernel, peterz, linux-integrity, ard.biesheuvel, tglx,
	linux-mm, dave.hansen, nadav.amit, Dock, Deneen T, pavel,
	linux-security-module, x86, akpm, hpa, kristen, mingo, linux_dti,
	luto, will.deacon, kernel-hardening, rjw

On Tue, Feb 19, 2019 at 09:28:55PM +0000, Edgecombe, Rick P wrote:
> These are from logs hibernate generates. The concern was that hibernate could be
> slightly slower because of the checking of whether the pages are mapped. The
> bandwidth number can be used to compare, 819.02->813.32 MB/s. Some randomness
> must have resulted in different amounts of memory used between tests. I can just
> remove the log lines and include the bandwidth numbers.

Nah, I'm just trying to get an idea of the slowdown it would cause. I'm
thinking these pages are, as you call them "special" so they should not
be a huge chunk of all the system's pages, even if it is a machine with
a lot of memory so I guess it ain't that bad. We should keep an eye on
it though... :-)

Thx.

-- 
Regards/Gruss,
    Boris.

Good mailing practices for 400: avoid top-posting and trim the reply.

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [PATCH v2 15/20] vmalloc: New flags for safe vfree on special perms
  2019-02-19 19:42     ` Edgecombe, Rick P
@ 2019-02-20 16:14       ` Borislav Petkov
  0 siblings, 0 replies; 71+ messages in thread
From: Borislav Petkov @ 2019-02-20 16:14 UTC (permalink / raw)
  To: Edgecombe, Rick P
  Cc: ard.biesheuvel, will.deacon, linux-kernel, peterz,
	linux-integrity, Dock, Deneen T, tglx, linux-mm, dave.hansen,
	nadav.amit, linux-security-module, x86, akpm, hpa, kristen,
	mingo, linux_dti, luto, kernel-hardening

On Tue, Feb 19, 2019 at 07:42:53PM +0000, Edgecombe, Rick P wrote:
> So to capture both of those intentions, maybe I'll slightly tweak your
> suggestion to VM_FLUSH_RESET_PERMS?

Yeah, sure, better.

VM_HAS_SPECIAL_PERMS doesn't tell me what those special permissions are
while flush and reset permissions makes a lot more sense, thx.

> I had thought it was easier to read. If its not the case, I'll change it as you
> suggest.

My logic is, the less local vars, the easier to scan the code quickly.

> Ard had expressed interest in having the set_alias_() functions for Arm, and the
> names were chosen to be arch agnostic. He didn't explicitly commit but I was
> under the impression he might create an implementation for ARM and we could
> remove this block.

Yeah, Will has those on his radar too so we should be good here.

Thx.

-- 
Regards/Gruss,
    Boris.

Good mailing practices for 400: avoid top-posting and trim the reply.

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [PATCH v2 10/20] x86: avoid W^X being broken during modules loading
  2019-02-11 19:42               ` Borislav Petkov
  2019-02-11 20:32                 ` Nadav Amit
@ 2019-03-07  7:29                 ` Borislav Petkov
  2019-03-07 16:53                   ` hpa
  1 sibling, 1 reply; 71+ messages in thread
From: Borislav Petkov @ 2019-03-07  7:29 UTC (permalink / raw)
  To: Nadav Amit
  Cc: Rick Edgecombe, Andy Lutomirski, Ingo Molnar, LKML, X86 ML,
	H. Peter Anvin, Thomas Gleixner, Dave Hansen, Peter Zijlstra,
	Damian Tometzki, linux-integrity, LSM List, Andrew Morton,
	Kernel Hardening, Linux-MM, Will Deacon, Ard Biesheuvel,
	Kristen Carlson Accardi, Dock, Deneen T, Kees Cook, Dave Hansen,
	Masami Hiramatsu

On Mon, Feb 11, 2019 at 08:42:51PM +0100, Borislav Petkov wrote:
> On Mon, Feb 11, 2019 at 11:27:03AM -0800, Nadav Amit wrote:
> > Is there any comment over static_cpu_has()? ;-)
> 
> Almost:
> 
> /*
>  * Static testing of CPU features.  Used the same as boot_cpu_has().
>  * These will statically patch the target code for additional
>  * performance.
>  */
> static __always_inline __pure bool _static_cpu_has(u16 bit)

Ok, I guess something like that along with converting the obvious slow
path callers to boot_cpu_has():

---
diff --git a/arch/x86/include/asm/cpufeature.h b/arch/x86/include/asm/cpufeature.h
index ce95b8cbd229..e25d11ad7a88 100644
--- a/arch/x86/include/asm/cpufeature.h
+++ b/arch/x86/include/asm/cpufeature.h
@@ -155,9 +155,12 @@ extern void clear_cpu_cap(struct cpuinfo_x86 *c, unsigned int bit);
 #else
 
 /*
- * Static testing of CPU features.  Used the same as boot_cpu_has().
- * These will statically patch the target code for additional
- * performance.
+ * Static testing of CPU features. Used the same as boot_cpu_has(). It
+ * statically patches the target code for additional performance. Use
+ * static_cpu_has() only in fast paths, where every cycle counts. Which
+ * means that the boot_cpu_has() variant is already fast enough for the
+ * majority of cases and you should stick to using it as it is generally
+ * only two instructions: a RIP-relative MOV and a TEST.
  */
 static __always_inline __pure bool _static_cpu_has(u16 bit)
 {
diff --git a/arch/x86/include/asm/fpu/internal.h b/arch/x86/include/asm/fpu/internal.h
index fa2c93cb42a2..c525b053b3b3 100644
--- a/arch/x86/include/asm/fpu/internal.h
+++ b/arch/x86/include/asm/fpu/internal.h
@@ -291,7 +291,7 @@ static inline void copy_xregs_to_kernel_booting(struct xregs_state *xstate)
 
 	WARN_ON(system_state != SYSTEM_BOOTING);
 
-	if (static_cpu_has(X86_FEATURE_XSAVES))
+	if (boot_cpu_has(X86_FEATURE_XSAVES))
 		XSTATE_OP(XSAVES, xstate, lmask, hmask, err);
 	else
 		XSTATE_OP(XSAVE, xstate, lmask, hmask, err);
@@ -313,7 +313,7 @@ static inline void copy_kernel_to_xregs_booting(struct xregs_state *xstate)
 
 	WARN_ON(system_state != SYSTEM_BOOTING);
 
-	if (static_cpu_has(X86_FEATURE_XSAVES))
+	if (boot_cpu_has(X86_FEATURE_XSAVES))
 		XSTATE_OP(XRSTORS, xstate, lmask, hmask, err);
 	else
 		XSTATE_OP(XRSTOR, xstate, lmask, hmask, err);
@@ -528,8 +528,7 @@ static inline void fpregs_activate(struct fpu *fpu)
  *  - switch_fpu_finish() restores the new state as
  *    necessary.
  */
-static inline void
-switch_fpu_prepare(struct fpu *old_fpu, int cpu)
+static inline void switch_fpu_prepare(struct fpu *old_fpu, int cpu)
 {
 	if (static_cpu_has(X86_FEATURE_FPU) && old_fpu->initialized) {
 		if (!copy_fpregs_to_fpstate(old_fpu))
diff --git a/arch/x86/kernel/apic/apic_numachip.c b/arch/x86/kernel/apic/apic_numachip.c
index 78778b54f904..a5464b8b6c46 100644
--- a/arch/x86/kernel/apic/apic_numachip.c
+++ b/arch/x86/kernel/apic/apic_numachip.c
@@ -175,7 +175,7 @@ static void fixup_cpu_id(struct cpuinfo_x86 *c, int node)
 	this_cpu_write(cpu_llc_id, node);
 
 	/* Account for nodes per socket in multi-core-module processors */
-	if (static_cpu_has(X86_FEATURE_NODEID_MSR)) {
+	if (boot_cpu_has(X86_FEATURE_NODEID_MSR)) {
 		rdmsrl(MSR_FAM10H_NODE_ID, val);
 		nodes = ((val >> 3) & 7) + 1;
 	}
diff --git a/arch/x86/kernel/cpu/aperfmperf.c b/arch/x86/kernel/cpu/aperfmperf.c
index 804c49493938..64d5aec24203 100644
--- a/arch/x86/kernel/cpu/aperfmperf.c
+++ b/arch/x86/kernel/cpu/aperfmperf.c
@@ -83,7 +83,7 @@ unsigned int aperfmperf_get_khz(int cpu)
 	if (!cpu_khz)
 		return 0;
 
-	if (!static_cpu_has(X86_FEATURE_APERFMPERF))
+	if (!boot_cpu_has(X86_FEATURE_APERFMPERF))
 		return 0;
 
 	aperfmperf_snapshot_cpu(cpu, ktime_get(), true);
@@ -99,7 +99,7 @@ void arch_freq_prepare_all(void)
 	if (!cpu_khz)
 		return;
 
-	if (!static_cpu_has(X86_FEATURE_APERFMPERF))
+	if (!boot_cpu_has(X86_FEATURE_APERFMPERF))
 		return;
 
 	for_each_online_cpu(cpu)
@@ -115,7 +115,7 @@ unsigned int arch_freq_get_on_cpu(int cpu)
 	if (!cpu_khz)
 		return 0;
 
-	if (!static_cpu_has(X86_FEATURE_APERFMPERF))
+	if (!boot_cpu_has(X86_FEATURE_APERFMPERF))
 		return 0;
 
 	if (aperfmperf_snapshot_cpu(cpu, ktime_get(), true))
diff --git a/arch/x86/kernel/cpu/common.c b/arch/x86/kernel/cpu/common.c
index cb28e98a0659..95a5faf3a6a0 100644
--- a/arch/x86/kernel/cpu/common.c
+++ b/arch/x86/kernel/cpu/common.c
@@ -1668,7 +1668,7 @@ static void setup_getcpu(int cpu)
 	unsigned long cpudata = vdso_encode_cpunode(cpu, early_cpu_to_node(cpu));
 	struct desc_struct d = { };
 
-	if (static_cpu_has(X86_FEATURE_RDTSCP))
+	if (boot_cpu_has(X86_FEATURE_RDTSCP))
 		write_rdtscp_aux(cpudata);
 
 	/* Store CPU and node number in limit. */
diff --git a/arch/x86/kernel/cpu/mce/inject.c b/arch/x86/kernel/cpu/mce/inject.c
index 8492ef7d9015..3da9a8823e47 100644
--- a/arch/x86/kernel/cpu/mce/inject.c
+++ b/arch/x86/kernel/cpu/mce/inject.c
@@ -528,7 +528,7 @@ static void do_inject(void)
 	 * only on the node base core. Refer to D18F3x44[NbMcaToMstCpuEn] for
 	 * Fam10h and later BKDGs.
 	 */
-	if (static_cpu_has(X86_FEATURE_AMD_DCM) &&
+	if (boot_cpu_has(X86_FEATURE_AMD_DCM) &&
 	    b == 4 &&
 	    boot_cpu_data.x86 < 0x17) {
 		toggle_nb_mca_mst_cpu(amd_get_nb_id(cpu));
diff --git a/arch/x86/kernel/cpu/proc.c b/arch/x86/kernel/cpu/proc.c
index 2c8522a39ed5..cb2e49810d68 100644
--- a/arch/x86/kernel/cpu/proc.c
+++ b/arch/x86/kernel/cpu/proc.c
@@ -35,11 +35,11 @@ static void show_cpuinfo_misc(struct seq_file *m, struct cpuinfo_x86 *c)
 		   "fpu_exception\t: %s\n"
 		   "cpuid level\t: %d\n"
 		   "wp\t\t: yes\n",
-		   static_cpu_has_bug(X86_BUG_FDIV) ? "yes" : "no",
-		   static_cpu_has_bug(X86_BUG_F00F) ? "yes" : "no",
-		   static_cpu_has_bug(X86_BUG_COMA) ? "yes" : "no",
-		   static_cpu_has(X86_FEATURE_FPU) ? "yes" : "no",
-		   static_cpu_has(X86_FEATURE_FPU) ? "yes" : "no",
+		   boot_cpu_has_bug(X86_BUG_FDIV) ? "yes" : "no",
+		   boot_cpu_has_bug(X86_BUG_F00F) ? "yes" : "no",
+		   boot_cpu_has_bug(X86_BUG_COMA) ? "yes" : "no",
+		   boot_cpu_has(X86_FEATURE_FPU) ? "yes" : "no",
+		   boot_cpu_has(X86_FEATURE_FPU) ? "yes" : "no",
 		   c->cpuid_level);
 }
 #else
diff --git a/arch/x86/kernel/ldt.c b/arch/x86/kernel/ldt.c
index 6135ae8ce036..b2463fcb20a8 100644
--- a/arch/x86/kernel/ldt.c
+++ b/arch/x86/kernel/ldt.c
@@ -113,7 +113,7 @@ static void do_sanity_check(struct mm_struct *mm,
 		 * tables.
 		 */
 		WARN_ON(!had_kernel_mapping);
-		if (static_cpu_has(X86_FEATURE_PTI))
+		if (boot_cpu_has(X86_FEATURE_PTI))
 			WARN_ON(!had_user_mapping);
 	} else {
 		/*
@@ -121,7 +121,7 @@ static void do_sanity_check(struct mm_struct *mm,
 		 * Sync the pgd to the usermode tables.
 		 */
 		WARN_ON(had_kernel_mapping);
-		if (static_cpu_has(X86_FEATURE_PTI))
+		if (boot_cpu_has(X86_FEATURE_PTI))
 			WARN_ON(had_user_mapping);
 	}
 }
@@ -156,7 +156,7 @@ static void map_ldt_struct_to_user(struct mm_struct *mm)
 	k_pmd = pgd_to_pmd_walk(k_pgd, LDT_BASE_ADDR);
 	u_pmd = pgd_to_pmd_walk(u_pgd, LDT_BASE_ADDR);
 
-	if (static_cpu_has(X86_FEATURE_PTI) && !mm->context.ldt)
+	if (boot_cpu_has(X86_FEATURE_PTI) && !mm->context.ldt)
 		set_pmd(u_pmd, *k_pmd);
 }
 
@@ -181,7 +181,7 @@ static void map_ldt_struct_to_user(struct mm_struct *mm)
 {
 	pgd_t *pgd = pgd_offset(mm, LDT_BASE_ADDR);
 
-	if (static_cpu_has(X86_FEATURE_PTI) && !mm->context.ldt)
+	if (boot_cpu_has(X86_FEATURE_PTI) && !mm->context.ldt)
 		set_pgd(kernel_to_user_pgdp(pgd), *pgd);
 }
 
@@ -208,7 +208,7 @@ map_ldt_struct(struct mm_struct *mm, struct ldt_struct *ldt, int slot)
 	spinlock_t *ptl;
 	int i, nr_pages;
 
-	if (!static_cpu_has(X86_FEATURE_PTI))
+	if (!boot_cpu_has(X86_FEATURE_PTI))
 		return 0;
 
 	/*
@@ -271,7 +271,7 @@ static void unmap_ldt_struct(struct mm_struct *mm, struct ldt_struct *ldt)
 		return;
 
 	/* LDT map/unmap is only required for PTI */
-	if (!static_cpu_has(X86_FEATURE_PTI))
+	if (!boot_cpu_has(X86_FEATURE_PTI))
 		return;
 
 	nr_pages = DIV_ROUND_UP(ldt->nr_entries * LDT_ENTRY_SIZE, PAGE_SIZE);
@@ -311,7 +311,7 @@ static void free_ldt_pgtables(struct mm_struct *mm)
 	unsigned long start = LDT_BASE_ADDR;
 	unsigned long end = LDT_END_ADDR;
 
-	if (!static_cpu_has(X86_FEATURE_PTI))
+	if (!boot_cpu_has(X86_FEATURE_PTI))
 		return;
 
 	tlb_gather_mmu(&tlb, mm, start, end);
diff --git a/arch/x86/kernel/paravirt.c b/arch/x86/kernel/paravirt.c
index c0e0101133f3..7bbaa6baf37f 100644
--- a/arch/x86/kernel/paravirt.c
+++ b/arch/x86/kernel/paravirt.c
@@ -121,7 +121,7 @@ DEFINE_STATIC_KEY_TRUE(virt_spin_lock_key);
 
 void __init native_pv_lock_init(void)
 {
-	if (!static_cpu_has(X86_FEATURE_HYPERVISOR))
+	if (!boot_cpu_has(X86_FEATURE_HYPERVISOR))
 		static_branch_disable(&virt_spin_lock_key);
 }
 
diff --git a/arch/x86/kernel/process.c b/arch/x86/kernel/process.c
index 58ac7be52c7a..16a7113e91c5 100644
--- a/arch/x86/kernel/process.c
+++ b/arch/x86/kernel/process.c
@@ -236,7 +236,7 @@ static int get_cpuid_mode(void)
 
 static int set_cpuid_mode(struct task_struct *task, unsigned long cpuid_enabled)
 {
-	if (!static_cpu_has(X86_FEATURE_CPUID_FAULT))
+	if (!boot_cpu_has(X86_FEATURE_CPUID_FAULT))
 		return -ENODEV;
 
 	if (cpuid_enabled)
@@ -666,7 +666,7 @@ static int prefer_mwait_c1_over_halt(const struct cpuinfo_x86 *c)
 	if (c->x86_vendor != X86_VENDOR_INTEL)
 		return 0;
 
-	if (!cpu_has(c, X86_FEATURE_MWAIT) || static_cpu_has_bug(X86_BUG_MONITOR))
+	if (!cpu_has(c, X86_FEATURE_MWAIT) || boot_cpu_has_bug(X86_BUG_MONITOR))
 		return 0;
 
 	return 1;
diff --git a/arch/x86/kernel/reboot.c b/arch/x86/kernel/reboot.c
index 725624b6c0c0..d62ebbc5ec78 100644
--- a/arch/x86/kernel/reboot.c
+++ b/arch/x86/kernel/reboot.c
@@ -108,7 +108,7 @@ void __noreturn machine_real_restart(unsigned int type)
 	write_cr3(real_mode_header->trampoline_pgd);
 
 	/* Exiting long mode will fail if CR4.PCIDE is set. */
-	if (static_cpu_has(X86_FEATURE_PCID))
+	if (boot_cpu_has(X86_FEATURE_PCID))
 		cr4_clear_bits(X86_CR4_PCIDE);
 #endif
 
diff --git a/arch/x86/kernel/vm86_32.c b/arch/x86/kernel/vm86_32.c
index a092b6b40c6b..6a38717d179c 100644
--- a/arch/x86/kernel/vm86_32.c
+++ b/arch/x86/kernel/vm86_32.c
@@ -369,7 +369,7 @@ static long do_sys_vm86(struct vm86plus_struct __user *user_vm86, bool plus)
 	preempt_disable();
 	tsk->thread.sp0 += 16;
 
-	if (static_cpu_has(X86_FEATURE_SEP)) {
+	if (boot_cpu_has(X86_FEATURE_SEP)) {
 		tsk->thread.sysenter_cs = 0;
 		refresh_sysenter_cs(&tsk->thread);
 	}
diff --git a/arch/x86/kvm/svm.c b/arch/x86/kvm/svm.c
index f13a3a24d360..5ed039bf1b58 100644
--- a/arch/x86/kvm/svm.c
+++ b/arch/x86/kvm/svm.c
@@ -835,7 +835,7 @@ static void svm_init_erratum_383(void)
 	int err;
 	u64 val;
 
-	if (!static_cpu_has_bug(X86_BUG_AMD_TLB_MMATCH))
+	if (!boot_cpu_has_bug(X86_BUG_AMD_TLB_MMATCH))
 		return;
 
 	/* Use _safe variants to not break nested virtualization */
@@ -889,7 +889,7 @@ static int has_svm(void)
 static void svm_hardware_disable(void)
 {
 	/* Make sure we clean up behind us */
-	if (static_cpu_has(X86_FEATURE_TSCRATEMSR))
+	if (boot_cpu_has(X86_FEATURE_TSCRATEMSR))
 		wrmsrl(MSR_AMD64_TSC_RATIO, TSC_RATIO_DEFAULT);
 
 	cpu_svm_disable();
@@ -931,7 +931,7 @@ static int svm_hardware_enable(void)
 
 	wrmsrl(MSR_VM_HSAVE_PA, page_to_pfn(sd->save_area) << PAGE_SHIFT);
 
-	if (static_cpu_has(X86_FEATURE_TSCRATEMSR)) {
+	if (boot_cpu_has(X86_FEATURE_TSCRATEMSR)) {
 		wrmsrl(MSR_AMD64_TSC_RATIO, TSC_RATIO_DEFAULT);
 		__this_cpu_write(current_tsc_ratio, TSC_RATIO_DEFAULT);
 	}
@@ -2247,7 +2247,7 @@ static void svm_vcpu_load(struct kvm_vcpu *vcpu, int cpu)
 	for (i = 0; i < NR_HOST_SAVE_USER_MSRS; i++)
 		rdmsrl(host_save_user_msrs[i], svm->host_user_msrs[i]);
 
-	if (static_cpu_has(X86_FEATURE_TSCRATEMSR)) {
+	if (boot_cpu_has(X86_FEATURE_TSCRATEMSR)) {
 		u64 tsc_ratio = vcpu->arch.tsc_scaling_ratio;
 		if (tsc_ratio != __this_cpu_read(current_tsc_ratio)) {
 			__this_cpu_write(current_tsc_ratio, tsc_ratio);
@@ -2255,7 +2255,7 @@ static void svm_vcpu_load(struct kvm_vcpu *vcpu, int cpu)
 		}
 	}
 	/* This assumes that the kernel never uses MSR_TSC_AUX */
-	if (static_cpu_has(X86_FEATURE_RDTSCP))
+	if (boot_cpu_has(X86_FEATURE_RDTSCP))
 		wrmsrl(MSR_TSC_AUX, svm->tsc_aux);
 
 	if (sd->current_vmcb != svm->vmcb) {
diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index 30a6bcd735ec..0ec24853a0e6 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -6553,7 +6553,7 @@ static void vmx_vcpu_run(struct kvm_vcpu *vcpu)
 	if (vcpu->guest_debug & KVM_GUESTDBG_SINGLESTEP)
 		vmx_set_interrupt_shadow(vcpu, 0);
 
-	if (static_cpu_has(X86_FEATURE_PKU) &&
+	if (boot_cpu_has(X86_FEATURE_PKU) &&
 	    kvm_read_cr4_bits(vcpu, X86_CR4_PKE) &&
 	    vcpu->arch.pkru != vmx->host_pkru)
 		__write_pkru(vcpu->arch.pkru);
@@ -6633,7 +6633,7 @@ static void vmx_vcpu_run(struct kvm_vcpu *vcpu)
 	 * back on host, so it is safe to read guest PKRU from current
 	 * XSAVE.
 	 */
-	if (static_cpu_has(X86_FEATURE_PKU) &&
+	if (boot_cpu_has(X86_FEATURE_PKU) &&
 	    kvm_read_cr4_bits(vcpu, X86_CR4_PKE)) {
 		vcpu->arch.pkru = __read_pkru();
 		if (vcpu->arch.pkru != vmx->host_pkru)
diff --git a/arch/x86/mm/dump_pagetables.c b/arch/x86/mm/dump_pagetables.c
index e3cdc85ce5b6..b596ac1eed1c 100644
--- a/arch/x86/mm/dump_pagetables.c
+++ b/arch/x86/mm/dump_pagetables.c
@@ -579,7 +579,7 @@ void ptdump_walk_pgd_level(struct seq_file *m, pgd_t *pgd)
 void ptdump_walk_pgd_level_debugfs(struct seq_file *m, pgd_t *pgd, bool user)
 {
 #ifdef CONFIG_PAGE_TABLE_ISOLATION
-	if (user && static_cpu_has(X86_FEATURE_PTI))
+	if (user && boot_cpu_has(X86_FEATURE_PTI))
 		pgd = kernel_to_user_pgdp(pgd);
 #endif
 	ptdump_walk_pgd_level_core(m, pgd, false, false);
@@ -592,7 +592,7 @@ void ptdump_walk_user_pgd_level_checkwx(void)
 	pgd_t *pgd = INIT_PGD;
 
 	if (!(__supported_pte_mask & _PAGE_NX) ||
-	    !static_cpu_has(X86_FEATURE_PTI))
+	    !boot_cpu_has(X86_FEATURE_PTI))
 		return;
 
 	pr_info("x86/mm: Checking user space page tables\n");
diff --git a/arch/x86/mm/pgtable.c b/arch/x86/mm/pgtable.c
index 7bd01709a091..3dbf440d4114 100644
--- a/arch/x86/mm/pgtable.c
+++ b/arch/x86/mm/pgtable.c
@@ -190,7 +190,7 @@ static void pgd_dtor(pgd_t *pgd)
  * when PTI is enabled. We need them to map the per-process LDT into the
  * user-space page-table.
  */
-#define PREALLOCATED_USER_PMDS	 (static_cpu_has(X86_FEATURE_PTI) ? \
+#define PREALLOCATED_USER_PMDS	 (boot_cpu_has(X86_FEATURE_PTI) ? \
 					KERNEL_PGD_PTRS : 0)
 #define MAX_PREALLOCATED_USER_PMDS KERNEL_PGD_PTRS
 
@@ -292,7 +292,7 @@ static void pgd_mop_up_pmds(struct mm_struct *mm, pgd_t *pgdp)
 
 #ifdef CONFIG_PAGE_TABLE_ISOLATION
 
-	if (!static_cpu_has(X86_FEATURE_PTI))
+	if (!boot_cpu_has(X86_FEATURE_PTI))
 		return;
 
 	pgdp = kernel_to_user_pgdp(pgdp);
diff --git a/arch/x86/mm/pti.c b/arch/x86/mm/pti.c
index 4fee5c3003ed..8c9a54ebda60 100644
--- a/arch/x86/mm/pti.c
+++ b/arch/x86/mm/pti.c
@@ -626,7 +626,7 @@ void pti_set_kernel_image_nonglobal(void)
  */
 void __init pti_init(void)
 {
-	if (!static_cpu_has(X86_FEATURE_PTI))
+	if (!boot_cpu_has(X86_FEATURE_PTI))
 		return;
 
 	pr_info("enabled\n");
diff --git a/drivers/cpufreq/amd_freq_sensitivity.c b/drivers/cpufreq/amd_freq_sensitivity.c
index 4ac7c3cf34be..6927a8c0e748 100644
--- a/drivers/cpufreq/amd_freq_sensitivity.c
+++ b/drivers/cpufreq/amd_freq_sensitivity.c
@@ -124,7 +124,7 @@ static int __init amd_freq_sensitivity_init(void)
 			PCI_DEVICE_ID_AMD_KERNCZ_SMBUS, NULL);
 
 	if (!pcidev) {
-		if (!static_cpu_has(X86_FEATURE_PROC_FEEDBACK))
+		if (!boot_cpu_has(X86_FEATURE_PROC_FEEDBACK))
 			return -ENODEV;
 	}
 
diff --git a/drivers/cpufreq/intel_pstate.c b/drivers/cpufreq/intel_pstate.c
index dd66decf2087..9bbc3dfdebe3 100644
--- a/drivers/cpufreq/intel_pstate.c
+++ b/drivers/cpufreq/intel_pstate.c
@@ -520,7 +520,7 @@ static s16 intel_pstate_get_epb(struct cpudata *cpu_data)
 	u64 epb;
 	int ret;
 
-	if (!static_cpu_has(X86_FEATURE_EPB))
+	if (!boot_cpu_has(X86_FEATURE_EPB))
 		return -ENXIO;
 
 	ret = rdmsrl_on_cpu(cpu_data->cpu, MSR_IA32_ENERGY_PERF_BIAS, &epb);
@@ -534,7 +534,7 @@ static s16 intel_pstate_get_epp(struct cpudata *cpu_data, u64 hwp_req_data)
 {
 	s16 epp;
 
-	if (static_cpu_has(X86_FEATURE_HWP_EPP)) {
+	if (boot_cpu_has(X86_FEATURE_HWP_EPP)) {
 		/*
 		 * When hwp_req_data is 0, means that caller didn't read
 		 * MSR_HWP_REQUEST, so need to read and get EPP.
@@ -559,7 +559,7 @@ static int intel_pstate_set_epb(int cpu, s16 pref)
 	u64 epb;
 	int ret;
 
-	if (!static_cpu_has(X86_FEATURE_EPB))
+	if (!boot_cpu_has(X86_FEATURE_EPB))
 		return -ENXIO;
 
 	ret = rdmsrl_on_cpu(cpu, MSR_IA32_ENERGY_PERF_BIAS, &epb);
@@ -607,7 +607,7 @@ static int intel_pstate_get_energy_pref_index(struct cpudata *cpu_data)
 	if (epp < 0)
 		return epp;
 
-	if (static_cpu_has(X86_FEATURE_HWP_EPP)) {
+	if (boot_cpu_has(X86_FEATURE_HWP_EPP)) {
 		if (epp == HWP_EPP_PERFORMANCE)
 			return 1;
 		if (epp <= HWP_EPP_BALANCE_PERFORMANCE)
@@ -616,7 +616,7 @@ static int intel_pstate_get_energy_pref_index(struct cpudata *cpu_data)
 			return 3;
 		else
 			return 4;
-	} else if (static_cpu_has(X86_FEATURE_EPB)) {
+	} else if (boot_cpu_has(X86_FEATURE_EPB)) {
 		/*
 		 * Range:
 		 *	0x00-0x03	:	Performance
@@ -644,7 +644,7 @@ static int intel_pstate_set_energy_pref_index(struct cpudata *cpu_data,
 
 	mutex_lock(&intel_pstate_limits_lock);
 
-	if (static_cpu_has(X86_FEATURE_HWP_EPP)) {
+	if (boot_cpu_has(X86_FEATURE_HWP_EPP)) {
 		u64 value;
 
 		ret = rdmsrl_on_cpu(cpu_data->cpu, MSR_HWP_REQUEST, &value);
@@ -819,7 +819,7 @@ static void intel_pstate_hwp_set(unsigned int cpu)
 		epp = cpu_data->epp_powersave;
 	}
 update_epp:
-	if (static_cpu_has(X86_FEATURE_HWP_EPP)) {
+	if (boot_cpu_has(X86_FEATURE_HWP_EPP)) {
 		value &= ~GENMASK_ULL(31, 24);
 		value |= (u64)epp << 24;
 	} else {
@@ -844,7 +844,7 @@ static void intel_pstate_hwp_force_min_perf(int cpu)
 	value |= HWP_MIN_PERF(min_perf);
 
 	/* Set EPP/EPB to min */
-	if (static_cpu_has(X86_FEATURE_HWP_EPP))
+	if (boot_cpu_has(X86_FEATURE_HWP_EPP))
 		value |= HWP_ENERGY_PERF_PREFERENCE(HWP_EPP_POWERSAVE);
 	else
 		intel_pstate_set_epb(cpu, HWP_EPP_BALANCE_POWERSAVE);
@@ -1191,7 +1191,7 @@ static void __init intel_pstate_sysfs_expose_params(void)
 static void intel_pstate_hwp_enable(struct cpudata *cpudata)
 {
 	/* First disable HWP notification interrupt as we don't process them */
-	if (static_cpu_has(X86_FEATURE_HWP_NOTIFY))
+	if (boot_cpu_has(X86_FEATURE_HWP_NOTIFY))
 		wrmsrl_on_cpu(cpudata->cpu, MSR_HWP_INTERRUPT, 0x00);
 
 	wrmsrl_on_cpu(cpudata->cpu, MSR_PM_ENABLE, 0x1);
diff --git a/drivers/cpufreq/powernow-k8.c b/drivers/cpufreq/powernow-k8.c
index fb77b39a4ce3..3c12e03fa343 100644
--- a/drivers/cpufreq/powernow-k8.c
+++ b/drivers/cpufreq/powernow-k8.c
@@ -1178,7 +1178,7 @@ static int powernowk8_init(void)
 	unsigned int i, supported_cpus = 0;
 	int ret;
 
-	if (static_cpu_has(X86_FEATURE_HW_PSTATE)) {
+	if (boot_cpu_has(X86_FEATURE_HW_PSTATE)) {
 		__request_acpi_cpufreq();
 		return -ENODEV;
 	}

-- 
Regards/Gruss,
    Boris.

Good mailing practices for 400: avoid top-posting and trim the reply.

^ permalink raw reply related	[flat|nested] 71+ messages in thread

* [PATCH] x86/cpufeature: Remove __pure attribute to _static_cpu_has()
  2019-02-11 20:32                 ` Nadav Amit
@ 2019-03-07 15:10                   ` Borislav Petkov
  2019-03-07 16:43                     ` hpa
  0 siblings, 1 reply; 71+ messages in thread
From: Borislav Petkov @ 2019-03-07 15:10 UTC (permalink / raw)
  To: Nadav Amit
  Cc: Rick Edgecombe, Andy Lutomirski, Ingo Molnar, LKML, X86 ML,
	H. Peter Anvin, Thomas Gleixner, Dave Hansen, Peter Zijlstra,
	Damian Tometzki, linux-integrity, LSM List, Andrew Morton,
	Kernel Hardening, Linux-MM, Will Deacon, Ard Biesheuvel,
	Kristen Carlson Accardi, Dock, Deneen T, Kees Cook, Dave Hansen,
	Masami Hiramatsu

On Mon, Feb 11, 2019 at 12:32:41PM -0800, Nadav Amit wrote:
> BTW: the “__pure” attribute is useless when “__always_inline” is used.
> Unless it is intended to be some sort of comment, of course.

---
From: Borislav Petkov <bp@suse.de>
Date: Thu, 7 Mar 2019 15:54:51 +0100

__pure is used to make gcc do Common Subexpression Elimination (CSE)
and thus save subsequent invocations of a function which does a complex
computation (without side effects). As a simple example:

  bool a = _static_cpu_has(x);
  bool b = _static_cpu_has(x);

gets turned into

  bool a = _static_cpu_has(x);
  bool b = a;

However, gcc doesn't do CSE with asm()s when those get inlined - like it
is done with _static_cpu_has() - because, for example, the t_yes/t_no
labels are different for each inlined function body and thus cannot be
detected as equivalent anymore for the CSE heuristic to hit.

However, this all is beside the point because best it should be avoided
to have more than one call to _static_cpu_has(X) in the same function
due to the fact that each such call is an alternatives patch site and it
is simply pointless.

Therefore, drop the __pure attribute as it is not doing anything.

Reported-by: Nadav Amit <nadav.amit@gmail.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: x86@kernel.org
---
 arch/x86/include/asm/cpufeature.h | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/x86/include/asm/cpufeature.h b/arch/x86/include/asm/cpufeature.h
index e25d11ad7a88..6d6d5cc4302b 100644
--- a/arch/x86/include/asm/cpufeature.h
+++ b/arch/x86/include/asm/cpufeature.h
@@ -162,7 +162,7 @@ extern void clear_cpu_cap(struct cpuinfo_x86 *c, unsigned int bit);
  * majority of cases and you should stick to using it as it is generally
  * only two instructions: a RIP-relative MOV and a TEST.
  */
-static __always_inline __pure bool _static_cpu_has(u16 bit)
+static __always_inline bool _static_cpu_has(u16 bit)
 {
 	asm_volatile_goto("1: jmp 6f\n"
 		 "2:\n"
-- 
2.21.0

-- 
Regards/Gruss,
    Boris.

Good mailing practices for 400: avoid top-posting and trim the reply.

^ permalink raw reply related	[flat|nested] 71+ messages in thread

* Re: [PATCH] x86/cpufeature: Remove __pure attribute to _static_cpu_has()
  2019-03-07 15:10                   ` [PATCH] x86/cpufeature: Remove __pure attribute to _static_cpu_has() Borislav Petkov
@ 2019-03-07 16:43                     ` hpa
  2019-03-07 17:02                       ` Borislav Petkov
  0 siblings, 1 reply; 71+ messages in thread
From: hpa @ 2019-03-07 16:43 UTC (permalink / raw)
  To: Borislav Petkov, Nadav Amit
  Cc: Rick Edgecombe, Andy Lutomirski, Ingo Molnar, LKML, X86 ML,
	Thomas Gleixner, Dave Hansen, Peter Zijlstra, Damian Tometzki,
	linux-integrity, LSM List, Andrew Morton, Kernel Hardening,
	Linux-MM, Will Deacon, Ard Biesheuvel, Kristen Carlson Accardi,
	Dock, Deneen T, Kees Cook, Dave Hansen, Masami Hiramatsu

On March 7, 2019 7:10:36 AM PST, Borislav Petkov <bp@alien8.de> wrote:
>On Mon, Feb 11, 2019 at 12:32:41PM -0800, Nadav Amit wrote:
>> BTW: the “__pure” attribute is useless when “__always_inline” is
>used.
>> Unless it is intended to be some sort of comment, of course.
>
>---
>From: Borislav Petkov <bp@suse.de>
>Date: Thu, 7 Mar 2019 15:54:51 +0100
>
>__pure is used to make gcc do Common Subexpression Elimination (CSE)
>and thus save subsequent invocations of a function which does a complex
>computation (without side effects). As a simple example:
>
>  bool a = _static_cpu_has(x);
>  bool b = _static_cpu_has(x);
>
>gets turned into
>
>  bool a = _static_cpu_has(x);
>  bool b = a;
>
>However, gcc doesn't do CSE with asm()s when those get inlined - like
>it
>is done with _static_cpu_has() - because, for example, the t_yes/t_no
>labels are different for each inlined function body and thus cannot be
>detected as equivalent anymore for the CSE heuristic to hit.
>
>However, this all is beside the point because best it should be avoided
>to have more than one call to _static_cpu_has(X) in the same function
>due to the fact that each such call is an alternatives patch site and
>it
>is simply pointless.
>
>Therefore, drop the __pure attribute as it is not doing anything.
>
>Reported-by: Nadav Amit <nadav.amit@gmail.com>
>Signed-off-by: Borislav Petkov <bp@suse.de>
>Cc: Peter Zijlstra <peterz@infradead.org>
>Cc: x86@kernel.org
>---
> arch/x86/include/asm/cpufeature.h | 2 +-
> 1 file changed, 1 insertion(+), 1 deletion(-)
>
>diff --git a/arch/x86/include/asm/cpufeature.h
>b/arch/x86/include/asm/cpufeature.h
>index e25d11ad7a88..6d6d5cc4302b 100644
>--- a/arch/x86/include/asm/cpufeature.h
>+++ b/arch/x86/include/asm/cpufeature.h
>@@ -162,7 +162,7 @@ extern void clear_cpu_cap(struct cpuinfo_x86 *c,
>unsigned int bit);
>* majority of cases and you should stick to using it as it is generally
>  * only two instructions: a RIP-relative MOV and a TEST.
>  */
>-static __always_inline __pure bool _static_cpu_has(u16 bit)
>+static __always_inline bool _static_cpu_has(u16 bit)
> {
> 	asm_volatile_goto("1: jmp 6f\n"
> 		 "2:\n"

Uhm... (a) it is correct, even if the compiler doesn't use it now, it allows the compiler to CSE it in the future; (b) it is documentation; (c) there is an actual bug here: the "volatile" implies a side effect, which in reality is not present, inhibiting CSE.

So the correct fix is to remove "volatile", not remove "__pure".
-- 
Sent from my Android device with K-9 Mail. Please excuse my brevity.

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [PATCH v2 10/20] x86: avoid W^X being broken during modules loading
  2019-03-07  7:29                 ` [PATCH v2 10/20] x86: avoid W^X being broken during modules loading Borislav Petkov
@ 2019-03-07 16:53                   ` hpa
  2019-03-07 17:06                     ` Borislav Petkov
  0 siblings, 1 reply; 71+ messages in thread
From: hpa @ 2019-03-07 16:53 UTC (permalink / raw)
  To: Borislav Petkov, Nadav Amit
  Cc: Rick Edgecombe, Andy Lutomirski, Ingo Molnar, LKML, X86 ML,
	Thomas Gleixner, Dave Hansen, Peter Zijlstra, Damian Tometzki,
	linux-integrity, LSM List, Andrew Morton, Kernel Hardening,
	Linux-MM, Will Deacon, Ard Biesheuvel, Kristen Carlson Accardi,
	Dock, Deneen T, Kees Cook, Dave Hansen, Masami Hiramatsu

On March 6, 2019 11:29:47 PM PST, Borislav Petkov <bp@alien8.de> wrote:
>On Mon, Feb 11, 2019 at 08:42:51PM +0100, Borislav Petkov wrote:
>> On Mon, Feb 11, 2019 at 11:27:03AM -0800, Nadav Amit wrote:
>> > Is there any comment over static_cpu_has()? ;-)
>> 
>> Almost:
>> 
>> /*
>>  * Static testing of CPU features.  Used the same as boot_cpu_has().
>>  * These will statically patch the target code for additional
>>  * performance.
>>  */
>> static __always_inline __pure bool _static_cpu_has(u16 bit)
>
>Ok, I guess something like that along with converting the obvious slow
>path callers to boot_cpu_has():
>
>---
>diff --git a/arch/x86/include/asm/cpufeature.h
>b/arch/x86/include/asm/cpufeature.h
>index ce95b8cbd229..e25d11ad7a88 100644
>--- a/arch/x86/include/asm/cpufeature.h
>+++ b/arch/x86/include/asm/cpufeature.h
>@@ -155,9 +155,12 @@ extern void clear_cpu_cap(struct cpuinfo_x86 *c,
>unsigned int bit);
> #else
> 
> /*
>- * Static testing of CPU features.  Used the same as boot_cpu_has().
>- * These will statically patch the target code for additional
>- * performance.
>+ * Static testing of CPU features. Used the same as boot_cpu_has(). It
>+ * statically patches the target code for additional performance. Use
>+ * static_cpu_has() only in fast paths, where every cycle counts.
>Which
>+ * means that the boot_cpu_has() variant is already fast enough for
>the
>+ * majority of cases and you should stick to using it as it is
>generally
>+ * only two instructions: a RIP-relative MOV and a TEST.
>  */
> static __always_inline __pure bool _static_cpu_has(u16 bit)
> {
>diff --git a/arch/x86/include/asm/fpu/internal.h
>b/arch/x86/include/asm/fpu/internal.h
>index fa2c93cb42a2..c525b053b3b3 100644
>--- a/arch/x86/include/asm/fpu/internal.h
>+++ b/arch/x86/include/asm/fpu/internal.h
>@@ -291,7 +291,7 @@ static inline void
>copy_xregs_to_kernel_booting(struct xregs_state *xstate)
> 
> 	WARN_ON(system_state != SYSTEM_BOOTING);
> 
>-	if (static_cpu_has(X86_FEATURE_XSAVES))
>+	if (boot_cpu_has(X86_FEATURE_XSAVES))
> 		XSTATE_OP(XSAVES, xstate, lmask, hmask, err);
> 	else
> 		XSTATE_OP(XSAVE, xstate, lmask, hmask, err);
>@@ -313,7 +313,7 @@ static inline void
>copy_kernel_to_xregs_booting(struct xregs_state *xstate)
> 
> 	WARN_ON(system_state != SYSTEM_BOOTING);
> 
>-	if (static_cpu_has(X86_FEATURE_XSAVES))
>+	if (boot_cpu_has(X86_FEATURE_XSAVES))
> 		XSTATE_OP(XRSTORS, xstate, lmask, hmask, err);
> 	else
> 		XSTATE_OP(XRSTOR, xstate, lmask, hmask, err);
>@@ -528,8 +528,7 @@ static inline void fpregs_activate(struct fpu *fpu)
>  *  - switch_fpu_finish() restores the new state as
>  *    necessary.
>  */
>-static inline void
>-switch_fpu_prepare(struct fpu *old_fpu, int cpu)
>+static inline void switch_fpu_prepare(struct fpu *old_fpu, int cpu)
> {
> 	if (static_cpu_has(X86_FEATURE_FPU) && old_fpu->initialized) {
> 		if (!copy_fpregs_to_fpstate(old_fpu))
>diff --git a/arch/x86/kernel/apic/apic_numachip.c
>b/arch/x86/kernel/apic/apic_numachip.c
>index 78778b54f904..a5464b8b6c46 100644
>--- a/arch/x86/kernel/apic/apic_numachip.c
>+++ b/arch/x86/kernel/apic/apic_numachip.c
>@@ -175,7 +175,7 @@ static void fixup_cpu_id(struct cpuinfo_x86 *c, int
>node)
> 	this_cpu_write(cpu_llc_id, node);
> 
> 	/* Account for nodes per socket in multi-core-module processors */
>-	if (static_cpu_has(X86_FEATURE_NODEID_MSR)) {
>+	if (boot_cpu_has(X86_FEATURE_NODEID_MSR)) {
> 		rdmsrl(MSR_FAM10H_NODE_ID, val);
> 		nodes = ((val >> 3) & 7) + 1;
> 	}
>diff --git a/arch/x86/kernel/cpu/aperfmperf.c
>b/arch/x86/kernel/cpu/aperfmperf.c
>index 804c49493938..64d5aec24203 100644
>--- a/arch/x86/kernel/cpu/aperfmperf.c
>+++ b/arch/x86/kernel/cpu/aperfmperf.c
>@@ -83,7 +83,7 @@ unsigned int aperfmperf_get_khz(int cpu)
> 	if (!cpu_khz)
> 		return 0;
> 
>-	if (!static_cpu_has(X86_FEATURE_APERFMPERF))
>+	if (!boot_cpu_has(X86_FEATURE_APERFMPERF))
> 		return 0;
> 
> 	aperfmperf_snapshot_cpu(cpu, ktime_get(), true);
>@@ -99,7 +99,7 @@ void arch_freq_prepare_all(void)
> 	if (!cpu_khz)
> 		return;
> 
>-	if (!static_cpu_has(X86_FEATURE_APERFMPERF))
>+	if (!boot_cpu_has(X86_FEATURE_APERFMPERF))
> 		return;
> 
> 	for_each_online_cpu(cpu)
>@@ -115,7 +115,7 @@ unsigned int arch_freq_get_on_cpu(int cpu)
> 	if (!cpu_khz)
> 		return 0;
> 
>-	if (!static_cpu_has(X86_FEATURE_APERFMPERF))
>+	if (!boot_cpu_has(X86_FEATURE_APERFMPERF))
> 		return 0;
> 
> 	if (aperfmperf_snapshot_cpu(cpu, ktime_get(), true))
>diff --git a/arch/x86/kernel/cpu/common.c
>b/arch/x86/kernel/cpu/common.c
>index cb28e98a0659..95a5faf3a6a0 100644
>--- a/arch/x86/kernel/cpu/common.c
>+++ b/arch/x86/kernel/cpu/common.c
>@@ -1668,7 +1668,7 @@ static void setup_getcpu(int cpu)
>	unsigned long cpudata = vdso_encode_cpunode(cpu,
>early_cpu_to_node(cpu));
> 	struct desc_struct d = { };
> 
>-	if (static_cpu_has(X86_FEATURE_RDTSCP))
>+	if (boot_cpu_has(X86_FEATURE_RDTSCP))
> 		write_rdtscp_aux(cpudata);
> 
> 	/* Store CPU and node number in limit. */
>diff --git a/arch/x86/kernel/cpu/mce/inject.c
>b/arch/x86/kernel/cpu/mce/inject.c
>index 8492ef7d9015..3da9a8823e47 100644
>--- a/arch/x86/kernel/cpu/mce/inject.c
>+++ b/arch/x86/kernel/cpu/mce/inject.c
>@@ -528,7 +528,7 @@ static void do_inject(void)
> 	 * only on the node base core. Refer to D18F3x44[NbMcaToMstCpuEn] for
> 	 * Fam10h and later BKDGs.
> 	 */
>-	if (static_cpu_has(X86_FEATURE_AMD_DCM) &&
>+	if (boot_cpu_has(X86_FEATURE_AMD_DCM) &&
> 	    b == 4 &&
> 	    boot_cpu_data.x86 < 0x17) {
> 		toggle_nb_mca_mst_cpu(amd_get_nb_id(cpu));
>diff --git a/arch/x86/kernel/cpu/proc.c b/arch/x86/kernel/cpu/proc.c
>index 2c8522a39ed5..cb2e49810d68 100644
>--- a/arch/x86/kernel/cpu/proc.c
>+++ b/arch/x86/kernel/cpu/proc.c
>@@ -35,11 +35,11 @@ static void show_cpuinfo_misc(struct seq_file *m,
>struct cpuinfo_x86 *c)
> 		   "fpu_exception\t: %s\n"
> 		   "cpuid level\t: %d\n"
> 		   "wp\t\t: yes\n",
>-		   static_cpu_has_bug(X86_BUG_FDIV) ? "yes" : "no",
>-		   static_cpu_has_bug(X86_BUG_F00F) ? "yes" : "no",
>-		   static_cpu_has_bug(X86_BUG_COMA) ? "yes" : "no",
>-		   static_cpu_has(X86_FEATURE_FPU) ? "yes" : "no",
>-		   static_cpu_has(X86_FEATURE_FPU) ? "yes" : "no",
>+		   boot_cpu_has_bug(X86_BUG_FDIV) ? "yes" : "no",
>+		   boot_cpu_has_bug(X86_BUG_F00F) ? "yes" : "no",
>+		   boot_cpu_has_bug(X86_BUG_COMA) ? "yes" : "no",
>+		   boot_cpu_has(X86_FEATURE_FPU) ? "yes" : "no",
>+		   boot_cpu_has(X86_FEATURE_FPU) ? "yes" : "no",
> 		   c->cpuid_level);
> }
> #else
>diff --git a/arch/x86/kernel/ldt.c b/arch/x86/kernel/ldt.c
>index 6135ae8ce036..b2463fcb20a8 100644
>--- a/arch/x86/kernel/ldt.c
>+++ b/arch/x86/kernel/ldt.c
>@@ -113,7 +113,7 @@ static void do_sanity_check(struct mm_struct *mm,
> 		 * tables.
> 		 */
> 		WARN_ON(!had_kernel_mapping);
>-		if (static_cpu_has(X86_FEATURE_PTI))
>+		if (boot_cpu_has(X86_FEATURE_PTI))
> 			WARN_ON(!had_user_mapping);
> 	} else {
> 		/*
>@@ -121,7 +121,7 @@ static void do_sanity_check(struct mm_struct *mm,
> 		 * Sync the pgd to the usermode tables.
> 		 */
> 		WARN_ON(had_kernel_mapping);
>-		if (static_cpu_has(X86_FEATURE_PTI))
>+		if (boot_cpu_has(X86_FEATURE_PTI))
> 			WARN_ON(had_user_mapping);
> 	}
> }
>@@ -156,7 +156,7 @@ static void map_ldt_struct_to_user(struct mm_struct
>*mm)
> 	k_pmd = pgd_to_pmd_walk(k_pgd, LDT_BASE_ADDR);
> 	u_pmd = pgd_to_pmd_walk(u_pgd, LDT_BASE_ADDR);
> 
>-	if (static_cpu_has(X86_FEATURE_PTI) && !mm->context.ldt)
>+	if (boot_cpu_has(X86_FEATURE_PTI) && !mm->context.ldt)
> 		set_pmd(u_pmd, *k_pmd);
> }
> 
>@@ -181,7 +181,7 @@ static void map_ldt_struct_to_user(struct mm_struct
>*mm)
> {
> 	pgd_t *pgd = pgd_offset(mm, LDT_BASE_ADDR);
> 
>-	if (static_cpu_has(X86_FEATURE_PTI) && !mm->context.ldt)
>+	if (boot_cpu_has(X86_FEATURE_PTI) && !mm->context.ldt)
> 		set_pgd(kernel_to_user_pgdp(pgd), *pgd);
> }
> 
>@@ -208,7 +208,7 @@ map_ldt_struct(struct mm_struct *mm, struct
>ldt_struct *ldt, int slot)
> 	spinlock_t *ptl;
> 	int i, nr_pages;
> 
>-	if (!static_cpu_has(X86_FEATURE_PTI))
>+	if (!boot_cpu_has(X86_FEATURE_PTI))
> 		return 0;
> 
> 	/*
>@@ -271,7 +271,7 @@ static void unmap_ldt_struct(struct mm_struct *mm,
>struct ldt_struct *ldt)
> 		return;
> 
> 	/* LDT map/unmap is only required for PTI */
>-	if (!static_cpu_has(X86_FEATURE_PTI))
>+	if (!boot_cpu_has(X86_FEATURE_PTI))
> 		return;
> 
> 	nr_pages = DIV_ROUND_UP(ldt->nr_entries * LDT_ENTRY_SIZE, PAGE_SIZE);
>@@ -311,7 +311,7 @@ static void free_ldt_pgtables(struct mm_struct *mm)
> 	unsigned long start = LDT_BASE_ADDR;
> 	unsigned long end = LDT_END_ADDR;
> 
>-	if (!static_cpu_has(X86_FEATURE_PTI))
>+	if (!boot_cpu_has(X86_FEATURE_PTI))
> 		return;
> 
> 	tlb_gather_mmu(&tlb, mm, start, end);
>diff --git a/arch/x86/kernel/paravirt.c b/arch/x86/kernel/paravirt.c
>index c0e0101133f3..7bbaa6baf37f 100644
>--- a/arch/x86/kernel/paravirt.c
>+++ b/arch/x86/kernel/paravirt.c
>@@ -121,7 +121,7 @@ DEFINE_STATIC_KEY_TRUE(virt_spin_lock_key);
> 
> void __init native_pv_lock_init(void)
> {
>-	if (!static_cpu_has(X86_FEATURE_HYPERVISOR))
>+	if (!boot_cpu_has(X86_FEATURE_HYPERVISOR))
> 		static_branch_disable(&virt_spin_lock_key);
> }
> 
>diff --git a/arch/x86/kernel/process.c b/arch/x86/kernel/process.c
>index 58ac7be52c7a..16a7113e91c5 100644
>--- a/arch/x86/kernel/process.c
>+++ b/arch/x86/kernel/process.c
>@@ -236,7 +236,7 @@ static int get_cpuid_mode(void)
> 
>static int set_cpuid_mode(struct task_struct *task, unsigned long
>cpuid_enabled)
> {
>-	if (!static_cpu_has(X86_FEATURE_CPUID_FAULT))
>+	if (!boot_cpu_has(X86_FEATURE_CPUID_FAULT))
> 		return -ENODEV;
> 
> 	if (cpuid_enabled)
>@@ -666,7 +666,7 @@ static int prefer_mwait_c1_over_halt(const struct
>cpuinfo_x86 *c)
> 	if (c->x86_vendor != X86_VENDOR_INTEL)
> 		return 0;
> 
>-	if (!cpu_has(c, X86_FEATURE_MWAIT) ||
>static_cpu_has_bug(X86_BUG_MONITOR))
>+	if (!cpu_has(c, X86_FEATURE_MWAIT) ||
>boot_cpu_has_bug(X86_BUG_MONITOR))
> 		return 0;
> 
> 	return 1;
>diff --git a/arch/x86/kernel/reboot.c b/arch/x86/kernel/reboot.c
>index 725624b6c0c0..d62ebbc5ec78 100644
>--- a/arch/x86/kernel/reboot.c
>+++ b/arch/x86/kernel/reboot.c
>@@ -108,7 +108,7 @@ void __noreturn machine_real_restart(unsigned int
>type)
> 	write_cr3(real_mode_header->trampoline_pgd);
> 
> 	/* Exiting long mode will fail if CR4.PCIDE is set. */
>-	if (static_cpu_has(X86_FEATURE_PCID))
>+	if (boot_cpu_has(X86_FEATURE_PCID))
> 		cr4_clear_bits(X86_CR4_PCIDE);
> #endif
> 
>diff --git a/arch/x86/kernel/vm86_32.c b/arch/x86/kernel/vm86_32.c
>index a092b6b40c6b..6a38717d179c 100644
>--- a/arch/x86/kernel/vm86_32.c
>+++ b/arch/x86/kernel/vm86_32.c
>@@ -369,7 +369,7 @@ static long do_sys_vm86(struct vm86plus_struct
>__user *user_vm86, bool plus)
> 	preempt_disable();
> 	tsk->thread.sp0 += 16;
> 
>-	if (static_cpu_has(X86_FEATURE_SEP)) {
>+	if (boot_cpu_has(X86_FEATURE_SEP)) {
> 		tsk->thread.sysenter_cs = 0;
> 		refresh_sysenter_cs(&tsk->thread);
> 	}
>diff --git a/arch/x86/kvm/svm.c b/arch/x86/kvm/svm.c
>index f13a3a24d360..5ed039bf1b58 100644
>--- a/arch/x86/kvm/svm.c
>+++ b/arch/x86/kvm/svm.c
>@@ -835,7 +835,7 @@ static void svm_init_erratum_383(void)
> 	int err;
> 	u64 val;
> 
>-	if (!static_cpu_has_bug(X86_BUG_AMD_TLB_MMATCH))
>+	if (!boot_cpu_has_bug(X86_BUG_AMD_TLB_MMATCH))
> 		return;
> 
> 	/* Use _safe variants to not break nested virtualization */
>@@ -889,7 +889,7 @@ static int has_svm(void)
> static void svm_hardware_disable(void)
> {
> 	/* Make sure we clean up behind us */
>-	if (static_cpu_has(X86_FEATURE_TSCRATEMSR))
>+	if (boot_cpu_has(X86_FEATURE_TSCRATEMSR))
> 		wrmsrl(MSR_AMD64_TSC_RATIO, TSC_RATIO_DEFAULT);
> 
> 	cpu_svm_disable();
>@@ -931,7 +931,7 @@ static int svm_hardware_enable(void)
> 
> 	wrmsrl(MSR_VM_HSAVE_PA, page_to_pfn(sd->save_area) << PAGE_SHIFT);
> 
>-	if (static_cpu_has(X86_FEATURE_TSCRATEMSR)) {
>+	if (boot_cpu_has(X86_FEATURE_TSCRATEMSR)) {
> 		wrmsrl(MSR_AMD64_TSC_RATIO, TSC_RATIO_DEFAULT);
> 		__this_cpu_write(current_tsc_ratio, TSC_RATIO_DEFAULT);
> 	}
>@@ -2247,7 +2247,7 @@ static void svm_vcpu_load(struct kvm_vcpu *vcpu,
>int cpu)
> 	for (i = 0; i < NR_HOST_SAVE_USER_MSRS; i++)
> 		rdmsrl(host_save_user_msrs[i], svm->host_user_msrs[i]);
> 
>-	if (static_cpu_has(X86_FEATURE_TSCRATEMSR)) {
>+	if (boot_cpu_has(X86_FEATURE_TSCRATEMSR)) {
> 		u64 tsc_ratio = vcpu->arch.tsc_scaling_ratio;
> 		if (tsc_ratio != __this_cpu_read(current_tsc_ratio)) {
> 			__this_cpu_write(current_tsc_ratio, tsc_ratio);
>@@ -2255,7 +2255,7 @@ static void svm_vcpu_load(struct kvm_vcpu *vcpu,
>int cpu)
> 		}
> 	}
> 	/* This assumes that the kernel never uses MSR_TSC_AUX */
>-	if (static_cpu_has(X86_FEATURE_RDTSCP))
>+	if (boot_cpu_has(X86_FEATURE_RDTSCP))
> 		wrmsrl(MSR_TSC_AUX, svm->tsc_aux);
> 
> 	if (sd->current_vmcb != svm->vmcb) {
>diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
>index 30a6bcd735ec..0ec24853a0e6 100644
>--- a/arch/x86/kvm/vmx/vmx.c
>+++ b/arch/x86/kvm/vmx/vmx.c
>@@ -6553,7 +6553,7 @@ static void vmx_vcpu_run(struct kvm_vcpu *vcpu)
> 	if (vcpu->guest_debug & KVM_GUESTDBG_SINGLESTEP)
> 		vmx_set_interrupt_shadow(vcpu, 0);
> 
>-	if (static_cpu_has(X86_FEATURE_PKU) &&
>+	if (boot_cpu_has(X86_FEATURE_PKU) &&
> 	    kvm_read_cr4_bits(vcpu, X86_CR4_PKE) &&
> 	    vcpu->arch.pkru != vmx->host_pkru)
> 		__write_pkru(vcpu->arch.pkru);
>@@ -6633,7 +6633,7 @@ static void vmx_vcpu_run(struct kvm_vcpu *vcpu)
> 	 * back on host, so it is safe to read guest PKRU from current
> 	 * XSAVE.
> 	 */
>-	if (static_cpu_has(X86_FEATURE_PKU) &&
>+	if (boot_cpu_has(X86_FEATURE_PKU) &&
> 	    kvm_read_cr4_bits(vcpu, X86_CR4_PKE)) {
> 		vcpu->arch.pkru = __read_pkru();
> 		if (vcpu->arch.pkru != vmx->host_pkru)
>diff --git a/arch/x86/mm/dump_pagetables.c
>b/arch/x86/mm/dump_pagetables.c
>index e3cdc85ce5b6..b596ac1eed1c 100644
>--- a/arch/x86/mm/dump_pagetables.c
>+++ b/arch/x86/mm/dump_pagetables.c
>@@ -579,7 +579,7 @@ void ptdump_walk_pgd_level(struct seq_file *m,
>pgd_t *pgd)
>void ptdump_walk_pgd_level_debugfs(struct seq_file *m, pgd_t *pgd, bool
>user)
> {
> #ifdef CONFIG_PAGE_TABLE_ISOLATION
>-	if (user && static_cpu_has(X86_FEATURE_PTI))
>+	if (user && boot_cpu_has(X86_FEATURE_PTI))
> 		pgd = kernel_to_user_pgdp(pgd);
> #endif
> 	ptdump_walk_pgd_level_core(m, pgd, false, false);
>@@ -592,7 +592,7 @@ void ptdump_walk_user_pgd_level_checkwx(void)
> 	pgd_t *pgd = INIT_PGD;
> 
> 	if (!(__supported_pte_mask & _PAGE_NX) ||
>-	    !static_cpu_has(X86_FEATURE_PTI))
>+	    !boot_cpu_has(X86_FEATURE_PTI))
> 		return;
> 
> 	pr_info("x86/mm: Checking user space page tables\n");
>diff --git a/arch/x86/mm/pgtable.c b/arch/x86/mm/pgtable.c
>index 7bd01709a091..3dbf440d4114 100644
>--- a/arch/x86/mm/pgtable.c
>+++ b/arch/x86/mm/pgtable.c
>@@ -190,7 +190,7 @@ static void pgd_dtor(pgd_t *pgd)
>* when PTI is enabled. We need them to map the per-process LDT into the
>  * user-space page-table.
>  */
>-#define PREALLOCATED_USER_PMDS	 (static_cpu_has(X86_FEATURE_PTI) ? \
>+#define PREALLOCATED_USER_PMDS	 (boot_cpu_has(X86_FEATURE_PTI) ? \
> 					KERNEL_PGD_PTRS : 0)
> #define MAX_PREALLOCATED_USER_PMDS KERNEL_PGD_PTRS
> 
>@@ -292,7 +292,7 @@ static void pgd_mop_up_pmds(struct mm_struct *mm,
>pgd_t *pgdp)
> 
> #ifdef CONFIG_PAGE_TABLE_ISOLATION
> 
>-	if (!static_cpu_has(X86_FEATURE_PTI))
>+	if (!boot_cpu_has(X86_FEATURE_PTI))
> 		return;
> 
> 	pgdp = kernel_to_user_pgdp(pgdp);
>diff --git a/arch/x86/mm/pti.c b/arch/x86/mm/pti.c
>index 4fee5c3003ed..8c9a54ebda60 100644
>--- a/arch/x86/mm/pti.c
>+++ b/arch/x86/mm/pti.c
>@@ -626,7 +626,7 @@ void pti_set_kernel_image_nonglobal(void)
>  */
> void __init pti_init(void)
> {
>-	if (!static_cpu_has(X86_FEATURE_PTI))
>+	if (!boot_cpu_has(X86_FEATURE_PTI))
> 		return;
> 
> 	pr_info("enabled\n");
>diff --git a/drivers/cpufreq/amd_freq_sensitivity.c
>b/drivers/cpufreq/amd_freq_sensitivity.c
>index 4ac7c3cf34be..6927a8c0e748 100644
>--- a/drivers/cpufreq/amd_freq_sensitivity.c
>+++ b/drivers/cpufreq/amd_freq_sensitivity.c
>@@ -124,7 +124,7 @@ static int __init amd_freq_sensitivity_init(void)
> 			PCI_DEVICE_ID_AMD_KERNCZ_SMBUS, NULL);
> 
> 	if (!pcidev) {
>-		if (!static_cpu_has(X86_FEATURE_PROC_FEEDBACK))
>+		if (!boot_cpu_has(X86_FEATURE_PROC_FEEDBACK))
> 			return -ENODEV;
> 	}
> 
>diff --git a/drivers/cpufreq/intel_pstate.c
>b/drivers/cpufreq/intel_pstate.c
>index dd66decf2087..9bbc3dfdebe3 100644
>--- a/drivers/cpufreq/intel_pstate.c
>+++ b/drivers/cpufreq/intel_pstate.c
>@@ -520,7 +520,7 @@ static s16 intel_pstate_get_epb(struct cpudata
>*cpu_data)
> 	u64 epb;
> 	int ret;
> 
>-	if (!static_cpu_has(X86_FEATURE_EPB))
>+	if (!boot_cpu_has(X86_FEATURE_EPB))
> 		return -ENXIO;
> 
> 	ret = rdmsrl_on_cpu(cpu_data->cpu, MSR_IA32_ENERGY_PERF_BIAS, &epb);
>@@ -534,7 +534,7 @@ static s16 intel_pstate_get_epp(struct cpudata
>*cpu_data, u64 hwp_req_data)
> {
> 	s16 epp;
> 
>-	if (static_cpu_has(X86_FEATURE_HWP_EPP)) {
>+	if (boot_cpu_has(X86_FEATURE_HWP_EPP)) {
> 		/*
> 		 * When hwp_req_data is 0, means that caller didn't read
> 		 * MSR_HWP_REQUEST, so need to read and get EPP.
>@@ -559,7 +559,7 @@ static int intel_pstate_set_epb(int cpu, s16 pref)
> 	u64 epb;
> 	int ret;
> 
>-	if (!static_cpu_has(X86_FEATURE_EPB))
>+	if (!boot_cpu_has(X86_FEATURE_EPB))
> 		return -ENXIO;
> 
> 	ret = rdmsrl_on_cpu(cpu, MSR_IA32_ENERGY_PERF_BIAS, &epb);
>@@ -607,7 +607,7 @@ static int
>intel_pstate_get_energy_pref_index(struct cpudata *cpu_data)
> 	if (epp < 0)
> 		return epp;
> 
>-	if (static_cpu_has(X86_FEATURE_HWP_EPP)) {
>+	if (boot_cpu_has(X86_FEATURE_HWP_EPP)) {
> 		if (epp == HWP_EPP_PERFORMANCE)
> 			return 1;
> 		if (epp <= HWP_EPP_BALANCE_PERFORMANCE)
>@@ -616,7 +616,7 @@ static int
>intel_pstate_get_energy_pref_index(struct cpudata *cpu_data)
> 			return 3;
> 		else
> 			return 4;
>-	} else if (static_cpu_has(X86_FEATURE_EPB)) {
>+	} else if (boot_cpu_has(X86_FEATURE_EPB)) {
> 		/*
> 		 * Range:
> 		 *	0x00-0x03	:	Performance
>@@ -644,7 +644,7 @@ static int
>intel_pstate_set_energy_pref_index(struct cpudata *cpu_data,
> 
> 	mutex_lock(&intel_pstate_limits_lock);
> 
>-	if (static_cpu_has(X86_FEATURE_HWP_EPP)) {
>+	if (boot_cpu_has(X86_FEATURE_HWP_EPP)) {
> 		u64 value;
> 
> 		ret = rdmsrl_on_cpu(cpu_data->cpu, MSR_HWP_REQUEST, &value);
>@@ -819,7 +819,7 @@ static void intel_pstate_hwp_set(unsigned int cpu)
> 		epp = cpu_data->epp_powersave;
> 	}
> update_epp:
>-	if (static_cpu_has(X86_FEATURE_HWP_EPP)) {
>+	if (boot_cpu_has(X86_FEATURE_HWP_EPP)) {
> 		value &= ~GENMASK_ULL(31, 24);
> 		value |= (u64)epp << 24;
> 	} else {
>@@ -844,7 +844,7 @@ static void intel_pstate_hwp_force_min_perf(int
>cpu)
> 	value |= HWP_MIN_PERF(min_perf);
> 
> 	/* Set EPP/EPB to min */
>-	if (static_cpu_has(X86_FEATURE_HWP_EPP))
>+	if (boot_cpu_has(X86_FEATURE_HWP_EPP))
> 		value |= HWP_ENERGY_PERF_PREFERENCE(HWP_EPP_POWERSAVE);
> 	else
> 		intel_pstate_set_epb(cpu, HWP_EPP_BALANCE_POWERSAVE);
>@@ -1191,7 +1191,7 @@ static void __init
>intel_pstate_sysfs_expose_params(void)
> static void intel_pstate_hwp_enable(struct cpudata *cpudata)
> {
>	/* First disable HWP notification interrupt as we don't process them
>*/
>-	if (static_cpu_has(X86_FEATURE_HWP_NOTIFY))
>+	if (boot_cpu_has(X86_FEATURE_HWP_NOTIFY))
> 		wrmsrl_on_cpu(cpudata->cpu, MSR_HWP_INTERRUPT, 0x00);
> 
> 	wrmsrl_on_cpu(cpudata->cpu, MSR_PM_ENABLE, 0x1);
>diff --git a/drivers/cpufreq/powernow-k8.c
>b/drivers/cpufreq/powernow-k8.c
>index fb77b39a4ce3..3c12e03fa343 100644
>--- a/drivers/cpufreq/powernow-k8.c
>+++ b/drivers/cpufreq/powernow-k8.c
>@@ -1178,7 +1178,7 @@ static int powernowk8_init(void)
> 	unsigned int i, supported_cpus = 0;
> 	int ret;
> 
>-	if (static_cpu_has(X86_FEATURE_HW_PSTATE)) {
>+	if (boot_cpu_has(X86_FEATURE_HW_PSTATE)) {
> 		__request_acpi_cpufreq();
> 		return -ENODEV;
> 	}

I'm confused here, and as I'm on my phone on an aircraft I can't check the back thread, but am I gathering that this tries to unbreak W^X during module loading by removing functions which use alternatives?

The right thing to do is to apply alternatives before the pages are marked +x-w, like we do with the kernel proper during early boot, if this isn't already the case (sorry again, see above.)

If we *do*, what is the issue here? Although boot_cpu_has() isn't slow (it should in general be possible to reduce to one testb instruction followed by a conditional jump)  it seems that "avoiding an alternatives slot" *should* be a *very* weak reason, and seems to me to look like papering over some other problem.


-- 
Sent from my Android device with K-9 Mail. Please excuse my brevity.

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [PATCH] x86/cpufeature: Remove __pure attribute to _static_cpu_has()
  2019-03-07 16:43                     ` hpa
@ 2019-03-07 17:02                       ` Borislav Petkov
  0 siblings, 0 replies; 71+ messages in thread
From: Borislav Petkov @ 2019-03-07 17:02 UTC (permalink / raw)
  To: hpa
  Cc: Nadav Amit, Rick Edgecombe, Andy Lutomirski, Ingo Molnar, LKML,
	X86 ML, Thomas Gleixner, Dave Hansen, Peter Zijlstra,
	Damian Tometzki, linux-integrity, LSM List, Andrew Morton,
	Kernel Hardening, Linux-MM, Will Deacon, Ard Biesheuvel,
	Kristen Carlson Accardi, Dock, Deneen T, Kees Cook, Dave Hansen,
	Masami Hiramatsu, Michael Matz

Lemme preface this by saying that I've talked to gcc guys before doing
this.

On Thu, Mar 07, 2019 at 08:43:50AM -0800, hpa@zytor.com wrote:
> Uhm... (a) it is correct, even if the compiler doesn't use it now, it
> allows the compiler to CSE it in the future;

Well, the compiler won't CSE asm blocks due to the difference in the
labels, for example, so the heuristic won't detect them as equivalent
blocks.

Also, compiler guys said that they might consider inlining pure
functions later, in the IPA stage but that's future stuff.

This is how I understood it, at least.

> (b) it is documentation;

That could be a comment instead. Otherwise we will wonder again why this
is marked pure.

> (c) there is an actual bug here: the "volatile" implies a side effect,
> which in reality is not present, inhibiting CSE.
>
> So the correct fix is to remove "volatile", not remove "__pure".

There's not really a volatile there:

/*
 * GCC 'asm goto' miscompiles certain code sequences:
 *
 *   http://gcc.gnu.org/bugzilla/show_bug.cgi?id=58670
 *
 * Work it around via a compiler barrier quirk suggested by Jakub Jelinek.
 *
 * (asm goto is automatically volatile - the naming reflects this.)
 */
#define asm_volatile_goto(x...) do { asm goto(x); asm (""); } while (0)

-- 
Regards/Gruss,
    Boris.

Good mailing practices for 400: avoid top-posting and trim the reply.

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [PATCH v2 10/20] x86: avoid W^X being broken during modules loading
  2019-03-07 16:53                   ` hpa
@ 2019-03-07 17:06                     ` Borislav Petkov
  2019-03-07 20:02                       ` Andy Lutomirski
  0 siblings, 1 reply; 71+ messages in thread
From: Borislav Petkov @ 2019-03-07 17:06 UTC (permalink / raw)
  To: hpa
  Cc: Nadav Amit, Rick Edgecombe, Andy Lutomirski, Ingo Molnar, LKML,
	X86 ML, Thomas Gleixner, Dave Hansen, Peter Zijlstra,
	Damian Tometzki, linux-integrity, LSM List, Andrew Morton,
	Kernel Hardening, Linux-MM, Will Deacon, Ard Biesheuvel,
	Kristen Carlson Accardi, Dock, Deneen T, Kees Cook, Dave Hansen,
	Masami Hiramatsu

On Thu, Mar 07, 2019 at 08:53:34AM -0800, hpa@zytor.com wrote:
> If we *do*, what is the issue here? Although boot_cpu_has() isn't
> slow (it should in general be possible to reduce to one testb
> instruction followed by a conditional jump) it seems that "avoiding an
> alternatives slot" *should* be a *very* weak reason, and seems to me
> to look like papering over some other problem.

Forget the current thread: this is simply trying to document when to use
static_cpu_has() and when to use boot_cpu_has(). I get asked about it at
least once a month.

And then it is replacing clear slow paths using static_cpu_has() with
boot_cpu_has() because there's purely no need to patch there. And having
a RIP-relative MOV and a JMP is good enough for slow paths.

Makes sense?

-- 
Regards/Gruss,
    Boris.

Good mailing practices for 400: avoid top-posting and trim the reply.

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [PATCH v2 10/20] x86: avoid W^X being broken during modules loading
  2019-03-07 17:06                     ` Borislav Petkov
@ 2019-03-07 20:02                       ` Andy Lutomirski
  2019-03-07 20:25                         ` Borislav Petkov
  0 siblings, 1 reply; 71+ messages in thread
From: Andy Lutomirski @ 2019-03-07 20:02 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: H. Peter Anvin, Nadav Amit, Rick Edgecombe, Andy Lutomirski,
	Ingo Molnar, LKML, X86 ML, Thomas Gleixner, Dave Hansen,
	Peter Zijlstra, Damian Tometzki, linux-integrity, LSM List,
	Andrew Morton, Kernel Hardening, Linux-MM, Will Deacon,
	Ard Biesheuvel, Kristen Carlson Accardi, Dock, Deneen T,
	Kees Cook, Dave Hansen, Masami Hiramatsu

On Thu, Mar 7, 2019 at 9:06 AM Borislav Petkov <bp@alien8.de> wrote:
>
> On Thu, Mar 07, 2019 at 08:53:34AM -0800, hpa@zytor.com wrote:
> > If we *do*, what is the issue here? Although boot_cpu_has() isn't
> > slow (it should in general be possible to reduce to one testb
> > instruction followed by a conditional jump) it seems that "avoiding an
> > alternatives slot" *should* be a *very* weak reason, and seems to me
> > to look like papering over some other problem.
>
> Forget the current thread: this is simply trying to document when to use
> static_cpu_has() and when to use boot_cpu_has(). I get asked about it at
> least once a month.
>
> And then it is replacing clear slow paths using static_cpu_has() with
> boot_cpu_has() because there's purely no need to patch there. And having
> a RIP-relative MOV and a JMP is good enough for slow paths.
>

Should we maybe rename these functions?  static_cpu_has() is at least
reasonably obvious.  But cpu_feature_enabled() is different for
reasons I've never understood, and boot_cpu_has() is IMO terribly
named.  It's not about the boot cpu -- it's about doing the same thing
but with less bloat and less performance.

(And can we maybe collapse cpu_feature_enabled() and static_cpu_has()
into the same function?)

--Andy

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [PATCH v2 10/20] x86: avoid W^X being broken during modules loading
  2019-03-07 20:02                       ` Andy Lutomirski
@ 2019-03-07 20:25                         ` Borislav Petkov
  0 siblings, 0 replies; 71+ messages in thread
From: Borislav Petkov @ 2019-03-07 20:25 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: H. Peter Anvin, Nadav Amit, Rick Edgecombe, Ingo Molnar, LKML,
	X86 ML, Thomas Gleixner, Dave Hansen, Peter Zijlstra,
	Damian Tometzki, linux-integrity, LSM List, Andrew Morton,
	Kernel Hardening, Linux-MM, Will Deacon, Ard Biesheuvel,
	Kristen Carlson Accardi, Dock, Deneen T, Kees Cook, Dave Hansen,
	Masami Hiramatsu

On Thu, Mar 07, 2019 at 12:02:13PM -0800, Andy Lutomirski wrote:
> Should we maybe rename these functions?  static_cpu_has() is at least
> reasonably obvious.  But cpu_feature_enabled() is different for
> reasons I've never understood, and boot_cpu_has() is IMO terribly
> named.  It's not about the boot cpu -- it's about doing the same thing
> but with less bloat and less performance.

Well, it does test bits in boot_cpu_data. I don't care about "boot" in
the name though so feel free to suggest something better.

> (And can we maybe collapse cpu_feature_enabled() and static_cpu_has()
> into the same function?)

I'm not sure it would be always ok to involve the DISABLED_MASK*
buildtime stuff in the checks. It probably is but it would need careful
auditing to be sure, first.

-- 
Regards/Gruss,
    Boris.

Good mailing practices for 400: avoid top-posting and trim the reply.

^ permalink raw reply	[flat|nested] 71+ messages in thread

end of thread, other threads:[~2019-03-07 20:25 UTC | newest]

Thread overview: 71+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-01-29  0:34 [PATCH v2 00/20] Merge text_poke fixes and executable lockdowns Rick Edgecombe
2019-01-29  0:34 ` [PATCH v2 01/20] Fix "x86/alternatives: Lockdep-enforce text_mutex in text_poke*()" Rick Edgecombe
2019-01-29  0:34 ` [PATCH v2 02/20] x86/jump_label: Use text_poke_early() during early init Rick Edgecombe
2019-01-29  0:34 ` [PATCH v2 03/20] x86/mm: temporary mm struct Rick Edgecombe
2019-01-31 11:29   ` Borislav Petkov
2019-01-31 22:19     ` Nadav Amit
2019-02-01  0:08       ` Borislav Petkov
2019-02-01  0:25         ` Nadav Amit
2019-02-04 14:28       ` Borislav Petkov
2019-01-29  0:34 ` [PATCH v2 04/20] fork: provide a function for copying init_mm Rick Edgecombe
2019-02-05  8:53   ` Borislav Petkov
2019-02-05  9:03     ` Nadav Amit
2019-01-29  0:34 ` [PATCH v2 05/20] x86/alternative: initializing temporary mm for patching Rick Edgecombe
2019-02-05  9:18   ` Borislav Petkov
2019-02-11  0:39   ` Nadav Amit
2019-02-11  5:18     ` Andy Lutomirski
2019-02-11 18:04       ` Nadav Amit
2019-02-11 19:07         ` Andy Lutomirski
2019-02-11 19:18           ` Nadav Amit
2019-02-11 22:47             ` Andy Lutomirski
2019-02-12 18:23               ` Nadav Amit
2019-01-29  0:34 ` [PATCH v2 06/20] x86/alternative: use temporary mm for text poking Rick Edgecombe
2019-02-05  9:58   ` Borislav Petkov
2019-02-05 11:31     ` Peter Zijlstra
2019-02-05 12:35       ` Borislav Petkov
2019-02-05 13:25         ` Peter Zijlstra
2019-02-05 17:54         ` Nadav Amit
2019-02-05 13:29       ` Peter Zijlstra
2019-01-29  0:34 ` [PATCH v2 07/20] x86/kgdb: avoid redundant comparison of patched code Rick Edgecombe
2019-01-29  0:34 ` [PATCH v2 08/20] x86/ftrace: set trampoline pages as executable Rick Edgecombe
2019-01-29  0:34 ` [PATCH v2 09/20] x86/kprobes: instruction pages initialization enhancements Rick Edgecombe
2019-02-11 18:22   ` Borislav Petkov
2019-02-11 19:36     ` Nadav Amit
2019-01-29  0:34 ` [PATCH v2 10/20] x86: avoid W^X being broken during modules loading Rick Edgecombe
2019-02-11 18:29   ` Borislav Petkov
2019-02-11 18:45     ` Nadav Amit
2019-02-11 19:01       ` Borislav Petkov
2019-02-11 19:09         ` Nadav Amit
2019-02-11 19:10           ` Borislav Petkov
2019-02-11 19:27             ` Nadav Amit
2019-02-11 19:42               ` Borislav Petkov
2019-02-11 20:32                 ` Nadav Amit
2019-03-07 15:10                   ` [PATCH] x86/cpufeature: Remove __pure attribute to _static_cpu_has() Borislav Petkov
2019-03-07 16:43                     ` hpa
2019-03-07 17:02                       ` Borislav Petkov
2019-03-07  7:29                 ` [PATCH v2 10/20] x86: avoid W^X being broken during modules loading Borislav Petkov
2019-03-07 16:53                   ` hpa
2019-03-07 17:06                     ` Borislav Petkov
2019-03-07 20:02                       ` Andy Lutomirski
2019-03-07 20:25                         ` Borislav Petkov
2019-01-29  0:34 ` [PATCH v2 11/20] x86/jump-label: remove support for custom poker Rick Edgecombe
2019-02-11 18:37   ` Borislav Petkov
2019-01-29  0:34 ` [PATCH v2 12/20] x86/alternative: Remove the return value of text_poke_*() Rick Edgecombe
2019-01-29  0:34 ` [PATCH v2 13/20] Add set_alias_ function and x86 implementation Rick Edgecombe
2019-02-11 19:09   ` Borislav Petkov
2019-02-11 19:27     ` Edgecombe, Rick P
2019-02-11 22:59     ` Andy Lutomirski
2019-02-12  0:01       ` Edgecombe, Rick P
2019-01-29  0:34 ` [PATCH v2 14/20] mm: Make hibernate handle unmapped pages Rick Edgecombe
2019-02-19 11:04   ` Borislav Petkov
2019-02-19 21:28     ` Edgecombe, Rick P
2019-02-20 16:07       ` Borislav Petkov
2019-01-29  0:34 ` [PATCH v2 15/20] vmalloc: New flags for safe vfree on special perms Rick Edgecombe
2019-02-19 12:48   ` Borislav Petkov
2019-02-19 19:42     ` Edgecombe, Rick P
2019-02-20 16:14       ` Borislav Petkov
2019-01-29  0:34 ` [PATCH v2 16/20] modules: Use vmalloc special flag Rick Edgecombe
2019-01-29  0:34 ` [PATCH v2 17/20] bpf: " Rick Edgecombe
2019-01-29  0:34 ` [PATCH v2 18/20] x86/ftrace: " Rick Edgecombe
2019-01-29  0:34 ` [PATCH v2 19/20] x86/kprobes: " Rick Edgecombe
2019-01-29  0:34 ` [PATCH v2 20/20] x86/alternative: comment about module removal races Rick Edgecombe

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).