All of lore.kernel.org
 help / color / mirror / Atom feed
* [RFC PATCH v2 00/18] livepatch: hybrid consistency model
@ 2016-04-28 20:44 Josh Poimboeuf
  2016-04-28 20:44 ` [RFC PATCH v2 01/18] x86/asm/head: clean up initial stack variable Josh Poimboeuf
                   ` (17 more replies)
  0 siblings, 18 replies; 121+ messages in thread
From: Josh Poimboeuf @ 2016-04-28 20:44 UTC (permalink / raw)
  To: Jessica Yu, Jiri Kosina, Miroslav Benes, Ingo Molnar,
	Peter Zijlstra, Michael Ellerman, Heiko Carstens
  Cc: live-patching, linux-kernel, x86, linuxppc-dev, linux-s390,
	Vojtech Pavlik, Jiri Slaby, Petr Mladek, Chris J Arges,
	Andy Lutomirski

This is v2 of the livepatch hybrid consistency model, based on
linux-next/master.

v1 of this patch set was posted over a year ago:

  https://lkml.kernel.org/r/cover.1423499826.git.jpoimboe@redhat.com

The biggest complaint at that time was that stack traces are unreliable.
Since CONFIG_STACK_VALIDATION was merged, that issue has been addressed.
I've also tried to address all other outstanding complaints and issues.

Ingo and Peter, note that I'm using task_rq_lock() in patch 17/18 to
make sure a task stays asleep while its stack gets checked.  I'm not
sure if there's a better way to achieve that goal -- any suggestions
there would be greatly appreciated.

Patches 1-7 create a mechanism for detecting whether a given stack trace
can be deemed reliable.

Patches 8-18 add the consistency model.  See patch 17/18 for more
details about the consistency model itself.

Remaining TODOs:
- how to patch kthreads without RELIABLE_STACKTRACE?
- safe patch module removal
- fake signal facility
- allow user to force a task to the patched state
- enable the patching of kthreads which are sleeping on affected
  functions, via the livepatch ftrace handler
- WARN on certain stack error conditions

v2:
- "universe" -> "patch state"
- rename klp_update_task_universe() -> klp_patch_task()
- add preempt IRQ tracking (TF_PREEMPT_IRQ)
- fix print_context_stack_reliable() bug
- improve print_context_stack_reliable() comments
- klp_ftrace_handler comment fixes
- add "patch_state" proc file to tid_base_stuff
- schedule work even for !RELIABLE_STACKTRACE
- forked child inherits patch state from parent
- add detailed comment to livepatch.h klp_func definition about the
  klp_func patched/transition state transitions
- update exit_to_usermode_loop() comment
- clear all TIF_KLP_NEED_UPDATE flags in klp_complete_transition()
- remove unnecessary function externs
- add livepatch documentation, sysfs documentation, /proc documentation
- /proc/pid/patch_state: -1 means no patch is currently being applied/reverted
- "TIF_KLP_NEED_UPDATE" -> "TIF_PATCH_PENDING"
- support for s390 and powerpc-le
- don't assume stacks with dynamic ftrace trampolines are reliable
- add _TIF_ALLWORK_MASK info to commit log

v1.9:
- revive from the dead and rebased
- reliable stacks!
- add support for immediate consistency model
- add a ton of comments
- fix up memory barriers
- remove "allow patch modules to be removed" patch for now, it still 
  needs more discussion and thought - it can be done with something
- "proc/pid/universe" -> "proc/pid/patch_status"
- remove WARN_ON_ONCE from !func condition in ftrace handler -- can
  happen because of RCU
- keep klp_mutex private by putting the work_fn in core.c
- convert states from int to boolean
- remove obsolete '@state' comments
- several header file and include improvements suggested by Jiri S
- change kallsyms_lookup_size_offset() errors from EINVAL -> ENOENT
- change proc file permissions S_IRUGO -> USR
- use klp_for_each_object/func helpers


Jiri Slaby (1):
  livepatch/s390: reorganize TIF thread flag bits

Josh Poimboeuf (16):
  x86/asm/head: clean up initial stack variable
  x86/asm/head: use a common function for starting CPUs
  x86/asm/head: standardize the bottom of the stack for idle tasks
  x86: move _stext marker before head code
  sched: add task flag for preempt IRQ tracking
  x86: dump_trace() error handling
  stacktrace/x86: function for detecting reliable stack traces
  livepatch: temporary stubs for klp_patch_pending() and
    klp_patch_task()
  livepatch/x86: add TIF_PATCH_PENDING thread flag
  livepatch/powerpc: add TIF_PATCH_PENDING thread flag
  livepatch: separate enabled and patched states
  livepatch: remove unnecessary object loaded check
  livepatch: move patching functions into patch.c
  livepatch: store function sizes
  livepatch: change to a per-task consistency model
  livepatch: add /proc/<pid>/patch_state

Miroslav Benes (1):
  livepatch/s390: add TIF_PATCH_PENDING thread flag

 Documentation/ABI/testing/sysfs-kernel-livepatch |   8 +
 Documentation/filesystems/proc.txt               |  18 +
 Documentation/livepatch/livepatch.txt            | 132 ++++++-
 arch/Kconfig                                     |   6 +
 arch/powerpc/include/asm/thread_info.h           |   4 +-
 arch/powerpc/kernel/signal.c                     |   4 +
 arch/s390/include/asm/thread_info.h              |  24 +-
 arch/s390/kernel/entry.S                         |  31 +-
 arch/x86/Kconfig                                 |   1 +
 arch/x86/entry/common.c                          |   9 +-
 arch/x86/include/asm/realmode.h                  |   2 +-
 arch/x86/include/asm/smp.h                       |   3 -
 arch/x86/include/asm/stacktrace.h                |  36 +-
 arch/x86/include/asm/thread_info.h               |   2 +
 arch/x86/kernel/acpi/sleep.c                     |   2 +-
 arch/x86/kernel/dumpstack.c                      | 108 +++++-
 arch/x86/kernel/dumpstack_32.c                   |  22 +-
 arch/x86/kernel/dumpstack_64.c                   |  53 ++-
 arch/x86/kernel/head_32.S                        |   8 +-
 arch/x86/kernel/head_64.S                        |  34 +-
 arch/x86/kernel/smpboot.c                        |   2 +-
 arch/x86/kernel/stacktrace.c                     |  24 ++
 arch/x86/kernel/vmlinux.lds.S                    |   2 +-
 fs/proc/base.c                                   |  15 +
 include/linux/init_task.h                        |   9 +
 include/linux/kernel.h                           |   1 +
 include/linux/livepatch.h                        |  57 ++-
 include/linux/sched.h                            |   4 +
 include/linux/stacktrace.h                       |  20 +-
 kernel/extable.c                                 |   2 +-
 kernel/fork.c                                    |   5 +-
 kernel/livepatch/Makefile                        |   2 +-
 kernel/livepatch/core.c                          | 342 +++++-----------
 kernel/livepatch/patch.c                         | 254 ++++++++++++
 kernel/livepatch/patch.h                         |  33 ++
 kernel/livepatch/transition.c                    | 474 +++++++++++++++++++++++
 kernel/livepatch/transition.h                    |  14 +
 kernel/sched/core.c                              |   4 +
 kernel/sched/idle.c                              |   4 +
 kernel/stacktrace.c                              |   4 +-
 lib/Kconfig.debug                                |   6 +
 41 files changed, 1413 insertions(+), 372 deletions(-)
 create mode 100644 kernel/livepatch/patch.c
 create mode 100644 kernel/livepatch/patch.h
 create mode 100644 kernel/livepatch/transition.c
 create mode 100644 kernel/livepatch/transition.h

-- 
2.4.11

^ permalink raw reply	[flat|nested] 121+ messages in thread

* [RFC PATCH v2 01/18] x86/asm/head: clean up initial stack variable
  2016-04-28 20:44 [RFC PATCH v2 00/18] livepatch: hybrid consistency model Josh Poimboeuf
@ 2016-04-28 20:44 ` Josh Poimboeuf
  2016-04-28 20:44 ` [RFC PATCH v2 02/18] x86/asm/head: use a common function for starting CPUs Josh Poimboeuf
                   ` (16 subsequent siblings)
  17 siblings, 0 replies; 121+ messages in thread
From: Josh Poimboeuf @ 2016-04-28 20:44 UTC (permalink / raw)
  To: Jessica Yu, Jiri Kosina, Miroslav Benes, Ingo Molnar,
	Peter Zijlstra, Michael Ellerman, Heiko Carstens
  Cc: live-patching, linux-kernel, x86, linuxppc-dev, linux-s390,
	Vojtech Pavlik, Jiri Slaby, Petr Mladek, Chris J Arges,
	Andy Lutomirski

The stack_start variable is similar in usage to initial_code and
initial_gs: they're all stored in head_64.S and they're all updated by
SMP and suspend/resume before starting a CPU.

Rename stack_start to initial_stack to be more consistent with the
others.

Also do a few other related cleanups:

- Remove the unused init_rsp variable declaration.

- Remove the ".word 0" statement after the initial_stack definition
  because it has no apparent function.

Signed-off-by: Josh Poimboeuf <jpoimboe@redhat.com>
---
 arch/x86/include/asm/realmode.h |  2 +-
 arch/x86/include/asm/smp.h      |  3 ---
 arch/x86/kernel/acpi/sleep.c    |  2 +-
 arch/x86/kernel/head_32.S       |  8 ++++----
 arch/x86/kernel/head_64.S       | 10 ++++------
 arch/x86/kernel/smpboot.c       |  2 +-
 6 files changed, 11 insertions(+), 16 deletions(-)

diff --git a/arch/x86/include/asm/realmode.h b/arch/x86/include/asm/realmode.h
index 9c6b890..677a671 100644
--- a/arch/x86/include/asm/realmode.h
+++ b/arch/x86/include/asm/realmode.h
@@ -44,9 +44,9 @@ struct trampoline_header {
 extern struct real_mode_header *real_mode_header;
 extern unsigned char real_mode_blob_end[];
 
-extern unsigned long init_rsp;
 extern unsigned long initial_code;
 extern unsigned long initial_gs;
+extern unsigned long initial_stack;
 
 extern unsigned char real_mode_blob[];
 extern unsigned char real_mode_relocs[];
diff --git a/arch/x86/include/asm/smp.h b/arch/x86/include/asm/smp.h
index 66b0573..a9ac31b 100644
--- a/arch/x86/include/asm/smp.h
+++ b/arch/x86/include/asm/smp.h
@@ -38,9 +38,6 @@ DECLARE_EARLY_PER_CPU_READ_MOSTLY(u16, x86_bios_cpu_apicid);
 DECLARE_EARLY_PER_CPU_READ_MOSTLY(int, x86_cpu_to_logical_apicid);
 #endif
 
-/* Static state in head.S used to set up a CPU */
-extern unsigned long stack_start; /* Initial stack pointer address */
-
 struct task_struct;
 
 struct smp_ops {
diff --git a/arch/x86/kernel/acpi/sleep.c b/arch/x86/kernel/acpi/sleep.c
index adb3eaf..4858733 100644
--- a/arch/x86/kernel/acpi/sleep.c
+++ b/arch/x86/kernel/acpi/sleep.c
@@ -99,7 +99,7 @@ int x86_acpi_suspend_lowlevel(void)
 	saved_magic = 0x12345678;
 #else /* CONFIG_64BIT */
 #ifdef CONFIG_SMP
-	stack_start = (unsigned long)temp_stack + sizeof(temp_stack);
+	initial_stack = (unsigned long)temp_stack + sizeof(temp_stack);
 	early_gdt_descr.address =
 			(unsigned long)get_cpu_gdt_table(smp_processor_id());
 	initial_gs = per_cpu_offset(smp_processor_id());
diff --git a/arch/x86/kernel/head_32.S b/arch/x86/kernel/head_32.S
index 6770865..da840be 100644
--- a/arch/x86/kernel/head_32.S
+++ b/arch/x86/kernel/head_32.S
@@ -94,7 +94,7 @@ RESERVE_BRK(pagetables, INIT_MAP_SIZE)
  */
 __HEAD
 ENTRY(startup_32)
-	movl pa(stack_start),%ecx
+	movl pa(initial_stack),%ecx
 	
 	/* test KEEP_SEGMENTS flag to see if the bootloader is asking
 		us to not reload segments */
@@ -286,7 +286,7 @@ num_subarch_entries = (. - subarch_entries) / 4
  * start_secondary().
  */
 ENTRY(start_cpu0)
-	movl stack_start, %ecx
+	movl initial_stack, %ecx
 	movl %ecx, %esp
 	jmp  *(initial_code)
 ENDPROC(start_cpu0)
@@ -307,7 +307,7 @@ ENTRY(startup_32_smp)
 	movl %eax,%es
 	movl %eax,%fs
 	movl %eax,%gs
-	movl pa(stack_start),%ecx
+	movl pa(initial_stack),%ecx
 	movl %eax,%ss
 	leal -__PAGE_OFFSET(%ecx),%esp
 
@@ -709,7 +709,7 @@ ENTRY(initial_page_table)
 
 .data
 .balign 4
-ENTRY(stack_start)
+ENTRY(initial_stack)
 	.long init_thread_union+THREAD_SIZE
 
 __INITRODATA
diff --git a/arch/x86/kernel/head_64.S b/arch/x86/kernel/head_64.S
index 5df831e..792c3bb 100644
--- a/arch/x86/kernel/head_64.S
+++ b/arch/x86/kernel/head_64.S
@@ -226,7 +226,7 @@ ENTRY(secondary_startup_64)
 	movq	%rax, %cr0
 
 	/* Setup a boot time stack */
-	movq stack_start(%rip), %rsp
+	movq initial_stack(%rip), %rsp
 
 	/* zero EFLAGS after setting rsp */
 	pushq $0
@@ -309,7 +309,7 @@ ENTRY(secondary_startup_64)
  * start_secondary().
  */
 ENTRY(start_cpu0)
-	movq stack_start(%rip),%rsp
+	movq initial_stack(%rip),%rsp
 	movq	initial_code(%rip),%rax
 	pushq	$0		# fake return address to stop unwinder
 	pushq	$__KERNEL_CS	# set correct cs
@@ -318,17 +318,15 @@ ENTRY(start_cpu0)
 ENDPROC(start_cpu0)
 #endif
 
-	/* SMP bootup changes these two */
+	/* SMP bootup changes these variables */
 	__REFDATA
 	.balign	8
 	GLOBAL(initial_code)
 	.quad	x86_64_start_kernel
 	GLOBAL(initial_gs)
 	.quad	INIT_PER_CPU_VAR(irq_stack_union)
-
-	GLOBAL(stack_start)
+	GLOBAL(initial_stack)
 	.quad  init_thread_union+THREAD_SIZE-8
-	.word  0
 	__FINITDATA
 
 bad_address:
diff --git a/arch/x86/kernel/smpboot.c b/arch/x86/kernel/smpboot.c
index 1fe4130..503682a 100644
--- a/arch/x86/kernel/smpboot.c
+++ b/arch/x86/kernel/smpboot.c
@@ -950,7 +950,7 @@ static int do_boot_cpu(int apicid, int cpu, struct task_struct *idle)
 
 	early_gdt_descr.address = (unsigned long)get_cpu_gdt_table(cpu);
 	initial_code = (unsigned long)start_secondary;
-	stack_start  = idle->thread.sp;
+	initial_stack  = idle->thread.sp;
 
 	/*
 	 * Enable the espfix hack for this CPU
-- 
2.4.11

^ permalink raw reply related	[flat|nested] 121+ messages in thread

* [RFC PATCH v2 02/18] x86/asm/head: use a common function for starting CPUs
  2016-04-28 20:44 [RFC PATCH v2 00/18] livepatch: hybrid consistency model Josh Poimboeuf
  2016-04-28 20:44 ` [RFC PATCH v2 01/18] x86/asm/head: clean up initial stack variable Josh Poimboeuf
@ 2016-04-28 20:44 ` Josh Poimboeuf
  2016-04-28 20:44 ` [RFC PATCH v2 03/18] x86/asm/head: standardize the bottom of the stack for idle tasks Josh Poimboeuf
                   ` (15 subsequent siblings)
  17 siblings, 0 replies; 121+ messages in thread
From: Josh Poimboeuf @ 2016-04-28 20:44 UTC (permalink / raw)
  To: Jessica Yu, Jiri Kosina, Miroslav Benes, Ingo Molnar,
	Peter Zijlstra, Michael Ellerman, Heiko Carstens
  Cc: live-patching, linux-kernel, x86, linuxppc-dev, linux-s390,
	Vojtech Pavlik, Jiri Slaby, Petr Mladek, Chris J Arges,
	Andy Lutomirski

There are two different pieces of code for starting a CPU: start_cpu0()
and the end of secondary_startup_64().  They're identical except for the
stack setup.  Combine the common parts into a shared start_cpu()
function.

Signed-off-by: Josh Poimboeuf <jpoimboe@redhat.com>
---
 arch/x86/kernel/head_64.S | 19 +++++++++----------
 1 file changed, 9 insertions(+), 10 deletions(-)

diff --git a/arch/x86/kernel/head_64.S b/arch/x86/kernel/head_64.S
index 792c3bb..6dbd2c0 100644
--- a/arch/x86/kernel/head_64.S
+++ b/arch/x86/kernel/head_64.S
@@ -264,13 +264,15 @@ ENTRY(secondary_startup_64)
 	movl	$MSR_GS_BASE,%ecx
 	movl	initial_gs(%rip),%eax
 	movl	initial_gs+4(%rip),%edx
-	wrmsr	
+	wrmsr
 
 	/* rsi is pointer to real mode structure with interesting info.
 	   pass it to C */
 	movq	%rsi, %rdi
-	
-	/* Finally jump to run C code and to be on real kernel address
+
+ENTRY(start_cpu)
+	/*
+	 * Jump to run C code and to be on a real kernel address.
 	 * Since we are running on identity-mapped space we have to jump
 	 * to the full 64bit address, this is only possible as indirect
 	 * jump.  In addition we need to ensure %cs is set so we make this
@@ -299,6 +301,7 @@ ENTRY(secondary_startup_64)
 	pushq	$__KERNEL_CS	# set correct cs
 	pushq	%rax		# target address in negative space
 	lretq
+ENDPROC(start_cpu)
 
 #include "verify_cpu.S"
 
@@ -306,15 +309,11 @@ ENTRY(secondary_startup_64)
 /*
  * Boot CPU0 entry point. It's called from play_dead(). Everything has been set
  * up already except stack. We just set up stack here. Then call
- * start_secondary().
+ * start_secondary() via start_cpu().
  */
 ENTRY(start_cpu0)
-	movq initial_stack(%rip),%rsp
-	movq	initial_code(%rip),%rax
-	pushq	$0		# fake return address to stop unwinder
-	pushq	$__KERNEL_CS	# set correct cs
-	pushq	%rax		# target address in negative space
-	lretq
+	movq	initial_stack(%rip), %rsp
+	jmp	start_cpu
 ENDPROC(start_cpu0)
 #endif
 
-- 
2.4.11

^ permalink raw reply related	[flat|nested] 121+ messages in thread

* [RFC PATCH v2 03/18] x86/asm/head: standardize the bottom of the stack for idle tasks
  2016-04-28 20:44 [RFC PATCH v2 00/18] livepatch: hybrid consistency model Josh Poimboeuf
  2016-04-28 20:44 ` [RFC PATCH v2 01/18] x86/asm/head: clean up initial stack variable Josh Poimboeuf
  2016-04-28 20:44 ` [RFC PATCH v2 02/18] x86/asm/head: use a common function for starting CPUs Josh Poimboeuf
@ 2016-04-28 20:44 ` Josh Poimboeuf
  2016-04-29 18:46   ` Brian Gerst
  2016-04-29 19:39   ` Andy Lutomirski
  2016-04-28 20:44 ` [RFC PATCH v2 04/18] x86: move _stext marker before head code Josh Poimboeuf
                   ` (14 subsequent siblings)
  17 siblings, 2 replies; 121+ messages in thread
From: Josh Poimboeuf @ 2016-04-28 20:44 UTC (permalink / raw)
  To: Jessica Yu, Jiri Kosina, Miroslav Benes, Ingo Molnar,
	Peter Zijlstra, Michael Ellerman, Heiko Carstens
  Cc: live-patching, linux-kernel, x86, linuxppc-dev, linux-s390,
	Vojtech Pavlik, Jiri Slaby, Petr Mladek, Chris J Arges,
	Andy Lutomirski

Thanks to all the recent x86 entry code refactoring, most tasks' kernel
stacks start at the same offset right above their saved pt_regs,
regardless of which syscall was used to enter the kernel.  That creates
a nice convention which makes it straightforward to identify the
"bottom" of the stack, which can be useful for stack walking code which
needs to verify the stack is sane.

However there are still a few types of tasks which don't yet follow that
convention:

1) CPU idle tasks, aka the "swapper" tasks

2) freshly forked TIF_FORK tasks which don't have a stack at all

Make the idle tasks conform to the new stack bottom convention by
starting their stack at a sizeof(pt_regs) offset from the end of the
stack page.

Signed-off-by: Josh Poimboeuf <jpoimboe@redhat.com>
---
 arch/x86/kernel/head_64.S | 7 ++++---
 1 file changed, 4 insertions(+), 3 deletions(-)

diff --git a/arch/x86/kernel/head_64.S b/arch/x86/kernel/head_64.S
index 6dbd2c0..0b12311 100644
--- a/arch/x86/kernel/head_64.S
+++ b/arch/x86/kernel/head_64.S
@@ -296,8 +296,9 @@ ENTRY(start_cpu)
 	 *	REX.W + FF /5 JMP m16:64 Jump far, absolute indirect,
 	 *		address given in m16:64.
 	 */
-	movq	initial_code(%rip),%rax
-	pushq	$0		# fake return address to stop unwinder
+	call	1f		# put return address on stack for unwinder
+1:	xorq	%rbp, %rbp	# clear frame pointer
+	movq	initial_code(%rip), %rax
 	pushq	$__KERNEL_CS	# set correct cs
 	pushq	%rax		# target address in negative space
 	lretq
@@ -325,7 +326,7 @@ ENDPROC(start_cpu0)
 	GLOBAL(initial_gs)
 	.quad	INIT_PER_CPU_VAR(irq_stack_union)
 	GLOBAL(initial_stack)
-	.quad  init_thread_union+THREAD_SIZE-8
+	.quad  init_thread_union + THREAD_SIZE - SIZEOF_PTREGS
 	__FINITDATA
 
 bad_address:
-- 
2.4.11

^ permalink raw reply related	[flat|nested] 121+ messages in thread

* [RFC PATCH v2 04/18] x86: move _stext marker before head code
  2016-04-28 20:44 [RFC PATCH v2 00/18] livepatch: hybrid consistency model Josh Poimboeuf
                   ` (2 preceding siblings ...)
  2016-04-28 20:44 ` [RFC PATCH v2 03/18] x86/asm/head: standardize the bottom of the stack for idle tasks Josh Poimboeuf
@ 2016-04-28 20:44 ` Josh Poimboeuf
  2016-04-28 20:44 ` [RFC PATCH v2 05/18] sched: add task flag for preempt IRQ tracking Josh Poimboeuf
                   ` (13 subsequent siblings)
  17 siblings, 0 replies; 121+ messages in thread
From: Josh Poimboeuf @ 2016-04-28 20:44 UTC (permalink / raw)
  To: Jessica Yu, Jiri Kosina, Miroslav Benes, Ingo Molnar,
	Peter Zijlstra, Michael Ellerman, Heiko Carstens
  Cc: live-patching, linux-kernel, x86, linuxppc-dev, linux-s390,
	Vojtech Pavlik, Jiri Slaby, Petr Mladek, Chris J Arges,
	Andy Lutomirski

When core_kernel_text() is used to determine whether an address on a
task's stack trace is a kernel text address, it incorrectly returns
false for early text addresses for the head code between the _text and
_stext markers.

Head code is text code too, so mark it as such.  This seems to match the
intent of other users of the _stext symbol, and it also seems consistent
with what other architectures are already doing.

Signed-off-by: Josh Poimboeuf <jpoimboe@redhat.com>
---
 arch/x86/kernel/vmlinux.lds.S | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/x86/kernel/vmlinux.lds.S b/arch/x86/kernel/vmlinux.lds.S
index 4c941f8..79e15ef 100644
--- a/arch/x86/kernel/vmlinux.lds.S
+++ b/arch/x86/kernel/vmlinux.lds.S
@@ -91,10 +91,10 @@ SECTIONS
 	/* Text and read-only data */
 	.text :  AT(ADDR(.text) - LOAD_OFFSET) {
 		_text = .;
+		_stext = .;
 		/* bootstrapping code */
 		HEAD_TEXT
 		. = ALIGN(8);
-		_stext = .;
 		TEXT_TEXT
 		SCHED_TEXT
 		LOCK_TEXT
-- 
2.4.11

^ permalink raw reply related	[flat|nested] 121+ messages in thread

* [RFC PATCH v2 05/18] sched: add task flag for preempt IRQ tracking
  2016-04-28 20:44 [RFC PATCH v2 00/18] livepatch: hybrid consistency model Josh Poimboeuf
                   ` (3 preceding siblings ...)
  2016-04-28 20:44 ` [RFC PATCH v2 04/18] x86: move _stext marker before head code Josh Poimboeuf
@ 2016-04-28 20:44 ` Josh Poimboeuf
  2016-04-29 18:06   ` Andy Lutomirski
  2016-04-28 20:44 ` [RFC PATCH v2 06/18] x86: dump_trace() error handling Josh Poimboeuf
                   ` (12 subsequent siblings)
  17 siblings, 1 reply; 121+ messages in thread
From: Josh Poimboeuf @ 2016-04-28 20:44 UTC (permalink / raw)
  To: Jessica Yu, Jiri Kosina, Miroslav Benes, Ingo Molnar,
	Peter Zijlstra, Michael Ellerman, Heiko Carstens
  Cc: live-patching, linux-kernel, x86, linuxppc-dev, linux-s390,
	Vojtech Pavlik, Jiri Slaby, Petr Mladek, Chris J Arges,
	Andy Lutomirski

A preempted function might not have had a chance to save the frame
pointer to the stack yet, which can result in its caller getting skipped
on a stack trace.

Add a flag to indicate when the task has been preempted so that stack
dump code can determine whether the stack trace is reliable.

Signed-off-by: Josh Poimboeuf <jpoimboe@redhat.com>
---
 include/linux/sched.h | 1 +
 kernel/fork.c         | 2 +-
 kernel/sched/core.c   | 4 ++++
 3 files changed, 6 insertions(+), 1 deletion(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 3d31572..fb364a0 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -2137,6 +2137,7 @@ extern void thread_group_cputime_adjusted(struct task_struct *p, cputime_t *ut,
 #define PF_SWAPWRITE	0x00800000	/* Allowed to write to swap */
 #define PF_NO_SETAFFINITY 0x04000000	/* Userland is not allowed to meddle with cpus_allowed */
 #define PF_MCE_EARLY    0x08000000      /* Early kill for mce process policy */
+#define PF_PREEMPT_IRQ	0x10000000	/* Thread is preempted by an irq */
 #define PF_MUTEX_TESTER	0x20000000	/* Thread belongs to the rt mutex tester */
 #define PF_FREEZER_SKIP	0x40000000	/* Freezer should not count it as freezable */
 #define PF_SUSPEND_TASK 0x80000000      /* this thread called freeze_processes and should not be frozen */
diff --git a/kernel/fork.c b/kernel/fork.c
index b73a539..d2fe04a 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -1373,7 +1373,7 @@ static struct task_struct *copy_process(unsigned long clone_flags,
 		goto bad_fork_cleanup_count;
 
 	delayacct_tsk_init(p);	/* Must remain after dup_task_struct() */
-	p->flags &= ~(PF_SUPERPRIV | PF_WQ_WORKER);
+	p->flags &= ~(PF_SUPERPRIV | PF_WQ_WORKER | PF_PREEMPT_IRQ);
 	p->flags |= PF_FORKNOEXEC;
 	INIT_LIST_HEAD(&p->children);
 	INIT_LIST_HEAD(&p->sibling);
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 9d84d60..7594267 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -3422,6 +3422,8 @@ asmlinkage __visible void __sched preempt_schedule_irq(void)
 
 	prev_state = exception_enter();
 
+	current->flags |= PF_PREEMPT_IRQ;
+
 	do {
 		preempt_disable();
 		local_irq_enable();
@@ -3430,6 +3432,8 @@ asmlinkage __visible void __sched preempt_schedule_irq(void)
 		sched_preempt_enable_no_resched();
 	} while (need_resched());
 
+	current->flags &= ~PF_PREEMPT_IRQ;
+
 	exception_exit(prev_state);
 }
 
-- 
2.4.11

^ permalink raw reply related	[flat|nested] 121+ messages in thread

* [RFC PATCH v2 06/18] x86: dump_trace() error handling
  2016-04-28 20:44 [RFC PATCH v2 00/18] livepatch: hybrid consistency model Josh Poimboeuf
                   ` (4 preceding siblings ...)
  2016-04-28 20:44 ` [RFC PATCH v2 05/18] sched: add task flag for preempt IRQ tracking Josh Poimboeuf
@ 2016-04-28 20:44 ` Josh Poimboeuf
  2016-04-29 13:45   ` Minfei Huang
  2016-04-28 20:44 ` [RFC PATCH v2 07/18] stacktrace/x86: function for detecting reliable stack traces Josh Poimboeuf
                   ` (11 subsequent siblings)
  17 siblings, 1 reply; 121+ messages in thread
From: Josh Poimboeuf @ 2016-04-28 20:44 UTC (permalink / raw)
  To: Jessica Yu, Jiri Kosina, Miroslav Benes, Ingo Molnar,
	Peter Zijlstra, Michael Ellerman, Heiko Carstens
  Cc: live-patching, linux-kernel, x86, linuxppc-dev, linux-s390,
	Vojtech Pavlik, Jiri Slaby, Petr Mladek, Chris J Arges,
	Andy Lutomirski

In preparation for being able to determine whether a given stack trace
is reliable, allow the stacktrace_ops functions to propagate errors to
dump_trace().

Signed-off-by: Josh Poimboeuf <jpoimboe@redhat.com>
---
 arch/x86/include/asm/stacktrace.h | 36 +++++++++++++++-----------
 arch/x86/kernel/dumpstack.c       | 31 +++++++++++------------
 arch/x86/kernel/dumpstack_32.c    | 22 ++++++++++------
 arch/x86/kernel/dumpstack_64.c    | 53 ++++++++++++++++++++++++++-------------
 4 files changed, 87 insertions(+), 55 deletions(-)

diff --git a/arch/x86/include/asm/stacktrace.h b/arch/x86/include/asm/stacktrace.h
index 7c247e7..a64523f3 100644
--- a/arch/x86/include/asm/stacktrace.h
+++ b/arch/x86/include/asm/stacktrace.h
@@ -14,26 +14,32 @@ extern int kstack_depth_to_print;
 struct thread_info;
 struct stacktrace_ops;
 
-typedef unsigned long (*walk_stack_t)(struct thread_info *tinfo,
-				      unsigned long *stack,
-				      unsigned long bp,
-				      const struct stacktrace_ops *ops,
-				      void *data,
-				      unsigned long *end,
-				      int *graph);
-
-extern unsigned long
+typedef int (*walk_stack_t)(struct thread_info *tinfo,
+			    unsigned long *stack,
+			    unsigned long *bp,
+			    const struct stacktrace_ops *ops,
+			    void *data,
+			    unsigned long *end,
+			    int *graph);
+
+extern int
 print_context_stack(struct thread_info *tinfo,
-		    unsigned long *stack, unsigned long bp,
+		    unsigned long *stack, unsigned long *bp,
 		    const struct stacktrace_ops *ops, void *data,
 		    unsigned long *end, int *graph);
 
-extern unsigned long
+extern int
 print_context_stack_bp(struct thread_info *tinfo,
-		       unsigned long *stack, unsigned long bp,
+		       unsigned long *stack, unsigned long *bp,
 		       const struct stacktrace_ops *ops, void *data,
 		       unsigned long *end, int *graph);
 
+extern int
+print_context_stack_reliable(struct thread_info *tinfo,
+			     unsigned long *stack, unsigned long *bp,
+			     const struct stacktrace_ops *ops, void *data,
+			     unsigned long *end, int *graph);
+
 /* Generic stack tracer with callbacks */
 
 struct stacktrace_ops {
@@ -43,9 +49,9 @@ struct stacktrace_ops {
 	walk_stack_t	walk_stack;
 };
 
-void dump_trace(struct task_struct *tsk, struct pt_regs *regs,
-		unsigned long *stack, unsigned long bp,
-		const struct stacktrace_ops *ops, void *data);
+int dump_trace(struct task_struct *tsk, struct pt_regs *regs,
+	       unsigned long *stack, unsigned long bp,
+	       const struct stacktrace_ops *ops, void *data);
 
 #ifdef CONFIG_X86_32
 #define STACKSLOTS_PER_LINE 8
diff --git a/arch/x86/kernel/dumpstack.c b/arch/x86/kernel/dumpstack.c
index 2bb25c3..13d240c 100644
--- a/arch/x86/kernel/dumpstack.c
+++ b/arch/x86/kernel/dumpstack.c
@@ -92,23 +92,22 @@ static inline int valid_stack_ptr(struct thread_info *tinfo,
 	return p > t && p < t + THREAD_SIZE - size;
 }
 
-unsigned long
-print_context_stack(struct thread_info *tinfo,
-		unsigned long *stack, unsigned long bp,
-		const struct stacktrace_ops *ops, void *data,
-		unsigned long *end, int *graph)
+int print_context_stack(struct thread_info *tinfo,
+			unsigned long *stack, unsigned long *bp,
+			const struct stacktrace_ops *ops, void *data,
+			unsigned long *end, int *graph)
 {
-	struct stack_frame *frame = (struct stack_frame *)bp;
+	struct stack_frame *frame = (struct stack_frame *)*bp;
 
 	while (valid_stack_ptr(tinfo, stack, sizeof(*stack), end)) {
 		unsigned long addr;
 
 		addr = *stack;
 		if (__kernel_text_address(addr)) {
-			if ((unsigned long) stack == bp + sizeof(long)) {
+			if ((unsigned long) stack == *bp + sizeof(long)) {
 				ops->address(data, addr, 1);
 				frame = frame->next_frame;
-				bp = (unsigned long) frame;
+				*bp = (unsigned long) frame;
 			} else {
 				ops->address(data, addr, 0);
 			}
@@ -116,17 +115,16 @@ print_context_stack(struct thread_info *tinfo,
 		}
 		stack++;
 	}
-	return bp;
+	return 0;
 }
 EXPORT_SYMBOL_GPL(print_context_stack);
 
-unsigned long
-print_context_stack_bp(struct thread_info *tinfo,
-		       unsigned long *stack, unsigned long bp,
-		       const struct stacktrace_ops *ops, void *data,
-		       unsigned long *end, int *graph)
+int print_context_stack_bp(struct thread_info *tinfo,
+			   unsigned long *stack, unsigned long *bp,
+			   const struct stacktrace_ops *ops, void *data,
+			   unsigned long *end, int *graph)
 {
-	struct stack_frame *frame = (struct stack_frame *)bp;
+	struct stack_frame *frame = (struct stack_frame *)*bp;
 	unsigned long *ret_addr = &frame->return_address;
 
 	while (valid_stack_ptr(tinfo, ret_addr, sizeof(*ret_addr), end)) {
@@ -142,7 +140,8 @@ print_context_stack_bp(struct thread_info *tinfo,
 		print_ftrace_graph_addr(addr, data, ops, tinfo, graph);
 	}
 
-	return (unsigned long)frame;
+	*bp = (unsigned long)frame;
+	return 0;
 }
 EXPORT_SYMBOL_GPL(print_context_stack_bp);
 
diff --git a/arch/x86/kernel/dumpstack_32.c b/arch/x86/kernel/dumpstack_32.c
index 464ffd6..e710bab 100644
--- a/arch/x86/kernel/dumpstack_32.c
+++ b/arch/x86/kernel/dumpstack_32.c
@@ -38,13 +38,14 @@ static void *is_softirq_stack(unsigned long *stack, int cpu)
 	return is_irq_stack(stack, irq);
 }
 
-void dump_trace(struct task_struct *task, struct pt_regs *regs,
-		unsigned long *stack, unsigned long bp,
-		const struct stacktrace_ops *ops, void *data)
+int dump_trace(struct task_struct *task, struct pt_regs *regs,
+	       unsigned long *stack, unsigned long bp,
+	       const struct stacktrace_ops *ops, void *data)
 {
 	const unsigned cpu = get_cpu();
 	int graph = 0;
 	u32 *prev_esp;
+	int ret;
 
 	if (!task)
 		task = current;
@@ -69,8 +70,10 @@ void dump_trace(struct task_struct *task, struct pt_regs *regs,
 			end_stack = is_softirq_stack(stack, cpu);
 
 		context = task_thread_info(task);
-		bp = ops->walk_stack(context, stack, bp, ops, data,
-				     end_stack, &graph);
+		ret = ops->walk_stack(context, stack, &bp, ops, data,
+				      end_stack, &graph);
+		if (ret)
+			goto out;
 
 		/* Stop if not on irq stack */
 		if (!end_stack)
@@ -82,11 +85,16 @@ void dump_trace(struct task_struct *task, struct pt_regs *regs,
 		if (!stack)
 			break;
 
-		if (ops->stack(data, "IRQ") < 0)
-			break;
+		ret = ops->stack(data, "IRQ");
+		if (ret)
+			goto out;
+
 		touch_nmi_watchdog();
 	}
+
+out:
 	put_cpu();
+	return ret;
 }
 EXPORT_SYMBOL(dump_trace);
 
diff --git a/arch/x86/kernel/dumpstack_64.c b/arch/x86/kernel/dumpstack_64.c
index 5f1c626..0c810ba 100644
--- a/arch/x86/kernel/dumpstack_64.c
+++ b/arch/x86/kernel/dumpstack_64.c
@@ -148,9 +148,9 @@ analyze_stack(int cpu, struct task_struct *task, unsigned long *stack,
  * severe exception (double fault, nmi, stack fault, debug, mce) hardware stack
  */
 
-void dump_trace(struct task_struct *task, struct pt_regs *regs,
-		unsigned long *stack, unsigned long bp,
-		const struct stacktrace_ops *ops, void *data)
+int dump_trace(struct task_struct *task, struct pt_regs *regs,
+	       unsigned long *stack, unsigned long bp,
+	       const struct stacktrace_ops *ops, void *data)
 {
 	const unsigned cpu = get_cpu();
 	struct thread_info *tinfo;
@@ -159,6 +159,7 @@ void dump_trace(struct task_struct *task, struct pt_regs *regs,
 	unsigned used = 0;
 	int graph = 0;
 	int done = 0;
+	int ret;
 
 	if (!task)
 		task = current;
@@ -198,13 +199,18 @@ void dump_trace(struct task_struct *task, struct pt_regs *regs,
 			break;
 
 		case STACK_IS_EXCEPTION:
-
-			if (ops->stack(data, id) < 0)
-				break;
-
-			bp = ops->walk_stack(tinfo, stack, bp, ops,
-					     data, stack_end, &graph);
-			ops->stack(data, "<EOE>");
+			ret = ops->stack(data, id);
+			if (ret)
+				goto out;
+
+			ret = ops->walk_stack(tinfo, stack, &bp, ops, data,
+					      stack_end, &graph);
+			if (ret)
+				goto out;
+
+			ret = ops->stack(data, "<EOE>");
+			if (ret)
+				goto out;
 			/*
 			 * We link to the next stack via the
 			 * second-to-last pointer (index -2 to end) in the
@@ -215,11 +221,15 @@ void dump_trace(struct task_struct *task, struct pt_regs *regs,
 			break;
 
 		case STACK_IS_IRQ:
+			ret = ops->stack(data, "IRQ");
+			if (ret)
+				goto out;
+
+			ret = ops->walk_stack(tinfo, stack, &bp, ops, data,
+					      stack_end, &graph);
+			if (ret)
+				goto out;
 
-			if (ops->stack(data, "IRQ") < 0)
-				break;
-			bp = ops->walk_stack(tinfo, stack, bp,
-				     ops, data, stack_end, &graph);
 			/*
 			 * We link to the next stack (which would be
 			 * the process stack normally) the last
@@ -227,12 +237,18 @@ void dump_trace(struct task_struct *task, struct pt_regs *regs,
 			 */
 			stack = (unsigned long *) (stack_end[-1]);
 			irq_stack = NULL;
-			ops->stack(data, "EOI");
+
+			ret = ops->stack(data, "EOI");
+			if (ret)
+				goto out;
+
 			done = 0;
 			break;
 
 		case STACK_IS_UNKNOWN:
-			ops->stack(data, "UNK");
+			ret = ops->stack(data, "UNK");
+			if (ret)
+				goto out;
 			break;
 		}
 	}
@@ -240,8 +256,11 @@ void dump_trace(struct task_struct *task, struct pt_regs *regs,
 	/*
 	 * This handles the process stack:
 	 */
-	bp = ops->walk_stack(tinfo, stack, bp, ops, data, NULL, &graph);
+	ret = ops->walk_stack(tinfo, stack, &bp, ops, data, NULL, &graph);
+
+out:
 	put_cpu();
+	return ret;
 }
 EXPORT_SYMBOL(dump_trace);
 
-- 
2.4.11

^ permalink raw reply related	[flat|nested] 121+ messages in thread

* [RFC PATCH v2 07/18] stacktrace/x86: function for detecting reliable stack traces
  2016-04-28 20:44 [RFC PATCH v2 00/18] livepatch: hybrid consistency model Josh Poimboeuf
                   ` (5 preceding siblings ...)
  2016-04-28 20:44 ` [RFC PATCH v2 06/18] x86: dump_trace() error handling Josh Poimboeuf
@ 2016-04-28 20:44 ` Josh Poimboeuf
  2016-04-28 20:44 ` [RFC PATCH v2 08/18] livepatch: temporary stubs for klp_patch_pending() and klp_patch_task() Josh Poimboeuf
                   ` (10 subsequent siblings)
  17 siblings, 0 replies; 121+ messages in thread
From: Josh Poimboeuf @ 2016-04-28 20:44 UTC (permalink / raw)
  To: Jessica Yu, Jiri Kosina, Miroslav Benes, Ingo Molnar,
	Peter Zijlstra, Michael Ellerman, Heiko Carstens
  Cc: live-patching, linux-kernel, x86, linuxppc-dev, linux-s390,
	Vojtech Pavlik, Jiri Slaby, Petr Mladek, Chris J Arges,
	Andy Lutomirski

For live patching and possibly other use cases, a stack trace is only
useful if it can be assured that it's completely reliable.  Add a new
save_stack_trace_tsk_reliable() function to achieve that.

Scenarios which indicate that a stack trace may be unreliable:

- running tasks
- interrupt stacks
- preemption
- corrupted stack data
- the stack grows the wrong way
- the stack walk doesn't reach the bottom
- the user didn't provide a large enough entries array

Also add a config option so arch-independent code can determine at build
time whether the function is implemented.

Signed-off-by: Josh Poimboeuf <jpoimboe@redhat.com>
---
 arch/Kconfig                 |  6 ++++
 arch/x86/Kconfig             |  1 +
 arch/x86/kernel/dumpstack.c  | 77 ++++++++++++++++++++++++++++++++++++++++++++
 arch/x86/kernel/stacktrace.c | 24 ++++++++++++++
 include/linux/kernel.h       |  1 +
 include/linux/stacktrace.h   | 20 +++++++++---
 kernel/extable.c             |  2 +-
 kernel/stacktrace.c          |  4 +--
 lib/Kconfig.debug            |  6 ++++
 9 files changed, 134 insertions(+), 7 deletions(-)

diff --git a/arch/Kconfig b/arch/Kconfig
index 8f84fd2..ec4d480 100644
--- a/arch/Kconfig
+++ b/arch/Kconfig
@@ -598,6 +598,12 @@ config HAVE_STACK_VALIDATION
 	  Architecture supports the 'objtool check' host tool command, which
 	  performs compile-time stack metadata validation.
 
+config HAVE_RELIABLE_STACKTRACE
+	bool
+	help
+	  Architecture has a save_stack_trace_tsk_reliable() function which
+	  only returns a stack trace if it can guarantee the trace is reliable.
+
 #
 # ABI hall of shame
 #
diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 0b128b4..78c4e00 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -140,6 +140,7 @@ config X86
 	select HAVE_PERF_REGS
 	select HAVE_PERF_USER_STACK_DUMP
 	select HAVE_REGS_AND_STACK_ACCESS_API
+	select HAVE_RELIABLE_STACKTRACE		if X86_64 && FRAME_POINTER
 	select HAVE_SYSCALL_TRACEPOINTS
 	select HAVE_UID16			if X86_32 || IA32_EMULATION
 	select HAVE_UNSTABLE_SCHED_CLOCK
diff --git a/arch/x86/kernel/dumpstack.c b/arch/x86/kernel/dumpstack.c
index 13d240c..70d0013 100644
--- a/arch/x86/kernel/dumpstack.c
+++ b/arch/x86/kernel/dumpstack.c
@@ -145,6 +145,83 @@ int print_context_stack_bp(struct thread_info *tinfo,
 }
 EXPORT_SYMBOL_GPL(print_context_stack_bp);
 
+#ifdef CONFIG_RELIABLE_STACKTRACE
+/*
+ * Only succeeds if the stack trace is deemed reliable.  This relies on the
+ * fact that frame pointers are reliable thanks to CONFIG_STACK_VALIDATION.
+ *
+ * The caller must ensure that the task is either sleeping or is the current
+ * task.
+ */
+int print_context_stack_reliable(struct thread_info *tinfo,
+				 unsigned long *stack, unsigned long *bp,
+				 const struct stacktrace_ops *ops,
+				 void *data, unsigned long *end, int *graph)
+{
+	struct stack_frame *frame = (struct stack_frame *)*bp;
+	struct stack_frame *last_frame = NULL;
+	unsigned long *ret_addr = &frame->return_address;
+
+	/*
+	 * If the kernel was preempted by an IRQ, we can't trust the stack
+	 * because the preempted function might not have gotten the chance to
+	 * save the frame pointer on the stack before it was interrupted.
+	 */
+	if (tinfo->task->flags & PF_PREEMPT_IRQ)
+		return -EINVAL;
+
+	/*
+	 * A freshly forked task has an empty stack trace.  We can consider
+	 * that to be reliable.
+	 */
+	if (test_ti_thread_flag(tinfo, TIF_FORK))
+		return 0;
+
+	while (valid_stack_ptr(tinfo, ret_addr, sizeof(*ret_addr), end)) {
+		unsigned long addr = *ret_addr;
+
+		/*
+		 * Make sure the stack only grows down.
+		 */
+		if (frame <= last_frame)
+			return -EINVAL;
+
+		/*
+		 * Make sure the frame refers to a valid kernel function.
+		 */
+		if (!core_kernel_text(addr) && !init_kernel_text(addr) &&
+		    !is_module_text_address(addr))
+			return -EINVAL;
+
+		/*
+		 * Save the kernel text address and make sure the entries array
+		 * isn't full.
+		 */
+		if (ops->address(data, addr, 1))
+			return -EINVAL;
+
+		/*
+		 * If the function graph tracer is in effect, save the real
+		 * function address.
+		 */
+		print_ftrace_graph_addr(addr, data, ops, tinfo, graph);
+
+		last_frame = frame;
+		frame = frame->next_frame;
+		ret_addr = &frame->return_address;
+	}
+
+	/*
+	 * Make sure we reached the bottom of the stack.
+	 */
+	if (last_frame + 1 != (void *)task_pt_regs(tinfo->task))
+		return -EINVAL;
+
+	*bp = (unsigned long)frame;
+	return 0;
+}
+#endif /* CONFIG_RELIABLE_STACKTRACE */
+
 static int print_trace_stack(void *data, char *name)
 {
 	printk("%s <%s> ", (char *)data, name);
diff --git a/arch/x86/kernel/stacktrace.c b/arch/x86/kernel/stacktrace.c
index 9ee98ee..10882e4 100644
--- a/arch/x86/kernel/stacktrace.c
+++ b/arch/x86/kernel/stacktrace.c
@@ -148,3 +148,27 @@ void save_stack_trace_user(struct stack_trace *trace)
 		trace->entries[trace->nr_entries++] = ULONG_MAX;
 }
 
+#ifdef CONFIG_RELIABLE_STACKTRACE
+
+static int save_stack_stack_reliable(void *data, char *name)
+{
+	return -EINVAL;
+}
+
+static const struct stacktrace_ops save_stack_ops_reliable = {
+	.stack		= save_stack_stack_reliable,
+	.address	= save_stack_address,
+	.walk_stack	= print_context_stack_reliable,
+};
+
+/*
+ * Returns 0 if the stack trace is deemed reliable.  The caller must ensure
+ * that the task is either sleeping or is the current task.
+ */
+int save_stack_trace_tsk_reliable(struct task_struct *tsk,
+				  struct stack_trace *trace)
+{
+	return dump_trace(tsk, NULL, NULL, 0, &save_stack_ops_reliable, trace);
+}
+
+#endif /* CONFIG_RELIABLE_STACKTRACE */
diff --git a/include/linux/kernel.h b/include/linux/kernel.h
index cc73982..6be1e82 100644
--- a/include/linux/kernel.h
+++ b/include/linux/kernel.h
@@ -429,6 +429,7 @@ extern char *get_options(const char *str, int nints, int *ints);
 extern unsigned long long memparse(const char *ptr, char **retptr);
 extern bool parse_option_str(const char *str, const char *option);
 
+extern int init_kernel_text(unsigned long addr);
 extern int core_kernel_text(unsigned long addr);
 extern int core_kernel_data(unsigned long addr);
 extern int __kernel_text_address(unsigned long addr);
diff --git a/include/linux/stacktrace.h b/include/linux/stacktrace.h
index 0a34489..527e4cc 100644
--- a/include/linux/stacktrace.h
+++ b/include/linux/stacktrace.h
@@ -2,17 +2,18 @@
 #define __LINUX_STACKTRACE_H
 
 #include <linux/types.h>
+#include <linux/errno.h>
 
 struct task_struct;
 struct pt_regs;
 
-#ifdef CONFIG_STACKTRACE
 struct stack_trace {
 	unsigned int nr_entries, max_entries;
 	unsigned long *entries;
 	int skip;	/* input argument: How many entries to skip */
 };
 
+#ifdef CONFIG_STACKTRACE
 extern void save_stack_trace(struct stack_trace *trace);
 extern void save_stack_trace_regs(struct pt_regs *regs,
 				  struct stack_trace *trace);
@@ -29,12 +30,23 @@ extern void save_stack_trace_user(struct stack_trace *trace);
 # define save_stack_trace_user(trace)              do { } while (0)
 #endif
 
-#else
+#else /* !CONFIG_STACKTRACE */
 # define save_stack_trace(trace)			do { } while (0)
 # define save_stack_trace_tsk(tsk, trace)		do { } while (0)
 # define save_stack_trace_user(trace)			do { } while (0)
 # define print_stack_trace(trace, spaces)		do { } while (0)
 # define snprint_stack_trace(buf, size, trace, spaces)	do { } while (0)
-#endif
+#endif /* CONFIG_STACKTRACE */
 
-#endif
+#ifdef CONFIG_RELIABLE_STACKTRACE
+extern int save_stack_trace_tsk_reliable(struct task_struct *tsk,
+					 struct stack_trace *trace);
+#else
+static inline int save_stack_trace_tsk_reliable(struct task_struct *tsk,
+						struct stack_trace *trace)
+{
+	return -ENOSYS;
+}
+#endif /* CONFIG_RELIABLE_STACKTRACE */
+
+#endif /* __LINUX_STACKTRACE_H */
diff --git a/kernel/extable.c b/kernel/extable.c
index e820cce..c085844 100644
--- a/kernel/extable.c
+++ b/kernel/extable.c
@@ -58,7 +58,7 @@ const struct exception_table_entry *search_exception_tables(unsigned long addr)
 	return e;
 }
 
-static inline int init_kernel_text(unsigned long addr)
+int init_kernel_text(unsigned long addr)
 {
 	if (addr >= (unsigned long)_sinittext &&
 	    addr < (unsigned long)_einittext)
diff --git a/kernel/stacktrace.c b/kernel/stacktrace.c
index b6e4c16..f35bc5d 100644
--- a/kernel/stacktrace.c
+++ b/kernel/stacktrace.c
@@ -58,8 +58,8 @@ int snprint_stack_trace(char *buf, size_t size,
 EXPORT_SYMBOL_GPL(snprint_stack_trace);
 
 /*
- * Architectures that do not implement save_stack_trace_tsk or
- * save_stack_trace_regs get this weak alias and a once-per-bootup warning
+ * Architectures that do not implement save_stack_trace_*()
+ * get this weak alias and a once-per-bootup warning
  * (whenever this facility is utilized - for example by procfs):
  */
 __weak void
diff --git a/lib/Kconfig.debug b/lib/Kconfig.debug
index 5d57177..189a2d7 100644
--- a/lib/Kconfig.debug
+++ b/lib/Kconfig.debug
@@ -1164,6 +1164,12 @@ config STACKTRACE
 	  It is also used by various kernel debugging features that require
 	  stack trace generation.
 
+config RELIABLE_STACKTRACE
+	def_bool y
+	depends on HAVE_RELIABLE_STACKTRACE
+	depends on STACKTRACE
+	depends on STACK_VALIDATION
+
 config DEBUG_KOBJECT
 	bool "kobject debugging"
 	depends on DEBUG_KERNEL
-- 
2.4.11

^ permalink raw reply related	[flat|nested] 121+ messages in thread

* [RFC PATCH v2 08/18] livepatch: temporary stubs for klp_patch_pending() and klp_patch_task()
  2016-04-28 20:44 [RFC PATCH v2 00/18] livepatch: hybrid consistency model Josh Poimboeuf
                   ` (6 preceding siblings ...)
  2016-04-28 20:44 ` [RFC PATCH v2 07/18] stacktrace/x86: function for detecting reliable stack traces Josh Poimboeuf
@ 2016-04-28 20:44 ` Josh Poimboeuf
  2016-04-28 20:44 ` [RFC PATCH v2 09/18] livepatch/x86: add TIF_PATCH_PENDING thread flag Josh Poimboeuf
                   ` (9 subsequent siblings)
  17 siblings, 0 replies; 121+ messages in thread
From: Josh Poimboeuf @ 2016-04-28 20:44 UTC (permalink / raw)
  To: Jessica Yu, Jiri Kosina, Miroslav Benes, Ingo Molnar,
	Peter Zijlstra, Michael Ellerman, Heiko Carstens
  Cc: live-patching, linux-kernel, x86, linuxppc-dev, linux-s390,
	Vojtech Pavlik, Jiri Slaby, Petr Mladek, Chris J Arges,
	Andy Lutomirski

Create temporary stubs for klp_patch_pending() and klp_patch_task() so
we can add TIF_PATCH_PENDING to different architectures in separate
patches without breaking build bisectability.

Signed-off-by: Josh Poimboeuf <jpoimboe@redhat.com>
---
 include/linux/livepatch.h | 7 ++++++-
 kernel/livepatch/core.c   | 3 +++
 2 files changed, 9 insertions(+), 1 deletion(-)

diff --git a/include/linux/livepatch.h b/include/linux/livepatch.h
index 0933ca4..a8c6c9c 100644
--- a/include/linux/livepatch.h
+++ b/include/linux/livepatch.h
@@ -118,10 +118,15 @@ int klp_disable_patch(struct klp_patch *);
 int klp_module_coming(struct module *mod);
 void klp_module_going(struct module *mod);
 
+static inline bool klp_patch_pending(struct task_struct *task) { return false; }
+void klp_patch_task(struct task_struct *task);
+
 #else /* !CONFIG_LIVEPATCH */
 
 static inline int klp_module_coming(struct module *mod) { return 0; }
-static inline void klp_module_going(struct module *mod) { }
+static inline void klp_module_going(struct module *mod) {}
+static inline bool klp_patch_pending(struct task_struct *task) { return false; }
+static inline void klp_patch_task(struct task_struct *task) {}
 
 #endif /* CONFIG_LIVEPATCH */
 
diff --git a/kernel/livepatch/core.c b/kernel/livepatch/core.c
index a19f195..6ea6880 100644
--- a/kernel/livepatch/core.c
+++ b/kernel/livepatch/core.c
@@ -64,6 +64,9 @@ static LIST_HEAD(klp_ops);
 
 static struct kobject *klp_root_kobj;
 
+/* TODO: temporary stub */
+void klp_patch_task(struct task_struct *task) {}
+
 static struct klp_ops *klp_find_ops(unsigned long old_addr)
 {
 	struct klp_ops *ops;
-- 
2.4.11

^ permalink raw reply related	[flat|nested] 121+ messages in thread

* [RFC PATCH v2 09/18] livepatch/x86: add TIF_PATCH_PENDING thread flag
  2016-04-28 20:44 [RFC PATCH v2 00/18] livepatch: hybrid consistency model Josh Poimboeuf
                   ` (7 preceding siblings ...)
  2016-04-28 20:44 ` [RFC PATCH v2 08/18] livepatch: temporary stubs for klp_patch_pending() and klp_patch_task() Josh Poimboeuf
@ 2016-04-28 20:44 ` Josh Poimboeuf
  2016-04-29 18:08   ` Andy Lutomirski
  2016-04-28 20:44 ` [RFC PATCH v2 10/18] livepatch/powerpc: " Josh Poimboeuf
                   ` (8 subsequent siblings)
  17 siblings, 1 reply; 121+ messages in thread
From: Josh Poimboeuf @ 2016-04-28 20:44 UTC (permalink / raw)
  To: Jessica Yu, Jiri Kosina, Miroslav Benes, Ingo Molnar,
	Peter Zijlstra, Michael Ellerman, Heiko Carstens
  Cc: live-patching, linux-kernel, x86, linuxppc-dev, linux-s390,
	Vojtech Pavlik, Jiri Slaby, Petr Mladek, Chris J Arges,
	Andy Lutomirski

Add the TIF_PATCH_PENDING thread flag to enable the new livepatch
per-task consistency model for x86_64.  The bit getting set indicates
the thread has a pending patch which needs to be applied when the thread
exits the kernel.

The bit is placed in the least-significant word of the thread_info flags
so that it gets automatically included in the _TIF_ALLWORK_MASK macro.
This results in exit_to_usermode_loop() and klp_patch_task() getting
called when the bit is set.

Signed-off-by: Josh Poimboeuf <jpoimboe@redhat.com>
---
 arch/x86/entry/common.c            | 9 ++++++---
 arch/x86/include/asm/thread_info.h | 2 ++
 2 files changed, 8 insertions(+), 3 deletions(-)

diff --git a/arch/x86/entry/common.c b/arch/x86/entry/common.c
index ec138e5..0eaa1d9 100644
--- a/arch/x86/entry/common.c
+++ b/arch/x86/entry/common.c
@@ -21,6 +21,7 @@
 #include <linux/context_tracking.h>
 #include <linux/user-return-notifier.h>
 #include <linux/uprobes.h>
+#include <linux/livepatch.h>
 
 #include <asm/desc.h>
 #include <asm/traps.h>
@@ -202,14 +203,13 @@ long syscall_trace_enter(struct pt_regs *regs)
 
 #define EXIT_TO_USERMODE_LOOP_FLAGS				\
 	(_TIF_SIGPENDING | _TIF_NOTIFY_RESUME | _TIF_UPROBE |	\
-	 _TIF_NEED_RESCHED | _TIF_USER_RETURN_NOTIFY)
+	 _TIF_NEED_RESCHED | _TIF_USER_RETURN_NOTIFY | _TIF_PATCH_PENDING)
 
 static void exit_to_usermode_loop(struct pt_regs *regs, u32 cached_flags)
 {
 	/*
 	 * In order to return to user mode, we need to have IRQs off with
-	 * none of _TIF_SIGPENDING, _TIF_NOTIFY_RESUME, _TIF_USER_RETURN_NOTIFY,
-	 * _TIF_UPROBE, or _TIF_NEED_RESCHED set.  Several of these flags
+	 * none of EXIT_TO_USERMODE_LOOP_FLAGS set.  Several of these flags
 	 * can be set at any time on preemptable kernels if we have IRQs on,
 	 * so we need to loop.  Disabling preemption wouldn't help: doing the
 	 * work to clear some of the flags can sleep.
@@ -236,6 +236,9 @@ static void exit_to_usermode_loop(struct pt_regs *regs, u32 cached_flags)
 		if (cached_flags & _TIF_USER_RETURN_NOTIFY)
 			fire_user_return_notifiers();
 
+		if (cached_flags & _TIF_PATCH_PENDING)
+			klp_patch_task(current);
+
 		/* Disable IRQs and retry */
 		local_irq_disable();
 
diff --git a/arch/x86/include/asm/thread_info.h b/arch/x86/include/asm/thread_info.h
index 30c133a..4e4f50e 100644
--- a/arch/x86/include/asm/thread_info.h
+++ b/arch/x86/include/asm/thread_info.h
@@ -97,6 +97,7 @@ struct thread_info {
 #define TIF_SECCOMP		8	/* secure computing */
 #define TIF_USER_RETURN_NOTIFY	11	/* notify kernel of userspace return */
 #define TIF_UPROBE		12	/* breakpointed or singlestepping */
+#define TIF_PATCH_PENDING	13	/* pending live patching update */
 #define TIF_NOTSC		16	/* TSC is not accessible in userland */
 #define TIF_IA32		17	/* IA32 compatibility process */
 #define TIF_FORK		18	/* ret_from_fork */
@@ -121,6 +122,7 @@ struct thread_info {
 #define _TIF_SECCOMP		(1 << TIF_SECCOMP)
 #define _TIF_USER_RETURN_NOTIFY	(1 << TIF_USER_RETURN_NOTIFY)
 #define _TIF_UPROBE		(1 << TIF_UPROBE)
+#define _TIF_PATCH_PENDING	(1 << TIF_PATCH_PENDING)
 #define _TIF_NOTSC		(1 << TIF_NOTSC)
 #define _TIF_IA32		(1 << TIF_IA32)
 #define _TIF_FORK		(1 << TIF_FORK)
-- 
2.4.11

^ permalink raw reply related	[flat|nested] 121+ messages in thread

* [RFC PATCH v2 10/18] livepatch/powerpc: add TIF_PATCH_PENDING thread flag
  2016-04-28 20:44 [RFC PATCH v2 00/18] livepatch: hybrid consistency model Josh Poimboeuf
                   ` (8 preceding siblings ...)
  2016-04-28 20:44 ` [RFC PATCH v2 09/18] livepatch/x86: add TIF_PATCH_PENDING thread flag Josh Poimboeuf
@ 2016-04-28 20:44 ` Josh Poimboeuf
  2016-05-03  9:07   ` Petr Mladek
  2016-04-28 20:44 ` [RFC PATCH v2 11/18] livepatch/s390: reorganize TIF thread flag bits Josh Poimboeuf
                   ` (7 subsequent siblings)
  17 siblings, 1 reply; 121+ messages in thread
From: Josh Poimboeuf @ 2016-04-28 20:44 UTC (permalink / raw)
  To: Jessica Yu, Jiri Kosina, Miroslav Benes, Ingo Molnar,
	Peter Zijlstra, Michael Ellerman, Heiko Carstens
  Cc: live-patching, linux-kernel, x86, linuxppc-dev, linux-s390,
	Vojtech Pavlik, Jiri Slaby, Petr Mladek, Chris J Arges,
	Andy Lutomirski

Add the TIF_PATCH_PENDING thread flag to enable the new livepatch
per-task consistency model for powerpc.  The bit getting set indicates
the thread has a pending patch which needs to be applied when the thread
exits the kernel.

The bit is included in the _TIF_USER_WORK_MASK macro so that
do_notify_resume() and klp_patch_task() get called when the bit is set.

Signed-off-by: Josh Poimboeuf <jpoimboe@redhat.com>
---
 arch/powerpc/include/asm/thread_info.h | 4 +++-
 arch/powerpc/kernel/signal.c           | 4 ++++
 2 files changed, 7 insertions(+), 1 deletion(-)

diff --git a/arch/powerpc/include/asm/thread_info.h b/arch/powerpc/include/asm/thread_info.h
index 8febc3f..df262ca 100644
--- a/arch/powerpc/include/asm/thread_info.h
+++ b/arch/powerpc/include/asm/thread_info.h
@@ -88,6 +88,7 @@ static inline struct thread_info *current_thread_info(void)
 					   TIF_NEED_RESCHED */
 #define TIF_32BIT		4	/* 32 bit binary */
 #define TIF_RESTORE_TM		5	/* need to restore TM FP/VEC/VSX */
+#define TIF_PATCH_PENDING	6	/* pending live patching update */
 #define TIF_SYSCALL_AUDIT	7	/* syscall auditing active */
 #define TIF_SINGLESTEP		8	/* singlestepping active */
 #define TIF_NOHZ		9	/* in adaptive nohz mode */
@@ -111,6 +112,7 @@ static inline struct thread_info *current_thread_info(void)
 #define _TIF_POLLING_NRFLAG	(1<<TIF_POLLING_NRFLAG)
 #define _TIF_32BIT		(1<<TIF_32BIT)
 #define _TIF_RESTORE_TM		(1<<TIF_RESTORE_TM)
+#define _TIF_PATCH_PENDING	(1<<TIF_PATCH_PENDING)
 #define _TIF_SYSCALL_AUDIT	(1<<TIF_SYSCALL_AUDIT)
 #define _TIF_SINGLESTEP		(1<<TIF_SINGLESTEP)
 #define _TIF_SECCOMP		(1<<TIF_SECCOMP)
@@ -127,7 +129,7 @@ static inline struct thread_info *current_thread_info(void)
 
 #define _TIF_USER_WORK_MASK	(_TIF_SIGPENDING | _TIF_NEED_RESCHED | \
 				 _TIF_NOTIFY_RESUME | _TIF_UPROBE | \
-				 _TIF_RESTORE_TM)
+				 _TIF_RESTORE_TM | _TIF_PATCH_PENDING)
 #define _TIF_PERSYSCALL_MASK	(_TIF_RESTOREALL|_TIF_NOERROR)
 
 /* Bits in local_flags */
diff --git a/arch/powerpc/kernel/signal.c b/arch/powerpc/kernel/signal.c
index cb64d6f..844497b 100644
--- a/arch/powerpc/kernel/signal.c
+++ b/arch/powerpc/kernel/signal.c
@@ -14,6 +14,7 @@
 #include <linux/uprobes.h>
 #include <linux/key.h>
 #include <linux/context_tracking.h>
+#include <linux/livepatch.h>
 #include <asm/hw_breakpoint.h>
 #include <asm/uaccess.h>
 #include <asm/unistd.h>
@@ -159,6 +160,9 @@ void do_notify_resume(struct pt_regs *regs, unsigned long thread_info_flags)
 		tracehook_notify_resume(regs);
 	}
 
+	if (thread_info_flags & _TIF_PATCH_PENDING)
+		klp_patch_task(current);
+
 	user_enter();
 }
 
-- 
2.4.11

^ permalink raw reply related	[flat|nested] 121+ messages in thread

* [RFC PATCH v2 11/18] livepatch/s390: reorganize TIF thread flag bits
  2016-04-28 20:44 [RFC PATCH v2 00/18] livepatch: hybrid consistency model Josh Poimboeuf
                   ` (9 preceding siblings ...)
  2016-04-28 20:44 ` [RFC PATCH v2 10/18] livepatch/powerpc: " Josh Poimboeuf
@ 2016-04-28 20:44 ` Josh Poimboeuf
  2016-04-28 20:44 ` [RFC PATCH v2 12/18] livepatch/s390: add TIF_PATCH_PENDING thread flag Josh Poimboeuf
                   ` (6 subsequent siblings)
  17 siblings, 0 replies; 121+ messages in thread
From: Josh Poimboeuf @ 2016-04-28 20:44 UTC (permalink / raw)
  To: Jessica Yu, Jiri Kosina, Miroslav Benes, Ingo Molnar,
	Peter Zijlstra, Michael Ellerman, Heiko Carstens
  Cc: live-patching, linux-kernel, x86, linuxppc-dev, linux-s390,
	Vojtech Pavlik, Jiri Slaby, Petr Mladek, Chris J Arges,
	Andy Lutomirski

From: Jiri Slaby <jslaby@suse.cz>

Group the TIF thread flag bits by their inclusion in the _TIF_WORK and
_TIF_TRACE macros.

Signed-off-by: Jiri Slaby <jslaby@suse.cz>
Signed-off-by: Josh Poimboeuf <jpoimboe@redhat.com>
---
 arch/s390/include/asm/thread_info.h | 22 ++++++++++++++--------
 1 file changed, 14 insertions(+), 8 deletions(-)

diff --git a/arch/s390/include/asm/thread_info.h b/arch/s390/include/asm/thread_info.h
index 2fffc2c..8642c1d 100644
--- a/arch/s390/include/asm/thread_info.h
+++ b/arch/s390/include/asm/thread_info.h
@@ -70,14 +70,12 @@ void arch_release_task_struct(struct task_struct *tsk);
 /*
  * thread information flags bit numbers
  */
+/* _TIF_WORK bits */
 #define TIF_NOTIFY_RESUME	0	/* callback before returning to user */
 #define TIF_SIGPENDING		1	/* signal pending */
 #define TIF_NEED_RESCHED	2	/* rescheduling necessary */
-#define TIF_SYSCALL_TRACE	3	/* syscall trace active */
-#define TIF_SYSCALL_AUDIT	4	/* syscall auditing active */
-#define TIF_SECCOMP		5	/* secure computing */
-#define TIF_SYSCALL_TRACEPOINT	6	/* syscall tracepoint instrumentation */
-#define TIF_UPROBE		7	/* breakpointed or single-stepping */
+#define TIF_UPROBE		3	/* breakpointed or single-stepping */
+
 #define TIF_31BIT		16	/* 32bit process */
 #define TIF_MEMDIE		17	/* is terminating due to OOM killer */
 #define TIF_RESTORE_SIGMASK	18	/* restore signal mask in do_signal() */
@@ -85,15 +83,23 @@ void arch_release_task_struct(struct task_struct *tsk);
 #define TIF_BLOCK_STEP		20	/* This task is block stepped */
 #define TIF_UPROBE_SINGLESTEP	21	/* This task is uprobe single stepped */
 
+/* _TIF_TRACE bits */
+#define TIF_SYSCALL_TRACE	24	/* syscall trace active */
+#define TIF_SYSCALL_AUDIT	25	/* syscall auditing active */
+#define TIF_SECCOMP		26	/* secure computing */
+#define TIF_SYSCALL_TRACEPOINT	27	/* syscall tracepoint instrumentation */
+
 #define _TIF_NOTIFY_RESUME	_BITUL(TIF_NOTIFY_RESUME)
 #define _TIF_SIGPENDING		_BITUL(TIF_SIGPENDING)
 #define _TIF_NEED_RESCHED	_BITUL(TIF_NEED_RESCHED)
+#define _TIF_UPROBE		_BITUL(TIF_UPROBE)
+
+#define _TIF_31BIT		_BITUL(TIF_31BIT)
+#define _TIF_SINGLE_STEP	_BITUL(TIF_SINGLE_STEP)
+
 #define _TIF_SYSCALL_TRACE	_BITUL(TIF_SYSCALL_TRACE)
 #define _TIF_SYSCALL_AUDIT	_BITUL(TIF_SYSCALL_AUDIT)
 #define _TIF_SECCOMP		_BITUL(TIF_SECCOMP)
 #define _TIF_SYSCALL_TRACEPOINT	_BITUL(TIF_SYSCALL_TRACEPOINT)
-#define _TIF_UPROBE		_BITUL(TIF_UPROBE)
-#define _TIF_31BIT		_BITUL(TIF_31BIT)
-#define _TIF_SINGLE_STEP	_BITUL(TIF_SINGLE_STEP)
 
 #endif /* _ASM_THREAD_INFO_H */
-- 
2.4.11

^ permalink raw reply related	[flat|nested] 121+ messages in thread

* [RFC PATCH v2 12/18] livepatch/s390: add TIF_PATCH_PENDING thread flag
  2016-04-28 20:44 [RFC PATCH v2 00/18] livepatch: hybrid consistency model Josh Poimboeuf
                   ` (10 preceding siblings ...)
  2016-04-28 20:44 ` [RFC PATCH v2 11/18] livepatch/s390: reorganize TIF thread flag bits Josh Poimboeuf
@ 2016-04-28 20:44 ` Josh Poimboeuf
  2016-04-28 20:44 ` [RFC PATCH v2 13/18] livepatch: separate enabled and patched states Josh Poimboeuf
                   ` (5 subsequent siblings)
  17 siblings, 0 replies; 121+ messages in thread
From: Josh Poimboeuf @ 2016-04-28 20:44 UTC (permalink / raw)
  To: Jessica Yu, Jiri Kosina, Miroslav Benes, Ingo Molnar,
	Peter Zijlstra, Michael Ellerman, Heiko Carstens
  Cc: live-patching, linux-kernel, x86, linuxppc-dev, linux-s390,
	Vojtech Pavlik, Jiri Slaby, Petr Mladek, Chris J Arges,
	Andy Lutomirski

From: Miroslav Benes <mbenes@suse.cz>

Update a task's patch state when returning from a system call or user
space interrupt, or after handling a signal.

This greatly increases the chances of a patch operation succeeding.  If
a task is I/O bound, it can be patched when returning from a system
call.  If a task is CPU bound, it can be patched when returning from an
interrupt.  If a task is sleeping on a to-be-patched function, the user
can send SIGSTOP and SIGCONT to force it to switch.

Since there are two ways the syscall can be restarted on return from a
signal handling process, it is important to clear the flag before
do_signal() is called. Otherwise we could miss the migration if we used
SIGSTOP/SIGCONT procedure or fake signal to migrate patching blocking
tasks. If we place our hook to sysc_work label in entry before
TIF_SIGPENDING is evaluated we kill two birds with one stone. The task
is correctly migrated in all return paths from a syscall.

Signed-off-by: Miroslav Benes <mbenes@suse.cz>
Signed-off-by: Josh Poimboeuf <jpoimboe@redhat.com>
---
 arch/s390/include/asm/thread_info.h |  2 ++
 arch/s390/kernel/entry.S            | 31 ++++++++++++++++++++++++++++++-
 2 files changed, 32 insertions(+), 1 deletion(-)

diff --git a/arch/s390/include/asm/thread_info.h b/arch/s390/include/asm/thread_info.h
index 8642c1d..b69d538 100644
--- a/arch/s390/include/asm/thread_info.h
+++ b/arch/s390/include/asm/thread_info.h
@@ -75,6 +75,7 @@ void arch_release_task_struct(struct task_struct *tsk);
 #define TIF_SIGPENDING		1	/* signal pending */
 #define TIF_NEED_RESCHED	2	/* rescheduling necessary */
 #define TIF_UPROBE		3	/* breakpointed or single-stepping */
+#define TIF_PATCH_PENDING	4	/* pending live patching update */
 
 #define TIF_31BIT		16	/* 32bit process */
 #define TIF_MEMDIE		17	/* is terminating due to OOM killer */
@@ -93,6 +94,7 @@ void arch_release_task_struct(struct task_struct *tsk);
 #define _TIF_SIGPENDING		_BITUL(TIF_SIGPENDING)
 #define _TIF_NEED_RESCHED	_BITUL(TIF_NEED_RESCHED)
 #define _TIF_UPROBE		_BITUL(TIF_UPROBE)
+#define _TIF_PATCH_PENDING	_BITUL(TIF_PATCH_PENDING)
 
 #define _TIF_31BIT		_BITUL(TIF_31BIT)
 #define _TIF_SINGLE_STEP	_BITUL(TIF_SINGLE_STEP)
diff --git a/arch/s390/kernel/entry.S b/arch/s390/kernel/entry.S
index 2d47f9c..5db5959 100644
--- a/arch/s390/kernel/entry.S
+++ b/arch/s390/kernel/entry.S
@@ -46,7 +46,7 @@ STACK_SIZE  = 1 << STACK_SHIFT
 STACK_INIT = STACK_SIZE - STACK_FRAME_OVERHEAD - __PT_SIZE
 
 _TIF_WORK	= (_TIF_SIGPENDING | _TIF_NOTIFY_RESUME | _TIF_NEED_RESCHED | \
-		   _TIF_UPROBE)
+		   _TIF_UPROBE | _TIF_PATCH_PENDING)
 _TIF_TRACE	= (_TIF_SYSCALL_TRACE | _TIF_SYSCALL_AUDIT | _TIF_SECCOMP | \
 		   _TIF_SYSCALL_TRACEPOINT)
 _CIF_WORK	= (_CIF_MCCK_PENDING | _CIF_ASCE | _CIF_FPU)
@@ -329,6 +329,11 @@ ENTRY(system_call)
 #endif
 	TSTMSK	__PT_FLAGS(%r11),_PIF_PER_TRAP
 	jo	.Lsysc_singlestep
+#ifdef CONFIG_LIVEPATCH
+	TSTMSK	__TI_flags(%r12),_TIF_PATCH_PENDING
+	jo	.Lsysc_patch_pending	# handle live patching just before
+					# signals and possible syscall restart
+#endif
 	TSTMSK	__TI_flags(%r12),_TIF_SIGPENDING
 	jo	.Lsysc_sigpending
 	TSTMSK	__TI_flags(%r12),_TIF_NOTIFY_RESUME
@@ -404,6 +409,16 @@ ENTRY(system_call)
 #endif
 
 #
+# _TIF_PATCH_PENDING is set, call klp_patch_task
+#
+#ifdef CONFIG_LIVEPATCH
+.Lsysc_patch_pending:
+	lg	%r2,__TI_task(%r12)
+	larl	%r14,.Lsysc_return
+	jg	klp_patch_task
+#endif
+
+#
 # _PIF_PER_TRAP is set, call do_per_trap
 #
 .Lsysc_singlestep:
@@ -652,6 +667,10 @@ ENTRY(io_int_handler)
 	jo	.Lio_mcck_pending
 	TSTMSK	__TI_flags(%r12),_TIF_NEED_RESCHED
 	jo	.Lio_reschedule
+#ifdef CONFIG_LIVEPATCH
+	TSTMSK	__TI_flags(%r12),_TIF_PATCH_PENDING
+	jo	.Lio_patch_pending
+#endif
 	TSTMSK	__TI_flags(%r12),_TIF_SIGPENDING
 	jo	.Lio_sigpending
 	TSTMSK	__TI_flags(%r12),_TIF_NOTIFY_RESUME
@@ -698,6 +717,16 @@ ENTRY(io_int_handler)
 	j	.Lio_return
 
 #
+# _TIF_PATCH_PENDING is set, call klp_patch_task
+#
+#ifdef CONFIG_LIVEPATCH
+.Lio_patch_pending:
+	lg	%r2,__TI_task(%r12)
+	larl	%r14,.Lio_return
+	jg	klp_patch_task
+#endif
+
+#
 # _TIF_SIGPENDING or is set, call do_signal
 #
 .Lio_sigpending:
-- 
2.4.11

^ permalink raw reply related	[flat|nested] 121+ messages in thread

* [RFC PATCH v2 13/18] livepatch: separate enabled and patched states
  2016-04-28 20:44 [RFC PATCH v2 00/18] livepatch: hybrid consistency model Josh Poimboeuf
                   ` (11 preceding siblings ...)
  2016-04-28 20:44 ` [RFC PATCH v2 12/18] livepatch/s390: add TIF_PATCH_PENDING thread flag Josh Poimboeuf
@ 2016-04-28 20:44 ` Josh Poimboeuf
  2016-05-03  9:30   ` Petr Mladek
  2016-04-28 20:44 ` [RFC PATCH v2 14/18] livepatch: remove unnecessary object loaded check Josh Poimboeuf
                   ` (4 subsequent siblings)
  17 siblings, 1 reply; 121+ messages in thread
From: Josh Poimboeuf @ 2016-04-28 20:44 UTC (permalink / raw)
  To: Jessica Yu, Jiri Kosina, Miroslav Benes, Ingo Molnar,
	Peter Zijlstra, Michael Ellerman, Heiko Carstens
  Cc: live-patching, linux-kernel, x86, linuxppc-dev, linux-s390,
	Vojtech Pavlik, Jiri Slaby, Petr Mladek, Chris J Arges,
	Andy Lutomirski

Once we have a consistency model, patches and their objects will be
enabled and disabled at different times.  For example, when a patch is
disabled, its loaded objects' funcs can remain registered with ftrace
indefinitely until the unpatching operation is complete and they're no
longer in use.

It's less confusing if we give them different names: patches can be
enabled or disabled; objects (and their funcs) can be patched or
unpatched:

- Enabled means that a patch is logically enabled (but not necessarily
  fully applied).

- Patched means that an object's funcs are registered with ftrace and
  added to the klp_ops func stack.

Also, since these states are binary, represent them with booleans
instead of ints.

Signed-off-by: Josh Poimboeuf <jpoimboe@redhat.com>
---
 include/linux/livepatch.h | 17 ++++-------
 kernel/livepatch/core.c   | 72 +++++++++++++++++++++++------------------------
 2 files changed, 42 insertions(+), 47 deletions(-)

diff --git a/include/linux/livepatch.h b/include/linux/livepatch.h
index a8c6c9c..9ba26c5 100644
--- a/include/linux/livepatch.h
+++ b/include/linux/livepatch.h
@@ -28,11 +28,6 @@
 
 #include <asm/livepatch.h>
 
-enum klp_state {
-	KLP_DISABLED,
-	KLP_ENABLED
-};
-
 /**
  * struct klp_func - function structure for live patching
  * @old_name:	name of the function to be patched
@@ -41,8 +36,8 @@ enum klp_state {
  *		can be found (optional)
  * @old_addr:	the address of the function being patched
  * @kobj:	kobject for sysfs resources
- * @state:	tracks function-level patch application state
  * @stack_node:	list node for klp_ops func_stack list
+ * @patched:	the func has been added to the klp_ops list
  */
 struct klp_func {
 	/* external */
@@ -60,8 +55,8 @@ struct klp_func {
 	/* internal */
 	unsigned long old_addr;
 	struct kobject kobj;
-	enum klp_state state;
 	struct list_head stack_node;
+	bool patched;
 };
 
 /**
@@ -71,7 +66,7 @@ struct klp_func {
  * @kobj:	kobject for sysfs resources
  * @mod:	kernel module associated with the patched object
  * 		(NULL for vmlinux)
- * @state:	tracks object-level patch application state
+ * @patched:	the object's funcs have been added to the klp_ops list
  */
 struct klp_object {
 	/* external */
@@ -81,7 +76,7 @@ struct klp_object {
 	/* internal */
 	struct kobject kobj;
 	struct module *mod;
-	enum klp_state state;
+	bool patched;
 };
 
 /**
@@ -90,7 +85,7 @@ struct klp_object {
  * @objs:	object entries for kernel objects to be patched
  * @list:	list node for global list of registered patches
  * @kobj:	kobject for sysfs resources
- * @state:	tracks patch-level application state
+ * @enabled:	the patch is enabled (but operation may be incomplete)
  */
 struct klp_patch {
 	/* external */
@@ -100,7 +95,7 @@ struct klp_patch {
 	/* internal */
 	struct list_head list;
 	struct kobject kobj;
-	enum klp_state state;
+	bool enabled;
 };
 
 #define klp_for_each_object(patch, obj) \
diff --git a/kernel/livepatch/core.c b/kernel/livepatch/core.c
index 6ea6880..2b59230 100644
--- a/kernel/livepatch/core.c
+++ b/kernel/livepatch/core.c
@@ -350,11 +350,11 @@ static unsigned long klp_get_ftrace_location(unsigned long faddr)
 }
 #endif
 
-static void klp_disable_func(struct klp_func *func)
+static void klp_unpatch_func(struct klp_func *func)
 {
 	struct klp_ops *ops;
 
-	if (WARN_ON(func->state != KLP_ENABLED))
+	if (WARN_ON(!func->patched))
 		return;
 	if (WARN_ON(!func->old_addr))
 		return;
@@ -380,10 +380,10 @@ static void klp_disable_func(struct klp_func *func)
 		list_del_rcu(&func->stack_node);
 	}
 
-	func->state = KLP_DISABLED;
+	func->patched = false;
 }
 
-static int klp_enable_func(struct klp_func *func)
+static int klp_patch_func(struct klp_func *func)
 {
 	struct klp_ops *ops;
 	int ret;
@@ -391,7 +391,7 @@ static int klp_enable_func(struct klp_func *func)
 	if (WARN_ON(!func->old_addr))
 		return -EINVAL;
 
-	if (WARN_ON(func->state != KLP_DISABLED))
+	if (WARN_ON(func->patched))
 		return -EINVAL;
 
 	ops = klp_find_ops(func->old_addr);
@@ -439,7 +439,7 @@ static int klp_enable_func(struct klp_func *func)
 		list_add_rcu(&func->stack_node, &ops->func_stack);
 	}
 
-	func->state = KLP_ENABLED;
+	func->patched = true;
 
 	return 0;
 
@@ -450,36 +450,36 @@ err:
 	return ret;
 }
 
-static void klp_disable_object(struct klp_object *obj)
+static void klp_unpatch_object(struct klp_object *obj)
 {
 	struct klp_func *func;
 
 	klp_for_each_func(obj, func)
-		if (func->state == KLP_ENABLED)
-			klp_disable_func(func);
+		if (func->patched)
+			klp_unpatch_func(func);
 
-	obj->state = KLP_DISABLED;
+	obj->patched = false;
 }
 
-static int klp_enable_object(struct klp_object *obj)
+static int klp_patch_object(struct klp_object *obj)
 {
 	struct klp_func *func;
 	int ret;
 
-	if (WARN_ON(obj->state != KLP_DISABLED))
+	if (WARN_ON(obj->patched))
 		return -EINVAL;
 
 	if (WARN_ON(!klp_is_object_loaded(obj)))
 		return -EINVAL;
 
 	klp_for_each_func(obj, func) {
-		ret = klp_enable_func(func);
+		ret = klp_patch_func(func);
 		if (ret) {
-			klp_disable_object(obj);
+			klp_unpatch_object(obj);
 			return ret;
 		}
 	}
-	obj->state = KLP_ENABLED;
+	obj->patched = true;
 
 	return 0;
 }
@@ -490,17 +490,17 @@ static int __klp_disable_patch(struct klp_patch *patch)
 
 	/* enforce stacking: only the last enabled patch can be disabled */
 	if (!list_is_last(&patch->list, &klp_patches) &&
-	    list_next_entry(patch, list)->state == KLP_ENABLED)
+	    list_next_entry(patch, list)->enabled)
 		return -EBUSY;
 
 	pr_notice("disabling patch '%s'\n", patch->mod->name);
 
 	klp_for_each_object(patch, obj) {
-		if (obj->state == KLP_ENABLED)
-			klp_disable_object(obj);
+		if (obj->patched)
+			klp_unpatch_object(obj);
 	}
 
-	patch->state = KLP_DISABLED;
+	patch->enabled = false;
 
 	return 0;
 }
@@ -524,7 +524,7 @@ int klp_disable_patch(struct klp_patch *patch)
 		goto err;
 	}
 
-	if (patch->state == KLP_DISABLED) {
+	if (!patch->enabled) {
 		ret = -EINVAL;
 		goto err;
 	}
@@ -542,12 +542,12 @@ static int __klp_enable_patch(struct klp_patch *patch)
 	struct klp_object *obj;
 	int ret;
 
-	if (WARN_ON(patch->state != KLP_DISABLED))
+	if (WARN_ON(patch->enabled))
 		return -EINVAL;
 
 	/* enforce stacking: only the first disabled patch can be enabled */
 	if (patch->list.prev != &klp_patches &&
-	    list_prev_entry(patch, list)->state == KLP_DISABLED)
+	    !list_prev_entry(patch, list)->enabled)
 		return -EBUSY;
 
 	pr_notice_once("tainting kernel with TAINT_LIVEPATCH\n");
@@ -559,12 +559,12 @@ static int __klp_enable_patch(struct klp_patch *patch)
 		if (!klp_is_object_loaded(obj))
 			continue;
 
-		ret = klp_enable_object(obj);
+		ret = klp_patch_object(obj);
 		if (ret)
 			goto unregister;
 	}
 
-	patch->state = KLP_ENABLED;
+	patch->enabled = true;
 
 	return 0;
 
@@ -622,20 +622,20 @@ static ssize_t enabled_store(struct kobject *kobj, struct kobj_attribute *attr,
 	if (ret)
 		return -EINVAL;
 
-	if (val != KLP_DISABLED && val != KLP_ENABLED)
+	if (val > 1)
 		return -EINVAL;
 
 	patch = container_of(kobj, struct klp_patch, kobj);
 
 	mutex_lock(&klp_mutex);
 
-	if (val == patch->state) {
+	if (patch->enabled == val) {
 		/* already in requested state */
 		ret = -EINVAL;
 		goto err;
 	}
 
-	if (val == KLP_ENABLED) {
+	if (val) {
 		ret = __klp_enable_patch(patch);
 		if (ret)
 			goto err;
@@ -660,7 +660,7 @@ static ssize_t enabled_show(struct kobject *kobj,
 	struct klp_patch *patch;
 
 	patch = container_of(kobj, struct klp_patch, kobj);
-	return snprintf(buf, PAGE_SIZE-1, "%d\n", patch->state);
+	return snprintf(buf, PAGE_SIZE-1, "%d\n", patch->enabled);
 }
 
 static struct kobj_attribute enabled_kobj_attr = __ATTR_RW(enabled);
@@ -751,7 +751,7 @@ static void klp_free_patch(struct klp_patch *patch)
 static int klp_init_func(struct klp_object *obj, struct klp_func *func)
 {
 	INIT_LIST_HEAD(&func->stack_node);
-	func->state = KLP_DISABLED;
+	func->patched = false;
 
 	/* The format for the sysfs directory is <function,sympos> where sympos
 	 * is the nth occurrence of this symbol in kallsyms for the patched
@@ -794,7 +794,7 @@ static int klp_init_object(struct klp_patch *patch, struct klp_object *obj)
 	if (!obj->funcs)
 		return -EINVAL;
 
-	obj->state = KLP_DISABLED;
+	obj->patched = false;
 	obj->mod = NULL;
 
 	klp_find_object_module(obj);
@@ -835,7 +835,7 @@ static int klp_init_patch(struct klp_patch *patch)
 
 	mutex_lock(&klp_mutex);
 
-	patch->state = KLP_DISABLED;
+	patch->enabled = false;
 
 	ret = kobject_init_and_add(&patch->kobj, &klp_ktype_patch,
 				   klp_root_kobj, "%s", patch->mod->name);
@@ -881,7 +881,7 @@ int klp_unregister_patch(struct klp_patch *patch)
 		goto out;
 	}
 
-	if (patch->state == KLP_ENABLED) {
+	if (patch->enabled) {
 		ret = -EBUSY;
 		goto out;
 	}
@@ -968,13 +968,13 @@ int klp_module_coming(struct module *mod)
 				goto err;
 			}
 
-			if (patch->state == KLP_DISABLED)
+			if (!patch->enabled)
 				break;
 
 			pr_notice("applying patch '%s' to loading module '%s'\n",
 				  patch->mod->name, obj->mod->name);
 
-			ret = klp_enable_object(obj);
+			ret = klp_patch_object(obj);
 			if (ret) {
 				pr_warn("failed to apply patch '%s' to module '%s' (%d)\n",
 					patch->mod->name, obj->mod->name, ret);
@@ -1025,10 +1025,10 @@ void klp_module_going(struct module *mod)
 			if (!klp_is_module(obj) || strcmp(obj->name, mod->name))
 				continue;
 
-			if (patch->state != KLP_DISABLED) {
+			if (patch->enabled) {
 				pr_notice("reverting patch '%s' on unloading module '%s'\n",
 					  patch->mod->name, obj->mod->name);
-				klp_disable_object(obj);
+				klp_unpatch_object(obj);
 			}
 
 			klp_free_object_loaded(obj);
-- 
2.4.11

^ permalink raw reply related	[flat|nested] 121+ messages in thread

* [RFC PATCH v2 14/18] livepatch: remove unnecessary object loaded check
  2016-04-28 20:44 [RFC PATCH v2 00/18] livepatch: hybrid consistency model Josh Poimboeuf
                   ` (12 preceding siblings ...)
  2016-04-28 20:44 ` [RFC PATCH v2 13/18] livepatch: separate enabled and patched states Josh Poimboeuf
@ 2016-04-28 20:44 ` Josh Poimboeuf
  2016-04-28 20:44 ` [RFC PATCH v2 15/18] livepatch: move patching functions into patch.c Josh Poimboeuf
                   ` (3 subsequent siblings)
  17 siblings, 0 replies; 121+ messages in thread
From: Josh Poimboeuf @ 2016-04-28 20:44 UTC (permalink / raw)
  To: Jessica Yu, Jiri Kosina, Miroslav Benes, Ingo Molnar,
	Peter Zijlstra, Michael Ellerman, Heiko Carstens
  Cc: live-patching, linux-kernel, x86, linuxppc-dev, linux-s390,
	Vojtech Pavlik, Jiri Slaby, Petr Mladek, Chris J Arges,
	Andy Lutomirski

klp_patch_object()'s callers already ensure that the object is loaded,
so its call to klp_is_object_loaded() is unnecessary.

This will also make it possible to move the patching code into a
separate file.

Signed-off-by: Josh Poimboeuf <jpoimboe@redhat.com>
---
 kernel/livepatch/core.c | 3 ---
 1 file changed, 3 deletions(-)

diff --git a/kernel/livepatch/core.c b/kernel/livepatch/core.c
index 2b59230..2ad7892 100644
--- a/kernel/livepatch/core.c
+++ b/kernel/livepatch/core.c
@@ -469,9 +469,6 @@ static int klp_patch_object(struct klp_object *obj)
 	if (WARN_ON(obj->patched))
 		return -EINVAL;
 
-	if (WARN_ON(!klp_is_object_loaded(obj)))
-		return -EINVAL;
-
 	klp_for_each_func(obj, func) {
 		ret = klp_patch_func(func);
 		if (ret) {
-- 
2.4.11

^ permalink raw reply related	[flat|nested] 121+ messages in thread

* [RFC PATCH v2 15/18] livepatch: move patching functions into patch.c
  2016-04-28 20:44 [RFC PATCH v2 00/18] livepatch: hybrid consistency model Josh Poimboeuf
                   ` (13 preceding siblings ...)
  2016-04-28 20:44 ` [RFC PATCH v2 14/18] livepatch: remove unnecessary object loaded check Josh Poimboeuf
@ 2016-04-28 20:44 ` Josh Poimboeuf
  2016-05-03  9:39   ` Petr Mladek
  2016-04-28 20:44 ` [RFC PATCH v2 16/18] livepatch: store function sizes Josh Poimboeuf
                   ` (2 subsequent siblings)
  17 siblings, 1 reply; 121+ messages in thread
From: Josh Poimboeuf @ 2016-04-28 20:44 UTC (permalink / raw)
  To: Jessica Yu, Jiri Kosina, Miroslav Benes, Ingo Molnar,
	Peter Zijlstra, Michael Ellerman, Heiko Carstens
  Cc: live-patching, linux-kernel, x86, linuxppc-dev, linux-s390,
	Vojtech Pavlik, Jiri Slaby, Petr Mladek, Chris J Arges,
	Andy Lutomirski

Move functions related to the actual patching of functions and objects
into a new patch.c file.

Signed-off-by: Josh Poimboeuf <jpoimboe@redhat.com>
---
 kernel/livepatch/Makefile |   2 +-
 kernel/livepatch/core.c   | 202 +------------------------------------------
 kernel/livepatch/patch.c  | 213 ++++++++++++++++++++++++++++++++++++++++++++++
 kernel/livepatch/patch.h  |  32 +++++++
 4 files changed, 247 insertions(+), 202 deletions(-)
 create mode 100644 kernel/livepatch/patch.c
 create mode 100644 kernel/livepatch/patch.h

diff --git a/kernel/livepatch/Makefile b/kernel/livepatch/Makefile
index e8780c0..e136dad 100644
--- a/kernel/livepatch/Makefile
+++ b/kernel/livepatch/Makefile
@@ -1,3 +1,3 @@
 obj-$(CONFIG_LIVEPATCH) += livepatch.o
 
-livepatch-objs := core.o
+livepatch-objs := core.o patch.o
diff --git a/kernel/livepatch/core.c b/kernel/livepatch/core.c
index 2ad7892..f28504d 100644
--- a/kernel/livepatch/core.c
+++ b/kernel/livepatch/core.c
@@ -24,32 +24,13 @@
 #include <linux/kernel.h>
 #include <linux/mutex.h>
 #include <linux/slab.h>
-#include <linux/ftrace.h>
 #include <linux/list.h>
 #include <linux/kallsyms.h>
 #include <linux/livepatch.h>
 #include <linux/elf.h>
 #include <linux/moduleloader.h>
 #include <asm/cacheflush.h>
-
-/**
- * struct klp_ops - structure for tracking registered ftrace ops structs
- *
- * A single ftrace_ops is shared between all enabled replacement functions
- * (klp_func structs) which have the same old_addr.  This allows the switch
- * between function versions to happen instantaneously by updating the klp_ops
- * struct's func_stack list.  The winner is the klp_func at the top of the
- * func_stack (front of the list).
- *
- * @node:	node for the global klp_ops list
- * @func_stack:	list head for the stack of klp_func's (active func is on top)
- * @fops:	registered ftrace ops struct
- */
-struct klp_ops {
-	struct list_head node;
-	struct list_head func_stack;
-	struct ftrace_ops fops;
-};
+#include "patch.h"
 
 /*
  * The klp_mutex protects the global lists and state transitions of any
@@ -60,28 +41,12 @@ struct klp_ops {
 static DEFINE_MUTEX(klp_mutex);
 
 static LIST_HEAD(klp_patches);
-static LIST_HEAD(klp_ops);
 
 static struct kobject *klp_root_kobj;
 
 /* TODO: temporary stub */
 void klp_patch_task(struct task_struct *task) {}
 
-static struct klp_ops *klp_find_ops(unsigned long old_addr)
-{
-	struct klp_ops *ops;
-	struct klp_func *func;
-
-	list_for_each_entry(ops, &klp_ops, node) {
-		func = list_first_entry(&ops->func_stack, struct klp_func,
-					stack_node);
-		if (func->old_addr == old_addr)
-			return ops;
-	}
-
-	return NULL;
-}
-
 static bool klp_is_module(struct klp_object *obj)
 {
 	return obj->name;
@@ -316,171 +281,6 @@ static int klp_write_object_relocations(struct module *pmod,
 	return ret;
 }
 
-static void notrace klp_ftrace_handler(unsigned long ip,
-				       unsigned long parent_ip,
-				       struct ftrace_ops *fops,
-				       struct pt_regs *regs)
-{
-	struct klp_ops *ops;
-	struct klp_func *func;
-
-	ops = container_of(fops, struct klp_ops, fops);
-
-	rcu_read_lock();
-	func = list_first_or_null_rcu(&ops->func_stack, struct klp_func,
-				      stack_node);
-	if (WARN_ON_ONCE(!func))
-		goto unlock;
-
-	klp_arch_set_pc(regs, (unsigned long)func->new_func);
-unlock:
-	rcu_read_unlock();
-}
-
-/*
- * Convert a function address into the appropriate ftrace location.
- *
- * Usually this is just the address of the function, but on some architectures
- * it's more complicated so allow them to provide a custom behaviour.
- */
-#ifndef klp_get_ftrace_location
-static unsigned long klp_get_ftrace_location(unsigned long faddr)
-{
-	return faddr;
-}
-#endif
-
-static void klp_unpatch_func(struct klp_func *func)
-{
-	struct klp_ops *ops;
-
-	if (WARN_ON(!func->patched))
-		return;
-	if (WARN_ON(!func->old_addr))
-		return;
-
-	ops = klp_find_ops(func->old_addr);
-	if (WARN_ON(!ops))
-		return;
-
-	if (list_is_singular(&ops->func_stack)) {
-		unsigned long ftrace_loc;
-
-		ftrace_loc = klp_get_ftrace_location(func->old_addr);
-		if (WARN_ON(!ftrace_loc))
-			return;
-
-		WARN_ON(unregister_ftrace_function(&ops->fops));
-		WARN_ON(ftrace_set_filter_ip(&ops->fops, ftrace_loc, 1, 0));
-
-		list_del_rcu(&func->stack_node);
-		list_del(&ops->node);
-		kfree(ops);
-	} else {
-		list_del_rcu(&func->stack_node);
-	}
-
-	func->patched = false;
-}
-
-static int klp_patch_func(struct klp_func *func)
-{
-	struct klp_ops *ops;
-	int ret;
-
-	if (WARN_ON(!func->old_addr))
-		return -EINVAL;
-
-	if (WARN_ON(func->patched))
-		return -EINVAL;
-
-	ops = klp_find_ops(func->old_addr);
-	if (!ops) {
-		unsigned long ftrace_loc;
-
-		ftrace_loc = klp_get_ftrace_location(func->old_addr);
-		if (!ftrace_loc) {
-			pr_err("failed to find location for function '%s'\n",
-				func->old_name);
-			return -EINVAL;
-		}
-
-		ops = kzalloc(sizeof(*ops), GFP_KERNEL);
-		if (!ops)
-			return -ENOMEM;
-
-		ops->fops.func = klp_ftrace_handler;
-		ops->fops.flags = FTRACE_OPS_FL_SAVE_REGS |
-				  FTRACE_OPS_FL_DYNAMIC |
-				  FTRACE_OPS_FL_IPMODIFY;
-
-		list_add(&ops->node, &klp_ops);
-
-		INIT_LIST_HEAD(&ops->func_stack);
-		list_add_rcu(&func->stack_node, &ops->func_stack);
-
-		ret = ftrace_set_filter_ip(&ops->fops, ftrace_loc, 0, 0);
-		if (ret) {
-			pr_err("failed to set ftrace filter for function '%s' (%d)\n",
-			       func->old_name, ret);
-			goto err;
-		}
-
-		ret = register_ftrace_function(&ops->fops);
-		if (ret) {
-			pr_err("failed to register ftrace handler for function '%s' (%d)\n",
-			       func->old_name, ret);
-			ftrace_set_filter_ip(&ops->fops, ftrace_loc, 1, 0);
-			goto err;
-		}
-
-
-	} else {
-		list_add_rcu(&func->stack_node, &ops->func_stack);
-	}
-
-	func->patched = true;
-
-	return 0;
-
-err:
-	list_del_rcu(&func->stack_node);
-	list_del(&ops->node);
-	kfree(ops);
-	return ret;
-}
-
-static void klp_unpatch_object(struct klp_object *obj)
-{
-	struct klp_func *func;
-
-	klp_for_each_func(obj, func)
-		if (func->patched)
-			klp_unpatch_func(func);
-
-	obj->patched = false;
-}
-
-static int klp_patch_object(struct klp_object *obj)
-{
-	struct klp_func *func;
-	int ret;
-
-	if (WARN_ON(obj->patched))
-		return -EINVAL;
-
-	klp_for_each_func(obj, func) {
-		ret = klp_patch_func(func);
-		if (ret) {
-			klp_unpatch_object(obj);
-			return ret;
-		}
-	}
-	obj->patched = true;
-
-	return 0;
-}
-
 static int __klp_disable_patch(struct klp_patch *patch)
 {
 	struct klp_object *obj;
diff --git a/kernel/livepatch/patch.c b/kernel/livepatch/patch.c
new file mode 100644
index 0000000..782fbb5
--- /dev/null
+++ b/kernel/livepatch/patch.c
@@ -0,0 +1,213 @@
+/*
+ * patch.c - livepatch patching functions
+ *
+ * Copyright (C) 2014 Seth Jennings <sjenning@redhat.com>
+ * Copyright (C) 2014 SUSE
+ * Copyright (C) 2015 Josh Poimboeuf <jpoimboe@redhat.com
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License
+ * as published by the Free Software Foundation; either version 2
+ * of the License, or (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, see <http://www.gnu.org/licenses/>.
+ */
+
+#define pr_fmt(fmt) KBUILD_MODNAME ": " fmt
+
+#include <linux/livepatch.h>
+#include <linux/list.h>
+#include <linux/ftrace.h>
+#include <linux/rculist.h>
+#include <linux/slab.h>
+#include <linux/bug.h>
+#include <linux/printk.h>
+#include "patch.h"
+
+static LIST_HEAD(klp_ops);
+
+struct klp_ops *klp_find_ops(unsigned long old_addr)
+{
+	struct klp_ops *ops;
+	struct klp_func *func;
+
+	list_for_each_entry(ops, &klp_ops, node) {
+		func = list_first_entry(&ops->func_stack, struct klp_func,
+					stack_node);
+		if (func->old_addr == old_addr)
+			return ops;
+	}
+
+	return NULL;
+}
+
+static void notrace klp_ftrace_handler(unsigned long ip,
+				       unsigned long parent_ip,
+				       struct ftrace_ops *fops,
+				       struct pt_regs *regs)
+{
+	struct klp_ops *ops;
+	struct klp_func *func;
+
+	ops = container_of(fops, struct klp_ops, fops);
+
+	rcu_read_lock();
+	func = list_first_or_null_rcu(&ops->func_stack, struct klp_func,
+				      stack_node);
+	if (WARN_ON_ONCE(!func))
+		goto unlock;
+
+	klp_arch_set_pc(regs, (unsigned long)func->new_func);
+unlock:
+	rcu_read_unlock();
+}
+
+/*
+ * Convert a function address into the appropriate ftrace location.
+ *
+ * Usually this is just the address of the function, but on some architectures
+ * it's more complicated so allow them to provide a custom behaviour.
+ */
+#ifndef klp_get_ftrace_location
+static unsigned long klp_get_ftrace_location(unsigned long faddr)
+{
+	return faddr;
+}
+#endif
+
+static void klp_unpatch_func(struct klp_func *func)
+{
+	struct klp_ops *ops;
+
+	if (WARN_ON(!func->patched))
+		return;
+	if (WARN_ON(!func->old_addr))
+		return;
+
+	ops = klp_find_ops(func->old_addr);
+	if (WARN_ON(!ops))
+		return;
+
+	if (list_is_singular(&ops->func_stack)) {
+		unsigned long ftrace_loc;
+
+		ftrace_loc = klp_get_ftrace_location(func->old_addr);
+		if (WARN_ON(!ftrace_loc))
+			return;
+
+		WARN_ON(unregister_ftrace_function(&ops->fops));
+		WARN_ON(ftrace_set_filter_ip(&ops->fops, ftrace_loc, 1, 0));
+
+		list_del_rcu(&func->stack_node);
+		list_del(&ops->node);
+		kfree(ops);
+	} else {
+		list_del_rcu(&func->stack_node);
+	}
+
+	func->patched = false;
+}
+
+static int klp_patch_func(struct klp_func *func)
+{
+	struct klp_ops *ops;
+	int ret;
+
+	if (WARN_ON(!func->old_addr))
+		return -EINVAL;
+
+	if (WARN_ON(func->patched))
+		return -EINVAL;
+
+	ops = klp_find_ops(func->old_addr);
+	if (!ops) {
+		unsigned long ftrace_loc;
+
+		ftrace_loc = klp_get_ftrace_location(func->old_addr);
+		if (!ftrace_loc) {
+			pr_err("failed to find location for function '%s'\n",
+				func->old_name);
+			return -EINVAL;
+		}
+
+		ops = kzalloc(sizeof(*ops), GFP_KERNEL);
+		if (!ops)
+			return -ENOMEM;
+
+		ops->fops.func = klp_ftrace_handler;
+		ops->fops.flags = FTRACE_OPS_FL_SAVE_REGS |
+				  FTRACE_OPS_FL_DYNAMIC |
+				  FTRACE_OPS_FL_IPMODIFY;
+
+		list_add(&ops->node, &klp_ops);
+
+		INIT_LIST_HEAD(&ops->func_stack);
+		list_add_rcu(&func->stack_node, &ops->func_stack);
+
+		ret = ftrace_set_filter_ip(&ops->fops, ftrace_loc, 0, 0);
+		if (ret) {
+			pr_err("failed to set ftrace filter for function '%s' (%d)\n",
+			       func->old_name, ret);
+			goto err;
+		}
+
+		ret = register_ftrace_function(&ops->fops);
+		if (ret) {
+			pr_err("failed to register ftrace handler for function '%s' (%d)\n",
+			       func->old_name, ret);
+			ftrace_set_filter_ip(&ops->fops, ftrace_loc, 1, 0);
+			goto err;
+		}
+
+
+	} else {
+		list_add_rcu(&func->stack_node, &ops->func_stack);
+	}
+
+	func->patched = true;
+
+	return 0;
+
+err:
+	list_del_rcu(&func->stack_node);
+	list_del(&ops->node);
+	kfree(ops);
+	return ret;
+}
+
+void klp_unpatch_object(struct klp_object *obj)
+{
+	struct klp_func *func;
+
+	klp_for_each_func(obj, func)
+		if (func->patched)
+			klp_unpatch_func(func);
+
+	obj->patched = false;
+}
+
+int klp_patch_object(struct klp_object *obj)
+{
+	struct klp_func *func;
+	int ret;
+
+	if (WARN_ON(obj->patched))
+		return -EINVAL;
+
+	klp_for_each_func(obj, func) {
+		ret = klp_patch_func(func);
+		if (ret) {
+			klp_unpatch_object(obj);
+			return ret;
+		}
+	}
+	obj->patched = true;
+
+	return 0;
+}
diff --git a/kernel/livepatch/patch.h b/kernel/livepatch/patch.h
new file mode 100644
index 0000000..2d0cce0
--- /dev/null
+++ b/kernel/livepatch/patch.h
@@ -0,0 +1,32 @@
+#ifndef _LIVEPATCH_PATCH_H
+#define _LIVEPATCH_PATCH_H
+
+#include <linux/livepatch.h>
+#include <linux/list.h>
+#include <linux/ftrace.h>
+
+/**
+ * struct klp_ops - structure for tracking registered ftrace ops structs
+ *
+ * A single ftrace_ops is shared between all enabled replacement functions
+ * (klp_func structs) which have the same old_addr.  This allows the switch
+ * between function versions to happen instantaneously by updating the klp_ops
+ * struct's func_stack list.  The winner is the klp_func at the top of the
+ * func_stack (front of the list).
+ *
+ * @node:	node for the global klp_ops list
+ * @func_stack:	list head for the stack of klp_func's (active func is on top)
+ * @fops:	registered ftrace ops struct
+ */
+struct klp_ops {
+	struct list_head node;
+	struct list_head func_stack;
+	struct ftrace_ops fops;
+};
+
+struct klp_ops *klp_find_ops(unsigned long old_addr);
+
+int klp_patch_object(struct klp_object *obj);
+void klp_unpatch_object(struct klp_object *obj);
+
+#endif /* _LIVEPATCH_PATCH_H */
-- 
2.4.11

^ permalink raw reply related	[flat|nested] 121+ messages in thread

* [RFC PATCH v2 16/18] livepatch: store function sizes
  2016-04-28 20:44 [RFC PATCH v2 00/18] livepatch: hybrid consistency model Josh Poimboeuf
                   ` (14 preceding siblings ...)
  2016-04-28 20:44 ` [RFC PATCH v2 15/18] livepatch: move patching functions into patch.c Josh Poimboeuf
@ 2016-04-28 20:44 ` Josh Poimboeuf
  2016-04-28 20:44 ` [RFC PATCH v2 17/18] livepatch: change to a per-task consistency model Josh Poimboeuf
  2016-04-28 20:44 ` [RFC PATCH v2 18/18] livepatch: add /proc/<pid>/patch_state Josh Poimboeuf
  17 siblings, 0 replies; 121+ messages in thread
From: Josh Poimboeuf @ 2016-04-28 20:44 UTC (permalink / raw)
  To: Jessica Yu, Jiri Kosina, Miroslav Benes, Ingo Molnar,
	Peter Zijlstra, Michael Ellerman, Heiko Carstens
  Cc: live-patching, linux-kernel, x86, linuxppc-dev, linux-s390,
	Vojtech Pavlik, Jiri Slaby, Petr Mladek, Chris J Arges,
	Andy Lutomirski

For the consistency model we'll need to know the sizes of the old and
new functions to determine if they're on the stacks of any tasks.

Signed-off-by: Josh Poimboeuf <jpoimboe@redhat.com>
---
 include/linux/livepatch.h |  3 +++
 kernel/livepatch/core.c   | 16 ++++++++++++++++
 2 files changed, 19 insertions(+)

diff --git a/include/linux/livepatch.h b/include/linux/livepatch.h
index 9ba26c5..c38c694 100644
--- a/include/linux/livepatch.h
+++ b/include/linux/livepatch.h
@@ -37,6 +37,8 @@
  * @old_addr:	the address of the function being patched
  * @kobj:	kobject for sysfs resources
  * @stack_node:	list node for klp_ops func_stack list
+ * @old_size:	size of the old function
+ * @new_size:	size of the new function
  * @patched:	the func has been added to the klp_ops list
  */
 struct klp_func {
@@ -56,6 +58,7 @@ struct klp_func {
 	unsigned long old_addr;
 	struct kobject kobj;
 	struct list_head stack_node;
+	unsigned long old_size, new_size;
 	bool patched;
 };
 
diff --git a/kernel/livepatch/core.c b/kernel/livepatch/core.c
index f28504d..aa3dbdf 100644
--- a/kernel/livepatch/core.c
+++ b/kernel/livepatch/core.c
@@ -577,6 +577,22 @@ static int klp_init_object_loaded(struct klp_patch *patch,
 					     &func->old_addr);
 		if (ret)
 			return ret;
+
+		ret = kallsyms_lookup_size_offset(func->old_addr,
+						  &func->old_size, NULL);
+		if (!ret) {
+			pr_err("kallsyms size lookup failed for '%s'\n",
+			       func->old_name);
+			return -ENOENT;
+		}
+
+		ret = kallsyms_lookup_size_offset((unsigned long)func->new_func,
+						  &func->new_size, NULL);
+		if (!ret) {
+			pr_err("kallsyms size lookup failed for '%s' replacement\n",
+			       func->old_name);
+			return -ENOENT;
+		}
 	}
 
 	return 0;
-- 
2.4.11

^ permalink raw reply related	[flat|nested] 121+ messages in thread

* [RFC PATCH v2 17/18] livepatch: change to a per-task consistency model
  2016-04-28 20:44 [RFC PATCH v2 00/18] livepatch: hybrid consistency model Josh Poimboeuf
                   ` (15 preceding siblings ...)
  2016-04-28 20:44 ` [RFC PATCH v2 16/18] livepatch: store function sizes Josh Poimboeuf
@ 2016-04-28 20:44 ` Josh Poimboeuf
  2016-05-04  8:42   ` Petr Mladek
                     ` (7 more replies)
  2016-04-28 20:44 ` [RFC PATCH v2 18/18] livepatch: add /proc/<pid>/patch_state Josh Poimboeuf
  17 siblings, 8 replies; 121+ messages in thread
From: Josh Poimboeuf @ 2016-04-28 20:44 UTC (permalink / raw)
  To: Jessica Yu, Jiri Kosina, Miroslav Benes, Ingo Molnar,
	Peter Zijlstra, Michael Ellerman, Heiko Carstens
  Cc: live-patching, linux-kernel, x86, linuxppc-dev, linux-s390,
	Vojtech Pavlik, Jiri Slaby, Petr Mladek, Chris J Arges,
	Andy Lutomirski

Change livepatch to use a basic per-task consistency model.  This is the
foundation which will eventually enable us to patch those ~10% of
security patches which change function or data semantics.  This is the
biggest remaining piece needed to make livepatch more generally useful.

This code stems from the design proposal made by Vojtech [1] in November
2014.  It's a hybrid of kGraft and kpatch: it uses kGraft's per-task
consistency and syscall barrier switching combined with kpatch's stack
trace switching.  There are also a number of fallback options which make
it quite flexible.

Patches are applied on a per-task basis, when the task is deemed safe to
switch over.  When a patch is enabled, livepatch enters into a
transition state where tasks are converging to the patched state.
Usually this transition state can complete in a few seconds.  The same
sequence occurs when a patch is disabled, except the tasks converge from
the patched state to the unpatched state.

An interrupt handler inherits the patched state of the task it
interrupts.  The same is true for forked tasks: the child inherits the
patched state of the parent.

Livepatch uses several complementary approaches to determine when it's
safe to patch tasks:

1. The first and most effective approach is stack checking of sleeping
   tasks.  If no affected functions are on the stack of a given task,
   the task is patched.  In most cases this will patch most or all of
   the tasks on the first try.  Otherwise it'll keep trying
   periodically.  This option is only available if the architecture has
   reliable stacks (CONFIG_RELIABLE_STACKTRACE and
   CONFIG_STACK_VALIDATION).

2. The second approach, if needed, is kernel exit switching.  A
   task is switched when it returns to user space from a system call, a
   user space IRQ, or a signal.  It's useful in the following cases:

   a) Patching I/O-bound user tasks which are sleeping on an affected
      function.  In this case you have to send SIGSTOP and SIGCONT to
      force it to exit the kernel and be patched.
   b) Patching CPU-bound user tasks.  If the task is highly CPU-bound
      then it will get patched the next time it gets interrupted by an
      IRQ.
   c) Applying patches for architectures which don't yet have
      CONFIG_RELIABLE_STACKTRACE.  In this case you'll have to signal
      most of the tasks on the system.  However this isn't a complete
      solution, because there's currently no way to patch kthreads
      without CONFIG_RELIABLE_STACKTRACE.

   Note: since idle "swapper" tasks don't ever exit the kernel, they
   instead have a kpatch_patch_task() call in the idle loop which allows
   them to patched before the CPU enters the idle state.

3. A third approach (not yet implemented) is planned for the case where
   a kthread is sleeping on an affected function.  In that case we could
   kick the kthread with a signal and then try to patch the task from
   the to-be-patched function's livepatch ftrace handler when it
   re-enters the function.  This will require
   CONFIG_RELIABLE_STACKTRACE.

All the above approaches may be skipped by setting the 'immediate' flag
in the 'klp_patch' struct, which will patch all tasks immediately.  This
can be useful if the patch doesn't change any function or data
semantics.  Note that, even with this flag set, it's possible that some
tasks may still be running with an old version of the function, until
that function returns.

There's also an 'immediate' flag in the 'klp_func' struct which allows
you to specify that certain functions in the patch can be applied
without per-task consistency.  This might be useful if you want to patch
a common function like schedule(), and the function change doesn't need
consistency but the rest of the patch does.

For architectures which don't have CONFIG_RELIABLE_STACKTRACE, there
are two options:

a) the user can set the patch->immediate flag which causes all tasks to
   be patched immediately.  This option should be used with care, only
   when the patch doesn't change any function or data semantics; or

b) use the kernel exit switching approach (this is the default).
   Note the patching will never complete because there's no currently no
   way to patch kthreads without CONFIG_RELIABLE_STACKTRACE.

The /sys/kernel/livepatch/<patch>/transition file shows whether a patch
is in transition.  Only a single patch (the topmost patch on the stack)
can be in transition at a given time.  A patch can remain in transition
indefinitely, if any of the tasks are stuck in the initial patch state.

A transition can be reversed and effectively canceled by writing the
opposite value to the /sys/kernel/livepatch/<patch>/enabled file while
the transition is in progress.  Then all the tasks will attempt to
converge back to the original patch state.

[1] https://lkml.kernel.org/r/20141107140458.GA21774@suse.cz

Signed-off-by: Josh Poimboeuf <jpoimboe@redhat.com>
---
 Documentation/ABI/testing/sysfs-kernel-livepatch |   8 +
 Documentation/livepatch/livepatch.txt            | 132 ++++++-
 include/linux/init_task.h                        |   9 +
 include/linux/livepatch.h                        |  34 +-
 include/linux/sched.h                            |   3 +
 kernel/fork.c                                    |   3 +
 kernel/livepatch/Makefile                        |   2 +-
 kernel/livepatch/core.c                          |  98 +++--
 kernel/livepatch/patch.c                         |  43 +-
 kernel/livepatch/patch.h                         |   1 +
 kernel/livepatch/transition.c                    | 474 +++++++++++++++++++++++
 kernel/livepatch/transition.h                    |  14 +
 kernel/sched/idle.c                              |   4 +
 13 files changed, 781 insertions(+), 44 deletions(-)
 create mode 100644 kernel/livepatch/transition.c
 create mode 100644 kernel/livepatch/transition.h

diff --git a/Documentation/ABI/testing/sysfs-kernel-livepatch b/Documentation/ABI/testing/sysfs-kernel-livepatch
index da87f43..24ca6df 100644
--- a/Documentation/ABI/testing/sysfs-kernel-livepatch
+++ b/Documentation/ABI/testing/sysfs-kernel-livepatch
@@ -25,6 +25,14 @@ Description:
 		code is currently applied.  Writing 0 will disable the patch
 		while writing 1 will re-enable the patch.
 
+What:		/sys/kernel/livepatch/<patch>/transition
+Date:		May 2016
+KernelVersion:	4.7.0
+Contact:	live-patching@vger.kernel.org
+Description:
+		An attribute which indicates whether the patch is currently in
+		transition.
+
 What:		/sys/kernel/livepatch/<patch>/<object>
 Date:		Nov 2014
 KernelVersion:	3.19.0
diff --git a/Documentation/livepatch/livepatch.txt b/Documentation/livepatch/livepatch.txt
index 6c43f6e..bee86d0 100644
--- a/Documentation/livepatch/livepatch.txt
+++ b/Documentation/livepatch/livepatch.txt
@@ -72,7 +72,8 @@ example, they add a NULL pointer or a boundary check, fix a race by adding
 a missing memory barrier, or add some locking around a critical section.
 Most of these changes are self contained and the function presents itself
 the same way to the rest of the system. In this case, the functions might
-be updated independently one by one.
+be updated independently one by one.  (This can be done by setting the
+'immediate' flag in the klp_patch struct.)
 
 But there are more complex fixes. For example, a patch might change
 ordering of locking in multiple functions at the same time. Or a patch
@@ -86,20 +87,103 @@ or no data are stored in the modified structures at the moment.
 The theory about how to apply functions a safe way is rather complex.
 The aim is to define a so-called consistency model. It attempts to define
 conditions when the new implementation could be used so that the system
-stays consistent. The theory is not yet finished. See the discussion at
-http://thread.gmane.org/gmane.linux.kernel/1823033/focus=1828189
-
-The current consistency model is very simple. It guarantees that either
-the old or the new function is called. But various functions get redirected
-one by one without any synchronization.
-
-In other words, the current implementation _never_ modifies the behavior
-in the middle of the call. It is because it does _not_ rewrite the entire
-function in the memory. Instead, the function gets redirected at the
-very beginning. But this redirection is used immediately even when
-some other functions from the same patch have not been redirected yet.
-
-See also the section "Limitations" below.
+stays consistent.
+
+Livepatch has a consistency model which is a hybrid of kGraft and
+kpatch:  it uses kGraft's per-task consistency and syscall barrier
+switching combined with kpatch's stack trace switching.  There are also
+a number of fallback options which make it quite flexible.
+
+Patches are applied on a per-task basis, when the task is deemed safe to
+switch over.  When a patch is enabled, livepatch enters into a
+transition state where tasks are converging to the patched state.
+Usually this transition state can complete in a few seconds.  The same
+sequence occurs when a patch is disabled, except the tasks converge from
+the patched state to the unpatched state.
+
+An interrupt handler inherits the patched state of the task it
+interrupts.  The same is true for forked tasks: the child inherits the
+patched state of the parent.
+
+Livepatch uses several complementary approaches to determine when it's
+safe to patch tasks:
+
+1. The first and most effective approach is stack checking of sleeping
+   tasks.  If no affected functions are on the stack of a given task,
+   the task is patched.  In most cases this will patch most or all of
+   the tasks on the first try.  Otherwise it'll keep trying
+   periodically.  This option is only available if the architecture has
+   reliable stacks (CONFIG_RELIABLE_STACKTRACE and
+   CONFIG_STACK_VALIDATION).
+
+2. The second approach, if needed, is kernel exit switching.  A
+   task is switched when it returns to user space from a system call, a
+   user space IRQ, or a signal.  It's useful in the following cases:
+
+   a) Patching I/O-bound user tasks which are sleeping on an affected
+      function.  In this case you have to send SIGSTOP and SIGCONT to
+      force it to exit the kernel and be patched.
+   b) Patching CPU-bound user tasks.  If the task is highly CPU-bound
+      then it will get patched the next time it gets interrupted by an
+      IRQ.
+   c) Applying patches for architectures which don't yet have
+      CONFIG_RELIABLE_STACKTRACE.  In this case you'll have to signal
+      most of the tasks on the system.  However this isn't a complete
+      solution, because there's currently no way to patch kthreads
+      without CONFIG_RELIABLE_STACKTRACE.
+
+   Note: since idle "swapper" tasks don't ever exit the kernel, they
+   instead have a kpatch_patch_task() call in the idle loop which allows
+   them to patched before the CPU enters the idle state.
+
+3. A third approach (not yet implemented) is planned for the case where
+   a kthread is sleeping on an affected function.  In that case we could
+   kick the kthread with a signal and then try to patch the task from
+   the to-be-patched function's livepatch ftrace handler when it
+   re-enters the function.  This will require
+   CONFIG_RELIABLE_STACKTRACE.
+
+All the above approaches may be skipped by setting the 'immediate' flag
+in the 'klp_patch' struct, which will patch all tasks immediately.  This
+can be useful if the patch doesn't change any function or data
+semantics.  Note that, even with this flag set, it's possible that some
+tasks may still be running with an old version of the function, until
+that function returns.
+
+There's also an 'immediate' flag in the 'klp_func' struct which allows
+you to specify that certain functions in the patch can be applied
+without per-task consistency.  This might be useful if you want to patch
+a common function like schedule(), and the function change doesn't need
+consistency but the rest of the patch does.
+
+For architectures which don't have CONFIG_RELIABLE_STACKTRACE, there
+are two options:
+
+a) the user can set the patch->immediate flag which causes all tasks to
+   be patched immediately.  This option should be used with care, only
+   when the patch doesn't change any function or data semantics; or
+
+b) use the kernel exit switching approach (this is the default).
+   Note the patching will never complete because there's no currently no
+   way to patch kthreads without CONFIG_RELIABLE_STACKTRACE.
+
+The /sys/kernel/livepatch/<patch>/transition file shows whether a patch
+is in transition.  Only a single patch (the topmost patch on the stack)
+can be in transition at a given time.  A patch can remain in transition
+indefinitely, if any of the tasks are stuck in the initial patch state.
+
+A transition can be reversed and effectively canceled by writing the
+opposite value to the /sys/kernel/livepatch/<patch>/enabled file while
+the transition is in progress.  Then all the tasks will attempt to
+converge back to the original patch state.
+
+There's also a /proc/<pid>/patch_state file which can be used to
+determine which tasks are blocking completion of a patching operation.
+If a patch is in transition, this file shows 0 to indicate the task is
+unpatched and 1 to indicate it's patched.  Otherwise, if no patch is in
+transition, it shows -1.  Any tasks which are blocking the transition
+can be signaled with SIGSTOP and SIGCONT to force them to change their
+patched state.
 
 
 4. Livepatch module
@@ -239,9 +323,15 @@ Registered patches might be enabled either by calling klp_enable_patch() or
 by writing '1' to /sys/kernel/livepatch/<name>/enabled. The system will
 start using the new implementation of the patched functions at this stage.
 
-In particular, if an original function is patched for the first time, a
-function specific struct klp_ops is created and an universal ftrace handler
-is registered.
+When a patch is enabled, livepatch enters into a transition state where
+tasks are converging to the patched state.  This is indicated by a value
+of '1' in /sys/kernel/livepatch/<name>/transition.  Once all tasks have
+been patched, the 'transition' value changes to '0'.  For more
+information about this process, see the "Consistency model" section.
+
+If an original function is patched for the first time, a function
+specific struct klp_ops is created and an universal ftrace handler is
+registered.
 
 Functions might be patched multiple times. The ftrace handler is registered
 only once for the given function. Further patches just add an entry to the
@@ -261,6 +351,12 @@ by writing '0' to /sys/kernel/livepatch/<name>/enabled. At this stage
 either the code from the previously enabled patch or even the original
 code gets used.
 
+When a patch is disabled, livepatch enters into a transition state where
+tasks are converging to the unpatched state.  This is indicated by a
+value of '1' in /sys/kernel/livepatch/<name>/transition.  Once all tasks
+have been unpatched, the 'transition' value changes to '0'.  For more
+information about this process, see the "Consistency model" section.
+
 Here all the functions (struct klp_func) associated with the to-be-disabled
 patch are removed from the corresponding struct klp_ops. The ftrace handler
 is unregistered and the struct klp_ops is freed when the func_stack list
diff --git a/include/linux/init_task.h b/include/linux/init_task.h
index f2cb8d4..12199ef 100644
--- a/include/linux/init_task.h
+++ b/include/linux/init_task.h
@@ -14,6 +14,7 @@
 #include <linux/rbtree.h>
 #include <net/net_namespace.h>
 #include <linux/sched/rt.h>
+#include <linux/livepatch.h>
 
 #ifdef CONFIG_SMP
 # define INIT_PUSHABLE_TASKS(tsk)					\
@@ -183,6 +184,13 @@ extern struct task_group root_task_group;
 # define INIT_KASAN(tsk)
 #endif
 
+#ifdef CONFIG_LIVEPATCH
+#define INIT_LIVEPATCH(tsk)						\
+	.patch_state = KLP_UNDEFINED,
+#else
+#define INIT_LIVEPATCH(tsk)
+#endif
+
 /*
  *  INIT_TASK is used to set up the first task table, touch at
  * your own risk!. Base=0, limit=0x1fffff (=2MB)
@@ -260,6 +268,7 @@ extern struct task_group root_task_group;
 	INIT_VTIME(tsk)							\
 	INIT_NUMA_BALANCING(tsk)					\
 	INIT_KASAN(tsk)							\
+	INIT_LIVEPATCH(tsk)						\
 }
 
 
diff --git a/include/linux/livepatch.h b/include/linux/livepatch.h
index c38c694..6ec50ff 100644
--- a/include/linux/livepatch.h
+++ b/include/linux/livepatch.h
@@ -28,18 +28,40 @@
 
 #include <asm/livepatch.h>
 
+/* task patch states */
+#define KLP_UNDEFINED	-1
+#define KLP_UNPATCHED	0
+#define KLP_PATCHED	1
+
 /**
  * struct klp_func - function structure for live patching
  * @old_name:	name of the function to be patched
  * @new_func:	pointer to the patched function code
  * @old_sympos: a hint indicating which symbol position the old function
  *		can be found (optional)
+ * @immediate:  patch the func immediately, bypassing backtrace safety checks
  * @old_addr:	the address of the function being patched
  * @kobj:	kobject for sysfs resources
  * @stack_node:	list node for klp_ops func_stack list
  * @old_size:	size of the old function
  * @new_size:	size of the new function
  * @patched:	the func has been added to the klp_ops list
+ * @transition:	the func is currently being applied or reverted
+ *
+ * The patched and transition variables define the func's patching state.  When
+ * patching, a func is always in one of the following states:
+ *
+ *   patched=0 transition=0: unpatched
+ *   patched=0 transition=1: unpatched, temporary starting state
+ *   patched=1 transition=1: patched, may be visible to some tasks
+ *   patched=1 transition=0: patched, visible to all tasks
+ *
+ * And when unpatching, it goes in the reverse order:
+ *
+ *   patched=1 transition=0: patched, visible to all tasks
+ *   patched=1 transition=1: patched, may be visible to some tasks
+ *   patched=0 transition=1: unpatched, temporary ending state
+ *   patched=0 transition=0: unpatched
  */
 struct klp_func {
 	/* external */
@@ -53,6 +75,7 @@ struct klp_func {
 	 * in kallsyms for the given object is used.
 	 */
 	unsigned long old_sympos;
+	bool immediate;
 
 	/* internal */
 	unsigned long old_addr;
@@ -60,6 +83,7 @@ struct klp_func {
 	struct list_head stack_node;
 	unsigned long old_size, new_size;
 	bool patched;
+	bool transition;
 };
 
 /**
@@ -86,6 +110,7 @@ struct klp_object {
  * struct klp_patch - patch structure for live patching
  * @mod:	reference to the live patch module
  * @objs:	object entries for kernel objects to be patched
+ * @immediate:  patch all funcs immediately, bypassing safety mechanisms
  * @list:	list node for global list of registered patches
  * @kobj:	kobject for sysfs resources
  * @enabled:	the patch is enabled (but operation may be incomplete)
@@ -94,6 +119,7 @@ struct klp_patch {
 	/* external */
 	struct module *mod;
 	struct klp_object *objs;
+	bool immediate;
 
 	/* internal */
 	struct list_head list;
@@ -116,15 +142,21 @@ int klp_disable_patch(struct klp_patch *);
 int klp_module_coming(struct module *mod);
 void klp_module_going(struct module *mod);
 
-static inline bool klp_patch_pending(struct task_struct *task) { return false; }
+void klp_copy_process(struct task_struct *child);
 void klp_patch_task(struct task_struct *task);
 
+static inline bool klp_patch_pending(struct task_struct *task)
+{
+	return test_tsk_thread_flag(task, TIF_PATCH_PENDING);
+}
+
 #else /* !CONFIG_LIVEPATCH */
 
 static inline int klp_module_coming(struct module *mod) { return 0; }
 static inline void klp_module_going(struct module *mod) {}
 static inline bool klp_patch_pending(struct task_struct *task) { return false; }
 static inline void klp_patch_task(struct task_struct *task) {}
+static inline void klp_copy_process(struct task_struct *child) {}
 
 #endif /* CONFIG_LIVEPATCH */
 
diff --git a/include/linux/sched.h b/include/linux/sched.h
index fb364a0..7fc8b49 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1860,6 +1860,9 @@ struct task_struct {
 #ifdef CONFIG_MMU
 	struct task_struct *oom_reaper_list;
 #endif
+#ifdef CONFIG_LIVEPATCH
+	int patch_state;
+#endif
 /* CPU-specific state of this task */
 	struct thread_struct thread;
 /*
diff --git a/kernel/fork.c b/kernel/fork.c
index d2fe04a..a12e3b0 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -76,6 +76,7 @@
 #include <linux/compiler.h>
 #include <linux/sysctl.h>
 #include <linux/kcov.h>
+#include <linux/livepatch.h>
 
 #include <asm/pgtable.h>
 #include <asm/pgalloc.h>
@@ -1586,6 +1587,8 @@ static struct task_struct *copy_process(unsigned long clone_flags,
 		p->parent_exec_id = current->self_exec_id;
 	}
 
+	klp_copy_process(p);
+
 	spin_lock(&current->sighand->siglock);
 
 	/*
diff --git a/kernel/livepatch/Makefile b/kernel/livepatch/Makefile
index e136dad..2b8bdb1 100644
--- a/kernel/livepatch/Makefile
+++ b/kernel/livepatch/Makefile
@@ -1,3 +1,3 @@
 obj-$(CONFIG_LIVEPATCH) += livepatch.o
 
-livepatch-objs := core.o patch.o
+livepatch-objs := core.o patch.o transition.o
diff --git a/kernel/livepatch/core.c b/kernel/livepatch/core.c
index aa3dbdf..0be352f 100644
--- a/kernel/livepatch/core.c
+++ b/kernel/livepatch/core.c
@@ -31,12 +31,15 @@
 #include <linux/moduleloader.h>
 #include <asm/cacheflush.h>
 #include "patch.h"
+#include "transition.h"
 
 /*
- * The klp_mutex protects the global lists and state transitions of any
- * structure reachable from them.  References to any structure must be obtained
- * under mutex protection (except in klp_ftrace_handler(), which uses RCU to
- * ensure it gets consistent data).
+ * klp_mutex is a coarse lock which serializes access to klp data.  All
+ * accesses to klp-related variables and structures must have mutex protection,
+ * except within the following functions which carefully avoid the need for it:
+ *
+ * - klp_ftrace_handler()
+ * - klp_patch_task()
  */
 static DEFINE_MUTEX(klp_mutex);
 
@@ -44,8 +47,28 @@ static LIST_HEAD(klp_patches);
 
 static struct kobject *klp_root_kobj;
 
-/* TODO: temporary stub */
-void klp_patch_task(struct task_struct *task) {}
+static void klp_work_fn(struct work_struct *work);
+static DECLARE_DELAYED_WORK(klp_work, klp_work_fn);
+
+static void klp_schedule_work(void)
+{
+	schedule_delayed_work(&klp_work, round_jiffies_relative(HZ));
+}
+
+/*
+ * This work can be performed periodically to finish patching or unpatching any
+ * "straggler" tasks which failed to transition in klp_enable_patch().
+ */
+static void klp_work_fn(struct work_struct *work)
+{
+	mutex_lock(&klp_mutex);
+
+	if (klp_transition_patch)
+		if (!klp_try_complete_transition())
+			klp_schedule_work();
+
+	mutex_unlock(&klp_mutex);
+}
 
 static bool klp_is_module(struct klp_object *obj)
 {
@@ -85,7 +108,6 @@ static void klp_find_object_module(struct klp_object *obj)
 	mutex_unlock(&module_mutex);
 }
 
-/* klp_mutex must be held by caller */
 static bool klp_is_patch_registered(struct klp_patch *patch)
 {
 	struct klp_patch *mypatch;
@@ -283,19 +305,18 @@ static int klp_write_object_relocations(struct module *pmod,
 
 static int __klp_disable_patch(struct klp_patch *patch)
 {
-	struct klp_object *obj;
+	if (klp_transition_patch)
+		return -EBUSY;
 
 	/* enforce stacking: only the last enabled patch can be disabled */
 	if (!list_is_last(&patch->list, &klp_patches) &&
 	    list_next_entry(patch, list)->enabled)
 		return -EBUSY;
 
-	pr_notice("disabling patch '%s'\n", patch->mod->name);
-
-	klp_for_each_object(patch, obj) {
-		if (obj->patched)
-			klp_unpatch_object(obj);
-	}
+	klp_init_transition(patch, KLP_UNPATCHED);
+	klp_start_transition();
+	if (!klp_try_complete_transition())
+		klp_schedule_work();
 
 	patch->enabled = false;
 
@@ -339,6 +360,9 @@ static int __klp_enable_patch(struct klp_patch *patch)
 	struct klp_object *obj;
 	int ret;
 
+	if (klp_transition_patch)
+		return -EBUSY;
+
 	if (WARN_ON(patch->enabled))
 		return -EINVAL;
 
@@ -350,24 +374,32 @@ static int __klp_enable_patch(struct klp_patch *patch)
 	pr_notice_once("tainting kernel with TAINT_LIVEPATCH\n");
 	add_taint(TAINT_LIVEPATCH, LOCKDEP_STILL_OK);
 
-	pr_notice("enabling patch '%s'\n", patch->mod->name);
+	klp_init_transition(patch, KLP_PATCHED);
 
 	klp_for_each_object(patch, obj) {
 		if (!klp_is_object_loaded(obj))
 			continue;
 
 		ret = klp_patch_object(obj);
-		if (ret)
-			goto unregister;
+		if (ret) {
+			pr_warn("failed to enable patch '%s'\n",
+				patch->mod->name);
+
+			klp_unpatch_objects(patch);
+			klp_complete_transition();
+
+			return ret;
+		}
 	}
 
+	klp_start_transition();
+
+	if (!klp_try_complete_transition())
+		klp_schedule_work();
+
 	patch->enabled = true;
 
 	return 0;
-
-unregister:
-	WARN_ON(__klp_disable_patch(patch));
-	return ret;
 }
 
 /**
@@ -404,6 +436,7 @@ EXPORT_SYMBOL_GPL(klp_enable_patch);
  * /sys/kernel/livepatch
  * /sys/kernel/livepatch/<patch>
  * /sys/kernel/livepatch/<patch>/enabled
+ * /sys/kernel/livepatch/<patch>/transition
  * /sys/kernel/livepatch/<patch>/<object>
  * /sys/kernel/livepatch/<patch>/<object>/<function,sympos>
  */
@@ -432,7 +465,9 @@ static ssize_t enabled_store(struct kobject *kobj, struct kobj_attribute *attr,
 		goto err;
 	}
 
-	if (val) {
+	if (patch == klp_transition_patch) {
+		klp_reverse_transition();
+	} else if (val) {
 		ret = __klp_enable_patch(patch);
 		if (ret)
 			goto err;
@@ -460,9 +495,21 @@ static ssize_t enabled_show(struct kobject *kobj,
 	return snprintf(buf, PAGE_SIZE-1, "%d\n", patch->enabled);
 }
 
+static ssize_t transition_show(struct kobject *kobj,
+			       struct kobj_attribute *attr, char *buf)
+{
+	struct klp_patch *patch;
+
+	patch = container_of(kobj, struct klp_patch, kobj);
+	return snprintf(buf, PAGE_SIZE-1, "%d\n",
+			patch == klp_transition_patch);
+}
+
 static struct kobj_attribute enabled_kobj_attr = __ATTR_RW(enabled);
+static struct kobj_attribute transition_kobj_attr = __ATTR_RO(transition);
 static struct attribute *klp_patch_attrs[] = {
 	&enabled_kobj_attr.attr,
+	&transition_kobj_attr.attr,
 	NULL
 };
 
@@ -549,6 +596,7 @@ static int klp_init_func(struct klp_object *obj, struct klp_func *func)
 {
 	INIT_LIST_HEAD(&func->stack_node);
 	func->patched = false;
+	func->transition = false;
 
 	/* The format for the sysfs directory is <function,sympos> where sympos
 	 * is the nth occurrence of this symbol in kallsyms for the patched
@@ -781,7 +829,11 @@ int klp_module_coming(struct module *mod)
 				goto err;
 			}
 
-			if (!patch->enabled)
+			/*
+			 * Only patch the module if the patch is enabled or is
+			 * in transition.
+			 */
+			if (!patch->enabled && patch != klp_transition_patch)
 				break;
 
 			pr_notice("applying patch '%s' to loading module '%s'\n",
diff --git a/kernel/livepatch/patch.c b/kernel/livepatch/patch.c
index 782fbb5..b3b8639 100644
--- a/kernel/livepatch/patch.c
+++ b/kernel/livepatch/patch.c
@@ -29,6 +29,7 @@
 #include <linux/bug.h>
 #include <linux/printk.h>
 #include "patch.h"
+#include "transition.h"
 
 static LIST_HEAD(klp_ops);
 
@@ -58,11 +59,42 @@ static void notrace klp_ftrace_handler(unsigned long ip,
 	ops = container_of(fops, struct klp_ops, fops);
 
 	rcu_read_lock();
+
 	func = list_first_or_null_rcu(&ops->func_stack, struct klp_func,
 				      stack_node);
-	if (WARN_ON_ONCE(!func))
+
+	if (!func)
 		goto unlock;
 
+	/*
+	 * See the comment for the 2nd smp_wmb() in klp_init_transition() for
+	 * an explanation of why this read barrier is needed.
+	 */
+	smp_rmb();
+
+	if (unlikely(func->transition)) {
+
+		/*
+		 * See the comment for the 1st smp_wmb() in
+		 * klp_init_transition() for an explanation of why this read
+		 * barrier is needed.
+		 */
+		smp_rmb();
+
+		if (current->patch_state == KLP_UNPATCHED) {
+			/*
+			 * Use the previously patched version of the function.
+			 * If no previous patches exist, use the original
+			 * function.
+			 */
+			func = list_entry_rcu(func->stack_node.next,
+					      struct klp_func, stack_node);
+
+			if (&func->stack_node == &ops->func_stack)
+				goto unlock;
+		}
+	}
+
 	klp_arch_set_pc(regs, (unsigned long)func->new_func);
 unlock:
 	rcu_read_unlock();
@@ -211,3 +243,12 @@ int klp_patch_object(struct klp_object *obj)
 
 	return 0;
 }
+
+void klp_unpatch_objects(struct klp_patch *patch)
+{
+	struct klp_object *obj;
+
+	klp_for_each_object(patch, obj)
+		if (obj->patched)
+			klp_unpatch_object(obj);
+}
diff --git a/kernel/livepatch/patch.h b/kernel/livepatch/patch.h
index 2d0cce0..0db2271 100644
--- a/kernel/livepatch/patch.h
+++ b/kernel/livepatch/patch.h
@@ -28,5 +28,6 @@ struct klp_ops *klp_find_ops(unsigned long old_addr);
 
 int klp_patch_object(struct klp_object *obj);
 void klp_unpatch_object(struct klp_object *obj);
+void klp_unpatch_objects(struct klp_patch *patch);
 
 #endif /* _LIVEPATCH_PATCH_H */
diff --git a/kernel/livepatch/transition.c b/kernel/livepatch/transition.c
new file mode 100644
index 0000000..92819bb
--- /dev/null
+++ b/kernel/livepatch/transition.c
@@ -0,0 +1,474 @@
+/*
+ * transition.c - Kernel Live Patching transition functions
+ *
+ * Copyright (C) 2015 Josh Poimboeuf <jpoimboe@redhat.com>
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License
+ * as published by the Free Software Foundation; either version 2
+ * of the License, or (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, see <http://www.gnu.org/licenses/>.
+ */
+
+#define pr_fmt(fmt) KBUILD_MODNAME ": " fmt
+
+#include <linux/cpu.h>
+#include <linux/stacktrace.h>
+#include "../sched/sched.h"
+
+#include "patch.h"
+#include "transition.h"
+
+#define MAX_STACK_ENTRIES 100
+
+struct klp_patch *klp_transition_patch;
+
+static int klp_target_state;
+
+/* called from copy_process() during fork */
+void klp_copy_process(struct task_struct *child)
+{
+	child->patch_state = current->patch_state;
+
+	/* TIF_PATCH_PENDING gets copied in setup_thread_stack() */
+}
+
+/*
+ * klp_patch_task() - change the patched state of a task
+ * @task:	The task to change
+ *
+ * Switches the patched state of the task to the set of functions in the target
+ * patch state.
+ */
+void klp_patch_task(struct task_struct *task)
+{
+	clear_tsk_thread_flag(task, TIF_PATCH_PENDING);
+
+	/*
+	 * The corresponding write barriers are in klp_init_transition() and
+	 * klp_reverse_transition().  See the comments there for an explanation.
+	 */
+	smp_rmb();
+
+	task->patch_state = klp_target_state;
+}
+
+/*
+ * Initialize the global target patch state and all tasks to the initial patch
+ * state, and initialize all function transition states to true in preparation
+ * for patching or unpatching.
+ */
+void klp_init_transition(struct klp_patch *patch, int state)
+{
+	struct task_struct *g, *task;
+	unsigned int cpu;
+	struct klp_object *obj;
+	struct klp_func *func;
+	int initial_state = !state;
+
+	klp_transition_patch = patch;
+
+	/*
+	 * If the patch can be applied or reverted immediately, skip the
+	 * per-task transitions.
+	 */
+	if (patch->immediate)
+		return;
+
+	/*
+	 * Initialize all tasks to the initial patch state to prepare them for
+	 * switching to the target state.
+	 */
+	read_lock(&tasklist_lock);
+	for_each_process_thread(g, task)
+		task->patch_state = initial_state;
+	read_unlock(&tasklist_lock);
+
+	/*
+	 * Ditto for the idle "swapper" tasks.
+	 */
+	get_online_cpus();
+	for_each_online_cpu(cpu)
+		idle_task(cpu)->patch_state = initial_state;
+	put_online_cpus();
+
+	/*
+	 * Ensure klp_ftrace_handler() sees the task->patch_state updates
+	 * before the func->transition updates.  Otherwise it could read an
+	 * out-of-date task state and pick the wrong function.
+	 */
+	smp_wmb();
+
+	/*
+	 * Set the func transition states so klp_ftrace_handler() will know to
+	 * switch to the transition logic.
+	 *
+	 * When patching, the funcs aren't yet in the func_stack and will be
+	 * made visible to the ftrace handler shortly by the calls to
+	 * klp_patch_object().
+	 *
+	 * When unpatching, the funcs are already in the func_stack and so are
+	 * already visible to the ftrace handler.
+	 */
+	klp_for_each_object(patch, obj)
+		klp_for_each_func(obj, func)
+			func->transition = true;
+
+	/*
+	 * Set the global target patch state which tasks will switch to.  This
+	 * has no effect until the TIF_PATCH_PENDING flags get set later.
+	 */
+	klp_target_state = state;
+
+	/*
+	 * For the enable path, ensure klp_ftrace_handler() will see the
+	 * func->transition updates before the funcs become visible to the
+	 * handler.  Otherwise the handler may wrongly pick the new func before
+	 * the task switches to the patched state.
+	 *
+	 * For the disable path, the funcs are already visible to the handler.
+	 * But we still need to ensure the ftrace handler will see the
+	 * func->transition updates before the tasks start switching to the
+	 * unpatched state.  Otherwise the handler can miss a task patch state
+	 * change which would result in it wrongly picking the new function.
+	 *
+	 * This barrier also ensures that if another CPU goes through the
+	 * syscall barrier, sees the TIF_PATCH_PENDING writes in
+	 * klp_start_transition(), and calls klp_patch_task(), it also sees the
+	 * above write to the target state.  Otherwise it can put the task in
+	 * the wrong universe.
+	 */
+	smp_wmb();
+}
+
+/*
+ * Start the transition to the specified target patch state so tasks can begin
+ * switching to it.
+ */
+void klp_start_transition(void)
+{
+	struct task_struct *g, *task;
+	unsigned int cpu;
+
+	pr_notice("'%s': %s...\n", klp_transition_patch->mod->name,
+		  klp_target_state == KLP_PATCHED ? "patching" : "unpatching");
+
+	/*
+	 * If the patch can be applied or reverted immediately, skip the
+	 * per-task transitions.
+	 */
+	if (klp_transition_patch->immediate)
+		return;
+
+	/*
+	 * Mark all normal tasks as needing a patch state update.  As they pass
+	 * through the syscall barrier they'll switch over to the target state
+	 * (unless we switch them in klp_try_complete_transition() first).
+	 */
+	read_lock(&tasklist_lock);
+	for_each_process_thread(g, task)
+		set_tsk_thread_flag(task, TIF_PATCH_PENDING);
+	read_unlock(&tasklist_lock);
+
+	/*
+	 * Ditto for the idle "swapper" tasks, though they never cross the
+	 * syscall barrier.  Instead they switch over in cpu_idle_loop().
+	 */
+	get_online_cpus();
+	for_each_online_cpu(cpu)
+		set_tsk_thread_flag(idle_task(cpu), TIF_PATCH_PENDING);
+	put_online_cpus();
+}
+
+/*
+ * The transition to the target patch state is complete.  Clean up the data
+ * structures.
+ */
+void klp_complete_transition(void)
+{
+	struct klp_object *obj;
+	struct klp_func *func;
+	struct task_struct *g, *task;
+	unsigned int cpu;
+
+	if (klp_transition_patch->immediate)
+		goto done;
+
+	klp_for_each_object(klp_transition_patch, obj)
+		klp_for_each_func(obj, func)
+			func->transition = false;
+
+	read_lock(&tasklist_lock);
+	for_each_process_thread(g, task) {
+		clear_tsk_thread_flag(task, TIF_PATCH_PENDING);
+		task->patch_state = KLP_UNDEFINED;
+	}
+	read_unlock(&tasklist_lock);
+
+	get_online_cpus();
+	for_each_online_cpu(cpu) {
+		task = idle_task(cpu);
+		clear_tsk_thread_flag(task, TIF_PATCH_PENDING);
+		task->patch_state = KLP_UNDEFINED;
+	}
+	put_online_cpus();
+
+done:
+	klp_transition_patch = NULL;
+}
+
+/*
+ * Determine whether the given stack trace includes any references to a
+ * to-be-patched or to-be-unpatched function.
+ */
+static int klp_check_stack_func(struct klp_func *func,
+				struct stack_trace *trace)
+{
+	unsigned long func_addr, func_size, address;
+	struct klp_ops *ops;
+	int i;
+
+	if (func->immediate)
+		return 0;
+
+	for (i = 0; i < trace->nr_entries; i++) {
+		address = trace->entries[i];
+
+		if (klp_target_state == KLP_UNPATCHED) {
+			 /*
+			  * Check for the to-be-unpatched function
+			  * (the func itself).
+			  */
+			func_addr = (unsigned long)func->new_func;
+			func_size = func->new_size;
+		} else {
+			/*
+			 * Check for the to-be-patched function
+			 * (the previous func).
+			 */
+			ops = klp_find_ops(func->old_addr);
+
+			if (list_is_singular(&ops->func_stack)) {
+				/* original function */
+				func_addr = func->old_addr;
+				func_size = func->old_size;
+			} else {
+				/* previously patched function */
+				struct klp_func *prev;
+
+				prev = list_next_entry(func, stack_node);
+				func_addr = (unsigned long)prev->new_func;
+				func_size = prev->new_size;
+			}
+		}
+
+		if (address >= func_addr && address < func_addr + func_size)
+			return -EAGAIN;
+	}
+
+	return 0;
+}
+
+/*
+ * Determine whether it's safe to transition the task to the target patch state
+ * by looking for any to-be-patched or to-be-unpatched functions on its stack.
+ */
+static int klp_check_stack(struct task_struct *task)
+{
+	static unsigned long entries[MAX_STACK_ENTRIES];
+	struct stack_trace trace;
+	struct klp_object *obj;
+	struct klp_func *func;
+	int ret;
+
+	trace.skip = 0;
+	trace.nr_entries = 0;
+	trace.max_entries = MAX_STACK_ENTRIES;
+	trace.entries = entries;
+	ret = save_stack_trace_tsk_reliable(task, &trace);
+	WARN_ON_ONCE(ret == -ENOSYS);
+	if (ret) {
+		pr_debug("%s: pid %d (%s) has an unreliable stack\n",
+			 __func__, task->pid, task->comm);
+		return ret;
+	}
+
+	klp_for_each_object(klp_transition_patch, obj) {
+		if (!obj->patched)
+			continue;
+		klp_for_each_func(obj, func) {
+			ret = klp_check_stack_func(func, &trace);
+			if (ret) {
+				pr_debug("%s: pid %d (%s) is sleeping on function %s\n",
+					 __func__, task->pid, task->comm,
+					 func->old_name);
+				return ret;
+			}
+		}
+	}
+
+	return 0;
+}
+
+/*
+ * Try to safely switch a task to the target patch state.  If it's currently
+ * running, or it's sleeping on a to-be-patched or to-be-unpatched function, or
+ * if the stack is unreliable, return false.
+ */
+static bool klp_try_switch_task(struct task_struct *task)
+{
+	struct rq *rq;
+	unsigned long flags;
+	int ret;
+	bool success = false;
+
+	/* check if this task has already switched over */
+	if (task->patch_state == klp_target_state)
+		return true;
+
+	/*
+	 * For arches which don't have reliable stack traces, we have to rely
+	 * on other methods (e.g., switching tasks at the syscall barrier).
+	 */
+	if (!IS_ENABLED(CONFIG_RELIABLE_STACKTRACE))
+		return false;
+
+	/*
+	 * Now try to check the stack for any to-be-patched or to-be-unpatched
+	 * functions.  If all goes well, switch the task to the target patch
+	 * state.
+	 */
+	rq = task_rq_lock(task, &flags);
+
+	if (task_running(rq, task) && task != current) {
+		pr_debug("%s: pid %d (%s) is running\n", __func__, task->pid,
+			 task->comm);
+		goto done;
+	}
+
+	ret = klp_check_stack(task);
+	if (ret)
+		goto done;
+
+	success = true;
+
+	clear_tsk_thread_flag(task, TIF_PATCH_PENDING);
+	task->patch_state = klp_target_state;
+
+done:
+	task_rq_unlock(rq, task, &flags);
+	return success;
+}
+
+/*
+ * Try to switch all remaining tasks to the target patch state by walking the
+ * stacks of sleeping tasks and looking for any to-be-patched or
+ * to-be-unpatched functions.  If such functions are found, the task can't be
+ * switched yet.
+ *
+ * If any tasks are still stuck in the initial patch state, schedule a retry.
+ */
+bool klp_try_complete_transition(void)
+{
+	unsigned int cpu;
+	struct task_struct *g, *task;
+	bool complete = true;
+
+	/*
+	 * If the patch can be applied or reverted immediately, skip the
+	 * per-task transitions.
+	 */
+	if (klp_transition_patch->immediate)
+		goto success;
+
+	/*
+	 * Try to switch the tasks to the target patch state by walking their
+	 * stacks and looking for any to-be-patched or to-be-unpatched
+	 * functions.  If such functions are found on a stack, or if the stack
+	 * is deemed unreliable, the task can't be switched yet.
+	 *
+	 * Usually this will transition most (or all) of the tasks on a system
+	 * unless the patch includes changes to a very common function.
+	 */
+	read_lock(&tasklist_lock);
+	for_each_process_thread(g, task)
+		if (!klp_try_switch_task(task))
+			complete = false;
+	read_unlock(&tasklist_lock);
+
+	/*
+	 * Ditto for the idle "swapper" tasks.
+	 */
+	get_online_cpus();
+	for_each_online_cpu(cpu)
+		if (!klp_try_switch_task(idle_task(cpu)))
+			complete = false;
+	put_online_cpus();
+
+	/*
+	 * Some tasks weren't able to be switched over.  Try again later and/or
+	 * wait for other methods like syscall barrier switching.
+	 */
+	if (!complete)
+		return false;
+
+success:
+	/*
+	 * When unpatching, all tasks have transitioned to KLP_UNPATCHED so we
+	 * can now remove the new functions from the func_stack.
+	 */
+	if (klp_target_state == KLP_UNPATCHED) {
+		klp_unpatch_objects(klp_transition_patch);
+
+		/*
+		 * Don't allow any existing instances of ftrace handlers to
+		 * access any obsolete funcs before we reset the func
+		 * transition states to false.  Otherwise the handler may see
+		 * the deleted "new" func, see that it's not in transition, and
+		 * wrongly pick the new version of the function.
+		 */
+		synchronize_rcu();
+	}
+
+	pr_notice("'%s': %s complete\n", klp_transition_patch->mod->name,
+		  klp_target_state == KLP_PATCHED ? "patching" : "unpatching");
+
+	/* we're done, now cleanup the data structures */
+	klp_complete_transition();
+
+	return true;
+}
+
+/*
+ * This function can be called in the middle of an existing transition to
+ * reverse the direction of the target patch state.  This can be done to
+ * effectively cancel an existing enable or disable operation if there are any
+ * tasks which are stuck in the initial patch state.
+ */
+void klp_reverse_transition(void)
+{
+	struct klp_patch *patch = klp_transition_patch;
+
+	klp_target_state = !klp_target_state;
+
+	/*
+	 * Ensure that if another CPU goes through the syscall barrier, sees
+	 * the TIF_PATCH_PENDING writes in klp_start_transition(), and calls
+	 * klp_patch_task(), it also sees the above write to the target state.
+	 * Otherwise it can put the task in the wrong universe.
+	 */
+	smp_wmb();
+
+	klp_start_transition();
+	klp_try_complete_transition();
+
+	patch->enabled = !patch->enabled;
+}
+
diff --git a/kernel/livepatch/transition.h b/kernel/livepatch/transition.h
new file mode 100644
index 0000000..5191b96
--- /dev/null
+++ b/kernel/livepatch/transition.h
@@ -0,0 +1,14 @@
+#ifndef _LIVEPATCH_TRANSITION_H
+#define _LIVEPATCH_TRANSITION_H
+
+#include <linux/livepatch.h>
+
+extern struct klp_patch *klp_transition_patch;
+
+void klp_init_transition(struct klp_patch *patch, int state);
+void klp_start_transition(void);
+void klp_reverse_transition(void);
+bool klp_try_complete_transition(void);
+void klp_complete_transition(void);
+
+#endif /* _LIVEPATCH_TRANSITION_H */
diff --git a/kernel/sched/idle.c b/kernel/sched/idle.c
index bd12c6c..60d633f 100644
--- a/kernel/sched/idle.c
+++ b/kernel/sched/idle.c
@@ -9,6 +9,7 @@
 #include <linux/mm.h>
 #include <linux/stackprotector.h>
 #include <linux/suspend.h>
+#include <linux/livepatch.h>
 
 #include <asm/tlb.h>
 
@@ -266,6 +267,9 @@ static void cpu_idle_loop(void)
 
 		sched_ttwu_pending();
 		schedule_preempt_disabled();
+
+		if (unlikely(klp_patch_pending(current)))
+			klp_patch_task(current);
 	}
 }
 
-- 
2.4.11

^ permalink raw reply related	[flat|nested] 121+ messages in thread

* [RFC PATCH v2 18/18] livepatch: add /proc/<pid>/patch_state
  2016-04-28 20:44 [RFC PATCH v2 00/18] livepatch: hybrid consistency model Josh Poimboeuf
                   ` (16 preceding siblings ...)
  2016-04-28 20:44 ` [RFC PATCH v2 17/18] livepatch: change to a per-task consistency model Josh Poimboeuf
@ 2016-04-28 20:44 ` Josh Poimboeuf
  17 siblings, 0 replies; 121+ messages in thread
From: Josh Poimboeuf @ 2016-04-28 20:44 UTC (permalink / raw)
  To: Jessica Yu, Jiri Kosina, Miroslav Benes, Ingo Molnar,
	Peter Zijlstra, Michael Ellerman, Heiko Carstens
  Cc: live-patching, linux-kernel, x86, linuxppc-dev, linux-s390,
	Vojtech Pavlik, Jiri Slaby, Petr Mladek, Chris J Arges,
	Andy Lutomirski

Expose the per-task patch state value so users can determine which tasks
are holding up completion of a patching operation.

Signed-off-by: Josh Poimboeuf <jpoimboe@redhat.com>
---
 Documentation/filesystems/proc.txt | 18 ++++++++++++++++++
 fs/proc/base.c                     | 15 +++++++++++++++
 2 files changed, 33 insertions(+)

diff --git a/Documentation/filesystems/proc.txt b/Documentation/filesystems/proc.txt
index e8d0075..0b09495 100644
--- a/Documentation/filesystems/proc.txt
+++ b/Documentation/filesystems/proc.txt
@@ -44,6 +44,7 @@ Table of Contents
   3.8   /proc/<pid>/fdinfo/<fd> - Information about opened file
   3.9   /proc/<pid>/map_files - Information about memory mapped files
   3.10  /proc/<pid>/timerslack_ns - Task timerslack value
+  3.11	/proc/<pid>/patch_state - Livepatch patch operation state
 
   4	Configuring procfs
   4.1	Mount options
@@ -1880,6 +1881,23 @@ Valid values are from 0 - ULLONG_MAX
 An application setting the value must have PTRACE_MODE_ATTACH_FSCREDS level
 permissions on the task specified to change its timerslack_ns value.
 
+3.11	/proc/<pid>/patch_state - Livepatch patch operation state
+-----------------------------------------------------------------
+When CONFIG_LIVEPATCH is enabled, this file displays the value of the
+patch state for the task.
+
+A value of '-1' indicates that no patch is in transition.
+
+A value of '0' indicates that a patch is in transition and the task is
+unpatched.  If the patch is being enabled, then the task hasn't been
+patched yet.  If the patch is being disabled, then the task has already
+been unpatched.
+
+A value of '1' indicates that a patch is in transition and the task is
+patched.  If the patch is being enabled, then the task has already been
+patched.  If the patch is being disabled, then the task hasn't been
+unpatched yet.
+
 
 ------------------------------------------------------------------------------
 Configuring procfs
diff --git a/fs/proc/base.c b/fs/proc/base.c
index 2868cdf..c485450 100644
--- a/fs/proc/base.c
+++ b/fs/proc/base.c
@@ -2801,6 +2801,15 @@ static int proc_pid_personality(struct seq_file *m, struct pid_namespace *ns,
 	return err;
 }
 
+#ifdef CONFIG_LIVEPATCH
+static int proc_pid_patch_state(struct seq_file *m, struct pid_namespace *ns,
+				struct pid *pid, struct task_struct *task)
+{
+	seq_printf(m, "%d\n", task->patch_state);
+	return 0;
+}
+#endif /* CONFIG_LIVEPATCH */
+
 /*
  * Thread groups
  */
@@ -2900,6 +2909,9 @@ static const struct pid_entry tgid_base_stuff[] = {
 	REG("timers",	  S_IRUGO, proc_timers_operations),
 #endif
 	REG("timerslack_ns", S_IRUGO|S_IWUGO, proc_pid_set_timerslack_ns_operations),
+#ifdef CONFIG_LIVEPATCH
+	ONE("patch_state",  S_IRUSR, proc_pid_patch_state),
+#endif
 };
 
 static int proc_tgid_base_readdir(struct file *file, struct dir_context *ctx)
@@ -3282,6 +3294,9 @@ static const struct pid_entry tid_base_stuff[] = {
 	REG("projid_map", S_IRUGO|S_IWUSR, proc_projid_map_operations),
 	REG("setgroups",  S_IRUGO|S_IWUSR, proc_setgroups_operations),
 #endif
+#ifdef CONFIG_LIVEPATCH
+	ONE("patch_state",  S_IRUSR, proc_pid_patch_state),
+#endif
 };
 
 static int proc_tid_base_readdir(struct file *file, struct dir_context *ctx)
-- 
2.4.11

^ permalink raw reply related	[flat|nested] 121+ messages in thread

* Re: [RFC PATCH v2 06/18] x86: dump_trace() error handling
  2016-04-28 20:44 ` [RFC PATCH v2 06/18] x86: dump_trace() error handling Josh Poimboeuf
@ 2016-04-29 13:45   ` Minfei Huang
  2016-04-29 14:00     ` Josh Poimboeuf
  0 siblings, 1 reply; 121+ messages in thread
From: Minfei Huang @ 2016-04-29 13:45 UTC (permalink / raw)
  To: Josh Poimboeuf
  Cc: Jessica Yu, Jiri Kosina, Miroslav Benes, Ingo Molnar,
	Peter Zijlstra, Michael Ellerman, Heiko Carstens, live-patching,
	linux-kernel, x86, linuxppc-dev, linux-s390, Vojtech Pavlik,
	Jiri Slaby, Petr Mladek, Chris J Arges, Andy Lutomirski

On 04/28/16 at 03:44P, Josh Poimboeuf wrote:
> In preparation for being able to determine whether a given stack trace
> is reliable, allow the stacktrace_ops functions to propagate errors to
> dump_trace().

Hi, Josh.

Have you considered to make walk_stack function as non-return function,
since there is no obvious error during detecting the frame points?

Thanks
Minfei

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [RFC PATCH v2 06/18] x86: dump_trace() error handling
  2016-04-29 13:45   ` Minfei Huang
@ 2016-04-29 14:00     ` Josh Poimboeuf
  0 siblings, 0 replies; 121+ messages in thread
From: Josh Poimboeuf @ 2016-04-29 14:00 UTC (permalink / raw)
  To: Minfei Huang
  Cc: Jessica Yu, Jiri Kosina, Miroslav Benes, Ingo Molnar,
	Peter Zijlstra, Michael Ellerman, Heiko Carstens, live-patching,
	linux-kernel, x86, linuxppc-dev, linux-s390, Vojtech Pavlik,
	Jiri Slaby, Petr Mladek, Chris J Arges, Andy Lutomirski

On Fri, Apr 29, 2016 at 09:45:58PM +0800, Minfei Huang wrote:
> On 04/28/16 at 03:44P, Josh Poimboeuf wrote:
> > In preparation for being able to determine whether a given stack trace
> > is reliable, allow the stacktrace_ops functions to propagate errors to
> > dump_trace().
> 
> Hi, Josh.
> 
> Have you considered to make walk_stack function as non-return function,
> since there is no obvious error during detecting the frame points?

If you look at the next patch 07/18, there are several cases where
walk_stack (print_context_stack_reliable) returns an error.

For example, if a function gets preempted before it gets a chance to
save the frame pointer, the function's caller would get skipped on the
stack trace.  So for preempted tasks, we always have to consider their
stacks unreliable.

-- 
Josh

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [RFC PATCH v2 05/18] sched: add task flag for preempt IRQ tracking
  2016-04-28 20:44 ` [RFC PATCH v2 05/18] sched: add task flag for preempt IRQ tracking Josh Poimboeuf
@ 2016-04-29 18:06   ` Andy Lutomirski
  2016-04-29 20:11     ` Josh Poimboeuf
  0 siblings, 1 reply; 121+ messages in thread
From: Andy Lutomirski @ 2016-04-29 18:06 UTC (permalink / raw)
  To: Josh Poimboeuf
  Cc: Jessica Yu, Jiri Kosina, Miroslav Benes, Ingo Molnar,
	Peter Zijlstra, Michael Ellerman, Heiko Carstens, live-patching,
	linux-kernel, X86 ML, linuxppc-dev, linux-s390, Vojtech Pavlik,
	Jiri Slaby, Petr Mladek, Chris J Arges, Andy Lutomirski

On Thu, Apr 28, 2016 at 1:44 PM, Josh Poimboeuf <jpoimboe@redhat.com> wrote:
> A preempted function might not have had a chance to save the frame
> pointer to the stack yet, which can result in its caller getting skipped
> on a stack trace.
>
> Add a flag to indicate when the task has been preempted so that stack
> dump code can determine whether the stack trace is reliable.

I think I like this, but how do you handle the rather similar case in
which a task goes to sleep because it's waiting on IO that happened in
response to get_user, put_user, copy_from_user, etc?

--Andy

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [RFC PATCH v2 09/18] livepatch/x86: add TIF_PATCH_PENDING thread flag
  2016-04-28 20:44 ` [RFC PATCH v2 09/18] livepatch/x86: add TIF_PATCH_PENDING thread flag Josh Poimboeuf
@ 2016-04-29 18:08   ` Andy Lutomirski
  2016-04-29 20:18     ` Josh Poimboeuf
  0 siblings, 1 reply; 121+ messages in thread
From: Andy Lutomirski @ 2016-04-29 18:08 UTC (permalink / raw)
  To: Josh Poimboeuf
  Cc: Jessica Yu, Jiri Kosina, Miroslav Benes, Ingo Molnar,
	Peter Zijlstra, Michael Ellerman, Heiko Carstens, live-patching,
	linux-kernel, X86 ML, linuxppc-dev, linux-s390, Vojtech Pavlik,
	Jiri Slaby, Petr Mladek, Chris J Arges, Andy Lutomirski

On Thu, Apr 28, 2016 at 1:44 PM, Josh Poimboeuf <jpoimboe@redhat.com> wrote:
> Add the TIF_PATCH_PENDING thread flag to enable the new livepatch
> per-task consistency model for x86_64.  The bit getting set indicates
> the thread has a pending patch which needs to be applied when the thread
> exits the kernel.
>
> The bit is placed in the least-significant word of the thread_info flags

NAK to that part.

The least-significant word thing is a huge hack that has gotten out of
control.  Please add the thing explicitly to all relevant masks.

--Andy

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [RFC PATCH v2 03/18] x86/asm/head: standardize the bottom of the stack for idle tasks
  2016-04-28 20:44 ` [RFC PATCH v2 03/18] x86/asm/head: standardize the bottom of the stack for idle tasks Josh Poimboeuf
@ 2016-04-29 18:46   ` Brian Gerst
  2016-04-29 20:28     ` Josh Poimboeuf
  2016-04-29 19:39   ` Andy Lutomirski
  1 sibling, 1 reply; 121+ messages in thread
From: Brian Gerst @ 2016-04-29 18:46 UTC (permalink / raw)
  To: Josh Poimboeuf
  Cc: Jessica Yu, Jiri Kosina, Miroslav Benes, Ingo Molnar,
	Peter Zijlstra, Michael Ellerman, Heiko Carstens, live-patching,
	Linux Kernel Mailing List, the arch/x86 maintainers,
	linuxppc-dev, linux-s390, Vojtech Pavlik, Jiri Slaby,
	Petr Mladek, Chris J Arges, Andy Lutomirski

On Thu, Apr 28, 2016 at 4:44 PM, Josh Poimboeuf <jpoimboe@redhat.com> wrote:
> Thanks to all the recent x86 entry code refactoring, most tasks' kernel
> stacks start at the same offset right above their saved pt_regs,
> regardless of which syscall was used to enter the kernel.  That creates
> a nice convention which makes it straightforward to identify the
> "bottom" of the stack, which can be useful for stack walking code which
> needs to verify the stack is sane.
>
> However there are still a few types of tasks which don't yet follow that
> convention:
>
> 1) CPU idle tasks, aka the "swapper" tasks
>
> 2) freshly forked TIF_FORK tasks which don't have a stack at all
>
> Make the idle tasks conform to the new stack bottom convention by
> starting their stack at a sizeof(pt_regs) offset from the end of the
> stack page.
>
> Signed-off-by: Josh Poimboeuf <jpoimboe@redhat.com>
> ---
>  arch/x86/kernel/head_64.S | 7 ++++---
>  1 file changed, 4 insertions(+), 3 deletions(-)
>
> diff --git a/arch/x86/kernel/head_64.S b/arch/x86/kernel/head_64.S
> index 6dbd2c0..0b12311 100644
> --- a/arch/x86/kernel/head_64.S
> +++ b/arch/x86/kernel/head_64.S
> @@ -296,8 +296,9 @@ ENTRY(start_cpu)
>          *      REX.W + FF /5 JMP m16:64 Jump far, absolute indirect,
>          *              address given in m16:64.
>          */
> -       movq    initial_code(%rip),%rax
> -       pushq   $0              # fake return address to stop unwinder
> +       call    1f              # put return address on stack for unwinder
> +1:     xorq    %rbp, %rbp      # clear frame pointer
> +       movq    initial_code(%rip), %rax
>         pushq   $__KERNEL_CS    # set correct cs
>         pushq   %rax            # target address in negative space
>         lretq

This chunk looks like it should be a separate patch.

--
Brian Gerst

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [RFC PATCH v2 03/18] x86/asm/head: standardize the bottom of the stack for idle tasks
  2016-04-28 20:44 ` [RFC PATCH v2 03/18] x86/asm/head: standardize the bottom of the stack for idle tasks Josh Poimboeuf
  2016-04-29 18:46   ` Brian Gerst
@ 2016-04-29 19:39   ` Andy Lutomirski
  2016-04-29 20:50     ` Josh Poimboeuf
  1 sibling, 1 reply; 121+ messages in thread
From: Andy Lutomirski @ 2016-04-29 19:39 UTC (permalink / raw)
  To: Josh Poimboeuf
  Cc: Jessica Yu, Jiri Kosina, Miroslav Benes, Ingo Molnar,
	Peter Zijlstra, Michael Ellerman, Heiko Carstens, live-patching,
	linux-kernel, X86 ML, linuxppc-dev, linux-s390, Vojtech Pavlik,
	Jiri Slaby, Petr Mladek, Chris J Arges, Andy Lutomirski

On Thu, Apr 28, 2016 at 1:44 PM, Josh Poimboeuf <jpoimboe@redhat.com> wrote:
> Thanks to all the recent x86 entry code refactoring, most tasks' kernel
> stacks start at the same offset right above their saved pt_regs,
> regardless of which syscall was used to enter the kernel.  That creates
> a nice convention which makes it straightforward to identify the
> "bottom" of the stack, which can be useful for stack walking code which
> needs to verify the stack is sane.
>
> However there are still a few types of tasks which don't yet follow that
> convention:
>
> 1) CPU idle tasks, aka the "swapper" tasks
>
> 2) freshly forked TIF_FORK tasks which don't have a stack at all
>
> Make the idle tasks conform to the new stack bottom convention by
> starting their stack at a sizeof(pt_regs) offset from the end of the
> stack page.
>
> Signed-off-by: Josh Poimboeuf <jpoimboe@redhat.com>
> ---
>  arch/x86/kernel/head_64.S | 7 ++++---
>  1 file changed, 4 insertions(+), 3 deletions(-)
>
> diff --git a/arch/x86/kernel/head_64.S b/arch/x86/kernel/head_64.S
> index 6dbd2c0..0b12311 100644
> --- a/arch/x86/kernel/head_64.S
> +++ b/arch/x86/kernel/head_64.S
> @@ -296,8 +296,9 @@ ENTRY(start_cpu)
>          *      REX.W + FF /5 JMP m16:64 Jump far, absolute indirect,
>          *              address given in m16:64.
>          */
> -       movq    initial_code(%rip),%rax
> -       pushq   $0              # fake return address to stop unwinder
> +       call    1f              # put return address on stack for unwinder
> +1:     xorq    %rbp, %rbp      # clear frame pointer
> +       movq    initial_code(%rip), %rax
>         pushq   $__KERNEL_CS    # set correct cs
>         pushq   %rax            # target address in negative space
>         lretq
> @@ -325,7 +326,7 @@ ENDPROC(start_cpu0)
>         GLOBAL(initial_gs)
>         .quad   INIT_PER_CPU_VAR(irq_stack_union)
>         GLOBAL(initial_stack)
> -       .quad  init_thread_union+THREAD_SIZE-8
> +       .quad  init_thread_union + THREAD_SIZE - SIZEOF_PTREGS

As long as you're doing this, could you also set orig_ax to -1?  I
remember running into some oddities resulting from orig_ax containing
garbage at some point.

--Andy

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [RFC PATCH v2 05/18] sched: add task flag for preempt IRQ tracking
  2016-04-29 18:06   ` Andy Lutomirski
@ 2016-04-29 20:11     ` Josh Poimboeuf
  2016-04-29 20:19       ` Andy Lutomirski
  0 siblings, 1 reply; 121+ messages in thread
From: Josh Poimboeuf @ 2016-04-29 20:11 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Jessica Yu, Jiri Kosina, Miroslav Benes, Ingo Molnar,
	Peter Zijlstra, Michael Ellerman, Heiko Carstens, live-patching,
	linux-kernel, X86 ML, linuxppc-dev, linux-s390, Vojtech Pavlik,
	Jiri Slaby, Petr Mladek, Chris J Arges, Andy Lutomirski

On Fri, Apr 29, 2016 at 11:06:53AM -0700, Andy Lutomirski wrote:
> On Thu, Apr 28, 2016 at 1:44 PM, Josh Poimboeuf <jpoimboe@redhat.com> wrote:
> > A preempted function might not have had a chance to save the frame
> > pointer to the stack yet, which can result in its caller getting skipped
> > on a stack trace.
> >
> > Add a flag to indicate when the task has been preempted so that stack
> > dump code can determine whether the stack trace is reliable.
> 
> I think I like this, but how do you handle the rather similar case in
> which a task goes to sleep because it's waiting on IO that happened in
> response to get_user, put_user, copy_from_user, etc?

Hm, good question.  I was thinking that page faults had a dedicated
stack, but now looking at the entry and traps code, that doesn't seem to
be the case.

Anyway I think it shouldn't be a problem if we make sure that any kernel
function which might trigger a valid page fault (e.g.,
copy_user_generic_string) do the proper frame pointer setup first.  Then
the stack should still be reliable.

In fact I might be able to teach objtool to enforce that: any function
which uses an exception table should create a stack frame.

Or alternatively, maybe set some kind of flag for page faults, similar
to what I did with this patch.

-- 
Josh

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [RFC PATCH v2 09/18] livepatch/x86: add TIF_PATCH_PENDING thread flag
  2016-04-29 18:08   ` Andy Lutomirski
@ 2016-04-29 20:18     ` Josh Poimboeuf
  0 siblings, 0 replies; 121+ messages in thread
From: Josh Poimboeuf @ 2016-04-29 20:18 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Jessica Yu, Jiri Kosina, Miroslav Benes, Ingo Molnar,
	Peter Zijlstra, Michael Ellerman, Heiko Carstens, live-patching,
	linux-kernel, X86 ML, linuxppc-dev, linux-s390, Vojtech Pavlik,
	Jiri Slaby, Petr Mladek, Chris J Arges, Andy Lutomirski

On Fri, Apr 29, 2016 at 11:08:04AM -0700, Andy Lutomirski wrote:
> On Thu, Apr 28, 2016 at 1:44 PM, Josh Poimboeuf <jpoimboe@redhat.com> wrote:
> > Add the TIF_PATCH_PENDING thread flag to enable the new livepatch
> > per-task consistency model for x86_64.  The bit getting set indicates
> > the thread has a pending patch which needs to be applied when the thread
> > exits the kernel.
> >
> > The bit is placed in the least-significant word of the thread_info flags
> 
> NAK to that part.
> 
> The least-significant word thing is a huge hack that has gotten out of
> control.  Please add the thing explicitly to all relevant masks.

Yeah, it is quite dangerous.  I'll make it explicit, and make all the
other _TIF_ALLWORK_MASK flags explicit while I'm at it.

-- 
Josh

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [RFC PATCH v2 05/18] sched: add task flag for preempt IRQ tracking
  2016-04-29 20:11     ` Josh Poimboeuf
@ 2016-04-29 20:19       ` Andy Lutomirski
  2016-04-29 20:27         ` Josh Poimboeuf
  0 siblings, 1 reply; 121+ messages in thread
From: Andy Lutomirski @ 2016-04-29 20:19 UTC (permalink / raw)
  To: Josh Poimboeuf
  Cc: Jessica Yu, Jiri Kosina, Miroslav Benes, Ingo Molnar,
	Peter Zijlstra, Michael Ellerman, Heiko Carstens, live-patching,
	linux-kernel, X86 ML, linuxppc-dev, linux-s390, Vojtech Pavlik,
	Jiri Slaby, Petr Mladek, Chris J Arges, Andy Lutomirski

On Fri, Apr 29, 2016 at 1:11 PM, Josh Poimboeuf <jpoimboe@redhat.com> wrote:
> On Fri, Apr 29, 2016 at 11:06:53AM -0700, Andy Lutomirski wrote:
>> On Thu, Apr 28, 2016 at 1:44 PM, Josh Poimboeuf <jpoimboe@redhat.com> wrote:
>> > A preempted function might not have had a chance to save the frame
>> > pointer to the stack yet, which can result in its caller getting skipped
>> > on a stack trace.
>> >
>> > Add a flag to indicate when the task has been preempted so that stack
>> > dump code can determine whether the stack trace is reliable.
>>
>> I think I like this, but how do you handle the rather similar case in
>> which a task goes to sleep because it's waiting on IO that happened in
>> response to get_user, put_user, copy_from_user, etc?
>
> Hm, good question.  I was thinking that page faults had a dedicated
> stack, but now looking at the entry and traps code, that doesn't seem to
> be the case.
>
> Anyway I think it shouldn't be a problem if we make sure that any kernel
> function which might trigger a valid page fault (e.g.,
> copy_user_generic_string) do the proper frame pointer setup first.  Then
> the stack should still be reliable.
>
> In fact I might be able to teach objtool to enforce that: any function
> which uses an exception table should create a stack frame.
>
> Or alternatively, maybe set some kind of flag for page faults, similar
> to what I did with this patch.
>

How about doing it the other way around: teach the unwinder to detect
when it hits a non-outermost entry (i.e. it lands in idtentry, etc)
and use some reasonable heuristic as to whether it's okay to keep
unwinding.  You should be able to handle preemption like that, too --
the unwind process will end up in an IRQ frame.

--Andy

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [RFC PATCH v2 05/18] sched: add task flag for preempt IRQ tracking
  2016-04-29 20:19       ` Andy Lutomirski
@ 2016-04-29 20:27         ` Josh Poimboeuf
  2016-04-29 20:32           ` Andy Lutomirski
  0 siblings, 1 reply; 121+ messages in thread
From: Josh Poimboeuf @ 2016-04-29 20:27 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Jessica Yu, Jiri Kosina, Miroslav Benes, Ingo Molnar,
	Peter Zijlstra, Michael Ellerman, Heiko Carstens, live-patching,
	linux-kernel, X86 ML, linuxppc-dev, linux-s390, Vojtech Pavlik,
	Jiri Slaby, Petr Mladek, Chris J Arges, Andy Lutomirski

On Fri, Apr 29, 2016 at 01:19:23PM -0700, Andy Lutomirski wrote:
> On Fri, Apr 29, 2016 at 1:11 PM, Josh Poimboeuf <jpoimboe@redhat.com> wrote:
> > On Fri, Apr 29, 2016 at 11:06:53AM -0700, Andy Lutomirski wrote:
> >> On Thu, Apr 28, 2016 at 1:44 PM, Josh Poimboeuf <jpoimboe@redhat.com> wrote:
> >> > A preempted function might not have had a chance to save the frame
> >> > pointer to the stack yet, which can result in its caller getting skipped
> >> > on a stack trace.
> >> >
> >> > Add a flag to indicate when the task has been preempted so that stack
> >> > dump code can determine whether the stack trace is reliable.
> >>
> >> I think I like this, but how do you handle the rather similar case in
> >> which a task goes to sleep because it's waiting on IO that happened in
> >> response to get_user, put_user, copy_from_user, etc?
> >
> > Hm, good question.  I was thinking that page faults had a dedicated
> > stack, but now looking at the entry and traps code, that doesn't seem to
> > be the case.
> >
> > Anyway I think it shouldn't be a problem if we make sure that any kernel
> > function which might trigger a valid page fault (e.g.,
> > copy_user_generic_string) do the proper frame pointer setup first.  Then
> > the stack should still be reliable.
> >
> > In fact I might be able to teach objtool to enforce that: any function
> > which uses an exception table should create a stack frame.
> >
> > Or alternatively, maybe set some kind of flag for page faults, similar
> > to what I did with this patch.
> >
> 
> How about doing it the other way around: teach the unwinder to detect
> when it hits a non-outermost entry (i.e. it lands in idtentry, etc)
> and use some reasonable heuristic as to whether it's okay to keep
> unwinding.  You should be able to handle preemption like that, too --
> the unwind process will end up in an IRQ frame.

How exactly would the unwinder detect if a text address is in an
idtentry?  Maybe put all the idt entries in a special ELF section?

-- 
Josh

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [RFC PATCH v2 03/18] x86/asm/head: standardize the bottom of the stack for idle tasks
  2016-04-29 18:46   ` Brian Gerst
@ 2016-04-29 20:28     ` Josh Poimboeuf
  0 siblings, 0 replies; 121+ messages in thread
From: Josh Poimboeuf @ 2016-04-29 20:28 UTC (permalink / raw)
  To: Brian Gerst
  Cc: Jessica Yu, Jiri Kosina, Miroslav Benes, Ingo Molnar,
	Peter Zijlstra, Michael Ellerman, Heiko Carstens, live-patching,
	Linux Kernel Mailing List, the arch/x86 maintainers,
	linuxppc-dev, linux-s390, Vojtech Pavlik, Jiri Slaby,
	Petr Mladek, Chris J Arges, Andy Lutomirski

On Fri, Apr 29, 2016 at 02:46:10PM -0400, Brian Gerst wrote:
> On Thu, Apr 28, 2016 at 4:44 PM, Josh Poimboeuf <jpoimboe@redhat.com> wrote:
> > Thanks to all the recent x86 entry code refactoring, most tasks' kernel
> > stacks start at the same offset right above their saved pt_regs,
> > regardless of which syscall was used to enter the kernel.  That creates
> > a nice convention which makes it straightforward to identify the
> > "bottom" of the stack, which can be useful for stack walking code which
> > needs to verify the stack is sane.
> >
> > However there are still a few types of tasks which don't yet follow that
> > convention:
> >
> > 1) CPU idle tasks, aka the "swapper" tasks
> >
> > 2) freshly forked TIF_FORK tasks which don't have a stack at all
> >
> > Make the idle tasks conform to the new stack bottom convention by
> > starting their stack at a sizeof(pt_regs) offset from the end of the
> > stack page.
> >
> > Signed-off-by: Josh Poimboeuf <jpoimboe@redhat.com>
> > ---
> >  arch/x86/kernel/head_64.S | 7 ++++---
> >  1 file changed, 4 insertions(+), 3 deletions(-)
> >
> > diff --git a/arch/x86/kernel/head_64.S b/arch/x86/kernel/head_64.S
> > index 6dbd2c0..0b12311 100644
> > --- a/arch/x86/kernel/head_64.S
> > +++ b/arch/x86/kernel/head_64.S
> > @@ -296,8 +296,9 @@ ENTRY(start_cpu)
> >          *      REX.W + FF /5 JMP m16:64 Jump far, absolute indirect,
> >          *              address given in m16:64.
> >          */
> > -       movq    initial_code(%rip),%rax
> > -       pushq   $0              # fake return address to stop unwinder
> > +       call    1f              # put return address on stack for unwinder
> > +1:     xorq    %rbp, %rbp      # clear frame pointer
> > +       movq    initial_code(%rip), %rax
> >         pushq   $__KERNEL_CS    # set correct cs
> >         pushq   %rax            # target address in negative space
> >         lretq
> 
> This chunk looks like it should be a separate patch.

Agreed, thanks.

-- 
Josh

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [RFC PATCH v2 05/18] sched: add task flag for preempt IRQ tracking
  2016-04-29 20:27         ` Josh Poimboeuf
@ 2016-04-29 20:32           ` Andy Lutomirski
  2016-04-29 21:25             ` Josh Poimboeuf
  0 siblings, 1 reply; 121+ messages in thread
From: Andy Lutomirski @ 2016-04-29 20:32 UTC (permalink / raw)
  To: Josh Poimboeuf
  Cc: Jessica Yu, Jiri Kosina, Miroslav Benes, Ingo Molnar,
	Peter Zijlstra, Michael Ellerman, Heiko Carstens, live-patching,
	linux-kernel, X86 ML, linuxppc-dev, linux-s390, Vojtech Pavlik,
	Jiri Slaby, Petr Mladek, Chris J Arges, Andy Lutomirski

On Fri, Apr 29, 2016 at 1:27 PM, Josh Poimboeuf <jpoimboe@redhat.com> wrote:
> On Fri, Apr 29, 2016 at 01:19:23PM -0700, Andy Lutomirski wrote:
>> On Fri, Apr 29, 2016 at 1:11 PM, Josh Poimboeuf <jpoimboe@redhat.com> wrote:
>> > On Fri, Apr 29, 2016 at 11:06:53AM -0700, Andy Lutomirski wrote:
>> >> On Thu, Apr 28, 2016 at 1:44 PM, Josh Poimboeuf <jpoimboe@redhat.com> wrote:
>> >> > A preempted function might not have had a chance to save the frame
>> >> > pointer to the stack yet, which can result in its caller getting skipped
>> >> > on a stack trace.
>> >> >
>> >> > Add a flag to indicate when the task has been preempted so that stack
>> >> > dump code can determine whether the stack trace is reliable.
>> >>
>> >> I think I like this, but how do you handle the rather similar case in
>> >> which a task goes to sleep because it's waiting on IO that happened in
>> >> response to get_user, put_user, copy_from_user, etc?
>> >
>> > Hm, good question.  I was thinking that page faults had a dedicated
>> > stack, but now looking at the entry and traps code, that doesn't seem to
>> > be the case.
>> >
>> > Anyway I think it shouldn't be a problem if we make sure that any kernel
>> > function which might trigger a valid page fault (e.g.,
>> > copy_user_generic_string) do the proper frame pointer setup first.  Then
>> > the stack should still be reliable.
>> >
>> > In fact I might be able to teach objtool to enforce that: any function
>> > which uses an exception table should create a stack frame.
>> >
>> > Or alternatively, maybe set some kind of flag for page faults, similar
>> > to what I did with this patch.
>> >
>>
>> How about doing it the other way around: teach the unwinder to detect
>> when it hits a non-outermost entry (i.e. it lands in idtentry, etc)
>> and use some reasonable heuristic as to whether it's okay to keep
>> unwinding.  You should be able to handle preemption like that, too --
>> the unwind process will end up in an IRQ frame.
>
> How exactly would the unwinder detect if a text address is in an
> idtentry?  Maybe put all the idt entries in a special ELF section?
>

Hmm.

What actually happens when you unwind all the way into the entry code?
 Don't you end up in something that isn't in an ELF function?  Can you
detect that?  Ideally, the unwinder could actually detect that it's
hit a pt_regs struct and report that.  If used for stack dumps, it
could display some indication of this and then continue its unwinding
by decoding the pt_regs.  If used for patching, it could take some
other appropriate action.

I would have no objection to annotating all the pt_regs-style entry
code, whether by putting it in a separate section or by making a table
of addresses.

There are a couple of nasty cases if NMI or MCE is involved but, as of
4.6, outside of NMI, MCE, and vmalloc faults (ugh!), there should
always be a complete pt_regs on the stack before interrupts get
enabled for each entry.  Of course, finding the thing may be
nontrivial in case other things were pushed.  I suppose we could try
to rejigger the code so that rbp points to pt_regs or similar.

--Andy

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [RFC PATCH v2 03/18] x86/asm/head: standardize the bottom of the stack for idle tasks
  2016-04-29 19:39   ` Andy Lutomirski
@ 2016-04-29 20:50     ` Josh Poimboeuf
  2016-04-29 21:38       ` Andy Lutomirski
  0 siblings, 1 reply; 121+ messages in thread
From: Josh Poimboeuf @ 2016-04-29 20:50 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Jessica Yu, Jiri Kosina, Miroslav Benes, Ingo Molnar,
	Peter Zijlstra, Michael Ellerman, Heiko Carstens, live-patching,
	linux-kernel, X86 ML, linuxppc-dev, linux-s390, Vojtech Pavlik,
	Jiri Slaby, Petr Mladek, Chris J Arges

On Fri, Apr 29, 2016 at 12:39:16PM -0700, Andy Lutomirski wrote:
> On Thu, Apr 28, 2016 at 1:44 PM, Josh Poimboeuf <jpoimboe@redhat.com> wrote:
> > Thanks to all the recent x86 entry code refactoring, most tasks' kernel
> > stacks start at the same offset right above their saved pt_regs,
> > regardless of which syscall was used to enter the kernel.  That creates
> > a nice convention which makes it straightforward to identify the
> > "bottom" of the stack, which can be useful for stack walking code which
> > needs to verify the stack is sane.
> >
> > However there are still a few types of tasks which don't yet follow that
> > convention:
> >
> > 1) CPU idle tasks, aka the "swapper" tasks
> >
> > 2) freshly forked TIF_FORK tasks which don't have a stack at all
> >
> > Make the idle tasks conform to the new stack bottom convention by
> > starting their stack at a sizeof(pt_regs) offset from the end of the
> > stack page.
> >
> > Signed-off-by: Josh Poimboeuf <jpoimboe@redhat.com>
> > ---
> >  arch/x86/kernel/head_64.S | 7 ++++---
> >  1 file changed, 4 insertions(+), 3 deletions(-)
> >
> > diff --git a/arch/x86/kernel/head_64.S b/arch/x86/kernel/head_64.S
> > index 6dbd2c0..0b12311 100644
> > --- a/arch/x86/kernel/head_64.S
> > +++ b/arch/x86/kernel/head_64.S
> > @@ -296,8 +296,9 @@ ENTRY(start_cpu)
> >          *      REX.W + FF /5 JMP m16:64 Jump far, absolute indirect,
> >          *              address given in m16:64.
> >          */
> > -       movq    initial_code(%rip),%rax
> > -       pushq   $0              # fake return address to stop unwinder
> > +       call    1f              # put return address on stack for unwinder
> > +1:     xorq    %rbp, %rbp      # clear frame pointer
> > +       movq    initial_code(%rip), %rax
> >         pushq   $__KERNEL_CS    # set correct cs
> >         pushq   %rax            # target address in negative space
> >         lretq
> > @@ -325,7 +326,7 @@ ENDPROC(start_cpu0)
> >         GLOBAL(initial_gs)
> >         .quad   INIT_PER_CPU_VAR(irq_stack_union)
> >         GLOBAL(initial_stack)
> > -       .quad  init_thread_union+THREAD_SIZE-8
> > +       .quad  init_thread_union + THREAD_SIZE - SIZEOF_PTREGS
> 
> As long as you're doing this, could you also set orig_ax to -1?  I
> remember running into some oddities resulting from orig_ax containing
> garbage at some point.

I assume you mean to initialize the orig_rax value in the pt_regs at the
bottom of the stack of the idle task?

How could that cause a problem?  Since the idle task never returns from
a system call, I'd assume that memory never gets accessed?

-- 
Josh

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [RFC PATCH v2 05/18] sched: add task flag for preempt IRQ tracking
  2016-04-29 20:32           ` Andy Lutomirski
@ 2016-04-29 21:25             ` Josh Poimboeuf
  2016-04-29 21:37               ` Andy Lutomirski
  0 siblings, 1 reply; 121+ messages in thread
From: Josh Poimboeuf @ 2016-04-29 21:25 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Jessica Yu, Jiri Kosina, Miroslav Benes, Ingo Molnar,
	Peter Zijlstra, Michael Ellerman, Heiko Carstens, live-patching,
	linux-kernel, X86 ML, linuxppc-dev, linux-s390, Vojtech Pavlik,
	Jiri Slaby, Petr Mladek, Chris J Arges, Andy Lutomirski

On Fri, Apr 29, 2016 at 01:32:53PM -0700, Andy Lutomirski wrote:
> On Fri, Apr 29, 2016 at 1:27 PM, Josh Poimboeuf <jpoimboe@redhat.com> wrote:
> > On Fri, Apr 29, 2016 at 01:19:23PM -0700, Andy Lutomirski wrote:
> >> On Fri, Apr 29, 2016 at 1:11 PM, Josh Poimboeuf <jpoimboe@redhat.com> wrote:
> >> > On Fri, Apr 29, 2016 at 11:06:53AM -0700, Andy Lutomirski wrote:
> >> >> On Thu, Apr 28, 2016 at 1:44 PM, Josh Poimboeuf <jpoimboe@redhat.com> wrote:
> >> >> > A preempted function might not have had a chance to save the frame
> >> >> > pointer to the stack yet, which can result in its caller getting skipped
> >> >> > on a stack trace.
> >> >> >
> >> >> > Add a flag to indicate when the task has been preempted so that stack
> >> >> > dump code can determine whether the stack trace is reliable.
> >> >>
> >> >> I think I like this, but how do you handle the rather similar case in
> >> >> which a task goes to sleep because it's waiting on IO that happened in
> >> >> response to get_user, put_user, copy_from_user, etc?
> >> >
> >> > Hm, good question.  I was thinking that page faults had a dedicated
> >> > stack, but now looking at the entry and traps code, that doesn't seem to
> >> > be the case.
> >> >
> >> > Anyway I think it shouldn't be a problem if we make sure that any kernel
> >> > function which might trigger a valid page fault (e.g.,
> >> > copy_user_generic_string) do the proper frame pointer setup first.  Then
> >> > the stack should still be reliable.
> >> >
> >> > In fact I might be able to teach objtool to enforce that: any function
> >> > which uses an exception table should create a stack frame.
> >> >
> >> > Or alternatively, maybe set some kind of flag for page faults, similar
> >> > to what I did with this patch.
> >> >
> >>
> >> How about doing it the other way around: teach the unwinder to detect
> >> when it hits a non-outermost entry (i.e. it lands in idtentry, etc)
> >> and use some reasonable heuristic as to whether it's okay to keep
> >> unwinding.  You should be able to handle preemption like that, too --
> >> the unwind process will end up in an IRQ frame.
> >
> > How exactly would the unwinder detect if a text address is in an
> > idtentry?  Maybe put all the idt entries in a special ELF section?
> >
> 
> Hmm.
> 
> What actually happens when you unwind all the way into the entry code?
>  Don't you end up in something that isn't in an ELF function?  Can you
> detect that?

For entry from user space (e.g., syscalls), it's easy to detect because
there's always a pt_regs at the bottom of the stack.  So if the unwinder
reaches the stack address at (thread.sp0 - sizeof(pt_regs)), it knows
it's done.

But for nested entry (e.g. in-kernel irqs/exceptions like preemption and
page faults which don't have dedicated stacks), where the pt_regs is
stored somewhere in the middle of the stack instead of the bottom,
there's no reliable way to detect that.

> Ideally, the unwinder could actually detect that it's
> hit a pt_regs struct and report that.  If used for stack dumps, it
> could display some indication of this and then continue its unwinding
> by decoding the pt_regs.  If used for patching, it could take some
> other appropriate action.
>
> I would have no objection to annotating all the pt_regs-style entry
> code, whether by putting it in a separate section or by making a table
> of addresses.

I think the easiest way to make it work would be to modify the idtentry
macro to put all the idt entries in a dedicated section.  Then the
unwinder could easily detect any calls from that code.

> There are a couple of nasty cases if NMI or MCE is involved but, as of
> 4.6, outside of NMI, MCE, and vmalloc faults (ugh!), there should
> always be a complete pt_regs on the stack before interrupts get
> enabled for each entry.  Of course, finding the thing may be
> nontrivial in case other things were pushed.

NMI, MCE and interrupts aren't a problem because they have dedicated
stacks, which are easy to detect.  If the tasks' stack is on an
exception stack or an irq stack, we consider it unreliable.

And also, they don't sleep.  The stack of any running task (other than
current) is automatically considered unreliable anyway, since they could
be modifying it while we're reading it.

> I suppose we could try to rejigger the code so that rbp points to
> pt_regs or similar.

I think we should avoid doing something like that because it would break
gdb and all the other unwinders who don't know about it.

-- 
Josh

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [RFC PATCH v2 05/18] sched: add task flag for preempt IRQ tracking
  2016-04-29 21:25             ` Josh Poimboeuf
@ 2016-04-29 21:37               ` Andy Lutomirski
  2016-04-29 22:11                 ` Jiri Kosina
  2016-04-29 22:41                 ` Josh Poimboeuf
  0 siblings, 2 replies; 121+ messages in thread
From: Andy Lutomirski @ 2016-04-29 21:37 UTC (permalink / raw)
  To: Josh Poimboeuf
  Cc: Jessica Yu, Jiri Kosina, Miroslav Benes, Ingo Molnar,
	Peter Zijlstra, Michael Ellerman, Heiko Carstens, live-patching,
	linux-kernel, X86 ML, linuxppc-dev, linux-s390, Vojtech Pavlik,
	Jiri Slaby, Petr Mladek, Chris J Arges, Andy Lutomirski

On Fri, Apr 29, 2016 at 2:25 PM, Josh Poimboeuf <jpoimboe@redhat.com> wrote:
> On Fri, Apr 29, 2016 at 01:32:53PM -0700, Andy Lutomirski wrote:
>> On Fri, Apr 29, 2016 at 1:27 PM, Josh Poimboeuf <jpoimboe@redhat.com> wrote:
>> > On Fri, Apr 29, 2016 at 01:19:23PM -0700, Andy Lutomirski wrote:
>> >> On Fri, Apr 29, 2016 at 1:11 PM, Josh Poimboeuf <jpoimboe@redhat.com> wrote:
>> >> > On Fri, Apr 29, 2016 at 11:06:53AM -0700, Andy Lutomirski wrote:
>> >> >> On Thu, Apr 28, 2016 at 1:44 PM, Josh Poimboeuf <jpoimboe@redhat.com> wrote:
>> >> >> > A preempted function might not have had a chance to save the frame
>> >> >> > pointer to the stack yet, which can result in its caller getting skipped
>> >> >> > on a stack trace.
>> >> >> >
>> >> >> > Add a flag to indicate when the task has been preempted so that stack
>> >> >> > dump code can determine whether the stack trace is reliable.
>> >> >>
>> >> >> I think I like this, but how do you handle the rather similar case in
>> >> >> which a task goes to sleep because it's waiting on IO that happened in
>> >> >> response to get_user, put_user, copy_from_user, etc?
>> >> >
>> >> > Hm, good question.  I was thinking that page faults had a dedicated
>> >> > stack, but now looking at the entry and traps code, that doesn't seem to
>> >> > be the case.
>> >> >
>> >> > Anyway I think it shouldn't be a problem if we make sure that any kernel
>> >> > function which might trigger a valid page fault (e.g.,
>> >> > copy_user_generic_string) do the proper frame pointer setup first.  Then
>> >> > the stack should still be reliable.
>> >> >
>> >> > In fact I might be able to teach objtool to enforce that: any function
>> >> > which uses an exception table should create a stack frame.
>> >> >
>> >> > Or alternatively, maybe set some kind of flag for page faults, similar
>> >> > to what I did with this patch.
>> >> >
>> >>
>> >> How about doing it the other way around: teach the unwinder to detect
>> >> when it hits a non-outermost entry (i.e. it lands in idtentry, etc)
>> >> and use some reasonable heuristic as to whether it's okay to keep
>> >> unwinding.  You should be able to handle preemption like that, too --
>> >> the unwind process will end up in an IRQ frame.
>> >
>> > How exactly would the unwinder detect if a text address is in an
>> > idtentry?  Maybe put all the idt entries in a special ELF section?
>> >
>>
>> Hmm.
>>
>> What actually happens when you unwind all the way into the entry code?
>>  Don't you end up in something that isn't in an ELF function?  Can you
>> detect that?
>
> For entry from user space (e.g., syscalls), it's easy to detect because
> there's always a pt_regs at the bottom of the stack.  So if the unwinder
> reaches the stack address at (thread.sp0 - sizeof(pt_regs)), it knows
> it's done.
>
> But for nested entry (e.g. in-kernel irqs/exceptions like preemption and
> page faults which don't have dedicated stacks), where the pt_regs is
> stored somewhere in the middle of the stack instead of the bottom,
> there's no reliable way to detect that.

>
>> Ideally, the unwinder could actually detect that it's
>> hit a pt_regs struct and report that.  If used for stack dumps, it
>> could display some indication of this and then continue its unwinding
>> by decoding the pt_regs.  If used for patching, it could take some
>> other appropriate action.
>>
>> I would have no objection to annotating all the pt_regs-style entry
>> code, whether by putting it in a separate section or by making a table
>> of addresses.
>
> I think the easiest way to make it work would be to modify the idtentry
> macro to put all the idt entries in a dedicated section.  Then the
> unwinder could easily detect any calls from that code.

That would work.  Would it make sense to do the same for the irq entries?

I'd be glad to review a patch.  It should be straightforward.

>
>> There are a couple of nasty cases if NMI or MCE is involved but, as of
>> 4.6, outside of NMI, MCE, and vmalloc faults (ugh!), there should
>> always be a complete pt_regs on the stack before interrupts get
>> enabled for each entry.  Of course, finding the thing may be
>> nontrivial in case other things were pushed.
>
> NMI, MCE and interrupts aren't a problem because they have dedicated
> stacks, which are easy to detect.  If the tasks' stack is on an
> exception stack or an irq stack, we consider it unreliable.

Only on x86_64.

>
> And also, they don't sleep.  The stack of any running task (other than
> current) is automatically considered unreliable anyway, since they could
> be modifying it while we're reading it.

True.

>
>> I suppose we could try to rejigger the code so that rbp points to
>> pt_regs or similar.
>
> I think we should avoid doing something like that because it would break
> gdb and all the other unwinders who don't know about it.

How so?

Currently, rbp in the entry code is meaningless.  I'm suggesting that,
when we do, for example, 'call \do_sym' in idtentry, we point rbp to
the pt_regs.  Currently it points to something stale (which the
dump_stack code might be relying on.  Hmm.)  But it's probably also
safe to assume that if you unwind to the 'call \do_sym', then pt_regs
is the next thing on the stack, so just doing the section thing would
work.

We should really re-add DWARF some day.

--Andy

-- 
Andy Lutomirski
AMA Capital Management, LLC

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [RFC PATCH v2 03/18] x86/asm/head: standardize the bottom of the stack for idle tasks
  2016-04-29 20:50     ` Josh Poimboeuf
@ 2016-04-29 21:38       ` Andy Lutomirski
  2016-04-29 23:27         ` Josh Poimboeuf
  0 siblings, 1 reply; 121+ messages in thread
From: Andy Lutomirski @ 2016-04-29 21:38 UTC (permalink / raw)
  To: Josh Poimboeuf
  Cc: Andy Lutomirski, Jessica Yu, Jiri Kosina, Miroslav Benes,
	Ingo Molnar, Peter Zijlstra, Michael Ellerman, Heiko Carstens,
	live-patching, linux-kernel, X86 ML, linuxppc-dev, linux-s390,
	Vojtech Pavlik, Jiri Slaby, Petr Mladek, Chris J Arges

On Fri, Apr 29, 2016 at 1:50 PM, Josh Poimboeuf <jpoimboe@redhat.com> wrote:
> On Fri, Apr 29, 2016 at 12:39:16PM -0700, Andy Lutomirski wrote:
>> On Thu, Apr 28, 2016 at 1:44 PM, Josh Poimboeuf <jpoimboe@redhat.com> wrote:
>> > Thanks to all the recent x86 entry code refactoring, most tasks' kernel
>> > stacks start at the same offset right above their saved pt_regs,
>> > regardless of which syscall was used to enter the kernel.  That creates
>> > a nice convention which makes it straightforward to identify the
>> > "bottom" of the stack, which can be useful for stack walking code which
>> > needs to verify the stack is sane.
>> >
>> > However there are still a few types of tasks which don't yet follow that
>> > convention:
>> >
>> > 1) CPU idle tasks, aka the "swapper" tasks
>> >
>> > 2) freshly forked TIF_FORK tasks which don't have a stack at all
>> >
>> > Make the idle tasks conform to the new stack bottom convention by
>> > starting their stack at a sizeof(pt_regs) offset from the end of the
>> > stack page.
>> >
>> > Signed-off-by: Josh Poimboeuf <jpoimboe@redhat.com>
>> > ---
>> >  arch/x86/kernel/head_64.S | 7 ++++---
>> >  1 file changed, 4 insertions(+), 3 deletions(-)
>> >
>> > diff --git a/arch/x86/kernel/head_64.S b/arch/x86/kernel/head_64.S
>> > index 6dbd2c0..0b12311 100644
>> > --- a/arch/x86/kernel/head_64.S
>> > +++ b/arch/x86/kernel/head_64.S
>> > @@ -296,8 +296,9 @@ ENTRY(start_cpu)
>> >          *      REX.W + FF /5 JMP m16:64 Jump far, absolute indirect,
>> >          *              address given in m16:64.
>> >          */
>> > -       movq    initial_code(%rip),%rax
>> > -       pushq   $0              # fake return address to stop unwinder
>> > +       call    1f              # put return address on stack for unwinder
>> > +1:     xorq    %rbp, %rbp      # clear frame pointer
>> > +       movq    initial_code(%rip), %rax
>> >         pushq   $__KERNEL_CS    # set correct cs
>> >         pushq   %rax            # target address in negative space
>> >         lretq
>> > @@ -325,7 +326,7 @@ ENDPROC(start_cpu0)
>> >         GLOBAL(initial_gs)
>> >         .quad   INIT_PER_CPU_VAR(irq_stack_union)
>> >         GLOBAL(initial_stack)
>> > -       .quad  init_thread_union+THREAD_SIZE-8
>> > +       .quad  init_thread_union + THREAD_SIZE - SIZEOF_PTREGS
>>
>> As long as you're doing this, could you also set orig_ax to -1?  I
>> remember running into some oddities resulting from orig_ax containing
>> garbage at some point.
>
> I assume you mean to initialize the orig_rax value in the pt_regs at the
> bottom of the stack of the idle task?
>
> How could that cause a problem?  Since the idle task never returns from
> a system call, I'd assume that memory never gets accessed?
>

Look at collect_syscall in lib/syscall.c

> --
> Josh



-- 
Andy Lutomirski
AMA Capital Management, LLC

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [RFC PATCH v2 05/18] sched: add task flag for preempt IRQ tracking
  2016-04-29 21:37               ` Andy Lutomirski
@ 2016-04-29 22:11                 ` Jiri Kosina
  2016-04-29 22:57                   ` Josh Poimboeuf
  2016-04-30  0:09                   ` Andy Lutomirski
  2016-04-29 22:41                 ` Josh Poimboeuf
  1 sibling, 2 replies; 121+ messages in thread
From: Jiri Kosina @ 2016-04-29 22:11 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Josh Poimboeuf, Jessica Yu, Miroslav Benes, Ingo Molnar,
	Peter Zijlstra, Michael Ellerman, Heiko Carstens, live-patching,
	linux-kernel, X86 ML, linuxppc-dev, linux-s390, Vojtech Pavlik,
	Jiri Slaby, Petr Mladek, Chris J Arges, Andy Lutomirski

On Fri, 29 Apr 2016, Andy Lutomirski wrote:

> > NMI, MCE and interrupts aren't a problem because they have dedicated
> > stacks, which are easy to detect.  If the tasks' stack is on an
> > exception stack or an irq stack, we consider it unreliable.
> 
> Only on x86_64.

Well, MCEs are more or less x86-specific as well. But otherwise good 
point, thanks Andy.

So, how does stack layout generally look like in case when NMI is actually 
running on proper kernel stack? I thought it's guaranteed to contain 
pt_regs anyway in all cases. Is that not guaranteed to be the case?

Thanks,

-- 
Jiri Kosina
SUSE Labs

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [RFC PATCH v2 05/18] sched: add task flag for preempt IRQ tracking
  2016-04-29 21:37               ` Andy Lutomirski
  2016-04-29 22:11                 ` Jiri Kosina
@ 2016-04-29 22:41                 ` Josh Poimboeuf
  2016-04-30  0:08                   ` Andy Lutomirski
  1 sibling, 1 reply; 121+ messages in thread
From: Josh Poimboeuf @ 2016-04-29 22:41 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Jessica Yu, Jiri Kosina, Miroslav Benes, Ingo Molnar,
	Peter Zijlstra, Michael Ellerman, Heiko Carstens, live-patching,
	linux-kernel, X86 ML, linuxppc-dev, linux-s390, Vojtech Pavlik,
	Jiri Slaby, Petr Mladek, Chris J Arges, Andy Lutomirski

On Fri, Apr 29, 2016 at 02:37:41PM -0700, Andy Lutomirski wrote:
> On Fri, Apr 29, 2016 at 2:25 PM, Josh Poimboeuf <jpoimboe@redhat.com> wrote:
> > I think the easiest way to make it work would be to modify the idtentry
> > macro to put all the idt entries in a dedicated section.  Then the
> > unwinder could easily detect any calls from that code.
> 
> That would work.  Would it make sense to do the same for the irq entries?

Yes, I think so.

> >> I suppose we could try to rejigger the code so that rbp points to
> >> pt_regs or similar.
> >
> > I think we should avoid doing something like that because it would break
> > gdb and all the other unwinders who don't know about it.
> 
> How so?
> 
> Currently, rbp in the entry code is meaningless.  I'm suggesting that,
> when we do, for example, 'call \do_sym' in idtentry, we point rbp to
> the pt_regs.  Currently it points to something stale (which the
> dump_stack code might be relying on.  Hmm.)  But it's probably also
> safe to assume that if you unwind to the 'call \do_sym', then pt_regs
> is the next thing on the stack, so just doing the section thing would
> work.

Yes, rbp is meaningless on the entry from user space.  But if an
in-kernel interrupt occurs (e.g. page fault, preemption) and you have
nested entry, rbp keeps its old value, right?  So the unwinder can walk
past the nested entry frame and keep going until it gets to the original
entry.

> We should really re-add DWARF some day.

Working on it :-)

-- 
Josh

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [RFC PATCH v2 05/18] sched: add task flag for preempt IRQ tracking
  2016-04-29 22:11                 ` Jiri Kosina
@ 2016-04-29 22:57                   ` Josh Poimboeuf
  2016-04-30  0:09                   ` Andy Lutomirski
  1 sibling, 0 replies; 121+ messages in thread
From: Josh Poimboeuf @ 2016-04-29 22:57 UTC (permalink / raw)
  To: Jiri Kosina
  Cc: Andy Lutomirski, Jessica Yu, Miroslav Benes, Ingo Molnar,
	Peter Zijlstra, Michael Ellerman, Heiko Carstens, live-patching,
	linux-kernel, X86 ML, linuxppc-dev, linux-s390, Vojtech Pavlik,
	Jiri Slaby, Petr Mladek, Chris J Arges, Andy Lutomirski

On Sat, Apr 30, 2016 at 12:11:45AM +0200, Jiri Kosina wrote:
> On Fri, 29 Apr 2016, Andy Lutomirski wrote:
> > > NMI, MCE and interrupts aren't a problem because they have dedicated
> > > stacks, which are easy to detect.  If the tasks' stack is on an
> > > exception stack or an irq stack, we consider it unreliable.
> > 
> > Only on x86_64.
> 
> Well, MCEs are more or less x86-specific as well. But otherwise good 
> point, thanks Andy.
> 
> So, how does stack layout generally look like in case when NMI is actually 
> running on proper kernel stack? I thought it's guaranteed to contain 
> pt_regs anyway in all cases. Is that not guaranteed to be the case?

If the NMI were using the normal kernel stack and it interrupted kernel
space, pt_regs would be placed in the "middle" of the stack rather than
the bottom, and there's currently no way to detect that.

However, NMIs don't sleep, and we only consider sleeping tasks for stack
reliability, so it wouldn't be an issue anyway.

-- 
Josh

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [RFC PATCH v2 03/18] x86/asm/head: standardize the bottom of the stack for idle tasks
  2016-04-29 21:38       ` Andy Lutomirski
@ 2016-04-29 23:27         ` Josh Poimboeuf
  2016-04-30  0:10           ` Andy Lutomirski
  0 siblings, 1 reply; 121+ messages in thread
From: Josh Poimboeuf @ 2016-04-29 23:27 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Andy Lutomirski, Jessica Yu, Jiri Kosina, Miroslav Benes,
	Ingo Molnar, Peter Zijlstra, Michael Ellerman, Heiko Carstens,
	live-patching, linux-kernel, X86 ML, linuxppc-dev, linux-s390,
	Vojtech Pavlik, Jiri Slaby, Petr Mladek, Chris J Arges

On Fri, Apr 29, 2016 at 02:38:02PM -0700, Andy Lutomirski wrote:
> On Fri, Apr 29, 2016 at 1:50 PM, Josh Poimboeuf <jpoimboe@redhat.com> wrote:
> > On Fri, Apr 29, 2016 at 12:39:16PM -0700, Andy Lutomirski wrote:
> >> On Thu, Apr 28, 2016 at 1:44 PM, Josh Poimboeuf <jpoimboe@redhat.com> wrote:
> >> > Thanks to all the recent x86 entry code refactoring, most tasks' kernel
> >> > stacks start at the same offset right above their saved pt_regs,
> >> > regardless of which syscall was used to enter the kernel.  That creates
> >> > a nice convention which makes it straightforward to identify the
> >> > "bottom" of the stack, which can be useful for stack walking code which
> >> > needs to verify the stack is sane.
> >> >
> >> > However there are still a few types of tasks which don't yet follow that
> >> > convention:
> >> >
> >> > 1) CPU idle tasks, aka the "swapper" tasks
> >> >
> >> > 2) freshly forked TIF_FORK tasks which don't have a stack at all
> >> >
> >> > Make the idle tasks conform to the new stack bottom convention by
> >> > starting their stack at a sizeof(pt_regs) offset from the end of the
> >> > stack page.
> >> >
> >> > Signed-off-by: Josh Poimboeuf <jpoimboe@redhat.com>
> >> > ---
> >> >  arch/x86/kernel/head_64.S | 7 ++++---
> >> >  1 file changed, 4 insertions(+), 3 deletions(-)
> >> >
> >> > diff --git a/arch/x86/kernel/head_64.S b/arch/x86/kernel/head_64.S
> >> > index 6dbd2c0..0b12311 100644
> >> > --- a/arch/x86/kernel/head_64.S
> >> > +++ b/arch/x86/kernel/head_64.S
> >> > @@ -296,8 +296,9 @@ ENTRY(start_cpu)
> >> >          *      REX.W + FF /5 JMP m16:64 Jump far, absolute indirect,
> >> >          *              address given in m16:64.
> >> >          */
> >> > -       movq    initial_code(%rip),%rax
> >> > -       pushq   $0              # fake return address to stop unwinder
> >> > +       call    1f              # put return address on stack for unwinder
> >> > +1:     xorq    %rbp, %rbp      # clear frame pointer
> >> > +       movq    initial_code(%rip), %rax
> >> >         pushq   $__KERNEL_CS    # set correct cs
> >> >         pushq   %rax            # target address in negative space
> >> >         lretq
> >> > @@ -325,7 +326,7 @@ ENDPROC(start_cpu0)
> >> >         GLOBAL(initial_gs)
> >> >         .quad   INIT_PER_CPU_VAR(irq_stack_union)
> >> >         GLOBAL(initial_stack)
> >> > -       .quad  init_thread_union+THREAD_SIZE-8
> >> > +       .quad  init_thread_union + THREAD_SIZE - SIZEOF_PTREGS
> >>
> >> As long as you're doing this, could you also set orig_ax to -1?  I
> >> remember running into some oddities resulting from orig_ax containing
> >> garbage at some point.
> >
> > I assume you mean to initialize the orig_rax value in the pt_regs at the
> > bottom of the stack of the idle task?
> >
> > How could that cause a problem?  Since the idle task never returns from
> > a system call, I'd assume that memory never gets accessed?
> >
> 
> Look at collect_syscall in lib/syscall.c

I don't see how collect_syscall() can be called for the per-cpu idle
"swapper" tasks (which is what the above code affects).  They don't have
pids or /proc entries so you can't do /proc/<pid>/syscall on them.

-- 
Josh

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [RFC PATCH v2 05/18] sched: add task flag for preempt IRQ tracking
  2016-04-29 22:41                 ` Josh Poimboeuf
@ 2016-04-30  0:08                   ` Andy Lutomirski
  2016-05-02 13:52                     ` Josh Poimboeuf
  0 siblings, 1 reply; 121+ messages in thread
From: Andy Lutomirski @ 2016-04-30  0:08 UTC (permalink / raw)
  To: Josh Poimboeuf
  Cc: Jiri Kosina, Ingo Molnar, X86 ML, Heiko Carstens, linux-s390,
	live-patching, Michael Ellerman, Chris J Arges, linuxppc-dev,
	Jessica Yu, Petr Mladek, Jiri Slaby, Vojtech Pavlik,
	linux-kernel, Miroslav Benes, Peter Zijlstra

On Apr 29, 2016 3:41 PM, "Josh Poimboeuf" <jpoimboe@redhat.com> wrote:
>
> On Fri, Apr 29, 2016 at 02:37:41PM -0700, Andy Lutomirski wrote:
> > On Fri, Apr 29, 2016 at 2:25 PM, Josh Poimboeuf <jpoimboe@redhat.com> wrote:
> > > I think the easiest way to make it work would be to modify the idtentry
> > > macro to put all the idt entries in a dedicated section.  Then the
> > > unwinder could easily detect any calls from that code.
> >
> > That would work.  Would it make sense to do the same for the irq entries?
>
> Yes, I think so.
>
> > >> I suppose we could try to rejigger the code so that rbp points to
> > >> pt_regs or similar.
> > >
> > > I think we should avoid doing something like that because it would break
> > > gdb and all the other unwinders who don't know about it.
> >
> > How so?
> >
> > Currently, rbp in the entry code is meaningless.  I'm suggesting that,
> > when we do, for example, 'call \do_sym' in idtentry, we point rbp to
> > the pt_regs.  Currently it points to something stale (which the
> > dump_stack code might be relying on.  Hmm.)  But it's probably also
> > safe to assume that if you unwind to the 'call \do_sym', then pt_regs
> > is the next thing on the stack, so just doing the section thing would
> > work.
>
> Yes, rbp is meaningless on the entry from user space.  But if an
> in-kernel interrupt occurs (e.g. page fault, preemption) and you have
> nested entry, rbp keeps its old value, right?  So the unwinder can walk
> past the nested entry frame and keep going until it gets to the original
> entry.

Yes.

It would be nice if we could do better, though, and actually notice
the pt_regs and identify the entry.  For example, I'd love to see
"page fault, RIP=xyz" printed in the middle of a stack dump on a
crash.  Also, I think that just following rbp links will lose the
actual function that took the page fault (or whatever function
pt_regs->ip actually points to).

>
> > We should really re-add DWARF some day.
>
> Working on it :-)

Excellent.

Have you looked at my vdso unwinding test at all?  If we could do
something similar for the kernel, IMO it would make testing much more
pleasant.

--Andy

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [RFC PATCH v2 05/18] sched: add task flag for preempt IRQ tracking
  2016-04-29 22:11                 ` Jiri Kosina
  2016-04-29 22:57                   ` Josh Poimboeuf
@ 2016-04-30  0:09                   ` Andy Lutomirski
  1 sibling, 0 replies; 121+ messages in thread
From: Andy Lutomirski @ 2016-04-30  0:09 UTC (permalink / raw)
  To: Jiri Kosina
  Cc: Ingo Molnar, Josh Poimboeuf, X86 ML, Heiko Carstens, linux-s390,
	live-patching, Michael Ellerman, Chris J Arges, linuxppc-dev,
	Jessica Yu, Petr Mladek, Jiri Slaby, Vojtech Pavlik,
	linux-kernel, Miroslav Benes, Peter Zijlstra

On Apr 29, 2016 3:11 PM, "Jiri Kosina" <jikos@kernel.org> wrote:
>
> On Fri, 29 Apr 2016, Andy Lutomirski wrote:
>
> > > NMI, MCE and interrupts aren't a problem because they have dedicated
> > > stacks, which are easy to detect.  If the tasks' stack is on an
> > > exception stack or an irq stack, we consider it unreliable.
> >
> > Only on x86_64.
>
> Well, MCEs are more or less x86-specific as well. But otherwise good
> point, thanks Andy.
>
> So, how does stack layout generally look like in case when NMI is actually
> running on proper kernel stack? I thought it's guaranteed to contain
> pt_regs anyway in all cases. Is that not guaranteed to be the case?
>

On x86, at least, there will still be pt_regs for the NMI.  For the
interrupted state, though, there might not be pt_regs, as the NMI
might have happened while still populating pt_regs.  In fact, the NMI
stack could overlap task_pt_regs.

For x86_32, there's no guarantee that pt_regs contains sp due to
hardware silliness.  You need to parse it more carefully, as,
!user_mode(regs), then the old sp is just above pt_regs.

--Andy

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [RFC PATCH v2 03/18] x86/asm/head: standardize the bottom of the stack for idle tasks
  2016-04-29 23:27         ` Josh Poimboeuf
@ 2016-04-30  0:10           ` Andy Lutomirski
  0 siblings, 0 replies; 121+ messages in thread
From: Andy Lutomirski @ 2016-04-30  0:10 UTC (permalink / raw)
  To: Josh Poimboeuf
  Cc: Jiri Kosina, Ingo Molnar, X86 ML, Heiko Carstens, linux-s390,
	live-patching, Michael Ellerman, Chris J Arges, linuxppc-dev,
	Jessica Yu, Petr Mladek, Jiri Slaby, Vojtech Pavlik,
	linux-kernel, Miroslav Benes, Peter Zijlstra

On Apr 29, 2016 4:27 PM, "Josh Poimboeuf" <jpoimboe@redhat.com> wrote:
>
> On Fri, Apr 29, 2016 at 02:38:02PM -0700, Andy Lutomirski wrote:
> > On Fri, Apr 29, 2016 at 1:50 PM, Josh Poimboeuf <jpoimboe@redhat.com> wrote:
> > > On Fri, Apr 29, 2016 at 12:39:16PM -0700, Andy Lutomirski wrote:
> > >> On Thu, Apr 28, 2016 at 1:44 PM, Josh Poimboeuf <jpoimboe@redhat.com> wrote:
> > >> > Thanks to all the recent x86 entry code refactoring, most tasks' kernel
> > >> > stacks start at the same offset right above their saved pt_regs,
> > >> > regardless of which syscall was used to enter the kernel.  That creates
> > >> > a nice convention which makes it straightforward to identify the
> > >> > "bottom" of the stack, which can be useful for stack walking code which
> > >> > needs to verify the stack is sane.
> > >> >
> > >> > However there are still a few types of tasks which don't yet follow that
> > >> > convention:
> > >> >
> > >> > 1) CPU idle tasks, aka the "swapper" tasks
> > >> >
> > >> > 2) freshly forked TIF_FORK tasks which don't have a stack at all
> > >> >
> > >> > Make the idle tasks conform to the new stack bottom convention by
> > >> > starting their stack at a sizeof(pt_regs) offset from the end of the
> > >> > stack page.
> > >> >
> > >> > Signed-off-by: Josh Poimboeuf <jpoimboe@redhat.com>
> > >> > ---
> > >> >  arch/x86/kernel/head_64.S | 7 ++++---
> > >> >  1 file changed, 4 insertions(+), 3 deletions(-)
> > >> >
> > >> > diff --git a/arch/x86/kernel/head_64.S b/arch/x86/kernel/head_64.S
> > >> > index 6dbd2c0..0b12311 100644
> > >> > --- a/arch/x86/kernel/head_64.S
> > >> > +++ b/arch/x86/kernel/head_64.S
> > >> > @@ -296,8 +296,9 @@ ENTRY(start_cpu)
> > >> >          *      REX.W + FF /5 JMP m16:64 Jump far, absolute indirect,
> > >> >          *              address given in m16:64.
> > >> >          */
> > >> > -       movq    initial_code(%rip),%rax
> > >> > -       pushq   $0              # fake return address to stop unwinder
> > >> > +       call    1f              # put return address on stack for unwinder
> > >> > +1:     xorq    %rbp, %rbp      # clear frame pointer
> > >> > +       movq    initial_code(%rip), %rax
> > >> >         pushq   $__KERNEL_CS    # set correct cs
> > >> >         pushq   %rax            # target address in negative space
> > >> >         lretq
> > >> > @@ -325,7 +326,7 @@ ENDPROC(start_cpu0)
> > >> >         GLOBAL(initial_gs)
> > >> >         .quad   INIT_PER_CPU_VAR(irq_stack_union)
> > >> >         GLOBAL(initial_stack)
> > >> > -       .quad  init_thread_union+THREAD_SIZE-8
> > >> > +       .quad  init_thread_union + THREAD_SIZE - SIZEOF_PTREGS
> > >>
> > >> As long as you're doing this, could you also set orig_ax to -1?  I
> > >> remember running into some oddities resulting from orig_ax containing
> > >> garbage at some point.
> > >
> > > I assume you mean to initialize the orig_rax value in the pt_regs at the
> > > bottom of the stack of the idle task?
> > >
> > > How could that cause a problem?  Since the idle task never returns from
> > > a system call, I'd assume that memory never gets accessed?
> > >
> >
> > Look at collect_syscall in lib/syscall.c
>
> I don't see how collect_syscall() can be called for the per-cpu idle
> "swapper" tasks (which is what the above code affects).  They don't have
> pids or /proc entries so you can't do /proc/<pid>/syscall on them.

If so, then never mind.

--Andy

>
> --
> Josh

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [RFC PATCH v2 05/18] sched: add task flag for preempt IRQ tracking
  2016-04-30  0:08                   ` Andy Lutomirski
@ 2016-05-02 13:52                     ` Josh Poimboeuf
  2016-05-02 15:52                       ` Andy Lutomirski
  0 siblings, 1 reply; 121+ messages in thread
From: Josh Poimboeuf @ 2016-05-02 13:52 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Jiri Kosina, Ingo Molnar, X86 ML, Heiko Carstens, linux-s390,
	live-patching, Michael Ellerman, Chris J Arges, linuxppc-dev,
	Jessica Yu, Petr Mladek, Jiri Slaby, Vojtech Pavlik,
	linux-kernel, Miroslav Benes, Peter Zijlstra

On Fri, Apr 29, 2016 at 05:08:50PM -0700, Andy Lutomirski wrote:
> On Apr 29, 2016 3:41 PM, "Josh Poimboeuf" <jpoimboe@redhat.com> wrote:
> >
> > On Fri, Apr 29, 2016 at 02:37:41PM -0700, Andy Lutomirski wrote:
> > > On Fri, Apr 29, 2016 at 2:25 PM, Josh Poimboeuf <jpoimboe@redhat.com> wrote:
> > > >> I suppose we could try to rejigger the code so that rbp points to
> > > >> pt_regs or similar.
> > > >
> > > > I think we should avoid doing something like that because it would break
> > > > gdb and all the other unwinders who don't know about it.
> > >
> > > How so?
> > >
> > > Currently, rbp in the entry code is meaningless.  I'm suggesting that,
> > > when we do, for example, 'call \do_sym' in idtentry, we point rbp to
> > > the pt_regs.  Currently it points to something stale (which the
> > > dump_stack code might be relying on.  Hmm.)  But it's probably also
> > > safe to assume that if you unwind to the 'call \do_sym', then pt_regs
> > > is the next thing on the stack, so just doing the section thing would
> > > work.
> >
> > Yes, rbp is meaningless on the entry from user space.  But if an
> > in-kernel interrupt occurs (e.g. page fault, preemption) and you have
> > nested entry, rbp keeps its old value, right?  So the unwinder can walk
> > past the nested entry frame and keep going until it gets to the original
> > entry.
> 
> Yes.
> 
> It would be nice if we could do better, though, and actually notice
> the pt_regs and identify the entry.  For example, I'd love to see
> "page fault, RIP=xyz" printed in the middle of a stack dump on a
> crash.
>
> Also, I think that just following rbp links will lose the
> actual function that took the page fault (or whatever function
> pt_regs->ip actually points to).

Hm.  I think we could fix all that in a more standard way.  Whenever a
new pt_regs frame gets saved on entry, we could also create a new stack
frame which points to a fake kernel_entry() function.  That would tell
the unwinder there's a pt_regs frame without otherwise breaking frame
pointers across the frame.

Then I guess we wouldn't need my other solution of putting the idt
entries in a special section.

How does that sound?

> Have you looked at my vdso unwinding test at all?  If we could do
> something similar for the kernel, IMO it would make testing much more
> pleasant.

I found it, but I'm not sure what it would mean to do something similar
for the kernel.  Do you mean doing something like an NMI sampling-based
approach where we periodically do a random stack sanity check?

(If so, I do have something like that planned.)

-- 
Josh

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [RFC PATCH v2 05/18] sched: add task flag for preempt IRQ tracking
  2016-05-02 13:52                     ` Josh Poimboeuf
@ 2016-05-02 15:52                       ` Andy Lutomirski
  2016-05-02 17:31                         ` Josh Poimboeuf
  2016-05-19 23:15                         ` Josh Poimboeuf
  0 siblings, 2 replies; 121+ messages in thread
From: Andy Lutomirski @ 2016-05-02 15:52 UTC (permalink / raw)
  To: Josh Poimboeuf
  Cc: Jiri Kosina, Ingo Molnar, X86 ML, Heiko Carstens, linux-s390,
	live-patching, Michael Ellerman, Chris J Arges, linuxppc-dev,
	Jessica Yu, Petr Mladek, Jiri Slaby, Vojtech Pavlik,
	linux-kernel, Miroslav Benes, Peter Zijlstra

On Mon, May 2, 2016 at 6:52 AM, Josh Poimboeuf <jpoimboe@redhat.com> wrote:
> On Fri, Apr 29, 2016 at 05:08:50PM -0700, Andy Lutomirski wrote:
>> On Apr 29, 2016 3:41 PM, "Josh Poimboeuf" <jpoimboe@redhat.com> wrote:
>> >
>> > On Fri, Apr 29, 2016 at 02:37:41PM -0700, Andy Lutomirski wrote:
>> > > On Fri, Apr 29, 2016 at 2:25 PM, Josh Poimboeuf <jpoimboe@redhat.com> wrote:
>> > > >> I suppose we could try to rejigger the code so that rbp points to
>> > > >> pt_regs or similar.
>> > > >
>> > > > I think we should avoid doing something like that because it would break
>> > > > gdb and all the other unwinders who don't know about it.
>> > >
>> > > How so?
>> > >
>> > > Currently, rbp in the entry code is meaningless.  I'm suggesting that,
>> > > when we do, for example, 'call \do_sym' in idtentry, we point rbp to
>> > > the pt_regs.  Currently it points to something stale (which the
>> > > dump_stack code might be relying on.  Hmm.)  But it's probably also
>> > > safe to assume that if you unwind to the 'call \do_sym', then pt_regs
>> > > is the next thing on the stack, so just doing the section thing would
>> > > work.
>> >
>> > Yes, rbp is meaningless on the entry from user space.  But if an
>> > in-kernel interrupt occurs (e.g. page fault, preemption) and you have
>> > nested entry, rbp keeps its old value, right?  So the unwinder can walk
>> > past the nested entry frame and keep going until it gets to the original
>> > entry.
>>
>> Yes.
>>
>> It would be nice if we could do better, though, and actually notice
>> the pt_regs and identify the entry.  For example, I'd love to see
>> "page fault, RIP=xyz" printed in the middle of a stack dump on a
>> crash.
>>
>> Also, I think that just following rbp links will lose the
>> actual function that took the page fault (or whatever function
>> pt_regs->ip actually points to).
>
> Hm.  I think we could fix all that in a more standard way.  Whenever a
> new pt_regs frame gets saved on entry, we could also create a new stack
> frame which points to a fake kernel_entry() function.  That would tell
> the unwinder there's a pt_regs frame without otherwise breaking frame
> pointers across the frame.
>
> Then I guess we wouldn't need my other solution of putting the idt
> entries in a special section.
>
> How does that sound?

Let me try to understand.

The normal call sequence is call; push %rbp; mov %rsp, %rbp.  So rbp
points to (prev rbp, prev rip) on the stack, and you can follow the
chain back.  Right now, on a user access page fault or similar, we
have rbp (probably) pointing to the interrupted frame, and the
interrupted rip isn't saved anywhere that a naive unwinder can find
it.  (It's in pt_regs, but the rbp chain skips right over that.)

We could change the entry code so that an interrupt / idtentry does:

push pt_regs
push kernel_entry
push %rbp
mov %rsp, %rbp
call handler
pop %rbp
addq $8, %rsp

or similar.  That would make it appear that the actual C handler was
caused by a dummy function "kernel_entry".  Now the unwinder would get
to kernel_entry, but it *still* wouldn't find its way to the calling
frame, which only solves part of the problem.  We could at least teach
the unwinder how kernel_entry works and let it decode pt_regs to
continue unwinding.  This would be nice, and I think it could work.

I think I like this, except that, if it used a separate section, it
could potentially be faster, as, for each actual entry type, the
offset from the C handler frame to pt_regs is a foregone conclusion.
But this is pretty simple and performance is already abysmal in most
handlers.

There's an added benefit to using a separate section, though: we could
also annotate the calls with what type of entry they were so the
unwinder could print it out nicely.

I could be convinced either way.


>
>> Have you looked at my vdso unwinding test at all?  If we could do
>> something similar for the kernel, IMO it would make testing much more
>> pleasant.
>
> I found it, but I'm not sure what it would mean to do something similar
> for the kernel.  Do you mean doing something like an NMI sampling-based
> approach where we periodically do a random stack sanity check?

I was imagining something a little more strict: single-step
interesting parts of the kernel and make sure that each step unwinds
correctly.  That could detect missing frames and similar.

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [RFC PATCH v2 05/18] sched: add task flag for preempt IRQ tracking
  2016-05-02 15:52                       ` Andy Lutomirski
@ 2016-05-02 17:31                         ` Josh Poimboeuf
  2016-05-02 18:12                           ` Andy Lutomirski
  2016-05-19 23:15                         ` Josh Poimboeuf
  1 sibling, 1 reply; 121+ messages in thread
From: Josh Poimboeuf @ 2016-05-02 17:31 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Jiri Kosina, Ingo Molnar, X86 ML, Heiko Carstens, linux-s390,
	live-patching, Michael Ellerman, Chris J Arges, linuxppc-dev,
	Jessica Yu, Petr Mladek, Jiri Slaby, Vojtech Pavlik,
	linux-kernel, Miroslav Benes, Peter Zijlstra

On Mon, May 02, 2016 at 08:52:41AM -0700, Andy Lutomirski wrote:
> On Mon, May 2, 2016 at 6:52 AM, Josh Poimboeuf <jpoimboe@redhat.com> wrote:
> > On Fri, Apr 29, 2016 at 05:08:50PM -0700, Andy Lutomirski wrote:
> >> On Apr 29, 2016 3:41 PM, "Josh Poimboeuf" <jpoimboe@redhat.com> wrote:
> >> >
> >> > On Fri, Apr 29, 2016 at 02:37:41PM -0700, Andy Lutomirski wrote:
> >> > > On Fri, Apr 29, 2016 at 2:25 PM, Josh Poimboeuf <jpoimboe@redhat.com> wrote:
> >> > > >> I suppose we could try to rejigger the code so that rbp points to
> >> > > >> pt_regs or similar.
> >> > > >
> >> > > > I think we should avoid doing something like that because it would break
> >> > > > gdb and all the other unwinders who don't know about it.
> >> > >
> >> > > How so?
> >> > >
> >> > > Currently, rbp in the entry code is meaningless.  I'm suggesting that,
> >> > > when we do, for example, 'call \do_sym' in idtentry, we point rbp to
> >> > > the pt_regs.  Currently it points to something stale (which the
> >> > > dump_stack code might be relying on.  Hmm.)  But it's probably also
> >> > > safe to assume that if you unwind to the 'call \do_sym', then pt_regs
> >> > > is the next thing on the stack, so just doing the section thing would
> >> > > work.
> >> >
> >> > Yes, rbp is meaningless on the entry from user space.  But if an
> >> > in-kernel interrupt occurs (e.g. page fault, preemption) and you have
> >> > nested entry, rbp keeps its old value, right?  So the unwinder can walk
> >> > past the nested entry frame and keep going until it gets to the original
> >> > entry.
> >>
> >> Yes.
> >>
> >> It would be nice if we could do better, though, and actually notice
> >> the pt_regs and identify the entry.  For example, I'd love to see
> >> "page fault, RIP=xyz" printed in the middle of a stack dump on a
> >> crash.
> >>
> >> Also, I think that just following rbp links will lose the
> >> actual function that took the page fault (or whatever function
> >> pt_regs->ip actually points to).
> >
> > Hm.  I think we could fix all that in a more standard way.  Whenever a
> > new pt_regs frame gets saved on entry, we could also create a new stack
> > frame which points to a fake kernel_entry() function.  That would tell
> > the unwinder there's a pt_regs frame without otherwise breaking frame
> > pointers across the frame.
> >
> > Then I guess we wouldn't need my other solution of putting the idt
> > entries in a special section.
> >
> > How does that sound?
> 
> Let me try to understand.
> 
> The normal call sequence is call; push %rbp; mov %rsp, %rbp.  So rbp
> points to (prev rbp, prev rip) on the stack, and you can follow the
> chain back.  Right now, on a user access page fault or similar, we
> have rbp (probably) pointing to the interrupted frame, and the
> interrupted rip isn't saved anywhere that a naive unwinder can find
> it.  (It's in pt_regs, but the rbp chain skips right over that.)
> 
> We could change the entry code so that an interrupt / idtentry does:
> 
> push pt_regs
> push kernel_entry
> push %rbp
> mov %rsp, %rbp
> call handler
> pop %rbp
> addq $8, %rsp
> 
> or similar.  That would make it appear that the actual C handler was
> caused by a dummy function "kernel_entry".  Now the unwinder would get
> to kernel_entry, but it *still* wouldn't find its way to the calling
> frame, which only solves part of the problem.  We could at least teach
> the unwinder how kernel_entry works and let it decode pt_regs to
> continue unwinding.  This would be nice, and I think it could work.

Yeah, that's about what I had in mind.

> I think I like this, except that, if it used a separate section, it
> could potentially be faster, as, for each actual entry type, the
> offset from the C handler frame to pt_regs is a foregone conclusion.

Hm, this I don't really follow.  It's true that the unwinder can easily
find RIP from pt_regs, which will always be a known offset from the
kernel_entry pointer on the stack.  But why would having the entry code
in a separate section make that faster?

> But this is pretty simple and performance is already abysmal in most
> handlers.
> 
> There's an added benefit to using a separate section, though: we could
> also annotate the calls with what type of entry they were so the
> unwinder could print it out nicely.

Yeah, that could be a nice feature... but doesn't printing the name of
the C handler pretty much already give that information?

In any case, once we have a working DWARF unwinder, I think it will show
the name of the idt entry anyway.

> >> Have you looked at my vdso unwinding test at all?  If we could do
> >> something similar for the kernel, IMO it would make testing much more
> >> pleasant.
> >
> > I found it, but I'm not sure what it would mean to do something similar
> > for the kernel.  Do you mean doing something like an NMI sampling-based
> > approach where we periodically do a random stack sanity check?
> 
> I was imagining something a little more strict: single-step
> interesting parts of the kernel and make sure that each step unwinds
> correctly.  That could detect missing frames and similar.

Interesting idea, though I wonder how hard it would be to reliably
distinguish a missing frame from the case where gcc decides to inline a
function.

Another idea to detect missing frames: for each return address on the
stack, ensure there's a corresponding "call <func>" instruction
immediately preceding the return location, where <func> matches what's
on the stack.

That wouldn't work so well for indirect calls using function pointers,
but even then maybe we could use the DWARF CFI to find the function
pointer value and validate that it matches the stack function.

-- 
Josh

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [RFC PATCH v2 05/18] sched: add task flag for preempt IRQ tracking
  2016-05-02 17:31                         ` Josh Poimboeuf
@ 2016-05-02 18:12                           ` Andy Lutomirski
  2016-05-02 18:34                             ` Ingo Molnar
                                               ` (3 more replies)
  0 siblings, 4 replies; 121+ messages in thread
From: Andy Lutomirski @ 2016-05-02 18:12 UTC (permalink / raw)
  To: Josh Poimboeuf
  Cc: Jiri Kosina, Ingo Molnar, X86 ML, Heiko Carstens, linux-s390,
	live-patching, Michael Ellerman, Chris J Arges, linuxppc-dev,
	Jessica Yu, Petr Mladek, Jiri Slaby, Vojtech Pavlik,
	linux-kernel, Miroslav Benes, Peter Zijlstra

On Mon, May 2, 2016 at 10:31 AM, Josh Poimboeuf <jpoimboe@redhat.com> wrote:
> On Mon, May 02, 2016 at 08:52:41AM -0700, Andy Lutomirski wrote:
>> On Mon, May 2, 2016 at 6:52 AM, Josh Poimboeuf <jpoimboe@redhat.com> wrote:
>> > On Fri, Apr 29, 2016 at 05:08:50PM -0700, Andy Lutomirski wrote:
>> >> On Apr 29, 2016 3:41 PM, "Josh Poimboeuf" <jpoimboe@redhat.com> wrote:
>> >> >
>> >> > On Fri, Apr 29, 2016 at 02:37:41PM -0700, Andy Lutomirski wrote:
>> >> > > On Fri, Apr 29, 2016 at 2:25 PM, Josh Poimboeuf <jpoimboe@redhat.com> wrote:
>> >> > > >> I suppose we could try to rejigger the code so that rbp points to
>> >> > > >> pt_regs or similar.
>> >> > > >
>> >> > > > I think we should avoid doing something like that because it would break
>> >> > > > gdb and all the other unwinders who don't know about it.
>> >> > >
>> >> > > How so?
>> >> > >
>> >> > > Currently, rbp in the entry code is meaningless.  I'm suggesting that,
>> >> > > when we do, for example, 'call \do_sym' in idtentry, we point rbp to
>> >> > > the pt_regs.  Currently it points to something stale (which the
>> >> > > dump_stack code might be relying on.  Hmm.)  But it's probably also
>> >> > > safe to assume that if you unwind to the 'call \do_sym', then pt_regs
>> >> > > is the next thing on the stack, so just doing the section thing would
>> >> > > work.
>> >> >
>> >> > Yes, rbp is meaningless on the entry from user space.  But if an
>> >> > in-kernel interrupt occurs (e.g. page fault, preemption) and you have
>> >> > nested entry, rbp keeps its old value, right?  So the unwinder can walk
>> >> > past the nested entry frame and keep going until it gets to the original
>> >> > entry.
>> >>
>> >> Yes.
>> >>
>> >> It would be nice if we could do better, though, and actually notice
>> >> the pt_regs and identify the entry.  For example, I'd love to see
>> >> "page fault, RIP=xyz" printed in the middle of a stack dump on a
>> >> crash.
>> >>
>> >> Also, I think that just following rbp links will lose the
>> >> actual function that took the page fault (or whatever function
>> >> pt_regs->ip actually points to).
>> >
>> > Hm.  I think we could fix all that in a more standard way.  Whenever a
>> > new pt_regs frame gets saved on entry, we could also create a new stack
>> > frame which points to a fake kernel_entry() function.  That would tell
>> > the unwinder there's a pt_regs frame without otherwise breaking frame
>> > pointers across the frame.
>> >
>> > Then I guess we wouldn't need my other solution of putting the idt
>> > entries in a special section.
>> >
>> > How does that sound?
>>
>> Let me try to understand.
>>
>> The normal call sequence is call; push %rbp; mov %rsp, %rbp.  So rbp
>> points to (prev rbp, prev rip) on the stack, and you can follow the
>> chain back.  Right now, on a user access page fault or similar, we
>> have rbp (probably) pointing to the interrupted frame, and the
>> interrupted rip isn't saved anywhere that a naive unwinder can find
>> it.  (It's in pt_regs, but the rbp chain skips right over that.)
>>
>> We could change the entry code so that an interrupt / idtentry does:
>>
>> push pt_regs
>> push kernel_entry
>> push %rbp
>> mov %rsp, %rbp
>> call handler
>> pop %rbp
>> addq $8, %rsp
>>
>> or similar.  That would make it appear that the actual C handler was
>> caused by a dummy function "kernel_entry".  Now the unwinder would get
>> to kernel_entry, but it *still* wouldn't find its way to the calling
>> frame, which only solves part of the problem.  We could at least teach
>> the unwinder how kernel_entry works and let it decode pt_regs to
>> continue unwinding.  This would be nice, and I think it could work.
>
> Yeah, that's about what I had in mind.

FWIW, I just tried this:

static bool is_entry_text(unsigned long addr)
{
    return addr >= (unsigned long)__entry_text_start &&
        addr < (unsigned long)__entry_text_end;
}

it works.  So the entry code is already annotated reasonably well :)

I just hacked it up here:

https://git.kernel.org/cgit/linux/kernel/git/luto/linux.git/commit/?h=stack&id=085eacfe0edfc18768e48340084415dba9a6bd21

and it seems to work, at least for page faults.  A better
implementation would print out the entire contents of pt_regs so that
people reading the stack trace will know the registers at the time of
the exception, which might be helpful.

>
>> I think I like this, except that, if it used a separate section, it
>> could potentially be faster, as, for each actual entry type, the
>> offset from the C handler frame to pt_regs is a foregone conclusion.
>
> Hm, this I don't really follow.  It's true that the unwinder can easily
> find RIP from pt_regs, which will always be a known offset from the
> kernel_entry pointer on the stack.  But why would having the entry code
> in a separate section make that faster?

It doesn't make the unwinder faster -- it makes the entry code faster.

>
>> But this is pretty simple and performance is already abysmal in most
>> handlers.
>>
>> There's an added benefit to using a separate section, though: we could
>> also annotate the calls with what type of entry they were so the
>> unwinder could print it out nicely.
>
> Yeah, that could be a nice feature... but doesn't printing the name of
> the C handler pretty much already give that information?
>
> In any case, once we have a working DWARF unwinder, I think it will show
> the name of the idt entry anyway.

True.  And it'll automatically follow pt_regs.

>
>> >> Have you looked at my vdso unwinding test at all?  If we could do
>> >> something similar for the kernel, IMO it would make testing much more
>> >> pleasant.
>> >
>> > I found it, but I'm not sure what it would mean to do something similar
>> > for the kernel.  Do you mean doing something like an NMI sampling-based
>> > approach where we periodically do a random stack sanity check?
>>
>> I was imagining something a little more strict: single-step
>> interesting parts of the kernel and make sure that each step unwinds
>> correctly.  That could detect missing frames and similar.
>
> Interesting idea, though I wonder how hard it would be to reliably
> distinguish a missing frame from the case where gcc decides to inline a
> function.
>
> Another idea to detect missing frames: for each return address on the
> stack, ensure there's a corresponding "call <func>" instruction
> immediately preceding the return location, where <func> matches what's
> on the stack.

Hmm, interesting.

I hope your plans include rewriting the current stack unwinder
completely.  The thing in print_context_stack is (a)
hard-to-understand and hard-to-modify crap and (b) is called in a loop
from another file using totally ridiculous conventions.

--Andy

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [RFC PATCH v2 05/18] sched: add task flag for preempt IRQ tracking
  2016-05-02 18:12                           ` Andy Lutomirski
@ 2016-05-02 18:34                             ` Ingo Molnar
  2016-05-02 19:44                             ` Josh Poimboeuf
                                               ` (2 subsequent siblings)
  3 siblings, 0 replies; 121+ messages in thread
From: Ingo Molnar @ 2016-05-02 18:34 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Josh Poimboeuf, Jiri Kosina, Ingo Molnar, X86 ML, Heiko Carstens,
	linux-s390, live-patching, Michael Ellerman, Chris J Arges,
	linuxppc-dev, Jessica Yu, Petr Mladek, Jiri Slaby,
	Vojtech Pavlik, linux-kernel, Miroslav Benes, Peter Zijlstra,
	Frédéric Weisbecker


* Andy Lutomirski <luto@amacapital.net> wrote:

> > Another idea to detect missing frames: for each return address on the stack, 
> > ensure there's a corresponding "call <func>" instruction immediately preceding 
> > the return location, where <func> matches what's on the stack.
> 
> Hmm, interesting.
> 
> I hope your plans include rewriting the current stack unwinder completely.  The 
> thing in print_context_stack is (a) hard-to-understand and hard-to-modify crap 
> and (b) is called in a loop from another file using totally ridiculous 
> conventions.

So we had several attempts at making it better, any further improvements 
(including radical rewrites) are more than welcome!

The generalization between the various stack walking methods certainly didn't make 
things easier to read - we might want to eliminate that by using better primitives 
to iterate over the stack frame.

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [RFC PATCH v2 05/18] sched: add task flag for preempt IRQ tracking
  2016-05-02 18:12                           ` Andy Lutomirski
  2016-05-02 18:34                             ` Ingo Molnar
@ 2016-05-02 19:44                             ` Josh Poimboeuf
  2016-05-02 19:54                             ` Jiri Kosina
  2016-05-04 15:16                               ` David Laight
  3 siblings, 0 replies; 121+ messages in thread
From: Josh Poimboeuf @ 2016-05-02 19:44 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Jiri Kosina, Ingo Molnar, X86 ML, Heiko Carstens, linux-s390,
	live-patching, Michael Ellerman, Chris J Arges, linuxppc-dev,
	Jessica Yu, Petr Mladek, Jiri Slaby, Vojtech Pavlik,
	linux-kernel, Miroslav Benes, Peter Zijlstra

On Mon, May 02, 2016 at 11:12:39AM -0700, Andy Lutomirski wrote:
> On Mon, May 2, 2016 at 10:31 AM, Josh Poimboeuf <jpoimboe@redhat.com> wrote:
> > On Mon, May 02, 2016 at 08:52:41AM -0700, Andy Lutomirski wrote:
> >> On Mon, May 2, 2016 at 6:52 AM, Josh Poimboeuf <jpoimboe@redhat.com> wrote:
> >> > On Fri, Apr 29, 2016 at 05:08:50PM -0700, Andy Lutomirski wrote:
> >> >> On Apr 29, 2016 3:41 PM, "Josh Poimboeuf" <jpoimboe@redhat.com> wrote:
> >> >> >
> >> >> > On Fri, Apr 29, 2016 at 02:37:41PM -0700, Andy Lutomirski wrote:
> >> >> > > On Fri, Apr 29, 2016 at 2:25 PM, Josh Poimboeuf <jpoimboe@redhat.com> wrote:
> >> >> > > >> I suppose we could try to rejigger the code so that rbp points to
> >> >> > > >> pt_regs or similar.
> >> >> > > >
> >> >> > > > I think we should avoid doing something like that because it would break
> >> >> > > > gdb and all the other unwinders who don't know about it.
> >> >> > >
> >> >> > > How so?
> >> >> > >
> >> >> > > Currently, rbp in the entry code is meaningless.  I'm suggesting that,
> >> >> > > when we do, for example, 'call \do_sym' in idtentry, we point rbp to
> >> >> > > the pt_regs.  Currently it points to something stale (which the
> >> >> > > dump_stack code might be relying on.  Hmm.)  But it's probably also
> >> >> > > safe to assume that if you unwind to the 'call \do_sym', then pt_regs
> >> >> > > is the next thing on the stack, so just doing the section thing would
> >> >> > > work.
> >> >> >
> >> >> > Yes, rbp is meaningless on the entry from user space.  But if an
> >> >> > in-kernel interrupt occurs (e.g. page fault, preemption) and you have
> >> >> > nested entry, rbp keeps its old value, right?  So the unwinder can walk
> >> >> > past the nested entry frame and keep going until it gets to the original
> >> >> > entry.
> >> >>
> >> >> Yes.
> >> >>
> >> >> It would be nice if we could do better, though, and actually notice
> >> >> the pt_regs and identify the entry.  For example, I'd love to see
> >> >> "page fault, RIP=xyz" printed in the middle of a stack dump on a
> >> >> crash.
> >> >>
> >> >> Also, I think that just following rbp links will lose the
> >> >> actual function that took the page fault (or whatever function
> >> >> pt_regs->ip actually points to).
> >> >
> >> > Hm.  I think we could fix all that in a more standard way.  Whenever a
> >> > new pt_regs frame gets saved on entry, we could also create a new stack
> >> > frame which points to a fake kernel_entry() function.  That would tell
> >> > the unwinder there's a pt_regs frame without otherwise breaking frame
> >> > pointers across the frame.
> >> >
> >> > Then I guess we wouldn't need my other solution of putting the idt
> >> > entries in a special section.
> >> >
> >> > How does that sound?
> >>
> >> Let me try to understand.
> >>
> >> The normal call sequence is call; push %rbp; mov %rsp, %rbp.  So rbp
> >> points to (prev rbp, prev rip) on the stack, and you can follow the
> >> chain back.  Right now, on a user access page fault or similar, we
> >> have rbp (probably) pointing to the interrupted frame, and the
> >> interrupted rip isn't saved anywhere that a naive unwinder can find
> >> it.  (It's in pt_regs, but the rbp chain skips right over that.)
> >>
> >> We could change the entry code so that an interrupt / idtentry does:
> >>
> >> push pt_regs
> >> push kernel_entry
> >> push %rbp
> >> mov %rsp, %rbp
> >> call handler
> >> pop %rbp
> >> addq $8, %rsp
> >>
> >> or similar.  That would make it appear that the actual C handler was
> >> caused by a dummy function "kernel_entry".  Now the unwinder would get
> >> to kernel_entry, but it *still* wouldn't find its way to the calling
> >> frame, which only solves part of the problem.  We could at least teach
> >> the unwinder how kernel_entry works and let it decode pt_regs to
> >> continue unwinding.  This would be nice, and I think it could work.
> >
> > Yeah, that's about what I had in mind.
> 
> FWIW, I just tried this:
> 
> static bool is_entry_text(unsigned long addr)
> {
>     return addr >= (unsigned long)__entry_text_start &&
>         addr < (unsigned long)__entry_text_end;
> }
> 
> it works.  So the entry code is already annotated reasonably well :)
> 
> I just hacked it up here:
> 
> https://git.kernel.org/cgit/linux/kernel/git/luto/linux.git/commit/?h=stack&id=085eacfe0edfc18768e48340084415dba9a6bd21
> 
> and it seems to work, at least for page faults.  A better
> implementation would print out the entire contents of pt_regs so that
> people reading the stack trace will know the registers at the time of
> the exception, which might be helpful.

I still think we would need more specific annotations to do that
reliably: a call from entry code doesn't necessarily correlate with a
pt_regs frame.

> >> I think I like this, except that, if it used a separate section, it
> >> could potentially be faster, as, for each actual entry type, the
> >> offset from the C handler frame to pt_regs is a foregone conclusion.
> >
> > Hm, this I don't really follow.  It's true that the unwinder can easily
> > find RIP from pt_regs, which will always be a known offset from the
> > kernel_entry pointer on the stack.  But why would having the entry code
> > in a separate section make that faster?
> 
> It doesn't make the unwinder faster -- it makes the entry code faster.

Oh, right.  But I don't think a few extra frame pointer instructions are
much of an issue if you already have CONFIG_FRAME_POINTER enabled.

Anyway I'm not sure which way is better.  I'll think about it.

> I hope your plans include rewriting the current stack unwinder
> completely.  The thing in print_context_stack is (a)
> hard-to-understand and hard-to-modify crap and (b) is called in a loop
> from another file using totally ridiculous conventions.

I agree, that code is quite confusing.  I haven't really thought about
how specifically it could be improved or replaced though.

Along those lines, I think it would be awesome if we could have an
arch-independent DWARF unwinder so that most of the stack dumping code
could be shared amongst all the arches.

-- 
Josh

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [RFC PATCH v2 05/18] sched: add task flag for preempt IRQ tracking
  2016-05-02 18:12                           ` Andy Lutomirski
  2016-05-02 18:34                             ` Ingo Molnar
  2016-05-02 19:44                             ` Josh Poimboeuf
@ 2016-05-02 19:54                             ` Jiri Kosina
  2016-05-02 20:00                               ` Jiri Kosina
  2016-05-04 15:16                               ` David Laight
  3 siblings, 1 reply; 121+ messages in thread
From: Jiri Kosina @ 2016-05-02 19:54 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Josh Poimboeuf, Ingo Molnar, X86 ML, Heiko Carstens, linux-s390,
	live-patching, Michael Ellerman, Chris J Arges, linuxppc-dev,
	Jessica Yu, Petr Mladek, Jiri Slaby, Vojtech Pavlik,
	linux-kernel, Miroslav Benes, Peter Zijlstra

On Mon, 2 May 2016, Andy Lutomirski wrote:

> FWIW, I just tried this:
> 
> static bool is_entry_text(unsigned long addr)
> {
>     return addr >= (unsigned long)__entry_text_start &&
>         addr < (unsigned long)__entry_text_end;
> }
> 
> it works.  So the entry code is already annotated reasonably well :)
> 
> I just hacked it up here:
> 
> https://git.kernel.org/cgit/linux/kernel/git/luto/linux.git/commit/?h=stack&id=085eacfe0edfc18768e48340084415dba9a6bd21
> 
> and it seems to work, at least for page faults.  A better
> implementation would print out the entire contents of pt_regs so that
> people reading the stack trace will know the registers at the time of
> the exception, which might be helpful.

Sorry for being dense, but how do you distinguish here between a "real" 
kernel entry, that pushes pt_regs, and any "non-entry" function call that 
passes pt_regs around?

-- 
Jiri Kosina
SUSE Labs

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [RFC PATCH v2 05/18] sched: add task flag for preempt IRQ tracking
  2016-05-02 19:54                             ` Jiri Kosina
@ 2016-05-02 20:00                               ` Jiri Kosina
  2016-05-03  0:39                                 ` Andy Lutomirski
  0 siblings, 1 reply; 121+ messages in thread
From: Jiri Kosina @ 2016-05-02 20:00 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Josh Poimboeuf, Ingo Molnar, X86 ML, Heiko Carstens, linux-s390,
	live-patching, Michael Ellerman, Chris J Arges, linuxppc-dev,
	Jessica Yu, Petr Mladek, Jiri Slaby, Vojtech Pavlik,
	linux-kernel, Miroslav Benes, Peter Zijlstra

On Mon, 2 May 2016, Jiri Kosina wrote:

> > FWIW, I just tried this:
> > 
> > static bool is_entry_text(unsigned long addr)
> > {
> >     return addr >= (unsigned long)__entry_text_start &&
> >         addr < (unsigned long)__entry_text_end;
> > }
> > 
> > it works.  So the entry code is already annotated reasonably well :)
> > 
> > I just hacked it up here:
> > 
> > https://git.kernel.org/cgit/linux/kernel/git/luto/linux.git/commit/?h=stack&id=085eacfe0edfc18768e48340084415dba9a6bd21
> > 
> > and it seems to work, at least for page faults.  A better
> > implementation would print out the entire contents of pt_regs so that
> > people reading the stack trace will know the registers at the time of
> > the exception, which might be helpful.
> 
> Sorry for being dense, but how do you distinguish here between a "real" 
> kernel entry, that pushes pt_regs, and any "non-entry" function call that 
> passes pt_regs around?

Umm, actually, the more tricky part is the other way around -- how do you 
make sure that whenever you are calling out from a code between 
__entry_text_start and __entry_text_end, pt_regs will be at the place 
you're looking for it? How's that guaranteed?

Thanks,

-- 
Jiri Kosina
SUSE Labs

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [RFC PATCH v2 05/18] sched: add task flag for preempt IRQ tracking
  2016-05-02 20:00                               ` Jiri Kosina
@ 2016-05-03  0:39                                 ` Andy Lutomirski
  0 siblings, 0 replies; 121+ messages in thread
From: Andy Lutomirski @ 2016-05-03  0:39 UTC (permalink / raw)
  To: Jiri Kosina
  Cc: Josh Poimboeuf, Ingo Molnar, X86 ML, Heiko Carstens, linux-s390,
	live-patching, Michael Ellerman, Chris J Arges, linuxppc-dev,
	Jessica Yu, Petr Mladek, Jiri Slaby, Vojtech Pavlik,
	linux-kernel, Miroslav Benes, Peter Zijlstra

On Mon, May 2, 2016 at 1:00 PM, Jiri Kosina <jikos@kernel.org> wrote:
> On Mon, 2 May 2016, Jiri Kosina wrote:
>
>> > FWIW, I just tried this:
>> >
>> > static bool is_entry_text(unsigned long addr)
>> > {
>> >     return addr >= (unsigned long)__entry_text_start &&
>> >         addr < (unsigned long)__entry_text_end;
>> > }
>> >
>> > it works.  So the entry code is already annotated reasonably well :)
>> >
>> > I just hacked it up here:
>> >
>> > https://git.kernel.org/cgit/linux/kernel/git/luto/linux.git/commit/?h=stack&id=085eacfe0edfc18768e48340084415dba9a6bd21
>> >
>> > and it seems to work, at least for page faults.  A better
>> > implementation would print out the entire contents of pt_regs so that
>> > people reading the stack trace will know the registers at the time of
>> > the exception, which might be helpful.
>>
>> Sorry for being dense, but how do you distinguish here between a "real"
>> kernel entry, that pushes pt_regs, and any "non-entry" function call that
>> passes pt_regs around?
>
> Umm, actually, the more tricky part is the other way around -- how do you
> make sure that whenever you are calling out from a code between
> __entry_text_start and __entry_text_end, pt_regs will be at the place
> you're looking for it? How's that guaranteed?

It's not guaranteed in my code.  I think we'd want to add a little
table of call sites and their pt_regs offsets.  This was just meant to
test that the general idea works (and it does indeed generate better
traces than the stock kernel, which gets it unconditionally wrong).

--Andy

>
> Thanks,
>
> --
> Jiri Kosina
> SUSE Labs
>



-- 
Andy Lutomirski
AMA Capital Management, LLC

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [RFC PATCH v2 10/18] livepatch/powerpc: add TIF_PATCH_PENDING thread flag
  2016-04-28 20:44 ` [RFC PATCH v2 10/18] livepatch/powerpc: " Josh Poimboeuf
@ 2016-05-03  9:07   ` Petr Mladek
  2016-05-03 12:06     ` Miroslav Benes
  0 siblings, 1 reply; 121+ messages in thread
From: Petr Mladek @ 2016-05-03  9:07 UTC (permalink / raw)
  To: Josh Poimboeuf
  Cc: Jessica Yu, Jiri Kosina, Miroslav Benes, Ingo Molnar,
	Peter Zijlstra, Michael Ellerman, Heiko Carstens, live-patching,
	linux-kernel, x86, linuxppc-dev, linux-s390, Vojtech Pavlik,
	Jiri Slaby, Chris J Arges, Andy Lutomirski

On Thu 2016-04-28 15:44:41, Josh Poimboeuf wrote:
> Add the TIF_PATCH_PENDING thread flag to enable the new livepatch
> per-task consistency model for powerpc.  The bit getting set indicates
> the thread has a pending patch which needs to be applied when the thread
> exits the kernel.
> 
> The bit is included in the _TIF_USER_WORK_MASK macro so that
> do_notify_resume() and klp_patch_task() get called when the bit is set.
> 
> Signed-off-by: Josh Poimboeuf <jpoimboe@redhat.com>
> ---
>  arch/powerpc/include/asm/thread_info.h | 4 +++-
>  arch/powerpc/kernel/signal.c           | 4 ++++
>  2 files changed, 7 insertions(+), 1 deletion(-)
> 
> diff --git a/arch/powerpc/include/asm/thread_info.h b/arch/powerpc/include/asm/thread_info.h
> index 8febc3f..df262ca 100644
> --- a/arch/powerpc/include/asm/thread_info.h
> +++ b/arch/powerpc/include/asm/thread_info.h
> @@ -88,6 +88,7 @@ static inline struct thread_info *current_thread_info(void)
>  					   TIF_NEED_RESCHED */
>  #define TIF_32BIT		4	/* 32 bit binary */
>  #define TIF_RESTORE_TM		5	/* need to restore TM FP/VEC/VSX */
> +#define TIF_PATCH_PENDING	6	/* pending live patching update */
>  #define TIF_SYSCALL_AUDIT	7	/* syscall auditing active */
>  #define TIF_SINGLESTEP		8	/* singlestepping active */
>  #define TIF_NOHZ		9	/* in adaptive nohz mode */
> @@ -111,6 +112,7 @@ static inline struct thread_info *current_thread_info(void)
>  #define _TIF_POLLING_NRFLAG	(1<<TIF_POLLING_NRFLAG)
>  #define _TIF_32BIT		(1<<TIF_32BIT)
>  #define _TIF_RESTORE_TM		(1<<TIF_RESTORE_TM)
> +#define _TIF_PATCH_PENDING	(1<<TIF_PATCH_PENDING)
>  #define _TIF_SYSCALL_AUDIT	(1<<TIF_SYSCALL_AUDIT)
>  #define _TIF_SINGLESTEP		(1<<TIF_SINGLESTEP)
>  #define _TIF_SECCOMP		(1<<TIF_SECCOMP)
> @@ -127,7 +129,7 @@ static inline struct thread_info *current_thread_info(void)
>  
>  #define _TIF_USER_WORK_MASK	(_TIF_SIGPENDING | _TIF_NEED_RESCHED | \
>  				 _TIF_NOTIFY_RESUME | _TIF_UPROBE | \
> -				 _TIF_RESTORE_TM)
> +				 _TIF_RESTORE_TM | _TIF_PATCH_PENDING)
>  #define _TIF_PERSYSCALL_MASK	(_TIF_RESTOREALL|_TIF_NOERROR)
>  
>  /* Bits in local_flags */
> diff --git a/arch/powerpc/kernel/signal.c b/arch/powerpc/kernel/signal.c
> index cb64d6f..844497b 100644
> --- a/arch/powerpc/kernel/signal.c
> +++ b/arch/powerpc/kernel/signal.c
> @@ -14,6 +14,7 @@
>  #include <linux/uprobes.h>
>  #include <linux/key.h>
>  #include <linux/context_tracking.h>
> +#include <linux/livepatch.h>
>  #include <asm/hw_breakpoint.h>
>  #include <asm/uaccess.h>
>  #include <asm/unistd.h>
> @@ -159,6 +160,9 @@ void do_notify_resume(struct pt_regs *regs, unsigned long thread_info_flags)
>  		tracehook_notify_resume(regs);
>  	}
>  
> +	if (thread_info_flags & _TIF_PATCH_PENDING)
> +		klp_patch_task(current);
> +

JFYI, if we later add the fake signal to speed up migration of
sleeping task, we would need to move this up before calling
do_signal(). It would help to avoid cycling here twice.
Mirek surely knows more details about it.

Best Regards,
Petr

>  	user_enter();
>  }
>  
> -- 
> 2.4.11
> 

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [RFC PATCH v2 13/18] livepatch: separate enabled and patched states
  2016-04-28 20:44 ` [RFC PATCH v2 13/18] livepatch: separate enabled and patched states Josh Poimboeuf
@ 2016-05-03  9:30   ` Petr Mladek
  2016-05-03 13:48     ` Josh Poimboeuf
  0 siblings, 1 reply; 121+ messages in thread
From: Petr Mladek @ 2016-05-03  9:30 UTC (permalink / raw)
  To: Josh Poimboeuf
  Cc: Jessica Yu, Jiri Kosina, Miroslav Benes, Ingo Molnar,
	Peter Zijlstra, Michael Ellerman, Heiko Carstens, live-patching,
	linux-kernel, x86, linuxppc-dev, linux-s390, Vojtech Pavlik,
	Jiri Slaby, Chris J Arges, Andy Lutomirski

On Thu 2016-04-28 15:44:44, Josh Poimboeuf wrote:
> Once we have a consistency model, patches and their objects will be
> enabled and disabled at different times.  For example, when a patch is
> disabled, its loaded objects' funcs can remain registered with ftrace
> indefinitely until the unpatching operation is complete and they're no
> longer in use.
> 
> It's less confusing if we give them different names: patches can be
> enabled or disabled; objects (and their funcs) can be patched or
> unpatched:
> 
> - Enabled means that a patch is logically enabled (but not necessarily
>   fully applied).
> 
> - Patched means that an object's funcs are registered with ftrace and
>   added to the klp_ops func stack.
> 
> Also, since these states are binary, represent them with booleans
> instead of ints.
> 
> Signed-off-by: Josh Poimboeuf <jpoimboe@redhat.com>
> ---
>  include/linux/livepatch.h | 17 ++++-------
>  kernel/livepatch/core.c   | 72 +++++++++++++++++++++++------------------------
>  2 files changed, 42 insertions(+), 47 deletions(-)
> 
> diff --git a/kernel/livepatch/core.c b/kernel/livepatch/core.c
> index 6ea6880..2b59230 100644
> --- a/kernel/livepatch/core.c
> +++ b/kernel/livepatch/core.c
> @@ -622,20 +622,20 @@ static ssize_t enabled_store(struct kobject *kobj, struct kobj_attribute *attr,
>  	if (ret)
>  		return -EINVAL;
>  
> -	if (val != KLP_DISABLED && val != KLP_ENABLED)
> +	if (val > 1)
>  		return -EINVAL;

It would be cleaner to get "val" via kstrtobool(). It guarantees that
the value is true or false. Another nice win is that it accepts
Y/y/1/N/n/0 as the input.

>  	patch = container_of(kobj, struct klp_patch, kobj);
>  
>  	mutex_lock(&klp_mutex);
>  
> -	if (val == patch->state) {
> +	if (patch->enabled == val) {

Also this check will be cleaner if "val" is a boolean.

Best Regards,
Petr

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [RFC PATCH v2 15/18] livepatch: move patching functions into patch.c
  2016-04-28 20:44 ` [RFC PATCH v2 15/18] livepatch: move patching functions into patch.c Josh Poimboeuf
@ 2016-05-03  9:39   ` Petr Mladek
  0 siblings, 0 replies; 121+ messages in thread
From: Petr Mladek @ 2016-05-03  9:39 UTC (permalink / raw)
  To: Josh Poimboeuf
  Cc: Jessica Yu, Jiri Kosina, Miroslav Benes, Ingo Molnar,
	Peter Zijlstra, Michael Ellerman, Heiko Carstens, live-patching,
	linux-kernel, x86, linuxppc-dev, linux-s390, Vojtech Pavlik,
	Jiri Slaby, Chris J Arges, Andy Lutomirski

On Thu 2016-04-28 15:44:46, Josh Poimboeuf wrote:
> Move functions related to the actual patching of functions and objects
> into a new patch.c file.
> 
> diff --git a/kernel/livepatch/patch.c b/kernel/livepatch/patch.c
> new file mode 100644
> index 0000000..782fbb5
> --- /dev/null
> +++ b/kernel/livepatch/patch.c
> @@ -0,0 +1,213 @@
> +/*
> + * patch.c - livepatch patching functions
> + *
> + * Copyright (C) 2014 Seth Jennings <sjenning@redhat.com>
> + * Copyright (C) 2014 SUSE
> + * Copyright (C) 2015 Josh Poimboeuf <jpoimboe@redhat.com
                                                         ^^^^^^

Missing '>' ;-)

Best Regards,
Petr

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [RFC PATCH v2 10/18] livepatch/powerpc: add TIF_PATCH_PENDING thread flag
  2016-05-03  9:07   ` Petr Mladek
@ 2016-05-03 12:06     ` Miroslav Benes
  0 siblings, 0 replies; 121+ messages in thread
From: Miroslav Benes @ 2016-05-03 12:06 UTC (permalink / raw)
  To: Petr Mladek
  Cc: Josh Poimboeuf, Jessica Yu, Jiri Kosina, Ingo Molnar,
	Peter Zijlstra, Michael Ellerman, Heiko Carstens, live-patching,
	linux-kernel, x86, linuxppc-dev, linux-s390, Vojtech Pavlik,
	Jiri Slaby, Chris J Arges, Andy Lutomirski

On Tue, 3 May 2016, Petr Mladek wrote:

> On Thu 2016-04-28 15:44:41, Josh Poimboeuf wrote:
> > Add the TIF_PATCH_PENDING thread flag to enable the new livepatch
> > per-task consistency model for powerpc.  The bit getting set indicates
> > the thread has a pending patch which needs to be applied when the thread
> > exits the kernel.
> > 
> > The bit is included in the _TIF_USER_WORK_MASK macro so that
> > do_notify_resume() and klp_patch_task() get called when the bit is set.
> > 
> > Signed-off-by: Josh Poimboeuf <jpoimboe@redhat.com>
> > ---
> >  arch/powerpc/include/asm/thread_info.h | 4 +++-
> >  arch/powerpc/kernel/signal.c           | 4 ++++
> >  2 files changed, 7 insertions(+), 1 deletion(-)
> > 
> > diff --git a/arch/powerpc/include/asm/thread_info.h b/arch/powerpc/include/asm/thread_info.h
> > index 8febc3f..df262ca 100644
> > --- a/arch/powerpc/include/asm/thread_info.h
> > +++ b/arch/powerpc/include/asm/thread_info.h
> > @@ -88,6 +88,7 @@ static inline struct thread_info *current_thread_info(void)
> >  					   TIF_NEED_RESCHED */
> >  #define TIF_32BIT		4	/* 32 bit binary */
> >  #define TIF_RESTORE_TM		5	/* need to restore TM FP/VEC/VSX */
> > +#define TIF_PATCH_PENDING	6	/* pending live patching update */
> >  #define TIF_SYSCALL_AUDIT	7	/* syscall auditing active */
> >  #define TIF_SINGLESTEP		8	/* singlestepping active */
> >  #define TIF_NOHZ		9	/* in adaptive nohz mode */
> > @@ -111,6 +112,7 @@ static inline struct thread_info *current_thread_info(void)
> >  #define _TIF_POLLING_NRFLAG	(1<<TIF_POLLING_NRFLAG)
> >  #define _TIF_32BIT		(1<<TIF_32BIT)
> >  #define _TIF_RESTORE_TM		(1<<TIF_RESTORE_TM)
> > +#define _TIF_PATCH_PENDING	(1<<TIF_PATCH_PENDING)
> >  #define _TIF_SYSCALL_AUDIT	(1<<TIF_SYSCALL_AUDIT)
> >  #define _TIF_SINGLESTEP		(1<<TIF_SINGLESTEP)
> >  #define _TIF_SECCOMP		(1<<TIF_SECCOMP)
> > @@ -127,7 +129,7 @@ static inline struct thread_info *current_thread_info(void)
> >  
> >  #define _TIF_USER_WORK_MASK	(_TIF_SIGPENDING | _TIF_NEED_RESCHED | \
> >  				 _TIF_NOTIFY_RESUME | _TIF_UPROBE | \
> > -				 _TIF_RESTORE_TM)
> > +				 _TIF_RESTORE_TM | _TIF_PATCH_PENDING)
> >  #define _TIF_PERSYSCALL_MASK	(_TIF_RESTOREALL|_TIF_NOERROR)
> >  
> >  /* Bits in local_flags */
> > diff --git a/arch/powerpc/kernel/signal.c b/arch/powerpc/kernel/signal.c
> > index cb64d6f..844497b 100644
> > --- a/arch/powerpc/kernel/signal.c
> > +++ b/arch/powerpc/kernel/signal.c
> > @@ -14,6 +14,7 @@
> >  #include <linux/uprobes.h>
> >  #include <linux/key.h>
> >  #include <linux/context_tracking.h>
> > +#include <linux/livepatch.h>
> >  #include <asm/hw_breakpoint.h>
> >  #include <asm/uaccess.h>
> >  #include <asm/unistd.h>
> > @@ -159,6 +160,9 @@ void do_notify_resume(struct pt_regs *regs, unsigned long thread_info_flags)
> >  		tracehook_notify_resume(regs);
> >  	}
> >  
> > +	if (thread_info_flags & _TIF_PATCH_PENDING)
> > +		klp_patch_task(current);
> > +
> 
> JFYI, if we later add the fake signal to speed up migration of
> sleeping task, we would need to move this up before calling
> do_signal(). It would help to avoid cycling here twice.
> Mirek surely knows more details about it.

Yes, that is true if we go with a fake signal implementation we have in 
kGraft. Nevertheless it is the issue of that patch so let's discuss that 
there (when I send it).

Miroslav

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [RFC PATCH v2 13/18] livepatch: separate enabled and patched states
  2016-05-03  9:30   ` Petr Mladek
@ 2016-05-03 13:48     ` Josh Poimboeuf
  0 siblings, 0 replies; 121+ messages in thread
From: Josh Poimboeuf @ 2016-05-03 13:48 UTC (permalink / raw)
  To: Petr Mladek
  Cc: Jessica Yu, Jiri Kosina, Miroslav Benes, Ingo Molnar,
	Peter Zijlstra, Michael Ellerman, Heiko Carstens, live-patching,
	linux-kernel, x86, linuxppc-dev, linux-s390, Vojtech Pavlik,
	Jiri Slaby, Chris J Arges, Andy Lutomirski

On Tue, May 03, 2016 at 11:30:12AM +0200, Petr Mladek wrote:
> On Thu 2016-04-28 15:44:44, Josh Poimboeuf wrote:
> > Once we have a consistency model, patches and their objects will be
> > enabled and disabled at different times.  For example, when a patch is
> > disabled, its loaded objects' funcs can remain registered with ftrace
> > indefinitely until the unpatching operation is complete and they're no
> > longer in use.
> > 
> > It's less confusing if we give them different names: patches can be
> > enabled or disabled; objects (and their funcs) can be patched or
> > unpatched:
> > 
> > - Enabled means that a patch is logically enabled (but not necessarily
> >   fully applied).
> > 
> > - Patched means that an object's funcs are registered with ftrace and
> >   added to the klp_ops func stack.
> > 
> > Also, since these states are binary, represent them with booleans
> > instead of ints.
> > 
> > Signed-off-by: Josh Poimboeuf <jpoimboe@redhat.com>
> > ---
> >  include/linux/livepatch.h | 17 ++++-------
> >  kernel/livepatch/core.c   | 72 +++++++++++++++++++++++------------------------
> >  2 files changed, 42 insertions(+), 47 deletions(-)
> > 
> > diff --git a/kernel/livepatch/core.c b/kernel/livepatch/core.c
> > index 6ea6880..2b59230 100644
> > --- a/kernel/livepatch/core.c
> > +++ b/kernel/livepatch/core.c
> > @@ -622,20 +622,20 @@ static ssize_t enabled_store(struct kobject *kobj, struct kobj_attribute *attr,
> >  	if (ret)
> >  		return -EINVAL;
> >  
> > -	if (val != KLP_DISABLED && val != KLP_ENABLED)
> > +	if (val > 1)
> >  		return -EINVAL;
> 
> It would be cleaner to get "val" via kstrtobool(). It guarantees that
> the value is true or false. Another nice win is that it accepts
> Y/y/1/N/n/0 as the input.
> 
> >  	patch = container_of(kobj, struct klp_patch, kobj);
> >  
> >  	mutex_lock(&klp_mutex);
> >  
> > -	if (val == patch->state) {
> > +	if (patch->enabled == val) {
> 
> Also this check will be cleaner if "val" is a boolean.

Good idea, thanks.

-- 
Josh

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [RFC PATCH v2 17/18] livepatch: change to a per-task consistency model
  2016-04-28 20:44 ` [RFC PATCH v2 17/18] livepatch: change to a per-task consistency model Josh Poimboeuf
@ 2016-05-04  8:42   ` Petr Mladek
  2016-05-04 15:51     ` Josh Poimboeuf
  2016-05-04 12:39   ` barriers: was: " Petr Mladek
                     ` (6 subsequent siblings)
  7 siblings, 1 reply; 121+ messages in thread
From: Petr Mladek @ 2016-05-04  8:42 UTC (permalink / raw)
  To: Josh Poimboeuf
  Cc: Jessica Yu, Jiri Kosina, Miroslav Benes, Ingo Molnar,
	Peter Zijlstra, Michael Ellerman, Heiko Carstens, live-patching,
	linux-kernel, x86, linuxppc-dev, linux-s390, Vojtech Pavlik,
	Jiri Slaby, Chris J Arges, Andy Lutomirski

On Thu 2016-04-28 15:44:48, Josh Poimboeuf wrote:
> Change livepatch to use a basic per-task consistency model.  This is the
> foundation which will eventually enable us to patch those ~10% of
> security patches which change function or data semantics.  This is the
> biggest remaining piece needed to make livepatch more generally useful.

> --- a/kernel/fork.c
> +++ b/kernel/fork.c
> @@ -76,6 +76,7 @@
>  #include <linux/compiler.h>
>  #include <linux/sysctl.h>
>  #include <linux/kcov.h>
> +#include <linux/livepatch.h>
>  
>  #include <asm/pgtable.h>
>  #include <asm/pgalloc.h>
> @@ -1586,6 +1587,8 @@ static struct task_struct *copy_process(unsigned long clone_flags,
>  		p->parent_exec_id = current->self_exec_id;
>  	}
>  
> +	klp_copy_process(p);

I am in doubts here. We copy the state from the parent here. It means
that the new process might still need to be converted. But at the same
point print_context_stack_reliable() returns zero without printing
any stack trace when TIF_FORK flag is set. It means that a freshly
forked task might get be converted immediately. I seems that boot
operations are always done when copy_process() is called. But
they are contradicting each other.

I guess that print_context_stack_reliable() should either return
-EINVAL when TIF_FORK is set. Or it should try to print the
stack of the newly forked task.

Or do I miss something, please?

> +
>  	spin_lock(&current->sighand->siglock);
>  
>  	/*

[...]

> diff --git a/kernel/livepatch/transition.c b/kernel/livepatch/transition.c
> new file mode 100644
> index 0000000..92819bb
> --- /dev/null
> +++ b/kernel/livepatch/transition.c
> +/*
> + * This function can be called in the middle of an existing transition to
> + * reverse the direction of the target patch state.  This can be done to
> + * effectively cancel an existing enable or disable operation if there are any
> + * tasks which are stuck in the initial patch state.
> + */
> +void klp_reverse_transition(void)
> +{
> +	struct klp_patch *patch = klp_transition_patch;
> +
> +	klp_target_state = !klp_target_state;
> +
> +	/*
> +	 * Ensure that if another CPU goes through the syscall barrier, sees
> +	 * the TIF_PATCH_PENDING writes in klp_start_transition(), and calls
> +	 * klp_patch_task(), it also sees the above write to the target state.
> +	 * Otherwise it can put the task in the wrong universe.
> +	 */
> +	smp_wmb();
> +
> +	klp_start_transition();
> +	klp_try_complete_transition();

It is a bit strange that we keep the work scheduled. It might be
better to use

       mod_delayed_work(system_wq, &klp_work, 0);

Which triggers more ideas from the nitpicking deparment:

I would move the work definition from core.c to transition.c because
it is closely related to klp_try_complete_transition();

When on it. I would make it more clear that the work is related
to transition. Also I would call queue_delayed_work() directly
instead of adding the klp_schedule_work() wrapper. The delay
might be defined using a constant, e.g.

#define KLP_TRANSITION_DELAY round_jiffies_relative(HZ)

queue_delayed_work(system_wq, &klp_transition_work, KLP_TRANSITION_DELAY);

Finally, the following is always called right after
klp_start_transition(), so I would call it from there.

	if (!klp_try_complete_transition())
		klp_schedule_work();


> +
> +	patch->enabled = !patch->enabled;
> +}
> +

It is really great work! I am checking this patch from left, right, top,
and even bottom and all seems to work well together.

Best Regards,
Petr

^ permalink raw reply	[flat|nested] 121+ messages in thread

* barriers: was: [RFC PATCH v2 17/18] livepatch: change to a per-task consistency model
  2016-04-28 20:44 ` [RFC PATCH v2 17/18] livepatch: change to a per-task consistency model Josh Poimboeuf
  2016-05-04  8:42   ` Petr Mladek
@ 2016-05-04 12:39   ` Petr Mladek
  2016-05-04 13:53     ` Peter Zijlstra
                       ` (2 more replies)
  2016-05-04 14:48   ` klp_task_patch: " Petr Mladek
                     ` (5 subsequent siblings)
  7 siblings, 3 replies; 121+ messages in thread
From: Petr Mladek @ 2016-05-04 12:39 UTC (permalink / raw)
  To: Josh Poimboeuf
  Cc: Jessica Yu, Jiri Kosina, Miroslav Benes, Ingo Molnar,
	Peter Zijlstra, Michael Ellerman, Heiko Carstens, live-patching,
	linux-kernel, x86, linuxppc-dev, linux-s390, Vojtech Pavlik,
	Jiri Slaby, Chris J Arges, Andy Lutomirski

On Thu 2016-04-28 15:44:48, Josh Poimboeuf wrote:
> Change livepatch to use a basic per-task consistency model.  This is the
> foundation which will eventually enable us to patch those ~10% of
> security patches which change function or data semantics.  This is the
> biggest remaining piece needed to make livepatch more generally useful.

I spent a lot of time with checking the memory barriers. It seems that
they are basically correct.  Let me use my own words to show how
I understand it. I hope that it will help others with review.

> diff --git a/kernel/livepatch/patch.c b/kernel/livepatch/patch.c
> index 782fbb5..b3b8639 100644
> --- a/kernel/livepatch/patch.c
> +++ b/kernel/livepatch/patch.c
> @@ -29,6 +29,7 @@
>  #include <linux/bug.h>
>  #include <linux/printk.h>
>  #include "patch.h"
> +#include "transition.h"
>  
>  static LIST_HEAD(klp_ops);
>  
> @@ -58,11 +59,42 @@ static void notrace klp_ftrace_handler(unsigned long ip,
>  	ops = container_of(fops, struct klp_ops, fops);
>  
>  	rcu_read_lock();
> +
>  	func = list_first_or_null_rcu(&ops->func_stack, struct klp_func,
>  				      stack_node);
> -	if (WARN_ON_ONCE(!func))
> +
> +	if (!func)
>  		goto unlock;
>  
> +	/*
> +	 * See the comment for the 2nd smp_wmb() in klp_init_transition() for
> +	 * an explanation of why this read barrier is needed.
> +	 */
> +	smp_rmb();

I would prefer to be more explicit, e.g.

	/*
	 * Read the right func->transition when the struct appeared on top of
	 * func_stack.  See klp_init_transition and klp_patch_func().
	 */

Note that this barrier is not really needed when the patch is being
disabled, see below.

> +
> +	if (unlikely(func->transition)) {
> +
> +		/*
> +		 * See the comment for the 1st smp_wmb() in
> +		 * klp_init_transition() for an explanation of why this read
> +		 * barrier is needed.
> +		 */
> +		smp_rmb();

Similar here:

		/*
		 * Read the right initial state when func->transition was
		 * enabled, see klp_init_transition().
		 *
		 * Note that the task must never be migrated to the target
		 * state when being inside this ftrace handler.
		 */

We might want to move the second paragraph on top of the function.
It is a basic and important fact. It actually explains why the first
read barrier is not needed when the patch is being disabled.

There are some more details below. I started to check and comment the
barriers from klp_init_transition().


> +		if (current->patch_state == KLP_UNPATCHED) {
> +			/*
> +			 * Use the previously patched version of the function.
> +			 * If no previous patches exist, use the original
> +			 * function.
> +			 */
> +			func = list_entry_rcu(func->stack_node.next,
> +					      struct klp_func, stack_node);
> +
> +			if (&func->stack_node == &ops->func_stack)
> +				goto unlock;
> +		}
> +	}
> +
>  	klp_arch_set_pc(regs, (unsigned long)func->new_func);
>  unlock:
>  	rcu_read_unlock();
> diff --git a/kernel/livepatch/transition.c b/kernel/livepatch/transition.c
> new file mode 100644
> index 0000000..92819bb
> --- /dev/null
> +++ b/kernel/livepatch/transition.c
> +/*
> + * klp_patch_task() - change the patched state of a task
> + * @task:	The task to change
> + *
> + * Switches the patched state of the task to the set of functions in the target
> + * patch state.
> + */
> +void klp_patch_task(struct task_struct *task)
> +{
> +	clear_tsk_thread_flag(task, TIF_PATCH_PENDING);
> +
> +	/*
> +	 * The corresponding write barriers are in klp_init_transition() and
> +	 * klp_reverse_transition().  See the comments there for an explanation.
> +	 */
> +	smp_rmb();

I would prefer to be more explicit, e.g.

	/*
	 * Read the correct klp_target_state when TIF_PATCH_PENDING was set
	 * and this function was called.  See klp_init_transition() and
	 * klp_reverse_transition().
	 */
> +
> +	task->patch_state = klp_target_state;
> +}

The function name confused me few times when klp_target_state
was KLP_UNPATCHED. I suggest to rename it to klp_update_task()
or klp_transit_task().

> +/*
> + * Initialize the global target patch state and all tasks to the initial patch
> + * state, and initialize all function transition states to true in preparation
> + * for patching or unpatching.
> + */
> +void klp_init_transition(struct klp_patch *patch, int state)
> +{
> +	struct task_struct *g, *task;
> +	unsigned int cpu;
> +	struct klp_object *obj;
> +	struct klp_func *func;
> +	int initial_state = !state;
> +
> +	klp_transition_patch = patch;
> +
> +	/*
> +	 * If the patch can be applied or reverted immediately, skip the
> +	 * per-task transitions.
> +	 */
> +	if (patch->immediate)
> +		return;
> +
> +	/*
> +	 * Initialize all tasks to the initial patch state to prepare them for
> +	 * switching to the target state.
> +	 */
> +	read_lock(&tasklist_lock);
> +	for_each_process_thread(g, task)
> +		task->patch_state = initial_state;
> +	read_unlock(&tasklist_lock);
> +
> +	/*
> +	 * Ditto for the idle "swapper" tasks.
> +	 */
> +	get_online_cpus();
> +	for_each_online_cpu(cpu)
> +		idle_task(cpu)->patch_state = initial_state;
> +	put_online_cpus();
> +
> +	/*
> +	 * Ensure klp_ftrace_handler() sees the task->patch_state updates
> +	 * before the func->transition updates.  Otherwise it could read an
> +	 * out-of-date task state and pick the wrong function.
> +	 */
> +	smp_wmb();

This barrier is needed when the patch is being disabled. In this case,
the ftrace handler is already in use and the related struct klp_func
are on top of func_stack. The purpose is well described above.

It is not needed when the patch is being enabled because it is
not visible to the ftrace handler at the moment. The barrier below
is enough.


> +	/*
> +	 * Set the func transition states so klp_ftrace_handler() will know to
> +	 * switch to the transition logic.
> +	 *
> +	 * When patching, the funcs aren't yet in the func_stack and will be
> +	 * made visible to the ftrace handler shortly by the calls to
> +	 * klp_patch_object().
> +	 *
> +	 * When unpatching, the funcs are already in the func_stack and so are
> +	 * already visible to the ftrace handler.
> +	 */
> +	klp_for_each_object(patch, obj)
> +		klp_for_each_func(obj, func)
> +			func->transition = true;
> +
> +	/*
> +	 * Set the global target patch state which tasks will switch to.  This
> +	 * has no effect until the TIF_PATCH_PENDING flags get set later.
> +	 */
> +	klp_target_state = state;
> +
> +	/*
> +	 * For the enable path, ensure klp_ftrace_handler() will see the
> +	 * func->transition updates before the funcs become visible to the
> +	 * handler.  Otherwise the handler may wrongly pick the new func before
> +	 * the task switches to the patched state.

By other words, it makes sure that the ftrace handler will see
the updated func->transition before the ftrace handler is registered
and/or before the struct klp_func is listed in func_stack.


> +	 * For the disable path, the funcs are already visible to the handler.
> +	 * But we still need to ensure the ftrace handler will see the
> +	 * func->transition updates before the tasks start switching to the
> +	 * unpatched state.  Otherwise the handler can miss a task patch state
> +	 * change which would result in it wrongly picking the new function.

If this is true, it would mean that the task might be switched when it
is in the middle of klp_ftrace_handler. It would mean that reading
task->patch_state would be racy against the modification by
klp_patch_task().

Note that before we call klp_patch_task(), the task will stay in the
previous state. We are disabling the patch, so the previous state
is that the patch is enabled. It means that it should always use
the new function before klp_patch_task() is called. It means
that it does not matter if it sees func->transition updated or not.
In both cases, it will use the new function.

Fortunately, task->patch_state might be set to KLP_UNPATCHED
only when it is sleeping or on some other safe location, e.g.
when leaving to userspace. In both cases, the barrier is not needed
here.

By other words, this barrier is not needed to synchronize func_stack
and patch->transition when it is being disabled.


> +	 * This barrier also ensures that if another CPU goes through the
> +	 * syscall barrier, sees the TIF_PATCH_PENDING writes in
> +	 * klp_start_transition(), and calls klp_patch_task(), it also sees the
> +	 * above write to the target state.  Otherwise it can put the task in
> +	 * the wrong universe.
> +	 */

By other words, it makes sure that klp_patch_task() will assign the
right patch_state. Where klp_patch_task() could not be called
before we set TIF_PATCH_PENDING in klp_start_transition().

> +	smp_wmb();
> +}
> +
> +/*
> + * Start the transition to the specified target patch state so tasks can begin
> + * switching to it.
> + */
> +void klp_start_transition(void)
> +{
> +	struct task_struct *g, *task;
> +	unsigned int cpu;
> +
> +	pr_notice("'%s': %s...\n", klp_transition_patch->mod->name,
> +		  klp_target_state == KLP_PATCHED ? "patching" : "unpatching");
> +
> +	/*
> +	 * If the patch can be applied or reverted immediately, skip the
> +	 * per-task transitions.
> +	 */
> +	if (klp_transition_patch->immediate)
> +		return;
> +
> +	/*
> +	 * Mark all normal tasks as needing a patch state update.  As they pass
> +	 * through the syscall barrier they'll switch over to the target state
> +	 * (unless we switch them in klp_try_complete_transition() first).
> +	 */
> +	read_lock(&tasklist_lock);
> +	for_each_process_thread(g, task)
> +		set_tsk_thread_flag(task, TIF_PATCH_PENDING);

A bad intuition might suggest that we do not need to set this flag
when klp_start_transition() is called from klp_reverse_transition()
and the task already is in the right state.

But I think that we actually must set TIF_PATCH_PENDING even in this
case to avoid a possible race. We do not know if klp_patch_task()
is not running at the moment with the previous klp_target_state().


> +	read_unlock(&tasklist_lock);
> +
> +	/*
> +	 * Ditto for the idle "swapper" tasks, though they never cross the
> +	 * syscall barrier.  Instead they switch over in cpu_idle_loop().
> +	 */
> +	get_online_cpus();
> +	for_each_online_cpu(cpu)
> +		set_tsk_thread_flag(idle_task(cpu), TIF_PATCH_PENDING);
> +	put_online_cpus();
> +}
> +
> +/*
> + * This function can be called in the middle of an existing transition to
> + * reverse the direction of the target patch state.  This can be done to
> + * effectively cancel an existing enable or disable operation if there are any
> + * tasks which are stuck in the initial patch state.
> + */
> +void klp_reverse_transition(void)
> +{
> +	struct klp_patch *patch = klp_transition_patch;
> +
> +	klp_target_state = !klp_target_state;
> +
> +	/*
> +	 * Ensure that if another CPU goes through the syscall barrier, sees
> +	 * the TIF_PATCH_PENDING writes in klp_start_transition(), and calls
> +	 * klp_patch_task(), it also sees the above write to the target state.
> +	 * Otherwise it can put the task in the wrong universe.
> +	 */
> +	smp_wmb();

Yup, it is the same reason as for the 2nd barrier in klp_init_transition()
regarding klp_target_state and klp_patch_task() that is triggered by
TIF_PATCH_PENDING.

> +
> +	klp_start_transition();
> +	klp_try_complete_transition();
> +
> +	patch->enabled = !patch->enabled;
> +}
> +

Best Regards,
Petr

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: barriers: was: [RFC PATCH v2 17/18] livepatch: change to a per-task consistency model
  2016-05-04 12:39   ` barriers: was: " Petr Mladek
@ 2016-05-04 13:53     ` Peter Zijlstra
  2016-05-04 16:51       ` Josh Poimboeuf
  2016-05-04 14:12     ` Petr Mladek
  2016-05-04 17:02     ` Josh Poimboeuf
  2 siblings, 1 reply; 121+ messages in thread
From: Peter Zijlstra @ 2016-05-04 13:53 UTC (permalink / raw)
  To: Petr Mladek
  Cc: Josh Poimboeuf, Jessica Yu, Jiri Kosina, Miroslav Benes,
	Ingo Molnar, Michael Ellerman, Heiko Carstens, live-patching,
	linux-kernel, x86, linuxppc-dev, linux-s390, Vojtech Pavlik,
	Jiri Slaby, Chris J Arges, Andy Lutomirski

On Wed, May 04, 2016 at 02:39:40PM +0200, Petr Mladek wrote:
> > +	 * This barrier also ensures that if another CPU goes through the
> > +	 * syscall barrier, sees the TIF_PATCH_PENDING writes in
> > +	 * klp_start_transition(), and calls klp_patch_task(), it also sees the
> > +	 * above write to the target state.  Otherwise it can put the task in
> > +	 * the wrong universe.
> > +	 */
> 
> By other words, it makes sure that klp_patch_task() will assign the
> right patch_state. Where klp_patch_task() could not be called
> before we set TIF_PATCH_PENDING in klp_start_transition().
> 
> > +	smp_wmb();
> > +}

So I've not read the patch; but ending a function with an smp_wmb()
feels wrong.

A wmb orders two stores, and I feel both stores should be well visible
in the same function.

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: barriers: was: [RFC PATCH v2 17/18] livepatch: change to a per-task consistency model
  2016-05-04 12:39   ` barriers: was: " Petr Mladek
  2016-05-04 13:53     ` Peter Zijlstra
@ 2016-05-04 14:12     ` Petr Mladek
  2016-05-04 17:25       ` Josh Poimboeuf
  2016-05-04 17:02     ` Josh Poimboeuf
  2 siblings, 1 reply; 121+ messages in thread
From: Petr Mladek @ 2016-05-04 14:12 UTC (permalink / raw)
  To: Josh Poimboeuf
  Cc: Jessica Yu, Jiri Kosina, Miroslav Benes, Ingo Molnar,
	Peter Zijlstra, Michael Ellerman, Heiko Carstens, live-patching,
	linux-kernel, x86, linuxppc-dev, linux-s390, Vojtech Pavlik,
	Jiri Slaby, Chris J Arges, Andy Lutomirski

On Wed 2016-05-04 14:39:40, Petr Mladek wrote:
> 		 *
> 		 * Note that the task must never be migrated to the target
> 		 * state when being inside this ftrace handler.
> 		 */
> 
> We might want to move the second paragraph on top of the function.
> It is a basic and important fact. It actually explains why the first
> read barrier is not needed when the patch is being disabled.

I wrote the statement partly intuitively. I think that it is really
somehow important. And I am slightly in doubts if we are on the safe side.

First, why is it important that the task->patch_state is not switched
when being inside the ftrace handler?

If we are inside the handler, we are kind-of inside the called
function. And the basic idea of this consistency model is that
we must not switch a task when it is inside a patched function.
This is normally decided by the stack.

The handler is a bit special because it is called right before the
function. If it was the only patched function on the stack, it would
not matter if we choose the new or old code. Both decisions would
be safe for the moment.

The fun starts when the function calls another patched function.
The other patched function must be called consistently with
the first one. If the first function was from the patch,
the other must be from the patch as well and vice versa.

This is why we must not switch task->patch_state dangerously
when being inside the ftrace handler.

Now I am not sure if this condition is fulfilled. The ftrace handler
is called as the very first instruction of the function. Does not
it break the stack validity? Could we sleep inside the ftrace
handler? Will the patched function be detected on the stack?

Or is my brain already too far in the fantasy world?


Best regards,
Petr

^ permalink raw reply	[flat|nested] 121+ messages in thread

* klp_task_patch: was: [RFC PATCH v2 17/18] livepatch: change to a per-task consistency model
  2016-04-28 20:44 ` [RFC PATCH v2 17/18] livepatch: change to a per-task consistency model Josh Poimboeuf
  2016-05-04  8:42   ` Petr Mladek
  2016-05-04 12:39   ` barriers: was: " Petr Mladek
@ 2016-05-04 14:48   ` Petr Mladek
  2016-05-04 14:56     ` Jiri Kosina
  2016-05-04 17:57     ` Josh Poimboeuf
  2016-05-06 11:33   ` Petr Mladek
                     ` (4 subsequent siblings)
  7 siblings, 2 replies; 121+ messages in thread
From: Petr Mladek @ 2016-05-04 14:48 UTC (permalink / raw)
  To: Josh Poimboeuf
  Cc: Jessica Yu, Jiri Kosina, Miroslav Benes, Ingo Molnar,
	Peter Zijlstra, Michael Ellerman, Heiko Carstens, live-patching,
	linux-kernel, x86, linuxppc-dev, linux-s390, Vojtech Pavlik,
	Jiri Slaby, Chris J Arges, Andy Lutomirski

On Thu 2016-04-28 15:44:48, Josh Poimboeuf wrote:
> Change livepatch to use a basic per-task consistency model.  This is the
> foundation which will eventually enable us to patch those ~10% of
> security patches which change function or data semantics.  This is the
> biggest remaining piece needed to make livepatch more generally useful.
> 
> diff --git a/kernel/livepatch/transition.c b/kernel/livepatch/transition.c
> new file mode 100644
> index 0000000..92819bb
> --- /dev/null
> +++ b/kernel/livepatch/transition.c
> +/*
> + * klp_patch_task() - change the patched state of a task
> + * @task:	The task to change
> + *
> + * Switches the patched state of the task to the set of functions in the target
> + * patch state.
> + */
> +void klp_patch_task(struct task_struct *task)
> +{
> +	clear_tsk_thread_flag(task, TIF_PATCH_PENDING);
> +
> +	/*
> +	 * The corresponding write barriers are in klp_init_transition() and
> +	 * klp_reverse_transition().  See the comments there for an explanation.
> +	 */
> +	smp_rmb();
> +
> +	task->patch_state = klp_target_state;
> +}
> +
> diff --git a/kernel/sched/idle.c b/kernel/sched/idle.c
> index bd12c6c..60d633f 100644
> --- a/kernel/sched/idle.c
> +++ b/kernel/sched/idle.c
> @@ -9,6 +9,7 @@
>  #include <linux/mm.h>
>  #include <linux/stackprotector.h>
>  #include <linux/suspend.h>
> +#include <linux/livepatch.h>
>  
>  #include <asm/tlb.h>
>  
> @@ -266,6 +267,9 @@ static void cpu_idle_loop(void)
>  
>  		sched_ttwu_pending();
>  		schedule_preempt_disabled();
> +
> +		if (unlikely(klp_patch_pending(current)))
> +			klp_patch_task(current);
>  	}

Some more ideas from the world of crazy races. I was shaking my head
if this was safe or not.

The problem might be if the task get rescheduled between the check
for the pending stuff or inside the klp_patch_task() function.
This will get even more important when we use this construct
on some more locations, e.g. in some kthreads.

If the task is sleeping on this strange locations, it might assign
strange values on strange times.

I think that it is safe only because it is called with the
'current' parameter and on a safe locations. It means that
the result is always safe and consistent. Also we could assign
an outdated value only when sleeping between reading klp_target_state
and storing task->patch_state. But if anyone modified
klp_target_state at this point, he also set TIF_PENDING_PATCH,
so the change will not get lost.

I think that we should document that klp_patch_func() must be
called only from a safe location from within the affected task.

I even suggest to avoid misuse by removing the struct *task_struct
parameter. It should always be called with current.

Best Regards,
Petr

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: klp_task_patch: was: [RFC PATCH v2 17/18] livepatch: change to a per-task consistency model
  2016-05-04 14:48   ` klp_task_patch: " Petr Mladek
@ 2016-05-04 14:56     ` Jiri Kosina
  2016-05-04 17:57     ` Josh Poimboeuf
  1 sibling, 0 replies; 121+ messages in thread
From: Jiri Kosina @ 2016-05-04 14:56 UTC (permalink / raw)
  To: Petr Mladek
  Cc: Josh Poimboeuf, Jessica Yu, Miroslav Benes, Ingo Molnar,
	Peter Zijlstra, Michael Ellerman, Heiko Carstens, live-patching,
	linux-kernel, x86, linuxppc-dev, linux-s390, Vojtech Pavlik,
	Jiri Slaby, Chris J Arges, Andy Lutomirski

On Wed, 4 May 2016, Petr Mladek wrote:

> > +
> > +		if (unlikely(klp_patch_pending(current)))
> > +			klp_patch_task(current);
> >  	}
> 
> Some more ideas from the world of crazy races. I was shaking my head
> if this was safe or not.
> 
> The problem might be if the task get rescheduled between the check
> for the pending stuff 

The code in question is running with preemption disabled.

> or inside the klp_patch_task() function. 

We must make sure that this function doesn't go to sleep. It's only used 
to clear the task_struct flag anyway.

-- 
Jiri Kosina
SUSE Labs

^ permalink raw reply	[flat|nested] 121+ messages in thread

* RE: [RFC PATCH v2 05/18] sched: add task flag for preempt IRQ tracking
  2016-05-02 18:12                           ` Andy Lutomirski
@ 2016-05-04 15:16                               ` David Laight
  2016-05-02 19:44                             ` Josh Poimboeuf
                                                 ` (2 subsequent siblings)
  3 siblings, 0 replies; 121+ messages in thread
From: David Laight @ 2016-05-04 15:16 UTC (permalink / raw)
  To: 'Andy Lutomirski', Josh Poimboeuf
  Cc: linux-s390, Jiri Kosina, Jessica Yu, Vojtech Pavlik, Petr Mladek,
	Peter Zijlstra, X86 ML, Heiko Carstens, linux-kernel,
	Ingo Molnar, live-patching, Jiri Slaby, Miroslav Benes,
	linuxppc-dev, Chris J Arges

From: Andy Lutomirski
> Sent: 02 May 2016 19:13
...
> I hope your plans include rewriting the current stack unwinder
> completely.  The thing in print_context_stack is (a)
> hard-to-understand and hard-to-modify crap and (b) is called in a loop
> from another file using totally ridiculous conventions.

I've seen a 'stack unwinder' that parsed the instruction stream
forwards looking for 'return' instructions.
I fixed it to add a few extra instructions needs to sort out the
exit path from hardware interrupts.

It only had to understand instructions that modified %sp and %bp
and remember which branch instructions and branch targets it had
used in order to find the correct exit path from a function.

Worked reasonably well without any debug info or guaranteed frame
pointers.
It did have to fall back on scanning the stack if it was inside
an infinite loop. Even on x86 it is reasonably possible to check
for 'call' instructions in this case.

	David

^ permalink raw reply	[flat|nested] 121+ messages in thread

* RE: [RFC PATCH v2 05/18] sched: add task flag for preempt IRQ tracking
@ 2016-05-04 15:16                               ` David Laight
  0 siblings, 0 replies; 121+ messages in thread
From: David Laight @ 2016-05-04 15:16 UTC (permalink / raw)
  To: 'Andy Lutomirski', Josh Poimboeuf
  Cc: linux-s390, Jiri Kosina, Jessica Yu, Vojtech Pavlik, Petr Mladek,
	Peter Zijlstra, X86 ML, Heiko Carstens, linux-kernel,
	Ingo Molnar, live-patching, Jiri Slaby, Miroslav Benes,
	linuxppc-dev, Chris J Arges

RnJvbTogQW5keSBMdXRvbWlyc2tpDQo+IFNlbnQ6IDAyIE1heSAyMDE2IDE5OjEzDQouLi4NCj4g
SSBob3BlIHlvdXIgcGxhbnMgaW5jbHVkZSByZXdyaXRpbmcgdGhlIGN1cnJlbnQgc3RhY2sgdW53
aW5kZXINCj4gY29tcGxldGVseS4gIFRoZSB0aGluZyBpbiBwcmludF9jb250ZXh0X3N0YWNrIGlz
IChhKQ0KPiBoYXJkLXRvLXVuZGVyc3RhbmQgYW5kIGhhcmQtdG8tbW9kaWZ5IGNyYXAgYW5kIChi
KSBpcyBjYWxsZWQgaW4gYSBsb29wDQo+IGZyb20gYW5vdGhlciBmaWxlIHVzaW5nIHRvdGFsbHkg
cmlkaWN1bG91cyBjb252ZW50aW9ucy4NCg0KSSd2ZSBzZWVuIGEgJ3N0YWNrIHVud2luZGVyJyB0
aGF0IHBhcnNlZCB0aGUgaW5zdHJ1Y3Rpb24gc3RyZWFtDQpmb3J3YXJkcyBsb29raW5nIGZvciAn
cmV0dXJuJyBpbnN0cnVjdGlvbnMuDQpJIGZpeGVkIGl0IHRvIGFkZCBhIGZldyBleHRyYSBpbnN0
cnVjdGlvbnMgbmVlZHMgdG8gc29ydCBvdXQgdGhlDQpleGl0IHBhdGggZnJvbSBoYXJkd2FyZSBp
bnRlcnJ1cHRzLg0KDQpJdCBvbmx5IGhhZCB0byB1bmRlcnN0YW5kIGluc3RydWN0aW9ucyB0aGF0
IG1vZGlmaWVkICVzcCBhbmQgJWJwDQphbmQgcmVtZW1iZXIgd2hpY2ggYnJhbmNoIGluc3RydWN0
aW9ucyBhbmQgYnJhbmNoIHRhcmdldHMgaXQgaGFkDQp1c2VkIGluIG9yZGVyIHRvIGZpbmQgdGhl
IGNvcnJlY3QgZXhpdCBwYXRoIGZyb20gYSBmdW5jdGlvbi4NCg0KV29ya2VkIHJlYXNvbmFibHkg
d2VsbCB3aXRob3V0IGFueSBkZWJ1ZyBpbmZvIG9yIGd1YXJhbnRlZWQgZnJhbWUNCnBvaW50ZXJz
Lg0KSXQgZGlkIGhhdmUgdG8gZmFsbCBiYWNrIG9uIHNjYW5uaW5nIHRoZSBzdGFjayBpZiBpdCB3
YXMgaW5zaWRlDQphbiBpbmZpbml0ZSBsb29wLiBFdmVuIG9uIHg4NiBpdCBpcyByZWFzb25hYmx5
IHBvc3NpYmxlIHRvIGNoZWNrDQpmb3IgJ2NhbGwnIGluc3RydWN0aW9ucyBpbiB0aGlzIGNhc2Uu
DQoNCglEYXZpZA0KDQo=

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [RFC PATCH v2 17/18] livepatch: change to a per-task consistency model
  2016-05-04  8:42   ` Petr Mladek
@ 2016-05-04 15:51     ` Josh Poimboeuf
  2016-05-05  9:41       ` Miroslav Benes
  2016-05-05 13:06       ` Petr Mladek
  0 siblings, 2 replies; 121+ messages in thread
From: Josh Poimboeuf @ 2016-05-04 15:51 UTC (permalink / raw)
  To: Petr Mladek
  Cc: Jessica Yu, Jiri Kosina, Miroslav Benes, Ingo Molnar,
	Peter Zijlstra, Michael Ellerman, Heiko Carstens, live-patching,
	linux-kernel, x86, linuxppc-dev, linux-s390, Vojtech Pavlik,
	Jiri Slaby, Chris J Arges, Andy Lutomirski

On Wed, May 04, 2016 at 10:42:23AM +0200, Petr Mladek wrote:
> On Thu 2016-04-28 15:44:48, Josh Poimboeuf wrote:
> > Change livepatch to use a basic per-task consistency model.  This is the
> > foundation which will eventually enable us to patch those ~10% of
> > security patches which change function or data semantics.  This is the
> > biggest remaining piece needed to make livepatch more generally useful.
> 
> > --- a/kernel/fork.c
> > +++ b/kernel/fork.c
> > @@ -76,6 +76,7 @@
> >  #include <linux/compiler.h>
> >  #include <linux/sysctl.h>
> >  #include <linux/kcov.h>
> > +#include <linux/livepatch.h>
> >  
> >  #include <asm/pgtable.h>
> >  #include <asm/pgalloc.h>
> > @@ -1586,6 +1587,8 @@ static struct task_struct *copy_process(unsigned long clone_flags,
> >  		p->parent_exec_id = current->self_exec_id;
> >  	}
> >  
> > +	klp_copy_process(p);
> 
> I am in doubts here. We copy the state from the parent here. It means
> that the new process might still need to be converted. But at the same
> point print_context_stack_reliable() returns zero without printing
> any stack trace when TIF_FORK flag is set. It means that a freshly
> forked task might get be converted immediately. I seems that boot
> operations are always done when copy_process() is called. But
> they are contradicting each other.
> 
> I guess that print_context_stack_reliable() should either return
> -EINVAL when TIF_FORK is set. Or it should try to print the
> stack of the newly forked task.
> 
> Or do I miss something, please?

Ok, I admit it's confusing.

A newly forked task doesn't *have* a stack (other than the pt_regs frame
it needs for the return to user space), which is why
print_context_stack_reliable() returns success with an empty array of
addresses.

For a little background, see the second switch_to() macro in
arch/x86/include/asm/switch_to.h.  When a newly forked task runs for the
first time, it returns from __switch_to() with no stack.  It then jumps
straight to ret_from_fork in entry_64.S, calls a few C functions, and
eventually returns to user space.  So, assuming we aren't patching entry
code or the switch_to() macro in __schedule(), it should be safe to
patch the task before it does all that.

With the current code, if an unpatched task gets forked, the child will
also be unpatched.  In theory, we could go ahead and patch the child
then.  In fact, that's what I did in v1.9.

But in v1.9 discussions it was pointed out that someday maybe the
ret_from_fork stuff will get cleaned up and instead the child stack will
be copied from the parent.  In that case the child should inherit its
parent's patched state.  So we decided to make it more future-proof by
having the child inherit the parent's patched state.

So, having said all that, I'm really not sure what the best approach is
for print_context_stack_reliable().  Right now I'm thinking I'll change
it back to return -EINVAL for a newly forked task, so it'll be more
future-proof: better to have a false positive than a false negative.
Either way it will probably need to be changed again if the
ret_from_fork code gets cleaned up.

> > +
> >  	spin_lock(&current->sighand->siglock);
> >  
> >  	/*
> 
> [...]
> 
> > diff --git a/kernel/livepatch/transition.c b/kernel/livepatch/transition.c
> > new file mode 100644
> > index 0000000..92819bb
> > --- /dev/null
> > +++ b/kernel/livepatch/transition.c
> > +/*
> > + * This function can be called in the middle of an existing transition to
> > + * reverse the direction of the target patch state.  This can be done to
> > + * effectively cancel an existing enable or disable operation if there are any
> > + * tasks which are stuck in the initial patch state.
> > + */
> > +void klp_reverse_transition(void)
> > +{
> > +	struct klp_patch *patch = klp_transition_patch;
> > +
> > +	klp_target_state = !klp_target_state;
> > +
> > +	/*
> > +	 * Ensure that if another CPU goes through the syscall barrier, sees
> > +	 * the TIF_PATCH_PENDING writes in klp_start_transition(), and calls
> > +	 * klp_patch_task(), it also sees the above write to the target state.
> > +	 * Otherwise it can put the task in the wrong universe.
> > +	 */
> > +	smp_wmb();
> > +
> > +	klp_start_transition();
> > +	klp_try_complete_transition();
> 
> It is a bit strange that we keep the work scheduled. It might be
> better to use
> 
>        mod_delayed_work(system_wq, &klp_work, 0);

True, I think that would be better.

> Which triggers more ideas from the nitpicking deparment:
> 
> I would move the work definition from core.c to transition.c because
> it is closely related to klp_try_complete_transition();

That could be good, but there's a slight problem: klp_work_fn() requires
klp_mutex, which is static to core.c.  It's kind of nice to keep the use
of the mutex in core.c only.

> When on it. I would make it more clear that the work is related
> to transition.

How would you recommend doing that?  How about:

- rename "klp_work" -> "klp_transition_work"
- rename "klp_work_fn" -> "klp_transition_work_fn" 

?

> Also I would call queue_delayed_work() directly
> instead of adding the klp_schedule_work() wrapper. The delay
> might be defined using a constant, e.g.
> 
> #define KLP_TRANSITION_DELAY round_jiffies_relative(HZ)
> 
> queue_delayed_work(system_wq, &klp_transition_work, KLP_TRANSITION_DELAY);

Sure.

> Finally, the following is always called right after
> klp_start_transition(), so I would call it from there.
> 
> 	if (!klp_try_complete_transition())
> 		klp_schedule_work();

Except for when it's called by klp_reverse_transition().  And it really
depends on whether we want to allow transition.c to use the mutex.  I
don't have a strong opinion either way, I may need to think about it
some more.

> > +
> > +	patch->enabled = !patch->enabled;
> > +}
> > +
> 
> It is really great work! I am checking this patch from left, right, top,
> and even bottom and all seems to work well together.

Great!  Thanks a lot for the thorough review!

-- 
Josh

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: barriers: was: [RFC PATCH v2 17/18] livepatch: change to a per-task consistency model
  2016-05-04 13:53     ` Peter Zijlstra
@ 2016-05-04 16:51       ` Josh Poimboeuf
  0 siblings, 0 replies; 121+ messages in thread
From: Josh Poimboeuf @ 2016-05-04 16:51 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Petr Mladek, Jessica Yu, Jiri Kosina, Miroslav Benes,
	Ingo Molnar, Michael Ellerman, Heiko Carstens, live-patching,
	linux-kernel, x86, linuxppc-dev, linux-s390, Vojtech Pavlik,
	Jiri Slaby, Chris J Arges, Andy Lutomirski

On Wed, May 04, 2016 at 03:53:29PM +0200, Peter Zijlstra wrote:
> On Wed, May 04, 2016 at 02:39:40PM +0200, Petr Mladek wrote:
> > > +	 * This barrier also ensures that if another CPU goes through the
> > > +	 * syscall barrier, sees the TIF_PATCH_PENDING writes in
> > > +	 * klp_start_transition(), and calls klp_patch_task(), it also sees the
> > > +	 * above write to the target state.  Otherwise it can put the task in
> > > +	 * the wrong universe.

(oops, missed a "universe" -> "patch state" rename)

> > > +	 */
> > 
> > By other words, it makes sure that klp_patch_task() will assign the
> > right patch_state. Where klp_patch_task() could not be called
> > before we set TIF_PATCH_PENDING in klp_start_transition().
> > 
> > > +	smp_wmb();
> > > +}
> 
> So I've not read the patch; but ending a function with an smp_wmb()
> feels wrong.
> 
> A wmb orders two stores, and I feel both stores should be well visible
> in the same function.

Yeah, I would agree with that.  And also, it's probably a red flag that
the barrier needs *three* paragraphs to describe the various cases its
needed for.

However, there are some complications:

1) The stores are in separate functions (which is a generally a good
   thing as it greatly helps the readability of the code).

2) Which stores are being ordered depends on whether the function is
   called in the enable path or the disable path.

3) Either way it actually orders *two* separate pairs of stores.

Anyway I'm thinking I should move that barrier out of
klp_init_transition() and into its callers.  The stores will still be in
separate functions but at least there will be better visibility of where
the stores are occurring, and the comments can be a little more focused.

-- 
Josh

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: barriers: was: [RFC PATCH v2 17/18] livepatch: change to a per-task consistency model
  2016-05-04 12:39   ` barriers: was: " Petr Mladek
  2016-05-04 13:53     ` Peter Zijlstra
  2016-05-04 14:12     ` Petr Mladek
@ 2016-05-04 17:02     ` Josh Poimboeuf
  2016-05-05 10:21       ` Petr Mladek
  2 siblings, 1 reply; 121+ messages in thread
From: Josh Poimboeuf @ 2016-05-04 17:02 UTC (permalink / raw)
  To: Petr Mladek
  Cc: Jessica Yu, Jiri Kosina, Miroslav Benes, Ingo Molnar,
	Peter Zijlstra, Michael Ellerman, Heiko Carstens, live-patching,
	linux-kernel, x86, linuxppc-dev, linux-s390, Vojtech Pavlik,
	Jiri Slaby, Chris J Arges, Andy Lutomirski

On Wed, May 04, 2016 at 02:39:40PM +0200, Petr Mladek wrote:
> On Thu 2016-04-28 15:44:48, Josh Poimboeuf wrote:
> > Change livepatch to use a basic per-task consistency model.  This is the
> > foundation which will eventually enable us to patch those ~10% of
> > security patches which change function or data semantics.  This is the
> > biggest remaining piece needed to make livepatch more generally useful.
> 
> I spent a lot of time with checking the memory barriers. It seems that
> they are basically correct.  Let me use my own words to show how
> I understand it. I hope that it will help others with review.

[...snip a ton of useful comments...]

Thanks, this will help a lot!  I'll try to incorporate your barrier
comments into the code.

I also agree that kpatch_patch_task() is poorly named.  I was trying to
make it clear to external callers that "hey, the task is getting patched
now!", but it's internally inconsistent with livepatch code because we
make a distinction between patching and unpatching.

Maybe I'll do:

  klp_update_task_patch_state()

-- 
Josh

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: barriers: was: [RFC PATCH v2 17/18] livepatch: change to a per-task consistency model
  2016-05-04 14:12     ` Petr Mladek
@ 2016-05-04 17:25       ` Josh Poimboeuf
  2016-05-05 11:21         ` Petr Mladek
  2016-05-09 15:42         ` Miroslav Benes
  0 siblings, 2 replies; 121+ messages in thread
From: Josh Poimboeuf @ 2016-05-04 17:25 UTC (permalink / raw)
  To: Petr Mladek
  Cc: Jessica Yu, Jiri Kosina, Miroslav Benes, Ingo Molnar,
	Peter Zijlstra, Michael Ellerman, Heiko Carstens, live-patching,
	linux-kernel, x86, linuxppc-dev, linux-s390, Vojtech Pavlik,
	Jiri Slaby, Chris J Arges, Andy Lutomirski

On Wed, May 04, 2016 at 04:12:05PM +0200, Petr Mladek wrote:
> On Wed 2016-05-04 14:39:40, Petr Mladek wrote:
> > 		 *
> > 		 * Note that the task must never be migrated to the target
> > 		 * state when being inside this ftrace handler.
> > 		 */
> > 
> > We might want to move the second paragraph on top of the function.
> > It is a basic and important fact. It actually explains why the first
> > read barrier is not needed when the patch is being disabled.
> 
> I wrote the statement partly intuitively. I think that it is really
> somehow important. And I am slightly in doubts if we are on the safe side.
> 
> First, why is it important that the task->patch_state is not switched
> when being inside the ftrace handler?
> 
> If we are inside the handler, we are kind-of inside the called
> function. And the basic idea of this consistency model is that
> we must not switch a task when it is inside a patched function.
> This is normally decided by the stack.
> 
> The handler is a bit special because it is called right before the
> function. If it was the only patched function on the stack, it would
> not matter if we choose the new or old code. Both decisions would
> be safe for the moment.
> 
> The fun starts when the function calls another patched function.
> The other patched function must be called consistently with
> the first one. If the first function was from the patch,
> the other must be from the patch as well and vice versa.
> 
> This is why we must not switch task->patch_state dangerously
> when being inside the ftrace handler.
> 
> Now I am not sure if this condition is fulfilled. The ftrace handler
> is called as the very first instruction of the function. Does not
> it break the stack validity? Could we sleep inside the ftrace
> handler? Will the patched function be detected on the stack?
> 
> Or is my brain already too far in the fantasy world?

I think this isn't a possibility.

In today's code base, this can't happen because task patch states are
only switched when sleeping or when exiting the kernel.  The ftrace
handler doesn't sleep directly.

If it were preempted, it couldn't be switched there either because we
consider preempted stacks to be unreliable.

In theory, a DWARF stack trace of a preempted task *could* be reliable.
But then the DWARF unwinder should be smart enough to see that the
original function called the ftrace handler.  Right?  So the stack would
be reliable, but then livepatch would see the original function on the
stack and wouldn't switch the task.

Does that make sense?

-- 
Josh

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: klp_task_patch: was: [RFC PATCH v2 17/18] livepatch: change to a per-task consistency model
  2016-05-04 14:48   ` klp_task_patch: " Petr Mladek
  2016-05-04 14:56     ` Jiri Kosina
@ 2016-05-04 17:57     ` Josh Poimboeuf
  2016-05-05 11:57       ` Petr Mladek
  1 sibling, 1 reply; 121+ messages in thread
From: Josh Poimboeuf @ 2016-05-04 17:57 UTC (permalink / raw)
  To: Petr Mladek
  Cc: Jessica Yu, Jiri Kosina, Miroslav Benes, Ingo Molnar,
	Peter Zijlstra, Michael Ellerman, Heiko Carstens, live-patching,
	linux-kernel, x86, linuxppc-dev, linux-s390, Vojtech Pavlik,
	Jiri Slaby, Chris J Arges, Andy Lutomirski

On Wed, May 04, 2016 at 04:48:54PM +0200, Petr Mladek wrote:
> On Thu 2016-04-28 15:44:48, Josh Poimboeuf wrote:
> > Change livepatch to use a basic per-task consistency model.  This is the
> > foundation which will eventually enable us to patch those ~10% of
> > security patches which change function or data semantics.  This is the
> > biggest remaining piece needed to make livepatch more generally useful.
> > 
> > diff --git a/kernel/livepatch/transition.c b/kernel/livepatch/transition.c
> > new file mode 100644
> > index 0000000..92819bb
> > --- /dev/null
> > +++ b/kernel/livepatch/transition.c
> > +/*
> > + * klp_patch_task() - change the patched state of a task
> > + * @task:	The task to change
> > + *
> > + * Switches the patched state of the task to the set of functions in the target
> > + * patch state.
> > + */
> > +void klp_patch_task(struct task_struct *task)
> > +{
> > +	clear_tsk_thread_flag(task, TIF_PATCH_PENDING);
> > +
> > +	/*
> > +	 * The corresponding write barriers are in klp_init_transition() and
> > +	 * klp_reverse_transition().  See the comments there for an explanation.
> > +	 */
> > +	smp_rmb();
> > +
> > +	task->patch_state = klp_target_state;
> > +}
> > +
> > diff --git a/kernel/sched/idle.c b/kernel/sched/idle.c
> > index bd12c6c..60d633f 100644
> > --- a/kernel/sched/idle.c
> > +++ b/kernel/sched/idle.c
> > @@ -9,6 +9,7 @@
> >  #include <linux/mm.h>
> >  #include <linux/stackprotector.h>
> >  #include <linux/suspend.h>
> > +#include <linux/livepatch.h>
> >  
> >  #include <asm/tlb.h>
> >  
> > @@ -266,6 +267,9 @@ static void cpu_idle_loop(void)
> >  
> >  		sched_ttwu_pending();
> >  		schedule_preempt_disabled();
> > +
> > +		if (unlikely(klp_patch_pending(current)))
> > +			klp_patch_task(current);
> >  	}
> 
> Some more ideas from the world of crazy races. I was shaking my head
> if this was safe or not.
> 
> The problem might be if the task get rescheduled between the check
> for the pending stuff or inside the klp_patch_task() function.
> This will get even more important when we use this construct
> on some more locations, e.g. in some kthreads.
> 
> If the task is sleeping on this strange locations, it might assign
> strange values on strange times.
> 
> I think that it is safe only because it is called with the
> 'current' parameter and on a safe locations. It means that
> the result is always safe and consistent. Also we could assign
> an outdated value only when sleeping between reading klp_target_state
> and storing task->patch_state. But if anyone modified
> klp_target_state at this point, he also set TIF_PENDING_PATCH,
> so the change will not get lost.
> 
> I think that we should document that klp_patch_func() must be
> called only from a safe location from within the affected task.
> 
> I even suggest to avoid misuse by removing the struct *task_struct
> parameter. It should always be called with current.

Would the race involve two tasks trying to call klp_patch_task() for the
same task at the same time?  If so I don't think that would be a problem
since they would both write the same value for task->patch_state.

(Sorry if I'm being slow, I think I've managed to reach my quota of hard
thinking for the day and I don't exactly follow what the race would be.)

-- 
Josh

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [RFC PATCH v2 17/18] livepatch: change to a per-task consistency model
  2016-05-04 15:51     ` Josh Poimboeuf
@ 2016-05-05  9:41       ` Miroslav Benes
  2016-05-05 13:06       ` Petr Mladek
  1 sibling, 0 replies; 121+ messages in thread
From: Miroslav Benes @ 2016-05-05  9:41 UTC (permalink / raw)
  To: Josh Poimboeuf
  Cc: Petr Mladek, Jessica Yu, Jiri Kosina, Ingo Molnar,
	Peter Zijlstra, Michael Ellerman, Heiko Carstens, live-patching,
	linux-kernel, x86, linuxppc-dev, linux-s390, Vojtech Pavlik,
	Jiri Slaby, Chris J Arges, Andy Lutomirski

On Wed, 4 May 2016, Josh Poimboeuf wrote:

> On Wed, May 04, 2016 at 10:42:23AM +0200, Petr Mladek wrote:
> > On Thu 2016-04-28 15:44:48, Josh Poimboeuf wrote:
> > > Change livepatch to use a basic per-task consistency model.  This is the
> > > foundation which will eventually enable us to patch those ~10% of
> > > security patches which change function or data semantics.  This is the
> > > biggest remaining piece needed to make livepatch more generally useful.
> > 
> > > --- a/kernel/fork.c
> > > +++ b/kernel/fork.c
> > > @@ -76,6 +76,7 @@
> > >  #include <linux/compiler.h>
> > >  #include <linux/sysctl.h>
> > >  #include <linux/kcov.h>
> > > +#include <linux/livepatch.h>
> > >  
> > >  #include <asm/pgtable.h>
> > >  #include <asm/pgalloc.h>
> > > @@ -1586,6 +1587,8 @@ static struct task_struct *copy_process(unsigned long clone_flags,
> > >  		p->parent_exec_id = current->self_exec_id;
> > >  	}
> > >  
> > > +	klp_copy_process(p);
> > 
> > I am in doubts here. We copy the state from the parent here. It means
> > that the new process might still need to be converted. But at the same
> > point print_context_stack_reliable() returns zero without printing
> > any stack trace when TIF_FORK flag is set. It means that a freshly
> > forked task might get be converted immediately. I seems that boot
> > operations are always done when copy_process() is called. But
> > they are contradicting each other.
> > 
> > I guess that print_context_stack_reliable() should either return
> > -EINVAL when TIF_FORK is set. Or it should try to print the
> > stack of the newly forked task.
> > 
> > Or do I miss something, please?
> 
> Ok, I admit it's confusing.
> 
> A newly forked task doesn't *have* a stack (other than the pt_regs frame
> it needs for the return to user space), which is why
> print_context_stack_reliable() returns success with an empty array of
> addresses.
> 
> For a little background, see the second switch_to() macro in
> arch/x86/include/asm/switch_to.h.  When a newly forked task runs for the
> first time, it returns from __switch_to() with no stack.  It then jumps
> straight to ret_from_fork in entry_64.S, calls a few C functions, and
> eventually returns to user space.  So, assuming we aren't patching entry
> code or the switch_to() macro in __schedule(), it should be safe to
> patch the task before it does all that.
> 
> With the current code, if an unpatched task gets forked, the child will
> also be unpatched.  In theory, we could go ahead and patch the child
> then.  In fact, that's what I did in v1.9.
> 
> But in v1.9 discussions it was pointed out that someday maybe the
> ret_from_fork stuff will get cleaned up and instead the child stack will
> be copied from the parent.  In that case the child should inherit its
> parent's patched state.  So we decided to make it more future-proof by
> having the child inherit the parent's patched state.
> 
> So, having said all that, I'm really not sure what the best approach is
> for print_context_stack_reliable().  Right now I'm thinking I'll change
> it back to return -EINVAL for a newly forked task, so it'll be more
> future-proof: better to have a false positive than a false negative.
> Either way it will probably need to be changed again if the
> ret_from_fork code gets cleaned up.

I'd be for returning -EINVAL. It is a safe play for now.

[...]
 
> > Finally, the following is always called right after
> > klp_start_transition(), so I would call it from there.
> > 
> > 	if (!klp_try_complete_transition())
> > 		klp_schedule_work();

On the other hand it is quite nice to see the sequence

init
start
try_complete

there. Just my 2 cents.

> Except for when it's called by klp_reverse_transition().  And it really
> depends on whether we want to allow transition.c to use the mutex.  I
> don't have a strong opinion either way, I may need to think about it
> some more.

Miroslav

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: barriers: was: [RFC PATCH v2 17/18] livepatch: change to a per-task consistency model
  2016-05-04 17:02     ` Josh Poimboeuf
@ 2016-05-05 10:21       ` Petr Mladek
  0 siblings, 0 replies; 121+ messages in thread
From: Petr Mladek @ 2016-05-05 10:21 UTC (permalink / raw)
  To: Josh Poimboeuf
  Cc: Jessica Yu, Jiri Kosina, Miroslav Benes, Ingo Molnar,
	Peter Zijlstra, Michael Ellerman, Heiko Carstens, live-patching,
	linux-kernel, x86, linuxppc-dev, linux-s390, Vojtech Pavlik,
	Jiri Slaby, Chris J Arges, Andy Lutomirski

On Wed 2016-05-04 12:02:36, Josh Poimboeuf wrote:
> On Wed, May 04, 2016 at 02:39:40PM +0200, Petr Mladek wrote:
> > On Thu 2016-04-28 15:44:48, Josh Poimboeuf wrote:
> > > Change livepatch to use a basic per-task consistency model.  This is the
> > > foundation which will eventually enable us to patch those ~10% of
> > > security patches which change function or data semantics.  This is the
> > > biggest remaining piece needed to make livepatch more generally useful.
> > 
> > I spent a lot of time with checking the memory barriers. It seems that
> > they are basically correct.  Let me use my own words to show how
> > I understand it. I hope that it will help others with review.
> 
> [...snip a ton of useful comments...]
> 
> Thanks, this will help a lot!  I'll try to incorporate your barrier
> comments into the code.

Thanks a lot.

> I also agree that kpatch_patch_task() is poorly named.  I was trying to
> make it clear to external callers that "hey, the task is getting patched
> now!", but it's internally inconsistent with livepatch code because we
> make a distinction between patching and unpatching.
> 
> Maybe I'll do:
> 
>   klp_update_task_patch_state()

I like it. It is long but it well describes the purpose.

Livepatch is using many state variables:

  + global:                klp_transition_patch, klp_target_state
  + per task specific:     TIF_PENDING_PATCH, patch_state
  + per each new function: transition, patched
  + per old function:      func_stack
  + per object:            patched, loaded
  + per patch:             enabled

The dependency between them and the workflow is important to
create a mental picture about the Livepatching. Good names
help with it.

Best Regards,
Petr

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: barriers: was: [RFC PATCH v2 17/18] livepatch: change to a per-task consistency model
  2016-05-04 17:25       ` Josh Poimboeuf
@ 2016-05-05 11:21         ` Petr Mladek
  2016-05-09 15:42         ` Miroslav Benes
  1 sibling, 0 replies; 121+ messages in thread
From: Petr Mladek @ 2016-05-05 11:21 UTC (permalink / raw)
  To: Josh Poimboeuf
  Cc: Jessica Yu, Jiri Kosina, Miroslav Benes, Ingo Molnar,
	Peter Zijlstra, Michael Ellerman, Heiko Carstens, live-patching,
	linux-kernel, x86, linuxppc-dev, linux-s390, Vojtech Pavlik,
	Jiri Slaby, Chris J Arges, Andy Lutomirski

On Wed 2016-05-04 12:25:17, Josh Poimboeuf wrote:
> On Wed, May 04, 2016 at 04:12:05PM +0200, Petr Mladek wrote:
> > On Wed 2016-05-04 14:39:40, Petr Mladek wrote:
> > > 		 *
> > > 		 * Note that the task must never be migrated to the target
> > > 		 * state when being inside this ftrace handler.
> > > 		 */
> > > 
> > > We might want to move the second paragraph on top of the function.
> > > It is a basic and important fact. It actually explains why the first
> > > read barrier is not needed when the patch is being disabled.
> > 
> > I wrote the statement partly intuitively. I think that it is really
> > somehow important. And I am slightly in doubts if we are on the safe side.
> > 
> > First, why is it important that the task->patch_state is not switched
> > when being inside the ftrace handler?
> > 
> > If we are inside the handler, we are kind-of inside the called
> > function. And the basic idea of this consistency model is that
> > we must not switch a task when it is inside a patched function.
> > This is normally decided by the stack.
> > 
> > The handler is a bit special because it is called right before the
> > function. If it was the only patched function on the stack, it would
> > not matter if we choose the new or old code. Both decisions would
> > be safe for the moment.
> > 
> > The fun starts when the function calls another patched function.
> > The other patched function must be called consistently with
> > the first one. If the first function was from the patch,
> > the other must be from the patch as well and vice versa.
> > 
> > This is why we must not switch task->patch_state dangerously
> > when being inside the ftrace handler.
> > 
> > Now I am not sure if this condition is fulfilled. The ftrace handler
> > is called as the very first instruction of the function. Does not
> > it break the stack validity? Could we sleep inside the ftrace
> > handler? Will the patched function be detected on the stack?
> > 
> > Or is my brain already too far in the fantasy world?
> 
> I think this isn't a possibility.
> 
> In today's code base, this can't happen because task patch states are
> only switched when sleeping or when exiting the kernel.  The ftrace
> handler doesn't sleep directly.
> 
> If it were preempted, it couldn't be switched there either because we
> consider preempted stacks to be unreliable.

This was the missing piece.

> In theory, a DWARF stack trace of a preempted task *could* be reliable.
> But then the DWARF unwinder should be smart enough to see that the
> original function called the ftrace handler.  Right?  So the stack would
> be reliable, but then livepatch would see the original function on the
> stack and wouldn't switch the task.
> 
> Does that make sense?

Yup. I think that we are on the safe side. Thanks for explanation.

Best Regards,
Petr

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: klp_task_patch: was: [RFC PATCH v2 17/18] livepatch: change to a per-task consistency model
  2016-05-04 17:57     ` Josh Poimboeuf
@ 2016-05-05 11:57       ` Petr Mladek
  2016-05-06 12:38         ` Josh Poimboeuf
  0 siblings, 1 reply; 121+ messages in thread
From: Petr Mladek @ 2016-05-05 11:57 UTC (permalink / raw)
  To: Josh Poimboeuf
  Cc: Jessica Yu, Jiri Kosina, Miroslav Benes, Ingo Molnar,
	Peter Zijlstra, Michael Ellerman, Heiko Carstens, live-patching,
	linux-kernel, x86, linuxppc-dev, linux-s390, Vojtech Pavlik,
	Jiri Slaby, Chris J Arges, Andy Lutomirski

On Wed 2016-05-04 12:57:00, Josh Poimboeuf wrote:
> On Wed, May 04, 2016 at 04:48:54PM +0200, Petr Mladek wrote:
> > On Thu 2016-04-28 15:44:48, Josh Poimboeuf wrote:
> > > Change livepatch to use a basic per-task consistency model.  This is the
> > > foundation which will eventually enable us to patch those ~10% of
> > > security patches which change function or data semantics.  This is the
> > > biggest remaining piece needed to make livepatch more generally useful.
> > > 
> > > diff --git a/kernel/livepatch/transition.c b/kernel/livepatch/transition.c
> > > new file mode 100644
> > > index 0000000..92819bb
> > > --- /dev/null
> > > +++ b/kernel/livepatch/transition.c
> > > +/*
> > > + * klp_patch_task() - change the patched state of a task
> > > + * @task:	The task to change
> > > + *
> > > + * Switches the patched state of the task to the set of functions in the target
> > > + * patch state.
> > > + */
> > > +void klp_patch_task(struct task_struct *task)
> > > +{
> > > +	clear_tsk_thread_flag(task, TIF_PATCH_PENDING);
> > > +
> > > +	/*
> > > +	 * The corresponding write barriers are in klp_init_transition() and
> > > +	 * klp_reverse_transition().  See the comments there for an explanation.
> > > +	 */
> > > +	smp_rmb();
> > > +
> > > +	task->patch_state = klp_target_state;
> > > +}
> > > +
> > > diff --git a/kernel/sched/idle.c b/kernel/sched/idle.c
> > > index bd12c6c..60d633f 100644
> > > --- a/kernel/sched/idle.c
> > > +++ b/kernel/sched/idle.c
> > > @@ -9,6 +9,7 @@
> > >  #include <linux/mm.h>
> > >  #include <linux/stackprotector.h>
> > >  #include <linux/suspend.h>
> > > +#include <linux/livepatch.h>
> > >  
> > >  #include <asm/tlb.h>
> > >  
> > > @@ -266,6 +267,9 @@ static void cpu_idle_loop(void)
> > >  
> > >  		sched_ttwu_pending();
> > >  		schedule_preempt_disabled();
> > > +
> > > +		if (unlikely(klp_patch_pending(current)))
> > > +			klp_patch_task(current);
> > >  	}
> > 
> > Some more ideas from the world of crazy races. I was shaking my head
> > if this was safe or not.
> > 
> > The problem might be if the task get rescheduled between the check
> > for the pending stuff or inside the klp_patch_task() function.
> > This will get even more important when we use this construct
> > on some more locations, e.g. in some kthreads.
> > 
> > If the task is sleeping on this strange locations, it might assign
> > strange values on strange times.
> > 
> > I think that it is safe only because it is called with the
> > 'current' parameter and on a safe locations. It means that
> > the result is always safe and consistent. Also we could assign
> > an outdated value only when sleeping between reading klp_target_state
> > and storing task->patch_state. But if anyone modified
> > klp_target_state at this point, he also set TIF_PENDING_PATCH,
> > so the change will not get lost.
> > 
> > I think that we should document that klp_patch_func() must be
> > called only from a safe location from within the affected task.
> > 
> > I even suggest to avoid misuse by removing the struct *task_struct
> > parameter. It should always be called with current.
> 
> Would the race involve two tasks trying to call klp_patch_task() for the
> same task at the same time?  If so I don't think that would be a problem
> since they would both write the same value for task->patch_state.

I have missed that the two commands are called with preemption
disabled. So, I had the following crazy scenario in mind:


CPU0				CPU1

klp_enable_patch()

  klp_target_state = KLP_PATCHED;

  for_each_task()
     set TIF_PENDING_PATCH

				# task 123

				if (klp_patch_pending(current)
				  klp_patch_task(current)

                                    clear TIF_PENDING_PATCH

				    smp_rmb();

				    # switch to assembly of
				    # klp_patch_task()

				    mov klp_target_state, %r12

				    # interrupt and schedule
				    # another task


  klp_reverse_transition();

    klp_target_state = KLP_UNPATCHED;

    klt_try_to_complete_transition()

      task = 123;
      if (task->patch_state == klp_target_state;
         return 0;

    => task 123 is in target state and does
    not block conversion

  klp_complete_transition()


  # disable previous patch on the stack
  klp_disable_patch();

    klp_target_state = KLP_UNPATCHED;
  
  
				    # task 123 gets scheduled again
				    lea %r12, task->patch_state

				    => it happily stores an outdated
				    state


This is why the two functions should get called with preemption
disabled. We should document it at least. I imagine that we will
use them later also in another context and nobody will remember
this crazy scenario.

Well, even disabled preemption does not help. The process on
CPU1 might be also interrupted by an NMI and do some long
printk in it.

IMHO, the only safe approach is to call klp_patch_task()
only for "current" on a safe place. Then this race is harmless.
The switch happen on a safe place, so that it does not matter
into which state the process is switched.

By other words, the task state might be updated only

   + by the task itself on a safe place
   + by other task when the updated on is sleeping on a safe place

This should be well documented and the API should help to avoid
a misuse.

Best Regards,
Petr

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [RFC PATCH v2 17/18] livepatch: change to a per-task consistency model
  2016-05-04 15:51     ` Josh Poimboeuf
  2016-05-05  9:41       ` Miroslav Benes
@ 2016-05-05 13:06       ` Petr Mladek
  1 sibling, 0 replies; 121+ messages in thread
From: Petr Mladek @ 2016-05-05 13:06 UTC (permalink / raw)
  To: Josh Poimboeuf
  Cc: Jessica Yu, Jiri Kosina, Miroslav Benes, Ingo Molnar,
	Peter Zijlstra, Michael Ellerman, Heiko Carstens, live-patching,
	linux-kernel, x86, linuxppc-dev, linux-s390, Vojtech Pavlik,
	Jiri Slaby, Chris J Arges, Andy Lutomirski

On Wed 2016-05-04 10:51:21, Josh Poimboeuf wrote:
> On Wed, May 04, 2016 at 10:42:23AM +0200, Petr Mladek wrote:
> > On Thu 2016-04-28 15:44:48, Josh Poimboeuf wrote:
> > > Change livepatch to use a basic per-task consistency model.  This is the
> > > foundation which will eventually enable us to patch those ~10% of
> > > security patches which change function or data semantics.  This is the
> > > biggest remaining piece needed to make livepatch more generally useful.
> > 
> > > --- a/kernel/fork.c
> > > +++ b/kernel/fork.c
> > > @@ -76,6 +76,7 @@
> > >  #include <linux/compiler.h>
> > >  #include <linux/sysctl.h>
> > >  #include <linux/kcov.h>
> > > +#include <linux/livepatch.h>
> > >  
> > >  #include <asm/pgtable.h>
> > >  #include <asm/pgalloc.h>
> > > @@ -1586,6 +1587,8 @@ static struct task_struct *copy_process(unsigned long clone_flags,
> > >  		p->parent_exec_id = current->self_exec_id;
> > >  	}
> > >  
> > > +	klp_copy_process(p);
> > 
> > I am in doubts here. We copy the state from the parent here. It means
> > that the new process might still need to be converted. But at the same
> > point print_context_stack_reliable() returns zero without printing
> > any stack trace when TIF_FORK flag is set. It means that a freshly
> > forked task might get be converted immediately. I seems that boot
> > operations are always done when copy_process() is called. But
> > they are contradicting each other.
> > 
> > I guess that print_context_stack_reliable() should either return
> > -EINVAL when TIF_FORK is set. Or it should try to print the
> > stack of the newly forked task.
> > 
> > Or do I miss something, please?
> 
> Ok, I admit it's confusing.
> 
> A newly forked task doesn't *have* a stack (other than the pt_regs frame
> it needs for the return to user space), which is why
> print_context_stack_reliable() returns success with an empty array of
> addresses.
> 
> For a little background, see the second switch_to() macro in
> arch/x86/include/asm/switch_to.h.  When a newly forked task runs for the
> first time, it returns from __switch_to() with no stack.  It then jumps
> straight to ret_from_fork in entry_64.S, calls a few C functions, and
> eventually returns to user space.  So, assuming we aren't patching entry
> code or the switch_to() macro in __schedule(), it should be safe to
> patch the task before it does all that.

This is great explanation. Thanks for it.

> So, having said all that, I'm really not sure what the best approach is
> for print_context_stack_reliable().  Right now I'm thinking I'll change
> it back to return -EINVAL for a newly forked task, so it'll be more
> future-proof: better to have a false positive than a false negative.
> Either way it will probably need to be changed again if the
> ret_from_fork code gets cleaned up.

I would prefer the -EINVAL. It might safe some hairs when anyone
is working on patching the switch_to stuff. Also it is not that
big loss beacuse most tasks will get migrated on the return to
userspace.

It might help a bit with the newly forked kthreads. But there should
be more safe location where the new kthreads might get migrated,
e.g. right before the main function gets called.


> > > diff --git a/kernel/livepatch/transition.c b/kernel/livepatch/transition.c
> > > new file mode 100644
> > > index 0000000..92819bb
> > > --- /dev/null
> > > +++ b/kernel/livepatch/transition.c
> > > +/*
> > > + * This function can be called in the middle of an existing transition to
> > > + * reverse the direction of the target patch state.  This can be done to
> > > + * effectively cancel an existing enable or disable operation if there are any
> > > + * tasks which are stuck in the initial patch state.
> > > + */
> > > +void klp_reverse_transition(void)
> > > +{
> > > +	struct klp_patch *patch = klp_transition_patch;
> > > +
> > > +	klp_target_state = !klp_target_state;
> > > +
> > > +	/*
> > > +	 * Ensure that if another CPU goes through the syscall barrier, sees
> > > +	 * the TIF_PATCH_PENDING writes in klp_start_transition(), and calls
> > > +	 * klp_patch_task(), it also sees the above write to the target state.
> > > +	 * Otherwise it can put the task in the wrong universe.
> > > +	 */
> > > +	smp_wmb();
> > > +
> > > +	klp_start_transition();
> > > +	klp_try_complete_transition();
> > 
> > It is a bit strange that we keep the work scheduled. It might be
> > better to use
> > 
> >        mod_delayed_work(system_wq, &klp_work, 0);
> 
> True, I think that would be better.
> 
> > Which triggers more ideas from the nitpicking deparment:
> > 
> > I would move the work definition from core.c to transition.c because
> > it is closely related to klp_try_complete_transition();
> 
> That could be good, but there's a slight problem: klp_work_fn() requires
> klp_mutex, which is static to core.c.  It's kind of nice to keep the use
> of the mutex in core.c only.

I see and am surprised that we take the lock only in core.c ;-)

I do not have a strong opinion then. Just a small one. The lock guards
also operations from the other .c files. I think that it is only a matter
of time when we will need to access it there. But the work is clearly
transition-related. But it is a real nitpicking. I am sorry for it.

> > When on it. I would make it more clear that the work is related
> > to transition.
> 
> How would you recommend doing that?  How about:
> 
> - rename "klp_work" -> "klp_transition_work"
> - rename "klp_work_fn" -> "klp_transition_work_fn" 

Yup, sounds better.

> > Also I would call queue_delayed_work() directly
> > instead of adding the klp_schedule_work() wrapper. The delay
> > might be defined using a constant, e.g.
> > 
> > #define KLP_TRANSITION_DELAY round_jiffies_relative(HZ)
> > 
> > queue_delayed_work(system_wq, &klp_transition_work, KLP_TRANSITION_DELAY);
> 
> Sure.
> 
> > Finally, the following is always called right after
> > klp_start_transition(), so I would call it from there.
> > 
> > 	if (!klp_try_complete_transition())
> > 		klp_schedule_work();
> 
> Except for when it's called by klp_reverse_transition().  And it really
> depends on whether we want to allow transition.c to use the mutex.  I
> don't have a strong opinion either way, I may need to think about it
> some more.

Ah, I had in mind that it could be replaced by that

	mod_delayed_work(system_wq, &klp_transition_work, 0);

So, that we would never call klp_try_complete_transition()
directly. Then it could be the same in all situations. But
it might look strange and be ineffective when really starting
the transition. So, maybe forget about it.

Best Regards,
Petr

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [RFC PATCH v2 17/18] livepatch: change to a per-task consistency model
  2016-04-28 20:44 ` [RFC PATCH v2 17/18] livepatch: change to a per-task consistency model Josh Poimboeuf
                     ` (2 preceding siblings ...)
  2016-05-04 14:48   ` klp_task_patch: " Petr Mladek
@ 2016-05-06 11:33   ` Petr Mladek
  2016-05-06 12:44     ` Josh Poimboeuf
  2016-05-09  9:41   ` Miroslav Benes
                     ` (3 subsequent siblings)
  7 siblings, 1 reply; 121+ messages in thread
From: Petr Mladek @ 2016-05-06 11:33 UTC (permalink / raw)
  To: Josh Poimboeuf
  Cc: Jessica Yu, Jiri Kosina, Miroslav Benes, Ingo Molnar,
	Peter Zijlstra, Michael Ellerman, Heiko Carstens, live-patching,
	linux-kernel, x86, linuxppc-dev, linux-s390, Vojtech Pavlik,
	Jiri Slaby, Chris J Arges, Andy Lutomirski

On Thu 2016-04-28 15:44:48, Josh Poimboeuf wrote:
> diff --git a/kernel/livepatch/patch.c b/kernel/livepatch/patch.c
> index 782fbb5..b3b8639 100644
> --- a/kernel/livepatch/patch.c
> +++ b/kernel/livepatch/patch.c
> @@ -29,6 +29,7 @@
>  #include <linux/bug.h>
>  #include <linux/printk.h>
>  #include "patch.h"
> +#include "transition.h"
>  
>  static LIST_HEAD(klp_ops);
>  
> @@ -58,11 +59,42 @@ static void notrace klp_ftrace_handler(unsigned long ip,
>  	ops = container_of(fops, struct klp_ops, fops);
>  
>  	rcu_read_lock();
> +
>  	func = list_first_or_null_rcu(&ops->func_stack, struct klp_func,
>  				      stack_node);
> -	if (WARN_ON_ONCE(!func))
> +
> +	if (!func)
>  		goto unlock;
>  
> +	/*
> +	 * See the comment for the 2nd smp_wmb() in klp_init_transition() for
> +	 * an explanation of why this read barrier is needed.
> +	 */
> +	smp_rmb();
> +
> +	if (unlikely(func->transition)) {
> +
> +		/*
> +		 * See the comment for the 1st smp_wmb() in
> +		 * klp_init_transition() for an explanation of why this read
> +		 * barrier is needed.
> +		 */
> +		smp_rmb();

I would add here:

		WARN_ON_ONCE(current->patch_state == KLP_UNDEFINED);

We do not know in which context this is called, so the printk's are
not ideal. But it will get triggered only if there is a bug in
the livepatch implementation. It should happen on random locations
and rather early when a bug is introduced.

Anyway, better to die and catch the bug that let the system running
in an undefined state and produce cryptic errors later on.


> +		if (current->patch_state == KLP_UNPATCHED) {
> +			/*
> +			 * Use the previously patched version of the function.
> +			 * If no previous patches exist, use the original
> +			 * function.
> +			 */
> +			func = list_entry_rcu(func->stack_node.next,
> +					      struct klp_func, stack_node);
> +
> +			if (&func->stack_node == &ops->func_stack)
> +				goto unlock;
> +		}
> +	}

I am staring into the code for too long now. I need to step back for a
while. I'll do another look when you send the next version. Anyway,
you did a great work. I speak mainly for the livepatch part and
I like it.

Best Regards,
Petr

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: klp_task_patch: was: [RFC PATCH v2 17/18] livepatch: change to a per-task consistency model
  2016-05-05 11:57       ` Petr Mladek
@ 2016-05-06 12:38         ` Josh Poimboeuf
  2016-05-09 12:23           ` Petr Mladek
  0 siblings, 1 reply; 121+ messages in thread
From: Josh Poimboeuf @ 2016-05-06 12:38 UTC (permalink / raw)
  To: Petr Mladek
  Cc: Jessica Yu, Jiri Kosina, Miroslav Benes, Ingo Molnar,
	Peter Zijlstra, Michael Ellerman, Heiko Carstens, live-patching,
	linux-kernel, x86, linuxppc-dev, linux-s390, Vojtech Pavlik,
	Jiri Slaby, Chris J Arges, Andy Lutomirski

On Thu, May 05, 2016 at 01:57:01PM +0200, Petr Mladek wrote:
> I have missed that the two commands are called with preemption
> disabled. So, I had the following crazy scenario in mind:
> 
> 
> CPU0				CPU1
> 
> klp_enable_patch()
> 
>   klp_target_state = KLP_PATCHED;
> 
>   for_each_task()
>      set TIF_PENDING_PATCH
> 
> 				# task 123
> 
> 				if (klp_patch_pending(current)
> 				  klp_patch_task(current)
> 
>                                     clear TIF_PENDING_PATCH
> 
> 				    smp_rmb();
> 
> 				    # switch to assembly of
> 				    # klp_patch_task()
> 
> 				    mov klp_target_state, %r12
> 
> 				    # interrupt and schedule
> 				    # another task
> 
> 
>   klp_reverse_transition();
> 
>     klp_target_state = KLP_UNPATCHED;
> 
>     klt_try_to_complete_transition()
> 
>       task = 123;
>       if (task->patch_state == klp_target_state;
>          return 0;
> 
>     => task 123 is in target state and does
>     not block conversion
> 
>   klp_complete_transition()
> 
> 
>   # disable previous patch on the stack
>   klp_disable_patch();
> 
>     klp_target_state = KLP_UNPATCHED;
>   
>   
> 				    # task 123 gets scheduled again
> 				    lea %r12, task->patch_state
> 
> 				    => it happily stores an outdated
> 				    state
> 

Thanks for the clear explanation, this helps a lot.

> This is why the two functions should get called with preemption
> disabled. We should document it at least. I imagine that we will
> use them later also in another context and nobody will remember
> this crazy scenario.
> 
> Well, even disabled preemption does not help. The process on
> CPU1 might be also interrupted by an NMI and do some long
> printk in it.
> 
> IMHO, the only safe approach is to call klp_patch_task()
> only for "current" on a safe place. Then this race is harmless.
> The switch happen on a safe place, so that it does not matter
> into which state the process is switched.

I'm not sure about this solution.  When klp_complete_transition() is
called, we need all tasks to be patched, for good.  We don't want any of
them to randomly switch to the wrong state at some later time in the
middle of a future patch operation.  How would changing klp_patch_task()
to only use "current" prevent that?

> By other words, the task state might be updated only
> 
>    + by the task itself on a safe place
>    + by other task when the updated on is sleeping on a safe place
> 
> This should be well documented and the API should help to avoid
> a misuse.

I think we could fix it to be safe for future callers who might not have
preemption disabled with a couple of changes to klp_patch_task():
disabling preemption and testing/clearing the TIF_PATCH_PENDING flag
before changing the patch state:

  void klp_patch_task(struct task_struct *task)
  {
  	preempt_disable();
  
  	if (test_and_clear_tsk_thread_flag(task, TIF_PATCH_PENDING))
  		task->patch_state = READ_ONCE(klp_target_state);
  
  	preempt_enable();
  }

We would also need a synchronize_sched() after the patching is complete,
either at the end of klp_try_complete_transition() or in
klp_complete_transition().  That would make sure that all existing calls
to klp_patch_task() are done.

-- 
Josh

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [RFC PATCH v2 17/18] livepatch: change to a per-task consistency model
  2016-05-06 11:33   ` Petr Mladek
@ 2016-05-06 12:44     ` Josh Poimboeuf
  0 siblings, 0 replies; 121+ messages in thread
From: Josh Poimboeuf @ 2016-05-06 12:44 UTC (permalink / raw)
  To: Petr Mladek
  Cc: Jessica Yu, Jiri Kosina, Miroslav Benes, Ingo Molnar,
	Peter Zijlstra, Michael Ellerman, Heiko Carstens, live-patching,
	linux-kernel, x86, linuxppc-dev, linux-s390, Vojtech Pavlik,
	Jiri Slaby, Chris J Arges, Andy Lutomirski

On Fri, May 06, 2016 at 01:33:01PM +0200, Petr Mladek wrote:
> On Thu 2016-04-28 15:44:48, Josh Poimboeuf wrote:
> > diff --git a/kernel/livepatch/patch.c b/kernel/livepatch/patch.c
> > index 782fbb5..b3b8639 100644
> > --- a/kernel/livepatch/patch.c
> > +++ b/kernel/livepatch/patch.c
> > @@ -29,6 +29,7 @@
> >  #include <linux/bug.h>
> >  #include <linux/printk.h>
> >  #include "patch.h"
> > +#include "transition.h"
> >  
> >  static LIST_HEAD(klp_ops);
> >  
> > @@ -58,11 +59,42 @@ static void notrace klp_ftrace_handler(unsigned long ip,
> >  	ops = container_of(fops, struct klp_ops, fops);
> >  
> >  	rcu_read_lock();
> > +
> >  	func = list_first_or_null_rcu(&ops->func_stack, struct klp_func,
> >  				      stack_node);
> > -	if (WARN_ON_ONCE(!func))
> > +
> > +	if (!func)
> >  		goto unlock;
> >  
> > +	/*
> > +	 * See the comment for the 2nd smp_wmb() in klp_init_transition() for
> > +	 * an explanation of why this read barrier is needed.
> > +	 */
> > +	smp_rmb();
> > +
> > +	if (unlikely(func->transition)) {
> > +
> > +		/*
> > +		 * See the comment for the 1st smp_wmb() in
> > +		 * klp_init_transition() for an explanation of why this read
> > +		 * barrier is needed.
> > +		 */
> > +		smp_rmb();
> 
> I would add here:
> 
> 		WARN_ON_ONCE(current->patch_state == KLP_UNDEFINED);
> 
> We do not know in which context this is called, so the printk's are
> not ideal. But it will get triggered only if there is a bug in
> the livepatch implementation. It should happen on random locations
> and rather early when a bug is introduced.
> 
> Anyway, better to die and catch the bug that let the system running
> in an undefined state and produce cryptic errors later on.

Ok, makes sense.

> > +		if (current->patch_state == KLP_UNPATCHED) {
> > +			/*
> > +			 * Use the previously patched version of the function.
> > +			 * If no previous patches exist, use the original
> > +			 * function.
> > +			 */
> > +			func = list_entry_rcu(func->stack_node.next,
> > +					      struct klp_func, stack_node);
> > +
> > +			if (&func->stack_node == &ops->func_stack)
> > +				goto unlock;
> > +		}
> > +	}
> 
> I am staring into the code for too long now. I need to step back for a
> while. I'll do another look when you send the next version. Anyway,
> you did a great work. I speak mainly for the livepatch part and
> I like it.

Thanks for the helpful reviews!  I'll be on vacation again next week so
I get a break too :-)

-- 
Josh

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [RFC PATCH v2 17/18] livepatch: change to a per-task consistency model
  2016-04-28 20:44 ` [RFC PATCH v2 17/18] livepatch: change to a per-task consistency model Josh Poimboeuf
                     ` (3 preceding siblings ...)
  2016-05-06 11:33   ` Petr Mladek
@ 2016-05-09  9:41   ` Miroslav Benes
  2016-05-16 17:27     ` Josh Poimboeuf
  2016-05-10 11:39   ` Miroslav Benes
                     ` (2 subsequent siblings)
  7 siblings, 1 reply; 121+ messages in thread
From: Miroslav Benes @ 2016-05-09  9:41 UTC (permalink / raw)
  To: Josh Poimboeuf
  Cc: Jessica Yu, Jiri Kosina, Ingo Molnar, Peter Zijlstra,
	Michael Ellerman, Heiko Carstens, live-patching, linux-kernel,
	x86, linuxppc-dev, linux-s390, Vojtech Pavlik, Jiri Slaby,
	Petr Mladek, Chris J Arges, Andy Lutomirski


[...]

> +static int klp_target_state;

[...]

> +void klp_init_transition(struct klp_patch *patch, int state)
> +{
> +	struct task_struct *g, *task;
> +	unsigned int cpu;
> +	struct klp_object *obj;
> +	struct klp_func *func;
> +	int initial_state = !state;
> +
> +	klp_transition_patch = patch;
> +
> +	/*
> +	 * If the patch can be applied or reverted immediately, skip the
> +	 * per-task transitions.
> +	 */
> +	if (patch->immediate)
> +		return;
> +
> +	/*
> +	 * Initialize all tasks to the initial patch state to prepare them for
> +	 * switching to the target state.
> +	 */
> +	read_lock(&tasklist_lock);
> +	for_each_process_thread(g, task)
> +		task->patch_state = initial_state;
> +	read_unlock(&tasklist_lock);
> +
> +	/*
> +	 * Ditto for the idle "swapper" tasks.
> +	 */
> +	get_online_cpus();
> +	for_each_online_cpu(cpu)
> +		idle_task(cpu)->patch_state = initial_state;
> +	put_online_cpus();
> +
> +	/*
> +	 * Ensure klp_ftrace_handler() sees the task->patch_state updates
> +	 * before the func->transition updates.  Otherwise it could read an
> +	 * out-of-date task state and pick the wrong function.
> +	 */
> +	smp_wmb();
> +
> +	/*
> +	 * Set the func transition states so klp_ftrace_handler() will know to
> +	 * switch to the transition logic.
> +	 *
> +	 * When patching, the funcs aren't yet in the func_stack and will be
> +	 * made visible to the ftrace handler shortly by the calls to
> +	 * klp_patch_object().
> +	 *
> +	 * When unpatching, the funcs are already in the func_stack and so are
> +	 * already visible to the ftrace handler.
> +	 */
> +	klp_for_each_object(patch, obj)
> +		klp_for_each_func(obj, func)
> +			func->transition = true;
> +
> +	/*
> +	 * Set the global target patch state which tasks will switch to.  This
> +	 * has no effect until the TIF_PATCH_PENDING flags get set later.
> +	 */
> +	klp_target_state = state;

I am afraid there is a problem for (patch->immediate == true) patches. 
klp_target_state is not set for those and the comment is not entirely 
true, because klp_target_state has an effect in several places.

[...]

> +void klp_start_transition(void)
> +{
> +	struct task_struct *g, *task;
> +	unsigned int cpu;
> +
> +	pr_notice("'%s': %s...\n", klp_transition_patch->mod->name,
> +		  klp_target_state == KLP_PATCHED ? "patching" : "unpatching");

Here...

> +
> +	/*
> +	 * If the patch can be applied or reverted immediately, skip the
> +	 * per-task transitions.
> +	 */
> +	if (klp_transition_patch->immediate)
> +		return;
> +

[...]

> +bool klp_try_complete_transition(void)
> +{
> +	unsigned int cpu;
> +	struct task_struct *g, *task;
> +	bool complete = true;
> +
> +	/*
> +	 * If the patch can be applied or reverted immediately, skip the
> +	 * per-task transitions.
> +	 */
> +	if (klp_transition_patch->immediate)
> +		goto success;
> +
> +	/*
> +	 * Try to switch the tasks to the target patch state by walking their
> +	 * stacks and looking for any to-be-patched or to-be-unpatched
> +	 * functions.  If such functions are found on a stack, or if the stack
> +	 * is deemed unreliable, the task can't be switched yet.
> +	 *
> +	 * Usually this will transition most (or all) of the tasks on a system
> +	 * unless the patch includes changes to a very common function.
> +	 */
> +	read_lock(&tasklist_lock);
> +	for_each_process_thread(g, task)
> +		if (!klp_try_switch_task(task))
> +			complete = false;
> +	read_unlock(&tasklist_lock);
> +
> +	/*
> +	 * Ditto for the idle "swapper" tasks.
> +	 */
> +	get_online_cpus();
> +	for_each_online_cpu(cpu)
> +		if (!klp_try_switch_task(idle_task(cpu)))
> +			complete = false;
> +	put_online_cpus();
> +
> +	/*
> +	 * Some tasks weren't able to be switched over.  Try again later and/or
> +	 * wait for other methods like syscall barrier switching.
> +	 */
> +	if (!complete)
> +		return false;
> +
> +success:
> +	/*
> +	 * When unpatching, all tasks have transitioned to KLP_UNPATCHED so we
> +	 * can now remove the new functions from the func_stack.
> +	 */
> +	if (klp_target_state == KLP_UNPATCHED) {

Here (this is the most important one I think).

> +		klp_unpatch_objects(klp_transition_patch);
> +
> +		/*
> +		 * Don't allow any existing instances of ftrace handlers to
> +		 * access any obsolete funcs before we reset the func
> +		 * transition states to false.  Otherwise the handler may see
> +		 * the deleted "new" func, see that it's not in transition, and
> +		 * wrongly pick the new version of the function.
> +		 */
> +		synchronize_rcu();
> +	}
> +
> +	pr_notice("'%s': %s complete\n", klp_transition_patch->mod->name,
> +		  klp_target_state == KLP_PATCHED ? "patching" : "unpatching");

Here

> +
> +	/* we're done, now cleanup the data structures */
> +	klp_complete_transition();
> +
> +	return true;
> +}
> +
> +/*
> + * This function can be called in the middle of an existing transition to
> + * reverse the direction of the target patch state.  This can be done to
> + * effectively cancel an existing enable or disable operation if there are any
> + * tasks which are stuck in the initial patch state.
> + */
> +void klp_reverse_transition(void)
> +{
> +	struct klp_patch *patch = klp_transition_patch;
> +
> +	klp_target_state = !klp_target_state;

And probably here.

All other references look safe.

I guess we need to set klp_target_state even for immediate patches. Should 
we also initialize it with KLP_UNDEFINED and set it to KLP_UNDEFINED in 
klp_complete_transition()?

Miroslav

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: klp_task_patch: was: [RFC PATCH v2 17/18] livepatch: change to a per-task consistency model
  2016-05-06 12:38         ` Josh Poimboeuf
@ 2016-05-09 12:23           ` Petr Mladek
  2016-05-16 18:12             ` Josh Poimboeuf
  0 siblings, 1 reply; 121+ messages in thread
From: Petr Mladek @ 2016-05-09 12:23 UTC (permalink / raw)
  To: Josh Poimboeuf
  Cc: Jessica Yu, Jiri Kosina, Miroslav Benes, Ingo Molnar,
	Peter Zijlstra, Michael Ellerman, Heiko Carstens, live-patching,
	linux-kernel, x86, linuxppc-dev, linux-s390, Vojtech Pavlik,
	Jiri Slaby, Chris J Arges, Andy Lutomirski

On Fri 2016-05-06 07:38:55, Josh Poimboeuf wrote:
> On Thu, May 05, 2016 at 01:57:01PM +0200, Petr Mladek wrote:
> > I have missed that the two commands are called with preemption
> > disabled. So, I had the following crazy scenario in mind:
> > 
> > 
> > CPU0				CPU1
> > 
> > klp_enable_patch()
> > 
> >   klp_target_state = KLP_PATCHED;
> > 
> >   for_each_task()
> >      set TIF_PENDING_PATCH
> > 
> > 				# task 123
> > 
> > 				if (klp_patch_pending(current)
> > 				  klp_patch_task(current)
> > 
> >                                     clear TIF_PENDING_PATCH
> > 
> > 				    smp_rmb();
> > 
> > 				    # switch to assembly of
> > 				    # klp_patch_task()
> > 
> > 				    mov klp_target_state, %r12
> > 
> > 				    # interrupt and schedule
> > 				    # another task
> > 
> > 
> >   klp_reverse_transition();
> > 
> >     klp_target_state = KLP_UNPATCHED;
> > 
> >     klt_try_to_complete_transition()
> > 
> >       task = 123;
> >       if (task->patch_state == klp_target_state;
> >          return 0;
> > 
> >     => task 123 is in target state and does
> >     not block conversion
> > 
> >   klp_complete_transition()
> > 
> > 
> >   # disable previous patch on the stack
> >   klp_disable_patch();
> > 
> >     klp_target_state = KLP_UNPATCHED;
> >   
> >   
> > 				    # task 123 gets scheduled again
> > 				    lea %r12, task->patch_state
> > 
> > 				    => it happily stores an outdated
> > 				    state
> > 
> 
> Thanks for the clear explanation, this helps a lot.
> 
> > This is why the two functions should get called with preemption
> > disabled. We should document it at least. I imagine that we will
> > use them later also in another context and nobody will remember
> > this crazy scenario.
> > 
> > Well, even disabled preemption does not help. The process on
> > CPU1 might be also interrupted by an NMI and do some long
> > printk in it.
> > 
> > IMHO, the only safe approach is to call klp_patch_task()
> > only for "current" on a safe place. Then this race is harmless.
> > The switch happen on a safe place, so that it does not matter
> > into which state the process is switched.
> 
> I'm not sure about this solution.  When klp_complete_transition() is
> called, we need all tasks to be patched, for good.  We don't want any of
> them to randomly switch to the wrong state at some later time in the
> middle of a future patch operation.  How would changing klp_patch_task()
> to only use "current" prevent that?

You are right that it is pity but it really should be safe because
it is not entirely random.

If the race happens and assign an outdated value, there are two
situations:

1. It is assigned when there is not transition in the progress.
   Then it is OK because it will be ignored by the ftrace handler.
   The right state will be set before the next transition starts.

2. It is assigned when some other transition is in progress.
   Then it is OK as long as the function is called from "current".
   The "wrong" state will be used consistently. It will switch
   to the right state on another safe state.


> > By other words, the task state might be updated only
> > 
> >    + by the task itself on a safe place
> >    + by other task when the updated on is sleeping on a safe place
> > 
> > This should be well documented and the API should help to avoid
> > a misuse.
> 
> I think we could fix it to be safe for future callers who might not have
> preemption disabled with a couple of changes to klp_patch_task():
> disabling preemption and testing/clearing the TIF_PATCH_PENDING flag
> before changing the patch state:
> 
>   void klp_patch_task(struct task_struct *task)
>   {
>   	preempt_disable();
>   
>   	if (test_and_clear_tsk_thread_flag(task, TIF_PATCH_PENDING))
>   		task->patch_state = READ_ONCE(klp_target_state);
>   
>   	preempt_enable();
>   }

It reduces the race window a bit but it is still there. For example,
NMI still might add a huge delay between reading klp_target_state
and assigning task->patch state.

What about the following?

/*
 * This function might assign an outdated value if the transaction
`* is reverted and finalized in parallel. But it is safe. If the value
 * is assigned outside of a transaction, it is ignored and the next
 * transaction will set the right one. Or if it gets assigned
 * inside another transaction, it will repeat the cycle and
 * set the right state.
 */
void klp_update_current_patch_state()
{
	while (test_and_clear_tsk_thread_flag(current, TIF_PATCH_PENDING))
		current->patch_state = READ_ONCE(klp_target_state);
}

Note that the disabled preemption helped only partially,
so I think that it was not really needed.

Hmm, it means that the task->patch_state  might be either
KLP_PATCHED or KLP_UNPATCHED outside a transition. I wonder
if the tristate really brings some advantages.


Alternatively, we might synchronize the operation with klp_mutex.
The function is called in a slow path and in a safe context.
Well, it might cause contention on the lock when many CPUs are
trying to update their tasks.

Best Regards,
Petr

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: barriers: was: [RFC PATCH v2 17/18] livepatch: change to a per-task consistency model
  2016-05-04 17:25       ` Josh Poimboeuf
  2016-05-05 11:21         ` Petr Mladek
@ 2016-05-09 15:42         ` Miroslav Benes
  1 sibling, 0 replies; 121+ messages in thread
From: Miroslav Benes @ 2016-05-09 15:42 UTC (permalink / raw)
  To: Josh Poimboeuf
  Cc: Petr Mladek, Jessica Yu, Jiri Kosina, Ingo Molnar,
	Peter Zijlstra, Michael Ellerman, Heiko Carstens, live-patching,
	linux-kernel, x86, linuxppc-dev, linux-s390, Vojtech Pavlik,
	Jiri Slaby, Chris J Arges, Andy Lutomirski

On Wed, 4 May 2016, Josh Poimboeuf wrote:

> On Wed, May 04, 2016 at 04:12:05PM +0200, Petr Mladek wrote:
> > On Wed 2016-05-04 14:39:40, Petr Mladek wrote:
> > > 		 *
> > > 		 * Note that the task must never be migrated to the target
> > > 		 * state when being inside this ftrace handler.
> > > 		 */
> > > 
> > > We might want to move the second paragraph on top of the function.
> > > It is a basic and important fact. It actually explains why the first
> > > read barrier is not needed when the patch is being disabled.
> > 
> > I wrote the statement partly intuitively. I think that it is really
> > somehow important. And I am slightly in doubts if we are on the safe side.
> > 
> > First, why is it important that the task->patch_state is not switched
> > when being inside the ftrace handler?
> > 
> > If we are inside the handler, we are kind-of inside the called
> > function. And the basic idea of this consistency model is that
> > we must not switch a task when it is inside a patched function.
> > This is normally decided by the stack.
> > 
> > The handler is a bit special because it is called right before the
> > function. If it was the only patched function on the stack, it would
> > not matter if we choose the new or old code. Both decisions would
> > be safe for the moment.
> > 
> > The fun starts when the function calls another patched function.
> > The other patched function must be called consistently with
> > the first one. If the first function was from the patch,
> > the other must be from the patch as well and vice versa.
> > 
> > This is why we must not switch task->patch_state dangerously
> > when being inside the ftrace handler.
> > 
> > Now I am not sure if this condition is fulfilled. The ftrace handler
> > is called as the very first instruction of the function. Does not
> > it break the stack validity? Could we sleep inside the ftrace
> > handler? Will the patched function be detected on the stack?
> > 
> > Or is my brain already too far in the fantasy world?
> 
> I think this isn't a possibility.
> 
> In today's code base, this can't happen because task patch states are
> only switched when sleeping or when exiting the kernel.  The ftrace
> handler doesn't sleep directly.
> 
> If it were preempted, it couldn't be switched there either because we
> consider preempted stacks to be unreliable.

And IIRC ftrace handlers cannot sleep and are called with preemption 
disabled as of now. The code is a bit obscure, but see 
__ftrace_ops_list_func for example. This is "main" ftrace handler that 
calls all the registered ones in case FTRACE_OPS_FL_DYNAMIC is set (which 
is always true for handlers coming from modules) and CONFIG_PREEMPT is 
on. If it is off and there is only one handler registered for a function 
dynamic trampoline is used. See commit 12cce594fa8f ("ftrace/x86: Allow 
!CONFIG_PREEMPT dynamic ops to use allocated trampolines"). I think 
Steven had a plan to implement dynamic trampolines even for 
CONFIG_PREEMPT case but he still hasn't done it. It should use RCU_TASKS 
infrastructure.

The reason for all the mess is that ftrace needs to be sure that no task 
is in the handler when the handler/trampoline is freed.

So we should be safe for now even from this side.

Miroslav

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [RFC PATCH v2 17/18] livepatch: change to a per-task consistency model
  2016-04-28 20:44 ` [RFC PATCH v2 17/18] livepatch: change to a per-task consistency model Josh Poimboeuf
                     ` (4 preceding siblings ...)
  2016-05-09  9:41   ` Miroslav Benes
@ 2016-05-10 11:39   ` Miroslav Benes
  2016-05-17 22:53   ` Jessica Yu
  2016-06-06 13:54   ` [RFC PATCH v2 17/18] " Petr Mladek
  7 siblings, 0 replies; 121+ messages in thread
From: Miroslav Benes @ 2016-05-10 11:39 UTC (permalink / raw)
  To: Josh Poimboeuf
  Cc: Jessica Yu, Jiri Kosina, Ingo Molnar, Peter Zijlstra,
	Michael Ellerman, Heiko Carstens, live-patching, linux-kernel,
	x86, linuxppc-dev, linux-s390, Vojtech Pavlik, Jiri Slaby,
	Petr Mladek, Chris J Arges, Andy Lutomirski

On Thu, 28 Apr 2016, Josh Poimboeuf wrote:

> Change livepatch to use a basic per-task consistency model.  This is the
> foundation which will eventually enable us to patch those ~10% of
> security patches which change function or data semantics.  This is the
> biggest remaining piece needed to make livepatch more generally useful.
> 
> This code stems from the design proposal made by Vojtech [1] in November
> 2014.  It's a hybrid of kGraft and kpatch: it uses kGraft's per-task
> consistency and syscall barrier switching combined with kpatch's stack
> trace switching.  There are also a number of fallback options which make
> it quite flexible.
> 
> Patches are applied on a per-task basis, when the task is deemed safe to
> switch over.  When a patch is enabled, livepatch enters into a
> transition state where tasks are converging to the patched state.
> Usually this transition state can complete in a few seconds.  The same
> sequence occurs when a patch is disabled, except the tasks converge from
> the patched state to the unpatched state.
> 
> An interrupt handler inherits the patched state of the task it
> interrupts.  The same is true for forked tasks: the child inherits the
> patched state of the parent.
> 
> Livepatch uses several complementary approaches to determine when it's
> safe to patch tasks:
> 
> 1. The first and most effective approach is stack checking of sleeping
>    tasks.  If no affected functions are on the stack of a given task,
>    the task is patched.  In most cases this will patch most or all of
>    the tasks on the first try.  Otherwise it'll keep trying
>    periodically.  This option is only available if the architecture has
>    reliable stacks (CONFIG_RELIABLE_STACKTRACE and
>    CONFIG_STACK_VALIDATION).
> 
> 2. The second approach, if needed, is kernel exit switching.  A
>    task is switched when it returns to user space from a system call, a
>    user space IRQ, or a signal.  It's useful in the following cases:
> 
>    a) Patching I/O-bound user tasks which are sleeping on an affected
>       function.  In this case you have to send SIGSTOP and SIGCONT to
>       force it to exit the kernel and be patched.
>    b) Patching CPU-bound user tasks.  If the task is highly CPU-bound
>       then it will get patched the next time it gets interrupted by an
>       IRQ.
>    c) Applying patches for architectures which don't yet have
>       CONFIG_RELIABLE_STACKTRACE.  In this case you'll have to signal
>       most of the tasks on the system.  However this isn't a complete
>       solution, because there's currently no way to patch kthreads
>       without CONFIG_RELIABLE_STACKTRACE.
> 
>    Note: since idle "swapper" tasks don't ever exit the kernel, they
>    instead have a kpatch_patch_task() call in the idle loop which allows

s/kpatch_patch_task()/klp_patch_task()/

[...]

> --- a/Documentation/livepatch/livepatch.txt
> +++ b/Documentation/livepatch/livepatch.txt
> @@ -72,7 +72,8 @@ example, they add a NULL pointer or a boundary check, fix a race by adding
>  a missing memory barrier, or add some locking around a critical section.
>  Most of these changes are self contained and the function presents itself
>  the same way to the rest of the system. In this case, the functions might
> -be updated independently one by one.
> +be updated independently one by one.  (This can be done by setting the
> +'immediate' flag in the klp_patch struct.)
>  
>  But there are more complex fixes. For example, a patch might change
>  ordering of locking in multiple functions at the same time. Or a patch
> @@ -86,20 +87,103 @@ or no data are stored in the modified structures at the moment.
>  The theory about how to apply functions a safe way is rather complex.
>  The aim is to define a so-called consistency model. It attempts to define
>  conditions when the new implementation could be used so that the system
> -stays consistent. The theory is not yet finished. See the discussion at
> -http://thread.gmane.org/gmane.linux.kernel/1823033/focus=1828189
> -
> -The current consistency model is very simple. It guarantees that either
> -the old or the new function is called. But various functions get redirected
> -one by one without any synchronization.
> -
> -In other words, the current implementation _never_ modifies the behavior
> -in the middle of the call. It is because it does _not_ rewrite the entire
> -function in the memory. Instead, the function gets redirected at the
> -very beginning. But this redirection is used immediately even when
> -some other functions from the same patch have not been redirected yet.
> -
> -See also the section "Limitations" below.
> +stays consistent.
> +
> +Livepatch has a consistency model which is a hybrid of kGraft and
> +kpatch:  it uses kGraft's per-task consistency and syscall barrier
> +switching combined with kpatch's stack trace switching.  There are also
> +a number of fallback options which make it quite flexible.
> +
> +Patches are applied on a per-task basis, when the task is deemed safe to
> +switch over.  When a patch is enabled, livepatch enters into a
> +transition state where tasks are converging to the patched state.
> +Usually this transition state can complete in a few seconds.  The same
> +sequence occurs when a patch is disabled, except the tasks converge from
> +the patched state to the unpatched state.
> +
> +An interrupt handler inherits the patched state of the task it
> +interrupts.  The same is true for forked tasks: the child inherits the
> +patched state of the parent.
> +
> +Livepatch uses several complementary approaches to determine when it's
> +safe to patch tasks:
> +
> +1. The first and most effective approach is stack checking of sleeping
> +   tasks.  If no affected functions are on the stack of a given task,
> +   the task is patched.  In most cases this will patch most or all of
> +   the tasks on the first try.  Otherwise it'll keep trying
> +   periodically.  This option is only available if the architecture has
> +   reliable stacks (CONFIG_RELIABLE_STACKTRACE and
> +   CONFIG_STACK_VALIDATION).
> +
> +2. The second approach, if needed, is kernel exit switching.  A
> +   task is switched when it returns to user space from a system call, a
> +   user space IRQ, or a signal.  It's useful in the following cases:
> +
> +   a) Patching I/O-bound user tasks which are sleeping on an affected
> +      function.  In this case you have to send SIGSTOP and SIGCONT to
> +      force it to exit the kernel and be patched.
> +   b) Patching CPU-bound user tasks.  If the task is highly CPU-bound
> +      then it will get patched the next time it gets interrupted by an
> +      IRQ.
> +   c) Applying patches for architectures which don't yet have
> +      CONFIG_RELIABLE_STACKTRACE.  In this case you'll have to signal
> +      most of the tasks on the system.  However this isn't a complete
> +      solution, because there's currently no way to patch kthreads
> +      without CONFIG_RELIABLE_STACKTRACE.
> +
> +   Note: since idle "swapper" tasks don't ever exit the kernel, they
> +   instead have a kpatch_patch_task() call in the idle loop which allows

s/kpatch_patch_task()/klp_patch_task()/

Otherwise all the code that touches livepatch looks good to me. Apart from 
the things mentioned in another emails.

Miroslav

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [RFC PATCH v2 17/18] livepatch: change to a per-task consistency model
  2016-05-09  9:41   ` Miroslav Benes
@ 2016-05-16 17:27     ` Josh Poimboeuf
  0 siblings, 0 replies; 121+ messages in thread
From: Josh Poimboeuf @ 2016-05-16 17:27 UTC (permalink / raw)
  To: Miroslav Benes
  Cc: Jessica Yu, Jiri Kosina, Ingo Molnar, Peter Zijlstra,
	Michael Ellerman, Heiko Carstens, live-patching, linux-kernel,
	x86, linuxppc-dev, linux-s390, Vojtech Pavlik, Jiri Slaby,
	Petr Mladek, Chris J Arges, Andy Lutomirski

On Mon, May 09, 2016 at 11:41:37AM +0200, Miroslav Benes wrote:
> > +void klp_init_transition(struct klp_patch *patch, int state)
> > +{
> > +	struct task_struct *g, *task;
> > +	unsigned int cpu;
> > +	struct klp_object *obj;
> > +	struct klp_func *func;
> > +	int initial_state = !state;
> > +
> > +	klp_transition_patch = patch;
> > +
> > +	/*
> > +	 * If the patch can be applied or reverted immediately, skip the
> > +	 * per-task transitions.
> > +	 */
> > +	if (patch->immediate)
> > +		return;
> > +
> > +	/*
> > +	 * Initialize all tasks to the initial patch state to prepare them for
> > +	 * switching to the target state.
> > +	 */
> > +	read_lock(&tasklist_lock);
> > +	for_each_process_thread(g, task)
> > +		task->patch_state = initial_state;
> > +	read_unlock(&tasklist_lock);
> > +
> > +	/*
> > +	 * Ditto for the idle "swapper" tasks.
> > +	 */
> > +	get_online_cpus();
> > +	for_each_online_cpu(cpu)
> > +		idle_task(cpu)->patch_state = initial_state;
> > +	put_online_cpus();
> > +
> > +	/*
> > +	 * Ensure klp_ftrace_handler() sees the task->patch_state updates
> > +	 * before the func->transition updates.  Otherwise it could read an
> > +	 * out-of-date task state and pick the wrong function.
> > +	 */
> > +	smp_wmb();
> > +
> > +	/*
> > +	 * Set the func transition states so klp_ftrace_handler() will know to
> > +	 * switch to the transition logic.
> > +	 *
> > +	 * When patching, the funcs aren't yet in the func_stack and will be
> > +	 * made visible to the ftrace handler shortly by the calls to
> > +	 * klp_patch_object().
> > +	 *
> > +	 * When unpatching, the funcs are already in the func_stack and so are
> > +	 * already visible to the ftrace handler.
> > +	 */
> > +	klp_for_each_object(patch, obj)
> > +		klp_for_each_func(obj, func)
> > +			func->transition = true;
> > +
> > +	/*
> > +	 * Set the global target patch state which tasks will switch to.  This
> > +	 * has no effect until the TIF_PATCH_PENDING flags get set later.
> > +	 */
> > +	klp_target_state = state;
> 
> I am afraid there is a problem for (patch->immediate == true) patches. 
> klp_target_state is not set for those and the comment is not entirely 
> true, because klp_target_state has an effect in several places.

Ah, you're right.  I moved this assignment here for v2.  It was
originally done before the patch->immediate check.  If I remember
correctly, I moved it closer to the barrier for better readability (but
I created a bug in the process).

> I guess we need to set klp_target_state even for immediate patches. Should 
> we also initialize it with KLP_UNDEFINED and set it to KLP_UNDEFINED in 
> klp_complete_transition()?

Yes, to both.

-- 
Josh

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: klp_task_patch: was: [RFC PATCH v2 17/18] livepatch: change to a per-task consistency model
  2016-05-09 12:23           ` Petr Mladek
@ 2016-05-16 18:12             ` Josh Poimboeuf
  2016-05-18 13:12               ` Petr Mladek
  0 siblings, 1 reply; 121+ messages in thread
From: Josh Poimboeuf @ 2016-05-16 18:12 UTC (permalink / raw)
  To: Petr Mladek
  Cc: Jessica Yu, Jiri Kosina, Miroslav Benes, Ingo Molnar,
	Peter Zijlstra, Michael Ellerman, Heiko Carstens, live-patching,
	linux-kernel, x86, linuxppc-dev, linux-s390, Vojtech Pavlik,
	Jiri Slaby, Chris J Arges, Andy Lutomirski

On Mon, May 09, 2016 at 02:23:03PM +0200, Petr Mladek wrote:
> On Fri 2016-05-06 07:38:55, Josh Poimboeuf wrote:
> > On Thu, May 05, 2016 at 01:57:01PM +0200, Petr Mladek wrote:
> > > I have missed that the two commands are called with preemption
> > > disabled. So, I had the following crazy scenario in mind:
> > > 
> > > 
> > > CPU0				CPU1
> > > 
> > > klp_enable_patch()
> > > 
> > >   klp_target_state = KLP_PATCHED;
> > > 
> > >   for_each_task()
> > >      set TIF_PENDING_PATCH
> > > 
> > > 				# task 123
> > > 
> > > 				if (klp_patch_pending(current)
> > > 				  klp_patch_task(current)
> > > 
> > >                                     clear TIF_PENDING_PATCH
> > > 
> > > 				    smp_rmb();
> > > 
> > > 				    # switch to assembly of
> > > 				    # klp_patch_task()
> > > 
> > > 				    mov klp_target_state, %r12
> > > 
> > > 				    # interrupt and schedule
> > > 				    # another task
> > > 
> > > 
> > >   klp_reverse_transition();
> > > 
> > >     klp_target_state = KLP_UNPATCHED;
> > > 
> > >     klt_try_to_complete_transition()
> > > 
> > >       task = 123;
> > >       if (task->patch_state == klp_target_state;
> > >          return 0;
> > > 
> > >     => task 123 is in target state and does
> > >     not block conversion
> > > 
> > >   klp_complete_transition()
> > > 
> > > 
> > >   # disable previous patch on the stack
> > >   klp_disable_patch();
> > > 
> > >     klp_target_state = KLP_UNPATCHED;
> > >   
> > >   
> > > 				    # task 123 gets scheduled again
> > > 				    lea %r12, task->patch_state
> > > 
> > > 				    => it happily stores an outdated
> > > 				    state
> > > 
> > 
> > Thanks for the clear explanation, this helps a lot.
> > 
> > > This is why the two functions should get called with preemption
> > > disabled. We should document it at least. I imagine that we will
> > > use them later also in another context and nobody will remember
> > > this crazy scenario.
> > > 
> > > Well, even disabled preemption does not help. The process on
> > > CPU1 might be also interrupted by an NMI and do some long
> > > printk in it.
> > > 
> > > IMHO, the only safe approach is to call klp_patch_task()
> > > only for "current" on a safe place. Then this race is harmless.
> > > The switch happen on a safe place, so that it does not matter
> > > into which state the process is switched.
> > 
> > I'm not sure about this solution.  When klp_complete_transition() is
> > called, we need all tasks to be patched, for good.  We don't want any of
> > them to randomly switch to the wrong state at some later time in the
> > middle of a future patch operation.  How would changing klp_patch_task()
> > to only use "current" prevent that?
> 
> You are right that it is pity but it really should be safe because
> it is not entirely random.
> 
> If the race happens and assign an outdated value, there are two
> situations:
> 
> 1. It is assigned when there is not transition in the progress.
>    Then it is OK because it will be ignored by the ftrace handler.
>    The right state will be set before the next transition starts.
> 
> 2. It is assigned when some other transition is in progress.
>    Then it is OK as long as the function is called from "current".
>    The "wrong" state will be used consistently. It will switch
>    to the right state on another safe state.

Maybe it would be safe, though I'm not entirely convinced.  Regardless I
think we should avoid these situations entirely because they create
windows for future bugs and races.

> > > By other words, the task state might be updated only
> > > 
> > >    + by the task itself on a safe place
> > >    + by other task when the updated on is sleeping on a safe place
> > > 
> > > This should be well documented and the API should help to avoid
> > > a misuse.
> > 
> > I think we could fix it to be safe for future callers who might not have
> > preemption disabled with a couple of changes to klp_patch_task():
> > disabling preemption and testing/clearing the TIF_PATCH_PENDING flag
> > before changing the patch state:
> > 
> >   void klp_patch_task(struct task_struct *task)
> >   {
> >   	preempt_disable();
> >   
> >   	if (test_and_clear_tsk_thread_flag(task, TIF_PATCH_PENDING))
> >   		task->patch_state = READ_ONCE(klp_target_state);
> >   
> >   	preempt_enable();
> >   }
> 
> It reduces the race window a bit but it is still there. For example,
> NMI still might add a huge delay between reading klp_target_state
> and assigning task->patch state.

Maybe you missed this paragraph from my last email:

| We would also need a synchronize_sched() after the patching is complete,
| either at the end of klp_try_complete_transition() or in
| klp_complete_transition().  That would make sure that all existing calls
| to klp_patch_task() are done.

So a huge NMI delay wouldn't be a problem here.  The call to
synchronize_sched() in klp_complete_transition() would sleep until the
NMI handler returns and the critical section of klp_patch_task()
finishes.  So once a patch is complete, we know that it's really
complete.

> What about the following?
> 
> /*
>  * This function might assign an outdated value if the transaction
> `* is reverted and finalized in parallel. But it is safe. If the value
>  * is assigned outside of a transaction, it is ignored and the next
>  * transaction will set the right one. Or if it gets assigned
>  * inside another transaction, it will repeat the cycle and
>  * set the right state.
>  */
> void klp_update_current_patch_state()
> {
> 	while (test_and_clear_tsk_thread_flag(current, TIF_PATCH_PENDING))
> 		current->patch_state = READ_ONCE(klp_target_state);
> }

I'm not sure how this would work.  How would the thread flag get set
again after it's been cleared?

Also I really don't like the idea of randomly updating a task's patch
state after the transition has been completed.

> Note that the disabled preemption helped only partially,
> so I think that it was not really needed.
> 
> Hmm, it means that the task->patch_state  might be either
> KLP_PATCHED or KLP_UNPATCHED outside a transition. I wonder
> if the tristate really brings some advantages.
> 
> 
> Alternatively, we might synchronize the operation with klp_mutex.
> The function is called in a slow path and in a safe context.
> Well, it might cause contention on the lock when many CPUs are
> trying to update their tasks.

I don't think a mutex would work because at least the ftrace handler
(and maybe more) can't sleep.  Maybe a spinlock could work but I think
that would be overkill.

-- 
Josh

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: livepatch: change to a per-task consistency model
  2016-04-28 20:44 ` [RFC PATCH v2 17/18] livepatch: change to a per-task consistency model Josh Poimboeuf
                     ` (5 preceding siblings ...)
  2016-05-10 11:39   ` Miroslav Benes
@ 2016-05-17 22:53   ` Jessica Yu
  2016-05-18  8:16     ` Jiri Kosina
  2016-06-06 13:54   ` [RFC PATCH v2 17/18] " Petr Mladek
  7 siblings, 1 reply; 121+ messages in thread
From: Jessica Yu @ 2016-05-17 22:53 UTC (permalink / raw)
  To: Josh Poimboeuf
  Cc: Jiri Kosina, Miroslav Benes, Ingo Molnar, Peter Zijlstra,
	Michael Ellerman, Heiko Carstens, live-patching, linux-kernel,
	x86, linuxppc-dev, linux-s390, Vojtech Pavlik, Jiri Slaby,
	Petr Mladek, Chris J Arges, Andy Lutomirski

+++ Josh Poimboeuf [28/04/16 15:44 -0500]:

[snip]

>diff --git a/Documentation/livepatch/livepatch.txt b/Documentation/livepatch/livepatch.txt
>index 6c43f6e..bee86d0 100644
>--- a/Documentation/livepatch/livepatch.txt
>+++ b/Documentation/livepatch/livepatch.txt
>@@ -72,7 +72,8 @@ example, they add a NULL pointer or a boundary check, fix a race by adding
> a missing memory barrier, or add some locking around a critical section.
> Most of these changes are self contained and the function presents itself
> the same way to the rest of the system. In this case, the functions might
>-be updated independently one by one.
>+be updated independently one by one.  (This can be done by setting the
>+'immediate' flag in the klp_patch struct.)
>
> But there are more complex fixes. For example, a patch might change
> ordering of locking in multiple functions at the same time. Or a patch
>@@ -86,20 +87,103 @@ or no data are stored in the modified structures at the moment.
> The theory about how to apply functions a safe way is rather complex.
> The aim is to define a so-called consistency model. It attempts to define
> conditions when the new implementation could be used so that the system
>-stays consistent. The theory is not yet finished. See the discussion at
>-http://thread.gmane.org/gmane.linux.kernel/1823033/focus=1828189
>-
>-The current consistency model is very simple. It guarantees that either
>-the old or the new function is called. But various functions get redirected
>-one by one without any synchronization.
>-
>-In other words, the current implementation _never_ modifies the behavior
>-in the middle of the call. It is because it does _not_ rewrite the entire
>-function in the memory. Instead, the function gets redirected at the
>-very beginning. But this redirection is used immediately even when
>-some other functions from the same patch have not been redirected yet.
>-
>-See also the section "Limitations" below.
>+stays consistent.
>+
>+Livepatch has a consistency model which is a hybrid of kGraft and
>+kpatch:  it uses kGraft's per-task consistency and syscall barrier
>+switching combined with kpatch's stack trace switching.  There are also
>+a number of fallback options which make it quite flexible.
>+
>+Patches are applied on a per-task basis, when the task is deemed safe to
>+switch over.  When a patch is enabled, livepatch enters into a
>+transition state where tasks are converging to the patched state.
>+Usually this transition state can complete in a few seconds.  The same
>+sequence occurs when a patch is disabled, except the tasks converge from
>+the patched state to the unpatched state.
>+
>+An interrupt handler inherits the patched state of the task it
>+interrupts.  The same is true for forked tasks: the child inherits the
>+patched state of the parent.
>+
>+Livepatch uses several complementary approaches to determine when it's
>+safe to patch tasks:
>+
>+1. The first and most effective approach is stack checking of sleeping
>+   tasks.  If no affected functions are on the stack of a given task,
>+   the task is patched.  In most cases this will patch most or all of
>+   the tasks on the first try.  Otherwise it'll keep trying
>+   periodically.  This option is only available if the architecture has
>+   reliable stacks (CONFIG_RELIABLE_STACKTRACE and
>+   CONFIG_STACK_VALIDATION).
>+
>+2. The second approach, if needed, is kernel exit switching.  A
>+   task is switched when it returns to user space from a system call, a
>+   user space IRQ, or a signal.  It's useful in the following cases:
>+
>+   a) Patching I/O-bound user tasks which are sleeping on an affected
>+      function.  In this case you have to send SIGSTOP and SIGCONT to
>+      force it to exit the kernel and be patched.

See below -

>+   b) Patching CPU-bound user tasks.  If the task is highly CPU-bound
>+      then it will get patched the next time it gets interrupted by an
>+      IRQ.
>+   c) Applying patches for architectures which don't yet have
>+      CONFIG_RELIABLE_STACKTRACE.  In this case you'll have to signal
>+      most of the tasks on the system.  However this isn't a complete
>+      solution, because there's currently no way to patch kthreads
>+      without CONFIG_RELIABLE_STACKTRACE.
>+
>+   Note: since idle "swapper" tasks don't ever exit the kernel, they
>+   instead have a kpatch_patch_task() call in the idle loop which allows
>+   them to patched before the CPU enters the idle state.
>+
>+3. A third approach (not yet implemented) is planned for the case where
>+   a kthread is sleeping on an affected function.  In that case we could
>+   kick the kthread with a signal and then try to patch the task from
>+   the to-be-patched function's livepatch ftrace handler when it
>+   re-enters the function.  This will require
>+   CONFIG_RELIABLE_STACKTRACE.
>+
>+All the above approaches may be skipped by setting the 'immediate' flag
>+in the 'klp_patch' struct, which will patch all tasks immediately.  This
>+can be useful if the patch doesn't change any function or data
>+semantics.  Note that, even with this flag set, it's possible that some
>+tasks may still be running with an old version of the function, until
>+that function returns.
>+
>+There's also an 'immediate' flag in the 'klp_func' struct which allows
>+you to specify that certain functions in the patch can be applied
>+without per-task consistency.  This might be useful if you want to patch
>+a common function like schedule(), and the function change doesn't need
>+consistency but the rest of the patch does.
>+
>+For architectures which don't have CONFIG_RELIABLE_STACKTRACE, there
>+are two options:
>+
>+a) the user can set the patch->immediate flag which causes all tasks to
>+   be patched immediately.  This option should be used with care, only
>+   when the patch doesn't change any function or data semantics; or
>+
>+b) use the kernel exit switching approach (this is the default).
>+   Note the patching will never complete because there's no currently no
>+   way to patch kthreads without CONFIG_RELIABLE_STACKTRACE.
>+
>+The /sys/kernel/livepatch/<patch>/transition file shows whether a patch
>+is in transition.  Only a single patch (the topmost patch on the stack)
>+can be in transition at a given time.  A patch can remain in transition
>+indefinitely, if any of the tasks are stuck in the initial patch state.
>+
>+A transition can be reversed and effectively canceled by writing the
>+opposite value to the /sys/kernel/livepatch/<patch>/enabled file while
>+the transition is in progress.  Then all the tasks will attempt to
>+converge back to the original patch state.
>+
>+There's also a /proc/<pid>/patch_state file which can be used to
>+determine which tasks are blocking completion of a patching operation.
>+If a patch is in transition, this file shows 0 to indicate the task is
>+unpatched and 1 to indicate it's patched.  Otherwise, if no patch is in
>+transition, it shows -1. Any tasks which are blocking the transition
>+can be signaled with SIGSTOP and SIGCONT to force them to change their
>+patched state.

What about tasks sleeping on affected functions in uninterruptible
sleep (possibly indefinitely)? Since all signals are ignored, we
wouldn't be able to patch those tasks in this way, right? Would that
be an unsupported case? Might be useful to mention this in the
documentation somewhere.

Jessica

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: livepatch: change to a per-task consistency model
  2016-05-17 22:53   ` Jessica Yu
@ 2016-05-18  8:16     ` Jiri Kosina
  2016-05-18 16:51       ` Josh Poimboeuf
  0 siblings, 1 reply; 121+ messages in thread
From: Jiri Kosina @ 2016-05-18  8:16 UTC (permalink / raw)
  To: Jessica Yu
  Cc: Josh Poimboeuf, Miroslav Benes, Ingo Molnar, Peter Zijlstra,
	Michael Ellerman, Heiko Carstens, live-patching, linux-kernel,
	x86, linuxppc-dev, linux-s390, Vojtech Pavlik, Jiri Slaby,
	Petr Mladek, Chris J Arges, Andy Lutomirski

On Tue, 17 May 2016, Jessica Yu wrote:

> What about tasks sleeping on affected functions in uninterruptible sleep 
> (possibly indefinitely)? Since all signals are ignored, we wouldn't be 
> able to patch those tasks in this way, right? Would that be an 
> unsupported case?

I don't think there is any better way out of this situation than 
documenting that the convergence of patching could in such cases could 
take quite a lot of time (well, we can pro-actively try to detect this 
situation before the patching actually start, and warn about the possible 
consequences).

But let's face it, this should be pretty uncommon, because (a) it's not 
realistic for the wait times to be really indefinite (b) the task is 
likely to be in TASK_KILLABLE rather than just plain TASK_UNINTERRUPTIBLE.

-- 
Jiri Kosina
SUSE Labs

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: klp_task_patch: was: [RFC PATCH v2 17/18] livepatch: change to a per-task consistency model
  2016-05-16 18:12             ` Josh Poimboeuf
@ 2016-05-18 13:12               ` Petr Mladek
  0 siblings, 0 replies; 121+ messages in thread
From: Petr Mladek @ 2016-05-18 13:12 UTC (permalink / raw)
  To: Josh Poimboeuf
  Cc: Jessica Yu, Jiri Kosina, Miroslav Benes, Ingo Molnar,
	Peter Zijlstra, Michael Ellerman, Heiko Carstens, live-patching,
	linux-kernel, x86, linuxppc-dev, linux-s390, Vojtech Pavlik,
	Jiri Slaby, Chris J Arges, Andy Lutomirski

On Mon 2016-05-16 13:12:50, Josh Poimboeuf wrote:
> On Mon, May 09, 2016 at 02:23:03PM +0200, Petr Mladek wrote:
> > On Fri 2016-05-06 07:38:55, Josh Poimboeuf wrote:
> > > On Thu, May 05, 2016 at 01:57:01PM +0200, Petr Mladek wrote:
> > > > I have missed that the two commands are called with preemption
> > > > disabled. So, I had the following crazy scenario in mind:
> > > > 
> > > > 
> > > > CPU0				CPU1
> > > > 
> > > > klp_enable_patch()
> > > > 
> > > >   klp_target_state = KLP_PATCHED;
> > > > 
> > > >   for_each_task()
> > > >      set TIF_PENDING_PATCH
> > > > 
> > > > 				# task 123
> > > > 
> > > > 				if (klp_patch_pending(current)
> > > > 				  klp_patch_task(current)
> > > > 
> > > >                                     clear TIF_PENDING_PATCH
> > > > 
> > > > 				    smp_rmb();
> > > > 
> > > > 				    # switch to assembly of
> > > > 				    # klp_patch_task()
> > > > 
> > > > 				    mov klp_target_state, %r12
> > > > 
> > > > 				    # interrupt and schedule
> > > > 				    # another task
> > > > 
> > > > 
> > > >   klp_reverse_transition();
> > > > 
> > > >     klp_target_state = KLP_UNPATCHED;
> > > > 
> > > >     klt_try_to_complete_transition()
> > > > 
> > > >       task = 123;
> > > >       if (task->patch_state == klp_target_state;
> > > >          return 0;
> > > > 
> > > >     => task 123 is in target state and does
> > > >     not block conversion
> > > > 
> > > >   klp_complete_transition()
> > > > 
> > > > 
> > > >   # disable previous patch on the stack
> > > >   klp_disable_patch();
> > > > 
> > > >     klp_target_state = KLP_UNPATCHED;
> > > >   
> > > >   
> > > > 				    # task 123 gets scheduled again
> > > > 				    lea %r12, task->patch_state
> > > > 
> > > > 				    => it happily stores an outdated
> > > > 				    state
> > > > 
> > > 
> > > Thanks for the clear explanation, this helps a lot.
> > > 
> > > > This is why the two functions should get called with preemption
> > > > disabled. We should document it at least. I imagine that we will
> > > > use them later also in another context and nobody will remember
> > > > this crazy scenario.
> > > > 
> > > > Well, even disabled preemption does not help. The process on
> > > > CPU1 might be also interrupted by an NMI and do some long
> > > > printk in it.
> > > > 
> > > > IMHO, the only safe approach is to call klp_patch_task()
> > > > only for "current" on a safe place. Then this race is harmless.
> > > > The switch happen on a safe place, so that it does not matter
> > > > into which state the process is switched.
> > > 
> > > I'm not sure about this solution.  When klp_complete_transition() is
> > > called, we need all tasks to be patched, for good.  We don't want any of
> > > them to randomly switch to the wrong state at some later time in the
> > > middle of a future patch operation.  How would changing klp_patch_task()
> > > to only use "current" prevent that?
> > 
> > You are right that it is pity but it really should be safe because
> > it is not entirely random.
> > 
> > If the race happens and assign an outdated value, there are two
> > situations:
> > 
> > 1. It is assigned when there is not transition in the progress.
> >    Then it is OK because it will be ignored by the ftrace handler.
> >    The right state will be set before the next transition starts.
> > 
> > 2. It is assigned when some other transition is in progress.
> >    Then it is OK as long as the function is called from "current".
> >    The "wrong" state will be used consistently. It will switch
> >    to the right state on another safe state.
> 
> Maybe it would be safe, though I'm not entirely convinced.  Regardless I
> think we should avoid these situations entirely because they create
> windows for future bugs and races.

Yup, I would prefer a cleaner solution as well.

> > > > By other words, the task state might be updated only
> > > > 
> > > >    + by the task itself on a safe place
> > > >    + by other task when the updated on is sleeping on a safe place
> > > > 
> > > > This should be well documented and the API should help to avoid
> > > > a misuse.
> > > 
> > > I think we could fix it to be safe for future callers who might not have
> > > preemption disabled with a couple of changes to klp_patch_task():
> > > disabling preemption and testing/clearing the TIF_PATCH_PENDING flag
> > > before changing the patch state:
> > > 
> > >   void klp_patch_task(struct task_struct *task)
> > >   {
> > >   	preempt_disable();
> > >   
> > >   	if (test_and_clear_tsk_thread_flag(task, TIF_PATCH_PENDING))
> > >   		task->patch_state = READ_ONCE(klp_target_state);
> > >   
> > >   	preempt_enable();
> > >   }
> > 
> > It reduces the race window a bit but it is still there. For example,
> > NMI still might add a huge delay between reading klp_target_state
> > and assigning task->patch state.
> 
> Maybe you missed this paragraph from my last email:
>
> | We would also need a synchronize_sched() after the patching is complete,
> | either at the end of klp_try_complete_transition() or in
> | klp_complete_transition().  That would make sure that all existing calls
> | to klp_patch_task() are done.
> 
> So a huge NMI delay wouldn't be a problem here.  The call to
> synchronize_sched() in klp_complete_transition() would sleep until the
> NMI handler returns and the critical section of klp_patch_task()
> finishes.  So once a patch is complete, we know that it's really
> complete.

Yes, synchronize_sched() will help with the premeption disabled. I did
not shake my head enough last time.


> > What about the following?
> > 
> > /*
> >  * This function might assign an outdated value if the transaction
> > `* is reverted and finalized in parallel. But it is safe. If the value
> >  * is assigned outside of a transaction, it is ignored and the next
> >  * transaction will set the right one. Or if it gets assigned
> >  * inside another transaction, it will repeat the cycle and
> >  * set the right state.
> >  */
> > void klp_update_current_patch_state()
> > {
> > 	while (test_and_clear_tsk_thread_flag(current, TIF_PATCH_PENDING))
> > 		current->patch_state = READ_ONCE(klp_target_state);
> > }
> 
> I'm not sure how this would work.  How would the thread flag get set
> again after it's been cleared?

See the race described in the previous mail. The problem is when the
target_state and the TIF flags gets set after reading klp_target_state
into a register and before storing the value into current->patch_state.

We do not need this if use the synchronize_sched() and fix up
current->patch_state then.

> Also I really don't like the idea of randomly updating a task's patch
> state after the transition has been completed.
> 
> > Note that the disabled preemption helped only partially,
> > so I think that it was not really needed.
> > 
> > Hmm, it means that the task->patch_state  might be either
> > KLP_PATCHED or KLP_UNPATCHED outside a transition. I wonder
> > if the tristate really brings some advantages.
> > 
> > 
> > Alternatively, we might synchronize the operation with klp_mutex.
> > The function is called in a slow path and in a safe context.
> > Well, it might cause contention on the lock when many CPUs are
> > trying to update their tasks.
> 
> I don't think a mutex would work because at least the ftrace handler
> (and maybe more) can't sleep.  Maybe a spinlock could work but I think
> that would be overkill.

Sure, I had a spinlock in mind.

Best Regards,
Petr

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: livepatch: change to a per-task consistency model
  2016-05-18  8:16     ` Jiri Kosina
@ 2016-05-18 16:51       ` Josh Poimboeuf
  2016-05-18 20:22         ` Jiri Kosina
  0 siblings, 1 reply; 121+ messages in thread
From: Josh Poimboeuf @ 2016-05-18 16:51 UTC (permalink / raw)
  To: Jiri Kosina
  Cc: Jessica Yu, Miroslav Benes, Ingo Molnar, Peter Zijlstra,
	Michael Ellerman, Heiko Carstens, live-patching, linux-kernel,
	x86, linuxppc-dev, linux-s390, Vojtech Pavlik, Jiri Slaby,
	Petr Mladek, Chris J Arges, Andy Lutomirski

On Wed, May 18, 2016 at 10:16:22AM +0200, Jiri Kosina wrote:
> On Tue, 17 May 2016, Jessica Yu wrote:
> 
> > What about tasks sleeping on affected functions in uninterruptible sleep 
> > (possibly indefinitely)? Since all signals are ignored, we wouldn't be 
> > able to patch those tasks in this way, right? Would that be an 
> > unsupported case?
> 
> I don't think there is any better way out of this situation than 
> documenting that the convergence of patching could in such cases could 
> take quite a lot of time (well, we can pro-actively try to detect this 
> situation before the patching actually start, and warn about the possible 
> consequences).
> 
> But let's face it, this should be pretty uncommon, because (a) it's not 
> realistic for the wait times to be really indefinite (b) the task is 
> likely to be in TASK_KILLABLE rather than just plain TASK_UNINTERRUPTIBLE.

Yeah, I think this situation -- a task sleeping on an affected function
in uninterruptible state for a long period of time -- would be
exceedingly rare and not something we need to worry about for now.

-- 
Josh

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: livepatch: change to a per-task consistency model
  2016-05-18 16:51       ` Josh Poimboeuf
@ 2016-05-18 20:22         ` Jiri Kosina
  2016-05-23  9:42             ` David Laight
  0 siblings, 1 reply; 121+ messages in thread
From: Jiri Kosina @ 2016-05-18 20:22 UTC (permalink / raw)
  To: Josh Poimboeuf
  Cc: Jessica Yu, Miroslav Benes, Ingo Molnar, Peter Zijlstra,
	Michael Ellerman, Heiko Carstens, live-patching, linux-kernel,
	x86, linuxppc-dev, linux-s390, Vojtech Pavlik, Jiri Slaby,
	Petr Mladek, Chris J Arges, Andy Lutomirski

On Wed, 18 May 2016, Josh Poimboeuf wrote:

> Yeah, I think this situation -- a task sleeping on an affected function 
> in uninterruptible state for a long period of time -- would be 
> exceedingly rare and not something we need to worry about for now.

Plus in case task'd be in TASK_UNINTERRUPTIBLE for more than 120s, hung 
task detector would trigger anyway.

-- 
Jiri Kosina
SUSE Labs

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [RFC PATCH v2 05/18] sched: add task flag for preempt IRQ tracking
  2016-05-02 15:52                       ` Andy Lutomirski
  2016-05-02 17:31                         ` Josh Poimboeuf
@ 2016-05-19 23:15                         ` Josh Poimboeuf
  2016-05-19 23:39                           ` Andy Lutomirski
  2016-05-23 21:34                           ` Andy Lutomirski
  1 sibling, 2 replies; 121+ messages in thread
From: Josh Poimboeuf @ 2016-05-19 23:15 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Jiri Kosina, Ingo Molnar, X86 ML, Heiko Carstens, linux-s390,
	live-patching, Michael Ellerman, Chris J Arges, linuxppc-dev,
	Jessica Yu, Petr Mladek, Jiri Slaby, Vojtech Pavlik,
	linux-kernel, Miroslav Benes, Peter Zijlstra

On Mon, May 02, 2016 at 08:52:41AM -0700, Andy Lutomirski wrote:
> On Mon, May 2, 2016 at 6:52 AM, Josh Poimboeuf <jpoimboe@redhat.com> wrote:
> > On Fri, Apr 29, 2016 at 05:08:50PM -0700, Andy Lutomirski wrote:
> >> On Apr 29, 2016 3:41 PM, "Josh Poimboeuf" <jpoimboe@redhat.com> wrote:
> >> >
> >> > On Fri, Apr 29, 2016 at 02:37:41PM -0700, Andy Lutomirski wrote:
> >> > > On Fri, Apr 29, 2016 at 2:25 PM, Josh Poimboeuf <jpoimboe@redhat.com> wrote:
> >> > > >> I suppose we could try to rejigger the code so that rbp points to
> >> > > >> pt_regs or similar.
> >> > > >
> >> > > > I think we should avoid doing something like that because it would break
> >> > > > gdb and all the other unwinders who don't know about it.
> >> > >
> >> > > How so?
> >> > >
> >> > > Currently, rbp in the entry code is meaningless.  I'm suggesting that,
> >> > > when we do, for example, 'call \do_sym' in idtentry, we point rbp to
> >> > > the pt_regs.  Currently it points to something stale (which the
> >> > > dump_stack code might be relying on.  Hmm.)  But it's probably also
> >> > > safe to assume that if you unwind to the 'call \do_sym', then pt_regs
> >> > > is the next thing on the stack, so just doing the section thing would
> >> > > work.
> >> >
> >> > Yes, rbp is meaningless on the entry from user space.  But if an
> >> > in-kernel interrupt occurs (e.g. page fault, preemption) and you have
> >> > nested entry, rbp keeps its old value, right?  So the unwinder can walk
> >> > past the nested entry frame and keep going until it gets to the original
> >> > entry.
> >>
> >> Yes.
> >>
> >> It would be nice if we could do better, though, and actually notice
> >> the pt_regs and identify the entry.  For example, I'd love to see
> >> "page fault, RIP=xyz" printed in the middle of a stack dump on a
> >> crash.
> >>
> >> Also, I think that just following rbp links will lose the
> >> actual function that took the page fault (or whatever function
> >> pt_regs->ip actually points to).
> >
> > Hm.  I think we could fix all that in a more standard way.  Whenever a
> > new pt_regs frame gets saved on entry, we could also create a new stack
> > frame which points to a fake kernel_entry() function.  That would tell
> > the unwinder there's a pt_regs frame without otherwise breaking frame
> > pointers across the frame.
> >
> > Then I guess we wouldn't need my other solution of putting the idt
> > entries in a special section.
> >
> > How does that sound?
> 
> Let me try to understand.
> 
> The normal call sequence is call; push %rbp; mov %rsp, %rbp.  So rbp
> points to (prev rbp, prev rip) on the stack, and you can follow the
> chain back.  Right now, on a user access page fault or similar, we
> have rbp (probably) pointing to the interrupted frame, and the
> interrupted rip isn't saved anywhere that a naive unwinder can find
> it.  (It's in pt_regs, but the rbp chain skips right over that.)
> 
> We could change the entry code so that an interrupt / idtentry does:
> 
> push pt_regs
> push kernel_entry
> push %rbp
> mov %rsp, %rbp
> call handler
> pop %rbp
> addq $8, %rsp
> 
> or similar.  That would make it appear that the actual C handler was
> caused by a dummy function "kernel_entry".  Now the unwinder would get
> to kernel_entry, but it *still* wouldn't find its way to the calling
> frame, which only solves part of the problem.  We could at least teach
> the unwinder how kernel_entry works and let it decode pt_regs to
> continue unwinding.  This would be nice, and I think it could work.
> 
> I think I like this, except that, if it used a separate section, it
> could potentially be faster, as, for each actual entry type, the
> offset from the C handler frame to pt_regs is a foregone conclusion.
> But this is pretty simple and performance is already abysmal in most
> handlers.
> 
> There's an added benefit to using a separate section, though: we could
> also annotate the calls with what type of entry they were so the
> unwinder could print it out nicely.
> 
> I could be convinced either way.

Ok, I took a stab at this.  See the patch below.

In addition to annotating interrupt/exception pt_regs frames, I also
annotated all the syscall pt_regs, for consistency.

As you mentioned, it will affect performance a bit, but I think it will
be insignificant.

I think I like this approach better than putting the
interrupt/idtentry's in a special section, because this is much more
precise.  Especially now that I'm annotating pt_regs syscalls.

Also I think with a few minor changes we could implement your idea of
annotating the calls with what type of entry they are.  But I don't
think that's really needed, because the name of the interrupt/idtentry
is already on the stack trace.

Before:

  [<ffffffff8143c243>] dump_stack+0x85/0xc2
  [<ffffffff81073596>] __do_page_fault+0x576/0x5a0
  [<ffffffff8107369c>] trace_do_page_fault+0x5c/0x2e0
  [<ffffffff8106d83c>] do_async_page_fault+0x2c/0xa0
  [<ffffffff81887058>] async_page_fault+0x28/0x30
  [<ffffffff81451560>] ? copy_page_to_iter+0x70/0x440
  [<ffffffff811ebeac>] ? pagecache_get_page+0x2c/0x290
  [<ffffffff811edaeb>] generic_file_read_iter+0x26b/0x770
  [<ffffffff81285e32>] __vfs_read+0xe2/0x140
  [<ffffffff81286378>] vfs_read+0x98/0x140
  [<ffffffff812878c8>] SyS_read+0x58/0xc0
  [<ffffffff81884dbc>] entry_SYSCALL_64_fastpath+0x1f/0xbd

After:

  [<ffffffff8143c243>] dump_stack+0x85/0xc2
  [<ffffffff81073596>] __do_page_fault+0x576/0x5a0
  [<ffffffff8107369c>] trace_do_page_fault+0x5c/0x2e0
  [<ffffffff8106d83c>] do_async_page_fault+0x2c/0xa0
  [<ffffffff81887422>] async_page_fault+0x32/0x40
  [<ffffffff81887861>] pt_regs+0x1/0x10
  [<ffffffff81451560>] ? copy_page_to_iter+0x70/0x440
  [<ffffffff811ebeac>] ? pagecache_get_page+0x2c/0x290
  [<ffffffff811edaeb>] generic_file_read_iter+0x26b/0x770
  [<ffffffff81285e32>] __vfs_read+0xe2/0x140
  [<ffffffff81286378>] vfs_read+0x98/0x140
  [<ffffffff812878c8>] SyS_read+0x58/0xc0
  [<ffffffff81884dc6>] entry_SYSCALL_64_fastpath+0x29/0xdb
  [<ffffffff81887861>] pt_regs+0x1/0x10

Note this example is with today's unwinder.  It could be made smarter to
get the RIP from the pt_regs so the '?' could be removed from
copy_page_to_iter().

Thoughts?

diff --git a/arch/x86/entry/calling.h b/arch/x86/entry/calling.h
index 9a9e588..f54886a 100644
--- a/arch/x86/entry/calling.h
+++ b/arch/x86/entry/calling.h
@@ -201,6 +201,32 @@ For 32-bit we have the following conventions - kernel is built with
 	.byte 0xf1
 	.endm
 
+	/*
+	 * Create a stack frame for the saved pt_regs.  This allows frame
+	 * pointer based unwinders to find pt_regs on the stack.
+	 */
+	.macro CREATE_PT_REGS_FRAME regs=%rsp
+#ifdef CONFIG_FRAME_POINTER
+	pushq	\regs
+	pushq	$pt_regs+1
+	pushq	%rbp
+	movq	%rsp, %rbp
+#endif
+	.endm
+
+	.macro REMOVE_PT_REGS_FRAME
+#ifdef CONFIG_FRAME_POINTER
+	popq	%rbp
+	addq	$0x10, %rsp
+#endif
+	.endm
+
+	.macro CALL_HANDLER handler regs=%rsp
+	CREATE_PT_REGS_FRAME \regs
+	call	\handler
+	REMOVE_PT_REGS_FRAME
+	.endm
+
 #endif /* CONFIG_X86_64 */
 
 /*
diff --git a/arch/x86/entry/entry_64.S b/arch/x86/entry/entry_64.S
index 9ee0da1..8642984 100644
--- a/arch/x86/entry/entry_64.S
+++ b/arch/x86/entry/entry_64.S
@@ -199,6 +199,7 @@ entry_SYSCALL_64_fastpath:
 	ja	1f				/* return -ENOSYS (already in pt_regs->ax) */
 	movq	%r10, %rcx
 
+	CREATE_PT_REGS_FRAME
 	/*
 	 * This call instruction is handled specially in stub_ptregs_64.
 	 * It might end up jumping to the slow path.  If it jumps, RAX
@@ -207,6 +208,8 @@ entry_SYSCALL_64_fastpath:
 	call	*sys_call_table(, %rax, 8)
 .Lentry_SYSCALL_64_after_fastpath_call:
 
+	REMOVE_PT_REGS_FRAME
+
 	movq	%rax, RAX(%rsp)
 1:
 
@@ -238,14 +241,14 @@ entry_SYSCALL_64_fastpath:
 	ENABLE_INTERRUPTS(CLBR_NONE)
 	SAVE_EXTRA_REGS
 	movq	%rsp, %rdi
-	call	syscall_return_slowpath	/* returns with IRQs disabled */
+	CALL_HANDLER syscall_return_slowpath	/* returns with IRQs disabled */
 	jmp	return_from_SYSCALL_64
 
 entry_SYSCALL64_slow_path:
 	/* IRQs are off. */
 	SAVE_EXTRA_REGS
 	movq	%rsp, %rdi
-	call	do_syscall_64		/* returns with IRQs disabled */
+	CALL_HANDLER do_syscall_64	/* returns with IRQs disabled */
 
 return_from_SYSCALL_64:
 	RESTORE_EXTRA_REGS
@@ -344,6 +347,7 @@ ENTRY(stub_ptregs_64)
 	DISABLE_INTERRUPTS(CLBR_NONE)
 	TRACE_IRQS_OFF
 	popq	%rax
+	REMOVE_PT_REGS_FRAME
 	jmp	entry_SYSCALL64_slow_path
 
 1:
@@ -372,7 +376,7 @@ END(ptregs_\func)
 ENTRY(ret_from_fork)
 	LOCK ; btr $TIF_FORK, TI_flags(%r8)
 
-	call	schedule_tail			/* rdi: 'prev' task parameter */
+	CALL_HANDLER schedule_tail		/* rdi: 'prev' task parameter */
 
 	testb	$3, CS(%rsp)			/* from kernel_thread? */
 	jnz	1f
@@ -385,8 +389,9 @@ ENTRY(ret_from_fork)
 	 * parameter to be passed in RBP.  The called function is permitted
 	 * to call do_execve and thereby jump to user mode.
 	 */
+	movq	RBX(%rsp), %rbx
 	movq	RBP(%rsp), %rdi
-	call	*RBX(%rsp)
+	CALL_HANDLER *%rbx
 	movl	$0, RAX(%rsp)
 
 	/*
@@ -396,7 +401,7 @@ ENTRY(ret_from_fork)
 
 1:
 	movq	%rsp, %rdi
-	call	syscall_return_slowpath	/* returns with IRQs disabled */
+	CALL_HANDLER syscall_return_slowpath	/* returns with IRQs disabled */
 	TRACE_IRQS_ON			/* user mode is traced as IRQS on */
 	SWAPGS
 	jmp	restore_regs_and_iret
@@ -468,7 +473,7 @@ END(irq_entries_start)
 	/* We entered an interrupt context - irqs are off: */
 	TRACE_IRQS_OFF
 
-	call	\func	/* rdi points to pt_regs */
+	CALL_HANDLER \func regs=%rdi
 	.endm
 
 	/*
@@ -495,7 +500,7 @@ ret_from_intr:
 	/* Interrupt came from user space */
 GLOBAL(retint_user)
 	mov	%rsp,%rdi
-	call	prepare_exit_to_usermode
+	CALL_HANDLER prepare_exit_to_usermode
 	TRACE_IRQS_IRETQ
 	SWAPGS
 	jmp	restore_regs_and_iret
@@ -509,7 +514,7 @@ retint_kernel:
 	jnc	1f
 0:	cmpl	$0, PER_CPU_VAR(__preempt_count)
 	jnz	1f
-	call	preempt_schedule_irq
+	CALL_HANDLER preempt_schedule_irq
 	jmp	0b
 1:
 #endif
@@ -688,8 +693,6 @@ ENTRY(\sym)
 	.endif
 	.endif
 
-	movq	%rsp, %rdi			/* pt_regs pointer */
-
 	.if \has_error_code
 	movq	ORIG_RAX(%rsp), %rsi		/* get error code */
 	movq	$-1, ORIG_RAX(%rsp)		/* no syscall to restart */
@@ -701,7 +704,8 @@ ENTRY(\sym)
 	subq	$EXCEPTION_STKSZ, CPU_TSS_IST(\shift_ist)
 	.endif
 
-	call	\do_sym
+	movq	%rsp, %rdi			/* pt_regs pointer */
+	CALL_HANDLER \do_sym
 
 	.if \shift_ist != -1
 	addq	$EXCEPTION_STKSZ, CPU_TSS_IST(\shift_ist)
@@ -728,8 +732,6 @@ ENTRY(\sym)
 	call	sync_regs
 	movq	%rax, %rsp			/* switch stack */
 
-	movq	%rsp, %rdi			/* pt_regs pointer */
-
 	.if \has_error_code
 	movq	ORIG_RAX(%rsp), %rsi		/* get error code */
 	movq	$-1, ORIG_RAX(%rsp)		/* no syscall to restart */
@@ -737,7 +739,8 @@ ENTRY(\sym)
 	xorl	%esi, %esi			/* no error code */
 	.endif
 
-	call	\do_sym
+	movq	%rsp, %rdi			/* pt_regs pointer */
+	CALL_HANDLER \do_sym
 
 	jmp	error_exit			/* %ebx: no swapgs flag */
 	.endif
@@ -1174,7 +1177,7 @@ ENTRY(nmi)
 
 	movq	%rsp, %rdi
 	movq	$-1, %rsi
-	call	do_nmi
+	CALL_HANDLER do_nmi
 
 	/*
 	 * Return back to user mode.  We must *not* do the normal exit
@@ -1387,7 +1390,7 @@ end_repeat_nmi:
 	/* paranoidentry do_nmi, 0; without TRACE_IRQS_OFF */
 	movq	%rsp, %rdi
 	movq	$-1, %rsi
-	call	do_nmi
+	CALL_HANDLER do_nmi
 
 	testl	%ebx, %ebx			/* swapgs needed? */
 	jnz	nmi_restore
@@ -1423,3 +1426,11 @@ ENTRY(ignore_sysret)
 	mov	$-ENOSYS, %eax
 	sysret
 END(ignore_sysret)
+
+/* fake function which allows stack unwinders to detect pt_regs frames */
+#ifdef CONFIG_FRAME_POINTER
+ENTRY(pt_regs)
+	nop
+	nop
+ENDPROC(pt_regs)
+#endif /* CONFIG_FRAME_POINTER */

^ permalink raw reply related	[flat|nested] 121+ messages in thread

* Re: [RFC PATCH v2 05/18] sched: add task flag for preempt IRQ tracking
  2016-05-19 23:15                         ` Josh Poimboeuf
@ 2016-05-19 23:39                           ` Andy Lutomirski
  2016-05-20 14:05                             ` Josh Poimboeuf
  2016-05-23 21:34                           ` Andy Lutomirski
  1 sibling, 1 reply; 121+ messages in thread
From: Andy Lutomirski @ 2016-05-19 23:39 UTC (permalink / raw)
  To: Josh Poimboeuf
  Cc: Jiri Kosina, Ingo Molnar, X86 ML, Heiko Carstens, linux-s390,
	live-patching, Michael Ellerman, Chris J Arges, linuxppc-dev,
	Jessica Yu, Petr Mladek, Jiri Slaby, Vojtech Pavlik,
	linux-kernel, Miroslav Benes, Peter Zijlstra

On Thu, May 19, 2016 at 4:15 PM, Josh Poimboeuf <jpoimboe@redhat.com> wrote:
> On Mon, May 02, 2016 at 08:52:41AM -0700, Andy Lutomirski wrote:
>> On Mon, May 2, 2016 at 6:52 AM, Josh Poimboeuf <jpoimboe@redhat.com> wrote:
>> > On Fri, Apr 29, 2016 at 05:08:50PM -0700, Andy Lutomirski wrote:
>> >> On Apr 29, 2016 3:41 PM, "Josh Poimboeuf" <jpoimboe@redhat.com> wrote:
>> >> >
>> >> > On Fri, Apr 29, 2016 at 02:37:41PM -0700, Andy Lutomirski wrote:
>> >> > > On Fri, Apr 29, 2016 at 2:25 PM, Josh Poimboeuf <jpoimboe@redhat.com> wrote:
>> >> > > >> I suppose we could try to rejigger the code so that rbp points to
>> >> > > >> pt_regs or similar.
>> >> > > >
>> >> > > > I think we should avoid doing something like that because it would break
>> >> > > > gdb and all the other unwinders who don't know about it.
>> >> > >
>> >> > > How so?
>> >> > >
>> >> > > Currently, rbp in the entry code is meaningless.  I'm suggesting that,
>> >> > > when we do, for example, 'call \do_sym' in idtentry, we point rbp to
>> >> > > the pt_regs.  Currently it points to something stale (which the
>> >> > > dump_stack code might be relying on.  Hmm.)  But it's probably also
>> >> > > safe to assume that if you unwind to the 'call \do_sym', then pt_regs
>> >> > > is the next thing on the stack, so just doing the section thing would
>> >> > > work.
>> >> >
>> >> > Yes, rbp is meaningless on the entry from user space.  But if an
>> >> > in-kernel interrupt occurs (e.g. page fault, preemption) and you have
>> >> > nested entry, rbp keeps its old value, right?  So the unwinder can walk
>> >> > past the nested entry frame and keep going until it gets to the original
>> >> > entry.
>> >>
>> >> Yes.
>> >>
>> >> It would be nice if we could do better, though, and actually notice
>> >> the pt_regs and identify the entry.  For example, I'd love to see
>> >> "page fault, RIP=xyz" printed in the middle of a stack dump on a
>> >> crash.
>> >>
>> >> Also, I think that just following rbp links will lose the
>> >> actual function that took the page fault (or whatever function
>> >> pt_regs->ip actually points to).
>> >
>> > Hm.  I think we could fix all that in a more standard way.  Whenever a
>> > new pt_regs frame gets saved on entry, we could also create a new stack
>> > frame which points to a fake kernel_entry() function.  That would tell
>> > the unwinder there's a pt_regs frame without otherwise breaking frame
>> > pointers across the frame.
>> >
>> > Then I guess we wouldn't need my other solution of putting the idt
>> > entries in a special section.
>> >
>> > How does that sound?
>>
>> Let me try to understand.
>>
>> The normal call sequence is call; push %rbp; mov %rsp, %rbp.  So rbp
>> points to (prev rbp, prev rip) on the stack, and you can follow the
>> chain back.  Right now, on a user access page fault or similar, we
>> have rbp (probably) pointing to the interrupted frame, and the
>> interrupted rip isn't saved anywhere that a naive unwinder can find
>> it.  (It's in pt_regs, but the rbp chain skips right over that.)
>>
>> We could change the entry code so that an interrupt / idtentry does:
>>
>> push pt_regs
>> push kernel_entry
>> push %rbp
>> mov %rsp, %rbp
>> call handler
>> pop %rbp
>> addq $8, %rsp
>>
>> or similar.  That would make it appear that the actual C handler was
>> caused by a dummy function "kernel_entry".  Now the unwinder would get
>> to kernel_entry, but it *still* wouldn't find its way to the calling
>> frame, which only solves part of the problem.  We could at least teach
>> the unwinder how kernel_entry works and let it decode pt_regs to
>> continue unwinding.  This would be nice, and I think it could work.
>>
>> I think I like this, except that, if it used a separate section, it
>> could potentially be faster, as, for each actual entry type, the
>> offset from the C handler frame to pt_regs is a foregone conclusion.
>> But this is pretty simple and performance is already abysmal in most
>> handlers.
>>
>> There's an added benefit to using a separate section, though: we could
>> also annotate the calls with what type of entry they were so the
>> unwinder could print it out nicely.
>>
>> I could be convinced either way.
>
> Ok, I took a stab at this.  See the patch below.
>
> In addition to annotating interrupt/exception pt_regs frames, I also
> annotated all the syscall pt_regs, for consistency.
>
> As you mentioned, it will affect performance a bit, but I think it will
> be insignificant.
>
> I think I like this approach better than putting the
> interrupt/idtentry's in a special section, because this is much more
> precise.  Especially now that I'm annotating pt_regs syscalls.
>
> Also I think with a few minor changes we could implement your idea of
> annotating the calls with what type of entry they are.  But I don't
> think that's really needed, because the name of the interrupt/idtentry
> is already on the stack trace.
>
> Before:
>
>   [<ffffffff8143c243>] dump_stack+0x85/0xc2
>   [<ffffffff81073596>] __do_page_fault+0x576/0x5a0
>   [<ffffffff8107369c>] trace_do_page_fault+0x5c/0x2e0
>   [<ffffffff8106d83c>] do_async_page_fault+0x2c/0xa0
>   [<ffffffff81887058>] async_page_fault+0x28/0x30
>   [<ffffffff81451560>] ? copy_page_to_iter+0x70/0x440
>   [<ffffffff811ebeac>] ? pagecache_get_page+0x2c/0x290
>   [<ffffffff811edaeb>] generic_file_read_iter+0x26b/0x770
>   [<ffffffff81285e32>] __vfs_read+0xe2/0x140
>   [<ffffffff81286378>] vfs_read+0x98/0x140
>   [<ffffffff812878c8>] SyS_read+0x58/0xc0
>   [<ffffffff81884dbc>] entry_SYSCALL_64_fastpath+0x1f/0xbd
>
> After:
>
>   [<ffffffff8143c243>] dump_stack+0x85/0xc2
>   [<ffffffff81073596>] __do_page_fault+0x576/0x5a0
>   [<ffffffff8107369c>] trace_do_page_fault+0x5c/0x2e0
>   [<ffffffff8106d83c>] do_async_page_fault+0x2c/0xa0
>   [<ffffffff81887422>] async_page_fault+0x32/0x40
>   [<ffffffff81887861>] pt_regs+0x1/0x10
>   [<ffffffff81451560>] ? copy_page_to_iter+0x70/0x440
>   [<ffffffff811ebeac>] ? pagecache_get_page+0x2c/0x290
>   [<ffffffff811edaeb>] generic_file_read_iter+0x26b/0x770
>   [<ffffffff81285e32>] __vfs_read+0xe2/0x140
>   [<ffffffff81286378>] vfs_read+0x98/0x140
>   [<ffffffff812878c8>] SyS_read+0x58/0xc0
>   [<ffffffff81884dc6>] entry_SYSCALL_64_fastpath+0x29/0xdb
>   [<ffffffff81887861>] pt_regs+0x1/0x10
>
> Note this example is with today's unwinder.  It could be made smarter to
> get the RIP from the pt_regs so the '?' could be removed from
> copy_page_to_iter().
>
> Thoughts?

I think we should do that.  The silly sample patch I sent you (or at
least that I think I sent you) did that, and it worked nicely.

>
> diff --git a/arch/x86/entry/calling.h b/arch/x86/entry/calling.h
> index 9a9e588..f54886a 100644
> --- a/arch/x86/entry/calling.h
> +++ b/arch/x86/entry/calling.h
> @@ -201,6 +201,32 @@ For 32-bit we have the following conventions - kernel is built with
>         .byte 0xf1
>         .endm
>
> +       /*
> +        * Create a stack frame for the saved pt_regs.  This allows frame
> +        * pointer based unwinders to find pt_regs on the stack.
> +        */
> +       .macro CREATE_PT_REGS_FRAME regs=%rsp
> +#ifdef CONFIG_FRAME_POINTER
> +       pushq   \regs
> +       pushq   $pt_regs+1
> +       pushq   %rbp
> +       movq    %rsp, %rbp
> +#endif
> +       .endm

I don't love this part.  It's going to hurt performance, and, given
that we need to change the unwinder anyway to make it useful, let's
just emit a table somewhere in .rodata and use it directly.

> --- a/arch/x86/entry/entry_64.S
> +++ b/arch/x86/entry/entry_64.S
> @@ -199,6 +199,7 @@ entry_SYSCALL_64_fastpath:
>         ja      1f                              /* return -ENOSYS (already in pt_regs->ax) */
>         movq    %r10, %rcx
>
> +       CREATE_PT_REGS_FRAME
>         /*
>          * This call instruction is handled specially in stub_ptregs_64.
>          * It might end up jumping to the slow path.  If it jumps, RAX
> @@ -207,6 +208,8 @@ entry_SYSCALL_64_fastpath:
>         call    *sys_call_table(, %rax, 8)
>  .Lentry_SYSCALL_64_after_fastpath_call:
>
> +       REMOVE_PT_REGS_FRAME
> +
>         movq    %rax, RAX(%rsp)
>  1:

This one is particular is quite hot, so I'd much rather avoid it.

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [RFC PATCH v2 05/18] sched: add task flag for preempt IRQ tracking
  2016-05-19 23:39                           ` Andy Lutomirski
@ 2016-05-20 14:05                             ` Josh Poimboeuf
  2016-05-20 15:41                               ` Andy Lutomirski
  0 siblings, 1 reply; 121+ messages in thread
From: Josh Poimboeuf @ 2016-05-20 14:05 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Jiri Kosina, Ingo Molnar, X86 ML, Heiko Carstens, linux-s390,
	live-patching, Michael Ellerman, Chris J Arges, linuxppc-dev,
	Jessica Yu, Petr Mladek, Jiri Slaby, Vojtech Pavlik,
	linux-kernel, Miroslav Benes, Peter Zijlstra

On Thu, May 19, 2016 at 04:39:51PM -0700, Andy Lutomirski wrote:
> On Thu, May 19, 2016 at 4:15 PM, Josh Poimboeuf <jpoimboe@redhat.com> wrote:
> > Note this example is with today's unwinder.  It could be made smarter to
> > get the RIP from the pt_regs so the '?' could be removed from
> > copy_page_to_iter().
> >
> > Thoughts?
> 
> I think we should do that.  The silly sample patch I sent you (or at
> least that I think I sent you) did that, and it worked nicely.

Yeah, we can certainly do something similar to make the unwinder
smarter.  It should be very simple with this approach: if it finds the
pt_regs() function on the stack, the (struct pt_regs *) pointer will be
right after it.

> > diff --git a/arch/x86/entry/calling.h b/arch/x86/entry/calling.h
> > index 9a9e588..f54886a 100644
> > --- a/arch/x86/entry/calling.h
> > +++ b/arch/x86/entry/calling.h
> > @@ -201,6 +201,32 @@ For 32-bit we have the following conventions - kernel is built with
> >         .byte 0xf1
> >         .endm
> >
> > +       /*
> > +        * Create a stack frame for the saved pt_regs.  This allows frame
> > +        * pointer based unwinders to find pt_regs on the stack.
> > +        */
> > +       .macro CREATE_PT_REGS_FRAME regs=%rsp
> > +#ifdef CONFIG_FRAME_POINTER
> > +       pushq   \regs
> > +       pushq   $pt_regs+1
> > +       pushq   %rbp
> > +       movq    %rsp, %rbp
> > +#endif
> > +       .endm
> 
> I don't love this part.  It's going to hurt performance, and, given
> that we need to change the unwinder anyway to make it useful, let's
> just emit a table somewhere in .rodata and use it directly.

I'm not sure about the idea of a table.  I get the feeling it would add
more complexity to both the entry code and the unwinder.  How would you
specify the pt_regs location when it's on a different stack?  (See the
interrupt macro: non-nested interrupts will place pt_regs on the task
stack before switching to the irq stack.)

If you're worried about performance, I can remove the syscall
annotations.  They're optional anyway, since the pt_regs is always at
the same place on the stack for syscalls.

I think three extra pushes wouldn't be a performance issue for
interrupts/exceptions.  And they'll go away when we finally bury
CONFIG_FRAME_POINTER.

Here's the same patch without syscall annotations:


diff --git a/arch/x86/entry/calling.h b/arch/x86/entry/calling.h
index 9a9e588..f54886a 100644
--- a/arch/x86/entry/calling.h
+++ b/arch/x86/entry/calling.h
@@ -201,6 +201,32 @@ For 32-bit we have the following conventions - kernel is built with
 	.byte 0xf1
 	.endm
 
+	/*
+	 * Create a stack frame for the saved pt_regs.  This allows frame
+	 * pointer based unwinders to find pt_regs on the stack.
+	 */
+	.macro CREATE_PT_REGS_FRAME regs=%rsp
+#ifdef CONFIG_FRAME_POINTER
+	pushq	\regs
+	pushq	$pt_regs+1
+	pushq	%rbp
+	movq	%rsp, %rbp
+#endif
+	.endm
+
+	.macro REMOVE_PT_REGS_FRAME
+#ifdef CONFIG_FRAME_POINTER
+	popq	%rbp
+	addq	$0x10, %rsp
+#endif
+	.endm
+
+	.macro CALL_HANDLER handler regs=%rsp
+	CREATE_PT_REGS_FRAME \regs
+	call	\handler
+	REMOVE_PT_REGS_FRAME
+	.endm
+
 #endif /* CONFIG_X86_64 */
 
 /*
diff --git a/arch/x86/entry/entry_64.S b/arch/x86/entry/entry_64.S
index 9ee0da1..93bc8f0 100644
--- a/arch/x86/entry/entry_64.S
+++ b/arch/x86/entry/entry_64.S
@@ -468,7 +468,7 @@ END(irq_entries_start)
 	/* We entered an interrupt context - irqs are off: */
 	TRACE_IRQS_OFF
 
-	call	\func	/* rdi points to pt_regs */
+	CALL_HANDLER \func regs=%rdi
 	.endm
 
 	/*
@@ -495,7 +495,7 @@ ret_from_intr:
 	/* Interrupt came from user space */
 GLOBAL(retint_user)
 	mov	%rsp,%rdi
-	call	prepare_exit_to_usermode
+	CALL_HANDLER prepare_exit_to_usermode
 	TRACE_IRQS_IRETQ
 	SWAPGS
 	jmp	restore_regs_and_iret
@@ -509,7 +509,7 @@ retint_kernel:
 	jnc	1f
 0:	cmpl	$0, PER_CPU_VAR(__preempt_count)
 	jnz	1f
-	call	preempt_schedule_irq
+	CALL_HANDLER preempt_schedule_irq
 	jmp	0b
 1:
 #endif
@@ -688,8 +688,6 @@ ENTRY(\sym)
 	.endif
 	.endif
 
-	movq	%rsp, %rdi			/* pt_regs pointer */
-
 	.if \has_error_code
 	movq	ORIG_RAX(%rsp), %rsi		/* get error code */
 	movq	$-1, ORIG_RAX(%rsp)		/* no syscall to restart */
@@ -701,7 +699,8 @@ ENTRY(\sym)
 	subq	$EXCEPTION_STKSZ, CPU_TSS_IST(\shift_ist)
 	.endif
 
-	call	\do_sym
+	movq	%rsp, %rdi			/* pt_regs pointer */
+	CALL_HANDLER \do_sym
 
 	.if \shift_ist != -1
 	addq	$EXCEPTION_STKSZ, CPU_TSS_IST(\shift_ist)
@@ -728,8 +727,6 @@ ENTRY(\sym)
 	call	sync_regs
 	movq	%rax, %rsp			/* switch stack */
 
-	movq	%rsp, %rdi			/* pt_regs pointer */
-
 	.if \has_error_code
 	movq	ORIG_RAX(%rsp), %rsi		/* get error code */
 	movq	$-1, ORIG_RAX(%rsp)		/* no syscall to restart */
@@ -737,7 +734,8 @@ ENTRY(\sym)
 	xorl	%esi, %esi			/* no error code */
 	.endif
 
-	call	\do_sym
+	movq	%rsp, %rdi			/* pt_regs pointer */
+	CALL_HANDLER \do_sym
 
 	jmp	error_exit			/* %ebx: no swapgs flag */
 	.endif
@@ -1174,7 +1172,7 @@ ENTRY(nmi)
 
 	movq	%rsp, %rdi
 	movq	$-1, %rsi
-	call	do_nmi
+	CALL_HANDLER do_nmi
 
 	/*
 	 * Return back to user mode.  We must *not* do the normal exit
@@ -1387,7 +1385,7 @@ end_repeat_nmi:
 	/* paranoidentry do_nmi, 0; without TRACE_IRQS_OFF */
 	movq	%rsp, %rdi
 	movq	$-1, %rsi
-	call	do_nmi
+	CALL_HANDLER do_nmi
 
 	testl	%ebx, %ebx			/* swapgs needed? */
 	jnz	nmi_restore
@@ -1423,3 +1421,11 @@ ENTRY(ignore_sysret)
 	mov	$-ENOSYS, %eax
 	sysret
 END(ignore_sysret)
+
+/* fake function which allows stack unwinders to detect pt_regs frames */
+#ifdef CONFIG_FRAME_POINTER
+ENTRY(pt_regs)
+	nop
+	nop
+ENDPROC(pt_regs)
+#endif /* CONFIG_FRAME_POINTER */

^ permalink raw reply related	[flat|nested] 121+ messages in thread

* Re: [RFC PATCH v2 05/18] sched: add task flag for preempt IRQ tracking
  2016-05-20 14:05                             ` Josh Poimboeuf
@ 2016-05-20 15:41                               ` Andy Lutomirski
  2016-05-20 16:41                                 ` Josh Poimboeuf
  0 siblings, 1 reply; 121+ messages in thread
From: Andy Lutomirski @ 2016-05-20 15:41 UTC (permalink / raw)
  To: Josh Poimboeuf
  Cc: Jiri Kosina, Ingo Molnar, X86 ML, Heiko Carstens, linux-s390,
	live-patching, Michael Ellerman, Chris J Arges, linuxppc-dev,
	Jessica Yu, Petr Mladek, Jiri Slaby, Vojtech Pavlik,
	linux-kernel, Miroslav Benes, Peter Zijlstra

On Fri, May 20, 2016 at 7:05 AM, Josh Poimboeuf <jpoimboe@redhat.com> wrote:
> On Thu, May 19, 2016 at 04:39:51PM -0700, Andy Lutomirski wrote:
>> On Thu, May 19, 2016 at 4:15 PM, Josh Poimboeuf <jpoimboe@redhat.com> wrote:
>> > Note this example is with today's unwinder.  It could be made smarter to
>> > get the RIP from the pt_regs so the '?' could be removed from
>> > copy_page_to_iter().
>> >
>> > Thoughts?
>>
>> I think we should do that.  The silly sample patch I sent you (or at
>> least that I think I sent you) did that, and it worked nicely.
>
> Yeah, we can certainly do something similar to make the unwinder
> smarter.  It should be very simple with this approach: if it finds the
> pt_regs() function on the stack, the (struct pt_regs *) pointer will be
> right after it.

That seems barely easier than checking if it finds a function in
.entry that's marked on the stack, and the latter has no runtime cost.

>
>> > diff --git a/arch/x86/entry/calling.h b/arch/x86/entry/calling.h
>> > index 9a9e588..f54886a 100644
>> > --- a/arch/x86/entry/calling.h
>> > +++ b/arch/x86/entry/calling.h
>> > @@ -201,6 +201,32 @@ For 32-bit we have the following conventions - kernel is built with
>> >         .byte 0xf1
>> >         .endm
>> >
>> > +       /*
>> > +        * Create a stack frame for the saved pt_regs.  This allows frame
>> > +        * pointer based unwinders to find pt_regs on the stack.
>> > +        */
>> > +       .macro CREATE_PT_REGS_FRAME regs=%rsp
>> > +#ifdef CONFIG_FRAME_POINTER
>> > +       pushq   \regs
>> > +       pushq   $pt_regs+1
>> > +       pushq   %rbp
>> > +       movq    %rsp, %rbp
>> > +#endif
>> > +       .endm
>>
>> I don't love this part.  It's going to hurt performance, and, given
>> that we need to change the unwinder anyway to make it useful, let's
>> just emit a table somewhere in .rodata and use it directly.
>
> I'm not sure about the idea of a table.  I get the feeling it would add
> more complexity to both the entry code and the unwinder.  How would you
> specify the pt_regs location when it's on a different stack?  (See the
> interrupt macro: non-nested interrupts will place pt_regs on the task
> stack before switching to the irq stack.)

Hmm.  I need to think about the interrupt stack case a bit.  Although
the actual top of the interrupt stack has a nearly fixed format, and I
have old patches to clean it up and make it actually be fixed.  I'll
try to dust those off and resend them soon.

>
> If you're worried about performance, I can remove the syscall
> annotations.  They're optional anyway, since the pt_regs is always at
> the same place on the stack for syscalls.
>
> I think three extra pushes wouldn't be a performance issue for
> interrupts/exceptions.  And they'll go away when we finally bury
> CONFIG_FRAME_POINTER.

I bet we'll always need to support CONFIG_FRAME_POINTER for some
embedded systems.

I'll play with this a bit.

--Andy

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [RFC PATCH v2 05/18] sched: add task flag for preempt IRQ tracking
  2016-05-20 15:41                               ` Andy Lutomirski
@ 2016-05-20 16:41                                 ` Josh Poimboeuf
  2016-05-20 16:59                                   ` Andy Lutomirski
  0 siblings, 1 reply; 121+ messages in thread
From: Josh Poimboeuf @ 2016-05-20 16:41 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Jiri Kosina, Ingo Molnar, X86 ML, Heiko Carstens, linux-s390,
	live-patching, Michael Ellerman, Chris J Arges, linuxppc-dev,
	Jessica Yu, Petr Mladek, Jiri Slaby, Vojtech Pavlik,
	linux-kernel, Miroslav Benes, Peter Zijlstra

On Fri, May 20, 2016 at 08:41:00AM -0700, Andy Lutomirski wrote:
> On Fri, May 20, 2016 at 7:05 AM, Josh Poimboeuf <jpoimboe@redhat.com> wrote:
> > On Thu, May 19, 2016 at 04:39:51PM -0700, Andy Lutomirski wrote:
> >> On Thu, May 19, 2016 at 4:15 PM, Josh Poimboeuf <jpoimboe@redhat.com> wrote:
> >> > Note this example is with today's unwinder.  It could be made smarter to
> >> > get the RIP from the pt_regs so the '?' could be removed from
> >> > copy_page_to_iter().
> >> >
> >> > Thoughts?
> >>
> >> I think we should do that.  The silly sample patch I sent you (or at
> >> least that I think I sent you) did that, and it worked nicely.
> >
> > Yeah, we can certainly do something similar to make the unwinder
> > smarter.  It should be very simple with this approach: if it finds the
> > pt_regs() function on the stack, the (struct pt_regs *) pointer will be
> > right after it.
> 
> That seems barely easier than checking if it finds a function in
> .entry that's marked on the stack,

Don't forget you'd also have to look up the function's pt_regs offset in
the table.

> and the latter has no runtime cost.

Well, I had been assuming the three extra pushes and one extra pop for
interrupts would be negligible, but I'm definitely no expert there.  Any
idea how I can measure it?

> > I'm not sure about the idea of a table.  I get the feeling it would add
> > more complexity to both the entry code and the unwinder.  How would you
> > specify the pt_regs location when it's on a different stack?  (See the
> > interrupt macro: non-nested interrupts will place pt_regs on the task
> > stack before switching to the irq stack.)
> 
> Hmm.  I need to think about the interrupt stack case a bit.  Although
> the actual top of the interrupt stack has a nearly fixed format, and I
> have old patches to clean it up and make it actually be fixed.  I'll
> try to dust those off and resend them soon.

How would that solve the problem?  Would pt_regs be moved or copied to
the irq stack?

> > If you're worried about performance, I can remove the syscall
> > annotations.  They're optional anyway, since the pt_regs is always at
> > the same place on the stack for syscalls.
> >
> > I think three extra pushes wouldn't be a performance issue for
> > interrupts/exceptions.  And they'll go away when we finally bury
> > CONFIG_FRAME_POINTER.
> 
> I bet we'll always need to support CONFIG_FRAME_POINTER for some
> embedded systems.

Yeah, probably.

> I'll play with this a bit.

Thanks, looking forward to seeing what you come up with.

-- 
Josh

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [RFC PATCH v2 05/18] sched: add task flag for preempt IRQ tracking
  2016-05-20 16:41                                 ` Josh Poimboeuf
@ 2016-05-20 16:59                                   ` Andy Lutomirski
  2016-05-20 17:49                                     ` Josh Poimboeuf
  2016-05-23 23:02                                     ` Jiri Kosina
  0 siblings, 2 replies; 121+ messages in thread
From: Andy Lutomirski @ 2016-05-20 16:59 UTC (permalink / raw)
  To: Josh Poimboeuf
  Cc: Jiri Kosina, Ingo Molnar, X86 ML, Heiko Carstens, linux-s390,
	live-patching, Michael Ellerman, Chris J Arges, linuxppc-dev,
	Jessica Yu, Petr Mladek, Jiri Slaby, Vojtech Pavlik,
	linux-kernel, Miroslav Benes, Peter Zijlstra

On Fri, May 20, 2016 at 9:41 AM, Josh Poimboeuf <jpoimboe@redhat.com> wrote:
> On Fri, May 20, 2016 at 08:41:00AM -0700, Andy Lutomirski wrote:
>> On Fri, May 20, 2016 at 7:05 AM, Josh Poimboeuf <jpoimboe@redhat.com> wrote:
>> > On Thu, May 19, 2016 at 04:39:51PM -0700, Andy Lutomirski wrote:
>> >> On Thu, May 19, 2016 at 4:15 PM, Josh Poimboeuf <jpoimboe@redhat.com> wrote:
>> >> > Note this example is with today's unwinder.  It could be made smarter to
>> >> > get the RIP from the pt_regs so the '?' could be removed from
>> >> > copy_page_to_iter().
>> >> >
>> >> > Thoughts?
>> >>
>> >> I think we should do that.  The silly sample patch I sent you (or at
>> >> least that I think I sent you) did that, and it worked nicely.
>> >
>> > Yeah, we can certainly do something similar to make the unwinder
>> > smarter.  It should be very simple with this approach: if it finds the
>> > pt_regs() function on the stack, the (struct pt_regs *) pointer will be
>> > right after it.
>>
>> That seems barely easier than checking if it finds a function in
>> .entry that's marked on the stack,
>
> Don't forget you'd also have to look up the function's pt_regs offset in
> the table.
>
>> and the latter has no runtime cost.
>
> Well, I had been assuming the three extra pushes and one extra pop for
> interrupts would be negligible, but I'm definitely no expert there.  Any
> idea how I can measure it?

I think it would be negligible, at least for interrupts, since
interrupts are already extremely expensive.  But I don't love adding
assembly code that makes them even slower.  The real thing I dislike
about this approach is that it's not a normal stack frame, so you need
code in the unwinder to unwind through it correctly, which makes me
think that you're not saving much complexity by adding the pushes.
But maybe I'm wrong.

>
>> > I'm not sure about the idea of a table.  I get the feeling it would add
>> > more complexity to both the entry code and the unwinder.  How would you
>> > specify the pt_regs location when it's on a different stack?  (See the
>> > interrupt macro: non-nested interrupts will place pt_regs on the task
>> > stack before switching to the irq stack.)
>>
>> Hmm.  I need to think about the interrupt stack case a bit.  Although
>> the actual top of the interrupt stack has a nearly fixed format, and I
>> have old patches to clean it up and make it actually be fixed.  I'll
>> try to dust those off and resend them soon.
>
> How would that solve the problem?  Would pt_regs be moved or copied to
> the irq stack?

Hmm.

Maybe the right way would be to unwind off the IRQ stack in two steps.
Step one would be to figure out that you're on the IRQ stack and pop
your way off it.  Step two would be to find pt_regs.

But that could be rather nasty to implement.  Maybe what we actually
want to do is this:

First, apply this thing:

https://git.kernel.org/cgit/linux/kernel/git/luto/linux.git/commit/?h=x86/entry_ist&id=2231ec7e0bcc1a2bc94a17081511ab54cc6badd1

that gives us a common format for the IRQ stack.

Second, use my idea of making a table of offsets to pt_regs, so we'd
have, roughly:

ENTER_IRQ_STACK old_rsp=%r11
call __do_softirq
ANNOTATE_IRQSTACK_PTREGS_CALL offset=0
LEAVE_IRQ_STACK

the idea here is that offset=0 means that there is no offset beyond
that implied by ENTER_IRQ_STACK.  What actually gets written to the
table is offset 8, because ENTER_IRQ_STACK itself does one push.

So far, this has no runtime overhead at all.

Then, in the unwinder, the logic becomes:

If the return address is annotated in the ptregs entry table, look up
the offset.  If the offset is in bounds, then use it.  If the offset
is out of bounds and we're on the IRQ stack, then unwind the
ENTER_IRQ_STACK as well.

Does that seem reasonable?  I can try to implement it and see what it
looks like.

--Andy

>
>> > If you're worried about performance, I can remove the syscall
>> > annotations.  They're optional anyway, since the pt_regs is always at
>> > the same place on the stack for syscalls.
>> >
>> > I think three extra pushes wouldn't be a performance issue for
>> > interrupts/exceptions.  And they'll go away when we finally bury
>> > CONFIG_FRAME_POINTER.
>>
>> I bet we'll always need to support CONFIG_FRAME_POINTER for some
>> embedded systems.
>
> Yeah, probably.
>
>> I'll play with this a bit.
>
> Thanks, looking forward to seeing what you come up with.
>
> --
> Josh



-- 
Andy Lutomirski
AMA Capital Management, LLC

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [RFC PATCH v2 05/18] sched: add task flag for preempt IRQ tracking
  2016-05-20 16:59                                   ` Andy Lutomirski
@ 2016-05-20 17:49                                     ` Josh Poimboeuf
  2016-05-23 23:02                                     ` Jiri Kosina
  1 sibling, 0 replies; 121+ messages in thread
From: Josh Poimboeuf @ 2016-05-20 17:49 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Jiri Kosina, Ingo Molnar, X86 ML, Heiko Carstens, linux-s390,
	live-patching, Michael Ellerman, Chris J Arges, linuxppc-dev,
	Jessica Yu, Petr Mladek, Jiri Slaby, Vojtech Pavlik,
	linux-kernel, Miroslav Benes, Peter Zijlstra

On Fri, May 20, 2016 at 09:59:38AM -0700, Andy Lutomirski wrote:
> On Fri, May 20, 2016 at 9:41 AM, Josh Poimboeuf <jpoimboe@redhat.com> wrote:
> > On Fri, May 20, 2016 at 08:41:00AM -0700, Andy Lutomirski wrote:
> >> On Fri, May 20, 2016 at 7:05 AM, Josh Poimboeuf <jpoimboe@redhat.com> wrote:
> >> > On Thu, May 19, 2016 at 04:39:51PM -0700, Andy Lutomirski wrote:
> >> >> On Thu, May 19, 2016 at 4:15 PM, Josh Poimboeuf <jpoimboe@redhat.com> wrote:
> >> >> > Note this example is with today's unwinder.  It could be made smarter to
> >> >> > get the RIP from the pt_regs so the '?' could be removed from
> >> >> > copy_page_to_iter().
> >> >> >
> >> >> > Thoughts?
> >> >>
> >> >> I think we should do that.  The silly sample patch I sent you (or at
> >> >> least that I think I sent you) did that, and it worked nicely.
> >> >
> >> > Yeah, we can certainly do something similar to make the unwinder
> >> > smarter.  It should be very simple with this approach: if it finds the
> >> > pt_regs() function on the stack, the (struct pt_regs *) pointer will be
> >> > right after it.
> >>
> >> That seems barely easier than checking if it finds a function in
> >> .entry that's marked on the stack,
> >
> > Don't forget you'd also have to look up the function's pt_regs offset in
> > the table.
> >
> >> and the latter has no runtime cost.
> >
> > Well, I had been assuming the three extra pushes and one extra pop for
> > interrupts would be negligible, but I'm definitely no expert there.  Any
> > idea how I can measure it?
> 
> I think it would be negligible, at least for interrupts, since
> interrupts are already extremely expensive.  But I don't love adding
> assembly code that makes them even slower.

I just don't find that very convincing :-)  If the performance impact is
negligible, we should ignore it.  We should instead be focusing on
finding the simplest solution.

> The real thing I dislike about this approach is that it's not a normal
> stack frame, so you need code in the unwinder to unwind through it
> correctly, which makes me think that you're not saving much complexity
> by adding the pushes.  But maybe I'm wrong.

Hm, I view it differently.  Sure the stack frame is a bit unusual, but
as far as unwinders go, it _is_ normal.  So even a stock unwinder can
show the user that a pt_regs() -- or interrupt_frame() or whatever we
call it -- function was called.  Just knowing that an interrupt occurred
could be useful information, even without the contents of RIP.

> >> > I'm not sure about the idea of a table.  I get the feeling it would add
> >> > more complexity to both the entry code and the unwinder.  How would you
> >> > specify the pt_regs location when it's on a different stack?  (See the
> >> > interrupt macro: non-nested interrupts will place pt_regs on the task
> >> > stack before switching to the irq stack.)
> >>
> >> Hmm.  I need to think about the interrupt stack case a bit.  Although
> >> the actual top of the interrupt stack has a nearly fixed format, and I
> >> have old patches to clean it up and make it actually be fixed.  I'll
> >> try to dust those off and resend them soon.
> >
> > How would that solve the problem?  Would pt_regs be moved or copied to
> > the irq stack?
> 
> Hmm.
> 
> Maybe the right way would be to unwind off the IRQ stack in two steps.
> Step one would be to figure out that you're on the IRQ stack and pop
> your way off it.  Step two would be to find pt_regs.
> 
> But that could be rather nasty to implement.  Maybe what we actually
> want to do is this:
> 
> First, apply this thing:
> 
> https://git.kernel.org/cgit/linux/kernel/git/luto/linux.git/commit/?h=x86/entry_ist&id=2231ec7e0bcc1a2bc94a17081511ab54cc6badd1
> 
> that gives us a common format for the IRQ stack.
> 
> Second, use my idea of making a table of offsets to pt_regs, so we'd
> have, roughly:
> 
> ENTER_IRQ_STACK old_rsp=%r11
> call __do_softirq
> ANNOTATE_IRQSTACK_PTREGS_CALL offset=0
> LEAVE_IRQ_STACK
> 
> the idea here is that offset=0 means that there is no offset beyond
> that implied by ENTER_IRQ_STACK.  What actually gets written to the
> table is offset 8, because ENTER_IRQ_STACK itself does one push.
> 
> So far, this has no runtime overhead at all.
> 
> Then, in the unwinder, the logic becomes:
> 
> If the return address is annotated in the ptregs entry table, look up
> the offset.  If the offset is in bounds, then use it.  If the offset
> is out of bounds and we're on the IRQ stack, then unwind the
> ENTER_IRQ_STACK as well.
> 
> Does that seem reasonable?  I can try to implement it and see what it
> looks like.

To be honest I'm not crazy about it.  The ANNOTATE_IRQSTACK_PTREGS_CALL
macro needs knowledge about the implementation of ENTER_IRQ_STACK and
how many pushes it does.  I think that makes the entry code more complex
and harder to understand than my patch.

Also the unwinders would need to be quite a bit more complex, and would
need to know more of the intimate details of the irq stack.  That's
probably a less important consideration than entry code complexity, but
it's still significant when you consider the fact that the kernel has
many unwinders, both in-kernel and out-of-kernel.

-- 
Josh

^ permalink raw reply	[flat|nested] 121+ messages in thread

* RE: livepatch: change to a per-task consistency model
  2016-05-18 20:22         ` Jiri Kosina
@ 2016-05-23  9:42             ` David Laight
  0 siblings, 0 replies; 121+ messages in thread
From: David Laight @ 2016-05-23  9:42 UTC (permalink / raw)
  To: 'Jiri Kosina', Josh Poimboeuf
  Cc: linux-s390, Jessica Yu, Vojtech Pavlik, Petr Mladek,
	Peter Zijlstra, x86, Heiko Carstens, linux-kernel, Ingo Molnar,
	Andy Lutomirski, live-patching, Jiri Slaby, Miroslav Benes,
	linuxppc-dev, Chris J Arges

From: Jiri Kosina
> Sent: 18 May 2016 21:23
> On Wed, 18 May 2016, Josh Poimboeuf wrote:
> 
> > Yeah, I think this situation -- a task sleeping on an affected function
> > in uninterruptible state for a long period of time -- would be
> > exceedingly rare and not something we need to worry about for now.
> 
> Plus in case task'd be in TASK_UNINTERRUPTIBLE for more than 120s, hung
> task detector would trigger anyway.

Related, please can we have a flag for the sleep and/or process so that
an uninterruptible sleep doesn't trigger the 'hung task' detector
and also stops the process counting towards the 'load average'.

In particular some kernel threads are not signalable, and do not
want to be woken by signals (they exit on a specific request).

	David

^ permalink raw reply	[flat|nested] 121+ messages in thread

* RE: livepatch: change to a per-task consistency model
@ 2016-05-23  9:42             ` David Laight
  0 siblings, 0 replies; 121+ messages in thread
From: David Laight @ 2016-05-23  9:42 UTC (permalink / raw)
  To: 'Jiri Kosina', Josh Poimboeuf
  Cc: linux-s390, Jessica Yu, Vojtech Pavlik, Petr Mladek,
	Peter Zijlstra, x86, Heiko Carstens, linux-kernel, Ingo Molnar,
	Andy Lutomirski, live-patching, Jiri Slaby, Miroslav Benes,
	linuxppc-dev, Chris J Arges

RnJvbTogSmlyaSBLb3NpbmENCj4gU2VudDogMTggTWF5IDIwMTYgMjE6MjMNCj4gT24gV2VkLCAx
OCBNYXkgMjAxNiwgSm9zaCBQb2ltYm9ldWYgd3JvdGU6DQo+IA0KPiA+IFllYWgsIEkgdGhpbmsg
dGhpcyBzaXR1YXRpb24gLS0gYSB0YXNrIHNsZWVwaW5nIG9uIGFuIGFmZmVjdGVkIGZ1bmN0aW9u
DQo+ID4gaW4gdW5pbnRlcnJ1cHRpYmxlIHN0YXRlIGZvciBhIGxvbmcgcGVyaW9kIG9mIHRpbWUg
LS0gd291bGQgYmUNCj4gPiBleGNlZWRpbmdseSByYXJlIGFuZCBub3Qgc29tZXRoaW5nIHdlIG5l
ZWQgdG8gd29ycnkgYWJvdXQgZm9yIG5vdy4NCj4gDQo+IFBsdXMgaW4gY2FzZSB0YXNrJ2QgYmUg
aW4gVEFTS19VTklOVEVSUlVQVElCTEUgZm9yIG1vcmUgdGhhbiAxMjBzLCBodW5nDQo+IHRhc2sg
ZGV0ZWN0b3Igd291bGQgdHJpZ2dlciBhbnl3YXkuDQoNClJlbGF0ZWQsIHBsZWFzZSBjYW4gd2Ug
aGF2ZSBhIGZsYWcgZm9yIHRoZSBzbGVlcCBhbmQvb3IgcHJvY2VzcyBzbyB0aGF0DQphbiB1bmlu
dGVycnVwdGlibGUgc2xlZXAgZG9lc24ndCB0cmlnZ2VyIHRoZSAnaHVuZyB0YXNrJyBkZXRlY3Rv
cg0KYW5kIGFsc28gc3RvcHMgdGhlIHByb2Nlc3MgY291bnRpbmcgdG93YXJkcyB0aGUgJ2xvYWQg
YXZlcmFnZScuDQoNCkluIHBhcnRpY3VsYXIgc29tZSBrZXJuZWwgdGhyZWFkcyBhcmUgbm90IHNp
Z25hbGFibGUsIGFuZCBkbyBub3QNCndhbnQgdG8gYmUgd29rZW4gYnkgc2lnbmFscyAodGhleSBl
eGl0IG9uIGEgc3BlY2lmaWMgcmVxdWVzdCkuDQoNCglEYXZpZA0KDQo=

^ permalink raw reply	[flat|nested] 121+ messages in thread

* RE: livepatch: change to a per-task consistency model
  2016-05-23  9:42             ` David Laight
  (?)
@ 2016-05-23 18:44             ` Jiri Kosina
  2016-05-24 15:06                 ` David Laight
  -1 siblings, 1 reply; 121+ messages in thread
From: Jiri Kosina @ 2016-05-23 18:44 UTC (permalink / raw)
  To: David Laight
  Cc: Josh Poimboeuf, linux-s390, Jessica Yu, Vojtech Pavlik,
	Petr Mladek, Peter Zijlstra, x86, Heiko Carstens, linux-kernel,
	Ingo Molnar, Andy Lutomirski, live-patching, Jiri Slaby,
	Miroslav Benes, linuxppc-dev, Chris J Arges

On Mon, 23 May 2016, David Laight wrote:

> Related, please can we have a flag for the sleep and/or process so that
> an uninterruptible sleep doesn't trigger the 'hung task' detector

TASK_KILLABLE

> and also stops the process counting towards the 'load average'.

TASK_NOLOAD

-- 
Jiri Kosina
SUSE Labs

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [RFC PATCH v2 05/18] sched: add task flag for preempt IRQ tracking
  2016-05-19 23:15                         ` Josh Poimboeuf
  2016-05-19 23:39                           ` Andy Lutomirski
@ 2016-05-23 21:34                           ` Andy Lutomirski
  2016-05-24  2:28                             ` Josh Poimboeuf
  1 sibling, 1 reply; 121+ messages in thread
From: Andy Lutomirski @ 2016-05-23 21:34 UTC (permalink / raw)
  To: Josh Poimboeuf
  Cc: Jiri Kosina, Ingo Molnar, X86 ML, Heiko Carstens, linux-s390,
	live-patching, Michael Ellerman, Chris J Arges, linuxppc-dev,
	Jessica Yu, Petr Mladek, Jiri Slaby, Vojtech Pavlik,
	linux-kernel, Miroslav Benes, Peter Zijlstra

On Thu, May 19, 2016 at 4:15 PM, Josh Poimboeuf <jpoimboe@redhat.com> wrote:
> On Mon, May 02, 2016 at 08:52:41AM -0700, Andy Lutomirski wrote:
>> On Mon, May 2, 2016 at 6:52 AM, Josh Poimboeuf <jpoimboe@redhat.com> wrote:
>> > On Fri, Apr 29, 2016 at 05:08:50PM -0700, Andy Lutomirski wrote:
>> >> On Apr 29, 2016 3:41 PM, "Josh Poimboeuf" <jpoimboe@redhat.com> wrote:
>> >> >
>> >> > On Fri, Apr 29, 2016 at 02:37:41PM -0700, Andy Lutomirski wrote:
>> >> > > On Fri, Apr 29, 2016 at 2:25 PM, Josh Poimboeuf <jpoimboe@redhat.com> wrote:
>> >> > > >> I suppose we could try to rejigger the code so that rbp points to
>> >> > > >> pt_regs or similar.
>> >> > > >
>> >> > > > I think we should avoid doing something like that because it would break
>> >> > > > gdb and all the other unwinders who don't know about it.
>> >> > >
>> >> > > How so?
>> >> > >
>> >> > > Currently, rbp in the entry code is meaningless.  I'm suggesting that,
>> >> > > when we do, for example, 'call \do_sym' in idtentry, we point rbp to
>> >> > > the pt_regs.  Currently it points to something stale (which the
>> >> > > dump_stack code might be relying on.  Hmm.)  But it's probably also
>> >> > > safe to assume that if you unwind to the 'call \do_sym', then pt_regs
>> >> > > is the next thing on the stack, so just doing the section thing would
>> >> > > work.
>> >> >
>> >> > Yes, rbp is meaningless on the entry from user space.  But if an
>> >> > in-kernel interrupt occurs (e.g. page fault, preemption) and you have
>> >> > nested entry, rbp keeps its old value, right?  So the unwinder can walk
>> >> > past the nested entry frame and keep going until it gets to the original
>> >> > entry.
>> >>
>> >> Yes.
>> >>
>> >> It would be nice if we could do better, though, and actually notice
>> >> the pt_regs and identify the entry.  For example, I'd love to see
>> >> "page fault, RIP=xyz" printed in the middle of a stack dump on a
>> >> crash.
>> >>
>> >> Also, I think that just following rbp links will lose the
>> >> actual function that took the page fault (or whatever function
>> >> pt_regs->ip actually points to).
>> >
>> > Hm.  I think we could fix all that in a more standard way.  Whenever a
>> > new pt_regs frame gets saved on entry, we could also create a new stack
>> > frame which points to a fake kernel_entry() function.  That would tell
>> > the unwinder there's a pt_regs frame without otherwise breaking frame
>> > pointers across the frame.
>> >
>> > Then I guess we wouldn't need my other solution of putting the idt
>> > entries in a special section.
>> >
>> > How does that sound?
>>
>> Let me try to understand.
>>
>> The normal call sequence is call; push %rbp; mov %rsp, %rbp.  So rbp
>> points to (prev rbp, prev rip) on the stack, and you can follow the
>> chain back.  Right now, on a user access page fault or similar, we
>> have rbp (probably) pointing to the interrupted frame, and the
>> interrupted rip isn't saved anywhere that a naive unwinder can find
>> it.  (It's in pt_regs, but the rbp chain skips right over that.)
>>
>> We could change the entry code so that an interrupt / idtentry does:
>>
>> push pt_regs
>> push kernel_entry
>> push %rbp
>> mov %rsp, %rbp
>> call handler
>> pop %rbp
>> addq $8, %rsp
>>
>> or similar.  That would make it appear that the actual C handler was
>> caused by a dummy function "kernel_entry".  Now the unwinder would get
>> to kernel_entry, but it *still* wouldn't find its way to the calling
>> frame, which only solves part of the problem.  We could at least teach
>> the unwinder how kernel_entry works and let it decode pt_regs to
>> continue unwinding.  This would be nice, and I think it could work.
>>
>> I think I like this, except that, if it used a separate section, it
>> could potentially be faster, as, for each actual entry type, the
>> offset from the C handler frame to pt_regs is a foregone conclusion.
>> But this is pretty simple and performance is already abysmal in most
>> handlers.
>>
>> There's an added benefit to using a separate section, though: we could
>> also annotate the calls with what type of entry they were so the
>> unwinder could print it out nicely.
>>
>> I could be convinced either way.
>
> Ok, I took a stab at this.  See the patch below.
>
> In addition to annotating interrupt/exception pt_regs frames, I also
> annotated all the syscall pt_regs, for consistency.
>
> As you mentioned, it will affect performance a bit, but I think it will
> be insignificant.
>
> I think I like this approach better than putting the
> interrupt/idtentry's in a special section, because this is much more
> precise.  Especially now that I'm annotating pt_regs syscalls.
>
> Also I think with a few minor changes we could implement your idea of
> annotating the calls with what type of entry they are.  But I don't
> think that's really needed, because the name of the interrupt/idtentry
> is already on the stack trace.
>
> Before:
>
>   [<ffffffff8143c243>] dump_stack+0x85/0xc2
>   [<ffffffff81073596>] __do_page_fault+0x576/0x5a0
>   [<ffffffff8107369c>] trace_do_page_fault+0x5c/0x2e0
>   [<ffffffff8106d83c>] do_async_page_fault+0x2c/0xa0
>   [<ffffffff81887058>] async_page_fault+0x28/0x30
>   [<ffffffff81451560>] ? copy_page_to_iter+0x70/0x440
>   [<ffffffff811ebeac>] ? pagecache_get_page+0x2c/0x290
>   [<ffffffff811edaeb>] generic_file_read_iter+0x26b/0x770
>   [<ffffffff81285e32>] __vfs_read+0xe2/0x140
>   [<ffffffff81286378>] vfs_read+0x98/0x140
>   [<ffffffff812878c8>] SyS_read+0x58/0xc0
>   [<ffffffff81884dbc>] entry_SYSCALL_64_fastpath+0x1f/0xbd
>
> After:
>
>   [<ffffffff8143c243>] dump_stack+0x85/0xc2
>   [<ffffffff81073596>] __do_page_fault+0x576/0x5a0
>   [<ffffffff8107369c>] trace_do_page_fault+0x5c/0x2e0
>   [<ffffffff8106d83c>] do_async_page_fault+0x2c/0xa0
>   [<ffffffff81887422>] async_page_fault+0x32/0x40
>   [<ffffffff81887861>] pt_regs+0x1/0x10
>   [<ffffffff81451560>] ? copy_page_to_iter+0x70/0x440
>   [<ffffffff811ebeac>] ? pagecache_get_page+0x2c/0x290
>   [<ffffffff811edaeb>] generic_file_read_iter+0x26b/0x770
>   [<ffffffff81285e32>] __vfs_read+0xe2/0x140
>   [<ffffffff81286378>] vfs_read+0x98/0x140
>   [<ffffffff812878c8>] SyS_read+0x58/0xc0
>   [<ffffffff81884dc6>] entry_SYSCALL_64_fastpath+0x29/0xdb
>   [<ffffffff81887861>] pt_regs+0x1/0x10
>
> Note this example is with today's unwinder.  It could be made smarter to
> get the RIP from the pt_regs so the '?' could be removed from
> copy_page_to_iter().
>
> Thoughts?

Maybe I'm coming around to liking this idea.

In an ideal world (DWARF support, high-quality unwinder, nice pretty
printer, etc), unwinding across a kernel exception would look like:

 - some_func
 - some_other_func
 - do_page_fault
 - page_fault

After page_fault, the next unwind step takes us to the faulting RIP
(faulting_func) and reports that all GPRs are known.  It should
probably learn this fact from DWARF if DWARF is available, instead of
directly decoding pt_regs, due to a few funny cases in which pt_regs
may be incomplete.  A nice pretty printer could now print all the
regs.

 - faulting_func
 - etc.

For this to work, we don't actually need the unwinder to explicitly
know where pt_regs is.

Food for thought, though: if user code does SYSENTER with TF set,
you'll end up with partial pt_regs.  There's nothing the kernel can do
about it.  DWARF will handle it without any fanfare, but I don't know
if it's going to cause trouble for you pre-DWARF.

I'm also not sure it makes sense to apply this before the unwinder
that can consume it is ready.  Maybe, if it would be consistent with
your plans, it would make sense to rewrite the unwinder first, then
apply this and teach live patching to use the new unwinder, and *then*
add DWARF support?


>
> +       /*
> +        * Create a stack frame for the saved pt_regs.  This allows frame
> +        * pointer based unwinders to find pt_regs on the stack.
> +        */
> +       .macro CREATE_PT_REGS_FRAME regs=%rsp
> +#ifdef CONFIG_FRAME_POINTER
> +       pushq   \regs
> +       pushq   $pt_regs+1

Why the +1?

> +       pushq   %rbp
> +       movq    %rsp, %rbp
> +#endif
> +       .endm

I keep wanting this to be only two pushes and to fudge rbp to make it
work, but I don't see a good way.  But let's call it
CREATE_NESTED_ENTRY_FRAME or something, and let's rename pt_regs to
nested_frame or similar.

> +
> +       .macro CALL_HANDLER handler regs=%rsp
> +       CREATE_PT_REGS_FRAME \regs
> +       call    \handler
> +       REMOVE_PT_REGS_FRAME
> +       .endm

I think I'd rather open-code this everywhere.  It'll make it clearer
what's going on.

> @@ -199,6 +199,7 @@ entry_SYSCALL_64_fastpath:
>         ja      1f                              /* return -ENOSYS (already in pt_regs->ax) */
>         movq    %r10, %rcx
>
> +       CREATE_PT_REGS_FRAME
>         /*
>          * This call instruction is handled specially in stub_ptregs_64.
>          * It might end up jumping to the slow path.  If it jumps, RAX
> @@ -207,6 +208,8 @@ entry_SYSCALL_64_fastpath:
>         call    *sys_call_table(, %rax, 8)
>  .Lentry_SYSCALL_64_after_fastpath_call:
>
> +       REMOVE_PT_REGS_FRAME
> +

As discussed, let's get rid of this bit.

>         movq    %rax, RAX(%rsp)
>  1:
>
> @@ -238,14 +241,14 @@ entry_SYSCALL_64_fastpath:
>         ENABLE_INTERRUPTS(CLBR_NONE)
>         SAVE_EXTRA_REGS
>         movq    %rsp, %rdi
> -       call    syscall_return_slowpath /* returns with IRQs disabled */
> +       CALL_HANDLER syscall_return_slowpath    /* returns with IRQs disabled */

and this.

>         jmp     return_from_SYSCALL_64
>
>  entry_SYSCALL64_slow_path:
>         /* IRQs are off. */
>         SAVE_EXTRA_REGS
>         movq    %rsp, %rdi
> -       call    do_syscall_64           /* returns with IRQs disabled */
> +       CALL_HANDLER do_syscall_64      /* returns with IRQs disabled */
>
>  return_from_SYSCALL_64:
>         RESTORE_EXTRA_REGS
> @@ -344,6 +347,7 @@ ENTRY(stub_ptregs_64)
>         DISABLE_INTERRUPTS(CLBR_NONE)
>         TRACE_IRQS_OFF
>         popq    %rax
> +       REMOVE_PT_REGS_FRAME

This will be less mysterious if you open-code the macros.  Also, I
think you have to, some return_from_SYSCALL_64 needs to be directly
after the actual call instruction.  (But if you get rid of the hunks
above, I think this goes away too, so this may be moot.)

>  1:
> @@ -372,7 +376,7 @@ END(ptregs_\func)
>  ENTRY(ret_from_fork)
>         LOCK ; btr $TIF_FORK, TI_flags(%r8)
>
> -       call    schedule_tail                   /* rdi: 'prev' task parameter */
> +       CALL_HANDLER schedule_tail              /* rdi: 'prev' task parameter */
>

If you end up making the unwinder smart enough to notice that rsp is
just below pt_regs, then this can go away.  It's harmless, though.

>         testb   $3, CS(%rsp)                    /* from kernel_thread? */
>         jnz     1f
> @@ -385,8 +389,9 @@ ENTRY(ret_from_fork)
>          * parameter to be passed in RBP.  The called function is permitted
>          * to call do_execve and thereby jump to user mode.
>          */
> +       movq    RBX(%rsp), %rbx
>         movq    RBP(%rsp), %rdi
> -       call    *RBX(%rsp)
> +       CALL_HANDLER *%rbx

Does using a register like this actually save any code size?
Admittedly, it's a bit cleaner.

> +
> +/* fake function which allows stack unwinders to detect pt_regs frames */
> +#ifdef CONFIG_FRAME_POINTER
> +ENTRY(pt_regs)
> +       nop
> +       nop
> +ENDPROC(pt_regs)
> +#endif /* CONFIG_FRAME_POINTER */

Why is this two bytes long?  Is there some reason it has to be more
than one byte?

--Andy

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [RFC PATCH v2 05/18] sched: add task flag for preempt IRQ tracking
  2016-05-20 16:59                                   ` Andy Lutomirski
  2016-05-20 17:49                                     ` Josh Poimboeuf
@ 2016-05-23 23:02                                     ` Jiri Kosina
  2016-05-24  1:42                                       ` Andy Lutomirski
  1 sibling, 1 reply; 121+ messages in thread
From: Jiri Kosina @ 2016-05-23 23:02 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Josh Poimboeuf, Ingo Molnar, X86 ML, Heiko Carstens, linux-s390,
	live-patching, Michael Ellerman, Chris J Arges, linuxppc-dev,
	Jessica Yu, Petr Mladek, Jiri Slaby, Vojtech Pavlik,
	linux-kernel, Miroslav Benes, Peter Zijlstra

On Fri, 20 May 2016, Andy Lutomirski wrote:

> I think it would be negligible, at least for interrupts, since
> interrupts are already extremely expensive.  But I don't love adding
> assembly code that makes them even slower.  The real thing I dislike
> about this approach is that it's not a normal stack frame, so you need
> code in the unwinder to unwind through it correctly, which makes me
> think that you're not saving much complexity by adding the pushes.

I fail to see what is so special about the stack frame; it's in fact 
pretty normal.

It has added semantic value for "those who know", but the others will 
(pretty much correctly) consider it to be a stackframe from a function 
call, and be done with it.

What am I missing?

Thanks,

-- 
Jiri Kosina
SUSE Labs

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [RFC PATCH v2 05/18] sched: add task flag for preempt IRQ tracking
  2016-05-23 23:02                                     ` Jiri Kosina
@ 2016-05-24  1:42                                       ` Andy Lutomirski
  0 siblings, 0 replies; 121+ messages in thread
From: Andy Lutomirski @ 2016-05-24  1:42 UTC (permalink / raw)
  To: Jiri Kosina
  Cc: Josh Poimboeuf, Ingo Molnar, X86 ML, Heiko Carstens, linux-s390,
	live-patching, Michael Ellerman, Chris J Arges, linuxppc-dev,
	Jessica Yu, Petr Mladek, Jiri Slaby, Vojtech Pavlik,
	linux-kernel, Miroslav Benes, Peter Zijlstra

On Mon, May 23, 2016 at 4:02 PM, Jiri Kosina <jikos@kernel.org> wrote:
> On Fri, 20 May 2016, Andy Lutomirski wrote:
>
>> I think it would be negligible, at least for interrupts, since
>> interrupts are already extremely expensive.  But I don't love adding
>> assembly code that makes them even slower.  The real thing I dislike
>> about this approach is that it's not a normal stack frame, so you need
>> code in the unwinder to unwind through it correctly, which makes me
>> think that you're not saving much complexity by adding the pushes.
>
> I fail to see what is so special about the stack frame; it's in fact
> pretty normal.
>
> It has added semantic value for "those who know", but the others will
> (pretty much correctly) consider it to be a stackframe from a function
> call, and be done with it.
>
> What am I missing?

In Josh's code, the stack looks like:

...
interrupted frame
pt_regs
pointer to pt_regs
address of pt_regs dummy function
rbp
handler frame

A naive unwinder won't unwind this correctly, as there's no link back
to the interrupted frame's RIP.  If the layout changed to:


...
interrupted frame
pt_regs
interrupted RIP
rbp
handler frame

then I think it would unwind correctly, but the pt_regs would be
invisible, which is IMO a bit unfortunate.  It could also be (I
think):


...
interrupted frame
pt_regs
interrupted rbp
interrupted RIP
pointer to pt_regs
address of pt_regs dummy function
pointer to "interrupted RIP" stack slot
handler frame

but now this is *five* pushes for the dummy frame, which I think is
getting a bit out of hand.

--Andy

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [RFC PATCH v2 05/18] sched: add task flag for preempt IRQ tracking
  2016-05-23 21:34                           ` Andy Lutomirski
@ 2016-05-24  2:28                             ` Josh Poimboeuf
  2016-05-24  3:52                               ` Andy Lutomirski
  0 siblings, 1 reply; 121+ messages in thread
From: Josh Poimboeuf @ 2016-05-24  2:28 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Jiri Kosina, Ingo Molnar, X86 ML, Heiko Carstens, linux-s390,
	live-patching, Michael Ellerman, Chris J Arges, linuxppc-dev,
	Jessica Yu, Petr Mladek, Jiri Slaby, Vojtech Pavlik,
	linux-kernel, Miroslav Benes, Peter Zijlstra

On Mon, May 23, 2016 at 02:34:56PM -0700, Andy Lutomirski wrote:
> On Thu, May 19, 2016 at 4:15 PM, Josh Poimboeuf <jpoimboe@redhat.com> wrote:
> > On Mon, May 02, 2016 at 08:52:41AM -0700, Andy Lutomirski wrote:
> >> On Mon, May 2, 2016 at 6:52 AM, Josh Poimboeuf <jpoimboe@redhat.com> wrote:
> >> > On Fri, Apr 29, 2016 at 05:08:50PM -0700, Andy Lutomirski wrote:
> >> >> On Apr 29, 2016 3:41 PM, "Josh Poimboeuf" <jpoimboe@redhat.com> wrote:
> >> >> >
> >> >> > On Fri, Apr 29, 2016 at 02:37:41PM -0700, Andy Lutomirski wrote:
> >> >> > > On Fri, Apr 29, 2016 at 2:25 PM, Josh Poimboeuf <jpoimboe@redhat.com> wrote:
> >> >> > > >> I suppose we could try to rejigger the code so that rbp points to
> >> >> > > >> pt_regs or similar.
> >> >> > > >
> >> >> > > > I think we should avoid doing something like that because it would break
> >> >> > > > gdb and all the other unwinders who don't know about it.
> >> >> > >
> >> >> > > How so?
> >> >> > >
> >> >> > > Currently, rbp in the entry code is meaningless.  I'm suggesting that,
> >> >> > > when we do, for example, 'call \do_sym' in idtentry, we point rbp to
> >> >> > > the pt_regs.  Currently it points to something stale (which the
> >> >> > > dump_stack code might be relying on.  Hmm.)  But it's probably also
> >> >> > > safe to assume that if you unwind to the 'call \do_sym', then pt_regs
> >> >> > > is the next thing on the stack, so just doing the section thing would
> >> >> > > work.
> >> >> >
> >> >> > Yes, rbp is meaningless on the entry from user space.  But if an
> >> >> > in-kernel interrupt occurs (e.g. page fault, preemption) and you have
> >> >> > nested entry, rbp keeps its old value, right?  So the unwinder can walk
> >> >> > past the nested entry frame and keep going until it gets to the original
> >> >> > entry.
> >> >>
> >> >> Yes.
> >> >>
> >> >> It would be nice if we could do better, though, and actually notice
> >> >> the pt_regs and identify the entry.  For example, I'd love to see
> >> >> "page fault, RIP=xyz" printed in the middle of a stack dump on a
> >> >> crash.
> >> >>
> >> >> Also, I think that just following rbp links will lose the
> >> >> actual function that took the page fault (or whatever function
> >> >> pt_regs->ip actually points to).
> >> >
> >> > Hm.  I think we could fix all that in a more standard way.  Whenever a
> >> > new pt_regs frame gets saved on entry, we could also create a new stack
> >> > frame which points to a fake kernel_entry() function.  That would tell
> >> > the unwinder there's a pt_regs frame without otherwise breaking frame
> >> > pointers across the frame.
> >> >
> >> > Then I guess we wouldn't need my other solution of putting the idt
> >> > entries in a special section.
> >> >
> >> > How does that sound?
> >>
> >> Let me try to understand.
> >>
> >> The normal call sequence is call; push %rbp; mov %rsp, %rbp.  So rbp
> >> points to (prev rbp, prev rip) on the stack, and you can follow the
> >> chain back.  Right now, on a user access page fault or similar, we
> >> have rbp (probably) pointing to the interrupted frame, and the
> >> interrupted rip isn't saved anywhere that a naive unwinder can find
> >> it.  (It's in pt_regs, but the rbp chain skips right over that.)
> >>
> >> We could change the entry code so that an interrupt / idtentry does:
> >>
> >> push pt_regs
> >> push kernel_entry
> >> push %rbp
> >> mov %rsp, %rbp
> >> call handler
> >> pop %rbp
> >> addq $8, %rsp
> >>
> >> or similar.  That would make it appear that the actual C handler was
> >> caused by a dummy function "kernel_entry".  Now the unwinder would get
> >> to kernel_entry, but it *still* wouldn't find its way to the calling
> >> frame, which only solves part of the problem.  We could at least teach
> >> the unwinder how kernel_entry works and let it decode pt_regs to
> >> continue unwinding.  This would be nice, and I think it could work.
> >>
> >> I think I like this, except that, if it used a separate section, it
> >> could potentially be faster, as, for each actual entry type, the
> >> offset from the C handler frame to pt_regs is a foregone conclusion.
> >> But this is pretty simple and performance is already abysmal in most
> >> handlers.
> >>
> >> There's an added benefit to using a separate section, though: we could
> >> also annotate the calls with what type of entry they were so the
> >> unwinder could print it out nicely.
> >>
> >> I could be convinced either way.
> >
> > Ok, I took a stab at this.  See the patch below.
> >
> > In addition to annotating interrupt/exception pt_regs frames, I also
> > annotated all the syscall pt_regs, for consistency.
> >
> > As you mentioned, it will affect performance a bit, but I think it will
> > be insignificant.
> >
> > I think I like this approach better than putting the
> > interrupt/idtentry's in a special section, because this is much more
> > precise.  Especially now that I'm annotating pt_regs syscalls.
> >
> > Also I think with a few minor changes we could implement your idea of
> > annotating the calls with what type of entry they are.  But I don't
> > think that's really needed, because the name of the interrupt/idtentry
> > is already on the stack trace.
> >
> > Before:
> >
> >   [<ffffffff8143c243>] dump_stack+0x85/0xc2
> >   [<ffffffff81073596>] __do_page_fault+0x576/0x5a0
> >   [<ffffffff8107369c>] trace_do_page_fault+0x5c/0x2e0
> >   [<ffffffff8106d83c>] do_async_page_fault+0x2c/0xa0
> >   [<ffffffff81887058>] async_page_fault+0x28/0x30
> >   [<ffffffff81451560>] ? copy_page_to_iter+0x70/0x440
> >   [<ffffffff811ebeac>] ? pagecache_get_page+0x2c/0x290
> >   [<ffffffff811edaeb>] generic_file_read_iter+0x26b/0x770
> >   [<ffffffff81285e32>] __vfs_read+0xe2/0x140
> >   [<ffffffff81286378>] vfs_read+0x98/0x140
> >   [<ffffffff812878c8>] SyS_read+0x58/0xc0
> >   [<ffffffff81884dbc>] entry_SYSCALL_64_fastpath+0x1f/0xbd
> >
> > After:
> >
> >   [<ffffffff8143c243>] dump_stack+0x85/0xc2
> >   [<ffffffff81073596>] __do_page_fault+0x576/0x5a0
> >   [<ffffffff8107369c>] trace_do_page_fault+0x5c/0x2e0
> >   [<ffffffff8106d83c>] do_async_page_fault+0x2c/0xa0
> >   [<ffffffff81887422>] async_page_fault+0x32/0x40
> >   [<ffffffff81887861>] pt_regs+0x1/0x10
> >   [<ffffffff81451560>] ? copy_page_to_iter+0x70/0x440
> >   [<ffffffff811ebeac>] ? pagecache_get_page+0x2c/0x290
> >   [<ffffffff811edaeb>] generic_file_read_iter+0x26b/0x770
> >   [<ffffffff81285e32>] __vfs_read+0xe2/0x140
> >   [<ffffffff81286378>] vfs_read+0x98/0x140
> >   [<ffffffff812878c8>] SyS_read+0x58/0xc0
> >   [<ffffffff81884dc6>] entry_SYSCALL_64_fastpath+0x29/0xdb
> >   [<ffffffff81887861>] pt_regs+0x1/0x10
> >
> > Note this example is with today's unwinder.  It could be made smarter to
> > get the RIP from the pt_regs so the '?' could be removed from
> > copy_page_to_iter().
> >
> > Thoughts?
> 
> Maybe I'm coming around to liking this idea.

Ok, good :-)

> In an ideal world (DWARF support, high-quality unwinder, nice pretty
> printer, etc), unwinding across a kernel exception would look like:
> 
>  - some_func
>  - some_other_func
>  - do_page_fault
>  - page_fault
> 
> After page_fault, the next unwind step takes us to the faulting RIP
> (faulting_func) and reports that all GPRs are known.  It should
> probably learn this fact from DWARF if DWARF is available, instead of
> directly decoding pt_regs, due to a few funny cases in which pt_regs
> may be incomplete.  A nice pretty printer could now print all the
> regs.
> 
>  - faulting_func
>  - etc.
> 
> For this to work, we don't actually need the unwinder to explicitly
> know where pt_regs is.

That's true (but only for DWARF).

> Food for thought, though: if user code does SYSENTER with TF set,
> you'll end up with partial pt_regs.  There's nothing the kernel can do
> about it.  DWARF will handle it without any fanfare, but I don't know
> if it's going to cause trouble for you pre-DWARF.

In this case it should see the stack pointer is past the pt_regs offset,
so it would just report it as an empty stack.

> I'm also not sure it makes sense to apply this before the unwinder
> that can consume it is ready.  Maybe, if it would be consistent with
> your plans, it would make sense to rewrite the unwinder first, then
> apply this and teach live patching to use the new unwinder, and *then*
> add DWARF support?

For the purposes of livepatch, the reliable unwinder needs to detect
whether an interrupt/exception pt_regs frame exists on a sleeping task
(or current).  This patch would allow us to do that.

So my preferred order of doing things would be:

1) Brian Gerst's switch_to() cleanup and any related unwinder fixes
2) this patch for annotating pt_regs stack frames
3) reliable unwinder, similar to what I already posted, except it relies
   on this patch instead of PF_PREEMPT_IRQ, and knows how to deal with
   the new inactive task frame format of #1
4) livepatch consistency model which uses the reliable unwinder
5) rewrite unwinder, and port all users to the new interface
6) add DWARF unwinder

1-4 are pretty much already written, whereas 5 and 6 will take
considerably more work.

> > +       /*
> > +        * Create a stack frame for the saved pt_regs.  This allows frame
> > +        * pointer based unwinders to find pt_regs on the stack.
> > +        */
> > +       .macro CREATE_PT_REGS_FRAME regs=%rsp
> > +#ifdef CONFIG_FRAME_POINTER
> > +       pushq   \regs
> > +       pushq   $pt_regs+1
> 
> Why the +1?

Some unwinders like gdb are smart enough to report the function which
contains the instruction *before* the return address.  Without the +1,
they would show the wrong function.

> > +       pushq   %rbp
> > +       movq    %rsp, %rbp
> > +#endif
> > +       .endm
> 
> I keep wanting this to be only two pushes and to fudge rbp to make it
> work, but I don't see a good way.  But let's call it
> CREATE_NESTED_ENTRY_FRAME or something, and let's rename pt_regs to
> nested_frame or similar.

Or, if we aren't going to annotate syscall pt_regs, we could give it a
more specific name.  CREATE_INTERRUPT_FRAME and interrupt_frame()?

> > +
> > +       .macro CALL_HANDLER handler regs=%rsp
> > +       CREATE_PT_REGS_FRAME \regs
> > +       call    \handler
> > +       REMOVE_PT_REGS_FRAME
> > +       .endm
> 
> I think I'd rather open-code this everywhere.  It'll make it clearer
> what's going on.

Ok.

> > @@ -199,6 +199,7 @@ entry_SYSCALL_64_fastpath:
> >         ja      1f                              /* return -ENOSYS (already in pt_regs->ax) */
> >         movq    %r10, %rcx
> >
> > +       CREATE_PT_REGS_FRAME
> >         /*
> >          * This call instruction is handled specially in stub_ptregs_64.
> >          * It might end up jumping to the slow path.  If it jumps, RAX
> > @@ -207,6 +208,8 @@ entry_SYSCALL_64_fastpath:
> >         call    *sys_call_table(, %rax, 8)
> >  .Lentry_SYSCALL_64_after_fastpath_call:
> >
> > +       REMOVE_PT_REGS_FRAME
> > +
> 
> As discussed, let's get rid of this bit.

Yeah, it's fine with me to get rid of all the syscall stuff.

> 
> >         movq    %rax, RAX(%rsp)
> >  1:
> >
> > @@ -238,14 +241,14 @@ entry_SYSCALL_64_fastpath:
> >         ENABLE_INTERRUPTS(CLBR_NONE)
> >         SAVE_EXTRA_REGS
> >         movq    %rsp, %rdi
> > -       call    syscall_return_slowpath /* returns with IRQs disabled */
> > +       CALL_HANDLER syscall_return_slowpath    /* returns with IRQs disabled */
> 
> and this.

This will be gone...

> 
> >         jmp     return_from_SYSCALL_64
> >
> >  entry_SYSCALL64_slow_path:
> >         /* IRQs are off. */
> >         SAVE_EXTRA_REGS
> >         movq    %rsp, %rdi
> > -       call    do_syscall_64           /* returns with IRQs disabled */
> > +       CALL_HANDLER do_syscall_64      /* returns with IRQs disabled */
> >
> >  return_from_SYSCALL_64:
> >         RESTORE_EXTRA_REGS
> > @@ -344,6 +347,7 @@ ENTRY(stub_ptregs_64)
> >         DISABLE_INTERRUPTS(CLBR_NONE)
> >         TRACE_IRQS_OFF
> >         popq    %rax
> > +       REMOVE_PT_REGS_FRAME
> 
> This will be less mysterious if you open-code the macros.  Also, I
> think you have to, some return_from_SYSCALL_64 needs to be directly
> after the actual call instruction.  (But if you get rid of the hunks
> above, I think this goes away too, so this may be moot.)

and this...

> >  1:
> > @@ -372,7 +376,7 @@ END(ptregs_\func)
> >  ENTRY(ret_from_fork)
> >         LOCK ; btr $TIF_FORK, TI_flags(%r8)
> >
> > -       call    schedule_tail                   /* rdi: 'prev' task parameter */
> > +       CALL_HANDLER schedule_tail              /* rdi: 'prev' task parameter */
> >
> 
> If you end up making the unwinder smart enough to notice that rsp is
> just below pt_regs, then this can go away.  It's harmless, though.

and this...

> >         testb   $3, CS(%rsp)                    /* from kernel_thread? */
> >         jnz     1f
> > @@ -385,8 +389,9 @@ ENTRY(ret_from_fork)
> >          * parameter to be passed in RBP.  The called function is permitted
> >          * to call do_execve and thereby jump to user mode.
> >          */
> > +       movq    RBX(%rsp), %rbx
> >         movq    RBP(%rsp), %rdi
> > -       call    *RBX(%rsp)
> > +       CALL_HANDLER *%rbx
> 
> Does using a register like this actually save any code size?
> Admittedly, it's a bit cleaner.

and this.

(FWIW, I used a register because the assembler macro didn't seem to
support passing "*RBX(%rsp)" as an argument.)

> > +
> > +/* fake function which allows stack unwinders to detect pt_regs frames */
> > +#ifdef CONFIG_FRAME_POINTER
> > +ENTRY(pt_regs)
> > +       nop
> > +       nop
> > +ENDPROC(pt_regs)
> > +#endif /* CONFIG_FRAME_POINTER */
> 
> Why is this two bytes long?  Is there some reason it has to be more
> than one byte?

Similar to above, this is related to the need to support various
unwinders.  Whether the unwinder displays the ret_addr or the
instruction preceding it, either way the instruction needs to be inside
the function for the function to be reported.

-- 
Josh

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [RFC PATCH v2 05/18] sched: add task flag for preempt IRQ tracking
  2016-05-24  2:28                             ` Josh Poimboeuf
@ 2016-05-24  3:52                               ` Andy Lutomirski
  2016-06-22 16:30                                 ` Josh Poimboeuf
  0 siblings, 1 reply; 121+ messages in thread
From: Andy Lutomirski @ 2016-05-24  3:52 UTC (permalink / raw)
  To: Josh Poimboeuf
  Cc: Jiri Kosina, Ingo Molnar, X86 ML, Heiko Carstens, linux-s390,
	live-patching, Michael Ellerman, Chris J Arges, Jessica Yu,
	linuxppc-dev, Petr Mladek, Jiri Slaby, linux-kernel,
	Vojtech Pavlik, Miroslav Benes, Peter Zijlstra

On May 23, 2016 7:28 PM, "Josh Poimboeuf" <jpoimboe@redhat.com> wrote:
> > Maybe I'm coming around to liking this idea.
>
> Ok, good :-)
>
> > In an ideal world (DWARF support, high-quality unwinder, nice pretty
> > printer, etc), unwinding across a kernel exception would look like:
> >
> >  - some_func
> >  - some_other_func
> >  - do_page_fault
> >  - page_fault
> >
> > After page_fault, the next unwind step takes us to the faulting RIP
> > (faulting_func) and reports that all GPRs are known.  It should
> > probably learn this fact from DWARF if DWARF is available, instead of
> > directly decoding pt_regs, due to a few funny cases in which pt_regs
> > may be incomplete.  A nice pretty printer could now print all the
> > regs.
> >
> >  - faulting_func
> >  - etc.
> >
> > For this to work, we don't actually need the unwinder to explicitly
> > know where pt_regs is.
>
> That's true (but only for DWARF).
>
> > Food for thought, though: if user code does SYSENTER with TF set,
> > you'll end up with partial pt_regs.  There's nothing the kernel can do
> > about it.  DWARF will handle it without any fanfare, but I don't know
> > if it's going to cause trouble for you pre-DWARF.
>
> In this case it should see the stack pointer is past the pt_regs offset,
> so it would just report it as an empty stack.

OK

>
> > I'm also not sure it makes sense to apply this before the unwinder
> > that can consume it is ready.  Maybe, if it would be consistent with
> > your plans, it would make sense to rewrite the unwinder first, then
> > apply this and teach live patching to use the new unwinder, and *then*
> > add DWARF support?
>
> For the purposes of livepatch, the reliable unwinder needs to detect
> whether an interrupt/exception pt_regs frame exists on a sleeping task
> (or current).  This patch would allow us to do that.
>
> So my preferred order of doing things would be:
>
> 1) Brian Gerst's switch_to() cleanup and any related unwinder fixes
> 2) this patch for annotating pt_regs stack frames
> 3) reliable unwinder, similar to what I already posted, except it relies
>    on this patch instead of PF_PREEMPT_IRQ, and knows how to deal with
>    the new inactive task frame format of #1
> 4) livepatch consistency model which uses the reliable unwinder
> 5) rewrite unwinder, and port all users to the new interface
> 6) add DWARF unwinder
>
> 1-4 are pretty much already written, whereas 5 and 6 will take
> considerably more work.

Fair enough.

>
> > > +       /*
> > > +        * Create a stack frame for the saved pt_regs.  This allows frame
> > > +        * pointer based unwinders to find pt_regs on the stack.
> > > +        */
> > > +       .macro CREATE_PT_REGS_FRAME regs=%rsp
> > > +#ifdef CONFIG_FRAME_POINTER
> > > +       pushq   \regs
> > > +       pushq   $pt_regs+1
> >
> > Why the +1?
>
> Some unwinders like gdb are smart enough to report the function which
> contains the instruction *before* the return address.  Without the +1,
> they would show the wrong function.

Lovely.  Want to add a comment?

>
> > > +       pushq   %rbp
> > > +       movq    %rsp, %rbp
> > > +#endif
> > > +       .endm
> >
> > I keep wanting this to be only two pushes and to fudge rbp to make it
> > work, but I don't see a good way.  But let's call it
> > CREATE_NESTED_ENTRY_FRAME or something, and let's rename pt_regs to
> > nested_frame or similar.
>
> Or, if we aren't going to annotate syscall pt_regs, we could give it a
> more specific name.  CREATE_INTERRUPT_FRAME and interrupt_frame()?

CREATE_INTERRUPT_FRAME is confusing because it applies to idtentry,
too.  CREATE_PT_REGS_FRAME is probably fine.

> > > +
> > > +/* fake function which allows stack unwinders to detect pt_regs frames */
> > > +#ifdef CONFIG_FRAME_POINTER
> > > +ENTRY(pt_regs)
> > > +       nop
> > > +       nop
> > > +ENDPROC(pt_regs)
> > > +#endif /* CONFIG_FRAME_POINTER */
> >
> > Why is this two bytes long?  Is there some reason it has to be more
> > than one byte?
>
> Similar to above, this is related to the need to support various
> unwinders.  Whether the unwinder displays the ret_addr or the
> instruction preceding it, either way the instruction needs to be inside
> the function for the function to be reported.

OK.

--Andy

^ permalink raw reply	[flat|nested] 121+ messages in thread

* RE: livepatch: change to a per-task consistency model
  2016-05-23 18:44             ` Jiri Kosina
@ 2016-05-24 15:06                 ` David Laight
  0 siblings, 0 replies; 121+ messages in thread
From: David Laight @ 2016-05-24 15:06 UTC (permalink / raw)
  To: 'Jiri Kosina'
  Cc: Josh Poimboeuf, linux-s390, Jessica Yu, Vojtech Pavlik,
	Petr Mladek, Peter Zijlstra, x86, Heiko Carstens, linux-kernel,
	Ingo Molnar, Andy Lutomirski, live-patching, Jiri Slaby,
	Miroslav Benes, linuxppc-dev, Chris J Arges

From: Jiri Kosina 
> Sent: 23 May 2016 19:45
> > Related, please can we have a flag for the sleep and/or process so that
> > an uninterruptible sleep doesn't trigger the 'hung task' detector
> 
> TASK_KILLABLE

Not sure that does what I want.
It appears to allow some 'kill' actions to wake the process.
I'm sure I've looked at the 'hung task' code since 2007.

> > and also stops the process counting towards the 'load average'.
> 
> TASK_NOLOAD

Ah, that was added in May 2015.
Not surprising I didn't know about it.

I'll leave the code doing:
  set_current_state(signal_pending(current) ? TASK_UNINTERRUPTIBLE ? TASK_INTERRUPTIBLE);
for a while longer.

	David

^ permalink raw reply	[flat|nested] 121+ messages in thread

* RE: livepatch: change to a per-task consistency model
@ 2016-05-24 15:06                 ` David Laight
  0 siblings, 0 replies; 121+ messages in thread
From: David Laight @ 2016-05-24 15:06 UTC (permalink / raw)
  To: 'Jiri Kosina'
  Cc: Josh Poimboeuf, linux-s390, Jessica Yu, Vojtech Pavlik,
	Petr Mladek, Peter Zijlstra, x86, Heiko Carstens, linux-kernel,
	Ingo Molnar, Andy Lutomirski, live-patching, Jiri Slaby,
	Miroslav Benes, linuxppc-dev, Chris J Arges

From: Jiri Kosina=20
> Sent: 23 May 2016 19:45
> > Related, please can we have a flag for the sleep and/or process so that
> > an uninterruptible sleep doesn't trigger the 'hung task' detector
>=20
> TASK_KILLABLE

Not sure that does what I want.
It appears to allow some 'kill' actions to wake the process.
I'm sure I've looked at the 'hung task' code since 2007.

> > and also stops the process counting towards the 'load average'.
>=20
> TASK_NOLOAD

Ah, that was added in May 2015.
Not surprising I didn't know about it.

I'll leave the code doing:
  set_current_state(signal_pending(current) ? TASK_UNINTERRUPTIBLE ? TASK_I=
NTERRUPTIBLE);
for a while longer.

	David

^ permalink raw reply	[flat|nested] 121+ messages in thread

* RE: livepatch: change to a per-task consistency model
  2016-05-24 15:06                 ` David Laight
  (?)
@ 2016-05-24 22:45                 ` Jiri Kosina
  -1 siblings, 0 replies; 121+ messages in thread
From: Jiri Kosina @ 2016-05-24 22:45 UTC (permalink / raw)
  To: David Laight
  Cc: Josh Poimboeuf, linux-s390, Jessica Yu, Vojtech Pavlik,
	Petr Mladek, Peter Zijlstra, x86, Heiko Carstens, linux-kernel,
	Ingo Molnar, Andy Lutomirski, live-patching, Jiri Slaby,
	Miroslav Benes, linuxppc-dev, Chris J Arges

On Tue, 24 May 2016, David Laight wrote:

> > > Related, please can we have a flag for the sleep and/or process so that
> > > an uninterruptible sleep doesn't trigger the 'hung task' detector
> > 
> > TASK_KILLABLE
> 
> Not sure that does what I want.
> It appears to allow some 'kill' actions to wake the process.
> I'm sure I've looked at the 'hung task' code since 2007.

The trick is the 

	if (t->state == TASK_UNINTERRUPTIBLE)

test in check_hung_uninterruptible_tasks(). That makes sure that 
TASK_KILLABLE tasks (e.g. waiting on NFS I/O, but not limited only to 
that; feel free to set it whereever you need) are skipped. Please note 
that TASK_KILLABLE is actually a _mask_ that includes TASK_UNINTERRUPTIBLE 
as well; therefore the '==' test skips such tasks.

-- 
Jiri Kosina
SUSE Labs

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [RFC PATCH v2 17/18] livepatch: change to a per-task consistency model
  2016-04-28 20:44 ` [RFC PATCH v2 17/18] livepatch: change to a per-task consistency model Josh Poimboeuf
                     ` (6 preceding siblings ...)
  2016-05-17 22:53   ` Jessica Yu
@ 2016-06-06 13:54   ` Petr Mladek
  2016-06-06 14:29     ` Josh Poimboeuf
  7 siblings, 1 reply; 121+ messages in thread
From: Petr Mladek @ 2016-06-06 13:54 UTC (permalink / raw)
  To: Josh Poimboeuf
  Cc: Jessica Yu, Jiri Kosina, Miroslav Benes, Ingo Molnar,
	Peter Zijlstra, Michael Ellerman, Heiko Carstens, live-patching,
	linux-kernel, x86, linuxppc-dev, linux-s390, Vojtech Pavlik,
	Jiri Slaby, Chris J Arges, Andy Lutomirski

On Thu 2016-04-28 15:44:48, Josh Poimboeuf wrote:
> Change livepatch to use a basic per-task consistency model.  This is the
> foundation which will eventually enable us to patch those ~10% of
> security patches which change function or data semantics.  This is the
> biggest remaining piece needed to make livepatch more generally useful.

> diff --git a/kernel/livepatch/transition.c b/kernel/livepatch/transition.c
> new file mode 100644
> index 0000000..92819bb
> --- /dev/null
> +++ b/kernel/livepatch/transition.c
> +/*
> + * Try to safely switch a task to the target patch state.  If it's currently
> + * running, or it's sleeping on a to-be-patched or to-be-unpatched function, or
> + * if the stack is unreliable, return false.
> + */
> +static bool klp_try_switch_task(struct task_struct *task)
> +{
> +	struct rq *rq;
> +	unsigned long flags;

This should be of type "struct rq_flags". Otherwise, I get compilation
warnings:

kernel/livepatch/transition.c: In function ‘klp_try_switch_task’:
kernel/livepatch/transition.c:349:2: warning: passing argument 2 of ‘task_rq_lock’ from incompatible pointer type [enabled by default]
  rq = task_rq_lock(task, &flags);
  ^
In file included from kernel/livepatch/transition.c:24:0:
kernel/livepatch/../sched/sched.h:1468:12: note: expected ‘struct rq_flags *’ but argument is of type ‘long unsigned int *’
 struct rq *task_rq_lock(struct task_struct *p, struct rq_flags *rf)
            ^
kernel/livepatch/transition.c:367:2: warning: passing argument 3 of ‘task_rq_unlock’ from incompatible pointer type [enabled by default]
  task_rq_unlock(rq, task, &flags);
  ^
In file included from kernel/livepatch/transition.c:24:0:
kernel/livepatch/../sched/sched.h:1480:1: note: expected ‘struct rq_flags *’ but argument is of type ‘long unsigned int *’
 task_rq_unlock(struct rq *rq, struct task_struct *p, struct rq_flags *rf)


And even runtime warnings from lockdep:

[  212.847548] WARNING: CPU: 1 PID: 3847 at kernel/locking/lockdep.c:3532 lock_release+0x431/0x480
[  212.847549] releasing a pinned lock
[  212.847550] Modules linked in: livepatch_sample(E+)
[  212.847555] CPU: 1 PID: 3847 Comm: modprobe Tainted: G            E K 4.7.0-rc1-next-20160602-4-default+ #336
[  212.847556] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011
[  212.847558]  0000000000000000 ffff880139823aa0 ffffffff814388dc ffff880139823af0
[  212.847562]  0000000000000000 ffff880139823ae0 ffffffff8106fad1 00000dcc82b11390
[  212.847565]  ffff88013fc978d8 ffffffff810eea1e ffff8800ba0ed6d0 0000000000000003
[  212.847569] Call Trace:
[  212.847572]  [<ffffffff814388dc>] dump_stack+0x85/0xc9
[  212.847575]  [<ffffffff8106fad1>] __warn+0xd1/0xf0
[  212.847578]  [<ffffffff810eea1e>] ? klp_try_switch_task.part.3+0x5e/0x2b0
[  212.847580]  [<ffffffff8106fb3f>] warn_slowpath_fmt+0x4f/0x60
[  212.847582]  [<ffffffff810cc151>] lock_release+0x431/0x480
[  212.847585]  [<ffffffff8101e258>] ? dump_trace+0x118/0x310
[  212.847588]  [<ffffffff8195d07c>] ? entry_SYSCALL_64_fastpath+0x1f/0xbd
[  212.847590]  [<ffffffff8195c8bf>] _raw_spin_unlock+0x1f/0x30
[  212.847600]  [<ffffffff810eea1e>] klp_try_switch_task.part.3+0x5e/0x2b0
[  212.847603]  [<ffffffff810ef0e4>] klp_try_complete_transition+0x84/0x190
[  212.847605]  [<ffffffff810ed370>] __klp_enable_patch+0xb0/0x130
[  212.847607]  [<ffffffff810ed445>] klp_enable_patch+0x55/0x80
[  212.847610]  [<ffffffffa0000030>] ? livepatch_cmdline_proc_show+0x30/0x30 [livepatch_sample]
[  212.847613]  [<ffffffffa0000061>] livepatch_init+0x31/0x70 [livepatch_sample]
[  212.847615]  [<ffffffffa0000030>] ? livepatch_cmdline_proc_show+0x30/0x30 [livepatch_sample]
[  212.847617]  [<ffffffff8100041d>] do_one_initcall+0x3d/0x160
[  212.847629]  [<ffffffff81196c9b>] ? do_init_module+0x27/0x1e4
[  212.847632]  [<ffffffff810e7172>] ? rcu_read_lock_sched_held+0x62/0x70
[  212.847634]  [<ffffffff811fdea2>] ? kmem_cache_alloc_trace+0x282/0x340
[  212.847636]  [<ffffffff81196cd4>] do_init_module+0x60/0x1e4
[  212.847638]  [<ffffffff81111fd2>] load_module+0x1482/0x1d40
[  212.847640]  [<ffffffff8110ea10>] ? __symbol_put+0x40/0x40
[  212.847643]  [<ffffffff81112aa9>] SYSC_finit_module+0xa9/0xd0
[  212.847645]  [<ffffffff81112aee>] SyS_finit_module+0xe/0x10
[  212.847647]  [<ffffffff8195d07c>] entry_SYSCALL_64_fastpath+0x1f/0xbd
[  212.847649] ---[ end trace e4e9f09d45443049 ]---


> +	int ret;
> +	bool success = false;
> +
> +	/* check if this task has already switched over */
> +	if (task->patch_state == klp_target_state)
> +		return true;
> +
> +	/*
> +	 * For arches which don't have reliable stack traces, we have to rely
> +	 * on other methods (e.g., switching tasks at the syscall barrier).
> +	 */
> +	if (!IS_ENABLED(CONFIG_RELIABLE_STACKTRACE))
> +		return false;
> +
> +	/*
> +	 * Now try to check the stack for any to-be-patched or to-be-unpatched
> +	 * functions.  If all goes well, switch the task to the target patch
> +	 * state.
> +	 */
> +	rq = task_rq_lock(task, &flags);
> +
> +	if (task_running(rq, task) && task != current) {
> +		pr_debug("%s: pid %d (%s) is running\n", __func__, task->pid,
> +			 task->comm);

Also I think about using printk_deferred() inside the rq_lock but
it is not strictly needed. Also we use only pr_debug() here which
is a NOP when not enabled.

Best Regards,
Petr

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [RFC PATCH v2 17/18] livepatch: change to a per-task consistency model
  2016-06-06 13:54   ` [RFC PATCH v2 17/18] " Petr Mladek
@ 2016-06-06 14:29     ` Josh Poimboeuf
  0 siblings, 0 replies; 121+ messages in thread
From: Josh Poimboeuf @ 2016-06-06 14:29 UTC (permalink / raw)
  To: Petr Mladek
  Cc: Jessica Yu, Jiri Kosina, Miroslav Benes, Ingo Molnar,
	Peter Zijlstra, Michael Ellerman, Heiko Carstens, live-patching,
	linux-kernel, x86, linuxppc-dev, linux-s390, Vojtech Pavlik,
	Jiri Slaby, Chris J Arges, Andy Lutomirski

On Mon, Jun 06, 2016 at 03:54:41PM +0200, Petr Mladek wrote:
> On Thu 2016-04-28 15:44:48, Josh Poimboeuf wrote:
> > Change livepatch to use a basic per-task consistency model.  This is the
> > foundation which will eventually enable us to patch those ~10% of
> > security patches which change function or data semantics.  This is the
> > biggest remaining piece needed to make livepatch more generally useful.
> 
> > diff --git a/kernel/livepatch/transition.c b/kernel/livepatch/transition.c
> > new file mode 100644
> > index 0000000..92819bb
> > --- /dev/null
> > +++ b/kernel/livepatch/transition.c
> > +/*
> > + * Try to safely switch a task to the target patch state.  If it's currently
> > + * running, or it's sleeping on a to-be-patched or to-be-unpatched function, or
> > + * if the stack is unreliable, return false.
> > + */
> > +static bool klp_try_switch_task(struct task_struct *task)
> > +{
> > +	struct rq *rq;
> > +	unsigned long flags;
> 
> This should be of type "struct rq_flags". Otherwise, I get compilation
> warnings:
> 
> kernel/livepatch/transition.c: In function ‘klp_try_switch_task’:
> kernel/livepatch/transition.c:349:2: warning: passing argument 2 of ‘task_rq_lock’ from incompatible pointer type [enabled by default]
>   rq = task_rq_lock(task, &flags);
>   ^
> In file included from kernel/livepatch/transition.c:24:0:
> kernel/livepatch/../sched/sched.h:1468:12: note: expected ‘struct rq_flags *’ but argument is of type ‘long unsigned int *’
>  struct rq *task_rq_lock(struct task_struct *p, struct rq_flags *rf)
>             ^
> kernel/livepatch/transition.c:367:2: warning: passing argument 3 of ‘task_rq_unlock’ from incompatible pointer type [enabled by default]
>   task_rq_unlock(rq, task, &flags);
>   ^
> In file included from kernel/livepatch/transition.c:24:0:
> kernel/livepatch/../sched/sched.h:1480:1: note: expected ‘struct rq_flags *’ but argument is of type ‘long unsigned int *’
>  task_rq_unlock(struct rq *rq, struct task_struct *p, struct rq_flags *rf)
> 
> 
> And even runtime warnings from lockdep:
> 
> [  212.847548] WARNING: CPU: 1 PID: 3847 at kernel/locking/lockdep.c:3532 lock_release+0x431/0x480
> [  212.847549] releasing a pinned lock
> [  212.847550] Modules linked in: livepatch_sample(E+)
> [  212.847555] CPU: 1 PID: 3847 Comm: modprobe Tainted: G            E K 4.7.0-rc1-next-20160602-4-default+ #336
> [  212.847556] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011
> [  212.847558]  0000000000000000 ffff880139823aa0 ffffffff814388dc ffff880139823af0
> [  212.847562]  0000000000000000 ffff880139823ae0 ffffffff8106fad1 00000dcc82b11390
> [  212.847565]  ffff88013fc978d8 ffffffff810eea1e ffff8800ba0ed6d0 0000000000000003
> [  212.847569] Call Trace:
> [  212.847572]  [<ffffffff814388dc>] dump_stack+0x85/0xc9
> [  212.847575]  [<ffffffff8106fad1>] __warn+0xd1/0xf0
> [  212.847578]  [<ffffffff810eea1e>] ? klp_try_switch_task.part.3+0x5e/0x2b0
> [  212.847580]  [<ffffffff8106fb3f>] warn_slowpath_fmt+0x4f/0x60
> [  212.847582]  [<ffffffff810cc151>] lock_release+0x431/0x480
> [  212.847585]  [<ffffffff8101e258>] ? dump_trace+0x118/0x310
> [  212.847588]  [<ffffffff8195d07c>] ? entry_SYSCALL_64_fastpath+0x1f/0xbd
> [  212.847590]  [<ffffffff8195c8bf>] _raw_spin_unlock+0x1f/0x30
> [  212.847600]  [<ffffffff810eea1e>] klp_try_switch_task.part.3+0x5e/0x2b0
> [  212.847603]  [<ffffffff810ef0e4>] klp_try_complete_transition+0x84/0x190
> [  212.847605]  [<ffffffff810ed370>] __klp_enable_patch+0xb0/0x130
> [  212.847607]  [<ffffffff810ed445>] klp_enable_patch+0x55/0x80
> [  212.847610]  [<ffffffffa0000030>] ? livepatch_cmdline_proc_show+0x30/0x30 [livepatch_sample]
> [  212.847613]  [<ffffffffa0000061>] livepatch_init+0x31/0x70 [livepatch_sample]
> [  212.847615]  [<ffffffffa0000030>] ? livepatch_cmdline_proc_show+0x30/0x30 [livepatch_sample]
> [  212.847617]  [<ffffffff8100041d>] do_one_initcall+0x3d/0x160
> [  212.847629]  [<ffffffff81196c9b>] ? do_init_module+0x27/0x1e4
> [  212.847632]  [<ffffffff810e7172>] ? rcu_read_lock_sched_held+0x62/0x70
> [  212.847634]  [<ffffffff811fdea2>] ? kmem_cache_alloc_trace+0x282/0x340
> [  212.847636]  [<ffffffff81196cd4>] do_init_module+0x60/0x1e4
> [  212.847638]  [<ffffffff81111fd2>] load_module+0x1482/0x1d40
> [  212.847640]  [<ffffffff8110ea10>] ? __symbol_put+0x40/0x40
> [  212.847643]  [<ffffffff81112aa9>] SYSC_finit_module+0xa9/0xd0
> [  212.847645]  [<ffffffff81112aee>] SyS_finit_module+0xe/0x10
> [  212.847647]  [<ffffffff8195d07c>] entry_SYSCALL_64_fastpath+0x1f/0xbd
> [  212.847649] ---[ end trace e4e9f09d45443049 ]---

Thanks, I also saw this when rebasing onto a newer linux-next.

> > +	int ret;
> > +	bool success = false;
> > +
> > +	/* check if this task has already switched over */
> > +	if (task->patch_state == klp_target_state)
> > +		return true;
> > +
> > +	/*
> > +	 * For arches which don't have reliable stack traces, we have to rely
> > +	 * on other methods (e.g., switching tasks at the syscall barrier).
> > +	 */
> > +	if (!IS_ENABLED(CONFIG_RELIABLE_STACKTRACE))
> > +		return false;
> > +
> > +	/*
> > +	 * Now try to check the stack for any to-be-patched or to-be-unpatched
> > +	 * functions.  If all goes well, switch the task to the target patch
> > +	 * state.
> > +	 */
> > +	rq = task_rq_lock(task, &flags);
> > +
> > +	if (task_running(rq, task) && task != current) {
> > +		pr_debug("%s: pid %d (%s) is running\n", __func__, task->pid,
> > +			 task->comm);
> 
> Also I think about using printk_deferred() inside the rq_lock but
> it is not strictly needed. Also we use only pr_debug() here which
> is a NOP when not enabled.

Good catch.  It's probably best to avoid it anyway.  klp_check_stack()
also has some pr_debug() calls.  I may restructure the code a bit to
release the lock before doing any of the pr_debug()'s.

-- 
Josh

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [RFC PATCH v2 05/18] sched: add task flag for preempt IRQ tracking
  2016-05-24  3:52                               ` Andy Lutomirski
@ 2016-06-22 16:30                                 ` Josh Poimboeuf
  2016-06-22 17:59                                   ` Andy Lutomirski
  2016-06-23  0:09                                   ` Andy Lutomirski
  0 siblings, 2 replies; 121+ messages in thread
From: Josh Poimboeuf @ 2016-06-22 16:30 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Jiri Kosina, Ingo Molnar, X86 ML, Heiko Carstens, linux-s390,
	live-patching, Michael Ellerman, Chris J Arges, Jessica Yu,
	linuxppc-dev, Petr Mladek, Jiri Slaby, linux-kernel,
	Vojtech Pavlik, Miroslav Benes, Peter Zijlstra

On Mon, May 23, 2016 at 08:52:12PM -0700, Andy Lutomirski wrote:
> On May 23, 2016 7:28 PM, "Josh Poimboeuf" <jpoimboe@redhat.com> wrote:
> > > Maybe I'm coming around to liking this idea.
> >
> > Ok, good :-)
> >
> > > In an ideal world (DWARF support, high-quality unwinder, nice pretty
> > > printer, etc), unwinding across a kernel exception would look like:
> > >
> > >  - some_func
> > >  - some_other_func
> > >  - do_page_fault
> > >  - page_fault
> > >
> > > After page_fault, the next unwind step takes us to the faulting RIP
> > > (faulting_func) and reports that all GPRs are known.  It should
> > > probably learn this fact from DWARF if DWARF is available, instead of
> > > directly decoding pt_regs, due to a few funny cases in which pt_regs
> > > may be incomplete.  A nice pretty printer could now print all the
> > > regs.
> > >
> > >  - faulting_func
> > >  - etc.
> > >
> > > For this to work, we don't actually need the unwinder to explicitly
> > > know where pt_regs is.
> >
> > That's true (but only for DWARF).
> >
> > > Food for thought, though: if user code does SYSENTER with TF set,
> > > you'll end up with partial pt_regs.  There's nothing the kernel can do
> > > about it.  DWARF will handle it without any fanfare, but I don't know
> > > if it's going to cause trouble for you pre-DWARF.
> >
> > In this case it should see the stack pointer is past the pt_regs offset,
> > so it would just report it as an empty stack.
> 
> OK
> 
> >
> > > I'm also not sure it makes sense to apply this before the unwinder
> > > that can consume it is ready.  Maybe, if it would be consistent with
> > > your plans, it would make sense to rewrite the unwinder first, then
> > > apply this and teach live patching to use the new unwinder, and *then*
> > > add DWARF support?
> >
> > For the purposes of livepatch, the reliable unwinder needs to detect
> > whether an interrupt/exception pt_regs frame exists on a sleeping task
> > (or current).  This patch would allow us to do that.
> >
> > So my preferred order of doing things would be:
> >
> > 1) Brian Gerst's switch_to() cleanup and any related unwinder fixes
> > 2) this patch for annotating pt_regs stack frames
> > 3) reliable unwinder, similar to what I already posted, except it relies
> >    on this patch instead of PF_PREEMPT_IRQ, and knows how to deal with
> >    the new inactive task frame format of #1
> > 4) livepatch consistency model which uses the reliable unwinder
> > 5) rewrite unwinder, and port all users to the new interface
> > 6) add DWARF unwinder
> >
> > 1-4 are pretty much already written, whereas 5 and 6 will take
> > considerably more work.
> 
> Fair enough.
> 
> >
> > > > +       /*
> > > > +        * Create a stack frame for the saved pt_regs.  This allows frame
> > > > +        * pointer based unwinders to find pt_regs on the stack.
> > > > +        */
> > > > +       .macro CREATE_PT_REGS_FRAME regs=%rsp
> > > > +#ifdef CONFIG_FRAME_POINTER
> > > > +       pushq   \regs
> > > > +       pushq   $pt_regs+1
> > >
> > > Why the +1?
> >
> > Some unwinders like gdb are smart enough to report the function which
> > contains the instruction *before* the return address.  Without the +1,
> > they would show the wrong function.
> 
> Lovely.  Want to add a comment?
> 
> >
> > > > +       pushq   %rbp
> > > > +       movq    %rsp, %rbp
> > > > +#endif
> > > > +       .endm
> > >
> > > I keep wanting this to be only two pushes and to fudge rbp to make it
> > > work, but I don't see a good way.  But let's call it
> > > CREATE_NESTED_ENTRY_FRAME or something, and let's rename pt_regs to
> > > nested_frame or similar.
> >
> > Or, if we aren't going to annotate syscall pt_regs, we could give it a
> > more specific name.  CREATE_INTERRUPT_FRAME and interrupt_frame()?
> 
> CREATE_INTERRUPT_FRAME is confusing because it applies to idtentry,
> too.  CREATE_PT_REGS_FRAME is probably fine.
> 
> > > > +
> > > > +/* fake function which allows stack unwinders to detect pt_regs frames */
> > > > +#ifdef CONFIG_FRAME_POINTER
> > > > +ENTRY(pt_regs)
> > > > +       nop
> > > > +       nop
> > > > +ENDPROC(pt_regs)
> > > > +#endif /* CONFIG_FRAME_POINTER */
> > >
> > > Why is this two bytes long?  Is there some reason it has to be more
> > > than one byte?
> >
> > Similar to above, this is related to the need to support various
> > unwinders.  Whether the unwinder displays the ret_addr or the
> > instruction preceding it, either way the instruction needs to be inside
> > the function for the function to be reported.
> 
> OK.

Andy,

So I got a chance to look at this some more.  I'm thinking that to make
this feature more consistently useful, we shouldn't only annotate
pt_regs frames for calls to handlers; other calls should be annotated as
well: preempt_schedule_irq, CALL_enter_from_user_mode,
prepare_exit_to_usermode, SWAPGS, TRACE_IRQS_OFF, DISABLE_INTERRUPTS,
etc.  That way, the unwinder will always be able to find pt_regs from an
interrupt/exception, even if starting from one of these other calls.

But then, things get ugly.  You have to either setup and tear down the
frame for every possible call, or do a higher-level setup/teardown
across multiple calls, which invalidates several assumptions in the
entry code about the location of pt_regs on the stack.

Also problematic is that several of the macros (like TRACE_IRQS_IRETQ)
make assumptions about the location of pt_regs.  And they're used by
both syscall and interrupt code.  So if we didn't create a frame pointer
header for syscalls, we'd basically need two versions of the macros: one
for irqs/exceptions and one for syscalls.

So I think the cleanest way to handle this is to always allocate two
extra registers on the stack in ALLOC_PT_GPREGS_ON_STACK.  Then all
entry code can assume that pt_regs is at a constant location, and all
the above problems go away.  Another benefit is that we'd only need two
saves instead of three -- the pointer to pt_regs is no longer needed
since pt_regs is always immediately after the frame header.

I worked up a patch to implement this -- see below.  It writes the frame
pointer in all entry paths, including syscalls.  This helps keep the
code simple.

The downside is a small performance penalty: with getppid()-in-a-loop on
my laptop, the average syscall went from 52ns to 53ns, which is about a
2% slowdown.  But I doubt it would be measurable in a real-world
workload.

It looks like about half the slowdown is due to the extra stack
allocation (which presumably adds a little d-cache pressure on the stack
memory) and the other half is due to the stack writes.

I could remove the writes from the syscall path but it would only save
about half a ns, and it would make the code less robust.  Plus it's nice
to have the consistency of having *all* pt_regs frames annotated.

Thoughts?

----

diff --git a/arch/x86/entry/calling.h b/arch/x86/entry/calling.h
index 9a9e588..0f6ccfc 100644
--- a/arch/x86/entry/calling.h
+++ b/arch/x86/entry/calling.h
@@ -88,30 +88,69 @@ For 32-bit we have the following conventions - kernel is built with
 #define RSP		19*8
 #define SS		20*8
 
-#define SIZEOF_PTREGS	21*8
 
-	.macro ALLOC_PT_GPREGS_ON_STACK addskip=0
-	addq	$-(15*8+\addskip), %rsp
+#ifdef CONFIG_FRAME_POINTER
+
+#define PT_REGS_OFFSET 2*8
+
+	/*
+	 * Create an entry stack frame pointer header which corresponds to the
+	 * saved pt_regs.  This allows frame pointer based unwinders to find
+	 * pt_regs on the stack.  The frame pointer and the return address of a
+	 * fake function are stored immediately before the pt_regs.
+	 */
+	.macro SAVE_ENTRY_FRAME_POINTER
+	movq	%rbp,			0*8(%rsp)
+	movq	$entry_frame_ret,	1*8(%rsp)
+	movq	%rsp, %rbp
+	.endm
+
+	.macro RESTORE_ENTRY_FRAME_POINTER
+	movq	(%rsp), %rbp
+	.endm
+
+	.macro ALLOC_AND_SAVE_ENTRY_FRAME_POINTER
+	subq	$PT_REGS_OFFSET, %rsp
+	SAVE_ENTRY_FRAME_POINTER
+	.endm
+
+#else /* !CONFIG_FRAME_POINTER */
+
+#define PT_REGS_OFFSET 0
+#define SAVE_ENTRY_FRAME_POINTER
+#define RESTORE_ENTRY_FRAME_POINTER
+#define ALLOC_AND_SAVE_ENTRY_FRAME_POINTER
+
+#endif /* CONFIG_FRAME_POINTER */
+
+#define PT_REGS_SIZE		21*8
+#define ENTRY_FRAME_SIZE	(PT_REGS_OFFSET+PT_REGS_SIZE)
+
+#define TI_FLAGS(rsp) ASM_THREAD_INFO(TI_flags, rsp, ENTRY_FRAME_SIZE)
+#define PT_REGS(reg) reg+PT_REGS_OFFSET(%rsp)
+
+	.macro ALLOC_ENTRY_FRAME addskip=0
+	addq	$-(15*8+PT_REGS_OFFSET+\addskip), %rsp
 	.endm
 
 	.macro SAVE_C_REGS_HELPER offset=0 rax=1 rcx=1 r8910=1 r11=1
 	.if \r11
-	movq %r11, 6*8+\offset(%rsp)
+	movq %r11, 6*8+PT_REGS_OFFSET+\offset(%rsp)
 	.endif
 	.if \r8910
-	movq %r10, 7*8+\offset(%rsp)
-	movq %r9,  8*8+\offset(%rsp)
-	movq %r8,  9*8+\offset(%rsp)
+	movq %r10, 7*8+PT_REGS_OFFSET+\offset(%rsp)
+	movq %r9,  8*8+PT_REGS_OFFSET+\offset(%rsp)
+	movq %r8,  9*8+PT_REGS_OFFSET+\offset(%rsp)
 	.endif
 	.if \rax
-	movq %rax, 10*8+\offset(%rsp)
+	movq %rax, 10*8+PT_REGS_OFFSET+\offset(%rsp)
 	.endif
 	.if \rcx
-	movq %rcx, 11*8+\offset(%rsp)
+	movq %rcx, 11*8+PT_REGS_OFFSET+\offset(%rsp)
 	.endif
-	movq %rdx, 12*8+\offset(%rsp)
-	movq %rsi, 13*8+\offset(%rsp)
-	movq %rdi, 14*8+\offset(%rsp)
+	movq %rdx, 12*8+PT_REGS_OFFSET+\offset(%rsp)
+	movq %rsi, 13*8+PT_REGS_OFFSET+\offset(%rsp)
+	movq %rdi, 14*8+PT_REGS_OFFSET+\offset(%rsp)
 	.endm
 	.macro SAVE_C_REGS offset=0
 	SAVE_C_REGS_HELPER \offset, 1, 1, 1, 1
@@ -128,23 +167,39 @@ For 32-bit we have the following conventions - kernel is built with
 	.macro SAVE_C_REGS_EXCEPT_RAX_RCX_R11
 	SAVE_C_REGS_HELPER 0, 0, 0, 1, 0
 	.endm
+	.macro SAVE_C_REGS_EXCEPT_RAX
+	SAVE_C_REGS_HELPER rax=0
+	.endm
+
+	.macro SAVE_EXTRA_REGS offset=0 rbx=1
+	movq %r15, 0*8+PT_REGS_OFFSET+\offset(%rsp)
+	movq %r14, 1*8+PT_REGS_OFFSET+\offset(%rsp)
+	movq %r13, 2*8+PT_REGS_OFFSET+\offset(%rsp)
+	movq %r12, 3*8+PT_REGS_OFFSET+\offset(%rsp)
+#ifdef CONFIG_FRAME_POINTER
+	/* copy rbp value from entry frame header */
+	movq \offset(%rsp), %rbp
+	movq %rbp, 4*8+PT_REGS_OFFSET+\offset(%rsp)
+	leaq \offset(%rsp), %rbp
+#else
+	movq %rbp, 4*8+PT_REGS_OFFSET+\offset(%rsp)
+#endif
+	.if \rbx
+	movq %rbx, 5*8+PT_REGS_OFFSET+\offset(%rsp)
+	.endif
+	.endm
 
-	.macro SAVE_EXTRA_REGS offset=0
-	movq %r15, 0*8+\offset(%rsp)
-	movq %r14, 1*8+\offset(%rsp)
-	movq %r13, 2*8+\offset(%rsp)
-	movq %r12, 3*8+\offset(%rsp)
-	movq %rbp, 4*8+\offset(%rsp)
-	movq %rbx, 5*8+\offset(%rsp)
+	.macro SAVE_EXTRA_REGS_EXCEPT_RBX
+	SAVE_EXTRA_REGS rbx=0
 	.endm
 
 	.macro RESTORE_EXTRA_REGS offset=0
-	movq 0*8+\offset(%rsp), %r15
-	movq 1*8+\offset(%rsp), %r14
-	movq 2*8+\offset(%rsp), %r13
-	movq 3*8+\offset(%rsp), %r12
-	movq 4*8+\offset(%rsp), %rbp
-	movq 5*8+\offset(%rsp), %rbx
+	movq 0*8+PT_REGS_OFFSET+\offset(%rsp), %r15
+	movq 1*8+PT_REGS_OFFSET+\offset(%rsp), %r14
+	movq 2*8+PT_REGS_OFFSET+\offset(%rsp), %r13
+	movq 3*8+PT_REGS_OFFSET+\offset(%rsp), %r12
+	movq 4*8+PT_REGS_OFFSET+\offset(%rsp), %rbp
+	movq 5*8+PT_REGS_OFFSET+\offset(%rsp), %rbx
 	.endm
 
 	.macro ZERO_EXTRA_REGS
@@ -158,24 +213,24 @@ For 32-bit we have the following conventions - kernel is built with
 
 	.macro RESTORE_C_REGS_HELPER rstor_rax=1, rstor_rcx=1, rstor_r11=1, rstor_r8910=1, rstor_rdx=1
 	.if \rstor_r11
-	movq 6*8(%rsp), %r11
+	movq 6*8+PT_REGS_OFFSET(%rsp), %r11
 	.endif
 	.if \rstor_r8910
-	movq 7*8(%rsp), %r10
-	movq 8*8(%rsp), %r9
-	movq 9*8(%rsp), %r8
+	movq 7*8+PT_REGS_OFFSET(%rsp), %r10
+	movq 8*8+PT_REGS_OFFSET(%rsp), %r9
+	movq 9*8+PT_REGS_OFFSET(%rsp), %r8
 	.endif
 	.if \rstor_rax
-	movq 10*8(%rsp), %rax
+	movq 10*8+PT_REGS_OFFSET(%rsp), %rax
 	.endif
 	.if \rstor_rcx
-	movq 11*8(%rsp), %rcx
+	movq 11*8+PT_REGS_OFFSET(%rsp), %rcx
 	.endif
 	.if \rstor_rdx
-	movq 12*8(%rsp), %rdx
+	movq 12*8+PT_REGS_OFFSET(%rsp), %rdx
 	.endif
-	movq 13*8(%rsp), %rsi
-	movq 14*8(%rsp), %rdi
+	movq 13*8+PT_REGS_OFFSET(%rsp), %rsi
+	movq 14*8+PT_REGS_OFFSET(%rsp), %rdi
 	.endm
 	.macro RESTORE_C_REGS
 	RESTORE_C_REGS_HELPER 1,1,1,1,1
@@ -193,8 +248,8 @@ For 32-bit we have the following conventions - kernel is built with
 	RESTORE_C_REGS_HELPER 1,0,0,1,1
 	.endm
 
-	.macro REMOVE_PT_GPREGS_FROM_STACK addskip=0
-	subq $-(15*8+\addskip), %rsp
+	.macro FREE_ENTRY_FRAME addskip=0
+	subq $-(15*8+PT_REGS_OFFSET+\addskip), %rsp
 	.endm
 
 	.macro icebp
diff --git a/arch/x86/entry/entry_64.S b/arch/x86/entry/entry_64.S
index 23e764c..ff92759 100644
--- a/arch/x86/entry/entry_64.S
+++ b/arch/x86/entry/entry_64.S
@@ -55,7 +55,7 @@ ENDPROC(native_usergs_sysret64)
 
 .macro TRACE_IRQS_IRETQ
 #ifdef CONFIG_TRACE_IRQFLAGS
-	bt	$9, EFLAGS(%rsp)		/* interrupts off? */
+	bt	$9, PT_REGS(EFLAGS)		/* interrupts off? */
 	jnc	1f
 	TRACE_IRQS_ON
 1:
@@ -88,7 +88,7 @@ ENDPROC(native_usergs_sysret64)
 .endm
 
 .macro TRACE_IRQS_IRETQ_DEBUG
-	bt	$9, EFLAGS(%rsp)		/* interrupts off? */
+	bt	$9, PT_REGS(EFLAGS)		/* interrupts off? */
 	jnc	1f
 	TRACE_IRQS_ON_DEBUG
 1:
@@ -164,22 +164,17 @@ GLOBAL(entry_SYSCALL_64_after_swapgs)
 	pushq	$__USER_CS			/* pt_regs->cs */
 	pushq	%rcx				/* pt_regs->ip */
 	pushq	%rax				/* pt_regs->orig_ax */
-	pushq	%rdi				/* pt_regs->di */
-	pushq	%rsi				/* pt_regs->si */
-	pushq	%rdx				/* pt_regs->dx */
-	pushq	%rcx				/* pt_regs->cx */
-	pushq	$-ENOSYS			/* pt_regs->ax */
-	pushq	%r8				/* pt_regs->r8 */
-	pushq	%r9				/* pt_regs->r9 */
-	pushq	%r10				/* pt_regs->r10 */
-	pushq	%r11				/* pt_regs->r11 */
-	sub	$(6*8), %rsp			/* pt_regs->bp, bx, r12-15 not saved */
+
+	ALLOC_ENTRY_FRAME
+	SAVE_ENTRY_FRAME_POINTER
+	SAVE_C_REGS_EXCEPT_RAX
+	movq	$-ENOSYS, PT_REGS(RAX)
 
 	/*
 	 * If we need to do entry work or if we guess we'll need to do
 	 * exit work, go straight to the slow path.
 	 */
-	testl	$_TIF_WORK_SYSCALL_ENTRY|_TIF_ALLWORK_MASK, ASM_THREAD_INFO(TI_flags, %rsp, SIZEOF_PTREGS)
+	testl	$_TIF_WORK_SYSCALL_ENTRY|_TIF_ALLWORK_MASK, TI_FLAGS(%rsp)
 	jnz	entry_SYSCALL64_slow_path
 
 entry_SYSCALL_64_fastpath:
@@ -207,7 +202,7 @@ entry_SYSCALL_64_fastpath:
 	call	*sys_call_table(, %rax, 8)
 .Lentry_SYSCALL_64_after_fastpath_call:
 
-	movq	%rax, RAX(%rsp)
+	movq	%rax, PT_REGS(RAX)
 1:
 
 	/*
@@ -217,15 +212,16 @@ entry_SYSCALL_64_fastpath:
 	 */
 	DISABLE_INTERRUPTS(CLBR_NONE)
 	TRACE_IRQS_OFF
-	testl	$_TIF_ALLWORK_MASK, ASM_THREAD_INFO(TI_flags, %rsp, SIZEOF_PTREGS)
+	testl	$_TIF_ALLWORK_MASK, TI_FLAGS(%rsp)
 	jnz	1f
 
 	LOCKDEP_SYS_EXIT
 	TRACE_IRQS_ON		/* user mode is traced as IRQs on */
-	movq	RIP(%rsp), %rcx
-	movq	EFLAGS(%rsp), %r11
+	movq	PT_REGS(RIP), %rcx
+	movq	PT_REGS(EFLAGS), %r11
 	RESTORE_C_REGS_EXCEPT_RCX_R11
-	movq	RSP(%rsp), %rsp
+	RESTORE_ENTRY_FRAME_POINTER
+	movq	PT_REGS(RSP), %rsp
 	USERGS_SYSRET64
 
 1:
@@ -237,14 +233,14 @@ entry_SYSCALL_64_fastpath:
 	TRACE_IRQS_ON
 	ENABLE_INTERRUPTS(CLBR_NONE)
 	SAVE_EXTRA_REGS
-	movq	%rsp, %rdi
+	leaq	PT_REGS_OFFSET(%rsp), %rdi
 	call	syscall_return_slowpath	/* returns with IRQs disabled */
 	jmp	return_from_SYSCALL_64
 
 entry_SYSCALL64_slow_path:
 	/* IRQs are off. */
 	SAVE_EXTRA_REGS
-	movq	%rsp, %rdi
+	leaq	PT_REGS_OFFSET(%rsp), %rdi
 	call	do_syscall_64		/* returns with IRQs disabled */
 
 return_from_SYSCALL_64:
@@ -255,8 +251,8 @@ return_from_SYSCALL_64:
 	 * Try to use SYSRET instead of IRET if we're returning to
 	 * a completely clean 64-bit userspace context.
 	 */
-	movq	RCX(%rsp), %rcx
-	movq	RIP(%rsp), %r11
+	movq	PT_REGS(RCX), %rcx
+	movq	PT_REGS(RIP), %r11
 	cmpq	%rcx, %r11			/* RCX == RIP */
 	jne	opportunistic_sysret_failed
 
@@ -283,8 +279,8 @@ return_from_SYSCALL_64:
 	cmpq	$__USER_CS, CS(%rsp)		/* CS must match SYSRET */
 	jne	opportunistic_sysret_failed
 
-	movq	R11(%rsp), %r11
-	cmpq	%r11, EFLAGS(%rsp)		/* R11 == RFLAGS */
+	movq	PT_REGS(R11), %r11
+	cmpq	%r11, PT_REGS(EFLAGS)		/* R11 == RFLAGS */
 	jne	opportunistic_sysret_failed
 
 	/*
@@ -306,7 +302,7 @@ return_from_SYSCALL_64:
 
 	/* nothing to check for RSP */
 
-	cmpq	$__USER_DS, SS(%rsp)		/* SS must match SYSRET */
+	cmpq	$__USER_DS, PT_REGS(SS)		/* SS must match SYSRET */
 	jne	opportunistic_sysret_failed
 
 	/*
@@ -316,7 +312,7 @@ return_from_SYSCALL_64:
 syscall_return_via_sysret:
 	/* rcx and r11 are already restored (see code above) */
 	RESTORE_C_REGS_EXCEPT_RCX_R11
-	movq	RSP(%rsp), %rsp
+	movq	PT_REGS(RSP), %rsp
 	USERGS_SYSRET64
 
 opportunistic_sysret_failed:
@@ -408,6 +404,7 @@ END(__switch_to_asm)
  * r12: kernel thread arg
  */
 ENTRY(ret_from_fork)
+	ALLOC_AND_SAVE_ENTRY_FRAME_POINTER
 	movq	%rax, %rdi
 	call	schedule_tail			/* rdi: 'prev' task parameter */
 
@@ -415,7 +412,7 @@ ENTRY(ret_from_fork)
 	jnz	1f
 
 2:
-	movq	%rsp, %rdi
+	leaq	PT_REGS_OFFSET(%rsp), %rdi
 	call	syscall_return_slowpath	/* returns with IRQs disabled */
 	TRACE_IRQS_ON			/* user mode is traced as IRQS on */
 	SWAPGS
@@ -430,7 +427,7 @@ ENTRY(ret_from_fork)
 	 * calling do_execve().  Exit to userspace to complete the execve()
 	 * syscall.
 	 */
-	movq	$0, RAX(%rsp)
+	movq	$0, PT_REGS(RAX)
 	jmp	2b
 END(ret_from_fork)
 
@@ -460,11 +457,12 @@ END(irq_entries_start)
 /* 0(%rsp): ~(interrupt number) */
 	.macro interrupt func
 	cld
-	ALLOC_PT_GPREGS_ON_STACK
+	ALLOC_ENTRY_FRAME
+	SAVE_ENTRY_FRAME_POINTER
 	SAVE_C_REGS
 	SAVE_EXTRA_REGS
 
-	testb	$3, CS(%rsp)
+	testb	$3, PT_REGS(CS)
 	jz	1f
 
 	/*
@@ -500,6 +498,7 @@ END(irq_entries_start)
 	/* We entered an interrupt context - irqs are off: */
 	TRACE_IRQS_OFF
 
+	addq	$PT_REGS_OFFSET, %rdi
 	call	\func	/* rdi points to pt_regs */
 	.endm
 
@@ -521,12 +520,12 @@ ret_from_intr:
 	/* Restore saved previous stack */
 	popq	%rsp
 
-	testb	$3, CS(%rsp)
+	testb	$3, PT_REGS(CS)
 	jz	retint_kernel
 
 	/* Interrupt came from user space */
 GLOBAL(retint_user)
-	mov	%rsp,%rdi
+	leaq	PT_REGS_OFFSET(%rsp), %rdi
 	call	prepare_exit_to_usermode
 	TRACE_IRQS_IRETQ
 	SWAPGS
@@ -537,7 +536,7 @@ retint_kernel:
 #ifdef CONFIG_PREEMPT
 	/* Interrupts are off */
 	/* Check if we need preemption */
-	bt	$9, EFLAGS(%rsp)		/* were interrupts off? */
+	bt	$9, PT_REGS(EFLAGS)		/* were interrupts off? */
 	jnc	1f
 0:	cmpl	$0, PER_CPU_VAR(__preempt_count)
 	jnz	1f
@@ -558,7 +557,7 @@ GLOBAL(restore_regs_and_iret)
 	RESTORE_EXTRA_REGS
 restore_c_regs_and_iret:
 	RESTORE_C_REGS
-	REMOVE_PT_GPREGS_FROM_STACK 8
+	FREE_ENTRY_FRAME 8
 	INTERRUPT_RETURN
 
 ENTRY(native_iret)
@@ -699,11 +698,12 @@ ENTRY(\sym)
 	pushq	$-1				/* ORIG_RAX: no syscall to restart */
 	.endif
 
-	ALLOC_PT_GPREGS_ON_STACK
+	ALLOC_ENTRY_FRAME
+	SAVE_ENTRY_FRAME_POINTER
 
 	.if \paranoid
 	.if \paranoid == 1
-	testb	$3, CS(%rsp)			/* If coming from userspace, switch stacks */
+	testb	$3, PT_REGS(CS)			/* If coming from userspace, switch stacks */
 	jnz	1f
 	.endif
 	call	paranoid_entry
@@ -720,11 +720,11 @@ ENTRY(\sym)
 	.endif
 	.endif
 
-	movq	%rsp, %rdi			/* pt_regs pointer */
+	leaq	PT_REGS_OFFSET(%rsp), %rdi	/* pt_regs pointer */
 
 	.if \has_error_code
-	movq	ORIG_RAX(%rsp), %rsi		/* get error code */
-	movq	$-1, ORIG_RAX(%rsp)		/* no syscall to restart */
+	movq	PT_REGS(ORIG_RAX), %rsi		/* get error code */
+	movq	$-1, PT_REGS(ORIG_RAX)		/* no syscall to restart */
 	.else
 	xorl	%esi, %esi			/* no error code */
 	.endif
@@ -755,16 +755,15 @@ ENTRY(\sym)
 1:
 	call	error_entry
 
-
-	movq	%rsp, %rdi			/* pt_regs pointer */
-	call	sync_regs
+	movq	%rsp, %rdi			/* stack frame + pt_regs */
+	call	sync_entry_frame
 	movq	%rax, %rsp			/* switch stack */
 
-	movq	%rsp, %rdi			/* pt_regs pointer */
+	leaq	PT_REGS_OFFSET(%rsp), %rdi	/* pt_regs pointer */
 
 	.if \has_error_code
-	movq	ORIG_RAX(%rsp), %rsi		/* get error code */
-	movq	$-1, ORIG_RAX(%rsp)		/* no syscall to restart */
+	movq	PT_REGS(ORIG_RAX), %rsi		/* get error code */
+	movq	$-1, PT_REGS(ORIG_RAX)		/* no syscall to restart */
 	.else
 	xorl	%esi, %esi			/* no error code */
 	.endif
@@ -922,7 +921,8 @@ ENTRY(xen_failsafe_callback)
 	movq	8(%rsp), %r11
 	addq	$0x30, %rsp
 	pushq	$-1 /* orig_ax = -1 => not a system call */
-	ALLOC_PT_GPREGS_ON_STACK
+	ALLOC_ENTRY_FRAME
+	SAVE_ENTRY_FRAME_POINTER
 	SAVE_C_REGS
 	SAVE_EXTRA_REGS
 	jmp	error_exit
@@ -1003,7 +1003,7 @@ paranoid_exit_no_swapgs:
 paranoid_exit_restore:
 	RESTORE_EXTRA_REGS
 	RESTORE_C_REGS
-	REMOVE_PT_GPREGS_FROM_STACK 8
+	FREE_ENTRY_FRAME 8
 	INTERRUPT_RETURN
 END(paranoid_exit)
 
@@ -1016,7 +1016,7 @@ ENTRY(error_entry)
 	SAVE_C_REGS 8
 	SAVE_EXTRA_REGS 8
 	xorl	%ebx, %ebx
-	testb	$3, CS+8(%rsp)
+	testb	$3, CS+8+PT_REGS_OFFSET(%rsp)
 	jz	.Lerror_kernelspace
 
 .Lerror_entry_from_usermode_swapgs:
@@ -1049,12 +1049,12 @@ ENTRY(error_entry)
 .Lerror_kernelspace:
 	incl	%ebx
 	leaq	native_irq_return_iret(%rip), %rcx
-	cmpq	%rcx, RIP+8(%rsp)
+	cmpq	%rcx, RIP+8+PT_REGS_OFFSET(%rsp)
 	je	.Lerror_bad_iret
 	movl	%ecx, %eax			/* zero extend */
-	cmpq	%rax, RIP+8(%rsp)
+	cmpq	%rax, RIP+8+PT_REGS_OFFSET(%rsp)
 	je	.Lbstep_iret
-	cmpq	$.Lgs_change, RIP+8(%rsp)
+	cmpq	$.Lgs_change, RIP+8+PT_REGS_OFFSET(%rsp)
 	jne	.Lerror_entry_done
 
 	/*
@@ -1066,7 +1066,7 @@ ENTRY(error_entry)
 
 .Lbstep_iret:
 	/* Fix truncated RIP */
-	movq	%rcx, RIP+8(%rsp)
+	movq	%rcx, RIP+8+PT_REGS_OFFSET(%rsp)
 	/* fall through */
 
 .Lerror_bad_iret:
@@ -1182,29 +1182,20 @@ ENTRY(nmi)
 	pushq	2*8(%rdx)	/* pt_regs->cs */
 	pushq	1*8(%rdx)	/* pt_regs->rip */
 	pushq   $-1		/* pt_regs->orig_ax */
-	pushq   %rdi		/* pt_regs->di */
-	pushq   %rsi		/* pt_regs->si */
-	pushq   (%rdx)		/* pt_regs->dx */
-	pushq   %rcx		/* pt_regs->cx */
-	pushq   %rax		/* pt_regs->ax */
-	pushq   %r8		/* pt_regs->r8 */
-	pushq   %r9		/* pt_regs->r9 */
-	pushq   %r10		/* pt_regs->r10 */
-	pushq   %r11		/* pt_regs->r11 */
-	pushq	%rbx		/* pt_regs->rbx */
-	pushq	%rbp		/* pt_regs->rbp */
-	pushq	%r12		/* pt_regs->r12 */
-	pushq	%r13		/* pt_regs->r13 */
-	pushq	%r14		/* pt_regs->r14 */
-	pushq	%r15		/* pt_regs->r15 */
 
-	/*
+	ALLOC_ENTRY_FRAME
+	SAVE_ENTRY_FRAME_POINTER
+	movq	(%rdx), %rdx
+	SAVE_C_REGS
+	SAVE_EXTRA_REGS
+
+/*
 	 * At this point we no longer need to worry about stack damage
 	 * due to nesting -- we're on the normal thread stack and we're
 	 * done with the NMI stack.
 	 */
 
-	movq	%rsp, %rdi
+	leaq	PT_REGS_OFFSET(%rsp), %rdi
 	movq	$-1, %rsi
 	call	do_nmi
 
@@ -1214,6 +1205,7 @@ ENTRY(nmi)
 	 * do_nmi doesn't modify pt_regs.
 	 */
 	SWAPGS
+	RESTORE_ENTRY_FRAME_POINTER
 	jmp	restore_c_regs_and_iret
 
 .Lnmi_from_kernel:
@@ -1405,7 +1397,8 @@ end_repeat_nmi:
 	 * frame to point back to repeat_nmi.
 	 */
 	pushq	$-1				/* ORIG_RAX: no syscall to restart */
-	ALLOC_PT_GPREGS_ON_STACK
+	ALLOC_ENTRY_FRAME
+	SAVE_ENTRY_FRAME_POINTER
 
 	/*
 	 * Use paranoid_entry to handle SWAPGS, but no need to use paranoid_exit
@@ -1417,7 +1410,7 @@ end_repeat_nmi:
 	call	paranoid_entry
 
 	/* paranoidentry do_nmi, 0; without TRACE_IRQS_OFF */
-	movq	%rsp, %rdi
+	leaq	PT_REGS_OFFSET(%rsp), %rdi
 	movq	$-1, %rsi
 	call	do_nmi
 
@@ -1430,7 +1423,7 @@ nmi_restore:
 	RESTORE_C_REGS
 
 	/* Point RSP at the "iret" frame. */
-	REMOVE_PT_GPREGS_FROM_STACK 6*8
+	FREE_ENTRY_FRAME 6*8
 
 	/*
 	 * Clear "NMI executing".  Set DF first so that we can easily
@@ -1455,3 +1448,22 @@ ENTRY(ignore_sysret)
 	mov	$-ENOSYS, %eax
 	sysret
 END(ignore_sysret)
+
+#ifdef CONFIG_FRAME_POINTER
+/*
+ * This is a fake function which allows stack unwinders to detect entry stack
+ * frames.  The entry_frame_ret return address is stored on the stack after the
+ * frame pointer, immediately before pt_regs.
+ *
+ * Some unwinders like gdb are smart enough to report the function which
+ * contains the instruction *before* the return address on the stack.  More
+ * primitive unwinders like the kernel's will report the function containing
+ * the return address itself.  So the address needs to be in the middle of the
+ * function in order to satisfy them both.
+ */
+ENTRY(entry_frame)
+	nop
+GLOBAL(entry_frame_ret)
+	nop
+ENDPROC(entry_frame)
+#endif /* CONFIG_FRAME_POINTER */
diff --git a/arch/x86/entry/entry_64_compat.S b/arch/x86/entry/entry_64_compat.S
index e1721da..31b4f63c 100644
--- a/arch/x86/entry/entry_64_compat.S
+++ b/arch/x86/entry/entry_64_compat.S
@@ -309,21 +309,16 @@ ENTRY(entry_INT80_compat)
 
 	/* Construct struct pt_regs on stack (iret frame is already on stack) */
 	pushq	%rax			/* pt_regs->orig_ax */
-	pushq	%rdi			/* pt_regs->di */
-	pushq	%rsi			/* pt_regs->si */
-	pushq	%rdx			/* pt_regs->dx */
-	pushq	%rcx			/* pt_regs->cx */
-	pushq	$-ENOSYS		/* pt_regs->ax */
-	pushq   $0			/* pt_regs->r8  = 0 */
-	pushq   $0			/* pt_regs->r9  = 0 */
-	pushq   $0			/* pt_regs->r10 = 0 */
-	pushq   $0			/* pt_regs->r11 = 0 */
-	pushq   %rbx                    /* pt_regs->rbx */
-	pushq   %rbp                    /* pt_regs->rbp */
-	pushq   %r12                    /* pt_regs->r12 */
-	pushq   %r13                    /* pt_regs->r13 */
-	pushq   %r14                    /* pt_regs->r14 */
-	pushq   %r15                    /* pt_regs->r15 */
+	ALLOC_ENTRY_FRAME
+	SAVE_ENTRY_FRAME_POINTER
+	movq	$-ENOSYS, %rax
+	xorq	%r8, %r8
+	xorq	%r9, %r9
+	xorq	%r10, %r10
+	xorq	%r11, %r11
+	SAVE_C_REGS
+	SAVE_EXTRA_REGS
+
 	cld
 
 	/*
@@ -332,7 +327,7 @@ ENTRY(entry_INT80_compat)
 	 */
 	TRACE_IRQS_OFF
 
-	movq	%rsp, %rdi
+	leaq	PT_REGS_OFFSET(%rsp), %rdi
 	call	do_int80_syscall_32
 .Lsyscall_32_done:
 
diff --git a/arch/x86/include/asm/traps.h b/arch/x86/include/asm/traps.h
index c3496619..bba7ece 100644
--- a/arch/x86/include/asm/traps.h
+++ b/arch/x86/include/asm/traps.h
@@ -70,7 +70,6 @@ dotraplinkage void do_segment_not_present(struct pt_regs *, long);
 dotraplinkage void do_stack_segment(struct pt_regs *, long);
 #ifdef CONFIG_X86_64
 dotraplinkage void do_double_fault(struct pt_regs *, long);
-asmlinkage struct pt_regs *sync_regs(struct pt_regs *);
 #endif
 dotraplinkage void do_general_protection(struct pt_regs *, long);
 dotraplinkage void do_page_fault(struct pt_regs *, unsigned long);
diff --git a/arch/x86/kernel/head_64.S b/arch/x86/kernel/head_64.S
index 5df831e..f3c7922 100644
--- a/arch/x86/kernel/head_64.S
+++ b/arch/x86/kernel/head_64.S
@@ -362,24 +362,15 @@ early_idt_handler_common:
 	incl early_recursion_flag(%rip)
 
 	/* The vector number is currently in the pt_regs->di slot. */
-	pushq %rsi				/* pt_regs->si */
-	movq 8(%rsp), %rsi			/* RSI = vector number */
-	movq %rdi, 8(%rsp)			/* pt_regs->di = RDI */
-	pushq %rdx				/* pt_regs->dx */
-	pushq %rcx				/* pt_regs->cx */
-	pushq %rax				/* pt_regs->ax */
-	pushq %r8				/* pt_regs->r8 */
-	pushq %r9				/* pt_regs->r9 */
-	pushq %r10				/* pt_regs->r10 */
-	pushq %r11				/* pt_regs->r11 */
-	pushq %rbx				/* pt_regs->bx */
-	pushq %rbp				/* pt_regs->bp */
-	pushq %r12				/* pt_regs->r12 */
-	pushq %r13				/* pt_regs->r13 */
-	pushq %r14				/* pt_regs->r14 */
-	pushq %r15				/* pt_regs->r15 */
-
-	cmpq $14,%rsi		/* Page fault? */
+	ALLOC_ENTRY_FRAME addskip=-8
+	SAVE_ENTRY_FRAME_POINTER
+	movq %rbx, PT_REGS(RBX)	/* save rbx */
+	movq PT_REGS(RDI), %rbx	/* rbx = vector number */
+
+	SAVE_C_REGS
+	SAVE_EXTRA_REGS_EXCEPT_RBX
+
+	cmpq $14, %rbx		/* page fault? */
 	jnz 10f
 	GET_CR2_INTO(%rdi)	/* Can clobber any volatile register if pv */
 	call early_make_pgtable
@@ -387,7 +378,9 @@ early_idt_handler_common:
 	jz 20f			/* All good */
 
 10:
-	movq %rsp,%rdi		/* RDI = pt_regs; RSI is already trapnr */
+	movq %rsp, %rdi
+	addq $PT_REGS_OFFSET, %rdi /* rdi = pt_regs */
+	movq %rbx, %rsi		   /* rsi = vector number */
 	call early_fixup_exception
 
 20:
diff --git a/arch/x86/kernel/traps.c b/arch/x86/kernel/traps.c
index 00f03d8..6954e74 100644
--- a/arch/x86/kernel/traps.c
+++ b/arch/x86/kernel/traps.c
@@ -514,22 +514,34 @@ exit:
 NOKPROBE_SYMBOL(do_int3);
 
 #ifdef CONFIG_X86_64
+
+struct entry_frame {
+#ifdef CONFIG_FRAME_POINTER
+	void *fp;
+	void *ret_addr;
+#endif
+	struct pt_regs regs;
+};
+
 /*
  * Help handler running on IST stack to switch off the IST stack if the
  * interrupted code was in user mode. The actual stack switch is done in
  * entry_64.S
  */
-asmlinkage __visible notrace struct pt_regs *sync_regs(struct pt_regs *eregs)
+asmlinkage __visible notrace
+struct entry_frame *sync_entry_frame(struct entry_frame *old)
 {
-	struct pt_regs *regs = task_pt_regs(current);
-	*regs = *eregs;
-	return regs;
+	struct entry_frame *new = container_of(task_pt_regs(current),
+					       struct entry_frame, regs);
+
+	*new = *old;
+	return new;
 }
-NOKPROBE_SYMBOL(sync_regs);
+NOKPROBE_SYMBOL(sync_entry_frame);
 
 struct bad_iret_stack {
 	void *error_entry_ret;
-	struct pt_regs regs;
+	struct entry_frame frame;
 };
 
 asmlinkage __visible notrace
@@ -544,15 +556,15 @@ struct bad_iret_stack *fixup_bad_iret(struct bad_iret_stack *s)
 	 */
 	struct bad_iret_stack *new_stack =
 		container_of(task_pt_regs(current),
-			     struct bad_iret_stack, regs);
+			     struct bad_iret_stack, frame.regs);
 
 	/* Copy the IRET target to the new stack. */
-	memmove(&new_stack->regs.ip, (void *)s->regs.sp, 5*8);
+	memmove(&new_stack->frame.regs.ip, (void *)s->frame.regs.sp, 5*8);
 
 	/* Copy the remainder of the stack from the current stack. */
-	memmove(new_stack, s, offsetof(struct bad_iret_stack, regs.ip));
+	memmove(new_stack, s, offsetof(struct bad_iret_stack, frame.regs.ip));
 
-	BUG_ON(!user_mode(&new_stack->regs));
+	BUG_ON(!user_mode(&new_stack->frame.regs));
 	return new_stack;
 }
 NOKPROBE_SYMBOL(fixup_bad_iret);

^ permalink raw reply related	[flat|nested] 121+ messages in thread

* Re: [RFC PATCH v2 05/18] sched: add task flag for preempt IRQ tracking
  2016-06-22 16:30                                 ` Josh Poimboeuf
@ 2016-06-22 17:59                                   ` Andy Lutomirski
  2016-06-22 18:22                                     ` Josh Poimboeuf
  2016-06-23  0:09                                   ` Andy Lutomirski
  1 sibling, 1 reply; 121+ messages in thread
From: Andy Lutomirski @ 2016-06-22 17:59 UTC (permalink / raw)
  To: Josh Poimboeuf
  Cc: Jiri Kosina, Ingo Molnar, X86 ML, Heiko Carstens, linux-s390,
	live-patching, Michael Ellerman, Chris J Arges, Jessica Yu,
	linuxppc-dev, Petr Mladek, Jiri Slaby, linux-kernel,
	Vojtech Pavlik, Miroslav Benes, Peter Zijlstra

On Wed, Jun 22, 2016 at 9:30 AM, Josh Poimboeuf <jpoimboe@redhat.com> wrote:
> On Mon, May 23, 2016 at 08:52:12PM -0700, Andy Lutomirski wrote:
>> On May 23, 2016 7:28 PM, "Josh Poimboeuf" <jpoimboe@redhat.com> wrote:
>> > > Maybe I'm coming around to liking this idea.
>> >
>> > Ok, good :-)
>> >
>> > > In an ideal world (DWARF support, high-quality unwinder, nice pretty
>> > > printer, etc), unwinding across a kernel exception would look like:
>> > >
>> > >  - some_func
>> > >  - some_other_func
>> > >  - do_page_fault
>> > >  - page_fault
>> > >
>> > > After page_fault, the next unwind step takes us to the faulting RIP
>> > > (faulting_func) and reports that all GPRs are known.  It should
>> > > probably learn this fact from DWARF if DWARF is available, instead of
>> > > directly decoding pt_regs, due to a few funny cases in which pt_regs
>> > > may be incomplete.  A nice pretty printer could now print all the
>> > > regs.
>> > >
>> > >  - faulting_func
>> > >  - etc.
>> > >
>> > > For this to work, we don't actually need the unwinder to explicitly
>> > > know where pt_regs is.
>> >
>> > That's true (but only for DWARF).
>> >
>> > > Food for thought, though: if user code does SYSENTER with TF set,
>> > > you'll end up with partial pt_regs.  There's nothing the kernel can do
>> > > about it.  DWARF will handle it without any fanfare, but I don't know
>> > > if it's going to cause trouble for you pre-DWARF.
>> >
>> > In this case it should see the stack pointer is past the pt_regs offset,
>> > so it would just report it as an empty stack.
>>
>> OK
>>
>> >
>> > > I'm also not sure it makes sense to apply this before the unwinder
>> > > that can consume it is ready.  Maybe, if it would be consistent with
>> > > your plans, it would make sense to rewrite the unwinder first, then
>> > > apply this and teach live patching to use the new unwinder, and *then*
>> > > add DWARF support?
>> >
>> > For the purposes of livepatch, the reliable unwinder needs to detect
>> > whether an interrupt/exception pt_regs frame exists on a sleeping task
>> > (or current).  This patch would allow us to do that.
>> >
>> > So my preferred order of doing things would be:
>> >
>> > 1) Brian Gerst's switch_to() cleanup and any related unwinder fixes
>> > 2) this patch for annotating pt_regs stack frames
>> > 3) reliable unwinder, similar to what I already posted, except it relies
>> >    on this patch instead of PF_PREEMPT_IRQ, and knows how to deal with
>> >    the new inactive task frame format of #1
>> > 4) livepatch consistency model which uses the reliable unwinder
>> > 5) rewrite unwinder, and port all users to the new interface
>> > 6) add DWARF unwinder
>> >
>> > 1-4 are pretty much already written, whereas 5 and 6 will take
>> > considerably more work.
>>
>> Fair enough.
>>
>> >
>> > > > +       /*
>> > > > +        * Create a stack frame for the saved pt_regs.  This allows frame
>> > > > +        * pointer based unwinders to find pt_regs on the stack.
>> > > > +        */
>> > > > +       .macro CREATE_PT_REGS_FRAME regs=%rsp
>> > > > +#ifdef CONFIG_FRAME_POINTER
>> > > > +       pushq   \regs
>> > > > +       pushq   $pt_regs+1
>> > >
>> > > Why the +1?
>> >
>> > Some unwinders like gdb are smart enough to report the function which
>> > contains the instruction *before* the return address.  Without the +1,
>> > they would show the wrong function.
>>
>> Lovely.  Want to add a comment?
>>
>> >
>> > > > +       pushq   %rbp
>> > > > +       movq    %rsp, %rbp
>> > > > +#endif
>> > > > +       .endm
>> > >
>> > > I keep wanting this to be only two pushes and to fudge rbp to make it
>> > > work, but I don't see a good way.  But let's call it
>> > > CREATE_NESTED_ENTRY_FRAME or something, and let's rename pt_regs to
>> > > nested_frame or similar.
>> >
>> > Or, if we aren't going to annotate syscall pt_regs, we could give it a
>> > more specific name.  CREATE_INTERRUPT_FRAME and interrupt_frame()?
>>
>> CREATE_INTERRUPT_FRAME is confusing because it applies to idtentry,
>> too.  CREATE_PT_REGS_FRAME is probably fine.
>>
>> > > > +
>> > > > +/* fake function which allows stack unwinders to detect pt_regs frames */
>> > > > +#ifdef CONFIG_FRAME_POINTER
>> > > > +ENTRY(pt_regs)
>> > > > +       nop
>> > > > +       nop
>> > > > +ENDPROC(pt_regs)
>> > > > +#endif /* CONFIG_FRAME_POINTER */
>> > >
>> > > Why is this two bytes long?  Is there some reason it has to be more
>> > > than one byte?
>> >
>> > Similar to above, this is related to the need to support various
>> > unwinders.  Whether the unwinder displays the ret_addr or the
>> > instruction preceding it, either way the instruction needs to be inside
>> > the function for the function to be reported.
>>
>> OK.
>
> Andy,
>
> So I got a chance to look at this some more.  I'm thinking that to make
> this feature more consistently useful, we shouldn't only annotate
> pt_regs frames for calls to handlers; other calls should be annotated as
> well: preempt_schedule_irq, CALL_enter_from_user_mode,
> prepare_exit_to_usermode, SWAPGS, TRACE_IRQS_OFF, DISABLE_INTERRUPTS,
> etc.  That way, the unwinder will always be able to find pt_regs from an
> interrupt/exception, even if starting from one of these other calls.
>
> But then, things get ugly.  You have to either setup and tear down the
> frame for every possible call, or do a higher-level setup/teardown
> across multiple calls, which invalidates several assumptions in the
> entry code about the location of pt_regs on the stack.
>
> Also problematic is that several of the macros (like TRACE_IRQS_IRETQ)
> make assumptions about the location of pt_regs.  And they're used by
> both syscall and interrupt code.  So if we didn't create a frame pointer
> header for syscalls, we'd basically need two versions of the macros: one
> for irqs/exceptions and one for syscalls.
>
> So I think the cleanest way to handle this is to always allocate two
> extra registers on the stack in ALLOC_PT_GPREGS_ON_STACK.  Then all
> entry code can assume that pt_regs is at a constant location, and all
> the above problems go away.  Another benefit is that we'd only need two
> saves instead of three -- the pointer to pt_regs is no longer needed
> since pt_regs is always immediately after the frame header.
>
> I worked up a patch to implement this -- see below.  It writes the frame
> pointer in all entry paths, including syscalls.  This helps keep the
> code simple.
>
> The downside is a small performance penalty: with getppid()-in-a-loop on
> my laptop, the average syscall went from 52ns to 53ns, which is about a
> 2% slowdown.  But I doubt it would be measurable in a real-world
> workload.
>
> It looks like about half the slowdown is due to the extra stack
> allocation (which presumably adds a little d-cache pressure on the stack
> memory) and the other half is due to the stack writes.
>
> I could remove the writes from the syscall path but it would only save
> about half a ns, and it would make the code less robust.  Plus it's nice
> to have the consistency of having *all* pt_regs frames annotated.

This is a bit messy, and I'm not really sure that the entry code
should be have to operate under constraints like this.  Also,
convincing myself this works for NMI sounds unpleasant.

Maybe we should go back to my idea of just listing the call sites in a table.

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [RFC PATCH v2 05/18] sched: add task flag for preempt IRQ tracking
  2016-06-22 17:59                                   ` Andy Lutomirski
@ 2016-06-22 18:22                                     ` Josh Poimboeuf
  2016-06-22 18:26                                       ` Andy Lutomirski
  0 siblings, 1 reply; 121+ messages in thread
From: Josh Poimboeuf @ 2016-06-22 18:22 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Jiri Kosina, Ingo Molnar, X86 ML, Heiko Carstens, linux-s390,
	live-patching, Michael Ellerman, Chris J Arges, Jessica Yu,
	linuxppc-dev, Petr Mladek, Jiri Slaby, linux-kernel,
	Vojtech Pavlik, Miroslav Benes, Peter Zijlstra

On Wed, Jun 22, 2016 at 10:59:23AM -0700, Andy Lutomirski wrote:
> > So I got a chance to look at this some more.  I'm thinking that to make
> > this feature more consistently useful, we shouldn't only annotate
> > pt_regs frames for calls to handlers; other calls should be annotated as
> > well: preempt_schedule_irq, CALL_enter_from_user_mode,
> > prepare_exit_to_usermode, SWAPGS, TRACE_IRQS_OFF, DISABLE_INTERRUPTS,
> > etc.  That way, the unwinder will always be able to find pt_regs from an
> > interrupt/exception, even if starting from one of these other calls.
> >
> > But then, things get ugly.  You have to either setup and tear down the
> > frame for every possible call, or do a higher-level setup/teardown
> > across multiple calls, which invalidates several assumptions in the
> > entry code about the location of pt_regs on the stack.
> >
> > Also problematic is that several of the macros (like TRACE_IRQS_IRETQ)
> > make assumptions about the location of pt_regs.  And they're used by
> > both syscall and interrupt code.  So if we didn't create a frame pointer
> > header for syscalls, we'd basically need two versions of the macros: one
> > for irqs/exceptions and one for syscalls.
> >
> > So I think the cleanest way to handle this is to always allocate two
> > extra registers on the stack in ALLOC_PT_GPREGS_ON_STACK.  Then all
> > entry code can assume that pt_regs is at a constant location, and all
> > the above problems go away.  Another benefit is that we'd only need two
> > saves instead of three -- the pointer to pt_regs is no longer needed
> > since pt_regs is always immediately after the frame header.
> >
> > I worked up a patch to implement this -- see below.  It writes the frame
> > pointer in all entry paths, including syscalls.  This helps keep the
> > code simple.
> >
> > The downside is a small performance penalty: with getppid()-in-a-loop on
> > my laptop, the average syscall went from 52ns to 53ns, which is about a
> > 2% slowdown.  But I doubt it would be measurable in a real-world
> > workload.
> >
> > It looks like about half the slowdown is due to the extra stack
> > allocation (which presumably adds a little d-cache pressure on the stack
> > memory) and the other half is due to the stack writes.
> >
> > I could remove the writes from the syscall path but it would only save
> > about half a ns, and it would make the code less robust.  Plus it's nice
> > to have the consistency of having *all* pt_regs frames annotated.
> 
> This is a bit messy, and I'm not really sure that the entry code
> should be have to operate under constraints like this.  Also,
> convincing myself this works for NMI sounds unpleasant.
> 
> Maybe we should go back to my idea of just listing the call sites in a table.

So are you suggesting something like:

	.macro ENTRY_CALL func pt_regs_offset=0
	call \func
1:	.pushsection .entry_calls, "a"
	.long 1b - .
	.long \pt_regs_offset
	.popsection
	.endm

and then change every call in the entry code to ENTRY_CALL?

-- 
Josh

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [RFC PATCH v2 05/18] sched: add task flag for preempt IRQ tracking
  2016-06-22 18:22                                     ` Josh Poimboeuf
@ 2016-06-22 18:26                                       ` Andy Lutomirski
  2016-06-22 18:40                                         ` Josh Poimboeuf
  0 siblings, 1 reply; 121+ messages in thread
From: Andy Lutomirski @ 2016-06-22 18:26 UTC (permalink / raw)
  To: Josh Poimboeuf
  Cc: Jiri Kosina, Ingo Molnar, X86 ML, Heiko Carstens, linux-s390,
	live-patching, Michael Ellerman, Chris J Arges, Jessica Yu,
	linuxppc-dev, Petr Mladek, Jiri Slaby, linux-kernel,
	Vojtech Pavlik, Miroslav Benes, Peter Zijlstra

On Wed, Jun 22, 2016 at 11:22 AM, Josh Poimboeuf <jpoimboe@redhat.com> wrote:
> On Wed, Jun 22, 2016 at 10:59:23AM -0700, Andy Lutomirski wrote:
>> > So I got a chance to look at this some more.  I'm thinking that to make
>> > this feature more consistently useful, we shouldn't only annotate
>> > pt_regs frames for calls to handlers; other calls should be annotated as
>> > well: preempt_schedule_irq, CALL_enter_from_user_mode,
>> > prepare_exit_to_usermode, SWAPGS, TRACE_IRQS_OFF, DISABLE_INTERRUPTS,
>> > etc.  That way, the unwinder will always be able to find pt_regs from an
>> > interrupt/exception, even if starting from one of these other calls.
>> >
>> > But then, things get ugly.  You have to either setup and tear down the
>> > frame for every possible call, or do a higher-level setup/teardown
>> > across multiple calls, which invalidates several assumptions in the
>> > entry code about the location of pt_regs on the stack.
>> >
>> > Also problematic is that several of the macros (like TRACE_IRQS_IRETQ)
>> > make assumptions about the location of pt_regs.  And they're used by
>> > both syscall and interrupt code.  So if we didn't create a frame pointer
>> > header for syscalls, we'd basically need two versions of the macros: one
>> > for irqs/exceptions and one for syscalls.
>> >
>> > So I think the cleanest way to handle this is to always allocate two
>> > extra registers on the stack in ALLOC_PT_GPREGS_ON_STACK.  Then all
>> > entry code can assume that pt_regs is at a constant location, and all
>> > the above problems go away.  Another benefit is that we'd only need two
>> > saves instead of three -- the pointer to pt_regs is no longer needed
>> > since pt_regs is always immediately after the frame header.
>> >
>> > I worked up a patch to implement this -- see below.  It writes the frame
>> > pointer in all entry paths, including syscalls.  This helps keep the
>> > code simple.
>> >
>> > The downside is a small performance penalty: with getppid()-in-a-loop on
>> > my laptop, the average syscall went from 52ns to 53ns, which is about a
>> > 2% slowdown.  But I doubt it would be measurable in a real-world
>> > workload.
>> >
>> > It looks like about half the slowdown is due to the extra stack
>> > allocation (which presumably adds a little d-cache pressure on the stack
>> > memory) and the other half is due to the stack writes.
>> >
>> > I could remove the writes from the syscall path but it would only save
>> > about half a ns, and it would make the code less robust.  Plus it's nice
>> > to have the consistency of having *all* pt_regs frames annotated.
>>
>> This is a bit messy, and I'm not really sure that the entry code
>> should be have to operate under constraints like this.  Also,
>> convincing myself this works for NMI sounds unpleasant.
>>
>> Maybe we should go back to my idea of just listing the call sites in a table.
>
> So are you suggesting something like:
>
>         .macro ENTRY_CALL func pt_regs_offset=0
>         call \func
> 1:      .pushsection .entry_calls, "a"
>         .long 1b - .
>         .long \pt_regs_offset
>         .popsection
>         .endm
>
> and then change every call in the entry code to ENTRY_CALL?

Yes, exactly, modulo whether the section name is good.  hpa is
probably the authority on that.

--Andy

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [RFC PATCH v2 05/18] sched: add task flag for preempt IRQ tracking
  2016-06-22 18:26                                       ` Andy Lutomirski
@ 2016-06-22 18:40                                         ` Josh Poimboeuf
  2016-06-22 19:17                                           ` Andy Lutomirski
  0 siblings, 1 reply; 121+ messages in thread
From: Josh Poimboeuf @ 2016-06-22 18:40 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Jiri Kosina, Ingo Molnar, X86 ML, Heiko Carstens, linux-s390,
	live-patching, Michael Ellerman, Chris J Arges, Jessica Yu,
	linuxppc-dev, Petr Mladek, Jiri Slaby, linux-kernel,
	Vojtech Pavlik, Miroslav Benes, Peter Zijlstra

On Wed, Jun 22, 2016 at 11:26:21AM -0700, Andy Lutomirski wrote:
> On Wed, Jun 22, 2016 at 11:22 AM, Josh Poimboeuf <jpoimboe@redhat.com> wrote:
> > On Wed, Jun 22, 2016 at 10:59:23AM -0700, Andy Lutomirski wrote:
> >> > So I got a chance to look at this some more.  I'm thinking that to make
> >> > this feature more consistently useful, we shouldn't only annotate
> >> > pt_regs frames for calls to handlers; other calls should be annotated as
> >> > well: preempt_schedule_irq, CALL_enter_from_user_mode,
> >> > prepare_exit_to_usermode, SWAPGS, TRACE_IRQS_OFF, DISABLE_INTERRUPTS,
> >> > etc.  That way, the unwinder will always be able to find pt_regs from an
> >> > interrupt/exception, even if starting from one of these other calls.
> >> >
> >> > But then, things get ugly.  You have to either setup and tear down the
> >> > frame for every possible call, or do a higher-level setup/teardown
> >> > across multiple calls, which invalidates several assumptions in the
> >> > entry code about the location of pt_regs on the stack.
> >> >
> >> > Also problematic is that several of the macros (like TRACE_IRQS_IRETQ)
> >> > make assumptions about the location of pt_regs.  And they're used by
> >> > both syscall and interrupt code.  So if we didn't create a frame pointer
> >> > header for syscalls, we'd basically need two versions of the macros: one
> >> > for irqs/exceptions and one for syscalls.
> >> >
> >> > So I think the cleanest way to handle this is to always allocate two
> >> > extra registers on the stack in ALLOC_PT_GPREGS_ON_STACK.  Then all
> >> > entry code can assume that pt_regs is at a constant location, and all
> >> > the above problems go away.  Another benefit is that we'd only need two
> >> > saves instead of three -- the pointer to pt_regs is no longer needed
> >> > since pt_regs is always immediately after the frame header.
> >> >
> >> > I worked up a patch to implement this -- see below.  It writes the frame
> >> > pointer in all entry paths, including syscalls.  This helps keep the
> >> > code simple.
> >> >
> >> > The downside is a small performance penalty: with getppid()-in-a-loop on
> >> > my laptop, the average syscall went from 52ns to 53ns, which is about a
> >> > 2% slowdown.  But I doubt it would be measurable in a real-world
> >> > workload.
> >> >
> >> > It looks like about half the slowdown is due to the extra stack
> >> > allocation (which presumably adds a little d-cache pressure on the stack
> >> > memory) and the other half is due to the stack writes.
> >> >
> >> > I could remove the writes from the syscall path but it would only save
> >> > about half a ns, and it would make the code less robust.  Plus it's nice
> >> > to have the consistency of having *all* pt_regs frames annotated.
> >>
> >> This is a bit messy, and I'm not really sure that the entry code
> >> should be have to operate under constraints like this.  Also,
> >> convincing myself this works for NMI sounds unpleasant.
> >>
> >> Maybe we should go back to my idea of just listing the call sites in a table.
> >
> > So are you suggesting something like:
> >
> >         .macro ENTRY_CALL func pt_regs_offset=0
> >         call \func
> > 1:      .pushsection .entry_calls, "a"
> >         .long 1b - .
> >         .long \pt_regs_offset
> >         .popsection
> >         .endm
> >
> > and then change every call in the entry code to ENTRY_CALL?
> 
> Yes, exactly, modulo whether the section name is good.  hpa is
> probably the authority on that.

Well, as you probably know, I don't really like peppering ENTRY_CALL
everywhere. :-/

Also I wonder how we could annotate the hypercalls, for example
DISABLE_INTERRUPTS actually wraps the call in a push/pop pair.

-- 
Josh

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [RFC PATCH v2 05/18] sched: add task flag for preempt IRQ tracking
  2016-06-22 18:40                                         ` Josh Poimboeuf
@ 2016-06-22 19:17                                           ` Andy Lutomirski
  2016-06-23 16:19                                             ` Josh Poimboeuf
  0 siblings, 1 reply; 121+ messages in thread
From: Andy Lutomirski @ 2016-06-22 19:17 UTC (permalink / raw)
  To: Josh Poimboeuf
  Cc: Jiri Kosina, Ingo Molnar, X86 ML, Heiko Carstens, linux-s390,
	live-patching, Michael Ellerman, Chris J Arges, Jessica Yu,
	linuxppc-dev, Petr Mladek, Jiri Slaby, linux-kernel,
	Vojtech Pavlik, Miroslav Benes, Peter Zijlstra

On Wed, Jun 22, 2016 at 11:40 AM, Josh Poimboeuf <jpoimboe@redhat.com> wrote:
> On Wed, Jun 22, 2016 at 11:26:21AM -0700, Andy Lutomirski wrote:
>> >
>> > So are you suggesting something like:
>> >
>> >         .macro ENTRY_CALL func pt_regs_offset=0
>> >         call \func
>> > 1:      .pushsection .entry_calls, "a"
>> >         .long 1b - .
>> >         .long \pt_regs_offset
>> >         .popsection
>> >         .endm
>> >
>> > and then change every call in the entry code to ENTRY_CALL?
>>
>> Yes, exactly, modulo whether the section name is good.  hpa is
>> probably the authority on that.
>
> Well, as you probably know, I don't really like peppering ENTRY_CALL
> everywhere. :-/

Me neither.  But at least it's less constraining on the
already-fairly-hairy code.

>
> Also I wonder how we could annotate the hypercalls, for example
> DISABLE_INTERRUPTS actually wraps the call in a push/pop pair.

Oh, yuck.  But forcing all the DISABLE_INTERRUPTS and
ENABLE_INTERRUPTS invocations to be in frame pointer regions isn't so
great either.

DWARF solves this problem completely and IMO fairly cleanly.  Maybe we
should add your task flag and then consider removing it again when
DWARF happens.

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [RFC PATCH v2 05/18] sched: add task flag for preempt IRQ tracking
  2016-06-22 16:30                                 ` Josh Poimboeuf
  2016-06-22 17:59                                   ` Andy Lutomirski
@ 2016-06-23  0:09                                   ` Andy Lutomirski
  2016-06-23 15:55                                     ` Josh Poimboeuf
  1 sibling, 1 reply; 121+ messages in thread
From: Andy Lutomirski @ 2016-06-23  0:09 UTC (permalink / raw)
  To: Josh Poimboeuf
  Cc: Jiri Kosina, Ingo Molnar, X86 ML, Heiko Carstens, linux-s390,
	live-patching, Michael Ellerman, Chris J Arges, Jessica Yu,
	linuxppc-dev, Petr Mladek, Jiri Slaby, linux-kernel,
	Vojtech Pavlik, Miroslav Benes, Peter Zijlstra

On Wed, Jun 22, 2016 at 9:30 AM, Josh Poimboeuf <jpoimboe@redhat.com> wrote:
> Andy,
>
> So I got a chance to look at this some more.  I'm thinking that to make
> this feature more consistently useful, we shouldn't only annotate
> pt_regs frames for calls to handlers; other calls should be annotated as
> well: preempt_schedule_irq, CALL_enter_from_user_mode,
> prepare_exit_to_usermode, SWAPGS, TRACE_IRQS_OFF, DISABLE_INTERRUPTS,
> etc.  That way, the unwinder will always be able to find pt_regs from an
> interrupt/exception, even if starting from one of these other calls.
>
> But then, things get ugly.  You have to either setup and tear down the
> frame for every possible call, or do a higher-level setup/teardown
> across multiple calls, which invalidates several assumptions in the
> entry code about the location of pt_regs on the stack.
>

Here's yet another harebrained idea.  Maybe it works better than my
previous harebrained ideas :)

Your patch is already creating a somewhat nonstandard stack frame:

+       movq    %rbp,                   0*8(%rsp)
+       movq    $entry_frame_ret,       1*8(%rsp)
+       movq    %rsp, %rbp

It's kind of a normal stack frame, but rbp points at something odd,
and to unwind it fully correctly, the unwinder needs to know about it.

What if we made it even more special, along the lines of:

leaq offset_to_ptregs(%rsp), %rbp
xorq $-1, %rbp

IOW, don't write anything to the stack at all, and just put a special
value into RBP that says "the next frame is pt_regs at such-and-such
address".  Do this once on entry and make sure to restore RBP (from
pt_regs) on exit.  Now the unwinder can notice that RBP has the high
bits clear *and* that the negation of it points to the stack, and it
can figure out what's going on.

What do you think?  Am I nuts or could this work?

It had better not have much risk of breaking things worse than they
currently are, given that current kernel allow user code to stick any
value it likes into the very last element of the RBP chain.

--Andy

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [RFC PATCH v2 05/18] sched: add task flag for preempt IRQ tracking
  2016-06-23  0:09                                   ` Andy Lutomirski
@ 2016-06-23 15:55                                     ` Josh Poimboeuf
  0 siblings, 0 replies; 121+ messages in thread
From: Josh Poimboeuf @ 2016-06-23 15:55 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Jiri Kosina, Ingo Molnar, X86 ML, Heiko Carstens, linux-s390,
	live-patching, Michael Ellerman, Chris J Arges, Jessica Yu,
	linuxppc-dev, Petr Mladek, Jiri Slaby, linux-kernel,
	Vojtech Pavlik, Miroslav Benes, Peter Zijlstra

On Wed, Jun 22, 2016 at 05:09:11PM -0700, Andy Lutomirski wrote:
> On Wed, Jun 22, 2016 at 9:30 AM, Josh Poimboeuf <jpoimboe@redhat.com> wrote:
> > Andy,
> >
> > So I got a chance to look at this some more.  I'm thinking that to make
> > this feature more consistently useful, we shouldn't only annotate
> > pt_regs frames for calls to handlers; other calls should be annotated as
> > well: preempt_schedule_irq, CALL_enter_from_user_mode,
> > prepare_exit_to_usermode, SWAPGS, TRACE_IRQS_OFF, DISABLE_INTERRUPTS,
> > etc.  That way, the unwinder will always be able to find pt_regs from an
> > interrupt/exception, even if starting from one of these other calls.
> >
> > But then, things get ugly.  You have to either setup and tear down the
> > frame for every possible call, or do a higher-level setup/teardown
> > across multiple calls, which invalidates several assumptions in the
> > entry code about the location of pt_regs on the stack.
> >
> 
> Here's yet another harebrained idea.  Maybe it works better than my
> previous harebrained ideas :)
> 
> Your patch is already creating a somewhat nonstandard stack frame:
> 
> +       movq    %rbp,                   0*8(%rsp)
> +       movq    $entry_frame_ret,       1*8(%rsp)
> +       movq    %rsp, %rbp
> 
> It's kind of a normal stack frame, but rbp points at something odd,
> and to unwind it fully correctly, the unwinder needs to know about it.
> 
> What if we made it even more special, along the lines of:
> 
> leaq offset_to_ptregs(%rsp), %rbp
> xorq $-1, %rbp
> 
> IOW, don't write anything to the stack at all, and just put a special
> value into RBP that says "the next frame is pt_regs at such-and-such
> address".  Do this once on entry and make sure to restore RBP (from
> pt_regs) on exit.  Now the unwinder can notice that RBP has the high
> bits clear *and* that the negation of it points to the stack, and it
> can figure out what's going on.
> 
> What do you think?  Am I nuts or could this work?
> 
> It had better not have much risk of breaking things worse than they
> currently are, given that current kernel allow user code to stick any
> value it likes into the very last element of the RBP chain.

I think it's a good idea, and it could work... BUT it would break
external unwinders like gdb for the in-kernel entry case.

For interrupts and exceptions in kernel mode, rbp *is* valid.  Sure, it
doesn't tell you the interrupted function, but it does tell you its
caller.  A generic frame pointer unwinder skips the interrupted
function, but at least it keeps going.  If we encoded rbp on entry, that
would break.

-- 
Josh

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [RFC PATCH v2 05/18] sched: add task flag for preempt IRQ tracking
  2016-06-22 19:17                                           ` Andy Lutomirski
@ 2016-06-23 16:19                                             ` Josh Poimboeuf
  2016-06-23 16:35                                               ` Andy Lutomirski
  0 siblings, 1 reply; 121+ messages in thread
From: Josh Poimboeuf @ 2016-06-23 16:19 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Jiri Kosina, Ingo Molnar, X86 ML, Heiko Carstens, linux-s390,
	live-patching, Michael Ellerman, Chris J Arges, Jessica Yu,
	linuxppc-dev, Petr Mladek, Jiri Slaby, linux-kernel,
	Vojtech Pavlik, Miroslav Benes, Peter Zijlstra

On Wed, Jun 22, 2016 at 12:17:25PM -0700, Andy Lutomirski wrote:
> On Wed, Jun 22, 2016 at 11:40 AM, Josh Poimboeuf <jpoimboe@redhat.com> wrote:
> > On Wed, Jun 22, 2016 at 11:26:21AM -0700, Andy Lutomirski wrote:
> >> >
> >> > So are you suggesting something like:
> >> >
> >> >         .macro ENTRY_CALL func pt_regs_offset=0
> >> >         call \func
> >> > 1:      .pushsection .entry_calls, "a"
> >> >         .long 1b - .
> >> >         .long \pt_regs_offset
> >> >         .popsection
> >> >         .endm
> >> >
> >> > and then change every call in the entry code to ENTRY_CALL?
> >>
> >> Yes, exactly, modulo whether the section name is good.  hpa is
> >> probably the authority on that.
> >
> > Well, as you probably know, I don't really like peppering ENTRY_CALL
> > everywhere. :-/
> 
> Me neither.  But at least it's less constraining on the
> already-fairly-hairy code.
> 
> >
> > Also I wonder how we could annotate the hypercalls, for example
> > DISABLE_INTERRUPTS actually wraps the call in a push/pop pair.
> 
> Oh, yuck.  But forcing all the DISABLE_INTERRUPTS and
> ENABLE_INTERRUPTS invocations to be in frame pointer regions isn't so
> great either.

Hm, I don't follow this statement.  Why not?  The more frame pointer
coverage, the better, especially if it doesn't add any additional
overhead.

> DWARF solves this problem completely and IMO fairly cleanly.  Maybe we
> should add your task flag and then consider removing it again when
> DWARF happens.

I tend to doubt we'd be able to remove it later.  As you said before,
many embedded platforms probably won't be able to switch to DWARF, and
they'll want to do live patching too.

So which is the least-bad option?  To summarize:

  1) task flag(s) for preemption and page faults

  2) turn pt_regs into a stack frame

  3) annotate all calls from entry code in a table

  4) encode rbp on entry

They all have their issues, though I'm partial to #2.

Any more hare-brained ideas? :-)

-- 
Josh

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [RFC PATCH v2 05/18] sched: add task flag for preempt IRQ tracking
  2016-06-23 16:19                                             ` Josh Poimboeuf
@ 2016-06-23 16:35                                               ` Andy Lutomirski
  2016-06-23 18:31                                                 ` Josh Poimboeuf
  0 siblings, 1 reply; 121+ messages in thread
From: Andy Lutomirski @ 2016-06-23 16:35 UTC (permalink / raw)
  To: Josh Poimboeuf
  Cc: Jiri Kosina, Ingo Molnar, X86 ML, Heiko Carstens, linux-s390,
	live-patching, Michael Ellerman, Chris J Arges, Jessica Yu,
	linuxppc-dev, Petr Mladek, Jiri Slaby, linux-kernel,
	Vojtech Pavlik, Miroslav Benes, Peter Zijlstra

On Thu, Jun 23, 2016 at 9:19 AM, Josh Poimboeuf <jpoimboe@redhat.com> wrote:
> On Wed, Jun 22, 2016 at 12:17:25PM -0700, Andy Lutomirski wrote:
>> On Wed, Jun 22, 2016 at 11:40 AM, Josh Poimboeuf <jpoimboe@redhat.com> wrote:
>> > On Wed, Jun 22, 2016 at 11:26:21AM -0700, Andy Lutomirski wrote:
>> >> >
>> >> > So are you suggesting something like:
>> >> >
>> >> >         .macro ENTRY_CALL func pt_regs_offset=0
>> >> >         call \func
>> >> > 1:      .pushsection .entry_calls, "a"
>> >> >         .long 1b - .
>> >> >         .long \pt_regs_offset
>> >> >         .popsection
>> >> >         .endm
>> >> >
>> >> > and then change every call in the entry code to ENTRY_CALL?
>> >>
>> >> Yes, exactly, modulo whether the section name is good.  hpa is
>> >> probably the authority on that.
>> >
>> > Well, as you probably know, I don't really like peppering ENTRY_CALL
>> > everywhere. :-/
>>
>> Me neither.  But at least it's less constraining on the
>> already-fairly-hairy code.
>>
>> >
>> > Also I wonder how we could annotate the hypercalls, for example
>> > DISABLE_INTERRUPTS actually wraps the call in a push/pop pair.
>>
>> Oh, yuck.  But forcing all the DISABLE_INTERRUPTS and
>> ENABLE_INTERRUPTS invocations to be in frame pointer regions isn't so
>> great either.
>
> Hm, I don't follow this statement.  Why not?  The more frame pointer
> coverage, the better, especially if it doesn't add any additional
> overhead.

Less flexibility, and it's IMO annoying to make the Xen case have
extra constraints.  It also makes it very awkward or impossible to
take advantage of the sti interrupt window, although admittedly that
doesn't work on Xen either, so maybe that's moot.

>
>> DWARF solves this problem completely and IMO fairly cleanly.  Maybe we
>> should add your task flag and then consider removing it again when
>> DWARF happens.
>
> I tend to doubt we'd be able to remove it later.  As you said before,
> many embedded platforms probably won't be able to switch to DWARF, and
> they'll want to do live patching too.
>
> So which is the least-bad option?  To summarize:
>
>   1) task flag(s) for preemption and page faults
>
>   2) turn pt_regs into a stack frame
>
>   3) annotate all calls from entry code in a table
>
>   4) encode rbp on entry
>
> They all have their issues, though I'm partial to #2.
>
> Any more hare-brained ideas? :-)

I'll try to take a closer look at #2 and see just how much I dislike
all the stack frame munging.  Also, in principle, it's only the
sleeping calls and the calls that make it into real (non-entry) kernel
code that really want to be unwindable through this mechanism.

FWIW, I don't care that much about preserving gdb's partial ability to
unwind through pt_regs, especially because gdb really ought to be able
to use DWARF, too.

--Andy

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [RFC PATCH v2 05/18] sched: add task flag for preempt IRQ tracking
  2016-06-23 16:35                                               ` Andy Lutomirski
@ 2016-06-23 18:31                                                 ` Josh Poimboeuf
  2016-06-23 20:40                                                   ` Josh Poimboeuf
  0 siblings, 1 reply; 121+ messages in thread
From: Josh Poimboeuf @ 2016-06-23 18:31 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Jiri Kosina, Ingo Molnar, X86 ML, Heiko Carstens, linux-s390,
	live-patching, Michael Ellerman, Chris J Arges, Jessica Yu,
	linuxppc-dev, Petr Mladek, Jiri Slaby, linux-kernel,
	Vojtech Pavlik, Miroslav Benes, Peter Zijlstra

On Thu, Jun 23, 2016 at 09:35:29AM -0700, Andy Lutomirski wrote:
> > So which is the least-bad option?  To summarize:
> >
> >   1) task flag(s) for preemption and page faults
> >
> >   2) turn pt_regs into a stack frame
> >
> >   3) annotate all calls from entry code in a table
> >
> >   4) encode rbp on entry
> >
> > They all have their issues, though I'm partial to #2.
> >
> > Any more hare-brained ideas? :-)
> 
> I'll try to take a closer look at #2 and see just how much I dislike
> all the stack frame munging.

Ok.

> Also, in principle, it's only the
> sleeping calls and the calls that make it into real (non-entry) kernel
> code that really want to be unwindable through this mechanism.

Yeah, that's true.  We could modify options 2 or 3 to be less absolute.
Though I think that makes them more prone to future breakage.

> FWIW, I don't care that much about preserving gdb's partial ability to
> unwind through pt_regs, especially because gdb really ought to be able
> to use DWARF, too.

Hm, that's a good point.  I really don't know if there are any other
external tools out there that would care.  Maybe we could try option 4
and then see if anybody complains.

-- 
Josh

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [RFC PATCH v2 05/18] sched: add task flag for preempt IRQ tracking
  2016-06-23 18:31                                                 ` Josh Poimboeuf
@ 2016-06-23 20:40                                                   ` Josh Poimboeuf
  2016-06-23 22:00                                                     ` Andy Lutomirski
  0 siblings, 1 reply; 121+ messages in thread
From: Josh Poimboeuf @ 2016-06-23 20:40 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Jiri Kosina, Ingo Molnar, X86 ML, Heiko Carstens, linux-s390,
	live-patching, Michael Ellerman, Chris J Arges, Jessica Yu,
	linuxppc-dev, Petr Mladek, Jiri Slaby, linux-kernel,
	Vojtech Pavlik, Miroslav Benes, Peter Zijlstra

On Thu, Jun 23, 2016 at 01:31:32PM -0500, Josh Poimboeuf wrote:
> On Thu, Jun 23, 2016 at 09:35:29AM -0700, Andy Lutomirski wrote:
> > > So which is the least-bad option?  To summarize:
> > >
> > >   1) task flag(s) for preemption and page faults
> > >
> > >   2) turn pt_regs into a stack frame
> > >
> > >   3) annotate all calls from entry code in a table
> > >
> > >   4) encode rbp on entry
> > >
> > > They all have their issues, though I'm partial to #2.
> > >
> > > Any more hare-brained ideas? :-)
> > 
> > I'll try to take a closer look at #2 and see just how much I dislike
> > all the stack frame munging.
> 
> Ok.
> 
> > Also, in principle, it's only the
> > sleeping calls and the calls that make it into real (non-entry) kernel
> > code that really want to be unwindable through this mechanism.
> 
> Yeah, that's true.  We could modify options 2 or 3 to be less absolute.
> Though I think that makes them more prone to future breakage.
> 
> > FWIW, I don't care that much about preserving gdb's partial ability to
> > unwind through pt_regs, especially because gdb really ought to be able
> > to use DWARF, too.
> 
> Hm, that's a good point.  I really don't know if there are any other
> external tools out there that would care.  Maybe we could try option 4
> and then see if anybody complains.

I'm starting to think hare-brained option 4 is the way to go.  Any
external tooling should really be relying on DWARF anyway.

Here's a sneak preview.  If this general approach looks ok to you, I'll
go ahead and port all the in-tree unwinders and post a proper patch.

Instead of using xor -1 on the pt_regs pointer, I just cleared the
high-order bit.  That makes the unwinding experience much more pleasant
for a human stack walker, and also ensures that anybody trying to
dereference it gets slapped with an oops, at least in the 48-bit address
space era.

diff --git a/arch/x86/entry/calling.h b/arch/x86/entry/calling.h
index 9a9e588..bf397426 100644
--- a/arch/x86/entry/calling.h
+++ b/arch/x86/entry/calling.h
@@ -201,6 +201,23 @@ For 32-bit we have the following conventions - kernel is built with
 	.byte 0xf1
 	.endm
 
+	/*
+	 * This is a sneaky trick to help the unwinder find pt_regs on the
+	 * stack.  The frame pointer is replaced with an encoded pointer to
+	 * pt_regs.  The encoding is just a clearing of the highest-order bit,
+	 * which makes it an invalid address and is also a signal to the
+	 * unwinder that it's a pt_regs pointer in disguise.
+	 *
+	 * NOTE: This must be called *after* SAVE_EXTRA_REGS because it
+	 * corrupts rbp.
+	 */
+	.macro ENCODE_FRAME_POINTER ptregs_offset=0
+#ifdef CONFIG_FRAME_POINTER
+	leaq \ptregs_offset(%rsp), %rbp
+	btr $63, %rbp
+#endif
+	.endm
+
 #endif /* CONFIG_X86_64 */
 
 /*
diff --git a/arch/x86/entry/entry_64.S b/arch/x86/entry/entry_64.S
index 9ee0da1..eb79652 100644
--- a/arch/x86/entry/entry_64.S
+++ b/arch/x86/entry/entry_64.S
@@ -431,6 +431,7 @@ END(irq_entries_start)
 	ALLOC_PT_GPREGS_ON_STACK
 	SAVE_C_REGS
 	SAVE_EXTRA_REGS
+	ENCODE_FRAME_POINTER
 
 	testb	$3, CS(%rsp)
 	jz	1f
@@ -893,6 +894,7 @@ ENTRY(xen_failsafe_callback)
 	ALLOC_PT_GPREGS_ON_STACK
 	SAVE_C_REGS
 	SAVE_EXTRA_REGS
+	ENCODE_FRAME_POINTER
 	jmp	error_exit
 END(xen_failsafe_callback)
 
@@ -936,6 +938,7 @@ ENTRY(paranoid_entry)
 	cld
 	SAVE_C_REGS 8
 	SAVE_EXTRA_REGS 8
+	ENCODE_FRAME_POINTER 8
 	movl	$1, %ebx
 	movl	$MSR_GS_BASE, %ecx
 	rdmsr
@@ -983,6 +986,7 @@ ENTRY(error_entry)
 	cld
 	SAVE_C_REGS 8
 	SAVE_EXTRA_REGS 8
+	ENCODE_FRAME_POINTER 8
 	xorl	%ebx, %ebx
 	testb	$3, CS+8(%rsp)
 	jz	.Lerror_kernelspace
@@ -1165,6 +1169,7 @@ ENTRY(nmi)
 	pushq	%r13		/* pt_regs->r13 */
 	pushq	%r14		/* pt_regs->r14 */
 	pushq	%r15		/* pt_regs->r15 */
+	ENCODE_FRAME_POINTER
 
 	/*
 	 * At this point we no longer need to worry about stack damage
@@ -1182,7 +1187,7 @@ ENTRY(nmi)
 	 * do_nmi doesn't modify pt_regs.
 	 */
 	SWAPGS
-	jmp	restore_c_regs_and_iret
+	jmp	restore_regs_and_iret
 
 .Lnmi_from_kernel:
 	/*

^ permalink raw reply related	[flat|nested] 121+ messages in thread

* Re: [RFC PATCH v2 05/18] sched: add task flag for preempt IRQ tracking
  2016-06-23 20:40                                                   ` Josh Poimboeuf
@ 2016-06-23 22:00                                                     ` Andy Lutomirski
  0 siblings, 0 replies; 121+ messages in thread
From: Andy Lutomirski @ 2016-06-23 22:00 UTC (permalink / raw)
  To: Josh Poimboeuf
  Cc: Jiri Kosina, Ingo Molnar, X86 ML, Heiko Carstens, linux-s390,
	live-patching, Michael Ellerman, Chris J Arges, Jessica Yu,
	linuxppc-dev, Petr Mladek, Jiri Slaby, linux-kernel,
	Vojtech Pavlik, Miroslav Benes, Peter Zijlstra

On Thu, Jun 23, 2016 at 1:40 PM, Josh Poimboeuf <jpoimboe@redhat.com> wrote:
> On Thu, Jun 23, 2016 at 01:31:32PM -0500, Josh Poimboeuf wrote:
>> On Thu, Jun 23, 2016 at 09:35:29AM -0700, Andy Lutomirski wrote:
>> > > So which is the least-bad option?  To summarize:
>> > >
>> > >   1) task flag(s) for preemption and page faults
>> > >
>> > >   2) turn pt_regs into a stack frame
>> > >
>> > >   3) annotate all calls from entry code in a table
>> > >
>> > >   4) encode rbp on entry
>> > >
>> > > They all have their issues, though I'm partial to #2.
>> > >
>> > > Any more hare-brained ideas? :-)
>> >
>> > I'll try to take a closer look at #2 and see just how much I dislike
>> > all the stack frame munging.
>>
>> Ok.
>>
>> > Also, in principle, it's only the
>> > sleeping calls and the calls that make it into real (non-entry) kernel
>> > code that really want to be unwindable through this mechanism.
>>
>> Yeah, that's true.  We could modify options 2 or 3 to be less absolute.
>> Though I think that makes them more prone to future breakage.
>>
>> > FWIW, I don't care that much about preserving gdb's partial ability to
>> > unwind through pt_regs, especially because gdb really ought to be able
>> > to use DWARF, too.
>>
>> Hm, that's a good point.  I really don't know if there are any other
>> external tools out there that would care.  Maybe we could try option 4
>> and then see if anybody complains.
>
> I'm starting to think hare-brained option 4 is the way to go.  Any
> external tooling should really be relying on DWARF anyway.
>
> Here's a sneak preview.  If this general approach looks ok to you, I'll
> go ahead and port all the in-tree unwinders and post a proper patch.
>
> Instead of using xor -1 on the pt_regs pointer, I just cleared the
> high-order bit.  That makes the unwinding experience much more pleasant
> for a human stack walker, and also ensures that anybody trying to
> dereference it gets slapped with an oops, at least in the 48-bit address
> space era.
>
> diff --git a/arch/x86/entry/calling.h b/arch/x86/entry/calling.h
> index 9a9e588..bf397426 100644
> --- a/arch/x86/entry/calling.h
> +++ b/arch/x86/entry/calling.h
> @@ -201,6 +201,23 @@ For 32-bit we have the following conventions - kernel is built with
>         .byte 0xf1
>         .endm
>
> +       /*
> +        * This is a sneaky trick to help the unwinder find pt_regs on the
> +        * stack.  The frame pointer is replaced with an encoded pointer to
> +        * pt_regs.  The encoding is just a clearing of the highest-order bit,
> +        * which makes it an invalid address and is also a signal to the
> +        * unwinder that it's a pt_regs pointer in disguise.
> +        *
> +        * NOTE: This must be called *after* SAVE_EXTRA_REGS because it
> +        * corrupts rbp.
> +        */
> +       .macro ENCODE_FRAME_POINTER ptregs_offset=0
> +#ifdef CONFIG_FRAME_POINTER
> +       leaq \ptregs_offset(%rsp), %rbp
> +       btr $63, %rbp
> +#endif
> +       .endm
> +

Maybe optimize slightly:

.ifeq \ptregs_offset
mov %rsp, %rbp
.else
leaq \ptregs_offset(%rsp), %rbp
.endif

^ permalink raw reply	[flat|nested] 121+ messages in thread

end of thread, other threads:[~2016-06-23 22:01 UTC | newest]

Thread overview: 121+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-04-28 20:44 [RFC PATCH v2 00/18] livepatch: hybrid consistency model Josh Poimboeuf
2016-04-28 20:44 ` [RFC PATCH v2 01/18] x86/asm/head: clean up initial stack variable Josh Poimboeuf
2016-04-28 20:44 ` [RFC PATCH v2 02/18] x86/asm/head: use a common function for starting CPUs Josh Poimboeuf
2016-04-28 20:44 ` [RFC PATCH v2 03/18] x86/asm/head: standardize the bottom of the stack for idle tasks Josh Poimboeuf
2016-04-29 18:46   ` Brian Gerst
2016-04-29 20:28     ` Josh Poimboeuf
2016-04-29 19:39   ` Andy Lutomirski
2016-04-29 20:50     ` Josh Poimboeuf
2016-04-29 21:38       ` Andy Lutomirski
2016-04-29 23:27         ` Josh Poimboeuf
2016-04-30  0:10           ` Andy Lutomirski
2016-04-28 20:44 ` [RFC PATCH v2 04/18] x86: move _stext marker before head code Josh Poimboeuf
2016-04-28 20:44 ` [RFC PATCH v2 05/18] sched: add task flag for preempt IRQ tracking Josh Poimboeuf
2016-04-29 18:06   ` Andy Lutomirski
2016-04-29 20:11     ` Josh Poimboeuf
2016-04-29 20:19       ` Andy Lutomirski
2016-04-29 20:27         ` Josh Poimboeuf
2016-04-29 20:32           ` Andy Lutomirski
2016-04-29 21:25             ` Josh Poimboeuf
2016-04-29 21:37               ` Andy Lutomirski
2016-04-29 22:11                 ` Jiri Kosina
2016-04-29 22:57                   ` Josh Poimboeuf
2016-04-30  0:09                   ` Andy Lutomirski
2016-04-29 22:41                 ` Josh Poimboeuf
2016-04-30  0:08                   ` Andy Lutomirski
2016-05-02 13:52                     ` Josh Poimboeuf
2016-05-02 15:52                       ` Andy Lutomirski
2016-05-02 17:31                         ` Josh Poimboeuf
2016-05-02 18:12                           ` Andy Lutomirski
2016-05-02 18:34                             ` Ingo Molnar
2016-05-02 19:44                             ` Josh Poimboeuf
2016-05-02 19:54                             ` Jiri Kosina
2016-05-02 20:00                               ` Jiri Kosina
2016-05-03  0:39                                 ` Andy Lutomirski
2016-05-04 15:16                             ` David Laight
2016-05-04 15:16                               ` David Laight
2016-05-19 23:15                         ` Josh Poimboeuf
2016-05-19 23:39                           ` Andy Lutomirski
2016-05-20 14:05                             ` Josh Poimboeuf
2016-05-20 15:41                               ` Andy Lutomirski
2016-05-20 16:41                                 ` Josh Poimboeuf
2016-05-20 16:59                                   ` Andy Lutomirski
2016-05-20 17:49                                     ` Josh Poimboeuf
2016-05-23 23:02                                     ` Jiri Kosina
2016-05-24  1:42                                       ` Andy Lutomirski
2016-05-23 21:34                           ` Andy Lutomirski
2016-05-24  2:28                             ` Josh Poimboeuf
2016-05-24  3:52                               ` Andy Lutomirski
2016-06-22 16:30                                 ` Josh Poimboeuf
2016-06-22 17:59                                   ` Andy Lutomirski
2016-06-22 18:22                                     ` Josh Poimboeuf
2016-06-22 18:26                                       ` Andy Lutomirski
2016-06-22 18:40                                         ` Josh Poimboeuf
2016-06-22 19:17                                           ` Andy Lutomirski
2016-06-23 16:19                                             ` Josh Poimboeuf
2016-06-23 16:35                                               ` Andy Lutomirski
2016-06-23 18:31                                                 ` Josh Poimboeuf
2016-06-23 20:40                                                   ` Josh Poimboeuf
2016-06-23 22:00                                                     ` Andy Lutomirski
2016-06-23  0:09                                   ` Andy Lutomirski
2016-06-23 15:55                                     ` Josh Poimboeuf
2016-04-28 20:44 ` [RFC PATCH v2 06/18] x86: dump_trace() error handling Josh Poimboeuf
2016-04-29 13:45   ` Minfei Huang
2016-04-29 14:00     ` Josh Poimboeuf
2016-04-28 20:44 ` [RFC PATCH v2 07/18] stacktrace/x86: function for detecting reliable stack traces Josh Poimboeuf
2016-04-28 20:44 ` [RFC PATCH v2 08/18] livepatch: temporary stubs for klp_patch_pending() and klp_patch_task() Josh Poimboeuf
2016-04-28 20:44 ` [RFC PATCH v2 09/18] livepatch/x86: add TIF_PATCH_PENDING thread flag Josh Poimboeuf
2016-04-29 18:08   ` Andy Lutomirski
2016-04-29 20:18     ` Josh Poimboeuf
2016-04-28 20:44 ` [RFC PATCH v2 10/18] livepatch/powerpc: " Josh Poimboeuf
2016-05-03  9:07   ` Petr Mladek
2016-05-03 12:06     ` Miroslav Benes
2016-04-28 20:44 ` [RFC PATCH v2 11/18] livepatch/s390: reorganize TIF thread flag bits Josh Poimboeuf
2016-04-28 20:44 ` [RFC PATCH v2 12/18] livepatch/s390: add TIF_PATCH_PENDING thread flag Josh Poimboeuf
2016-04-28 20:44 ` [RFC PATCH v2 13/18] livepatch: separate enabled and patched states Josh Poimboeuf
2016-05-03  9:30   ` Petr Mladek
2016-05-03 13:48     ` Josh Poimboeuf
2016-04-28 20:44 ` [RFC PATCH v2 14/18] livepatch: remove unnecessary object loaded check Josh Poimboeuf
2016-04-28 20:44 ` [RFC PATCH v2 15/18] livepatch: move patching functions into patch.c Josh Poimboeuf
2016-05-03  9:39   ` Petr Mladek
2016-04-28 20:44 ` [RFC PATCH v2 16/18] livepatch: store function sizes Josh Poimboeuf
2016-04-28 20:44 ` [RFC PATCH v2 17/18] livepatch: change to a per-task consistency model Josh Poimboeuf
2016-05-04  8:42   ` Petr Mladek
2016-05-04 15:51     ` Josh Poimboeuf
2016-05-05  9:41       ` Miroslav Benes
2016-05-05 13:06       ` Petr Mladek
2016-05-04 12:39   ` barriers: was: " Petr Mladek
2016-05-04 13:53     ` Peter Zijlstra
2016-05-04 16:51       ` Josh Poimboeuf
2016-05-04 14:12     ` Petr Mladek
2016-05-04 17:25       ` Josh Poimboeuf
2016-05-05 11:21         ` Petr Mladek
2016-05-09 15:42         ` Miroslav Benes
2016-05-04 17:02     ` Josh Poimboeuf
2016-05-05 10:21       ` Petr Mladek
2016-05-04 14:48   ` klp_task_patch: " Petr Mladek
2016-05-04 14:56     ` Jiri Kosina
2016-05-04 17:57     ` Josh Poimboeuf
2016-05-05 11:57       ` Petr Mladek
2016-05-06 12:38         ` Josh Poimboeuf
2016-05-09 12:23           ` Petr Mladek
2016-05-16 18:12             ` Josh Poimboeuf
2016-05-18 13:12               ` Petr Mladek
2016-05-06 11:33   ` Petr Mladek
2016-05-06 12:44     ` Josh Poimboeuf
2016-05-09  9:41   ` Miroslav Benes
2016-05-16 17:27     ` Josh Poimboeuf
2016-05-10 11:39   ` Miroslav Benes
2016-05-17 22:53   ` Jessica Yu
2016-05-18  8:16     ` Jiri Kosina
2016-05-18 16:51       ` Josh Poimboeuf
2016-05-18 20:22         ` Jiri Kosina
2016-05-23  9:42           ` David Laight
2016-05-23  9:42             ` David Laight
2016-05-23 18:44             ` Jiri Kosina
2016-05-24 15:06               ` David Laight
2016-05-24 15:06                 ` David Laight
2016-05-24 22:45                 ` Jiri Kosina
2016-06-06 13:54   ` [RFC PATCH v2 17/18] " Petr Mladek
2016-06-06 14:29     ` Josh Poimboeuf
2016-04-28 20:44 ` [RFC PATCH v2 18/18] livepatch: add /proc/<pid>/patch_state Josh Poimboeuf

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.