All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 0/9] x86/dumpstack: Cleanups and user opcode bytes Code: section, v2
@ 2018-03-15 15:44 Borislav Petkov
  2018-03-15 15:44 ` [PATCH 1/9] x86/dumstack: Remove code_bytes Borislav Petkov
                   ` (9 more replies)
  0 siblings, 10 replies; 36+ messages in thread
From: Borislav Petkov @ 2018-03-15 15:44 UTC (permalink / raw)
  To: X86 ML
  Cc: Andy Lutomirski, Josh Poimboeuf, Linus Torvalds, Peter Zijlstra, LKML

From: Borislav Petkov <bp@suse.de>

Hi all,

here's v2 with the dumpstack cleanups. This one gets rid of code_bytes=
as it was discussed last time. As a result, the code got even leaner and
simpler. I like that. :)

Thx.

Borislav Petkov (9):
  x86/dumstack: Remove code_bytes
  x86/dumpstack: Unexport oops_begin()
  x86/dumpstack: Carve out Code: dumping into a function
  x86/dumpstack: Improve opcodes dumping in the Code: section
  x86/dumpstack: Add loglevel argument to show_opcodes()
  x86/fault: Dump user opcode bytes on fatal faults
  x86/dumpstack: Add a show_ip() function
  x86/dumpstack: Save first regs set for the executive summary
  x86/dumpstack: Explain the reasoning for the prologue and buffer size

 Documentation/admin-guide/kernel-parameters.txt |   5 -
 arch/x86/include/asm/stacktrace.h               |   2 +
 arch/x86/kernel/dumpstack.c                     | 138 ++++++++++++------------
 arch/x86/kernel/process_32.c                    |   4 +-
 arch/x86/mm/fault.c                             |   7 +-
 5 files changed, 78 insertions(+), 78 deletions(-)

Changelog:

v1:

Hi,

here's v2 of the dumpstack cleanups.

I've split them into more fine-grained pieces to show each change. The
relevant parts are the saving of the executive registers of the first
time we oops and dumping them in the end + opcode bytes for user faults.
I've tested splats in a 80x25 screen and the registers, RIP and opcode
bytes fit all in.

I'm adding exemplary dumps from 32-bit and 64-bit at the end of this mail.

I still have on my TODO list to experiment with console log levels and
see whether we can do a best-of-both-worlds thing there.

v0:

Hi,

so I've been thinking about doing this for a while now: be able to dump
the opcode bytes around the user rIP just like we do for kernel faults.

Why?

See patch 5's commit message. That's why I've marked it RFC.

The rest is cleanups: we're copying the opcodes byte-by-byte and that's
just wasteful.

Also, we're using probe_kernel_read() underneath and it does
__copy_from_user_inatomic() which makes copying user opcode bytes
trivial.

With that, it looks like this:

[  696.837457] strsep[1733]: segfault at 40066b ip 00007fad558fccf8 sp 00007ffc5e662520 error 7 in libc-2.26.so[7fad55876000+1ad000]
[  696.837538] Code: 1b 48 89 fd 48 89 df e8 77 99 f9 ff 48 01 d8 80 38 00 75 17 48 c7 45 00 00 00 00 00 48 83 c4 08 48 89 d8 5b 5d c3 0f 1f 44 00 00 <c6> 00 00 48 83 c0 01 48 89 45 00 48 83 c4 08 48 89 d8 5b 5d c3

and the code matches, as expected:

0000000000086cc0 <__strsep_g@@GLIBC_2.2.5>:
   86cc0:       55                      push   %rbp
   86cc1:       53                      push   %rbx
   86cc2:       48 83 ec 08             sub    $0x8,%rsp
   86cc6:       48 8b 1f                mov    (%rdi),%rbx
   86cc9:       48 85 db                test   %rbx,%rbx
   86ccc:       74 1b                   je     86ce9 <__strsep_g@@GLIBC_2.2.5+0x29>
   86cce:       48 89 fd                mov    %rdi,%rbp
   86cd1:       48 89 df                mov    %rbx,%rdi
   86cd4:       e8 77 99 f9 ff          callq  20650 <*ABS*+0x854e0@plt>
   86cd9:       48 01 d8                add    %rbx,%rax
   86cdc:       80 38 00                cmpb   $0x0,(%rax)
   86cdf:       75 17                   jne    86cf8 <__strsep_g@@GLIBC_2.2.5+0x38>
   86ce1:       48 c7 45 00 00 00 00    movq   $0x0,0x0(%rbp)
   86ce8:       00 
   86ce9:       48 83 c4 08             add    $0x8,%rsp
   86ced:       48 89 d8                mov    %rbx,%rax
   86cf0:       5b                      pop    %rbx
   86cf1:       5d                      pop    %rbp
   86cf2:       c3                      retq   
   86cf3:       0f 1f 44 00 00          nopl   0x0(%rax,%rax,1)
   86cf8:       c6 00 00                movb   $0x0,(%rax)
   86cfb:       48 83 c0 01             add    $0x1,%rax
   86cff:       48 89 45 00             mov    %rax,0x0(%rbp)
   86d03:       48 83 c4 08             add    $0x8,%rsp
   86d07:       48 89 d8                mov    %rbx,%rax
   86d0a:       5b                      pop    %rbx
   86d0b:       5d                      pop    %rbp
   86d0c:       c3                      retq

Comments and suggestions are welcome!

Thx.

Example dumps (current version):

64-bit:

[   53.534957] sysrq: SysRq : Trigger a crash
[   53.536939] BUG: unable to handle kernel NULL pointer dereference at 0000000000000000
[   53.539982] PGD 79149067 P4D 79149067 PUD 793a5067 PMD 0 
[   53.540897] Oops: 0002 [#1] PREEMPT SMP
[   53.540897] CPU: 6 PID: 3700 Comm: bash Not tainted 4.16.0-rc5+ #11
[   53.540897] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-1 04/01/2014
[   53.540897] RIP: 0010:sysrq_handle_crash+0x17/0x20
[   53.540897] Code: d1 e8 6d 08 b7 ff 0f 1f 00 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 e8 76 1f bd ff c7 05 a4 12 19 01 01 00 00 00 0f ae f8 <c6> 04 25 00 00 00 00 01 c3 0f 1f 44 00 00 e8 c6 1b c2 ff fb e9 80 
[   53.540897] RSP: 0018:ffffc9000053bdf0 EFLAGS: 00010246
[   53.540897] RAX: 0000000000000000 RBX: 0000000000000063 RCX: 0000000000000000
[   53.540897] RDX: 0000000000000000 RSI: ffffffff81101e0a RDI: 0000000000000063
[   53.540897] RBP: ffffffff822714c0 R08: 0000000000000185 R09: 00000000000829ad
[   53.540897] R10: 0000000000000000 R11: 0000000000000000 R12: 000000000000000a
[   53.540897] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000
[   53.540897] FS:  00007ffff7fdb700(0000) GS:ffff88007ed80000(0000) knlGS:0000000000000000
[   53.540897] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[   53.540897] CR2: 0000000000000000 CR3: 0000000079107000 CR4: 00000000000406e0
[   53.540897] Call Trace:
[   53.540897]  __handle_sysrq+0x9e/0x160
[   53.540897]  write_sysrq_trigger+0x2b/0x30
[   53.540897]  proc_reg_write+0x38/0x70
[   53.540897]  __vfs_write+0x36/0x160
[   53.540897]  ? __fd_install+0x69/0x110
[   53.540897]  ? preempt_count_add+0x74/0xb0
[   53.540897]  ? _raw_spin_lock+0x13/0x30
[   53.540897]  ? set_close_on_exec+0x41/0x80
[   53.540897]  ? preempt_count_sub+0xa8/0x100
[   53.540897]  vfs_write+0xc0/0x190
[   53.540897]  SyS_write+0x64/0xe0
[   53.540897]  ? trace_hardirqs_off_thunk+0x1a/0x1c
[   53.540897]  do_syscall_64+0x70/0x130
[   53.540897]  entry_SYSCALL_64_after_hwframe+0x42/0xb7
[   53.540897] RIP: 0033:0x7ffff74b9620
[   53.540897] Code: 73 01 c3 48 8b 0d 68 98 2c 00 f7 d8 64 89 01 48 83 c8 ff c3 66 0f 1f 44 00 00 83 3d bd f1 2c 00 00 75 10 b8 01 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 31 c3 48 83 ec 08 e8 ce 8f 01 00 48 89 04 24 
[   53.540897] RSP: 002b:00007fffffffe6f8 EFLAGS: 00000246 ORIG_RAX: 0000000000000001
[   53.540897] RAX: ffffffffffffffda RBX: 0000000000000002 RCX: 00007ffff74b9620
[   53.540897] RDX: 0000000000000002 RSI: 0000000000705408 RDI: 0000000000000001
[   53.540897] RBP: 0000000000705408 R08: 000000000000000a R09: 00007ffff7fdb700
[   53.540897] R10: 00007ffff77826a0 R11: 0000000000000246 R12: 00007ffff77842a0
[   53.540897] R13: 0000000000000002 R14: 0000000000000001 R15: 0000000000000000
[   53.540897] Modules linked in:
[   53.540897] CR2: 0000000000000000
[   53.576029] ---[ end trace 9b6fe8eba592293d ]---
[   53.578109] RIP: 0010:sysrq_handle_crash+0x17/0x20
[   53.580191] Code: d1 e8 6d 08 b7 ff 0f 1f 00 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 e8 76 1f bd ff c7 05 a4 12 19 01 01 00 00 00 0f ae f8 <c6> 04 25 00 00 00 00 01 c3 0f 1f 44 00 00 e8 c6 1b c2 ff fb e9 80 
[   53.587244] RSP: 0018:ffffc9000053bdf0 EFLAGS: 00010246
[   53.587928] RAX: 0000000000000000 RBX: 0000000000000063 RCX: 0000000000000000
[   53.588929] RDX: 0000000000000000 RSI: ffffffff81101e0a RDI: 0000000000000063
[   53.589956] RBP: ffffffff822714c0 R08: 0000000000000185 R09: 00000000000829ad
[   53.590886] R10: 0000000000000000 R11: 0000000000000000 R12: 000000000000000a
[   53.591812] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000
[   53.592781] Kernel panic - not syncing: Fatal exception
[   53.594100] Kernel Offset: disabled
[   53.594571] ---[ end Kernel panic - not syncing: Fatal exception ]---

[   22.737752] strsep[3728]: segfault at 40066b ip 00007ffff7abe22b sp 00007fffffffea60 error 7 in libc-2.19.so[7ffff7a33000+19f000]
[   22.742487] Code: 48 89 fd 53 48 83 ec 08 48 8b 1f 48 85 db 74 67 0f b6 06 84 c0 74 33 80 7e 01 00 74 22 48 89 df e8 5a 8a ff ff 48 85 c0 74 20 <c6> 00 00 48 83 c0 01 48 89 45 00 48 89 d8 48 83 c4 08 5b 5d c3 0f


32-bit
------

[  151.053373] sysrq: SysRq : Trigger a crash
[  151.056586] BUG: unable to handle kernel NULL pointer dereference at 00000000
[  151.060237] *pde = 00000000 
[  151.060484] Oops: 0002 [#1] PREEMPT SMP
[  151.060484] CPU: 1 PID: 2070 Comm: bash Not tainted 4.16.0-rc5+ #12
[  151.060484] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-1 04/01/2014
[  151.060484] EIP: sysrq_handle_crash+0x1d/0x30
[  151.060484] Code: ff eb d6 e8 75 0f ba ff 90 8d 74 26 00 0f 1f 44 00 00 55 89 e5 e8 03 07 c0 ff c7 05 34 72 c1 c1 01 00 00 00 0f ae f8 0f 1f 00 <c6> 05 00 00 00 00 01 5d c3 8d 76 00 8d bc 27 00 00 00 00 0f 1f 44 
[  151.060484] EAX: 00000000 EBX: 0000000a ECX: 00000000 EDX: c1503f70
[  151.060484] ESI: 00000063 EDI: 00000000 EBP: f36d7e8c ESP: f36d7e8c
[  151.060484]  DS: 007b ES: 007b FS: 00d8 GS: 00e0 SS: 0068
[  151.060484] CR0: 80050033 CR2: 00000000 CR3: 33d64000 CR4: 000406d0
[  151.060484] Call Trace:
[  151.060484]  __handle_sysrq+0x93/0x130
[  151.060484]  ? sysrq_filter+0x3c0/0x3c0
[  151.060484]  write_sysrq_trigger+0x27/0x40
[  151.060484]  proc_reg_write+0x4d/0x80
[  151.060484]  ? proc_reg_poll+0x70/0x70
[  151.060484]  __vfs_write+0x38/0x160
[  151.060484]  ? preempt_count_sub+0xa0/0x110
[  151.060484]  ? __fd_install+0x51/0xd0
[  151.060484]  ? __sb_start_write+0x4c/0xc0
[  151.060484]  ? preempt_count_sub+0xa0/0x110
[  151.060484]  vfs_write+0x98/0x180
[  151.060484]  SyS_write+0x4f/0xb0
[  151.060484]  do_fast_syscall_32+0x99/0x200
[  151.060484]  entry_SYSENTER_32+0x53/0x86
[  151.060484] EIP: 0xb7f25b35
[  151.060484] Code: 89 e5 8b 55 08 8b 80 64 cd ff ff 85 d2 74 02 89 02 5d c3 8b 04 24 c3 8b 0c 24 c3 8b 1c 24 c3 90 90 51 52 55 89 e5 0f 34 cd 80 <5d> 5a 59 c3 90 90 90 90 8d 76 00 58 b8 77 00 00 00 cd 80 90 8d 76 
[  151.060484] EAX: ffffffda EBX: 00000001 ECX: 08b14a08 EDX: 00000002
[  151.060484] ESI: 00000002 EDI: b7ef0d80 EBP: 08b14a08 ESP: bfc53830
[  151.060484]  DS: 007b ES: 007b FS: 0000 GS: 0033 SS: 007b
[  151.060484] Modules linked in:
[  151.060484] CR2: 0000000000000000
[  151.128925] ---[ end trace 822f779813ab57e1 ]---
[  151.136624] EIP: sysrq_handle_crash+0x1d/0x30
[  151.136625] Code: ff eb d6 e8 75 0f ba ff 90 8d 74 26 00 0f 1f 44 00 00 55 89 e5 e8 03 07 c0 ff c7 05 34 72 c1 c1 01 00 00 00 0f ae f8 0f 1f 00 <c6> 05 00 00 00 00 01 5d c3 8d 76 00 8d bc 27 00 00 00 00 0f 1f 44 
[  151.136658] EAX: 00000000 EBX: 0000000a ECX: 00000000 EDX: c1503f70
[  151.136659] ESI: 00000063 EDI: 00000000 EBP: f36d7e8c ESP: c1c0d87c
[  151.136661]  DS: 007b ES: 007b FS: 00d8 GS: 00e0 SS: 0068
[  151.136662] Kernel panic - not syncing: Fatal exception
[  151.137001] Kernel Offset: disabled
[  151.140587] ---[ end Kernel panic - not syncing: Fatal exception ]---

[  103.241026] strsep32[2125]: segfault at 4336a7 ip b7df6758 sp bfc73fd0 error 7 in libc-2.26.so[b7d76000+1cd000]
[  103.252505] Code: 1d 83 ec 08 ff 74 24 1c 56 e8 14 d6 ff ff 01 f0 83 c4 10 80 38 00 75 12 c7 03 00 00 00 00 83 c4 04 89 f0 5b 5e c3 8d 74 26 00 <c6> 00 00 83 c0 01 89 03 83 c4 04 89 f0 5b 5e c3 66 90 66 90 66 90

-- 
2.13.0

^ permalink raw reply	[flat|nested] 36+ messages in thread

* [PATCH 1/9] x86/dumstack: Remove code_bytes
  2018-03-15 15:44 [PATCH 0/9] x86/dumpstack: Cleanups and user opcode bytes Code: section, v2 Borislav Petkov
@ 2018-03-15 15:44 ` Borislav Petkov
  2018-03-15 18:12   ` Josh Poimboeuf
  2018-03-15 15:44 ` [PATCH 2/9] x86/dumpstack: Unexport oops_begin() Borislav Petkov
                   ` (8 subsequent siblings)
  9 siblings, 1 reply; 36+ messages in thread
From: Borislav Petkov @ 2018-03-15 15:44 UTC (permalink / raw)
  To: X86 ML
  Cc: Andy Lutomirski, Josh Poimboeuf, Linus Torvalds, Peter Zijlstra, LKML

From: Borislav Petkov <bp@suse.de>

This was added by

  86c418374223 ("[PATCH] i386: add option to show more code in oops reports")

long time ago but experience shows that 64 instruction bytes are plenty
when deciphering an oops. So get rid of it.

Removing it will simplify further enhancements to the opcodes dumping
machinery coming in the following patches.

Signed-off-by: Borislav Petkov <bp@suse.de>
---
 Documentation/admin-guide/kernel-parameters.txt |  5 -----
 arch/x86/kernel/dumpstack.c                     | 27 ++++---------------------
 2 files changed, 4 insertions(+), 28 deletions(-)

diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
index b37c1c30c16f..2c74c1694d9d 100644
--- a/Documentation/admin-guide/kernel-parameters.txt
+++ b/Documentation/admin-guide/kernel-parameters.txt
@@ -587,11 +587,6 @@
 			Sets the size of memory pool for coherent, atomic dma
 			allocations, by default set to 256K.
 
-	code_bytes	[X86] How many bytes of object code to print
-			in an oops report.
-			Range: 0 - 8192
-			Default: 64
-
 	com20020=	[HW,NET] ARCnet - COM20020 chipset
 			Format:
 			<io>[,<irq>[,<nodeID>[,<backplane>[,<ckp>[,<timeout>]]]]]
diff --git a/arch/x86/kernel/dumpstack.c b/arch/x86/kernel/dumpstack.c
index 18fa9d74c182..593db796374d 100644
--- a/arch/x86/kernel/dumpstack.c
+++ b/arch/x86/kernel/dumpstack.c
@@ -22,9 +22,10 @@
 #include <asm/stacktrace.h>
 #include <asm/unwind.h>
 
+#define OPCODE_BUFSIZE 64
+
 int panic_on_unrecovered_nmi;
 int panic_on_io_nmi;
-static unsigned int code_bytes = 64;
 static int die_counter;
 
 bool in_task_stack(unsigned long *stack, struct task_struct *task,
@@ -356,26 +357,6 @@ void die(const char *str, struct pt_regs *regs, long err)
 	oops_end(flags, regs, sig);
 }
 
-static int __init code_bytes_setup(char *s)
-{
-	ssize_t ret;
-	unsigned long val;
-
-	if (!s)
-		return -EINVAL;
-
-	ret = kstrtoul(s, 0, &val);
-	if (ret)
-		return ret;
-
-	code_bytes = val;
-	if (code_bytes > 8192)
-		code_bytes = 8192;
-
-	return 1;
-}
-__setup("code_bytes=", code_bytes_setup);
-
 void show_regs(struct pt_regs *regs)
 {
 	bool all = true;
@@ -393,8 +374,8 @@ void show_regs(struct pt_regs *regs)
 	 * time of the fault..
 	 */
 	if (!user_mode(regs)) {
-		unsigned int code_prologue = code_bytes * 43 / 64;
-		unsigned int code_len = code_bytes;
+		unsigned int code_prologue = OPCODE_BUFSIZE * 43 / 64;
+		unsigned int code_len = OPCODE_BUFSIZE;
 		unsigned char c;
 		u8 *ip;
 
-- 
2.13.0

^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [PATCH 2/9] x86/dumpstack: Unexport oops_begin()
  2018-03-15 15:44 [PATCH 0/9] x86/dumpstack: Cleanups and user opcode bytes Code: section, v2 Borislav Petkov
  2018-03-15 15:44 ` [PATCH 1/9] x86/dumstack: Remove code_bytes Borislav Petkov
@ 2018-03-15 15:44 ` Borislav Petkov
  2018-03-15 15:44 ` [PATCH 3/9] x86/dumpstack: Carve out Code: dumping into a function Borislav Petkov
                   ` (7 subsequent siblings)
  9 siblings, 0 replies; 36+ messages in thread
From: Borislav Petkov @ 2018-03-15 15:44 UTC (permalink / raw)
  To: X86 ML
  Cc: Andy Lutomirski, Josh Poimboeuf, Linus Torvalds, Peter Zijlstra, LKML

From: Borislav Petkov <bp@suse.de>

The only user outside of arch/ is not a module since

  86cd47334b00 ("ACPI, APEI, GHES, Prevent GHES to be built as module")

No functional changes.

Signed-off-by: Borislav Petkov <bp@suse.de>
---
 arch/x86/kernel/dumpstack.c | 1 -
 1 file changed, 1 deletion(-)

diff --git a/arch/x86/kernel/dumpstack.c b/arch/x86/kernel/dumpstack.c
index 593db796374d..579455c2b91e 100644
--- a/arch/x86/kernel/dumpstack.c
+++ b/arch/x86/kernel/dumpstack.c
@@ -268,7 +268,6 @@ unsigned long oops_begin(void)
 	bust_spinlocks(1);
 	return flags;
 }
-EXPORT_SYMBOL_GPL(oops_begin);
 NOKPROBE_SYMBOL(oops_begin);
 
 void __noreturn rewind_stack_do_exit(int signr);
-- 
2.13.0

^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [PATCH 3/9] x86/dumpstack: Carve out Code: dumping into a function
  2018-03-15 15:44 [PATCH 0/9] x86/dumpstack: Cleanups and user opcode bytes Code: section, v2 Borislav Petkov
  2018-03-15 15:44 ` [PATCH 1/9] x86/dumstack: Remove code_bytes Borislav Petkov
  2018-03-15 15:44 ` [PATCH 2/9] x86/dumpstack: Unexport oops_begin() Borislav Petkov
@ 2018-03-15 15:44 ` Borislav Petkov
  2018-03-15 15:44 ` [PATCH 4/9] x86/dumpstack: Improve opcodes dumping in the Code: section Borislav Petkov
                   ` (6 subsequent siblings)
  9 siblings, 0 replies; 36+ messages in thread
From: Borislav Petkov @ 2018-03-15 15:44 UTC (permalink / raw)
  To: X86 ML
  Cc: Andy Lutomirski, Josh Poimboeuf, Linus Torvalds, Peter Zijlstra, LKML

From: Borislav Petkov <bp@suse.de>

No functionality change, carve it out into a separate function for later
changes.

Signed-off-by: Borislav Petkov <bp@suse.de>
---
 arch/x86/kernel/dumpstack.c | 57 ++++++++++++++++++++++++---------------------
 1 file changed, 30 insertions(+), 27 deletions(-)

diff --git a/arch/x86/kernel/dumpstack.c b/arch/x86/kernel/dumpstack.c
index 579455c2b91e..eb9d6c00a52f 100644
--- a/arch/x86/kernel/dumpstack.c
+++ b/arch/x86/kernel/dumpstack.c
@@ -70,6 +70,35 @@ static void printk_stack_address(unsigned long address, int reliable,
 	printk("%s %s%pB\n", log_lvl, reliable ? "" : "? ", (void *)address);
 }
 
+static void show_opcodes(u8 *rip)
+{
+	unsigned int code_prologue = OPCODE_BUFSIZE * 43 / 64;
+	unsigned int code_len = OPCODE_BUFSIZE;
+	unsigned char c;
+	u8 *ip;
+	int i;
+
+	printk(KERN_DEFAULT "Code: ");
+
+	ip = (u8 *)rip - code_prologue;
+	if (ip < (u8 *)PAGE_OFFSET || probe_kernel_address(ip, c)) {
+		/* try starting at IP */
+		ip = (u8 *)rip;
+		code_len = code_len - code_prologue + 1;
+	}
+	for (i = 0; i < code_len; i++, ip++) {
+		if (ip < (u8 *)PAGE_OFFSET || probe_kernel_address(ip, c)) {
+			pr_cont(" Bad RIP value.");
+			break;
+		}
+		if (ip == (u8 *)rip)
+			pr_cont("<%02x> ", c);
+		else
+			pr_cont("%02x ", c);
+	}
+	pr_cont("\n");
+}
+
 void show_iret_regs(struct pt_regs *regs)
 {
 	printk(KERN_DEFAULT "RIP: %04x:%pS\n", (int)regs->cs, (void *)regs->ip);
@@ -359,7 +388,6 @@ void die(const char *str, struct pt_regs *regs, long err)
 void show_regs(struct pt_regs *regs)
 {
 	bool all = true;
-	int i;
 
 	show_regs_print_info(KERN_DEFAULT);
 
@@ -373,32 +401,7 @@ void show_regs(struct pt_regs *regs)
 	 * time of the fault..
 	 */
 	if (!user_mode(regs)) {
-		unsigned int code_prologue = OPCODE_BUFSIZE * 43 / 64;
-		unsigned int code_len = OPCODE_BUFSIZE;
-		unsigned char c;
-		u8 *ip;
-
 		show_trace_log_lvl(current, regs, NULL, KERN_DEFAULT);
-
-		printk(KERN_DEFAULT "Code: ");
-
-		ip = (u8 *)regs->ip - code_prologue;
-		if (ip < (u8 *)PAGE_OFFSET || probe_kernel_address(ip, c)) {
-			/* try starting at IP */
-			ip = (u8 *)regs->ip;
-			code_len = code_len - code_prologue + 1;
-		}
-		for (i = 0; i < code_len; i++, ip++) {
-			if (ip < (u8 *)PAGE_OFFSET ||
-					probe_kernel_address(ip, c)) {
-				pr_cont(" Bad RIP value.");
-				break;
-			}
-			if (ip == (u8 *)regs->ip)
-				pr_cont("<%02x> ", c);
-			else
-				pr_cont("%02x ", c);
-		}
+		show_opcodes((u8 *)regs->ip);
 	}
-	pr_cont("\n");
 }
-- 
2.13.0

^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [PATCH 4/9] x86/dumpstack: Improve opcodes dumping in the Code: section
  2018-03-15 15:44 [PATCH 0/9] x86/dumpstack: Cleanups and user opcode bytes Code: section, v2 Borislav Petkov
                   ` (2 preceding siblings ...)
  2018-03-15 15:44 ` [PATCH 3/9] x86/dumpstack: Carve out Code: dumping into a function Borislav Petkov
@ 2018-03-15 15:44 ` Borislav Petkov
  2018-03-15 18:10   ` Josh Poimboeuf
                     ` (2 more replies)
  2018-03-15 15:44 ` [PATCH 5/9] x86/dumpstack: Add loglevel argument to show_opcodes() Borislav Petkov
                   ` (5 subsequent siblings)
  9 siblings, 3 replies; 36+ messages in thread
From: Borislav Petkov @ 2018-03-15 15:44 UTC (permalink / raw)
  To: X86 ML
  Cc: Andy Lutomirski, Josh Poimboeuf, Linus Torvalds, Peter Zijlstra, LKML

From: Borislav Petkov <bp@suse.de>

The code used to iterate byte-by-byte over the bytes around RIP and that
is expensive: disabling pagefaults around it, copy_from_user, etc...

Make it read the whole buffer of OPCODE_BUFSIZE size in one go. Use a
statically allocated 64 bytes buffer so that concurrent show_opcodes()
do not interleave in the output even though in the majority of the cases
we sync on die_lock. Except the #PF path which doesn't...

Also, do the PAGE_OFFSET check outside of the function because latter
will be reused in other context.

Signed-off-by: Borislav Petkov <bp@suse.de>
---
 arch/x86/kernel/dumpstack.c | 34 ++++++++++++++++------------------
 1 file changed, 16 insertions(+), 18 deletions(-)

diff --git a/arch/x86/kernel/dumpstack.c b/arch/x86/kernel/dumpstack.c
index eb9d6c00a52f..3f781a8dddb8 100644
--- a/arch/x86/kernel/dumpstack.c
+++ b/arch/x86/kernel/dumpstack.c
@@ -22,8 +22,6 @@
 #include <asm/stacktrace.h>
 #include <asm/unwind.h>
 
-#define OPCODE_BUFSIZE 64
-
 int panic_on_unrecovered_nmi;
 int panic_on_io_nmi;
 static int die_counter;
@@ -72,29 +70,25 @@ static void printk_stack_address(unsigned long address, int reliable,
 
 static void show_opcodes(u8 *rip)
 {
-	unsigned int code_prologue = OPCODE_BUFSIZE * 43 / 64;
-	unsigned int code_len = OPCODE_BUFSIZE;
-	unsigned char c;
+#define OPCODE_BUFSIZE 64
+	unsigned int code_prologue = OPCODE_BUFSIZE * 2 / 3;
+	u8 opcodes[OPCODE_BUFSIZE];
 	u8 *ip;
 	int i;
 
 	printk(KERN_DEFAULT "Code: ");
 
 	ip = (u8 *)rip - code_prologue;
-	if (ip < (u8 *)PAGE_OFFSET || probe_kernel_address(ip, c)) {
-		/* try starting at IP */
-		ip = (u8 *)rip;
-		code_len = code_len - code_prologue + 1;
+	if (probe_kernel_read(opcodes, ip, OPCODE_BUFSIZE)) {
+		pr_cont(" Bad RIP value.\n");
+		return;
 	}
-	for (i = 0; i < code_len; i++, ip++) {
-		if (ip < (u8 *)PAGE_OFFSET || probe_kernel_address(ip, c)) {
-			pr_cont(" Bad RIP value.");
-			break;
-		}
-		if (ip == (u8 *)rip)
-			pr_cont("<%02x> ", c);
+
+	for (i = 0; i < OPCODE_BUFSIZE; i++, ip++) {
+		if (ip == rip)
+			pr_cont("<%02x> ", opcodes[i]);
 		else
-			pr_cont("%02x ", c);
+			pr_cont("%02x ", opcodes[i]);
 	}
 	pr_cont("\n");
 }
@@ -402,6 +396,10 @@ void show_regs(struct pt_regs *regs)
 	 */
 	if (!user_mode(regs)) {
 		show_trace_log_lvl(current, regs, NULL, KERN_DEFAULT);
-		show_opcodes((u8 *)regs->ip);
+
+		if (regs->ip < PAGE_OFFSET)
+			pr_cont(" Bad RIP value.\n");
+		else
+			show_opcodes((u8 *)regs->ip);
 	}
 }
-- 
2.13.0

^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [PATCH 5/9] x86/dumpstack: Add loglevel argument to show_opcodes()
  2018-03-15 15:44 [PATCH 0/9] x86/dumpstack: Cleanups and user opcode bytes Code: section, v2 Borislav Petkov
                   ` (3 preceding siblings ...)
  2018-03-15 15:44 ` [PATCH 4/9] x86/dumpstack: Improve opcodes dumping in the Code: section Borislav Petkov
@ 2018-03-15 15:44 ` Borislav Petkov
  2018-03-15 15:44 ` [PATCH 6/9] x86/fault: Dump user opcode bytes on fatal faults Borislav Petkov
                   ` (4 subsequent siblings)
  9 siblings, 0 replies; 36+ messages in thread
From: Borislav Petkov @ 2018-03-15 15:44 UTC (permalink / raw)
  To: X86 ML
  Cc: Andy Lutomirski, Josh Poimboeuf, Linus Torvalds, Peter Zijlstra, LKML

From: Borislav Petkov <bp@suse.de>

Will be used in the next patch.

Signed-off-by: Borislav Petkov <bp@suse.de>
---
 arch/x86/include/asm/stacktrace.h | 1 +
 arch/x86/kernel/dumpstack.c       | 6 +++---
 2 files changed, 4 insertions(+), 3 deletions(-)

diff --git a/arch/x86/include/asm/stacktrace.h b/arch/x86/include/asm/stacktrace.h
index 133d9425fced..0630eeb18bbc 100644
--- a/arch/x86/include/asm/stacktrace.h
+++ b/arch/x86/include/asm/stacktrace.h
@@ -111,4 +111,5 @@ static inline unsigned long caller_frame_pointer(void)
 	return (unsigned long)frame;
 }
 
+void show_opcodes(u8 *rip, const char *loglvl);
 #endif /* _ASM_X86_STACKTRACE_H */
diff --git a/arch/x86/kernel/dumpstack.c b/arch/x86/kernel/dumpstack.c
index 3f781a8dddb8..82562f327a99 100644
--- a/arch/x86/kernel/dumpstack.c
+++ b/arch/x86/kernel/dumpstack.c
@@ -68,7 +68,7 @@ static void printk_stack_address(unsigned long address, int reliable,
 	printk("%s %s%pB\n", log_lvl, reliable ? "" : "? ", (void *)address);
 }
 
-static void show_opcodes(u8 *rip)
+void show_opcodes(u8 *rip, const char *loglvl)
 {
 #define OPCODE_BUFSIZE 64
 	unsigned int code_prologue = OPCODE_BUFSIZE * 2 / 3;
@@ -76,7 +76,7 @@ static void show_opcodes(u8 *rip)
 	u8 *ip;
 	int i;
 
-	printk(KERN_DEFAULT "Code: ");
+	printk("%sCode: ", loglvl);
 
 	ip = (u8 *)rip - code_prologue;
 	if (probe_kernel_read(opcodes, ip, OPCODE_BUFSIZE)) {
@@ -400,6 +400,6 @@ void show_regs(struct pt_regs *regs)
 		if (regs->ip < PAGE_OFFSET)
 			pr_cont(" Bad RIP value.\n");
 		else
-			show_opcodes((u8 *)regs->ip);
+			show_opcodes((u8 *)regs->ip, KERN_DEFAULT);
 	}
 }
-- 
2.13.0

^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [PATCH 6/9] x86/fault: Dump user opcode bytes on fatal faults
  2018-03-15 15:44 [PATCH 0/9] x86/dumpstack: Cleanups and user opcode bytes Code: section, v2 Borislav Petkov
                   ` (4 preceding siblings ...)
  2018-03-15 15:44 ` [PATCH 5/9] x86/dumpstack: Add loglevel argument to show_opcodes() Borislav Petkov
@ 2018-03-15 15:44 ` Borislav Petkov
  2018-03-15 15:44 ` [PATCH 7/9] x86/dumpstack: Add a show_ip() function Borislav Petkov
                   ` (3 subsequent siblings)
  9 siblings, 0 replies; 36+ messages in thread
From: Borislav Petkov @ 2018-03-15 15:44 UTC (permalink / raw)
  To: X86 ML
  Cc: Andy Lutomirski, Josh Poimboeuf, Linus Torvalds, Peter Zijlstra, LKML

From: Borislav Petkov <bp@suse.de>

Sometimes it is useful to see which user opcode bytes RIP points to
when a fault happens: be it to rule out RIP corruption, to dump info
early during boot, when doing core dumps is impossible due to not having
writable fs yet.

Sometimes it is useful if debugging an issue and one doesn't have access
to the executable which caused the fault in order to disassemble it.

That last aspect might have some security implications so
show_unhandled_signals could be revisited for that or a new config
option added.

Signed-off-by: Borislav Petkov <bp@suse.de>
---
 arch/x86/mm/fault.c | 7 +++++--
 1 file changed, 5 insertions(+), 2 deletions(-)

diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c
index 26865147a507..b3c19f734442 100644
--- a/arch/x86/mm/fault.c
+++ b/arch/x86/mm/fault.c
@@ -850,6 +850,8 @@ static inline void
 show_signal_msg(struct pt_regs *regs, unsigned long error_code,
 		unsigned long address, struct task_struct *tsk)
 {
+	const char *loglvl = task_pid_nr(tsk) > 1 ? KERN_INFO : KERN_EMERG;
+
 	if (!unhandled_signal(tsk, SIGSEGV))
 		return;
 
@@ -857,13 +859,14 @@ show_signal_msg(struct pt_regs *regs, unsigned long error_code,
 		return;
 
 	printk("%s%s[%d]: segfault at %lx ip %px sp %px error %lx",
-		task_pid_nr(tsk) > 1 ? KERN_INFO : KERN_EMERG,
-		tsk->comm, task_pid_nr(tsk), address,
+		loglvl, tsk->comm, task_pid_nr(tsk), address,
 		(void *)regs->ip, (void *)regs->sp, error_code);
 
 	print_vma_addr(KERN_CONT " in ", regs->ip);
 
 	printk(KERN_CONT "\n");
+
+	show_opcodes((u8 *)regs->ip, loglvl);
 }
 
 static void
-- 
2.13.0

^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [PATCH 7/9] x86/dumpstack: Add a show_ip() function
  2018-03-15 15:44 [PATCH 0/9] x86/dumpstack: Cleanups and user opcode bytes Code: section, v2 Borislav Petkov
                   ` (5 preceding siblings ...)
  2018-03-15 15:44 ` [PATCH 6/9] x86/fault: Dump user opcode bytes on fatal faults Borislav Petkov
@ 2018-03-15 15:44 ` Borislav Petkov
  2018-03-15 18:34   ` Josh Poimboeuf
  2018-03-15 15:44 ` [PATCH 8/9] x86/dumpstack: Save first regs set for the executive summary Borislav Petkov
                   ` (2 subsequent siblings)
  9 siblings, 1 reply; 36+ messages in thread
From: Borislav Petkov @ 2018-03-15 15:44 UTC (permalink / raw)
  To: X86 ML
  Cc: Andy Lutomirski, Josh Poimboeuf, Linus Torvalds, Peter Zijlstra, LKML

From: Borislav Petkov <bp@suse.de>

Which shows the Istruction Pointer along with the insn bytes around it.
Use it whenever we print rIP.

Signed-off-by: Borislav Petkov <bp@suse.de>
---
 arch/x86/include/asm/stacktrace.h |  1 +
 arch/x86/kernel/dumpstack.c       | 13 ++++++++++++-
 arch/x86/kernel/process_32.c      |  4 +---
 3 files changed, 14 insertions(+), 4 deletions(-)

diff --git a/arch/x86/include/asm/stacktrace.h b/arch/x86/include/asm/stacktrace.h
index 0630eeb18bbc..b6dc698f992a 100644
--- a/arch/x86/include/asm/stacktrace.h
+++ b/arch/x86/include/asm/stacktrace.h
@@ -112,4 +112,5 @@ static inline unsigned long caller_frame_pointer(void)
 }
 
 void show_opcodes(u8 *rip, const char *loglvl);
+void show_ip(struct pt_regs *regs, const char *loglvl);
 #endif /* _ASM_X86_STACKTRACE_H */
diff --git a/arch/x86/kernel/dumpstack.c b/arch/x86/kernel/dumpstack.c
index 82562f327a99..8cf2e2289d50 100644
--- a/arch/x86/kernel/dumpstack.c
+++ b/arch/x86/kernel/dumpstack.c
@@ -93,9 +93,20 @@ void show_opcodes(u8 *rip, const char *loglvl)
 	pr_cont("\n");
 }
 
+void show_ip(struct pt_regs *regs, const char *loglvl)
+{
+#ifdef CONFIG_X86_32
+	printk("%sEIP: %pS\n", loglvl, (void *)regs->ip);
+#else
+	printk("%sRIP: %04x:%pS\n", loglvl, (int)regs->cs, (void *)regs->ip);
+#endif
+
+	show_opcodes((u8 *)regs->ip, loglvl);
+}
+
 void show_iret_regs(struct pt_regs *regs)
 {
-	printk(KERN_DEFAULT "RIP: %04x:%pS\n", (int)regs->cs, (void *)regs->ip);
+	show_ip(regs, KERN_DEFAULT);
 	printk(KERN_DEFAULT "RSP: %04x:%016lx EFLAGS: %08lx", (int)regs->ss,
 		regs->sp, regs->flags);
 }
diff --git a/arch/x86/kernel/process_32.c b/arch/x86/kernel/process_32.c
index 5224c6099184..3d1f1226b972 100644
--- a/arch/x86/kernel/process_32.c
+++ b/arch/x86/kernel/process_32.c
@@ -76,9 +76,7 @@ void __show_regs(struct pt_regs *regs, int all)
 		savesegment(gs, gs);
 	}
 
-	printk(KERN_DEFAULT "EIP: %pS\n", (void *)regs->ip);
-	printk(KERN_DEFAULT "EFLAGS: %08lx CPU: %d\n", regs->flags,
-		raw_smp_processor_id());
+	show_ip(regs, KERN_DEFAULT);
 
 	printk(KERN_DEFAULT "EAX: %08lx EBX: %08lx ECX: %08lx EDX: %08lx\n",
 		regs->ax, regs->bx, regs->cx, regs->dx);
-- 
2.13.0

^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [PATCH 8/9] x86/dumpstack: Save first regs set for the executive summary
  2018-03-15 15:44 [PATCH 0/9] x86/dumpstack: Cleanups and user opcode bytes Code: section, v2 Borislav Petkov
                   ` (6 preceding siblings ...)
  2018-03-15 15:44 ` [PATCH 7/9] x86/dumpstack: Add a show_ip() function Borislav Petkov
@ 2018-03-15 15:44 ` Borislav Petkov
  2018-03-15 19:01   ` Josh Poimboeuf
  2018-03-15 15:44 ` [PATCH 9/9] x86/dumpstack: Explain the reasoning for the prologue and buffer size Borislav Petkov
  2018-03-15 17:51 ` [PATCH 0/9] x86/dumpstack: Cleanups and user opcode bytes Code: section, v2 Linus Torvalds
  9 siblings, 1 reply; 36+ messages in thread
From: Borislav Petkov @ 2018-03-15 15:44 UTC (permalink / raw)
  To: X86 ML
  Cc: Andy Lutomirski, Josh Poimboeuf, Linus Torvalds, Peter Zijlstra, LKML

From: Borislav Petkov <bp@suse.de>

Save the regs set when we call __die() for the first time and print it
in oops_end().

Signed-off-by: Borislav Petkov <bp@suse.de>
---
 arch/x86/kernel/dumpstack.c | 33 ++++++++++++---------------------
 1 file changed, 12 insertions(+), 21 deletions(-)

diff --git a/arch/x86/kernel/dumpstack.c b/arch/x86/kernel/dumpstack.c
index 8cf2e2289d50..bb712ca99632 100644
--- a/arch/x86/kernel/dumpstack.c
+++ b/arch/x86/kernel/dumpstack.c
@@ -26,6 +26,8 @@ int panic_on_unrecovered_nmi;
 int panic_on_io_nmi;
 static int die_counter;
 
+static struct pt_regs exec_summary_regs;
+
 bool in_task_stack(unsigned long *stack, struct task_struct *task,
 		   struct stack_info *info)
 {
@@ -321,6 +323,9 @@ void oops_end(unsigned long flags, struct pt_regs *regs, int signr)
 	raw_local_irq_restore(flags);
 	oops_exit();
 
+	/* Executive summary in case the oops scrolled away */
+	__show_regs(&exec_summary_regs, false);
+
 	if (!signr)
 		return;
 	if (in_interrupt())
@@ -339,10 +344,10 @@ NOKPROBE_SYMBOL(oops_end);
 
 int __die(const char *str, struct pt_regs *regs, long err)
 {
-#ifdef CONFIG_X86_32
-	unsigned short ss;
-	unsigned long sp;
-#endif
+	/* Save the regs of the first oops for the executive summary later. */
+	if (!die_counter)
+		exec_summary_regs = *regs;
+
 	printk(KERN_DEFAULT
 	       "%s: %04lx [#%d]%s%s%s%s%s\n", str, err & 0xffff, ++die_counter,
 	       IS_ENABLED(CONFIG_PREEMPT) ? " PREEMPT"         : "",
@@ -352,26 +357,14 @@ int __die(const char *str, struct pt_regs *regs, long err)
 	       IS_ENABLED(CONFIG_PAGE_TABLE_ISOLATION) ?
 	       (boot_cpu_has(X86_FEATURE_PTI) ? " PTI" : " NOPTI") : "");
 
+	show_regs(regs);
+
 	if (notify_die(DIE_OOPS, str, regs, err,
 			current->thread.trap_nr, SIGSEGV) == NOTIFY_STOP)
 		return 1;
 
 	print_modules();
-	show_regs(regs);
-#ifdef CONFIG_X86_32
-	if (user_mode(regs)) {
-		sp = regs->sp;
-		ss = regs->ss;
-	} else {
-		sp = kernel_stack_pointer(regs);
-		savesegment(ss, ss);
-	}
-	printk(KERN_EMERG "EIP: %pS SS:ESP: %04x:%08lx\n",
-	       (void *)regs->ip, ss, sp);
-#else
-	/* Executive summary in case the oops scrolled away */
-	printk(KERN_ALERT "RIP: %pS RSP: %016lx\n", (void *)regs->ip, regs->sp);
-#endif
+
 	return 0;
 }
 NOKPROBE_SYMBOL(__die);
@@ -410,7 +403,5 @@ void show_regs(struct pt_regs *regs)
 
 		if (regs->ip < PAGE_OFFSET)
 			pr_cont(" Bad RIP value.\n");
-		else
-			show_opcodes((u8 *)regs->ip, KERN_DEFAULT);
 	}
 }
-- 
2.13.0

^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [PATCH 9/9] x86/dumpstack: Explain the reasoning for the prologue and buffer size
  2018-03-15 15:44 [PATCH 0/9] x86/dumpstack: Cleanups and user opcode bytes Code: section, v2 Borislav Petkov
                   ` (7 preceding siblings ...)
  2018-03-15 15:44 ` [PATCH 8/9] x86/dumpstack: Save first regs set for the executive summary Borislav Petkov
@ 2018-03-15 15:44 ` Borislav Petkov
  2018-03-15 18:07   ` Josh Poimboeuf
  2018-03-15 17:51 ` [PATCH 0/9] x86/dumpstack: Cleanups and user opcode bytes Code: section, v2 Linus Torvalds
  9 siblings, 1 reply; 36+ messages in thread
From: Borislav Petkov @ 2018-03-15 15:44 UTC (permalink / raw)
  To: X86 ML
  Cc: Andy Lutomirski, Josh Poimboeuf, Linus Torvalds, Peter Zijlstra, LKML

From: Borislav Petkov <bp@suse.de>

The whole reasoning behind the amount of opcode bytes dumped and
prologue length isn't very clear so let's hold down some of the reasons
for why it is done the way it is.

Signed-off-by: Borislav Petkov <bp@suse.de>
---
 arch/x86/kernel/dumpstack.c | 19 +++++++++++++++++++
 1 file changed, 19 insertions(+)

diff --git a/arch/x86/kernel/dumpstack.c b/arch/x86/kernel/dumpstack.c
index bb712ca99632..7ceba3c09ad7 100644
--- a/arch/x86/kernel/dumpstack.c
+++ b/arch/x86/kernel/dumpstack.c
@@ -70,6 +70,25 @@ static void printk_stack_address(unsigned long address, int reliable,
 	printk("%s %s%pB\n", log_lvl, reliable ? "" : "? ", (void *)address);
 }
 
+/*
+ * The are a couple of reasons for the 2/3rd prologue, courtesy of Linus:
+ *
+ * In case where we don't have the exact kernel image (which, if we did, we can
+ * simply disassemble and navigate to the RIP), the purpose of the bigger
+ * prologue is to have more context and to be able to correlate the code from
+ * the different toolchains better.
+ *
+ * In addition, it helps in recreating the register allocation of the failing
+ * kernel and thus make sense of the register dump.
+ *
+ * What is more, the additional complication of a variable length insn arch like
+ * x86 warrants having longer byte sequence before rIP so that the disassembler
+ * can "sync" up properly and find instruction boundaries when decoding the
+ * opcode bytes.
+ *
+ * Thus, the 2/3rds prologue and 64 byte OPCODE_BUFSIZE is just a random
+ * guesstimate in attempt to achieve all of the above.
+ */
 void show_opcodes(u8 *rip, const char *loglvl)
 {
 #define OPCODE_BUFSIZE 64
-- 
2.13.0

^ permalink raw reply related	[flat|nested] 36+ messages in thread

* Re: [PATCH 0/9] x86/dumpstack: Cleanups and user opcode bytes Code: section, v2
  2018-03-15 15:44 [PATCH 0/9] x86/dumpstack: Cleanups and user opcode bytes Code: section, v2 Borislav Petkov
                   ` (8 preceding siblings ...)
  2018-03-15 15:44 ` [PATCH 9/9] x86/dumpstack: Explain the reasoning for the prologue and buffer size Borislav Petkov
@ 2018-03-15 17:51 ` Linus Torvalds
  2018-04-17 14:40   ` Borislav Petkov
  9 siblings, 1 reply; 36+ messages in thread
From: Linus Torvalds @ 2018-03-15 17:51 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: X86 ML, Andy Lutomirski, Josh Poimboeuf, Peter Zijlstra, LKML

On Thu, Mar 15, 2018 at 8:44 AM, Borislav Petkov <bp@alien8.de> wrote:
>
> here's v2 with the dumpstack cleanups. This one gets rid of code_bytes=
> as it was discussed last time. As a result, the code got even leaner and
> simpler. I like that. :)

This version looks ok to me. I'm sure there's room for tweaking here,
but I'm not seeing anything alarming.

               Linus

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 9/9] x86/dumpstack: Explain the reasoning for the prologue and buffer size
  2018-03-15 15:44 ` [PATCH 9/9] x86/dumpstack: Explain the reasoning for the prologue and buffer size Borislav Petkov
@ 2018-03-15 18:07   ` Josh Poimboeuf
  2018-03-15 18:17     ` Borislav Petkov
  0 siblings, 1 reply; 36+ messages in thread
From: Josh Poimboeuf @ 2018-03-15 18:07 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: X86 ML, Andy Lutomirski, Linus Torvalds, Peter Zijlstra, LKML

On Thu, Mar 15, 2018 at 04:44:48PM +0100, Borislav Petkov wrote:
> From: Borislav Petkov <bp@suse.de>
> 
> The whole reasoning behind the amount of opcode bytes dumped and
> prologue length isn't very clear so let's hold down some of the reasons
> for why it is done the way it is.
> 
> Signed-off-by: Borislav Petkov <bp@suse.de>
> ---
>  arch/x86/kernel/dumpstack.c | 19 +++++++++++++++++++
>  1 file changed, 19 insertions(+)
> 
> diff --git a/arch/x86/kernel/dumpstack.c b/arch/x86/kernel/dumpstack.c
> index bb712ca99632..7ceba3c09ad7 100644
> --- a/arch/x86/kernel/dumpstack.c
> +++ b/arch/x86/kernel/dumpstack.c
> @@ -70,6 +70,25 @@ static void printk_stack_address(unsigned long address, int reliable,
>  	printk("%s %s%pB\n", log_lvl, reliable ? "" : "? ", (void *)address);
>  }
>  
> +/*
> + * The are a couple of reasons for the 2/3rd prologue, courtesy of Linus:

s/The/There/

> + *
> + * In case where we don't have the exact kernel image (which, if we did, we can
> + * simply disassemble and navigate to the RIP), the purpose of the bigger
> + * prologue is to have more context and to be able to correlate the code from
> + * the different toolchains better.
> + *
> + * In addition, it helps in recreating the register allocation of the failing
> + * kernel and thus make sense of the register dump.
> + *
> + * What is more, the additional complication of a variable length insn arch like
> + * x86 warrants having longer byte sequence before rIP so that the disassembler
> + * can "sync" up properly and find instruction boundaries when decoding the
> + * opcode bytes.
> + *
> + * Thus, the 2/3rds prologue and 64 byte OPCODE_BUFSIZE is just a random
> + * guesstimate in attempt to achieve all of the above.
> + */
>  void show_opcodes(u8 *rip, const char *loglvl)
>  {
>  #define OPCODE_BUFSIZE 64
> -- 
> 2.13.0
> 

-- 
Josh

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 4/9] x86/dumpstack: Improve opcodes dumping in the Code: section
  2018-03-15 15:44 ` [PATCH 4/9] x86/dumpstack: Improve opcodes dumping in the Code: section Borislav Petkov
@ 2018-03-15 18:10   ` Josh Poimboeuf
  2018-03-15 18:16     ` Borislav Petkov
  2018-03-15 18:19   ` Josh Poimboeuf
  2018-03-15 18:23   ` Josh Poimboeuf
  2 siblings, 1 reply; 36+ messages in thread
From: Josh Poimboeuf @ 2018-03-15 18:10 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: X86 ML, Andy Lutomirski, Linus Torvalds, Peter Zijlstra, LKML

On Thu, Mar 15, 2018 at 04:44:43PM +0100, Borislav Petkov wrote:
> From: Borislav Petkov <bp@suse.de>
> 
> The code used to iterate byte-by-byte over the bytes around RIP and that
> is expensive: disabling pagefaults around it, copy_from_user, etc...
> 
> Make it read the whole buffer of OPCODE_BUFSIZE size in one go. Use a
> statically allocated 64 bytes buffer so that concurrent show_opcodes()
> do not interleave in the output even though in the majority of the cases
> we sync on die_lock. Except the #PF path which doesn't...
> 
> Also, do the PAGE_OFFSET check outside of the function because latter
> will be reused in other context.
> 
> Signed-off-by: Borislav Petkov <bp@suse.de>
> ---
>  arch/x86/kernel/dumpstack.c | 34 ++++++++++++++++------------------
>  1 file changed, 16 insertions(+), 18 deletions(-)
> 
> diff --git a/arch/x86/kernel/dumpstack.c b/arch/x86/kernel/dumpstack.c
> index eb9d6c00a52f..3f781a8dddb8 100644
> --- a/arch/x86/kernel/dumpstack.c
> +++ b/arch/x86/kernel/dumpstack.c
> @@ -22,8 +22,6 @@
>  #include <asm/stacktrace.h>
>  #include <asm/unwind.h>
>  
> -#define OPCODE_BUFSIZE 64
> -
>  int panic_on_unrecovered_nmi;
>  int panic_on_io_nmi;
>  static int die_counter;
> @@ -72,29 +70,25 @@ static void printk_stack_address(unsigned long address, int reliable,
>  
>  static void show_opcodes(u8 *rip)
>  {
> -	unsigned int code_prologue = OPCODE_BUFSIZE * 43 / 64;
> -	unsigned int code_len = OPCODE_BUFSIZE;
> -	unsigned char c;
> +#define OPCODE_BUFSIZE 64
> +	unsigned int code_prologue = OPCODE_BUFSIZE * 2 / 3;
> +	u8 opcodes[OPCODE_BUFSIZE];

I liked OPCODE_BUFSIZE where it was before :-)  Here it disrupts the
readability of the function a bit IMO.

-- 
Josh

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 1/9] x86/dumstack: Remove code_bytes
  2018-03-15 15:44 ` [PATCH 1/9] x86/dumstack: Remove code_bytes Borislav Petkov
@ 2018-03-15 18:12   ` Josh Poimboeuf
  0 siblings, 0 replies; 36+ messages in thread
From: Josh Poimboeuf @ 2018-03-15 18:12 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: X86 ML, Andy Lutomirski, Linus Torvalds, Peter Zijlstra, LKML

On Thu, Mar 15, 2018 at 04:44:40PM +0100, Borislav Petkov wrote:
> From: Borislav Petkov <bp@suse.de>
> 
> This was added by
> 
>   86c418374223 ("[PATCH] i386: add option to show more code in oops reports")
> 
> long time ago but experience shows that 64 instruction bytes are plenty
> when deciphering an oops. So get rid of it.
> 
> Removing it will simplify further enhancements to the opcodes dumping
> machinery coming in the following patches.
> 
> Signed-off-by: Borislav Petkov <bp@suse.de>

s/dumstack/dumpstack/ in the subject.

-- 
Josh

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 4/9] x86/dumpstack: Improve opcodes dumping in the Code: section
  2018-03-15 18:10   ` Josh Poimboeuf
@ 2018-03-15 18:16     ` Borislav Petkov
  2018-03-15 19:06       ` Josh Poimboeuf
  2018-03-16 11:57       ` David Laight
  0 siblings, 2 replies; 36+ messages in thread
From: Borislav Petkov @ 2018-03-15 18:16 UTC (permalink / raw)
  To: Josh Poimboeuf
  Cc: X86 ML, Andy Lutomirski, Linus Torvalds, Peter Zijlstra, LKML

On Thu, Mar 15, 2018 at 01:10:54PM -0500, Josh Poimboeuf wrote:
> I liked OPCODE_BUFSIZE where it was before :-)  Here it disrupts the
> readability of the function a bit IMO.

My thinking is have it close by so that you don't have to go search for
its definition.

But I don't have a strong opinion on where it should be so...

-- 
Regards/Gruss,
    Boris.

Good mailing practices for 400: avoid top-posting and trim the reply.

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 9/9] x86/dumpstack: Explain the reasoning for the prologue and buffer size
  2018-03-15 18:07   ` Josh Poimboeuf
@ 2018-03-15 18:17     ` Borislav Petkov
  0 siblings, 0 replies; 36+ messages in thread
From: Borislav Petkov @ 2018-03-15 18:17 UTC (permalink / raw)
  To: Josh Poimboeuf
  Cc: X86 ML, Andy Lutomirski, Linus Torvalds, Peter Zijlstra, LKML

On Thu, Mar 15, 2018 at 01:07:31PM -0500, Josh Poimboeuf wrote:
> > +/*
> > + * The are a couple of reasons for the 2/3rd prologue, courtesy of Linus:
> 
> s/The/There/

Fixed, thanks.

-- 
Regards/Gruss,
    Boris.

Good mailing practices for 400: avoid top-posting and trim the reply.

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 4/9] x86/dumpstack: Improve opcodes dumping in the Code: section
  2018-03-15 15:44 ` [PATCH 4/9] x86/dumpstack: Improve opcodes dumping in the Code: section Borislav Petkov
  2018-03-15 18:10   ` Josh Poimboeuf
@ 2018-03-15 18:19   ` Josh Poimboeuf
  2018-03-15 18:23   ` Josh Poimboeuf
  2 siblings, 0 replies; 36+ messages in thread
From: Josh Poimboeuf @ 2018-03-15 18:19 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: X86 ML, Andy Lutomirski, Linus Torvalds, Peter Zijlstra, LKML

On Thu, Mar 15, 2018 at 04:44:43PM +0100, Borislav Petkov wrote:
> From: Borislav Petkov <bp@suse.de>
> 
> The code used to iterate byte-by-byte over the bytes around RIP and that
> is expensive: disabling pagefaults around it, copy_from_user, etc...
> 
> Make it read the whole buffer of OPCODE_BUFSIZE size in one go. Use a
> statically allocated 64 bytes buffer so that concurrent show_opcodes()
> do not interleave in the output even though in the majority of the cases
> we sync on die_lock. Except the #PF path which doesn't...
> 
> Also, do the PAGE_OFFSET check outside of the function because latter
> will be reused in other context.
> 
> Signed-off-by: Borislav Petkov <bp@suse.de>
> ---
>  arch/x86/kernel/dumpstack.c | 34 ++++++++++++++++------------------
>  1 file changed, 16 insertions(+), 18 deletions(-)
> 
> diff --git a/arch/x86/kernel/dumpstack.c b/arch/x86/kernel/dumpstack.c
> index eb9d6c00a52f..3f781a8dddb8 100644
> --- a/arch/x86/kernel/dumpstack.c
> +++ b/arch/x86/kernel/dumpstack.c
> @@ -22,8 +22,6 @@
>  #include <asm/stacktrace.h>
>  #include <asm/unwind.h>
>  
> -#define OPCODE_BUFSIZE 64
> -
>  int panic_on_unrecovered_nmi;
>  int panic_on_io_nmi;
>  static int die_counter;
> @@ -72,29 +70,25 @@ static void printk_stack_address(unsigned long address, int reliable,
>  
>  static void show_opcodes(u8 *rip)
>  {
> -	unsigned int code_prologue = OPCODE_BUFSIZE * 43 / 64;
> -	unsigned int code_len = OPCODE_BUFSIZE;
> -	unsigned char c;
> +#define OPCODE_BUFSIZE 64
> +	unsigned int code_prologue = OPCODE_BUFSIZE * 2 / 3;
> +	u8 opcodes[OPCODE_BUFSIZE];
>  	u8 *ip;
>  	int i;
>  
>  	printk(KERN_DEFAULT "Code: ");
>  
>  	ip = (u8 *)rip - code_prologue;
> -	if (ip < (u8 *)PAGE_OFFSET || probe_kernel_address(ip, c)) {
> -		/* try starting at IP */
> -		ip = (u8 *)rip;
> -		code_len = code_len - code_prologue + 1;
> +	if (probe_kernel_read(opcodes, ip, OPCODE_BUFSIZE)) {
> +		pr_cont(" Bad RIP value.\n");
> +		return;

As long as you're changing the code, might as well remove the leading
space here, so it shows

	Code: Bad RIP value.

instead of

	Code:  Bad RIP value.

-- 
Josh

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 4/9] x86/dumpstack: Improve opcodes dumping in the Code: section
  2018-03-15 15:44 ` [PATCH 4/9] x86/dumpstack: Improve opcodes dumping in the Code: section Borislav Petkov
  2018-03-15 18:10   ` Josh Poimboeuf
  2018-03-15 18:19   ` Josh Poimboeuf
@ 2018-03-15 18:23   ` Josh Poimboeuf
  2 siblings, 0 replies; 36+ messages in thread
From: Josh Poimboeuf @ 2018-03-15 18:23 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: X86 ML, Andy Lutomirski, Linus Torvalds, Peter Zijlstra, LKML

On Thu, Mar 15, 2018 at 04:44:43PM +0100, Borislav Petkov wrote:
> @@ -402,6 +396,10 @@ void show_regs(struct pt_regs *regs)
>  	 */
>  	if (!user_mode(regs)) {
>  		show_trace_log_lvl(current, regs, NULL, KERN_DEFAULT);
> -		show_opcodes((u8 *)regs->ip);
> +
> +		if (regs->ip < PAGE_OFFSET)
> +			pr_cont(" Bad RIP value.\n");
> +		else
> +			show_opcodes((u8 *)regs->ip);

[ Oops, sorry about three reviews for the same patch... ]

I don't see the printk preceding the pr_cont() here.  Shouldn't it be
something like:

			printk(KERN_DEFAULT "Code: Bad RIP value.\n")

?

-- 
Josh

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 7/9] x86/dumpstack: Add a show_ip() function
  2018-03-15 15:44 ` [PATCH 7/9] x86/dumpstack: Add a show_ip() function Borislav Petkov
@ 2018-03-15 18:34   ` Josh Poimboeuf
  2018-03-15 18:55     ` Borislav Petkov
  0 siblings, 1 reply; 36+ messages in thread
From: Josh Poimboeuf @ 2018-03-15 18:34 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: X86 ML, Andy Lutomirski, Linus Torvalds, Peter Zijlstra, LKML

On Thu, Mar 15, 2018 at 04:44:46PM +0100, Borislav Petkov wrote:
> From: Borislav Petkov <bp@suse.de>
> 
> Which shows the Istruction Pointer along with the insn bytes around it.
> Use it whenever we print rIP.

s/Istruction/Instruction/

> diff --git a/arch/x86/kernel/process_32.c b/arch/x86/kernel/process_32.c
> index 5224c6099184..3d1f1226b972 100644
> --- a/arch/x86/kernel/process_32.c
> +++ b/arch/x86/kernel/process_32.c
> @@ -76,9 +76,7 @@ void __show_regs(struct pt_regs *regs, int all)
>  		savesegment(gs, gs);
>  	}
>  
> -	printk(KERN_DEFAULT "EIP: %pS\n", (void *)regs->ip);
> -	printk(KERN_DEFAULT "EFLAGS: %08lx CPU: %d\n", regs->flags,
> -		raw_smp_processor_id());
> +	show_ip(regs, KERN_DEFAULT);

Did you intentionally remove the printing of EFLAGS?

-- 
Josh

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 7/9] x86/dumpstack: Add a show_ip() function
  2018-03-15 18:34   ` Josh Poimboeuf
@ 2018-03-15 18:55     ` Borislav Petkov
  0 siblings, 0 replies; 36+ messages in thread
From: Borislav Petkov @ 2018-03-15 18:55 UTC (permalink / raw)
  To: Josh Poimboeuf
  Cc: X86 ML, Andy Lutomirski, Linus Torvalds, Peter Zijlstra, LKML

On Thu, Mar 15, 2018 at 01:34:11PM -0500, Josh Poimboeuf wrote:
> Did you intentionally remove the printing of EFLAGS?

Blergh, EFLAGS were supposed to go down in the printk block. Good catch!

---
diff --git a/arch/x86/kernel/process_32.c b/arch/x86/kernel/process_32.c
index 3d1f1226b972..0ae659de21eb 100644
--- a/arch/x86/kernel/process_32.c
+++ b/arch/x86/kernel/process_32.c
@@ -82,8 +82,8 @@ void __show_regs(struct pt_regs *regs, int all)
 		regs->ax, regs->bx, regs->cx, regs->dx);
 	printk(KERN_DEFAULT "ESI: %08lx EDI: %08lx EBP: %08lx ESP: %08lx\n",
 		regs->si, regs->di, regs->bp, sp);
-	printk(KERN_DEFAULT " DS: %04x ES: %04x FS: %04x GS: %04x SS: %04x\n",
-	       (u16)regs->ds, (u16)regs->es, (u16)regs->fs, gs, ss);
+	printk(KERN_DEFAULT "DS: %04x ES: %04x FS: %04x GS: %04x SS: %04x EFLAGS: %08lx\n",
+	       (u16)regs->ds, (u16)regs->es, (u16)regs->fs, gs, ss, regs->flags);
 
 	if (!all)
 		return;


-- 
Regards/Gruss,
    Boris.

Good mailing practices for 400: avoid top-posting and trim the reply.

^ permalink raw reply related	[flat|nested] 36+ messages in thread

* Re: [PATCH 8/9] x86/dumpstack: Save first regs set for the executive summary
  2018-03-15 15:44 ` [PATCH 8/9] x86/dumpstack: Save first regs set for the executive summary Borislav Petkov
@ 2018-03-15 19:01   ` Josh Poimboeuf
  2018-03-16 11:48     ` Borislav Petkov
  0 siblings, 1 reply; 36+ messages in thread
From: Josh Poimboeuf @ 2018-03-15 19:01 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: X86 ML, Andy Lutomirski, Linus Torvalds, Peter Zijlstra, LKML

On Thu, Mar 15, 2018 at 04:44:47PM +0100, Borislav Petkov wrote:
> @@ -321,6 +323,9 @@ void oops_end(unsigned long flags, struct pt_regs *regs, int signr)
>  	raw_local_irq_restore(flags);
>  	oops_exit();
>  
> +	/* Executive summary in case the oops scrolled away */
> +	__show_regs(&exec_summary_regs, false);
> +

no_context() has the following line, right before it calls oops_end():

	/* Executive summary in case the body of the oops scrolled away */
	printk(KERN_DEFAULT "CR2: %016lx\n", address);

I think that line can now be removed, since the executive summary
__show_regs() will include CR2.


> @@ -352,26 +357,14 @@ int __die(const char *str, struct pt_regs *regs, long err)
>  	       IS_ENABLED(CONFIG_PAGE_TABLE_ISOLATION) ?
>  	       (boot_cpu_has(X86_FEATURE_PTI) ? " PTI" : " NOPTI") : "");
>  
> +	show_regs(regs);
> +
>  	if (notify_die(DIE_OOPS, str, regs, err,
>  			current->thread.trap_nr, SIGSEGV) == NOTIFY_STOP)
>  		return 1;
>  
>  	print_modules();
> -	show_regs(regs);

Was moving the show_regs() call intentional?  I didn't see it mentioned
in the changelog.

> @@ -410,7 +403,5 @@ void show_regs(struct pt_regs *regs)
>  
>  		if (regs->ip < PAGE_OFFSET)
>  			pr_cont(" Bad RIP value.\n");
> -		else
> -			show_opcodes((u8 *)regs->ip, KERN_DEFAULT);
>  	}
>  }

Doesn't this hunk belong in the previous patch, which added the
__show_regs -> show_ip() -> show_opcodes() call path?

-- 
Josh

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 4/9] x86/dumpstack: Improve opcodes dumping in the Code: section
  2018-03-15 18:16     ` Borislav Petkov
@ 2018-03-15 19:06       ` Josh Poimboeuf
  2018-03-16 11:57       ` David Laight
  1 sibling, 0 replies; 36+ messages in thread
From: Josh Poimboeuf @ 2018-03-15 19:06 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: X86 ML, Andy Lutomirski, Linus Torvalds, Peter Zijlstra, LKML

On Thu, Mar 15, 2018 at 07:16:13PM +0100, Borislav Petkov wrote:
> On Thu, Mar 15, 2018 at 01:10:54PM -0500, Josh Poimboeuf wrote:
> > I liked OPCODE_BUFSIZE where it was before :-)  Here it disrupts the
> > readability of the function a bit IMO.
> 
> My thinking is have it close by so that you don't have to go search for
> its definition.

But knowing the value of OPCODE_BUFSIZE isn't needed for understanding
the rest of the function, so it seems like clutter to me.  Besides,
that's what the vim cscope plugin is for :-)

But I also don't have a strong opinion.

-- 
Josh

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 8/9] x86/dumpstack: Save first regs set for the executive summary
  2018-03-15 19:01   ` Josh Poimboeuf
@ 2018-03-16 11:48     ` Borislav Petkov
  2018-03-16 12:01       ` Josh Poimboeuf
  2018-03-16 17:22       ` Linus Torvalds
  0 siblings, 2 replies; 36+ messages in thread
From: Borislav Petkov @ 2018-03-16 11:48 UTC (permalink / raw)
  To: Josh Poimboeuf
  Cc: X86 ML, Andy Lutomirski, Linus Torvalds, Peter Zijlstra, LKML

On Thu, Mar 15, 2018 at 02:01:32PM -0500, Josh Poimboeuf wrote:
> no_context() has the following line, right before it calls oops_end():
> 
> 	/* Executive summary in case the body of the oops scrolled away */
> 	printk(KERN_DEFAULT "CR2: %016lx\n", address);
> 
> I think that line can now be removed, since the executive summary
> __show_regs() will include CR2.

Good idea. Done. It adds three more lines to the executive summary but I
think they're worth it.

[ 4020.804801] Modules linked in:
[ 4020.840092] ---[ end trace 13285dfd393b58bd ]---
[ 4020.840828] RIP: 0010:sysrq_handle_crash+0x17/0x20
[ 4020.841731] Code: d1 e8 6d 08 b7 ff 0f 1f 00 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 e8 76 1f bd ff c7 05 a4 12 19 01 01 00 00 00 0f ae f8 <c6> 04 25 00 00 00 00 01 c3 0f 1f 44 00 00 e8 c6 1b c2 ff fb e9 80 
[ 4020.845056] RSP: 0018:ffffc9000085bdf0 EFLAGS: 00010246
[ 4020.845760] RAX: 0000000000000000 RBX: 0000000000000063 RCX: 0000000000000000
[ 4020.846704] RDX: 0000000000000000 RSI: ffffffff81101e0a RDI: 0000000000000063
[ 4020.847630] RBP: ffffffff822714c0 R08: 0000000000000185 R09: 00000000000c303e
[ 4020.848658] R10: 0000000000000000 R11: 0000000000000000 R12: 000000000000000a
[ 4020.849678] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000
[ 4020.850644] FS:  00007ffff7fdb700(0000) GS:ffff88007ec40000(0000) knlGS:0000000000000000
[ 4020.851688] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 4020.852994] CR2: 0000000000000000 CR3: 000000007a0f9000 CR4: 00000000000406e0
[ 4020.854018] Kernel panic - not syncing: Fatal exception
[ 4020.855256] Kernel Offset: disabled
[ 4020.855730] ---[ end Kernel panic - not syncing: Fatal exception ]---

> > @@ -352,26 +357,14 @@ int __die(const char *str, struct pt_regs *regs, long err)
> >  	       IS_ENABLED(CONFIG_PAGE_TABLE_ISOLATION) ?
> >  	       (boot_cpu_has(X86_FEATURE_PTI) ? " PTI" : " NOPTI") : "");
> >  
> > +	show_regs(regs);
> > +
> >  	if (notify_die(DIE_OOPS, str, regs, err,
> >  			current->thread.trap_nr, SIGSEGV) == NOTIFY_STOP)
> >  		return 1;
> >  
> >  	print_modules();
> > -	show_regs(regs);
> 
> Was moving the show_regs() call intentional?

Yes. It'd be prudent for registers to come out unconditionally and
not some of the notifiers to make us exit early. Which kinda needs
print_modules() to go up too. Fixed.

> I didn't see it mentioned in the changelog.

Fixed.

> Doesn't this hunk belong in the previous patch, which added the
> __show_regs -> show_ip() -> show_opcodes() call path?

Yeah, and the PAGE_OFFSET check needs to happen in show_ip() too.

Thanks for the detailed review, here's the current splat format:

[   29.046500] sysrq: SysRq : Trigger a crash
[   29.048605] BUG: unable to handle kernel NULL pointer dereference at 0000000000000000
[   29.051639] PGD 79afd067 P4D 79afd067 PUD 7a1a2067 PMD 0 
[   29.052557] Oops: 0002 [#1] PREEMPT SMP
[   29.052557] CPU: 7 PID: 3693 Comm: bash Not tainted 4.16.0-rc5+ #8
[   29.052557] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-1 04/01/2014
[   29.052557] RIP: 0010:sysrq_handle_crash+0x17/0x20
[   29.052557] Code: d1 e8 6d 08 b7 ff 0f 1f 00 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 e8 76 1f bd ff c7 05 a4 12 19 01 01 00 00 00 0f ae f8 <c6> 04 25 00 00 00 00 01 c3 0f 1f 44 00 00 e8 c6 1b c2 ff fb e9 80 
[   29.052557] RSP: 0018:ffffc900007cbdf0 EFLAGS: 00010246
[   29.052557] RAX: 0000000000000000 RBX: 0000000000000063 RCX: 0000000000000000
[   29.052557] RDX: 0000000000000000 RSI: ffffffff81101e0a RDI: 0000000000000063
[   29.052557] RBP: ffffffff822714c0 R08: 0000000000000185 R09: 000000000000b5a4
[   29.052557] R10: 0000000000000000 R11: 0000000000000000 R12: 000000000000000a
[   29.052557] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000
[   29.052557] FS:  00007ffff7fdb700(0000) GS:ffff88007edc0000(0000) knlGS:0000000000000000
[   29.052557] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[   29.052557] CR2: 0000000000000000 CR3: 00000000799e1000 CR4: 00000000000406e0
[   29.052557] Call Trace:
[   29.052557]  __handle_sysrq+0x9e/0x160
[   29.052557]  write_sysrq_trigger+0x2b/0x30
[   29.052557]  proc_reg_write+0x38/0x70
[   29.052557]  __vfs_write+0x36/0x160
[   29.052557]  ? __fd_install+0x69/0x110
[   29.052557]  ? preempt_count_add+0x74/0xb0
[   29.052557]  ? _raw_spin_lock+0x13/0x30
[   29.052557]  ? set_close_on_exec+0x41/0x80
[   29.052557]  ? preempt_count_sub+0xa8/0x100
[   29.052557]  vfs_write+0xc0/0x190
[   29.052557]  SyS_write+0x64/0xe0
[   29.052557]  ? trace_hardirqs_off_thunk+0x1a/0x1c
[   29.052557]  do_syscall_64+0x70/0x130
[   29.052557]  entry_SYSCALL_64_after_hwframe+0x42/0xb7
[   29.052557] RIP: 0033:0x7ffff74b9620
[   29.052557] Code: Bad RIP value.
[   29.052557] RSP: 002b:00007fffffffe6f8 EFLAGS: 00000246 ORIG_RAX: 0000000000000001
[   29.052557] RAX: ffffffffffffffda RBX: 0000000000000002 RCX: 00007ffff74b9620
[   29.052557] RDX: 0000000000000002 RSI: 0000000000705408 RDI: 0000000000000001
[   29.052557] RBP: 0000000000705408 R08: 000000000000000a R09: 00007ffff7fdb700
[   29.052557] R10: 00007ffff77826a0 R11: 0000000000000246 R12: 00007ffff77842a0
[   29.052557] R13: 0000000000000002 R14: 0000000000000001 R15: 0000000000000000
[   29.052557] Modules linked in:
[   29.085219] ---[ end trace c579921b8f40a393 ]---
[   29.085920] RIP: 0010:sysrq_handle_crash+0x17/0x20
[   29.086704] Code: d1 e8 6d 08 b7 ff 0f 1f 00 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 e8 76 1f bd ff c7 05 a4 12 19 01 01 00 00 00 0f ae f8 <c6> 04 25 00 00 00 00 01 c3 0f 1f 44 00 00 e8 c6 1b c2 ff fb e9 80 
[   29.089439] RSP: 0018:ffffc900007cbdf0 EFLAGS: 00010246
[   29.090117] RAX: 0000000000000000 RBX: 0000000000000063 RCX: 0000000000000000
[   29.091039] RDX: 0000000000000000 RSI: ffffffff81101e0a RDI: 0000000000000063
[   29.091959] RBP: ffffffff822714c0 R08: 0000000000000185 R09: 000000000000b5a4
[   29.092935] R10: 0000000000000000 R11: 0000000000000000 R12: 000000000000000a
[   29.093948] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000
[   29.094885] FS:  00007ffff7fdb700(0000) GS:ffff88007edc0000(0000) knlGS:0000000000000000
[   29.095925] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[   29.096731] CR2: 0000000000000000 CR3: 00000000799e1000 CR4: 00000000000406e0
[   29.097784] Kernel panic - not syncing: Fatal exception
[   29.098882] Kernel Offset: disabled
[   29.099351] ---[ end Kernel panic - not syncing: Fatal exception ]---

-- 
Regards/Gruss,
    Boris.

Good mailing practices for 400: avoid top-posting and trim the reply.

^ permalink raw reply	[flat|nested] 36+ messages in thread

* RE: [PATCH 4/9] x86/dumpstack: Improve opcodes dumping in the Code: section
  2018-03-15 18:16     ` Borislav Petkov
  2018-03-15 19:06       ` Josh Poimboeuf
@ 2018-03-16 11:57       ` David Laight
  1 sibling, 0 replies; 36+ messages in thread
From: David Laight @ 2018-03-16 11:57 UTC (permalink / raw)
  To: 'Borislav Petkov', Josh Poimboeuf
  Cc: X86 ML, Andy Lutomirski, Linus Torvalds, Peter Zijlstra, LKML

From: Borislav Petkov
> Sent: 15 March 2018 18:16
> On Thu, Mar 15, 2018 at 01:10:54PM -0500, Josh Poimboeuf wrote:
> > I liked OPCODE_BUFSIZE where it was before :-)  Here it disrupts the
> > readability of the function a bit IMO.
> 
> My thinking is have it close by so that you don't have to go search for
> its definition.
> 
> But I don't have a strong opinion on where it should be so...

Is OPCODE_BUFSIZE even needed?
Maybe replace with 64 and use sizeof() and/or ARRAY_SIZE() elsewhere.
Then no one has to check that the bound is appropriate for the array.

	David

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 8/9] x86/dumpstack: Save first regs set for the executive summary
  2018-03-16 11:48     ` Borislav Petkov
@ 2018-03-16 12:01       ` Josh Poimboeuf
  2018-03-16 12:11         ` Borislav Petkov
  2018-03-16 17:22       ` Linus Torvalds
  1 sibling, 1 reply; 36+ messages in thread
From: Josh Poimboeuf @ 2018-03-16 12:01 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: X86 ML, Andy Lutomirski, Linus Torvalds, Peter Zijlstra, LKML

On Fri, Mar 16, 2018 at 12:48:49PM +0100, Borislav Petkov wrote:
> [   29.046500] sysrq: SysRq : Trigger a crash
> [   29.048605] BUG: unable to handle kernel NULL pointer dereference at 0000000000000000
> [   29.051639] PGD 79afd067 P4D 79afd067 PUD 7a1a2067 PMD 0 
> [   29.052557] Oops: 0002 [#1] PREEMPT SMP
> [   29.052557] CPU: 7 PID: 3693 Comm: bash Not tainted 4.16.0-rc5+ #8
> [   29.052557] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-1 04/01/2014
> [   29.052557] RIP: 0010:sysrq_handle_crash+0x17/0x20
> [   29.052557] Code: d1 e8 6d 08 b7 ff 0f 1f 00 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 e8 76 1f bd ff c7 05 a4 12 19 01 01 00 00 00 0f ae f8 <c6> 04 25 00 00 00 00 01 c3 0f 1f 44 00 00 e8 c6 1b c2 ff fb e9 80 
> [   29.052557] RSP: 0018:ffffc900007cbdf0 EFLAGS: 00010246
> [   29.052557] RAX: 0000000000000000 RBX: 0000000000000063 RCX: 0000000000000000
> [   29.052557] RDX: 0000000000000000 RSI: ffffffff81101e0a RDI: 0000000000000063
> [   29.052557] RBP: ffffffff822714c0 R08: 0000000000000185 R09: 000000000000b5a4
> [   29.052557] R10: 0000000000000000 R11: 0000000000000000 R12: 000000000000000a
> [   29.052557] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000
> [   29.052557] FS:  00007ffff7fdb700(0000) GS:ffff88007edc0000(0000) knlGS:0000000000000000
> [   29.052557] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [   29.052557] CR2: 0000000000000000 CR3: 00000000799e1000 CR4: 00000000000406e0
> [   29.052557] Call Trace:
> [   29.052557]  __handle_sysrq+0x9e/0x160
> [   29.052557]  write_sysrq_trigger+0x2b/0x30
> [   29.052557]  proc_reg_write+0x38/0x70
> [   29.052557]  __vfs_write+0x36/0x160
> [   29.052557]  ? __fd_install+0x69/0x110
> [   29.052557]  ? preempt_count_add+0x74/0xb0
> [   29.052557]  ? _raw_spin_lock+0x13/0x30
> [   29.052557]  ? set_close_on_exec+0x41/0x80
> [   29.052557]  ? preempt_count_sub+0xa8/0x100
> [   29.052557]  vfs_write+0xc0/0x190
> [   29.052557]  SyS_write+0x64/0xe0
> [   29.052557]  ? trace_hardirqs_off_thunk+0x1a/0x1c
> [   29.052557]  do_syscall_64+0x70/0x130
> [   29.052557]  entry_SYSCALL_64_after_hwframe+0x42/0xb7
> [   29.052557] RIP: 0033:0x7ffff74b9620
> [   29.052557] Code: Bad RIP value.
> [   29.052557] RSP: 002b:00007fffffffe6f8 EFLAGS: 00000246 ORIG_RAX: 0000000000000001
> [   29.052557] RAX: ffffffffffffffda RBX: 0000000000000002 RCX: 00007ffff74b9620
> [   29.052557] RDX: 0000000000000002 RSI: 0000000000705408 RDI: 0000000000000001
> [   29.052557] RBP: 0000000000705408 R08: 000000000000000a R09: 00007ffff7fdb700
> [   29.052557] R10: 00007ffff77826a0 R11: 0000000000000246 R12: 00007ffff77842a0
> [   29.052557] R13: 0000000000000002 R14: 0000000000000001 R15: 0000000000000000

Hm, the "Code: Bad RIP value" will always be shown for syscall regs,
which will probably cause some unnecessary confusion/worry.  Should we
just skip printing it for the "regs->ip < PAGE_OFFSET" case?

-- 
Josh

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 8/9] x86/dumpstack: Save first regs set for the executive summary
  2018-03-16 12:01       ` Josh Poimboeuf
@ 2018-03-16 12:11         ` Borislav Petkov
  2018-03-16 13:16           ` Josh Poimboeuf
  0 siblings, 1 reply; 36+ messages in thread
From: Borislav Petkov @ 2018-03-16 12:11 UTC (permalink / raw)
  To: Josh Poimboeuf
  Cc: X86 ML, Andy Lutomirski, Linus Torvalds, Peter Zijlstra, LKML

On Fri, Mar 16, 2018 at 07:01:12AM -0500, Josh Poimboeuf wrote:
> Hm, the "Code: Bad RIP value" will always be shown for syscall regs,
> which will probably cause some unnecessary confusion/worry.  Should we
> just skip printing it for the "regs->ip < PAGE_OFFSET" case?

How about we remove that check altogether?

I mean, __copy_from_user_inatomic() by way of probe_kernel_read() should
be able to handle every address.

And if it doesn't, it says so:

        if (probe_kernel_read(opcodes, ip, OPCODE_BUFSIZE)) {
                pr_cont("Bad RIP value.\n");


And if we *can* print opcode bytes, why not do so? It is one more hint
when debugging, who knows, might prove useful...

Hmm?

-- 
Regards/Gruss,
    Boris.

Good mailing practices for 400: avoid top-posting and trim the reply.

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 8/9] x86/dumpstack: Save first regs set for the executive summary
  2018-03-16 12:11         ` Borislav Petkov
@ 2018-03-16 13:16           ` Josh Poimboeuf
  2018-03-16 13:44             ` Borislav Petkov
  0 siblings, 1 reply; 36+ messages in thread
From: Josh Poimboeuf @ 2018-03-16 13:16 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: X86 ML, Andy Lutomirski, Linus Torvalds, Peter Zijlstra, LKML

On Fri, Mar 16, 2018 at 01:11:17PM +0100, Borislav Petkov wrote:
> On Fri, Mar 16, 2018 at 07:01:12AM -0500, Josh Poimboeuf wrote:
> > Hm, the "Code: Bad RIP value" will always be shown for syscall regs,
> > which will probably cause some unnecessary confusion/worry.  Should we
> > just skip printing it for the "regs->ip < PAGE_OFFSET" case?
> 
> How about we remove that check altogether?
> 
> I mean, __copy_from_user_inatomic() by way of probe_kernel_read() should
> be able to handle every address.
> 
> And if it doesn't, it says so:
> 
>         if (probe_kernel_read(opcodes, ip, OPCODE_BUFSIZE)) {
>                 pr_cont("Bad RIP value.\n");
> 
> 
> And if we *can* print opcode bytes, why not do so? It is one more hint
> when debugging, who knows, might prove useful...
> 
> Hmm?

Yeah, sounds good to me.  I think an earlier version of your patches
already printed the user space opcodes anyway.

-- 
Josh

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 8/9] x86/dumpstack: Save first regs set for the executive summary
  2018-03-16 13:16           ` Josh Poimboeuf
@ 2018-03-16 13:44             ` Borislav Petkov
  0 siblings, 0 replies; 36+ messages in thread
From: Borislav Petkov @ 2018-03-16 13:44 UTC (permalink / raw)
  To: Josh Poimboeuf
  Cc: X86 ML, Andy Lutomirski, Linus Torvalds, Peter Zijlstra, LKML

On Fri, Mar 16, 2018 at 08:16:06AM -0500, Josh Poimboeuf wrote:
> Yeah, sounds good to me.  I think an earlier version of your patches
> already printed the user space opcodes anyway.

Right, which means it works already! :-P

-- 
Regards/Gruss,
    Boris.

Good mailing practices for 400: avoid top-posting and trim the reply.

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 8/9] x86/dumpstack: Save first regs set for the executive summary
  2018-03-16 11:48     ` Borislav Petkov
  2018-03-16 12:01       ` Josh Poimboeuf
@ 2018-03-16 17:22       ` Linus Torvalds
  2018-03-16 17:40         ` Josh Poimboeuf
  2018-03-16 17:45         ` Borislav Petkov
  1 sibling, 2 replies; 36+ messages in thread
From: Linus Torvalds @ 2018-03-16 17:22 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: Josh Poimboeuf, X86 ML, Andy Lutomirski, Peter Zijlstra, LKML

On Fri, Mar 16, 2018 at 4:48 AM, Borislav Petkov <bp@alien8.de> wrote:
> On Thu, Mar 15, 2018 at 02:01:32PM -0500, Josh Poimboeuf wrote:
>> no_context() has the following line, right before it calls oops_end():
>>
>>       /* Executive summary in case the body of the oops scrolled away */
>>       printk(KERN_DEFAULT "CR2: %016lx\n", address);
>>
>> I think that line can now be removed, since the executive summary
>> __show_regs() will include CR2.
>
> Good idea. Done.

NOOOO!

Guys, %cr2 CAN AND DOES CHANGE!

The reason we do that

        printk(KERN_DEFAULT "CR2: %016lx\n", address);

is because WE ARE NOT PRINTING OUT THE CURRENT CR2 REGISTER!

This is really damn important.

The "address" register contains the CR2 value as it was read *very*
early in the page fault case, before we enabled interrupts, and before
we did various random things that can cause further page faults and
change CR2!

So the executive summary that does __show_regs() may end up showing
something completely different than the actual faulting address,
because we might have taken a vmalloc-space exception in the meantime,
for example.

Do *NOT* get rid of that thing.

You're better off getting rid of the CR2 line from __show_regs(),
because it can be dangerously confusing. It's not actually part of the
saved register state at all, it's something entirely different. It's
like showing the current eflags rather than the eflags saved on the
faulting stack.

                     Linus

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 8/9] x86/dumpstack: Save first regs set for the executive summary
  2018-03-16 17:22       ` Linus Torvalds
@ 2018-03-16 17:40         ` Josh Poimboeuf
  2018-03-16 17:45         ` Borislav Petkov
  1 sibling, 0 replies; 36+ messages in thread
From: Josh Poimboeuf @ 2018-03-16 17:40 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Borislav Petkov, X86 ML, Andy Lutomirski, Peter Zijlstra, LKML

On Fri, Mar 16, 2018 at 10:22:29AM -0700, Linus Torvalds wrote:
> On Fri, Mar 16, 2018 at 4:48 AM, Borislav Petkov <bp@alien8.de> wrote:
> > On Thu, Mar 15, 2018 at 02:01:32PM -0500, Josh Poimboeuf wrote:
> >> no_context() has the following line, right before it calls oops_end():
> >>
> >>       /* Executive summary in case the body of the oops scrolled away */
> >>       printk(KERN_DEFAULT "CR2: %016lx\n", address);
> >>
> >> I think that line can now be removed, since the executive summary
> >> __show_regs() will include CR2.
> >
> > Good idea. Done.
> 
> NOOOO!
> 
> Guys, %cr2 CAN AND DOES CHANGE!
> 
> The reason we do that
> 
>         printk(KERN_DEFAULT "CR2: %016lx\n", address);
> 
> is because WE ARE NOT PRINTING OUT THE CURRENT CR2 REGISTER!

Good point.  I missed the fact that no_context() isn't printing the
current CR2.

> This is really damn important.
> 
> The "address" register contains the CR2 value as it was read *very*
> early in the page fault case, before we enabled interrupts, and before
> we did various random things that can cause further page faults and
> change CR2!
> 
> So the executive summary that does __show_regs() may end up showing
> something completely different than the actual faulting address,
> because we might have taken a vmalloc-space exception in the meantime,
> for example.
> 
> Do *NOT* get rid of that thing.
> 
> You're better off getting rid of the CR2 line from __show_regs(),
> because it can be dangerously confusing. It's not actually part of the
> saved register state at all, it's something entirely different. It's
> like showing the current eflags rather than the eflags saved on the
> faulting stack.

True, it's probably best to remove it.  The only time we need CR2's
value is presumably when it would have already been printed in
no_context(), and so it primarily just adds confusion as you said.

-- 
Josh

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 8/9] x86/dumpstack: Save first regs set for the executive summary
  2018-03-16 17:22       ` Linus Torvalds
  2018-03-16 17:40         ` Josh Poimboeuf
@ 2018-03-16 17:45         ` Borislav Petkov
  2018-03-16 18:38           ` Josh Poimboeuf
  1 sibling, 1 reply; 36+ messages in thread
From: Borislav Petkov @ 2018-03-16 17:45 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Josh Poimboeuf, X86 ML, Andy Lutomirski, Peter Zijlstra, LKML

On Fri, Mar 16, 2018 at 10:22:29AM -0700, Linus Torvalds wrote:
> The reason we do that
> 
>         printk(KERN_DEFAULT "CR2: %016lx\n", address);
> 
> is because WE ARE NOT PRINTING OUT THE CURRENT CR2 REGISTER!

Whoopsie!

Doh, __show_regs() reads CR2 again and there's a big fat window
in-between...

> This is really damn important.
> 
> The "address" register contains the CR2 value as it was read *very*
> early in the page fault case, before we enabled interrupts, and before
> we did various random things that can cause further page faults and
> change CR2!
> 
> So the executive summary that does __show_regs() may end up showing
> something completely different than the actual faulting address,
> because we might have taken a vmalloc-space exception in the meantime,
> for example.
> 
> Do *NOT* get rid of that thing.

Reverted.

> You're better off getting rid of the CR2 line from __show_regs(),
> because it can be dangerously confusing. It's not actually part of the
> saved register state at all, it's something entirely different. It's
> like showing the current eflags rather than the eflags saved on the
> faulting stack.

Yeah, __show_regs() goes and gets a bunch of registers at the time
__show_regs() runs. Which is ok for those which don't change in between
but CR2 is special.

We probably could improve that situation by having a struct fault_regs
or so wrapping pt_regs and adding a bunch of fields like CR2 etc. Fault
handlers would then populate fault_regs at fault time while we're atomic
and then hand this struct down to the printing path.

The printing path would fill out the rest and this way we won't have any
of that monkey business anymore.

Thoughts?

-- 
Regards/Gruss,
    Boris.

Good mailing practices for 400: avoid top-posting and trim the reply.

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 8/9] x86/dumpstack: Save first regs set for the executive summary
  2018-03-16 17:45         ` Borislav Petkov
@ 2018-03-16 18:38           ` Josh Poimboeuf
  0 siblings, 0 replies; 36+ messages in thread
From: Josh Poimboeuf @ 2018-03-16 18:38 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: Linus Torvalds, X86 ML, Andy Lutomirski, Peter Zijlstra, LKML

On Fri, Mar 16, 2018 at 06:45:29PM +0100, Borislav Petkov wrote:
> > You're better off getting rid of the CR2 line from __show_regs(),
> > because it can be dangerously confusing. It's not actually part of the
> > saved register state at all, it's something entirely different. It's
> > like showing the current eflags rather than the eflags saved on the
> > faulting stack.
> 
> Yeah, __show_regs() goes and gets a bunch of registers at the time
> __show_regs() runs. Which is ok for those which don't change in between
> but CR2 is special.
> 
> We probably could improve that situation by having a struct fault_regs
> or so wrapping pt_regs and adding a bunch of fields like CR2 etc. Fault
> handlers would then populate fault_regs at fault time while we're atomic
> and then hand this struct down to the printing path.
> 
> The printing path would fill out the rest and this way we won't have any
> of that monkey business anymore.
> 
> Thoughts?

It would be nice if we could save *all* the printed registers before
they get a chance to change, but I don't know how feasible that is.
Some of the registers change in entry code, like CR3 and GS.

-- 
Josh

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 0/9] x86/dumpstack: Cleanups and user opcode bytes Code: section, v2
  2018-03-15 17:51 ` [PATCH 0/9] x86/dumpstack: Cleanups and user opcode bytes Code: section, v2 Linus Torvalds
@ 2018-04-17 14:40   ` Borislav Petkov
  2018-04-17 20:16     ` Josh Poimboeuf
  0 siblings, 1 reply; 36+ messages in thread
From: Borislav Petkov @ 2018-04-17 14:40 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: X86 ML, Andy Lutomirski, Josh Poimboeuf, Peter Zijlstra, LKML

On Thu, Mar 15, 2018 at 10:51:06AM -0700, Linus Torvalds wrote:
> This version looks ok to me. I'm sure there's room for tweaking here,
> but I'm not seeing anything alarming.

So I'm redoing the series ontop of 17-rc1 and I see a *lot* of output
during testing. For example:

1) is from the userspace fault, 2) is the panic from sysrq but then you have 3)
which is

	WARN_ON_ONCE(!cpu_online(new_cpu));

in set_task_cpu() and to top it all off, we have 4) coming from
native_smp_send_reschedule():

static void native_smp_send_reschedule(int cpu)
{
        if (unlikely(cpu_is_offline(cpu))) {
                WARN(1, "sched: Unexpected reschedule of offline CPU#%d!\n", cpu);

so all the "fine tuning" we did to try to fit the most important splat
on the screen is for shit because those loud WARNs simply pushed it all
up into oblivion.

And the executive summary and registers are just as worthless in such a
case.

We could start thinking about caching all that data from the very first
splat, when we're not tainted yet and dump it last but then we can't
even know what is going out last.

Not only because we can't guess from where stuff might warn and what
could execute - the below splats case-in-point - also, and more
importantly, we don't know how much of that data would actually go out
as there are no guarantees *when* the machine will die and stop spewing
to the serial port.

So maybe the most important splat coming out first is maybe a good thing
because it has a higher chance of coming out before the box locks up
completely.

So I guess we should keep hoping that serial console works and keeps on
working...

Hmmm.

1.
---

[  211.766286] strsep[2240]: segfault at 40066b ip 00007f9d4e0a301b sp 00007ffcd158a1e0 error 7 in libc-2.24.so[7f9d4e01a000+195000]
[  211.778147] Code: 53 48 83 ec 08 48 8b 2f 48 85 ed 74 27 0f b6 06 48 89 fb 84 c0 74 34 80 7e 01 00 74 22 48 89 ef e8 7a 68 f9 ff 48 85 c0 74 21 <c6> 00 00 48 83 c0 01 48 89 03 48 83 c4 08 48 89 e8 5b 5d c3 90 0f 

2.
---

[  218.747919] sysrq: SysRq : Trigger a crash
[  218.752258] BUG: unable to handle kernel NULL pointer dereference at 0000000000000000
[  218.760044] PGD 8000000439bb8067 P4D 8000000439bb8067 PUD 4381ac067 PMD 0 
[  218.766888] Oops: 0002 [#1] PREEMPT SMP PTI
[  218.771052] CPU: 5 PID: 2200 Comm: bash Not tainted 4.17.0-rc1+ #1
[  218.777203] Hardware name: Dell Inc. Precision T3600/0PTTT9, BIOS A13 05/11/2014
[  218.784562] RIP: 0010:sysrq_handle_crash+0x41/0x80
[  218.789319] Code: e8 c4 17 c4 ff 48 c7 c2 87 b6 4a 81 be 01 00 00 00 48 c7 c7 c0 07 25 82 e8 2c 23 c2 ff c7 05 82 7a 39 01 01 00 00 00 0f ae f8 <c6> 04 25 00 00 00 00 01 c3 e8 61 43 c4 ff 84 c0 75 c4 48 c7 c2 50 
[  218.808148] RSP: 0018:ffffc90003d03de8 EFLAGS: 00010286
[  218.813338] RAX: 0000000000000000 RBX: 0000000000000063 RCX: 3e2d9cbd00000000
[  218.820422] RDX: ffff88043804cd38 RSI: 000000006cb322e4 RDI: 0000000000000000
[  218.827508] RBP: ffffc90003d03e10 R08: 000000003f352d81 R09: 0000000000000001
[  218.834595] R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000007
[  218.841680] R13: ffffffff822c09e0 R14: 0000000000000000 R15: 0000000000000000
[  218.848773] FS:  00007f4105bce280(0000) GS:ffff88043df40000(0000) knlGS:0000000000000000
[  218.856808] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  218.862514] CR2: 0000000000000000 CR3: 0000000439428005 CR4: 00000000000606e0
[  218.869599] Call Trace:
[  218.872041]  __handle_sysrq+0xcd/0x210
[  218.875774]  write_sysrq_trigger+0x48/0x50
[  218.879853]  proc_reg_write+0x3c/0x60
[  218.883501]  __vfs_write+0x36/0x160
[  218.886986]  ? rcu_read_lock_sched_held+0x74/0x80
[  218.891666]  ? rcu_sync_lockdep_assert+0x2e/0x60
[  218.896268]  ? __sb_start_write+0x157/0x200
[  218.900425]  ? __sb_start_write+0x16d/0x200
[  218.904585]  vfs_write+0xbe/0x1b0
[  218.907886]  ksys_write+0x55/0xc0
[  218.911193]  do_syscall_64+0x64/0x200
[  218.914858]  entry_SYSCALL_64_after_hwframe+0x49/0xbe
[  218.919881] RIP: 0033:0x7f41050a9620
[  218.923429] Code: 73 01 c3 48 8b 0d 78 c8 2b 00 f7 d8 64 89 01 48 83 c8 ff c3 66 0f 1f 44 00 00 83 3d 29 21 2c 00 00 75 10 b8 01 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 31 c3 48 83 ec 08 e8 ae 9b 01 00 48 89 04 24 
[  218.942261] RSP: 002b:00007ffe8c2ad378 EFLAGS: 00000246 ORIG_RAX: 0000000000000001
[  218.949778] RAX: ffffffffffffffda RBX: 0000000000000002 RCX: 00007f41050a9620
[  218.956863] RDX: 0000000000000002 RSI: 00000000008fe408 RDI: 0000000000000001
[  218.963949] RBP: 00000000008fe408 R08: 00007f4105368760 R09: 00007f4105bce280
[  218.971047] R10: 0000000000000073 R11: 0000000000000246 R12: 0000000000000002
[  218.978138] R13: 0000000000000001 R14: 00007f4105367600 R15: 0000000000000002
[  218.985240] Modules linked in: binfmt_misc xfs nls_iso8859_15 nls_cp437 btrfs xor zstd_decompress zstd_compress xxhash raid6_pq sb_edac libcrc32c x86_pkg_temp_thermal coretemp kvm_intel kvm irqbypass crc32_pclmul snd_hda_codec_hdmi crc32c_intel ghash_clmulni_intel usbhid aesni_intel aes_x86_64 snd_hda_codec_realtek snd_hda_codec_generic crypto_simd cryptd snd_hda_intel glue_helper iTCO_wdt intel_cstate snd_hda_codec intel_uncore snd_hwdep dcdbas snd_hda_core dell_smm_hwmon iTCO_vendor_support intel_rapl_perf snd_pcm efi_pstore snd_timer serio_raw lpc_ich efivars i2c_i801 mfd_core snd soundcore button
[  219.038108] CR2: 0000000000000000
[  219.041499] ---[ end trace 75992131ce321254 ]---
[  219.046095] RIP: 0010:sysrq_handle_crash+0x41/0x80
[  219.050848] Code: e8 c4 17 c4 ff 48 c7 c2 87 b6 4a 81 be 01 00 00 00 48 c7 c7 c0 07 25 82 e8 2c 23 c2 ff c7 05 82 7a 39 01 01 00 00 00 0f ae f8 <c6> 04 25 00 00 00 00 01 c3 e8 61 43 c4 ff 84 c0 75 c4 48 c7 c2 50 
[  219.069577] RSP: 0018:ffffc90003d03de8 EFLAGS: 00010286
[  219.074770] RAX: 0000000000000000 RBX: 0000000000000063 RCX: 3e2d9cbd00000000
[  219.081842] RDX: ffff88043804cd38 RSI: 000000006cb322e4 RDI: 0000000000000000
[  219.088919] RBP: ffffc90003d03e10 R08: 000000003f352d81 R09: 0000000000000001
[  219.096000] R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000007
[  219.103077] R13: ffffffff822c09e0 R14: 0000000000000000 R15: 0000000000000000
[  219.110151] FS:  00007f4105bce280(0000) GS:ffff88043df40000(0000) knlGS:0000000000000000
[  219.118168] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  219.124199] CR2: 0000000000000000 CR3: 0000000439428005 CR4: 00000000000606e0
[  219.131603] Kernel panic - not syncing: Fatal exception
[  219.137135] Kernel Offset: disabled
[  219.140909] ---[ end Kernel panic - not syncing: Fatal exception ]---

3.
---

[  219.147613] WARNING: CPU: 5 PID: 2200 at kernel/sched/core.c:1163 set_task_cpu+0x24e/0x2c0
[  219.157314] Modules linked in: binfmt_misc xfs nls_iso8859_15 nls_cp437 btrfs xor zstd_decompress zstd_compress xxhash raid6_pq sb_edac libcrc32c x86_pkg_temp_thermal coretemp kvm_intel kvm irqbypass crc32_pclmul snd_hda_codec_hdmi crc32c_intel ghash_clmulni_intel usbhid aesni_intel aes_x86_64 snd_hda_codec_realtek snd_hda_codec_generic crypto_simd cryptd snd_hda_intel glue_helper iTCO_wdt intel_cstate snd_hda_codec intel_uncore snd_hwdep dcdbas snd_hda_core dell_smm_hwmon iTCO_vendor_support intel_rapl_perf snd_pcm efi_pstore snd_timer serio_raw lpc_ich efivars i2c_i801 mfd_core snd soundcore button
[  219.214593] CPU: 5 PID: 2200 Comm: bash Tainted: G      D           4.17.0-rc1+ #1
[  219.223581] Hardware name: Dell Inc. Precision T3600/0PTTT9, BIOS A13 05/11/2014
[  219.232396] RIP: 0010:set_task_cpu+0x24e/0x2c0
[  219.238281] Code: 85 c0 0f 85 2a fe ff ff 0f 0b e9 23 fe ff ff 0f 0b e9 f7 fd ff ff f7 43 60 fd ff ff ff 0f 84 01 fe ff ff 0f 0b e9 fa fd ff ff <0f> 0b e9 12 fe ff ff e8 f6 c0 04 00 85 c0 0f 85 72 ff ff ff 48 c7 
[  219.260179] RSP: 0018:ffff88043df43d78 EFLAGS: 00010046
[  219.266967] RAX: 0000000000000000 RBX: ffff880439e81700 RCX: 0000000000000000
[  219.275649] RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffff880439e81700
[  219.284264] RBP: 0000000000000000 R08: 0000000000000000 R09: 0000000000000000
[  219.292805] R10: 0000000000000000 R11: ffffffff822507c0 R12: ffff880439e81e78
[  219.301344] R13: 0000000000000000 R14: 0000000000000046 R15: 0000000000021b40
[  219.309884] FS:  00007f4105bce280(0000) GS:ffff88043df40000(0000) knlGS:0000000000000000
[  219.319375] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  219.326549] CR2: 0000000000000000 CR3: 0000000439428005 CR4: 00000000000606e0
[  219.335106] Call Trace:
[  219.339018]  <IRQ>
[  219.342481]  try_to_wake_up+0x176/0x530
[  219.347751]  ? lock_acquire+0xa6/0x220
[  219.352928]  autoremove_wake_function+0xe/0x30
[  219.358773]  __wake_up_common+0x74/0x120
[  219.364114]  __wake_up_common_lock+0x7c/0xc0
[  219.369795]  ? tick_sched_do_timer+0x80/0x80
[  219.375482]  irq_work_run_list+0x49/0x70
[  219.380820]  update_process_times+0x3b/0x50
[  219.386416]  tick_sched_timer+0x37/0x70
[  219.391674]  __hrtimer_run_queues+0x114/0x590
[  219.397444]  hrtimer_interrupt+0xd5/0x220
[  219.402867]  ? trace_hardirqs_off_thunk+0x1a/0x1c
[  219.408983]  smp_apic_timer_interrupt+0x80/0x2e0
[  219.415014]  apic_timer_interrupt+0xf/0x20
[  219.420538]  </IRQ>
[  219.424076] RIP: 0010:panic+0x1fe/0x24c
[  219.429321] Code: a6 83 3d 28 97 7d 01 00 74 05 e8 c9 c9 02 00 48 c7 c6 60 31 84 82 48 c7 c7 90 e3 0a 82 e8 c4 68 07 00 e8 01 1e 06 00 fb 31 db <e8> f9 c7 0c 00 4c 39 eb 7c 1d 41 83 f4 01 48 8b 05 e9 96 7d 01 44 
[  219.450965] RSP: 0018:ffffc90003d03b90 EFLAGS: 00000246 ORIG_RAX: ffffffffffffff13
[  219.459921] RAX: 0000000000000039 RBX: 0000000000000000 RCX: 0000000000000001
[  219.468443] RDX: 0000000000000000 RSI: ffffffff81069a1f RDI: ffffffff81069a1f
[  219.476966] RBP: ffffc90003d03c08 R08: 0000000000000000 R09: 0000000000000000
[  219.485493] R10: ffffc90003d03ad0 R11: ffffffff8224e338 R12: 0000000000000000
[  219.494018] R13: 0000000000000000 R14: 000000000000000b R15: 0000000000000009
[  219.502549]  ? panic+0x1fb/0x24c
[  219.507175]  ? panic+0x1fb/0x24c
[  219.511771]  oops_end+0xbb/0xc0
[  219.516248]  no_context+0x17e/0x400
[  219.521044]  __do_page_fault+0x3f4/0x570
[  219.526260]  ? trace_hardirqs_off_caller+0x83/0xe0
[  219.532326]  do_page_fault+0x2c/0x2b0
[  219.537210]  page_fault+0x1e/0x30
[  219.541677] RIP: 0010:sysrq_handle_crash+0x41/0x80
[  219.547600] Code: e8 c4 17 c4 ff 48 c7 c2 87 b6 4a 81 be 01 00 00 00 48 c7 c7 c0 07 25 82 e8 2c 23 c2 ff c7 05 82 7a 39 01 01 00 00 00 0f ae f8 <c6> 04 25 00 00 00 00 01 c3 e8 61 43 c4 ff 84 c0 75 c4 48 c7 c2 50 
[  219.568890] RSP: 0018:ffffc90003d03de8 EFLAGS: 00010286
[  219.575381] RAX: 0000000000000000 RBX: 0000000000000063 RCX: 3e2d9cbd00000000
[  219.583774] RDX: ffff88043804cd38 RSI: 000000006cb322e4 RDI: 0000000000000000
[  219.592163] RBP: ffffc90003d03e10 R08: 000000003f352d81 R09: 0000000000000001
[  219.600544] R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000007
[  219.608918] R13: ffffffff822c09e0 R14: 0000000000000000 R15: 0000000000000000
[  219.617290]  ? sysrq_handle_crash+0x34/0x80
[  219.622727]  __handle_sysrq+0xcd/0x210
[  219.627731]  write_sysrq_trigger+0x48/0x50
[  219.633075]  proc_reg_write+0x3c/0x60
[  219.637980]  __vfs_write+0x36/0x160
[  219.642686]  ? rcu_read_lock_sched_held+0x74/0x80
[  219.648556]  ? rcu_sync_lockdep_assert+0x2e/0x60
[  219.654302]  ? __sb_start_write+0x157/0x200
[  219.659595]  ? __sb_start_write+0x16d/0x200
[  219.664866]  vfs_write+0xbe/0x1b0
[  219.669268]  ksys_write+0x55/0xc0
[  219.673602]  do_syscall_64+0x64/0x200
[  219.678202]  entry_SYSCALL_64_after_hwframe+0x49/0xbe
[  219.684177] RIP: 0033:0x7f41050a9620
[  219.688682] Code: 73 01 c3 48 8b 0d 78 c8 2b 00 f7 d8 64 89 01 48 83 c8 ff c3 66 0f 1f 44 00 00 83 3d 29 21 2c 00 00 75 10 b8 01 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 31 c3 48 83 ec 08 e8 ae 9b 01 00 48 89 04 24 
[  219.709570] RSP: 002b:00007ffe8c2ad378 EFLAGS: 00000246 ORIG_RAX: 0000000000000001
[  219.718195] RAX: ffffffffffffffda RBX: 0000000000000002 RCX: 00007f41050a9620
[  219.726391] RDX: 0000000000000002 RSI: 00000000008fe408 RDI: 0000000000000001
[  219.734584] RBP: 00000000008fe408 R08: 00007f4105368760 R09: 00007f4105bce280
[  219.742793] R10: 0000000000000073 R11: 0000000000000246 R12: 0000000000000002
[  219.750973] R13: 0000000000000001 R14: 00007f4105367600 R15: 0000000000000002
[  219.759145] irq event stamp: 80113
[  219.763626] hardirqs last  enabled at (80113): [<ffffffff810544f9>] __do_page_fault+0x2f9/0x570
[  219.773371] hardirqs last disabled at (80112): [<ffffffff81a01096>] error_entry+0x86/0x100
[  219.782690] softirqs last  enabled at (80104): [<ffffffff81c0039e>] __do_softirq+0x39e/0x504
[  219.792196] softirqs last disabled at (80095): [<ffffffff810719d5>] irq_exit+0xd5/0xe0
[  219.801181] ---[ end trace 75992131ce321255 ]---

4.
---

[  219.806912] ------------[ cut here ]------------
[  219.812640] sched: Unexpected reschedule of offline CPU#0!
[  219.819253] WARNING: CPU: 5 PID: 2200 at arch/x86/kernel/smp.c:128 native_smp_send_reschedule+0x34/0x40
[  219.829732] Modules linked in: binfmt_misc xfs nls_iso8859_15 nls_cp437 btrfs xor zstd_decompress zstd_compress xxhash raid6_pq sb_edac libcrc32c x86_pkg_temp_thermal coretemp kvm_intel kvm irqbypass crc32_pclmul snd_hda_codec_hdmi crc32c_intel ghash_clmulni_intel usbhid aesni_intel aes_x86_64 snd_hda_codec_realtek snd_hda_codec_generic crypto_simd cryptd snd_hda_intel glue_helper iTCO_wdt intel_cstate snd_hda_codec intel_uncore snd_hwdep dcdbas snd_hda_core dell_smm_hwmon iTCO_vendor_support intel_rapl_perf snd_pcm efi_pstore snd_timer serio_raw lpc_ich efivars i2c_i801 mfd_core snd soundcore button
[  219.886225] CPU: 5 PID: 2200 Comm: bash Tainted: G      D W         4.17.0-rc1+ #1
[  219.895022] Hardware name: Dell Inc. Precision T3600/0PTTT9, BIOS A13 05/11/2014
[  219.903655] RIP: 0010:native_smp_send_reschedule+0x34/0x40
[  219.910397] Code: 05 21 ea 30 01 73 15 48 8b 05 78 34 13 01 be fd 00 00 00 48 8b 40 30 e9 aa 63 bc 00 89 fe 48 c7 c7 88 66 0a 82 e8 ec c5 02 00 <0f> 0b c3 66 0f 1f 84 00 00 00 00 00 66 66 66 66 90 8b 05 55 73 80 
[  219.931962] RSP: 0018:ffff88043df43d60 EFLAGS: 00010082
[  219.938524] RAX: 0000000000000000 RBX: ffff88043de21b40 RCX: 0000000000010004
[  219.946918] RDX: 0000000080010004 RSI: ffffffff820ebfed RDI: 00000000ffffffff
[  219.955301] RBP: ffff880439e81700 R08: 0000000000000000 R09: 0000000000000000
[  219.963686] R10: afb504000afb5041 R11: ffffffff8224e338 R12: ffff880439e81e78
[  219.972069] R13: ffff88043df43db0 R14: 0000000000000046 R15: ffff88043de21b58
[  219.980459] FS:  00007f4105bce280(0000) GS:ffff88043df40000(0000) knlGS:0000000000000000
[  219.989801] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  219.996819] CR2: 0000000000000000 CR3: 0000000439428005 CR4: 00000000000606e0
[  220.005216] Call Trace:
[  220.008992]  <IRQ>
[  220.012324]  check_preempt_curr+0x86/0xb0
[  220.017637]  ttwu_do_wakeup+0x19/0x270
[  220.022687]  try_to_wake_up+0x204/0x530
[  220.027820]  autoremove_wake_function+0xe/0x30
[  220.033553]  __wake_up_common+0x74/0x120
[  220.038771]  __wake_up_common_lock+0x7c/0xc0
[  220.044338]  ? tick_sched_do_timer+0x80/0x80
[  220.049829]  irq_work_run_list+0x49/0x70
[  220.054901]  update_process_times+0x3b/0x50
[  220.060230]  tick_sched_timer+0x37/0x70
[  220.065205]  __hrtimer_run_queues+0x114/0x590
[  220.070695]  hrtimer_interrupt+0xd5/0x220
[  220.075837]  ? trace_hardirqs_off_thunk+0x1a/0x1c
[  220.081669]  smp_apic_timer_interrupt+0x80/0x2e0
[  220.087412]  apic_timer_interrupt+0xf/0x20
[  220.092630]  </IRQ>
[  220.095903] RIP: 0010:panic+0x1fe/0x24c
[  220.100861] Code: a6 83 3d 28 97 7d 01 00 74 05 e8 c9 c9 02 00 48 c7 c6 60 31 84 82 48 c7 c7 90 e3 0a 82 e8 c4 68 07 00 e8 01 1e 06 00 fb 31 db <e8> f9 c7 0c 00 4c 39 eb 7c 1d 41 83 f4 01 48 8b 05 e9 96 7d 01 44 
[  220.122141] RSP: 0018:ffffc90003d03b90 EFLAGS: 00000246 ORIG_RAX: ffffffffffffff13
[  220.130970] RAX: 0000000000000039 RBX: 0000000000000000 RCX: 0000000000000001
[  220.139368] RDX: 0000000000000000 RSI: ffffffff81069a1f RDI: ffffffff81069a1f
[  220.147767] RBP: ffffc90003d03c08 R08: 0000000000000000 R09: 0000000000000000
[  220.156178] R10: ffffc90003d03ad0 R11: ffffffff8224e338 R12: 0000000000000000
[  220.164590] R13: 0000000000000000 R14: 000000000000000b R15: 0000000000000009
[  220.173012]  ? panic+0x1fb/0x24c
[  220.177561]  ? panic+0x1fb/0x24c
[  220.182108]  oops_end+0xbb/0xc0
[  220.186559]  no_context+0x17e/0x400
[  220.191290]  __do_page_fault+0x3f4/0x570
[  220.196374]  ? trace_hardirqs_off_caller+0x83/0xe0
[  220.202312]  do_page_fault+0x2c/0x2b0
[  220.207124]  page_fault+0x1e/0x30
[  220.211584] RIP: 0010:sysrq_handle_crash+0x41/0x80
[  220.217507] Code: e8 c4 17 c4 ff 48 c7 c2 87 b6 4a 81 be 01 00 00 00 48 c7 c7 c0 07 25 82 e8 2c 23 c2 ff c7 05 82 7a 39 01 01 00 00 00 0f ae f8 <c6> 04 25 00 00 00 00 01 c3 e8 61 43 c4 ff 84 c0 75 c4 48 c7 c2 50 
[  220.238815] RSP: 0018:ffffc90003d03de8 EFLAGS: 00010286
[  220.245319] RAX: 0000000000000000 RBX: 0000000000000063 RCX: 3e2d9cbd00000000
[  220.253709] RDX: ffff88043804cd38 RSI: 000000006cb322e4 RDI: 0000000000000000
[  220.262096] RBP: ffffc90003d03e10 R08: 000000003f352d81 R09: 0000000000000001
[  220.270477] R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000007
[  220.278855] R13: ffffffff822c09e0 R14: 0000000000000000 R15: 0000000000000000
[  220.287230]  ? sysrq_handle_crash+0x34/0x80
[  220.292677]  __handle_sysrq+0xcd/0x210
[  220.297690]  write_sysrq_trigger+0x48/0x50
[  220.303037]  proc_reg_write+0x3c/0x60
[  220.307920]  __vfs_write+0x36/0x160
[  220.312595]  ? rcu_read_lock_sched_held+0x74/0x80
[  220.318443]  ? rcu_sync_lockdep_assert+0x2e/0x60
[  220.324172]  ? __sb_start_write+0x157/0x200
[  220.329464]  ? __sb_start_write+0x16d/0x200
[  220.334745]  vfs_write+0xbe/0x1b0
[  220.339116]  ksys_write+0x55/0xc0
[  220.343401]  do_syscall_64+0x64/0x200
[  220.348014]  entry_SYSCALL_64_after_hwframe+0x49/0xbe
[  220.354001] RIP: 0033:0x7f41050a9620
[  220.358521] Code: 73 01 c3 48 8b 0d 78 c8 2b 00 f7 d8 64 89 01 48 83 c8 ff c3 66 0f 1f 44 00 00 83 3d 29 21 2c 00 00 75 10 b8 01 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 31 c3 48 83 ec 08 e8 ae 9b 01 00 48 89 04 24 
[  220.379421] RSP: 002b:00007ffe8c2ad378 EFLAGS: 00000246 ORIG_RAX: 0000000000000001
[  220.388042] RAX: ffffffffffffffda RBX: 0000000000000002 RCX: 00007f41050a9620
[  220.396231] RDX: 0000000000000002 RSI: 00000000008fe408 RDI: 0000000000000001
[  220.404413] RBP: 00000000008fe408 R08: 00007f4105368760 R09: 00007f4105bce280
[  220.412590] R10: 0000000000000073 R11: 0000000000000246 R12: 0000000000000002
[  220.420759] R13: 0000000000000001 R14: 00007f4105367600 R15: 0000000000000002
[  220.428933] irq event stamp: 80113
[  220.433409] hardirqs last  enabled at (80113): [<ffffffff810544f9>] __do_page_fault+0x2f9/0x570
[  220.443155] hardirqs last disabled at (80112): [<ffffffff81a01096>] error_entry+0x86/0x100
[  220.452472] softirqs last  enabled at (80104): [<ffffffff81c0039e>] __do_softirq+0x39e/0x504
[  220.461970] softirqs last disabled at (80095): [<ffffffff810719d5>] irq_exit+0xd5/0xe0
[  220.470953] ---[ end trace 75992131ce321256 ]---

-- 
Regards/Gruss,
    Boris.

Good mailing practices for 400: avoid top-posting and trim the reply.

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 0/9] x86/dumpstack: Cleanups and user opcode bytes Code: section, v2
  2018-04-17 14:40   ` Borislav Petkov
@ 2018-04-17 20:16     ` Josh Poimboeuf
  2018-04-17 21:06       ` Borislav Petkov
  0 siblings, 1 reply; 36+ messages in thread
From: Josh Poimboeuf @ 2018-04-17 20:16 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: Linus Torvalds, X86 ML, Andy Lutomirski, Peter Zijlstra, LKML

On Tue, Apr 17, 2018 at 04:40:42PM +0200, Borislav Petkov wrote:
> On Thu, Mar 15, 2018 at 10:51:06AM -0700, Linus Torvalds wrote:
> > This version looks ok to me. I'm sure there's room for tweaking here,
> > but I'm not seeing anything alarming.
> 
> So I'm redoing the series ontop of 17-rc1 and I see a *lot* of output
> during testing. For example:
> 
> 1) is from the userspace fault, 2) is the panic from sysrq but then you have 3)
> which is
> 
> 	WARN_ON_ONCE(!cpu_online(new_cpu));
> 
> in set_task_cpu() and to top it all off, we have 4) coming from
> native_smp_send_reschedule():
> 
> static void native_smp_send_reschedule(int cpu)
> {
>         if (unlikely(cpu_is_offline(cpu))) {
>                 WARN(1, "sched: Unexpected reschedule of offline CPU#%d!\n", cpu);
> 
> so all the "fine tuning" we did to try to fit the most important splat
> on the screen is for shit because those loud WARNs simply pushed it all
> up into oblivion.
> 
> And the executive summary and registers are just as worthless in such a
> case.
> 
> We could start thinking about caching all that data from the very first
> splat, when we're not tainted yet and dump it last but then we can't
> even know what is going out last.
> 
> Not only because we can't guess from where stuff might warn and what
> could execute - the below splats case-in-point - also, and more
> importantly, we don't know how much of that data would actually go out
> as there are no guarantees *when* the machine will die and stop spewing
> to the serial port.
> 
> So maybe the most important splat coming out first is maybe a good thing
> because it has a higher chance of coming out before the box locks up
> completely.
> 
> So I guess we should keep hoping that serial console works and keeps on
> working...
> 
> Hmmm.

I don't think the stack tracing code could do anything better here.  #3
and #4 seem like an issue with the scheduler, it doesn't realize the
rest of the CPUs have all been taken offline due to the panic().

-- 
Josh

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 0/9] x86/dumpstack: Cleanups and user opcode bytes Code: section, v2
  2018-04-17 20:16     ` Josh Poimboeuf
@ 2018-04-17 21:06       ` Borislav Petkov
  2018-04-18 13:26         ` Josh Poimboeuf
  0 siblings, 1 reply; 36+ messages in thread
From: Borislav Petkov @ 2018-04-17 21:06 UTC (permalink / raw)
  To: Josh Poimboeuf
  Cc: Linus Torvalds, X86 ML, Andy Lutomirski, Peter Zijlstra, LKML

On Tue, Apr 17, 2018 at 03:16:55PM -0500, Josh Poimboeuf wrote:
> I don't think the stack tracing code could do anything better here.  #3
> and #4 seem like an issue with the scheduler, it doesn't realize the
> rest of the CPUs have all been taken offline due to the panic().

So maybe teach the WARN code to check whether a panic() has happened?

-- 
Regards/Gruss,
    Boris.

Good mailing practices for 400: avoid top-posting and trim the reply.

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 0/9] x86/dumpstack: Cleanups and user opcode bytes Code: section, v2
  2018-04-17 21:06       ` Borislav Petkov
@ 2018-04-18 13:26         ` Josh Poimboeuf
  0 siblings, 0 replies; 36+ messages in thread
From: Josh Poimboeuf @ 2018-04-18 13:26 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: Linus Torvalds, X86 ML, Andy Lutomirski, Peter Zijlstra, LKML

On Tue, Apr 17, 2018 at 11:06:50PM +0200, Borislav Petkov wrote:
> On Tue, Apr 17, 2018 at 03:16:55PM -0500, Josh Poimboeuf wrote:
> > I don't think the stack tracing code could do anything better here.  #3
> > and #4 seem like an issue with the scheduler, it doesn't realize the
> > rest of the CPUs have all been taken offline due to the panic().
> 
> So maybe teach the WARN code to check whether a panic() has happened?

I get the feeling that disabling warnings could be papering over a real
bug, but maybe we don't care about bugs in the post-panic state?

Ideally we could just leave interrupts disabled, but I don't know if
that's generally feasible since it looks like sparc is waiting for
keyboard interrupts to allow breaking out to the boot prom.

-- 
Josh

^ permalink raw reply	[flat|nested] 36+ messages in thread

end of thread, other threads:[~2018-04-18 13:26 UTC | newest]

Thread overview: 36+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-03-15 15:44 [PATCH 0/9] x86/dumpstack: Cleanups and user opcode bytes Code: section, v2 Borislav Petkov
2018-03-15 15:44 ` [PATCH 1/9] x86/dumstack: Remove code_bytes Borislav Petkov
2018-03-15 18:12   ` Josh Poimboeuf
2018-03-15 15:44 ` [PATCH 2/9] x86/dumpstack: Unexport oops_begin() Borislav Petkov
2018-03-15 15:44 ` [PATCH 3/9] x86/dumpstack: Carve out Code: dumping into a function Borislav Petkov
2018-03-15 15:44 ` [PATCH 4/9] x86/dumpstack: Improve opcodes dumping in the Code: section Borislav Petkov
2018-03-15 18:10   ` Josh Poimboeuf
2018-03-15 18:16     ` Borislav Petkov
2018-03-15 19:06       ` Josh Poimboeuf
2018-03-16 11:57       ` David Laight
2018-03-15 18:19   ` Josh Poimboeuf
2018-03-15 18:23   ` Josh Poimboeuf
2018-03-15 15:44 ` [PATCH 5/9] x86/dumpstack: Add loglevel argument to show_opcodes() Borislav Petkov
2018-03-15 15:44 ` [PATCH 6/9] x86/fault: Dump user opcode bytes on fatal faults Borislav Petkov
2018-03-15 15:44 ` [PATCH 7/9] x86/dumpstack: Add a show_ip() function Borislav Petkov
2018-03-15 18:34   ` Josh Poimboeuf
2018-03-15 18:55     ` Borislav Petkov
2018-03-15 15:44 ` [PATCH 8/9] x86/dumpstack: Save first regs set for the executive summary Borislav Petkov
2018-03-15 19:01   ` Josh Poimboeuf
2018-03-16 11:48     ` Borislav Petkov
2018-03-16 12:01       ` Josh Poimboeuf
2018-03-16 12:11         ` Borislav Petkov
2018-03-16 13:16           ` Josh Poimboeuf
2018-03-16 13:44             ` Borislav Petkov
2018-03-16 17:22       ` Linus Torvalds
2018-03-16 17:40         ` Josh Poimboeuf
2018-03-16 17:45         ` Borislav Petkov
2018-03-16 18:38           ` Josh Poimboeuf
2018-03-15 15:44 ` [PATCH 9/9] x86/dumpstack: Explain the reasoning for the prologue and buffer size Borislav Petkov
2018-03-15 18:07   ` Josh Poimboeuf
2018-03-15 18:17     ` Borislav Petkov
2018-03-15 17:51 ` [PATCH 0/9] x86/dumpstack: Cleanups and user opcode bytes Code: section, v2 Linus Torvalds
2018-04-17 14:40   ` Borislav Petkov
2018-04-17 20:16     ` Josh Poimboeuf
2018-04-17 21:06       ` Borislav Petkov
2018-04-18 13:26         ` Josh Poimboeuf

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.