On Sat, Sep 07, 2019 at 12:05:34PM +0300, Alexander Kapshuk wrote: > To Whom It May Concern > > Every kernel I have built since 5.3.0-rc2-next-20190730 and up to > 5.3.0-rc7-next-20190903 has resulted in the kernel panic described below. > > The panic occurs early on in the boot process, so no records of it get > written on disk. I resourted to taking photos and videos to get the info > for debugging. > > [Kernel panic] > Code: 00 48 83 bb f0 00 00 00 00 74 16 48 83 c3 18 b9 17 00 00 00 31 c0 48 89 df f3 48 ab 5b 41 5c 5d c3 4c 89 a3 f0 00 00 00 eb e1 <0f> 0b 0f 1f 40 00 55 48 89 e5 41 54 49 89 d4 53 48 89 f3 e8 7e ff > > Kernel panic - Not syncing: Attempted to kill init! exitcode=0x0000000b. > > Top of call stack: > __drm_fb_helper_initial_config_and_unlock > drm_fb_helper_initial_config > > Code: 00 48 83 bb f0 00 00 00 00 74 16 48 83 c3 18 b9 17 00 00 00 31 c0 48 89 df f3 48 ab 5b 41 5c 5d c3 4c 89 a3 f0 00 00 00 eb e1 <0f> 0b 0f 1f 40 00 55 48 89 e5 41 54 49 89 d4 53 48 89 f3 e8 7e ff > All code > ======== > 0: 00 48 83 add %cl,-0x7d(%rax) > 3: bb f0 00 00 00 mov $0xf0,%ebx > 8: 00 74 16 48 add %dh,0x48(%rsi,%rdx,1) > c: 83 c3 18 add $0x18,%ebx > f: b9 17 00 00 00 mov $0x17,%ecx > 14: 31 c0 xor %eax,%eax > 16: 48 89 df mov %rbx,%rdi > 19: f3 48 ab rep stos %rax,%es:(%rdi) > 1c: 5b pop %rbx > 1d: 41 5c pop %r12 > 1f: 5d pop %rbp > 20: c3 retq > 21: 4c 89 a3 f0 00 00 00 mov %r12,0xf0(%rbx) > 28: eb e1 jmp 0xb > 2a:* 0f 0b ud2 <-- trapping instruction > 2c: 0f 1f 40 00 nopl 0x0(%rax) > 30: 55 push %rbp > 31: 48 89 e5 mov %rsp,%rbp > 34: 41 54 push %r12 > 36: 49 89 d4 mov %rdx,%r12 > 39: 53 push %rbx > 3a: 48 89 f3 mov %rsi,%rbx > 3d: e8 .byte 0xe8 > 3e: 7e ff jle 0x3f > > Code starting with the faulting instruction > =========================================== > 0: 0f 0b ud2 > 2: 0f 1f 40 00 nopl 0x0(%rax) > 6: 55 push %rbp > 7: 48 89 e5 mov %rsp,%rbp > a: 41 54 push %r12 > c: 49 89 d4 mov %rdx,%r12 > f: 53 push %rbx > 10: 48 89 f3 mov %rsi,%rbx > 13: e8 .byte 0xe8 > 14: 7e ff jle 0x15 > > The panic occurs after the 'Driver supports precise vblank timestamp > query.' line gets printed to console: > [ 2.858970] Linux agpgart interface v0.103 > [ 2.859308] nouveau 0000:01:00.0: NVIDIA G84 (084300a2) > [ 2.968950] nouveau 0000:01:00.0: bios: version 60.84.68.00.19 > [ 2.989923] nouveau 0000:01:00.0: bios: M0203T not found > [ 2.990010] nouveau 0000:01:00.0: bios: M0203E not matched! > [ 2.990096] nouveau 0000:01:00.0: fb: 512 MiB DDR2 > [ 3.062362] [TTM] Zone kernel: Available graphics memory: 2015014 KiB > [ 3.062494] [TTM] Initializing pool allocator > [ 3.062581] [TTM] Initializing DMA pool allocator > [ 3.062683] nouveau 0000:01:00.0: DRM: VRAM: 512 MiB > [ 3.062769] nouveau 0000:01:00.0: DRM: GART: 1048576 MiB > [ 3.062859] nouveau 0000:01:00.0: DRM: TMDS table version 2.0 > [ 3.062944] nouveau 0000:01:00.0: DRM: DCB version 4.0 > [ 3.063030] nouveau 0000:01:00.0: DRM: DCB outp 00: 02000300 00000028 > [ 3.063117] nouveau 0000:01:00.0: DRM: DCB outp 01: 01000302 00000030 > [ 3.063203] nouveau 0000:01:00.0: DRM: DCB outp 02: 04011310 00000028 > [ 3.063290] nouveau 0000:01:00.0: DRM: DCB outp 03: 02011312 00c000b0 > [ 3.063377] nouveau 0000:01:00.0: DRM: DCB conn 00: 1030 > [ 3.063462] nouveau 0000:01:00.0: DRM: DCB conn 01: 2130 > [ 3.065982] nouveau 0000:01:00.0: DRM: MM: using CRYPT for buffer copies > [ 3.066622] [drm] Supports vblank timestamp caching Rev 2 (21.10.2013). > [ 3.066754] [drm] Driver supports precise vblank timestamp query. > > I was not able to capture the value of RIP for this crash. > > With drm_kms_helper.fbdev_emulation=0 enabled, as documented in > the commentary to function drm_fb_helper_initial_config defined in > drivers/gpu/drm/drm_fb_helper.c, I get the following output: > > RIP: 0010: _raw_spin_lock+0x7/0x20 > Code: ba ff 00 00 00 f0 0f b1 17 75 01 c3 55 48 89 e5 e8 23 a2 6d ff 5d c3 66 66 2e 0f 1f 84 00 00 00 00 00 90 31 c0 ba 01 00 00 00 0f b1 17 75 01 c3 55 89 c6 40 89 e5 e8 e7 8f 6d ff 5d c3 0f 1f > > Code: ba ff 00 00 00 f0 0f b1 17 75 01 c3 55 48 89 e5 e8 23 a2 6d ff 5d c3 66 66 2e 0f 1f 84 00 00 00 00 00 90 31 c0 ba 01 00 00 00 0f b1 17 75 01 c3 55 89 c6 40 89 e5 e8 e7 8f 6d ff 5d c3 0f 1f > All code > ======== > 0: ba ff 00 00 00 mov $0xff,%edx > 5: f0 0f b1 17 lock cmpxchg %edx,(%rdi) > 9: 75 01 jne 0xc > b: c3 retq > c: 55 push %rbp > d: 48 89 e5 mov %rsp,%rbp > 10: e8 23 a2 6d ff callq 0xffffffffff6da238 > 15: 5d pop %rbp > 16: c3 retq > 17: 66 66 2e 0f 1f 84 00 data16 nopw %cs:0x0(%rax,%rax,1) > 1e: 00 00 00 00 > 22: 90 nop > 23: 31 c0 xor %eax,%eax > 25: ba 01 00 00 00 mov $0x1,%edx > 2a:* f0 0f b1 17 lock cmpxchg %edx,(%rdi) <-- trapping instruction > 2e: 75 01 jne 0x31 > 30: c3 retq > 31: 55 push %rbp > 32: 89 c6 mov %eax,%esi > 34: 40 89 e5 rex mov %esp,%ebp > 37: e8 e7 8f 6d ff callq 0xffffffffff6d9023 > 3c: 5d pop %rbp > 3d: c3 retq > 3e: 0f .byte 0xf > 3f: 1f (bad) > > Code starting with the faulting instruction > =========================================== > 0: f0 0f b1 17 lock cmpxchg %edx,(%rdi) > 4: 75 01 jne 0x7 > 6: c3 retq > 7: 55 push %rbp > 8: 89 c6 mov %eax,%esi > a: 40 89 e5 rex mov %esp,%ebp > d: e8 e7 8f 6d ff callq 0xffffffffff6d8ff9 > 12: 5d pop %rbp > 13: c3 retq > 14: 0f .byte 0xf > 15: 1f (bad) > > (gdb) list *(_raw_spin_lock+0x7) > 0xffffffff81a13b27 is in _raw_spin_lock (./arch/x86/include/asm/atomic.h:200). > 195 } > 196 > 197 #define arch_atomic_try_cmpxchg arch_atomic_try_cmpxchg > 198 static __always_inline bool arch_atomic_try_cmpxchg(atomic_t *v, int *old, int new) > 199 { > 200 return try_cmpxchg(&v->counter, old, new); > 201 } > 202 > 203 static inline int arch_atomic_xchg(atomic_t *v, int new) > 204 { > > (gdb) disassemble _raw_spin_lock+0x7 > Dump of assembler code for function _raw_spin_lock: > 0xffffffff81a13b20 <+0>: xor %eax,%eax > 0xffffffff81a13b22 <+2>: mov $0x1,%edx > 0xffffffff81a13b27 <+7>: lock cmpxchg %edx,(%rdi) > 0xffffffff81a13b2b <+11>: jne 0xffffffff81a13b2e <_raw_spin_lock+14> > 0xffffffff81a13b2d <+13>: retq > 0xffffffff81a13b2e <+14>: push %rbp > 0xffffffff81a13b2f <+15>: mov %eax,%esi > 0xffffffff81a13b31 <+17>: mov %rsp,%rbp > 0xffffffff81a13b34 <+20>: callq 0xffffffff810ecb20 > 0xffffffff81a13b39 <+25>: pop %rbp > 0xffffffff81a13b3a <+26>: retq > End of assembler dump. > > Any pointers on how to proceed with this would be appreciated. 'Git bisect' has identified the following commits as being 'bad'. b96f3e7c8069b749a40ca3a33c97835d57dd45d2 is the first bad commit commit b96f3e7c8069b749a40ca3a33c97835d57dd45d2 Author: Gerd Hoffmann Date: Mon Aug 5 16:01:10 2019 +0200 drm/ttm: use gem vma_node Drop vma_node from ttm_buffer_object, use the gem struct (base.vma_node) instead. Signed-off-by: Gerd Hoffmann Reviewed-by: Christian König Link: http://patchwork.freedesktop.org/patch/msgid/20190805140119.7337-9-kraxel@redhat.com drivers/gpu/drm/amd/amdgpu/amdgpu_object.h | 2 +- drivers/gpu/drm/drm_gem_vram_helper.c | 2 +- drivers/gpu/drm/nouveau/nouveau_display.c | 2 +- drivers/gpu/drm/nouveau/nouveau_gem.c | 2 +- drivers/gpu/drm/qxl/qxl_object.h | 2 +- drivers/gpu/drm/radeon/radeon_object.h | 2 +- drivers/gpu/drm/ttm/ttm_bo.c | 8 ++++---- drivers/gpu/drm/ttm/ttm_bo_util.c | 2 +- drivers/gpu/drm/ttm/ttm_bo_vm.c | 9 +++++---- drivers/gpu/drm/virtio/virtgpu_drv.h | 2 +- drivers/gpu/drm/virtio/virtgpu_prime.c | 3 --- drivers/gpu/drm/vmwgfx/vmwgfx_bo.c | 4 ++-- drivers/gpu/drm/vmwgfx/vmwgfx_surface.c | 4 ++-- include/drm/ttm/ttm_bo_api.h | 4 ---- 14 files changed, 21 insertions(+), 27 deletions(-) I nominated commit '[1e053b10ba60eae6a3f9de64cbc74bdf6cb0e715] drm/ttm: use gem reservation object' as being 'good' initially, based on the fact that kernel 5.3.0-rc1-00364-g1e053b10ba60 did boot. But the GUI applications displayed black artifacts across the screen. I then edited the git-bisect log file where I nominated commit 1e053b10ba60eae6a3f9de64cbc74bdf6cb0e715 as being 'bad' and ran 'git bisect replay' on it. This blamed commit 1e053b10ba60eae6a3f9de64cbc74bdf6cb0e715 as the first bad commit. 1e053b10ba60eae6a3f9de64cbc74bdf6cb0e715 is the first bad commit commit 1e053b10ba60eae6a3f9de64cbc74bdf6cb0e715 Author: Gerd Hoffmann Date: Mon Aug 5 16:01:09 2019 +0200 drm/ttm: use gem reservation object Drop ttm_resv from ttm_buffer_object, use the gem reservation object (base._resv) instead. Signed-off-by: Gerd Hoffmann Reviewed-by: Christian König Link: http://patchwork.freedesktop.org/patch/msgid/20190805140119.7337-8-kraxel@redhat.com drivers/gpu/drm/ttm/ttm_bo.c | 39 +++++++++++++++++++++++---------------- drivers/gpu/drm/ttm/ttm_bo_util.c | 2 +- include/drm/ttm/ttm_bo_api.h | 1 - 3 files changed, 24 insertions(+), 18 deletions(-) In the process of bisection, I nominated the following kernels as being 'bad'. They also booted fine, but the xserver would fail to start. I have attached the error messages generated by xorg. # kernel boots; Xorg won't start. See Xorg_err.log attached. 5.3.0-rc3-01537-g6a3068065fa4 5.3.0-rc3-00782-gb0383c0653c4 5.3.0-rc1-00391-g54fc01b775fe 5.3.0-rc1-00366-g2e3c9ec4d151 5.3.0-rc1-00365-gb96f3e7c8069 Today, I upgraded the kernel to 5.3.0-next-20190919, which booted fine with no Xorg regressions to report. Just wondering if the earlier kernels would not boot for me because of the changes introduced by the 'bad' commits being perhaps incomplete? Thanks to all of you for the tips on how proceed with bisection.