All of lore.kernel.org
 help / color / mirror / Atom feed
* general protection fault in vmx_vcpu_run
@ 2018-04-12  9:45 syzbot
  2018-04-14  1:07 ` syzbot
  0 siblings, 1 reply; 10+ messages in thread
From: syzbot @ 2018-04-12  9:45 UTC (permalink / raw)
  To: hpa, kvm, linux-kernel, mingo, pbonzini, rkrcmar, syzkaller-bugs,
	tglx, x86

Hello,

syzbot hit the following crash on upstream commit
b284d4d5a6785f8cd07eda2646a95782373cd01e (Tue Apr 10 19:25:30 2018 +0000)
Merge tag 'ceph-for-4.17-rc1' of git://github.com/ceph/ceph-client
syzbot dashboard link:  
https://syzkaller.appspot.com/bug?extid=cc483201a3c6436d3550

So far this crash happened 2 times on upstream.
Unfortunately, I don't have any reproducer for this crash yet.
Raw console output:  
https://syzkaller.appspot.com/x/log.txt?id=5622460745515008
Kernel config:  
https://syzkaller.appspot.com/x/.config?id=-1223000601505858474
compiler: gcc (GCC) 8.0.1 20180301 (experimental)

IMPORTANT: if you fix the bug, please add the following tag to the commit:
Reported-by: syzbot+cc483201a3c6436d3550@syzkaller.appspotmail.com
It will help syzbot understand when the bug is fixed. See footer for  
details.
If you forward the report, please keep this part and the footer.

kasan: CONFIG_KASAN_INLINE enabled
kasan: GPF could be caused by NULL-ptr deref or user memory access
general protection fault: 0000 [#1] SMP KASAN
Dumping ftrace buffer:
    (ftrace buffer empty)
Modules linked in:
CPU: 1 PID: 6155 Comm: syz-executor1 Not tainted 4.16.0+ #19
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS  
Google 01/01/2011
RIP: 0010:vmx_vcpu_run+0x95f/0x25f0 arch/x86/kvm/vmx.c:9746
RSP: 0018:ffff8801ac217368 EFLAGS: 00010002
RAX: ffffed003725a3a0 RBX: ffff8801c751ab00 RCX: 1ffff10035842e78
RDX: 0000000000000001 RSI: 0000000000000004 RDI: ffff8801b92d1cf8
RBP: ffff8801db107a18 R08: ffffed003725a3a0 R09: ffffed003725a39f
R10: ffffed003725a39f R11: ffff8801b92d1cfb R12: ffff8801b92d1c20
R13: 1ffff1003b620f2a R14: ffff8801d7448300 R15: ffff8801c751ab90
FS:  00007f47f8675700(0000) GS:ffff8801db100000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000000000000000 CR3: 00000001b0d04000 CR4: 00000000001426e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Call Trace:
  save_stack+0xa9/0xd0 mm/kasan/kasan.c:454
  set_track mm/kasan/kasan.c:460 [inline]
  kasan_kmalloc+0xc4/0xe0 mm/kasan/kasan.c:553
  kasan_slab_alloc+0x12/0x20 mm/kasan/kasan.c:490
  slab_post_alloc_hook mm/slab.h:444 [inline]
  slab_alloc mm/slab.c:3392 [inline]
  kmem_cache_alloc+0x11b/0x760 mm/slab.c:3552
  getname_flags+0xd0/0x5a0 fs/namei.c:140
  getname+0x19/0x20 fs/namei.c:211
  do_sys_open+0x38e/0x770 fs/open.c:1087
  SYSC_openat fs/open.c:1120 [inline]
  SyS_openat+0x30/0x40 fs/open.c:1114
  do_syscall_64+0x29e/0x9d0 arch/x86/entry/common.c:287
  entry_SYSCALL_64_after_hwframe+0x42/0xb7
RIP: 0033:0x455259
RSP: 002b:00007f47f8674c68 EFLAGS: 00000246 ORIG_RAX: 0000000000000101
RAX: ffffffffffffffda RBX: 00007f47f86756d4 RCX: 0000000000455259
RDX: 0000000000000000 RSI: 00000000200001c0 RDI: ffffffffffffff9c
RBP: 000000000072bea0 R08: 0000000000000000 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000246 R12: 00000000ffffffff
R13: 000000000000042f R14: 00000000006f9508 R15: 0000000000000000
Code: 8b a9 68 03 00 00 4c 8b b1 70 03 00 00 4c 8b b9 78 03 00 00 48 8b 89  
08 03 00 00 75 05 0f 01 c2 eb 03 0f 01 c3 48 89 4c 24 08 59 <0f> 96 81 88  
56 00 00 48 89 81 00 03 00 00 48 89 99 18 03 00 00
RIP: vmx_vcpu_run+0x95f/0x25f0 arch/x86/kvm/vmx.c:9746 RSP: ffff8801ac217368
---[ end trace c157ab9734a00941 ]---


---
This bug is generated by a dumb bot. It may contain errors.
See https://goo.gl/tpsmEJ for details.
Direct all questions to syzkaller@googlegroups.com.

syzbot will keep track of this bug report.
If you forgot to add the Reported-by tag, once the fix for this bug is  
merged
into any tree, please reply to this email with:
#syz fix: exact-commit-title
To mark this as a duplicate of another syzbot report, please reply with:
#syz dup: exact-subject-of-another-report
If it's a one-off invalid bug report, please reply with:
#syz invalid
Note: if the crash happens again, it will cause creation of a new bug  
report.
Note: all commands must start from beginning of the line in the email body.

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: general protection fault in vmx_vcpu_run
  2018-04-12  9:45 general protection fault in vmx_vcpu_run syzbot
@ 2018-04-14  1:07 ` syzbot
  2018-06-28  5:27   ` Dmitry Vyukov
  0 siblings, 1 reply; 10+ messages in thread
From: syzbot @ 2018-04-14  1:07 UTC (permalink / raw)
  To: hpa, kvm, linux-kernel, mingo, pbonzini, rkrcmar, syzkaller-bugs,
	tglx, x86

syzbot has found reproducer for the following crash on upstream commit
1bad9ce155a7c010a9a5f3261ad12a6a8eccfb2c (Fri Apr 13 19:27:11 2018 +0000)
Merge tag 'sh-for-4.17' of git://git.libc.org/linux-sh
syzbot dashboard link:  
https://syzkaller.appspot.com/bug?extid=cc483201a3c6436d3550

So far this crash happened 4 times on upstream.
C reproducer: https://syzkaller.appspot.com/x/repro.c?id=6257386297753600
syzkaller reproducer:  
https://syzkaller.appspot.com/x/repro.syz?id=4808329293463552
Raw console output:  
https://syzkaller.appspot.com/x/log.txt?id=4943675322793984
Kernel config:  
https://syzkaller.appspot.com/x/.config?id=-5947642240294114534
compiler: gcc (GCC) 8.0.1 20180413 (experimental)

IMPORTANT: if you fix the bug, please add the following tag to the commit:
Reported-by: syzbot+cc483201a3c6436d3550@syzkaller.appspotmail.com
It will help syzbot understand when the bug is fixed.

IPv6: ADDRCONF(NETDEV_CHANGE): veth1: link becomes ready
IPv6: ADDRCONF(NETDEV_CHANGE): veth0: link becomes ready
8021q: adding VLAN 0 to HW filter on device team0
kasan: CONFIG_KASAN_INLINE enabled
kasan: GPF could be caused by NULL-ptr deref or user memory access
general protection fault: 0000 [#1] SMP KASAN
Dumping ftrace buffer:
    (ftrace buffer empty)
Modules linked in:
CPU: 0 PID: 6472 Comm: syzkaller667776 Not tainted 4.16.0+ #1
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS  
Google 01/01/2011
RIP: 0010:vmx_vcpu_run+0x95f/0x25f0 arch/x86/kvm/vmx.c:9746
RSP: 0018:ffff8801c95bf368 EFLAGS: 00010002
RAX: ffff8801b44df6e8 RBX: ffff8801ada0ec40 RCX: 1ffff100392b7e78
RDX: 0000000000000000 RSI: ffffffff81467b15 RDI: ffff8801ada0ec50
RBP: ffff8801b44df790 R08: ffff8801c4efe780 R09: fffffbfff1141218
R10: fffffbfff1141218 R11: ffffffff88a090c3 R12: ffff8801b186aa90
R13: ffff8801ae61e000 R14: dffffc0000000000 R15: ffff8801ae61e3e0
FS:  00007fa147982700(0000) GS:ffff8801db000000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000000000000000 CR3: 00000001d780d000 CR4: 00000000001426f0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Call Trace:
Code: 8b a9 68 03 00 00 4c 8b b1 70 03 00 00 4c 8b b9 78 03 00 00 48 8b 89  
08 03 00 00 75 05 0f 01 c2 eb 03 0f 01 c3 48 89 4c 24 08 59 <0f> 96 81 88  
56 00 00 48 89 81 00 03 00 00 48 89 99 18 03 00 00
RIP: vmx_vcpu_run+0x95f/0x25f0 arch/x86/kvm/vmx.c:9746 RSP: ffff8801c95bf368
---[ end trace ffd91ebc3bb06b01 ]---
Kernel panic - not syncing: Fatal exception
Shutting down cpus with NMI
Dumping ftrace buffer:
    (ftrace buffer empty)
Kernel Offset: disabled
Rebooting in 86400 seconds..

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: general protection fault in vmx_vcpu_run
  2018-04-14  1:07 ` syzbot
@ 2018-06-28  5:27   ` Dmitry Vyukov
  2018-06-28 17:18     ` Jim Mattson
  0 siblings, 1 reply; 10+ messages in thread
From: Dmitry Vyukov @ 2018-06-28  5:27 UTC (permalink / raw)
  To: syzbot
  Cc: H. Peter Anvin, KVM list, LKML, Ingo Molnar, Paolo Bonzini,
	Radim Krčmář,
	syzkaller-bugs, Thomas Gleixner, the arch/x86 maintainers

On Sat, Apr 14, 2018 at 3:07 AM, syzbot
<syzbot+cc483201a3c6436d3550@syzkaller.appspotmail.com> wrote:
> syzbot has found reproducer for the following crash on upstream commit
> 1bad9ce155a7c010a9a5f3261ad12a6a8eccfb2c (Fri Apr 13 19:27:11 2018 +0000)
> Merge tag 'sh-for-4.17' of git://git.libc.org/linux-sh
> syzbot dashboard link:
> https://syzkaller.appspot.com/bug?extid=cc483201a3c6436d3550
>
> So far this crash happened 4 times on upstream.
> C reproducer: https://syzkaller.appspot.com/x/repro.c?id=6257386297753600
> syzkaller reproducer:
> https://syzkaller.appspot.com/x/repro.syz?id=4808329293463552
> Raw console output:
> https://syzkaller.appspot.com/x/log.txt?id=4943675322793984
> Kernel config:
> https://syzkaller.appspot.com/x/.config?id=-5947642240294114534
> compiler: gcc (GCC) 8.0.1 20180413 (experimental)
>
> IMPORTANT: if you fix the bug, please add the following tag to the commit:
> Reported-by: syzbot+cc483201a3c6436d3550@syzkaller.appspotmail.com
> It will help syzbot understand when the bug is fixed.

#syz dup: BUG: unable to handle kernel paging request in vmx_vcpu_run


> IPv6: ADDRCONF(NETDEV_CHANGE): veth1: link becomes ready
> IPv6: ADDRCONF(NETDEV_CHANGE): veth0: link becomes ready
> 8021q: adding VLAN 0 to HW filter on device team0
> kasan: CONFIG_KASAN_INLINE enabled
> kasan: GPF could be caused by NULL-ptr deref or user memory access
> general protection fault: 0000 [#1] SMP KASAN
> Dumping ftrace buffer:
>    (ftrace buffer empty)
> Modules linked in:
> CPU: 0 PID: 6472 Comm: syzkaller667776 Not tainted 4.16.0+ #1
> Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS
> Google 01/01/2011
> RIP: 0010:vmx_vcpu_run+0x95f/0x25f0 arch/x86/kvm/vmx.c:9746
> RSP: 0018:ffff8801c95bf368 EFLAGS: 00010002
> RAX: ffff8801b44df6e8 RBX: ffff8801ada0ec40 RCX: 1ffff100392b7e78
> RDX: 0000000000000000 RSI: ffffffff81467b15 RDI: ffff8801ada0ec50
> RBP: ffff8801b44df790 R08: ffff8801c4efe780 R09: fffffbfff1141218
> R10: fffffbfff1141218 R11: ffffffff88a090c3 R12: ffff8801b186aa90
> R13: ffff8801ae61e000 R14: dffffc0000000000 R15: ffff8801ae61e3e0
> FS:  00007fa147982700(0000) GS:ffff8801db000000(0000) knlGS:0000000000000000
> CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> CR2: 0000000000000000 CR3: 00000001d780d000 CR4: 00000000001426f0
> DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
> Call Trace:
> Code: 8b a9 68 03 00 00 4c 8b b1 70 03 00 00 4c 8b b9 78 03 00 00 48 8b 89
> 08 03 00 00 75 05 0f 01 c2 eb 03 0f 01 c3 48 89 4c 24 08 59 <0f> 96 81 88 56
> 00 00 48 89 81 00 03 00 00 48 89 99 18 03 00 00
> RIP: vmx_vcpu_run+0x95f/0x25f0 arch/x86/kvm/vmx.c:9746 RSP: ffff8801c95bf368
> ---[ end trace ffd91ebc3bb06b01 ]---
> Kernel panic - not syncing: Fatal exception
> Shutting down cpus with NMI
> Dumping ftrace buffer:
>    (ftrace buffer empty)
> Kernel Offset: disabled
> Rebooting in 86400 seconds..
>
> --
> You received this message because you are subscribed to the Google Groups
> "syzkaller-bugs" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to syzkaller-bugs+unsubscribe@googlegroups.com.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/syzkaller-bugs/00000000000037b58a0569c49b70%40google.com.
>
> For more options, visit https://groups.google.com/d/optout.

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: general protection fault in vmx_vcpu_run
  2018-06-28  5:27   ` Dmitry Vyukov
@ 2018-06-28 17:18     ` Jim Mattson
  2018-06-30  8:09         ` Raslan, KarimAllah
  0 siblings, 1 reply; 10+ messages in thread
From: Jim Mattson @ 2018-06-28 17:18 UTC (permalink / raw)
  To: Dmitry Vyukov
  Cc: syzbot, H. Peter Anvin, KVM list, LKML, Ingo Molnar,
	Paolo Bonzini, Radim Krčmář,
	syzkaller-bugs, Thomas Gleixner, the arch/x86 maintainers

  22: 0f 01 c3              vmresume
  25: 48 89 4c 24 08        mov    %rcx,0x8(%rsp)
  2a: 59                    pop    %rcx

<rip>:
  2b: 0f 96 81 88 56 00 00 setbe  0x5688(%rcx)
  32: 48 89 81 00 03 00 00 mov    %rax,0x300(%rcx)
  39: 48 89 99 18 03 00 00 mov    %rbx,0x318(%rcx)

%rcx should be pointing to the vcpu_vmx structure, but it's not even
canonical: 1ffff10035842e78.

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: general protection fault in vmx_vcpu_run
  2018-06-28 17:18     ` Jim Mattson
@ 2018-06-30  8:09         ` Raslan, KarimAllah
  0 siblings, 0 replies; 10+ messages in thread
From: Raslan, KarimAllah @ 2018-06-30  8:09 UTC (permalink / raw)
  To: jmattson, dvyukov
  Cc: kvm, linux-kernel, tglx, syzbot+cc483201a3c6436d3550, x86, hpa,
	mingo, pbonzini, syzkaller-bugs, rkrcmar

Looking also at the other crash [0]:

        msr_bitmap = to_vmx(vcpu)->loaded_vmcs->msr_bitmap;
ffffffff811f65b7:       e8 44 cb 57 00          callq  ffffffff81773100
<__sanitizer_cov_trace_pc>
ffffffff811f65bc:       48 8b 54 24 08          mov    0x8(%rsp),%rdx
ffffffff811f65c1:       48 b8 00 00 00 00 00    movabs
$0xdffffc0000000000,%rax
ffffffff811f65c8:       fc ff df
ffffffff811f65cb:       48 c1 ea 03             shr    $0x3,%rdx
ffffffff811f65cf:       80 3c 02
00             cmpb   $0x0,(%rdx,%rax,1)        <- fault here.
ffffffff811f65d3:       0f 85 36 19 00 00       jne    ffffffff811f7f0f
<vmx_vcpu_run+0x236f>

%rdx should contain a pointer to loaded_vmcs. It is directly loaded 
from the stack [0x8(%rsp)]. This same stack location was just used 
before the inlined assembly for VMRESUME/VMLAUNCH here:

        vmx->__launched = vmx->loaded_vmcs->launched;
ffffffff811f639f:       e8 5c cd 57 00          callq  ffffffff81773100
<__sanitizer_cov_trace_pc>
ffffffff811f63a4:       48 8b 54 24 08          mov    0x8(%rsp),%rdx
ffffffff811f63a9:       48 b8 00 00 00 00 00    movabs
$0xdffffc0000000000,%rax
ffffffff811f63b0:       fc ff df
ffffffff811f63b3:       48 c1 ea 03             shr    $0x3,%rdx
ffffffff811f63b7:       80 3c 02
00             cmpb   $0x0,(%rdx,%rax,1)        <- used here.

... and this stack location was never touched by anything in between! 
So something must have corrupted the stack itself not really the 
kvm_vc
pu struct.

Obviously the inlined assembly block is using the stack as well, but I 
can not see anything that would cause this corruption there.

That being said, looking at the %rsp and %rbp values that are dumped
in the stack trace:

RSP: ffff8801b7d7f380
RBP: ffff8801b8260140

... they are almost 4.8 MiB apart! Should not these two register be a 
bit closer to each other? :)

So 2 possibilities here:

1- %rsp is wrong

That would explain why the loaded_vmcs was NULL. However, it is a bit 
harder to understand how it became wrong! It should have been restored 
during the VMEXIT from the HOST_RSP value in the VMCS!

Is this a nested setup?

2- %rbp is wrong

That would also explain why the loaded_vmcs was NULL. Whatever
corrupted the stack that caused loaded_vmcs to be NULL could have also
corrupted the %rbp saved in the stack. That would mean that it happened
during a function call. All function calls that happened between the
point when the stack was sane (just before the "asm" block for
VMLAUNCH) and the crash-site are only kcov related. Looking at kcov, I
can not see where the stack would get corrupted though! Obviously
another source of corruption can be a completely unrelated thread
directly corruption this thread's memory.

Maybe it would be easier to just try to repro it first and see which 
one is true (if at all).

[0] https://syzkaller.appspot.com/bug?extid=cc483201a3c6436d3550


On Thu, 2018-06-28 at 10:18 -0700, Jim Mattson wrote:
>   22: 0f 01 c3              vmresume
>   25: 48 89 4c 24 08        mov    %rcx,0x8(%rsp)
>   2a: 59                    pop    %rcx
> 
> <rip>:
>   2b: 0f 96 81 88 56 00 00 setbe  0x5688(%rcx)
>   32: 48 89 81 00 03 00 00 mov    %rax,0x300(%rcx)
>   39: 48 89 99 18 03 00 00 mov    %rbx,0x318(%rcx)
> 
> %rcx should be pointing to the vcpu_vmx structure, but it's not even
> canonical: 1ffff10035842e78.
> 
Amazon Development Center Germany GmbH
Berlin - Dresden - Aachen
main office: Krausenstr. 38, 10117 Berlin
Geschaeftsfuehrer: Dr. Ralf Herbrich, Christian Schlaeger
Ust-ID: DE289237879
Eingetragen am Amtsgericht Charlottenburg HRB 149173 B

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: general protection fault in vmx_vcpu_run
@ 2018-06-30  8:09         ` Raslan, KarimAllah
  0 siblings, 0 replies; 10+ messages in thread
From: Raslan, KarimAllah @ 2018-06-30  8:09 UTC (permalink / raw)
  To: jmattson, dvyukov
  Cc: kvm, linux-kernel, tglx, syzbot+cc483201a3c6436d3550, x86, hpa,
	mingo, pbonzini, syzkaller-bugs, rkrcmar

Looking also at the other crash [0]:

        msr_bitmap = to_vmx(vcpu)->loaded_vmcs->msr_bitmap;
ffffffff811f65b7:       e8 44 cb 57 00          callq  ffffffff81773100
<__sanitizer_cov_trace_pc>
ffffffff811f65bc:       48 8b 54 24 08          mov    0x8(%rsp),%rdx
ffffffff811f65c1:       48 b8 00 00 00 00 00    movabs
$0xdffffc0000000000,%rax
ffffffff811f65c8:       fc ff df
ffffffff811f65cb:       48 c1 ea 03             shr    $0x3,%rdx
ffffffff811f65cf:       80 3c 02
00             cmpb   $0x0,(%rdx,%rax,1)        <- fault here.
ffffffff811f65d3:       0f 85 36 19 00 00       jne    ffffffff811f7f0f
<vmx_vcpu_run+0x236f>

%rdx should contain a pointer to loaded_vmcs. It is directly loaded 
from the stack [0x8(%rsp)]. This same stack location was just used 
before the inlined assembly for VMRESUME/VMLAUNCH here:

        vmx->__launched = vmx->loaded_vmcs->launched;
ffffffff811f639f:       e8 5c cd 57 00          callq  ffffffff81773100
<__sanitizer_cov_trace_pc>
ffffffff811f63a4:       48 8b 54 24 08          mov    0x8(%rsp),%rdx
ffffffff811f63a9:       48 b8 00 00 00 00 00    movabs
$0xdffffc0000000000,%rax
ffffffff811f63b0:       fc ff df
ffffffff811f63b3:       48 c1 ea 03             shr    $0x3,%rdx
ffffffff811f63b7:       80 3c 02
00             cmpb   $0x0,(%rdx,%rax,1)        <- used here.

... and this stack location was never touched by anything in between! 
So something must have corrupted the stack itself not really the 
kvm_vc
pu struct.

Obviously the inlined assembly block is using the stack as well, but I 
can not see anything that would cause this corruption there.

That being said, looking at the %rsp and %rbp values that are dumped
in the stack trace:

RSP: ffff8801b7d7f380
RBP: ffff8801b8260140

... they are almost 4.8 MiB apart! Should not these two register be a 
bit closer to each other? :)

So 2 possibilities here:

1- %rsp is wrong

That would explain why the loaded_vmcs was NULL. However, it is a bit 
harder to understand how it became wrong! It should have been restored 
during the VMEXIT from the HOST_RSP value in the VMCS!

Is this a nested setup?

2- %rbp is wrong

That would also explain why the loaded_vmcs was NULL. Whatever
corrupted the stack that caused loaded_vmcs to be NULL could have also
corrupted the %rbp saved in the stack. That would mean that it happened
during a function call. All function calls that happened between the
point when the stack was sane (just before the "asm" block for
VMLAUNCH) and the crash-site are only kcov related. Looking at kcov, I
can not see where the stack would get corrupted though! Obviously
another source of corruption can be a completely unrelated thread
directly corruption this thread's memory.

Maybe it would be easier to just try to repro it first and see which 
one is true (if at all).

[0] https://syzkaller.appspot.com/bug?extid=cc483201a3c6436d3550


On Thu, 2018-06-28 at 10:18 -0700, Jim Mattson wrote:
>   22: 0f 01 c3              vmresume
>   25: 48 89 4c 24 08        mov    %rcx,0x8(%rsp)
>   2a: 59                    pop    %rcx
> 
> <rip>:
>   2b: 0f 96 81 88 56 00 00 setbe  0x5688(%rcx)
>   32: 48 89 81 00 03 00 00 mov    %rax,0x300(%rcx)
>   39: 48 89 99 18 03 00 00 mov    %rbx,0x318(%rcx)
> 
> %rcx should be pointing to the vcpu_vmx structure, but it's not even
> canonical: 1ffff10035842e78.
> 
Amazon Development Center Germany GmbH
Berlin - Dresden - Aachen
main office: Krausenstr. 38, 10117 Berlin
Geschaeftsfuehrer: Dr. Ralf Herbrich, Christian Schlaeger
Ust-ID: DE289237879
Eingetragen am Amtsgericht Charlottenburg HRB 149173 B

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: general protection fault in vmx_vcpu_run
  2018-06-30  8:09         ` Raslan, KarimAllah
@ 2018-07-04 19:31           ` Raslan, KarimAllah
  -1 siblings, 0 replies; 10+ messages in thread
From: Raslan, KarimAllah @ 2018-07-04 19:31 UTC (permalink / raw)
  To: jmattson, dvyukov
  Cc: kvm, linux-kernel, tglx, syzbot+cc483201a3c6436d3550, x86, hpa,
	mingo, pbonzini, syzkaller-bugs, rkrcmar

Dmitry,

Can you share the host kernel version?

I can not reproduce any of these crash signatures and I think it's 
really a nested virtualization bug. So I will need the exact host 
kernel version as well.

I am currently getting all sorts of:

"KVM: entry failed, hardware error 0x7"

... instead of the crash signatures that you are posting.

Regards.

On Sat, 2018-06-30 at 08:09 +0000, Raslan, KarimAllah wrote:
> Looking also at the other crash [0]:
> 
>         msr_bitmap = to_vmx(vcpu)->loaded_vmcs->msr_bitmap;
> ffffffff811f65b7:       e8 44 cb 57 00          callq  ffffffff81773100
> <__sanitizer_cov_trace_pc>
> ffffffff811f65bc:       48 8b 54 24 08          mov    0x8(%rsp),%rdx
> ffffffff811f65c1:       48 b8 00 00 00 00 00    movabs
> $0xdffffc0000000000,%rax
> ffffffff811f65c8:       fc ff df
> ffffffff811f65cb:       48 c1 ea 03             shr    $0x3,%rdx
> ffffffff811f65cf:       80 3c 02
> 00             cmpb   $0x0,(%rdx,%rax,1)        <- fault here.
> ffffffff811f65d3:       0f 85 36 19 00 00       jne    ffffffff811f7f0f
> <vmx_vcpu_run+0x236f>
> 
> %rdx should contain a pointer to loaded_vmcs. It is directly loaded 
> from the stack [0x8(%rsp)]. This same stack location was just used 
> before the inlined assembly for VMRESUME/VMLAUNCH here:
> 
>         vmx->__launched = vmx->loaded_vmcs->launched;
> ffffffff811f639f:       e8 5c cd 57 00          callq  ffffffff81773100
> <__sanitizer_cov_trace_pc>
> ffffffff811f63a4:       48 8b 54 24 08          mov    0x8(%rsp),%rdx
> ffffffff811f63a9:       48 b8 00 00 00 00 00    movabs
> $0xdffffc0000000000,%rax
> ffffffff811f63b0:       fc ff df
> ffffffff811f63b3:       48 c1 ea 03             shr    $0x3,%rdx
> ffffffff811f63b7:       80 3c 02
> 00             cmpb   $0x0,(%rdx,%rax,1)        <- used here.
> 
> ... and this stack location was never touched by anything in between! 
> So something must have corrupted the stack itself not really the 
> kvm_vc
> pu struct.
> 
> Obviously the inlined assembly block is using the stack as well, but I 
> can not see anything that would cause this corruption there.
> 
> That being said, looking at the %rsp and %rbp values that are dumped
> in the stack trace:
> 
> RSP: ffff8801b7d7f380
> RBP: ffff8801b8260140
> 
> ... they are almost 4.8 MiB apart! Should not these two register be a 
> bit closer to each other? :)
> 
> So 2 possibilities here:
> 
> 1- %rsp is wrong
> 
> That would explain why the loaded_vmcs was NULL. However, it is a bit 
> harder to understand how it became wrong! It should have been restored 
> during the VMEXIT from the HOST_RSP value in the VMCS!
> 
> Is this a nested setup?
> 
> 2- %rbp is wrong
> 
> That would also explain why the loaded_vmcs was NULL. Whatever
> corrupted the stack that caused loaded_vmcs to be NULL could have also
> corrupted the %rbp saved in the stack. That would mean that it happened
> during a function call. All function calls that happened between the
> point when the stack was sane (just before the "asm" block for
> VMLAUNCH) and the crash-site are only kcov related. Looking at kcov, I
> can not see where the stack would get corrupted though! Obviously
> another source of corruption can be a completely unrelated thread
> directly corruption this thread's memory.
> 
> Maybe it would be easier to just try to repro it first and see which 
> one is true (if at all).
> 
> [0] https://syzkaller.appspot.com/bug?extid=cc483201a3c6436d3550
> 
> 
> On Thu, 2018-06-28 at 10:18 -0700, Jim Mattson wrote:
> > 
> >   22: 0f 01 c3              vmresume
> >   25: 48 89 4c 24 08        mov    %rcx,0x8(%rsp)
> >   2a: 59                    pop    %rcx
> > 
> > <rip>:
> >   2b: 0f 96 81 88 56 00 00 setbe  0x5688(%rcx)
> >   32: 48 89 81 00 03 00 00 mov    %rax,0x300(%rcx)
> >   39: 48 89 99 18 03 00 00 mov    %rbx,0x318(%rcx)
> > 
> > %rcx should be pointing to the vcpu_vmx structure, but it's not even
> > canonical: 1ffff10035842e78.
> > 
Amazon Development Center Germany GmbH
Berlin - Dresden - Aachen
main office: Krausenstr. 38, 10117 Berlin
Geschaeftsfuehrer: Dr. Ralf Herbrich, Christian Schlaeger
Ust-ID: DE289237879
Eingetragen am Amtsgericht Charlottenburg HRB 149173 B

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: general protection fault in vmx_vcpu_run
@ 2018-07-04 19:31           ` Raslan, KarimAllah
  0 siblings, 0 replies; 10+ messages in thread
From: Raslan, KarimAllah @ 2018-07-04 19:31 UTC (permalink / raw)
  To: jmattson, dvyukov
  Cc: kvm, linux-kernel, tglx, syzbot+cc483201a3c6436d3550, x86, hpa,
	mingo, pbonzini, syzkaller-bugs, rkrcmar

Dmitry,

Can you share the host kernel version?

I can not reproduce any of these crash signatures and I think it's 
really a nested virtualization bug. So I will need the exact host 
kernel version as well.

I am currently getting all sorts of:

"KVM: entry failed, hardware error 0x7"

... instead of the crash signatures that you are posting.

Regards.

On Sat, 2018-06-30 at 08:09 +0000, Raslan, KarimAllah wrote:
> Looking also at the other crash [0]:
> 
>         msr_bitmap = to_vmx(vcpu)->loaded_vmcs->msr_bitmap;
> ffffffff811f65b7:       e8 44 cb 57 00          callq  ffffffff81773100
> <__sanitizer_cov_trace_pc>
> ffffffff811f65bc:       48 8b 54 24 08          mov    0x8(%rsp),%rdx
> ffffffff811f65c1:       48 b8 00 00 00 00 00    movabs
> $0xdffffc0000000000,%rax
> ffffffff811f65c8:       fc ff df
> ffffffff811f65cb:       48 c1 ea 03             shr    $0x3,%rdx
> ffffffff811f65cf:       80 3c 02
> 00             cmpb   $0x0,(%rdx,%rax,1)        <- fault here.
> ffffffff811f65d3:       0f 85 36 19 00 00       jne    ffffffff811f7f0f
> <vmx_vcpu_run+0x236f>
> 
> %rdx should contain a pointer to loaded_vmcs. It is directly loaded 
> from the stack [0x8(%rsp)]. This same stack location was just used 
> before the inlined assembly for VMRESUME/VMLAUNCH here:
> 
>         vmx->__launched = vmx->loaded_vmcs->launched;
> ffffffff811f639f:       e8 5c cd 57 00          callq  ffffffff81773100
> <__sanitizer_cov_trace_pc>
> ffffffff811f63a4:       48 8b 54 24 08          mov    0x8(%rsp),%rdx
> ffffffff811f63a9:       48 b8 00 00 00 00 00    movabs
> $0xdffffc0000000000,%rax
> ffffffff811f63b0:       fc ff df
> ffffffff811f63b3:       48 c1 ea 03             shr    $0x3,%rdx
> ffffffff811f63b7:       80 3c 02
> 00             cmpb   $0x0,(%rdx,%rax,1)        <- used here.
> 
> ... and this stack location was never touched by anything in between! 
> So something must have corrupted the stack itself not really the 
> kvm_vc
> pu struct.
> 
> Obviously the inlined assembly block is using the stack as well, but I 
> can not see anything that would cause this corruption there.
> 
> That being said, looking at the %rsp and %rbp values that are dumped
> in the stack trace:
> 
> RSP: ffff8801b7d7f380
> RBP: ffff8801b8260140
> 
> ... they are almost 4.8 MiB apart! Should not these two register be a 
> bit closer to each other? :)
> 
> So 2 possibilities here:
> 
> 1- %rsp is wrong
> 
> That would explain why the loaded_vmcs was NULL. However, it is a bit 
> harder to understand how it became wrong! It should have been restored 
> during the VMEXIT from the HOST_RSP value in the VMCS!
> 
> Is this a nested setup?
> 
> 2- %rbp is wrong
> 
> That would also explain why the loaded_vmcs was NULL. Whatever
> corrupted the stack that caused loaded_vmcs to be NULL could have also
> corrupted the %rbp saved in the stack. That would mean that it happened
> during a function call. All function calls that happened between the
> point when the stack was sane (just before the "asm" block for
> VMLAUNCH) and the crash-site are only kcov related. Looking at kcov, I
> can not see where the stack would get corrupted though! Obviously
> another source of corruption can be a completely unrelated thread
> directly corruption this thread's memory.
> 
> Maybe it would be easier to just try to repro it first and see which 
> one is true (if at all).
> 
> [0] https://syzkaller.appspot.com/bug?extid=cc483201a3c6436d3550
> 
> 
> On Thu, 2018-06-28 at 10:18 -0700, Jim Mattson wrote:
> > 
> >   22: 0f 01 c3              vmresume
> >   25: 48 89 4c 24 08        mov    %rcx,0x8(%rsp)
> >   2a: 59                    pop    %rcx
> > 
> > <rip>:
> >   2b: 0f 96 81 88 56 00 00 setbe  0x5688(%rcx)
> >   32: 48 89 81 00 03 00 00 mov    %rax,0x300(%rcx)
> >   39: 48 89 99 18 03 00 00 mov    %rbx,0x318(%rcx)
> > 
> > %rcx should be pointing to the vcpu_vmx structure, but it's not even
> > canonical: 1ffff10035842e78.
> > 
Amazon Development Center Germany GmbH
Berlin - Dresden - Aachen
main office: Krausenstr. 38, 10117 Berlin
Geschaeftsfuehrer: Dr. Ralf Herbrich, Christian Schlaeger
Ust-ID: DE289237879
Eingetragen am Amtsgericht Charlottenburg HRB 149173 B

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: general protection fault in vmx_vcpu_run
  2018-07-04 19:31           ` Raslan, KarimAllah
@ 2018-07-05  5:32             ` Dmitry Vyukov
  -1 siblings, 0 replies; 10+ messages in thread
From: Dmitry Vyukov @ 2018-07-05  5:32 UTC (permalink / raw)
  To: Raslan, KarimAllah
  Cc: jmattson, kvm, linux-kernel, tglx, syzbot+cc483201a3c6436d3550,
	x86, hpa, mingo, pbonzini, syzkaller-bugs, rkrcmar

On Wed, Jul 4, 2018 at 9:31 PM, Raslan, KarimAllah <karahmed@amazon.de> wrote:
> Dmitry,
>
> Can you share the host kernel version?
>
> I can not reproduce any of these crash signatures and I think it's
> really a nested virtualization bug. So I will need the exact host
> kernel version as well.
>
> I am currently getting all sorts of:
>
> "KVM: entry failed, hardware error 0x7"
>
> ... instead of the crash signatures that you are posting.


Hi Raslan,

The tested kernel runs as GCE VM.
Jim, how can we describe the host kernel for GCE? Potentially only we
can debug this.


> On Sat, 2018-06-30 at 08:09 +0000, Raslan, KarimAllah wrote:
>> Looking also at the other crash [0]:
>>
>>         msr_bitmap = to_vmx(vcpu)->loaded_vmcs->msr_bitmap;
>> ffffffff811f65b7:       e8 44 cb 57 00          callq  ffffffff81773100
>> <__sanitizer_cov_trace_pc>
>> ffffffff811f65bc:       48 8b 54 24 08          mov    0x8(%rsp),%rdx
>> ffffffff811f65c1:       48 b8 00 00 00 00 00    movabs
>> $0xdffffc0000000000,%rax
>> ffffffff811f65c8:       fc ff df
>> ffffffff811f65cb:       48 c1 ea 03             shr    $0x3,%rdx
>> ffffffff811f65cf:       80 3c 02
>> 00             cmpb   $0x0,(%rdx,%rax,1)        <- fault here.
>> ffffffff811f65d3:       0f 85 36 19 00 00       jne    ffffffff811f7f0f
>> <vmx_vcpu_run+0x236f>
>>
>> %rdx should contain a pointer to loaded_vmcs. It is directly loaded
>> from the stack [0x8(%rsp)]. This same stack location was just used
>> before the inlined assembly for VMRESUME/VMLAUNCH here:
>>
>>         vmx->__launched = vmx->loaded_vmcs->launched;
>> ffffffff811f639f:       e8 5c cd 57 00          callq  ffffffff81773100
>> <__sanitizer_cov_trace_pc>
>> ffffffff811f63a4:       48 8b 54 24 08          mov    0x8(%rsp),%rdx
>> ffffffff811f63a9:       48 b8 00 00 00 00 00    movabs
>> $0xdffffc0000000000,%rax
>> ffffffff811f63b0:       fc ff df
>> ffffffff811f63b3:       48 c1 ea 03             shr    $0x3,%rdx
>> ffffffff811f63b7:       80 3c 02
>> 00             cmpb   $0x0,(%rdx,%rax,1)        <- used here.
>>
>> ... and this stack location was never touched by anything in between!
>> So something must have corrupted the stack itself not really the
>> kvm_vc
>> pu struct.
>>
>> Obviously the inlined assembly block is using the stack as well, but I
>> can not see anything that would cause this corruption there.
>>
>> That being said, looking at the %rsp and %rbp values that are dumped
>> in the stack trace:
>>
>> RSP: ffff8801b7d7f380
>> RBP: ffff8801b8260140
>>
>> ... they are almost 4.8 MiB apart! Should not these two register be a
>> bit closer to each other? :)
>>
>> So 2 possibilities here:
>>
>> 1- %rsp is wrong
>>
>> That would explain why the loaded_vmcs was NULL. However, it is a bit
>> harder to understand how it became wrong! It should have been restored
>> during the VMEXIT from the HOST_RSP value in the VMCS!
>>
>> Is this a nested setup?
>>
>> 2- %rbp is wrong
>>
>> That would also explain why the loaded_vmcs was NULL. Whatever
>> corrupted the stack that caused loaded_vmcs to be NULL could have also
>> corrupted the %rbp saved in the stack. That would mean that it happened
>> during a function call. All function calls that happened between the
>> point when the stack was sane (just before the "asm" block for
>> VMLAUNCH) and the crash-site are only kcov related. Looking at kcov, I
>> can not see where the stack would get corrupted though! Obviously
>> another source of corruption can be a completely unrelated thread
>> directly corruption this thread's memory.
>>
>> Maybe it would be easier to just try to repro it first and see which
>> one is true (if at all).
>>
>> [0] https://syzkaller.appspot.com/bug?extid=cc483201a3c6436d3550
>>
>>
>> On Thu, 2018-06-28 at 10:18 -0700, Jim Mattson wrote:
>> >
>> >   22: 0f 01 c3              vmresume
>> >   25: 48 89 4c 24 08        mov    %rcx,0x8(%rsp)
>> >   2a: 59                    pop    %rcx
>> >
>> > <rip>:
>> >   2b: 0f 96 81 88 56 00 00 setbe  0x5688(%rcx)
>> >   32: 48 89 81 00 03 00 00 mov    %rax,0x300(%rcx)
>> >   39: 48 89 99 18 03 00 00 mov    %rbx,0x318(%rcx)
>> >
>> > %rcx should be pointing to the vcpu_vmx structure, but it's not even
>> > canonical: 1ffff10035842e78.
>> >
> Amazon Development Center Germany GmbH
> Berlin - Dresden - Aachen
> main office: Krausenstr. 38, 10117 Berlin
> Geschaeftsfuehrer: Dr. Ralf Herbrich, Christian Schlaeger
> Ust-ID: DE289237879
> Eingetragen am Amtsgericht Charlottenburg HRB 149173 B

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: general protection fault in vmx_vcpu_run
@ 2018-07-05  5:32             ` Dmitry Vyukov
  0 siblings, 0 replies; 10+ messages in thread
From: Dmitry Vyukov @ 2018-07-05  5:32 UTC (permalink / raw)
  To: Raslan, KarimAllah
  Cc: jmattson, kvm, linux-kernel, tglx, syzbot+cc483201a3c6436d3550,
	x86, hpa, mingo, pbonzini, syzkaller-bugs, rkrcmar

On Wed, Jul 4, 2018 at 9:31 PM, Raslan, KarimAllah <karahmed@amazon.de> wrote:
> Dmitry,
>
> Can you share the host kernel version?
>
> I can not reproduce any of these crash signatures and I think it's
> really a nested virtualization bug. So I will need the exact host
> kernel version as well.
>
> I am currently getting all sorts of:
>
> "KVM: entry failed, hardware error 0x7"
>
> ... instead of the crash signatures that you are posting.


Hi Raslan,

The tested kernel runs as GCE VM.
Jim, how can we describe the host kernel for GCE? Potentially only we
can debug this.


> On Sat, 2018-06-30 at 08:09 +0000, Raslan, KarimAllah wrote:
>> Looking also at the other crash [0]:
>>
>>         msr_bitmap = to_vmx(vcpu)->loaded_vmcs->msr_bitmap;
>> ffffffff811f65b7:       e8 44 cb 57 00          callq  ffffffff81773100
>> <__sanitizer_cov_trace_pc>
>> ffffffff811f65bc:       48 8b 54 24 08          mov    0x8(%rsp),%rdx
>> ffffffff811f65c1:       48 b8 00 00 00 00 00    movabs
>> $0xdffffc0000000000,%rax
>> ffffffff811f65c8:       fc ff df
>> ffffffff811f65cb:       48 c1 ea 03             shr    $0x3,%rdx
>> ffffffff811f65cf:       80 3c 02
>> 00             cmpb   $0x0,(%rdx,%rax,1)        <- fault here.
>> ffffffff811f65d3:       0f 85 36 19 00 00       jne    ffffffff811f7f0f
>> <vmx_vcpu_run+0x236f>
>>
>> %rdx should contain a pointer to loaded_vmcs. It is directly loaded
>> from the stack [0x8(%rsp)]. This same stack location was just used
>> before the inlined assembly for VMRESUME/VMLAUNCH here:
>>
>>         vmx->__launched = vmx->loaded_vmcs->launched;
>> ffffffff811f639f:       e8 5c cd 57 00          callq  ffffffff81773100
>> <__sanitizer_cov_trace_pc>
>> ffffffff811f63a4:       48 8b 54 24 08          mov    0x8(%rsp),%rdx
>> ffffffff811f63a9:       48 b8 00 00 00 00 00    movabs
>> $0xdffffc0000000000,%rax
>> ffffffff811f63b0:       fc ff df
>> ffffffff811f63b3:       48 c1 ea 03             shr    $0x3,%rdx
>> ffffffff811f63b7:       80 3c 02
>> 00             cmpb   $0x0,(%rdx,%rax,1)        <- used here.
>>
>> ... and this stack location was never touched by anything in between!
>> So something must have corrupted the stack itself not really the
>> kvm_vc
>> pu struct.
>>
>> Obviously the inlined assembly block is using the stack as well, but I
>> can not see anything that would cause this corruption there.
>>
>> That being said, looking at the %rsp and %rbp values that are dumped
>> in the stack trace:
>>
>> RSP: ffff8801b7d7f380
>> RBP: ffff8801b8260140
>>
>> ... they are almost 4.8 MiB apart! Should not these two register be a
>> bit closer to each other? :)
>>
>> So 2 possibilities here:
>>
>> 1- %rsp is wrong
>>
>> That would explain why the loaded_vmcs was NULL. However, it is a bit
>> harder to understand how it became wrong! It should have been restored
>> during the VMEXIT from the HOST_RSP value in the VMCS!
>>
>> Is this a nested setup?
>>
>> 2- %rbp is wrong
>>
>> That would also explain why the loaded_vmcs was NULL. Whatever
>> corrupted the stack that caused loaded_vmcs to be NULL could have also
>> corrupted the %rbp saved in the stack. That would mean that it happened
>> during a function call. All function calls that happened between the
>> point when the stack was sane (just before the "asm" block for
>> VMLAUNCH) and the crash-site are only kcov related. Looking at kcov, I
>> can not see where the stack would get corrupted though! Obviously
>> another source of corruption can be a completely unrelated thread
>> directly corruption this thread's memory.
>>
>> Maybe it would be easier to just try to repro it first and see which
>> one is true (if at all).
>>
>> [0] https://syzkaller.appspot.com/bug?extid=cc483201a3c6436d3550
>>
>>
>> On Thu, 2018-06-28 at 10:18 -0700, Jim Mattson wrote:
>> >
>> >   22: 0f 01 c3              vmresume
>> >   25: 48 89 4c 24 08        mov    %rcx,0x8(%rsp)
>> >   2a: 59                    pop    %rcx
>> >
>> > <rip>:
>> >   2b: 0f 96 81 88 56 00 00 setbe  0x5688(%rcx)
>> >   32: 48 89 81 00 03 00 00 mov    %rax,0x300(%rcx)
>> >   39: 48 89 99 18 03 00 00 mov    %rbx,0x318(%rcx)
>> >
>> > %rcx should be pointing to the vcpu_vmx structure, but it's not even
>> > canonical: 1ffff10035842e78.
>> >
> Amazon Development Center Germany GmbH
> Berlin - Dresden - Aachen
> main office: Krausenstr. 38, 10117 Berlin
> Geschaeftsfuehrer: Dr. Ralf Herbrich, Christian Schlaeger
> Ust-ID: DE289237879
> Eingetragen am Amtsgericht Charlottenburg HRB 149173 B

^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2018-07-05  5:33 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-04-12  9:45 general protection fault in vmx_vcpu_run syzbot
2018-04-14  1:07 ` syzbot
2018-06-28  5:27   ` Dmitry Vyukov
2018-06-28 17:18     ` Jim Mattson
2018-06-30  8:09       ` Raslan, KarimAllah
2018-06-30  8:09         ` Raslan, KarimAllah
2018-07-04 19:31         ` Raslan, KarimAllah
2018-07-04 19:31           ` Raslan, KarimAllah
2018-07-05  5:32           ` Dmitry Vyukov
2018-07-05  5:32             ` Dmitry Vyukov

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.