kvm.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* Re: kernel BUG at arch/x86/kvm/x86.c:361! on AMD CPU
       [not found] <37952f51-7687-672c-45d9-92ba418c9133@oracle.com>
@ 2019-06-12 16:12 ` Borislav Petkov
       [not found]   ` <af0054d1-1fc8-c106-b503-ca91da5a6fee@oracle.com>
  0 siblings, 1 reply; 10+ messages in thread
From: Borislav Petkov @ 2019-06-12 16:12 UTC (permalink / raw)
  To: George Kennedy; +Cc: joro, pbonzini, mingo, hpa, kvm, syzkaller

On Thu, May 30, 2019 at 02:25:23PM -0400, George Kennedy wrote:
> To arch/x86/kvm/svm.c maintainers,
> 
> Syzkaller hit this bug on an AMD CPU.
> 
> The host was running Ubuntu 18.04. The VM was running 5.2.0-rc1+

Can't trigger it here on 5.2.0-rc4+ host and guest after running it for
a half an hour.

Maybe the ubuntu host kernel is missing some fixes...

-- 
Regards/Gruss,
    Boris.

Good mailing practices for 400: avoid top-posting and trim the reply.

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: kernel BUG at arch/x86/kvm/x86.c:361! on AMD CPU
       [not found]   ` <af0054d1-1fc8-c106-b503-ca91da5a6fee@oracle.com>
@ 2019-06-12 19:51     ` Borislav Petkov
  2019-06-12 20:54       ` Sean Christopherson
  0 siblings, 1 reply; 10+ messages in thread
From: Borislav Petkov @ 2019-06-12 19:51 UTC (permalink / raw)
  To: George Kennedy; +Cc: joro, pbonzini, mingo, hpa, kvm, syzkaller

On Wed, Jun 12, 2019 at 02:45:34PM -0400, George Kennedy wrote:
> The crash can still be reproduced with VM running Upstream 5.2.0-rc4 

That's clear.

> and host running Ubuntu on AMD CPU.

That's the important question: why can't I trigger it with 5.2.0-rc4+ as
the host and you can with the ubuntu kernel 4.15 or so. I.e., what changed
upstream or does the ubuntu kernel have out-of-tree stuff?

Maybe kvm folks would have a better idea. That kvm_spurious_fault thing
is for:

/*
 * Hardware virtualization extension instructions may fault if a
 * reboot turns off virtualization while processes are running.
 * Trap the fault and ignore the instruction if that happens.
 */
asmlinkage void kvm_spurious_fault(void);

but you're not rebooting...

-- 
Regards/Gruss,
    Boris.

Good mailing practices for 400: avoid top-posting and trim the reply.

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: kernel BUG at arch/x86/kvm/x86.c:361! on AMD CPU
  2019-06-12 19:51     ` Borislav Petkov
@ 2019-06-12 20:54       ` Sean Christopherson
  2019-06-13  7:18         ` Borislav Petkov
  0 siblings, 1 reply; 10+ messages in thread
From: Sean Christopherson @ 2019-06-12 20:54 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: George Kennedy, joro, pbonzini, mingo, hpa, kvm, syzkaller

On Wed, Jun 12, 2019 at 09:51:52PM +0200, Borislav Petkov wrote:
> On Wed, Jun 12, 2019 at 02:45:34PM -0400, George Kennedy wrote:
> > The crash can still be reproduced with VM running Upstream 5.2.0-rc4 
> 
> That's clear.
> 
> > and host running Ubuntu on AMD CPU.
> 
> That's the important question: why can't I trigger it with 5.2.0-rc4+ as
> the host and you can with the ubuntu kernel 4.15 or so. I.e., what changed
> upstream or does the ubuntu kernel have out-of-tree stuff?
> 
> Maybe kvm folks would have a better idea. That kvm_spurious_fault thing
> is for:
> 
> /*
>  * Hardware virtualization extension instructions may fault if a
>  * reboot turns off virtualization while processes are running.
>  * Trap the fault and ignore the instruction if that happens.
>  */
> asmlinkage void kvm_spurious_fault(void);
> 
> but you're not rebooting...

The reboot thing is a red-herring.   The ____kvm_handle_fault_on_reboot()
macro suppresses faults that occur on VMX and SVM instructions while the
kernel is rebooting (CPUs need to leave VMX/SVM mode to recognize INIT),
i.e. kvm_spurious_fault() is reached when a VMX or SVM instruction faults
and we're *not* rebooting.

TL;DR: an SVM instruction is faulting unexpectedly.

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: kernel BUG at arch/x86/kvm/x86.c:361! on AMD CPU
  2019-06-12 20:54       ` Sean Christopherson
@ 2019-06-13  7:18         ` Borislav Petkov
       [not found]           ` <df80299b-8e1f-f48b-a26b-c163b4018d01@oracle.com>
  0 siblings, 1 reply; 10+ messages in thread
From: Borislav Petkov @ 2019-06-13  7:18 UTC (permalink / raw)
  To: Sean Christopherson, George Kennedy
  Cc: joro, pbonzini, mingo, hpa, kvm, syzkaller

On Wed, Jun 12, 2019 at 01:54:30PM -0700, Sean Christopherson wrote:
> The reboot thing is a red-herring.   The ____kvm_handle_fault_on_reboot()
> macro suppresses faults that occur on VMX and SVM instructions while the
> kernel is rebooting (CPUs need to leave VMX/SVM mode to recognize INIT),
> i.e. kvm_spurious_fault() is reached when a VMX or SVM instruction faults
> and we're *not* rebooting.
> 
> TL;DR: an SVM instruction is faulting unexpectedly.

Aha, thx!

And there are a couple of places in svm_vcpu_run() which can cause that:

[  135.498208] Call Trace:
[  135.498594]  svm_vcpu_run+0xa83/0x20e0

George, can you objdump the area around offset 0xa83 within svm_vcpu_run
of the guest kernel?

Thx.

-- 
Regards/Gruss,
    Boris.

Good mailing practices for 400: avoid top-posting and trim the reply.

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: kernel BUG at arch/x86/kvm/x86.c:361! on AMD CPU
       [not found]           ` <df80299b-8e1f-f48b-a26b-c163b4018d01@oracle.com>
@ 2019-06-18 17:51             ` Borislav Petkov
  2019-06-18 18:01               ` Dmitry Vyukov
  0 siblings, 1 reply; 10+ messages in thread
From: Borislav Petkov @ 2019-06-18 17:51 UTC (permalink / raw)
  To: George Kennedy
  Cc: Sean Christopherson, joro, pbonzini, mingo, hpa, kvm, syzkaller,
	Boris Ostrovsky

On Mon, Jun 17, 2019 at 11:13:22AM -0400, George Kennedy wrote:
>    319f3:       0f 01 da                vmload                       <--- svm_vcpu_run+0xa83

Hmm, so VMLOAD can fault for a bunch of reasons if you look at its
description in the APM.

Looking at your Code: section and building with your config, rIP and the
insns around it point to:

All code
========
   0:	00 55 89             	add    %dl,-0x77(%rbp)
   3:	d2 45 89             	rolb   %cl,-0x77(%rbp)
   6:	c9                   	leaveq 
   7:	48 89 e5             	mov    %rsp,%rbp
   a:	8b 45 18             	mov    0x18(%rbp),%eax
   d:	50                   	push   %rax
   e:	8b 45 10             	mov    0x10(%rbp),%eax
  11:	50                   	push   %rax
  12:	e8 8a 42 6b 00       	callq  0x6b42a1
  17:	58                   	pop    %rax
  18:	5a                   	pop    %rdx
  19:	c9                   	leaveq 
  1a:	c3                   	retq   
  1b:	66 0f 1f 44 00 00    	nopw   0x0(%rax,%rax,1)
  21:	66 66 66 66 90       	data16 data16 data16 xchg %ax,%ax
  26:	55                   	push   %rbp
  27:	48 89 e5             	mov    %rsp,%rbp
  2a:*	0f 0b                	ud2    		<-- trapping instruction
  2c:	0f 1f 44 00 00       	nopl   0x0(%rax,%rax,1)
  31:	66 66 66 66 90       	data16 data16 data16 xchg %ax,%ax
  36:	55                   	push   %rbp
  37:	48 89 e5             	mov    %rsp,%rbp
  3a:	41 55                	push   %r13
  3c:	41 54                	push   %r12
  3e:	41                   	rex.B
  3f:	89                   	.byte 0x89

0000000000017f30 <__bpf_trace_kvm_nested_vmexit>:
   17f30:       55                      push   %rbp
   17f31:       89 d2                   mov    %edx,%edx
   17f33:       45 89 c9                mov    %r9d,%r9d
   17f36:       48 89 e5                mov    %rsp,%rbp
   17f39:       8b 45 18                mov    0x18(%rbp),%eax
   17f3c:       50                      push   %rax
   17f3d:       8b 45 10                mov    0x10(%rbp),%eax
   17f40:       50                      push   %rax
   17f41:       e8 00 00 00 00          callq  17f46 <__bpf_trace_kvm_nested_vmexit+0x16>
   17f46:       58                      pop    %rax
   17f47:       5a                      pop    %rdx
   17f48:       c9                      leaveq 
   17f49:       c3                      retq   
   17f4a:       66 0f 1f 44 00 00       nopw   0x0(%rax,%rax,1)

0000000000017f50 <kvm_spurious_fault>:
   17f50:       e8 00 00 00 00          callq  17f55 <kvm_spurious_fault+0x5>
   17f55:       55                      push   %rbp
   17f56:       48 89 e5                mov    %rsp,%rbp
   17f59:       0f 0b                   ud2    
   17f5b:       0f 1f 44 00 00          nopl   0x0(%rax,%rax,1)

so the invalid opcode splat is dumping the insn bytes around rIP the
UD2 happens but in order to see which of the VMLOAD conditions causes
the fault, you'd probably need to intercept the fault handler and dump
registers in the hypervisor to check.

There's something else that's bothering me though: your splat is from
the guest yet there is svm_vcpu_run() mentioned there which should be
called by the hypervisor and not by the guest. Unless you're doing
nested virt stuff...

Anyway, sorry that I cannot be of more help - I'm sure KVM guys would
make much more sense of it.

-- 
Regards/Gruss,
    Boris.

Good mailing practices for 400: avoid top-posting and trim the reply.

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: kernel BUG at arch/x86/kvm/x86.c:361! on AMD CPU
  2019-06-18 17:51             ` Borislav Petkov
@ 2019-06-18 18:01               ` Dmitry Vyukov
  2019-06-18 18:27                 ` Borislav Petkov
  0 siblings, 1 reply; 10+ messages in thread
From: Dmitry Vyukov @ 2019-06-18 18:01 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: George Kennedy, Sean Christopherson, Joerg Roedel, Paolo Bonzini,
	Ingo Molnar, H. Peter Anvin, KVM list, syzkaller,
	Boris Ostrovsky

On Tue, Jun 18, 2019 at 7:52 PM Borislav Petkov <bp@alien8.de> wrote:
>
> On Mon, Jun 17, 2019 at 11:13:22AM -0400, George Kennedy wrote:
> >    319f3:       0f 01 da                vmload                       <--- svm_vcpu_run+0xa83
>
> Hmm, so VMLOAD can fault for a bunch of reasons if you look at its
> description in the APM.
>
> Looking at your Code: section and building with your config, rIP and the
> insns around it point to:
>
> All code
> ========
>    0:   00 55 89                add    %dl,-0x77(%rbp)
>    3:   d2 45 89                rolb   %cl,-0x77(%rbp)
>    6:   c9                      leaveq
>    7:   48 89 e5                mov    %rsp,%rbp
>    a:   8b 45 18                mov    0x18(%rbp),%eax
>    d:   50                      push   %rax
>    e:   8b 45 10                mov    0x10(%rbp),%eax
>   11:   50                      push   %rax
>   12:   e8 8a 42 6b 00          callq  0x6b42a1
>   17:   58                      pop    %rax
>   18:   5a                      pop    %rdx
>   19:   c9                      leaveq
>   1a:   c3                      retq
>   1b:   66 0f 1f 44 00 00       nopw   0x0(%rax,%rax,1)
>   21:   66 66 66 66 90          data16 data16 data16 xchg %ax,%ax
>   26:   55                      push   %rbp
>   27:   48 89 e5                mov    %rsp,%rbp
>   2a:*  0f 0b                   ud2             <-- trapping instruction
>   2c:   0f 1f 44 00 00          nopl   0x0(%rax,%rax,1)
>   31:   66 66 66 66 90          data16 data16 data16 xchg %ax,%ax
>   36:   55                      push   %rbp
>   37:   48 89 e5                mov    %rsp,%rbp
>   3a:   41 55                   push   %r13
>   3c:   41 54                   push   %r12
>   3e:   41                      rex.B
>   3f:   89                      .byte 0x89
>
> 0000000000017f30 <__bpf_trace_kvm_nested_vmexit>:
>    17f30:       55                      push   %rbp
>    17f31:       89 d2                   mov    %edx,%edx
>    17f33:       45 89 c9                mov    %r9d,%r9d
>    17f36:       48 89 e5                mov    %rsp,%rbp
>    17f39:       8b 45 18                mov    0x18(%rbp),%eax
>    17f3c:       50                      push   %rax
>    17f3d:       8b 45 10                mov    0x10(%rbp),%eax
>    17f40:       50                      push   %rax
>    17f41:       e8 00 00 00 00          callq  17f46 <__bpf_trace_kvm_nested_vmexit+0x16>
>    17f46:       58                      pop    %rax
>    17f47:       5a                      pop    %rdx
>    17f48:       c9                      leaveq
>    17f49:       c3                      retq
>    17f4a:       66 0f 1f 44 00 00       nopw   0x0(%rax,%rax,1)
>
> 0000000000017f50 <kvm_spurious_fault>:
>    17f50:       e8 00 00 00 00          callq  17f55 <kvm_spurious_fault+0x5>
>    17f55:       55                      push   %rbp
>    17f56:       48 89 e5                mov    %rsp,%rbp
>    17f59:       0f 0b                   ud2
>    17f5b:       0f 1f 44 00 00          nopl   0x0(%rax,%rax,1)
>
> so the invalid opcode splat is dumping the insn bytes around rIP the
> UD2 happens but in order to see which of the VMLOAD conditions causes
> the fault, you'd probably need to intercept the fault handler and dump
> registers in the hypervisor to check.
>
> There's something else that's bothering me though: your splat is from
> the guest yet there is svm_vcpu_run() mentioned there which should be
> called by the hypervisor and not by the guest. Unless you're doing
> nested virt stuff...
>
> Anyway, sorry that I cannot be of more help - I'm sure KVM guys would
> make much more sense of it.

I am not a KVM folk either, but FWIW syzkaller is capable of creating
a double-nested VM. The code is somewhat VMX-specific, but it should
be capable at least executing some SVM instructions inside of guest.
This code setups VM to run a given instruction sequences (should be generic):
https://github.com/google/syzkaller/blob/34bf9440bd06034f86b5d9ac8afbf078129cbdae/executor/common_kvm_amd64.h
The instruction generator is based on Intel XED so it may be somewhat
Intel-biased, but at least I see some mentions of SVM there:
https://raw.githubusercontent.com/google/syzkaller/34bf9440bd06034f86b5d9ac8afbf078129cbdae/pkg/ifuzz/gen/all-enc-instructions.txt
And this code can setup a double-nested VM, but it's VMX-specific:
https://github.com/google/syzkaller/blob/34bf9440bd06034f86b5d9ac8afbf078129cbdae/executor/kvm.S#L125

If there is specific interested in testing SVM, it can make sense to
improve SVM support in syzkaller to at least match VMX.

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: kernel BUG at arch/x86/kvm/x86.c:361! on AMD CPU
  2019-06-18 18:01               ` Dmitry Vyukov
@ 2019-06-18 18:27                 ` Borislav Petkov
  2019-06-18 19:17                   ` Paolo Bonzini
                                     ` (2 more replies)
  0 siblings, 3 replies; 10+ messages in thread
From: Borislav Petkov @ 2019-06-18 18:27 UTC (permalink / raw)
  To: Dmitry Vyukov
  Cc: George Kennedy, Sean Christopherson, Joerg Roedel, Paolo Bonzini,
	Ingo Molnar, H. Peter Anvin, KVM list, syzkaller,
	Boris Ostrovsky

On Tue, Jun 18, 2019 at 08:01:06PM +0200, Dmitry Vyukov wrote:
> I am not a KVM folk either, but FWIW syzkaller is capable of creating
> a double-nested VM.

Aaaha, there it is. :)

> The code is somewhat VMX-specific, but it should
> be capable at least executing some SVM instructions inside of guest.
> This code setups VM to run a given instruction sequences (should be generic):
> https://github.com/google/syzkaller/blob/34bf9440bd06034f86b5d9ac8afbf078129cbdae/executor/common_kvm_amd64.h
> The instruction generator is based on Intel XED so it may be somewhat
> Intel-biased, but at least I see some mentions of SVM there:
> https://raw.githubusercontent.com/google/syzkaller/34bf9440bd06034f86b5d9ac8afbf078129cbdae/pkg/ifuzz/gen/all-enc-instructions.txt

Right, and that right there looks wrong:

ICLASS    : VMLOAD
CPL       : 3
CATEGORY  : SYSTEM
EXTENSION : SVM
ATTRIBUTES: PROTECTED_MODE
PATTERN   : 0x0F 0x01 MOD[0b11] MOD=3 REG[0b011] RM[0b010]
OPERANDS  : REG0=OrAX():r:IMPL

That is, *if* "CPL: 3" above means in XED context that VMLOAD is
supposed to be run in CPL3, then this is wrong because VMLOAD #GPs if
CPL was not 0. Ditto for VMRUN and a couple of others.

Perhaps that support was added at some point but not really run on AMD
hw yet...

-- 
Regards/Gruss,
    Boris.

Good mailing practices for 400: avoid top-posting and trim the reply.

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: kernel BUG at arch/x86/kvm/x86.c:361! on AMD CPU
  2019-06-18 18:27                 ` Borislav Petkov
@ 2019-06-18 19:17                   ` Paolo Bonzini
  2019-06-18 19:34                   ` George Kennedy
  2019-06-23 13:15                   ` Dmitry Vyukov
  2 siblings, 0 replies; 10+ messages in thread
From: Paolo Bonzini @ 2019-06-18 19:17 UTC (permalink / raw)
  To: Borislav Petkov, Dmitry Vyukov
  Cc: George Kennedy, Sean Christopherson, Joerg Roedel, Ingo Molnar,
	H. Peter Anvin, KVM list, syzkaller, Boris Ostrovsky

On 18/06/19 20:27, Borislav Petkov wrote:
> Right, and that right there looks wrong:
> 
> ICLASS    : VMLOAD
> CPL       : 3
> CATEGORY  : SYSTEM
> EXTENSION : SVM
> ATTRIBUTES: PROTECTED_MODE
> PATTERN   : 0x0F 0x01 MOD[0b11] MOD=3 REG[0b011] RM[0b010]
> OPERANDS  : REG0=OrAX():r:IMPL
> 
> That is, *if* "CPL: 3" above means in XED context that VMLOAD is
> supposed to be run in CPL3, then this is wrong because VMLOAD #GPs if
> CPL was not 0. Ditto for VMRUN and a couple of others.

This should not be related though, this is what syzkaller could place in
the guest but the reproducer is much simpler and the vmload fault is
happening genuinely in the host.

In particular, syz_kvm_setup_cpu's arguments are all zero so the guest
is basically doing nothing.

Paolo

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: kernel BUG at arch/x86/kvm/x86.c:361! on AMD CPU
  2019-06-18 18:27                 ` Borislav Petkov
  2019-06-18 19:17                   ` Paolo Bonzini
@ 2019-06-18 19:34                   ` George Kennedy
  2019-06-23 13:15                   ` Dmitry Vyukov
  2 siblings, 0 replies; 10+ messages in thread
From: George Kennedy @ 2019-06-18 19:34 UTC (permalink / raw)
  To: Borislav Petkov, Dmitry Vyukov
  Cc: Sean Christopherson, Joerg Roedel, Paolo Bonzini, Ingo Molnar,
	H. Peter Anvin, KVM list, syzkaller, Boris Ostrovsky

[-- Attachment #1: Type: text/plain, Size: 1504 bytes --]

The guest crash can be reproduced with only the Syzkaller reproducer C 
program (see attached) run from the guest. Syzkaller is not running.

George

On 6/18/2019 2:27 PM, Borislav Petkov wrote:
> On Tue, Jun 18, 2019 at 08:01:06PM +0200, Dmitry Vyukov wrote:
>> I am not a KVM folk either, but FWIW syzkaller is capable of creating
>> a double-nested VM.
> Aaaha, there it is. :)
>
>> The code is somewhat VMX-specific, but it should
>> be capable at least executing some SVM instructions inside of guest.
>> This code setups VM to run a given instruction sequences (should be generic):
>> https://github.com/google/syzkaller/blob/34bf9440bd06034f86b5d9ac8afbf078129cbdae/executor/common_kvm_amd64.h
>> The instruction generator is based on Intel XED so it may be somewhat
>> Intel-biased, but at least I see some mentions of SVM there:
>> https://raw.githubusercontent.com/google/syzkaller/34bf9440bd06034f86b5d9ac8afbf078129cbdae/pkg/ifuzz/gen/all-enc-instructions.txt
> Right, and that right there looks wrong:
>
> ICLASS    : VMLOAD
> CPL       : 3
> CATEGORY  : SYSTEM
> EXTENSION : SVM
> ATTRIBUTES: PROTECTED_MODE
> PATTERN   : 0x0F 0x01 MOD[0b11] MOD=3 REG[0b011] RM[0b010]
> OPERANDS  : REG0=OrAX():r:IMPL
>
> That is, *if* "CPL: 3" above means in XED context that VMLOAD is
> supposed to be run in CPL3, then this is wrong because VMLOAD #GPs if
> CPL was not 0. Ditto for VMRUN and a couple of others.
>
> Perhaps that support was added at some point but not really run on AMD
> hw yet...
>


[-- Attachment #2: amd_kvm_repro.c --]
[-- Type: text/plain, Size: 37879 bytes --]

// autogenerated by syzkaller (https://github.com/google/syzkaller)

#define _GNU_SOURCE 

#include <dirent.h>
#include <endian.h>
#include <errno.h>
#include <fcntl.h>
#include <setjmp.h>
#include <signal.h>
#include <stdarg.h>
#include <stdbool.h>
#include <stddef.h>
#include <stdint.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <sys/ioctl.h>
#include <sys/prctl.h>
#include <sys/stat.h>
#include <sys/syscall.h>
#include <sys/types.h>
#include <sys/wait.h>
#include <time.h>
#include <unistd.h>

#include <linux/kvm.h>

unsigned long long procid;

static __thread int skip_segv;
static __thread jmp_buf segv_env;

static void segv_handler(int sig, siginfo_t* info, void* ctx)
{
	uintptr_t addr = (uintptr_t)info->si_addr;
	const uintptr_t prog_start = 1 << 20;
	const uintptr_t prog_end = 100 << 20;
	if (__atomic_load_n(&skip_segv, __ATOMIC_RELAXED) && (addr < prog_start || addr > prog_end)) {
		_longjmp(segv_env, 1);
	}
	exit(sig);
}

static void install_segv_handler(void)
{
	struct sigaction sa;
	memset(&sa, 0, sizeof(sa));
	sa.sa_handler = SIG_IGN;
	syscall(SYS_rt_sigaction, 0x20, &sa, NULL, 8);
	syscall(SYS_rt_sigaction, 0x21, &sa, NULL, 8);
	memset(&sa, 0, sizeof(sa));
	sa.sa_sigaction = segv_handler;
	sa.sa_flags = SA_NODEFER | SA_SIGINFO;
	sigaction(SIGSEGV, &sa, NULL);
	sigaction(SIGBUS, &sa, NULL);
}

#define NONFAILING(...) { __atomic_fetch_add(&skip_segv, 1, __ATOMIC_SEQ_CST); if (_setjmp(segv_env) == 0) { __VA_ARGS__; } __atomic_fetch_sub(&skip_segv, 1, __ATOMIC_SEQ_CST); }

static void sleep_ms(uint64_t ms)
{
	usleep(ms * 1000);
}

static uint64_t current_time_ms(void)
{
	struct timespec ts;
	if (clock_gettime(CLOCK_MONOTONIC, &ts))
	exit(1);
	return (uint64_t)ts.tv_sec * 1000 + (uint64_t)ts.tv_nsec / 1000000;
}

static bool write_file(const char* file, const char* what, ...)
{
	char buf[1024];
	va_list args;
	va_start(args, what);
	vsnprintf(buf, sizeof(buf), what, args);
	va_end(args);
	buf[sizeof(buf) - 1] = 0;
	int len = strlen(buf);
	int fd = open(file, O_WRONLY | O_CLOEXEC);
	if (fd == -1)
		return false;
	if (write(fd, buf, len) != len) {
		int err = errno;
		close(fd);
		errno = err;
		return false;
	}
	close(fd);
	return true;
}

const char kvm_asm16_cpl3[] = "\x0f\x20\xc0\x66\x83\xc8\x01\x0f\x22\xc0\xb8\xa0\x00\x0f\x00\xd8\xb8\x2b\x00\x8e\xd8\x8e\xc0\x8e\xe0\x8e\xe8\xbc\x00\x01\xc7\x06\x00\x01\x1d\xba\xc7\x06\x02\x01\x23\x00\xc7\x06\x04\x01\x00\x01\xc7\x06\x06\x01\x2b\x00\xcb";
const char kvm_asm32_paged[] = "\x0f\x20\xc0\x0d\x00\x00\x00\x80\x0f\x22\xc0";
const char kvm_asm32_vm86[] = "\x66\xb8\xb8\x00\x0f\x00\xd8\xea\x00\x00\x00\x00\xd0\x00";
const char kvm_asm32_paged_vm86[] = "\x0f\x20\xc0\x0d\x00\x00\x00\x80\x0f\x22\xc0\x66\xb8\xb8\x00\x0f\x00\xd8\xea\x00\x00\x00\x00\xd0\x00";
const char kvm_asm64_enable_long[] = "\x0f\x20\xc0\x0d\x00\x00\x00\x80\x0f\x22\xc0\xea\xde\xc0\xad\x0b\x50\x00\x48\xc7\xc0\xd8\x00\x00\x00\x0f\x00\xd8";
const char kvm_asm64_init_vm[] = "\x0f\x20\xc0\x0d\x00\x00\x00\x80\x0f\x22\xc0\xea\xde\xc0\xad\x0b\x50\x00\x48\xc7\xc0\xd8\x00\x00\x00\x0f\x00\xd8\x48\xc7\xc1\x3a\x00\x00\x00\x0f\x32\x48\x83\xc8\x05\x0f\x30\x0f\x20\xe0\x48\x0d\x00\x20\x00\x00\x0f\x22\xe0\x48\xc7\xc1\x80\x04\x00\x00\x0f\x32\x48\xc7\xc2\x00\x60\x00\x00\x89\x02\x48\xc7\xc2\x00\x70\x00\x00\x89\x02\x48\xc7\xc0\x00\x5f\x00\x00\xf3\x0f\xc7\x30\x48\xc7\xc0\x08\x5f\x00\x00\x66\x0f\xc7\x30\x0f\xc7\x30\x48\xc7\xc1\x81\x04\x00\x00\x0f\x32\x48\x83\xc8\x3f\x48\x21\xd0\x48\xc7\xc2\x00\x40\x00\x00\x0f\x79\xd0\x48\xc7\xc2\x02\x40\x00\x00\x48\xb8\x84\x9e\x99\xf3\x00\x00\x00\x00\x0f\x79\xd0\x48\xc7\xc2\x1e\x40\x00\x00\x48\xc7\xc0\x81\x00\x00\x00\x0f\x79\xd0\x48\xc7\xc1\x83\x04\x00\x00\x0f\x32\x48\x0d\xff\x6f\x03\x00\x48\x21\xd0\x48\xc7\xc2\x0c\x40\x00\x00\x0f\x79\xd0\x48\xc7\xc1\x84\x04\x00\x00\x0f\x32\x48\x0d\xff\x17\x00\x00\x48\x21\xd0\x48\xc7\xc2\x12\x40\x00\x00\x0f\x79\xd0\x48\xc7\xc2\x04\x2c\x00\x00\x48\xc7\xc0\x00\x00\x00\x00\x0f\x79\xd0\x48\xc7\xc2\x00\x28\x00\x00\x48\xc7\xc0\xff\xff\xff\xff\x0f\x79\xd0\x48\xc7\xc2\x02\x0c\x00\x00\x48\xc7\xc0\x50\x00\x00\x00\x0f\x79\xd0\x48\xc7\xc0\x58\x00\x00\x00\x48\xc7\xc2\x00\x0c\x00\x00\x0f\x79\xd0\x48\xc7\xc2\x04\x0c\x00\x00\x0f\x79\xd0\x48\xc7\xc2\x06\x0c\x00\x00\x0f\x79\xd0\x48\xc7\xc2\x08\x0c\x00\x00\x0f\x79\xd0\x48\xc7\xc2\x0a\x0c\x00\x00\x0f\x79\xd0\x48\xc7\xc0\xd8\x00\x00\x00\x48\xc7\xc2\x0c\x0c\x00\x00\x0f\x79\xd0\x48\xc7\xc2\x02\x2c\x00\x00\x48\xc7\xc0\x00\x05\x00\x00\x0f\x79\xd0\x48\xc7\xc2\x00\x4c\x00\x00\x48\xc7\xc0\x50\x00\x00\x00\x0f\x79\xd0\x48\xc7\xc2\x10\x6c\x00\x00\x48\xc7\xc0\x00\x00\x00\x00\x0f\x79\xd0\x48\xc7\xc2\x12\x6c\x00\x00\x48\xc7\xc0\x00\x00\x00\x00\x0f\x79\xd0\x0f\x20\xc0\x48\xc7\xc2\x00\x6c\x00\x00\x48\x89\xc0\x0f\x79\xd0\x0f\x20\xd8\x48\xc7\xc2\x02\x6c\x00\x00\x48\x89\xc0\x0f\x79\xd0\x0f\x20\xe0\x48\xc7\xc2\x04\x6c\x00\x00\x48\x89\xc0\x0f\x79\xd0\x48\xc7\xc2\x06\x6c\x00\x00\x48\xc7\xc0\x00\x00\x00\x00\x0f\x79\xd0\x48\xc7\xc2\x08\x6c\x00\x00\x48\xc7\xc0\x00\x00\x00\x00\x0f\x79\xd0\x48\xc7\xc2\x0a\x6c\x00\x00\x48\xc7\xc0\x00\x3a\x00\x00\x0f\x79\xd0\x48\xc7\xc2\x0c\x6c\x00\x00\x48\xc7\xc0\x00\x10\x00\x00\x0f\x79\xd0\x48\xc7\xc2\x0e\x6c\x00\x00\x48\xc7\xc0\x00\x38\x00\x00\x0f\x79\xd0\x48\xc7\xc2\x14\x6c\x00\x00\x48\xc7\xc0\x00\x00\x00\x00\x0f\x79\xd0\x48\xc7\xc2\x16\x6c\x00\x00\x48\x8b\x04\x25\x10\x5f\x00\x00\x0f\x79\xd0\x48\xc7\xc2\x00\x00\x00\x00\x48\xc7\xc0\x01\x00\x00\x00\x0f\x79\xd0\x48\xc7\xc2\x02\x00\x00\x00\x48\xc7\xc0\x00\x00\x00\x00\x0f\x79\xd0\x48\xc7\xc2\x00\x20\x00\x00\x48\xc7\xc0\x00\x00\x00\x00\x0f\x79\xd0\x48\xc7\xc2\x02\x20\x00\x00\x48\xc7\xc0\x00\x00\x00\x00\x0f\x79\xd0\x48\xc7\xc2\x04\x20\x00\x00\x48\xc7\xc0\x00\x00\x00\x00\x0f\x79\xd0\x48\xc7\xc2\x06\x20\x00\x00\x48\xc7\xc0\x00\x00\x00\x00\x0f\x79\xd0\x48\xc7\xc1\x77\x02\x00\x00\x0f\x32\x48\xc1\xe2\x20\x48\x09\xd0\x48\xc7\xc2\x00\x2c\x00\x00\x48\x89\xc0\x0f\x79\xd0\x48\xc7\xc2\x04\x40\x00\x00\x48\xc7\xc0\x00\x00\x00\x00\x0f\x79\xd0\x48\xc7\xc2\x0a\x40\x00\x00\x48\xc7\xc0\x00\x00\x00\x00\x0f\x79\xd0\x48\xc7\xc2\x0e\x40\x00\x00\x48\xc7\xc0\x00\x00\x00\x00\x0f\x79\xd0\x48\xc7\xc2\x10\x40\x00\x00\x48\xc7\xc0\x00\x00\x00\x00\x0f\x79\xd0\x48\xc7\xc2\x16\x40\x00\x00\x48\xc7\xc0\x00\x00\x00\x00\x0f\x79\xd0\x48\xc7\xc2\x14\x40\x00\x00\x48\xc7\xc0\x00\x00\x00\x00\x0f\x79\xd0\x48\xc7\xc2\x00\x60\x00\x00\x48\xc7\xc0\xff\xff\xff\xff\x0f\x79\xd0\x48\xc7\xc2\x02\x60\x00\x00\x48\xc7\xc0\xff\xff\xff\xff\x0f\x79\xd0\x48\xc7\xc2\x1c\x20\x00\x00\x48\xc7\xc0\x00\x00\x00\x00\x0f\x79\xd0\x48\xc7\xc2\x1e\x20\x00\x00\x48\xc7\xc0\x00\x00\x00\x00\x0f\x79\xd0\x48\xc7\xc2\x20\x20\x00\x00\x48\xc7\xc0\x00\x00\x00\x00\x0f\x79\xd0\x48\xc7\xc2\x22\x20\x00\x00\x48\xc7\xc0\x00\x00\x00\x00\x0f\x79\xd0\x48\xc7\xc2\x00\x08\x00\x00\x48\xc7\xc0\x58\x00\x00\x00\x0f\x79\xd0\x48\xc7\xc2\x02\x08\x00\x00\x48\xc7\xc0\x50\x00\x00\x00\x0f\x79\xd0\x48\xc7\xc2\x04\x08\x00\x00\x48\xc7\xc0\x58\x00\x00\x00\x0f\x79\xd0\x48\xc7\xc2\x06\x08\x00\x00\x48\xc7\xc0\x58\x00\x00\x00\x0f\x79\xd0\x48\xc7\xc2\x08\x08\x00\x00\x48\xc7\xc0\x58\x00\x00\x00\x0f\x79\xd0\x48\xc7\xc2\x0a\x08\x00\x00\x48\xc7\xc0\x58\x00\x00\x00\x0f\x79\xd0\x48\xc7\xc2\x0c\x08\x00\x00\x48\xc7\xc0\x00\x00\x00\x00\x0f\x79\xd0\x48\xc7\xc2\x0e\x08\x00\x00\x48\xc7\xc0\xd8\x00\x00\x00\x0f\x79\xd0\x48\xc7\xc2\x12\x68\x00\x00\x48\xc7\xc0\x00\x00\x00\x00\x0f\x79\xd0\x48\xc7\xc2\x14\x68\x00\x00\x48\xc7\xc0\x00\x3a\x00\x00\x0f\x79\xd0\x48\xc7\xc2\x16\x68\x00\x00\x48\xc7\xc0\x00\x10\x00\x00\x0f\x79\xd0\x48\xc7\xc2\x18\x68\x00\x00\x48\xc7\xc0\x00\x38\x00\x00\x0f\x79\xd0\x48\xc7\xc2\x00\x48\x00\x00\x48\xc7\xc0\xff\xff\x0f\x00\x0f\x79\xd0\x48\xc7\xc2\x02\x48\x00\x00\x48\xc7\xc0\xff\xff\x0f\x00\x0f\x79\xd0\x48\xc7\xc2\x04\x48\x00\x00\x48\xc7\xc0\xff\xff\x0f\x00\x0f\x79\xd0\x48\xc7\xc2\x06\x48\x00\x00\x48\xc7\xc0\xff\xff\x0f\x00\x0f\x79\xd0\x48\xc7\xc2\x08\x48\x00\x00\x48\xc7\xc0\xff\xff\x0f\x00\x0f\x79\xd0\x48\xc7\xc2\x0a\x48\x00\x00\x48\xc7\xc0\xff\xff\x0f\x00\x0f\x79\xd0\x48\xc7\xc2\x0c\x48\x00\x00\x48\xc7\xc0\x00\x00\x00\x00\x0f\x79\xd0\x48\xc7\xc2\x0e\x48\x00\x00\x48\xc7\xc0\xff\x1f\x00\x00\x0f\x79\xd0\x48\xc7\xc2\x10\x48\x00\x00\x48\xc7\xc0\xff\x1f\x00\x00\x0f\x79\xd0\x48\xc7\xc2\x12\x48\x00\x00\x48\xc7\xc0\xff\x1f\x00\x00\x0f\x79\xd0\x48\xc7\xc2\x14\x48\x00\x00\x48\xc7\xc0\x93\x40\x00\x00\x0f\x79\xd0\x48\xc7\xc2\x16\x48\x00\x00\x48\xc7\xc0\x9b\x20\x00\x00\x0f\x79\xd0\x48\xc7\xc2\x18\x48\x00\x00\x48\xc7\xc0\x93\x40\x00\x00\x0f\x79\xd0\x48\xc7\xc2\x1a\x48\x00\x00\x48\xc7\xc0\x93\x40\x00\x00\x0f\x79\xd0\x48\xc7\xc2\x1c\x48\x00\x00\x48\xc7\xc0\x93\x40\x00\x00\x0f\x79\xd0\x48\xc7\xc2\x1e\x48\x00\x00\x48\xc7\xc0\x93\x40\x00\x00\x0f\x79\xd0\x48\xc7\xc2\x20\x48\x00\x00\x48\xc7\xc0\x82\x00\x00\x00\x0f\x79\xd0\x48\xc7\xc2\x22\x48\x00\x00\x48\xc7\xc0\x8b\x00\x00\x00\x0f\x79\xd0\x48\xc7\xc2\x1c\x68\x00\x00\x48\xc7\xc0\x00\x00\x00\x00\x0f\x79\xd0\x48\xc7\xc2\x1e\x68\x00\x00\x48\xc7\xc0\x00\x91\x00\x00\x0f\x79\xd0\x48\xc7\xc2\x20\x68\x00\x00\x48\xc7\xc0\x02\x00\x00\x00\x0f\x79\xd0\x48\xc7\xc2\x06\x28\x00\x00\x48\xc7\xc0\x00\x05\x00\x00\x0f\x79\xd0\x48\xc7\xc2\x0a\x28\x00\x00\x48\xc7\xc0\x00\x00\x00\x00\x0f\x79\xd0\x48\xc7\xc2\x0c\x28\x00\x00\x48\xc7\xc0\x00\x00\x00\x00\x0f\x79\xd0\x48\xc7\xc2\x0e\x28\x00\x00\x48\xc7\xc0\x00\x00\x00\x00\x0f\x79\xd0\x48\xc7\xc2\x10\x28\x00\x00\x48\xc7\xc0\x00\x00\x00\x00\x0f\x79\xd0\x0f\x20\xc0\x48\xc7\xc2\x00\x68\x00\x00\x48\x89\xc0\x0f\x79\xd0\x0f\x20\xd8\x48\xc7\xc2\x02\x68\x00\x00\x48\x89\xc0\x0f\x79\xd0\x0f\x20\xe0\x48\xc7\xc2\x04\x68\x00\x00\x48\x89\xc0\x0f\x79\xd0\x48\xc7\xc0\x18\x5f\x00\x00\x48\x8b\x10\x48\xc7\xc0\x20\x5f\x00\x00\x48\x8b\x08\x48\x31\xc0\x0f\x78\xd0\x48\x31\xc8\x0f\x79\xd0\x0f\x01\xc2\x48\xc7\xc2\x00\x44\x00\x00\x0f\x78\xd0\xf4";
const char kvm_asm64_vm_exit[] = "\x48\xc7\xc3\x00\x44\x00\x00\x0f\x78\xda\x48\xc7\xc3\x02\x44\x00\x00\x0f\x78\xd9\x48\xc7\xc0\x00\x64\x00\x00\x0f\x78\xc0\x48\xc7\xc3\x1e\x68\x00\x00\x0f\x78\xdb\xf4";
const char kvm_asm64_cpl3[] = "\x0f\x20\xc0\x0d\x00\x00\x00\x80\x0f\x22\xc0\xea\xde\xc0\xad\x0b\x50\x00\x48\xc7\xc0\xd8\x00\x00\x00\x0f\x00\xd8\x48\xc7\xc0\x6b\x00\x00\x00\x8e\xd8\x8e\xc0\x8e\xe0\x8e\xe8\x48\xc7\xc4\x80\x0f\x00\x00\x48\xc7\x04\x24\x1d\xba\x00\x00\x48\xc7\x44\x24\x04\x63\x00\x00\x00\x48\xc7\x44\x24\x08\x80\x0f\x00\x00\x48\xc7\x44\x24\x0c\x6b\x00\x00\x00\xcb";

#define ADDR_TEXT 0x0000
#define ADDR_GDT 0x1000
#define ADDR_LDT 0x1800
#define ADDR_PML4 0x2000
#define ADDR_PDP 0x3000
#define ADDR_PD 0x4000
#define ADDR_STACK0 0x0f80
#define ADDR_VAR_HLT 0x2800
#define ADDR_VAR_SYSRET 0x2808
#define ADDR_VAR_SYSEXIT 0x2810
#define ADDR_VAR_IDT 0x3800
#define ADDR_VAR_TSS64 0x3a00
#define ADDR_VAR_TSS64_CPL3 0x3c00
#define ADDR_VAR_TSS16 0x3d00
#define ADDR_VAR_TSS16_2 0x3e00
#define ADDR_VAR_TSS16_CPL3 0x3f00
#define ADDR_VAR_TSS32 0x4800
#define ADDR_VAR_TSS32_2 0x4a00
#define ADDR_VAR_TSS32_CPL3 0x4c00
#define ADDR_VAR_TSS32_VM86 0x4e00
#define ADDR_VAR_VMXON_PTR 0x5f00
#define ADDR_VAR_VMCS_PTR 0x5f08
#define ADDR_VAR_VMEXIT_PTR 0x5f10
#define ADDR_VAR_VMWRITE_FLD 0x5f18
#define ADDR_VAR_VMWRITE_VAL 0x5f20
#define ADDR_VAR_VMXON 0x6000
#define ADDR_VAR_VMCS 0x7000
#define ADDR_VAR_VMEXIT_CODE 0x9000
#define ADDR_VAR_USER_CODE 0x9100
#define ADDR_VAR_USER_CODE2 0x9120

#define SEL_LDT (1 << 3)
#define SEL_CS16 (2 << 3)
#define SEL_DS16 (3 << 3)
#define SEL_CS16_CPL3 ((4 << 3) + 3)
#define SEL_DS16_CPL3 ((5 << 3) + 3)
#define SEL_CS32 (6 << 3)
#define SEL_DS32 (7 << 3)
#define SEL_CS32_CPL3 ((8 << 3) + 3)
#define SEL_DS32_CPL3 ((9 << 3) + 3)
#define SEL_CS64 (10 << 3)
#define SEL_DS64 (11 << 3)
#define SEL_CS64_CPL3 ((12 << 3) + 3)
#define SEL_DS64_CPL3 ((13 << 3) + 3)
#define SEL_CGATE16 (14 << 3)
#define SEL_TGATE16 (15 << 3)
#define SEL_CGATE32 (16 << 3)
#define SEL_TGATE32 (17 << 3)
#define SEL_CGATE64 (18 << 3)
#define SEL_CGATE64_HI (19 << 3)
#define SEL_TSS16 (20 << 3)
#define SEL_TSS16_2 (21 << 3)
#define SEL_TSS16_CPL3 ((22 << 3) + 3)
#define SEL_TSS32 (23 << 3)
#define SEL_TSS32_2 (24 << 3)
#define SEL_TSS32_CPL3 ((25 << 3) + 3)
#define SEL_TSS32_VM86 (26 << 3)
#define SEL_TSS64 (27 << 3)
#define SEL_TSS64_HI (28 << 3)
#define SEL_TSS64_CPL3 ((29 << 3) + 3)
#define SEL_TSS64_CPL3_HI (30 << 3)

#define MSR_IA32_FEATURE_CONTROL 0x3a
#define MSR_IA32_VMX_BASIC 0x480
#define MSR_IA32_SMBASE 0x9e
#define MSR_IA32_SYSENTER_CS 0x174
#define MSR_IA32_SYSENTER_ESP 0x175
#define MSR_IA32_SYSENTER_EIP 0x176
#define MSR_IA32_STAR 0xC0000081
#define MSR_IA32_LSTAR 0xC0000082
#define MSR_IA32_VMX_PROCBASED_CTLS2 0x48B

#define NEXT_INSN $0xbadc0de
#define PREFIX_SIZE 0xba1d

#define KVM_SMI _IO(KVMIO, 0xb7)

#define CR0_PE 1
#define CR0_MP (1 << 1)
#define CR0_EM (1 << 2)
#define CR0_TS (1 << 3)
#define CR0_ET (1 << 4)
#define CR0_NE (1 << 5)
#define CR0_WP (1 << 16)
#define CR0_AM (1 << 18)
#define CR0_NW (1 << 29)
#define CR0_CD (1 << 30)
#define CR0_PG (1 << 31)

#define CR4_VME 1
#define CR4_PVI (1 << 1)
#define CR4_TSD (1 << 2)
#define CR4_DE (1 << 3)
#define CR4_PSE (1 << 4)
#define CR4_PAE (1 << 5)
#define CR4_MCE (1 << 6)
#define CR4_PGE (1 << 7)
#define CR4_PCE (1 << 8)
#define CR4_OSFXSR (1 << 8)
#define CR4_OSXMMEXCPT (1 << 10)
#define CR4_UMIP (1 << 11)
#define CR4_VMXE (1 << 13)
#define CR4_SMXE (1 << 14)
#define CR4_FSGSBASE (1 << 16)
#define CR4_PCIDE (1 << 17)
#define CR4_OSXSAVE (1 << 18)
#define CR4_SMEP (1 << 20)
#define CR4_SMAP (1 << 21)
#define CR4_PKE (1 << 22)

#define EFER_SCE 1
#define EFER_LME (1 << 8)
#define EFER_LMA (1 << 10)
#define EFER_NXE (1 << 11)
#define EFER_SVME (1 << 12)
#define EFER_LMSLE (1 << 13)
#define EFER_FFXSR (1 << 14)
#define EFER_TCE (1 << 15)
#define PDE32_PRESENT 1
#define PDE32_RW (1 << 1)
#define PDE32_USER (1 << 2)
#define PDE32_PS (1 << 7)
#define PDE64_PRESENT 1
#define PDE64_RW (1 << 1)
#define PDE64_USER (1 << 2)
#define PDE64_ACCESSED (1 << 5)
#define PDE64_DIRTY (1 << 6)
#define PDE64_PS (1 << 7)
#define PDE64_G (1 << 8)

struct tss16 {
	uint16_t prev;
	uint16_t sp0;
	uint16_t ss0;
	uint16_t sp1;
	uint16_t ss1;
	uint16_t sp2;
	uint16_t ss2;
	uint16_t ip;
	uint16_t flags;
	uint16_t ax;
	uint16_t cx;
	uint16_t dx;
	uint16_t bx;
	uint16_t sp;
	uint16_t bp;
	uint16_t si;
	uint16_t di;
	uint16_t es;
	uint16_t cs;
	uint16_t ss;
	uint16_t ds;
	uint16_t ldt;
} __attribute__((packed));

struct tss32 {
	uint16_t prev, prevh;
	uint32_t sp0;
	uint16_t ss0, ss0h;
	uint32_t sp1;
	uint16_t ss1, ss1h;
	uint32_t sp2;
	uint16_t ss2, ss2h;
	uint32_t cr3;
	uint32_t ip;
	uint32_t flags;
	uint32_t ax;
	uint32_t cx;
	uint32_t dx;
	uint32_t bx;
	uint32_t sp;
	uint32_t bp;
	uint32_t si;
	uint32_t di;
	uint16_t es, esh;
	uint16_t cs, csh;
	uint16_t ss, ssh;
	uint16_t ds, dsh;
	uint16_t fs, fsh;
	uint16_t gs, gsh;
	uint16_t ldt, ldth;
	uint16_t trace;
	uint16_t io_bitmap;
} __attribute__((packed));

struct tss64 {
	uint32_t reserved0;
	uint64_t rsp[3];
	uint64_t reserved1;
	uint64_t ist[7];
	uint64_t reserved2;
	uint32_t reserved3;
	uint32_t io_bitmap;
} __attribute__((packed));

static void fill_segment_descriptor(uint64_t* dt, uint64_t* lt, struct kvm_segment* seg)
{
	uint16_t index = seg->selector >> 3;
	uint64_t limit = seg->g ? seg->limit >> 12 : seg->limit;
	uint64_t sd = (limit & 0xffff) | (seg->base & 0xffffff) << 16 | (uint64_t)seg->type << 40 | (uint64_t)seg->s << 44 | (uint64_t)seg->dpl << 45 | (uint64_t)seg->present << 47 | (limit & 0xf0000ULL) << 48 | (uint64_t)seg->avl << 52 | (uint64_t)seg->l << 53 | (uint64_t)seg->db << 54 | (uint64_t)seg->g << 55 | (seg->base & 0xff000000ULL) << 56;
	NONFAILING(dt[index] = sd);
	NONFAILING(lt[index] = sd);
}

static void fill_segment_descriptor_dword(uint64_t* dt, uint64_t* lt, struct kvm_segment* seg)
{
	fill_segment_descriptor(dt, lt, seg);
	uint16_t index = seg->selector >> 3;
	NONFAILING(dt[index + 1] = 0);
	NONFAILING(lt[index + 1] = 0);
}

static void setup_syscall_msrs(int cpufd, uint16_t sel_cs, uint16_t sel_cs_cpl3)
{
	char buf[sizeof(struct kvm_msrs) + 5 * sizeof(struct kvm_msr_entry)];
	memset(buf, 0, sizeof(buf));
	struct kvm_msrs* msrs = (struct kvm_msrs*)buf;
	struct kvm_msr_entry* entries = msrs->entries;
	msrs->nmsrs = 5;
	entries[0].index = MSR_IA32_SYSENTER_CS;
	entries[0].data = sel_cs;
	entries[1].index = MSR_IA32_SYSENTER_ESP;
	entries[1].data = ADDR_STACK0;
	entries[2].index = MSR_IA32_SYSENTER_EIP;
	entries[2].data = ADDR_VAR_SYSEXIT;
	entries[3].index = MSR_IA32_STAR;
	entries[3].data = ((uint64_t)sel_cs << 32) | ((uint64_t)sel_cs_cpl3 << 48);
	entries[4].index = MSR_IA32_LSTAR;
	entries[4].data = ADDR_VAR_SYSRET;
	ioctl(cpufd, KVM_SET_MSRS, msrs);
}

static void setup_32bit_idt(struct kvm_sregs* sregs, char* host_mem, uintptr_t guest_mem)
{
	sregs->idt.base = guest_mem + ADDR_VAR_IDT;
	sregs->idt.limit = 0x1ff;
	uint64_t* idt = (uint64_t*)(host_mem + sregs->idt.base);
	int i;
	for (i = 0; i < 32; i++) {
		struct kvm_segment gate;
		gate.selector = i << 3;
		switch (i % 6) {
		case 0:
			gate.type = 6;
			gate.base = SEL_CS16;
			break;
		case 1:
			gate.type = 7;
			gate.base = SEL_CS16;
			break;
		case 2:
			gate.type = 3;
			gate.base = SEL_TGATE16;
			break;
		case 3:
			gate.type = 14;
			gate.base = SEL_CS32;
			break;
		case 4:
			gate.type = 15;
			gate.base = SEL_CS32;
			break;
		case 6:
			gate.type = 11;
			gate.base = SEL_TGATE32;
			break;
		}
		gate.limit = guest_mem + ADDR_VAR_USER_CODE2;
		gate.present = 1;
		gate.dpl = 0;
		gate.s = 0;
		gate.g = 0;
		gate.db = 0;
		gate.l = 0;
		gate.avl = 0;
		fill_segment_descriptor(idt, idt, &gate);
	}
}

static void setup_64bit_idt(struct kvm_sregs* sregs, char* host_mem, uintptr_t guest_mem)
{
	sregs->idt.base = guest_mem + ADDR_VAR_IDT;
	sregs->idt.limit = 0x1ff;
	uint64_t* idt = (uint64_t*)(host_mem + sregs->idt.base);
	int i;
	for (i = 0; i < 32; i++) {
		struct kvm_segment gate;
		gate.selector = (i * 2) << 3;
		gate.type = (i & 1) ? 14 : 15;
		gate.base = SEL_CS64;
		gate.limit = guest_mem + ADDR_VAR_USER_CODE2;
		gate.present = 1;
		gate.dpl = 0;
		gate.s = 0;
		gate.g = 0;
		gate.db = 0;
		gate.l = 0;
		gate.avl = 0;
		fill_segment_descriptor_dword(idt, idt, &gate);
	}
}

struct kvm_text {
	uintptr_t typ;
	const void* text;
	uintptr_t size;
};

struct kvm_opt {
	uint64_t typ;
	uint64_t val;
};

#define KVM_SETUP_PAGING (1 << 0)
#define KVM_SETUP_PAE (1 << 1)
#define KVM_SETUP_PROTECTED (1 << 2)
#define KVM_SETUP_CPL3 (1 << 3)
#define KVM_SETUP_VIRT86 (1 << 4)
#define KVM_SETUP_SMM (1 << 5)
#define KVM_SETUP_VM (1 << 6)
static long syz_kvm_setup_cpu(volatile long a0, volatile long a1, volatile long a2, volatile long a3, volatile long a4, volatile long a5, volatile long a6, volatile long a7)
{
	const int vmfd = a0;
	const int cpufd = a1;
	char* const host_mem = (char*)a2;
	const struct kvm_text* const text_array_ptr = (struct kvm_text*)a3;
	const uintptr_t text_count = a4;
	const uintptr_t flags = a5;
	const struct kvm_opt* const opt_array_ptr = (struct kvm_opt*)a6;
	uintptr_t opt_count = a7;
	const uintptr_t page_size = 4 << 10;
	const uintptr_t ioapic_page = 10;
	const uintptr_t guest_mem_size = 24 * page_size;
	const uintptr_t guest_mem = 0;
	(void)text_count;
	int text_type = 0;
	const void* text = 0;
	uintptr_t text_size = 0;
	NONFAILING(text_type = text_array_ptr[0].typ);
	NONFAILING(text = text_array_ptr[0].text);
	NONFAILING(text_size = text_array_ptr[0].size);
	uintptr_t i;
	for (i = 0; i < guest_mem_size / page_size; i++) {
		struct kvm_userspace_memory_region memreg;
		memreg.slot = i;
		memreg.flags = 0;
		memreg.guest_phys_addr = guest_mem + i * page_size;
		if (i == ioapic_page)
			memreg.guest_phys_addr = 0xfec00000;
		memreg.memory_size = page_size;
		memreg.userspace_addr = (uintptr_t)host_mem + i * page_size;
		ioctl(vmfd, KVM_SET_USER_MEMORY_REGION, &memreg);
	}
	struct kvm_userspace_memory_region memreg;
	memreg.slot = 1 + (1 << 16);
	memreg.flags = 0;
	memreg.guest_phys_addr = 0x30000;
	memreg.memory_size = 64 << 10;
	memreg.userspace_addr = (uintptr_t)host_mem;
	ioctl(vmfd, KVM_SET_USER_MEMORY_REGION, &memreg);
	struct kvm_sregs sregs;
	if (ioctl(cpufd, KVM_GET_SREGS, &sregs))
		return -1;
	struct kvm_regs regs;
	memset(&regs, 0, sizeof(regs));
	regs.rip = guest_mem + ADDR_TEXT;
	regs.rsp = ADDR_STACK0;
	sregs.gdt.base = guest_mem + ADDR_GDT;
	sregs.gdt.limit = 256 * sizeof(uint64_t) - 1;
	uint64_t* gdt = (uint64_t*)(host_mem + sregs.gdt.base);
	struct kvm_segment seg_ldt;
	seg_ldt.selector = SEL_LDT;
	seg_ldt.type = 2;
	seg_ldt.base = guest_mem + ADDR_LDT;
	seg_ldt.limit = 256 * sizeof(uint64_t) - 1;
	seg_ldt.present = 1;
	seg_ldt.dpl = 0;
	seg_ldt.s = 0;
	seg_ldt.g = 0;
	seg_ldt.db = 1;
	seg_ldt.l = 0;
	sregs.ldt = seg_ldt;
	uint64_t* ldt = (uint64_t*)(host_mem + sregs.ldt.base);
	struct kvm_segment seg_cs16;
	seg_cs16.selector = SEL_CS16;
	seg_cs16.type = 11;
	seg_cs16.base = 0;
	seg_cs16.limit = 0xfffff;
	seg_cs16.present = 1;
	seg_cs16.dpl = 0;
	seg_cs16.s = 1;
	seg_cs16.g = 0;
	seg_cs16.db = 0;
	seg_cs16.l = 0;
	struct kvm_segment seg_ds16 = seg_cs16;
	seg_ds16.selector = SEL_DS16;
	seg_ds16.type = 3;
	struct kvm_segment seg_cs16_cpl3 = seg_cs16;
	seg_cs16_cpl3.selector = SEL_CS16_CPL3;
	seg_cs16_cpl3.dpl = 3;
	struct kvm_segment seg_ds16_cpl3 = seg_ds16;
	seg_ds16_cpl3.selector = SEL_DS16_CPL3;
	seg_ds16_cpl3.dpl = 3;
	struct kvm_segment seg_cs32 = seg_cs16;
	seg_cs32.selector = SEL_CS32;
	seg_cs32.db = 1;
	struct kvm_segment seg_ds32 = seg_ds16;
	seg_ds32.selector = SEL_DS32;
	seg_ds32.db = 1;
	struct kvm_segment seg_cs32_cpl3 = seg_cs32;
	seg_cs32_cpl3.selector = SEL_CS32_CPL3;
	seg_cs32_cpl3.dpl = 3;
	struct kvm_segment seg_ds32_cpl3 = seg_ds32;
	seg_ds32_cpl3.selector = SEL_DS32_CPL3;
	seg_ds32_cpl3.dpl = 3;
	struct kvm_segment seg_cs64 = seg_cs16;
	seg_cs64.selector = SEL_CS64;
	seg_cs64.l = 1;
	struct kvm_segment seg_ds64 = seg_ds32;
	seg_ds64.selector = SEL_DS64;
	struct kvm_segment seg_cs64_cpl3 = seg_cs64;
	seg_cs64_cpl3.selector = SEL_CS64_CPL3;
	seg_cs64_cpl3.dpl = 3;
	struct kvm_segment seg_ds64_cpl3 = seg_ds64;
	seg_ds64_cpl3.selector = SEL_DS64_CPL3;
	seg_ds64_cpl3.dpl = 3;
	struct kvm_segment seg_tss32;
	seg_tss32.selector = SEL_TSS32;
	seg_tss32.type = 9;
	seg_tss32.base = ADDR_VAR_TSS32;
	seg_tss32.limit = 0x1ff;
	seg_tss32.present = 1;
	seg_tss32.dpl = 0;
	seg_tss32.s = 0;
	seg_tss32.g = 0;
	seg_tss32.db = 0;
	seg_tss32.l = 0;
	struct kvm_segment seg_tss32_2 = seg_tss32;
	seg_tss32_2.selector = SEL_TSS32_2;
	seg_tss32_2.base = ADDR_VAR_TSS32_2;
	struct kvm_segment seg_tss32_cpl3 = seg_tss32;
	seg_tss32_cpl3.selector = SEL_TSS32_CPL3;
	seg_tss32_cpl3.base = ADDR_VAR_TSS32_CPL3;
	struct kvm_segment seg_tss32_vm86 = seg_tss32;
	seg_tss32_vm86.selector = SEL_TSS32_VM86;
	seg_tss32_vm86.base = ADDR_VAR_TSS32_VM86;
	struct kvm_segment seg_tss16 = seg_tss32;
	seg_tss16.selector = SEL_TSS16;
	seg_tss16.base = ADDR_VAR_TSS16;
	seg_tss16.limit = 0xff;
	seg_tss16.type = 1;
	struct kvm_segment seg_tss16_2 = seg_tss16;
	seg_tss16_2.selector = SEL_TSS16_2;
	seg_tss16_2.base = ADDR_VAR_TSS16_2;
	seg_tss16_2.dpl = 0;
	struct kvm_segment seg_tss16_cpl3 = seg_tss16;
	seg_tss16_cpl3.selector = SEL_TSS16_CPL3;
	seg_tss16_cpl3.base = ADDR_VAR_TSS16_CPL3;
	seg_tss16_cpl3.dpl = 3;
	struct kvm_segment seg_tss64 = seg_tss32;
	seg_tss64.selector = SEL_TSS64;
	seg_tss64.base = ADDR_VAR_TSS64;
	seg_tss64.limit = 0x1ff;
	struct kvm_segment seg_tss64_cpl3 = seg_tss64;
	seg_tss64_cpl3.selector = SEL_TSS64_CPL3;
	seg_tss64_cpl3.base = ADDR_VAR_TSS64_CPL3;
	seg_tss64_cpl3.dpl = 3;
	struct kvm_segment seg_cgate16;
	seg_cgate16.selector = SEL_CGATE16;
	seg_cgate16.type = 4;
	seg_cgate16.base = SEL_CS16 | (2 << 16);
	seg_cgate16.limit = ADDR_VAR_USER_CODE2;
	seg_cgate16.present = 1;
	seg_cgate16.dpl = 0;
	seg_cgate16.s = 0;
	seg_cgate16.g = 0;
	seg_cgate16.db = 0;
	seg_cgate16.l = 0;
	seg_cgate16.avl = 0;
	struct kvm_segment seg_tgate16 = seg_cgate16;
	seg_tgate16.selector = SEL_TGATE16;
	seg_tgate16.type = 3;
	seg_cgate16.base = SEL_TSS16_2;
	seg_tgate16.limit = 0;
	struct kvm_segment seg_cgate32 = seg_cgate16;
	seg_cgate32.selector = SEL_CGATE32;
	seg_cgate32.type = 12;
	seg_cgate32.base = SEL_CS32 | (2 << 16);
	struct kvm_segment seg_tgate32 = seg_cgate32;
	seg_tgate32.selector = SEL_TGATE32;
	seg_tgate32.type = 11;
	seg_tgate32.base = SEL_TSS32_2;
	seg_tgate32.limit = 0;
	struct kvm_segment seg_cgate64 = seg_cgate16;
	seg_cgate64.selector = SEL_CGATE64;
	seg_cgate64.type = 12;
	seg_cgate64.base = SEL_CS64;
	int kvmfd = open("/dev/kvm", O_RDWR);
	char buf[sizeof(struct kvm_cpuid2) + 128 * sizeof(struct kvm_cpuid_entry2)];
	memset(buf, 0, sizeof(buf));
	struct kvm_cpuid2* cpuid = (struct kvm_cpuid2*)buf;
	cpuid->nent = 128;
	ioctl(kvmfd, KVM_GET_SUPPORTED_CPUID, cpuid);
	ioctl(cpufd, KVM_SET_CPUID2, cpuid);
	close(kvmfd);
	const char* text_prefix = 0;
	int text_prefix_size = 0;
	char* host_text = host_mem + ADDR_TEXT;
	if (text_type == 8) {
		if (flags & KVM_SETUP_SMM) {
			if (flags & KVM_SETUP_PROTECTED) {
				sregs.cs = seg_cs16;
				sregs.ds = sregs.es = sregs.fs = sregs.gs = sregs.ss = seg_ds16;
				sregs.cr0 |= CR0_PE;
			} else {
				sregs.cs.selector = 0;
				sregs.cs.base = 0;
			}
			NONFAILING(*(host_mem + ADDR_TEXT) = 0xf4);
			host_text = host_mem + 0x8000;
			ioctl(cpufd, KVM_SMI, 0);
		} else if (flags & KVM_SETUP_VIRT86) {
			sregs.cs = seg_cs32;
			sregs.ds = sregs.es = sregs.fs = sregs.gs = sregs.ss = seg_ds32;
			sregs.cr0 |= CR0_PE;
			sregs.efer |= EFER_SCE;
			setup_syscall_msrs(cpufd, SEL_CS32, SEL_CS32_CPL3);
			setup_32bit_idt(&sregs, host_mem, guest_mem);
			if (flags & KVM_SETUP_PAGING) {
				uint64_t pd_addr = guest_mem + ADDR_PD;
				uint64_t* pd = (uint64_t*)(host_mem + ADDR_PD);
				NONFAILING(pd[0] = PDE32_PRESENT | PDE32_RW | PDE32_USER | PDE32_PS);
				sregs.cr3 = pd_addr;
				sregs.cr4 |= CR4_PSE;
				text_prefix = kvm_asm32_paged_vm86;
				text_prefix_size = sizeof(kvm_asm32_paged_vm86) - 1;
			} else {
				text_prefix = kvm_asm32_vm86;
				text_prefix_size = sizeof(kvm_asm32_vm86) - 1;
			}
		} else {
			sregs.cs.selector = 0;
			sregs.cs.base = 0;
		}
	} else if (text_type == 16) {
		if (flags & KVM_SETUP_CPL3) {
			sregs.cs = seg_cs16;
			sregs.ds = sregs.es = sregs.fs = sregs.gs = sregs.ss = seg_ds16;
			text_prefix = kvm_asm16_cpl3;
			text_prefix_size = sizeof(kvm_asm16_cpl3) - 1;
		} else {
			sregs.cr0 |= CR0_PE;
			sregs.cs = seg_cs16;
			sregs.ds = sregs.es = sregs.fs = sregs.gs = sregs.ss = seg_ds16;
		}
	} else if (text_type == 32) {
		sregs.cr0 |= CR0_PE;
		sregs.efer |= EFER_SCE;
		setup_syscall_msrs(cpufd, SEL_CS32, SEL_CS32_CPL3);
		setup_32bit_idt(&sregs, host_mem, guest_mem);
		if (flags & KVM_SETUP_SMM) {
			sregs.cs = seg_cs32;
			sregs.ds = sregs.es = sregs.fs = sregs.gs = sregs.ss = seg_ds32;
			NONFAILING(*(host_mem + ADDR_TEXT) = 0xf4);
			host_text = host_mem + 0x8000;
			ioctl(cpufd, KVM_SMI, 0);
		} else if (flags & KVM_SETUP_PAGING) {
			sregs.cs = seg_cs32;
			sregs.ds = sregs.es = sregs.fs = sregs.gs = sregs.ss = seg_ds32;
			uint64_t pd_addr = guest_mem + ADDR_PD;
			uint64_t* pd = (uint64_t*)(host_mem + ADDR_PD);
			NONFAILING(pd[0] = PDE32_PRESENT | PDE32_RW | PDE32_USER | PDE32_PS);
			sregs.cr3 = pd_addr;
			sregs.cr4 |= CR4_PSE;
			text_prefix = kvm_asm32_paged;
			text_prefix_size = sizeof(kvm_asm32_paged) - 1;
		} else if (flags & KVM_SETUP_CPL3) {
			sregs.cs = seg_cs32_cpl3;
			sregs.ds = sregs.es = sregs.fs = sregs.gs = sregs.ss = seg_ds32_cpl3;
		} else {
			sregs.cs = seg_cs32;
			sregs.ds = sregs.es = sregs.fs = sregs.gs = sregs.ss = seg_ds32;
		}
	} else {
		sregs.efer |= EFER_LME | EFER_SCE;
		sregs.cr0 |= CR0_PE;
		setup_syscall_msrs(cpufd, SEL_CS64, SEL_CS64_CPL3);
		setup_64bit_idt(&sregs, host_mem, guest_mem);
		sregs.cs = seg_cs32;
		sregs.ds = sregs.es = sregs.fs = sregs.gs = sregs.ss = seg_ds32;
		uint64_t pml4_addr = guest_mem + ADDR_PML4;
		uint64_t* pml4 = (uint64_t*)(host_mem + ADDR_PML4);
		uint64_t pdpt_addr = guest_mem + ADDR_PDP;
		uint64_t* pdpt = (uint64_t*)(host_mem + ADDR_PDP);
		uint64_t pd_addr = guest_mem + ADDR_PD;
		uint64_t* pd = (uint64_t*)(host_mem + ADDR_PD);
		NONFAILING(pml4[0] = PDE64_PRESENT | PDE64_RW | PDE64_USER | pdpt_addr);
		NONFAILING(pdpt[0] = PDE64_PRESENT | PDE64_RW | PDE64_USER | pd_addr);
		NONFAILING(pd[0] = PDE64_PRESENT | PDE64_RW | PDE64_USER | PDE64_PS);
		sregs.cr3 = pml4_addr;
		sregs.cr4 |= CR4_PAE;
		if (flags & KVM_SETUP_VM) {
			sregs.cr0 |= CR0_NE;
			NONFAILING(*((uint64_t*)(host_mem + ADDR_VAR_VMXON_PTR)) = ADDR_VAR_VMXON);
			NONFAILING(*((uint64_t*)(host_mem + ADDR_VAR_VMCS_PTR)) = ADDR_VAR_VMCS);
			NONFAILING(memcpy(host_mem + ADDR_VAR_VMEXIT_CODE, kvm_asm64_vm_exit, sizeof(kvm_asm64_vm_exit) - 1));
			NONFAILING(*((uint64_t*)(host_mem + ADDR_VAR_VMEXIT_PTR)) = ADDR_VAR_VMEXIT_CODE);
			text_prefix = kvm_asm64_init_vm;
			text_prefix_size = sizeof(kvm_asm64_init_vm) - 1;
		} else if (flags & KVM_SETUP_CPL3) {
			text_prefix = kvm_asm64_cpl3;
			text_prefix_size = sizeof(kvm_asm64_cpl3) - 1;
		} else {
			text_prefix = kvm_asm64_enable_long;
			text_prefix_size = sizeof(kvm_asm64_enable_long) - 1;
		}
	}
	struct tss16 tss16;
	memset(&tss16, 0, sizeof(tss16));
	tss16.ss0 = tss16.ss1 = tss16.ss2 = SEL_DS16;
	tss16.sp0 = tss16.sp1 = tss16.sp2 = ADDR_STACK0;
	tss16.ip = ADDR_VAR_USER_CODE2;
	tss16.flags = (1 << 1);
	tss16.cs = SEL_CS16;
	tss16.es = tss16.ds = tss16.ss = SEL_DS16;
	tss16.ldt = SEL_LDT;
	struct tss16* tss16_addr = (struct tss16*)(host_mem + seg_tss16_2.base);
	NONFAILING(memcpy(tss16_addr, &tss16, sizeof(tss16)));
	memset(&tss16, 0, sizeof(tss16));
	tss16.ss0 = tss16.ss1 = tss16.ss2 = SEL_DS16;
	tss16.sp0 = tss16.sp1 = tss16.sp2 = ADDR_STACK0;
	tss16.ip = ADDR_VAR_USER_CODE2;
	tss16.flags = (1 << 1);
	tss16.cs = SEL_CS16_CPL3;
	tss16.es = tss16.ds = tss16.ss = SEL_DS16_CPL3;
	tss16.ldt = SEL_LDT;
	struct tss16* tss16_cpl3_addr = (struct tss16*)(host_mem + seg_tss16_cpl3.base);
	NONFAILING(memcpy(tss16_cpl3_addr, &tss16, sizeof(tss16)));
	struct tss32 tss32;
	memset(&tss32, 0, sizeof(tss32));
	tss32.ss0 = tss32.ss1 = tss32.ss2 = SEL_DS32;
	tss32.sp0 = tss32.sp1 = tss32.sp2 = ADDR_STACK0;
	tss32.ip = ADDR_VAR_USER_CODE;
	tss32.flags = (1 << 1) | (1 << 17);
	tss32.ldt = SEL_LDT;
	tss32.cr3 = sregs.cr3;
	tss32.io_bitmap = offsetof(struct tss32, io_bitmap);
	struct tss32* tss32_addr = (struct tss32*)(host_mem + seg_tss32_vm86.base);
	NONFAILING(memcpy(tss32_addr, &tss32, sizeof(tss32)));
	memset(&tss32, 0, sizeof(tss32));
	tss32.ss0 = tss32.ss1 = tss32.ss2 = SEL_DS32;
	tss32.sp0 = tss32.sp1 = tss32.sp2 = ADDR_STACK0;
	tss32.ip = ADDR_VAR_USER_CODE;
	tss32.flags = (1 << 1);
	tss32.cr3 = sregs.cr3;
	tss32.es = tss32.ds = tss32.ss = tss32.gs = tss32.fs = SEL_DS32;
	tss32.cs = SEL_CS32;
	tss32.ldt = SEL_LDT;
	tss32.cr3 = sregs.cr3;
	tss32.io_bitmap = offsetof(struct tss32, io_bitmap);
	struct tss32* tss32_cpl3_addr = (struct tss32*)(host_mem + seg_tss32_2.base);
	NONFAILING(memcpy(tss32_cpl3_addr, &tss32, sizeof(tss32)));
	struct tss64 tss64;
	memset(&tss64, 0, sizeof(tss64));
	tss64.rsp[0] = ADDR_STACK0;
	tss64.rsp[1] = ADDR_STACK0;
	tss64.rsp[2] = ADDR_STACK0;
	tss64.io_bitmap = offsetof(struct tss64, io_bitmap);
	struct tss64* tss64_addr = (struct tss64*)(host_mem + seg_tss64.base);
	NONFAILING(memcpy(tss64_addr, &tss64, sizeof(tss64)));
	memset(&tss64, 0, sizeof(tss64));
	tss64.rsp[0] = ADDR_STACK0;
	tss64.rsp[1] = ADDR_STACK0;
	tss64.rsp[2] = ADDR_STACK0;
	tss64.io_bitmap = offsetof(struct tss64, io_bitmap);
	struct tss64* tss64_cpl3_addr = (struct tss64*)(host_mem + seg_tss64_cpl3.base);
	NONFAILING(memcpy(tss64_cpl3_addr, &tss64, sizeof(tss64)));
	if (text_size > 1000)
		text_size = 1000;
	if (text_prefix) {
		NONFAILING(memcpy(host_text, text_prefix, text_prefix_size));
		void* patch = 0;
		NONFAILING(patch = memmem(host_text, text_prefix_size, "\xde\xc0\xad\x0b", 4));
		if (patch)
			NONFAILING(*((uint32_t*)patch) = guest_mem + ADDR_TEXT + ((char*)patch - host_text) + 6);
		uint16_t magic = PREFIX_SIZE;
		patch = 0;
		NONFAILING(patch = memmem(host_text, text_prefix_size, &magic, sizeof(magic)));
		if (patch)
			NONFAILING(*((uint16_t*)patch) = guest_mem + ADDR_TEXT + text_prefix_size);
	}
	NONFAILING(memcpy((void*)(host_text + text_prefix_size), text, text_size));
	NONFAILING(*(host_text + text_prefix_size + text_size) = 0xf4);
	NONFAILING(memcpy(host_mem + ADDR_VAR_USER_CODE, text, text_size));
	NONFAILING(*(host_mem + ADDR_VAR_USER_CODE + text_size) = 0xf4);
	NONFAILING(*(host_mem + ADDR_VAR_HLT) = 0xf4);
	NONFAILING(memcpy(host_mem + ADDR_VAR_SYSRET, "\x0f\x07\xf4", 3));
	NONFAILING(memcpy(host_mem + ADDR_VAR_SYSEXIT, "\x0f\x35\xf4", 3));
	NONFAILING(*(uint64_t*)(host_mem + ADDR_VAR_VMWRITE_FLD) = 0);
	NONFAILING(*(uint64_t*)(host_mem + ADDR_VAR_VMWRITE_VAL) = 0);
	if (opt_count > 2)
		opt_count = 2;
	for (i = 0; i < opt_count; i++) {
		uint64_t typ = 0;
		uint64_t val = 0;
		NONFAILING(typ = opt_array_ptr[i].typ);
		NONFAILING(val = opt_array_ptr[i].val);
		switch (typ % 9) {
		case 0:
			sregs.cr0 ^= val & (CR0_MP | CR0_EM | CR0_ET | CR0_NE | CR0_WP | CR0_AM | CR0_NW | CR0_CD);
			break;
		case 1:
			sregs.cr4 ^= val & (CR4_VME | CR4_PVI | CR4_TSD | CR4_DE | CR4_MCE | CR4_PGE | CR4_PCE |
					    CR4_OSFXSR | CR4_OSXMMEXCPT | CR4_UMIP | CR4_VMXE | CR4_SMXE | CR4_FSGSBASE | CR4_PCIDE |
					    CR4_OSXSAVE | CR4_SMEP | CR4_SMAP | CR4_PKE);
			break;
		case 2:
			sregs.efer ^= val & (EFER_SCE | EFER_NXE | EFER_SVME | EFER_LMSLE | EFER_FFXSR | EFER_TCE);
			break;
		case 3:
			val &= ((1 << 8) | (1 << 9) | (1 << 10) | (1 << 12) | (1 << 13) | (1 << 14) |
				(1 << 15) | (1 << 18) | (1 << 19) | (1 << 20) | (1 << 21));
			regs.rflags ^= val;
			NONFAILING(tss16_addr->flags ^= val);
			NONFAILING(tss16_cpl3_addr->flags ^= val);
			NONFAILING(tss32_addr->flags ^= val);
			NONFAILING(tss32_cpl3_addr->flags ^= val);
			break;
		case 4:
			seg_cs16.type = val & 0xf;
			seg_cs32.type = val & 0xf;
			seg_cs64.type = val & 0xf;
			break;
		case 5:
			seg_cs16_cpl3.type = val & 0xf;
			seg_cs32_cpl3.type = val & 0xf;
			seg_cs64_cpl3.type = val & 0xf;
			break;
		case 6:
			seg_ds16.type = val & 0xf;
			seg_ds32.type = val & 0xf;
			seg_ds64.type = val & 0xf;
			break;
		case 7:
			seg_ds16_cpl3.type = val & 0xf;
			seg_ds32_cpl3.type = val & 0xf;
			seg_ds64_cpl3.type = val & 0xf;
			break;
		case 8:
			NONFAILING(*(uint64_t*)(host_mem + ADDR_VAR_VMWRITE_FLD) = (val & 0xffff));
			NONFAILING(*(uint64_t*)(host_mem + ADDR_VAR_VMWRITE_VAL) = (val >> 16));
			break;
		default:
	exit(1);
		}
	}
	regs.rflags |= 2;
	fill_segment_descriptor(gdt, ldt, &seg_ldt);
	fill_segment_descriptor(gdt, ldt, &seg_cs16);
	fill_segment_descriptor(gdt, ldt, &seg_ds16);
	fill_segment_descriptor(gdt, ldt, &seg_cs16_cpl3);
	fill_segment_descriptor(gdt, ldt, &seg_ds16_cpl3);
	fill_segment_descriptor(gdt, ldt, &seg_cs32);
	fill_segment_descriptor(gdt, ldt, &seg_ds32);
	fill_segment_descriptor(gdt, ldt, &seg_cs32_cpl3);
	fill_segment_descriptor(gdt, ldt, &seg_ds32_cpl3);
	fill_segment_descriptor(gdt, ldt, &seg_cs64);
	fill_segment_descriptor(gdt, ldt, &seg_ds64);
	fill_segment_descriptor(gdt, ldt, &seg_cs64_cpl3);
	fill_segment_descriptor(gdt, ldt, &seg_ds64_cpl3);
	fill_segment_descriptor(gdt, ldt, &seg_tss32);
	fill_segment_descriptor(gdt, ldt, &seg_tss32_2);
	fill_segment_descriptor(gdt, ldt, &seg_tss32_cpl3);
	fill_segment_descriptor(gdt, ldt, &seg_tss32_vm86);
	fill_segment_descriptor(gdt, ldt, &seg_tss16);
	fill_segment_descriptor(gdt, ldt, &seg_tss16_2);
	fill_segment_descriptor(gdt, ldt, &seg_tss16_cpl3);
	fill_segment_descriptor_dword(gdt, ldt, &seg_tss64);
	fill_segment_descriptor_dword(gdt, ldt, &seg_tss64_cpl3);
	fill_segment_descriptor(gdt, ldt, &seg_cgate16);
	fill_segment_descriptor(gdt, ldt, &seg_tgate16);
	fill_segment_descriptor(gdt, ldt, &seg_cgate32);
	fill_segment_descriptor(gdt, ldt, &seg_tgate32);
	fill_segment_descriptor_dword(gdt, ldt, &seg_cgate64);
	if (ioctl(cpufd, KVM_SET_SREGS, &sregs))
		return -1;
	if (ioctl(cpufd, KVM_SET_REGS, &regs))
		return -1;
	return 0;
}

static void kill_and_wait(int pid, int* status)
{
	kill(-pid, SIGKILL);
	kill(pid, SIGKILL);
	int i;
	for (i = 0; i < 100; i++) {
		if (waitpid(-1, status, WNOHANG | __WALL) == pid)
			return;
		usleep(1000);
	}
	DIR* dir = opendir("/sys/fs/fuse/connections");
	if (dir) {
		for (;;) {
			struct dirent* ent = readdir(dir);
			if (!ent)
				break;
			if (strcmp(ent->d_name, ".") == 0 || strcmp(ent->d_name, "..") == 0)
				continue;
			char abort[300];
			snprintf(abort, sizeof(abort), "/sys/fs/fuse/connections/%s/abort", ent->d_name);
			int fd = open(abort, O_WRONLY);
			if (fd == -1) {
				continue;
			}
			if (write(fd, abort, 1) < 0) {
			}
			close(fd);
		}
		closedir(dir);
	} else {
	}
	while (waitpid(-1, status, __WALL) != pid) {
	}
}

#define SYZ_HAVE_SETUP_TEST 1
static void setup_test()
{
	prctl(PR_SET_PDEATHSIG, SIGKILL, 0, 0, 0);
	setpgrp();
	write_file("/proc/self/oom_score_adj", "1000");
}

static void execute_one(void);

#define WAIT_FLAGS __WALL

static void loop(void)
{
	int iter;
	for (iter = 0;; iter++) {
		int pid = fork();
		if (pid < 0)
	exit(1);
		if (pid == 0) {
			setup_test();
			execute_one();
			exit(0);
		}
		int status = 0;
		uint64_t start = current_time_ms();
		for (;;) {
			if (waitpid(-1, &status, WNOHANG | WAIT_FLAGS) == pid)
				break;
			sleep_ms(1);
			if (current_time_ms() - start < 5 * 1000)
				continue;
			kill_and_wait(pid, &status);
			break;
		}
	}
}

uint64_t r[3] = {0xffffffffffffffff, 0xffffffffffffffff, 0xffffffffffffffff};

void execute_one(void)
{
		long res = 0;
	NONFAILING(memcpy((void*)0x200011c0, "/dev/kvm\000", 9));
	res = syscall(__NR_openat, 0xffffffffffffff9c, 0x200011c0, 0x2000, 0);
	if (res != -1)
		r[0] = res;
	res = syscall(__NR_ioctl, r[0], 0xae01, 0);
	if (res != -1)
		r[1] = res;
	res = syscall(__NR_ioctl, r[1], 0xae41, 0);
	if (res != -1)
		r[2] = res;
	syz_kvm_setup_cpu(r[1], r[2], 0x20fe5000, 0, 0, 0, 0, 0);
	syscall(__NR_ioctl, r[2], 0xae80, 0);

}
int main(void)
{
		syscall(__NR_mmap, 0x20000000, 0x1000000, 3, 0x32, -1, 0);
	install_segv_handler();
	for (procid = 0; procid < 8; procid++) {
		if (fork() == 0) {
			loop();
		}
	}
	sleep(1000000);
	return 0;
}




^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: kernel BUG at arch/x86/kvm/x86.c:361! on AMD CPU
  2019-06-18 18:27                 ` Borislav Petkov
  2019-06-18 19:17                   ` Paolo Bonzini
  2019-06-18 19:34                   ` George Kennedy
@ 2019-06-23 13:15                   ` Dmitry Vyukov
  2 siblings, 0 replies; 10+ messages in thread
From: Dmitry Vyukov @ 2019-06-23 13:15 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: George Kennedy, Sean Christopherson, Joerg Roedel, Paolo Bonzini,
	Ingo Molnar, H. Peter Anvin, KVM list, syzkaller,
	Boris Ostrovsky

On Tue, Jun 18, 2019 at 8:27 PM Borislav Petkov <bp@alien8.de> wrote:
>
> On Tue, Jun 18, 2019 at 08:01:06PM +0200, Dmitry Vyukov wrote:
> > I am not a KVM folk either, but FWIW syzkaller is capable of creating
> > a double-nested VM.
>
> Aaaha, there it is. :)
>
> > The code is somewhat VMX-specific, but it should
> > be capable at least executing some SVM instructions inside of guest.
> > This code setups VM to run a given instruction sequences (should be generic):
> > https://github.com/google/syzkaller/blob/34bf9440bd06034f86b5d9ac8afbf078129cbdae/executor/common_kvm_amd64.h
> > The instruction generator is based on Intel XED so it may be somewhat
> > Intel-biased, but at least I see some mentions of SVM there:
> > https://raw.githubusercontent.com/google/syzkaller/34bf9440bd06034f86b5d9ac8afbf078129cbdae/pkg/ifuzz/gen/all-enc-instructions.txt
>
> Right, and that right there looks wrong:
>
> ICLASS    : VMLOAD
> CPL       : 3
> CATEGORY  : SYSTEM
> EXTENSION : SVM
> ATTRIBUTES: PROTECTED_MODE
> PATTERN   : 0x0F 0x01 MOD[0b11] MOD=3 REG[0b011] RM[0b010]
> OPERANDS  : REG0=OrAX():r:IMPL
>
> That is, *if* "CPL: 3" above means in XED context that VMLOAD is
> supposed to be run in CPL3, then this is wrong because VMLOAD #GPs if
> CPL was not 0. Ditto for VMRUN and a couple of others.
>
> Perhaps that support was added at some point but not really run on AMD
> hw yet...


Interesting. I've updated to the latest Intel XED:
https://github.com/google/syzkaller/commit/472f0082fd8a2f82b85ab0682086e10b71529a51

And I actually see a number of changes around this for the VMX instructions:

 ICLASS    : VMPTRLD
-CPL       : 3
+CPL       : 0

 ICLASS    : VMXON
-CPL       : 3
+CPL       : 0

 ICLASS    : VMLAUNCH
-CPL       : 3
+CPL       : 0

But VMLOAD is still marked as CPL 3.
Perhaps it's something to fix in Intel XED. But for syzkaller it
should not matter much, it's a fuzzer and it will do all kinds of
crazy non-conforming stuff. These CPLs are only used as hints at best
if at all. It should sure try CPL=0 instructions in CPL3 regardless.

^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2019-06-23 13:15 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <37952f51-7687-672c-45d9-92ba418c9133@oracle.com>
2019-06-12 16:12 ` kernel BUG at arch/x86/kvm/x86.c:361! on AMD CPU Borislav Petkov
     [not found]   ` <af0054d1-1fc8-c106-b503-ca91da5a6fee@oracle.com>
2019-06-12 19:51     ` Borislav Petkov
2019-06-12 20:54       ` Sean Christopherson
2019-06-13  7:18         ` Borislav Petkov
     [not found]           ` <df80299b-8e1f-f48b-a26b-c163b4018d01@oracle.com>
2019-06-18 17:51             ` Borislav Petkov
2019-06-18 18:01               ` Dmitry Vyukov
2019-06-18 18:27                 ` Borislav Petkov
2019-06-18 19:17                   ` Paolo Bonzini
2019-06-18 19:34                   ` George Kennedy
2019-06-23 13:15                   ` Dmitry Vyukov

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).