All of lore.kernel.org
 help / color / mirror / Atom feed
* Re: #DE
       [not found]   ` <b58b5d08-97a6-1e64-d8db-7ce74084553a@redhat.com>
@ 2020-05-25 15:13     ` Maxim Levitsky
  2020-05-25 16:45       ` #DE Paolo Bonzini
  0 siblings, 1 reply; 4+ messages in thread
From: Maxim Levitsky @ 2020-05-25 15:13 UTC (permalink / raw)
  To: Paolo Bonzini, Vitaly Kuznetsov; +Cc: kvm

[-- Attachment #1: Type: text/plain, Size: 7500 bytes --]

On Sat, 2020-05-23 at 09:09 +0200, Paolo Bonzini wrote:
> On 23/05/20 02:40, Maxim Levitsky wrote:
> > On Fri, 2020-05-22 at 20:34 +0200, Paolo Bonzini wrote:
> > > The #DE is caused by INTERCEPT_VINTR being clear and V_IRQ being set.
> > > 
> > > I'm not really sure how that can happen, I tried to put some code in the
> > > branch that would avoid that, and who knows maybe it even made some
> > > progress on Hyper-V.  But, I might have missed some case.  This thing is
> > > growing fast...
> > I''ll try to see if I can find any clues to this tomorrow.
> > One thing for sure, it all started with 
> > 
> > 'KVM: nSVM: implement check_nested_events for interrupts'
> > b518ba9fa691a3066ee935f6f317f827295453f0
> 
> Yes, but that is a bugfix.  The IRQ tests in svm.flat were all broken
> before.
> 
> Paolo
> 

I managed to get to the very bottom of this rabbit hole.

I attach the debug traces of host, and debug changes I did to the host and guest.
Both host and guest run on mainline git then kvm/queue and kvm/nested-svm merged

You can find the exact kernel used on my gitlab account
https://gitlab.com/maximlevitsky/linux.git, 
branches kernel-starship-5.7 and kernel-starship-vm-5.7 accordingly 

I captured and analyzed trace, together with some help from a debug code I inserted
(you can find this in these trees as well, plus I attached it, together with annotated trace on the host)

I think that some of these debug hacks can be turned to something useful, like tracing the state of
interrupt window, printing the int_cfg, etc. I'll send patches about this soon.


Some of the things I guessed, and I will note when I say something I don't have a significant proof of,
since I don't know all of KVM yet.

With all this said, this is what is happening:

1. The host sets interrupt window. It needs interrupts window because (I guess) that guest
disabled interrupts and it waits till interrupts are enabled to inject the interrupt.

To be honest this is VMX limitation, and in SVM (I think according to the spec) you can inject
interrupts whenever you want and if the guest can't accept then interrupt, the vector written to EVENTINJ field,
will be moved to V_INTR_VECTOR and V_INTR flag will be set, untill the guest re-enables interrupts, and then
the interrupt will be injected. I don't have a proof of this from SVM spec but it looks like that, so to some
extent this is a guess as well.

2. Since SVM doesn't really have a concept of interrupt window intercept, this is faked by setting V_INTR,
as if injected (or as they call it virtual) interrupt is pending, together with intercept of virtual interrupts,
which happens when the virtual interrupt is about to be injected, which is equavalent of intercepting the moment,
the guest re-enabled interrupts.

3. Now we setup our interrupt window while we about to enter a nested guest (while we deal with VMRUN intercept),
because we have an interrupt pending injection and the guest is indeed with disabled interrupts (it will re-enable them
when it returns to the nested guest).

4. After we enter the nested guest, we eventually get an VMexit due to unrelated reason and we sync the V_INTR that *we* set
to the nested vmcs, since in theory that flag could have beeing set by the CPU itself, if the guest itself used EVENTINJ to inject
an event to its nested guest while the nested guest didn't have interrupts enabled (KVM doesn't do this, but we can't assume that)


5. Now we enter the guest and it re-enables interrupts as part of its VMexit handling, and we intercept that with INTERCEPT_VINTR,
disable VINTR intercept, and inject the interrupt we always wanted to the guest. In other words, our interrupt window is finally
signaled, and our abuse of VINTR is ended.


5. At that point the bomb is ticking. Once the guest ends dealing with the nested guest VMexit, and executes VMRUN,
we enter the nested guest with V_INTR enabled. V_INTER intercept is disabled since we disabled our interrupt window long ago,
guest is also currently doesn't enable any interrupt window, so we basically injecting to the guest whatever is there in 
V_INTR_VECTOR in the nested guest's VMCB. Usually there is the value that was when the guest itself used interrupt window
on its nested guest and that is zero. So we inject interrupt 0 to the nested guest which wasn't there ever. Checkmate.


Now that I am thinking about this the issue is deeper that I thought and it stems from the abuse of the V_INTR on AMD.
IMHO the best solution is to avoid interrupt windows on AMD unless really needed (like with level-triggered interrupts or so)


If to try to explain the root cause of this, what happens is this:

1. The host sets up the interrupt window by setting VINTR intercept and VINTR pending flag

2. Once we exit the nested guest, we are supposed to sync the VINTR pending flag to the nested guest VMCB
since this flag could have beeing set by the CPU itself, and not by us abusing it.

3. If the flag is set by use, it ends up in the nested guest VMCB and since the nested guest doesn't nessesarly intercept
the VINTR, it might end up injecting an interrupt to the guest.


Now the problem is that it is next to impossible to know the source of the VINTR pending flag. Even if we remember that
host is currently setup an interrupt window, the guest afterwards could have used EVENTINJ + interrupt disabled nested guest,
to raise that flag as well, and might need to know about it.


I have an idea on how to fix this, which is about 99% correct and will only fail if the guest attempt something that
is undefined anyway.

Lets set the vector of the fake VINTR to some reserved exception value, rather that 0 (which the guest is not supposed to inject ever to the nested guest),
so then we will know if the VINTR is from our fake injection or from the guest itself.
If it is our VINTR then we will not sync it to the guest.
In theory it can be even 0, since exceptions should never be injected as interrupts anyway, this is also reserved operation.

What do you think?


This is the patch that fixes the issue for me that implements the above, just copy&pasted to my email client, so it probably won't apply.
I'll send this from git if you think that this is the right idea:


diff --git a/arch/x86/kvm/svm/nested.c b/arch/x86/kvm/svm/nested.c
index 134a5cf2b7d3..b8b2238af5e4 100644
--- a/arch/x86/kvm/svm/nested.c
+++ b/arch/x86/kvm/svm/nested.c
@@ -273,6 +273,13 @@ void sync_nested_vmcb_control(struct vcpu_svm *svm)
 
                WARN_ON(1);
        }
+
+       if (svm->vmcb->control.int_vector == 0x1F) {
+               /* This is our fake V_IRQ, which is meant for interrupt
+                * window. No need to sync this to the guest*/
+               return;
+       }
+
        svm->nested.ctl.int_ctl        &= ~mask;
        svm->nested.ctl.int_ctl        |= svm->vmcb->control.int_ctl & mask;
 }
diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c
index a12af9467114..90616317575a 100644
--- a/arch/x86/kvm/svm/svm.c
+++ b/arch/x86/kvm/svm/svm.c
@@ -1363,7 +1363,7 @@ static void svm_set_vintr(struct vcpu_svm *svm)
         * Actual injection of virtual interrupts happens through EVENTINJ.
         */
        control = &svm->vmcb->control;
-       control->int_vector = 0x0;
+       control->int_vector = 0x1F;
        control->int_ctl &= ~V_INTR_PRIO_MASK;
        control->int_ctl |= V_IRQ_MASK |
                ((/*control->int_vector >> 4*/ 0xf) << V_INTR_PRIO_SHIFT);


Best regards,
	Maxim Levitsky



[-- Attachment #2: guest_debug.diff --]
[-- Type: text/x-patch, Size: 2162 bytes --]

diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c
index b3406379697c8..487f24c478e79 100644
--- a/arch/x86/kvm/svm/svm.c
+++ b/arch/x86/kvm/svm/svm.c
@@ -1363,7 +1363,7 @@ static void svm_set_vintr(struct vcpu_svm *svm)
 	 * Actual injection of virtual interrupts happens through EVENTINJ.
 	 */
 	control = &svm->vmcb->control;
-	control->int_vector = 0x0;
+	control->int_vector = 0x4;
 	control->int_ctl &= ~V_INTR_PRIO_MASK;
 	control->int_ctl |= V_IRQ_MASK |
 		((/*control->int_vector >> 4*/ 0xf) << V_INTR_PRIO_SHIFT);
@@ -2663,6 +2663,8 @@ static int interrupt_window_interception(struct vcpu_svm *svm)
 	kvm_make_request(KVM_REQ_EVENT, &svm->vcpu);
 	svm_clear_vintr(svm);
 
+	trace_kvm_enable_window(0);
+
 	/*
 	 * For AVIC, the only reason to end up here is ExtINTs.
 	 * In this case AVIC was temporarily disabled for
@@ -3152,7 +3154,11 @@ static void enable_irq_window(struct kvm_vcpu *vcpu)
 		 */
 		svm_toggle_avic_for_irq_window(vcpu, false);
 		svm_set_vintr(svm);
+
+		trace_kvm_enable_window(2);
 	}
+
+	trace_kvm_enable_window(1);
 }
 
 static void enable_nmi_window(struct kvm_vcpu *vcpu)
diff --git a/arch/x86/kvm/trace.h b/arch/x86/kvm/trace.h
index 54a10c98d7466..c69a5f682e2c3 100644
--- a/arch/x86/kvm/trace.h
+++ b/arch/x86/kvm/trace.h
@@ -29,6 +29,22 @@ TRACE_EVENT(kvm_entry,
 	TP_printk("vcpu %u", __entry->vcpu_id)
 );
 
+
+TRACE_EVENT(kvm_enable_window,
+	TP_PROTO(unsigned int enabled),
+	TP_ARGS(enabled),
+
+	TP_STRUCT__entry(
+		__field(	unsigned int,	enabled		)
+	),
+
+	TP_fast_assign(
+		__entry->enabled	= enabled;
+	),
+
+	TP_printk("enabled %u", __entry->enabled)
+);
+
 /*
  * Tracepoint for hypercall.
  */
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 8494fd519bda0..14525a0c39423 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -10643,6 +10643,7 @@ u64 kvm_spec_ctrl_valid_bits(struct kvm_vcpu *vcpu)
 EXPORT_SYMBOL_GPL(kvm_spec_ctrl_valid_bits);
 
 EXPORT_TRACEPOINT_SYMBOL_GPL(kvm_exit);
+EXPORT_TRACEPOINT_SYMBOL_GPL(kvm_enable_window);
 EXPORT_TRACEPOINT_SYMBOL_GPL(kvm_fast_mmio);
 EXPORT_TRACEPOINT_SYMBOL_GPL(kvm_inj_virq);
 EXPORT_TRACEPOINT_SYMBOL_GPL(kvm_page_fault);

[-- Attachment #3: host_debug.diff --]
[-- Type: text/x-patch, Size: 6406 bytes --]

diff --git a/arch/x86/kvm/svm/nested.c b/arch/x86/kvm/svm/nested.c
index 16ef8cfefb5f..134a5cf2b7d3 100644
--- a/arch/x86/kvm/svm/nested.c
+++ b/arch/x86/kvm/svm/nested.c
@@ -142,6 +142,8 @@ void recalc_intercepts(struct vcpu_svm *svm)
 	c->intercept_dr |= g->intercept_dr;
 	c->intercept_exceptions |= g->intercept_exceptions;
 	c->intercept |= g->intercept;
+
+	//c->intercept |= (1ULL << INTERCEPT_VINTR);
 }
 
 static void copy_vmcb_control_area(struct vmcb_control_area *dst,
@@ -268,6 +270,8 @@ void sync_nested_vmcb_control(struct vcpu_svm *svm)
 		 * restores int_ctl.  We can just leave it aside.
 		 */
 		mask &= ~V_IRQ_MASK;
+
+		WARN_ON(1);
 	}
 	svm->nested.ctl.int_ctl        &= ~mask;
 	svm->nested.ctl.int_ctl        |= svm->vmcb->control.int_ctl & mask;
@@ -393,6 +397,7 @@ int nested_svm_vmrun(struct vcpu_svm *svm)
 	struct vmcb *vmcb = svm->vmcb;
 	struct kvm_host_map map;
 	u64 vmcb_gpa;
+	bool was_like_that = false;
 
 	if (is_smm(&svm->vcpu)) {
 		kvm_queue_exception(&svm->vcpu, UD_VECTOR);
@@ -412,6 +417,11 @@ int nested_svm_vmrun(struct vcpu_svm *svm)
 
 	nested_vmcb = map.hva;
 
+	if (!(nested_vmcb->control.intercept & (1ULL << INTERCEPT_VINTR)) && (nested_vmcb->control.int_ctl & V_IRQ_MASK) && nested_vmcb->control.int_vector == 4) {
+		was_like_that = true;
+	}
+
+
 	if (!nested_vmcb_checks(nested_vmcb)) {
 		nested_vmcb->control.exit_code    = SVM_EXIT_ERR;
 		nested_vmcb->control.exit_code_hi = 0;
@@ -420,6 +430,9 @@ int nested_svm_vmrun(struct vcpu_svm *svm)
 		goto out;
 	}
 
+
+
+
 	trace_kvm_nested_vmrun(svm->vmcb->save.rip, vmcb_gpa,
 			       nested_vmcb->save.rip,
 			       nested_vmcb->control.int_ctl,
@@ -462,6 +475,35 @@ int nested_svm_vmrun(struct vcpu_svm *svm)
 	svm->nested.nested_run_pending = 1;
 	enter_svm_guest_mode(svm, vmcb_gpa, nested_vmcb);
 
+	//
+	// vmcb->control.intercept = host.intercept | guest.intercept
+	// guest at that stage probably doesn't intercept VINTR, host probably also doesn't
+	//
+	// vmcb->control.int_vector comes from svm->nested.ctl.int_vector which comes from nested_vmcb
+	// probably didn't get updated since last interrupt window in the L1 guest
+	//
+	//
+	// vmcb->control.int_ctl & V_IRQ_MASK
+	// this seems also to be taken from the guest
+	//
+	//
+
+	if(!(vmcb->control.intercept & (1ULL << INTERCEPT_VINTR)) && (vmcb->control.int_ctl & V_IRQ_MASK) && vmcb->control.int_vector == 4) {
+
+		printk(KERN_ERR "vcpu %d IS ABOUT to receive wrong VINTR\n", svm->vcpu.vcpu_id);
+
+		if (was_like_that) {
+			printk(KERN_ERR "vcpu %d The VINTR settings come from the nested guest itsel\n", svm->vcpu.vcpu_id);
+		}
+
+		svm->vmcb->control.exit_code    = SVM_EXIT_ERR;
+		svm->vmcb->control.exit_code_hi = 0;
+		svm->vmcb->control.exit_info_1  = 420;
+		svm->vmcb->control.exit_info_2  = 0;
+		nested_svm_vmexit(svm);
+		return ret;
+	}
+
 	if (!nested_svm_vmrun_msrpm(svm)) {
 		svm->vmcb->control.exit_code    = SVM_EXIT_ERR;
 		svm->vmcb->control.exit_code_hi = 0;
diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c
index b3406379697c..a12af9467114 100644
--- a/arch/x86/kvm/svm/svm.c
+++ b/arch/x86/kvm/svm/svm.c
@@ -1368,9 +1368,11 @@ static void svm_set_vintr(struct vcpu_svm *svm)
 	control->int_ctl |= V_IRQ_MASK |
 		((/*control->int_vector >> 4*/ 0xf) << V_INTR_PRIO_SHIFT);
 	mark_dirty(svm->vmcb, VMCB_INTR);
+
+	trace_kvm_interrupt_window(1);
 }
 
-static void svm_clear_vintr(struct vcpu_svm *svm)
+static void svm_clear_vintr(struct vcpu_svm *svm, int from)
 {
 	clr_intercept(svm, INTERCEPT_VINTR);
 
@@ -1380,6 +1382,8 @@ static void svm_clear_vintr(struct vcpu_svm *svm)
 		svm->vmcb->control.int_ctl &= ~V_IRQ_MASK;
 
 	mark_dirty(svm->vmcb, VMCB_INTR);
+
+	trace_kvm_interrupt_window(0);
 }
 
 static struct vmcb_seg *svm_seg(struct kvm_vcpu *vcpu, int seg)
@@ -2003,8 +2007,9 @@ void svm_set_gif(struct vcpu_svm *svm, bool value)
 		disable_gif(svm);
 
 		/* After a CLGI no interrupts should come */
-		if (!kvm_vcpu_apicv_active(&svm->vcpu))
-			svm_clear_vintr(svm);
+		if (!kvm_vcpu_apicv_active(&svm->vcpu)) {
+			svm_clear_vintr(svm, 1);
+		}
 	}
 }
 
@@ -2661,7 +2666,7 @@ static int msr_interception(struct vcpu_svm *svm)
 static int interrupt_window_interception(struct vcpu_svm *svm)
 {
 	kvm_make_request(KVM_REQ_EVENT, &svm->vcpu);
-	svm_clear_vintr(svm);
+	svm_clear_vintr(svm, 2);
 
 	/*
 	 * For AVIC, the only reason to end up here is ExtINTs.
@@ -2891,8 +2896,8 @@ static void svm_get_exit_info(struct kvm_vcpu *vcpu, u64 *info1, u64 *info2)
 {
 	struct vmcb_control_area *control = &to_svm(vcpu)->vmcb->control;
 
-	*info1 = control->exit_info_1;
-	*info2 = control->exit_info_2;
+	*info1 = control->int_ctl;
+	*info2 = control->int_vector;
 }
 
 static int handle_exit(struct kvm_vcpu *vcpu, fastpath_t exit_fastpath)
@@ -2914,8 +2919,8 @@ static int handle_exit(struct kvm_vcpu *vcpu, fastpath_t exit_fastpath)
 		int vmexit;
 
 		trace_kvm_nested_vmexit(svm->vmcb->save.rip, exit_code,
-					svm->vmcb->control.exit_info_1,
-					svm->vmcb->control.exit_info_2,
+					svm->vmcb->control.int_ctl,
+					svm->vmcb->control.int_vector,
 					svm->vmcb->control.exit_int_info,
 					svm->vmcb->control.exit_int_info_err,
 					KVM_ISA_SVM);
diff --git a/arch/x86/kvm/trace.h b/arch/x86/kvm/trace.h
index 54a10c98d746..845e2128ee72 100644
--- a/arch/x86/kvm/trace.h
+++ b/arch/x86/kvm/trace.h
@@ -11,6 +11,22 @@
 #undef TRACE_SYSTEM
 #define TRACE_SYSTEM kvm
 
+TRACE_EVENT(kvm_interrupt_window,
+	TP_PROTO(unsigned int value),
+	TP_ARGS(value),
+
+	TP_STRUCT__entry(
+		__field(	unsigned int,	value		)
+	),
+
+	TP_fast_assign(
+		__entry->value	= value;
+	),
+
+	TP_printk("value %u", __entry->value)
+);
+
+
 /*
  * Tracepoint for guest mode entry.
  */
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 4b2855ab46d4..40b8828a5e7b 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -8226,6 +8226,9 @@ static int vcpu_enter_guest(struct kvm_vcpu *vcpu)
 	bool req_int_win =
 		dm_request_for_irq_injection(vcpu) &&
 		kvm_cpu_accept_dm_intr(vcpu);
+
+	WARN_ON(req_int_win);
+
 	fastpath_t exit_fastpath;
 
 	bool req_immediate_exit = false;
@@ -10660,3 +10663,4 @@ EXPORT_TRACEPOINT_SYMBOL_GPL(kvm_avic_unaccelerated_access);
 EXPORT_TRACEPOINT_SYMBOL_GPL(kvm_avic_incomplete_ipi);
 EXPORT_TRACEPOINT_SYMBOL_GPL(kvm_avic_ga_log);
 EXPORT_TRACEPOINT_SYMBOL_GPL(kvm_apicv_update_request);
+EXPORT_TRACEPOINT_SYMBOL_GPL(kvm_interrupt_window);

[-- Attachment #4: host_trace.txt --]
[-- Type: text/plain, Size: 7536 bytes --]


Here we enter a nested guest.
We setup the interrupt window for a pending injection of interrupt 49

           <...>-602596 [016] 96904.010990: kvm_exit:             reason EXIT_VMRUN rip 0xffffffffa04f722a info 30f0001 0
           <...>-602596 [016] 96904.010991: kvm_nested_vmrun:     rip: 0xffffffffa04f722a vmcb: 0x000000081a6d2000 nrip: 0x000000007fb5d347 int_ctl: 0x010f0000 event_inj: 0x00000000 npt: on
           <...>-602596 [016] 96904.010991: kvm_nested_intercepts: cr_read: 0010 cr_write: 0010 excp: 00060042 intercept: 00006e7fbd4c8027
           <...>-602537 [048] 96904.010992: kvm_msi_set_irq:      dst 0 vec 49 (Fixed|physical|edge)
           <...>-602537 [048] 96904.010993: kvm_apic_accept_irq:  apicid 0 vec 49 (Fixed|edge)
           <...>-602596 [016] 96904.010994: kvm_interrupt_window: value 1
           <...>-602596 [016] 96904.010995: kvm_entry:            vcpu 0

=====================================================================================================================================================================================

* We exit due to a real interrupt, vector not known yet I think
* VINTR is still intercepted, since interrupt window is on,we toggle it on/off but keep it on the entry.
* We forward the vmexit to the L1 guest and exit the nesting.

            <!!!!!!!!!!!!here, slightly before the kvm_exit, at the end of svm_vcpu_run, we sync 
            the current int_ctl to svm->nested.ctl, which pollutes it with VINTR!!!!!!!>
            
            sync_nested_vmcb_control()

           <...>-602596 [016] 96904.010997: kvm_exit:             reason EXIT_INTR rip 0x7fb5d347 info 10f0100 0
           
           <!!!! and here that bit propogates to nested VMCB !!!!>
                nested_vmcb->control.int_ctl           = svm->nested.ctl.int_ctl;
           
           <...>-602596 [016] 96904.010997: kvm_nested_vmexit:    rip 7fb5d347 reason EXIT_INTR info1 10f0100 info2 0 int_info 0 int_info_err 0
           <...>-602596 [016] 96904.010998: kvm_nested_intr_vmexit: rip: 0x000000007fb5d347
                          <----- guest mode ended ------>

           <...>-602596 [016] 96904.010998: kvm_interrupt_window: value 0 # this is most likely the result of svm_set_gif(false), because on vmexit gif is set to 0
           <...>-602596 [016] 96904.010999: kvm_nested_vmexit_inject: reason EXIT_INTR info1 0 info2 0 int_info 0 int_info_err 0


           <...>-602596 [016] 96904.011000: kvm_interrupt_window: value 1   # now interrupt window is setup again since we still have that pending injection
           <...>-602596 [016] 96904.011000: kvm_entry:            vcpu 0


Here L1 apparently re-enables interrupts to get hit with the real interrupt it got.
Most likeky it does stgi
But since we catch VINTR (we have interrupt window ON), we inject it the interrupt that interrupt 49 now

           <...>-602596 [016] 96904.011003: kvm_exit:             reason EXIT_VINTR rip 0xffffffffa04224b0 info 30f0301 0
           <...>-602596 [016] 96904.011003: kvm_interrupt_window: value 0
           <...>-602596 [016] 96904.011004: kvm_inj_virq:         irq 49
           <...>-602596 [016] 96904.011004: kvm_entry:            vcpu 0


Here the L1 guest ends dealing with the injected interrupt, and prepares to enter the nested guest
           <...>-602596 [016] 96904.011014: kvm_eoi:              apicid 0 vector 49
           <...>-602596 [016] 96904.011015: kvm_pv_eoi:           apicid 0 vector 49
           <...>-602596 [016] 96904.011015: kvm_exit:             reason EXIT_MSR rip 0xffffffffa04f3d7d info 30f0201 0
           <...>-602596 [016] 96904.011015: kvm_msr:              msr_write 175 = 0xfffffe0000002200
           <...>-602596 [016] 96904.011016: kvm_entry:            vcpu 0
           <...>-602596 [016] 96904.011017: kvm_exit:             reason EXIT_MSR rip 0xffffffffa04f3d7d info 30f0201 0
           <...>-602596 [016] 96904.011017: kvm_msr:              msr_write 176 = 0xffffffff81a012d0
           <...>-602596 [016] 96904.011017: kvm_entry:            vcpu 0
           <...>-602596 [016] 96904.011018: kvm_exit:             reason EXIT_MSR rip 0xffffffffa04f3d7d info 30f0201 0
           <...>-602596 [016] 96904.011018: kvm_msr:              msr_write c0000103 = 0x0
           <...>-602596 [016] 96904.011019: kvm_entry:            vcpu 0
           <...>-602596 [016] 96904.011019: kvm_exit:             reason EXIT_WRITE_DR6 rip 0xffffffffa041c21e info 30f0201 0
           <...>-602596 [016] 96904.011020: kvm_entry:            vcpu 0
           <...>-602596 [016] 96904.011023: kvm_exit:             reason EXIT_NPF rip 0xffffffff814cea99 info 30f0201 0
           <...>-602596 [016] 96904.011023: kvm_page_fault:       address 1000003008 error_code f
           <...>-602596 [016] 96904.011024: kvm_emulate_insn:     0:ffffffff814cea99: 66 89 3e
           <...>-602596 [016] 96904.011025: vcpu_match_mmio:      gva 0xffffc9000024d008 gpa 0x1000003008 Write GPA
           <...>-602596 [016] 96904.011026: kvm_mmio:             mmio write len 2 gpa 0x1000003008 val 0x2
           <...>-602596 [016] 96904.011028: kvm_entry:            vcpu 0
           <...>-602596 [016] 96904.011031: kvm_exit:             reason EXIT_MSR rip 0xffffffffa04f4818 info 30f0201 0
           <...>-602596 [016] 96904.011031: kvm_msr:              msr_read 175 = 0xfffffe0000002200
           <...>-602596 [016] 96904.011032: kvm_entry:            vcpu 0
           <...>-602596 [016] 96904.011033: kvm_exit:             reason EXIT_MSR rip 0xffffffffa04f4818 info 30f0201 0
           <...>-602596 [016] 96904.011033: kvm_msr:              msr_read 176 = 0xffffffff81a012d0
           <...>-602596 [016] 96904.011033: kvm_entry:            vcpu 0
           <...>-602596 [016] 96904.011034: kvm_exit:             reason EXIT_MSR rip 0xffffffffa04f4818 info 30f0201 0
           <...>-602596 [016] 96904.011035: kvm_msr:              msr_read c0000103 = 0x0
           <...>-602596 [016] 96904.011035: kvm_entry:            vcpu 0
           <...>-602596 [016] 96904.011036: kvm_exit:             reason EXIT_MSR rip 0xffffffffa04f4870 info 30f0201 0
           <...>-602596 [016] 96904.011036: kvm_msr:              msr_write c0000103 = 0x0
           <...>-602596 [016] 96904.011037: kvm_entry:            vcpu 0
  
  =====================================================================================================================================================================================

  Here the L1 guest tries to enter the nested guest again.
           <...>-602596 [016] 96904.011039: kvm_exit:             reason EXIT_VMRUN rip 0xffffffffa04f722a info 30f0001 0
           <...>-602596 [016] 96904.011040: kvm_nested_vmrun:     rip: 0xffffffffa04f722a vmcb: 0x000000081a6d2000 nrip: 0x000000007fb5d347 int_ctl: 0x010f0100 event_inj: 0x00000000 npt: on
           <...>-602596 [016] 96904.011040: kvm_nested_intercepts: cr_read: 0010 cr_write: 0010 excp: 00060042 intercept: 00006e7fbd4c8027
           <...>-602537 [048] 96904.011061: kvm_msi_set_irq:      dst 0 vec 49 (Fixed|physical|edge)
           <...>-602537 [048] 96904.011062: kvm_apic_accept_irq:  apicid 0 vec 49 (Fixed|edge)
           <...>-602596 [016] 96904.012588: kvm_interrupt_window: value 0
           
           
           <!!! this is the error I planted when error condition is about to happen !!!>
           <...>-602596 [016] 96904.012589: kvm_nested_vmexit_inject: reason EXIT_ERR info1 1a4 info2 0 int_info 0 int_info_err 0
  

^ permalink raw reply related	[flat|nested] 4+ messages in thread

* Re: #DE
  2020-05-25 15:13     ` #DE Maxim Levitsky
@ 2020-05-25 16:45       ` Paolo Bonzini
  2020-05-26 15:30         ` #DE Maxim Levitsky
  0 siblings, 1 reply; 4+ messages in thread
From: Paolo Bonzini @ 2020-05-25 16:45 UTC (permalink / raw)
  To: Maxim Levitsky, Vitaly Kuznetsov; +Cc: kvm

On 25/05/20 17:13, Maxim Levitsky wrote:
> With all this said, this is what is happening:
> 
> 1. The host sets interrupt window. It needs interrupts window because (I guess) that guest
> disabled interrupts and it waits till interrupts are enabled to inject the interrupt.
> 
> To be honest this is VMX limitation, and in SVM (I think according to the spec) you can inject
> interrupts whenever you want and if the guest can't accept then interrupt, the vector written to EVENTINJ field,
> will be moved to V_INTR_VECTOR and V_INTR flag will be set,

Not exactly, EVENTINJ ignores the interrupt flag and everything else.  
But yes, you can inject the interrupt any time you want using V_IRQ + 
V_INTR_VECTOR and it will be injected by the processor.  This is a 
very interesting feature but it doesn't fit very well in the KVM
architecture so we're not using it.  Hyper-V does (and it is also
why it broke mercilessly).

> 2. Since SVM doesn't really have a concept of interrupt window 
> intercept, this is faked by setting V_INTR, as if injected (or as
> they call it virtual) interrupt is pending, together with intercept
> of virtual interrupts,
Correct.

> 4. After we enter the nested guest, we eventually get an VMexit due
> to unrelated reason and we sync the V_INTR that *we* set
> to the nested vmcs, since in theory that flag could have beeing set
> by the CPU itself, if the guest itself used EVENTINJ to inject
> an event to its nested guest while the nested guest didn't have
> interrupts enabled (KVM doesn't do this, but we can't assume that)

I suppose you mean in sync_nested_vmcb_control.  Here, in the latest version,
we have:

        mask = V_IRQ_MASK | V_TPR_MASK;
        if (!(svm->nested.ctl.int_ctl & V_INTR_MASKING_MASK) &&
            is_intercept(svm, SVM_EXIT_VINTR)) {
                /*
                 * In order to request an interrupt window, L0 is usurping
                 * svm->vmcb->control.int_ctl and possibly setting V_IRQ
                 * even if it was clear in L1's VMCB.  Restoring it would be
                 * wrong.  However, in this case V_IRQ will remain true until
                 * interrupt_window_interception calls svm_clear_vintr and
                 * restores int_ctl.  We can just leave it aside.
                 */
                mask &= ~V_IRQ_MASK;
        }

and svm_clear_vintr will copy V_IRQ, V_INTR_VECTOR, etc. back from the nested_vmcb
into svm->vmcb.  Is that enough to fix the bug?

> 5. At that point the bomb is ticking. Once the guest ends dealing
> with the nested guest VMexit, and executes VMRUN, we enter the nested
> guest with V_INTR enabled. V_INTER intercept is disabled since we
> disabled our interrupt window long ago, guest is also currently
> doesn't enable any interrupt window, so we basically injecting to the
> guest whatever is there in V_INTR_VECTOR in the nested guest's VMCB.

Yep, this sounds correct.  The question is whether it can still happen
in the latest version of the code, where I tried to think more about who
owns which int_ctl bits when.

> Now that I am thinking about this the issue is deeper that I thought
> and it stems from the abuse of the V_INTR on AMD. IMHO the best
> solution is to avoid interrupt windows on AMD unless really needed
> (like with level-triggered interrupts or so)

Yes, that is the root cause.  However I'm not sure it would be that
much simpler if we didn't abuse VINTR like that, because INT_CTL is a
very complicated field.

> Now the problem is that it is next to impossible to know the source
> of the VINTR pending flag. Even if we remember that host is currently
> setup an interrupt window, the guest afterwards could have used
> EVENTINJ + interrupt disabled nested guest, to raise that flag as
> well, and might need to know about it.

Actually it is possible!  is_intercept tests L0's VINTR intercept
(see get_host_vmcb in svm.h), and that will be true if and only if
we are abusing the V_IRQ/V_INTR_PRIO/V_INTR_VECTOR fields.

Furthermore, L0 should not use VINTR during nested VMRUN only if both
the following conditions are true:

- L1 is not using V_INTR_MASKING

- L0 has a pending interrupt (of course)

This is because when virtual interrupts are in use, you can inject
physical interrupts into L1 at any time by taking an EXIT_INTR vmexit.

My theory as to why the bug could happen involved a race between
the call to kvm_x86_ops.interrupt_allowed(vcpu, true) in
inject_pending_event and the call to kvm_cpu_has_injectable_intr
in vcpu_enter_guest.  Again, that one should be fixed in the
latest version on the branch, but there could be more than one bug!

> I have an idea on how to fix this, which is about 99% correct and will only fail if the guest attempt something that
> is undefined anyway.
> 
> Lets set the vector of the fake VINTR to some reserved exception value, rather that 0 (which the guest is not supposed to inject ever to the nested guest),
> so then we will know if the VINTR is from our fake injection or from the guest itself.
> If it is our VINTR then we will not sync it to the guest.
> In theory it can be even 0, since exceptions should never be injected as interrupts anyway, this is also reserved operation.

Unfortunately these are interrupts, not exceptions.  You _can_ configure
the PIC (or probably the IOAPIC too) to inject vectors in the 0-31 range.
Are you old enough to remember INT 8 as the timer interrupt? :)

Thanks very much for the detective work though!  You made a good walkthrough
overall so you definitely understood good parts of the code.

Paolo


^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: #DE
  2020-05-25 16:45       ` #DE Paolo Bonzini
@ 2020-05-26 15:30         ` Maxim Levitsky
  2020-05-26 16:26           ` #DE Paolo Bonzini
  0 siblings, 1 reply; 4+ messages in thread
From: Maxim Levitsky @ 2020-05-26 15:30 UTC (permalink / raw)
  To: Paolo Bonzini, Vitaly Kuznetsov; +Cc: kvm

On Mon, 2020-05-25 at 18:45 +0200, Paolo Bonzini wrote:
> On 25/05/20 17:13, Maxim Levitsky wrote:
> > With all this said, this is what is happening:
> > 
> > 1. The host sets interrupt window. It needs interrupts window because (I guess) that guest
> > disabled interrupts and it waits till interrupts are enabled to inject the interrupt.
> > 
> > To be honest this is VMX limitation, and in SVM (I think according to the spec) you can inject
> > interrupts whenever you want and if the guest can't accept then interrupt, the vector written to EVENTINJ field,
> > will be moved to V_INTR_VECTOR and V_INTR flag will be set,
> 
> Not exactly, EVENTINJ ignores the interrupt flag and everything else.  
Aha! I totaly missed this, that is why I told that I am somewhat guessing
this since I haven't found the place that said otherwise. Now I understand.
Moreover this means that V_IRQ can't be _set_ by the processor as I feared
it could, and can only be cleared, when the interrupt is taken, thus
this is what they mean whey they say that V_IRQ is written back on VmExit.
This changes everything.


> But yes, you can inject the interrupt any time you want using V_IRQ + 
> V_INTR_VECTOR and it will be injected by the processor.  This is a 
> very interesting feature but it doesn't fit very well in the KVM
> architecture so we're not using it.  Hyper-V does (and it is also
> why it broke mercilessly).
Aha, now I understand that well. I was under impression that V_IRQ/V_INTR_VECTOR,
are more like set by CPU and and EVENTINJ is the way to inject interrupts.
Now it is all clear.

> 
> > 2. Since SVM doesn't really have a concept of interrupt window 
> > intercept, this is faked by setting V_INTR, as if injected (or as
> > they call it virtual) interrupt is pending, together with intercept
> > of virtual interrupts,
> Correct.
> 
> > 4. After we enter the nested guest, we eventually get an VMexit due
> > to unrelated reason and we sync the V_INTR that *we* set
> > to the nested vmcs, since in theory that flag could have beeing set
> > by the CPU itself, if the guest itself used EVENTINJ to inject
> > an event to its nested guest while the nested guest didn't have
> > interrupts enabled (KVM doesn't do this, but we can't assume that)
> 
> I suppose you mean in sync_nested_vmcb_control.  Here, in the latest version,
Yep.
> we have:
> 
>         mask = V_IRQ_MASK | V_TPR_MASK;
>         if (!(svm->nested.ctl.int_ctl & V_INTR_MASKING_MASK) &&
>             is_intercept(svm, SVM_EXIT_VINTR)) {
>                 /*
>                  * In order to request an interrupt window, L0 is usurping
>                  * svm->vmcb->control.int_ctl and possibly setting V_IRQ
>                  * even if it was clear in L1's VMCB.  Restoring it would be
>                  * wrong.  However, in this case V_IRQ will remain true until
>                  * interrupt_window_interception calls svm_clear_vintr and
>                  * restores int_ctl.  We can just leave it aside.
>                  */
>                 mask &= ~V_IRQ_MASK;
>         }
> 
> and svm_clear_vintr will copy V_IRQ, V_INTR_VECTOR, etc. back from the nested_vmcb
> into svm->vmcb.  Is that enough to fix the bug?

No I have the exact same version, however as is turned out that for some reason
you didn't publish the changes, thus I indeed was running an outdated version.
It works now, and the fix for the reference was that you made the code
not to set interrupt window at all when running a nested guest
since nested guest getting an interrupt almost always meanst VMEXIT
which can't be masked. The only case where it is still needed is
when the guest doesn't intercept physical interrupts which should 
be handled as well (I haven't yet studied how that works now,
but since it is a corner case almost nobody uses, it doesn't matter much).


> 
> > 5. At that point the bomb is ticking. Once the guest ends dealing
> > with the nested guest VMexit, and executes VMRUN, we enter the nested
> > guest with V_INTR enabled. V_INTER intercept is disabled since we
> > disabled our interrupt window long ago, guest is also currently
> > doesn't enable any interrupt window, so we basically injecting to the
> > guest whatever is there in V_INTR_VECTOR in the nested guest's VMCB.
> 
> Yep, this sounds correct.  The question is whether it can still happen
> in the latest version of the code, where I tried to think more about who
> owns which int_ctl bits when.
Works now.
> 
> > Now that I am thinking about this the issue is deeper that I thought
> > and it stems from the abuse of the V_INTR on AMD. IMHO the best
> > solution is to avoid interrupt windows on AMD unless really needed
> > (like with level-triggered interrupts or so)
> 
> Yes, that is the root cause.  However I'm not sure it would be that
> much simpler if we didn't abuse VINTR like that, because INT_CTL is a
> very complicated field.
> 
> > Now the problem is that it is next to impossible to know the source
> > of the VINTR pending flag. Even if we remember that host is currently
> > setup an interrupt window, the guest afterwards could have used
> > EVENTINJ + interrupt disabled nested guest, to raise that flag as
> > well, and might need to know about it.
> 
> Actually it is possible!  is_intercept tests L0's VINTR intercept
> (see get_host_vmcb in svm.h), and that will be true if and only if
> we are abusing the V_IRQ/V_INTR_PRIO/V_INTR_VECTOR fields.
Yep. I wasn't aware of logic in svm_check_nested_events.
In fact I think that it was added by the path that I found via bisect,
since which the nesting started to not work well.

BTW, since nesting is broken with that #DE on mainline, should we prepare
some patches to -stable to fix that?


> 
> Furthermore, L0 should not use VINTR during nested VMRUN only if both
> the following conditions are true:
> 
> - L1 is not using V_INTR_MASKING
> 
> - L0 has a pending interrupt (of course)
> 
> This is because when virtual interrupts are in use, you can inject
> physical interrupts into L1 at any time by taking an EXIT_INTR vmexit.

I agree, and now this is done this way





> 
> My theory as to why the bug could happen involved a race between
> the call to kvm_x86_ops.interrupt_allowed(vcpu, true) in
> inject_pending_event and the call to kvm_cpu_has_injectable_intr
> in vcpu_enter_guest.  Again, that one should be fixed in the
> latest version on the branch, but there could be more than one bug!
> 
> > I have an idea on how to fix this, which is about 99% correct and will only fail if the guest attempt something that
> > is undefined anyway.
> > 
> > Lets set the vector of the fake VINTR to some reserved exception value, rather that 0 (which the guest is not supposed to inject ever to the nested guest),
> > so then we will know if the VINTR is from our fake injection or from the guest itself.
> > If it is our VINTR then we will not sync it to the guest.
> > In theory it can be even 0, since exceptions should never be injected as interrupts anyway, this is also reserved operation.
> 
> Unfortunately these are interrupts, not exceptions.  You _can_ configure
> the PIC (or probably the IOAPIC too) to inject vectors in the 0-31 range.
> Are you old enough to remember INT 8 as the timer interrupt? :)

No sadly :-) My computer life started with pentium 133 and windows 98 
But I am aware of this, I was just thinking that nobody would do this anyway.

I still remember from my Intel's job, closing about 2-3 bugreports per year for 
user error of injecting an random interrupt vector and hitting an exception handler
which expects an error code, and thus crashing the whole thing.


> 
> Thanks very much for the detective work though!  You made a good walkthrough
> overall so you definitely understood good parts of the code.

Thank you too. I think that this bug was probably the thing
I enjoyed the most since I joined Red-Hat.
(closely followed by my nvme-mdev driver)

> 
> Paolo
> 

Best regards,
	Maxim Levitsky



^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: #DE
  2020-05-26 15:30         ` #DE Maxim Levitsky
@ 2020-05-26 16:26           ` Paolo Bonzini
  0 siblings, 0 replies; 4+ messages in thread
From: Paolo Bonzini @ 2020-05-26 16:26 UTC (permalink / raw)
  To: Maxim Levitsky, Vitaly Kuznetsov; +Cc: kvm

On 26/05/20 17:30, Maxim Levitsky wrote:
>>> Now the problem is that it is next to impossible to know the source
>>> of the VINTR pending flag. Even if we remember that host is currently
>>> setup an interrupt window, the guest afterwards could have used
>>> EVENTINJ + interrupt disabled nested guest, to raise that flag as
>>> well, and might need to know about it.
>> Actually it is possible!  is_intercept tests L0's VINTR intercept
>> (see get_host_vmcb in svm.h), and that will be true if and only if
>> we are abusing the V_IRQ/V_INTR_PRIO/V_INTR_VECTOR fields.
> Yep. I wasn't aware of logic in svm_check_nested_events.
> In fact I think that it was added by the path that I found via bisect,
> since which the nesting started to not work well.
> 
> BTW, since nesting is broken with that #DE on mainline, should we prepare
> some patches to -stable to fix that?

I think 5.7 is going to remain broken.  But for 5.8 we need to fix Hyper-V.

Paolo


^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2020-05-26 16:27 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <0fa0acac-f3b6-96c0-6ac8-18ec4d573aab@redhat.com>
     [not found] ` <233a810765c8b026778e76e9f8828a9ad0b3716d.camel@redhat.com>
     [not found]   ` <b58b5d08-97a6-1e64-d8db-7ce74084553a@redhat.com>
2020-05-25 15:13     ` #DE Maxim Levitsky
2020-05-25 16:45       ` #DE Paolo Bonzini
2020-05-26 15:30         ` #DE Maxim Levitsky
2020-05-26 16:26           ` #DE Paolo Bonzini

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.