RE: event injection MACROs

From: "Dong, Eddie" <eddie.dong@intel.com>
To: Gleb Natapov <gleb@redhat.com>
Cc: Avi Kivity <avi@redhat.com>,
	"kvm@vger.kernel.org" <kvm@vger.kernel.org>,
	"Dong, Eddie" <eddie.dong@intel.com>
Subject: RE: event injection MACROs
Date: Fri, 15 May 2009 15:57:44 +0800	[thread overview]
Message-ID: <9832F13BD22FB94A829F798DA4A8280501B96099E3@pdsmsx503.ccr.corp.intel.com> (raw)
In-Reply-To: <20090514154405.GA24776@redhat.com>

Gleb Natapov wrote:
> On Thu, May 14, 2009 at 10:34:11PM +0800, Dong, Eddie wrote:
>> Gleb Natapov wrote:
>>> On Thu, May 14, 2009 at 09:43:33PM +0800, Dong, Eddie wrote:
>>>> Avi Kivity wrote:
>>>>> Dong, Eddie wrote:
>>>>>> OK.
>>>>>> Also back to Gleb's question, the reason I want to do that is to
>>>>>> simplify event generation mechanism in current KVM.
>>>>>> 
>>>>>> Today KVM use additional layer of exception/nmi/interrupt such as
>>>>>> vcpu.arch.exception.pending, vcpu->arch.interrupt.pending &
>>>>>> vcpu->arch.nmi_injected. All those additional layer is due to
>>>>>> compete of VM_ENTRY_INTR_INFO_FIELD
>>>>>> write to inject the event. Both SVM & VMX has only one resource
>>>>>> to inject the virtual event but KVM generates 3 catagory of
>>>>>> events in parallel which further requires additional
>>>>>> logic to dictate among them.
>>>>> 
>>>>> I thought of using a queue to hold all pending events (in a common
>>>>> format), sort it by priority, and inject the head.
>>>> 
>>>> The SDM Table 5-4 requires to merge 2 events together, i.e.
>>>> convert to #DF/ Triple fault or inject serially when 2 events
>>>> happens no matter NMI, IRQ or exception. 
>>>> 
>>>> As if considering above events merging activity, that is a single
>>>> element queue.
>>> I don't know how you got to this conclusion from you previous
>>> statement. See explanation to table 5-2 for instate where it is
>>> stated that interrupt should be held pending if there is exception
>>> with higher priority. Should be held pending where? In the queue,
>>> like we do. Note that low prio exceptions are just dropped since
>>> they will be regenerated.
>> 
>> I have different understanding here.
>> My understanding is that "held" means NO INTA in HW, i.e. LAPIC
>> still hold this IRQ. 
>> 
> And what if INTA already happened and CPU is ready to fetch IDT for
> interrupt vector and at this very moment CPU faults?

If INTA happens, that means it is delivered. If its delivery triggers another 
exception, that is what Table5-4 handles.

My understanding is that it is 2 stage process. Table 5-2 talk about 
events happening before delivery, so that HW needs to prioritize them. 
Once a decision is make, the highest one is delivered but then it could 
trigger another exception when fetching IDT etc.

Current execption.pending/interrupt.pending/nmi_injected doesn't match 
either of above, interrupt/nmi is only for failed event injection, and a strange
fixed priority check when it is really injected: 
exception > failed NMI > failed IRQ > new NMI > new IRQ.

Table 5-2 looks missed in current KVM IMO except a wrong (but minor)
 exception > NMI > IRQ sequence.

> 
>>> 
>>>>  We could have either:  1) A pure SW "queue" that will be flush to
>>>> HW register later (VM_ENTRY_INTR_INFO_FIELD), 2) Direct use HW
>>>> register. 
>>>> 
>>> We have three event sources 1) exceptions 2) IRQ 3) NMI. We should
>>> have queue of three elements sorted by priority. On each entry we
>>> should 
>> 
>> Table 5-4 alreadys says NMI/IRQ is BENIGN.
> Table 5-2 applies here not table 5-4 I think.
> 
>> 
>>> inject an event with highest priority. And remove it from queue on
>>> exit.
>> 
>> The problem is that we have to decide to inject only one of above 3,
>> and discard the rest. Whether priority them or merge (to one event
>> as Table 5-4) is another story. 
> Only a small number of event are merged into #DF. Most handled
> serially (SDM does not define what serially means unfortunately), so
> I don't understand where "discard the rest" is come from. We can

vmx_complete_interrupts clear all of them at next EXIT.

Even from HW point of view, if there are pending NMI/IRQ/exception,
CPU pick highest one, NMI, ignore/discard IRQ (but LAPIC still holds 
IRQ, thus it can be re-injected), completely discard exception.

I don't say discarding has any problem, but unnecessary to keep all of 3.
the only difference is when to discard the rest 2, at queue_exception/irq/nmi 
time or later on (even at next EXIT time), which is same to me.

> discard exception since it will be regenerated anyway, but IRQ and
> NMI is another story. SDM says that IRQ should be held pending (once
> again not much explanation here), nothing about NMI.
> 
>>> 
>>>> 
>>>> A potential benefit is that it can avoid duplicated code and
>>>> potential bugs in current code as following patch shows if I
>>>> understand correctly: 
>>>> 
>>>> --- a/arch/x86/kvm/vmx.c
>>>> +++ b/arch/x86/kvm/vmx.c
>>>> @@ -2599,7 +2599,7 @@ static int handle_exception(struct kvm_vcpu
>>>>                 *vcpu, struct kvm_run *kvm_run) cr2 =
>>>>                 vmcs_readl(EXIT_QUALIFICATION);
>>>>                             KVMTRACE_3D(PAGE_FAULT, vcpu,
>>>> error_code, (u32)cr2, (u32)((u64)cr2 >> 32), handler); -
>>>> if (vcpu->arch.interrupt.pending || vcpu->arch.exception.pending )
>>>>                         + if (vcpu->arch.interrupt.pending ||
>>>>                 vcpu->arch.exception.pending  ||
>>>>         vcpu->arch.nmi_injected) kvm_mmu_unprotect_page_virt(vcpu,
>>>> cr2); return kvm_mmu_page_fault(vcpu, cr2, error_code); }
>>> This fix is already in Avi's tree (not yet pushed).
>>> 
>>>> Either way are OK and up to you. BTW Xen uses HW register directly
>>>> to representing an pending event.
>>>> 
>>> In this particular case I don't mind to use HW register either, but
>>> I don't see any advantage. 
>>> 
>>>>> 
>>>>>> One example is that exception has higher priority
>>>>>> than NMI/IRQ injection in current code which is not true in
>>>>>> reality. 
>>>>>> 
>>>>> 
>>>>> I don't think it matters in practice, since the guest will see it
>>>>> as a timing issue.  NMIs and IRQs are asynchronous (even those
>>>>> generated by the guest through the local APIC).
>>>> 
>>>> Yes. But also cause IRQ injection be delayed which may have side
>>>> effect. For example if guest exception handler is very longer or
>>>> if guest VCPU fall into recursive #GP. Within current logic, a
>>>> guest IRQ event from KDB (IPI) running on VCPU0, as an example,
>>>> can't force the dead loop VCPU1 into KDB since it is recursively
>>>> #GP. 
>>> If one #GP causes another #GP this is a #DF. If CPU has a chance to
>> 
>> Means another #GP in next instruction i.e. Beginning of #GP handler
>> in guest. 
>> No #DF here.
>> 
> In this case we will enter guest with "NMI windows open" request and
> should exit immediately before first instruction of #GP handler. At
> this moment KVM will be able to inject NMI.

If the HW NMI windows is supported, it is fine, how about SW NMI case?
The flow will then look like:

Guest #GP instruction -> VM Exit -> Inject virtual #GP -> VMRESUME ->
try to execute 1st ins of guest #GP handler -> VM Exit again (#GP) -> 
inject virtual #GP -> .....

> 
>>> executes
>>> something in between KVM will have a chance to inject NMI.
>> 
>> Could have no chance in some cases though not very common.
>> 
>>> 
>>>> 
>>>>> 
>>>>>> Another issue is that an failed event from previous injection say
>>>>>> IRQ or NMI may be discarded if an virtual exception happens in
>>>>>> the EXIT handling now. With the patch of generic double fault
>>>>>> handling, this case should be handled as normally.
>>>>>> 
>>>>> 
>>>>> Discarding an exception is usually okay as it will be regenerated.
>>>>> I don't think we discard interrupts or NMIs.
>>>> In reality (Running OS in guest), it doesn't happen so far. But
>>>> architecturally, it could. For example KVM injects an IRQ, but VM
>>>> Resume get #PF and back to KVM with IDT_VECTORING valid. Then KVM
>>>> will put back the failed IRQ to interrupt queue. But if #PF
>>>> handling generates another exception, then the interrupt queue
>>>> won't be able to be injected, since KVM inject exception first.
>>>> And the interrupt queue is discarded at next VM Exit. 
>>>> 
>>> I acknowledge the presence of the bug although I was not able to
>>> write a test case to cause it yet, but it is easy to fix this
>>> without changing code too much. Unified event queue and clearing of
>>> only injected event on exit should do the trick.
>> 
>> Yes, minor.
>> 
>> Eddie