Re: [PATCH v2 1/2] VMX: fix VMCS race on context-switch paths

From: Igor Druzhinin <igor.druzhinin@citrix.com>
To: Jan Beulich <JBeulich@suse.com>,
	xen-devel <xen-devel@lists.xenproject.org>
Cc: Sergey Dyasli <sergey.dyasli@citrix.com>,
	Kevin Tian <kevin.tian@intel.com>,
	Andrew Cooper <andrew.cooper3@citrix.com>,
	Anshul Makkar <anshul.makkar@citrix.com>,
	raistlin@linux.it, Jun Nakajima <jun.nakajima@intel.com>
Subject: Re: [PATCH v2 1/2] VMX: fix VMCS race on context-switch paths
Date: Thu, 2 Nov 2017 19:46:56 +0000	[thread overview]
Message-ID: <5ca9f140-a574-a8d0-1231-4ce0aec0e124@citrix.com> (raw)
In-Reply-To: <b24ef6a3-a8f6-7da6-0295-6bc96c0725e8@citrix.com>

On 27/10/17 18:42, Igor Druzhinin wrote:
> On 16/02/17 11:15, Jan Beulich wrote:
>> When __context_switch() is being bypassed during original context
>> switch handling, the vCPU "owning" the VMCS partially loses control of
>> it: It will appear non-running to remote CPUs, and hence their attempt
>> to pause the owning vCPU will have no effect on it (as it already
>> looks to be paused). At the same time the "owning" CPU will re-enable
>> interrupts eventually (the lastest when entering the idle loop) and
>> hence becomes subject to IPIs from other CPUs requesting access to the
>> VMCS. As a result, when __context_switch() finally gets run, the CPU
>> may no longer have the VMCS loaded, and hence any accesses to it would
>> fail. Hence we may need to re-load the VMCS in vmx_ctxt_switch_from().
>>
>> Similarly, when __context_switch() is being bypassed also on the second
>> (switch-in) path, VMCS ownership may have been lost and hence needs
>> re-establishing. Since there's no existing hook to put this in, add a
>> new one.
>>
>> Reported-by: Kevin Mayer <Kevin.Mayer@gdata.de>
>> Reported-by: Anshul Makkar <anshul.makkar@citrix.com>
>> Signed-off-by: Jan Beulich <jbeulich@suse.com>
>> ---
>> v2: Drop the spin loop from vmx_vmc_reload(). Use the function in
>>     vmx_do_resume() instead of open coding it there (requiring the
>>     ASSERT()s to be adjusted/dropped). Drop the new
>>     ->ctxt_switch_same() hook.
>>
>> --- a/xen/arch/x86/hvm/vmx/vmcs.c
>> +++ b/xen/arch/x86/hvm/vmx/vmcs.c
>> @@ -552,6 +552,20 @@ static void vmx_load_vmcs(struct vcpu *v
>>      local_irq_restore(flags);
>>  }
>>  
>> +void vmx_vmcs_reload(struct vcpu *v)
>> +{
>> +    /*
>> +     * As we may be running with interrupts disabled, we can't acquire
>> +     * v->arch.hvm_vmx.vmcs_lock here. However, with interrupts disabled
>> +     * the VMCS can't be taken away from us anymore if we still own it.
>> +     */
>> +    ASSERT(v->is_running || !local_irq_is_enabled());
>> +    if ( v->arch.hvm_vmx.vmcs_pa == this_cpu(current_vmcs) )
>> +        return;
>> +
>> +    vmx_load_vmcs(v);
>> +}
>> +
>>  int vmx_cpu_up_prepare(unsigned int cpu)
>>  {
>>      /*
>> @@ -1678,10 +1692,7 @@ void vmx_do_resume(struct vcpu *v)
>>      bool_t debug_state;
>>  
>>      if ( v->arch.hvm_vmx.active_cpu == smp_processor_id() )
>> -    {
>> -        if ( v->arch.hvm_vmx.vmcs_pa != this_cpu(current_vmcs) )
>> -            vmx_load_vmcs(v);
>> -    }
>> +        vmx_vmcs_reload(v);
>>      else
>>      {
>>          /*
>> --- a/xen/arch/x86/hvm/vmx/vmx.c
>> +++ b/xen/arch/x86/hvm/vmx/vmx.c
>> @@ -936,6 +937,18 @@ static void vmx_ctxt_switch_from(struct
>>      if ( unlikely(!this_cpu(vmxon)) )
>>          return;
>>  
>> +    if ( !v->is_running )
>> +    {
>> +        /*
>> +         * When this vCPU isn't marked as running anymore, a remote pCPU's
>> +         * attempt to pause us (from vmx_vmcs_enter()) won't have a reason
>> +         * to spin in vcpu_sleep_sync(), and hence that pCPU might have taken
>> +         * away the VMCS from us. As we're running with interrupts disabled,
>> +         * we also can't call vmx_vmcs_enter().
>> +         */
>> +        vmx_vmcs_reload(v);
>> +    }
>> +
>>      vmx_fpu_leave(v);
>>      vmx_save_guest_msrs(v);
>>      vmx_restore_host_msrs();
>> --- a/xen/include/asm-x86/hvm/vmx/vmcs.h
>> +++ b/xen/include/asm-x86/hvm/vmx/vmcs.h
>> @@ -174,6 +174,7 @@ void vmx_destroy_vmcs(struct vcpu *v);
>>  void vmx_vmcs_enter(struct vcpu *v);
>>  bool_t __must_check vmx_vmcs_try_enter(struct vcpu *v);
>>  void vmx_vmcs_exit(struct vcpu *v);
>> +void vmx_vmcs_reload(struct vcpu *v);
>>  
>>  #define CPU_BASED_VIRTUAL_INTR_PENDING        0x00000004
>>  #define CPU_BASED_USE_TSC_OFFSETING           0x00000008
>>
> 
> Hi Jan,
> 
> I'm not entirely sure if it's something related but the end result looks
> similar to the issue that this patch solved. We are now getting reports of
> a similar race condition with the following stack trace on 4.7.1 with this
> patch backported but I'm pretty sure this should be the case for master
> as well:
> 
> (XEN) [480198.570165] Xen call trace:
> (XEN) [480198.570168]    [<ffff82d0801eb53e>] vmx.c#arch/x86/hvm/vmx/vmx.o.unlikely+0x136/0x1a8
> (XEN) [480198.570171]    [<ffff82d080160095>] domain.c#__context_switch+0x10c/0x3a4
> (XEN) [480198.570176]    [<ffff82d08016560c>] __sync_local_execstate+0x35/0x51
> (XEN) [480198.570179]    [<ffff82d08018bd82>] invalidate_interrupt+0x40/0x73
> (XEN) [480198.570183]    [<ffff82d08016ea1f>] do_IRQ+0x8c/0x5cb
> (XEN) [480198.570186]    [<ffff82d08022d93f>] common_interrupt+0x5f/0x70
> (XEN) [480198.570189]    [<ffff82d0801b32b0>] vpmu_destroy+0/0x100
> (XEN) [480198.570192]    [<ffff82d0801e7dc9>] vmx.c#vmx_vcpu_destroy+0x21/0x30
> (XEN) [480198.570195]    [<ffff82d0801c2bf6>] hvm_vcpu_destroy+0x70/0x77
> (XEN) [480198.570197]    [<ffff82d08016101e>] vcpu_destroy+0x5d/0x72
> (XEN) [480198.570201]    [<ffff82d080107510>] domain.c#complete_domain_destroy+0x49/0x182
> (XEN) [480198.570204]    [<ffff82d0801266d2>] rcupdate.c#rcu_process_callbacks+0x141/0x1a3
> (XEN) [480198.570207]    [<ffff82d080132f95>] softirq.c#__do_softirq+0x75/0x80
> (XEN) [480198.570209]    [<ffff82d080132fae>] process_pending_softirqs+0xe/0x10
> (XEN) [480198.570212]    [<ffff82d0801b256f>] mwait-idle.c#mwait_idle+0xf5/0x2c3
> (XEN) [480198.570214]    [<ffff82d0801e0d00>] vmx_intr_assist+0x3bf/0x4f2
> (XEN) [480198.570216]    [<ffff82d08015fd57>] domain.c#idle_loop+0x38/0x4d
> 
> So far all the attempts to get a repro locally failed - the race is quite rare -
> it only happens when (probably) the issue with stolen VMCS appears AND TLB flush
> IPI comes at the moment of domain destruction. For the issue to appear several
> conditions should be met:
> 1) TLB flush IPI should execute __sync_local_execstate and enter VMX context
> switch
> 2) This should come at the VCPU destroy loop in RCU callback
> 3) VMCS pointer should be invalid (possibly stolen or cleared somehow)
> 
> My idea was to provoke the crash somehow by simulating the conditions described
> above. Using Andrew's suggestion I now can satisfy conditions 1 and 2 with
> some help of XTF, but I still have no idea how to provoke 3.
> 
> Any ideas about the root cause of the fault and suggestions how to reproduce it
> would be welcome. Does this crash really has something to do with PML? I doubt
> because the original environment may hardly be called PML-heavy.
> 
> Thanks,
> Igor
> 

So we finally have complete understanding of what's going on:

Some vCPU has just migrated to another pCPU and we switched to idle but
per_cpu(curr_vcpu) on the current pCPU is still pointing to it - this is
how the current logic works. While we're in idle we're issuing
vcpu_destroy() for some other domain which eventually calls
vmx_vcpu_disable_pml() and trashes VMCS pointer on the current pCPU. At
this moment we get a TLB flush IPI from that same vCPU which is now
context switching on another pCPU - it appears to clean TLB after
itself. This vCPU is already marked is_running=1 by the scheduler. In
the IPI handler we enter __sync_local_execstate() and trying to call
vmx_ctxt_switch_from() for the migrated vCPU which is supposed to call
vmcs_reload() but doesn't do it because is_running==1. The next VMWRITE
crashes the hypervisor.

So the state transition diagram might look like:
pCPU1: vCPUx -> migrate to pCPU2 -> idle -> RCU callbacks ->
vcpu_destroy() -> vmx_vcpu_disable_pml() -> vmcs_clear()
pCPU2: context switch into vCPUx -> is_running = 1 -> TLB flush
pCPU1: IPI handler -> context switch out of vCPUx -> VMWRITE -> CRASH!

We can basically just fix the condition around vmcs_reload() call but
I'm not completely sure that it's the right way to do - I don't think
leaving per_cpu(curr_vcpu) pointing to a migrated vCPU is a good idea
(maybe we need to clean it). What are your thoughts?

CC-ing Dario

Thanks,
Igor

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel