All of lore.kernel.org
 help / color / mirror / Atom feed
From: Igor Druzhinin <igor.druzhinin@citrix.com>
To: Jan Beulich <JBeulich@suse.com>,
	xen-devel <xen-devel@lists.xenproject.org>
Cc: Sergey Dyasli <sergey.dyasli@citrix.com>,
	Kevin Tian <kevin.tian@intel.com>,
	Andrew Cooper <andrew.cooper3@citrix.com>,
	Anshul Makkar <anshul.makkar@citrix.com>,
	raistlin@linux.it, Jun Nakajima <jun.nakajima@intel.com>
Subject: Re: [PATCH v2 1/2] VMX: fix VMCS race on context-switch paths
Date: Thu, 2 Nov 2017 19:46:56 +0000	[thread overview]
Message-ID: <5ca9f140-a574-a8d0-1231-4ce0aec0e124@citrix.com> (raw)
In-Reply-To: <b24ef6a3-a8f6-7da6-0295-6bc96c0725e8@citrix.com>

On 27/10/17 18:42, Igor Druzhinin wrote:
> On 16/02/17 11:15, Jan Beulich wrote:
>> When __context_switch() is being bypassed during original context
>> switch handling, the vCPU "owning" the VMCS partially loses control of
>> it: It will appear non-running to remote CPUs, and hence their attempt
>> to pause the owning vCPU will have no effect on it (as it already
>> looks to be paused). At the same time the "owning" CPU will re-enable
>> interrupts eventually (the lastest when entering the idle loop) and
>> hence becomes subject to IPIs from other CPUs requesting access to the
>> VMCS. As a result, when __context_switch() finally gets run, the CPU
>> may no longer have the VMCS loaded, and hence any accesses to it would
>> fail. Hence we may need to re-load the VMCS in vmx_ctxt_switch_from().
>>
>> Similarly, when __context_switch() is being bypassed also on the second
>> (switch-in) path, VMCS ownership may have been lost and hence needs
>> re-establishing. Since there's no existing hook to put this in, add a
>> new one.
>>
>> Reported-by: Kevin Mayer <Kevin.Mayer@gdata.de>
>> Reported-by: Anshul Makkar <anshul.makkar@citrix.com>
>> Signed-off-by: Jan Beulich <jbeulich@suse.com>
>> ---
>> v2: Drop the spin loop from vmx_vmc_reload(). Use the function in
>>     vmx_do_resume() instead of open coding it there (requiring the
>>     ASSERT()s to be adjusted/dropped). Drop the new
>>     ->ctxt_switch_same() hook.
>>
>> --- a/xen/arch/x86/hvm/vmx/vmcs.c
>> +++ b/xen/arch/x86/hvm/vmx/vmcs.c
>> @@ -552,6 +552,20 @@ static void vmx_load_vmcs(struct vcpu *v
>>      local_irq_restore(flags);
>>  }
>>  
>> +void vmx_vmcs_reload(struct vcpu *v)
>> +{
>> +    /*
>> +     * As we may be running with interrupts disabled, we can't acquire
>> +     * v->arch.hvm_vmx.vmcs_lock here. However, with interrupts disabled
>> +     * the VMCS can't be taken away from us anymore if we still own it.
>> +     */
>> +    ASSERT(v->is_running || !local_irq_is_enabled());
>> +    if ( v->arch.hvm_vmx.vmcs_pa == this_cpu(current_vmcs) )
>> +        return;
>> +
>> +    vmx_load_vmcs(v);
>> +}
>> +
>>  int vmx_cpu_up_prepare(unsigned int cpu)
>>  {
>>      /*
>> @@ -1678,10 +1692,7 @@ void vmx_do_resume(struct vcpu *v)
>>      bool_t debug_state;
>>  
>>      if ( v->arch.hvm_vmx.active_cpu == smp_processor_id() )
>> -    {
>> -        if ( v->arch.hvm_vmx.vmcs_pa != this_cpu(current_vmcs) )
>> -            vmx_load_vmcs(v);
>> -    }
>> +        vmx_vmcs_reload(v);
>>      else
>>      {
>>          /*
>> --- a/xen/arch/x86/hvm/vmx/vmx.c
>> +++ b/xen/arch/x86/hvm/vmx/vmx.c
>> @@ -936,6 +937,18 @@ static void vmx_ctxt_switch_from(struct
>>      if ( unlikely(!this_cpu(vmxon)) )
>>          return;
>>  
>> +    if ( !v->is_running )
>> +    {
>> +        /*
>> +         * When this vCPU isn't marked as running anymore, a remote pCPU's
>> +         * attempt to pause us (from vmx_vmcs_enter()) won't have a reason
>> +         * to spin in vcpu_sleep_sync(), and hence that pCPU might have taken
>> +         * away the VMCS from us. As we're running with interrupts disabled,
>> +         * we also can't call vmx_vmcs_enter().
>> +         */
>> +        vmx_vmcs_reload(v);
>> +    }
>> +
>>      vmx_fpu_leave(v);
>>      vmx_save_guest_msrs(v);
>>      vmx_restore_host_msrs();
>> --- a/xen/include/asm-x86/hvm/vmx/vmcs.h
>> +++ b/xen/include/asm-x86/hvm/vmx/vmcs.h
>> @@ -174,6 +174,7 @@ void vmx_destroy_vmcs(struct vcpu *v);
>>  void vmx_vmcs_enter(struct vcpu *v);
>>  bool_t __must_check vmx_vmcs_try_enter(struct vcpu *v);
>>  void vmx_vmcs_exit(struct vcpu *v);
>> +void vmx_vmcs_reload(struct vcpu *v);
>>  
>>  #define CPU_BASED_VIRTUAL_INTR_PENDING        0x00000004
>>  #define CPU_BASED_USE_TSC_OFFSETING           0x00000008
>>
> 
> Hi Jan,
> 
> I'm not entirely sure if it's something related but the end result looks
> similar to the issue that this patch solved. We are now getting reports of
> a similar race condition with the following stack trace on 4.7.1 with this
> patch backported but I'm pretty sure this should be the case for master
> as well:
> 
> (XEN) [480198.570165] Xen call trace:
> (XEN) [480198.570168]    [<ffff82d0801eb53e>] vmx.c#arch/x86/hvm/vmx/vmx.o.unlikely+0x136/0x1a8
> (XEN) [480198.570171]    [<ffff82d080160095>] domain.c#__context_switch+0x10c/0x3a4
> (XEN) [480198.570176]    [<ffff82d08016560c>] __sync_local_execstate+0x35/0x51
> (XEN) [480198.570179]    [<ffff82d08018bd82>] invalidate_interrupt+0x40/0x73
> (XEN) [480198.570183]    [<ffff82d08016ea1f>] do_IRQ+0x8c/0x5cb
> (XEN) [480198.570186]    [<ffff82d08022d93f>] common_interrupt+0x5f/0x70
> (XEN) [480198.570189]    [<ffff82d0801b32b0>] vpmu_destroy+0/0x100
> (XEN) [480198.570192]    [<ffff82d0801e7dc9>] vmx.c#vmx_vcpu_destroy+0x21/0x30
> (XEN) [480198.570195]    [<ffff82d0801c2bf6>] hvm_vcpu_destroy+0x70/0x77
> (XEN) [480198.570197]    [<ffff82d08016101e>] vcpu_destroy+0x5d/0x72
> (XEN) [480198.570201]    [<ffff82d080107510>] domain.c#complete_domain_destroy+0x49/0x182
> (XEN) [480198.570204]    [<ffff82d0801266d2>] rcupdate.c#rcu_process_callbacks+0x141/0x1a3
> (XEN) [480198.570207]    [<ffff82d080132f95>] softirq.c#__do_softirq+0x75/0x80
> (XEN) [480198.570209]    [<ffff82d080132fae>] process_pending_softirqs+0xe/0x10
> (XEN) [480198.570212]    [<ffff82d0801b256f>] mwait-idle.c#mwait_idle+0xf5/0x2c3
> (XEN) [480198.570214]    [<ffff82d0801e0d00>] vmx_intr_assist+0x3bf/0x4f2
> (XEN) [480198.570216]    [<ffff82d08015fd57>] domain.c#idle_loop+0x38/0x4d
> 
> So far all the attempts to get a repro locally failed - the race is quite rare -
> it only happens when (probably) the issue with stolen VMCS appears AND TLB flush
> IPI comes at the moment of domain destruction. For the issue to appear several
> conditions should be met:
> 1) TLB flush IPI should execute __sync_local_execstate and enter VMX context
> switch
> 2) This should come at the VCPU destroy loop in RCU callback
> 3) VMCS pointer should be invalid (possibly stolen or cleared somehow)
> 
> My idea was to provoke the crash somehow by simulating the conditions described
> above. Using Andrew's suggestion I now can satisfy conditions 1 and 2 with
> some help of XTF, but I still have no idea how to provoke 3.
> 
> Any ideas about the root cause of the fault and suggestions how to reproduce it
> would be welcome. Does this crash really has something to do with PML? I doubt
> because the original environment may hardly be called PML-heavy.
> 
> Thanks,
> Igor
> 

So we finally have complete understanding of what's going on:

Some vCPU has just migrated to another pCPU and we switched to idle but
per_cpu(curr_vcpu) on the current pCPU is still pointing to it - this is
how the current logic works. While we're in idle we're issuing
vcpu_destroy() for some other domain which eventually calls
vmx_vcpu_disable_pml() and trashes VMCS pointer on the current pCPU. At
this moment we get a TLB flush IPI from that same vCPU which is now
context switching on another pCPU - it appears to clean TLB after
itself. This vCPU is already marked is_running=1 by the scheduler. In
the IPI handler we enter __sync_local_execstate() and trying to call
vmx_ctxt_switch_from() for the migrated vCPU which is supposed to call
vmcs_reload() but doesn't do it because is_running==1. The next VMWRITE
crashes the hypervisor.

So the state transition diagram might look like:
pCPU1: vCPUx -> migrate to pCPU2 -> idle -> RCU callbacks ->
vcpu_destroy() -> vmx_vcpu_disable_pml() -> vmcs_clear()
pCPU2: context switch into vCPUx -> is_running = 1 -> TLB flush
pCPU1: IPI handler -> context switch out of vCPUx -> VMWRITE -> CRASH!

We can basically just fix the condition around vmcs_reload() call but
I'm not completely sure that it's the right way to do - I don't think
leaving per_cpu(curr_vcpu) pointing to a migrated vCPU is a good idea
(maybe we need to clean it). What are your thoughts?

CC-ing Dario

Thanks,
Igor



_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

  reply	other threads:[~2017-11-02 19:47 UTC|newest]

Thread overview: 30+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2017-02-16 11:10 [PATCH v2 0/2] x86: context switch handling adjustments Jan Beulich
2017-02-16 11:15 ` [PATCH v2 1/2] VMX: fix VMCS race on context-switch paths Jan Beulich
2017-02-16 12:27   ` Andrew Cooper
2017-02-16 12:35     ` Jan Beulich
2017-02-17  3:48       ` Tian, Kevin
2017-02-17  8:40   ` Sergey Dyasli
2017-02-17  9:01     ` Jan Beulich
2017-10-27 17:42   ` Igor Druzhinin
2017-11-02 19:46     ` Igor Druzhinin [this message]
2017-11-07  8:07       ` Jan Beulich
2017-11-07 14:24         ` Igor Druzhinin
2017-11-07 14:55           ` Jan Beulich
2017-11-07 15:52             ` Igor Druzhinin
2017-11-07 16:31               ` Jan Beulich
2017-11-09 10:05               ` Jan Beulich
2017-11-09 10:36                 ` Dario Faggioli
2017-11-09 12:58                   ` Jan Beulich
2017-11-09  9:54           ` Dario Faggioli
2017-11-09 10:17             ` Jan Beulich
2017-11-09 10:36               ` Sergey Dyasli
2017-11-09 11:01                 ` Dario Faggioli
2017-11-09 13:08                   ` Jan Beulich
2017-11-09 14:16                     ` Dario Faggioli
2017-11-09 14:39                       ` Jan Beulich
2017-11-09 16:38                       ` Jan Beulich
2017-11-09 10:39               ` Dario Faggioli
2017-11-07 15:16         ` Jan Beulich
2017-02-16 11:16 ` [PATCH v2 2/2] x86: package up context switch hook pointers Jan Beulich
2017-02-16 11:23   ` Andrew Cooper
2017-02-17  3:49   ` Tian, Kevin

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=5ca9f140-a574-a8d0-1231-4ce0aec0e124@citrix.com \
    --to=igor.druzhinin@citrix.com \
    --cc=JBeulich@suse.com \
    --cc=andrew.cooper3@citrix.com \
    --cc=anshul.makkar@citrix.com \
    --cc=jun.nakajima@intel.com \
    --cc=kevin.tian@intel.com \
    --cc=raistlin@linux.it \
    --cc=sergey.dyasli@citrix.com \
    --cc=xen-devel@lists.xenproject.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.