All of lore.kernel.org
 help / color / mirror / Atom feed
From: Jan Beulich <jbeulich@suse.com>
To: Tamas K Lengyel <tamas.k.lengyel@gmail.com>
Cc: "Boris Ostrovsky" <boris.ostrovsky@oracle.com>,
	"Andrew Cooper" <andrew.cooper3@citrix.com>,
	"Roger Pau Monné" <roger.pau@citrix.com>, "Wei Liu" <wl@xen.org>,
	xen-devel@lists.xenproject.org,
	"Tamas K Lengyel" <tamas.lengyel@intel.com>
Subject: Re: [PATCH] x86/vpmu: fix race-condition in vpmu_load
Date: Mon, 19 Sep 2022 16:56:32 +0200	[thread overview]
Message-ID: <5d1b06f0-fc20-585e-9da0-fb24c5931ad3@suse.com> (raw)
In-Reply-To: <CABfawhmrnL1HGOWS1fkEv5X4CwfkrBj-+APJ=hM1GCzzgjW4zA@mail.gmail.com>

On 19.09.2022 16:11, Tamas K Lengyel wrote:
> On Mon, Sep 19, 2022 at 9:58 AM Jan Beulich <jbeulich@suse.com> wrote:
> 
>> On 19.09.2022 15:24, Tamas K Lengyel wrote:
>>> On Mon, Sep 19, 2022 at 9:21 AM Jan Beulich <jbeulich@suse.com> wrote:
>>>>
>>>> On 19.09.2022 14:25, Tamas K Lengyel wrote:
>>>>> On Mon, Sep 19, 2022 at 5:28 AM Jan Beulich <jbeulich@suse.com> wrote:
>>>>>>
>>>>>> On 16.09.2022 23:35, Boris Ostrovsky wrote:
>>>>>>>
>>>>>>> On 9/16/22 8:52 AM, Jan Beulich wrote:
>>>>>>>> On 15.09.2022 16:01, Tamas K Lengyel wrote:
>>>>>>>>> While experimenting with the vPMU subsystem an ASSERT failure was
>>>>>>>>> observed in vmx_find_msr because the vcpu_runnable state was true.
>>>>>>>>>
>>>>>>>>> The root cause of the bug appears to be the fact that the vPMU
>> subsystem
>>>>>>>>> doesn't save its state on context_switch.
>>>>
>>>> For the further reply below - is this actually true? What is the
>>>> vpmu_switch_from() (resolving to vpmu_save()) doing then early
>>>> in context_switch()?
>>>>
>>>>>>>>> The vpmu_load function will attempt
>>>>>>>>> to gather the PMU state if its still loaded two different ways:
>>>>>>>>>      1. if the current pcpu is not where the vcpu ran before doing
>> a remote save
>>>>>>>>>      2. if the current pcpu had another vcpu active before doing a
>> local save
>>>>>>>>>
>>>>>>>>> However, in case the prev vcpu is being rescheduled on another
>> pcpu its state
>>>>>>>>> has already changed and vcpu_runnable is returning true, thus #2
>> will trip the
>>>>>>>>> ASSERT. The only way to avoid this race condition is to make sure
>> the
>>>>>>>>> prev vcpu is paused while being checked and its context saved.
>> Once the prev
>>>>>>>>> vcpu is resumed and does #1 it will find its state already saved.
>>>>>>>> While I consider this explanation plausible, I'm worried:
>>>>>>>>
>>>>>>>>> --- a/xen/arch/x86/cpu/vpmu.c
>>>>>>>>> +++ b/xen/arch/x86/cpu/vpmu.c
>>>>>>>>> @@ -419,8 +419,10 @@ int vpmu_load(struct vcpu *v, bool_t
>> from_guest)
>>>>>>>>>           vpmu = vcpu_vpmu(prev);
>>>>>>>>>
>>>>>>>>>           /* Someone ran here before us */
>>>>>>>>> +        vcpu_pause(prev);
>>>>>>>>>           vpmu_save_force(prev);
>>>>>>>>>           vpmu_reset(vpmu, VPMU_CONTEXT_LOADED);
>>>>>>>>> +        vcpu_unpause(prev);
>>>>>>>>>
>>>>>>>>>           vpmu = vcpu_vpmu(v);
>>>>>>>>>       }
>>>>>>>> We're running with IRQs off here, yet vcpu_pause() waits for the
>> vcpu
>>>>>>>> to actually be de-scheduled. Even with IRQs on this is already a
>>>>>>>> relatively heavy operation (also including its impact on the remote
>>>>>>>> side). Additionally the function is called from context_switch(),
>> and
>>>>>>>> I'm unsure of the usability of vcpu_pause() on such a path. In
>>>>>>>> particular: Is there a risk of two CPUs doing this mutually to one
>>>>>>>> another? If so, is deadlocking excluded?
>>>>>>>>
>>>>>>>> Hence at the very least I think the description wants extending, to
>>>>>>>> discuss the safety of the change.
>>>>>>>>
>>>>>>>> Boris - any chance you could comment here? Iirc that's code you did
>>>>>>>> introduce.
>>>>>>>
>>>>>>>
>>>>>>> Is the assertion in vmx_find_msr() really needs to be for runnable
>> vcpu or can it be a check on whether vcpu is actually running (e.g.
>> RUNSTATE_running)?
>>>>>>
>>>>>> You cannot safely check for "running", as "runnable" may transition
>>>>>> to/from "running" behind your back.
>>>>>
>>>>> The more I look at this the more I think the only sensible solution is
>>>>> to have the vPMU state be saved on vmexit for all vCPUs.
>>>>
>>>> Do you really mean vmexit? It would suffice if state was reliably
>>>> saved during context-switch-out, wouldn't it? At that point the
>>>> vCPU can't be resumed on another pCPU, yet.
>>>>
>>>>> That way all
>>>>> this having to figure out where and when a context needs saving during
>>>>> scheduling goes away. Yes, it adds a bit of overhead for cases where
>>>>> the vCPU will resume on the same pCPU and that context saved could
>>>>> have been skipped,
>>>>
>>>> If you really mean vmexit, then I'm inclined to say that's more
>>>> than just "a bit of overhead". I'd agree if you really meant
>>>> context-switch-out, but as said further up it looks to me as if
>>>> that was already occurring. Apparently I'm overlooking something
>>>> crucial ...
>>>
>>> Yes, the current setup is doing exactly that, saving the vPMU context
>>> on context-switch-out, and that's where the ASSERT failure occurs
>>> because the vCPU it needs to save the context for is already runnable
>>> on another pCPU.
>>
>> Well, if that's the scenario (sorry for not understanding it that
>> way earlier on), then the assertion is too strict: While in context
>> switch, the vCPU may be runnable, but certainly won't actually run
>> anywhere. Therefore I'd be inclined to suggest to relax the
>> condition just enough to cover this case, without actually going to
>> checking for "running".
>>
> 
> What ensures the vCPU won't actually run anywhere if it's in the runnable
> state?

The fact that the vCPU is the subject of context_switch().

> And how do I relax the condition just enough to cover just this case?

That's the more difficult question. The immediate solution, passing a
boolean or flag to vpmu_switch_from(), may not be practical, as it
would likely mean passing this through many layers.

Utilizing that I have Jürgen sitting next to me, I've discussed this
with him, and he suggested to simply check for v == current. And
indeed - set_current() in context_switch() happens a few lines after
vpmu_switch_from().

However, going back to vmx_find_msr() I find that the v == current
case is already included there. Which makes me wonder again - what
exactly is the scenario that you're observing the assertion
triggering? Would you mind spelling out the call chain, perhaps by
way of the call stack from the assertion?

Jan


  reply	other threads:[~2022-09-19 14:57 UTC|newest]

Thread overview: 18+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2022-09-15 14:01 [PATCH] x86/vpmu: fix race-condition in vpmu_load Tamas K Lengyel
2022-09-16 12:52 ` Jan Beulich
2022-09-16 21:35   ` Boris Ostrovsky
2022-09-19  9:27     ` Jan Beulich
2022-09-19 12:25       ` Tamas K Lengyel
2022-09-19 13:21         ` Jan Beulich
2022-09-19 13:24           ` Tamas K Lengyel
2022-09-19 13:58             ` Jan Beulich
2022-09-19 14:11               ` Tamas K Lengyel
2022-09-19 14:56                 ` Jan Beulich [this message]
2022-09-19 22:42                   ` Boris Ostrovsky
2022-09-20  8:01                     ` Jan Beulich
2022-09-20 14:26                       ` Boris Ostrovsky
2022-09-20 14:54                         ` Jan Beulich
2022-09-20 18:12                           ` Boris Ostrovsky
2022-09-20 18:27                             ` Tamas K Lengyel
2022-09-22 13:35                               ` Tamas K Lengyel
2022-09-22 19:04                                 ` Jan Beulich

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=5d1b06f0-fc20-585e-9da0-fb24c5931ad3@suse.com \
    --to=jbeulich@suse.com \
    --cc=andrew.cooper3@citrix.com \
    --cc=boris.ostrovsky@oracle.com \
    --cc=roger.pau@citrix.com \
    --cc=tamas.k.lengyel@gmail.com \
    --cc=tamas.lengyel@intel.com \
    --cc=wl@xen.org \
    --cc=xen-devel@lists.xenproject.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.